This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Structured Reinforcement Learning for Incentivized Stochastic Covert Optimization

Adit Jain and Vikram Krishnamurthy A. Jain and V. Krishnamurthy are with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, 14853, USA email: [email protected], [email protected]
Abstract

This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG. The oracle probabilistically returns either a noisy gradient of the function or a non-informative measurement, depending on the oracle state and incentive. The learner’s query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This paper formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi-armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.

Keywords: Markov decision process, Covert optimization, Structural results, Interval dominance

I Introduction

The learner aims to obtain an estimate x^\hat{x} for a point xargminxdf(x)x^{*}\in\arg\min_{x\in\mathbb{R}^{d}}f(x) 111By argmin\arg\min or minimizer we mean a local stationary point of f𝒞2f\in\mathcal{C}^{2}. by querying a stochastic oracle. At each time k=1,2,k=1,2,\dots, the learner sends query qkdq_{k}\in\mathbb{R}^{d} and incentive iki_{k} to a stochastic oracle in state oko_{k}. The oracle returns a noisy gradient, rkr_{k} evaluated at qkq_{k} as follows:

rk={f(qk)+ηkwith prob.Γ(ok,ik)0(non-informative)with prob. 1Γ(ok,ik).\displaystyle r_{k}=\begin{cases}\nabla f(q_{k})+\eta_{k}&\text{with prob.}\ \Gamma(o_{k},i_{k})\\ {\textbf{0}}\ \text{(non-informative)}&\text{with prob.}\ 1-\Gamma(o_{k},i_{k})\end{cases}. (1)

Here (ηk)(\eta_{k}) are independent, zero-mean finite-variance random variables, and Γ\Gamma denotes the probability that the learner gets a noisy informative response from the oracle.

An eavesdropper observes query qkq_{k} and incentive iki_{k} but not response rkr_{k}. The eavesdropper aims to estimate x^\hat{x}, as an approximation to the minimizer of the function the eavesdropper is interested in optimizing. This paper addresses the question: Suppose the learner uses a stochastic gradient (SG) algorithm to obtain an estimate x^\hat{x}. How can the learner control the SG to hide x^\hat{x} from an eavesdropper?

Our proposed approach is to dynamically switch between two SGs. Let ak{0=Obfuscate SG,1=Learn SG}a_{k}\in\{0=\texttt{Obfuscate SG},1=\texttt{Learn SG}\} denote the chosen SG at time kk. The first SG minimizes function ff and updates the learner estimate x^k\hat{x}_{k}. The second SG is for obfuscating the eavesdropper with estimates z^k\hat{z}_{k}. The update of both SGs is given by the equation,

[x^k+1z^k+1]=[x^kz^k]μk[𝟙(ak=1)00𝟙(ak=0)][rkr¯k],\displaystyle\begin{bmatrix}\hat{x}_{k+1}\\ \hat{z}_{k+1}\end{bmatrix}=\begin{bmatrix}\hat{x}_{k}\\ \hat{z}_{k}\end{bmatrix}-\mu_{k}\begin{bmatrix}\mathds{1}(a_{k}=1)&0\\ 0&\mathds{1}(a_{k}=0)\end{bmatrix}\begin{bmatrix}r_{k}\\ \bar{r}_{k}\end{bmatrix}, (2)

where μk\mu_{k} is the step size, r¯k\bar{r}_{k} is a synthetic gradient response discussed later and aka_{k} controls the SG to update.

The query qkq_{k} by the learner to the oracle is given by,

qk\displaystyle q_{k} =x^k𝟙(ak=1)+z^k𝟙(ak=0).\displaystyle=\hat{x}_{k}\mathds{1}(a_{k}=1)+\hat{z}_{k}\mathds{1}(a_{k}=0). (3)

and uk=(ak,ik)u_{k}=(a_{k},i_{k}) is the control learner variable (action). The learner needs MM informative updates of (2) to achieve the learning objective in NN queries. We formulate an MDP whose policy π\pi controls the switching of SGs and incentivization by the learner, to minimize the expected cost balancing obfuscation and learning. The optimal policy π\pi^{*} solving the MDP is shown to have a threshold structure (Theorem 2) of the form,

π(b,o,n)={a=0(obfuscate),bb¯(o,n)a=1(learn),b>b¯(o,n),\displaystyle\pi^{*}(b,o,n)=\begin{cases}a=0\ (\texttt{obfuscate}),\ &b\leq\bar{b}(o,n)\\ a=1\ (\texttt{learn}),\ &b>\bar{b}(o,n)\end{cases},

where bb is the number of informative learning steps left, nn is the number of queries left and b¯\bar{b} is the threshold function dependent on the oracle state oo and nn. Note that the exact dependence on the incentive is discussed later. We propose a stochastic approximation algorithm to estimate the optimal stationary policy with a threshold structure. We propose a multi-armed bandits based approach with finite-time regret bounds in Theorem 3. The optimal stationary policy with a threshold structure is benchmarked in a numerical study for covert federated hate-speech classification.

Motivation: The main application of covert (or learner-private) optimization is in centralized distributed optimization [12, 10, 9]. One motivating example is in pricing optimization and inventory management, the learner (e.g., e-retailer) queries the distributed oracle (e.g., customers) pricing and product preferences to estimate the optimal price and quantity of a product to optimize the profit function [10, 1]. A competitor could spoof as a customer and use the optimal price and quantity for their competitive advantage. Our numerical experiment illustrates another application in federated learning, a form of distributed machine learning.

The current literature on covert optimization has been focused on deriving upper and lower bounds on the query complexity for a given obfuscation level [12]. Query complexity for binary and convex covert optimization with a Bayesian eavesdropper has been studied in [10, 12]. These bounds assume a static oracle and a random querying policy can be used to randomly obfuscate and learn. In contrast, the authors have looked at dynamic covert optimization where stochastic control is used to query a stochastic oracle optimally [4]. This is starkly different than the current literature since a stochastic oracle models situations where the quality of gradient responses may vary (e.g., due to Markovian client participation). The success of a response can be determined by the learner (e.g., based on gradient quality [4]) or by the oracle (e.g., based on computational resources availability).

Differences from previous work [4]: To prove that the optimal policy has a monotone threshold structure, [4] requires supermodularity conditions. This paper proves results under more relaxed conditions using interval dominance [8] in Theorem 2, which can incorporate convex cost functions and more general transition probabilities [5]. The action space in this work includes an incentive the learner provides to the oracle. An incentive that the learner pays is motivated by the learner’s cost for obtaining a gradient evaluation of desired quality, it could be a monetary compensation the learner pays or non-monetary, e.g., controlling latency of services to participating clients  [11]. We had a generic cost function in [4], but the costs considered in this paper are exact regarding the learner’s approximation of the eavesdropper’s estimate of x^\hat{x}.

II Covert optimization for first-order stochastic gradient descent

This section describes the two stochastic gradient algorithms, between which the learner dynamically switches to either learn or obfuscate using the MDP formulation of the next section. This section states the assumptions about the oracle, the learner, the eavesdropper, and the obfuscation strategy. We state the result on the number of successful gradient steps the learner needs to achieve the learning objective. The problem formulation for covert optimization is illustrated in Fig. 1.

II-A Oracle

The oracle evaluates the gradient of the function ff. The following is assumed about the oracle and the function ff,

  1. O1:

    Function f:df:\mathbb{R}^{d}\to\mathbb{R} is continously differentiable and is lower bounded by ff^{*}. Function ff is γ\gamma-Lipschitz continuous, f(x)f(z)γxzx,zd\|\nabla f(x)-\nabla f(z)\|\leq\gamma\|x-z\|\ \forall x,z\in\mathbb{R}^{d}.

  2. O2:

    At time kk, the oracle is in state ok{1,,R}o_{k}\in\{1,\dots,R\}, where RR are the number of oracle states and for the incentive ik{i1,,ini}i_{k}\in\{i^{1},\dots,i^{n_{i}}\}, replies with probability Γ(ok,ik)\Gamma(o_{k},i_{k}). skBernoulli(Γ(ok,ik))s_{k}\sim\text{Bernoulli}(\Gamma(o_{k},i_{k})) denotes success of the reply.

  3. O3:

    For a reply with success sks_{k} to the query qkdq_{k}\in\mathbb{R}^{d}, the oracle returns a noisy gradient response rkr_{k} according to (1). The noise terms ηk\eta_{k} are independent 222A slightly weaker assumption based on conditional independence was considered in our paper [4]. We consider independence here for brevity., have zero-mean and finite-variance, 𝔼[rk]=f(qk)\mathds{E}[r_{k}]=\nabla f(q_{k}) and 𝔼[ηk2]σ2\mathds{E}[\|\eta_{k}\|^{2}]\leq\sigma^{2}.

O1 and O3 are standard assumptions for analyzing oracle-based gradient descent [3]. O2 is motivated by an oracle with a stochastic state (e.g., client participation), and the success is determined by the oracle or by the learner.

Oracle (~f\tilde{\nabla}f)qkq_{k}sks_{k}rkr_{k}aka_{k}oko_{k}Type of QueryQueryNoisy GradientResponse SuccessLearnerbkb_{k}iki_{k}IncentiveVisible to Eavesdropper
Figure 1: Dynamic Covert Optimization: Learner sends query qkq_{k} and incentive iki_{k} to oracle in state oko_{k}. The oracle evaluates noisy gradient of ff at qkq_{k}, rkr_{k} according to (1). An eavesdropper observes qkq_{k} and iki_{k} and aims to approximate the learner’s estimate. The learner needs to control the incentive iki_{k} and type of SG (aka_{k}) to query using (3) to achieve the learning objective of (4) and obfuscate the eavesdropper with belief (5).

II-B Learner

Similar to oracle-based first-order gradient descent [3], the learner aims to estimate x^d\hat{x}\in\mathbb{R}^{d} which is a ϵ\epsilon-close critical point of the function ff,

𝔼[f(x^)2]ϵ.\displaystyle\mathds{E}\left[\|\nabla f(\hat{x})\|^{2}\right]\leq\epsilon. (4)

Since ff is non-convex and not known in closed-form to the learner, in general, the gradient at z1z_{1} is non-informative about the gradient at z2z_{2} far from z1z_{1}. Hence, at time kk, the learner can either send a learning or an obfuscating query. We propose controlling the gradient descent of the learner by the query action ak{0=obfuscating,1=learning}a_{k}\in\{0=\texttt{obfuscating},1=\texttt{learning}\}. While learning, the learner updates its estimate, x^k\hat{x}_{k} by performing the controlled stochastic gradient step of (2). Here, μk\mu_{k} is the step size chosen to be constant in this paper. In the next section, we will formally state the action space composed of the type of query aka_{k} and the incentive iki_{k}. In order to estimate the number of queries to the oracle that the learner has to spend on learning queries, we first define the successful gradient step. We then state the result on the order of the number of successful gradient steps required for achieving the objective.

Definition 1

(Successful Gradient Step) A gradient step of (2) is successful when the learner queries the oracle with a learning query (ak=1a_{k}=1) and gets a successful reply (sk=1s_{k}=1).

Theorem 1

For an oracle with assumptions (O1-O3), to obtain an estimate x^\hat{x} which achieves the objective (4), the learner needs to perform MM successful gradient steps (Def. 1) with a step size (μ=min(1γ,ϵ2σ2γ)\mu=\min(\frac{1}{\gamma},\frac{\epsilon}{2\sigma^{2}\gamma})) where MM is of the order, O(σ2ϵ2+1ϵ).O\left(\frac{\sigma^{2}}{\epsilon^{2}}+\frac{1}{\epsilon}\right). The exact expression is M=max(4Fγϵ,8Fγσ2ϵ2)M=\max\left({\frac{4F\gamma}{\epsilon},\frac{8F\gamma\sigma^{2}}{\epsilon^{2}}}\right) where F=(𝔼f(x0)f)F=(\mathds{E}f(x_{0})-f^{*}).

Proof for a general setting can be found in [4] and in [3]. Theorem 1 characterizes the number of successful gradient steps, MM that the learner needs to perform to achieve the learning objective of (4). Theorem 1 guarantees the existence of a finite queue state space in the next section, which models the number of successful gradient steps left to be taken. MM in the MDP formulation of the next section can be chosen heuristically or be computed exactly if the parameters of the function are known. It also shows that MM is inversely dependent on ϵ\epsilon and incorporates the descent dynamics in the structure of the optimal policy.

II-C Obfuscation Strategy

Based on the chosen SG aka_{k}, the learner poses queries using (3) and provides incentives to the oracle. To obfuscate the eavesdropper, the learner runs a parallel stochastic gradient with synthetic responses, r¯k\bar{r}_{k}. The synthetic responses can be generated by suitably simulating an oracle, for e.g., the learner can train a neural network separately with an unbalanced subset of as was done in [4]. If the learner is sure that the eavesdropper has no public dataset to validate, the learner can simply take mirrored gradients with (1). When obfuscating, the learner poses queries from the estimates of the second SG, z^k\hat{z}_{k}.

The parallel stochastic gradient ensures that the eavesdropper cannot infer the true learning trajectory from the shape of the trajectory. In summary, the learner obfuscates and learns by dynamically chooses the query qkq_{k}, as the current estimate x^k\hat{x}_{k} from the controlled stochastic gradient step or as the estimate z^k\hat{z}_{k} of parallel SG. We assume that the learner queries such that the two trajectories are sufficiently separated from each other, and the eavesdropper can cluster the queries and distinguish them uniquely into two trajectories as described next.

II-D Eavesdropper

At time kk, the eavesdropper observes query qkq_{k} and the incentive iki_{k} by the learner. We use the following assumptions to approximate the posterior belief of the eavesdropper,

  1. E1:

    The eavesdropper is passive, does not affect the oracle, and can not observe the oracle’s responses, rkr_{k}.

  2. E2:

    For k>0k>0, the eavesdropper can classify each query of the observed query sequence (q1,,qk)(q_{1},\dots,q_{k}) into two unique trajectory sequences, 𝒥1k\mathcal{J}_{1}^{k} and 𝒥2k\mathcal{J}_{2}^{k} which can be separated by a hyperplane in d\mathbb{R}^{d}.

  3. E3:

    The eavesdropper knows that either of 𝒥1k\mathcal{J}_{1}^{k} and 𝒥2k\mathcal{J}_{2}^{k} is a learning trajectory. Given the information till time kk, the eavesdropper computes a posterior belief, δk\delta_{k} for trajectory 𝒥1k\mathcal{J}_{1}^{k} being the learning trajectory as,

    δk=(x^𝒥1k|(q1,i1,,qk,ik)).\displaystyle\delta_{k}=\mathds{P}(\hat{x}\in\mathcal{J}_{1}^{k}|(q_{1},i_{1},\dots,q_{k},i_{k})).

Assumption E1 holds if the eavesdropper is considered an insignificant part of the oracle and can not observe the aggregated response. Assumptions E2 and E3 ensure the dynamic obfuscation is for a worst-case eavesdropper 333As mentioned above, it is assumed that the queries are posed such that the two trajectories are sufficiently separated (by a metric known to the eavesdropper). One of the trajectories can be empty for the initial queries.. We consider the eavesdropper using an incentive-weighed proportional sampling estimator for the posterior δk\delta_{k} [12, 4],

δk=1tkit𝟙(qt𝒥1t)1tkit.\displaystyle\delta_{k}=\frac{\sum_{1\leq t\leq k}i_{t}\mathds{1}(q_{t}\in\mathcal{J}_{1}^{t})}{\sum_{1\leq t\leq k}i_{t}}. (5)

The learner has access to the queries and the incentives and uses (5) as an approximation for the eavesdropper’s argmin\arg\min estimate to compute the cost incurred when learning. The following section formulates an MDP to perform covert optimization using stochastic control. MM from Theorem 1 and oracle state are used to model the state space, while the incentives iki_{k} and the type of SG aka_{k} in (2) model the action space.

III MDP for achieving covert optimization

We formulate a finite-horizon MDP to solve the learner’s decision problem. The learner chooses an incentive, and dynamically either minimizes the function using the estimate x^k\hat{x}_{k} or obfuscates the eavesdropper using z^k\hat{z}_{k}. The learner wants to perform MM successful gradient steps in NN total queries. Using interval dominance, we show that the optimal policy of the finite-horizon MDP has a threshold structure. The stochastic control approach for the same is described in Algorithm 1.

III-A MDP formulation for optimally switching between stochastic gradient algorithms

Input: Policy π\pi, Queries NN, Successful Gradient Steps MM
Initialize learner queue state bN=Mb_{N}=M
for kk in 1,,N1,\dots,N do
     Obtain type of SG and incentive, (ak,ik)=π(ok,bk)(a_{k},i_{k})=\pi(o_{k},b_{k})
     Incur cost c((ak,ik),(ok,bk))c((a_{k},i_{k}),(o_{k},b_{k})) from (7)
     Query oracle using query qkq_{k} (3) and incentive iki_{k}
     Receive response rkr_{k} and success of reply sks_{k}
     Update estimates of the two SGs using (2).
     if sks_{k}=1 then bk+1=bkskb_{k+1}=b_{k}-s_{k}
     Oracle state evolves, ok+1Δ(|ok)o_{k+1}\sim\Delta({\cdot}|{o_{k}})
end for
Incur terminal cost d(b0)d(b_{0})
Algorithm 1 Stochastic Control for Covert Optimization

The dynamic programming index, n=N,,0n=N,\dots,0 denotes the number of queries left and decreases with time kk.

State Space: The state space is denoted by 𝒴=𝒴B×𝒴O\mathcal{Y}=\mathcal{Y}^{B}\times\mathcal{Y}^{O} where 𝒴B={0,1,,M}\mathcal{Y}^{B}=\{0,1,\dots,M\} is the learner queue state space and 𝒴O={0,1,,R}\mathcal{Y}^{O}=\{0,1,\dots,R\} is the oracle state space. The learner queue state bn𝒴Bb_{n}\in\mathcal{Y}^{B} denotes the number of successful gradient steps (Def. 1) remaining to achieve (4). The oracle state space 𝒴O\mathcal{Y}^{O} discretizes the stochastic state of the oracle into RR levels (e.g., percentages of client participation in FL). yny_{n} denotes the state with nn queries remaining.

Action Space: The action space is 𝒰={0=obfuscate,1=learn}×{i1,,ini}\mathcal{U}=\{0=\texttt{obfuscate},1=\texttt{learn}\}\times\{i^{1},\dots,i^{n_{i}}\}. The action when nn queries are remaining is given by, un=(an,in)u_{n}=(a_{n},i_{n}) where an{0=obfuscate,1=learn}a_{n}\in\{0=\texttt{obfuscate},1=\texttt{learn}\} is the type of the query and in{i1,,ini}i_{n}\in\{i^{1},\dots,i^{n_{i}}\} is the incentive. To derive structural results on the optimal policy, we consider the following transformation of the action space, 𝒰={(0,i1),,(0,ini),(1,i1),,(1,ini)}\mathcal{U}=\{(0,i^{1}),\dots,(0,i^{n_{i}}),(1,i^{1}),\dots,(1,i^{n_{i}})\}. A deterministic policy for the finite-horizon MDP is denoted by π\pi, a sequence of functions π=(un:n=0,,N)\pi=(u_{n}:n=0,\dots,N). Here, un:𝒴𝒰u_{n}:\mathcal{Y}\to\mathcal{U} maps the state space to the action space. Π\Pi denotes the space of all policies.

Transition Probabilities: We assume that the evolution of the oracle state and the learner queue state is Markovian. The oracle state evolves independently of the queue state evolution. In case of a successful gradient step (Def. 1), the queue decreases by one, and the oracle state evolves to a state o𝒴Oo^{\prime}\in\mathcal{Y}^{O} with probability Δ(o|o)>0\Delta({o^{\prime}}|{o}){>0}. Let 𝒫bo,o(u)=((o,)|o,b,u)/Δ(o|o)\mathcal{P}^{o,o^{\prime}}_{b}(u)=\nicefrac{{\mathds{P}((o^{\prime},\cdot)|o,b,u)}}{{\Delta({o^{\prime}}|{o})}} denote the transition probability vector of the buffer state with future oracle state oo^{\prime} given (o,b,u)(o,b,u). The transition probability from the state y=(o,b)𝒴y=(o,b)\in\mathcal{Y} to state y=(o,b)𝒴y^{\prime}=(o^{\prime},b^{\prime})\in\mathcal{Y} with action u=(a,i)u=(a,i) can be written as,

(y=(o,b1)|y,u)=Δ(o|o)Γ(o,i)𝟙(a)o,(y=(o,b)|y,u)=(1Γ(o,i))𝟙(a)+(1𝟙(a)),\displaystyle\begin{split}&\mathds{P}(y^{\prime}=(o^{\prime},b-1)|y,u)=\Delta({o^{\prime}}|{o})\Gamma(o,i)\mathds{1}(a)\ \forall o^{\prime},\\ &\mathds{P}(y^{\prime}=(o,b)|y,u)=\left(1-\Gamma(o,i)\right)\mathds{1}(a)+\left(1-\mathds{1}(a)\right),\end{split} (6)

and is 0 otherwise. The first equation corresponds to a successful gradient step, and the second to an unsuccessful one. We assume that Γ(o,i)\Gamma(o,i) (from O2) is increasing in incentive ii.

Learning and Queueing Cost: The learning cost cn:𝒴×𝒰c_{n}:\mathcal{Y}\times\mathcal{U}\to\mathbb{R}, is the cost (with nn queries remaining) incurred after every action due to learning at the expense of reduced obfuscation. We consider the following learning cost which is proportional to the logarithm of the improvement in the eavesdropper’s estimate (log(δn/δn+1)\propto\log(\nicefrac{{\delta_{n}}}{{\delta_{n+1}}})) and is given by,

cn(yn,un)=ψ1(bn)ψ2(on)log(In+in/δnIn+in)𝟙(an)+ψ2(on)ψ1(bn)log(InIn+in)(1𝟙(an)),\displaystyle\begin{split}c_{n}(y_{n},u_{n})&=\frac{\psi_{1}(b_{n})}{{\psi_{2}(o_{n})}}\log\left(\frac{I_{n}+\nicefrac{{i_{n}}}{{\delta_{n}}}}{I_{n}+i_{n}}\right)\mathds{1}(a_{n})\\ +&\frac{\psi_{2}(o_{n})}{\psi_{1}(b_{n})}\log\left(\frac{I_{n}}{I_{n}+i_{n}}\right)(1-\mathds{1}(a_{n}))\end{split}, (7)

where ψ1:𝒴B+\psi_{1}:\mathcal{Y}^{B}\to\mathbb{R}^{+} and ψ2:𝒴O+\psi_{2}:\mathcal{Y}^{O}\to\mathbb{R}^{+} are positive, convex and increasing cost functions, In=k=N1n+1ikI_{n}=\sum_{k=N-1}^{n+1}i_{k} is the sum of the previous incentives and δn\delta_{n} is the eavesdropper’s estimate of the trajectory 𝒥1\mathcal{J}_{1} being the true trajectory computed using (5). ψ1\psi_{1} and ψ2\psi_{2} are used to incorporate the cost with respect to the oracle and queue state, e.g., the functions ψ1\psi_{1}, and ψ2\psi_{2} are considered quadratic in the respective states in the experiments. The form of the fractions ensures the structure as discussed next. The first term in (7) denotes the cost incurred in a learning query and is non-negative (0δk10\leq\delta_{k}\leq 1). The second term corresponds to an obfuscating query and is non-positive. The cost increases with the queue state and decreases with the oracle state. This incentivizes the learner to drive the system to a smaller queue and learn when the oracle is in a good state. After NN queries, the learner pays a terminal queue cost computed using the function d:𝒴d:\mathcal{Y}\to\mathbb{R}. The queue cost accounts for learning loss in terms of terminal successful gradient steps left, b0b_{0}.

Remark: The incentive improves the response probability Γ\Gamma, but also allows for improved obfuscation than a non-incentivized setup (a high incentive can be used to misdirect the eavesdropper’s belief in (5)).

III-B Optimization problem

The expected total cost for the finite-horizon MDP with the initial state yN𝒴y_{N}\in\mathcal{Y} and policy π\pi is given by,

Vπ(yN)=𝔼[1Nn=1Ncn(yn,un)+d(y0,u0)yN,π].\displaystyle V^{\pi}(y_{N})=\mathds{E}\left[\frac{1}{N}\sum_{n=1}^{N}c_{n}\left(y_{n},u_{n}\right)+d(y_{0},u_{0})\mid y_{N},\pi\right]. (8)

The optimization problem is to find the optimal policy π\pi^{*},

Vπ(y)=infπΠVπ(y)y𝒴.\displaystyle V^{\pi^{*}}(y)=\inf_{\pi\in\Pi}V^{\pi}(y)\ \forall\ y\in\mathcal{Y}. (9)

To define the optimal policy using a recursive equation, we first define the value function, VnV_{n} with nn queries remaining,

Vn(y)=minu𝒰(cn(y,u)+y𝒴(y|y)Vn1(y)).\displaystyle V_{n}(y)=\underset{u\in\mathcal{U}}{\min}\left(\ c_{n}(y,u)+\sum_{y^{\prime}\in\mathcal{Y}}\mathds{P}(y^{\prime}|y)V_{n-1}(y^{\prime})\right). (10)

Let the optimal policy be π=(un)n=N1\pi^{*}=(u^{*}_{n})_{n=N}^{1}, where u()u^{*}(\cdot) is the optimal action with nn remaining queries and is the solution of the following stochastic recursion (Bellman’s equation),

un(y)=argminu𝒰Qn(u,y),\displaystyle u^{*}_{n}(y)=\underset{u\in\mathcal{U}}{\arg\min}\ Q_{n}(u,y), (11)

where the Q-function QnQ_{n} is defined as,

Qn(u,y)=cn(u,y)+y𝒴(y|y,u)Vn1(y),\displaystyle Q_{n}(u,y)=c_{n}(u,y)+\sum_{y^{\prime}\in\mathcal{Y}}\mathds{P}(y^{\prime}|y,u)V_{n-1}(y^{\prime}), (12)

with n=0,,Nn=0,\dots,N and V0(y)=d(y)V_{0}(y)=d(y). If the transition probabilities are unknown, then Q-learning can be used to estimate the optimal policy of (11). However, the following subsection shows that the optimal policy has a threshold structure, which motivates efficient policy search algorithms.

III-C Structural Results

The following is assumed to derive the structural results,

  1. R1:

    The learning cost, cnc_{n} is \uparrow (increasing) and convex in the buffer state, dnd_{n} for each action un𝒰u_{n}\in\mathcal{U}.

  2. R2:

    Transition probability matrix (b|b,o,u)\mathds{P}(b^{\prime}|b,o,u) is TP3444Totally postive of order 3 (TP3) for a matrix P(a) requires that each of 3rd order minor of P(a) is non-negative. with bb(b|b,o,u)b\sum_{b^{\prime}}b^{\prime}\mathds{P}(b^{\prime}|b,o,u)\uparrow b and convex in bb.

  3. R3:

    The terminal cost, dd is \uparrow and convex in the queue state, bb.

  4. R4:

    For αb,b,u>0\alpha_{b^{\prime},b,u}>0 and u\uparrow u, c(b,o,u+1)c(b,o,u)αb,b,u(c(b,o,u+1)c(b,o,u))c(b^{\prime},o,u+1)-c(b^{\prime},o,u)\leq\alpha_{b^{\prime},b,u}(c(b,o,u+1)-c(b,o,u)), b>bb^{\prime}>b.

  5. R5:

    For βb,b,u>0\beta_{b^{\prime},b,u}>0 and u\uparrow u,

    𝒫bo,o(u+1)+βb,b,u𝒫bo,o(u)1+βb,b,u<c𝒫bo,o(u)+βb,b,u𝒫bo,o(u+1)1+βb,b,u\frac{\mathcal{P}^{o,o^{\prime}}_{b^{\prime}}(u+1)+\beta_{b^{\prime},b,u}\mathcal{P}^{o,o^{\prime}}_{b^{\prime}}(u)}{1+\beta_{b^{\prime},b,u}}<_{c}\frac{\mathcal{P}^{o,o^{\prime}}_{b}(u)+\beta_{b^{\prime},b,u}\mathcal{P}^{o,o^{\prime}}_{b}(u+1)}{1+\beta_{b^{\prime},b,u}}

    ,b>b,o,o𝒴O,b^{\prime}>b,\forall o^{\prime},o\in\mathcal{Y}^{O}; <c<_{c} denotes convex dominance 555Probability vector pp is convex dominated by probability vector qq iff fpfqf^{\prime}p\geq f^{\prime}q for increasing and convex vector ff..

  6. R6:

    There exist αb,b,u=βb,b,u\alpha_{b^{\prime},b,u}=\beta_{b^{\prime},b,u} s.t. (R4) and (R5) hold.

Assumptions (R1) and (R3) are true by the construction of cost in (7) and the terminal cost. (R2) is a standard assumption on bi-diagonal stochastic matrices made when analyzing structural results [5]. (R4), (R5) and (R6) are the generalization of the supermodularity conditions made previously in [4] and are sufficient for interval dominance [5]. Assumption (R4) can be verified for cost of (7) using algebraic manipulation with αb,b,u1\alpha_{b^{\prime},b,u}\leq 1, and (R5) can be shown with βb,b,u1\beta_{b^{\prime},b,u}\leq 1 for the bi-diagonal matrix of  (6) with Γ(o,i)i\Gamma(o,i)\uparrow i. Therefore (R6) can be satisfied for some γb,b,u=αb,b,u=βb,b,u1\gamma_{b^{\prime},b,u}=\alpha_{b^{\prime},b,u}=\beta_{b^{\prime},b,u}\leq 1. We now state the main structural result,

Theorem 2

Under assumptions (R1-6), the optimal action un(y)u^{*}_{n}(y) (given by (11)) for the finite-horizon MDP of (9) is increasing in the queue state bb.

Proof:

Step 1: Conditions of Interval Dominance:

The following condition with γb,b,u>0\gamma_{b^{\prime},b,u}>0,

Qn(b,u+1)Qn(b,u)γb,b,u[Qn(b,u+1)Qn(b,u)],b>b,\displaystyle\begin{split}Q_{n}(b^{\prime},u+1)&-Q_{n}(b^{\prime},u)\leq\\ &\gamma_{b^{\prime},b,u}\left[Q_{n}(b,u+1)-Q_{n}(b,u)\right],b^{\prime}>b,\end{split} (13)

is sufficient for argminQn\arg\min Q_{n} to be increasing in bb  [8, 5],

un(b)=argminu𝒰Qn(b,u)b.\displaystyle u^{*}_{n}(b)=\arg\min_{u\in\mathcal{U}}Q_{n}(b,u)\uparrow b.

We omit the oracle state oo from the above expression.

By plugging (12) in (13) we need to show the following,

cn(b,u+1)cn(b,u)γb,b,u(cn(b,u+1)cn(b,u))a+o𝒴Ob′′[((o,b′′)|(o,b),u+1)((o,b′′)|(o,b),u)γb,b,u(((o,b′′)|(o,b),u+1)((o,b′′)|(o,b),u))]V(o,b′′)0,b>b.\displaystyle\begin{split}&\underbrace{c_{n}(b^{\prime},u+1)-c_{n}(b^{\prime},u)-\gamma_{b,b^{\prime},u}(c_{n}(b,u+1)-c_{n}(b,u))}_{a}\\ &+\sum_{o^{\prime}\in\mathcal{Y}^{O}}\sum_{b^{{}^{\prime\prime}}}\left[\mathds{P}((o^{\prime},b^{{}^{\prime\prime}})|(o,b^{\prime}),u+1)\right.\\ &\left.-\mathds{P}((o^{\prime},b^{{}^{\prime\prime}})|(o,b^{\prime}),u)-\gamma_{b,b^{\prime},u}\left(\mathds{P}((o^{\prime},b^{{}^{\prime\prime}})|(o,b),u+1)\right.\right.\\ &\left.\left.-\mathds{P}((o^{\prime},b^{{}^{\prime\prime}})|(o,b),u)\right)\right]V(o^{\prime},b^{{}^{\prime\prime}})\leq 0,b^{\prime}>b\end{split}.

By (R4) part (a) of the above inequality is satisfied with constant αb,b,u1\alpha_{b,b^{\prime},u}\leq 1. The rest of the inequality can be shown using (R5) with a constant βb,b,u1\beta_{b,b^{\prime},u}\leq 1 if we assume the value function is increasing and convex (see n.5 and  [5]). Finally we apply (R6), with αb,b,u=βb,b,u=γb,b,u\alpha_{b,b^{\prime},u}=\beta_{b,b^{\prime},u}=\gamma_{b,b^{\prime},u} to show that (13) holds and the optimal action is \uparrow in learner state bb. All that remains to be shown is that the value function is increasing and convex, which we now show using (R1, R2, R3) and induction,

Step 2: Value Function is Increasing in bb: By (R3), V0(y)=d(b)V_{0}(y)=d(b) is increasing in bb. Let Vn(y)bV_{n}(y)\uparrow b. TP3 (R2) implies TP2 and hence preserves monotone functions [5]. Therefore by applying preservation of TP2 and linear combination, o𝒴OΔ(o|o)b𝒴B(b|b,o,u)Vnb\sum_{o^{\prime}\in\mathcal{Y}^{O}}\Delta({o^{{}^{\prime}}}|{o})\sum_{b^{{}^{\prime}}\in\mathcal{Y}^{B}}\mathds{P}(b^{{}^{\prime}}|b,o,u)V_{n}\uparrow b. By (R1) and (12), Qn+1bQ_{n+1}\uparrow b. And therefore by (10), Vn+1(y)bV_{n+1}(y)\uparrow b.

Step 3: Value Function is Convex in bb: By (R3) V0(y)=d(b)V_{0}(y)=d(b) is convex in bb. Let VnV_{n} be convex in bb. Then by (R2) and applying Lemma 1 of [5] along with preservation of convexity under positive weighted sum, o𝒴OΔ(o|o)b𝒴B(b|b,o,u)Vn\sum_{o^{\prime}\in\mathcal{Y}^{O}}\Delta({o^{{}^{\prime}}}|{o})\sum_{b^{{}^{\prime}}\in\mathcal{Y}^{B}}\mathds{P}(b^{{}^{\prime}}|b,o,u)V_{n} is convex in bb. Applying (R1) and (12), Qn+1Q_{n+1} is convex in bb. Since minimization preserves convexity, Vn+1=minQn+1V_{n+1}=\min Q_{n+1} is convex in bb. ∎

Theorem 2 implies that the policy is threshold in the learner queue state; hence, the learner learns more aggressively when the number of successful gradient steps (Def. 1) left is more. This intuitively makes sense from an obfuscation perspective since the learner should ideally spend more time obfuscating when it is closer to the minimizer (the queue state is small).

Using Theorem 2, we can parameterize the optimal policy by the thresholds on the queue state. Although we can construct stochastic approximation for estimating the non-stationary policy, which has a threshold structure and performs computationally better than Q-learning, this approach still requires the number of parameters to be linear in time horizon NN. Given this insight, we restrict the search space to stationary policies with a monotone threshold structure, this restriction is common in literature [4, 7].

Let the threshold on queue state for oracle state oo and action uu be parameterized by b¯:𝒴O×𝒰𝒴B\bar{b}:\mathcal{Y}^{O}\times\mathcal{U}\to\mathcal{Y}^{B}. The optimal stationary policy with a threshold structure can be written as,

π(y)=u𝒰u𝟙(b¯(o,u)b<b¯(o,u+1)),\displaystyle\pi^{*}(y)=\sum_{u\in\mathcal{U}}u\mathds{1}(\bar{b}^{*}(o,u)\leq b<\bar{b}^{*}(o,u+1)), (14)

where b¯\bar{b}^{*} is the optimal threshold function.

IV Estimating the optimal stationary policy with a threshold structure

In this section, we propose two methods to approximate the optimal stationary policy666In this section, the optimal stationary policy is referred to as optimal policy.. for the finite-horizon MDP of (9) which has the monotone threshold structure of (14). The first method uses a stochastic approximation to update the parameters over the learning episodes iteratively. The second method uses a multi-armed bandit formulation to perform discrete optimization over the space of thresholds. The proposed methods can be extended to a non-stationary policy space with an increased time and memory complexity.

IV-A Simultaneous Perturbation Stochastic Approximation

Taking the thresholds of the stationary policy of  (14) as the parameters, a simultaneous perturbation stochastic approximation (SPSA) based algorithm can be used to find the parameters for the optimal policy. We update the policy parameters using approximate gradients of the costs computed using perturbed parameters. We use the following sigmoidal approximation for the threshold policy of (14),

π^(y,b¯)=u𝒰11+exp((bb¯(o,u))/τ),\displaystyle\hat{\pi}(y,\bar{b})=\sum_{u\in\mathcal{U}}\frac{1}{1+\exp\left(\nicefrac{{-(b-\bar{b}(o,u))}}{{\tau}}\right)}, (15)

where τ\tau is an approximation parameter. The parameters are the F=|𝒰||𝒴O|F=|\mathcal{U}||\mathcal{Y}_{O}| threshold values and are represented by Θ\Theta. For the optimal parameters, the approximate policy converges to the optimal policy as τ0\tau\to 0 [7]. For the learning episode ii and current parameter set Θi\Theta_{i}, the actions are computed using the current approximate policy (15). The policy parameters are perturbed independently with probability 1/2\nicefrac{{1}}{{2}} by ±δ\pm\delta. Two learning episodes are performed with each set of perturbed policy parameters (Θi+\Theta_{i}^{+},Θi\Theta_{i}^{-}). The costs from the two episodes are used to obtain the approximate gradient ~Ci\tilde{\nabla}C_{i} by the method of finite differences. The policy parameters are updated using a stepsize ϕi\phi_{i},

Θi+1=Θiϕi~Ci.\displaystyle\Theta_{i+1}=\Theta_{i}-\phi_{i}\tilde{\nabla}C_{i}.

Under regularity conditions on the noise in the approximate gradients, the approximate policy parameters asymptotically converge in distribution to the set of parameters of the optimal stationary policies with a threshold structure [6]. The SPSA algorithm can also be used with a constant step size to track changes in the system [4, 6]. The computational complexity for each learning episode is O(F+N)O(F+N).

IV-B Multi-armed bandit approach

The problem of searching the thresholds for (14) is solved by considering the values each threshold can take and then taking the product space of the thresholds as bandit arms. Each threshold can take values over learner state space 𝒴B\mathcal{Y}_{B}, which is of the cardinality M+1M+1. Consider each permutation of the F=|𝒰||𝒴O|F=|\mathcal{U}||\mathcal{Y}_{O}| thresholds as an arm, making the total number of arms (M+1)F(M+1)^{F}. The selection of an arm gives a corresponding stationary policy of the form (11), and a reward (negative of the cumulative cost of the episode) is obtained by interacting with oracle for time horizon NN. The noisy reward is sampled from a distribution centered at the expected value (8). (B1) The reward is assumed to be sampled independently for a given policy, and the noise is assumed to be sub-Gaussian [2]. For brevity, we omit the definition of regret and the exact upper bound, both of which can be found in Chapter 2 of  [2]. Following is the result for the upper bound on the regret for searching the thresholds,

Theorem 3

Consider the finite-horizon MDP of (9) for covert optimization with an oracle (O1-O3) to achieve (4). The optimal stationary policy with a threshold structure (14) can be searched using the upper confidence bound algorithm under (B1) with an expected regret after TT episodes bounded by O(MFlogT)O\left(M^{F}\log{T}\right), where, MM is of the order O(1/ϵ+σ2/ϵ2)O(\nicefrac{{1}}{{\epsilon}}+\nicefrac{{\sigma^{2}}}{{\epsilon^{2}}}) and F=|𝒴O||𝒰|F=|\mathcal{Y}_{O}||\mathcal{U}| is the number of thresholds.

The proof follows from Theorem 1 and plugging the number of arms in the standard regret bound for UCB [2]. Although the regret for this approach is bounded, the significant limitations are that the bound is exponential in the state and action space and, compared to SPSA, it cannot track changes in the system.

V Example: Covert Federated Learning for Hate-Speech Classification

We demonstrate the proposed covert optimization framework on a numerical experiment for hate-speech classification using federated learning in the presence of an eavesdropper. An eavesdropper spoofs as a client and misuses the optimal weights to generate hate speech, which goes undetected by the classifier. The detailed motivation and experimental setup can be found in [4]. A balanced subset of the civil comments toxicity dataset by Jigsaw AI is used, which has comments along with annotations for whether the comment is toxic or not The federated learning setup consists of Nc=35N_{c}=35 clients, each having Nd=689N_{d}=689 data points. A fully connected neural network attached to a pre-trained transformer is trained with a cross-entropy loss to classify the comments as toxic or not. The accuracy is reported on a balanced validation dataset 777The results are reproducible and can be found on the Github repository: github.com/aditj/CovertOptimization. The repository also contains links to the dataset, the complete set of experimental parameters, and a supplementary document with additional benchmarks and illustrations..

We consider M=45M=45 successful gradient steps and N=100N=100 queries, and the oracle levels are based on client participation. Each client participates in a Markovian fashion with a probability of staying connected or not connected as 0.80.8. R=3R=3 oracle states correspond to the minimum number of participating clients 𝒴O=[1=1,2=12,3=24]\mathcal{Y}^{O}=[1=1,2=12,3=24]. We consider ni=3{n_{i}}=3 incentive levels as {1,2,3}\{1,2,3\}. The number of samples each client contributes in each round depends on the incentive, as [10%,40%,80%] of NdN_{d} for the respective incentives. We consider a round successful if the number of samples exceeds 40004000. The emperical success probabilities are Γ(o,i)=[[0,0.1,0.2],[0.1,0.2,0.6],[0.3,0.6,0.9]]\Gamma(o,i)=[[0,0.1,0.2],[0.1,0.2,0.6],[0.3,0.6,0.9]]. The functions ψ1\psi_{1} and ψ2\psi_{2} in (7) are quadratic in bb and oo, respectively. This satisfies assumptions R1, R2, R3. The emperical success probabilities along with the resulting cost function of (7) ensure that R4, R5, R6 are satisfied for αb,b,u=βb,b,u1\alpha_{b,b^{\prime},u}=\beta_{b,b^{\prime},u}\leq 1. The queue cost is d(b)b4d(b)\propto b^{4}. The optimal stationary policy with the threshold structure is obtained using SPSA with ϕk=0.01\phi_{k}=0.01, δ=0.1\delta=0.1, and H=3000H=3000 episodes.

Type of Policy Learner Acc. Eaves. Acc. Incentive
Optimal Policy 90% 54% 254
Optimal Policy from [4] 89% 53% 290
Greedy Policy 91% 89% 300
Random Policy 52% 53% 190
TABLE I: The optimal stationary policy with a threshold structure outperforms greedy policy by 35%35\% on eavesdropper accuracy and random policy by 38%38\% on learner accuracy.

The results are averaged for Nmc=100N_{mc}=100 runs and reported in Table I. The greedy policy learns first with a maximum incentive, and random policy uniformly samples from the action space. The optimal policy is better than the greedy policy in terms of the eavesdropper accuracy corresponding to the maximum a posteriori trajectory of (5). The optimal policy outperforms the random policy on learner accuracy. The learner saves 14%~{}14\% incentive spent compared to the greedy policy. We also benchmark against the optimal policy from [4] with constant incentivization (ik=3i_{k}=3) and similar to the greedy policy, the accuracies are comparable, but the optimal policy of this paper improves incentive expenditure by 12%12\%.

VI Conclusion

The proposed MDP framework solves the learner’s problem of dynamically optimizing a function by querying and incentivizing a stochastic oracle and obfuscating an eavesdropper by switching between two stochastic gradients. Using interval dominance, we prove structural results on the monotone threshold nature of the optimal policy. In our numerical experiments, the optimal stationary policy with the threshold structure outperformed the greedy policy on the eavesdropper accuracy and the incentive spent. In future work, the problem of obfuscating sequential eavesdroppers can be formulated as a Bayesian social learning problem, where initially the eavesdropper is obfuscated maximally to make it stop participating and its departure provides an indication to the subsequent eavesdroppers that the learner is obfuscating. Hence, the eavesdroppers can eventually be made to herd, forming an information cascade so that they don’t eavesdrop anymore, regardless of whether the learner is learning or not.

References

  • [1] C. Bersani, H. Dagdougui, C. Roncoli, and R. Sacile. Distributed Product Flow Control in a Network of Inventories With Stochastic Production and Demand. IEEE Access, 7:22486–22494, 2019.
  • [2] S. Bubeck, N. Cesa-Bianchi, and S. Bubeck. Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Now Publishers, 2012. Google-Books-ID: Rl2skwEACAAJ.
  • [3] S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4):2341–2368, Jan. 2013.
  • [4] A. Jain and V. Krishnamurthy. Controlling Federated Learning for Covertness. Transactions on Machine Learning Research, 2024.
  • [5] V. Krishnamurthy. Interval dominance based structural results for Markov decision process. Automatica, 153:111024, July 2023.
  • [6] H. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer Science & Business Media, July 2003. Google-Books-ID: EC2w1SaPb7YC.
  • [7] M. H. Ngo and V. Krishnamurthy. Monotonicity of Constrained Optimal Transmission Policies in Correlated Fading Channels With ARQ. IEEE Transactions on Signal Processing, 58(1):438–451, Jan. 2010.
  • [8] J. K.-H. Quah and B. Strulovici. Comparative Statics, Informativeness, and the Interval Dominance Order. Econometrica, 77(6):1949–1992, 2009. Publisher: Wiley, The Econometric Society.
  • [9] X. Shi, G. Wen, and X. Yu. Finite-Time Convergent Algorithms for Time-Varying Distributed Optimization. IEEE Control Systems Letters, 7:3223–3228, 2023.
  • [10] J. N. Tsitsiklis, K. Xu, and Z. Xu. Private Sequential Learning. Operations Research, 69(5):1575–1590, Sept. 2021.
  • [11] L. Witt, M. Heyer, K. Toyoda, W. Samek, and D. Li. Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review. IEEE Internet of Things Journal, 10(4):3642–3663, Feb. 2023.
  • [12] J. Xu, K. Xu, and D. Yang. Learner-Private Convex Optimization. IEEE Transactions on Information Theory, 69(1):528–547, Jan. 2023.