Structured Reinforcement Learning for Incentivized Stochastic Covert Optimization
Abstract
This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG. The oracle probabilistically returns either a noisy gradient of the function or a non-informative measurement, depending on the oracle state and incentive. The learner’s query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This paper formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi-armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.
Keywords: Markov decision process, Covert optimization, Structural results, Interval dominance
I Introduction
The learner aims to obtain an estimate for a point 111By or minimizer we mean a local stationary point of . by querying a stochastic oracle. At each time , the learner sends query and incentive to a stochastic oracle in state . The oracle returns a noisy gradient, evaluated at as follows:
(1) |
Here are independent, zero-mean finite-variance random variables, and denotes the probability that the learner gets a noisy informative response from the oracle.
An eavesdropper observes query and incentive but not response . The eavesdropper aims to estimate , as an approximation to the minimizer of the function the eavesdropper is interested in optimizing. This paper addresses the question: Suppose the learner uses a stochastic gradient (SG) algorithm to obtain an estimate . How can the learner control the SG to hide from an eavesdropper?
Our proposed approach is to dynamically switch between two SGs. Let denote the chosen SG at time . The first SG minimizes function and updates the learner estimate . The second SG is for obfuscating the eavesdropper with estimates . The update of both SGs is given by the equation,
(2) |
where is the step size, is a synthetic gradient response discussed later and controls the SG to update.
The query by the learner to the oracle is given by,
(3) |
and is the control learner variable (action). The learner needs informative updates of (2) to achieve the learning objective in queries. We formulate an MDP whose policy controls the switching of SGs and incentivization by the learner, to minimize the expected cost balancing obfuscation and learning. The optimal policy solving the MDP is shown to have a threshold structure (Theorem 2) of the form,
where is the number of informative learning steps left, is the number of queries left and is the threshold function dependent on the oracle state and . Note that the exact dependence on the incentive is discussed later. We propose a stochastic approximation algorithm to estimate the optimal stationary policy with a threshold structure. We propose a multi-armed bandits based approach with finite-time regret bounds in Theorem 3. The optimal stationary policy with a threshold structure is benchmarked in a numerical study for covert federated hate-speech classification.
Motivation: The main application of covert (or learner-private) optimization is in centralized distributed optimization [12, 10, 9]. One motivating example is in pricing optimization and inventory management, the learner (e.g., e-retailer) queries the distributed oracle (e.g., customers) pricing and product preferences to estimate the optimal price and quantity of a product to optimize the profit function [10, 1]. A competitor could spoof as a customer and use the optimal price and quantity for their competitive advantage. Our numerical experiment illustrates another application in federated learning, a form of distributed machine learning.
The current literature on covert optimization has been focused on deriving upper and lower bounds on the query complexity for a given obfuscation level [12]. Query complexity for binary and convex covert optimization with a Bayesian eavesdropper has been studied in [10, 12]. These bounds assume a static oracle and a random querying policy can be used to randomly obfuscate and learn. In contrast, the authors have looked at dynamic covert optimization where stochastic control is used to query a stochastic oracle optimally [4]. This is starkly different than the current literature since a stochastic oracle models situations where the quality of gradient responses may vary (e.g., due to Markovian client participation). The success of a response can be determined by the learner (e.g., based on gradient quality [4]) or by the oracle (e.g., based on computational resources availability).
Differences from previous work [4]: To prove that the optimal policy has a monotone threshold structure, [4] requires supermodularity conditions. This paper proves results under more relaxed conditions using interval dominance [8] in Theorem 2, which can incorporate convex cost functions and more general transition probabilities [5]. The action space in this work includes an incentive the learner provides to the oracle. An incentive that the learner pays is motivated by the learner’s cost for obtaining a gradient evaluation of desired quality, it could be a monetary compensation the learner pays or non-monetary, e.g., controlling latency of services to participating clients [11]. We had a generic cost function in [4], but the costs considered in this paper are exact regarding the learner’s approximation of the eavesdropper’s estimate of .
II Covert optimization for first-order stochastic gradient descent
This section describes the two stochastic gradient algorithms, between which the learner dynamically switches to either learn or obfuscate using the MDP formulation of the next section. This section states the assumptions about the oracle, the learner, the eavesdropper, and the obfuscation strategy. We state the result on the number of successful gradient steps the learner needs to achieve the learning objective. The problem formulation for covert optimization is illustrated in Fig. 1.
II-A Oracle
The oracle evaluates the gradient of the function . The following is assumed about the oracle and the function ,
-
O1:
Function is continously differentiable and is lower bounded by . Function is -Lipschitz continuous, .
-
O2:
At time , the oracle is in state , where are the number of oracle states and for the incentive , replies with probability . denotes success of the reply.
-
O3:
For a reply with success to the query , the oracle returns a noisy gradient response according to (1). The noise terms are independent 222A slightly weaker assumption based on conditional independence was considered in our paper [4]. We consider independence here for brevity., have zero-mean and finite-variance, and .
O1 and O3 are standard assumptions for analyzing oracle-based gradient descent [3]. O2 is motivated by an oracle with a stochastic state (e.g., client participation), and the success is determined by the oracle or by the learner.
II-B Learner
Similar to oracle-based first-order gradient descent [3], the learner aims to estimate which is a -close critical point of the function ,
(4) |
Since is non-convex and not known in closed-form to the learner, in general, the gradient at is non-informative about the gradient at far from . Hence, at time , the learner can either send a learning or an obfuscating query. We propose controlling the gradient descent of the learner by the query action . While learning, the learner updates its estimate, by performing the controlled stochastic gradient step of (2). Here, is the step size chosen to be constant in this paper. In the next section, we will formally state the action space composed of the type of query and the incentive . In order to estimate the number of queries to the oracle that the learner has to spend on learning queries, we first define the successful gradient step. We then state the result on the order of the number of successful gradient steps required for achieving the objective.
Definition 1
(Successful Gradient Step) A gradient step of (2) is successful when the learner queries the oracle with a learning query () and gets a successful reply ().
Theorem 1
Proof for a general setting can be found in [4] and in [3]. Theorem 1 characterizes the number of successful gradient steps, that the learner needs to perform to achieve the learning objective of (4). Theorem 1 guarantees the existence of a finite queue state space in the next section, which models the number of successful gradient steps left to be taken. in the MDP formulation of the next section can be chosen heuristically or be computed exactly if the parameters of the function are known. It also shows that is inversely dependent on and incorporates the descent dynamics in the structure of the optimal policy.
II-C Obfuscation Strategy
Based on the chosen SG , the learner poses queries using (3) and provides incentives to the oracle. To obfuscate the eavesdropper, the learner runs a parallel stochastic gradient with synthetic responses, . The synthetic responses can be generated by suitably simulating an oracle, for e.g., the learner can train a neural network separately with an unbalanced subset of as was done in [4]. If the learner is sure that the eavesdropper has no public dataset to validate, the learner can simply take mirrored gradients with (1). When obfuscating, the learner poses queries from the estimates of the second SG, .
The parallel stochastic gradient ensures that the eavesdropper cannot infer the true learning trajectory from the shape of the trajectory. In summary, the learner obfuscates and learns by dynamically chooses the query , as the current estimate from the controlled stochastic gradient step or as the estimate of parallel SG. We assume that the learner queries such that the two trajectories are sufficiently separated from each other, and the eavesdropper can cluster the queries and distinguish them uniquely into two trajectories as described next.
II-D Eavesdropper
At time , the eavesdropper observes query and the incentive by the learner. We use the following assumptions to approximate the posterior belief of the eavesdropper,
-
E1:
The eavesdropper is passive, does not affect the oracle, and can not observe the oracle’s responses, .
-
E2:
For , the eavesdropper can classify each query of the observed query sequence into two unique trajectory sequences, and which can be separated by a hyperplane in .
-
E3:
The eavesdropper knows that either of and is a learning trajectory. Given the information till time , the eavesdropper computes a posterior belief, for trajectory being the learning trajectory as,
Assumption E1 holds if the eavesdropper is considered an insignificant part of the oracle and can not observe the aggregated response. Assumptions E2 and E3 ensure the dynamic obfuscation is for a worst-case eavesdropper 333As mentioned above, it is assumed that the queries are posed such that the two trajectories are sufficiently separated (by a metric known to the eavesdropper). One of the trajectories can be empty for the initial queries.. We consider the eavesdropper using an incentive-weighed proportional sampling estimator for the posterior [12, 4],
(5) |
The learner has access to the queries and the incentives and uses (5) as an approximation for the eavesdropper’s estimate to compute the cost incurred when learning. The following section formulates an MDP to perform covert optimization using stochastic control. from Theorem 1 and oracle state are used to model the state space, while the incentives and the type of SG in (2) model the action space.
III MDP for achieving covert optimization
We formulate a finite-horizon MDP to solve the learner’s decision problem. The learner chooses an incentive, and dynamically either minimizes the function using the estimate or obfuscates the eavesdropper using . The learner wants to perform successful gradient steps in total queries. Using interval dominance, we show that the optimal policy of the finite-horizon MDP has a threshold structure. The stochastic control approach for the same is described in Algorithm 1.
III-A MDP formulation for optimally switching between stochastic gradient algorithms
The dynamic programming index, denotes the number of queries left and decreases with time .
State Space: The state space is denoted by where is the learner queue state space and is the oracle state space. The learner queue state denotes the number of successful gradient steps (Def. 1) remaining to achieve (4). The oracle state space discretizes the stochastic state of the oracle into levels (e.g., percentages of client participation in FL). denotes the state with queries remaining.
Action Space: The action space is . The action when queries are remaining is given by, where is the type of the query and is the incentive. To derive structural results on the optimal policy, we consider the following transformation of the action space, . A deterministic policy for the finite-horizon MDP is denoted by , a sequence of functions . Here, maps the state space to the action space. denotes the space of all policies.
Transition Probabilities: We assume that the evolution of the oracle state and the learner queue state is Markovian. The oracle state evolves independently of the queue state evolution. In case of a successful gradient step (Def. 1), the queue decreases by one, and the oracle state evolves to a state with probability . Let denote the transition probability vector of the buffer state with future oracle state given . The transition probability from the state to state with action can be written as,
(6) |
and is otherwise. The first equation corresponds to a successful gradient step, and the second to an unsuccessful one. We assume that (from O2) is increasing in incentive .
Learning and Queueing Cost: The learning cost , is the cost (with queries remaining) incurred after every action due to learning at the expense of reduced obfuscation. We consider the following learning cost which is proportional to the logarithm of the improvement in the eavesdropper’s estimate () and is given by,
(7) |
where and are positive, convex and increasing cost functions, is the sum of the previous incentives and is the eavesdropper’s estimate of the trajectory being the true trajectory computed using (5). and are used to incorporate the cost with respect to the oracle and queue state, e.g., the functions , and are considered quadratic in the respective states in the experiments. The form of the fractions ensures the structure as discussed next. The first term in (7) denotes the cost incurred in a learning query and is non-negative (). The second term corresponds to an obfuscating query and is non-positive. The cost increases with the queue state and decreases with the oracle state. This incentivizes the learner to drive the system to a smaller queue and learn when the oracle is in a good state. After queries, the learner pays a terminal queue cost computed using the function . The queue cost accounts for learning loss in terms of terminal successful gradient steps left, .
Remark: The incentive improves the response probability , but also allows for improved obfuscation than a non-incentivized setup (a high incentive can be used to misdirect the eavesdropper’s belief in (5)).
III-B Optimization problem
The expected total cost for the finite-horizon MDP with the initial state and policy is given by,
(8) |
The optimization problem is to find the optimal policy ,
(9) |
To define the optimal policy using a recursive equation, we first define the value function, with queries remaining,
(10) |
Let the optimal policy be , where is the optimal action with remaining queries and is the solution of the following stochastic recursion (Bellman’s equation),
(11) |
where the Q-function is defined as,
(12) |
with and . If the transition probabilities are unknown, then Q-learning can be used to estimate the optimal policy of (11). However, the following subsection shows that the optimal policy has a threshold structure, which motivates efficient policy search algorithms.
III-C Structural Results
The following is assumed to derive the structural results,
-
R1:
The learning cost, is (increasing) and convex in the buffer state, for each action .
-
R2:
Transition probability matrix is TP3444Totally postive of order 3 (TP3) for a matrix P(a) requires that each of 3rd order minor of P(a) is non-negative. with and convex in .
-
R3:
The terminal cost, is and convex in the queue state, .
-
R4:
For and , , .
-
R5:
For and ,
; denotes convex dominance 555Probability vector is convex dominated by probability vector iff for increasing and convex vector ..
-
R6:
There exist s.t. (R4) and (R5) hold.
Assumptions (R1) and (R3) are true by the construction of cost in (7) and the terminal cost. (R2) is a standard assumption on bi-diagonal stochastic matrices made when analyzing structural results [5]. (R4), (R5) and (R6) are the generalization of the supermodularity conditions made previously in [4] and are sufficient for interval dominance [5]. Assumption (R4) can be verified for cost of (7) using algebraic manipulation with , and (R5) can be shown with for the bi-diagonal matrix of (6) with . Therefore (R6) can be satisfied for some . We now state the main structural result,
Theorem 2
Proof:
Step 1: Conditions of Interval Dominance:
We omit the oracle state from the above expression.
By plugging (12) in (13) we need to show the following,
By (R4) part (a) of the above inequality is satisfied with constant . The rest of the inequality can be shown using (R5) with a constant if we assume the value function is increasing and convex (see n.5 and [5]). Finally we apply (R6), with to show that (13) holds and the optimal action is in learner state . All that remains to be shown is that the value function is increasing and convex, which we now show using (R1, R2, R3) and induction,
Theorem 2 implies that the policy is threshold in the learner queue state; hence, the learner learns more aggressively when the number of successful gradient steps (Def. 1) left is more. This intuitively makes sense from an obfuscation perspective since the learner should ideally spend more time obfuscating when it is closer to the minimizer (the queue state is small).
Using Theorem 2, we can parameterize the optimal policy by the thresholds on the queue state. Although we can construct stochastic approximation for estimating the non-stationary policy, which has a threshold structure and performs computationally better than Q-learning, this approach still requires the number of parameters to be linear in time horizon . Given this insight, we restrict the search space to stationary policies with a monotone threshold structure, this restriction is common in literature [4, 7].
Let the threshold on queue state for oracle state and action be parameterized by . The optimal stationary policy with a threshold structure can be written as,
(14) |
where is the optimal threshold function.
IV Estimating the optimal stationary policy with a threshold structure
In this section, we propose two methods to approximate the optimal stationary policy666In this section, the optimal stationary policy is referred to as optimal policy.. for the finite-horizon MDP of (9) which has the monotone threshold structure of (14). The first method uses a stochastic approximation to update the parameters over the learning episodes iteratively. The second method uses a multi-armed bandit formulation to perform discrete optimization over the space of thresholds. The proposed methods can be extended to a non-stationary policy space with an increased time and memory complexity.
IV-A Simultaneous Perturbation Stochastic Approximation
Taking the thresholds of the stationary policy of (14) as the parameters, a simultaneous perturbation stochastic approximation (SPSA) based algorithm can be used to find the parameters for the optimal policy. We update the policy parameters using approximate gradients of the costs computed using perturbed parameters. We use the following sigmoidal approximation for the threshold policy of (14),
(15) |
where is an approximation parameter. The parameters are the threshold values and are represented by . For the optimal parameters, the approximate policy converges to the optimal policy as [7]. For the learning episode and current parameter set , the actions are computed using the current approximate policy (15). The policy parameters are perturbed independently with probability by . Two learning episodes are performed with each set of perturbed policy parameters (,). The costs from the two episodes are used to obtain the approximate gradient by the method of finite differences. The policy parameters are updated using a stepsize ,
Under regularity conditions on the noise in the approximate gradients, the approximate policy parameters asymptotically converge in distribution to the set of parameters of the optimal stationary policies with a threshold structure [6]. The SPSA algorithm can also be used with a constant step size to track changes in the system [4, 6]. The computational complexity for each learning episode is .
IV-B Multi-armed bandit approach
The problem of searching the thresholds for (14) is solved by considering the values each threshold can take and then taking the product space of the thresholds as bandit arms. Each threshold can take values over learner state space , which is of the cardinality . Consider each permutation of the thresholds as an arm, making the total number of arms . The selection of an arm gives a corresponding stationary policy of the form (11), and a reward (negative of the cumulative cost of the episode) is obtained by interacting with oracle for time horizon . The noisy reward is sampled from a distribution centered at the expected value (8). (B1) The reward is assumed to be sampled independently for a given policy, and the noise is assumed to be sub-Gaussian [2]. For brevity, we omit the definition of regret and the exact upper bound, both of which can be found in Chapter 2 of [2]. Following is the result for the upper bound on the regret for searching the thresholds,
Theorem 3
Consider the finite-horizon MDP of (9) for covert optimization with an oracle (O1-O3) to achieve (4). The optimal stationary policy with a threshold structure (14) can be searched using the upper confidence bound algorithm under (B1) with an expected regret after episodes bounded by , where, is of the order and is the number of thresholds.
The proof follows from Theorem 1 and plugging the number of arms in the standard regret bound for UCB [2]. Although the regret for this approach is bounded, the significant limitations are that the bound is exponential in the state and action space and, compared to SPSA, it cannot track changes in the system.
V Example: Covert Federated Learning for Hate-Speech Classification
We demonstrate the proposed covert optimization framework on a numerical experiment for hate-speech classification using federated learning in the presence of an eavesdropper. An eavesdropper spoofs as a client and misuses the optimal weights to generate hate speech, which goes undetected by the classifier. The detailed motivation and experimental setup can be found in [4]. A balanced subset of the civil comments toxicity dataset by Jigsaw AI is used, which has comments along with annotations for whether the comment is toxic or not The federated learning setup consists of clients, each having data points. A fully connected neural network attached to a pre-trained transformer is trained with a cross-entropy loss to classify the comments as toxic or not. The accuracy is reported on a balanced validation dataset 777The results are reproducible and can be found on the Github repository: github.com/aditj/CovertOptimization. The repository also contains links to the dataset, the complete set of experimental parameters, and a supplementary document with additional benchmarks and illustrations..
We consider successful gradient steps and queries, and the oracle levels are based on client participation. Each client participates in a Markovian fashion with a probability of staying connected or not connected as . oracle states correspond to the minimum number of participating clients . We consider incentive levels as . The number of samples each client contributes in each round depends on the incentive, as [10%,40%,80%] of for the respective incentives. We consider a round successful if the number of samples exceeds . The emperical success probabilities are . The functions and in (7) are quadratic in and , respectively. This satisfies assumptions R1, R2, R3. The emperical success probabilities along with the resulting cost function of (7) ensure that R4, R5, R6 are satisfied for . The queue cost is . The optimal stationary policy with the threshold structure is obtained using SPSA with , , and episodes.
Type of Policy | Learner Acc. | Eaves. Acc. | Incentive |
---|---|---|---|
Optimal Policy | 90% | 54% | 254 |
Optimal Policy from [4] | 89% | 53% | 290 |
Greedy Policy | 91% | 89% | 300 |
Random Policy | 52% | 53% | 190 |
The results are averaged for runs and reported in Table I. The greedy policy learns first with a maximum incentive, and random policy uniformly samples from the action space. The optimal policy is better than the greedy policy in terms of the eavesdropper accuracy corresponding to the maximum a posteriori trajectory of (5). The optimal policy outperforms the random policy on learner accuracy. The learner saves incentive spent compared to the greedy policy. We also benchmark against the optimal policy from [4] with constant incentivization () and similar to the greedy policy, the accuracies are comparable, but the optimal policy of this paper improves incentive expenditure by .
VI Conclusion
The proposed MDP framework solves the learner’s problem of dynamically optimizing a function by querying and incentivizing a stochastic oracle and obfuscating an eavesdropper by switching between two stochastic gradients. Using interval dominance, we prove structural results on the monotone threshold nature of the optimal policy. In our numerical experiments, the optimal stationary policy with the threshold structure outperformed the greedy policy on the eavesdropper accuracy and the incentive spent. In future work, the problem of obfuscating sequential eavesdroppers can be formulated as a Bayesian social learning problem, where initially the eavesdropper is obfuscated maximally to make it stop participating and its departure provides an indication to the subsequent eavesdroppers that the learner is obfuscating. Hence, the eavesdroppers can eventually be made to herd, forming an information cascade so that they don’t eavesdrop anymore, regardless of whether the learner is learning or not.
References
- [1] C. Bersani, H. Dagdougui, C. Roncoli, and R. Sacile. Distributed Product Flow Control in a Network of Inventories With Stochastic Production and Demand. IEEE Access, 7:22486–22494, 2019.
- [2] S. Bubeck, N. Cesa-Bianchi, and S. Bubeck. Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Now Publishers, 2012. Google-Books-ID: Rl2skwEACAAJ.
- [3] S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4):2341–2368, Jan. 2013.
- [4] A. Jain and V. Krishnamurthy. Controlling Federated Learning for Covertness. Transactions on Machine Learning Research, 2024.
- [5] V. Krishnamurthy. Interval dominance based structural results for Markov decision process. Automatica, 153:111024, July 2023.
- [6] H. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer Science & Business Media, July 2003. Google-Books-ID: EC2w1SaPb7YC.
- [7] M. H. Ngo and V. Krishnamurthy. Monotonicity of Constrained Optimal Transmission Policies in Correlated Fading Channels With ARQ. IEEE Transactions on Signal Processing, 58(1):438–451, Jan. 2010.
- [8] J. K.-H. Quah and B. Strulovici. Comparative Statics, Informativeness, and the Interval Dominance Order. Econometrica, 77(6):1949–1992, 2009. Publisher: Wiley, The Econometric Society.
- [9] X. Shi, G. Wen, and X. Yu. Finite-Time Convergent Algorithms for Time-Varying Distributed Optimization. IEEE Control Systems Letters, 7:3223–3228, 2023.
- [10] J. N. Tsitsiklis, K. Xu, and Z. Xu. Private Sequential Learning. Operations Research, 69(5):1575–1590, Sept. 2021.
- [11] L. Witt, M. Heyer, K. Toyoda, W. Samek, and D. Li. Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review. IEEE Internet of Things Journal, 10(4):3642–3663, Feb. 2023.
- [12] J. Xu, K. Xu, and D. Yang. Learner-Private Convex Optimization. IEEE Transactions on Information Theory, 69(1):528–547, Jan. 2023.