This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BGTplanner: Maximizing Training Accuracy for Differentially Private Federated Recommenders via Strategic Privacy Budget Allocation

Xianzhi Zhang, Yipeng Zhou,  Miao Hu,  Di Wu, , Pengshan Liao, Mohsen Guizani,  Michael Sheng Xianzhi Zhang, Miao Hu, Di Wu, and Pengshan Liao are with the School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, and Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China. (E-mail: {zhangxzh9, liaopsh}@mail2.sysu.edu.cn; {humiao5, wudi27}@mail.sysu.edu.cn). Yipeng Zhou and Michael Sheng are with the Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, Australia. (E-mail: {yipeng.zhou, michael.sheng}@mq.edu.au,). Mohsen Guizani is with the Machine Learning Department, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE (E-mail: [email protected]).
Abstract

To mitigate the rising concern about privacy leakage, the federated recommender (FR) paradigm emerges, in which decentralized clients co-train the recommendation model without exposing their raw user-item rating data. The differentially private federated recommender (DPFR) further enhances FR by injecting differentially private (DP) noises into clients. Yet, current DPFRs, suffering from noise distortion, cannot achieve satisfactory accuracy. Various efforts have been dedicated to improving DPFRs by adaptively allocating the privacy budget over the learning process. However, due to the intricate relation between privacy budget allocation and model accuracy, existing works are still far from maximizing DPFR accuracy. To address this challenge, we develop BGTplanner (Budget Planner) to strategically allocate the privacy budget for each round of DPFR training, improving overall training performance. Specifically, we leverage the Gaussian process regression and historical information to predict the change in recommendation accuracy with a certain allocated privacy budget. Additionally, Contextual Multi-Armed Bandit (CMAB) is harnessed to make privacy budget allocation decisions by reconciling the current improvement and long-term privacy constraints. Our extensive experimental results on real datasets demonstrate that BGTplanner achieves an average improvement of 6.76% in training performance compared to state-of-the-art baselines.

Index Terms:
data-sharing platforms, differential privacy, budget allocation, online algorithm.

I Introduction

Nowadays, machine learning has been widely applied in various recommender systems, necessitating the processing of substantial user data. Consequently, there is increasing concern from users about the security and privacy of their data [1, 2]. To protect private raw data, the federated recommender (FR) paradigm [3, 4, 5] has emerged, offering a solution to the conundrum of balancing data-driven progress with the preservation of individual privacy. In this evolving landscape, organizations mainly exchange training parameters with distributed data sources for improving recommendations while respecting user privacy [6].

Nevertheless, revealing training parameters as opposed to raw data proves inadequate for ensuring privacy protection, invoking the development of differentially private federated recommenders (DPFRs) [6, 7]. Studies indicate that training parameters, including gradient values, can facilitate the reconstruction of a segment of original data [8], or enable inference about whether the records pertained by a specific participant [9, 10]. To ensure data privacy, the additional safeguard, i.e., differential privacy (DP) [11, 12, 13], is involved by DPFRs, which inject noises to distort training parameters, and have emerged as a key consideration in privacy-preserving recommender systems.

However, the practical application of DPFRs is impeded by substantially compromised recommendation accuracy due to the distortion caused by DP noise. The challenge is rooted in the complicated relationship between privacy budget consumption and learning improvement over multiple times of training parameter exposure in the DPFR learning process. On the one hand, to guarantee privacy, the total privacy budget is constrained, and only a fixed small amount of privacy budget can be consumed by each exposure [14, 15]. Whereas, a smaller privacy budget implies a larger noise scale and more significant detriment to DPFR accuracy. On the other hand, the FR training process is highly dynamic with heterogeneous data scattered among users, resulting in a dynamic learning process [16]. Consequently, it is difficult to predict the improvement of each learning round in advance.

When facing multiple times of exposures, the naive approach evenly allocates a fixed amount of privacy budget to each training round regardless of discrepant learning improvements during the dynamic training process. Such an approach fails to maximize recommendation accuracy because the privacy budget is probably wasted if it is spent on protecting users bringing little learning improvement. Recently, a few works have attempted to tune the privacy budget allocation over multiple learning rounds adaptively. However, most of these works such as ADPML [17] are heuristics-based algorithms, far away from the optimal budget allocation.

TABLE I: The main notations and definitions in the paper.
Notation Definition
u𝒰u\in\mathcal{U} The client index and the set of all clients.
t/τ/Tt\ /\ \tau\ /\ T The training round index, the total training rounds with early termination and the maximum training rounds.
𝐗t/𝒙t\bm{\mathrm{X}}^{t}\ /\ \bm{x}^{t} The original context matrix of training data information and the SVD-reduced vector in round tt.
dd The hyper-parameter representing the dimension of the vector 𝒙t\bm{x}^{t}.
ϵtotal/ϵt\bm{\epsilon}^{total}\ /\ \bm{\epsilon}^{t} The total privacy budget vector and the privacy budget cost vector for all clients in training round tt.
(ϵut,δut)(\epsilon_{u}^{t},\delta_{u}^{t}) The privacy budget of client u𝒰u\in\mathcal{U} in round tt.
Cu(){C}_{u}(\cdot) The budget generating function of DP noise on client uu.
𝜾ut\bm{\iota}_{u}^{t} /  𝝂ut\bm{\nu}_{u}^{t}  /  𝝎ut\bm{\omega}^{t}_{u} The parameters of the local item embedding, the local user embedding and recommender model for clients uu in round tt.
𝜾ut\nabla\bm{\iota}_{u}^{t} /  𝝂ut\nabla\bm{\nu}_{u}^{t}  /  𝝎ut\nabla\bm{\omega}^{t}_{u} The gradients of the local item embedding, the local user embedding and recommender model for clients uu in round tt.
at𝒜a^{t}\in\mathcal{A} The budget allocation action in round tt and the action space.
AA The size of action space.
rtr^{t}  / 𝒓t\bm{r}^{t} The reward of model training in tt round, which can be defined as the change of training loss or recommender accuracy between round tt and t1t-1.  / The vector of all observed rewards until round tt.
𝒛t\bm{z}^{t} The concatenated input vector of action a𝒜a\in\mathcal{A} and context vector 𝒙t\bm{x}^{t} for GPR model.
r^(𝒛t)\hat{r}(\bm{z}^{t}) or r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) The predicted reward by GPR model with action a𝒜a\in\mathcal{A} and context vector 𝒙t\bm{x}^{t}.
μ(𝒛t){\mu}(\bm{z}^{t})  / σ(𝒛t){\sigma}(\bm{z}^{t}) The mean function and variance function of the conditional distribution 𝒩(μ(𝒛t),σ(𝒛t))\mathcal{N}({\mu}(\bm{z}^{t}),\sigma(\bm{z}^{t})) of GPR model to predict reward r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}).
𝝀t\bm{\lambda}^{t} The vector of dual variables (Lagrange multipliers) for long-term budget constraint in round tt.
𝒱\mathcal{V}  / Λ\Lambda The space of dual variables 𝝀\bm{\lambda} and the l1l_{1}-radius of the space 𝒱\mathcal{V}.
T0T_{0} The total rounds for playing each action in the initial stage of BGTplanner.
βt(a)\beta^{t}(a) The CMAB score function for action aa in round tt.
pt(a)p^{t}(a) The probability for CMAB to choose action aa in round tt.

To address the challenges in allocating privacy budgets in DPFR, we introduce a novel online algorithm, called BGTplanner (Budget Planner). Different from prior approaches, our method employs the Gaussian process regression (GPR) to predict learning improvement on the fly based on historical information and dynamic contexts (e.g., dynamic information of training items). Next, Contextual Multi-Armed Bandit (CMAB) is leveraged to balance the tradeoff between the current learning improvement and long-term privacy budget constraints, which can avoid greedily depleting the privacy budget in a single round. Our contributions can be summarized as follows:

  • We formulate an online privacy budget allocation problem for the DPFR system. Then, we exploit historical information and Gaussian process regression to predict the dynamic learning improvement under different contexts.

  • We harness the Contextual Multi-Armed Bandit (CMAB) method to make privacy budget allocation decisions for reconciling the current learning improvement and long-term privacy budget constraints.

  • We provide sufficient theoretical analyses for BGTplanner. Furthermore, through extensive experiments, we demonstrate the superiority of our method compared to state-of-the-art baselines, with an average improvement of 6.76% in RMSE and 0.71% in F1-score.

The remainder of this paper is organized as follows. In Section II, we review the related works. Section III provides the necessary background on DPFR and CMAB. Our system model and problem formulation are discussed in Section IV. Section V introduces the design of our algorithm, along with an analysis of differential privacy and regret. In Section VII, we present comprehensive performance evaluations. Finally, we conclude the paper in Section VIII.

II Related Work

II-A Training Federated Recommender with DP

Training recommender models often relies on sensitive data, such as users’ historical interactions with items. To protect origin data from leaking to malicious attackers, FR, a distributed learning framework [18] has been widely adopted for training recommender models [3, 6]. Nevertheless, revealing training parameters rather than raw data has proven inadequate for ensuring privacy protection [8, 9, 10], leading to the development of DPFR. To safeguard data privacy, DPFRs incorporate DP mechanisms [11], which enhance privacy by injecting noises to distort exposed training parameters. This approach has emerged as a key consideration in privacy-preserving recommenders [19], despite that its adoption in practice is still hindered by lowered recommendation accuracy.

II-B Differential Privacy Budget Allocation

The privacy budget is a non-recoverable resource, which must be split into multiple small portions to restrict the privacy loss per training round [14, 15]. It has been reported in [19] that DP noises generated by a small privacy budget per round can severely impair model accuracy. Existing studies have initially attempted to address the influence of noise on DPFR. For example, [20] explored dynamic techniques for adjusting the learning rate, batch size, noise magnitude, and gradient clipping to improve DPFR accuracy. Yet, this work does not account for dynamic factors in the recommender training process, such as online interactive ratings and randomly generated noises [3]. The dynamic strategy introduced by [17] focused on privacy budget allocation in a heuristic-based method. However, when applied to DPFR, it confronts the limitation that a heuristically designed strategy can be far from the optimal privacy budget allocation.

CMBA is an online decision-making scheme applicable in complex dynamic environments [21]. [22] applied CMBA to improve personalized FL. [23, 21] applied CMBA to improve client selection for FL framework by overcoming dynamic training environments. Because of the dynamic nature of user interactions with recommenders, we are among the first to utilize CMAB for optimizing the allocation of the privacy budget in DPFR.

III Preliminaries

III-A Differentially Privacy and Federated Recommender

Differential privacy (DP) [24, 12] is a mechanism designed to limit an attacker’s ability to gain additional knowledge about individuals in a dataset. A typical approach to achieve DP is by introducing randomness into computations, which conceals the details of individual data records. Formally, DP is characterized by ϵ\epsilon, which defines the upper bound on the difference between two neighboring inputs, and δ\delta, which represents the residual probability. The formal definition of classic (ϵ,δ\epsilon,\delta)-DP is provided in Definition 1.

Definition 1 ((ϵ,δ\epsilon,\delta)-DP).

For any ϵ>0\epsilon>0 and δ[0,1]\delta\in[0,1], a randomized algorithm \mathcal{M}: 𝒟𝒪\mathcal{D}\rightarrow\mathcal{O} satisfies (ϵ,δ\epsilon,\delta)-DP if for any two neighboring inputs D,D𝒟D,D^{{}^{\prime}}\in\mathcal{D} and any subset of output O𝒪O\subset\mathcal{O}, the following holds:

𝐏𝐫[(D)O]eϵ𝐏𝐫[(D)O]+δ.\mathbf{Pr}[\mathcal{M}(D)\in O]\leq e^{\epsilon}\cdot\mathbf{Pr}[\mathcal{M}(D^{{}^{\prime}})\in O]+\delta. (1)

In the DP mechanism, revisiting the same dataset increases the risk of privacy leakage compared to a single access [25]. The composition theorem [26] states that the privacy budget per query should decrease as the number of accesses to the dataset grows. The most fundamental form of this is the sequential composition theorem, presented in Theorem 1.

Theorem 1 (Sequential Composition [26]).

Let the randomized algorithm t\mathcal{M}^{t}: 𝒟𝒪\mathcal{D}\rightarrow\mathcal{O} satisfy (ϵt,δt\epsilon^{t},\delta^{t})-DP for all tT,t+t\leq T,t\in\mathbb{N}^{+}. Then, the composition \mathcal{M} of all algorithms for the dataset D𝒟D\in\mathcal{D}, defined by (D)=(1(D),,T(D))\mathcal{M}(D)=(\mathcal{M}_{1}(D),\dots,\mathcal{M}_{T}(D)), satisfies (t=1Tϵt,t=1Tδt\sum_{t=1}^{T}\epsilon^{t},\sum_{t=1}^{T}\delta^{t})-DP.

Following this, in federated learning [18, 27], clients can securely update the recommender models using private datasets and upload the noisy gradients to the server for global aggregation in each training round. Specifically, a client uu in the differentially private federated recommender (DPFR) framework updates the gradients for the local item embedding 𝜾ut\nabla\bm{\iota}_{u}^{t}, the local user embedding 𝝂ut\nabla\bm{\nu}_{u}^{t}, and other recommender parameters 𝝎ut\nabla\bm{\omega}^{t}_{u}, respectively [3, 28]. Before uploading these local recommender models to the server, the client distorts the local gradients by adding noise: 𝝎~ut=𝝎ut+𝒏ω\nabla\tilde{\bm{\omega}}^{t}_{u}=\nabla\bm{\omega}^{t}_{u}+\bm{n}_{\omega} and 𝜾~ut=𝜾ut+𝒏ι\nabla\tilde{\bm{\iota}}_{u}^{t}=\nabla\bm{\iota}_{u}^{t}+\bm{n}_{\iota}, where 𝒏u=[𝒏ω,𝒏ι]\bm{n}_{u}=[\bm{n}_{\omega},\bm{n}_{\iota}] represents the DP noise generated by a specific DP noise mechanism (e.g., Gaussian [14] or Laplace [15]) based on the allocated privacy budget (ϵut,δut)(\epsilon_{u}^{t},\delta_{u}^{t}). 111The user embedding is generally stored locally after being updated, and therefore does not require noise addition.

III-B Contextual Multi-Arm Bandit Algorithm

The foundational concept of the Contextual Multi-Armed Bandit (CMAB) algorithm is essential for addressing decision-making challenges that involve balancing exploration and exploitation. Analogous to rows of slot machines (one-armed bandits), the CMAB algorithm deals with selecting actions over successive rounds to maximize cumulative rewards. Let AA denote the number of arms (or actions), and tt represent discrete time steps. At each time step t{1,,T}t\in\{1,\dots,T\}, the algorithm selects an action ata^{t} from the space of action 𝒜={1,2,,A}\mathcal{A}=\{1,2,\dots,A\} based on observed environmental information (e.g., any context vector matrix 𝐗t\mathbf{X}^{t}). The reward rtr^{t} corresponding to the chosen action is then observed. The goal of the algorithm is to iteratively refine its action-selection strategy to maximize the expected cumulative reward over time, which can be expressed as:

max𝒂t=1T𝔼[rtat,𝐗t].\max_{\bm{a}}\sum_{t=1}^{T}\mathbb{E}\left[r^{t}\mid a^{t},\mathbf{X}^{t}\right]. (2)

IV SYSTEM MODEL

This section encompasses preliminary knowledge of DPFR, problem formulation, and BGTplanner overview. To facilitate the readability, we have compiled a summary of notations used in this paper in Table I.

IV-A System Overview

In Fig. 1, we present the workflow of a typical DPFR system. There are two main components: the server and clients. The server plays two functions: aggregating model parameters of the recommender and allocating the privacy budget to each client to determine the noise scale [29, 30]. Clients are responsible for conducting local model updates with private datasets. Before uploading local recommender models to the server, a DP noise adder will generate noises according to the allocated privacy budget to distort local models [3, 19, 6].

In the system, there are UU clients, denoted by the set 𝒰={1,,U}\mathcal{U}=\{1,\cdots,U\}. The DPFR can conduct the maximum TT rounds of model training. The total privacy loss of each client is constrained by a fixed total privacy budget (ϵutotal,δutotal),u𝒰(\epsilon_{u}^{total},\delta_{u}^{total}),\forall u\in\mathcal{U} to be consumed throughout the entire training. During the training process, clients and the server need to exchange information multiple times via the following five steps.

Refer to caption
Figure 1: An overview of a typical DPFR system.

In Step ①, all clients upload ID numbers of item samples used for the local training. The server can reform the ID information as the context matrix denoted by 𝐗t=[Xu,it]U×I\bm{\mathrm{X}}^{t}=[\mathrm{X}_{u,i}^{t}]^{U\times I}. Here, II represents the total number of items, and Xu,it{0,1}\mathrm{X}_{u,i}^{t}\in\{0,1\}. Xu,it=1\mathrm{X}_{u,i}^{t}=1 indicates that item ii is used for training on client uu in this round, while Xu,it=0\mathrm{X}_{u,i}^{t}=0 signifies that the item is not used. Note that 𝐗t\bm{\mathrm{X}}^{t} is commonly uploaded together with updated parameters in Step ④ for aggregating item embeddings in related works [3]. It makes no difference in uploading 𝐗t\bm{\mathrm{X}}^{t} in either step222It is worth mentioning that 𝐗t\bm{\mathrm{X}}^{t} before uploading can be obfuscated by clients with pseudo records to enhance privacy. Since our focus is not on how to obfuscate 𝐗t\bm{\mathrm{X}}^{t}, we simply assume that existing obfuscation methods  [3] will be applied..

In Step ②, the server allocates (ϵut,δut)(\epsilon_{u}^{t},\delta_{u}^{t}) privacy budget to client u𝒰u\in\mathcal{U} in round tt. If a client exhausts its privacy budget, it quits the training process. Besides, the server can verify the noise scale [29, 30] and account cumulative privacy budget consumption using privacy accounting methods for different DP mechanisms, such as native composition for Laplace noises [14] and Moments Accountant for Gaussian noises [31].

In Step ③ and , client uu updates the recommender model via Local Trainer using the latest user-item interactions in local data. Local updating yields gradients of the local item embedding 𝜾ut\nabla\bm{\iota}_{u}^{t}, the local user embedding 𝝂ut\nabla\bm{\nu}_{u}^{t} and other recommender parameters 𝝎ut\nabla\bm{\omega}^{t}_{u}, respectively [3]. After local training, client uu uploads 𝝎~ut=𝝎ut+𝒏ω\nabla\tilde{\bm{\omega}}^{t}_{u}=\nabla\bm{\omega}^{t}_{u}+\bm{n}_{\omega} and 𝜾~ut=𝜾ut+𝒏ι\nabla\tilde{\bm{\iota}}_{u}^{t}=\nabla\bm{\iota}_{u}^{t}+\bm{n}_{\iota} where 𝒏u=[𝒏ω,𝒏ι]\bm{n}_{u}=[\bm{n}_{\omega},\bm{n}_{\iota}] represents DP noises generated by DP Noise Adder according to the allocated privacy budget (ϵut,δut)(\epsilon_{u}^{t},\delta_{u}^{t}).

In Step ⑤, Model Aggregator deployed at the server aggregates recommender parameters and item embedding vectors based on 𝐗t\bm{\mathrm{X}}^{t}. For more details, please refer to [3]. Then, the server broadcasts aggregated parameters and item embedding vectors to all clients.

IV-B Online Privacy Budget Allocation Problem

Our study primarily focuses on designing a more advanced BGTplanner. In our design, we suppose that the server can select an action at𝒜a^{t}\in\mathcal{A} for privacy budget allocation, where 𝒜={1,2,,A}\mathcal{A}=\{1,2,\ldots,A\} represents AA different levels of privacy budgets. The action ata^{t} corresponds to the privacy budget (ϵut,δut)(\epsilon_{u}^{t},\delta_{u}^{t}), denoted by (ϵut,δut)=Cu(at)(\epsilon_{u}^{t},\delta_{u}^{t})=C_{u}(a^{t}), for generating DP noises on client uu. Our objective is to maximize the final recommendation accuracy, gauged by reward t=1Trt\sum_{t=1}^{T}r^{t}. Here, rtr^{t} denotes the reward of round tt, which can be defined as the change of training loss or recommender accuracy between round tt and t1t-1. The FL training performance is highly dependent on data quality. Thus, rtr^{t} should be related to the privacy budget allocated to round tt and training records contributed by clients. According to the process in Fig. 1, training records are represented by 𝐗t\bm{\mathrm{X}}^{t}. Thus, we define the objective as 𝔼[rtat,𝐗t]\mathbb{E}\left[r^{t}\mid a^{t},\bm{\mathrm{X}}^{t}\right]. Given the limited total privacy budget, our problem can be formulated as follows:

1:max𝒂,τ\displaystyle\mathbb{P}1:\max_{\bm{a},\tau} t=1τ𝔼[rtat,𝐗t],\displaystyle\;\sum_{t=1}^{\tau}\mathbb{E}\left[r^{t}\mid a^{t},\bm{\mathrm{X}}^{t}\right], (3a)
s.t. t=1τ(ϵut,δut)(ϵutotal,δutotal),u𝒰,\displaystyle\sum_{t=1}^{\tau}(\epsilon_{u}^{t},\delta_{u}^{t})\leq(\epsilon_{u}^{total},\delta_{u}^{total}),\ \forall u\in\mathcal{U}, (3b)
(ϵut,δut)=Cu(at),u𝒰,t,\displaystyle(\epsilon_{u}^{t},\delta_{u}^{t})=C_{u}(a^{t}),\ \forall u\in\mathcal{U},\forall t, (3c)
at𝒜,t,τT.\displaystyle a^{t}\in\mathcal{A},\ \forall t,\ \tau\leq T. (3d)

There are two challenges for solving 1\mathbb{P}1. First, we need to specify the expression of 𝔼[rtat,𝐗t]\mathbb{E}\left[r^{t}\mid a^{t},\bm{\mathrm{X}}^{t}\right]. To our best knowledge, the exact expression is unavailable in existing works. Second, it cannot be maximized by a greedy algorithm. If we only consider 𝔼[rtat,𝐗t]\mathbb{E}\left[r^{t}\mid a^{t},\bm{\mathrm{X}}^{t}\right] for round tt, we should allocate all budget to tt such that 𝔼[rtat,𝐗t]\mathbb{E}\left[r^{t}\mid a^{t},\bm{\mathrm{X}}^{t}\right] is maximized, which unfortunately depletes the privacy budget for other training rounds, resulting in poor final recommendation accuracy.

V BGTplanner Design

Our solution for 1\mathbb{P}1 is depicted in Fig. 2. Briefly speaking, Predictor employs the Gaussian process regression (GPR) to predict reward 𝔼[rtat,𝐗t]\mathbb{E}\left[r^{t}\mid a^{t},\bm{\mathrm{X}}^{t}\right] based on historical information. To avoid the trap of greedy budget allocation, Constrainer properly penalizes each round allocation. Allocator makes final budget allocation decisions by considering both the current improvement reward and penalized budget cost.

Refer to caption
Figure 2: The workflow of BGTplanner.

V-A Instant Training Reward Predictor Design

In our problem, accurate prediction of rewards serves as a cornerstone for BGTplanner. Given an action aa, we employ GPR to model the relationship between the context input, i.e., 𝐗t\bm{\mathrm{X}}^{t}, and the corresponding reward.

In a typical recommender, 𝐗t\bm{\mathrm{X}}^{t} is a high-dimensional matrix, making it impossible to directly apply GPR. To reduce the dimension of 𝐗t\bm{\mathrm{X}}^{t}, we conduct the Singular Value Decomposition (SVD) operation to obtain 𝐗t=𝐔Σ𝐕\bm{\mathrm{X}}^{t}=\bm{\mathrm{U}}\Sigma\bm{\mathrm{V}}^{*}. Then, we rank diagonal elements in Σ\Sigma by a descending order to only reserve top dd elements, where dd is a tuneable hyper-parameter. Let 𝒙t\bm{x}^{t} denote the vector containing reserved dd elements. Unless otherwise specified, we use 𝒙t\bm{x}^{t} to denote the context vector hereafter.

With a much lower dimension, GPR can predict 𝔼[rta,𝒙t],a𝒜\mathbb{E}\left[r^{t}\mid a,\bm{x}^{t}\right],\forall a\in\mathcal{A} based on historical information. To distinguish with 𝔼[rta,𝒙t]\mathbb{E}\left[r^{t}\mid a,\bm{x}^{t}\right], let r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) denote the predicted reward. According to [32], r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) can be modelled as a random variable sampling value from the distribution of the Gaussian Process GP(μ0(𝒛t),k(𝒛t,𝒛))\text{GP}\left(\mu_{0}(\bm{z}^{t}),k(\bm{z}^{t},\bm{z}^{\prime})\right), where μ0(𝒛t)\mu_{0}(\bm{z}^{t}) denotes the mean function333For simplicity and practicability, μ0(𝒛)\mu_{0}(\bm{z}) is assumed to be zero [32]. and k(𝒛t,𝒛)k(\bm{z}^{t},\bm{z}^{\prime}) is the covariance (or kernel) function for input vector 𝒛t\bm{z}^{t} and the historical input vector 𝒛\bm{z}^{\prime}. To simplify our notation, we can concatenate any action a𝒜a\in\mathcal{A} (or a historical action at𝒜,t<ta^{t^{\prime}}\in\mathcal{A},\forall t^{\prime}<t) with the context 𝒙t\bm{x}^{t} (or 𝒙t\bm{x}^{t^{\prime}}) to form the vector 𝒛t=[a,𝒙t]\bm{z}^{t}=[a,\bm{x}^{t}] (or 𝒛=[at,𝒙t]\bm{z}^{\prime}=[a^{t^{\prime}},\bm{x}^{t^{\prime}}]). 444 Note that ata^{t^{\prime}} is already fixed in round tt^{\prime} for t<tt^{\prime}<t.

To adapt the GPR to model the temporal dynamics of observations (i.e., inputs 𝒛\bm{z}), we construct a composite kernel that incorporates both the squared exponential kernel function and the Ornstein-Uhlenbeck temporal covariance function [33], yielding

k(𝒛t,𝒛)=(1α)|tt|2exp(𝒛t𝒛22s2),𝒛𝒵t1,k(\bm{z}^{t},\bm{z}^{\prime})=(1-\alpha)^{\frac{|t-t^{\prime}|}{2}}\exp(-\frac{||\bm{z}^{t}-\bm{z}^{\prime}||_{2}}{2\,s^{2}}),\forall\bm{z}^{\prime}\in\mathcal{Z}^{t-1}, (4)

where 𝒵t1={𝒛1,,𝒛t1}\mathcal{Z}^{t-1}=\{\bm{z}^{1},...,\bm{z}^{t-1}\} is the set of historical inputs until the round t1t-1, |tt||t-t^{\prime}| represents the temporal difference between 𝒛t\bm{z}^{t} and historical input 𝒛\bm{z}^{\prime}, and α\alpha is a hyper-parameter tuning the weight assigned to temporal correlations. The hyper-parameter s>0s>0 represents the length scale for determining the influence of one input on another. Hence, a larger k(𝒛t,𝒛)k(\bm{z}^{t},\bm{z}^{\prime}) indicates a stronger correlation between 𝒛t\bm{z}^{t} and 𝒛\bm{z}^{\prime}, implying the effectiveness of the historical observation with input 𝒛\bm{z}^{\prime} in predicting the reward with input 𝒛t\bm{z}^{t}.

Until round tt, we can collect the input set 𝒵t1\mathcal{Z}^{t-1} and the reward vector 𝒓t1=[r1,,rt1]\bm{r}^{t-1}=[r^{1},...,r^{t-1}]^{\top} including t1t-1 observed rewards. Thus, we can establish a GPR with t1t-1 variables to predict the rtr^{t} distribution. The detailed joint distribution of GPR is presented in Appendix B. Using the joint distribution of GPR and Bayes’ theorem, we can predict r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) for any action a𝒜a\in\mathcal{A} with context 𝒙t\bm{x}^{t} by the conditional distribution 𝒩(μ(𝒛t),σ(𝒛t))\mathcal{N}({\mu}(\bm{z}^{t}),\sigma(\bm{z}^{t})) parameterized by the mean function μ(𝒛t){\mu}(\bm{z}^{t}) in Eq. (5) and variance function σ(𝒛t)\sigma(\bm{z}^{t}),555Due to the limited space, the expression of variance function σ(𝒛t)\sigma(\bm{z}^{t}) is presented in Appendix B. where 𝒛t=[a,𝒙t]\bm{z}^{t}=[a,\bm{x}^{t}].

μ(𝒛t)=𝐤t1(𝒛t)[𝐊t1+σμ2𝐈]1𝒓t1,a𝒜.{\mu}(\bm{z}^{t})={\mathbf{k}^{t-1}(\bm{z}^{t})}{[{\mathbf{K}}^{t-1}+\sigma_{\mu}^{2}\mathbf{I}]}^{-1}\bm{r}^{t-1},\forall a\in\mathcal{A}. (5)

Here, 𝐤t1(𝒛t)=[k(𝒛t,𝒛1),k(𝒛t,𝒛2),,k(𝒛t,𝒛t1)]\mathbf{k}^{t-1}(\bm{z}^{t})=[k(\bm{z}^{t},\bm{z}^{1}),k(\bm{z}^{t},\bm{z}^{2}),\dots,k(\bm{z}^{t},\bm{z}^{t-1})] is the kernel vector between historical inputs and the current input, 𝐊t1=[k(𝒛,𝒛)]𝒛,𝒛𝒵t1\mathbf{K}^{t-1}=[k(\bm{z},\bm{z}^{\prime})]_{\forall\bm{z},\bm{z}^{\prime}\in\mathcal{Z}^{t-1}} represents the kernel matrix among historical inputs, and 𝐈\mathbf{I} denotes an identity matrix. Additionally, σμ\sigma_{\mu} is a variance parameter of distribution 𝒩μ(0,σμ2)\mathcal{N}_{\mu}(0,\sigma_{\mu}^{2}), representing the deviation of the zero-mean random noise between the observed reward rr and the estimated reward r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) [34].

For a given aa, it can be concatenated with the context 𝒙t\bm{x}^{t} to form the input 𝒛t\bm{z}^{t}. By leveraging Eq. (5), we can estimate the expected reward r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}). By enumerating a𝒜a\in\mathcal{A}, we can predict the reward for choosing different actions. Note that our Predictor design is not based on a fixed 𝒙t\bm{x}^{t}. It can process dynamic contexts for predicting rewards with the evolving dynamics of user-item interactions over time.

V-B Long-Term Privacy Budget Constrainer Design

We next discuss the design of the long-term privacy budget constrainer in BGTplanner.

In 1\mathbb{P}1, the privacy budget is constrained in the entire training process by (3b). To solve it, we decouple the time-dependent privacy budget constraint with an online queue optimization method [35, 36] so that we can consider budget allocation round by round in 1\mathbb{P}1. For training round τ\tau, considering that the inequality t=1τϵutt=1Tϵut\sum_{t=1}^{\tau}\epsilon_{u}^{t}\leq\sum_{t=1}^{T}\epsilon_{u}^{t} always holds when τT\tau\leq T, we can relax the constraint in Eq. (3b) as t=1Tϵutϵutotal,u\sum_{t=1}^{T}\epsilon_{u}^{t}\leq\epsilon_{u}^{total},\forall u. δut\delta_{u}^{t} can be constrained in a similar way. For saving space, we simply use ϵut\epsilon_{u}^{t} to denote the privacy budget hereafter.

The problem enforcing long-term privacy budget constraints can be transformed into queue recursion optimization with Lagrange multipliers 𝝀t=[λut]U,t.\bm{\lambda}^{t}={[\lambda_{u}^{t}]^{U}}^{\top},\forall t. Therefore, the Lagrange function of reward in Eq. (3a) can be written as: L(𝒂,𝝀)=t=1TLt(at,𝝀t),L(\bm{a},\bm{\lambda})=\sum_{t=1}^{T}L^{t}(a^{t},\bm{\lambda}^{t}), where

Lt(at,𝝀t)=𝔼[rtat,𝒙t]+ϵtotalTϵt,𝝀t.\begin{split}&L^{t}(\,a^{t},\bm{\lambda}^{t})=-\mathbb{E}\left[r^{t}\mid a^{t},\bm{x}^{t}\right]+\langle\frac{\bm{\epsilon}^{{total}}}{T}-\bm{\epsilon}^{t},\bm{\lambda}^{t}\rangle.\end{split} (6)

Here, ϵtotal=[ϵutotal]U\bm{\epsilon}^{total}=[\epsilon^{total}_{u}]^{U} is the total privacy budget vector, ϵt=[ϵut]U\bm{\epsilon}^{t}=[\epsilon^{t}_{u}]^{U} is the per-round privacy consumption vector for all clients and ϵut=Cu(at)\epsilon^{t}_{u}=C_{u}(a^{t}) is the function mapping actions and privacy budgets. Additionally, ,\langle\cdot,\cdot\rangle represents the inner product of two vectors. The problem 1\mathbb{P}1 is converted to maximizing the reward with the constraint punishment by solving min𝒂,τmax𝝀L(𝒂,𝝀)s.t.(3c),(3d).\min_{\bm{a},\tau}\max_{\bm{\lambda}}L(\bm{a},\bm{\lambda})\;\text{s.t.}\;\eqref{EQ:c2},\eqref{EQ:c3}. By penalizing the privacy budget consumption per training round, we can simplify the problem by only considering how to optimize Lt(at,𝝀t)L^{t}(\,a^{t},\bm{\lambda}^{t}) for a particular round tt.

By resorting to the online mirror descent (OMD) algorithm [37, 38], we can tackle the Lagrangian dual problem in every round tt, formulated as min𝝀t𝒱Dt(𝝀t)\min_{\bm{\lambda}^{t}\in\mathcal{V}}D^{t}(\bm{\lambda}^{t}), where

Dt(𝝀t)=maxat(𝔼[rtat,𝒙t]ϵtotalTϵt,𝝀t).D^{t}(\bm{\lambda}^{t})=\max_{a^{t}}\left(\mathbb{E}\left[r^{t}\mid a^{t},\bm{x}^{t}\right]-\langle\frac{\bm{\epsilon}^{{total}}}{T}-\bm{\epsilon}^{t},\bm{\lambda}^{t}\rangle\right). (7)

Here, 𝒱={𝝀N:𝝀0,𝝀1Λ}\mathcal{V}=\left\{\bm{\lambda}\in\mathbb{R}^{N}:\bm{\lambda}\geq 0,\,\|\bm{\lambda}\|_{1}\leq\Lambda\right\} is the space of the dual variable 𝝀\bm{\lambda}, and Λ>0\Lambda>0 is the l1l_{1}-radius of 𝒱\mathcal{V}.

With the determined space set 𝒱\mathcal{V}, Constrainer can update 𝝀t+1\bm{\lambda}^{t+1} in the rest training rounds with the following rule:

𝝀t+1=argmin𝝀𝒱(Dt(𝝀t),𝝀+B(𝝀,𝝀t)ηt),\bm{\lambda}^{t+1}=\arg\min_{\bm{\lambda}\in\mathcal{V}}\left(\langle\nabla D^{t}\left(\bm{\lambda}^{t}\right),\bm{\lambda}\rangle+\frac{B\left(\bm{\lambda},\bm{\lambda}^{t}\right)}{\eta_{t}}\right), (8)

where B(,)B(\cdot,\cdot) is the Bregman divergence of the generating function and ηt\eta_{t} is the step factor of the OMD algorithm. Here, we select the generating function as the negative entropy function, the same as that in [36].

To sum up, Constrainer enables the integration of long-term privacy budget constraints into instant reward maximization and focuses on dynamically tuning 𝝀t\bm{\lambda}^{t} to penalize the privacy budget consumption properly.

V-C Privacy Budget Allocator Design

The last component in BGTplanner is designed by leveraging contextual Multi-Armed Bandit (CMAB) to make final budget allocation decisions. Briefly speaking, there are two phases in BGTplanner, i.e., the initial stage and the exploration-exploitation stage, which are described as below.

V-C1 Initial Stage

The target of the initial stage is to establish the space 𝒱\mathcal{V} of 𝝀t\bm{\lambda}^{t} for Constrainer, so that the privacy budget consumption can be properly penalized in the exploration-exploitation stage. According to [36], we employ a regression-based algorithm to determine Λ\Lambda. The initial stage lasts (A+1)T0(A+1)\cdot T_{0} training rounds to evaluate Λ\Lambda. Here, AA is the size of the action space, and T0T_{0} is a hyper-parameter representing the number of times each action is selected.

To ensure that the OMD algorithm achieves O(T)\text{O}(\sqrt{T}) regret bounds, our preferred choice is to set Λ=TOPTϵ\Lambda=\frac{T\cdot\text{OPT}}{\epsilon}, where OPT is defined in Eq. (14[36]. However, OPT cannot be solved accurately in an online scenario. Therefore, we employ a regression algorithm in the initial stage including (A+1)T0(A+1)\cdot T_{0} training rounds to evaluate Λ\Lambda. Generally speaking, there are two steps in the initial stage.

Step ①: For the first AT0A\cdot T_{0} rounds, we play different AA actions in CMAB evenly for T0T_{0} times to gather historical records set AT0={(𝒛1,r1),,(𝒛AT0,rAT0)}\mathcal{H}^{A\cdot T_{0}}=\{(\bm{z}^{1},r^{1}),\cdots,(\bm{z}^{A\cdot T_{0}},r^{A\cdot T_{0}})\} for all inputs 𝒛t=[at,𝒙t]\bm{z}^{t^{\prime}}=[a^{t^{\prime}},\bm{x}^{t^{\prime}}] and reward outputs rtr^{t^{\prime}}, where t{1,,AT0}{t^{\prime}}\in\{1,\cdots,A\cdot T_{0}\}. With the historical records, we can formulate the GPR model as an oracle to predict the reward r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) and the mean function of reward μ(𝒛t)\mu(\bm{z}^{t}) with input 𝒛t=[at,𝒙t]\bm{z}^{t}=[a^{t},\bm{x}^{t}], where t𝒯0t\in\mathcal{T}_{0} and 𝒯0\mathcal{T}_{0} is defined in Eq. (9c).

OPT^(T0)=max𝒐𝒪1T0t𝒯0a𝒜r^(a,𝒙t)oa,t,\displaystyle\hat{\text{OPT}}(T_{0})=\max_{\bm{o}\in\mathcal{O}}\frac{1}{T_{0}}\sum_{t\in\mathcal{T}_{0}}\sum_{a\in\mathcal{A}}\hat{r}(a,\bm{x}^{t})\cdot o_{a,t}, (9a)
s.t.1T0\displaystyle\text{s.t.}~{}\frac{1}{T_{0}} t𝒯0a𝒜Cu(a)oa,tϵmintotalT+2M(T0),\displaystyle\sum_{t\in\mathcal{T}_{0}}\sum_{a\in\mathcal{A}}C_{u}(a)\cdot o_{a,t}\leq\frac{\epsilon_{min}^{total}}{T}+2M\left(T_{0}\right), (9b)
𝒯0\displaystyle\mathcal{T}_{0} ={t:AT0+1t(A+1)T0},\displaystyle=\left\{t:A\cdot T_{0}+1\leq t\leq(A+1)\cdot T_{0}\right\}, (9c)
M(T0)\displaystyle M\left(T_{0}\right) =AE(T0)+4log(TU)T0,\displaystyle=\sqrt{A\cdot{E}\left(T_{0}\right)+4\cdot\frac{\log(T\cdot U)}{T_{0}}}, (9d)
E(T0)\displaystyle E\left(T_{0}\right) =1T0t𝒯0(μ(𝒛t)rt)2,𝒛t=[at,𝒙t],\displaystyle=\frac{1}{T_{0}}\sum_{t\in\mathcal{T}_{0}}(\mu(\bm{z}^{t})-r^{t})^{2},\bm{z}^{t}=[a^{t},\bm{x}^{t}], (9e)
ϵmintotal\displaystyle\epsilon_{min}^{total} =argmin𝑢{ϵutotalu𝒰},\displaystyle=\underset{u}{\arg\min}\{\epsilon_{u}^{total}\mid\forall u\in\mathcal{U}\}, (9f)

Step ②: For the latter total T0T_{0} rounds, we select actions ata^{t} randomly and observe the reward rtr^{t}, where t𝒯0t\in\mathcal{T}_{0}. Then, we use the observed rewards and GPR model to estimate OPT^(T0)\hat{\text{OPT}}(T_{0}) by solving the linear programming problem (9), where 𝒐\bm{o} is the solution matrix and we use the GPR model as the oracle to predict the reward r^(a,𝒙t)\hat{r}(a,\bm{x}^{t}) and the mean of reward μ(𝒛t)\mu(\bm{z}^{t}). Additionally, TT is the maximum training rounds, UU is the total number of clients, T0T_{0} is the hyper-parameter of rounds to select each action, and ϵmin\epsilon_{min} is the smallest privacy budget among all clients. Therefore, we can obtain all the parameters needed to solve problem (9) using any general linear programming solver.

Upon obtaining the estimated value of OPT^(T0)\hat{\text{OPT}}\left(T_{0}\right), we can set

Λ=Tϵmintotal(OPT^(T0)+M(T0)),\Lambda=\frac{T}{\epsilon_{min}^{total}}\left(\hat{\text{OPT}}\left(T_{0}\right)+M\left(T_{0}\right)\right), (10)

where M(T0)M(T_{0}) is defined in Eq. (9d). Consequently, the error between OPT^(T0)\hat{\text{OPT}}\left(T_{0}\right) and OPT is bounded by the estimated error guarantee, as demonstrated in Lemma 1 in Appendix E, based on the analysis in [36]. The detailed algorithm for determining Λ\Lambda is provided in Alg. 1.

Algorithm 1 Initial Stage to Estimate Λ\Lambda of the Server.

Input: Training rounds TT; Exploration rounds T0T_{0}; Total privacy budget ϵtotal\bm{\epsilon}^{total}, Item embedding 𝜾0\bm{\iota}^{0}, user embedding 𝝂0\bm{\nu}^{0}, Recommender parameters 𝝎0\bm{\omega}^{0}.
Output: Radius Λ\Lambda; Remained budget ϵre\bm{\epsilon}^{re}; Historical record set (A+1)T0\mathcal{H}^{(A+1)\cdot T_{0}}.

1:  Distributing 𝜾0\bm{\iota}^{0}, 𝝂0\bm{\nu}^{0}, 𝝎0\bm{\omega}^{0} to all clients as the initialization of their local model;// Step ① //
2:  Initializing remained budget ϵreϵtotal\bm{\epsilon}^{re}\leftarrow\bm{\epsilon}^{total} and historical records 0=\mathcal{H}^{0}=\emptyset for GPR model;
3:  for  at𝒜a^{t}\in\mathcal{A} do
4:     for t{(at1)T0+1,,atT0}t\in\{(a^{t}-1)\cdot T_{0}+1,\dots,a^{t}\cdot T_{0}\} do
5:        Receiving the 𝐗t\bm{\mathrm{X}}^{t} from all clients and generate the context 𝒙t\bm{x}^{t} to concatenate 𝒛t=[at,𝒙t]\bm{z}^{t}=[a^{t},\bm{x}^{t}];
6:        Sending ata^{t} to all clients and executing DPFR TRAINING (Alg. 2) to obtain rt,ϵutr^{t},\bm{\epsilon}_{u}^{t};
7:        Updating ϵureϵureϵut\bm{\epsilon}_{u}^{re}\leftarrow\bm{\epsilon}^{re}_{u}-\bm{\epsilon}_{u}^{t};
8:        Collecting tt1(𝒛t,rt\mathcal{H}^{t}\leftarrow\mathcal{H}^{t-1}\cup(\bm{z}^{t},r^{t});
9:     end for
10:  end for
11:  Modeling GPR with AT0\mathcal{H}^{A\cdot T_{0}};// Step ② //
12:  Generating 𝒯0\mathcal{T}_{0} by Eq. (9c) with AA and T0T_{0};
13:  for t𝒯0t\in\mathcal{T}_{0} do
14:     Receiving the 𝐗t\bm{\mathrm{X}}^{t} from all clients and generating the context 𝒙t\bm{x}^{t};
15:     Sending ata^{t} to all clients and executing DPFR TRAINING (Alg. 2) to obtain rt,ϵutr^{t},\bm{\epsilon}_{u}^{t};
16:     Updating ϵureϵureϵut\bm{\epsilon}_{u}^{re}\leftarrow\bm{\epsilon}^{re}_{u}-\bm{\epsilon}_{u}^{t};
17:     Concatenating 𝒛t=[at,𝒙t]\bm{z}^{t}=[a^{t},\bm{x}^{t}] and collecting tt1(𝒛t,rt\mathcal{H}^{t}\leftarrow\mathcal{H}^{t-1}\cup(\bm{z}^{t},r^{t});
18:     Updating GPR model with t\mathcal{H}^{t};
19:  end for
20:  Enumerating a𝒜a\in\mathcal{A} and t𝒯0t\in\mathcal{T}_{0} to calculate r^(,)\hat{r}(\cdot,\cdot) with GPR model and (A+1)T0\mathcal{H}^{(A+1)\cdot T_{0}};
21:  Calculating μ()\mu(\cdot) with 𝒛t,t𝒯0\bm{z}^{t},\forall t\in\mathcal{T}_{0} by Eq (5);
22:  Estimating OPT^(T0)\hat{\text{OPT}}\left(T_{0}\right) by solving the linear programming (9) with r^()\hat{r}(\cdot) and μ()\mu(\cdot) and (A+1)T0\mathcal{H}^{(A+1)\cdot T_{0}};
23:  Determining the radius Λ\Lambda with OPT^(T0)\hat{\text{OPT}}\left(T_{0}\right) by Eq. (10).
Algorithm 2 DPFR TRAINING (One Round)

Input: Training round tt, Clients set 𝒰\mathcal{U}, Budget allocation ata^{t}.
Output: Training reward rtr^{t}, Budget consumption ϵt\bm{\epsilon}^{t}.
// Clients u𝒰u\in\mathcal{U} //

1:  Training the local item embedding 𝜾ut\bm{\iota}^{t}_{u}, user embedding 𝜾ut\bm{\iota}^{t}_{u}, and recommender model 𝝎ut\bm{\omega}^{t}_{u} with local private data;
2:  Generating DP noise 𝒏u=[𝒏ω,𝒏ι]\bm{n}_{u}=[\bm{n}_{\omega},\bm{n}_{\iota}] based on budget allocation ata^{t} and specific DP mechanism;
3:  Adding DP noise to the local gradient 𝝎~ut=𝝎ut+𝒏ω\nabla\tilde{\bm{\omega}}^{t}_{u}=\nabla\bm{\omega}^{t}_{u}+\bm{n}_{\omega}, 𝜾~ut=𝜾ut+𝒏ι\nabla\tilde{\bm{\iota}}_{u}^{t}=\nabla\bm{\iota}_{u}^{t}+\bm{n}_{\iota};
4:  Uploading the noisy gradient 𝝎~ut\nabla\tilde{\bm{\omega}}^{t}_{u} and 𝜾~ut\nabla\tilde{\bm{\iota}}_{u}^{t} to server;

// Server //

1:  Receiving the noisy gradient of item embedding 𝜾~ut\nabla\tilde{\bm{\iota}}^{t}_{u} and recommender model 𝝎~ut\nabla\tilde{\bm{\omega}}^{t}_{u} of from all clients;
2:  Performing model aggregation to obtain global 𝜾t\bm{\iota}^{t}, 𝝎t\bm{\omega}^{t} with 𝜾~ut,𝝎~ut,u𝒰\nabla\tilde{\bm{\iota}}^{t}_{u},\ \nabla\tilde{\bm{\omega}}^{t}_{u},\ \forall u\in\mathcal{U} and training information 𝐗t\mathbf{X}^{t};
3:  Distributing the aggregated parameters 𝜾t\bm{\iota}^{t} and 𝝎t\bm{\omega}^{t} to all clients and observing the training reward rtr^{t};
4:  Calculating the consumption budgets ϵt\bm{\epsilon}^{t} for all clients.
Algorithm 3 The BGTplanner Algorithm of Server.

Input: Training rounds TT; Exploration rounds T0T_{0}; Total privacy budget ϵtotal\bm{\epsilon}^{total}.

1:  Determining the radius Λ\Lambda by Alg. 1 and obtaining ϵre\bm{\epsilon}^{re} and (A+1)T0\mathcal{H}^{(A+1)\cdot T_{0}};
2:  Randomly initializing the dual variables 𝝀(A+1)T0+1\bm{\lambda}^{(A+1)\cdot T_{0}+1} by Eq. (8) in the radius Λ\Lambda;
3:  for t{(A+1)T0+1,,T}t\in\{(A+1)\cdot T_{0}+1,\cdots,T\} do
4:     if All clients exhaust their privacy budget then Exiting model training end if; // Step ① //
5:     Receiving the 𝐗t\bm{\mathrm{X}}^{t} from all clients and generating the context 𝒙t\bm{x}^{t};
6:     Predicting μ(𝒛t)\mu(\bm{z}^{t}) with 𝒛t=[a,𝒙t],a𝒜\bm{z}^{t}=[a,\bm{x}^{t}],a\in\mathcal{A} by Eq. (5);
7:     Obtaining βt(a),a𝒜\beta^{t}(a),\forall a\in\mathcal{A} with μ(𝒛t)\mu(\bm{z}^{t}) and 𝝀t\bm{\lambda}^{t} based on Eq. (11);// Step ② //
8:     Calculating pt(a),a𝒜p^{t}(a),\forall a\in\mathcal{A} with βt(a)\beta^{t}(a) based on Eqs.(12)-(13);
9:     Sampling a budget allocation action at𝒑ta^{t}\sim\bm{p}^{t};
10:     Sending ata^{t} to all clients and executing DPFR TRAINING (Alg. 2) to obtain rt,ϵutr^{t},\bm{\epsilon}_{u}^{t}; // Step ③ //
11:     Updating ϵureϵureϵut\bm{\epsilon}_{u}^{re}\leftarrow\bm{\epsilon}^{re}_{u}-\bm{\epsilon}_{u}^{t};
12:     Concatenating 𝒛t=[at,𝒙t]\bm{z}^{t}=[a^{t},\bm{x}^{t}];
13:     Collecting tt1(𝒛t,rt\mathcal{H}^{t}\leftarrow\mathcal{H}^{t-1}\cup(\bm{z}^{t},r^{t});
14:     Updating GPR model with t\mathcal{H}^{t};
15:     Updating the dual variables 𝝀t+1\bm{\lambda}^{t+1} by Eq. (8) with ϵut\bm{\epsilon}_{u}^{t} in the radius Λ\Lambda.
16:  end for

V-C2 Exploration-Exploitation Stage

In this stage, BGTplanner utilizes the outputs of Predictor and Constrainer, and the CMAB framework to allocate the privacy budget. In general, the Exploration-Exploitation stage includes three steps in each round t((A+1)T0,T]t\in\left((A+1)\cdot T_{0},T\right], described as follows:

Step ① (Generating Action Scores): After observing context 𝒙t\bm{x}^{t}, BGTplanner acquires the estimated mean of rewards μ(𝒛t){\mu}(\bm{z}^{t}) by Predictor, where 𝒛t=[a,𝒙t],a𝒜\bm{z}^{t}=[a,\bm{x}^{t}],\forall a\in\mathcal{A}, and calculates the predicted Lagrangian as an appropriate score function by considering both rewards and penalties:

βt(a)=μ(𝒛t)ϵtotalTϵt,𝝀t,a𝒜,\beta^{t}(a)=\mu(\bm{z}^{t})-\langle\frac{\bm{\epsilon}^{total}}{T}-\bm{\epsilon}^{t},\bm{\lambda}^{t}\rangle,\forall a\in\mathcal{A}, (11)

where 𝝀t\bm{\lambda}^{t} is updated by Constrainer in round t1t-1.

Step ② (Selecting the Optimal Action): Based on the estimated score for taking action aa, BGTplanner generates probabilities for all arms corresponding to different privacy budget allocation actions in the CMAB model. Let amaxa^{max} denote the action with the highest score βt(amax)\beta^{t}({a}^{max}), the probability to sample each action can be calculated as:

pt(a)=1A+γ(βt(amax)βt(a)),a𝒜{amax},p^{t}(a)=\frac{1}{A+\gamma\,\left(\,\beta^{t}({a}^{max})-\beta^{t}({a})\,\right)},\forall a\in\mathcal{A}-\{{a}^{max}\}, (12)

where γ\gamma is a hyper-parameter representing the extent to which BGTplanner prefers exploitation over exploration. If γ\gamma is smaller, pt(a)p^{t}(a) is closer to 1A\frac{1}{A}, implying that BGTplanner samples actions in a more uniform manner, and hence prefers exploration. The probability of amax{a}^{max} is:

pt(amax)=1a𝒜,aamaxpt(a).p^{t}({a}^{max})=1-\sum_{a\in\mathcal{A},a\neq{a}^{max}}p^{t}(a). (13)

The final action ata^{t} is sampled with the probability distribution in Eqs.(12) and (13) under context 𝒙t\bm{x}^{t} and the variable 𝝀t\bm{\lambda}^{t}.

Step ③ (Updating Penalties): BGTplanner leverages Constrainer to update the new dual variable vector 𝝀t+1\bm{\lambda}^{t+1}, which will be used for budget allocation in the next training round on the fly.

V-D Details of BGTplanner

We provide a detailed description of the BGTplanner algorithm in Alg. 3. The algorithm operates in two stages: an initial stage, described in Alg. 1, and a subsequent exploration-exploitation stage, which includes three main steps in each round. In each training round of the second stage, BGTplanner predicts the scores for all actions, selects the optimal action and adjusts the dual variables based on observed rewards and the remaining privacy budgets. With the chosen budget allocation action, the server and all clients execute one round of DPFR training, as detailed in Alg. 2, to update the model and embedding parameters. This process navigates a dynamic training environment, where the privacy budget is strategically allocated to maximize model performance while adhering to privacy constraints. The iterative process continues until the privacy budget is exhausted or the desired training performance is achieved.

VI Analysis and Implementation

Additionally, we undertake privacy, regret and complexity analysis to substantiate our approach’s theoretical guarantee.

VI-A Privacy Analysis.

Given (ϵutotal,δutotal)(\epsilon^{total}_{u},\delta^{total}_{u}), each client can allocate its budget as follows. If the Laplace mechanism is adopted, δut=0\delta^{t}_{u}=0 and each client can employ the native composition rule [14] to accumulate the total privacy loss ϵutotal\epsilon^{total}_{u} during the multiple rounds. Otherwise, if the Gaussian mechanism is adopted, clients can convert the budget to the RDP (Rényi Differential Privacy) budget, which can then be simply split into multiple training rounds [15]. In this case, Eq. (3b) should be updated by the RDP budget. It has been proved in [14, 15] that if clients follow these methods to accumulate their privacy loss, (ϵutotal,δutotal)(\epsilon^{total}_{u},\delta^{total}_{u})-DP is satisfied on client uu.

VI-B Overhead Analysis.

BGTplanner incurs additional computation and memory overhead for making budget allocation decisions. Through analysis, we show that the overhead cost is lightweight and affordable for the server. For Predictor, the complexity for calculating the Gaussian Kernel is O(Adt)\mathrm{O}(A\cdot d\cdot t), with dd dimension of the input vector. The computational complexity for estimating the reward μ(𝒛t)\mu(\bm{z}^{t}) in Eq. (5) is O(Adt2)\mathrm{O}(A\cdot d\cdot t^{2}). Calculating the score vector βt(a)\beta^{t}(a) by Eq. (11) incurs a complexity of O(AU)\mathrm{O}(A\cdot U), and computing the probability vector 𝒑(t)\bm{p}^{(t)} via Eq. (12)-Eq. (13) has a complexity of O(A)\mathrm{O}(A). To conclude, the total computation complexity per round is O(A(t2d+U))\mathrm{O}(A\cdot(t^{2}\cdot d+U)). According to [37], the computational complexity of Constrainer is also small by using the MOD algorithm.

Additionally, the memory complexity is O(At2+U)\mathrm{O}(A\cdot t^{2}+U), primarily because of the storage requirements for 1) the GP kernels, 2) estimated means, scores and probabilities for all actions, and 3) client budgets and penalties information.

VI-C Regret Analysis.

We then discuss the regret performance of BGTplanner. We define the optimal expected reward for one round while ensuring resource utilization remains within a predefined budget as OPT [36], i.e.,

Definition 2.

Let r^(a,𝐱)\hat{r}^{*}(a,\bm{x}) denote the optimal regression to characterize the expectation of reward distribution and the static randomized policy pa:𝒳P𝒜p_{a}^{*}:\mathcal{X}\rightarrow P_{\mathcal{A}} as the optimal mapping function (defined on context distribution space 𝒳\mathcal{X}) assigning probabilities to actions in 𝒜\mathcal{A}. We have

OPT=maxpa:𝒳P𝒜𝔼𝒙P𝒳[a𝒜pa(𝒙)r^(a,𝒙)],\displaystyle\text{OPT}=\max_{p_{a}^{*}:\mathcal{X}\rightarrow P_{\mathcal{A}}}\ \mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\sum_{a\in\mathcal{A}}p_{a}^{*}(\bm{x})\,\hat{r}^{*}(a,\bm{x})\right], (14)
s.t. 𝔼𝒙P𝒳[a𝒜pa(𝒙)Cu(a)]ϵutotalT,u𝒰.\displaystyle\text{ s.t. }\mathbb{E}_{\bm{x}\sim P_{\mathcal{X}}}\left[\sum_{a\in\mathcal{A}}p_{a}^{*}(\bm{x})\,C_{u}(a)\right]\leq\frac{\epsilon_{u}^{total}}{T}\,,\,\forall u\in\mathcal{U}.

Here, P𝒜P_{\mathcal{A}} and P𝒳P_{\mathcal{X}} represent probability distribution over the action set 𝒜\mathcal{A} and context space 𝒳\mathcal{X}, respectively. Cu:𝒜+C_{u}:\mathcal{A}\rightarrow\mathbb{R}_{+} is a mapping from action to privacy budget consumption.

With Definition 2, let TOPTT\cdot\text{OPT} denote the upper bound of the optimal reward in online scenarios [39], we can minimize the regret between the OPT and the actual reward obtained by BGTplanner online algorithm as:

Reg(T)=TOPTt=1T𝔼[rtat,𝒙t].\text{Reg}(T)=T\cdot\text{OPT}-\sum_{t=1}^{T}\mathbb{E}\left[r^{t}\mid a^{t},\bm{x}^{t}\right]. (15)

We first give the error guarantee of the estimation of 1\ell_{1}-radius Λ\Lambda.

Lemma 1.

Denoting the optimal value of (9) by OPT^(T0)\hat{\text{OPT}}\left(T_{0}\right), and setting Λ=Tϵmintotal(OPT^(T0)+M(T0))\Lambda=\frac{T}{\epsilon_{min}^{total}}\left(\hat{\text{OPT}}\left(T_{0}\right)+M\left(T_{0}\right)\right), we have with probability at least 1O(1/T2)1-O\left(1/T^{2}\right),

TOPTϵmintotalΛ(6TM(T0)ϵmintotal+1)(TOPTϵmintotal+1),\frac{T\cdot\text{OPT}}{\epsilon^{total}_{min}}\leq\Lambda\leq\left(\frac{6T\cdot M\left(T_{0}\right)}{\epsilon^{total}_{min}}+1\right)\left(\frac{T\cdot\text{OPT}}{\epsilon^{total}_{min}}+1\right), (16)

Here, when ϵmintotal=O(TM(T0))\epsilon^{total}_{min}=\mathrm{O}(T\cdot M(T_{0})), we can donate ΛTOPTϵmintotal+1\Lambda\lesssim\frac{T\cdot\text{OPT}}{\epsilon^{total}_{min}}+1, where the notation ‘\lesssim’ denotes a represents multiple that is less than. The comprehensive proof of Lemma 1 can be found in Lemma 4.1 in [36] . With the estimation error guarantee of Λ\Lambda in Lemma 1, the OMD to update 𝝀\bm{\lambda} achieves optimal regret bounds O(T)\mathrm{O}(\sqrt{T}) (Lemma 2.2 in [36]). Therefore, we can obtain the following regret guarantee of Alg. 3 and we rewrite the theorem as follows:

Theorem 2.

Denote TT as the total training rounds, T0T_{0} as the number of times each action is selected in the initial phase, and AA as the number of actions, we have Reg(T)=TOPTt=1T𝔼[rtat,𝐱t]\text{Reg}(T)=T\cdot\text{OPT}-\sum_{t=1}^{T}\mathbb{E}\left[r^{t}\mid a^{t},\bm{x}^{t}\right] of BGTplanner satisfying:

Reg(T)\displaystyle\text{Reg}(T)\lesssim (TOPTϵmintotal+1)AT[Regr(T)+1]\displaystyle\left(\frac{T\cdot\text{OPT}}{\epsilon_{min}^{total}}+1\right)\sqrt{A\cdot T\left[\operatorname{Reg}^{r}(T)+1\right]} (17)
+(TOPTϵmintotal+1)AT0,\displaystyle+\left(\frac{T\cdot\text{OPT}}{\epsilon_{min}^{total}}+1\right)A\cdot T_{0},

where ϵmintotal=argminu{ϵutotalu𝒰}\epsilon_{min}^{total}=\arg\min_{u}\{\epsilon_{u}^{total}\mid\forall u\in\mathcal{U}\} is the minimum budget among the clients. Meanwhile, ϵmintotal\epsilon_{min}^{total} constrains T0T_{0} with ϵmintotal>max{(A+2)T0,TM(T0)}\epsilon_{min}^{total}>\max\left\{(A+2)\cdot T_{0},T\cdot M\left(T_{0}\right)\right\}, where M()M(\cdot) is the bias function of constraints of a linear programming problem defined in Appendix C. Regr(T)\operatorname{Reg}^{r}(T) is the regret between the optimal regression function R(a,𝐱)R^{*}(a,\bm{x}) and GPR-based reward Predictor within total TT rounds, with an upper bound O(Tlog(T))\mathrm{O}(\sqrt{T\cdot\log(T)}) proved in [33].

Proof.

We can directly deduce the regret from the Theorem 4.1 in [36] by considering reward regression oracle r^(a,𝒙)\hat{r}(a,\bm{x}) as GPR model with an upper bound O(Tlog(T))\mathrm{O}(\sqrt{T\cdot\log(T)}) regret [33] and constraint regression oracle C(a)C(a) as an unbiased estimation. In Alg. 3, we use O(AT0)\text{O}(A\cdot T_{0}) rounds in the initial stage, the expected regret incurred by Alg. 1 is upper bounded by O(TOPTϵmintotal+1)AT0\text{O}(\frac{T\cdot\text{OPT}}{\epsilon_{min}^{total}}+1)\cdot A\cdot T_{0}, with the guarantee by ΛTOPTϵmintotal+1\Lambda\lesssim\frac{T\cdot\text{OPT}}{\epsilon_{min}^{total}}+1 in Lemma 1. For the second stage in Alg. 3, the expected regret for the exploration-exploitation stage Alg. 3 is upper bounded by (TOPTϵmintotal+1)AT[Regr(T)+1]\left(\frac{T\cdot\text{OPT}}{\epsilon_{min}^{total}}+1\right)\sqrt{A\cdot T\left[\operatorname{Reg}^{r}(T)+1\right]}, where AT[Regr(T)+1]\sqrt{A\cdot T\left[\operatorname{Reg}^{r}(T)+1\right]} is the upper bound of CMAB method employing GPR model as reward regression oracle and OMD method to update dual variables online in total TT rounds (Theorem 4.1 [36]). Then, Theorem 2 holds by adding the regret in these two stages together. ∎

Theorem 2 guarantees that BGTplanner achieves an O(T)\mathrm{O}(\sqrt{T}) regret bound. Such a sub-linear regret bound is highly desirable, indicating that BGTplanner will not significantly diverge from the optimal budget allocation solution even under dynamic contexts.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Performance of all baselines across different settings of privacy budgets and DP noise mechanisms in Filmtrust, ML-100K, and ML-1M dataset, respectively. Lower RMSE values indicate better performance, while higher F1-scores reflect better performance.
TABLE II: Performance in RMSE and F1-score under various datasets and noise models with the default privacy budget ϵutotal=10\epsilon_{u}^{total}=10. All methods are tested five times, and the average of the results is presented.
Methods RMSE  (\downarrow) F1-score  (\uparrow)
Filmtrust ML-100K ML-1M Filmtrust ML-100K ML-1M
GM LM GM LM GM LM GM LM GM LM GM LM
FedSGD 0.2905 0.2017 0.2006 0.8354 0.9018 0.9137
OURS 0.3746 0.4114 0.3223 0.3449 0.3135 0.3484 0.6685 0.6517 0.6444 0.6421 0.6524 0.6431
ADPML 0.3898 0.4258 0.3442 0.3815 0.3558 0.4217 0.6395 0.6338 0.6358 0.6404 0.6407 0.6417
ADAP 0.3970 0.4435 0.3553 0.4092 0.3958 0.4621 0.6389 0.6194 0.6366 0.6312 0.6404 0.6379
DPSGD 0.4093 0.3356 0.3210 0.6680 0.6412 0.6501
Note: Lower is better (\downarrow). Higher is better (\uparrow).

VII Performance Evaluation

In this section, we report the experimental results for comparing BGTplanner and the state-of-the-art baselines. To facilitate the peer review, we also anonymously open the source code of our system deployment666https://anonymous.4open.science/r/BGTplanner-AAAI.

VII-A Experimental Settings

VII-A1 Datasets

For our experiments, we exploit public rating datasets Filmtrust [40], MovieLens 100k (ML-100K) [41], and MovieLens 1M (ML-1M) [41], which are commonly used in evaluating recommender systems. Filmtrust includes 18,662 ratings from 874 users, ML-100k includes 100,000 ratings for 1,682 movies from 943 users, and ML-1M includes 1,000,209 ratings for 3,952 movies from 6,040 users. All rating scores are normalized to the range [0,1][0,1]. We implement FedGNN [3] as the recommendation model and set the number of clients for FL according to the number of users in Filmtrust and ML-100K, respectively. Due to the larger number of users in ML-1M, we randomly allocate its 6,040 users to 605 clients.

Additionally, we split the training set and test set by the ratio of 4:1 for all rating datasets. To further simulate the dynamic training contexts with continuously generated rating records, the training set is further randomly divided into two equal-size parts. At the beginning of training, only ratings in the first part can be used for training. As the training progresses, training samples in the second part are randomly and evenly sampled and added to the first part for enhance training. Besides, all clients randomly select pseudo items from the set of non-interacted items for training FedGNN in each round [3], and the number is set to 50, 50, and 400 for Filmtrust, ML-100K, and ML-1M, respectively.

Refer to caption
Refer to caption
Refer to caption
Figure 4: The sensitivity experiments of BGTplanner with TminT_{min}, which controls the range of the action space, were conducted across different privacy budget settings and DP noise mechanisms using the Filmtrust, ML-100K, and ML-1M datasets, respectively. Lower RMSE values indicate better performance, while higher F1-scores reflect better performance.

VII-A2 System Settings

Referring to [32], we set the hyper-parameter of GPR the length scale parameter s=0.2s=0.2 and α=0.001\alpha=0.001 for the squared exponential kernel. For CMAB [36], the number of arms is set to A=5A=5, the number of initial rounds is set to T0=5T_{0}=5 and the trade-off parameter is set to γ=2AT(U+2)log(T)\gamma=2\cdot\sqrt{\frac{A\cdot T}{(U+2)\cdot log(T)}} in Eq. (12) according to [36]. By default, we set T=100T=100 while the overall privacy budget ϵutotal\epsilon_{u}^{total} is set to 10 and δutotal=e5\delta_{u}^{total}=e^{-5} [3] for all clients in all datasets. Additionally, the mapping function Cu()C_{u}(\cdot) uniformly distributes privacy budget values within the range [ϵutotalT,ϵutotalTmin][\frac{\epsilon_{u}^{total}}{T},\frac{\epsilon_{u}^{total}}{T_{min}}], where TminT_{min} is default as 7070 to control the maximum budget cost per round. We also conduct sensitivity experiments for the hyper-parameters ϵutotal\epsilon_{u}^{total}, TT, and TminT_{min} to further investigate their impact on the performance.

We adopt the Root Mean Square Error (RMSE) and F1-score as evaluation metrics, both of which are widely used for assessing recommenders. For the F1-score, we binarize the ratings and the outputs from the recommenders to calculate rating classification precision and recall, which are then combined to compute the F1-score. The binarization threshold is set to 0.50.5. The reward in round tt is quantified by rt=RMSEt1RMSEtr^{t}=\text{RMSE}^{t-1}-\text{RMSE}^{t}. All experiments are tested by Python and executed on an Intel® CPU server with a Xeon® E5-2660 v4 CPU and 189 GB of RAM.

VII-A3 Baselines

To demonstrate the superiority of BGTplanner, we compare it with the following baselines. (1) FedSGD [18], the fundamental FL algorithm without DP noises, serves as the benchmark to establish the upper bound accuracy of DP-based baselines. (2) ADPML [17] determines the minimum privacy budget (ϵmin\epsilon_{min}) and the maximum privacy budget (ϵmax\epsilon_{max}) before training. It gradually increases the privacy budget used by clients during training. We set ϵmin=1\epsilon_{min}=1, ϵmax=10\epsilon_{max}=10 and the increasing rate as 0.90.9 in all datasets following the defaulted settings in [17]. (3) ADAP [42] dynamically adjusts the clip threshold during training and gradually increases the privacy budget used by clients based on the trend of the test loss in model performance. All configurations for this baseline follow the defaults specified in [42]. (4) DPSGD [31], designed a more advanced Gaussian mechanism (GM). It uses the Moments Accountant (essentially RDP) to accumulate and allocate privacy loss for the GM. We utilize the open source code with the same settings from [43] for our experiments.

Refer to caption
Refer to caption
Refer to caption
Figure 5: The sensitivity experiments of BGTplanner with TT, which represents the maximum number of training rounds, are conducted across different privacy budget settings and DP noise mechanisms using the Filmtrust, ML-100K, and ML-1M datasets, respectively. Lower RMSE values indicate better performance, while higher F1-scores reflect better performance.

VII-B Recommendation Performance Results

Table II shows the recommendation performance between BGTplanner and other baselines under both the Gaussian mechanism (GM) and Laplace mechanism (LM) with ϵutotal=10\epsilon_{u}^{total}=10. In Fig. 3, we further compare final recommendation performance together with standard deviation constrained by various total privacy budgets. From Table II and Fig. 3 , we have the following observations:

  1. 1.

    BGTplanner consistently achieves the recommendation performance closest to FedSGD (which does not incorporate DP noises for gradient protection) across all experimental cases. In Table II, BGTplanner improves RMSE by an average of 3.40% with the GM and 10.12% with the LM. Additionally, in terms of the F1-score, BGTplanner outperforms the second-best baseline by an average of 0.71% across all datasets and noise mechanisms.

  2. 2.

    As shown in Figs. 3, the performance gain diminishes as the total privacy budget increases, due to the noise variance approaching the increased privacy budget, thereby shrinking the improvement space achievable by tuning privacy budget consumption. Meanwhile, the results demonstrate better training performance of BGTplanner in strict budget constraints, indicating that BGTplanner can allocate the privacy budget strategically with the CMAB method and the long-term budget Constrainter.

  3. 3.

    DPSGD achieves the second-best performance among DP-based algorithms in most cases when using the GM. This is because DPSGD, using RDP, can achieve a tighter bound to track privacy loss during training. However, DPSGD’s uniform noise variance allocation across iterations fails to minimize noise impact in the training process, allowing the superior performance of BGTplanner.

  4. 4.

    ADPML and ADAP rely on heuristics or prior knowledge to set privacy budgets and adjust hyper-parameters. These constraints are especially problematic when dealing with different privacy requirements and make it difficult for these methods to adapt to the dynamic nature of changing settings, which may lead to worse results.

VII-C Sensitive Experiments of Hyperparameters

Additionally, we conducted a series of experiments to analyze the sensitivity of the BGTplanner algorithm to key hyperparameters, specifically TminT_{min} and TT, which control the range of action space and the maximum number of training rounds, respectively. The results of these experiments are presented in Figs. 4-5, where all other parameters are held constant while varying ϵutotal\epsilon_{u}^{total} and TminT_{min} (or TT), respectively.

VII-C1 Overall Observations

In general, the experiments consistently show that as the total privacy budget increases, the model’s training performance improves. Moreover, the robustness of BGTplanner is evident from the relatively small variations in performance across different settings of TminT_{min} and TT, compared to baseline methods. This stability underscores the algorithm’s superiority in various training scenarios. Additionally, as illustrated in Figs. 4-5, BGTplanner enhances overall training efficiency as the total privacy cost increases for both Gaussian and Laplace mechanisms. However, this efficiency comes at the expense of greater privacy disclosure, highlighting the need to balance privacy and performance.

VII-C2 Sensitivity to TminT_{min}

Figs. 4 explores the impact of varying TminT_{min}, which represents the minimum number of training rounds. A smaller TminT_{min} implies a broader range of possible actions for the CMAB model, allowing it to allocate a larger portion of the privacy budget to each training round. Generally, we observe that reducing TminT_{min} tends to improve the performance of privacy budget allocation, as it provides more flexibility for distributing the budget effectively across rounds. However, in scenarios where the total privacy budget is limited, an excessively small TminT_{min} may lead to rapid consumption of the privacy budget, causing premature termination of training and resulting in suboptimal performance.

VII-C3 Sensitivity to TT

Figs. 5 investigates the influence of the maximum number of training rounds TT on the algorithm’s performance. The results reveal distinct trends between the Gaussian and Laplace mechanisms. Under the Gaussian mechanism, BGTplanner achieves better training efficiency with a larger TT ia large proportion of situations, owing to the advanced RDP method, which allows tighter tracking of privacy loss as the number of rounds increases. Conversely, under the Laplace mechanism, increasing TT tends to reduce training efficiency. This discrepancy likely arises because the basic composition theorem, used in the Laplace mechanism, is more sensitive to changes in TT, necessitating more careful budget allocation.

Other than the experiments presented above, we also examine the sensitivity of some crucial parameters in Appendix F, which illustrates how the hyperparameters affect the overall allocation performance. Overall, the results underscore the effectiveness of BGTplanner in achieving high model accuracy under conditions, demonstrating its superiority over other baseline methods.

VIII Conclusion

In this study, we devise a novel online privacy budget allocation solution, i.e., BGTplanner, for differentially private federated recommenders (DPFRs) under dynamic contexts, solving the challenging problem of privacy budget allocation in federated recommender systems. We have pioneered the representation of privacy budget allocation in federated recommender (FR) as a budget-constrained online problem, making a novel contribution in this field. Furthermore, Gaussian Process Regression (GPR) is employed to predict model learning progress. A Contextual Multi-Armed Bandit (CMAB) algorithm is devised to make final budget allocation decisions on the fly with regret guarantees. The empirical evidence from the extensive experiments demonstrates the superiority of our method over existing privacy budget allocation algorithms. To solidify and extend our findings, more comprehensive and detailed experimental analysis, and the customization of privacy budgets for individual clients deserve exploration in future research.

References

  • [1] C. Li, A. He, Y. Wen, G. Liu, and A. T. Chronopoulos, “Optimal Trading Mechanism Based on Differential Privacy Protection and Stackelberg Game in Big Data Market,” IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3550–3563, 2023.
  • [2] M. Hu, D. Wu, R. Wu, Z. Shi, M. Chen, and Y. Zhou, “RAP: A Light-Weight Privacy-Preserving Framework for Recommender Systems,” IEEE Transactions on Services Computing, vol. 15, no. 5, pp. 2969–2981, 2022.
  • [3] C. Wu, F. Wu, L. Lyu, T. Qi, Y. Huang, and X. Xie, “A federated graph neural network framework for privacy-preserving personalization,” Nature Communications, vol. 13, no. 1, p. 3091, 2022.
  • [4] Y. Guo, H. Xie, Y. Miao, C. Wang, and X. Jia, “FedCrowd: A Federated and Privacy-Preserving Crowdsourcing Platform on Blockchain,” IEEE Transactions on Services Computing, vol. 15, no. 4, pp. 2060–2073, 2022.
  • [5] J. Wu, Y. Yang, M. Hu, Y. Zhou, and D. Wu, “FCER: A Federated Cloud-Edge Recommendation Framework with Cluster-based Edge Selection,” IEEE Transactions on Mobile Computing, vol. Early Access, pp. 1–13, 2024.
  • [6] N. Agrawal, A. K. Sirohi, S. Kumar, and J. , “No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 10, 2024, pp. 10 775–10 783.
  • [7] J. Bian, J. Huang, S. Ji, Y. Liao, X. Li, Q. Wang, J. Zhou, D. Dou, Y. Wang, and H. Xiong, “Feynman: Federated Learning-Based Advertising for Ecosystems-Oriented Mobile Apps Recommendation,” IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3361–3372, 2023.
  • [8] L. Melis, C. Song, E. D. Cristofaro, and V. Shmatikov, “Exploiting Unintended Feature Leakage in Collaborative Learning,” in IEEE Symposium on Security and Privacy (S&P).   IEEE, 2019, pp. 691–706.
  • [9] R. Liu, Y. Cao, Y. Wang, L. Lyu, Y. Chen, and H. Chen, “PrivateRec: Differentially Private Model Training and Online Serving for Federated News Recommendation,” in Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD).   ACM, 2023, pp. 4539–4548.
  • [10] C. Arroyo Arevalo, S. L. Noorbakhsh, Y. Dong, Y. Hong, and B. Wang, “Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning against Attribute Inference Attacks,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 10, 2024, pp. 10 909–10 917.
  • [11] G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang, “Privacy at scale: Local differential privacy in practice,” in Proceedings of the International Conference on Management of Data (SIGMOD).   ACM, 2018, pp. 1655–1658.
  • [12] K. Ren, G. Liao, Q. Ma, and X. Chen, “Differentially Private Auction Design for Federated Learning With non-IID Data,” IEEE Transactions on Services Computing, vol. 17, no. 5, pp. 2236–2247, 2024.
  • [13] J. He, N. Wang, T. Xiang, Y. Wei, Z. Zhang, M. Li, and L. Zhu, “ABDP: Accurate Billing on Differentially Private Data Reporting for Smart Grids,” IEEE Transactions on Services Computing, vol. 17, no. 5, pp. 1938–1954, 2024.
  • [14] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. S. Quek, and H. Vincent Poor, “Federated Learning with Differential Privacy: Algorithms and Performance Analysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3454–3469, 2020.
  • [15] A. M. Girgis, D. Data, S. Diggavi, A. T. Suresh, and P. Kairouz, “On the Rényi Differential Privacy of the Shuffle Model,” in Proceedings of ACM SIGSAC Conference on Computer and Communications Security (CCS).   ACM, 2021, p. 2321–2341.
  • [16] X. Zhang, Y. Chen, C. Ma, Y. Fang, and I. King, “Influential Exemplar Replay for Incremental Learning in Recommender Systems,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 8, 2024, pp. 9368–9376.
  • [17] K. Pan and K. Feng, “Differential Privacy-Enabled Multi-Party Learning with Dynamic Privacy Budget Allocating Strategy,” Electronics, vol. 12, no. 3, p. 658, 2023.
  • [18] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the International Conference on Artificial intelligence and statistics (AISTATS), vol. 54.   PMLR, 2017, pp. 1273–1282.
  • [19] C. Chen, H. Wu, J. Su, L. Lyu, X. Zheng, and L. Wang, “Differential Private Knowledge Transfer for Privacy-Preserving Cross-Domain Recommendation,” in Proceedings of the ACM Web Conference (WWW).   ACM, 2022, pp. 1455–1465.
  • [20] J. Hong, Z. Wang, and J. Zhou, “Dynamic Privacy Budget Allocation Improves Data Efficiency of Differentially Private Gradient Descent,” in ACM Conference on Fairness, Accountability, and Transparency (FAccT).   ACM, 2022, pp. 11–35.
  • [21] D. B. Ami, K. Cohen, and Q. Zhao, “Client Selection for Generalization in Accelerated Federated Learning: A Multi-Armed Bandit Approach,” 2023. [Online]. Available: https://arxiv.org/abs/2303.10373
  • [22] C. Shi, C. Shen, and J. Yang, “Federated Multi-armed Bandits with Personalization,” in Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS).   PMLR, 2021, pp. 2917–2925.
  • [23] C. Wu, Y. Zhu, R. Zhang, Y. Chen, F. Wang, and S. Cui, “FedAB: Truthful Federated Learning With Auction-Based Combinatorial Multi-Armed Bandit,” IEEE Internet of Things Journal, vol. 10, no. 17, pp. 15 159–15 170, 2023.
  • [24] X. Feng, W. Cheng, C. Cao, L. Wang, and V. S. Sheng, “DPFLA: Defending Private Federated Learning Against Poisoning Attacks,” IEEE Transactions on Services Computing, vol. 17, no. 4, pp. 1480–1491, 2024.
  • [25] L. Xiao, X. Zhang, D. Wu, M. Hu, Y. Zhou, and S. Yu, “History-Aware Privacy Budget Allocation for Model Training on Evolving Data-Sharing Platforms,” IEEE Transactions on Services Computing, vol. Early Access, pp. 1–15, 2024.
  • [26] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [27] H. Zhou, H. Wang, Z. Yu, G. Bin, M. Xiao, and J. Wu, “Federated Distributed Deep Reinforcement Learning for Recommendation-enabled Edge Caching,” IEEE Transactions on Services Computing, vol. Early Access, pp. 1–17, 2024.
  • [28] T. Qi, F. Wu, C. Wu, Y. Huang, and X. Xie, “Privacy-Preserving News Recommendation Model Learning,” in Findings of the Association for Computational Linguistics: EMNLP 2020.   ACL, 2020, pp. 1423–1432.
  • [29] M. Nasr, R. Shokri, and A. Houmansadr, “Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning,” in IEEE Symposium on Security and Privacy (S&P).   IEEE, 2019, pp. 739–753.
  • [30] V. Shejwalkar, A. Houmansadr, P. Kairouz, and D. Ramage, “Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning,” in IEEE Symposium on Security and Privacy (S&P).   IEEE, 2022, pp. 1354–1371.
  • [31] M. Abadi, A. Chu, I. J. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep Learning with Differential Privacy,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS).   ACM, 2016, pp. 308–318.
  • [32] Y. Yang, Y. Zhou, M. Hu, D. Wu, and Q. Z. Sheng, “BARA: Efficient Incentive Mechanism with Online Reward Budget Allocation in Cross-Silo Federated Learning,” in Proceedings of International Joint Conference on Artificial Intelligence (IJCAI).   Morgan Kaufmann, 2023, pp. 4478–4485.
  • [33] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design,” in Proceedings of International Conference on Machine Learning (ICML).   Omnipress, 2010, pp. 1015–1022.
  • [34] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning, ser. Adaptive computation and machine learning.   MIT Press, 2006. [Online]. Available: https://www.worldcat.org/oclc/61285753
  • [35] T. Ouyang, K. Zhao, X. Zhang, Z. Zhou, and X. Chen, “Dynamic Edge-centric Resource Provisioning for Online and Offline Services Co-location,” in IEEE Conference on Computer Communications (INFOCOM).   IEEE, 2023, pp. 1–10.
  • [36] Y. Han, J. Zeng, Y. Wang, Y. Xiang, and J. Zhang, “Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles,” in International Conference on Artificial Intelligence and Statistics (AISTATS).   PMLR, 2023, pp. 5011–5035.
  • [37] E. Hazan et al., “Introduction to online convex optimization,” Foundations and Trends® in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016.
  • [38] S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and Trends in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.
  • [39] S. Agrawal and N. R. Devanur, “Linear contextual bandits with knapsacks,” in Annual Conference on Neural Information Processing Systems (NeurIPS).   Curran Associates Inc., 2016, p. 3458–3467.
  • [40] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multi-graph neural networks,” in Proceedings of International Conference on Neural Information Processing Systems (NIPS).   Curran Associates Inc., 2017, p. 3700–3710.
  • [41] F. M. Harper and J. A. Konstan, “The MovieLens Datasets: History and Context,” ACM Transactions on Interactive Intelligent Systems, vol. 5, no. 4, 2015.
  • [42] J. Fu, Z. Chen, and X. Han, “Adap DP-FL: Differentially Private Federated Learning with Adaptive Noise,” in IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom).   IEEE, 2022, pp. 656–663.
  • [43] W. Yang, Y. Zhou, M. Hu, D. Wu, X. Zheng, J. H. Wang, S. Guo, and C. Li, “Gain without pain: Offsetting dp-injected noises stealthily in cross-device federated learning,” IEEE Internet of Things Journal, vol. 9, no. 22, pp. 22 147–22 157, 2022.
[Uncaptioned image] Xianzhi Zhang received his B.S. degree from Nanchang University (NCU), Nanchang, China, in 2019 and an M.S. degree from the School of Computer Science and Engineering, Sun Yat-sen University (SYSU), Guangzhou, China, in 2022. He is currently pursuing a Ph.D. degree in the School of Computer Science and Engineering at Sun Yat-sen University, Guangzhou, China. He is also working as a visiting PhD student at the School of Computing, Macquarie University, Sydney, Australia. Xianzhi’s current research interests include video caching, privacy protection, federated learning, and edge computing. His researches have been published at IEEE TPDS and TSC, and won the Best Paper Award at PDCAT 2021.
[Uncaptioned image] Yipeng Zhou is a senior lecturer in computer science with School of Computing at Macquarie University, and the recipient of ARC DECRA in 2018. From Aug. 2016 to Feb. 2018, he was a research fellow with Institute for Telecommunications Research (ITR) of University of South Australia. From 2013.9-2016.9, He was a lecturer with College of Computer Science and Software Engineering, Shenzhen University. He was a Postdoctoral Fellow with Institute of Network Coding (INC) of The Chinese University of Hong Kong (CUHK) from Aug. 2012 to Aug. 2013. He won his PhD degree and Mphil degree from Informatio Engineering (IE) Department of CUHK respectively. He got Bachelor degree in Computer Science from University of Science and Technology of China (USTC). His research interests lie in federated learning, privacy protection and caching algorithm design in networks. He has published more than 80 papers including IEEE INFOCOM, ICNP, IWQoS, IEEE ToN, JSAC, TPDS, TMC, TMM, etc.
[Uncaptioned image] Miao Hu is currently an Associate Professor with the School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China. He received the B.S. degree and the Ph.D. degree in communication engineering from Beijing Jiaotong University, Beijing, China, in 2011 and 2017, respectively. From Sept. 2014 to Sept. 2015, he was a Visiting Scholar with the Pennsylvania State University, PA, USA. His research interests include edge/cloud computing, multimedia communication and software defined networks.
[Uncaptioned image] Di Wu received the B.S. degree from the University of Science and Technology of China, Hefei, China, in 2000, the M.S. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2003, and the Ph.D. degree in computer science and engineering from the Chinese University of Hong Kong, Hong Kong, in 2007. He was a Post-Doctoral Researcher with the Department of Computer Science and Engineering, Polytechnic Institute of New York University, Brooklyn, NY, USA, from 2007 to 2009, advised by Prof. K. W. Ross. Dr. Wu is currently a Professor and the Associate Dean of the School of Computer Science and Engineering with Sun Yat-sen University, Guangzhou, China. He was the recipient of the IEEE INFOCOM 2009 Best Paper Award, IEEE Jack Neubauer Memorial Award, and etc. His research interests include edge/cloud computing, multimedia communication, Internet measurement, and network security.
[Uncaptioned image] Pengshan Liao received the BE degree and the MS degree from Sun Yat-sen University (SYSU), Guangzhou, China. His research interests include federated learning and recommendation system.
[Uncaptioned image] Mohsen Guizani received the B.S. (Hons.) and first M.S. degrees in electrical engineering, and the M.S. and Ph.D. degrees in computer engineering from Syracuse University, Syracuse, NY, USA, in 1984, 1986, 1987, and 1990, respectively. He is currently a Professor with the the Machine Learning Department, Mohamed Bin Zayed University of Artificial Intelligence and the Computer Science and Engineering Department, Qatar University, Qatar. Previously, he has served in different academic and administrative positions with the University of Idaho, Western Michigan University, the University of West Florida, the University of MissouriKansas City, the University of Colorado Boulder, and Syracuse University. He is the author of nine books and more than 600 publications in refereed journals and conferences. His research interests include wireless communications and mobile computing, computer networks, mobile cloud computing, security, and smart grid. Throughout his career, he received three teaching awards and four research awards. He was a recipient of the 2017 IEEE Communications Society Wireless Technical Committee Recognition Award, the 2018 Ad Hoc Technical Committee Recognition Award for his contribution to outstanding research in wireless communications and Ad-Hoc Sensor networks, and the 2019 IEEE Communications and Information Security Technical Recognition Award for outstanding contributions to the technological advancement of security. He guests edited a number of special issues in IEEE journals and magazines. He is also the Editor-in-Chief of IEEE Network Magazine. He serves on the editorial boards for several international technical journals, and the Founder and the Editor-in-Chief for Wireless Communications and Mobile Computing (Wiley). He also served as a member, the chair, and the general chair of a number of international conferences. He was the Chair of IEEE Communications Society Wireless Technical Committee and the Chair of the TAOS Technical Committee. He has served as the IEEE Computer Society Distinguished Speaker. He is currently an IEEE ComSoc Distinguished Lecturer. He is a Senior Member of ACM.
[Uncaptioned image] Michael Sheng is a Distinguished Professor and Head of School of Computing at Macquarie University. Before moving to Macquarie, Michael spent 10 years at School of Computer Science, the University of Adelaide, serving in senior leadership roles such as Interim Head of School and Deputy Head of School. Michael holds a PhD degree in computer science from the University of New South Wales (UNSW) and did his post-doc as a research scientist at CSIRO ICT Centre. From 1999 to 2001, Sheng also worked at UNSW as a visiting research fellow. Prior to that, he spent 6 years as a senior software engineer in industries. Prof. Michael Sheng’s research interests include Web of Things, Internet of Things, Big Data Analytics, Web Science, Service-oriented Computing, Pervasive Computing, and Sensor Networks. He is ranked by Microsoft Academic as one of the Most Impactful Authors in Services Computing (ranked Top 5 all time worldwide) and in Web of Things (ranked Top 20 all time). He is the recipient of the AMiner Most Influential Scholar Award on IoT (2019), ARC Future Fellowship (2014), Chris Wallace Award for Outstanding Research Contribution (2012), and Microsoft Research Fellowship (2003). Prof Michael Sheng is Vice Chair of the Executive Committee of the IEEE Technical Community on Services Computing (IEEE TCSVC), the Associate Director (Smart Technologies) of Macquarie’s Smart Green Cities Research Centre, and a member of the ACS Technical Advisory Board on IoT.