FeDXL: Provable Federated Learning for Deep X-Risk Optimization
Abstract
In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of , where two sets of data are distributed over multiple machines, is a pairwise loss that only depends on the prediction outputs of the input data pairs . This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm for X-risks lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. To this end, we propose an active-passive decomposition framework that decouples the gradient’s components with two types, namely active parts and passive parts, where the active parts depend on local data that are computed with the local model and the passive parts depend on other machines that are communicated/computed based on historical models and samples. Under this framework, we design two FL algorithms (FeDXL) for handling linear and nonlinear , respectively, based on federated averaging and merging and develop a novel theoretical analysis to combat the latency of the passive parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the passive parts do not degrade the complexities. We conduct empirical studies of FeDXL for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.
1 Introduction
This work is motivated by solving the following optimization problem arising in many ML applications in a federated learning (FL) setting:
(1) |
where and denote two sets of data points that are distributed over many machines, denotes the model of a prediction function , is a deterministic function that could be linear or non-linear (possibly non-convex), and denotes a pairwise loss that only depends on the prediction outputs of the input data . The above problem belongs to a broader family of machine learning problems called deep X-risk optimization (DXO) (Yang, 2022). We provide details of some X-risk minimization applications in Appendix B.
When is a linear function, the above problem is the classic pairwise loss minimization problem, which has applications in AUROC (AUC) maximization (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b; Yang & Ying, 2022), bipartite ranking (Cohen et al., 1997; Clémençon et al., 2008; Kotlowski et al., 2011; Dembczynski et al., 2012), and distance metric learning (Radenović et al., 2016; Wu et al., 2017). When is a non-linear function, the above problem is a special case of finite-sum coupled compositional optimization problem (Wang & Yang, 2022), which has found applications in various performance measure optimization such as partial AUC maximization (Zhu et al., 2022), average precision maximization (Qi et al., 2021; Wang et al., 2022), NDCG maximization (Qiu et al., 2022), p-norm push optimization (Rudin, 2009; Wang & Yang, 2022) and contrastive loss optimization (Goldberger et al., 2004; Yuan et al., 2022).
This is in sharp contrast with most existing studies on FL algorithms (Yang, 2013; Konečnỳ et al., 2016; McMahan et al., 2017; Kairouz et al., 2021; Smith et al., 2018; Stich, 2018; Yu et al., 2019a, b; Khaled et al., 2020; Woodworth et al., 2020b, a; Karimireddy et al., 2020b; Haddadpour et al., 2019), which focus on the following empirical risk minimization (ERM) problem with the data set distributed over different machines:
(2) |
The major differences between DXO and ERM are (i) the ERM’s objective is decomposable over training data, while the DXO is not; and (ii) the data-dependent losses in ERM are decoupled between different data points; in contrast the data-dependent loss in DXO couples different training data points. These differences pose a big challenge for DXO in the FL setting where the training data are distributed on different machines and are prohibited to be moved to a central server. In particular, the gradient of X-risk cannot be written as the sum of local gradients at individual machines that only depend on the local data in those machines. Instead, the gradient of DXO at each machine not only depends on local data but also on data in other machines. As a result, the design of communication-efficient FL algorithms for DXO is much more complicated than that for ERM. In addition, the presence of non-linear function makes the algorithm design and analysis even more challenging than that with linear . There are two levels of coupling in DXO with nonlinear with one level at the pairwise loss and another level at the non-linear risk of , which makes estimation of stochastic gradient more tricky.
Although DXO can be solved by existing algorithms in a centralized learning setting (Hu et al., 2020; Wang & Yang, 2022), extension of the existing algorithms to the FL setting is non-trivial. This is different from the extension of centralized algorithms for ERM problems to the FL setting. In the design and analysis of FL algorithms for ERM, the individual machines compute local gradients and update local models and communicate periodically to average models. The rationale of local FL algorithms for ERM is that as long as the gap error between local models and the averaged model is on par with the noise in the stochastic gradients by controlling the communication frequency, the convergence of local FL algorithms will not be sacrificed and is able to enjoy the parallel speed-up of using multiple machines. However, this rationale is not sufficient for developing FL algorithms for DXO optimization due to the challenges mentioned above.
To address these challenges, we propose two novel FL algorithms named FeDXL1 and FeDXL2 for DXO with linear and non-linear , respectively. The main innovation in the algorithm design lies at an active-passive decomposition framework that decouples the gradient of the objective into two types, active parts and passive parts. The active parts depend on data in local machines and the passive parts depend on data in other machines. We estimate the active parts using the local data and the local model and estimate the passive parts using the information with delayed communications from other machines that are computed at historical models in the previous round. In terms of analysis, the challenge is that the model used in the computation of stochastic gradient estimator depends on the (historical) samples for computing the passive parts at the current iteration, which is only exacerbated in the presence of non-linear function . We develop a novel analysis that allows us to transfer the error of the gradient estimator into the latency error of the passive parts and the gap error between local models and the global model. Hence, the rationale is that as long as the latency error of the passive parts and the gap error between local models and the global model is on par with the noise in the stochastic gradient estimator we are able to achieve convergence and linear speed-up.

The main contributions of this work are as follows:
-
•
We propose two novel communication-efficient algorithms, FeDXL1 and FeDXL2, for DXO with linear and nonlinear , respectively, based on federated averaging and merging. Besides communicating local models for federated averaging, the proposed algorithms need to communicate local prediction outputs only periodically for federated merging to enable the computing of passive parts. The diagram of the proposed FeDXL algorithms is shown in Figure 1.
-
•
We perform novel technical analysis to prove the convergence of both algorithms. We show that both algorithms enjoy parallel speed-up in terms of the iteration complexity, and a lower-order communication complexity.
-
•
We conduct empirical studies on two tasks for federated deep partial AUC optimization with a compositional loss and federated deep AUC optimization with a pairwise loss, and demonstrate the advantages of the proposed algorithms over several baselines.
2 Related Work
FL for ERM. The challenge of FL is how to utilize the distributed data to learn a ML model with light communication cost without harming the data privacy (Konečnỳ et al., 2016; McMahan et al., 2017). To reduce the communication cost, many algorithms have been proposed to skip communications (Stich, 2018; Yu et al., 2019a, b; Yang, 2013; Karimireddy et al., 2020b) or compress the communicated statistics (Stich et al., 2018; Basu et al., 2019; Jiang & Agrawal, 2018; Wangni et al., 2018; Bernstein et al., 2018). Tight analysis has been performed in various studies (Kairouz et al., 2021; Yu et al., 2019a, b; Khaled et al., 2020; Woodworth et al., 2020b, a; Karimireddy et al., 2020b; Haddadpour et al., 2019). However, most of these works target at ERM.
FL for Non-ERM Problems. In (Guo et al., 2020; Yuan et al., 2021a; Deng & Mahdavi, 2021; Deng et al., 2020; Liu et al., 2020; Sharma et al., 2022), federated minimax optimization algorithms have been studied, which are not applicable to our problem when is non-convex. Gao et al. (2022) considered a much simpler federated compositional optimization in the form of , where denotes the machine index. Compared with the X-risk, their objective does not involve interdependence between different machines. Li et al. (2022); Huang et al. (2022) analyzed FL algorithms for bi-level problems where only the low-level objective involves distribution over many machines. Tarzanagh et al. (2022) considered another federated bilevel problem, where both upper and lower level objective are distributed over many machines, but the lower level objective is not coupled with the data in the upper objective. Xing et al. (2022) studied a federated bilevel optimization in a server-clients setting, where the central server solves an objective that depends on optimal solutions of local clients. Our problem cannot be mapped into these federated bilevel optimization problems. There are works that optimize non-ERM problems using local data or data from other machines, which are mostly adhoc and lack of theoretical guarantees (Han et al., 2022; Zhang et al., 2020; Wu et al., 2022; Li & Huang, 2022).
Method | Sample Complexity | Setting | |
BSGD (Hu et al., 2020) | Inner Expectation + Outer Expectation | ||
BSpiderBoost (Hu et al., 2020) | Inner Expectation + Outer Expectation | ||
Centralized | SOX (Wang & Yang, 2022) | Inner Expectation + Outer Finite-sum | |
MSVR (Jiang et al., 2022) | Inner Expectation + Outer Finite-sum | ||
MSVR (Jiang et al., 2022) | Inner Finite-sum + Outer Finite-sum | ||
Federated | This Work | Inner Expectation + Outer Finite-sum |
Centralized Algorithms for DXO. In the centralized setting DXO has been considered in recent works (Qi et al., 2021; Wang et al., 2022; Wang & Yang, 2022; Qiu et al., 2022). In particular, Wang & Yang (2022) have proposed a stochastic algorithm named SOX for solving (1) and achieved state-of-the-art sample complexity of to ensure the expected convergence to an -stationary point. Nevertheless, it is non-trivial to extend the centralized algorithms to the FL setting due to the challenges mentioned earlier. Recently, (Jiang et al., 2022) further proposed an advanced variance-reduction technique named MSVR to improve the sample complexity of solving finite-sum coupled compositional optimization problems. We provide a summary of state-of-the-art sample complexities for solving DXO in both centralized and FL setting in Table 1.
3 FeDXL for DXO
We assume are split into non-overlapping subsets that are distributed over clients 111We use clients and machines interchangeably., i.e., and . We denote by . Denote by and the partial gradients in terms of the first argument and the second argument, respectively. Without loss of generality, we assume the dimensionality of is 1 (i.e., ) in the following presentation. Notations used in the algorithms are summarized in Appendix A.
3.1 FeDXL1 for DXO with linear
We consider the following FL objective for DXO:
(3) |
To highlight the challenge and motivate FeDXL, we decompose the gradient of the objective function into:
Let . Then .
With the above decomposition, we can see that the main task at the local client is to estimate the gradient terms and . Due to the symmetry between and , below, we only use as an illustration for explaining the proposed algorithm. The difficulty in computing lies at it relies on data in other machines due to the presence of for all . To overcome this difficulty, we decouple the data-dependent factors in into two types marked by green and blue shown below:
(4) |
It is notable that the three green terms can be estimated or computed based the local data. In particular, local1 can be estimated by sampling data from and local2 and local3 can be computed based on the sampled data and the local model parameter. The difficulty springs from estimating and computing the two blue terms that depend on data on all machines. We would like to avoid communicating at every iteration for estimating the blue terms as each communication would incur additional communication overhead. To tackle this, we propose to leverage the historical information computed in the previous round 222A round is defined as a sequence of local updates between two consecutive communications.. To put this into context of optimization, we consider the update at the -th iteration during the -th round, where . Let denote the local model in -th client at the -th iteration within -th round. Let denote the data sampled at the -th iteration from and , respectively. Each local machine will compute and , which will be used for computing the active parts. Across all iterations , we will accumulate the computed prediction outputs over sampled data and stored in two sets and . At the end of round , we will communicate and and to the central server, which will average the local models to get a global model and also merge and . These merged information will be broadcast to each individual client. Then, at the -th iteration in the -th round, we estimate the blue term by sampling without replacement and compute an estimator of by
(5) |
where represents a random variable that captures the randomness in the sampled client , iteration index and data sample , which is used for estimating the global1 in (4). We refer to the green factors in as the active parts and the blue factor in as the passive part. Similarly, we can estimate by
(6) |
where is a randomly sampled prediction output in the previous round with representing a random variable including a client sample and iteration sample and the data sample . Then we will update the local model parameter by using a gradient estimator .
We present the detailed steps of the proposed algorithm FeDXL1 in Algorithm 1. Several remarks are following: (i) at every round, the algorithm needs to communicate both the model parameters and the historical prediction outputs and , where is constructed by collecting all or sub-sampled computed predictions in the -th round. The bottom line for constructing is to ensure that contains at least independently sampled predictions that are from the previous round on all machines such that the corresponding data samples involved in can be used to approximate times. Hence, to keep the communication costs minimal, each client at least needs to sample sampled predictions from all iterations and send them to the server for constructing , which is then broadcast to all clients for computing the passive parts in the round . As a result, the minimal communication costs per-round per-client is . Nevertheless, for simplicity in Algorithm 1 we simply put all historical predictions into .
Similar to all other FL algorithms, FeDXL1 does not require communicating the raw input data, hence protects the privacy of the data. However, compared with most FL algorithms for ERM, FeDXL1 for DXO has an additional communication overhead at least which depends on the dimensionality of prediction output . For learning a high-dimensional model (e.g. deep neural network with ) with score-based pairwise losses (), the additional communication cost could be marginal. For updating the buffer and , we can simply flush the history and add the newly received and with random shuffling to and , respectively.
For analysis, we make the following assumptions regarding the DXO with linear problem, i.e., problem (3).
Assumption 3.1.
-
•
is differentiable, -smooth and -Lipschitz.
-
•
is differentiable, -smooth and -Lipschitz on for any .
-
•
.
-
•
such that .
The first three assumptions are standard in the optimization of DXO problems (Wang & Yang, 2022). The last assumption embodies the data heterogeneity that is also common in federated learning (Yu et al., 2019a; Karimireddy et al., 2020b). Next, we present the theoretical results on the convergence of FeDXL1.
Remark. To get , we just need to set , and . The number of communications is much less than the total number of iterations i.e., as long as . And the sample complexity on each machine is , which is linearly reduced by the number of machines .
Novelty of Analysis. As the passive parts are computed in different machines in a previous round, the gradient estimators and will involve the dependency between the local model parameter and the historical data contained in used for computing and , which makes the analysis more involved. We need to make sure that using the gradient estimator based on them can still result in “good” results. To this end, we borrow an analysis technique in (Yang et al., 2021b) to decouple the dependence between the current model parameter and the data used for computing the current gradient estimator, in which they used data in previous iteration to couple the data in the current iteration in order to compute a gradient of the pairwise loss . Nevertheless, in federated DXO controlling the error brought by the passive parts is more challenging since the delay is much longer and they were computed on different machines. In our analysis, we replace with to decouple the dependence between the model parameter and the historical data , then we need to control the latency error and the gap error between different machines such that the complexities are not compromised.
3.2 FeDXL2 for optimizing DXO with nonlinear
With nonlinear , we consider the following FL problem of DXO minimization,
(8) |
We compute the gradient and decompose it into:
(9) |
where
(10) |
Let . Then we have .
Compared to that in (4) for DXO with linear , the term above involves another factor , which cannot be computed locally as it depends on distributed over all machines. Similarly, the term above involves another non-locally computable factor . To address the challenge of estimating , we leverage the similar technique in the centralized setting (Wang & Yang, 2022) by tracking it using a moving average estimator based on random samples. In a centralized setting, one can maintain and update for estimating by
where is a random sample from . However, this is not possible in an FL setting as is distributed over many machines. To tackle this, we leverage the delay communication technique used in the last subsection. At the -th iteration in the -th round, we update for a sampled by
(11) |
where is a random sample from where captures the randomness in client, iteration index and data sample in the last round. Then, we can use in place of for estimating . However, it is more nuanced for estimating in since is not local random data. To address this, we propose to communicate . Then at the -iteration in the -th round of the -th client, we can estimate with a random sample from denoted by , where , i.e., by using . Then we estimate and by
(12) | ||||
(13) | ||||
where are random variables. Another difference from DXO with linear is that even in the centralized setting directly using will lead to a worse complexity due to that non-linear make the stochastic gradient estimator biased (Wang et al., 2017). Hence, in order to improve the convergence, we follow existing state-of-the-art algorithms for stochastic compositional optimization (Ghadimi et al., 2020; Wang & Yang, 2022) to compute a moving average estimator for the gradient at local machines, i.e., Step 17 in Algorithm 2. With these changes, we present the detailed steps of FeDXL2 for solving DXO with non-linear in Algorithm 2. The buffers and are updated similar to that for FeDXL1. Different from FeDXL1, there is an additional communication cost for communicating and an additional buffer at each local machine to store the received from aggregated . Nevertheless, these additional costs are marginal compared with communicating and maintaining the buffer .
We make the following assumptions regarding problem (8).
Assumption 3.3.
-
•
is differentiable, -smooth and -Lipschitz. .
-
•
is differentiable, -smooth and -Lipschitz.
-
•
is differentiable, -smooth and -Lipschitz on for any .
-
•
.
-
•
such that .
We present the convergence result of FeDXL2 below.
Theorem 3.4.
Remark. To get , we just set , , , and . The number of communications is less than the total number of iterations i.e., by a factor of . And the sample complexity on each machine is , which is less than that in (Wang & Yang, 2022) which has a sample complexity of . When the data are evenly distributed on different machines, we have achieved a linear speedup property. And in an extreme case where all data are on one machine, the sample complexity of FeDXL2 matches that in (Wang & Yang, 2022), which is expected. Compared with FeDXL1, the analysis of FeDXL2 has to deal with extra difficulties. First, with non-linear , the coupling between the inner function and outer function adds to the complexity of interdependence between different rounds and machines. Second, we have to deal with the error for the passive part related to .
Our analysis for FeDXL2 with moving average gradient estimator is different from previous studies for local momentum methods for ERM problems(Yu et al., 2019a; Karimireddy et al., 2020a), which used a fixed momentum parameter. In contrast, in FeDXL2 the momentum parameter is decreasing as increases, which is similar to centralized algorithms compositional problems (Ghadimi et al., 2020; Wang & Yang, 2022).
, | Centralized (OPAUC Loss) | Local SGD (CE Loss) | CODASCA (Min-Max AUC) | Local Pair (OPAUC Loss) | FeDXL2 (OPAUC Loss) | |
Cifar10 | FPR 0.3 | 0.76550.0039 | 0.68250.0047 | 0.72880.0035 | 0.74870.0059 | 0.75800.0034 |
FPR 0.5 | 0.80320.0039 | 0.72790.0050 | 0.77020.0029 | 0.78880.0052 | 0.79780.0026 | |
Cifar100 | FPR 0.3 | 0.62870.0037 | 0.58750.0016 | 0.61310.0054 | 0.62810.0032 | 0.63320.0024 |
FPR 0.5 | 0.64870.0026 | 0.61240.0021 | 0.64060.0041 | 0.65690.0017 | 0.66230.0022 | |
CheXpert | FPR 0.3 | 0.72200.0035 | 0.64950.0039 | 0.69030.0059 | 0.69020.0053 | 0.73440.0042 |
FPR 0.5 | 0.78610.0040 | 0.70170.0042 | 0.77700.0071 | 0.74830.0033 | 0.79180.0037 | |
ChestMNIST | FPR 0.3 | 0.63440.0053 | 0.59040.0012 | 0.60710.0040 | 0.58020.0039 | 0.62280.0048 |
FPR 0.5 | 0.66220.0029 | 0.60720.0034 | 0.62720.0038 | 0.60260.0025 | 0.64900.0039 |
, | Centralized (PSM Loss) | Local SGD (CE Loss) | CODASCA ( Min-Max AUC) | Local Pair (PSM Loss) | FeDXL1 (PSM Loss) |
Cifar10 | 0.73520.0043 | 0.65010.0024 | 0.64070.0044 | 0.72870.0027 | 0.73440.0038 |
Cifar100 | 0.61140.0038 | 0.57000.0031 | 0.59500.0039 | 0.61750.0045 | 0.62080.0041 |
CheXpert | 0.81490.0031 | 0.67820.0032 | 0.70620.0085 | 0.79240.0043 | 0.84310.0027 |
ChestMNIST | 0.72270.0026 | 0.56420.0041 | 0.65090.0033 | 0.67660.0019 | 0.69250.0030 |
4 Experiments
To verify our theories, we experiment on two tasks: federated deep partial AUC maximization and federated deep AUC maximization with a pairwise surrogate loss, which corresponds to (1) with non-linear and linear , respectively. Code is released at https://github.com/Optimization-AI/ICML2023_FeDXL.
Datasets and Neural Networks. We use four datasets: Cifar10, Cifar100 (Krizhevsky et al., 2009), CheXpert (Irvin et al., 2019), and ChestMNIST (Yang et al., 2021a), where the latter two datasets are large-scale medical image data. For Cifar10 and Cifar100, we sample 20% of the training data as validation set, and construct imbalanced binary versions with positive:negative = 1:5 in the training set similar to (Yuan et al., 2021b). For CheXpert, we consider the task of predicting Consolidation and use the last 1000 images in the training set as the validation set and use the original validation set as the testing set. For ChestMNIST, we consider the task of Mass prediction and use the provided train/valid/test split. We distribute training data to machines unless specified otherwise. To increase the heterogeneity of data on different machines, we add random Gaussian noise of to all training images, where that varies on different machines, i.e., for the -th machine out of the machines, its . We train ResNet18 from scratch for CIFAR-10 and CIFAR-100 data, and initialize DenseNet121 by an ImageNet pretrained model for CheXpert and ChestMNIST. We use the PyTorch framework (Paszke et al., 2019).
Baselines. We compare our algorithms with three local baselines: 1) Local SGD which optimizes a Cross-Entropy loss using classical local SGD algorithm; 2) CODASCA - a state-of-the-art FL algorithm for optimizing a min-max formulated AUC loss (Yuan et al., 2021a); and 3) Local Pair which optimizes the X-risk using only local pairs. As a reference, we also compare with the Centralized methods, i.e., mini-batch SGD for DXO with linear and SOX for DXO with non-linear . We tune the initial step size in using grid search and decay it by a factor of 0.1 every 5K iterations. All algorithms are run for 20k iterations. The mini-batch sizes (as in Step 11 of FeDXL1 and FeDXL2) are set to 32. The parameter of FeDXL2 (and corresponding Local Pair and Centralized method) is set to . In the Centralized method, we tune the batch size and from in an effort to benchmark the best performance.For CODASCA and Local SGD which are not using pairwise losses, we set the batch size to 64 for fair comparison with FeDXL. For all the non-centralized algorithms, we set the communication interval unless specified otherwise. In every run, we use the validation set to select the best performing model and finally use the selected model to evaluate on the testing set. For each algorithm, we repeat 3 times with different random seeds and report the averaged performance.
FeDXL2 for Federated Deep Partial AUC Maximization.
We consider the task of one way partial AUC maximization, which refers to the area under the ROC curve with false positive rate (FPR) restricted to be less than a threshold. We consider the KL-OPAUC loss function proposed in (Zhu et al., 2022),
(14) |
where denotes the set of positive data, denotes the set of negative data and where is a parameter tuned in . The experimental results are reported in Table 3. We can see: (i) FeDXL2 is better than all local methods (i.e., Local SGD, Local Pair and CODASCA), and achieves competitive performance as the Centralized method, which indicates the our algorithm can effectively utilize data on all machines. The better performance of FeDXL2 on CIFAR100 and CheXpert than the Centralized method is probably due to that the Centralized method may overfit the training data; (ii) FeDXL2 is better than the Local Pair method, which implies that using data pairs from all machines are helpful for improving the performance in terms of partial AUC maximization; and (iii) FeDXL2 is better than CODASCA, which is not surprising since CODASCA is designed to optimize AUC loss, while FeDXL2 is used to optimize partial AUC loss.
FeDXL1 for Federated Deep AUC maximization with Corrupted Labels. Second, we consider the task of federated deep AUC maximization. Since deep AUC maximization for solving a min-max loss (an equivalent form for the pairwise square loss) has been developed in previous works (Yuan et al., 2021a), we aim to justify the benefit of using the general pairwise loss formulation. According to (Charoenphakdee et al., 2019), a symmetric loss can be more robust to data with corrupted labels for AUC maximization, where a symmetric loss is one such that is a constant. Since the square loss is not symmetric, we conjecture that that min-max federated deep AUC maximization algorithm CODASCA is not robust to the noise in labels. In contrast, our algorithm FeDXL1 can optimize a symmetric pairwise loss; hence we expect FeDXL1 is better than CODASCA in the presence of corrupted labels. To verify this hypothesis, we generate corrupted data by flipping the labels of 20% of both the positive and negative training data. We use FeDXL1/Local Pair to optimize the symmetric pairwise sigmoid (PSM) loss (Calders & Jaroszewicz, 2007), which corresponds to (1) with linear and , where is a positive data score and is a negative data score. Specifically,
where denotes the set of positive data, denotes the set of negative data and . The results are reported in Table 3. We observe that FeDXL1 is more robust to label noises compared to other local methods, including Local SGD, Local Pair, and CODASCA that optimizes a min-max AUC loss. As before, FeDXL1 has competitive performance compared with the Centralized method.
The running time comparison, statistics of data, and ablation studies are in Appendix C.
5 Conclusion
We have considered federated learning (FL) for deep X-risk optimization. We have developed communication-efficient FL algorithms to alleviate the interdependence between different machines. Novel convergence analysis is performed to address the technical challenges and to improve both iteration and communication complexities of proposed algorithms. We have conducted empirical studies of the proposed FL algorithms for solving deep partial AUC maximization and deep AUC maximization and achieved promising results compared with several baselines.
6 Limitations and Potential Negative Societal Impacts
While the current communication complexity is , there may still be room for improvement to further reduce the communication cost because the state-of-the-art communication complexity for federated ERM problems is . Our experimental results indicate that FeDXL may offer better generalization performance than centralized algorithms. However, a more rigorous analysis is necessary to better understand this phenomenon and leverage it effectively. While this work has verified the performance of FeDXL on partial AUC maximization and AUC maximization problems, more studies are needed to test FeDXL on other federated DXO problems and beyond. We do not see any potential negative societal impact.
Acknowledgements
We appreciate the feedback provided by the anonymous reviewers. This work has been partially supported by NSF Career Award 2246753, NSF Grant 2246757 and NSF Grant 2246756.
References
- Basu et al. (2019) Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019.
- Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pp. 560–569. PMLR, 2018.
- Boyd et al. (2013) Boyd, K., Eng, K. H., and Page, C. D. Area under the precision-recall curve: point estimates and confidence intervals. In Joint European conference on machine learning and knowledge discovery in databases, pp. 451–466. Springer, 2013.
- Calders & Jaroszewicz (2007) Calders, T. and Jaroszewicz, S. Efficient AUC optimization for classification. In Knowledge Discovery in Databases: PKDD 2007, 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 17-21, 2007, Proceedings, volume 4702 of Lecture Notes in Computer Science, pp. 42–53. Springer, 2007.
- Charoenphakdee et al. (2019) Charoenphakdee, N., Lee, J., and Sugiyama, M. On symmetric losses for learning from corrupted labels. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 961–970. PMLR, 2019.
- Clémençon et al. (2008) Clémençon, S., Lugosi, G., and Vayatis, N. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844–874, 2008.
- Cohen et al. (1997) Cohen, W. W., Schapire, R. E., and Singer, Y. Learning to order things. Advances in neural information processing systems, 10, 1997.
- Dembczynski et al. (2012) Dembczynski, K., Kotlowski, W., and Hüllermeier, E. Consistent multilabel ranking through univariate losses. arXiv preprint arXiv:1206.6401, 2012.
- Deng & Mahdavi (2021) Deng, Y. and Mahdavi, M. Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 1387–1395. PMLR, 2021.
- Deng et al. (2020) Deng, Y., Kamani, M. M., and Mahdavi, M. Distributionally robust federated averaging. Advances in Neural Information Processing Systems, 33:15111–15122, 2020.
- Gao et al. (2022) Gao, H., Li, J., and Huang, H. On the convergence of local stochastic compositional gradient descent with momentum. In International Conference on Machine Learning, pp. 7017–7035. PMLR, 2022.
- Gao & Zhou (2015) Gao, W. and Zhou, Z. On the consistency of AUC pairwise optimization. In Yang, Q. and Wooldridge, M. J. (eds.), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pp. 939–945. AAAI Press, 2015.
- Gao et al. (2013) Gao, W., Jin, R., Zhu, S., and Zhou, Z. One-pass AUC optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp. 906–914. JMLR.org, 2013.
- Ghadimi et al. (2020) Ghadimi, S., Ruszczynski, A., and Wang, M. A single timescale stochastic approximation method for nested stochastic optimization. SIAM J. Optim., 30(1):960–979, 2020.
- Goldberger et al. (2004) Goldberger, J., Hinton, G. E., Roweis, S., and Salakhutdinov, R. R. Neighbourhood components analysis. Advances in neural information processing systems, 17, 2004.
- Guo et al. (2020) Guo, Z., Liu, M., Yuan, Z., Shen, L., Liu, W., and Yang, T. Communication-efficient distributed stochastic auc maximization with deep neural networks. In International Conference on Machine Learning, pp. 3864–3874. PMLR, 2020.
- Haddadpour et al. (2019) Haddadpour, F., Kamani, M. M., Mahdavi, M., and Cadambe, V. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. Advances in Neural Information Processing Systems, 32, 2019.
- Han et al. (2022) Han, S., Park, S., Wu, F., Kim, S., Wu, C., Xie, X., and Cha, M. Fedx: Unsupervised federated learning with cross knowledge distillation, 2022. URL https://arxiv.org/abs/2207.09158.
- Hanley & McNeil (1982) Hanley, J. A. and McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
- Hu et al. (2020) Hu, Y., Zhang, S., Chen, X., and He, N. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Huang et al. (2022) Huang, Y., Lin, Q., Street, N., and Baek, S. Federated learning on adaptively weighted nodes by bilevel optimization. arXiv preprint arXiv:2207.10751, 2022.
- Irvin et al. (2019) Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R. L., Shpanskaya, K. S., Seekins, J., Mong, D. A., Halabi, S. S., Sandberg, J. K., Jones, R., Larson, D. B., Langlotz, C. P., Patel, B. N., Lungren, M. P., and Ng, A. Y. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 590–597. AAAI Press, 2019.
- Jiang & Agrawal (2018) Jiang, P. and Agrawal, G. A linear speedup analysis of distributed deep learning with sparse and quantized communication. Advances in Neural Information Processing Systems, 31, 2018.
- Jiang et al. (2022) Jiang, W., Li, G., Wang, Y., Zhang, L., and Yang, T. Multi-block-single-probe variance reduced estimator for coupled compositional optimization. In NeurIPS, 2022.
- Kairouz et al. (2021) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
- Karimireddy et al. (2020a) Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
- Karimireddy et al. (2020b) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020b.
- Khaled et al. (2020) Khaled, A., Mishchenko, K., and Richtárik, P. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. PMLR, 2020.
- Konečnỳ et al. (2016) Konečnỳ, J., McMahan, H. B., Ramage, D., and Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
- Kotlowski et al. (2011) Kotlowski, W., Dembczynski, K., and Hüllermeier, E. Bipartite ranking through minimization of univariate loss. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 1113–1120. Omnipress, 2011.
- Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images.(2009), 2009.
- Li & Huang (2022) Li, J. and Huang, H. Fedgrec: Federated graph recommender system with lazy update of latent embeddings. arXiv preprint arXiv:2210.13686, 2022.
- Li et al. (2022) Li, J., Pei, J., and Huang, H. Communication-efficient robust federated learning with noisy labels. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 914–924, 2022.
- Liu et al. (2020) Liu, M., Zhang, W., Mroueh, Y., Cui, X., Ross, J., Yang, T., and Das, P. A decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information Processing Systems, 33:11056–11070, 2020.
- McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Qi et al. (2021) Qi, Q., Luo, Y., Xu, Z., Ji, S., and Yang, T. Stochastic optimization of areas under precision-recall curves with provable convergence. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 1752–1765, 2021.
- Qiu et al. (2022) Qiu, Z., Hu, Q., Zhong, Y., Zhang, L., and Yang, T. Large-scale stochastic optimization of NDCG surrogates for deep learning with provable convergence. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 18122–18152. PMLR, 2022.
- Radenović et al. (2016) Radenović, F., Tolias, G., and Chum, O. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In European conference on computer vision, pp. 3–20. Springer, 2016.
- Rudin (2009) Rudin, C. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. J. Mach. Learn. Res., 10:2233–2271, 2009.
- Sharma et al. (2022) Sharma, P., Panda, R., Joshi, G., and Varshney, P. Federated minimax optimization: Improved convergence analyses and algorithms. In International Conference on Machine Learning, pp. 19683–19730. PMLR, 2022.
- Smith et al. (2018) Smith, V., Forte, S., Chenxin, M., Takáč, M., Jordan, M. I., and Jaggi, M. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
- Stich (2018) Stich, S. U. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2018.
- Stich et al. (2018) Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31, 2018.
- Tarzanagh et al. (2022) Tarzanagh, D. A., Li, M., Thrampoulidis, C., and Oymak, S. FedNest: Federated bilevel, minimax, and compositional optimization. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 21146–21179. PMLR, 17–23 Jul 2022.
- Wang & Yang (2022) Wang, B. and Yang, T. Finite-sum coupled compositional stochastic optimization: Theory and applications. In International Conference on Machine Learning, pp. 23292–23317. PMLR, 2022.
- Wang et al. (2022) Wang, G., Yang, M., Zhang, L., and Yang, T. Momentum accelerates the convergence of stochastic AUPRC maximization. In International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pp. 3753–3771. PMLR, 2022.
- Wang et al. (2017) Wang, M., Fang, E. X., and Liu, H. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program., 161(1-2):419–449, 2017.
- Wangni et al. (2018) Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.
- Woodworth et al. (2020a) Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., and Srebro, N. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pp. 10334–10343. PMLR, 2020a.
- Woodworth et al. (2020b) Woodworth, B. E., Patel, K. K., and Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281–6292, 2020b.
- Wu et al. (2017) Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision, pp. 2840–2848, 2017.
- Wu et al. (2022) Wu, Y., Wang, Z., Zeng, D., Li, M., Shi, Y., and Hu, J. Federated contrastive representation learning with feature fusion and neighborhood matching, 2022. URL https://openreview.net/forum?id=6LNPEcJAGWe.
- Xing et al. (2022) Xing, P., Lu, S., Wu, L., and Yu, H. Big-fed: Bilevel optimization enhanced graph-aided federated learning. IEEE Transactions on Big Data, pp. 1–12, 2022. doi: 10.1109/TBDATA.2022.3191439.
- Yang et al. (2021a) Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister, H., and Ni, B. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. arXiv preprint arXiv:2110.14795, 2021a.
- Yang (2013) Yang, T. Trading computation for communication: Distributed stochastic dual coordinate ascent. Advances in Neural Information Processing Systems, 26, 2013.
- Yang (2022) Yang, T. Algorithmic foundation of deep x-risk optimization. CoRR, abs/2206.00439, 2022. doi: 10.48550/arXiv.2206.00439. URL https://doi.org/10.48550/arXiv.2206.00439.
- Yang & Ying (2022) Yang, T. and Ying, Y. Auc maximization in the era of big data and ai: A survey. ACM Comput. Surv., aug 2022. ISSN 0360-0300. doi: 10.1145/3554729. URL https://doi.org/10.1145/3554729. Just Accepted.
- Yang et al. (2021b) Yang, Z., Lei, Y., Wang, P., Yang, T., and Ying, Y. Simple stochastic and online gradient descent algorithms for pairwise learning. Advances in Neural Information Processing Systems, 34:20160–20171, 2021b.
- Yu et al. (2019a) Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pp. 7184–7193. PMLR, 2019a.
- Yu et al. (2019b) Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5693–5700, 2019b.
- Yuan et al. (2021a) Yuan, Z., Guo, Z., Xu, Y., Ying, Y., and Yang, T. Federated deep auc maximization for hetergeneous data with a constant communication complexity. In International Conference on Machine Learning, pp. 12219–12229. PMLR, 2021a.
- Yuan et al. (2021b) Yuan, Z., Yan, Y., Sonka, M., and Yang, T. Large-scale robust deep AUC maximization: A new surrogate loss and empirical studies on medical image classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 3020–3029. IEEE, 2021b.
- Yuan et al. (2022) Yuan, Z., Wu, Y., Qiu, Z., Du, X., Zhang, L., Zhou, D., and Yang, T. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 25760–25782. PMLR, 2022. URL https://proceedings.mlr.press/v162/yuan22b.html.
- Zhang et al. (2020) Zhang, F., Kuang, K., You, Z., Shen, T., Xiao, J., Zhang, Y., Wu, C., Zhuang, Y., and Li, X. Federated unsupervised representation learning. CoRR, abs/2010.08982, 2020. URL https://arxiv.org/abs/2010.08982.
- Zhao et al. (2011) Zhao, P., Hoi, S. C. H., Jin, R., and Yang, T. Online AUC maximization. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 233–240, 2011.
- Zhu et al. (2022) Zhu, D., Li, G., Wang, B., Wu, X., and Yang, T. When AUC meets DRO: optimizing partial AUC for deep learning with non-convex convergence guarantee. CoRR, abs/2203.00176, 2022.
Appendix A Notations
Model parameters of the neural network, variables to be trained | |
Model parameters of machine at round , iteration | |
A data point | |
A data point from machine | |
A data point sampled on machine , at round iteration | |
Two independent data points sampled on machine , at round iteration | |
The prediction score of data by network | |
Stochastic estimators of components of gradient | |
Collected historical prediction scores on machine at round | |
Moving average estimator of the inner function | |
Moving average estimator of the inner function on machine at round , iteration | |
Collected historical on machine at round | |
Predictions scores sampled from the collected scores of round | |
Moving average estimator sampled from the collected moving average estimator of round |
Appendix B Applications of DXO Problems
We now present some concrete applications of the DXO problems, including AUROC maximization, partial AUROC maximization and AUPRC maximization. A more comprehensive list of DXO problems is discussed in the Intrduction section and can also be found in a recent survey (Yang, 2022).
AUROC Maximization The area under ROC curve (AUROC) is defined (Hanley & McNeil, 1982) as
(15) |
where are a pair of data features and are the corresponding labels. To maximize the AUROC, there are a number of surrogate losses , e.g. , that have proposed in the literature (Gao et al., 2013; Zhao et al., 2011; Gao & Zhou, 2015; Calders & Jaroszewicz, 2007; Charoenphakdee et al., 2019; Yang et al., 2021b), which formulates the problem into
(16) |
where is the set of data with positive labels and is the set of data with negative labels. This is a DXO problem of (1) with .
Partial AUROC Maximization In medical diagnosis, high false positive rates (FPR) and low true positive rates (TPR) may cause a large cost. To alleviate this, we will also consider optimizing partial AUC (pAUC). This task considers to maximize the area under ROC curve with the restriction that the false positive rate to be less than a certain level. In (Zhu et al., 2022), it has been shown that the partial AUROC maximization problem can be solved by the
(17) |
where is the set of positive data, is the set of negative data, is surrogate loss, and is associated with the tolerance level of false positive rate. This is a DXO problem of (1) with , and .
Appendix C Experiments
C.1 Statistics of Data
Statistics of used data sets are summarized in Table 5.
# of Training Data | # of Validation Data | # of Testing Data | |
Cifar10 | 24000 | 10000 | 10000 |
Cifar100 | 24000 | 10000 | 10000 |
CheXpert | 190027 | 1000 | 202 |
ChestMNIST | 78468 | 11219 | 22433 |
C.2 Running Time Comparison
Running time is reported in Tabel 6. Each algorithm was run on 16 client machines connected by InfiniBand where each machine uses a NVIDIA A100 GPU.
Local SGD (CE Loss) | CODASCA (Min-Max AUC) | Local Pair (OPAUC Loss) | FeDXL2 (OPAUC Loss) | |
Cifar10 | 157 (664s) | 147 (955s) | 168 (740s) | 160 (819s) |
Cifar100 | 160 (644s) | 163 (974s) | 162 (688s) | 159 (758s) |
CheXpert | 162 (2465s) | 151 (3501s) | 175 (2838s) | 182 (3246s) |
ChestMNIST | 172 (1537s) | 165 (3176s) | 164 (1484s) | 171 (1763s) |
C.3 Ablation Study.
We show an ablation study to further verify our theory. In particular, we show the benefit of using multiple machines and the lower communication complexity by using local updates between two communications. To verify the first effect, we fix and vary , and for the latter we fix and vary . We conduct experiments on the CIFAR-10 data for optimizing the X risk corresponding to partial AUC loss and the results are plotted in Figure 2. The left two figures demonstrate that our algorithm can tolerate a certain value of for skipping communications without harming the performance; and the right two figures demonstrate the advantage of FL by using FeDXL2, i.e., using data from more sources can dramatically improve the performance.




Appendix D Analysis of FeDXL1 for solving DXO with Linear
In this section, we present the analysis of the FeDXL1 algorithm. For and , we define
(20) |
Therefore, the
defined in (4) is equivalent to , where is a scored of a randomly sampled data that in computed in the round at machine and iteration . Technically, notations and are associated with and , but we omit this dependence when the context is clear to simplify notations.
Proof.
Under Assumption 3.1, it follows that is -smooth, with . Simiarly, also Lipschtz in and with some constant that depend on . Let .
Denote and suppose by proper setting of and . Using the -smoothness of , we have
(21) |
where
(22) |
where first inequality uses Young’s inequality, Lipschitz of , and the fact that data samples are independent samples after , therefore
(23) |
To bound the updates of after one round, we have
(24) |
Using the Lipschtz property of , we continue this inequality as
Thus,
(25) |
Using Assumption 3.1, we know that are both less than . Then, to bound the updates in one round of one machine as
(26) |
Appendix E FeDXL2 for Solving DXO with Non-Linear
In this section, we define the following notations:
(28) |
Based on Assumption 3.3, it follows that are Lipschitz with some constant modulus and are bounded by , is -smooth, where are some proper constants depend on Assumption 3.3. We denote to simplify notations.
For , define and for , we define
(29) |
It follows that is also -Lipschitz in and .
E.1 Analysis of the moving average estimator
Lemma E.1.
Under Assumption 3.3, the moving average estimator satisfies
Proof.
By update rules of , we have
(30) |
Or equivalently,
(31) |
Define , . Then it follows that
(32) |
which is
(33) |
where
(34) |
If , we have
(35) |
Then, we have
(36) |
Note that , which implies
(37) |
Since , we have that , and
. Besides, we have
(38) |
where
(39) |
Noting
(40) |
Then by multiplying to every term and rearranging terms using the setting of , we can obtain
(41) |
Dividing on both sides gives
(42) |
Using Young’s inequality,
∎
E.2 Analysis of the estimator of gradient
With update , we define , and . Then it follows that .
Proof.
(43) |
Using Young’s inequality and -Lipschtzness of , we can then derive
(44) |
By the fact that
(45) |
(46) |
and
(47) |
we obtain
∎
E.3 Analysis of Theorem 3.4
Proof.
By updating rules,
(48) |
and
(49) |
Similarly, we also have
(50) |
Appendix F FeDXL with Partial Client Participation
Considering that not all client machines are available to work at each round, in this section, we provide an algorithm that allows partial client participation in every round. The algorithm is given in Algorithm 3. We use the Assumption 3.3. The convergence results will be presented in Theorem F.3.
F.1 Analysis of the moving average estimator
Lemma F.1.
Under Assumption 3.3, the moving average estimator satisfies
Proof.
Denote as the clients that are sampled to take participation in the -th round. By update rules of , we have
(54) |
Or equivalently,
(55) |
Define , . Then it follows that
(56) |
where for it has
(57) |
If , we have for
(58) |
Then, we have
(59) |
Note that for , , which implies for
(60) |
Since , we have that , and
Besides, we have for that
(61) |
where
(62) |
Noting for ,
(63) |
With the client sampling and data sampling, we observe that
(64) |
Then by multiplying to every term and rearranging terms using the setting of , we can obtain
(65) |
Dividing on both sides gives
(66) |
Using Young’s inequality,
which yields
∎
F.2 Analysis of the estimator of gradient
With update , we define , and . Then it follows that .
Proof.
(67) |
Using Young’s inequality and -Lipschtzness of , we can then derive
(68) |
By the fact that
(69) |
(70) |
and
(71) |
we obtain
∎
F.3 Convergence Result
Theorem F.3.
Proof.
By updating rules, we have that for ,
(72) |
and
(73) |
Similarly, we also have
(74) |