Vertical federated learning based on DFP and BFGS

1^st Wenjie Song China University of Petroleum
Qingdao, China
[email protected] 2^nd Xuan Shen China University of Petroleum
Qingdao, China
[email protected] 3^rd Given Name Surname dept. name of organization (of Aff.)
City, Country
email address

Abstract

As data privacy is gradually valued by people, federated learning(FL) has emerged because of its potential to protect data. FL uses homomorphic encryption and differential privacy encryption on the promise of ensuring data security to realize distributed machine learning by exchanging encrypted information between different data providers. However, there are still many problems in FL, such as the communication efficiency between the client and the server and the data is non-iid. In order to solve the two problems mentioned above, we propose a novel vertical federated learning framework based on the DFP and the BFGS(denoted as BDFL), then apply it to logistic regression. Finally, we perform experiments using real datasets to test efficiency of BDFL framework.

Index Terms:

Federated learning, Machine learning, Non-iid data, Data privacy.

I INTRODUCTION

On the one hand, due to the emergence of the General Data Protection Regulation, more and more people are paying attention to privacy protection in machine learning. On the other hand, in real situations, more and more data island appears, making traditional machine learning difficult to achieve. Generally speaking, AI service needs data provided by users to train on a server. However, in this process, the data may come from various institutions, and although the institution wants to get a perfect model, it does not like leaking its own data. Therefore, in order to break data island and achieve privacy protection, Google [1] proposed federated learning in 2016. In FL, AI services can perform machine learning without collecting data from various institutions. FL allows the model to be trained locally and send encrypted information to the center server. Then the center server aggregates received data and send back to every client. Finally client could update parameter by themselves. For the method of updating parameters, there are GD, SGD, Mini-Batch SGD methods, but these methods are all first-order accuracy. Therefore, we consider a higher-order accuracy method, the newton method, but in the newton method, the Hessian matrix may be irreversible and even if it does be a inverse matrix, it is also extremely difficult to compute it. Therefore, we consider adopting the quasi-newton method. Among them, DFP and BFGS are two representative algorithms. Yang [2] et al. implemented BFGS under the algorithm architecture of logistic regression and applied it to vertical federated learning. But in terms of communication, there are still problems. Therefore, we combined DFP and BFGS to propose a new algorithm, which is used in the logistic regression algorithm of vertical federated learning. In the end, compared to other algorithm, our algorithm can achieve better results with less communication times.

II RELATED WORK

In recent years, a large number of studies on federated learning have emerged [3], [4], [5]. In its architecture, the use of gradient descent methods is common. However, the convergence of the first-order gradient descent method is lower than that of the second-order newton method. The calculation is very large when calculating the inverse of the Hesian matrix, so the quasi-newton method came into being, BFGS and DFP, as the two representative methods. A series of works on horizontal federated learning has been proposed [6],[7], each client has a part of the sample, but has all the data attributes. In vertical federated learning, each client holds part of the data attributes, and the samples are overlapped. [8] suggests that logistic regression is applied under the framework of vertical federation. Yang [2] and others use L-BFGS to implement logistic regression algorithm of vertical federated learning. It reduces communication cost. [9] combines federated learning with blockchain proposing BlockFL. Because of the consensus mechanism in the blockchain, BlockFL can resist attacks from malicious clients. FedAvg [4] is an iterative method that has become a universal optimization method in FL. In addition, in terms of theoretical proof, [10], [11] gives a proof of convergence for the FedAvg algorithm for non-IID data. In particular, [12] offers a boosting method based on tree model SecureBoost. Recently, [13] proposes the FedProx algorithm on the basis of FedAvg by adding proximal term. FedProx is absolutely superior to FedAvg in statistical heterogeneity and system heterogeneity.

In summary, FeaAvg as the baseline in FL, shows bad performance in the case of statistical heterogeneity and system heterogeneity. As an improvement of FedAvg, FedProx has great performance in non-iid environments. The first-order gradient descent method in traditional machine learning has strong universality. But for FL, when the communication cost is much more than the calculation cost, a higher-precision algorithm should be selected. In other words, higher computation cost should be used in exchange for smaller communication cost.

III ANALYTICAL MODEL

In this work, inspired by BFGS in logistic regression of vertical federated learning [2], we exlore a broader framework, BDFL, that is capable of managing heterogeneous federated environments when ensuring privacy security. Besides, our novel framework performs better than BFGS [2] and SGD [14].

III-A Logistic Regression

In vertical federated learning, [14] realizes classic logistic regression method. Let $X\in R^{N\times T}$ be a data set containing $T$ data samples, and each instance has $N$ features. Corresponding data label is $y\in\{-1,+1\}^{T}$ . Suppose there are two honest but curious participants party A(host) and party B(guest). A has only the characteristics of the data. And B has not only the characteristics, but also the label of the data. So $X^{A}\in R^{N_{A}\times T}$ is owned by A and $X^{B}\in R^{N_{B}\times T}$ is owned by B. Each party has different data characteristics, but the sample id is the same. Therefore, the goal of optimization is to train classification model to solve

\min\limits_{\boldsymbol{w}\in R^{N}}\quad\frac{1}{T}\sum_{i}^{T}l(\boldsymbol{w};\boldsymbol{x}_{i},y_{i})

(1)

where $\boldsymbol{w}$ is the model parameters. So $\boldsymbol{w}=(\boldsymbol{w}^{A},\boldsymbol{w}^{B})$ where $\boldsymbol{w}^{A}\in R^{N_{A}}$ and $\boldsymbol{w}^{B}\in R^{N_{B}}$ . Moreover $\boldsymbol{x}_{i}$ represents the feature of the i-th data instance and $y_{i}$ is the corresponding label. The loss function is negative log-likelihood

l(\boldsymbol{w};\boldsymbol{x}_{i},y_{i})=log(1+exp(-y_{i}\boldsymbol{w}^{T}\boldsymbol{x}_{i}))

(2)

In [14], they use SGD to decrease gradient by exchanging encrypted middle information at each iteration. Party A and Party B hold vertically encrypted gradients $\boldsymbol{g}^{A}\in R^{n_{A}}$ and $\boldsymbol{g}^{B}\in R^{n_{B}}$ respectively, which can be decrypted by the third party C. Furthemore, to achieve secure multi-party computing, the additively homomorphic encryption is accepted. In the field of homomorphic encryption, a lot of works have been completed [15] [16]. Different computing requirements correspond to different encryption methods, such as PHE, SHE, FHE. After encryption, we can directly perform encrypted data with addition or multiplication operations, the value of decrypting the operation result is consistent with the result of the direct operation on the original data. That is $[\![m]\!]+[\![n]\!]=[\![m+n]\!]$ and $[\![m]\!]\cdot n=[\![m\cdot n]\!]$ with $[\![\cdot]\!]$ represent encryption method. However, homomorphic encryption has no idea to solve exponential calculation yet. So equation (2) cannot directly apply homomorphic encryption. We consider using Taylor expansion to approximate the loss function. Fortunately, it’s proposed in [14] as

l(\boldsymbol{w};\boldsymbol{x}_{i},y_{i})\approx log2-\frac{1}{2}y_{i}\boldsymbol{w}^{T}\boldsymbol{x}_{i}+\frac{1}{8}(\boldsymbol{w}^{T}\boldsymbol{x}_{i})^{2}

(3)

III-B Newton Method

The basic idea of newton’s method is to use the first-order gradient and the second-order gradient(Hessian) at the iteration point to approximate the objective function with the quadratic function, and then use the minimum point of the quadratic model as the new iteration point. This process is repeated until the approximate minimum value that satisfies the required accuracy. The newton’s method can highly approximate the optimal value and its speed is quite fast. Though it is very quickly, the calculation is extremely huge. For federated learning, this method is perfect when trading larger computational costs for smaller communication costs.

For convenience, we mainly discuss the one-dimensional situation. For an objective function $f(\boldsymbol{w})$ , the problem of finding the extreme value of the function can be transformed into the derivative function $f^{\prime}(\boldsymbol{w})=0$ , and the second-order Taylor expansion of the function $f(\boldsymbol{w})$ is obtained

f(\boldsymbol{w})=f(\boldsymbol{w}_{k})+f^{\prime}(\boldsymbol{w}_{k})(\boldsymbol{w}-\boldsymbol{w}_{k})+\frac{1}{2}f^{\prime\prime}(\boldsymbol{w}_{k})(\boldsymbol{w}-\boldsymbol{w}_{k})^{2}

(4)

and take the derivative of the above formula and set it to 0, then

f^{\prime}(\boldsymbol{w}_{k})+f^{\prime\prime}(\boldsymbol{w}_{k})(\boldsymbol{w}-\boldsymbol{w}_{k})=0

(5)

\boldsymbol{w}=\boldsymbol{w}_{k}-\frac{f^{\prime}(\boldsymbol{w}_{k})}{f^{\prime\prime}(\boldsymbol{w}_{k})}

(6)

it is further organized into the following iterative expression:

\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\lambda H^{-1}f^{\prime}(\boldsymbol{w}_{k})

(7)

where $\lambda$ represent step-size and $H$ represent Hessian.

This formula is an iterative formula of newton method. But this method also has a fatal flaw, that is, in equation (7), the inverse of the Hessian matrix needs to be required. As we all know, not all matrices have inverses. And the computational complexity of the inversion operation is also very large. Therefore, there is quasi-newton methods. BFGS and DFP, approximate newton method.

III-C Quasi-Newton Method

The central idea of the quasi-newton method is getting a matrix similar to the Hessian inverse without computing the inverse of Hessian. Therefore, the expression of the quasi-newton method is similar to equation (7), as follows

\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\lambda C_{k}f^{\prime}(\boldsymbol{w}_{k})

(8)

where $C_{k}$ is the matrix used to approximate $H^{-1}$ .

In contrast, the update formula is as follows in SGD

\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\lambda f^{\prime}(\boldsymbol{w}_{k})

(9)

Different quasi-newton methods are inconsistent with the iterative formula of $C_{k}$ . Therefore, we explain the iterative formula of DFP and BFGS on $C_{k}$ below.

III-C1 DFP

C^{\prime}_{i+1}=C_{i}+\frac{\triangle\boldsymbol{w}_{i}\triangle\boldsymbol{w}_{i}^{T}}{\triangle\boldsymbol{w}_{i}^{T}\triangle\boldsymbol{g}_{i}}-\frac{(C_{i}\triangle\boldsymbol{g}_{i})(C_{i}\triangle\boldsymbol{g}_{i})^{T}}{\triangle\boldsymbol{g}_{i}^{T}C_{i}\triangle\boldsymbol{g}_{i}}

(10)

III-C2 BFGS

C^{\prime\prime}_{i+1}=(I-\frac{\triangle\boldsymbol{w}_{i}\triangle\boldsymbol{g}_{i}^{T}}{\triangle\boldsymbol{g}_{i}^{T}\triangle\boldsymbol{w}_{i}})C_{i}(I-\frac{\triangle\boldsymbol{g}_{i}\triangle\boldsymbol{w}_{i}^{T}}{\triangle\boldsymbol{g}_{i}^{T}\triangle\boldsymbol{w}_{i}})+\frac{\triangle\boldsymbol{w}_{i}\triangle\boldsymbol{w}_{i}^{T}}{\triangle\boldsymbol{g}_{i}^{T}\triangle\boldsymbol{w}_{i}}

(11)

III-C3 BDFL

C_{i+1}=\alpha C^{\prime}_{i+1}+(1-\alpha)C^{\prime\prime}_{i+1}

(12)

In equation (10) and equation (11), $\triangle\boldsymbol{w}_{i}=\boldsymbol{w}_{i+1}-\boldsymbol{w}_{i},\triangle\boldsymbol{g}_{i}=\boldsymbol{g}_{i+1}-\boldsymbol{g}_{i}$ . In equation (12), $\alpha$ is a number with no limits.

III-D Compute and Exchange information

Refer to caption — Figure 1: Information exchange between parties

The gradient and the Hessian of Taylor loss of the i-th data sample are given by $\boldsymbol{g}_{i}=\nabla l(\boldsymbol{w};\boldsymbol{x}_{i},y_{i})\approx(\frac{1}{4}\boldsymbol{w}^{T}\boldsymbol{x}_{i}-\frac{1}{2}y_{i})\boldsymbol{x}_{i}$ , $H=\nabla^{2}l(\boldsymbol{w};\boldsymbol{x}_{i},y_{i})\approx\frac{1}{4}\boldsymbol{x}_{i}\boldsymbol{x}_{i}^{T}$ respectively. For convenience, we calculate the intermediate variable $\boldsymbol{w}^{T}\boldsymbol{x}$ and express it

\boldsymbol{u}=\boldsymbol{w}^{T}\boldsymbol{x}_{i}

(13)

III-D1 Compute Gradient and Loss

First, after initializing $w$ and $C$ , both parties A and B calculate $u_{a}$ and $u_{b}$ . After the calculation is completed, party A sends $[\![\boldsymbol{u}_{a}]\!]$ and $[\![\boldsymbol{u}_{a}^{2}]\!]$ to B. Next, B calculates $[\![loss]\!]$ , $[\![\boldsymbol{d}]\!]$ according to formula (16) and (14) and then sends $[\![\boldsymbol{d}]\!]$ to A. Then according to the equation (15), A calculates $[\![\boldsymbol{g}_{a}]\!]$ and B calculates $[\![\boldsymbol{g}_{b}]\!]$ .

[\![\boldsymbol{d}]\!]=\frac{1}{4}([\![\boldsymbol{u}_{A}[i]]\!]+[\![\boldsymbol{u}_{B}[i]]\!]-2[\![y_{i}]\!])

(14)

[\![\boldsymbol{g}]\!]\approx\frac{1}{N}\sum_{i\in N}[\![d_{i}]\!]x_{i}

(15)

	$\displaystyle\-[\![loss]\!]\approx\frac{1}{N}\sum_{i\in N}log2-\frac{1}{2}y_{i}([\![\boldsymbol{u}_{A}[i]]\!]+[\![\boldsymbol{u}_{B}[i]]\!])+$		(16)
	$\displaystyle\frac{1}{8}([\![\boldsymbol{u}_{A}^{2}[i]]\!]+2\boldsymbol{u}_{B}[i][\![\boldsymbol{u}_{A}[i]]\!]+[\![\boldsymbol{u}_{B}^{2}[i]]\!])$		(16)

III-D2 Send Encrypted Information And Return

B sends the calculated $[\![loss]\!]$ to C. And C decrypts it and displays the results. Then A&B send $[\![\boldsymbol{g}_{a}]\!],[\![\boldsymbol{g}_{b}]\!]$ to C. After decrypting the gradient, C sends back the respective gradient plaintext.

III-D3 Update Hessian and $\boldsymbol{w}$

After both A and B have received their respective gradients, they first update their $C_{k}$ . Later, update $\boldsymbol{w}$ using the equation (8).

III-D4 Check

Party A&B check whether $w$ has reached convergence. If both of them converge, then output $w$ , if one of them does not converge, continue the loop.

Procedure 1 Basic Logistic Regression In Vertical FL

Input : $w_{0}^{A},w_{0}^{B},X_{A},X_{B},Y_{B},E,\lambda$
Output : $w^{A},w^{B}$
Party C: Generated public key and private key
Party C: Send private key to A and B

1:for each round

k=1,..,E

2: Party A: Compute

u_{a},u_{a}^{2}

as equation (13)

3: Party A: Send

[\![u_{a}]\!],[\![u_{a}^{2}]\!]

to B.

4: Party B: Compute

u_{b},u_{b}^{2}

as equation (13)

5: Party B: Compute

[\![loss]\!]

as equation (16)

6: Party B: Compute

[\![d]\!]

as equation (14) and send to A

7: Party A: Compute

[\![g_{A}]\!]

as equation (15)

8: Party B: Compute

[\![g_{B}]\!]

as equation (15)

9: Party A&B: Send

[\![g_{A}]\!],[\![g_{B}]\!]

to C

10: Party C: Decrypted

[\![g_{A}]\!],[\![g_{B}]\!]

and send back

11: Party A&B: Update

w

as equation (9)

12:end for

Procedure 2 BDFL Framework

Input : $w_{0}^{A},w_{0}^{B},X_{A},X_{B},Y_{B},C_{0}^{A},C_{0}^{B},E,\lambda$
Output : $w^{A},w^{B}$
Party C: Generated public key and private key
Party C: Send private key to A and B

1:for each round

k=1,..,E

2: Party A: Compute

u_{a},u_{a}^{2}

as equation (13)

3: Party A: Send

[\![u_{a}]\!],[\![u_{a}^{2}]\!]

to B.

4: Party B: Compute

u_{b},u_{b}^{2}

as equation (13)

5: Party B: Compute

[\![loss]\!]

as equation (16)

6: Party B: Compute

[\![d]\!]

as equation (14) and send to A

7: Party A: Compute

[\![g_{A}]\!]

as equation (15)

8: Party B: Compute

[\![g_{B}]\!]

as equation (15)

9: Party A&B: Send

[\![g_{A}]\!],[\![g_{B}]\!]

to C

10: Party C: Decrypted

[\![g_{A}]\!],[\![g_{B}]\!]

and send back

11: if

k!=1

then

12: Party A&B: Update separately

C

as equation (12)

13: end if

14: Party A&B: Update

w

as equation (8)

15:end for

IV PERFORMANCE EVALUEATION

Our numerical experiment has two parts. In both of the experiments, we select 80% of the data for training and check the training loss. The remaining 20% is used as the test dataset to check the generalization ability of the model.

IV-A Compare Quasi-Newton with SGD

The first part is to compare SGD and quasi-Newton. It is applied to credit card dataset, which consists of 30000 instances and each instances holds 23 features. So, we shuffle the order of the instances. Party A holds 12 features, and Party B holds remaining 11 features and corresponding target. By using the two quasi-newton methods of DFP and BFGS, it is compared with the SGD method.

IV-B Compare Quasi-Newton with BDFL

In the second part, to go further, we use BDFL(we proposed) and quasi-newton method to compare. Using the breast cancer dataset, which has 569 instances, 30 atrributes and label. Because there are 20% test dataset, so split them to $X_{A}\in R^{455\times 20}$ , $X_{B}\in R^{455\times 10}$ and $Y_{B}\in R^{455\times 1}$ . The attribute index held by Party A are from 10-29, and those held by Party B are from 0-9.

In figure 2, it is clear that BFGS is much faster than SGD. In figure 3, it shows BDFL is better than both DFP and BFGS. What is more, we run every model in test dataset.

TABLE I: The Test Accuracy

Method	Credit Card	Breast Cancer
SGD	90.90%	86.26%
DFP	94.41%	85.57%
BFGS	95.10%	91.29%
BDFL	–	91.35%

V CONCLUSIONS

In this article, we use the quasi-newton method to replace the gradient descent method on the purpose of exchanging a larger amount of calculation for a smaller communication cost. In addition, we make improvements on the original basis of the quasi-newton. A novel framework, named BDFL, is proposed under vertical federated learning. Logistic regression is applied to the BDFL framework, which is used to test actual dataset. And the experiments have shown that BDFL can meet the following two premises for multi-party modeling:

1.

Ensure data privacy would not leak;
2.

One of them has only data but no labels. The other has data and label.

And the convergence speed and accuracy of the model are also better than traditional methods.

But our model still has some problems, such as the convergence speed did not meet our expectations, the amount of calculation is too large, etc. We will continue to study in future work.

References

[1] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2017.
[2] K. Yang, T. Fan, T. Chen, Y. Shi, and Q. Yang, “A quasi-newton method based vertical federated learning framework for logistic regression,” 2019.
[3] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” 2019.
[4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” 2017.
[5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, 2020.
[6] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2017.
[7] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” 2019.
[8] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” 2017.
[9] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “Blockchained on-device federated learning,” 2019.
[10] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” 2020.
[11] A. Khaled, K. Mishchenko, and P. Richtárik, “First analysis of local gd on heterogeneous data,” 2020.
[12] K. Cheng, T. Fan, Y. Jin, Y. Liu, T. Chen, and Q. Yang, “Secureboost: A lossless federated learning framework,” 2019.
[13] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” 2020.
[14] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” 2017.
[15] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A survey on homomorphic encryption schemes: Theory and implementation,” 2017.
[16] H. Zhu, R. Wang, Y. Jin, K. Liang, and J. Ning, “Distributed additive encryption and quantization for privacy preserving federated deep learning,” 2020.

Vertical federated learning based on DFP and BFGS

Abstract

Index Terms:

I INTRODUCTION

II RELATED WORK

III ANALYTICAL MODEL

III-A Logistic Regression

III-B Newton Method

III-C Quasi-Newton Method

III-C1 DFP

III-C2 BFGS

III-C3 BDFL

III-D Compute and Exchange information

III-D1 Compute Gradient and Loss

III-D2 Send Encrypted Information And Return

III-D3 Update Hessian and 𝒘\boldsymbol{w}

III-D4 Check

IV PERFORMANCE EVALUEATION

IV-A Compare Quasi-Newton with SGD

IV-B Compare Quasi-Newton with BDFL

V CONCLUSIONS

References

III-D3 Update Hessian and $\boldsymbol{w}$