Privacy-preserving Federated Adversarial Domain Adaptation over Feature Groups for Interpretability

Yan Kang1, Yuanqin He, Jiahuan Luo, Tao Fan, Yang Liu, and Qiang Yang Corresponding author: Yan Kang (email: [email protected]).

Abstract

We present a novel privacy-preserving federated adversarial domain adaptation approach (PrADA) to address an under-studied but practical cross-silo federated domain adaptation problem, in which the party of the target domain is insufficient in both samples and features. We handle the lack-of-feature issue by extending the feature space through vertical federated learning with a feature-rich party and tackle the sample-scarce issue by performing adversarial domain adaptation from the sample-rich source party to the target party. In this work, we focus on financial applications where interpretability is critical. However, existing adversarial domain adaptation methods typically apply a single feature extractor to learn low-interpretable feature representations with respect to the target task. To improve interpretability, we exploit domain expertise to categorize the feature space into multiple groups that each group holds tightly relevant features, and we learn a semantically meaningful high-order feature from each feature group. In addition, we apply a fine-grained domain adaptation to each feature group to improve transferability. We design a privacy-preserving vertical federated learning framework that enables performing the PrADA securely and efficiently. We evaluate our approach based on two tabular datasets. Experiments demonstrate both the effectiveness and practicality of our approach.

Index Terms:

Vertical Federated Learning, Privacy, Domain Adaptation, Interpretability.

1 Introduction

Domain adaptation approaches [9, 28, 25, 30, 34] have shown notable success. Those approaches typically establish alignment or minimize the discrepancy between source and target domains by creating domain-invariant feature representation in the form of deep neural network (DNN) feature extractors. In addition to the remarkable ability of DNN on encoding raw data into meaningful representations that result in high performance on objective tasks, a major enabler of the adoption of DNN in domain adaptation is the availability of a large amount of data with rich features (image and text) that supports the representation learning of DNN.

Due to increasingly strict legal and regulatory constraints enforced on user privacy, private data from different organizations (domains) cannot be directly integrated for training machine learning models. In recent years, federated learning (FL) has emerged as a practicable solution to tackle data silo issues without compromising user privacy. Initially, FL [19] was proposed to build models by utilizing data of millions of mobile devices. [31] further extends FL architecture to enterprise setting where participating parties might be much smaller but privacy concerns are paramount. This setting is coined as cross-silo federated learning [23].

Recently, a growing number of works have been proposed to integrate domain adaptation into cross-silo FL setting [24, 22, 16, 27] for solving domain shift issues among independent parties. These federated domain adaptation (FDA) methods conduct experiments typically using image and text data that have rich features to perform meaningful representation learning. However, in many real-world FL applications where data is stored in tabular format (i.e., sample-feature matrix), the participating parties might be insufficient in features for building DNN-based domain adaptation models. One promising way to address the lack-of-feature issue is to enlarge the feature space by collaborating with a feature-rich party. For example, financial institutes with limited features (e.g., only basic user info) may have a large number of overlapping users with an e-commerce site that curates rich user information (e.g., product-browsing history and app usage information) and thereby they can collaboratively build domain adaptation models based on the enlarged feature space. This cross-silo FL setting where sample features distributed in different parties is categorized as vertical (feature-partitioned) federated learning (VFL) [31].

Although enlarged feature space enables domain adaptation, mainstream adversarial DA methods [9, 28, 30] typically apply one single pair of feature extractor and domain discriminator over the whole feature space to learn feature representations, which is not understandable by human. In this work, we focus on financial applications in which the model interpretability is an important concern. Thus, training models directly on top of raw feature representations cannot satisfy our requirements toward model interpretability and regulation. In addition, a single pair of feature extractor and domain discriminator may not be effective to learn transferable feature representations. In this work, we propose to group highly relevant features together and apply domain adaptation to each feature group aiming to improve both interpretability and transferability.

Most FDA approaches apply differential privacy (DP)[6, 5, 7] to protect the privacy of participants’ private data. But DP suffers from precision loss, which is not acceptable in high-stake decision-making applications (e.g., financial services and healthcare) where precision is crucial.

In this work, we propose PrADA, a privacy-preserving federated adversarial domain adaptation approach that enables participating parties to collaboratively conduct domain adaptation modeling in a privacy-preserving manner while taking the model interpretability into account. The main contributions of this work are highlighted as follows:

1.

To our best knowledge, this work is the first study on domain adaptation problem in the VFL setting for tabular data;
2.

This work proposes a fine-grained adversarial domain adaptation approach to reduce feature dimensionality, enhance model interpretability, and facilitate the learning of domain-invariant features.
3.

This work proposes a privacy-preserving VFL framework that allows participating parties to collaboratively conduct domain adaptation without exposing private local data under the semi-honest assumption.

2 Related Work

2.1 Federated Domain Adaptation

Traditional domain adaptation (DA) approaches assume the data are centralized on one server, thus limiting their applicability to decentralized real-world scenarios. Federated domain adaptation aims to conduct domain adaptation modeling among independent parties of different domains without violating privacy. [24] applies a mixture of experts (MoE) strategy that each participant combines a collaboratively-learned general model and a domain-tuned private model to reconcile distribution differences among participants. [22] leverages federated adversarial domain alignment with a dynamic attention mechanism to enhance knowledge transfer. [16] applies methods proposed by [24, 22] to functional magnetic resonance imaging (fMRI) analysis. [20] proposes agnostic federated learning aiming to optimize the global model for any target distribution formed by a mixture of client distributions without overfitting data of any particular client. One major limitation of those (both traditional and federated ) DA approaches is that they almost use computer vision datasets, and only a few of them (e.g., [20]) are evaluated on tabular data.

2.2 Deep Neural Network on Encrypted Data

Protecting privacy is a crucial element of federated learning. Homomorphic encryption (HE) is one of the major solutions to address the privacy issue. Although HE is a promising solution that allows computation to be performed on encrypted data, its expensive computational cost makes it impractical to be applied in training an entire DNN model. To address this issue, GELU-NET [32] adopts a client-server architecture in which the client encrypts the data while the server performs most computation on encrypted data. ACML [33] focuses on a more enterprise scenario where data and labels are distributed among two independent parties. It adopts a SplitNN [29] approach that each party is only responsible for updating its own portion of the whole DNN model. The novelty of ACML is that the costly encryption-decryption operations are only performed on the two partial models’ boundaries, leaving the rest of the computation in plaintext.

2.3 Model Interpretability

A variety of research works have been proposed for interpreting deep neural networks [21]. These methods focus on post-hoc interpretability that analyzes the relationships between input and output of the trained model rather than elucidating models’ internal structures. Other methods [3, 13, 1] construct prototypes or general concepts that shed light on the decision-making process. e.g., [3] propose ProtoPNet that learns a set of prototypes each can be considered as the latent representation of a small prototypical part of training images. Then, the label prediction can be calculated based on a weighted combination of the similarity scores between parts of the image and the learned prototypes. [18] calculates SHAP values of every feature for every sample based on model prediction. Complex models, such as ensemble methods or deep networks, can be explained through these SHAP values.

3 Problem Definition

We consider following cross-silo federated domain adaptation scenario that involves three parties. Party A is from the target domain, and it has a small number of labeled samples $(\mathbf{X}^{A}_{l},\mathbf{Y}^{A})\in\mathbb{R}^{n^{A}_{l}\times(m+1)}$ and some unlabeled samples $\mathbf{X}^{A}_{u}\in\mathbb{R}^{n^{A}_{u}\times m}$ . Party B is from the source domain and it has a large amount of labeled samples $(\mathbf{X}^{B},\mathbf{Y}^{B})\in\mathbb{R}^{n^{B}\times(m+1)}$ . $n^{A}=n^{A}_{l}+n^{A}_{u}$ and $n^{B}$ denote the sample size of parties A and B respectively, while $m$ denotes the feature dimension. These two parties share the same feature space and have similar tasks. We consider conduct domain adaptation (DA) from party B to party A, and we call these two parties active parties because they initiate the DA procedure. The two active parties have insufficient number of features to support DA. Thus, we refer to a passive party C that is able to provide sufficient amount of complementary features $\mathbf{X}^{B^{c}}\in\mathbb{R}^{n^{B}\times m^{c}}$ and $\mathbf{X}^{A^{c}}\in\mathbb{R}^{n^{A}\times m^{c}}$ for party B and party A, respectively. $\mathbf{X}^{B^{c}}$ and $\mathbf{X}^{A^{c}}$ have the same feature space with dimension $m^{c}$ , and $n^{B}\gg n^{A}_{l}$ and $m^{c}\gg m$ .

We align $\mathbf{X}^{A^{c}}$ with $(\mathbf{X}^{A}_{l},\mathbf{Y}^{A})$ and $\mathbf{X}^{A}_{u}$ respectively along the feature axis to form a virtual labeled dataset $\mathbf{D}^{t}_{l}=[\mathbf{X}^{A^{c}}_{l};\mathbf{X}^{A}_{l};\mathbf{Y}^{A}]$ and a virtual unlabeled dataset $\mathbf{D}^{t}_{u}=[\mathbf{X}^{A^{c}}_{u};\mathbf{X}^{A}_{u}]$ of the target domain. Likewise, we form a virtual dataset $\mathbf{D}^{s}=[\mathbf{X}^{B^{c}};\mathbf{X}^{B};\mathbf{Y}^{B}]$ of the source domain. The alignment can be performed by leveraging privacy-preserving entity matching approaches [11]. Figure 1 shows the federated view of tabular datasets $\mathbf{D}^{s}$ and $\mathbf{D}^{t}$ among the three parties.

Refer to caption — Figure 1: View of the virtual tabular data of the cross-silo federated domain adaptation. Source party B has a large amount of labeled samples $(\mathbf{X}^{B},\mathbf{Y}^{B})$ , while target party A has a small amount of labeled samples $(\mathbf{X}^{A}_{l},\mathbf{Y}^{A})$ and some unlabeled samples $\mathbf{X}^{A}_{u}$ . Party C provides complementary features $\mathbf{X}^{A^{c}}$ for party A and $\mathbf{X}^{B^{c}}$ for party B. Thus, we form virtual dataset $\mathbf{D}^{s}=[\mathbf{X}^{B^{c}};\mathbf{X}^{B};\mathbf{Y}^{B}]$ of the source domain, and virtual datasets $\mathbf{D}^{t}_{l}=[\mathbf{X}^{A^{c}}_{l};\mathbf{X}^{A}_{l};\mathbf{Y}^{A}]$ and $\mathbf{D}^{t}_{u}=[\mathbf{X}^{A^{c}}_{u};\mathbf{X}^{A}_{u}]$ of the target domain.

Under this setting, our PrADA approach is conducted from two aspects: (1) extending the feature space of active parties A and B through vertical federated learning with a feature-rich passive party C; (2) performing domain adaptation from party B of the sample-rich source domain to party A of the sample-scarce target domain based on the extended yet distributed feature space. Our ultimate goal is to improve the performance of the target model of party A.

Because $\mathbf{D}^{s}$ and $\mathbf{D}^{t}$ are composed of data from two independent parties, this domain adaptation is performed in a federated learning manner where a privacy-preserving protocol is applied. We assume that all three parties are honest-but-curious, meaning they follow the federated learning protocol but attempt to deduce as much as possible from the information received from other parties.

4 Architecture Overview

PrADA involves two stages: pre-training and fine-tuning. The pre-training stage is performed between source party B and party C aiming to pre-train feature extractors maintained by party C, while fine-tuning is performed between target party A and party C, aiming to train the target label predictor of party A based on pre-trained feature extractors. Figure 2 illustrates the workflow of the pre-training stage of the federated adversarial domain adaptation. Since the solely goal of fine-tuning is to train the target label predictor of party A, fine-tuning follows a similar workflow as pre-training except that no domain adaptation is involved.

As illustrated in Figure 2, The party B owns label predictor $R^{B}$ while party C owns feature extractors $\mathscr{F}=\{F_{i}\}_{i=1}^{g}$ and their corresponding domain discriminators $\mathscr{D}=\{D_{i}\}_{i=1}^{g}$ , and aggregators $\mathscr{G}=\{G_{i}\}_{i=1}^{g}$ . The pre-train stage of federated adversarial domain adaptation mainly involves three steps.

Feature grouping. The party C leverages domain expertise to group raw features into $k$ feature groups that each comprises tightly relevant features. In addition, the party C forms $z$ interactions between pairwise feature groups. Thus, this step gives totall $g=k+z$ feature groups. ( $k$ normal feature groups and $z$ interactive feature groups).

Adversarial domain adaptation. The party C leverages the adversarial domain adaptation to train feature extractors $\mathscr{F}=\{F_{i}\}_{i=1}^{g}$ in order to learn domain-invariant feature representations based upon those $g$ feature groups.

Vertical federated learning. The party B and party C collaboratively perform vertical federated learning to train the task-specific label predictor $R^{B}$ and feature extractors $\mathscr{F}=\{F_{i}\}_{i=1}^{g}$ in order to learn domain-specific feature representations.

We discuss in section 5 and elaborate in section 6. In section 7, we explain how our privacy-preserving vertial federated learning framework is applied to protect data privacy of the whole workflow.

5 Feature Grouping

The reasons that the PrADA leverages feature grouping are two folds: (1) improve the transferability of feature extractors; (2) improve the interpretabiltiy of label predictors.

We propose that with the help of domain expertise the party C creates $k$ feature groups out of its original feature space such that features in the same group are more relevant than features belonging to other groups. Based on feature grouping, the party C obtains $k$ groups of relevant features $\{\mathbf{x}^{p^{c}}_{(i)}\}_{i=1}^{k}$ for each sample $\mathbf{x}^{p^{c}}\in\mathbb{R}^{1\times m^{c}}$ dawn from $\mathbf{X}^{p^{c}},p\in\{A,B\}$ . To explore interactive features, party C performs interaction between each pair of the $k$ feature groups by concatenating the two feature groups together, giving $z=C_{2}^{k}$ interactive feature groups. As a result, the party C creates totally $g=k+z$ feature groups. Naturally, the party C assigns each feature group a feature extractor along with a domain discriminator to learn domain-invariant features representations. We hypothesize that this fine-grained domain adaptation between each of the two domains’ feature group that includes tightly relevant features helps improve the transferability of domain-invariant feature representations.

We adopt logistic regression (LR) model as the label predictor because it is a widely used interpretable model in financial applications. LR considers each feature and its associated weight as a fundamental interpretable unit. Therefore, instead of directly passing dense feature representations output from feature extractors to the LR model, party C leverages a set of aggregators $\{G_{i}\}_{i=1}^{g}$ that compress the output of each feature extractor into a scalar value representing a high-order feature, and then it feeds these high-order features into the LR model. As a result, the LR model takes as input a manageable number of high-order features (from party C), which are more explainable than it takes as input the concatenation of multiple dense feature representations. We formalize the procedure of party C generating high-order features vector $\boldsymbol{\mu}^{p^{c}}$ as follows:

\boldsymbol{\mu}^{p^{c}}=[G_{1}(\mathbf{f}^{p^{c}}_{(1)});\dots;G_{k}(\mathbf{f}^{p^{c}}_{(k)});\dots;G_{g}(\mathbf{f}^{p^{c}}_{(g)})]

(1)

where $\mathbf{f}_{(i)}^{p^{c}}$ denotes feature representations learned from feature extractor $F_{i}(\mathbf{x}^{p^{c}}_{(i)})$ , and $G_{i}(\mathbf{f}^{p^{c}}_{(i)})$ returns a scalar value representing the high order feature for feature group $\mathbf{x}_{(i)}^{p^{c}}$ .

For performing federated adversarial domain adaptation, the party C feeds $\{\mathbf{f}_{(i)}^{p^{c}}\}_{i=1}^{g}$ into their corresponding domain discriminators for optimizing domain discrimination losses and passes high-order feature vectors $\boldsymbol{\mu}^{p^{c}}$ to active party $p$ (together with $p$ ’s own raw features) for optimizing label prediction loss, as described in section 6.

6 Federated Adversarial Domain Adaptation

Federated adversarial domain adaptation of PrADA involves two stages: pre-training and fine-tuning. The pre-training stage is performed collaboratively between source party B and party C, and it aims to train feature extractors that can learn both domain-invariant and label-discriminative features. The fine-tuning stage is performed collaboratively between target party A and party C and it aims to train the target label predictor possessed by party A leveraging pre-trained feature extractors.

6.1 Pre-training Stage

The essential idea of adversarial domain adaptation is to train feature extractors that are able to learn features that are both discriminative to the task and invariant to the change of domains. Thus, we have two optimization goals. (1) In order to obtain domain-invariant features, we perform adversarial domain adaptation to optimize the feature extractors that maximizes the domain classification losses, while simultaneously optimize the domain discriminators that minimizes the domain classification loss. (2) To obtain task-specific discriminative feature representations, we perform vertical federated learning (VFL) to optimize the feature extractors and label predictor that minimize the label prediction loss.

In our federated learning setting, party C leverages $g$ feature extractors $\mathscr{F}=\{F_{i}\}_{i=1}^{g}$ and their corresponding $g$ domain discriminators $\mathscr{D}=\{D_{i}\}_{i=1}^{g}$ to learn domain-invariant feature representations from $g$ feature groups. More specifically, the $i$ th feature extractor $F_{i}$ learns feature representation from the $i$ th feature group and then the $i$ th domain discriminator $D_{i}$ maps this feature representation to a domain label $d\in\{0,1\}$ . The overall domain classification loss is the sum of domain classification losses for all domain discriminators in $\mathscr{D}$ . We give the formula as follows:

L_{adv}(\mathscr{F},\mathscr{D})=-\mathbb{E}_{\mathbf{x}^{A^{c}}\sim\mathbf{X}^{A^{c}}}\sum_{i=1}^{g}log[D_{i}(F_{i}(\mathbf{x}_{(i)}^{A^{c}}))]\\ -\mathbb{E}_{\mathbf{x}^{B^{c}}\sim\mathbf{X}^{B^{c}}}\sum_{i=1}^{g}log[1-D_{i}(F_{i}(\mathbf{x}_{(i)}^{B^{c}}))]

(2)

To make feature extractors obtain task-specific discriminative features, we optimize label prediction loss to train both label predictor and feature extractors to classify the source samples correctly. We define the label prediction loss as:

L_{ce}(\mathscr{F},R^{B})=\\ \mathbb{E}_{(\mathbf{x}^{B^{c}},\mathbf{x}^{B},\mathbf{y}^{B})\sim\mathbf{D}^{s}}[\ell_{ce}(R^{B}([\boldsymbol{\mu}^{B^{c}};\mathbf{x}^{B}]),\mathbf{y}^{B})]

(3)

where $R^{B}$ is the label predictor curated at party B, $\boldsymbol{\mu}^{B^{c}}$ is the high-order feature vectors passed from party C and $\mathbf{x}^{B}$ is the feature vectors possessed by party B.

The pre-training stage optimizes the two losses presented in (2) and (3). We have the complete loss function for pre-training stage as follows:

L(\mathscr{F},\mathscr{D},R^{B})=L_{ce}(\mathscr{F},R^{B})-\lambda L_{adv}(\mathscr{F},\mathscr{D})

(4)

where $\lambda$ is a hyperparameter that controls the trade-off between the two losses that shape the feature representations during training.

In our federated setting, $L_{adv}(\mathscr{F},\mathscr{D})$ is optimized locally at party C since it only involves data of party C, while $L_{ce}(\mathscr{F},R^{B})$ is optimized collaboratively by party B and party C in a federated manner since it involves data from the two parties. To this end, we train parameters $\{\theta_{f_{i}}\}_{i=1}^{g}$ of feature extractors $\mathscr{F}$ , $\{\theta_{d_{i}}\}_{i=1}^{g}$ of domain discriminators $\mathscr{D}$ , and $\theta_{R^{B}}$ of label predictor $R^{B}$ by solving following two optimization problems.

\mathop{\mathrm{argmin}}\limits_{\{\theta_{f_{i}}\}_{i=1}^{g}}\mathop{\mathrm{argmax}}\limits_{\{\theta_{d_{i}}\}_{i=1}^{g}}(-\lambda L_{adv}(\mathscr{F},\mathscr{D}))

(5)

\mathop{\mathrm{argmin}}\limits_{\{\theta_{f_{i}}\}_{i=1}^{g},\theta_{R^{B}}}L_{ce}(\mathscr{F},R^{B})

(6)

$\{\theta_{d_{i}}\}_{i=1}^{g}$ are trained by minimizing the domain classification loss, $\theta_{R^{B}}$ is trained by minimizing the label prediction loss, and $\{\theta_{f_{i}}\}_{i=1}^{g}$ are trained by minimizing the label prediction loss while at the same time maximizing the domain classification loss.

Figure 2 illustrates the overall workflow of the pre-training stage and Algorithm 1 describes the procedure of optimizing (5) and (6). We assume that the entity alignment procedure has been run and the indices of $\mathbf{D}^{s}$ have been shuffled and synchronized between party B and party C before training. For each inner iteration, party C and party B fetch the same mini-batch of aligned samples from $\mathbf{D}^{s}$ but each holds its own portion of private data: party C holds $\mathbf{x}^{B^{c}}$ while party B holds $\mathbf{x}^{B}$ and $\mathbf{y}^{B}$ . In addition, Party C samples a mini-batch $\mathbf{x}^{A^{c}}$ from its target data $\mathbf{X}^{A^{c}}$ . Based on $\mathbf{x}^{B^{c}}$ and $\mathbf{x}^{A^{c}}$ , party C optimizes (5) locally. Based on $\mathbf{\mu}^{B^{c}}$ , $\mathbf{x}^{B}$ and $y^{B}$ , party B and party C collaboratively optimize (6) through Algorithm 3.

Algorithm 1 Federated Pre-training

1:Initialization: feature extractors

\mathscr{F}

, domain discriminators

\mathscr{D}

, batch indices

\mathcal{I}

2:Input:

\mathbf{D}^{s}=[\mathbf{X}^{B^{c}};\mathbf{X}^{B};\mathbf{Y}^{B}]

\mathbf{X}^{A^{c}}

3:for

e=1,2,...,E

4: for

i\in\mathcal{I}

5: Party

\boldsymbol{C}

do:

\mathbf{x}^{A^{c}}\xleftarrow{}

sample a mini-batch from

\mathbf{X}^{A^{c}}

;

\mathbf{x}^{B^{c}}\xleftarrow{}

select

i

th mini-batch from

\mathbf{X}^{B^{c}}

;

8: update models in

\mathscr{F},\mathscr{D}

by optimizing (5)

using

\mathbf{x}^{A^{c}}

and

\mathbf{x}^{B^{c}}

;

9: compute

\boldsymbol{\mu}^{B^{c}}

by (1) using

\mathbf{x}^{B^{c}}

;

10: encrypt

\boldsymbol{\mu}^{B^{c}}

and send

[[\boldsymbol{\mu}^{B^{c}}]]

to party B;

11: Party

\boldsymbol{B}

do:

12:

\mathbf{x}^{B}\xleftarrow{}

select

i

th min-batch from

\mathbf{X}^{B}

;

13:

\mathbf{y}^{B}\xleftarrow{}

select

i

th min-batch from

\mathbf{Y}^{B}

;

14: Party

\boldsymbol{B}

and Party

\boldsymbol{C}

do:

15: optimize (6) using Algorithm 3 with

[[\boldsymbol{\mu}^{B^{c}}]],\mathbf{x}^{B},\mathbf{y}^{B}

;

16: end for

17:end for

6.2 Fine-tuning Stage

The fine-tuning stage aims to train the label predictor $R^{A}$ possessed by target party A using target labeled data $\mathbf{D}^{t}_{l}$ . Prior to fine-tuning, party C initializes its feature extractors with pre-trained parameters. Note that since party A and party B are two independent parties, the trained label predictor $R^{B}$ of of source party B cannot be fine-tuned by $R^{A}$ of target party A. Thus, $R^{A}$ has to be trained from the scratch.

For each iteration, party C applies (1) to compute feature vectors $\boldsymbol{\mu}^{A^{c}}$ and then send $\boldsymbol{\mu}^{A^{c}}$ to party A for computing the label prediction loss:

L_{ce}(\mathscr{F},R^{A})=\\ \mathbb{E}_{(\mathbf{x}^{A^{c}}_{l},\mathbf{x}^{A}_{l},\mathbf{y}^{A})\sim\mathbf{D}^{t}_{l}}[\ell_{ce}(R^{A}([\boldsymbol{\mu}^{A^{c}};\mathbf{x}^{A}_{l}]),\mathbf{y}^{A})]

(7)

Algorithm 2 describes the fine-tuning procedure and it is quite similar to Algorithm 1 except that it does not require party C to optimize (5).

Algorithm 2 Federated fine-tuning

1:Initialization: feature extractors

\mathscr{F}

, batch indices

\mathcal{I}

2:Input:

\mathbf{D}^{t}_{l}=[\mathbf{X}^{A^{c}}_{l};\mathbf{X}^{A}_{l};\mathbf{Y}^{A}]

3:for

e=1,2,...,K

4: for

i\in\mathcal{I}

5: Party

\boldsymbol{C}

do:

\mathbf{x}^{A^{c}}\xleftarrow{}

select

i

th mini-batch from

\mathbf{X}^{A^{c}}_{l}

;

7: compute

\boldsymbol{\mu}^{A^{c}}

by (1) using

\mathbf{x}^{A^{c}}

;

8: encrypt

\boldsymbol{\mu}^{A^{c}}

and send

[[\boldsymbol{\mu}^{A^{c}}]]

to party

\boldsymbol{A}

;

9: Party

\boldsymbol{A}

do:

10:

\mathbf{x}^{A}_{l}\xleftarrow{}

select

i

th min-batch from

\mathbf{X}^{A}_{l}

;

11:

\mathbf{y}^{A}\xleftarrow{}

select

i

th min-batch from

\mathbf{Y}^{A}

;

12: Party

\boldsymbol{A}

and Party

\boldsymbol{C}

do:

13: minimize (7) using Algorithm 3 with

[[\boldsymbol{\mu}^{A^{c}}]],\mathbf{x}^{A}_{l},\mathbf{y}^{A}

;

14: end for

15:end for

Algorithm 3 Privacy-preserving Federated Training

1:Input:

[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p},\mathbf{y}^{p}

, where

p\in\{A,B\}

2: run Algorithm 4 with

[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p},\mathbf{y}^{p}

;

3: run Algorithm 5 with

[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p}

;

7 Privacy-preserving Vertical Federated Learning Framework

As shown in (3) (7), minimizing label prediction loss for training label predictor involves data from a active party (either party A or party B) and the passive party C. Therefore, the label predictor should be trained in a privacy-preserving manner. In this section, we elaborate our proposed privacy-preserving vertical federated learning framework (PP-VFL) of PrADA that enables two independent parties to collaboratively train the label predictor without exposing their private data. First, we define the label predictor model, which is LR, as follows:

\displaystyle R^{p}([\boldsymbol{\mu}^{C};\mathbf{x}^{p}])=\sigma([\boldsymbol{\mu}^{C};\mathbf{x}^{p}]\mathbf{W}+b)

(8)

where $\sigma$ is the sigmoid function and $p\in\{A,B\}$ denote an active party, $\mathbf{W}\in\mathbb{R}^{m+g}$ is the weights of model $R^{p}$ and $b\in\mathbb{R}^{1}$ is the bias. In this section, we denote $\boldsymbol{\mu}^{C}$ as the feature vectors from party C and $\mathbf{x}^{p}$ as raw features from the active party $p\in\{A,B\}$ . We further decompose the input of $\sigma$ as follows:

z=\boldsymbol{\mu}^{C}\mathbf{W}^{C}+\mathbf{x}^{p}\mathbf{W}^{p}+b^{p}

(9)

where $\mathbf{W}^{C}\in\mathbb{R}^{g}$ is for the input $\boldsymbol{\mu}^{C}$ from party C while $\mathbf{W}^{p}\in\mathbb{R}^{m}$ is for the input $\mathbf{x}^{p}$ from party $p$ . Both $\mathbf{W}^{p}$ and $\mathbf{W}^{C}$ are maintained by party $p$ , but the real value of $\mathbf{W}^{C}$ is concealed from both party $p$ and party C, as elaborated in sections 7.1 and 7.2.

We extend the PHE-based secure protocol applied to the setting where one party has features and another has only labels [33] to our setting where features are distributed among two parties. Our new secure protocol includes two stages: (1) privacy-preserving forward propagation (Algo 4) and (2) privacy-preserving backward propagation (Algo 5). We denote the PHE encryption, addition and multiplication as $[[\cdot]]$ , $\oplus$ and $\otimes$ , respectively. Note that in our setting, only party C can encrypt and decrypt exchanging messages.

7.1 Privacy-Preserving Forward Propagation

Algorithm 4 aims to compute the label prediction loss in (3) without compromising the data privacy of participating parties. To achieve this, the party C encrypts $\mu^{C}$ with PHE and sends encrypted $[[\mu^{C}]]$ to party $p$ to prevent privacy leakage. When receiving $[[\mu^{C}]]$ , party $p$ can compute logit $z$ according to (9). However, directly applying (9) yields $[[z]]$ that is not compatible with logistic function. The workaround is that party $p$ first computes $[[\tilde{z}^{C}]]$ (Algo 4, line 5) and sends it to party C with random noise $\epsilon^{p}$ (Algo 4, line 8). The party C then decrypts $[[\tilde{z}^{C}+\epsilon^{p}]]$ and adds it with $\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}$ , which is for cancelling out the accumulated random noise, denoted as $\varepsilon^{C}_{t}$ , that was embedded in $\widetilde{\mathbf{W}}_{t}^{C}$ during the backpropagation of the previous iteration. For now, we assume $\widetilde{\mathbf{W}}_{t}^{C}=\mathbf{W}_{t}^{C}-\varepsilon^{C}_{t}$ , which we will prove in section 7.2. Here, we prove the logit $z^{C}$ is calculated correctly (Algo 4, line 11):

\begin{split}z^{C}&=\tilde{z}^{C}+\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}\\ &=\boldsymbol{\mu}^{C}\widetilde{\mathbf{W}}_{t}^{C}+\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}\\ &=\boldsymbol{\mu}^{C}\mathbf{W}_{t}^{C}-\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}+\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}\\ &=\boldsymbol{\mu}^{C}\mathbf{W}_{t}^{C}\end{split}

(10)

As a result, party C has $z^{C}+\epsilon^{p}$ . The noise $\epsilon^{p}$ is for preventing party C from accessing the plaintext $z^{C}$ and further collecting $\mathbf{W}_{t}^{C}$ = $z^{C}/\boldsymbol{\mu}^{C}$ . Finally, the party $p$ computes the loss $\ell_{ce}(\sigma(z),\mathbf{y}^{p})$ .

Algorithm 4 Privacy-preserving Forward Propagation

1:Initialization: label predictor model

\widetilde{\mathbf{W}}_{0}^{C}

and

\mathbf{W}_{0}^{p}

, accumulated noise

\varepsilon^{C}_{0}

2:Input:

[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p},\mathbf{y}^{p}

p\in\{A,B\}

3:Party

\boldsymbol{p}

4: compute logit:

[[\tilde{z}^{C}]]\xleftarrow{}[[\boldsymbol{\mu}^{C}]]\otimes\widetilde{\mathbf{W}}_{t}^{C}

;

z^{p}\xleftarrow{}\mathbf{x}^{p}\mathbf{W}_{t}^{p}+b^{p}

;

7: add noise

[[\tilde{z}^{C}+\epsilon^{p}]]\xleftarrow{}[[\tilde{z}^{C}]]\oplus\epsilon^{p}

;

8: send

[[\tilde{z}^{C}+\epsilon^{p}]]

to party

\boldsymbol{C}

;

9:Party

\boldsymbol{C}

10:

\tilde{z}^{C}+\epsilon^{p}\xleftarrow{}

decrypt

[[\tilde{z}^{C}+\epsilon^{p}]]

;

11:

z^{C}+\epsilon^{p}\xleftarrow{}

\tilde{z}^{C}

\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}

\epsilon^{p}

;

12: send

z^{C}+\epsilon^{p}

to party

\boldsymbol{p}

;

13:Party

\boldsymbol{p}

14: remove noise

z^{C}\xleftarrow{}z^{C}+\epsilon^{p}

;

15:

z\xleftarrow{}

z^{p}+z^{C}

;

16: compute loss

\ell_{ce}(\sigma(z),\mathbf{y}^{p})

;

Algorithm 5 Privacy-preserving Backward Propagation

1:Initialization: learning rate

\eta

2:Input:

[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p}

p\in\{A,B\}

3:Party

\boldsymbol{p}

\delta^{l}\xleftarrow{}

\nabla_{\sigma}\ell_{ce}

w.r.t the activation function

\sigma

;

5: backpropagate gradients

\delta^{l}

[[\Delta\mathbf{W}^{C}_{t}]]\xleftarrow{}[[\boldsymbol{\mu}^{C}]]\otimes\delta^{l}

;

\Delta\mathbf{W}^{p}_{t}\xleftarrow{}\mathbf{x}^{p}\delta^{l}

;

\Delta b^{p}_{t}\xleftarrow{}\delta^{l}

;

9: add noise

[[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]]\xleftarrow{}[[\Delta\mathbf{W}^{C}_{t}]]\oplus\epsilon^{p}

;

10: send

[[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]]

to party

\boldsymbol{C}

;

11:Party

\boldsymbol{C}

12:

\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}\xleftarrow{}

decrypt

[[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]]

;

13: add noise

\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p}\xleftarrow{}

\Delta\mathbf{W}^{C}_{t}+\frac{\epsilon^{C}}{\eta}+\epsilon^{p}

;

14:

\varepsilon^{C}_{t+1}\xleftarrow{}\varepsilon^{C}_{t}+\epsilon^{C}

and

[[\varepsilon^{C}_{t+1}]]\xleftarrow{}

encrypt

\varepsilon^{C}_{t+1}

;

15: send

[[\varepsilon^{C}_{t+1}]]

\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p}

to party

\boldsymbol{p}

16:Party

\boldsymbol{p}

17: remove noise

\Delta\widetilde{\mathbf{W}}^{C}_{t}\xleftarrow{}\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p}

18: update weights and bias of logistic regression model:

19:

\widetilde{\mathbf{W}}^{C}_{t+1}\xleftarrow{}\widetilde{\mathbf{W}}^{C}_{t}-\eta\Delta\widetilde{\mathbf{W}}^{C}_{t}

;

20:

\mathbf{W}^{p}_{t+1}\xleftarrow{}\mathbf{W}^{p}_{t}-\eta\Delta\mathbf{W}^{p}_{t}

;

21:

b_{t+1}^{p}\xleftarrow{}b_{t}^{p}-\eta\Delta b_{t}^{p}

;

22:

[[\delta^{C}]]\xleftarrow{}\delta^{l}\otimes(\widetilde{\mathbf{W}}^{C}_{t+1}\oplus[[\varepsilon^{C}_{t+1}]])

;

23: send

[[\delta^{C}]]

to party

\boldsymbol{C}

;

24:Party

\boldsymbol{C}

25:

\delta^{C}\xleftarrow{}

decrypt

[[\delta^{C}]]

;

26: update feature aggregators in

\mathscr{G}

and feature extractors

27: in

\mathscr{F}

based on gradient

\delta^{C}

using SGD;

7.2 Privacy-Preserving Backward Propagation

During the privacy-preserving backward propagation as described in Algorithm 5, the active party $p$ securely updates logistic regression model $R^{p}$ and backpropagate gradients to party C. As shown in (9), we partitioned weights of $R^{p}$ into $\mathbf{W}^{p}$ and $\mathbf{W}^{C}$ . On the one hand, party $p$ can compute gradients $\Delta\mathbf{W}^{p}_{t}$ and $\Delta b^{p}_{t}$ , and update $\mathbf{W}^{p}_{t}$ and $b^{p}_{t}$ (Algo 5, line 20-21) in plaintext since party $p$ owns these parameters. On the other hand, party A can not directly update $[[\mathbf{W}^{C}_{t+1}]]\xleftarrow{}\mathbf{W}^{C}_{t}-\eta[[\Delta\mathbf{W}^{C}_{t}]]$ since this leads to incompatibility with PHE for computing $[[\tilde{z}^{C}]]\xleftarrow{}[[\boldsymbol{\mu}^{C}]]\otimes[[\mathbf{W}^{C}_{t+1}]]$ in the next iteration of forward propagation. To work around this issue, party $p$ may send encrypted gradients $[[\Delta\mathbf{W}^{C}_{t}]]$ to party C and get the decrypted $\Delta\mathbf{W}^{C}_{t}$ back. However, this leads to privacy leakage for both parties. Because based on $\Delta\mathbf{W}^{C}_{t}=\boldsymbol{\mu}^{C}\otimes\delta_{l}$ , knowing $\Delta\mathbf{W}^{C}_{t}$ party $p$ can infer the value of $\boldsymbol{\mu}^{C}$ , while party C can infer gradient $\delta_{l}$ during training. Therefore, to conceal the real value of $\Delta\mathbf{W}^{C}_{t}$ from both parties, the two parties mask $\Delta\mathbf{W}^{C}_{t}$ by adding corresponding random noises. Specifically, party $p$ adds noise $\epsilon^{p}$ to $[[\Delta\mathbf{W}^{C}_{t}]]$ and sends $[[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]]$ to party C (Algo 5, line 9-10). Party C in turn decrypts $[[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]]$ and sends $\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p}$ back to party $p$ , where $\Delta\widetilde{\mathbf{W}}^{C}_{t}=\Delta\mathbf{W}^{C}_{t}+\frac{\epsilon^{C}}{\eta}$ (Algo 5, line 13), $\epsilon^{C}$ is the random noise generated by party C and $\eta$ is the learning rate. Then, the party $p$ updates $\widetilde{\mathbf{W}}^{C}_{t+1}$ based on gradient $\Delta\widetilde{\mathbf{W}}^{C}_{t}$ after removing noise $\epsilon^{p}$ (Algo 5, line 19). Note that while the noise $\epsilon^{p}$ can be removed by party $p$ , the noise $\epsilon^{C}$ added by party C is accumulated in weight $\widetilde{\mathbf{W}}^{C}_{t}$ through $\Delta\widetilde{\mathbf{W}}^{C}_{t}$ at each iteration. Intuitively, the real value of $\mathbf{W}^{C}_{t+1}$ can be seem as shared by party C and party $p$ . This concept is similar to secret sharing.

For party $p$ to correctly calculate the intermediate gradient $\delta^{C}$ , party $p$ needs to cancel out the accumulated noise embedded in $\widetilde{\mathbf{W}}^{C}_{t+1}$ . To this end, party C seeds the encrypted accumulated noise $[[\varepsilon^{C}_{t+1}]]$ to party $p$ , which then calculates the gradient $[[\delta^{C}]]$ of loss $\ell_{ce}$ with respect to $\mu^{C}$ (Algo 5, line 22) using $[[\varepsilon^{C}_{t+1}]]$ . To prove the value of gradient $\delta^{C}$ is calculated correctly, we prove that $\widetilde{\mathbf{W}}^{C}_{t+1}=\mathbf{W}^{C}_{t+1}-\varepsilon^{C}_{t+1}$ by mathematical induction, assuming $\widetilde{\mathbf{W}}^{C}_{t}=\mathbf{W}^{C}_{t}-\varepsilon^{C}_{t}$ and Initializing $\varepsilon^{C}_{0}=0$ :

\begin{split}\widetilde{\mathbf{W}}^{C}_{t+1}&=\widetilde{\mathbf{W}}^{C}_{t}-\eta\Delta\widetilde{\mathbf{W}}^{C}_{t}\\ &=\widetilde{\mathbf{W}}^{C}_{t}-\eta(\Delta\mathbf{W}^{C}_{t}+\frac{\epsilon^{C}}{\eta})\\ &=(\mathbf{W}^{C}_{t}-\eta\Delta\mathbf{W}^{C}_{t})-(\varepsilon^{C}_{t}+\epsilon^{C})\\ &=\mathbf{W}^{C}_{t+1}-\varepsilon^{C}_{t+1}\end{split}

Finally, party $p$ sends $[[\delta^{C}]]$ back to party C, which decrypts $[[\delta^{C}]]$ and backpropagates $\delta^{C}$ locally to optimize local models.

TABLE I: Comparison between models in different settings on Census Income dataset.

Positive labels		40		80		160
Setting	Model	AUC (%)	KS (%)	AUC (%)	KS (%)	AUC (%)	KS (%)
$\mathbf{A}$ - $\mathbf{Local}$	LR	65.57	31.48	72.33	35.32	72.71	35.45
	XGBoost	67.59	32.27	74.07	37.62	77.60	41.68
$\mathbf{A}$ - $\mathbf{VFL}$	SecureLR	69.67	32.21	72.72	36.87	75.07	39.48
	SecureBoost	71.97	34.73	77.02	41.61	80.08	46.80
	$\text{PrADA}_{\text{w/o DA\&FG\&IR}}$	73.72 $\pm$ 0.41	35.36 $\pm$ 0.68	77.48 $\pm$ 0.47	42.32 $\pm$ 0.30	79.13 $\pm$ 0.68	44.90 $\pm$ 0.57
$\mathbf{AB}$ - $\mathbf{VFL}$	SecureLR	72.88	34.73	73.80	35.83	74.63	38.48
	SecureBoost	78.06	42.18	79.56	45.56	80.82	47.87
	$\text{PrADA}_{\text{w/o DA\&FG\&IR}}$	77.65 $\pm$ 0.38	43.09 $\pm$ 0.64	78.97 $\pm$ 0.49	45.91 $\pm$ 0.51	80.56 $\pm$ 0.31	47.62 $\pm$ 0.48
$\mathbf{B}\rightarrow\mathbf{A}$	$\text{PrADA}_{\text{w/o FG\&IR}}$	78.98 $\pm$ 0.13	43.42 $\pm$ 0.50	80.17 $\pm$ 0.28	46.86 $\pm$ 0.93	81.10 $\pm$ 0.57	48.14 $\pm$ 0.75
	$\text{PrADA}_{\text{w/o IR}}$	78.92 $\pm$ 0.16	44.06 $\pm$ 0.72	80.49 $\pm$ 0.37	47.36 $\pm$ 0.55	81.36 $\pm$ 0.15	48.73 $\pm$ 0.56
	PrADA	79.17 $\pm$ 0.40	44.92 $\pm$ 0.68	81.08 $\pm$ 0.30	48.06 $\pm$ 0.72	81.46 $\pm$ 0.06	49.27 $\pm$ 0.42

7.3 Discussions on Privacy Protection

In this section, we discuss the privacy-preserving capability of our PP-VFL, its possible privacy leakage and trade-off.

Proposition 1

The active party $p$ cannot reveal the true value of the feature vectors $\boldsymbol{\mu}^{C}$ passed from the passive party C during training and inference.

Proof. There are three ways through which party $p$ can leverage to recover the true value feature vectors $\boldsymbol{\mu}^{C}$ during training. The first way is to decrypt $[[\boldsymbol{\mu}^{C}]]$ directly. However, it is impossible to decrypt $[[\boldsymbol{\mu}^{C}]]$ without knowing the private key. The second way is to derive $\boldsymbol{\mu}^{C}$ from $z^{C}/\widetilde{\mathbf{W}}_{t}^{C}$ according to (10). However, this requires party $p$ to remove the noise $\varepsilon^{C}_{t}$ from $\widetilde{\mathbf{W}}_{t}^{C}$ . Suppose $\varepsilon^{C}_{t}$ is the random noise accumulated by party C at iteration $t$ and $\hat{\varepsilon}^{C}_{t}$ is an attempt from party A. The probability that $\varepsilon^{C}_{t}=\hat{\varepsilon}^{C}_{t}$ is $Pr(\varepsilon^{C}_{t}=\hat{\varepsilon}^{C}_{t})\leq(1-e^{-2/\mathbb{|Z|}})$ [26]. Because $\mathbb{|Z|}$ is typically a very large number, the $Pr(\varepsilon^{C}_{t}=\hat{\varepsilon}^{C}_{t})$ is very close to zero. Third, $\boldsymbol{\mu}^{C}$ can also be derived from $\Delta\mathbf{W}^{C}_{t}/\delta^{l}$ after the noise $\epsilon^{C}$ being removed from $\Delta\widetilde{\mathbf{W}}^{C}_{t}$ . However, the probability $Pr(\epsilon^{C}=\hat{\epsilon}^{C})$ that the attempt $\hat{\epsilon}^{C}$ made at party $p$ equals $\epsilon^{C}$ approximates zero.

Proposition 2

The active party $p$ cannot infer the true value of weights $\mathbf{W}^{C}$ during training and inference.

Proof. There two ways that party $p$ can infer $\mathbf{W}^{C}$ . One is via $z^{C}/\boldsymbol{\mu}^{C}$ according to (10), and another is removing accumulated noise $\varepsilon^{C}$ from $\widetilde{\mathbf{W}}^{C}$ . According to Proposition 1 the true value of $\boldsymbol{\mu}^{C}$ is concealed from the party $p$ during training and inference, and the probability $Pr(\varepsilon^{C}=\hat{\varepsilon}^{C})\leq(1-e^{-2/\mathbb{|Z|}})$ that the party C can generate noise $\hat{\varepsilon}^{C}$ to cancel out $\varepsilon^{C}$ is close to zero. Therefore, the party $p$ cannot infer the true value of $\mathbf{W}^{C}$ during training and inference.

The active party $p$ cannot infer the data of passive party C during training because party $p$ has no access to $\boldsymbol{\mu}^{C}$ , $\mathbf{W}^{C}$ and party C’s local model. [12] proposes model inversion (MI) that enables the attacker (i.e., party $p$ ) to recover the private data of the victim (i.e., party C) during inference. To recover data of reasonably high quality, [12] makes strong assumptions that the attacker knows the network structure of the victim and has access to the training data that follows the same distribution as those of the victim, aiming to approximate the victim’s local model. However, these assumptions typically do not hold in scenarios like finance. For one thing, financial data generally are not publicly available because they are sensitive, and thus their publication is regulated. For another, participating parties provide heterogeneous features in VFL, and thereby they typically adopt different model structures. Besides, [12] demonstrates that a local model with a full-connected layer on top can significantly drop the quality of the recovered data. The $\mathbf{W}^{C}$ automatically provides such a layer of protection.

Proposition 3

The passive party C cannot infer the true value of weights $\mathbf{W}$ curated by the active party $p$ during training and inference.

Proof. The weights $\mathbf{W}$ is composed of $\mathbf{W^{C}}$ and $\mathbf{W}^{p}$ . Party C receives no information on $\mathbf{W}^{p}$ of party $p$ . Therefore, party C can learn nothing on $\mathbf{W}^{p}$ . There are two ways that party C can infer $\mathbf{W}^{C}$ . The first one is via $(z^{C}+\epsilon^{p})/\boldsymbol{\mu}^{C}$ and the another is via $\delta^{C}/\delta^{l}$ . The noise $\epsilon^{p}$ in the former one prevents party C from revealing $\mathbf{W^{C}}$ because the probability $Pr(\epsilon^{p}=\hat{\epsilon}^{p})\leq(1-e^{-2/\mathbb{|Z|}})$ that the party C can generate noise $\hat{\epsilon}^{p}$ to cancel out $\epsilon^{p}$ is close to zero, while $\delta^{l}$ in the latter one resides only in party $p$ . Therefore, the party C cannot reveal the true value of $\mathbf{W^{C}}$ .

Recent research works propose that the attacker (i.e., passive party C) can leverage gradient inversion (GI) [35], model completion (MC) [8], and properties of cut-layer gradient (PCG) [15] to recover labels of the victim (i.e., active party $p$ ). Our PP-VFL can prevent GI attack because the attacker has no access to both the weights (i.e., $\mathbf{W}$ ) and the gradient of the label predictor model [14, 10]. While PP-VFL itself cannot prevent the MC and PCG attacks in that the cut-layer gradient $\delta^{C}$ is passed to the attacker in plain text and without any protection. The current form of PP-VFL trades a certain degree of increased label privacy leakage with enhanced model performance and training efficiency. In applications where the labels are important assets, the PP-VFL can be equipped with other privacy protection mechanisms (e.g., MARVELL [15] for PCG and CoAE [17] for MC) to trade the protection of label privacy with a certain degree of degraded utility.

8 Experiments

8.1 Experimental datasets and settings

We evaluate our proposed PrADA based on two datasets: one is Census Income dataset, and another is a real-world financial dataset called Loan Default. For each dataset, we run experiments under following four settings:

1.

$\mathbf{A}$ - $\mathbf{Local}$ : Target party A only uses its local data to train models without leveraging VFL and DA.
2.

$\mathbf{A}$ - $\mathbf{VFL}$ : Target Party A uses target domain data $\mathbf{D}^{t}_{l}$ to train models via VFL with party C. This setting serves as the conventional VFL to improve the model performance of party A with additional features from party C.
3.

$\mathbf{AB}$ - $\mathbf{VFL}$ : Assuming party A and B are from the same organization and privacy is not a concern. Thus, Party A uses $\mathbf{D}^{t}_{l}$ and $\mathbf{D}^{s}$ of both domains to train models via VFL with party C (with no DA). Models in this setting serve as strong baselines because they use all data together.
4.

$\mathbf{B}\rightarrow\mathbf{A}$ : We conduct PrADA elaborated in sections 6 to perform federated adversarial domain adaptation from party B to party A.

In settings 2 and 3, we adopt SecureLR and SecureBoost implemented in FATE¹¹1https://github.com/FederatedAI/FATE, an industrial grade federated learning framework, as comparing models. These two models are VFL version of tree-boosting model and logistic regression model respectively, and they are using PHE to protect data privacy. To explore the effectiveness of different components of PrADA, we propose three different ablations:

•

$\text{PrADA}_{\text{w/o DA\&FG\&IR}}$ without domain adaptation (DA), feature grouping (FG) and feature group interaction (IR);
•

$\text{PrADA}_{\text{w/o FG\&IR}}$ applies domain adaptation, but without feature grouping and interaction;
•

$\text{PrADA}_{\text{w/o IR}}$ applies domain adaptation based on feature grouping, but without interaction.

In this paper, we focus on the binary classification problem. Because imbalanced class label is one of the major motivations for applying domain adaptation in real-world financial applications, we also investigate the effectiveness of our PrADA approach for different positive label ratios. Specifically, we investigate scenarios when the target training data has a positive label ratio {0.01,0.02,0.04}.

For SecureLR, we use default hyperparameters, while for SecureBoost, we sweep over all combination of max depth {2,4,6,8} and number of trees {100,200,300,400}, leaving other hyperparameters default. For PrADA, we use batch size 128 for Census Income data and 64 for Loan Default data, and use learning rate 0.0005 for pre-training and 0.0008 for fine-tuning for both datasets. Our PrADA is implemented with PyTorch. We repeat every experiment 5 times on each dataset, reporting the mean and standard derivation of AUC and KS (Kolmogorov-Smirnov test) [2] of all trained models on test data of the target party A.

8.2 Experiments on Census Income

Census Income is a census dataset from the UCI Machine Learning Repository. We split this dataset into a undergraduate source domain and a postgraduate target domain. The source domain has 80000 labeled examples, while the target domain has 4000 labeled samples and 9000 unlabeled samples. Our goal is to help party A of the target domain to predict whether a person’s income exceeds 50,000 US dollars or not.

After data preprocessing, the census income dataset contains 36 features, 31 of which are categorical. We put 5 numerical features on active parties (i.e., A and B) while the 31 categorical features on passive party C. We split 31 features on party C into 4 feature groups (FG) including employment(emp), demographics(demo), household(house), and migration(migr). Thus, we have $C^{4}_{2}$ (i.e., 6) interactive feature groups, which are emp-demo, emp-house, emp-migr, demo-house, demo-migr, house-migr. We embed all categorical features into dense vectors. Table II shows the architecture of feature extractor for each of the 10 feature groups and the one (i.e., all_feat) for all features when feature grouping is not applied.

TABLE II: Architecture of feature extractors for Census Income dataset. All feature extractors only use fully-connected layers, and adopt Leaky ReLU as the activation function, which is omitted in the table for simplicity.

FG name	feature extractor architecture
emp	FC(28 $\rightarrow$ 56)-FC(56 $\rightarrow$ 28)-FC(28 $\rightarrow$ 14)
demo	FC(25 $\rightarrow$ 50)-FC(50 $\rightarrow$ 25)-FC(25 $\rightarrow$ 12)
migr	FC(56 $\rightarrow$ 86)-FC(86 $\rightarrow$ 56)-FC(56 $\rightarrow$ 18)
house	FC(27 $\rightarrow$ 54)-FC(54 $\rightarrow$ 27)-FC(27 $\rightarrow$ 13)
emp-demo	FC(53 $\rightarrow$ 78)-FC(78 $\rightarrow$ 53)-FC(53 $\rightarrow$ 15)
emp-migr	FC(84 $\rightarrow$ 120)-FC(120 $\rightarrow$ 84)-FC(84 $\rightarrow$ 20)
emp-house	FC(51 $\rightarrow$ 81)-FC(81 $\rightarrow$ 55)-FC(55 $\rightarrow$ 15)
demo-migr	FC(81 $\rightarrow$ 120)-FC(120 $\rightarrow$ 81)-FC(81 $\rightarrow$ 20)
demo-house	FC(52 $\rightarrow$ 78)-FC(78 $\rightarrow$ 52)-FC(52 $\rightarrow$ 15)
migr-house	FC(83 $\rightarrow$ 120)-FC(120 $\rightarrow$ 83)-FC(83 $\rightarrow$ 20)
all_feat	FC(136 $\rightarrow$ 150)-FC(150 $\rightarrow$ 60)-FC(60 $\rightarrow$ 20)

TABLE III: Comparison between models in different settings on Loan Default dataset.

Positive labels		40		80		160
Setting	Model	AUC (%)	KS (%)	AUC (%)	KS (%)	AUC (%)	KS (%)
$\mathbf{A}$ - $\mathbf{Local}$	LR	57.17	12.65	56.51	13.16	57.77	15.14
	XGBoost	56.66	11.49	57.90	14.89	58.91	17.47
$\mathbf{A}$ - $\mathbf{VFL}$	SecureLR	59.67	15.31	64.68	24.10	67.78	28.61
	SecureBoost	57.88	12.86	64.90	23.33	70.68	31.09
	$\text{PrADA}_{\text{w/o DA\&FG\&IR}}$	63.26 $\pm$ 0.88	21.14 $\pm$ 1.26	67.49 $\pm$ 0.78	28.06 $\pm$ 1.12	68.66 $\pm$ 0.94	29.59 $\pm$ 1.39
$\mathbf{AB}$ - $\mathbf{VFL}$	SecureLR	72.72	35.81	73.04	35.90	74.22	36.87
	SecureBoost	75.18	38.53	75.96	40.65	76.16	41.83
	$\text{PrADA}_{\text{w/o DA\&FG\&IR}}$	75.11 $\pm$ 0.37	40.28 $\pm$ 1.03	75.16 $\pm$ 0.19	40.63 $\pm$ 0.64	75.53 $\pm$ 0.23	41.12 $\pm$ 0.71
$\mathbf{B}\rightarrow\mathbf{A}$	$\text{PrADA}_{\text{w/o FG\&IR}}$	75.27 $\pm$ 0.25	40.82 $\pm$ 0.53	75.52 $\pm$ 0.22	41.25 $\pm$ 0.24	75.76 $\pm$ 0.28	41.91 $\pm$ 0.56
	$\text{PrADA}_{\text{w/o IR}}$	75.63 $\pm$ 0.11	41.42 $\pm$ 0.74	75.84 $\pm$ 0.09	42.04 $\pm$ 0.61	76.43 $\pm$ 0.08	42.61 $\pm$ 0.17
	PrADA	75.75 $\pm$ 0.12	41.69 $\pm$ 0.36	75.99 $\pm$ 0.05	42.48 $\pm$ 0.21	76.58 $\pm$ 0.18	43.48 $\pm$ 0.62

The experimental results are shown in Table I. From these results, we observe that: (1) SecureBoost and SecureLR in $\mathbf{A}$ - $\mathbf{VFL}$ outperform their counterparts in $\mathbf{A}$ - $\mathbf{Local}$ demonstrating that leveraging additional features improve the model performance. (2) The performance of models in $\mathbf{B}\rightarrow\mathbf{A}$ significantly outperforms that of models in $\mathbf{A}$ - $\mathbf{VFL}$ across all settings of different positive labels. This is expected because a considerable amount of source data is involved in training. More specifically, when the number of positive labels is small (i.e., 40), the performance gain is the most significant. For example, PrADA outperforms $\text{PrADA}_{\text{w/o DA\&FG\&IR}}$ in AUC by 5.45% and in KS by 9.56%, and outperforms SecureBoost in AUC by 7.2% and in KS by 10.19% when the positive label is 40. (3) $\text{PrADA}_{\text{w/o FG\&IR}}$ in $\mathbf{B}\rightarrow\mathbf{A}$ outperforms $\text{PrADA}_{\text{w/o DA\&FG\&IR}}$ in $\mathbf{AB}$ - $\mathbf{VFL}$ in AUC by 1.02% and in KS by 0.60% on average, and outperforms SecureBoost in AUC by 0.60% and in KS by 0.94% on average, demonstrating the effectiveness of PrADA on bridging the divergence between source and target domains. (4) In $\mathbf{B}\rightarrow\mathbf{A}$ setting, $\text{PrADA}_{\text{w/o IR}}$ outperforms $\text{PrADA}_{\text{w/o FG\&IR}}$ in AUC by 0.17% and KS by 0.58% on average, demonstrating the effectiveness of FG-based domain adversarial training on improving the transferability of feature extractors. In addition to boosting model performance, feature grouping also enhances the interpretability of target model $R^{A}$ , which we discuss in section 8.4. (5) In $\mathbf{B}\rightarrow\mathbf{A}$ setting, PrADA outperforms $\text{PrADA}_{\text{w/o IR}}$ in AUC by 0.31% and KS by 0.70% on average, demonstrating the interaction on feature groups help improve model performance.

To dive deeper into the effect of FG-based domain adaptation on learned feature representations, we visualize the t-SNE embeddings [4] of the feature representations in Figure 3. Figure 3(a)-(d) and Figure 3(e)-(h) show the the t-SNE embeddings of the feature representations learned by $\text{PrADA}_{\text{w/o DA\&IR}}$ and $\text{PrADA}_{\text{w/o IR}}$ , respectively, on feature groups of Census Income data. We observe that the adaptation in our method makes the two distributions of learned feature representations much closer in all feature groups.

8.3 Experiments on Loan Default

Loan Default is a loan default risk dataset for the online lending industry published by FinVolution Group. It contains loan data issued in 2014. We consider 40000 labeled samples of loans issued in the first three quarters of 2014 as the source domain while the 4000 labeled samples and 9000 unlabeled samples of loans issued in the fourth quarter as the target domain. This is an Out-Of-Time scenario in financial risk control. Our goal is to help party A to build a loan predictor to predict whether a loan will default or not.

After data preprocessing, the Load Default dataset has 162 features, 27 of which are categorical. For protecting privacy, user and feature names are anonymized. We put 6 demographics features and labels on active parties, while the rest 156 features on passive party C. We split features of party C into 5 groups including user location(loc), third-party period(period), education(edu), social network (soc), and micro-blog(mblog). Thus, we have $C^{5}_{2}$ (i.e., 10) interactive feature groups, which are loc-period, loc-edu, loc-soc, loc-mblog, period-edu, period-soc, period-mblog, edu-soc, edu-mblog, soc-mblog. We embed all categorical features into dense vectors. Table IV shows the architecture of feature extractor for each of the 15 feature groups and the one (i.e., all_feat) for all features when feature grouping is not applied.

TABLE IV: Architecture of feature extractors for Loan Default dataset. All feature extractors only use fully-connected layers, and adopt Leaky ReLU as the activation function, which is omitted in the table for simplicity.

FG name	feature extractor architecture
loc	FC(15 $\rightarrow$ 20)-FC(20 $\rightarrow$ 15)-FC(15 $\rightarrow$ 6)
period	FC(85 $\rightarrow$ 100)-FC(100 $\rightarrow$ 60)-FC(60 $\rightarrow$ 8)
edu	FC(30 $\rightarrow$ 50)-FC(50 $\rightarrow$ 30)-FC(30 $\rightarrow$ 6)
soc	FC(18 $\rightarrow$ 30)-FC(30 $\rightarrow$ 18)-FC(18 $\rightarrow$ 6)
mblog	FC(55 $\rightarrow$ 70)-FC(70 $\rightarrow$ 30)-FC(30 $\rightarrow$ 8)
loc-period	FC(100 $\rightarrow$ 120)-FC(120 $\rightarrow$ 75)-FC(75 $\rightarrow$ 14)
loc-edu	FC(45 $\rightarrow$ 70)-FC(70 $\rightarrow$ 45)-FC(45 $\rightarrow$ 12)
loc-soc	FC(33 $\rightarrow$ 50)-FC(50 $\rightarrow$ 33)-FC(33 $\rightarrow$ 12)
loc-mblog	FC(70 $\rightarrow$ 90)-FC(90 $\rightarrow$ 45)-FC(45 $\rightarrow$ 14)
period-edu	FC(115 $\rightarrow$ 150)-FC(150 $\rightarrow$ 90)-FC(90 $\rightarrow$ 14)
period-soc	FC(103 $\rightarrow$ 130)-FC(130 $\rightarrow$ 78)-FC(78 $\rightarrow$ 14)
period-mblog	FC(140 $\rightarrow$ 170)-FC(170 $\rightarrow$ 90)-FC(90 $\rightarrow$ 16)
edu-soc	FC(48 $\rightarrow$ 80)-FC(80 $\rightarrow$ 48)-FC(48 $\rightarrow$ 12)
edu-mblog	FC(85 $\rightarrow$ 120)-FC(120 $\rightarrow$ 60)-FC(60 $\rightarrow$ 14)
soc-mblog	FC(73 $\rightarrow$ 100)-FC(100 $\rightarrow$ 48)-FC(48 $\rightarrow$ 14)
all_feat	FC(203 $\rightarrow$ 210)-FC(210 $\rightarrow$ 70)-FC(70 $\rightarrow$ 20)

The experimental results are reported in Table III. From these, we observe that: (1) Table III reports a similar trend as Table I does that the performance of models improves from $\mathbf{A}$ - $\mathbf{Local}$ setting to $\mathbf{A}$ - $\mathbf{VFL}$ and then to $\mathbf{B}\rightarrow\mathbf{A}$ as more data is involved in training. The performance gains the most when the number of positive labels is small. Specifically, when the number of positive labels is 40, PrADA in $\mathbf{B}\rightarrow\mathbf{A}$ setting outperforms $\text{PrADA}_{\text{w/o DA\&FG\&IR}}$ in $\mathbf{A}$ - $\mathbf{VFL}$ setting in AUC and KS by 12.49% and 20.55% respectively, and outperforms SecureBoost in AUC and KS by 17.87% and 28.83% respectively. As the number of positive labels increases, the performance gain narrows. (2) $\text{PrADA}_{\text{w/o FG\&IR}}$ in $\mathbf{B}\rightarrow\mathbf{A}$ setting outperforms $\text{PrADA}_{\text{w/o DA\&FG\&IR}}$ in $\mathbf{AB}$ - $\mathbf{VFL}$ in AUC by 0.25% and in KS by 0.65% on average, demonstrating the effectiveness of PrADA on mitigating domain divergence. (3) In $\mathbf{B}\rightarrow\mathbf{A}$ setting, $\text{PrADA}_{\text{w/o IR}}$ outperforms $\text{PrADA}_{\text{w/o FG\&IR}}$ , demonstrating the superiority of FG-based DA over conventional DA, and PrADA outperforms $\text{PrADA}_{\text{w/o IR}}$ , demonstrating the interaction on feature groups help enhance model performance.

8.4 Model Interpretability

We demonstrate model interpretability by visualizing the impact of features on target model $R^{A}$ using SHAP [18], a tool widely used to explain black-box models. As discussed in section 7, the real values of model parameters of $R^{A}$ are not accessible by either party C or party A. This means that party A cannot interpret the model by simply looking at model parameters. SHAP provides party A with a way to interpret the model without accessing the model parameters. We select the Census Income dataset for this purpose since the semantic meaning of features in the Loan Default dataset is anonymized.

Figure 4 lists the most influential features of model $R^{A}$ in descending order. The top features have higher predictive power because they contribute more to the model than the bottom ones. For example, emp-demo, gender, capital_gain, migr-house and demo-migr are the top-5 most influential features.

The SHAP can further show the positive and negative relationships of features with the prediction target. Figure 5 plots the SHAP values of every feature for all samples to illustrate the impact of those features on the prediction output. Features are ranked in descending order. The color represents the feature value (red high, blue low). The horizontal location shows whether the effect of a feature value is associated with a higher or lower prediction. Specifically, emp-demo, captial_gain and demo-migr are positively correlated with the prediction, while gender and migr-house are negatively correlated with the prediction.

8.5 Computation Cost

We compare the training time among SecureLR, SecureBoost, and PrADA using FATE 1.6. The experiments are conducted on a machine with 72 Intel Xeon Gold 6140 CPUs and 320 GB RAM. All experiments are simulated in standalone deployment mode. Note that the privacy-preserving VFL framework (PP-VFL) discussed in section 7 has been integrated into FATE, while the federated adversarial domain adaptation (FADA) discussed in section 6 has not because some of the core functionalities of FADA are not satisfied by FATE for now. Thus, we estimate the training time of PrADA by simulation using PP-VFL on FATE.

TABLE V: Training time (hours) on Census Income dataset. PT and FT denote pre-taining time and fine-tuning time, respectively.

Setting	Model	Time(h)
$\mathbf{AB}$ - $\mathbf{VFL}$	SecureLR	$\sim$ 1.12
	SecureBoost	$\sim$ 2.16
	$\text{PrADA}_{\text{w/o DA\&FG\&IR}}$	$\sim$ 4.10
		PT(h)	FT(h)
$\mathbf{B}\rightarrow\mathbf{A}$	$\text{PrADA}_{\text{w/o IR}}$	$\sim$ 5.47	$\sim$ 0.52
	PrADA	$\sim$ 8.24	$\sim$ 1.32

Table V reports the training time of $\text{PrADA}_{\text{w/o DA\&FG\&IR}}$ is roughly 4.10 hours, which is approximately twice the training time spent by SecureBoost. In $\mathbf{B}\rightarrow\mathbf{A}$ setting, $\text{PrADA}_{\text{w/o IR}}$ takes 5.47 hours to train because FG-based domain adaptation is involved. However, once pre-training is completed, $\text{PrADA}_{\text{w/o IR}}$ only takes half an hour to perform fine-tuning. PrADA takes 8.24 hours to train because it spends extra time on feature group interaction, additional feature extractor training and feature representations encryption. As reported in Table I and Table III, PrADA exceeds $\text{PrADA}_{\text{w/o IR}}$ by only a small margin. Therefore, if efficiency is a major concern, $\text{PrADA}_{\text{w/o IR}}$ is a better choice.

9 Conclusion

In this paper, we propose a privacy-preserving vertical federated adversarial domain adaptation approach. In particular, we develop a privacy-preserving VFL framework that allows participating parties to collaboratively conduct domain adaptation without exposing private data. To reduce feature dimensionality, enhance model interpretability, and facilitate the learning of domain-invariant features, we propose a fine-grained adversarial domain adaptation over feature groups that each holds tightly relevant features. Experiments demonstrate both the effectiveness and practicality of our approach.

Acknowledgments

This work is partially supported by the National Key Research and Development Program of China under grant [2018AAA0101100].

References

[1] David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, NIPS’18, page 7786–7795, Red Hook, NY, USA, 2018. Curran Associates Inc.
[2] I. M. Chakravarti, R. G. Laha, and J. Roy. Handbook of Methods of Applied Statistics, Volume I. John Wiley and Sons, NY, 1967.
[3] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This looks like that: Deep learning for interpretable image recognition. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 8930–8941. Curran Associates, Inc., 2019.
[4] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 647–655, China, 22–24 Jun 2014. PMLR.
[5] Cynthia Dwork. A firm foundation for private data analysis. Communications of the ACM, 54(1):86–95, 2011.
[6] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
[7] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
[8] Chong Fu, Xuhong Zhang, Shouling Ji, Jinyin Chen, Jingzheng Wu, Shanqing Guo, Jun Zhou, Alex X Liu, and Ting Wang. Label inference attacks against vertical federated learning. In 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, August 2022. USENIX Association.
[9] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
[10] Hanlin Gu, Lixin Fan, Bowen Li, Yan Kang, Yuan Yao, and Qiang Yang. Federated deep learning with bayesian privacy. CoRR, abs/2109.13012, 2021.
[11] Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. Private federated learning on vertically partitioned data via entity resolution and homomorphic encryption. CoRR, abs/1711.10677, 2017.
[12] Zecheng He, Tianwei Zhang, and Ruby B. Lee. Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, ACSAC ’19, page 148–162, NY, USA, 2019. Association for Computing Machinery.
[13] Daniel Ho. Nbdt: Neural-backed decision trees. Master’s thesis, EECS Department, University of California, Berkeley, May 2020.
[14] Yan Kang, Yuezhou Wu, Jiahuan Luo, Yuanqin He, and Qiang Yang. Fedcg: Leverage conditional GAN for protecting privacy and maintaining competitive performance in federated learning. CoRR, abs/2111.08211, 2021.
[15] Oscar Li, Jiankai Sun, Xin Yang, Weihao Gao, Hongyi Zhang, Junyuan Xie, Virginia Smith, and Chong Wang. Label leakage and protection in two-party split learning. In International Conference on Learning Representations, 2022.
[16] Xiaoxiao Li, Yufeng Gu, Nicha Dvornek, Lawrence H. Staib, Pamela Ventola, and James S. Duncan. Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: Abide results. Medical Image Analysis, 65:101765, 2020.
[17] Yang Liu, Zhihao Yi, Yan Kang, Yuanqin He, Wenhan Liu, Tianyuan Zou, and Qiang Yang. Defending label inference and backdoor attacks in vertical federated learning. CoRR, abs/2112.05409, 2021.
[18] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
[19] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
[20] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In 36th International Conference on Machine Learning, ICML 2019, pages 8114–8124. International Machine Learning Society (IMLS), January 2019.
[21] Grégoire Montavon, Wojciech Samek, and Klaus Robert Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing: A Review Journal, 73:1–15, February 2018.
[22] Xingchao Peng, Zijun Huang, Yizhe Zhu, and Kate Saenko. Federated adversarial domain adaptation. arXiv preprint arXiv:1911.02054, 2019.
[23] Brendan Avent Peter Kairouz, H. Brendan McMahan. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
[24] D. Peterson. Private federated learning with domain adaptation. ArXiv, abs/1912.06733, 2019.
[25] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[26] Bruce Schneier and Phil Sutherland. Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Wiley and Sons, Inc., USA, 2007.
[27] L. Song, C. Ma, G. Zhang, and Y. Zhang. Privacy-preserving unsupervised domain adaptation in federated setting. IEEE Access, 8:143233–143240, 2020.
[28] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 2962–2971, 2017.
[29] Praneeth Vepakomma, Tristan Swedish, Ramesh Raskar, Otkrist Gupta, and Abhimanyu Dubey. No peek: A survey of private distributed deep learning, 2018.
[30] Z. Wang, Z. Dai, B. Póczos, and J. Carbonell. Characterizing and avoiding negative transfer. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11285–11294, 2019.
[31] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):12, 2019.
[32] Qiao Zhang, Cong Wang, Hongyi Wu, Chunsheng Xin, and Tran V. Phuong. Gelu-net: A globally encrypted, locally unencrypted deep neural network for privacy-preserved learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 3933–3939. International Joint Conferences on Artificial Intelligence Organization, 7 2018.
[33] Y. Zhang and Hao Zhu. Additively homomorphical encryption based deep neural network for asymmetrically collaborative machine learning. ArXiv, abs/2007.06849, 2020.
[34] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7404–7413, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
[35] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Yan Kang is currently a research team lead with the AI department of WeBank, Shenzhen, China. His works focus on the research and implementation of privacy-preserving machine learning and federated learning. His research was authored or coauthored in well-known conferences and journals including IEEE Intelligence Systems, IJCAI, and ACM TIST, and coauthored the Federated Learning book.

Yuanqin He is currently a researcher with WeBank. He received the B.S. degree from Shanghai Jiao Tong University, and the Ph.D. degree in Physics from Technical University of Munich. His research interests include machine learning and federated learning.

Jiahuan Luo is currently a researcher with WeBank. He received the B.S. degree from Guangdong University of Foreign Studies and the Master degree in Software Engineering from South China University of Technology. His research interests include federated learning and representation learning.

Yang Liu is an associate professor with institute for AI Industry Research (AIR), Tsinghua University. Her research interests include federated learning, machine learning, multi-agent systems, statistical mechanics and AI industrial applications. Her research work was recognized with multiple awards, such as AAAI Innovation Award and CCF Technology Award.

Tao Fan is a tech lead with the AI department of WeBank, ShenZhen, China. He is now responsible for FATE, an industrial level federated learning open source project. He has more than 8 years of experience in large-scale machine learning. He received his Master degree from University of Science and Technology of China in 2013.

Qiang Yang is a fellow of Royal Society of Canada (RSC) and Canadian Academy of Engineering (CAE), Chief Artificial Intelligence Officer of WeBank, a Chair Professor of Computer Science and Engineering Department at Hong Kong University of Science and Technology (HKUST). He is a fellow of AAAI, ACM, CAAI, IEEE, IAPR, AAAS. His research interests are artificial intelligence, machine learning, data mining and planning. His latest books are Transfer Learning, Federated Learning and Practicing Federated Learning.