This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Privacy-preserving Federated Adversarial Domain Adaptation over Feature Groups for Interpretability

Yan Kang1, Yuanqin He, Jiahuan Luo, Tao Fan, Yang Liu, and Qiang Yang Corresponding author: Yan Kang (email: [email protected]).
Abstract

We present a novel privacy-preserving federated adversarial domain adaptation approach (PrADA) to address an under-studied but practical cross-silo federated domain adaptation problem, in which the party of the target domain is insufficient in both samples and features. We handle the lack-of-feature issue by extending the feature space through vertical federated learning with a feature-rich party and tackle the sample-scarce issue by performing adversarial domain adaptation from the sample-rich source party to the target party. In this work, we focus on financial applications where interpretability is critical. However, existing adversarial domain adaptation methods typically apply a single feature extractor to learn low-interpretable feature representations with respect to the target task. To improve interpretability, we exploit domain expertise to categorize the feature space into multiple groups that each group holds tightly relevant features, and we learn a semantically meaningful high-order feature from each feature group. In addition, we apply a fine-grained domain adaptation to each feature group to improve transferability. We design a privacy-preserving vertical federated learning framework that enables performing the PrADA securely and efficiently. We evaluate our approach based on two tabular datasets. Experiments demonstrate both the effectiveness and practicality of our approach.

Index Terms:
Vertical Federated Learning, Privacy, Domain Adaptation, Interpretability.
publicationid: pubid: Copyright (c) 2022 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

1 Introduction

Domain adaptation approaches [9, 28, 25, 30, 34] have shown notable success. Those approaches typically establish alignment or minimize the discrepancy between source and target domains by creating domain-invariant feature representation in the form of deep neural network (DNN) feature extractors. In addition to the remarkable ability of DNN on encoding raw data into meaningful representations that result in high performance on objective tasks, a major enabler of the adoption of DNN in domain adaptation is the availability of a large amount of data with rich features (image and text) that supports the representation learning of DNN.

Due to increasingly strict legal and regulatory constraints enforced on user privacy, private data from different organizations (domains) cannot be directly integrated for training machine learning models. In recent years, federated learning (FL) has emerged as a practicable solution to tackle data silo issues without compromising user privacy. Initially, FL [19] was proposed to build models by utilizing data of millions of mobile devices. [31] further extends FL architecture to enterprise setting where participating parties might be much smaller but privacy concerns are paramount. This setting is coined as cross-silo federated learning [23].

Recently, a growing number of works have been proposed to integrate domain adaptation into cross-silo FL setting  [24, 22, 16, 27] for solving domain shift issues among independent parties. These federated domain adaptation (FDA) methods conduct experiments typically using image and text data that have rich features to perform meaningful representation learning. However, in many real-world FL applications where data is stored in tabular format (i.e., sample-feature matrix), the participating parties might be insufficient in features for building DNN-based domain adaptation models. One promising way to address the lack-of-feature issue is to enlarge the feature space by collaborating with a feature-rich party. For example, financial institutes with limited features (e.g., only basic user info) may have a large number of overlapping users with an e-commerce site that curates rich user information (e.g., product-browsing history and app usage information) and thereby they can collaboratively build domain adaptation models based on the enlarged feature space. This cross-silo FL setting where sample features distributed in different parties is categorized as vertical (feature-partitioned) federated learning (VFL) [31].

Although enlarged feature space enables domain adaptation, mainstream adversarial DA methods [9, 28, 30] typically apply one single pair of feature extractor and domain discriminator over the whole feature space to learn feature representations, which is not understandable by human. In this work, we focus on financial applications in which the model interpretability is an important concern. Thus, training models directly on top of raw feature representations cannot satisfy our requirements toward model interpretability and regulation. In addition, a single pair of feature extractor and domain discriminator may not be effective to learn transferable feature representations. In this work, we propose to group highly relevant features together and apply domain adaptation to each feature group aiming to improve both interpretability and transferability.

Most FDA approaches apply differential privacy (DP)[6, 5, 7] to protect the privacy of participants’ private data. But DP suffers from precision loss, which is not acceptable in high-stake decision-making applications (e.g., financial services and healthcare) where precision is crucial.

In this work, we propose PrADA, a privacy-preserving federated adversarial domain adaptation approach that enables participating parties to collaboratively conduct domain adaptation modeling in a privacy-preserving manner while taking the model interpretability into account. The main contributions of this work are highlighted as follows:

  1. 1.

    To our best knowledge, this work is the first study on domain adaptation problem in the VFL setting for tabular data;

  2. 2.

    This work proposes a fine-grained adversarial domain adaptation approach to reduce feature dimensionality, enhance model interpretability, and facilitate the learning of domain-invariant features.

  3. 3.

    This work proposes a privacy-preserving VFL framework that allows participating parties to collaboratively conduct domain adaptation without exposing private local data under the semi-honest assumption.

2 Related Work

2.1 Federated Domain Adaptation

Traditional domain adaptation (DA) approaches assume the data are centralized on one server, thus limiting their applicability to decentralized real-world scenarios. Federated domain adaptation aims to conduct domain adaptation modeling among independent parties of different domains without violating privacy. [24] applies a mixture of experts (MoE) strategy that each participant combines a collaboratively-learned general model and a domain-tuned private model to reconcile distribution differences among participants. [22] leverages federated adversarial domain alignment with a dynamic attention mechanism to enhance knowledge transfer. [16] applies methods proposed by [24, 22] to functional magnetic resonance imaging (fMRI) analysis. [20] proposes agnostic federated learning aiming to optimize the global model for any target distribution formed by a mixture of client distributions without overfitting data of any particular client. One major limitation of those (both traditional and federated ) DA approaches is that they almost use computer vision datasets, and only a few of them (e.g., [20]) are evaluated on tabular data.

2.2 Deep Neural Network on Encrypted Data

Protecting privacy is a crucial element of federated learning. Homomorphic encryption (HE) is one of the major solutions to address the privacy issue. Although HE is a promising solution that allows computation to be performed on encrypted data, its expensive computational cost makes it impractical to be applied in training an entire DNN model. To address this issue, GELU-NET [32] adopts a client-server architecture in which the client encrypts the data while the server performs most computation on encrypted data. ACML [33] focuses on a more enterprise scenario where data and labels are distributed among two independent parties. It adopts a SplitNN [29] approach that each party is only responsible for updating its own portion of the whole DNN model. The novelty of ACML is that the costly encryption-decryption operations are only performed on the two partial models’ boundaries, leaving the rest of the computation in plaintext.

2.3 Model Interpretability

A variety of research works have been proposed for interpreting deep neural networks [21]. These methods focus on post-hoc interpretability that analyzes the relationships between input and output of the trained model rather than elucidating models’ internal structures. Other methods [3, 13, 1] construct prototypes or general concepts that shed light on the decision-making process. e.g., [3] propose ProtoPNet that learns a set of prototypes each can be considered as the latent representation of a small prototypical part of training images. Then, the label prediction can be calculated based on a weighted combination of the similarity scores between parts of the image and the learned prototypes. [18] calculates SHAP values of every feature for every sample based on model prediction. Complex models, such as ensemble methods or deep networks, can be explained through these SHAP values.

3 Problem Definition

We consider following cross-silo federated domain adaptation scenario that involves three parties. Party A is from the target domain, and it has a small number of labeled samples (𝐗lA,𝐘A)nlA×(m+1)(\mathbf{X}^{A}_{l},\mathbf{Y}^{A})\in\mathbb{R}^{n^{A}_{l}\times(m+1)} and some unlabeled samples 𝐗uAnuA×m\mathbf{X}^{A}_{u}\in\mathbb{R}^{n^{A}_{u}\times m}. Party B is from the source domain and it has a large amount of labeled samples (𝐗B,𝐘B)nB×(m+1)(\mathbf{X}^{B},\mathbf{Y}^{B})\in\mathbb{R}^{n^{B}\times(m+1)}. nA=nlA+nuAn^{A}=n^{A}_{l}+n^{A}_{u} and nBn^{B} denote the sample size of parties A and B respectively, while mm denotes the feature dimension. These two parties share the same feature space and have similar tasks. We consider conduct domain adaptation (DA) from party B to party A, and we call these two parties active parties because they initiate the DA procedure. The two active parties have insufficient number of features to support DA. Thus, we refer to a passive party C that is able to provide sufficient amount of complementary features 𝐗BcnB×mc\mathbf{X}^{B^{c}}\in\mathbb{R}^{n^{B}\times m^{c}} and 𝐗AcnA×mc\mathbf{X}^{A^{c}}\in\mathbb{R}^{n^{A}\times m^{c}} for party B and party A, respectively. 𝐗Bc\mathbf{X}^{B^{c}} and 𝐗Ac\mathbf{X}^{A^{c}} have the same feature space with dimension mcm^{c}, and nBnlAn^{B}\gg n^{A}_{l} and mcmm^{c}\gg m.

We align 𝐗Ac\mathbf{X}^{A^{c}} with (𝐗lA,𝐘A)(\mathbf{X}^{A}_{l},\mathbf{Y}^{A}) and 𝐗uA\mathbf{X}^{A}_{u} respectively along the feature axis to form a virtual labeled dataset 𝐃lt=[𝐗lAc;𝐗lA;𝐘A]\mathbf{D}^{t}_{l}=[\mathbf{X}^{A^{c}}_{l};\mathbf{X}^{A}_{l};\mathbf{Y}^{A}] and a virtual unlabeled dataset 𝐃ut=[𝐗uAc;𝐗uA]\mathbf{D}^{t}_{u}=[\mathbf{X}^{A^{c}}_{u};\mathbf{X}^{A}_{u}] of the target domain. Likewise, we form a virtual dataset 𝐃s=[𝐗Bc;𝐗B;𝐘B]\mathbf{D}^{s}=[\mathbf{X}^{B^{c}};\mathbf{X}^{B};\mathbf{Y}^{B}] of the source domain. The alignment can be performed by leveraging privacy-preserving entity matching approaches [11]. Figure 1 shows the federated view of tabular datasets 𝐃s\mathbf{D}^{s} and 𝐃t\mathbf{D}^{t} among the three parties.

Refer to caption
Figure 1: View of the virtual tabular data of the cross-silo federated domain adaptation. Source party B has a large amount of labeled samples (𝐗B,𝐘B)(\mathbf{X}^{B},\mathbf{Y}^{B}), while target party A has a small amount of labeled samples (𝐗lA,𝐘A)(\mathbf{X}^{A}_{l},\mathbf{Y}^{A}) and some unlabeled samples 𝐗uA\mathbf{X}^{A}_{u}. Party C provides complementary features 𝐗Ac\mathbf{X}^{A^{c}} for party A and 𝐗Bc\mathbf{X}^{B^{c}} for party B. Thus, we form virtual dataset 𝐃s=[𝐗Bc;𝐗B;𝐘B]\mathbf{D}^{s}=[\mathbf{X}^{B^{c}};\mathbf{X}^{B};\mathbf{Y}^{B}] of the source domain, and virtual datasets 𝐃lt=[𝐗lAc;𝐗lA;𝐘A]\mathbf{D}^{t}_{l}=[\mathbf{X}^{A^{c}}_{l};\mathbf{X}^{A}_{l};\mathbf{Y}^{A}] and 𝐃ut=[𝐗uAc;𝐗uA]\mathbf{D}^{t}_{u}=[\mathbf{X}^{A^{c}}_{u};\mathbf{X}^{A}_{u}] of the target domain.

Under this setting, our PrADA approach is conducted from two aspects: (1) extending the feature space of active parties A and B through vertical federated learning with a feature-rich passive party C; (2) performing domain adaptation from party B of the sample-rich source domain to party A of the sample-scarce target domain based on the extended yet distributed feature space. Our ultimate goal is to improve the performance of the target model of party A.

Because 𝐃s\mathbf{D}^{s} and 𝐃t\mathbf{D}^{t} are composed of data from two independent parties, this domain adaptation is performed in a federated learning manner where a privacy-preserving protocol is applied. We assume that all three parties are honest-but-curious, meaning they follow the federated learning protocol but attempt to deduce as much as possible from the information received from other parties.

4 Architecture Overview

Refer to caption
Figure 2: The pre-training stage of PrADA. θfi,θdi,θRB\theta_{f_{i}},\theta_{d_{i}},\theta_{R^{B}} denote model parameters of iith feature extractor, iith discriminator and label predictor RBR^{B}, respectively. Party C locally trains gg pairs of {(θfi,θdi)}i=1g\{(\theta_{f_{i}},\theta_{d_{i}})\}_{i=1}^{g} that each corresponds to a feature group by optimizing (5). Party B and party C collaboratively train {θfi}i=1g\{\theta_{f_{i}}\}_{i=1}^{g} and θRB\theta_{R^{B}} by optimizing (6) using μBc\mathbf{\mu}^{B^{c}}, 𝐱B\mathbf{x}^{B} and yBy^{B}.

PrADA involves two stages: pre-training and fine-tuning. The pre-training stage is performed between source party B and party C aiming to pre-train feature extractors maintained by party C, while fine-tuning is performed between target party A and party C, aiming to train the target label predictor of party A based on pre-trained feature extractors. Figure 2 illustrates the workflow of the pre-training stage of the federated adversarial domain adaptation. Since the solely goal of fine-tuning is to train the target label predictor of party A, fine-tuning follows a similar workflow as pre-training except that no domain adaptation is involved.

As illustrated in Figure 2, The party B owns label predictor RBR^{B} while party C owns feature extractors ={Fi}i=1g\mathscr{F}=\{F_{i}\}_{i=1}^{g} and their corresponding domain discriminators 𝒟={Di}i=1g\mathscr{D}=\{D_{i}\}_{i=1}^{g}, and aggregators 𝒢={Gi}i=1g\mathscr{G}=\{G_{i}\}_{i=1}^{g}. The pre-train stage of federated adversarial domain adaptation mainly involves three steps.

1

Feature grouping. The party C leverages domain expertise to group raw features into kk feature groups that each comprises tightly relevant features. In addition, the party C forms zz interactions between pairwise feature groups. Thus, this step gives totall g=k+zg=k+z feature groups. (kk normal feature groups and zz interactive feature groups).

2

Adversarial domain adaptation. The party C leverages the adversarial domain adaptation to train feature extractors ={Fi}i=1g\mathscr{F}=\{F_{i}\}_{i=1}^{g} in order to learn domain-invariant feature representations based upon those gg feature groups.

3

Vertical federated learning. The party B and party C collaboratively perform vertical federated learning to train the task-specific label predictor RBR^{B} and feature extractors ={Fi}i=1g\mathscr{F}=\{F_{i}\}_{i=1}^{g} in order to learn domain-specific feature representations.

We discuss 1 in section 5 and elaborate 2 3 in section 6. In section 7, we explain how our privacy-preserving vertial federated learning framework is applied to protect data privacy of the whole workflow.

5 Feature Grouping

The reasons that the PrADA leverages feature grouping are two folds: (1) improve the transferability of feature extractors; (2) improve the interpretabiltiy of label predictors.

We propose that with the help of domain expertise the party C creates kk feature groups out of its original feature space such that features in the same group are more relevant than features belonging to other groups. Based on feature grouping, the party C obtains kk groups of relevant features {𝐱(i)pc}i=1k\{\mathbf{x}^{p^{c}}_{(i)}\}_{i=1}^{k} for each sample 𝐱pc1×mc\mathbf{x}^{p^{c}}\in\mathbb{R}^{1\times m^{c}} dawn from 𝐗pc,p{A,B}\mathbf{X}^{p^{c}},p\in\{A,B\}. To explore interactive features, party C performs interaction between each pair of the kk feature groups by concatenating the two feature groups together, giving z=C2kz=C_{2}^{k} interactive feature groups. As a result, the party C creates totally g=k+zg=k+z feature groups. Naturally, the party C assigns each feature group a feature extractor along with a domain discriminator to learn domain-invariant features representations. We hypothesize that this fine-grained domain adaptation between each of the two domains’ feature group that includes tightly relevant features helps improve the transferability of domain-invariant feature representations.

We adopt logistic regression (LR) model as the label predictor because it is a widely used interpretable model in financial applications. LR considers each feature and its associated weight as a fundamental interpretable unit. Therefore, instead of directly passing dense feature representations output from feature extractors to the LR model, party C leverages a set of aggregators {Gi}i=1g\{G_{i}\}_{i=1}^{g} that compress the output of each feature extractor into a scalar value representing a high-order feature, and then it feeds these high-order features into the LR model. As a result, the LR model takes as input a manageable number of high-order features (from party C), which are more explainable than it takes as input the concatenation of multiple dense feature representations. We formalize the procedure of party C generating high-order features vector 𝝁pc\boldsymbol{\mu}^{p^{c}} as follows:

𝝁pc=[G1(𝐟(1)pc);;Gk(𝐟(k)pc);;Gg(𝐟(g)pc)]\boldsymbol{\mu}^{p^{c}}=[G_{1}(\mathbf{f}^{p^{c}}_{(1)});\dots;G_{k}(\mathbf{f}^{p^{c}}_{(k)});\dots;G_{g}(\mathbf{f}^{p^{c}}_{(g)})] (1)

where 𝐟(i)pc\mathbf{f}_{(i)}^{p^{c}} denotes feature representations learned from feature extractor Fi(𝐱(i)pc)F_{i}(\mathbf{x}^{p^{c}}_{(i)}), and Gi(𝐟(i)pc)G_{i}(\mathbf{f}^{p^{c}}_{(i)}) returns a scalar value representing the high order feature for feature group 𝐱(i)pc\mathbf{x}_{(i)}^{p^{c}}.

For performing federated adversarial domain adaptation, the party C feeds {𝐟(i)pc}i=1g\{\mathbf{f}_{(i)}^{p^{c}}\}_{i=1}^{g} into their corresponding domain discriminators for optimizing domain discrimination losses and passes high-order feature vectors 𝝁pc\boldsymbol{\mu}^{p^{c}} to active party pp (together with pp’s own raw features) for optimizing label prediction loss, as described in section 6.

6 Federated Adversarial Domain Adaptation

Federated adversarial domain adaptation of PrADA involves two stages: pre-training and fine-tuning. The pre-training stage is performed collaboratively between source party B and party C, and it aims to train feature extractors that can learn both domain-invariant and label-discriminative features. The fine-tuning stage is performed collaboratively between target party A and party C and it aims to train the target label predictor possessed by party A leveraging pre-trained feature extractors.

6.1 Pre-training Stage

The essential idea of adversarial domain adaptation is to train feature extractors that are able to learn features that are both discriminative to the task and invariant to the change of domains. Thus, we have two optimization goals. (1) In order to obtain domain-invariant features, we perform adversarial domain adaptation to optimize the feature extractors that maximizes the domain classification losses, while simultaneously optimize the domain discriminators that minimizes the domain classification loss. (2) To obtain task-specific discriminative feature representations, we perform vertical federated learning (VFL) to optimize the feature extractors and label predictor that minimize the label prediction loss.

In our federated learning setting, party C leverages gg feature extractors ={Fi}i=1g\mathscr{F}=\{F_{i}\}_{i=1}^{g} and their corresponding gg domain discriminators 𝒟={Di}i=1g\mathscr{D}=\{D_{i}\}_{i=1}^{g} to learn domain-invariant feature representations from gg feature groups. More specifically, the iith feature extractor FiF_{i} learns feature representation from the iith feature group and then the iith domain discriminator DiD_{i} maps this feature representation to a domain label d{0,1}d\in\{0,1\}. The overall domain classification loss is the sum of domain classification losses for all domain discriminators in 𝒟\mathscr{D}. We give the formula as follows:

Ladv(,𝒟)=𝔼𝐱Ac𝐗Aci=1glog[Di(Fi(𝐱(i)Ac))]𝔼𝐱Bc𝐗Bci=1glog[1Di(Fi(𝐱(i)Bc))]L_{adv}(\mathscr{F},\mathscr{D})=-\mathbb{E}_{\mathbf{x}^{A^{c}}\sim\mathbf{X}^{A^{c}}}\sum_{i=1}^{g}log[D_{i}(F_{i}(\mathbf{x}_{(i)}^{A^{c}}))]\\ -\mathbb{E}_{\mathbf{x}^{B^{c}}\sim\mathbf{X}^{B^{c}}}\sum_{i=1}^{g}log[1-D_{i}(F_{i}(\mathbf{x}_{(i)}^{B^{c}}))] (2)

To make feature extractors obtain task-specific discriminative features, we optimize label prediction loss to train both label predictor and feature extractors to classify the source samples correctly. We define the label prediction loss as:

Lce(,RB)=𝔼(𝐱Bc,𝐱B,𝐲B)𝐃s[ce(RB([𝝁Bc;𝐱B]),𝐲B)]L_{ce}(\mathscr{F},R^{B})=\\ \mathbb{E}_{(\mathbf{x}^{B^{c}},\mathbf{x}^{B},\mathbf{y}^{B})\sim\mathbf{D}^{s}}[\ell_{ce}(R^{B}([\boldsymbol{\mu}^{B^{c}};\mathbf{x}^{B}]),\mathbf{y}^{B})] (3)

where RBR^{B} is the label predictor curated at party B, 𝝁Bc\boldsymbol{\mu}^{B^{c}} is the high-order feature vectors passed from party C and 𝐱B\mathbf{x}^{B} is the feature vectors possessed by party B.

The pre-training stage optimizes the two losses presented in (2) and (3). We have the complete loss function for pre-training stage as follows:

L(,𝒟,RB)=Lce(,RB)λLadv(,𝒟)L(\mathscr{F},\mathscr{D},R^{B})=L_{ce}(\mathscr{F},R^{B})-\lambda L_{adv}(\mathscr{F},\mathscr{D}) (4)

where λ\lambda is a hyperparameter that controls the trade-off between the two losses that shape the feature representations during training.

In our federated setting, Ladv(,𝒟)L_{adv}(\mathscr{F},\mathscr{D}) is optimized locally at party C since it only involves data of party C, while Lce(,RB)L_{ce}(\mathscr{F},R^{B}) is optimized collaboratively by party B and party C in a federated manner since it involves data from the two parties. To this end, we train parameters {θfi}i=1g\{\theta_{f_{i}}\}_{i=1}^{g} of feature extractors \mathscr{F}, {θdi}i=1g\{\theta_{d_{i}}\}_{i=1}^{g} of domain discriminators 𝒟\mathscr{D}, and θRB\theta_{R^{B}} of label predictor RBR^{B} by solving following two optimization problems.

argmin{θfi}i=1gargmax{θdi}i=1g(λLadv(,𝒟))\mathop{\mathrm{argmin}}\limits_{\{\theta_{f_{i}}\}_{i=1}^{g}}\mathop{\mathrm{argmax}}\limits_{\{\theta_{d_{i}}\}_{i=1}^{g}}(-\lambda L_{adv}(\mathscr{F},\mathscr{D})) (5)
argmin{θfi}i=1g,θRBLce(,RB)\mathop{\mathrm{argmin}}\limits_{\{\theta_{f_{i}}\}_{i=1}^{g},\theta_{R^{B}}}L_{ce}(\mathscr{F},R^{B}) (6)

{θdi}i=1g\{\theta_{d_{i}}\}_{i=1}^{g} are trained by minimizing the domain classification loss, θRB\theta_{R^{B}} is trained by minimizing the label prediction loss, and {θfi}i=1g\{\theta_{f_{i}}\}_{i=1}^{g} are trained by minimizing the label prediction loss while at the same time maximizing the domain classification loss.

Figure 2 illustrates the overall workflow of the pre-training stage and Algorithm 1 describes the procedure of optimizing (5) and (6). We assume that the entity alignment procedure has been run and the indices of 𝐃s\mathbf{D}^{s} have been shuffled and synchronized between party B and party C before training. For each inner iteration, party C and party B fetch the same mini-batch of aligned samples from 𝐃s\mathbf{D}^{s} but each holds its own portion of private data: party C holds 𝐱Bc\mathbf{x}^{B^{c}} while party B holds 𝐱B\mathbf{x}^{B} and 𝐲B\mathbf{y}^{B}. In addition, Party C samples a mini-batch 𝐱Ac\mathbf{x}^{A^{c}} from its target data 𝐗Ac\mathbf{X}^{A^{c}}. Based on 𝐱Bc\mathbf{x}^{B^{c}} and 𝐱Ac\mathbf{x}^{A^{c}}, party C optimizes (5) locally. Based on μBc\mathbf{\mu}^{B^{c}}, 𝐱B\mathbf{x}^{B} and yBy^{B}, party B and party C collaboratively optimize (6) through Algorithm 3.

Algorithm 1 Federated Pre-training
1:Initialization: feature extractors \mathscr{F}, domain discriminators 𝒟\mathscr{D}, batch indices \mathcal{I}
2:Input: 𝐃s=[𝐗Bc;𝐗B;𝐘B]\mathbf{D}^{s}=[\mathbf{X}^{B^{c}};\mathbf{X}^{B};\mathbf{Y}^{B}], 𝐗Ac\mathbf{X}^{A^{c}}
3:for e=1,2,,Ee=1,2,...,E do
4:     for ii\in\mathcal{I} do
5:         Party 𝑪\boldsymbol{C} do:
6:          𝐱Ac\mathbf{x}^{A^{c}}\xleftarrow{} sample a mini-batch from 𝐗Ac\mathbf{X}^{A^{c}};
7:          𝐱Bc\mathbf{x}^{B^{c}}\xleftarrow{} select iith mini-batch from 𝐗Bc\mathbf{X}^{B^{c}};
8:          update models in ,𝒟\mathscr{F},\mathscr{D} by optimizing (5)
          using 𝐱Ac\mathbf{x}^{A^{c}} and 𝐱Bc\mathbf{x}^{B^{c}};
9:          compute 𝝁Bc\boldsymbol{\mu}^{B^{c}} by (1) using 𝐱Bc\mathbf{x}^{B^{c}};
10:          encrypt 𝝁Bc\boldsymbol{\mu}^{B^{c}} and send [[𝝁Bc]][[\boldsymbol{\mu}^{B^{c}}]] to party B;
11:         Party 𝑩\boldsymbol{B} do:
12:          𝐱B\mathbf{x}^{B}\xleftarrow{} select iith min-batch from 𝐗B\mathbf{X}^{B};
13:          𝐲B\mathbf{y}^{B}\xleftarrow{} select iith min-batch from 𝐘B\mathbf{Y}^{B};
14:         Party 𝑩\boldsymbol{B} and Party 𝑪\boldsymbol{C} do:
15:          optimize (6) using Algorithm 3 with
          [[𝝁Bc]],𝐱B,𝐲B[[\boldsymbol{\mu}^{B^{c}}]],\mathbf{x}^{B},\mathbf{y}^{B};
16:     end for
17:end for

6.2 Fine-tuning Stage

The fine-tuning stage aims to train the label predictor RAR^{A} possessed by target party A using target labeled data 𝐃lt\mathbf{D}^{t}_{l}. Prior to fine-tuning, party C initializes its feature extractors with pre-trained parameters. Note that since party A and party B are two independent parties, the trained label predictor RBR^{B} of of source party B cannot be fine-tuned by RAR^{A} of target party A. Thus, RAR^{A} has to be trained from the scratch.

For each iteration, party C applies (1) to compute feature vectors 𝝁Ac\boldsymbol{\mu}^{A^{c}} and then send 𝝁Ac\boldsymbol{\mu}^{A^{c}} to party A for computing the label prediction loss:

Lce(,RA)=𝔼(𝐱lAc,𝐱lA,𝐲A)𝐃lt[ce(RA([𝝁Ac;𝐱lA]),𝐲A)]L_{ce}(\mathscr{F},R^{A})=\\ \mathbb{E}_{(\mathbf{x}^{A^{c}}_{l},\mathbf{x}^{A}_{l},\mathbf{y}^{A})\sim\mathbf{D}^{t}_{l}}[\ell_{ce}(R^{A}([\boldsymbol{\mu}^{A^{c}};\mathbf{x}^{A}_{l}]),\mathbf{y}^{A})] (7)

Algorithm 2 describes the fine-tuning procedure and it is quite similar to Algorithm 1 except that it does not require party C to optimize (5).

Algorithm 2 Federated fine-tuning
1:Initialization: feature extractors \mathscr{F}, batch indices \mathcal{I}
2:Input: 𝐃lt=[𝐗lAc;𝐗lA;𝐘A]\mathbf{D}^{t}_{l}=[\mathbf{X}^{A^{c}}_{l};\mathbf{X}^{A}_{l};\mathbf{Y}^{A}]
3:for e=1,2,,Ke=1,2,...,K do
4:     for ii\in\mathcal{I} do
5:         Party 𝑪\boldsymbol{C} do:
6:          𝐱Ac\mathbf{x}^{A^{c}}\xleftarrow{} select iith mini-batch from 𝐗lAc\mathbf{X}^{A^{c}}_{l};
7:          compute 𝝁Ac\boldsymbol{\mu}^{A^{c}} by (1) using 𝐱Ac\mathbf{x}^{A^{c}};
8:          encrypt 𝝁Ac\boldsymbol{\mu}^{A^{c}} and send [[𝝁Ac]][[\boldsymbol{\mu}^{A^{c}}]] to party 𝑨\boldsymbol{A};
9:         Party 𝑨\boldsymbol{A} do:
10:          𝐱lA\mathbf{x}^{A}_{l}\xleftarrow{} select iith min-batch from 𝐗lA\mathbf{X}^{A}_{l};
11:          𝐲A\mathbf{y}^{A}\xleftarrow{} select iith min-batch from 𝐘A\mathbf{Y}^{A};
12:         Party 𝑨\boldsymbol{A} and Party 𝑪\boldsymbol{C} do:
13:          minimize (7) using Algorithm 3 with
          [[𝝁Ac]],𝐱lA,𝐲A[[\boldsymbol{\mu}^{A^{c}}]],\mathbf{x}^{A}_{l},\mathbf{y}^{A};
14:     end for
15:end for
Algorithm 3 Privacy-preserving Federated Training
1:Input: [[𝝁C]],𝐱p,𝐲p[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p},\mathbf{y}^{p}, where p{A,B}p\in\{A,B\}
2: run Algorithm 4 with [[𝝁C]],𝐱p,𝐲p[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p},\mathbf{y}^{p};
3: run Algorithm 5 with [[𝝁C]],𝐱p[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p};

7 Privacy-preserving Vertical Federated Learning Framework

As shown in (3) (7), minimizing label prediction loss for training label predictor involves data from a active party (either party A or party B) and the passive party C. Therefore, the label predictor should be trained in a privacy-preserving manner. In this section, we elaborate our proposed privacy-preserving vertical federated learning framework (PP-VFL) of PrADA that enables two independent parties to collaboratively train the label predictor without exposing their private data. First, we define the label predictor model, which is LR, as follows:

Rp([𝝁C;𝐱p])=σ([𝝁C;𝐱p]𝐖+b)\displaystyle R^{p}([\boldsymbol{\mu}^{C};\mathbf{x}^{p}])=\sigma([\boldsymbol{\mu}^{C};\mathbf{x}^{p}]\mathbf{W}+b) (8)

where σ\sigma is the sigmoid function and p{A,B}p\in\{A,B\} denote an active party, 𝐖m+g\mathbf{W}\in\mathbb{R}^{m+g} is the weights of model RpR^{p} and b1b\in\mathbb{R}^{1} is the bias. In this section, we denote 𝝁C\boldsymbol{\mu}^{C} as the feature vectors from party C and 𝐱p\mathbf{x}^{p} as raw features from the active party p{A,B}p\in\{A,B\}. We further decompose the input of σ\sigma as follows:

z=𝝁C𝐖C+𝐱p𝐖p+bpz=\boldsymbol{\mu}^{C}\mathbf{W}^{C}+\mathbf{x}^{p}\mathbf{W}^{p}+b^{p} (9)

where 𝐖Cg\mathbf{W}^{C}\in\mathbb{R}^{g} is for the input 𝝁C\boldsymbol{\mu}^{C} from party C while 𝐖pm\mathbf{W}^{p}\in\mathbb{R}^{m} is for the input 𝐱p\mathbf{x}^{p} from party pp. Both 𝐖p\mathbf{W}^{p} and 𝐖C\mathbf{W}^{C} are maintained by party pp, but the real value of 𝐖C\mathbf{W}^{C} is concealed from both party pp and party C, as elaborated in sections 7.1 and 7.2.

We extend the PHE-based secure protocol applied to the setting where one party has features and another has only labels [33] to our setting where features are distributed among two parties. Our new secure protocol includes two stages: (1) privacy-preserving forward propagation (Algo 4) and (2) privacy-preserving backward propagation (Algo 5). We denote the PHE encryption, addition and multiplication as [[]][[\cdot]], \oplus and \otimes, respectively. Note that in our setting, only party C can encrypt and decrypt exchanging messages.

7.1 Privacy-Preserving Forward Propagation

Algorithm 4 aims to compute the label prediction loss in (3) without compromising the data privacy of participating parties. To achieve this, the party C encrypts μC\mu^{C} with PHE and sends encrypted [[μC]][[\mu^{C}]] to party pp to prevent privacy leakage. When receiving [[μC]][[\mu^{C}]], party pp can compute logit zz according to (9). However, directly applying (9) yields [[z]][[z]] that is not compatible with logistic function. The workaround is that party pp first computes [[z~C]][[\tilde{z}^{C}]] (Algo 4, line 5) and sends it to party C with random noise ϵp\epsilon^{p} (Algo 4, line 8). The party C then decrypts [[z~C+ϵp]][[\tilde{z}^{C}+\epsilon^{p}]] and adds it with 𝝁CεtC\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}, which is for cancelling out the accumulated random noise, denoted as εtC\varepsilon^{C}_{t}, that was embedded in 𝐖~tC\widetilde{\mathbf{W}}_{t}^{C} during the backpropagation of the previous iteration. For now, we assume 𝐖~tC=𝐖tCεtC\widetilde{\mathbf{W}}_{t}^{C}=\mathbf{W}_{t}^{C}-\varepsilon^{C}_{t}, which we will prove in section 7.2. Here, we prove the logit zCz^{C} is calculated correctly (Algo 4, line 11):

zC=z~C+𝝁CεtC=𝝁C𝐖~tC+𝝁CεtC=𝝁C𝐖tC𝝁CεtC+𝝁CεtC=𝝁C𝐖tC\begin{split}z^{C}&=\tilde{z}^{C}+\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}\\ &=\boldsymbol{\mu}^{C}\widetilde{\mathbf{W}}_{t}^{C}+\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}\\ &=\boldsymbol{\mu}^{C}\mathbf{W}_{t}^{C}-\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}+\boldsymbol{\mu}^{C}\varepsilon^{C}_{t}\\ &=\boldsymbol{\mu}^{C}\mathbf{W}_{t}^{C}\end{split} (10)

As a result, party C has zC+ϵpz^{C}+\epsilon^{p}. The noise ϵp\epsilon^{p} is for preventing party C from accessing the plaintext zCz^{C} and further collecting 𝐖tC\mathbf{W}_{t}^{C} = zC/𝝁Cz^{C}/\boldsymbol{\mu}^{C}. Finally, the party pp computes the loss ce(σ(z),𝐲p)\ell_{ce}(\sigma(z),\mathbf{y}^{p}).

Algorithm 4 Privacy-preserving Forward Propagation
1:Initialization: label predictor model 𝐖~0C\widetilde{\mathbf{W}}_{0}^{C} and 𝐖0p\mathbf{W}_{0}^{p}, accumulated noise ε0C\varepsilon^{C}_{0}
2:Input: [[𝝁C]],𝐱p,𝐲p[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p},\mathbf{y}^{p}, p{A,B}p\in\{A,B\}
3:Party 𝒑\boldsymbol{p}:
4: compute logit:
5:    [[z~C]][[𝝁C]]𝐖~tC[[\tilde{z}^{C}]]\xleftarrow{}[[\boldsymbol{\mu}^{C}]]\otimes\widetilde{\mathbf{W}}_{t}^{C};
6:    zp𝐱p𝐖tp+bpz^{p}\xleftarrow{}\mathbf{x}^{p}\mathbf{W}_{t}^{p}+b^{p};
7: add noise [[z~C+ϵp]][[z~C]]ϵp[[\tilde{z}^{C}+\epsilon^{p}]]\xleftarrow{}[[\tilde{z}^{C}]]\oplus\epsilon^{p};
8: send [[z~C+ϵp]][[\tilde{z}^{C}+\epsilon^{p}]] to party 𝑪\boldsymbol{C};
9:Party 𝑪\boldsymbol{C}:
10:z~C+ϵp\tilde{z}^{C}+\epsilon^{p}\xleftarrow{} decrypt [[z~C+ϵp]][[\tilde{z}^{C}+\epsilon^{p}]];
11:zC+ϵpz^{C}+\epsilon^{p}\xleftarrow{} z~C\tilde{z}^{C} + 𝝁CεtC\boldsymbol{\mu}^{C}\varepsilon^{C}_{t} + ϵp\epsilon^{p};
12: send zC+ϵpz^{C}+\epsilon^{p} to party 𝒑\boldsymbol{p};
13:Party 𝒑\boldsymbol{p}:
14: remove noise zCzC+ϵpz^{C}\xleftarrow{}z^{C}+\epsilon^{p};
15:zz\xleftarrow{} zp+zCz^{p}+z^{C};
16: compute loss ce(σ(z),𝐲p)\ell_{ce}(\sigma(z),\mathbf{y}^{p});
Algorithm 5 Privacy-preserving Backward Propagation
1:Initialization: learning rate η\eta
2:Input: [[𝝁C]],𝐱p[[\boldsymbol{\mu}^{C}]],\mathbf{x}^{p}, p{A,B}p\in\{A,B\}
3:Party 𝒑\boldsymbol{p}:
4:δl\delta^{l}\xleftarrow{} σce\nabla_{\sigma}\ell_{ce} w.r.t the activation function σ\sigma;
5: backpropagate gradients δl\delta^{l}:
6:    [[Δ𝐖tC]][[𝝁C]]δl[[\Delta\mathbf{W}^{C}_{t}]]\xleftarrow{}[[\boldsymbol{\mu}^{C}]]\otimes\delta^{l};
7:    Δ𝐖tp𝐱pδl\Delta\mathbf{W}^{p}_{t}\xleftarrow{}\mathbf{x}^{p}\delta^{l};
8:    Δbtpδl\Delta b^{p}_{t}\xleftarrow{}\delta^{l};
9: add noise [[Δ𝐖tC+ϵp]][[Δ𝐖tC]]ϵp[[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]]\xleftarrow{}[[\Delta\mathbf{W}^{C}_{t}]]\oplus\epsilon^{p};
10: send [[Δ𝐖tC+ϵp]][[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]] to party 𝑪\boldsymbol{C};
11:Party 𝑪\boldsymbol{C}:
12:Δ𝐖tC+ϵp\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}\xleftarrow{} decrypt [[Δ𝐖tC+ϵp]][[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]];
13: add noise Δ𝐖~tC+ϵp\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p}\xleftarrow{} Δ𝐖tC+ϵCη+ϵp\Delta\mathbf{W}^{C}_{t}+\frac{\epsilon^{C}}{\eta}+\epsilon^{p};
14:εt+1CεtC+ϵC\varepsilon^{C}_{t+1}\xleftarrow{}\varepsilon^{C}_{t}+\epsilon^{C} and [[εt+1C]][[\varepsilon^{C}_{t+1}]]\xleftarrow{} encrypt εt+1C\varepsilon^{C}_{t+1};
15: send [[εt+1C]][[\varepsilon^{C}_{t+1}]], Δ𝐖~tC+ϵp\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p} to party 𝒑\boldsymbol{p}
16:Party 𝒑\boldsymbol{p}:
17: remove noise Δ𝐖~tCΔ𝐖~tC+ϵp\Delta\widetilde{\mathbf{W}}^{C}_{t}\xleftarrow{}\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p}
18: update weights and bias of logistic regression model:
19:    𝐖~t+1C𝐖~tCηΔ𝐖~tC\widetilde{\mathbf{W}}^{C}_{t+1}\xleftarrow{}\widetilde{\mathbf{W}}^{C}_{t}-\eta\Delta\widetilde{\mathbf{W}}^{C}_{t};
20:    𝐖t+1p𝐖tpηΔ𝐖tp\mathbf{W}^{p}_{t+1}\xleftarrow{}\mathbf{W}^{p}_{t}-\eta\Delta\mathbf{W}^{p}_{t};
21:    bt+1pbtpηΔbtpb_{t+1}^{p}\xleftarrow{}b_{t}^{p}-\eta\Delta b_{t}^{p};
22:[[δC]]δl(𝐖~t+1C[[εt+1C]])[[\delta^{C}]]\xleftarrow{}\delta^{l}\otimes(\widetilde{\mathbf{W}}^{C}_{t+1}\oplus[[\varepsilon^{C}_{t+1}]]);
23: send [[δC]][[\delta^{C}]] to party 𝑪\boldsymbol{C};
24:Party 𝑪\boldsymbol{C}:
25:δC\delta^{C}\xleftarrow{} decrypt [[δC]][[\delta^{C}]];
26: update feature aggregators in 𝒢\mathscr{G} and feature extractors
27: in \mathscr{F} based on gradient δC\delta^{C} using SGD;

7.2 Privacy-Preserving Backward Propagation

During the privacy-preserving backward propagation as described in Algorithm 5, the active party pp securely updates logistic regression model RpR^{p} and backpropagate gradients to party C. As shown in (9), we partitioned weights of RpR^{p} into 𝐖p\mathbf{W}^{p} and 𝐖C\mathbf{W}^{C}. On the one hand, party pp can compute gradients Δ𝐖tp\Delta\mathbf{W}^{p}_{t} and Δbtp\Delta b^{p}_{t}, and update 𝐖tp\mathbf{W}^{p}_{t} and btpb^{p}_{t} (Algo 5, line 20-21) in plaintext since party pp owns these parameters. On the other hand, party A can not directly update [[𝐖t+1C]]𝐖tCη[[Δ𝐖tC]][[\mathbf{W}^{C}_{t+1}]]\xleftarrow{}\mathbf{W}^{C}_{t}-\eta[[\Delta\mathbf{W}^{C}_{t}]] since this leads to incompatibility with PHE for computing [[z~C]][[𝝁C]][[𝐖t+1C]][[\tilde{z}^{C}]]\xleftarrow{}[[\boldsymbol{\mu}^{C}]]\otimes[[\mathbf{W}^{C}_{t+1}]] in the next iteration of forward propagation. To work around this issue, party pp may send encrypted gradients [[Δ𝐖tC]][[\Delta\mathbf{W}^{C}_{t}]] to party C and get the decrypted Δ𝐖tC\Delta\mathbf{W}^{C}_{t} back. However, this leads to privacy leakage for both parties. Because based on Δ𝐖tC=𝝁Cδl\Delta\mathbf{W}^{C}_{t}=\boldsymbol{\mu}^{C}\otimes\delta_{l}, knowing Δ𝐖tC\Delta\mathbf{W}^{C}_{t} party pp can infer the value of 𝝁C\boldsymbol{\mu}^{C}, while party C can infer gradient δl\delta_{l} during training. Therefore, to conceal the real value of Δ𝐖tC\Delta\mathbf{W}^{C}_{t} from both parties, the two parties mask Δ𝐖tC\Delta\mathbf{W}^{C}_{t} by adding corresponding random noises. Specifically, party pp adds noise ϵp\epsilon^{p} to [[Δ𝐖tC]][[\Delta\mathbf{W}^{C}_{t}]] and sends [[Δ𝐖tC+ϵp]][[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]] to party C (Algo 5, line 9-10). Party C in turn decrypts [[Δ𝐖tC+ϵp]][[\Delta\mathbf{W}^{C}_{t}+\epsilon^{p}]] and sends Δ𝐖~tC+ϵp\Delta\widetilde{\mathbf{W}}^{C}_{t}+\epsilon^{p} back to party pp, where Δ𝐖~tC=Δ𝐖tC+ϵCη\Delta\widetilde{\mathbf{W}}^{C}_{t}=\Delta\mathbf{W}^{C}_{t}+\frac{\epsilon^{C}}{\eta} (Algo 5, line 13), ϵC\epsilon^{C} is the random noise generated by party C and η\eta is the learning rate. Then, the party pp updates 𝐖~t+1C\widetilde{\mathbf{W}}^{C}_{t+1} based on gradient Δ𝐖~tC\Delta\widetilde{\mathbf{W}}^{C}_{t} after removing noise ϵp\epsilon^{p} (Algo 5, line 19). Note that while the noise ϵp\epsilon^{p} can be removed by party pp, the noise ϵC\epsilon^{C} added by party C is accumulated in weight 𝐖~tC\widetilde{\mathbf{W}}^{C}_{t} through Δ𝐖~tC\Delta\widetilde{\mathbf{W}}^{C}_{t} at each iteration. Intuitively, the real value of 𝐖t+1C\mathbf{W}^{C}_{t+1} can be seem as shared by party C and party pp. This concept is similar to secret sharing.

For party pp to correctly calculate the intermediate gradient δC\delta^{C}, party pp needs to cancel out the accumulated noise embedded in 𝐖~t+1C\widetilde{\mathbf{W}}^{C}_{t+1}. To this end, party C seeds the encrypted accumulated noise [[εt+1C]][[\varepsilon^{C}_{t+1}]] to party pp, which then calculates the gradient [[δC]][[\delta^{C}]] of loss ce\ell_{ce} with respect to μC\mu^{C} (Algo 5, line 22) using [[εt+1C]][[\varepsilon^{C}_{t+1}]]. To prove the value of gradient δC\delta^{C} is calculated correctly, we prove that 𝐖~t+1C=𝐖t+1Cεt+1C\widetilde{\mathbf{W}}^{C}_{t+1}=\mathbf{W}^{C}_{t+1}-\varepsilon^{C}_{t+1} by mathematical induction, assuming 𝐖~tC=𝐖tCεtC\widetilde{\mathbf{W}}^{C}_{t}=\mathbf{W}^{C}_{t}-\varepsilon^{C}_{t} and Initializing ε0C=0\varepsilon^{C}_{0}=0:

𝐖~t+1C=𝐖~tCηΔ𝐖~tC=𝐖~tCη(Δ𝐖tC+ϵCη)=(𝐖tCηΔ𝐖tC)(εtC+ϵC)=𝐖t+1Cεt+1C\begin{split}\widetilde{\mathbf{W}}^{C}_{t+1}&=\widetilde{\mathbf{W}}^{C}_{t}-\eta\Delta\widetilde{\mathbf{W}}^{C}_{t}\\ &=\widetilde{\mathbf{W}}^{C}_{t}-\eta(\Delta\mathbf{W}^{C}_{t}+\frac{\epsilon^{C}}{\eta})\\ &=(\mathbf{W}^{C}_{t}-\eta\Delta\mathbf{W}^{C}_{t})-(\varepsilon^{C}_{t}+\epsilon^{C})\\ &=\mathbf{W}^{C}_{t+1}-\varepsilon^{C}_{t+1}\end{split}

Finally, party pp sends [[δC]][[\delta^{C}]] back to party C, which decrypts [[δC]][[\delta^{C}]] and backpropagates δC\delta^{C} locally to optimize local models.

TABLE I: Comparison between models in different settings on Census Income dataset.
Positive labels 40 80 160
Setting Model AUC (%) KS (%) AUC (%) KS (%) AUC (%) KS (%)
𝐀\mathbf{A}-𝐋𝐨𝐜𝐚𝐥\mathbf{Local} LR 65.57 31.48 72.33 35.32 72.71 35.45
XGBoost 67.59 32.27 74.07 37.62 77.60 41.68
𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL} SecureLR 69.67 32.21 72.72 36.87 75.07 39.48
SecureBoost 71.97 34.73 77.02 41.61 80.08 46.80
PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} 73.72±\pm0.41 35.36±\pm0.68 77.48±\pm0.47 42.32±\pm0.30 79.13±\pm0.68 44.90±\pm0.57
𝐀𝐁\mathbf{AB}-𝐕𝐅𝐋\mathbf{VFL} SecureLR 72.88 34.73 73.80 35.83 74.63 38.48
SecureBoost 78.06 42.18 79.56 45.56 80.82 47.87
PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} 77.65±\pm0.38 43.09±\pm0.64 78.97±\pm0.49 45.91±\pm0.51 80.56±\pm0.31 47.62±\pm0.48
𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}} 78.98±\pm0.13 43.42±\pm0.50 80.17±\pm0.28 46.86±\pm0.93 81.10±\pm0.57 48.14±\pm0.75
PrADAw/o IR\text{PrADA}_{\text{w/o IR}} 78.92±\pm0.16 44.06±\pm0.72 80.49±\pm0.37 47.36±\pm0.55 81.36±\pm0.15 48.73±\pm0.56
PrADA 79.17±\pm0.40 44.92±\pm0.68 81.08±\pm0.30 48.06±\pm0.72 81.46±\pm0.06 49.27±\pm0.42

7.3 Discussions on Privacy Protection

In this section, we discuss the privacy-preserving capability of our PP-VFL, its possible privacy leakage and trade-off.

Proposition 1

The active party pp cannot reveal the true value of the feature vectors 𝛍C\boldsymbol{\mu}^{C} passed from the passive party C during training and inference.

Proof. There are three ways through which party pp can leverage to recover the true value feature vectors 𝝁C\boldsymbol{\mu}^{C} during training. The first way is to decrypt [[𝝁C]][[\boldsymbol{\mu}^{C}]] directly. However, it is impossible to decrypt [[𝝁C]][[\boldsymbol{\mu}^{C}]] without knowing the private key. The second way is to derive 𝝁C\boldsymbol{\mu}^{C} from zC/𝐖~tCz^{C}/\widetilde{\mathbf{W}}_{t}^{C} according to (10). However, this requires party pp to remove the noise εtC\varepsilon^{C}_{t} from 𝐖~tC\widetilde{\mathbf{W}}_{t}^{C}. Suppose εtC\varepsilon^{C}_{t} is the random noise accumulated by party C at iteration tt and ε^tC\hat{\varepsilon}^{C}_{t} is an attempt from party A. The probability that εtC=ε^tC\varepsilon^{C}_{t}=\hat{\varepsilon}^{C}_{t} is Pr(εtC=ε^tC)(1e2/||)Pr(\varepsilon^{C}_{t}=\hat{\varepsilon}^{C}_{t})\leq(1-e^{-2/\mathbb{|Z|}}) [26]. Because ||\mathbb{|Z|} is typically a very large number, the Pr(εtC=ε^tC)Pr(\varepsilon^{C}_{t}=\hat{\varepsilon}^{C}_{t}) is very close to zero. Third, 𝝁C\boldsymbol{\mu}^{C} can also be derived from Δ𝐖tC/δl\Delta\mathbf{W}^{C}_{t}/\delta^{l} after the noise ϵC\epsilon^{C} being removed from Δ𝐖~tC\Delta\widetilde{\mathbf{W}}^{C}_{t}. However, the probability Pr(ϵC=ϵ^C)Pr(\epsilon^{C}=\hat{\epsilon}^{C}) that the attempt ϵ^C\hat{\epsilon}^{C} made at party pp equals ϵC\epsilon^{C} approximates zero.

Proposition 2

The active party pp cannot infer the true value of weights 𝐖C\mathbf{W}^{C} during training and inference.

Proof. There two ways that party pp can infer 𝐖C\mathbf{W}^{C}. One is via zC/𝝁Cz^{C}/\boldsymbol{\mu}^{C} according to (10), and another is removing accumulated noise εC\varepsilon^{C} from 𝐖~C\widetilde{\mathbf{W}}^{C}. According to Proposition 1 the true value of 𝝁C\boldsymbol{\mu}^{C} is concealed from the party pp during training and inference, and the probability Pr(εC=ε^C)(1e2/||)Pr(\varepsilon^{C}=\hat{\varepsilon}^{C})\leq(1-e^{-2/\mathbb{|Z|}}) that the party C can generate noise ε^C\hat{\varepsilon}^{C} to cancel out εC\varepsilon^{C} is close to zero. Therefore, the party pp cannot infer the true value of 𝐖C\mathbf{W}^{C} during training and inference.

The active party pp cannot infer the data of passive party C during training because party pp has no access to 𝝁C\boldsymbol{\mu}^{C}, 𝐖C\mathbf{W}^{C} and party C’s local model. [12] proposes model inversion (MI) that enables the attacker (i.e., party pp) to recover the private data of the victim (i.e., party C) during inference. To recover data of reasonably high quality, [12] makes strong assumptions that the attacker knows the network structure of the victim and has access to the training data that follows the same distribution as those of the victim, aiming to approximate the victim’s local model. However, these assumptions typically do not hold in scenarios like finance. For one thing, financial data generally are not publicly available because they are sensitive, and thus their publication is regulated. For another, participating parties provide heterogeneous features in VFL, and thereby they typically adopt different model structures. Besides, [12] demonstrates that a local model with a full-connected layer on top can significantly drop the quality of the recovered data. The 𝐖C\mathbf{W}^{C} automatically provides such a layer of protection.

Proposition 3

The passive party C cannot infer the true value of weights 𝐖\mathbf{W} curated by the active party pp during training and inference.

Proof. The weights 𝐖\mathbf{W} is composed of 𝐖𝐂\mathbf{W^{C}} and 𝐖p\mathbf{W}^{p}. Party C receives no information on 𝐖p\mathbf{W}^{p} of party pp. Therefore, party C can learn nothing on 𝐖p\mathbf{W}^{p}. There are two ways that party C can infer 𝐖C\mathbf{W}^{C}. The first one is via (zC+ϵp)/𝝁C(z^{C}+\epsilon^{p})/\boldsymbol{\mu}^{C} and the another is via δC/δl\delta^{C}/\delta^{l}. The noise ϵp\epsilon^{p} in the former one prevents party C from revealing 𝐖𝐂\mathbf{W^{C}} because the probability Pr(ϵp=ϵ^p)(1e2/||)Pr(\epsilon^{p}=\hat{\epsilon}^{p})\leq(1-e^{-2/\mathbb{|Z|}}) that the party C can generate noise ϵ^p\hat{\epsilon}^{p} to cancel out ϵp\epsilon^{p} is close to zero, while δl\delta^{l} in the latter one resides only in party pp. Therefore, the party C cannot reveal the true value of 𝐖𝐂\mathbf{W^{C}}.

Recent research works propose that the attacker (i.e., passive party C) can leverage gradient inversion (GI) [35], model completion (MC) [8], and properties of cut-layer gradient (PCG) [15] to recover labels of the victim (i.e., active party pp). Our PP-VFL can prevent GI attack because the attacker has no access to both the weights (i.e., 𝐖\mathbf{W}) and the gradient of the label predictor model [14, 10]. While PP-VFL itself cannot prevent the MC and PCG attacks in that the cut-layer gradient δC\delta^{C} is passed to the attacker in plain text and without any protection. The current form of PP-VFL trades a certain degree of increased label privacy leakage with enhanced model performance and training efficiency. In applications where the labels are important assets, the PP-VFL can be equipped with other privacy protection mechanisms (e.g., MARVELL [15] for PCG and CoAE [17] for MC) to trade the protection of label privacy with a certain degree of degraded utility.

8 Experiments

8.1 Experimental datasets and settings

We evaluate our proposed PrADA based on two datasets: one is Census Income dataset, and another is a real-world financial dataset called Loan Default. For each dataset, we run experiments under following four settings:

  1. 1.

    𝐀\mathbf{A}-𝐋𝐨𝐜𝐚𝐥\mathbf{Local}: Target party A only uses its local data to train models without leveraging VFL and DA.

  2. 2.

    𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL}: Target Party A uses target domain data 𝐃lt\mathbf{D}^{t}_{l} to train models via VFL with party C. This setting serves as the conventional VFL to improve the model performance of party A with additional features from party C.

  3. 3.

    𝐀𝐁\mathbf{AB}-𝐕𝐅𝐋\mathbf{VFL}: Assuming party A and B are from the same organization and privacy is not a concern. Thus, Party A uses 𝐃lt\mathbf{D}^{t}_{l} and 𝐃s\mathbf{D}^{s} of both domains to train models via VFL with party C (with no DA). Models in this setting serve as strong baselines because they use all data together.

  4. 4.

    𝐁𝐀\mathbf{B}\rightarrow\mathbf{A}: We conduct PrADA elaborated in sections 6 to perform federated adversarial domain adaptation from party B to party A.

In settings 2 and 3, we adopt SecureLR and SecureBoost implemented in FATE111https://github.com/FederatedAI/FATE, an industrial grade federated learning framework, as comparing models. These two models are VFL version of tree-boosting model and logistic regression model respectively, and they are using PHE to protect data privacy. To explore the effectiveness of different components of PrADA, we propose three different ablations:

  • PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} without domain adaptation (DA), feature grouping (FG) and feature group interaction (IR);

  • PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}} applies domain adaptation, but without feature grouping and interaction;

  • PrADAw/o IR\text{PrADA}_{\text{w/o IR}} applies domain adaptation based on feature grouping, but without interaction.

In this paper, we focus on the binary classification problem. Because imbalanced class label is one of the major motivations for applying domain adaptation in real-world financial applications, we also investigate the effectiveness of our PrADA approach for different positive label ratios. Specifically, we investigate scenarios when the target training data has a positive label ratio {0.01,0.02,0.04}.

For SecureLR, we use default hyperparameters, while for SecureBoost, we sweep over all combination of max depth {2,4,6,8} and number of trees {100,200,300,400}, leaving other hyperparameters default. For PrADA, we use batch size 128 for Census Income data and 64 for Loan Default data, and use learning rate 0.0005 for pre-training and 0.0008 for fine-tuning for both datasets. Our PrADA is implemented with PyTorch. We repeat every experiment 5 times on each dataset, reporting the mean and standard derivation of AUC and KS (Kolmogorov-Smirnov test) [2] of all trained models on test data of the target party A.

Refer to caption
(a) Non-adapted employment
Refer to caption
(b) Non-adapted demographics
Refer to caption
(c) Non-adapted household
Refer to caption
(d) Non-adapted migration
Refer to caption
(e) Adapted employment
Refer to caption
(f) Adapted demographics
Refer to caption
(g) Adapted household
Refer to caption
(h) Adapted migration
Figure 3: The effect of FG-based domain adaptation on the distribution of learned feature representation (best viewed in color). (a)-(d) and (e)-(h) show the the t-SNE embeddings of the feature representations learned by PrADAw/o DA&IR\text{PrADA}_{\text{w/o DA\&IR}} and PrADAw/o IR\text{PrADA}_{\text{w/o IR}}, respectively, on feature groups of Census Income data. Red points correspond to the source domain examples, while blue ones correspond to the target domain samples. In all feature groups, the adaptation makes the two distributions of learned feature representations much closer.

8.2 Experiments on Census Income

Census Income is a census dataset from the UCI Machine Learning Repository. We split this dataset into a undergraduate source domain and a postgraduate target domain. The source domain has 80000 labeled examples, while the target domain has 4000 labeled samples and 9000 unlabeled samples. Our goal is to help party A of the target domain to predict whether a person’s income exceeds 50,000 US dollars or not.

After data preprocessing, the census income dataset contains 36 features, 31 of which are categorical. We put 5 numerical features on active parties (i.e., A and B) while the 31 categorical features on passive party C. We split 31 features on party C into 4 feature groups (FG) including employment(emp), demographics(demo), household(house), and migration(migr). Thus, we have C24C^{4}_{2}(i.e., 6) interactive feature groups, which are emp-demo, emp-house, emp-migr, demo-house, demo-migr, house-migr. We embed all categorical features into dense vectors. Table II shows the architecture of feature extractor for each of the 10 feature groups and the one (i.e., all_feat) for all features when feature grouping is not applied.

TABLE II: Architecture of feature extractors for Census Income dataset. All feature extractors only use fully-connected layers, and adopt Leaky ReLU as the activation function, which is omitted in the table for simplicity.
FG name feature extractor architecture
emp FC(28\rightarrow56)-FC(56\rightarrow28)-FC(28\rightarrow14)
demo FC(25\rightarrow50)-FC(50\rightarrow25)-FC(25\rightarrow12)
migr FC(56\rightarrow86)-FC(86\rightarrow56)-FC(56\rightarrow18)
house FC(27\rightarrow54)-FC(54\rightarrow27)-FC(27\rightarrow13)
emp-demo FC(53\rightarrow78)-FC(78\rightarrow53)-FC(53\rightarrow15)
emp-migr FC(84\rightarrow120)-FC(120\rightarrow84)-FC(84\rightarrow20)
emp-house FC(51\rightarrow81)-FC(81\rightarrow55)-FC(55\rightarrow15)
demo-migr FC(81\rightarrow120)-FC(120\rightarrow81)-FC(81\rightarrow20)
demo-house FC(52\rightarrow78)-FC(78\rightarrow52)-FC(52\rightarrow15)
migr-house FC(83\rightarrow120)-FC(120\rightarrow83)-FC(83\rightarrow20)
all_feat FC(136\rightarrow150)-FC(150\rightarrow60)-FC(60\rightarrow20)
TABLE III: Comparison between models in different settings on Loan Default dataset.
Positive labels 40 80 160
Setting Model AUC (%) KS (%) AUC (%) KS (%) AUC (%) KS (%)
𝐀\mathbf{A}-𝐋𝐨𝐜𝐚𝐥\mathbf{Local} LR 57.17 12.65 56.51 13.16 57.77 15.14
XGBoost 56.66 11.49 57.90 14.89 58.91 17.47
𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL} SecureLR 59.67 15.31 64.68 24.10 67.78 28.61
SecureBoost 57.88 12.86 64.90 23.33 70.68 31.09
PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} 63.26±\pm0.88 21.14±\pm1.26 67.49±\pm0.78 28.06±\pm1.12 68.66±\pm0.94 29.59±\pm1.39
𝐀𝐁\mathbf{AB}-𝐕𝐅𝐋\mathbf{VFL} SecureLR 72.72 35.81 73.04 35.90 74.22 36.87
SecureBoost 75.18 38.53 75.96 40.65 76.16 41.83
PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} 75.11±\pm0.37 40.28±\pm1.03 75.16±\pm0.19 40.63±\pm0.64 75.53±\pm0.23 41.12±\pm0.71
𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}} 75.27±\pm0.25 40.82±\pm0.53 75.52±\pm0.22 41.25±\pm0.24 75.76±\pm0.28 41.91±\pm0.56
PrADAw/o IR\text{PrADA}_{\text{w/o IR}} 75.63±\pm0.11 41.42±\pm0.74 75.84±\pm0.09 42.04±\pm0.61 76.43±\pm0.08 42.61±\pm0.17
PrADA 75.75±\pm0.12 41.69±\pm0.36 75.99±\pm0.05 42.48±\pm0.21 76.58±\pm0.18 43.48±\pm0.62

The experimental results are shown in Table I. From these results, we observe that: (1) SecureBoost and SecureLR in 𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL} outperform their counterparts in 𝐀\mathbf{A}-𝐋𝐨𝐜𝐚𝐥\mathbf{Local} demonstrating that leveraging additional features improve the model performance. (2) The performance of models in 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} significantly outperforms that of models in 𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL} across all settings of different positive labels. This is expected because a considerable amount of source data is involved in training. More specifically, when the number of positive labels is small (i.e., 40), the performance gain is the most significant. For example, PrADA outperforms PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} in AUC by 5.45% and in KS by 9.56%, and outperforms SecureBoost in AUC by 7.2% and in KS by 10.19% when the positive label is 40. (3) PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}} in 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} outperforms PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} in 𝐀𝐁\mathbf{AB}-𝐕𝐅𝐋\mathbf{VFL} in AUC by 1.02% and in KS by 0.60% on average, and outperforms SecureBoost in AUC by 0.60% and in KS by 0.94% on average, demonstrating the effectiveness of PrADA on bridging the divergence between source and target domains. (4) In 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} setting, PrADAw/o IR\text{PrADA}_{\text{w/o IR}} outperforms PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}} in AUC by 0.17% and KS by 0.58% on average, demonstrating the effectiveness of FG-based domain adversarial training on improving the transferability of feature extractors. In addition to boosting model performance, feature grouping also enhances the interpretability of target model RAR^{A}, which we discuss in section 8.4. (5) In 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} setting, PrADA outperforms PrADAw/o IR\text{PrADA}_{\text{w/o IR}} in AUC by 0.31% and KS by 0.70% on average, demonstrating the interaction on feature groups help improve model performance.

To dive deeper into the effect of FG-based domain adaptation on learned feature representations, we visualize the t-SNE embeddings [4] of the feature representations in Figure 3. Figure 3(a)-(d) and Figure 3(e)-(h) show the the t-SNE embeddings of the feature representations learned by PrADAw/o DA&IR\text{PrADA}_{\text{w/o DA\&IR}} and PrADAw/o IR\text{PrADA}_{\text{w/o IR}}, respectively, on feature groups of Census Income data. We observe that the adaptation in our method makes the two distributions of learned feature representations much closer in all feature groups.

8.3 Experiments on Loan Default

Loan Default is a loan default risk dataset for the online lending industry published by FinVolution Group. It contains loan data issued in 2014. We consider 40000 labeled samples of loans issued in the first three quarters of 2014 as the source domain while the 4000 labeled samples and 9000 unlabeled samples of loans issued in the fourth quarter as the target domain. This is an Out-Of-Time scenario in financial risk control. Our goal is to help party A to build a loan predictor to predict whether a loan will default or not.

After data preprocessing, the Load Default dataset has 162 features, 27 of which are categorical. For protecting privacy, user and feature names are anonymized. We put 6 demographics features and labels on active parties, while the rest 156 features on passive party C. We split features of party C into 5 groups including user location(loc), third-party period(period), education(edu), social network (soc), and micro-blog(mblog). Thus, we have C25C^{5}_{2}(i.e., 10) interactive feature groups, which are loc-period, loc-edu, loc-soc, loc-mblog, period-edu, period-soc, period-mblog, edu-soc, edu-mblog, soc-mblog. We embed all categorical features into dense vectors. Table IV shows the architecture of feature extractor for each of the 15 feature groups and the one (i.e., all_feat) for all features when feature grouping is not applied.

TABLE IV: Architecture of feature extractors for Loan Default dataset. All feature extractors only use fully-connected layers, and adopt Leaky ReLU as the activation function, which is omitted in the table for simplicity.
FG name feature extractor architecture
loc FC(15\rightarrow20)-FC(20\rightarrow15)-FC(15\rightarrow6)
period FC(85\rightarrow100)-FC(100\rightarrow60)-FC(60\rightarrow8)
edu FC(30\rightarrow50)-FC(50\rightarrow30)-FC(30\rightarrow6)
soc FC(18\rightarrow30)-FC(30\rightarrow18)-FC(18\rightarrow6)
mblog FC(55\rightarrow70)-FC(70\rightarrow30)-FC(30\rightarrow8)
loc-period FC(100\rightarrow120)-FC(120\rightarrow75)-FC(75\rightarrow14)
loc-edu FC(45\rightarrow70)-FC(70\rightarrow45)-FC(45\rightarrow12)
loc-soc FC(33\rightarrow50)-FC(50\rightarrow33)-FC(33\rightarrow12)
loc-mblog FC(70\rightarrow90)-FC(90\rightarrow45)-FC(45\rightarrow14)
period-edu FC(115\rightarrow150)-FC(150\rightarrow90)-FC(90\rightarrow14)
period-soc FC(103\rightarrow130)-FC(130\rightarrow78)-FC(78\rightarrow14)
period-mblog FC(140\rightarrow170)-FC(170\rightarrow90)-FC(90\rightarrow16)
edu-soc FC(48\rightarrow80)-FC(80\rightarrow48)-FC(48\rightarrow12)
edu-mblog FC(85\rightarrow120)-FC(120\rightarrow60)-FC(60\rightarrow14)
soc-mblog FC(73\rightarrow100)-FC(100\rightarrow48)-FC(48\rightarrow14)
all_feat FC(203\rightarrow210)-FC(210\rightarrow70)-FC(70\rightarrow20)

The experimental results are reported in Table III. From these, we observe that: (1) Table III reports a similar trend as Table I does that the performance of models improves from 𝐀\mathbf{A}-𝐋𝐨𝐜𝐚𝐥\mathbf{Local} setting to 𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL} and then to 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} as more data is involved in training. The performance gains the most when the number of positive labels is small. Specifically, when the number of positive labels is 40, PrADA in 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} setting outperforms PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} in 𝐀\mathbf{A}-𝐕𝐅𝐋\mathbf{VFL} setting in AUC and KS by 12.49% and 20.55% respectively, and outperforms SecureBoost in AUC and KS by 17.87% and 28.83% respectively. As the number of positive labels increases, the performance gain narrows. (2) PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}} in 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} setting outperforms PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} in 𝐀𝐁\mathbf{AB}-𝐕𝐅𝐋\mathbf{VFL} in AUC by 0.25% and in KS by 0.65% on average, demonstrating the effectiveness of PrADA on mitigating domain divergence. (3) In 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} setting, PrADAw/o IR\text{PrADA}_{\text{w/o IR}} outperforms PrADAw/o FG&IR\text{PrADA}_{\text{w/o FG\&IR}}, demonstrating the superiority of FG-based DA over conventional DA, and PrADA outperforms PrADAw/o IR\text{PrADA}_{\text{w/o IR}}, demonstrating the interaction on feature groups help enhance model performance.

8.4 Model Interpretability

We demonstrate model interpretability by visualizing the impact of features on target model RAR^{A} using SHAP [18], a tool widely used to explain black-box models. As discussed in section 7, the real values of model parameters of RAR^{A} are not accessible by either party C or party A. This means that party A cannot interpret the model by simply looking at model parameters. SHAP provides party A with a way to interpret the model without accessing the model parameters. We select the Census Income dataset for this purpose since the semantic meaning of features in the Loan Default dataset is anonymized.

Refer to caption
Figure 4: The importance of features produced by SHAP. Each feature can be a raw feature from party A, or a (interactive) feature group from party C.

Figure 4 lists the most influential features of model RAR^{A} in descending order. The top features have higher predictive power because they contribute more to the model than the bottom ones. For example, emp-demo, gender, capital_gain, migr-house and demo-migr are the top-5 most influential features.

Refer to caption
Figure 5: The impact of features on model predictions. Each feature can be a raw feature from the party A, or a high-level feature representing a (interactive) feature group from the party C.

The SHAP can further show the positive and negative relationships of features with the prediction target. Figure 5 plots the SHAP values of every feature for all samples to illustrate the impact of those features on the prediction output. Features are ranked in descending order. The color represents the feature value (red high, blue low). The horizontal location shows whether the effect of a feature value is associated with a higher or lower prediction. Specifically, emp-demo, captial_gain and demo-migr are positively correlated with the prediction, while gender and migr-house are negatively correlated with the prediction.

8.5 Computation Cost

We compare the training time among SecureLR, SecureBoost, and PrADA using FATE 1.6. The experiments are conducted on a machine with 72 Intel Xeon Gold 6140 CPUs and 320 GB RAM. All experiments are simulated in standalone deployment mode. Note that the privacy-preserving VFL framework (PP-VFL) discussed in section 7 has been integrated into FATE, while the federated adversarial domain adaptation (FADA) discussed in section 6 has not because some of the core functionalities of FADA are not satisfied by FATE for now. Thus, we estimate the training time of PrADA by simulation using PP-VFL on FATE.

TABLE V: Training time (hours) on Census Income dataset. PT and FT denote pre-taining time and fine-tuning time, respectively.
Setting Model Time(h)
𝐀𝐁\mathbf{AB}-𝐕𝐅𝐋\mathbf{VFL} SecureLR \sim1.12
SecureBoost \sim2.16
PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} \sim4.10
PT(h) FT(h)
𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} PrADAw/o IR\text{PrADA}_{\text{w/o IR}} \sim5.47 \sim0.52
PrADA \sim8.24 \sim1.32

Table V reports the training time of PrADAw/o DA&FG&IR\text{PrADA}_{\text{w/o DA\&FG\&IR}} is roughly 4.10 hours, which is approximately twice the training time spent by SecureBoost. In 𝐁𝐀\mathbf{B}\rightarrow\mathbf{A} setting, PrADAw/o IR\text{PrADA}_{\text{w/o IR}} takes 5.47 hours to train because FG-based domain adaptation is involved. However, once pre-training is completed, PrADAw/o IR\text{PrADA}_{\text{w/o IR}} only takes half an hour to perform fine-tuning. PrADA takes 8.24 hours to train because it spends extra time on feature group interaction, additional feature extractor training and feature representations encryption. As reported in Table I and Table III, PrADA exceeds PrADAw/o IR\text{PrADA}_{\text{w/o IR}} by only a small margin. Therefore, if efficiency is a major concern, PrADAw/o IR\text{PrADA}_{\text{w/o IR}} is a better choice.

9 Conclusion

In this paper, we propose a privacy-preserving vertical federated adversarial domain adaptation approach. In particular, we develop a privacy-preserving VFL framework that allows participating parties to collaboratively conduct domain adaptation without exposing private data. To reduce feature dimensionality, enhance model interpretability, and facilitate the learning of domain-invariant features, we propose a fine-grained adversarial domain adaptation over feature groups that each holds tightly relevant features. Experiments demonstrate both the effectiveness and practicality of our approach.

Acknowledgments

This work is partially supported by the National Key Research and Development Program of China under grant [2018AAA0101100].

References

  • [1] David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, NIPS’18, page 7786–7795, Red Hook, NY, USA, 2018. Curran Associates Inc.
  • [2] I. M. Chakravarti, R. G. Laha, and J. Roy. Handbook of Methods of Applied Statistics, Volume I. John Wiley and Sons, NY, 1967.
  • [3] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This looks like that: Deep learning for interpretable image recognition. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 8930–8941. Curran Associates, Inc., 2019.
  • [4] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 647–655, China, 22–24 Jun 2014. PMLR.
  • [5] Cynthia Dwork. A firm foundation for private data analysis. Communications of the ACM, 54(1):86–95, 2011.
  • [6] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • [7] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • [8] Chong Fu, Xuhong Zhang, Shouling Ji, Jinyin Chen, Jingzheng Wu, Shanqing Guo, Jun Zhou, Alex X Liu, and Ting Wang. Label inference attacks against vertical federated learning. In 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, August 2022. USENIX Association.
  • [9] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
  • [10] Hanlin Gu, Lixin Fan, Bowen Li, Yan Kang, Yuan Yao, and Qiang Yang. Federated deep learning with bayesian privacy. CoRR, abs/2109.13012, 2021.
  • [11] Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. Private federated learning on vertically partitioned data via entity resolution and homomorphic encryption. CoRR, abs/1711.10677, 2017.
  • [12] Zecheng He, Tianwei Zhang, and Ruby B. Lee. Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, ACSAC ’19, page 148–162, NY, USA, 2019. Association for Computing Machinery.
  • [13] Daniel Ho. Nbdt: Neural-backed decision trees. Master’s thesis, EECS Department, University of California, Berkeley, May 2020.
  • [14] Yan Kang, Yuezhou Wu, Jiahuan Luo, Yuanqin He, and Qiang Yang. Fedcg: Leverage conditional GAN for protecting privacy and maintaining competitive performance in federated learning. CoRR, abs/2111.08211, 2021.
  • [15] Oscar Li, Jiankai Sun, Xin Yang, Weihao Gao, Hongyi Zhang, Junyuan Xie, Virginia Smith, and Chong Wang. Label leakage and protection in two-party split learning. In International Conference on Learning Representations, 2022.
  • [16] Xiaoxiao Li, Yufeng Gu, Nicha Dvornek, Lawrence H. Staib, Pamela Ventola, and James S. Duncan. Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: Abide results. Medical Image Analysis, 65:101765, 2020.
  • [17] Yang Liu, Zhihao Yi, Yan Kang, Yuanqin He, Wenhan Liu, Tianyuan Zou, and Qiang Yang. Defending label inference and backdoor attacks in vertical federated learning. CoRR, abs/2112.05409, 2021.
  • [18] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
  • [19] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
  • [20] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In 36th International Conference on Machine Learning, ICML 2019, pages 8114–8124. International Machine Learning Society (IMLS), January 2019.
  • [21] Grégoire Montavon, Wojciech Samek, and Klaus Robert Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing: A Review Journal, 73:1–15, February 2018.
  • [22] Xingchao Peng, Zijun Huang, Yizhe Zhu, and Kate Saenko. Federated adversarial domain adaptation. arXiv preprint arXiv:1911.02054, 2019.
  • [23] Brendan Avent Peter Kairouz, H. Brendan McMahan. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • [24] D. Peterson. Private federated learning with domain adaptation. ArXiv, abs/1912.06733, 2019.
  • [25] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [26] Bruce Schneier and Phil Sutherland. Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Wiley and Sons, Inc., USA, 2007.
  • [27] L. Song, C. Ma, G. Zhang, and Y. Zhang. Privacy-preserving unsupervised domain adaptation in federated setting. IEEE Access, 8:143233–143240, 2020.
  • [28] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 2962–2971, 2017.
  • [29] Praneeth Vepakomma, Tristan Swedish, Ramesh Raskar, Otkrist Gupta, and Abhimanyu Dubey. No peek: A survey of private distributed deep learning, 2018.
  • [30] Z. Wang, Z. Dai, B. Póczos, and J. Carbonell. Characterizing and avoiding negative transfer. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11285–11294, 2019.
  • [31] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):12, 2019.
  • [32] Qiao Zhang, Cong Wang, Hongyi Wu, Chunsheng Xin, and Tran V. Phuong. Gelu-net: A globally encrypted, locally unencrypted deep neural network for privacy-preserved learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 3933–3939. International Joint Conferences on Artificial Intelligence Organization, 7 2018.
  • [33] Y. Zhang and Hao Zhu. Additively homomorphical encryption based deep neural network for asymmetrically collaborative machine learning. ArXiv, abs/2007.06849, 2020.
  • [34] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7404–7413, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  • [35] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Yan Kang is currently a research team lead with the AI department of WeBank, Shenzhen, China. His works focus on the research and implementation of privacy-preserving machine learning and federated learning. His research was authored or coauthored in well-known conferences and journals including IEEE Intelligence Systems, IJCAI, and ACM TIST, and coauthored the Federated Learning book.
Yuanqin He is currently a researcher with WeBank. He received the B.S. degree from Shanghai Jiao Tong University, and the Ph.D. degree in Physics from Technical University of Munich. His research interests include machine learning and federated learning.
Jiahuan Luo is currently a researcher with WeBank. He received the B.S. degree from Guangdong University of Foreign Studies and the Master degree in Software Engineering from South China University of Technology. His research interests include federated learning and representation learning.
Yang Liu is an associate professor with institute for AI Industry Research (AIR), Tsinghua University. Her research interests include federated learning, machine learning, multi-agent systems, statistical mechanics and AI industrial applications. Her research work was recognized with multiple awards, such as AAAI Innovation Award and CCF Technology Award.
Tao Fan is a tech lead with the AI department of WeBank, ShenZhen, China. He is now responsible for FATE, an industrial level federated learning open source project. He has more than 8 years of experience in large-scale machine learning. He received his Master degree from University of Science and Technology of China in 2013.
Qiang Yang is a fellow of Royal Society of Canada (RSC) and Canadian Academy of Engineering (CAE), Chief Artificial Intelligence Officer of WeBank, a Chair Professor of Computer Science and Engineering Department at Hong Kong University of Science and Technology (HKUST). He is a fellow of AAAI, ACM, CAAI, IEEE, IAPR, AAAS. His research interests are artificial intelligence, machine learning, data mining and planning. His latest books are Transfer Learning, Federated Learning and Practicing Federated Learning.