This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Open-World Feature Extrapolation:
An Inductive Graph Learning Approach

Qitian Wu, Chenxiao Yang, Junchi Yan
Department of Computer Science and Engineering
Shanghai Jiao Tong University
{echo740, chr26195, yanjunchi}@sjtu.edu.cn
Junchi Yan is the corresponding author.
Abstract

We target open-world feature extrapolation problem where the feature space of input data goes through expansion and a model trained on partially observed features needs to handle new features in test data without further retraining. The problem is of much significance for dealing with features incrementally collected from different fields. To this end, we propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data. Based on our framework, we design two training strategies, a self-supervised approach and an inductive learning approach, to endow the model with extrapolation ability and alleviate feature-level over-fitting. We also provide theoretical analysis on the generalization error on test data with new features, which dissects the impact of training features and algorithms on generalization performance. Our experiments over several classification datasets and large-scale advertisement click prediction datasets demonstrate that our model can produce effective embeddings for unseen features and significantly outperforms baseline methods that adopt KNN and local aggregation. The implementation codes are public available at https://github.com/qitianwu/FATE.

1 Introduction

Learning a mapping from observation 𝐱\mathbf{x} (a vector of attribute features) to label yy is a fundamental and pervasive problem in ML community, with extensive applications spanning from classification/regression tasks for tabular data to advertisement click prediction feature-ctr ; lr ; wd ; fm ; fwl , item recommendation feature-rec ; feature-rec2 ; youtube ; pinsage , question answering qa2 ; qa1 , or AI more broadly. Existing approaches focus on a fixed input feature space shared by training and test data. Nevertheless, practical ML systems interact with a dynamic open-world where features are incrementally collected. For example, in recommender/advertisement systems, there often occur new user profile features unseen before for current prediction tasks. Also, with the advances of multi-modal recognition mm-4 and federated learning mm-2 , it is a requirement for a model trained with partial features to incorporate new features from other fields for decision-making on a target task.

A challenge stems from the fact that off-the-shelf neural network models cannot deal with new features without re-training on new data. As shown in Fig. 1(a), a neural network builds a mapping from input features to a hidden representation through a weight matrix in the first layer. Given new features as input, the network needs to be augmented with new weight parameters and cannot map the new features to a desirable position in latent space if without re-training.

However, model re-training would be tricky and bring up several issues. First, re-training a model with both previous and new features would be highly time-consuming and cannot meet the requirement for online systems. Alternatively, one can re-train a trained model only on new features, which will induce risks for over-fitting new data or forgetting previous data.

Different from machines, fortunately, humans are often equipped with the ability for extrapolating to unseen features and distill the knowledge in new information for solving a target task without any re-training on new data. The inherent gap between existing ML approaches and human intelligence raises a research question: Can we design a ML model that is trained on one set of features and able to generalize to combine new unseen features for the same task without further training?

Refer to caption
Figure 1: (a) The open-world feature extrapolation problem is defined over training data with partial features and test data with augmented feature space. Existing neural models cannot deal with new features without re-training. (b) Oue approach has an analogy to thinking process in human’s brain. Imagine a person trained on using features age, occ to predict one’s income and tested on test cases with a new feature edu. There are four processes from acquiring observation to giving final prediction.

To find desirable solutions to this non-trivial question is challenging. Let us resort to how humans think and act in physical world scenarios. Imagine that we are well trained on predicting a person’s income with one’s age and occupation from historical records. Now we are asked to estimate the income of a new person with his/her age, occ. and education (a new feature). We can modularize the process in brain systems that distill and leverage the knowledge in new observations into four steps, as shown in Fig. 1(b). 1) Perception: the observations are recognized by our senses; 2) Abstraction: the perceived information is aligned with concepts in our cognition; 3) Reasoning: we search similar concepts and observations in our memory to understand and absorb the new knowledge; 4) Decision: with new understanding and abstraction, we make final decisions for the prediction.

The above human’s thinking process inspires the methodology in our paper where we propose a new learning paradigm for open-world feature extrapolation. Our proposed framework contains two modules: a backbone network, which can be a feedforward neural network, and a graph neural network. 1) The backbone network first maps input features to embeddings, which can be seen as perception from observation. 2) We then treat observed data matrix (each row represents a feature vector for one instance) as a feature-data bipartite graph, which explicitly define proximity and locality structures among features and instances. Then a graph neural network is harnessed for neural message passing over adjacent features and instances in a latent space (abstraction). 3) The GNN will inductively compute embeddings for new features based on those of existing ones, mimicking the reasoning process from familiar concepts to new ones in our brain. 4) The newly obtained embeddings that capture both semantics and feature-level relations will be used to obtain hidden representations of new data with unseen features and make final decisions.

To endow the model with ability for extrapolation to new features, we propose two training algorithms, one by self-supervised learning and one by inductive learning. The proposed model and its learning algorithm can easily scale to large-scale dataset (with millions of features and instances) via mini-batch training. We also provide a theoretical analysis on the generalization error on test data with new features, which allows us to dissect the impact of training features and algorithms on generalization performance. To verify our approach, we conduct extensive experiments on several real-world datasets, including six classification datasets from biology, engineering and social domains with diverse features, as well as two large-scale datasets for advertisement click prediction. Our model is trained on training data using partial features and tested on test data with a mixture of seen and unseen features. The results demonstrate that 1) our approach consistently outperform models not using new features for inference; 2) our approach achieves averagely 29.8%29.8\% higher Accuracy than baseline methods using KNN, pooling or local aggregation for feature extrapolation; 3) our approach even exceeds models using incremental training on new features, yielding average 4.1%4.1\% higher Accuracy.

Our contributions are: 1) We formulate open-world feature extrapolation problem and show that it is feasible to extend neural models for extrapolating to new features without re-training; 2) We propose a new graph-learning model and two training algorithms for feature extrapolation problem; 3) Our theoretical analysis shows that the generalization error for data with new features relies on the number of training features and the randomness in training algorithms; 4) We conduct comprehensive experiments and show the effectiveness, applicality and scalability of proposed method.

2 Methodology

We focus on attribute features as input in this paper. An input instance is a vector 𝐫i=[rim]m=1dd\mathbf{r}_{i}=[r_{im}]_{m=1}^{d}\in\mathbb{R}^{d} where each entry rimr_{im} denotes a raw feature (like age, occ., edu., etc.). If rimr_{im} is a discrete/categorical raw feature with RmR_{m} possible values, its space is an integer set {0,1,,Rm1}\{0,1,\cdots,R_{m}-1\}. If rimr_{im^{\prime}} is a continuous one, a common practice is to convert it into a discrete feature within space {0,1,,Rm1}\{0,1,\cdots,R_{m^{\prime}}-1\} by evenly division ffm or log transformation feature-ctr . We call RmR_{m} as cardinality for mm-th raw feature.

An effective way to handle attribute features is via one-hot encoding wd ; ffm ; deepfm . For rimr_{im} with cardinality RmR_{m} we convert it into a RmR_{m}-dimensional one-hot vector 𝐱im\mathbf{x}_{i}^{m} where the unique 1 indexes the value. In this way, one can convert an input 𝐫i\mathbf{r}_{i} into a concatenation of one-hot vectors:

𝐱i=[𝐱i1,𝐱i2,,𝐱id],where 1md,𝐱im{0,1}Rmis a one-hot vector.\mathbf{x}_{i}=\left[\mathbf{x}_{i}^{1},\mathbf{x}_{i}^{2},\cdots,\mathbf{x}_{i}^{d}\right],\;\mbox{where}\;1\leq\forall m\leq d,\;\mathbf{x}_{i}^{m}\in\{0,1\}^{R_{m}}\;\mbox{is a one-hot vector}. (1)

Use xijx_{ij} to denote the jj-th entry of 𝐱i\mathbf{x}_{i} and we call each xijx_{ij} as feature in this paper. Assume D=m=1dRmD=\sum_{m=1}^{d}R_{m} denotes the number of features and, as a reminder, dd is the number of raw features.

We next give a formal definition for open-world feature extrapolation problem in this paper: given training data {(𝐱i,yi)}iItr\{(\mathbf{x}_{i},y_{i})\}_{i\in I_{tr}} where 𝐱i𝒳tr={0,1}D\mathbf{x}_{i}\in\mathcal{X}_{tr}=\{0,1\}^{D}, yi𝒴y_{i}\in\mathcal{Y} and ItrI_{tr} is a set of indices, we aim to learn a model that can generalize to test data {𝐱¯i,yi}iIte\{\overline{\mathbf{x}}_{i^{\prime}},y_{i^{\prime}}\}_{i^{\prime}\in I_{te}} where 𝐱¯i𝒳te={0,1}D¯\overline{\mathbf{x}}_{i^{\prime}}\in\mathcal{X}_{te}=\{0,1\}^{\overline{D}}, yi𝒴y_{i^{\prime}}\in\mathcal{Y} and IteI_{te} is another set of indices. We term 𝒳tr\mathcal{X}_{tr} as training feature space and 𝒳te\mathcal{X}_{te} as test feature space. We assume 1) the label space 𝒴\mathcal{Y} is shared by training and test data, and 2) 𝒳tr𝒳te\mathcal{X}_{tr}\subset\mathcal{X}_{te}, i.e., test feature space is an extension of training feature space. The feature space expansion stems from two possible causes: 1) there appear new raw features incrementally collected from other fields (i.e., dd increases) or 2) there appear new values out of the known space of existing raw features (i.e., RmR_{m} increases).

2.1 Proposed Model

Our model contains three parts: 1) feature representation that builds a bipartite feature-data graph from input data; 2) a backbone network which is essentially a neural network model that predicts the labels when fed with input data; 3) a GNN model that inductively compute features’ embeddings based on their proximity and local structures to achieve feature extrapolation.

Feature Representation with Graphs. We stack the feature vectors of all the training data as a matrix 𝐗tr=[𝐱i]iItr{0,1}N×D\mathbf{X}_{tr}=[\mathbf{x}_{i}]_{i\in I_{tr}}\in\{0,1\}^{N\times D} where N=|Itr|N=|I_{tr}|. Then we treat each feature and instance as nodes and construct a bipartite graph between them. Formally, we define a node set FtrItrF_{tr}\cup I_{tr} where Ftr={fj}j=1DF_{tr}=\{f_{j}\}_{j=1}^{D} with fjf_{j} the jj-th feature and Itr={oi}i=1NI_{tr}=\{o_{i}\}_{i=1}^{N} with oio_{i} the ii-th instance in training set. The binary matrix 𝐗tr\mathbf{X}_{tr} constitutes an adjacency matrix where the non-zero entries indicate edges connecting two nodes in FtrF_{tr} and ItrI_{tr}, respectively. The induced feature-data bipartite graph will play an important role in our extrapolation approach. The representation is flexible for variable-size feature set, enabling our model to handle test data 𝐱¯i{0,1}D¯\overline{\mathbf{x}}_{i^{\prime}}\in\{0,1\}^{\overline{D}} which gives 𝐗te=[𝐱¯i]iIte\mathbf{X}_{te}=[\overline{\mathbf{x}}_{i^{\prime}}]_{i^{\prime}\in I_{te}}.

Backbone Networks. We next consider a prediction model hθ()h_{\theta}(\cdot) as a backbone network that maps data features 𝐱i\mathbf{x}_{i} to predicted label y^i\hat{y}_{i}. Without loss of generality, a default choice for hθh_{\theta} is a feedforward neural network. The first layer serves as an embedding layer which shrinks 𝐱i\mathbf{x}_{i} into a HH-dimensional hidden vector 𝐳i=𝐱i𝐖\mathbf{z}_{i}=\mathbf{x}_{i}\mathbf{W} where 𝐖D×H\mathbf{W}\in\mathbb{R}^{D\times H} denotes a weight matrix. The subsequent network (called classifier) is often a stack of neural layers that predicts label y^i=FFN(𝐳i;ϕ)\hat{y}_{i}=\mbox{FFN}(\mathbf{z}_{i};\phi). We use the notation y^i=h(𝐱i;ϕ,𝐖)\hat{y}_{i}=h(\mathbf{x}_{i};\phi,\mathbf{W}) to highlight two sets of parameters and θ=[ϕ,𝐖]\theta=[\phi,\mathbf{W}].

Notice that the matrix multiplication in the embedding layer is equivalent to a two-step procedure: 1) a lookup of feature embeddings and 2) a permutation-invariant aggregation. More specifically, we consider 𝐖\mathbf{W} as a stack of weight vectors 𝐖=[𝐰j]j=1D\mathbf{W}=[\mathbf{w}_{j}]_{j=1}^{D} where 𝐰j1×H\mathbf{w}_{j}\in\mathbb{R}^{1\times H} corresponds to the embedding of feature fjf_{j}. The non-zero entries in 𝐱i\mathbf{x}_{i} will index the corresponding rows of 𝐖\mathbf{W} and induce a set of embeddings {𝐳im}m=1d\{\mathbf{z}_{i}^{m}\}_{m=1}^{d} where 𝐳im\mathbf{z}_{i}^{m} is the embedding given by 𝐱im\mathbf{x}_{i}^{m} (i.e., one-hot vector of mm-th raw feature). Then the hidden vector of ii-th instance can be obtained by aggregation, i.e., 𝐳i=m=1d𝐳im\mathbf{z}_{i}=\sum_{m=1}^{d}\mathbf{z}_{i}^{m} which is permutation-invariant w.r.t. the order of feature embeddings in {𝐳im}m=1d\{\mathbf{z}_{i}^{m}\}_{m=1}^{d}. A more intuitive illustration is presented in Fig. 2.

The permutation-invariant property opens a way for handling variable-length feature vectors 𝐱¯i\overline{\mathbf{x}}_{i^{\prime}} deepset . Essentially, on condition that we have embeddings for input features, we can add them up to get a fixed-dimensional hidden representation 𝐳i\mathbf{z}_{i^{\prime}} as input for the subsequent classifier. Therefore, the problem boils down to learning feature embeddings, especially how to extrapolate for embeddings of new features based on those of existing ones.

Remark. Instead of using sum aggregation, some existing architectures consider concatenation of dd embedding vectors 𝐳im\mathbf{z}_{i}^{m}’s, which is essentially equivalent to sum aggregation (see Appendix A for more details). Therefore, the permutation-invariant property holds for widely adopted deep models.

Refer to caption
Figure 2: An illustration for the feasibility of extrapolation for new feature space with neural network models. (a) The backbone network is a feedforward neural network that is fed with input feature vector 𝐱i\mathbf{x}_{i} and outputs predicted label y^i\hat{y}_{i}. The first layer can be seen as an embedding layer where an embedding matrix is multiplied with the input vector to obtain an intermediate hidden representation 𝐳i\mathbf{z}_{i}. (b) The embedding layer can be equivalently replaced by an embedding lookup and an aggregation over indexed feature embeddings. The aggregation is permutation-invariant w.r.t. the order of input features. (c) The permutation-invariant property enables the model to handle variable-length input feature vector (new features may come from unseen values of existing raw features or unseen raw features). The problem boils down to learning new features’ embeddings.

GNN for Feature Extrapolation. We proceed to propose a graph neural networks (GNN) model for embedding learning with the feature-data graph. Our key insight is that the bipartite graph explicitly embodies features’ co-occurrence in observed instances, which reflects the proximity among features. Once we conduct message passing for feature embeddings over the graph structures, the embeddings of similar features can be leveraged to compute and update each feature’s embedding. The model can learn to extrapolate for new features’ embeddings using those of existing features with locality structures in a data-driven manner. The message passing over the defined graph representation is inductive w.r.t. variable-sized feature nodes and instance nodes, which enables the model to tackle new feature space with distinct feature sizes and supports.

Specifically, we consider the embeddings 𝐰j\mathbf{w}_{j} as an initial state 𝐰j(0)\mathbf{w}_{j}^{(0)} for node fjf_{j} in tr\mathcal{F}_{tr}. The initial states of instance nodes are set as zero vectors with equal dimension as the feature nodes, i.e. 𝐬i(0)=𝟎\mathbf{s}_{i}^{(0)}=\mathbf{0}. The interaction between two sets of nodes {𝐰j}j=1D\{\mathbf{w}_{j}\}_{j=1}^{D} and {𝐬i}i=1N\{\mathbf{s}_{i}\}_{i=1}^{N} can be modeled via graph neural networks where the node states in the ll-th layer are updated by

𝐬i(l)=𝐏(l)COMB(𝐬i(l1),AGG({𝐰k(l1)}|k,xik=1)),\displaystyle\mathbf{s}_{i}^{(l)}=\mathbf{P}^{(l)}\mbox{C{OMB}}\left(\mathbf{s}_{i}^{(l-1)},\mbox{A{GG}}(\{\mathbf{w}_{k}^{(l-1)}\}|\forall k,x_{ik}=1)\right), (2)
𝐰j(l)=𝐏(l)COMB(𝐰j(l1),AGG({𝐬k(l1)}|k,xjk=1)),\displaystyle\mathbf{w}_{j}^{(l)}=\mathbf{P}^{(l)}\mbox{C{OMB}}\left(\mathbf{w}_{j}^{(l-1)},\mbox{A{GG}}(\{\mathbf{s}_{k}^{(l-1)}\}|\forall k,x_{jk}=1)\right),

where 𝐏(l)H×H\mathbf{P}^{(l)}\in\mathbb{R}^{H\times H} is a weight matrix and we do not use non-linearity since it would degrade the performance empirically. For any new feature fjf_{j^{\prime}} in test data, we can set its initial state as a zero vector 𝐰j(0)=𝟎\mathbf{w}_{j^{\prime}}^{(0)}=\mathbf{0}. The GNN model outputs updated embeddings for feature nodes and we further use them as the feature embeddings in the backbone network. Fig. 3 presents the feedforward computation of proposed model. Formally, with LL-layer GNN, the GNN network gg gives updated feature embeddings 𝐖^=[𝐰j(L)]j=1D=g(𝐖,𝐗;ω)\hat{\mathbf{W}}=[\mathbf{w}_{j}^{(L)}]_{j=1}^{D}=g(\mathbf{W},\mathbf{X};\omega) and then the backbone network outputs prediction y^i=h(𝐱i;ϕ,𝐖^)\hat{y}_{i}=h(\mathbf{x}_{i};\phi,\hat{\mathbf{W}}) where 𝐗=𝐗tr\mathbf{X}=\mathbf{X}_{tr} for training, 𝐗=𝐗te\mathbf{X}=\mathbf{X}_{te} for test and ω={𝐏(l)}l=1L\omega=\{\mathbf{P}^{(l)}\}_{l=1}^{L} denotes gg’s parameters,.

Refer to caption
Figure 3: Feedforward of our model with input {𝐱i}\{\mathbf{x}_{i}\}: we build a feature-data graph between feature nodes {fj}\{f_{j}\} and instance nodes {oi}\{o_{i}\}. A GNN is used to inductively compute features’ embeddings via message passing and a backbone network uses the updated embeddings to predict labels {yi}\{y_{i}\}.

2.2 Model Learning

We next discuss approaches for model training. In order to enable the model to extrapolate for new features, we put forward two useful strategies. 1) Proxy training data: we only use partial features from training set as observed ones for each update. 2) Asynchronous updates: we decouple the training of backbone network and GNN network and using different updating speeds for them in a nested manner (see Fig. 4). Based on these, we proceed to propose two specific training approaches.

Refer to caption
Figure 4: Illustration of two proposed training approaches: (a) self-supervised learning with n-fold splitting (n=4n=4), and (b) inductive learning with k-shot sampling (k=3k=3). The two methods differ in configuration of proxy data, and both consider asynchronous updating rule: the backbone network is updated with nn-fold proxy data before the GNN network is updated once with an accumulated loss.

Self-supervised Learning with N-fold Splits. To mimic new features in the future, we can mask some observed features and let the model use the remaining features to estimate the embeddings of the masked ones. For a given feature set Ftr={fj}j=1DF_{tr}=\{f_{j}\}_{j=1}^{D} of data {𝐱i}iItr\{\mathbf{x}_{i}\}_{i\in I_{tr}}, we consider an n-fold splitting method: in each iteration the features are first randomly shuffled and evenly divided into nn disjoint subsets, denoted by {F¯s}s=1n\{\overline{F}_{s}\}_{s=1}^{n}. We then consider asynchronous updating rule for two networks: each iteration contains nn times updates for backbone network and one update for GNN model. For the ss-th update of the backbone, we mask features in F¯s\overline{F}_{s} and set the initial states of masked features as zero vectors before fed into GNN. The GNN network will use the adjacency matrix 𝐗tr\mathbf{X}_{tr} to compute updated embeddings for the masked features. The embedding layer will be composed of updated embeddings of masked features and initial embeddings of the remaining features, based on which the backbone network outputs prediction y^i\hat{y}_{i} for each 𝐱i\mathbf{x}_{i} and compute the loss function s=1NiItrl(yi,y^i)\mathcal{L}_{s}=\frac{1}{N}\sum_{i\in I_{tr}}l(y_{i},\hat{y}_{i})111Note that we still use supervised labels for loss though we call this approach as self-supervised learning. (where l(,)l(\cdot,\cdot) can be cross-entropy for classification). After nn-step updates for backbone network, we use the accumulated loss =sns\mathcal{L}=\sum_{s}^{n}\mathcal{L}_{s} to update GNN model. The training procedure will repeat the above process until a given time budget.

Inductive Learning with K-shot Samples. Alternatively, we can sample over the feature set and only expose partial features to the model for each update. For the ss-th update, we randomly sample kk raw features222One can also directly sample a certain ratio of features from FtrF_{tr}, which might lead to large variance. from input data which induces a new feature set FsFtrF_{s}\subset F_{tr} and extract the corresponding columns of 𝐗tr=[𝐱i]iItr\mathbf{X}_{tr}=[\mathbf{x}_{i}]_{i\in I_{tr}} to form a proxy data matrix 𝐗s{0,1}N×|Fs|\mathbf{X}_{s}\in\{0,1\}^{N\times|F_{s}|} (where each instance contains |Fs||F_{s}| features). Then 𝐗s\mathbf{X}_{s} is fed into GNN to obtain updated embeddings of features in FsF_{s}, based on which the backbone network outputs prediction for each instance using features in FsF_{s}.

By contrast, the n-fold splitting contributes to better training stability since the model is updated on each feature in each iteration, while the inductive learning adds more randomness which can help to enhance model’s generalization. We will further compare them in our experiments.

DropEdge for Regularization. In order to further alleviate over-fitting on training features, we use the DropEdge dropedge to regularize our model. We consider a threshold ρ\rho and randomly set nonzero entries in 𝐗tr\mathbf{X}_{tr} (for self-supervised) or 𝐗s\mathbf{X}_{s} (for inductive) as zero for each feedforward computation,

𝐗~tr=DROPEDGE(𝐗tr,ρ)={xij|xij𝐗tr,eij>ρ},whereeij𝒰(0,1).\tilde{\mathbf{X}}_{tr}=\mbox{D{ROP}}\mbox{E{DGE}}(\mathbf{X}_{tr},\rho)=\{x_{ij}|x_{ij}\in\mathbf{X}_{tr},e_{ij}>\rho\},\;\mbox{where}\;e_{ij}\sim\mathcal{U}(0,1). (3)

Scaling to Large Systems. To handle prohibitively large datasets for practical systems, we can divide data matrix into mini-batches along the instance dimension. Then, we feed each mini-batch into the model for once model training (including feature-level sampling/splitting, as shown in Fig. 4) or inference. Since the number of nonzero features for each instance is no more than dd (a relatively small value), the edge number in each mini-batch will be controlled within O(Bd)O(Bd) (assume a mini-batch contains BB instances). Hence, the space cost can be effectively controlled using instance-level mini-batch partition. Yet, note that BB could not be arbitrarily small in order to guarantee sufficient message passing over diverse instances. We present the complete training algorithm in Appendix B where the model is trained end-to-end using self-supervised or inductive learning approaches.

3 Generalization Analysis

In this section, we analyze the generalization error on test data with new features. We simplify the settings for analysis: 1) the backbone network is a two-layer FNN (an embedding layer 𝐖D×H\mathbf{W}\in\mathbb{R}^{D\times H} plus a fully-connected layer ΦH×1\Phi\in\mathbb{R}^{H\times 1}) with sigmoid output; 2) the GNN network is a LL-layer GCN which takes mean pooling aggregation over neighbored nodes without linear transformation and non-linearity in each layer; 3) the training algorithm is SGD. With above settings, the model can be written as y^i=σ(i𝒩L~(i){i}ciiL𝐱i𝐖Φ)\hat{y}_{i}=\sigma(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\mathbf{W}\Phi) where 𝒩L~(i)\mathcal{N}_{\tilde{L}}(i) contains all the 𝐱i\mathbf{x}_{i^{\prime}}’s that appear in the L~{\tilde{L}}-hop neighbors of 𝐱i\mathbf{x}_{i} in the feature-data graph, L~=2L2\tilde{L}=2\cdot\lfloor\frac{L}{2}\rfloor, and ciiL+c_{ii^{\prime}}^{L}\in\mathbb{R}^{+} is a weight that quantifies influence of 𝐱i\mathbf{x}_{i} on 𝐱i\mathbf{x}_{i^{\prime}} through LL-layer mean-pooling graph convolution. More details for the derivation are in Appendix C. Also, we focus our analysis on the case of inductive learning and the results can be extended to self-supervised approach, which we leave for future work.

The data generation process can be described as follows. First, features fjf_{j}’s are sampled from an unknown distribution \mathcal{F} and form a feature set Ftr={fj}j=1DF_{tr}=\{f_{j}\}_{j=1}^{D}. Then data {(𝐱i,yi)}iItr\{(\mathbf{x}_{i},y_{i})\}_{i\in I_{tr}} are sampled from a distribution 𝒟Ftr\mathcal{D}_{F_{tr}} whose support is over 𝒳Ftr×𝒴\mathcal{X}_{F_{tr}}\times\mathcal{Y}, and define 𝐗tr={𝐱i}iItr\mathbf{X}_{tr}=\{\mathbf{x}_{i}\}_{i\in I_{tr}} and Ytr={yi}iItrY_{tr}=\{y_{i}\}_{i\in I_{tr}}. Using ψ=[θ,ω]\psi=[\theta,\omega], the model can be denoted by Y^=h(𝐗;ψ)\hat{Y}=h(\mathbf{X};\psi) (with a simplification from Y^={y^i}iI={h(𝐱i;ψ)}iI\hat{Y}=\{\hat{y}_{i}\}_{i\in I}=\{h(\mathbf{x}_{i};\psi)\}_{i\in I}) with loss function (Y,Y^)=1|I|iIl(yi,y^i)\mathcal{L}(Y,\hat{Y})=\frac{1}{|I|}\sum_{i\in I}l(y_{i},\hat{y}_{i}).

Recall that in training stage, we randomly partition the instances in 𝐗tr\mathbf{X}_{tr} into mini-batches with size BB and each mini-batch further samples kk raw features to form a feature subset FsF_{s}. In each update, the model is exposed to a B×|Fs|B\times|F_{s}| sub-matrix from 𝐗tr\mathbf{X}_{tr} as proxy training data and uses it for once feedforward and backward computation. With given training data (𝐗tr,Ytr)(\mathbf{X}_{tr},Y_{tr}), we define 𝒮\mathcal{S} as a set of all the proxy data sub-matrices that could be exposed to the model during the training

𝒮={(𝐗1,Y1),(𝐗2,Y2),,(𝐗m,Ym),,(𝐗M,YM)},whereM𝒪(d!(dk)!k!).\mathcal{S}=\{(\mathbf{X}_{1},Y_{1}),(\mathbf{X}_{2},Y_{2}),\cdots,(\mathbf{X}_{m},Y_{m}),\cdots,(\mathbf{X}_{M},Y_{M})\},\;\mbox{where}\;M\propto\mathcal{O}\left(\frac{d!}{(d-k)!k!}\right).

The training process can be seen as a sequence of operations each of which samples a sub-matrix from 𝒮\mathcal{S} as proxy data in an i.i.d. manner and computes gradients for one SGD update (more discussions are in Appendix C). Define A𝒮A_{\mathcal{S}} as a learning algorithm trained on 𝒮\mathcal{S}, which gives a trained model h(𝐗;ψ𝒮)h(\mathbf{X};\psi_{\mathcal{S}}) simplified as h𝒮h_{\mathcal{S}}. The generalization error R(h𝒮)R(h_{\mathcal{S}}) can be defined as

R(h𝒮)=𝔼(𝐗,Y)[(Y,h(𝐗;ψ𝒮))],R(h_{\mathcal{S}})=\mathbb{E}_{(\mathbf{X},Y)}\left[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))\right], (4)

where the expectation contains two stages of sampling: 1) a feature set F={fj}F=\{f_{j}\} is sampled according to fjf_{j}\sim\mathcal{F}, and 2) data (𝐗,Y)(\mathbf{X},Y) is sampled according to (𝐱i,yi)𝒟F(\mathbf{x}_{i},y_{i})\sim\mathcal{D}_{F}. The empirical risk that our approach optimizes with the training data would be

Remp(h𝒮)=1Mm=1M(Ym,h(𝐗m;ψ𝒮)).R_{emp}(h_{\mathcal{S}})=\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}})). (5)

We study the expected generalization gap

𝔼A[R(h𝒮)Remp(h𝒮)],\mathbb{E}_{A}[R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}})], (6)

where the expectation is taken over the randomness of A𝒮A_{\mathcal{S}} stemming from sampling for SGD updates.

We assume that the loss function l(yi,y^i)l(y_{i},\hat{y}_{i}) is Lipschitz-continuous and smooth w.r.t. the model output y^i\hat{y}_{i}. Concretely, we have 1) |l(y,f())l(y,f())|β|f()f()||l(y,f(\cdot))-l(y,f^{\prime}(\cdot))|\leq\beta|f(\cdot)-f^{\prime}(\cdot)| and 2) |l(y,f())l(y,f())|β|f()f()||\nabla l(y,f(\cdot))-\nabla l(y,f^{\prime}(\cdot))|\leq\beta^{\prime}|\nabla f(\cdot)-\nabla f^{\prime}(\cdot)|. Such condition can be satisfied by widely used loss functions such as cross-entropy and MSE. Then we have the following result (see Appendix C for proof).

Theorem 1.

Assume the loss function is bounded by l(yi,y^i)λl(y_{i},\hat{y}_{i})\leq\lambda. For a learning algorithm trained on data {𝐗tr,Ytr}\{\mathbf{X}_{tr},Y_{tr}\} with TT iterations of SGD updates, with probability at least 1δ1-\delta, we have

𝔼A[R(h𝒮)Remp(h𝒮)]𝒪(dTM)+(𝒪(dTM2)+λ)log(1/δ)2M.\mathbb{E}_{A}[R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}})]\leq\mathcal{O}(\frac{d^{T}}{M})+\left(\mathcal{O}(\frac{d^{T}}{M^{2}})+\lambda\right)\sqrt{\frac{\log(1/\delta)}{2M}}. (7)

The generalization gap depends on the number of raw features in training data and the size of 𝒮\mathcal{S}. The latter is determined by configuration of proxy training data, particular, sampling over training features. If the sampling introduces more randomness (e.g. kd/2k\approx d/2), 𝒮\mathcal{S} would become larger, contributing to tighter gap. However, a large 𝒮\mathcal{S} would also lead to large variance in training and amplify optimization error. Therefore, there exists a trade-off w.r.t. how to sample/split observed features in training stage. Furthermore, the generalization gap also depends on dd, and a larger dd would result in looser bound (since one often has T>dT>d). This is because a larger dd would require to deal with more features. As the network becomes wider and complex, it would be more prone for over-fitting.

4 Experiments

We apply our model FATE (for FeATure Extrapolation Networks) on real-world datasets. First, we consider six classification datasets from UCI Machine Learning Repository uci-dataset : Gene, Protein, Robot, Drive, Calls and Github, as collected from domains like biology, engineering and social networks. The feature numbers are ranged from 219 to 4006 and the instance numbers vary from 1080 to 58509. We consider two large-scale datasets Avazu and Criteo from real-world online advertisement system whose goal is to predict the Click-Through Rate (CTR) of exposed advertisement to users. The two datasets have 40\sim 40 million clicking/non-clicking records as instances and 2\sim 2 million features. More dataset information and implementation details are in Appendix D and E, respectively.

4.1 Experiment on UCI Datasets

Refer to caption
Figure 5: Accuracy/ROC-AUC results for UCI datasets with 30%80%30\%\sim 80\% observed features for training. We run each experiment five times with different random seeds and report averaged scores.

Setup. We randomly split all the instances into training/validation/test data with the ratio 6:2:2. Then we randomly select a certain ratio (30%80%30\%\sim 80\%) of features as observed ones and use the remaining as unobserved ones. The model is trained with the observed features of training instances and tested with all the features of testing data. We adopt Accuracy as metric for datasets with more than two classes (Gene, Protein, Robot, Drive and Calls) and ROC-AUC for Github with two classes.

Implementation. We specify FATE in the following ways. 1) Backbone: a 3-layer feedforward NN. 2) GNN: a 4-layer GCN. 3) Training: self-supervised learning with n-fold splits. Several baselines are considered for comparison and their architectures are all specified as a 3-layer feedforward NN. First, Base-NN, Average-NN, Pooling-NN and KNN-NN are all trained on training instances with observed features. Then Base-NN only uses test instances’ observed features for inference. Average-NN uses averaged embeddings of observed features as those of unobserved ones. Pooling-NN (resp. KNN-NN) computes embeddings for unobserved features via replacing our GNN with mean pooling aggregation over neighborhoods (resp. KNN aggregation over all the observed ones). Furthermore, we consider Oracle-NN using all the features of training data for training and INL-NN that is first trained on training data with observed features and then re-trained on training data with the remaining features.

Results and Discussions. Fig. 5 reports the mean Accuracy/ROC-AUC of five trials with different ratios of observed features ranging from 0.3 to 0.8. FATE achieves averagely 7.3%7.3\% higher Accuracy and 1.3%1.3\% higher ROC-AUC over Base-NN which uses partial features for inference. The improvements are statistically significant under 95%95\% confidence level. The results show that FATE can learn effective embeddings for new features that contribute to better performance for classification. Furthermore, FATE achieves averagely 29.8%29.8\% higher Accuracy and 0.5%0.5\% higher ROC-AUC over baselines Average-NN, KNN-NN and Pooling-NN. These baseline methods perform worse than Base-NN especially when observed features are few, which suggests that directly aggregating embeddings of observed features for extrapolation would degrade the performance. By contrast, FATE possesses superior capability for extrapolating to new unseen features. Even with 30%30\% observed features the model is able to distill the useful knowledge from 70%70\% unobserved features without re-training, providing decent classification performance. Notably, compared with INL-NN, FATE even achieves higher accuracy in most cases with a 4.1%4.1\% Accuracy improvement on average. The possible reason is that INL-NN is prone for over-fitting on new data and forgetting the previous one. Finally, FATE achieves very close performance to Oracle-NN when using sufficient observed features and can even slightly exceeds it with 80%80\% features in Gene, Robot and Github. In fact, the GNN network in FATE can not only achieve feature extrapolation, but also capture feature-level relations, which could be another merit of our method.

4.2 Experiment on CTR Prediction

Table 1: ROC-AUC results for 8 test splits (T1-T8) on Avazu and Criteo datasets.
Dataset Backbone Model T1 T2 T3 T4 T5 T6 T7 T8 Overall
Avazu NN Base 0.666 0.680 0.691 0.694 0.699 0.703 0.705 0.705 0.693 ±\pm 0.012
Pooling 0.655 0.671 0.683 0.683 0.689 0.694 0.697 0.697 0.684 ±\pm 0.011
FATE 0.689 0.699 0.708 0.710 0.715 0.720 0.721 0.721 0.710 ±\pm 0.010
DeepFM Base 0.675 0.684 0.694 0.697 0.699 0.706 0.708 0.706 0.697 ±\pm 0.009
Pooling 0.666 0.676 0.685 0.685 0.688 0.693 0.694 0.694 0.685 ±\pm 0.009
FATE 0.692 0.702 0.711 0.714 0.718 0.722 0.724 0.724 0.713 ±\pm 0.010
Criteo NN Base 0.761 0.761 0.763 0.763 0.765 0.766 0.766 0.766 0.764 ±\pm 0.002
Pooling 0.761 0.762 0.764 0.763 0.766 0.767 0.768 0.768 0.765 ±\pm 0.001
FATE 0.770 0.769 0.771 0.772 0.773 0.774 0.774 0.774 0.772 ±\pm 0.001
DeepFM Base 0.772 0.771 0.772 0.772 0.774 0.774 0.774 0.774 0.773 ±\pm 0.001
Pooling 0.772 0.772 0.773 0.774 0.776 0.776 0.776 0.776 0.774 ±\pm 0.002
FATE 0.781 0.780 0.782 0.782 0.784 0.784 0.784 0.784 0.783 ±\pm 0.001

Setup. We split the dataset in chronological order to simulate real-world cases. For Avazu dataset which contains time information in ten days, we use the data of first/second day for training/validation and the data of the third to tenth days for test. For Criteo dataset whose records are given in temporal order, we split the dataset into ten continual subsets with equal size and use the first/second subset for training/validation and the third to tenth subsets for test. With such data splitting, we can naturally obtain validation/test data with a mixture of seen and unseen features in the training data (the new features come from new values out of the known range of raw features). For Avazu/Criteo, there are \sim0.6/\sim1.3 million features in training data, \sim0.2/\sim0.4 million new features in validation data and totally \sim1.1/\sim0.8 million new features in all the test splits. We use ROC-AUC as evaluation metric.

Implementation Details. We consider two specifications for our backbone: 1) a 3-layer feedfoward NN, and 2) DeepFM deepfm , a widely used model for advertisement click prediction considering inter-feature interactions over NN. The GNN is a 2-layer GraphSAGE model. The training method is inductive learning with k-shot samples and mini-batch partition. Also, we compare with baselines Base and Pooling. The KNN method would suffer from scalability issue in the two large datasets.

Results and Discussions. Table 1 reports the ROC-AUC results for different test splits, which show that FATE significantly outperforms Base and Pooling in all the test splits using NN and DeepFM as backnones. Overall, compared with Base, FATE improves the ROC-AUC by 0.017/0.016 (resp. 0.008/0.01) with NN/DeepFM as the backbone on Avazu (resp. Criteo). Note that even an improvement of 0.004 for ROC-AUC is considered significant in click prediction tasks wd ; deepfm . The results show that FATE can combine useful information in new features collected in the future for the target task without further training and has promising power for enhancing the real-world systems interacting with open world. Compared with Pooling, FATE achieves significant AUC improvements on two datasets. The reason is that directly using average pooling to replace the GNN convolution would lead to limited capacity and weakens its ability for effective concept abstraction and reasoning.

4.3 Further Discussions

Scalability. We study model’s scalability w.r.t. different batch sizes BB and number of features DD in Fig. 6 and 7 which show that our training and inference time/space scale linearly on Criteo dataset.

Ablation Studies. Table 3 also provides ablation studies, which show that 1) the DropEdge operation can help regularize the training and bring up higher test accuracy; 2) using asynchronous updates for two networks leads to performance gain over joint training; 3) the n-fold splitting and k-shot sampling will outperform each other in different cases and both exceeds leave-one-out partition. Table 4 compares using different sampling size kk for inductive learning approach on Avazu and Criteo, which verify our theoretical analysis in Section 3. See more discussions in Appendix F.

Visualization. Fig. 6 visualizes the produced feature embeddings by FATE-NN and Oracle-NN. It unveils two interesting insights that can interpret why FATE can sometimes outperform Oracle that uses all the features for training. First, FATE-NN’s produced embeddings for observed and unobserved features possess more dissimilar distributions in latent space, compared to Oracle-NN. Notice that features’ embeddings are used for the backbone network to compute intermediate hidden representation for each instance. Such phenomenon implies that FATE manages to extract more informative knowledge from new features. Second, the embeddings of FATE-NN form some particular structures (clusters, lines or curves) rather than uniformly distribute over the 2-D plane like Oracle-NN. The reason is that the GNN network leverages locality structures among features for further abstraction which explicitly encodes feature-level relations and could help downstream classification.

Refer to caption
Figure 6: Visualization for t-SNE embeddings of FATE-NN’s and Oracle-NN’s produced feature embeddings. We mark observed features with red and the remaining ones with blue nodes.

5 Connection to Other Learning Paradigms

Our introduced problem setting, open-world feature extrapolation (OFE), can be treated as an instantiation of out-of-distribution generalization ood-classic-2 ; ood-classic-3 or domain shift problem, focusing on distribution shift led by feature space expansion. We next discuss the relationships of our problem setting and our model FATE with domain adaption (DA), continual learning (CL), open-set learning (OSL) and zero-shot learning (ZSL). In general, OFE is orthogonal to these problems and opens a new direction that can potentially have promising intersections with the well-established ones.

Domain Adaption adapts a model trained on source domain to target with different distribution da-old1 ; da-old2 ; da-old3 ; da-nn1 ; da-nn2 ; da-nn3 ; da-nn4 . Our problem OFE is different from DA in two aspects: 1) the label space/distribution of training and test data is the same for OFE, while DA often mostly considers different label distributions for source and target domains; 2) OFE focus on combining new features that are related to the current task, while DA considers different tasks from different domains.

Continual Learning, or lifelong learning, aims at enabling a single model to learn from a stream of data from different tasks that cannot be seen at one time cl-old1 ; cl-1 ; cl-2 ; cl-survey . By contrast, there are two-fold differences of OFE. First, OFE does not allow finetuning or further retraining on new data, which can be more challenging than CL. Second, CL mostly assume each piece of data in the stream is from different tasks with different labels to handle. The key challenge of CL is the catastrophic forgetting cl-old1 ; cl-survey that requires the model to balance a trade-off between previous and new tasks, while FATE is free from such issue in nature since we do not require incremental learning.

Open-set Learning is another line of researches that relate with us. Differently, open-set recognition mostly focus on expansion of label sets openset-cv1 ; openset-cv2 ; openset-cv3 ; openset-cv4 ; openset-cv5 ; openset-nlp . To our knowledge, we are the first to study feature set expansion, formulate it as OFE and further solve it via graph learning.

Zero-shot Learning. Our problem setting is also linked with few-shot/zero-shot learning. In NLP domains, some studies focus on dealing with rare entities exposed in limited times or new entities unseen by training nlp-1 ; nlp-3 . Similar problems are also encoutered and explored in cold-start recommender systems where there are also new users/items unseen before recsys-1 ; recsys-2 ; recsys-3 . One common nature of these works aim at inferring the embeddings for new entities based on some ‘held-out’ ones. With a similar end and distinct methodological aspect, a recent study recsys-3 explores a new possibility via learning a latent graph between existing entities (users) and newly arrived ones through attention mechanism for inductive representation learning. The core technical contributions of our work lie in the unique problem setting which stays focused on feature space expansion (domain shift) and the proposed feature-level sampling/partition training strategy, backed up with our theoretical insights.

Extension and Outlooks. Our work can be extended to solve more problems and push the development in broader areas. First, the input feature vectors can be replaced by feature maps given by CNN or word/sentence embeddings by Transformer transformer , for handling multi-modal data in the context of federated learning mm-2 or multi-view learning mm-4 ; mm-6 ; mm-7 . Admittedly, our formulation assumes multi-hot feature vectors as input, which is a common practice for handling attribute features but not often the case for other data format (like vision or texts). For practitioners who would like to apply FATE to broader areas, one can extend our graph representation and GNN model to directly handle continuous features by treating feature values as edge weights as is done by graphlearning-missing . Second, as shown in our experiments, the classification layers can be replaced by more complex models with inter-feature interactions ffm ; xdeepfm ; nfm ; fwl or advanced architectures danser ; featevolve for more sufficient feature-wise interaction and improving the expressiveness.

6 Conclusion

We present a new framework to address new features unseen in training, by formulating it as the open-world feature extrapolation problem. We target the problem via graph representation learning by treating observed data as a feature-data graph and further harness GNN to inductively compute embeddings for new features with those of existing ones, mimicking abstraction and reasoning process in human’s brain. We also propose two training strategies for effective feature extrapolation learning. Our theoretical results show that generalization error depends on training features and learning algorithms. Experiments verify its effectiveness and scalability to large-scale systems.

Potential Societal Impacts. When learning mapping from features to labels, the model is at risk of focusing on dominant features from majority groups and ignoring scarce features from minority ones. Potential extended works of much significance are to develop debiased methods for feature extrapolation. We believe AI models can be guided to promote social justice and well-being.

References

  • [1] A. Asuncion and D. Newman. Uci machine learning repository. 2007.
  • [2] B. B. Avants, N. J. Tustison, and J. R. Stone. Similarity-driven multi-view embeddings from high-dimensional biomedical data. Nature Computational Science, 1:143–152, 2021.
  • [3] D. Baptista, P. G. Ferreira, and M. Rocha. Deep learning for drug response prediction in cancer. Briefings Bioinform, 22(1):360–379, 2021.
  • [4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [5] A. Bendale and T. E. Boult. Towards open world recognition. In Conference on Computer Vision and Pattern Recognition, pages 1893–1902, 2015.
  • [6] A. Bendale and T. E. Boult. Towards open set deep networks. In Conference on Computer Vision and Pattern Recognition, pages 1563–1572, 2016.
  • [7] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems (NeurIPS), pages 2178–2186, 2011.
  • [8] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems, pages 129–136, 2007.
  • [9] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.
  • [10] H. C, G. KJ, and C. KJ. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLoS ONE, 10(6), 2015.
  • [11] W. Chen, M. Chang, E. Schlinger, W. Y. Wang, and W. W. Cohen. Open question answering over tables and text. In International Conference on Learning Representations, 2021.
  • [12] W. Chen, H. Zha, Z. Chen, W. Xiong, H. Wang, and W. Y. Wang. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In Conference on Empirical Methods in Natural Language Processing, pages 1026–1036, 2020.
  • [13] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah. Wide & deep learning for recommender systems. In Workshop on Deep Learning for Recommender Systems, pages 7–10, 2016.
  • [14] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Conference on Recommender Systems, pages 191–198, 2016.
  • [15] Y. Ganin and V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
  • [16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17:59:1–59:35, 2016.
  • [17] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He. Deepfm: A factorization-machine based neural network for CTR prediction. In International Joint Conference on Artificial Intelligence, pages 1725–1731, 2017.
  • [18] W. L. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • [19] B. Hao, J. Zhang, H. Yin, C. Li, and H. Chen. Pre-training graph neural networks for cold-start users and items representation. In ACM International Conference on Web Search and Data Mining, pages 265–273, 2021.
  • [20] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234, 2016.
  • [21] M. Hassen and P. K. Chan. Learning a neural-network-based representation for open set recognition. In International Conference on Data Mining, pages 154–162, 2020.
  • [22] X. He and T. Chua. Neural factorization machines for sparse predictive analytics. In International Conference on Research and Development in Information Retrieval, pages 355–364, 2017.
  • [23] Y. Juan, Y. Zhuang, W. Chin, and C. Lin. Field-aware factorization machines for CTR prediction. In Conference on Recommender Systems, pages 43–50, 2016.
  • [24] Y. Juan, Y. Zhuang, W. Chin, and C. Lin. Field-aware factorization machines for CTR prediction. In Conference on Recommender Systems, pages 43–50, 2016.
  • [25] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • [26] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  • [27] Z. Li, J. Zhang, Y. Gong, Y. Yao, and Q. Wu. Field-wise learning for multi-field categorical data. In Advances in Neural Information Processing Systems, pages 6639–6649, 2020.
  • [28] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In International Conference on Knowledge Discovery & Data Mining, pages 1754–1763, 2018.
  • [29] P. P. Liang, T. Liu, Z. Liu, R. Salakhutdinov, and L. Morency. Think locally, act globally: Federated learning with local and global representations. CoRR, abs/2001.01523, 2020.
  • [30] L. Logeswaran, M. Chang, K. Lee, K. Toutanova, J. Devlin, and H. Lee. Zero-shot entity linking by reading entity descriptions. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Conference of the Association for Computational Linguistics, pages 3449–3460, 2019.
  • [31] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 1647–1657, 2018.
  • [32] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Conference on Learning Theory, 2009.
  • [33] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning (ICML), pages 10–18, 2013.
  • [34] P. Oza and V. M. Patel. C2AE: class conditioned auto-encoder for open-set recognition. In Conference on Computer Vision and Pattern Recognition, pages 2307–2316. Computer Vision Foundation / IEEE, 2019.
  • [35] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
  • [36] G. I. Parisi, J. Tani, C. Weber, and S. Wermter. Lifelong learning of human actions with deep neural network self-organization. Neural Networks, 96:137–149, 2017.
  • [37] S. Rendle. Factorization machines. In International Conference on Data Mining, pages 995–1000, 2010.
  • [38] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In The Web Conference, pages 521–530, 2007.
  • [39] A. V. Robins. Catastrophic forgetting in neural networks: the role of rehearsal mechanisms. In First New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, ANNES ’93, Dunedin, New Zealand, November 24-26, 1993, pages 65–68. IEEE, 1993.
  • [40] Y. Rong, W. Huang, T. Xu, and J. Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2020.
  • [41] B. Rozemberczki, C. Allen, and R. Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2), 2021.
  • [42] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  • [43] T. Schick and H. Schütze. Rare words: A major problem for contextualized embeddings and how to fix it by attentive mimicking. In AAAI Conference on Artificial Intelligence, pages 8766–8774, 2020.
  • [44] L. Shu, H. Xu, and B. Liu. DOC: deep open classification of text documents. In Conference on Empirical Methods in Natural Language Processing, pages 2911–2916, 2017.
  • [45] Y. H. Tsai, P. P. Liang, A. Zadeh, L. Morency, and R. Salakhutdinov. Learning factorized multimodal representations. In International Conference on Learning Representations. OpenReview.net, 2019.
  • [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [47] S. Verma and Z. Zhang. Stability and generalization of graph convolutional neural networks. In International Conference on Knowledge Discovery & Data Mining, pages 1539–1548, 2019.
  • [48] Q. Wu, L. Jiang, X. Gao, X. Yang, and G. Chen. Feature evolution based multi-task learning for collaborative filtering with social trust. In International Joint Conference on Artificial Intelligence, pages 3877–3883, 2019.
  • [49] Q. Wu, H. Zhang, X. Gao, P. He, P. Weng, H. Gao, and G. Chen. Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. In The Web Conference, pages 2091–2102, 2019.
  • [50] Q. Wu, H. Zhang, X. Gao, J. Yan, and H. Zha. Towards open-world recommendation: An inductive model-based collaborative filtering approach. In International Conference on Machine Learning, pages 11329–11339, 2021.
  • [51] T. Wu, E. K. I. Chio, H. Cheng, Y. Du, S. Rendle, D. Kuzmin, R. Agarwal, L. Zhang, J. R. Anderson, S. Singh, T. Chandra, E. H. Chi, W. Li, A. Kumar, X. Ma, A. Soares, N. Jindal, and P. Cao. Zero-shot heterogeneous transfer learning from recommender systems to cold-start search retrieval. In ACM International Conference on Information and Knowledge Management, pages 2821–2828, 2020.
  • [52] M. Xu, R. Jin, and Z. Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Advances in Neural Information Processing Systems, pages 2301–2309, 2013.
  • [53] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In International Conference on Knowledge Discovery & Data Mining, pages 974–983, 2018.
  • [54] R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and T. Naemura. Classification-reconstruction learning for open-set recognition. In Conference on Computer Vision and Pattern Recognition, pages 4016–4025, 2019.
  • [55] J. You, X. Ma, D. Y. Ding, M. J. Kochenderfer, and J. Leskovec. Handling missing data with graph representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [56] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in Neural Information Processing Systems, pages 3391–3401, 2017.
  • [57] H. Zhao, S. Zhang, G. Wu, J. M. F. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pages 8568–8579, 2018.

Appendix A More Discussions for Permutation-Invariant Property of Embedding Layers

In Section 2.1, we mentioned that the embedding layer in the backbone network can be equivalently seen as a combination of embedding lookup and a sum aggregation which is permutation-invariant w.r.t. the order of input features. We provide an illustration for this in Fig. 2.

Equivalence between Concatenation and Sum Aggregation. To support the remark argument in Section 2.1, we next illustrate the equivalence between concatenation of features’ embeddings and sum aggregation/pooling over features’ embeddings. Assume we have feature embeddings {𝐳im}m=1d\{\mathbf{z}_{i}^{m}\}_{m=1}^{d} for instance 𝐱i\mathbf{x}_{i}. We concat all the embeddings as a vector 𝐳i¯=[𝐳i1,𝐳i2,,𝐳im]\overline{\mathbf{z}_{i}}=[\mathbf{z}_{i}^{1},\mathbf{z}_{i}^{2},\cdots,\mathbf{z}_{i}^{m}] and feed it into a neural layer to obtain 𝐜i=𝐖𝐳i¯\mathbf{c}_{i}=\mathbf{W}^{\prime}\overline{\mathbf{z}_{i}}. Notice that the weight matrix 𝐖dH×H\mathbf{W}^{\prime}\in\mathbb{R}^{dH\times H} can be decomposed into dd sub-matrices {𝐖m}m=1d\{\mathbf{W}^{\prime}_{m}\}_{m=1}^{d} where 𝐖m=𝐖[(m1)H:mH,:]H×H\mathbf{W}^{\prime}_{m}=\mathbf{W}^{\prime}[(m-1)H:mH,:]\in\mathbb{R}^{H\times H}. If we consider sum aggregation/pooling over {𝐳im}m=1d\{\mathbf{z}_{i}^{m}\}_{m=1}^{d}, i.e. 𝐳i=m=1d𝐳im\mathbf{z}_{i}=\sum_{m=1}^{d}\mathbf{z}_{i}^{m}, the subsequent neural layer would be a weight matrix 𝐖′′\mathbf{W}^{\prime\prime} with dimension H×HH\times H. We can set it as 𝐖′′=m=1d𝐖m\mathbf{W}^{\prime\prime}=\sum_{m=1}^{d}\mathbf{W}^{\prime}_{m} and will easily obtain 𝐳i𝐖′′=𝐳i¯𝐖\mathbf{z}_{i}\mathbf{W}^{\prime\prime}=\overline{\mathbf{z}_{i}}\mathbf{W}^{\prime}. Hence, the concatenation plus a fully-connected layer is equivalent to sum pooling plus a fully-connected layer. This observation indicates that our reasoning in the maintext can be applied to general neural network-based models for attribute features and enable them to handle input vectors with variable-length features.

Appendix B Training Algorithms

We present the training algorithms for our model in Alg. 1 where the model is trained end-to-end via self-supervised learning or inductive learning approaches.

1 INPUT: 𝐗tr={xi}iItr\mathbf{X}_{tr}=\{x_{i}\}_{i\in I_{tr}}, training data matrix, Ftr={fj}F_{tr}=\{f_{j}\}, training feature set, ω(0)\omega^{(0)}, initial parameter for GNN, θ(0)=[ϕ(0),𝐖(0)]\theta^{(0)}=[\phi^{(0)},\mathbf{W}^{(0)}], initial parameter for backbone network (where ϕ(0)\phi^{(0)} denotes parameter for classifer and 𝐖(0)\mathbf{W}^{(0)} denotes initial feature embeddings), nn split ratio, kk sample size, ρ\rho dropedge ratio, αs,αf\alpha_{s},\alpha_{f}, learning rates.
2 for t=1,2,,Tmaxt=1,2,\cdots,T_{max} do
3       Sample a mini-batch 𝐗b={𝐱i}iIb\mathbf{X}^{b}=\{\mathbf{x}_{i}\}_{i\in I_{b}} from 𝐗tr={𝐱i}iItr\mathbf{X}_{tr}=\{\mathbf{x}_{i}\}_{i\in I_{tr}} // If handling large dataset, otherwise use 𝐗tr\mathbf{X}_{tr} directly ;
4      
5      if using self-supervised learning then
6             Shuffle Ftr={fj}F_{tr}=\{f_{j}\} and split into nn subsets {F¯s}s=1n\{\overline{F}_{s}\}_{s=1}^{n} ;
7             for s=1,,ns=1,\cdots,n do
8                   𝐖(t)[fj]𝟎\mathbf{W}^{(t)}[f_{j}]\leftarrow\mathbf{0}, for fjF¯sf_{j}\in\overline{F}_{s};
9                   𝐗~b=DROPEDGE(𝐗b,ρ)\tilde{\mathbf{X}}^{b}=\mbox{D{ROP}}\mbox{E{DGE}}(\mathbf{X}^{b},\rho);
10                   Feed 𝐖(t)\mathbf{W}^{(t)} and 𝐗~b\tilde{\mathbf{X}}^{b} into GNN and obtain 𝐖^(t)=g(𝐖(t),𝐗~b;ω(t))\hat{\mathbf{W}}^{(t)}=g(\mathbf{W}^{(t)},\tilde{\mathbf{X}}^{b};\omega^{(t)});
11                   𝐖(t)^[fj]=𝐖(t)[fj]\hat{\mathbf{W}^{(t)}}[f_{j}]=\mathbf{W}^{(t)}[f_{j}], for fjtrF¯sf_{j}\in\mathcal{F}_{tr}\setminus\overline{F}_{s};
12                   Feed {𝐱i}iIb\{\mathbf{x}_{i}\}_{i\in I_{b}} into backbone network and obtain {y^i}iIb\{\hat{y}_{i}\}_{i\in I_{b}} by y^i=h(𝐱i;ϕ(t),𝐖^(t))\hat{y}_{i}=h(\mathbf{x}_{i};\phi^{(t)},\hat{\mathbf{W}}^{(t)});
13                   Compute the loss s(Yb,Y^b)=1|Itr|iItrl(y^i,yi)\mathcal{L}_{s}(Y^{b},\hat{Y}^{b})=\frac{1}{|I_{tr}|}\sum_{i\in I{tr}}l(\hat{y}_{i},y_{i});
14                   Update θ(t+1)θtαfθs(Yb,Y^b)\theta^{(t+1)}\leftarrow\theta^{t}-\alpha_{f}\nabla_{\theta}\mathcal{L}_{s}(Y^{b},\hat{Y}^{b});
15                  
16             Update ω(t+1)ωtαss=1nωs(Yb,Y^b)\omega^{(t+1)}\leftarrow\omega^{t}-\alpha_{s}\sum_{s=1}^{n}\nabla_{\omega}\mathcal{L}_{s}(Y^{b},\hat{Y}^{b});
17            
18      
19      if using inductive learning then
20             for s=1,,ns=1,\cdots,n do
21                   Sample kk raw features, extract a subset FsF_{s} from Ftr={fj}F_{tr}=\{f_{j}\} and extract 𝐗sb\mathbf{X}^{b}_{s} from 𝐗b\mathbf{X}^{b} ;
22                   𝐗~sb=DROPEDGE(𝐗sb,ρ)\tilde{\mathbf{X}}^{b}_{s}=\mbox{D{ROP}}\mbox{E{DGE}}(\mathbf{X}^{b}_{s},\rho);
23                   Feed 𝐖(t)\mathbf{W}^{(t)} and 𝐗~sb\tilde{\mathbf{X}}^{b}_{s} into GNN and obtain 𝐖^(t)=g(𝐖(t),𝐗~sb;ω(t))\hat{\mathbf{W}}^{(t)}=g(\mathbf{W}^{(t)},\tilde{\mathbf{X}}^{b}_{s};\omega^{(t)}) ;
24                   Feed {𝐱i}iIb\{\mathbf{x}_{i}\}_{i\in I_{b}} into backbone network and obtain {y^i}iIb\{\hat{y}_{i}\}_{i\in I_{b}} by y^i=h(𝐱i;ϕ(t),𝐖^(t))\hat{y}_{i}=h(\mathbf{x}_{i};\phi^{(t)},\hat{\mathbf{W}}^{(t)});
25                   Compute the loss s(Yb,Y^b)=1|Itr|iItrl(y^i,yi)\mathcal{L}_{s}(Y^{b},\hat{Y}^{b})=\frac{1}{|I_{tr}|}\sum_{i\in I{tr}}l(\hat{y}_{i},y_{i});
26                   θ(t+1)θtαfθs(Yb,Y^b)\theta^{(t+1)}\leftarrow\theta^{t}-\alpha_{f}\nabla_{\theta}\mathcal{L}_{s}(Y^{b},\hat{Y}^{b});
27                  
28             Update ω(t+1)ωtαss=1nωs(Yb,Y^b)\omega^{(t+1)}\leftarrow\omega^{t}-\alpha_{s}\sum_{s=1}^{n}\nabla_{\omega}\mathcal{L}_{s}(Y^{b},\hat{Y}^{b});
29            
30      
OUTPUT: θ=[ϕ,𝐖]\theta=[\phi,\mathbf{W}], trained parameter of backbone network, ω\omega, trained parameter of GNN.
Algorithm 1 Training algorithm for feature extrapolation networks (FATE).

Appendix C Analysis of Generalization Error

We provide a complete discussion and proof for analysis on generalization error of our approach. Some notations are repeatedly defined in order for a self-contained presentation in this section. Recall that we focus our analysis on the case of inductive learning with k-shot sampling approach. Also, we simplify the model as :1) the backbone network is a two-layer FFN (an embedding layer 𝐖\mathbf{W} plus a fully-connected layer Φ\Phi) with sigmoid output; 2) the GNN network is a LL-layer GCN which takes mean pooling aggregation over neighbored nodes without linear transformation and non-linearity in each layer; 3) the training algorithm is SGD.

Derivation for model function. With our settings in Section 3, we write the model as

y^i=h(𝐱i;ϕ,𝐖^)=h(𝐱i;ϕ,g(𝐖,𝐗;ω))=σ(i𝒩L~(i){i}ciiL𝐱i𝐖Φ),\hat{y}_{i}=h(\mathbf{x}_{i};\phi,\hat{\mathbf{W}})=h(\mathbf{x}_{i};\phi,g(\mathbf{W},\mathbf{X};\omega))=\sigma\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\mathbf{W}\Phi\right), (8)

where 𝒩L~(i)\mathcal{N}_{\tilde{L}}(i) is a set which contains 𝐱i\mathbf{x}_{i^{\prime}}’s that appear in the L~{\tilde{L}}-hop neighbors of 𝐱i\mathbf{x}_{i} in the feature-data graph, L~=2L2\tilde{L}=2\cdot\lfloor\frac{L}{2}\rfloor, and ciiLc^{L}_{ii^{\prime}} is a weight that quantifies influence of 𝐱i\mathbf{x}_{i} on 𝐱i\mathbf{x}_{i^{\prime}} through LL-layer mean-pooling graph convolution. Here we provide the detailed derivation. In fact, the embedding layer in the backbone can be seen as a one-layer GCN convolution using sum pooling without linear transformation and non-linearity, which can be denoted by Y^=σ(𝐙Φ)=σ(𝐗𝐖^Φ)\hat{Y}=\sigma(\mathbf{Z}\Phi)=\sigma(\mathbf{X}\hat{\mathbf{W}}\Phi) where 𝐙={𝐳i}\mathbf{Z}=\{\mathbf{z}_{i}\} (recall that 𝐗\mathbf{X} is treated as an adjacency matrix of the feature-data graph in our model). The GNN model, which is a LL-layer GCN with mean pooling without linear transformation and non-linearity in each layer, can be denoted by 𝐖^=(𝐃out1𝐗𝐃in1𝐗)L/2𝐖\hat{\mathbf{W}}=(\mathbf{D}_{out}^{-1}\mathbf{X}^{\top}\mathbf{D}_{in}^{-1}\mathbf{X})^{\lfloor L/2\rfloor}\mathbf{W}, where 𝐃in=diag({din,i}i=1N)\mathbf{D}_{in}=diag(\{d_{in,i}\}_{i=1}^{N}) with din,i=jFxijd_{in,i}=\sum_{j\in F}x_{ij} and 𝐃out=diag({dout,j}j=1F)\mathbf{D}_{out}=diag(\{d_{out,j}\}_{j=1}^{F}) with dout,j=iIxijd_{out,j}=\sum_{i\in I}x_{ij}. Hence, we have 𝐙=𝐗(𝐃out1𝐗𝐃in1𝐗)L/2𝐖=(𝐃out1𝐃in1)L/2𝐗(𝐗𝐗)L/2𝐖=(𝐃out1𝐃in1)L/2(𝐗𝐗)L/2𝐗𝐖\mathbf{Z}=\mathbf{X}(\mathbf{D}_{out}^{-1}\mathbf{X}^{\top}\mathbf{D}_{in}^{-1}\mathbf{X})^{\lfloor L/2\rfloor}\mathbf{W}=(\mathbf{D}_{out}^{-1}\mathbf{D}_{in}^{-1})^{\lfloor L/2\rfloor}\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{\lfloor L/2\rfloor}\mathbf{W}=(\mathbf{D}_{out}^{-1}\mathbf{D}_{in}^{-1})^{\lfloor L/2\rfloor}(\mathbf{X}\mathbf{X}^{\top})^{\lfloor L/2\rfloor}\mathbf{X}\mathbf{W}. Let (𝐃out1𝐃in1)L/2(𝐗𝐗)L/2=𝐂LN×N(\mathbf{D}_{out}^{-1}\mathbf{D}_{in}^{-1})^{\lfloor L/2\rfloor}(\mathbf{X}\mathbf{X}^{\top})^{\lfloor L/2\rfloor}=\mathbf{C}_{L}\in\mathbb{R}^{N\times N} and 𝐂L={ciiL}\mathbf{C}_{L}=\{c_{ii^{\prime}}^{L}\} where ciiLc_{ii^{\prime}}^{L} denotes the a weight that quantifies influence between instance ii and ii^{\prime} through LL-layer GCN. Converting the global view of graph convolution into a local view for each node’s ego-network, we can obtain y^i=σ(𝐳iΦ)=σ(i𝒩L~(i){i}ciiL𝐱i𝐖Φ)\hat{y}_{i}=\sigma(\mathbf{z}_{i}\Phi)=\sigma\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\mathbf{W}\Phi\right).

With given training data (𝐗tr,Ytr)(\mathbf{X}_{tr},Y_{tr}), we define 𝒮\mathcal{S} as a set of all the data sub-matrices that could be sampled and exposed to the model during training

𝒮={(𝐗1,Y1),(𝐗2,Y2),,(𝐗m,Ym),,(𝐗M,YM)}.\mathcal{S}=\{(\mathbf{X}_{1},Y_{1}),(\mathbf{X}_{2},Y_{2}),\cdots,(\mathbf{X}_{m},Y_{m}),\cdots,(\mathbf{X}_{M},Y_{M})\}.

The SGD training can be seen as a sequence of operations each of which picks an instance from 𝒮\mathcal{S} in an i.i.d. manner as a proxy training data and leverage it to compute updating gradient. We further introduce 𝒮m\mathcal{S}^{\setminus m} which removes the mm-th sub-matrix and 𝒮m\mathcal{S}^{m} which replaces mm-th sub-matrix by another one. Specifically, we have

𝒮m={(𝐗1,Y1),,(𝐗m1,Ym1),(𝐗m+1,Ym+1),(𝐗M,YM)},\mathcal{S}^{\setminus m}=\{(\mathbf{X}_{1},Y_{1}),\cdots,(\mathbf{X}_{m-1},Y_{m-1}),(\mathbf{X}_{m+1},Y_{m+1})\cdots,(\mathbf{X}_{M},Y_{M})\},
𝒮m={(𝐗1,Y1),,(𝐗m1,Ym1),(𝐗m,Ym),(𝐗m+1,Ym+1),(𝐗M,YM)}.\mathcal{S}^{m}=\{(\mathbf{X}_{1},Y_{1}),\cdots,(\mathbf{X}_{{m-1}},Y_{{m-1}}),(\mathbf{X}_{{m}^{\prime}},Y_{{m}^{\prime}}),(\mathbf{X}_{{m+1}},Y_{{m+1}})\cdots,(\mathbf{X}_{M},Y_{M})\}.

Justification of the i.i.d. Sampling. In fact, for our inductive learning approach in Section 2.2, the observed features for each proxy data are randomly sampled. The feature-level sampling at one time can be seen as kk times i.i.d. sampling from all the raw features without replacement. Denote C={cn}n=1dC=\{c_{n}\}_{n=1}^{d} as a set of all the raw features in training set, 𝒦\mathcal{K} as a set of kk distinct indices in {1,,d}\{1,\cdots,d\} and C𝒦C_{\mathcal{K}} denotes a subset of raw features with indices from 𝒦\mathcal{K}. Obviously, there are (dk){d\choose k} different configurations for 𝒦\mathcal{K} (or C𝒦C_{\mathcal{K}}) in total. We can equivalently treat once feature-level sampling as a one-time i.i.d. sampling from a set of candidates {𝒦m}\{\mathcal{K}_{m}\} which contains (dk){d\choose k} index sets and each index set 𝒦m\mathcal{K}_{m} contains kk indices from {1,,d}\{1,\cdots,d\}. Next we discuss two cases.

1) If we do not consider instance-level mini-batch partition, then the set 𝒮\mathcal{S} will consist of M=(dk)M={d\choose k} sub-matrices. Specifically, the mm-th sub-matrix 𝐗m\mathbf{X}_{m} is induced by C𝒦mC_{\mathcal{K}_{m}} which extracts the columns (corresponding to features generated by raw features in C𝒦mC_{\mathcal{K}_{m}}) of 𝐗tr\mathbf{X}_{tr}.

2) If we use instance-level mini-batch partition, the case would be a bit more complicated. First of all, the instance-level partition is not a strictly i.i.d. sampling process over training instances since their exists dependency among different mini-batches in one epoch. Yet, in practice, the batch size BB is very large (e.g. B=100000B=100000 in our experiment), so the number of mini-batches in one epoch is much smaller than BB, which allows us to neglect the dependency in one epoch. Furthermore, since the instance-level selection is dependent of feature-level sampling, the whole sampling process for proxy data can be seen as a series of i.i.d. sampling over (NB)×(dk){N\choose B}\times{d\choose k} sub-matrices of 𝐗tr\mathbf{X}_{tr}, which consists of the set 𝒮\mathcal{S} in this case.

Next, we recall the generalization gap of our interests. The generalization error R(h𝒮)R(h_{\mathcal{S}}) is defined as

R(h𝒮)=𝔼(𝐗,Y)[(Y,h(𝐗;ψ𝒮))].R(h_{\mathcal{S}})=\mathbb{E}_{(\mathbf{X},Y)}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]. (9)

where the expectation contains two stages of sampling: 1) a feature set F={fj}F=\{f_{j}\} is sampled according to fjf_{j}\sim\mathcal{F}, and 2) data (𝐗,Y)(\mathbf{X},Y) is sampled according to (𝐱i,yi)𝒟F(\mathbf{x}_{i},y_{i})\sim\mathcal{D}_{F}. The empirical risk that our approach optimizes with the training data would be

Remp(h𝒮)=1Mm=1M(Ym,h(𝐗m;ψ𝒮)).R_{emp}(h_{\mathcal{S}})=\frac{1}{M}\sum_{m=1}^{M}\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}})). (10)

Then the expected generalization gap would be

𝔼A[R(h𝒮)Remp(h𝒮)],\mathbb{E}_{A}[R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}})], (11)

where the expectation is taken over the randomness of A𝒮A_{\mathcal{S}} that stems from sampling in SGD.

We next prove the result in Theorem 1 in our maintext. Our proof is based on algorithmic stability analysis [9], following similar lines of reasoning in [20, 47]. The main idea of the stability analysis is to bound the output difference of a loss function from a single data point perturbation. Differently, in our case, the ‘data point’ is a data sub-matrix in 𝒮\mathcal{S}. Therefore, our proof can be seen an extension of stability analysis to matrix data or graph as input. The proof can be divided into two parts. First, we derive a generalization error bound on condition of γ\gamma-uniform stability of the learning algorithm, Then we prove the bound for γ\gamma based on our model architecture and SGD training.

C.1 Generalization error with uniform stability condition

We first introduce the definition for uniform stability of a randomized learning algorithm as a building block of our proof. A randomized learning algorithm A𝒮A_{\mathcal{S}} is γ\gamma-uniform stable with regard to loss function \mathcal{L} if it satisfies

sup𝒮,(𝐗,Y)|𝔼𝒮[(Y,h(𝐗;ψ𝒮))]𝔼S[(Y,h(𝐗;ψ𝒮m))]|γ.\sup_{\mathcal{S},(\mathbf{X},Y)}|\mathbb{E}_{\mathcal{S}}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]-\mathbb{E}_{S}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{\setminus m}}))]|\leq\gamma. (12)

We first prove a generalization bound using the uniform stability as a condition and then we prove that the learning algorithm in our case satisfies the condition.

Theorem 2.

Assume a randomized algorithm A𝒮A_{\mathcal{S}} is γ\gamma-uniform stable with a bounded loss function 0(Y,h(𝐗;ψ𝒮))L0\leq\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))\leq L. Then with probability at-least 1δ1-\delta (0<δ<10<\delta<1), over the random draw of 𝒮\mathcal{S}, we have

𝔼A[R(h𝒮)Remp(h𝒮)]2γ+(4Mγ+L)log1δ2M.\mathbb{E}_{A}[R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}})]\leq 2\cdot\gamma+(4M\gamma+L)\sqrt{\frac{\log\frac{1}{\delta}}{2M}}. (13)
Proof.

Using triangle inequality, the stability property in (12) yields,

sup𝒮,(𝐗,Y)|𝔼𝒮[(Y,h(𝐗;ψ𝒮))]𝔼S[(Y,h(𝐗;ψ𝒮m))]|sup𝒮,(𝐗,Y)|𝔼𝒮[(Y,h(𝐗;ψ𝒮))]𝔼S[(Y,h(𝐗;ψ𝒮m))]|+sup𝒮,(𝐗,Y)|𝔼𝒮[(Y,h(𝐗;ψ𝒮m))]𝔼S[(Y,h(𝐗;ψ𝒮m))]|2γ.\begin{split}&\sup_{\mathcal{S},(\mathbf{X},Y)}|\mathbb{E}_{\mathcal{S}}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]-\mathbb{E}_{S}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{m}}))]|\\ \leq&\sup_{\mathcal{S},(\mathbf{X},Y)}|\mathbb{E}_{\mathcal{S}}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]-\mathbb{E}_{S}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{\setminus m}}))]|\\ +&\sup_{\mathcal{S},(\mathbf{X},Y)}|\mathbb{E}_{\mathcal{S}}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{m}}))]-\mathbb{E}_{S}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{\setminus m}}))]|\\ \leq&\quad 2\gamma.\end{split} (14)

We will use McDiarmid’s concentration inequality for the following proof. Let 𝐙\mathbf{Z} be a random variable set and f:𝐙Mf:\mathbf{Z}^{M}\rightarrow\mathbb{R}. If it satisfies

supz1,,zm,,zM,zm|f(z1,,zm,,zM)f(z1,,zm,,zM)|cm,\sup_{z_{1},\cdots,z_{m},\cdots,z_{M},z_{m}^{\prime}}|f(z_{1},\cdots,z_{m},\cdots,z_{M})-f(z_{1},\cdots,z_{m}^{\prime},\cdots,z_{M})|\leq c_{m}, (15)

then we have

P(f(z1,,zM)𝔼z1,,zM[f(z1,,zM)]ϵ)exp(2ϵ2m=1Mcm2).P(f(z_{1},\cdots,z_{M})-\mathbb{E}_{z_{1},\cdots,z_{M}}[f(z_{1},\cdots,z_{M})]\geq\epsilon)\leq\exp{\left(-\frac{2\epsilon^{2}}{\sum_{m=1}^{M}c_{m}^{2}}\right)}. (16)

Recall that data (𝐗s,Ys)(\mathbf{X}_{s},Y_{s}) are assumed to be i.i.d. sampled, so we have (assuming 𝐎=(𝐗,Y)\mathbf{O}=(\mathbf{X},Y))

𝔼𝒮[(Ym,h(𝐗m;ψ𝒮))]=(Ym,h(𝐗m;ψ𝒮))p(𝐎1,,𝐎M)𝑑𝐎1𝑑𝐎M=(Ym,h(𝐗m;ψ𝒮))p(𝐎1)p(𝐎M))d𝐎1d𝐎M=(Ym,h(𝐗m;ψ𝒮m))p(𝐎1)p(𝐎m)p(𝐎M)𝑑𝐎1𝑑𝐎m𝐎M=(Ym,h(𝐗m;ψ𝒮m))p(𝐎1,,𝐎m,,𝐎M)𝑑𝐎1𝑑𝐎m𝑑𝐎M×p(𝐎m)𝑑𝐎m=(Ym,h(𝐗m;ψ𝒮m))p(𝐎1,,𝐎m,𝐎m,,𝐎M)𝑑𝐎1𝑑𝐎M𝑑𝐎m=𝔼𝒮,𝐎m[(Ym,h(𝐗m;ψ𝒮m))].\begin{split}&\mathbb{E}_{\mathcal{S}}[\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}}))]\\ =&\int\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}}))p(\mathbf{O}_{1},\cdots,\mathbf{O}_{M})d\mathbf{O}_{1}\cdots d\mathbf{O}_{M}\\ =&\int\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}}))p(\mathbf{O}_{1})\cdots p(\mathbf{O}_{M}))d\mathbf{O}_{1}\cdots d\mathbf{O}_{M}\\ =&\int\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}^{m}}))p(\mathbf{O}_{1})\cdots p(\mathbf{O}_{m^{\prime}})\cdots p(\mathbf{O}_{M})d\mathbf{O}_{1}\cdots d\mathbf{O}_{m^{\prime}}\cdots\mathbf{O}_{M}\\ =&\int\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}^{m}}))p(\mathbf{O}_{1},\cdots,\mathbf{O}_{m^{\prime}},\cdots,\mathbf{O}_{M})d\mathbf{O}_{1}\cdots d\mathbf{O}_{m^{\prime}}\cdots d\mathbf{O}_{M}\times\int p(\mathbf{O}_{m})d\mathbf{O}_{m}\\ =&\int\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}^{m}}))p(\mathbf{O}_{1},\cdots,\mathbf{O}_{m^{\prime}},\mathbf{O}_{m},\cdots,\mathbf{O}_{M})d\mathbf{O}_{1}\cdots d\mathbf{O}_{M}d\mathbf{O}_{m^{\prime}}\\ =&\mathbb{E}_{\mathcal{S},\mathbf{O}_{m^{\prime}}}[\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}^{m}}))].\end{split} (17)

Using above equation and the γ\gamma-uniform stability we have

𝔼𝒮[𝔼A[R(h𝒮)Remp(h𝒮)]]=𝔼𝒮[𝔼𝐎[𝔼A[(Y,h(𝐗;ψ𝒮))]]]1Mm=1M𝔼𝒮[𝔼A[(Ym,h(𝐗m;ψ𝒮))]]=𝔼𝒮[𝔼𝐎[𝔼A[(Y,h(𝐗;ψ𝒮))]]]𝔼𝒮[𝔼A[(Ym,h(𝐗m;ψ𝒮))]]=𝔼𝒮,𝐎m[𝔼A[(Ym,h(𝐗m;ψ𝒮))]]𝔼𝒮,𝐎m[𝔼A[(Ym,h(𝐗m;ψ𝒮m))]]=𝔼𝒮,𝐎m[𝔼A[(Ym,h(𝐗m;ψ𝒮))(Ym,h(𝐗m;ψ𝒮m))]]𝔼𝒮,𝐎m[𝔼A[|(Ym,h(𝐗m;ψ𝒮))(Ym,h(𝐗m;ψ𝒮m))|]]2γ.\begin{split}&\mathbb{E}_{\mathcal{S}}[\mathbb{E}_{A}[R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}})]]\\ =&\mathbb{E}_{\mathcal{S}}[\mathbb{E}_{\mathbf{O}}[\mathbb{E}_{A}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]]]-\frac{1}{M}\sum_{m=1}^{M}\mathbb{E}_{\mathcal{S}}[\mathbb{E}_{A}[\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}}))]]\\ =&\mathbb{E}_{\mathcal{S}}[\mathbb{E}_{\mathbf{O}}[\mathbb{E}_{A}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]]]-\mathbb{E}_{\mathcal{S}}[\mathbb{E}_{A}[\mathcal{L}(Y_{m},h(\mathbf{X}_{m};\psi_{\mathcal{S}}))]]\\ =&\mathbb{E}_{\mathcal{S},\mathbf{O}_{{m}^{\prime}}}[\mathbb{E}_{A}[\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}}))]]-\mathbb{E}_{\mathcal{S},\mathbf{O}_{{m}^{\prime}}}[\mathbb{E}_{A}[\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}^{m}}))]]\\ =&\mathbb{E}_{\mathcal{S},\mathbf{O}_{{m}^{\prime}}}[\mathbb{E}_{A}[\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}}))-\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}^{m}}))]]\\ \leq&\mathbb{E}_{\mathcal{S},\mathbf{O}_{{m}^{\prime}}}[\mathbb{E}_{A}[|\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}}))-\mathcal{L}(Y_{m^{\prime}},h(\mathbf{X}_{m^{\prime}};\psi_{\mathcal{S}^{m}}))|]]\\ \leq&2\gamma.\end{split} (18)

Also we have the following inequalities,

|𝔼A[R(h𝒮)R(h𝒮m)]|=|𝔼𝐎[𝔼A[(Y,h(𝐗;ψ𝒮))]]𝔼𝐎[𝔼A[(Y,h(𝐗;ψ𝒮m))]]|=|𝔼𝐎[𝔼A[(Y,h(𝐗;ψ𝒮))]]𝔼A[(Y,h(𝐗;ψ𝒮m))]|𝔼𝐎[𝔼A[|(Y,h(𝐗;ψ𝒮))(Y,h(𝐗;ψ𝒮m))|]]2γ,\begin{split}|\mathbb{E}_{A}[R(h_{\mathcal{S}})-R(h_{\mathcal{S}^{m}})]|&=|\mathbb{E}_{\mathbf{O}}[\mathbb{E}_{A}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]]-\mathbb{E}_{\mathbf{O}}[\mathbb{E}_{A}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{m}}))]]|\\ &=|\mathbb{E}_{\mathbf{O}}[\mathbb{E}_{A}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))]]-\mathbb{E}_{A}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{m}}))]|\\ &\leq\mathbb{E}_{\mathbf{O}}[\mathbb{E}_{A}[|\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))-\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{m}}))|]]\\ &\leq 2\gamma,\end{split} (19)
|𝔼A[Remp(h𝒮)Remp(h𝒮m)]||1Mm=1,mmM(𝔼A[(Ym,h(𝐗m;ψ𝒮))(Ym,h(𝐗m;ψ𝒮m))])|+|1M𝔼A[(Ym,h(𝐗m;ψ𝒮))(Ym,h(𝐗m;ψ𝒮m))]|2M1Mγ+λM2γ+λM.\begin{split}|\mathbb{E}_{A}[R_{emp}(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}^{m}})]|\leq&|\frac{1}{M}\sum_{m^{\prime}=1,m^{\prime}\neq m}^{M}(\mathbb{E}_{A}[\mathcal{L}(Y_{{m^{\prime}}},h(\mathbf{X}_{{m^{\prime}}};\psi_{\mathcal{S}}))-\mathcal{L}(Y_{{m^{\prime}}},h(\mathbf{X}_{{m^{\prime}}};\psi_{\mathcal{S}^{m}}))])|\\ +&|\frac{1}{M}\mathbb{E}_{A}[\mathcal{L}(Y_{{m}},h(\mathbf{X}_{{m}};\psi_{\mathcal{S}}))-\mathcal{L}(Y_{{m}^{\prime}},h(\mathbf{X}_{{m}^{\prime}};\psi_{\mathcal{S}^{m}}))]|\\ \leq&2\frac{M-1}{M}\gamma+\frac{\lambda}{M}\\ \leq&2\gamma+\frac{\lambda}{M}.\end{split} (20)

Letting K𝒮=R(h𝒮)Remp(h𝒮)K_{\mathcal{S}}=R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}}) and using (19) and (20), we obtain

|𝔼A[K𝒮]𝔼A[K𝒮m]|=|𝔼A[R(h𝒮)Remp(h𝒮)]𝔼A[R(h𝒮m)Remp(h𝒮m)]||𝔼A[R(h𝒮)]𝔼A[R(h𝒮m)]|+|𝔼A[Remp(h𝒮)]𝔼A[Remp(h𝒮m)]|2γ+(2γ+λM)4γ+λM.\begin{split}|\mathbb{E}_{A}[K_{\mathcal{S}}]-\mathbb{E}_{A}[K_{\mathcal{S}^{m}}]|&=\left|\mathbb{E}_{A}[R(h_{\mathcal{S}})-R_{emp}(h_{\mathcal{S}})]-\mathbb{E}_{A}[R(h_{\mathcal{S}^{m}})-R_{emp}(h_{\mathcal{S}^{m}})]\right|\\ &\leq\left|\mathbb{E}_{A}[R(h_{\mathcal{S}})]-\mathbb{E}_{A}[R(h_{\mathcal{S}^{m}})]\right|+\left|\mathbb{E}_{A}[R_{emp}(h_{\mathcal{S}})]-\mathbb{E}_{A}[R_{emp}(h_{\mathcal{S}^{m}})]\right|\\ &\leq 2\gamma+(2\gamma+\frac{\lambda}{M})\\ &\leq 4\gamma+\frac{\lambda}{M}.\end{split} (21)

Based on the above fact, we can apply the result of (16),

P(𝔼A[K𝒮]𝔼𝒮[𝔼A[K𝒮]]ϵ)exp(2ϵ2M(4γ+λM)2).P(\mathbb{E}_{A}[K_{\mathcal{S}}]-\mathbb{E}_{\mathcal{S}}[\mathbb{E}_{A}[K_{\mathcal{S}}]]\geq\epsilon)\leq\exp{\left(-\frac{2\epsilon^{2}}{M(4\gamma+\frac{\lambda}{M})^{2}}\right)}. (22)

Letting δ=exp(2ϵ2M(4γ+λM)2)\delta=\exp{(-\frac{2\epsilon^{2}}{M(4\gamma+\frac{\lambda}{M})^{2}})} and using (18), we obtain the following result and conclude the proof.

P(𝔼A[K𝒮]2γ+(4Mγ+λ)log(1/δ)2M)1δ.P\left(\mathbb{E}_{A}[K_{\mathcal{S}}]\leq 2\gamma+(4M\gamma+\lambda)\sqrt{\frac{\log{(1/\delta)}}{2M}}\right)\geq 1-\delta. (23)

C.2 Deriving bound for γ\gamma

We proceed to prove our main result in Theorem 1 by deriving the bound for γ\gamma based on the SGD algorithm and our GNN model. Let Θ𝒮\Theta_{\mathcal{S}} and Θ𝒮m\Theta_{\mathcal{S}^{m}} denote the weight matrix of the classifier in the backbone network. Recall that our model is y^i=h(𝐱i;ψ)=σ(i𝒩L~(i){i}ciiL𝐱i𝐖Φ)\hat{y}_{i}=h(\mathbf{x}_{i};\psi)=\sigma(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\mathbf{W}\Phi). Hence, we have

|𝔼SGD[(Y,h(𝐗;ψ𝒮))(Y,h(𝐗;ψ𝒮m))]|=|𝔼SGD[1NiIl(yi,h(𝐱i;ψ𝒮))1NiIl(yi,h(𝐱i;ψ𝒮m))]|βN𝔼SGD[iI|h(𝐱i;ψ𝒮)h(𝐱i;ψ𝒮m)|](sincel(,)isβ-Lipschitz)=βN𝔼SGD[iI|σ(i𝒩L~(i){i}ciiL𝐱i𝐖𝒮Φ𝒮)σ(i𝒩L(i){i}ciiL𝐱i𝐖𝒮mΦ𝒮m)|]βN𝔼SGD[iI|i𝒩L~(i){i}ciiL𝐱i𝐖𝒮Φ𝒮i𝒩L(i){i}ciiL𝐱i𝐖𝒮mΦ𝒮m|](due to the fact|σ(x)σ(y)||xy|)βN𝔼SGD[iIi𝒩L~(i){i}ciiL𝐱i2𝐖𝒮Φ𝒮𝐖𝒮mΦ𝒮m2]βNiIi𝒩L~(i){i}ciiL𝐱i2𝔼SGD[𝐖𝒮Φ𝒮𝐖𝒮mΦ𝒮m2].\begin{split}&|\mathbb{E}_{SGD}[\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}}))-\mathcal{L}(Y,h(\mathbf{X};\psi_{\mathcal{S}^{m}}))]|\\ =&\left|\mathbb{E}_{SGD}\left[\frac{1}{N}\sum_{i\in I}l(y_{i},h(\mathbf{x}_{i};\psi_{\mathcal{S}}))-\frac{1}{N}\sum_{i\in I}l(y_{i},h(\mathbf{x}_{i};\psi_{\mathcal{S}^{m}}))\right]\right|\\ \leq&\frac{\beta}{N}\mathbb{E}_{SGD}\left[\sum_{i\in I}|h(\mathbf{x}_{i};\psi_{\mathcal{S}})-h(\mathbf{x}_{i};\psi_{\mathcal{S}^{m}})|\right]\quad(\mbox{since}\;l(\cdot,\cdot)\;\mbox{is}\;\beta\mbox{-Lipschitz})\\ =&\frac{\beta}{N}\mathbb{E}_{SGD}\left[\sum_{i\in I}\left|\sigma\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\mathbf{W}_{\mathcal{S}}\Phi_{\mathcal{S}}\right)-\sigma\left(\sum_{i^{\prime}\in\mathcal{N}_{L}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\mathbf{W}_{\mathcal{S}^{m}}\Phi_{\mathcal{S}^{m}}\right)\right|\right]\\ \leq&\frac{\beta}{N}\mathbb{E}_{SGD}\left[\sum_{i\in I}\left|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\mathbf{W}_{\mathcal{S}}\Phi_{\mathcal{S}}-\sum_{i^{\prime}\in\mathcal{N}_{L}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\mathbf{W}_{\mathcal{S}^{m}}\Phi_{\mathcal{S}^{m}}\right|\right]\\ &(\mbox{due to the fact}\;|\sigma(x)-\sigma(y)|\leq|x-y|)\\ \leq&\frac{\beta}{N}\mathbb{E}_{SGD}\left[\sum_{i\in I}\left\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\right\|_{2}\cdot\|\mathbf{W}_{\mathcal{S}}\Phi_{\mathcal{S}}-\mathbf{W}_{\mathcal{S}^{m}}\Phi_{\mathcal{S}^{m}}\|_{2}\right]\\ \leq&\frac{\beta}{N}\sum_{i\in I}\left\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\right\|_{2}\mathbb{E}_{SGD}[\|\mathbf{W}_{\mathcal{S}}\Phi_{\mathcal{S}}-\mathbf{W}_{\mathcal{S}^{m}}\Phi_{\mathcal{S}^{m}}\|_{2}].\end{split} (24)

We need to bound the two terms in (24). First, notice that for 𝐱i\forall\mathbf{x}_{i}, it satisfies 𝐱i2d\|\mathbf{x}_{i}\|_{2}\leq\sqrt{d} and xij1d\|x_{ij}\|_{1}\leq d and the graph convolution with mean pooling induce the fact that i𝒩L~(i){i}ciiL𝐱i1d\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c_{ii^{\prime}}^{L}\mathbf{x}_{i^{\prime}}\|_{1}\leq d. Using the inequality of arithmetic and geometric means, we have i𝒩L~(i){i}ciiL𝐱i2d\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\|_{2}\leq\sqrt{d}.

We proceed to bound the second term by considering the randomness of SGD. We can define Ψ𝒮=𝐖𝒮Φ𝒮\Psi_{\mathcal{S}}=\mathbf{W}_{\mathcal{S}}\Phi_{\mathcal{S}} as model parameters and we need to derive bound for 𝔼SGD[Ψ𝒮Ψ𝒮m2]\mathbb{E}_{SGD}[\|\Psi_{\mathcal{S}}-\Psi_{\mathcal{S}^{m}}\|_{2}]. Then define a sequence of model parameters {Ψ𝒮,0,Ψ𝒮,1,,Ψ𝒮,T}\{\Psi_{\mathcal{S},0},\Psi_{\mathcal{S},1},\cdots,\Psi_{\mathcal{S},T}\} where Ψ𝒮,t\Psi_{\mathcal{S},t} denotes the model parameters learned by SGD on 𝒮\mathcal{S} with the updating in tt-th step as

Ψ𝒮,t+1=Ψ𝒮,tαΨ(h(𝐗t;Ψ𝒮,t),Yt)=Ψ𝒮,tα1NtiItΨl(h(𝐱i;Ψ𝒮,t),yi).\Psi_{\mathcal{S},t+1}=\Psi_{\mathcal{S},t}-\alpha\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t})=\Psi_{\mathcal{S},t}-\alpha\frac{1}{N_{t}}\sum_{i\in I_{t}}\nabla_{\Psi}l(h(\mathbf{x}_{i};\Psi_{\mathcal{S},t}),y_{i}). (25)

Similarly, {Ψ𝒮m,0,Ψ𝒮m,1,,Ψ𝒮m,T}\{\Psi_{\mathcal{S}^{m},0},\Psi_{\mathcal{S}^{m},1},\cdots,\Psi_{\mathcal{S}^{m},T}\} denotes a sequence of model parameters learned by SGD on 𝒮m\mathcal{S}^{m}. We then derive bound for ΔΘt=Ψ𝒮,tΨ𝒮m,t\Delta\Theta_{t}=\Psi_{\mathcal{S},t}-\Psi_{\mathcal{S}^{m},t} by considering two cases.

First, at step tt, SGD picks data (𝐗t,Yt)(\mathbf{X}_{t},Y_{t}) and tmt\neq m, i.e., (𝐗t,Yt)(\mathbf{X}_{t},Y_{t}) exists in both 𝒮\mathcal{S} and 𝒮m\mathcal{S}^{m}. This case will happen with probability M1M\frac{M-1}{M}. The derivative of model output is

h(𝐱i;Ψ)Ψ=σ(i𝒩L~(i){i}ciiL𝐱iΨ)i𝒩L~(i){i}ciiL𝐱i.\frac{\partial h(\mathbf{x}_{i};\Psi)}{\partial\Psi}=\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}. (26)

Using the fact |σ(x)σ(y)||σ(x)σ(y)||xy||\sigma^{\prime}(x)-\sigma^{\prime}(y)|\leq|\sigma(x)-\sigma(y)|\leq|x-y|, we have

Ψ(h(𝐗t;Ψ𝒮,t),Yt)Ψ(h(𝐗t;Ψ𝒮m,t),Yt)21NtiItΨl(h(𝐱i;Ψ𝒮,t),yi)Ψl(h(𝐱i;Ψ𝒮m,t),yi)2βNtiItΨh(𝐱i;Ψ𝒮,t)Ψh(𝐱i;Ψ𝒮m,t)2βNtiItσ(i𝒩L~(i){i}ciiL𝐱iΨ𝒮,t)i𝒩L~(i){i}ciiL𝐱iσ(i𝒩L~(i){i}ciiL𝐱iΨ𝒮m,t)i𝒩L~(i){i}ciiL𝐱i2βNtiIti𝒩L~(i){i}ciiL𝐱i2σ(i𝒩L~(i){i}ciiL𝐱iΨ𝒮,t)σ(i𝒩L~(i){i}ciiL𝐱iΨ𝒮m,t)2βNtiItdi𝒩L~(i){i}ciiL𝐱iΨ𝒮,ti𝒩L~(i){i}ciiL𝐱iΨ𝒮m,t2(due to|σ(x)σ(y)||xy|)βNtiItdi𝒩L~(i){i}ciiL𝐱i2Ψ𝒮,tΨ𝒮m,t2βdΔΨt2.\begin{split}&\|\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t})-\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S}^{m},t}),Y_{t})\|_{2}\\ \leq&\frac{1}{N_{t}}\sum_{i\in I_{t}}\|\nabla_{\Psi}l(h(\mathbf{x}_{i};\Psi_{\mathcal{S},t}),y_{i})-\nabla_{\Psi}l(h(\mathbf{x}_{i};\Psi_{\mathcal{S}^{m},t}),y_{i})\|_{2}\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t}}\|\nabla_{\Psi}h(\mathbf{x}_{i};\Psi_{\mathcal{S},t})-\nabla_{\Psi}h(\mathbf{x}_{i};\Psi_{\mathcal{S}^{m},t})\|_{2}\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t}}\left\|\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S},t}\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}-\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S}^{m},t}\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\right\|_{2}\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t}}\left\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\right\|_{2}\left\|\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S},t}\right)-\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S}^{m},t}\right)\right\|_{2}\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t}}\sqrt{d}\left\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S},t}-\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S}^{m},t}\right\|_{2}\quad(\mbox{due to}\;|\sigma^{\prime}(x)-\sigma^{\prime}(y)|\leq|x-y|)\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t}}\sqrt{d}\left\|\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\right\|_{2}\|\Psi_{\mathcal{S},t}-\Psi_{\mathcal{S}^{m},t}\|_{2}\\ \leq&\beta^{\prime}d\|\Delta\Psi_{t}\|_{2}.\end{split} (27)

Second, at step tt, SGD picks (𝐗t,Yt)(\mathbf{X}_{t},Y_{t}) and t=mt=m, i.e., (𝐗t,Yt)(\mathbf{X}_{t},Y_{t}) picked by the algorithm on 𝒮\mathcal{S} and (𝐗t,Yt)(\mathbf{X}^{\prime}_{t},Y^{\prime}_{t}) picked by the algorithm on 𝒮m\mathcal{S}^{m} are distinct. This case would happen with probability 1M\frac{1}{M}. We have

Ψ(h(𝐗t;Ψ𝒮,t),Yt)Ψ(h(𝐗t;Ψ𝒮m,t),Yt)21NtiIt,j=It[i]Ψl(h(𝐱i;Ψ𝒮,t),yi)Ψl(h(𝐱i;Ψ𝒮m,t),yi)2(sinceNt=Ntand assumeIt[i]denotes thei-th entry inIt)βNtiIt,j=It[i]σ(i𝒩L~(i){i}ciiL𝐱iΨ𝒮,t)i𝒩L~(i){i}ciiL𝐱iσ(i𝒩L~(j){j}cjiL𝐱iΨ𝒮m,t)i𝒩L~(j){j}cjiL𝐱i2βNtiIt,j=It[i]σ(i𝒩L~(j){j}cjiL𝐱iΨ𝒮,t)i𝒩L~(j){j}cjiL𝐱i2+βNtiIt,j=It[i]σ(i𝒩L~(j){j}cjiL𝐱iΨ𝒮m,t)i𝒩L~(j){j}cjiL𝐱i22βd,\begin{split}&\|\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t})-\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}^{\prime}_{t};\Psi_{\mathcal{S}^{m},t}),Y^{\prime}_{t})\|_{2}\\ \leq&\frac{1}{N_{t}}\sum_{i\in I_{t},j=I^{\prime}_{t}[i]}\|\nabla_{\Psi}l(h(\mathbf{x}_{i};\Psi_{\mathcal{S},t}),y_{i})-\nabla_{\Psi}l(h(\mathbf{x}_{i}^{\prime};\Psi_{\mathcal{S}^{m},t}),y_{i}^{\prime})\|_{2}\\ &(\mbox{since}\;N_{t}=N^{\prime}_{t}\;\mbox{and assume}\;I^{\prime}_{t}[i]\;\mbox{denotes the}\;i\mbox{-th entry in}\;I^{\prime}_{t})\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t},j=I^{\prime}_{t}[i]}\|\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S},t}\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(i)\cup\{i\}}c^{L}_{ii^{\prime}}\mathbf{x}_{i^{\prime}}\\ -&\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(j)\cup\{j\}}c^{L}_{ji^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S}^{m},t}\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(j)\cup\{j\}}c^{L}_{ji^{\prime}}\mathbf{x}_{i^{\prime}}\|_{2}\\ \leq&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t},j=I^{\prime}_{t}[i]}\left\|\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(j)\cup\{j\}}c^{L}_{ji^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S},t}\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(j)\cup\{j\}}c^{L}_{ji^{\prime}}\mathbf{x}_{i^{\prime}}\right\|_{2}\\ +&\frac{\beta^{\prime}}{N_{t}}\sum_{i\in I_{t},j=I^{\prime}_{t}[i]}\left\|\sigma^{\prime}\left(\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(j)\cup\{j\}}c^{L}_{ji^{\prime}}\mathbf{x}_{i^{\prime}}\Psi_{\mathcal{S}^{m},t}\right)\cdot\sum_{i^{\prime}\in\mathcal{N}_{\tilde{L}}(j)\cup\{j\}}c^{L}_{ji^{\prime}}\mathbf{x}_{i^{\prime}}\right\|_{2}\\ \leq&2\beta^{\prime}\sqrt{d},\end{split} (28)

where the last inequality is due to |σ(x)|1|\sigma^{\prime}(x)|\leq 1.

Combining (27) and (28) we have

𝔼SGD[ΔΨt+12]M1M𝔼SGD[(Ψ𝒮,tαΨ(h(𝐗t;Ψ𝒮,t),Yt))(Ψ𝒮m,tαΨ(h(𝐗t;Ψ𝒮m,t),Yt))2]+1M𝔼SGD[(Ψ𝒮,tαΨ(h(𝐗t;Ψ𝒮,t),Yt))(Ψ𝒮m,tαΨ(h(𝐗Ft;Ψ𝒮m,t),YFt))2]𝔼SGD[ΔΨt2]+(11M)α𝔼SGD[Ψ(h(𝐗t;Ψ𝒮,t),Yt)Ψ(h(𝐗t;Ψ𝒮m,t),Yt)2]+1Mα𝔼SGD[Ψ(h(𝐗t;Ψ𝒮,t),Yt)Ψ(h(𝐗t;Ψ𝒮m,t),Yt)2]=𝔼SGD[ΔΨt2]+(11M)βd𝔼SGD[ΔΨt2]+2Mβd(1+βd𝔼SGD[|ΔΨt|]+2Mβd.\begin{split}&\mathbb{E}_{SGD}[\|\Delta\Psi_{t+1}\|_{2}]\\ \leq&\frac{M-1}{M}\mathbb{E}_{SGD}\left[\left\|(\Psi_{\mathcal{S},t}-\alpha\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t}))-(\Psi_{\mathcal{S}^{m},t}-\alpha\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S}^{m},t}),Y_{t}))\right\|_{2}\right]\\ +&\frac{1}{M}\mathbb{E}_{SGD}\left[\left\|(\Psi_{\mathcal{S},t}-\alpha\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t}))-(\Psi_{\mathcal{S}^{m},t}-\alpha\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{F^{\prime}_{t}};\Psi_{\mathcal{S}^{m},t}),Y_{F^{\prime}_{t}}))\right\|_{2}\right]\\ \leq&\mathbb{E}_{SGD}[\|\Delta\Psi_{t}\|_{2}]+(1-\frac{1}{M})\alpha\mathbb{E}_{SGD}\left[\left\|\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t})-\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S}^{m},t}),Y_{t})\right\|_{2}\right]\\ +&\frac{1}{M}\alpha\mathbb{E}_{SGD}\left[\left\|\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}_{t};\Psi_{\mathcal{S},t}),Y_{t})-\nabla_{\Psi}\mathcal{L}(h(\mathbf{X}^{\prime}_{t};\Psi_{\mathcal{S}^{m},t}),Y^{\prime}_{t})\right\|_{2}\right]\\ =&\mathbb{E}_{SGD}[\|\Delta\Psi_{t}\|_{2}]+(1-\frac{1}{M})\beta^{\prime}d\mathbb{E}_{SGD}[\|\Delta\Psi_{t}\|_{2}]+\frac{2}{M}\beta^{\prime}\sqrt{d}\\ \leq&(1+\beta^{\prime}d\mathbb{E}_{SGD}[|\Delta\Psi_{t}|]+\frac{2}{M}\beta^{\prime}\sqrt{d}.\end{split} (29)

The above inequality yields,

𝔼SGD[|ΔΨT|]2βdMt=1T(1+βd)t1.\mathbb{E}_{SGD}[|\Delta\Psi_{T}|]\leq\frac{2\beta^{\prime}\sqrt{d}}{M}\sum_{t=1}^{T}(1+\beta^{\prime}d)^{t-1}. (30)

Plugging the result into (24) we will obtain,

γ2ββdMt=1T(1+βd)t1.\gamma\leq\frac{2\beta\beta^{\prime}d}{M}\sum_{t=1}^{T}(1+\beta^{\prime}d)^{t-1}. (31)

We complete the proof for Theorem 1.

Appendix D Dataset Information

Table 2: Information for experiment datasets. The Github dataset directly provides preprocessed 0-1 features.
Dataset Domain #Instances #Raw Feat. Cardinality #0-1 Feat. #Class
Gene Life 3190 60 4\sim6 287 3
Protein Life 1080 80 2\sim8 743 8
Robot Computer 5456 24 9 237 4
Drive Computer 58509 49 9 378 11
Calls Life 7195 10 4\sim10 219 10
Github Social 37700 - \sim 4006 2
Avazu Ad. 40,428,967 22 5\sim1611749 2,018,025 2
Criteo Ad. 45,840,617 39 5\sim541311 2,647,481 2

D.1 Dataset Information

We present detailed information for our used datasets concerning the data collection, preprocessing and statistic information.

UCI datasets. The six datasets are provided by UCI Machine Learning repository [1]. They are from different domains, including biology, engineering and social networks. Gene dataset contains 60 DNA sequence elements, and the task is to recognize exon/intron boundaries of DNA. Protein dataset [10] consists of the expression levels of 77 proteins/protein modifications, genotype, treatment type and behavior, and the task is to identify subsets of proteins that are discriminant between eight classes of mice. Robot dataset is collected as a robot navigates through a room following a wall with 24 ultrasound sensor readings, and the task is to predict the robot behavior. Drive dataset is extracted from electric current drive signals with 49 attributes, and the task is to identify 11 different classes with different conditions. Calls dataset was created by segmenting audio records belonging to 4 different families, 8 genus, and 10 species, and the task is to identify the class of species. Github dataset [41] is a large social network of GitHub developers with their location, repositories starred, employer and e-mail address, and the task is to predict whether the GitHub user is a web or a machine learning developer.

The six UCI datasets have diverse statistics. Overall, they contain thousands of instances and dozens of raw features with a mix up of categorial and continuous ones. The categorical raw features have cardinality ranged from 2 to 12. As mentioned in Section 2.1, the cardinality means the number of possible values for a discrete feature. For continuous features in each dataset (if exist), we first normalize the values into 0-mean and 1-standard-deviation distribution and then hash them into 10 buckets with evenly partition between the maximum and the minimum. Then each raw feature can be converted into one-hot representation. After converting all the features into binary ones we get up to hundreds of 0-1 features for each dataset. Table 2 summarizes the basic information for each dataset.

CTR prediction datasets. The two click-through rate (CTR) prediction datasets have millions of instances and dozens of raw features with diverse cardinality. The goal of CTR prediction task is to estimate the probability that a user will click on an advertisement with the user’s profile features and the ad’s content features. In specific, Criteo333http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ is a widely used public benchmark dataset for developing CTR models, which includes 45 million users’ click records, 13 continuous raw features and 26 categorical ones 444In computational advertisement community, the raw features (e.g. site category, device id, device type, app domain, etc.) are often called fields. We call them raw features in our paper to keep the notation self-contained.. We follow [24, 23] and use log transformation to convert the continuous features into discrete ones. Avazu555https://www.kaggle.com/c/avazu-ctr-prediction is another publicly accessible dataset for CTR prediction, which contains users’ mobile behaviors including whether a displayed mobile ad is clicked by a user or not. It has 40 millions users’ click records, 23 categorical raw feature spanning from user/device features to ad attributes (all are encoded to remove user identity information). The cardinality of different raw features for these two datasets are very diverse, ranging from 5 to a million. The raw features with very large cardinality include some id features, e.g. device id, site id, app id, etc. For each dataset, we convert each raw feature into one-hot representations and obtain 0-1 features. For features appearing less than 4 times we group them as one feature. After preprocessing, we obtain nearly 2 million 0-1 features for Avazu and Criteo as shown in Table 2.

D.2 Dataset Splits

UCI datasets. For each of UCI datasets, we first randomly partition all the instances into training/validation/test sets according to the ratio of 6:2:2. Then we randomly select a certain ratio (30%80%30\%\sim 80\% in our experiments) of features as observed features and use the remaining as unobserved ones. The model is trained with observed features of training instances, validated with observed features of validation instances and tested with all the features of test instances.

CTR prediction datasets. As illustrated in Section 4, for Avazu/Criteo we split all the instances into ten folds in chronological order. Then we use the first fold for training, second fold for validation, and third to tenth folds for test. In such way, the validation data and test data will naturally contain new features not appeared in training data. Here we provide more illustration about this. As mentioned above, Avazu dataset contains 23 categorial raw features and some of them have very large cardinality. For example, the cardinality of raw features app id and device id are 5481 and 381763, respectively. In practical systems, there will be new apps introduced and new devices observed by the system as time goes by, and they play as new values out of the known range of existing raw features, which consist of new 0-1 features that are not unseen by the model (as introduced in the beginning of Section 2.1). Since we chronologically divide the dataset into training/validation/test sets, the validation and test sets would both contain a mixture of features seen in training set and new features unseen in training. Concretely, for Avazu, there are totally 618411 features in training set, 248614 new features (unseen by training data) in validation set, and totally 1151000 new features (unseen by both training and validation sets) in all the test sets. For Criteo, there are totally 1340248 features in training set, 472023 new features (unseen by training data) in validation set, and totally 835210 new features (unseen by both training and validation sets) in all the test sets.

Appendix E Implementation Details

We present implementation details for our experiments for reproducibility. We implement our model as well as all the baselines with Python 3.8, Pytorch 1.7 and Pytorch Geometric 1.6. The experiments are all run on a RTX 2080Ti, except for our scalability test in Section 4.3 where we use a RTX 8000.

E.1 Details for UCI experiments

Architectures. For experiments on UCI datasets, the network architecture for our backbone network is

  • A three-layer neural network with hidden size 8 in each layer.

  • The activation function is ReLU.

  • The output layer is a softmax function for multi-class classification or sigmoid for two-class classification.

The architecture for our GNN network is

  • A four-layer GCN [25] network with hidden size 8 in each layer.

  • Adding self-loop and using normalization for graph convolution in each layer.

  • No activation unit is used.

Training Details. We adopt self-supervised learning approach with n-fold splitting. Concretely, in each epoch, we feed the whole training data matrix into the model and randomly divide all the observed features into n=5n=5 disjoint sets {F¯s}s=1n\{\overline{F}_{s}\}_{s=1}^{n}. Then a nested optimization is considered: 1) we update the backbone network with nn steps where in the ss-th step, we mask observed features in F¯s\overline{F}_{s}; 2) then we update the GNN network with one step using the accumulated loss of the nn steps. The training procedure will repeat the above process until a given budget of 200 epochs. Also, in each epoch, the validation loss is averaged over nn-fold data where for the ss-th fold the features in F¯s\overline{F}_{s} are masked and the model will use the remaining observed features for prediction. Finally, we report the test accuracy achieved by the epoch that gives the minimum logloss on validation dataset.

Hyperparameters. Other hyper-parameters are searched with grid search on validation dataset. We use the same hyperparameter settings for six datasets, which indicates that our model is dataset agnostic in some senses. The settings and searching space are as follows:

  • The learning rates αf\alpha_{f}, αs\alpha_{s} are searched within [0.1,0.01,0.001][0.1,0.01,0.001]. We set αf=0.01\alpha_{f}=0.01 and αs=0.001\alpha_{s}=0.001.

  • The ratio for DropEdge ρ\rho is searched within [0.0,0.2,0.5][0.0,0.2,0.5]. We set ρ=0.5\rho=0.5.

  • The fold number for data partition nn is searched within [2,5,10][2,5,10]. We set n=5n=5.

Baselines. All the baselines are implemented with a three-layer neural network, the same as the backbone network in our model. The baselines are all trained with a given budget of 200 epochs, and we report the test accuracy achieved by the epoch that gives the minimum logloss on validation dataset. The difference of them lies in the ways for leveraging observed and unobserved (new) features in training and inference. The detailed information for baseline methods is as follows.

  • Base-NN. Use observed features of training instances for model training, and observed features of validation/test instances for model validation/test.

  • Oracle-NN. User all the features of training instances for model training, and all the features of validation/test instances for model validation/test.

  • INL-NN. The training process contains two stages. In the first stage, we train the model with initialized parameters using observed features of training instances for 200 epochs and save the model at the epoch that gives the minimum logloss on validation dataset (with observed features). In the second stage, we load the saved model in the first stage, train it using unobserved features of training instances for 200 epochs and report the test accuracy (using all the features) achieved by the epoch that gives the minimum logloss on validation dataset (using all the features).

  • Average-NN. Use observed features of training instances for model training. In test stage, we average the embeddings of observed features as the embeddings of unobserved features. Then the model would use all the features of test instances for inference (by using the trained embeddings of observed features and estimated embeddings of unobserved ones).

  • Pooling-NN. Use observed features of training instances for model training. In test stage, we replace the GNN model in FATE with mean pooling over neighbored nodes. Specifically, the embeddings of unobserved features are obtained by non-parametric message passing using mean pooling over the feature-data bipartite graph.

  • KNN-NN. Use observed features of training instances for model training. In test stage, we compute the Jaccard similarity scores between any pair of observed and unobserved features. Then for each unobserved feature, its embedding is obtained by taking average of the embeddings of the observed features with top 20% Jaccard similarities as the target unobserved feature.

E.2 Details for Avazu/Criteo experiments

Architectures. For experiments on Criteo and Avazu datasets, we consider two specifications for the backbone network. First, we specify it as a feedforward NN, whose architecture is

  • A three-layer neural network with hidden size 10-400-400-1.

  • The activation function is ReLU unit except the last layer using sigmoid.

  • We use BatchNorm and Dropout with probability 0.5 in each layer.

Second, we specify it as DeepFM network [17], which also contains an embedding layer and a subsequent classification layer. The embedding layer is an embedding lookup 𝐖\mathbf{W} which maps each nonzero index in 𝐱i\mathbf{x}_{i} to an embedding, denoted as {𝐳im}\{\mathbf{z}_{i}^{m}\} where 𝐳im\mathbf{z}_{i}^{m} denotes the embedding for the mm-th raw feature of instance ii. The subsequent classification layer can be denoted by

y^i=j=1Dwjxij+FM({𝐳im})+FNN(𝐳i),\hat{y}_{i}=\sum_{j=1}^{D}w_{j}\cdot x_{ij}+\mbox{FM}(\{\mathbf{z}_{i}^{m}\})+\mbox{FNN}(\mathbf{z}_{i}), (32)

where 𝐳i=m=1d𝐳im\mathbf{z}_{i}=\sum_{m=1}^{d}\mathbf{z}_{i}^{m}, FNN is a feedforward neural network and FM is a factorization machine which can be denoted as

FM({𝐳im})=m,m<𝐳im,𝐳im>.\mbox{FM}(\{\mathbf{z}_{i}^{m}\})=\sum_{m,m^{\prime}}<\mathbf{z}_{i}^{m},\mathbf{z}_{i}^{m^{\prime}}>. (33)

For our model FATE-DeepFM, we use the GNN model to compute feature embeddings 𝐖^\hat{\mathbf{W}} based on which we use the input feature vector of an instance 𝐱i\mathbf{x}_{i} to obtain {𝐳im}\{\mathbf{z}_{i}^{m}\} and 𝐳i\mathbf{z}_{i} and then plug into the subsequent classification layer.

The architecture for our GNN network is

  • A two-layer GraphSAGE [18] network with hidden size 10 in each layer.

  • No activation unit.

Training Details. We adopt inductive learning approach with k-shot sampling for training our model. Furthermore, we use instance-level mini-batch partition to control space cost. Concretely, in each epoch, we first randomly shuffle all the training instances and partition them into mini-batches with size BB. Then for each mini-batch, we consider a training iteration where the backbone network is updated with nn steps and the GNN model is updated with one step. For ss-th step update for the backbone, we uniformly sample kk raw features from dd existing ones, obtain a new feature set FsF_{s} induced by the sampled kk raw features, and extract the corresponding columns in the data matrix to form a proxy data, i.e., a B×|Fs|B\times|F_{s}| sub-matrix from 𝐗tr\mathbf{X}_{tr}. We then use the proxy data matrix to update the backbone. After nn-step updates for the backbone, we update the GNN model with the accumulated loss of the nn steps.

All the models including FATE and baselines are trained with a given budget of 100 epochs. For every 10 training iteration, we compute the validation loss and ROC-AUC on validation dataset. Finally, we report the test ROC-AUC achieved by the epoch that gives the highest ROC-AUC on validation dataset.

Hyperparameters. Other hyper-parameters are searched with grid search on validation dataset. The settings and searching space are as follows:

  • The learning rates αf\alpha_{f}, αs\alpha_{s} are searched within [0.1,0.01,0.001,0.0001,0.00001][0.1,0.01,0.001,0.0001,0.00001]. We set αf=0.0001\alpha_{f}=0.0001 and αs=0.0001\alpha_{s}=0.0001 for NN as backbone. For DeepFM as backbone, we set αf=0.0001\alpha_{f}=0.0001 and αs=0.0001\alpha_{s}=0.0001 on Criteo, and αs=0.00001\alpha_{s}=0.00001 on Avazu.

  • The ratio for DropEdge ρ\rho is searched within [0.0,0.2,0.5][0.0,0.2,0.5]. We set ρ=0.5\rho=0.5.

  • The batch size BB is searched within [10000,20000,100000,200000][10000,20000,100000,200000]. We set B=100000B=100000.

  • The sampling size kk for data partition is searched within [7,11,13,17,20][7,11,13,17,20] for Avazu and [13,16,21,24,27][13,16,21,24,27] for Criteo. We set k=17k=17 for Avazu and k=24k=24 for Criteo.

Appendix F More Experiment Results

We supplement more experiment results as further discussions of our method, including salability tests and ablation studies.

F.1 Scalability Test

We conduct experiments for scalability test on Criteo dataset. The scalability experiment is deployed on a RTX 8000 GPU with 48GB memory (though our comparison experiments in Section 4.1 and 4.2 require less than 12GB memory for each trial).

Impact of Batch Sizes. We statistic the running time per mini-batch for training and inference on Criteo dataset in Fig. 6(a) and (b) where the batch size is changed from 1e5 to 1e6. The results are taken average over 20 mini-batches. As we can see, as the batch size increases, the training time and inference time both increase linearly, which depicts that our model has linear scalability w.r.t. the number of instances for each update and inference. Also, in Fig. 6(c) and (d), we present the GPU memory cost for training and inference on Criteo dataset. As we can see, the space cost of FATE also increases linearly with respect with batch sizes. Indeed, as discussed in Section 3.2, the time and space complexity of FATE is O(Bd)O(Bd) using mini-batch training where dd is relative small value (up to a hundred). Hence, the empirical results verify our analysis.

Impact of Feature Numbers. We also discuss the model’s scalability concerning different feature numbers, i.e. the dimension of feature vectors DD for training data (DD^{\prime} for test data). There are totally 39 raw features in Criteo dataset and we only use [39,36,33,30,27,24,21,18,15][39,36,33,30,27,24,21,18,15] of them for experiments, which induces [2647481,2475154,1956637,1949494,1466711,927535,862927,850016][2647481,2475154,1956637,1949494,1466711,927535,862927,850016] features, and also compare the training/inference time per mini-batch and GPU memory costs. The results are shown in Fig. 7(a)-(d). We can see that as the feature number increases, the time and space costs both go up in linear trends, which indicates FATE has linear scalability w.r.t. feature number DD. In fact, more feature numbers would require larger model size (for feature embeddings) and induce larger computational graph due to the increase of dd; also, the increase of dd would also require more training/inference time based on our complexity analysis.

F.2 Ablation Studies

We next conduct ablation studies for some key components in our framework and discuss their impacts on our model. The results are shown in Table 3 and Table 4.

Effectiveness of DropEdge. In Table 3, we compare with not using DropEdge regularization in training stage. The results show that FATE consistently achieve superior accuracy throughout six datasets, which demonstrate the effectiveness of DropEdge regularization that can help to alleviate the over-fitting on training features.

Effectiveness of Asynchronous Updates. We also compare our asynchronous updates (alternative fast updates for Backbone network and slow updates for GNN) with directly using end-to-end jointly training of two networks. The results show that FATE with asynchronous updates can outperform joint training approach over a large margin, which verify the effectiveness of our proposed asynchronous updating rule. The reason is that using asynchronous updates can decouple the training for two networks and further help two models learn useful information from observed data. Also, we observe that using slow updates for GNN network with the accumulated loss of several data splits can stabilize its training and alleviate the over-fitting.

Comparison between Training Approaches. We further investigate the k-fold splitting and n-fold sampling strategies used in our training approaches in Table 3. Recall that in UCI datasets, we adopt the self-supervised learning approach for training. Here we compare our used n-fold splitting with leave-one-out, which leave out partial features as a fixed set for masking in training, and k-shot sampling, which randomly sample 0.8D\lfloor 0.8*D\rfloor training features as observed ones and mask the remaining for each update. The results show that the n-fold splitting and k-shot sampling strategies both provide superior performance in six datasets. Furthermore, when using different nn’s, the relative performance of n-fold splitting and k-shot sampling approaches diverge in different cases. Overall, we found using n-fold splitting with n=5n=5 or n=10n=10 work the best on average. In fact, the n-fold splitting and k-shot sampling both play as a role in mimicking new features and exposing partial observed features to the model in training. The difference is that n-fold splitting guarantees that in each iteration the model can be updated on each feature in training set while the k-shot sampling introduces more randomness. Unlike UCI datasets, in two large-scale datasets Criteo and Avazu where we adopt the inductive learning approach for training, we found using k-shot sampling works consistently better than n-fold splitting. One possible reason is that k-shot sampling can increase the diversity of proxy data (containing partial features and partial instances) used for each training update and can presumably help the model to overcome feature-level over-fitting in large datasets. Such results are consistent with our theoretical generalization error analysis in Section 3.

Impact of Sampling Sizes. We next study the impact of sampling size kk on the model performance. We use different kk’s for inductive learning on Avazu and Criteo. The results are shown in Table 4. As we can see, as kk increases, the training AUC goes up, which demonstrates that larger sampling size can help for optimization since it reduces the variance of sampling and enhances training stability. Furthermore, it is not always beneficial to increase kk. When it becomes large enough and close to the number of raw features dd, the model would suffers from over-fitting. The results further demonstrate that large sampling size would lead to feature-level over-fitting, which echoes our theoretical results in Section 3. Recall that Theorem 1 shows that model’s generalization gap depends on the randomness in sampling over training features. Here when kk is large, there will be less randomness from feature-level data partition, which will degrade model’s generalization ability.

Table 3: Ablation studies for DropEdge regularization, asynchronous updates for two networks (compared with end-to-end joint training) and our sampling strategies (n-fold splits and k-shot sampling compared with leave-one-out) on six UCI datasets. We run each experiment five times with different random seeds and report the mean scores.
Models Gene Protein Robot Drive Calls Github
w/o DropEdge 0.9226 0.9031 0.8062 0.5261 0.9760 0.8688
End-to-end Joint 0.9257 0.8963 0.8454 0.1073 0.9762 0.7557
FATE (ours) 0.9345 0.9178 0.8815 0.6440 0.9839 0.8743
Leave-one-out 0.8564 0.6574 0.7641 0.4448 0.9334 0.8533
n-fold split (n=10n=10) 0.8884 0.8426 0.8888 0.5910 0.9851 0.8723
n-fold split (n=5n=5) 0.9345 0.9178 0.8815 0.6440 0.9839 0.8743
n-fold split (n=2n=2) 0.9298 0.8398 0.8359 0.5234 0.9514 0.8771
k-shot sample (n=10n=10) 0.9404 0.9046 0.8839 0.5559 0.9812 0.8712
k-shot sample (n=5n=5) 0.9379 0.9102 0.8802 0.6060 0.9819 0.8712
k-shot sample (n=1n=1) 0.9304 0.8778 0.8568 0.5408 0.9611 0.8722
Table 4: Ablation studies for different sampling sizes kk for k-shot sampling in inductive learning on Avazu and Criteo. We report ROC-AUC on training data, validation data and 8-fold test data (T1-T8).
Dataset kk Train Val T1 T2 T3 T4 T5 T6 T7 T8
Avazu 11 0.7815 0.7369 0.6853 0.6950 0.7058 0.7093 0.7137 0.7186 0.7183 0.7193
14 0.7842 0.7399 0.6896 0.6989 0.7080 0.7091 0.7142 0.7190 0.7201 0.7210
17 0.7902 0.7433 0.6894 0.6995 0.7082 0.7105 0.7156 0.7203 0.7215 0.7216
20 0.7978 0.7420 0.6872 0.6978 0.7080 0.7091 0.7146 0.7201 0.7201 0.7202
Criteo 16 0.7955 0.7725 0.7669 0.7666 0.7688 0.7695 0.7714 0.7721 0.7722 0.7722
21 0.7988 0.7752 0.7699 0.7695 0.7714 0.7721 0.7736 0.7739 0.7741 0.7744
24 0.8005 0.7758 0.7701 0.7694 0.7712 0.7727 0.7732 0.7745 0.7740 0.7743
27 0.8025 0.7747 0.7698 0.7683 0.7711 0.7713 0.7727 0.7743 0.7734 0.7744
30 0.8057 0.7750 0.7690 0.7678 0.7701 0.7708 0.7725 0.7735 0.7723 0.7739
Refer to caption
(a) Training time per mini-batch
Refer to caption
(b) Inference time per mini-batch
Refer to caption
(c) GPU memory cost for training
Refer to caption
(d) GPU memory cost for inference
Figure 6: Scalability test of time and space costs w.r.t. batch sizes BB for training and inference on Criteo dataset.
Refer to caption
(a) Training time per mini-batch
Refer to caption
(b) Inference time per mini-batch
Refer to caption
(c) GPU memory cost for training
Refer to caption
(d) GPU memory cost for inference
Figure 7: Scalability test of time and space costs w.r.t. feature numbers DD for training and inference on Criteo dataset.