Confidence-Based Channel-wise Feature Recovery from Partially Known Features on Graphs

Daeho Um, Jiwoong Park, Seulki Park, Jin Young Choi
Department of ECE, ASRI, Seoul National University
{daehoum1,ptywoong,seulki.park,jychoi}@snu.ac.kr

Abstract

In this paper, we tackle a missing feature recovery problem for graph learning tasks. Since existing methods recover the missing features based on only inter-node diffusion via edges of a graph, they could not effectively handle channel-wise missing features. To overcome this limitation, we introduce a novel concept of channel-wise confidence in a node feature, which is assigned to each recovered channel feature of a node for reflecting certainty of the recovery. To replace unavailable true confidence in an actual learning process, we design a pseudo-confidence by using the channel-wise shortest path distance between a missing-feature node and its nearest known-feature node. Based on the pseudo-confidence, we propose a novel feature recovery scheme that performs both channel-wise graph diffusion and node-wise inter-channel propagation in contrast to existing methods. The scheme can endure even at a very high missing rate (e.g., 99.5%) and achieves state-of-the-art accuracy for both semi-supervised node classification and link prediction on various datasets containing a high rate of missing features.

1 Introduction

In recent years, graph neural networks (GNNs) have received a lot of attention and have shown outstanding performance on numerous problems in various fields [zhou2020graph, wu2020comprehensive, zhang2020deep, zitnik2018modeling, ying2018graph, yan2018spatial, fout2017protein]. While various GNNs are designed for node representation learning [velivckovic2017graph, xu2018representation, kipf2016semi, defferrard2016convolutional] and graph representation learning [kipf2016variational, sun2019infograph, velickovic2019deep], GNN models typically assume that features for all nodes are fully observed. However, in real-world situations, features in graph-structured data are often partially observed as illustrated in the following cases. First, collecting complete data for a large graph is expensive or even impossible. Secondly, in social networks, most users want to protect their personal information selectively. Besides, measurement failure often occurs or there can exist newly joined nodes after measurement. Under these situations, most GNNs cannot be directly applied due to incomplete features.

Several methods have been proposed to handle missing features [taguchi2021graph, jiang2020incomplete, monti2017geometric, rossi2021unreasonable], but they suffer from significant performance degradation at very high rates of missing features. Feature Propagation (FP) [rossi2021unreasonable], which iteratively propagates known features among the nodes, shows improved performance but still cannot avoid a considerable accuracy drop at a high missing rate. In real-world scenarios, extremely sparse data [ref:wind_data, ref:extreme_sparse_18] commonly exist, which is significant but challenging.

To fully exploit the sparsely-known channel feature in each node, in this paper, we design an elaborate feature recovery scheme that diffuses the known features with different intensities per channel, unlike the inter-node diffusion used in FP. To this end, we devise two schemes for both channel-wise and node-wise feature recovery, called channel-wise graph diffusion and node-wise inter-channel propagation. To assign the diffusion intensity per channel in every node, we introduce a novel concept of channel-wise confidence which reflects the quality of channel feature recovery.

The true confidence in missing channel feature is not accessible before recovery. Thus we propose pseudo-confidence to be used in our recovery scheme instead of true confidence. By using channel-wise confidence in our scheme, the less confident channel feature can be recovered by aggregating the highly confident channel features in each node or by highly confident channel features diffused from neighboring nodes.

The key contribution of our work can be summarized as follows: (1) We propose a new concept of channel-wise confidence that represents the quality of a recovered channel feature. (2) We design a method to give pseudo-confidence that can be used instead of unavailable true confidence in a missing channel feature. (3) Based on the pseudo-confidence, we propose a novel feature recovery scheme which achieves the state-of-the-art performance for node classification and link prediction even in a very high missing rate (e.g., 99.5%) of missing features.

2 Related Work

2.1 Learning on Graphs with Missing Node Features

The problem of missing data has been widely studied in the literature [little2019statistical, loh2011high, you2020handling, zhang2022non, najafi2022missing, gordon2021tsi]. Recently, focusing on graph-structured data, there have been several attempts to learn graphs with missing node features. Monti et el. [monti2017geometric] proposed recurrent multi-graph convolutional neural networks (RMGCNN) and separable RMGCNN(sRMGCNN), a scalable version of RMGCNN. Both of them combine a multi-graph convolutional neural network and LSTM [hochreiter1997long] for matrix completion on multi-graphs. Structure-attribute transformer (SAT) [chen2020learning] models the joint distribution of graph structures and node attributes by distribution techniques, then completes missing node attributes. GCN for missing features (GCNMF) [taguchi2021graph] adapts graph convolutional networks (GCN) [kipf2016semi] to graphs that contain missing node features via representing the missing features by Gaussian mixture model. Meanwhile, a partial graph neural network (PaGNN) [jiang2020incomplete] leverages a partial message-propagation scheme that considers only known features during propagation. However, these methods suffer large performance degradation when there exists a high feature missing rate. Feature propagation (FP) [rossi2021unreasonable] reconstructs missing features by diffusing known features. However, in diffusion of FP, a missing feature is formed by aggregating features from neighboring nodes regardless of whether a feature is known or inferred. Moreover, FP does not consider any interdependency among feature channels. To utilize relationships among channels, we construct a covariance matrix of pre-recovered features and update features additionally.

2.2 Distance Encoding

Distance encoding (DE) on graphs defines extra features by using distance from a node to the node set where the prediction is to be made. Zhang and Chen [zhang2018link] extracts a local enclosing subgraph around each target node pair, and uses GNN to learn graph structure features for link prediction. Li et al. [li2020distance] exploits structure-related features termed DE that encodes distance between a node and its neighboring node set by graph-distance measures (e.g., shortest path distance or generalized PageRank scores [li2019optimizing]). Zhang et al. [zhang2021labeling] unifies the aforementioned techniques into labeling trick. Heterogeneous graph neural network (HGNN) [ji2021heterogeneous] proposes a heterogeneous distance encoding in consideration of multiple types of paths in enclosing subgraphs of heterogeneous graphs. Distance encoding in existing methods improves the representation power of GNNs. We use distance encoding to distinguish missing features based on the shortest path distance from a missing feature to known features in the same channel.

2.3 Graph Diffusion

Diffusion on graphs spreads the feature of each node to its neighboring nodes along the edges. [coifman2006diffusion, shuman2013emerging, guille2013information]. There are two types of transition matrices commonly used for diffusion on graphs: symmetric transition matrix [kipf2016semi, klicpera2019diffusion, rossi2021unreasonable] and random walk matrix [page1999pagerank, chung2007heat, perozzi2014deepwalk, grover2016node2vec, atwood2016diffusion, klicpera2018predict, lim2021class]. While these matrices work well for each target task, from a node’s point of view, the sum of edge weights for aggregating features is not $1$ in general. Therefore, since features are not updated at the same scale of original features, these matrices are not suitable for missing feature recovery.

3 Proposed Method

3.1 Overview

Refer to caption — Figure 1: Overall scheme of the proposed Pseudo-Confidence based Feature Recovery (PCFR) method. Based on the graph structure and partially known features, we calculate the channel-wise shortest path distance between a node with a missing feature and its nearest source node (SPD-S). Based on SPD-S, we determine the pseudo-confidence in the recovered feature, using a pre-determined hyperparameter $\alpha\,(0<\alpha<1)$ . Pseudo-confidence plays an important role in our two stages: channel-wise graph diffusion and node-wise inter-channel propagation.

In our work, we tackle a problem for graph learning tasks with missing node features. The first target task, semi-supervised node classification, is to infer the labels of the remaining unlabeled/missing feature nodes from the partially known features/labels and the fully known graph structure. The second target task, link prediction, is to predict whether two nodes are likely to have a link. Figure 1 depicts overall scheme of the proposed feature recovery scheme to address graph learning tasks with missing features. Our key idea is to recover the missing feature in a channel-wise manner. To this end, we devise two feature recovery stages: channel-wise graph diffusion and node-wise inter-channel propagation. The recovered features are used for downstream tasks via an off-the-shelf GNNs.

In Sec. 3.2, we first introduce the notations used in this paper. In Sec. 3.3, we give an outline of the proposed PC (pseudo-confidence)-based feature recovery (PCFR) scheme which recovers missing node features. We then propose a method to determine the pseudo-confidence in Sec. 3.4. In Sec. 3.5 we present channel-wise graph diffusion that iteratively propagates known features in consideration of PC. In Sec. 3.6, we present node-wise inter-channel propagation which adjusts features based on covariance between channels.

3.2 Notations

Basic notation on graphs. An undirected connected graph is represented as $\mathcal{G}=(\mathcal{V},\mathcal{E},A)$ where $\mathcal{V}=\{v_{i}\}^{N}_{i=1}$ is the set of $N$ nodes, $\mathcal{E}$ is the edge set with $(v_{i},v_{j})\in\mathcal{E}$ , and $A\in\{0,1\}^{N\times N}$ denotes an adjacency matrix. $X=[x_{i,d}]\in\mathbb{R}^{N\times F}$ is a $F$ -channel node feature matrix, i.e., $x_{i,d}$ , the $d$ -th channel value of the node $v_{i}$ . $\mathcal{N}(v_{i})$ denotes the set of neighbors of $v_{i}$ . Given an arbitrary matrix $M\in\mathbb{R}^{n\times m}$ , we let $M_{i,:}$ denote the $i$ -th row vector of $M$ . Similarly, we let $M_{:,j}$ denote the $j$ -th column vector of $M$ .

Notation for graphs with missing node features. As we assume partial or even very few node features are known, we define $\mathcal{V}^{(d)}_{k}$ as the set of nodes where the $d$ -th channel feature values are known ( $k$ in $\mathcal{V}^{(d)}_{k}$ means ‘known’). The set of nodes with the unknown $d$ -th channel feature values is denoted by $\mathcal{V}^{(d)}_{u}=\mathcal{V}\setminus\mathcal{V}^{(d)}_{k}$ . Then $\mathcal{V}^{(d)}_{k}$ and $\mathcal{V}^{(d)}_{u}$ are referred to source nodes and missing nodes, respectively. By reordering the nodes according to whether a feature value is known or not for the $d$ -th channel, we can write graph signal for the $d$ -th channel features and adjacency matrix as

x^{(d)}=\begin{bmatrix}x^{(d)}_{k}\\ x^{(d)}_{u}\end{bmatrix}\qquad A^{(d)}=\begin{bmatrix}A^{(d)}_{kk}&A^{(d)}_{ku}\\ A^{(d)}_{uk}&A^{(d)}_{uu}\end{bmatrix}.

Here, $x^{(d)}$ , $x^{(d)}_{k}$ , and $x^{(d)}_{u}$ are column vectors that represent corresponding graph signal. Since the graph is undirected, $A^{(d)}$ is symmetric and thus $(A^{(d)}_{ku})^{\top}=A^{(d)}_{uk}$ . Note that $A^{(d)}$ is different from $A$ due to reordering while they represent the same graph structure. $\tilde{X}=[\tilde{x}_{i,d}]$ denotes recovered features for $X$ from $\{x^{(d)}_{k}\}^{F}_{d=1}$ and $\{A^{(d)}\}^{F}_{d=1}$ . A summary of notations is in Table LABEL:notation_table.

3.3 PC-based Feature Recovery Scheme

The proposed PC-based feature recovery (PCFR) scheme leverages the shortest path distance between nodes to compute pseudo-confidence. PCFR consists of two stages: channel-wise graph diffusion and node-wise inter-channel propagation. The first stage finds $\tilde{X}$ (pre-recovered features for $X$ ) through PC-based feature diffusion on a given graph $\mathcal{G}$ . Then, the second stage updates $\tilde{X}$ to the final recovered features $\hat{X}$ by considering both PC and correlation between channels.

To perform node classification and link prediction, a GNN is trained with recovered node features $\hat{X}$ . In this work, PCFR is designed to perform the downstream tasks well. However, since PCFR is independent of the type of learning task, PCFR is not limited to the two tasks and so can be applied to various graph learning tasks with missing node features.

Formally, the proposed framework can be expressed as


$\displaystyle\hat{X}$	$\displaystyle=f(\{x^{(d)}_{k}\}^{F}_{d=1},\{A^{(d)}\}^{F}_{d=1})$	(1a)
$\displaystyle\hat{Y}$	$\displaystyle=g_{\theta}(\hat{X},A){\color[rgb]{0,0,1},}$	(1b)

where $f$ is PCFR and $\hat{Y}$ is a prediction for desired output of a given task. Here, any GNN architecture can be adopted as $g_{\theta}$ according to the type of task.

3.4 Pseudo-Confidence

We first define the concept of confidence in the recovered feature $\hat{x}_{i,d}$ of a node $v_{i}$ for channel $d$ .

Definition 1.

Confidence in the recovered channel feature $\hat{x}_{i,d}$ is defined by similarity between $\hat{x}_{i,d}$ and true one $x_{i,d}$ , which is a value between 0 and 1.

Note that the feature $x_{i,d}$ of a source node is fully observed and thus its confidence becomes 1. When the recovered $\hat{x}_{i,d}$ is far from the true ${x}_{i,d}$ , the confidence in $\hat{x}_{i,d}$ will decrease towards 0. However, it is a chicken and egg problem to find $\hat{x}_{i,d}$ and its confidence. That is, the confidence in $\hat{x}_{i,d}$ is unavailable before getting $\hat{x}_{i,d}$ according to Definition 1, whereas the proposed scheme can not yield $\hat{x}_{i,d}$ without the confidence.

To handle this issue, instead of true confidence, we design the pseudo-confidence by using shortest path distance between a node and its nearest source node for a specific channel (SPD-S). For instance, SPD-S of the $i$ -th node for the $d$ -th channel feature is denoted by $S_{i,d}$ which is calculated by

S_{i,d}=s(v_{i}|\mathcal{V}^{(d)}_{k},A^{(d)}),

(2)

where $s(\cdot)$ yields the shortest path distance between the $i$ -th node and its nearest source node in $\mathcal{V}^{(d)}_{k}$ on $A^{(d)}$ . Note that if the $i$ -th node is a source node, its nearest source node is itself, thus $S_{i,d}$ becomes zero. We construct SPD-S matrix $S\in\mathbb{R}^{N\times F}$ of which elements are $S_{i,d}$ .

Consider $\hat{X}=[\hat{X}_{i,d}]$ that represents the recovered features of $X$ with consideration of feature homophily [mcpherson2001birds] that is a local property on a graph [bisgin2010investigating, lauw2010homophily, bisgin2012study]. Based on the feature homophily, the feature similarity between any two nodes tends to increase as the shortest path distance between the two nodes decreases. Hence the recovered feature $\hat{x}_{i,d}$ of a node $v_{i}$ becomes similar to the given feature of its nearest source node more confidently as SPD-S of $v_{i}$ ( $S_{i,d}$ ) decreases.

Based on this homophily, we define the pseudo-confidence using SPD-S in Definition 2.

Definition 2.

Pseudo-confidence (PC) in $\hat{x}_{i,d}$ is defined by a function ${\rho}_{i,d}=\alpha^{S_{i,d}}$ where $\alpha\in(0,1)$ is a hyperparameter.

By Definition 2, PC becomes 1 for $\hat{x}_{i,d}={x}_{i,d}$ on source nodes. Also, PC for a missing node features decreases exponentially as $S_{i,d}$ increases. Likewise, PC reflects well the tendency of confidence in Definition 1.

3.5 Channel-wise Graph Diffusion

To recover missing node features via graph diffusion, source nodes propagate their features to their neighbors for each channel independently. While source nodes keep their features that are true values, missing nodes aggregate features from connected nodes. Missing nodes aggregate low-confidence features from other connected missing nodes in addition to high-confidence features from source nodes. Based on confidence which is an essential concept for missing feature setting, we design a novel transition matrix for feature diffusion. Our design objective is that a missing node aggregates more features of high-PC nodes. To this end, we consider PC in features for neighboring nodes of a specific missing feature.

We define relative PC that represents an amount of PC in a particular node feature relative to other node features as follows.

Definition 3.

Relative PC of $\hat{x}_{j,d}$ relative to $\hat{x}_{i,d}$ is defined by ${\rho}_{j/i,d}={\rho}_{j,d}/{\rho}_{i,d}=\alpha^{S_{j,d}-S_{i,d}}$ .

Then, suppose a missing node feature $x_{i,d}$ of $v_{i}$ aggregates features from $v_{j}\in\mathcal{N}(v_{i})$ . We observe that at the point of view of a missing node, there are only three possible cases for relative PC in its neighboring feature relative to its feature, while two cases for a source node. Formally, from the observation, we relative PCs depending on the above cases are determined by Proposition 1 as

Proposition 1.

If $S_{i,d}=m\geq 1$ , then $\rho_{j/i,d}\in\{\alpha^{-1},1,\alpha\}\,$ for $v_{j}\in\mathcal{N}(v_{i})$ . Otherwise, if $S_{i,d}=0$ , then $\rho_{j/i,d}\in\{1,\alpha\}\,$ for $v_{j}\in\mathcal{N}(v_{i})$ .

The proof of Proposition 1 is given in the supplementary material.

Before defining a transition matrix, we reorder nodes temporarily according to whether a feature value is known or not for the $d$ -th channel, i.e., $x^{(d)}$ and $A^{(d)}$ are reordered for each channel as described in Section 3.2. After the feature diffusion stage, we order the nodes according to the original numbering.

Built on Proposition 1, we construct a weighted adjacency matrix $W^{(d)}$ for the $d$ -th channel. $W^{(d)}\in\mathbb{R}^{N\times N}$ is defined as follows,

W^{(d)}_{i,j}=\begin{cases}\rho_{j/i,d}&\,\text{if}\;i\neq j\,,\,A^{(d)}_{i,j}=1\\ 0&\,\text{if}\;i\neq j\,,\,A^{(d)}_{i,j}=0\\ 1&\,\text{if}\;i=j{\color[rgb]{0,0,1}.}\end{cases}

(3)

Note that self-loops are added to $W^{(d)}$ with weight 1 so that each node can keep some of its own feature.

$W^{d}_{i,j}$ is an edge weight corresponding to message passing from $v_{j}$ to $v_{i}$ . According to Proposition 1, $W^{(d)}_{i,j}\in\{\alpha^{-1},0,\alpha\}$ when $v_{i}$ and $v_{j}$ are connected. Hence, if we consider aggregation in a node, $\alpha^{-1}$ is assigned as edge weights for high-PC neighbors while 1 for same-PC and $\alpha$ for low-PC neighbors. That is, from a node’s point of view, $W^{(d)}$ allows the node to aggregate more features with high PC from its neighbors. Furthermore, consider message passing between two connected nodes $v_{i}$ and $v_{j}$ s.t. $\rho_{j/i,d}=\alpha$ . By Definition 3, $\rho_{i/j,d}=\rho^{-1}_{j/i,d}$ , so that $W^{(d)}_{j,i}=(W^{(d)}_{i,j})^{-1}=\alpha^{-1}$ . This means that message passing from a more confident node to a less confident node occurs with a large weight while message passing in the opposite direction occurs with a small weight.

To ensure convergence of diffusion process, we normalize $W^{(d)}$ to $\overline{W}^{(d)}=(D^{(d)})^{-1}W^{(d)}$ through row-stochastic normalization with $D^{(d)}_{ii}=\sum_{j}W_{i,j}$ . Since $x^{d}_{k}$ with true feature values should be preserved, we replace first $|\mathcal{V}^{(d)}|$ rows of $\overline{W}^{(d)}$ with one-hot vectors indicating $\mathcal{V}^{(d)}_{k}$ . Finally, the channel-wise graph diffusion matrix $\widetilde{W}^{(d)}$ for the $d$ -th channel is expressed as

\widetilde{W}^{(d)}=\begin{bmatrix}I&\mathbf{0}_{ku}\\ \overline{W}^{(d)}_{uk}&\overline{W}^{(d)}_{uu}\end{bmatrix}{\color[rgb]{0,0,1},}

(4)

where $I\in\mathbb{R}^{|\mathcal{V}^{(d)}_{k}|\times|\mathcal{V}^{(d)}_{k}|}$ is an identity matrix and $\mathbf{0}_{ku}\in\{0\}^{|\mathcal{V}^{(d)}_{k}|\times|\mathcal{V}^{(d)}_{u}|}$ is a zero matrix. Note that $\widetilde{W}^{(d)}$ is still row-stochastic despite of the replacement. An aggregation in a specific node can be regarded as a weighted sum of features on neighboring nodes. A row-stochastic matrix for transition matrix means that when a node aggregates features from its neighbors, the sum of the weights is 1. Therefore, unlike symmetric transition matrix [kipf2016semi, klicpera2019diffusion, rossi2021unreasonable] or column-stochastic (random walk) transition matrix [page1999pagerank, chung2007heat, perozzi2014deepwalk, grover2016node2vec, atwood2016diffusion, klicpera2018predict, lim2021class], features of missing nodes can be formed at the same scale of known features. By preserving the original scale, features can be recovered close to the actual features.

Now, we define channel-wise graph diffusion for the $d$ -th channel as

\begin{split}&\tilde{x}^{(d)}(0)=\begin{bmatrix}x^{(d)}_{k}\\ \mathbf{0}_{u}\end{bmatrix}\\ &\tilde{x}^{(d)}(t)=\widetilde{W}^{(d)}\tilde{x}^{(d)}(t-1){\color[rgb]{0,0,1},}\end{split}

(5)

where $\tilde{x}^{(d)}(t)$ is a recovered feature for ${x}^{(d)}$ after $t$ propagation steps, $\mathbf{0}_{u}$ is a zero column vector of size $|\mathcal{V}^{(d)}_{u}|$ and $t\in[1,K]$ . Note that we initialize missing feature values $x^{(d)}_{u}$ to zero. As $K\rightarrow\infty$ , this recursion converges (the proof is included in the supplementary material). We approximate the steady state to $\tilde{x}^{(d)}(K)$ , which is calculated by $(\widetilde{W}^{(d)})^{K}\tilde{x}^{(d)}(0)$ with large enough $K$ . The diffusion is performed for each channel and outputs $\{\tilde{x}^{(d)}(K)\}^{F}_{d=1}$ .

Because of the reordering nodes for each channel before the diffusion, node indices in $\tilde{x}^{(d)}(K)$ for $d\in\{1,...,F\}$ are different. Therefore, after unifying different ordering in each $\tilde{x}^{(d)}(K)$ according to the original order in $X$ , we concatenate all $\tilde{x}^{(d)}(K)$ along the channels into $\tilde{X}$ which is a final output in this stage.

3.6 Node-wise Inter-channel Propagation

In the previous stage, we obtain $\tilde{X}=[\tilde{x}_{i,d}]$ (pre-recovered features for $X$ ) via channel-wise graph diffusion performed for each channel separately. The proposed feature diffusion is done based on the graph structure and pseudo-confidence, but does not consider dependency between channels. Since the dependency between channels can be another important factor for recovering missing node features, we develop an additional scheme to update $\tilde{X}$ by considering both channel correlation and pseudo-confidence.

We first prepare a covariance matrix $C=[C_{a,b}]\in\mathbb{R}^{F\times F}$ giving the covariance between each pair of channels for the proposed scheme. $C_{a,b}$ , the covariance between $\tilde{X}_{:,a}$ and $\tilde{X}_{:,b}$ is calculated as

C_{a,b}=\frac{1}{N-1}\sum^{N}_{i=1}(\tilde{x}_{i,a}-m_{a})(\tilde{x}_{i,b}-m_{b}){\color[rgb]{0,0,1},}

(6)

where $m_{d}$ denotes the mean of a graph signal for the $d$ -th channel, i.e., $m_{d}=\frac{1}{N}\sum^{N}_{i=1}\tilde{x}_{i,d}$ .

In this stage, unlike looking across the nodes for each channel in the previous stage, we look across the channels for each node. As shown in the right-hand graph of Figure 1, we define directed graphs $\{\mathcal{H}^{(i)}\}^{N}_{i=1}$ called node-wise inter-channel propagation graphs from the given graph $\mathcal{G}$ . $\mathcal{H}^{(i)}$ for the $i$ -th node in $\mathcal{G}$ is defined as

\mathcal{H}^{(i)}=(\mathcal{V}^{(i)},\mathcal{E}^{(i)},B^{(i)}){\color[rgb]{0,0,1},}

(7)

where $\mathcal{V}^{(i)}=\{v^{(i)}_{d}\}^{F}_{d=1}$ is a set of nodes in $\mathcal{H}^{(i)}$ , $\mathcal{E}^{(i)}$ is a set of directed edges in $\mathcal{H}^{(i)}$ , and $B^{(i)}\in\mathbb{R}^{F\times F}$ is a weighted adjacency matrix to update $\tilde{X}_{i,:}$ . To update $\tilde{x}_{i,d}$ of the $i$ -th node via inter-channel propagation, we assign $\tilde{x}_{i,d}$ to each $v^{(i)}_{d}$ as a scalar node feature for the $d$ -th channel ( $d\in\{1,...,F\}$ ). The weights in $\mathcal{E}^{(i)}$ are given by $B^{(i)}$ , which will be described.

Our design goals of $B^{(i)}$ for inter-channel propagation in each node are to achieve threefold: (1) highly correlated channels should exchange more information to each other than less correlated channels, (2) a high PC channel feature should propagate more information to other channels of the node than a low PC channel feature, and (3) a low PC channel feature should receive more information from other channels for recovery than a high PC channel feature. Based on these three design goals, the weight of the directed edge from the $b$ -th channel to the $a$ -th channel ( $B^{(i)}_{a,b}$ ) in $B^{(i)}$ is designed by

B^{(i)}_{a,b}=\begin{cases}\gamma(1-\alpha^{\prime S_{i,a}})\frac{\alpha^{S_{i,b}}}{\sum^{F}_{b^{\prime}\neq a}\alpha^{S_{i,b^{\prime}}}}C_{a,b}&\;\text{if}\;a\neq b\par\\ 0&\;\text{if}\;a=b\end{cases}{\color[rgb]{0,0,1},}

(8)

where $C_{a,b}$ , $\frac{\alpha^{S_{i,b}}}{\sum^{F}_{b^{\prime}\neq a}\alpha^{S_{i,b^{\prime}}}}$ , and $(1-\alpha^{\prime S_{i,a}})$ are terms to meet the design goal (1), (2), and (3), respectively. $\alpha,\alpha^{\prime}$ are hyperparameters for pseudo-confidence in Definition 2, and $\gamma$ is the scaling hyperparameter. Note that we normalize the confidence, $\alpha^{S_{i,b}}$ , to prevent too much increasing of aggregated channel features as $F$ increases.

Node-wise inter-channel propagation on $\mathcal{H}^{(i)}$ outputs the final recovered features for $X_{i,:}$ . We define node-wise inter-channel propagation as

\hat{X}^{\top}_{i,:}=\tilde{X}^{\top}_{i,:}+B^{(i)}(\tilde{X}_{i,:}-[m_{1},\,m_{2},\cdots,m_{F}])^{\top}{\color[rgb]{0,0,1},}

(9)

where $\hat{X}_{i,:}$ and $\tilde{X}_{i,:}$ are row vectors. Preserving the pre-recovered channel feature values (as self loops), message passing among different channel features is done along the directed edges of $B^{(i)}$ . After calculating $\hat{X}_{i,:}$ for $i\in\{1,...,N\}$ , we obtain the final recovered features by concatenating them, i.e., $\hat{X}=[{\hat{X}_{1,:}}^{\top}\;{\hat{X}_{2,:}}^{\top}\;\,\cdots\;\,{\hat{X}_{N,:}}^{\top}]^{\top}$ . Moreover, since $C$ is calculated via pre-recovered features $\tilde{X}$ for all nodes in $\mathcal{G}$ , channel correlation propagation injects global information into recovered features for $X$ . $\hat{X}$ is a final output of PC-based feature recovery and is fed to GNN to solve a downstream task.

4 Experiments

4.1 Datasets

To validate our method, we conducted experiments on 6 benchmark datasets from two different domains: Citation networks (Cora, CiteSeer, PubMed [sen2008collective], and OGBN-Arxiv [hu2020open]) and Recommendation networks (Amazon-Computers and Amazon-Photo [shchur2018pitfalls]). For link prediction, we evaluated methods on the 5 benchmark datasets except OGBN-Arxiv which causes out of memory. The description for the datasets are reported in the supplementary material.

4.2 Experimental Setup

Compared Methods. For semi-supervised node classification, we compared our method with two baselines and four state-of-the-art methods. we set Baseline 1 to a simple scheme that directly feeds the graph data with missing features to GNN without recovery, where all missing values in a feature matrix are set to zero. We set Baseline 2 to label propagation (LP) [zhu2002learning] which does not use node features and propagates only partially-known labels for inferring the remaining labels. That is, LP corresponds to the case of 100% feature missing. The four state-of-the-art methods can be categorized into two approaches: GCN-variant model={GCNMF [taguchi2021graph], PaGNN [jiang2020incomplete]} and feature imputation= {sRMGCNN [monti2017geometric], FP [rossi2021unreasonable]}. For Baseline 1, sRMGCNN, FP, and our method, vanilla GCN [kipf2016semi] is used only for the downstream task. For GCN-variant models perform both feature recovery and downstream task in the GCN framework where a scheme for feature recovery is added to vanilla GCN. In all the compared methods, the same GCN [kipf2016semi] is used.

For link prediction, we compared our method with sRMGCNN and FP, which are the feature imputation approach. To perform link prediction on the inferred features by each method, graph auto-encoder (GAE) [kipf2016variational] models are adopted. We used inferred features by each method as input of GAE models.

Data Settings. Regardless of the type of task, we removed features according to missing rate $r_{m}$ ( $0<r_{m}<1$ ). Missing features were selected in two ways as follows.

•

Structural missing. We first randomly selected nodes in a ratio of $r_{m}$ among all nodes. Then, we assigned all features of the selected nodes to missing (unknown) values (zero).
•

Uniform missing. We randomly selected features in a ratio of $r_{m}$ from the node feature matrix $X$ , and we set the selected features to missing (unknown) values (zero).

For semi-supervised node classification, we randomly generated 10 different training/validation/test splits, except OGBN-Arxiv where the split was fixed according to the specified criteria. In each generated split, 20 per class were assigned to the training set and a total of 1500 nodes were assigned to the validation set regardless of class. For the test set, we used all nodes except training and validation nodes.

For link prediction, we also randomly generated 10 different training/validation/test splits for each datasets. We split edges with the same split ratio for all the datasets. As in the setting in [kipf2016variational], we used 85% edges for train, 5% edges for validation, and the 10% edges for test.

Evaluation. For node classification, we report the accuracy evaluated on 10 different pairs of splits for node labels and missing features. For link prediction, we report area under the ROC curve (AUC) and average precision (AP) scores on 10 different pairs of splits for edges and missing features. A detailed description of the hyperparameters used in experiments and implementation of each method are given in the supplementary material.

4.3 Semi-Supervised Node Classification Results

Figure 2 demonstrates the trend of average accuracy of compared methods for node classification on 6 datasets with different $r_{m}$ . The performance gain of PCFR is remarkable at $r_{m}=0.995$ . In contrast, the average accuracy of existing methods rapidly decrease as $r_{m}$ increases and are overtaken by LP which does not utilize features. In the case of uniform missing features, FP shows better resistance than LP, but the gap from ours increases as $r_{m}$ increases. The high performance of PCFR comes from the proposed channel-wise confidence which is an important concept to deal with missing channel features. We conclude that PCFR effectively contributes to performance improvement of node classification.

Table LABEL:sota_table shows the detailed results of node classification with $r_{m}=0.995$ . sRMGCNN and GCNMF show significantly low performance for all experiments in this extremely challenging environment. Even PaGNN shows worse accuracy than Baseline 2 (LP) in general. For all the datasets except OGBN-Arxiv, PCFR shows superior performance to the other methods at $r_{m}=0.995$ . Ablation study to analyze the effectiveness of channel-wise graph diffusion and node-wise inter-channel propagation is provided in the supplementary material.

4.4 Link Prediction Results

Table LABEL:tab:lp demonstrates results for the link prediction task at $r_{m}$ =0.995. While GCNMF, PaGNN can only deal with node classification, feature imputation methods can be applied to link prediction. Among them, PCFR achieves state-of-the-art performance across all the datasets. Based on the results on semi-supervised node classification and link prediction, which are representative graph learning tasks, PCFR shows the effectiveness at a very high rate of missing features.

Ablation study to analyze the effectiveness of channel-wise graph diffusion and node-wise inter-channel propagation is provided in the supplementary material.

5 Conclusion

We introduced a novel concept of channel-wise confidence to recover highly-rated missing features in a graph. To replace the unavailable true confidence, we designed a pseudo-confidence that can be obtained from the shortest path distance of each channel feature of a node. Using the pseudo-confidence, we developed a new framework for missing feature recovery that consists of channel-wise graph diffusion and node-wise inter-channel propagation. As validated in experiments, the proposed method shows outperforming performance on both node classification and link prediction. The channel-wise confidence approach for missing feature recovery can be easily applied to various graph-related downstream tasks with missing node features.

Limitation. Many real-world networks are not limited to pairwise relations or single-type nodes adopted in our work. Therefore, it is an interesting future work to extend our work to more challenging structures such as hypergraphs for modeling higher-order relations or heterogeneous graphs with multi-type nodes.

Potential negative societal impacts. The intentionally removed private or confidential information can be recovered by using the proposed method and the recovered information can be misused. Therefore, the work is suggested to be used with an aim for positive impacts on society such as health care [wang2020guardhealth, deng2020cola], crime prediction [wang2021hagen], and weather forecasting [han2022semi].

Checklist

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] See Section 1.
2. (b)
  
  Did you describe the limitations of your work? [Yes] See Section 5.
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5.
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] Our paper conforms to the ethics review guidelines.
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3.4 and Section 3.5
2. (b)
  
  Did you include complete proofs of all theoretical results? [Yes] See the supplementary material.
3.
If you ran experiments…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the supplementary material.
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 4.2 and the supplementary material.
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Table LABEL:sota_table and Table LABEL:link_pre_table.
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See the supplementary material.
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [Yes] We cited them.
2. (b)
  
  Did you mention the license of the assets? [Yes] See Sec.C.3 in the supplementary material.
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [No] We did not include any new assets.
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]