This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-Label Graph Convolutional Network Representation Learning

Min Shi1, Yufei Tang1, Xingquan Zhu1 and Jianxun Liu2 1 Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, USA
2 School of Computer Science and Engineering, Hunan University of Science and Technology, China
Email: {mshi2018, tangy, xzhu3}@fau.edu, [email protected]
Abstract

Knowledge representation of graph-based systems is fundamental across many disciplines. To date, most existing methods for representation learning primarily focus on networks with simplex labels, yet real-world objects (nodes) are inherently complex in nature and often contain rich semantics or labels, e.g.e.g., a user may belong to diverse interest groups of a social network, resulting in multi-label networks for many applications. The multi-label network nodes not only have multiple labels for each node, such labels are often highly correlated making existing methods ineffective or fail to handle such correlation for node representation learning. In this paper, we propose a novel multi-label graph convolutional network (ML-GCN) for learning node representation for multi-label networks. To fully explore label-label correlation and network topology structures, we propose to model a multi-label network as two Siamese GCNs: a node-node-label graph and a label-label-node graph. The two GCNs each handle one aspect of representation learning for nodes and labels, respectively, and they are seamlessly integrated under one objective function. The learned label representations can effectively preserve the inner-label interaction and node label properties, and are then aggregated to enhance the node representation learning under a unified training framework. Experiments and comparisons on multi-label node classification validate the effectiveness of our proposed approach.

Index Terms:
Multi-label learning, network embedding, deep neural networks, label correlation

I Introduction

Graphs have become increasingly common structures for organizing data in many complex systems such as sensor networks, citation networks, social networks and many more [1]. Such a development raised new requirement of efficient network representation or embedding learning algorithms for various real-world applications, which seeks to learn low-dimensional vector representations of all nodes with preserved graph topology structures, such as edge links, degrees, and communities etc. The graph edges inherently reflect semantic relevance between nodes, where nodes with similar neighborhood structures usually tend to share identical labeling information, i.e., clustering together characterized by a single grouping label. For examples, in a scientific collaboration network, two connected authors often belong to a common area of science [2, 3], and in a protein-protein interaction network, proteins co-appeared in many identical protein complexes are likely to serve with the similar biological functions.

To date, a large body of work has been focused on the representation learning of graphs with simplex labels [4, 5], where each node only has a single label which is used to model node relationships, i.e., two nodes in a neighborhood are forced to have an identical unique label in the learning process. However, graph nodes associated with multiple labels are ubiquitous in many real-world applications. For example, in an image network, a photograph can belong to more than one semantic class, such as sunsets and beaches. In a patient social network, a patient may be suffering from diabetes and kidney cancer at the same time. Similarly, in many social networks, such as BlogCatalog and Flickr, users are allowed to join various groups that respectively represent their multiple interests. For all these networks, each node not only has content (or features), it is also associated with multiple class labels.

In general, multi-label graphs primarily differ from simplex-label graphs in twofold. First, every node in a multi-label graph could be associated with a set of labels, thus graph structures usually encode much more complicated relationships between nodes with shared labels, i.e., an edge could either reflect a simple relationship of some single label or interpret a very complex relationship of multiple combined labels. Second, it has been widely accepted that label correlations and dependencies are widespread between multiple labels [6, 7], i.e., the sunsets are frequently correlated with the beaches, and diabetes could finally lead to kidney cancer. Therefore, the correlation and interaction between labels could provide implicit and supplemental factors to enhance and differentiate node relationships that cannot be explicitly captured by the discrete and independent labels in a simplex-label graph.

Indeed, multi-label learning is a fundamental problem in the machine learning community [8], with significant attentions in many research domains such as computer vision [9], text classification [10] and tag recommendation [11]. However, research on multi-label graph learning is still in its infancy. Existing methods either consider graphs with simplex labels [5, 12] or treat multiple labels as plain attribute information to enhance the graph learning process [13, 14]. Such learning paradigms, however, neglect the fact that the information of one label may be helpful for the learning of another related label [6]—the label correlations may provide helpful extra information especially when some labels have insufficient training examples. To address this constrain and meanwhile move forward the graph learning theory a litter, we propose multi-label graph representation learning in this paper, where each node has a collection of features as well as a set of labels. Figure 1 illustrates the difference between our studied problem and the traditional simplex-label graph learning. We argue the key for multi-label graph learning is to efficiently combine network structures, node features and label correlations for enriched node relationships modeling in a mutually reinforced manner.

Refer to caption
Figure 1: Illustration of the difference between simplex-label graph learning vs. multi-label graph learning, where the labels of each node are highlighted. In a simplex-label graph, each node is associated with only one label and performs single-label classification based on learned representations. In a multi-label graph, each node may be associated with multiple labels and these node labels are often highly correlated to represent node semantics.

Incorporating labels and their correlations with graph structures for graph representation learning is a nontrivial task. First, in a multi-label graph, two linked nodes may share one or multiple identical labels, thus their affinity cannot be simply determined by one observed edge that is indistinguishable from others. Second, while each label can be seen as an abstraction of nodes sharing similar network structures and features, the label-label correlations would bring about dramatic impact on the node-node interactions, thus it is hard to constrain and balance the two aspects of relations modeling for an optimal graph representation learning as a whole. Recently, a general class of neural network called Graph Convolutional Networks (GCN) [15] shows good performance for learning node representations from graph structures and features by performing the supervised single-label node classification training. GCN operates directly on a graph and induces embedding vectors of nodes based on the spectral convolutional filter that enforces each node to aggregate features from all neighbors to form its representation.

In this paper, we advance this emerging tool to multi-label node classification and propose a novel model called Multi-Label GCN (ML-GCN) to specifically handle the multi-label graph learning problem. ML-GCN contains two Siamese GCNs to learn label and node representations from a high-layer label-label-node graph and a low-layer node-node-label graph, respectively. The high-layer graph learning serves to model label correlations, which only updates the label representations with preserved labels, label correlations and node community information by performing a single-label classification. The derived label representations are subsequently aggregated to enhance the low-layer graph learning, which carries out node representation learning from graph structures and features by performing a multi-label classification. Learning in these two layers can enhance each other in an alternative training manner to optimize a collective classification objective.

Our main contributions are summarized as follows:

  1. 1.

    We advance the traditional simplex-label graph learning to a multi-label graph learning setting, which is more general and common in many real-world graph-based systems.

  2. 2.

    Instead of treating multiple labels as flat attributes, like many existing methods do, We propose to leverage label correlations to strengthen and differentiate edge relationships between nodes.

  3. 3.

    We propose a novel model ML-GCN to handle multi-label graphs. It can simultaneously integrate graph structures, features, and label correlations for enhanced node representation learning and classification.

The rest of this paper is organized as follows. Section II surveys the related work. Section III reviews some preliminaries, including definition of the multi-label graph learning problem and the graph convolutional networks used in our approach. The proposed model for multi-label graph embedding is introduced in Section IV. Section V reports experiments and comparisons to validate the proposed approach. Finally, Section VI concludes the paper.

II Related Works

This section presents existing works related to our studied problem in this paper, including multi-label learning and graph representation learning.

II-A Multi-label Learning

Multi-label learning is a classical research problem in the machine learning community with applications ranging from document classification and gene function prediction to automatic image annotation [16]. In a multi-label learning task, each instance is associated with multiple labels represented by a sparse label vector. The objective is to learn a classifier that can automatically assign an instance with the most relevant subset of labels [8]. Techniques for multi-label classification learning can be broadly divided into two categories [17]: the problem transformation-based and the algorithm adaption-based. The former class of methods generally transforms the multi-label classification task into a series of binary classification problems [6, 18] while adaption-based methods try to generalize some popular learning algorithms to enable a multi-label learning setting [19, 20].

Multi-label learning methods for graph-based data did not attract much attention in the past. DeepWalk [21] was proposed to learn graph representations that are then used for training a multi-label classifier. However, DeepWalk only exploits graph structures, with valuable label and label correlation information not preserved in learned node embeddings. Wang et al. [4] and Huang et al. [14] proposed to leverage labeling information along with graph structures for enriched representation learning. However, these methods either consider simplex-label graphs or treat multiple labels as plain attribute genes to support graph structure modeling. Such paradigms still neglect frequent label correlations and dependencies which are demonstrably helpful properties in multi-label learning problems [9, 10].

II-B Graph Representation Learning

Graph representation learning [1, 22] seeks to learn low-dimensional vector representations of a given network, such that various downstream analytic tasks like link prediction and node classification can be benefited. Traditional methods in this area are generally developed based on shallow neural models, such as DeepWalk[21], Node2vec[23] and LINE[24]. To preserve the node neighborhood relationships, they typically perform truncated random walk over the whole graph to generate a collection of fixed-length node sequences, where nodes within the same sequences are assumed to have semantic connections and will be mapped to be close in the learned embedding space. However, above methods only consider modeling the edge links to constrain node relations, which may be insufficient especially when the network structures are very sparse. To mitigate this issue, many methods [3, 25] are proposed to additionally embed the rich network contents or features associated such as the user profiles in a social network and the publication descriptions in a citation network. For example, TriDNR [26] was proposed to learn simultaneously from the network structures and textual contents, where structures and texts are mutually boosted to collectively constrain the similarities between learned node representations. In general, most real-word graphs are sparse in connectivity (e.g., each node only connects several others in the huge node space), while node contents or features can be leveraged to either enhance node relevance or repair the missing links in the original network structures [27].

However, above representation learning methods belong to the class of shallow neural models, which may have limitations in learning complex relational patterns between graph nodes. Recently, there is growing interest in adapting deep neural networks to handle the non-Euclidean graph data [12, 28]. Several works seek to apply the concepts of convolutional neural networks to process arbitrary graph structures [15, 24], with GCN [15] achieving state-of-the-art representation learning and node classification performance on a number of benchmark graph datasets. Following this success, Yao el al. [29] proposed a Text GCN for document embedding and text classification based on a constructed heterogeneous word-document graph. Graph Attention Networks (GAN) [12] are another recently proposed end-to-end neural network structure similar to GCNs, which introduce attention mechanisms that assign larger weights to the more important nodes, walks, or models. Inspired by these deep neural models targeted at mostly the simplex-label graphs, we generalize GCN and propose a novel training framework, ML-GCN, to address the multi-label graph learning problem in this paper.

Refer to caption
Figure 2: The proposed ML-GCN model for multi-label graph learning. (a) shows a multi-label graph organized in two layers. (b) shows the proposed architecture that contains two Siamese GCNs to learn from the label-label-node graph and node-node-label graph, respectively. The upper panel uses label-label-node graph to learn label representation (from right to left), and the lower panel uses node-node-label graph to learn node representation (from left to right).

III Problem Definition & Preliminaries

III-A Problem Definition

A multi-label graph is represented as G=(𝐯,𝐞,𝐥,𝐀,𝐁,𝐗)G=(\mathbf{v},\mathbf{e},\mathbf{l},\mathbf{A},\mathbf{B},\mathbf{X}), where 𝐯={vi}i=1,,n\mathbf{v}=\{v_{i}\}_{i=1,\cdots,n} is a set of unique nodes, 𝐞={ei,j}i,j=1,,n;ij\mathbf{e}=\{e_{i,j}\}_{{i,j}=1,\cdots,n;~{}i\neq j} is a set of edges and 𝐥={cr}r=1,,m\mathbf{l}=\{c_{r}\}_{r=1,\cdots,m} is a set of unique labels, respectively. nn is the total number of nodes in the graph and mm is the total number of labels in the labeling space. 𝐀\mathbf{A} is a n×nn\times n adjacency matrix with 𝐀i,j=wi,j>0\mathbf{A}_{i,j}=w_{i,j}>0 if ei,j𝐞e_{i,j}\in\mathbf{e} and 𝐀i,j=0\mathbf{A}_{i,j}=0 if ei,j𝐞e_{i,j}\notin\mathbf{e}. 𝐁\mathbf{B} is a n×mn\times m affiliation matrix of labels with 𝐁i,r=1\mathbf{B}_{i,r}=1 if viv_{i} has label cr𝐥c_{r}\in\mathbf{l} or otherwise 𝐁i,r=0\mathbf{B}_{i,r}=0. Finally, 𝐗n×di\mathbf{X}\in\mathbb{R}^{n\times d_{i}} is a matrix containing all nn nodes with their features, i.e.i.e., 𝐗idi\mathbf{X}_{i}\in\mathbb{R}^{d_{i}} represents the feature vector of node viv_{i}, where did_{i} is the feature vector’s dimension.

In this paper, the multi-label graph learning aims to represent nodes of graph GG in a new mm-dimensional feature space m\mathcal{H}^{m}, embedding information of graph structures, features, labels and label correlations preserved, i.e., learning a mapping f:G{𝐡i}i=1,,nf:G\rightarrow\{\mathbf{h}_{i}\}_{i=1,\cdots,n} such that 𝐡im\mathbf{h}_{i}\in\mathcal{H}^{m} can be used to accurately infer labels associated with node viv_{i}.

III-B Graph Convolutional Networks

GCN [22] is a general class of convolutional neural networks that operate directly on graphs for node representation learning and classification by encoding both the graph structures and node features. In this paper, we focus on the spectral-based GCN [16], which assumes that neighborhood nodes tend to have identical labels guaranteed by each node gathers features from all neighbors to form its representation. Given a network G=(𝐯,𝐞,𝐗)G=(\mathbf{v},\mathbf{e},\mathbf{X}), which has nn nodes and each node has a set of did_{i}-dimensional features (𝐗n×di\mathbf{X}\in\mathbb{R}^{n\times{d_{i}}} denotes the feature vector matrix of all nodes), GCN takes this graph as input and obtains the new low-dimensional vector representations of all nodes though a convolutional learning process. More specifically, with one convolutional layer, GCN is able to preserve the 1-hop neighborhood relationships between nodes, where each node will be represented as a mm-dimension vector. The output feature matrix for all nodes 𝐗(1)n×m\mathbf{X}^{(1)}\in\mathbb{R}^{n\times m} can be computed by:

𝐗(1)=ρ(𝐀~𝐗(0)𝐖0)\mathbf{X}^{(1)}=\rho(\tilde{\mathbf{A}}\mathbf{X}^{(0)}\mathbf{W}_{0}) (1)

where 𝐀~=𝐃12(𝐈+𝐀)𝐃12\tilde{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}(\mathbf{I}+\mathbf{A})\mathbf{D}^{-\frac{1}{2}} is the normalized symmetric adjacency matrix. 𝐃\mathbf{D} is the degree matrix of (𝐈+𝐀)(\mathbf{I}+\mathbf{A}) and 𝐈\mathbf{I} is an identity matrix with corresponding shape. 𝐗(0)n×di\mathbf{X}^{(0)}\in\mathbb{R}^{n\times d_{i}} is the input feature matrix (e.g., 𝐗(0)=𝐗\mathbf{X}^{(0)}=\mathbf{X} ) for GCN and 𝐖0di×m\mathbf{W}_{0}\in\mathbb{R}^{d_{i}\times m} is a weight matrix for the first convolutional layer. ρ\rho is an activation function such as the ReLU represented by ρ(x)=max(0,x)\rho(x)=\max(0,x). If it is necessary to encode kk-hop neighborhood relationships, one can easily stack multiple GCN layers, where the output node features of the jjth (1 \leq jj \leq kk) layer is calculated by:

𝐗(j+1)=ρ(𝐀~𝐗(j)𝐖j)\mathbf{X}^{(j+1)}=\rho(\tilde{\mathbf{A}}\mathbf{X}^{(j)}\mathbf{W}_{j}) (2)

where 𝐖jdh×dh\mathbf{W}_{j}\in\mathbb{R}^{d_{h}\times d_{h}} is the weight matrix for the jjth layer and dhd_{h} is the feature vector dimension output in the hidden convolutional layer.

IV The Proposed Approach

In this section, we first present the proposed Multi-Label Graph Convolutional Networks (ML-GCN) model, where the node representations are learned and trained through the supervised classification. Then, we provide the training and optimization details which incorporate node and label representation learning by a collective objective, followed by the computation complexity analysis of the ML-GCN.

IV-A Multi-Label Graph Convolutional Networks

As discussed in previous sections, the key and challenge for multi-label graph learning are to simultaneously learn from the graph structures, features, labels and label correlations, where different aspects of learning could enhance each other to achieve a global good network representation. To support the incorporation of labels, we can simply build a heterogeneous node-label graph similar to the text GCN [29], where common nodes and label nodes are directly connected by their labeling relationships. However, such a diagram makes it hard to model the higher-order label correlations since labels must reach each other through common nodes, i.e., one cannot directly encode kk-hop neighborhood node relations and label correlations by a GCN with kk convolutional layers. To enable immediate and flexible label interactions, we consider a stratified graph structure shown in Fig. 2(a), which is defined as follows:

Label-label-node graph: In this graph, labels connect each other by their co-occurrence relations, i.e., two labels have an edge if they appear at the same time in the label set of some common node. Meanwhile, common nodes are seen as attributes that link with label nodes based on their corresponding labeling relationships.

Node-node-label graph: In this graph, common nodes link together to form the original graph structure. Meanwhile, associated label nodes are seen as the attributes of the common node-node graph.

Such a construction of the layered multi-label graph in Fig. 2(a) could bring three main favorable properties. First, the label-label connectivity allows direct and efficient higher-order label interactions by simply adjusting the number of convolutional layers in GCN. In addition, common nodes as attributes of the label nodes enable to encode graph community information in learned label representations, as nodes with identical labels tend to form a cluster or community. Lastly, the learned node representations can naturally preserve labels, label correlations and graph community information by taking label nodes as attributes of the node-node graph.

Fig. 2(b) shows the proposed ML-GCN model that contains two Siamese GCNs to simultaneously learn the label and node representations from the given multi-label graph, where input feature vectors of both the attributed label and common nodes will be regularly updated during the training. First, the high-layer GCN learns label representations from the label-label-node graph through supervised single-label classification. Let 𝐘m×di\mathbf{Y}\in\mathbb{R}^{m\times d_{i}} be the input feature matrix of all mm label nodes, 𝐂\mathbf{C} be the m×mm\times m adjacency matrix recording the co-occurrence relations between label nodes, and 𝐅\mathbf{F} be the (n+m)×(n+m)(n+m)\times(n+m) adjacency matrix of the input label-label-node graph. The first convolutional layer aggregates information from both the neighborhood label nodes and the associated common nodes (e.g., label node L1 in Fig. 2(a)), where the new mm-dimensional label node feature matrix 𝐋(1)n×m\mathbf{L}^{(1)}\in\mathbb{R}^{n\times m} is computed by:

𝐋(1)=ρ(𝐅~𝐘𝐖0l)\mathbf{L}^{(1)}=\rho(\tilde{\mathbf{F}}^{*}\mathbf{Y}^{*}\mathbf{W}_{0}^{l}) (3)

where ρ\rho is an activation function, such as the ReLU represented by ρ(x)=max(0,x)\rho(x)=\max(0,x), 𝐖0ldi×m\mathbf{W}_{0}^{l}\in\mathbb{R}^{d_{i}\times m} is a weight matrix for the first label-label-node GCN layer and 𝐘=[𝐘;𝐗]T\mathbf{Y}^{*}=[\mathbf{Y};\mathbf{X}]^{T} is a vertically stacked (m+n)×di(m+n)\times d_{i} feature matrix. 𝐅~\tilde{\mathbf{F}}^{*} is a truncated normalized symmetric adjacency matrix obtained by:

𝐅=𝐅+𝐈m+n;𝐅~=𝐃f12𝐅𝐃f12;𝐅~=𝐅~[:m]\mathbf{F}^{*}=\mathbf{F}+\mathbf{I}_{m+n};\tilde{\mathbf{F}}=\mathbf{D}_{f}^{-\frac{1}{2}}\mathbf{F}^{*}\mathbf{D}_{f}^{-\frac{1}{2}};\tilde{\mathbf{F}}^{*}=\tilde{\mathbf{F}}[:m] (4)

where 𝐈m+n\mathbf{I}_{m+n} is the identity matrix, 𝐃f\mathbf{D}_{f} is the degree matrix with 𝐃f,ii=j𝐅ij\mathbf{D}_{f,ii}=\sum_{j}\mathbf{F}^{*}_{ij}. One layer GCN only incorporates immediate label node neighbors. When higher order label correlations need to be preserved, we can easily stack multiple GCN layers (e.g., the layer number k2k\geq 2) by:

𝐋(k)=ρ(𝐂~𝐋(k1)𝐖kl)\mathbf{L}^{(k)}=\rho(\tilde{\mathbf{C}}\mathbf{L}^{(k-1)}\mathbf{W}_{k}^{l}) (5)

where 𝐂~=𝐃c12(𝐂+𝐈m)𝐃c12\tilde{\mathbf{C}}=\mathbf{D}_{c}^{-\frac{1}{2}}(\mathbf{C}+\mathbf{I}_{m})\mathbf{D}_{c}^{-\frac{1}{2}} is the normalized symmetric adjacency matrix and 𝐃c,ii=j(𝐂+𝐈m)ij\mathbf{D}_{c,ii}=\sum_{j}(\mathbf{C}+\mathbf{I}_{m})_{ij}. The last layer output label embeddings have the same size as the total number of labels, mm, and are through a softmax classifier to perform the single-label classification (e.g., assume we consider a two-layer GCN) by:

𝐎l=𝐂~ReLU(𝐅~𝐘𝐖0l)𝐖1l\mathbf{O}^{l}=\tilde{\mathbf{C}}ReLU(\tilde{\mathbf{F}}^{*}\mathbf{Y}^{*}\mathbf{W}_{0}^{l})\mathbf{W}_{1}^{l} (6)
𝐙l=softmax(𝐎l)=exp(𝐨l)iexp(𝐨il)\mathbf{Z}^{l}=softmax(\mathbf{O}^{l})=\frac{{\exp(\mathbf{o}^{l})}}{{\sum\nolimits_{i}{\exp(\mathbf{o}_{i}^{l})}}} (7)

where 𝐖0ldi×dh\mathbf{W}_{0}^{l}\in\mathbb{R}^{d_{i}\times d_{h}} and 𝐖1ldh×m\mathbf{W}_{1}^{l}\in\mathbb{R}^{d_{h}\times m} are the weight matrices for the first and second label-label-node GCN layers, respectively. Let 𝐘l\mathbf{Y}^{l} be the one-hot label indicator matrix of all label nodes, the classification loss can be defined as the cross-entropy error computed by:

1=d=1m𝐘dlln𝐙dl\mathcal{L}_{1}=-\sum\nolimits_{d=1}^{m}{\mathbf{Y}_{d}^{l}\ln\mathbf{Z}_{d}^{l}} (8)

Then, the low-layer GCN learns node representations from the node-node-label graph. Similarly, in the first layer each convolution node aggregates information from both the neighborhood common nodes and the associated attributed label nodes (e.g., take node V2 in Fig. 2(a) as an example). Let 𝐄\mathbf{E} be the (n+m)×(n+m)(n+m)\times(n+m) adjacency matrix of the input node-node-label graph, the dod_{o}-dimensional node embeddings output by the first GCN layer are computed as:

𝐍(1)=ρ(𝐄~𝐗𝐖0v)\mathbf{N}^{(1)}=\rho(\tilde{\mathbf{E}}^{*}\mathbf{X}^{*}\mathbf{W}^{v}_{0}) (9)

where 𝐖0vdi×m\mathbf{W}^{v}_{0}\in\mathbb{R}^{d_{i}\times m} is a weight matrix for the first node-node-label GCN layer, 𝐗=[𝐗;𝐘]T\mathbf{X}^{*}=[\mathbf{X};\mathbf{Y}]^{T} and 𝐄~\tilde{\mathbf{E}}^{*} is a truncated normalized symmetric adjacency matrix obtained by:

𝐄=𝐄+𝐈m;𝐄=𝐃e12𝐄𝐃e12;𝐄~=𝐄~[:n]\mathbf{E}^{*}=\mathbf{E}+\mathbf{I}_{m};\mathbf{E}=\mathbf{D}_{e}^{-\frac{1}{2}}\mathbf{E}^{*}\mathbf{D}_{e}^{-\frac{1}{2}};\tilde{\mathbf{E}}^{*}=\tilde{\mathbf{E}}[:n] (10)

where 𝐃e\mathbf{D}_{e} is the degree matrix with 𝐃e,ii=j𝐄ij\mathbf{D}_{e,ii}=\sum_{j}\mathbf{E}^{*}_{ij}. As in the label-label-node graph, we can also incorporate kk-hop neighborhood information by stacking multiple GCN layers:

𝐍(k)=ρ(𝐀~𝐍(k1)𝐖kv)\mathbf{N}^{(k)}=\rho(\tilde{\mathbf{A}}\mathbf{N}^{(k-1)}\mathbf{W}^{v}_{k}) (11)

where 𝐀~=𝐃a12(𝐀+𝐈m)𝐃a12\tilde{\mathbf{A}}=\mathbf{D}_{a}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I}_{m})\mathbf{D}_{a}^{-\frac{1}{2}} and 𝐃a,ii=j(𝐀+𝐈n)ij\mathbf{D}_{a,ii}=\sum_{j}(\mathbf{A}+\mathbf{I}_{n})_{ij}. The node embeddings output by the last layer have size mm and are passed through a sigmoid transformation to perform supervised multi-label classification with the collective cross-entropy loss (e.g., the two-layer GCN are used in this paper) over all labeled nodes computed by:

𝐎v=𝐀~ReLU(𝐄~𝐗𝐖0v)𝐖1v\mathbf{O}^{v}=\tilde{\mathbf{A}}ReLU(\tilde{\mathbf{E}}^{*}\mathbf{X}^{*}\mathbf{W}^{v}_{0})\mathbf{W}^{v}_{1} (12)
2=i𝐲2\mathcal{L}_{2}=-\sum\nolimits_{i\in\mathbf{y}}\mathcal{L}_{2}^{*} (13)

where 𝐖0vdi×dh\mathbf{W}^{v}_{0}\in\mathbb{R}^{d_{i}\times d_{h}} and 𝐖1vdh×m\mathbf{W}^{v}_{1}\in\mathbb{R}^{d_{h}\times m} are the weight matrices for the first and second node-node-label GCN layers, respectively, 𝐲\mathbf{y} is the set of node indices that have labels. Let 𝐘v\mathbf{Y}^{v} be the one-hot label indicator matrix of all common nodes, then 2\mathcal{L}_{2}^{*} is calculated:

2\displaystyle\mathcal{L}_{2}^{*} =𝐘ivlog(σ(𝐎iv))+(1𝐘iv)log(1σ(𝐎iv))\displaystyle={\rm\mathbf{Y}}_{i}^{v}\log(\sigma(\mathbf{O}_{i}^{v}))+(1-{\rm\mathbf{Y}}_{i}^{v})\log(1-\sigma(\mathbf{O}_{i}^{v})) (14)
=𝐘ivlog(11+exp(𝐎iv))\displaystyle={\rm\mathbf{Y}}_{i}^{v}\log\left({\frac{1}{{1+\exp(-\mathbf{O}_{i}^{v})}}}\right)
+(1𝐘iv)log(exp(𝐎iv)1+exp(𝐎iv))\displaystyle+\left({1-{\rm\mathbf{Y}}_{i}^{v}}\right)\log\left({\frac{{\exp(-\mathbf{O}_{i}^{v})}}{{1+\exp(-\mathbf{O}_{i}^{v})}}}\right)
=𝐘ivlog(1+exp(𝐎iv))\displaystyle=-{\rm\mathbf{Y}}_{i}^{v}\log\left({1+\exp(-\mathbf{O}^{v}_{i})}\right)
(1𝐘iv)log(𝐎iv+log(1+exp(𝐎iv)))\displaystyle-\left({1-{\rm\mathbf{Y}}_{i}^{v}}\right)\log\left({\mathbf{O}_{i}^{v}+\log\left({1+\exp(-\mathbf{O}_{i}^{v})}\right)}\right)
=(1𝐘iv)𝐎ivlog(1+exp(𝐎iv)).\displaystyle=-\left({1-{\rm\mathbf{Y}}_{i}^{v}}\right)\mathbf{O}_{i}^{v}-\log(1+\exp(-\mathbf{O}_{i}^{v})).
Input : A multi-label graph G=(𝐯,𝐞,𝐥,𝐀,𝐁,𝐗)G=(\mathbf{v,e,l,A,B,X})
Output : The node representations 𝐎n={hi}i=1,,n\mathbf{O}^{n}=\{h_{i}\}_{i=1,\cdots,n}
Initialization: i=0i=0, the training epoch II, the information updating frequencies MM and NN while iIi\leq I do
       Feed the label-label-node graph to train label representations;
       Feed the node-node-label graph to train node representations;
       if i%M=0i\%M=0 then
             Update the feature matrix by Eq. (15);
            
       end if
      if i%N=0i\%N=0 then
             Update the feature matrix by Eq. (16);
            
       end if
      Optimize 1\mathcal{L}_{1} and 2\mathcal{L}_{2} by the collective classification objective of Eq. (17);
       ii = ii +1.
end while
Algorithm 1 Training ML-GCNs

The above two aspects of representation learning for labels and nodes are trained together and impact one another by sharing the common classification labeling space of 𝐥\mathbf{l} from the target graph GG, and in the meantime a subset of input features, i.e., through the attributed label nodes in the low-level node-node graph and the attributed common nodes in the high-level label-label graph. Let the total training epoch for ML-GCN be II, after NN-epoch training of common node representations, the input feature matrix for the label-label-node graph will be updated:

𝐗new=ρ(𝐎v𝐖v);𝐘=[𝐘;𝐗new]T,\mathbf{X}_{new}=\rho(\mathbf{O}^{v}\mathbf{W}^{v});\mathbf{Y}^{*}=[\mathbf{Y};\mathbf{X}_{new}]^{T}, (15)

and in the meantime, after MM-epoch training of label representations, the input feature matrix for the node-node-label graph will be updated:

𝐘new=ρ(𝐎l𝐖l);𝐗=[𝐗;𝐘new]T\mathbf{Y}_{new}=\rho(\mathbf{O}^{l}\mathbf{W}^{l});\mathbf{X}^{*}=[\mathbf{X};\mathbf{Y}_{new}]^{T} (16)

where 𝐖vm×di\mathbf{W}^{v}\in\mathbb{R}^{m\times d_{i}} and 𝐖lm×di\mathbf{W}^{l}\in\mathbb{R}^{m\times d_{i}} are weight matrices. The collective training procedure for the ML-GCN model has been summarized in Algorithm 1.

IV-B Algorithm Optimization and Complexity Analysis

As can be seen from Algorithm 1, the node representations and label representations are not learned independently, but depend on each other through shared embedding features learned from two reciprocally enhanced GCNs. In addition, the two level GCNs conduct two supervised classification tasks within the same labelling space: the top label-label-node GCN is doing a single-label classification and the bottom node-node-label GCN is doing a multi-label node classification. Finally, The global learning objective is to minimize the following collective classification loss:

=1+2\mathcal{L}=\mathcal{L}_{1}+\mathcal{L}_{2} (17)

In this paper, all weight parameters are optimized using gradient descent as in [15] and [29].

The training of ML-GCN is efficient in terms of the computational complexity. In this paper, we adopt a two-layer GCN and one-layer GCN for learning the node and feature representations, respectively. Since multiplication of the adjacency matrix (e.g., 𝐀\mathbf{A} for the node-node graph and 𝐅\mathbf{F} for the label-label graph) and feature matrix (e.g., 𝐘\mathbf{Y}^{*} and 𝐗\mathbf{X}^{*} in Eqs. (6) and (12), respectively) can be implemented as a product of a sparse matrix with a dense matrix, the algorithm complexity of ML-GCN can be represented as O((Edidhm+ndi)+(Ldim+mdi))O((Ed_{i}d_{h}m+nd_{i})+(Ld_{i}m+md_{i})), where nn and mm are the number of nodes and labels, EE and LL are the number of edges in node-node-label graph and label-label-node graph, respectively. did_{i} is the dimension (for both nodes and labels) of input feature vectors. dhd_{h} is dimension of the hidden feature vectors produced in the first node-node-label GCN layer of all common nodes. In addition, because for most networks mm, LL and nn are generally far more less than EE (e.g., as we will see in section V, for the Filckr dataset, EE is 4,332,620, compared with mm, LL and nn are merely 194, 3,716 and 8,052, respectively), therefore the complexity of ML-GCN is approximately equivalent to O(Edidhm)O(Ed_{i}d_{h}m), which is the same as GCN. Meanwhile, since Eqs. (15) and (16) are not computed in each epoch (e.g., every 50 epoch), the complexity for our model is still O(Edidhm)O(Ed_{i}d_{h}m), the same theoretical asymptotic complexity as the GCN.

V Experiments & Results

In this section, we compare the proposed approach against a set of strong baselines on three real-world datasets by conducting supervised node classification.

V-A Benchmark Datasets

We collect three multi-label networks [13, 21], BlogCatalog, Flickr, and YouTube, as the benchmark. They are described as follows.

BlogCatalog is a network of social relationships among 10,312 blogger authors (nodes), where the node labels represent bloggers’ interests such as Education, Food and Health. There are 39 unique labels in total and each node may be associated with one or multiple labels. It is easy to find that users’ labels of interest often interact and correlate with each other to enhance the affinities between blogger authors. For example, food is highly related with Health in real life, where two users have both labels food and life should be much closer compared with those whom only share either label food or label life. There are 615 co-occurrence relationships (e.g., correlations) among all 39 labels in this dataset.

Flickr is a photo-sharing network between users, where node labels represent user interests, such as Landscapes and Travel. There are 8,052 users and 4,332,620 interactions (e.g., edges) among them. Each user could have one or multiple labels of interest from the same labeling space of 194 labels in total.

YouTube is a social network formed by video-sharing behaviors, where labels represent the interest groups of users who enjoy common video genres such as anime and wrestling. Table I summaries their detailed statistic information. There are 22,693 users and 192,722 links between them. Each pair of linked users may share multiple identical labels out of the total 47 labels. The number of correlations between these labels is 1,079.

The detailed statistic information of the above three multi-label networks are summarized in Table 1.

V-B Comparative Methods

We compare the performance of the proposed method with the following state-of-the-art methods for multi-label node classification:

  • DeepWalk [21] is a shallow network embedding model that only preserves the topology structures. It captures the node neighborhood relations based on random walks and then derives node representations based on SkipGram model.

  • LINE [30] is also a structure preserving method. It optimizes a carefully designed objective function that preserves both the local and global network structures, compared with the DeepWalk that encodes only the local structures.

  • Node2vec [23] adopts a more flexible neighborhood sampling process than DeepWalk to capture the node relations. The biased random walk of Node2vec can capture second-order and high-order node proximity for representation learning.

  • GENE [13] is a network embedding method that simultaneously preserves the topology structures and label information. Different from the proposed approach in this paper, GENE simply models labels as plain attributes to enhance structure-based representation learning process, whereas our model considers multi-label correlation and network structure for representation learning.

  • GCN [15] is a state-of-the-art method that can naturally learn node relations from network structures and features, where each node forms its representation by adopting a spectral-based convolutional filter to recursively aggregate features from all its neighborhood nodes.

  • Text GCN [29] is built on GCN that aims to embed heterogeneous information network. In this paper, we construct heterogeneous node-label graph, where common nodes and label nodes are directly connected by their labeling relationships.

  • ML-GCNnode is a variant of the proposed ML-GCN model that removes attributes of common nodes from the label-label-node graph. Therefore, the community information is not preserved in this method.

  • ML-GCN1n is a variant of the proposed ML-GCN model. The only difference with ML-GCN is that ML-GCN1n takes only one convolutional layer while learning from the node-node-label graph.

  • ML-GCN2l is a variant of the proposed ML-GCN model, which adopts a two consecutive convolutional layers to learn from the label-label-node graph, compared with one layer in GCN.

  • ML-GCN is our proposed multi-label learning approach in this paper. It considers a two-layer graph structure–a high-level label-label-node graph which allows the preservation of label correlations and meanwhile a low-level node-node-label graph that enables the label correlation-enhanced node representation learning.

TABLE I: Dataset characteristics.
Items BlogCatalog Flickr YouTube
# Nodes 10,31210,312 8,0528,052 22,69322,693
# Edges 333,983333,983 4,332,6204,332,620 192,722192,722
# Labels 3939 194194 4747
# Co-occur. 615615 3,7163,716 1,0791,079
TABLE II: Multi-label classification performance comparison. The 1st1^{\text{st}}, 2nd2^{\text{nd}}, and 3rd3^{\text{rd}} best results are bold-faced, italic-formatted and underscored respectively.
Metrics Micro-F1 (%) Macro-F1 (%)
Datasets BlogCatalog Flickr YouTube BlogCatalog Flickr YouTube
Methods DeepWalk 29.1929.19±0.28\pm 0.28 25.7525.75±0.13\pm 0.13 26.1926.19±0.18\pm 0.18 19.2219.22±0.63\pm 0.63 12.3112.31±0.17\pm 0.17 10.0310.03±0.23\pm 0.23
LINE 30.7930.79±0.84\pm 0.84 30.1330.13±0.39\pm 0.39 27.6827.68±0.11\pm 0.11 17.8917.89±1.26\pm 1.26 16.2416.24±0.53\pm 0.53 10.9010.90±0.30\pm 0.30
Node2vec 33.3533.35±0.69\pm 0.69 34.5134.51±0.44\pm 0.44 26.7526.75±0.25\pm 0.25 21.3821.38±1.76\pm 1.76 19.7219.72±0.32\pm 0.32 10.0310.03±0.29\pm 0.29
GENE 28.7728.77±0.02\pm 0.02 29.4429.44±0.18\pm 0.18 26.7726.77±0.22\pm 0.22 16.0016.00±1.06\pm 1.06 14.3514.35±0.82\pm 0.82 10.4810.48±0.72\pm 0.72
GCN 43.5743.57±0.11\pm 0.11 40.3640.36±0.04\pm 0.04 44.44±0.02\pm\textit{0.02} 36.5536.55±0.22\pm 0.22 24.7724.77±0.09\pm 0.09 33.5433.54±0.08\pm 0.08
Text GCN 40.3140.31±0.49\pm 0.49 41.82±0.33\pm\textit{0.33} 39.8239.82±0.53\pm 0.53 32.0332.03±0.49\pm 0.49 22.9822.98±0.40\pm 0.40 39.07±0.11\pm\textit{0.11}
ML-GCNnode 43.7243.72±0.31\pm 0.31 39.9939.99±0.07\pm 0.07 44.1444.14±0.05\pm 0.05 38.3938.39±0.38\pm 0.38 22.7022.70±0.03\pm 0.03 33.9633.96±0.18\pm 0.18
ML-GCN1n 32.9632.96±0.01\pm 0.01 31.8331.83±0.03\pm 0.03 31.8231.82±0.07\pm 0.07 17.9217.92±0.02\pm 0.02 16.8916.89±0.21\pm 0.21 27.4827.48±0.33\pm 0.33
ML-GCN2l 43.86±0.07\pm\textit{0.07} 38.4138.41±0.23\pm 0.23 42.3342.33±0.03\pm 0.03 38.63±0.24\pm\textit{0.24} 26.28±0.30\pm\textit{0.30} 32.6232.62±0.11\pm 0.11
ML-GCN 45.17\mathbf{45.17}±0.20\pm\mathbf{0.20} 43.74\mathbf{43.74}±0.20\pm\mathbf{0.20} 45.71\mathbf{45.71}±0.02\pm\mathbf{0.02} 42.53\mathbf{42.53}±0.72\pm\mathbf{0.72} 30.71\mathbf{30.71}±0.12\pm\mathbf{0.12} 42.77\mathbf{42.77}±0.31\pm\mathbf{0.31}

The above baselines can be roughly separated into three categories based on the types of information (e.g., network structures and labels) and how it is incorporated in the graph embedding models. The first class belongs to methods that only preserve graph structures, including DeepWalk, Node2vec, LINE, GCN (e.g., we use the structure-based identity matrix as the original features of all node). The second class includes GENE and Text GCN that preserve both the graph structures and label information, where the labels are modeled as plain attribute information to enhance structure-based representation learning. The proposed method MG-GCN and its variants (ML-GCNnode, ML-GCN1n, and ML-GCN2l) represent the thrid class, which not only preserve structural and label information, but also the correlations between labels.

It is worth noting that we designed three variants of MG-GCN (including ML-GCNnode, ML-GCN1n, and ML-GCN2l) to validate its performance under different settings. This allows us to fully observe ML-GCN’s performance and conclude which part is playing major roles for multiple-label GCN learning.

V-C Experiment Setup

There are many hyper-parameters involved. Some are empirically set [29] while others are selected through sensitivity experiments. For ML-GCN, we use two-layer and one-layer GCNs to learn from the node-node-label graph and the label-label-node graph respectively. We test hidden embedding size, dhd_{h}, between 50 to 500, training ratios, α\alpha, of supervised labeled instances between 0.025 and 0.2, and updating frequencies NN and MM from 10 to 100, respectively. We also compare the performance of ML-GCN through differing numbers of GCN convolutional layers (e.g., ML-GCN1n and ML-GCN2l). For comparison, we set the learning rate η\eta for gradient decent as 0.02, training epoch as 300, dropout probability as 0.5, L2L_{2} norm regularization weight decay as 0, and the default values of dhd_{h}, α\alpha, NN and MM as 400, 0.2, 50 and 50, respectively. After selecting the labeled training instances, the rest is split into two parts: 10% as validation set and 90% for testing set.

It is necessary to mention that all baselines are set to conduct the multi-label node classification (e.g., each node can belong to multiple labels) within the same environmental settings. As metrics used in [21] and [30], we adopt Micro-F1 and Macro-F1 to evaluate the node classification performance, which are defined as follows:

MicroF1=i=1𝐥2TPii=1𝐥(2TPi+FPi+FNi){\rm Micro-F1}=\frac{{\sum\nolimits_{i=1}^{\mathbf{l}}{2TP^{i}}}}{{\sum\nolimits_{i=1}^{\mathbf{l}}{(2TP^{i}+FP^{i}+FN^{i})}}} (18)
MacroF1=1𝐥i=1𝐥2TPi(2TPi+FPi+FNi){\rm Macro-F1}=\frac{1}{\mathbf{l}}\sum\nolimits_{i=1}^{\mathbf{l}}{\frac{{2TP^{i}}}{{\left({2TP^{i}+FP^{i}+FN^{i}}\right)}}} (19)

where 𝐥\mathbf{l} is the set of labels from the target graph G. TPiTP^{i}, FNiFN^{i} and FPiFP^{i} denote the number of true positives, false negatives and false positives w.r.t the iith label category, respectively. All experiments are repeated 10 times with the average results and their standard deviations reported.

Refer to caption
Refer to caption
Figure 3: Algorithm performance comparisons with respect to different percentage of training sample ratios (the xx-axis denotes the ratio of training samples comparing to the whole network).

V-D Experimental Results

Table II presents the comparative results of all methods with respect to the multi-label classification performance within the same environment settings, where the top three best results have been highlighted. From the table, we have the following four main observations.

  • Among all methods that encode only the graph topology structures, the shallow neural networks-based methods (e.g., DeepWalk, LINE and Node2vec) perform poorly with a wide gap compared with deep model GCN over all three datasets, i.e., on BlogCatalog network, the classification performance of GCN improved 30.6% and 70.9% over Node2vec w.r.t Micro-F1 and Macro-F1, respectively. This is because shallow models have limitations in learning complex relational patterns among nodes [1]. For example, although Node2vec relies on a carefully designed random walk process to capture the node neighborhood relationships, it cannot differentiate the affinities between a node and others within the same walk sequence. In comparison, GCN uses a more efficient way to constrain the neighborhood relations between nodes, where each node only interact with its neighbors in each convolution layer. Such a learning paradigm is more accurate to maintain the actual node relevance reflected by the edge links without introducing noise neighborhood relationships as in Node2vec.

  • In terms of the performance of methods (GENE and Text GCN) that have incorporated the labels to enhance the structures modeling. GENE is built on DeepWalk to additionally preserve the label information. We can observe that GENE performs slightly better than DeepWalk on both Flickr and YouTube datasets. The reason is that each label is considered as the higher-level summation of a group of similar nodes, thus can be used to supervise and distinguish the neighborhood affinities between nodes within the same random walk sequence. Nevertheless, LINE and Node2vec can perform better than GENE in most cases over three datasets, i.e., the Micro-F1 performance of LINE and Node2vec on BlogCatalog increased 2.0% and 4.6%, respectively. The reason is probably that they have adopted more efficient random walk process to capture node neighborhood relationships. In addition, as we can see from Table II, the label-preserved model, Text GCN, has no advantages compared with the basic GCN model. The reason is probably that the labels are considered as attributes that have not been leveraged in a meaningful manner. In other words, only the labeled nodes have the attribute of labels in the supervised node representation leaning and training, the scattered labels could have become the noisy information to confuse the neighborhood relationships modeling between nodes.

    Refer to caption
    Refer to caption
    Refer to caption
    Refer to caption
    Figure 4: Impact of the information updating frequencies.
    Refer to caption
    Refer to caption
    Figure 5: Impact of the hidden node embedding size.
  • For deep models that both preserved graph structures, labels and label correlations, we can observe that ML-GCN is consistently superior to the Text GCN in leaning multi-label graphs over all three datasets. The reasons are mainly in three-fold. First, ML-GCN allows immediate and efficient label correlation modeling without depending on common nodes, i.e., labels directly interact with each other over the label-label graph in the proposed model. Second, it is common to use different numbers of convolutional layer (e.g., based on the results observed by comparing ML-GCN and ML-GCN2l, where the former performs far more better) to learn from the label and node graphs respectively, since the node-node graph is much more complicated than the label-label graph that involves simple label interaction patterns. To obtain a model that best fits the given label and node graphs of different scales, one can easily change the number of layers used by the two-layer graph modelings in ML-GCN independently. But this is hard for Text GCN to coordinate the layer settings that are most suitable to model node relations and label relations simultaneously in the node-label graph. Finally, in the structure design of ML-GCN, each label has preserved the community information by taking all related common nodes as attributes, which make the node relations modeling and the label correlations modeling more dependent on each other to optimize the global network representation learning, i.e., the node representations of one label could refine the node representation learning of another correlated label, as we will demonstrate in the case study later.

  • In terms of the performance of different variants for the proposed ML-GCN, we can condlude that Ml-GCN is superior to ML-GCNnode, ML-GCN1n and ML-GCN2l, where the possible reasons are given as follows: 1) The comparison between ML-GCN and ML-GCNnode demonstrates that taking the common nodes as attributes of the label-label network is beneficial, where labels could learn more enriched representations (with encoding label correlations and the node communities) to refine the neighborhood feature aggregation for node representation learning in the low-level node-node graph; 2) ML-GCN performs better than ML-GCN2l, which illustrates a single-layer GCN is appropriate to model the label correlations, since compared with the node-node interactions, the label-label interactions are generally simple and explicit; 3) ML-GCN2l is poorly performed than ML-GCN, which demonstrates that exploring the high-order neighborhood relationships between nodes is important.

V-E Parameter Sensitivity

We designed extensive experiments to test the sensitivities of various parameters between a wider range of values, such as the training ratio α\alpha of labeled nodes, the feature updating frequencies NN and MM, and the embedding size of the first hidden convolution layer while modeling the node-node-label graph. Fig. 3 shows the impacts of different portions of labeled training instances. In general, for all test models, we can observe that both the Micro-F1 (e.g., Fig. 3(a)) and Macro-F1 (e.g., Fig. 3(b)) performances increase with more labeled training nodes. This is reasonable since all these models adopt a supervised node representation learning and training manner, where the model parameters can be fully trained with larger labeled data [29]. Fig. 4 shows the influence of the input feature-updating frequencies controlled by NN and MM (e.g., used in Eqs. (15) and (16)). We can see from Fig. 4(a) where the performance changes with NN but no clear patterns can be observed in Micro-F1, while Fig. 4(b) shows an deceasing trend with larger values of NN in the Macro-F1 scores. In comparison, from Fig. 4(c) and Fig. 4(d) the accuracy first increases then decreases with larger values of MM w.r.t. both Micro-F1 and Macro-F1 scores. We also test the impact of node embedding size generated by the first convolutional layer with the trend shown in Fig. 5(a). The accuracy fluctuates before peaking at 400 and 450 w.r.t. Micro-F1 and Macro-F1 results, followed by a sharp decline.

V-F Case Study

To illustrate how label correlations affect the multi-label graph learning performance, we present the classification results through four related label categories shown in Fig. 6. Fig. 7 presents their correlation matrix, where deeper colors imply higher correlation between two corresponding labels, i.e., L1 and L2 are highly correlated. We can see in Fig. 6 that ML-GCN and GCN perform similarly with respect to the node classification of category L1. However, the accuracy of MC-GCN improved over GCN with respect to categories L2, L3 and L4. It is interesting to note how much these categories improved (e.g., L2 >> L3 >> L4) is related to how frequently they correlate with label L1 respectively. This phenomenon might be caused by the fact that the L1 has an impact on its correlated labels during training. This also verifies that label interaction is critical for multi-label graph learning, and our proposed ML-GCN model can effectively capture and utilize this property.

VI Conclusions

In this paper, we formulated a new multi-label network representation learning problem, where each node of the network may have multiple labels. To simultaneously explore label-label correlation and the network topology, we proposed a multi-label graph convolution network (ML-GCN) to build two siamese GCNs, a node-node-label graph and a label-label-node graph from the multi-label network, and carry out learning of node representation and label representation from the two GCNs simultaneously. Because the two GCNs are unified to achieve one optimization goal, the learning of node representation and label representation can mutually benefit each other for maximum performance gain. Experiments on three real-word datasets verified the effectiveness of ML-GCN in combining labels, label correlations, and graph structures for enhanced node representation learning and classification.

Refer to caption
Figure 6: Classification performance (Macro-F1) with respect to different label categories.
Refer to caption
Figure 7: Pair-wise label correlation matrix. A higher gray intensity value (excluding main diagonal values) indicates a stronger correlation between two labels.

References

  • [1] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation learning: A survey,” IEEE Trans. on Big Data, 2018.
  • [2] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowledge and Information Systems, vol. 42, no. 1, pp. 181–213, 2015.
  • [3] L. Liao, X. He, H. Zhang, and T.-S. Chua, “Attributed social network embedding,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2257–2270, 2018.
  • [4] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community preserving network embedding,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [5] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
  • [6] S.-J. Huang and Z.-H. Zhou, “Multi-label learning by exploiting label correlations locally,” in Twenty-sixth AAAI conference on artificial intelligence, 2012.
  • [7] W. Bi and J. T. Kwok, “Multilabel classification with label correlations and missing labels,” in Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
  • [8] V. Kumar, A. K. Pujari, V. Padmanabhan, S. K. Sahu, and V. R. Kagita, “Multi-label classification using hierarchical embedding,” Expert Systems with Applications, vol. 91, pp. 263–269, 2018.
  • [9] J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu, “Multilabel image classification with regional latent semantic dependencies,” IEEE Transactions on Multimedia, vol. 20, no. 10, pp. 2801–2813, 2018.
  • [10] S. Burkhardt and S. Kramer, “Online multi-label dependency topic models for text classification,” Machine Learning, vol. 107, no. 5, pp. 859–886, 2018.
  • [11] M. Shi, J. Liu, D. Zhou, and Y. Tang, “A topic-sensitive method for mashup tag recommendation utilizing multi-relational service data,” IEEE Transactions on Services Computing, 2018.
  • [12] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  • [13] J. Chen, Q. Zhang, and X. Huang, “Incorporate group information to enhance network embedding,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.   ACM, 2016, pp. 1901–1904.
  • [14] X. Huang, J. Li, and X. Hu, “Label informed attributed network embedding,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining.   ACM, 2017, pp. 731–739.
  • [15] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [16] W. Liu and I. W. Tsang, “Large margin metric learning for multi-label prediction,” in Twenty-Ninth AAAI Conf. on Artificial Intelligence, 2015.
  • [17] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE transactions on knowledge and data engineering, vol. 26, no. 8, pp. 1819–1837, 2014.
  • [18] M.-L. Zhang, Y.-K. Li, X.-Y. Liu, and X. Geng, “Binary relevance for multi-label learning: an overview,” Frontiers of Computer Science, vol. 12, no. 2, pp. 191–202, 2018.
  • [19] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007.
  • [20] J. Nam, J. Kim, E. L. Mencía, I. Gurevych, and J. Fürnkranz, “Large-scale multi-label text classification—-revisiting neural networks,” in Joint european conference on machine learning and knowledge discovery in databases.   Springer, 2014, pp. 437–452.
  • [21] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2014, pp. 701–710.
  • [22] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
  • [23] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proc. of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2016, pp. 855–864.
  • [24] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.
  • [25] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang, “Network representation learning with rich text information,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • [26] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep network representation,” Network, vol. 11, no. 9, p. 12, 2016.
  • [27] T. M. Le and H. W. Lauw, “Probabilistic latent document network embedding,” in 2014 IEEE International Conference on Data Mining.   IEEE, 2014, pp. 270–279.
  • [28] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
  • [29] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” arXiv preprint arXiv:1809.05679, 2018.
  • [30] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in Proceedings of the 24th international conference on world wide web.   International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.