DsMtGCN: A Direction-sensitive Multi-task framework for Knowledge Graph Completion

Jining Wang Chuan Chen [email protected] Zibin Zheng Yuren Zhou School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510275, China

Abstract

To solve the inherent incompleteness of knowledge graphs (KGs), numbers of knowledge graph completion (KGC) models have been proposed to predict missing links from known triples. Among those, several works have achieved more advanced results via exploiting the structure information on KGs with Graph Convolutional Networks (GCN). However, we observe that entity embeddings aggregated from neighbors in different directions are just simply averaged to complete single-tasks by existing GCN based models, ignoring the specific requirements of forward and backward sub-tasks. In this paper, we propose a Direction-sensitive Multi-task GCN (DsMtGCN) to make full use of the direction information, the multi-head self-attention is applied to specifically combine embeddings in different directions based on various entities and sub-tasks, the geometric constraints are imposed to adjust the distribution of embeddings, and the traditional binary cross-entropy loss is modified to reflect the triple uncertainty. Moreover, the competitive experiments results on several benchmark datasets verify the effectiveness of our model.

keywords:

Knowledge graph completion , Direction-sensitive , Multi-task , Graph Convolutional Networks

^†^†journal: Information Processing and Management

\useunder

\ul

1 Introduction

KGs are multi-relational directed graphs with entities as nodes and relations as edges, containing a large amount of structured knowledge in the form of (head entity, relation, tail entity) triplets, they are widely applied in question answering [1], dialog systems [2], web search [3] and recommendation system [4]. However, due to the limitation of available resources, it is impractical to store all facts in KGs, which leads to the incompleteness [5], and the algorithms of KGC are required to solve the problem.

There are a lot of researches focusing on KGC or link prediction tasks aiming to infer missing facts automatically based on known facts. Pioneering additive models [6, 7, 8] take the transformation from head entities to tail entities as a translation problem, while multiplicative models [9, 10, 11, 12] try to measure the plausibility of unknown triplets by applying proper semantic similarity-based score function. Benefiting from the development of neural networks, several works concentrate on the deeper nonlinear interactions among entities and relations with innovative model structures [13, 14, 15, 16, 17, 18].

Furthermore, some recent studies introduce GCN to take the structure information into consideration by aggregating neighborhood information [19, 20, 21, 22], which brings significant improvement. Despite the high-performance of them, they fail to utilize direction information implied in different neighbors while merging them, which is important for making reasonable predictions. As shown in Fig. 1, there exists original and inverse edges (relations) in KGs, according to the direction of them, the link prediction tasks can be divided into forward and backward sub-tasks, neighbors can also be grouped into forward and backward neighbors, and sub-tasks in different directions always have diverse preferences for neighbors. For example, while dealing with forward sub-task (Stan Lee, profession, ?), it can be solved from backward neighbors including Spider-Man, Iron Man and Captain America; on the other hand, the answer for query (Stan Lee, ethnicity^-1, ?) can be derived from Judaism, which is a forward neighbor of Stan Lee. However, existing studies can’t model such preferences as they just simply average neighbor embeddings in different directions, which inspires our work.

On the basis of above observations, we propose a novel model called DsMtGCN in this paper. Concretely, we design a multi-task framework to complete forward and backward prediction sub-tasks respectively to merge the direction information with differential partiality. In consideration of the differences among entities and sub-tasks, we use multi-head self-attention at entity level to combine embeddings aggregated from neighbors in various directions under forward and backward sub-tasks. Inspired by multi-view learning, we believe that the embeddings learnt by neighbors in different directions can be seen as two views obtained from two sides, and we impose distance and conicity constraints for embeddings to satisfy the principle of consensus and difference. Besides, we find that the difference on number of possible tail entities leads to triple uncertainty. For example, if the task is (Stan Lee, profession, ?), then there are numbers of correct answers and larger triple uncertainty. On the other hand, if the query is (Stan Lee, nationality, ?), the candidate tail entity is unique with small triple uncertainty. For this reason, we redefine label smoothing for link prediction tasks to fully considers the uncertainty while training.

In summary, our contributions in this paper are the following:

1.

We propose a multi-task model called DsMtGCN to utilize ignored direction information by considering the diverse direction preferences of forward and backward sub-tasks, which is the first study in this field to the best of our knowledge.
2.

For embeddings learned in different directions, we apply multi-head self-attention to extract direction information specifically at the entity level under different sub-tasks, and adopt the geometric constraints to adjust the embedding distribution.
3.

We propose an improved label smoothing under the multi-label classification scenario to reflect the triple uncertainty, which further increase the performance of our model.
4.

By analyzing experimental results of DsMtGCN on FB15k-237 and WN18RR datasets, we demonstrate the effectiveness of our proposed method.

2 Related work

In this part, we will introduce related work from two aspects including link prediction and multi-task learning, they will be described in detail below.

2.1 Link Prediction

Link prediction or knowledge graph completion aims to predict missing triples by scoring candidate triples with embeddings of entities and relations leaned by well-designed models. Some non-neural approaches were first proposed in this field, they usually take relations as transfer vectors to convert head entities to tail entities, different models differ in score functions and representation spaces. For example, TransE[6], TransH[7], and TransR[8] are known as typical additive models, since they project entities and relations to vector space, then optimize the distance-based score function with the goal of making the head entity added by relation as close as possible to the tail entity. On the other hand, multiplicative models use semantic similarity-based score function to measure the plausibility of given triplets, enabling them to maintain good expression ability when the scale of the knowledge graph grows. Specifically, RESCAL[9] and DistMult[10] introduce bilinear operation in point-wise space, while Complex[11] and QuatE[12] represent embeddings in complex-valued space and four dimensional space respectively to capture complicated patterns like composition and antisymmetry of relations.

With the wide application of neural networks in various fields, several neural network based models have been proposed to improve expression ability by capturing more complex interactions among triples. Considering that the simplicity and expressiveness of Multi-Layer Perception (MLP) are suitable for enhancing score functions, NTN[23] and SME[13] explore nonlinear modeling with a fully-connected layer and activation function, while NAM[14] adopts deeper hidden layers and utilizes information from the hidden encoding. Convolutional Neural Networks (CNN) have also been employed to improve expression ability by learning deeper features with fewer shared parameters. For instance, ConvE[15] applies 2D convolution over a two-dimensional matrix reshaped from stacked head entities and relations to model the interactions among them, Conv-KB[16] adopts 1D filters on the same dimension of entities and relations directly to keep better transitional characteristic, ConvR[17] replaces global filters used by ConvE with relation-specific ones, InteractE[18] proposes several skills including feature permutation, checkered reshaping as well as circular convolution to maximize the number of heterogeneous and homogeneous interactions between entity and relation features. The information used by most methods is limited to the triple level as mentioned above, in order to make better decisions, RSN[24] tries to capture long-term path dependence with Recurrent Neural Networks (RNN), while transformer-based models like CoKE[25] and KG-BERT[26] choose to mine contextual information from knowledge graphs.

Although knowledge graphs are fundamentally multi-relational graphs, the aforementioned methods, which are more suitable for Euclidean data, tend to overlook the structural information inherent in them. To make up for it, recent works introduce Graph Neural Networks (GNN) for graph context modeling under the encoder-decoder framework. R-GCN[19] proposes relation specific filters to take different relations into account while modeling the message passing on knowledge graphs. SACN[20] learns adaptive weights of adjacent nodes with same relation instead of taking the neighborhood of each entity equally, so that node structure, node attributes and relation types can contribute to the performance gains. VR-GCN[21] explicitly learns the representation of entities and relations to make full use of relation information. CompGCN[22] introduces entity-relation composition operations while aggregating neighbors, which is proven to generalize previous GCN-based models including KipfGCN[27], R-GCN[19], D-GCN[28] and SACN[20]. KBGAT[29] claims to surpass others by learning graph attention based embeddings, but its results have been questioned by another study[30]. Nevertheless, none of above studies suppose that neighbors in different directions should be specifically focused under different sub-tasks, which inspires our work.

2.2 Multi-task Learning

Multi-task Learning (MTL) is a specially designed network structure to leverage information across tasks instead of learning each task in isolation [31]. Benefiting from its simplicity and efficiency, MTL has contributed to computer vision [32, 33], natural language processing [34], recommendation system [35] and so on.

In the field of link prediction, existing MTL models mainly focus on introducing additional tasks or datasets. Specifically, SENN [36] utilizes relation prediction to promote link prediction, TransMTL [37] learns embeddings of multiple knowledge graphs simultaneously to mine shared structure patterns among them. However, the aforementioned methods bring much extra computational overhead, and it’s not easy to find suitable sub-tasks or external datasets. In this paper, we define sub-tasks by decomposing existing tasks, which reduces the number of parameters and avoids introducing extra noise.

3 Background and Definition

In this section, we aim to introduce several important concepts and notations related to our work.

3.1 Link prediction tasks

We denote KGs as $\mathcal{G}=\{(h,r,t)\mid(h,r,t)\in\mathcal{T},h,t\in\mathcal{E},r\in\mathcal{R}\}$ , where $h,t$ are head and tail entity, $r$ represents the relation connecting $h$ and $t$ , $\mathcal{T,E,R}$ denote the set of triples, entities, and relations. The train, valid and test set of triples are denoted as $\mathcal{T}^{train}$ , $\mathcal{T}^{valid}$ and $\mathcal{T}^{test}$ respectively.

Given $(h,r)$ , the link prediction task aims to predict missing $t$ on $\mathcal{G}$ . Let $e_{h},e_{r},e_{t}\in R^{d}$ be corresponding embeddings with dimension $d$ , a score function defined as $\psi:{R}^{d}\times{R}^{d}\times{R}^{d}\rightarrow{R}$ takes $e_{h},e_{r},e_{t}$ as input and outputs a score measuring the plausibility of triplet $(h,r,t)$ , then $t$ with the highest score is selected as the prediction result.

3.2 Link prediction sub-tasks

Following [38], the information in KGs is required to flow in backward and self-loop directions besides forward direction. For triple $(h,r,t)$ , we generate its backward version $(t,r^{-1},h)$ and self-loop version $(h,r^{0},h)$ to augment original data. After that, the new relation set is $\mathcal{R}^{\prime}=\mathcal{R}\cup\mathcal{R}^{-1}\cup\{r^{0}\}$ , where $\mathcal{R}^{-1}=\{r^{-1}\mid r\in\mathcal{R}\}$ ; the new triple set is $\mathcal{T}^{\prime}=\mathcal{T}\cup\mathcal{T}^{-1}\cup\mathcal{T}^{0}$ , where $\mathcal{T}^{-1}=\{(t,r^{-1},h)\mid(h,r,t)\in\mathcal{T}\}$ , $\mathcal{T}^{0}=\{(h,r^{0},h)\mid h\in\mathcal{E}\}$ , the prediction tasks can be divided into forward sub-tasks like $(h,r,?)$ and backward sub-tasks like $(h,r^{-1},?)$ .

For one entity $h$ , its neighbors $\mathcal{N}_{h}$ can be grouped into forward neighbors $\mathcal{N}^{F}_{h}$ and backward neighbors $\mathcal{N}^{B}_{h}$ , where ${N}^{F}_{h}=\{(r,t)\mid(h,r,t)\in\mathcal{T}\}$ , ${N}^{B}_{h}=\{(r^{-1},t)\mid(h,r^{-1},t)\in\mathcal{T}^{-1}\}$ , and they are considered to contribute differently to the two sub-tasks. Furthermore, the embedding representation learned from $\mathcal{N}^{F}_{h}$ and $\mathcal{N}^{B}_{h}$ are denoted as $e^{f}_{h}$ and $e^{b}_{h}$ respectively, the set of $e^{f}_{h}$ and $e^{b}_{h}$ can be rewritten in the matrix form as $E^{f},E^{b}\in R^{|\mathcal{E}|\times d}$ .

4 Model

In this part, we will describe the structure of DsMtGCN. Overall, the essential idea of our work is to integrates neighbor information in specific directions at entity level based on sub-tasks so that the direction information can be fully used, Fig. 2(a) provides an intuitive explanation to this point. In what follows,we will introduce each module of DsMtGCN and the joint using of them in detail.

4.1 Message Passing

Inspired by CompGCN, we use entity-relation composition operation $\phi$ and message passing function $M$ to model the information flow from neighbors in different directions.

Specifically, the features of entity $h$ and relation $r$ are initialized as $v_{h},v_{r}\in R^{d_{i}}$ . Given one known triple $(h,r,t)$ , the operation $\phi$ is applied to get the distributed representation of $(r,t)$ , which is implemented in two ways including Multiplication (Mult) [10] and Circular-correlation (Corr) [39] defined as follows, where $\circ$ denotes Hadamard product and $\star$ means compressing the tensor product.

\textbf{Mult:}\>\phi(r,t)=v_{r}\circ v_{t}

(1)

\textbf{Corr:}\>\phi(r,t)=v_{r}\star v_{t}

(2)

The representations are then aggregated by $M$ , where $W^{F},W^{B},$ $W^{L}\in R^{d_{i}\times d}$ are direction-specific weights.

e^{f}_{h}=M(N^{F}_{h},W^{F})=\sum_{(r,t)\in N^{F}_{h}}W^{F}\phi(r,t)

(3)

e^{b}_{h}=M(N^{B}_{h},W^{B})=\sum_{(r^{-1},t)\in N^{B}_{h}}W^{B}\phi(r^{-1},t)

(4)

e^{l}_{h}=M(\{(r^{0},h)\},W^{L})=W^{L}\phi(r^{0},h)

(5)

4.2 Geometric constraints

In essence, $e^{f}_{h}$ and $e^{b}_{h}$ can be regarded as two views of $h$ , which are expected to satisfy the principle of consensus and difference. Specifically, the consensus principle means $e^{f}_{h}$ and $e^{b}_{h}$ should be close to each other appropriately considering that they share similar semantics, the difference principle claims that embeddings of different entities from the same view should be distinguished as a whole.

For keeping the two principles, we impose distance constrains $dis(E^{f},E^{b})$ to bring them closer, the conicity geometric constrains $con(E^{f})$ and $con(E^{b})$ are used to increase the distinguishability of embeddings between different entities. The conicity term is originally defined as a geometric attribute [40], while it is regarded as an explicit constraint here. The formula of $dis$ and $con$ are shown as follows, where $E^{f}_{i}$ stands for the $i$ -th row of $E^{f}$ , $\|x-y\|_{2}$ and $cos(x,y)$ denote the Euclidean distance and cosine similarity.

dis(E^{f},E^{b})=\frac{1}{|\mathcal{E}|}\sum_{i=1}^{|\mathcal{E}|}\|E^{f}_{i}-E^{b}_{i}\|_{2}

(6)

con(E^{f})=\frac{1}{|\mathcal{E}|}\sum_{i=1}^{|\mathcal{E}|}cos(E^{f}_{i},\frac{\sum_{i=1}^{|\mathcal{E}|}E^{f}_{i}}{|\mathcal{E}|})

(7)

We illustrate the effect of constraints in Fig. 2(b), the conicity constraints make it easier to distinguish points on the same plane, while the distance constraint can help to align points of the same color on different planes, the combination of them can achieve the superb result.

4.3 Attention mechanism

Considering the specific requirements for neighbor information in different directions under forward and backward sub-tasks, the multi-head self-attention mechanism [41] at entity level is applied.

For simplicity, we first expound how to calculate $e^{F}_{h}$ for the forward sub-tasks with single head. Suppose the dimension of attention vectors is $d_{a}$ , given stacked embedding $E_{h}=[e_{h}^{f};e_{h}^{l};e_{h}^{b}]\in R^{3\times d}$ , we obtain the query vector $Q_{h}^{F}\in R^{3\times d_{a}}$ and the key vector $K_{h}^{F}\in R^{3\times d_{a}}$ as follows, where $W^{F}_{Q},W^{F}_{K}\in R^{d\times d_{a}}$ are task-specific weights.

Q_{h}^{F}=E_{h}W^{F}_{Q},K_{h}^{F}=E_{h}W^{F}_{K}

(8)

Then the scaled dot-product result $S^{F}\in R^{3\times 3}$ can be formulated as:

S^{F}=\frac{Q_{h}^{F}{K_{h}^{F}}^{T}}{\sqrt{d_{a}}}

(9)

The final attention $A^{F}\in R^{3}$ can be obtained after softmax and mean-pooling ( $mp$ ), the $mp$ operator aims to save mean value of each column of the matrix to be processed, note that the sum of $A^{F}$ equals 1, the $m$ -th element is denoted as $A^{F}_{m}$ :

A^{F}=mp(softmax(S^{F}))

(10)

A^{F}_{m}=\frac{1}{3}\sum_{j=1}^{3}\frac{exp(S^{F}_{jm})}{\sum_{k=1}^{3}exp(S^{F}_{jk})}

(11)

To get richer latent information and enhance the stability of results, we can use more heads to compute $A^{F}$ . If the number of heads is $n_{h}$ , the updated formula is written down below, where ${Q_{h}^{F_{i}}}=E_{h}{W_{Q}^{F_{i}}}$ , ${K_{h}^{F_{i}}}=E_{h}{W^{F_{i}}_{K}}$ :

A^{F}=\frac{\sum_{i=1}^{n_{h}}A^{F_{i}}}{n_{h}}=\frac{\sum_{i=1}^{n_{h}}mp(softmax(\frac{{Q_{h}^{F_{i}}}{K_{h}^{F_{i}}}^{T}}{\sqrt{d_{a}}}))}{n_{h}}

(12)

In the end, $e_{h}^{F}\in R^{d}$ is the weighted sum of $e_{h}^{f},e_{h}^{l},e_{h}^{b}$ as follows, and $e_{h}^{B}$ can be calculated with similar process as shown in Fig. 2(c).

e_{h}^{F}=A^{F}_{1}e_{h}^{f}+A^{F}_{2}e_{h}^{l}+A^{F}_{3}e_{h}^{b}=A^{F}E_{h}

(13)

4.4 Loss function

In this work, we use ConvE to model the interaction between entities and relations. Specifically, the score function $\psi$ is defined as follows, the convolution operation is applied on the 2D reshaping of $e_{h},e_{r}$ with filter $\omega$ , and $g$ denotes the nonlinear activation function.

\psi(e_{h},e_{r},e_{t})=g(vec(g([e_{h};e_{r}]*\omega))W)e_{t}

(14)

Taking into account the different sub-tasks, one triple is scored as follows, where $\sigma$ is the sigmoid function mapping score to the interval of $(0,1)$ , $\alpha$ is the indicator variable, whose value is 1 if current prediction task belongs to the forward prediction sub-tasks otherwise 0.

s(h,r,t)=\sigma(\alpha\psi(e_{h}^{F},e_{r},e_{t}^{F})+(1-\alpha)\psi(e_{h}^{B},e_{r},e_{t}^{B}))

(15)

For most previous works, the target of training is to minimize such standard binary cross-entropy loss for $i$ -th triple, where $s_{i}$ is the predicted score, the label $t_{i}$ indicates the existence or non-existence of the triple, and $l$ controls the degree of label smoothing.

\mathcal{L}_{BCE}^{i}=\left\{\begin{array}[]{l}-\log(\sigma(s_{i}))+\left(l-\frac{1}{|\mathcal{E}|}\right)*s_{i},\text{if }t_{i}=1\\ -\log(\sigma(-s_{i}))-\frac{1}{|\mathcal{E}|}*s_{i},\text{if }t_{i}=0\end{array}\right.

(16)

To model the triple uncertainty, we propose a new label smoothing mechanism as follows to prevent the model from giving too low scores to potential targets, where $u_{i}=|\{t|(h_{i},r_{i},t)\in\mathcal{T}^{train}\}|$ ranging from 1 to $|\mathcal{E}|$ is the measurement of uncertainty of $i$ -th triple, and $k$ controls the scale of adjustment:

\mathcal{L}_{TU}^{i}=\left\{\begin{array}[]{l}-\log(\sigma(s_{i}))+\left(l-\frac{u_{i}^{k}}{|\mathcal{E}|}\right)*s_{i},\text{if }t_{i}=1\\ -\log(\sigma(-s_{i}))-\frac{u_{i}^{k}}{|\mathcal{E}|}*s_{i},\text{if }t_{i}=0\end{array}\right.

(17)

The final loss function $\mathcal{L}$ is defined as follows, where $\lambda_{1}$ and $\lambda_{2}$ are trade-off parameters:

\mathcal{L}=\mathcal{L}_{TU}+\mathcal{L}_{GC}

(18)

\mathcal{L}_{TU}=\frac{1}{|\mathcal{T}^{train}|}\sum_{i=1}^{|\mathcal{T}^{train}|}\mathcal{L}_{TU}^{i}

(19)

\mathcal{L}_{GC}=\lambda_{1}dis(E^{f},E^{b})+\lambda_{2}(con(E^{f})+con(E^{b}))

(20)

5 Experiments

Table 1: Statistics of datasets.

Statistics	FB15k-237	WN18RR
$\|\mathcal{E}\|$	14,541	40,943
$\|\mathcal{R}\|$	237	11
$\|\mathcal{T}^{train}\|$	272,115	86,835
$\|\mathcal{T}^{valid}\|$	17,535	3,034
$\|\mathcal{T}^{test}\|$	20,466	3,134

In this section, we will demonstrate the effectiveness of DsMtGCN with numerous experiments, the experimental setting and results analysis will be introduced below.

5.1 Experimental Setting

Datesets

To confirm the performance of DsMtGCN in a fair way, FB15k-237 [42] and WN18RR [15], the two commonly used benchmarks, are adopted in our research. In particular, FB15k-237 provides general facts such as actors, companies and films, WN18RR contains semantic knowledge of lexicons like synonym and hyponym, both of them have their inverse relations removed to resolve the test leakage problem. Table 1 shows the statistics details of the above datasets.

Metrics

To evaluate the performance of different models on the link prediction task, several popular ranking-based metrics including mean reciprocal rank (MRR) and Hits@k (H@k) for k = 1, 3, 10 are reported, to avoid the unfairness caused by the model’s tendency to output same scores for all triples, we adopt the random evaluation protocols [30]. Suppose $r_{i}$ is the ranking of the target tail entity for $i$ -th triple under filter setting, where valid candidates different from the target are filtered [6], MRR is defined as the mean value of $\frac{1}{r_{i}}$ [43], while H@k measures the average proportion of $r_{i}$ less than k, the formulas are defined as follows, where $\mathcal{I}$ is the indicator function, whose value is 1 if the condition can be satisfied else 0.

MRR=\frac{1}{|\mathcal{T}^{test}|}\sum_{i=1}^{|\mathcal{T}^{test}|}\frac{1}{r_{i}}

(21)

Hits@k=\frac{1}{|\mathcal{T}^{test}|}\sum_{i=1}^{|\mathcal{T}^{test}|}\mathcal{I}(r_{i}<=k)

(22)

Baselines

To illustrate the effectiveness of our model, the results of DsMtGCN are compared with following baselines.

1.

Non-neural models: Additive models including TransE [6], Complex [11], RotatE [44]. Multiplicative models including CrossE [45] and DistMult [10].
2.

Neural network based models: Models leveraging CNN or GCN to capture complex interactions between entities and relations. Typical models of the former include ConvE [15], ConvKB [16] and ConvR [17], while R-GCN [19], SACN [20], VR-GCN [21], KBGAT [29], CompGCN [22] are in the latter case.
3.

Multi-task models: Due to the lack of results of SENN [36], we only compare the TransMTL-E [37] model.

Implementations

We implement our model in PyTorch [46], the Adam optimizer [47] is applied to speed up the training process. The optimal hyperparameters are determined by the best MRR evaluated on $\mathcal{T}^{valid}$ . Concretely, we implement $\phi$ with Corr and Mult for FB15k-237 and WN18RR [22], and other hyperparameters are confirmed by grid search. For example, the learning rate ranges from $1e-4$ to $1e-2$ , the batch size is selected from $\{64,128,256,512\}$ , $d_{i}$ and $d$ are searched in $\{100,200,300,400\}$ , both $l$ and $k$ are within the range of 0 to 1. For FB15k-237, we set learning rate as 0.001, batch size as 128, $d_{i}$ as 100, $d$ as 200, $l$ as 0.2 and $k$ as 0.2. For WN18RR, we set learning rate as 0.0003, batch size as 256, $d_{i}$ as 400, $d$ as 200, $l$ as 0.1 and $k$ as 0.5.

5.2 Link Prediction Tasks

Table 2: Link prediction results of DsMtGCN and other baselines. The performance of other models are taken from original or latest published papers, ”-” denotes missing results, the best results are in bold.

Model	FB15k-237				WN18RR
Model	MRR	H@1	H@3	H@10	MRR	H@1	H@3	H@10
TransE	0.294	-	-	0.465	0.226	-	-	0.501
Complex	0.247	0.158	0.275	0.428	0.440	0.410	0.460	0.510
RotatE	0.338	0.241	0.375	0.533	0.476	0.428	0.492	0.571
CrossE	0.299	0.211	0.331	0.474	-	-	-	-
DistMult	0.241	0.155	0.263	0.419	0.430	0.390	0.440	0.490
ConvE	0.325	0.237	0.356	0.501	0.430	0.400	0.440	0.520
ConvKB	0.243	0.155	0.371	0.421	0.249	0.057	0.417	0.524
ConvR	0.350	0.261	0.385	0.528	0.475	0.443	0.489	0.537
R-GCN	0.248	0.151	-	0.417	-	-	-	-
SACN	0.350	0.261	0.390	0.540	0.470	0.430	0.480	0.540
VR-GCN	0.248	0.159	0.272	0.432	-	-	-	-
KBGAT	0.157	-	-	0.331	0.412	-	-	0.554
CompGCN	0.355	0.264	0.390	0.535	0.479	0.443	0.494	0.546
TransMTL-E	0.336	-	-	0.526	0.363	-	-	0.541
DsMtGCN	0.363	0.273	0.398	0.545	0.481	0.445	0.495	0.544

We compare our DsMtGCN model with various baselines, the results are summarized in Table 2, it can be observed that with the model structure becomes more complex, the overall effect improves, CNN based models surpass non-neural models, and GNN based models achieve better performance with the leverage of graph structures. Compared with all of the above models, DsMtGCN outperforms them under all metrics on FB15k-237 and 2 out of 4 metrics on WN18RR, indicating the strong competitiveness of our model.

Compared with other GCN models like R-GCN, SACN and CompGCN, the improvement of our model can be attributed to the well designed multi-task framework, on the other hand, the superior performance over previous MTL models can be explained with the introduce of divided sub-tasks and structure information aggregated from neighbors, which demonstrates the effectiveness of our model.

We can also find several differences between different metrics and datasets from the above results. In particular, MR is not a proper metric as it always fails to keep pace with other metrics, a smaller MR doesn’t means larger MRR and Hits@k, as the simple averaging it adopts will inevitably be affected by extreme values. Besides, the performance of overly complex models on WN18RR is limited[30], as it only contains 11 relations, the dataset is too simple to avoid over-fitting. In view of the complexity of knowledge graph in reality as well as the fairness of comparison, our follow-up experiments will pay more attention to the performance of metrics other than MR on FB15k-237.

5.3 Link Prediction Sub-tasks

Table 3: Results on link prediction sub-tasks by relation category on FB15k-237, the forward and backward sub-tasks correspond to Tail and Head Prediction in some other works.

Task		ConvE		InteractE		SACN		CompGCN		DsMtGCN
Task		MRR	H@10	MRR	H@10	MRR	H@10	MRR	H@10	MRR	H@10
Forward Sub-tasks	1-1	0.193	0.385	0.386	0.547	0.422	0.547	0.457	0.604	0.468	0.589
	1-N	0.068	0.116	0.106	0.192	0.093	0.187	0.112	0.190	0.113	0.214
	N-1	0.438	0.638	0.466	0.647	0.454	0.647	0.471	0.656	0.484	0.67
	N-N	0.246	0.436	0.276	0.476	0.261	0.459	0.275	0.474	0.286	0.486
Backward Sub-tasks	1-1	0.177	0.391	0.368	0.547	0.406	0.531	0.453	0.589	0.458	0.583
	1-N	0.756	0.867	0.777	0.881	0.771	0.875	0.779	0.885	0.787	0.888
	N-1	0.049	0.09	0.074	0.141	0.068	0.139	0.076	0.151	0.084	0.155
	N-N	0.369	0.587	0.395	0.617	0.385	0.607	0.395	0.616	0.403	0.623

To further investigate the performance of DsMtGCN under different sub-tasks and relation types, we present the results of our model compared with ConvE [15], InteractE [18], SACN [20] and CompGCN [22] on FB15k-237 in Table 3. Following previous work, the relations are divided into four categories including one-to-one (1-1), one-to-many (1-N), many-to-one (N-1) and many-to-many (N-N) by counting the average number of tails per head and heads per tail [7]. It can be observed that both CompGCN and DsMtGCN outperform other models with significant improvements on all four relation types, which can be attributed to the full exploitation of graph structure information. When it comes to DsMtGCN, it gets the highest MRR for all relation types especially complex ones like 1-N, N-1 and N-N, which indicates that aggregating the information from neighbors in different directions specifically can help to handle complex relations under both forward and backward sub-tasks.

5.4 Ablation Study

Table 4: Results of ablation study on FB15k-237.

Model	MRR	H@1	H@3	H@10
DsMtGCN	0.363	0.273	0.398	0.545
w/o GC	0.360	0.268	0.397	0.542
w/o MHSA	0.358	0.268	0.392	0.539
w/o TU	0.361	0.271	0.396	0.542

Since our model has outperformed other competitors in the aforementioned comparative experiments, to further probe into the contribution of each module to the final performance, we conduct ablation experiments of DsMtGCN on FB15k-237. Considering that the main innovations in this work can be divided into three modules including geometric constraints, multi-head fine-grained self-attention mechanism and label smoothing with triple uncertainty, we abbreviate them as GC, MHSA and TU respectively for the convenience of explanation. In order to analyze the role of each module, we remove them from the original model in turn to get following sub-models: DsMtGCN w/o GC, DsMtGCN w/o MHSA and DsMtGCN w/o TU, where w/o means without specific module or mechanism. Specifically, we set $\lambda_{1}$ and $\lambda_{2}$ to 0 to obtain DsMtGCN w/o GC, we replace $A^{F}$ with the vector filled with 1 to obtain DsMtGCN w/o MHSA, and $k$ is initialized to 0 to get DsMtGCN w/o TU, these sub-models are trained under the same settings, the results are shown in Table 4.

It’s obvious that the removal of MHSA brings the greatest decrease in effect, as it enables our model to make full use of structural information from neighbors in different directions at entity level, which is the core of our direction-sensitive multi-task framework. The effect of GC is evident as well, because it can help to improve the geometric distribution of embeddings. Lastly, the introduction of TU proves to be indispensable for further improvements by considering the number of possible tail entities.

In order to show the effect of GC module in a more explicit way, we visualize the cosine similarity and geometric distribution of $E^{f}$ and $E^{b}$ in Fig.3, the mean value of cosine similarity (i.e., conicity) is marked by bold dotted line, and the scatter graph shows the distribution of embeddings reduced to two dimensions by principal component analysis (PCA). From Fig.3(a)(b)(d)(e), we can find that curves with GC is more gentle and the center is more left, which means that better diversity is preserved among embeddings. Besides, The points in Fig.3(c) are more evenly distributed than those in Fig.3(f), and the dots with different colors can cover each other more completely in Fig.3(c), indicating the validity of GC module.

Table 5: Several examples reflecting the preference of forward and backward sub-tasks to neighbors in different directions. The key neighbors playing prompt roles in prediction are bolded, and the relative attention weights of forward and backward neighbors are listed in the rightmost column.

Task $\Rightarrow$ Target	Forward $\\|$ Backward Neighbors	Weights
(Soul music, artists, ?) $\Rightarrow$ Tyrese Gibson	(artists, Patti Labelle), (artists, Kim Carnes) $\\|$ (genre^-1, Funk)	(0.66, 0.34)
(John Cale, profession, ?) $\Rightarrow$ Composer	(profession, Actor), (nationality, Wales) $\\|$ (genre^-1, Art rock), (instrument^-1, Piano)	(0.29, 0.71)
(Who Framed Roger Rabbit, genre, ?) $\Rightarrow$ Fantasy	(genre, Parody), (genre, Buddy film), (country, USA), (release, DVD) $\\|$ (actor^-1, Joel Silver), (distributor^-1,Disney) (director^-1,Robert Zemeckis)	(0.32, 0.68)
(Louise Fletcher, nomination^-1,?) $\Rightarrow$ Academy Award for Best Actress	(profession, Actor), (gender, Femal), (film, One Flew Over the Cuckoo’s Nest) $\\|$ (student^-1, University of North Carolina)	(0.73, 0.27)
(Charles S.Dutton, student^-1, ?) $\Rightarrow$ Towson University	(place_of_birth, Baltimore), (place_live, Baltimore), (profession, Television director) $\\|$ (student^-1, Yale University), (ethnicity^-1, African American)	(0.69, 0.31)
(Earth, area^-1, ?) $\Rightarrow$ Egypt	(currency, nited States dollar), (taxonomy, Library of Congress Classification) $\\|$ (area^-1, Italy), (area^-1, Kenya), (area^-1, Singapore)	(0.26, 0.74)

To further verify the effectiveness of MHSA module, we replace it with multi-head adaptive attention mechanism (MHAA) and multi-head parameter-generated attention mechanism (MH-PA). In particular, MHAA implements attention weights with randomly initialized learnable vectors, MHPA applies linear transformation from the embedding matrix to attention weights, the attention distribution generated by above models and corresponding prediction effects are shown in Fig. 4, r/w means to replace MHSA with above substitutes. It’s clear that while dealing with different sub-tasks, the preference for neighbor information in different directions changes evidently, the backward neighbors are paid more attention under the forward sub-tasks, while the opposite happens for backward sub-tasks. Besides, there exists intersections between curves, which means that even for the same sub-tasks, different entities have different preferences, demonstrating the importance of adopting attention at entity level. Ultimately, the prediction effect in the graph increases gradually from left to right, and the overlapping parts between different curves become smaller, indicating that the MHSA module can better allocate attention weights according to different sub-tasks and entities.

5.5 Case Study

We employ case study to further analyze why DsMtGCN works, several examples are shown in Table 5, where tasks and corresponding targets come from $\mathcal{T}^{test}$ , forward and backward neighbors are taken from $\mathcal{T}^{train}$ . Generally speaking, backward neighbors can provide more information to forward sub-tasks, and forward neighbors are more beneficial to backward sub-tasks, because they can provide more complementary information in most cases. For example, while dealing with (Who Framed Roger Rabbit, genre, ?), the neighbors in same direction like (genre, Parody), (genre, Buddy film) are unable to provide additional information about Fantasy as they are at the same level, while (distributor^-1,Disney) gives a strong signal; when the query is (Charles S.Dutton, student^-1, ?), knowing Charles S.Dutton is a student of Yale University is helpless, but his deep connection with Baltimore can remind us of the Towson University in Baltimore. On the other hand, this rule is not true for queries about abstract concepts like (Soul music, artists, ?) or (Earth, area^-1, ?), as the opposite direction usually means abstract hierarchy, while the answer is concrete.

6 Conclusion

In this work, we notice that the neighbor information aggregated in forward and backward directions has specific significance for different entities and sub-tasks. For taking advantage of such sensitivity to direction, we propose a novel multi-task GCN for knowledge graph, DsMtGCN, which decomposes the single link prediction tasks into forward sub-tasks as well as backward sub-tasks, the multi-head self-attention is utilized to combine neighbor information from different directions, the geometric constraints and label smoothing with triple uncertainty are applied to obtain further performance gains. Through extensive comparison and ablation experiments, we demonstrate the effectiveness and rationality of our model. For future work, we will explore richer interactions between entities and relations, the idea of triple uncertainty is also expected to be applied in more multi-label classification tasks.

References

Yin et al. [2016] J. Yin, X. Jiang, Z. Lu, L. Shang, H. Li, X. Li, Neural generative question answering, in: S. Kambhampati (Ed.), Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, IJCAI/AAAI Press, 2016, pp. 2972–2978. URL: http://www.ijcai.org/Abstract/16/422.
Ma et al. [2015] Y. Ma, P. A. Crook, R. Sarikaya, E. Fosler-Lussier, Knowledge graph inference for spoken dialog systems, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, IEEE, 2015, pp. 5346–5350. URL: https://doi.org/10.1109/ICASSP.2015.7178992. doi:10.1109/ICASSP.2015.7178992.
Szumlanski and Gomez [2010] S. R. Szumlanski, F. Gomez, Automatically acquiring a semantic network of related concepts, in: J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, A. An (Eds.), Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, ACM, 2010, pp. 19–28. URL: https://doi.org/10.1145/1871437.1871445. doi:10.1145/1871437.1871445.
Zhang et al. [2016] F. Zhang, N. J. Yuan, D. Lian, X. Xie, W. Ma, Collaborative knowledge base embedding for recommender systems, in: B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, R. Rastogi (Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, ACM, 2016, pp. 353–362. URL: https://doi.org/10.1145/2939672.2939673. doi:10.1145/2939672.2939673.
Dong et al. [2014] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, W. Zhang, Knowledge vault: a web-scale approach to probabilistic knowledge fusion, in: S. A. Macskassy, C. Perlich, J. Leskovec, W. Wang, R. Ghani (Eds.), The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, ACM, 2014, pp. 601–610. URL: https://doi.org/10.1145/2623330.2623623. doi:10.1145/2623330.2623623.
Bordes et al. [2013] A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data (2013) 2787–2795. URL: https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html.
Wang et al. [2014] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by translating on hyperplanes, in: C. E. Brodley, P. Stone (Eds.), Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada, AAAI Press, 2014, pp. 1112–1119. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8531.
Lin et al. [2015] Y. Lin, Z. Liu, M. Sun, Y. Liu, X. Zhu, Learning entity and relation embeddings for knowledge graph completion, in: B. Bonet, S. Koenig (Eds.), Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, AAAI Press, 2015, pp. 2181–2187. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9571.
Nickel et al. [2011] M. Nickel, V. Tresp, H. Kriegel, A three-way model for collective learning on multi-relational data, in: L. Getoor, T. Scheffer (Eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, Omnipress, 2011, pp. 809–816. URL: https://icml.cc/2011/papers/438_icmlpaper.pdf.
Yang et al. [2015] B. Yang, W. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases (2015). URL: http://arxiv.org/abs/1412.6575.
Trouillon et al. [2016] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: M. Balcan, K. Q. Weinberger (Eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, JMLR.org, 2016, pp. 2071–2080. URL: http://proceedings.mlr.press/v48/trouillon16.html.
Zhang et al. [2019] S. Zhang, Y. Tay, L. Yao, Q. Liu, Quaternion knowledge graph embeddings (2019) 2731–2741. URL: https://proceedings.neurips.cc/paper/2019/hash/d961e9f236177d65d21100592edb0769-Abstract.html.
Glorot et al. [2013] X. Glorot, A. Bordes, J. Weston, Y. Bengio, A semantic matching energy function for learning with multi-relational data (2013). URL: http://arxiv.org/abs/1301.3485.
Liu et al. [2016] Q. Liu, H. Jiang, Z. Ling, S. Wei, Y. Hu, Probabilistic reasoning via deep learning: Neural association models, CoRR abs/1603.07704 (2016). URL: http://arxiv.org/abs/1603.07704. arXiv:1603.07704.
Dettmers et al. [2018] T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel, Convolutional 2d knowledge graph embeddings, in: S. A. McIlraith, K. Q. Weinberger (Eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, AAAI Press, 2018, pp. 1811–1818. URL: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17366.
Nguyen et al. [2018] D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, D. Q. Phung, A novel embedding model for knowledge base completion based on convolutional neural network, in: M. A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), Association for Computational Linguistics, 2018, pp. 327–333. URL: https://doi.org/10.18653/v1/n18-2053. doi:10.18653/v1/n18-2053.
Jiang et al. [2019] X. Jiang, Q. Wang, B. Wang, Adaptive convolution for multi-relational learning, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 978–987. URL: https://doi.org/10.18653/v1/n19-1103. doi:10.18653/v1/n19-1103.
Vashishth et al. [2020] S. Vashishth, S. Sanyal, V. Nitin, N. Agrawal, P. P. Talukdar, Interacte: Improving convolution-based knowledge graph embeddings by increasing feature interactions, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 3009–3016. URL: https://ojs.aaai.org/index.php/AAAI/article/view/5694.
Schlichtkrull et al. [2018] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, M. Welling, Modeling relational data with graph convolutional networks, in: A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, M. Alam (Eds.), The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, volume 10843 of Lecture Notes in Computer Science, Springer, 2018, pp. 593–607. URL: https://doi.org/10.1007/978-3-319-93417-4_38. doi:10.1007/978-3-319-93417-4\_38.
Shang et al. [2019] C. Shang, Y. Tang, J. Huang, J. Bi, X. He, B. Zhou, End-to-end structure-aware convolutional networks for knowledge base completion, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, AAAI Press, 2019, pp. 3060–3067. URL: https://doi.org/10.1609/aaai.v33i01.33013060. doi:10.1609/aaai.v33i01.33013060.
Ye et al. [2019] R. Ye, X. Li, Y. Fang, H. Zang, M. Wang, A vectorized relational graph convolutional network for multi-relational network alignment, in: S. Kraus (Ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. 4135–4141. URL: https://doi.org/10.24963/ijcai.2019/574. doi:10.24963/ijcai.2019/574.
Vashishth et al. [2020] S. Vashishth, S. Sanyal, V. Nitin, P. P. Talukdar, Composition-based multi-relational graph convolutional networks, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020. URL: https://openreview.net/forum?id=BylA_C4tPr.
Socher et al. [2013] R. Socher, D. Chen, C. D. Manning, A. Y. Ng, Reasoning with neural tensor networks for knowledge base completion, in: C. J. C. Burges, L. Bottou, Z. Ghahramani, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013, pp. 926–934. URL: https://proceedings.neurips.cc/paper/2013/hash/b337e84de8752b27eda3a12363109e80-Abstract.html.
Guo et al. [2019] L. Guo, Z. Sun, W. Hu, Learning to exploit long-term relational dependencies in knowledge graphs, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 2505–2514. URL: http://proceedings.mlr.press/v97/guo19c.html.
Wang et al. [2019] Q. Wang, P. Huang, H. Wang, S. Dai, W. Jiang, J. Liu, Y. Lyu, Y. Zhu, H. Wu, Coke: Contextualized knowledge graph embedding, CoRR abs/1911.02168 (2019). URL: http://arxiv.org/abs/1911.02168. arXiv:1911.02168.
Yao et al. [2019] L. Yao, C. Mao, Y. Luo, KG-BERT: BERT for knowledge graph completion, CoRR abs/1909.03193 (2019). URL: http://arxiv.org/abs/1909.03193. arXiv:1909.03193.
Kipf and Welling [2016] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, CoRR abs/1609.02907 (2016). URL: http://arxiv.org/abs/1609.02907. arXiv:1609.02907.
Marcheggiani and Titov [2017] D. Marcheggiani, I. Titov, Encoding sentences with graph convolutional networks for semantic role labeling, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, Association for Computational Linguistics, 2017, pp. 1506–1515. URL: https://doi.org/10.18653/v1/d17-1159. doi:10.18653/v1/d17-1159.
Nathani et al. [2019] D. Nathani, J. Chauhan, C. Sharma, M. Kaul, Learning attention-based embeddings for relation prediction in knowledge graphs, in: A. Korhonen, D. R. Traum, L. Màrquez (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 4710–4723. URL: https://doi.org/10.18653/v1/p19-1466. doi:10.18653/v1/p19-1466.
Sun et al. [2020] Z. Sun, S. Vashishth, S. Sanyal, P. P. Talukdar, Y. Yang, A re-evaluation of knowledge graph completion methods, in: D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for Computational Linguistics, 2020, pp. 5516–5522. URL: https://doi.org/10.18653/v1/2020.acl-main.489. doi:10.18653/v1/2020.acl-main.489.
Vandenhende et al. [2020] S. Vandenhende, S. Georgoulis, M. Proesmans, D. Dai, L. V. Gool, Revisiting multi-task learning in the deep learning era, CoRR abs/2004.13379 (2020). URL: https://arxiv.org/abs/2004.13379. arXiv:2004.13379.
Zhang et al. [2014] Z. Zhang, P. Luo, C. C. Loy, X. Tang, Facial landmark detection by deep multi-task learning, in: D. J. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI, volume 8694 of Lecture Notes in Computer Science, Springer, 2014, pp. 94–108. URL: https://doi.org/10.1007/978-3-319-10599-4_7. doi:10.1007/978-3-319-10599-4\_7.
Misra et al. [2016] I. Misra, A. Shrivastava, A. Gupta, M. Hebert, Cross-stitch networks for multi-task learning, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 3994–4003. URL: https://doi.org/10.1109/CVPR.2016.433. doi:10.1109/CVPR.2016.433.
Liu et al. [2019] X. Liu, P. He, W. Chen, J. Gao, Multi-task deep neural networks for natural language understanding, in: A. Korhonen, D. R. Traum, L. Màrquez (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 4487–4496. URL: https://doi.org/10.18653/v1/p19-1441. doi:10.18653/v1/p19-1441.
Wang et al. [2019] H. Wang, F. Zhang, M. Zhao, W. Li, X. Xie, M. Guo, Multi-task feature learning for knowledge graph enhanced recommendation, in: L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, L. Zia (Eds.), The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, ACM, 2019, pp. 2000–2010. URL: https://doi.org/10.1145/3308558.3313411. doi:10.1145/3308558.3313411.
Guan et al. [2018] S. Guan, X. Jin, Y. Wang, X. Cheng, Shared embedding based neural networks for knowledge graph completion, in: A. Cuzzocrea, J. Allan, N. W. Paton, D. Srivastava, R. Agrawal, A. Z. Broder, M. J. Zaki, K. S. Candan, A. Labrinidis, A. Schuster, H. Wang (Eds.), Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, ACM, 2018, pp. 247–256. URL: https://doi.org/10.1145/3269206.3271704. doi:10.1145/3269206.3271704.
Dou et al. [2021] J. Dou, B. Tian, Y. Zhang, C. Xing, A novel embedding model for knowledge graph completion based on multi-task learning, in: C. S. Jensen, E. Lim, D. Yang, W. Lee, V. S. Tseng, V. Kalogeraki, J. Huang, C. Shen (Eds.), Database Systems for Advanced Applications - 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part I, volume 12681 of Lecture Notes in Computer Science, Springer, 2021, pp. 240–255. URL: https://doi.org/10.1007/978-3-030-73194-6_17. doi:10.1007/978-3-030-73194-6\_17.
Marcheggiani and Titov [2017] D. Marcheggiani, I. Titov, Encoding sentences with graph convolutional networks for semantic role labeling (2017) 1506–1515. URL: https://doi.org/10.18653/v1/d17-1159. doi:10.18653/v1/d17-1159.
Nickel et al. [2016] M. Nickel, L. Rosasco, T. A. Poggio, Holographic embeddings of knowledge graphs, in: D. Schuurmans, M. P. Wellman (Eds.), Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, AAAI Press, 2016, pp. 1955–1961. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12484.
Chandrahas et al. [2018] Chandrahas, A. Sharma, P. P. Talukdar, Towards understanding the geometry of knowledge graph embeddings, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Association for Computational Linguistics, 2018, pp. 122–131. URL: https://aclanthology.org/P18-1012/. doi:10.18653/v1/P18-1012.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Toutanova and Chen [2015] K. Toutanova, D. Chen, Observed versus latent features for knowledge base and text inference, in: Proceedings of the 3rd workshop on continuous vector space models and their compositionality, 2015, pp. 57–66.
Zhang et al. [2019] W. Zhang, B. Paudel, W. Zhang, A. Bernstein, H. Chen, Interaction embeddings for prediction and explanation in knowledge graphs, in: J. S. Culpepper, A. Moffat, P. N. Bennett, K. Lerman (Eds.), Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, ACM, 2019, pp. 96–104. URL: https://doi.org/10.1145/3289600.3291014. doi:10.1145/3289600.3291014.
Sun et al. [2019] Z. Sun, Z. Deng, J. Nie, J. Tang, Rotate: Knowledge graph embedding by relational rotation in complex space (2019). URL: https://openreview.net/forum?id=HkgEQnRqYQ.
Zhang et al. [2019] W. Zhang, B. Paudel, W. Zhang, A. Bernstein, H. Chen, Interaction embeddings for prediction and explanation in knowledge graphs, in: J. S. Culpepper, A. Moffat, P. N. Bennett, K. Lerman (Eds.), Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, ACM, 2019, pp. 96–104. URL: https://doi.org/10.1145/3289600.3291014. doi:10.1145/3289600.3291014.
Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library (2019) 8024–8035. URL: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
Kingma and Ba [2015] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization (2015). URL: http://arxiv.org/abs/1412.6980.