Abstract

Recently, neural network based methods have shown their power in learning more expressive features on the task of knowledge graph embedding (KGE). However, the performance of deep methods often falls behind the shallow ones on simple graphs. One possible reason is that deep models are difficult to train, while shallow models might suffice for accurately representing the structure of the simple KGs.

In this paper, we propose a neural network based model, named DeepE, to address the problem, which stacks multiple building blocks to predict the tail entity based on the head entity and the relation. Each building block is an addition of a linear and a non-linear function. The stacked building blocks are equivalent to a group of learning functions with different non-linear depth. Hence, DeepE allows deep functions to learn deep features, and shallow functions to learn shallow features. Through extensive experiments, we find DeepE outperforms other state-of-the-art baseline methods. A major advantage of DeepE is the robustness. DeepE achieves a Mean Rank (MR) score that is 6%, 30%, 65% lower than the best baseline methods on FB15k-237, WN18RR and YAGO3-10. Our design makes it possible to train much deeper networks on KGE, e.g. 40 layers on FB15k-237, and without scarifying precision on simple relations.¹¹1The code and data of DeepE will be released around 2022.11.30. https://github.com/zhudanhao/DeepE

DeepE: a deep neural network for knowledge graph embedding

Zhu Danhao Department of Criminal Science and Technology, Jiangsu Police Institute [email protected] Huang Shujian Department of Computer Science and Technology, Nanjing University Shen Si School of Economics and Management, Nanjing University of Science and Technology Yin Chang, Ding Ziqi Department of Computer Information and Network Security, Jiangsu Police Institute

1 Introduction

Knowledge graphs (KGs) are collections of facts. Some well-known knowledge graphs include Freebase Bollacker et al. (2008), WordNet Miller (1995),YAGO Suchanek et al. (2007) and NELL Mitchell et al. (2018), are proved to be effective for a variety of downstream applications, such as question answering Ferrucci et al. (2010), information extraction Mintz et al. (2009) and recommender systems Zhang et al. (2016).

Real world KGs suffer from the problem of incompleteness. The task of Knowledge Graph Embedding (KGE) learns low-dimensional feature vectors for entities and relations, and then predicts the missing facts. The core of a KGE method is the defined score function. The score of a valid fact is expected to be higher than a invalid fact. One main category of KGE methods is built based on shallow score functions, e.g., TranE Bordes et al. (2013), Distmult Lin et al. (2015), Rescal Nickel et al. (2011) and RotatE Sun et al. (2019). More recently, neural network based models are proposed to learn more expressive features. The representative methods include ConvE Dettmers et al. (2018) and the following works, such as HypER Balažević et al. (2019), InteractE Vashishth et al. (2020) and JointE Zhou et al. (2022).

However, deep models do not always result in precision improvement. ConvEDettmers et al. (2018) is found to have an advantage over shallow models on complex graphs that contain nodes with high average relation-specific indegree. But on simple graphs with low average indegree, ConvE cannot outperform its linear baseline methods. Similar phenomenon has also been observed by Vashishth et al. (2020). Dettmers et al. (2018)’s explanation is that deep models are difficult to train, while shallow models might suffice for accurately representing the structure of the simple KGs.

Unlike the task of image recognition or language processing, one distinct characteristic of KGE is that shallow features also matter a lot. For example, the pix level features can hardly serve for identifying a dog directly. In contrast, a large proportion of relations in knowledge graphs are linear, such as symmetry/antisymmetry, inversion and composition Sun et al. (2019). Shallow functions are enough for scoring these linear patterns. Although deep functions are more expressive in theory, they may not learn as good as shallow functions on simple relations, whether because of overfitting or training difficulty.

Is there anyway to build a model that can enjoy the ability of learning deep features, but without paying the price of losing shallow features? In this paper, we propose a novel deep neural network, named DeepE, to achieve the goal. The key component of DeepE is the DeepE building block, whose output is an addition of both linear and non-linear features. Stacking $n$ building blocks will obtain a learning function consists $n+1$ sub-functions, with their non-linear depth range from $0$ to $n$ . Then, each sub-function is expected to be responsible for its own duty: deep function for learning deep feature, and shallow function for learning shallow feature.

The architecture of the hybrid learning functions has two consequences. First, now it is possible to train very deep networks without losing precision on simple relations. For example, on FB15-237k, the best DeepE model contains 40 layers, while previous methods can afford 1-4 layers only. Second, DeepE is robust in various situations, especially when data is sparse or relations are difficult. The result will not be too bad as long as one sub-function works.

The contributions of the paper are summarized as follows.

•

We propose DeepE, a KGE method that can learn very deep features, without sacrificing the ability of learning shallow features.
•

We provide theoretical analysis to the key component of our method, the DeepE building block, to show why DeepE can learn features with different non-linear depth.
•

Through extensive experiments on various simple and complex KG datasets, we demonstrate the effectiveness of DeepE.

2 Definitions

2.1 Problem definition

A knowledge graph $\mathcal{G}=\{(h,r,t)\}\subseteq\mathcal{E}\times\mathcal{R}\times\mathcal{E}$ is formalized as a set of knowledge triples or facts, each consists of a relation type $r\in\mathcal{R}$ points from a head entity $h\in\mathcal{E}$ to a tail entity $t\in\mathcal{E}$ .

Predicting the missing links of KG can be formalized as a ranking problem. A KGE model first maps the entities and relations to vectors h, r, t, and then defines a score function $\Psi(\textbf{h},\textbf{r},\textbf{t})$ that maps the triple to a scalar which is proportional to the likelihood of the triple. To predict the tail entity $t$ for $(h,r,?)$ , $t=\mathop{\arg\max}\limits_{t\in\mathcal{E}}\Psi(\textbf{h}$ , r, $\textbf{t})$ . Similar process can be used for finding a head entity $(?,r,t)$ . In this paper, we do not distinguish head prediction with tail prediction. For a triple $(h,r,t)$ in the training set, we will insert an reverse triple $(t,r^{\prime},h)$ where $r^{\prime}$ is the reverse relation of $r$ . On test stage, finding $(?,r,t)$ is equivalent to solving $(t,r^{\prime},?)$ .

2.2 Non-linear depth of functions

We formally define of the non-linear depth of a function. A deeper function contains more number of nested non-linear functions.

Definition 2.1 (0th order non-linear function)

A 0th order non-linear function is a function with no non-linear transformations.

Definition 2.2 (kth order non-linear function)

A kth order non-linear function is a non-linear transformation of a k-1th order non-linear function.

The linear combination of a kth and a jth (j<=k) order non-linear function is a kth order non-linear function.

The linear methods belong to the family of 0th order non-linear functions, and most of the existing neural network based methods are 1th to 4th non-linear functions. The higher order of a non-linear function, the deeper it is, and the deeper features are produced.

3 The proposed method

3.1 Overall framework

The overall framework of DeepE is shown in Fig. 1. DeepE is composed of two multi-layer neural networks, named feature extraction network and project network. The formal network learns features from the head entity and the relation, while the latter one projects the tail entity to the same space of the learned features. Similar to the previous works Dettmers et al. (2018); Vashishth et al. (2020), $h,r,t$ are mapped to embeddings $\textbf{h},\textbf{r},\textbf{t}\in\mathbb{R}^{d}$ first, $d$ denotes the dimension of embeddings.

Refer to caption — Figure 1: An overview of DeepE.

3.2 Feature extraction network

The feature extraction network first concatenates h and r, and then let it go through a BN layer and a dropout layer. The obtained vector is used as an input for multiple stacked DeepE building blocks. The building blocks return a feature vector $\textbf{v}\in\mathbb{R}^{d}$ .

DeepE building block is the basic component of feature extraction network, and we use stacked DeepE building blocks to extract features from head entity and relation.

3.2.1 A DeepE building block

The structure of a DeepE building block is shown in Fig. 1 (a). x and $\mathcal{F}(\textbf{x})$ are the input and output vectors of a building block. Formally, a DeepE building block is defined as follows:

\mathcal{F}(\textbf{x})=\textbf{x}+\textbf{W}_{2}\sigma(\textbf{W}_{1}\textbf{x})

(1)

$\textbf{W}_{1}$ and $\textbf{W}_{2}$ are the weight matrices. $\sigma$ is the non-linear function, and we use ${\rm Relu}$ across the paper. For convenience, we omit the bias terms, dropout layer and Batch Normalization (BN) layer here. Generally, a BN layer Ioffe and Szegedy (2015) and a dropout layer Srivastava et al. (2014) are inserted right after each linear layer, to reduce overfitting and stabilize the training.

The proposed DeepE building block is quite similar to the one of ResNet He et al. (2016a), as shown in Fig. 1 (b). The only difference is that DeepE building block removes the final non-linear layer of ResNet building block. Hence, a DeepE building block outputs a linear feature vector x, and a non-linear feature vector $\textbf{W}_{2}\sigma(\textbf{W}_{1}\textbf{x})$ . In contrast, a ResNet building block can only produce a non-linear feature vector $\sigma(\textbf{x}+\textbf{W}_{2}\sigma(\textbf{W}_{1}\textbf{x}))$ .

If the dimensions of x and $\textbf{W}_{2}\sigma(\textbf{W}_{1}\textbf{x})$ are not equal, a linear projection $\textbf{W}_{s}$ will be applied to x, to make the element-wise addition available.

\mathcal{F}(\textbf{x})=\textbf{W}_{s}\textbf{x}+\textbf{W}_{2}\sigma(\textbf{W}_{1}\textbf{x})

(2)

In the feature extraction network, the input and output dimensions of the first building block are $2d$ and $d$ respectively, as in Eqn. 2, where $\textbf{W}_{s}\in\mathbb{R}^{d*2d}$ . The other building blocks share the same input and output dimensions of $d$ , so the compute function is as in Eqn. 1.

3.2.2 Stacking multiple building blocks

The identity mapping of x can well address the problem of network degradation He et al. (2016a), which allows training a network with multiple stacked building blocks. Considering a simple case of stacking 2 building blocks. The superscript $(i)$ denotes the ith building block.

	$\displaystyle\mathcal{F}^{(2)}(\mathcal{F}^{(1)}(\textbf{x}))=\textbf{x}+\textbf{W}_{2}^{(1)}\sigma(\textbf{W}_{1}^{(1)}\textbf{x})+$		(3)
	$\displaystyle\textbf{W}_{2}^{(2)}\sigma(\textbf{W}_{1}^{(2)}(\textbf{x}+\textbf{W}_{2}^{(1)}\sigma(\textbf{W}_{1}^{(1)}\textbf{x}))$		(3)

Eqn.3 contains 3 terms, whose non-linearity orders are $0,1,2$ respectively. It can be easily derived that when stacking $n$ DeepE building blocks, the output will consist $n+1$ terms, and their orders of non-linearity should be $0,1,2..n$ respectively. Therefore, deep and shallow features can be easily learned by functions with different non-linear depth. In contrast, stacking $n$ ResNet building blocks can only produce a non-linear function with $2n$ th depth, which is not suitable for extracting shallow features. Our experiments compare the performance of ResNet and DeepE building blocks in section 4.3.2.

It is not recommended to add more non-linear layers in a DeepE building block, i.e. $\mathcal{F}(\textbf{x})=\textbf{x}+\textbf{W}_{n}\sigma(\textbf{W}_{n-1}...\sigma(\textbf{W}_{2}\sigma(\textbf{W}_{1}\textbf{x}))...)$ . Considering a case of stacking $n$ building blocks, each with 2 inner non-linear layers. The output will consist $n+1$ terms with non-linear orders of $0,2,4,..2n$ . Clearly, the more inner non-linear layers a building block contains, the more levels of non-linear order the stacked architecture will loss.

3.2.3 Dropout on identity mapping

Although identity mapping is benefit for training deep network, it may cause the problem of diminishing feature reuse Srivastava et al. (2015). The gradients will prefer to go through the identity mappings rather than the weights within building blocks.

To address the problem, we propose to dropout the identity mapping with a small ratio $\alpha$ . For a network of $n$ stacked building blocks, the $i$ th order of non-linear feature vector has to go through the identity mappings for $n-i$ times. Hence, its total dropout probability will be $(1-\alpha)^{n-i}$ . Such design will gradually dropout more shallow features than deep features. For example, for a DeepE model with 40 building blocks (the one we use for FB15k-237), $\alpha=0.01$ , the final dropout probability of the $0$ th, $10$ th, $20$ th and $30$ th order of non-linear features will be 0.331, 0.260, 0.182, 0.096.

Note that He et al. (2016b) also experimented dropout on the identity mapping in ResNet, and found the network failed to converge to a good solution. One probable reason is that they used a dropout value of 0.5, which is too big. As discussed before, such a large value will impede signal propagation.

3.3 Project network

v is obtained via a series of functions with different orders of non-linearity, which is in a space quite different from the original entity space. Therefor, it will be better to project t to a space close with v. t is used as an input for a stacked ResNet building blocks, and returns a vector $\textbf{t}^{\prime}$ . In practice, the number of ResNet building blocks are no more than 2.

3.4 Score function

The score function is the dotted product between v and $\textbf{t}^{\prime}$ . Overall, our score function is written as follows:

\begin{split}\Psi(h,r,t)&={\rm f}(\textbf{h}||\textbf{r})\cdot{\rm g}(\textbf{t})\\ &=\textbf{v}\cdot\textbf{t}^{\prime}\end{split}

(4)

where $||$ is the concatenation operator, ${\rm f}$ and ${\rm g}$ are the feature extraction network and project network respectively. For training, we use standard cross entropy loss function.

3.5 Space complexity

There are three types of parameters in DeepE. First, the embedding matrices for entities and relations, with size of $|\mathcal{E}|d+|\mathcal{R}|d$ . Second, the parameters of the DeepE building blocks in the feature extraction network. Each building block has two matrices, and the parameter size is $2kd^{2}$ , where $k$ is the number of DeepE building blocks. Third, the parameters of the ResNet building blocks in the project network whose parameter number is about $2td^{2}$ , $t$ denotes the number of ResNet building blocks.

In general, $|\mathcal{R}|$ and $d$ are much smaller than $|\mathcal{E}|$ , and $t$ is no more than 2. The final space complexity can be reduced to:

\mathcal{O}(|\mathcal{E}|d+kd^{2})

On large KGs with a massive mount of entities, DeepE is very parameter efficiency since $kd^{2}$ can also be neglected.

4 Experiments

In this section, we want to investigate the following research questions:

RQ1. How does DeepE perform in compare with the baseline methods? (Section 4.2)

RQ2. What is the impact of different modules in DeepE? (Section 4.3)

RQ3. What is the impact of functions with different order of non-linearity? (Section 4.4)

RQ4. Why DeepE is robust than other methods? (Section 4.5)

4.1 Experiment setup

4.1.1 Datasets

Following Dettmers et al. (2018); Sun et al. (2019); Vashishth et al. (2020), three most common benchmark datasets are used. The summary statistics of the datasets are presented in Table 1.

Dataset	FB15k-237	WN18RR	YAGO3-10
$\|\mathcal{E}\|$	14541	40943	123182
$\|\mathcal{R}\|$	237	11	37
#Train	272115	86835	1079040
#Valid	17535	3034	5000
#Test	20446	3134	5000

Table 1: Statistics of the datasets. # denotes the number of triples.

•

FB15k-237 Toutanova and Chen (2015) is a improvement version of FB15k Bordes et al. (2013) dataset where the some inverse relations are removed, to avoid data leakage.
•

WN18RR Dettmers et al. (2018) is a subset of WN18 Bordes et al. (2013), where the inverse relations are also removed.
•

YAGO3-10 Suchanek et al. (2007) is a subset of YAGO3. Most of the triples deal with descriptive attributes of people.

4.1.2 Evaluation protocol

Four metrics are used to measure the performance, including Mean Rank (MR), Mean Reciprocal Rank (MRR), Hit@1 and Hit@10. We follow the filtered settings in Bordes et al. (2013), that is, excluding all true entities appear in the train, valid and test sets when ranking all entities. Note that models with higher MRR, Hit@1, Hit@10 and lower MR are preferred.

4.1.3 Baseline methods

We compare our method with various KGE methods, which can be classified into three categories.

•

Shallow methods, whose score functions contain no non-linear functions, include DistMult Yang et al. (2014) and RotatE Sun et al. (2019).
•

Convolutional neural network (CNN) based methods. These CNN based methods are the dominating techniques among non-linear methods, include ConvEDettmers et al. (2018), HypER Balažević et al. (2019), InteractE Vashishth et al. (2020), AcrE Ren et al. (2020) and JointE Zhou et al. (2022).
•

A multi-layer perception (MLP) method, ER-MLP-2d Ravishankar et al. (2017). Although ER-MLP-2d is not a very strong baseline, both the method and DeepE are based on MLP, rather than CNN. Hence, our method is closer to ER-MLP-2d than other baseline methods.

4.1.4 Training details

All parameters are random initialized with xavier normal distribution Glorot and Bengio (2010). We use Adam Kingma and Ba (2014) optimizer with an initial learning rate of 0.003. The learning rate will decrease with a coefficient of 0.8 when the training loss does not decrease for 5 epoches. The other training parameters for each dataset are shown in Table 2. When the MRR does not improve for 10 epochs on the valid set, training will be terminated. The best models are selected when the best MRR are obtained on the valid set. We report the average results of 5 runs with different randomly initialization. The maximum training epoch is 1000. In practice, WN18RR and YAGO will finish training in about 200 epochs, and FB15k-237 takes about 700 epochs.

	FB15k-237	WN18RR	YAGO3-10
P1	300	250	500
P2	5e-8	5e-5	5e-8
P3	40	1	2
P4	1	2	1
P5	2	3	2
P6	0.4	0.4	0.4
P7	0.01	0	0
P8	0.4	0	0

Table 2: Parameter settings for each dataset. P1: Entity/Relation dimension. P2: L2 regularization. P3: Number of DeepE building blocks. P4: Number of ResNet building blocks. P5: Number of inner layers in ResNet building blocks. P6: Dropout on input layer and FC layers in DeepE building blocks. P7: Dropout on the identity mapping in DeepE building blocks. P8: Dropout on the FC layers in ResNet building blocks.

4.2 Performance Comparison

4.2.1 Main results

Model	FB15k-237				WN18RR				YAGO3-10
	MR	MRR	Hit@1	Hit@10	MR	MRR	Hit@1	Hit@10	MR	MRR	Hit@1	Hit@10
Distmult	254	0.241	0.155	0.419	5110	0.43	0.39	0.49	5926	0.34	0.24	0.54
RotatE	177	0.338	0.241	0.533	3340	0.476	0.428	0.571	1767	0.495	0.402	0.67
ConvE	244	0.325	0.237	0.501	4187	0.43	0.4	0.52	1671	0.44	0.35	0.62
HypER	250	0.341	0.252	0.52	5798	0.465	0.436	0.522	2529	0.533	0.455	0.678
InteractE	172	0.354	0.263	0.535	5202	0.463	0.43	0.528	2375	0.541	0.462	0.687
AcrE	-	0.358	0.266	0.545	-	0.459	0.422	0.532	-	-	-	-
JointE	177	0.356	0.262	0.543	4655	0.471	0.438	0.537	-	0.556	0.481	0.695
ER-MLP2d	234	0.338	-	0.547	4233	0.358	-	0.421	-	-	-	-
DeepE	161	0.358	0.266	0.544	2337	0.487	0.445	0.567	591	0.558	0.485	0.692

Table 3: The main results. The best results are in bold, the second best results are underlined.

The main results are presented in Table 3. Overall, DeepE achieves the best and the second best results on most metrics. In particular, our method outperforms the baseline methods on MR with a large margin, i.e. 172 -> 161 (-6%) on FB15k-237, 3340->2337 (-30%) on WN18RR, and 1671->591 (-65%) on YAGO3-10. Comparing to MRR, Hit@1 and Hit@10, MR is a metric more concerning about robustness, since a bad prediction will pull down the value a lot. We will analyze the robustness of DeepE in section 4.5.

FB15k-237 is a dataset suitable for deep methods mostly. On FB15k-237, InteractE, AcrE and JointE outperform RotatE on almost all the metrics. On YAGO3-10, the advantage is less significant. InteractE and JointE are better than RotatE on 3 out of 4 metrics. But on WN18RR, the situation changes. RotatE outperforms all other non-linear methods on MR, MRR and Hit@10. The parameter settings of our method (Table 2) also validate the phenomenon. The number of building blocks for the best DeepE models are 40, 2 and 1 on FB15k-237, YAGO3-10 and WN18RR. However, unlike other non-linear methods that suffer from the problem of inconsistent performance, DeepE can work well on all three datasets.

As a closer baseline that based on MLP, ER-MLP2d’s performance is much falling behind DeepE. Note that DeepE maybe the first MLP based method that achieves SOTA results, which validates the effectiveness of our motivation.

4.2.2 Additional comparison

Some recent studies try to exploit graph structure information for link prediction on knowledge graphs. These methods often use KGE methods as tools for extracting features Dai et al. (2022); Wu et al. (2021). Since the score function of a graph based method is built on a sub-graph rather than on a single triple, it is meaningless to compare the non-linear orders between these methods and DeepE. Table 4 presents the results of some SOTA graph based methods as an additional comparison with DeepE, include MRGAT Dai et al. (2022), CompGCN Vashishth et al. (2019), ReinceptionE Xie et al. (2020) and DisenKGAT Wu et al. (2021). Although DeepE does not model graph structure information explicitly, it can still achieve first or second best results on most metrics.

Model	FB15k-237		WN18RR
	MR	MRR	MR	MRR
MRGAT	-	0.355	-	0.481
CompGCN	197	0.355	3533	0.479
ReinceptionE	173	0.349	1894	0.483
DisenKGAT	179	0.368	1504	0.486
DeepE	161	0.358	2337	0.487

Table 4: An additional comparison between DeepE and some graph based methods.

4.3 Effect of different modules

4.3.1 Ablation analysis

The result of ablation analysis is shown in Table 5. Project network is essential for the performance. Without the project network, MRR decreases 0.016 on FB15k-237, and 0.026 on WN18RR. Without the identity dropout, MRR decreases 0.003 on FB15k-237. As discussed in section 2.4, identity dropout is designed for training very deep networks. We do not use it in WN18RR and YAGO3-10, since their best models only contain 1 or 2 DeepE building blocks.

Model	FB15k-237	WN18RR
Original	0.358	0.487
- project network	0.342 (-0.016)	0.461 (-0.026)
- identity dropout	0.355 (-0.003)	-

Table 5: Ablation analysis. MRR is used as the metric. - mapping network replaces the project network with an identity mapping. - identity dropout removes the dropout layer on identity mapping.

4.3.2 Effect of depth

Fig 3 gives the performance on different model depth, i.e. number of building blocks. On FB15k-237, the MRR of the ResNet building blocks drops quickly as the network going deeper. In contrast, DeepE gains precision improvement with deeper layers. The result shows that DeepE building block is essential for training deeper networks for KGE.

On WN18RR, both DeepE or ResNet buidling blocks perform worse when adding more layer, but the result of DeepE is more robust. When stacking a new building block, a network based on ResNet building blocks will replace the old learning function with a new one, while DeepE adds a new function to the original learning function without modifying it. Even if the new function is not suitable, DeepE’s performance will not decrease too much, since the original learning function still works. We believe this is why the building block of DeepE is more robust than that of ResNet on different depth.

4.4 Effect of functions with different non-linear orders

4.4.1 A single DeepE building block

A model with only one DeepE building block is trained on FB15k-237, to investigate the effect of the linear and non-linear functions of a building block. The results are shown in Table 6. Th 0th order non-linear function suffices for the accurate prediction on simple relations of N-1. The main contribution of the 1th order non-linear function is on the complex relations of 1-N (with an improvement of 59% on MRR).

The results validate our hypothesis. For a single DeepE building block, the linear function is responsible for learning simple relations, and the non-linear function mainly serves for complex relations.

Rel type	0th	0th+1th
1-1	0.287	0.337 (+17%)
N-1	0.679	0.701 (+3%)
1-N	0.064	0.102 (+59%)
N-N	0.27	0.31 (+15%)

Table 6: The effect of functions in a DeepE building block. MRR is used as the metric. 0th indicates that the function with non-linear order of 0th is retained, while the other function is disabled. Following Bordes et al. (2013), relations are classified into four groups: one-to-one (1-1), one-to-many (1-N), many-to-one (N-1) and many-to-many (N-N). For brevity, we do not distinct head prediction with tail prediction. The tail prediction of 1-N relations is equivalent to the head prediction of N-1 relations, and vice verse.

4.4.2 Stacked DeepE building blocks

We train several models with different number of DeepE building blocks. More building blocks result in more orders of non-linear learning functions. The results are shown in Fig. 4. DeepE has addressed the problem of training deep neural networks on KGE: deeper non-linear functions no longer degrade the learning on simple relations. Relations in all groups achieve consistent improvement. For the simplest relation group, N-1, the curve is almost steady on all layers. The results validate the advantage of using different order of non-linear learning functions on KGE.

4.5 Robustness

		RotatE			ConvE			InteractE			DeepE
		MRR	MR	Hit@10	MRR	MR	Hit@10	MRR	MR	Hit@10	MRR	MR	Hit@10
Head Pred	1-1	0.498	359	0.593	0.374	223	0.505	0.386	175	0.547	0.458	548	0.615
	1-N	0.092	614	0.174	0.091	700	0.17	0.106	573	0.192	0.121	519	0.222
	N-1	0.471	108	0.674	0.444	73	0.644	0.466	69	0.647	0.461	102	0.657
	N-N	0.261	141	0.476	0.261	158	0.459	0.276	148	0.476	0.282	139	0.485
Tail Pred	1-1	0.484	307	0.578	0.366	261	0.51	0.368	308	0.547	0.424	358	0.573
	1-N	0.749	41	0.674	0.762	33	0.878	0.777	27	0.881	0.789	47	0.886
	N-1	0.074	578	0.138	0.069	682	0.15	0.074	625	0.141	0.08	576	0.162
	N-N	0.364	90	0.608	0.375	100	0.603	0.395	92	0.617	0.39	83	0.616

Table 7: Link prediction results by relation category on FB15k-237.

In this subsection, we show DeepE is robust on different entities and relations, especially when data is sparse and learning is difficult. Since DeepE is indeed a hybrid learning functions with different non-linear depth, the prediction result will not be too bad as long as one function works. In contrast, if a learning case relies on deep features, a method based on shallow functions may fail, and vice verse. The problem will be more severe when data is sparse.

4.5.1 Performance on entities with low degrees

Learning on entities with low degree is difficult, since the number of training samples are small. However, these entities make up a large proportion of the total entities. For example, 25% entities in FB15k-237 has a degree smaller than 10.

We compare the performance of low-degree entities between DeepE and InteractE on FB15k-237, as in Fig. 5. DeepE is more robust on low degree entities. Specially, the MRR of DeepE is about 3-4 times higher than InteractE when degree is 1 or 2.

4.6 Performance on relation types

We compare DeepE with some baseline methods on different relation types. Overall, DeepE achieves the best results on the majority of metrics. DeepE’s advantage is particularly prominent on more difficult relations, e.g. head prediction of 1-N and N-N; tail prediction of N-1. The results validate that DeepE is robust on difficult relations.

5 Related works

ConvE Dettmers et al. (2018) is a pioneer work to use convolutional neural networks for KGE. Convolution increases the interactions between the head entity and the relation while keeping parameter efficiency. Later, InteractE Vashishth et al. (2020), ArcE Ren et al. (2020) and JointE Zhou et al. (2022) explored the idea further. More interactions are obtained by adding filters, permutating features and changing convolution functions. Compared to these methods, DeepE can learn better on shallow features.

ArcE Ren et al. (2020) also applies identity mapping by using a skip connection from the input layer to the output layer. Their learning function can be viewed as an addition of a linear function and a non-linear function, while DeepE contains functions with more levels of non-linear orders.

The idea of DeepE’s project network is similar with the ones in TransH Wang et al. (2014) and TransR Lin et al. (2015). The difference is the project function they used is linear, while the one DeepE used is ResNet. The reason is that feature extraction function of DeepE is also non-linear, so a linear function may not suffice for the projection.

Some recent graph based methods Dai et al. (2022); Vashishth et al. (2019); Xie et al. (2020); Wu et al. (2021) used neighborhood information for knowledge graph completion, while traditional KGE methods only focus on scoring a single triple. The graph based methods often employ KGE methods as feature extraction kernels Dai et al. (2022); Wu et al. (2021). Generally, the more powerful the kernel, the better results a graph baseed method may achieve. Hence, the advance of DeepE may further improve the performance of graph based methods.

6 Conclusion

In the paper, we propose a deep neural network for knowledge graph embedding. DeepE outperforms the SOTA methods and shows outstanding advantages on robustness. The experiments validate our hypothesis: shallow functions suffice for learning on simple relations, and deep functions serve more for learning on complex relations. Unlike other neural network based KGE methods, DeepE no longer suffers from the performance degradation problem on simple relations, and hence can be very deep while still gaining performance improvement.

For future work, we will explore other learning kernels rather than simple MLP. For example, CNN is parameter effective and has been proved to be useful in many KGE methods. Another research direction is to extend the idea of DeepE to other graph-based learning problem, such as graph neural networks, since the hybrid learning functions seem to be suitable for the graph data.

References

Balažević et al. (2019) Ivana Balažević, Carl Allen, and Timothy M Hospedales. 2019. Hypernetwork knowledge graph embeddings. In International Conference on Artificial Neural Networks, pages 553–565. Springer.
Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250.
Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26.
Dai et al. (2022) Guoquan Dai, Xizhao Wang, Xiaoying Zou, Chao Liu, and Si Cen. 2022. Mrgat: Multi-relational graph attention network for knowledge graph completion. Neural Networks, 154:234–245.
Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
Ferrucci et al. (2010) David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. 2010. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79.
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer.
Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence.
Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.
Mitchell et al. (2018) Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang, Justin Betteridge, Andrew Carlson, Bhavana Dalvi, Matt Gardner, Bryan Kisiel, et al. 2018. Never-ending learning. Communications of the ACM, 61(5):103–115.
Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In Icml.
Ravishankar et al. (2017) Srinivas Ravishankar, Partha Pratim Talukdar, et al. 2017. Revisiting simple neural networks for learning representations of knowledge graphs. arXiv preprint arXiv:1711.05401.
Ren et al. (2020) Feiliang Ren, Juchen Li, Huihui Zhang, Shilei Liu, Bochao Li, Ruicheng Ming, and Yujia Bai. 2020. Knowledge graph embedding with atrous convolution and residual learning. arXiv preprint arXiv:2010.12121.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387.
Suchanek et al. (2007) Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697–706.
Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197.
Toutanova and Chen (2015) Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd workshop on continuous vector space models and their compositionality, pages 57–66.
Vashishth et al. (2020) Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, Nilesh Agrawal, and Partha Talukdar. 2020. Interacte: Improving convolution-based knowledge graph embeddings by increasing feature interactions. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 3009–3016.
Vashishth et al. (2019) Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. 2019. Composition-based multi-relational graph convolutional networks. arXiv preprint arXiv:1911.03082.
Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence, volume 28.
Wu et al. (2021) Junkang Wu, Wentao Shi, Xuezhi Cao, Jiawei Chen, Wenqiang Lei, Fuzheng Zhang, Wei Wu, and Xiangnan He. 2021. Disenkgat: knowledge graph embedding with disentangled graph attention network. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2140–2149.
Xie et al. (2020) Zhiwen Xie, Guangyou Zhou, Jin Liu, and Xiangji Huang. 2020. Reinceptione: relation-aware inception network with joint local-global structural information for knowledge graph embedding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5929–5939.
Yang et al. (2014) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575.
Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 353–362.
Zhou et al. (2022) Zhehui Zhou, Can Wang, Yan Feng, and Defang Chen. 2022. Jointe: Jointly utilizing 1d and 2d convolution for knowledge graph embedding. Knowledge-Based Systems, 240:108100.