MöbiusE: Knowledge Graph Embedding on Möbius Ring

Yao Chen [email protected] Jiangang Liu Zhe Zhang Shiping Wen Wenjun Xiong Department of Computer Science, Southwestern University of Finance and Economics, China Centre for Artificial Intelligence, University of Technology Sydney, Sydney, Australia

Abstract

In this work, we propose a novel Knowledge Graph Embedding (KGE) strategy, called MöbiusE, in which the entities and relations are embedded to the surface of a Möbius ring. The proposition of such a strategy is inspired by the classic TorusE, in which the addition of two arbitrary elements is subject to a modulus operation. In this sense, TorusE naturally guarantees the critical boundedness of embedding vectors in KGE. However, the nonlinear property of addition operation on Torus ring is uniquely derived by the modulus operation, which in some extent restricts the expressiveness of TorusE. As a further generalization of TorusE, MöbiusE also uses modulus operation to preserve the closeness of addition operation on it, but the coordinates on Möbius ring interacts with each other in the following way: any vector on the surface of a Möbius ring moves along its parametric trace will goes to the right opposite direction after a cycle. Hence, MöbiusE assumes much more nonlinear representativeness than that of TorusE, and in turn it generates much more precise embedding results. In our experiments, MöbiusE outperforms TorusE and other classic embedding strategies in several key indicators.

keywords:

Möbius ring, Torus ring , Knowledge graph , Embedding

^†^†journal: Journal of LaTeX Templates

1 Introduction

Graph or network structure can usually be used to model the intrinsic information behind data [1]. As an typical example, a knowledge graph (KG) is a collection of facts of the real world, related examples include DBpedia [2], YAGO [3] and Freebase [4]. These built KG database can be applied in many realistic engineering tasks, such as knowledge inference, question answering, and sample labeling. Driven by realistic applications, an intriguing and challenging problem for KG is: how to infer those unknown knowledge (or missing data) via the existing built KG database.

For any KG, the basic element of knowledge is represented by a triplet $(h,r,t)$ , which composed of two entities $h,t$ and one relation $r$ . In detail, $h$ is called the head, $r$ is called the tail, and $(h,r,t)$ means head $h$ maps to tail $t$ via the operation of relation $r$ . For example, the entity $h=$ U.S.A projects to $t=$ DonaldTrump under the relation $h=$ ThePresidentOf, and the entity $h=$ U.S.A projects to $t=$ MikePence under the relation $h=$ TheVicePresidentOf. Based on these two triplets, and considering the fact of “the president and vice president of a country at the same time are colleagues”, one predicts the existence of the triplet $($ DonaldTrump, IsColleagueOf, MikePence $)$ if such triplet is not contained in the given database. For realistic applications such as question answering, the core task is to predict those triplets which are not listed in the given database, such as task is called link prediction in KG.

There exists many types of method for link prediction in KG, among which translation-based methods treats each entity and relation as embedded vectors in some constructed algebraic space, and each true triplet can be translated to a simple constraint described by an algebraic expression $f(h,r,t)$ . Using such a method, if some $(h,r,t)$ is not presented in KG but the corresponding algebraic constraint index $f(h,r,t)$ is very small, then it is reasonable to believe that $(h,r,t)$ is a missing fact. Due to the choosing of different algebraic space, many different translation-based models have be obtained. TransE [5] is the first translation-based model for link prediction tasks, in which it uses the Euclidean space $\mathbb{R}^{n}$ as the embedding space and the constraint quantity for each triplet is set as simple as $\|h+r-t\|$ . TransR [6] builds the constraint equation by mapping the entities to a subspace derived by the relation vector. TansH [7] improves TransE in dealing with reflexive/ONE-TO-MANY/MANY-TO-ONE/MANY-TO-MANY cases in KG, such an improvement is caused by treating each relation as a translating operation on a hyperplane. DistMult [8] utilizes a simple bilinear formulation to model the constraint generated by triplets, in this way the composition of relation is characterized by matrix multiplication, which makes such a model good at capturing relational semantics. ComplEx [9] uses the standard dot product between embeddings and demonstrates the capability of complex-valued embeddings instead of real-valued numbers. TorusE [10] chooses a compact Lie group, a torus, as the embedding space, and is more scalable to large-size knowledge graphs because of its lower computational complexity. Other types of embedding methods include TransA [11], a locally and temporally adaptive translation-based approach; ConvE [12], a multi-layer convolutional network embedding model; TransF, an embedding strategy with flexibility in each triplet constraint; QuaternionE [13], an embedding method treats each relation as a quaternion in hypercomplex space.

Among the above embedding methods, the basic operation behind TorusE is a simple modulus operation and this operation automatically regularizes the calculated results to the given bounded area. In this way, there is no need to make an extra regularization operation in TorusE. Inspired by this property of TorusE, we move beyond one-dimensional modulus operation on Lie group and propose MöbiusE, which takes a Möbius ring (whose basic dimension is two) as the embedding space. This MöbiusE has the following major benefits for embedding tasks: At first, MöbiusE provides much more flexibility since each point on the surface of a Möbius ring has two different expressions; Second, the expressiveness of MöbiusE is much greater than TorusE due to the complex distance function built in Möbius ring; Third, MöbiusE also automatically truncates the training vectors to constraint space as that of TorusE; Finally, MöbiusE subsumes TorusE, and inherits all the attractive properties of TorusE, such as its ability to model symmetry/antisymmetry, inversion, and composition.

The rest of the paper is organized as follows: Section 2 revisits the idea of TorusE and proposes the basic idea of MöbiusE, especially, we discuss about how to define the distance function on a Möbius ring. Section 3 provides the experiment results for MöbiusE on datasets FB15K and WN18. Section 4 discuss about related works and compare MöbiusE with other embedding methods. Section 5 concludes this paper. In the end, Section 6 lists the proof for several key propositions on Möbius ring.

Name	Triplet Score Function	Embedding Space
TransE [5]	$\|\|\bf{h}+\bf{r}-\bf{t}\|\|$	$\bf{h,r,t}\in\mathbb{R}^{n}$
TransH [7]	$\|\|\bf{h}-\bf{w_{r}^{\mathrm{T}}hw_{r}}+d_{r}-(\bf{t}-\bf{w_{r}^{\mathrm{T}}tw_{r}})\|\|$	$\bf{h,t,d_{r},w_{r}}\in\mathbb{R}^{n}$
RESCAL [14]	$\bf{h}^{T}\bf{M}_{r}\bf{t}$	$\\|\bf{h}\\|_{2}\leq 1,\\|\bf{t}\\|_{2}\leq 1,\\|\bf{M}_{r}\\|_{F}\leq 1$
TorusE [10]	$\|\|\bf{h}+\bf{r}-\bf{t}\|\|_{T^{n}}$	$\bf{h,r,t}\in\mathbb{T}^{n}$
Distmult [8]	$-\bf{h^{\mathrm{T}}}diag(\bf{r})\bf{t}$	$\bf{h,r,t}\in\mathbb{R}^{n}$
ComplEx [9]	$-{\rm Re}(\bf{h^{\mathrm{T}}diag(r)\bar{t}})$	$\bf{h,r,t}\in\mathbb{C}^{n}$
MöbiusE (ours)	${\rm dist}(\bf{h}\oplus\bf{r},\bf{t})$	$\bf{h},\bf{r},\bf{t}\in\mathbb{M}^{q/p}_{n}$

Table 1: Scoring Functions of Typical KGE Models

2 Embedding on Möbius Ring

A KG is described by a set of triples $\Delta$ , where each $l=(h,r,t)\in\Delta$ contains $h$ , $r$ , and $t$ as head entity, relation, and tail entity, respectively. Denote $E=\{h,t|(h,r,t)\in\Delta\}$ and $R=\{r|(h,r,t)\in\Delta\}$ as the set of entities and relations of $\Delta$ . Let $f$ be the scoring function, the task of KGE is to find vectors $\mathbf{h},\mathbf{r},\mathbf{t}$ corresponding to $h,r,t$ which minimize:

\displaystyle\mathcal{I}=\sum_{l=(h,r,t)\in\Delta}\sum_{(h^{\prime},r,t^{\prime})\in\Delta_{l}^{\prime}}[\gamma+f(\mathbf{h},\mathbf{r},\mathbf{t})-f(\mathbf{h}^{\prime},\mathbf{r},\mathbf{t}^{\prime})]_{+},

(1)

where

$\displaystyle H_{l}$	$\displaystyle=$	$\displaystyle\{(e,r,t)\|e\in E,e\neq h\},\quad l=(h,r,t),$	(2)
$\displaystyle T_{l}$	$\displaystyle=$	$\displaystyle\{(h,r,e)\|e\in E,e\neq t\},\quad l=(h,r,t),$	(3)
$\displaystyle\Delta_{l}^{\prime}$	$\displaystyle=$	$\displaystyle H_{l}\cup T_{l}-\Delta,$	(4)

and $l=(h,r,t)\in\Delta$ , $\gamma$ is a margin hyperparameter. The function $[\cdot]_{+}$ is defined by $[x]_{+}=\max(0,x)$ for $x\in\mathbb{R}$ .

Suppose $l=(h,r,t)$ be a positive triple, the trueness of $l$ assessed by $f$ is given by

	$\displaystyle{\rm rank}_{l}$	$\displaystyle=$	$\displaystyle\mbox{The rank of }f(l)\mbox{ in the sequence of }$		(5)
			$\displaystyle\{f(l)\}\cup f(\Delta_{l}^{\prime})\mbox{ in ascending order}.$		(5)

In general, we use the value ${\rm rank}_{l}$ to evaluate the trueness of a candidate triplet.

In order to begin with the following description, we give two critical functions $\mathbbm{m}_{k}(\cdot)$ and $\mathbbm{d}_{k}(\cdot)$ .

For any $u\in\mathbb{R}$ and $k\in\mathbb{Z}^{+}$ , $\mathbbm{m}_{k}(u)$ is defined as¹¹1 $\mathbbm{m}_{1}(u)$ is equivalent to $[u]$ in [10].

\displaystyle\mathbbm{m}_{k}(u)\in[0,k)\mbox{ and }u\equiv\mathbbm{m}_{k}(u)\,{\rm mod}\,k.

(6)

$\mathbbm{d}_{k}(u)$ is defined as

\displaystyle\mathbbm{d}_{k}(u)=\min(\mathbbm{m}_{k}(u),k-\mathbbm{m}_{k}(u)).

(7)

Based on these definitions, for any $u\in\mathbb{R}$ , it is not difficult to verify that $\mathbbm{m}_{k}(u)=\mathbbm{m}_{k}(u\pm k)$ , $\mathbbm{m}_{k}(-u)=k-\mathbbm{m}_{k}(u)$ , $\mathbbm{d}_{k}(-u)=\mathbbm{d}_{k}(u)$ , and $\mathbbm{d}_{k}(u)=\mathbbm{d}_{k}(u\pm k)$ . In particular, $\mathbbm{d}_{k}(u)$ can be reformulated as $\mathbbm{d}_{k}(u)=\min_{i\in\mathbb{Z}}|u+ik|$ .

2.1 Revisit Torus Ring $\mathbb{T}^{2}$

Before introducing $\mathbb{T}^{2}$ , we would like review the definition of Torus ring $\mathbb{T}^{1}$ . As shown in Fig. 1, $\mathbb{T}^{1}$ is a one-dimensional ring. In particular, any two points $A$ and $B$ on $\mathbb{T}^{1}$ can be uniquely defined by their angles $\theta=2\pi x$ and $\omega=2\pi y$ with $x,y\in[0,1)$ , these variables are all depicted in Fig. 1.

The addition operation $\oplus$ of any two points $A$ and $B$ is defined by $x\oplus y=\mathbbm{m}_{1}(x+y)$ . The distance between $A$ and $B$ can be viewed as the minimal value among the rotating angles from $A$ to $B$ and from $B$ to $A$ , i.e., ${\rm arc}_{AB}$ and ${\rm arc}_{BA}$ as shown in Fig. 1. Describing such an idea in math, we have

\displaystyle{\rm dist}(x,y)=\min\{\mathbbm{m}_{1}(x-y),\mathbbm{m}_{1}(y-x)\}=\mathbbm{d}_{1}(x-y).

This distance function (and all the following distance functions) satisfies

\displaystyle\left\{\begin{array}[]{l}\rm{dist}(x,x)=0,\\ \rm{dist}(x,y)=\rm{dist}(y,x)\geq 0,\\ \rm{dist}(x+y,z)\leq\rm{dist}(x,z)+\rm{dist}(y,z),\end{array}\right.

(11)

where $x,y,z\in[0,1)$ .

Torus ring $\mathbb{T}^{2}$ can be viewed as the independent stacking of two Torus ring $\mathbb{T}^{1}$ . Given $x=(x_{1},x_{2}),y=(y_{1},y_{2})\in[0,1)\times[0,1)$ as two points on Torus ring $\mathbb{T}^{2}$ , then the addition operation on $\mathbb{T}^{2}$ is defined by $x\oplus y=(\mathbbm{m}_{1}(x_{1}+y_{1}),\mathbbm{m}_{1}(x_{2}+y_{2}))$ . The distance function on $\mathbb{T}^{2}$ is given as

\displaystyle\rm{dist}(x,y)=\|\digamma\|_{\tau},\quad\digamma=(\rm{dist}(x_{1},y_{1}),\rm{dist}(x_{2},y_{2})),

where $\|\cdot\|_{\tau}$ is some vector norm.

Torus ring $\mathbb{T}^{2}$ can be extended to $n$ -dimensional case, let $x=(x_{i}),y=(y_{i})\in[0,1)^{\times n}$ as two points on $\mathbb{T}^{n}$ , it holds

	$\displaystyle x\oplus y$	$\displaystyle=$	$\displaystyle(\mathbbm{m}_{1}(x_{1}+y_{1}),\cdots,\mathbbm{m}_{1}(x_{n}+y_{n})),$		(12)
	$\displaystyle\rm{dist}(x,y)$	$\displaystyle=$	$\displaystyle\\|\digamma\\|_{\tau},,$		(13)

where $\digamma=(\mathbbm{d}_{1}(x_{1}-y_{1}),\mathbbm{d}_{1}(x_{2}-y_{2}),\cdots,\mathbbm{d}_{1}(x_{n}-y_{n})).$

Summarize the above discussion, the objective function (1) becomes TorusE when the scoring function $f(\cdot)$ is set as

\displaystyle f(\mathbf{h},\mathbf{r},\mathbf{t})={\rm dist}(\mathbf{h}\oplus\mathbf{r},\mathbf{t}),\quad\mathbf{h},\mathbf{r},\mathbf{t}\in\mathbb{T}^{n},

where ${\rm dist}(\cdot)$ is given by (13) and $\oplus$ is given by (12).

Refer to caption — Figure 1: Illustration of Torus ring $\mathbb{T}^{1}$ .

2.2 Möbius Ring $\mathbb{M}^{2}$

As an intuitive illustration of Möbius ring, we plot the surface of Möbius ring $\mathbb{M}^{2}$ in Fig. 3. In this figure, let the radius from the center of the hole to the center of the tube be $R$ , the radius of the tube be $r$ , then the parametric equation for Möbius ring is

\displaystyle\left\{\begin{array}[]{lcl}x(\theta,\omega)&=&(R+r\cos(\frac{\theta}{2}+\omega))\cos\theta,\\ y(\theta,\omega)&=&(R+r\cos(\frac{\theta}{2}+\omega))\sin\theta,\\ z(\theta,\omega)&=&r\sin(\frac{\theta}{2}+\omega).\\ \end{array}\right.

(17)

Letting $\theta=2\pi x_{1}$ and $\omega=2\pi x_{2}$ , then any point on $\mathbb{M}^{2}$ can be uniquely defined by $(x_{1},x_{2})$ . In particular, the period of $x_{1}$ is $2$ and the period of $x_{2}$ is $1$ , hence we set $(x_{1},x_{2})\in[0,2)\times[0,1)$ .

Given $x=(x_{1},x_{2}),y=(y_{1},y_{2})\in[0,2)\times[0,1)$ as two points on a Möbius ring, the addition operation $\oplus$ on $\mathbb{M}^{2}$ is defined by

\displaystyle x\oplus y=(\mathbbm{m}_{2}(x_{1}+y_{1}),\mathbbm{m}_{1}(x_{2}+y_{2})),

(18)

i.e., the addition on $\mathbb{M}^{2}$ is formulated by the modulus addition on each dimension with modulus $2$ and $1$ , respectively.

The distance function ${\rm dist}(\cdot)$ between $x$ and $y$ is defined as

$\displaystyle{\rm dist}(x,y)$	$\displaystyle=$	$\displaystyle\min(w_{1},w_{2}),$	(19)
$\displaystyle w_{1}$	$\displaystyle=$	$\displaystyle\mathbbm{d}_{2}(y_{1}-x_{1})+\mathbbm{d}_{1}(y_{2}-x_{2}),$	(20)
$\displaystyle w_{2}$	$\displaystyle=$	$\displaystyle\mathbbm{d}_{2}(y_{1}-x_{1}+1)+\mathbbm{d}_{1}\Big{(}y_{2}-x_{2}+\frac{1}{2}\Big{)}.$	(21)

One can further verify that the above definition follows the basic properties in (11). Moreover, the above distance satisfies:

Proposition 1.

For distance defined in (19) for $\mathbb{M}^{2}$ and any two points $x=(x_{1},x_{2})$ , $y=(y_{1},y_{2})$ in $[0,2)\times[0,1)$ , it holds $0\leq{\rm dist}(x,y)\leq\frac{3}{4}$ , where the upperbound $\frac{3}{4}$ achieves when $x_{1}-y_{1}=k\pm\frac{1}{2}$ and $x_{2}-y_{2}=\frac{k^{\prime}}{2}\pm\frac{1}{4}$ for some integer $k$ and $k^{\prime}$ .

We give the proof of Proposition 1 in Appendix 6.3. In what follows, we will explain the reason for the definition (19) by using equation (17).

Given $x=(x_{1},x_{2}),y=(y_{1},y_{2})\in\mathbb{M}^{2}$ and the corresponding points $A$ and $B$ on $\mathbb{M}^{2}$ , transforming $A$ to $B$ (or, $B$ to $A$ ) needs one of the following two operations:

a).

If transform $A$ to $B$ , then add $y_{1}-x_{1}+2k$ to the first component of $x$ , and add $y_{2}-x_{2}+k^{\prime}$ to the second component of $x$ for some integer $k$ and $k^{\prime}$ . If transform $B$ to $A$ , then add $x_{1}-y_{1}+2k$ to the first component of $y$ , and add $x_{2}-y_{2}+k^{\prime}$ to the second component of $y$ for some integer $k$ and $k^{\prime}$ . Hence, when we measure the distance in the first dimension using modulus $2$ and considering that $x_{1},y_{1}\in[0,2)$ , such a distance is $\mathbbm{d_{2}}(y_{1}-x_{1})$ . Similarly, if we measure the distance between $x$ and $y$ in the second dimension using modulus $1$ and considering that $x_{2},y_{2}\in[0,1)$ , the obtained distance is $\mathbbm{d_{1}}(y_{2}-x_{2})$ . By adding the distance in the two dimensions, we simply obtain the distance in this case be $\mathbbm{d}_{2}(y_{1}-x_{1})+\mathbbm{d}_{1}(y_{2}-x_{2})$ , which conforms with (20).

b).

If transform $A$ to $B$ , then add $y_{1}-x_{1}+2k+1$ to the first component of $x$ , and add $y_{2}-x_{2}+k^{\prime}+\frac{1}{2}$ to the second component of $x$ for some integer $k$ and $k^{\prime}$ . If transform $B$ to $A$ , then add $x_{1}-y_{1}+2k+1$ to the first component of $y$ , and add $x_{2}-y_{2}+k^{\prime}+\frac{1}{2}$ to the second component of $y$ for some integer $k$ and $k^{\prime}$ . In order to make an intuitive calculation, we calculate the value of $\frac{\theta}{2}+\omega$ and $\theta$ for $A$ in (17) as

\frac{\theta}{2}+\omega=\pi x_{1}+2\pi x_{2},\quad\theta=2\pi x_{1},

when add $y_{1}-x_{1}+2k+1$ to $x_{1}$ , and add $y_{2}-x_{2}+k^{\prime}+\frac{1}{2}$ to $x_{2}$ , it becomes

$\displaystyle\theta$	$\displaystyle=$	$\displaystyle 2\pi y_{1}+(4k+2)\pi,$
$\displaystyle\frac{\theta}{2}+\omega$	$\displaystyle=$	$\displaystyle\pi(y_{1}+2k+1)+2\pi(y_{2}+k^{\prime}+\frac{1}{2})$
	$\displaystyle=$	$\displaystyle\pi y_{1}+2\pi y_{2}+2(k+k^{\prime}+1)\pi,$

which coincides with the coordinate of $B$ on $\mathbb{M}^{2}$ (ignore multiple of $2\pi$ difference). Similar calculation can be conducted for the case of transforming $B$ to $A$ and hence we omit the details. To sum it up, when we measure the distance in the first dimension using modulus $2$ and considering that $x_{1},y_{1}\in[0,2)$ , such a distance is $\mathbbm{d_{2}}(y_{1}-x_{1}+1)$ . Similarly, if we measure the distance between $x$ and $y$ in the second dimension using modulus $1$ and considering that $x_{2},y_{2}\in[0,1)$ , the obtained distance is $\mathbbm{d}_{1}(y_{2}-x_{2}+\frac{1}{2})$ . By adding the distance in the two dimensions, we simply obtain the distance in this case be $\mathbbm{d}_{2}(y_{1}-x_{1}+1)+\mathbbm{d}_{1}(y_{2}-x_{2}+\frac{1}{2})$ , which conforms with (21).

Summarize the above discussion, the distance between $x,y\in\mathbb{M}^{2}$ should be the minimum of the two cases, which is equivalent to (19).

In fact, Möbius ring $\mathbb{M}^{2}$ degenerates to Torus ring in some special case, as shown in the following proposition, whose proof is given in Appendix 6.3.

Proposition 2.

If the first dimension of $\mathbb{M}^{2}$ is set to zero, then $\mathbb{M}^{2}$ is equivalent to $\mathbb{T}^{1}$ .

For comparison between Torus ring and Möbius ring, we list the parametric equations for Torus ring $\mathbb{T}^{2}$ as:

\displaystyle\left\{\begin{array}[]{lcl}x(\theta,\omega)&=&(R+r\cos\omega)\cos\theta,\\ y(\theta,\omega)&=&(R+r\cos\omega)\sin\theta,\\ z(\theta,\omega)&=&r\sin\omega.\\ \end{array}\right.

(25)

As we can see in (25), the period of $\theta$ and $\omega$ are both $2\pi$ , hence the following points on $\mathbb{T}^{2}$ are equivalent:

\displaystyle(\theta,\omega)\Leftrightarrow(\theta+2k\pi,\omega+2k^{\prime}\pi),\quad k,k^{\prime}\in\mathbb{Z}.

However, in Möbius (17), the period of $\theta$ and $\omega$ are $4\pi$ and $2\pi$ respectively. In particular, the following points can be viewed as equivalent points on Möbius ring:

	$\displaystyle(\theta,\omega)$	$\displaystyle\Leftrightarrow$	$\displaystyle(\theta+4k\pi,\omega+2k^{\prime}\pi)$
		$\displaystyle\Leftrightarrow$	$\displaystyle(\theta+(4k+2)\pi,\omega+(2k^{\prime}+1)\pi),\quad k,k^{\prime}\in\mathbb{Z},$

which can be verified in (17).

As shown in Fig. 2 and Fig. 3, the parametric curve on Torus ring and Möbius ring are also quite different: fixing $\omega$ in (25) generates a cycle, but fixing $\omega$ in (17) generates a twisted cycle.

2.3 Möbius Ring $\mathbb{M}^{q/p}$

The simple Möbius ring $\mathbb{M}^{2}$ can be extended to more general case $\mathbb{M}^{q/p}$ . Möbius ring $\mathbb{M}^{q/p}$ is a space constructed on $[0,q)\times[0,p)$ , where $p,q$ are co-prime positive integers, any $x,y\in\mathbb{M}^{q/p}$ satisfies:

$\displaystyle x\oplus y$	$\displaystyle=$	$\displaystyle(\mathbbm{m}_{q}(x_{1}+y_{1}),\mathbbm{m}_{p}(x_{2}+y_{2})),$	(26)
$\displaystyle\rm{dist}(x,y)$	$\displaystyle=$	$\displaystyle\min_{j=0}^{pq-1}\Big{\{}\mathbbm{d}_{q}(y_{1}-x_{1}+\frac{1}{p}j)+$	(27)
		$\displaystyle\mathbbm{d}_{p}(y_{2}-x_{2}+\frac{1}{q}j)\Big{\}}.$	(27)

The reason of why $\rm{dist}(x,y)$ is defined as (30) is similar to that of (19) and hence omitted here. The distance $\rm{dist}(\cdot,\cdot)$ satisfies properties (11) too, and in particular there is:

Proposition 3.

The distance defined in (30) for $\mathbb{M}^{q/p}$ satisfies

{\rm dist}(x,y)\in\Big{[}0,\frac{1}{2p}+\frac{1}{2q}\Big{)},

where the upperbound achieves when $(x,y)=\big{(}\frac{1}{2p},\frac{1}{2q}\big{)}$ .

The proof of the above proposition is similar to that of Proposition 1 and hence omitted in this paper. The proof of $\rm{dist}(x+y,z)\leq\rm{dist}(x,z)+\rm{dist}(y,z)$ in $\mathbb{M}^{q/p}$ in (11) is given in Appendix 6.1.

In order to increase the embedding dimensions for KGE, we define

\displaystyle\mathbb{M}^{q/p}_{n}=\underbrace{\mathbb{M}^{q/p}\uplus\mathbb{M}^{q/p}\uplus\cdots\uplus\mathbb{M}^{q/p}}_{n-1\;\;\uplus}

(28)

as the direct sum of $n$ Möbius ring $\mathbb{M}^{q/p}$ . For each $u=(u_{1},u_{2},\cdots,$ $u_{n})$ , $v=(v_{1},v_{2},\cdots v_{n})\in\mathbb{M}^{q/p}_{n}$ , there is $v_{i},u_{i}\in\mathbb{M}^{q/p}$ for each $1\leq i\leq n$ and

	$\displaystyle u\oplus v$	$\displaystyle=$	$\displaystyle(u_{1}\oplus v_{1},\cdots,u_{n}\oplus v_{n}),$		(29)
	$\displaystyle{\rm dist}(u,v)$	$\displaystyle=$	$\displaystyle\\|\digamma\\|_{\tau},$		(30)

where $\digamma=(\rm{dist}(u_{1},v_{1}),\cdots,\rm{dist}(u_{n},v_{n}))$ and $\|\cdot\|_{\tau}$ is some vector norm.

For embedding on Möbius ring, the scoring function $f(\mathbf{h},\mathbf{r},\mathbf{t})$ in (1) will be set as

\displaystyle f(\mathbf{h},\mathbf{r},\mathbf{t})={\rm dist}(\mathbf{h}\oplus\mathbf{r},\mathbf{t}),\quad\mathbf{h},\mathbf{r},\mathbf{t}\in\mathbb{M}^{q/p}_{n},

where the operation $\oplus$ is defined in (29) and distance function is defined in (30), which is listed in Tab. 1.

In what follows, we list a key proposition for Möbius ring $\mathbb{M}^{q/p}$ , whose proof is intuitive and hence omitted here.

Proposition 4.

The solutions of the following equation

\displaystyle{\rm dist}(x,0)=0

(31)

in the region $[0,q)\times[0,p)$ for $\mathbb{M}^{q/p}$ are $x=(iq/p,jp/q)$ with $0\leq i<p$ and $0\leq j<q$ .

In the next, we call the solutions of equation (31) be zero points of $\mathbb{M}^{q/p}$ .

For any given KG described by set of triplets $\Delta$ , one of the difficulties for KGE comes from the cycle structure in the given KG data, a cycle structure in KG means the existence of entities $h_{i}$ and relations $r_{i}$ ( $0\leq i\leq m$ ) such that

\displaystyle(h_{i},r_{i},h_{i+1})\in\Delta\mbox{ or }(h_{i+1},r_{i},h_{i})\in\Delta\mbox{ for }0\leq i\leq m-1,

and $(h_{m},r_{m},h_{0})\in\Delta$ or $(h_{0},r_{m},h_{m})\in\Delta$ . In the case that the given set $\Delta$ contains no cycle structure, the geometric structure of $\Delta$ becomes a tree, and the embedding task for this KG without considering the negative samples can be implemented via a simple iteration strategy. However, realistic KG data contains a great amount of logic cycles and an appropriate KGE strategy should choose an algebraic structure which is capable of fitting the constraint generated by tremendous number of cycles.

Consider the properties in (11), the constraint of the above cycle without entities can be described by an equation of relations on $\mathbb{M}^{q/p}$ :

\displaystyle{\rm dist}(\gamma_{0}\hat{r}_{0}+\gamma_{1}\hat{r}_{1}+\cdots\gamma_{s}\hat{r}_{s},0)=0,

(32)

where $\gamma_{i}\in\mathbb{Z}$ , $\hat{r}_{i}\in\{r_{0},r_{1},\cdots,r_{m}\}$ .

In general, the number of cycles in a given KG dataset is much larger than that of the relations, hence the set of equations (32) in Euclidean space is hardly to have an exact solution. Intuitively, since the number of zero points in $\mathbb{M}^{q/p}$ in the region $[0,q)\times[0,p)$ is $pq$ , the set of equations (32) derived by all the cycles in given KG data will have more chance to get an exact solution. However, the number of zero point in $\mathbb{T}^{2}=[0,1)\times[0,1)$ is only $1$ for TorusE, which is much smaller than that of $\mathbb{M}^{q/p}$ . In the above sense, MöbiusE assumes more powerful expressiveness than TorusE.

3 Experiments

The performance of the proposed MöbiusE is tested on two datasets: FB15K and WN18 [5], which are extracted from realistic knowledge graphs. The basic properties on these two datasets is given in Tab. 2.

We conduct the link prediction task by using the same method reported in [5]. For each test triplet (which is a positive sample), we replace its head (or tail) to generate a corrupted triplet, then the score of each corrupted triplet is calculated by the scoring function $f(\cdot)$ , and the ranking of the test triplet is obtained according to these scores, i.e., the definition of ${\rm rank}_{l}$ in (5). It should be noted from (4) that the set of generated corrupted elements excludes the positive triplets $\Delta$ , we call such a set be “filtered” [10], and the ranking values in our experiments are all obtained on the “filtered” set.

We choose stochastic gradient descent algorithm to optimize the objective (1). In order to generate the negative sample in (1), we employ the “Bern” method [7] for negative sampling because of the datasets only contains positive triplets.

To evaluate our model, we use Mean Rank (MR), Mean Reciprocal Rank (MRR) and HIT@m as evaluating indicators [5]. MR is calculated by ${\rm mean}_{l\in\Delta}({\rm rank}_{l})$ with $\Delta$ be the set of triplet data and ${\rm rank}_{l}$ is defined in (5), MRR is calculated by ${\rm mean}_{l\in\Delta}(1/{\rm rank}_{l})$ , HIT@m is defined by $|\{l\in\Delta:{\rm rank}_{l}\leq m\}|/|\Delta|$ .

We conduct a grid search to find a set of optimal hyper-parameters for each dataset, the searching area for margin $\gamma$ in (1) is $\{2000,1000,500,200,100\}$ and the searching area for learning rate $\alpha$ in stochastic gradient descent is $\{0.002,0.001,$ $0.0005,0.0002,0.0001\}$ . Finally, the learning rate $\alpha$ is set to $0.0005$ both for WN18 and FB15K, the margin $\gamma$ is set to $2,000$ for WN18 and $500$ for FB15K.

We conduct our experiments on the augmented Möbius ring $\mathbb{M}^{q/p}_{n}$ as defined in (28), and we choose three different types of Möbisu ring for embedding, i.e., $\mathbb{M}^{2/1}_{n}$ , $\mathbb{M}^{3/1}_{n}$ , $\mathbb{M}^{3/2}_{n}$ . In choosing the distance function in $\mathbb{M}^{q/p}_{n}$ , we choose $\|\cdot\|_{\tau}=\|\cdot\|_{1}$ in ${\rm dist}(\cdot)$ in (30). We call the above corresponding MöbiusE be MöbiusE(2,1), MöbiusE(3,1), MöbiusE(3,2) in Tab. 3 and Tab. 4. The value of $n$ in $\mathbb{M}^{q/p}_{n}$ is set to $5,000$ , which means in each embedding vector we have $10,000$ parameters to be trained. The dimensions for other types of models in Tab. 3 and Tab. 4 are all set to $10,000$ .

As shown in Tab. 3, Möbius(3,1) outperforms the other models in 4 of the 5 critical indicators. In Tab. 4, Möbius(2,1) outperforms the other models also in 4 of the 5 critical indicators.

Dataset	#Ent	#Rel	#Train	#Valid	#Test
WN18	40,943	18	141,442	5,000	5,000
FB15K	14,951	1,345	483,142	50,000	59,071

Table 2: Statistics of tested datasets.

Model	MRR	MR	HIT@10	HIT@3	HIT@1
TransE	0.516	73.21	0.847	0.738	0.269
TransH	0.559	82.48	0.795	0.680	0.404
RESCAL	0.354	-	0.587	0.409	0.235
DistMult	0.690	151.40	0.818	0.749	0.609
ComplEx	0.673	210.17	0.796	0.713	0.606
TorusE	0.746	143.93	0.839	0.784	0.689
MöbiusE(2,1)	0.782	118.53	0.862	0.817	0.731
MöbiusE(3,1)	0.783	114.50	0.863	0.817	0.734
MöbiusE(3,2)	0.767	73.87	0.863	0.816	0.702

Table 3: Performance of MöbiusE in FB15K

Model	MRR	MR	HIT@10	HIT@3	HIT@1
TransE	0.414	-	0.688	0.534	0.247
TransR	0.218	-	0.582	0.404	0.218
RESCAL	0.890	-	0.928	0.904	0.842
DistMult	0.797	655	0.946	-	-
ComplEx	0.941	-	0.947	0.945	0.936
TorusE	0.947	577.62	0.954	0.950	0.943
MöbiusE(2,1)	0.947	539.80	0.954	0.950	0.944
MöbiusE(3,1)	0.947	531.80	0.954	0.950	0.943
MöbiusE(3,2)	0.944	425.51	0.954	0.949	0.937

Table 4: Performance of MöbiusE in WN18

4 Related Works

The key for different types of KGE strategies is the choosing of embedding spaces, and in turn different addition operations and different scoring functions, i.e., different $f(\cdot)$ in (1). As the first work in KGE, TranE [5] uses the basic algebraic addition to generate the score function, and regularizes the obtained vectors via a normalized condition. In TransE, the left (right) entity and the relation uniquely defines the right (left) entity. However, for realistic KG, it may have the following properties: a). ONE-TO-MANY mapping, given any fixed relation $r$ , the left (right) entity corresponding to a the right (left) entity is not unique; b): MANY-TO-ONE mapping, given any fixed entities $h$ and $t$ , the relation between $h$ and $t$ may not be unique. The original TransE cannot resolve the above two drawbacks in the beginning.

As an improvement of TransE to overcome the above drawback a), TransR [6] maps each relation $r$ to a vector and a matrix, the corresponding scoring function is derived by the image of such a relation matrix. In this sense, different entities may be mapped to a same vector in the image space and the drawback a) of TransE is resolved. Similar to TransR, TransH [7] uses a single vector to obtain the image of each entities. As further generalization of TransR and TransH, TransD [15] uses a dynamic matrix which is determined both by relation and entities to generate the above mapping matrix. As another solution to the drawbacks of TransE, TransM [16] introduces a relation-based adjust factor to the scoring function in TransE, such a factor will be much smaller in MANY-TO-MANY case than that of ONE-TO-ONE case and hence penalizing effects works.

Introducing flexibility in the scoring function in some extent enhances the generalization capability of KGE. As an implementation of such an idea, TransF [17] uses a flexible scoring function in KGE, in which the matching degree is desribed by the inner products of desired entities and given relation. Increasing the complexity in the scoring function enhances the nonlinear representative capability of KGE, RESCAL [14] and DistMult [8] uses a matrix-induced inner product to represent the scoring function in KGE, and ComplEx [9] applies the similar but embeds both relation and entities to a complex-valued space. TransA [11] combines the idea of TransE and RESCAL and incorporates the residue vector in TransE to the vector-based norm in RESCAL, such a strategy can adaptively find the loss function according to the structure of knowledge graphs.

MöbiusE is inspired from TorusE [10], however, MöbiusE differs from TorusE in the following aspects: The minimal dimension of Torus ring is $1$ , but the minimal dimension of Möbius ring is $2$ ; The addition in Torus ring chooses an unique value for modulus operation (generally $1$ ), but the addition in Möbius ring chooses different values (generally $p$ and $q$ in $\mathbbm{M}^{q/p}$ ); The distance function is Möbius ring is strongly nonlinear (see, (30)), which is much more complicated than that of Torus ring. These major difference guarantees the strong expressiveness of MöbiusE.

5 Conclusions

A novel KGE strategy is proposed by taking advantage of the intertwined rotating property of Möbius ring. As a first step, we defined the basic addition operation on Möbius ring and discussed its related properties. Next, in order to obtain an appropriate scoring function based on Möbius ring, we built a distance function on Möbius ring and constructed the corresponding distance-induced scoring function. Finally, a complete KGE strategy is obtained via the above constructions, which outperforms several key KGE strategies in our conducted experiments.

6 Appendix

6.1 Proof of $\rm{dist}(x,z)\leq\rm{dist}(x,y)+\rm{dist}(y,z)$ in $\mathbb{M}^{q/p}$ .

According to the definition of $\rm{dist}(\cdot,\cdot)$ on $\mathbb{M}^{q/p}$ , we only need to prove $\rm{dist}(x+y,0)\leq\rm{dist}(x,0)+\rm{dist}(y,0)$ .

Note that for any $i_{1},i_{2},i_{1}^{\prime},i_{2}^{\prime},j,j^{\prime}\in\mathbb{Z}$ , there is

			$\displaystyle\mathbbm{d}_{q}(y_{1}+x_{1}+\frac{1}{p}j)+\mathbbm{d}_{p}(y_{2}+x_{2}+\frac{1}{q}j)$
		$\displaystyle\leq$	$\displaystyle\|y_{1}+x_{1}+\frac{1}{p}j+i_{1}q\|+\|y_{2}+x_{2}+\frac{1}{q}j+i_{2}p\|+$
		$\displaystyle\leq$	$\displaystyle\|x_{1}+\frac{1}{p}j+\frac{1}{p}j^{\prime}+i_{1}q+i_{1}^{\prime}q\|+$
			$\displaystyle\|x_{2}+\frac{1}{q}j+\frac{1}{q}j^{\prime}+i_{2}p+i_{2}^{\prime}p\|+$
			$\displaystyle\|y_{1}-\frac{1}{p}j^{\prime}-i_{1}^{\prime}q\|+\|y_{2}-\frac{1}{q}j^{\prime}-i_{2}^{\prime}p\|$

based on which and the arbitrariness of $i_{1}$ and $i_{2}$ , there is

			$\displaystyle{\rm dist}(x+y,0)$
		$\displaystyle=$	$\displaystyle\min_{j=0}^{pq-1}\left\{\mathbbm{d}_{q}(y_{1}+x_{1}+\frac{1}{p}j)+\mathbbm{d}_{p}(y_{2}+x_{2}+\frac{1}{q}j)\right\}$
		$\displaystyle\leq$	$\displaystyle\min_{j=0}^{pq-1}\left(\min_{i_{1}}\|x_{1}+\frac{1}{p}j+\frac{1}{p}j^{\prime}+i_{1}q+i_{1}^{\prime}q\|\right.+$
			$\displaystyle\left.\min_{i_{2}}\|x_{2}+\frac{1}{q}j+\frac{1}{q}j^{\prime}+i_{2}p+i_{2}^{\prime}p\|\right)+$
			$\displaystyle\|y_{1}-\frac{1}{p}j^{\prime}-i_{1}^{\prime}q\|+\|y_{2}-\frac{1}{q}j^{\prime}-i_{2}^{\prime}p\|$
		$\displaystyle=$	$\displaystyle\min_{j=0}^{pq-1}\left(\mathbbm{d}_{q}(x_{1}+\frac{1}{p}(j+j^{\prime}))+\mathbbm{d}_{p}(x_{2}+\frac{1}{q}(j+j^{\prime}))\right)$
			$\displaystyle+\|y_{1}-\frac{1}{p}j^{\prime}-i_{1}^{\prime}q\|+\|y_{2}-\frac{1}{q}j^{\prime}-i_{2}^{\prime}p\|.$
		$\displaystyle=$	$\displaystyle{\rm dist}(x,0)+\|y_{1}-\frac{1}{p}j^{\prime}-i_{1}^{\prime}q\|+\|y_{2}-\frac{1}{q}j^{\prime}-i_{2}^{\prime}p\|.$

Due to the arbitrariness of $j^{\prime},i_{1}^{\prime},i_{2}^{\prime}$ , there is ${\rm dist}(x+y,0)\leq{\rm dist}(x,0)+{\rm dist}(y,0)$ .

6.2 Proof of Proposition 1

We define a function $g(\alpha,\beta)=\min\{g_{1},g_{2}\}$ with $g_{1}=\mathbbm{d}_{2}(\alpha)+\mathbbm{d}_{1}(\beta)$ , $g_{2}=\mathbbm{d}_{2}(\alpha+1)+\mathbbm{d}_{1}(\beta+\frac{1}{2})$ . Based on which we know $\alpha$ has a period of $2$ and $\beta$ has a period of $1$ . Next, we divide the interval $[-1,0)$ to $4$ subintervals $I_{\alpha,i}=[-1+0.5i,-0.5+0.5i)$ , and divide $[-0.5,0.5)$ to $4$ subintervals $I_{\beta,j}=[-0.5+0.25j,-0.25+0.25j)$ with $i,j=0,1,2,3$ , then the value of $\sup_{x,y\in I_{\alpha_{i}}\times I_{\beta,j}}g$ can be calculated and all these $16$ values are $3/4$ .

6.3 Proof of Proposition 2

Let $x=(0,x_{2})$ , $y=(0,y_{2})$ , then ${\rm dist}(x,y)=\min(\mathbbm{d}_{1}(y_{2}-x_{2}),1+\mathbbm{d}_{1}(y_{2}-x_{2}+\frac{1}{2}))=\mathbbm{d}_{1}(y_{2}-x_{2})$ , hence $\mathbb{M}^{2}$ can be viewed as $\mathbb{T}^{1}$ by constraining the first dimension to zero.

References

[1] J. Lu, J. Xuan, G. Zhang, X. Luo, Structural property-aware multilayer network embedding for latent factor analysis, Pattern Recognition 76 (2018) 228 – 241.
[2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, Dbpedia: a nucleus for a web of open data, in: Proceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, 2007, pp. 722–735.
[3] M. F. Suchanek, G. Kasneci, G. Weikum, Yago: a core of semantic knowledge, in: Proceedings of the 16th International Conference on World Wide Web, 2007, pp. 697–706.
[4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 2008, pp. 1247–1250.
[5] A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data, in: Advances in Neural Information Processing Systems, 2013, pp. 2787–2795.
[6] Y. Lin, Z. Liu, M. Sun, Y. Liu, X. Zhu, Learning entity and relation embeddings for knowledge graph completion, in: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015, pp. 2181–2187.
[7] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by translating on hyperplanes, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014, pp. 1112–1119.
[8] B. Yang, W. T. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases, in: Proceedings of the 4th International Conference of Learning Representations, 2015, pp. 1412–1423.
[9] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: Proceedings of the 29th International Conference on Machine Learning, 2012, pp. 2071–2080.
[10] T. Ebisu, R. Ichise, Toruse: Knowledge graph embedding on a lie group, in: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018, pp. 1819–1826.
[11] Y. Jia, Y. Wang, X. Jin, H. Lin, X. Cheng, Knowledge graph embedding: A locally and temporally adaptive translation-based approach, ACM Transactions on the Web (2018) 1559–1131.
[12] T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel, Convolutional 2D knowledge graph embeddings, in: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018, pp. 1811–1818.
[13] S. Zhang, Y. Tay, L. Yao, Q. Liu, Quaternion knowledge graph embeddings, in: Advances in Neural Information Processing Systems, 2019, pp. 2731–2741.
[14] M. Nickel, V. Tresp, H. P. Kriegel, A three-way model for collective learning on multi-relational data, in: Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 809–816.
[15] G. Ji, S. He, L. Xu, K. Liu, J. Zhao, Knowledge graph embedding via dynamic mapping matrix, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, 2015, pp. 687–696.
[16] M. Fan, Q. Zhou, E. Chang, T. F. Zheng, Transition-based knowledge graph embedding with relational mapping properties, in: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computation, 2014, pp. 328–337.
[17] J. Feng, M. Huang, M. Wang, M. Zhou, Y. Hao, X. Zhu, Knowlege graph embedding by flexible translation, in: Proceedings of the 15th International Conference on Principles of Knowledge Representation and Reasoning, 2016, pp. 557–560.

			$\displaystyle\mathbbm{d}_{q}(y_{1}+x_{1}+\frac{1}{p}j)+\mathbbm{d}_{p}(y_{2}+x_{2}+\frac{1}{q}j)$
		$\displaystyle\leq$	$\displaystyle\|y_{1}+x_{1}+\frac{1}{p}j+i_{1}q\|+\|y_{2}+x_{2}+\frac{1}{q}j+i_{2}p\|+$
		$\displaystyle\leq$	$\displaystyle\|x_{1}+\frac{1}{p}j+\frac{1}{p}j^{\prime}+i_{1}q+i_{1}^{\prime}q\|+$
			$\displaystyle\|x_{2}+\frac{1}{q}j+\frac{1}{q}j^{\prime}+i_{2}p+i_{2}^{\prime}p\|+$
			$\displaystyle\|y_{1}-\frac{1}{p}j^{\prime}-i_{1}^{\prime}q\|+\|y_{2}-\frac{1}{q}j^{\prime}-i_{2}^{\prime}p\|$