¹¹institutetext: Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha, China ²²institutetext: Department of Computer Science, Xi’an Jiaotong University, Xi’an, China
²²email: {zengweixin13,xiangzhao,jiuyang_tang}@nudt.edu.cn, [email protected], {minnluo,qhzheng}@mail.xjtu.edu.cn

Towards Entity Alignment in the Open World:
An Unsupervised Approach

Weixin Zeng 11 Xiang Zhao 11 Jiuyang Tang 11 Xinyi Li 11 Minnan Luo 22 Qinghua Zheng 22

Abstract

Entity alignment (EA) aims to discover the equivalent entities in different knowledge graphs (KGs). It is a pivotal step for integrating KGs to increase knowledge coverage and quality. Recent years have witnessed a rapid increase of EA frameworks. However, state-of-the-art solutions tend to rely on labeled data for model training. Additionally, they work under the closed-domain setting and cannot deal with entities that are unmatchable.

To address these deficiencies, we offer an unsupervised framework that performs entity alignment in the open world. Specifically, we first mine useful features from the side information of KGs. Then, we devise an unmatchable entity prediction module to filter out unmatchable entities and produce preliminary alignment results. These preliminary results are regarded as the pseudo-labeled data and forwarded to the progressive learning framework to generate structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. Finally, the progressive learning framework gradually improves the quality of structural embeddings and enhances the alignment performance by enriching the pseudo-labeled data with alignment results from the previous round. Our solution does not require labeled data and can effectively filter out unmatchable entities. Comprehensive experimental evaluations validate its superiority.

Keywords:

Entity alignment Unsupervised learning Knowledge graph.

1 Introduction

Knowledge graphs (KGs) have been applied to various fields such as natural language processing and information retrieval. To improve the quality of KGs, many efforts have been dedicated to the alignment of KGs, since different KGs usually contain complementary information. Particularly, entity alignment (EA), which aims to identify equivalent entities in different KGs, is a crucial step of KG alignment and has been intensively studied over the last few years [1, 2, 3, 4, 5, 6, 7, 8]. We use Example 1 to illustrate this task.

Example 1

In Figure 1 are a partial English KG and a partial Spanish KG concerning the director Hirokazu Koreeda, where the dashed lines indicate known alignments (i.e., seeds). The task of EA aims to identify equivalent entity pairs between two KGs, e.g., (Shoplifters, Manbiki Kazoku).

Refer to caption — Figure 1: An example of EA.

State-of-the-art EA solutions [9, 10, 11, 12] assume that equivalent entities usually possess similar neighboring information. Consequently, they utilize KG embedding models, e.g., TransE [13], or graph neural network (GNN) models, e.g., GCN [14], to generate structural embeddings of entities in individual KGs. Then, these separated embeddings are projected into a unified embedding space by using the seed entity pairs as connections, so that the entities from different KGs are directly comparable. Finally, to determine the alignment results, the majority of current works [15, 16, 3, 17] formalize the alignment process as a ranking problem; that is, for each entity in the source KG, they rank all the entities in the target KG according to some distance metric, and the closest entity is considered as the equivalent target entity.

Nevertheless, we still observe several issues from current EA works:

•

Reliance on labeled data. Most of the approaches rely on pre-aligned seed entity pairs to connect two KGs and use the unified KG structural embeddings to align entities. These labeled data, however, might not exist in real-life settings. For instance, in Example 1, the equivalence between Hirokazu Koreeda in $KG_{EN}$ and Hirokazu Koreeda in $KG_{ES}$ might not be known in advance. In this case, state-of-the-art methods that solely rely on the structural information would fall short, as there are no seeds to connect these individual KGs.
•

Closed-domain setting. All of current EA solutions work under the closed-domain setting [18]; that is, they assume every entity in the source KG has an equivalent entity in the target KG. Nevertheless, in practical settings, there always exist unmatchable entities. For instance, in Example 1, for the source entity Ryo Kase, there is no equivalent entity in the target KG. Therefore, an ideal EA system should be capable of predicting the unmatchable entities.

In response to these issues, we put forward an unsupervised EA solution UEA that is capable of addressing the unmatchable problem. Specifically, to mitigate the reliance on labeled data, we mine useful features from the KG side information and use them to produce preliminary pseudo-labeled data. These preliminary seeds are forwarded to our devised progressive learning framework to generate unified KG structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. This framework also progressively augments the training data and improves the alignment results in a self-training fashion. Besides, to tackle the unmatchable issue, we design an unmatchable entity prediction module, which leverages thresholded bi-directional nearest neighbor search (TBNNS) to filter out the unmatchable entities and excludes them from the alignment results. We embed the unmatchable entity prediction module into the progressive learning framework to control the pace of progressive learning by dynamically adjusting the thresholds in TBNNS.

Contribution. The main contributions of the article can be summarized as follows:

•

We identify the deficiencies of existing EA methods, namely, requiring labeled data and working under the closed-domain setting, and propose an unsupervised EA framework UEA that is able to deal with unmatchable entities. This is done by (1) exploiting the side information of KGs to generate preliminary pseudo-labeled data; and (2) devising an unmatchable entity prediction module that leverages the thresholded bi-directional nearest neighbor search strategy to produce alignment results, which can effectively exclude unmatchable entities; and (3) offering a progressive learning algorithm to improve the quality of KG embeddings and enhance the alignment performance.
•

We empirically evaluate our proposal against state-of-the-art methods, and the comparative results demonstrate the superiority of UEA.

Organization. In Section 2, we formally define the task of EA and introduce related work. Section 3 elaborates the framework of UEA. In Section 4, we introduce experimental results and conduct detailed analysis. Section 5 concludes this article.

2 Task Definition and Related Work

In this section, we formally define the task of EA, and then introduce the related work.

Task definition. The inputs to EA are a source KG $G_{1}$ and a target KG $G_{2}$ . The task of EA is defined as finding the equivalent entities between the KGs, i.e., $\Psi=\{(u,v)|u\in E_{1},v\in E_{2},u\leftrightarrow v\}$ , where $E_{1}$ and $E_{2}$ refer to the entity sets in $G_{1}$ and $G_{2}$ , respectively, $u\leftrightarrow v$ represents the source entity $u$ and the target entity $v$ are equivalent, i.e., $u$ and $v$ refer to the same real-world object.

Most of current EA solutions assume that there exist a set of seed entity pairs $\Psi_{s}=\{(u_{s},v_{s})|u_{s}\in E_{1},v_{s}\in E_{2},u_{s}\leftrightarrow v_{s}\}$ . Nevertheless, in this work, we focus on unsupervised EA and do not assume the availability of such labeled data.

Entity alignment. The majority of state-of-the-art methods are supervised or semi-supervised, which can be roughly divided into three categories, i.e., methods merely using the structural information, methods that utilize the iterative training strategy, and methods using information in addition to the structural information [19].

The approaches in the first category aim to mine useful structural signals for alignment, and devise structure learning models such as recurrent skipping networks [20] and multi-channel GNN [17], or exploit existing models such as TransE [9, 3, 21, 22, 23] and graph attention networks [3]. The embedding spaces of different KGs are connected by seed entity pairs. In accordance to the distance in the unified embedding space, the alignment results can hence be predicted.

Methods in the second category iteratively label likely EA pairs as the training set and gradually improve alignment results [22, 15, 23, 21, 24]. A more detailed discussion about these methods and the difference from our framework is provided in Section 3.3. Methods in the third category incorporate the side information to offer a complementing view to the KG structure, including the attributes [10, 25, 26, 27, 28, 29], entity descriptions [30, 16], and entity names [31, 32, 12, 24, 33, 34]. These methods devise various models to encode the side information and consider them as features parallel to the structural information. In comparison, the side information in this work has an additional role, i.e., generating pseudo-labeled data for learning unified structural representations.

Unsupervised entity alignment. A few methods have investigated the alignment without labeled data. Qu et al. [35] propose an unsupervised approach towards knowledge graph alignment with the adversarial training framework. Nevertheless, the experimental results are extremely poor. He et al. [36] utilize the shared attributes between heterogeneous KGs to generate aligned entity pairs, which are used to detect more equivalent attributes. They perform entity alignment and attribute alignment alternately, leading to more high-quality aligned entity pairs, which are used to train a relation embedding model. Finally, they combine the alignment results generated by attribute and relation triples using a bivariate regression model. The overall procedure of this work might seem similar to our proposed model. However, there are many notable differences; for instance, the KG embeddings in our work are updated progressively, which can lead to more accurate alignment results, and our model can deal with unmatchable entities. We empirically demonstrate the superiority of our model in Section 4.

We notice that there are some entity resolution (ER) approaches established in a setting similar to EA, represented by PARIS [37]. They adopt collective alignment algorithms such as similarity propagation so as to model the relations among entities. We include them in the experimental study for the comprehensiveness of the article.

3 Methodology

In this section, we first introduce the outline of UEA. Then, we elaborate its components.

As shown in Figure 2, given two KGs, UEA first mines useful features from the side information. These features are forwarded to the unmatchable entity prediction module to generate initial alignment results, which are regarded as pseudo-labeled data. Then, the progressive learning framework uses these pseudo seeds to connect two KGs and learn unified entity structural embeddings. It further combines the alignment signals from the side information and structural information to provide a more comprehensive view for alignment. Finally, it progressively improves the quality of structural embeddings and augments the alignment results by iteratively updating the pseudo-labeled data with results from the previous round, which also leads to increasingly better alignment.

3.1 Side Information

There is abundant side information in KGs, such as the attributes, descriptions and classes. In this work, we use a particular form of the attributes—the entity name, as it exists in the majority of KGs. To make the most of the entity name information, inspired by [12], we exploit it from the semantic level and string-level and generate the textual distance matrix between entities in two KGs.

More specifically, we use the averaged word embeddings to represent the semantic meanings of entity names. Given the semantic embeddings of a source and a target entity, we obtain the semantic distance score by subtracting their cosine similarity score from 1. We denote the semantic distance matrix between the entities in two KGs as $\mathbf{M^{n}}$ , where rows represent source entities, columns denote target entities and each element in the matrix denotes the distance score between a pair of source and target entities. As for the string-level feature, we adopt the Levenshtein distance [38] to measure the difference between two sequences. We denote the string distance matrix as $\mathbf{M^{l}}$ .

To obtain a more comprehensive view for alignment, we combine these two distance matrices and generate the textual distance matrix as $\mathbf{M^{t}}=\alpha\mathbf{M^{n}}+(1-\alpha)\mathbf{M^{l}}$ , where $\alpha$ is a hyper-parameter that balances the weights. Then, we forward the textual distance matrix $\mathbf{M^{t}}$ into the unmatchable entity module to produce alignment results, which are considered as the pseudo-labeled data for training KG structural embeddings. The details are introduced in the next subsection.

Remark. The goal of this step is to exploit available side information to generate useful features for alignment. Other types of side information, e.g., attributes and entity descriptions, can also be leveraged. Besides, more advanced textual encoders, such as misspelling oblivious word embeddings [39] and convolutional embedding for edit distance [40], can be utilized. We will investigate them in the future.

3.2 Unmatchable Entity Prediction

State-of-the-art EA solutions generate for each source entity a corresponding target entity and fail to consider the potential unmatchable issue. Nevertheless, as mentioned in [19], in real-life settings, KGs contain entities that other KGs do not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are related to movies, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB. These unmatchable entities would increase the difficulty of EA. Therefore, in this work, we devise an unmatchable entity prediction module to predict the unmatchable entities and filter them out from the alignment results.

More specifically, we put forward a novel strategy, i.e., thresholded bi-directional nearest neighbor search (TBNNS), to generate the alignment results, and the resulting unaligned entities are predicted to be unmatchable. As can be observed from Algorithm 1, given a source entity $u$ and a target entity $v$ , if $u$ and $v$ are the nearest neighbor of each other, and the distance between them is below a given threshold $\theta$ , we consider $(u,v)$ as an aligned entity pair. Note that $\mathbf{M}(u,v)$ represents the element in the $u$ -th row and $v$ -th column of the distance matrix $\mathbf{M}$ .

Input :

G_{1}

and

G_{2}

: the two KGs to be aligned;

E_{1}

and

E_{2}

: the entity sets in

G_{1}

and

G_{2}

;

\theta

: a given threshold;

\mathbf{M}

: a distance matrix.

Output :

S

: Alignment results.

2foreach

u\in E_{1}

v\leftarrow\mathop{\arg\min}\limits_{\hat{v}\in E_{2}}\mathbf{M}(u,\hat{v})

;

4 if

\mathop{\arg\min}\limits_{\hat{u}\in E_{1}}\mathbf{M}(v,\hat{u})=u

and

\mathbf{M}(u,v)<\theta

then

S\leftarrow S+\{(u,v)\}

return

S

Algorithm 1 TBNNS in the unmatchable entity prediction module

The TBNNS strategy exerts strong constraints on alignment, since it requires that the matched entities should both prefer each other the most, and the distance between their embeddings should be below a certain value. Therefore, it can effectively predict unmatchable entities and prevent them from being aligned. Notably, the threshold $\theta$ plays a significant role in this strategy. A larger threshold would lead to more matches, whereas it would also increase the risk of including erroneous matches or unmatchable entities. In contrast, a small threshold would only lead to a few aligned entity pairs, and almost all of them would be correct. This is further discussed and verified in Section 4.4. Therefore, our progressive learning framework dynamically adjusts the threshold value to produce more accurate alignment results (to be discussed in the next subsection).

3.3 The Progressive Learning Framework

To exploit the rich structural patterns in KGs that could provide useful signals for alignment, we design a progressive learning framework to combine structural and textual features for alignment and improve the quality of both structural embeddings and alignment results in a self-training fashion.

Structural information. As mentioned above, we forward the textual distance matrix $\mathbf{M^{t}}$ generated by using the side information to the unmatchable entity prediction module to produce the preliminary alignment results, which are considered as pseudo-labeled data for learning unified KG embeddings. Concretely, following [25], we adopt GCN ¹¹1More advanced structural learning models, such as recurrent skipping networks [20], could also be used here. We will explore these alternative options in the future. to capture the neighboring information of entities. We leave out the implementation details since this is not the focus of this paper, which can be found in [25].

Given the learned structural embedding matrix $\mathbf{Z}$ , we calculate the structural distance score between a source and a target entity by subtracting the cosine similarity score between their embeddings from 1. We denote the resultant structural distance matrix as $\mathbf{M^{s}}$ . Then, we combine the textual and structural information to generate more accurate signals for alignment: $\mathbf{M}=\beta\mathbf{M^{t}}+(1-\beta)\mathbf{M^{s}}$ , where $\beta$ is a hyper-parameter that balances the weights. The fused distance matrix $\mathbf{M}$ can be used to generate more accurate matches.

The progressive learning algorithm. The amount of training data has an impact on the quality of the unified KG embeddings, which in turn affects the alignment performance [10, 41]. As thus, we devise an algorithm (Algorithm 2) to progressively augment the pseudo training data, so as to improve the quality of KG embeddings and enhance the alignment performance. The algorithm starts with learning unified structural embeddings and generating the fused distance matrix $\mathbf{M}$ by using the preliminary pseudo-labeled data $S_{0}$ (line 1). Then, the fused distance matrix is used to produce the new alignment results $\Delta S$ using TBNNS (line 2). These newly generated entity pairs $\Delta S$ are added to the alignment results (which are considered as pseudo-labeled data for the next round), and the entities in the alignment results $S$ are removed from the entity sets (line 3-6). In order to progressively improve the quality of KG embeddings and detect more alignment results, we perform the aforementioned process recursively until the number of newly generated entity pairs is below a given threshold $\gamma$ (line 7-13).

Input :

G_{1}

and

G_{2}

: the two KGs to be aligned;

E_{1}

and

E_{2}

: the entity sets in

G_{1}

and

G_{2}

;

\mathbf{M^{t}}

: textual distance matrix;

S_{0}

: preliminary labeled data;

\theta_{0}

: the initial threshold.

Output :

S

: Alignment results.

1 Use

S_{0}

to learn structural embeddings, generate

\mathbf{M^{s}}

and

\mathbf{M}

;

\Delta S\leftarrow

TBNNS(

G_{1}

G_{2}

E_{1}

E_{2}

\theta_{0}

\mathbf{M}

);

S\leftarrow S_{0}+\Delta S

;

\theta\leftarrow\theta_{0}+\eta

;

E_{1}\leftarrow\{e|e\in E_{1},e\notin S\}

;

E_{2}\leftarrow\{e|e\in E_{2},e\notin S\}

;

7 while the number of the newly generated alignment results is above

\gamma

8 Use

S

to learn structural embeddings, generate

\mathbf{M^{s}}

and

\mathbf{M}

;

\Delta S\leftarrow

TBNNS(

G_{1}

G_{2}

E_{1}

E_{2}

\theta

\mathbf{M}

);

S\leftarrow S+\Delta S

;

\theta\leftarrow\theta

\eta

;

E_{1}\leftarrow\{e|e\in E_{1},e\notin S\}

;

E_{2}\leftarrow\{e|e\in E_{2},e\notin S\}

;

return

S

Algorithm 2 Progressive learning.

Notably, in the learning process, once a pair of entities is considered as a match, the entities will be removed from the entity sets (line 5-6 and line 12-13). This could gradually reduce the alignment search space and lower the difficulty for aligning the rest entities. Obviously, this strategy suffers from the error propagation issue, which, however, could be effectively mitigated by the progressive learning process that dynamically adjusts the threshold. We will verify the effectiveness of this setting in Section 4.3.

Dynamic threshold adjustment. It can be observed from Algorithm 2 that, the matches generated by the unmatchable entity prediction module are not only part of the eventual alignment results, but also the pseudo training data for learning subsequent structural embeddings. Therefore, to enhance the overall alignment performance, the alignment results generated in each round should, ideally, have both large quantity and high quality. Unfortunately, these two goals cannot be achieved at the same time. This is because, as stated in Section 3.2, a larger threshold in TBNNS can generate more alignment results (large quantity), whereas some of them might be erroneous (low quality). These wrongly aligned entity pairs can cause the error propagation problem and result in more erroneous matches in the following rounds. In contrast, a smaller threshold leads to fewer alignment results (small quantity), while almost all of them are correct (high quality).

To address this issue, we aim to balance between the quantity and the quality of the matches generated in each round. An intuitive idea is to set the threshold to a moderate value. However, this fails to take into account the characteristics of the progressive learning process. That is, in the beginning, the quality of the matches should be prioritized, as these alignment results will have a long-term impact on the subsequent rounds. In comparison, in the later stages where most of the entities have been aligned, the quantity is more important, as we need to include more possible matches that might not have a small distance score. In this connection, we set the initial threshold $\theta_{0}$ to a very small value so as to reduce potential errors. Then, in the following rounds, we gradually increase the threshold by $\eta$ , so that more possible matches could be detected. We will empirically validate the superiority of this strategy over the fixed weight in Section 4.3.

Remark. As mentioned in the related work, there are some existing EA approaches that exploit the iterative learning (bootstrapping) strategy to improve EA performance. Particularly, BootEA calculates for each source entity the alignment likelihood to every target entity, and includes those with likelihood above a given threshold in a maximum likelihood matching process under the 1-to-1 mapping constraint, producing a solution containing confident EA pairs [22]. This strategy is also adopted by [15, 23]. Zhu et al. use a threshold to select the entity pairs with very close distances as the pseudo-labeled data [21]. DAT employs a bi-directional margin-based constraint to select the confident EA pairs as labels [24]. Our progressive learning strategy differs from these existing solutions in three aspects: (1) we exclude the entities in the confident EA pairs from the test sets; and (2) we use the dynamic threshold adjustment strategy to control the pace of learning process; and (3) our strategy can deal with unmatchable entities. The superiority of our strategy is validated in Section 4.3.

4 Experiment

This section reports the experiment results with in-depth analysis. The source code is available at https://github.com/DexterZeng/UEA.

4.1 Experiment Settings

Datasets. Following existing works, we adopt the DBP15K dataset [10] for evaluation. This dataset consists of three multilingual KG pairs extracted from DBpedia. Each KG pair contains 15 thousand inter-language links as gold standards. The statistics can be found in Table 1. We note that state-of-the-art studies merely consider the labeled entities and divide them into training and testing sets. Nevertheless, as can be observed from Table 1, there exist unlabeled entities, e.g., 4,388 and 4,572 entities in the Chinese and English KG of $\texttt{DBP15K}_{\texttt{ZH-EN}}$ , respectively. In this connection, we adapt the dataset by including the unmatchable entities. Specifically, for each KG pair, we keep 30% of the labeled entity pairs as the training set (for training the supervised or semi-supervised methods). Then, to construct the test set, we include the rest of the entities in the first KG and the rest of the labeled entities in the second KG, so that the unlabeled entities in the first KG become unmatchable. The statistics of the test sets can be found in the Test set column in Table 1.

Table 1: The statistics of the evaluation benchmarks.

Dataset	KG pairs	#Triples	#Entities	#Labeled Ents	#Relations	#Test set
$\texttt{DBP15K}_{\texttt{ZH-EN}}$	DBpedia(Chinese)	70,414	19,388	15,000	1,701	14,888
$\texttt{DBP15K}_{\texttt{ZH-EN}}$	DBpedia(English)	95,142	19,572	15,000	1,323	10,500
$\texttt{DBP15K}_{\texttt{JA-EN}}$	DBpedia(Japanese)	77,214	19,814	15,000	1,299	15,314
$\texttt{DBP15K}_{\texttt{JA-EN}}$	DBpedia(English)	93,484	19,780	15,000	1,153	10,500
$\texttt{DBP15K}_{\texttt{FR-EN}}$	DBpedia(French)	105,998	19,661	15,000	903	15,161
$\texttt{DBP15K}_{\texttt{FR-EN}}$	DBpedia(English)	115,722	19,993	15,000	1,208	10,500

Parameter settings. For the side information module, we utilize the fastText embeddings [42] as word embeddings. To deal with cross-lingual KG pairs, following [32], we use Google translate to translate the entity names from one language to another, i.e., translating Chinese, Japanese and French to English. $\alpha$ is set to 0.5. For the structural information learning, we set $\beta$ to 0.5. Noteworthily, since there are no training set or validation set for parameter tuning, we set $\alpha$ and $\beta$ to the default value (0.5). We will further verify that the hyper-parameters do not have a large influence on the final results in Section 4.4. For progressive learning, we set the initial threshold $\theta_{0}$ to 0.05, the incremental parameter $\eta$ to 0.1, the termination threshold $\gamma$ to 30. Note that if the threshold $\theta$ is over 0.45, we reset it to 0.45. These hyper-parameters are default values since there is no extra validation set for hyper-parameter tuning.

Evaluation metrics. We use precision (P), recall (R) and F1 score as evaluation metrics. The precision is computed as the number of correct matches divided by the number of matches found by a method. The recall is computed as the number of correct matches found by a method divided by the number of gold matches. The F1 score is the harmonic mean between precision and recall.

Competitors. We select the most performant state-of-the-art solutions for comparison. Within the group that solely utilizes structural information, we compare with BootEA [22], TransEdge [15], MRAEA [41] and SSP [43]. Among the methods incorporating other sources of information, we compare with GCN-Align [25], HMAN [16], HGCN [11], RE-GCN [44], DAT [24] and RREA [45]. We also include the unsupervised approaches, i.e., IMUSE [36] and PARIS [37]. To make a fair comparison, we only use entity name labels as the side information.

Table 2: Alignment results.

	ZH-EN			JA-EN			FR-EN
	P	R	F1	P	R	F1	P	R	F1
BootEA	0.444	0.629	0.520	0.426	0.622	0.506	0.452	0.653	0.534
TransEdge	0.518	0.735	0.608	0.493	0.719	0.585	0.492	0.710	0.581
MRAEA	0.534	0.757	0.626	0.520	0.758	0.617	0.540	0.780	0.638
SSP	0.521	0.739	0.611	0.494	0.721	0.587	0.512	0.739	0.605
GCN-Align	0.291	0.413	0.342	0.274	0.399	0.325	0.258	0.373	0.305
HMAN	0.614	0.871	0.720	0.641	0.935	0.761	0.674	0.973	0.796
HGCN	0.508	0.720	0.596	0.525	0.766	0.623	0.618	0.892	0.730
RE-GCN	0.518	0.735	0.608	0.548	0.799	0.650	0.646	0.933	0.764
DAT	0.556	0.788	0.652	0.573	0.835	0.679	0.639	0.922	0.755
RREA	0.580	0.822	0.680	0.629	0.918	0.747	0.667	0.963	0.788
IMUSE	0.608	0.862	0.713	0.625	0.911	0.741	0.618	0.892	0.730
PARIS	0.976	0.777	0.865	0.981	0.785	0.872	0.972	0.793	0.873
UEA	0.913	0.902	0.907	0.940	0.932	0.936	0.953	0.950	0.951

4.2 Results

Table 2 reports the alignment results, which shows that state-of-the-art supervised or semi-supervised methods have rather low precision values. This is because these approaches cannot predict the unmatchable source entities and generate a target entity for each source entity (including the unmatchable ones). Particularly, methods incorporating additional information attain relatively better performance than the methods in the first group, demonstrating the benefit of leveraging such additional information.

Regarding the unsupervised methods, although IMUSE cannot deal with the unmatchable entities and achieves a low precision score, it outperforms most of the supervised or semi-supervised methods in terms of recall and F1 score. This indicates that, for the EA task, the KG side information is useful for mitigating the reliance on labeled data. In contrast to the abovementioned methods, PARIS attains very high precision, since it only generates matches that it believes to be highly possible, which can effectively filter out the unmatchable entities. It also achieves the second best F1 score among all approaches, showcasing its effectiveness when the unmatchable entities are involved. Our proposal, UEA, achieves the best balance between precision and recall and attains the best F1 score, which outperforms the second-best method by a large margin, validating its effectiveness. Notably, although UEA does not require labeled data, it achieves even better performance than the most performant supervised method HMAN (except for the recall values on $\texttt{DBP15K}_{\texttt{JA-EN}}$ and $\texttt{DBP15K}_{\texttt{FR-EN}}$ ).

4.3 Ablation Study

In this subsection, we examine the usefulness of proposed modules by conducting the ablation study. More specifically, in Table 3, we report the results of UEA w/o Unm, which excludes the unmatchable entity prediction module, and UEA w/o Prg, which excludes the progressive learning process. It shows that, removing the unmatchable entity prediction module (UEA w/o Unm) brings down the performance on all metrics and datasets, validating its effectiveness of detecting the unmatchable entities and enhancing the overall alignment performance. Besides, without the progressive learning (UEA w/o Prg), the precision increases, while the recall and F1 score values drop significantly. This shows that the progressive learning framework can discover more correct aligned entity pairs and is crucial to the alignment progress.

To provide insights into the progressive learning framework, we report the results of UEA w/o Adj, which does not adjust the threshold, and UEA w/o Excl, which does not exclude the entities in the alignment results from the entity sets during the progressive learning. Table 3 shows that setting the threshold to a fixed value (UEA w/o Adj) leads to worse F1 results, verifying that the progressive learning process depends on the choice of the threshold and the quality of the alignment results. We will further discuss the setting of the threshold in the next subsection. Besides, the performance also decreases if we do not exclude the matched entities from the entity sets (UEA w/o Excl), validating that this strategy indeed can reduce the difficulty of aligning entities.

Moreover, we replace our progressive learning framework with other state-of-the-art iterative learning strategies (i.e., MWGM [22], TH [21] and DAT-I [24]) and report the results in Table 3. It shows that using our progressive learning framework (UEA) can attain the best F1 score, verifying its superiority.

Table 3: Ablation results.

	ZH-EN			JA-EN			FR-EN
	P	R	F1	P	R	F1	P	R	F1
UEA	0.913	0.902	0.907	0.940	0.932	0.936	0.953	0.950	0.951
w/o Unm	0.553	0.784	0.648	0.578	0.843	0.686	0.603	0.871	0.713
w/o Prg	0.942	0.674	0.786	0.966	0.764	0.853	0.972	0.804	0.880
w/o Adj	0.889	0.873	0.881	0.927	0.915	0.921	0.941	0.936	0.939
w/o Excl	0.974	0.799	0.878	0.982	0.862	0.918	0.985	0.887	0.933
MWGM	0.930	0.789	0.853	0.954	0.858	0.903	0.959	0.909	0.934
TH	0.743	0.914	0.820	0.795	0.942	0.862	0.807	0.953	0.874
DAT-I	0.974	0.805	0.881	0.985	0.866	0.922	0.988	0.875	0.928
UEA- $\mathbf{M^{l}}$	0.908	0.902	0.905	0.926	0.924	0.925	0.937	0.931	0.934
$\mathbf{M^{l}}$	0.935	0.721	0.814	0.960	0.803	0.875	0.948	0.750	0.838
UEA- $\mathbf{M^{n}}$	0.758	0.727	0.742	0.840	0.807	0.823	0.906	0.899	0.903
$\mathbf{M^{n}}$	0.891	0.497	0.638	0.918	0.562	0.697	0.959	0.752	0.843

4.4 Quantitative Analysis

In this subsection, we perform quantitative analysis of the modules in UEA.

The threshold $\theta$ in TBNNS. We discuss the setting of $\theta$ to reveal the trade-off between the risk and gain from generating the alignment results in the progressive learning. Identifying a match leads to the integration of additional structural information, which benefits the subsequent learning. However, for the same reason, the identification of a false positive, i.e., an incorrect match, potentially leads to mistakenly modifying the connections between KGs, with the risk of amplifying the error in successive rounds. As shown in Figure 3, a smaller $\theta$ (e.g., 0.05) brings low risk and low gain; that is, it merely generates a small number of matches, among which almost all are correct. In contrast, a higher $\theta$ (e.g., 0.45) increases the risk, and brings relatively higher gain; that is, it results in much more aligned entity pairs, while a certain portion of them are erroneous. Additionally, using a higher threshold leads to increasingly more alignment results, while for a lower threshold, the progressive learning process barely increases the number of matches. This is in consistency with our theoretical analysis in Section 3.2.

Unmatchable entity prediction. Zhao et al. [19] propose an intuitive strategy (U-TH) to predict the unmatchable entities. They set an NIL threshold, and if the distance value between a source entity and its closest target entity is above this threshold, they consider the source entity to be unmatchable. We compare our unmatchable entity prediction strategy with it in terms of the percentage of unmatchable entities that are included in the final alignment results and the F1 score. On $\texttt{DBP15K}_{\texttt{ZH-EN}}$ , replacing our unmatchable entity prediction strategy with U-TH attains the F1 score at 0.837, which is 8.4% lower than that of UEA. Besides, in the alignment results generated by using U-TH, 18.9% are unmatchable entities, while this figure for UEA is merely 3.9%. This demonstrates the superiority of our unmatchable entity prediction strategy.

Influence of parameters. As mentioned in Section 4.1, we set $\alpha$ and $\beta$ to 0.5 since there are no training/validation data. Here, we aim to prove that different values of the parameters do not have a large influence on the final results. More specifically, we keep $\alpha$ at 0.5, and choose $\beta$ from [0.3, 0.4, 0.5, 0.6, 0.7]; then we keep $\beta$ at 0.5, and choose $\alpha$ from [0.3, 0.4, 0.5, 0.6, 0.7]. It can be observed from Figure 4 that, although smaller $\alpha$ and $\beta$ lead to better results, the performance does not change significantly.

Influence of input side information. We adopt different side information as input to examine the performance of UEA. More specifically, we report the results of UEA- $\mathbf{M^{l}}$ , which merely uses the string-level feature of entity names as input, UEA- $\mathbf{M^{n}}$ , which only uses the semantic embeddings of entity names as input. We also provide the results of $\mathbf{M^{l}}$ and $\mathbf{M^{n}}$ , which use the string-level and semantic information to directly generate alignment results (without progressive learning), respectively.

As shown in Table 3, the performance of solely using the input side information is not very promising ( $\mathbf{M^{l}}$ and $\mathbf{M^{n}}$ ). Nevertheless, by forwarding the side information into our model, the results of UEA- $\mathbf{M^{l}}$ and UEA- $\mathbf{M^{n}}$ become much better. This unveils that UEA can work with different types of side information and consistently improve the alignment results. Additionally, by comparing UEA- $\mathbf{M^{l}}$ with UEA- $\mathbf{M^{n}}$ , it is evident that the input side information does affect the final results, and the quality of the side information is of significance to the overall alignment performance.

Pseudo-labeled data. We further examine the usefulness of the preliminary alignment results generated by the side information, i.e., the pseudo-labeled data. Concretely, we replace the training data in HGCN with these pseudo-labeled data, resulting in HGCN-U, and then compare its alignment results with the original performance. Regarding the F1 score, HGCN-U is 4% lower than HGCN on $\texttt{DBP15K}_{\texttt{ZH-EN}}$ , 2.9% lower on $\texttt{DBP15K}_{\texttt{JA-EN}}$ , 2.8% lower on $\texttt{DBP15K}_{\texttt{FR-EN}}$ . The minor difference validates the effectiveness of the pseudo-labeled data generated by the side information. It also demonstrates that this strategy can be applied to other supervised or semi-supervised frameworks to reduce their reliance on labeled data.

5 Conclusion

In this article, we propose an unsupervised EA solution that is capable of dealing with unmatchable entities. We first exploit the side information of KGs to generate preliminary alignment results, which are considered as pseudo-labeled data and forwarded to the progressive learning framework to produce better KG embeddings and alignment results in a self-training fashion. We also devise an unmatchable entity prediction module to detect the unmatchable entities. The experimental results validate the usefulness of our proposed model and its superiority over state-of-the-art approaches.

Acknowledgments. This work was partially supported by Ministry of Science and Technology of China under grant No. 2020AAA0108800, NSFC under grants Nos. 61872446 and 71971212, NSF of Hunan Province under grant No. 2019JJ20024, Postgraduate Scientific Research Innovation Project of Hunan Province under grant No. CX20190033.

References

[1] Yanchao Hao, Yuanzhe Zhang, Shizhu He, Kang Liu, and Jun Zhao. A joint embedding method for entity alignment of knowledge bases. In CCKS, pages 3–14, 2016.
[2] Xiaofei Shi and Yanghua Xiao. Modeling multi-mapping relations for precise cross-lingual entity alignment. In EMNLP, pages 813–822, 2019.
[3] Chengjiang Li, Yixin Cao, Lei Hou, Jiaxin Shi, Juanzi Li, and Tat-Seng Chua. Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In EMNLP, pages 2723–2732, 2019.
[4] Zequn Sun, Chengming Wang, Wei Hu, Muhao Chen, Jian Dai, Wei Zhang, and Yuzhong Qu. Knowledge graph alignment network with gated multi-hop neighborhood aggregation. In AAAI, pages 222–229, 2020.
[5] Kun Xu, Linfeng Song, Yansong Feng, Yan Song, and Dong Yu. Coordinated reasoning for cross-lingual knowledge graph alignment. In AAAI, pages 9354–9361, 2020.
[6] Jia Chen, Binbin Gu, Zhixu Li, Pengpeng Zhao, An Liu, and Lei Zhao. SAEA: self-attentive heterogeneous sequence learning model for entity alignment. In DASFAA, pages 452–467, 2020.
[7] Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, and Dongyan Zhao. Neighborhood matching network for entity alignment. In ACL, pages 6477–6487, 2020.
[8] Zequn Sun, Qingheng Zhang, Wei Hu, Chengming Wang, Muhao Chen, Farahnaz Akrami, and Chengkai Li. A benchmarking study of embedding-based entity alignment for knowledge graphs. Proc. VLDB Endow., 13(11):2326–2340, 2020.
[9] Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In IJCAI, pages 1511–1517, 2017.
[10] Zequn Sun, Wei Hu, and Chengkai Li. Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC, pages 628–644, 2017.
[11] Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, and Dongyan Zhao. Jointly learning entity and relation representations for entity alignment. In EMNLP, pages 240–249, 2019.
[12] Weixin Zeng, Xiang Zhao, Jiuyang Tang, and Xuemin Lin. Collective entity alignment via adaptive features. In ICDE, pages 1870–1873, 2020.
[13] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
[14] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
[15] Zequn Sun, JiaCheng Huang, Wei Hu, Muhao Chen, Lingbing Guo, and Yuzhong Qu. Transedge: Translating relation-contextualized embeddings for knowledge graphs. In ISWC, pages 612–629, 2019.
[16] Hsiu-Wei Yang, Yanyan Zou, Peng Shi, Wei Lu, Jimmy Lin, and Xu Sun. Aligning cross-lingual entities with multi-aspect information. In EMNLP, pages 4430–4440, 2019.
[17] Yixin Cao, Zhiyuan Liu, Chengjiang Li, Zhiyuan Liu, Juanzi Li, and Tat-Seng Chua. Multi-channel graph neural network for entity alignment. In ACL, pages 1452–1461, 2019.
[18] Sven Hertling and Heiko Paulheim. The knowledge graph track at OAEI - gold standards, baselines, and the golden hammer bias. In ESWC, volume 12123, pages 343–359, 2020.
[19] Xiang Zhao, Weixin Zeng, Jiuyang Tang, Wei Wang, and Fabian Suchanek. An experimental study of state-of-the-art entity alignment approaches. IEEE Transactions on Knowledge and Data Engineering, pages 1–1, 2020.
[20] Lingbing Guo, Zequn Sun, and Wei Hu. Learning to exploit long-term relational dependencies in knowledge graphs. In ICML, pages 2505–2514, 2019.
[21] Hao Zhu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Iterative entity alignment via joint knowledge embeddings. In IJCAI, pages 4258–4264, 2017.
[22] Zequn Sun, Wei Hu, Qingheng Zhang, and Yuzhong Qu. Bootstrapping entity alignment with knowledge graph embedding. In IJCAI, pages 4396–4402, 2018.
[23] Qiannan Zhu, Xiaofei Zhou, Jia Wu, Jianlong Tan, and Li Guo. Neighborhood-aware attentional representation for multilingual knowledge graphs. In IJCAI, pages 1943–1949, 2019.
[24] Weixin Zeng, Xiang Zhao, Wei Wang, Jiuyang Tang, and Zhen Tan. Degree-aware alignment for entities in tail. In SIGIR, pages 811–820, 2020.
[25] Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, pages 349–357, 2018.
[26] Bayu Distiawan Trisedya, Jianzhong Qi, and Rui Zhang. Entity alignment between knowledge graphs using attribute embeddings. In AAAI, pages 297–304, 2019.
[27] Kai Yang, Shaoqin Liu, Junfeng Zhao, Yasha Wang, and Bing Xie. COTSAE: co-training of structure and attribute embeddings for entity alignment. In AAAI, pages 3025–3032, 2020.
[28] Bo Chen, Jing Zhang, Xiaobin Tang, Hong Chen, and Cuiping Li. Jarka: Modeling attribute interactions for cross-lingual knowledge alignment. In PAKDD, volume 12084, pages 845–856, 2020.
[29] Xiaobin Tang, Jing Zhang, Bo Chen, Yang Yang, Hong Chen, and Cuiping Li. BERT-INT: A bert-based interaction model for knowledge graph alignment. In IJCAI, pages 3174–3180, 2020.
[30] Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, and Carlo Zaniolo. Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In IJCAI, pages 3998–4004, 2018.
[31] Kun Xu, Liwei Wang, Mo Yu, Yansong Feng, Yan Song, Zhiguo Wang, and Dong Yu. Cross-lingual knowledge graph alignment via graph matching neural network. In ACL, pages 3156–3161, 2019.
[32] Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, Rui Yan, and Dongyan Zhao. Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI, pages 5278–5284, 2019.
[33] Matthias Fey, Jan Eric Lenssen, Christopher Morris, Jonathan Masci, and Nils M. Kriege. Deep graph matching consensus. In ICLR, 2020.
[34] Weixin Zeng, Xiang Zhao, Jiuyang Tang, Xuemin Lin, and Paul Groth. Reinforcement learning based collective entity alignment with adaptive features. ACM Transactions on Information Systems, To appear, 2021.
[35] Meng Qu, Jian Tang, and Yoshua Bengio. Weakly-supervised knowledge graph alignment with adversarial learning. CoRR, abs/1907.03179, 2019.
[36] Fuzhen He, Zhixu Li, Qiang Yang, An Liu, Guanfeng Liu, Pengpeng Zhao, Lei Zhao, Min Zhang, and Zhigang Chen. Unsupervised entity alignment using attribute triples and relation triples. In DASFAA, pages 367–382, 2019.
[37] Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. PARIS: probabilistic alignment of relations, instances, and schema. PVLDB, 5(3):157–168, 2011.
[38] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710, 1966.
[39] Bora Edizel, Aleksandra Piktus, Piotr Bojanowski, Rui Ferreira, Edouard Grave, and Fabrizio Silvestri. Misspelling oblivious word embeddings. In NAACL-HLT, pages 3226–3234, 2019.
[40] Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, and James Cheng. Convolutional embedding for edit distance. In SIGIR, pages 599–608, 2020.
[41] Xin Mao, Wenting Wang, Huimin Xu, Man Lan, and Yuanbin Wu. MRAEA: an efficient and robust entity alignment approach for cross-lingual knowledge graph. In WSDM, pages 420–428, 2020.
[42] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
[43] Hao Nie, Xianpei Han, Le Sun, Chi Man Wong, Qiang Chen, Suhui Wu, and Wei Zhang. Global structure and local semantics-preserved embeddings for entity alignment. In IJCAI, pages 3658–3664, 2020.
[44] Jinzhu Yang, Wei Zhou, Lingwei Wei, Junyu Lin, Jizhong Han, and Songlin Hu. RE-GCN: relation enhanced graph convolutional network for entity alignment in heterogeneous knowledge graphs. In DASFAA, pages 432–447, 2020.
[45] Xin Mao, Wenting Wang, Huimin Xu, Yuanbin Wu, and Man Lan. Relational reflection entity alignment. In CIKM, pages 1095–1104, 2020.

Towards Entity Alignment in the Open World: An Unsupervised Approach