Knowledge Graph Completion with Text-aided Regularization
Abstract
Knowledge Graph Completion is a task of expanding the knowledge graph/base through estimating possible entities, or proper nouns, that can be connected using a set of predefined relations, or verb/predicates describing interconnections of two things. Generally, we describe this problem as adding new edges to a current network of vertices and edges. Traditional approaches mainly focus on using the existing graphical information that is intrinsic of the graph and train the corresponding embeddings to describe the information; however, we think that the corpus that are related to the entities should also contain information that can positively influence the embeddings to better make predictions. In our project, we try numerous ways of using extracted or raw textual information to help existing KG embedding frameworks reach better prediction results, in the means of adding a similarity function to the regularization part in the loss function. Results have shown that we have made decent improvements over baseline KG embedding methods.
1 Introduction
The Knowledge Graph/Knowledge Base is a collection of structured tuples of information connecting entities via relations, where entities are generalized concepts or names to be appeared in head/tail positions, forming nodes/vertices in the graph and relations are verbs or predicates describing the way two entities are logically connected, forming edges in the graph.
Formally, Knowledge Graph can be represented as a directed multigraph that may have multiple edges with the same vertices.
Knowledge Graphs have a lot of applications, such as Structured Search, Question-Answering Systems, Recommendation Systems etc. However, most of the Knowledge Graphs are large but far from complete. In this project, we aim to provide a modern approach to create or predict missing links in the graph, formally done by link prediction: given two elements of a triple, predict missing items.
Let’s formally define our task. Let be be the set of entities and be the set of relations. Then a KG is defined as a collection of triple facts (), and the task of knowledge graph reasoning is to find the missing entry in the triple facts.
Previously, the focus of knwoledge graph embedding and completion mainly focuses on using various methods to catch the intrinsic, latent semantics of the knowledge graph, which more or less rely on the existing items or structures of the graph itself; but it is another level of concern that the information would be limited to the existing relations or semantics of the knowledge graph, so to complete and expand a knowledge graph, it would be very helpful to introduce external information from text corpora that one knowledge graph or even commonsense graph would rely on, be it general (like FreeBase or YAGO) or industry-specific (such asUMLS).
Many recent joint graph and text embedding methods have been focusing on learning better knowledge graph embeddings for reasoning (Han et al. (2018)), but we consider reaching for better graph embeddings in a more language-oriented sense. In this research, we would propose a general framework of using regularization techniques to train a set of entity embeddings that can capture the nuances of relation and connection between the entities that may not lie in the knowledge graph structure itself, but more on the text corpus side; through becoming a constituent part of regularization, we make our framework more compatible to more of the commonly available knowledge graph embedding models, and have the potential to migrate to state-of-the-art models of the time.
2 Related Work
2.1 Knowledge Graph Reasoning
Given a knowledge graph, the task of knowledge graph reasoning includes predicting missing links between entities, predicting missing entities, and predicting whether a graph triple is true or false. A variety of methods have been applied to the task of knowledge graph reasoning, and recently embedding-based methods are gaining popularity and yielding promising results, such as linear models in Bordes et al. (2011), matrix factorization models in Trouillon et al. (2016), convolutional neural networks in Yang et al. (2014). In spite of promising results, these models have limited interpretability. Reinforcement learning models have better interpretability, such MINERVA in Das et al. (2017), DeepPath in Xiong et al. (2017) and Multi-hop in Lin et al. (2018), which exploit policy network.
2.2 Open-world Knowledge Graph Completion.
Using only information inside the static and structured knowledge graph limits our ability to learn representations for embeddings and relations. Therefore, (Shi & Weninger (2018)) raised the concept of Open-world Knowledge Graph Completion problem, which is to utilize external information so as to connect unseen entities to the knowledge graph. There are several attempts to utilize text corpus to improve Knowledge Graph embeddings, such as the mutual attention model by (Han et al. (2018)) and Latent Relation Language Models in Hayashi et al. (2019). However, these approaches have the following limitations: First of all, the models focuses on local information in the text corpus. Each entity is trained with just its description without considering corpus-level statistics. Secondly, the model need to align two disjoint latent spaces, the Knowledge Graph space and the text corpus space. Thirdly, the models tend to closely couple between the Knowledge Graph and Text method. Last but not least, the models typically provide no way of learning relation-specific representations of entities.
2.3 Text Embedding Models and Vector Sets.
Word embeddings have been proved to be useful in NLP tasks as standalone features (Turian et al. (2010)). The key idea of word embedding is to use multi-dimentional vectors to represent the meanings of words. Influential models can be divided into two categories, count-based models and prediction-based models. A good example of count-based models is GloVe, introduced by Pennington et al. in 2014, which is a log-linear model trained on window based local co-occurrence information about word pairs in Pennington et al. (2014a). Popular prediction-based models include Continuous bag of words (CBOW) models, skip-gram models and transformer-based models. Based on these state-of-the-art models, pre-trained vector sets such as Word2Vec in Mikolov et al. (2013), FastText in Mikolov et al. (2018), and GloVe, are released and are ready to be used in NLP tasks.
In our model, word embedding can be used both statically and dynamically. The static way is to initialize the entity weight by pretrained embedding from results in Pennington et al. (2014a), Mikolov et al. (2013), and train the entity embedding with models in Dettmers et al. (2018) and Bordes et al. (2011). The dynamic way is to combine the loss functions of word embedding model and knowledge graph model together to train the entity and relation embedding.
3 Text-regularized KG Completion
In this research, we would propose a general framework of using regularization techniques to train a set of entity embeddings that can capture the nuances of relation and connection between the entities that may not lie in the knowledge graph structure itself, but more on the text corpus side. Many recent joint graph and text embedding methods have been focusing on learning better knowledge graph embeddings for reasoning in Han et al. (2018), but we consider reaching for better graph embeddings in a more language-oriented sense.
Existing jointly training methods, like Toutanova et al. (2015), Han et al. (2018), and Hayashi et al. (2019), share the following shortcomings:
-
•
These methods typically only focus on a neighboring/localized space that is labeled relevant to some given entity-relation-entity triples; their embeddings are only trained with the related sentence description that are labeled in the corpus.
-
•
They do not take corpus level statistics and language model relativities into account, due to the same reason that they rely on labeled partitioned datasets.
-
•
They need to align two disjoint latent/embedding spaces (the KG embeddings and text embeddings), which are usually not quite overlapping considering the word tokens available from both sides.
-
•
Entities representations are typically generic, not accounting for relation variations, so no way of learning relation-specific embeddings are offered.
Here we would propose a method of text-enhanced knowledge graph embedding, which uses similarity functions as regularizers towards the training loss of the knowledge graph. The underlied motivation is that, we would typically find similar descriptions and especially, overlapping words for two entities/concepts that falls into similar domains and same categories of name types; in the meantime, these entities would also share similar characteristics in their connectivity, topology, and types of relations linking to them in the knowledge graph. Thus from the description or context regarding the concept, we would be able to train a similar set of embeddings for these two concepts.
For example, both Microsoft and Amazon are concept names that describes a company, and their descriptions would definitely cover the fact that they both have headquarters in Seattle, and they both have a founder (who are Bill Gates and Jeff Bezos, respectively).
To learn a corpus-regularized representations for the relations and entities, we would need to build a loss function that properly trains the embedding towards better predictability. we could define a loss function on the domain that captures the similarities between relevant descriptions or context of entities and , as shown in the figure 1. There are lots of room of decision over which type of similarity function to be used, such as word overlap on context, TF-IDF based on word pairs, dot-product of existing trained embeddings, and so on.

Formally, we would define a new loss function that serves the purpose of minimizing losses from both the side of knowledge base and the one from the text:
To exemplify one possible similarity function, we could propose a more GLoVePennington et al. (2014a)-like approach, which could make entities that have similar descriptions being closer to each other in the vector embedding space.
4 Variants of Similarity functions
Here we show some types of similarity functions that we have tested with. They include converging to generic word embeddings, similarity calculation using entity-recognized newspaper text, and ones using associated wikipedia article text.
4.1 Variant A: Converging to existing embeddings
A Quick way to use existing fruits of extracted textual information to begin with is converging to existing word embeddings. Given the premises that we are trying to incorporate textual structures into entity embeddings, we can well assume that word embeddings should also contain information about connections between words.
Considering the apparent fact that the latent structures underlying in the knowledge graphs should be drastically different than the ones extracted from text (which could be more focusing on predicting the next words that would follow the current word), we would prefer to converge the differences between the embeddings of the entities, and the ones when they are converted to words. Also considering the fact that the latent structures’ distribution and the number of dimensions should be different between the two types of embeddings, we incorporate a fully connection layer to extract and redistribute the features, and make the differences comparable.
The easiest and the most prominent example would be using cosine similarity to converge the two:
Another would be using rank distance defined by (Santus et al. (2018)) that would extract the most prominent ranks of the embeddings and compare the differences between the two:
4.2 Variant B: Using Entity-tagged Text
In (Pennington et al. (2014a)), word similarities can be extracted from a corpus by constructing a co-occurrence matrix. Similarly, entity similarities can be calculated in the same manner. Given a corpus, we can construct matrix where is the number of times entity occurs in the context of entity . We select New York Time corpus as the test dataset for extracting co-occurrence matrix . Summing over the entities in the knowledge graph, we have our Entity-tagged text regularization term:
4.3 Variant C: Using Associated Wikipedia Text
As described above, similar entities should have similar text descriptions on Wikipedia and similar text structures. For example, Wikipedia documents of counties in the US should all talk about its climate, economy, population, etc, and actors should all talk about their career works and life experiences and so on. To exploit such similarity to regularize our entity embedding, we measure two correlations between the Wikipedia documents of entities, which are, first, a modified Relaxed Word Mover Distance (Kusner et al. (2015)) between two entity documents, and second, TF-IDF similarity between two entity documents.
4.3.1 Relaxed Word Mover Distance
In (Kusner et al. (2015)), the Relaxed Word Mover Distance is calculated as
, where are index of words in entity documents, is the normalized frequency of word , and is the distance between the pre-trained word embedding of word and . In our proposed method, to boost the calculation, instead of using the distance between word embedding, we use the the dot product of two word embedding to measure their similarity, and accordingly, we introduce the concept of RWMD-Gain, which is the maximum gain of transformation from the document of an entity to another and is as follows:
We employ the bidirectional gain as the semantic similarity of two entity document, which is
To sum over all instances in the knowledge graph, we have our RMWD regularization term:
4.3.2 TF-IDF Similarity
To compute the TF-IDF similarity between two entity documents, we first convert each document to a TF-IDF vector , in which each element can be calculated by the following formula:
where represents the index of the word; represents the occurrence of word in document ; represents the total number of documents and represents the number of documents containing word .
The similarity of document and document is the dot product of such , .
5 Experiments
5.1 Data Sources and formulations
From (Fu et al. (2019)) we select the FB60K dataset as the base knowledge graph 111https://github.com/thunlp/OpenNRE. In the aforementioned paper, the authors studied the datasets and found that the relation distributions of the two datasets are very imbalanced; but still, It is the KG dataset that contains the most entities in the “FreeBase”-related KG variants, like FB17K, FB15K-237, etc. In the original dataset, all entity names are its Freebase ID; to properly link the partially anonymized graph to the truth raw text, we employed an openly available dictionary to convert thme from ID to its corresponding names. By linking the ID to the names, we can see that the FB60K dataset mainly contains famous or obscure locations, celebrity names, schools, sports unions, etc.
On the external sources of the three variants, we would list them here:
-
•
Variant A (using existing embeddings): we employ the GLoVe trained embeddings (Pennington et al. (2014b)) as the embeddings we would want to converge to. Specifically, since word embeddings are normally single words, while entities are mostly proper noun phrases, we generate the final embedding of the entity by calculating the average over all the constituent words that is related to the entity (an entity can have multiple words and multiple names to refer to) according to the open dictionaries.
-
•
Variant B (using entity labeled text): we employ the NYT10 dataset that is supplied along with the FB60K dataset. Entities are labeled using common named entity recognition toolkits, and the source corpus text is excerpted from the New York Times corpus. They are also labeled what relations the sentence would contain, but currently we haven’t employed these. An analysis to the information overlap (i.e., alignment) between the corpus and the KG in Table 1. Higher CT/CE (triple-entity ratio) indicates adding corpus-edges to the KG increases the average degree more significantly, leading to more reduction in sparsity.
-
•
Variant C (using linked Wikipedia articles): we employ the articles from the Wikipedia dump in December 2019. By linking the entity FreeBase ID to its name, and by finding these names’ English Wikipedia article, we managed to extract over 54,000 articles linking to existing entities (around 80%). We use Stanford CoreNLP Tokenizer (Manning et al. (2014)) to tokenize and lower-letter all the words in the text after we remove structured text and tables, which are text that cannot be easily consumed.
We choose TransE (Bordes et al. (2011)) as the base KG embedding method as the base of comparison between all the variants. TransE is an old methods of calculating KG embeddings, but it is the easiest to implement and has still been holding decent performances despite several newer and more state-of-the-art models being introduced since.
Dataset | #triples(C) | #triple(G) | #entities(C) | #entities(G) | #rel(C) | #rel(G) | S(train) | S(test) | CT/CE | CR/KR |
---|---|---|---|---|---|---|---|---|---|---|
FB60K-NYT10 | 172,448 | 268,280 | 63,696 | 69,514 | 57 | 1,327 | 570k | 172k | 2.71 | 0.04 |
5.2 Experimental Result
We run all of our proposed variants on FB60K-NYT10 dataset. Our baseline method is TransE (Bordes et al. (2011)). We add different proposed regularization terms to TransE individually to compare their performance. The experimental results are shown in Table 2.
Method | Hits@1 | Hits@3 | Hits@10 | MRR | Hits@1 Filtered | Hits@3 Filtered | Hits@10 Filtered | MRR Filtered |
---|---|---|---|---|---|---|---|---|
TransE | 0.2144 | 0.4414 | 0.5431 | 0.3379 | 0.3392 | 0.6245 | 0.7016 | 0.4902 |
TransE + Cosine | 0.2303 | 0.4455 | 0.5436 | 0.3487 | 0.3957 | 0.6310 | 0.7035 | 0.5216 |
TransE + GloVe-NYT | 0.2112 | 0.4366 | 0.5436 | 0.3363 | 0.3378 | 0.6086 | 0.6997 | 0.4841 |
TransE + GloVe-Wiki | 0.2305 | 0.4474 | 0.5493 | 0.3500 | 0.3552 | 0.6170 | 0.6951 | 0.4957 |
TransE + GloVe-RWMD | 0.2417 | 0.4000 | 0.5147 | 0.3396 | 0.4292 | 0.5816 | 0.6787 | 0.5194 |
5.3 Experimental Analysis
From table 2, we can see that generally TransE + Cosine and TransE + GloVe-wiki outperform the other methods and baseline. For HR@1, TranE + GloVe-RWMD gives the best result and outperforms baseline significantly. For HR@3, HR@10, and MRR, TransE + GloVe is the best one. GloVe-RWMD considers the semantic distances, which helps finding the best suitable entity, but negatively impacts the score of the potential entities. TransE + Cosine and GloVe-wiki are more simple and intuitive, which consider only the word frequencies and proves to be more useful when considering a list of entities. The amount of injected information directly influences similarity matrix quality. Wiki is comparably larger then New York Times, and larger amount of information gives more accurate entity similarity score, so GloVe-Wiki outperforms Glove-NYT.
6 Conclusion
From the experiments, our methods showed decent improvements compared to the baseline model. Although TransE is a prudent model, we believe the proposed regularization methods can fit to later state-of-the-art models with modest adaptations. However, due to time limitation, the hyper-parameters were not fine-tuned in the experiments so the results did not show a very significant improvement. Therefore, future work can be done on further improving the implementation of regularization, or on choosing a better set of parameters. Moreover, the design of decoders can be further improved to transform textual latent features to entity or knowledge graph latent features. After all, we believe that more information should always be better than less, but we need to find a good way to utilize it.
In this project, we diversely explored text-assisted KG embedding or reasoning methods. However, given the fact that these methods all require extensive frameworks to extract external information, we think there might be unexplored while simpler ways to migrate textual latent features to enrich KG embeddings. However, further study requires us to focus on a more balanced and diverse knowledge base, which we are lacking today due to the demise of the FreeBase graph.
References
- Bordes et al. (2011) Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, et al. Learning structured embeddings of knowledge bases. In AAAI, volume 6, pp. 6, 2011.
- Das et al. (2017) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. arXiv preprint arXiv:1711.05851, 2017.
- Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Fu et al. (2019) Cong Fu, Tong Chen, Meng Qu, Woojeong Jin, and Xiang Ren. Collaborative policy learning for open knowledge graph reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2672–2681, 2019.
- Han et al. (2018) Xu Han, Zhiyuan Liu, and Maosong Sun. Neural knowledge acquisition via mutual attention between knowledge graph and text. In Proceedings of AAAI, 2018.
- Hayashi et al. (2019) Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham Neubig. Latent relation language models. arXiv preprint arXiv:1908.07690, 2019.
- Kusner et al. (2015) Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International conference on machine learning, pp. 957–966, 2015.
- Lin et al. (2018) Xi Victoria Lin, Richard Socher, and Caiming Xiong. Multi-hop knowledge graph reasoning with reward shaping. arXiv preprint arXiv:1808.10568, 2018.
- Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60, 2014. URL http://www.aclweb.org/anthology/P/P14/P14-5010.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
- Mikolov et al. (2018) Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
- Pennington et al. (2014a) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Citeseer, 2014a.
- Pennington et al. (2014b) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014b.
- Santus et al. (2018) Enrico Santus, Hongmin Wang, Emmanuele Chersoni, and Yue Zhang. A rank-based similarity metric for word embeddings. arXiv preprint arXiv:1805.01923, 2018.
- Shi & Weninger (2018) Baoxu Shi and Tim Weninger. Open-world knowledge graph completion. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1499–1509, 2015.
- Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning, pp. 2071–2080, 2016.
- Turian et al. (2010) Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P10-1040.
- Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. arXiv preprint arXiv:1707.06690, 2017.
- Yang et al. (2014) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.