This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diversified and Adaptive Negative Sampling on Knowledge Graphs

Ran Liu Zhongzhou Liu Xiaoli Li Hao Wu [email protected] Yuan Fang [email protected]
Abstract

In knowledge graph embedding, aside from positive triplets (i.e., facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (e.g., randomly replacing an entity in a positive triplet). An ideal sampled negative triplet should be informative enough to help the model train better. However, existing methods often ignore diversity and adaptiveness in their sampling process, which harms the informativeness of negative triplets. As such, we propose a generative adversarial approach called Diversified and Adaptive Negative Sampling (DANS) on knowledge graphs. DANS is equipped with a two-way generator that generates more diverse negative triplets through two pathways, and an adaptive mechanism that produces more fine-grained examples by localizing the global generator for different entities and relations. On the one hand, the two-way generator increase the overall informativeness with more diverse negative examples; on the other hand, the adaptive mechanism increases the individual sample-wise informativeness with more fine-grained sampling. Finally, we evaluate the performance of DANS on three benchmark knowledge graphs to demonstrate its effectiveness through quantitative and qualitative experiments.

keywords:
Knowledge graphs, Graph representation learning, Graph neural networks, Negative sampling
journal: Neural Networks
\csdef

WGMwgm\xspace \csdefQEqe\xspace \csdefEPep\xspace \csdefPMSpms\xspace \csdefBECbec\xspace \csdefDEde\xspace

\affiliation

[1]organization=Singapore Management University, addressline=81 Victoria St, city=Singapore, postcode=188065, country=Singapore \affiliation[2]organization=Agency for Science, Technology and Research, addressline=1 Fusionopolis Way, city=Singapore, postcode=138632, country=Singapore \affiliation[3]organization=Beijing Normal University, addressline=19 Xinwai Ave, Beitaipingzhuang, Haidian District, city=Beijing, postcode=100875, country=China

{highlights}

We design a two-way generator to produce diverse negative triplets, to increase the overall informativeness.

We employ a FiLM layer to adapt the global generator model into local models, to increase the individual informativeness of the negative triplets.

We conduct extensive experiments on three benchmark datasets. The results demonstrate the superiority of our proposed approach.

1 Introduction

Knowledge graphs have been widely used to encode facts about the real world. Typically, each fact describes a relationship between a head and tail entity in the form of a triplet \langlehead, relation, tail\rangle, and different entities across facts are interconnected to form a graph structure. The rich facts contained in a large-scale knowledge graph can be used to enhance numerous applications that rely on real-world knowledge, such as question answering [41, 16, 33], object detection [9, 15, 19] and recommendation [4, 12, 8, 42]. To effectively exploit the facts for these applications, a common approach is to first perform knowledge graph embedding that converts the symbolic entities and relations to a latent vector space. The learned embedding aims to capture relevant structural and semantic information in the knowledge graph, which can then be integrated with other machine learning models.

In this paper, we focus on the problem of knowledge graph embedding. The high-level idea is that the embedding vectors of entities and relations co-occurring in the same fact should be bounded by certain constraints due to their relatedness. For instance, consider a fact τ=h=𝙱𝚎𝚒𝚓𝚒𝚗𝚐,r=𝚒𝚜𝙲𝚊𝚙𝚒𝚝𝚊𝚕𝙾𝚏,t=𝙲𝚑𝚒𝚗𝚊\tau=\langle h=\mathtt{Beijing},r=\mathtt{isCapitalOf},t=\mathtt{China}\rangle and a classic method TransE [2]. TransE maps each entity and relation in the fact to vectors, i.e., 𝐞h\mathbf{e}_{h}, 𝐞r\mathbf{e}_{r}, 𝐞t\mathbf{e}_{t}, respectively, so that they approximately satisfy the constraint 𝐞h+𝐞r𝐞t\mathbf{e}_{h}+\mathbf{e}_{r}\approx\mathbf{e}_{t} by minimizing the loss 𝐞h+𝐞r𝐞t\|\mathbf{e}_{h}+\mathbf{e}_{r}-\mathbf{e}_{t}\|. On the contrary, a nonfact such as h=𝙱𝚎𝚒𝚓𝚒𝚗𝚐,r=𝚒𝚜𝙲𝚊𝚙𝚒𝚝𝚊𝚕𝙾𝚏,t=𝚁𝚞𝚜𝚜𝚒𝚊\langle h=\mathtt{Beijing},r=\mathtt{isCapitalOf},t^{\prime}=\mathtt{Russia}\rangle would maximize the loss 𝐞h+𝐞r𝐞t\|\mathbf{e}_{h}+\mathbf{e}_{r}-\mathbf{e}_{t^{\prime}}\|. Given this contrast, the factual triplets are known as positive triplets (or examples), whereas the non-factual triplets are called negative triplets. Although positive triplets are readily available, negative triplets are often obtained through random sampling. More recent works [43, 3, 1, 48, 30, 50] explore advanced constraints or losses [2, 36, 44] on the triplets, but the sampling strategy for negative triplets remains a crucial yet less explored problem.

Earlier negative sampling approaches resort to random sampling, e.g., by replacing the tail (or head) entity in a positive triplet with a random entity from the knowledge graph sampled in a uniform [42] or popularity-weighted manner [20]. Although random sampling is straightforward, it is often inadequate to optimize the informativeness of negative triplets. The informativeness refers to how much information each negative triplet could contribute to model learning. Intuitively, a more informative negative triplet would improve the efficiency of model training and accelerate model convergence. For instance, the positive triplet τ\tau given earlier, \langle𝙱𝚎𝚒𝚓𝚒𝚗𝚐\mathtt{Beijing}, 𝚒𝚜𝙲𝚊𝚙𝚒𝚝𝚊𝚕𝙾𝚏\mathtt{isCapitalOf}, 𝚁𝚞𝚜𝚜𝚒𝚊\mathtt{Russia}\rangle is considered a more informative negative triplet than \langle𝙱𝚎𝚒𝚓𝚒𝚗𝚐\mathtt{Beijing}, 𝚒𝚜𝙲𝚊𝚙𝚒𝚝𝚊𝚕𝙾𝚏\mathtt{isCapitalOf}, 𝙺𝙵𝙲\mathtt{KFC}\rangle, as the latter can be easily identified as negative and thus helps little in refining the decision boundary. Although various scoring functions [2, 36, 44, 32, 31] help to judge the informativeness of negative triplets, they do not consider the diversity and adaptiveness of the sampling process, which are two aspects we propose to study in this work.

On one hand, diversity helps to increase the overall informativeness of all the negative triplets collectively. We observe that negative triplets can be associated with both entities and relations. For example, the tail entity of the positive triplet τ\tau can be replaced by entities associated not only with the head entity 𝙱𝚎𝚒𝚓𝚒𝚗𝚐\mathtt{Beijing}, such as 𝙶𝚛𝚎𝚊𝚝𝚆𝚊𝚕𝚕\mathtt{GreatWall} and 𝚂𝚑𝚊𝚗𝚐𝚑𝚊𝚒\mathtt{Shanghai}, but also with the relation 𝚒𝚜𝙲𝚊𝚙𝚝𝚒𝚊𝚕𝙾𝚏\mathtt{isCaptialOf}, such as 𝚁𝚞𝚜𝚜𝚒𝚊\mathtt{Russia} (a country with some capital city) and 𝙻𝚘𝚗𝚍𝚘𝚗\mathtt{London} (a capital city of some country). On the other hand, adaptive sampling of negative triplets would make entity- or relation-specific adjustments to sample selection, which increases the individual informativeness of each triplet in a finer-grained manner. For instance, selecting a tail entity for 𝙱𝚎𝚒𝚓𝚒𝚗𝚐\mathtt{Beijing}, 𝚃𝚘𝚔𝚢𝚘\mathtt{Tokyo} or 𝙺𝙵𝙲\mathtt{KFC} using a global sampling model could be suboptimal given the variability among these entities. Instead, local models that condition on each entity would be able to adapt to such differences and make each triplet more informative.

In view of the above, we propose a Diversified and Adaptive Negative Sampling (DANS) approach for knowledge graph embedding, to improve both the overall and individual informativeness of negative triplets. Similar to previous state-of-the-art approaches such as KBGAN [3], we adopt a generative adversarial network (GAN) [38] for the generation of negative samples. However, there are two significant differences from previous GAN-based negative sampling on the knowledge graph. First, we design a two-way generator to produce diversified samples that are associated with both entities and relations w.r.t. a positive triplet, which aims to increase the overall informativeness of the samples. More specifically, the generator consists of two pathways to produce two different kinds of negative triplets associated with a given entity and entity-relation, respectively. Second, we design an adaptive mechanism to modulate the global generator model into local models to handle the differences across entities and relations, which aims to increase the individual informativeness of the samples in a finer-grained manner. In particular, we employ a Feature-wise Linear Modulation (FiLM) layer [26] that conditions the generator on a given entity or entity-relation input. In summary, we make the following contributions:

  • We design a two-way generator to produce diverse negative triplets, to increase the overall informativeness.

  • We employ a FiLM layer to adapt the global generator model into local models, to increase the individual informativeness of the negative triplets.

  • We conduct extensive experiments on three benchmark datasets. The results demonstrate the superiority of our proposed approach.

2 Background

Negative sampling is an important issue in various machine learning tasks such as recommendation systems [28] and natural language processing [20]. In the context of knowledge graph embedding, negative triplets are often constructed by replacing the tail or head entity in a positive triplet with a randomly sampled entity [2, 40, 17]. Unfortunately, in uniform [2] or popularity-weighted sampling [20], the sampled entity could be completely unrelated to the head or the relation, and therefore be less informative.

To sample more informative negative triplets, researchers have leveraged different heuristics or learning strategies. Several structure-aware models [1, 46, 18] exploit the graph structures, which generally select negative examples in the neighborhood of positive examples. For example, SANS [1] hypothesizes that entities that are in close proximity to each other, but do not share a direct relationship, are better candidates for negative sampling. In a similar spirit, PinSage [18] generates localized graphs via random walks to extract informative negative samples. However, these approaches have a high risk of selecting false negatives, as not explicitly related entities in close proximity could still form positive triplets due to the incompleteness of the observed graph.

Other approaches seek to quantify the informativeness of the negative triplets through various learning strategies, including GANs [3, 38, 11, 47], reinforcement learning [39, 49], and importance sampling [48]. These methods provide a more explicit and systematic scoring of negative triplets which often led to better performance. However, these approaches do not consider the diversity and adaptiveness of negative sampling, which are crucial to the overall and individual informativeness of the negative triplets, respectively.

Besides, recent studies [45, 27] show that the optimal negative sampling distribution should be positively but sub-linearly correlated to the positive sampling distribution. Although our proposed model shares a similar view by learning the underlying distribution of positive samples to produce negative samples, we take one step further to consider the diversity and adaptiveness of the negative samples in an adversarial setting. In particular, toward adaptiveness, we borrow the idea from Feature-wise Linear Modulation (FiLM) [26], which was first introduced in the area of visual question answering. Its mechanism includes a learnable feature-wise affine transformation on the hidden neurons of a neural network, conditioned on an arbitrary input. In our context, we employ a FiLM layer to adapt the global generators into local models conditioned on individual input (entity or relation).

3 Methodology

In this section, we introduce the problem formulation and some preliminaries on knowledge graph embedding, followed by our proposed approach DANS.

Refer to caption
Figure 1: Overall framework of DANS. The toy example only shows how to generate fake tail entities, while generating fake head entities follows a similar process.

Before we delve into the details, we first sketch the overall framework in Figure 1. The model consists of four main parts: (a) A base embedding model which learns the embeddings for entities and relations; (b) the two-way adaptive generator which generate “fake” entity samples to construct negative examples; (c) the two-way discriminator which utilize both adversarial and auxiliary losses to improve the quality of produced samples; (d) model training with negative sampling, where we replace one entity in a positive triplet with a generated fake entity to form negative triplets, and train the base model together with the original positive triplets.

3.1 Problem formulation and preliminaries

A knowledge graph (KG) is defined by an entity (node) set 𝒱\mathcal{V}, a relation set \mathcal{R} and a ground-truth or positive triplet (edge) set \mathcal{E}. Given a triplet τ=h,r,t\tau=\langle h,r,t\rangle for some h,t𝒱h,t\in\mathcal{V} and rr\in\mathcal{R}, a typical KG model aims to learn a scoring function (τ)\mathcal{F}(\tau) to estimate the probability that τ\tau is a positive triplet, i.e., τ\tau is a fact that should appear in the ground truth set \mathcal{E}.

Given the power of graph convolutional networks, in this paper, we adopt a multi-layer relational graph convolutional network (RGCN) [29] to serve as our base embedding model in Figure 1(a). The base model encodes the entities in layer l+1l+1 into vectors 𝐞il+1dl+1\mathbf{e}_{i}^{l+1}\in\mathbb{R}^{d^{l+1}} in a latent embedding space, by aggregating their embeddings 𝐞jldl\mathbf{e}_{j}^{l}\in\mathbb{R}^{d^{l}} from the previous layer ll, as follows.

𝐞il+1=ReLU(rj𝒩ir1|𝒩ir|Wrl𝐞jl+W0l𝐞il),\displaystyle\mathbf{e}_{i}^{l+1}=\textsc{ReLU}\textstyle\left(\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{i}^{r}}\frac{1}{|\mathcal{N}_{i}^{r}|}W_{r}^{l}\mathbf{e}_{j}^{l}+W_{0}^{l}\mathbf{e}_{i}^{l}\right), (1)

where 𝒩ir\mathcal{N}_{i}^{r} is the set of neighbors of entity ii under relation rr, WrlW_{r}^{l} is a trainable weight matrix for rr, W0lW_{0}^{l} is an additional trainable weight matrix to capture the self-information of each entity in layer ll, and ReLU is the activation function. Assuming a total of LL layers are stacked, the embeddings in the last layer are the output embeddings, which we simply write as 𝐞id,i𝒱\mathbf{e}_{i}\in\mathbb{R}^{d},\forall i\in\mathcal{V}.

To optimize the parameters, a set of training triplets 𝒟tr\mathcal{D}_{\text{tr}} that consists of both positive and negative triplets is used. As shown in Figure 1(d), our objective is to sample a set of high-quality negative triplets, which, together with positive triplets, will be used to minimize the following cross-entropy loss:

τ𝒟tryτlog(τ)+(1yτ)log(1(τ))\displaystyle\textstyle-\sum_{\tau\in\mathcal{D}_{\text{tr}}}y_{\tau}\log\mathcal{F}(\tau)+\left(1-y_{\tau}\right)\log\left(1-\mathcal{F}(\tau)\right)\, (2)

where yτ=1y_{\tau}=1 if τ\tau\in\mathcal{E}, else yτ=0y_{\tau}=0. We implement \mathcal{F} using three popular decoders, namely, DistMult [44], ComplEx [36] and RotatE [34]. We provide the DistMult function below, and leave the details of ComplEx and RotatE to A.

(h,r,t)=σ(𝐞hDiag(𝐞r)𝐞t),\displaystyle\mathcal{F}(\langle h,r,t\rangle)=\sigma(\mathbf{e}_{h}^{\top}\text{Diag}(\mathbf{e}_{r})\mathbf{e}_{t}), (3)

where σ\sigma is the sigmoid activation, 𝐞h,𝐞t\mathbf{e}_{h},\mathbf{e}_{t} are the head, tail entity embeddings from RGCN, Diag(𝐞r)d×d\text{Diag}(\mathbf{e}_{r})\in\mathbb{R}^{d\times d} is diagonal matrix whose diagonal is 𝐞r\mathbf{e}_{r}, an rr-specific trainable vector of the decoder. Therefore, the full set of training parameters of the base model is Θ={Wrl:r,lL}{W0l:lL}{𝐞r:r}.\Theta=\{W_{r}^{l}:r\in\mathcal{R},l\leq L\}\cup\{W_{0}^{l}:l\leq L\}\cup\{\mathbf{e}_{r}:r\in\mathcal{R}\}.

3.2 Adaptive two-way generator

A common way to obtain a negative triplet is to replace the tail (or head) entity in a positive triplet by a randomly sampled entity. Beyond simple random sampling, generative adversarial nets (GAN) [10] such as KBGAN [3], IGAN [38], HeGAN [11] and GNDN [47], which learn the underlying sample distributions, have been shown to be effective in negative sampling on KG or other graph structures.

Formally, given a positive triplet h,r,t\langle h,r,t\rangle, a generator GG aims to produce a “fake” tail entity tt^{\prime} to replace the real tail tt, resulting in a negative triplet h,r,t\langle h,r,t^{\prime}\rangle. More precisely, GG is a function that maps a noise ϵ\epsilon (typically sampled from a prior distribution) to a vector 𝐞t\mathbf{e}_{t^{\prime}} in the entity embedding space. Although we follow a similar process, distinct from existing GAN-based approaches, we propose an adaptive two-way generator, as shown in Figure 1(b). It not only diversifies the generation of fake entities, but also localizes the global generator model to adapt to fine-grained differences across entities.

Diversity. Classical GANs generate fake samples through a single pathway and assume a fixed prior distribution, which limits the diversity of fake entity generation. Particularly, in the context of KG, we can generate a fake tail entity associated with either the head entity only, or the relation as well. This improves the diversity of resulting negative triplets and increases the overall informativeness. Hence, we propose a two-way generator that consists of two pathways, namely GEG_{E} and GRG_{R}, to generate negative triplets associated with a given entity and entity-relation, respectively. Furthermore, having personalized priors for each entity or relation would further enhance the diversification. Specifically, to replace the tail entity in a positive triplet h,r,t\langle h,r,t\rangle (the same process would also apply to replacing the head entity hh), we generate fake tail entity embeddings 𝐞t\mathbf{e}_{t^{\prime}} and 𝐞t′′\mathbf{e}_{t^{\prime\prime}} from the two pathways, as follows.

𝐞t\displaystyle\mathbf{e}_{t^{\prime}} =GE(ϵ;ΘGE),s.t. ϵN(𝐞h,σ2I),\displaystyle=G_{E}(\epsilon;\Theta_{G_{E}}),\ \text{s.t. }\epsilon\sim N(\mathbf{e}_{h},\sigma^{2}I), (4)
𝐞t′′\displaystyle\mathbf{e}_{t^{\prime\prime}} =GR(ϵ;ΘGR),s.t. ϵN(𝐞h𝐞r,σ2I),\displaystyle=G_{R}(\epsilon;\Theta_{G_{R}}),\ \text{s.t. }\epsilon\sim N(\mathbf{e}_{h}\otimes\mathbf{e}_{r},\sigma^{2}I), (5)

where each pathway has its own parameters, i.e., GEG_{E} parameterized by ΘGE\Theta_{G_{E}} and GRG_{R} parameterized by ΘGR\Theta_{G_{R}}. The noise vector ϵ\epsilon that feeds into each pathway is sampled from a personalized multivariate Gaussian distribution for each entity/relation, N(𝐞h,σ2I)N(\mathbf{e}_{h},\sigma^{2}I) or N(𝐞h𝐞r,σ2I)N(\mathbf{e}_{h}\otimes\mathbf{e}_{r},\sigma^{2}I) depending on the pathway. NN represents the prior Gaussian distribution for sampling the input to the generator ϵ\epsilon, σ\sigma is a hyper-parameter controlling the covariance of the multivariate Gaussian, II is the identity matrix, and \otimes stands for element-wise multiplication. Intuitively, as the prior Gaussian distributions in Eqs. (4) and (5) are centered on different embeddings, 𝐞h\mathbf{e}_{h} or 𝐞h𝐞r\mathbf{e}_{h}\otimes\mathbf{e}_{r}, it helps to diversify the generated samples from different pathways.

Each pathway is implemented as a multi-layer perceptron (MLP). Taking GEG_{E} as an example, its MLP is parameterized by ΘGE\Theta_{G_{E}} which consists of the weights and biases in each layer. Let 𝐱GEm+1\mathbf{x}_{G_{E}}^{m+1} denote the activations of the mm-th MLP layer, where the activations of the last MLP layer are simply the output embedding of GEG_{E}. The architecture of GRG_{R} mirrors that of GEG_{E}.

Adaptiveness. While more diverse samples help increase the overall informativeness, it is also important to improve the informativeness of individual samples. On the one hand, all input entities or relations sharing a global generator model are unable to fully adapt to fine-grained differences across entities or relations. On the other hand, training one model for each entity or relation can cause severe overfitting and incur large overheads. To address the dilemma, we still train a global generator model, but allow the global model to be modulated through a Feature-wise Linear Modulation (FiLM) layer conditioned on each input entity or relation, which essentially adapts the shared global model into local models. Thus, in addition to the global model parameters, the adaptive mechanism only needs to learn the parameters of the FiLM layer, instead of one set of model parameters for each entity or relation.

Consider the pathway GEG_{E} to generate a fake tail entity for a head entity hh. We adapt the global model GEG_{E} to suit the head entity hh, by modulating the activations in each hidden layer of GEG_{E}:

𝐱~GEm=𝐱GEmαhm+βhm,\displaystyle\tilde{\mathbf{x}}_{G_{E}}^{m}=\mathbf{x}_{G_{E}}^{m}\otimes\alpha_{h}^{m}+\beta_{h}^{m}, (6)

where αhm\alpha_{h}^{m} and βhm\beta_{h}^{m} are vectors conditioned on the head entity hh and have the same dimension as the mm-th layer of GEG_{E}. They are used to scale and shift the activations 𝐱GEm\mathbf{x}_{G_{E}}^{m} of the mm-th layer of GEG_{E}. That is, the global GEG_{E} is adapted into a local model conditioned on hh. More specifically, αhm\alpha_{h}^{m} and βhm\beta_{h}^{m} are output of the FiLM layer FEF_{E} applied to the mm-th layer of GEG_{E}, as follows.

αhm=FE(𝐞h;ΘFE,αm),\displaystyle\alpha_{h}^{m}=F_{E}(\mathbf{e}_{h};\Theta_{F_{E},\alpha}^{m}), (7)
βhm=FE(𝐞h;ΘFE,βm).\displaystyle\beta_{h}^{m}=F_{E}(\mathbf{e}_{h};\Theta_{F_{E},\beta}^{m}). (8)

Note that the head entity embedding 𝐞h\mathbf{e}_{h} is the input to FEF_{E}, making the output adaptive to and conditioned on hh. FEF_{E} can be implemented as a MLP, parameterized by ΘFE,αm\Theta_{F_{E},\alpha}^{m} and ΘFE,βm\Theta_{F_{E},\beta}^{m} in the mm-th layer of GEG_{E}. Similarly, the second pathway GRG_{R} can be modulated by a FiLM layer FRF_{R}, whose input is 𝐞h𝐞r\mathbf{e}_{h}\otimes\mathbf{e}_{r}, to generate a fake tail entity for a head entity hh and relation rr. FRF_{R} is parameterized by ΘFR,αm\Theta_{F_{R},\alpha}^{m} and ΘFR,βm\Theta_{F_{R},\beta}^{m} in the mm-th layer of GRG_{R}, to output αh,rm\alpha_{h,r}^{m} and βh,rm\beta_{h,r}^{m} to scale and shift the activations in GRG_{R}.

To sum up, the trainable parameters in the adaptive two-way generator, ΘG\Theta_{G}, include the weights in the two global pathways and the FiLM layer weights for each layer in each pathway. Assuming a total of MM hidden layers in the global pathways, we would have ΘG={ΘGE,ΘGR}{ΘFE,αm,ΘFE,βm,ΘFR,αm,ΘFR,βm:mM}\Theta_{G}=\{\Theta_{G_{E}},\Theta_{G_{R}}\}\cup\{\Theta_{F_{E},\alpha}^{m},\Theta_{F_{E},\beta}^{m},\Theta_{F_{R},\alpha}^{m},\Theta_{F_{R},\beta}^{m}:m\leq M\}.

3.3 Two-way discriminator

As in a standard GAN architecture, a discriminator is needed to help the generator produce high-quality fake entities that mimic real entities. Specifically, the discriminator and the generator compete with each other in a minimax game, in which the generator aims to fool the discriminator by producing realistic looking entities, while the discriminator aims to beat the generator by distinguishing the real and fake entities. In our case, given the two-way generator, we further equip the discriminator with the ability to distinguish the fake entities generated by the two pathways, which can further differentiate and diversify the two pathways.

Concretely, as shown in Figure 1(c), the discriminator also has two pathways: DAdvD_{\text{Adv}}, an adversarial pathway to distinguish fake and real entities, and DAuxD_{\text{Aux}}, an auxiliary pathway to distinguish fake entities generated by GEG_{E} and GRG_{R}. Taking the generation of tail entities as an example, given the real tail entity tt in a positive triplet, as well as the fake entities tt^{\prime} generated by GEG_{E} and t′′t^{\prime\prime} generated by GRG_{R}, DAdvD_{\text{Adv}} tries to distinguish tt from tt^{\prime} and t′′t^{\prime\prime}, while DAuxD_{\text{Aux}} tries to distinguish tt^{\prime} from t′′t^{\prime\prime}. In other words, each of them involves a binary classification:

y^Adv,i\displaystyle\hat{y}_{\text{Adv},i} =DAdv(𝐞~i;ΘDAdv),\displaystyle=D_{\text{Adv}}(\tilde{\mathbf{e}}_{i};\Theta_{D_{\text{Adv}}}), (9)
y^Aux,i\displaystyle\hat{y}_{\text{Aux},i} =DAux(𝐞~i;ΘDAux), s.t. it,\displaystyle=D_{\text{Aux}}(\tilde{\mathbf{e}}_{i};\Theta_{D_{\text{Aux}}}),\text{ \ s.t. }i\neq t, (10)

where DAdvD_{\text{Adv}} and DAuxD_{\text{Aux}} are implemented as a fully connected layer, and 𝐞~i=\capitalisewordsmlp(𝐞i;ΘDS)\tilde{\mathbf{e}}_{i}=\textsc{\capitalisewords{{mlp}}}(\mathbf{e}_{i};\Theta_{D_{S}}) is a shared hidden representation computed from the embedding 𝐞i\mathbf{e}_{i} of a real or fake entity ii. The shared hidden representation allows both DAdvD_{\text{Adv}} and DAuxD_{\text{Aux}} to benefit from each other during training as in Odena [23], as they collectively try to distinguish three different classes of samples (tt, tt^{\prime}, t′′t^{\prime\prime}).

Note that y^Adv,i\hat{y}_{\text{Adv},i} (or y^Aux,i\hat{y}_{\text{Aux},i}) is the predicted value of the ground-truth label yAdv,iy_{\text{Adv},i} (or yAux,iy_{\text{Aux},i}), such that yAdv,i=1y_{\text{Adv},i}=1 if ii is a real entity, else yAdv,i=0y_{\text{Adv},i}=0. Furthermore, for a fake entity ii, we define yAux,i=1y_{\text{Aux},i}=1 if 𝐞i\mathbf{e}_{i} is generated via Eq. (4), or 0 if generated via Eq. (5). Subsequently, we employ a cross-entropy loss on the two discriminator pathways:

Adv(y^Adv,i,yAdv,i)=yAdv,ilogy^Adv,i\displaystyle\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},i},y_{\text{Adv},i})=-y_{\text{Adv},i}\log\hat{y}_{\text{Adv},i}
(1yAdv,i)log(1y^Adv,i),\displaystyle-(1-y_{\text{Adv},i})\log(1-\hat{y}_{\text{Adv},i}), (11)
Aux(y^Aux,i,yAux,i)=yAux,ilogy^Aux,i\displaystyle\mathcal{L}_{\text{Aux}}(\hat{y}_{\text{Aux},i},y_{\text{Aux},i})=-y_{\text{Aux},i}\log\hat{y}_{\text{Aux},i}
(1yAux,i)log(1y^Aux,i),\displaystyle-(1-y_{\text{Aux},i})\log(1-\hat{y}_{\text{Aux},i}), (12)

In summary, the set of trainable parameters of the two-way discriminator pathway includes the shared parameters and the weights of each classifier, i.e., ΘD={ΘDS,ΘDAdv,ΘDAux}.\Theta_{D}=\{\Theta_{D_{S}},\Theta_{D_{\text{Adv}}},\Theta_{D_{\text{Aux}}}\}.

3.4 Adversarial training

Lastly, we train the generator, discriminator, and base embedding model jointly. On the one hand, the generator aims to fool the adversarial pathway of the discriminator, making DAdvD_{\text{Adv}} harder to distinguish real and fake entities, as below.

argmaxΘG\displaystyle\textstyle\arg\max_{\Theta_{G}} 𝔼tAdv(y^Adv,t,1)\displaystyle\ \mathbb{E}_{t}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},t},1)}
+𝔼tAdv(y^Adv,t,0)+𝔼t′′Adv(y^Adv,t′′,0)\displaystyle+\mathbb{E}_{t^{\prime}}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},t^{\prime}},0)}+\mathbb{E}_{t^{\prime\prime}}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},t^{\prime\prime}},0)}
+λm,h(αhm12+βhm2),\displaystyle+\lambda\textstyle\sum_{m,h}(\|\alpha_{h}^{m}-\textbf{1}\|^{2}+\|\beta_{h}^{m}\|^{2}), (13)

where tt is a real tail entity, and t,t′′t^{\prime},t^{\prime\prime} are fake tail entities from GEG_{E} and GRG_{R}, respectively (again, we only illustrate the case where the tail entity in a positive triplet is replaced). The last term in Eq. (3.4) is a regularization term on the scaling and shifting factors to prevent overfitting as in Oreshkin et al. [24], and λ\lambda is a hyper-parameter to control the strength of regularization. On the other hand, the goal of the discriminator is to overcome the generators by distinguishing fake and real entities, as well as fake entities from different generator pathways, as follows.

argminΘD\displaystyle\textstyle\arg\min_{\Theta_{D}} 𝔼iAdv(y^Adv,i,yAdv,i)\displaystyle\ \mathbb{E}_{i}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},i},y_{\text{Adv},i})}
+\displaystyle+ 𝔼itAux(y^Aux,i,yAux,i),\displaystyle\ \mathbb{E}_{i\neq t}{\mathcal{L}_{\text{Aux}}(\hat{y}_{\text{Aux},i},y_{\text{Aux},i})}, (14)

where ii can be either real or fake entity in the first term, but iti\neq t can only be a fake entity in the second term.

Following a typical adversarial training scheme in negative sampling on knowledge graphs in KBGAN [3], we alternate the model updating among the three parties, as follows. First, we train the generator by updating the generator parameters ΘG\Theta_{G} with Eq. (3.4), while freezing the discriminator parameters ΘD\Theta_{D} and the base model parameters Θ\Theta. Next, we update ΘD\Theta_{D} with Eq. (3.4), while freezing ΘG,Θ\Theta_{G},\Theta. Finally, we update Θ\Theta by minimizing the loss on the positive and negative triples in Eq. (2), while freezing the other two parameter sets. We repeat the three steps until the convergence of all parties are achieved.

4 Experiments

We perform empirical evaluation on three benchmark knowledge graphs. We first compare the empirical performance of the proposed model DANS111We include the code for review in Supplementary Materials. with state-of-the-art baselines. In addition, we seek to address a number of research questions (RQ) through more in-depth empirical analysis. RQ1: Does the two-way design in the generator improve model performance? RQ2: Does the adaptive FiLM layer in the generator improve model performance? RQ3: What is the impact of the number of negative triplets and adaptive regularization, respectively? RQ4: Can we observe the diversity and adaptiveness of generated triplets?

4.1 Experimental Design

Table 1: Statistics of datasets.
Entities Relations Train Val Test Total
WN18RR 40,943 11 86,835 3,034 3,134 93,003
NELL-995 75,492 200 149,678 543 3,992 154,213
UMLS 135 46 5,216 652 661 6,529

Datasets. Three benchmark knowledge graphs are used for our experiment. (1) WN18RR [2], a harder variant of WN18 [7], which is derived from WordNet consisting of hyponym and hypernym relations between words. Compared to WN18, WN18RR removes inverse relations to minimize leakage from training. (2) NELL-995 [43] is a subset of the web-based facts collected by the 995th iteration of the Nell system [5] which contains a large pool of entity types and only the top 200 relations are retained. (3) UMLS [7] is a specialized knowledge base containing medical entities and their semantic relationships. The entities are biomedical concepts (e.g., disease, antibiotic), and the relations include interactions such as 𝚝𝚛𝚎𝚊𝚝𝚜\mathtt{treats} and 𝚍𝚒𝚊𝚐𝚗𝚘𝚜𝚒𝚜\mathtt{diagnosis}. Table 1 gives a summary of the datasets used.

Task and evaluation. We employ the standard knowledge graph completion task [21, 4, 13, 6]. Specifically, for each positive test triplet, we construct a list of candidate triplets that also include negative triplets, which are obtained by replacing either the head or tail of the positive triplet with every other entity in the dataset. To avoid false negatives, we follow previous work Bordes et al. [2] by adopting their “filtered setting”. We then rank the candidate triplets based on the scoring function. For evaluation, we adopt several standard ranking metrics including Mean Reciprocal Ranking (MRR), Hit ratio at 1 (H@1) and Normalized discounted cumulative gain at 5 (NDCG@5) [35]. Details of these ranking metrics can be found in B.

Baselines. We compare with baselines in two distinct categories:

(1) Negative samplers with the same RGCN backbone [29] and decoders. In other words, they are flexible “plug-ins” that only replace the sampling strategy for a fair comparison to our method DANS. They include Rand, which replaces the head or tail entity with a uniformly sampled random entity; Pop [20]: a variant of Rand that substitutes uniform sampling with popularity-weighted sampling; Self-adv [35]: a self-adversarial negative sampling methodology; MCNS [45]: a model which derives negative samples from a distribution that is positively but sub-linearly correlated with the positive distribution.

(2) Other state-of-the-art baselines for knowledge graph embedding which may employ a variety of different backbones, heuristics and techniques that diverge from DANS, for a comprehensive comparison. They include SANS-RW [1]: a structure-aware model that selects negative samples at close proximity from positive nodes via random walks on the graph; NSCaching [48]: a model that employs importance sampling to sample more informative negative triplets; KBGAN [3]: a GAN-based model that learns to generate informative negative triplets; CAKE [22]: a framework which leverages extra information such as entity types to from factual triplets to sample negative triplets; SMiLE [25]: a framework which employs specific contextual information influenced by entity types to sample negative triplets.

Parameter settings. Our model DANS and other negative samplers (Random, Pop, Self-Adv and MCNS) employ RGCN [29] as the backbone, which follows JinheonBaek’s pytorch implementation. RGCN is first pre-trained for 15000 epochs, and our base embedding model is then initialized using the pre-trained weights. We train the model for 5000 epochs, using a learning rate of 0.001 and a mini-batch size of 1000 for UMLS, WN18RR and NELL-995. In each mini-batch, the generator and discriminator epochs are set to 5 and 1, respectively, and their learning rates are set to 1e-3 and 1e-4, respectively. The regularization coefficient λ\lambda for the FiLM layer in Eq. (13) is set to 1e-4 for all three datasets as it is the most optimal among candidate set 1e-2, 1e-3, 1e-4, 1e-5, 1e-6.

Furthermore, we generate Ns=20N_{s}=20 negative triplets for each positive triplet, out of which the first ten negative triplets are equally split between the two generator pathways, while the remaining ten negative triplets are obtained via uniform random sampling to further increase the diversity. In all cases, either the head or tail of the positive triplets are randomly replaced with negative entities, but not both. We set the output embedding dimension dd to 100 for all methods, except SANS-RW, where dd is set to the recommended 1,000 to achieve optimal performance. RGCN, RGCN-P, RGCN-Adv and RGCN-MCNS follow the same implementation and settings per the backbone of DANS.

In addition, the hyper-parameters related to negative sampling via Metr-opolis-Hastings in RGCN-MCNS have been copied from the original paper ’s link prediction experiments in Yang et al. [45]. To reduce the variance resulting from parameter initialization, the experimental results are calculated from an average of five runs with different seeds in all methods. Furthermore, every method is standardized to use the triplet loss in Eqs.(2). Other baseline settings have also been tuned according to the recommendations of the literature. Additional details can be found in C.

4.2 Results and Analysis

Table 2: Performance comparison with other negative sampling methods, which are plugged into the same backbone (RGCN) and decoders (DistMult, RotatE or ComplEx). The best results are in bold, and the runner-ups are underlined.
Sampling WN18RR NELL-995 UMLS
method MRR H@1 NDCG@5 MRR H@1 NDCG@5 MRR H@1 NDCG@5
DistMult
Rand .372±\pm.002 .343±\pm.003 .369±\pm.005 .218±\pm.001 .146±\pm.002 .219±\pm.002 .696±\pm.010 .607±\pm.082 .693±\pm.007
Pop .374±\pm.002 .342±\pm.002 .376±\pm.006 .216±\pm.001 .142±\pm.002 .216±\pm.003 .680±\pm.009 .589±\pm.012 .692±\pm.005
Self-adv .370±\pm.007 .332±\pm.010 .373±\pm.006 .238±\pm.003 .156±\pm.003 .241±\pm.003 .717±\pm.009 .624±\pm.015 .733±\pm.008
MCNS .376±\pm.004 .340±\pm.005 .374±\pm.006 .226±\pm.002 .144±\pm.002 .221±\pm.003 .700±\pm.002 .606±\pm.008 .717±\pm.002
DANS .381±\pm.006 .352±\pm.007 .386±\pm.008 .227±\pm.004 .162±\pm.007 .220±\pm.009 .724±\pm.008 .641±\pm.009 .725±\pm.008
RotatE
Rand .234±\pm.009 .110±\pm.003 .260±\pm.008 .182±\pm.003 .093±\pm.003 1̇89±\pm.003 .817±\pm.015 .683±\pm.021 .855±\pm.013
Pop .235±\pm.007 .095±\pm.003 .268±\pm.007 .181±\pm.002 .131±\pm.002 .200±\pm.003 .800±\pm.005 .673±\pm.010 .839±\pm.004
Self-adv .202±\pm.007 .058±\pm.010 .235±\pm.006 .186±\pm.002 .096±\pm.003 .194±\pm.002 .809±\pm.007 .677±\pm.007 .848±\pm.007
MCNS .242±\pm.009 .132±\pm.004 .288±\pm.006 .194±\pm.003 .122±\pm.004 .200±\pm.004 .822±\pm.005 .682±\pm.006 .884±\pm.006
DANS .249±\pm.002 .154±\pm.001 .274±\pm.003 .195±\pm.010 .135±\pm.011 .208±\pm.010 .833±\pm.004 .716±\pm.006 .866±\pm.005
ComplEx
Rand .386±\pm.007 .346±\pm.005 .390±\pm.006 .245±\pm.004 .172±\pm.003 2̇51±\pm.006 .898±\pm.008 .822±\pm.017 .920±\pm.015
Pop .389±\pm.011 .341±\pm.007 .387.±\pm.012 .241±\pm.005 .179±\pm.006 .245±\pm.004 .840±\pm.009 .747±\pm.009 .865±\pm.008
Self-adv .375±\pm.006 .329±\pm.011 .382±\pm.013 .250±\pm.005 .181±\pm.007 .277±\pm.008 .908±\pm.009 .844±\pm.006 .925±\pm.010
MCNS .392±\pm.008 .343±\pm.007 .394±\pm.008 .248±\pm.007 .177±\pm.004 .264±\pm.009 .879±\pm.007 .835±\pm.005 .892±\pm.011
DANS .404±\pm.005 .347±\pm.004 .392±\pm.009 .257±\pm.006 .186±\pm.010 .255±\pm.008 .920±\pm.007 .857±\pm.011 .927±\pm.008
Table 3: Performance comparison with baselines (all using the DistMult decoder). See Table 2 caption for entry styles.
WN18RR NELL-995 UMLS
Model MRR H@1 NDCG@5 MRR H@1 NDCG@5 MRR H@1 NDCG@5
SANS-RW .349±\pm.010 .340±\pm.013 .334±\pm.010 .135±\pm.006 .109±\pm.008 .110±\pm.008 .510±\pm.008 .369±\pm.009 .478±\pm.003
NSCaching .374±\pm.002 .337±\pm.003 .374±\pm.002 .177±\pm.004 .150±\pm.003 .140±\pm.002 .625±\pm.004 .508±\pm.021 .607±\pm.004
KBGAN .172±\pm.004 .070±\pm.006 .155±\pm.002 .170±\pm.002 .077±\pm.004 .195±\pm.009 .680±\pm.005 .556±\pm.023 .654±\pm.004
CAKE .353±\pm.007 .345±\pm.005 .351±\pm.008 .204±\pm.006 .130±\pm.007 .175±\pm.012 .441±\pm.013 .365±\pm.008 .383±\pm.010
SMiLE .315±\pm.006 .291±\pm.007 .294±\pm.012 .131±\pm.004 .127±\pm.005 .105±\pm.008 .414±\pm.015 .345±\pm.007 .372±\pm.013
DANS .381±\pm.006 .352±\pm.007 .386±\pm.008 .227±\pm.004 .162±\pm.007 .220±\pm.009 .724±\pm.008 .641±\pm.009 .725±\pm.008

Table 2 reports quantitative comparison against the first category of baselines involving different negative samplers under the same backbone and decoder. Overall, our model DANS consistently leads to better performance for DistMult, RotatE and ComplEx decoders. This shows the robustness of our approach across various decoders. In general, DANS performs better than Rand and its variant Pop, showing that it is important to account for the informativeness of negative triples which are missing in random and popularity-weighted sampling. Since Self-Adv accounts for the informativeness by giving more weight to higher quality triplets, it generally outperforms Rand and Pop. It still lags behind DANS in most cases as it ignores the concepts of diversity and adaptiveness. The variant MCNS shows better performance than Rand and its variant Pop but loses to DANS  as MCNS was originally designed for homogeneous graphs.

Next, Table 3 compares DANS with the second category of baselines. Negative sampling in SANS-RW is not relation-aware and thus performs poorly on datasets with more variety of relations, namely, NELL-995 and UMLS. In addition, KBGAN fell short for the two bigger datasets WN18RR and NELL-995 as it ignores graph structure in the sampling process. Furthermore, its adversarial training process potentially suffers from instability and degeneracy. On the other hand, NSCaching employs a more streamlined importance sampling approach, contributing to its competitive performance despite not considering graph structure for negative sampling. As CAKE and SMiLE leverage on extra side information such as entity types to enhance its performances, their experimental results deteriorate as such information are not available in standard knowledge graph completion benchmarks in this paper.

Overall, DANS has obtained favourable performance, showing the importance of diversity and adaptiveness during negative sampling. We will conduct further ablation study in the next part to examine the contribution from each aspect. Finally, we have included the experimental results for dataset FB15k-237 which show favorable performance on ComplEx decoder in D.

4.3 Additional research questions

Refer to caption
(a) Ablation study on generator and FiLM
Refer to caption
(b) No. of negative triplets NsN_{s}
Refer to caption
(c) Extent of regularization λ\lambda
Figure 2: Investigation of research questions. Each dataset uses its own yy-axis, and the metric used on the yy-axes is MRR. (a) Ablation study on the contribution of individual generator pathways and adaptive FiLM layer. (b) Study of parameter sensitivity on the number of negative triplets NsN_{s} and (c) extent of adaptive regularization λ\lambda.

In this part, we seek to investigate RQ1–RQ4 listed at the beginning of this section. All experiments in this part are conducted using the DistMult function as the decoder.

Ablation study (RQ1, RQ2). We investigate the contribution from major design choices through an ablation study. As depicted in Figure 2(a), we compare DANS with the following variants, all of which do not employ the FiLM layer. (1) GEG_{E}: Only the pathway GEG_{E} in the generator; (2) GRG_{R}: Only the pathway GRG_{R} in the generator; (3) GERG_{ER}: Both pathways GEG_{E} and GRG_{R}.

From the results, among the single pathways (either GEG_{E} or GRG_{R}), there is no consistent winner and it depends on the dataset. However, it is clear that the use of both pathways in the generator (GERG_{ER}) outperforms using just a single pathway. Thus, this addresses RQ1 and shows that diversifying the negative triplets with the two-way generator can improve model performance and improve the overall informativeness of the negative triplets.

Furthermore, by comparing GERG_{ER} (i.e., both pathways without FiLM) and the proposed model DANS (i.e., both pathways with FiLM), our model obtains a significant lead in performance. This addresses RQ2 and shows the effectiveness of our adaptive design using FiLM.

Parameter sensitivity (RQ3). To answer RQ3, we perform a parameter sensitivity analysis. We first analyze how the number of negative triplets per positive triplet, NsN_{s}, can impact model performance. As shown in Figure 2(b), as we increase NsN_{s} on each dataset, we consistently observe that the MRR performance improves and peaks at Ns=20N_{s}=20. A larger NsN_{s} allows for greater diversity, which explains the initial increase in performance. However, when Ns20N_{s}\geq 20, performance starts to plateau or even deteriorate, due to highly imbalanced training data.

Next, we investigate the impact of adaptive regularization controlled by λ\lambda in Figure 2(c). Generally, having such a regularization (i.e., λ>0\lambda>0) avoids excessive scaling and shifting from the FiLM layer, and thus reduces overfitting to individual entities or relations. In particular, the MRR performance improves as λ\lambda increases and achieves the most optimal performance for all three datasets when λ\lambda is around 1e-4. As the optimal values of NsN_{s} and λ\lambda are largely stable across the three datasets, our model is not sensitive to these hyperparameter settings, and potentially requires less effort in hyperparameter tuning. We also note that the performance on the UMLS datasets tends to be more sensitive to changes in both parameters. This could be because UMLS is a smaller dataset than the other two, containing only 5,216 positive triplets in training and this increases the risk of overfitting to certain settings in general.

Case study (RQ4). We conduct a qualitative evaluation of DANS to demonstrate the diversity and adaptiveness of negative triplets generated by DANS.

Refer to caption
(a) 𝚑𝚊𝚜𝙿𝚊𝚛𝚝\mathtt{hasPart}
(WN18RR)
Refer to caption
(b) 𝚊𝚗𝚒𝚖𝚊𝚕𝚃𝚢𝚙𝚎\mathtt{animalType}
(NELL-995)
Refer to caption
(c) 𝚒𝚜𝙰\mathtt{isA}
(UMLS)
Figure 3: Visualization of diversity comparing to Random Negative Sampling (RNS). Best viewed in color.
Refer to caption
(a) 𝚑𝚊𝚜𝙿𝚊𝚛𝚝\mathtt{hasPart}
(WN18RR)
Refer to caption
(b) 𝚊𝚗𝚒𝚖𝚊𝚕𝚃𝚢𝚙𝚎\mathtt{animalType}
(NELL-995)
Refer to caption
(c) 𝚒𝚜𝙰\mathtt{isA}
(UMLS)
Figure 4: Visualization of diversity comparing to Popularity-weighted Negative Sampling (PNS). Best viewed in color.

Note that the ablation study has demonstrated improved model performance with the two-way generator, FiLM layers and provided direct evidence on the importance of diverse pathways and FiLM in the model architecture. However, it is not immediately clear if having two pathways and FiLM layers in the generator would indeed produce diverse and adaptive examples in their embedding space. Instead, as asked in RQ4, can we observe some generated examples on their diversity and adaptiveness?

To demonstrate the notion of diversity in RQ4, we present a few case studies of how DANS can produce more diverse examples than uniform random sampling (RNS) or popularity-weight random sampling (PNS). In Figures 4 and 4 we visualize the positive and negative tail entities w.r.t. a given relation and all its head entities on each dataset. More specifically, each point represents one tail entity, which can be a positive (real) tail entity, or a negative tail entity. The negative entity can be generated by one of the pathways GEG_{E} or GRG_{R} of the generator, or randomly sampled by RNS or PNS. The high-dimensional embedding space is projected onto a Cartesian plane using the tt-SNE algorithm [37]. In Figure 4, we compare the diversity of negative entities generated by DANS with that of RNS-based negative entities. For our case study, we select one relation for each dataset, namely, 𝚑𝚊𝚜𝙿𝚊𝚛𝚝\mathtt{hasPart} on WN18RR, 𝚊𝚗𝚒𝚖𝚊𝚕𝚃𝚢𝚙𝚎\mathtt{animalType} on NELL-995 and 𝚒𝚜𝙰\mathtt{isA} on UMLS, so that all the positive and negative tail entities for a common relation (and the same original head entities) can be contrasted in one visualization. The results show that DANS could provide more diverse negative entities for model training, where those generated by GEG_{E} and GRG_{R} occupy different subspaces from the positive entities. In contrast, RNS lacks diversity and samples negative entities in the same subspace as positive entities. This could even potentially contribute to false negative triplets as they are not well separated from the real ones. Similarly, in Figure 4, we compare the negative entities generated from GEG_{E} and GRG_{R} in DANS against the output of PNS, which replaces uniform negative sampling with popularity-weighted sampling. The results again echoed that DANS has produced more diverse negative entities in different subspaces as the positive entities, whereas PNS samples negative entities that mostly overlap with the positive triplets with less diversity.

Refer to caption

(a) Generated samples w/o FiLM

Refer to caption

(b) Generated samples w/ FiLM

Figure 5: Visualization of adaptiveness for relations 𝙳𝚘𝚖𝚊𝚒𝚗𝚁𝚎𝚐𝚒𝚘𝚗\mathtt{DomainRegion} and 𝙳𝚘𝚖𝚊𝚒𝚗𝚄𝚜𝚊𝚐𝚎\mathtt{DomainUsage} in WN18RR. (a) Generated samples w/o FiLM layers and (b) w/ FiLM layers. Best viewed in color.

On the other hand, to demonstrate the notion of adaptiveness, we compare two different relations from each dataset to visualize its differences with and without FiLM layers. For example, Figure 5 visualizes the impact of adaptiveness for relations 𝙳𝚘𝚖𝚊𝚒𝚗𝚁𝚎𝚐𝚒𝚘𝚗\mathtt{DomainRegion} and 𝙳𝚘𝚖𝚊𝚒𝚗𝚄𝚜𝚊𝚐𝚎\mathtt{DomainUsage} in the WN18RR dataset. “Positive1” and “Positive2” denote the existing positive entities of each of the two relations, respectively; “Fake1” and “Fake2” denote the corresponding negative samples produced from the generators for each of the two relations, respsectively. As shown in Figure 5(a), without FiLM layers, “Fake1” and “Fake2” both spread across the plane without any clear association to the corresponding “Positive1” and “Positive2”, and thus offering less discriminative power to improve the learning of each relation. In contrast, in Figure 5(b), after FiLM layers are added, we can clearly visualize that “Fake1” and “Fake2” are adapted to “Positive1” and “Positive2” as they move closer to each of their respective positive samples, improving the discriminative power of learning. This demonstrates that the global generators are adapted into local models conditioned on individual input (entity or relation) when FiLM layers are present. Similar patterns can be observed in NELL-995 and UMLS datasets, which are presented in E.

5 Conclusion and Future Work

In this work, we introduced DANS, a negative sampling strategy for knowledge graph embedding that explicitly accounts for the informativeness of negative triplets. On one hand, we proposed a two-way generator to increase the overall informativeness by diversifying the negative triplets based on their association with not only entities but also relations. On the other hand, we adapt the global generator model into local models, which generate negative triplets in a finer-grained manner to improve their individual informativeness. Empirically, DANS has outperformed state-of-the-art baselines on three benchmark knowledge graphs through both quantitative and qualitative experiments. In the future, we believe that the concept of diversity and adaptiveness can be further extended to other graph representation learning problems with a lean structure that is less parameter intensive.

6 Funding

This research / project is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (Proposal ID: T2EP20122-0041). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore.

Appendices

Appendix A Scoring function

ComplEx: ComplEx [36] extends DistMult [44] by introducing complex-valued embeddings to better model asymmetric relations. In ComplEx, entity and relation embeddings no longer lie in a real space but a complex space k\mathbb{C}^{k} by operator Re\operatorname{Re}.

(h,r,t)ComplEx=Re(𝐞hDiag(𝐞r)𝐞t)\displaystyle\mathcal{F}(\langle h,r,t\rangle)_{\text{ComplEx}}=\operatorname{Re}(\mathbf{e}_{h}^{\top}\text{Diag}(\mathbf{e}_{r})\mathbf{e}_{t})

ComplEx maps the entities and relations to the complex vector where 𝐞h,𝐞tk\mathbf{e}_{h},\mathbf{e}_{t}\in\mathbb{C}^{k} by operator Re\operatorname{Re} and Diag(𝐞r)d×d\text{Diag}(\mathbf{e}_{r})\in\mathbb{R}^{d\times d} is diagonal matrix whose diagonal is 𝐞r\mathbf{e}_{r}, an rr-specific trainable vector of the decoder.

RotatE: Inspired by TransE [2], RotatE [34] veers into complex vector space and is motivated by Euler’s identity. The model defines each relation as a rotation and measures how the distance from the source entity to the target entity to account for three relation patterns: symmetric/anti-symmetric, inversion and composition.

(h,r,t)RotatE=𝐞h𝐞r𝐞t2\displaystyle\mathcal{F}(\langle h,r,t\rangle)_{\text{RotatE}}=-||\mathbf{e}_{h}\ \mathcal{\circ}\ \mathbf{e}_{r}-\mathbf{e}_{t}||^{2}

RotatE maps the entities and relations to the complex vector where 𝐞h,𝐞tk\mathbf{e}_{h},\mathbf{e}_{t}\in\mathbb{C}^{k} and \circ denotes the Hadamard product.

Appendix B Evaluation Metrics

Mean Reciprocal Ranking (MRR): For each candidate list we record the reciprocal of the ranking position of the positive triplet, and take the average over all lists. It reflects the absolute ranking of the first relevant item in the list.

MRR=1|Q|i=1|Q|1ranki\displaystyle\operatorname{MRR}=\textstyle\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{\operatorname{rank}_{i}}

Where rankirank_{i} is the position of the relevant result in the ith query and Q is the total number of queries.

Hit ratio at KK (H@KK): We compute the fraction of candidate lists in which positive triplets fails within the first KK positions.

HR=|Uhit K||Uall |\displaystyle\operatorname{HR}=\textstyle\frac{\left|U_{\text{hit }}^{K}\right|}{\left|U_{\text{all }}\right|}

where |Uhit K|\left|U_{\text{hit }}^{K}\right| is the number of times a positive triplet is ranked within top KK positions in the candidate list. |Uall |\left|U_{\text{all }}\right| is the total number of candidates.

Normalized discounted cumulative gain at KK (NDCG@KK): We compare the ranked list with the ideal list, where a match at a lower position would have a discounted gain. The gain is further normalized across lists, and we measure the average over all lists. It reflects the relevance at the top KK positions, taking position significance into account.

DCGK=i=1KGainlog2(i+1)\displaystyle\mathrm{DCG}_{K}=\textstyle\sum_{i=1}^{K}\frac{Gain}{\log_{2}(i+1)}
IDCGK=i=1KGainlog2(i+1)\displaystyle\mathrm{IDCG}_{K}=\textstyle\sum_{i=1}^{K}\frac{Gain}{\log_{2}(i+1)}
nDCGK=DCGKIDCGK\displaystyle\mathrm{nDCG}_{K}=\textstyle\frac{\mathrm{DCG}_{K}}{\mathrm{IDCG}_{K}}

where DCGK\mathrm{DCG}_{K} measures our ranked list and IDCGK\mathrm{IDCG}_{K} measures the ideal list. The numerator GainGain equals to 1 when prediction of positive triplet falls within the top KK positions and 0 if otherwise. In addition, ii refers to the ranking position of the positive triplet.

Appendix C Efficiency Comparison

Table 4: Training time comparisons (minutes) for different negative sampling strategies with the RGCN backbone and DistMult decoder on the WN18RR dataset.
Number of Negative Samples
Model 50 20 10 5
RGCN 184 87 79 68
RGCN-P 182 94 79 67
RGCN-Adv 191 108 91 70
RGCN-MCNS 325 282 169 137
DANS 761 461 271 201

To compare the efficiency of different negative sampling methods, we report the runtime of DANS and other negative sampling strategies including random negative sampling in RGCN, popularity biased negative sampling in RGCN-P, self-adversarial negative sampling in RGCN-Adv and negative sampling via positive distribution in RGCN-MCNS in Table 4. We standardize the experimental setting and use the same RGCN backbone with DistMult scoring function for fair comparison. While DANS incurs more time than other negative sampling methods, the increase over more advanced methods like RGCN-MCNS is by a manageable constant factor. Furthermore, the growth in time is linear when more negative samples are generated.

Using 20 negative samples for reference, results show that RGCN-Adv, that weighs the negative triplets takes 108 minutes to complete, which is slightly more than 87 minutes and 94 minutes in RGCN and RGCN-P. RGCN-MCNS which samples from positive distribution requires more time (135 minutes) to finish. Overall, DANS being a more complexed model that generally performs better than baselines, takes approximately 3 to 5 times longer to run. As the number of negative samples increases, the gap for run-time widens between DANS and other model variants as DANS takes more computational resources to generate negative synthetic samples.

Appendix D Performance on the FB15k-237 dataset

Table 5: Performance comparison with baselines (all using the ComplEx decoder). See Table II caption for entry
FB15k-237
Model MRR H@1 H@5 H@10 NDCG@5
ComplEx .211±\pm.005 .128±\pm.003 .234±\pm.006 .381±\pm.006 .128±\pm.004
SANS-RW .195±\pm.006 .153±\pm.004 .204±\pm.08 .267±\pm.008 .119±\pm.004
NSCaching .261±\pm.007 .183±\pm.006 .284±\pm.007 .417±\pm.006 .192±\pm.008
KBGAN .259±\pm.008 .169±\pm.0106 .279±\pm.007 .413±\pm.010 .188±\pm.006
CAKE .195±\pm.007 .132±\pm.004 .222±\pm.006 .319±\pm.009 .130±\pm.003
SMiLE .179±\pm.003 .114±\pm.003 .197±\pm.005 .284±\pm.007 .112±\pm.004
DANS .253±\pm.007 .185±\pm.006 .333±\pm.009 .446±\pm.011 .252±\pm.007

The DistMult decoder we use on the other datasets tends to perform poorly on FB15k-237 in Dettmers et al. [7], Ji et al. [14], Zhou et al. [51] Hence, on FB15k-237, we implement the baselines and DANS using the ComplEx decoder, as shown in Table 5. Results show that DANS still achieves competitive performance in comparison to the baselines. In particular, DANS achieves a notable gain of 17.3%, 6.95% and 31.3% in terms of H@5, H@10 and NDCG@5, respectively.

Appendix E Case Study for Adaptiveness

Refer to caption

(a) Generated samples without FiLM

Refer to caption

(b) Generated samples with FiLM

Figure 6: Visualization of adaptiveness for relations 𝙰𝚗𝚒𝚖𝚊𝚕.𝚎𝚊𝚝.𝚟𝚎𝚐𝚎𝚝𝚊𝚋𝚕𝚎\mathtt{Animal.eat.vegetable} and 𝚜𝚙𝚘𝚛𝚝𝚜.𝚜𝚌𝚑𝚘𝚘𝚕.𝚒𝚗.𝚌𝚘𝚞𝚗𝚝𝚛𝚢\mathtt{sports.school.in.country} on NELL-995. Best viewed in color.
Refer to caption

(a) Generated samples without FiLM

Refer to caption

(b) Generated samples with FiLM

Figure 7: Visualization of adaptiveness for relations 𝙳𝚘𝚖𝚊𝚒𝚗_𝙻𝚘𝚌𝚊𝚝𝚒𝚘𝚗_𝚘𝚏\mathtt{Domain\_Location\_of} and 𝙳𝚘𝚖𝚊𝚒𝚗_𝙿𝚛𝚘𝚙𝚎𝚛𝚝𝚢_𝚘𝚏\mathtt{Domain\_Property\_of} on UMLS. Best viewed in color.

We conduct additional experiments to qualitatively assess the impact of FiLM layer on output from the generator pathways through a case study. To demonstrate adaptiveness, we compare two different relations from each dataset to visualize its differences with and without FiLM layers. The visualization on the WN18RR dataset has already been explained in the main paper. Here, we include the visualizations on the NELL-995 and UMLS datasets in Figures 7 and 7, respectively, which display similar patterns as explained in the main paper.

References

  • Ahrabian et al. [2020] Kian Ahrabian, Aarash Feizi, Yasmin Salehi, William L Hamilton, and Avishek Joey Bose. Structure aware negative sampling in knowledge graphs. arXiv preprint arXiv:2009.11355, 2020.
  • Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.
  • Cai and Wang [2017] Liwei Cai and William Yang Wang. Kbgan: Adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071, 2017.
  • Cao et al. [2019] Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The world wide web conference, pages 151–161, 2019.
  • Carlson et al. [2010] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam Hruschka, and Tom Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the AAAI conference on artificial intelligence, pages 1306–1313, 2010.
  • Chen et al. [2020] Zhe Chen, Yuehan Wang, Bin Zhao, Jing Cheng, Xin Zhao, and Zongtao Duan. Knowledge graph completion: A review. Ieee Access, 8:192435–192456, 2020.
  • Dettmers et al. [2018] Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pages 1811–1818, February 2018.
  • Du et al. [2021] Yu Du, Sylvie Ranwez, Nicolas Sutton-Charani, and Vincent Ranwez. Is diversity optimization always suitable? toward a better understanding of diversity within recommendation approaches. Information processing & management, 58(6):102721, 2021.
  • Fang et al. [2017] Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vijay Chandrasekhar. Object detection meets knowledge graphs. In International Joint Conferences on Artificial Intelligence, 2017.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Hu et al. [2019] Binbin Hu, Yuan Fang, and Chuan Shi. Adversarial learning on heterogeneous information networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 120–129, 2019.
  • Isufi et al. [2021] Elvin Isufi, Matteo Pocchiari, and Alan Hanjalic. Accuracy-diversity trade-off in recommender systems via graph convolutions. Information Processing & Management, 58(2):102459, 2021.
  • Ji et al. [2016] Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In Proceedings of the AAAI conference on artificial intelligence, 2016.
  • Ji et al. [2020] Kexi Ji, Bei Hui, and Guangchun Luo. Graph attention networks with local structure awareness for knowledge graph completion. IEEE Access, 8:224860–224870, 2020.
  • Li et al. [2023] Jianping Li, Guozhen Tan, Xiao Ke, Huaiwei Si, and Yanfei Peng. Object detection based on knowledge graph network. Applied Intelligence, 53(12):15045–15066, 2023.
  • Li et al. [2021] Sirui Li, Kok Wai Wong, Chun Che Fung, and Dengya Zhu. Improving question answering over knowledge graphs using graph summarization. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part IV 28, pages 489–500. Springer, 2021.
  • Lin et al. [2015] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI conference on artificial intelligence, 2015.
  • Liu et al. [2020] Hai Liu, Kairong Hu, Fu-Lee Wang, and Tianyong Hao. Aggregating neighborhood information for negative sampling for knowledge graph embedding. Neural Computing and Applications, 32:17637–17653, 2020.
  • Lv et al. [2023] Wen Lv, Hongbo Shi, Shuai Tan, Bing Song, and Yang Tao. A dynamic semantic knowledge graph for zero-shot object detection. The Visual Computer, 39(10):4513–4527, 2023.
  • Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  • Nathani et al. [2019] Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning attention-based embeddings for relation prediction in knowledge graphs. arXiv preprint arXiv:1906.01195, 2019.
  • Niu et al. [2022] Guanglin Niu, Bo Li, Yongfei Zhang, and Shiliang Pu. Cake: A scalable commonsense-aware framework for multi-view knowledge graph completion. arXiv preprint arXiv:2202.13785, 2022.
  • Odena [2016] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
  • Oreshkin et al. [2018] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31, 2018.
  • Peng et al. [2022] Miao Peng, Ben Liu, Qianqian Xie, Wenjie Xu, Hua Wang, and Min Peng. Smile: Schema-augmented multi-level contrastive learning for knowledge graph link prediction. arXiv preprint arXiv:2210.04870, 2022.
  • Perez et al. [2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  • Qian [2021] Jing Qian. Understanding negative sampling in knowledge graph embedding. International Journal of Artificial Intelligence and Applications (IJAIA), 12(1), 2021.
  • Rendle et al. [2012] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012.
  • Schlichtkrull et al. [2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15, pages 593–607. Springer, 2018.
  • Shaffi et al. [2022] Noushath Shaffi, Faizal Hajamohideen, Mufti Mahmud, Abdelhamid Abdesselam, Karthikeyan Subramanian, and Arwa Al Sariri. Triplet-loss based siamese convolutional neural network for 4-way classification of alzheimer’s disease. In International Conference on Brain Informatics, pages 277–287. Springer, 2022.
  • Shen et al. [2022] Jianhao Shen, Chenguang Wang, Linyuan Gong, and Dawn Song. Joint language semantic and structure embedding for knowledge graph completion. arXiv preprint arXiv:2209.08721, 2022.
  • Shimin et al. [2021] DI Shimin, YAO Quanming, Yongqi Zhang, and CHEN Lei. Efficient relation-aware scoring function search for knowledge graph embedding. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1104–1115. IEEE, 2021.
  • Shin et al. [2019] Sangjin Shin, Xiongnan Jin, Jooik Jung, and Kyong-Ho Lee. Predicate constraints based question answering over knowledge graph. Information Processing & Management, 56(3):445–462, 2019.
  • Sun et al. [2019] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019.
  • Sun et al. [2020] Zhu Sun, Di Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, and Cong Geng. Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison. In Proceedings of the 14th ACM Conference on Recommender Systems, pages 23–32, 2020.
  • Trouillon et al. [2016] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016.
  • Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Wang et al. [2018] Peifeng Wang, Shuangyin Li, and Rong Pan. Incorporating gan for negative sampling in knowledge representation learning. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  • Wang et al. [2020] Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the web conference 2020, pages 99–109, 2020.
  • Wang et al. [2014] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence, 2014.
  • Wu et al. [2016] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4622–4630, 2016.
  • Xi et al. [2021] Wu-Dong Xi, Ling Huang, Chang-Dong Wang, Yin-Yu Zheng, and Jian-Huang Lai. Deep rating and review neural network for item recommendation. IEEE Transactions on Neural Networks and Learning Systems, 33(11):6726–6736, 2021.
  • Xiong et al. [2017] Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. arXiv preprint arXiv:1707.06690, 2017.
  • Yang et al. [2014] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
  • Yang et al. [2020] Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. Understanding negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1666–1676, 2020.
  • Ying et al. [2018] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974–983, 2018.
  • Zeng et al. [2020] Jiehang Zeng, Lu Liu, and Xiaoqing Zheng. Learning structured embeddings of knowledge graphs with adversarial learning framework. arXiv preprint arXiv:2004.07265, 2020.
  • Zhang et al. [2019] Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei Chen. Nscaching: simple and efficient negative sampling for knowledge graph embedding. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 614–625. IEEE, 2019.
  • Zhao et al. [2022] Jianli Zhao, Hao Li, Lijun Qu, Qinzhi Zhang, Qiuxia Sun, Huan Huo, and Maoguo Gong. Dcfgan: An adversarial deep reinforcement learning framework with improved negative sampling for session-based recommender systems. Information sciences, 596:222–235, 2022.
  • Zhou et al. [2021] Xiaofei Zhou, Lingfeng Niu, Qiannan Zhu, Xingquan Zhu, Ping Liu, Jianlong Tan, and Li Guo. Knowledge graph embedding by double limit scoring loss. IEEE Transactions on Knowledge and Data Engineering, 34(12):5825–5839, 2021.
  • Zhou et al. [2022] Zhehui Zhou, Can Wang, Yan Feng, and Defang Chen. Jointe: Jointly utilizing 1d and 2d convolution for knowledge graph embedding. Knowledge-Based Systems, 240:108100, 2022.