Diversified and Adaptive Negative Sampling on Knowledge Graphs

Ran Liu Zhongzhou Liu Xiaoli Li Hao Wu^∗ [email protected] Yuan Fang^∗ [email protected]

Abstract

In knowledge graph embedding, aside from positive triplets (i.e., facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (e.g., randomly replacing an entity in a positive triplet). An ideal sampled negative triplet should be informative enough to help the model train better. However, existing methods often ignore diversity and adaptiveness in their sampling process, which harms the informativeness of negative triplets. As such, we propose a generative adversarial approach called Diversified and Adaptive Negative Sampling (DANS) on knowledge graphs. DANS is equipped with a two-way generator that generates more diverse negative triplets through two pathways, and an adaptive mechanism that produces more fine-grained examples by localizing the global generator for different entities and relations. On the one hand, the two-way generator increase the overall informativeness with more diverse negative examples; on the other hand, the adaptive mechanism increases the individual sample-wise informativeness with more fine-grained sampling. Finally, we evaluate the performance of DANS on three benchmark knowledge graphs to demonstrate its effectiveness through quantitative and qualitative experiments.

keywords:

Knowledge graphs, Graph representation learning, Graph neural networks, Negative sampling

^†^†journal: Neural Networks

\csdef

WGMwgm\xspace \csdefQEqe\xspace \csdefEPep\xspace \csdefPMSpms\xspace \csdefBECbec\xspace \csdefDEde\xspace

\affiliation

[1]organization=Singapore Management University, addressline=81 Victoria St, city=Singapore, postcode=188065, country=Singapore \affiliation[2]organization=Agency for Science, Technology and Research, addressline=1 Fusionopolis Way, city=Singapore, postcode=138632, country=Singapore \affiliation[3]organization=Beijing Normal University, addressline=19 Xinwai Ave, Beitaipingzhuang, Haidian District, city=Beijing, postcode=100875, country=China

{highlights}

We design a two-way generator to produce diverse negative triplets, to increase the overall informativeness.

We employ a FiLM layer to adapt the global generator model into local models, to increase the individual informativeness of the negative triplets.

We conduct extensive experiments on three benchmark datasets. The results demonstrate the superiority of our proposed approach.

1 Introduction

Knowledge graphs have been widely used to encode facts about the real world. Typically, each fact describes a relationship between a head and tail entity in the form of a triplet $\langle$ head, relation, tail $\rangle$ , and different entities across facts are interconnected to form a graph structure. The rich facts contained in a large-scale knowledge graph can be used to enhance numerous applications that rely on real-world knowledge, such as question answering [41, 16, 33], object detection [9, 15, 19] and recommendation [4, 12, 8, 42]. To effectively exploit the facts for these applications, a common approach is to first perform knowledge graph embedding that converts the symbolic entities and relations to a latent vector space. The learned embedding aims to capture relevant structural and semantic information in the knowledge graph, which can then be integrated with other machine learning models.

In this paper, we focus on the problem of knowledge graph embedding. The high-level idea is that the embedding vectors of entities and relations co-occurring in the same fact should be bounded by certain constraints due to their relatedness. For instance, consider a fact $\tau=\langle h=\mathtt{Beijing},r=\mathtt{isCapitalOf},t=\mathtt{China}\rangle$ and a classic method TransE [2]. TransE maps each entity and relation in the fact to vectors, i.e., $\mathbf{e}_{h}$ , $\mathbf{e}_{r}$ , $\mathbf{e}_{t}$ , respectively, so that they approximately satisfy the constraint $\mathbf{e}_{h}+\mathbf{e}_{r}\approx\mathbf{e}_{t}$ by minimizing the loss $\|\mathbf{e}_{h}+\mathbf{e}_{r}-\mathbf{e}_{t}\|$ . On the contrary, a nonfact such as $\langle h=\mathtt{Beijing},r=\mathtt{isCapitalOf},t^{\prime}=\mathtt{Russia}\rangle$ would maximize the loss $\|\mathbf{e}_{h}+\mathbf{e}_{r}-\mathbf{e}_{t^{\prime}}\|$ . Given this contrast, the factual triplets are known as positive triplets (or examples), whereas the non-factual triplets are called negative triplets. Although positive triplets are readily available, negative triplets are often obtained through random sampling. More recent works [43, 3, 1, 48, 30, 50] explore advanced constraints or losses [2, 36, 44] on the triplets, but the sampling strategy for negative triplets remains a crucial yet less explored problem.

Earlier negative sampling approaches resort to random sampling, e.g., by replacing the tail (or head) entity in a positive triplet with a random entity from the knowledge graph sampled in a uniform [42] or popularity-weighted manner [20]. Although random sampling is straightforward, it is often inadequate to optimize the informativeness of negative triplets. The informativeness refers to how much information each negative triplet could contribute to model learning. Intuitively, a more informative negative triplet would improve the efficiency of model training and accelerate model convergence. For instance, the positive triplet $\tau$ given earlier, $\langle$ $\mathtt{Beijing}$ , $\mathtt{isCapitalOf}$ , $\mathtt{Russia}$ $\rangle$ is considered a more informative negative triplet than $\langle$ $\mathtt{Beijing}$ , $\mathtt{isCapitalOf}$ , $\mathtt{KFC}$ $\rangle$ , as the latter can be easily identified as negative and thus helps little in refining the decision boundary. Although various scoring functions [2, 36, 44, 32, 31] help to judge the informativeness of negative triplets, they do not consider the diversity and adaptiveness of the sampling process, which are two aspects we propose to study in this work.

On one hand, diversity helps to increase the overall informativeness of all the negative triplets collectively. We observe that negative triplets can be associated with both entities and relations. For example, the tail entity of the positive triplet $\tau$ can be replaced by entities associated not only with the head entity $\mathtt{Beijing}$ , such as $\mathtt{GreatWall}$ and $\mathtt{Shanghai}$ , but also with the relation $\mathtt{isCaptialOf}$ , such as $\mathtt{Russia}$ (a country with some capital city) and $\mathtt{London}$ (a capital city of some country). On the other hand, adaptive sampling of negative triplets would make entity- or relation-specific adjustments to sample selection, which increases the individual informativeness of each triplet in a finer-grained manner. For instance, selecting a tail entity for $\mathtt{Beijing}$ , $\mathtt{Tokyo}$ or $\mathtt{KFC}$ using a global sampling model could be suboptimal given the variability among these entities. Instead, local models that condition on each entity would be able to adapt to such differences and make each triplet more informative.

In view of the above, we propose a Diversified and Adaptive Negative Sampling (DANS) approach for knowledge graph embedding, to improve both the overall and individual informativeness of negative triplets. Similar to previous state-of-the-art approaches such as KBGAN [3], we adopt a generative adversarial network (GAN) [38] for the generation of negative samples. However, there are two significant differences from previous GAN-based negative sampling on the knowledge graph. First, we design a two-way generator to produce diversified samples that are associated with both entities and relations w.r.t. a positive triplet, which aims to increase the overall informativeness of the samples. More specifically, the generator consists of two pathways to produce two different kinds of negative triplets associated with a given entity and entity-relation, respectively. Second, we design an adaptive mechanism to modulate the global generator model into local models to handle the differences across entities and relations, which aims to increase the individual informativeness of the samples in a finer-grained manner. In particular, we employ a Feature-wise Linear Modulation (FiLM) layer [26] that conditions the generator on a given entity or entity-relation input. In summary, we make the following contributions:

•

We design a two-way generator to produce diverse negative triplets, to increase the overall informativeness.
•

We employ a FiLM layer to adapt the global generator model into local models, to increase the individual informativeness of the negative triplets.
•

We conduct extensive experiments on three benchmark datasets. The results demonstrate the superiority of our proposed approach.

2 Background

Negative sampling is an important issue in various machine learning tasks such as recommendation systems [28] and natural language processing [20]. In the context of knowledge graph embedding, negative triplets are often constructed by replacing the tail or head entity in a positive triplet with a randomly sampled entity [2, 40, 17]. Unfortunately, in uniform [2] or popularity-weighted sampling [20], the sampled entity could be completely unrelated to the head or the relation, and therefore be less informative.

To sample more informative negative triplets, researchers have leveraged different heuristics or learning strategies. Several structure-aware models [1, 46, 18] exploit the graph structures, which generally select negative examples in the neighborhood of positive examples. For example, SANS [1] hypothesizes that entities that are in close proximity to each other, but do not share a direct relationship, are better candidates for negative sampling. In a similar spirit, PinSage [18] generates localized graphs via random walks to extract informative negative samples. However, these approaches have a high risk of selecting false negatives, as not explicitly related entities in close proximity could still form positive triplets due to the incompleteness of the observed graph.

Other approaches seek to quantify the informativeness of the negative triplets through various learning strategies, including GANs [3, 38, 11, 47], reinforcement learning [39, 49], and importance sampling [48]. These methods provide a more explicit and systematic scoring of negative triplets which often led to better performance. However, these approaches do not consider the diversity and adaptiveness of negative sampling, which are crucial to the overall and individual informativeness of the negative triplets, respectively.

Besides, recent studies [45, 27] show that the optimal negative sampling distribution should be positively but sub-linearly correlated to the positive sampling distribution. Although our proposed model shares a similar view by learning the underlying distribution of positive samples to produce negative samples, we take one step further to consider the diversity and adaptiveness of the negative samples in an adversarial setting. In particular, toward adaptiveness, we borrow the idea from Feature-wise Linear Modulation (FiLM) [26], which was first introduced in the area of visual question answering. Its mechanism includes a learnable feature-wise affine transformation on the hidden neurons of a neural network, conditioned on an arbitrary input. In our context, we employ a FiLM layer to adapt the global generators into local models conditioned on individual input (entity or relation).

3 Methodology

In this section, we introduce the problem formulation and some preliminaries on knowledge graph embedding, followed by our proposed approach DANS.

Refer to caption — Figure 1: Overall framework of DANS. The toy example only shows how to generate fake tail entities, while generating fake head entities follows a similar process.

Before we delve into the details, we first sketch the overall framework in Figure 1. The model consists of four main parts: (a) A base embedding model which learns the embeddings for entities and relations; (b) the two-way adaptive generator which generate “fake” entity samples to construct negative examples; (c) the two-way discriminator which utilize both adversarial and auxiliary losses to improve the quality of produced samples; (d) model training with negative sampling, where we replace one entity in a positive triplet with a generated fake entity to form negative triplets, and train the base model together with the original positive triplets.

3.1 Problem formulation and preliminaries

A knowledge graph (KG) is defined by an entity (node) set $\mathcal{V}$ , a relation set $\mathcal{R}$ and a ground-truth or positive triplet (edge) set $\mathcal{E}$ . Given a triplet $\tau=\langle h,r,t\rangle$ for some $h,t\in\mathcal{V}$ and $r\in\mathcal{R}$ , a typical KG model aims to learn a scoring function $\mathcal{F}(\tau)$ to estimate the probability that $\tau$ is a positive triplet, i.e., $\tau$ is a fact that should appear in the ground truth set $\mathcal{E}$ .

Given the power of graph convolutional networks, in this paper, we adopt a multi-layer relational graph convolutional network (RGCN) [29] to serve as our base embedding model in Figure 1(a). The base model encodes the entities in layer $l+1$ into vectors $\mathbf{e}_{i}^{l+1}\in\mathbb{R}^{d^{l+1}}$ in a latent embedding space, by aggregating their embeddings $\mathbf{e}_{j}^{l}\in\mathbb{R}^{d^{l}}$ from the previous layer $l$ , as follows.

\displaystyle\mathbf{e}_{i}^{l+1}=\textsc{ReLU}\textstyle\left(\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{i}^{r}}\frac{1}{|\mathcal{N}_{i}^{r}|}W_{r}^{l}\mathbf{e}_{j}^{l}+W_{0}^{l}\mathbf{e}_{i}^{l}\right),

(1)

where $\mathcal{N}_{i}^{r}$ is the set of neighbors of entity $i$ under relation $r$ , $W_{r}^{l}$ is a trainable weight matrix for $r$ , $W_{0}^{l}$ is an additional trainable weight matrix to capture the self-information of each entity in layer $l$ , and ReLU is the activation function. Assuming a total of $L$ layers are stacked, the embeddings in the last layer are the output embeddings, which we simply write as $\mathbf{e}_{i}\in\mathbb{R}^{d},\forall i\in\mathcal{V}$ .

To optimize the parameters, a set of training triplets $\mathcal{D}_{\text{tr}}$ that consists of both positive and negative triplets is used. As shown in Figure 1(d), our objective is to sample a set of high-quality negative triplets, which, together with positive triplets, will be used to minimize the following cross-entropy loss:

\displaystyle\textstyle-\sum_{\tau\in\mathcal{D}_{\text{tr}}}y_{\tau}\log\mathcal{F}(\tau)+\left(1-y_{\tau}\right)\log\left(1-\mathcal{F}(\tau)\right)\,

(2)

where $y_{\tau}=1$ if $\tau\in\mathcal{E}$ , else $y_{\tau}=0$ . We implement $\mathcal{F}$ using three popular decoders, namely, DistMult [44], ComplEx [36] and RotatE [34]. We provide the DistMult function below, and leave the details of ComplEx and RotatE to A.

\displaystyle\mathcal{F}(\langle h,r,t\rangle)=\sigma(\mathbf{e}_{h}^{\top}\text{Diag}(\mathbf{e}_{r})\mathbf{e}_{t}),

(3)

where $\sigma$ is the sigmoid activation, $\mathbf{e}_{h},\mathbf{e}_{t}$ are the head, tail entity embeddings from RGCN, $\text{Diag}(\mathbf{e}_{r})\in\mathbb{R}^{d\times d}$ is diagonal matrix whose diagonal is $\mathbf{e}_{r}$ , an $r$ -specific trainable vector of the decoder. Therefore, the full set of training parameters of the base model is $\Theta=\{W_{r}^{l}:r\in\mathcal{R},l\leq L\}\cup\{W_{0}^{l}:l\leq L\}\cup\{\mathbf{e}_{r}:r\in\mathcal{R}\}.$

3.2 Adaptive two-way generator

A common way to obtain a negative triplet is to replace the tail (or head) entity in a positive triplet by a randomly sampled entity. Beyond simple random sampling, generative adversarial nets (GAN) [10] such as KBGAN [3], IGAN [38], HeGAN [11] and GNDN [47], which learn the underlying sample distributions, have been shown to be effective in negative sampling on KG or other graph structures.

Formally, given a positive triplet $\langle h,r,t\rangle$ , a generator $G$ aims to produce a “fake” tail entity $t^{\prime}$ to replace the real tail $t$ , resulting in a negative triplet $\langle h,r,t^{\prime}\rangle$ . More precisely, $G$ is a function that maps a noise $\epsilon$ (typically sampled from a prior distribution) to a vector $\mathbf{e}_{t^{\prime}}$ in the entity embedding space. Although we follow a similar process, distinct from existing GAN-based approaches, we propose an adaptive two-way generator, as shown in Figure 1(b). It not only diversifies the generation of fake entities, but also localizes the global generator model to adapt to fine-grained differences across entities.

Diversity. Classical GANs generate fake samples through a single pathway and assume a fixed prior distribution, which limits the diversity of fake entity generation. Particularly, in the context of KG, we can generate a fake tail entity associated with either the head entity only, or the relation as well. This improves the diversity of resulting negative triplets and increases the overall informativeness. Hence, we propose a two-way generator that consists of two pathways, namely $G_{E}$ and $G_{R}$ , to generate negative triplets associated with a given entity and entity-relation, respectively. Furthermore, having personalized priors for each entity or relation would further enhance the diversification. Specifically, to replace the tail entity in a positive triplet $\langle h,r,t\rangle$ (the same process would also apply to replacing the head entity $h$ ), we generate fake tail entity embeddings $\mathbf{e}_{t^{\prime}}$ and $\mathbf{e}_{t^{\prime\prime}}$ from the two pathways, as follows.

	$\displaystyle\mathbf{e}_{t^{\prime}}$	$\displaystyle=G_{E}(\epsilon;\Theta_{G_{E}}),\ \text{s.t. }\epsilon\sim N(\mathbf{e}_{h},\sigma^{2}I),$		(4)
	$\displaystyle\mathbf{e}_{t^{\prime\prime}}$	$\displaystyle=G_{R}(\epsilon;\Theta_{G_{R}}),\ \text{s.t. }\epsilon\sim N(\mathbf{e}_{h}\otimes\mathbf{e}_{r},\sigma^{2}I),$		(5)

where each pathway has its own parameters, i.e., $G_{E}$ parameterized by $\Theta_{G_{E}}$ and $G_{R}$ parameterized by $\Theta_{G_{R}}$ . The noise vector $\epsilon$ that feeds into each pathway is sampled from a personalized multivariate Gaussian distribution for each entity/relation, $N(\mathbf{e}_{h},\sigma^{2}I)$ or $N(\mathbf{e}_{h}\otimes\mathbf{e}_{r},\sigma^{2}I)$ depending on the pathway. $N$ represents the prior Gaussian distribution for sampling the input to the generator $\epsilon$ , $\sigma$ is a hyper-parameter controlling the covariance of the multivariate Gaussian, $I$ is the identity matrix, and $\otimes$ stands for element-wise multiplication. Intuitively, as the prior Gaussian distributions in Eqs. (4) and (5) are centered on different embeddings, $\mathbf{e}_{h}$ or $\mathbf{e}_{h}\otimes\mathbf{e}_{r}$ , it helps to diversify the generated samples from different pathways.

Each pathway is implemented as a multi-layer perceptron (MLP). Taking $G_{E}$ as an example, its MLP is parameterized by $\Theta_{G_{E}}$ which consists of the weights and biases in each layer. Let $\mathbf{x}_{G_{E}}^{m+1}$ denote the activations of the $m$ -th MLP layer, where the activations of the last MLP layer are simply the output embedding of $G_{E}$ . The architecture of $G_{R}$ mirrors that of $G_{E}$ .

Adaptiveness. While more diverse samples help increase the overall informativeness, it is also important to improve the informativeness of individual samples. On the one hand, all input entities or relations sharing a global generator model are unable to fully adapt to fine-grained differences across entities or relations. On the other hand, training one model for each entity or relation can cause severe overfitting and incur large overheads. To address the dilemma, we still train a global generator model, but allow the global model to be modulated through a Feature-wise Linear Modulation (FiLM) layer conditioned on each input entity or relation, which essentially adapts the shared global model into local models. Thus, in addition to the global model parameters, the adaptive mechanism only needs to learn the parameters of the FiLM layer, instead of one set of model parameters for each entity or relation.

Consider the pathway $G_{E}$ to generate a fake tail entity for a head entity $h$ . We adapt the global model $G_{E}$ to suit the head entity $h$ , by modulating the activations in each hidden layer of $G_{E}$ :

\displaystyle\tilde{\mathbf{x}}_{G_{E}}^{m}=\mathbf{x}_{G_{E}}^{m}\otimes\alpha_{h}^{m}+\beta_{h}^{m},

(6)

where $\alpha_{h}^{m}$ and $\beta_{h}^{m}$ are vectors conditioned on the head entity $h$ and have the same dimension as the $m$ -th layer of $G_{E}$ . They are used to scale and shift the activations $\mathbf{x}_{G_{E}}^{m}$ of the $m$ -th layer of $G_{E}$ . That is, the global $G_{E}$ is adapted into a local model conditioned on $h$ . More specifically, $\alpha_{h}^{m}$ and $\beta_{h}^{m}$ are output of the FiLM layer $F_{E}$ applied to the $m$ -th layer of $G_{E}$ , as follows.

	$\displaystyle\alpha_{h}^{m}=F_{E}(\mathbf{e}_{h};\Theta_{F_{E},\alpha}^{m}),$		(7)
	$\displaystyle\beta_{h}^{m}=F_{E}(\mathbf{e}_{h};\Theta_{F_{E},\beta}^{m}).$		(8)

Note that the head entity embedding $\mathbf{e}_{h}$ is the input to $F_{E}$ , making the output adaptive to and conditioned on $h$ . $F_{E}$ can be implemented as a MLP, parameterized by $\Theta_{F_{E},\alpha}^{m}$ and $\Theta_{F_{E},\beta}^{m}$ in the $m$ -th layer of $G_{E}$ . Similarly, the second pathway $G_{R}$ can be modulated by a FiLM layer $F_{R}$ , whose input is $\mathbf{e}_{h}\otimes\mathbf{e}_{r}$ , to generate a fake tail entity for a head entity $h$ and relation $r$ . $F_{R}$ is parameterized by $\Theta_{F_{R},\alpha}^{m}$ and $\Theta_{F_{R},\beta}^{m}$ in the $m$ -th layer of $G_{R}$ , to output $\alpha_{h,r}^{m}$ and $\beta_{h,r}^{m}$ to scale and shift the activations in $G_{R}$ .

To sum up, the trainable parameters in the adaptive two-way generator, $\Theta_{G}$ , include the weights in the two global pathways and the FiLM layer weights for each layer in each pathway. Assuming a total of $M$ hidden layers in the global pathways, we would have $\Theta_{G}=\{\Theta_{G_{E}},\Theta_{G_{R}}\}\cup\{\Theta_{F_{E},\alpha}^{m},\Theta_{F_{E},\beta}^{m},\Theta_{F_{R},\alpha}^{m},\Theta_{F_{R},\beta}^{m}:m\leq M\}$ .

3.3 Two-way discriminator

As in a standard GAN architecture, a discriminator is needed to help the generator produce high-quality fake entities that mimic real entities. Specifically, the discriminator and the generator compete with each other in a minimax game, in which the generator aims to fool the discriminator by producing realistic looking entities, while the discriminator aims to beat the generator by distinguishing the real and fake entities. In our case, given the two-way generator, we further equip the discriminator with the ability to distinguish the fake entities generated by the two pathways, which can further differentiate and diversify the two pathways.

Concretely, as shown in Figure 1(c), the discriminator also has two pathways: $D_{\text{Adv}}$ , an adversarial pathway to distinguish fake and real entities, and $D_{\text{Aux}}$ , an auxiliary pathway to distinguish fake entities generated by $G_{E}$ and $G_{R}$ . Taking the generation of tail entities as an example, given the real tail entity $t$ in a positive triplet, as well as the fake entities $t^{\prime}$ generated by $G_{E}$ and $t^{\prime\prime}$ generated by $G_{R}$ , $D_{\text{Adv}}$ tries to distinguish $t$ from $t^{\prime}$ and $t^{\prime\prime}$ , while $D_{\text{Aux}}$ tries to distinguish $t^{\prime}$ from $t^{\prime\prime}$ . In other words, each of them involves a binary classification:

	$\displaystyle\hat{y}_{\text{Adv},i}$	$\displaystyle=D_{\text{Adv}}(\tilde{\mathbf{e}}_{i};\Theta_{D_{\text{Adv}}}),$		(9)
	$\displaystyle\hat{y}_{\text{Aux},i}$	$\displaystyle=D_{\text{Aux}}(\tilde{\mathbf{e}}_{i};\Theta_{D_{\text{Aux}}}),\text{ \ s.t. }i\neq t,$		(10)

where $D_{\text{Adv}}$ and $D_{\text{Aux}}$ are implemented as a fully connected layer, and $\tilde{\mathbf{e}}_{i}=\textsc{\capitalisewords{{mlp}}}(\mathbf{e}_{i};\Theta_{D_{S}})$ is a shared hidden representation computed from the embedding $\mathbf{e}_{i}$ of a real or fake entity $i$ . The shared hidden representation allows both $D_{\text{Adv}}$ and $D_{\text{Aux}}$ to benefit from each other during training as in Odena [23], as they collectively try to distinguish three different classes of samples ( $t$ , $t^{\prime}$ , $t^{\prime\prime}$ ).

Note that $\hat{y}_{\text{Adv},i}$ (or $\hat{y}_{\text{Aux},i}$ ) is the predicted value of the ground-truth label $y_{\text{Adv},i}$ (or $y_{\text{Aux},i}$ ), such that $y_{\text{Adv},i}=1$ if $i$ is a real entity, else $y_{\text{Adv},i}=0$ . Furthermore, for a fake entity $i$ , we define $y_{\text{Aux},i}=1$ if $\mathbf{e}_{i}$ is generated via Eq. (4), or 0 if generated via Eq. (5). Subsequently, we employ a cross-entropy loss on the two discriminator pathways:

	$\displaystyle\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},i},y_{\text{Adv},i})=-y_{\text{Adv},i}\log\hat{y}_{\text{Adv},i}$
	$\displaystyle-(1-y_{\text{Adv},i})\log(1-\hat{y}_{\text{Adv},i}),$		(11)
	$\displaystyle\mathcal{L}_{\text{Aux}}(\hat{y}_{\text{Aux},i},y_{\text{Aux},i})=-y_{\text{Aux},i}\log\hat{y}_{\text{Aux},i}$
	$\displaystyle-(1-y_{\text{Aux},i})\log(1-\hat{y}_{\text{Aux},i}),$		(12)

In summary, the set of trainable parameters of the two-way discriminator pathway includes the shared parameters and the weights of each classifier, i.e., $\Theta_{D}=\{\Theta_{D_{S}},\Theta_{D_{\text{Adv}}},\Theta_{D_{\text{Aux}}}\}.$

3.4 Adversarial training

Lastly, we train the generator, discriminator, and base embedding model jointly. On the one hand, the generator aims to fool the adversarial pathway of the discriminator, making $D_{\text{Adv}}$ harder to distinguish real and fake entities, as below.

$\displaystyle\textstyle\arg\max_{\Theta_{G}}$	$\displaystyle\ \mathbb{E}_{t}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},t},1)}$
	$\displaystyle+\mathbb{E}_{t^{\prime}}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},t^{\prime}},0)}+\mathbb{E}_{t^{\prime\prime}}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},t^{\prime\prime}},0)}$
	$\displaystyle+\lambda\textstyle\sum_{m,h}(\\|\alpha_{h}^{m}-\textbf{1}\\|^{2}+\\|\beta_{h}^{m}\\|^{2}),$	(13)

where $t$ is a real tail entity, and $t^{\prime},t^{\prime\prime}$ are fake tail entities from $G_{E}$ and $G_{R}$ , respectively (again, we only illustrate the case where the tail entity in a positive triplet is replaced). The last term in Eq. (3.4) is a regularization term on the scaling and shifting factors to prevent overfitting as in Oreshkin et al. [24], and $\lambda$ is a hyper-parameter to control the strength of regularization. On the other hand, the goal of the discriminator is to overcome the generators by distinguishing fake and real entities, as well as fake entities from different generator pathways, as follows.

	$\displaystyle\textstyle\arg\min_{\Theta_{D}}$	$\displaystyle\ \mathbb{E}_{i}{\mathcal{L}_{\text{Adv}}(\hat{y}_{\text{Adv},i},y_{\text{Adv},i})}$
	$\displaystyle+$	$\displaystyle\ \mathbb{E}_{i\neq t}{\mathcal{L}_{\text{Aux}}(\hat{y}_{\text{Aux},i},y_{\text{Aux},i})},$		(14)

where $i$ can be either real or fake entity in the first term, but $i\neq t$ can only be a fake entity in the second term.

Following a typical adversarial training scheme in negative sampling on knowledge graphs in KBGAN [3], we alternate the model updating among the three parties, as follows. First, we train the generator by updating the generator parameters $\Theta_{G}$ with Eq. (3.4), while freezing the discriminator parameters $\Theta_{D}$ and the base model parameters $\Theta$ . Next, we update $\Theta_{D}$ with Eq. (3.4), while freezing $\Theta_{G},\Theta$ . Finally, we update $\Theta$ by minimizing the loss on the positive and negative triples in Eq. (2), while freezing the other two parameter sets. We repeat the three steps until the convergence of all parties are achieved.

4 Experiments

We perform empirical evaluation on three benchmark knowledge graphs. We first compare the empirical performance of the proposed model DANS¹¹1We include the code for review in Supplementary Materials. with state-of-the-art baselines. In addition, we seek to address a number of research questions (RQ) through more in-depth empirical analysis. RQ1: Does the two-way design in the generator improve model performance? RQ2: Does the adaptive FiLM layer in the generator improve model performance? RQ3: What is the impact of the number of negative triplets and adaptive regularization, respectively? RQ4: Can we observe the diversity and adaptiveness of generated triplets?

4.1 Experimental Design

Table 1: Statistics of datasets.

	Entities	Relations	Train	Val	Test	Total
WN18RR	40,943	11	86,835	3,034	3,134	93,003
NELL-995	75,492	200	149,678	543	3,992	154,213
UMLS	135	46	5,216	652	661	6,529

Datasets. Three benchmark knowledge graphs are used for our experiment. (1) WN18RR [2], a harder variant of WN18 [7], which is derived from WordNet consisting of hyponym and hypernym relations between words. Compared to WN18, WN18RR removes inverse relations to minimize leakage from training. (2) NELL-995 [43] is a subset of the web-based facts collected by the 995th iteration of the Nell system [5] which contains a large pool of entity types and only the top 200 relations are retained. (3) UMLS [7] is a specialized knowledge base containing medical entities and their semantic relationships. The entities are biomedical concepts (e.g., disease, antibiotic), and the relations include interactions such as $\mathtt{treats}$ and $\mathtt{diagnosis}$ . Table 1 gives a summary of the datasets used.

Task and evaluation. We employ the standard knowledge graph completion task [21, 4, 13, 6]. Specifically, for each positive test triplet, we construct a list of candidate triplets that also include negative triplets, which are obtained by replacing either the head or tail of the positive triplet with every other entity in the dataset. To avoid false negatives, we follow previous work Bordes et al. [2] by adopting their “filtered setting”. We then rank the candidate triplets based on the scoring function. For evaluation, we adopt several standard ranking metrics including Mean Reciprocal Ranking (MRR), Hit ratio at 1 (H@1) and Normalized discounted cumulative gain at 5 (NDCG@5) [35]. Details of these ranking metrics can be found in B.

Baselines. We compare with baselines in two distinct categories:

(1) Negative samplers with the same RGCN backbone [29] and decoders. In other words, they are flexible “plug-ins” that only replace the sampling strategy for a fair comparison to our method DANS. They include Rand, which replaces the head or tail entity with a uniformly sampled random entity; Pop [20]: a variant of Rand that substitutes uniform sampling with popularity-weighted sampling; Self-adv [35]: a self-adversarial negative sampling methodology; MCNS [45]: a model which derives negative samples from a distribution that is positively but sub-linearly correlated with the positive distribution.

(2) Other state-of-the-art baselines for knowledge graph embedding which may employ a variety of different backbones, heuristics and techniques that diverge from DANS, for a comprehensive comparison. They include SANS-RW [1]: a structure-aware model that selects negative samples at close proximity from positive nodes via random walks on the graph; NSCaching [48]: a model that employs importance sampling to sample more informative negative triplets; KBGAN [3]: a GAN-based model that learns to generate informative negative triplets; CAKE [22]: a framework which leverages extra information such as entity types to from factual triplets to sample negative triplets; SMiLE [25]: a framework which employs specific contextual information influenced by entity types to sample negative triplets.

Parameter settings. Our model DANS and other negative samplers (Random, Pop, Self-Adv and MCNS) employ RGCN [29] as the backbone, which follows JinheonBaek’s pytorch implementation. RGCN is first pre-trained for 15000 epochs, and our base embedding model is then initialized using the pre-trained weights. We train the model for 5000 epochs, using a learning rate of 0.001 and a mini-batch size of 1000 for UMLS, WN18RR and NELL-995. In each mini-batch, the generator and discriminator epochs are set to 5 and 1, respectively, and their learning rates are set to 1e-3 and 1e-4, respectively. The regularization coefficient $\lambda$ for the FiLM layer in Eq. (13) is set to 1e-4 for all three datasets as it is the most optimal among candidate set 1e-2, 1e-3, 1e-4, 1e-5, 1e-6.

Furthermore, we generate $N_{s}=20$ negative triplets for each positive triplet, out of which the first ten negative triplets are equally split between the two generator pathways, while the remaining ten negative triplets are obtained via uniform random sampling to further increase the diversity. In all cases, either the head or tail of the positive triplets are randomly replaced with negative entities, but not both. We set the output embedding dimension $d$ to 100 for all methods, except SANS-RW, where $d$ is set to the recommended 1,000 to achieve optimal performance. RGCN, RGCN-P, RGCN-Adv and RGCN-MCNS follow the same implementation and settings per the backbone of DANS.

In addition, the hyper-parameters related to negative sampling via Metr-opolis-Hastings in RGCN-MCNS have been copied from the original paper ’s link prediction experiments in Yang et al. [45]. To reduce the variance resulting from parameter initialization, the experimental results are calculated from an average of five runs with different seeds in all methods. Furthermore, every method is standardized to use the triplet loss in Eqs.(2). Other baseline settings have also been tuned according to the recommendations of the literature. Additional details can be found in C.

4.2 Results and Analysis

Table 2: Performance comparison with other negative sampling methods, which are plugged into the same backbone (RGCN) and decoders (DistMult, RotatE or ComplEx). The best results are in bold, and the runner-ups are underlined.

Sampling	WN18RR			NELL-995			UMLS
method	MRR	H@1	NDCG@5	MRR	H@1	NDCG@5	MRR	H@1	NDCG@5
	DistMult
Rand	.372 $\pm$ .002	.343 $\pm$ .003	.369 $\pm$ .005	.218 $\pm$ .001	.146 $\pm$ .002	.219 $\pm$ .002	.696 $\pm$ .010	.607 $\pm$ .082	.693 $\pm$ .007
Pop	.374 $\pm$ .002	.342 $\pm$ .002	.376 $\pm$ .006	.216 $\pm$ .001	.142 $\pm$ .002	.216 $\pm$ .003	.680 $\pm$ .009	.589 $\pm$ .012	.692 $\pm$ .005
Self-adv	.370 $\pm$ .007	.332 $\pm$ .010	.373 $\pm$ .006	.238 $\pm$ .003	.156 $\pm$ .003	.241 $\pm$ .003	.717 $\pm$ .009	.624 $\pm$ .015	.733 $\pm$ .008
MCNS	.376 $\pm$ .004	.340 $\pm$ .005	.374 $\pm$ .006	.226 $\pm$ .002	.144 $\pm$ .002	.221 $\pm$ .003	.700 $\pm$ .002	.606 $\pm$ .008	.717 $\pm$ .002
DANS	.381 $\pm$ .006	.352 $\pm$ .007	.386 $\pm$ .008	.227 $\pm$ .004	.162 $\pm$ .007	.220 $\pm$ .009	.724 $\pm$ .008	.641 $\pm$ .009	.725 $\pm$ .008
	RotatE
Rand	.234 $\pm$ .009	.110 $\pm$ .003	.260 $\pm$ .008	.182 $\pm$ .003	.093 $\pm$ .003	1̇89 $\pm$ .003	.817 $\pm$ .015	.683 $\pm$ .021	.855 $\pm$ .013
Pop	.235 $\pm$ .007	.095 $\pm$ .003	.268 $\pm$ .007	.181 $\pm$ .002	.131 $\pm$ .002	.200 $\pm$ .003	.800 $\pm$ .005	.673 $\pm$ .010	.839 $\pm$ .004
Self-adv	.202 $\pm$ .007	.058 $\pm$ .010	.235 $\pm$ .006	.186 $\pm$ .002	.096 $\pm$ .003	.194 $\pm$ .002	.809 $\pm$ .007	.677 $\pm$ .007	.848 $\pm$ .007
MCNS	.242 $\pm$ .009	.132 $\pm$ .004	.288 $\pm$ .006	.194 $\pm$ .003	.122 $\pm$ .004	.200 $\pm$ .004	.822 $\pm$ .005	.682 $\pm$ .006	.884 $\pm$ .006
DANS	.249 $\pm$ .002	.154 $\pm$ .001	.274 $\pm$ .003	.195 $\pm$ .010	.135 $\pm$ .011	.208 $\pm$ .010	.833 $\pm$ .004	.716 $\pm$ .006	.866 $\pm$ .005
	ComplEx
Rand	.386 $\pm$ .007	.346 $\pm$ .005	.390 $\pm$ .006	.245 $\pm$ .004	.172 $\pm$ .003	2̇51 $\pm$ .006	.898 $\pm$ .008	.822 $\pm$ .017	.920 $\pm$ .015
Pop	.389 $\pm$ .011	.341 $\pm$ .007	.387. $\pm$ .012	.241 $\pm$ .005	.179 $\pm$ .006	.245 $\pm$ .004	.840 $\pm$ .009	.747 $\pm$ .009	.865 $\pm$ .008
Self-adv	.375 $\pm$ .006	.329 $\pm$ .011	.382 $\pm$ .013	.250 $\pm$ .005	.181 $\pm$ .007	.277 $\pm$ .008	.908 $\pm$ .009	.844 $\pm$ .006	.925 $\pm$ .010
MCNS	.392 $\pm$ .008	.343 $\pm$ .007	.394 $\pm$ .008	.248 $\pm$ .007	.177 $\pm$ .004	.264 $\pm$ .009	.879 $\pm$ .007	.835 $\pm$ .005	.892 $\pm$ .011
DANS	.404 $\pm$ .005	.347 $\pm$ .004	.392 $\pm$ .009	.257 $\pm$ .006	.186 $\pm$ .010	.255 $\pm$ .008	.920 $\pm$ .007	.857 $\pm$ .011	.927 $\pm$ .008

Table 3: Performance comparison with baselines (all using the DistMult decoder). See Table 2 caption for entry styles.

	WN18RR			NELL-995			UMLS
Model	MRR	H@1	NDCG@5	MRR	H@1	NDCG@5	MRR	H@1	NDCG@5
SANS-RW	.349 $\pm$ .010	.340 $\pm$ .013	.334 $\pm$ .010	.135 $\pm$ .006	.109 $\pm$ .008	.110 $\pm$ .008	.510 $\pm$ .008	.369 $\pm$ .009	.478 $\pm$ .003
NSCaching	.374 $\pm$ .002	.337 $\pm$ .003	.374 $\pm$ .002	.177 $\pm$ .004	.150 $\pm$ .003	.140 $\pm$ .002	.625 $\pm$ .004	.508 $\pm$ .021	.607 $\pm$ .004
KBGAN	.172 $\pm$ .004	.070 $\pm$ .006	.155 $\pm$ .002	.170 $\pm$ .002	.077 $\pm$ .004	.195 $\pm$ .009	.680 $\pm$ .005	.556 $\pm$ .023	.654 $\pm$ .004
CAKE	.353 $\pm$ .007	.345 $\pm$ .005	.351 $\pm$ .008	.204 $\pm$ .006	.130 $\pm$ .007	.175 $\pm$ .012	.441 $\pm$ .013	.365 $\pm$ .008	.383 $\pm$ .010
SMiLE	.315 $\pm$ .006	.291 $\pm$ .007	.294 $\pm$ .012	.131 $\pm$ .004	.127 $\pm$ .005	.105 $\pm$ .008	.414 $\pm$ .015	.345 $\pm$ .007	.372 $\pm$ .013
DANS	.381 $\pm$ .006	.352 $\pm$ .007	.386 $\pm$ .008	.227 $\pm$ .004	.162 $\pm$ .007	.220 $\pm$ .009	.724 $\pm$ .008	.641 $\pm$ .009	.725 $\pm$ .008

Table 2 reports quantitative comparison against the first category of baselines involving different negative samplers under the same backbone and decoder. Overall, our model DANS consistently leads to better performance for DistMult, RotatE and ComplEx decoders. This shows the robustness of our approach across various decoders. In general, DANS performs better than Rand and its variant Pop, showing that it is important to account for the informativeness of negative triples which are missing in random and popularity-weighted sampling. Since Self-Adv accounts for the informativeness by giving more weight to higher quality triplets, it generally outperforms Rand and Pop. It still lags behind DANS in most cases as it ignores the concepts of diversity and adaptiveness. The variant MCNS shows better performance than Rand and its variant Pop but loses to DANS as MCNS was originally designed for homogeneous graphs.

Next, Table 3 compares DANS with the second category of baselines. Negative sampling in SANS-RW is not relation-aware and thus performs poorly on datasets with more variety of relations, namely, NELL-995 and UMLS. In addition, KBGAN fell short for the two bigger datasets WN18RR and NELL-995 as it ignores graph structure in the sampling process. Furthermore, its adversarial training process potentially suffers from instability and degeneracy. On the other hand, NSCaching employs a more streamlined importance sampling approach, contributing to its competitive performance despite not considering graph structure for negative sampling. As CAKE and SMiLE leverage on extra side information such as entity types to enhance its performances, their experimental results deteriorate as such information are not available in standard knowledge graph completion benchmarks in this paper.

Overall, DANS has obtained favourable performance, showing the importance of diversity and adaptiveness during negative sampling. We will conduct further ablation study in the next part to examine the contribution from each aspect. Finally, we have included the experimental results for dataset FB15k-237 which show favorable performance on ComplEx decoder in D.

4.3 Additional research questions

In this part, we seek to investigate RQ1–RQ4 listed at the beginning of this section. All experiments in this part are conducted using the DistMult function as the decoder.

Ablation study (RQ1, RQ2). We investigate the contribution from major design choices through an ablation study. As depicted in Figure 2(a), we compare DANS with the following variants, all of which do not employ the FiLM layer. (1) $G_{E}$ : Only the pathway $G_{E}$ in the generator; (2) $G_{R}$ : Only the pathway $G_{R}$ in the generator; (3) $G_{ER}$ : Both pathways $G_{E}$ and $G_{R}$ .

From the results, among the single pathways (either $G_{E}$ or $G_{R}$ ), there is no consistent winner and it depends on the dataset. However, it is clear that the use of both pathways in the generator ( $G_{ER}$ ) outperforms using just a single pathway. Thus, this addresses RQ1 and shows that diversifying the negative triplets with the two-way generator can improve model performance and improve the overall informativeness of the negative triplets.

Furthermore, by comparing $G_{ER}$ (i.e., both pathways without FiLM) and the proposed model DANS (i.e., both pathways with FiLM), our model obtains a significant lead in performance. This addresses RQ2 and shows the effectiveness of our adaptive design using FiLM.

Parameter sensitivity (RQ3). To answer RQ3, we perform a parameter sensitivity analysis. We first analyze how the number of negative triplets per positive triplet, $N_{s}$ , can impact model performance. As shown in Figure 2(b), as we increase $N_{s}$ on each dataset, we consistently observe that the MRR performance improves and peaks at $N_{s}=20$ . A larger $N_{s}$ allows for greater diversity, which explains the initial increase in performance. However, when $N_{s}\geq 20$ , performance starts to plateau or even deteriorate, due to highly imbalanced training data.

Next, we investigate the impact of adaptive regularization controlled by $\lambda$ in Figure 2(c). Generally, having such a regularization (i.e., $\lambda>0$ ) avoids excessive scaling and shifting from the FiLM layer, and thus reduces overfitting to individual entities or relations. In particular, the MRR performance improves as $\lambda$ increases and achieves the most optimal performance for all three datasets when $\lambda$ is around 1e-4. As the optimal values of $N_{s}$ and $\lambda$ are largely stable across the three datasets, our model is not sensitive to these hyperparameter settings, and potentially requires less effort in hyperparameter tuning. We also note that the performance on the UMLS datasets tends to be more sensitive to changes in both parameters. This could be because UMLS is a smaller dataset than the other two, containing only 5,216 positive triplets in training and this increases the risk of overfitting to certain settings in general.

Case study (RQ4). We conduct a qualitative evaluation of DANS to demonstrate the diversity and adaptiveness of negative triplets generated by DANS.

Note that the ablation study has demonstrated improved model performance with the two-way generator, FiLM layers and provided direct evidence on the importance of diverse pathways and FiLM in the model architecture. However, it is not immediately clear if having two pathways and FiLM layers in the generator would indeed produce diverse and adaptive examples in their embedding space. Instead, as asked in RQ4, can we observe some generated examples on their diversity and adaptiveness?

To demonstrate the notion of diversity in RQ4, we present a few case studies of how DANS can produce more diverse examples than uniform random sampling (RNS) or popularity-weight random sampling (PNS). In Figures 4 and 4 we visualize the positive and negative tail entities w.r.t. a given relation and all its head entities on each dataset. More specifically, each point represents one tail entity, which can be a positive (real) tail entity, or a negative tail entity. The negative entity can be generated by one of the pathways $G_{E}$ or $G_{R}$ of the generator, or randomly sampled by RNS or PNS. The high-dimensional embedding space is projected onto a Cartesian plane using the $t$ -SNE algorithm [37]. In Figure 4, we compare the diversity of negative entities generated by DANS with that of RNS-based negative entities. For our case study, we select one relation for each dataset, namely, $\mathtt{hasPart}$ on WN18RR, $\mathtt{animalType}$ on NELL-995 and $\mathtt{isA}$ on UMLS, so that all the positive and negative tail entities for a common relation (and the same original head entities) can be contrasted in one visualization. The results show that DANS could provide more diverse negative entities for model training, where those generated by $G_{E}$ and $G_{R}$ occupy different subspaces from the positive entities. In contrast, RNS lacks diversity and samples negative entities in the same subspace as positive entities. This could even potentially contribute to false negative triplets as they are not well separated from the real ones. Similarly, in Figure 4, we compare the negative entities generated from $G_{E}$ and $G_{R}$ in DANS against the output of PNS, which replaces uniform negative sampling with popularity-weighted sampling. The results again echoed that DANS has produced more diverse negative entities in different subspaces as the positive entities, whereas PNS samples negative entities that mostly overlap with the positive triplets with less diversity.

On the other hand, to demonstrate the notion of adaptiveness, we compare two different relations from each dataset to visualize its differences with and without FiLM layers. For example, Figure 5 visualizes the impact of adaptiveness for relations $\mathtt{DomainRegion}$ and $\mathtt{DomainUsage}$ in the WN18RR dataset. “Positive1” and “Positive2” denote the existing positive entities of each of the two relations, respectively; “Fake1” and “Fake2” denote the corresponding negative samples produced from the generators for each of the two relations, respsectively. As shown in Figure 5(a), without FiLM layers, “Fake1” and “Fake2” both spread across the plane without any clear association to the corresponding “Positive1” and “Positive2”, and thus offering less discriminative power to improve the learning of each relation. In contrast, in Figure 5(b), after FiLM layers are added, we can clearly visualize that “Fake1” and “Fake2” are adapted to “Positive1” and “Positive2” as they move closer to each of their respective positive samples, improving the discriminative power of learning. This demonstrates that the global generators are adapted into local models conditioned on individual input (entity or relation) when FiLM layers are present. Similar patterns can be observed in NELL-995 and UMLS datasets, which are presented in E.

5 Conclusion and Future Work

In this work, we introduced DANS, a negative sampling strategy for knowledge graph embedding that explicitly accounts for the informativeness of negative triplets. On one hand, we proposed a two-way generator to increase the overall informativeness by diversifying the negative triplets based on their association with not only entities but also relations. On the other hand, we adapt the global generator model into local models, which generate negative triplets in a finer-grained manner to improve their individual informativeness. Empirically, DANS has outperformed state-of-the-art baselines on three benchmark knowledge graphs through both quantitative and qualitative experiments. In the future, we believe that the concept of diversity and adaptiveness can be further extended to other graph representation learning problems with a lean structure that is less parameter intensive.

6 Funding

This research / project is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (Proposal ID: T2EP20122-0041). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore.

Appendices

Appendix A Scoring function

ComplEx: ComplEx [36] extends DistMult [44] by introducing complex-valued embeddings to better model asymmetric relations. In ComplEx, entity and relation embeddings no longer lie in a real space but a complex space $\mathbb{C}^{k}$ by operator $\operatorname{Re}$ .

\displaystyle\mathcal{F}(\langle h,r,t\rangle)_{\text{ComplEx}}=\operatorname{Re}(\mathbf{e}_{h}^{\top}\text{Diag}(\mathbf{e}_{r})\mathbf{e}_{t})

ComplEx maps the entities and relations to the complex vector where $\mathbf{e}_{h},\mathbf{e}_{t}\in\mathbb{C}^{k}$ by operator $\operatorname{Re}$ and $\text{Diag}(\mathbf{e}_{r})\in\mathbb{R}^{d\times d}$ is diagonal matrix whose diagonal is $\mathbf{e}_{r}$ , an $r$ -specific trainable vector of the decoder.

RotatE: Inspired by TransE [2], RotatE [34] veers into complex vector space and is motivated by Euler’s identity. The model defines each relation as a rotation and measures how the distance from the source entity to the target entity to account for three relation patterns: symmetric/anti-symmetric, inversion and composition.

\displaystyle\mathcal{F}(\langle h,r,t\rangle)_{\text{RotatE}}=-||\mathbf{e}_{h}\ \mathcal{\circ}\ \mathbf{e}_{r}-\mathbf{e}_{t}||^{2}

RotatE maps the entities and relations to the complex vector where $\mathbf{e}_{h},\mathbf{e}_{t}\in\mathbb{C}^{k}$ and $\circ$ denotes the Hadamard product.

Appendix B Evaluation Metrics

Mean Reciprocal Ranking (MRR): For each candidate list we record the reciprocal of the ranking position of the positive triplet, and take the average over all lists. It reflects the absolute ranking of the first relevant item in the list.

\displaystyle\operatorname{MRR}=\textstyle\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{\operatorname{rank}_{i}}

Where $rank_{i}$ is the position of the relevant result in the ith query and Q is the total number of queries.

Hit ratio at $K$ (H@ $K$ ): We compute the fraction of candidate lists in which positive triplets fails within the first $K$ positions.

\displaystyle\operatorname{HR}=\textstyle\frac{\left|U_{\text{hit }}^{K}\right|}{\left|U_{\text{all }}\right|}

where $\left|U_{\text{hit }}^{K}\right|$ is the number of times a positive triplet is ranked within top $K$ positions in the candidate list. $\left|U_{\text{all }}\right|$ is the total number of candidates.

Normalized discounted cumulative gain at $K$ (NDCG@ $K$ ): We compare the ranked list with the ideal list, where a match at a lower position would have a discounted gain. The gain is further normalized across lists, and we measure the average over all lists. It reflects the relevance at the top $K$ positions, taking position significance into account.

	$\displaystyle\mathrm{DCG}_{K}=\textstyle\sum_{i=1}^{K}\frac{Gain}{\log_{2}(i+1)}$
	$\displaystyle\mathrm{IDCG}_{K}=\textstyle\sum_{i=1}^{K}\frac{Gain}{\log_{2}(i+1)}$
	$\displaystyle\mathrm{nDCG}_{K}=\textstyle\frac{\mathrm{DCG}_{K}}{\mathrm{IDCG}_{K}}$

where $\mathrm{DCG}_{K}$ measures our ranked list and $\mathrm{IDCG}_{K}$ measures the ideal list. The numerator $Gain$ equals to 1 when prediction of positive triplet falls within the top $K$ positions and 0 if otherwise. In addition, $i$ refers to the ranking position of the positive triplet.

Appendix C Efficiency Comparison

Table 4: Training time comparisons (minutes) for different negative sampling strategies with the RGCN backbone and DistMult decoder on the WN18RR dataset.

	Number of Negative Samples
Model	50	20	10	5
RGCN	184	87	79	68
RGCN-P	182	94	79	67
RGCN-Adv	191	108	91	70
RGCN-MCNS	325	282	169	137
DANS	761	461	271	201

To compare the efficiency of different negative sampling methods, we report the runtime of DANS and other negative sampling strategies including random negative sampling in RGCN, popularity biased negative sampling in RGCN-P, self-adversarial negative sampling in RGCN-Adv and negative sampling via positive distribution in RGCN-MCNS in Table 4. We standardize the experimental setting and use the same RGCN backbone with DistMult scoring function for fair comparison. While DANS incurs more time than other negative sampling methods, the increase over more advanced methods like RGCN-MCNS is by a manageable constant factor. Furthermore, the growth in time is linear when more negative samples are generated.

Using 20 negative samples for reference, results show that RGCN-Adv, that weighs the negative triplets takes 108 minutes to complete, which is slightly more than 87 minutes and 94 minutes in RGCN and RGCN-P. RGCN-MCNS which samples from positive distribution requires more time (135 minutes) to finish. Overall, DANS being a more complexed model that generally performs better than baselines, takes approximately 3 to 5 times longer to run. As the number of negative samples increases, the gap for run-time widens between DANS and other model variants as DANS takes more computational resources to generate negative synthetic samples.

Appendix D Performance on the FB15k-237 dataset

Table 5: Performance comparison with baselines (all using the ComplEx decoder). See Table II caption for entry

	FB15k-237
Model	MRR	H@1	H@5	H@10	NDCG@5
ComplEx	.211 $\pm$ .005	.128 $\pm$ .003	.234 $\pm$ .006	.381 $\pm$ .006	.128 $\pm$ .004
SANS-RW	.195 $\pm$ .006	.153 $\pm$ .004	.204 $\pm$ .08	.267 $\pm$ .008	.119 $\pm$ .004
NSCaching	.261 $\pm$ .007	.183 $\pm$ .006	.284 $\pm$ .007	.417 $\pm$ .006	.192 $\pm$ .008
KBGAN	.259 $\pm$ .008	.169 $\pm$ .0106	.279 $\pm$ .007	.413 $\pm$ .010	.188 $\pm$ .006
CAKE	.195 $\pm$ .007	.132 $\pm$ .004	.222 $\pm$ .006	.319 $\pm$ .009	.130 $\pm$ .003
SMiLE	.179 $\pm$ .003	.114 $\pm$ .003	.197 $\pm$ .005	.284 $\pm$ .007	.112 $\pm$ .004
DANS	.253 $\pm$ .007	.185 $\pm$ .006	.333 $\pm$ .009	.446 $\pm$ .011	.252 $\pm$ .007

The DistMult decoder we use on the other datasets tends to perform poorly on FB15k-237 in Dettmers et al. [7], Ji et al. [14], Zhou et al. [51] Hence, on FB15k-237, we implement the baselines and DANS using the ComplEx decoder, as shown in Table 5. Results show that DANS still achieves competitive performance in comparison to the baselines. In particular, DANS achieves a notable gain of 17.3%, 6.95% and 31.3% in terms of H@5, H@10 and NDCG@5, respectively.

Appendix E Case Study for Adaptiveness

We conduct additional experiments to qualitatively assess the impact of FiLM layer on output from the generator pathways through a case study. To demonstrate adaptiveness, we compare two different relations from each dataset to visualize its differences with and without FiLM layers. The visualization on the WN18RR dataset has already been explained in the main paper. Here, we include the visualizations on the NELL-995 and UMLS datasets in Figures 7 and 7, respectively, which display similar patterns as explained in the main paper.

References

Ahrabian et al. [2020] Kian Ahrabian, Aarash Feizi, Yasmin Salehi, William L Hamilton, and Avishek Joey Bose. Structure aware negative sampling in knowledge graphs. arXiv preprint arXiv:2009.11355, 2020.
Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.
Cai and Wang [2017] Liwei Cai and William Yang Wang. Kbgan: Adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071, 2017.
Cao et al. [2019] Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The world wide web conference, pages 151–161, 2019.
Carlson et al. [2010] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam Hruschka, and Tom Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the AAAI conference on artificial intelligence, pages 1306–1313, 2010.
Chen et al. [2020] Zhe Chen, Yuehan Wang, Bin Zhao, Jing Cheng, Xin Zhao, and Zongtao Duan. Knowledge graph completion: A review. Ieee Access, 8:192435–192456, 2020.
Dettmers et al. [2018] Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pages 1811–1818, February 2018.
Du et al. [2021] Yu Du, Sylvie Ranwez, Nicolas Sutton-Charani, and Vincent Ranwez. Is diversity optimization always suitable? toward a better understanding of diversity within recommendation approaches. Information processing & management, 58(6):102721, 2021.
Fang et al. [2017] Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vijay Chandrasekhar. Object detection meets knowledge graphs. In International Joint Conferences on Artificial Intelligence, 2017.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Hu et al. [2019] Binbin Hu, Yuan Fang, and Chuan Shi. Adversarial learning on heterogeneous information networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 120–129, 2019.
Isufi et al. [2021] Elvin Isufi, Matteo Pocchiari, and Alan Hanjalic. Accuracy-diversity trade-off in recommender systems via graph convolutions. Information Processing & Management, 58(2):102459, 2021.
Ji et al. [2016] Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In Proceedings of the AAAI conference on artificial intelligence, 2016.
Ji et al. [2020] Kexi Ji, Bei Hui, and Guangchun Luo. Graph attention networks with local structure awareness for knowledge graph completion. IEEE Access, 8:224860–224870, 2020.
Li et al. [2023] Jianping Li, Guozhen Tan, Xiao Ke, Huaiwei Si, and Yanfei Peng. Object detection based on knowledge graph network. Applied Intelligence, 53(12):15045–15066, 2023.
Li et al. [2021] Sirui Li, Kok Wai Wong, Chun Che Fung, and Dengya Zhu. Improving question answering over knowledge graphs using graph summarization. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part IV 28, pages 489–500. Springer, 2021.
Lin et al. [2015] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI conference on artificial intelligence, 2015.
Liu et al. [2020] Hai Liu, Kairong Hu, Fu-Lee Wang, and Tianyong Hao. Aggregating neighborhood information for negative sampling for knowledge graph embedding. Neural Computing and Applications, 32:17637–17653, 2020.
Lv et al. [2023] Wen Lv, Hongbo Shi, Shuai Tan, Bing Song, and Yang Tao. A dynamic semantic knowledge graph for zero-shot object detection. The Visual Computer, 39(10):4513–4527, 2023.
Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
Nathani et al. [2019] Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning attention-based embeddings for relation prediction in knowledge graphs. arXiv preprint arXiv:1906.01195, 2019.
Niu et al. [2022] Guanglin Niu, Bo Li, Yongfei Zhang, and Shiliang Pu. Cake: A scalable commonsense-aware framework for multi-view knowledge graph completion. arXiv preprint arXiv:2202.13785, 2022.
Odena [2016] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
Oreshkin et al. [2018] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31, 2018.
Peng et al. [2022] Miao Peng, Ben Liu, Qianqian Xie, Wenjie Xu, Hua Wang, and Min Peng. Smile: Schema-augmented multi-level contrastive learning for knowledge graph link prediction. arXiv preprint arXiv:2210.04870, 2022.
Perez et al. [2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
Qian [2021] Jing Qian. Understanding negative sampling in knowledge graph embedding. International Journal of Artificial Intelligence and Applications (IJAIA), 12(1), 2021.
Rendle et al. [2012] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012.
Schlichtkrull et al. [2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15, pages 593–607. Springer, 2018.
Shaffi et al. [2022] Noushath Shaffi, Faizal Hajamohideen, Mufti Mahmud, Abdelhamid Abdesselam, Karthikeyan Subramanian, and Arwa Al Sariri. Triplet-loss based siamese convolutional neural network for 4-way classification of alzheimer’s disease. In International Conference on Brain Informatics, pages 277–287. Springer, 2022.
Shen et al. [2022] Jianhao Shen, Chenguang Wang, Linyuan Gong, and Dawn Song. Joint language semantic and structure embedding for knowledge graph completion. arXiv preprint arXiv:2209.08721, 2022.
Shimin et al. [2021] DI Shimin, YAO Quanming, Yongqi Zhang, and CHEN Lei. Efficient relation-aware scoring function search for knowledge graph embedding. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1104–1115. IEEE, 2021.
Shin et al. [2019] Sangjin Shin, Xiongnan Jin, Jooik Jung, and Kyong-Ho Lee. Predicate constraints based question answering over knowledge graph. Information Processing & Management, 56(3):445–462, 2019.
Sun et al. [2019] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019.
Sun et al. [2020] Zhu Sun, Di Yu, Hui Fang, Jie Yang, Xinghua Qu, Jie Zhang, and Cong Geng. Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison. In Proceedings of the 14th ACM Conference on Recommender Systems, pages 23–32, 2020.
Trouillon et al. [2016] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016.
Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Wang et al. [2018] Peifeng Wang, Shuangyin Li, and Rong Pan. Incorporating gan for negative sampling in knowledge representation learning. In Proceedings of the AAAI conference on artificial intelligence, 2018.
Wang et al. [2020] Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the web conference 2020, pages 99–109, 2020.
Wang et al. [2014] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence, 2014.
Wu et al. [2016] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4622–4630, 2016.
Xi et al. [2021] Wu-Dong Xi, Ling Huang, Chang-Dong Wang, Yin-Yu Zheng, and Jian-Huang Lai. Deep rating and review neural network for item recommendation. IEEE Transactions on Neural Networks and Learning Systems, 33(11):6726–6736, 2021.
Xiong et al. [2017] Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. arXiv preprint arXiv:1707.06690, 2017.
Yang et al. [2014] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
Yang et al. [2020] Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. Understanding negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1666–1676, 2020.
Ying et al. [2018] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974–983, 2018.
Zeng et al. [2020] Jiehang Zeng, Lu Liu, and Xiaoqing Zheng. Learning structured embeddings of knowledge graphs with adversarial learning framework. arXiv preprint arXiv:2004.07265, 2020.
Zhang et al. [2019] Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei Chen. Nscaching: simple and efficient negative sampling for knowledge graph embedding. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 614–625. IEEE, 2019.
Zhao et al. [2022] Jianli Zhao, Hao Li, Lijun Qu, Qinzhi Zhang, Qiuxia Sun, Huan Huo, and Maoguo Gong. Dcfgan: An adversarial deep reinforcement learning framework with improved negative sampling for session-based recommender systems. Information sciences, 596:222–235, 2022.
Zhou et al. [2021] Xiaofei Zhou, Lingfeng Niu, Qiannan Zhu, Xingquan Zhu, Ping Liu, Jianlong Tan, and Li Guo. Knowledge graph embedding by double limit scoring loss. IEEE Transactions on Knowledge and Data Engineering, 34(12):5825–5839, 2021.
Zhou et al. [2022] Zhehui Zhou, Can Wang, Yan Feng, and Defang Chen. Jointe: Jointly utilizing 1d and 2d convolution for knowledge graph embedding. Knowledge-Based Systems, 240:108100, 2022.