Exploring Entity Interactions for Few-Shot Relation Learning (Student Abstract)

Yi Liang¹, Shuai Zhao¹, Bo Cheng¹, Yuwei Yin², Hao Yang³

Abstract

Few-shot relation learning refers to infer facts for relations with a limited number of observed triples. Existing metric-learning methods for this problem mostly neglect entity interactions within and between triples. In this paper, we explore this kind of fine-grained semantic meanings and propose our model TransAM. Specifically, we serialize reference entities and query entities into sequence and apply transformer structure with local-global attention to capture both intra- and inter-triple entity interactions. Experiments on two public benchmark datasets NELL-One and Wiki-One with 1-shot setting prove the effectiveness of TransAM.

Introduction

Knowledge graphs (KGs) store facts in the form of triples $(h,r,t)$ reflecting the relation $r$ between head $h$ and tail entity $t$ . Few-shot relation learning or few-shot knowledge graph completion (FSKGC), which aims to complete the relations with only a limited number of observed triples (few-shot references), has attracted research attention in recent years. However, existing FSKGC models mostly fail to model entity level interactions within and between triples as they perform matching over entity pair representations.

In this paper, we argue that entity interactions can provide valuable fine-grained semantic meanings and propose TransAM, Transformer(Vaswani et al. 2017) Appending Matcher with local-global attention for FSKGC. In particular, we serialize references of few-shot relation $r$ and query pair following with special token [CLS] into entity sequence and feed to transformer. To capture intra-triple level entity interactions, we construct a block attention mask matrix to constrain that every entity only attends to triple it involves (local attention). Besides, we introduce rotary operation to encode head or tail roles for each entity to model relation patterns, i.e., symmetry/anti-symmetry or inversion. Global attention focuses more on modelling inter-triple entity interactions. We devise a separated triple position encoding for the self-attention module to preserve the triple structure and decouple entity representations and triple position embedding as we concern they are two kinds of heterogeneous features. Finally, the representation of [CLS] is used to predict the plausibility of entity sequence, which indicates whether query fact $(h_{q},r,t_{q})$ holds.

Proposed Method

Task Definition

Knowledge graph $\mathcal{G}$ is formulated as $\mathcal{G}:=\{\mathcal{E},\mathcal{R},\mathcal{T}\}$ . $\mathcal{E}$ is entity set, $\mathcal{R}$ is a set of relations and $\mathcal{T}$ is the triple set. Background graph $\mathcal{BG}$ is a set of known triples. Giving a new relation $r$ with $K$ entity pairs as support set $\mathcal{S}_{r}=\{(h_{i},t_{i})|(h_{i},r,t_{i})\in\mathcal{G}\}$ and a query entity pair $(h_{q},t_{q})$ , the goal of FSKGC is to estimate whether $(h_{q},r,t_{q})$ is a missing fact. In this work, we solve this problem by serializing support set and query pair to an entity sequence $\textbf{s}_{q}:=[\texttt{[CLS]},h_{1},t_{1},\dots,h_{K},t_{K},h_{q},t_{q}]$ and train the model to compute the probability of whether $\textbf{s}_{q}$ holds.

Entity Encoder

Following previous work(Zhang et al. 2020), we enhance entity representations with a heterogeneous graph encoder. For each entity $e$ from input sequence $\textbf{s}_{q}$ , after extracting neighbors $\mathcal{N}_{e}=\{(r_{i},t_{i})\}$ starting with $e$ from $\mathcal{BG}$ , we calculate the contribution $\alpha_{i}=\frac{\exp(\textbf{u}^{T}(\text{ReLU}(\textbf{W}_{1}(\textbf{v}_{r_{i}}\|\textbf{v}_{t_{i}})))+b_{1})}{\sum_{j=1}^{|\mathcal{N}_{e}|}{\exp(\textbf{u}^{T}(\text{ReLU}(\textbf{W}_{1}(\textbf{v}_{r_{j}}\|\textbf{v}_{t_{j}})))+b_{1})}}$ . Then we aggregate neighbors information following $\textbf{h}_{e}=\sum_{i=1}^{|\mathcal{N}_{e}|}{\alpha_{i}\textbf{v}_{t_{i}}}$ . Further, a fusion mechanism is applied to preserve origin embedding $\textbf{v}_{e}$ and obtain the final representation $\textbf{x}_{e}=\sigma(\textbf{W}_{2}\textbf{v}_{e}+\textbf{W}_{3}\textbf{h}_{e})$ . Here, $\|$ denotes concatenate operation. $\sigma$ is non-linear function, and we use $\tanh$ . $\textbf{u}\in\mathbb{R}^{d_{e}\times 1}$ , $b_{1}\in\mathbb{R}$ , $\textbf{W}_{1}\in\mathbb{R}^{d_{e}\times 2d_{e}}$ , $\textbf{W}_{2}\in\mathbb{R}^{d_{e}\times d_{e}}$ and $\textbf{W}_{3}\in\mathbb{R}^{d_{e}\times d_{e}}$ are learnable parameters. We pack the encoder output of each entity of $\textbf{s}_{q}$ into matrix $\textbf{X}_{q}$ . $d_{e}$ is the embedding dimension.

Transformer Matching Processor

Transformer Matching Processor contains a stack of $L$ identical blocks which mainly includes multi-head local-global attention module, position-wise feed-forward network (FFN) and layer normalization (LayerNorm). For $\textbf{X}_{q}$ , the multi-head local-global attention (MHA) can be denoted as $\text{MHA}(\textbf{X}_{q})=\mathop{\|}_{i=1}^{H}(Attn_{\text{global}}(\textbf{X}_{q})+Attn_{\text{local}}(\textbf{X}_{q}))\textbf{W}^{O}$ , where $Attn_{\text{global}}(\cdot)$ and $Attn_{\text{local}}(\cdot)$ indicate global attention and local attention function we describe later. $H$ is the number of attention heads. $\textbf{W}^{O}\in\mathbb{R}^{Hd_{e}\times Hd_{e}}$ is learnable projection matrix. Fig. 1 illustrates the architectures of local and global attention. Please refer to (Vaswani et al. 2017) for more details about FFN and LayerNorm.

Local Attention

Firstly, we construct a block attention mask matrix $\textbf{M}_{\text{local}}$ to enable self-attention module capturing intra-triple entity interaction features. We further utilize rotary operation $f_{R}^{\{Q,K\}}(e,m)=(\textbf{W}^{\{Q,K\}}\textbf{x}_{e})e^{im\theta}$ for each entity $e$ to endow entities with role information (head or tail) within triple, where $m$ is 1 and 2 for head and tail, respectively. Finally, the local attention output is computed as $\textbf{H}_{\text{local}}=\text{Softmax}(\frac{f_{R}^{Q}(\textbf{X}_{q})\cdot f_{R}^{K}(\textbf{X}_{q})^{T}}{\sqrt{d}}+\textbf{M}_{\text{local}})\textbf{X}_{q}\textbf{W}^{V}$ . $\textbf{M}_{\text{local}}$ is obtained follows:

\begin{split}m_{i,j}&=\begin{cases}&i=0\text{ or }i=j\text{ or }\\ 0,&j=i+1,i\in\{1,\dots,2K+1\}\text{ or }\\ &j=i-1,i\in\{2,\dots,2K+2\};\\ -\inf,&\text{else}\end{cases}\end{split}

(1)

Global Attention

We first assign position $i$ for both entity $h_{i}$ and $t_{i}$ of triple $(h_{i},t_{i})$ , while the position for [CLS] and query pair $(h_{q},t_{q})$ is set to 0 and $K+1$ , respectively. A $d_{e}$ dimensional learnable embedding is applied to obtain representation of each position then we pack them into a matrix $\tilde{\textbf{P}}$ . Finally, we compute the global stage hidden state following $\textbf{H}_{\text{global}}=\text{Softmax}(\textbf{A}_{g})\textbf{X}_{q}\textbf{W}^{V}$ , where $\textbf{A}_{g}=\frac{(\textbf{X}_{q}\textbf{W}^{Q})(\textbf{X}_{q}\textbf{W}^{K})^{T}+(\tilde{\textbf{P}}\textbf{U}^{Q})(\tilde{\textbf{P}}\textbf{U}^{K})^{T}}{\sqrt{2d_{e}}}$ .

Refer to caption — Figure 1: (a) Details of local attention; (b) Details of global attention. We sum *LocalAttnOut* and *GlobalAttnOut* and feed it to the rest structure of transformer.

Prediction and Optimization

Assuming the last hidden state of [CLS] is $\textbf{z}^{L}_{CLS}$ , we obtain final score $\bar{\textbf{s}}_{q}$ of $(h_{q},t_{q})$ follow $\bar{\textbf{s}}_{q}=\text{Softmax}(\textbf{U}^{T}_{2}(\textbf{W}_{4}\textbf{z}^{L}_{\text{CLS}}))$ , in which $\textbf{W}_{4}\in\mathbb{R}^{d_{e}\times Hd_{e}}$ , $\textbf{U}^{T}_{2}\in\mathbb{R}^{d_{e}\times 2}$ . $\bar{\textbf{s}}_{q}\in\mathbb{R}^{2}$ is a 2-dimensional real vector while $s_{q}^{(1)}$ indicates the probability of whether query $(h_{q},r,t_{q})$ holds. We calculate loss $\mathcal{L}=-\sum_{q\in\mathbb{S}^{+}\cup\mathbb{S}^{-}}((1-y_{q})\log(s_{q}^{(0)})+y_{q}\log(s_{q}^{(1)}))$ , where $\mathbb{S}^{+}$ and $\mathbb{S}^{-}$ indicate positive and negative query sequence respectively. $y_{q}\in\{0,1\}$ is the sequence label (negative or positive).

Experiments

Datasets and Baselines

We use two public datasets NELL-One and Wiki-One proposed by GMatching(Xiong et al. 2018). We select Mean Reciprocal Rank (MRR), Hits@1 (H1) and Hits@10 (H10) as evaluation metrics. The few-shot size $K$ is set to 1, i.e., one-shot. We compare TransAM with three baselines most relevant to our research, GMatching(Xiong et al. 2018), FSRL(Zhang et al. 2020) and FAAN(Sheng et al. 2020). Results of FSRL and FAAN are obtained using their official implementations with recommended hyperparameters reported in their papers.

Implementation Details

We choose ComplEx pre-trained embedding vectors to initialize all entities and relations. Embedding dimensionality is 100 for NELL-One and 50 for Wiki-One. We set the number of Transformer layers to 3 and 4, and the number of heads to 4 and 8 for NELL-One and Wiki-One, respectively. We warm up the model at the first 10k steps by linearly increasing the learning rate to $5e^{-5}$ for NELL-One and $6e^{-5}$ for Wiki-One, then decreasing it linearly to 0 until the last step.

Results and Conclusions

From table 1, we observe that TransAM outperforms all three baselines in MRR and Hits@1 on both two datasets. Giving only one instance, matching models are expected to capture accurate semantic meanings for prediction. These results reveal that TransAM is appropriate for addressing few-shot issue in knowledge graph completion. Our future work may consider introducing sophisticated structural bias to the transformer for complex few-shot relations.

	NELL-One			Wiki-One
Model	H1	H10	MRR	H1	H10	MRR
GMatching^†	.119	.313	.185	.120	.336	.200
FSRL	.114	.294	.170	.091	.267	.149
FAAN	.116	.316	.187	.146	.362	.215
TransAM	.152	.360	.225	.184	.358	.242

Table 1: Experiment Results on NELL-One and Wiki-One. Best results are boldfaced. ^† indicates results are taken from the origin paper.

Acknowledgments

This work is supported by Beijing Nova Program of Science and Technology (Grant No. Z191100001119031), Guangxi Key Laboratory of Cryptography and Information Security (No.GCIS202111). Bo Cheng is the corresponding author.

References

Sheng et al. (2020) Sheng, J.; Guo, S.; Chen, Z.; Yue, J.; Wang, L.; Liu, T.; and Xu, H. 2020. Adaptive Attentional Network for Few-Shot Knowledge Graph Completion. In EMNLP.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. .
Xiong et al. (2018) Xiong, W.; Yu, M.; Chang, S.; Guo, X.; and Wang, W. Y. 2018. One-Shot Relational Learning for Knowledge Graphs. In EMNLP.
Zhang et al. (2020) Zhang, C.; Yao, H.; Huang, C.; Jiang, M.; Li, Z.; and Chawla, N. 2020. Few-Shot Knowledge Graph Completion. In AAAI.