This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploring Entity Interactions for Few-Shot Relation Learning (Student Abstract)

Yi Liang1, Shuai Zhao1, Bo Cheng1, Yuwei Yin2, Hao Yang3
Abstract

Few-shot relation learning refers to infer facts for relations with a limited number of observed triples. Existing metric-learning methods for this problem mostly neglect entity interactions within and between triples. In this paper, we explore this kind of fine-grained semantic meanings and propose our model TransAM. Specifically, we serialize reference entities and query entities into sequence and apply transformer structure with local-global attention to capture both intra- and inter-triple entity interactions. Experiments on two public benchmark datasets NELL-One and Wiki-One with 1-shot setting prove the effectiveness of TransAM.

Introduction

Knowledge graphs (KGs) store facts in the form of triples (h,r,t)(h,r,t) reflecting the relation rr between head hh and tail entity tt. Few-shot relation learning or few-shot knowledge graph completion (FSKGC), which aims to complete the relations with only a limited number of observed triples (few-shot references), has attracted research attention in recent years. However, existing FSKGC models mostly fail to model entity level interactions within and between triples as they perform matching over entity pair representations.

In this paper, we argue that entity interactions can provide valuable fine-grained semantic meanings and propose TransAM, Transformer(Vaswani etย al. 2017) Appending Matcher with local-global attention for FSKGC. In particular, we serialize references of few-shot relation rr and query pair following with special token [CLS] into entity sequence and feed to transformer. To capture intra-triple level entity interactions, we construct a block attention mask matrix to constrain that every entity only attends to triple it involves (local attention). Besides, we introduce rotary operation to encode head or tail roles for each entity to model relation patterns, i.e., symmetry/anti-symmetry or inversion. Global attention focuses more on modelling inter-triple entity interactions. We devise a separated triple position encoding for the self-attention module to preserve the triple structure and decouple entity representations and triple position embedding as we concern they are two kinds of heterogeneous features. Finally, the representation of [CLS] is used to predict the plausibility of entity sequence, which indicates whether query fact (hq,r,tq)(h_{q},r,t_{q}) holds.

Proposed Method

Task Definition

Knowledge graph ๐’ข\mathcal{G} is formulated as ๐’ข:={โ„ฐ,โ„›,๐’ฏ}\mathcal{G}:=\{\mathcal{E},\mathcal{R},\mathcal{T}\}. โ„ฐ\mathcal{E} is entity set, โ„›\mathcal{R} is a set of relations and ๐’ฏ\mathcal{T} is the triple set. Background graph โ„ฌโ€‹๐’ข\mathcal{BG} is a set of known triples. Giving a new relation rr with KK entity pairs as support set ๐’ฎr={(hi,ti)|(hi,r,ti)โˆˆ๐’ข}\mathcal{S}_{r}=\{(h_{i},t_{i})|(h_{i},r,t_{i})\in\mathcal{G}\} and a query entity pair (hq,tq)(h_{q},t_{q}), the goal of FSKGC is to estimate whether (hq,r,tq)(h_{q},r,t_{q}) is a missing fact. In this work, we solve this problem by serializing support set and query pair to an entity sequence sq:=[[CLS],h1,t1,โ€ฆ,hK,tK,hq,tq]\textbf{s}_{q}:=[\texttt{[CLS]},h_{1},t_{1},\dots,h_{K},t_{K},h_{q},t_{q}] and train the model to compute the probability of whether sq\textbf{s}_{q} holds.

Entity Encoder

Following previous work(Zhang etย al. 2020), we enhance entity representations with a heterogeneous graph encoder. For each entity ee from input sequence sq\textbf{s}_{q}, after extracting neighbors ๐’ฉe={(ri,ti)}\mathcal{N}_{e}=\{(r_{i},t_{i})\} starting with ee from โ„ฌโ€‹๐’ข\mathcal{BG}, we calculate the contribution ฮฑi=expโก(uTโ€‹(ReLUโ€‹(W1โ€‹(vriโˆฅvti)))+b1)โˆ‘j=1|๐’ฉe|expโก(uTโ€‹(ReLUโ€‹(W1โ€‹(vrjโˆฅvtj)))+b1)\alpha_{i}=\frac{\exp(\textbf{u}^{T}(\text{ReLU}(\textbf{W}_{1}(\textbf{v}_{r_{i}}\|\textbf{v}_{t_{i}})))+b_{1})}{\sum_{j=1}^{|\mathcal{N}_{e}|}{\exp(\textbf{u}^{T}(\text{ReLU}(\textbf{W}_{1}(\textbf{v}_{r_{j}}\|\textbf{v}_{t_{j}})))+b_{1})}}. Then we aggregate neighbors information following he=โˆ‘i=1|๐’ฉe|ฮฑiโ€‹vti\textbf{h}_{e}=\sum_{i=1}^{|\mathcal{N}_{e}|}{\alpha_{i}\textbf{v}_{t_{i}}}. Further, a fusion mechanism is applied to preserve origin embedding ve\textbf{v}_{e} and obtain the final representation xe=ฯƒโ€‹(W2โ€‹ve+W3โ€‹he)\textbf{x}_{e}=\sigma(\textbf{W}_{2}\textbf{v}_{e}+\textbf{W}_{3}\textbf{h}_{e}). Here, โˆฅ\| denotes concatenate operation. ฯƒ\sigma is non-linear function, and we use tanh\tanh. uโˆˆโ„deร—1\textbf{u}\in\mathbb{R}^{d_{e}\times 1}, b1โˆˆโ„b_{1}\in\mathbb{R}, W1โˆˆโ„deร—2โ€‹de\textbf{W}_{1}\in\mathbb{R}^{d_{e}\times 2d_{e}}, W2โˆˆโ„deร—de\textbf{W}_{2}\in\mathbb{R}^{d_{e}\times d_{e}} and W3โˆˆโ„deร—de\textbf{W}_{3}\in\mathbb{R}^{d_{e}\times d_{e}} are learnable parameters. We pack the encoder output of each entity of sq\textbf{s}_{q} into matrix Xq\textbf{X}_{q}. ded_{e} is the embedding dimension.

Transformer Matching Processor

Transformer Matching Processor contains a stack of LL identical blocks which mainly includes multi-head local-global attention module, position-wise feed-forward network (FFN) and layer normalization (LayerNorm). For Xq\textbf{X}_{q}, the multi-head local-global attention (MHA) can be denoted as MHAโ€‹(Xq)=โˆฅi=1H(Aโ€‹tโ€‹tโ€‹nglobalโ€‹(Xq)+Aโ€‹tโ€‹tโ€‹nlocalโ€‹(Xq))โ€‹WO\text{MHA}(\textbf{X}_{q})=\mathop{\|}_{i=1}^{H}(Attn_{\text{global}}(\textbf{X}_{q})+Attn_{\text{local}}(\textbf{X}_{q}))\textbf{W}^{O}, where Aโ€‹tโ€‹tโ€‹nglobalโ€‹(โ‹…)Attn_{\text{global}}(\cdot) and Aโ€‹tโ€‹tโ€‹nlocalโ€‹(โ‹…)Attn_{\text{local}}(\cdot) indicate global attention and local attention function we describe later. HH is the number of attention heads. WOโˆˆโ„Hโ€‹deร—Hโ€‹de\textbf{W}^{O}\in\mathbb{R}^{Hd_{e}\times Hd_{e}} is learnable projection matrix. Fig.ย 1 illustrates the architectures of local and global attention. Please refer to (Vaswani etย al. 2017) for more details about FFN and LayerNorm.

Local Attention

Firstly, we construct a block attention mask matrix Mlocal\textbf{M}_{\text{local}} to enable self-attention module capturing intra-triple entity interaction features. We further utilize rotary operation fR{Q,K}โ€‹(e,m)=(W{Q,K}โ€‹xe)โ€‹eiโ€‹mโ€‹ฮธf_{R}^{\{Q,K\}}(e,m)=(\textbf{W}^{\{Q,K\}}\textbf{x}_{e})e^{im\theta} for each entity ee to endow entities with role information (head or tail) within triple, where mm is 1 and 2 for head and tail, respectively. Finally, the local attention output is computed as Hlocal=Softmaxโ€‹(fRQโ€‹(Xq)โ‹…fRKโ€‹(Xq)Td+Mlocal)โ€‹Xqโ€‹WV\textbf{H}_{\text{local}}=\text{Softmax}(\frac{f_{R}^{Q}(\textbf{X}_{q})\cdot f_{R}^{K}(\textbf{X}_{q})^{T}}{\sqrt{d}}+\textbf{M}_{\text{local}})\textbf{X}_{q}\textbf{W}^{V}. Mlocal\textbf{M}_{\text{local}} is obtained follows:

mi,j={i=0โ€‹ย orย โ€‹i=jโ€‹ย orย 0,j=i+1,iโˆˆ{1,โ€ฆ,2โ€‹K+1}โ€‹ย orย j=iโˆ’1,iโˆˆ{2,โ€ฆ,2โ€‹K+2};โˆ’inf,else\begin{split}m_{i,j}&=\begin{cases}&i=0\text{ or }i=j\text{ or }\\ 0,&j=i+1,i\in\{1,\dots,2K+1\}\text{ or }\\ &j=i-1,i\in\{2,\dots,2K+2\};\\ -\inf,&\text{else}\end{cases}\end{split} (1)

Global Attention

We first assign position ii for both entity hih_{i} and tit_{i} of triple (hi,ti)(h_{i},t_{i}), while the position for [CLS] and query pair (hq,tq)(h_{q},t_{q}) is set to 0 and K+1K+1, respectively. A ded_{e} dimensional learnable embedding is applied to obtain representation of each position then we pack them into a matrix P~\tilde{\textbf{P}}. Finally, we compute the global stage hidden state following Hglobal=Softmaxโ€‹(Ag)โ€‹Xqโ€‹WV\textbf{H}_{\text{global}}=\text{Softmax}(\textbf{A}_{g})\textbf{X}_{q}\textbf{W}^{V}, where Ag=(Xqโ€‹WQ)โ€‹(Xqโ€‹WK)T+(P~โ€‹UQ)โ€‹(P~โ€‹UK)T2โ€‹de\textbf{A}_{g}=\frac{(\textbf{X}_{q}\textbf{W}^{Q})(\textbf{X}_{q}\textbf{W}^{K})^{T}+(\tilde{\textbf{P}}\textbf{U}^{Q})(\tilde{\textbf{P}}\textbf{U}^{K})^{T}}{\sqrt{2d_{e}}}.

Refer to caption
Figure 1: (a) Details of local attention; (b) Details of global attention. We sum LocalAttnOut and GlobalAttnOut and feed it to the rest structure of transformer.

Prediction and Optimization

Assuming the last hidden state of [CLS] is zCโ€‹Lโ€‹SL\textbf{z}^{L}_{CLS}, we obtain final score sยฏq\bar{\textbf{s}}_{q} of (hq,tq)(h_{q},t_{q}) follow sยฏq=Softmaxโ€‹(U2Tโ€‹(W4โ€‹zCLSL))\bar{\textbf{s}}_{q}=\text{Softmax}(\textbf{U}^{T}_{2}(\textbf{W}_{4}\textbf{z}^{L}_{\text{CLS}})), in which W4โˆˆโ„deร—Hโ€‹de\textbf{W}_{4}\in\mathbb{R}^{d_{e}\times Hd_{e}}, U2Tโˆˆโ„deร—2\textbf{U}^{T}_{2}\in\mathbb{R}^{d_{e}\times 2}. sยฏqโˆˆโ„2\bar{\textbf{s}}_{q}\in\mathbb{R}^{2} is a 2-dimensional real vector while sq(1)s_{q}^{(1)} indicates the probability of whether query (hq,r,tq)(h_{q},r,t_{q}) holds. We calculate loss โ„’=โˆ’โˆ‘qโˆˆ๐•Š+โˆช๐•Šโˆ’((1โˆ’yq)โ€‹logโก(sq(0))+yqโ€‹logโก(sq(1)))\mathcal{L}=-\sum_{q\in\mathbb{S}^{+}\cup\mathbb{S}^{-}}((1-y_{q})\log(s_{q}^{(0)})+y_{q}\log(s_{q}^{(1)})), where ๐•Š+\mathbb{S}^{+} and ๐•Šโˆ’\mathbb{S}^{-} indicate positive and negative query sequence respectively. yqโˆˆ{0,1}y_{q}\in\{0,1\} is the sequence label (negative or positive).

Experiments

Datasets and Baselines

We use two public datasets NELL-One and Wiki-One proposed by GMatching(Xiong etย al. 2018). We select Mean Reciprocal Rank (MRR), Hits@1 (H1) and Hits@10 (H10) as evaluation metrics. The few-shot size KK is set to 1, i.e., one-shot. We compare TransAM with three baselines most relevant to our research, GMatching(Xiong etย al. 2018), FSRL(Zhang etย al. 2020) and FAAN(Sheng etย al. 2020). Results of FSRL and FAAN are obtained using their official implementations with recommended hyperparameters reported in their papers.

Implementation Details

We choose ComplEx pre-trained embedding vectors to initialize all entities and relations. Embedding dimensionality is 100 for NELL-One and 50 for Wiki-One. We set the number of Transformer layers to 3 and 4, and the number of heads to 4 and 8 for NELL-One and Wiki-One, respectively. We warm up the model at the first 10k steps by linearly increasing the learning rate to 5โ€‹eโˆ’55e^{-5} for NELL-One and 6โ€‹eโˆ’56e^{-5} for Wiki-One, then decreasing it linearly to 0 until the last step.

Results and Conclusions

From tableย 1, we observe that TransAM outperforms all three baselines in MRR and Hits@1 on both two datasets. Giving only one instance, matching models are expected to capture accurate semantic meanings for prediction. These results reveal that TransAM is appropriate for addressing few-shot issue in knowledge graph completion. Our future work may consider introducing sophisticated structural bias to the transformer for complex few-shot relations.

NELL-One Wiki-One
Model H1 H10 MRR H1 H10 MRR
GMatchingโ€  .119 .313 .185 .120 .336 .200
FSRL .114 .294 .170 .091 .267 .149
FAAN .116 .316 .187 .146 .362 .215
TransAM .152 .360 .225 .184 .358 .242
Table 1: Experiment Results on NELL-One and Wiki-One. Best results are boldfaced. โ€  indicates results are taken from the origin paper.

Acknowledgments

This work is supported by Beijing Nova Program of Science and Technology (Grant No. Z191100001119031), Guangxi Key Laboratory of Cryptography and Information Security (No.GCIS202111). Bo Cheng is the corresponding author.

References

  • Sheng etย al. (2020) Sheng, J.; Guo, S.; Chen, Z.; Yue, J.; Wang, L.; Liu, T.; and Xu, H. 2020. Adaptive Attentional Network for Few-Shot Knowledge Graph Completion. In EMNLP.
  • Vaswani etย al. (2017) Vaswani, A.; Shazeer, N.ย M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.ย N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. .
  • Xiong etย al. (2018) Xiong, W.; Yu, M.; Chang, S.; Guo, X.; and Wang, W.ย Y. 2018. One-Shot Relational Learning for Knowledge Graphs. In EMNLP.
  • Zhang etย al. (2020) Zhang, C.; Yao, H.; Huang, C.; Jiang, M.; Li, Z.; and Chawla, N. 2020. Few-Shot Knowledge Graph Completion. In AAAI.