Exploring Entity Interactions for Few-Shot Relation Learning (Student Abstract)
Abstract
Few-shot relation learning refers to infer facts for relations with a limited number of observed triples. Existing metric-learning methods for this problem mostly neglect entity interactions within and between triples. In this paper, we explore this kind of fine-grained semantic meanings and propose our model TransAM. Specifically, we serialize reference entities and query entities into sequence and apply transformer structure with local-global attention to capture both intra- and inter-triple entity interactions. Experiments on two public benchmark datasets NELL-One and Wiki-One with 1-shot setting prove the effectiveness of TransAM.
Introduction
Knowledge graphs (KGs) store facts in the form of triples reflecting the relation between head and tail entity . Few-shot relation learning or few-shot knowledge graph completion (FSKGC), which aims to complete the relations with only a limited number of observed triples (few-shot references), has attracted research attention in recent years. However, existing FSKGC models mostly fail to model entity level interactions within and between triples as they perform matching over entity pair representations.
In this paper, we argue that entity interactions can provide valuable fine-grained semantic meanings and propose TransAM, Transformer(Vaswani etย al. 2017) Appending Matcher with local-global attention for FSKGC. In particular, we serialize references of few-shot relation and query pair following with special token [CLS] into entity sequence and feed to transformer. To capture intra-triple level entity interactions, we construct a block attention mask matrix to constrain that every entity only attends to triple it involves (local attention). Besides, we introduce rotary operation to encode head or tail roles for each entity to model relation patterns, i.e., symmetry/anti-symmetry or inversion. Global attention focuses more on modelling inter-triple entity interactions. We devise a separated triple position encoding for the self-attention module to preserve the triple structure and decouple entity representations and triple position embedding as we concern they are two kinds of heterogeneous features. Finally, the representation of [CLS] is used to predict the plausibility of entity sequence, which indicates whether query fact holds.
Proposed Method
Task Definition
Knowledge graph is formulated as . is entity set, is a set of relations and is the triple set. Background graph is a set of known triples. Giving a new relation with entity pairs as support set and a query entity pair , the goal of FSKGC is to estimate whether is a missing fact. In this work, we solve this problem by serializing support set and query pair to an entity sequence and train the model to compute the probability of whether holds.
Entity Encoder
Following previous work(Zhang etย al. 2020), we enhance entity representations with a heterogeneous graph encoder. For each entity from input sequence , after extracting neighbors starting with from , we calculate the contribution . Then we aggregate neighbors information following . Further, a fusion mechanism is applied to preserve origin embedding and obtain the final representation . Here, denotes concatenate operation. is non-linear function, and we use . , , , and are learnable parameters. We pack the encoder output of each entity of into matrix . is the embedding dimension.
Transformer Matching Processor
Transformer Matching Processor contains a stack of identical blocks which mainly includes multi-head local-global attention module, position-wise feed-forward network (FFN) and layer normalization (LayerNorm). For , the multi-head local-global attention (MHA) can be denoted as , where and indicate global attention and local attention function we describe later. is the number of attention heads. is learnable projection matrix. Fig.ย 1 illustrates the architectures of local and global attention. Please refer to (Vaswani etย al. 2017) for more details about FFN and LayerNorm.
Local Attention
Firstly, we construct a block attention mask matrix to enable self-attention module capturing intra-triple entity interaction features. We further utilize rotary operation for each entity to endow entities with role information (head or tail) within triple, where is 1 and 2 for head and tail, respectively. Finally, the local attention output is computed as . is obtained follows:
(1) |
Global Attention
We first assign position for both entity and of triple , while the position for [CLS] and query pair is set to 0 and , respectively. A dimensional learnable embedding is applied to obtain representation of each position then we pack them into a matrix . Finally, we compute the global stage hidden state following , where .

Prediction and Optimization
Assuming the last hidden state of [CLS] is , we obtain final score of follow , in which , . is a 2-dimensional real vector while indicates the probability of whether query holds. We calculate loss , where and indicate positive and negative query sequence respectively. is the sequence label (negative or positive).
Experiments
Datasets and Baselines
We use two public datasets NELL-One and Wiki-One proposed by GMatching(Xiong etย al. 2018). We select Mean Reciprocal Rank (MRR), Hits@1 (H1) and Hits@10 (H10) as evaluation metrics. The few-shot size is set to 1, i.e., one-shot. We compare TransAM with three baselines most relevant to our research, GMatching(Xiong etย al. 2018), FSRL(Zhang etย al. 2020) and FAAN(Sheng etย al. 2020). Results of FSRL and FAAN are obtained using their official implementations with recommended hyperparameters reported in their papers.
Implementation Details
We choose ComplEx pre-trained embedding vectors to initialize all entities and relations. Embedding dimensionality is 100 for NELL-One and 50 for Wiki-One. We set the number of Transformer layers to 3 and 4, and the number of heads to 4 and 8 for NELL-One and Wiki-One, respectively. We warm up the model at the first 10k steps by linearly increasing the learning rate to for NELL-One and for Wiki-One, then decreasing it linearly to 0 until the last step.
Results and Conclusions
From tableย 1, we observe that TransAM outperforms all three baselines in MRR and Hits@1 on both two datasets. Giving only one instance, matching models are expected to capture accurate semantic meanings for prediction. These results reveal that TransAM is appropriate for addressing few-shot issue in knowledge graph completion. Our future work may consider introducing sophisticated structural bias to the transformer for complex few-shot relations.
NELL-One | Wiki-One | |||||
---|---|---|---|---|---|---|
Model | H1 | H10 | MRR | H1 | H10 | MRR |
GMatchingโ | .119 | .313 | .185 | .120 | .336 | .200 |
FSRL | .114 | .294 | .170 | .091 | .267 | .149 |
FAAN | .116 | .316 | .187 | .146 | .362 | .215 |
TransAM | .152 | .360 | .225 | .184 | .358 | .242 |
Acknowledgments
This work is supported by Beijing Nova Program of Science and Technology (Grant No. Z191100001119031), Guangxi Key Laboratory of Cryptography and Information Security (No.GCIS202111). Bo Cheng is the corresponding author.
References
- Sheng etย al. (2020) Sheng, J.; Guo, S.; Chen, Z.; Yue, J.; Wang, L.; Liu, T.; and Xu, H. 2020. Adaptive Attentional Network for Few-Shot Knowledge Graph Completion. In EMNLP.
- Vaswani etย al. (2017) Vaswani, A.; Shazeer, N.ย M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.ย N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. .
- Xiong etย al. (2018) Xiong, W.; Yu, M.; Chang, S.; Guo, X.; and Wang, W.ย Y. 2018. One-Shot Relational Learning for Knowledge Graphs. In EMNLP.
- Zhang etย al. (2020) Zhang, C.; Yao, H.; Huang, C.; Jiang, M.; Li, Z.; and Chawla, N. 2020. Few-Shot Knowledge Graph Completion. In AAAI.