This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TDRE: A Tensor Decomposition Based Approach for Relation Extraction

Bin-Bin Zhao [email protected] Liang Li [email protected], [email protected] Hui-Dong Zhang School of Mathematical Sciences, University of Electronic Science and Technology of China, 611731, Chengdu, P.R. China
Abstract

Extracting entity pairs along with relation types from unstructured texts is a fundamental subtask of information extraction. Most existing joint models rely on fine-grained labeling scheme or focus on shared embedding parameters. These methods directly model the joint probability of multi-labeled triplets, which suffer from extracting redundant triplets with all relation types. However, each sentence may contain very few relation types. In this paper, we first model the final triplet extraction result as a three-order tensor of word-to-word pairs enriched with each relation type. And in order to obtain the sentence contained relations, we introduce an independent but joint training relation classification module. The tensor decomposition strategy is finally utilized to decompose the triplet tensor with predicted relational components which omits the calculations for unpredicted relation types. According to effective decomposition methods, we propose the Tensor Decomposition based Relation Extraction (TDRE) approach which is able to extract overlapping triplets and avoid detecting unnecessary entity pairs. Experiments on benchmark datasets NYT, CoNLL04 and ADE datasets demonstrate that the proposed method outperforms existing strong baselines.

keywords:
Relation extraction , Tensor decomposition , Natural language processing , Deep neural network

1 Introduction

Relation Extraction (RE) is aiming to extract entity pairs with relation types from unstructured text, which facilitates many other Natural Language Processing (NLP) tasks, including knowledge base construction and question answering. For example, there is a given sentence “Bill Gates co-founded Microsoft with his friend Paul Allen”, and the task is to extract triplet (Bill Gates, Founder, Microsoft). Traditional pipeline models consist of two separate subtasks name entity recognition and relation classification, which first extract entities and then classify entity pairs with relation types for decoding these into triple formats. However, pipeline models suffer from error propagation and do not fully use the relevance of these two subtasks. To tackle these problems, most recent studies pay attention to joint models which integrate entity recognition and relation classification into a single model by jointly training like training paradigm in multi-tasks projects. Owing to reasonable information sharing, joint models have achieved better performance than the pipeline models.

Most existing joint models rely on fine-grained labeling scheme [1] or train by parameter sharing [2, 3] in the embedding layers. However, fine-grained labeling scheme is unable to identify overlapping relations. As an improvement, the authors model triplet extraction as multi-head selective problems in [4], which approximately calculates the joint probability p(es,r,et)p(e_{s},r,e_{t}) where (es,r,et)(e_{s},r,e_{t}) is a triplet in the given sentence, ese_{s} denotes the source entity, ete_{t} denotes the target entity and rr denotes the relation between the source entity and the target entity. Having these notations in mind, we could model these predicted approximate joint probabilities as an N×N×KN\times N\times K tensor, where NN is the number of words in the sentence ss under consideration and KK is the number of predefined relation types. Each element in this tensor denotes the probability of the existence of a relation among word pairs. However, the great influence of relation classification would be ignored if we directly model the joint probabilities. And it could result in predicting unnecessary triplets when the predicted relations are wrong. As is well known, the triplet is correct only when all elements in the triplet are predicted correctly, which relies on the corrected predicted relations. Wrong predicted relation types could decrease the accuracy of the extracted triplets and cost a lot of unnecessary calculations.

In this paper, we attempt to avoid predicting redundant entity pairs while extracting overlapping triplets. It also means to prune the extracted tensor that could decode overlapping relations. Based on the tensor notation mentioned above, we consider the final extracted triplets (joint probability) as a three-dimensional tensor. By introducing a tensor based decomposition into directional component (DEDICOM) algorithm [5], we decompose the extracted triplets tensor into relational component so that it can effectively capture the interactions among relation types. From the perspective of probability, decomposing p(es,r,et|s)=p(es,er|s,r)p(r|s)p(e_{s},r,e_{t}|s)=p(e_{s},e_{r}|s,r)p(r|s) is an effective way to simplify the original joint probabilities. To extract multiple kinds of relation types p(r|s)p(r|s) that the sentence contains, we regard relation classification model as an independent module but also a part of joint training process. And then we define a new operation between the diagonal tensor appeared in the DEDICOM algorithm and the result of relation classification to restrict the detection of entity pairs with no prediction relation types. It is able to avoid extracting redundant triplets. Originally, the tensor based DEDICOM strategy is able to capture the asymmetric interactions among word pairs in the sentence. This idea is suitable for our task because the extracted triple tuple (es,r,et)(e_{s},r,e_{t}) is directional where ese_{s} is the source entity and ete_{t} is the target entity.

The main work of this paper includes:

  1. 1.

    We propose a new joint learning framework with tensor decomposition strategy for relation extraction (TDRE). Additionally, we employ the DEDICOM strategy for capturing the interactions among relation types.

  2. 2.

    We introduce a new joint training relation classification module and apply the results into tensor decomposition process, which is able to avoid extracting redundant triplets.

  3. 3.

    We conduct extensive experiments on three benchmark datasets NYT10, CoNLL04 and ADE datasets, which show that our model is able to achieve better performance than most of the existing strong baseline models.

2 Related Work

Many previous relation extraction approaches focus on manual feature extraction which heavily rely on NLP tools and sometimes are only applicable to specific fields. Recent years, neural networks and deep learning have attracted more and more attention. Deep neural network models have made significant progress in relation extraction without complicated handcrafted features. The task of relation extraction can be divided into two categories: pipeline models [6, 7] and joint models [8, 1, 4, 2]. Pipeline models take the task as two separated models including name entity recognition model and relation classification model. However, separated models cannot dig the potential relevance between two subtasks and suffer from error propagation. Joint models take these subtasks into a single model and train jointly by shared parameters.

Zheng et al. [1] introduced a novel tagging scheme where each word was tagged with a unique tag including an entity type and a relation type. Hence, the set of word tags is the Cartesian product of the relation types and the entity types. However, it is unable to deal with overlapping relations which means that one word may be mapped with many other words with different relations in the text. Zeng et al. [9] proposed a sequence-to-sequence copy mechanism to decode overlapping triplets. Bekoulis et al. [4] considered entities-relation extraction tasks as multi-head selection problems which could effectively solve overlapping relations problem as well. With the development of graph convolution neural network, it is developed to capture not only the ordered features on the timeline but also the features between different nodes in space with graph representation. Sun et al. [2] made fully uses of the relevance between entity and relation types by building entity-relation graph to catch the combination of entity pairs and valid relation. Guo et al. [10] also used graph convolution networks to capture structured information with dependency trees and to select relevant sub-structures with a soft-pruning approach. Nevertheless, this kind of proposed methods predict each relation for every word-to-word pairs which still suffer from redundant triplets. Takanobu et al. [11] firstly detected relation types, and then identified related entity pairs with hierarchical reinforcement learning strategy which avoided predicting redundant relation types to some extent.

Inspired by the aforementioned proposed methods, we introduce an independent but joint training relation classification module. And based on the DEDICOM strategy, we develop a joint learning framework for relation extraction with tensor decomposition algorithms. The proposed tensor decomposition based relation extraction model is effective to avoid predicting redundant triplets.

3 Model

In this section, we present our tensor decomposition based relation extraction model (TDRE) which is illustrated in Figure 1.

Refer to caption
Figure 1: The overall architecture of the proposed tensor decomposition based model for relation extraction.

The relation extraction problem is denoted as follows. Given the relation types set =R{“N”}\mathcal{R}=R\cup\{\text{``N"}\}, where RR represents the dataset of predefined relation types and “N” for none-relation types. If the word is not matched with any other words, it will be assigned the relation type of “N”. Given a sentence SS consisting of nn words w1w2wnw_{1}w_{2}...w_{n}, what we need is to structure the text into multiple triple tuples (es,r,et)(e_{s},r,e_{t}), where ese_{s} denotes the source entity, ete_{t} denotes the target entity and r{r} denotes the relation between the source entity and target entity.

3.1 Embedding Layer and BiLSTM layer

In order to map each word into a word vector, we use pre-trained word embeddings with the skip-gram word2vec model [12]. And we combine character embeddings with convolutional neural networks (CNN) since they are able to capture morphological features such as prefixes and suffixes. Bidirectional Long Short-Term Memory Networks (BiLSTM) is utilized to encode bidirectional contextual information [13],

hi=BiLSTM(xi,hi1),\displaystyle h_{i}=\text{BiLSTM}(x_{i},h_{i-1}), (1)

where hih_{i} represents the hidden states with BiLSTM networks concatenating the forward and backward hidden states at position ii in the given sequence, and xix_{i} represents the iith word embedding representation. Here, we formally express xix_{i} as [word_emb(wi);CNN(char_emb(wi))][\text{word\_emb}(w_{i});\text{CNN}(\text{char\_emb}(w_{i}))] by pre-trained word vector embeddings and CNN based character embeddings. For this layer, we will get a final representation matrix AA for the given sentence:

A=[h1,h2,,hn]n×dh,\displaystyle A=[h_{1},h_{2},...,h_{n}]\in\mathbb{R}^{n\times d_{h}}, (2)

where nn is the sequence length and dhd_{h} is the final concatenated bidirectional hidden dimension of BiLSTM networks.

3.2 Name Entity Recognition Model

To identify the entities in a given sentence, we regard entity recognition task as a sequence labeling problem as what is done in [14]. A Conditional Random Fields (CRF) layer is employed here. The designed state transition matrix makes full use of neighbor tag information, which helps both entity type identification and boundary recognition. With the sentence representation AA after embedding layer and BiLSTM layer, the given entity tag set {y1,y2,,ym}\{y_{1},y_{2},...,y_{m}\} and the defined feature score functions F1,F2,,FJ{F_{1},F_{2},...,F_{J}}, the probabilistic model is formulated as:

p(yk|hi)=exp(j=1JλjFj(yk,hi))yexp(j=1JλjFj(y,hi)),p(y_{k}|h_{i})=\frac{\exp(\sum_{j=1}^{J}\lambda_{j}F_{j}(y_{k},h_{i}))}{\sum_{y^{{}^{\prime}}}\exp(\sum_{j=1}^{J}\lambda_{j}F_{j}(y^{{}^{\prime}},h_{i}))}, (3)

where i=1,2,,n;k=1,2,,mi=1,2,...,n;k=1,2,...,m and λj\lambda_{j} shows each feature function may have different weight to decide the final probabilistic score. Each feature function FjF_{j} may include several sub-feature functions ff which can be formalized as Fj=ifj(yi1,yi,A,i)F_{j}=\sum_{i}f_{j}(y_{i-1},y_{i},A,i) where ii represents the position of current word in the given sentence, yiy_{i} denotes the entity tag of current word and yi1y_{i-1} denotes the entity tag of the previous word. With these definitions, the loss function is formally given as:

Lner\displaystyle L_{ner} =logp(y|A)\displaystyle=-\log p(y|A) (4)
=j=1JλjFj(y,A)+logZ(λ,A),\displaystyle=-\sum_{j=1}^{J}\lambda_{j}F_{j}(y,A)+\log Z(\lambda,A), (5)

where yy is the ground truth word label and Z(λ,A)=yexp(j=1JλjFj(y,A))Z(\lambda,A)=\sum_{y^{{}^{\prime}}}\exp(\sum_{j=1}^{J}\lambda_{j}F_{j}(y^{{}^{\prime}},A)). Our optimization goal is to minimize the loss function and finally decode entity tags with Viterbi algorithm.

3.3 Relation Classification Model

As we all know, simple classification problem is much easier than simultaneous classification for both relations and its related entity pairs. In this module, we consider the relation classification problem as an independent multi-label classification problem, because each sentence may contain more than one relation type [15, 16, 17, 18]. In order to share the word representation, we use the outputs of BiLSTM networks as the inputs of our classification model which could help find the interactions between entity recognition and relation classification. It is worth noting that we are categorizing for the entire sentence here instead of categorizing every word in the sentence.

A linear network is utilized in our relation classification model and it can be denoted as f(A)=WA+bf(A)=WA+b, where AA is the sentence representation, WW and bb are the training parameters. As for the multi-label problems, we focus on the final expression of the classification results P=(P1,P2,PK)P=(P_{1},P_{2},...P_{K}) and each element is calculated as follows:

Pk={1, if σ(f(A))γ1;0,else if σ(f(A))<γ1.\displaystyle P_{k}=\begin{cases}1,&\text{ if $\sigma(f(A))\geq\gamma_{1}$};\\ 0,&\text{else if $\sigma(f(A))<\gamma_{1}$}.\end{cases} (6)

where σ\sigma is the sigmoid activation function and γ1\gamma_{1} is the threshold for predicting the sentence contained relation types.

If the real relation types are denoted by yy, then the goal of this classification module is to minimize the loss function:

Lrel=k=1Kyklogpk,\displaystyle L_{rel}=-\sum_{k=1}^{K}y_{k}\log p_{k}, (7)

where yky_{k} is the ground truth relation type and pkp_{k} is the probability for predicting the current sentence into the kkth relation type. And in the training process, sigmoid cross-entropy loss function is used since there exists a multi-label problem.

3.4 Triplet Extraction Model

According to the above defined sentence with nn words, the target of relation extraction is to extract the existing triple tuples with the predefined relation type set \mathcal{R} from the unstructured text. And the extracted triplet (es,r,et)(e_{s},r,e_{t}) corresponds to specified entity spans es,ete_{s},e_{t} and the relation rr between ese_{s} and ete_{t}. For a specific relation type kk, we focus on mapping each word with the others in the sentence with an indication function. The word-to-word pair mapping can be modeled as a matrix Xn×nX\in{\mathbb{R}^{n\times n}}, where the element xijx_{ij} in the matrix XX indicates whether the iith word and the jjth word can form a triplet related word pair. And ii and jj represent the last word of source entity and target entity respectively. Hence, the indication function is expressed as xij=1x_{ij}=1 when the iith word can map with jjth word in the specific relation type kk. Extending matrix XX with relation type component, we have a tensor extraction representation 𝒳\mathcal{X}. It can be defined as following:

𝒳ijk={1,if there is a triplet (i,k,j)0,otherwise.\displaystyle\mathcal{X}_{ijk}=\begin{cases}1,&\text{if there is a triplet $(i,k,j)$; }\\ 0,&\text{otherwise}.\end{cases} (8)

Note that in this mathematical formula, ii and jj are the tail words of two entity spans.

In order to extract useful information from the tensor, we employ a tensor decomposition algorithm named decomposition into directional component (DEDICOM) strategy introduced by Harshman et al. [5]. As for a sequence with nn words, the defined mapping matrix Xn×nX\in{\mathbb{R}^{n\times n}} needs to describe the asymmetric relationships between each word and the other words. This idea is suitable for that each extracted triplet is directional since source entity and target entity are ordered. Triplet extraction process can be modeled as a three-order DEDICOM model, where the target constructed tensor is 𝒳n×n×K\mathcal{X}\in{\mathbb{R}^{n\times n\times K}}, where KK denotes the number of relation types. The decomposition is:

𝒳kA𝒟kH𝒟kA𝖳\displaystyle\mathcal{X}_{k}\approx A\mathcal{D}_{k}H\mathcal{D}_{k}A^{\mathsf{T}} for k=1,,K,\displaystyle\text{ for $k=1,...,K$}, (9)

where An×dhA\in\mathbb{R}^{n\times d_{h}} is the sequence representation after BiLSTM networks, 𝒟dh×dh×K\mathcal{D}\in\mathbb{R}^{d_{h}\times d_{h}\times K} and Hdh×dhH\in\mathbb{R}^{d_{h}\times d_{h}} are parameters. The matrices 𝒟kdh×dh\mathcal{D}_{k}\in\mathbb{R}^{d_{h}\times d_{h}} are diagonal, and the diagonal entry (𝒟k)ii(\mathcal{D}_{k})_{ii} indicates the participation of iith latent component at relation kk and i=1,2,,dhi=1,2,...,d_{h}.

From the above mentioned module, it is necessary to predict all relation types for each word pair while the sentence may contain only few relation types. To address this problem and to further utilize the predicted relation types with relation classification model, we need to make specific settings for the factor of 𝒟\mathcal{D} in the structure of the tensor decomposition. Actually, it is not necessary to calculate the mapping matrix 𝒳r\mathcal{X}_{r} when there is no prediction for the rrth relation type. We set 𝒟r\mathcal{D}_{r}^{{}^{\prime}} as a zero matrix, thus 𝒳r\mathcal{X}_{r} turns into a zero matrix after calculation, which means no triplet in rrth relation component. Thus, it is theoretically observable that the decomposition strategy avoids predicting unnecessary triplets and simultaneously reduces the predicting time. Therefore, a custom operation \circ can be represented as follows:

𝒟k=𝒟Pk={diagonal matrix, if Pk=1zero matrix, else if Pk=0.\displaystyle\small\mathcal{D}_{k}^{{}^{\prime}}=\mathcal{D}\circ P_{k}=\begin{cases}\text{diagonal matrix, }&\text{if $P_{k}=1$; }\\ \text{zero matrix, }&{\text{else if }P_{k}=0}.\end{cases} (10)

And then, the final decomposition algorithm can be formalized as:

𝒳A(𝒟P)H(𝒟P)A𝖳,\small\mathcal{X}\approx A(\mathcal{D}\circ P)H(\mathcal{D}\circ P)A^{\mathsf{T}}, (11)

where PKP\in\mathbb{R}^{K} denotes the prediction result of relation classification module.

From the perspective of probability distribution, we do not directly model the joint probability like what is done in [4] and [3] but decompose joint probability into relation component as conditional probability:

p(xi,rk,xj|S)=p(xi,xj|rk,S)p(rk|S),\displaystyle p(x_{i},r_{k},x_{j}|S)=p(x_{i},x_{j}|r_{k},S)p(r_{k}|S), (12)

i,j=1,2,,n;k=1,2,,Ki,j=1,2,...,n;k=1,2,...,K, where pp represents probability distribution, xi,xjx_{i},x_{j} respectively denotes the ii-th and jj-th word representation in the sentence SS and rkr_{k} denotes the kk-th relation type. Owing to that each word may be mapped with more than one word and relation types, we employ sigmoid function here to amplify the difference. And the final triplet decode procession can be denoted as:

P(xi,rk,xj|S)={1,σ(𝒳ijk)γ2;0,σ(𝒳ijk)<γ2.\displaystyle P(x_{i},r_{k},x_{j}|S)=\begin{cases}1,&\sigma(\mathcal{X}_{ijk}^{{}^{\prime}})\geq\gamma_{2};\\ 0,&\sigma(\mathcal{X}_{ijk}^{{}^{\prime}})<\gamma_{2}.\end{cases} (13)

where 𝒳ijk\mathcal{X}_{ijk}^{{}^{\prime}} is the calculated tensor and γ2\gamma_{2} represents the threshold for judging whether a triplet is true.

The goal of this decomposed relation extraction module is to minimize the cross-entropy loss LextractL_{extract} during training:

Lextract=i,j,k𝒳ijklog𝒳ijk,\displaystyle L_{extract}=-\sum_{i,j,k}\mathcal{X}_{ijk}\log\mathcal{X}_{ijk}^{{}^{\prime}}, (14)

where 𝒳ijk\mathcal{X}_{ijk} is the ground truth label and 𝒳ijk\mathcal{X}_{ijk}^{{}^{\prime}} is the predicted triplet tensor.

Algorithm 1 Tensor Decomposition for Relation Extraction (TDRE)

Input: Sentence SS, predefined relation type set RR
Output: Triplets {(es,r,et)j}j=1m\{(e_{s},r,e_{t})_{j}\}_{j=1}^{m}, es,ete_{s},e_{t} are the source entity and the target entity, rr denotes the relation type, and mm denotes the number of triplets.

1:  Define length of the given sentence: nn, number of relations: KK.
2:  repeat
3:     Calculate word representation AA by Eq. (2);
4:     Identify entity types YnerY_{ner} by Eq. (3) ;
5:     if entity label embedding then
6:        AA+label_embedding(Yner)A\leftarrow A+\text{label\_embedding}(Y_{ner})
7:     end if
8:     Initialize decomposition factor 𝒟\mathcal{D} and HH;
9:     if relation classification model then
10:        Classify relation types PP by Eq. (6);
11:        Calculate diagonal relation matrix 𝒟\mathcal{D} by Eq. (10);
12:        Calculate triplet tensor 𝒳ijr\mathcal{X}_{ijr} by Eq. (11);
13:     else
14:        Calculate triplet tensor 𝒳ijr\mathcal{X}_{ijr} by Eq. (9);
15:     end if
16:     Decode triplets with Eq. (13) and Eq. (8).
17:  until  TDRE converges

3.5 Loss Function

In order to jointly train the proposed model, we combine all of the three objective loss functions and optimize parameters together. The final loss function is computed as follows:

L=Lner+Lrel+Lextract,\displaystyle L=L_{ner}+L_{rel}+L_{extract}, (15)

where LnerL_{ner} is the loss function in name entity recognition module, LrelL_{rel} is the loss function in relation classification module and LextractL_{extract} is the loss function of final triplet extraction module which is calculated by DEDICOM algorithm. In order to update parameters, we optimize our model with Adam optimization approach and train the proposed joint model like multi-tasks learning paradigm.

Finally, our TDRE algorithm is summarized in Algorithm 1.

4 Experiments

4.1 Datasets

We conduct experiments on three benchmark datasets for relation extraction: (i) NYT10, the originally New York Times corpus [19] which is developed by distant supervised and published by Takanobu et al. [11] who filtered dataset by removing the relations in training set but not in testing set and sentences containing no relations. (ii) CoNLL04 dataset [20] and it is split by Gupta et al. [21] and Adel et al. [22]. The official given entity type set is {Location, Organization, Person, Other} where we omit the beginning, inside and the outside tags of entity types here. And the official defined relation type set is {Kill, Live in, Located in, OrgBased in, Work for}. (iii) Adverse Drug Events dataset (ADE) [23], where we use 80% for training and 20% as test set. Entity type set {Beginning, Inside, Outside} is applied here since there is no official entity types. And the relation type set is {Adverse-Effect, Drug-Disease Treatment}.

The details of the public datasets are reported in Table 1. The approximate statistical number of triplets on ADE dataset is shown since we conduct cross-validation on ADE dataset.

Table 1: Statistics of the datasets
Datasets NYT10 CoNLL04 ADE
Relation types 29 5 2
Entity types 7 4 3
Training set 70339 910 3416
Training triplets 87739 1273 5650
Test set 4006 288 854
Test triplets 5859 422 1450

4.2 Details of Implementation

Similar to previous works, we obtain the 50-dimensional word embeddings which is used by [22] trained on Wikipedia. Character embeddings are initialized randomly and the kernel size is set to 3 for convolution neural network while parameters update with training process. And then we concatenate the word embeddings and character embeddings as the final input embeddings. For the representation of the learning layers, we use three-layer BiLSTM networks with 64 hidden units. Comparing to most existing models, we randomly initialize entity label embeddings for all conducted datasets with embedding size 128, and update parameters by optimization. Training is performed by using the Adam optimizer with learning rate 10310^{-3}, and dropout technique is used in input embeddings and BiLSTM hidden layers, where the dropout rate is set to 0.1. Early stopping is also used on the validation set for avoiding overfitting.

Table 2: Main results on NYT10 and NYT10-sub datasets where NYT10-sub is selected from NYT10 test set. And all baseline results come from the original papers of HRL [11].
-Methods NYT10 NYT10-sub
Pre Rec F1 Pre Rec F1
Tagging [1] 59.3 38.1 46.4 25.6 23.7 24.6
CopyR [9] 56.9 45.2 50.4 39.2 26.3 31.5
HRL [11] 71.4 58.6 64.4 81.5 47.5 60.0
MrMep [24] 71.7 63.5 67.3 83.2 55.0 66.2
TDRE (ours) 81.3 68.3 74.3 82.2 70.2 75.7

4.3 Results

Evaluation Metrics

As for entity recognition, the entity is correctly identified only when all the characters of the original entity are recognized and recognized as the correct type. As for triplet extraction, a triplet is considered correct if and only if the source entity, target entity and related relation types are both correct. We adopt precision, recall rate and micro-F1 to evaluate the performance.

Baselines

For comparison, we choose the following models as baselines:

  1. 1.

    Tagging [1]. It takes the joint extraction as a sequential labeling problem with a special tagging schema where the label of each word includes the entity type and entity order (source or target entity) and the relation type.

  2. 2.

    CopyR [9]. It is a sequence-to-sequence learning framework with copy mechanism which could handle the relational triplet overlapped problem.

  3. 3.

    HRL [11]. It applies a hierarchical reinforcement learning by detecting relations and then extracting participating entities for the relation which regards the related entities as the argument of a relation.

  4. 4.

    MrMep [24]. It is also a joint learning model which first classifies relations and then uses triplet attention to reinforce the connections between relation and entity pairs.

  5. 5.

    Multi-head selection model [4]. It takes the relation extraction task as a multi-head selection problem, which could identify multiple relation types for each entity pair.

  6. 6.

    Multi-head with adversarial training regularization method [25]. It has also achieved better performance in overlapping relations.

  7. 7.

    Li et tal. [26] introduces a neural joint model to simultaneously extract both biomedical entities and their relations by using hand-crafted features or features derived from NLP tools.

  8. 8.

    MultiQA [27]. It takes the task as a multi-turn question answering problem which identify answer spans (entities) from the defined relation related questions.

Experiment Results

The detailed comparisons between baseline models and our proposed model are summarized in Table 2 and Table 3 with precision, recall, micro-average F1. The best performances are marked in bold.

Table 3: Main results of our method comparing with previous baseline models on benchmark dataset. Overall F1 is to average these two subtasks. Results of the compared baseline models directly come from the original papers.
Dataset Model Entity Recognition Triplet Extraction Overall F1
Pre Rec F1 Pre Rec F1
CoNLL04 Multi-Head 83.75 84.06 83.9 63.75 60.43 62.04 72.97
MultiHead-AT - - 83.61 - - 61.95 72.78
MultiQA 89.0 86.6 87.8 69.2 68.2 68.9 78.35
TDRE (ours) 91.85 91.34 91.59 80.90 76.67 78.73 85.16
ADE Li [26] 82.70 86.70 84.60 67.50 75.80 71.40 78.00
Multi-Head 84.72 88.16 86.40 72.10 77.24 74.58 80.49
MultiHead-AT - - 86.73 - - 75.52 81.13
TDRE (ours) 88.82 88.28 88.55 81.63 76.20 78.82 83.69

As for NYT10 dataset, the purposed TDRE model significantly outperforms the others. Through the sub test set NYT10-sub published by HRL [11], it is well known that it contains most of the overlapping relations while the triplets shared both source entity and target entity. Our model shows the best performance particularly in NYT10-sub which means TDRE is more effective for overlapping problems. And in Table 3, it can be seen that our proposed approach performs better comparing with the baseline methods whatever in entity recognition or relation identification on CoNLL04 and ADE dataset.

Especially focusing on the precision of triplet extraction, experimental results show great improvements. Theoretically, our model reduces extra computation for the unpredicted relation types while most of the other models compute every relation type for an entity pair. The agreement between theoretical analysis and experimental results show that the model can indeed reduce predicting the redundant triplets. All the experimental results indicate that the proposed approach of tensor decomposition into relation component to model relation extraction process is more effective than the baseline models, even without using BERT.

Ablation Study

In order to illustrate the effectiveness of various parts of our model, we conduct ablation experiments on the CoNLL04 dataset. Concretely, we separately remove character embeddings (Char Emb.), entity label embeddings (Label Emb.) and relation classification module (CLS). According to the results presented in Table 4, it can be observed that character embeddings, which could catch additional information such as suffixes, prefixes and other morphological features, are helpful in the triplet extraction. In terms of entity label embeddings, it is possible to incorporate entity type information and boundary information which is beneficial to capture the integrity of entities. As for removing classification module, the proposed approach degenerates in directly modeling joint probability and the performances severely degrade whether in entity recognition or in triplet extraction. This indicates the proposed tensor decomposition into relation component approach has great contributions in triplet extraction.

Table 4: Ablation study on CoNLL04 dataset.
Methods Entity Triplet Extraction
F1 Pre Rec F1
TDRE 91.59 80.90 76.67 78.73
- Char Emb. 85.65 75.91 64.52 69.75
- Label Emb. 89.51 75.74 66.90 71.05
- CLS 88.80 74.53 66.90 70.51

Analysis of Relation Classification Model

Experiments have been conducted on CoNLL04 dataset with recurrent neural networks (RNN), recurrent convolution neural networks (RCNN) and bidirectional long short-term memory networks (BiLSTM) models for relation classification. The results in Table 5 show that our model are still better than the previous compared models [4, 27] even without the classification model. We guess it is because the raw introduced DEDICOM strategy decomposes with diagonal tensor, and it is to interpret connections between relation types. The strategy is originally suitable for capturing asymmetry which is fit to our ordered triplets. Focusing on the other results with RNN and RCNN, performances are better than multi-head model but not better than MultiQA [27]. There are two possible reasons: (1) classification models pay much attention on identifying relations with sacrificing the accuracy of entity recognition; (2) parameters are not shared with name entity recognition in these classification models, which ignores the inherent connections between relation identification and entity recognition. After parameters sharing in BiLSTM networks, the model achieves substantial improvements in entity recognition and triplet extraction module. Better performance in relation identification or in entity recognition can yield better triplet extraction performance.

Table 5: Various classify models on CoNLL04 dataset.
Model Rel Entity Triplet Extraction
F1 F1 Pre Rec F1
- CLS - 88.8 74.5 66.9 70.5
RNN 97.6 85.0 70.5 60.2 64.9
RCNN 99.7 87.7 75.2 66.4 70.5
BiLSTM 94.1 91.6 80.9 76.7 78.7

5 Conclusion

In this paper, we introduce a joint neural model with tensor decomposition based strategy for multi-labeled relation extraction (TDRE) . The introduced tensor is to model the connections of word pairs in each relation component, which could tackle overlapping relations. Consider relation classification as an independent but joint training module to get sentence contained relation types. By implanting predicted relations to the tensor decomposition formula as relational component factors, it can avoid identifying unnecessary entity pairs among no related relation types. Experiments on public datasets demonstrate that the proposed model achieves significant improvements over previous state-of-the-art baselines. In the future, we plan to explore other tensor decomposition approaches and extend our model in other tensor based modeling tasks.

Acknowledgments

This work is supported by the Science and Technology Plan Project of Sichuan Province (Key R & D Project, 2020YFS0465) and Mathematics Teaching Research and Development Center for Colleges (CMC20190501).

References