TDRE: A Tensor Decomposition Based Approach for Relation Extraction

Bin-Bin Zhao [email protected] Liang Li [email protected], [email protected] Hui-Dong Zhang School of Mathematical Sciences, University of Electronic Science and Technology of China, 611731, Chengdu, P.R. China

Abstract

Extracting entity pairs along with relation types from unstructured texts is a fundamental subtask of information extraction. Most existing joint models rely on fine-grained labeling scheme or focus on shared embedding parameters. These methods directly model the joint probability of multi-labeled triplets, which suffer from extracting redundant triplets with all relation types. However, each sentence may contain very few relation types. In this paper, we first model the final triplet extraction result as a three-order tensor of word-to-word pairs enriched with each relation type. And in order to obtain the sentence contained relations, we introduce an independent but joint training relation classification module. The tensor decomposition strategy is finally utilized to decompose the triplet tensor with predicted relational components which omits the calculations for unpredicted relation types. According to effective decomposition methods, we propose the Tensor Decomposition based Relation Extraction (TDRE) approach which is able to extract overlapping triplets and avoid detecting unnecessary entity pairs. Experiments on benchmark datasets NYT, CoNLL04 and ADE datasets demonstrate that the proposed method outperforms existing strong baselines.

keywords:

Relation extraction , Tensor decomposition , Natural language processing , Deep neural network

1 Introduction

Relation Extraction (RE) is aiming to extract entity pairs with relation types from unstructured text, which facilitates many other Natural Language Processing (NLP) tasks, including knowledge base construction and question answering. For example, there is a given sentence “Bill Gates co-founded Microsoft with his friend Paul Allen”, and the task is to extract triplet (Bill Gates, Founder, Microsoft). Traditional pipeline models consist of two separate subtasks name entity recognition and relation classification, which first extract entities and then classify entity pairs with relation types for decoding these into triple formats. However, pipeline models suffer from error propagation and do not fully use the relevance of these two subtasks. To tackle these problems, most recent studies pay attention to joint models which integrate entity recognition and relation classification into a single model by jointly training like training paradigm in multi-tasks projects. Owing to reasonable information sharing, joint models have achieved better performance than the pipeline models.

Most existing joint models rely on fine-grained labeling scheme [1] or train by parameter sharing [2, 3] in the embedding layers. However, fine-grained labeling scheme is unable to identify overlapping relations. As an improvement, the authors model triplet extraction as multi-head selective problems in [4], which approximately calculates the joint probability $p(e_{s},r,e_{t})$ where $(e_{s},r,e_{t})$ is a triplet in the given sentence, $e_{s}$ denotes the source entity, $e_{t}$ denotes the target entity and $r$ denotes the relation between the source entity and the target entity. Having these notations in mind, we could model these predicted approximate joint probabilities as an $N\times N\times K$ tensor, where $N$ is the number of words in the sentence $s$ under consideration and $K$ is the number of predefined relation types. Each element in this tensor denotes the probability of the existence of a relation among word pairs. However, the great influence of relation classification would be ignored if we directly model the joint probabilities. And it could result in predicting unnecessary triplets when the predicted relations are wrong. As is well known, the triplet is correct only when all elements in the triplet are predicted correctly, which relies on the corrected predicted relations. Wrong predicted relation types could decrease the accuracy of the extracted triplets and cost a lot of unnecessary calculations.

In this paper, we attempt to avoid predicting redundant entity pairs while extracting overlapping triplets. It also means to prune the extracted tensor that could decode overlapping relations. Based on the tensor notation mentioned above, we consider the final extracted triplets (joint probability) as a three-dimensional tensor. By introducing a tensor based decomposition into directional component (DEDICOM) algorithm [5], we decompose the extracted triplets tensor into relational component so that it can effectively capture the interactions among relation types. From the perspective of probability, decomposing $p(e_{s},r,e_{t}|s)=p(e_{s},e_{r}|s,r)p(r|s)$ is an effective way to simplify the original joint probabilities. To extract multiple kinds of relation types $p(r|s)$ that the sentence contains, we regard relation classification model as an independent module but also a part of joint training process. And then we define a new operation between the diagonal tensor appeared in the DEDICOM algorithm and the result of relation classification to restrict the detection of entity pairs with no prediction relation types. It is able to avoid extracting redundant triplets. Originally, the tensor based DEDICOM strategy is able to capture the asymmetric interactions among word pairs in the sentence. This idea is suitable for our task because the extracted triple tuple $(e_{s},r,e_{t})$ is directional where $e_{s}$ is the source entity and $e_{t}$ is the target entity.

The main work of this paper includes:

1.

We propose a new joint learning framework with tensor decomposition strategy for relation extraction (TDRE). Additionally, we employ the DEDICOM strategy for capturing the interactions among relation types.
2.

We introduce a new joint training relation classification module and apply the results into tensor decomposition process, which is able to avoid extracting redundant triplets.
3.

We conduct extensive experiments on three benchmark datasets NYT10, CoNLL04 and ADE datasets, which show that our model is able to achieve better performance than most of the existing strong baseline models.

2 Related Work

Many previous relation extraction approaches focus on manual feature extraction which heavily rely on NLP tools and sometimes are only applicable to specific fields. Recent years, neural networks and deep learning have attracted more and more attention. Deep neural network models have made significant progress in relation extraction without complicated handcrafted features. The task of relation extraction can be divided into two categories: pipeline models [6, 7] and joint models [8, 1, 4, 2]. Pipeline models take the task as two separated models including name entity recognition model and relation classification model. However, separated models cannot dig the potential relevance between two subtasks and suffer from error propagation. Joint models take these subtasks into a single model and train jointly by shared parameters.

Zheng et al. [1] introduced a novel tagging scheme where each word was tagged with a unique tag including an entity type and a relation type. Hence, the set of word tags is the Cartesian product of the relation types and the entity types. However, it is unable to deal with overlapping relations which means that one word may be mapped with many other words with different relations in the text. Zeng et al. [9] proposed a sequence-to-sequence copy mechanism to decode overlapping triplets. Bekoulis et al. [4] considered entities-relation extraction tasks as multi-head selection problems which could effectively solve overlapping relations problem as well. With the development of graph convolution neural network, it is developed to capture not only the ordered features on the timeline but also the features between different nodes in space with graph representation. Sun et al. [2] made fully uses of the relevance between entity and relation types by building entity-relation graph to catch the combination of entity pairs and valid relation. Guo et al. [10] also used graph convolution networks to capture structured information with dependency trees and to select relevant sub-structures with a soft-pruning approach. Nevertheless, this kind of proposed methods predict each relation for every word-to-word pairs which still suffer from redundant triplets. Takanobu et al. [11] firstly detected relation types, and then identified related entity pairs with hierarchical reinforcement learning strategy which avoided predicting redundant relation types to some extent.

Inspired by the aforementioned proposed methods, we introduce an independent but joint training relation classification module. And based on the DEDICOM strategy, we develop a joint learning framework for relation extraction with tensor decomposition algorithms. The proposed tensor decomposition based relation extraction model is effective to avoid predicting redundant triplets.

3 Model

In this section, we present our tensor decomposition based relation extraction model (TDRE) which is illustrated in Figure 1.

Refer to caption — Figure 1: The overall architecture of the proposed tensor decomposition based model for relation extraction.

The relation extraction problem is denoted as follows. Given the relation types set $\mathcal{R}=R\cup\{\text{``N"}\}$ , where $R$ represents the dataset of predefined relation types and “N” for none-relation types. If the word is not matched with any other words, it will be assigned the relation type of “N”. Given a sentence $S$ consisting of $n$ words $w_{1}w_{2}...w_{n}$ , what we need is to structure the text into multiple triple tuples $(e_{s},r,e_{t})$ , where $e_{s}$ denotes the source entity, $e_{t}$ denotes the target entity and ${r}$ denotes the relation between the source entity and target entity.

3.1 Embedding Layer and BiLSTM layer

In order to map each word into a word vector, we use pre-trained word embeddings with the skip-gram word2vec model [12]. And we combine character embeddings with convolutional neural networks (CNN) since they are able to capture morphological features such as prefixes and suffixes. Bidirectional Long Short-Term Memory Networks (BiLSTM) is utilized to encode bidirectional contextual information [13],

\displaystyle h_{i}=\text{BiLSTM}(x_{i},h_{i-1}),

(1)

where $h_{i}$ represents the hidden states with BiLSTM networks concatenating the forward and backward hidden states at position $i$ in the given sequence, and $x_{i}$ represents the $i$ th word embedding representation. Here, we formally express $x_{i}$ as $[\text{word\_emb}(w_{i});\text{CNN}(\text{char\_emb}(w_{i}))]$ by pre-trained word vector embeddings and CNN based character embeddings. For this layer, we will get a final representation matrix $A$ for the given sentence:

\displaystyle A=[h_{1},h_{2},...,h_{n}]\in\mathbb{R}^{n\times d_{h}},

(2)

where $n$ is the sequence length and $d_{h}$ is the final concatenated bidirectional hidden dimension of BiLSTM networks.

3.2 Name Entity Recognition Model

To identify the entities in a given sentence, we regard entity recognition task as a sequence labeling problem as what is done in [14]. A Conditional Random Fields (CRF) layer is employed here. The designed state transition matrix makes full use of neighbor tag information, which helps both entity type identification and boundary recognition. With the sentence representation $A$ after embedding layer and BiLSTM layer, the given entity tag set $\{y_{1},y_{2},...,y_{m}\}$ and the defined feature score functions ${F_{1},F_{2},...,F_{J}}$ , the probabilistic model is formulated as:

p(y_{k}|h_{i})=\frac{\exp(\sum_{j=1}^{J}\lambda_{j}F_{j}(y_{k},h_{i}))}{\sum_{y^{{}^{\prime}}}\exp(\sum_{j=1}^{J}\lambda_{j}F_{j}(y^{{}^{\prime}},h_{i}))},

(3)

where $i=1,2,...,n;k=1,2,...,m$ and $\lambda_{j}$ shows each feature function may have different weight to decide the final probabilistic score. Each feature function $F_{j}$ may include several sub-feature functions $f$ which can be formalized as $F_{j}=\sum_{i}f_{j}(y_{i-1},y_{i},A,i)$ where $i$ represents the position of current word in the given sentence, $y_{i}$ denotes the entity tag of current word and $y_{i-1}$ denotes the entity tag of the previous word. With these definitions, the loss function is formally given as:

	$\displaystyle L_{ner}$	$\displaystyle=-\log p(y\|A)$		(4)
		$\displaystyle=-\sum_{j=1}^{J}\lambda_{j}F_{j}(y,A)+\log Z(\lambda,A),$		(5)

where $y$ is the ground truth word label and $Z(\lambda,A)=\sum_{y^{{}^{\prime}}}\exp(\sum_{j=1}^{J}\lambda_{j}F_{j}(y^{{}^{\prime}},A))$ . Our optimization goal is to minimize the loss function and finally decode entity tags with Viterbi algorithm.

3.3 Relation Classification Model

As we all know, simple classification problem is much easier than simultaneous classification for both relations and its related entity pairs. In this module, we consider the relation classification problem as an independent multi-label classification problem, because each sentence may contain more than one relation type [15, 16, 17, 18]. In order to share the word representation, we use the outputs of BiLSTM networks as the inputs of our classification model which could help find the interactions between entity recognition and relation classification. It is worth noting that we are categorizing for the entire sentence here instead of categorizing every word in the sentence.

A linear network is utilized in our relation classification model and it can be denoted as $f(A)=WA+b$ , where $A$ is the sentence representation, $W$ and $b$ are the training parameters. As for the multi-label problems, we focus on the final expression of the classification results $P=(P_{1},P_{2},...P_{K})$ and each element is calculated as follows:

\displaystyle P_{k}=\begin{cases}1,&\text{ if $\sigma(f(A))\geq\gamma_{1}$};\\ 0,&\text{else if $\sigma(f(A))<\gamma_{1}$}.\end{cases}

(6)

where $\sigma$ is the sigmoid activation function and $\gamma_{1}$ is the threshold for predicting the sentence contained relation types.

If the real relation types are denoted by $y$ , then the goal of this classification module is to minimize the loss function:

\displaystyle L_{rel}=-\sum_{k=1}^{K}y_{k}\log p_{k},

(7)

where $y_{k}$ is the ground truth relation type and $p_{k}$ is the probability for predicting the current sentence into the $k$ th relation type. And in the training process, sigmoid cross-entropy loss function is used since there exists a multi-label problem.

3.4 Triplet Extraction Model

According to the above defined sentence with $n$ words, the target of relation extraction is to extract the existing triple tuples with the predefined relation type set $\mathcal{R}$ from the unstructured text. And the extracted triplet $(e_{s},r,e_{t})$ corresponds to specified entity spans $e_{s},e_{t}$ and the relation $r$ between $e_{s}$ and $e_{t}$ . For a specific relation type $k$ , we focus on mapping each word with the others in the sentence with an indication function. The word-to-word pair mapping can be modeled as a matrix $X\in{\mathbb{R}^{n\times n}}$ , where the element $x_{ij}$ in the matrix $X$ indicates whether the $i$ th word and the $j$ th word can form a triplet related word pair. And $i$ and $j$ represent the last word of source entity and target entity respectively. Hence, the indication function is expressed as $x_{ij}=1$ when the $i$ th word can map with $j$ th word in the specific relation type $k$ . Extending matrix $X$ with relation type component, we have a tensor extraction representation $\mathcal{X}$ . It can be defined as following:

\displaystyle\mathcal{X}_{ijk}=\begin{cases}1,&\text{if there is a triplet $(i,k,j)$; }\\ 0,&\text{otherwise}.\end{cases}

(8)

Note that in this mathematical formula, $i$ and $j$ are the tail words of two entity spans.

In order to extract useful information from the tensor, we employ a tensor decomposition algorithm named decomposition into directional component (DEDICOM) strategy introduced by Harshman et al. [5]. As for a sequence with $n$ words, the defined mapping matrix $X\in{\mathbb{R}^{n\times n}}$ needs to describe the asymmetric relationships between each word and the other words. This idea is suitable for that each extracted triplet is directional since source entity and target entity are ordered. Triplet extraction process can be modeled as a three-order DEDICOM model, where the target constructed tensor is $\mathcal{X}\in{\mathbb{R}^{n\times n\times K}}$ , where $K$ denotes the number of relation types. The decomposition is:

\displaystyle\mathcal{X}_{k}\approx A\mathcal{D}_{k}H\mathcal{D}_{k}A^{\mathsf{T}}

\displaystyle\text{ for $k=1,...,K$},

(9)

where $A\in\mathbb{R}^{n\times d_{h}}$ is the sequence representation after BiLSTM networks, $\mathcal{D}\in\mathbb{R}^{d_{h}\times d_{h}\times K}$ and $H\in\mathbb{R}^{d_{h}\times d_{h}}$ are parameters. The matrices $\mathcal{D}_{k}\in\mathbb{R}^{d_{h}\times d_{h}}$ are diagonal, and the diagonal entry $(\mathcal{D}_{k})_{ii}$ indicates the participation of $i$ th latent component at relation $k$ and $i=1,2,...,d_{h}$ .

From the above mentioned module, it is necessary to predict all relation types for each word pair while the sentence may contain only few relation types. To address this problem and to further utilize the predicted relation types with relation classification model, we need to make specific settings for the factor of $\mathcal{D}$ in the structure of the tensor decomposition. Actually, it is not necessary to calculate the mapping matrix $\mathcal{X}_{r}$ when there is no prediction for the $r$ th relation type. We set $\mathcal{D}_{r}^{{}^{\prime}}$ as a zero matrix, thus $\mathcal{X}_{r}$ turns into a zero matrix after calculation, which means no triplet in $r$ th relation component. Thus, it is theoretically observable that the decomposition strategy avoids predicting unnecessary triplets and simultaneously reduces the predicting time. Therefore, a custom operation $\circ$ can be represented as follows:

\displaystyle\small\mathcal{D}_{k}^{{}^{\prime}}=\mathcal{D}\circ P_{k}=\begin{cases}\text{diagonal matrix, }&\text{if $P_{k}=1$; }\\ \text{zero matrix, }&{\text{else if }P_{k}=0}.\end{cases}

(10)

And then, the final decomposition algorithm can be formalized as:

\small\mathcal{X}\approx A(\mathcal{D}\circ P)H(\mathcal{D}\circ P)A^{\mathsf{T}},

(11)

where $P\in\mathbb{R}^{K}$ denotes the prediction result of relation classification module.

From the perspective of probability distribution, we do not directly model the joint probability like what is done in [4] and [3] but decompose joint probability into relation component as conditional probability:

\displaystyle p(x_{i},r_{k},x_{j}|S)=p(x_{i},x_{j}|r_{k},S)p(r_{k}|S),

(12)

$i,j=1,2,...,n;k=1,2,...,K$ , where $p$ represents probability distribution, $x_{i},x_{j}$ respectively denotes the $i$ -th and $j$ -th word representation in the sentence $S$ and $r_{k}$ denotes the $k$ -th relation type. Owing to that each word may be mapped with more than one word and relation types, we employ sigmoid function here to amplify the difference. And the final triplet decode procession can be denoted as:

\displaystyle P(x_{i},r_{k},x_{j}|S)=\begin{cases}1,&\sigma(\mathcal{X}_{ijk}^{{}^{\prime}})\geq\gamma_{2};\\ 0,&\sigma(\mathcal{X}_{ijk}^{{}^{\prime}})<\gamma_{2}.\end{cases}

(13)

where $\mathcal{X}_{ijk}^{{}^{\prime}}$ is the calculated tensor and $\gamma_{2}$ represents the threshold for judging whether a triplet is true.

The goal of this decomposed relation extraction module is to minimize the cross-entropy loss $L_{extract}$ during training:

\displaystyle L_{extract}=-\sum_{i,j,k}\mathcal{X}_{ijk}\log\mathcal{X}_{ijk}^{{}^{\prime}},

(14)

where $\mathcal{X}_{ijk}$ is the ground truth label and $\mathcal{X}_{ijk}^{{}^{\prime}}$ is the predicted triplet tensor.

Algorithm 1 Tensor Decomposition for Relation Extraction (TDRE)

Input: Sentence $S$ , predefined relation type set $R$
Output: Triplets $\{(e_{s},r,e_{t})_{j}\}_{j=1}^{m}$ , $e_{s},e_{t}$ are the source entity and the target entity, $r$ denotes the relation type, and $m$ denotes the number of triplets.

1: Define length of the given sentence:

n

, number of relations:

K

2: repeat

3: Calculate word representation

A

by Eq. (2);

4: Identify entity types

Y_{ner}

by Eq. (3) ;

5: if entity label embedding then

A\leftarrow A+\text{label\_embedding}(Y_{ner})

7: end if

8: Initialize decomposition factor

\mathcal{D}

and

H

;

9: if relation classification model then

10: Classify relation types

P

by Eq. (6);

11: Calculate diagonal relation matrix

\mathcal{D}

by Eq. (10);

12: Calculate triplet tensor

\mathcal{X}_{ijr}

by Eq. (11);

13: else

14: Calculate triplet tensor

\mathcal{X}_{ijr}

by Eq. (9);

15: end if

16: Decode triplets with Eq. (13) and Eq. (8).

17: until TDRE converges

3.5 Loss Function

In order to jointly train the proposed model, we combine all of the three objective loss functions and optimize parameters together. The final loss function is computed as follows:

\displaystyle L=L_{ner}+L_{rel}+L_{extract},

(15)

where $L_{ner}$ is the loss function in name entity recognition module, $L_{rel}$ is the loss function in relation classification module and $L_{extract}$ is the loss function of final triplet extraction module which is calculated by DEDICOM algorithm. In order to update parameters, we optimize our model with Adam optimization approach and train the proposed joint model like multi-tasks learning paradigm.

Finally, our TDRE algorithm is summarized in Algorithm 1.

4 Experiments

4.1 Datasets

We conduct experiments on three benchmark datasets for relation extraction: (i) NYT10, the originally New York Times corpus [19] which is developed by distant supervised and published by Takanobu et al. [11] who filtered dataset by removing the relations in training set but not in testing set and sentences containing no relations. (ii) CoNLL04 dataset [20] and it is split by Gupta et al. [21] and Adel et al. [22]. The official given entity type set is {Location, Organization, Person, Other} where we omit the beginning, inside and the outside tags of entity types here. And the official defined relation type set is {Kill, Live in, Located in, OrgBased in, Work for}. (iii) Adverse Drug Events dataset (ADE) [23], where we use 80% for training and 20% as test set. Entity type set {Beginning, Inside, Outside} is applied here since there is no official entity types. And the relation type set is {Adverse-Effect, Drug-Disease Treatment}.

The details of the public datasets are reported in Table 1. The approximate statistical number of triplets on ADE dataset is shown since we conduct cross-validation on ADE dataset.

Table 1: Statistics of the datasets

Datasets	NYT10	CoNLL04	ADE
Relation types	29	5	2
Entity types	7	4	3
Training set	70339	910	3416
Training triplets	87739	1273	5650
Test set	4006	288	854
Test triplets	5859	422	1450

4.2 Details of Implementation

Similar to previous works, we obtain the 50-dimensional word embeddings which is used by [22] trained on Wikipedia. Character embeddings are initialized randomly and the kernel size is set to 3 for convolution neural network while parameters update with training process. And then we concatenate the word embeddings and character embeddings as the final input embeddings. For the representation of the learning layers, we use three-layer BiLSTM networks with 64 hidden units. Comparing to most existing models, we randomly initialize entity label embeddings for all conducted datasets with embedding size 128, and update parameters by optimization. Training is performed by using the Adam optimizer with learning rate $10^{-3}$ , and dropout technique is used in input embeddings and BiLSTM hidden layers, where the dropout rate is set to 0.1. Early stopping is also used on the validation set for avoiding overfitting.

Table 2: Main results on NYT10 and NYT10-sub datasets where NYT10-sub is selected from NYT10 test set. And all baseline results come from the original papers of HRL [11].

-Methods	NYT10			NYT10-sub
-Methods	Pre	Rec	F1	Pre	Rec	F1
Tagging [1]	59.3	38.1	46.4	25.6	23.7	24.6
CopyR [9]	56.9	45.2	50.4	39.2	26.3	31.5
HRL [11]	71.4	58.6	64.4	81.5	47.5	60.0
MrMep [24]	71.7	63.5	67.3	83.2	55.0	66.2
TDRE (ours)	81.3	68.3	74.3	82.2	70.2	75.7

4.3 Results

Evaluation Metrics

As for entity recognition, the entity is correctly identified only when all the characters of the original entity are recognized and recognized as the correct type. As for triplet extraction, a triplet is considered correct if and only if the source entity, target entity and related relation types are both correct. We adopt precision, recall rate and micro-F1 to evaluate the performance.

Baselines

For comparison, we choose the following models as baselines:

1.

Tagging [1]. It takes the joint extraction as a sequential labeling problem with a special tagging schema where the label of each word includes the entity type and entity order (source or target entity) and the relation type.
2.

CopyR [9]. It is a sequence-to-sequence learning framework with copy mechanism which could handle the relational triplet overlapped problem.
3.

HRL [11]. It applies a hierarchical reinforcement learning by detecting relations and then extracting participating entities for the relation which regards the related entities as the argument of a relation.
4.

MrMep [24]. It is also a joint learning model which first classifies relations and then uses triplet attention to reinforce the connections between relation and entity pairs.
5.

Multi-head selection model [4]. It takes the relation extraction task as a multi-head selection problem, which could identify multiple relation types for each entity pair.
6.

Multi-head with adversarial training regularization method [25]. It has also achieved better performance in overlapping relations.
7.

Li et tal. [26] introduces a neural joint model to simultaneously extract both biomedical entities and their relations by using hand-crafted features or features derived from NLP tools.
8.

MultiQA [27]. It takes the task as a multi-turn question answering problem which identify answer spans (entities) from the defined relation related questions.

Experiment Results

The detailed comparisons between baseline models and our proposed model are summarized in Table 2 and Table 3 with precision, recall, micro-average F1. The best performances are marked in bold.

Table 3: Main results of our method comparing with previous baseline models on benchmark dataset. Overall F1 is to average these two subtasks. Results of the compared baseline models directly come from the original papers.

Dataset	Model	Entity Recognition			Triplet Extraction			Overall F1
Dataset	Model	Pre	Rec	F1	Pre	Rec	F1	Overall F1
CoNLL04	Multi-Head	83.75	84.06	83.9	63.75	60.43	62.04	72.97
	MultiHead-AT	-	-	83.61	-	-	61.95	72.78
	MultiQA	89.0	86.6	87.8	69.2	68.2	68.9	78.35
	TDRE (ours)	91.85	91.34	91.59	80.90	76.67	78.73	85.16
ADE	Li [26]	82.70	86.70	84.60	67.50	75.80	71.40	78.00
	Multi-Head	84.72	88.16	86.40	72.10	77.24	74.58	80.49
	MultiHead-AT	-	-	86.73	-	-	75.52	81.13
	TDRE (ours)	88.82	88.28	88.55	81.63	76.20	78.82	83.69

As for NYT10 dataset, the purposed TDRE model significantly outperforms the others. Through the sub test set NYT10-sub published by HRL [11], it is well known that it contains most of the overlapping relations while the triplets shared both source entity and target entity. Our model shows the best performance particularly in NYT10-sub which means TDRE is more effective for overlapping problems. And in Table 3, it can be seen that our proposed approach performs better comparing with the baseline methods whatever in entity recognition or relation identification on CoNLL04 and ADE dataset.

Especially focusing on the precision of triplet extraction, experimental results show great improvements. Theoretically, our model reduces extra computation for the unpredicted relation types while most of the other models compute every relation type for an entity pair. The agreement between theoretical analysis and experimental results show that the model can indeed reduce predicting the redundant triplets. All the experimental results indicate that the proposed approach of tensor decomposition into relation component to model relation extraction process is more effective than the baseline models, even without using BERT.

Ablation Study

In order to illustrate the effectiveness of various parts of our model, we conduct ablation experiments on the CoNLL04 dataset. Concretely, we separately remove character embeddings (Char Emb.), entity label embeddings (Label Emb.) and relation classification module (CLS). According to the results presented in Table 4, it can be observed that character embeddings, which could catch additional information such as suffixes, prefixes and other morphological features, are helpful in the triplet extraction. In terms of entity label embeddings, it is possible to incorporate entity type information and boundary information which is beneficial to capture the integrity of entities. As for removing classification module, the proposed approach degenerates in directly modeling joint probability and the performances severely degrade whether in entity recognition or in triplet extraction. This indicates the proposed tensor decomposition into relation component approach has great contributions in triplet extraction.

Table 4: Ablation study on CoNLL04 dataset.

Methods	Entity	Triplet Extraction
Methods	F1	Pre	Rec	F1
TDRE	91.59	80.90	76.67	78.73
- Char Emb.	85.65	75.91	64.52	69.75
- Label Emb.	89.51	75.74	66.90	71.05
- CLS	88.80	74.53	66.90	70.51

Analysis of Relation Classification Model

Experiments have been conducted on CoNLL04 dataset with recurrent neural networks (RNN), recurrent convolution neural networks (RCNN) and bidirectional long short-term memory networks (BiLSTM) models for relation classification. The results in Table 5 show that our model are still better than the previous compared models [4, 27] even without the classification model. We guess it is because the raw introduced DEDICOM strategy decomposes with diagonal tensor, and it is to interpret connections between relation types. The strategy is originally suitable for capturing asymmetry which is fit to our ordered triplets. Focusing on the other results with RNN and RCNN, performances are better than multi-head model but not better than MultiQA [27]. There are two possible reasons: (1) classification models pay much attention on identifying relations with sacrificing the accuracy of entity recognition; (2) parameters are not shared with name entity recognition in these classification models, which ignores the inherent connections between relation identification and entity recognition. After parameters sharing in BiLSTM networks, the model achieves substantial improvements in entity recognition and triplet extraction module. Better performance in relation identification or in entity recognition can yield better triplet extraction performance.

Table 5: Various classify models on CoNLL04 dataset.

Model	Rel	Entity	Triplet Extraction
Model	F1	F1	Pre	Rec	F1
- CLS	-	88.8	74.5	66.9	70.5
RNN	97.6	85.0	70.5	60.2	64.9
RCNN	99.7	87.7	75.2	66.4	70.5
BiLSTM	94.1	91.6	80.9	76.7	78.7

5 Conclusion

In this paper, we introduce a joint neural model with tensor decomposition based strategy for multi-labeled relation extraction (TDRE) . The introduced tensor is to model the connections of word pairs in each relation component, which could tackle overlapping relations. Consider relation classification as an independent but joint training module to get sentence contained relation types. By implanting predicted relations to the tensor decomposition formula as relational component factors, it can avoid identifying unnecessary entity pairs among no related relation types. Experiments on public datasets demonstrate that the proposed model achieves significant improvements over previous state-of-the-art baselines. In the future, we plan to explore other tensor decomposition approaches and extend our model in other tensor based modeling tasks.

Acknowledgments

This work is supported by the Science and Technology Plan Project of Sichuan Province (Key R & D Project, 2020YFS0465) and Mathematics Teaching Research and Development Center for Colleges (CMC20190501).

References

[1] S. Zheng, F. Wang, H. Bao, Y. Hao, P. Zhou, B. Xu, Joint extraction of entities and relations based on a novel tagging scheme, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 1227–1236. doi:10.18653/v1/P17-1113.
URL https://www.aclweb.org/anthology/P17-1113
[2] C. Sun, Y. Gong, Y. Wu, M. Gong, D. Jiang, M. Lan, S. Sun, N. Duan, Joint type inference on entities and relations via graph convolutional networks, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1361–1370. doi:10.18653/v1/P19-1131.
URL https://www.aclweb.org/anthology/P19-1131
[3] T.-J. Fu, P.-H. Li, W.-Y. Ma, GraphRel: Modeling text as relational graphs for joint entity and relation extraction, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1409–1418. doi:10.18653/v1/P19-1136.
URL https://www.aclweb.org/anthology/P19-1136
[4] G. Bekoulis, J. Deleu, T. Demeester, C. Develder, Joint entity recognition and relation extraction as a multi-head selection problem, Expert Systems with Applications 114 (2018) 34–45. doi:10.1016/j.eswa.2018.07.032.
[5] R. Harshman, P. Green, Y. Wind, M. Lundy, A model for the analysis of asymmetric data in marketing research, Marketing Science (1982) 205–242doi:10.1287/mksc.1.2.205.
[6] K. Fundel-Clemens, R. Küffner, R. Zimmer, Relex-relation extraction using dependency parse trees, Bioinformatics (2007) 365–371doi:10.1093/bioinformatics/btl616.
[7] H. Gurulingappa, A. Mateen‐Rajpu, L. Toldo, Extraction of potential adverse drug events from medical case reports, Journal of Biomedical Semantics 3 (1) (2012) 15. doi:10.1186/2041-1480-3-15.
[8] M. Miwa, M. Bansal, End-to-end relation extraction using LSTMs on sequences and tree structures, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1105–1116. doi:10.18653/v1/P16-1105.
URL https://www.aclweb.org/anthology/P16-1105
[9] X. Zeng, D. Zeng, S. He, K. Liu, J. Zhao, Extracting relational facts by an end-to-end neural model with copy mechanism, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 506–514. doi:10.18653/v1/P18-1047.
URL https://www.aclweb.org/anthology/P18-1047
[10] Z. Guo, Y. Zhang, W. Lu, Attention guided graph convolutional networks for relation extraction, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 241–251. doi:10.18653/v1/P19-1024.
URL https://www.aclweb.org/anthology/P19-1024
[11] R. Takanobu, T. Zhang, J. Liu, M. Huang, A hierarchical framework for relation extraction with reinforcement learning, Proceedings of the AAAI Conference on Artificial Intelligence (2019) 7072–7079.
[12] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, pp. 3111–3119.
[13] A. Graves, A. R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, Acoustics, Speech, and Signal Processing (2013) 6645–6649.
[14] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, Computing Research Repository arXiv:1508.01991 (2015).
[15] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2267–2273.
[16] P. Liu, X. Qiu, X. Huang, Recurrent neural network for text classification with multi-task learning, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016, pp. 2873–2879.
[17] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2016, pp. 1480–1489. doi:10.18653/v1/N16-1174.
URL https://www.aclweb.org/anthology/N16-1174
[18] Y. L. Liang Yao, Chengsheng Mao, Graph convolutional networks for text classification, in: The Thirty-Third AAAI Conference on Artificial Intelligence, 2019, pp. 7370–7377.
[19] S. Riedel, L. Yao, A. McCallum, Modeling relations and their mentions without labeled text, in: Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III, ECML PKDD’10, Springer-Verlag, 2010, pp. 148–163. doi:10.1007/978-3-642-15939-8\_10.
[20] D. Roth, W.-t. Yih, A linear programming formulation for global inference in natural language tasks, in: Proceedings of the Eighth Conference on Computational Natural Language Learning, 2004, pp. 1–8.
URL https://www.aclweb.org/anthology/W04-2401
[21] P. Gupta, H. Schütze, B. Andrassy, Table filling multi-task recurrent neural network for joint entity and relation extraction, in: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 2537–2547.
URL https://www.aclweb.org/anthology/C16-1239
[22] H. Adel, H. Schütze, Global normalization of convolutional neural networks for joint entity and relation classification, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1723–1729. doi:10.18653/v1/D17-1181.
URL https://www.aclweb.org/anthology/D17-1181
[23] H. Gurulingappa, A. M. Rajput, A. Roberts, J. Fluck, M. Hofmann-Apitius, L. Toldo, Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports, Journal of biomedical informatics 45 (5) (2012) 885–892. doi:10.1016/j.jbi.2012.04.008.
[24] J. Chen, C. Yuan, X. Wang, Z. Bai, Mrmep: Joint extraction of multiple relations and multiple entity pairs based on triplet attention, in: Proceedings of the 23rd Conference on Computational Natural Language Learning, 2019, pp. 593–602. doi:10.18653/v1/K19-1055.
URL https://www.aclweb.org/anthology/K19-1055
[25] G. Bekoulis, J. Deleu, T. Demeester, C. Develder, Adversarial training for multi-context joint entity and relation extraction, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2830–2836. doi:10.18653/v1/D18-1307.
URL https://www.aclweb.org/anthology/D18-1307
[26] F. Li, M. Zhang, G. Fu, D.-H. Ji, A neural joint model for entity and relation extraction from biomedical text, BMC Bioinformatics 18 (1) (2017) 198. doi:10.1186/s12859-017-1609-9.
URL https://doi.org/10.1186/s12859-017-1609-9
[27] X. Li, F. Yin, Z. Sun, X. Li, A. Yuan, D. Chai, M. Zhou, J. Li, Entity-relation extraction as multi-turn question answering, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1340–1350. doi:10.18653/v1/P19-1129.
URL https://www.aclweb.org/anthology/P19-1129