Curriculum-Meta Learning for Order-Robust Continual Relation Extraction
Abstract
Continual relation extraction is an important task that focuses on extracting new facts incrementally from unstructured text. Given the sequential arrival order of the relations, this task is prone to two serious challenges, namely catastrophic forgetting and order-sensitivity. We propose a novel curriculum-meta learning method to tackle the above two challenges in continual relation extraction. We combine meta learning and curriculum learning to quickly adapt model parameters to a new task and to reduce interference of previously seen tasks on the current task. We design a novel relation representation learning method through the distribution of domain and range types of relations. Such representations are utilized to quantify the difficulty of tasks for the construction of curricula. Moreover, we also present novel difficulty-based metrics to quantitatively measure the extent of order-sensitivity of a given model, suggesting new ways to evaluate model robustness. Our comprehensive experiments on three benchmark datasets show that our proposed method outperforms the state-of-the-art techniques. The code is available at the anonymous GitHub repository https://github.com/wutong8023/AAAI-CML.
1 Introduction
Relation extraction (Han et al. 2020b) aims at extracting structured facts as triples from unstructured text. As an essential component of information extraction, relation extraction has been widely utilized in downstream applications such as knowledge base construction (Turki et al. 2019) and population (Mahdisoltani, Biega, and Suchanek 2015). However, given the continuous and iterative nature of the update process, continual relation extraction (Wang et al. 2019; Obamuyide and Vlachos 2019) is a more realistic and useful setting. Yet due to the limitations of storage and computational resources, it is impractical to grant the relation extractor access to all the training instances in previously seen tasks. Thus, this continual learning formulation is in contrast to the conventional relation extraction setting where the extractor is generally trained from scratch with the full access to the training corpus.
Catastrophic Forgetting (CF) is a well-known problem in continual learning (Chen and Liu 2016). The problem is that when a neural network is utilized to learn a sequence of tasks, the learning of the later tasks may degrade the performance of the learned model for the previous tasks. Various recent works tackle the CF problem, including consolidation-based methods (Zenke, Poole, and Ganguli 2017; Kirkpatrick et al. 2016), dynamic architecture methods (Chen, Goodfellow, and Shlens 2016; Rusu et al. 2016), or memory-based methods (Rebuffi et al. 2017; Lopez-Paz and Ranzato 2017; Chaudhry et al. 2019). These methods have been demonstrated in simple image classification tasks. Yet the memory-based methods have been proven to be the most promising for NLP applications. EA-EMR (Wang et al. 2019) proposes a sentence embedding alignment mechanism in memory maintenance and adopt it to continual relation extraction learning. Based on EA-EMR, MLLRE (Obamuyide and Vlachos 2019) introduces a meta-learning framework for fast adaptation and EMAR (Han et al. 2020a) introduces a multi-turn co-training procedure for memory-consolidation. Most of these methods explore CF problem in the overall performance of task sequences, but they lack the insight analysis of the characteristics of each subtask and the corresponding model performance.
Order-sensitivity (OS) is another major problem in continual learning, which is relatively under-explored (Chen and Liu 2016; Yoon et al. 2020). It refers to the phenomenon that the performance of tasks varies based on the order of the task arrival sequence. This is due to not only the CF incurred by the different sequences of previous tasks but also the unidirectional knowledge transfer from the previous tasks. Order-sensitivity can be problematic in various aspects: (i) ethical AI considerations in continual learning, e.g. fairness in the medical domain (Yoon et al. 2020); (ii) bench-marking of continual learning algorithms as most of the existing works pick an arbitrary and random sequence of the given tasks for evaluation (Chen and Liu 2016); (iii) uncertainty to the quality of extracted knowledge in the realistic scenario for knowledge base population, where the model is faced with only one sequence.
In this paper, we introduce the curriculum-meta learning (CML) method to tackle both the catastrophic forgetting and order-sensitivity problems. Taking a memory-based approach, CML is based on the following observations about the catastrophic forgetting and order sensitivity issues of the previous works: (i) over-fitting to the experience memory, indicating that the performance on any task will decrease as training progresses, and (ii) the interference between similar tasks, indicating that the model performs better on less intrusive tasks. We therefore design a mechanism which selectively reduces the replay frequency of memory to avoid over-fitting, and steer the model to learn the bias between the current task and previous most similar tasks to reduce the order-sensitivity.
Our CML method contains two steps. In the first step, it samples instances from the memory based on the difficulty of the previous tasks for the current task, resulting in a curriculum for continual learning. Then, it trains the model on both the curriculum and training instances of the current task. We further introduce a knowledge-based method to quantify task difficulty according to the similarity of pairs of relations. Taking a relation as a function mapping to named entities in its domain and range, we define a similarity measure between two relations based on the conceptual distribution of their head and tail entities.
Our contributions are summarized as follows:
-
•
We propose a novel curriculum-meta learning method to tackle the order-sensitivity and catastrophic forgetting problems in continual relation extraction.
-
•
We introduce a new relation representation learning method via the conceptual distribution of head and tail entities of relations, which is utilized to quantify the difficulty of each relation extraction task for constructing the curriculum.
-
•
We conduct comprehensive experiments to analyze the order-sensitivity and catastrophic forgetting problems in state-of-the-art methods, and empirically demonstrate that our proposed method outperforms the state-of-the-art methods on three benchmark datasets.
2 Related Work
The conventional relation extraction methods could be categorized into three domains by the way data is used: supervised methods (Zelenko, Aone, and Richardella 2002; Liu et al. 2013; Zeng et al. 2014; Lin et al. 2016; Miwa and Bansal 2016), semi-supervised methods (Chen et al. 2006; Sun, Grishman, and Sekine 2011; Lin et al. 2019), and distantly supervised methods (Yao et al. 2011; Marcheggiani and Titov 2016). Most of these methods assume a predefined relation schema and thus cannon be easily generalized to new relations. To overcome this problem, several challenging tasks, including open relation learning and continual relation learning, have been proposed to detect and learn relations without a predefined relation schema.
In this paper, we address the continual relation learning problem (Wang et al. 2019), a relatively new and less investigated task. Continual learning in general faces two major challenges: catastrophic forgetting and order-sensitivity.
Catastrophic forgetting (CF) is a prominent line of research in continual learning (Chen and Liu 2016; Thrun 1998). Methods addressing CF can be broadly divided into three categories. (i) Consolidation-based methods (Kirkpatrick et al. 2016; Zenke, Poole, and Ganguli 2017; Liu et al. 2018; Ritter, Botev, and Barber 2018) consolidate model parameters important to previous tasks and reduce their learning weights. These methods employ sophisticated mechanisms to evaluate parameter importance for tasks. (ii) Dynamic architecture methods (Lesort et al. 2019; Mallya, Davis, and Lazebnik 2018) dynamically expand model architectures to learn new tasks and effectively prevent forgetting of old tasks. Sizes of these methods grow dramatically with new tasks, making them unsuitable for NLP applications. (iii) Memory-based methods (Lopez-Paz and Ranzato 2017; Rebuffi et al. 2017; Shin et al. 2017; Aljundi et al. 2018; Chaudhry et al. 2019) remember a few examples in old tasks and continually learn them with emerging new tasks to alleviate catastrophic forgetting. Among these methods, the memory-based methods have been proven to be the most promising for NLP applications (Sun, Ho, and Lee 2020; de Masson d’Autume et al. 2019), including both relation learning (Han et al. 2020a; Wang et al. 2019).
Order-sensitivity (OS) (Chen and Liu 2016; Yoon et al. 2020) is another major problem in continual learning that is relatively under-explored. It is the phenomenon that a model’s performance is sensitive to the order in which tasks arrive. In this paper, we tackle this problem by leveraging a curriculum learning method (Bengio et al. 2009). Briefly, we construct our curriculum by the similarity of tasks, thus minimizing the impact and interference of previous tasks.
3 Curriculum-Meta Learning
Problem Formulation.
In continual relation extraction, given a sequence of tasks , each task is a conventional supervised classification task, containing a series of examples and their corresponding labels , where is the input data, containing the natural-language context and the candidate relations, and is the ground-truth relation label of the context. The model can access the training data of the current task and is trained by optimising a loss function . The goal of continual learning is to train the model such that it continually learns new tasks while avoiding catastrophically forgetting the previously learned tasks. Due to various constraints, the learner is typically allowed to maintain and observe only a subset of the training data of the previous tasks, which is contained in a memory set .
The performance of the model is measured in the conventional way, by whole accuracy , on the entire test set, where . Moreover, model performance at task is evaluated with average accuracy on the test sets of all the tasks up to this task in the sequence . Average accuracy is a better measure of the effect of catastrophic forgetting as it emphasizes on a model’s performance of earlier tasks.
Framework.
Our curriculum-meta learning (CML) framework is described in Algorithm 1. CML maintains initialization parameters and a memory set that stores the prototype instances of previous tasks. It performs the following operations at each time step during the learning phase. (1) The meta-learner fetches the initialization parameters from the memory to initialize the model . (2) replays on the curriculum set which is sampled and sorted by the knowledge-based curriculum module. (3) trains on the support set of the current task . (4) Finally, updates the learned parameters and stores a small number of prototype instances of the current task into the memory. During the evaluation phase, the trained model is given a target set with labeled unseen instances from all observed tasks (See Appendix A for the workflow of CML.)
We will introduce the framework in terms of (1) the utilization of the initialization parameters (i.e. meta training) and (2) the utilization of the memory set (i.e. the curriculum-based memory replay).
Meta Training. Meta learning, or learning to learn, aims at developing algorithms to learn the generic knowledge of how to solve tasks from a given distribution of tasks. With a given basic relation extraction model (Yu et al. 2017) parameterized by , we employ the gradient-based meta-learning method (Nichol, Achiam, and Schulman 2018) to learn a prior initialization at each time step . During adaption to a new task, the model parameters are quickly updated from to the task-specific with a few steps of gradient descent. Formally, the meta learner updates that is optimized for the following objectives:
(1) |
where is the training data, is the loss function for task , and is the optimizer of . Then, when it converges on the current task, the model will generate the initialization parameter for the next time step :
(2) |
where the is the updated parameter for the current task at time step , and is the number of instances which may be processed in parallel at a time step.
Curriculum-based Memory Replay. Meta learner reviews the previous tasks in an orderly way before learning the new task. Here, we denote by a function to represent the teacher which prepares the curriculum for the student network (i.e. the relation extractor ) for replay. Different from conventional experience-replay based models, the teacher function needs to master three skills:
-
1.
Assessing the difficulty of tasks. When a new task arrives, this function calculates which of all observed previous tasks interferes with the current task.
-
2.
Sampling instances from the memory. By sampling, we can reduce the time consumption in the replay stage and alleviate the over-fitting problems caused by the high frequency of updates on the memory.
-
3.
Ranking the sampled instances by a certain strategy. The teacher instructs the student model to learn the bias between the current task and observed similar tasks in the most efficient way.
We sample the memory randomly and sort the sampled instances according to the difficulty of each previous task with respect to the current task. Based on the above requirements, we implement a knowledge-based curriculum module, which is introduced in the next section.
4 Knowledge-based Curriculum
Intrinsically, order-sensitivity is caused by a model’s inability to guarantee optimal performance for all previous tasks. However, from the perspective of experiments, order-sensitivity is closely related to the unbalanced forgetting rate (or the unbalanced difficulty) of different tasks, where we assume that task difficulty is due to the interactions between semantically similar relations in an observed task sequence. Intuitively, if the conceptual distribution of two relations is similar, these two relations tend to be expressed in similar natural language contexts, such as the relations “father” and “mother”.
Semantic Embedding-based Difficulty Function.
To formalize this intuition, we define a difficulty estimation function based on the semantic embeddings of relations in each task. Given a set of tasks , the difficulty of task is defined as:
(3) |
is the similarity score between tasks and , which is defined as the average similarity among relation pairs from the two tasks:
(4) |
where and are the numbers of relations in each task respectively, and calculates the Cosine similarity between the embeddings of the two relations: . Using , we calculate the difficulty of each relation in the memory with respect to the relations in the current task, in order to sort and sample the relations stored in memory into the final curriculum.
Relation Representation Learning.
In order to calculate the semantic embedding of each relation, inspired by (Chen et al. 2019), we introduce a knowledge- and distribution-based representation learning method. Intuitively, the representation of a relation is learned from the types of its head and tail entities. Consider a knowledge graph , where and are the head and tail entities, is the relation between them, and and represent the sets of entities and relations respectively. We reduce the relation representation learning task into the problem of learning the conceptual distribution of each relation which is optimized based on the following objective:
(5) | ||||
where and are the concepts (i.e. the hypernyms obtained from the knowledge graph) of the head and tail entities respectively. , and , where is two-layer neural network parameterized with . Finally, we obtain two representations and for each relation, which indicate the conceptual-distribution of head entities and tail entities respectively. We concatenate these two embeddings to generate the final representation of the relation .
5 Experiments
In this section, we aim to empirically address the following research questions related to our contributions:
RQ1: Why and to what extent do current memory replay-based approaches suffer from catastrophic forgetting and order-sensitivity?
RQ2: How to qualitatively and quantitatively understand task difficulty?
RQ3: Compared with the state-of-the-art methods, can our method (curriculum-meta learning with the knowledge-based curriculum) effectively alleviate catastrophic forgetting and order-sensitivity?
Datasets. We conduct our experiments on three datasets, including Continual-FewRel, Continual-SimpleQuestions, and Continual-TACRED, which were introduced in (Han et al. 2020a). FewRel (Han et al. 2018) is a labelled dataset which contains 80 relations and 700 instances per relation. SimpleQuestions is a knowledge-based question answering dataset containing single-relation questions (Bordes et al. 2015), from which a relation extraction dataset was extracted (Yu et al. 2017). The relation extraction dataset contains 1,785 relations and 72,238 training instances. TACRED (Zhang et al. 2017) is a well-constructed RE dataset that contains 42 relations and 21,784 examples. Considering the special relation “n/a” (i.e, not available) in TACRED, we follow (Han et al. 2020a) and filter out these examples with the relation “n/a” and use the remaining 13,012 examples for Continual-TACRED.
Following Wang et al. (2019); Obamuyide and Vlachos (2019), we partition the relations of each dataset into some groups and then consider each group of relations as a distinct task .
We form training and testing set for each task, based on the instances in the original dataset labeled by the relations in the task.
Following the previous work, we employ two relation partitioning methods. Firstly, the unbalanced division is based on clustering, using the averaged word embeddings (Pennington, Socher, and Manning 2014) of relation names with the K-means clustering algorithm (Wang et al. 2019). Secondly, the random partitioning into groups with a similar number of relations (Obamuyide and Vlachos 2019). For Continual-FewRel, we partition its 80 relations into 10 distinct tasks. Similarly, we partition the 1,785 relations in Continual-SimpleQuestions into 20 disjoint tasks, so as well as partition the 41 relations in Continual-TACRED into 10 tasks.
Evaluation Metrics. We employ the following four metrics to measure model performance. Note that the last two metrics, the average forgetting rate and the error bound, are new metrics we propose in this paper.
- Whole Accuracy
-
of the resulting model at the end of the continual learning process on the full test sets of all tasks,
- Average Accuracy
-
of the resulting model trained on task on all the test sets of all tasks seen up to stage of the continual learning process, Compared to , highlights the catastrophic forgetting problem. However, as we will empirically show, is subject to order-sensitivity of the tasks sequence, and thus does not accurately measure the level of forgetting on a specific task.
- Average Forgetting Rate
-
for task after time steps, is a new metric to evaluate task-specific model performance on order-sensitivity.
(6) where is the model’s average performance on a specific task when it appears in the th position of distinct task permutations:
(7) where is the final accuracy on task of the model trained on the permutation , is the set of all permutations of the tasks , and is the index of Task of the sequence. We note that the number of all permutations in which task is fixed at position is . Of course we may not be able to exactly compute as the size of possible tasks persmutations grows exponentially. Therefore, we estimate this quantity by Monte Carlo sampling of some permutations.
- Error Bound
-
is a new metric to evaluate the overall model performance regarding order-sensitivity.
(8) where is the confidence coefficient of confidence level , and is the standard deviation of accuracy obtained from distinct task permutations. Note that a model with a lower error bound shows better robustness and less order-sensitivity for the input sequences.
Baseline Models. We compare our proposed CML with knowledge-based curriculum-meta learning with the following baseline models, among which Vanilla is employed as the base learner for both CML and the other models (see Appendix B for hyper-parameters.):
-
1.
Vanilla (Yu et al. 2017), which is the basic model for conventional supervised relation extraction not specifically designed for the continual learning setup.
-
2.
EWC (Kirkpatrick et al. 2016), which adopts elastic weight consolidation to add special regularization on parameter changes. Then, EWC uses Fisher information to measure the parameter importance to old tasks, and slow down the update of those parameters important to old tasks.
-
3.
AGEM (Chaudhry et al. 2019), which takes the gradient on sampled memorized examples from memory as the only constraint on the optimization directions of the current task.
-
4.
EA-EMR (Wang et al. 2019), which maintains a memory of previous tasks to alleviate the catastrophic forgetting problem.
-
5.
MLLRE (Obamuyide and Vlachos 2019), which leverages meta-learning to improve the usage efficiency of training instances.
-
6.
EMAR (Han et al. 2020a), which introduces episodic memory activation and reconsolidation to continual relation learning.
Main Results
To evaluate the overall performance of our model CML (RQ3), we conduct experiments on the three datasets under both task division methods: unbalanced cluster-based task division and the uniform random task division.
Continual-FewRel | Continual-SimpQ | Continual-TACRED | |||||||||||
Setting | Model | ||||||||||||
Vanilla | 16.3 | 4.10 | 19.7 | 3.90 | 60.3 | 2.52 | 58.3 | 2.30 | 12.0 | 3.21 | 8.7 | 2.35 | |
EWC | 27.1 | 2.32 | 30.2 | 2.10 | 67.2 | 3.16 | 59.0 | 2.20 | 14.5 | 2.51 | 14.5 | 2.90 | |
AGEM | 36.1 | 2.51 | 42.5 | 2.63 | 77.6 | 2.11 | 72.2 | 2.72 | 12.5 | 2.24 | 16.5 | 2.20 | |
EA-EMR | 59.8 | 1.50 | 74.8 | 1.30 | 82.7 | 0.48 | 86.2 | 0.33 | 17.8 | 1.01 | 25.4 | 1.17 | |
EMAR | 53.8 | 1.30 | 68.6 | 0.71 | 80.0 | 0.83 | 76.9 | 1.39 | 42.7 | 2.92 | 52.5 | 1.74 | |
MLLRE | 56.8 | 1.30 | 70.2 | 0.93 | 84.5 | 0.35 | 86.7 | 0.46 | 34.4 | 0.49 | 41.2 | 1.37 | |
Cluster | CML (ours) | 60.2 | 0.71 | 76.0 | 0.24 | 85.6 | 0.34 | 87.5 | 0.32 | 44.4 | 1.16 | 49.3 | 1.01 |
Vanilla | 19.1 | 1.20 | 19.3 | 1.30 | 55.0 | 1.30 | 55.2 | 1.30 | 10.2 | 2.02 | 10.4 | 2.31 | |
EWC | 30.1 | 1.07 | 30.2 | 1.05 | 66.4 | 0.81 | 66.7 | 0.83 | 15.3 | 1.70 | 15.4 | 1.79 | |
AGEM | 36.9 | 0.80 | 37.0 | 0.83 | 76.4 | 1.02 | 76.7 | 1.01 | 13.4 | 1.47 | 14.3 | 1.62 | |
EA-EMR | 61.4 | 0.81 | 61.6 | 0.76 | 83.1 | 0.41 | 83.2 | 0.47 | 27.3 | 1.01 | 30.3 | 0.70 | |
EMAR | 62.7 | 0.63 | 62.8 | 0.62 | 82.4 | 0.86 | 84.0 | 0.78 | 45.1 | 1.48 | 46.4 | 2.00 | |
MLLRE | 59.8 | 0.91 | 59.8 | 0.94 | 85.2 | 0.25 | 85.5 | 0.31 | 36.4 | 0.66 | 38.0 | 0.58 | |
Random | CML (ours) | 62.9 | 0.62 | 63.0 | 0.59 | 86.5 | 0.22 | 86.9 | 0.28 | 43.7 | 0.83 | 45.3 | 0.72 |
Train | 100 | 200 | all | ||
---|---|---|---|---|---|
Memory | 25 | 50 | 50 | 50 | |
EA-EMR | 70.7 | 75.5 | 74.8 | 73.9 | |
53.2 | 57.4 | 59.8 | 59.6 | ||
MLLRE | 68.4 | 72.1 | 70.2 | 51.0 | |
51.9 | 57.8 | 56.8 | 47.3 | ||
EMAR | 60.1 | 66.7 | 68.6 | 74.1 | |
43.7 | 51.2 | 53.8 | 57.7 | ||
CML | 73.6 | 76.4 | 76.0 | 58.0 | |
54.7 | 60.3 | 60.2 | 49.1 |
The following observations can be made from Table 2. (i) Our model CML achieves the best and in both settings and on the three datasets in the majority of cases. (ii) Specifically, CML achieves the best and in the two larger datasets Continual-FewRel and Continual-SimpQ. (iii) CML obtains the lowest error bounds in the majority of cases, demonstrating better stability and lower order-sensitivity. (iv) The two task division methods produce the most prominent differences on Continual-Fewrel for CML. when the data is evenly distributed (i.e. Random), CML’s is significantly reduced to be almost equal to (from 76.0 to 63.0). On the other two datasets, the performance difference is much less noticeable.
Although the three metrics , , and are good measures of the overall model performance, they do not provide task-specific insights, which we will further discuss in the following subsection.
Analysis of Unbalanced Forgetting
We designed another experiment to better understand the reason for catastrophic forgetting and order-sensitivity (RQ1). In this experiment, each task is assigned a fixed ID. Starting with an initial “run” of tasks , we test model performance on ten different runs generated by the cyclic shift of the initial run. The results of EA-EMR on Continual-Fewrel are summarised in Table 3 (See Appendix C for the result for EMAR, MLLRE, and CML.)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||
88.1 | 39.0 | 49.9 | 59.9 | 100.0 | 23.8 | 28.2 | 48.5 | 52.6 | 53.9 | 76.8 | 62.7 | |
82.2 | 43.5 | 68.8 | 33.9 | 100.0 | 30.7 | 56.7 | 39.6 | 71.1 | 59.0 | 73.4 | 63.9 | |
83.0 | 49.9 | 66.7 | 75.9 | 100.0 | 26.7 | 57.4 | 74.7 | 58.5 | 64.1 | 72.6 | 61.2 | |
91.9 | 48.3 | 76.3 | 77.0 | 100.0 | 23.2 | 54.3 | 77.0 | 83.7 | 50.6 | 62.8 | 59.0 | |
90.4 | 44.1 | 73.6 | 79.6 | 100.0 | 43.5 | 47.5 | 69.6 | 86.7 | 77.6 | 77.7 | 58.4 | |
97.0 | 42.8 | 71.2 | 79.9 | 100.0 | 46.0 | 74.9 | 68.5 | 80.0 | 76.8 | 79.5 | 61.6 | |
98.5 | 79.2 | 69.3 | 75.5 | 100.0 | 52.4 | 82.6 | 88.1 | 88.1 | 80.4 | 74.6 | 58.7 | |
97.0 | 80.2 | 92.1 | 67.9 | 100.0 | 57.1 | 81.9 | 87.6 | 93.3 | 81.4 | 67.4 | 56.3 | |
91.9 | 81.5 | 91.9 | 92.7 | 99.3 | 68.0 | 86.1 | 93.4 | 92.6 | 90.8 | 77.4 | 56.1 | |
100.0 | 89.7 | 97.4 | 96.0 | 100.0 | 82.1 | 91.9 | 94.7 | 99.3 | 93.1 | 77.7 | 58.7 | |
92.0 | 59.8 | 75.7 | 73.8 | 99.9 | 45.4 | 66.2 | 74.2 | 80.6 | 72.8 | 74.0 | 59.7 | |
6.25 | 20.06 | 14.39 | 17.51 | 0.22 | 19.92 | 20.41 | 18.50 | 15.35 | 14.99 | 5.26 | 2.61 |
As shown in Table 3, comparing the results in the rows of and of each task, we can observe that most tasks see a significant drop in accuracy from (when the task is the last one seen by the model) to (when the task is the first one seen by the model), indicating that they suffer from catastrophic forgetting.
Intuitively, forgetting on a task reflects an increase in empirical error. We hypothesized that this is most likely due to frequent replay of limited task-related memory, in other words, over-fitting on memory. We tested this hypothesis by adjusting the ratio of training data to memory. Table 2 shows that when memory size is fixed (e.g. 50), more training data (i.e. all vs 200 vs 100) results in poorer performance for all models except EMAR, indicating the model is over-fitting on memory. This is due to a higher replay frequency of the memory. When we fix the ratio of the training data to memory (e.g. 100:25 and 200:50), model performance conforms to the general rule of better performance with more data. Thus, the results shown in Table 2 support our hypothesis.
When reading each column in Table 3 separately, we find that the forgetting rate of each task is different, which cannot be solely explained by the issue of over-fitting on memory. For example, the model performance on and is exactly the same (100%) when they appear at the last position . However, performance on does not degrade with the advance of its position. On the other hand, we can observe a decreasing trend of performance on as its position moves back.
Moreover, we observe that order-sensitivity may be related to the difficulty of the earlier task in a run, where difficulty refers to a task’s tendency of being more easily forgotten by the model. For instance, we can observe that the final of the experiment (in dark-orange) is the highest, whereas the final of the experiment (in light orange) is the lowest. Also, among all tasks at position , task has the highest accuracy while task has the lowest.
Both of the above observations may be explained by task difficulty, which we further study in the next subsection.
0.121 | 0.141 | 0.168 | 0.054 | 0.035 | 0.186 | 0.112 | 0.152 | 0.146 | 0.137 | - | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
EA-EMR | 0.060 | 0.098 | 0.061 | 0.028 | 0.001 | 0.137 | 0.139 | 0.060 | 0.044 | 0.051 | 0.559 | |
MLLRE | 0.022 | 0.078 | 0.091 | 0.085 | -0.004 | 0.147 | 0.060 | 0.064 | 0.069 | 0.075 | 0.667 | |
EMAR | 0.036 | 0.016 | 0.027 | 0.007 | 0.006 | 0.016 | 0.008 | 0.026 | 0.020 | 0.005 | 0.499 | |
CML | -0.002 | 0.108 | 0.051 | 0.065 | -0.002 | 0.113 | 0.108 | 0.046 | 0.070 | 0.068 | 0.470 |
Analysis of Task Difficulty
In this section, we present the qualitative and quantitative analyses in order to better understand the difficulty of tasks (RQ2). We choose EA-EMR as the case study on the Continual-FewRel dataset with the tasks constructed through clustering.
Qualitative Analysis.

Figure 1 shows the t-SNE visualization of the relations, where nodes represent relations, colors represent the tasks, and the distance is calculated from the hidden layer in EA-EMR.
As can be seen from the figure, tasks and have only one relation each, and that the relation in is far away from the others but the relation is much closer. The difference in their distances from the other relations may explain the difference in their task-specific performance in Table 3, where does not suffer from catastrophic forgetting but does. In other words, is easy whereas is more difficult.
Similarly, we can observe that (colored in black) contains only two relations, and it overlaps significantly with the other tasks. Therefore, catastrophic forgetting is more serious on , i.e., is difficult. Finally, we can observe that tasks and are difficult tasks, as their centroids are very close. They both suffer from serious catastrophic forgetting as can be seen from Table 3.
Therefore, in continual learning, the difficulty of a task may be characterized by the correlation between and the other observed tasks. In the relation extraction scenario, we define this correlation as the semantic similarity between the relations.
Quantitative Analysis.
Based on the above analysis, we hypothesize that a model could achieve better performance if we can measure the similarity of the current task and previous tasks and guide the model to distinguish similar tasks.
For ease of expression, we denote the measure difficulty of each task as the prior difficulty , and denote the average forgetting rate as the posterior difficulty . The prior difficulty is the estimated value of knowledge graph embedding, and the posterior difficulty is related to the performance of each model.
Specifically, we use our embedding-based task difficulty function defined in Eq. (3) as the prior difficulty , and average forgetting rate defined in Eq. (6) as the posterior difficulty. Table 4 shows the correlation between the prior difficulty and the posterior difficulty, and thus offers evidence of the effectiveness of our representation learning method.
We can learn from Table 4 four main conclusions: (i) The of the three models demonstrate that the semantic embedding-based prior difficulty does positive to the forgetting rate indicating which tasks are difficult for a model. (ii) Comparing the values of CML with the other three methods, our proposed CML with the knowledge-based curriculum does alleviate the interference between similar tasks as it achieves the lowest forgetting rate. (iii) For tasks and in CML, the average forgetting rate is negative, which means that the accuracy of these tasks decreases as they move towards the end. (iv) Based on the analysis of Table 2, Table 3 (See Appendix C for the other three relevant tables), it is evident that CML improves the overall accuracy and effectively alleviates the order-sensitivity problem. We note that the improvements do come at a cost of moderately decreased accuracy on simple tasks such as and .
6 Conclusion
In this paper, we proposed a novel curriculum-meta learning method to tackle the catastrophic forgetting and order-sensitivity issues in continual relation learning. The construction of the curriculum is based on the notion of task difficulty, which is defined through a novel relation representation learning method that learns from the distribution of domain and range types of relations. Our comprehensive experiments on the three benchmark datasets Continual-FewRel, Continual-SimpleQuestions and Continual-Tacred show that our proposed method outperforms the state-of-the-art models, and is less prone to catastrophic forgetting and less order-sensitive. In future, we will investigate an end-to-end curriculum model and a new dynamic difficulty measurement based on the framework presented in this paper.
Acknowledgments
Research in this paper was partially supported by the National Key Research and Development Program of China under grants (2018YFC0830200, 2017YFB1002801), the Natural Science Foundation of China grants (U1736204), the Judicial Big Data Research Centre, School of Law at Southeast University.
References
- Aljundi et al. (2018) Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; and Tuytelaars, T. 2018. Memory Aware Synapses: Learning What (not) to Forget. In Proceedings of ECCV, 144–161.
- Bengio et al. (2009) Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of ICML, 41–48.
- Bordes et al. (2015) Bordes, A.; Usunier, N.; Chopra, S.; and Weston, J. 2015. Large-scale Simple Question Answering with Memory Networks. CoRR abs/1506.02075.
- Chaudhry et al. (2019) Chaudhry, A.; Ranzato, M.; Rohrbach, M.; and Elhoseiny, M. 2019. Efficient Lifelong Learning with A-GEM. In Proceedings of ICLR.
- Chen et al. (2006) Chen, J.; Ji, D.; Tan, C. L.; and Niu, Z.-Y. 2006. Relation extraction using label propagation based semi-supervised learning. In Proceedings of ACL, 129–136.
- Chen, Goodfellow, and Shlens (2016) Chen, T.; Goodfellow, I. J.; and Shlens, J. 2016. Net2Net: Accelerating Learning via Knowledge Transfer. In Proceedings of ICLR.
- Chen et al. (2019) Chen, W.; Zhu, H.; Han, X.; Liu, Z.; and Sun, M. 2019. Quantifying Similarity between Relations with Fact Distribution. In Proceedings of ACL, 2882–2894.
- Chen and Liu (2016) Chen, Z.; and Liu, B. 2016. Lifelong Machine Learning.
- de Masson d’Autume et al. (2019) de Masson d’Autume, C.; Ruder, S.; Kong, L.; and Yogatama, D. 2019. Episodic Memory in Lifelong Language Learning. In Proceedings of NeurIPS, 13122–13131.
- Donahue et al. (2014) Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proceedings of ICML, 647–655.
- Han et al. (2020a) Han, X.; Dai, Y.; Gao, T.; Lin, Y.; Liu, Z.; Li, P.; Sun, M.; and Zhou, J. 2020a. Continual Relation Learning via Episodic Memory Activation and Reconsolidation. In Proceedings of ACL, 6429–6440.
- Han et al. (2020b) Han, X.; Gao, T.; Lin, Y.; Peng, H.; Yang, Y.; Xiao, C.; Liu, Z.; Li, P.; Sun, M.; and Zhou, J. 2020b. More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction. CoRR abs/2004.03186.
- Han et al. (2018) Han, X.; Zhu, H.; Yu, P.; Wang, Z.; Yao, Y.; Liu, Z.; and Sun, M. 2018. FewRel: A Large-Scale Supervised Few-shot Relation Classification Dataset with State-of-the-Art Evaluation. In Proceedings of EMNLP, 4803–4809.
- Kirkpatrick et al. (2016) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N. C.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Hadsell, R. 2016. Overcoming catastrophic forgetting in neural networks. CoRR abs/1612.00796.
- Lesort et al. (2019) Lesort, T.; Caselles-Dupré, H.; Ortiz, M. G.; Stoian, A.; and Filliat, D. 2019. Generative Models from the perspective of Continual Learning. In Proceedings of IJCNN, 1–8.
- Lin et al. (2019) Lin, H.; Yan, J.; Qu, M.; and Ren, X. 2019. Learning Dual Retrieval Module for Semi-supervised Relation Extraction. In Proceedings of WWW, 1073–1083.
- Lin et al. (2016) Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; and Sun, M. 2016. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of ACL, 2124–2133.
- Liu et al. (2013) Liu, C.; Sun, W.; Chao, W.; and Che, W. 2013. Convolution Neural Network for Relation Extraction. In Motoda, H.; Wu, Z.; Cao, L.; Zaïane, O. R.; Yao, M.; and Wang, W., eds., Proceedings of ADMA, 231–242.
- Liu et al. (2018) Liu, X.; Masana, M.; Herranz, L.; van de Weijer, J.; López, A. M.; and Bagdanov, A. D. 2018. Rotate your Networks: Better Weight Consolidation and Less Catastrophic Forgetting. In Proceedings of ICPR, 2262–2268.
- Lopez-Paz and Ranzato (2017) Lopez-Paz, D.; and Ranzato, M. 2017. Gradient Episodic Memory for Continual Learning. In Proceedings of NeurIPS, 6467–6476.
- Mahdisoltani, Biega, and Suchanek (2015) Mahdisoltani, F.; Biega, J.; and Suchanek, F. M. 2015. YAGO3: A Knowledge Base from Multilingual Wikipedias. In Proceedings of CIDR, 1–11.
- Mallya, Davis, and Lazebnik (2018) Mallya, A.; Davis, D.; and Lazebnik, S. 2018. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. In Proceedings of ECCV, 72–88.
- Marcheggiani and Titov (2016) Marcheggiani, D.; and Titov, I. 2016. Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations. Transactions of the Association for Computational Linguistics 231–244.
- Miwa and Bansal (2016) Miwa, M.; and Bansal, M. 2016. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of ACL, 1105–1116.
- Nichol, Achiam, and Schulman (2018) Nichol, A.; Achiam, J.; and Schulman, J. 2018. On First-Order Meta-Learning Algorithms. CoRR abs/1803.02999.
- Obamuyide and Vlachos (2019) Obamuyide, A.; and Vlachos, A. 2019. Meta-Learning Improves Lifelong Relation Extraction. In Proceedings of RepL4NLP@ACL, 224–229.
- Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global Vectors for Word Representation. In Proceedings of EMNLP, 1532–1543.
- Rebuffi et al. (2017) Rebuffi, S.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of CVPR, 5533–5542.
- Ritter, Botev, and Barber (2018) Ritter, H.; Botev, A.; and Barber, D. 2018. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting. In Proceedings of NeurIPS, 3742–3752.
- Rusu et al. (2016) Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive Neural Networks. CoRR abs/1606.04671.
- Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual Learning with Deep Generative Replay. In Proceedings of NeurIPS, 2990–2999.
- Sun, Grishman, and Sekine (2011) Sun, A.; Grishman, R.; and Sekine, S. 2011. Semi-supervised relation extraction with large-scale word clustering. In Proceedings of NAACL-HLT, 521–529.
- Sun, Ho, and Lee (2020) Sun, F.; Ho, C.; and Lee, H. 2020. LAMOL: LAnguage MOdeling for Lifelong Language Learning. In Proceedings of ICLR.
- Thrun (1998) Thrun, S. 1998. Lifelong Learning Algorithms. In Learning to Learn, 181–209.
- Turki et al. (2019) Turki, H.; Shafee, T.; Taieb, M. A. H.; Aouicha, M. B.; Vrandecic, D.; Das, D.; and Hamdi, H. 2019. Wikidata: A large-scale collaborative ontological medical database. J. Biomed. Informatics 99.
- Wang et al. (2019) Wang, H.; Xiong, W.; Yu, M.; Guo, X.; Chang, S.; and Wang, W. Y. 2019. Sentence Embedding Alignment for Lifelong Relation Extraction. In Proceedings of NAACL-HLT, 796–806.
- Yao et al. (2011) Yao, L.; Haghighi, A.; Riedel, S.; and McCallum, A. 2011. Structured Relation Discovery using Generative Models. In Proceedings of EMNLP, 1456–1466.
- Yoon et al. (2020) Yoon, J.; Kim, S.; Yang, E.; and Hwang, S. J. 2020. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. arXiv: Learning .
- Yu et al. (2017) Yu, M.; Yin, W.; Hasan, K. S.; dos Santos, C. N.; Xiang, B.; and Zhou, B. 2017. Improved Neural Relation Detection for Knowledge Base Question Answering. In Proceedings of ACL, 571–581.
- Zelenko, Aone, and Richardella (2002) Zelenko, D.; Aone, C.; and Richardella, A. 2002. Kernel Methods for Relation Extraction. In Proceedings of EMNLP, 71–78.
- Zeng et al. (2014) Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014. Relation Classification via Convolutional Deep Neural Network. In Proceeding of COLING, 2335–2344.
- Zenke, Poole, and Ganguli (2017) Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning through Synaptic Intelligence. In Proceedings of ICMR, 3987–3995.
- Zhang et al. (2017) Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning, C. D. 2017. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of EMNLP, 35–45.