Curriculum-Meta Learning for Order-Robust Continual Relation Extraction

Tongtong Wu¹, Xuekai Li¹, Yuan-Fang Li², Gholamreza Haffari², Guilin Qi¹,
Yujin Zhu³, Guoqiang Xu³
This work has been done by Tongtong Wu during the visiting period at Monash University, and the original idea was generated in the internship at Gamma Lab, Ping An OneConnect.Contact author.

Abstract

Continual relation extraction is an important task that focuses on extracting new facts incrementally from unstructured text. Given the sequential arrival order of the relations, this task is prone to two serious challenges, namely catastrophic forgetting and order-sensitivity. We propose a novel curriculum-meta learning method to tackle the above two challenges in continual relation extraction. We combine meta learning and curriculum learning to quickly adapt model parameters to a new task and to reduce interference of previously seen tasks on the current task. We design a novel relation representation learning method through the distribution of domain and range types of relations. Such representations are utilized to quantify the difficulty of tasks for the construction of curricula. Moreover, we also present novel difficulty-based metrics to quantitatively measure the extent of order-sensitivity of a given model, suggesting new ways to evaluate model robustness. Our comprehensive experiments on three benchmark datasets show that our proposed method outperforms the state-of-the-art techniques. The code is available at the anonymous GitHub repository https://github.com/wutong8023/AAAI-CML.

1 Introduction

Relation extraction (Han et al. 2020b) aims at extracting structured facts as triples from unstructured text. As an essential component of information extraction, relation extraction has been widely utilized in downstream applications such as knowledge base construction (Turki et al. 2019) and population (Mahdisoltani, Biega, and Suchanek 2015). However, given the continuous and iterative nature of the update process, continual relation extraction (Wang et al. 2019; Obamuyide and Vlachos 2019) is a more realistic and useful setting. Yet due to the limitations of storage and computational resources, it is impractical to grant the relation extractor access to all the training instances in previously seen tasks. Thus, this continual learning formulation is in contrast to the conventional relation extraction setting where the extractor is generally trained from scratch with the full access to the training corpus.

Catastrophic Forgetting (CF) is a well-known problem in continual learning (Chen and Liu 2016). The problem is that when a neural network is utilized to learn a sequence of tasks, the learning of the later tasks may degrade the performance of the learned model for the previous tasks. Various recent works tackle the CF problem, including consolidation-based methods (Zenke, Poole, and Ganguli 2017; Kirkpatrick et al. 2016), dynamic architecture methods (Chen, Goodfellow, and Shlens 2016; Rusu et al. 2016), or memory-based methods (Rebuffi et al. 2017; Lopez-Paz and Ranzato 2017; Chaudhry et al. 2019). These methods have been demonstrated in simple image classiﬁcation tasks. Yet the memory-based methods have been proven to be the most promising for NLP applications. EA-EMR (Wang et al. 2019) proposes a sentence embedding alignment mechanism in memory maintenance and adopt it to continual relation extraction learning. Based on EA-EMR, MLLRE (Obamuyide and Vlachos 2019) introduces a meta-learning framework for fast adaptation and EMAR (Han et al. 2020a) introduces a multi-turn co-training procedure for memory-consolidation. Most of these methods explore CF problem in the overall performance of task sequences, but they lack the insight analysis of the characteristics of each subtask and the corresponding model performance.

Order-sensitivity (OS) is another major problem in continual learning, which is relatively under-explored (Chen and Liu 2016; Yoon et al. 2020). It refers to the phenomenon that the performance of tasks varies based on the order of the task arrival sequence. This is due to not only the CF incurred by the different sequences of previous tasks but also the unidirectional knowledge transfer from the previous tasks. Order-sensitivity can be problematic in various aspects: (i) ethical AI considerations in continual learning, e.g. fairness in the medical domain (Yoon et al. 2020); (ii) bench-marking of continual learning algorithms as most of the existing works pick an arbitrary and random sequence of the given tasks for evaluation (Chen and Liu 2016); (iii) uncertainty to the quality of extracted knowledge in the realistic scenario for knowledge base population, where the model is faced with only one sequence.

In this paper, we introduce the curriculum-meta learning (CML) method to tackle both the catastrophic forgetting and order-sensitivity problems. Taking a memory-based approach, CML is based on the following observations about the catastrophic forgetting and order sensitivity issues of the previous works: (i) over-fitting to the experience memory, indicating that the performance on any task will decrease as training progresses, and (ii) the interference between similar tasks, indicating that the model performs better on less intrusive tasks. We therefore design a mechanism which selectively reduces the replay frequency of memory to avoid over-fitting, and steer the model to learn the bias between the current task and previous most similar tasks to reduce the order-sensitivity.

Our CML method contains two steps. In the first step, it samples instances from the memory based on the difficulty of the previous tasks for the current task, resulting in a curriculum for continual learning. Then, it trains the model on both the curriculum and training instances of the current task. We further introduce a knowledge-based method to quantify task difficulty according to the similarity of pairs of relations. Taking a relation as a function mapping to named entities in its domain and range, we define a similarity measure between two relations based on the conceptual distribution of their head and tail entities.

Our contributions are summarized as follows:

•

We propose a novel curriculum-meta learning method to tackle the order-sensitivity and catastrophic forgetting problems in continual relation extraction.
•

We introduce a new relation representation learning method via the conceptual distribution of head and tail entities of relations, which is utilized to quantify the difficulty of each relation extraction task for constructing the curriculum.
•

We conduct comprehensive experiments to analyze the order-sensitivity and catastrophic forgetting problems in state-of-the-art methods, and empirically demonstrate that our proposed method outperforms the state-of-the-art methods on three benchmark datasets.

2 Related Work

The conventional relation extraction methods could be categorized into three domains by the way data is used: supervised methods (Zelenko, Aone, and Richardella 2002; Liu et al. 2013; Zeng et al. 2014; Lin et al. 2016; Miwa and Bansal 2016), semi-supervised methods (Chen et al. 2006; Sun, Grishman, and Sekine 2011; Lin et al. 2019), and distantly supervised methods (Yao et al. 2011; Marcheggiani and Titov 2016). Most of these methods assume a predefined relation schema and thus cannon be easily generalized to new relations. To overcome this problem, several challenging tasks, including open relation learning and continual relation learning, have been proposed to detect and learn relations without a predefined relation schema.

In this paper, we address the continual relation learning problem (Wang et al. 2019), a relatively new and less investigated task. Continual learning in general faces two major challenges: catastrophic forgetting and order-sensitivity.

Catastrophic forgetting (CF) is a prominent line of research in continual learning (Chen and Liu 2016; Thrun 1998). Methods addressing CF can be broadly divided into three categories. (i) Consolidation-based methods (Kirkpatrick et al. 2016; Zenke, Poole, and Ganguli 2017; Liu et al. 2018; Ritter, Botev, and Barber 2018) consolidate model parameters important to previous tasks and reduce their learning weights. These methods employ sophisticated mechanisms to evaluate parameter importance for tasks. (ii) Dynamic architecture methods (Lesort et al. 2019; Mallya, Davis, and Lazebnik 2018) dynamically expand model architectures to learn new tasks and effectively prevent forgetting of old tasks. Sizes of these methods grow dramatically with new tasks, making them unsuitable for NLP applications. (iii) Memory-based methods (Lopez-Paz and Ranzato 2017; Rebuffi et al. 2017; Shin et al. 2017; Aljundi et al. 2018; Chaudhry et al. 2019) remember a few examples in old tasks and continually learn them with emerging new tasks to alleviate catastrophic forgetting. Among these methods, the memory-based methods have been proven to be the most promising for NLP applications (Sun, Ho, and Lee 2020; de Masson d’Autume et al. 2019), including both relation learning (Han et al. 2020a; Wang et al. 2019).

Order-sensitivity (OS) (Chen and Liu 2016; Yoon et al. 2020) is another major problem in continual learning that is relatively under-explored. It is the phenomenon that a model’s performance is sensitive to the order in which tasks arrive. In this paper, we tackle this problem by leveraging a curriculum learning method (Bengio et al. 2009). Briefly, we construct our curriculum by the similarity of tasks, thus minimizing the impact and interference of previous tasks.

3 Curriculum-Meta Learning

Problem Formulation.

In continual relation extraction, given a sequence of $K$ tasks $\{\mathcal{T}_{1},\mathcal{T}_{2},\ldots,\mathcal{T}_{K}\}$ , each task $\mathcal{T}_{k}$ is a conventional supervised classification task, containing a series of examples and their corresponding labels $\{(x^{(i)},y^{(i)})\}$ , where $x^{(i)}$ is the input data, containing the natural-language context and the candidate relations, and $y^{(i)}$ is the ground-truth relation label of the context. The model $f_{\theta}(.)$ can access the training data of the current task $\mathcal{D}^{train}_{k}$ and is trained by optimising a loss function $l(f_{\theta}(x),y)$ . The goal of continual learning is to train the model $f_{\theta}(.)$ such that it continually learns new tasks while avoiding catastrophically forgetting the previously learned tasks. Due to various constraints, the learner is typically allowed to maintain and observe only a subset of the training data of the previous tasks, which is contained in a memory set $\mathcal{M}$ .

The performance of the model $f_{\theta}(.)$ is measured in the conventional way, by whole accuracy $Acc_{w}=acc_{f,\mathcal{D}^{test}}$ , on the entire test set, where $\mathcal{D}^{test}=\bigcup_{i=1}^{K}\mathcal{D}_{i}^{test}$ . Moreover, model performance at task $k$ is evaluated with average accuracy on the test sets of all the tasks up to this task in the sequence $Acc_{a}=\frac{1}{k}\sum^{k}_{i=1}acc_{f,i}$ . Average accuracy is a better measure of the effect of catastrophic forgetting as it emphasizes on a model’s performance of earlier tasks.

Framework.

Input: Stream of incoming tasks

\mathcal{T}_{1},\mathcal{T}_{2},..

; Classification model

f_{\theta}

, Step size

\epsilon

, Learning rate

\alpha

, Relations embedding

r

, Memory buffer

\mathcal{M}

, Curriculum size

k

, Curriculum instance size

n

, Knowledge-based curriculum module

g_{\phi}

2Initialize

\theta

3while there are still tasking do

5 Retrieve current task

\mathcal{T}_{t}

6 Initialize

\theta_{t}\leftarrow\theta

7 while not convergence do

9 if $\mathcal{M}$ is not empty then

11 Set

\mathcal{D}_{t}^{train}\leftarrow\mathcal{D}_{t}^{train}\cup\mathcal{M}

13 for each relation $R_{i}$ in $\mathcal{T}_{t}$ do

15 Initialize

\theta_{t_{i}}\leftarrow\theta_{t}

16 Sample

D_{t,i}^{train}

from

D_{t}^{train}

for

R_{i}

17 if $\mathcal{T}_{t}$ is not the first task then

19 Sample

k

curriculum relations from

\mathcal{M}

with

g_{\phi}

20 Sample

n

instances for each curriculum relation

21 Construct sorted mini-batches, each contains

D_{t,i}^{curri}

k\times n

instances.

22 Evaluate

\nabla_{\theta_{t_{i}}}L_{R_{i}}(f_{\theta_{t_{i}}})

using

D_{t,i}^{curri}

23 Update

\theta_{t_{i}}^{\prime}=Adam(\theta_{t_{i}},L_{R_{i}}(f_{\theta_{t_{i}}}),\alpha)

25 Evaluate

\nabla_{\theta_{t_{i}}^{\prime}}L_{R_{i}}(f_{\theta_{t_{i}}^{\prime}})

using

D_{t,i}^{train}

26 Update

\theta_{t_{i}}^{*}=Adam(\theta_{t_{i}}^{\prime},L_{R_{i}}(f_{\theta_{t_{i}}^{\prime}}),\alpha)

29 Update

\theta_{t+1}=\theta_{t}+\frac{\epsilon}{N}\sum^{N}_{i=1}(\theta_{t_{i}}^{*}-\theta_{t})

30 Sample

sub\mathcal{T}_{t}

from

\mathcal{T}_{t}

31 Update

\mathcal{M}\leftarrow\mathcal{M}\cup sub\mathcal{T}_{t}

32 Update

\theta\leftarrow\theta_{t}

33 Fine-tune

\theta

\mathcal{M}

Algorithm 1 Curriculum-Meta Learning

Our curriculum-meta learning (CML) framework is described in Algorithm 1. CML maintains initialization parameters $\theta_{t}$ and a memory set $\mathcal{M}$ that stores the prototype instances of previous tasks. It performs the following operations at each time step $t$ during the learning phase. (1) The meta-learner $L$ fetches the initialization parameters $\theta_{t}$ from the memory to initialize the model $f_{\theta_{t}}(.)$ . (2) $L$ replays on the curriculum set $D_{t,i}^{curri}$ which is sampled and sorted by the knowledge-based curriculum module. (3) $L$ trains on the support set $D_{t,i}^{train}$ of the current task $\mathcal{T}_{t}$ . (4) Finally, $L$ updates the learned parameters $\theta_{t+1}$ and stores a small number of prototype instances of the current task into the memory. During the evaluation phase, the trained model is given a target set with labeled unseen instances from all observed tasks (See Appendix A for the workflow of CML.)

We will introduce the framework in terms of (1) the utilization of the initialization parameters (i.e. meta training) and (2) the utilization of the memory set (i.e. the curriculum-based memory replay).

Meta Training. Meta learning, or learning to learn, aims at developing algorithms to learn the generic knowledge of how to solve tasks from a given distribution of tasks. With a given basic relation extraction model (Yu et al. 2017) $f_{\theta}(.)$ parameterized by $\theta$ , we employ the gradient-based meta-learning method (Nichol, Achiam, and Schulman 2018) to learn a prior initialization $\theta_{t}$ at each time step $t$ . During adaption to a new task, the model parameters $\theta$ are quickly updated from $\theta_{t}$ to the task-specific $\theta_{t}^{*}$ with a few steps of gradient descent. Formally, the meta learner $L$ updates $f_{\theta}(.)$ that is optimized for the following objectives:

\displaystyle\min_{\theta}\mathbb{E}_{\mathcal{T}\thicksim p(\mathcal{(}T)}[\mathcal{L}(\theta^{*})]=\min_{\theta}\mathbb{E}_{\mathcal{T}\thicksim p(\mathcal{(}T)}[\mathcal{L}(\mathcal{U}(\mathcal{D_{T}}\theta))]

(1)

where $\mathcal{D_{T}}$ is the training data, $\mathcal{L}$ is the loss function for task $\mathcal{T}$ , and $\mathcal{U}$ is the optimizer of $f_{\theta}(.)$ . Then, when it converges on the current task, the model will generate the initialization parameter $\theta_{t+1}$ for the next time step $t+1$ :

\theta_{t+1}=\theta+\frac{\epsilon}{n}\sum_{t=1}^{n}(\theta_{t}^{*}-\theta_{t}),

(2)

where the $\theta_{t}^{*}$ is the updated parameter for the current task $\mathcal{T}_{k}$ at time step $t$ , and $n$ is the number of instances which may be processed in parallel at a time step.

Curriculum-based Memory Replay. Meta learner $L$ reviews the previous tasks in an orderly way before learning the new task. Here, we denote by $g_{\phi}(.)$ a function to represent the teacher which prepares the curriculum for the student network (i.e. the relation extractor $f_{\theta}(.)$ ) for replay. Different from conventional experience-replay based models, the teacher function needs to master three skills:

1.

Assessing the difficulty of tasks. When a new task arrives, this function calculates which of all observed previous tasks interferes with the current task.
2.

Sampling instances from the memory. By sampling, we can reduce the time consumption in the replay stage and alleviate the over-fitting problems caused by the high frequency of updates on the memory.
3.

Ranking the sampled instances by a certain strategy. The teacher instructs the student model to learn the bias between the current task and observed similar tasks in the most efficient way.

We sample the memory randomly and sort the sampled instances according to the difficulty of each previous task with respect to the current task. Based on the above requirements, we implement a knowledge-based curriculum module, which is introduced in the next section.

4 Knowledge-based Curriculum

Intrinsically, order-sensitivity is caused by a model’s inability to guarantee optimal performance for all previous tasks. However, from the perspective of experiments, order-sensitivity is closely related to the unbalanced forgetting rate (or the unbalanced difficulty) of different tasks, where we assume that task difficulty is due to the interactions between semantically similar relations in an observed task sequence. Intuitively, if the conceptual distribution of two relations is similar, these two relations tend to be expressed in similar natural language contexts, such as the relations “father” and “mother”.

Semantic Embedding-based Difficulty Function.

To formalize this intuition, we define a difficulty estimation function based on the semantic embeddings of relations in each task. Given a set of $K$ tasks $\{\mathcal{T}_{1},\mathcal{T}_{2},...,\mathcal{T}_{K}\}$ , the difficulty of task $\mathcal{T}_{i}$ is defined as:

Dl_{i}:=\frac{1}{K-1}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{K}S_{i}^{j}

(3)

$S_{j}^{i}$ is the similarity score between tasks $\mathcal{T}_{i}$ and $\mathcal{T}_{j}$ , which is defined as the average similarity among relation pairs from the two tasks:

S_{i}^{j}:=\frac{1}{M\times N}\sum_{m=1}^{M}\sum_{n=1}^{N}s_{m}^{n},

(4)

where $M$ and $N$ are the numbers of relations in each task respectively, and $s_{m}^{n}$ calculates the Cosine similarity between the embeddings of the two relations: $s_{m}^{n}:=\cos(emd_{m},emd_{n})$ . Using $s_{m}^{n}$ , we calculate the difficulty of each relation in the memory with respect to the relations in the current task, in order to sort and sample the relations stored in memory into the final curriculum.

Relation Representation Learning.

In order to calculate the semantic embedding of each relation, inspired by (Chen et al. 2019), we introduce a knowledge- and distribution-based representation learning method. Intuitively, the representation of a relation is learned from the types of its head and tail entities. Consider a knowledge graph $\mathcal{G}=\{(h,r,t)\in\mathcal{E\times R\times E}\}$ , where $h$ and $t$ are the head and tail entities, $r$ is the relation between them, and $\mathcal{E}$ and $\mathcal{R}$ represent the sets of entities and relations respectively. We reduce the relation representation learning task into the problem of learning the conceptual distribution of each relation which is optimized based on the following objective:

		$\displaystyle\min_{\phi}\mathcal{L}(\phi;\mathcal{G})=$		(5)
		$\displaystyle\min_{\phi}\sum_{(h^{\prime},t^{\prime};r)\in\mathcal{G}}[-\log P_{\phi}(h^{\prime}\|r)-\log P_{\phi}(t^{\prime}\|r)]$		(5)

where $h^{\prime}$ and $t^{\prime}$ are the concepts (i.e. the hypernyms obtained from the knowledge graph) of the head and tail entities respectively. $P_{\phi}(h^{\prime}|r)=\exp(NN_{\phi_{1}}(h^{\prime},r))$ , and $P_{\phi}(t^{\prime}|r)=\exp(NN_{\phi_{2}}(t^{\prime},r))$ , where $NN_{\phi}(a,b)=MLP_{\phi}(\mathbf{a})^{\top}\mathbf{b}$ is two-layer neural network parameterized with $\phi$ . Finally, we obtain two representations $emd^{h}_{r}$ and $emd^{t}_{r}$ for each relation, which indicate the conceptual-distribution of head entities and tail entities respectively. We concatenate these two embeddings to generate the final representation of the relation $emd_{r}:=[emd^{h}_{r};emd^{t}_{r}]$ .

5 Experiments

In this section, we aim to empirically address the following research questions related to our contributions:
RQ1: Why and to what extent do current memory replay-based approaches suffer from catastrophic forgetting and order-sensitivity?
RQ2: How to qualitatively and quantitatively understand task difficulty?
RQ3: Compared with the state-of-the-art methods, can our method (curriculum-meta learning with the knowledge-based curriculum) effectively alleviate catastrophic forgetting and order-sensitivity?

Datasets. We conduct our experiments on three datasets, including Continual-FewRel, Continual-SimpleQuestions, and Continual-TACRED, which were introduced in (Han et al. 2020a). FewRel (Han et al. 2018) is a labelled dataset which contains 80 relations and 700 instances per relation. SimpleQuestions is a knowledge-based question answering dataset containing single-relation questions (Bordes et al. 2015), from which a relation extraction dataset was extracted (Yu et al. 2017). The relation extraction dataset contains 1,785 relations and 72,238 training instances. TACRED (Zhang et al. 2017) is a well-constructed RE dataset that contains 42 relations and 21,784 examples. Considering the special relation “n/a” (i.e, not available) in TACRED, we follow (Han et al. 2020a) and filter out these examples with the relation “n/a” and use the remaining 13,012 examples for Continual-TACRED.

Following Wang et al. (2019); Obamuyide and Vlachos (2019), we partition the relations of each dataset into some groups and then consider each group of relations as a distinct task $\mathcal{T}_{k}$ . We form training and testing set for each task, based on the instances in the original dataset labeled by the relations in the task. Following the previous work, we employ two relation partitioning methods. Firstly, the unbalanced division is based on clustering, using the averaged word embeddings (Pennington, Socher, and Manning 2014) of relation names with the K-means clustering algorithm (Wang et al. 2019). Secondly, the random partitioning into groups with a similar number of relations (Obamuyide and Vlachos 2019). For Continual-FewRel, we partition its 80 relations into 10 distinct tasks. Similarly, we partition the 1,785 relations in Continual-SimpleQuestions into 20 disjoint tasks, so as well as partition the 41 relations in Continual-TACRED into 10 tasks.

Evaluation Metrics. We employ the following four metrics to measure model performance. Note that the last two metrics, the average forgetting rate and the error bound, are new metrics we propose in this paper.

Whole Accuracy

of the resulting model at the end of the continual learning process on the full test sets of all tasks, $Acc_{w}:=acc_{f,\mathcal{D}_{test}}.$

Average Accuracy

of the resulting model trained on task $\mathcal{T}_{k}$ on all the test sets of all tasks seen up to stage $k$ of the continual learning process, $Acc_{a}:=\frac{1}{k}\sum^{k}_{i=1}acc_{f,i}.$ Compared to $Acc_{w}$ , $Acc_{a}$ highlights the catastrophic forgetting problem. However, as we will empirically show, $Acc_{a}$ is subject to order-sensitivity of the tasks sequence, and thus does not accurately measure the level of forgetting on a specific task.

Average Forgetting Rate

for task $j$ after $k$ time steps, $Fr^{j}_{avg}$ is a new metric to evaluate task-specific model performance on order-sensitivity.

Fr^{j}_{avg}:=\frac{1}{k-1}\sum^{k-1}_{i=1}\frac{\overline{acc}^{j}_{i+1}-\overline{acc}^{j}_{i}}{\overline{acc}^{j}_{i}},

(6)

where $\overline{acc}^{j}_{i}$ is the model’s average performance on a specific task $\mathcal{T}_{j}$ when it appears in the $i$ th position of distinct task permutations:

\overline{acc}^{j}_{i}:=\frac{1}{(J-1)!}\sum_{\pi\in\Pi_{[1,\ldots,J]}\text{ st }\pi_{i}=j}acc_{i}(\pi)

(7)

where $acc_{i}(\pi)$ is the final accuracy on task $\mathcal{T}_{i}$ of the model trained on the permutation $\pi$ , $\Pi_{[1,..,J]}$ is the set of all permutations of the tasks $\{\mathcal{T}_{1},\ldots,\mathcal{T}_{J}\}$ , and $\pi_{i}$ is the index of Task $\mathcal{T}_{i}$ of the sequence. We note that the number of all permutations in which task $\mathcal{T}_{i}$ is fixed at position $j$ is $(J-1)!$ . Of course we may not be able to exactly compute $\overline{acc}^{j}_{f,i}$ as the size of possible tasks persmutations grows exponentially. Therefore, we estimate this quantity by Monte Carlo sampling of some permutations.

Error Bound

is a new metric to evaluate the overall model performance regarding order-sensitivity.

EB:=Z_{\frac{\alpha}{2}}\times\frac{\delta}{\sqrt{n}},

(8)

where $Z_{\frac{\alpha}{2}}$ is the confidence coefficient of confidence level $\alpha$ , and $\delta$ is the standard deviation of accuracy obtained from $n$ distinct task permutations. Note that a model with a lower error bound shows better robustness and less order-sensitivity for the input sequences.

Baseline Models. We compare our proposed CML with knowledge-based curriculum-meta learning with the following baseline models, among which Vanilla is employed as the base learner for both CML and the other models (see Appendix B for hyper-parameters.):

1.

Vanilla (Yu et al. 2017), which is the basic model for conventional supervised relation extraction not specifically designed for the continual learning setup.
2.

EWC (Kirkpatrick et al. 2016), which adopts elastic weight consolidation to add special $L_{2}$ regularization on parameter changes. Then, EWC uses Fisher information to measure the parameter importance to old tasks, and slow down the update of those parameters important to old tasks.
3.

AGEM (Chaudhry et al. 2019), which takes the gradient on sampled memorized examples from memory as the only constraint on the optimization directions of the current task.
4.

EA-EMR (Wang et al. 2019), which maintains a memory of previous tasks to alleviate the catastrophic forgetting problem.
5.

MLLRE (Obamuyide and Vlachos 2019), which leverages meta-learning to improve the usage efficiency of training instances.
6.

EMAR (Han et al. 2020a), which introduces episodic memory activation and reconsolidation to continual relation learning.

Main Results

To evaluate the overall performance of our model CML (RQ3), we conduct experiments on the three datasets under both task division methods: unbalanced cluster-based task division and the uniform random task division.

		Continual-FewRel				Continual-SimpQ				Continual-TACRED
		$Acc_{w}$		$Acc_{a}$		$Acc_{w}$		$Acc_{a}$		$Acc_{w}$		$Acc_{a}$
Setting	Model	$Acc$	$EB$	$Acc$	$EB$	$Acc$	$EB$	$Acc$	$EB$	$Acc$	$EB$	$Acc$	$EB$
	Vanilla $\ddagger$	16.3	$\pm$ 4.10	19.7	$\pm$ 3.90	60.3	$\pm$ 2.52	58.3	$\pm$ 2.30	12.0	$\pm$ 3.21	8.7	$\pm$ 2.35
	EWC $\dagger$	27.1	$\pm$ 2.32	30.2	$\pm$ 2.10	67.2	$\pm$ 3.16	59.0	$\pm$ 2.20	14.5	$\pm$ 2.51	14.5	$\pm$ 2.90
	AGEM $\dagger$	36.1	$\pm$ 2.51	42.5	$\pm$ 2.63	77.6	$\pm$ 2.11	72.2	$\pm$ 2.72	12.5	$\pm$ 2.24	16.5	$\pm$ 2.20
	EA-EMR $\ddagger$	59.8	$\pm$ 1.50	74.8	$\pm$ 1.30	82.7	$\pm$ 0.48	86.2	$\pm$ 0.33	17.8	$\pm$ 1.01	25.4	$\pm$ 1.17
	EMAR $\dagger$	53.8	$\pm$ 1.30	68.6	$\pm$ 0.71	80.0	$\pm$ 0.83	76.9	$\pm$ 1.39	42.7	$\pm$ 2.92	52.5	$\pm$ 1.74
	MLLRE	56.8	$\pm$ 1.30	70.2	$\pm$ 0.93	84.5	$\pm$ 0.35	86.7	$\pm$ 0.46	34.4	$\pm$ 0.49	41.2	$\pm$ 1.37
Cluster	CML (ours)	60.2	$\pm$ 0.71	76.0	$\pm$ 0.24	85.6	$\pm$ 0.34	87.5	$\pm$ 0.32	44.4	$\pm$ 1.16	49.3	$\pm$ 1.01
	Vanilla $\ddagger$	19.1	$\pm$ 1.20	19.3	$\pm$ 1.30	55.0	$\pm$ 1.30	55.2	$\pm$ 1.30	10.2	$\pm$ 2.02	10.4	$\pm$ 2.31
	EWC $\dagger$	30.1	$\pm$ 1.07	30.2	$\pm$ 1.05	66.4	$\pm$ 0.81	66.7	$\pm$ 0.83	15.3	$\pm$ 1.70	15.4	$\pm$ 1.79
	AGEM $\dagger$	36.9	$\pm$ 0.80	37.0	$\pm$ 0.83	76.4	$\pm$ 1.02	76.7	$\pm$ 1.01	13.4	$\pm$ 1.47	14.3	$\pm$ 1.62
	EA-EMR $\ddagger$	61.4	$\pm$ 0.81	61.6	$\pm$ 0.76	83.1	$\pm$ 0.41	83.2	$\pm$ 0.47	27.3	$\pm$ 1.01	30.3	$\pm$ 0.70
	EMAR $\dagger$	62.7	$\pm$ 0.63	62.8	$\pm$ 0.62	82.4	$\pm$ 0.86	84.0	$\pm$ 0.78	45.1	$\pm$ 1.48	46.4	$\pm$ 2.00
	MLLRE	59.8	$\pm$ 0.91	59.8	$\pm$ 0.94	85.2	$\pm$ 0.25	85.5	$\pm$ 0.31	36.4	$\pm$ 0.66	38.0	$\pm$ 0.58
Random	CML (ours)	62.9	$\pm$ 0.62	63.0	$\pm$ 0.59	86.5	$\pm$ 0.22	86.9	$\pm$ 0.28	43.7	$\pm$ 0.83	45.3	$\pm$ 0.72

Table 1: The average accuracy

Acc_{a}

and whole accuracy

Acc_{w}

with error bounds by 0.95 confidence, on the test sets of observed tasks at the final time step, where

\dagger

and

\ddagger

indicate the result generated from the source code provided by (Han et al. 2020a)¹¹1https://github.com/thunlp/ContinualRE and (Wang et al. 2019)²²2https://github.com/hongwang600/ respectively.

Train		100		200	all
Memory		25	50	50	50
EA-EMR	$Acc_{a}$	70.7	75.5	74.8	73.9
EA-EMR	$Acc_{w}$	53.2	57.4	59.8	59.6
MLLRE	$Acc_{a}$	68.4	72.1	70.2	51.0
MLLRE	$Acc_{w}$	51.9	57.8	56.8	47.3
EMAR	$Acc_{a}$	60.1	66.7	68.6	74.1
EMAR	$Acc_{w}$	43.7	51.2	53.8	57.7
CML	$Acc_{a}$	73.6	76.4	76.0	58.0
CML	$Acc_{w}$	54.7	60.3	60.2	49.1

Table 2: Experimental results on the impact of the amount and ratio of memory on model performance over Continual-FewRel.

The following observations can be made from Table 2. (i) Our model CML achieves the best $Acc_{w}$ and $Acc_{a}$ in both settings and on the three datasets in the majority of cases. (ii) Specifically, CML achieves the best $Acc_{w}$ and $Acc_{a}$ in the two larger datasets Continual-FewRel and Continual-SimpQ. (iii) CML obtains the lowest error bounds $EB$ in the majority of cases, demonstrating better stability and lower order-sensitivity. (iv) The two task division methods produce the most prominent $Arcc_{a}$ differences on Continual-Fewrel for CML. when the data is evenly distributed (i.e. Random), CML’s $Acc_{a}$ is significantly reduced to be almost equal to $Acc_{w}$ (from 76.0 to 63.0). On the other two datasets, the performance difference is much less noticeable.

Although the three metrics $Acc_{a}$ , $Acc_{w}$ , and $EB$ are good measures of the overall model performance, they do not provide task-specific insights, which we will further discuss in the following subsection.

Analysis of Unbalanced Forgetting

We designed another experiment to better understand the reason for catastrophic forgetting and order-sensitivity (RQ1). In this experiment, each task is assigned a fixed ID. Starting with an initial “run” of tasks $0,1,\ldots,9$ , we test model performance on ten different runs generated by the cyclic shift of the initial run. The results of EA-EMR on Continual-Fewrel are summarised in Table 3 (See Appendix C for the result for EMAR, MLLRE, and CML.)

$taskID$	$\mathcal{T}_{0}$	$\mathcal{T}_{1}$	$\mathcal{T}_{2}$	$\mathcal{T}_{3}$	$\mathcal{T}_{4}$	$\mathcal{T}_{5}$	$\mathcal{T}_{6}$	$\mathcal{T}_{7}$	$\mathcal{T}_{8}$	$\mathcal{T}_{9}$
$runID$	0	1	2	3	4	5	6	7	8	9	$Acc_{a}$	$Acc_{w}$
$\mathcal{P}_{0}$	88.1	39.0	49.9	59.9	100.0	23.8	28.2	48.5	52.6	53.9	76.8	62.7
$\mathcal{P}_{1}$	82.2	43.5	68.8	33.9	100.0	30.7	56.7	39.6	71.1	59.0	73.4	63.9
$\mathcal{P}_{2}$	83.0	49.9	66.7	75.9	100.0	26.7	57.4	74.7	58.5	64.1	72.6	61.2
$\mathcal{P}_{3}$	91.9	48.3	76.3	77.0	100.0	23.2	54.3	77.0	83.7	50.6	62.8	59.0
$\mathcal{P}_{4}$	90.4	44.1	73.6	79.6	100.0	43.5	47.5	69.6	86.7	77.6	77.7	58.4
$\mathcal{P}_{5}$	97.0	42.8	71.2	79.9	100.0	46.0	74.9	68.5	80.0	76.8	79.5	61.6
$\mathcal{P}_{6}$	98.5	79.2	69.3	75.5	100.0	52.4	82.6	88.1	88.1	80.4	74.6	58.7
$\mathcal{P}_{7}$	97.0	80.2	92.1	67.9	100.0	57.1	81.9	87.6	93.3	81.4	67.4	56.3
$\mathcal{P}_{8}$	91.9	81.5	91.9	92.7	99.3	68.0	86.1	93.4	92.6	90.8	77.4	56.1
$\mathcal{P}_{9}$	100.0	89.7	97.4	96.0	100.0	82.1	91.9	94.7	99.3	93.1	77.7	58.7
$\mu$	92.0	59.8	75.7	73.8	99.9	45.4	66.2	74.2	80.6	72.8	74.0	59.7
$\sigma$	6.25	20.06	14.39	17.51	0.22	19.92	20.41	18.50	15.35	14.99	5.26	2.61

Table 3: A case study of EA-EMR on the FewRel dataset, where

taskID=\{\mathcal{T}_{0},...,\mathcal{T}_{9}\}

represents the fixed ID of each task,

runID=\{0,...,9\}

represents the runs with loop offset from 0 to 9,

Position={\mathcal{P}_{0},...,\mathcal{P}_{9}}

represents the position of a task in a run.

\mu

and

\delta

are the mean and variance of accuracy (read vertically) respectively. Different runs are distinguished by color in the table (read diagnally), and the same color represents the same round of experiment, for example, when

\mathcal{T}_{0}

is at

\mathcal{P}_{1}

and

runID=9

, we can obtain the

acc_{0}^{1}=82.2\%

at the final time step, and

Acc_{a}=76.8\%

Acc_{w}=62.7\%

. See Appendix C for such result of EMAR, MLLRE and the proposed CML.

As shown in Table 3, comparing the results in the rows of $\mathcal{P}_{0}$ and $\mathcal{P}_{9}$ of each task, we can observe that most tasks see a significant drop in accuracy from $\mathcal{P}_{9}$ (when the task is the last one seen by the model) to $\mathcal{P}_{0}$ (when the task is the first one seen by the model), indicating that they suffer from catastrophic forgetting.

Intuitively, forgetting on a task reflects an increase in empirical error. We hypothesized that this is most likely due to frequent replay of limited task-related memory, in other words, over-fitting on memory. We tested this hypothesis by adjusting the ratio of training data to memory. Table 2 shows that when memory size is fixed (e.g. 50), more training data (i.e. all vs 200 vs 100) results in poorer performance for all models except EMAR, indicating the model is over-fitting on memory. This is due to a higher replay frequency of the memory. When we fix the ratio of the training data to memory (e.g. 100:25 and 200:50), model performance conforms to the general rule of better performance with more data. Thus, the results shown in Table 2 support our hypothesis.

When reading each column in Table 3 separately, we find that the forgetting rate of each task is different, which cannot be solely explained by the issue of over-fitting on memory. For example, the model performance on $\mathcal{T}_{4}$ and $\mathcal{T}_{0}$ is exactly the same (100%) when they appear at the last position $\mathcal{P}_{9}$ . However, performance on $\mathcal{T}_{4}$ does not degrade with the advance of its position. On the other hand, we can observe a decreasing trend of performance on $\mathcal{T}_{0}$ as its position moves back.

Moreover, we observe that order-sensitivity may be related to the difficulty of the earlier task in a run, where difficulty refers to a task’s tendency of being more easily forgotten by the model. For instance, we can observe that the final $Acc_{a}$ of the $runID=4$ experiment (in dark-orange) is the highest, whereas the final $Acc_{a}$ of the $runID=6$ experiment (in light orange) is the lowest. Also, among all tasks at position $\mathcal{P}_{0}$ , task $\mathcal{T}_{4}$ has the highest accuracy while task $\mathcal{T}_{6}$ has the lowest.

Both of the above observations may be explained by task difficulty, which we further study in the next subsection.

		$\mathcal{T}_{0}$	$\mathcal{T}_{1}$	$\mathcal{T}_{2}$	$\mathcal{T}_{3}$	$\mathcal{T}_{4}$	$\mathcal{T}_{5}$	$\mathcal{T}_{6}$	$\mathcal{T}_{7}$	$\mathcal{T}_{8}$	$\mathcal{T}_{9}$	$PCC_{s}$
$D_{prior}$		0.121	0.141	0.168	0.054	0.035	0.186	0.112	0.152	0.146	0.137	-
$D_{post}$	EA-EMR	0.060	0.098	0.061	0.028	0.001	0.137	0.139	0.060	0.044	0.051	0.559
	MLLRE	0.022	0.078	0.091	0.085	-0.004	0.147	0.060	0.064	0.069	0.075	0.667
	EMAR	0.036	0.016	0.027	0.007	0.006	0.016	0.008	0.026	0.020	0.005	0.499
	CML	-0.002	0.108	0.051	0.065	-0.002	0.113	0.108	0.046	0.070	0.068	0.470

Table 4: Prior difficulty and posterior difficulty of each task, where

PCCs

indicates the Pearson correlation coefficient between the estimated prior difficulty by Eq.(3) (i.e.,

D_{prior}

) and the average forgetting rate of each model (i.e.,

D_{post}

Analysis of Task Difficulty

In this section, we present the qualitative and quantitative analyses in order to better understand the difficulty of tasks (RQ2). We choose EA-EMR as the case study on the Continual-FewRel dataset with the tasks constructed through clustering.

Qualitative Analysis.

Refer to caption — Figure 1: The t-SNE (Donahue et al. 2014) visualization of the encoding of relations generated by the hidden layer in EA-EMR over Continual-FewRel, where each point represents a relationship, each color represents a task, and the position anchored by the label is the center of the task.

Figure 1 shows the t-SNE visualization of the relations, where nodes represent relations, colors represent the tasks, and the distance is calculated from the hidden layer in EA-EMR.

As can be seen from the figure, tasks $\mathcal{T}_{4}$ and $\mathcal{T}_{8}$ have only one relation each, and that the relation in $\mathcal{T}_{4}$ is far away from the others but the relation $\mathcal{T}_{8}$ is much closer. The difference in their distances from the other relations may explain the difference in their task-specific performance in Table 3, where $\mathcal{T}_{4}$ does not suffer from catastrophic forgetting but $\mathcal{T}_{8}$ does. In other words, $\mathcal{T}_{4}$ is easy whereas $\mathcal{T}_{8}$ is more difficult.

Similarly, we can observe that $\mathcal{T}_{3}$ (colored in black) contains only two relations, and it overlaps significantly with the other tasks. Therefore, catastrophic forgetting is more serious on $\mathcal{T}_{3}$ , i.e., $\mathcal{T}_{3}$ is difficult. Finally, we can observe that tasks $\mathcal{T}_{1}$ and $\mathcal{T}_{6}$ are difficult tasks, as their centroids are very close. They both suffer from serious catastrophic forgetting as can be seen from Table 3.

Therefore, in continual learning, the difficulty of a task $\mathcal{T}$ may be characterized by the correlation between $\mathcal{T}$ and the other observed tasks. In the relation extraction scenario, we define this correlation as the semantic similarity between the relations.

Quantitative Analysis.

Based on the above analysis, we hypothesize that a model could achieve better performance if we can measure the similarity of the current task and previous tasks and guide the model to distinguish similar tasks.

For ease of expression, we denote the measure difficulty of each task $\mathcal{T}_{i}$ as the prior difficulty $D_{prior}$ , and denote the average forgetting rate $Fr^{i}_{avg}$ as the posterior difficulty $D_{post}$ . The prior difficulty $D_{prior}$ is the estimated value of knowledge graph embedding, and the posterior difficulty is related to the performance of each model.

Specifically, we use our embedding-based task difficulty function defined in Eq. (3) as the prior difficulty $D_{prior}$ , and average forgetting rate defined in Eq. (6) as the posterior difficulty. Table 4 shows the correlation between the prior difficulty and the posterior difficulty, and thus offers evidence of the effectiveness of our representation learning method.

We can learn from Table 4 four main conclusions: (i) The $PCCs$ of the three models demonstrate that the semantic embedding-based prior difficulty does positive to the forgetting rate indicating which tasks are difficult for a model. (ii) Comparing the $PCCs$ values of CML with the other three methods, our proposed CML with the knowledge-based curriculum does alleviate the interference between similar tasks as it achieves the lowest forgetting rate. (iii) For tasks $\mathcal{T}_{0}$ and $\mathcal{T}_{4}$ in CML, the average forgetting rate is negative, which means that the accuracy of these tasks decreases as they move towards the end. (iv) Based on the analysis of Table 2, Table 3 (See Appendix C for the other three relevant tables), it is evident that CML improves the overall accuracy and effectively alleviates the order-sensitivity problem. We note that the improvements do come at a cost of moderately decreased accuracy on simple tasks such as $\mathcal{T}_{0}$ and $\mathcal{T}_{4}$ .

6 Conclusion

In this paper, we proposed a novel curriculum-meta learning method to tackle the catastrophic forgetting and order-sensitivity issues in continual relation learning. The construction of the curriculum is based on the notion of task difficulty, which is defined through a novel relation representation learning method that learns from the distribution of domain and range types of relations. Our comprehensive experiments on the three benchmark datasets Continual-FewRel, Continual-SimpleQuestions and Continual-Tacred show that our proposed method outperforms the state-of-the-art models, and is less prone to catastrophic forgetting and less order-sensitive. In future, we will investigate an end-to-end curriculum model and a new dynamic difficulty measurement based on the framework presented in this paper.

Acknowledgments

Research in this paper was partially supported by the National Key Research and Development Program of China under grants (2018YFC0830200, 2017YFB1002801), the Natural Science Foundation of China grants (U1736204), the Judicial Big Data Research Centre, School of Law at Southeast University.

References

Aljundi et al. (2018) Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; and Tuytelaars, T. 2018. Memory Aware Synapses: Learning What (not) to Forget. In Proceedings of ECCV, 144–161.
Bengio et al. (2009) Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of ICML, 41–48.
Bordes et al. (2015) Bordes, A.; Usunier, N.; Chopra, S.; and Weston, J. 2015. Large-scale Simple Question Answering with Memory Networks. CoRR abs/1506.02075.
Chaudhry et al. (2019) Chaudhry, A.; Ranzato, M.; Rohrbach, M.; and Elhoseiny, M. 2019. Efficient Lifelong Learning with A-GEM. In Proceedings of ICLR.
Chen et al. (2006) Chen, J.; Ji, D.; Tan, C. L.; and Niu, Z.-Y. 2006. Relation extraction using label propagation based semi-supervised learning. In Proceedings of ACL, 129–136.
Chen, Goodfellow, and Shlens (2016) Chen, T.; Goodfellow, I. J.; and Shlens, J. 2016. Net2Net: Accelerating Learning via Knowledge Transfer. In Proceedings of ICLR.
Chen et al. (2019) Chen, W.; Zhu, H.; Han, X.; Liu, Z.; and Sun, M. 2019. Quantifying Similarity between Relations with Fact Distribution. In Proceedings of ACL, 2882–2894.
Chen and Liu (2016) Chen, Z.; and Liu, B. 2016. Lifelong Machine Learning.
de Masson d’Autume et al. (2019) de Masson d’Autume, C.; Ruder, S.; Kong, L.; and Yogatama, D. 2019. Episodic Memory in Lifelong Language Learning. In Proceedings of NeurIPS, 13122–13131.
Donahue et al. (2014) Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proceedings of ICML, 647–655.
Han et al. (2020a) Han, X.; Dai, Y.; Gao, T.; Lin, Y.; Liu, Z.; Li, P.; Sun, M.; and Zhou, J. 2020a. Continual Relation Learning via Episodic Memory Activation and Reconsolidation. In Proceedings of ACL, 6429–6440.
Han et al. (2020b) Han, X.; Gao, T.; Lin, Y.; Peng, H.; Yang, Y.; Xiao, C.; Liu, Z.; Li, P.; Sun, M.; and Zhou, J. 2020b. More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction. CoRR abs/2004.03186.
Han et al. (2018) Han, X.; Zhu, H.; Yu, P.; Wang, Z.; Yao, Y.; Liu, Z.; and Sun, M. 2018. FewRel: A Large-Scale Supervised Few-shot Relation Classification Dataset with State-of-the-Art Evaluation. In Proceedings of EMNLP, 4803–4809.
Kirkpatrick et al. (2016) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N. C.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Hadsell, R. 2016. Overcoming catastrophic forgetting in neural networks. CoRR abs/1612.00796.
Lesort et al. (2019) Lesort, T.; Caselles-Dupré, H.; Ortiz, M. G.; Stoian, A.; and Filliat, D. 2019. Generative Models from the perspective of Continual Learning. In Proceedings of IJCNN, 1–8.
Lin et al. (2019) Lin, H.; Yan, J.; Qu, M.; and Ren, X. 2019. Learning Dual Retrieval Module for Semi-supervised Relation Extraction. In Proceedings of WWW, 1073–1083.
Lin et al. (2016) Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; and Sun, M. 2016. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of ACL, 2124–2133.
Liu et al. (2013) Liu, C.; Sun, W.; Chao, W.; and Che, W. 2013. Convolution Neural Network for Relation Extraction. In Motoda, H.; Wu, Z.; Cao, L.; Zaïane, O. R.; Yao, M.; and Wang, W., eds., Proceedings of ADMA, 231–242.
Liu et al. (2018) Liu, X.; Masana, M.; Herranz, L.; van de Weijer, J.; López, A. M.; and Bagdanov, A. D. 2018. Rotate your Networks: Better Weight Consolidation and Less Catastrophic Forgetting. In Proceedings of ICPR, 2262–2268.
Lopez-Paz and Ranzato (2017) Lopez-Paz, D.; and Ranzato, M. 2017. Gradient Episodic Memory for Continual Learning. In Proceedings of NeurIPS, 6467–6476.
Mahdisoltani, Biega, and Suchanek (2015) Mahdisoltani, F.; Biega, J.; and Suchanek, F. M. 2015. YAGO3: A Knowledge Base from Multilingual Wikipedias. In Proceedings of CIDR, 1–11.
Mallya, Davis, and Lazebnik (2018) Mallya, A.; Davis, D.; and Lazebnik, S. 2018. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights. In Proceedings of ECCV, 72–88.
Marcheggiani and Titov (2016) Marcheggiani, D.; and Titov, I. 2016. Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations. Transactions of the Association for Computational Linguistics 231–244.
Miwa and Bansal (2016) Miwa, M.; and Bansal, M. 2016. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of ACL, 1105–1116.
Nichol, Achiam, and Schulman (2018) Nichol, A.; Achiam, J.; and Schulman, J. 2018. On First-Order Meta-Learning Algorithms. CoRR abs/1803.02999.
Obamuyide and Vlachos (2019) Obamuyide, A.; and Vlachos, A. 2019. Meta-Learning Improves Lifelong Relation Extraction. In Proceedings of RepL4NLP@ACL, 224–229.
Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global Vectors for Word Representation. In Proceedings of EMNLP, 1532–1543.
Rebuffi et al. (2017) Rebuffi, S.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of CVPR, 5533–5542.
Ritter, Botev, and Barber (2018) Ritter, H.; Botev, A.; and Barber, D. 2018. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting. In Proceedings of NeurIPS, 3742–3752.
Rusu et al. (2016) Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive Neural Networks. CoRR abs/1606.04671.
Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual Learning with Deep Generative Replay. In Proceedings of NeurIPS, 2990–2999.
Sun, Grishman, and Sekine (2011) Sun, A.; Grishman, R.; and Sekine, S. 2011. Semi-supervised relation extraction with large-scale word clustering. In Proceedings of NAACL-HLT, 521–529.
Sun, Ho, and Lee (2020) Sun, F.; Ho, C.; and Lee, H. 2020. LAMOL: LAnguage MOdeling for Lifelong Language Learning. In Proceedings of ICLR.
Thrun (1998) Thrun, S. 1998. Lifelong Learning Algorithms. In Learning to Learn, 181–209.
Turki et al. (2019) Turki, H.; Shafee, T.; Taieb, M. A. H.; Aouicha, M. B.; Vrandecic, D.; Das, D.; and Hamdi, H. 2019. Wikidata: A large-scale collaborative ontological medical database. J. Biomed. Informatics 99.
Wang et al. (2019) Wang, H.; Xiong, W.; Yu, M.; Guo, X.; Chang, S.; and Wang, W. Y. 2019. Sentence Embedding Alignment for Lifelong Relation Extraction. In Proceedings of NAACL-HLT, 796–806.
Yao et al. (2011) Yao, L.; Haghighi, A.; Riedel, S.; and McCallum, A. 2011. Structured Relation Discovery using Generative Models. In Proceedings of EMNLP, 1456–1466.
Yoon et al. (2020) Yoon, J.; Kim, S.; Yang, E.; and Hwang, S. J. 2020. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. arXiv: Learning .
Yu et al. (2017) Yu, M.; Yin, W.; Hasan, K. S.; dos Santos, C. N.; Xiang, B.; and Zhou, B. 2017. Improved Neural Relation Detection for Knowledge Base Question Answering. In Proceedings of ACL, 571–581.
Zelenko, Aone, and Richardella (2002) Zelenko, D.; Aone, C.; and Richardella, A. 2002. Kernel Methods for Relation Extraction. In Proceedings of EMNLP, 71–78.
Zeng et al. (2014) Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014. Relation Classification via Convolutional Deep Neural Network. In Proceeding of COLING, 2335–2344.
Zenke, Poole, and Ganguli (2017) Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning through Synaptic Intelligence. In Proceedings of ICMR, 3987–3995.
Zhang et al. (2017) Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning, C. D. 2017. Position-aware Attention and Supervised Data Improve Slot Filling. In Proceedings of EMNLP, 35–45.