Incremental Few-shot Text Classification with Multi-round New Classes:
Formulation, Dataset and System

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
&Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain

Abstract

Text classification is usually studied by labeling natural language texts with relevant categories from a predefined set. In the real-world, new classes might keep challenging the existing system with limited labeled data. The system should be intelligent enough to recognize upcoming new classes with a few examples. In this work, we define a new task in the NLP domain—“incremental few-shot text classification”, where the system first handles some base classes with rich annotations, then copes with multiple rounds of new classes. For each round, there is a batch of new classes with a few labeled examples per class. Two major challenges are existing in this new task: (i) For the learning process, the system should incrementally learn new classes round by round without re-training on the examples of preceding classes; (ii) For the performance, the system should perform well on new classes without much loss on preceding classes. In addition to formalizing the new task, we also release two benchmark datasets in the incremental few-shot setting: intent classification and relation classification. Moreover, we propose an approach Entailment, which shows promise for solving this novel problem.

1 Introduction

Text classification has achieved great success in the past decades with the development of deep learning techniques kowsari2019text. However, decent performance highly relies on the quality and quantity of the training data. Recently, few-shot text classification yu2018diverse has attracted increasing attention from researchers since we realize that it is unlikely to have large-scale labeled data for new classes in reality.

Typically, few-shot text classification is formulated like this: the system first sees a set of base classes $C_{b}$ that have a large number of labeled examples, then a group of new classes $C_{n}$ is provided with $k$ examples per class. For a testing instance, the system is required to search for its label in the space of $C_{b}\cup C_{n}$ or merely $C_{n}$ .

However, if we think about few-shot text classification in real scenarios, new challenges exist which makes it worth further exploration. First, take the bank’s customer service system as an example, queries with new intents are continuously appearing, (e.g., by a sequence of rounds), without enough labeled data. The system should be able to keep learning and recognizing new intents round by round. For each query, the system needs to pick up the most appropriate intent in the incrementally increasing label space or return “none of them”. Second, existing incremental training strategies have the problem of catastrophic forgetting mccloskey1989catastrophic. The systems tend to forget the knowledge learned in the past when continually fine-tuning on new classes. This leads to a significant drop in the performance of base classes. For a real-world application, such as the aforementioned customer service system, the system is expected to perform well on all classes, regardless of when the classes come to the system or how many examples they have.

In this work, our contribution lies in three aspects. First, we formally define this problem: “incremental few-shot text classification”. In our definition, the system is first provided with some base classes $C_{b}$ with rich annotations, then $m$ rounds of new classes (i.e., $C^{1}_{n}$ , $C^{2}_{n}$ , $\cdots$ , $C^{m}_{n}$ ) will come sequentially. Each $C^{i}_{n}$ ( $i=1,\cdots,m$ ) consists of a group of new classes and each class has $k$ labeled examples ( $k$ is in the range of [1, 5] and varies for different classes). For testing, we require the system to either select the best class from $C_{b}\cup C^{1}_{n}\cup C^{1}_{n}\cdots\cup C^{m}_{n}$ or output “none of them” which means no existing class applies to the input. As far as we know, this is the first work in the NLP community that studies text classification with an incremental size of classes. There are a few papers working on incremental few-shot classification in the field of computer vision DBLPRenLFZ19. However, they only consider one round of new classes. It’s difficult to evaluate the system’s ability for incremental learning without multi-rounds of new classes.

Furthermore, we consider an extreme case where no base classes are available. This situation is named as incremental few-shot text classification without base classes. It happens whenever you want to build a system from scratch. In real-world scenarios, base classes with rich annotations might not be available at the beginning. In this setting, we need to solve the cold start problem where no initial annotations are available. All previous few-shot learning models DBLPSnellSZ17; DBLPGidarisK18; DBLPRenLFZ19 fail to solve this problem, since they relied on the large-scale labeled data for base classes to train a robust system.

Second, to evaluate the aforementioned two settings, incremental few-shot text classification with/without base classes, we release a benchmark dataset, IFS-Intent for evaluation. This benchmark simulates a task like the bank’s customer service as we just mentioned. Another important feature of our benchmark datasets is that we do not provide dev sets. Existing systems are commonly evaluated on the dev set to choose the best training model. We claim that in real-world (incremental) few-shot applications, we cannot expect extra labeled data other than the $k$ examples. This is in line with the observation in DBLP07676. If a system has to rely on dev set to find the best parameters, it is not suitable for the incremental few-shot setting.

Third, we propose our approach, Entailment, to solve this new problem. Entailment models the text classification problem as a textual entailment dagan2013recognizing problem. To figure out if an input $x$ belongs to a class $y$ , Entailment tries to infer the truth value of $y$ (i.e., a hypothesis), given the $x$ (i.e, the premise). The main benefit of this formulation is that the system learns this task not only from the label-specific examples, but more importantly, from the large-scale entailment datasets. In other words, we make use of indirect supervision from textual entailment datasets to address the target few-shot task. Specifically, for each round $C_{n}^{i}$ which has $h$ new classes with $k$ examples for each, we first build positive (premise, hypothesis) pairs by accompanying each input with its gold class, and negative (premise, hypothesis) pairs by accompanying the input with other $h-1$ classes in $C_{n}^{i}$ . It is worth noting that we only use the few-shot examples and label names in $C_{n}^{i}$ for the incremental training.

Current state-of-the-art paper zhangdiscriminative for few-shot text classification also utilizes textual entailment for pre-training. However, they ignore the information in the class labels. Instead of inferring the truth value of a class $y$ conditioned on the input text $x$ , they tried to infer whether two text inputs ( $x_{i},x_{j}$ ) are in the same class. In this way, they not only increase the computation cost since the number of examples is much more than the number of labels. The other thing is that their final predictions need to be made by search the nearest neighbor among all the examples. The performance highly depends on the examples they choose. Bad examples will definitely bring poor results.

2 Related Work

Incremental few-shot learning.

As far as we know, there is no prior work in the NLP domain that studies incremental few-shot text classification. In this section, we mainly introduce some work in the computer vision domain. As we mentioned before, they only assume that a single round of new classes $C_{n}$ is appended to the base classes $C_{b}$ . Generally, they will learn class representations for classification. Different approaches differ in the way of representing base classes and new classes. Hereafter, we use $W_{b}$ and $W_{n}$ as the representations for $C_{b}$ and $C_{n}$ , respectively.

DBLPSnellSZ17 proposes the Prototypical Network, in which both $W_{b}$ and $W_{n}$ are stored as the average embedding of the few-shot support images for a certain class. Although Prototypical Network was not designed for incremental few-shot learning, it can be easily adapted to the incremental setting by providing the representations for all the classes. It trains a nearest neighbor algorithm on the base classes and tests directly on the union of base and new classes. DBLPQiBL18 proposes an “imprinting” mechanism: the base representations $W_{b}$ are learned regularly through supervised pre-training (e.g., the weight matrix in a softmax classifier), and $W_{n}$ are computed using the averaged representations like Prototypical Network.

In DBLPGidarisK18, the base representations $W_{b}$ are learned through supervised pre-training. The representation of the $i^{th}$ novel class ( $W_{n,i}$ ) comes from two origins: (i) the prototypical averaging, $w_{avg}$ ; (ii) attention-weighted sum over base representations: $w_{att}$ . Namely, $W_{n,i}=\phi_{avg}\odot w_{avg}+\phi_{att}\odot w_{att}$ , where $\phi_{avg}$ and $\phi_{att}$ are learnable weight vectors. In the few-shot training stage, the original base classes $C_{b}$ are split into “new base classes” and “fake novel classes” for each episode. The loss is computed when the system predicts “new base classes” as well as “fake novel classes”. In testing, $W_{n}$ , the representations of novel classes, are constructed based on the $k$ examples and $W_{b}$ .

In DBLPRenLFZ19, both $W_{b}$ and $W_{n}$ are learned through supervised training: $W_{b}$ are classifier parameters pre-trained on base classes, $W_{n}$ are classifier parameters learned in new classes. During the training, the support set and the query set are constructed differently for new classes. The support set consists of examples only from new classes; the query set contains examples from both new classes and base classes (because the training goal is to maximize the performance on all classes). The training in this literature has two phases. The first phase is few-shot episode training which learns $W_{n}$ and optimizes the performance on the support set; the second phase (called meta-learning training) optimizes the performance on the query set. The latter has a regularization term on the $W_{n}$ , enforcing $W_{n}$ to be as close as possible to the attention-weighted sum of $W_{b}$ .

To summarize, compared with DBLPSnellSZ17 and DBLPQiBL18, both DBLPGidarisK18 and DBLPRenLFZ19 build connections between the representations of base classes and the new classes. However, these methods cannot be directly applied to our problem for the following reasons. (i) Despite the claims in some literature that they are dealing with incremental or dynamic few-shot problems, they only considered a single round of new classes DBLPQiBL18; DBLPGidarisK18; DBLPRenLFZ19. It is unclear if the system can keep the performance when multi-round new classes are considered. (ii) During the training for the new classes, they often rely on extra labeled data other than the $k$ examples, such as the query set in DBLPRenLFZ19. (iii) Different from their setting, we have an extra label “out-of-distribution” in incremental few-shot text classification. It’s not guaranteed that the input, such as the customer’s utterance, always falls into the range of seen labels.

Using textual entailment for text classification.

zhangdiscriminative is a state-of-the-art paper for few-shot text classification. They propose a discriminative nearest neighbor classification (DNNC) model by comparing whether two examples are in the same class or not. A matching model S( $x_{i}$ , $x_{j}$ ) is trained as a binary classifier, such that S( $x_{i}$ , $x_{j}$ ) is closed to 1.0 if $x_{i}$ and $x_{j}$ belong to the same class, otherwise closed to 0.0. Thus, their model can be pre-trained with large-scale textual entailment dataset. Given a test query $x$ , they compare the test query with all the previous examples. The final prediction is made by searching the nearest neighbor which has the highest matching score S( $x$ , $x_{i}$ ) with the query example. As we mentioned before, their computation cost is high and the performance is highly related to the quality of chosen examples.

Moreover, comparing whether two examples are in the same class is different from doing textual entailment. In textual entailment, a human reading a premise to infer that the hypothesis is true or not. The fact that two examples are in the same class does not mean they can entail each other. Thus, they cannot fully utilize the pre-trained entailment model. Instead, our proposed model, Entailment, entail the label with a given utterance, which is much more efficient and maximize the utilization of the pre-trained entailment model.

DBLPYinHR19 is another work that utilizes textual entailment for zero-shot text classification. They convert the zero-shot text classification as a problem of filling a label candidate for a hypothesis. For example, they combine “emotion” labels with the question “this text expresses ?”, and ask the model if this hypothesis is true, given the text. This work more focus on zero-shot learning and they need to propose different questions for different labels.

3 Problem Formulation

In this section, we give a formal description of the problem “incremental few-shot text classification” with/without base classes.

Training data.

For the setting with base classes, the system is provided with a set of base classes $C_{b}=\{C_{b,1},C_{b,2},\cdots,C_{b,g}\}$ ; each base class $C_{b,i}$ ( $i=1,\cdots,g$ ) has $l$ labeled examples. $l$ is usually a large number, like several hundred or several thousand. For the setting without base classes, $C_{b}$ is ignored. Both settings have totally $m$ rounds of new classes coming sequentially: { $C_{n}^{1},\cdots,C_{n}^{m}$ }. Each round $C_{n}^{i}$ has $h$ new classes, namely $C_{n}^{i}=\{C^{i}_{n,1},\cdots,C^{i}_{n,h}\}$ . Each new class only has $k$ examples ( $k\in[1,5]$ ). The value of $k$ is not fixed and varies for different new classes in the same round, i.e., $k_{C^{i}_{n,i}}\neq k_{C^{i}_{n,j}}$ .

We create the multi-round setting since it can evaluate the system more precisely while learning a long line of new classes. In each round, we set $k\in[1,5]$ and allow the flexibility that $k_{C^{i}_{n,i}}\neq k_{C^{i}_{n,j}}$ . This setting is more in line with reality, in which we can only collect a handful of examples for the upcoming classes and the number of examples cannot be guaranteed.

Development data.

To simulate real scenarios in the experiments, we assume that there are only $k$ examples available for the new classes during the incremental training. Thus, our formulation does not provide any development set to help select the best model. It is recommended to select hyper-parameters based on experience or related tasks.

Testing data.

To evaluate the system, the test data consists of examples across all the classes. For the setting with base classes, the potential label space is $C_{b}\cup C_{n,1}^{1}\cdots C_{n,h}^{1}\cdots C_{n,1}^{m}\cdots\cup C_{n,h}^{m}\cup C_{o}$ . For the setting without base classes, $C_{b}$ is excluded and we only search among the classes in $C_{n,1}^{1}\cdots C_{n,h}^{1}\cdots C_{n,1}^{m}\cdots\cup C_{n,h}^{m}\cup C_{o}$ . $C_{o}$ is an extra out-of-distribution (OOD) class in which examples falling outside of all the seen classes. It give us a chance to check the system’s ability to detect instances that reject all the known classes. This is crucial for an open-set problem like incremental learning, since there are always examples from upcoming classes that do not belong to any existing class.

Requirements.

(i) For the training of $i^{th}$ round $C_{n}^{i}$ , the system can only access the newly added few-shot examples in this round and preceding class names in $C_{b}\cup C_{n}^{1}...\cup C_{n}^{i-1}$ . The system is not allowed to re-train on the (full or partial) examples of preceding classes. (ii) For the evaluation, we care about the performance in different types of classes, including base classes, different rounds of new classes and OOD classes in $C_{o}$ . We expect a system that can continuously recognize new classes well with few-shot examples. In the meantime, the performance drop for preceding classes is also considered. A system showing severer catastrophic forgetting is less preferred.

4 Our Model: Entailment

Our approach Entailment casts the text classification problem into textual entailment: the input text acts as a premise, the class name, such as “open a bank account” in intent detection , acts as a hypothesis. Then the question that if the input belongs to a class is equivalent to ask if the hypothesis is true given the premise. There are two benefits by transforming the text classification problem to entailment. First, we can make use of indirect supervision from large-scale entailment dataset williams2018broad to benefit the few-shot settings. Second, this enables us to utilize the few-shot examples as well as the information of the class names. Typical text classification approaches treat classes as indices. In fact, class names usually contain informative signals.

Entailment pairs.

To transfer text classification problem into textual entailment, we construct positive and negative entailment pairs for the training. Positive entailment pairs ( $x_{i}$ , $y_{i}$ ) are constructed with utterance $x_{i}$ and its gold label name $y_{i}$ , where $y_{i}\in C_{b}$ for base classes and $y_{i}\in C_{n}^{i}$ for new classes. Negative entailment pairs consist of ( $x_{i}$ , $y_{j}$ ), where $y_{j}$ is an incorrect label in current round. For base classes, $y_{j}\in C_{b}$ but $y_{j}\neq y_{i}$ ; for new classes, $y_{j}\in C_{n}^{i}$ but $y_{j}\neq y_{i}$ .

Compared to zhangdiscriminative, their entailment pairs are constructed with two utterances in the same round. ( $x_{i},x_{j}$ ) is a positive pair if they belong to a same class, otherwise it is a negative pair. In order to explore the potential of different combinations, we also propose a hybrid entailment model that use both (utterance, label) paris ( $x_{i},y_{i}$ ) and (utterance, utterance) pairs ( $x_{i},x_{j}$ ). In this hybrid model, we train the model with entailment pairs from both our proposed model and zhangdiscriminative.

In the setting where we have $g$ base classes and $l$ examples for each base class, $g*l$ positive entailment pairs and $(g-1)*g*l$ negative pairs are generated for them. For round $C_{n}^{i}$ which contains $h$ new classes and $k$ examples for each class, there are $h*k$ positive entailment pairs and $(h-1)*h*k$ negative entailment pairs. For simplicity, we use the same $k$ value for all new classes here; in real datasets, different new classes may have different numbers of few-shot examples. In that case, the number of generated pairs will change accordingly.

For each entailment pair ( $x,y$ ) no matter it is positive or negative, we concatenate its utterance $x$ with the label $y$ and fed into the RoBERTa liu2019roberta encoder. Given an utterance $x=(X_{1},X_{2},...,X_{T_{2}})$ with $T_{1}$ words and a label $y=(Y_{1},Y_{2},...,Y_{T_{2}})$ with $T_{2}$ words, we add a special start-of-sequence ([CLS]) token at the beginning of the input and a special end-of-sequence ([SEP]) token at the end of each sentence. The whole input is ([CLS], $X_{1}$ , $X_{2}$ , …, $X_{T_{1}}$ , [SEP], $Y_{1}$ , $Y_{2}$ , …, $Y_{T_{2}}$ , [SEP]). We use the [CLS] embedding output from the RoBERTa encoder with a fully connected layer for binary textual entailment: \linenomathAMS

	$\displaystyle h=$	$\displaystyle\text{RoBERTa}(x,y),$		(1)
	$\displaystyle p=$	$\displaystyle\text{softmax}(Wh+b),$		(2)

where $h\in\mathbb{R}^{d}$ is the embedding for the [CLS] token, $W\in\mathbb{R}^{2\times d}$ and $b\in\mathbb{R}^{2}$ are parameters.

Training strategy.

Our model is a binary classification model that can utilize the indirect supervision from textual entailment. Firstly, we pre-train the model with a large-scale entailment dataset williams2018broad. For the setting with base classes, we fine-tune the model on entailment pairs from base classes to obtain a base model. Then it is fine-tuned on the new classes $C_{n}^{i}$ in each round. In the setting without base classes, the model is directly fine-tuned with the entailment pairs from new classes in each round $C_{n}^{i}$ .

Inference strategy.

After the training, we use the model to infer the class for a test input. For each input, we generate entailment pairs by accompanying the input with all classes except $C_{o}$ . Each pair will get a score $\lambda\in[0,1]$ indicating whether this input belongs to the particular class or not. $\lambda>0.5$ indicates “YES”, “No” otherwise. If there is at least one class labeled with “YES”, the class with the maximal $\lambda$ score is returned; otherwise, the system returns $C_{o}$ . We choose the threshold as 0.5 because entailment recognition is a binary classification problem.

Next, we compare our model with some related systems that can be potentially applied to the incremental few-shot text classification.

Entailment vs. DNNC

DNNC zhangdiscriminative also converts text classification problem into textual entailment. They discriminate whether two text inputs ( $x_{i},x_{j}$ ) are in the same class. As a result, for round $C_{n}^{i}$ with originally $h*k$ examples, this baseline will generate $h*k*(k-1)$ positive pairs and $h*(h-1)*k^{2}$ negative pairs. In testing, a query needs to compare with all the examples of all classes. The computation cost of this baseline is much higher than our proposed model.

Entailment vs. Prototypical Network.

Prototypical Network DBLPSnellSZ17 tries to solve few-shot target tasks given a collection of training tasks. The distributions in the training tasks and the target tasks are required to be similar. Prototypical network uses episode training to learn a nearest neighbor algorithm and hope it can generalize well to the target tasks.

The few-shot learning problem solved in Prototypical network is slightly different with our incremental few-shot setting. In Prototypical Network, the label space for target tasks only contains the new classes. However, in the incremental few-shot setting, the target label space is continuously increasing by adding new classes. Due to this essential distinction, applying Prototypical Network to incremental few-shot are very likely to have performance drop on base classes when fine-tuning on new classes.

Entailment vs. Incremental few-shot approaches in computer vision.

In Related Work, we introduced some typical approaches in computer vision that deal with incremental few-shot problem. Those methods consistently try to learn representations for classes and examples separately (i.e,, the $W_{b}$ and $W_{n}$ in Section 2). In our model, there are no individual representation vectors for classes or examples. Instead, the model learns an overall representation vector for the whole (input, class) pair. Our solution enables the learning of the input and the class to interact with each other, which has widely demonstrated its superiority in modeling the relations of two elements DBLP12808; zhangdiscriminative.

In addition, the approaches in computer vision mostly rely on large-scale labeled data for base classes to train a robust system. We would argue that the base classes with rich annotations may not be available in real-world applications. Our system which can be pre-trained with entailment dataset, instead, does not rely on base classes. This makes our system more applicable to various scenarios.

	$C_{b}$	$C_{n}^{1}$	$C_{n}^{2}$	$C_{n}^{3}$	$C_{n}^{4}$	$C_{n}^{5}$	$C_{o}$
# class	20	10	10	10	10	10	7
# train	2088	30	30	30	30	30	-
# test	800	400	400	400	400	400	280

Table 1: Statistics of the benchmark dataset: IFS-Intent.

C_{b}

: base classes; {

C_{n}^{1}

\cdots

C_{n}^{5}

}: five rounds of new classes;

C_{o}

: OOD classes. Note that

C_{o}

is never used for training.

5 Experiments

5.1 Datasets

IFS-Intent.

This is our benchmark for incremental few-shot intent detection. IFS-Intent is converted from BANKING77¹¹1https://github.com/PolyAI-LDN/task-specific-datasets casanueva2020efficient, which is single-domain intent detection dataset comprising 13,083 annotated examples over 77 intents (average: 170 examples per intent). Each intent class is described by a short name, such as “get physical card”, “lost or stolen card”, etc.

We split the 77 intents into a base group (i.e., $C_{b}$ ), 5 rounds of new intents (i.e., { $C_{n}^{1}$ , $\cdots$ , $C_{n}^{5}$ }) and a group of out-of-distribution intents (i.e., $C_{o}$ ). Each upcoming round has 10 new classes. We randomly split the 10 classes into 5 groups (each with 2 classes), then intentionally let the 5 groups have different sizes of k-shot examples ( $k\in[1,5]$ ). Detail statistics are reported in Table 1.

5.2 Experimental setting

Baselines.

Since this is the first work that studies the incremental few-shot text classification problem, there is no prior system that deals with exactly the same task. We compare our model with two different types of baselines. Two baselines DBLPYinHR19; zhangdiscriminative solve text classification as a textual entailment problem and use large-scale entailment datasets for pre-training. We also adapt two incremental few-shot learning models in computer vision to incremental few-shot text classification DBLPSnellSZ17; DBLPGidarisK18. For these two baselines, we replace their encoders with RoBERTa to fit into the text classification task.

• Textual Entailment. DBLPYinHR19 uses a pre-trained textual entailment system to cope with zero-shot text classification, in which the input text acts as a premise and the class name or its definition acts as a hypothesis. It is similar to our approach, except for the reminder mechanism proposed in this work. Thus, this textual entailment baseline keeps fine-tuning on the regular entailment pairs round by round.

• DNNC. zhangdiscriminative proposes an alternative way to implement the entailment idea for few-shot text classification. Instead of inferring the truth value of a class $y$ conditioned on the input text $x_{i}$ , they tried to infer whether two text inputs ( $x_{i},x_{j}$ ) are in the same class. As a result, for round $C_{n}^{i}$ with originally $h*k$ examples, this baseline will generate $h*k*(k-1)$ positive pairs and $h*(h-1)*k^{2}$ negative pairs. In testing, a query needs to compare with all the examples of all classes. The computation cost of this baseline is high.

• Prototypical Network DBLPSnellSZ17. Prototypical Network is trained on base classes with the episode training method. For each episode, it randomly selects $h$ base classes; each base class is equipped with $k$ -shot supporting examples and a set of query examples. The representations for both base and new classes are calculated with the average embedding of $k$ -shot supporting examples. Prototypical network trains a nearest neighbor algorithm to optimize the prediction on the query sets. In testing, the model compares the distances of a query example with all the class representations and choose the nearest neighbor as its label.

• DyFewShot DBLPGidarisK18. We introduced this baseline in the Section 2. For this baseline, we extend this baseline to address multi-round few-shot classes: for the present round $C_{n}^{t}$ , all the preceding classes, including that in $C_{b}$ and { $C_{n}^{1}\cdots,C_{n}^{t-1}$ }, are viewed as “base classes”.

Implementation and setting.

For all the textual entailment models, we use MNLI williams2018broad to pre-train the model. All systems are implemented through the Huggingface Transformers package.²²2https://github.com/huggingface/transformers We always fine-tune on each round 5 epochs with learning rate 1e-6 and batch size 16. We run the same program with 3 different seeds and report the average performance.

Accuracy is reported for { $C_{b}$ , $C_{n}^{1}$ , $\cdots$ , $C_{n}^{5}$ } and F1 score for $C_{o}$ .

		$C_{b}$	$C_{n}^{1}$	$C_{n}^{2}$	$C_{n}^{3}$	$C_{n}^{4}$	$C_{n}^{5}$	$C_{o}$
$C_{b}$	ProtoNet	87.25 $\pm$ 0.10						53.4 $\pm$ 10.68
	Entailment	96.42 $\pm$ 0.41						64.73 $\pm$ 3.84
	Clu4Fewshot	95.96 $\pm$ 0.68						61.89 $\pm$ 4.78
	DyFewShot	81.04 $\pm$ 1.91						55.01 $\pm$ 2.52
	Entailment	96.12 $\pm$ 0.12						58.92 $\pm$ 1.22
$C_{n}^{1}$	ProtoNet	85.83 $\pm$ 1.94	31.67 $\pm$ 1.48					43.66 $\pm$ 3.08
	Entailment	94.42 $\pm$ 0.21	75.42 $\pm$ 1.56					56.38 $\pm$ 5.29
	Clu4Fewshot	95.75 $\pm$ 0.41	74.83 $\pm$ 1.64					64.54 $\pm$ 2.02
	DyFewShot	81.29 $\pm$ 1.56	0.0 $\pm$ 0.0					39.33 $\pm$ 1.25
	Entailment	95.62 $\pm$ 1.00	77.75 $\pm$ 0.25					58.41 $\pm$ 5.10
$C_{n}^{2}$	ProtoNet	83.92 $\pm$ 0.33	24.92 $\pm$ 5.54	38.83 $\pm$ 3.43				31.14 $\pm$ 9.83
	Entailment	94.29 $\pm$ 0.16	71.92 $\pm$ 1.45	84.83 $\pm$ 1.33				48.12 $\pm$ 3.20
	Clu4Fewshot	95.42 $\pm$ 0.62	72.92 $\pm$ 4.37	75.08 $\pm$ 3.3				49.02 $\pm$ 3.23
	DyFewShot	81.29 $\pm$ 1.56	0.0 $\pm$ 0.0	0.5 $\pm$ 0.71				33.94 $\pm$ 1.42
	Entailment	96.44 $\pm$ 0.19	76.75 $\pm$ 2.75	75.0 $\pm$ 1.0				42.11 $\pm$ 0.30
$C_{n}^{3}$	ProtoNet	81.08 $\pm$ 2.06	24.33 $\pm$ 5.54	30.67 $\pm$ 6.17	22.5 $\pm$ 1.34			23.62 $\pm$ 6.99
	Entailment	92.71 $\pm$ 0.41	70.75 $\pm$ 0.54	82.83 $\pm$ 2.16	73.92 $\pm$ 2.52			29.34 $\pm$ 3.31
	Clu4Fewshot	95.67 $\pm$ 0.33	68.17 $\pm$ 2.37	66.33 $\pm$ 5.02	71.25 $\pm$ 3.78			45.69 $\pm$ 1.73
	DyFewShot	81.29 $\pm$ 1.56	0.0 $\pm$ 0.0	0.5 $\pm$ 0.71	0.0 $\pm$ 0.0			27.48 $\pm$ 1.24
	Entailment	95.44 $\pm$ 0.44	73.62 $\pm$ 0.62	71.62 $\pm$ 2.62	73.5 $\pm$ 0.75			33.69 $\pm$ 3.66
$C_{n}^{4}$	ProtoNet	81.17 $\pm$ 2.52	17.83 $\pm$ 2.58	31.75 $\pm$ 0.94	24.92 $\pm$ 1.9	22.25 $\pm$ 3.19		28.19 $\pm$ 4.78
	Entailment	91.67 $\pm$ 0.36	65.92 $\pm$ 2.18	79.92 $\pm$ 1.78	73.75 $\pm$ 0.74	69.08 $\pm$ 0.12		45.73 $\pm$ 2.80
	Clu4Fewshot	95.29 $\pm$ 0.16	68.75 $\pm$ 2.35	66.75 $\pm$ 3.82	67.0 $\pm$ 3.4	57.75 $\pm$ 1.41		42.09 $\pm$ 3.72
	DyFewShot	81.54 $\pm$ 1.71	0.25 $\pm$ 0.35	0.17 $\pm$ 0.24	0.0 $\pm$ 0.0	0.0 $\pm$ 0.0		23.52 $\pm$ 1.51
	Entailment	95.69 $\pm$ 0.06	72.12 $\pm$ 0.62	67.75 $\pm$ 1.25	70.25 $\pm$ 0.25	72.62 $\pm$ 1.38		38.85 $\pm$ 0.89
$C_{n}^{5}$	ProtoNet	80.00 $\pm$ 2.65	21.83 $\pm$ 5.45	29.17 $\pm$ 3.7	24.67 $\pm$ 3.12	23.17 $\pm$ 3.6	30.33 $\pm$ 4.17	29.24 $\pm$ 2.96
	Entailment	89.17 $\pm$ 0.60	65.08 $\pm$ 2.45	78.5 $\pm$ 0.94	69.08 $\pm$ 1.12	68.25 $\pm$ 0.35	70.67 $\pm$ 1.3	39.48 $\pm$ 1.45
	Clu4Fewshot	95.12 $\pm$ 0.47	67.50 $\pm$ 0.89	67.92 $\pm$ 4.7	64.42 $\pm$ 4.17	52.42 $\pm$ 1.2	53.33 $\pm$ 2.09	30.46 $\pm$ 5.92
	DyFewShot	81.50 $\pm$ 1.27	0.08 $\pm$ 0.12	0.83 $\pm$ 0.62	0.0 $\pm$ 0.0	0.0 $\pm$ 0.0	0.5 $\pm$ 0.71	21.23 $\pm$ 1.34
	Entailment	95.56 $\pm$ 0.06	68.75 $\pm$ 2.75	67.38 $\pm$ 0.62	63.75 $\pm$ 1.75	65.12 $\pm$ 3.62	61.62 $\pm$ 2.38	37.65 $\pm$ 0.44

Table 2: System performance on the benchmark IFS-Intent. Horizontal direction: different groups of testing classes (base classes

C_{b}

, five rounds of novel classes (

C_{n}^{1},\cdots,C_{n}^{5}

) and the OOD classes

C_{o}

); vertical direction: timeline of incremental learning over new rounds of novel classes. Numbers are averaged over results of three random seeds.

		$C_{n}^{1}$	$C_{n}^{2}$	$C_{n}^{3}$	$C_{n}^{4}$	$C_{n}^{5}$	$C_{o}$
$C_{n}^{1}$	Entailment	65.17 $\pm$ 1.36					75.43 $\pm$ 0.41
	Clu4Fewshot	55.50 $\pm$ 2.27					72.29 $\pm$ 0.20
	Entailment	70.08 $\pm$ 0.77					78.25 $\pm$ 0.19
$C_{n}^{2}$	Entailment	64.08 $\pm$ 2.04	76.33 $\pm$ 1.01				64.68 $\pm$ 0.71
	Clu4Fewshot	64.58 $\pm$ 0.42	77.75 $\pm$ 1.08				61.72 $\pm$ 0.90
	Entailment	74.25 $\pm$ 1.34	86.67 $\pm$ 1.01				64.39 $\pm$ 0.27
$C_{n}^{3}$	Entailment	75.50 $\pm$ 1.63	83.83 $\pm$ 0.62	75.25 $\pm$ 1.24			56.56 $\pm$ 2.43
	Clu4Fewshot	65.25 $\pm$ 1.67	79.58 $\pm$ 1.50	64.67 $\pm$ 1.93			50.25 $\pm$ 0.52
	Entailment	74.25 $\pm$ 1.08	85.92 $\pm$ 1.05	76.58 $\pm$ 1.05			53.09 $\pm$ 1.73
$C_{n}^{4}$	Entailment	68.33 $\pm$ 1.16	72.67 $\pm$ 0.77	68.58 $\pm$ 1.9	69.50 $\pm$ 1.34		53.92 $\pm$ 0.75
	Clu4Fewshot	66.75 $\pm$ 0.54	79.08 $\pm$ 0.51	60.5 $\pm$ 2.35	62.25 $\pm$ 1.08		42.56 $\pm$ 0.76
	Entailment	73.75 $\pm$ 1.41	85.50 $\pm$ 1.06	71.67 $\pm$ 1.53	75.83 $\pm$ 2.44		52.75 $\pm$ 0.63
$C_{n}^{5}$	Entailment	67.58 $\pm$ 0.82	73.50 $\pm$ 1.24	67.83 $\pm$ 0.47	71.83 $\pm$ 0.66	73.75 $\pm$ 0.74	50.95 $\pm$ 0.68
	Clu4Fewshot	65.33 $\pm$ 0.62	76.75 $\pm$ 1.59	62.83 $\pm$ 3.17	59.75 $\pm$ 2.83	57.25 $\pm$ 2.32	36.66 $\pm$ 1.07
	Entailment	70.75 $\pm$ 1.27	82.50 $\pm$ 1.27	72.42 $\pm$ 0.96	76.67 $\pm$ 1.05	71.0 $\pm$ 0.41	47.05 $\pm$ 1.60

Table 3: System performance on the benchmark IFS-Intent. Horizontal direction: different groups of testing classes (base classes

C_{b}

, five rounds of novel classes (

C_{n}^{1},\cdots,C_{n}^{5}

) and the OOD classes

C_{o}

); vertical direction: timeline of incremental learning over new rounds of novel classes. Numbers are averaged over results of three random seeds.

5.3 Experimental results

As the problem formulation presented in Section 3, we want to investigate two questions. $\mathcal{Q}_{1}$ : can our system get better performance on each round? $\mathcal{Q}_{2}$ : can our system hold more stable performance during the incremental learning process?

Tables 2 $\sim$ 3 list the results on the benchmarks IFS-Intent and IFS-Relation, respectively. Our system Entailment is compared with the baselines for the seven batches of testing classes (base, five rounds and OOD) along with the incremental learning from the base classes to the fifth round.

As for the question $\mathcal{Q}_{1}$ , we summarize our observations as follows. (i) The ProtoNet generally works worst on most cases, regardless of the test classes and the timeline. This should be due to the fact that ProtoNet does not fine-tune on the new classes; thus, no incremental learning in ProtoNet. (ii) The baselines “Entailment” and “Cluster4Fewshot”, which perform incremental fine-tuning, generally outperform the ProtoNet. In addition, they are mostly comparable. (iii) Our system Entailment consistently obtains the best results across all test classes and the timeline.

Refer to caption — (a) Incremental setting with base classes

To answer the question $\mathcal{Q}_{2}$ , we need to quantify the performance changes of all systems along the timeline not only on $C_{b}$ but also on all { $C_{n}^{i}$ } ( $i=1,\cdots,5$ ). Given a list of $m$ result values $r\in\mathbb{R}^{m}$ , we first use linear regression to fit these numbers. For example, if we fine a line $f(x)=a*x+b$ , where $a<0$ is the slope, $b$ is the intercept, and $t=1,\cdots,m$ is the time stamp. The performance drop $d$ reflected by this list is calculated as $d=(f(1)-f(m))/f(1)$ . Since the linear regression is more reliable when the $m$ value is larger, we compute the drop values for $C_{b}$ , $C_{n}^{1}$ and $C_{n}^{2}$ only and average them as the final evaluation of a system in responding to the question $\mathcal{Q}_{2}$ .

6 Conclusion

In this work, we define a new challenge in the NLP domain, incremental few-shot text classification with multi-round new classes. In addition to the problem formulation, we also release two benchmark datasets for this particular challenge: IFS-Intent and IFS-Relation . A novel approach Entailment is proposed to solve this problem. Entailment converts the text classification problem into textual entailment which can be pre-trained with large-scale entailment dataset. The reminder mechanism in Entailment mitigates the catastrophic forgetting problem in the incremental setting. Experiments on these two benchmark datasets show the effectiveness of our proposed model.

Incremental Few-shot Text Classification with Multi-round New Classes: Formulation, Dataset and System

Abstract

1 Introduction

2 Related Work

Incremental few-shot learning.

Using textual entailment for text classification.

3 Problem Formulation

Training data.

Development data.

Testing data.

Requirements.

4 Our Model: Entailment

Entailment pairs.

Training strategy.

Inference strategy.

Entailment vs. DNNC

Entailment vs. Prototypical Network.

Entailment vs. Incremental few-shot approaches in computer vision.

5 Experiments

5.1 Datasets

IFS-Intent.

5.2 Experimental setting

Baselines.

Implementation and setting.

5.3 Experimental results

6 Conclusion

Incremental Few-shot Text Classification with Multi-round New Classes:
Formulation, Dataset and System