This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Incremental Few-shot Text Classification with Multi-round New Classes:
Formulation, Dataset and System

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
&Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
Abstract

Text classification is usually studied by labeling natural language texts with relevant categories from a predefined set. In the real-world, new classes might keep challenging the existing system with limited labeled data. The system should be intelligent enough to recognize upcoming new classes with a few examples. In this work, we define a new task in the NLP domain—“incremental few-shot text classification”, where the system first handles some base classes with rich annotations, then copes with multiple rounds of new classes. For each round, there is a batch of new classes with a few labeled examples per class. Two major challenges are existing in this new task: (i) For the learning process, the system should incrementally learn new classes round by round without re-training on the examples of preceding classes; (ii) For the performance, the system should perform well on new classes without much loss on preceding classes. In addition to formalizing the new task, we also release two benchmark datasets in the incremental few-shot setting: intent classification and relation classification. Moreover, we propose an approach Entailment, which shows promise for solving this novel problem.

1 Introduction

Text classification has achieved great success in the past decades with the development of deep learning techniques kowsari2019text. However, decent performance highly relies on the quality and quantity of the training data. Recently, few-shot text classification yu2018diverse has attracted increasing attention from researchers since we realize that it is unlikely to have large-scale labeled data for new classes in reality.

Typically, few-shot text classification is formulated like this: the system first sees a set of base classes CbC_{b} that have a large number of labeled examples, then a group of new classes CnC_{n} is provided with kk examples per class. For a testing instance, the system is required to search for its label in the space of CbCnC_{b}\cup C_{n} or merely CnC_{n}.

However, if we think about few-shot text classification in real scenarios, new challenges exist which makes it worth further exploration. First, take the bank’s customer service system as an example, queries with new intents are continuously appearing, (e.g., by a sequence of rounds), without enough labeled data. The system should be able to keep learning and recognizing new intents round by round. For each query, the system needs to pick up the most appropriate intent in the incrementally increasing label space or return “none of them”. Second, existing incremental training strategies have the problem of catastrophic forgetting mccloskey1989catastrophic. The systems tend to forget the knowledge learned in the past when continually fine-tuning on new classes. This leads to a significant drop in the performance of base classes. For a real-world application, such as the aforementioned customer service system, the system is expected to perform well on all classes, regardless of when the classes come to the system or how many examples they have.

In this work, our contribution lies in three aspects. First, we formally define this problem: “incremental few-shot text classification”. In our definition, the system is first provided with some base classes CbC_{b} with rich annotations, then mm rounds of new classes (i.e., Cn1C^{1}_{n}, Cn2C^{2}_{n}, \cdots, CnmC^{m}_{n}) will come sequentially. Each CniC^{i}_{n} (i=1,,mi=1,\cdots,m) consists of a group of new classes and each class has kk labeled examples (kk is in the range of [1, 5] and varies for different classes). For testing, we require the system to either select the best class from CbCn1Cn1CnmC_{b}\cup C^{1}_{n}\cup C^{1}_{n}\cdots\cup C^{m}_{n} or output “none of them” which means no existing class applies to the input. As far as we know, this is the first work in the NLP community that studies text classification with an incremental size of classes. There are a few papers working on incremental few-shot classification in the field of computer vision DBLPRenLFZ19. However, they only consider one round of new classes. It’s difficult to evaluate the system’s ability for incremental learning without multi-rounds of new classes.

Furthermore, we consider an extreme case where no base classes are available. This situation is named as incremental few-shot text classification without base classes. It happens whenever you want to build a system from scratch. In real-world scenarios, base classes with rich annotations might not be available at the beginning. In this setting, we need to solve the cold start problem where no initial annotations are available. All previous few-shot learning models DBLPSnellSZ17; DBLPGidarisK18; DBLPRenLFZ19 fail to solve this problem, since they relied on the large-scale labeled data for base classes to train a robust system.

Second, to evaluate the aforementioned two settings, incremental few-shot text classification with/without base classes, we release a benchmark dataset, IFS-Intent for evaluation. This benchmark simulates a task like the bank’s customer service as we just mentioned. Another important feature of our benchmark datasets is that we do not provide dev sets. Existing systems are commonly evaluated on the dev set to choose the best training model. We claim that in real-world (incremental) few-shot applications, we cannot expect extra labeled data other than the kk examples. This is in line with the observation in DBLP07676. If a system has to rely on dev set to find the best parameters, it is not suitable for the incremental few-shot setting.

Third, we propose our approach, Entailment, to solve this new problem. Entailment models the text classification problem as a textual entailment dagan2013recognizing problem. To figure out if an input xx belongs to a class yy, Entailment tries to infer the truth value of yy (i.e., a hypothesis), given the xx (i.e, the premise). The main benefit of this formulation is that the system learns this task not only from the label-specific examples, but more importantly, from the large-scale entailment datasets. In other words, we make use of indirect supervision from textual entailment datasets to address the target few-shot task. Specifically, for each round CniC_{n}^{i} which has hh new classes with kk examples for each, we first build positive (premise, hypothesis) pairs by accompanying each input with its gold class, and negative (premise, hypothesis) pairs by accompanying the input with other h1h-1 classes in CniC_{n}^{i}. It is worth noting that we only use the few-shot examples and label names in CniC_{n}^{i} for the incremental training.

Current state-of-the-art paper zhangdiscriminative for few-shot text classification also utilizes textual entailment for pre-training. However, they ignore the information in the class labels. Instead of inferring the truth value of a class yy conditioned on the input text xx, they tried to infer whether two text inputs (xi,xjx_{i},x_{j}) are in the same class. In this way, they not only increase the computation cost since the number of examples is much more than the number of labels. The other thing is that their final predictions need to be made by search the nearest neighbor among all the examples. The performance highly depends on the examples they choose. Bad examples will definitely bring poor results.

2 Related Work

Incremental few-shot learning.

As far as we know, there is no prior work in the NLP domain that studies incremental few-shot text classification. In this section, we mainly introduce some work in the computer vision domain. As we mentioned before, they only assume that a single round of new classes CnC_{n} is appended to the base classes CbC_{b}. Generally, they will learn class representations for classification. Different approaches differ in the way of representing base classes and new classes. Hereafter, we use WbW_{b} and WnW_{n} as the representations for CbC_{b} and CnC_{n}, respectively.

DBLPSnellSZ17 proposes the Prototypical Network, in which both WbW_{b} and WnW_{n} are stored as the average embedding of the few-shot support images for a certain class. Although Prototypical Network was not designed for incremental few-shot learning, it can be easily adapted to the incremental setting by providing the representations for all the classes. It trains a nearest neighbor algorithm on the base classes and tests directly on the union of base and new classes. DBLPQiBL18 proposes an “imprinting” mechanism: the base representations WbW_{b} are learned regularly through supervised pre-training (e.g., the weight matrix in a softmax classifier), and WnW_{n} are computed using the averaged representations like Prototypical Network.

In DBLPGidarisK18, the base representations WbW_{b} are learned through supervised pre-training. The representation of the ithi^{th} novel class (Wn,iW_{n,i}) comes from two origins: (i) the prototypical averaging, wavgw_{avg}; (ii) attention-weighted sum over base representations: wattw_{att}. Namely, Wn,i=ϕavgwavg+ϕattwattW_{n,i}=\phi_{avg}\odot w_{avg}+\phi_{att}\odot w_{att}, where ϕavg\phi_{avg} and ϕatt\phi_{att} are learnable weight vectors. In the few-shot training stage, the original base classes CbC_{b} are split into “new base classes” and “fake novel classes” for each episode. The loss is computed when the system predicts “new base classes” as well as “fake novel classes”. In testing, WnW_{n}, the representations of novel classes, are constructed based on the kk examples and WbW_{b}.

In DBLPRenLFZ19, both WbW_{b} and WnW_{n} are learned through supervised training: WbW_{b} are classifier parameters pre-trained on base classes, WnW_{n} are classifier parameters learned in new classes. During the training, the support set and the query set are constructed differently for new classes. The support set consists of examples only from new classes; the query set contains examples from both new classes and base classes (because the training goal is to maximize the performance on all classes). The training in this literature has two phases. The first phase is few-shot episode training which learns WnW_{n} and optimizes the performance on the support set; the second phase (called meta-learning training) optimizes the performance on the query set. The latter has a regularization term on the WnW_{n}, enforcing WnW_{n} to be as close as possible to the attention-weighted sum of WbW_{b}.

To summarize, compared with DBLPSnellSZ17 and DBLPQiBL18, both DBLPGidarisK18 and DBLPRenLFZ19 build connections between the representations of base classes and the new classes. However, these methods cannot be directly applied to our problem for the following reasons. (i) Despite the claims in some literature that they are dealing with incremental or dynamic few-shot problems, they only considered a single round of new classes DBLPQiBL18; DBLPGidarisK18; DBLPRenLFZ19. It is unclear if the system can keep the performance when multi-round new classes are considered. (ii) During the training for the new classes, they often rely on extra labeled data other than the kk examples, such as the query set in DBLPRenLFZ19. (iii) Different from their setting, we have an extra label “out-of-distribution” in incremental few-shot text classification. It’s not guaranteed that the input, such as the customer’s utterance, always falls into the range of seen labels.

Using textual entailment for text classification.

zhangdiscriminative is a state-of-the-art paper for few-shot text classification. They propose a discriminative nearest neighbor classification (DNNC) model by comparing whether two examples are in the same class or not. A matching model S(xix_{i}, xjx_{j}) is trained as a binary classifier, such that S(xix_{i}, xjx_{j}) is closed to 1.0 if xix_{i} and xjx_{j} belong to the same class, otherwise closed to 0.0. Thus, their model can be pre-trained with large-scale textual entailment dataset. Given a test query xx, they compare the test query with all the previous examples. The final prediction is made by searching the nearest neighbor which has the highest matching score S(xx, xix_{i}) with the query example. As we mentioned before, their computation cost is high and the performance is highly related to the quality of chosen examples.

Moreover, comparing whether two examples are in the same class is different from doing textual entailment. In textual entailment, a human reading a premise to infer that the hypothesis is true or not. The fact that two examples are in the same class does not mean they can entail each other. Thus, they cannot fully utilize the pre-trained entailment model. Instead, our proposed model, Entailment, entail the label with a given utterance, which is much more efficient and maximize the utilization of the pre-trained entailment model.

DBLPYinHR19 is another work that utilizes textual entailment for zero-shot text classification. They convert the zero-shot text classification as a problem of filling a label candidate for a hypothesis. For example, they combine “emotion” labels with the question “this text expresses ?”, and ask the model if this hypothesis is true, given the text. This work more focus on zero-shot learning and they need to propose different questions for different labels.

3 Problem Formulation

In this section, we give a formal description of the problem “incremental few-shot text classification” with/without base classes.

Training data.

For the setting with base classes, the system is provided with a set of base classes Cb={Cb,1,Cb,2,,Cb,g}C_{b}=\{C_{b,1},C_{b,2},\cdots,C_{b,g}\}; each base class Cb,iC_{b,i} (i=1,,gi=1,\cdots,g) has ll labeled examples. ll is usually a large number, like several hundred or several thousand. For the setting without base classes, CbC_{b} is ignored. Both settings have totally mm rounds of new classes coming sequentially: {Cn1,,CnmC_{n}^{1},\cdots,C_{n}^{m}}. Each round CniC_{n}^{i} has hh new classes, namely Cni={Cn,1i,,Cn,hi}C_{n}^{i}=\{C^{i}_{n,1},\cdots,C^{i}_{n,h}\}. Each new class only has kk examples (k[1,5]k\in[1,5]). The value of kk is not fixed and varies for different new classes in the same round, i.e., kCn,iikCn,jik_{C^{i}_{n,i}}\neq k_{C^{i}_{n,j}}.

We create the multi-round setting since it can evaluate the system more precisely while learning a long line of new classes. In each round, we set k[1,5]k\in[1,5] and allow the flexibility that kCn,iikCn,jik_{C^{i}_{n,i}}\neq k_{C^{i}_{n,j}}. This setting is more in line with reality, in which we can only collect a handful of examples for the upcoming classes and the number of examples cannot be guaranteed.

Development data.

To simulate real scenarios in the experiments, we assume that there are only kk examples available for the new classes during the incremental training. Thus, our formulation does not provide any development set to help select the best model. It is recommended to select hyper-parameters based on experience or related tasks.

Testing data.

To evaluate the system, the test data consists of examples across all the classes. For the setting with base classes, the potential label space is CbCn,11Cn,h1Cn,1mCn,hmCoC_{b}\cup C_{n,1}^{1}\cdots C_{n,h}^{1}\cdots C_{n,1}^{m}\cdots\cup C_{n,h}^{m}\cup C_{o}. For the setting without base classes, CbC_{b} is excluded and we only search among the classes in Cn,11Cn,h1Cn,1mCn,hmCoC_{n,1}^{1}\cdots C_{n,h}^{1}\cdots C_{n,1}^{m}\cdots\cup C_{n,h}^{m}\cup C_{o}. CoC_{o} is an extra out-of-distribution (OOD) class in which examples falling outside of all the seen classes. It give us a chance to check the system’s ability to detect instances that reject all the known classes. This is crucial for an open-set problem like incremental learning, since there are always examples from upcoming classes that do not belong to any existing class.

Requirements.

(i) For the training of ithi^{th} round CniC_{n}^{i}, the system can only access the newly added few-shot examples in this round and preceding class names in CbCn1Cni1C_{b}\cup C_{n}^{1}...\cup C_{n}^{i-1}. The system is not allowed to re-train on the (full or partial) examples of preceding classes. (ii) For the evaluation, we care about the performance in different types of classes, including base classes, different rounds of new classes and OOD classes in CoC_{o}. We expect a system that can continuously recognize new classes well with few-shot examples. In the meantime, the performance drop for preceding classes is also considered. A system showing severer catastrophic forgetting is less preferred.

4 Our Model: Entailment

Our approach Entailment casts the text classification problem into textual entailment: the input text acts as a premise, the class name, such as “open a bank account” in intent detection , acts as a hypothesis. Then the question that if the input belongs to a class is equivalent to ask if the hypothesis is true given the premise. There are two benefits by transforming the text classification problem to entailment. First, we can make use of indirect supervision from large-scale entailment dataset williams2018broad to benefit the few-shot settings. Second, this enables us to utilize the few-shot examples as well as the information of the class names. Typical text classification approaches treat classes as indices. In fact, class names usually contain informative signals.

Entailment pairs.

To transfer text classification problem into textual entailment, we construct positive and negative entailment pairs for the training. Positive entailment pairs (xix_{i}, yiy_{i}) are constructed with utterance xix_{i} and its gold label name yiy_{i}, where yiCby_{i}\in C_{b} for base classes and yiCniy_{i}\in C_{n}^{i} for new classes. Negative entailment pairs consist of (xix_{i}, yjy_{j}), where yjy_{j} is an incorrect label in current round. For base classes, yjCby_{j}\in C_{b} but yjyiy_{j}\neq y_{i}; for new classes, yjCniy_{j}\in C_{n}^{i} but yjyiy_{j}\neq y_{i}.

Compared to zhangdiscriminative, their entailment pairs are constructed with two utterances in the same round. (xi,xjx_{i},x_{j}) is a positive pair if they belong to a same class, otherwise it is a negative pair. In order to explore the potential of different combinations, we also propose a hybrid entailment model that use both (utterance, label) paris (xi,yix_{i},y_{i}) and (utterance, utterance) pairs (xi,xjx_{i},x_{j}). In this hybrid model, we train the model with entailment pairs from both our proposed model and zhangdiscriminative.

In the setting where we have gg base classes and ll examples for each base class, glg*l positive entailment pairs and (g1)gl(g-1)*g*l negative pairs are generated for them. For round CniC_{n}^{i} which contains hh new classes and kk examples for each class, there are hkh*k positive entailment pairs and (h1)hk(h-1)*h*k negative entailment pairs. For simplicity, we use the same kk value for all new classes here; in real datasets, different new classes may have different numbers of few-shot examples. In that case, the number of generated pairs will change accordingly.

For each entailment pair (x,yx,y) no matter it is positive or negative, we concatenate its utterance xx with the label yy and fed into the RoBERTa liu2019roberta encoder. Given an utterance x=(X1,X2,,XT2)x=(X_{1},X_{2},...,X_{T_{2}}) with T1T_{1} words and a label y=(Y1,Y2,,YT2)y=(Y_{1},Y_{2},...,Y_{T_{2}}) with T2T_{2} words, we add a special start-of-sequence ([CLS]) token at the beginning of the input and a special end-of-sequence ([SEP]) token at the end of each sentence. The whole input is ([CLS], X1X_{1}, X2X_{2}, …, XT1X_{T_{1}}, [SEP], Y1Y_{1}, Y2Y_{2}, …, YT2Y_{T_{2}}, [SEP]). We use the [CLS] embedding output from the RoBERTa encoder with a fully connected layer for binary textual entailment: \linenomathAMS

h=\displaystyle h= RoBERTa(x,y),\displaystyle\text{RoBERTa}(x,y), (1)
p=\displaystyle p= softmax(Wh+b),\displaystyle\text{softmax}(Wh+b), (2)

where hdh\in\mathbb{R}^{d} is the embedding for the [CLS] token, W2×dW\in\mathbb{R}^{2\times d} and b2b\in\mathbb{R}^{2} are parameters.

Training strategy.

Our model is a binary classification model that can utilize the indirect supervision from textual entailment. Firstly, we pre-train the model with a large-scale entailment dataset williams2018broad. For the setting with base classes, we fine-tune the model on entailment pairs from base classes to obtain a base model. Then it is fine-tuned on the new classes CniC_{n}^{i} in each round. In the setting without base classes, the model is directly fine-tuned with the entailment pairs from new classes in each round CniC_{n}^{i}.

Inference strategy.

After the training, we use the model to infer the class for a test input. For each input, we generate entailment pairs by accompanying the input with all classes except CoC_{o}. Each pair will get a score λ[0,1]\lambda\in[0,1] indicating whether this input belongs to the particular class or not. λ>0.5\lambda>0.5 indicates “YES”, “No” otherwise. If there is at least one class labeled with “YES”, the class with the maximal λ\lambda score is returned; otherwise, the system returns CoC_{o}. We choose the threshold as 0.5 because entailment recognition is a binary classification problem.

Next, we compare our model with some related systems that can be potentially applied to the incremental few-shot text classification.

Entailment vs. DNNC

DNNC zhangdiscriminative also converts text classification problem into textual entailment. They discriminate whether two text inputs (xi,xjx_{i},x_{j}) are in the same class. As a result, for round CniC_{n}^{i} with originally hkh*k examples, this baseline will generate hk(k1)h*k*(k-1) positive pairs and h(h1)k2h*(h-1)*k^{2} negative pairs. In testing, a query needs to compare with all the examples of all classes. The computation cost of this baseline is much higher than our proposed model.

Entailment vs. Prototypical Network.

Prototypical Network DBLPSnellSZ17 tries to solve few-shot target tasks given a collection of training tasks. The distributions in the training tasks and the target tasks are required to be similar. Prototypical network uses episode training to learn a nearest neighbor algorithm and hope it can generalize well to the target tasks.

The few-shot learning problem solved in Prototypical network is slightly different with our incremental few-shot setting. In Prototypical Network, the label space for target tasks only contains the new classes. However, in the incremental few-shot setting, the target label space is continuously increasing by adding new classes. Due to this essential distinction, applying Prototypical Network to incremental few-shot are very likely to have performance drop on base classes when fine-tuning on new classes.

Entailment vs. Incremental few-shot approaches in computer vision.

In Related Work, we introduced some typical approaches in computer vision that deal with incremental few-shot problem. Those methods consistently try to learn representations for classes and examples separately (i.e,, the WbW_{b} and WnW_{n} in Section 2). In our model, there are no individual representation vectors for classes or examples. Instead, the model learns an overall representation vector for the whole (input, class) pair. Our solution enables the learning of the input and the class to interact with each other, which has widely demonstrated its superiority in modeling the relations of two elements DBLP12808; zhangdiscriminative.

In addition, the approaches in computer vision mostly rely on large-scale labeled data for base classes to train a robust system. We would argue that the base classes with rich annotations may not be available in real-world applications. Our system which can be pre-trained with entailment dataset, instead, does not rely on base classes. This makes our system more applicable to various scenarios.

CbC_{b} Cn1C_{n}^{1} Cn2C_{n}^{2} Cn3C_{n}^{3} Cn4C_{n}^{4} Cn5C_{n}^{5} CoC_{o}
# class 20 10 10 10 10 10 7
# train 2088 30 30 30 30 30 -
# test 800 400 400 400 400 400 280
Table 1: Statistics of the benchmark dataset: IFS-Intent. CbC_{b}: base classes; {Cn1C_{n}^{1}, \cdots, Cn5C_{n}^{5}}: five rounds of new classes; CoC_{o}: OOD classes. Note that CoC_{o} is never used for training.

5 Experiments

5.1 Datasets

IFS-Intent.

This is our benchmark for incremental few-shot intent detection. IFS-Intent is converted from BANKING77111https://github.com/PolyAI-LDN/task-specific-datasets casanueva2020efficient, which is single-domain intent detection dataset comprising 13,083 annotated examples over 77 intents (average: 170 examples per intent). Each intent class is described by a short name, such as “get physical card”, “lost or stolen card”, etc.

We split the 77 intents into a base group (i.e., CbC_{b}), 5 rounds of new intents (i.e., {Cn1C_{n}^{1}, \cdots, Cn5C_{n}^{5}}) and a group of out-of-distribution intents (i.e., CoC_{o}). Each upcoming round has 10 new classes. We randomly split the 10 classes into 5 groups (each with 2 classes), then intentionally let the 5 groups have different sizes of k-shot examples (k[1,5]k\in[1,5]). Detail statistics are reported in Table 1.

5.2 Experimental setting

Baselines.

Since this is the first work that studies the incremental few-shot text classification problem, there is no prior system that deals with exactly the same task. We compare our model with two different types of baselines. Two baselines DBLPYinHR19; zhangdiscriminative solve text classification as a textual entailment problem and use large-scale entailment datasets for pre-training. We also adapt two incremental few-shot learning models in computer vision to incremental few-shot text classification DBLPSnellSZ17; DBLPGidarisK18. For these two baselines, we replace their encoders with RoBERTa to fit into the text classification task.

• Textual Entailment. DBLPYinHR19 uses a pre-trained textual entailment system to cope with zero-shot text classification, in which the input text acts as a premise and the class name or its definition acts as a hypothesis. It is similar to our approach, except for the reminder mechanism proposed in this work. Thus, this textual entailment baseline keeps fine-tuning on the regular entailment pairs round by round.

• DNNC. zhangdiscriminative proposes an alternative way to implement the entailment idea for few-shot text classification. Instead of inferring the truth value of a class yy conditioned on the input text xix_{i}, they tried to infer whether two text inputs (xi,xjx_{i},x_{j}) are in the same class. As a result, for round CniC_{n}^{i} with originally hkh*k examples, this baseline will generate hk(k1)h*k*(k-1) positive pairs and h(h1)k2h*(h-1)*k^{2} negative pairs. In testing, a query needs to compare with all the examples of all classes. The computation cost of this baseline is high.

• Prototypical Network DBLPSnellSZ17. Prototypical Network is trained on base classes with the episode training method. For each episode, it randomly selects hh base classes; each base class is equipped with kk-shot supporting examples and a set of query examples. The representations for both base and new classes are calculated with the average embedding of kk-shot supporting examples. Prototypical network trains a nearest neighbor algorithm to optimize the prediction on the query sets. In testing, the model compares the distances of a query example with all the class representations and choose the nearest neighbor as its label.

• DyFewShot DBLPGidarisK18. We introduced this baseline in the Section 2. For this baseline, we extend this baseline to address multi-round few-shot classes: for the present round CntC_{n}^{t}, all the preceding classes, including that in CbC_{b} and {Cn1,Cnt1C_{n}^{1}\cdots,C_{n}^{t-1}}, are viewed as “base classes”.

Implementation and setting.

For all the textual entailment models, we use MNLI williams2018broad to pre-train the model. All systems are implemented through the Huggingface Transformers package.222https://github.com/huggingface/transformers We always fine-tune on each round 5 epochs with learning rate 1e-6 and batch size 16. We run the same program with 3 different seeds and report the average performance.

Accuracy is reported for {CbC_{b}, Cn1C_{n}^{1}, \cdots, Cn5C_{n}^{5}} and F1 score for CoC_{o}.

CbC_{b} Cn1C_{n}^{1} Cn2C_{n}^{2} Cn3C_{n}^{3} Cn4C_{n}^{4} Cn5C_{n}^{5} CoC_{o}
CbC_{b} ProtoNet 87.25±\pm0.10 53.4±\pm10.68
Entailment 96.42±\pm0.41 64.73±\pm3.84
Clu4Fewshot 95.96±\pm0.68 61.89±\pm4.78
DyFewShot 81.04±\pm1.91 55.01±\pm2.52
Entailment 96.12±\pm0.12 58.92±\pm1.22
Cn1C_{n}^{1} ProtoNet 85.83±\pm1.94 31.67±\pm1.48 43.66±\pm3.08
Entailment 94.42±\pm0.21 75.42±\pm1.56 56.38±\pm5.29
Clu4Fewshot 95.75±\pm0.41 74.83±\pm1.64 64.54±\pm2.02
DyFewShot 81.29±\pm1.56 0.0±\pm0.0 39.33±\pm1.25
Entailment 95.62±\pm1.00 77.75±\pm0.25 58.41±\pm5.10
Cn2C_{n}^{2} ProtoNet 83.92±\pm0.33 24.92±\pm5.54 38.83±\pm3.43 31.14±\pm9.83
Entailment 94.29±\pm0.16 71.92±\pm1.45 84.83±\pm1.33 48.12±\pm3.20
Clu4Fewshot 95.42±\pm0.62 72.92±\pm4.37 75.08±\pm3.3 49.02±\pm3.23
DyFewShot 81.29±\pm1.56 0.0±\pm0.0 0.5±\pm0.71 33.94±\pm1.42
Entailment 96.44±\pm0.19 76.75±\pm2.75 75.0±\pm1.0 42.11±\pm0.30
Cn3C_{n}^{3} ProtoNet 81.08±\pm2.06 24.33±\pm5.54 30.67±\pm6.17 22.5±\pm1.34 23.62±\pm6.99
Entailment 92.71±\pm0.41 70.75±\pm0.54 82.83±\pm2.16 73.92±\pm2.52 29.34±\pm3.31
Clu4Fewshot 95.67±\pm0.33 68.17±\pm2.37 66.33±\pm5.02 71.25±\pm3.78 45.69±\pm1.73
DyFewShot 81.29±\pm1.56 0.0±\pm0.0 0.5±\pm0.71 0.0±\pm0.0 27.48±\pm1.24
Entailment 95.44±\pm0.44 73.62±\pm0.62 71.62±\pm2.62 73.5±\pm0.75 33.69±\pm3.66
Cn4C_{n}^{4} ProtoNet 81.17±\pm2.52 17.83±\pm2.58 31.75±\pm0.94 24.92±\pm1.9 22.25±\pm3.19 28.19±\pm4.78
Entailment 91.67±\pm0.36 65.92±\pm2.18 79.92±\pm1.78 73.75±\pm0.74 69.08±\pm0.12 45.73±\pm2.80
Clu4Fewshot 95.29±\pm0.16 68.75±\pm2.35 66.75±\pm3.82 67.0±\pm3.4 57.75±\pm1.41 42.09±\pm3.72
DyFewShot 81.54±\pm1.71 0.25±\pm0.35 0.17±\pm0.24 0.0±\pm0.0 0.0±\pm0.0 23.52±\pm1.51
Entailment 95.69±\pm0.06 72.12±\pm0.62 67.75±\pm1.25 70.25±\pm0.25 72.62±\pm1.38 38.85±\pm0.89
Cn5C_{n}^{5} ProtoNet 80.00±\pm2.65 21.83±\pm5.45 29.17±\pm3.7 24.67±\pm3.12 23.17±\pm3.6 30.33±\pm4.17 29.24±\pm2.96
Entailment 89.17±\pm0.60 65.08±\pm2.45 78.5±\pm0.94 69.08±\pm1.12 68.25±\pm0.35 70.67±\pm1.3 39.48±\pm1.45
Clu4Fewshot 95.12±\pm0.47 67.50±\pm0.89 67.92±\pm4.7 64.42±\pm4.17 52.42±\pm1.2 53.33±\pm2.09 30.46±\pm5.92
DyFewShot 81.50±\pm1.27 0.08±\pm0.12 0.83±\pm0.62 0.0±\pm0.0 0.0±\pm0.0 0.5±\pm0.71 21.23±\pm1.34
Entailment 95.56±\pm0.06 68.75±\pm2.75 67.38±\pm0.62 63.75±\pm1.75 65.12±\pm3.62 61.62±\pm2.38 37.65±\pm0.44
Table 2: System performance on the benchmark IFS-Intent. Horizontal direction: different groups of testing classes (base classes CbC_{b}, five rounds of novel classes (Cn1,,Cn5C_{n}^{1},\cdots,C_{n}^{5}) and the OOD classes CoC_{o}); vertical direction: timeline of incremental learning over new rounds of novel classes. Numbers are averaged over results of three random seeds.
Cn1C_{n}^{1} Cn2C_{n}^{2} Cn3C_{n}^{3} Cn4C_{n}^{4} Cn5C_{n}^{5} CoC_{o}
Cn1C_{n}^{1} Entailment 65.17±\pm1.36 75.43±\pm0.41
Clu4Fewshot 55.50±\pm2.27 72.29±\pm0.20
Entailment 70.08±\pm0.77 78.25±\pm0.19
Cn2C_{n}^{2} Entailment 64.08±\pm2.04 76.33±\pm1.01 64.68±\pm0.71
Clu4Fewshot 64.58±\pm0.42 77.75±\pm1.08 61.72±\pm0.90
Entailment 74.25±\pm1.34 86.67±\pm1.01 64.39±\pm0.27
Cn3C_{n}^{3} Entailment 75.50±\pm1.63 83.83±\pm0.62 75.25±\pm1.24 56.56±\pm2.43
Clu4Fewshot 65.25±\pm1.67 79.58±\pm1.50 64.67±\pm1.93 50.25±\pm0.52
Entailment 74.25±\pm1.08 85.92±\pm1.05 76.58±\pm1.05 53.09±\pm1.73
Cn4C_{n}^{4} Entailment 68.33±\pm1.16 72.67±\pm0.77 68.58±\pm1.9 69.50±\pm1.34 53.92±\pm0.75
Clu4Fewshot 66.75±\pm0.54 79.08±\pm0.51 60.5±\pm2.35 62.25±\pm1.08 42.56±\pm0.76
Entailment 73.75±\pm1.41 85.50±\pm1.06 71.67±\pm1.53 75.83±\pm2.44 52.75±\pm0.63
Cn5C_{n}^{5} Entailment 67.58±\pm0.82 73.50±\pm1.24 67.83±\pm0.47 71.83±\pm0.66 73.75±\pm0.74 50.95±\pm0.68
Clu4Fewshot 65.33±\pm0.62 76.75±\pm1.59 62.83±\pm3.17 59.75±\pm2.83 57.25±\pm2.32 36.66±\pm1.07
Entailment 70.75±\pm1.27 82.50±\pm1.27 72.42±\pm0.96 76.67±\pm1.05 71.0±\pm0.41 47.05±\pm1.60
Table 3: System performance on the benchmark IFS-Intent. Horizontal direction: different groups of testing classes (base classes CbC_{b}, five rounds of novel classes (Cn1,,Cn5C_{n}^{1},\cdots,C_{n}^{5}) and the OOD classes CoC_{o}); vertical direction: timeline of incremental learning over new rounds of novel classes. Numbers are averaged over results of three random seeds.

5.3 Experimental results

As the problem formulation presented in Section 3, we want to investigate two questions. 𝒬1\mathcal{Q}_{1}: can our system get better performance on each round? 𝒬2\mathcal{Q}_{2}: can our system hold more stable performance during the incremental learning process?

Tables 2\sim3 list the results on the benchmarks IFS-Intent and IFS-Relation, respectively. Our system Entailment is compared with the baselines for the seven batches of testing classes (base, five rounds and OOD) along with the incremental learning from the base classes to the fifth round.

As for the question 𝒬1\mathcal{Q}_{1}, we summarize our observations as follows. (i) The ProtoNet generally works worst on most cases, regardless of the test classes and the timeline. This should be due to the fact that ProtoNet does not fine-tune on the new classes; thus, no incremental learning in ProtoNet. (ii) The baselines “Entailment” and “Cluster4Fewshot”, which perform incremental fine-tuning, generally outperform the ProtoNet. In addition, they are mostly comparable. (iii) Our system Entailment consistently obtains the best results across all test classes and the timeline.

Refer to caption
(a) Incremental setting with base classes
Refer to caption
(b) Incremental setting without base classes
Figure 1: Average performance on new classes in different rounds. The x axis is the number of round and y is the average accuracy on new classes in this round.

To answer the question 𝒬2\mathcal{Q}_{2}, we need to quantify the performance changes of all systems along the timeline not only on CbC_{b} but also on all {CniC_{n}^{i}} (i=1,,5i=1,\cdots,5). Given a list of mm result values rmr\in\mathbb{R}^{m}, we first use linear regression to fit these numbers. For example, if we fine a line f(x)=ax+bf(x)=a*x+b, where a<0a<0 is the slope, bb is the intercept, and t=1,,mt=1,\cdots,m is the time stamp. The performance drop dd reflected by this list is calculated as d=(f(1)f(m))/f(1)d=(f(1)-f(m))/f(1). Since the linear regression is more reliable when the mm value is larger, we compute the drop values for CbC_{b}, Cn1C_{n}^{1} and Cn2C_{n}^{2} only and average them as the final evaluation of a system in responding to the question 𝒬2\mathcal{Q}_{2}.

6 Conclusion

In this work, we define a new challenge in the NLP domain, incremental few-shot text classification with multi-round new classes. In addition to the problem formulation, we also release two benchmark datasets for this particular challenge: IFS-Intent and IFS-Relation . A novel approach Entailment is proposed to solve this problem. Entailment converts the text classification problem into textual entailment which can be pre-trained with large-scale entailment dataset. The reminder mechanism in Entailment mitigates the catastrophic forgetting problem in the incremental setting. Experiments on these two benchmark datasets show the effectiveness of our proposed model.