PROMPT LEARNING WITH KNOWLEDGE MEMORIZING PROTOTYPES FOR GENERALIZED FEW-SHOT INTENT DETECTION

Abstract

Generalized Few-Shot Intent Detection (GFSID) is challenging and realistic because it needs to categorize both seen and novel intents simultaneously. Previous GFSID methods rely on the episodic learning paradigm, which makes it hard to extend to a generalized setup as they do not explicitly learn the classification of seen categories and the knowledge of seen intents. To address the dilemma, we propose to convert the GFSID task into the class incremental learning paradigm. Specifically, we propose a two-stage learning framework, which sequentially learns the knowledge of different intents in various periods via prompt learning. And then we exploit prototypes for categorizing both seen and novel intents. Furthermore, to achieve the transfer knowledge of intents in different stages, for different scenarios we design two knowledge preservation methods which close to realistic applications. Extensive experiments and detailed analyses on two widely used datasets show that our framework based on the class incremental learning paradigm achieves promising performance.

Index Terms— Intent Detection, Knowledge Preservation, Generalized Few-Shot Learning

1 Introduction

Intent Detection is a vital component in dialog systems, which categorizes user utterances into corresponding intent and then systems execute the corresponding instructions or other downstream tasks. In the real world, the systems have two major problems that need to be solved urgently: (1). Lacking a large amount of labeled data. Because annotating data is costly and the utterances of users are complex and diverse. (2). Systems require the addition of novel intents to accommodate more needs. So in this paper, we are interested in Generalized Few-Shot Intent Detection (GFSID), which is more challenging and realistic. In this setting, models need to classify both seen intents and novel intents simultaneously.

For the problem of data scarcity [1, 2], previous researchers have devoted many efforts in several different areas [3, 4, 5, 6, 7]. Some valuable backbone networks have emerged, such as prototype network [8], matching network [9], and relation network [10, 11].

These above methods are hard to extend to a generalized setup directly where both seen and novel intents co-exist [12]. In response to this situation, [13] propose a solution first in the field of intent detection. [13] considers that previous few-shot approaches rely on a static similarity measure and overly fine-grained matching components, which inhibit generalizing capability towards GFSID. So they propose a Semantic Matching and Aggregation Network which exploits additional knowledge by automatically extracting multiple semantic components. Besides, [14] thinks a few instances can not cover the diversity of user expressions, which leads to previous few-shot approaches falling shot in adapting to the generalized setup. To deal with this issue, [14] introduces a diversity feature enhanced prototypical network, which generates diversity features to enhance novel intents representation using an auxiliary set. However, both two methods ignore the learning of classification of seen categories and knowledge of seen intents.

To address the dilemma, we propose to convert the GFSID task into the class incremental learning paradigm and introduce a two-stage learning framework. Motivated by recent work around prompt learning, we design a predefined template for improving performance with the use of PLMs. Since prototypes have a good generalized capability [8] and are easy to transfer knowledge in vector space, we use prototypes for classification toward both seen and novel intents. During training, our framework first learns the knowledge of seen intents with a large amount of labeled data. And then in the second phase, the framework learns the knowledge of novel intents with a few labeled data. In the meantime, we use two methods to memorize the prototypes of seen intents and transfer old knowledge for the limitations of different application scenarios.

The main contributions of our work are as follows:

•

We innovatively convert the GFSID task into the class incremental learning paradigm and propose a framework for learning knowledge of seen and novel intents.
•

To close to realistic applications, we present two knowledge preservation methods for memorizing the prototypes of seen intents and transferring old knowledge for different scenarios.
•

We evaluate our framework in two publicly available intent detection datasets. Extensive experiments show that our framework significantly outperforms previous methods in most experimental settings.

2 RELATED WORK

Generalized Few-Shot Learning. Few-shot learning (FSL) methods relied on a meta-learning paradigm, which trained models to recognize novel classes with a limited number of samples. Seminal methods for FSL were introduced[8, 9, 10]. Building upon these foundations, subsequent research has explored multi-level matching, aggregation approaches and adversarial adaptation networks [3, 4, 5, 15, 16]. Generalized Few-Shot Learning(GFSL) represents a novel research direction that is still being explored. [13] leverage semantic knowledge to provide additional information, while [14] generate diverse features to enhance the representation of novel intents using an auxiliary set. [17] propose a system capable of learning novel categories with limited training data while retaining knowledge of previously learned base classes.

Prompting PLM. Inspired by GPT-3 [18], many researchers have employed “Prompts” or “Templates” to establish connections between pre-training tasks and downstream tasks[19, 20, 21, 22]. Another crucial component in prompt learning is the ”Verbalizer,” which maps the model output to the corresponding label [19, 20].

3 METHODOLOGY

3.1 Prompt-based Learning for Intent Detection

We represent the input sentence as $x=[w_{1},...,w_{n}]$ , and the corresponding output label as $y$ . We introduce a template function $T(\cdot)$ that takes $x$ as input and generates $x_{prompt}=T(x)=x.\ The\ intent\ is\ to\ [MASK]$ . Furthermore, our framework aims to achieve effective adaptation and low computational cost when learning novel classes with limited samples. To accomplish this, we utilize prototypes as mapping objects for each original label. Specifically, given a PLM $M$ , the hidden vector at the position of the [MASK] token in the input sentence wrapped with the template is represented as:

h_{[MASK]}=M(T(x))

(1)

where $h_{[MASK]}\in\mathbb{R}^{h}$ is the last layer’s hidden representation of the PLM $M$ . Then, the hidden vector in prototype space can be computed as follow:

v_{[MASK]}=Wh_{[MASK]}

(2)

where $v_{[MASK]}\in\mathbb{R}^{c}$ and $W$ are the encoding representation of $h_{[MASK]}$ in prototype space and a matrix for linear transformation respectively. Then we apply the cosine similarity function to calculate the similarity between two samples in prototype space, which is expressed in the following form:

similarity=Sim(v_{i},v_{j})=\frac{v_{i}\cdot v_{j}}{\|{v_{i}}\|\|{v_{j}}\|}

(3)

3.2 Knowledge Preservation

Refer to caption — Fig. 1: Overview of our framework. We utilize the class incremental learning paradigm to solve the GFSID task.

Data-Agnostic Knowledge Preservation. It is critical to transfer knowledge for memorizing the prototypes of seen intents. In real-life scenarios, we cannot access past data repeatedly for some reasons such as protecting user privacy. In this case, we propose to utilize explicit weight constraints for the model, whose parameters are adaptive to seen intents. We formulate it in terms of the L2 penalization:

L^{\theta}_{2}=\sum_{p}^{\theta}{\|{p}_{joint}-{p}_{seen}\|}^{2}

(4)

where $p_{seen}$ denotes adaptive parameters of the PLM $M$ and seen prototypes after training in the data of seen intents. $p_{joint}$ indicates the adaptive parameters excluding novel prototypes after training in the data of joint intents. The weight constraints method mentioned above forces the model to retain more old knowledge on seen intents but also allows the model to adjust the parameters to fit joint intents.

Data-Dependent Knowledge Preservation. In some cases, the system or the model is still permitted to access past data. Some approaches usually maintain a virtual memory for storing old data in the lifelong learning task[23]. And when novel classes emerge, the system will be trained with old data and novel data together.

In this paper, we keep a virtual fixed-size memory to store the data of seen intents, whose data are randomly selected from the original dataset in a constant ratio. And to transfer the knowledge of seen intents we use knowledge distillation [24]. To be specific, we use the model whose parameters are adapted in seen intents to get soft labels of data stored in the memory instead of hard labels. Then when training with the data of joint intents, we distill the knowledge from the old model by minimizing the loss function in Eq. 5:

\begin{split}L_{KD}&=-\frac{1}{N}\sum_{i}^{N}p_{i}\log{q_{i}}\end{split}

(5)

where soft labels calculated by the old model and predicted probabilities of N classes calculated by the new model are represented as $p$ and $q$ respectively. N denotes the number of seen intents.

3.3 Supervised Contrastive Learning

To better understand the user’s intent and achieve a superior representation, inspired by [25, 26] we utilize the supervised contrastive learning method to make a sample have high similarity with other samples of the same intent and high similarity with its prototype simultaneously. Specifically, we treat two samples belonging to the same intent as a positive pair and two samples from different intents as a negative pair. Therefore, the loss function to be minimized is as follows:

L_{ii}=\frac{-1}{T^{2}}\sum_{i=1}^{T}\sum_{j=1}^{T}\log\frac{1_{y_{i}=y_{j}}\cdot\exp Sim(v_{i},v_{j})}{\sum_{k=1}^{T}\exp Sim(v_{i},v_{k})}

(6)

where $T$ denotes the number of total instances. $v_{i}$ and $v_{j}$ are samples from the same intent. In addition, the loss function between samples and prototypes is represented as:

L_{is}=\frac{-1}{CT}\sum_{i=1}^{C}\sum_{j=1}^{T}\log\frac{\exp Sim(v_{j},c_{i})}{\sum_{k=1}^{C}Sim(v_{j},c_{k})}

(7)

The i-th prototype and the number of prototypes are denoted as $c_{i}$ and $C$ respectively.

3.4 Overall Framework

As shown in Figure 1, the entire framework is divided into two phases. In the 1st Phase, the framework learns the knowledge of seen intents with a large amount of labeled data. During the 2nd Phase, we initialize prototypes of novel intents in the prototype space and then fine-tune the parameters to fit all intents using a few instances while transferring old knowledge to new vector space with $L_{2}^{\theta}$ or Knowledge Distillation.

1st Phase - Seen Intents Training. We denote $\theta_{seen}$ as the parameters of PLM and seen prototypes. In order to explicitly learn the classification of seen intents, we train $\theta_{seen}$ on a great deal of labeled data of seen intents. The overall loss for the first phase is defined as:

L=L_{cls}+L_{ii}+L{is}

(8)

2nd Phase - Joint Intents Training and Knowledge Transfer. We denote $\theta_{joint}$ as the parameters excluding novel intents prototypes during the second phase. In this phase, a few instances of seen and novel intents will emerge together. We optimize all parameters with a few instances of seen and novel intents. In the meantime, to memorize old knowledge and the prototypes of seen intents, we use different methods for different application scenarios. For scenarios where accessing old data is not allowed due to privacy policy or other issues, the optimization goal is:

L=L_{cls}+L_{ii}+L{is}+\lambda L_{2}^{\theta}

(9)

For scenarios that allow access to old data, we design a virtual memory for storing a small amount of old data and then use knowledge distillation to consolidate old knowledge. The overall loss is thus:

L=L_{cls}+L_{ii}+L{is}+L_{KD}

(10)

Table 1: Experiment results of GFSID (1-shot and 5-shot) on intent detection datasets (SNIPS and NLUE).

Model	SNIPS				NLUE
	1-shot		5-shot		1-shot		5-shot
	noneps	eps	noneps	eps	noneps	eps	noneps	eps
MN (2016)	73.50	82.67	77.31	84.60	62.30	76.21	56.27	78.85
PN (2017)	71.61	87.04	85.31	91.05	62.63	80.78	66.20	85.13
RN (2018)	74.94	85.63	64.09	79.25	56.75	73.57	46.50	75.23
HATT (2019)	71.54	84.51	86.53	91.85	64.01	81.39	67.86	78.41
MLMAN (2019)	78.61	87.77	79.58	89.27	63.12	82.65	60.70	84.45
HAPN (2019)	74.33	85.37	86.19	89.40	60.44	82.00	68.34	84.75
SMAN (2020)	81.85	88.10	87.87	93.18	66.10	89.54	72.18	87.76
MLADA (2021)	71.15	79.14	88.90	92.89	45.31	67.91	61.93	82.56
DFEPN (2022)	92.30	95.41	93.42	96.27	72.09	87.35	82.28	93.17
DDKP (ours)	94.11	93.99	96.33	96.91	82.27	89.91	88.77	93.94
DAKP (ours)	94.10	94.18	96.58	96.85	80.75	89.75	88.29	93.66

4 EXPERIMENT

Following the previous methodology [13] we perform extensive experiments on SNIPS and NLUE datasets to demonstrate the effectiveness of our framework. We conduct performance evaluations of GFSID on two datasets. The GFSID testing samples comprise 20% instances from the seen intents and all instances from the novel intents. The remaining 80% of samples from seen intents are used for training. For testing, support instances are pre-selected for both the 1-shot and 5-shot settings. Besides, we compare the proposed approach with strong baselines[3, 4, 5, 8, 9, 10, 13, 14, 15]. All performance results of the methods selected for comparison are taken from [14].

Our framework is implemented using PyTorch and the OpenPrompt toolkit [27]. $RoBERTa_{base}$ [28] is used as the text encoder. In the first Phase, we utilize the Adam optimizer with a learning rate of $1e-5$ and train with a batch size of 64. In the second Phase, the learning rate is $1e-4$ . For SNIPS, the batch size is set to 5. Training all parameters for 20 epochs. For NLUE, we set the batch size to 16 and train all parameters for 15 epochs. Following previous approach [13], we evaluate all methods in two different settings: Non-episodic Evaluation (noneps) and Episodic Evaluation (eps).

Episodic Evaluation: 1000 episodes for C-way K-shot testing, where C is 4 or 10 for SNIPS and NLUE, respectively. Query set consists of 5 samples for each intent.

Non-episodic Evaluation: Conduct a single round of testing on all unlabeled samples, which is more challenging and realistic.

4.1 Experimental Results

Our primary findings are presented in Table 1. For the SNIPS dataset, we computed the average accuracy across 3 distinct random seeds for 3 experimental groups. To evaluate our framework on the NLUE dataset, we employed 10-fold cross-validation. The experimental results demonstrate that our approach consistently outperforms previous methods in the majority of scenarios. Particularly in the more challenging and realistic noneps setting, our method demonstrates significant improvements of 10.18% and 6.49% for 1-shot and 5-shot respectively, as compared to the baseline. In the eps setting, our method achieve improvements of 2.56% and 0.77% in the 1-shot and 5-shot setting respectively, on the NLUE dataset. On the SNIPS dataset, our method exhibit a 0.64% improvement in the 5-shot setting. Additionally, employing the data-dependent knowledge preservation method yields superior performance in most cases.

4.2 Ablation Study

Table 2: The comparison of two knowledge preservation methods and directly learning without any knowledge preservation. Results on SNIPS dataset (1-shot and 5-shot noneps).

Method	1-shot	5-shot
DDKP	94.11	96.33
DAKP	94.10	96.58
w/o $L_{KD}$ w/o $L_{2}^{\theta}$	88.98	95.49

The observations from Figure 2 and Table 2 are as follows:

(1) Our proposed DDKP and DAKP play a crucial role in preserving old knowledge, and the effects are significant.

(2) DDKP provides more stable retention of old knowledge. This is because DAKP directly relies on parameter constraints to achieve knowledge preservation, which leads to some performance loss.

(3) DDKP and DAKP have an advantage when learning novel intents knowledge. This is because seen intents and novel intents share common knowledge, and the retained common knowledge can assist in learning novel intents knowledge.

5 CONCLUSION

In this paper, we innovatively introduce a conversion of the GFSID task into the class incremental learning paradigm to extend it to a generalized setup. We propose a straightforward and efficient two-stage learning framework for sequentially acquiring knowledge of various intents. Additionally, to facilitate the transfer of intent knowledge across different stages, we introduce two knowledge preservation methods that closely align with realistic applications. We conduct extensive experiments and demonstrate the effectiveness of our proposed framework through results obtained from two real-world intent detection datasets.

References

[1] Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding Zhang, Yangning Li, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng, and Ying Shen, “Linguistic rules-based corpus generation for native chinese grammatical error correction,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, Eds. 2022, pp. 576–589, Association for Computational Linguistics.
[2] Yinghui Li, Shulin Huang, Xinwei Zhang, Qingyu Zhou, Yangning Li, Ruiyang Liu, Yunbo Cao, Hai-Tao Zheng, and Ying Shen, “Automatic context pattern generation for entity set expansion,” CoRR, vol. abs/2207.08087, 2022.
[3] Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun, “Hybrid attention-based prototypical networks for noisy few-shot relation classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 6407–6414.
[4] Shengli Sun, Qingfeng Sun, Kevin Zhou, and Tengchao Lv, “Hierarchical attention prototypical networks for few-shot text classification,” in Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 476–485.
[5] Zhi-Xiu Ye and Zhen-Hua Ling, “Multi-level matching and aggregation network for few-shot relation classification,” arXiv preprint arXiv:1906.06678, 2019.
[6] Ruiyang Liu, Yinghui Li, Linmi Tao, Dun Liang, and Hai-Tao Zheng, “Are we ready for a new paradigm shift? A survey on visual deep MLP,” Patterns, vol. 3, no. 7, pp. 100520, 2022.
[7] Yinghui Li, Qingyu Zhou, Yangning Li, Zhongli Li, Ruiyang Liu, Rongyi Sun, Zizhen Wang, Chao Li, Yunbo Cao, and Hai-Tao Zheng, “The past mistake is the future wisdom: Error-driven contrastive probability optimization for chinese spell checking,” in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, Eds. 2022, pp. 3202–3213, Association for Computational Linguistics.
[8] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
[9] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016.
[10] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1199–1208.
[11] Yinghui Li, Yangning Li, Yuxin He, Tianyu Yu, Ying Shen, and Hai-Tao Zheng, “Contrastive learning with hard negative entities for entity set expansion,” in SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai, Eds. 2022, pp. 1077–1086, ACM.
[12] Yangning Li, Yinghui Li, Xi Chen, Hai-Tao Zheng, Ying Shen, and Hong-Gee Kim, “Active relation discovery: Towards general and label-aware open relation extraction,” CoRR, vol. abs/2211.04215, 2022.
[13] Hoang Nguyen, Chenwei Zhang, Congying Xia, and S Yu Philip, “Dynamic semantic matching and aggregation network for few-shot intent detection,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1209–1218.
[14] Fengyi Yang, Xi Zhou, Yi Wang, Abibulla Atawulla, and Ran Bi, “Diversity features enhanced prototypical network for few-shot intent detection,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt, Ed. 7 2022, pp. 4447–4453, International Joint Conferences on Artificial Intelligence Organization, Main Track.
[15] Chengcheng Han, Zeqiu Fan, Dongxiang Zhang, Minghui Qiu, Ming Gao, and Aoying Zhou, “Meta-learning adversarial domain adaptation network for few-shot text classification,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 1664–1673.
[16] Yinghui Li, Yong Jiang, Shen Huang, Xingyu Lu, Yangning Li, Pengjun Xie, Fei Huang, and Hai-Tao Zheng, “Bidirectional end-to-end learning of retriever-reader paradigm for entity linking,” CoRR, vol. abs/2306.12245, 2023.
[17] Spyros Gidaris and Nikos Komodakis, “Dynamic few-shot visual learning without forgetting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4367–4375.
[18] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[19] Timo Schick and Hinrich Schütze, “Exploiting cloze-questions for few-shot text classification and natural language inference,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 255–269.
[20] Tianyu Gao, Adam Fisch, and Danqi Chen, “Making pre-trained language models better few-shot learners,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 3816–3830.
[21] Brian Lester, Rami Al-Rfou, and Noah Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3045–3059.
[22] Yinghui Li, Haojing Huang, Shirong Ma, Yong Jiang, Yangning Li, Feng Zhou, Hai-Tao Zheng, and Qingyu Zhou, “On the (in)effectiveness of large language models for chinese text correction,” CoRR, vol. abs/2307.09007, 2023.
[23] Vaibhav Varshney, Mayur Patidar, Rajat Kumar, Lovekesh Vig, and Gautam Shroff, “Prompt augmented generative replay via supervised contrastive learning for lifelong intent detection,” in Findings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 1113–1127.
[24] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al., “Distilling the knowledge in a neural network,” .
[25] Beliz Gunel, Jingfei Du, Alexis Conneau, and Ves Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint arXiv:2011.01403, 2020.
[26] Yinghui Li, Shirong Ma, Qingyu Zhou, Zhongli Li, Yangning Li, Shulin Huang, Ruiyang Liu, Chao Li, Yunbo Cao, and Haitao Zheng, “Learning from the dictionary: Heterogeneous knowledge guided fine-tuning for chinese spell checking,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, Eds. 2022, pp. 238–249, Association for Computational Linguistics.
[27] Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun, “Openprompt: An open-source framework for prompt-learning,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2022, pp. 105–113.
[28] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.