This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Evaluating and Improving Continual Learning in Spoken
Language Understanding

Muqiao Yang1 Xiang Li1 Umberto Cappellazzo2
Shinji Watanabe1 Bhiksha Raj1,3
1
Carnegie Mellon University 2 University of Trento
3 Mohamed bin Zayed University of Artificial Intelligence
{muqiaoy, xl6, bhiksha}@cs.cmu.edu,
[email protected], [email protected]
Abstract

Continual learning has emerged as an increasingly important challenge across various tasks, including Spoken Language Understanding (SLU). In SLU, its objective is to effectively handle the emergence of new concepts and evolving environments. The evaluation of continual learning algorithms typically involves assessing the model’s stability, plasticity, and generalizability as fundamental aspects of standards. However, existing continual learning metrics primarily focus on only one or two of the properties. They neglect the overall performance across all tasks, and do not adequately disentangle the plasticity versus stability/generalizability trade-offs within the model. In this work, we propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning. By employing the proposed metric, we demonstrate how introducing various knowledge distillations can improve different aspects of these three properties of the SLU model. We further show that our proposed metric is more sensitive in capturing the impact of task ordering in continual learning, making it better suited for practical use-case scenarios.

Evaluating and Improving Continual Learning in Spoken
Language Understanding


Muqiao Yang1 Xiang Li1 Umberto Cappellazzo2 Shinji Watanabe1 Bhiksha Raj1,3 1 Carnegie Mellon University 2 University of Trento 3 Mohamed bin Zayed University of Artificial Intelligence {muqiaoy, xl6, bhiksha}@cs.cmu.edu, [email protected], [email protected]


1 Introduction

Spoken Language Understanding (SLU) focuses on interpreting and understanding human speech in order to extract meaningful information from it Tur and De Mori (2011). With the ubiquity of voice assistants and home devices, SLU systems have become an integral part of our society. While traditional SLU systems make use of the concatenation of an automatic speech recognition (ASR) and a natural language understanding block Mesnil et al. (2014); Coucke et al. (2018), recently, end-to-end (E2E) strategies have gained traction, buttress-

Refer to caption
Figure 1: An illustration of the proposed Dual-transfer Matching Index (DMI) vs. other continual learning related metrics, including backward transfer and forward transfer Lopez-Paz and Ranzato (2017). The vertical and horizontal axes represent the sequence of tasks presented to the network for learning. TT is the total number of tasks. A circle at index (i,j)(i,j) means the evaluation on task jj after finishing training task ii. A green circle indicates the evaluated performance of one seen task, while a grey circle is assessing unseen classes in future tasks. By covering the whole T×TT\times T matrix, our DMI provides an evaluation on three aspects of model capabilities, including stability, plasticity and generalizability.

ed by their reduced latency and error propagation Saxon et al. (2021); Arora et al. (2022).

SLU systems are often expected to learn to recognize a continually expanding “vocabulary” of intents, as they encounter them in training data – a task known as continual or lifelong learning. Without appropriate techniques, unlike human cognition, SLU models are susceptible to a phenomenon called catastrophic forgetting French (1999), where previously learned knowledge is lost when acquiring new information. Overlooking previously acquired knowledge or skills as newer ones are acquired can hinder the performance of the models. Therefore, continual learning aims to circumvent the forgetting effect with the goal of continually adapting the model by learning from infinite streams of data Kirkpatrick et al. (2017). An ideal SLU model should be stable, retaining previously acquired knowledge, while also being plastic enough to learn new intents effectively. Furthermore, it should also be generalizable, retaining the ability to learn novel yet-to-be-encountered information. Evaluating the effectiveness of continual learning algorithms requires measuring their performance across these three criteria.

However, existing continual learning evaluation metrics have certain limitations. Among stability, plasticity, and generalizability, most mainstream continual learning evaluation metrics focus on only one or two aspects of the three properties. A brief illustration is provided as a train-evaluation matrix in Figure 1. By immediately evaluating the performance after learning one task, current task accuracy (Figure 1a) only focuses on the diagonal entries of the matrix, thus measuring the plasticity of the model. Backward transfer and forward transfer (Figure 1b and 1c) evaluate the stability and generalizability by focusing on the lower and upper triangular matrix Lopez-Paz and Ranzato (2017). However, they only measure the partial impact of the current task on either past or future tasks, neglecting the overall performance of all seen and unseen tasks. Meanwhile, these metrics do not fully disentangle the plasticity and stability/generalizability trade-offs in continual learning, as their computation involves operations on the diagonal entries in the matrix. As a result, they may not fully capture the dynamic behaviors of models during the training process, and are less sensitive to the order of tasks particularly when the contents of adjacent tasks exhibit large semantic variations.

In this paper, we propose Dual-transfer Matching Index (DMI) as a unified continual learning metric to evaluate all three properties: stability, plasticity and generalizability. We compute the DMI by employing the Hungarian maximum matching algorithm to compare model predictions with ground-truth intent labels for all seen and unseen classes at the end of each task. This approach allows us to disentangle and quantify the three properties in a comprehensive manner. Based on that evaluation using DMI, we apply various knowledge distillation (KD) techniques to enhance the three properties respectively. We demonstrate that introducing different levels of KDs can improve all three standards and provide a better understanding of the continual learning behaviors of SLU models. Furthermore, we show that our proposed DMI metric is more sensitive than previous metrics in capturing the effect of task ordering.

2 Evaluation in Continual Learning

2.1 Stability-plasticity Dilemma

Refer to caption
Figure 2: Train-evaluation performance matrix 𝐀\mathbf{A} in continual learning. The diagonal entries (red) represent the plasticity with the performance of current tasks. The lower triangular matrix (blue) represents the focused field of stability, while the upper triangular matrix (grey) measures generalizability to unseen tasks.

The main issue that continual learning aims to address was formulated as the stability-plasticity dilemma Mermillod et al. (2013). An ideal system is expected to possess both plasticity, enabling the rapid acquisition of new knowledge, and stability, preventing the forgetting effect of previously learned knowledge.

As the two sides of the same coin, stability and plasticity are often regarded as a trade-off between learning and remembering. Assume that a model is continuously encountering data from different tasks {𝒟1,𝒟2,,𝒟T}\{{\mathcal{D}}_{1},{\mathcal{D}}_{2},\cdots,{\mathcal{D}}_{T}\} in chronological order, where TT indicates the total number of tasks. Among different 𝒟t{\mathcal{D}}_{t}s, the distribution of data and labels may shift across tasks. If the model has a high memory stability, it will remain stable and not easily erase the knowledge learned in the past tasks, thus may hinder the capability of the model to swiftly accommodate new data distributions. Conversely, excessive plasticity may impede the stability of the system by prioritizing adapting to distribution changes. Therefore, different metrics are often applied to assess the stability and plasticity of continual learning algorithms.

We formulate the metric evaluation in continual learning with the train-evaluation performance matrix 𝐀\mathbf{A}, which is a T×TT\times T matrix where TT is the total number of tasks. Each entry 𝐀i,j\mathbf{A}_{i,j} represents the evaluated performance on task jj after the model has learned task ii. Figure 2 provides a visual representation of the matrix. The entries can be classified into two categories, where the green circles represent the seen tasks and the grey circles represent the unseen tasks.

In this work, we establish the taxonomy of evaluation metrics based on the property that a metric is trying to assess in the model, namely stability, plasticity, and generalizability. We propose a novel metric and compare it with existing metrics in terms of formulation and experimental results. By considering different perspectives, it provides valuable insights into the evaluation of continual learning systems.

2.2 Formulation of Continual Metrics

Based on the train-evaluation performance matrix, we provide a formulation of the existing continual learning metrics. Along the diagonal of the train-evaluation matrix as in Figure 2, the circles indicate the immediate evaluation of task ii after learning the same task. Therefore, it reflects the plasticity of the model by showing how well and fast the model is adapting to new knowledge. Current task accuracy (cur-ACC) is one common metric to assess the plasticity by calculating the average accuracies of tasks from 11 to current task tt along the diagonal of the evaluation matrix:

cur-ACCt=1ti=1t𝐀i,i,t{1,,T}\vspace{-2.5mm}\text{cur-ACC}_{t}=\frac{1}{t}\sum_{i=1}^{t}\mathbf{A}_{i,i},t\in\{1,\cdots,T\} (1)

The lower triangular matrix of 𝐀\mathbf{A} reflects the stability. At task tt, it covers the evaluation results of all previous tasks 11 to t1t-1. One of the most straightforward stability-based metrics is last accuracy (last-ACC), which assesses the stability by calculating the average accuracy of all tasks after the last task TT:

last-ACC=1Ti=1T𝐀T,i\vspace{-2.5mm}\text{last-ACC}=\frac{1}{T}\sum_{i=1}^{T}\mathbf{A}_{T,i} (2)

However, last-ACC provides only one-time evaluation at the end and does not cover all entries in the lower triangular matrix. Therefore, it provides little information on the stability of the entire learning process. To better quantify the overall stability, backward transfer (BWT) measures the performance degradation of an arbitrary task tt on its previous tasks from 11 to t1t-1 Lopez-Paz and Ranzato (2017). It calculates the average differences between the evaluation on a previous task jj after learning current task ii (𝐀i,j\mathbf{A}_{i,j}), and the immediate evaluation of the previous task jj (𝐀j,j\mathbf{A}_{j,j}), i.e.,

BWTt=2t(t1)i=2tj=1i1(𝐀i,j𝐀j,j),\displaystyle\text{BWT}_{t}=\frac{2}{t(t-1)}\sum_{i=2}^{t}\sum_{j=1}^{i-1}(\mathbf{A}_{i,j}-\mathbf{A}_{j,j}), (3)
t{2,,T}\displaystyle t\in\{2,\cdots,T\}

Typically, the BWT score is a negative number, indicating that losing past information is almost inevitable during the learning process of new tasks.

Forward transfer (FWT) is a metric used to assess generalizability. It is mostly used as an opposite metric compared to BWT. We note that we refer to the implementation of the Continuum library 111https://github.com/Continvvm/continuum, which evaluates the influence that learning task t1t-1 has on the performance of the future incoming task tt. Since the model has not seen task tt yet, FWT quantifies the generalizability by calculating the gain of the current model performance on the incoming future task (𝐀i1,i\mathbf{A}_{i-1,i}), compared with zero-shot evaluation results on the prospective task (𝐀¯i\bar{\mathbf{A}}_{i}), i.e.,

FWTt=1t1i=2t(𝐀i1,i𝐀¯i),t{2,,T}\text{FWT}_{t}=\frac{1}{t-1}\sum_{i=2}^{t}(\mathbf{A}_{i-1,i}-\bar{\mathbf{A}}_{i}),t\in\{2,\cdots,T\} (4)

3 Continual Learning with Unified Evaluation

3.1 DMI: Dual-transfer Matching Index

In this work, beyond limited and entangled evaluations of stability, plasticity, or generalizability, we propose an evaluation metric named Dual-transfer Matching Index (DMI), to provide a unified evaluation for continual learning. Unlike existing metrics discussed in Section 2.2 that only show how the knowledge transfers in a unidirectional manner, either backward or forward, our metric evaluate the continual learning behavior of the model in terms of all three properties. Meanwhile, instead of entangling plasticity versus stability and generalizability, our DMI provides a disentangled quantification for each of the three properties.

Assume that our training process elapses along the x-axis of the train-evaluation performance matrix 𝐀\mathbf{A}. The model is continuously learning on a sequence of tasks 1,2,,T1,2,\dots,T. At the end of training task ii, instead of evaluating the accuracy of one single previous or future task, we evaluate the current model on all seen and unseen classes. However, since the model has not yet gathered information for tasks tit\succ i, we can only evaluate these tasks in a class-agnostic manner.

Let us denote the total number of data samples as NN and the number of classes as KK. Specifically, we first extract the predicted intent class embeddings {𝐱n(i)}n=1N\{{\mathbf{x}}^{(i)}_{n}\}_{n=1}^{N} at each task ii. Then we perform k-means clustering 𝒦()\mathcal{K}(\cdot) MacQueen (1967); Lloyd (1982) on 𝐱(i){\mathbf{x}}^{(i)}, with the total number of clusters as KK. By denoting 𝝁k(i)\boldsymbol{\mu}_{k}^{(i)} as the centroid of cluster kk at task ii, we can obtain the class-agnostic assignment as

𝐤(i)=𝒦(𝐱(i)),\displaystyle{\mathbf{k}}^{(i)}=\mathcal{K}({\mathbf{x}}^{(i)}), (5)
kn(i)=argmink|𝐱n(i)𝝁k(i)|\displaystyle k_{n}^{(i)}=\arg\min_{k}|{\mathbf{x}}_{n}^{(i)}-\boldsymbol{\mu}_{k}^{(i)}|

At each task ii, 𝐤(i)=(k1(i),k2(i),,kN(i)){\mathbf{k}}^{(i)}=(k_{1}^{(i)},k_{2}^{(i)},\dots,k_{N}^{(i)}) indicates the class-agnostic assignment for all NN samples. Class-agnostic means that 𝐤(i){\mathbf{k}}^{(i)} is not aware of the exact class index, as there exist unseen tasks that the model has not learned from yet. Then we compute the matching score between the class-agnostic assignment 𝐤(i){\mathbf{k}}^{(i)} and the ground-truth class labels 𝐤gt{\mathbf{k}}_{\text{gt}} using Hungarian maximum matching algorithm (,)\mathcal{H}(\cdot,\cdot) Kuhn (1955). As a result, (𝐤(i),𝐤gt)\mathcal{H}({\mathbf{k}}^{(i)},{\mathbf{k}}_{\text{gt}}) gives us an optimal matched mapping from 𝐤(i){\mathbf{k}}^{(i)} to 𝐤gt{\mathbf{k}}_{\text{gt}}. After that, we calculate the accuracies of the best mapping with the ground-truth labels to populate the train-evaluation performance matrix 𝐀{\mathbf{A}}, i.e.,

Refer to caption
Figure 3: Pipeline overview of our SLU training. Dashed blocks indicate the knowledge distillation from previous tasks.
𝐤(i)=(𝐤(i),𝐤gt),\displaystyle{\mathbf{k}}^{(i)*}=\mathcal{H}({\mathbf{k}}^{(i)},{\mathbf{k}}_{\text{gt}}), (6)
𝐀i,j=k𝒥n𝟙(kn(i)== kgt,n== k)k𝒥n𝟙(kgt,n== k)\displaystyle{\mathbf{A}}_{i,j}=\frac{\sum_{k\in\mathcal{J}}\sum_{n}\mathbbm{1}(k_{n}^{(i)*}\text{==\ }k_{\text{gt},n}\text{==\ }k)}{\sum_{k\in\mathcal{J}}\sum_{n}\mathbbm{1}(k_{\text{gt,n}}\text{==\ }k)}

where 𝟙()\mathbbm{1}(\cdot) returns 11 if the condition is true and 0 otherwise. kn(i)k_{n}^{(i)*} and kgt,nk_{\text{gt},n} indicate the nnth element of 𝐤(i){\mathbf{k}}^{(i)*} and 𝐤gt{\mathbf{k}}_{\text{gt}} respectively. 𝒥\mathcal{J} refers to the set of classes that appear at task jj. The numerator calculates the total number of correctly matched samples that belong to task jj. The denominator is the total number of samples in task jj. Therefore, 𝐀i,j{\mathbf{A}}_{i,j} is represented as the class-agnostic accuracy of task jj after training on task ii.

We perform similar computations for each task ii from 11 to TT to fulfill all entries of the matrix 𝐀{\mathbf{A}}. Finally, we compute the DMI as a quantification for stability, plasticity, and generalizability respectively by taking the mean of the entries in the corresponding regions as shown in Figure 2:

DMIstab=2T(T1)j=1i1i=1T𝐀i,j,\displaystyle\text{DMI}_{stab}=\frac{2}{T(T-1)}\sum_{j=1}^{i-1}\sum_{i=1}^{T}{\mathbf{A}}_{i,j}, (7)
DMIplas=1Ti=1T𝐀i,i,\displaystyle\text{DMI}_{plas}=\frac{1}{T}\sum_{i=1}^{T}{\mathbf{A}}_{i,i},
DMIgen=2T(T1)i=1j1j=1T𝐀i,j\displaystyle\text{DMI}_{gen}=\frac{2}{T(T-1)}\sum_{i=1}^{j-1}\sum_{j=1}^{T}{\mathbf{A}}_{i,j}

Note that we complete the performance matrix 𝐀{\mathbf{A}} with prior metrics in different ways. We compute the entries of matrix 𝐀{\mathbf{A}} by the overall clustering accuracies of both past and future tasks, as well as the current task. In this way, DMI consists of three separate scores to measure the stability, plasticity, and generalizability during the training process of the model. Therefore, it is expected to provide a disentangled quantification of the three properties, and reflect how the learned knowledge accumulates and evolves in a unified manner.

3.2 Pipeline Formulation

We establish our pipeline in a class-incremental learning (CIL) setting in SLU. Specifically, the model is continuously adapted to a sequence of different tasks, and incremental intent labels emerge sequentially across tasks. CIL is a challenging setting in real-world scenarios, as the model is agnostic to task labels during inference time Hsu et al. (2018).

We utilize a combined architecture for automatic speech recognition (ASR) and SLU, as it is shown that the auxiliary use of the ASR transcriptions can lead to better SLU performances Arora et al. (2022); Cha et al. (2021). Under this setting, we prepend intent tokens to the original transcription and separate them using a special token _SEP\langle\textit{\_SEP}\rangle. This operation extends the transcription and transforms the modeling into a sequence-to-sequence (seq2seq) problem. As a result, our architecture diverges from the conventional continual learning pipelines that consist of a feature extractor followed by a classifier. Instead, we employ a single ASR decoder that takes the encoder outputs and generates predicted tokens in a sequential manner. Assume that the input audio-text pair is denoted as (𝐚,𝐭)({\mathbf{a}},{\mathbf{t}}). The goal of the ASR decoder is to find the most probable extended transcription sequence 𝐭{\mathbf{t}} given the audio input 𝐚{\mathbf{a}}, i.e.,

argmax𝐭P(𝐭|𝐚,θ)\displaystyle\arg\max_{{\mathbf{t}}}P({\mathbf{t}}|{\mathbf{a}},\theta) (8)
=argmax𝐭P(𝐭1,𝐭2,,𝐭M|𝐚,θ)\displaystyle=\arg\max_{{\mathbf{t}}}P({\mathbf{t}}_{1},{\mathbf{t}}_{2},\cdots,{\mathbf{t}}_{M}|{\mathbf{a}},\theta)
=argmax𝐭m=2MP(𝐭m|𝐭m1,𝐚,θ)\displaystyle=\arg\max_{{\mathbf{t}}}\prod_{m=2}^{M}P({\mathbf{t}}_{m}|{\mathbf{t}}_{m-1},{\mathbf{a}},\theta)

where θ\theta is the parameter set of the model and MM is the length of the transcription. Conditional independence is assumed at the last step. 𝐭2{\mathbf{t}}_{2} is the intent token, succeeding CLS\langle\textit{CLS}\rangle as the first token. The formulation of the overall pipeline is illustrated in Figure 3.

3.3 Knowledge Distillations

The objective of knowledge distillation (KD) Hinton et al. (2015) aligns with continual learning by transferring knowledge from a teacher model to a student model. In continual learning, the teacher could refer to the model learned from previous tasks or a pretrained large network, while the student means the model that is learning currently. It is noteworthy that a small portion of previous data (typically smaller than 5%5\%), together with its extracted features, is saved into one rehearsal memory for the future use of teacher models. We apply multiple types of KDs to improve the stability, plasticity, and generalizability of the SLU model. Figure 3 shows the multiple KD techniques that we have leveraged in our pipeline.

Audio-KD.

Catastrophic forgetting at the audio encoder is one of the main factors that affect the continual learning performance of SLU. To address this issue, we introduce KD at the output of the audio encoder. Let us denote the teacher audio encoder and student audio encoder as hi1ah_{i-1}^{a} and hiah_{i}^{a} respectively. i1i-1 and ii are task indices, meaning that we would like to distill the knowledge from previously learned models to the current student model. Assume that the sampled rehearsal memory is \mathcal{M}. We add an audio-KD loss to main the continuity between the audio representations produced by the student audio encoder at current task ii and the teacher at previous task i1i-1 for the rehearsal data, i.e.,

audioKD=1||(𝐚,𝐭)hi1a(𝐚),hia(𝐚)\mathcal{L}_{\mathrm{audio-KD}}=\frac{1}{|\mathcal{M}|}\sum_{({\mathbf{a}},{\mathbf{t}})\in\mathcal{M}}\langle h_{i-1}^{a}({\mathbf{a}}),h_{i}^{a}({\mathbf{a}})\rangle (9)

where ,\langle\cdot,\cdot\rangle indicates cosine similarity. The objective of audio-KD is to regularize the change of the parameters of the audio encoder, such that the student audio encoder retains partial information from the teacher audio encoder. Therefore, it is expected to improve the stability of the model.

Seq-KD.

As we have discussed in Section 3.2, our pipeline employs an encoder-decoder architecture to predict the intents. Similar to the audio encoder, the ASR decoder may also suffer from catastrophic forgetting. To address this issue, we also introduce a sequence KD at the output of ASR decoder to reduce the forgetting effect at a sequence level. It works by pushing the student model to generate sequences that are close to the sequence-level distribution of the teacher model. The seq-KD loss is defined as

seqKD=1||\scaleto(𝐚,𝐭^)7.5ptlogP(𝐭^|𝐚,θ)\displaystyle\mathcal{L}_{\mathrm{seq-KD}}=-\frac{1}{|\mathcal{M}|}\sum_{\scaleto{({\mathbf{a}},\hat{{\mathbf{t}}})\in\mathcal{M}}{7.5pt}}\log P(\hat{{\mathbf{t}}}|{\mathbf{a}},\theta) (10)
=1||\scaleto(𝐚,𝐭^)7.5ptlogm=2MP(𝐭^m|𝐭^m1,𝐚,θ)\displaystyle=-\frac{1}{|\mathcal{M}|}\sum_{\scaleto{({\mathbf{a}},\hat{{\mathbf{t}}})\in\mathcal{M}}{7.5pt}}\log\prod_{m=2}^{M}P(\hat{{\mathbf{t}}}_{m}|\hat{{\mathbf{t}}}_{m-1},{\mathbf{a}},\theta)

where θ\theta is the parameter set of the student model. 𝐭^\hat{{\mathbf{t}}} is the output sequence generated with beam search using the teacher model, and it is saved into the rehearsal memory \mathcal{M} together with the paired audio 𝐚{\mathbf{a}} during the previous task.

Sent-KD.

Both audio-KD and seq-KD aim to increase the stability of the system by distilling knowledge from previous tasks. They are operated at the encoder and decoder levels of the pipeline respectively. In addition to them, we introduce another KD from a pretrained text encoder to our student audio encoder. Its objective is to increase the plasticity and generalizability of the model with pretrained knowledge. With distilled knowledge from the pretrained text representations, the learned audio representation can be aligned to the shared embedding space. By doing so, the generalizability of the model are expected to be improved, as the learned audio embedding is assumed to contain the similar semantic information as the text input in the sentence level.

Let us denote the pretrained text encoder as hth^{t}. It remains frozen during our training process, while the audio encoder is fine-tuned. Since the output of the audio encoder is a sequential embedding, we need to first average-pool the audio embedding vectors to reduce their temporal dimension, so that they are comparable with the sentence embedding. Then, the sentence KD is defined as

h¯ia(𝐚)=pool(hia(𝐚)),\displaystyle\bar{h}_{i}^{a}({\mathbf{a}})=\mathrm{pool}(h_{i}^{a}({\mathbf{a}})), (11)
sentKD=ht(𝐭)h¯ia(𝐚)2\displaystyle\mathcal{L}_{\mathrm{sent-KD}}=||h^{t}({\mathbf{t}})-\bar{h}_{i}^{a}({\mathbf{a}})||^{2}

And the final loss function to update the model is summarized as

\displaystyle\mathcal{L} =CE+λaudioaudioKD\displaystyle=\mathcal{L}_{\mathrm{CE}}+\lambda_{\mathrm{audio}}\mathcal{L}_{\mathrm{audio-KD}} (12)
+λseqseqKD+λsentsentKD\displaystyle\quad+\lambda_{\mathrm{seq}}\mathcal{L}_{\mathrm{seq-KD}}+\lambda_{\mathrm{sent}}\mathcal{L}_{\mathrm{sent-KD}}

where CE\mathcal{L}_{\mathrm{CE}} refers to the cross entropy loss between the predicted and ground-truth token logits. λaudio\lambda_{\mathrm{audio}}, λseq\lambda_{\mathrm{seq}} and λsent\lambda_{\mathrm{sent}} are coefficients to control the weight of corresponding KDs.

Overall, we introduce a combination of different knowledge distillation techniques to boost the stability, plasticity, and generalizability of our continually learning SLU model. In the next section, we will empirically show the improvement of the quantification of different metrics under various settings.

4 Experiments

4.1 Experimental Setup

Dataset FSC SLURP
Metric Acc BWT FWT DMI Acc BWT FWT DMI
Fine-tune 29.9 -61.2 0 22.1 / 63.6 / 25.7 31.9 -64.5 0 20.2 / 61.7 / 22.4
Random 68.6 -16.3 63.2 63.7 / 78.2 / 24.9 66.6 -18.7 63.5 63.8 / 75.3 / 22.7
   + audio-KD 72.9 -13.3 67.3 73.0 / 75.9 / 25.6 68.1 -15.9 63.9 70.3 / 70.8 / 24.7
   + seq-KD 75.1 -10.2 71.4 76.4 / 75.6 / 25.3 68.9 -14.7 65.0 72.5 / 70.9 / 25.8
   + sent-KD 76.7 -10.1 72.9 76.9 / 75.7 / 28.8 70.3 -14.0 66.8 73.2 / 71.3 / 28.6
Herding 69.8 -15.4 66.3 66.3 / 78.4 / 24.8 68.1 -16.2 63.7 64.5 / 75.8 / 23.1
   + audio-KD 73.5 -11.7 70.1 78.3 / 75.7 / 25.9 68.7 -12.8 64.0 74.4 / 72.9 / 24.7
   + seq-KD 75.7 -9.6 72.9 79.1 / 75.1 / 26.4 70.6 -12.3 67.1 76.9 / 72.3 / 25.2
   + sent-KD 77.9 -9.0 73.3 80.0 / 75.7 / 30.8 71.1 -12.1 67.9 77.8 / 70.5 / 30.4
Table 1: Quantitative evaluation results in terms of Acc (\uparrow), BWT (\uparrow), FWT (\uparrow), and our proposed DMI (\uparrow). The three numbers of DMI are in the order of stability, plasticity, and generalizability.

We base our experiments on the intent classification of Fluent Speech Commands (FSC) Lugosch et al. (2019) and Spoken Language Understanding Resource Package (SLURP) Bastianelli et al. (2020) datasets. FSC includes 30,043 English utterances, recorded at 16 kHz. It provides 248 different utterances that are mapped in 31 different intents (i.e., our classes). SLURP contains roughly 72K audio recordings of single-turn user interactions with a home assistant, annotated with scenarios, actions and entities. Overall, there are 18 unique scenarios and 69 intents. Following Cappellazzo et al. (2023), we divide the SLURP dataset into tasks using the scenario labels as splitting criterion, whereas for FSC we use the intents. We choose to divide the datasets into T=6T=6 tasks, with each task containing different classes. Implementation details can be found in Appendix A.1.

By default, we establish the order of tasks based on the principle of prioritizing the most populated scenarios. This is expected to ensure the model to learn richer information in early tasks, thus mitigating the catastrophic forgetting effect. We will provide the overall evaluation results with the default setting in Section 4.2, and further investigate the effect of task orders in the later Section 4.3.

4.2 Evaluation Results and Analysis

We evaluate the continual learning performance on both FSC and SLURP with metrics introduced in Section 2.2 including accuracy (Acc), backward transfer (BWT), forward transfer (FWT) Lopez-Paz and Ranzato (2017) and our proposed DMI. The accuracy is reported as last-ACC in Section 2.2. The experimental results are shown in Table 1.

The first row in the table is the naive fine-tuning result, without any continual learning algorithms applied. Therefore, it serves as the lower bound of our performance. The accuracy is relatively low and the BWT gives a substantial negative number, which offers us a general understanding of the extent to which the forgetting effect impacts the performance in the absence of continual learning techniques.

Recall that we mentioned in Section 3.3 that we are saving a portion of past data as the rehearsal memory for the future use of later tasks. In the implementation, we set the ratio as 1%1\%. To retrieve samples from the rehearsal memory, we employ two sampling strategies: random sampling and herding-based sampling Rebuffi et al. (2017). Different from random sampling, herding-based sampling strategy selects the samples that are closest to a small subset of exemplars in their class. This is expected to make the sampling process more robust to the changes in data representations, thus improving the continual learning performance.

From Table 1, the results of random and herding-based sampling both yield significantly better performance than the fine-tuning experiment, demonstrating improved continual learning performance. We can also observe that herding-based sampling performs almost consistently better than random sampling.

Additionally, we incorporate various KD techniques as described in Section 3.3 to validate their effectiveness. As we have introduced previously, audio-KD and seq-KD are expected to increase the stability of the system by regularizing parameter changes at the encoder and decoder levels. By leveraging semantic information from a pretrained text encoder, sent-KD may increase the generalizability of continual learning. Table 1 presents the results of adding three KDs sequentially. We can observe that both BWT and the first number of DMI (DMIstab\text{DMI}_{stab}) are increased after adding audio-KD and seq-KD, indicating that the stability of the model is improved. On the other hand, sent-KD boosts the generalizability by leveraging pretrained semantic information, reflected in the increase of FWT and the third number of DMI (DMIgen\text{DMI}_{gen}). However, as the effect of stability-plasticity dilemma discussed in Section 2.1, the plasticity of the model (DMIplas\text{DMI}_{plas}) might be decreased as a trade-off when KD techniques are added to regularize the model. The reason is that we are penalizing the parameter changes in the model with KDs from the teacher model of the previous task, thus potentially slowing down the learning process on the current task. Such a phenomenon is not reflected in prior metrics.

We also note that both BWT and FWT improve with the introduction of additional KD techniques. This is due to their entanglement with the plasticity as shown in Figure 1, therefore may not fully reflect the stability/generalizability of the model. As a comparison, our DMI metric provides a disentangled and unified evaluation of all three properties, and a more comprehensive view of how the model behavior changes and evolves during the continual learning process.

Refer to caption
Figure 4: Qualitative results to show the change of clustering across tasks.

In addition to quantitative performance, we show the qualitative results of clustering with t-SNE Van der Maaten and Hinton (2008) in Figure 4. For visualization purposes, we select one class from each task to plot. Figure 4(a) to (c) depict how the class-agnostic clustering evolves during the learning process from the first task to the last. The greater the distance between clusters of different colors, the better the model’s performance. In the beginning, although the model has not encountered 5 out of the 6 classes from later tasks yet, it still exhibits some generalizability to distinguish unseen classes from seen intents. As the continual learning process progresses, the model starts to gain increasingly better capability to classify intents from each other. Finally, Figure 4(c) provides a clustering result with each of the classes clearly separated. This validates the increasing effectiveness and generalizability of our continual learning SLU model from a qualitative perspective.

4.3 Effect of Task Ordering on Evaluation

Freqency Order Close-semantic Order Diverse-semantic Order
Acc 77.9 75.6 73.9
BWT -9.0 -9.1 -10.9
FWT 73.3 74.9 69.3
DMIstab\text{DMI}_{stab} 80.0 71.5 60.2
DMIplas\text{DMI}_{plas} 75.7 60.2 58.2
DMIgen\text{DMI}_{gen} 30.8 44.7 50.4
Table 2: Evaluation results for the effect of different task orderings on metrics.

Continual learning performance is affected by the ordering of tasks Bell and Lawrence (2022). In the case of SLU, since each transcription and its intent have different semantic meanings, the learning order of tasks may impact the overall performance, including stability, plasticity, and generalizability.

To validate this, we name the ordering scheme introduced in Section 4.1 as frequency order, as it is ranking tasks in terms of their frequencies. Alternatively, we choose to rank the tasks based on their semantic meanings in two opposite ways, namely close-semantic order and diverse-semantic order. Close-semantic order groups together data instances with similar semantic meanings within the same task. For example, utterances with intents “bring newspaper” and “bring shoes” are grouped together to establish a close semantic relationship within one task. Under the setting of diverse-semantic order, the intents are strategically arranged to ensure a mix of semantic meanings in close proximity.

The effect of different tasks orderings on metrics is presented in Table 2. By using the DMI metric, we observe that both close-semantic and diverse-semantic orders improve the generalizability while decreasing the stability and plasticity compared to the frequency order. The impact of diverse-semantic order is particularly more significant. The reason behind this may be that frequency order benefits stability and plasticity by exposing the model to a larger amount of data, while close-semantic order assists the model by grouping semantically similar classes together within a task. On the other hand, diverse-semantic order consists of a wider range of classes in each task, thereby equipping the model with a higher generalizability. However, as a trade-off, the complexity of individual tasks increases, resulting in decreased stability and plasticity of the model.

As a comparison, these changing tendencies are not effectively captured by prior metrics. From Table 2, BWT and FWT are mostly correlated with Acc, and a higher Acc would mostly lead to higher BWT and FWT scores. This is because these metrics are entangled with plasticity, and may not fully reflect the stability and generalizability. In contrast, our DMI metric provides a more sensitive measurement of stability, plasticity, and generalizability when the order of tasks changes. This makes our metric better suited for practical use-case scenarios.

5 Related Work

While the majority of research on continual learning has focused on computer vision, there have been notable efforts to extend it to speech and text domains. Regarding speech, it has been explored for the problems of Keyword Spotting Xiao et al. (2022), ASR Yang et al. (2022); Chang et al. (2021); Diwan et al. (2023), and SLU Cappellazzo et al. (2022, 2023). For NLP, some works aim to equip language models with continual learning capabilities. For example, Razdaibiedina et al. (2023) propose a prompt learning-based method, whereas Ke et al. (2023) combines soft masking units and contrastive learning to alleviate forgetting. An investigation of the taxonomy of these continual learning-related approaches is introduced in Appendix A.2. However, there has been limited prior research on comprehensive continual evaluation metrics. Although De Lange et al. (2022) proposes a new set of metrics to quantify the worst-case performance in previous tasks, it differs from our proposed metric as it does not provide a unified evaluation encompassing multiple aspects of evaluations in the continual learning system.

6 Conclusion

In this paper, we propose Dual-transfer Matching Index as one new evaluation metric for continual learning. It provides a unified and disentangled evaluation in terms of stability, plasticity, and generalizability. Moreover, we show that introducing multiple knowledge distillation techniques helps the SLU model improve all three standards. We also empirically demonstrate that our metric is more sensitive than previous evaluation metrics in capturing the effect of class ordering.

7 Limitations

In this work, we experiment with spoken language understanding as the main target task by applying the proposed evaluation metric. However, the concept could also be adapted to other similar classification tasks. Specifically in SLU, our contributions include both the proposed generalized DMI evaluation metric and the improved SLU training with different knowledge distillation methods to reflect the improvement of stability, plasticity, and generalizability measured by DMI. In other relevant tasks, due to different model architectures and task settings, the observations of the evaluated stability/plasticity/generalizability will also be likely to be different, and specific techniques to improve the three aspects of capabilities in continual learning models might be dependent on the selected task.

References

  • Adel et al. (2019) Tameem Adel, Han Zhao, and Richard E Turner. 2019. Continual learning with adaptive weights (claw). arXiv preprint arXiv:1911.09514.
  • Aljundi et al. (2018) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154.
  • Arora et al. (2022) Siddhant Arora, Siddharth Dalmia, Brian Yan, Florian Metze, Alan W Black, and Shinji Watanabe. 2022. Token-level sequence labeling for spoken language understanding using compositional end-to-end models. arXiv preprint arXiv:2210.15734.
  • Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  • Bang et al. (2021) Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. 2021. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8218–8227.
  • Bastianelli et al. (2020) Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. 2020. Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205.
  • Bell and Lawrence (2022) Samuel J Bell and Neil D Lawrence. 2022. The effect of task ordering in continual learning. arXiv preprint arXiv:2205.13323.
  • Cappellazzo et al. (2022) Umberto Cappellazzo, Daniele Falavigna, and Alessio Brutti. 2022. Exploring the joint use of rehearsal and knowledge distillation in continual learning for spoken language understanding. arXiv preprint arXiv:2211.08161.
  • Cappellazzo et al. (2023) Umberto Cappellazzo, Muqiao Yang, Daniele Falavigna, and Alessio Brutti. 2023. Sequence-level knowledge distillation for class-incremental end-to-end spoken language understanding. arXiv preprint arXiv:2305.13899.
  • Cha et al. (2021) Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Kuo, Samuel Thomas, and Edmilson Morais. 2021. Speak or chat with me: End-to-end spoken language understanding system with flexible inputs. arXiv preprint arXiv:2104.05752.
  • Chang et al. (2021) Heng-Jui Chang, Hung-yi Lee, and Lin-shan Lee. 2021. Towards lifelong learning of end-to-end asr. arXiv preprint arXiv:2104.01616.
  • Chang et al. (2022) Heng-Jui Chang, Shu-wen Yang, and Hung-yi Lee. 2022. Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7087–7091. IEEE.
  • Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.
  • De Lange et al. (2022) Matthias De Lange, Gido van de Ven, and Tinne Tuytelaars. 2022. Continual evaluation for lifelong learning: Identifying the stability gap. arXiv preprint arXiv:2205.13452.
  • Diwan et al. (2023) Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, and Abdelrahman Mohamed. 2023. Continual learning for on-device speech recognition using disentangled conformers. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  • Douillard et al. (2020) Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 86–102. Springer.
  • French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Hsu et al. (2018) Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, and Zsolt Kira. 2018. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. arXiv preprint arXiv:1810.12488.
  • Hu et al. (2021) Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. 2021. Distilling causal effect of data in class-incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 3957–3966.
  • Ke et al. (2023) Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. 2023. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  • Kuhn (1955) Harold W Kuhn. 1955. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97.
  • Li and Hoiem (2017) Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
  • Lloyd (1982) Stuart Lloyd. 1982. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137.
  • Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30.
  • Lugosch et al. (2019) Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio. 2019. Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670.
  • MacQueen (1967) J MacQueen. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281–297. University of California Los Angeles LA USA.
  • Mermillod et al. (2013) Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. 2013. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.
  • Mesnil et al. (2014) Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. 2014. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):530–539.
  • Park et al. (2019) Dongmin Park, Seokil Hong, Bohyung Han, and Kyoung Mu Lee. 2019. Continual learning by asymmetric loss approximation with single-side overestimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3335–3344.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  • Razdaibiedina et al. (2023) Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. 2023. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. 2019. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32.
  • Saxon et al. (2021) Michael Saxon, Samridhi Choudhary, Joseph P McKenna, and Athanasios Mouchtaris. 2021. End-to-end spoken language understanding for generalized voice assistants. arXiv preprint arXiv:2106.09009.
  • Schwarz et al. (2018) Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. 2018. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR.
  • Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  • Shin et al. (2017) Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. 2017. Continual learning with deep generative replay. Advances in neural information processing systems, 30.
  • Smith et al. (2023) James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. 2023. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919.
  • Tur and De Mori (2011) Gokhan Tur and Renato De Mori. 2011. Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  • Wang et al. (2021) Liyuan Wang, Mingtian Zhang, Zhongfan Jia, Qian Li, Chenglong Bao, Kaisheng Ma, Jun Zhu, and Yi Zhong. 2021. Afec: Active forgetting of negative transfer in continual learning. Advances in Neural Information Processing Systems, 34:22379–22391.
  • Wang et al. (2023) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. 2023. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487.
  • Wang et al. (2022) Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. 2022. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149.
  • Xiang et al. (2019) Ye Xiang, Ying Fu, Pan Ji, and Hua Huang. 2019. Incremental learning using conditional adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6619–6628.
  • Xiao et al. (2022) Yang Xiao, Nana Hou, and Eng Siong Chng. 2022. Rainbow keywords: Efficient incremental learning for online spoken keyword spotting. arXiv preprint arXiv:2203.16361.
  • Yan et al. (2021) Shipeng Yan, Jiangwei Xie, and Xuming He. 2021. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3014–3023.
  • Yang et al. (2022) Muqiao Yang, Ian Lane, and Shinji Watanabe. 2022. Online continual learning of end-to-end speech recognition models. Proceedings of Interspeech, pages 2668–2672.
  • Zhou et al. (2023) Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. 2023. Deep class-incremental learning: A survey. arXiv preprint arXiv:2302.03648.
  • Zhou et al. (2022) Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. 2022. A model or 603 exemplars: Towards memory-efficient class-incremental learning. arXiv preprint arXiv:2205.13218.

Appendix A Appendix

A.1 Implementation Details

For both datasets, the text encoder utilizes a standard text embedding layer of size 768. We use Sentence-BERT Reimers and Gurevych (2019) for the pretrained text encoder to extract the sentence-level information. Regarding the audio encoder, we employ a pre-trained and fine-tuned base Wav2vec 2.0 model Baevski et al. (2020) on 960 hours of Librispeech for SLURP (\sim94.3M parameters), while for FSC, we utilize the base DistilHuBERT Chang et al. (2022) (\sim23.5M parameters). As FSC is a less challenging dataset compared to SLURP, we discovered that a smaller pre-trained encoder is sufficient to achieve state-of-the-art results. Both encoders have a hidden size of 768, and their feature extractor remains fixed during training. Similar to Radford et al. (2021), we employ linear projection layers to map the representations from each encoder to the audio-text embedding space, which has a dimension of 512. The ASR decoder is based on the transformer model with 6 layers, a hidden size of 768, 8 attention heads, and feedforward layers with a dimension of 2048.

For tokenization, we utilize Byte-Pair Encoding (BPE) Sennrich et al. (2015) with a vocabulary size of 1,000 and a BPE dropout rate of 0.1 for SLURP. However, for FSC, due to the limited number of unique words, we use word tokenization resulting in 139 tokens. BPE automatically assigns a dedicated token to each intent, while for FSC, we manually add the intent tokens. During inference and for the computation of soft labels for seq-KD, we apply beam search with a beam width of 10 for FSC and 20 for SLURP, respectively. We use the validation set to tune the hyperparameters and select the best model for each task. All the weights of KDs are used as 0.10.1.

A.2 Taxonomy of continual learning approaches

Following the standard nomenclature, continual learning strategies can be grouped into a few categories, according to the approach they rely on Wang et al. (2023); Zhou et al. (2023). Regularization-based methods introduce ad-hoc regularization terms to combat forgetting. Some methods regularize the weights of the network Kirkpatrick et al. (2017); Schwarz et al. (2018); Aljundi et al. (2018); Park et al. (2019); Wang et al. (2021); Adel et al. (2019), whereas others penalize changes to the model’s intermediate or final outputs, usually by means of the knowledge distillation (KD) concept Hinton et al. (2015). By denoting the model trained in the previous task as the teacher model and that trained in the current task as the student model, KD fosters the pass of the knowledge accrued in the teacher model onto the student. We can identify logit distillation methods Li and Hoiem (2017); Rebuffi et al. (2017) that align the predictions of the student and teacher models, and feature distillation that works on the intermediate feature embeddings produced by the feature encoders Douillard et al. (2020); Hu et al. (2021). It is also possible to combine logit and feature distillations together Cappellazzo et al. (2022).

Replay-based methods either maintain a memory buffer where a bunch of old training samples are collected Rolnick et al. (2019); Bang et al. (2021) (experience replay) or generate data samples by means of a dedicated generative model Shin et al. (2017); Xiang et al. (2019) (generative replay). Finally, architecture-based approaches introduce task-specific parameters, either by expanding the network itself Yan et al. (2021); Zhou et al. (2022) or keeping the network frozen and learning a small amount of additional parameters (i.e., prompts) Wang et al. (2022); Smith et al. (2023).