This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Visual Prompt Tuning in Null Space for Continual Learning

Yue Lu1,  Shizhou Zhang1,  De Cheng211footnotemark: 1,  Yinghui Xing1,
 Nannan Wang2,  Peng Wang1,  Yanning Zhang1
1 School of Computer Science, Northwestern Polytechnical University, China
2 School of Telecommunications Engineering, Xidian University, China
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
Corresponding authors
Abstract

Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks’ features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i.e.i.e., 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at  https://github.com/zugexiaodui/VPTinNSforCL .

1 Introduction

Continual learning (CL) is crucial for AI models to adapt to the ever-changing environment by learning sequentially arrived data, where the catastrophic forgetting is the key challenge [21, 28]. Recently, prompt tuning-based continual learning methods [40, 32, 34, 43, 10, 22, 38, 45, 20, 12, 18] have been attracting increasing attention due to their impressive performances in the CL field. Existing prompt tuning-based works tackle the downstream continual learning problem by selecting and updating relevant prompts, which is encoded with full task-specific knowledge while exploiting the general knowledge of the pre-trained ViTs [40, 39].

On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks’ features, so as to ensure no interference with tasks that have been learned to overcome catastrophic forgetting in CL. It is worth noting that forgetting can be theoretically resolved by gradient orthogonal projection methods [42, 31, 36, 44], which have been extensively explored especially when adapting CNN models. Nevertheless, it remains a huge gap to introduce the orthogonal projection-based methods of CNNs to visual prompt tuning due to the following challenges: 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. For the linear operation in convolution or fully-connected layers, the output features of old tasks can remain unchanged by updating the weights in the orthogonal subspace of previous input features. While for self-attention, three linear transformations are employed on input tokens, followed by high-order and non-linear operations for the self-attention interaction of tokens. It makes the relationship between the update of prompts and the output image tokens much more complex, far exceeding mere linearity.

In this work, we theoretically deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. To be concrete, we firstly take the full self-attention and LayerNorm into consideration and derive a strict condition for eliminating the interference through a comprehensive analysis of the forward propagation of the ViT layer. Then we further propose to convert the condition of self-attention into its two sufficient conditions, which enables us to address the challenge of high order and nonlinearity. Thirdly, we propose a constraint of invariant prompt distribution that removes the obstacle to the final simplification of the conditions brought by the LayerNorm. The consistency conditions reveal that if the prompt update can be orthogonal to (1) the normalized previous input image tokens projected with the second-order qkv-transformation matrices of the pre-trained model, and (2) the activated attention map generated by image queries and prompt keys, the interference in visual prompt tuning can be eliminated theoretically.

In practice, based on the proposed consistency conditions, an effective null-space-based approximation solution [36] has been proposed to implement the prompt gradient orthogonal projection, while the invariant prompt distribution constraint is implemented by incorporating a loss function which penalizes the drifting of prompt distribution over sequential tasks. We validate our Null-Space Projection for Prompts (NSP2) approach on extensive class-incremental benchmarks: 10- and 20-split CIFAR-100, 10-split ImageNet-R [39] and 10-split DomainNet [38], with the sequential fine-tuning VPT and CLIP models as baselines. Our approach brings 4%\sim10% improvements in accuracy, and reduces 9%\sim17% forgetting, which is superior to state-of-the-art methods.

Our contributions are summarized as follows: (1) We introduce the orthogonal projection into the visual prompt tuning for continual learning, which comprehensively considers the full operations of a transformer layer on the interference problem. (2) Two sufficient consistency conditions for the self-attention and an invariant prompt distribution constraint for LayerNorm are theoretically deduced, based on which an effective null-space-based approximation solution is introduced to implement the prompt gradient orthogonal projection for visual prompt tuning. (3) Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods.

2 Related Work

Prompting-Based Approaches: Most of the prompting-based approaches adopt a two-stage framework [37, 39, 14, 15, 32, 34, 35, 11, 18, 19]: querying a group of prompts for an individual sample and using them to prompt the pre-trained models. For example, L2P [40] first selects a group of prompts from a prompt pool and then feeds them into the ViT. CPrompt [11] proposes to mitigate the gap between training and testing stages to enhance prediction robustness and boost prompt selection accuracy. These approaches essentially focus on acquisition of task-specific prompts tailored to individual samples. There are also several one-stage methods [2, 22, 38, 43, 20] based on prompt tuning. (1) Slowly updating trainable parameters [10, 43]: e.g., LAE [10] updates an offline expert with a large momentum to reduce the change of features. (2) Expandable backbones [45, 20]: e.g., EASE [45] trains a distinct lightweight adapter module for each new task, and designs a semantic mapping to complement the drift of old class prototypes. (3) Enhancing classifiers rather than focusing on learning features [38, 22, 12]: e.g., ESN [38] proposes an anchor-based classifier alignment approach based on energy-based models. As introduced above, these works still lack of a theoretical solution to the interference problem for visual prompt tuning. In our work, we conduct a deep analysis of this problem and provide a theoretical guidance on eliminating the interference.

Orthogonal Projection-Based Approaches: Orthogonal projection-based approaches [42, 4, 8, 31, 36, 17, 44] can theoretically eliminate the interference of new tasks on old tasks for linear layers. OWM [42] constructs a projector to find the direction orthogonal to the input space. GPM [31] first projects new gradients to the subspace important to the old tasks and then subtracts the projected components for updating parameters. Adam-NSCL [36] projects the parameter updates to the approximate null space of previous inputs. However, due to the different relationships between parameter updates and outputs in the linear operation and self-attention, the consistency condition used in CNNs is not directly applicable to the prompt tuning in ViTs. In our work, we derive the consistency conditions for the visual prompt tuning, enabling the application of orthogonal projection-based approaches to it, where the null-space projection [36] is adopted in our approach to get an approximate solution efficiently. We notice that a recently emerged work PGP [26] implements GPM [31] to prompt-based frameworks. However, it obtains the same conclusion as that of the linear operation under a simplified attention, which limits its application and performance as compared in the appendix D .

3 Preliminaries

Continual Learning: In the setting of continual learning, a network f(|𝚯)f(\cdot|\bf{\Theta}) with parameters 𝚯\bf{\Theta} is sequentially trained on a stream of disjoint tasks {𝒯1,𝒯2,,𝒯T}\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots,\mathcal{T}_{T}\}, where task 𝒯t\mathcal{T}_{t} is associated with paired data {(𝒳t<i>,yt<i>)i=1|𝒯t|}\{(\mathcal{X}_{t}^{<i>},y_{t}^{<i>})_{i=1}^{|\mathcal{T}_{t}|}\} of size |𝒯t||\mathcal{T}_{t}|. When a task 𝒯t\mathcal{T}_{t} arrives, the model f(|𝚯)f(\cdot|\bf{\Theta}) would be trained for the current task, while the data from previous tasks is unreachable.

Forward Propagation of Visual Prompt Tuning in ViT Layers: We describe the forward propagation process of the ViT layer for visual prompt tuning, as illustrated in Figure 1 . Let 𝐗N×D{\bf X}\in\mathbb{R}^{N\times D} and 𝐏M×D{\bf P}\in\mathbb{R}^{M\times D} denote the NN input image tokens of a sample (including the pre-trained class token if available) and MM prompts, respectively, where DD is the dimension of each token. In the ViT layer, only the prompts 𝐏{\bf P} are trainable parameters. The remaining parameters in LayerNorm, qkv-transformations and subsequent MLP introduced below are pre-trained and kept frozen. We use 𝐙=[𝐗;𝐏](N+M)×D{\bf Z}=[{\bf X};{\bf P}]\in\mathbb{R}^{(N+M)\times D} to denote the concatenated input tokens. First, they undergo the LayerNorm [1] operation LN()\mathrm{LN}(\cdot):

LN(𝐙)=𝐙𝝁𝐙𝝈𝐙𝜶+𝜷,\mathrm{LN}({\bf Z})=\frac{{\bf Z}-\boldsymbol{\mu}_{{\bf Z}}}{\boldsymbol{\sigma}_{{\bf Z}}}\odot\boldsymbol{\alpha}+\boldsymbol{\beta}, (1)

where 𝝁𝐙,𝝈𝐙N+M\boldsymbol{\mu}_{{\bf Z}},\boldsymbol{\sigma}_{{\bf Z}}\in\mathbb{R}^{N+M}, 𝜶,𝜷D\boldsymbol{\alpha},\boldsymbol{\beta}\in\mathbb{R}^{D}. The \odot and division here denote the element-wise (Hadamard) product and division, respectively. Note that the vectors 𝝁𝐙\boldsymbol{\mu}_{{\bf Z}}, 𝝈𝐙\boldsymbol{\sigma}_{{\bf Z}}, 𝜶\boldsymbol{\alpha} and 𝜷\boldsymbol{\beta} are broadcasted to match the matrices of dimensions (N+M)×D(N+M)\times D, enabling them to carry out operations with 𝐙{\bf Z}. Then the normalized tokens are fed into the qkv-transformations:

𝐐𝐙=LN(𝐙)𝐖q+𝒃q,𝐊𝐙=LN(𝐙)𝐖k+𝒃k,𝐕𝐙=LN(𝐙)𝐖v+𝒃v,{\bf Q}_{{\bf Z}}=\mathrm{LN}({\bf Z}){\bf W}_{q}+\boldsymbol{b}_{q},~{}{\bf K}_{{\bf Z}}=\mathrm{LN}({\bf Z}){\bf W}_{k}+\boldsymbol{b}_{k},~{}{\bf V}_{{\bf Z}}=\mathrm{LN}({\bf Z}){\bf W}_{v}+\boldsymbol{b}_{v}, (2)

where 𝐖{q,k,v}D×D{\bf W}_{\{q,k,v\}}\in\mathbb{R}^{D\times D}. The vector 𝒃{q,k,v}D\boldsymbol{b}_{\{q,k,v\}}\in\mathbb{R}^{D} is broadcasted to a matrix of dimensions (N+M)×D(N+M)\times D to facilitate the addition operation. Next is the self-attention:

𝐅𝐙=fSA(𝐙)=softmax(𝐐𝐗𝐊𝐙D)𝐕𝐙,{\bf F}_{{\bf Z}}=f_{\mathrm{SA}}({\bf Z})=\mathrm{softmax}(\frac{{\bf Q}_{\bf X}{\bf K}_{\bf Z}^{\top}}{\sqrt{D}}){\bf V}_{{\bf Z}}, (3)

where 𝐐𝐗{\bf Q}_{\bf X} denotes the image tokens serving as queries. Eq. (3) can be expanded as Affinity, softmax (on rows) and Aggregation operations:

𝐀𝐙=faff(𝐐𝐗,𝐊𝐙)=𝐐𝐗𝐊𝐙D=𝐐𝐗[𝐊𝐗𝐊𝐏]DN×(N+M),\displaystyle{\bf A}_{{\bf Z}}=f_{\mathrm{aff}}({\bf Q}_{\bf X},{\bf K}_{\bf Z})=\frac{{\bf Q}_{{\bf X}}{\bf K}_{{\bf Z}}^{\top}}{\sqrt{D}}=\frac{{\bf Q}_{{\bf X}}\begin{bmatrix}{\bf K}_{{\bf X}}^{\top}&{\bf K}_{{\bf P}}^{\top}\end{bmatrix}}{\sqrt{D}}\in\mathbb{R}^{N\times(N+M)}, (4)
𝐒𝐙=softmax(𝐀𝐙)=softmax([𝐀𝐗N×N𝐀𝐏N×M])=[𝐒𝐗𝐒𝐏],\displaystyle{\bf S}_{\bf Z}=\mathrm{softmax}({\bf A}_{\bf Z})=\mathrm{softmax}(\begin{bmatrix}{\bf A}_{{\bf X}}\in\mathbb{R}^{N\times N}&{\bf A}_{{\bf P}}\in\mathbb{R}^{N\times M}\end{bmatrix})=\begin{bmatrix}{\bf S}_{{\bf X}}&{\bf S}_{{\bf P}}\end{bmatrix}, (5)
𝐅𝐙=fagg(𝐒𝐙,𝐕𝐙)=𝐒𝐙𝐕𝐙=[𝐒𝐗𝐒𝐏][𝐕𝐗𝐕𝐏]N×D.\displaystyle{\bf F}_{{\bf Z}}=f_{\mathrm{agg}}({\bf S}_{\bf Z},{\bf V}_{\bf Z})={\bf S}_{\bf Z}{\bf V}_{\bf Z}=\begin{bmatrix}{\bf S}_{{\bf X}}&{\bf S}_{{\bf P}}\end{bmatrix}\begin{bmatrix}{\bf V}_{{\bf X}}\\ {\bf V}_{{\bf P}}\end{bmatrix}\in\mathbb{R}^{N\times D}. (6)

It is worth noting that the rows of the attention map where the prompts serve as queries (i.e., 𝐐𝐏{\bf Q}_{\bf P}) do not need to be computed, as formulated in Eq. (4) and illustrated in Figure 1 . The reason is that in VPT-Deep [13], the output prompts of this ViT layer will be replaced with new trainable prompts in the subsequent layer. Omitting 𝐐𝐏{\bf Q}_{\bf P} has no impact on the output image tokens of the ViT layer, as the subsequent Aggregation, LayerNorm and MLP operations are performed independently for each token. If no new prompts are added in the next layer, the output prompts can be just discarded as well.

After the self-attention, operations consist of another LayerNorm and the MLP layer are applied individually to each token, without any interaction among the tokens. Finally, the output fine-tuned image tokens are fed into the next ViT layer.

Refer to caption
Figure 1: Illustration of the forward propagation in a ViT layer. Residual connections are omitted. The red crosses indicate the rows of attention map or the output prompts can be neglected.

Orthogonal Projection in Convolutional Layers: A convolutional operation is actually a linear operation. For a convolutional layer fconv(|𝚯t)f_{\mathrm{conv}}(\cdot|\mathbf{\Theta}_{t}) in task 𝒯t\mathcal{T}_{t}, we use 𝚯tDin×Dout\mathbf{\Theta}_{t}\in\mathbb{R}^{D_{\mathrm{in}}\times D_{\mathrm{out}}} to denote its unrolled convolutional kernel matrix [5]. Here, DinD_{\mathrm{in}} represents the number of pixels within a kernel, and DoutD_{\mathrm{out}} corresponds to the number of kernels. Each convolutional patch from the input feature map is flattened into a row vector with a dimension of DinD_{\mathrm{in}}. These row vectors of totaling npn_{p} patches compose the input feature matrix 𝐗tnp×Din{\bf X}_{t}\in\mathbb{R}^{n_{p}\times D_{\mathrm{in}}}. The output feature for 𝐗t{\bf X}_{t} in task 𝒯t\mathcal{T}_{t} is expected to remain unchanged (referred to as consistent) in the next task 𝒯t+1\mathcal{T}_{t+1} to prevent forgetting:

fconv(𝐗t|𝚯t)=fconv(𝐗t|𝚯t+1).f_{\mathrm{conv}}({\bf X}_{t}|\mathbf{\Theta}_{t})=f_{\mathrm{conv}}({\bf X}_{t}|\mathbf{\Theta}_{t+1}). (7)

By substituting 𝚯t+1=𝚯t+Δ𝚯\mathbf{\Theta}_{t+1}=\mathbf{\Theta}_{t}+\Delta\mathbf{\Theta}, with Δ𝚯𝟎\Delta\bf{\Theta}\neq\mathbf{0} denoting the weight update in 𝒯t+1\mathcal{T}_{t+1}, the consistency condition for the convolutional layer is established as follows:

𝐗t𝚯t=𝐗t(𝚯t+Δ𝚯),{{\bf X}}_{t}\mathbf{\Theta}_{t}={{\bf X}}_{t}(\mathbf{\Theta}_{t}+\Delta\mathbf{\Theta}), (8)

which can be further simplified as:

𝐗tΔ𝚯=𝟎.{{\bf X}}_{t}\Delta\mathbf{\Theta}=\mathbf{0}. (9)

Eq. (9) suggests that if the weight update Δ𝚯\Delta\mathbf{\Theta} is orthogonal to the previous input feature 𝐗t{\bf X}_{t} during training in the new task, the corresponding output feature will remain unchanged. Thereby, the interference of the new task on the old task is eliminated. This can be realized by projecting the candidate weight update 𝚯𝒢\mathbf{\Theta}_{\mathcal{G}} into the orthogonal subspace of 𝐗t{\bf X}_{t}: Δ𝚯=𝒫𝚯𝒢\Delta\mathbf{\Theta}=\mathcal{P}\mathbf{\Theta}_{\mathcal{G}}, where 𝒫Din×Din\mathcal{P}\in\mathbb{R}^{D_{\mathrm{in}}\times D_{\mathrm{in}}} is an orthogonal projection matrix [42, 36, 31].

Similarly, for the prompt tuning which fine-tunes the prompts 𝐏t{\bf P}_{t} in a ViT layer fvit(𝐗t|𝐏t)f_{\mathrm{vit}}({\bf X}_{t}|{\bf P}_{t}), we also aim to satisfy the following consistency objective for the purpose of anti-forgetting:

fvit(𝐗t|𝐏t)=fvit(𝐗t|𝐏t+1).f_{\mathrm{vit}}({\bf X}_{t}|{\bf P}_{t})=f_{\mathrm{vit}}({\bf X}_{t}|{\bf P}_{t+1}). (10)

However, the consistency condition in Eq. (9) does not hold for Eq. (10), since fvit(𝐗t|𝐏t)𝐗t𝐏tf_{\mathrm{vit}}({\bf X}_{t}|{\bf P}_{t})\neq{\bf X}_{t}{\bf P}_{t} in prompt tuning. Instead, all the tokens 𝐗t{\bf X}_{t} and 𝐏t{\bf P}_{t} first undergo a LayerNorm and then interact via the self-attention mechanism, as previously described. The complicated forward propagation within the ViT layer brings huge challenge to analyzing the consistency conditions in relation to the prompt update Δ𝐏\Delta{\bf P}. In the next section, we will tackle this challenge and derive the consistency conditions for visual prompt tuning.

4 Method

We use 𝐙t=[𝐗t;𝐏t]{\bf Z}_{t}=[{\bf X}_{t};{\bf P}_{t}] and 𝐙t+1=[𝐗t;𝐏t+1]{\bf Z}_{t+1}=[{\bf X}_{t};{\bf P}_{t+1}] to denote the input tokens before and after updating the prompts, respectively, where 𝐏t+1=𝐏t+Δ𝐏,Δ𝐏𝟎{\bf P}_{t+1}={\bf P}_{t}+\Delta{\bf P},\Delta{\bf P}\neq\mathbf{0}. Our goal is to analyze how to satisfy Eq. (10) and derive one or more conditions expressed in terms of the prompt update Δ𝐏\Delta{\bf P}. These conditions will subsequently guide the application of orthogonal projection to Δ𝐏\Delta{\bf P}.

4.1 Analysis of Consistency Conditions

As can be seen in Figure 1, those outputs of LayerNorm and qkv-transformations corresponding to the image tokens remain unaffected by the updates to the prompts. Hence, the essence of attaining the consistency objective Eq. (10) can be turned into analyzing how to keep the output of self-attention in Eq. (3) unchanged as the prompts are updated, i.e., satisfying:

𝐅𝐙t=𝐅𝐙t+1.{\bf F}_{{\bf Z}_{t}}={\bf F}_{{\bf Z}_{t+1}}. (11)

However, the nonlinear operation (i.e., softmax\mathrm{softmax}) and the potential higher-order term 𝐖k𝐙𝐙𝐖v{\bf W}_{k}^{\top}{\bf Z}^{\top}{\bf Z}{\bf W}_{v} arising from 𝐊𝐙𝐕𝐙{\bf K}_{\bf Z}^{\top}{\bf V}_{\bf Z} in Eq. (3) complicate the direct resolution of this objective. Specifically, the non-injection property of the softmax\mathrm{softmax} function causes non-unique solutions. The multiplication between 𝐊𝐙t+1𝐕𝐙t+1{\bf K}_{{\bf Z}_{t+1}^{\top}}{\bf V}_{{\bf Z}_{t+1}} derives a quadratic term LN(𝐏t+Δ𝐏)LN(𝐏t+Δ𝐏)\mathrm{LN}({\bf P}_{t}+\Delta{\bf P})^{\top}\mathrm{LN}({\bf P}_{t}+\Delta{\bf P}), which result in difficult optimization for Δ𝐏\Delta{\bf P}.

To address this issue, we propose two sufficient conditions consisting solely of linear operations. Specifically, we split the process of self-attention into two primary stages, i.e., the Affinity described by Eq. (4) and the Aggregation outlined in Eq. (6). We can achieve Eq. (11) by ensuring the consistency of each stage:

faff(𝐐𝐗t,𝐊𝐙t)=faff(𝐐𝐗t,𝐊𝐙t+1),\displaystyle f_{\mathrm{aff}}({\bf Q}_{{\bf X}_{t}},{\bf K}_{{\bf Z}_{t}})=f_{\mathrm{aff}}({\bf Q}_{{\bf X}_{t}},{\bf K}_{{\bf Z}_{t+1}}), (12)
fagg(𝐒𝐙t,𝐕𝐙t)=fagg(𝐒𝐙t+1,𝐕𝐙t+1).\displaystyle f_{\mathrm{agg}}({\bf S}_{{\bf Z}_{t}},{\bf V}_{{\bf Z}_{t}})=f_{\mathrm{agg}}({\bf S}_{{\bf Z}_{t+1}},{\bf V}_{{\bf Z}_{t+1}}). (13)

We first analyze the consistency objective of Affinity, i.e., Eq. (12), for 𝐙t{\bf Z}_{t} and 𝐙t+1{\bf Z}_{t+1}:

faff(𝐐𝐗t,𝐊𝐙t)=𝐐𝐗t[𝐊𝐗t𝐊𝐏t]=[𝐐𝐗t𝐊𝐗t𝐐𝐗t[LN(𝐏t)𝐖k+𝒃k]],\displaystyle f_{\mathrm{aff}}({\bf Q}_{{\bf X}_{t}},{\bf K}_{{\bf Z}_{t}})={\bf Q}_{{\bf X}_{t}}\begin{bmatrix}{\bf K}_{{\bf X}_{t}}^{\top}&{\bf K}_{{\bf P}_{t}}^{\top}\end{bmatrix}=\begin{bmatrix}{\bf Q}_{{\bf X}_{t}}{\bf K}_{{\bf X}_{t}}^{\top}&{\bf Q}_{{\bf X}_{t}}\left[\mathrm{LN}({\bf P}_{t}){\bf W}_{k}+\boldsymbol{b}_{k}\right]^{\top}\end{bmatrix}, (14)
faff(𝐐𝐗t,𝐊𝐙t+1)=[𝐐𝐗t𝐊𝐗t𝐐𝐗t[LN(𝐏t+1)𝐖k+𝒃k]],\displaystyle f_{\mathrm{aff}}({\bf Q}_{{\bf X}_{t}},{\bf K}_{{\bf Z}_{t+1}})=\begin{bmatrix}{\bf Q}_{{\bf X}_{t}}{\bf K}_{{\bf X}_{t}}^{\top}&{\bf Q}_{{\bf X}_{t}}\left[\mathrm{LN}({\bf P}_{t+1}){\bf W}_{k}+\boldsymbol{b}_{k}\right]^{\top}\end{bmatrix}, (15)

where D\sqrt{D} is omitted for simplicity. Upon fulfilling Eq. (12), we can obtain 𝐒𝐙t=𝐒𝐙t+1{\bf S}_{{\bf Z}_{t}}={\bf S}_{{\bf Z}_{t+1}}, corresponding to the output of Eq. (5). Subsequently, we analyze the consistency objective of Aggregation in Eq. (13), yielding results for 𝐙t{\bf Z}_{t} and 𝐙t+1{\bf Z}_{t+1} as:

fagg(𝐒𝐙t,𝐕𝐙t)=𝐒𝐗t𝐕𝐗t+𝐒𝐏t𝐕𝐏t=𝐒𝐗t𝐕𝐗t+𝐒𝐏t[LN(𝐏t)𝐖v+𝒃v],\displaystyle f_{\mathrm{agg}}({\bf S}_{{\bf Z}_{t}},{\bf V}_{{\bf Z}_{t}})={\bf S}_{{\bf X}_{t}}{\bf V}_{{\bf X}_{t}}+{\bf S}_{{\bf P}_{t}}{\bf V}_{{\bf P}_{t}}={\bf S}_{{\bf X}_{t}}{\bf V}_{{\bf X}_{t}}+{\bf S}_{{\bf P}_{t}}\left[\mathrm{LN}({\bf P}_{t}){\bf W}_{v}+\boldsymbol{b}_{v}\right], (16)
fagg(𝐒𝐙t+1,𝐕𝐙t+1)=fagg(𝐒𝐙t,𝐕𝐙t+1)=𝐒𝐗t𝐕𝐗t+𝐒𝐏t[LN(𝐏t+1)𝐖v+𝒃v].\displaystyle f_{\mathrm{agg}}({\bf S}_{{\bf Z}_{t+1}},{\bf V}_{{\bf Z}_{t+1}})=f_{\mathrm{agg}}({\bf S}_{{\bf Z}_{t}},{\bf V}_{{\bf Z}_{t+1}})={\bf S}_{{\bf X}_{t}}{\bf V}_{{\bf X}_{t}}+{\bf S}_{{\bf P}_{t}}\left[\mathrm{LN}({\bf P}_{t+1}){\bf W}_{v}+\boldsymbol{b}_{v}\right]. (17)

Based on Eq. (12-17), we are able to derive the following two equations, respectively:

𝐐𝐗t𝐖kLN(𝐏t)=𝐐𝐗t𝐖kLN(𝐏t+1)=𝐐Xt𝐖kLN(𝐏t+Δ𝐏),\displaystyle{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}\mathrm{LN}({\bf P}_{t})^{\top}={\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}\mathrm{LN}({\bf P}_{t+1})^{\top}={\bf Q}_{X_{t}}{\bf W}_{k}^{\top}\mathrm{LN}({\bf P}_{t}+\Delta{\bf P})^{\top}, (18)
𝐒𝐏tLN(𝐏t)𝐖v=𝐒𝐏tLN(𝐏t+1)𝐖v=𝐒𝐏tLN(𝐏t+Δ𝐏)𝐖v.\displaystyle{\bf S}_{{\bf P}_{t}}\mathrm{LN}({\bf P}_{t}){\bf W}_{v}={\bf S}_{{\bf P}_{t}}\mathrm{LN}({\bf P}_{t+1}){\bf W}_{v}={\bf S}_{{\bf P}_{t}}\mathrm{LN}({\bf P}_{t}+\Delta{\bf P}){\bf W}_{v}. (19)

Note that we expect to further deduce Eq. (18) and Eq. (19) to obtain equations among LN(𝐏t)\mathrm{LN}({\bf P}_{t}), LN(𝐏t+Δ𝐏)\mathrm{LN}({\bf P}_{t}+\Delta{\bf P}) and Δ𝐏\Delta{\bf P}. However, due to the square root and quadratic terms in the expressions of the standard deviations 𝝈𝐏t\boldsymbol{\sigma}_{{\bf P}_{t}} and 𝝈𝐏t+Δ𝐏\boldsymbol{\sigma}_{{\bf P}_{t}+\Delta{\bf P}}, it is difficult to express 𝝈𝐏t+Δ𝐏\boldsymbol{\sigma}_{{\bf P}_{t}+\Delta{\bf P}} in terms of 𝝈𝐏t\boldsymbol{\sigma}_{{\bf P}_{t}} and 𝝈Δ𝐏\boldsymbol{\sigma}_{\Delta{\bf P}}. Consequently, it is challenging to derive a straightforward equation that relates LN(𝐏t)\mathrm{LN}({\bf P}_{t}) and LN(𝐏t+Δ𝐏)\mathrm{LN}({\bf P}_{t}+\Delta{\bf P}) through Δ𝐏\Delta{\bf P}.

To simplify the problem, we introduce an additional constraint on the distribution of prompts. Concretely, we require that the updated prompts 𝐏t+Δ𝐏{\bf P}_{t}+\Delta{\bf P} retain the same distribution as 𝐏t{\bf P}_{t}, i.e., meeting the following assumption:

{𝝁𝐏t+Δ𝐏=𝝁𝐏t,𝝈𝐏t+Δ𝐏=𝝈𝐏t.\displaystyle\left\{\begin{aligned} \boldsymbol{\mu}_{{\bf P}_{t}+\Delta{\bf P}}=\boldsymbol{\mu}_{{\bf P}_{t}},\\ \boldsymbol{\sigma}_{{\bf P}_{t}+\Delta{\bf P}}=\boldsymbol{\sigma}_{{\bf P}_{t}}.\end{aligned}\right. (20)

In this way, we can establish a straightforward mathematical relationship connecting LN(𝐏t+Δ𝐏)\mathrm{LN}({\bf P}_{t}+\Delta{\bf P}), LN(𝐏t)\mathrm{LN}({\bf P}_{t}) and Δ𝐏\Delta{\bf P}:

LN(𝐏t+Δ𝐏)=𝐏t+Δ𝐏𝝁𝐏t+Δ𝐏𝝈𝐏t+Δ𝐏𝜶+𝜷=𝐏t𝝁𝐏t+Δ𝐏𝝈𝐏t𝜶+𝜷=LN(𝐏t)+Δ𝐏𝝈𝐏t𝜶.\mathrm{LN}({\bf P}_{t}+\Delta{\bf P})=\frac{{{\bf P}_{t}+\Delta{\bf P}-\boldsymbol{\mu}_{{\bf P}_{t}+\Delta{\bf P}}}}{{\boldsymbol{\sigma}_{{\bf P}_{t}+\Delta{\bf P}}}}\odot\boldsymbol{\alpha}+\boldsymbol{\beta}=\frac{{{\bf P}_{t}-\boldsymbol{\mu}_{{\bf P}_{t}}}+\Delta{\bf P}}{{\boldsymbol{\sigma}_{{\bf P}_{t}}}}\odot\boldsymbol{\alpha}+\boldsymbol{\beta}=\mathrm{LN}({\bf P}_{t})+\frac{\Delta{\bf P}}{\boldsymbol{\sigma}_{{\bf P}_{t}}}\odot\boldsymbol{\alpha}. (21)

Consequently, we can apply Eq. (21) to simplify Eq. (18) and (19) as:

𝐐𝐗t𝐖kLN(𝐏t)=𝐐𝐗t𝐖kLN(𝐏t)+𝐐𝐗t𝐖kΔ𝐏/𝝈𝐏t𝜶,\displaystyle{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}\mathrm{LN}({\bf P}_{t})^{\top}={\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}\mathrm{LN}({\bf P}_{t})^{\top}+{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}{\Delta{\bf P}^{\top}}/{\boldsymbol{\sigma}_{{\bf P}_{t}}^{\top}}\odot\boldsymbol{\alpha}^{\top}, (22)
𝐒𝐏tLN(𝐏t)𝐖v=𝐒𝐏tLN(𝐏t)𝐖v+𝐒𝐏tΔ𝐏𝐖v/𝝈𝐏t𝜶.\displaystyle{\bf S}_{{\bf P}_{t}}\mathrm{LN}({\bf P}_{t}){\bf W}_{v}={\bf S}_{{\bf P}_{t}}\mathrm{LN}({\bf P}_{t}){\bf W}_{v}+{\bf S}_{{\bf P}_{t}}{\Delta{\bf P}{\bf W}_{v}}/{\boldsymbol{\sigma}_{{\bf P}_{t}}}\odot\boldsymbol{\alpha}. (23)

It should be noted that in Eq. 22 and Eq. 23 , 𝐖k{\bf W}_{k}, 𝐖v{\bf W}_{v} and 𝜶\boldsymbol{\alpha} are pre-trained parameters kept frozen throughout the continual learning process. 𝐐𝐗t{\bf Q}_{{\bf X}_{t}} and 𝐒𝐏t{\bf S}_{{\bf P}_{t}} are two matrices derived from the input 𝐗t{\bf X}_{t}. As our objective is to ensure that the above two equations remain valid for the variables 𝐐𝐗t{\bf Q}_{{\bf X}_{t}} and 𝐒𝐏t{\bf S}_{{\bf P}_{t}}, it is sufficient to meet the following conditions, in which 𝐖v{\bf W}_{v} can be ignored whereas 𝐖k{\bf W}_{k} remains crucial:

𝐐𝐗t𝐖kΔ𝐏=𝟎\displaystyle{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}\Delta{\bf P}^{\top}=\mathbf{0} (24)
𝐒𝐏tΔ𝐏=𝟎\displaystyle{\bf S}_{{\bf P}_{t}}\Delta{\bf P}=\mathbf{0} (25)

Now we have obtained the simplified formulas expressed by Δ𝐏\Delta{\bf P} in Eq. (24) and (25).

To sum up, we convert the overall consistency equation Eq. (11) into two sufficient conditions Eq. (12) and (13) for Affinity and Aggregation, respectively. Consequently, we derive two corresponding consistency conditions Eq. (24) and (25) expressed by the prompt update Δ𝐏\Delta{\bf P}, under the constraint of invariant prompt distribution formulated in Eq. (20). The deduced conditions can satisfy the consistency objective in Eq. (10), thereby achieving the goal of eliminating the interference of the new task on the old task for visual prompt tuning.

As 𝐐𝐗t=LN(𝐗t)𝐖q+𝒃q{\bf Q}_{{\bf X}_{t}}=\mathrm{LN}({\bf X}_{t}){\bf W}_{q}+\boldsymbol{b}_{q}, Eq. (24) implies that if the (transposed) prompt update can be orthogonal to the normalized previous input image tokens 𝐗t{\bf X}_{t} projected with a second-order transformation matrices 𝐖q𝐖k{\bf W}_{q}{\bf W}_{k}^{\top} of the pre-trained ViT, the consistency for Affinity can be guaranteed. When we ignore the normalization and the bias term in 𝐐𝐗t{\bf Q}_{{\bf X}_{t}}, Eq. (24) can be simplified as 𝐗t𝐖q𝐖kΔ𝐏=𝟎{\bf X}_{t}{\bf W}_{q}{\bf W}_{k}^{\top}\Delta{\bf P}^{\top}=\mathbf{0}. The simplified condition is still essentially different from the consistency condition of linear layers (i.e., Eq. (9)) and that deduced in [26] (i.e., 𝐗tΔ𝐏=𝟎{\bf X}_{t}\Delta{\bf P}^{\top}=\mathbf{0}). It indicates the interaction between the image tokens and prompts within ViT layers is fundamentally distinct, leading to a unique consistency condition related to the second-order transformation matrices 𝐖q𝐖k{\bf W}_{q}{\bf W}_{k}^{\top} of the pre-trained model. Moreover, Eq. (25) is also an essential condition served as one of the sufficient conditions for the consistency of the whole ViT layer. It implies that if the prompt update can be orthogonal to the activated attention map generated by the image queries (𝐐𝐗{\bf Q}_{\bf X}) and prompt keys (𝐊𝐏{\bf K}_{\bf P}), the consistency of Aggregation can be achieved.

4.2 Optimization of Consistency Conditions

To jointly optimize Eq. (24) and (25), we need to solve Δ𝐏\Delta{\bf P} that can meet both equations concurrently. Here, we employ a separate optimization approach to get an approximate solution efficiently. Initially, it ensures Δ𝐏\Delta{\bf P}^{\top} is orthogonal to the subspace spanned by 𝐐𝐗t𝐖k{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top} to satisfy Eq. (24). Subsequently, it makes Δ𝐏\Delta{\bf P} orthogonal to the subspace spanned by 𝐒𝐏t{\bf S}_{{\bf P}_{t}} to satisfy Eq. (25).

Specifically, we use 𝐏𝒢{\bf P}_{\mathcal{G}} to denote the candidate parameter update generated by the optimizer for the prompts. We aim to obtain a projection matrix \mathcal{B} such that Δ𝐏=𝐏𝒢\Delta{\bf P}=\mathcal{B}{\bf P}_{\mathcal{G}}. Following the previously mentioned separate optimization strategy, we first ensure Δ𝐏\Delta{\bf P}^{\top} is orthogonal to 𝐐𝐗t𝐖k{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top} by the projection matrix 1\mathcal{B}_{1}: Δ𝐏=1𝐏𝒢\Delta{\bf P}^{\top}=\mathcal{B}_{1}{\bf P}_{\mathcal{G}}^{\top}. Then Δ𝐏\Delta{\bf P} is made orthogonal to 𝐒𝐏t{\bf S}_{{\bf P}_{t}} by another projection matrix 2\mathcal{B}_{2}: Δ𝐏=2𝐏𝒢\Delta{\bf P}=\mathcal{B}_{2}{\bf P}_{\mathcal{G}}. Therefore, the objective of the optimization turns into obtaining the two projection matrices 1\mathcal{B}_{1} and 2\mathcal{B}_{2} to satisfy Eq. (24) and (25). Inspired by the null-space projection method [36], the bases of 1\mathcal{B}_{1} and 2\mathcal{B}_{2} correspond to the null-space bases of 𝐐𝐗t𝐖k{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top} and 𝐒𝐏t{\bf S}_{{\bf P}_{t}}, respectively. We use 𝐔1,0D×R1{\bf U}_{1,0}\in\mathbb{R}^{D\times R_{1}} and 𝐔2,0×M×R2{\bf U}_{2,0}\times\mathbb{R}^{M\times R_{2}} to denote the bases of the null spaces for 𝐐𝐗t𝐖k{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top} and 𝐒𝐏t{\bf S}_{{\bf P}_{t}}, where R1R_{1} and R2R_{2} indicate their nullities. 𝐔1,0{\bf U}_{1,0} and 𝐔2,0{\bf U}_{2,0} can be obtained from the right singular vectors associated with the zero singular values, through the process of singular value decomposition (SVD) applied by SVD((𝐐𝐗t𝐖k)𝐐𝐗t𝐖k)\mathrm{SVD}(({\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top})^{\top}{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top}) and SVD(𝐒𝐏t𝐒𝐏t)\mathrm{SVD}({\bf S}_{{\bf P}_{t}}^{\top}{\bf S}_{{\bf P}_{t}}), respectively. In this way, we get the projection matrices 1=𝐔1,0𝐔1,0D×D\mathcal{B}_{1}={\bf U}_{1,0}{\bf U}_{1,0}^{\top}\in\mathbb{R}^{D\times D} and 2=𝐔2,0𝐔2,0M×M\mathcal{B}_{2}={\bf U}_{2,0}{\bf U}_{2,0}^{\top}\in\mathbb{R}^{M\times M}, which are the solutions enabling Δ𝐏\Delta{\bf P} to jointly satisfy Eq. (24) and (25):

Δ𝐏=2𝐏𝒢1=(𝐔2,0𝐔2,0)𝐏𝒢(𝐔1,0𝐔1,0).\Delta{\bf P}=\mathcal{B}_{2}{\bf P}_{\mathcal{G}}\mathcal{B}_{1}=({\bf U}_{2,0}{\bf U}_{2,0}^{\top}){\bf P}_{\mathcal{G}}({\bf U}_{1,0}{\bf U}_{1,0}^{\top}). (26)

For the constraint Eq. (20), we incorporate an additional loss function aimed at penalizing the drift of prompt distribution, hence realizing a relaxed version of this constraint:

LN=𝝁𝐏t+1𝝁𝐏t1+𝝈𝐏t+1𝝈𝐏t1.\mathcal{L}_{\mathrm{LN}}=\|\boldsymbol{\mu}_{{\bf P}_{t+1}}-\boldsymbol{\mu}_{{\bf P}_{t}}\|_{1}+\|\boldsymbol{\sigma}_{{\bf P}_{t+1}}-\boldsymbol{\sigma}_{{\bf P}_{t}}\|_{1}. (27)

In Eq. (27), 𝝁𝐏t\boldsymbol{\mu}_{{\bf P}_{t}} and 𝝈𝐏t\boldsymbol{\sigma}_{{\bf P}_{t}} represent the target prompt distribution obtained in task 𝒯t\mathcal{T}_{t}, while 𝝁𝐏t+1\boldsymbol{\mu}_{{\bf P}_{t+1}} and 𝝈𝐏t+1\boldsymbol{\sigma}_{{\bf P}_{t+1}} denote the distribution to be optimized in task 𝒯t+1\mathcal{T}_{t+1}.

To sum up, we use Eq. (26) to realize Eq. (24) and (25), and use Eq. (27) to meet Eq. (20), thereby achieving the consistency objective Eq. (10) for anti-forgetting. We provide a full algorithm of our approach in the appendix A .

4.3 Extension to Multi-Heads

We further extend the consistency conditions Eq. (24) and (25) to multi-head self-attention, a common feature in current transformer-based models. Suppose there are HH heads and d=D/Hd=D/H represents the dimension of each token in a head. We use 𝐐𝐗t.hN×d{\bf Q}_{{\bf X}_{t}.h}\in\mathbb{R}^{N\times d}, 𝐖k.hD×d{\bf W}_{k.h}\in\mathbb{R}^{D\times d} and 𝐒𝐏t.hN×M{\bf S}_{{\bf P}_{t}.h}\in\mathbb{R}^{N\times M} to denote the corresponding matrices in Eq. (24) and (25) for the hh-th head, respectively. The objective is to ensure these conditions are met across all heads, i.e., 𝐐𝐗t.h𝐖k.hΔ𝐏=𝟎{\bf Q}_{{\bf X}_{t}.h}{\bf W}_{k.h}^{\top}\Delta{\bf P}^{\top}=\mathbf{0} and 𝐒𝐏t.hΔ𝐏=𝟎{\bf S}_{{\bf P}_{t}.h}\Delta{\bf P}=\mathbf{0}, h{1,2,,H}\forall h\in\{1,2,\cdots,H\}. Let 𝛀1,t=[𝐐𝐗t.1𝐖k.1;;𝐐𝐗t.H𝐖k.H]HN×D\mathbf{\Omega}_{1,t}=[{\bf Q}_{{\bf X}_{t}.1}{\bf W}_{k.1}^{\top};\cdots;{\bf Q}_{{\bf X}_{t}.H}{\bf W}_{k.H}^{\top}]\in\mathbb{R}^{HN\times D} and 𝛀2,t=[𝐒𝐏t.1;;𝐒𝐏t.H]HN×M\mathbf{\Omega}_{2,t}=[{\bf S}_{{\bf P}_{t}.1};\cdots;{\bf S}_{{\bf P}_{t}.H}]\in\mathbb{R}^{HN\times M} represent the concatenated matrices from all the heads, respectively. Based on block matrix properties, those two sets of conditions can be formulated as 𝛀1,tΔ𝐏=𝟎\mathbf{\Omega}_{1,t}\Delta{\bf P}^{\top}=\mathbf{0} and 𝛀2,tΔ𝐏=𝟎\mathbf{\Omega}_{2,t}\Delta{\bf P}=\mathbf{0}. To sum up, The main difference between single-head and multi-heads is that the parameter update should be orthogonal to the subspace spanned by the concatenation matrices from all heads for multi-heads self-attention. Therefore, for the multi-heads variant, only an additional step of concatenation of the matrices from all heads is required in our algorithm.

5 Experiments

5.1 Experimental Setups

In our experiments, we mainly utilize the VPT [13] with a ViT-B/16 backbone [9] pre-trained on ImageNet-21k. Additionally, we validate the effectiveness on the CLIP [27] model, wherein the visual prompts are inserted into the image encoder. Our experiments are conducted across 4 class-incremental benchmarks: 10- and 20-split CIFAR-100, 10-split ImageNet-R and 10-split DomainNet. We report the mean values of the final average accuracy and final average forgetting over 3 runs with different random seeds. Given that the null spaces of 𝐐𝐗t𝐖k{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top} and 𝐒𝐏t{\bf S}_{{\bf P}_{t}} may not always exist in practice, we compute the approximate null spaces and determine the nullities R1R_{1} and R2R_{2} in an adaptive manner, rather than the way suggested in [36]. For more detailed information regarding the experimental setups, please refer to Appendix B.

Table 1: Comparison with the baselines ("-Seq") on four benchmarks using two types of models. The upper-bound means jointly training all the classes in the dataset.
Method 10S-CIFAR-100 20S-CIFAR-100 10S-ImageNet-R 10S-DomainNet
Acc. \uparrow Forgetting \downarrow Acc. \uparrow Forgetting \downarrow Acc. \uparrow Forgetting \downarrow Acc. \uparrow Forgetting \downarrow
VPT-Seq 87.27 12.33 82.36 17.36 72.46 19.41 73.28 25.65
VPT-NSP2 91.74 3.28 89.89 4.91 78.88 5.06 83.54 8.54
Upper-bound 93.87 - 93.87 - 84.60 - 89.25 -
CLIP-Seq 72.91 15.13 71.37 17.89 75.69 19.21 67.73 35.60
CLIP-NSP2 80.96 12.45 79.83 13.77 82.17 6.42 77.04 18.33
Upper-bound 84.52 - 84.52 - 84.86 - 81.65 -
Refer to caption
Figure 2: Task-by-task accuracy changing curves of VPT-Seq and VPT-NSP2 on two benchmarks.

5.2 Main Results

Validation of Effectiveness: The comparison between our approach and the sequential fine-tuning VPT and CLIP baselines is shown in Table 1 . For the VPT model, the proposed NSP2 achieves 4.47%\sim10.26% improvements in accuracy on the 4 benchmarks. Meanwhile, it reduces the forgetting by 9.05%\sim17.11%. As to the CLIP model, the NSP2 improves the accuracy by 6.48%\sim9.31%, and reduces the forgetting by 2.68%\sim17.27%. We calculate the accuracy across all previously encountered tasks after completing training on each task. The accuracy curves of VPT-Seq and VPT-NSP2 on 10-split CIFAR-100 and 10-split ImageNet-R are displayed in Figure 2. They demonstrate our approach consistently outperforms the baseline throughout the sequential learning of tasks.

We conduct additional experiments with the VPT model, utilizing the weights pre-trained on different datasets as well as different paradigms, as shown in Figure 3 . The pre-training paradigms and datasets include: naive classification on ImageNet-1k [30], DINO [3] on ImageNet-1k, MIIL [29] on ImageNet21k-P and CLIP on LAION-2B [6] (we only use its image encoder). As can be seen from the figure, our approach not only significantly enhances accuracy but also markedly mitigates forgetting. These results further demonstrate the generalizability of the proposed approach.

Refer to caption
Figure 3: Results of utilizing different pre-training datasets and paradigms. The blue and yellow bars represent accuracy and forgetting, respectively. The upward arrows indicate the accuracy increasing from VPT-Seq to VPT-NSP2, whereas the downward arrows denote the reduction in forgetting.
Table 2: Comparison with existing methods that use the pre-trained ViT-B/16 on ImageNet-21k. The standard deviations are also reported if available. Missing results in the corresponding papers are denoted as "-". The results marked with †  and ‡  are implemented by [11] and [10], respectively. The highest accuracies are in bold, and the second highest accuracies are underlined.
Method Venue 10S-CIFAR-100 20S-CIFAR-100 10S-ImageNet-R 10S-DomainNet
Acc. Forgetting Acc. Forgetting Acc. Forgetting Acc. Forgetting
L2P [40] CVPR’22 83.83±\pm0.04 7.63±\pm0.30 80.10±\pm0.72 - 61.57±\pm0.66 9.73±\pm0.47 81.17±\pm0.83 8.98±\pm1.25
DualPrompt [39] ECCV’22 86.51±\pm0.33 5.16±\pm0.09 82.02±\pm0.32 - 68.13±\pm0.49 4.68±\pm0.20 81.70±\pm0.78 8.04±\pm0.31
CODA-P [32] CVPR’23 86.25±\pm0.74 1.67±\pm0.26 - - 75.45±\pm0.56 1.64±\pm0.10 80.04±\pm0.79 10.16±\pm0.35
ESN [38] AAAI’23 86.34±\pm0.52 4.76±\pm0.14 80.56±\pm0.94 - 62.61±\pm0.96 - 79.22±\pm2.04 10.62±\pm2.12
APG [33] ICCV’23 89.35 6.01 88.64 6.51 73.27 8.59 - -
LAE [10] ICCV’23 85.59±\pm0.46 - 83.93±\pm0.28 - 72.66±\pm0.63 - - -
DualP-LGCL [15] ICCV’23 87.23±\pm0.21 5.10±\pm0.15 - - 69.46±\pm0.04 4.20±\pm0.06 - -
C-LN [23] ICCVW’23 86.95±\pm0.37 6.98±\pm0.43 - - 76.36±\pm0.51 8.31±\pm1.28 - -
EvoPrompt [18] AAAI’24 87.97±\pm0.30 2.60±\pm0.42 84.64±\pm0.14 3.98±\pm0.24 76.83±\pm0.08 2.78±\pm0.06 - -
OVOR-Deep [12] ICLR’24 85.99±\pm0.89 6.42±\pm2.03 - - 76.11±\pm0.21 7.16±\pm0.34 - -
DualP-PGP [26] ICLR’24 86.92±\pm0.05 5.35±\pm0.19 83.74±\pm0.01 7.91±\pm0.15 69.34±\pm0.05 4.53±\pm0.04 - -
InfLoRA [20] CVPR’24 87.06±\pm0.25 - - - 75.65±\pm0.14 - - -
EASE [45] CVPR’24 87.76 - 85.80 - 76.17 - - -
CPrompt [11] CVPR’24 87.82±\pm0.21 5.06±\pm0.50 - - 77.14±\pm0.11 5.97±\pm0.68 82.97±\pm0.34 7.45±\pm0.93
VPT-NSP2 This work 91.74±\pm0.63 3.28±\pm0.45 89.89±\pm0.72 4.91±\pm0.59 78.88±\pm0.50 5.06±\pm0.26 83.54±\pm0.77 8.54±\pm0.48

Comparison with Existing Methods: We compare our method with existing methods in Table 2, where the competitors include many recent works. The proposed VPT-NSP2 achieves state-of-the-art performance on the four benchmarks, with surpassing the second best approach by an average of 1.49% in accuracy. The forgetting of our approach is not the lowest, which is reasonable since our approach sacrifices some stability for a better trade-off between stability and plasticity. The outperforming accuracy can demonstrate the superiority of our method.

Table 3: Ablation studies of each component in our approach on the four benchmarks.
1\mathcal{B}_{1} 2\mathcal{B}_{2} LN\mathcal{L}_{\mathrm{LN}} 10S-CIFAR-100 20S-CIFAR-100 10S-ImageNet-R 10S-DomainNet
Acc. \uparrow Forgetting \downarrow Acc. \uparrow Forgetting \downarrow Acc. \uparrow Forgetting \downarrow Acc. \uparrow Forgetting \downarrow
87.27 12.33 82.36 17.36 72.46 19.41 73.28 25.65
\surd 90.58 6.91 88.13 10.27 78.05 8.14 82.31 10.89
\surd 88.74 10.85 83.32 16.48 74.71 14.69 78.87 17.81
\surd \surd 91.33 4.22 88.96 6.42 78.37 6.25 83.17 8.95
\surd \surd 91.42 3.94 88.46 8.64 78.30 6.31 83.13 9.32
\surd \surd 89.36 9.32 86.67 11.59 75.27 13.35 79.45 16.50
\surd \surd \surd 91.74 3.28 89.89 4.91 78.88 5.06 83.54 8.54

Ablation Study: The two consistency conditions Eq. (24) and (25), along with the constraint Eq. (20), constitute the main components of our approach. They correspond to 1\mathcal{B}_{1}, 2\mathcal{B}_{2} in Eq. (26), and LN\mathcal{L}_{\mathrm{LN}} in Eq. (27). We study their effects on the four benchmarks using VPT-NSP2, with results presented in Table 3. We can see that the projection for Affinity (1\mathcal{B}_{1}) plays a crucial role, which brings 3.31%\sim9.03% improvement in accuracy and 5.42%\sim14.76% decline in forgetting. Furthermore, the projection for Aggregation (2\mathcal{B}_{2}) and the loss LN\mathcal{L}_{\mathrm{LN}} for invariant prompt distribution are indispensable as well for minimizing forgetting. Optimal accuracy is achieved when all three conditions are applied.

Model Analysis: We analyze the evolution of training losses on the 10-split CIFAR-100 and 10-split ImageNet-R benchmarks, as shown in Figure 4 . Each point on the curve represents the training loss of the data in 𝒯1\mathcal{T}_{1}/𝒯2\mathcal{T}_{2} after the model has been trained on subsequent tasks. As can be seen, the losses of VPT-NSP2 on previous tasks can be almost retained, confirming that our approach can effectively mitigate the interference of new tasks on old tasks.

Refer to caption
Figure 4: Training loss curves of VPT-NSP2 and VPT-Seq on tasks 𝒯1\mathcal{T}_{1} and 𝒯2\mathcal{T}_{2} when the models are trained on sequential tasks.
Refer to caption
Figure 5: Effect of the projection matrix weight η¯\bar{\eta} on the accuracy and forgetting for the stability-plasticity trade-off on the four benchmarks.

Trade-off between Stability and Plasticity: We first adaptively determine the nullities R1R_{1} and R2R_{2} for 1\mathcal{B}_{1} and 2\mathcal{B}_{2} to achieve near-minimum forgetting. Based on this, we assign two weights η1\eta_{1} and η2\eta_{2} to the projection matrices to control the trade-off between stability and plasticity: Δ𝐏=[η22+(1η2)𝐈]𝐏𝒢[η11+(1η1)𝐈]\Delta{\bf P}=\left[\eta_{2}\mathcal{B}_{2}+(1-\eta_{2}){\bf I}\right]{\bf P}_{\mathcal{G}}\left[\eta_{1}\mathcal{B}_{1}+(1-\eta_{1}){\bf I}\right], where 𝐈{\bf I} denotes the identity matrix. The effects of η1\eta_{1} and η2\eta_{2} which are set to a same value η¯\bar{\eta} is shown in Figure 5 . As the weight decreases, the accuracy increases first owing to better plasticity, and then decreases due to worse stability caused by the forgetting. It implies that a trade-off can be achieved by the two weights of projections.

Table 4: Results for 50 tasks and 100 tasks on CIFAR-100, ImageNet-R and DomainNet datasets. † indicates no plasticity enhancement, and ‡ indicates using the balanced plasticity enhancement where η¯\bar{\eta} is the default value less than 1. Our approach still outperforms other methods in long sequences of tasks.
Method 50S-CIFAR100 50S-ImageNet-R 50S-DomainNet 100S-ImageNet-R 100S-DomainNet
Acc. Forgetting Acc. Forgetting Acc. Forgetting Acc. Forgetting Acc. Forgetting
L2P 76.19 12.06 48.53 12.99 59.45 11.53 38.87 15.26 50.52 17.66
EvoPrompt 76.60 13.86 68.53 10.03 67.68 10.41 61.84 15.84 56.35 21.39
OVOR 65.69 14.28 60.08 5.86 66.27 7.43 40.49 8.12 47.65 8.91
InfLoRA 61.49 13.68 59.02 11.02 69.96 9.51 38.16 15.11 44.32 17.85
EASE 74.47 9.31 68.17 7.76 61.20 10.01 47.55 8.22 33.08 32.14
CPrompt 74.97 7.45 68.47 8.16 67.87 9.36 56.95 10.20 53.73 12.14
VPT-Seq 70.47 29.21 56.38 37.91 58.39 44.79 49.72 45.53 46.39 49.34
VPT-NSP2 81.92 6.56 67.32 6.35 70.13 9.92 59.97 10.07 54.44 11.04
VPT-NSP2 82.98 6.66 69.48 6.51 71.28 11.36 62.23 12.13 57.35 13.82

Long-sequence Continual Learning We experiment on 5 benchmarks under the protocols of 50 tasks and 100 tasks to validate that our approach remains effective even within the context of long-sequence continual learning. The results are presented in Table 4. Despite lacking plasticity enhancement, VPT-NSP2 can outperform existing state-of-the-art approaches and especially surpasses L2P by a large margin. This demonstrates that forgetting is still the predominant factor affecting performance in long sequence of tasks. With the plasticity enhancement, VPT-NSP2 achieves significant increase in accuracy (by 1.1%\sim2.9%). This demonstrates that our plasticity enhancement is effective in learning new knowledge in long-sequence continual learning.

6 Conclusion

In this paper, we study the interference problem of visual prompt tuning in ViTs, and propose two consistency conditions which can eliminate the interference in theory under the constraint of invariant prompt distribution. They guarantee the consistency of Affinity, Aggregation and distribution of prompts in LayerNorm, respectively, which jointly achieve the consistency objective of the whole ViT layer. We adopt the null-space projection to implement the two conditions and utilize an extra loss to satisfy the constraint. Our experiments on various benchmarks demonstrate the effectiveness of the proposed conditions for anti-forgetting, and our approach achieves state-of-the-art performances.

Limitation Discussion: To simplify the derivation of our consistency conditions, we introduce a constraint of invariant prompt distribution. Although the superior results show that it may not be a very strong assumption, it is not an exact solution.

References

  • [1] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. CoRR, abs/1607.06450, 2016.
  • [2] Benjamin Bowman, Alessandro Achille, Luca Zancato, Matthew Trager, Pramuditha Perera, Giovanni Paolini, and Stefano Soatto. À-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14984–14993, 2023.
  • [3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers. In IEEE/CVF International Conference on Computer Vision, pages 9630–9640, 2021.
  • [4] Arslan Chaudhry, Naeemullah Khan, Puneet K. Dokania, and Philip H. S. Torr. Continual Learning in Low-rank Orthogonal Subspaces. In Advances in Neural Information Processing Systems, 2020.
  • [5] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional neural networks for document processing. In Tenth international workshop on frontiers in handwriting recognition. Suvisoft, 2006.
  • [6] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible Scaling Laws for Contrastive Language-Image Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  • [7] Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. AutoAugment: Learning Augmentation Strategies From Data. In IEEE/CVF International Conference on Computer Vision, pages 113–123, 2019.
  • [8] Danruo Deng, Guangyong Chen, Jianye Hao, Qiong Wang, and Pheng-Ann Heng. Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning. In Advances in Neural Information Processing Systems, volume 34, pages 18710–18721, 2021.
  • [9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021.
  • [10] Qiankun Gao, Chen Zhao, Yifan Sun, Teng Xi, Gang Zhang, Bernard Ghanem, and Jian Zhang. A Unified Continual Learning Framework with General Parameter-Efficient Tuning. In IEEE/CVF International Conference on Computer Vision, pages 11449–11459, 2023.
  • [11] Zhanxin Gao, Jun Cen, and Xiaobin Chang. Consistent Prompting for Rehearsal-Free Continual Learning. CoRR, abs/2403.08568, 2024.
  • [12] Wei-Cheng Huang, Chun-Fu Chen, and Hsiang Hsu. OVOR: OnePrompt with virtual outlier regularization for rehearsal-free class-incremental learning. In International Conference on Learning Representations, 2024.
  • [13] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual Prompt Tuning. In European Conference on Computer Vision, volume 13693, pages 709–727, 2022.
  • [14] Dahuin Jung, Dongyoon Han, Jihwan Bang, and Hwanjun Song. Generating instance-level prompts for rehearsal-free continual learning. In IEEE/CVF International Conference on Computer Vision, pages 11813–11823, October 2023.
  • [15] Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, and Muhammad Zeshan Afzal. Introducing language guidance in prompt-based continual learning. In IEEE/CVF International Conference on Computer Vision, pages 11429–11439, October 2023.
  • [16] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
  • [17] Yajing Kong, Liu Liu, Zhen Wang, and Dacheng Tao. Balancing Stability and Plasticity Through Advanced Null Space in Continual Learning. In European Conference on Computer Vision, volume 13686, pages 219–236, 2022.
  • [18] Muhammad Rifki Kurniawan, Xiang Song, Zhiheng Ma, Yuhang He, Yihong Gong, Yang Qi, and Xing Wei. Evolving Parameterized Prompt Memory for Continual Learning. In AAAI Conference on Artificial Intelligence, pages 13301–13309, 2024.
  • [19] Zhuowei Li, Long Zhao, Zizhao Zhang, Han Zhang, Di Liu, Ting Liu, and Dimitris N. Metaxas. Steering Prototypes with Prompt-Tuning for Rehearsal-Free Continual Learning. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2523–2533, 2024.
  • [20] Yan-Shuo Liang and Wu-Jun Li. InfLoRA: Interference-free low-rank adaptation for continual learning. arXiv preprint arXiv:2404.00228, 2024.
  • [21] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
  • [22] Mark D. McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton van den Hengel. RanPAC: Random Projections and Pre-trained Models for Continual Learning. In Advances in Neural Information Processing Systems, 2023.
  • [23] Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, and Elisa Ricci. On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers. In IEEE/CVF International Conference on Computer Vision Workshops, pages 3577–3586, 2023.
  • [24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
  • [25] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment Matching for Multi-Source Domain Adaptation. In IEEE/CVF International Conference on Computer Vision, pages 1406–1415, 2019.
  • [26] Jingyang Qiao, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yong Peng, and Yuan Xie. Prompt Gradient Projection for Continual Learning. In International Conference on Learning Representations, 2024.
  • [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning, volume 139, pages 8748–8763, 2021.
  • [28] Roger Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
  • [29] Tal Ridnik, Emanuel Ben Baruch, Asaf Noy, and Lihi Zelnik. ImageNet-21K Pretraining for the Masses. In NeurIPS Datasets and Benchmarks, 2021.
  • [30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [31] Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient Projection Memory for Continual Learning. In International Conference on Learning Representations, 2021.
  • [32] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, June 2023.
  • [33] Yu-Ming Tang, Yi-Xing Peng, and Wei-Shi Zheng. When Prompt-based Incremental Learning Does Not Meet Strong Pretraining. In IEEE/CVF International Conference on Computer Vision, pages 1706–1716, 2023.
  • [34] Liyuan Wang, Jingyi Xie, Xingxing Zhang, Mingyi Huang, Hang Su, and Jun Zhu. Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality. In Advances in Neural Information Processing Systems, 2023.
  • [35] Runqi Wang, Xiaoyue Duan, Guoliang Kang, Jianzhuang Liu, Shaohui Lin, Songcen Xu, Jinhu Lv, and Baochang Zhang. AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3654–3663, June 2023.
  • [36] Shipeng Wang, Xiaorong Li, Jian Sun, and Zongben Xu. Training Networks in Null Space of Feature Covariance for Continual Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 184–193, June 2021.
  • [37] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for Domain Incremental Learning. In Advances in Neural Information Processing Systems, 2022.
  • [38] Yabin Wang, Zhiheng Ma, Zhiwu Huang, Yaowei Wang, Zhou Su, and Xiaopeng Hong. Isolation and Impartial Aggregation: A Paradigm of Incremental Learning without Interference. In AAAI Conference on Artificial Intelligence, pages 10209–10217, 2023.
  • [39] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer G. Dy, and Tomas Pfister. DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning. In European Conference on Computer Vision, volume 13686, pages 631–648, 2022.
  • [40] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to Prompt for Continual Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, June 2022.
  • [41] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • [42] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual Learning of Context-Dependent Processing in Neural Networks. Nature Machine Intelligence, 1(8):364–372, August 2019.
  • [43] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model. In IEEE/CVF International Conference on Computer Vision, pages 19091–19101, October 2023.
  • [44] Zhen Zhao, Zhizhong Zhang, Xin Tan, Jun Liu, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Rethinking Gradient Projection Continual Learning: Stability/Plasticity Feature Space Decoupling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3718–3727, June 2023.
  • [45] Da-Wei Zhou, Hai-Long Sun, Han-Jia Ye, and De-Chuan Zhan. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. CoRR, abs/2403.12030, 2024.

Appendix: Visual Prompt Tuning in Null Space for Continual Learning

Appendix A Algorithm

Refer to caption
Figure 6: Illustration of our algorithm. The input image tokens with prompts are fed into the ViT layer for forward propagation. During optimization, the gradients of the prompts will be projected into the orthogonal direction to the subspace of the previous task 𝒯t1{\cal T}_{t-1}. The projected prompt update will be used to update the prompts for anti-forgetting.

An overview and algorithm of our approach are provided in Figure 6 and Algorithm 1 , respectively. We first initialize the overall uncentered covariance matrices [36] 𝐂1{\bf C}_{1} and 𝐂2{\bf C}_{2}, as well as the null-space projection matrices 1\mathcal{B}_{1} and 2\mathcal{B}_{2}. During training, the cross-entropy loss for classification and the loss of prompt distribution LN\mathcal{L}_{\mathrm{LN}} are jointly utilized for optimization. Subsequently, we get the candidate prompt updates 𝐏𝒢{\bf P}_{\mathcal{G}} computed by the optimizer. Then 𝐏𝒢{\bf P}_{\mathcal{G}} is projected by the null-space projection matrices 1\mathcal{B}_{1} and 2\mathcal{B}_{2} for updating the prompts. After the convergence, we obtain the matrices 𝐉1{\bf J}_{1} and 𝐉2{\bf J}_{2} to temporarily store 𝐐𝐗t𝐖k{\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top} and 𝐒𝐏t{\bf S}_{{\bf P}_{t}} for the data of the current task. Then they are used to update the uncentered covariance matrices 𝐂1{\bf C}_{1} and 𝐂2{\bf C}_{2} by addition. Finally, we update the null-space projection matrices using the uncentered covariance matrices, which will be used in the next task.

Algorithm 2 shows the process of computing a null-space projection matrix. First, an input uncentered covariance matrix 𝐂{\bf C} is decomposed by SVD, from which we can get the singular values and right singular vectors. Next, we determine the nullity RR (i.e., the dimension of null space) of 𝐂{\bf C} according to the maximum second derivative, which is introduced in Section C . Then we select RR right singular vectors corresponding to the RR smallest singular values considered close to 0 as the bases of null space. Finally, we compute the normalized projection matrix, which provides an upper bound for the scale of the projected gradients and prevents excessive gradient magnitudes. In our implementation, the null-space projection matrix is added by an identity matrix with a weight η\eta (specifically η1\eta_{1} for 1\mathcal{B}_{1} and η2\eta_{2} for 2\mathcal{B}_{2}). η\eta is a hyper-parameter for the trade-off between stability and plasticity, which is also introduced in Section C

Algorithm 1 NSP2 for Visual Prompt Tuning
0:  Datasets 𝒟t={(𝒳t<i>,yt<i>)}i=1|𝒯t|\mathcal{D}_{t}=\{(\mathcal{X}_{t}^{<i>},y_{t}^{<i>})\}_{i=1}^{|\mathcal{T}_{t}|} for task 𝒯t{𝒯1,𝒯2,}\mathcal{T}_{t}\in\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots\}, ViT model fmodel(|𝐏t)f_{\mathrm{model}}(\cdot|{\bf P}_{t}) with the prompts 𝐏t{\bf P}_{t} to be optimized (the classifier is omitted for simplicity), uncentered covariance matrices 𝐂1{\bf C}_{1} and 𝐂2{\bf C}_{2}, projection matrices 1\mathcal{B}_{1} and 2\mathcal{B}_{2}
0:  The optimized prompts 𝐏t{\bf P}_{t}
1:  Initialization: Randomly initialize 𝐏t{\bf P}_{t}; 𝐂1=𝟎{\bf C}_{1}={\bf 0}, 𝐂2=𝟎{\bf C}_{2}={\bf 0}, 1=𝐈\mathcal{B}_{1}={\bf I}, 2=𝐈\mathcal{B}_{2}={\bf I}
2:  for task 𝒯t{𝒯1,𝒯2,}\mathcal{T}_{t}\in\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots\} do
3:     repeat
4:        Sample a mini-batch 𝓧t,𝒚t𝒟t\boldsymbol{\mathcal{X}}_{t},\boldsymbol{y}_{t}\sim\mathcal{D}_{t}
5:        Obtain prediction by 𝒚^tfmodel(𝓧t|𝐏t)\boldsymbol{\hat{y}}_{t}\leftarrow f_{\mathrm{model}}(\boldsymbol{\mathcal{X}}_{t}|{\bf P}_{t})
6:        Compute the classification loss totalCrossEntropy(𝒚^t,𝒚t)\mathcal{L}_{total}\leftarrow\mathrm{CrossEntropy}(\boldsymbol{\hat{y}}_{t},\boldsymbol{y}_{t})
7:        if t>1t>1 then
8:           Compute the loss of prompt distribution LN\mathcal{L}_{\mathrm{LN}} by Eq. (27)
9:           Accumulate the losses totaltotal+LN\mathcal{L}_{total}\leftarrow\mathcal{L}_{total}+\mathcal{L}_{\mathrm{LN}}
10:        end if
11:        Get the candidate prompt update 𝐏𝒢{\bf P}_{\mathcal{G}} from the optimizer by the loss total\mathcal{L}_{total}
12:        if t>1t>1 then
13:           Compute the prompt update Δ𝐏2𝐏𝒢1\Delta{\bf P}\leftarrow\mathcal{B}_{2}{\bf P}_{\mathcal{G}}\mathcal{B}_{1} by the null-space projection Eq. (26)
14:        else
15:           Directly adopt the candidate prompt update Δ𝐏𝐏𝒢\Delta{\bf P}\leftarrow{\bf P}_{\mathcal{G}}
16:        end if
17:        Update the prompts by 𝐏t𝐏tlearning_rate×Δ𝐏{\bf P}_{t}\leftarrow{\bf P}_{t}-learning\_rate\times\Delta{\bf P}
18:     until convergence
19:     Initialize two temporary matrices 𝐉1=[]{\bf J}_{1}=[~{}] and 𝐉2=[]{\bf J}_{2}=[~{}]
20:     for 𝒳t<i>𝒟t\mathcal{X}_{t}^{<i>}\in\mathcal{D}_{t} do
21:        Get the matrices (𝐐𝐗t𝐖k)<i>({\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top})^{<i>} and 𝐒𝐏t<i>{\bf S}_{{\bf P}_{t}}^{<i>} by the forward propagation fmodel(𝒳t<i>|𝐏t)f_{\mathrm{model}}(\mathcal{X}_{t}^{<i>}|{\bf P}_{t})
22:        Update 𝐉1{\bf J}_{1} and 𝐉2{\bf J}_{2} by concatenating (𝐐𝐗t𝐖k)<i>({\bf Q}_{{\bf X}_{t}}{\bf W}_{k}^{\top})^{<i>} and 𝐉1{\bf J}_{1}, 𝐒𝐏t<i>{\bf S}_{{\bf P}_{t}}^{<i>} and 𝐉2{\bf J}_{2}, respectively
23:     end for
24:     Update the uncentered covariance matrices 𝐂1𝐂1+𝐉1𝐉1{\bf C}_{1}\leftarrow{\bf C}_{1}+{\bf J}_{1}^{\top}{\bf J}_{1} and 𝐂2𝐂2+𝐉2𝐉2{\bf C}_{2}\leftarrow{\bf C}_{2}+{\bf J}_{2}^{\top}{\bf J}_{2}
25:     Compute the null-space projection matrices 1\mathcal{B}_{1} and 2\mathcal{B}_{2} by Algorithm 2 using 𝐂1{\bf C}_{1} and 𝐂2{\bf C}_{2}
26:  end for
Algorithm 2 Computing Null-Space Projection Matrix
0:  Uncentered covariance matrix 𝐂{\bf C}, hyper-parameter η[0,1]\eta\in[0,1] for the trade-off between stability and plasticity (mentioned in Section C)
0:  Null-space projection matrix \mathcal{B}
1:  Get the singular values Λ\varLambda in descending order and the corresponding right singular vectors 𝐔{\bf U} by singular value decomposition Λ,𝐔SVD(𝐂)\varLambda,{\bf U}^{\top}\leftarrow\mathrm{SVD}({\bf C}), where the left singular vectors are omitted
2:  Calculate the nullity RR by the maximum second derivative as introduced in Eq. (28)
3:  Select the right singular vectors of the RR smallest singular values in 𝐔{\bf U} as 𝐔0𝐔[DR:D]{\bf U}_{0}\leftarrow{\bf U}_{[D-R:D]}
4:  Compute the projection matrix 𝐔0𝐔0𝐔0𝐔0F\mathcal{B}\leftarrow\frac{{\bf U}_{0}{\bf U}_{0}^{\top}}{\|{\bf U}_{0}{\bf U}_{0}^{\top}\|_{\mathrm{F}}}
5:  Update \mathcal{B} with the weight η\eta by η+(1η)𝐈\mathcal{B}\leftarrow\eta\mathcal{B}+(1-\eta){\bf I} (corresponding to Eq. (29))

Appendix B Experimental Setups and Implementation Details

Models: We validate our approach on the Vision Transformer (ViT) [9] and CLIP [27] models in the experiments, whose backbones are both ViT-Base/16 [9]. The ViT is pre-trained on ImageNet-21k, and we insert 4 prompts into each of the 12 layers for fine-tuning, which is referred to as "VPT" [13]. The classifiers are dependently trained in each task and the orthogonal projection is not applicable to them. All the classifiers from the available tasks are concatenated to make prediction during inference. For the CLIP model pre-trained on the WebImageText, we insert 4 prompts into each of the first 3 layers of the image encoder, while the text encoder is kept frozen. The logit scale that serves as a learnable scalar parameter to scale the cosine similarities between image features and text features is also set to trainable. We observed a serious cross-task confusion among the tasks in the CLIP model. Hence, we follow [43] to utilize the class-wise mean and covariance of previous features extracted before the embedding projection head (i.e., the last linear layer of the image encoder) to refine the projection head, after the prompt tuning stage in each task.

Benchmarks: We conduct experiments under the class-incremental learning protocol, where the classes in each task are disjoint, and task identity is unknown during inference. Four class-incremental benchmarks with three widely used datasets are adopted: 10- and 20-split CIFAR-100, 10-split ImageNet-R [39] and 10-split DomainNet [25, 38]. For the CIFAR-100 dataset, the total of 100 classes are randomly split into 10 or 20 tasks, which can evaluate the ability to handle different numbers of tasks. We follow [39] to randomly split the 200 classes in ImageNet-R into 10 tasks, which forms the 10-split ImageNet-R benchmark. For the 10-split DomainNet, we follow the same dataset protocol adopted in [38] and [11] to select the top 200 classes with the most images from the original DomainNet [25], and randomly split them into 10 tasks with 20 classes per task. 25% samples of the training data in each dataset are picked as a validation set for searching optimal hyper-parameters.

Metrics: Formally, the final average accuracy and final average forgetting are defined as:

Finalaverageaccuracy=1Ti=1TaT,i,Finalaverageforgetting=1T1i=1T1maxj{1,2,,T1}(aj,iaT,i),\begin{split}\mathrm{Final~{}average~{}accuracy}&=\frac{1}{T}\sum_{i=1}^{T}a_{T,i},\\ \mathrm{Final~{}average~{}forgetting}&=\frac{1}{T-1}\sum_{i=1}^{T-1}\max_{j\in\{1,2,\cdots,T-1\}}(a_{j,i}-a_{T,i}),\end{split}

where TT is the number of tasks, aT,ia_{T,i} is the accuracy of the TT-th model on the ii-th task samples, and aj,ia_{j,i} is the accuracy of the jj-th model on the ii-th task samples.

Higher accuracy means the model performs better, while lower forgetting means stronger stability (i.e., the ability to retain old knowledge). However, lower forgetting does not always generate higher accuracy since the accuracy is also affected by plasticity (i.e., the ability to learn new knowledge). The accuracy is the main metric we should focus on as it reflects the precision of classification in practice.

Implementations Details: For all the datasets and models, the images fed into the models are resized to 224×224224\times 224 pixels and augmented by AutoAugment [7] during training. For the VPT-based models, we use the Adam optimizer [16] with β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999 and a weight decay of 5×1055\times 10^{-5} to train 100 epochs with an initial learning rate of 0.01 and a batch size of 256 on all benchmarks. The learning rate is scaled by a factor of 0.1 at the 50-th and 80-th epoch. Our training losses consist of the cross-entropy loss for classification and the loss LN\mathcal{L}_{\mathrm{LN}} in Eq. (27) whose coefficient is set to 1. Through cross validation on the validation set, we set the temperatures in the cross-entropy loss to 28, 25, 30 and 30 for the 10-split CIFAR100, 20-split CIFAR100, 10-split ImageNet-R and 10-split DomainNet benchmarks. There are two hyper-parameters η1\eta_{1} and η2\eta_{2} used for the trade-off between stability and plasticity in null-space projection as introduced in Section C , and we set both of them to be 0.97, 0.95, 0.94 and 0.95 for the four benchmarks by cross validation.

As to the CLIP-based models, the differences in training settings are as follows. We train them for 20 epochs with the batch size of 220 and the learning rate 0.001 which decays at the 10-th and 16-th epoch. The temperatures are all set to 1 since the logit scale is trainable. η1\eta_{1} and η2\eta_{2} are set to 0.98 which is a proper value for all the benchmarks. We refine the embedding projection head for 50 epochs using the SGD optimizer with a learning rate of 0.001, a momentum of 0.9 and a weight decay of 1×1041\times 10^{-4}.

We implement our approach in PyTorch [24] with the timm library [41]. The experiments are performed on a server with 128 GB RAM and four NVIDIA RTX 4090 GPUs. Each of the experiment can be finished in three hours.

Appendix C Trade-off between Stability and Plasticity

Given that the null space of covariance matrix does not always exist in practice, Wang et al. [36] suggest approximating it by selecting the bases whose associated singular values approach zero, where the singular values smaller than a specified multiple (denoted as γ\gamma in our paper) of the smallest one are selected. However, we experimentally find this strategy and the experience for selecting γ\gamma are not suitable for prompt tuning in ViTs to determine the nullities R1R_{1} and R2R_{2} for the uncentered covariance matrices 𝐂1{\bf C}_{1} and 𝐂2{\bf C}_{2} in Algorithm 1 , which will be introduced afterwards. To solve this problem, we propose an adaptive nullity strategy to determine the nullities in an adaptive manner. Utilizing the characteristic that the curve of descending singular values forms an "L" shape, we divide the curve into two parts by the point where the gradient changes fastest to cover most of the small singular values. It is realized by calculating the maximum second derivative of the points:

{R1=Dargmaxj{λj12λj+λj+1}j=2D1,R2=Margmaxj{λj12λj+λj+1}j=2M1,\displaystyle\left\{\begin{aligned} R_{1}=D-\arg\max_{j}\{\lambda_{j-1}-2\lambda_{j}+\lambda_{j+1}\}_{j=2}^{D-1},\\ R_{2}=M-\arg\max_{j}\{\lambda_{j-1}-2\lambda_{j}+\lambda_{j+1}\}_{j=2}^{M-1},\end{aligned}\right. (28)

where λj\lambda_{j} denotes the jj-th singular value. We find it reaches near-minimum forgetting in our experiments which also means reaching near-optimal stability. Furthermore, to enhance the plasticity, we fuse the projection matrices with identity matrices by the weights η1[0,1]\eta_{1}\in[0,1] and η2[0,1]\eta_{2}\in[0,1] which should be close to 1:

Δ𝐏=[η22+(1η2)𝐈]𝐏𝒢[η11+(1η1)𝐈].\Delta{\bf P}=\left[\eta_{2}\mathcal{B}_{2}+(1-\eta_{2}){\bf I}\right]{\bf P}_{\mathcal{G}}\left[\eta_{1}\mathcal{B}_{1}+(1-\eta_{1}){\bf I}\right]. (29)

In this way, we can make a trade-off between stability and plasticity by enhancing the plasticity based on near-optimal stability, and η1\eta_{1} and η2\eta_{2} are the hyper-parameters to control the trade-off.

Appendix D Comparison with PGP

D.1 Difference in Methods

The main difference between our method and PGP [26] are summarized as follows. (1) We derive a different consistency condition for Affinity even if we ignore the LayerNorm operation and the bias terms in the qkv-transformation. Specifically, our simplified consistency condition for Affinity is 𝐗t𝐖q𝐖kΔ𝐏=𝟎{\bf X}_{t}{\bf W}_{q}{\bf W}_{k}^{\top}\Delta{\bf P}^{\top}={\bf 0}, contrasted with 𝐗tΔ𝐏=𝟎{\bf X}_{t}\Delta{\bf P}^{\top}={\bf 0} in PGP. (2) We analyze the consistency conditions for the complete self-attention, i.e., softmax(𝐐𝐗𝐊𝐙D)𝐕𝐙\mathrm{softmax}(\frac{{\bf Q}_{\bf X}{\bf K}_{\bf Z}^{\top}}{\sqrt{D}}){\bf V}_{{\bf Z}} which contains the Aggregation operation. However, PGP does not account for the Aggregation. (3) We take the LayerNorm before self-attention into consideration and propose an invariant prompt distribution constraint, while it is ignored in PGP.

In conclusion, we conduct a comprehensive analysis of prompt tuning for the consistency objective, which provides a complete guarantee to eliminate the interference of new tasks on previous tasks. As demonstrated in our ablation study in the Experiment section, the consistency of Aggregation and LayerNorm also contribute to reducing forgetting, and thereby they should not be ignored. We make a comparison of the performance between PGP and our approach in the next subsection.

D.2 Performance Comparison

We compare with PGP [26] using the VPT-Seq and L2P [40] baselines on the four benchmarks in our experiments. The results are shown in Table 5 . We implement PGP to VPT (i.e. VPT-PGP) under the same training settings as VPT-NSP2 for a fair comparison. For the L2P-based methods, we insert prompts into the first three layers instead of only the first layer in the original implementation [40]. An orthogonal projection is also applied to the prompt pool which is essentially a linear layer in L2P-based models. We follow the training setting of PGP to train the L2P-based methods. The results in Table 5 demonstrate that our full approach can achieve more improvements in accuracy and reduce more forgetting than PGP. Even when applying only the projection matrix 1\mathcal{B}_{1} for the Affinity operation, our approach also performs better than PGP, demonstrating the effectiveness of our proposed method for mitigating the interference problem.

Table 5: Comparison with PGP on four benchmarks and two continual learning baselines. "-1\mathcal{B}_{1}" indicates only the projection matrix 1\mathcal{B}_{1} is used in our approach
Method 10S-CIFAR-100 20S-CIFAR-100 10S-ImageNet-R 10S-DomainNet
Acc.\uparrow Forgetting\downarrow Acc.\uparrow Forgetting\downarrow Acc.\uparrow Forgetting\downarrow Acc.\uparrow Forgetting\downarrow
VPT-Seq 87.27 12.33 82.36 17.36 72.46 19.41 73.28 25.65
VPT-PGP 87.76 11.98 82.71 16.85 73.12 18.92 73.98 25.15
VPT-NSP2-1\mathcal{B}_{1} 90.58 6.91 88.13 10.27 78.05 8.14 82.31 10.89
VPT-NSP2 91.74 3.28 89.89 4.91 78.88 5.06 83.54 8.54
L2P 84.12 6.36 81.46 8.69 61.25 9.32 65.73 10.19
L2P-PGP 84.70 5.96 82.04 8.11 62.01 8.55 66.31 9.63
L2P-NSP2-1\mathcal{B}_{1} 86.39 4.60 82.99 7.34 64.10 7.17 67.48 8.21
L2P-NSP2 86.78 4.22 83.37 6.93 64.66 6.84 68.14 7.79