Appendix of “ConFiT: Fine-Tuning Continually without Representational Shift”

1 Proofs

1.1 Proof of Theorem 1

Theorem 1.

Suppose $Conv(\cdot)$ denotes a 2D-Conv layer with $stride=1$ and $padding=kernelsize-1$ , and its output has a shape of $B\times C^{\prime}\times H^{\prime}\times W^{\prime}$ . Then:

	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}}$	$\displaystyle Conv(\boldsymbol{a})_{b,:,h,w}=$
	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}}$	$\displaystyle Conv(AvgPool_{DP}(\boldsymbol{a}))_{b,:,h,w}$

Proof.

Without loss of generality, we suppose the Conv layer does not have bias parameter. We denote its kernel size as $K$ , and its weight parameter as $\boldsymbol{W}\in\mathbb{R}^{C\times C^{\prime}\times K\times K}$ . When padding size is $K-1$ , the input $a$ is zero-padded to $\boldsymbol{a}^{\prime}\in\mathbb{R}^{B\times C^{\prime}\times(H^{\prime}+2K-2)\times(W^{\prime}+2K-2)}$ . We have:

		$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}}Conv(\boldsymbol{a})_{b,:,h,w}$
	$\displaystyle=$	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{j=1}^{K}\boldsymbol{W}_{c,:,i,j}\boldsymbol{a}^{\prime}_{b,c,h+i-1,w+j-1}$
	$\displaystyle=$	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{j=1}^{K}(\boldsymbol{W}_{c,:,i,j}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}\boldsymbol{a}^{\prime}_{b,c,h+i-1,w+j-1})$
	$\displaystyle=$	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}(\sum_{i=1}^{K}\sum_{j=1}^{K}\boldsymbol{W}_{c,:,i,j}\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}\boldsymbol{a}_{b,c,h,w})$
	$\displaystyle=$	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}(\sum_{i=1}^{K}\sum_{j=1}^{K}\boldsymbol{W}_{c,:,i,j}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}AvgPool_{DP}(\boldsymbol{a})_{b,c,h,w})$

	$\displaystyle=$	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{j=1}^{K}(\boldsymbol{W}_{c,:,i,j}$
		$\displaystyle\quad\quad\quad\quad\quad\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}AvgPool_{DP}(\boldsymbol{a})^{\prime}_{b,c,h+i-1,w+j-1})$
	$\displaystyle=$	$\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}}Conv(AvgPool_{DP}(\boldsymbol{a}))_{b,:,h,w}$

∎

If $stride\neq 1$ , we suppose $stride=2$ as an instance. We can transform the convolution with $stride=2$ into 4 convolutions with $stride=1$ , by breaking down both $\boldsymbol{a}$ and $\boldsymbol{W}$ as Figure 1. Then the theorem still holds for each of the 4 convolutions. Meanwhile, we need to store 4 pre-convolution means with respect to the 4 part of $\boldsymbol{a}$ . Similarly, if $stride=m$ , we just need to transform the convolution into $m^{2}$ convolutions with $stride=1$ , and store $m^{2}$ pre-convolution means.

Dataset	CIFAR100	CUB200	Caltech101	Flowers102
#classes	100	200	101	102
#tasks	20	10	10	10
#classes in each task	5	20	10	10
max #images of a class in training set	500	30	640	206
min #images of a class in training set	500	29	25	32
#images in training set	50000	5994	6941	6551
#images in test set	10000	5794	1736	1638

Table 1: Main statistics of datasets.

As for the cases where $padding\neq K-1$ , we hypothesize the pixels at the edges of images are less informative background, which can be ignored securely, and thus the theorem holds with negligible error.

1.2 Proof of Theorem 2

Theorem 2.

Let $X=\{\boldsymbol{x}|\boldsymbol{x}\in\mathcal{T}_{t}\}$ be the training data, $\mathcal{S}=SpanSpace(X)$ and $\mathcal{R}=RowSpace(\boldsymbol{B}_{t-1}^{*})$ be the orthogonal basis of corresponding spaces, and $(\boldsymbol{B}_{*},\boldsymbol{v}_{*})$ be the optimal parameters on both current task $t$ and previous task $t^{\prime}$ . If $\sigma_{k}(\mathcal{R}^{\top}\mathcal{S}^{\perp})>0$ . After fine-tuning on task $t$ , the loss on task $t^{\prime}$ is lower bounded:

\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t}^{*})}\geq\frac{\sigma_{k}(\mathcal{R}^{\top}\mathcal{S}^{\perp})}{\sqrt{k}}\frac{min(\phi,\phi^{2}/\|\boldsymbol{B}_{*}\boldsymbol{v}_{*}\|_{2})}{(1+\|\boldsymbol{B}_{*}\boldsymbol{v}_{*}\|_{2})^{2}}-\epsilon

where $\sigma_{k}$ denotes the k-th largest singular value, $\phi^{2}=|({\boldsymbol{v}_{t-1}^{*}}^{\top}\boldsymbol{v}_{*})^{2}-(\boldsymbol{v}_{*}^{\top}\boldsymbol{v}_{*})^{2}|$ is the alignment between $\boldsymbol{v}_{t-1}^{*}$ and $\boldsymbol{v}_{*}$ , and $\epsilon=\min_{\boldsymbol{U}}\|\boldsymbol{B}_{t-1}^{*}-\boldsymbol{U}\boldsymbol{B}_{*}\|^{2}_{2}$ is the distances between $\boldsymbol{B}_{t-1}^{*}$ and $\boldsymbol{B}_{*}$ under a rotation.

Proof.

This theorem is an immediate deduction of Theorem 3.1 in lpft by simply regarding inputs of task $t$ as in-distribution data, and inputs of task $t^{\prime}$ as out-of-distribution data. ∎

As for multi-head setting, we have

	$\displaystyle\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{},\boldsymbol{v}_{t^{\prime}}^{})}=\sqrt{\max_{\\|\boldsymbol{x}\\|\leq 1}(\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t}^{}\boldsymbol{x}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t^{\prime}}^{}\boldsymbol{x})^{2}}$
	$\displaystyle=\\|\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t}^{}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t^{\prime}}^{}\\|_{2}$
	$\displaystyle\geq\\|\boldsymbol{v}_{t}^{\top}\boldsymbol{B}_{t}^{}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t^{\prime}}^{}\\|_{2}-\\|\boldsymbol{v}_{t}^{\top}\boldsymbol{B}_{t}^{}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t}^{}\\|_{2}$
	$\displaystyle=\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{},\boldsymbol{v}_{t}^{})}-\epsilon_{2}$

in which $\epsilon_{2}=\|\boldsymbol{v}_{t}^{*\top}\boldsymbol{B}_{t}^{*}-\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t}^{*}\|_{2}$ is the divergence between the two sub-networks corresponding to the two heads. We can regard $\epsilon_{2}$ as a relaxation provided by multi-head setting.

1.3 Proof of Theorem 3

Theorem 3.

If for all task $t^{\prime}\leq t$ : (i) $\boldsymbol{v}$ is initialized with $\boldsymbol{v}_{t^{\prime}}^{lp}$ , and (ii) there exists $\boldsymbol{v}_{0}$ such that $\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{0},\boldsymbol{v}_{0})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*})$ (i.e. $\boldsymbol{B}_{0}$ is good enough), then for all tasks $t^{\prime}\leq t$ :

\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t^{\prime}}^{*})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*})

Proof.

For all task $t^{\prime}\leq t$ , if $\boldsymbol{B}_{t^{\prime}-1}^{*}=\boldsymbol{B}_{0}$ there exists $\boldsymbol{v}_{0}$ , such that $\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}-1}^{*},\boldsymbol{v}_{0})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*})$ , than we can know form Proposition 3.2 in lpft that $\boldsymbol{v}_{t^{\prime}}^{lp}=\boldsymbol{v}_{0}$ . Since overparameterized model can achieve zero loss, we have $\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}-1}^{*},\boldsymbol{v}_{t^{\prime}}^{lp})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{0},\boldsymbol{v}_{0})=0$ . So the gradient is also zero, and thus the feature extractor does not change at all, which means $\boldsymbol{B}_{t^{\prime}}^{*}=\boldsymbol{B}_{0}$ . Then inductively, for all task $t^{\prime}\leq t$ , we have $\boldsymbol{B}_{t^{\prime}}^{*}=\boldsymbol{B}_{0}$ , which leads to $\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t^{\prime}}^{*})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*})$ as we desire. ∎

Method	Hyperparameter	Search Range	CIFAR100	CUB200	Caltech101	Flowers102
EWC	$\lambda$	{3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5}	1e3	3e3	3e4	1e5
SI	$c$	{1, 2, 5, 10, 20, 50}	2	5	10	20
RWalk	$\lambda$	{5, 10, 20, 50, 100, 200}	20	20	50	100
MAS	$\lambda$	{1, 3, 10, 30, 100, 300}	10	30	30	30
CPR	$\beta$	{0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5}	0.2	0.02	0.01	0.1
AFEC	$\lambda_{e}$	{0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000}	0.01	100	1000	10

Table 2: Results of hyperparameter search.

2 Experiment Details

2.1 Dataset

For CIFAR100 and CUB200, we use the official train/test split. For Caltech101 and Flowers102, we randomly choose 80% images as training set, and the others as test set. During hyperparameters search, we further randomly choose 10% images in training set for validation. The details of dataset statistics are in Table 1.

2.2 Implementation

All experiments are implemented using Pytorch 1.10 with CUDA 10.2 on NVIDIA 1080Ti GPU. The datasets and dataloaders are conducted via Avalanche Avalanche. We use ResNet18 implemented by timm timm, and its pre-trained parameters are downloaded from torchvision. The baselines are reproduced on the basis of the official codes of cpr; afec; ccll, in which we only modify the dataset interfaces.

2.3 Data Pre-processing

For each image, we pre-process it as follows (Pytorch style):

⬇

Compose(

Resize(size=256, interpolation=bilinear, max_size=None, antialias=None)

CenterCrop(size=(224, 224))

ToTensor()

Normalize(mean=[0.4850, 0.4560, 0.4060], std=[0.2290, 0.2240, 0.2250])

)

There is no extra data augmentation applied.

2.4 Hyperparameters Search

For each baseline, we extensively search the best hyperparameter, which is shown in Table 2. We train the model for fixed epochs for each method, which depends on the complexity of the dataset. We randomly choose 10% images in training set for validation, exclude them from training, and use them to determine the best hyperparameters. After the hyperparameter search, we retrain the model with the whole training set and report the results on test set. Note that our ConFiT does not need hyperparameter search.

3 Difference between IRS and Internal Covariate Shift

We notice that there is another concept called Internal Covariate Shift (ICS) about BN. To avoid confusion, we here clarify the difference between IRS and ICS.

Firstly, we would emphasize that ICS and IRS are totally different concepts. ICS describes the instability of intermediate representations’ distribution during the training stage, i.e. the inputs of intermediate layers are unstable in training, which impedes the learning of networks. One of the motivations of BN is to address ICS, so as to make the network easier to train.

Whereas IRS is defined in the scenario of continual learning, which means the intermediate representations of data of previous task were shifted, because the network is fitting the data of the newest task. The shifted representation will disrupt the function of BN in testing, since the running moments will no longer be representative of the true moments of intermediate representations.

Overall, BN can solve ICS, but will suffer from IRS in continual learning, which is what we have attempted to address.

	$\displaystyle\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{},\boldsymbol{v}_{t^{\prime}}^{})}=\sqrt{\max_{\\|\boldsymbol{x}\\|\leq 1}(\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t}^{}\boldsymbol{x}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t^{\prime}}^{}\boldsymbol{x})^{2}}$
	$\displaystyle=\\|\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t}^{}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t^{\prime}}^{}\\|_{2}$
	$\displaystyle\geq\\|\boldsymbol{v}_{t}^{\top}\boldsymbol{B}_{t}^{}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t^{\prime}}^{}\\|_{2}-\\|\boldsymbol{v}_{t}^{\top}\boldsymbol{B}_{t}^{}-\boldsymbol{v}_{t^{\prime}}^{\top}\boldsymbol{B}_{t}^{}\\|_{2}$
	$\displaystyle=\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{},\boldsymbol{v}_{t}^{})}-\epsilon_{2}$