Appendix of “ConFiT: Fine-Tuning Continually without Representational Shift”
1 Proofs
1.1 Proof of Theorem 1
Theorem 1.
Suppose denotes a 2D-Conv layer with and , and its output has a shape of . Then:
Proof.
Without loss of generality, we suppose the Conv layer does not have bias parameter. We denote its kernel size as , and its weight parameter as . When padding size is , the input is zero-padded to . We have:
∎
If , we suppose as an instance. We can transform the convolution with into 4 convolutions with , by breaking down both and as Figure 1. Then the theorem still holds for each of the 4 convolutions. Meanwhile, we need to store 4 pre-convolution means with respect to the 4 part of . Similarly, if , we just need to transform the convolution into convolutions with , and store pre-convolution means.
Dataset | CIFAR100 | CUB200 | Caltech101 | Flowers102 |
#classes | 100 | 200 | 101 | 102 |
#tasks | 20 | 10 | 10 | 10 |
#classes in each task | 5 | 20 | 10 | 10 |
max #images of a class in training set | 500 | 30 | 640 | 206 |
min #images of a class in training set | 500 | 29 | 25 | 32 |
#images in training set | 50000 | 5994 | 6941 | 6551 |
#images in test set | 10000 | 5794 | 1736 | 1638 |
As for the cases where , we hypothesize the pixels at the edges of images are less informative background, which can be ignored securely, and thus the theorem holds with negligible error.
1.2 Proof of Theorem 2
Theorem 2.
Let be the training data, and be the orthogonal basis of corresponding spaces, and be the optimal parameters on both current task and previous task . If . After fine-tuning on task , the loss on task is lower bounded:
where denotes the k-th largest singular value, is the alignment between and , and is the distances between and under a rotation.
Proof.
This theorem is an immediate deduction of Theorem 3.1 in lpft by simply regarding inputs of task as in-distribution data, and inputs of task as out-of-distribution data. ∎
As for multi-head setting, we have
in which is the divergence between the two sub-networks corresponding to the two heads. We can regard as a relaxation provided by multi-head setting.
1.3 Proof of Theorem 3
Theorem 3.
If for all task : (i) is initialized with , and (ii) there exists such that (i.e. is good enough), then for all tasks :
Proof.
For all task , if there exists , such that , than we can know form Proposition 3.2 in lpft that . Since overparameterized model can achieve zero loss, we have . So the gradient is also zero, and thus the feature extractor does not change at all, which means . Then inductively, for all task , we have , which leads to as we desire. ∎
Method | Hyperparameter | Search Range | CIFAR100 | CUB200 | Caltech101 | Flowers102 |
EWC | {3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5} | 1e3 | 3e3 | 3e4 | 1e5 | |
SI | {1, 2, 5, 10, 20, 50} | 2 | 5 | 10 | 20 | |
RWalk | {5, 10, 20, 50, 100, 200} | 20 | 20 | 50 | 100 | |
MAS | {1, 3, 10, 30, 100, 300} | 10 | 30 | 30 | 30 | |
CPR | {0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5} | 0.2 | 0.02 | 0.01 | 0.1 | |
AFEC | {0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000} | 0.01 | 100 | 1000 | 10 |
2 Experiment Details
2.1 Dataset
For CIFAR100 and CUB200, we use the official train/test split. For Caltech101 and Flowers102, we randomly choose 80% images as training set, and the others as test set. During hyperparameters search, we further randomly choose 10% images in training set for validation. The details of dataset statistics are in Table 1.
2.2 Implementation
All experiments are implemented using Pytorch 1.10 with CUDA 10.2 on NVIDIA 1080Ti GPU. The datasets and dataloaders are conducted via Avalanche Avalanche. We use ResNet18 implemented by timm timm, and its pre-trained parameters are downloaded from torchvision. The baselines are reproduced on the basis of the official codes of cpr; afec; ccll, in which we only modify the dataset interfaces.
2.3 Data Pre-processing
For each image, we pre-process it as follows (Pytorch style):
There is no extra data augmentation applied.
2.4 Hyperparameters Search
For each baseline, we extensively search the best hyperparameter, which is shown in Table 2. We train the model for fixed epochs for each method, which depends on the complexity of the dataset. We randomly choose 10% images in training set for validation, exclude them from training, and use them to determine the best hyperparameters. After the hyperparameter search, we retrain the model with the whole training set and report the results on test set. Note that our ConFiT does not need hyperparameter search.
3 Difference between IRS and Internal Covariate Shift
We notice that there is another concept called Internal Covariate Shift (ICS) about BN. To avoid confusion, we here clarify the difference between IRS and ICS.
Firstly, we would emphasize that ICS and IRS are totally different concepts. ICS describes the instability of intermediate representations’ distribution during the training stage, i.e. the inputs of intermediate layers are unstable in training, which impedes the learning of networks. One of the motivations of BN is to address ICS, so as to make the network easier to train.
Whereas IRS is defined in the scenario of continual learning, which means the intermediate representations of data of previous task were shifted, because the network is fitting the data of the newest task. The shifted representation will disrupt the function of BN in testing, since the running moments will no longer be representative of the true moments of intermediate representations.
Overall, BN can solve ICS, but will suffer from IRS in continual learning, which is what we have attempted to address.