This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Appendix of “ConFiT: Fine-Tuning Continually without Representational Shift”

1 Proofs

1.1 Proof of Theorem 1

Theorem 1.

Suppose Conv()Conv(\cdot) denotes a 2D-Conv layer with stride=1stride=1 and padding=kernelsize1padding=kernelsize-1, and its output has a shape of B×C×H×WB\times C^{\prime}\times H^{\prime}\times W^{\prime}. Then:

1BHWb=1Bh=1Hw=1W\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}} Conv(𝒂)b,:,h,w=\displaystyle Conv(\boldsymbol{a})_{b,:,h,w}=
1BHWb=1Bh=1Hw=1W\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}} Conv(AvgPoolDP(𝒂))b,:,h,w\displaystyle Conv(AvgPool_{DP}(\boldsymbol{a}))_{b,:,h,w}
Proof.

Without loss of generality, we suppose the Conv layer does not have bias parameter. We denote its kernel size as KK, and its weight parameter as 𝑾C×C×K×K\boldsymbol{W}\in\mathbb{R}^{C\times C^{\prime}\times K\times K}. When padding size is K1K-1, the input aa is zero-padded to 𝒂B×C×(H+2K2)×(W+2K2)\boldsymbol{a}^{\prime}\in\mathbb{R}^{B\times C^{\prime}\times(H^{\prime}+2K-2)\times(W^{\prime}+2K-2)}. We have:

1BHWb=1Bh=1Hw=1WConv(𝒂)b,:,h,w\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}}Conv(\boldsymbol{a})_{b,:,h,w}
=\displaystyle= 1BHWb=1Bh=1Hw=1Wc=1Ci=1Kj=1K𝑾c,:,i,j𝒂b,c,h+i1,w+j1\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{j=1}^{K}\boldsymbol{W}_{c,:,i,j}\boldsymbol{a}^{\prime}_{b,c,h+i-1,w+j-1}
=\displaystyle= 1BHWc=1Ci=1Kj=1K(𝑾c,:,i,j\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{j=1}^{K}(\boldsymbol{W}_{c,:,i,j}
b=1Bh=1Hw=1W𝒂b,c,h+i1,w+j1)\displaystyle\quad\quad\quad\quad\quad\quad\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}\boldsymbol{a}^{\prime}_{b,c,h+i-1,w+j-1})
=\displaystyle= 1BHWc=1C(i=1Kj=1K𝑾c,:,i,jb=1Bh=1Hw=1W𝒂b,c,h,w)\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}(\sum_{i=1}^{K}\sum_{j=1}^{K}\boldsymbol{W}_{c,:,i,j}\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}\boldsymbol{a}_{b,c,h,w})
=\displaystyle= 1BHWc=1C(i=1Kj=1K𝑾c,:,i,j\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}(\sum_{i=1}^{K}\sum_{j=1}^{K}\boldsymbol{W}_{c,:,i,j}
b=1Bh=1Hw=1WAvgPoolDP(𝒂)b,c,h,w)\displaystyle\quad\quad\quad\quad\quad\quad\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}AvgPool_{DP}(\boldsymbol{a})_{b,c,h,w})
=\displaystyle= 1BHWc=1Ci=1Kj=1K(𝑾c,:,i,j\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{c=1}^{C}\sum_{i=1}^{K}\sum_{j=1}^{K}(\boldsymbol{W}_{c,:,i,j}
b=1Bh=1Hw=1WAvgPoolDP(𝒂)b,c,h+i1,w+j1)\displaystyle\quad\quad\quad\quad\quad\sum_{b=1}^{B}\sum_{h=1}^{H}\sum_{w=1}^{W}AvgPool_{DP}(\boldsymbol{a})^{\prime}_{b,c,h+i-1,w+j-1})
=\displaystyle= 1BHWb=1Bh=1Hw=1WConv(AvgPoolDP(𝒂))b,:,h,w\displaystyle\frac{1}{BH^{\prime}W^{\prime}}\sum_{b=1}^{B}\sum_{h=1}^{H^{\prime}}\sum_{w=1}^{W^{\prime}}Conv(AvgPool_{DP}(\boldsymbol{a}))_{b,:,h,w}

Refer to caption
Figure 1:

If stride1stride\neq 1, we suppose stride=2stride=2 as an instance. We can transform the convolution with stride=2stride=2 into 4 convolutions with stride=1stride=1, by breaking down both 𝒂\boldsymbol{a} and 𝑾\boldsymbol{W} as Figure 1. Then the theorem still holds for each of the 4 convolutions. Meanwhile, we need to store 4 pre-convolution means with respect to the 4 part of 𝒂\boldsymbol{a}. Similarly, if stride=mstride=m, we just need to transform the convolution into m2m^{2} convolutions with stride=1stride=1, and store m2m^{2} pre-convolution means.

Dataset CIFAR100 CUB200 Caltech101 Flowers102
#classes 100 200 101 102
#tasks 20 10 10 10
#classes in each task 5 20 10 10
max #images of a class in training set 500 30 640 206
min #images of a class in training set 500 29 25 32
#images in training set 50000 5994 6941 6551
#images in test set 10000 5794 1736 1638
Table 1: Main statistics of datasets.

As for the cases where paddingK1padding\neq K-1, we hypothesize the pixels at the edges of images are less informative background, which can be ignored securely, and thus the theorem holds with negligible error.

1.2 Proof of Theorem 2

Theorem 2.

Let X={𝐱|𝐱𝒯t}X=\{\boldsymbol{x}|\boldsymbol{x}\in\mathcal{T}_{t}\} be the training data, 𝒮=SpanSpace(X)\mathcal{S}=SpanSpace(X) and =RowSpace(𝐁t1)\mathcal{R}=RowSpace(\boldsymbol{B}_{t-1}^{*}) be the orthogonal basis of corresponding spaces, and (𝐁,𝐯)(\boldsymbol{B}_{*},\boldsymbol{v}_{*}) be the optimal parameters on both current task tt and previous task tt^{\prime}. If σk(𝒮)>0\sigma_{k}(\mathcal{R}^{\top}\mathcal{S}^{\perp})>0. After fine-tuning on task tt, the loss on task tt^{\prime} is lower bounded:

t(𝑩t,𝒗t)σk(𝒮)kmin(ϕ,ϕ2/𝑩𝒗2)(1+𝑩𝒗2)2ϵ\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t}^{*})}\geq\frac{\sigma_{k}(\mathcal{R}^{\top}\mathcal{S}^{\perp})}{\sqrt{k}}\frac{min(\phi,\phi^{2}/\|\boldsymbol{B}_{*}\boldsymbol{v}_{*}\|_{2})}{(1+\|\boldsymbol{B}_{*}\boldsymbol{v}_{*}\|_{2})^{2}}-\epsilon

where σk\sigma_{k} denotes the k-th largest singular value, ϕ2=|(𝐯t1𝐯)2(𝐯𝐯)2|\phi^{2}=|({\boldsymbol{v}_{t-1}^{*}}^{\top}\boldsymbol{v}_{*})^{2}-(\boldsymbol{v}_{*}^{\top}\boldsymbol{v}_{*})^{2}| is the alignment between 𝐯t1\boldsymbol{v}_{t-1}^{*} and 𝐯\boldsymbol{v}_{*}, and ϵ=min𝐔𝐁t1𝐔𝐁22\epsilon=\min_{\boldsymbol{U}}\|\boldsymbol{B}_{t-1}^{*}-\boldsymbol{U}\boldsymbol{B}_{*}\|^{2}_{2} is the distances between 𝐁t1\boldsymbol{B}_{t-1}^{*} and 𝐁\boldsymbol{B}_{*} under a rotation.

Proof.

This theorem is an immediate deduction of Theorem 3.1 in lpft by simply regarding inputs of task tt as in-distribution data, and inputs of task tt^{\prime} as out-of-distribution data. ∎

As for multi-head setting, we have

t(𝑩t,𝒗t)=max𝒙1(𝒗t𝑩t𝒙𝒗t𝑩t𝒙)2\displaystyle\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t^{\prime}}^{*})}=\sqrt{\max_{\|\boldsymbol{x}\|\leq 1}(\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t}^{*}\boldsymbol{x}-\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t^{\prime}}^{*}\boldsymbol{x})^{2}}
=𝒗t𝑩t𝒗t𝑩t2\displaystyle=\|\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t}^{*}-\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t^{\prime}}^{*}\|_{2}
𝒗t𝑩t𝒗t𝑩t2𝒗t𝑩t𝒗t𝑩t2\displaystyle\geq\|\boldsymbol{v}_{t}^{*\top}\boldsymbol{B}_{t}^{*}-\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t^{\prime}}^{*}\|_{2}-\|\boldsymbol{v}_{t}^{*\top}\boldsymbol{B}_{t}^{*}-\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t}^{*}\|_{2}
=t(𝑩t,𝒗t)ϵ2\displaystyle=\sqrt{\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t}^{*})}-\epsilon_{2}

in which ϵ2=𝒗t𝑩t𝒗t𝑩t2\epsilon_{2}=\|\boldsymbol{v}_{t}^{*\top}\boldsymbol{B}_{t}^{*}-\boldsymbol{v}_{t^{\prime}}^{*\top}\boldsymbol{B}_{t}^{*}\|_{2} is the divergence between the two sub-networks corresponding to the two heads. We can regard ϵ2\epsilon_{2} as a relaxation provided by multi-head setting.

1.3 Proof of Theorem 3

Theorem 3.

If for all task ttt^{\prime}\leq t: (i) 𝐯\boldsymbol{v} is initialized with 𝐯tlp\boldsymbol{v}_{t^{\prime}}^{lp}, and (ii) there exists 𝐯0\boldsymbol{v}_{0} such that t(𝐁0,𝐯0)=t(𝐁t,𝐯t)\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{0},\boldsymbol{v}_{0})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*}) (i.e. 𝐁0\boldsymbol{B}_{0} is good enough), then for all tasks ttt^{\prime}\leq t:

t(𝑩t,𝒗t)=t(𝑩t,𝒗t)\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t^{\prime}}^{*})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*})
Proof.

For all task ttt^{\prime}\leq t, if 𝑩t1=𝑩0\boldsymbol{B}_{t^{\prime}-1}^{*}=\boldsymbol{B}_{0} there exists 𝒗0\boldsymbol{v}_{0}, such that t(𝑩t1,𝒗0)=t(𝑩t,𝒗t)\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}-1}^{*},\boldsymbol{v}_{0})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*}), than we can know form Proposition 3.2 in lpft that 𝒗tlp=𝒗0\boldsymbol{v}_{t^{\prime}}^{lp}=\boldsymbol{v}_{0}. Since overparameterized model can achieve zero loss, we have t(𝑩t1,𝒗tlp)=t(𝑩0,𝒗0)=0\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}-1}^{*},\boldsymbol{v}_{t^{\prime}}^{lp})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{0},\boldsymbol{v}_{0})=0. So the gradient is also zero, and thus the feature extractor does not change at all, which means 𝑩t=𝑩0\boldsymbol{B}_{t^{\prime}}^{*}=\boldsymbol{B}_{0}. Then inductively, for all task ttt^{\prime}\leq t, we have 𝑩t=𝑩0\boldsymbol{B}_{t^{\prime}}^{*}=\boldsymbol{B}_{0}, which leads to t(𝑩t,𝒗t)=t(𝑩t,𝒗t)\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t}^{*},\boldsymbol{v}_{t^{\prime}}^{*})=\mathcal{L}_{t^{\prime}}(\boldsymbol{B}_{t^{\prime}}^{*},\boldsymbol{v}_{t^{\prime}}^{*}) as we desire. ∎

Method Hyperparameter Search Range CIFAR100 CUB200 Caltech101 Flowers102
EWC λ\lambda {3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5} 1e3 3e3 3e4 1e5
SI cc {1, 2, 5, 10, 20, 50} 2 5 10 20
RWalk λ\lambda {5, 10, 20, 50, 100, 200} 20 20 50 100
MAS λ\lambda {1, 3, 10, 30, 100, 300} 10 30 30 30
CPR β\beta {0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5} 0.2 0.02 0.01 0.1
AFEC λe\lambda_{e} {0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000} 0.01 100 1000 10
Table 2: Results of hyperparameter search.

2 Experiment Details

2.1 Dataset

For CIFAR100 and CUB200, we use the official train/test split. For Caltech101 and Flowers102, we randomly choose 80% images as training set, and the others as test set. During hyperparameters search, we further randomly choose 10% images in training set for validation. The details of dataset statistics are in Table 1.

2.2 Implementation

All experiments are implemented using Pytorch 1.10 with CUDA 10.2 on NVIDIA 1080Ti GPU. The datasets and dataloaders are conducted via Avalanche Avalanche. We use ResNet18 implemented by timm timm, and its pre-trained parameters are downloaded from torchvision. The baselines are reproduced on the basis of the official codes of cpr; afec; ccll, in which we only modify the dataset interfaces.

2.3 Data Pre-processing

For each image, we pre-process it as follows (Pytorch style):

Compose(
Resize(size=256, interpolation=bilinear, max_size=None, antialias=None)
CenterCrop(size=(224, 224))
ToTensor()
Normalize(mean=[0.4850, 0.4560, 0.4060], std=[0.2290, 0.2240, 0.2250])
)

There is no extra data augmentation applied.

2.4 Hyperparameters Search

For each baseline, we extensively search the best hyperparameter, which is shown in Table 2. We train the model for fixed epochs for each method, which depends on the complexity of the dataset. We randomly choose 10% images in training set for validation, exclude them from training, and use them to determine the best hyperparameters. After the hyperparameter search, we retrain the model with the whole training set and report the results on test set. Note that our ConFiT does not need hyperparameter search.

3 Difference between IRS and Internal Covariate Shift

We notice that there is another concept called Internal Covariate Shift (ICS) about BN. To avoid confusion, we here clarify the difference between IRS and ICS.

Firstly, we would emphasize that ICS and IRS are totally different concepts. ICS describes the instability of intermediate representations’ distribution during the training stage, i.e. the inputs of intermediate layers are unstable in training, which impedes the learning of networks. One of the motivations of BN is to address ICS, so as to make the network easier to train.

Whereas IRS is defined in the scenario of continual learning, which means the intermediate representations of data of previous task were shifted, because the network is fitting the data of the newest task. The shifted representation will disrupt the function of BN in testing, since the running moments will no longer be representative of the true moments of intermediate representations.

Overall, BN can solve ICS, but will suffer from IRS in continual learning, which is what we have attempted to address.