This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-Exemplar Online Class-incremental Continual Learning via
Dual-prototype Self-augment and Refinement
——Appendix——

Fushuo Huo1, Wenchao Xu1, Jingcai Guo1, 2, Haozhao Wang3111Corresponding author, Yunfeng Fan1

Overview

The appendix presents more experimental settings, results, and analyses as follows:

Appendix A More implementation details

Appendix B More results on different dataset partitions.

Appendix C Results and analysis on different base session training strategies.

Appendix D Hyper parameter analysis

Appendix E Computation overhead analysis

Appendix A: More Implementation Details

Dataset Overview. We conduct experiments on three widely used datasets, including CORE-50 (Lomonaco and Maltoni 2017), CIFAR 100 (Krizhevsky and Hinton 2009), and Mini-ImageNet (Vinyals et al. 2016). Here we give brief introductions. CORE-50 (Lomonaco and Maltoni 2017) is a benchmark designed for class incremental learning with 50 classes. Each class has around 2,398 training images and 900 testing images, with the size of 3×128×1283\times 128\times 128. CIFAR 100 (Krizhevsky and Hinton 2009) contains 60000 images of 32×3232\times 32 size from 100 classes, and each class includes 500 training images and 100 test images. Mini-ImageNet (Vinyals et al. 2016) contains 100 classes and is divided into 10 sub-datasets for 10 disjoint tasks, and each task contains 10 classes. Each task comprises 5,000 training images and 1,000 testing images, all with the size of 3×84×843\times 84\times 84.

Training Details. For OCL methods, we employ the same dataset partitions and training protocols of NO-CL, i.e., pre-training on the base classes and then online class-incremental learning with example buffers. Other hyperparameters are adopted as default. The example buffers are restored and retrieved during the whole training procedure with the default updating pipeline. MIR (Aljundi et al. 2019), GD (Prabhu, Torr, and Dokania 2020), ASER (Shim et al. 2021), SCR (Mai et al. 2021), and DVC (Gu et al. 2022) are based on the OCL codebase 222https://github.com/RaptorMai/online-continual-learning. Other methods are implemented with the public released codes. For FS-CL methods, FACT (Zhou et al. 2022a) and ALICE (Peng et al. 2022), we also adopt the same protocols as NO-CL. The prototypes of novel classes are computed by all data samples rather than few-shot samples. During the inference phase, FACT and ALICE directly infer incremental data samples via computed prototypes without finetuning the network. All methods are implemented with the public released codes. For NE-CL methods (Zhu et al. 2021b, 2022), as (Zhu et al. 2022) does not provide training scripts, we adopt the three-party codes (Zhou et al. 2022b)333https://github.com/G-U-N/PyCIL on CIFAR100 dataset. For (Zhu et al. 2021b), we employ the same dataset partitions and training and testing protocols of NO-CL. Note that all methods are employed the same reduced ResNet-18 as the feature extractor for fair comparisons. All experiments are conducted with NVIDIA RTX3090 GPU on CUDA 11.4 using PyTorch framework.

Base Session Training Details. Here, we give the details of our base session training strategy. We employ base training functions on the outputs of the feature extractor and projection module to obtain vanilla and high-dimensional prototypes for sequentially online sessions: Lbase=Lvpbase+LhpbaseL^{base}=L^{base}_{vp}+L^{base}_{hp}. Lvpbase=Loss(Projvp(θ1(x)),y)L^{base}_{vp}=Loss(Proj_{vp}(\theta_{1}(x)),y) and Lvpbase=Loss(Projhp(θ(θ1(x))2),y)L^{base}_{vp}=Loss(Proj_{hp}(\theta{{}_{2}}(\theta_{1}(x))),y). xx, yy, θ1\theta_{1}, and θ2\theta_{2} denote input samples, labels, feature extractor, and projection module. Projvp/hpProj_{vp/hp} are linear layers to align vanilla- and high-dimensional prototypes for loss calculations. For cross-entropy (CE) loss functions, Projvp/hpProj_{vp/hp} are one-layer MLP with the output dimension of base class. For supervised contrastive (SC) loss (Khosla et al. 2020), we follow SCR (Mai et al. 2021) and adopt the same hyper-parameters of SCR. Projvp/hpProj_{vp/hp} are two-layer MLP with the dimension of 160 and 128 and the temperature is set to 0.1.

Appendix B: Results on Different Dataset Partitions

Due to space constraints in the main paper, in this subsection, we report the additional results of different dataset partitions. Concretely, as shown in Tables 1 and 2, we conduct experiments on the configuration of 40%+6%×1040\%+6\%\times 10 and 80%+2%×1080\%+2\%\times 10, where 40%40\% and 60%60\% classes are selected as the base classes, and the rest classes are continually fed to the network in 10 sessions. Moreover, more incremental sessions (i.e., 20 sessions) with the configuration of 60%+2%×2060\%+2\%\times 20 are also compared in Table 3. The five representative state-of-the-art methods are compared with the same training and inference protocols as Non-exemplar Online Class-incremental continual Learning (NO-CL), including Online Class-incremental continual Learning (OCL) methods (i.e., SCR (Mai et al. 2021), OCM (Guo, Liu, and Zhao 2022), and DVC(Gu et al. 2022)), Non-Exemplar Class-incremental continual Learning (NE-CL) method (i.e., PASS (Zhu et al. 2021b)), and Few-Shot Class-incremental Learning (FS-CL) method (i.e., ALICE (Peng et al. 2022)). As we can see from Tables 1 and 2, fewer base classes result in poor performance, both in base and novel classes, because the network lacks enough pre-trained information to generalize to novel classes. Meanwhile, as our method depends on the inner-prototype computed by the pre-trained backbone, the well-trained backbone benefits us much for the prototype refinement. Notably, even with 40%40\% base classes, our method also achieves the best performance in Acc and HM metrics, which validates the robustness of prototype refinement strategies. Moreover, as for more incremental sessions in Table 3, the performance only drops slightly. Overall, experiments on different dataset partitions validate the effectiveness and robustness of our method.

Methods CORE-50 CIFAR100 Mini-ImageNet
Metrics Acc(base/novel)||HM Acc(base/novel)||HM Acc(base/novel)||HM
ALICE 35.0(49.3/25.5)||33.6 37.1(59.2/22.3)||32.4 36.5(58.6/21.8)||31.8
PASS 25.2(62.8/0.2)||0.4 26.4(65.1/0.6)||1.2 25.8(63.4/0.8)||1.6
MS 1000 2000 1000 2000 1000 2000
\hdashlineSCR 37.3(36.9/37.5)||37.2 38.7(38.9/38.6)||38.7 32.7(34.2/31.7)||32.9 34.9(36.9/33.6)||35.2 29.8(29.1/30.3)||29.7 36.3(39.2/34.4)||36.6
SCRft 34.2(45.2/26.8)||33.6 37.7(49.8/29.6)||37.1 30.9(45.2/21.3)||21.3 36.4(50.8/26.8)||35.1 32.9(43.2/26.1)||32.5 36.8(50.8/27.5)||35.7
OCM 37.6(37.5/37.6)||37.5 39.5(40.5/38.9)||39.7 31.6(32.9/30.8)||31.8 37.1(36.1/37.8)||36.9 31.3(30.8/31.6)||31.2 31.7(32.9/30.9)||31.9
OCMft 35.8(43.1/30.9)||36.0 36.7(46.1/30.6)||36.7 32.4(48.2/21.8)||30.1 35.9(50.6/26.1)||36.8 30.2(38.1/24.9)||31.4 37.7(51.2/28.7)||36.8
DVC 37.5(36.9/37.9)||37.3 38.7(39.8/38.0)||38.9 29.9(30.0/29.8)||29.9 37.2(34.6/39.0)||36.6 29.7(29.9/29.6)||29.7 33.3(35.6/31.8)||33.6
DVCft 36.7(45.8/30.6)||36.7 37.9(45.1/33.1)||38.2 32.3(43.6/24.7)||31.5 36.5(46.5/29.8)||36.3 31.4(39.4/26.1)||31.3 32.6(38.5/28.6)||32.8
Ours 45.5(44.2/46.4)||45.3 38.6(43.8/35.2)||39.0 40.4(55.4/30.4)||39.3
Table 1: The quantitative analysis of dataset partition 40%+6%×10\textbf{40\%}+\textbf{6\%}\times\textbf{10}. Class-wise accuracy (Acc) by end of the training in terms of all classes, base classes, and novel classes and Harmonic accuracy (HM) are illustrated. MS and ft mean the example memory size and finetuning versions. The best results are marked in bold.
Methods CORE-50 CIFAR100 Mini-ImageNet
Metrics Acc(base/novel)||HM Acc(base/novel)||HM Acc(base/novel)||HM
ALICE 44.7(48.2/30.6)||37.4 50.5(57.2/23.8)||33.6 50.3(56.4/26.1)||35.7
PASS 48.4(60.3/0.6)||1.2 50.4(62.8/0.8)||1.6 49.8(62.0/0.9)||1.8
MS 1000 2000 1000 2000 1000 2000
SCR 39.9(39.2/42.9)||40.9 42.9(43.5/40.8)||42.1 40.7(42.2/34.7)||38.1 46.3(47.2/42.8)||44.9 40.4(40.6/39.7)||40.1 44.5(43.8/47.3)||45.5
SCRft 45.4(48.9/31.2))||38.1 49.7(53.7/33.7)||54.6 47.4(50.6/34.8)||41.2 49.2(53.2/33.4)||41.0 40.2(45.2/32.8)||38.0 47.0(49.2/38.5)||43.5
OCM 43.4(43.8/42.2)||42.9 43.2(43.6/41.8)||42.7 39.9(39.8/40.6)||40.2 43.6(43.9/42.3)||43.1 39.5(38.9/41.8)||40.3 43.5(42.9/46.0)||44.3
OCMft 46.8(49.8/34.8)||40.9 47.8(49.8/39.7)||44.2 46.1(48.8/35.1)||40.8 48.7(51.8/36.3)||42.7 44.1(46.1/36.3)||40.6 46.3(48.6/37.2)||42.1
DVC 43.0(42.8/43.8)||43.3 45.2(45.8/42.8)||44.2 39.4(38.6/42.5)||40.5 40.1(39.7/41.9)||40.8 41.6(42.6/37.4)||39.8 45.4(45.6/44.8)||45.2
DVCft 48.0(50.3/39.0)||43.9 50.3(53.7/36.9)||43.7 41.1(41.6/38.9)||40.2 42.4(43.9/36.4)||39.8 45.3(48.6/31.9)||38.5 47.8(50.8/36.2)||42.2
Ours 55.2(55.6/53.7)||54.6 54.3(56.2/46.8)||51.1 52.2(52.6/50.8)||51.7
Table 2: The quantitative analysis of dataset partition 80%+2%×10\textbf{80\%}+\textbf{2\%}\times\textbf{10}. Class-wise accuracy (Acc) by end of the training in terms of all classes, base classes, and novel classes and Harmonic accuracy (HM) are illustrated. MS and ft mean the example memory size and finetuning versions. The best results are marked in bold.
Methods CORE-50 CIFAR100 Mini-ImageNet
Metrics Acc(base/novel)||HM Acc(base/novel)||HM Acc(base/novel)||HM
ALICE 39.5(46.2/29.5)||36.0 42.5(53.5/25.9)||34.9 41.1(51.4/25.7)||34.3
PASS 35.2(58.3/0.8)||1.6 37.9(62.6/1.0)||2.0 37.2(61.2/1.1)||2.2
MS 1000 2000 1000 2000 1000 2000
\hdashlineSCR 38.6(37.2/40.6)||38.8 38.6(39.2/37.6)||38.4 36.9(38.8/34.1)||36.3 40.7(42.8/37.6)||40.0 34.4(34.6/34.2)||34.4 35.1(38.6/29.8)||33.6
SCRft 36.6(42.4/27.9)||33.7 41.3(49.8/28.6)||36.3 38.6(49.2/22.8)||31.2 41.1(52.8/23.7)||32.7 37.9(43.2/30.0)||35.4 39.5(43.7/33.2)||37.7
OCM 38.3(38.8/37.6)||38.2 41.1(42.1/39.6)||40.8 36.3(36.5/35.9)||36.2 41.4(42.7/39.6)||41.1 35.9(35.2/37.1)||36.1 39.5(43.7/33.2)||37.7
OCMft 37.7(43.1/29.7)||35.2 43.3(47.9/36.4)||41.4 39.7(45.7/30.7)||36.7 40.9(45.9/33.4)||38.7 36.8(39.8/32.2)||35.4 41.5(43.5/38.4)||40.7
DVC 38.1(37.8/38.6)||38.2 40.5(41.8/38.6)||40.1 37.5(37.2/38.1)||37.7 40.2(41.6/38.2)||39.8 34.4(32.9/36.7)||34.7 37.1(35.4/39.6)||37.4
DVCft 39.8(45.9/30.7)||36.8 41.8(46.3/35.0)||39.9 37.8(42.1/31.4)||35.9 40.6(44.8/34.3)||38.9 35.1(38.9/29.4)||33.4 37.7(38.7/36.1)||37.4
Ours 49.5(49.1/50.2)49.6 47.6(51.6/41.7)||46.1 49.1(53.8/42.1)||47.2
Table 3: The quantitative analysis of dataset partition 60%+2%×20\textbf{60\%}+\textbf{2\%}\times\textbf{20}. Class-wise accuracy (Acc) by end of the training in terms of all classes, base classes, and novel classes and Harmonic accuracy (HM) are illustrated. MS and ft mean the example memory size and finetuning versions. The best results are marked in bold.

Appendix C: Results and Analysis on Different the Base Session Training Strategies

The stability and plasticity dilemma is a thorny problem in the area of continual learning. To deal with this dilemma, previous NE-CL (Zhu et al. 2021b, a) and FS-CL (Peng et al. 2022; Kalla and Biswas 2022) methods employ self-supervised learning (Jing and Tian 2021) and class and data augmentation to learn task-agnostic and transferable representations. For the problem of NO-CL, the base session training strategies also matter for the stability and plasticity dilemma. For fair comparisons, similar to (Mai et al. 2021; Gu et al. 2022; Guo, Liu, and Zhao 2022), we also employ supervised contrastive (SC) learning. Here, we provide two training strategies. Concretely, we add the extra self-supervised learning loss (Lee, Hwang, and Shin 2020) (+SSL) like (Zhu et al. 2021b; Kalla and Biswas 2022) and use the data augmentation strategy (+DA) proposed by (Zhu et al. 2021a; Peng et al. 2022). The results in Table 4 show that the elaborately designed pre-training strategies improve the accuracy both in the base and novel classes. Therefore, developing a more robust pre-training strategies is a promising way for the proposed NO-CL problem.

Ablations CIFAR100 Mini-ImageNet
Metrics Acc(base/novel) Acc(base/novel)
Ours(+CE) 45.8(50.0/39.6) 47.7(52.6/40.3)
Ours(+SC) 48.6(52.4/42.9) 50.7(56.1/42.6)
+SSL 50.4(53.8/45.2) 52.2(57.3/44.6)
+DA 51.2(55.7/44.6) 53.2(58.2/45.8)
Table 4: The results of Different training strategies on CIFAR100 and Mini-ImageNet. SSL and DA mean self-supervised learning and data augmentation, respectively.
Refer to caption
Figure 1: Quantitative results of varying λ\lambda.
Refer to caption
Figure 2: Quantitative results of varying TT.
Refer to caption
Figure 3: Quantitative results of varying K0K_{0}, K1K_{1}, and K2K_{2}
CIFAR100 Mini-ImageNet
Ablations Acc(base/novel)/HM Acc(bse/novel)/HM
w/ 256 44.8(49.5/37.9)/42.9 47.4(53.6/38.1)/44.5
w/ 1024 47.4(51.8/40.8)/45.7 49.7(55.4/41.1)/47.2
w/ 2048 48.6(52.4/42.9)/47.2 50.7(56.1/42.6)/48.4
w/ 3074 48.4(52.3/42.7)/47.0 50.8(56.3/42.6)/48.4
Table 5: Quantitative results of varying dimension of hyperdimensional embedding.
Refer to caption
(a) CORE-50
Refer to caption
(b) CIFAR100
Refer to caption
(c) Mini-ImageNet
Figure 4: Training losses of bi-level optimization procedure
CIFAR100 Mini-ImageNet
Metrics ALICE SSRE SCR DVC OCM Ours ALICE SSRE SCR DVC OCM Ours
Time(s) 512 291 165 126 561 35 793 457 254 194 831 61
Memory(GB) 1.9 3.2 1.8 1.6 12.8 1.4 4.4 6.8 4.1 2.8 21.4 1.9
Table 6: Quantitative comparisons of computation overhead in terms of online training time and memory footprint.

Appendix D: Hyper Parameter Analysis

Here, we analyse hyperparameters including feature transform coefficient λ\lambda, online iteration TT, and the number of sampled prototypes KK. We provide quantitative results on the Mini-ImageNet dataset in Figures 1, 2, and 3. Also, the experiments of the varying dimension of hyperdimensional embedding are conducted in Table 5.

In Figure 1, we can see that λ>1\lambda>1 leads to degraded performance as the feature distribution is more concentrated close to 0. Meanwhile, decreasing too much λ\lambda makes the distribution scattered and less aligned to the calibrated Gaussian distribution. Therefore, we set λ\lambda as 0.5 in our experiment.

In Figure 2, we vary online iteration TT. Less TT iterations harm the network to accommodate online novel classes by refining hyperdimensional prototypes and aligning the projection module. Meanwhile, more online iterations only lead to slight degradation, which validates the plasticity of our method. Therefore, we set TT to 20 to achieve the stability-plasticity trade-off.

In Figure 3, we vary the number of sampled prototypes KK. Concretely, we vary the number of sampled prototypes of base classes KbaseK_{base}, novel classes KnovelK_{novel}, and all classes KK. We can see that the imbalance sampling of classes leads to performance degradation, which is similar to the class imbalance problem (Hou et al. 2019; Wu et al. 2019). Also, increasing KK does not bring many gains while inducing computation overheads. Therefore, we set K=20K=20.

From Table 5, we can learn that increasing the dimension of hyperdimensional embedding benefits the proposed method while too large dimension brings little gain. Therefore, we set the dimension of hyperdimensional embedding as 2048 in our experiments.

Appendix E: Computation Overhead Analysis

For computation overheads during online learning, which is usually considered in OCL scenarios (Fini et al. 2020), we provide more quantitative comparisons to OCL, NE-CL, and FS-CL methods in Table 6. The batchsize of example-based methods is set as 10. As we only align prototypes by finetuning the projection module, which is much more efficient compared with training the whole network. Therefore, our method has clear advantages on computation overheads for online continual learning. Meanwhile, the bi-level optimization quickly converges as shown in Figure 4.

References

  • Aljundi et al. (2019) Aljundi, R.; Belilovsky, E.; Tuytelaars, T.; Charlin, L.; Caccia, M.; Lin, M.; and Page-Caccia, L. 2019. Online Continual Learning with Maximal Interfered Retrieval. In NeurIPS.
  • Fini et al. (2020) Fini, E.; Lathuilière, S.; Sangineto, E.; Nabi, M.; and Ricci, E. 2020. Online Continual Learning under Extreme Memory Constraints. In ECCV, 720–735.
  • Gu et al. (2022) Gu, Y.; Yang, X.; Wei, K.; and Deng, C. 2022. Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency. In CVPR, 7442–7451.
  • Guo, Liu, and Zhao (2022) Guo, Y.; Liu, B.; and Zhao, D. 2022. Online Continual Learning through Mutual Information Maximization. In ICML.
  • Hou et al. (2019) Hou, S.; Pan, X.; Loy, C. C.; Wang, Z.; and Lin, D. 2019. Learning a Unified Classifier Incrementally via Rebalancing. In CVPR.
  • Jing and Tian (2021) Jing, L.; and Tian, Y. 2021. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE TPAMI, 43(11): 4037–4058.
  • Kalla and Biswas (2022) Kalla, J.; and Biswas, S. 2022. S3C: Self-Supervised Stochastic Classifiers for Few-Shot Class-Incremental Learning. In ECCV, 432–448. Cham.
  • Khosla et al. (2020) Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; and Krishnan, D. 2020. Supervised Contrastive Learning. In NeurIPS, 18661–18673.
  • Krizhevsky and Hinton (2009) Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. In Technical Report.
  • Lee, Hwang, and Shin (2020) Lee, H.; Hwang, S. J.; and Shin, J. 2020. Self-supervised Label Augmentation via Input Transformations. In ICML, 5714–5724.
  • Lomonaco and Maltoni (2017) Lomonaco, V.; and Maltoni, D. 2017. CORe50: a New Dataset and Benchmark for Continuous Object Recognition. In CoRL, 17–26.
  • Mai et al. (2021) Mai, Z.; Li, R.; Kim, H.; and Sanner, S. 2021. Supervised Contrastive Replay: Revisiting the Nearest Class Mean Classifier in Online Class-Incremental Continual Learning. In CVPR Workshops, 3589–3599.
  • Peng et al. (2022) Peng, C.; Zhao, K.; Wang, T.; Li, M.; and Lovell, B. C. 2022. Few-Shot Class-Incremental Learning from an Open-Set Perspective. In ECCV, 382–397.
  • Prabhu, Torr, and Dokania (2020) Prabhu, A.; Torr, P. H. S.; and Dokania, P. K. 2020. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In ECCV, 524–540.
  • Shim et al. (2021) Shim, D.; Mai, Z.; Jeong, J.; Sanner, S.; Kim, H.; and Jang, J. 2021. Online Class-Incremental Continual Learning with Adversarial Shapley Value. AAAI.
  • Vinyals et al. (2016) Vinyals, O.; Blundell, C.; Lillicrap, T.; kavukcuoglu, k.; and Wierstra, D. 2016. Matching Networks for One Shot Learning. In Lee, D.; Sugiyama, M.; Luxburg, U.; Guyon, I.; and Garnett, R., eds., NeurIPS.
  • Wu et al. (2019) Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; and Fu, Y. 2019. Large Scale Incremental Learning. In CVPR.
  • Zhou et al. (2022a) Zhou, D.-W.; Wang, F.-Y.; Ye, H.-J.; Ma, L.; Pu, S.; and Zhan, D.-C. 2022a. Forward Compatible Few-Shot Class-Incremental Learning. In CVPR, 9046–9056.
  • Zhou et al. (2022b) Zhou, D.-W.; Wang, F.-Y.; Ye, H.-J.; and Zhan, D.-C. 2022b. PyCIL: A Python Toolbox for Class-Incremental Learning. SCIENCE CHINA Information Sciences.
  • Zhu et al. (2021a) Zhu, F.; Cheng, Z.; Zhang, X.-y.; and Liu, C.-l. 2021a. Class-Incremental Learning via Dual Augmentation. In NeurIPS, 14306–14318.
  • Zhu et al. (2021b) Zhu, F.; Zhang, X.-Y.; Wang, C.; Yin, F.; and Liu, C.-L. 2021b. Prototype Augmentation and Self-Supervision for Incremental Learning. In CVPR, 5871–5880.
  • Zhu et al. (2022) Zhu, K.; Zhai, W.; Cao, Y.; Luo, J.; and Zha, Z.-J. 2022. Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning. In CVPR, 9296–9305.