Response to Gradient Regularized Contrastive Learning for Continual Domain Adaptation

Response to R2
Q1: highlight the similarities/differences with GEM in terms of the methodological solution to the catastrophic forgetting.
A1: Similarities: GEM and GRCL both regularize gradients to overcome catastrophic forgetting. Differences: (1) GEM solves class-incremental problems while GRCL solves domain-incremental problems with the same label space. (2) While the classifier layer is different in every task, the parameters of the classifier layer in different tasks in GRCL are shared. (3) While GEM treats each incremental task independently, our real practice unifies all old target domains as one task (Eq. 9), which is not implemented in GEM.
Q2: highlight how target domain semantic labels are obtained would be important.
A2: Semantic labels are obtained by k-means clustering. There are two reasons for achieving accurate pseudo labels. (1) After adapted to this target domain, the features can be discriminative, which ensures the result of clustering accurate. (2) The label space is given in this setting. The kmeans algorithm can fully take advantage of the label number, which also ensure accurate pseudo labels.
Response to R3
Q3: a lot of “practical” and theoretical contributions are relegated to the supplement. Put algorithm part into the main context. The formulation of $\mathcal{L}_{ce}$ should be presented in the main text.
A3: We will put the pseudo code of GRCL in the main text in our final version.
Q4: Why is Forward Transfor (FWT) is not discussed.
A4: For typical continual learning, FWT is not a common metric since it measures the model’s generalization ability to unseen classes, which is not the main challenge. However, in continual domain adaptation, we agree FWT is meaningful as it measures the model’s generalization ability to unseen domains. We report FWT as
Q5: Are these results consistent for other permutations.
A5: Yes. Results for other permutations are consistent. We do experiments on Digits with order: MNIST, SynNum, MNIST-M, SVHN, USPS. The ACC for GRCL is $87.42\%$ while ACC for CUA is $81.35\%$ only.
Response to R4
Q6: Do you think that your approach is salable for cross-domain object detection? If yes, what kind of problems do you think it may happen for adapting a detector model to new domains?
A6:
Response to R5
Q7: The subscript of ${\cal D}$ sometimes refers to data-source (e.g., s for source and t for target) but sometimes refers to target-domain numbers, e.g., ${\cal D}_{k}$ , where $k\leq t$ . Although readers may be able to guess, it is better to make the notation consistent. Others
A7:
Q8: Would it require a large, complex model by asking the model to perform well on every target-domain and the source domain it encounters? Unless the model size is initially big enough, the model will eventually become unable to achieve this goal after an extended usage period.
A8: No. (1) In domain adaptation, it is more important to learn domain-invariant information than to learn domain-specific information. There may be much domain-specific information but domain-invariant information is limited, which makes some moderate models that learnt the most important domain-invariant information can also generalize to different domains. (2) Even with the information related domains, some common part learnt in previous domains can be transferred to the new domain, which reduces the new information that the model needs to remember. To prove the small model can also work by this method, we implement ResNet-34 on DomainNet. The performance drops from $37.73\%$ to $36.94\%$ by only $0.8\%$ , which proves the effectiveness of our method in small models.
Q9: q seems to be the gradient of $f_{\theta_{t}}(x)$ . What is the reason for computing the dot products between q and k in Eq. 4.
A9: $\mathbf{q}=g_{t}(f_{\theta_{t}}(x))$ is a typo. $\mathbf{q}$ should be $q_{t}(f_{\theta_{t}}(x))$ , which is the representation of sample $x$ . The dot product of $\mathbf{q}$ and $\mathbf{k}$ calculates the similarity between query and the positive sample, which follows the typical formulation of contrastive loss in MOCO.
Response to R6
Q10: It is contradicting to meet with domain drift while maintaining the performance on the previous dataset. Since the environment is dynamic, historical information may degrade the current result, maintaining the historical performance would decrease the adaptation performance in some cases.
A10:
Q11: The paper does not define what type of domain drift that it aims to solve, such as covariance drift or label drift, etc.
A11:
Q12: The paper does not compare with new continuous domain adaptation methods, such as “Understanding Self-Training for Gradual Domain Adaptation”, which makes the baseline comparison not trustable.
A12:
Q13: Compared to recent continuous domain adaptation methods, such as “Understanding Self-Training for Gradual Domain Adaptation”, what are the strengths of your proposed method?
A13:
Q14: The proposed method is based on accurate pseudo labels. How to guarantee the accuracy of pseudo labels?
A14: Please refer to Q2.