Cluster-based Contrastive Disentangling for Generalized Zero-Shot Learning

Appendix A Introduction

Appendix B Formatting your Response

Reviewer 1 Q2 and Reviewer 4 Q1 What is the reason for learning a decoder to reconstruct an image based on the swapped representations? First reviewer 1 seems to have misunderstood the decoupling results of our approach. We first decompose the latent vectors into semantic-unspecific(uns) and semantic-matched(mat) variables, and further disentangle the semantic-matched variables into class-shared(cs) and class-unique(cu) variables according to the clustering results. That is to say, these three parts are semantic-unspecific(uns), class-shared(cs), class-unique(cu) variables, not semantic-matched(mat), class shared(cs), and class-unique(cu) variables as the reviewer said.

D2 uses swapped representations to generate images, and its purpose is to learn the essential features needed for image reconstruction. We propose two random swap strategies.

(i) The mat is fixed, the uns is randomly selected from other samples in a batch. The mat are aligned with $\mathcal{A}$ , so the image reconstructed by the mat of sample $X$ combined with the uns of other other randomly selected samples should be similar to X.

(ii)The uns and cu is fixed, the cs is randomly selected from other samples in a batch. The image reconstructed by the uns and cu of sample $X$ combined with the cs of other other randomly selected samples should be similar to X. This is because cs is similar in the same set, and randomly swapping it does not have a significant impact on the reconfiguration results.

Swap strategy 1 promotes mat to attribute alignment, swap strategy 2 makes cs tend to contain similarity information within sets (or discriminative information for each set), and cu represents discriminative information for each category. In the inference process, we use mat for classification.

Reviewer 1 Q1 Why the prior distribution is p(z—a) instead of p(z)

Semantic attribute is needed in zero-sample learning as an information bridge between the categories, so the encoder approximation is p(z—a) instead of p(z), which is a common setting, e.g., in FREE[free], TF-VAEGAN[tfvaegan], f-VAEGAN-D2[f-vaegan-d2].

Reviewer 1 Q3 which space does clustering work and what features are used for clustering

Clustering is performed in the input space. Our input is the widely used 2048-dimensional vector extracted by resnet101.

The reason for not clustering in attribute space: attribute space and input space are not perfectly matched, which is exactly why we decompose latent vectors into semantic-unspecific and semantic-matched variables.

The reason for not clustering in latent space: Firstly, the model has not yet converged at the early stage of training, and the error is bound to be larger using latent vector clustering. In addition, since the second swap is set-based, the samples are clustered as N sets in advance for the swap to be meaningful, and the learned decoupling expressions are reasonable under the guidance of this rule. Clustering after decomposition does not fit our idea.

Reviewer 1 Q4 Why does ‘Sec 3.2’ appear twice in Line 678-690

Sec 3.2 not only introduces set-based contrastive learning but also class-based.

Reviewer 3 Q1 Why our approach does not improve performance much

In fact, GSMFlow and SDGZSL are almost the two best performing methods in current generalized zero-sample learning. We do not have the official performance of GSMFlow on finetune data. According to our own experiments, the performance of CCD is comparable to that of GSMFlow. With or without fine-tuning, CCD outperforms GSMFlow on CUB and FLO, but is a bit worse on AWA2. It is worth noting that GSMFlow uses a flow-based generative model, which itself outperforms VAE. If we replace VAE with a flow-based generative model, the performance will also be improved, but we cannot put up the experimental results because of the cvpr regulations.

Reviewer 3 Q1 Why does your framework seem a little complicated Our framework is based on a generative model, and in the training phase the VAE branch serves to generate unseen class samples and train a fully supervised classifier jointly with seen samples. In the inference phase, VAE is not used as the reviewer speculated, and it is sufficient for us to use AE decoupling.

Questions about computational overhead and hyper-parameters

To constrain the feature decoupling representation, we propose a novel contrast learning method with the help of semantic alignment and random exchange strategies. In sec4, Figures 5, 6, 7 represent the impact of the number ofsynthetic unseen classes, batch size ,the number of sets, respectively, and it can be seen that the impact on the performance of the method is not significant as long as the hyperparameters are not extreme values. We choose the correlation coefficient of the loss function empirically. For each experiment it takes about 1 hour to get the results at 3090, so it is not difficult to find the best combination

Translated with www.DeepL.com/Translator (free version) Translated with www.DeepL.com/Translator (free version) All text must be in a two-column format. The total allowable size of the text area is $6\frac{7}{8}$ inches (17.46 cm) wide by $8\frac{7}{8}$ inches (22.54 cm) high. Columns are to be $3\frac{1}{4}$ inches (8.25 cm) wide, with a $\frac{5}{16}$ inch (0.8 cm) space between them. The top margin should begin 1 inch (2.54 cm) from the top edge of the page. The bottom margin should be $1\frac{1}{8}$ inches (2.86 cm) from the bottom edge of the page for $8.5\times 11$ -inch paper; for A4 paper, approximately $1\frac{5}{8}$ inches (4.13 cm) from the bottom edge of the page.

Please number any displayed equations. It is important for readers to be able to refer to any particular equation.

Wherever Times is specified, Times Roman may also be used. Main text should be in 10-point Times, single-spaced. Section headings should be in 10 or 12 point Times. All paragraphs should be indented 1 pica (approx. $\frac{1}{6}$ inch or 0.422 cm). Figure and table captions should be 9-point Roman type as in Fig. 1.

List and number all bibliographical references in 9-point Times, single-spaced, at the end of your response. When referenced in the text, enclose the citation number in square brackets, for example [Alpher05]. Where appropriate, include the name(s) of editors of referenced books.

Figure 1: Example of caption. It is set in Roman so that mathematics (always set in Roman:

B\sin A=A\sin B

) may be included without an ugly clash.

To avoid ambiguities, it is best if the numbering for equations, figures, tables, and references in the author response does not overlap with that in the main paper (the reviewer may wonder if you talk about Fig. 1 in the author response or in the paper). See LaTeX template for a workaround.

B.1 Illustrations, graphs, and photographs

All graphics should be centered. Please ensure that any point you wish to make is resolvable in a printed copy of the response. Resize fonts in figures to match the font in the body text, and choose line widths which render effectively in print. Readers (and reviewers), even of an electronic copy, may choose to print your response in order to read it. You cannot insist that they do otherwise, and therefore must not assume that they can zoom in to see tiny details on a graphic.

When placing figures in LaTeX, it is almost always best to use \includegraphics, and to specify the figure width as a multiple of the line width as in the example below

   \usepackage{graphicx} ...
   \includegraphics[width=0.8\linewidth]
                   {myfile.pdf}