Classifier Crafting: Turn Your ConvNet into a Zero-Shot Learner!

Jacopo Cavazza
Istituto Italiano di Tecnologia
Pattern Analysis and Computer Vision (PAVIS) Department
Visual Geometry and Modelling (VGM)
GREAT Campus: Science and Technology Park in Genoa
Via Enrico Melen, 83, 16152 Genoa GE, Italy
[email protected]

Abstract

In Zero-shot learning (ZSL), we classify unseen categories using textual descriptions about their expected appearance when observed (class embeddings) and a disjoint pool of seen classes, for which annotated visual data are accessible. We tackle ZSL by casting a “vanilla” convolutional neural network (e.g. AlexNet [16], ResNet-101 [10], DenseNet-201 [12] or DarkNet-53 [28]) into a zero-shot learner. We do so by crafting the softmax classifier: we freeze its weights using fixed seen classification rules, either semantic (seen class embeddings) or visual (seen class prototypes). Then, we learn a data-driven and ZSL-tailored feature representation on seen classes only to match these fixed classification rules. Given that the latter seamlessly generalize towards unseen classes, while requiring not actual unseen data to be computed, we can perform ZSL inference by augmenting the pool of classification rules at test time while keeping the very same representation we learnt: nowhere re-training or fine-tuning on unseen data is performed. The combination of semantic and visual crafting (by simply averaging softmax scores) improves prior state-of-the-art methods in benchmark datasets for standard, inductive ZSL. After rebalancing predictions to better handle the joint inference over seen and unseen classes, we outperform prior generalized, inductive ZSL methods as well. Also, we gain interpretability at no additional cost, by using neural attention methods (e.g., grad-CAM [30]) as they are. Code will be made publicly available.

\begin{overpic}[width=397.48651pt]{images/new_motivational.png} \end{overpic}

Figure 1: The idea. We propose a learnable representation, tailored for ZSL, obtained from a convolutional network in which we crafted the softmax classifier’s weights. As opposed to optimize these weights (

), we use fixed classification rules: e.g., we replace the softmax weights’ with class embeddings, by either leveraging Osherson’s default probability scores [18] or LSTM-CNN text embeddings trained on Wikipedia [34]. Then, we learn a feature representation to match these fixed classification rules: we posit this to be the proxy to turn a “vanilla” convnet, trained on seen images only, into a zero-shot learner. Further, we achieve interpretability at no additional cost, using neural attention (e.g., grad-CAM [30]): our ZSL-crafted nets seem to focus on finer semantic details (e.g., legs or proboscis when classifying animals) and to better deal with confounding factors (e.g., in this case, a person).

1 Introduction

Due to the intrinsic difficulty of gathering a sufficient number of annotations for visual categories worth to be recognized, inductive zero-shot learning (ZSL) is advantageous since tackling the “extreme” scenario where some of the classes are never observed by the learner at training time (unseen classes), but still predicted at inference time (see Fig. 2). In fact, unseen classes can be recognized by exploiting textual descriptions (dubbed class embeddings) that inform the learner about how unseen classes are supposed to appear when observed [18, 6]. In conjunction to this, a disjoint pool of categories (seen classes) is accessible in terms of both annotated visual data (here, images) and, again, class embeddings. In abstract terms, ZSL it’s about relating seen class embeddings (and their semantic content) with visual seen data by means of a mapping which, despite being optimized on seen data only, has to transfer over unseen classes (standard ZSL) and over both unseen classes and unseen instances from seen classes (generalized ZSL, GZSL).

Previously, the way of making the aforementioned “mapping” transferrable, has been mainly tackled from the algorithmic perspective (either by finding a mathematical alignment between visual and semantic domains - [34] for a review - or training a feature generator to synthesize visual features from class embeddings [35, 29, 13, 19, 37, 32, 15, 8]) but almost always relying on pre-computed visual features that are extracted from ImageNet pre-trained neural networks [34, 35, 29, 13, 19, 37, 32, 15, 8].

Contributions.

In this paper, we propose to entangle the problem of making the aforementioned “seen-to-unseen mapping” generalizable with the problem of learning ZSL-specific visual descriptors. We craft the weights $w_{i}$ of the softmax classifier’s of a standard convnet: that is, we keep $w_{i}$ freezed and we optimize by gradient descent the rest of the convnet accordingly. We propose two strategies for that. 1) Semantic crafting: we replace $w_{i}$ with the class embeddings which, thanks to Osherson’s default probabilities¹¹1E.g., we will give probability “1” for a zebra to be four legged and probability “0” for a flying otter. Note that these probabilities are not visually grounded [4] and only convey information on the expected appearance: not all images of a zebra will actually show four legs! [18] or CNN-LSTM text embeddings [35] can be casted into numerical vectors (that’s what $w_{i}$ are). 2) Visual crafting: we replace $w_{i}$ with visual prototypes. While seen prototypes are simply obtained by averaging visual features, unseen prototypes are inferred through a linear projection (optimized in closed-form) from class embeddings, in order not to violate the ZSL assumption. In both cases, we fine-tune the hierarchical feature representation of the convnet using seen images only, once the classifier is frozen after it has been visually or semantically crafted. Since our classification rules naturally extend towards unseen classes (albeit not requiring unseen data to be computed), we claim that the combination of them with a learnable and ZSL-tailored representation is a valuable proxy to generalize over unseen classes while being able to fine-tune a learnable representation only on the seen classes.

Refer to caption — Figure 2: In this paper, we consider the problem of inductive zero-shot learning, considering both standard and generalized evaluation paradigms, recognizing unseen classes (no visual data at training time) using class embeddings (*e.g*., a list of attributes) and seen categories (provided of annotated visual data).

Since visual and semantic data are complementary in nature, we combine these two crafted convnets in an ensemble by simply averaging their predictions obtained from visually- and semantically-crafted softmax operators: in this manner, we improve prior art in standard, inductive ZSL [35, 29, 13, 19, 37, 32, 15, 8] – see Table 2.

To better deal with the GZSL setup, we propose a novel confidence rebalancing mechanism to fill the gap between seen predictions and (arguably) under-confident unseen predictions, given that unseen images are totally unobserved at training time. We do so by learning a discriminator to predict whether a vector of logits is seen (or not) using auxiliary task-irrelevant data to optimize it. Afterwards, we can use it to re-modulate, on the fly, the seen and unseen softmax probabilities at test time before taking a decision (again, neither training nor fine-tuning on unseen data is performed). By increasing the confidence of unseen classes and decreasing the ones of seen classes, (see Fig. 6), our rebalanced ensemble outperforms prior art in generalized, inductive ZSL [35, 29, 13, 19, 37, 32, 15, 8] – see Table 2.

We dissect the impact in performance of the visual and semantic crafting strategies, the rebalancing mechanism and the ensemble in ablation studies. We also provide grad-CAM [30] visualizations to gain in interpretability at no additional cost (see Fig. 1).

To summarize, these are the main contribution of our work:

•

We propose (visual and semantic) fixed classification rules to craft (i.e., replace and keep freezed) the softmax classifier’s weights. This turns a convnet into a zero-shot learner, Consequently, we learn from seen visual data only a feature representation which is tailored for ZSL and, thus, effective to recognize unseen classes as well.
•

With a simple ensemble method (average of predictions) our visual and semantic crafted networks improve prior art in standard, inductive ZSL. Once endowed of a novel predictions’ rebalancing mechanism, which we propose, our crafted networks improve improve prior art in inductive GZSL as well.
•

We provide ablation studies to quantify the role of each computational module in performance, comparing with prior rebalancing mechanisms [6] and [1]. We also gain interpretability simply using grad-CAM [30] as it is.

Outline of the paper.

We first explain the proposed approach (Section 2) and present the supporting experimental evidences (Section 3). The discussion of related work is deferred to Section 4, in order to, hopefully, gain in readability. Conclusions will be drawn in Section 5.

2 Method

For seen classes, input data are triplets of image, class label and a class embedding (the latter being a numerical vector in one-to-one known correspondence to a class). Unseen classes are totally deprived of images and are known only in terms of their class ebmeddings: unseen visual data are disclosed only at test time and no re-training on them is allowed.

Backbone Network.

We consider a convolutional feed-forward neural network (convnet) with a softmax classifier, trained in and end-to-end on raw images. Using the softmax, inference over a fixed pool of classes (indexed over $j$ ), can done through $\arg\max_{j}\mathbf{p}_{j}$ where

\mathbf{p}_{j}=\frac{\exp(\mathbf{f}\cdot w_{j})}{\sum_{\ell}\exp(\mathbf{f}\cdot w_{\ell})}

(1)

is the probability of predicting the $j$ -th class²²2For efficiency issues, $\arg\max_{j}\mathbf{p}_{j}$ is usually replaced by $\arg\max_{j}\mathbf{f}\cdot w_{j}$ in standard deep learning libraries (e.g., PyTorch, MATLAB, Tensorflow, …), since both operations give the same result.. In Eq. (1), the classifier’s weights are denoted by $w_{j},\dots,w_{\ell},\dots$ (one weight $w_{j}$ per class) and $\mathbf{f}$ refers to the higher lever feature representation that is learnt (and fed to the classifier to compute the logits $\mathbf{f}\cdot w_{j}$ ).

2.1 Semantic and Visual Classifier Crafting

We craft the softmax weights, that is, we replace their data-driven optimization by gradient descent with classification rules $r$ kept fixed during training. By “classification rule” $r$ , we mean vectorial embeddings to be used to replace the weights $w_{j}$ in Eq. (1). Classification rules are “fixed”: gradient descent is done to optimize $\mathbf{f}$ (and the whole backbone network of Fig. 3) to match them, while we keep them freezed. The rationale of doing this is the following: first, we endow our fixed classification rules of semantic patterns that seamlessly generalize onto unseen classes. That is, we either exploit (continuous) class embeddings - which we have for both seen and unseen classes - or visual prototypes - computed by averaging seen descriptors or using a semantic-to-visual mapping to infer unseen ones. Second, we learn a feature representation to match these fixed classification rules: this is the proxy we propose to adopt to generalize onto unseen classes, while being able to train on seen only.

As shown in Algorithm 1, we perform inference by computing the inner product between $\mathbf{f}$ and the classification rules $r$ . Seen classification rules $r_{i}$ are used to learn $\mathbf{f}$ . Crucially, we can still use $\mathbf{f}$ to recognize unseen classes as well, by augmenting the pool of seen logits $\mathbf{f}\cdot r_{i}$ with additional ones: we compute the inner product $\mathbf{f}\cdot\widetilde{r}_{k}$ between the very same representation $\mathbf{f}$ we learnt from seen data (and now kept freezed at inference time) and $\widetilde{r}_{k}$ . $\widetilde{r}_{k}$ is a generic unseen classification rule. Thus, upon augmenting $\mathbf{f}\cdot r_{i}$ with $\mathbf{f}\cdot\widetilde{r}_{k}$ , we can perform inference over seen and unseen classes jointly, with a single $\arg\max$ pass over $\mathbf{f}\cdot r_{i},\dots,\mathbf{f}\cdot\widetilde{r}_{k},\dots$ .

Data: Fixed rules:

r_{i}

(seen) and

\widetilde{r}_{k}

(unseen).

\boldsymbol{\rm Training}

(only using seen rules

r_{i}

and seen data) Replace the softmax weights with

r_{i}

;

2 while not converged do

3 Keep

r_{i}

frozen and fine-tune the backbone;

5 end while

\boldsymbol{\rm Testing}

foreach test image do

7 Forward the test image in the backbone

\longrightarrow\overline{\mathbf{f}}

;

8 Compute logits

\overline{\mathbf{f}}\cdot r_{i},\dots,\overline{\mathbf{f}}\cdot\widetilde{r}_{k},\dots

;

9 Predict the class of maximal logit;

11 end foreach

Algorithm 1 Classifier Crafting

Semantic Classifier Crafting $\colon$ S-CC. (Fig. 4 - left pane). We select the fixed classification rules $r_{i}$ and $\widetilde{r}_{k}$ , for seen classes $i$ and unseen classes $k$ , to be equal to the class embeddings.

Visual Classifier Crafting $\colon$ V-CC (Fig. 4 - right pane). We select $r_{i}$ and $\widetilde{r}_{k}$ as tvisual prototypes. Seen prototypes $\mathbf{m}_{i}$ are computed by a plain average of seen training features. Unseen prototypes $\widetilde{\mathbf{m}}_{k}$ are inferred using a linear projection, the latter being trained to map seen class embeddings onto seen prototypes. Please note that not a single unseen visual features is used to compute $\widetilde{\mathbf{m}}_{k}$ , $k$ being a generic unseen class.

Ensemble mechanism $\colon$ V&S-CC. Given the arguable complementarity of semantic and visual crafting, we propose a simple ensemble strategy which is based on averaging the softmax scores of a S-CC an V-CC model. Although arguably simple, this method is capable of outperforming prior art in standard, inductive ZSL (Table 2, left columns) showing the benefit of learning a tailored representation for ZSL from raw image data, as opposed to build upon pre-computed visual embeddings.

\sidecaptionvpos

figure*t

2.2 GZSL: rebalancing seen & unseen predictions

Given the unavailability of unseen data at training time, fine-tuning our crafted network on seen data only will inherently bias the network towards them. This is expecting to result in much lower confidence for unseen classes if compared to the seen ones. This is surely a problem to tackle, since limits generalization: prior methods mainly approached this through unseen feature generation [35, 29, 13, 19, 37, 32, 15, 8], but we propose to follow the perspective of [6] and [1] and try to resolve the problem in the prediction space, by rebalancing the network’s confidence - so that the we can increase the one for unseen classes and jointly decrease the one for seen ones.

In inductive GZSL, a prickly problem is mis-classifying an unseen class as if it was seen [6]. As faced in [6] and [1], we can formalized it as ”finding the best approximation for an oracle which, for every test instance (either seen or unseen) to be classifier, is able to tell whether the instance was sampled from the seen classes or not”.

In [6], this is loosely implemented by subtracting seen predictions scores by a fixed quantity $\gamma$ (calibrate stacking) while in [1] and adaptive prior is designed and rooted in a probabilistic framework of confidence smoothing (COSMO). Differently, in this paper we aim at pursuing a fully discriminative approach, and training a discriminator $D$ to predict whether a given test instance is seen or not. Thus, once we do so, we can re-modulate the estimated softmax probability scores $\mathbf{p}_{j}$ of our crafted network in the following manner

\overline{\mathbf{p}}_{j}=\begin{cases}p_{D}\cdot\mathbf{p}_{j}&\mbox{if $j$ is a seen class,}\\ (1-p_{D})\cdot\mathbf{p}_{j}&\mbox{if $j$ is an unseen class.}\\ \end{cases}

(2)

In Eq. (2), $p_{D}$ is the probability estimated by $D$ for a test instance to belong to the seen pool of classes. That is, we seek to train $D$ to achieve the best approximation for an oracle, who would have given $p_{D}=1$ for all seen test instances and $p_{D}=0$ for all the unseen ones.

Training the discriminator $D$ .

We extract $p_{D}$ from a logistic regressor [2] trained with a binary cross-entropy loss to classify whether a given logits is seen or not. The logits are obtained from our crafted convnet and we can exploit it to obtaining seen logits from seen training data. Since unseen classes are unavailable at training time, we need to synthesize unseen logits and, to do so, we take advantage of the idea we visualize in Figure 5.

To synthesize unseen logits, we consider task-irrelevant data from PASCAL VOC 2008 and we make sure to remove any overlapping class (e.g., removing animals when working on Animals with Attributes [18]). We fed this task-irrelevant data into the network and extract logits from a trained model: e.g., we compute flowers-related logits for an image of a bag, see Fig. 5. Albeit surely not reliable in their predictions, we can still capitalize on those logits to increase the uncertainty of the model and simulate the out-of-distribution regime. But, since task-irrelevant data are surely bringing more uncertainty with respect to the one we would need for unseen data, we exploit mixup [39] (in the feature space) to mitigate this. We combine task-irrelevant logits with seen training one: that’s how the computable surrogate logits are extracted and used as negative data to train $D$ .

Observation.

Note that this rebalance mechanism is 1) only necessary for generalized ZSL and 2) can be combined with the semantic crafting S-CC, the visual crafting V-CC and the ensemble of the two (V&S-CC) - see Fig. 6.

3 Experiments

Datasets and evaluation metrics

We run experiments on AWA [17], CUB [33], SUN [36], FLO [24] datasets using the proposed splits “PS” which are ImageNet-compatible [34]. Class embeddings are extracted using the TF-IDF [13] for AWA, CUB, SUN and the LSTM-based approach of [35] for FLO. Standard performance metrics are used [34]: in standard, inductive ZSL performance is evaluated as mean top-1 classification accuracy ( $T1$ ) over unseen classes. In generalized, inductive ZSL, we report $H$ , the harmonic mean between mean per-class accuracy scores $S$ and $U$ , computed over seen or unseen classes independently [34].

Turn your favorite convnet into a zero-shot learner!

Our idea of crafting the classifier’s weights of a softmax operator to solve ZSL is broadly applicable to a number of deep convolutional networks, designed for image classification. In fact, in Fig. 7, we show how different networks – in this case, AlexNet [16], ResNet-101 [10], DenseNet-201 [12] and DarkNet-53 [28] (with 256 $\times$ 256 input layer) – can be turned into a classifier that is able to discriminate classes that remain unseen at training time (on FLO using the V&S-CC ensemble of visual and semantic crafting, we achieve $T1_{\rm AlexNet}=$ 67.45%, $T1_{\rm ResNet-101}=$ 77.51%, $T1_{\rm DenseNet-201}=$ 77.92% and $T1_{\rm DarkNet-53}=$ 78.10%).

Ablation on predictions’ rebalancing.

We compare our proposed confidence rebalancing mechanism (Section 2.2) with two standard, state-of-the-art alternatives: calibrate stacking [6] and COSMO [1]. As the reader can see in Fig. 8 for FLO, our method is more effective than prior solutions [6, 1] in resolving the bias towards seen classes that are used to train S-CC, V-CC and V&S-CC models, making them effective to generalize onto unseen classes also.

\begin{overpic}[width=216.81pt]{images/ablation_.png} \put(6.0,31.0){\rotatebox{0.0}{\bf{\color[rgb]{0,0,1}AlexNet}}} \put(16.0,20.0){\scalebox{0.6}{67.45\%}} \put(29.4,31.0){\rotatebox{0.0}{\bf{\color[rgb]{1,0,0}ResNet-101}}} \put(41.0,23.0){\scalebox{0.6}{77.51\%}} \put(53.0,31.0){\rotatebox{0.0}{\bf{\color[rgb]{0,0.5,0}\scalebox{0.9}{DenseNet-201}}}} \put(65.8,23.4){\scalebox{0.6}{77.92\%}} \put(80.0,31.0){\rotatebox{0.0}{\bf\scalebox{0.9}{DarkNet-53}}} \put(90.0,23.4){\scalebox{0.6}{78.10\%}} \end{overpic}

Figure 7: Different craftings for standard, inductive ZSL. On FLO, using the

T1

metric, we ablate the impact of different crafting mechanism, across different newtorks, once we craft them into a zero-shot learner

On the same dataset, we provide a numerical ablation study to quantify the impact of performance of our rebalancing mechanism “R” (across all different crafting strategies) while also comparing with an “oracle” discriminator $\mathcal{O}$ which is always able to spot whether a test instance is seen or not: as shown in Tab. 1, the gap in between “ $R$ ” and “ $\mathcal{O}$ ” is only 1.8% for S-CC, increases to 6.6% for V-CC (but also the performance increases) and stabilizes to 8.1%. Albeit improvable, this gap certifies that the discriminator is effective enough in handling unseen classes and make predictions more balanced: after adding “ $R$ ”, the performance in GZSL, measured through $H$ , is always (more than) doubled.

	GZSL: $H$ [%]
	S-CC	V-CC	V&S-CC
	29.9	36.0	35.4
$R$	72.0	75.5	78.8
$\boldsymbol{\mathcal{O}}$	73.8	82.1	86.9

Table 1: Numerical ablation (on FLO, crafting a ResNet-101 model) on the effect of rebalancing “R” versus an oracle seen/unseen instance selector “

\mathcal{O}

”.

Learning a ZSL-tailored feature representation.

For V-CC, the crafting rules $r_{i},\dots,\widetilde{r}_{k},\dots$ are obtained by averaging seen features extracted from a pre-trained convnet (and learning the prototypes for the unseen). Hence, we can still perform ZSL inference by simply crafting the classifier, while skipping any fine-tuning on the feature representation $\mathbf{f}$ . By doing so, we can really evaluate the benefit of tailoring visual descriptors for ZSL: it turns out that this is indeed advantageous, since the fine-tuning free variant of V-CC is gapped in performance with respect to the full (i.e., fine-tuned) version (e.g., about -20% on CUB).

		Zero-Shot Learning $\colon T1$ [%]				Generalized ZSL $\colon H$ [%]
		AWA	CUB	SUN	FLO	AWA	CUB	SUN	FLO
CondZSL	${}^{\rm 2019}\leavevmode\nobreak\$ [20]	71.1	54.4	62.6	–	66.7	47.5	39.3	–
LisGAN	${}^{\rm 2019}\leavevmode\nobreak\$ [19]	70.6	58.8	61.7	69.6	62.3	51.6	40.2	68.3
$f$ -VAEGAN- $D2$	${}^{\rm 2019}\leavevmode\nobreak\$ [35]	70.3	72.9	65.6	70.4	65.2	68.9	43.1	75.1
EXEM	${}^{\rm 2020}\leavevmode\nobreak\$ [5]	68.1	58.6	62.9	69.4	46.5	50.1	39.1	48.2
ZSML	${}^{\rm 2020}\leavevmode\nobreak\$ [32]	76.1	69.6	60.2	–	65.8	55.7	–	–
APN	${}^{\rm 2020}\leavevmode\nobreak\$ [37]	71.7	73.8	65.7	–	65.6	70.0	43.7	–
ZS-OCD	${}^{\rm 2020}\leavevmode\nobreak\$ [15]	71.3	60.3	63.5	–	65.7	51.3	43.8	–
IZF	${}^{\rm 2020}\leavevmode\nobreak\$ [31]	74.5	67.1	68.4	–	68.0	59.4	54.8	–
$tf$ -VAEGAN	${}^{\rm 2020}\leavevmode\nobreak\$ [23]	73.4	74.3	66.7	74.7	66.7	70.7	46.3	79.4
AFRNet	${}^{\rm 2020}\leavevmode\nobreak\$ [22]	75.1	–	64.0	–	70.1	–	41.5	–
CN	${}^{\rm 2021}\leavevmode\nobreak\$ [14]	–	–	–	–	67.6	50.3	43.1	–
A&G-ZSL	${}^{\rm 2021}\leavevmode\nobreak\$ [8]	76.4	77.2	66.2	–	71.3	72.6	46.5	–
S-CC-(R)	${}^{\rm 2021}\leavevmode\nobreak\$ ours	71.7	79.7	63.5	60.3	70.0	70.5	55.0	72.0
V-CC-(R)	${}^{\rm 2021}\leavevmode\nobreak\$ ours	76.8	80.9	67.0	71.3	73.3	72.1	56.9	75.5
V&S-CC-(R)	${}^{\rm 2021}\leavevmode\nobreak\$ ours	78.5	81.2	69.3	77.5	75.5	73.1	58.0	78.8

Table 2: Comparison against recent state-of-the-art methods for standard and generalize, inductive ZSL. For fair comparisons, a ResNet-101 feature encoder is adopted since used in prior art. While reporting either Top-1 classification accuracy

T1

(ZSL) or the harmonic mean

H

between unseen/seen mean per class accuracy (GZSL), the best performance is bolded. S-CC, V-CC and V&S-CC are endowed of the rebalancing module “R” only for GZSL (in standard ZSL is simply unnecessary).

3.1 Comparison with SOTA in inductive (G)ZSL

We compare our proposed visually and semantically crafting V&S-CC against the state-of-the-art in inductive ZSL (standard and generalized protocols) by selecting a ResNet-101 backbone, since this common shared encoder is adopted from all prior art [20, 19, 5, 32, 37, 15, 31, 23, 22, 14, 8]. We can therefore ensure a fair comparison - but note that we could have achieved slightly better results if using more recent architectures such as DarkNet-53 (see Fig. 7).

In details we compare with the conditional/probabilistic approaches of [20, 15], the syntesis of both classifier and exemples [5], and a number of feature generating approaches using either GANs [19, 22, 8], VAE-GANs [35, 23] and GANs plus meta-learning [32]. We also compare against very recent works, relying on either invertible networks [31] or feature scaling [14].

As show in Table 2, our method improved in performance ( $T1$ ) previous methods in standard, inductive ZSL on AWA (+2.1% over [8]), CUB (+4.0% over [8]), SUN (+0.9% over [31]) and FLO (+2.8% over [23]). We also improve prior art in generalized, inductive ZSL on AWA (+4.2% over [8]) CUB (+0.5% over [8]) and SUN (+3.2% over [31]). On FLO, we improve all prior methods except to tf-VAEGAN [23], only. We pay a gap of -0.6%: we deem the reason of that to be in the limited size of the dataset. In fact, tf-VAEGAN seems to resolve this problem by generating much more synthesized features if compared to the seen examples available (1200 generated descriptors per unseen class, while each seen class is represented only by one or two hundreds of examples).

Interpretations with Saliency Maps.

To better ground why our recorder performance over prior art is superior, we exploited grad-CAM [30] to inspect what happens when using manually annotated attributes’ list for crafting on a S-CC model. As shown in Fig. 9, when we compare the learnt representation on all attributes with the one obtained once we removed “is timid”, we see very little variations: effectively such an attribute is not visually grounded and this result is actually understandable. At the same time, interestingly, if we remove the attribute “is black”, then we observe much more changes in the grad-CAM maps (right-most column of Fig. 9) since this attribute is instead visually grounded [4] and therefore it is actually impacting the learnt representation. We claim that this dependency between neural attention and attributes, can better justify why our proposed crafting is superior to prior methods, mainly because we better intertwine semantic and visual cues, and this is arguably a desirable proxy for an effective generalization towards unseen classes.

\begin{overpic}[width=216.81pt]{images/visualizations2.PNG} \put(1.0,67.0){All attributes} \put(36.0,67.0){\textst{Is Timid}} \put(69.0,67.0){\textst{Is Black}} \end{overpic}

Figure 9: Effect of removing non-visually grounded (e.g., is timid) and visually grounded (e.g., is black) attributes from class embeddings on Animals with Attributes [17].

4 Discussion and relationship with prior art

(Generalized) Zero-Shot Learning.

In the seminal work [18], ZSL is framed as the regression problem of predicting attributes from visual data. Differently, we exploit seen class embeddings and seen images to obtain better features with which we can do zero-shot inference on unseen classes, using a softmax classifier. Due to the absence of feature learning in [18], attributes’ predictions is highly suboptimal to our crafting - e.g., -32% on AWA and -30% on SUN). In CondZSL [20] and EXEM [5], classifiers explicitly depends upon class embeddings, using a conditional/causal perspective or exploiting data self-representation, respectively. However, neither [20] nor [5] perform feature learning from raw image data (they exploit pre-computed ImageNet-pretrained features) and, again they are suboptimal to us (V&S-CC always improve in performance both CondZSL nd EXEM, as one can check in Table 2).

End-to-end learnable zero-shot learning methods are emerging as recent directions for both video-based [3] (not considered here) or image-based applications [21]. However, in [21], the absence of a confidence rebalancing mechanism forbid the methods to be tested in GZSL - while in standard ZSL we are on par with [21] on AWA (see [21, Tab. 2]), superior on SUN (+2.3%) and sharply superior on CUB (+15%). Note that we did not insert [21] in Table 2 since it did not provide results using ResNet-101 features, using GoogleNet features instead. Note that, in fact, we could have furthermore improved our performance in Table 2 is using DarkNet-53 as opposed to ResNet-101 (e.g., about +3% improvement for $H$ on FLO), but this would not have been fair with respect to prior art. Ensemble approaches have been recently investigated in the (easiest) transductive ZSL setup - where unannotated visual data from unseen classes are accessible during training - [9, 38]. To the best of our knowledge we are the first of proposing an ensemble in inductive (standard and generalized) ZSL.

There are ZSL methods which, similarly to us, explicitly achieve interpretability using class attention maps: [7, 37]. Please, note that, in [7], an explicit patch-based optimization is designed, while [37] exploits body parts annotations (and not all ZSL datasets are provided of that and, usually this type of meta-data is not adopted). Our work is different since we achieve interpretability at not additional cost (no direct optimization for that, no extra annotation is required beyond class embeddings).

Classifiers’ Crafting.

We briefly discuss methods which combine a fixed classifier with a learnable feature representation. There is a recent sequence of works [27, 11, 25, 26] applying this idea to diverse applications, namely, vanilla classification [11, 25], incremental learning [26] or few-shot learning. To the best of our knowledge, our work is the first one to progress this sequence, applying this paradigm in inductive zero-shot learning.

In [11, 25, 26], the classifier’s weights are fixed to a pre-defined geometrical configuration (for instance, mapping all classes to be recognized over the vertex of a simplex [25]). We cannot access all classes to be recognized at training time, but only the seen. Thus, we cannot adapt on unseen using newly incoming data from them (as in [26]) but we need to carry out predictions from zero shots. Hence, the classifier’s weights we craft cannot be explicitly dependent upon unseen data - and this is different from the few-shot learning approach of [27] in which data from all classes are used for this. We therefore opted for a solution in which we rely on the intrinsic seen-to-unseen transferrability of our fixed classification rules (used for crafting) so that the representation, that we learn by matching them, can be transferrable in turn.

Recalibrated Predictions.

As shown in Fig. 8. our proposed discriminator, predicting if a logit is seen or not while training on real seen logits and fake auxiliary unseen ones, seems valid in re-calibrating predictions in the generalized zero-shot learning regime (since enhancing the performance in GZSL quite sharply). Relatively to calibrate stacking [6] or confidence smoothing [1], it appears more effective in this respect. Beyond performance, we also technical differ from these works: our rebalancing is fully discriminative, while [6] is based on thresholding (by manually decreasing the predictor’s confidence) and [1] seeks for an adaptive probabilistic prior.

5 Conclusion

Our papers shows that we can turn a convnet into a zero-shot learner by crafting the weights of the softmax operator, using fixed semantic/visual classification rules. We show that this strategy, when combined with an ensemble, and furthermore boosted by a predictions’ rebalancing, outperforms prior art in inductive zero-shot learning, on standard and generalized evaluation protocols, respectively. Since using a convnet for ZSL inference, we can achieve an interpretable predictor, at no additional cost, through neural attention by using methods such as GRAD-CAM as they are.

References

[1] Yuval Atzmon and Gal Chechik. Adaptive confidence smoothing for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[2] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
[3] Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4613–4623, 2020.
[4] Matteo Bustreo, Jacopo Cavazza, and Vittorio Murino. Enhancing visual embeddings through weakly supervised captioning for zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[5] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Classifier and exemplar synthesis for zero-shot learning. International Journal of Computer Vision, 128(1):166–201, 2020.
[6] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Springer European Conference on Computer Vision (ECCV), 2016.
[7] Zhi Chen, Sen Wang, Jingjing Li, and Zi Huang. Rethinking generative zero-shot learning: An ensemble learning perspective for recognising visual patches. In Proceedings of the 28th ACM International Conference on Multimedia, 2020.
[8] Yu-Ying Chou, Hsuan-Tien Lin, and Tyng-Luh Liu. Adaptive and generative zero-shot learning. In International Conference on Learning Representations (ICLR), 2021.
[9] Rafael Felix, Michele Sasdelli, Ian Reid, and Gustavo Carneiro. Multi-modal ensemble classification for generalized zero shot learning, 2019.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[11] Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classifier: the marginal value of training the last weight layer. International Conference on Learning Representations (ICLR), 2018.
[12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[13] He Huang, Changhu Wang, Philip S. Yu, and Chang-Dong Wang. Generative dual adversarial network for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[14] Mohamed Elhoseiny Ivan Skorokhodov. Class normalization for zero-shot learning. In International Conference on Learning Representations (ICLR), 2021.
[15] Rohit Keshari, Richa Singh, and Mayank Vatsa. Generalized zero-shot learning via over-complete distribution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 2012.
[17] CH. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[18] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[19] Jingjing Li, Mengmeng Jing, Ke Lu, Zhengming Ding, Lei Zhu, and Zi Huang. Leveraging the invariant side of generative zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[20] Kai Li, Martin Renqiang Min, and Yun Fu. Rethinking zero-shot learning: A conditional visual classification perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2019.
[21] Yan Li, Zhen Jia, Junge Zhang, Kaiqi Huang, and Tieniu Tan. Deep semantic structural constraints for zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[22] Bo Liu, Qiulei Dong, and Zhanyi Hu. Zero-shot learning from adversarial feature residual to compact visual feature. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.
[23] Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees GM Snoek, and Ling Shao. Latent embedding feedback and discriminative features for zero-shot classification. In The Springer European Conference on Computer Vision (ECCV), 2020.
[24] M-E. Nilsback and A. Zisserman. A visual vocabulary for flower classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
[25] Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. Maximally compact and separated features with regular polytope networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.
[26] Federico Pernici, Matteo Bruni, Claudio Baecchi, Francesco Turchini, and Alberto Del Bimbo. Class-incremental learning with pre-allocated fixed classifiers. arXiv preprint arXiv:2010.08657, 2020.
[27] Hang Qi, Matthew Brown, and David G Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018.
[28] Joseph Redmon. Darknet: Open source neural networks in c. https://pjreddie.com/darknet/. Accessed: December 2020.
[29] Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[30] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
[31] Yuming Shen, Jie Qin, Lei Huang, Li Liu, Fan Zhu, and Ling Shao. Invertible zero-shot recognition flows. In The Springer European Conference on Computer Vision (ECCV), 2020.
[32] Vinay Kumar Verma, Dhanajit Brahma, and Piyush Rai. Meta-learning for generalized zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.
[33] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
[34] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
[35] Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. f-VAEGAN-D2: A feature generating framework for any-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[36] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
[37] Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. Attribute prototype network for zero-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 2020.
[38] Meng Ye and Yuhong Guo. Progressive ensemble networks for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[39] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In The International Conference on Learning Representations (ICLR), 2018.