This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unveiling Backbone Effects in CLIP:
Exploring Representational Synergies and Variances

Cristian Rodriguez-Opazo1    Edison Marrese-Taylor2    Ehsan Abbasnejad1    Hamed Damirchi1    Ignacio M. Jara3    Felipe Bravo-Marquez3    Anton van den Hengel1,4

   1Australian Institute for Machine Learning, University of Adelaide    2The University of Tokyo, AIST 3University of Chile, CENIA & IMFD 4Amazon
Abstract

Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various neural architectures, spanning Transformer-based models like Vision Transformers (ViTs) to Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve as universal backbones across diverse vision tasks. Despite utilizing the same data and training objectives, the effectiveness of representations learned by these architectures raises a critical question. Our investigation explores the differences in CLIP performance among these backbone architectures, revealing significant disparities in their classifications. Notably, normalizing these representations results in substantial performance variations. Our findings showcase a remarkable possible synergy between backbone predictions that could reach an improvement of over 20% through informed selection of the appropriate backbone. Moreover, we propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%. We will release the code for reproducing the results.

1 Introduction

Large pre-trained models, also known as Foundation Models (FMs), are reshaping the landscape of learning, particularly in computer vision [39, 17, 24, 32] and natural language processing [9, 5, 44]. These models are trained through self-supervised and contrastive learning objectives on internet-scale data, eliminating the need for manual annotations [21, 42]. Contrastive Language Image Pre-training (CLIP) [39] stands as a pioneering contribution in the ongoing research endeavors within this domain, establishing itself as the state-of-the-art across diverse downstream tasks. The training objective is to learn to align text and image representations in an internet-scale dataset. This versatile framework is applicable to a wide range of commonly used architectures, facilitating the training of general image representation learning models. These representation learning models have significantly advanced the field by acting as backbones for other applications, producing state-of-the-art performance in various cross-modal alignment tasks, including but not limited to zero-shot classification [48, 29, 21], cross-modal retrieval [30, 31, 47], and the evaluation of machine-generated captions [27, 22].

Despite the extensive body of prior research on CLIP, there remains a notable gap when it comes to thoroughly examining these image backbones. While we find extensive work evaluating the backbones employed across various tasks to showcase the generalization capabilities of specific methods [14, 29, 49, 13], they typically exhibit general performance improvements when using larger models. This supports the prevailing perception that larger models inherently lead to better representations, implying that they not only learn the same patterns but also more of them. If that is true, increasing the size of the model alone is enough to improve generalization. However, even within the same family of backbones, architectural selections exhibit nuanced inductive biases, capturing distinct patterns by different representations [15]. Nevertheless, there is a conspicuous absence in thoroughly examining the synergies and variances among these backbones as learned by CLIP, particularly considering their shared training dataset and learning objective.

In light of these challenges, this paper conducts a comprehensive empirical investigation to unravel the intricacies of backbones within the CLIP framework. Our qualitative and quantitative evaluations reveal a notable “orthogonality” in the behaviour of these backbones. These backbones adeptly respond to diverse patterns in the input, manifested through different representations, and exhibit distinct confidence levels in their predictions. We observe that employing different normalizations on the representations obtained from these backbones results in varied performance, complementing previous studies [43, 50].

Based on these findings, we consider combining predictions from various backbones for the final predictions. To that end, we first consider a set of baseline methods, following [16]–motivated by calibration techniques [38]– and techniques used in ensemble methods [26, 10]. We illustrate that these methods do not consistently enhance generalization performance. Instead, we introduce a straightforward parametric approach, where a set of temperatures is learned to adjust the logits for each backbone. The inspiration for this temperature parameter stems from the observation of diverse performance outcomes when manipulating the logits in CLIP [43].

To determine these temperatures, we explore two approaches: (1) utilizing a genetic algorithm as a search strategy and identifying a dataset-specific value, and (2) employing a multi-layer perceptron (MLP) to learn and adjust them individually for each sample. We observe that MLP efficiently learns to predict these temperatures with few samples. This method allows for the seamless integration of information from different backbones, contributing to improved predictive outcomes.

In summary, our contributions are as follows:

  • We perform extensive experiments across eight datasets, revealing that the representations produced by CLIP backbones and their subsequent predictions exhibit a notable diversity rather than uniformity. For instance, we observe a slight majority, with just over half of the ImageNet test-set correct predictions aligning within the ViT family. The remaining instances only exhibit partial agreement with other backbones.

  • We underscore discerning the optimal backbone alone for each individual test instance can yield up to 13% improvement in zero-shot accuracy.

  • We demonstrate L2L_{2} normalization of the representations generally outperforms the alternatives for zero-shot classification, complementing the previous findings of [50].

  • Leveraging these insights, we introduce a simple approach to aggregate predictions from multiple backbones, enhancing the performance of the best single model by up to 6.34%. Our neural network combination (NNC) approach demonstrates remarkable adaptability, requiring as few as a single instance to outperform the best of CLIP backbones. This performance boost is achieved solely by employing a scalar per backbone.

2 Related work

Vision Foundation Models have extended the paradigm of pre-training to encompass wide-ranging datasets, ranging from hundreds of millions to billions of images. This expansion was significantly influenced by the introduction of Vision Transformers (ViTs) [11], which have highlighted the feasibility of training Transformers [45] on such massive datasets within the field of computer vision. Subsequently, various large-scale pre-training methods have surfaced in the domain of computer vision [39, 47, 17]. Among these, a notable category of Vision Foundational Models is exemplified by CLIP [39, 42, 12], which specializes in aligning noisy image-text pairs extracted from web sources. The prominence of CLIP stems not only from its scalability but also from its capability to generate meaningful alignments with prompts that facilitate zero-shot classification [13, 49, 29]. There are multiple versions of CLIP trained on different datasets and backbones. Well-studied convolutional backbones like Residual Networks [18] (ResNet) and ViTs [11] with different architectures setup. The focus of our current work centers on studying the difference in image classification of each backbone used to train CLIP.

Model Ensembling Improving performance by combining the output of multiple models is a foundational technique in machine learning, with studies on their effectiveness dating back to at least three decades ago [10, 2, 4, 26] More recent work has shown that foundational ensembling techniques can also be combined with deep neural networks, leading to what is known as deep ensembles [26]. In this context, we find previous work showing that deep ensembles exhibit high accuracy under distribution shift [36], and that higher divergence in training methodology leads to uncorrelated errors and better ensemble accuracy In contrast to techniques utilizing the same model with varied random initializations to achieve orthogonal predictions, our approach involves combining predictions from pre-trained models with distinct backbones. These backbones inherently embody diverse biases and properties [34, 1, 19], facilitating the characterization of different properties associated with each label. By leveraging the inherent variations among these pre-trained backbones, our methodology provides a nuanced and effective means of capturing and utilizing diverse label-related features.

Backbone studies Goldblum et al. [14] undertake a comprehensive comparison involving a diverse array of widely utilized pretrained backbones and randomly initialized baselines. Their evaluation spans extensive downstream tasks, encompassing image classification across diverse domains such as natural, medical, and satellite images. Additionally, the study delves into tasks like object detection and segmentation, evaluates models for out-of-distribution generalization, and assesses performance in image retrieval. Compared to these study, our work focuses specifically on revealing the orthogonality in predictions among the backbones utilized in the CLIP framework for the task of image classification. We present a method that explores and leverages this orthogonality to combine predictions effectively, aiming to boost performance straightforwardly.

3 Proposed Approach

CLIP is trained using a self-supervised objective to align the output (i.e., representations) obtained from both language and visual encoders, which we denote as ϕl\phi_{l} and ϕv\phi_{v} respective. This alignment serves a critical purpose, as we can map textual or visual inputs using their corresponding neural encoders to the same semantic space, assuming the outputs are normalized.

One way to use these models is to consider the joint probability of image xx and its corresponding description in natural language ll is proportional to a compatibility score between them, which we compute as the inner product between the encoded inputs, as Equation 1 shows below.

score(x,l)=ϕv(x)ϕl(l)\text{score}(x,l)=\phi_{v}(x)^{\top}\phi_{l}(l) (1)

To construct a probability distribution from these scores, we use the exponential function, thus allowing us to write the relationship shown below.

p(x,l)exp(ϕv(x)ϕl(l))p(x,l)\propto\exp\big{(}\phi_{v}(x)^{\top}\phi_{l}(l)\big{)} (2)

In Equation 2, 𝕏{\mathbb{X}} and 𝕃{\mathbb{L}} denote the visual and language spaces of possible inputs to the model, respectively. Then, given a downstream multi-class classification task with the label set 𝕐𝕃{\mathbb{Y}}\subset{\mathbb{L}} and |𝕐|=C|{\mathbb{Y}}|=C, we can write the following.

p(yx)\displaystyle p(y\mid x) =exp(ϕl(y)ϕv(x))yYexp(ϕl(y)ϕv(x))\displaystyle=\frac{\exp\big{(}\phi_{l}(y)^{\top}\phi_{v}(x)\big{)}}{\sum_{y\in Y}\exp\big{(}\phi_{l}(y)^{\top}\phi_{v}(x)\big{)}} (3)
y\displaystyle\quad y^{\star} =argmaxyYp(yx)\displaystyle=\arg\max_{y^{\prime}\in Y}p(y^{\prime}\mid x) (4)
Dataset RN50 RN101 ViT-B-32
UN DN L2\textsc{L}_{2} DN++ L2\textsc{L}_{2} UN DN L2\textsc{L}_{2} DN++ L2\textsc{L}_{2} UN DN L2\textsc{L}_{2} DN++ L2\textsc{L}_{2}
Pets 58.44 83.43 85.80 83.07 29.11 84.38 86.86 84.90 39.68 83.07 87.46 83.70
Cars 48.59 53.67 54.22 53.00 50.29 59.26 61.12 60.73 47.23 57.18 59.73 58.05
CUB 17.45 44.79 46.57 49.79 4.00 38.95 49.64 50.72 7.56 46.46 52.99 52.16
DTD 34.36 40.43 41.22 41.38 27.39 41.33 43.67 42.98 34.57 45.96 43.99 43.03
FGVC 10.89 17.49 17.07 17.49 9.12 18.33 18.63 19.23 14.04 19.74 19.65 19.32
Food 63.40 74.48 77.91 77.23 52.77 74.23 81.86 80.96 64.48 78.13 82.58 82.80
Flowers 28.04 63.41 66.12 64.35 0.89 48.71 65.20 63.12 20.07 62.43 66.48 64.40
ImgNet-1k 52.11 58.06 59.84 58.05 29.58 58.05 62.28 61.19 49.62 60.62 63.35 61.40
Dataset ViT-B-16 ViT-L-14 Oracle
UN DN L2\textsc{L}_{2} DN++ L2\textsc{L}_{2} UN DN L2\textsc{L}_{2} DN++ L2\textsc{L}_{2} UN DN L2\textsc{L}_{2} DN++ L2\textsc{L}_{2}
Pets 52.96 86.32 89.07 87.14 81.93 92.15 93.59 91.88 87.24 97.47 98.06 95.88
Cars 54.76 62.04 64.61 63.16 73.49 75.59 77.75 76.83 86.93 88.09 90.85 89.54
CUB 16.86 49.10 55.28 55.25 45.22 60.44 62.06 62.55 54.49 76.01 81.20 79.36
DTD 39.10 47.23 45.11 45.16 47.45 55.69 55.32 55.21 67.18 71.54 69.63 68.30
FGVC 18.48 24.36 24.39 24.99 30.93 34.29 31.71 33.42 46.20 52.87 52.09 50.83
Food 77.96 86.55 87.91 87.82 85.80 91.91 92.32 92.48 92.52 96.16 96.60 96.63
Flowers 44.53 68.52 71.43 68.91 69.28 77.98 79.05 76.83 74.78 84.37 86.32 83.59
ImgNet-1k 57.18 65.78 68.34 66.83 72.01 74.37 75.54 74.20 80.68 84.25 85.30 83.76
Table 1: Zero-shot performance of CLIP backbones using different normalization techniques, Distribution Normalization (DN), L2 Normalization (L2\textsc{L}_{2}), the combination of both (DN++ L2\textsc{L}_{2}), as well as an unnormalized version (UN). We also compare performance against upper-bound prediction if we could perfectly combine these backbones, denoted as Oracle.

ImageNet-1K Refer to caption

Figure 1: ImageNet-1k Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

Based on these principles, to perform zero-shot image classification using CLIP, we follow previous work and generate prompts that encapsulate the labels of the target dataset (e.g., an image of {label}, which is a pet). Subsequently, we calculate the similarity between the image feature and these prompts, and the prompt exhibiting the highest similarity is then chosen as the label for the given image. Importantly, we note that while performing zero-shot classification in this fashion, normalization defined in Equation 3 is not required, which in practice means that the step is often not performed in order to reduce computational requirements.

Since our interest is to combine multiple CLIP backbones operating simultaneously in this fashion, our first key insight is to note that Equation 3 is essentially equivalent to computing a softmax over a vector of logit-like values 𝒛¯C\bar{{\bm{z}}}\in{\mathbb{R}}^{C}, whose entries are constructed using on the inner product between a given input xx and the set of labels 𝕐{\mathbb{Y}}. Concretely, let z¯c\bar{{z}}_{c} denote the cth entry in vector 𝒛¯\bar{{\bm{z}}}, then we define the following.

z¯c=ϕl(yc)ϕv(x)c=[1,,C]\bar{{z}}_{c}=\phi_{l}(y_{c})^{\top}\phi_{v}(x)\quad\forall c=[1,\ldots,C] (5)

We can use the equation above to rewrite Equation 3 simply as p(y|x)=softmax(𝒛¯)p(y|x)=\text{softmax}(\bar{{\bm{z}}}), meaning that we can now essentially equate the output of our zero-shot CLIP classification pipeline, to that of a regular classifier, and interpret the output as a probability distribution over the label space {\mathbb{C}}.

Since we are interested in combining backbones, we note that a key goal when constructing traditional model ensembles, a technique often used to combine different models, is to increase robustness by increasing the overall confidence of the predictions. In the supervised setting, confidence is generally measured as the probability associated with the predicted label. Though it is expected that a network should provide a calibrated confidence measure in addition to its prediction, it has been shown that deep neural nets are poorly confidence calibrated [16].

In order to shed light into this issue for the case of CLIP models, we first perform an in-depth study of the output of 5 different single-backbone models when in zero-shot classification in 8 benchmark datasets (please see Section 4 for details on backbones and datasets). Concretely, we propose to assess the “orthogonality” of zero-shot predictions by means of two distinct approaches. Firstly, we establish an Oracle prediction that is informed about the appropriate backbone to select for each image sample in the target dataset. While a more sophisticated combination of probabilities may yield enhanced performance, we consider this oracle prediction a pragmatic upper bound for evaluating the potential of combining these backbones. Secondly, we propose to employ Venn diagrams to visually analyze the discrepancies and overlaps in predictions, which allows us to qualitatively characterize the nature of the orthogonality across models.

At this point, it is also important to highlight that the features extracted from the image and language encoders must first inhabit the same embedding space for effective comparison between prompts and the corresponding image. While different methods for doing this exist, a prevalent technique for achieving this alignment involves normalizing the features using the L2\textsc{L}_{2}-Norm. However, Zhou et al. [50] recently argued that a discrepancy exists between the pre-training objective and the subsequent application in downstream tasks and proposed Distribution Normalization (DN), which consists of removing the mean of the data to each of the features,

z¯c=(ϕl(yc)12μy)T(ϕv(x)12μx)\displaystyle\bar{{z}}^{\prime}_{c}=\Big{(}\phi_{l}(y_{c})-\frac{1}{2}\mu_{y}\Big{)}^{T}\Big{(}\phi_{v}(x)-\frac{1}{2}\mu_{x}\Big{)} (6)

where μx\mu_{x} and μy\mu_{y} are the means of a subset of the target dataset. We, therefore, also study the impact of different normalization techniques across our selected backbones and datasets, comparing against the option of not performing any normalization (UN).

Table 1 summarizes the results of our quantitative results. We first see that while the necessity for normalization for aligning the vision and language spaces is evident, determining the optimal normalization for each backbone and dataset proves less straightforward. In contrast to findings by Zhou et al. [50], our results indicate that, overall L2-Norm consistently demonstrates superior performance accuracy in our experimental setup. Despite the potential insights that could be derived from other properties of Distribution Normalizations [50].

Furthermore, within the ResNet family, it is observed that, for the Flowers dataset, the deeper model, ResNet-101, exhibits inferior performance compared to ResNet-50. Conversely, the ViT models consistently demonstrate an improvement in performance with increasing model size, and ViT-L-14 emerges as the superior performing backbone across all datasets by a significant margin.

Finally, we see that our Oracle consistently outperforms the best individual backbone. Specifically, we see that on FGVC, CUB, and Cars the oracle exhibits improvements of 20.38%, 19.14%, and 13.11%, respectively, for the L2 Normalization cases. It is important to notice that regardless of the normalization method used, a distinct orthogonality is evident when assessing the performance of the proposed Oracle.

In qualitative terms, as shown in Figure 1 via a Venn diagram with the correct predictions of each backbone on ImageNet-1k [8], we see that ViT-L-14 accurately predicts 37,766 images from the test set, with 21,593 correctly predicted across all five backbones (please refer to the Supplementary Material for results of other datasets). Leveraging the informed Oracle increases the overall correct predictions to 42,644 samples of 50,000. Notably, each backbone exhibits correct predictions that are unique to that specific backbone; RN50 and RN101 contribute 435 and 514 exclusive predictions, respectively. ViTs B-32, B-16, and L-14 present 558, 605, and 2,489 distinctive predictions, respectively. We highlight that predictions within each backbone family do not constitute a subset of the best model’s predictions in each family

Based on these empirical findings, we hypothesize that we can effectively improve the classification performance by learning to combine CLIP backbones. Thus, we propose an approach that is directly inspired by previous work on network calibration, but extended to multiple model combinations. Concretely, our framework can be regarded as a set of variations of Platt scaling [38]. In Platt scaling, the logits of a model are used as features for a logistic regression model, which is trained on the validation set to return probabilities. More concretely, given logit scores ziz_{i} for an example ii, Platt scaling learns scalar parameters a,ba,b\in{\mathbb{R}} and outputs q^i=σ(azi+b)\hat{q}_{i}=\sigma(az_{i}+b) as the calibrated probability. Parameters aa and bb can be optimized using the NLL loss over the validation set, but during this time, the parameter of the original model are fixed.

In this work, we adapt the simplest extension of Platt scaling, also known as temperature scaling [16], to our backbone combination setting. In temperature scaling, a single scalar parameter t>0t>0 is set for all classes of a given model. The new, calibrated confidence prediction is given by Equation 7, below, where tt is called the temperature, ziz_{i} is the logits, for example ii returned by the uncalibrated model.

q^i=maxksoftmax(zi/t)(k)\hat{q}_{i}=\max_{k}{\text{softmax}(z_{i}/t)^{(k)}} (7)

In the original setting, tt is optimized with respect to the NLL on the validation set aiming to reduce the overconfidence of the model on its predictions and produce more reliable predictions, but because the parameter tt does not change the maximum of the softmax function, the class prediction y^i\hat{y}_{i} remains unchanged, meaning that the performance of a given model remains the same.

Different from the above, in this work, we aim to jointly optimize a set of temperature parameters tbt_{b} with b[1,,B]b\in[1,\ldots,B] for a set of BB CLIP backbones aiming to combine the predictions of the backbones to produce a final prediction that change the confidence of each backbone depending on the confidence of the others and the input. To the best of our knowledge, we are unaware of any prior use in the context of calibrating mixtures of probabilistic models such as the one proposed above. Thus, we aim to improve the classification performance by learning the temperatures tbt_{b} that weigh the logit zibz_{i}^{b} for a backbone bb and example ii, using the cross-entropy loss as expressed in the following equation:

p(yx)\displaystyle\vspace{-2mm}p(y\mid x) =softmax(bBtbzib)\displaystyle=\text{softmax}\Big{(}\sum_{b\in B}t_{b}z^{b}_{i}\Big{)} (8)
c\displaystyle\mathcal{L}_{c} =yYyilogp(y|x)\displaystyle=\sum_{y\in Y}y_{i}\text{log}p(y|x)

Following the original proposals, which rely on data to find the optimal calibration temperature tt, we propose utilising two existing machine learning techniques to find the set of BB parameters that optimize the joint confidence (the temperatures for each backbone considered in the mix).

  • We utilize a genetic algorithm (GAC) to find the set of temperatures that better combine the backbones to improve the classification performance by regulating the confidence of each expert (backbone) when compared with the others. We use mutation and crossover operators to find the best temperatures that minimize the classification error.

  • We train a simple neural network combination (NNC) (a single-layer MLP) to predict the set of temperatures that best calibrate our backbone mixture. As input, this model receives the concatenated representations obtained by passing the images through the encoder ϕv\phi_{v}, for each backbone b𝔹b\in{\mathbb{B}}. The neural net directly produces a vector of temperatures 𝒕B{\bm{t}}\in{\mathbb{R}}^{B} and is trained using a cross-entropy loss.

Dataset Best Non-Parametric Parametric
Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC
Pets 93.59 91.63 \downarrow-1.96 93.00 \downarrow-0.59 92.26 \downarrow-1.33 92.86 \downarrow-0.73 92.89 \downarrow-0.70 92.94 \downarrow-0.65 93.30 \downarrow-0.29 94.58 \uparrow0.99
Cars 77.75 73.67 \downarrow-4.08 75.70 \downarrow-2.05 75.56 \downarrow-2.19 75.75 \downarrow-2.00 76.78 \downarrow-0.97 75.96 \downarrow-1.79 78.96 \uparrow1.21 80.30 \uparrow2.55
CUB 62.06 63.32 \uparrow1.26 64.38 \uparrow2.32 53.71 \downarrow-8.35 65.53 \uparrow3.47 61.79 \downarrow-0.27 65.01 \uparrow2.95 66.29 \uparrow4.23 68.40 \uparrow6.34
DTD 55.32 53.72 \downarrow-1.60 55.69 \uparrow0.37 42.55 \downarrow-12.77 54.36 \downarrow-0.96 55.59 \uparrow0.27 55.21 \downarrow-0.11 56.12 \uparrow0.80 58.94 \uparrow3.62
FGVC 31.71 28.95 \downarrow-2.76 30.96 \downarrow-0.75 30.63 \downarrow-1.08 31.53 \downarrow-0.18 31.29 \downarrow-0.42 31.89 \uparrow0.18 33.18 \uparrow1.47 35.88 \uparrow4.17
Food 92.32 90.01 \downarrow-2.31 90.87 \downarrow-1.45 89.62 \downarrow-2.70 91.09 \downarrow-1.23 91.43 \downarrow-0.89 90.91 \downarrow-1.41 92.91 \uparrow0.59 93.07 \uparrow0.75
Flowers 79.05 75.70 \downarrow-3.35 76.28 \downarrow-2.77 75.85 \downarrow-3.20 75.54 \downarrow-3.51 77.60 \downarrow-1.45 76.25 \downarrow-2.80 78.16 \downarrow-0.89 81.10 \uparrow2.05
ImgNet-1k 75.54 72.67 \downarrow-2.87 73.95 \downarrow-1.59 67.46 \downarrow-8.08 73.89 \downarrow-1.65 74.86 \downarrow-0.68 73.86 \downarrow-1.68 76.22 \uparrow0.68 76.59 \uparrow1.05
Mean Δ\Delta - -2.21 -0.81 -4.96 -0.85 -0.64 -0.66 0.98 2.69
Max Δ\Delta - 1.26 2.32 -1.08 3.47 0.27 2.95 4.23 6.34
Min Δ\Delta - -4.08 -2.77 -12.77 -3.51 -1.45 -2.80 -0.89 0.75
Table 2: Our results on combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques, and also compare to the best-performing single backbone (Best). We present the improvement \uparrow and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max and Min Δ\Delta summarize the difference in performance across datasets.
Dataset RN50 RN101 ViT-B-32 ViT-B-16 ViT-L-14 Oracle
Pets 85.53 \downarrow-0.27 89.13 \uparrow2.26 87.71 \uparrow0.25 91.66 \uparrow2.59 94.63 \uparrow1.04 98.23 \uparrow0.16
Cars 75.15 \uparrow20.92 80.91 \uparrow19.79 76.77 \uparrow17.04 83.24 \uparrow18.63 89.13 \uparrow11.38 95.77 \uparrow4.92
CUB 65.64 \uparrow19.07 70.69 \uparrow21.06 70.87 \uparrow17.88 76.49 \uparrow21.21 83.29 \uparrow21.23 92.56 \uparrow11.36
DTD 68.99 \uparrow27.77 69.68 \uparrow26.01 71.76 \uparrow27.77 74.95 \uparrow29.84 78.30 \uparrow22.98 89.57 \uparrow19.95
FGVC 39.87 \uparrow22.8 41.91 \uparrow23.28 41.34 \uparrow21.69 49.17 \uparrow24.78 60.88 \uparrow29.16 77.53 \uparrow25.44
Food 81.73 \uparrow3.82 84.49 \uparrow2.63 83.25 \uparrow0.67 88.07 \uparrow0.15 92.42 \uparrow0.1 97.80 \uparrow1.2
Flowers 90.91 \uparrow24.78 92.73 \uparrow27.53 92.65 \uparrow26.17 94.57 \uparrow23.14 98.36 \uparrow19.3 99.25 \uparrow12.93
ImgNet-1k 70.25 \uparrow10.4 72.44 \uparrow10.15 73.01 \uparrow9.66 77.47 \uparrow9.12 82.15 \uparrow6.63 90.09 \uparrow4.8
Mean Δ\Delta 16.16 16.59 15.14 16.07 13.98 10.10
Max Δ\Delta 27.77 27.53 27.77 29.84 29.16 25.44
Min Δ\Delta -0.27 2.26 0.25 0.15 0.10 0.16
Table 3: Linear probe accuracy across multiple datasets and employing different backbones, showcasing the performance improvement achieved with the linear probe in comparison with the zero-shot version with L2-Norm. The last column presents the Oracle performance, indicating accuracy benchmarks when the optimal backbone for each image sample is known in advance. The final four rows present the statistics of performance improvement.
Dataset Best Non-Parametric Parametric
Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg MoE GAC NNC
Pets 94.63 93.13 \downarrow-1.5 93.00 \downarrow-1.64 90.32 \downarrow-4.31 92.78 \downarrow-1.85 92.80 \downarrow-1.83 93.40 \downarrow-1.23 92.17 \downarrow-2.46 94.46 \downarrow-0.17 94.99 \uparrow0.36
Cars 89.13 87.94 \downarrow-1.19 88.07 \downarrow-1.06 84.39 \downarrow-4.74 89.14 \uparrow0.01 88.67 \downarrow-0.46 89.08 \downarrow-0.05 88.51 \downarrow-0.62 90.00 \uparrow0.87 90.19 \uparrow1.06
CUB 83.29 82.05 \downarrow-1.24 81.79 \downarrow-1.5 74.09 \downarrow-9.2 82.72 \downarrow-0.57 82.10 \downarrow-1.19 83.34 \uparrow0.05 76.80 \downarrow-6.49 84.55 \uparrow1.26 84.88 \uparrow1.59
DTD 78.30 78.35 \uparrow0.05 78.40 \uparrow0.11 73.03 \downarrow-5.27 78.56 \uparrow0.27 76.65 \downarrow-1.65 79.10 \uparrow0.8 72.44 \downarrow-5.86 80.00 \uparrow1.7 79.14 \uparrow0.84
FGVC 60.88 55.84 \downarrow-5.04 56.08 \downarrow-4.8 49.53 \downarrow-11.34 58.36 \downarrow-2.52 58.96 \downarrow-1.92 59.35 \downarrow-1.53 60.31 \downarrow-0.57 61.51 \uparrow0.63 62.20 \uparrow1.32
Food 92.42 93.03 \uparrow0.61 92.94 \uparrow0.52 84.21 \downarrow-8.21 93.99 \uparrow1.57 92.83 \uparrow0.41 93.64 \uparrow1.22 94.14 \uparrow1.72 93.46 \uparrow1.04 93.72 \uparrow1.30
Flowers 98.36 96.76 \downarrow-1.59 96.81 \downarrow-1.54 94.19 \downarrow-4.16 96.99 \downarrow-1.37 97.84 \downarrow-0.52 97.17 \downarrow-1.19 86.56 \downarrow-11.8 98.06 \uparrow-0.3 98.39 \uparrow0.03
ImgNet-1k 82.15 77.20 \downarrow-4.95 77.65 \downarrow-4.5 75.36 \downarrow-6.79 77.57 \downarrow-4.59 81.01 \downarrow-1.14 80.29 \downarrow-1.87 78.81 \downarrow-3.34 82.37 \uparrow0.22 82.48 \uparrow0.33
Mean Δ\Delta - -1.86 -1.80 -6.75 -1.13 -1.04 -0.48 -3.68 0.66 0.85
Max Δ\Delta - 0.61 0.52 -4.16 1.57 0.41 1.22 1.72 1.7 1.59
Min Δ\Delta - -5.04 -4.80 -11.34 -4.59 -1.92 -1.87 -11.80 -0.3 0.03
Table 4: Our results on combining the LinearProbe CLIP predictions with different backbones, which we group intro non-parametric and parametric techniques. We present the improvement \uparrow and deterioration \downarrow of accuracy performance for each method when compared with the best-performing single backbone (Best). Mean, Max and Min Δ\Delta summarize the difference in performance across datasets.

4 Experimental Setup

For our experiments, we utilize OpenCLIP [20], an open source implementation of the original CLIP [40]. Our research is predominantly centered on the context of zero-shot learning, where we operate with the fundamental constraint of lacking access to labeled data. As a result, our initial investigation primarily revolves around evaluating the model’s performance in comparison to the baseline zero-shot CLIP paradigm. However, our study also encompasses experimental investigations into the differences in adapted CLIPs using linear probing and limited samples for combining the backbones.

To help characterize the improvements that can be achieved with our proposed approach, we consider common model ensembling techniques as baselines that are grouped into Non-Parametric and Parametric.

In the non-parametric, we consider four approaches commonly used in ensembles:

Logit averaging (Log-Avg)

: We take the average of logits to produce a new logit that goes through a softmax to predict the class label for a certain sample.

Voting (Vote T-1 and Vote T-3)

: Voting enables us to combine conceptually different classifiers to predict the class label. It consists of using the majority of votes between the backbones to produce a final prediction. We use the top-1 (Vote T-1) prediction from each backbone, and the final prediction is determined by the label with the highest number of votes. In cases where multiple labels receive the same number of votes, we select the one with the highest probability. Additionally, we experiment with top-3 voting (Vote T-3), where we seek a consensus among backbones by considering the three most likely predictions from each of them. These predictions are weighted based on their position within the top-3 list.

Confidence (Conf)

: Using the Shannon entropy to evaluate the confidence of each backbone in each prediction, we select the backbone with the highest confidence for a prediction as the source of the final prediction.

We also consider two variations of parametric calibration where we perform model-wise calibration before combining models, as follows.

Calibrated confidence (C-Conf)

For this approach, we first calibrate the probabilities of each backbone independently using temperature scaling. Then follow the procedure above and utilize the Shannon entropy of each backbone to select the one with the highest confidence

Calibrated logit averaging (C-Log-Avg)

We first calibrate the probabilities of each backbone independently using temperature scaling. We then simply average the calibrated logits using the independently-obtained temperature parameters.

We also investigate the adaptability and orthogonality exhibited by each backbone when subjected to adaptation for a particular dataset. This examination seeks to understand how effectively each backbone can be tailored to a target dataset’s specific characteristics and determine if the orthogonality remains after adaptation. This adaptation is facilitated through the utilization of linear probing. In contrast to a comprehensive fine-tuning of the pre-trained models, linear probing emerges as a more straightforward and effective methodology for tailoring foundational models to a target dataset. The linear probing technique employs a linear layer, wherein the weights of the pre-trained backbone remain frozen, allowing for the efficient learning of a classifier tailored to the attributes of the specific dataset.

We initialize the weights of the linear probing using language weights, a practice detailed in [29]. This initialization strategy proves notably superior to random initialization, exhibiting enhanced stability, particularly in scenarios involving few-shot learning. We use the output of each backbone ϕb\phi_{b} normalized with L2 as an input to the linear probing. Each linear probe undergoes training with the target dataset’s training set, allocating 90% for the actual linear probe training and reserving the remaining 10% for the integration of backbones through the parametric approaches.

Finally, we made use of Mixture of Experts (MoE), one of the most popular techniques in machine learning to enhance model performance by combining the strengths of multiple specialized submodels [41, 51, 28, 6]. MoEs divide the overall model into multiple experts, each adept at addressing specific regions of the input space. A gating network determines the relevance of each expert for a given input, orchestrating the collaboration. We use the implementation of Sparse MoE [51] by [41], where the input to the MoE layer is the concatenation of the vision features xbx_{b} from ϕb\phi_{b}, we use five experts and trained for the classification task with a cross-entropy loss. We use Adam [23] with a learning rate of 2×1052\times 10^{-5} for 300 epochs.

4.1 Datasets and Backbones

We evaluate our proposed approach on eight popular image classification datasets: Describable Texture Dataset (DTD) [7], FGVC-Aircraft (FGVC) [33] Caltech-UCSD Birds-200-2011 (CUB) [46], Flowers (Flowers) [35], Pets (Pets) [37], Stanford Cars (Cars) [25], Food (Food) [3] and ImageNet-1k (ImgNet-1k) [8]. These datasets consist of 47, 100, 200, 102, 37, 196, 101 and 1,000 classes, respectively. Our objective is to evaluate the orthogonality of the predictions with respect to the different backbones used to train CLIP. Moreover, we want to evaluate probing and our method to fuse those backbones to improve the performance of the image classification task. We, therefore, utilize accuracy to evaluate the performance of each backbone of CLIP and the fusion mechanism.

We consider a broad selection of backbones. In particular, this work explores the ResNet [18] family with its variants RN50 and RN101, and the ViT [11] family with B-16, B-32 and L-14 variants for the image encoder ϕv\phi_{v}.

5 Results

Dataset Best NNC(1) NNC(2) NNC(4) NNC(8) NNC(16) NNC(32)
Pets 93.59 94.09 \uparrow0.50 94.03 \uparrow0.44 93.81 \uparrow0.22 94.18 \uparrow0.59 94.19 \uparrow0.6 94.19 \uparrow0.60
Cars 77.75 79.08 \uparrow1.33 79.27 \uparrow1.52 79.09 \uparrow1.34 79.31 \uparrow1.56 79.29 \uparrow1.54 79.38 \uparrow4.80
CUB 62.06 66.76 \uparrow4.70 66.97 \uparrow4.91 66.81 \uparrow4.75 66.91 \uparrow4.85 66.95 \uparrow4.89 66.86 \uparrow2.02
DTD 55.32 56.97 \uparrow1.65 57.34 \uparrow2.02 57.50 \uparrow2.18 57.23 \uparrow1.91 57.71 \uparrow2.39 57.34 \uparrow4.05
FGVC 31.71 33.21 \uparrow1.50 33.03 \uparrow1.32 32.82 \uparrow1.11 33.00 \uparrow1.29 35.28 \uparrow2.39 35.76 \uparrow0.75
Food 92.32 92.50 \uparrow0.18 92.97 \uparrow0.65 93.03 \uparrow0.71 93.02 \uparrow0.70 93.07 \uparrow3.57 93.07 \uparrow0.75
Flowers 79.05 78.65 \downarrow-0.40 78.95 \downarrow-0.10 78.95 \downarrow-0.10 79.05 \uparrow0.00 81.80 \uparrow2.75 81.80 \uparrow2.75
ImgNet-1k 75.54 76.19 \uparrow0.65 76.40 \uparrow0.86 76.34 \uparrow0.80 76.37 \uparrow0.83 75.98 \uparrow0.44 76.34 \uparrow0.80
Mean Δ\Delta - 1.26 1.45 1.38 1.47 2.12 2.18
Max Δ\Delta - 4.70 4.91 4.75 4.85 4.89 4.80
Min Δ\Delta - -0.4 -0.10 -0.10 0.00 0.60 0.60
Table 5: Ablation of NNC performance when changing the number of samples used to train. We use NNC(nn) to denote NNC with nn samples per class. \uparrow and \downarrow showcase the improvement in performance compared with Best backbone in a zero-shot setting.

Combination of backbones.

The challenges of effectively fusing different backbones for improved predictions are evident in Table 2. In non-parametric baselines, leveraging the confidence of each backbone in its predictions consistently fails to enhance the overall performance beyond that of the best backbone. The lack of calibrated probabilities in the Conf contributes to overconfidence in some backbones, resulting in performance degradation when combined. Calibrating the confidence leads to improvements, although the performance still falls short of matching the best backbone, except for DTD. This trend persists in the Log-Avg approach, where averaging performance across backbones does not exploit their orthogonal prediction capabilities effectively, yielding an average delta accuracy of -0.85%. Notably, the Log-Avg approach shows substantial improvement for the CUB dataset compared to the best backbone. When the backbones are calibrated C-Log-Avg, the Log-Avg approach enhances its performances across all datasets except CUB, Food, and ImgNet-1k when compared to the non-calibrated version. Intriguingly, conventional ensemble techniques such as Vote T-1 and Vote T-3 prove ineffective in providing a significant boost to prediction accuracy beyond that of the best backbone. In employing parametric methods, we utilize the entire training set of the target datasets. Our proposed approach NNC demonstrates a noteworthy capability to enhance the performance of the best backbone, achieving a substantial improvement of up to 6.34% in the case of the CUB dataset. On average, our method exhibits a commendable improvement of 2.69% when compared to the performance of the best backbone across the evaluated datasets.

Table 3 illustrates the accuracy results of linear probing on each dataset and backbone. Notably, a substantial enhancement in performance is observed through this adaptation technique when compared to the zero-shot L2\textsc{L}_{2}Norm (Tab. 1). RN101 stands out as the backbone experiencing the most significant improvement, averaging a remarkable 16.59% across all datasets, surpassing even ViT-B-32. Despite this, its performance falls short of outperforming ViT-B-16 and ViT-L-14. Even after linear probing, ViT-L-14 remains the top-performing backbone across all datasets. Noteworthy is the considerable room for improvement indicated by the Oracle of linear probes, showcasing superior performance compared to any individual backbone, suggesting potential gains through a combination of backbones. Venn diagrams for the linear probe are also shown in the Appendix. Although the room for improvement between the Oracle and the best linear probe is consistent across datasets, we found that the space for improvement is much less than the zero-shot option.

The outcomes presented in Table 4 showcase the results of combining linear probing versions for each backbone, using both non-parametric and parametric approaches. Similar to the zero-shot setting, our proposed NNC consistently demonstrates improvement across all datasets, achieving an average enhancement of 0.85% compared to the Best linear-probe backbone. Notably, this improvement is more modest compared to the zero-shot version, which records an average improvement of 2.69%. Intriguingly, the MoE approach does not reach the performance level of the Best backbone and exhibits a negative improvement of -3.68%. This discrepancy suggests that MoE might face challenges in effectively partitioning the input space into distinct clusters specific to certain experts, potentially hindering its optimal functionality.

In our exploration of the effectiveness of the NNC approach, we extend our analysis to a scenario where we limit the samples to combine the zero-shot CLIPs. This experiment allows us to assess the adaptability and performance of our proposed method under limited training data conditions. Table 5 presents the performance of NNC by means of a limited number of samples. Although the performance of NNC overall improves when it has more data available to combine the backbones, in the majority of cases just using one sample NNC(1) per class is enough to improve its performance. Notably, there is a drop in the trend of improvement of performance when we use four samples available NNC(4), we believe this is because of the use of the same hyperparameters in the training across a number of samples, and probably it needs a better fine-tuning.

6 Conclusion

In this paper, we have undertaken a comprehensive analysis of various backbones within the CLIP framework, specifically focusing on the image classification task. Unlike broader studies that span various downstream tasks, such as those by Goldblum et al. [14], our emphasis lies in understanding and leveraging the unique contributions of each backbone within the CLIP context. Our research unveils a distinctive orthogonality in predictions between backbones of zero-shot CLIP and LinearProbe CLIP, presenting a valuable avenue for enhancing CLIP’s performance through a synergistic combination of these backbones. The introduction of an Oracle highlights this orthogonality, emphasizing the potential of mixing different backbones to optimize performance in image classification tasks. We propose an innovative approach that learns a set of temperatures, refining performance by appropriately weighting the logits from each backbone to yield an enhanced prediction. Notably, our proposed method introduces two efficient techniques, a genetic algorithm and a multi-layer perceptron (MLP), for learning these temperatures, with the MLP showcasing the ability to combine backbones effectively even with limited samples, e.g. one per class. This novel methodology provides a streamlined means of capitalizing on the individual strengths of diverse backbones, ultimately boosting predictive accuracy in the realm of image classification. Our findings contribute to a fine understanding of CLIP’s backbone behaviour and offer a practical strategy for optimizing its performance in real-world applications.

References

  • [1] Antonio A Abello, Roberto Hirata, and Zhangyang Wang. Dissecting the high-frequency bias in convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 863–871, 2021.
  • [2] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36:105–139, 1999.
  • [3] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  • [4] Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
  • [5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • [6] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23049–23062. Curran Associates, Inc., 2022.
  • [7] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [10] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
  • [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  • [12] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, and Ludwig Schmidt. Datacomp: In search of the next generation of multimodal datasets, 2023.
  • [13] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  • [14] Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, and Tom Goldstein. Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks, 2023.
  • [15] Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 478(2266):20210068, 2022.
  • [16] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330. PMLR, July 2017.
  • [17] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [19] Katherine Hermann, Ting Chen, and Simon Kornblith. The origins and prevalence of texture bias in convolutional neural networks. Advances in Neural Information Processing Systems, 33:19000–19015, 2020.
  • [20] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. If you use this software, please cite it as below.
  • [21] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  • [22] Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. Tiger: Text-to-image grounding for image caption evaluation, 2019.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023.
  • [25] Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. 2013.
  • [26] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  • [27] Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. Vilbertscore: Evaluating image caption using vision-and-language bert. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 34–39, 2020.
  • [28] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  • [29] Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Neural Information Processing Systems, 2022.
  • [30] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11336–11344, 2020.
  • [31] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
  • [32] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
  • [33] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
  • [34] Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021.
  • [35] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  • [36] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13991–14002, 2019.
  • [37] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  • [38] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  • [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [40] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8), 2019.
  • [41] David Rau. Sparsely-gated mixture-of-experts pytorch implementation, 2019.
  • [42] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • [43] Peiyang Shi, Michael C. Welle, Mårten Björkman, and Danica Kragic. Towards understanding the modality gap in CLIP. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023.
  • [44] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  • [45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [46] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Caltech-ucsd birds-200-2011. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [47] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  • [48] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18123–18133, June 2022.
  • [49] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  • [50] Yifei Zhou, Juntao Ren, Fengyu Li, Ramin Zabih, and Ser-Nam Lim. Test-time distribution normalization for contrastively learned visual-language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [51] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam M. Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models. 2022.

Appendix A Venn diagrams ZeroShot CLIP

Cars Refer to caption CUB Refer to caption DTD Refer to caption

Figure 2: Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

FGVC Refer to caption Flowers Refer to caption Food Refer to caption

Figure 3: Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

Pets Refer to caption

Figure 4: Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

Figures 2, 3, and 4 visually represent the Venn diagrams for selected benchmark datasets utilizing predictions from ZeroShot CLIP. Across these instances, a consistent divergence in predictions is observed among different backbone families. Particularly noteworthy are cases such as Cars, CUB, DTD, and FGVC, where less than 50% of correct predictions align across all four backbones. Specifically, Cars exhibits a 36.43% agreement, CUB with 31.71%, DTD with 34.07%, and FGVC with 10.83%.

Furthermore, considering the agreement within the same backbone family, Cars demonstrates a 52.48% agreement within the ViT family and a 61.15% agreement within the ResNet family. Similarly, CUB records a 48.72% agreement within the ViT family and a 57.37% agreement within the ResNet family. In the case of DTD, there is a 49.19% agreement within the ViT family and a 56.16% agreement within the ResNet family. FGVC shows an agreement of 17.14% within the ViT family and 36.31% within the ResNet family. Importantly, the overall agreement within the ResNet family appears to be larger than that within the ViT family, suggesting a potentially lesser scope for improvement given the diversity of predictions. These insights highlight the distinct predictive characteristics within and between backbone families, showcasing the potential for leveraging this diversity to enhance overall model performance.

Appendix B All possible combinations per dataset

Tables 6, 7, 8, 9, 10, 11, 12, and 13 present the results of all possible combinations of backbones using the non-parametric and parametric approaches proposed in the paper. Notably, the performance of NNC consistently emerges as the best across various backbone combinations and datasets when compared to other methods.

It’s noteworthy that instances exist where the combination of specific backbones yields a more substantial performance boost than utilizing all backbones together. For instance, in the Pets dataset, combining ResNet 50, 101, and ViT-B-32 results in a delta improvement of 2.37%, surpassing the 0.99% improvement achieved by using all backbones. This phenomenon is consistent across datasets with different backbone combinations. In Cars, there is a boost of 5.71% when combining ResNet-101 and ViT-B-32, compared to the 2.55% boost when using all five different backbones. Similarly, in DTD, there is a 3.62% improvement when all backbones are combined, whereas combining ResNet 50, 101, and ViT-B-16 yields a higher improvement of 5.11%. While the best delta improvement among backbones may not necessarily come from combining all backbones, the best overall accuracy is consistently obtained when using the combination of all backbones.

Pets
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   85.80 -0.00     86.86 -0.00     87.46 -0.00     89.07 -0.00     93.59 -0.00   86.86 87.84 \uparrow0.98 87.90 \uparrow1.04 88.20 \uparrow1.34 88.03 \uparrow1.17 87.93 \uparrow1.06 87.84 \uparrow0.98 87.14 \uparrow0.27 88.09 \uparrow1.23 91.88 \uparrow5.01 87.46 88.50 \uparrow1.04 89.02 \uparrow1.55 85.25 \downarrow-2.21 89.40 \uparrow1.94 87.98 \uparrow0.52 89.21 \uparrow1.74 89.48 \uparrow2.02 89.45 \uparrow1.99 92.64 \uparrow5.18 89.07 89.83 \uparrow0.76 89.92 \uparrow0.84 89.67 \uparrow0.6 89.83 \uparrow0.76 89.48 \uparrow0.41 89.72 \uparrow0.65 89.86 \uparrow0.79 90.27 \uparrow1.2 92.89 \uparrow3.82 93.59 94.03 \uparrow0.44 94.00 \uparrow0.41 94.00 \uparrow0.41 93.81 \uparrow0.22 93.73 \uparrow0.14 93.98 \uparrow0.38 93.73 \uparrow0.14 94.19 \uparrow0.6 96.18 \uparrow2.59 87.46 88.72 \uparrow1.25 88.99 \uparrow1.53 85.91 \downarrow-1.55 89.07 \uparrow1.61 88.55 \uparrow1.09 89.07 \uparrow1.61 89.10 \uparrow1.64 89.04 \uparrow1.58 93.00 \uparrow5.53 89.07 90.00 \uparrow0.93 90.11 \uparrow1.04 89.75 \uparrow0.68 89.89 \uparrow0.82 89.81 \uparrow0.74 90.11 \uparrow1.04 89.45 \uparrow0.38 90.30 \uparrow1.23 93.13 \uparrow4.06 93.59 93.32 \downarrow-0.27 93.32 \downarrow-0.27 93.13 \downarrow-0.46 93.24 \downarrow-0.35 93.27 \downarrow-0.33 93.43 \downarrow-0.16 93.70 \uparrow0.11 93.70 \uparrow0.11 96.51 \uparrow2.92 89.07 89.86 \uparrow0.79 90.16 \uparrow1.09 87.35 \downarrow-1.72 90.41 \uparrow1.34 89.45 \uparrow0.38 90.38 \uparrow1.31 90.27 \uparrow1.2 90.30 \uparrow1.23 93.35 \uparrow4.28 93.59 93.79 \uparrow0.19 93.76 \uparrow0.16 93.00 \downarrow-0.6 93.38 \downarrow-0.22 93.00 \downarrow-0.6 93.73 \uparrow0.14 93.89 \uparrow0.3 93.87 \uparrow0.27 95.99 \uparrow2.4 93.59 93.98 \uparrow0.38 93.95 \uparrow0.35 93.92 \uparrow0.33 94.06 \uparrow0.46 93.84 \uparrow0.25 93.98 \uparrow0.38 93.73 \uparrow0.14 93.92 \uparrow0.33 96.18 \uparrow2.59 87.46 88.85 \uparrow1.39 89.29 \uparrow1.83 84.93 \downarrow-2.53 89.72 \uparrow2.26 88.42 \uparrow0.95 89.45 \uparrow1.99 89.48 \uparrow2.02 89.83 \uparrow2.37 94.69 \uparrow7.22 89.07 89.29 \uparrow0.22 89.81 \uparrow0.74 90.02 \uparrow0.95 90.11 \uparrow1.04 89.86 \uparrow0.79 90.08 \uparrow1.01 90.11 \uparrow1.04 90.38 \uparrow1.31 94.74 \uparrow5.67 93.59 91.58 \downarrow-2.02 92.97 \downarrow-0.63 93.43 \downarrow-0.16 93.19 \downarrow-0.41 93.30 \downarrow-0.3 92.86 \downarrow-0.74 93.70 \uparrow0.11 94.06 \uparrow0.46 97.36 \uparrow3.76 89.07 90.16 \uparrow1.09 90.46 \uparrow1.39 86.73 \downarrow-2.34 90.62 \uparrow1.55 89.15 \uparrow0.08 90.46 \uparrow1.39 90.35 \uparrow1.28 90.79 \uparrow1.72 94.79 \uparrow5.72 93.59 92.50 \downarrow-1.09 93.08 \downarrow-0.52 92.86 \downarrow-0.74 93.02 \downarrow-0.57 92.86 \downarrow-0.74 93.19 \downarrow-0.41 93.32 \downarrow-0.27 94.25 \uparrow0.65 97.03 \uparrow3.43 93.59 92.97 \downarrow-0.63 93.70 \uparrow0.11 93.98 \uparrow0.38 93.87 \uparrow0.27 93.84 \uparrow0.25 93.84 \uparrow0.25 94.19 \uparrow0.6 94.47 \uparrow0.87 96.97 \uparrow3.38 89.07 89.83 \uparrow0.76 90.16 \uparrow1.09 86.54 \downarrow-2.53 90.65 \uparrow1.58 89.83 \uparrow0.76 90.41 \uparrow1.34 90.35 \uparrow1.28 90.71 \uparrow1.64 95.12 \uparrow6.05 93.59 92.20 \downarrow-1.39 93.02 \downarrow-0.57 92.56 \downarrow-1.04 92.75 \downarrow-0.84 92.70 \downarrow-0.9 92.94 \downarrow-0.65 94.00 \uparrow0.41 93.98 \uparrow0.38 97.27 \uparrow3.68 93.59 92.91 \downarrow-0.68 93.46 \downarrow-0.14 93.46 \downarrow-0.14 93.43 \downarrow-0.16 93.57 \downarrow-0.03 93.49 \downarrow-0.11 93.84 \uparrow0.25 93.92 \uparrow0.33 97.30 \uparrow3.71 93.59 92.97 \downarrow-0.63 93.27 \downarrow-0.33 92.89 \downarrow-0.71 93.51 \downarrow-0.08 93.13 \downarrow-0.46 93.40 \downarrow-0.19 94.03 \uparrow0.44 94.17 \uparrow0.57 97.03 \uparrow3.43 89.07 90.27 \uparrow1.2 90.41 \uparrow1.34 86.21 \downarrow-2.86 90.57 \uparrow1.5 89.48 \uparrow0.41 90.54 \uparrow1.47 90.35 \uparrow1.28 90.90 \uparrow1.83 95.88 \uparrow6.81 93.59 92.20 \downarrow-1.39 92.80 \downarrow-0.79 92.10 \downarrow-1.5 92.91 \downarrow-0.68 92.64 \downarrow-0.95 92.80 \downarrow-0.79 93.16 \downarrow-0.44 94.17 \uparrow0.57 97.77 \uparrow4.17 93.59 92.56 \downarrow-1.04 92.94 \downarrow-0.65 93.59 -0.0 93.19 \downarrow-0.41 93.54 \downarrow-0.05 92.94 \downarrow-0.65 93.81 \uparrow0.22 94.52 \uparrow0.93 97.77 \uparrow4.17 93.59 92.91 \downarrow-0.68 93.27 \downarrow-0.33 92.78 \downarrow-0.82 93.19 \downarrow-0.41 93.05 \downarrow-0.55 93.27 \downarrow-0.33 93.79 \uparrow0.19 94.33 \uparrow0.74 97.52 \uparrow3.92 93.59 92.61 \downarrow-0.98 93.00 \downarrow-0.6 92.12 \downarrow-1.47 92.94 \downarrow-0.65 92.91 \downarrow-0.68 93.05 \downarrow-0.55 93.68 \uparrow0.08 94.03 \uparrow0.44 97.79 \uparrow4.2 93.59 91.63 \downarrow-1.96 93.00 \downarrow-0.59 92.26 \downarrow-1.34 92.86 \downarrow-0.74 92.89 \downarrow-0.71 92.94 \downarrow-0.65 93.30 \downarrow-0.29 94.58 \uparrow0.99 98.06 \uparrow4.47 Mean Δ\Delta -0.05 0.35 -0.77 0.42 0.06 0.40 0.58 0.98 4.31 Max Δ\Delta 1.39 1.83 1.34 2.26 1.09 1.99 2.02 2.37 7.22 Min Δ\Delta -2.02 -0.79 -2.86 -0.84 -0.95 -0.79 -0.44 0.11 2.40

Table 6: Our results on Pets dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

Cars
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   54.23 -0.00     61.12 -0.00     59.73 -0.00     64.61 -0.00     77.75 -0.00   61.12 62.41 \uparrow1.28 62.67 \uparrow1.54 53.41 \downarrow-7.71 63.03 \uparrow1.9 61.81 \uparrow0.68 62.83 \uparrow1.7 63.52 \uparrow2.4 63.59 \uparrow2.46 71.58 \uparrow10.46 59.73 61.30 \uparrow1.57 62.26 \uparrow2.52 60.84 \uparrow1.11 63.39 \uparrow3.66 61.14 \uparrow1.41 62.52 \uparrow2.79 63.41 \uparrow3.68 63.40 \uparrow3.67 71.50 \uparrow11.76 64.61 65.27 \uparrow0.66 65.55 \uparrow0.95 53.94 \downarrow-10.67 66.01 \uparrow1.41 64.97 \uparrow0.36 65.89 \uparrow1.28 66.55 \uparrow1.94 66.80 \uparrow2.19 73.95 \uparrow9.34 77.75 77.58 \downarrow-0.17 77.76 \uparrow0.01 77.69 \downarrow-0.06 77.24 \downarrow-0.51 77.47 \downarrow-0.29 77.64 \downarrow-0.11 78.51 \uparrow0.76 78.57 \uparrow0.82 83.96 \uparrow6.21 61.12 64.42 \uparrow3.3 64.99 \uparrow3.87 63.41 \uparrow2.29 66.89 \uparrow5.77 63.71 \uparrow2.59 65.66 \uparrow4.54 66.86 \uparrow5.73 66.83 \uparrow5.71 73.60 \uparrow12.47 64.61 67.16 \uparrow2.55 67.74 \uparrow3.13 59.28 \downarrow-5.32 67.99 \uparrow3.38 66.66 \uparrow2.05 68.11 \uparrow3.51 68.03 \uparrow3.42 68.20 \uparrow3.59 76.27 \uparrow11.67 77.75 77.47 \downarrow-0.29 77.78 \uparrow0.02 77.42 \downarrow-0.34 77.73 \downarrow-0.02 77.44 \downarrow-0.31 78.10 \uparrow0.35 78.92 \uparrow1.17 79.11 \uparrow1.36 84.32 \uparrow6.57 64.61 66.34 \uparrow1.73 66.84 \uparrow2.24 65.41 \uparrow0.81 67.95 \uparrow3.35 65.87 \uparrow1.27 67.26 \uparrow2.65 68.00 \uparrow3.4 67.77 \uparrow3.16 75.48 \uparrow10.87 77.75 77.55 \downarrow-0.2 77.86 \uparrow0.11 77.55 \downarrow-0.2 77.42 \downarrow-0.34 77.52 \downarrow-0.24 77.80 \uparrow0.05 78.68 \uparrow0.93 78.86 \uparrow1.11 83.88 \uparrow6.13 77.75 77.54 \downarrow-0.21 77.45 \downarrow-0.3 77.32 \downarrow-0.44 77.79 \uparrow0.04 77.18 \downarrow-0.57 77.63 \downarrow-0.12 78.88 \uparrow1.13 78.91 \uparrow1.16 84.26 \uparrow6.5 61.12 64.63 \uparrow3.51 65.23 \uparrow4.1 57.36 \downarrow-3.77 66.56 \uparrow5.43 63.69 \uparrow2.56 66.01 \uparrow4.89 66.76 \uparrow5.63 66.76 \uparrow5.63 78.86 \uparrow17.73 64.61 66.34 \uparrow1.73 67.35 \uparrow2.75 59.66 \downarrow-4.95 67.60 \uparrow3.0 66.65 \uparrow2.04 67.77 \uparrow3.16 67.95 \uparrow3.35 68.20 \uparrow3.59 80.66 \uparrow16.06 77.75 73.80 \downarrow-3.95 76.23 \downarrow-1.52 76.10 \downarrow-1.65 76.59 \downarrow-1.16 77.25 \downarrow-0.5 76.86 \downarrow-0.9 79.06 \uparrow1.31 78.88 \uparrow1.13 87.49 \uparrow9.74 64.61 66.24 \uparrow1.63 67.06 \uparrow2.45 56.24 \downarrow-8.37 67.94 \uparrow3.33 66.07 \uparrow1.47 67.62 \uparrow3.01 68.46 \uparrow3.86 68.29 \uparrow3.68 80.24 \uparrow15.63 77.75 75.04 \downarrow-2.71 77.17 \downarrow-0.58 77.50 \downarrow-0.25 76.71 \downarrow-1.04 77.34 \downarrow-0.41 77.24 \downarrow-0.51 78.73 \uparrow0.98 78.71 \uparrow0.96 87.15 \uparrow9.4 77.75 74.53 \downarrow-3.22 76.28 \downarrow-1.47 75.90 \downarrow-1.85 76.76 \downarrow-0.99 76.97 \downarrow-0.78 76.94 \downarrow-0.81 78.40 \uparrow0.65 79.01 \uparrow1.26 87.43 \uparrow9.68 64.61 67.86 \uparrow3.26 68.95 \uparrow4.34 60.40 \downarrow-4.2 69.36 \uparrow4.75 67.04 \uparrow2.44 69.12 \uparrow4.51 69.16 \uparrow4.55 69.34 \uparrow4.74 81.37 \uparrow16.76 77.75 75.04 \downarrow-2.71 76.96 \downarrow-0.8 77.17 \downarrow-0.58 76.88 \downarrow-0.87 77.18 \downarrow-0.57 77.14 \downarrow-0.61 79.24 \uparrow1.49 79.26 \uparrow1.5 87.32 \uparrow9.56 77.75 75.60 \downarrow-2.15 76.55 \downarrow-1.21 75.70 \downarrow-2.05 77.25 \downarrow-0.5 77.15 \downarrow-0.6 77.37 \downarrow-0.39 79.02 \uparrow1.27 79.06 \uparrow1.31 87.63 \uparrow9.87 77.75 75.23 \downarrow-2.52 76.71 \downarrow-1.04 77.28 \downarrow-0.47 77.35 \downarrow-0.4 77.19 \downarrow-0.56 77.15 \downarrow-0.6 78.42 \uparrow0.67 79.04 \uparrow1.29 87.39 \uparrow9.64 64.61 67.72 \uparrow3.11 68.54 \uparrow3.93 60.59 \downarrow-4.02 68.77 \uparrow4.17 66.98 \uparrow2.38 68.56 \uparrow3.95 69.06 \uparrow4.45 69.16 \uparrow4.55 84.18 \uparrow19.57 77.75 73.81 \downarrow-3.94 75.87 \downarrow-1.88 75.97 \downarrow-1.78 76.04 \downarrow-1.72 77.08 \downarrow-0.67 76.43 \downarrow-1.32 78.56 \uparrow0.81 79.12 \uparrow1.37 89.37 \uparrow11.62 77.75 74.69 \downarrow-3.06 76.06 \downarrow-1.69 75.66 \downarrow-2.09 76.41 \downarrow-1.34 76.93 \downarrow-0.82 76.47 \downarrow-1.28 79.07 \uparrow1.32 79.22 \uparrow1.47 89.60 \uparrow11.85 77.75 74.51 \downarrow-3.25 75.95 \downarrow-1.8 75.90 \downarrow-1.85 76.57 \downarrow-1.18 77.01 \downarrow-0.75 76.05 \downarrow-1.7 78.83 \uparrow1.08 79.19 \uparrow1.44 89.30 \uparrow11.55 77.75 75.28 \downarrow-2.47 76.48 \downarrow-1.27 75.54 \downarrow-2.21 76.61 \downarrow-1.14 76.99 \downarrow-0.76 76.91 \downarrow-0.85 79.07 \uparrow1.32 79.33 \uparrow1.58 89.49 \uparrow11.74 77.75 73.67 \downarrow-4.08 75.70 \downarrow-2.05 75.56 \downarrow-2.19 75.75 \downarrow-2.0 76.78 \downarrow-0.97 75.96 \downarrow-1.79 78.96 \uparrow1.21 80.30 \uparrow2.55 90.85 \uparrow13.1 Mean Δ\Delta -0.41 0.63 -2.42 1.04 0.40 0.98 2.25 2.43 11.36 Max Δ\Delta 3.51 4.34 2.29 5.77 2.59 4.89 5.73 5.71 19.57 Min Δ\Delta -4.08 -2.05 -10.67 -2.00 -0.97 -1.79 0.65 0.82 6.13

Table 7: Our results on Cars dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

CUB
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   46.57 -0.00     49.64 -0.00     52.99 -0.00     55.28 -0.00     62.06 -0.00   49.64 51.28 \uparrow1.64 52.33 \uparrow2.69 45.41 \downarrow-4.23 55.06 \uparrow5.42 50.72 \uparrow1.09 53.56 \uparrow3.92 55.06 \uparrow5.42 55.13 \uparrow5.49 61.13 \uparrow11.49 52.99 54.92 \uparrow1.93 55.70 \uparrow2.71 53.94 \uparrow0.95 57.77 \uparrow4.78 53.90 \uparrow0.91 56.51 \uparrow3.52 57.68 \uparrow4.69 57.75 \uparrow4.76 64.03 \uparrow11.05 55.28 56.61 \uparrow1.33 57.51 \uparrow2.23 55.82 \uparrow0.54 59.39 \uparrow4.11 55.49 \uparrow0.21 58.44 \uparrow3.16 59.73 \uparrow4.45 59.68 \uparrow4.4 65.52 \uparrow10.23 62.06 61.65 \downarrow-0.41 62.32 \uparrow0.26 61.72 \downarrow-0.35 63.96 \uparrow1.9 61.01 \downarrow-1.05 62.77 \uparrow0.71 64.12 \uparrow2.05 64.26 \uparrow2.19 70.49 \uparrow8.42 52.99 54.94 \uparrow1.95 55.49 \uparrow2.5 48.72 \downarrow-4.26 57.61 \uparrow4.63 54.04 \uparrow1.05 56.51 \uparrow3.52 57.42 \uparrow4.44 57.58 \uparrow4.59 63.82 \uparrow10.84 55.28 56.56 \uparrow1.28 57.16 \uparrow1.88 56.25 \uparrow0.97 58.78 \uparrow3.5 56.27 \uparrow0.98 58.08 \uparrow2.8 58.80 \uparrow3.52 59.25 \uparrow3.97 65.34 \uparrow10.06 62.06 62.01 \downarrow-0.05 62.70 \uparrow0.64 62.03 \downarrow-0.03 63.58 \uparrow1.52 61.67 \downarrow-0.4 63.24 \uparrow1.17 63.69 \uparrow1.62 63.93 \uparrow1.86 70.31 \uparrow8.25 55.28 57.87 \uparrow2.59 58.58 \uparrow3.3 51.05 \downarrow-4.23 60.87 \uparrow5.59 57.40 \uparrow2.12 59.63 \uparrow4.35 60.72 \uparrow5.44 60.74 \uparrow5.45 67.10 \uparrow11.82 62.06 62.58 \uparrow0.52 63.29 \uparrow1.23 62.62 \uparrow0.55 64.34 \uparrow2.28 62.32 \uparrow0.26 64.07 \uparrow2.0 64.58 \uparrow2.52 64.50 \uparrow2.43 71.33 \uparrow9.27 62.06 62.63 \uparrow0.57 63.36 \uparrow1.29 54.87 \downarrow-7.2 64.98 \uparrow2.92 62.29 \uparrow0.22 63.91 \uparrow1.85 65.07 \uparrow3.0 65.00 \uparrow2.93 71.18 \uparrow9.11 52.99 56.25 \uparrow3.26 57.21 \uparrow4.23 47.93 \downarrow-5.06 59.32 \uparrow6.33 54.50 \uparrow1.52 58.18 \uparrow5.2 59.60 \uparrow6.61 59.25 \uparrow6.27 69.71 \uparrow16.72 55.28 57.34 \uparrow2.05 58.42 \uparrow3.14 53.31 \downarrow-1.97 60.20 \uparrow4.92 56.08 \uparrow0.79 59.42 \uparrow4.14 60.58 \uparrow5.3 60.68 \uparrow5.4 70.90 \uparrow15.62 62.06 61.25 \downarrow-0.81 62.81 \uparrow0.74 59.89 \downarrow-2.17 64.01 \uparrow1.95 60.80 \downarrow-1.26 63.41 \uparrow1.35 65.08 \uparrow3.02 65.03 \uparrow2.97 74.75 \uparrow12.69 55.28 58.94 \uparrow3.66 60.11 \uparrow4.83 51.86 \downarrow-3.42 61.48 \uparrow6.2 57.46 \uparrow2.17 60.99 \uparrow5.71 61.22 \uparrow5.94 62.01 \uparrow6.73 72.64 \uparrow17.36 62.06 61.96 \downarrow-0.1 63.88 \uparrow1.81 62.39 \uparrow0.33 65.36 \uparrow3.3 61.60 \downarrow-0.47 64.36 \uparrow2.3 65.67 \uparrow3.61 65.64 \uparrow3.57 75.85 \uparrow13.79 62.06 62.67 \uparrow0.6 64.19 \uparrow2.12 55.44 \downarrow-6.63 65.67 \uparrow3.61 61.72 \downarrow-0.35 64.96 \uparrow2.9 66.02 \uparrow3.95 66.34 \uparrow4.28 76.04 \uparrow13.98 55.28 58.97 \uparrow3.69 59.92 \uparrow4.64 49.83 \downarrow-5.45 61.24 \uparrow5.95 57.73 \uparrow2.45 60.68 \uparrow5.4 61.18 \uparrow5.9 61.65 \uparrow6.37 71.99 \uparrow16.71 62.06 62.15 \uparrow0.09 63.44 \uparrow1.38 60.25 \downarrow-1.81 65.00 \uparrow2.93 61.82 \downarrow-0.24 63.98 \uparrow1.92 65.36 \uparrow3.3 65.31 \uparrow3.24 75.42 \uparrow13.36 62.06 62.43 \uparrow0.36 64.03 \uparrow1.97 55.47 \downarrow-6.59 65.33 \uparrow3.26 62.12 \uparrow0.05 64.91 \uparrow2.85 66.24 \uparrow4.18 66.05 \uparrow3.99 75.54 \uparrow13.48 62.06 63.12 \uparrow1.05 64.46 \uparrow2.4 54.45 \downarrow-7.61 65.93 \uparrow3.87 62.77 \uparrow0.71 64.79 \uparrow2.73 66.05 \uparrow3.99 66.05 \uparrow3.99 76.60 \uparrow14.53 55.28 59.44 \uparrow4.16 60.61 \uparrow5.33 49.57 \downarrow-5.71 61.81 \uparrow6.52 57.49 \uparrow2.21 61.39 \uparrow6.11 61.96 \uparrow6.68 62.50 \uparrow7.21 75.60 \uparrow20.31 62.06 62.00 \downarrow-0.07 63.62 \uparrow1.55 59.37 \downarrow-2.69 64.96 \uparrow2.9 61.22 \downarrow-0.85 64.10 \uparrow2.04 65.38 \uparrow3.31 65.91 \uparrow3.85 78.32 \uparrow16.26 62.06 62.98 \uparrow0.91 64.41 \uparrow2.35 54.45 \downarrow-7.61 65.22 \uparrow3.16 61.49 \downarrow-0.57 64.65 \uparrow2.59 66.66 \uparrow4.59 66.59 \uparrow4.52 78.58 \uparrow16.52 62.06 63.63 \uparrow1.57 65.15 \uparrow3.09 54.73 \downarrow-7.34 65.96 \uparrow3.9 62.20 \uparrow0.14 65.59 \uparrow3.52 66.21 \uparrow4.14 66.90 \uparrow4.83 79.67 \uparrow17.6 62.06 63.19 \uparrow1.12 64.38 \uparrow2.31 54.18 \downarrow-7.89 65.69 \uparrow3.62 62.29 \uparrow0.22 65.21 \uparrow3.14 66.28 \uparrow4.21 66.57 \uparrow4.5 79.01 \uparrow16.95 62.06 63.32 \uparrow1.26 64.38 \uparrow2.31 53.71 \downarrow-8.35 65.53 \uparrow3.47 61.79 \downarrow-0.28 65.01 \uparrow2.95 66.29 \uparrow4.23 68.40 \uparrow6.34 81.20 \uparrow19.14 Mean Δ\Delta 1.31 2.42 -3.74 3.94 0.45 3.15 4.24 4.47 13.68 Max Δ\Delta 4.16 5.33 0.97 6.52 2.45 6.11 6.68 7.21 20.31 Min Δ\Delta -0.81 0.26 -8.35 1.52 -1.26 0.71 1.62 1.86 8.25

Table 8: Our results on CUB dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

DTD
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   41.22 -0.00     43.67 -0.00     43.99 -0.00     45.11 -0.00     55.32 -0.00   43.67 47.13 \uparrow3.46 47.29 \uparrow3.62 46.44 \uparrow2.77 47.71 \uparrow4.04 45.96 \uparrow2.29 47.34 \uparrow3.67 47.61 \uparrow3.94 47.71 \uparrow4.04 54.36 \uparrow10.69 43.99 44.63 \uparrow0.64 45.00 \uparrow1.01 41.22 \downarrow-2.77 46.01 \uparrow2.02 43.78 \downarrow-0.21 45.43 \uparrow1.44 45.74 \uparrow1.76 45.90 \uparrow1.91 51.70 \uparrow7.71 45.11 46.86 \uparrow1.76 47.13 \uparrow2.02 40.37 \downarrow-4.73 47.39 \uparrow2.29 45.85 \uparrow0.74 47.23 \uparrow2.13 47.87 \uparrow2.77 47.39 \uparrow2.29 53.24 \uparrow8.14 55.32 55.64 \uparrow0.32 56.12 \uparrow0.8 41.33 \downarrow-13.99 55.64 \uparrow0.32 55.05 \downarrow-0.27 56.22 \uparrow0.9 55.43 \uparrow0.11 56.44 \uparrow1.12 62.23 \uparrow6.91 43.99 47.02 \uparrow3.03 47.34 \uparrow3.35 41.60 \downarrow-2.39 48.46 \uparrow4.47 47.13 \uparrow3.14 47.93 \uparrow3.94 47.87 \uparrow3.88 47.98 \uparrow3.99 55.11 \uparrow11.12 45.11 48.46 \uparrow3.35 48.94 \uparrow3.83 48.30 \uparrow3.19 49.15 \uparrow4.04 48.35 \uparrow3.24 48.67 \uparrow3.56 48.78 \uparrow3.67 48.78 \uparrow3.67 55.74 \uparrow10.64 55.32 55.21 \downarrow-0.11 55.59 \uparrow0.27 43.35 \downarrow-11.97 55.69 \uparrow0.37 55.11 \downarrow-0.21 56.33 \uparrow1.01 56.76 \uparrow1.44 56.70 \uparrow1.38 61.60 \uparrow6.28 45.11 46.65 \uparrow1.54 46.86 \uparrow1.76 43.24 \downarrow-1.86 47.71 \uparrow2.61 45.48 \uparrow0.37 47.23 \uparrow2.13 47.07 \uparrow1.97 47.61 \uparrow2.5 53.56 \uparrow8.46 55.32 55.74 \uparrow0.43 56.33 \uparrow1.01 43.56 \downarrow-11.76 56.22 \uparrow0.9 55.32 -0.0 56.22 \uparrow0.9 56.49 \uparrow1.17 56.70 \uparrow1.38 62.02 \uparrow6.7 55.32 57.02 \uparrow1.7 57.39 \uparrow2.07 43.88 \downarrow-11.44 56.28 \uparrow0.96 56.60 \uparrow1.28 56.76 \uparrow1.44 56.81 \uparrow1.49 56.86 \uparrow1.54 62.02 \uparrow6.7 43.99 47.50 \uparrow3.51 47.71 \uparrow3.72 40.37 \downarrow-3.62 48.99 \uparrow5.0 46.97 \uparrow2.98 47.82 \uparrow3.83 49.26 \uparrow5.27 49.10 \uparrow5.11 59.26 \uparrow15.27 45.11 48.35 \uparrow3.24 49.41 \uparrow4.31 44.68 \downarrow-0.43 49.95 \uparrow4.84 48.30 \uparrow3.19 49.95 \uparrow4.84 50.32 \uparrow5.21 50.21 \uparrow5.11 60.59 \uparrow15.48 55.32 54.26 \downarrow-1.06 56.22 \uparrow0.9 41.70 \downarrow-13.62 55.32 -0.0 54.89 \downarrow-0.43 56.76 \uparrow1.44 57.02 \uparrow1.7 57.18 \uparrow1.86 65.80 \uparrow10.48 45.11 46.86 \uparrow1.76 46.97 \uparrow1.86 44.95 \downarrow-0.16 48.19 \uparrow3.09 45.53 \uparrow0.43 47.34 \uparrow2.23 47.77 \uparrow2.66 48.24 \uparrow3.14 57.87 \uparrow12.77 55.32 53.19 \downarrow-2.13 55.43 \uparrow0.11 41.86 \downarrow-13.46 54.73 \downarrow-0.59 55.11 \downarrow-0.21 54.73 \downarrow-0.59 56.06 \uparrow0.74 56.49 \uparrow1.17 65.59 \uparrow10.27 55.32 53.99 \downarrow-1.33 56.33 \uparrow1.01 42.87 \downarrow-12.45 56.01 \uparrow0.69 55.80 \uparrow0.48 56.60 \uparrow1.28 57.34 \uparrow2.02 57.18 \uparrow1.86 65.69 \uparrow10.37 45.11 48.40 \uparrow3.3 49.68 \uparrow4.57 41.76 \downarrow-3.35 49.26 \uparrow4.15 48.62 \uparrow3.51 49.68 \uparrow4.57 49.95 \uparrow4.84 49.20 \uparrow4.1 60.32 \uparrow15.21 55.32 54.47 \downarrow-0.85 55.69 \uparrow0.37 44.57 \downarrow-10.74 55.27 \downarrow-0.05 55.32 -0.0 56.01 \uparrow0.69 56.86 \uparrow1.54 57.18 \uparrow1.86 65.32 \uparrow10.0 55.32 54.89 \downarrow-0.43 56.76 \uparrow1.44 43.56 \downarrow-11.76 55.74 \uparrow0.43 56.22 \uparrow0.9 56.81 \uparrow1.49 56.70 \uparrow1.38 57.77 \uparrow2.45 65.37 \uparrow10.05 55.32 54.63 \downarrow-0.69 56.44 \uparrow1.12 44.47 \downarrow-10.85 55.74 \uparrow0.43 55.74 \uparrow0.43 56.49 \uparrow1.17 57.23 \uparrow1.91 57.07 \uparrow1.76 65.32 \uparrow10.0 45.11 48.67 \uparrow3.56 49.52 \uparrow4.41 43.30 \downarrow-1.81 49.63 \uparrow4.52 48.46 \uparrow3.35 49.79 \uparrow4.68 50.27 \uparrow5.16 49.41 \uparrow4.31 63.19 \uparrow18.09 55.32 54.36 \downarrow-0.96 55.16 \downarrow-0.16 42.50 \downarrow-12.82 54.73 \downarrow-0.59 55.00 \downarrow-0.32 54.95 \downarrow-0.37 56.22 \uparrow0.9 57.34 \uparrow2.02 68.03 \uparrow12.71 55.32 54.89 \downarrow-0.43 56.38 \uparrow1.06 42.93 \downarrow-12.39 55.00 \downarrow-0.32 55.53 \uparrow0.21 56.81 \uparrow1.49 57.87 \uparrow2.55 57.71 \uparrow2.39 68.14 \uparrow12.82 55.32 53.72 \downarrow-1.6 55.96 \uparrow0.64 42.50 \downarrow-12.82 54.57 \downarrow-0.74 55.48 \uparrow0.16 55.48 \uparrow0.16 56.38 \uparrow1.06 56.97 \uparrow1.65 67.71 \uparrow12.39 55.32 54.68 \downarrow-0.64 56.12 \uparrow0.8 44.47 \downarrow-10.85 55.37 \uparrow0.05 55.96 \uparrow0.64 56.06 \uparrow0.74 56.81 \uparrow1.49 57.50 \uparrow2.18 67.66 \uparrow12.34 55.32 53.72 \downarrow-1.6 55.69 \uparrow0.37 42.55 \downarrow-12.77 54.36 \downarrow-0.96 55.59 \uparrow0.27 55.21 \downarrow-0.11 56.12 \uparrow0.8 58.94 \uparrow3.62 69.63 \uparrow14.31 Mean Δ\Delta 0.76 1.78 -7.65 1.61 1.00 1.87 2.36 2.63 11.00 Max Δ\Delta 3.56 4.57 3.19 5.00 3.51 4.84 5.27 5.11 18.09 Min Δ\Delta -2.13 -0.16 -13.99 -0.96 -0.43 -0.59 0.11 1.12 6.28

Table 9: Our results on DTD dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

FGVC
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   17.07 -0.00     18.63 -0.00     19.65 -0.00     24.39 -0.00     31.71 -0.00   18.63 18.87 \uparrow0.24 19.02 \uparrow0.39 18.60 \downarrow-0.03 19.23 \uparrow0.6 18.48 \downarrow-0.15 19.05 \uparrow0.42 19.59 \uparrow0.96 19.89 \uparrow1.26 26.19 \uparrow7.56 19.65 20.52 \uparrow0.87 20.46 \uparrow0.81 16.62 \downarrow-3.03 21.36 \uparrow1.71 19.80 \uparrow0.15 21.18 \uparrow1.53 21.00 \uparrow1.35 22.17 \uparrow2.52 27.69 \uparrow8.04 24.39 24.24 \downarrow-0.15 24.33 \downarrow-0.06 17.19 \downarrow-7.2 24.78 \uparrow0.39 23.97 \downarrow-0.42 25.23 \uparrow0.84 25.65 \uparrow1.26 25.89 \uparrow1.5 31.02 \uparrow6.63 31.71 31.47 \downarrow-0.24 31.47 \downarrow-0.24 31.47 \downarrow-0.24 31.68 \downarrow-0.03 31.59 \downarrow-0.12 32.55 \uparrow0.84 32.49 \uparrow0.78 33.57 \uparrow1.86 37.59 \uparrow5.88 19.65 21.00 \uparrow1.35 21.51 \uparrow1.86 20.70 \uparrow1.05 21.69 \uparrow2.04 20.73 \uparrow1.08 22.02 \uparrow2.37 21.21 \uparrow1.56 22.41 \uparrow2.76 28.20 \uparrow8.55 24.39 23.67 \downarrow-0.72 23.97 \downarrow-0.42 23.52 \downarrow-0.87 23.82 \downarrow-0.57 23.79 \downarrow-0.6 24.36 \downarrow-0.03 24.81 \uparrow0.42 25.38 \uparrow0.99 31.92 \uparrow7.53 31.71 30.81 \downarrow-0.9 31.29 \downarrow-0.42 30.93 \downarrow-0.78 30.54 \downarrow-1.17 31.20 \downarrow-0.51 31.50 \downarrow-0.21 31.35 \downarrow-0.36 33.03 \uparrow1.32 38.31 \uparrow6.6 24.39 24.18 \downarrow-0.21 24.30 \downarrow-0.09 20.22 \downarrow-4.17 25.08 \uparrow0.69 23.76 \downarrow-0.63 24.84 \uparrow0.45 25.02 \uparrow0.63 25.95 \uparrow1.56 32.73 \uparrow8.34 31.71 31.86 \uparrow0.15 31.68 \downarrow-0.03 31.59 \downarrow-0.12 32.04 \uparrow0.33 31.71 --0.0 32.79 \uparrow1.08 32.61 \uparrow0.9 33.33 \uparrow1.62 39.57 \uparrow7.86 31.71 31.62 \downarrow-0.09 32.19 \uparrow0.48 31.95 \uparrow0.24 32.76 \uparrow1.05 31.92 \uparrow0.21 32.88 \uparrow1.17 33.30 \uparrow1.59 33.69 \uparrow1.98 41.25 \uparrow9.54 19.65 20.79 \uparrow1.14 21.51 \uparrow1.86 18.12 \downarrow-1.53 21.75 \uparrow2.1 20.52 \uparrow0.87 21.69 \uparrow2.04 21.72 \uparrow2.07 23.49 \uparrow3.84 33.60 \uparrow13.95 24.39 22.89 \downarrow-1.5 23.28 \downarrow-1.11 17.55 \downarrow-6.84 23.67 \downarrow-0.72 23.40 \downarrow-0.99 24.03 \downarrow-0.36 24.69 \uparrow0.3 26.61 \uparrow2.22 36.57 \uparrow12.18 31.71 28.80 \downarrow-2.91 30.00 \downarrow-1.71 30.87 \downarrow-0.84 30.21 \downarrow-1.5 31.14 \downarrow-0.57 31.53 \downarrow-0.18 32.58 \uparrow0.87 34.08 \uparrow2.37 42.36 \uparrow10.65 24.39 23.94 \downarrow-0.45 24.57 \uparrow0.18 22.53 \downarrow-1.86 24.81 \uparrow0.42 23.61 \downarrow-0.78 24.99 \uparrow0.6 25.71 \uparrow1.32 26.76 \uparrow2.37 37.68 \uparrow13.29 31.71 30.36 \downarrow-1.35 31.29 \downarrow-0.42 30.93 \downarrow-0.78 32.07 \uparrow0.36 31.59 \downarrow-0.12 32.73 \uparrow1.02 32.25 \uparrow0.54 34.74 \uparrow3.03 43.53 \uparrow11.82 31.71 30.63 \downarrow-1.08 31.77 \uparrow0.06 30.27 \downarrow-1.44 32.40 \uparrow0.69 31.80 \uparrow0.09 32.97 \uparrow1.26 33.06 \uparrow1.35 34.80 \uparrow3.09 45.06 \uparrow13.35 24.39 23.79 \downarrow-0.6 24.15 \downarrow-0.24 20.55 \downarrow-3.84 25.02 \uparrow0.63 23.40 \downarrow-0.99 24.72 \uparrow0.33 25.50 \uparrow1.11 26.31 \uparrow1.92 38.10 \uparrow13.71 31.71 30.24 \downarrow-1.47 30.90 \downarrow-0.81 30.90 \downarrow-0.81 31.62 \downarrow-0.09 31.20 \downarrow-0.51 32.07 \uparrow0.36 31.95 \uparrow0.24 33.72 \uparrow2.01 44.01 \uparrow12.3 31.71 30.75 \downarrow-0.96 31.62 \downarrow-0.09 31.23 \downarrow-0.48 31.77 \uparrow0.06 31.44 \downarrow-0.27 32.40 \uparrow0.69 33.21 \uparrow1.5 34.23 \uparrow2.52 45.69 \uparrow13.98 31.71 31.14 \downarrow-0.57 31.59 \downarrow-0.12 31.02 \downarrow-0.69 32.91 \uparrow1.2 31.77 \uparrow0.06 33.39 \uparrow1.68 33.18 \uparrow1.47 34.50 \uparrow2.79 46.56 \uparrow14.85 24.39 23.97 \downarrow-0.42 23.91 \downarrow-0.48 22.11 \downarrow-2.28 24.72 \uparrow0.33 23.31 \downarrow-1.08 24.60 \uparrow0.21 25.50 \uparrow1.11 27.45 \uparrow3.06 41.79 \uparrow17.4 31.71 28.65 \downarrow-3.06 30.63 \downarrow-1.08 30.45 \downarrow-1.26 31.20 \downarrow-0.51 31.14 \downarrow-0.57 32.04 \uparrow0.33 32.37 \uparrow0.66 34.80 \uparrow3.09 47.04 \uparrow15.33 31.71 29.46 \downarrow-2.25 30.72 \downarrow-0.99 29.85 \downarrow-1.86 31.02 \downarrow-0.69 31.38 \downarrow-0.33 32.31 \uparrow0.6 32.55 \uparrow0.84 34.92 \uparrow3.21 48.54 \uparrow16.83 31.71 30.06 \downarrow-1.65 31.44 \downarrow-0.27 31.02 \downarrow-0.69 32.28 \uparrow0.57 31.65 \downarrow-0.06 33.21 \uparrow1.5 33.63 \uparrow1.92 35.37 \uparrow3.66 49.38 \uparrow17.67 31.71 29.82 \downarrow-1.89 31.20 \downarrow-0.51 30.51 \downarrow-1.2 32.58 \uparrow0.87 31.35 \downarrow-0.36 32.25 \uparrow0.54 33.33 \uparrow1.62 34.23 \uparrow2.52 49.83 \uparrow18.12 31.71 28.95 \downarrow-2.76 30.96 \downarrow-0.75 30.63 \downarrow-1.08 31.53 \downarrow-0.18 31.29 \downarrow-0.42 31.89 \uparrow0.18 33.18 \uparrow1.47 35.88 \uparrow4.17 52.09 \uparrow20.37 Mean Δ\Delta -0.83 -0.16 -1.64 0.33 -0.27 0.74 1.06 2.43 12.01 Max Δ\Delta 1.35 1.86 1.05 2.10 1.08 2.37 2.07 4.17 20.37 Min Δ\Delta -3.06 -1.71 -7.20 -1.50 -1.08 -0.36 -0.36 0.99 5.88

Table 10: Our results on FGVC dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

Food
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   77.91 -0.00     81.86 -0.00     82.58 -0.00     87.91 -0.00     92.32 -0.00   81.86 82.63 \uparrow0.78 82.90 \uparrow1.04 82.26 \uparrow0.4 83.08 \uparrow1.22 82.47 \uparrow0.61 83.14 \uparrow1.29 83.28 \uparrow1.43 83.27 \uparrow1.41 87.16 \uparrow5.3 82.58 83.83 \uparrow1.25 84.26 \uparrow1.68 76.96 \downarrow-5.62 84.55 \uparrow1.97 83.60 \uparrow1.02 84.40 \uparrow1.82 84.41 \uparrow1.83 84.68 \uparrow2.1 88.42 \uparrow5.84 87.91 87.56 \downarrow-0.36 87.62 \downarrow-0.3 87.49 \downarrow-0.43 87.58 \downarrow-0.33 87.52 \downarrow-0.4 87.80 \downarrow-0.11 88.13 \uparrow0.22 88.55 \uparrow0.64 91.31 \uparrow3.39 92.32 91.94 \downarrow-0.38 92.04 \downarrow-0.29 91.88 \downarrow-0.44 91.75 \downarrow-0.57 91.75 \downarrow-0.57 91.93 \downarrow-0.39 92.82 \uparrow0.49 92.79 \uparrow0.47 94.52 \uparrow2.19 82.58 84.97 \uparrow2.39 85.31 \uparrow2.73 84.48 \uparrow1.9 85.70 \uparrow3.11 84.62 \uparrow2.04 85.54 \uparrow2.96 85.55 \uparrow2.97 85.69 \uparrow3.11 89.28 \uparrow6.69 87.91 88.17 \uparrow0.26 88.30 \uparrow0.38 87.92 \uparrow0.01 88.40 \uparrow0.49 87.87 \downarrow-0.05 88.34 \uparrow0.43 88.72 \uparrow0.81 88.68 \uparrow0.76 91.81 \uparrow3.9 92.32 92.25 \downarrow-0.07 92.34 \uparrow0.02 92.04 \downarrow-0.29 92.08 \downarrow-0.24 91.85 \downarrow-0.48 92.22 \downarrow-0.11 92.78 \uparrow0.45 92.80 \uparrow0.48 94.81 \uparrow2.49 87.91 88.08 \uparrow0.16 88.18 \uparrow0.27 82.55 \downarrow-5.36 88.25 \uparrow0.33 88.10 \uparrow0.19 88.31 \uparrow0.4 88.79 \uparrow0.88 88.74 \uparrow0.83 91.54 \uparrow3.63 92.32 92.13 \downarrow-0.2 92.21 \downarrow-0.12 92.00 \downarrow-0.33 92.04 \downarrow-0.28 91.92 \downarrow-0.4 92.14 \downarrow-0.19 92.70 \uparrow0.37 92.73 \uparrow0.41 94.76 \uparrow2.44 92.32 92.58 \uparrow0.25 92.67 \uparrow0.34 92.45 \uparrow0.12 92.57 \uparrow0.24 92.31 \downarrow-0.01 92.59 \uparrow0.27 92.70 \uparrow0.38 92.88 \uparrow0.55 94.96 \uparrow2.64 82.58 84.54 \uparrow1.96 85.28 \uparrow2.7 79.64 \downarrow-2.95 85.59 \uparrow3.01 84.62 \uparrow2.04 85.38 \uparrow2.8 85.82 \uparrow3.24 85.84 \uparrow3.26 91.16 \uparrow8.58 87.91 86.54 \downarrow-1.37 87.34 \downarrow-0.57 87.57 \downarrow-0.34 87.72 \downarrow-0.19 87.51 \downarrow-0.4 87.63 \downarrow-0.28 88.62 \uparrow0.71 88.73 \uparrow0.82 92.99 \uparrow5.07 92.32 88.86 \downarrow-3.47 90.41 \downarrow-1.91 91.66 \downarrow-0.67 90.99 \downarrow-1.34 91.45 \downarrow-0.87 90.59 \downarrow-1.74 92.79 \uparrow0.47 92.84 \uparrow0.52 95.47 \uparrow3.15 87.91 87.45 \downarrow-0.47 87.98 \uparrow0.07 80.07 \downarrow-7.85 87.97 \uparrow0.06 87.77 \downarrow-0.15 88.09 \uparrow0.17 88.84 \uparrow0.93 88.85 \uparrow0.94 93.01 \uparrow5.09 92.32 89.70 \downarrow-2.63 90.90 \downarrow-1.43 90.01 \downarrow-2.32 91.18 \downarrow-1.15 91.54 \downarrow-0.79 90.93 \downarrow-1.39 92.71 \uparrow0.38 92.83 \uparrow0.5 95.56 \uparrow3.23 92.32 91.38 \downarrow-0.94 92.00 \downarrow-0.33 92.11 \downarrow-0.21 91.98 \downarrow-0.34 91.93 \downarrow-0.4 92.00 \downarrow-0.33 92.77 \uparrow0.44 93.02 \uparrow0.69 95.77 \uparrow3.45 87.91 87.78 \downarrow-0.13 88.18 \uparrow0.27 83.45 \downarrow-4.46 88.40 \uparrow0.48 87.96 \uparrow0.05 88.32 \uparrow0.4 89.02 \uparrow1.1 88.95 \uparrow1.04 93.36 \uparrow5.45 92.32 90.17 \downarrow-2.15 91.20 \downarrow-1.13 91.75 \downarrow-0.57 91.50 \downarrow-0.82 91.54 \downarrow-0.78 91.28 \downarrow-1.05 92.90 \uparrow0.57 92.93 \uparrow0.61 95.71 \uparrow3.39 92.32 91.54 \downarrow-0.78 91.98 \downarrow-0.34 92.06 \downarrow-0.26 92.07 \downarrow-0.25 91.87 \downarrow-0.45 92.02 \downarrow-0.3 92.93 \uparrow0.6 92.97 \uparrow0.65 95.89 \uparrow3.56 92.32 91.24 \downarrow-1.08 91.85 \downarrow-0.48 90.73 \downarrow-1.59 92.02 \downarrow-0.31 92.04 \downarrow-0.28 91.92 \downarrow-0.4 92.93 \uparrow0.61 92.98 \uparrow0.66 95.79 \uparrow3.47 87.91 87.65 \downarrow-0.26 87.98 \uparrow0.07 81.00 \downarrow-6.91 88.11 \uparrow0.2 87.66 \downarrow-0.25 88.10 \uparrow0.19 89.07 \uparrow1.16 89.03 \uparrow1.12 94.01 \uparrow6.1 92.32 89.57 \downarrow-2.75 90.21 \downarrow-2.12 90.17 \downarrow-2.15 90.61 \downarrow-1.71 91.24 \downarrow-1.09 90.24 \downarrow-2.09 92.49 \uparrow0.17 92.90 \uparrow0.58 96.10 \uparrow3.77 92.32 91.10 \downarrow-1.22 91.29 \downarrow-1.03 91.83 \downarrow-0.5 91.38 \downarrow-0.95 91.60 \downarrow-0.73 91.26 \downarrow-1.07 92.93 \uparrow0.61 93.03 \uparrow0.7 96.26 \uparrow3.94 92.32 90.99 \downarrow-1.33 91.49 \downarrow-0.84 89.53 \downarrow-2.79 91.53 \downarrow-0.8 91.73 \downarrow-0.59 91.49 \downarrow-0.84 92.93 \uparrow0.6 93.06 \uparrow0.74 96.25 \uparrow3.93 92.32 91.26 \downarrow-1.06 91.56 \downarrow-0.76 90.61 \downarrow-1.72 91.62 \downarrow-0.7 91.67 \downarrow-0.65 91.45 \downarrow-0.88 92.86 \uparrow0.54 93.03 \uparrow0.71 96.35 \uparrow4.02 92.32 90.08 \downarrow-2.25 90.88 \downarrow-1.45 89.62 \downarrow-2.7 91.09 \downarrow-1.24 91.43 \downarrow-0.9 90.91 \downarrow-1.41 92.91 \uparrow0.59 93.07 \uparrow0.75 96.60 \uparrow4.27 Mean Δ\Delta -0.61 -0.14 -1.85 0.00 -0.17 -0.07 0.87 0.96 4.19 Max Δ\Delta 2.39 2.73 1.90 3.11 2.04 2.96 3.24 3.26 8.58 Min Δ\Delta -3.47 -2.12 -7.85 -1.71 -1.09 -2.09 0.17 0.41 2.19

Table 11: Our results on Food dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

Flowers
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   66.12 -0.00     65.20 -0.00     66.48 -0.00     71.43 -0.00     79.05 -0.00   66.12 68.01 \uparrow1.89 67.95 \uparrow1.82 67.36 \uparrow1.24 68.63 \uparrow2.5 67.25 \uparrow1.12 68.43 \uparrow2.31 68.40 \uparrow2.28 68.84 \uparrow2.72 73.83 \uparrow7.71 66.48 69.21 \uparrow2.73 69.77 \uparrow3.29 68.68 \uparrow2.2 69.17 \uparrow2.68 68.76 \uparrow2.28 69.77 \uparrow3.29 69.28 \uparrow2.8 70.00 \uparrow3.51 75.49 \uparrow9.01 71.43 72.45 \uparrow1.02 72.68 \uparrow1.25 72.03 \uparrow0.6 72.48 \uparrow1.06 71.87 \uparrow0.44 72.69 \uparrow1.27 72.42 \uparrow0.99 73.12 \uparrow1.69 76.63 \uparrow5.2 79.05 78.29 \downarrow-0.76 78.22 \downarrow-0.83 79.02 \downarrow-0.03 77.33 \downarrow-1.72 78.78 \downarrow-0.28 78.19 \downarrow-0.86 78.91 \downarrow-0.15 79.41 \uparrow0.36 83.27 \uparrow4.21 66.48 68.56 \uparrow2.08 68.74 \uparrow2.26 68.35 \uparrow1.87 68.87 \uparrow2.39 68.32 \uparrow1.84 68.71 \uparrow2.23 68.74 \uparrow2.26 69.21 \uparrow2.73 74.39 \uparrow7.9 71.43 71.98 \uparrow0.55 72.32 \uparrow0.89 71.70 \uparrow0.28 72.35 \uparrow0.93 71.41 \downarrow-0.02 72.37 \uparrow0.94 73.10 \uparrow1.68 73.02 \uparrow1.59 77.35 \uparrow5.92 79.05 77.87 \downarrow-1.19 77.74 \downarrow-1.32 78.34 \downarrow-0.72 77.23 \downarrow-1.82 77.87 \downarrow-1.19 77.52 \downarrow-1.53 77.83 \downarrow-1.22 79.23 \uparrow0.18 82.62 \uparrow3.56 71.43 72.91 \uparrow1.48 72.69 \uparrow1.27 65.15 \downarrow-6.28 72.16 \uparrow0.73 72.52 \uparrow1.09 72.69 \uparrow1.27 71.88 \uparrow0.46 72.99 \uparrow1.56 78.29 \uparrow6.86 79.05 78.50 \downarrow-0.55 78.31 \downarrow-0.75 78.45 \downarrow-0.6 77.43 \downarrow-1.63 78.35 \downarrow-0.7 77.88 \downarrow-1.17 78.94 \downarrow-0.11 79.07 \uparrow0.02 82.86 \uparrow3.81 79.05 78.01 \downarrow-1.04 78.01 \downarrow-1.04 78.45 \downarrow-0.6 77.90 \downarrow-1.15 78.37 \downarrow-0.68 78.29 \downarrow-0.76 78.14 \downarrow-0.91 79.66 \uparrow0.6 83.36 \uparrow4.31 66.48 69.38 \uparrow2.89 69.98 \uparrow3.5 69.12 \uparrow2.63 69.57 \uparrow3.09 69.12 \uparrow2.63 70.09 \uparrow3.61 69.64 \uparrow3.15 70.92 \uparrow4.44 78.57 \uparrow12.08 71.43 71.74 \uparrow0.31 72.27 \uparrow0.85 71.78 \uparrow0.36 72.48 \uparrow1.06 71.49 \uparrow0.07 72.52 \uparrow1.09 73.10 \uparrow1.68 73.82 \uparrow2.39 79.46 \uparrow8.03 79.05 74.70 \downarrow-4.36 76.03 \downarrow-3.02 78.34 \downarrow-0.72 75.59 \downarrow-3.46 77.77 \downarrow-1.28 75.95 \downarrow-3.11 78.00 \downarrow-1.06 79.90 \uparrow0.85 84.65 \uparrow5.59 71.43 72.69 \uparrow1.27 73.20 \uparrow1.77 66.29 \downarrow-5.14 72.74 \uparrow1.32 72.55 \uparrow1.12 72.92 \uparrow1.5 72.97 \uparrow1.54 73.90 \uparrow2.47 80.34 \uparrow8.91 79.05 75.87 \downarrow-3.19 76.89 \downarrow-2.16 78.48 \downarrow-0.57 76.34 \downarrow-2.72 78.32 \downarrow-0.73 76.74 \downarrow-2.31 78.83 \downarrow-0.23 80.08 \uparrow1.02 84.91 \uparrow5.85 79.05 76.84 \downarrow-2.21 77.41 \downarrow-1.64 78.48 \downarrow-0.57 77.33 \downarrow-1.72 78.35 \downarrow-0.7 77.44 \downarrow-1.61 78.09 \downarrow-0.96 80.27 \uparrow1.22 84.83 \uparrow5.77 71.43 71.69 \uparrow0.26 72.55 \uparrow1.12 66.34 \downarrow-5.09 72.09 \uparrow0.67 72.43 \uparrow1.01 72.82 \uparrow1.4 71.74 \uparrow0.31 73.83 \uparrow2.41 80.94 \uparrow9.51 79.05 75.15 \downarrow-3.9 76.39 \downarrow-2.67 77.92 \downarrow-1.14 76.19 \downarrow-2.86 77.67 \downarrow-1.38 76.45 \downarrow-2.6 78.53 \downarrow-0.52 79.44 \uparrow0.39 84.16 \uparrow5.11 79.05 76.34 \downarrow-2.72 77.13 \downarrow-1.92 78.05 \downarrow-1.01 77.20 \downarrow-1.85 77.67 \downarrow-1.38 77.36 \downarrow-1.69 78.16 \downarrow-0.89 80.16 \uparrow1.11 84.88 \uparrow5.82 79.05 77.04 \downarrow-2.02 77.65 \downarrow-1.4 76.11 \downarrow-2.94 76.74 \downarrow-2.31 78.11 \downarrow-0.94 77.80 \downarrow-1.25 78.57 \downarrow-0.49 79.69 \uparrow0.63 85.01 \uparrow5.95 71.43 72.52 \uparrow1.09 72.95 \uparrow1.53 67.02 \downarrow-4.41 72.60 \uparrow1.17 72.22 \uparrow0.8 73.00 \uparrow1.58 72.61 \uparrow1.19 74.45 \uparrow3.02 82.14 \uparrow10.72 79.05 75.22 \downarrow-3.84 75.69 \downarrow-3.37 77.92 \downarrow-1.14 75.18 \downarrow-3.87 77.59 \downarrow-1.46 75.52 \downarrow-3.53 78.63 \downarrow-0.42 79.80 \uparrow0.75 85.62 \uparrow6.57 79.05 76.01 \downarrow-3.04 76.52 \downarrow-2.54 78.00 \downarrow-1.06 76.26 \downarrow-2.8 77.57 \downarrow-1.48 76.55 \downarrow-2.5 78.24 \downarrow-0.81 80.27 \uparrow1.22 85.67 \uparrow6.62 79.05 76.71 \downarrow-2.34 77.22 \downarrow-1.84 76.21 \downarrow-2.85 76.31 \downarrow-2.75 78.14 \downarrow-0.91 77.05 \downarrow-2.0 78.18 \downarrow-0.88 80.18 \uparrow1.12 85.77 \uparrow6.72 79.05 76.81 \downarrow-2.24 77.09 \downarrow-1.97 75.80 \downarrow-3.25 76.22 \downarrow-2.83 77.69 \downarrow-1.37 77.10 \downarrow-1.95 78.19 \downarrow-0.86 79.93 \uparrow0.88 85.75 \uparrow6.7 79.05 75.70 \downarrow-3.35 76.28 \downarrow-2.77 75.85 \downarrow-3.20 75.54 \downarrow-3.51 77.60 \downarrow-1.45 76.25 \downarrow-2.80 78.16 \downarrow-0.89 81.10 \uparrow2.05 86.32 \uparrow7.27 Mean Δ\Delta -0.81 -0.37 -1.24 -0.75 -0.14 -0.35 0.30 1.57 6.76 Max Δ\Delta 2.89 3.50 2.63 3.09 2.63 3.61 3.15 4.44 12.08 Min Δ\Delta -4.36 -3.37 -6.28 -3.87 -1.48 -3.53 -1.22 0.02 3.56

Table 12: Our results on Flowers dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracleperformance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

ImgNet-1k
   ResNet ViT Best Non-Parametric Parametric Oracle 50 101 B-32 B-16 L-14 Vote T-1 Vote T-3 Conf Log-Avg C-Conf C-Log-Avg GAC NNC   59.84 -0.00     62.28 -0.00     63.35 -0.00     68.34 -0.00     75.54 -0.00   62.30 63.76 \uparrow1.46 64.16 \uparrow1.86 59.05 \downarrow-3.25 64.61 \uparrow2.31 63.22 \uparrow0.91 64.47 \uparrow2.17 64.71 \uparrow2.41 65.14 \uparrow2.84 70.75 \uparrow8.45 63.36 65.00 \uparrow1.65 65.37 \uparrow2.01 64.40 \uparrow1.04 65.78 \uparrow2.42 64.39 \uparrow1.03 65.54 \uparrow2.19 65.86 \uparrow2.51 66.07 \uparrow2.71 71.76 \uparrow8.41 68.34 68.47 \uparrow0.12 68.72 \uparrow0.37 68.15 \downarrow-0.19 68.89 \uparrow0.55 68.09 \downarrow-0.26 68.85 \uparrow0.51 69.60 \uparrow1.26 69.86 \uparrow1.52 74.51 \uparrow6.17 75.53 75.13 \downarrow-0.4 75.27 \downarrow-0.27 60.23 \downarrow-15.3 75.09 \downarrow-0.45 74.51 \downarrow-1.02 74.97 \downarrow-0.56 75.97 \uparrow0.44 76.17 \uparrow0.64 80.13 \uparrow4.6 63.36 66.03 \uparrow2.67 66.41 \uparrow3.05 60.29 \downarrow-3.07 66.99 \uparrow3.63 65.45 \uparrow2.09 66.70 \uparrow3.35 67.00 \uparrow3.65 67.35 \uparrow3.99 72.96 \uparrow9.61 68.34 69.09 \uparrow0.75 69.29 \uparrow0.95 68.54 \uparrow0.2 69.58 \uparrow1.24 68.45 \uparrow0.11 69.40 \uparrow1.05 69.90 \uparrow1.56 70.29 \uparrow1.95 75.15 \uparrow6.8 75.53 75.18 \downarrow-0.35 75.35 \downarrow-0.18 62.53 \downarrow-13.0 75.16 \downarrow-0.38 74.56 \downarrow-0.97 75.10 \downarrow-0.44 76.01 \uparrow0.48 76.39 \uparrow0.86 80.28 \uparrow4.75 68.34 68.90 \uparrow0.56 69.17 \uparrow0.82 63.17 \downarrow-5.18 69.41 \uparrow1.07 68.48 \uparrow0.13 69.23 \uparrow0.89 69.76 \uparrow1.42 70.04 \uparrow1.7 74.97 \uparrow6.63 75.53 75.37 \downarrow-0.16 75.40 \downarrow-0.13 63.53 \downarrow-12.01 75.19 \downarrow-0.35 74.79 \downarrow-0.74 75.20 \downarrow-0.33 75.98 \uparrow0.45 76.31 \uparrow0.78 80.49 \uparrow4.96 75.53 75.47 \downarrow-0.06 75.61 \uparrow0.08 68.42 \downarrow-7.12 75.63 \uparrow0.1 75.04 \downarrow-0.5 75.50 \downarrow-0.03 75.98 \uparrow0.44 76.26 \uparrow0.73 80.60 \uparrow5.06 63.36 66.15 \uparrow2.79 66.76 \uparrow3.4 60.27 \downarrow-3.08 67.33 \uparrow3.97 65.51 \uparrow2.16 67.08 \uparrow3.72 67.47 \uparrow4.11 67.87 \uparrow4.51 76.39 \uparrow13.03 68.34 68.22 \downarrow-0.12 68.94 \uparrow0.6 66.82 \downarrow-1.53 69.23 \uparrow0.89 68.17 \downarrow-0.18 69.18 \uparrow0.83 70.11 \uparrow1.77 70.71 \uparrow2.37 77.95 \uparrow9.61 75.53 72.15 \downarrow-3.38 74.09 \downarrow-1.45 62.70 \downarrow-12.83 74.22 \downarrow-1.31 74.03 \downarrow-1.5 74.08 \downarrow-1.45 76.07 \uparrow0.54 76.42 \uparrow0.88 82.41 \uparrow6.87 68.34 68.58 \uparrow0.24 69.20 \uparrow0.85 63.72 \downarrow-4.62 69.45 \uparrow1.11 68.35 -0.0 69.45 \uparrow1.11 70.18 \uparrow1.84 70.60 \uparrow2.26 78.06 \uparrow9.72 75.53 72.80 \downarrow-2.73 74.35 \downarrow-1.18 63.26 \downarrow-12.28 74.36 \downarrow-1.17 74.24 \downarrow-1.3 74.32 \downarrow-1.21 76.14 \uparrow0.61 76.35 \uparrow0.82 82.61 \uparrow7.08 75.53 73.95 \downarrow-1.58 74.93 \downarrow-0.61 67.33 \downarrow-8.21 74.98 \downarrow-0.55 74.58 \downarrow-0.95 74.90 \downarrow-0.63 76.16 \uparrow0.63 76.54 \uparrow1.01 82.76 \uparrow7.23 68.34 69.15 \uparrow0.81 69.73 \uparrow1.39 62.46 \downarrow-5.88 70.02 \uparrow1.68 68.66 \uparrow0.32 69.95 \uparrow1.61 70.43 \uparrow2.09 70.87 \uparrow2.53 78.58 \uparrow10.24 75.53 73.32 \downarrow-2.21 74.61 \downarrow-0.92 63.05 \downarrow-12.49 74.63 \downarrow-0.9 74.27 \downarrow-1.27 74.58 \downarrow-0.95 76.10 \uparrow0.57 76.60 \uparrow1.07 82.82 \uparrow7.29 75.53 74.13 \downarrow-1.4 75.04 \downarrow-0.49 67.39 \downarrow-8.14 75.11 \downarrow-0.42 74.59 \downarrow-0.94 75.14 \downarrow-0.39 76.09 \uparrow0.56 76.73 \uparrow1.2 82.89 \uparrow7.35 75.53 74.01 \downarrow-1.52 74.90 \downarrow-0.63 68.56 \downarrow-6.97 75.02 \downarrow-0.51 74.64 \downarrow-0.9 74.98 \downarrow-0.55 76.07 \uparrow0.54 76.55 \uparrow1.01 82.91 \uparrow7.38 68.34 69.10 \uparrow0.75 69.51 \uparrow1.17 62.51 \downarrow-5.83 69.66 \uparrow1.32 68.43 \uparrow0.08 69.74 \uparrow1.4 70.52 \uparrow2.18 71.09 \uparrow2.75 80.31 \uparrow11.97 75.53 72.37 \downarrow-3.16 73.71 \downarrow-1.82 63.11 \downarrow-12.42 73.67 \downarrow-1.86 73.94 \downarrow-1.6 73.62 \downarrow-1.92 76.12 \uparrow0.58 76.69 \uparrow1.16 84.08 \uparrow8.55 75.53 73.47 \downarrow-2.06 74.46 \downarrow-1.07 67.47 \downarrow-8.06 74.46 \downarrow-1.07 74.21 \downarrow-1.32 74.41 \downarrow-1.12 76.10 \uparrow0.57 76.63 \uparrow1.09 84.17 \uparrow8.64 75.53 73.54 \downarrow-1.99 74.40 \downarrow-1.13 67.54 \downarrow-7.99 74.39 \downarrow-1.14 74.32 \downarrow-1.21 74.39 \downarrow-1.14 76.14 \uparrow0.61 76.65 \uparrow1.11 84.26 \uparrow8.73 75.53 73.78 \downarrow-1.75 74.64 \downarrow-0.89 67.46 \downarrow-8.07 74.58 \downarrow-0.95 74.32 \downarrow-1.21 74.59 \downarrow-0.94 76.19 \uparrow0.66 76.77 \uparrow1.23 84.42 \uparrow8.89 75.53 72.67 \downarrow-2.86 73.95 \downarrow-1.58 67.46 \downarrow-8.07 73.89 \downarrow-1.64 74.06 \downarrow-1.47 73.88 \downarrow-1.65 76.22 \uparrow0.69 76.59 \uparrow1.06 85.29 \uparrow9.76 Mean Δ\Delta -0.54 0.16 -7.09 0.29 -0.40 0.21 1.28 1.68 7.99 Max Δ\Delta 2.79 3.40 1.04 3.97 2.16 3.72 4.11 4.51 13.03 Min Δ\Delta -3.38 -1.82 -15.30 -1.86 -1.60 -1.92 0.44 0.64 4.60

Table 13: Our results on ImgNet-1k dataset for all the possible combinations of combining the zero-shot predictions of CLIP backbones, which we group intro non-parametric and parametric techniques. Also, the best-performing single backbone (Best) and the Oracle performance. We present, for each combination of backbones, the improvement \uparrow, constancy - and deterioration \downarrow of accuracy performance for each method when we compare it against the Best backbone. Mean, Max, and Min Δ\Delta summarize the difference in performance across methods and backbone combinations.

Appendix C Venn diagrams LinearProbe CLIP

In Figures 5, 6, and 7, the Venn diagrams for the LinearProbe versions of CLIP are presented. Similar to the ZeroShot CLIP, the diversity of predictions among backbones is preserved, albeit with a comparatively smaller potential for improvement. In comparison to the cases discussed in Section A, Cars, CUB, DTD, and FGVC exhibit a higher overall agreement between different backbones in the LinearProbe context. Specifically, Cars demonstrates a 61.19% agreement, a notable increase from the 36.43% observed in the ZeroShot case. Similarly, CUB showcases a 51.34% agreement in the LinearProbe case, compared to 31.71% in the ZeroShot case. For DTD, the LinearProbe case sees a 57.19% agreement, a substantial improvement from the 34.07% observed in the ZeroShot case. In the case of FGVC, there is a 24.68% agreement in the LinearProbe case, contrasting with the 10.83% observed in the ZeroShot case. However, despite this increased agreement, there remains a notable scope for improvement, underscoring the continued orthogonality of predictions across different backbones in the LinearProbe versions.

Cars Refer to caption CUB Refer to caption DTD Refer to caption

Figure 5: Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

FGVC Refer to caption Flowers Refer to caption Food Refer to caption

Figure 6: Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

Pets Refer to caption ImgNet-1k Refer to caption

Figure 7: Venn diagram with the correct prediction of each backbone. The Top part of the Venn diagram shows the number of backbones that are predicting correctly a set of images. Each column represents a set of image instances that are predicted correctly by some group of backbones. Each row in the diagram shows in colour the backbone that correctly predicts a certain set of image instances, in grey when the backbone is not correctly predicting those instances. The bottom part of the Venn diagram shows the number of images in a certain set. The right part is the total amount of correctly predicted images per backbone.

Appendix D Alpha values

In Figure 8, we present the distribution of alpha values using a box plot for the NNC method, normalized by their maximum value. Across all datasets, there is a notable dominance of ViT-L-14 in the weight distribution, particularly prominent in the Pets and Cars benchmarks, where ViT-L-14, being the largest backbone in the ViT family, holds significant influence. Interestingly, it is observed that the most weighted backbone within the ResNet family is not consistently ResNet-101, despite its deeper architecture. This observation is evident in datasets such as Pets, CUB, FGVC, and Flowers, where the mean value of alpha corresponding to ResNet-50 surpasses that of ResNet-101.

Furthermore, the distribution of alpha weights across backbones for datasets such as ImgNet-1k, FGVC, and DTD is more uniform compared to other datasets. This suggests that the NNC method is effectively leveraging the strengths of each backbone to arrive at accurate labels for each sample, resulting in a more balanced distribution of weights across different backbones.

Refer to caption
(a) Pets
Refer to caption
(b) Cars
Refer to caption
(c) CUB
Refer to caption
(d) DTD
Refer to caption
(e) FGVC
Refer to caption
(f) Food
Refer to caption
(g) Flowers
Refer to caption
(h) ImgNet-1k
Figure 8: Alpha values for each dataset using NNC