Task-Specific Preconditioner for Cross-Domain Few-Shot Learning
Abstract
Cross-Domain Few-Shot Learning (CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP). Our method first meta-learns Domain-Specific Preconditioners (DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.
1 Introduction
Few-Shot Learning (FSL) aims to learn a model that can generalize to novel classes using a few labeled examples. Recent advancements in FSL have been significantly propelled by meta-learning methods (Snell, Swersky, and Zemel 2017; Finn, Abbeel, and Levine 2017; Sung et al. 2018; Oreshkin, Rodríguez López, and Lacoste 2018; Garnelo et al. 2018; Rajeswaran et al. 2019). These approaches have achieved outstanding results in single domain FSL benchmarks such as Omniglot (Lake et al. 2011) and miniImagenet (Ravi and Larochelle 2016). However, recent studies (Chen et al. 2019; Tian et al. 2020) have revealed that many existing FSL methods struggle to generalize in cross-domain setting, where the test data originates from domains that are either unknown or previously unseen. To study the challenge of generalization in cross-domain few-shot tasks, Triantafillou et al. (2019) introduced the Meta-Dataset, a more realistic, large-scale, and diverse benchmark. It includes multiple datasets from a variety of domains for both meta-training and meta-testing phases.
Leveraging the Meta-Dataset, various Cross-Domain Few-Shot Learning (CDFSL) methods have been developed (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Dvornik, Schmid, and Mairal 2020; Liu et al. 2020; Guo et al. 2023; Tian et al. 2024), demonstrating significant advancements in this field. These approaches typically parameterize deep neural networks with a large set of task-agnostic parameters alongside a smaller set of task-specific parameters. Task-specific parameters are optimized to the target task through an adaptation mechanism, generally following one of two primary methodologies. The first approach utilizes an auxiliary network functioning as a parameter generator, which, upon receiving a few labeled examples from the target task, outputs optimized task-specific parameters (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2020, 2021). The second approach directly fine-tunes the task-specific parameters through gradient descent using a few labeled examples from the target task (Dvornik, Schmid, and Mairal 2020; Li, Liu, and Bilen 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2022; Tian et al. 2024).



While both approaches have improved CDFSL performance through adaptation mechanism, a common limitation persists in the optimization strategies employed by these methods. Specifically, both approaches employ a fixed optimization strategy across different target tasks. However, Figure 1(a) shows that the optimal choice of optimizer may vary significantly depending on the given domain or target task. This implies that the performance can be significantly improved by adapting an optimization strategy to align well with the target domain and task. However, devising an effective and reliable scheme for its implementation has been challenging.
One promising approach for establishing a robust adaptive optimization scheme is to leverage Preconditioned Gradient Descent (PGD) (Himmelblau et al. 2018). PGD operates by specifying a preconditioning matrix, often referred to as a preconditioner, which re-scales the geometry of the parameter space. In the field of machine learning, previous research has shown that if the preconditioner is positive definite (PD), it establishes a valid Riemannian metric, which represents the geometric characteristics (e.g., curvature) of the parameter space and steers preconditioned gradients in the direction of steepest descent (Amari 1967, 1996, 1998; Amari and Douglas 1998). While the effectiveness of positive definiteness in PGD is supported by existing theoretical findings, its efficacy as an adaptive optimization scheme in CDFSL can be examined through a simple comparison. In Figure 1(b), we compare PGD with and without a PD constraint for the preconditioner on the Meta-Dataset. Without a PD constraint, PGD shows markedly inferior performance, especially in unseen domains. Conversely, with a PD constraint, PGD consistently exhibits performance improvements across seen and unseen domains compared to the baseline using GD. This supports the pivotal role of positive definiteness in PGD for CDFSL.
Inspired by these findings, we introduce a novel adaptation mechanism named Task-Specific Preconditioned gradient descent (TSP). In our approach, we establish a Task-Specific Preconditioner that is constrained to be positive definite and adapt it to the specific nature of the target task. This preconditioner consists of two components. The first component is the Domain-Specific Preconditioners (DSPs), which are uniquely defined for each meta-training domain and meta-trained on tasks sampled from these domains through bi-level optimization during the meta-training phase. The second component is task-coefficient, which approximates the compatibility between the target task and each meta-training domain. Figure 2 illustrates the construction of the Task-Specific Preconditioner. For a given target task , the Task-Specific Preconditioner is constructed by linearly combining the DSPs from multiple seen domains, with each weighted by the corresponding task-coefficient . This process produces a preconditioner specifically adapted to the geometric characteristics of the target task’s parameter space. By integrating knowledge from multiple seen domains, TSP distinguishes itself from traditional PGD techniques, such as GAP (Kang et al. 2023), which are discussed further in Section 6. Applying our approach to state-of-the-art CDFSL methods, such as TSA or TA2-Net, significantly enhances performance on Meta-Dataset. For example, in multi-domain settings, applying TSP to TA2-Net (Guo et al. 2023) achieves the best performance across all datasets.
2 Related Works
Meta-Learning for Few-Shot Learning
Until recently, numerous approaches in the field of few-shot learning have adopted the meta-learning framework. These approaches can be mainly divided into three types: metric-based, model-based, and optimization-based methods. Metric-based methods (Garcia and Bruna 2017; Sung et al. 2018; Snell, Swersky, and Zemel 2017; Oreshkin, Rodríguez López, and Lacoste 2018) train a feature encoder to extract features from support and query samples. They employ a nearest neighbor classifier with various distance functions to calculate similarity scores for predicting the labels of query samples. Model-based methods (Santoro et al. 2016; Munkhdalai and Yu 2017; Mishra et al. 2017; Garnelo et al. 2018) train an encoder to generate task-specific models from a few support samples. Optimization-based methods (Ravi and Larochelle 2016; Finn, Abbeel, and Levine 2017; Yoon et al. 2018; Rajeswaran et al. 2019) train a model that can quickly adapt to new tasks with a few support samples, employing a bi-level optimization. In our method, we employ the bi-level optimization used in the optimization-based methods.
Cross-Domain Few-Shot Learning (CDFSL)
Recent CDFSL methods define the universal model as a deep neural network and partition it into task-agnostic and task-specific parameters. The task-agnostic parameters represent generic characteristics that are valid for a range of tasks from various domains. On the other hand, the task-specific parameters represent adaptable attributes that are optimized to the target tasks through an adaptation mechanism. Task-agnostic parameters can be designed as a single network or multiple networks. The single network is trained on a large dataset from single domain (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021) or multiple domains (Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), whereas the multiple networks are trained individually on each domain (Dvornik, Schmid, and Mairal 2020; Liu et al. 2020). Task-specific parameters can be designed as selection parameters (Dvornik, Schmid, and Mairal 2020; Liu et al. 2020), pre-classifier transformation (Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), Feature-wise Linear Modulate (FiLM) layer (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021; Triantafillou et al. 2021), or Residual Adapter (RA) (Li, Liu, and Bilen 2022; Guo et al. 2023). As the adaptation mechanism for the task-specific parameters, several studies (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2020, 2021) meta-learn an auxiliary network, which generates task-specific parameters adapted to the target task. On the other hand, other studies (Dvornik, Schmid, and Mairal 2020; Li, Liu, and Bilen 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2022) employ gradient descent to adapt task-specific parameters to the target task. In our work, we propose a novel adaptation mechanism in the form of a task-specific optimizer, which adapts task-specific parameters to the target task.
Preconditioned Gradient Descent in Meta-Learning
In meta-learning, several optimization-based approaches (Li et al. 2017; Lee and Choi 2018; Park and Oliva 2019; Rajasegaran et al. 2020; Simon et al. 2020; Zhao et al. 2020; Von Oswald et al. 2021; Kang et al. 2023) have incorporated Preconditioned Gradient Descent (PGD) to adapt network’s parameters to the target task (i.e., inner-level optimization). They meta-learn a preconditioning matrix, called a preconditioner, which is utilized to precondition the gradient. The preconditioner was kept static in most of the previous works (Li et al. 2017; Lee and Choi 2018; Park and Oliva 2019; Zhao et al. 2020; Von Oswald et al. 2021). Several prior studies have devised preconditioners tailored to adapt either per inner step (Rajasegaran et al. 2020), per task (Simon et al. 2020), or both simultaneously (Kang et al. 2023). Motivated by previous works (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998), (Kang et al. 2023) recently investigated the constraint of the preconditioner to satisfy the condition for a Riemannian metric (i.e., positive definiteness). They demonstrated that enforcing this constraint on the preconditioner was essential for improving the performance in few-shot learning. In our study, we propose a novel preconditioned gradient descent method with meta-learned task-specific preconditioner that guarantees positive definiteness for improving performance in CDFSL.
3 Backgrounds
Task Formulation for Meta-Learning in CDFSL
In CDFSL, task is formulated differently compared to traditional few-shot learning. In traditional few-shot learning, tasks are sampled from a single domain, resulting in the same form in both meta-training and meta-testing:
(1) |
where is a support set and is a query set. On the other hand, in CDFSL, tasks are sampled from multiple domains, leading to different forms in meta-training and meta-testing:
(2) |
where is a domain label indicating the domain from which the task was sampled. For instance, the domain label is an integer between and for domains (i.e., ).
Bi-level Optimization in Meta-Learning
Bi-level optimization (Rajeswaran et al. 2019) consists of two levels of main optimization processes: inner-level and outer-level optimizations. Let be a model, where the parameter is parameterized by the meta-parameter . For a task , the inner-level optimization is defined as:
(3) |
where , is the learning rate for the inner-level optimization, is the inner-level’s loss function, and is the total number of gradient descent steps. With in each task, we can define outer-level optimization as:
(4) |
where is the learning rate for the outer-level optimization, and is the outer-level’s loss function.
Preconditioned Gradient Descent (PGD)
PGD is a technique that minimizes empirical risk by using a gradient update with a preconditioner that re-scales the geometry of the parameter space. Given model parameters and task , we can formally define the preconditioned gradient descent with a preconditioner as follows:
(5) |
where , is the empirical loss associated with the task , and is the parameters. When the preconditioner is chosen to be the identity matrix , Eq. (5) becomes the standard Gradient Descent (GD). The choice of to leverage second-order information offers several options, including the inverse Fisher information matrix , leading to the Natural Gradient Descent (NGD) (Amari 1998), the inverse Hessian matrix , corresponding to Newton’s method (LeCun et al. 2002), and the diagonal matrix estimation with the past gradients, which results in adaptive gradient methods (Duchi, Hazan, and Singer 2011; Kingma and Ba 2014). They often reduce the effect of pathological curvature and speed up the optimization (Amari et al. 2020).
Dataset Classifier
In CDFSL, Dataset Classifier (Triantafillou et al. 2021) reads a support set in a few-shot task and predicts from which of the training datasets it was sampled. Formally, let be a train task sampled from domains. Let be a dataset classifier that takes the support set as input and generates logits as follows:
(6) |
In (Triantafillou et al. 2021), the dataset classifier is trained to minimize the cross-entropy loss for the dataset classification problem (i.e., classification problem with classes).
4 Method


In this section, we propose a novel adaptation mechanism named Task-Specific Preconditioned gradient descent (TSP). We first introduce Domain-Specific Preconditioner (DSP) and task-coefficients. Then, we describe the construction of Task-Specific Preconditioner using DSP and task-coefficients. Lastly, we show the positive definiteness of Task-Specific Preconditioner, which establishes it as a valid Riemannian metric. The algorithm for the training and testing procedures is provided in Appendix B.
4.1 Domain-Specific Preconditioner (DSP)
Consider task-specific parameters . For domains, we first define meta-parameters as follows:
(7) |
Then, for all , we define Domain-Specific Preconditioners (DSPs) using the meta-parameters as follows:
(8) |
We compare various DSP designs (See Table 3) in Section 5.3 and choose the form of Eq. (8). Through bi-level optimization, DSPs can be meta-learned as follows.
Inner-level Optimization
For each train task , in the inner-level optimization, we optimize the task-specific parameters through preconditioned gradient descent using , updating as follows:
(9) |
where , is the learning rate for the inner-level optimization, is the total number of gradient descent steps, and is the inner-level’s loss function.
Outer-level Optimization
In the outer-level optimization, we meta-learn meta-parameters as follows:
(10) |
where is the learning rate for outer-level optimization and is the outer-level’s loss function.
4.2 Task-coefficients
Consider the dataset classifier . Given a train task , we define task-coefficients as follows:
(11) |
where . Note that we use the sigmoid function instead of softmax in the single-domain setting because the output dimension of the dataset classifier is one. While Triantafillou et al. (2021) updates the parameters of to minimize only the cross-entropy loss with respect to the dataset label , we train the dataset classifier to minimize the following augmented loss:
(12) |
where is a regularization parameter and is the auxiliary loss, defined as follows:
(13) |
Here, task-specific parameters can be obtained as follows:
(14) |
where is the -th DSP of domain . In Eq. (12), the cross-entropy loss guides the dataset classifier to prioritize the ground-truth domain of the support set. Concurrently, the auxiliary loss guides toward DSPs that minimize any adverse effects on the performance of the query set during the inner-level optimization.
Test Dataset | SUR | URT | FLUTE | tri-M | URL | TSA | TA2-Net | MOKD | TSP† | TSP†† |
ImageNet | 56.21.0 | 56.81.1 | 58.61.0 | 51.81.1 | 58.81.1 | 59.51.0 | 59.61.0 | 57.31.1 | 60.51.0 | 60.71.0 |
Omniglot | 94.10.4 | 94.20.4 | 92.00.6 | 93.20.5 | 94.50.4 | 94.90.4 | 95.50.4 | 94.20.5 | 95.60.4 | 96.00.4 |
Aircraft | 85.50.5 | 85.80.5 | 82.80.7 | 87.20.5 | 89.40.4 | 89.90.4 | 90.50.4 | 88.40.5 | 90.50.4 | 91.20.4 |
Birds | 71.01.0 | 76.20.8 | 75.30.8 | 79.20.8 | 80.70.8 | 81.10.8 | 81.40.8 | 80.40.8 | 82.30.7 | 82.50.7 |
Textures | 71.00.8 | 71.60.7 | 71.20.8 | 68.80.8 | 77.20.7 | 77.50.7 | 77.40.7 | 76.50.7 | 78.60.6 | 79.10.6 |
Quick Draw | 81.80.6 | 82.40.6 | 77.30.7 | 79.50.7 | 82.50.6 | 81.70.6 | 82.50.6 | 82.20.6 | 83.00.7 | 83.20.6 |
Fungi | 64.30.9 | 64.01.0 | 48.51.0 | 58.11.1 | 68.10.9 | 66.30.8 | 66.30.9 | 68.61.0 | 68.60.9 | 69.70.8 |
VGG Flower | 82.90.8 | 87.90.6 | 90.50.5 | 91.60.6 | 92.00.5 | 92.20.5 | 92.60.4 | 92.50.5 | 93.30.4 | 93.40.4 |
Traffic Sign | 51.01.1 | 48.21.1 | 63.01.0 | 58.41.1 | 63.31.1 | 82.81.0 | 87.40.8 | 64.51.1 | 88.50.7 | 89.40.8 |
MSCOCO | 52.01.1 | 51.51.1 | 52.81.1 | 50.01.0 | 57.31.0 | 57.61.0 | 57.90.9 | 55.51.0 | 58.50.9 | 59.80.9 |
MNIST | 94.30.4 | 90.60.5 | 96.20.3 | 95.60.5 | 94.70.4 | 96.70.4 | 97.00.4 | 95.10.4 | 97.10.3 | 97.10.4 |
CIFAR-10 | 66.50.9 | 67.00.8 | 75.40.8 | 78.60.7 | 74.20.8 | 82.90.7 | 82.10.8 | 72.80.8 | 83.50.7 | 83.70.8 |
CIFAR-100 | 56.91.1 | 57.31.0 | 62.01.0 | 67.11.0 | 63.51.0 | 70.40.9 | 70.90.9 | 63.91.0 | 71.31.0 | 72.20.9 |
Avg Seen | 75.9 | 77.4 | 74.5 | 76.2 | 80.4 | 80.4 | 80.7 | 80.0 | 81.6 | 82.0 |
Avg Unseen | 64.1 | 62.9 | 69.9 | 69.9 | 70.6 | 78.1 | 79.1 | 70.3 | 79.8 | 80.4 |
Avg All | 71.3 | 71.8 | 72.7 | 73.8 | 76.6 | 79.5 | 80.1 | 76.3 | 80.9 | 81.4 |
Avg Rank | 8.8 | 8.2 | 8.0 | 7.8 | 5.5 | 4.3 | 3.2 | 5.8 | 1.9 | 1.0 |
4.3 Task-Specific Preconditioner
Given a test task , we define Task-Specific Preconditioner as follows:
(15) |
where is the -th DSP of domain , and is the task-coefficient for the given task and domain . By employing as the preconditioning matrix, we can define Task-Specific Preconditioned gradient descent (TSP), as follows:
(16) |
where is the learning rate used to adapt the task-specific parameters.
4.4 Positive Definiteness of TSP’s Preconditioner
A preconditioner satisfying positive definiteness ensures a valid Riemannian metric, which represents the geometric characteristics of the parameter space (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998). Task-Specific Preconditioner is designed to be a positive definite matrix, which is verified in Theorem 1.
Theorem 1.
Let , be the task-coefficients satisfying . For the Domain-Specific Preconditioners , Task-Specific Preconditioner defined as is positive definite.
The proof is provided in Appendix C. Drawing from prior research (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998), a preconditioner satisfying positive definiteness promotes gradients to point toward the steepest descent direction while avoiding undesirable paths in the parameter space. As shown in Figure 1(b), positive definiteness improves CDFSL performance, especially in unseen domains. In Section 6, we will discuss why this property helps in CDFSL.
5 Experiments
5.1 Experimental Setup
Implementation Details
In the experiments, we use Meta-Dataset (Triantafillou et al. 2019) that is the standard benchmark for evaluating the performance of CDFSL. To demonstrate the effectiveness of TSP as an adaptation mechanism, we apply it to the state-of-the-art CDFSL methods, TSA (Li, Liu, and Bilen 2022) and TA2-Net (Guo et al. 2023), which are publicly available as open-source. Following previous studies (Bateni et al. 2022; Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), we adopted ResNet-18 as the backbone for the feature extractor. In all experiments, we follow the standard protocol described in (Triantafillou et al. 2019). For the Dataset Classifier Loss, weighting factor is set to 0.1, as it performs best compared to other values, as shown in Appendix D.1. Details of the Meta-Dataset, hyper-parameters, and additional implementation are available in Appendix E.
Baselines
For the baselines, we compare our methods to the state-of-the-art CDFSL methods, including BOHB (Saikia, Brox, and Schmid 2020), SUR (Dvornik, Schmid, and Mairal 2020), URT (Liu et al. 2020), Simple-CNAPS (Bateni et al. 2020), FLUTE (Triantafillou et al. 2021), tri-M (Liu et al. 2021), URL (Li, Liu, and Bilen 2021), TSA (Li, Liu, and Bilen 2022), TA2-Net (Guo et al. 2023), ALFA (Baik et al. 2023)+Proto-MAML, GAP+Proto-MAML (Kang et al. 2023), and MOKD (Tian et al. 2024).
Test Dataset |
|
BOHB |
|
FLUTE | TSA | TA2-Net | MOKD | TSP† | TSP†† | ||||
ImageNet | 52.81.1 | 51.91.1 | 56.7 | 46.91.1 | 59.51.1 | 59.31.1 | 57.31.1 | 60.11.1 | 60.61.1 | ||||
Omniglot | 61.91.5 | 67.61.2 | 77.6 | 61.61.4 | 78.21.2 | 81.11.1 | 70.91.3 | 83.31.1 | 85.21.1 | ||||
Aircraft | 63.41.1 | 54.10.9 | 68.5 | 48.51.0 | 72.21.0 | 72.60.9 | 59.81.0 | 73.21.0 | 73.51.1 | ||||
Birds | 69.81.1 | 70.70.9 | 73.5 | 47.91.0 | 74.90.9 | 75.10.9 | 73.60.9 | 76.00.9 | 76.60.9 | ||||
Textures | 70.80.9 | 68.30.8 | 71.4 | 63.80.8 | 77.30.7 | 76.80.8 | 76.10.7 | 78.20.7 | 78.30.7 | ||||
Quick Draw | 59.21.2 | 50.31.0 | 65.4 | 57.51.0 | 67.60.9 | 68.40.9 | 61.21.0 | 70.80.9 | 71.50.9 | ||||
Fungi | 41.51.2 | 41.41.1 | 38.6 | 31.81.0 | 44.71.0 | 45.31.0 | 47.01.1 | 46.61.0 | 47.01.0 | ||||
VGG Flower | 86.00.8 | 87.30.6 | 86.8 | 80.10.9 | 90.90.6 | 91.00.6 | 88.50.6 | 91.80.5 | 92.20.6 | ||||
Traffic Sign | 60.81.3 | 51.81.0 | 66.9 | 46.51.1 | 82.50.8 | 84.10.7 | 61.61.1 | 87.50.8 | 88.70.8 | ||||
MSCOCO | 48.11.1 | 48.01.0 | 46.8 | 41.41.0 | 59.01.0 | 58.01.0 | 55.31.0 | 59.41.0 | 58.61.0 | ||||
MNIST | - | - | 94.0 | 80.80.8 | 93.90.6 | 94.90.5 | 88.30.7 | 94.50.5 | 95.30.6 | ||||
CIFAR-10 | - | - | 74.5 | 65.40.8 | 82.10.7 | 82.00.7 | 72.20.8 | 83.10.5 | 83.20.7 | ||||
CIFAR-100 | - | - | 63.2 | 52.71.1 | 70.70.9 | 70.80.9 | 63.11.0 | 71.20.9 | 72.80.9 | ||||
Avg Seen | 52.8 | 51.9 | 56.7 | 46.9 | 59.5 | 59.3 | 57.3 | 60.1 | 60.6 | ||||
Avg Unseen | 62.4 | 59.9 | 68.9 | 56.5 | 74.5 | 75.0 | 68.1 | 76.3 | 76.9 | ||||
Avg All | 61.4 | 59.1 | 68.0 | 55.8 | 73.3 | 73.8 | 67.3 | 75.0 | 75.7 | ||||
Avg Rank | 7.0 | 7.5 | 6.1 | 8.9 | 3.7 | 3.4 | 5.1 | 2.0 | 1.2 |
5.2 Performance Comparison to State-of-The-Art Methods
Following the experimental setup in (Li, Liu, and Bilen 2022), we first evaluate our method using multi-domain and single-domain feature extractors in Varying-Way Varying-Shot setting (i.e., Multi-domain and Single-domain setting). Then, we assess our approach with the multi-domain feature extractor in more challenging Varying-Way Five-Shot and Five-Way One-Shot settings. We provide the performance comparison results for Varying-Way Five-Shot and Five-Way One-Shot settings in the Appendix F.
Multi-Domain Setting
In Table 1, we evaluate TSP by applying it to TSA and TA2-Net, both of which employ URL (Liu et al. 2021) as the multi-domain feature extractor. We report average accuracies over seen, unseen, and all domains, along with average rank following the previous works (Liu et al. 2021; Li, Liu, and Bilen 2022; Guo et al. 2023). TSP† denotes TSP applied on TSA, while TSP†† indicates TSP applied on TA2-Net. TSP† outperforms the previous state-of-the-art methods on 11 out of 13 datasets, and TSP†† achieves the best results on all datasets. For example, TSP†† outperforms the state-of-the-art method (TA2-Net) by 1.7%, 3.4%, 2.0%, and 1.9% on Textures, Fungi, Traffic Sign, and MSCOCO respectively. These results imply that TSP can construct a desirable task-specific optimizer that effectively adapt the task-specific parameters for a given target task.
Single-Domain Setting
We evaluate TSP by applying it to TSA and TA2-Net, both of which employ the single-domain feature extractor pretrained solely on the ImageNet dataset. In Table 2, TSP†† achieves the best results for 12 out of 13 datasets, while TSP† leads in the remaining 1 datasets. Compared to recently proposed meta-learning methods based on PGD, such as Approximate GAPProto-MAML and GAPProto-MAML (Kang et al. 2023), both TSP† and TSP†† consistently outperform them across all 13 datasets by a significant margin. Furthermore, TSP†† outperforms the previous best methods by a clear margin in several datasets such as Quick Draw (), Omniglot (), and Traffic Sign (). Despite being trained only on single dataset, TSP improves performance by effectively constructing a task-specific optimizer tailored to the target task.
5.3 Ablation Studies
In this section, all ablation studies are performed using TSP† in the multi-domain setting to isolate the effects originating from the RL model in TSP††. Additional ablation studies are provided in Appendix D.
Matrix Design for DSP
DSP designs | |||
---|---|---|---|
Avg Seen | 80.8 | 81.2 | 81.6 |
Avg Unseen | 79.0 | 79.4 | 79.8 |
Avg All | 80.1 | 80.5 | 80.9 |
To design Domain-Specific Preconditioner (DSP), we consider three matrix designs that guarantee positive definiteness. The first one is the product of a real-valued lower triangular matrix and its transpose (i.e., ), where the lower triangular matrix is constrained to have positive diagonals. This form is commonly known as the Cholesky factorization (Horn and Johnson 2012). The second one is the addition of and the identity matrix (i.e., ). The last one is the addition of the Gram matrix (Horn and Johnson 2012) and the identity matrix (i.e., ). In Table 3, we compare three TSPs with these three DSP designs. Among them, the Gram matrix design achieves the highest average accuracies in both seen and unseen domains compared to the others. Therefore, we choose the Gram matrix design for DSP.
Preconditioner | w/ PD constraint | w/o PD constraint |
---|---|---|
Avg Seen | 81.6 | 80.0 |
Avg Unseen | 79.8 | 73.8 |
Avg All | 80.9 | 77.6 |
DSP | ImageNet | Omniglot | Aircraft | Birds | Textures | Quick Draw | Fungi | VGG Flower | Average |
Non-PD rate | 0.24 | 0.29 | 0.35 | 0.24 | 0.35 | 0.35 | 0.18 | 0.29 | 0.29 |
6 Discussion
In this section, all experiments are conducted using TSP†.
The Necessity of Positive Definite Constraint
Even without a specific constraint of PD, one might assume that initializing the preconditioner as positive definite, such as , would maintain its positive definiteness throughout meta-training due to its significant role. However, as illustrated in Table 4 and Table 5, this assumption does not hold. In Table 4, we compare preconditioners with and without a PD constraint, both initialized as positive definite. Specifically, the former adopts Task-Specific Preconditioner (See Eq. (15)), while the latter employs Task-Specific Preconditioner with DSP designed as and initialized as . Evaluations are conducted using the multi-domain feature extractor (URL) in the multi-domain setting. After meta-training, DSPs without a PD constraint tend to lose positive definiteness as shown in Table 5, leading to poor performance as shown in Table 4. These findings underscore the necessity of explicitly constraining the preconditioner to maintain positive definiteness, as relying solely on optimization fails to preserve this crucial property.
Positive Definite DSP Designs with and without the Identity Matrix
Setting |
|
|
||||||
---|---|---|---|---|---|---|---|---|
DSP designs | ||||||||
Avg Seen | 80.8 | 81.2 | 76.8 | 76.6 | ||||
Avg Unseen | 79.0 | 79.4 | 72.1 | 71.5 | ||||
Avg All | 80.1 | 80.5 | 75.0 | 74.6 |
Apart from ensuring positive definiteness, a notable characteristic of our Gram matrix design is its inclusion of the identity matrix. To explore the impact of this inclusion, we compare two positive definite DSP designs: and . We focus on these two DSP designs because does not guarantee positive definiteness. However, we also provide a comparison between and in Appendix G. The experiments are conducted using the multi-domain feature extractor (URL). In Table 6, we observe that the DSP design with the added identity matrix performs better in the Varying-Way Varying-Shot setting but worse in the Varying-Way Five-Shot setting. This outcome aligns with prior theoretical findings (Amari et al. 2020) indicating that PGD performs better than GD in noisy gradient conditions, while GD excels when gradients are accurate. With more shots, gradients tend to be more accurate due to increased data. In the Varying-Way Varying Shot setting, where tasks typically involve more than five shots, gradients are more accurate, making GD more beneficial compared to the other setting. Including the identity matrix can be viewed as a regularization of PGD towards GD. Consequently, aligns closer to GD compared to , resulting in improved performance due to the abundance of shots in Varying-Way Varying-Shot setting. Conversely, in the Varying-Way Five-Shot setting, where tasks involve fewer shots, exhibits superior performance to due to the scarcity of shots.




Effectiveness of Positive Definiteness in Cross-Domain Tasks
A positive definite preconditioner is known to mitigate the negative effects of pathological loss curvature and accelerate optimization, thereby facilitating convergence (Nocedal and Wright 1999; Saad 2003; Li 2017). This leads to a consistent reduction in the objective function. However, without positive definiteness, this effect is not guaranteed and may result in failure to converge. In Figure 4, we compare the learning curves of PGD with and without a PD constraint across both seen and unseen domains. Without the PD constraint, PGD fails to converge in some of the seen domains and in all the unseen domains. With the PD constraint, PGD successfully converges in all the seen and unseen domains. These results suggest that, in cross-domain tasks, a PD constraint of a preconditioner is crucial for achieving convergence and is beneficial for improving performance, which is also related to Figure 1(b).
TSP vs. Previous PGD Methods: Leveraging Multi-Domain Knowledge for Task-Specific Preconditioner
Compared to previous PGD methods like GAP (Kang et al. 2023), TSP is specifically designed for cross-domain few-shot learning (CDFSL), where unseen domains are not accessed during meta-training. The key challenge in CDFSL is to effectively leverage information from multiple seen domains to quickly adapt to each unseen domain. Previous PGD methods fall short in this regard because they rely on a single preconditioner, even when multiple seen domains are available. For example, GAP uses only one preconditioner to extract information from multiple seen domains, which limits its adaptability to unseen domains with distinct characteristics. In contrast, TSP meta-trains a distinct domain-specific preconditioner (DSP) for each seen domain and combines them to construct a Task-Specific Preconditioner that better suited to each unseen domain. TSP produces this Task-Specific Preconditioner effectively, as shown in Tables 1 and 2, and time-efficiently, as further detailed in Appendix H.
7 Conclusion
In this study, we have introduced a robust and effective adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP) to enhance CDFSL performance. Thanks to the meta-trained Domain-Specific Preconditioners (DSPs) and Task-coefficients, TSP can flexibly adjust the optimization strategy according to the geometric characteristics of the parameter space for the target task. Owing to these components, the proposed TSP demonstrates notable performance improvements on Meta-Dataset across various settings.
8 Acknowledgements
This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1A2C2007139) and in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) ([NO.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], [No. RS-2023-00235293, Development of autonomous driving big data processing, management, search, and sharing interface technology to provide autonomous driving data according to the purpose of usage]).
References
- Amari (1967) Amari, S. 1967. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, (3): 299–307.
- Amari (1996) Amari, S.-i. 1996. Neural learning in structured parameter spaces-natural Riemannian gradient. Advances in neural information processing systems, 9.
- Amari (1998) Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural computation, 10(2): 251–276.
- Amari et al. (2020) Amari, S.-i.; Ba, J.; Grosse, R.; Li, X.; Nitanda, A.; Suzuki, T.; Wu, D.; and Xu, J. 2020. When does preconditioning help or hurt generalization? arXiv preprint arXiv:2006.10732.
- Amari and Douglas (1998) Amari, S.-I.; and Douglas, S. C. 1998. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 2, 1213–1216. IEEE.
- Baik et al. (2023) Baik, S.; Choi, M.; Choi, J.; Kim, H.; and Lee, K. M. 2023. Learning to learn task-adaptive hyperparameters for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Bateni et al. (2022) Bateni, P.; Barber, J.; Van de Meent, J.-W.; and Wood, F. 2022. Enhancing few-shot image classification with unlabelled examples. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2796–2805.
- Bateni et al. (2020) Bateni, P.; Goyal, R.; Masrani, V.; Wood, F.; and Sigal, L. 2020. Improved few-shot visual classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14493–14502.
- Chen et al. (2019) Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C. F.; and Huang, J.-B. 2019. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232.
- Cimpoi et al. (2014) Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; and Vedaldi, A. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3606–3613.
- Duchi, Hazan, and Singer (2011) Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
- Dvornik, Schmid, and Mairal (2020) Dvornik, N.; Schmid, C.; and Mairal, J. 2020. Selecting relevant features from a multi-domain representation for few-shot classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, 769–786. Springer.
- Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, 1126–1135. PMLR.
- Garcia and Bruna (2017) Garcia, V.; and Bruna, J. 2017. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
- Garnelo et al. (2018) Garnelo, M.; Rosenbaum, D.; Maddison, C.; Ramalho, T.; Saxton, D.; Shanahan, M.; Teh, Y. W.; Rezende, D.; and Eslami, S. A. 2018. Conditional neural processes. In International conference on machine learning, 1704–1713. PMLR.
- Guo et al. (2023) Guo, Y.; Du, R.; Dong, Y.; Hospedales, T.; Song, Y.-Z.; and Ma, Z. 2023. Task-aware Adaptive Learning for Cross-domain Few-shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1590–1599.
- Ha and Eck (2017) Ha, D.; and Eck, D. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477.
- Himmelblau et al. (2018) Himmelblau, D. M.; et al. 2018. Applied nonlinear programming. McGraw-Hill.
- Horn and Johnson (2012) Horn, R. A.; and Johnson, C. R. 2012. Matrix analysis. Cambridge university press.
- Houben et al. (2013) Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; and Igel, C. 2013. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In The 2013 international joint conference on neural networks (IJCNN), 1–8. Ieee.
- Kakade (2001) Kakade, S. M. 2001. A natural policy gradient. Advances in neural information processing systems, 14.
- Kang et al. (2023) Kang, S.; Hwang, D.; Eo, M.; Kim, T.; and Rhee, W. 2023. Meta-Learning with a Geometry-Adaptive Preconditioner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16080–16090.
- Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
- Lake et al. (2011) Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33.
- Lake, Salakhutdinov, and Tenenbaum (2015) Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science, 350(6266): 1332–1338.
- LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324.
- LeCun et al. (2002) LeCun, Y.; Bottou, L.; Orr, G. B.; and Müller, K.-R. 2002. Efficient backprop. In Neural networks: Tricks of the trade, 9–50. Springer.
- Lee and Choi (2018) Lee, Y.; and Choi, S. 2018. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, 2927–2936. PMLR.
- Li, Liu, and Bilen (2021) Li, W.-H.; Liu, X.; and Bilen, H. 2021. Universal representation learning from multiple domains for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9526–9535.
- Li, Liu, and Bilen (2022) Li, W.-H.; Liu, X.; and Bilen, H. 2022. Cross-domain few-shot learning with task-specific adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7161–7170.
- Li (2017) Li, X.-L. 2017. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5): 1454–1466.
- Li et al. (2017) Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
- Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
- Liu et al. (2020) Liu, L.; Hamilton, W.; Long, G.; Jiang, J.; and Larochelle, H. 2020. A universal representation transformer layer for few-shot image classification. arXiv preprint arXiv:2006.11702.
- Liu et al. (2021) Liu, Y.; Lee, J.; Zhu, L.; Chen, L.; Shi, H.; and Yang, Y. 2021. A multi-mode modulator for multi-domain few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8453–8462.
- Maji et al. (2013) Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; and Vedaldi, A. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
- Mishra et al. (2017) Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P. 2017. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141.
- Munkhdalai and Yu (2017) Munkhdalai, T.; and Yu, H. 2017. Meta networks. In International conference on machine learning, 2554–2563. PMLR.
- Nilsback and Zisserman (2008) Nilsback, M.-E.; and Zisserman, A. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, 722–729. IEEE.
- Nocedal and Wright (1999) Nocedal, J.; and Wright, S. J. 1999. Numerical optimization. Springer.
- Oreshkin, Rodríguez López, and Lacoste (2018) Oreshkin, B.; Rodríguez López, P.; and Lacoste, A. 2018. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31.
- Park and Oliva (2019) Park, E.; and Oliva, J. B. 2019. Meta-curvature. Advances in Neural Information Processing Systems, 32.
- Rajasegaran et al. (2020) Rajasegaran, J.; Khan, S.; Hayat, M.; Khan, F. S.; and Shah, M. 2020. Meta-learning the learning trends shared across tasks. arXiv preprint arXiv:2010.09291.
- Rajeswaran et al. (2019) Rajeswaran, A.; Finn, C.; Kakade, S. M.; and Levine, S. 2019. Meta-learning with implicit gradients. Advances in neural information processing systems, 32.
- Ravi and Larochelle (2016) Ravi, S.; and Larochelle, H. 2016. Optimization as a model for few-shot learning. In International conference on learning representations.
- Requeima et al. (2019) Requeima, J.; Gordon, J.; Bronskill, J.; Nowozin, S.; and Turner, R. E. 2019. Fast and flexible multi-task classification using conditional neural adaptive processes. Advances in Neural Information Processing Systems, 32.
- Roy and Vetterli (2007) Roy, O.; and Vetterli, M. 2007. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, 606–610. IEEE.
- Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
- Saad (2003) Saad, Y. 2003. Iterative methods for sparse linear systems. SIAM.
- Saikia, Brox, and Schmid (2020) Saikia, T.; Brox, T.; and Schmid, C. 2020. Optimized generic feature learning for few-shot classification across domains. arXiv preprint arXiv:2001.07926.
- Santoro et al. (2016) Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning, 1842–1850. PMLR.
- Schroeder and Cui (2018) Schroeder, B.; and Cui, Y. 2018. Fgvcx fungi classification challenge 2018. Available online: github. com/visipedia/fgvcx_fungi_comp (accessed on 14 July 2021).
- Simon et al. (2020) Simon, C.; Koniusz, P.; Nock, R.; and Harandi, M. 2020. On modulating the gradient for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, 556–572. Springer.
- Snell, Swersky, and Zemel (2017) Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
- Sung et al. (2018) Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1199–1208.
- Tian et al. (2024) Tian, H.; Liu, F.; Liu, T.; Du, B.; Cheung, Y.-m.; and Han, B. 2024. MOKD: Cross-domain Finetuning for Few-shot Classification via Maximizing Optimized Kernel Dependence. arXiv preprint arXiv:2405.18786.
- Tian et al. (2020) Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J. B.; and Isola, P. 2020. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 266–282. Springer.
- Triantafillou et al. (2021) Triantafillou, E.; Larochelle, H.; Zemel, R.; and Dumoulin, V. 2021. Learning a universal template for few-shot dataset generalization. In International Conference on Machine Learning, 10424–10433. PMLR.
- Triantafillou et al. (2019) Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Evci, U.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.-A.; et al. 2019. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.
- Von Oswald et al. (2021) Von Oswald, J.; Zhao, D.; Kobayashi, S.; Schug, S.; Caccia, M.; Zucchet, N.; and Sacramento, J. 2021. Learning where to learn: Gradient sparsity in meta and continual learning. Advances in Neural Information Processing Systems, 34: 5250–5263.
- Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
- Yoon et al. (2018) Yoon, J.; Kim, T.; Dia, O.; Kim, S.; Bengio, Y.; and Ahn, S. 2018. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31.
- Zaheer et al. (2017) Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. Advances in neural information processing systems, 30.
- Zhao et al. (2020) Zhao, D.; Kobayashi, S.; Sacramento, J.; and von Oswald, J. 2020. Meta-learning via hypernetworks. In 4th Workshop on Meta-Learning at NeurIPS 2020 (MetaLearn 2020). NeurIPS.
Appendix for the paper
“Task-Specific Preconditioner for Cross-Domain Few-Shot Learning”
Appendix A The three preconditioners used in Figure 1(b) and Figure 4
To establish the motivation for enforcing the positive definite constraint in CDFSL, we conduct a comparative analysis of three adaptation mechanisms—PGD methods with varying preconditioners—using Meta-Dataset. These mechanisms are applied on the state-of-the-art CDFSL method, TSA (Li, Liu, and Bilen 2022). In these comparisons, with task-specific parameters and a task , we update using PGD with a preconditioner as follows:
(17) |
where and is the empirical loss associated with and . The first PGD method is identical to Gradient Descent (GD), which utilizes the fixed identity matrix as (i.e., the baseline for gradient descent). The second method is Task-Specific Preconditioned Gradient Descent (TSP), which utilizes a Task-Specific Preconditioner designed as and initialized as (i.e., PGD without a positive definite constraint). The final method is Task-Specific Preconditioned gradient descent (TSP), which utilizes Task-Specific Preconditioner defined in Section 4.3 (i.e., PGD with a positive definite).
Appendix B Meta-Training and Meta-Testing Algorithms
Appendix C Proofs of Theorems
Lemma 1.
For the meta parameter , the Domain-Specific Preconditioner defined as is positive definite.
Proof.
is symmetric, as shown below:
For ,
The first term on the right-hand side is non-negative, while the second term is positive:
since . Thus, we conclude:
(18) |
which confirms the positive-definiteness of the Domain-Specific Preconditioner . ∎
Theorem 1.
Let , be the task-coefficients satisfying . For the Domain-Specific Preconditioners , Task-Specific Preconditioner defined as is positive definite.
Proof.
By Lemma 1, is symmetric. Therefore, is symmetric, as shown below:
For ,
Since is the Domain-Specific Preconditioner, by Lemma 1, each summand on the right-hand side is non-negative (see Eq. (18)):
(19) |
because . Since and , there exists at least one such that , implying that at least one term in Eq. (19) is positive:
(20) |
Combining Eq. (19) and Eq. (20), we conclude:
which confirms the positive-definiteness of the Task-Specific Preconditioner . ∎
Appendix D Additional Ablation Studies
D.1 Weighting Factor of the Dataset Classifier Loss
In Table 7, we compare different dataset classifier losses by adjusting the weighting factor in Eq. (12). Additionally, we include the results obtained when utilizing the auxiliary loss in Eq. (12) as the dataset classifier loss. From the results, we can observe that yields the optimal performance, while using other losses results in inferior performance in both seen and unseen domains. This performance gap is more pronounced in unseen domains, highlighting the importance of balancing between the two losses for generalization to unseen domains. Based on these findings, we adopt for the dataset classifier loss in all experiments presented in this manuscript.
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg Seen | 80.8 | 81.3 | 81.6 | 81.0 | 80.9 | 80.1 | ||||||
Avg Unseen | 79.3 | 79.7 | 79.8 | 79.1 | 78.8 | 78.2 | ||||||
Avg All | 80.2 | 80.7 | 80.9 | 80.3 | 80.1 | 79.4 |
D.2 Interpreting and visualizing task-coefficient
To better understand how DSPs from the eight training domains combine to form Task-Specific Preconditioner, we illustrate the task-coefficients of TSP used in various test tasks in Figure 5(a). We first randomly sample the test tasks from each of the four domains (ImageNet, Birds, Traffic Sign, and MSCOCO). The blue heatmaps illustrate the task-coefficient values utilized in each test task. These heatmaps exhibit consistent patterns within each domain, although the values vary across tasks. For instance, in the Birds domain, all 5 tasks primarily rely on DSPs from ImageNet and Birds. Meanwhile, Task 2 evenly distributes task-coefficient values between them, while Task 4 assigns significantly more values to the Birds DSP compared to the ImageNet counterpart. In Figure 5(b), Figure 5(c), and Figure 5(d), we randomly sample the test tasks from other domains. Similar to the patterns observed in the heatmaps presented in Figure 5(a), these heatmaps also demonstrate consistent patterns within each domain, although the values vary across tasks.




Appendix E Implementation Details
E.1 Dataset
Meta-Dataset (Triantafillou et al. 2019) is the standard benchmark for evaluating the performance of cross-domain few-shot classification. Initially, it comprised ten datasets, including ILSVRC 2012 (Russakovsky et al. 2015), Omniglot (Lake, Salakhutdinov, and Tenenbaum 2015), FGVC-Aircraft (Maji et al. 2013), CUB-200-2011 (Wah et al. 2011), Describable Textures (Cimpoi et al. 2014), QuickDraw (Ha and Eck 2017), FGVCx Fungi (Schroeder and Cui 2018), VGG Flower (Nilsback and Zisserman 2008), Traffic Signs (Houben et al. 2013), and MSCOCO (Lin et al. 2014). Later, it was further expanded to include MNIST (LeCun et al. 1998), CIFAR-10 (Krizhevsky, Hinton et al. 2009), and CIFAR-100 (Krizhevsky, Hinton et al. 2009).
E.2 Architecture for the dataset classifier
As the dataset classifier, we use a permutation-invariant set encoder (Zaheer et al. 2017) followed by a linear layer. We adopt the implementation of a permutation-invariant set encoder as described in previous studies (Requeima et al. 2019; Triantafillou et al. 2021). We implement this encoder as Conv- backbone, comprising modules with convolutions employing filters, followed by batch normalization, ReLU activation, and max-pooling with a stride of . Subsequently, global average pooling is applied to the output, followed by averaging over the first dimension (representing different examples within the support set), resulting in the set representation of the given support set. This representation is then fed into a linear layer to classify the given support set into one of the -training datasets.
E.3 Hyper-parameters
Setting | Varying-Way | Varying-Way | 5-Way | Varying-Way |
---|---|---|---|---|
Varying-Shot | 5-Shot | 1-Shot | Varying-Shot | |
Batch size | 16 | 16 | ||
Weight decay | 0.0007 | 0.0007 | ||
T-max | 2500 | 2500 | ||
Max iteration | 160000 | 40000 | ||
Initialization for | ||||
Inner learning rate | 0.1 | 0.1 | ||
Outer learning rate | 0.1 | 0.1 | ||
The number of training inner-step | 5 | 5 | ||
The number of testing inner-step | 40 | 40 | ||
Test learning rate for seen domain | (0.05, 0.30) | (0.05, 0.30) | (0.05, 0.30) | (0.05, 0.20) |
Test learning rate for unseen domain | (0.25, 0.05) | (0.25, 0.05) | (0.25, 0.05) | (0.25, 0.05) |
Hyper-parameter | |
---|---|
Batch size | 16 |
Weight decay | 0.0007 |
T-max | 500 |
Max iteration | 4000 |
Learning rate | 0.001 |
Appendix F Additional Results
F.1 Varying-Way Five-Shot setting
In the standard Meta-Dataset benchmark, tasks vary in the number of classes per task (‘way’) and the number of support images per class (‘shot’), with ‘shot’ ranging up to 100. Here, we evaluate TSP in Varying-Way Five-Shot setting, which poses a greater challenge due to the limited number of support images. The results in Table 11 show that TSP†† achieves top performance for 10 out of 13 datasets, including all 5 unseen datasets. Notably, TSP†† demonstrates significantly higher scores in unseen domains compared to the previous best result ().
F.2 Five-Way One-Shot setting
We evaluate TSP under a more challenging setting, where only a single support image per class is available. As shown in Table 12, TSP†† consistently outperforms the previous best results for 10 out of 13 datasets, while TSP† also achieves the best or near-best results. Notably, applying TSP to TA2-Net results in a significant performance improvement in unseen domains () compared to using TA2-Net alone, demonstrating its efficacy in this highly challenging setting.
Setting |
|
|
||||||
---|---|---|---|---|---|---|---|---|
DSP designs | ||||||||
Avg Seen | 80.2 | 81.6 | 77.0 | 77.9 | ||||
Avg Unseen | 68.2 | 79.8 | 63.3 | 73.7 | ||||
Avg All | 75.6 | 80.9 | 71.7 | 76.3 |
Appendix G Comparison between and : Failure of DSP design
In Section 6, we compared and to explore the impact of including the identity matrix in the DSP design. Extending this analysis, we now include two additional DSP designs, and , for comprehensive examination, despite the latter’s failure to meet positive definiteness. Our findings, presented in Table 10, demonstrate that the DSP design consistently outperforms in both Varying-Way Varying-Shot and Varying-Way Five-Shot settings. This contrasts with the results in Table 6 of the manuscript, where including the identity matrix was effective in Varying-Way Varying-Shot setting. Furthermore, displays significantly lower performance compared to TSA (Li, Liu, and Bilen 2022), which serves as the baseline for our method.
To investigate the failure of the DSP design , we compare the effective ranks (Roy and Vetterli 2007) of Task-Specific Preconditioners between two DSP designs, and . The effective rank provides a numerical approximation of a matrix’s rank, indicating the number of singular values distant from zero. Since positive definite matrices possess solely positive singular values, an effective rank significantly lower than the full rank implies that the preconditioners are far from positive definite. Table 13 and Table 14 present the averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design , differing only in settings: Varying-Way Varying-Shot and Varying-Way Five-Shot, respectively. Several preconditioners in these tables exhibit notably lower effective rank than the full rank, indicating their departure from positive definiteness. Consequently, these non-positive definite preconditioners may fail to determine the steepest descent direction in the parameter space, leading to the degraded performance observed in Table 10. In contrast, 15 and Table 16 reveal that the averaged effective rank of 17 Task-Specific Preconditioners of the DSP design closely approach full rank. This finding aligns with the Cholesky factorization’s assertion (Horn and Johnson 2012) that is positive definite, which confirms that Task-Specific Preconditioners constructed with are positive definite (Theorem 1 holds as long as DSP satisfies the positive definiteness). This observation corroborates the results presented in Table 6 of the manuscript.
Appendix H Time-efficiency of TSP compared to GAP
Unlike GAP (Kang et al. 2023), which suffers from significant inference time due to time-intensive singular value decomposition (SVD) calculations at every neural network layer during each inner-level training iteration, TSP achieves a much faster inference time. Specifically, GAP requires approximately 14.2 seconds per task, while TSP applied on TSA (Li, Liu, and Bilen 2022) completes inference in just 1.1 seconds–about 13 times faster than GAP–by leveraging pre-calculated DSPs and a dataset classifier to avoid time-intensive calculations during inference.111Inference time for GAP and TSP is measured on a single RTX3090 GPU and averaged over 100 test tasks. This highlights the superior practical efficiency of TSP compared to the previous PGD-based method, GAP.
Appendix I Application Details for TSP† and TSP††
As shown Figure 6, we apply TSP to the state-of-the-art CDFSL methods, TSA (Li, Liu, and Bilen 2022) and TA2-Net (Guo et al. 2023). Figure 6(a) illustrates PGD with Domain-Specific Preconditioner (DSP) applied to TSA during meta-training, where the DSP is selected based on the task’s domain label, and PGD optimizes each task-specific parameter using the corresponding DSP. Figure 6(b) shows PGD with a Task-Specific Preconditioner applied to TSA during meta-testing. In this case, each preconditioner is constructed by DSPs with task coefficients generated by the Dataset Classifier, and PGD optimizes each using the constructed preconditioner. Figure 6(c) depicts PGD with DSP applied to TA2-Net during meta-training. Unlike TSA, which optimizes a single task-specific parameter for each module, TA2-Net optimizes multiple task-specific parameters . To accommodate this, we apply the same PGD to optimize the multiple parameters . Figure 6(d) shows PGD with Task-Specific Preconditoner applied to TA2-Net during meta-testing, where, similar to the meta-training phase, the same PGD is applied to optimize the multiple task-specific parameters .
Test Dataset |
|
SUR | URT | URL | TSA | TA2-Net | MOKD | TSP† | TSP†† | ||
ImageNet | 47.21.0 | 46.71.0 | 48.61.0 | 49.41.0 | 48.31.0 | 49.31.0 | 47.51.0 | 50.61.0 | 50.81.0 | ||
Omniglot | 95.10.3 | 95.80.3 | 96.00.3 | 96.00.3 | 96.80.3 | 96.60.2 | 96.00.3 | 97.20.3 | 97.10.3 | ||
Aircraft | 74.60.6 | 82.10.6 | 81.20.6 | 84.80.5 | 85.50.5 | 85.90.4 | 84.40.5 | 86.20.5 | 86.70.5 | ||
Birds | 69.60.7 | 62.80.9 | 71.20.7 | 76.00.6 | 76.60.6 | 77.30.6 | 76.80.6 | 77.00.6 | 77.80.6 | ||
Textures | 57.50.7 | 60.20.7 | 65.20.7 | 69.10.6 | 68.30.7 | 68.30.6 | 66.30.6 | 69.10.6 | 69.30.6 | ||
Quick Draw | 70.90.6 | 79.00.5 | 79.20.5 | 78.20.5 | 77.90.6 | 78.50.5 | 78.90.5 | 78.70.6 | 78.80.6 | ||
Fungi | 50.31.0 | 66.50.8 | 66.90.9 | 70.00.8 | 70.40.8 | 70.30.8 | 68.80.9 | 73.60.9 | 72.90.8 | ||
VGG Flower | 86.50.4 | 76.90.6 | 82.40.5 | 89.30.4 | 89.50.4 | 90.00.4 | 89.10.4 | 90.80.4 | 91.10.4 | ||
Traffic Sign | 55.20.8 | 44.90.9 | 45.10.9 | 57.50.8 | 72.30.6 | 76.70.5 | 59.20.8 | 79.60.5 | 80.90.5 | ||
MSCOCO | 49.20.8 | 48.10.9 | 52.30.9 | 56.10.8 | 56.00.8 | 56.00.8 | 51.80.8 | 57.50.9 | 59.30.8 | ||
MNIST | 88.90.4 | 90.10.4 | 86.50.5 | 89.70.4 | 92.50.4 | 93.30.3 | 89.40.3 | 93.00.3 | 93.90.3 | ||
CIFAR-10 | 66.10.7 | 50.31.0 | 61.40.7 | 66.00.7 | 72.00.7 | 73.10.7 | 58.80.7 | 73.50.7 | 74.20.7 | ||
CIFAR-100 | 53.80.9 | 46.40.9 | 52.50.9 | 57.00.9 | 64.10.8 | 64.10.8 | 55.30.9 | 65.00.8 | 65.60.9 | ||
Avg Seen | 69.0 | 71.3 | 73.8 | 76.6 | 76.7 | 77.0 | 76.0 | 77.9 | 78.1 | ||
Avg Unseen | 62.6 | 56.0 | 59.6 | 65.3 | 71.4 | 72.6 | 63.0 | 73.7 | 74.8 | ||
Avg All | 66.5 | 65.4 | 68.3 | 72.2 | 74.6 | 75.3 | 71.0 | 76.3 | 76.8 | ||
Avg Rank | 7.9 | 7.8 | 6.6 | 4.9 | 4.3 | 3.5 | 5.8 | 2.2 | 1.4 |
Test Dataset |
|
SUR | URT | URL | TSA | TA2-Net | MOKD | TSP† | TSP†† | ||
ImageNet | 42.60.9 | 40.71.0 | 47.41.0 | 49.61.1 | 48.01.0 | 48.81.1 | 46.01.0 | 50.11.0 | 50.51.0 | ||
Omniglot | 93.10.5 | 93.00.7 | 95.60.5 | 95.80.5 | 96.30.4 | 95.70.4 | 95.50.5 | 96.60.4 | 96.80.4 | ||
Aircraft | 65.80.9 | 67.11.4 | 77.90.9 | 79.60.9 | 79.60.9 | 79.80.9 | 78.60.9 | 81.10.9 | 81.30.9 | ||
Birds | 67.90.9 | 59.21.0 | 70.90.9 | 74.90.9 | 74.50.9 | 74.40.9 | 75.90.9 | 75.70.9 | 75.70.9 | ||
Textures | 42.20.8 | 42.50.8 | 49.40.9 | 53.60.9 | 54.50.9 | 54.10.8 | 51.40.9 | 55.50.9 | 55.40.9 | ||
Quick Draw | 70.50.9 | 79.80.9 | 79.60.9 | 79.00.8 | 79.30.9 | 78.90.9 | 78.90.9 | 80.70.8 | 80.20.9 | ||
Fungi | 58.31.1 | 64.81.1 | 71.01.0 | 75.21.0 | 75.31.0 | 75.20.9 | 71.11.0 | 77.80.9 | 78.10.8 | ||
VGG Flower | 79.90.7 | 65.01.0 | 72.70.0 | 79.90.8 | 80.30.8 | 80.10.8 | 79.80.8 | 81.00.8 | 81.10.8 | ||
Traffic Sign | 55.30.9 | 44.60.9 | 52.70.9 | 57.90.9 | 57.21.0 | 54.11.0 | 57.00.9 | 57.41.0 | 56.91.0 | ||
MSCOCO | 48.80.9 | 47.81.1 | 56.91.1 | 59.21.0 | 59.91.0 | 58.11.0 | 50.90.8 | 59.71.0 | 60.51.0 | ||
MNIST | 80.10.9 | 77.10.9 | 75.60.9 | 78.70.9 | 80.10.9 | 80.30.9 | 72.50.9 | 81.40.8 | 81.70.9 | ||
CIFAR-10 | 50.30.9 | 35.80.8 | 47.30.9 | 54.70.9 | 55.80.9 | 52.91.0 | 47.30.8 | 55.90.9 | 56.00.9 | ||
CIFAR-100 | 53.80.9 | 42.91.0 | 54.91.1 | 61.81.0 | 63.71.0 | 61.01.1 | 60.21.0 | 65.21.0 | 65.61.0 | ||
Avg Seen | 65.0 | 64.0 | 70.6 | 73.5 | 73.5 | 73.4 | 72.2 | 74.8 | 74.9 | ||
Avg Unseen | 57.7 | 49.6 | 57.5 | 62.5 | 63.3 | 61.3 | 57.5 | 63.9 | 64.1 | ||
Avg All | 62.2 | 58.5 | 65.5 | 69.2 | 69.6 | 68.7 | 66.5 | 70.6 | 70.8 | ||
Avg Rank | 7.5 | 8.2 | 6.8 | 4.2 | 3.5 | 4.8 | 6.2 | 1.9 | 1.5 |
|
|
|
|
Birds |
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
layer1-0- | 60.04 | 60.97 | 63.43 | 61.72 | 60.27 | 64.00 | 62.47 | 63.28 | 62.27 | 60.44 | 64.00 | 60.17 | 59.69 | 64 | ||||||||||||||||||||||||||||
layer1-0- | 62.70 | 61.46 | 62.92 | 62.77 | 61.63 | 64.00 | 61.59 | 62.73 | 62.99 | 62.80 | 64.00 | 62.80 | 62.69 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 54.84 | 57.39 | 40.04 | 57.87 | 47.18 | 63.96 | 35.14 | 49.95 | 50.58 | 54.76 | 63.99 | 56.37 | 56.88 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 62.17 | 63.87 | 57.23 | 62.60 | 62.29 | 64.00 | 59.25 | 60.77 | 62.08 | 62.31 | 64.00 | 62.34 | 62.27 | 64 | ||||||||||||||||||||||||||||
layer2-0- | 95.20 | 125.21 | 33.71 | 72.66 | 58.40 | 127.69 | 38.69 | 96.12 | 79.15 | 93.68 | 127.90 | 98.73 | 104.02 | 128 | ||||||||||||||||||||||||||||
layer2-0- | 125.09 | 127.15 | 109.02 | 113.07 | 116.99 | 127.99 | 117.76 | 122.92 | 124.18 | 125.19 | 128.00 | 124.84 | 125.40 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 126.98 | 127.73 | 115.54 | 124.03 | 119.52 | 128.00 | 126.41 | 127.16 | 127.15 | 126.96 | 128.00 | 126.90 | 126.91 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 127.74 | 127.43 | 126.58 | 126.58 | 127.04 | 128.00 | 127.61 | 127.39 | 127.79 | 127.77 | 128.00 | 127.71 | 127.72 | 128 | ||||||||||||||||||||||||||||
layer3-0- | 230.14 | 255.05 | 120.37 | 96.81 | 7.69 | 255.77 | 216.22 | 202.50 | 129.93 | 149.37 | 255.94 | 214.94 | 199.54 | 256 | ||||||||||||||||||||||||||||
layer3-0- | 250.48 | 254.61 | 156.72 | 142.09 | 196.29 | 255.97 | 253.60 | 218.84 | 247.46 | 250.71 | 255.98 | 240.82 | 247.83 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 196.75 | 252.20 | 53.97 | 10.92 | 41.12 | 254.19 | 252.10 | 24.33 | 126.43 | 195.93 | 254.96 | 125.57 | 164.58 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 255.58 | 254.85 | 253.45 | 253.98 | 253.79 | 255.99 | 255.87 | 254.82 | 255.73 | 255.62 | 255.99 | 255.52 | 255.51 | 256 | ||||||||||||||||||||||||||||
layer4-0- | 491.73 | 509.82 | 466.61 | 482.44 | 16.00 | 511.75 | 509.44 | 501.00 | 305.02 | 335.14 | 511.93 | 489.17 | 438.08 | 512 | ||||||||||||||||||||||||||||
layer4-0- | 460.91 | 493.62 | 410.53 | 73.12 | 387.58 | 511.91 | 507.25 | 478.68 | 497.04 | 482.00 | 511.92 | 385.93 | 438.04 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 431.05 | 501.79 | 223.91 | 243.72 | 7.22 | 488.68 | 451.76 | 420.43 | 201.02 | 239.09 | 489.71 | 410.19 | 351.43 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 498.54 | 509.57 | 509.23 | 507.46 | 496.55 | 511.55 | 510.78 | 509.13 | 507.39 | 500.13 | 511.55 | 499.06 | 496.86 | 512 | ||||||||||||||||||||||||||||
pa-weight | 11.60 | 3.13 | 3.88 | 7.07 | 4.58 | 2.79 | 3.33 | 3.36 | 9.06 | 11.60 | 2.74 | 11.84 | 12.16 | 512 |
|
|
|
|
Birds |
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
layer1-0- | 60.14 | 61.04 | 63.43 | 61.70 | 60.49 | 64.00 | 62.46 | 63.28 | 62.00 | 60.37 | 64.00 | 60.31 | 59.76 | 64 | ||||||||||||||||||||||||||||
layer1-0- | 62.70 | 61.52 | 62.93 | 62.79 | 61.86 | 64.00 | 61.61 | 62.73 | 62.98 | 62.80 | 64.00 | 62.80 | 62.70 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 54.41 | 57.46 | 40.18 | 57.88 | 48.46 | 63.96 | 35.38 | 49.97 | 51.36 | 55.34 | 63.99 | 55.87 | 56.69 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 62.13 | 63.86 | 57.28 | 62.61 | 62.40 | 64.00 | 59.31 | 60.78 | 62.15 | 62.36 | 64.00 | 62.31 | 62.27 | 64 | ||||||||||||||||||||||||||||
layer2-0- | 93.59 | 124.77 | 34.00 | 73.67 | 62.76 | 127.68 | 39.19 | 96.04 | 81.61 | 95.96 | 127.90 | 96.43 | 103.21 | 128 | ||||||||||||||||||||||||||||
layer2-0- | 124.92 | 127.16 | 109.17 | 113.52 | 118.17 | 127.99 | 117.88 | 122.93 | 124.42 | 125.25 | 128.00 | 124.63 | 125.37 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 126.98 | 127.74 | 115.65 | 124.15 | 120.53 | 128.00 | 126.43 | 127.16 | 127.15 | 126.91 | 128.00 | 126.89 | 126.92 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 127.74 | 127.45 | 126.59 | 126.63 | 127.16 | 128.00 | 127.61 | 127.39 | 127.80 | 127.76 | 128.00 | 127.71 | 127.73 | 128 | ||||||||||||||||||||||||||||
layer3-0- | 229.05 | 254.92 | 121.04 | 99.77 | 9.22 | 255.76 | 216.45 | 202.70 | 139.21 | 136.38 | 255.94 | 211.75 | 200.61 | 256 | ||||||||||||||||||||||||||||
layer3-0- | 249.93 | 254.62 | 157.35 | 144.87 | 202.20 | 255.97 | 253.61 | 219.02 | 248.38 | 249.63 | 255.98 | 239.71 | 248.04 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 193.29 | 251.15 | 54.49 | 11.90 | 47.84 | 254.12 | 251.87 | 24.43 | 138.76 | 184.25 | 254.95 | 127.65 | 166.39 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 255.58 | 254.88 | 253.48 | 254.05 | 254.07 | 255.99 | 255.87 | 254.83 | 255.72 | 255.59 | 255.99 | 255.53 | 255.52 | 256 | ||||||||||||||||||||||||||||
layer4-0- | 492.50 | 509.86 | 466.98 | 482.91 | 19.81 | 511.75 | 509.23 | 501.06 | 322.40 | 308.72 | 511.93 | 487.09 | 439.86 | 512 | ||||||||||||||||||||||||||||
layer4-0- | 457.24 | 493.82 | 411.01 | 77.99 | 398.85 | 511.91 | 507.03 | 478.88 | 496.18 | 476.10 | 511.92 | 382.92 | 440.78 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 431.12 | 501.57 | 225.25 | 248.87 | 8.84 | 488.65 | 451.73 | 420.80 | 220.69 | 213.26 | 489.71 | 405.75 | 354.55 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 499.01 | 509.62 | 509.24 | 507.28 | 497.76 | 511.55 | 510.73 | 509.15 | 506.38 | 499.75 | 511.55 | 499.69 | 497.14 | 512 | ||||||||||||||||||||||||||||
pa-weight | 11.43 | 3.21 | 3.95 | 7.29 | 5.60 | 2.79 | 3.45 | 3.39 | 9.62 | 11.74 | 2.74 | 11.65 | 12.10 | 512 |
|
|
|
|
Birds |
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
layer1-0- | 63.97 | 63.95 | 63.99 | 63.98 | 63.99 | 64.00 | 63.99 | 64.00 | 63.99 | 63.98 | 64.00 | 63.96 | 63.97 | 64 | ||||||||||||||||||||||||||||
layer1-0- | 63.97 | 63.96 | 64.00 | 63.99 | 64.00 | 64.00 | 64.00 | 64.00 | 63.99 | 63.98 | 64.00 | 63.97 | 63.97 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 63.92 | 63.90 | 63.99 | 63.96 | 63.99 | 64.00 | 63.99 | 63.99 | 63.97 | 63.95 | 64.00 | 63.91 | 63.92 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 63.97 | 63.99 | 63.99 | 63.99 | 64.00 | 64.00 | 64.00 | 64.00 | 63.99 | 63.98 | 64.00 | 63.97 | 63.97 | 64 | ||||||||||||||||||||||||||||
layer2-0- | 127.87 | 127.97 | 127.97 | 127.92 | 127.96 | 128.00 | 127.98 | 127.98 | 127.96 | 127.91 | 128.00 | 127.84 | 127.86 | 128 | ||||||||||||||||||||||||||||
layer2-0- | 127.97 | 127.99 | 127.99 | 127.98 | 127.98 | 128.00 | 127.99 | 127.99 | 127.99 | 127.98 | 128.00 | 127.97 | 127.97 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 127.99 | 128.00 | 127.99 | 127.99 | 127.98 | 128.00 | 128.00 | 128.00 | 128.00 | 127.99 | 128.00 | 127.99 | 127.99 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 127.99 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 127.99 | 127.99 | 128 | ||||||||||||||||||||||||||||
layer3-0- | 255.95 | 256.00 | 255.96 | 255.94 | 255.66 | 256.00 | 255.97 | 255.98 | 255.98 | 255.96 | 256.00 | 255.94 | 255.95 | 256 | ||||||||||||||||||||||||||||
layer3-0- | 255.98 | 256.00 | 255.95 | 255.96 | 255.90 | 256.00 | 255.99 | 255.99 | 255.99 | 255.98 | 256.00 | 255.97 | 255.98 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 255.99 | 256.00 | 255.93 | 255.91 | 255.59 | 256.00 | 255.98 | 255.97 | 255.99 | 255.98 | 256.00 | 255.99 | 255.99 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 256.00 | 256.00 | 255.98 | 255.99 | 255.99 | 256.00 | 256.00 | 255.99 | 256.00 | 256.00 | 256.00 | 255.99 | 256.00 | 256 | ||||||||||||||||||||||||||||
layer4-0- | 511.97 | 512.00 | 511.91 | 511.95 | 510.54 | 512.00 | 511.95 | 511.97 | 511.97 | 511.95 | 512.00 | 511.97 | 511.96 | 512 | ||||||||||||||||||||||||||||
layer4-0- | 511.96 | 511.98 | 511.79 | 511.81 | 511.66 | 512.00 | 511.90 | 511.92 | 511.95 | 511.96 | 512.00 | 511.96 | 511.97 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 511.90 | 511.97 | 511.54 | 511.78 | 507.13 | 511.92 | 511.58 | 511.84 | 511.77 | 511.76 | 511.93 | 511.91 | 511.89 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 511.97 | 511.99 | 511.97 | 511.97 | 511.81 | 512.00 | 511.97 | 511.98 | 511.98 | 511.98 | 512.00 | 511.97 | 511.97 | 512 | ||||||||||||||||||||||||||||
pa-weight | 494.46 | 481.54 | 494.69 | 495.60 | 490.04 | 486.75 | 484.23 | 493.87 | 490.07 | 493.48 | 486.75 | 495.33 | 495.25 | 512 |
|
|
|
|
Birds |
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
layer1-0- | 63.97 | 63.95 | 63.99 | 63.98 | 63.99 | 64.00 | 63.99 | 64.00 | 63.99 | 63.98 | 64.00 | 63.96 | 63.97 | 64 | ||||||||||||||||||||||||||||
layer1-0- | 63.97 | 63.96 | 64.00 | 63.99 | 64.00 | 64.00 | 64.00 | 64.00 | 63.99 | 63.98 | 64.00 | 63.97 | 63.97 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 63.92 | 63.90 | 63.99 | 63.96 | 63.99 | 64.00 | 63.99 | 63.99 | 63.97 | 63.94 | 64.00 | 63.91 | 63.92 | 64 | ||||||||||||||||||||||||||||
layer1-1- | 63.97 | 63.99 | 63.99 | 63.99 | 64.00 | 64.00 | 64.00 | 64.00 | 63.99 | 63.98 | 64.00 | 63.97 | 63.97 | 64 | ||||||||||||||||||||||||||||
layer2-0- | 127.87 | 127.97 | 127.97 | 127.93 | 127.96 | 128.00 | 127.98 | 127.98 | 127.95 | 127.91 | 128.00 | 127.85 | 127.86 | 128 | ||||||||||||||||||||||||||||
layer2-0- | 127.97 | 127.99 | 127.99 | 127.98 | 127.99 | 128.00 | 127.99 | 127.99 | 127.99 | 127.98 | 128.00 | 127.97 | 127.97 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 127.99 | 128.00 | 127.99 | 127.99 | 127.99 | 128.00 | 128.00 | 128.00 | 128.00 | 127.99 | 128.00 | 127.99 | 127.99 | 128 | ||||||||||||||||||||||||||||
layer2-1- | 127.99 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 127.99 | 127.99 | 128 | ||||||||||||||||||||||||||||
layer3-0- | 255.95 | 256.00 | 255.97 | 255.94 | 255.78 | 256.00 | 255.97 | 255.98 | 255.98 | 255.96 | 256.00 | 255.94 | 255.95 | 256 | ||||||||||||||||||||||||||||
layer3-0- | 255.98 | 256.00 | 255.95 | 255.95 | 255.94 | 256.00 | 255.99 | 255.99 | 255.99 | 255.98 | 256.00 | 255.97 | 255.98 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 255.99 | 256.00 | 255.93 | 255.90 | 255.74 | 256.00 | 255.98 | 255.97 | 255.99 | 255.98 | 256.00 | 255.99 | 255.99 | 256 | ||||||||||||||||||||||||||||
layer3-1- | 256.00 | 256.00 | 255.98 | 255.99 | 255.99 | 256.00 | 256.00 | 256.00 | 256.00 | 256.00 | 256.00 | 256.00 | 256.00 | 256 | ||||||||||||||||||||||||||||
layer4-0- | 511.97 | 512.00 | 511.91 | 511.95 | 511.06 | 512.00 | 511.95 | 511.97 | 511.97 | 511.94 | 512.00 | 511.97 | 511.97 | 512 | ||||||||||||||||||||||||||||
layer4-0- | 511.96 | 511.98 | 511.80 | 511.79 | 511.78 | 512.00 | 511.90 | 511.92 | 511.96 | 511.96 | 512.00 | 511.96 | 511.97 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 511.89 | 511.97 | 511.55 | 511.76 | 508.74 | 511.92 | 511.58 | 511.84 | 511.79 | 511.73 | 511.92 | 511.91 | 511.89 | 512 | ||||||||||||||||||||||||||||
layer4-1- | 511.97 | 511.99 | 511.97 | 511.97 | 511.87 | 512.00 | 511.97 | 511.98 | 511.98 | 511.97 | 512.00 | 511.97 | 511.97 | 512 | ||||||||||||||||||||||||||||
pa-weight | 494.37 | 481.54 | 494.72 | 495.46 | 490.94 | 486.75 | 484.26 | 493.87 | 490.64 | 494.00 | 486.75 | 495.13 | 495.09 | 512 |



