Task-Specific Preconditioner for Cross-Domain Few-Shot Learning

Suhyun Kang¹, Jungwon Park², Wonseok Lee³, Wonjong Rhee^2,3 Corresponding author

Abstract

Cross-Domain Few-Shot Learning (CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP). Our method first meta-learns Domain-Specific Preconditioners (DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.

1 Introduction

Few-Shot Learning (FSL) aims to learn a model that can generalize to novel classes using a few labeled examples. Recent advancements in FSL have been significantly propelled by meta-learning methods (Snell, Swersky, and Zemel 2017; Finn, Abbeel, and Levine 2017; Sung et al. 2018; Oreshkin, Rodríguez López, and Lacoste 2018; Garnelo et al. 2018; Rajeswaran et al. 2019). These approaches have achieved outstanding results in single domain FSL benchmarks such as Omniglot (Lake et al. 2011) and miniImagenet (Ravi and Larochelle 2016). However, recent studies (Chen et al. 2019; Tian et al. 2020) have revealed that many existing FSL methods struggle to generalize in cross-domain setting, where the test data originates from domains that are either unknown or previously unseen. To study the challenge of generalization in cross-domain few-shot tasks, Triantafillou et al. (2019) introduced the Meta-Dataset, a more realistic, large-scale, and diverse benchmark. It includes multiple datasets from a variety of domains for both meta-training and meta-testing phases.

Leveraging the Meta-Dataset, various Cross-Domain Few-Shot Learning (CDFSL) methods have been developed (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Dvornik, Schmid, and Mairal 2020; Liu et al. 2020; Guo et al. 2023; Tian et al. 2024), demonstrating significant advancements in this field. These approaches typically parameterize deep neural networks with a large set of task-agnostic parameters alongside a smaller set of task-specific parameters. Task-specific parameters are optimized to the target task through an adaptation mechanism, generally following one of two primary methodologies. The first approach utilizes an auxiliary network functioning as a parameter generator, which, upon receiving a few labeled examples from the target task, outputs optimized task-specific parameters (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2020, 2021). The second approach directly fine-tunes the task-specific parameters through gradient descent using a few labeled examples from the target task (Dvornik, Schmid, and Mairal 2020; Li, Liu, and Bilen 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2022; Tian et al. 2024).

While both approaches have improved CDFSL performance through adaptation mechanism, a common limitation persists in the optimization strategies employed by these methods. Specifically, both approaches employ a fixed optimization strategy across different target tasks. However, Figure 1(a) shows that the optimal choice of optimizer may vary significantly depending on the given domain or target task. This implies that the performance can be significantly improved by adapting an optimization strategy to align well with the target domain and task. However, devising an effective and reliable scheme for its implementation has been challenging.

One promising approach for establishing a robust adaptive optimization scheme is to leverage Preconditioned Gradient Descent (PGD) (Himmelblau et al. 2018). PGD operates by specifying a preconditioning matrix, often referred to as a preconditioner, which re-scales the geometry of the parameter space. In the field of machine learning, previous research has shown that if the preconditioner is positive definite (PD), it establishes a valid Riemannian metric, which represents the geometric characteristics (e.g., curvature) of the parameter space and steers preconditioned gradients in the direction of steepest descent (Amari 1967, 1996, 1998; Amari and Douglas 1998). While the effectiveness of positive definiteness in PGD is supported by existing theoretical findings, its efficacy as an adaptive optimization scheme in CDFSL can be examined through a simple comparison. In Figure 1(b), we compare PGD with and without a PD constraint for the preconditioner on the Meta-Dataset. Without a PD constraint, PGD shows markedly inferior performance, especially in unseen domains. Conversely, with a PD constraint, PGD consistently exhibits performance improvements across seen and unseen domains compared to the baseline using GD. This supports the pivotal role of positive definiteness in PGD for CDFSL.

Inspired by these findings, we introduce a novel adaptation mechanism named Task-Specific Preconditioned gradient descent (TSP). In our approach, we establish a Task-Specific Preconditioner that is constrained to be positive definite and adapt it to the specific nature of the target task. This preconditioner consists of two components. The first component is the Domain-Specific Preconditioners (DSPs), which are uniquely defined for each meta-training domain and meta-trained on tasks sampled from these domains through bi-level optimization during the meta-training phase. The second component is task-coefficient, which approximates the compatibility between the target task and each meta-training domain. Figure 2 illustrates the construction of the Task-Specific Preconditioner. For a given target task ${\mathcal{T}}$ , the Task-Specific Preconditioner $\mathbf{P}_{\mathcal{T}}$ is constructed by linearly combining the DSPs ${\mathbf{P}_{k}}$ from multiple seen domains, with each weighted by the corresponding task-coefficient ${p_{\mathcal{T},k}}$ . This process produces a preconditioner specifically adapted to the geometric characteristics of the target task’s parameter space. By integrating knowledge from multiple seen domains, TSP distinguishes itself from traditional PGD techniques, such as GAP (Kang et al. 2023), which are discussed further in Section 6. Applying our approach to state-of-the-art CDFSL methods, such as TSA or TA²-Net, significantly enhances performance on Meta-Dataset. For example, in multi-domain settings, applying TSP to TA²-Net (Guo et al. 2023) achieves the best performance across all datasets.

2 Related Works

Meta-Learning for Few-Shot Learning

Until recently, numerous approaches in the field of few-shot learning have adopted the meta-learning framework. These approaches can be mainly divided into three types: metric-based, model-based, and optimization-based methods. Metric-based methods (Garcia and Bruna 2017; Sung et al. 2018; Snell, Swersky, and Zemel 2017; Oreshkin, Rodríguez López, and Lacoste 2018) train a feature encoder to extract features from support and query samples. They employ a nearest neighbor classifier with various distance functions to calculate similarity scores for predicting the labels of query samples. Model-based methods (Santoro et al. 2016; Munkhdalai and Yu 2017; Mishra et al. 2017; Garnelo et al. 2018) train an encoder to generate task-specific models from a few support samples. Optimization-based methods (Ravi and Larochelle 2016; Finn, Abbeel, and Levine 2017; Yoon et al. 2018; Rajeswaran et al. 2019) train a model that can quickly adapt to new tasks with a few support samples, employing a bi-level optimization. In our method, we employ the bi-level optimization used in the optimization-based methods.

Cross-Domain Few-Shot Learning (CDFSL)

Recent CDFSL methods define the universal model as a deep neural network and partition it into task-agnostic and task-specific parameters. The task-agnostic parameters represent generic characteristics that are valid for a range of tasks from various domains. On the other hand, the task-specific parameters represent adaptable attributes that are optimized to the target tasks through an adaptation mechanism. Task-agnostic parameters can be designed as a single network or multiple networks. The single network is trained on a large dataset from single domain (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021) or multiple domains (Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), whereas the multiple networks are trained individually on each domain (Dvornik, Schmid, and Mairal 2020; Liu et al. 2020). Task-specific parameters can be designed as selection parameters (Dvornik, Schmid, and Mairal 2020; Liu et al. 2020), pre-classifier transformation (Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), Feature-wise Linear Modulate (FiLM) layer (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021; Triantafillou et al. 2021), or Residual Adapter (RA) (Li, Liu, and Bilen 2022; Guo et al. 2023). As the adaptation mechanism for the task-specific parameters, several studies (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2020, 2021) meta-learn an auxiliary network, which generates task-specific parameters adapted to the target task. On the other hand, other studies (Dvornik, Schmid, and Mairal 2020; Li, Liu, and Bilen 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2022) employ gradient descent to adapt task-specific parameters to the target task. In our work, we propose a novel adaptation mechanism in the form of a task-specific optimizer, which adapts task-specific parameters to the target task.

Preconditioned Gradient Descent in Meta-Learning

In meta-learning, several optimization-based approaches (Li et al. 2017; Lee and Choi 2018; Park and Oliva 2019; Rajasegaran et al. 2020; Simon et al. 2020; Zhao et al. 2020; Von Oswald et al. 2021; Kang et al. 2023) have incorporated Preconditioned Gradient Descent (PGD) to adapt network’s parameters to the target task (i.e., inner-level optimization). They meta-learn a preconditioning matrix, called a preconditioner, which is utilized to precondition the gradient. The preconditioner was kept static in most of the previous works (Li et al. 2017; Lee and Choi 2018; Park and Oliva 2019; Zhao et al. 2020; Von Oswald et al. 2021). Several prior studies have devised preconditioners tailored to adapt either per inner step (Rajasegaran et al. 2020), per task (Simon et al. 2020), or both simultaneously (Kang et al. 2023). Motivated by previous works (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998), (Kang et al. 2023) recently investigated the constraint of the preconditioner to satisfy the condition for a Riemannian metric (i.e., positive definiteness). They demonstrated that enforcing this constraint on the preconditioner was essential for improving the performance in few-shot learning. In our study, we propose a novel preconditioned gradient descent method with meta-learned task-specific preconditioner that guarantees positive definiteness for improving performance in CDFSL.

3 Backgrounds

Task Formulation for Meta-Learning in CDFSL

In CDFSL, task $\mathcal{T}$ is formulated differently compared to traditional few-shot learning. In traditional few-shot learning, tasks are sampled from a single domain, resulting in the same form in both meta-training and meta-testing:

\text{meta-training and meta-testing: }\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}

(1)

where $\mathcal{S}_{\mathcal{T}}$ is a support set and $\mathcal{Q}_{\mathcal{T}}$ is a query set. On the other hand, in CDFSL, tasks are sampled from multiple domains, leading to different forms in meta-training and meta-testing:

\begin{split}&\text{meta-training: }\mathcal{T}=\{\mathcal{S}_{\mathcal{T}}\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\},\\ &\text{meta-testing: }\mathcal{T}=\{\mathcal{S}_{\mathcal{T}}\mathcal{Q}_{\mathcal{T}}\},\end{split}

(2)

where $d_{\mathcal{T}}$ is a domain label indicating the domain from which the task was sampled. For instance, the domain label is an integer between $1$ and $K$ for $K$ domains (i.e., $1\leq d_{\mathcal{T}}\leq K$ ).

Bi-level Optimization in Meta-Learning

Bi-level optimization (Rajeswaran et al. 2019) consists of two levels of main optimization processes: inner-level and outer-level optimizations. Let $f_{\theta(\phi)}$ be a model, where the parameter $\theta(\phi)$ is parameterized by the meta-parameter $\phi$ . For a task $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}$ , the inner-level optimization is defined as:

\begin{split}\theta_{\mathcal{T},T}(\phi)=\theta_{\mathcal{T},0}(\phi)-\alpha_{\text{in}}\cdot\sum_{t=0}^{T-1}\nabla_{\theta}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t}(\phi);\mathcal{S}_{\mathcal{T}})\end{split}

(3)

where $\theta_{\mathcal{T},0}(\phi)=\theta(\phi)$ , $\alpha_{\text{in}}$ is the learning rate for the inner-level optimization, $\mathcal{L}_{\text{in}}$ is the inner-level’s loss function, and $T$ is the total number of gradient descent steps. With $\mathcal{Q}_{\mathcal{T}}$ in each task, we can define outer-level optimization as:

\phi\leftarrow\phi-\alpha_{\text{out}}\cdot\nabla_{\phi}\mathbb{E}_{\mathcal{T}}\Big{[}\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T}(\phi);\mathcal{Q}_{\mathcal{T}})\Big{]}

(4)

where $\alpha_{\text{out}}$ is the learning rate for the outer-level optimization, and $\mathcal{L}_{\text{out}}$ is the outer-level’s loss function.

Preconditioned Gradient Descent (PGD)

PGD is a technique that minimizes empirical risk by using a gradient update with a preconditioner that re-scales the geometry of the parameter space. Given model parameters $\theta$ and task $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}$ , we can formally define the preconditioned gradient descent with a preconditioner $\mathbf{P}$ as follows:

\theta_{\mathcal{T},t}=\theta_{\mathcal{T},t-1}-\alpha\cdot\mathbf{P}\nabla_{\theta}\mathcal{L}(\theta_{\mathcal{T},t-1};\mathcal{S}_{\mathcal{T}}),\,\,\,\,t=1,\cdots

(5)

where $\theta_{\mathcal{T},0}=\theta$ , $\mathcal{L}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}})$ is the empirical loss associated with the task $\mathcal{T}$ , and $\theta_{\mathcal{T},t}$ is the parameters. When the preconditioner $\mathbf{P}$ is chosen to be the identity matrix $\mathbf{I}$ , Eq. (5) becomes the standard Gradient Descent (GD). The choice of $\mathbf{P}$ to leverage second-order information offers several options, including the inverse Fisher information matrix $\mathbf{F}^{-1}$ , leading to the Natural Gradient Descent (NGD) (Amari 1998), the inverse Hessian matrix $\mathbf{H}^{-1}$ , corresponding to Newton’s method (LeCun et al. 2002), and the diagonal matrix estimation with the past gradients, which results in adaptive gradient methods (Duchi, Hazan, and Singer 2011; Kingma and Ba 2014). They often reduce the effect of pathological curvature and speed up the optimization (Amari et al. 2020).

Dataset Classifier

In CDFSL, Dataset Classifier (Triantafillou et al. 2021) reads a support set in a few-shot task and predicts from which of the training datasets it was sampled. Formally, let $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}$ be a train task sampled from $K$ domains. Let $g$ be a dataset classifier that takes the support set $\mathcal{S}_{\mathcal{T}}$ as input and generates logits as follows:

g(\mathcal{S}_{\mathcal{T}})=z_{\mathcal{T}}=(z_{\mathcal{T},1},\cdots,z_{\mathcal{T},K})\in\mathbb{R}^{K}

(6)

In (Triantafillou et al. 2021), the dataset classifier $g$ is trained to minimize the cross-entropy loss for the dataset classification problem (i.e., classification problem with $K$ classes).

4 Method

In this section, we propose a novel adaptation mechanism named Task-Specific Preconditioned gradient descent (TSP). We first introduce Domain-Specific Preconditioner (DSP) and task-coefficients. Then, we describe the construction of Task-Specific Preconditioner using DSP and task-coefficients. Lastly, we show the positive definiteness of Task-Specific Preconditioner, which establishes it as a valid Riemannian metric. The algorithm for the training and testing procedures is provided in Appendix B.

4.1 Domain-Specific Preconditioner (DSP)

Consider $L$ task-specific parameters $\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}_{l=1}^{L}$ . For $K$ domains, we first define meta-parameters $\mathcal{M}_{1},\cdots,\mathcal{M}_{K}$ as follows:

\mathcal{M}_{k}=\{\mathbf{M}^{l}_{k}\in\mathbb{R}^{m_{l}\times m_{l}}\}_{l=1}^{L},\,\,\,\,k=1,\cdots,K

(7)

Then, for all $l$ , we define Domain-Specific Preconditioners (DSPs) $\mathbf{P}^{l}_{k}$ using the meta-parameters as follows:

\mathbf{P}^{l}_{k}=\mathbf{M}_{k}^{l\mathbf{T}}\mathbf{M}^{l}_{k}+\mathbf{I},\,\,\,\,k=1,\cdots,K

(8)

We compare various DSP designs (See Table 3) in Section 5.3 and choose the form of Eq. (8). Through bi-level optimization, DSPs can be meta-learned as follows.

Inner-level Optimization

For each train task $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}$ , in the inner-level optimization, we optimize the task-specific parameters $\theta$ through preconditioned gradient descent using $\mathbf{P}^{l}_{d_{\mathcal{T}}}$ , updating $\theta$ as follows:

\theta^{l}_{\mathcal{T},T}=\theta^{l}_{\mathcal{T},0}-\alpha_{\text{in}}\cdot\sum^{T-1}_{t=0}\mathbf{P}^{l}_{d_{\mathcal{T}}}\nabla_{\theta^{l}_{\mathcal{T},t}}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}),

(9)

where $\theta^{l}_{\mathcal{T},0}=\theta^{l}$ , $\alpha_{\text{in}}$ is the learning rate for the inner-level optimization, $T$ is the total number of gradient descent steps, and $\mathcal{L}_{\text{in}}$ is the inner-level’s loss function.

Outer-level Optimization

In the outer-level optimization, we meta-learn meta-parameters $\mathcal{M}_{1},\cdots\mathcal{M}_{K}$ as follows:

\mathcal{M}_{k}\leftarrow\mathcal{M}_{k}-\alpha_{\text{out}}\cdot\nabla_{\mathcal{M}_{k}}\mathbb{E}_{\mathcal{T}}\Big{[}\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})\Big{]},\,\,\,\,k=1,\cdots,K

(10)

where $\alpha_{\text{out}}$ is the learning rate for outer-level optimization and $\mathcal{L}_{\text{out}}$ is the outer-level’s loss function.

4.2 Task-coefficients

Consider the dataset classifier $g$ . Given a train task $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}$ , we define task-coefficients $p_{\mathcal{T},1}\cdots,p_{\mathcal{T},K}$ as follows:

(p_{\mathcal{T},1},\cdots,p_{\mathcal{T},K})=\text{Softmax}(z_{\mathcal{T},1},\cdots,z_{\mathcal{T},K})

(11)

where $g(\mathcal{S}_{\mathcal{T}})=(z_{\mathcal{T},1},\cdots,z_{\mathcal{T},K})$ . Note that we use the sigmoid function instead of softmax in the single-domain setting because the output dimension of the dataset classifier is one. While Triantafillou et al. (2021) updates the parameters of $g$ to minimize only the cross-entropy loss $\mathcal{L}_{\text{CE}}$ with respect to the dataset label $d_{\mathcal{T}}$ , we train the dataset classifier $g$ to minimize the following augmented loss:

\mathcal{L}_{\text{CE}}+\lambda\cdot\mathcal{L}_{\text{Aux}}

(12)

where $\lambda$ is a regularization parameter and $\mathcal{L}_{\text{Aux}}$ is the auxiliary loss, defined as follows:

\mathcal{L}_{\text{Aux}}=\mathbb{E}_{\mathcal{T}}\Big{[}\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})\Big{]}

(13)

Here, task-specific parameters $\theta^{l}_{\mathcal{T},T}$ can be obtained as follows:

\theta^{l}_{\mathcal{T},T}=\theta^{l}_{\mathcal{T},0}-\alpha_{\text{in}}\cdot\sum^{T-1}_{t=0}\sum^{K}_{k=1}p_{\mathcal{T},k}\cdot\mathbf{P}^{l}_{k}\nabla_{\theta^{l}_{\mathcal{T},t}}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}})

(14)

where $\mathbf{P}^{l}_{k}$ is the $l$ -th DSP of domain $k$ . In Eq. (12), the cross-entropy loss guides the dataset classifier to prioritize the ground-truth domain of the support set. Concurrently, the auxiliary loss guides toward DSPs that minimize any adverse effects on the performance of the query set during the inner-level optimization.

Table 1: Performance comparison to state-of-the-art methods in a multi-domain setting. Mean accuracy and 95

\%

confidence interval are reported. The best results are highlighted in bold. TSP^† denotes TSP applied on TSA. TSP^†† denotes TSP applied on TA²-Net.

Test Dataset	SUR	URT	FLUTE	tri-M	URL	TSA	TA²-Net	MOKD	TSP^†	TSP^††
ImageNet	56.2 $\pm$ 1.0	56.8 $\pm$ 1.1	58.6 $\pm$ 1.0	51.8 $\pm$ 1.1	58.8 $\pm$ 1.1	59.5 $\pm$ 1.0	59.6 $\pm$ 1.0	57.3 $\pm$ 1.1	60.5 $\pm$ 1.0	60.7 $\pm$ 1.0
Omniglot	94.1 $\pm$ 0.4	94.2 $\pm$ 0.4	92.0 $\pm$ 0.6	93.2 $\pm$ 0.5	94.5 $\pm$ 0.4	94.9 $\pm$ 0.4	95.5 $\pm$ 0.4	94.2 $\pm$ 0.5	95.6 $\pm$ 0.4	96.0 $\pm$ 0.4
Aircraft	85.5 $\pm$ 0.5	85.8 $\pm$ 0.5	82.8 $\pm$ 0.7	87.2 $\pm$ 0.5	89.4 $\pm$ 0.4	89.9 $\pm$ 0.4	90.5 $\pm$ 0.4	88.4 $\pm$ 0.5	90.5 $\pm$ 0.4	91.2 $\pm$ 0.4
Birds	71.0 $\pm$ 1.0	76.2 $\pm$ 0.8	75.3 $\pm$ 0.8	79.2 $\pm$ 0.8	80.7 $\pm$ 0.8	81.1 $\pm$ 0.8	81.4 $\pm$ 0.8	80.4 $\pm$ 0.8	82.3 $\pm$ 0.7	82.5 $\pm$ 0.7
Textures	71.0 $\pm$ 0.8	71.6 $\pm$ 0.7	71.2 $\pm$ 0.8	68.8 $\pm$ 0.8	77.2 $\pm$ 0.7	77.5 $\pm$ 0.7	77.4 $\pm$ 0.7	76.5 $\pm$ 0.7	78.6 $\pm$ 0.6	79.1 $\pm$ 0.6
Quick Draw	81.8 $\pm$ 0.6	82.4 $\pm$ 0.6	77.3 $\pm$ 0.7	79.5 $\pm$ 0.7	82.5 $\pm$ 0.6	81.7 $\pm$ 0.6	82.5 $\pm$ 0.6	82.2 $\pm$ 0.6	83.0 $\pm$ 0.7	83.2 $\pm$ 0.6
Fungi	64.3 $\pm$ 0.9	64.0 $\pm$ 1.0	48.5 $\pm$ 1.0	58.1 $\pm$ 1.1	68.1 $\pm$ 0.9	66.3 $\pm$ 0.8	66.3 $\pm$ 0.9	68.6 $\pm$ 1.0	68.6 $\pm$ 0.9	69.7 $\pm$ 0.8
VGG Flower	82.9 $\pm$ 0.8	87.9 $\pm$ 0.6	90.5 $\pm$ 0.5	91.6 $\pm$ 0.6	92.0 $\pm$ 0.5	92.2 $\pm$ 0.5	92.6 $\pm$ 0.4	92.5 $\pm$ 0.5	93.3 $\pm$ 0.4	93.4 $\pm$ 0.4
Traffic Sign	51.0 $\pm$ 1.1	48.2 $\pm$ 1.1	63.0 $\pm$ 1.0	58.4 $\pm$ 1.1	63.3 $\pm$ 1.1	82.8 $\pm$ 1.0	87.4 $\pm$ 0.8	64.5 $\pm$ 1.1	88.5 $\pm$ 0.7	89.4 $\pm$ 0.8
MSCOCO	52.0 $\pm$ 1.1	51.5 $\pm$ 1.1	52.8 $\pm$ 1.1	50.0 $\pm$ 1.0	57.3 $\pm$ 1.0	57.6 $\pm$ 1.0	57.9 $\pm$ 0.9	55.5 $\pm$ 1.0	58.5 $\pm$ 0.9	59.8 $\pm$ 0.9
MNIST	94.3 $\pm$ 0.4	90.6 $\pm$ 0.5	96.2 $\pm$ 0.3	95.6 $\pm$ 0.5	94.7 $\pm$ 0.4	96.7 $\pm$ 0.4	97.0 $\pm$ 0.4	95.1 $\pm$ 0.4	97.1 $\pm$ 0.3	97.1 $\pm$ 0.4
CIFAR-10	66.5 $\pm$ 0.9	67.0 $\pm$ 0.8	75.4 $\pm$ 0.8	78.6 $\pm$ 0.7	74.2 $\pm$ 0.8	82.9 $\pm$ 0.7	82.1 $\pm$ 0.8	72.8 $\pm$ 0.8	83.5 $\pm$ 0.7	83.7 $\pm$ 0.8
CIFAR-100	56.9 $\pm$ 1.1	57.3 $\pm$ 1.0	62.0 $\pm$ 1.0	67.1 $\pm$ 1.0	63.5 $\pm$ 1.0	70.4 $\pm$ 0.9	70.9 $\pm$ 0.9	63.9 $\pm$ 1.0	71.3 $\pm$ 1.0	72.2 $\pm$ 0.9
Avg Seen	75.9	77.4	74.5	76.2	80.4	80.4	80.7	80.0	81.6	82.0
Avg Unseen	64.1	62.9	69.9	69.9	70.6	78.1	79.1	70.3	79.8	80.4
Avg All	71.3	71.8	72.7	73.8	76.6	79.5	80.1	76.3	80.9	81.4
Avg Rank	8.8	8.2	8.0	7.8	5.5	4.3	3.2	5.8	1.9	1.0

4.3 Task-Specific Preconditioner

Given a test task $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}$ , we define Task-Specific Preconditioner $\mathbf{P}^{l}_{\mathcal{T}}$ as follows:

\mathbf{P}^{l}_{\mathcal{T}}=\sum^{K}_{k=1}p_{\mathcal{T},k}\cdot\mathbf{P}^{l}_{k},\,\,\,\,l=1,\cdots,L

(15)

where $\mathbf{P}^{l}_{k}$ is the $l$ -th DSP of domain $k$ , and $p_{\mathcal{T},k}$ is the task-coefficient for the given task $\mathcal{T}$ and domain $k$ . By employing $\mathbf{P}^{l}_{\mathcal{T}}$ as the preconditioning matrix, we can define Task-Specific Preconditioned gradient descent (TSP), as follows:

\theta^{l}_{\mathcal{T},T}=\theta^{l}_{\mathcal{T},0}-\beta\cdot\sum^{T-1}_{t=0}\mathbf{P}^{l}_{\mathcal{T}}\nabla_{\theta^{l}_{\mathcal{T},t}}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}),

(16)

where $\beta$ is the learning rate used to adapt the task-specific parameters.

4.4 Positive Definiteness of TSP’s Preconditioner

A preconditioner satisfying positive definiteness ensures a valid Riemannian metric, which represents the geometric characteristics of the parameter space (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998). Task-Specific Preconditioner $\mathbf{P}^{l}_{\mathcal{T}}$ is designed to be a positive definite matrix, which is verified in Theorem 1.

Theorem 1.

Let $p_{k}\in[0,1],k=1,\cdots,K$ , be the task-coefficients satisfying $\sum^{K}_{k=1}p_{k}=1$ . For the Domain-Specific Preconditioners $\mathbf{P}_{k}\in\mathbb{R}^{m\times m},k=1,\cdots,K$ , Task-Specific Preconditioner $\mathbf{P}$ defined as $\mathbf{P}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}$ is positive definite.

The proof is provided in Appendix C. Drawing from prior research (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998), a preconditioner satisfying positive definiteness promotes gradients to point toward the steepest descent direction while avoiding undesirable paths in the parameter space. As shown in Figure 1(b), positive definiteness improves CDFSL performance, especially in unseen domains. In Section 6, we will discuss why this property helps in CDFSL.

5 Experiments

5.1 Experimental Setup

Implementation Details

In the experiments, we use Meta-Dataset (Triantafillou et al. 2019) that is the standard benchmark for evaluating the performance of CDFSL. To demonstrate the effectiveness of TSP as an adaptation mechanism, we apply it to the state-of-the-art CDFSL methods, TSA (Li, Liu, and Bilen 2022) and TA²-Net (Guo et al. 2023), which are publicly available as open-source. Following previous studies (Bateni et al. 2022; Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), we adopted ResNet-18 as the backbone for the feature extractor. In all experiments, we follow the standard protocol described in (Triantafillou et al. 2019). For the Dataset Classifier Loss, weighting factor $\lambda$ is set to 0.1, as it performs best compared to other values, as shown in Appendix D.1. Details of the Meta-Dataset, hyper-parameters, and additional implementation are available in Appendix E.

Baselines

For the baselines, we compare our methods to the state-of-the-art CDFSL methods, including BOHB (Saikia, Brox, and Schmid 2020), SUR (Dvornik, Schmid, and Mairal 2020), URT (Liu et al. 2020), Simple-CNAPS (Bateni et al. 2020), FLUTE (Triantafillou et al. 2021), tri-M (Liu et al. 2021), URL (Li, Liu, and Bilen 2021), TSA (Li, Liu, and Bilen 2022), TA²-Net (Guo et al. 2023), ALFA (Baik et al. 2023)+Proto-MAML, GAP+Proto-MAML (Kang et al. 2023), and MOKD (Tian et al. 2024).

Table 2: Performance comparison to state-of-the-art methods in a single-domain setting. Mean accuracy and 95

\%

confidence interval are reported. The best results are highlighted in bold. TSP^† denotes TSP applied on TSA. TSP^†† denotes TSP applied on TA²-Net.

Test Dataset

ALFA+

Proto-MAML

BOHB

GAP+

Proto-MAML

FLUTE

TSA

TA²-Net

MOKD

TSP^†

TSP^††

ImageNet

52.8

\pm

1.1

51.9

\pm

1.1

56.7

46.9

\pm

1.1

59.5

\pm

1.1

59.3

\pm

1.1

57.3

\pm

1.1

60.1

\pm

1.1

60.6

\pm

1.1

Omniglot

61.9

\pm

1.5

67.6

\pm

1.2

77.6

61.6

\pm

1.4

78.2

\pm

1.2

81.1

\pm

1.1

70.9

\pm

1.3

83.3

\pm

1.1

85.2

\pm

1.1

Aircraft

63.4

\pm

1.1

54.1

\pm

0.9

68.5

48.5

\pm

1.0

72.2

\pm

1.0

72.6

\pm

0.9

59.8

\pm

1.0

73.2

\pm

1.0

73.5

\pm

1.1

Birds

69.8

\pm

1.1

70.7

\pm

0.9

73.5

47.9

\pm

1.0

74.9

\pm

0.9

75.1

\pm

0.9

73.6

\pm

0.9

76.0

\pm

0.9

76.6

\pm

0.9

Textures

70.8

\pm

0.9

68.3

\pm

0.8

71.4

63.8

\pm

0.8

77.3

\pm

0.7

76.8

\pm

0.8

76.1

\pm

0.7

78.2

\pm

0.7

78.3

\pm

0.7

Quick Draw

59.2

\pm

1.2

50.3

\pm

1.0

65.4

57.5

\pm

1.0

67.6

\pm

0.9

68.4

\pm

0.9

61.2

\pm

1.0

70.8

\pm

0.9

71.5

\pm

0.9

Fungi

41.5

\pm

1.2

41.4

\pm

1.1

38.6

31.8

\pm

1.0

44.7

\pm

1.0

45.3

\pm

1.0

47.0

\pm

1.1

46.6

\pm

1.0

47.0

\pm

1.0

VGG Flower

86.0

\pm

0.8

87.3

\pm

0.6

86.8

80.1

\pm

0.9

90.9

\pm

0.6

91.0

\pm

0.6

88.5

\pm

0.6

91.8

\pm

0.5

92.2

\pm

0.6

Traffic Sign

60.8

\pm

1.3

51.8

\pm

1.0

66.9

46.5

\pm

1.1

82.5

\pm

0.8

84.1

\pm

0.7

61.6

\pm

1.1

87.5

\pm

0.8

88.7

\pm

0.8

MSCOCO

48.1

\pm

1.1

48.0

\pm

1.0

46.8

41.4

\pm

1.0

59.0

\pm

1.0

58.0

\pm

1.0

55.3

\pm

1.0

59.4

\pm

1.0

58.6

\pm

1.0

MNIST

94.0

80.8

\pm

0.8

93.9

\pm

0.6

94.9

\pm

0.5

88.3

\pm

0.7

94.5

\pm

0.5

95.3

\pm

0.6

CIFAR-10

74.5

65.4

\pm

0.8

82.1

\pm

0.7

82.0

\pm

0.7

72.2

\pm

0.8

83.1

\pm

0.5

83.2

\pm

0.7

CIFAR-100

63.2

52.7

\pm

1.1

70.7

\pm

0.9

70.8

\pm

0.9

63.1

\pm

1.0

71.2

\pm

0.9

72.8

\pm

0.9

Avg Seen

52.8

51.9

56.7

46.9

59.5

59.3

57.3

60.1

60.6

Avg Unseen

62.4

59.9

68.9

56.5

74.5

75.0

68.1

76.3

76.9

Avg All

61.4

59.1

68.0

55.8

73.3

73.8

67.3

75.0

75.7

Avg Rank

7.0

7.5

6.1

8.9

3.7

3.4

5.1

2.0

1.2

5.2 Performance Comparison to State-of-The-Art Methods

Following the experimental setup in (Li, Liu, and Bilen 2022), we first evaluate our method using multi-domain and single-domain feature extractors in Varying-Way Varying-Shot setting (i.e., Multi-domain and Single-domain setting). Then, we assess our approach with the multi-domain feature extractor in more challenging Varying-Way Five-Shot and Five-Way One-Shot settings. We provide the performance comparison results for Varying-Way Five-Shot and Five-Way One-Shot settings in the Appendix F.

Multi-Domain Setting

In Table 1, we evaluate TSP by applying it to TSA and TA²-Net, both of which employ URL (Liu et al. 2021) as the multi-domain feature extractor. We report average accuracies over seen, unseen, and all domains, along with average rank following the previous works (Liu et al. 2021; Li, Liu, and Bilen 2022; Guo et al. 2023). TSP^† denotes TSP applied on TSA, while TSP^†† indicates TSP applied on TA²-Net. TSP^† outperforms the previous state-of-the-art methods on 11 out of 13 datasets, and TSP^†† achieves the best results on all datasets. For example, TSP^†† outperforms the state-of-the-art method (TA²-Net) by 1.7%, 3.4%, 2.0%, and 1.9% on Textures, Fungi, Traffic Sign, and MSCOCO respectively. These results imply that TSP can construct a desirable task-specific optimizer that effectively adapt the task-specific parameters for a given target task.

Single-Domain Setting

We evaluate TSP by applying it to TSA and TA²-Net, both of which employ the single-domain feature extractor pretrained solely on the ImageNet dataset. In Table 2, TSP^†† achieves the best results for 12 out of 13 datasets, while TSP^† leads in the remaining 1 datasets. Compared to recently proposed meta-learning methods based on PGD, such as Approximate GAP $+$ Proto-MAML and GAP $+$ Proto-MAML (Kang et al. 2023), both TSP^† and TSP^†† consistently outperform them across all 13 datasets by a significant margin. Furthermore, TSP^†† outperforms the previous best methods by a clear margin in several datasets such as Quick Draw ( $+3.1\%$ ), Omniglot ( $+4.1\%$ ), and Traffic Sign ( $+4.6\%$ ). Despite being trained only on single dataset, TSP improves performance by effectively constructing a task-specific optimizer tailored to the target task.

5.3 Ablation Studies

In this section, all ablation studies are performed using TSP^† in the multi-domain setting to isolate the effects originating from the RL model in TSP^††. Additional ablation studies are provided in Appendix D.

Matrix Design for DSP

Table 3: Performance comparison of three TSPs with different DSP designs.

DSP designs	$\mathbf{LL^{T}}$	$\mathbf{LL^{T}+I}$	$\mathbf{M^{T}M+I}$
Avg Seen	80.8	81.2	81.6
Avg Unseen	79.0	79.4	79.8
Avg All	80.1	80.5	80.9

To design Domain-Specific Preconditioner (DSP), we consider three matrix designs that guarantee positive definiteness. The first one is the product of a real-valued lower triangular matrix and its transpose (i.e., $\mathbf{LL^{T}}$ ), where the lower triangular matrix $\mathbf{L}$ is constrained to have positive diagonals. This form is commonly known as the Cholesky factorization (Horn and Johnson 2012). The second one is the addition of $\mathbf{LL^{T}}$ and the identity matrix (i.e., $\mathbf{LL^{T}+I}$ ). The last one is the addition of the Gram matrix (Horn and Johnson 2012) and the identity matrix (i.e., $\mathbf{M^{T}}\mathbf{M}+\mathbf{I}$ ). In Table 3, we compare three TSPs with these three DSP designs. Among them, the Gram matrix design achieves the highest average accuracies in both seen and unseen domains compared to the others. Therefore, we choose the Gram matrix design for DSP.

Table 4: Performance comparison of TSPs with and without a PD constraint.

\alpha

is set to 0.1.

Preconditioner	w/ PD constraint	w/o PD constraint
Avg Seen	81.6	80.0
Avg Unseen	79.8	73.8
Avg All	80.9	77.6

Table 5: The rates of non-PD Domain-Specific Preconditioners (DSPs) after meta-training without a positive definite constraint. For the ResNet-18 backbone, there are 17 DSP preconditioners for each domain. All DSPs are initialized as

0.1\cdot\mathbf{I}

. The average rate is provided in the right column.

DSP	ImageNet	Omniglot	Aircraft	Birds	Textures	Quick Draw	Fungi	VGG Flower	Average
Non-PD rate	0.24	0.29	0.35	0.24	0.35	0.35	0.18	0.29	0.29

6 Discussion

In this section, all experiments are conducted using TSP^†.

The Necessity of Positive Definite Constraint

Even without a specific constraint of PD, one might assume that initializing the preconditioner as positive definite, such as $\alpha\cdot\mathbf{I}$ , would maintain its positive definiteness throughout meta-training due to its significant role. However, as illustrated in Table 4 and Table 5, this assumption does not hold. In Table 4, we compare preconditioners with and without a PD constraint, both initialized as positive definite. Specifically, the former adopts Task-Specific Preconditioner (See Eq. (15)), while the latter employs Task-Specific Preconditioner with DSP designed as $\mathbf{P}^{l}_{k}=\mathbf{M}^{l}_{k}$ and initialized as $\mathbf{M}^{l}_{k}=0.1\cdot\mathbf{I}$ . Evaluations are conducted using the multi-domain feature extractor (URL) in the multi-domain setting. After meta-training, DSPs without a PD constraint tend to lose positive definiteness as shown in Table 5, leading to poor performance as shown in Table 4. These findings underscore the necessity of explicitly constraining the preconditioner to maintain positive definiteness, as relying solely on optimization fails to preserve this crucial property.

Positive Definite DSP Designs with and without the Identity Matrix

Table 6: Performance comparison of two positive definite DSP designs with and without adding an identity matrix.

Setting

Varying-Way

Varying-Shot

Varying-Way

Five-Shot

DSP designs

\mathbf{L}\mathbf{L}^{\mathbf{T}}

\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I}

\mathbf{L}\mathbf{L}^{\mathbf{T}}

\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I}

Avg Seen

80.8

81.2

76.8

76.6

Avg Unseen

79.0

79.4

72.1

71.5

Avg All

80.1

80.5

75.0

74.6

Apart from ensuring positive definiteness, a notable characteristic of our Gram matrix design $\mathbf{M^{T}}\mathbf{M}+\mathbf{I}$ is its inclusion of the identity matrix. To explore the impact of this inclusion, we compare two positive definite DSP designs: $\mathbf{LL^{T}}$ and $\mathbf{LL^{T}+I}$ . We focus on these two DSP designs because $\mathbf{M^{T}M}$ does not guarantee positive definiteness. However, we also provide a comparison between $\mathbf{M^{T}M+I}$ and $\mathbf{M^{T}M}$ in Appendix G. The experiments are conducted using the multi-domain feature extractor (URL). In Table 6, we observe that the DSP design with the added identity matrix performs better in the Varying-Way Varying-Shot setting but worse in the Varying-Way Five-Shot setting. This outcome aligns with prior theoretical findings (Amari et al. 2020) indicating that PGD performs better than GD in noisy gradient conditions, while GD excels when gradients are accurate. With more shots, gradients tend to be more accurate due to increased data. In the Varying-Way Varying Shot setting, where tasks typically involve more than five shots, gradients are more accurate, making GD more beneficial compared to the other setting. Including the identity matrix can be viewed as a regularization of PGD towards GD. Consequently, $\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I}$ aligns closer to GD compared to $\mathbf{L}\mathbf{L}^{\mathbf{T}}$ , resulting in improved performance due to the abundance of shots in Varying-Way Varying-Shot setting. Conversely, in the Varying-Way Five-Shot setting, where tasks involve fewer shots, $\mathbf{L}\mathbf{L}^{\mathbf{T}}$ exhibits superior performance to $\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I}$ due to the scarcity of shots.

Effectiveness of Positive Definiteness in Cross-Domain Tasks

A positive definite preconditioner is known to mitigate the negative effects of pathological loss curvature and accelerate optimization, thereby facilitating convergence (Nocedal and Wright 1999; Saad 2003; Li 2017). This leads to a consistent reduction in the objective function. However, without positive definiteness, this effect is not guaranteed and may result in failure to converge. In Figure 4, we compare the learning curves of PGD with and without a PD constraint across both seen and unseen domains. Without the PD constraint, PGD fails to converge in some of the seen domains and in all the unseen domains. With the PD constraint, PGD successfully converges in all the seen and unseen domains. These results suggest that, in cross-domain tasks, a PD constraint of a preconditioner is crucial for achieving convergence and is beneficial for improving performance, which is also related to Figure 1(b).

TSP vs. Previous PGD Methods: Leveraging Multi-Domain Knowledge for Task-Specific Preconditioner

Compared to previous PGD methods like GAP (Kang et al. 2023), TSP is specifically designed for cross-domain few-shot learning (CDFSL), where unseen domains are not accessed during meta-training. The key challenge in CDFSL is to effectively leverage information from multiple seen domains to quickly adapt to each unseen domain. Previous PGD methods fall short in this regard because they rely on a single preconditioner, even when multiple seen domains are available. For example, GAP uses only one preconditioner to extract information from multiple seen domains, which limits its adaptability to unseen domains with distinct characteristics. In contrast, TSP meta-trains a distinct domain-specific preconditioner (DSP) for each seen domain and combines them to construct a Task-Specific Preconditioner that better suited to each unseen domain. TSP produces this Task-Specific Preconditioner effectively, as shown in Tables 1 and 2, and time-efficiently, as further detailed in Appendix H.

7 Conclusion

In this study, we have introduced a robust and effective adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP) to enhance CDFSL performance. Thanks to the meta-trained Domain-Specific Preconditioners (DSPs) and Task-coefficients, TSP can flexibly adjust the optimization strategy according to the geometric characteristics of the parameter space for the target task. Owing to these components, the proposed TSP demonstrates notable performance improvements on Meta-Dataset across various settings.

8 Acknowledgements

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1A2C2007139) and in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) ([NO.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], [No. RS-2023-00235293, Development of autonomous driving big data processing, management, search, and sharing interface technology to provide autonomous driving data according to the purpose of usage]).

References

Amari (1967) Amari, S. 1967. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, (3): 299–307.
Amari (1996) Amari, S.-i. 1996. Neural learning in structured parameter spaces-natural Riemannian gradient. Advances in neural information processing systems, 9.
Amari (1998) Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural computation, 10(2): 251–276.
Amari et al. (2020) Amari, S.-i.; Ba, J.; Grosse, R.; Li, X.; Nitanda, A.; Suzuki, T.; Wu, D.; and Xu, J. 2020. When does preconditioning help or hurt generalization? arXiv preprint arXiv:2006.10732.
Amari and Douglas (1998) Amari, S.-I.; and Douglas, S. C. 1998. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 2, 1213–1216. IEEE.
Baik et al. (2023) Baik, S.; Choi, M.; Choi, J.; Kim, H.; and Lee, K. M. 2023. Learning to learn task-adaptive hyperparameters for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Bateni et al. (2022) Bateni, P.; Barber, J.; Van de Meent, J.-W.; and Wood, F. 2022. Enhancing few-shot image classification with unlabelled examples. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2796–2805.
Bateni et al. (2020) Bateni, P.; Goyal, R.; Masrani, V.; Wood, F.; and Sigal, L. 2020. Improved few-shot visual classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14493–14502.
Chen et al. (2019) Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C. F.; and Huang, J.-B. 2019. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232.
Cimpoi et al. (2014) Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; and Vedaldi, A. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3606–3613.
Duchi, Hazan, and Singer (2011) Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
Dvornik, Schmid, and Mairal (2020) Dvornik, N.; Schmid, C.; and Mairal, J. 2020. Selecting relevant features from a multi-domain representation for few-shot classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, 769–786. Springer.
Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, 1126–1135. PMLR.
Garcia and Bruna (2017) Garcia, V.; and Bruna, J. 2017. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
Garnelo et al. (2018) Garnelo, M.; Rosenbaum, D.; Maddison, C.; Ramalho, T.; Saxton, D.; Shanahan, M.; Teh, Y. W.; Rezende, D.; and Eslami, S. A. 2018. Conditional neural processes. In International conference on machine learning, 1704–1713. PMLR.
Guo et al. (2023) Guo, Y.; Du, R.; Dong, Y.; Hospedales, T.; Song, Y.-Z.; and Ma, Z. 2023. Task-aware Adaptive Learning for Cross-domain Few-shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1590–1599.
Ha and Eck (2017) Ha, D.; and Eck, D. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477.
Himmelblau et al. (2018) Himmelblau, D. M.; et al. 2018. Applied nonlinear programming. McGraw-Hill.
Horn and Johnson (2012) Horn, R. A.; and Johnson, C. R. 2012. Matrix analysis. Cambridge university press.
Houben et al. (2013) Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; and Igel, C. 2013. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In The 2013 international joint conference on neural networks (IJCNN), 1–8. Ieee.
Kakade (2001) Kakade, S. M. 2001. A natural policy gradient. Advances in neural information processing systems, 14.
Kang et al. (2023) Kang, S.; Hwang, D.; Eo, M.; Kim, T.; and Rhee, W. 2023. Meta-Learning with a Geometry-Adaptive Preconditioner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16080–16090.
Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
Lake et al. (2011) Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33.
Lake, Salakhutdinov, and Tenenbaum (2015) Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science, 350(6266): 1332–1338.
LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324.
LeCun et al. (2002) LeCun, Y.; Bottou, L.; Orr, G. B.; and Müller, K.-R. 2002. Efficient backprop. In Neural networks: Tricks of the trade, 9–50. Springer.
Lee and Choi (2018) Lee, Y.; and Choi, S. 2018. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, 2927–2936. PMLR.
Li, Liu, and Bilen (2021) Li, W.-H.; Liu, X.; and Bilen, H. 2021. Universal representation learning from multiple domains for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9526–9535.
Li, Liu, and Bilen (2022) Li, W.-H.; Liu, X.; and Bilen, H. 2022. Cross-domain few-shot learning with task-specific adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7161–7170.
Li (2017) Li, X.-L. 2017. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5): 1454–1466.
Li et al. (2017) Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
Liu et al. (2020) Liu, L.; Hamilton, W.; Long, G.; Jiang, J.; and Larochelle, H. 2020. A universal representation transformer layer for few-shot image classification. arXiv preprint arXiv:2006.11702.
Liu et al. (2021) Liu, Y.; Lee, J.; Zhu, L.; Chen, L.; Shi, H.; and Yang, Y. 2021. A multi-mode modulator for multi-domain few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8453–8462.
Maji et al. (2013) Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; and Vedaldi, A. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
Mishra et al. (2017) Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P. 2017. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141.
Munkhdalai and Yu (2017) Munkhdalai, T.; and Yu, H. 2017. Meta networks. In International conference on machine learning, 2554–2563. PMLR.
Nilsback and Zisserman (2008) Nilsback, M.-E.; and Zisserman, A. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, 722–729. IEEE.
Nocedal and Wright (1999) Nocedal, J.; and Wright, S. J. 1999. Numerical optimization. Springer.
Oreshkin, Rodríguez López, and Lacoste (2018) Oreshkin, B.; Rodríguez López, P.; and Lacoste, A. 2018. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31.
Park and Oliva (2019) Park, E.; and Oliva, J. B. 2019. Meta-curvature. Advances in Neural Information Processing Systems, 32.
Rajasegaran et al. (2020) Rajasegaran, J.; Khan, S.; Hayat, M.; Khan, F. S.; and Shah, M. 2020. Meta-learning the learning trends shared across tasks. arXiv preprint arXiv:2010.09291.
Rajeswaran et al. (2019) Rajeswaran, A.; Finn, C.; Kakade, S. M.; and Levine, S. 2019. Meta-learning with implicit gradients. Advances in neural information processing systems, 32.
Ravi and Larochelle (2016) Ravi, S.; and Larochelle, H. 2016. Optimization as a model for few-shot learning. In International conference on learning representations.
Requeima et al. (2019) Requeima, J.; Gordon, J.; Bronskill, J.; Nowozin, S.; and Turner, R. E. 2019. Fast and flexible multi-task classification using conditional neural adaptive processes. Advances in Neural Information Processing Systems, 32.
Roy and Vetterli (2007) Roy, O.; and Vetterli, M. 2007. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, 606–610. IEEE.
Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
Saad (2003) Saad, Y. 2003. Iterative methods for sparse linear systems. SIAM.
Saikia, Brox, and Schmid (2020) Saikia, T.; Brox, T.; and Schmid, C. 2020. Optimized generic feature learning for few-shot classification across domains. arXiv preprint arXiv:2001.07926.
Santoro et al. (2016) Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning, 1842–1850. PMLR.
Schroeder and Cui (2018) Schroeder, B.; and Cui, Y. 2018. Fgvcx fungi classification challenge 2018. Available online: github. com/visipedia/fgvcx_fungi_comp (accessed on 14 July 2021).
Simon et al. (2020) Simon, C.; Koniusz, P.; Nock, R.; and Harandi, M. 2020. On modulating the gradient for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, 556–572. Springer.
Snell, Swersky, and Zemel (2017) Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
Sung et al. (2018) Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1199–1208.
Tian et al. (2024) Tian, H.; Liu, F.; Liu, T.; Du, B.; Cheung, Y.-m.; and Han, B. 2024. MOKD: Cross-domain Finetuning for Few-shot Classification via Maximizing Optimized Kernel Dependence. arXiv preprint arXiv:2405.18786.
Tian et al. (2020) Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J. B.; and Isola, P. 2020. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 266–282. Springer.
Triantafillou et al. (2021) Triantafillou, E.; Larochelle, H.; Zemel, R.; and Dumoulin, V. 2021. Learning a universal template for few-shot dataset generalization. In International Conference on Machine Learning, 10424–10433. PMLR.
Triantafillou et al. (2019) Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Evci, U.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.-A.; et al. 2019. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.
Von Oswald et al. (2021) Von Oswald, J.; Zhao, D.; Kobayashi, S.; Schug, S.; Caccia, M.; Zucchet, N.; and Sacramento, J. 2021. Learning where to learn: Gradient sparsity in meta and continual learning. Advances in Neural Information Processing Systems, 34: 5250–5263.
Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
Yoon et al. (2018) Yoon, J.; Kim, T.; Dia, O.; Kim, S.; Bengio, Y.; and Ahn, S. 2018. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31.
Zaheer et al. (2017) Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. Advances in neural information processing systems, 30.
Zhao et al. (2020) Zhao, D.; Kobayashi, S.; Sacramento, J.; and von Oswald, J. 2020. Meta-learning via hypernetworks. In 4th Workshop on Meta-Learning at NeurIPS 2020 (MetaLearn 2020). NeurIPS.

Appendix for the paper
“Task-Specific Preconditioner for Cross-Domain Few-Shot Learning”

Appendix A The three preconditioners used in Figure 1(b) and Figure 4

To establish the motivation for enforcing the positive definite constraint in CDFSL, we conduct a comparative analysis of three adaptation mechanisms—PGD methods with varying preconditioners—using Meta-Dataset. These mechanisms are applied on the state-of-the-art CDFSL method, TSA (Li, Liu, and Bilen 2022). In these comparisons, with task-specific parameters $\theta$ and a task $\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}$ , we update $\theta$ using PGD with a preconditioner $\mathbf{P}$ as follows:

\theta_{\mathcal{T},t}=\theta_{\mathcal{T},t-1}-\alpha\cdot\mathbf{P}\nabla_{\theta}\mathcal{L}(\theta_{\mathcal{T},t-1};\mathcal{S}_{\mathcal{T}}),\;t=1,2,\cdots,

(17)

where $\theta_{\mathcal{T},0}=\theta$ and $\mathcal{L}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}})$ is the empirical loss associated with $\mathcal{T}$ and $\theta_{\mathcal{T},t}$ . The first PGD method is identical to Gradient Descent (GD), which utilizes the fixed identity matrix $\mathbf{I}$ as $\mathbf{P}$ (i.e., the baseline for gradient descent). The second method is Task-Specific Preconditioned Gradient Descent (TSP), which utilizes a Task-Specific Preconditioner designed as $\mathbf{P}^{l}_{k}=\mathbf{M}^{l}_{k}$ and initialized as $(-1)\cdot\mathbf{I}$ (i.e., PGD without a positive definite constraint). The final method is Task-Specific Preconditioned gradient descent (TSP), which utilizes Task-Specific Preconditioner defined in Section 4.3 (i.e., PGD with a positive definite).

Appendix B Meta-Training and Meta-Testing Algorithms

Algorithm 1 Meta-Training for Domain-Specific Preconditioner (DSP)

p(\mathcal{T})

: Task distribution across

K

train domains

\alpha_{\text{in}}

\alpha_{\text{out}}

: The learning rates

\mathcal{L}_{\text{in}}

\mathcal{L}_{\text{out}}

: The inner and outer-level loss functions

1: Initialize task-specific parameters

\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}

2: Initialize meta-parameters

\mathcal{M}_{1},\cdots,\mathcal{M}_{K}\text{ where }\mathcal{M}_{k}=\{\mathbf{M}^{l}_{k}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}

3: while not converged do

4: Sample a batch of train tasks

\mathcal{T}_{B}\sim p(\mathcal{T})

5: for all

\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}\in\mathcal{T}_{B}

6: for

l=1

L

7: Compute DSP

\mathbf{P}^{l}_{d_{\mathcal{T}}}=\mathbf{M}_{d_{\mathcal{T}}}^{l\mathbf{T}}\mathbf{M}^{l}_{d_{\mathcal{T}}}+\mathbf{I}

8: Compute updated task-specific parameters via Eq. (3)

9: end for

10: Compute outer-level loss

\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})

11: end for

12: Update the meta-parameters via Eq. (10)

13: end while

Algorithm 2 Meta-Training the dataset classifier for Task-coefficients

p(\mathcal{T})

: Task distribution across

K

train domains

g_{\phi}(\cdot)

: The dataset classifier

\mathcal{L}_{\text{in}}

\mathcal{L}_{\text{out}}

: The inner and outer-level loss functions

\lambda

: The regularization parameter

\alpha

: The learning rate

1: Initialize task-specific parameters

\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}

2: Initialize the parameters

\phi

g_{\phi}(\cdot)

3: while not converged do

4: Sample a batch of train tasks

\mathcal{T}_{B}\sim p(\mathcal{T})

5: for all

\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}\in\mathcal{T}_{B}

6: Compute logits

(z_{\tau,1},\cdots,z_{\tau,K})=g(\mathcal{S}_{\tau})

7: Compute task-coefficients via Eq. (11)

8: Compute

\mathcal{L}_{\text{CE}}^{\mathcal{T}}=-\sum^{K}_{k=1}d_{\mathcal{T},k}\cdot\log(p_{\mathcal{T},k})

9: With

\theta_{\mathcal{T},T}=\{\theta_{\mathcal{T},T}^{l}\}^{L}_{l=1}

via Eq. (14), compute

\mathcal{L}_{\text{Aux}}^{\mathcal{T}}=\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})

10: Compute

\mathcal{L}^{\mathcal{T}}=\mathcal{L}_{\text{CE}}^{\mathcal{T}}+\lambda\cdot\mathcal{L}_{\text{Aux}}^{\mathcal{T}}

11: end for

12: Compute

\mathcal{L}=\sum_{\mathcal{T}\in\mathcal{T}_{B}}\mathcal{L}^{\mathcal{T}}

13: Update the parameters

\phi\leftarrow\phi-\alpha\cdot\nabla_{\phi}\mathcal{L}

14: end while

Algorithm 3 Meta-Testing through Task-Specific Preconditioned gradient descent (TSP)

p(\mathcal{T})

: Task distribution across all domains

\mathcal{M}_{1},\cdots,\mathcal{M}_{K}

: Meta-trained meta-parameters

g_{\phi}(\cdot)

: Meta-trained dataset classifier

\mathcal{L}_{\text{in}}

: The inner-level loss function

\beta

: The learning rate

1: Initialize task-specific parameters

\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}

2: Sample a test task

\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}

3: for

l=1

L

4: For all

k

, compute DSP

\mathbf{P}^{l}_{k}=\mathbf{M}_{k}^{l\mathbf{T}}\mathbf{M}_{k}^{l}+\mathbf{I}

5: Compute Task-Specific Preconditioner via Eq. (15)

6: Update the task-specific parameters via Eq. (16)

7: end for

Appendix C Proofs of Theorems

Lemma 1.

For the meta parameter $\mathbf{M}\in\mathbb{R}^{m\times m}$ , the Domain-Specific Preconditioner $\mathbf{P}$ defined as $\mathbf{P}=\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I}$ is positive definite.

Proof.

$\mathbf{P}$ is symmetric, as shown below:

\mathbf{P}^{\mathbf{T}}=(\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I})^{\mathbf{T}}=(\mathbf{M}^{\mathbf{T}}\mathbf{M})^{\mathbf{T}}+\mathbf{I}^{\mathbf{T}}=\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I}=\mathbf{P}.

For $\forall\,\mathbf{x}\in\mathbb{R}^{m}\backslash\{\mathbf{0}\}$ ,

	$\displaystyle\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x}$	$\displaystyle=\mathbf{x}^{\mathbf{T}}(\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I})\mathbf{x}$
		$\displaystyle=\mathbf{x}^{\mathbf{T}}\mathbf{M}^{\mathbf{T}}\mathbf{M}\mathbf{x}+\mathbf{x}^{\mathbf{T}}\mathbf{I}\mathbf{x}$
		$\displaystyle=(\mathbf{M}\mathbf{x})^{\mathbf{T}}(\mathbf{M}\mathbf{x})+\mathbf{x}^{\mathbf{T}}\mathbf{x}$
		$\displaystyle=\\|\mathbf{M}\mathbf{x}\\|^{2}+\\|\mathbf{x}\\|^{2}.$

The first term on the right-hand side is non-negative, while the second term is positive:

	$\displaystyle\\|\mathbf{M}\mathbf{x}\\|^{2}\geq 0,$
	$\displaystyle\\|\mathbf{x}\\|^{2}>0$

since $\mathbf{x}\neq\mathbf{0}$ . Thus, we conclude:

\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x}>0,

(18)

which confirms the positive-definiteness of the Domain-Specific Preconditioner $\mathbf{P}$ . ∎

Theorem 1.

Proof.

By Lemma 1, $\mathbf{P}_{k}$ is symmetric. Therefore, $\mathbf{P}$ is symmetric, as shown below:

\mathbf{P}^{\mathbf{T}}=\left(\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}\right)^{\mathbf{T}}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}^{\mathbf{T}}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}=\mathbf{P}.

For $\forall\,\mathbf{x}\in\mathbb{R}^{m}\backslash\{\mathbf{0}\}$ ,

	$\displaystyle\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x}$	$\displaystyle=\mathbf{x}^{\mathbf{T}}\left(\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}\right)\mathbf{x}$
		$\displaystyle=\sum^{K}_{k=1}p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}.$

Since $\mathbf{P}_{k}$ is the Domain-Specific Preconditioner, by Lemma 1, each summand on the right-hand side is non-negative (see Eq. (18)):

p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}\geq 0,\;\;k=1,\cdots,K

(19)

because $0\leq p_{k}\leq 1$ . Since $\sum^{K}_{k=1}p_{k}=1$ and $p_{k}\in[0,1],k=1,\cdots,K$ , there exists at least one $p_{k}$ such that $p_{k}>0$ , implying that at least one term in Eq. (19) is positive:

\exists\,k\in\{1,2,3,\dots,K\}\>\;\text{such that}\>\;p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}>0.

(20)

Combining Eq. (19) and Eq. (20), we conclude:

\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}>0,

which confirms the positive-definiteness of the Task-Specific Preconditioner $\mathbf{P}$ . ∎

Appendix D Additional Ablation Studies

D.1 Weighting Factor $\lambda$ of the Dataset Classifier Loss

In Table 7, we compare different dataset classifier losses by adjusting the weighting factor $\lambda$ in Eq. (12). Additionally, we include the results obtained when utilizing the auxiliary loss in Eq. (12) as the dataset classifier loss. From the results, we can observe that $\lambda\>\!=\>\!0.1$ yields the optimal performance, while using other losses results in inferior performance in both seen and unseen domains. This performance gap is more pronounced in unseen domains, highlighting the importance of balancing between the two losses for generalization to unseen domains. Based on these findings, we adopt $\lambda\>\!=\>\!0.1$ for the dataset classifier loss in all experiments presented in this manuscript.

Table 7: Mean accuracy (

\%

) for different values of

\lambda

in the dataset classifier loss (See Eq. (12)).

\lambda=10

\lambda=1

\lambda=0.1

(Ours)

\lambda=0.01

Only

\mathcal{L}_{\text{CE}}

Only

\mathcal{L}_{\text{Aux}}

Avg Seen

80.8

81.3

81.6

81.0

80.9

80.1

Avg Unseen

79.3

79.7

79.8

79.1

78.8

78.2

Avg All

80.2

80.7

80.9

80.3

80.1

79.4

D.2 Interpreting and visualizing task-coefficient

To better understand how DSPs from the eight training domains combine to form Task-Specific Preconditioner, we illustrate the task-coefficients of TSP used in various test tasks in Figure 5(a). We first randomly sample the test tasks from each of the four domains (ImageNet, Birds, Traffic Sign, and MSCOCO). The blue heatmaps illustrate the task-coefficient values utilized in each test task. These heatmaps exhibit consistent patterns within each domain, although the values vary across tasks. For instance, in the Birds domain, all 5 tasks primarily rely on DSPs from ImageNet and Birds. Meanwhile, Task 2 evenly distributes task-coefficient values between them, while Task 4 assigns significantly more values to the Birds DSP compared to the ImageNet counterpart. In Figure 5(b), Figure 5(c), and Figure 5(d), we randomly sample the test tasks from other domains. Similar to the patterns observed in the heatmaps presented in Figure 5(a), these heatmaps also demonstrate consistent patterns within each domain, although the values vary across tasks.

Appendix E Implementation Details

E.1 Dataset

Meta-Dataset (Triantafillou et al. 2019) is the standard benchmark for evaluating the performance of cross-domain few-shot classification. Initially, it comprised ten datasets, including ILSVRC 2012 (Russakovsky et al. 2015), Omniglot (Lake, Salakhutdinov, and Tenenbaum 2015), FGVC-Aircraft (Maji et al. 2013), CUB-200-2011 (Wah et al. 2011), Describable Textures (Cimpoi et al. 2014), QuickDraw (Ha and Eck 2017), FGVCx Fungi (Schroeder and Cui 2018), VGG Flower (Nilsback and Zisserman 2008), Traffic Signs (Houben et al. 2013), and MSCOCO (Lin et al. 2014). Later, it was further expanded to include MNIST (LeCun et al. 1998), CIFAR-10 (Krizhevsky, Hinton et al. 2009), and CIFAR-100 (Krizhevsky, Hinton et al. 2009).

E.2 Architecture for the dataset classifier

As the dataset classifier, we use a permutation-invariant set encoder $g$ (Zaheer et al. 2017) followed by a linear layer. We adopt the implementation of a permutation-invariant set encoder as described in previous studies (Requeima et al. 2019; Triantafillou et al. 2021). We implement this encoder as Conv- $5$ backbone, comprising $5$ modules with $3\times 3$ convolutions employing $256$ filters, followed by batch normalization, ReLU activation, and $2\times 2$ max-pooling with a stride of $2$ . Subsequently, global average pooling is applied to the output, followed by averaging over the first dimension (representing different examples within the support set), resulting in the set representation of the given support set. This representation is then fed into a linear layer to classify the given support set into one of the $K$ -training datasets.

E.3 Hyper-parameters

For all the experiments, we use the hyper-parameters in Table 8 and Table 9.

Table 8: Hyper-parameters used for training DSP on various experimental settings. For the test learning rates, the first value corresponds to the learning rate for the Residual Adapter, while the second value corresponds to the learning rate for the pre-classifier transformation.

Setting	Varying-Way	Varying-Way	5-Way	Varying-Way
Setting	Varying-Shot	5-Shot	1-Shot	Varying-Shot
Batch size	16			16
Weight decay	0.0007			0.0007
T-max	2500			2500
Max iteration	160000			40000
Initialization for $\mathbf{M}$	$0.1\cdot\mathbf{I}$			$0.1\cdot\mathbf{I}$
Inner learning rate $\alpha_{\text{in}}$	0.1			0.1
Outer learning rate $\alpha_{\text{out}}$	0.1			0.1
The number of training inner-step	5			5
The number of testing inner-step	40			40
Test learning rate $\beta$ for seen domain	(0.05, 0.30)	(0.05, 0.30)	(0.05, 0.30)	(0.05, 0.20)
Test learning rate $\beta$ for unseen domain	(0.25, 0.05)	(0.25, 0.05)	(0.25, 0.05)	(0.25, 0.05)

Table 9: Hyper-parameters used for training Dataset Classifier on various experimental settings.

Hyper-parameter
Batch size	16
Weight decay	0.0007
T-max	500
Max iteration	4000
Learning rate	0.001

Appendix F Additional Results

F.1 Varying-Way Five-Shot setting

In the standard Meta-Dataset benchmark, tasks vary in the number of classes per task (‘way’) and the number of support images per class (‘shot’), with ‘shot’ ranging up to 100. Here, we evaluate TSP in Varying-Way Five-Shot setting, which poses a greater challenge due to the limited number of support images. The results in Table 11 show that TSP^†† achieves top performance for 10 out of 13 datasets, including all 5 unseen datasets. Notably, TSP^†† demonstrates significantly higher scores in unseen domains compared to the previous best result ( $+2.2\%$ ).

F.2 Five-Way One-Shot setting

We evaluate TSP under a more challenging setting, where only a single support image per class is available. As shown in Table 12, TSP^†† consistently outperforms the previous best results for 10 out of 13 datasets, while TSP^† also achieves the best or near-best results. Notably, applying TSP to TA²-Net results in a significant performance improvement in unseen domains ( $+2.8\%$ ) compared to using TA²-Net alone, demonstrating its efficacy in this highly challenging setting.

Table 10: Comparision of two DSP designs with and without the identity matrix. The DSP design

\mathbf{M^{T}M}

does not guarantee positive definiteness.

Setting

Varying-Way

Varying-Shot

Varying-Way

Five-Shot

DSP designs

\mathbf{M^{T}M}

\mathbf{M^{T}M+I}

\mathbf{M^{T}M}

\mathbf{M^{T}M+I}

Avg Seen

80.2

81.6

77.0

77.9

Avg Unseen

68.2

79.8

63.3

73.7

Avg All

75.6

80.9

71.7

76.3

Appendix G Comparison between $\mathbf{M^{T}M+I}$ and $\mathbf{M^{T}M}$ : Failure of DSP design $\mathbf{M^{T}M}$

In Section 6, we compared $\mathbf{LL^{T}+I}$ and $\mathbf{LL^{T}}$ to explore the impact of including the identity matrix in the DSP design. Extending this analysis, we now include two additional DSP designs, $\mathbf{M^{T}M+I}$ and $\mathbf{M^{T}M}$ , for comprehensive examination, despite the latter’s failure to meet positive definiteness. Our findings, presented in Table 10, demonstrate that the DSP design $\mathbf{M^{T}M+I}$ consistently outperforms $\mathbf{M^{T}M}$ in both Varying-Way Varying-Shot and Varying-Way Five-Shot settings. This contrasts with the results in Table 6 of the manuscript, where including the identity matrix was effective in Varying-Way Varying-Shot setting. Furthermore, $\mathbf{M^{T}M}$ displays significantly lower performance compared to TSA (Li, Liu, and Bilen 2022), which serves as the baseline for our method.

To investigate the failure of the DSP design $\mathbf{M^{T}M}$ , we compare the effective ranks (Roy and Vetterli 2007) of Task-Specific Preconditioners between two DSP designs, $\mathbf{M^{T}M}$ and $\mathbf{LL^{T}}$ . The effective rank provides a numerical approximation of a matrix’s rank, indicating the number of singular values distant from zero. Since positive definite matrices possess solely positive singular values, an effective rank significantly lower than the full rank implies that the preconditioners are far from positive definite. Table 13 and Table 14 present the averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design $\mathbf{M^{T}M}$ , differing only in settings: Varying-Way Varying-Shot and Varying-Way Five-Shot, respectively. Several preconditioners in these tables exhibit notably lower effective rank than the full rank, indicating their departure from positive definiteness. Consequently, these non-positive definite preconditioners may fail to determine the steepest descent direction in the parameter space, leading to the degraded performance observed in Table 10. In contrast, 15 and Table 16 reveal that the averaged effective rank of 17 Task-Specific Preconditioners of the DSP design $\mathbf{LL^{T}}$ closely approach full rank. This finding aligns with the Cholesky factorization’s assertion (Horn and Johnson 2012) that $\mathbf{LL^{T}}$ is positive definite, which confirms that Task-Specific Preconditioners constructed with $\mathbf{LL^{T}}$ are positive definite (Theorem 1 holds as long as DSP $\mathbf{P}_{k}$ satisfies the positive definiteness). This observation corroborates the results presented in Table 6 of the manuscript.

Appendix H Time-efficiency of TSP compared to GAP

Unlike GAP (Kang et al. 2023), which suffers from significant inference time due to time-intensive singular value decomposition (SVD) calculations at every neural network layer during each inner-level training iteration, TSP achieves a much faster inference time. Specifically, GAP requires approximately 14.2 seconds per task, while TSP applied on TSA (Li, Liu, and Bilen 2022) completes inference in just 1.1 seconds–about 13 times faster than GAP–by leveraging pre-calculated DSPs and a dataset classifier to avoid time-intensive calculations during inference.¹¹1Inference time for GAP and TSP is measured on a single RTX3090 GPU and averaged over 100 test tasks. This highlights the superior practical efficiency of TSP compared to the previous PGD-based method, GAP.

Appendix I Application Details for TSP^† and TSP^††

As shown Figure 6, we apply TSP to the state-of-the-art CDFSL methods, TSA (Li, Liu, and Bilen 2022) and TA²-Net (Guo et al. 2023). Figure 6(a) illustrates PGD with Domain-Specific Preconditioner (DSP) applied to TSA during meta-training, where the DSP is selected based on the task’s domain label, and PGD optimizes each task-specific parameter $\theta^{l}$ using the corresponding DSP. Figure 6(b) shows PGD with a Task-Specific Preconditioner applied to TSA during meta-testing. In this case, each preconditioner is constructed by DSPs with task coefficients generated by the Dataset Classifier, and PGD optimizes each $\theta^{l}$ using the constructed preconditioner. Figure 6(c) depicts PGD with DSP applied to TA²-Net during meta-training. Unlike TSA, which optimizes a single task-specific parameter for each module, TA²-Net optimizes multiple task-specific parameters $\theta^{l,j}$ . To accommodate this, we apply the same PGD to optimize the multiple parameters $\theta^{l,j}$ . Figure 6(d) shows PGD with Task-Specific Preconditoner applied to TA²-Net during meta-testing, where, similar to the meta-training phase, the same PGD is applied to optimize the multiple task-specific parameters $\theta^{l,j}$ .

Table 11: Comparison to state-of-the-art methods in Varying-Way Five-Shot setting. Mean accuracy and 95

\%

confidence interval are reported. The best results are highlighted in bold. TSP^† denotes TSP applied on TSA. TSP^†† denotes TSP applied on TA²-Net.

Test Dataset

Simple

CNAPS

SUR

URT

URL

TSA

TA²-Net

MOKD

TSP^†

TSP^††

ImageNet

47.2

\pm

1.0

46.7

\pm

1.0

48.6

\pm

1.0

49.4

\pm

1.0

48.3

\pm

1.0

49.3

\pm

1.0

47.5

\pm

1.0

50.6

\pm

1.0

50.8

\pm

1.0

Omniglot

95.1

\pm

0.3

95.8

\pm

0.3

96.0

\pm

0.3

96.0

\pm

0.3

96.8

\pm

0.3

96.6

\pm

0.2

96.0

\pm

0.3

97.2

\pm

0.3

97.1

\pm

0.3

Aircraft

74.6

\pm

0.6

82.1

\pm

0.6

81.2

\pm

0.6

84.8

\pm

0.5

85.5

\pm

0.5

85.9

\pm

0.4

84.4

\pm

0.5

86.2

\pm

0.5

86.7

\pm

0.5

Birds

69.6

\pm

0.7

62.8

\pm

0.9

71.2

\pm

0.7

76.0

\pm

0.6

76.6

\pm

0.6

77.3

\pm

0.6

76.8

\pm

0.6

77.0

\pm

0.6

77.8

\pm

0.6

Textures

57.5

\pm

0.7

60.2

\pm

0.7

65.2

\pm

0.7

69.1

\pm

0.6

68.3

\pm

0.7

68.3

\pm

0.6

66.3

\pm

0.6

69.1

\pm

0.6

69.3

\pm

0.6

Quick Draw

70.9

\pm

0.6

79.0

\pm

0.5

79.2

\pm

0.5

78.2

\pm

0.5

77.9

\pm

0.6

78.5

\pm

0.5

78.9

\pm

0.5

78.7

\pm

0.6

78.8

\pm

0.6

Fungi

50.3

\pm

1.0

66.5

\pm

0.8

66.9

\pm

0.9

70.0

\pm

0.8

70.4

\pm

0.8

70.3

\pm

0.8

68.8

\pm

0.9

73.6

\pm

0.9

72.9

\pm

0.8

VGG Flower

86.5

\pm

0.4

76.9

\pm

0.6

82.4

\pm

0.5

89.3

\pm

0.4

89.5

\pm

0.4

90.0

\pm

0.4

89.1

\pm

0.4

90.8

\pm

0.4

91.1

\pm

0.4

Traffic Sign

55.2

\pm

0.8

44.9

\pm

0.9

45.1

\pm

0.9

57.5

\pm

0.8

72.3

\pm

0.6

76.7

\pm

0.5

59.2

\pm

0.8

79.6

\pm

0.5

80.9

\pm

0.5

MSCOCO

49.2

\pm

0.8

48.1

\pm

0.9

52.3

\pm

0.9

56.1

\pm

0.8

56.0

\pm

0.8

56.0

\pm

0.8

51.8

\pm

0.8

57.5

\pm

0.9

59.3

\pm

0.8

MNIST

88.9

\pm

0.4

90.1

\pm

0.4

86.5

\pm

0.5

89.7

\pm

0.4

92.5

\pm

0.4

93.3

\pm

0.3

89.4

\pm

0.3

93.0

\pm

0.3

93.9

\pm

0.3

CIFAR-10

66.1

\pm

0.7

50.3

\pm

1.0

61.4

\pm

0.7

66.0

\pm

0.7

72.0

\pm

0.7

73.1

\pm

0.7

58.8

\pm

0.7

73.5

\pm

0.7

74.2

\pm

0.7

CIFAR-100

53.8

\pm

0.9

46.4

\pm

0.9

52.5

\pm

0.9

57.0

\pm

0.9

64.1

\pm

0.8

64.1

\pm

0.8

55.3

\pm

0.9

65.0

\pm

0.8

65.6

\pm

0.9

Avg Seen

69.0

71.3

73.8

76.6

76.7

77.0

76.0

77.9

78.1

Avg Unseen

62.6

56.0

59.6

65.3

71.4

72.6

63.0

73.7

74.8

Avg All

66.5

65.4

68.3

72.2

74.6

75.3

71.0

76.3

76.8

Avg Rank

7.9

7.8

6.6

4.9

4.3

3.5

5.8

2.2

1.4

Table 12: Comparison to state-of-the-art methods in Five-Way One-Shot setting. Mean accuracy and 95

\%

confidence interval are reported. The best results are highlighted in bold. TSP^† denotes TSP applied on TSA. TSP^†† denotes TSP applied on TA²-Net.

Test Dataset

Simple

CNAPS

SUR

URT

URL

TSA

TA²-Net

MOKD

TSP^†

TSP^††

ImageNet

42.6

\pm

0.9

40.7

\pm

1.0

47.4

\pm

1.0

49.6

\pm

1.1

48.0

\pm

1.0

48.8

\pm

1.1

46.0

\pm

1.0

50.1

\pm

1.0

50.5

\pm

1.0

Omniglot

93.1

\pm

0.5

93.0

\pm

0.7

95.6

\pm

0.5

95.8

\pm

0.5

96.3

\pm

0.4

95.7

\pm

0.4

95.5

\pm

0.5

96.6

\pm

0.4

96.8

\pm

0.4

Aircraft

65.8

\pm

0.9

67.1

\pm

1.4

77.9

\pm

0.9

79.6

\pm

0.9

79.6

\pm

0.9

79.8

\pm

0.9

78.6

\pm

0.9

81.1

\pm

0.9

81.3

\pm

0.9

Birds

67.9

\pm

0.9

59.2

\pm

1.0

70.9

\pm

0.9

74.9

\pm

0.9

74.5

\pm

0.9

74.4

\pm

0.9

75.9

\pm

0.9

75.7

\pm

0.9

75.7

\pm

0.9

Textures

42.2

\pm

0.8

42.5

\pm

0.8

49.4

\pm

0.9

53.6

\pm

0.9

54.5

\pm

0.9

54.1

\pm

0.8

51.4

\pm

0.9

55.5

\pm

0.9

55.4

\pm

0.9

Quick Draw

70.5

\pm

0.9

79.8

\pm

0.9

79.6

\pm

0.9

79.0

\pm

0.8

79.3

\pm

0.9

78.9

\pm

0.9

78.9

\pm

0.9

80.7

\pm

0.8

80.2

\pm

0.9

Fungi

58.3

\pm

1.1

64.8

\pm

1.1

71.0

\pm

1.0

75.2

\pm

1.0

75.3

\pm

1.0

75.2

\pm

0.9

71.1

\pm

1.0

77.8

\pm

0.9

78.1

\pm

0.8

VGG Flower

79.9

\pm

0.7

65.0

\pm

1.0

72.7

\pm

0.0

79.9

\pm

0.8

80.3

\pm

0.8

80.1

\pm

0.8

79.8

\pm

0.8

81.0

\pm

0.8

81.1

\pm

0.8

Traffic Sign

55.3

\pm

0.9

44.6

\pm

0.9

52.7

\pm

0.9

57.9

\pm

0.9

57.2

\pm

1.0

54.1

\pm

1.0

57.0

\pm

0.9

57.4

\pm

1.0

56.9

\pm

1.0

MSCOCO

48.8

\pm

0.9

47.8

\pm

1.1

56.9

\pm

1.1

59.2

\pm

1.0

59.9

\pm

1.0

58.1

\pm

1.0

50.9

\pm

0.8

59.7

\pm

1.0

60.5

\pm

1.0

MNIST

80.1

\pm

0.9

77.1

\pm

0.9

75.6

\pm

0.9

78.7

\pm

0.9

80.1

\pm

0.9

80.3

\pm

0.9

72.5

\pm

0.9

81.4

\pm

0.8

81.7

\pm

0.9

CIFAR-10

50.3

\pm

0.9

35.8

\pm

0.8

47.3

\pm

0.9

54.7

\pm

0.9

55.8

\pm

0.9

52.9

\pm

1.0

47.3

\pm

0.8

55.9

\pm

0.9

56.0

\pm

0.9

CIFAR-100

53.8

\pm

0.9

42.9

\pm

1.0

54.9

\pm

1.1

61.8

\pm

1.0

63.7

\pm

1.0

61.0

\pm

1.1

60.2

\pm

1.0

65.2

\pm

1.0

65.6

\pm

1.0

Avg Seen

65.0

64.0

70.6

73.5

73.4

72.2

74.8

74.9

Avg Unseen

57.7

49.6

57.5

62.5

63.3

61.3

57.5

63.9

64.1

Avg All

62.2

58.5

65.5

69.2

69.6

68.7

66.5

70.6

70.8

Avg Rank

7.5

8.2

6.8

4.2

3.5

4.8

6.2

1.9

1.5

Table 13: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design

\mathbf{M^{T}M}

in Varying-Way Varying-Shot setting. We average the effective ranks using 600 tasks randomly sampled from each domain. The left column denotes the name of each task-specific weight, while the right column indicates the full rank of each task-specific weight.

Weight’s

Name

Image

-Net

Omni

-glot

Airc

-raft

Birds

Tex

-tures

Quick

-Draw

Fun

-gi

VGG

Flower

Traffic

Sign

-COCO

-IST

CIFAR

-10

CIFAR

-100

Full

Rank

layer1-0-

\alpha_{1}

60.04

60.97

63.43

61.72

60.27

64.00

62.47

63.28

62.27

60.44

64.00

60.17

59.69

layer1-0-

\alpha_{2}

62.70

61.46

62.92

62.77

61.63

64.00

61.59

62.73

62.99

62.80

64.00

62.80

62.69

layer1-1-

\alpha_{1}

54.84

57.39

40.04

57.87

47.18

63.96

35.14

49.95

50.58

54.76

63.99

56.37

56.88

layer1-1-

\alpha_{2}

62.17

63.87

57.23

62.60

62.29

64.00

59.25

60.77

62.08

62.31

64.00

62.34

62.27

layer2-0-

\alpha_{1}

95.20

125.21

33.71

72.66

58.40

127.69

38.69

96.12

79.15

93.68

127.90

98.73

104.02

128

layer2-0-

\alpha_{2}

125.09

127.15

109.02

113.07

116.99

127.99

117.76

122.92

124.18

125.19

128.00

124.84

125.40

128

layer2-1-

\alpha_{1}

126.98

127.73

115.54

124.03

119.52

128.00

126.41

127.16

127.15

126.96

128.00

126.90

126.91

128

layer2-1-

\alpha_{2}

127.74

127.43

126.58

127.04

128.00

127.61

127.39

127.79

127.77

128.00

127.71

127.72

128

layer3-0-

\alpha_{1}

230.14

255.05

120.37

96.81

7.69

255.77

216.22

202.50

129.93

149.37

255.94

214.94

199.54

256

layer3-0-

\alpha_{2}

250.48

254.61

156.72

142.09

196.29

255.97

253.60

218.84

247.46

250.71

255.98

240.82

247.83

256

layer3-1-

\alpha_{1}

196.75

252.20

53.97

10.92

41.12

254.19

252.10

24.33

126.43

195.93

254.96

125.57

164.58

256

layer3-1-

\alpha_{2}

255.58

254.85

253.45

253.98

253.79

255.99

255.87

254.82

255.73

255.62

255.99

255.52

255.51

256

layer4-0-

\alpha_{1}

491.73

509.82

466.61

482.44

16.00

511.75

509.44

501.00

305.02

335.14

511.93

489.17

438.08

512

layer4-0-

\alpha_{2}

460.91

493.62

410.53

73.12

387.58

511.91

507.25

478.68

497.04

482.00

511.92

385.93

438.04

512

layer4-1-

\alpha_{1}

431.05

501.79

223.91

243.72

7.22

488.68

451.76

420.43

201.02

239.09

489.71

410.19

351.43

512

layer4-1-

\alpha_{2}

498.54

509.57

509.23

507.46

496.55

511.55

510.78

509.13

507.39

500.13

511.55

499.06

496.86

512

pa-weight

11.60

3.13

3.88

7.07

4.58

2.79

3.33

3.36

9.06

11.60

2.74

11.84

12.16

512

Table 14: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design

\mathbf{M^{T}M}

in Varying-Way Five-Shot setting. We average the effective ranks using 600 tasks randomly sampled from each domain. The left column denotes the name of each task-specific weight, while the right column indicates the full rank of each task-specific weight.

Weight’s

Name

Image

-Net

Omni

-glot

Airc

-raft

Birds

Tex

-tures

Quick

-Draw

Fun

-gi

VGG

Flower

Traffic

Sign

-COCO

-IST

CIFAR

-10

CIFAR

-100

Full

Rank

layer1-0-

\alpha_{1}

60.14

61.04

63.43

61.70

60.49

64.00

62.46

63.28

62.00

60.37

64.00

60.31

59.76

layer1-0-

\alpha_{2}

62.70

61.52

62.93

62.79

61.86

64.00

61.61

62.73

62.98

62.80

64.00

62.80

62.70

layer1-1-

\alpha_{1}

54.41

57.46

40.18

57.88

48.46

63.96

35.38

49.97

51.36

55.34

63.99

55.87

56.69

layer1-1-

\alpha_{2}

62.13

63.86

57.28

62.61

62.40

64.00

59.31

60.78

62.15

62.36

64.00

62.31

62.27

layer2-0-

\alpha_{1}

93.59

124.77

34.00

73.67

62.76

127.68

39.19

96.04

81.61

95.96

127.90

96.43

103.21

128

layer2-0-

\alpha_{2}

124.92

127.16

109.17

113.52

118.17

127.99

117.88

122.93

124.42

125.25

128.00

124.63

125.37

128

layer2-1-

\alpha_{1}

126.98

127.74

115.65

124.15

120.53

128.00

126.43

127.16

127.15

126.91

128.00

126.89

126.92

128

layer2-1-

\alpha_{2}

127.74

127.45

126.59

126.63

127.16

128.00

127.61

127.39

127.80

127.76

128.00

127.71

127.73

128

layer3-0-

\alpha_{1}

229.05

254.92

121.04

99.77

9.22

255.76

216.45

202.70

139.21

136.38

255.94

211.75

200.61

256

layer3-0-

\alpha_{2}

249.93

254.62

157.35

144.87

202.20

255.97

253.61

219.02

248.38

249.63

255.98

239.71

248.04

256

layer3-1-

\alpha_{1}

193.29

251.15

54.49

11.90

47.84

254.12

251.87

24.43

138.76

184.25

254.95

127.65

166.39

256

layer3-1-

\alpha_{2}

255.58

254.88

253.48

254.05

254.07

255.99

255.87

254.83

255.72

255.59

255.99

255.53

255.52

256

layer4-0-

\alpha_{1}

492.50

509.86

466.98

482.91

19.81

511.75

509.23

501.06

322.40

308.72

511.93

487.09

439.86

512

layer4-0-

\alpha_{2}

457.24

493.82

411.01

77.99

398.85

511.91

507.03

478.88

496.18

476.10

511.92

382.92

440.78

512

layer4-1-

\alpha_{1}

431.12

501.57

225.25

248.87

8.84

488.65

451.73

420.80

220.69

213.26

489.71

405.75

354.55

512

layer4-1-

\alpha_{2}

499.01

509.62

509.24

507.28

497.76

511.55

510.73

509.15

506.38

499.75

511.55

499.69

497.14

512

pa-weight

11.43

3.21

3.95

7.29

5.60

2.79

3.45

3.39

9.62

11.74

2.74

11.65

12.10

512

Table 15: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design

\mathbf{LL^{T}}

Weight’s

Name

Image

-Net

Omni

-glot

Airc

-raft

Birds

Tex

-tures

Quick

-Draw

Fun

-gi

VGG

Flower

Traffic

Sign

-COCO

-IST

CIFAR

-10

CIFAR

-100

Full

Rank

layer1-0-

\alpha_{1}

63.97

63.95

63.99

63.98

63.99

64.00

63.99

64.00

63.99

63.98

64.00

63.96

63.97

layer1-0-

\alpha_{2}

63.97

63.96

64.00

63.99

64.00

63.99

63.98

64.00

63.97

layer1-1-

\alpha_{1}

63.92

63.90

63.99

63.96

63.99

64.00

63.99

63.97

63.95

64.00

63.91

63.92

layer1-1-

\alpha_{2}

63.97

63.99

64.00

63.99

63.98

64.00

63.97

layer2-0-

\alpha_{1}

127.87

127.97

127.92

127.96

128.00

127.98

127.96

127.91

128.00

127.84

127.86

128

layer2-0-

\alpha_{2}

127.97

127.99

127.98

128.00

127.99

127.98

128.00

127.97

128

layer2-1-

\alpha_{1}

127.99

128.00

127.99

127.98

128.00

127.99

128.00

127.99

128

layer2-1-

\alpha_{2}

127.99

128.00

127.99

128

layer3-0-

\alpha_{1}

255.95

256.00

255.96

255.94

255.66

256.00

255.97

255.98

255.96

256.00

255.94

255.95

256

layer3-0-

\alpha_{2}

255.98

256.00

255.95

255.96

255.90

256.00

255.99

255.98

256.00

255.97

255.98

256

layer3-1-

\alpha_{1}

255.99

256.00

255.93

255.91

255.59

256.00

255.98

255.97

255.99

255.98

256.00

255.99

256

layer3-1-

\alpha_{2}

256.00

255.98

255.99

256.00

255.99

256.00

255.99

256.00

256

layer4-0-

\alpha_{1}

511.97

512.00

511.91

511.95

510.54

512.00

511.95

511.97

511.95

512.00

511.97

511.96

512

layer4-0-

\alpha_{2}

511.96

511.98

511.79

511.81

511.66

512.00

511.90

511.92

511.95

511.96

512.00

511.96

511.97

512

layer4-1-

\alpha_{1}

511.90

511.97

511.54

511.78

507.13

511.92

511.58

511.84

511.77

511.76

511.93

511.91

511.89

512

layer4-1-

\alpha_{2}

511.97

511.99

511.97

511.81

512.00

511.97

511.98

512.00

511.97

512

pa-weight

494.46

481.54

494.69

495.60

490.04

486.75

484.23

493.87

490.07

493.48

486.75

495.33

495.25

512

Table 16: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design

\mathbf{LL^{T}}

Weight’s

Name

Image

-Net

Omni

-glot

Airc

-raft

Birds

Tex

-tures

Quick

-Draw

Fun

-gi

VGG

Flower

Traffic

Sign

-COCO

-IST

CIFAR

-10

CIFAR

-100

Full

Rank

layer1-0-

\alpha_{1}

63.97

63.95

63.99

63.98

63.99

64.00

63.99

64.00

63.99

63.98

64.00

63.96

63.97

layer1-0-

\alpha_{2}

63.97

63.96

64.00

63.99

64.00

63.99

63.98

64.00

63.97

layer1-1-

\alpha_{1}

63.92

63.90

63.99

63.96

63.99

64.00

63.99

63.97

63.94

64.00

63.91

63.92

layer1-1-

\alpha_{2}

63.97

63.99

64.00

63.99

63.98

64.00

63.97

layer2-0-

\alpha_{1}

127.87

127.97

127.93

127.96

128.00

127.98

127.95

127.91

128.00

127.85

127.86

128

layer2-0-

\alpha_{2}

127.97

127.99

127.98

127.99

128.00

127.99

127.98

128.00

127.97

128

layer2-1-

\alpha_{1}

127.99

128.00

127.99

128.00

127.99

128.00

127.99

128

layer2-1-

\alpha_{2}

127.99

128.00

127.99

128

layer3-0-

\alpha_{1}

255.95

256.00

255.97

255.94

255.78

256.00

255.97

255.98

255.96

256.00

255.94

255.95

256

layer3-0-

\alpha_{2}

255.98

256.00

255.95

255.94

256.00

255.99

255.98

256.00

255.97

255.98

256

layer3-1-

\alpha_{1}

255.99

256.00

255.93

255.90

255.74

256.00

255.98

255.97

255.99

255.98

256.00

255.99

256

layer3-1-

\alpha_{2}

256.00

255.98

255.99

256.00

256

layer4-0-

\alpha_{1}

511.97

512.00

511.91

511.95

511.06

512.00

511.95

511.97

511.94

512.00

511.97

512

layer4-0-

\alpha_{2}

511.96

511.98

511.80

511.79

511.78

512.00

511.90

511.92

511.96

512.00

511.96

511.97

512

layer4-1-

\alpha_{1}

511.89

511.97

511.55

511.76

508.74

511.92

511.58

511.84

511.79

511.73

511.92

511.91

511.89

512

layer4-1-

\alpha_{2}

511.97

511.99

511.97

511.87

512.00

511.97

511.98

511.97

512.00

511.97

512

pa-weight

494.37

481.54

494.72

495.46

490.94

486.75

484.26

493.87

490.64

494.00

486.75

495.13

495.09

512

Task-Specific Preconditioner for Cross-Domain Few-Shot Learning

Abstract

1 Introduction

2 Related Works

Meta-Learning for Few-Shot Learning

Cross-Domain Few-Shot Learning (CDFSL)

Preconditioned Gradient Descent in Meta-Learning

3 Backgrounds

Task Formulation for Meta-Learning in CDFSL

Bi-level Optimization in Meta-Learning

Preconditioned Gradient Descent (PGD)

Dataset Classifier

4 Method

4.1 Domain-Specific Preconditioner (DSP)

Inner-level Optimization

Outer-level Optimization

4.2 Task-coefficients

4.3 Task-Specific Preconditioner

4.4 Positive Definiteness of TSP’s Preconditioner

Theorem 1.

5 Experiments

5.1 Experimental Setup

Implementation Details

Baselines

5.2 Performance Comparison to State-of-The-Art Methods

Multi-Domain Setting

Single-Domain Setting

5.3 Ablation Studies

Matrix Design for DSP

6 Discussion

The Necessity of Positive Definite Constraint

Positive Definite DSP Designs with and without the Identity Matrix

Effectiveness of Positive Definiteness in Cross-Domain Tasks

TSP vs. Previous PGD Methods: Leveraging Multi-Domain Knowledge for Task-Specific Preconditioner

7 Conclusion

8 Acknowledgements

References

Appendix A The three preconditioners used in Figure 1(b) and Figure 4

Appendix B Meta-Training and Meta-Testing Algorithms

Appendix C Proofs of Theorems

Lemma 1.

Proof.

Theorem 1.

Proof.

Appendix D Additional Ablation Studies

D.1 Weighting Factor λ\lambda of the Dataset Classifier Loss

D.2 Interpreting and visualizing task-coefficient

Appendix E Implementation Details

E.1 Dataset

E.2 Architecture for the dataset classifier

E.3 Hyper-parameters

Appendix F Additional Results

F.1 Varying-Way Five-Shot setting

F.2 Five-Way One-Shot setting

Appendix G Comparison between 𝐌𝐓​𝐌+𝐈\mathbf{M^{T}M+I} and 𝐌𝐓​𝐌\mathbf{M^{T}M}: Failure of DSP design 𝐌𝐓​𝐌\mathbf{M^{T}M}

Appendix H Time-efficiency of TSP compared to GAP

Appendix I Application Details for TSP† and TSP††

D.1 Weighting Factor $\lambda$ of the Dataset Classifier Loss

Appendix G Comparison between $\mathbf{M^{T}M+I}$ and $\mathbf{M^{T}M}$ : Failure of DSP design $\mathbf{M^{T}M}$

Appendix I Application Details for TSP^† and TSP^††