This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Task-Specific Preconditioner for Cross-Domain Few-Shot Learning

Suhyun Kang1, Jungwon Park2, Wonseok Lee3, Wonjong Rhee2,3 Corresponding author
Abstract

Cross-Domain Few-Shot Learning (CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP). Our method first meta-learns Domain-Specific Preconditioners (DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.

1 Introduction

Few-Shot Learning (FSL) aims to learn a model that can generalize to novel classes using a few labeled examples. Recent advancements in FSL have been significantly propelled by meta-learning methods (Snell, Swersky, and Zemel 2017; Finn, Abbeel, and Levine 2017; Sung et al. 2018; Oreshkin, Rodríguez López, and Lacoste 2018; Garnelo et al. 2018; Rajeswaran et al. 2019). These approaches have achieved outstanding results in single domain FSL benchmarks such as Omniglot (Lake et al. 2011) and miniImagenet (Ravi and Larochelle 2016). However, recent studies (Chen et al. 2019; Tian et al. 2020) have revealed that many existing FSL methods struggle to generalize in cross-domain setting, where the test data originates from domains that are either unknown or previously unseen. To study the challenge of generalization in cross-domain few-shot tasks, Triantafillou et al. (2019) introduced the Meta-Dataset, a more realistic, large-scale, and diverse benchmark. It includes multiple datasets from a variety of domains for both meta-training and meta-testing phases.

Leveraging the Meta-Dataset, various Cross-Domain Few-Shot Learning (CDFSL) methods have been developed (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Dvornik, Schmid, and Mairal 2020; Liu et al. 2020; Guo et al. 2023; Tian et al. 2024), demonstrating significant advancements in this field. These approaches typically parameterize deep neural networks with a large set of task-agnostic parameters alongside a smaller set of task-specific parameters. Task-specific parameters are optimized to the target task through an adaptation mechanism, generally following one of two primary methodologies. The first approach utilizes an auxiliary network functioning as a parameter generator, which, upon receiving a few labeled examples from the target task, outputs optimized task-specific parameters (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2020, 2021). The second approach directly fine-tunes the task-specific parameters through gradient descent using a few labeled examples from the target task (Dvornik, Schmid, and Mairal 2020; Li, Liu, and Bilen 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2022; Tian et al. 2024).

Refer to caption
(a)
Refer to caption
(b)
Figure 1: All experiments are conducted baed on TSA. (a) The optimal optimization strategy can vary significantly depending on the nature of the target task, leading to notable differences in performance on the Meta-Dataset. (b) The accuracy of seen and unseen for the Meta-Dataset. Compared to the baseline of using gradient descent, adopting a preconditioner without a PD constraint can be unreliable. With a PD constraint, it becomes reliable to adapt the preconditioner to the target task. Further details on these preconditioners are provided in Appendix A.
Refer to caption
Figure 2: Illustration of forming a Task-Specific Preconditioner based on three DSPs that have been meta-trained for three meta-training domains.

While both approaches have improved CDFSL performance through adaptation mechanism, a common limitation persists in the optimization strategies employed by these methods. Specifically, both approaches employ a fixed optimization strategy across different target tasks. However, Figure 1(a) shows that the optimal choice of optimizer may vary significantly depending on the given domain or target task. This implies that the performance can be significantly improved by adapting an optimization strategy to align well with the target domain and task. However, devising an effective and reliable scheme for its implementation has been challenging.

One promising approach for establishing a robust adaptive optimization scheme is to leverage Preconditioned Gradient Descent (PGD) (Himmelblau et al. 2018). PGD operates by specifying a preconditioning matrix, often referred to as a preconditioner, which re-scales the geometry of the parameter space. In the field of machine learning, previous research has shown that if the preconditioner is positive definite (PD), it establishes a valid Riemannian metric, which represents the geometric characteristics (e.g., curvature) of the parameter space and steers preconditioned gradients in the direction of steepest descent (Amari 1967, 1996, 1998; Amari and Douglas 1998). While the effectiveness of positive definiteness in PGD is supported by existing theoretical findings, its efficacy as an adaptive optimization scheme in CDFSL can be examined through a simple comparison. In Figure 1(b), we compare PGD with and without a PD constraint for the preconditioner on the Meta-Dataset. Without a PD constraint, PGD shows markedly inferior performance, especially in unseen domains. Conversely, with a PD constraint, PGD consistently exhibits performance improvements across seen and unseen domains compared to the baseline using GD. This supports the pivotal role of positive definiteness in PGD for CDFSL.

Inspired by these findings, we introduce a novel adaptation mechanism named Task-Specific Preconditioned gradient descent (TSP). In our approach, we establish a Task-Specific Preconditioner that is constrained to be positive definite and adapt it to the specific nature of the target task. This preconditioner consists of two components. The first component is the Domain-Specific Preconditioners (DSPs), which are uniquely defined for each meta-training domain and meta-trained on tasks sampled from these domains through bi-level optimization during the meta-training phase. The second component is task-coefficient, which approximates the compatibility between the target task and each meta-training domain. Figure 2 illustrates the construction of the Task-Specific Preconditioner. For a given target task 𝒯{\mathcal{T}}, the Task-Specific Preconditioner 𝐏𝒯\mathbf{P}_{\mathcal{T}} is constructed by linearly combining the DSPs 𝐏k{\mathbf{P}_{k}} from multiple seen domains, with each weighted by the corresponding task-coefficient p𝒯,k{p_{\mathcal{T},k}}. This process produces a preconditioner specifically adapted to the geometric characteristics of the target task’s parameter space. By integrating knowledge from multiple seen domains, TSP distinguishes itself from traditional PGD techniques, such as GAP (Kang et al. 2023), which are discussed further in Section 6. Applying our approach to state-of-the-art CDFSL methods, such as TSA or TA2-Net, significantly enhances performance on Meta-Dataset. For example, in multi-domain settings, applying TSP to TA2-Net (Guo et al. 2023) achieves the best performance across all datasets.

2 Related Works

Meta-Learning for Few-Shot Learning

Until recently, numerous approaches in the field of few-shot learning have adopted the meta-learning framework. These approaches can be mainly divided into three types: metric-based, model-based, and optimization-based methods. Metric-based methods (Garcia and Bruna 2017; Sung et al. 2018; Snell, Swersky, and Zemel 2017; Oreshkin, Rodríguez López, and Lacoste 2018) train a feature encoder to extract features from support and query samples. They employ a nearest neighbor classifier with various distance functions to calculate similarity scores for predicting the labels of query samples. Model-based methods (Santoro et al. 2016; Munkhdalai and Yu 2017; Mishra et al. 2017; Garnelo et al. 2018) train an encoder to generate task-specific models from a few support samples. Optimization-based methods (Ravi and Larochelle 2016; Finn, Abbeel, and Levine 2017; Yoon et al. 2018; Rajeswaran et al. 2019) train a model that can quickly adapt to new tasks with a few support samples, employing a bi-level optimization. In our method, we employ the bi-level optimization used in the optimization-based methods.

Cross-Domain Few-Shot Learning (CDFSL)

Recent CDFSL methods define the universal model as a deep neural network and partition it into task-agnostic and task-specific parameters. The task-agnostic parameters represent generic characteristics that are valid for a range of tasks from various domains. On the other hand, the task-specific parameters represent adaptable attributes that are optimized to the target tasks through an adaptation mechanism. Task-agnostic parameters can be designed as a single network or multiple networks. The single network is trained on a large dataset from single domain (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021) or multiple domains (Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), whereas the multiple networks are trained individually on each domain (Dvornik, Schmid, and Mairal 2020; Liu et al. 2020). Task-specific parameters can be designed as selection parameters (Dvornik, Schmid, and Mairal 2020; Liu et al. 2020), pre-classifier transformation (Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), Feature-wise Linear Modulate (FiLM) layer (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2021; Triantafillou et al. 2021), or Residual Adapter (RA) (Li, Liu, and Bilen 2022; Guo et al. 2023). As the adaptation mechanism for the task-specific parameters, several studies (Requeima et al. 2019; Bateni et al. 2020, 2022; Liu et al. 2020, 2021) meta-learn an auxiliary network, which generates task-specific parameters adapted to the target task. On the other hand, other studies (Dvornik, Schmid, and Mairal 2020; Li, Liu, and Bilen 2021; Triantafillou et al. 2021; Li, Liu, and Bilen 2022) employ gradient descent to adapt task-specific parameters to the target task. In our work, we propose a novel adaptation mechanism in the form of a task-specific optimizer, which adapts task-specific parameters to the target task.

Preconditioned Gradient Descent in Meta-Learning

In meta-learning, several optimization-based approaches (Li et al. 2017; Lee and Choi 2018; Park and Oliva 2019; Rajasegaran et al. 2020; Simon et al. 2020; Zhao et al. 2020; Von Oswald et al. 2021; Kang et al. 2023) have incorporated Preconditioned Gradient Descent (PGD) to adapt network’s parameters to the target task (i.e., inner-level optimization). They meta-learn a preconditioning matrix, called a preconditioner, which is utilized to precondition the gradient. The preconditioner was kept static in most of the previous works (Li et al. 2017; Lee and Choi 2018; Park and Oliva 2019; Zhao et al. 2020; Von Oswald et al. 2021). Several prior studies have devised preconditioners tailored to adapt either per inner step (Rajasegaran et al. 2020), per task (Simon et al. 2020), or both simultaneously (Kang et al. 2023). Motivated by previous works (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998), (Kang et al. 2023) recently investigated the constraint of the preconditioner to satisfy the condition for a Riemannian metric (i.e., positive definiteness). They demonstrated that enforcing this constraint on the preconditioner was essential for improving the performance in few-shot learning. In our study, we propose a novel preconditioned gradient descent method with meta-learned task-specific preconditioner that guarantees positive definiteness for improving performance in CDFSL.

3 Backgrounds

Task Formulation for Meta-Learning in CDFSL

In CDFSL, task 𝒯\mathcal{T} is formulated differently compared to traditional few-shot learning. In traditional few-shot learning, tasks are sampled from a single domain, resulting in the same form in both meta-training and meta-testing:

meta-training and meta-testing: 𝒯={𝒮𝒯,𝒬𝒯}\text{meta-training and meta-testing: }\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\} (1)

where 𝒮𝒯\mathcal{S}_{\mathcal{T}} is a support set and 𝒬𝒯\mathcal{Q}_{\mathcal{T}} is a query set. On the other hand, in CDFSL, tasks are sampled from multiple domains, leading to different forms in meta-training and meta-testing:

meta-training: 𝒯={𝒮𝒯𝒬𝒯,d𝒯},meta-testing: 𝒯={𝒮𝒯𝒬𝒯},\begin{split}&\text{meta-training: }\mathcal{T}=\{\mathcal{S}_{\mathcal{T}}\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\},\\ &\text{meta-testing: }\mathcal{T}=\{\mathcal{S}_{\mathcal{T}}\mathcal{Q}_{\mathcal{T}}\},\end{split} (2)

where d𝒯d_{\mathcal{T}} is a domain label indicating the domain from which the task was sampled. For instance, the domain label is an integer between 11 and KK for KK domains (i.e., 1d𝒯K1\leq d_{\mathcal{T}}\leq K).

Bi-level Optimization in Meta-Learning

Bi-level optimization (Rajeswaran et al. 2019) consists of two levels of main optimization processes: inner-level and outer-level optimizations. Let fθ(ϕ)f_{\theta(\phi)} be a model, where the parameter θ(ϕ)\theta(\phi) is parameterized by the meta-parameter ϕ\phi. For a task 𝒯={𝒮𝒯,𝒬𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}, the inner-level optimization is defined as:

θ𝒯,T(ϕ)=θ𝒯,0(ϕ)αint=0T1θin(θ𝒯,t(ϕ);𝒮𝒯)\begin{split}\theta_{\mathcal{T},T}(\phi)=\theta_{\mathcal{T},0}(\phi)-\alpha_{\text{in}}\cdot\sum_{t=0}^{T-1}\nabla_{\theta}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t}(\phi);\mathcal{S}_{\mathcal{T}})\end{split} (3)

where θ𝒯,0(ϕ)=θ(ϕ)\theta_{\mathcal{T},0}(\phi)=\theta(\phi), αin\alpha_{\text{in}} is the learning rate for the inner-level optimization, in\mathcal{L}_{\text{in}} is the inner-level’s loss function, and TT is the total number of gradient descent steps. With 𝒬𝒯\mathcal{Q}_{\mathcal{T}} in each task, we can define outer-level optimization as:

ϕϕαoutϕ𝔼𝒯[out(θ𝒯,T(ϕ);𝒬𝒯)]\phi\leftarrow\phi-\alpha_{\text{out}}\cdot\nabla_{\phi}\mathbb{E}_{\mathcal{T}}\Big{[}\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T}(\phi);\mathcal{Q}_{\mathcal{T}})\Big{]} (4)

where αout\alpha_{\text{out}} is the learning rate for the outer-level optimization, and out\mathcal{L}_{\text{out}} is the outer-level’s loss function.

Preconditioned Gradient Descent (PGD)

PGD is a technique that minimizes empirical risk by using a gradient update with a preconditioner that re-scales the geometry of the parameter space. Given model parameters θ\theta and task 𝒯={𝒮𝒯,𝒬𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}, we can formally define the preconditioned gradient descent with a preconditioner 𝐏\mathbf{P} as follows:

θ𝒯,t=θ𝒯,t1α𝐏θ(θ𝒯,t1;𝒮𝒯),t=1,\theta_{\mathcal{T},t}=\theta_{\mathcal{T},t-1}-\alpha\cdot\mathbf{P}\nabla_{\theta}\mathcal{L}(\theta_{\mathcal{T},t-1};\mathcal{S}_{\mathcal{T}}),\,\,\,\,t=1,\cdots (5)

where θ𝒯,0=θ\theta_{\mathcal{T},0}=\theta, (θ𝒯,t;𝒮𝒯)\mathcal{L}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}) is the empirical loss associated with the task 𝒯\mathcal{T}, and θ𝒯,t\theta_{\mathcal{T},t} is the parameters. When the preconditioner 𝐏\mathbf{P} is chosen to be the identity matrix 𝐈\mathbf{I}, Eq. (5) becomes the standard Gradient Descent (GD). The choice of 𝐏\mathbf{P} to leverage second-order information offers several options, including the inverse Fisher information matrix 𝐅1\mathbf{F}^{-1}, leading to the Natural Gradient Descent (NGD) (Amari 1998), the inverse Hessian matrix 𝐇1\mathbf{H}^{-1}, corresponding to Newton’s method (LeCun et al. 2002), and the diagonal matrix estimation with the past gradients, which results in adaptive gradient methods (Duchi, Hazan, and Singer 2011; Kingma and Ba 2014). They often reduce the effect of pathological curvature and speed up the optimization (Amari et al. 2020).

Dataset Classifier

In CDFSL, Dataset Classifier (Triantafillou et al. 2021) reads a support set in a few-shot task and predicts from which of the training datasets it was sampled. Formally, let 𝒯={𝒮𝒯,𝒬𝒯,d𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\} be a train task sampled from KK domains. Let gg be a dataset classifier that takes the support set 𝒮𝒯\mathcal{S}_{\mathcal{T}} as input and generates logits as follows:

g(𝒮𝒯)=z𝒯=(z𝒯,1,,z𝒯,K)Kg(\mathcal{S}_{\mathcal{T}})=z_{\mathcal{T}}=(z_{\mathcal{T},1},\cdots,z_{\mathcal{T},K})\in\mathbb{R}^{K} (6)

In (Triantafillou et al. 2021), the dataset classifier gg is trained to minimize the cross-entropy loss for the dataset classification problem (i.e., classification problem with KK classes).

4 Method

Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a) PGD with Domain-Specific Preconditioner (DSP) in the inner-level optimization. During meta-training, for a train task 𝒯\mathcal{T}, DSP is chosen based on the domain label d𝒯d_{\mathcal{T}}, and each task-specific parameter θl\theta^{l} are optimized using PGD with the selected DSP 𝐏d𝒯l\mathbf{P}^{l}_{d_{\mathcal{T}}}. (b) PGD with Task-Specific Preconditioner. During meta-testing, for a test task, each Task-Specific Preconditioner 𝐏𝒯l\mathbf{P}^{l}_{\mathcal{T}} is contructed using DSPs and task-coefficients generated by Dataset Classifier. Each task-specific parameter θl\theta^{l} is then then optimized using PGD with 𝐏𝒯l\mathbf{P}^{l}_{\mathcal{T}}.

In this section, we propose a novel adaptation mechanism named Task-Specific Preconditioned gradient descent (TSP). We first introduce Domain-Specific Preconditioner (DSP) and task-coefficients. Then, we describe the construction of Task-Specific Preconditioner using DSP and task-coefficients. Lastly, we show the positive definiteness of Task-Specific Preconditioner, which establishes it as a valid Riemannian metric. The algorithm for the training and testing procedures is provided in Appendix B.

4.1 Domain-Specific Preconditioner (DSP)

Consider LL task-specific parameters θ={θlml×ml}l=1L\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}_{l=1}^{L}. For KK domains, we first define meta-parameters 1,,K\mathcal{M}_{1},\cdots,\mathcal{M}_{K} as follows:

k={𝐌klml×ml}l=1L,k=1,,K\mathcal{M}_{k}=\{\mathbf{M}^{l}_{k}\in\mathbb{R}^{m_{l}\times m_{l}}\}_{l=1}^{L},\,\,\,\,k=1,\cdots,K (7)

Then, for all ll, we define Domain-Specific Preconditioners (DSPs) 𝐏kl\mathbf{P}^{l}_{k} using the meta-parameters as follows:

𝐏kl=𝐌kl𝐓𝐌kl+𝐈,k=1,,K\mathbf{P}^{l}_{k}=\mathbf{M}_{k}^{l\mathbf{T}}\mathbf{M}^{l}_{k}+\mathbf{I},\,\,\,\,k=1,\cdots,K (8)

We compare various DSP designs (See Table 3) in Section 5.3 and choose the form of Eq. (8). Through bi-level optimization, DSPs can be meta-learned as follows.

Inner-level Optimization

For each train task 𝒯={𝒮𝒯,𝒬𝒯,d𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}, in the inner-level optimization, we optimize the task-specific parameters θ\theta through preconditioned gradient descent using 𝐏d𝒯l\mathbf{P}^{l}_{d_{\mathcal{T}}}, updating θ\theta as follows:

θ𝒯,Tl=θ𝒯,0lαint=0T1𝐏d𝒯lθ𝒯,tlin(θ𝒯,t;𝒮𝒯),\theta^{l}_{\mathcal{T},T}=\theta^{l}_{\mathcal{T},0}-\alpha_{\text{in}}\cdot\sum^{T-1}_{t=0}\mathbf{P}^{l}_{d_{\mathcal{T}}}\nabla_{\theta^{l}_{\mathcal{T},t}}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}), (9)

where θ𝒯,0l=θl\theta^{l}_{\mathcal{T},0}=\theta^{l}, αin\alpha_{\text{in}} is the learning rate for the inner-level optimization, TT is the total number of gradient descent steps, and in\mathcal{L}_{\text{in}} is the inner-level’s loss function.

Outer-level Optimization

In the outer-level optimization, we meta-learn meta-parameters 1,K\mathcal{M}_{1},\cdots\mathcal{M}_{K} as follows:

kkαoutk𝔼𝒯[out(θ𝒯,T;𝒬𝒯)],k=1,,K\mathcal{M}_{k}\leftarrow\mathcal{M}_{k}-\alpha_{\text{out}}\cdot\nabla_{\mathcal{M}_{k}}\mathbb{E}_{\mathcal{T}}\Big{[}\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})\Big{]},\,\,\,\,k=1,\cdots,K (10)

where αout\alpha_{\text{out}} is the learning rate for outer-level optimization and out\mathcal{L}_{\text{out}} is the outer-level’s loss function.

4.2 Task-coefficients

Consider the dataset classifier gg. Given a train task 𝒯={𝒮𝒯,𝒬𝒯,d𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}, we define task-coefficients p𝒯,1,p𝒯,Kp_{\mathcal{T},1}\cdots,p_{\mathcal{T},K} as follows:

(p𝒯,1,,p𝒯,K)=Softmax(z𝒯,1,,z𝒯,K)(p_{\mathcal{T},1},\cdots,p_{\mathcal{T},K})=\text{Softmax}(z_{\mathcal{T},1},\cdots,z_{\mathcal{T},K}) (11)

where g(𝒮𝒯)=(z𝒯,1,,z𝒯,K)g(\mathcal{S}_{\mathcal{T}})=(z_{\mathcal{T},1},\cdots,z_{\mathcal{T},K}). Note that we use the sigmoid function instead of softmax in the single-domain setting because the output dimension of the dataset classifier is one. While Triantafillou et al. (2021) updates the parameters of gg to minimize only the cross-entropy loss CE\mathcal{L}_{\text{CE}} with respect to the dataset label d𝒯d_{\mathcal{T}}, we train the dataset classifier gg to minimize the following augmented loss:

CE+λAux\mathcal{L}_{\text{CE}}+\lambda\cdot\mathcal{L}_{\text{Aux}} (12)

where λ\lambda is a regularization parameter and Aux\mathcal{L}_{\text{Aux}} is the auxiliary loss, defined as follows:

Aux=𝔼𝒯[out(θ𝒯,T;𝒬𝒯)]\mathcal{L}_{\text{Aux}}=\mathbb{E}_{\mathcal{T}}\Big{[}\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})\Big{]} (13)

Here, task-specific parameters θ𝒯,Tl\theta^{l}_{\mathcal{T},T} can be obtained as follows:

θ𝒯,Tl=θ𝒯,0lαint=0T1k=1Kp𝒯,k𝐏klθ𝒯,tlin(θ𝒯,t;𝒮𝒯)\theta^{l}_{\mathcal{T},T}=\theta^{l}_{\mathcal{T},0}-\alpha_{\text{in}}\cdot\sum^{T-1}_{t=0}\sum^{K}_{k=1}p_{\mathcal{T},k}\cdot\mathbf{P}^{l}_{k}\nabla_{\theta^{l}_{\mathcal{T},t}}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}) (14)

where 𝐏kl\mathbf{P}^{l}_{k} is the ll-th DSP of domain kk. In Eq. (12), the cross-entropy loss guides the dataset classifier to prioritize the ground-truth domain of the support set. Concurrently, the auxiliary loss guides toward DSPs that minimize any adverse effects on the performance of the query set during the inner-level optimization.

Table 1: Performance comparison to state-of-the-art methods in a multi-domain setting. Mean accuracy and 95%\% confidence interval are reported. The best results are highlighted in bold. TSP denotes TSP applied on TSA. TSP†† denotes TSP applied on TA2-Net.
Test Dataset SUR URT FLUTE tri-M URL TSA TA2-Net MOKD TSP TSP††
ImageNet 56.2±\pm1.0 56.8±\pm1.1 58.6±\pm1.0 51.8±\pm1.1 58.8±\pm1.1 59.5±\pm1.0 59.6±\pm1.0 57.3±\pm1.1 60.5±\pm1.0 60.7±\pm1.0
Omniglot 94.1±\pm0.4 94.2±\pm0.4 92.0±\pm0.6 93.2±\pm0.5 94.5±\pm0.4 94.9±\pm0.4 95.5±\pm0.4 94.2±\pm0.5 95.6±\pm0.4 96.0±\pm0.4
Aircraft 85.5±\pm0.5 85.8±\pm0.5 82.8±\pm0.7 87.2±\pm0.5 89.4±\pm0.4 89.9±\pm0.4 90.5±\pm0.4 88.4±\pm0.5 90.5±\pm0.4 91.2±\pm0.4
Birds 71.0±\pm1.0 76.2±\pm0.8 75.3±\pm0.8 79.2±\pm0.8 80.7±\pm0.8 81.1±\pm0.8 81.4±\pm0.8 80.4±\pm0.8 82.3±\pm0.7 82.5±\pm0.7
Textures 71.0±\pm0.8 71.6±\pm0.7 71.2±\pm0.8 68.8±\pm0.8 77.2±\pm0.7 77.5±\pm0.7 77.4±\pm0.7 76.5±\pm0.7 78.6±\pm0.6 79.1±\pm0.6
Quick Draw 81.8±\pm0.6 82.4±\pm0.6 77.3±\pm0.7 79.5±\pm0.7 82.5±\pm0.6 81.7±\pm0.6 82.5±\pm0.6 82.2±\pm0.6 83.0±\pm0.7 83.2±\pm0.6
Fungi 64.3±\pm0.9 64.0±\pm1.0 48.5±\pm1.0 58.1±\pm1.1 68.1±\pm0.9 66.3±\pm0.8 66.3±\pm0.9 68.6±\pm1.0 68.6±\pm0.9 69.7±\pm0.8
VGG Flower 82.9±\pm0.8 87.9±\pm0.6 90.5±\pm0.5 91.6±\pm0.6 92.0±\pm0.5 92.2±\pm0.5 92.6±\pm0.4 92.5±\pm0.5 93.3±\pm0.4 93.4±\pm0.4
Traffic Sign 51.0±\pm1.1 48.2±\pm1.1 63.0±\pm1.0 58.4±\pm1.1 63.3±\pm1.1 82.8±\pm1.0 87.4±\pm0.8 64.5±\pm1.1 88.5±\pm0.7 89.4±\pm0.8
MSCOCO 52.0±\pm1.1 51.5±\pm1.1 52.8±\pm1.1 50.0±\pm1.0 57.3±\pm1.0 57.6±\pm1.0 57.9±\pm0.9 55.5±\pm1.0 58.5±\pm0.9 59.8±\pm0.9
MNIST 94.3±\pm0.4 90.6±\pm0.5 96.2±\pm0.3 95.6±\pm0.5 94.7±\pm0.4 96.7±\pm0.4 97.0±\pm0.4 95.1±\pm0.4 97.1±\pm0.3 97.1±\pm0.4
CIFAR-10 66.5±\pm0.9 67.0±\pm0.8 75.4±\pm0.8 78.6±\pm0.7 74.2±\pm0.8 82.9±\pm0.7 82.1±\pm0.8 72.8±\pm0.8 83.5±\pm0.7 83.7±\pm0.8
CIFAR-100 56.9±\pm1.1 57.3±\pm1.0 62.0±\pm1.0 67.1±\pm1.0 63.5±\pm1.0 70.4±\pm0.9 70.9±\pm0.9 63.9±\pm1.0 71.3±\pm1.0 72.2±\pm0.9
Avg Seen 75.9 77.4 74.5 76.2 80.4 80.4 80.7 80.0 81.6 82.0
Avg Unseen 64.1 62.9 69.9 69.9 70.6 78.1 79.1 70.3 79.8 80.4
Avg All 71.3 71.8 72.7 73.8 76.6 79.5 80.1 76.3 80.9 81.4
Avg Rank 8.8 8.2 8.0 7.8 5.5 4.3 3.2 5.8 1.9 1.0

4.3 Task-Specific Preconditioner

Given a test task 𝒯={𝒮𝒯,𝒬𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}, we define Task-Specific Preconditioner 𝐏𝒯l\mathbf{P}^{l}_{\mathcal{T}} as follows:

𝐏𝒯l=k=1Kp𝒯,k𝐏kl,l=1,,L\mathbf{P}^{l}_{\mathcal{T}}=\sum^{K}_{k=1}p_{\mathcal{T},k}\cdot\mathbf{P}^{l}_{k},\,\,\,\,l=1,\cdots,L (15)

where 𝐏kl\mathbf{P}^{l}_{k} is the ll-th DSP of domain kk, and p𝒯,kp_{\mathcal{T},k} is the task-coefficient for the given task 𝒯\mathcal{T} and domain kk. By employing 𝐏𝒯l\mathbf{P}^{l}_{\mathcal{T}} as the preconditioning matrix, we can define Task-Specific Preconditioned gradient descent (TSP), as follows:

θ𝒯,Tl=θ𝒯,0lβt=0T1𝐏𝒯lθ𝒯,tlin(θ𝒯,t;𝒮𝒯),\theta^{l}_{\mathcal{T},T}=\theta^{l}_{\mathcal{T},0}-\beta\cdot\sum^{T-1}_{t=0}\mathbf{P}^{l}_{\mathcal{T}}\nabla_{\theta^{l}_{\mathcal{T},t}}\mathcal{L}_{\text{in}}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}), (16)

where β\beta is the learning rate used to adapt the task-specific parameters.

4.4 Positive Definiteness of TSP’s Preconditioner

A preconditioner satisfying positive definiteness ensures a valid Riemannian metric, which represents the geometric characteristics of the parameter space (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998). Task-Specific Preconditioner 𝐏𝒯l\mathbf{P}^{l}_{\mathcal{T}} is designed to be a positive definite matrix, which is verified in Theorem 1.

Theorem 1.

Let pk[0,1],k=1,,Kp_{k}\in[0,1],k=1,\cdots,K, be the task-coefficients satisfying k=1Kpk=1\sum^{K}_{k=1}p_{k}=1. For the Domain-Specific Preconditioners 𝐏km×m,k=1,,K\mathbf{P}_{k}\in\mathbb{R}^{m\times m},k=1,\cdots,K, Task-Specific Preconditioner 𝐏\mathbf{P} defined as 𝐏=k=1Kpk𝐏k\mathbf{P}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k} is positive definite.

The proof is provided in Appendix C. Drawing from prior research (Amari 1967, 1996, 1998; Kakade 2001; Amari and Douglas 1998), a preconditioner satisfying positive definiteness promotes gradients to point toward the steepest descent direction while avoiding undesirable paths in the parameter space. As shown in Figure 1(b), positive definiteness improves CDFSL performance, especially in unseen domains. In Section 6, we will discuss why this property helps in CDFSL.

5 Experiments

5.1 Experimental Setup

Implementation Details

In the experiments, we use Meta-Dataset (Triantafillou et al. 2019) that is the standard benchmark for evaluating the performance of CDFSL. To demonstrate the effectiveness of TSP as an adaptation mechanism, we apply it to the state-of-the-art CDFSL methods, TSA (Li, Liu, and Bilen 2022) and TA2-Net (Guo et al. 2023), which are publicly available as open-source. Following previous studies (Bateni et al. 2022; Triantafillou et al. 2021; Li, Liu, and Bilen 2021, 2022; Guo et al. 2023), we adopted ResNet-18 as the backbone for the feature extractor. In all experiments, we follow the standard protocol described in (Triantafillou et al. 2019). For the Dataset Classifier Loss, weighting factor λ\lambda is set to 0.1, as it performs best compared to other values, as shown in Appendix D.1. Details of the Meta-Dataset, hyper-parameters, and additional implementation are available in Appendix E.

Baselines

For the baselines, we compare our methods to the state-of-the-art CDFSL methods, including BOHB (Saikia, Brox, and Schmid 2020), SUR (Dvornik, Schmid, and Mairal 2020), URT (Liu et al. 2020), Simple-CNAPS (Bateni et al. 2020), FLUTE (Triantafillou et al. 2021), tri-M (Liu et al. 2021), URL (Li, Liu, and Bilen 2021), TSA (Li, Liu, and Bilen 2022), TA2-Net (Guo et al. 2023), ALFA (Baik et al. 2023)+Proto-MAML, GAP+Proto-MAML (Kang et al. 2023), and MOKD (Tian et al. 2024).

Table 2: Performance comparison to state-of-the-art methods in a single-domain setting. Mean accuracy and 95%\% confidence interval are reported. The best results are highlighted in bold. TSP denotes TSP applied on TSA. TSP†† denotes TSP applied on TA2-Net.
Test Dataset
ALFA+
Proto-MAML
BOHB
GAP+
Proto-MAML
FLUTE TSA TA2-Net MOKD TSP TSP††
ImageNet 52.8±\pm1.1 51.9±\pm1.1 56.7 46.9±\pm1.1 59.5±\pm1.1 59.3±\pm1.1 57.3±\pm1.1 60.1±\pm1.1 60.6±\pm1.1
Omniglot 61.9±\pm1.5 67.6±\pm1.2 77.6 61.6±\pm1.4 78.2±\pm1.2 81.1±\pm1.1 70.9±\pm1.3 83.3±\pm1.1 85.2±\pm1.1
Aircraft 63.4±\pm1.1 54.1±\pm0.9 68.5 48.5±\pm1.0 72.2±\pm1.0 72.6±\pm0.9 59.8±\pm1.0 73.2±\pm1.0 73.5±\pm1.1
Birds 69.8±\pm1.1 70.7±\pm0.9 73.5 47.9±\pm1.0 74.9±\pm0.9 75.1±\pm0.9 73.6±\pm0.9 76.0±\pm0.9 76.6±\pm0.9
Textures 70.8±\pm0.9 68.3±\pm0.8 71.4 63.8±\pm0.8 77.3±\pm0.7 76.8±\pm0.8 76.1±\pm0.7 78.2±\pm0.7 78.3±\pm0.7
Quick Draw 59.2±\pm1.2 50.3±\pm1.0 65.4 57.5±\pm1.0 67.6±\pm0.9 68.4±\pm0.9 61.2±\pm1.0 70.8±\pm0.9 71.5±\pm0.9
Fungi 41.5±\pm1.2 41.4±\pm1.1 38.6 31.8±\pm1.0 44.7±\pm1.0 45.3±\pm1.0 47.0±\pm1.1 46.6±\pm1.0 47.0±\pm1.0
VGG Flower 86.0±\pm0.8 87.3±\pm0.6 86.8 80.1±\pm0.9 90.9±\pm0.6 91.0±\pm0.6 88.5±\pm0.6 91.8±\pm0.5 92.2±\pm0.6
Traffic Sign 60.8±\pm1.3 51.8±\pm1.0 66.9 46.5±\pm1.1 82.5±\pm0.8 84.1±\pm0.7 61.6±\pm1.1 87.5±\pm0.8 88.7±\pm0.8
MSCOCO 48.1±\pm1.1 48.0±\pm1.0 46.8 41.4±\pm1.0 59.0±\pm1.0 58.0±\pm1.0 55.3±\pm1.0 59.4±\pm1.0 58.6±\pm1.0
MNIST - - 94.0 80.8±\pm0.8 93.9±\pm0.6 94.9±\pm0.5 88.3±\pm0.7 94.5±\pm0.5 95.3±\pm0.6
CIFAR-10 - - 74.5 65.4±\pm0.8 82.1±\pm0.7 82.0±\pm0.7 72.2±\pm0.8 83.1±\pm0.5 83.2±\pm0.7
CIFAR-100 - - 63.2 52.7±\pm1.1 70.7±\pm0.9 70.8±\pm0.9 63.1±\pm1.0 71.2±\pm0.9 72.8±\pm0.9
Avg Seen 52.8 51.9 56.7 46.9 59.5 59.3 57.3 60.1 60.6
Avg Unseen 62.4 59.9 68.9 56.5 74.5 75.0 68.1 76.3 76.9
Avg All 61.4 59.1 68.0 55.8 73.3 73.8 67.3 75.0 75.7
Avg Rank 7.0 7.5 6.1 8.9 3.7 3.4 5.1 2.0 1.2

5.2 Performance Comparison to State-of-The-Art Methods

Following the experimental setup in (Li, Liu, and Bilen 2022), we first evaluate our method using multi-domain and single-domain feature extractors in Varying-Way Varying-Shot setting (i.e., Multi-domain and Single-domain setting). Then, we assess our approach with the multi-domain feature extractor in more challenging Varying-Way Five-Shot and Five-Way One-Shot settings. We provide the performance comparison results for Varying-Way Five-Shot and Five-Way One-Shot settings in the Appendix F.

Multi-Domain Setting

In Table 1, we evaluate TSP by applying it to TSA and TA2-Net, both of which employ URL (Liu et al. 2021) as the multi-domain feature extractor. We report average accuracies over seen, unseen, and all domains, along with average rank following the previous works (Liu et al. 2021; Li, Liu, and Bilen 2022; Guo et al. 2023). TSP denotes TSP applied on TSA, while TSP†† indicates TSP applied on TA2-Net. TSP outperforms the previous state-of-the-art methods on 11 out of 13 datasets, and TSP†† achieves the best results on all datasets. For example, TSP†† outperforms the state-of-the-art method (TA2-Net) by 1.7%, 3.4%, 2.0%, and 1.9% on Textures, Fungi, Traffic Sign, and MSCOCO respectively. These results imply that TSP can construct a desirable task-specific optimizer that effectively adapt the task-specific parameters for a given target task.

Single-Domain Setting

We evaluate TSP by applying it to TSA and TA2-Net, both of which employ the single-domain feature extractor pretrained solely on the ImageNet dataset. In Table 2, TSP†† achieves the best results for 12 out of 13 datasets, while TSP leads in the remaining 1 datasets. Compared to recently proposed meta-learning methods based on PGD, such as Approximate GAP++Proto-MAML and GAP++Proto-MAML (Kang et al. 2023), both TSP and TSP†† consistently outperform them across all 13 datasets by a significant margin. Furthermore, TSP†† outperforms the previous best methods by a clear margin in several datasets such as Quick Draw (+3.1%+3.1\%), Omniglot (+4.1%+4.1\%), and Traffic Sign (+4.6%+4.6\%). Despite being trained only on single dataset, TSP improves performance by effectively constructing a task-specific optimizer tailored to the target task.

5.3 Ablation Studies

In this section, all ablation studies are performed using TSP in the multi-domain setting to isolate the effects originating from the RL model in TSP††. Additional ablation studies are provided in Appendix D.

Matrix Design for DSP

Table 3: Performance comparison of three TSPs with different DSP designs.
DSP designs 𝐋𝐋𝐓\mathbf{LL^{T}} 𝐋𝐋𝐓+𝐈\mathbf{LL^{T}+I}   𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I}
Avg Seen 80.8 81.2 81.6
Avg Unseen 79.0 79.4 79.8
Avg All 80.1 80.5 80.9

To design Domain-Specific Preconditioner (DSP), we consider three matrix designs that guarantee positive definiteness. The first one is the product of a real-valued lower triangular matrix and its transpose (i.e., 𝐋𝐋𝐓\mathbf{LL^{T}}), where the lower triangular matrix 𝐋\mathbf{L} is constrained to have positive diagonals. This form is commonly known as the Cholesky factorization (Horn and Johnson 2012). The second one is the addition of 𝐋𝐋𝐓\mathbf{LL^{T}} and the identity matrix (i.e., 𝐋𝐋𝐓+𝐈\mathbf{LL^{T}+I}). The last one is the addition of the Gram matrix (Horn and Johnson 2012) and the identity matrix  (i.e., 𝐌𝐓𝐌+𝐈\mathbf{M^{T}}\mathbf{M}+\mathbf{I}). In Table 3, we compare three TSPs with these three DSP designs. Among them, the Gram matrix design achieves the highest average accuracies in both seen and unseen domains compared to the others. Therefore, we choose the Gram matrix design for DSP.

Table 4: Performance comparison of TSPs with and without a PD constraint. α\alpha is set to 0.1.
Preconditioner w/ PD constraint w/o PD constraint
Avg Seen 81.6 80.0
Avg Unseen 79.8 73.8
Avg All 80.9 77.6
Table 5: The rates of non-PD Domain-Specific Preconditioners (DSPs) after meta-training without a positive definite constraint. For the ResNet-18 backbone, there are 17 DSP preconditioners for each domain. All DSPs are initialized as 0.1𝐈0.1\cdot\mathbf{I}. The average rate is provided in the right column.
DSP ImageNet Omniglot Aircraft Birds Textures Quick Draw Fungi VGG Flower Average
Non-PD rate 0.24 0.29 0.35 0.24 0.35 0.35 0.18 0.29 0.29

6 Discussion

In this section, all experiments are conducted using TSP.

The Necessity of Positive Definite Constraint

Even without a specific constraint of PD, one might assume that initializing the preconditioner as positive definite, such as α𝐈\alpha\cdot\mathbf{I}, would maintain its positive definiteness throughout meta-training due to its significant role. However, as illustrated in Table 4 and Table 5, this assumption does not hold. In Table 4, we compare preconditioners with and without a PD constraint, both initialized as positive definite. Specifically, the former adopts Task-Specific Preconditioner (See Eq. (15)), while the latter employs Task-Specific Preconditioner with DSP designed as 𝐏kl=𝐌kl\mathbf{P}^{l}_{k}=\mathbf{M}^{l}_{k} and initialized as 𝐌kl=0.1𝐈\mathbf{M}^{l}_{k}=0.1\cdot\mathbf{I}. Evaluations are conducted using the multi-domain feature extractor (URL) in the multi-domain setting. After meta-training, DSPs without a PD constraint tend to lose positive definiteness as shown in Table 5, leading to poor performance as shown in Table 4. These findings underscore the necessity of explicitly constraining the preconditioner to maintain positive definiteness, as relying solely on optimization fails to preserve this crucial property.

Positive Definite DSP Designs with and without the Identity Matrix
Table 6: Performance comparison of two positive definite DSP designs with and without adding an identity matrix.
Setting
Varying-Way
Varying-Shot
Varying-Way
Five-Shot
DSP designs 𝐋𝐋𝐓\mathbf{L}\mathbf{L}^{\mathbf{T}} 𝐋𝐋𝐓+𝐈\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I} 𝐋𝐋𝐓\mathbf{L}\mathbf{L}^{\mathbf{T}} 𝐋𝐋𝐓+𝐈\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I}
Avg Seen 80.8 81.2 76.8 76.6
Avg Unseen 79.0 79.4 72.1 71.5
Avg All 80.1 80.5 75.0 74.6

Apart from ensuring positive definiteness, a notable characteristic of our Gram matrix design 𝐌𝐓𝐌+𝐈\mathbf{M^{T}}\mathbf{M}+\mathbf{I} is its inclusion of the identity matrix. To explore the impact of this inclusion, we compare two positive definite DSP designs: 𝐋𝐋𝐓\mathbf{LL^{T}} and 𝐋𝐋𝐓+𝐈\mathbf{LL^{T}+I}. We focus on these two DSP designs because 𝐌𝐓𝐌\mathbf{M^{T}M} does not guarantee positive definiteness. However, we also provide a comparison between 𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I} and 𝐌𝐓𝐌\mathbf{M^{T}M} in Appendix G. The experiments are conducted using the multi-domain feature extractor (URL). In Table 6, we observe that the DSP design with the added identity matrix performs better in the Varying-Way Varying-Shot setting but worse in the Varying-Way Five-Shot setting. This outcome aligns with prior theoretical findings (Amari et al. 2020) indicating that PGD performs better than GD in noisy gradient conditions, while GD excels when gradients are accurate. With more shots, gradients tend to be more accurate due to increased data. In the Varying-Way Varying Shot setting, where tasks typically involve more than five shots, gradients are more accurate, making GD more beneficial compared to the other setting. Including the identity matrix can be viewed as a regularization of PGD towards GD. Consequently, 𝐋𝐋𝐓+𝐈\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I} aligns closer to GD compared to 𝐋𝐋𝐓\mathbf{L}\mathbf{L}^{\mathbf{T}}, resulting in improved performance due to the abundance of shots in Varying-Way Varying-Shot setting. Conversely, in the Varying-Way Five-Shot setting, where tasks involve fewer shots, 𝐋𝐋𝐓\mathbf{L}\mathbf{L}^{\mathbf{T}} exhibits superior performance to 𝐋𝐋𝐓+𝐈\mathbf{L}\mathbf{L}^{\mathbf{T}}+\mathbf{I} due to the scarcity of shots.

Refer to caption
(a) w/o PD, Seen
Refer to caption
(b) w/o PD, Unseen
Refer to caption
(c) w/ PD, Seen
Refer to caption
(d) w/ PD, Unseen
Figure 4: Learning curves of PGD with and without the PD constraint across both seen and unseen domains. Further details on the preconditioners used in this figure can be found in Appendix A.
Effectiveness of Positive Definiteness in Cross-Domain Tasks

A positive definite preconditioner is known to mitigate the negative effects of pathological loss curvature and accelerate optimization, thereby facilitating convergence (Nocedal and Wright 1999; Saad 2003; Li 2017). This leads to a consistent reduction in the objective function. However, without positive definiteness, this effect is not guaranteed and may result in failure to converge. In Figure 4, we compare the learning curves of PGD with and without a PD constraint across both seen and unseen domains. Without the PD constraint, PGD fails to converge in some of the seen domains and in all the unseen domains. With the PD constraint, PGD successfully converges in all the seen and unseen domains. These results suggest that, in cross-domain tasks, a PD constraint of a preconditioner is crucial for achieving convergence and is beneficial for improving performance, which is also related to Figure 1(b).

TSP vs. Previous PGD Methods: Leveraging Multi-Domain Knowledge for Task-Specific Preconditioner

Compared to previous PGD methods like GAP (Kang et al. 2023), TSP is specifically designed for cross-domain few-shot learning (CDFSL), where unseen domains are not accessed during meta-training. The key challenge in CDFSL is to effectively leverage information from multiple seen domains to quickly adapt to each unseen domain. Previous PGD methods fall short in this regard because they rely on a single preconditioner, even when multiple seen domains are available. For example, GAP uses only one preconditioner to extract information from multiple seen domains, which limits its adaptability to unseen domains with distinct characteristics. In contrast, TSP meta-trains a distinct domain-specific preconditioner (DSP) for each seen domain and combines them to construct a Task-Specific Preconditioner that better suited to each unseen domain. TSP produces this Task-Specific Preconditioner effectively, as shown in Tables 1 and 2, and time-efficiently, as further detailed in Appendix H.

7 Conclusion

In this study, we have introduced a robust and effective adaptation mechanism called Task-Specific Preconditioned gradient descent (TSP) to enhance CDFSL performance. Thanks to the meta-trained Domain-Specific Preconditioners (DSPs) and Task-coefficients, TSP can flexibly adjust the optimization strategy according to the geometric characteristics of the parameter space for the target task. Owing to these components, the proposed TSP demonstrates notable performance improvements on Meta-Dataset across various settings.

8 Acknowledgements

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1A2C2007139) and in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) ([NO.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], [No. RS-2023-00235293, Development of autonomous driving big data processing, management, search, and sharing interface technology to provide autonomous driving data according to the purpose of usage]).

References

  • Amari (1967) Amari, S. 1967. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, (3): 299–307.
  • Amari (1996) Amari, S.-i. 1996. Neural learning in structured parameter spaces-natural Riemannian gradient. Advances in neural information processing systems, 9.
  • Amari (1998) Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural computation, 10(2): 251–276.
  • Amari et al. (2020) Amari, S.-i.; Ba, J.; Grosse, R.; Li, X.; Nitanda, A.; Suzuki, T.; Wu, D.; and Xu, J. 2020. When does preconditioning help or hurt generalization? arXiv preprint arXiv:2006.10732.
  • Amari and Douglas (1998) Amari, S.-I.; and Douglas, S. C. 1998. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 2, 1213–1216. IEEE.
  • Baik et al. (2023) Baik, S.; Choi, M.; Choi, J.; Kim, H.; and Lee, K. M. 2023. Learning to learn task-adaptive hyperparameters for few-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Bateni et al. (2022) Bateni, P.; Barber, J.; Van de Meent, J.-W.; and Wood, F. 2022. Enhancing few-shot image classification with unlabelled examples. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2796–2805.
  • Bateni et al. (2020) Bateni, P.; Goyal, R.; Masrani, V.; Wood, F.; and Sigal, L. 2020. Improved few-shot visual classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14493–14502.
  • Chen et al. (2019) Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C. F.; and Huang, J.-B. 2019. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232.
  • Cimpoi et al. (2014) Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; and Vedaldi, A. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3606–3613.
  • Duchi, Hazan, and Singer (2011) Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
  • Dvornik, Schmid, and Mairal (2020) Dvornik, N.; Schmid, C.; and Mairal, J. 2020. Selecting relevant features from a multi-domain representation for few-shot classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, 769–786. Springer.
  • Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, 1126–1135. PMLR.
  • Garcia and Bruna (2017) Garcia, V.; and Bruna, J. 2017. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
  • Garnelo et al. (2018) Garnelo, M.; Rosenbaum, D.; Maddison, C.; Ramalho, T.; Saxton, D.; Shanahan, M.; Teh, Y. W.; Rezende, D.; and Eslami, S. A. 2018. Conditional neural processes. In International conference on machine learning, 1704–1713. PMLR.
  • Guo et al. (2023) Guo, Y.; Du, R.; Dong, Y.; Hospedales, T.; Song, Y.-Z.; and Ma, Z. 2023. Task-aware Adaptive Learning for Cross-domain Few-shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1590–1599.
  • Ha and Eck (2017) Ha, D.; and Eck, D. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477.
  • Himmelblau et al. (2018) Himmelblau, D. M.; et al. 2018. Applied nonlinear programming. McGraw-Hill.
  • Horn and Johnson (2012) Horn, R. A.; and Johnson, C. R. 2012. Matrix analysis. Cambridge university press.
  • Houben et al. (2013) Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; and Igel, C. 2013. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In The 2013 international joint conference on neural networks (IJCNN), 1–8. Ieee.
  • Kakade (2001) Kakade, S. M. 2001. A natural policy gradient. Advances in neural information processing systems, 14.
  • Kang et al. (2023) Kang, S.; Hwang, D.; Eo, M.; Kim, T.; and Rhee, W. 2023. Meta-Learning with a Geometry-Adaptive Preconditioner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16080–16090.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
  • Lake et al. (2011) Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33.
  • Lake, Salakhutdinov, and Tenenbaum (2015) Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science, 350(6266): 1332–1338.
  • LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324.
  • LeCun et al. (2002) LeCun, Y.; Bottou, L.; Orr, G. B.; and Müller, K.-R. 2002. Efficient backprop. In Neural networks: Tricks of the trade, 9–50. Springer.
  • Lee and Choi (2018) Lee, Y.; and Choi, S. 2018. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, 2927–2936. PMLR.
  • Li, Liu, and Bilen (2021) Li, W.-H.; Liu, X.; and Bilen, H. 2021. Universal representation learning from multiple domains for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9526–9535.
  • Li, Liu, and Bilen (2022) Li, W.-H.; Liu, X.; and Bilen, H. 2022. Cross-domain few-shot learning with task-specific adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7161–7170.
  • Li (2017) Li, X.-L. 2017. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5): 1454–1466.
  • Li et al. (2017) Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
  • Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  • Liu et al. (2020) Liu, L.; Hamilton, W.; Long, G.; Jiang, J.; and Larochelle, H. 2020. A universal representation transformer layer for few-shot image classification. arXiv preprint arXiv:2006.11702.
  • Liu et al. (2021) Liu, Y.; Lee, J.; Zhu, L.; Chen, L.; Shi, H.; and Yang, Y. 2021. A multi-mode modulator for multi-domain few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8453–8462.
  • Maji et al. (2013) Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; and Vedaldi, A. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
  • Mishra et al. (2017) Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P. 2017. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141.
  • Munkhdalai and Yu (2017) Munkhdalai, T.; and Yu, H. 2017. Meta networks. In International conference on machine learning, 2554–2563. PMLR.
  • Nilsback and Zisserman (2008) Nilsback, M.-E.; and Zisserman, A. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, 722–729. IEEE.
  • Nocedal and Wright (1999) Nocedal, J.; and Wright, S. J. 1999. Numerical optimization. Springer.
  • Oreshkin, Rodríguez López, and Lacoste (2018) Oreshkin, B.; Rodríguez López, P.; and Lacoste, A. 2018. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31.
  • Park and Oliva (2019) Park, E.; and Oliva, J. B. 2019. Meta-curvature. Advances in Neural Information Processing Systems, 32.
  • Rajasegaran et al. (2020) Rajasegaran, J.; Khan, S.; Hayat, M.; Khan, F. S.; and Shah, M. 2020. Meta-learning the learning trends shared across tasks. arXiv preprint arXiv:2010.09291.
  • Rajeswaran et al. (2019) Rajeswaran, A.; Finn, C.; Kakade, S. M.; and Levine, S. 2019. Meta-learning with implicit gradients. Advances in neural information processing systems, 32.
  • Ravi and Larochelle (2016) Ravi, S.; and Larochelle, H. 2016. Optimization as a model for few-shot learning. In International conference on learning representations.
  • Requeima et al. (2019) Requeima, J.; Gordon, J.; Bronskill, J.; Nowozin, S.; and Turner, R. E. 2019. Fast and flexible multi-task classification using conditional neural adaptive processes. Advances in Neural Information Processing Systems, 32.
  • Roy and Vetterli (2007) Roy, O.; and Vetterli, M. 2007. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, 606–610. IEEE.
  • Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
  • Saad (2003) Saad, Y. 2003. Iterative methods for sparse linear systems. SIAM.
  • Saikia, Brox, and Schmid (2020) Saikia, T.; Brox, T.; and Schmid, C. 2020. Optimized generic feature learning for few-shot classification across domains. arXiv preprint arXiv:2001.07926.
  • Santoro et al. (2016) Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning, 1842–1850. PMLR.
  • Schroeder and Cui (2018) Schroeder, B.; and Cui, Y. 2018. Fgvcx fungi classification challenge 2018. Available online: github. com/visipedia/fgvcx_fungi_comp (accessed on 14 July 2021).
  • Simon et al. (2020) Simon, C.; Koniusz, P.; Nock, R.; and Harandi, M. 2020. On modulating the gradient for meta-learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, 556–572. Springer.
  • Snell, Swersky, and Zemel (2017) Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
  • Sung et al. (2018) Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1199–1208.
  • Tian et al. (2024) Tian, H.; Liu, F.; Liu, T.; Du, B.; Cheung, Y.-m.; and Han, B. 2024. MOKD: Cross-domain Finetuning for Few-shot Classification via Maximizing Optimized Kernel Dependence. arXiv preprint arXiv:2405.18786.
  • Tian et al. (2020) Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J. B.; and Isola, P. 2020. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 266–282. Springer.
  • Triantafillou et al. (2021) Triantafillou, E.; Larochelle, H.; Zemel, R.; and Dumoulin, V. 2021. Learning a universal template for few-shot dataset generalization. In International Conference on Machine Learning, 10424–10433. PMLR.
  • Triantafillou et al. (2019) Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Evci, U.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.-A.; et al. 2019. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.
  • Von Oswald et al. (2021) Von Oswald, J.; Zhao, D.; Kobayashi, S.; Schug, S.; Caccia, M.; Zucchet, N.; and Sacramento, J. 2021. Learning where to learn: Gradient sparsity in meta and continual learning. Advances in Neural Information Processing Systems, 34: 5250–5263.
  • Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
  • Yoon et al. (2018) Yoon, J.; Kim, T.; Dia, O.; Kim, S.; Bengio, Y.; and Ahn, S. 2018. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31.
  • Zaheer et al. (2017) Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. Advances in neural information processing systems, 30.
  • Zhao et al. (2020) Zhao, D.; Kobayashi, S.; Sacramento, J.; and von Oswald, J. 2020. Meta-learning via hypernetworks. In 4th Workshop on Meta-Learning at NeurIPS 2020 (MetaLearn 2020). NeurIPS.

Appendix for the paper
“Task-Specific Preconditioner for Cross-Domain Few-Shot Learning”

Appendix A The three preconditioners used in Figure 1(b) and Figure 4

To establish the motivation for enforcing the positive definite constraint in CDFSL, we conduct a comparative analysis of three adaptation mechanisms—PGD methods with varying preconditioners—using Meta-Dataset. These mechanisms are applied on the state-of-the-art CDFSL method, TSA (Li, Liu, and Bilen 2022). In these comparisons, with task-specific parameters θ\theta and a task 𝒯={𝒮𝒯,𝒬𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}, we update θ\theta using PGD with a preconditioner 𝐏\mathbf{P} as follows:

θ𝒯,t=θ𝒯,t1α𝐏θ(θ𝒯,t1;𝒮𝒯),t=1,2,,\theta_{\mathcal{T},t}=\theta_{\mathcal{T},t-1}-\alpha\cdot\mathbf{P}\nabla_{\theta}\mathcal{L}(\theta_{\mathcal{T},t-1};\mathcal{S}_{\mathcal{T}}),\;t=1,2,\cdots, (17)

where θ𝒯,0=θ\theta_{\mathcal{T},0}=\theta and (θ𝒯,t;𝒮𝒯)\mathcal{L}(\theta_{\mathcal{T},t};\mathcal{S}_{\mathcal{T}}) is the empirical loss associated with 𝒯\mathcal{T} and θ𝒯,t\theta_{\mathcal{T},t}. The first PGD method is identical to Gradient Descent (GD), which utilizes the fixed identity matrix 𝐈\mathbf{I} as 𝐏\mathbf{P} (i.e., the baseline for gradient descent). The second method is Task-Specific Preconditioned Gradient Descent (TSP), which utilizes a Task-Specific Preconditioner designed as 𝐏kl=𝐌kl\mathbf{P}^{l}_{k}=\mathbf{M}^{l}_{k} and initialized as (1)𝐈(-1)\cdot\mathbf{I} (i.e., PGD without a positive definite constraint). The final method is Task-Specific Preconditioned gradient descent (TSP), which utilizes Task-Specific Preconditioner defined in Section 4.3 (i.e., PGD with a positive definite).

Appendix B Meta-Training and Meta-Testing Algorithms

Algorithm 1 Meta-Training for Domain-Specific Preconditioner (DSP)
0:  p(𝒯)p(\mathcal{T}): Task distribution across KK train domains
0:  αin\alpha_{\text{in}}, αout\alpha_{\text{out}}: The learning rates
0:  in\mathcal{L}_{\text{in}}, out\mathcal{L}_{\text{out}}: The inner and outer-level loss functions
1:  Initialize task-specific parameters
θ={θlml×ml}l=1L\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}
2:  Initialize meta-parameters
1,,K where k={𝐌klml×ml}l=1L\mathcal{M}_{1},\cdots,\mathcal{M}_{K}\text{ where }\mathcal{M}_{k}=\{\mathbf{M}^{l}_{k}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}
3:  while not converged do
4:     Sample a batch of train tasks 𝒯Bp(𝒯)\mathcal{T}_{B}\sim p(\mathcal{T})
5:     for all 𝒯={𝒮𝒯,𝒬𝒯,d𝒯}𝒯B\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}\in\mathcal{T}_{B} do
6:        for l=1l=1 to LL do
7:           Compute DSP 𝐏d𝒯l=𝐌d𝒯l𝐓𝐌d𝒯l+𝐈\mathbf{P}^{l}_{d_{\mathcal{T}}}=\mathbf{M}_{d_{\mathcal{T}}}^{l\mathbf{T}}\mathbf{M}^{l}_{d_{\mathcal{T}}}+\mathbf{I}
8:           Compute updated task-specific parameters via Eq. (3)
9:        end for
10:        Compute outer-level loss out(θ𝒯,T;𝒬𝒯)\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})
11:     end for
12:     Update the meta-parameters via Eq. (10)
13:  end while
Algorithm 2 Meta-Training the dataset classifier for Task-coefficients
0:  p(𝒯)p(\mathcal{T}): Task distribution across KK train domains
0:  gϕ()g_{\phi}(\cdot): The dataset classifier
0:  in\mathcal{L}_{\text{in}}, out\mathcal{L}_{\text{out}}: The inner and outer-level loss functions
0:  λ\lambda: The regularization parameter
0:  α\alpha: The learning rate
1:  Initialize task-specific parameters
θ={θlml×ml}l=1L\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}
2:  Initialize the parameters ϕ\phi of gϕ()g_{\phi}(\cdot)
3:  while not converged do
4:     Sample a batch of train tasks 𝒯Bp(𝒯)\mathcal{T}_{B}\sim p(\mathcal{T})
5:     for all 𝒯={𝒮𝒯,𝒬𝒯,d𝒯}𝒯B\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}},d_{\mathcal{T}}\}\in\mathcal{T}_{B} do
6:        Compute logits (zτ,1,,zτ,K)=g(𝒮τ)(z_{\tau,1},\cdots,z_{\tau,K})=g(\mathcal{S}_{\tau})
7:        Compute task-coefficients via Eq. (11)
8:        Compute CE𝒯=k=1Kd𝒯,klog(p𝒯,k)\mathcal{L}_{\text{CE}}^{\mathcal{T}}=-\sum^{K}_{k=1}d_{\mathcal{T},k}\cdot\log(p_{\mathcal{T},k})
9:        With θ𝒯,T={θ𝒯,Tl}l=1L\theta_{\mathcal{T},T}=\{\theta_{\mathcal{T},T}^{l}\}^{L}_{l=1} via Eq. (14), compute
Aux𝒯=out(θ𝒯,T;𝒬𝒯)\mathcal{L}_{\text{Aux}}^{\mathcal{T}}=\mathcal{L}_{\text{out}}(\theta_{\mathcal{T},T};\mathcal{Q}_{\mathcal{T}})
10:        Compute 𝒯=CE𝒯+λAux𝒯\mathcal{L}^{\mathcal{T}}=\mathcal{L}_{\text{CE}}^{\mathcal{T}}+\lambda\cdot\mathcal{L}_{\text{Aux}}^{\mathcal{T}}
11:     end for
12:     Compute =𝒯𝒯B𝒯\mathcal{L}=\sum_{\mathcal{T}\in\mathcal{T}_{B}}\mathcal{L}^{\mathcal{T}}
13:     Update the parameters
ϕϕαϕ\phi\leftarrow\phi-\alpha\cdot\nabla_{\phi}\mathcal{L}
14:  end while
Algorithm 3 Meta-Testing through Task-Specific Preconditioned gradient descent (TSP)
0:  p(𝒯)p(\mathcal{T}): Task distribution across all domains
0:  1,,K\mathcal{M}_{1},\cdots,\mathcal{M}_{K}: Meta-trained meta-parameters
0:  gϕ()g_{\phi}(\cdot): Meta-trained dataset classifier
0:  in\mathcal{L}_{\text{in}}: The inner-level loss function
0:  β\beta: The learning rate
1:  Initialize task-specific parameters
θ={θlml×ml}l=1L\theta=\{\theta^{l}\in\mathbb{R}^{m_{l}\times m_{l}}\}^{L}_{l=1}
2:  Sample a test task 𝒯={𝒮𝒯,𝒬𝒯}\mathcal{T}=\{\mathcal{S}_{\mathcal{T}},\mathcal{Q}_{\mathcal{T}}\}
3:  for l=1l=1 to LL do
4:     For all kk, compute DSP 𝐏kl=𝐌kl𝐓𝐌kl+𝐈\mathbf{P}^{l}_{k}=\mathbf{M}_{k}^{l\mathbf{T}}\mathbf{M}_{k}^{l}+\mathbf{I}
5:     Compute Task-Specific Preconditioner via Eq. (15)
6:     Update the task-specific parameters via Eq. (16)
7:  end for

Appendix C Proofs of Theorems

Lemma 1.

For the meta parameter 𝐌m×m\mathbf{M}\in\mathbb{R}^{m\times m}, the Domain-Specific Preconditioner 𝐏\mathbf{P} defined as 𝐏=𝐌𝐓𝐌+𝐈\mathbf{P}=\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I} is positive definite.

Proof.

𝐏\mathbf{P} is symmetric, as shown below:

𝐏𝐓=(𝐌𝐓𝐌+𝐈)𝐓=(𝐌𝐓𝐌)𝐓+𝐈𝐓=𝐌𝐓𝐌+𝐈=𝐏.\mathbf{P}^{\mathbf{T}}=(\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I})^{\mathbf{T}}=(\mathbf{M}^{\mathbf{T}}\mathbf{M})^{\mathbf{T}}+\mathbf{I}^{\mathbf{T}}=\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I}=\mathbf{P}.

For 𝐱m\{𝟎}\forall\,\mathbf{x}\in\mathbb{R}^{m}\backslash\{\mathbf{0}\},

𝐱𝐓𝐏𝐱\displaystyle\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x} =𝐱𝐓(𝐌𝐓𝐌+𝐈)𝐱\displaystyle=\mathbf{x}^{\mathbf{T}}(\mathbf{M}^{\mathbf{T}}\mathbf{M}+\mathbf{I})\mathbf{x}
=𝐱𝐓𝐌𝐓𝐌𝐱+𝐱𝐓𝐈𝐱\displaystyle=\mathbf{x}^{\mathbf{T}}\mathbf{M}^{\mathbf{T}}\mathbf{M}\mathbf{x}+\mathbf{x}^{\mathbf{T}}\mathbf{I}\mathbf{x}
=(𝐌𝐱)𝐓(𝐌𝐱)+𝐱𝐓𝐱\displaystyle=(\mathbf{M}\mathbf{x})^{\mathbf{T}}(\mathbf{M}\mathbf{x})+\mathbf{x}^{\mathbf{T}}\mathbf{x}
=𝐌𝐱2+𝐱2.\displaystyle=\|\mathbf{M}\mathbf{x}\|^{2}+\|\mathbf{x}\|^{2}.

The first term on the right-hand side is non-negative, while the second term is positive:

𝐌𝐱20,\displaystyle\|\mathbf{M}\mathbf{x}\|^{2}\geq 0,
𝐱2>0\displaystyle\|\mathbf{x}\|^{2}>0

since 𝐱𝟎\mathbf{x}\neq\mathbf{0}. Thus, we conclude:

𝐱𝐓𝐏𝐱>0,\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x}>0, (18)

which confirms the positive-definiteness of the Domain-Specific Preconditioner 𝐏\mathbf{P}. ∎

Theorem 1.

Let pk[0,1],k=1,,Kp_{k}\in[0,1],k=1,\cdots,K, be the task-coefficients satisfying k=1Kpk=1\sum^{K}_{k=1}p_{k}=1. For the Domain-Specific Preconditioners 𝐏km×m,k=1,,K\mathbf{P}_{k}\in\mathbb{R}^{m\times m},k=1,\cdots,K, Task-Specific Preconditioner 𝐏\mathbf{P} defined as 𝐏=k=1Kpk𝐏k\mathbf{P}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k} is positive definite.

Proof.

By Lemma 1, 𝐏k\mathbf{P}_{k} is symmetric. Therefore, 𝐏\mathbf{P} is symmetric, as shown below:

𝐏𝐓=(k=1Kpk𝐏k)𝐓=k=1Kpk𝐏k𝐓=k=1Kpk𝐏k=𝐏.\mathbf{P}^{\mathbf{T}}=\left(\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}\right)^{\mathbf{T}}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}^{\mathbf{T}}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}=\mathbf{P}.

For 𝐱m\{𝟎}\forall\,\mathbf{x}\in\mathbb{R}^{m}\backslash\{\mathbf{0}\},

𝐱𝐓𝐏𝐱\displaystyle\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x} =𝐱𝐓(k=1Kpk𝐏k)𝐱\displaystyle=\mathbf{x}^{\mathbf{T}}\left(\sum^{K}_{k=1}p_{k}\cdot\mathbf{P}_{k}\right)\mathbf{x}
=k=1Kpk𝐱𝐓𝐏k𝐱.\displaystyle=\sum^{K}_{k=1}p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}.

Since 𝐏k\mathbf{P}_{k} is the Domain-Specific Preconditioner, by Lemma 1, each summand on the right-hand side is non-negative (see Eq. (18)):

pk𝐱𝐓𝐏k𝐱0,k=1,,Kp_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}\geq 0,\;\;k=1,\cdots,K (19)

because 0pk10\leq p_{k}\leq 1. Since k=1Kpk=1\sum^{K}_{k=1}p_{k}=1 and pk[0,1],k=1,,Kp_{k}\in[0,1],k=1,\cdots,K, there exists at least one pkp_{k} such that pk>0p_{k}>0, implying that at least one term in Eq. (19) is positive:

k{1,2,3,,K}such thatpk𝐱𝐓𝐏k𝐱>0.\exists\,k\in\{1,2,3,\dots,K\}\>\;\text{such that}\>\;p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}>0. (20)

Combining Eq. (19) and Eq. (20), we conclude:

𝐱𝐓𝐏𝐱=k=1Kpk𝐱𝐓𝐏k𝐱>0,\mathbf{x}^{\mathbf{T}}\mathbf{P}\mathbf{x}=\sum^{K}_{k=1}p_{k}\cdot\mathbf{x}^{\mathbf{T}}\mathbf{P}_{k}\mathbf{x}>0,

which confirms the positive-definiteness of the Task-Specific Preconditioner 𝐏\mathbf{P}. ∎

Appendix D Additional Ablation Studies

D.1 Weighting Factor λ\lambda of the Dataset Classifier Loss

In Table 7, we compare different dataset classifier losses by adjusting the weighting factor λ\lambda in Eq. (12). Additionally, we include the results obtained when utilizing the auxiliary loss in Eq. (12) as the dataset classifier loss. From the results, we can observe that λ=0.1\lambda\>\!=\>\!0.1 yields the optimal performance, while using other losses results in inferior performance in both seen and unseen domains. This performance gap is more pronounced in unseen domains, highlighting the importance of balancing between the two losses for generalization to unseen domains. Based on these findings, we adopt λ=0.1\lambda\>\!=\>\!0.1 for the dataset classifier loss in all experiments presented in this manuscript.

Table 7: Mean accuracy (%\%) for different values of λ\lambda in the dataset classifier loss (See Eq. (12)).
λ=10\lambda=10 λ=1\lambda=1
λ=0.1\lambda=0.1
(Ours)
λ=0.01\lambda=0.01
Only
CE\mathcal{L}_{\text{CE}}
Only
Aux\mathcal{L}_{\text{Aux}}
Avg Seen 80.8 81.3 81.6 81.0 80.9 80.1
Avg Unseen 79.3 79.7 79.8 79.1 78.8 78.2
Avg All 80.2 80.7 80.9 80.3 80.1 79.4

D.2 Interpreting and visualizing task-coefficient

To better understand how DSPs from the eight training domains combine to form Task-Specific Preconditioner, we illustrate the task-coefficients of TSP used in various test tasks in Figure 5(a). We first randomly sample the test tasks from each of the four domains (ImageNet, Birds, Traffic Sign, and MSCOCO). The blue heatmaps illustrate the task-coefficient values utilized in each test task. These heatmaps exhibit consistent patterns within each domain, although the values vary across tasks. For instance, in the Birds domain, all 5 tasks primarily rely on DSPs from ImageNet and Birds. Meanwhile, Task 2 evenly distributes task-coefficient values between them, while Task 4 assigns significantly more values to the Birds DSP compared to the ImageNet counterpart. In Figure 5(b), Figure 5(c), and Figure 5(d), we randomly sample the test tasks from other domains. Similar to the patterns observed in the heatmaps presented in Figure 5(a), these heatmaps also demonstrate consistent patterns within each domain, although the values vary across tasks.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Task-coefficient values used in the construction of Task-Specific Preconditioner. Columns represent DSPs trained on one of the eight training domains of Meta-Dataset. (a) Rows represent five test tasks randomly sampled from each of the four domains: 2 seen domains (ImageNet, Birds) and 2 unseen domains (Traffic Sign, and MSCOCO). (b) Rows represent five test tasks randomly sampled from each of the three seen domains: Omniglot, Aricraft, and Textures. (c) Rows represent five test tasks randomly sampled from each of the three seen domains: Quick Draw, Fungi, and VGG Flower. (d) Rows represent five test tasks randomly sampled from each of the three unseen domains: MNIST, CIFAR-10, and CIFAR-100.

Appendix E Implementation Details

E.1 Dataset

Meta-Dataset (Triantafillou et al. 2019) is the standard benchmark for evaluating the performance of cross-domain few-shot classification. Initially, it comprised ten datasets, including ILSVRC 2012 (Russakovsky et al. 2015), Omniglot (Lake, Salakhutdinov, and Tenenbaum 2015), FGVC-Aircraft (Maji et al. 2013), CUB-200-2011 (Wah et al. 2011), Describable Textures (Cimpoi et al. 2014), QuickDraw (Ha and Eck 2017), FGVCx Fungi (Schroeder and Cui 2018), VGG Flower (Nilsback and Zisserman 2008), Traffic Signs (Houben et al. 2013), and MSCOCO (Lin et al. 2014). Later, it was further expanded to include MNIST (LeCun et al. 1998), CIFAR-10 (Krizhevsky, Hinton et al. 2009), and CIFAR-100 (Krizhevsky, Hinton et al. 2009).

E.2 Architecture for the dataset classifier

As the dataset classifier, we use a permutation-invariant set encoder gg (Zaheer et al. 2017) followed by a linear layer. We adopt the implementation of a permutation-invariant set encoder as described in previous studies (Requeima et al. 2019; Triantafillou et al. 2021). We implement this encoder as Conv-55 backbone, comprising 55 modules with 3×33\times 3 convolutions employing 256256 filters, followed by batch normalization, ReLU activation, and 2×22\times 2 max-pooling with a stride of 22. Subsequently, global average pooling is applied to the output, followed by averaging over the first dimension (representing different examples within the support set), resulting in the set representation of the given support set. This representation is then fed into a linear layer to classify the given support set into one of the KK-training datasets.

E.3 Hyper-parameters

For all the experiments, we use the hyper-parameters in Table 8 and Table 9.

Table 8: Hyper-parameters used for training DSP on various experimental settings. For the test learning rates, the first value corresponds to the learning rate for the Residual Adapter, while the second value corresponds to the learning rate for the pre-classifier transformation.
Setting Varying-Way Varying-Way 5-Way Varying-Way
Varying-Shot 5-Shot 1-Shot Varying-Shot
Batch size 16 16
Weight decay 0.0007 0.0007
T-max 2500 2500
Max iteration 160000 40000
Initialization for 𝐌\mathbf{M} 0.1𝐈0.1\cdot\mathbf{I} 0.1𝐈0.1\cdot\mathbf{I}
Inner learning rate αin\alpha_{\text{in}} 0.1 0.1
Outer learning rate αout\alpha_{\text{out}} 0.1 0.1
The number of training inner-step 5 5
The number of testing inner-step 40 40
Test learning rate β\beta for seen domain (0.05, 0.30) (0.05, 0.30) (0.05, 0.30) (0.05, 0.20)
Test learning rate β\beta for unseen domain (0.25, 0.05) (0.25, 0.05) (0.25, 0.05) (0.25, 0.05)
Table 9: Hyper-parameters used for training Dataset Classifier on various experimental settings.
Hyper-parameter
Batch size 16
Weight decay 0.0007
T-max 500
Max iteration 4000
Learning rate 0.001

Appendix F Additional Results

F.1 Varying-Way Five-Shot setting

In the standard Meta-Dataset benchmark, tasks vary in the number of classes per task (‘way’) and the number of support images per class (‘shot’), with ‘shot’ ranging up to 100. Here, we evaluate TSP in Varying-Way Five-Shot setting, which poses a greater challenge due to the limited number of support images. The results in Table 11 show that TSP†† achieves top performance for 10 out of 13 datasets, including all 5 unseen datasets. Notably, TSP†† demonstrates significantly higher scores in unseen domains compared to the previous best result (+2.2%+2.2\%).

F.2 Five-Way One-Shot setting

We evaluate TSP under a more challenging setting, where only a single support image per class is available. As shown in Table 12, TSP†† consistently outperforms the previous best results for 10 out of 13 datasets, while TSP also achieves the best or near-best results. Notably, applying TSP to TA2-Net results in a significant performance improvement in unseen domains (+2.8%+2.8\%) compared to using TA2-Net alone, demonstrating its efficacy in this highly challenging setting.

Table 10: Comparision of two DSP designs with and without the identity matrix. The DSP design 𝐌𝐓𝐌\mathbf{M^{T}M} does not guarantee positive definiteness.
Setting
Varying-Way
Varying-Shot
Varying-Way
Five-Shot
DSP designs 𝐌𝐓𝐌\mathbf{M^{T}M} 𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I} 𝐌𝐓𝐌\mathbf{M^{T}M} 𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I}
Avg Seen 80.2 81.6 77.0 77.9
Avg Unseen 68.2 79.8 63.3 73.7
Avg All 75.6 80.9 71.7 76.3

Appendix G Comparison between 𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I} and 𝐌𝐓𝐌\mathbf{M^{T}M}: Failure of DSP design 𝐌𝐓𝐌\mathbf{M^{T}M}

In Section 6, we compared 𝐋𝐋𝐓+𝐈\mathbf{LL^{T}+I} and 𝐋𝐋𝐓\mathbf{LL^{T}} to explore the impact of including the identity matrix in the DSP design. Extending this analysis, we now include two additional DSP designs, 𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I} and 𝐌𝐓𝐌\mathbf{M^{T}M}, for comprehensive examination, despite the latter’s failure to meet positive definiteness. Our findings, presented in Table 10, demonstrate that the DSP design 𝐌𝐓𝐌+𝐈\mathbf{M^{T}M+I} consistently outperforms 𝐌𝐓𝐌\mathbf{M^{T}M} in both Varying-Way Varying-Shot and Varying-Way Five-Shot settings. This contrasts with the results in Table 6 of the manuscript, where including the identity matrix was effective in Varying-Way Varying-Shot setting. Furthermore, 𝐌𝐓𝐌\mathbf{M^{T}M} displays significantly lower performance compared to TSA (Li, Liu, and Bilen 2022), which serves as the baseline for our method.

To investigate the failure of the DSP design 𝐌𝐓𝐌\mathbf{M^{T}M}, we compare the effective ranks (Roy and Vetterli 2007) of Task-Specific Preconditioners between two DSP designs, 𝐌𝐓𝐌\mathbf{M^{T}M} and 𝐋𝐋𝐓\mathbf{LL^{T}}. The effective rank provides a numerical approximation of a matrix’s rank, indicating the number of singular values distant from zero. Since positive definite matrices possess solely positive singular values, an effective rank significantly lower than the full rank implies that the preconditioners are far from positive definite. Table 13 and Table 14 present the averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design 𝐌𝐓𝐌\mathbf{M^{T}M}, differing only in settings: Varying-Way Varying-Shot and Varying-Way Five-Shot, respectively. Several preconditioners in these tables exhibit notably lower effective rank than the full rank, indicating their departure from positive definiteness. Consequently, these non-positive definite preconditioners may fail to determine the steepest descent direction in the parameter space, leading to the degraded performance observed in Table 10. In contrast, 15 and Table 16 reveal that the averaged effective rank of 17 Task-Specific Preconditioners of the DSP design 𝐋𝐋𝐓\mathbf{LL^{T}} closely approach full rank. This finding aligns with the Cholesky factorization’s assertion (Horn and Johnson 2012) that 𝐋𝐋𝐓\mathbf{LL^{T}} is positive definite, which confirms that Task-Specific Preconditioners constructed with 𝐋𝐋𝐓\mathbf{LL^{T}} are positive definite (Theorem 1 holds as long as DSP 𝐏k\mathbf{P}_{k} satisfies the positive definiteness). This observation corroborates the results presented in Table 6 of the manuscript.

Appendix H Time-efficiency of TSP compared to GAP

Unlike GAP (Kang et al. 2023), which suffers from significant inference time due to time-intensive singular value decomposition (SVD) calculations at every neural network layer during each inner-level training iteration, TSP achieves a much faster inference time. Specifically, GAP requires approximately 14.2 seconds per task, while TSP applied on TSA (Li, Liu, and Bilen 2022) completes inference in just 1.1 seconds–about 13 times faster than GAP–by leveraging pre-calculated DSPs and a dataset classifier to avoid time-intensive calculations during inference.111Inference time for GAP and TSP is measured on a single RTX3090 GPU and averaged over 100 test tasks. This highlights the superior practical efficiency of TSP compared to the previous PGD-based method, GAP.

Appendix I Application Details for TSP and TSP††

As shown Figure 6, we apply TSP to the state-of-the-art CDFSL methods, TSA (Li, Liu, and Bilen 2022) and TA2-Net (Guo et al. 2023). Figure 6(a) illustrates PGD with Domain-Specific Preconditioner (DSP) applied to TSA during meta-training, where the DSP is selected based on the task’s domain label, and PGD optimizes each task-specific parameter θl\theta^{l} using the corresponding DSP. Figure 6(b) shows PGD with a Task-Specific Preconditioner applied to TSA during meta-testing. In this case, each preconditioner is constructed by DSPs with task coefficients generated by the Dataset Classifier, and PGD optimizes each θl\theta^{l} using the constructed preconditioner. Figure 6(c) depicts PGD with DSP applied to TA2-Net during meta-training. Unlike TSA, which optimizes a single task-specific parameter for each module, TA2-Net optimizes multiple task-specific parameters θl,j\theta^{l,j}. To accommodate this, we apply the same PGD to optimize the multiple parameters θl,j\theta^{l,j}. Figure 6(d) shows PGD with Task-Specific Preconditoner applied to TA2-Net during meta-testing, where, similar to the meta-training phase, the same PGD is applied to optimize the multiple task-specific parameters θl,j\theta^{l,j}.

Table 11: Comparison to state-of-the-art methods in Varying-Way Five-Shot setting. Mean accuracy and 95%\% confidence interval are reported. The best results are highlighted in bold. TSP denotes TSP applied on TSA. TSP†† denotes TSP applied on TA2-Net.
Test Dataset
Simple
CNAPS
SUR URT URL TSA TA2-Net MOKD TSP TSP††
ImageNet 47.2±\pm1.0 46.7±\pm1.0 48.6±\pm1.0 49.4±\pm1.0 48.3±\pm1.0 49.3±\pm1.0 47.5±\pm1.0 50.6±\pm1.0 50.8±\pm1.0
Omniglot 95.1±\pm0.3 95.8±\pm0.3 96.0±\pm0.3 96.0±\pm0.3 96.8±\pm0.3 96.6±\pm0.2 96.0±\pm0.3 97.2±\pm0.3 97.1±\pm0.3
Aircraft 74.6±\pm0.6 82.1±\pm0.6 81.2±\pm0.6 84.8±\pm0.5 85.5±\pm0.5 85.9±\pm0.4 84.4±\pm0.5 86.2±\pm0.5 86.7±\pm0.5
Birds 69.6±\pm0.7 62.8±\pm0.9 71.2±\pm0.7 76.0±\pm0.6 76.6±\pm0.6 77.3±\pm0.6 76.8±\pm0.6 77.0±\pm0.6 77.8±\pm0.6
Textures 57.5±\pm0.7 60.2±\pm0.7 65.2±\pm0.7 69.1±\pm0.6 68.3±\pm0.7 68.3±\pm0.6 66.3±\pm0.6 69.1±\pm0.6 69.3±\pm0.6
Quick Draw 70.9±\pm0.6 79.0±\pm0.5 79.2±\pm0.5 78.2±\pm0.5 77.9±\pm0.6 78.5±\pm0.5 78.9±\pm0.5 78.7±\pm0.6 78.8±\pm0.6
Fungi 50.3±\pm1.0 66.5±\pm0.8 66.9±\pm0.9 70.0±\pm0.8 70.4±\pm0.8 70.3±\pm0.8 68.8±\pm0.9 73.6±\pm0.9 72.9±\pm0.8
VGG Flower 86.5±\pm0.4 76.9±\pm0.6 82.4±\pm0.5 89.3±\pm0.4 89.5±\pm0.4 90.0±\pm0.4 89.1±\pm0.4 90.8±\pm0.4 91.1±\pm0.4
Traffic Sign 55.2±\pm0.8 44.9±\pm0.9 45.1±\pm0.9 57.5±\pm0.8 72.3±\pm0.6 76.7±\pm0.5 59.2±\pm0.8 79.6±\pm0.5 80.9±\pm0.5
MSCOCO 49.2±\pm0.8 48.1±\pm0.9 52.3±\pm0.9 56.1±\pm0.8 56.0±\pm0.8 56.0±\pm0.8 51.8±\pm0.8 57.5±\pm0.9 59.3±\pm0.8
MNIST 88.9±\pm0.4 90.1±\pm0.4 86.5±\pm0.5 89.7±\pm0.4 92.5±\pm0.4 93.3±\pm0.3 89.4±\pm0.3 93.0±\pm0.3 93.9±\pm0.3
CIFAR-10 66.1±\pm0.7 50.3±\pm1.0 61.4±\pm0.7 66.0±\pm0.7 72.0±\pm0.7 73.1±\pm0.7 58.8±\pm0.7 73.5±\pm0.7 74.2±\pm0.7
CIFAR-100 53.8±\pm0.9 46.4±\pm0.9 52.5±\pm0.9 57.0±\pm0.9 64.1±\pm0.8 64.1±\pm0.8 55.3±\pm0.9 65.0±\pm0.8 65.6±\pm0.9
Avg Seen 69.0 71.3 73.8 76.6 76.7 77.0 76.0 77.9 78.1
Avg Unseen 62.6 56.0 59.6 65.3 71.4 72.6 63.0 73.7 74.8
Avg All 66.5 65.4 68.3 72.2 74.6 75.3 71.0 76.3 76.8
Avg Rank 7.9 7.8 6.6 4.9 4.3 3.5 5.8 2.2 1.4
Table 12: Comparison to state-of-the-art methods in Five-Way One-Shot setting. Mean accuracy and 95%\% confidence interval are reported. The best results are highlighted in bold. TSP denotes TSP applied on TSA. TSP†† denotes TSP applied on TA2-Net.
Test Dataset
Simple
CNAPS
SUR URT URL TSA TA2-Net MOKD TSP TSP††
ImageNet 42.6±\pm0.9 40.7±\pm1.0 47.4±\pm1.0 49.6±\pm1.1 48.0±\pm1.0 48.8±\pm1.1 46.0±\pm1.0 50.1±\pm1.0 50.5±\pm1.0
Omniglot 93.1±\pm0.5 93.0±\pm0.7 95.6±\pm0.5 95.8±\pm0.5 96.3±\pm0.4 95.7±\pm0.4 95.5±\pm0.5 96.6±\pm0.4 96.8±\pm0.4
Aircraft 65.8±\pm0.9 67.1±\pm1.4 77.9±\pm0.9 79.6±\pm0.9 79.6±\pm0.9 79.8±\pm0.9 78.6±\pm0.9 81.1±\pm0.9 81.3±\pm0.9
Birds 67.9±\pm0.9 59.2±\pm1.0 70.9±\pm0.9 74.9±\pm0.9 74.5±\pm0.9 74.4±\pm0.9 75.9±\pm0.9 75.7±\pm0.9 75.7±\pm0.9
Textures 42.2±\pm0.8 42.5±\pm0.8 49.4±\pm0.9 53.6±\pm0.9 54.5±\pm0.9 54.1±\pm0.8 51.4±\pm0.9 55.5±\pm0.9 55.4±\pm0.9
Quick Draw 70.5±\pm0.9 79.8±\pm0.9 79.6±\pm0.9 79.0±\pm0.8 79.3±\pm0.9 78.9±\pm0.9 78.9±\pm0.9 80.7±\pm0.8 80.2±\pm0.9
Fungi 58.3±\pm1.1 64.8±\pm1.1 71.0±\pm1.0 75.2±\pm1.0 75.3±\pm1.0 75.2±\pm0.9 71.1±\pm1.0 77.8±\pm0.9 78.1±\pm0.8
VGG Flower 79.9±\pm0.7 65.0±\pm1.0 72.7±\pm0.0 79.9±\pm0.8 80.3±\pm0.8 80.1±\pm0.8 79.8±\pm0.8 81.0±\pm0.8 81.1±\pm0.8
Traffic Sign 55.3±\pm0.9 44.6±\pm0.9 52.7±\pm0.9 57.9±\pm0.9 57.2±\pm1.0 54.1±\pm1.0 57.0±\pm0.9 57.4±\pm1.0 56.9±\pm1.0
MSCOCO 48.8±\pm0.9 47.8±\pm1.1 56.9±\pm1.1 59.2±\pm1.0 59.9±\pm1.0 58.1±\pm1.0 50.9±\pm0.8 59.7±\pm1.0 60.5±\pm1.0
MNIST 80.1±\pm0.9 77.1±\pm0.9 75.6±\pm0.9 78.7±\pm0.9 80.1±\pm0.9 80.3±\pm0.9 72.5±\pm0.9 81.4±\pm0.8 81.7±\pm0.9
CIFAR-10 50.3±\pm0.9 35.8±\pm0.8 47.3±\pm0.9 54.7±\pm0.9 55.8±\pm0.9 52.9±\pm1.0 47.3±\pm0.8 55.9±\pm0.9 56.0±\pm0.9
CIFAR-100 53.8±\pm0.9 42.9±\pm1.0 54.9±\pm1.1 61.8±\pm1.0 63.7±\pm1.0 61.0±\pm1.1 60.2±\pm1.0 65.2±\pm1.0 65.6±\pm1.0
Avg Seen 65.0 64.0 70.6 73.5 73.5 73.4 72.2 74.8 74.9
Avg Unseen 57.7 49.6 57.5 62.5 63.3 61.3 57.5 63.9 64.1
Avg All 62.2 58.5 65.5 69.2 69.6 68.7 66.5 70.6 70.8
Avg Rank 7.5 8.2 6.8 4.2 3.5 4.8 6.2 1.9 1.5
Table 13: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design 𝐌𝐓𝐌\mathbf{M^{T}M} in Varying-Way Varying-Shot setting. We average the effective ranks using 600 tasks randomly sampled from each domain. The left column denotes the name of each task-specific weight, while the right column indicates the full rank of each task-specific weight.
Weight’s
Name
Image
-Net
Omni
-glot
Airc
-raft
Birds
Tex
-tures
Quick
-Draw
Fun
-gi
VGG
Flower
Traffic
Sign
MS
-COCO
MN
-IST
CIFAR
-10
CIFAR
-100
Full
Rank
layer1-0-α1\alpha_{1} 60.04 60.97 63.43 61.72 60.27 64.00 62.47 63.28 62.27 60.44 64.00 60.17 59.69 64
layer1-0-α2\alpha_{2} 62.70 61.46 62.92 62.77 61.63 64.00 61.59 62.73 62.99 62.80 64.00 62.80 62.69 64
layer1-1-α1\alpha_{1} 54.84 57.39 40.04 57.87 47.18 63.96 35.14 49.95 50.58 54.76 63.99 56.37 56.88 64
layer1-1-α2\alpha_{2} 62.17 63.87 57.23 62.60 62.29 64.00 59.25 60.77 62.08 62.31 64.00 62.34 62.27 64
layer2-0-α1\alpha_{1} 95.20 125.21 33.71 72.66 58.40 127.69 38.69 96.12 79.15 93.68 127.90 98.73 104.02 128
layer2-0-α2\alpha_{2} 125.09 127.15 109.02 113.07 116.99 127.99 117.76 122.92 124.18 125.19 128.00 124.84 125.40 128
layer2-1-α1\alpha_{1} 126.98 127.73 115.54 124.03 119.52 128.00 126.41 127.16 127.15 126.96 128.00 126.90 126.91 128
layer2-1-α2\alpha_{2} 127.74 127.43 126.58 126.58 127.04 128.00 127.61 127.39 127.79 127.77 128.00 127.71 127.72 128
layer3-0-α1\alpha_{1} 230.14 255.05 120.37 96.81 7.69 255.77 216.22 202.50 129.93 149.37 255.94 214.94 199.54 256
layer3-0-α2\alpha_{2} 250.48 254.61 156.72 142.09 196.29 255.97 253.60 218.84 247.46 250.71 255.98 240.82 247.83 256
layer3-1-α1\alpha_{1} 196.75 252.20 53.97 10.92 41.12 254.19 252.10 24.33 126.43 195.93 254.96 125.57 164.58 256
layer3-1-α2\alpha_{2} 255.58 254.85 253.45 253.98 253.79 255.99 255.87 254.82 255.73 255.62 255.99 255.52 255.51 256
layer4-0-α1\alpha_{1} 491.73 509.82 466.61 482.44 16.00 511.75 509.44 501.00 305.02 335.14 511.93 489.17 438.08 512
layer4-0-α2\alpha_{2} 460.91 493.62 410.53 73.12 387.58 511.91 507.25 478.68 497.04 482.00 511.92 385.93 438.04 512
layer4-1-α1\alpha_{1} 431.05 501.79 223.91 243.72 7.22 488.68 451.76 420.43 201.02 239.09 489.71 410.19 351.43 512
layer4-1-α2\alpha_{2} 498.54 509.57 509.23 507.46 496.55 511.55 510.78 509.13 507.39 500.13 511.55 499.06 496.86 512
pa-weight 11.60 3.13 3.88 7.07 4.58 2.79 3.33 3.36 9.06 11.60 2.74 11.84 12.16 512
Table 14: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design 𝐌𝐓𝐌\mathbf{M^{T}M} in Varying-Way Five-Shot setting. We average the effective ranks using 600 tasks randomly sampled from each domain. The left column denotes the name of each task-specific weight, while the right column indicates the full rank of each task-specific weight.
Weight’s
Name
Image
-Net
Omni
-glot
Airc
-raft
Birds
Tex
-tures
Quick
-Draw
Fun
-gi
VGG
Flower
Traffic
Sign
MS
-COCO
MN
-IST
CIFAR
-10
CIFAR
-100
Full
Rank
layer1-0-α1\alpha_{1} 60.14 61.04 63.43 61.70 60.49 64.00 62.46 63.28 62.00 60.37 64.00 60.31 59.76 64
layer1-0-α2\alpha_{2} 62.70 61.52 62.93 62.79 61.86 64.00 61.61 62.73 62.98 62.80 64.00 62.80 62.70 64
layer1-1-α1\alpha_{1} 54.41 57.46 40.18 57.88 48.46 63.96 35.38 49.97 51.36 55.34 63.99 55.87 56.69 64
layer1-1-α2\alpha_{2} 62.13 63.86 57.28 62.61 62.40 64.00 59.31 60.78 62.15 62.36 64.00 62.31 62.27 64
layer2-0-α1\alpha_{1} 93.59 124.77 34.00 73.67 62.76 127.68 39.19 96.04 81.61 95.96 127.90 96.43 103.21 128
layer2-0-α2\alpha_{2} 124.92 127.16 109.17 113.52 118.17 127.99 117.88 122.93 124.42 125.25 128.00 124.63 125.37 128
layer2-1-α1\alpha_{1} 126.98 127.74 115.65 124.15 120.53 128.00 126.43 127.16 127.15 126.91 128.00 126.89 126.92 128
layer2-1-α2\alpha_{2} 127.74 127.45 126.59 126.63 127.16 128.00 127.61 127.39 127.80 127.76 128.00 127.71 127.73 128
layer3-0-α1\alpha_{1} 229.05 254.92 121.04 99.77 9.22 255.76 216.45 202.70 139.21 136.38 255.94 211.75 200.61 256
layer3-0-α2\alpha_{2} 249.93 254.62 157.35 144.87 202.20 255.97 253.61 219.02 248.38 249.63 255.98 239.71 248.04 256
layer3-1-α1\alpha_{1} 193.29 251.15 54.49 11.90 47.84 254.12 251.87 24.43 138.76 184.25 254.95 127.65 166.39 256
layer3-1-α2\alpha_{2} 255.58 254.88 253.48 254.05 254.07 255.99 255.87 254.83 255.72 255.59 255.99 255.53 255.52 256
layer4-0-α1\alpha_{1} 492.50 509.86 466.98 482.91 19.81 511.75 509.23 501.06 322.40 308.72 511.93 487.09 439.86 512
layer4-0-α2\alpha_{2} 457.24 493.82 411.01 77.99 398.85 511.91 507.03 478.88 496.18 476.10 511.92 382.92 440.78 512
layer4-1-α1\alpha_{1} 431.12 501.57 225.25 248.87 8.84 488.65 451.73 420.80 220.69 213.26 489.71 405.75 354.55 512
layer4-1-α2\alpha_{2} 499.01 509.62 509.24 507.28 497.76 511.55 510.73 509.15 506.38 499.75 511.55 499.69 497.14 512
pa-weight 11.43 3.21 3.95 7.29 5.60 2.79 3.45 3.39 9.62 11.74 2.74 11.65 12.10 512
Table 15: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design 𝐋𝐋𝐓\mathbf{LL^{T}} in Varying-Way Varying-Shot setting. We average the effective ranks using 600 tasks randomly sampled from each domain. The left column denotes the name of each task-specific weight, while the right column indicates the full rank of each task-specific weight.
Weight’s
Name
Image
-Net
Omni
-glot
Airc
-raft
Birds
Tex
-tures
Quick
-Draw
Fun
-gi
VGG
Flower
Traffic
Sign
MS
-COCO
MN
-IST
CIFAR
-10
CIFAR
-100
Full
Rank
layer1-0-α1\alpha_{1} 63.97 63.95 63.99 63.98 63.99 64.00 63.99 64.00 63.99 63.98 64.00 63.96 63.97 64
layer1-0-α2\alpha_{2} 63.97 63.96 64.00 63.99 64.00 64.00 64.00 64.00 63.99 63.98 64.00 63.97 63.97 64
layer1-1-α1\alpha_{1} 63.92 63.90 63.99 63.96 63.99 64.00 63.99 63.99 63.97 63.95 64.00 63.91 63.92 64
layer1-1-α2\alpha_{2} 63.97 63.99 63.99 63.99 64.00 64.00 64.00 64.00 63.99 63.98 64.00 63.97 63.97 64
layer2-0-α1\alpha_{1} 127.87 127.97 127.97 127.92 127.96 128.00 127.98 127.98 127.96 127.91 128.00 127.84 127.86 128
layer2-0-α2\alpha_{2} 127.97 127.99 127.99 127.98 127.98 128.00 127.99 127.99 127.99 127.98 128.00 127.97 127.97 128
layer2-1-α1\alpha_{1} 127.99 128.00 127.99 127.99 127.98 128.00 128.00 128.00 128.00 127.99 128.00 127.99 127.99 128
layer2-1-α2\alpha_{2} 127.99 128.00 128.00 128.00 128.00 128.00 128.00 128.00 128.00 128.00 128.00 127.99 127.99 128
layer3-0-α1\alpha_{1} 255.95 256.00 255.96 255.94 255.66 256.00 255.97 255.98 255.98 255.96 256.00 255.94 255.95 256
layer3-0-α2\alpha_{2} 255.98 256.00 255.95 255.96 255.90 256.00 255.99 255.99 255.99 255.98 256.00 255.97 255.98 256
layer3-1-α1\alpha_{1} 255.99 256.00 255.93 255.91 255.59 256.00 255.98 255.97 255.99 255.98 256.00 255.99 255.99 256
layer3-1-α2\alpha_{2} 256.00 256.00 255.98 255.99 255.99 256.00 256.00 255.99 256.00 256.00 256.00 255.99 256.00 256
layer4-0-α1\alpha_{1} 511.97 512.00 511.91 511.95 510.54 512.00 511.95 511.97 511.97 511.95 512.00 511.97 511.96 512
layer4-0-α2\alpha_{2} 511.96 511.98 511.79 511.81 511.66 512.00 511.90 511.92 511.95 511.96 512.00 511.96 511.97 512
layer4-1-α1\alpha_{1} 511.90 511.97 511.54 511.78 507.13 511.92 511.58 511.84 511.77 511.76 511.93 511.91 511.89 512
layer4-1-α2\alpha_{2} 511.97 511.99 511.97 511.97 511.81 512.00 511.97 511.98 511.98 511.98 512.00 511.97 511.97 512
pa-weight 494.46 481.54 494.69 495.60 490.04 486.75 484.23 493.87 490.07 493.48 486.75 495.33 495.25 512
Table 16: Averaged effective ranks for 17 Task-Specific Preconditioners of the DSP design 𝐋𝐋𝐓\mathbf{LL^{T}} in Varying-Way Five-Shot setting. We average the effective ranks using 600 tasks randomly sampled from each domain. The left column denotes the name of each task-specific weight, while the right column indicates the full rank of each task-specific weight.
Weight’s
Name
Image
-Net
Omni
-glot
Airc
-raft
Birds
Tex
-tures
Quick
-Draw
Fun
-gi
VGG
Flower
Traffic
Sign
MS
-COCO
MN
-IST
CIFAR
-10
CIFAR
-100
Full
Rank
layer1-0-α1\alpha_{1} 63.97 63.95 63.99 63.98 63.99 64.00 63.99 64.00 63.99 63.98 64.00 63.96 63.97 64
layer1-0-α2\alpha_{2} 63.97 63.96 64.00 63.99 64.00 64.00 64.00 64.00 63.99 63.98 64.00 63.97 63.97 64
layer1-1-α1\alpha_{1} 63.92 63.90 63.99 63.96 63.99 64.00 63.99 63.99 63.97 63.94 64.00 63.91 63.92 64
layer1-1-α2\alpha_{2} 63.97 63.99 63.99 63.99 64.00 64.00 64.00 64.00 63.99 63.98 64.00 63.97 63.97 64
layer2-0-α1\alpha_{1} 127.87 127.97 127.97 127.93 127.96 128.00 127.98 127.98 127.95 127.91 128.00 127.85 127.86 128
layer2-0-α2\alpha_{2} 127.97 127.99 127.99 127.98 127.99 128.00 127.99 127.99 127.99 127.98 128.00 127.97 127.97 128
layer2-1-α1\alpha_{1} 127.99 128.00 127.99 127.99 127.99 128.00 128.00 128.00 128.00 127.99 128.00 127.99 127.99 128
layer2-1-α2\alpha_{2} 127.99 128.00 128.00 128.00 128.00 128.00 128.00 128.00 128.00 128.00 128.00 127.99 127.99 128
layer3-0-α1\alpha_{1} 255.95 256.00 255.97 255.94 255.78 256.00 255.97 255.98 255.98 255.96 256.00 255.94 255.95 256
layer3-0-α2\alpha_{2} 255.98 256.00 255.95 255.95 255.94 256.00 255.99 255.99 255.99 255.98 256.00 255.97 255.98 256
layer3-1-α1\alpha_{1} 255.99 256.00 255.93 255.90 255.74 256.00 255.98 255.97 255.99 255.98 256.00 255.99 255.99 256
layer3-1-α2\alpha_{2} 256.00 256.00 255.98 255.99 255.99 256.00 256.00 256.00 256.00 256.00 256.00 256.00 256.00 256
layer4-0-α1\alpha_{1} 511.97 512.00 511.91 511.95 511.06 512.00 511.95 511.97 511.97 511.94 512.00 511.97 511.97 512
layer4-0-α2\alpha_{2} 511.96 511.98 511.80 511.79 511.78 512.00 511.90 511.92 511.96 511.96 512.00 511.96 511.97 512
layer4-1-α1\alpha_{1} 511.89 511.97 511.55 511.76 508.74 511.92 511.58 511.84 511.79 511.73 511.92 511.91 511.89 512
layer4-1-α2\alpha_{2} 511.97 511.99 511.97 511.97 511.87 512.00 511.97 511.98 511.98 511.97 512.00 511.97 511.97 512
pa-weight 494.37 481.54 494.72 495.46 490.94 486.75 484.26 493.87 490.64 494.00 486.75 495.13 495.09 512
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: TSP applied on TSA and TA2-Net. (a) PGD with Domain-Specific Preconditioner (DSP) applied on TSA during meta-training. (b) PGD with Task-Specific Preconditioner applied on TSA during meta-testing. (c) PGD with Domain-Specific Preconditioner (DSP) applied on TA2-Net during meta-training. (d) PGD with Task-Specific Preconditioner applied on TA2-Net during meta-testing.