This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement

Gaurav Patel{}^{\dagger} , Chris Sandino{}^{\ddagger}, Behrooz Mahasseni{}^{\ddagger}, Ellen Zippi{}^{\ddagger}
Erdrin Azemi{}^{\ddagger}, Ali Moin{}^{\ddagger}, Juri Minxha{}^{\ddagger}
Purdue University{}^{\dagger}, Apple{}^{\ddagger}
Work done during an internship at Apple.
Contact : [email protected], [email protected].
Abstract

In this paper, we propose a framework for efficient Source-Free Domain Adaptation (SFDA) in the context of time-series, focusing on enhancing both parameter efficiency and data-sample utilization. Our approach introduces an improved paradigm for source-model preparation and target-side adaptation, aiming to enhance training efficiency during target adaptation. Specifically, we reparameterize the source model’s weights in a Tucker-style decomposed manner, factorizing the model into a compact form during the source model preparation phase. During target-side adaptation, only a subset of these decomposed factors is fine-tuned, leading to significant improvements in training efficiency. We demonstrate using PAC Bayesian analysis that this selective fine-tuning strategy implicitly regularizes the adaptation process by constraining the model’s learning capacity. Furthermore, this re-parameterization reduces the overall model size and enhances inference efficiency, making the approach particularly well suited for resource-constrained devices. Additionally, we demonstrate that our framework is compatible with various SFDA methods and achieves significant computational efficiency, reducing the number of fine-tuned parameters and inference overhead in terms of MACs by over 90% while maintaining model performance.

1 Introduction

In a typical Source-Free Domain Adaptation (SFDA) setup, the source-pretrained model must adapt to the target distribution using unlabeled samples from the target domain. SFDA strategies have become prevalent due to the restrictive nature of conventional domain adaptation methods (Li et al., 2024) which require access to both source and target domain data simultaneously and therefore may not be feasible in real-world scenarios due to privacy and confidentiality concerns. Although SFDA techniques have been extensively investigated for visual tasks (Liang et al., 2020; 2021; Li et al., 2020; Yang et al., 2021b; a; 2022; 2023; Kim et al., 2021; Xia et al., 2021; Kundu et al., 2021; 2022b), their application to time series analysis remains relatively nascent (Ragab et al., 2023b; Gong et al., 2024). Nevertheless, time-series models adaptation is crucial due to the nonstationary and heterogeneous nature of time-series data (Park et al., 2023), where each user’s data exhibit distinct patterns, necessitating adaptive models that can learn idiosyncratic features.

Despite growing interest in SFDA, sample and parameter efficiency during adaptation is still largely unexplored (Karim et al., 2023; Lee et al., 2023). These aspects of SFDA are of particular importance in situations where there is a large resource disparity between the source and the target. For instance, a source-pretrained model may be deployed to a resource-constrained target device, rendering full model adaptation impractical. Additionally, the target-side often has access to substantially fewer reliable samples compared to the volume of data used during source pretraining, increasing the risk of overfitting.

In response to these challenges, we revisit both the source-model preparation and target-side adaptation processes. We demonstrate that disentangling the backbone parameter subspace at during source-model preparation and then fine-tuning a selected subset of those parameters during target-side adaptation leads to a computationally efficient adaptation process, while maintaining superior predictive performance. Figure 1B illustrates our proposed framework. During source-model preparation, we re-parameterize the backbone weights in a low-rank Tucker-style factorized (Tucker, 1966; Lathauwer et al., 2000) way, decomposing the model into compact parameter subspaces. Tucker-style factorization offers great flexibility and interpretability by breaking down tensors into a core tensor and factor matrices along each mode, capturing multi-dimensional interactions while independently reducing dimensionality. This reparameterization results in a model that is both parameter- and computation-efficient, significantly reducing the model size and inference overhead in terms of Multiply-Accumulate Operations (MACs).

Refer to caption
Figure 1: Prior Paradigm vs. Ours. A. Existing approaches (Liang et al., 2020; Yang et al., 2021a; 2022; Ragab et al., 2023b) adapt the entire model to the target distribution. B. Our method disentangles the backbone parameters using Tucker-style factorization. Fine-tuning a small subset of these parameters proves both parameter- and sample-efficient (cf. Section 3), while also being robust against overfitting (cf. Figure 2B and Section 3), leading to superior adaptation to the target distribution.

The re-parameterized source-model is then deployed to the target side. During target-side adaptation, we perform selective fine-tuning (SFT) within a selected subspace (i.e. the core tensor) that constitutes a very small fraction of the total number of parameters in the backbone. Our findings show that this strategy not only enhances parameter efficiency but also serves as a form of regularization, mitigating overfitting to unreliable target samples by restricting the model’s learning capacity. We provide theoretical insights into this regularization effect using PAC-Bayesian generalization bounds (McAllester, 1998; 1999; Li & Zhang, 2021; Wang et al., 2023) for the fine-tuned target model.

Empirical results demonstrate that low-rank weight disentanglement during source-model preparation enables parameter-efficient adaptation on the target side, consistently improving performance across various SFDA methods (Liang et al., 2020; Yang et al., 2021a; 2022; Ragab et al., 2023b) and time-series benchmarks (Ragab et al., 2023a; b). It is important to emphasize that our contribution does not introduce a novel SFDA method for time-series data. Instead, we focus on making the target adaptation process more parameter- and sample-efficient, and demonstrating that our framework can be seamlessly integrated with existing, and potentially future, SFDA techniques. In summary, our key contributions are:

  1. 1.

    We propose a novel strategy for source-model preparation by reparameterizing the backbone network’s weights using low-rank Tucker-style factorization. This decomposition into a core tensor and factor matrices creates a compact parameter subspace representation, leading to a parameter- and computation-efficient model with reduced size and inference overhead.

  2. 2.

    During target-side adaptation, we introduce a selective fine-tuning (SFT) approach that adjusts only a small fraction of the backbone’s parameters (i.e. the core tensor) within the decomposed subspace. This strategy enhances parameter efficiency and acts as an implicit regularization mechanism, mitigating the risk of overfitting to unreliable target samples by limiting the model’s learning capacity.

  3. 3.

    We ground the regularization effect of SFT using the PAC Bayesian generalization bound. Empirical analyses demonstrate that our proposed framework improves the parameter and sample efficiency of adaptation process of various existing SFDA methods across the time-series benchmarks, showcasing its generalizability.

2 Related Work and Observations

Unsupervised Domain Adaptation in Time Series.

Unsupervised Domain Adaptation (UDA) for time-series data addresses the critical challenge of distribution shifts between the source and target domains. Discrepancy-based methods, such as those of Cai et al. (2021), Liu & Xue (2021) and He et al. (2023), align feature representations through statistical measures, while adversarial approaches such as Wilson et al. (2020), Wilson et al. (2023), Ragab et al. (2022), Jin et al. (2022) and Ozyurt et al. (2023) focus on minimizing distributional discrepancies through adversarial training. The comprehensive study by Ragab et al. (2023a) provides a broad overview of domain adaptation in time series. However, SFDA assumes access to only the source-pretrained model (Liang et al., 2020; 2021; Li et al., 2020; Yang et al., 2021a; 2022; 2023; Kim et al., 2021; Xia et al., 2021). In SFDA, one of the most commonly used strategies is based on unsupervised clustering techniques (Yang et al., 2022; Li et al., 2024), promoting the discriminability and diversity of feature spaces through information maximization (Liang et al., 2020), neighborhood clustering (Yang et al., 2021a) or contrastive learning objectives (Yang et al., 2022). Recent advances incorporate auxiliary tasks, such as masking and imputation, to enhance SFDA performance, as demonstrated by Ragab et al. (2023b) and Gong et al. (2024), taking inspiration from Liang et al. (2021) and Kundu et al. (2022a) to include masked reconstruction as an auxiliary task. However, exploration of SFDA in time series contexts remains limited, especially with regard to parameter and sample efficiency during target-side adaptation (Ragab et al., 2023a; Gong et al., 2024).

Refer to caption
Figure 2: A. Original Dimensions vs. Effective Rank: Comparison of input/output channel dimensions (Cin,CoutC_{\text{in}},C_{\text{out}}) in the last two layers of the source model trained on the HHAR (Stisen et al., 2015) and SSC (Goldberger et al., 2000) datasets, alongside their effective rank computed via VBMF (Nakajima et al., 2013). B. Regularization Effect: Training dynamics illustrate the regularization effect of our source decomposition and selective fine-tuning (SFT) compared to the SHOT (Baseline) (Liang et al., 2020) on the SSC dataset.
Parameter Redundancy and Low-Rank Subspaces.

Neural network pruning has demonstrated that substantial reductions in parameters, often exceeding 90%, can be achieved with minimal accuracy loss, revealing significant redundancy in trained deep networks (LeCun et al., 1989; Hassibi & Stork, 1992; Li et al., 2017; Frankle & Carbin, 2018; Sharma et al., 2024). Structured pruning methods, such as those of Molchanov et al. (2017) and Hoefler et al. (2021), have optimized these reductions to maintain or even improve inference speeds. Low-rank models have played a crucial role in pruning, with techniques such as Singular Value Decomposition (SVD) (Eckart & Young, 1936) applied to fully connected layers (Denil et al., 2013) and tensor-train decomposition used to compress neural networks (Novikov et al., 2015). Methods developed by Jaderberg et al. (2014), Denton et al. (2014), Tai et al. (2016), and Lebedev et al. (2015) accelerate Convolutional Neural Networks (CNNs) through low-rank regularization. Decomposition models such as CANDECOMP/PARAFAC (CP) (Carroll & Chang, 1970) and Tucker decomposition (Tucker, 1966) have effectively reduced the computational complexity of CNNs (Lebedev et al., 2015; Kim et al., 2016). Recent works have extended these techniques to recurrent and transformer layers (Ye et al., 2018; Ma et al., 2019), broadening their applicability. In Figure 2A, we identify significant parameter redundancies in source-pretrained models in terms of effective parameter ranks, revealing low effective ranks. Our motivation for a factored re-parameterization is driven by the specific demands of our proposition—parameter and sample efficiency. We utilize low-rank tensor factorization (Tucker-style decomposition) to disentangle weight tensors into compact disentangled subspaces, enabling selective fine-tuning during target adaptation. This strategy enhances the parameter efficiency of the adaptation process and implicitly mitigates overfitting. As shown in Figure 2B, the validation F1 score for baseline methods declines over epochs, whereas our proposed SFT method maintains consistent performance, demonstrating its robustness against overfitting.

3 Proposed Methodology

In this section, we discuss our proposed methodology through several key steps: (1) We begin by discussing the SFDA setup. (2) We discuss the Tucker-style tensor factorization on the weight tensors of the pre-trained source-model, decomposing them into a core tensor and mode-specific factor matrices; where we introduce the rank factor (RFRF) hyperparameter to control the mode ranks in the decomposition, allowing for flexible trade-offs between model compactness and capacity. (3) During target-side adaptation, we selectively fine-tune the core tensor while keeping the mode factor matrices fixed, effectively adapting to distributional shifts in the target domain while reduced computational overhead and mitigating overfitting. (4) Finally, we offer theoretical insights grounded in PAC-Bayesian generalization analysis, demonstrating how source weight decomposition and target-side selective fine-tuning implicitly regularize the adaptation process, leading to enhanced generalization in predictive performance while rendering the process parameter- and sample-efficient.

SFDA Setup.

SFDA aims to adapt a pre-trained source model to a target domain without access to the original source data. Specifically, in a classification problem, we start with a source dataset 𝒟s={(xs,ys);xs𝒳s,ys𝒞g}\mathcal{D}_{s}=\{(x_{s},y_{s});x_{s}\in\mathcal{X}_{s},y_{s}\in\mathcal{C}_{g}\}, where 𝒳s\mathcal{X}_{s} denotes the input space, and 𝒞g\mathcal{C}_{g} represents the set of class labels for the goal task in a closed-set setting. On the target side, we have access to an unlabeled target dataset 𝒟t={xt;xt𝒳t}\mathcal{D}_{t}=\{x_{t};x_{t}\in\mathcal{X}_{t}\}, where 𝒳t\mathcal{X}_{t} is the input space for the target domain. The aim of SFDA is to learn a target function ft:𝒳t𝒞gf_{t}:\mathcal{X}_{t}\rightarrow\mathcal{C}_{g} that can accurately infer the labels yt𝒞gy_{t}\in\mathcal{C}_{g} for the samples in 𝒟t\mathcal{D}_{t}. ftf_{t} is obtained using only the unlabeled target samples in 𝒟t\mathcal{D}_{t}, and the source-model fs:𝒳s𝒞gf_{s}:\mathcal{X}_{s}\rightarrow\mathcal{C}_{g}, which is pre-trained on the source domain. Each xx (either from the source or target domain) is a sample of multivariate time series, i.e., xM×Lx\in\mathbb{R}^{M\times L}, where LL is the number of time steps (sequence length) and MM denotes number of observations (channels) for the corresponding time step.

Refer to caption
Figure 3: Illustration of Tucker-style factorization (Tucker, 1966; Lathauwer et al., 2000).
Tucker-style Factorization.

After the supervised source pre-training (cf. Appendix A.7 for details on source pre-training), we adopt Tucker-style factorization (Tucker, 1966; Lathauwer et al., 2000) of the weight tensors for its effectiveness in capturing multi-way interactions among different mode-independent factors. The compact core tensor encapsulates these interactions, and the factor matrices in the decomposition offer low-dimensional representations for each tensor mode. As illustrated in Figure 3A, the rank-(R1,R2,R3)(R_{1},R_{2},R_{3}) Tucker-style factorization of the 3-way tensor 𝓐I1×I2×I3\boldsymbol{\mathcal{A}}\in\mathbb{R}^{I_{1}\times I_{2}\times I_{3}} is represented as:

𝓐=𝓖×1𝐔(1)×2𝐔(2)×3𝐔(3)or,𝓐i,j,k=r1=1R1r2=1R2r3=1R3𝓖r1,r2,r3𝐔(1)i,r1𝐔(2)j,r2𝐔(3)k,r3,\boldsymbol{\mathcal{A}}=\boldsymbol{\mathcal{G}}\times_{1}\mathbf{U}^{(1)}\times_{2}\mathbf{U}^{(2)}\times_{3}\mathbf{U}^{(3)}\quad\text{or,}\quad\boldsymbol{\mathcal{A}}_{i,j,k}=\sum_{r_{1}=1}^{R_{1}}\sum_{r_{2}=1}^{R_{2}}\sum_{r_{3}=1}^{R_{3}}\boldsymbol{\mathcal{G}}_{r_{1},r_{2},r_{3}}\mathbf{U}^{(1)}_{i,r_{1}}\mathbf{U}^{(2)}_{j,r_{2}}\mathbf{U}^{(3)}_{k,r_{3}}, (1)

where 𝓖R1×R2×R3\boldsymbol{\mathcal{G}}\in\mathbb{R}^{R_{1}\times R_{2}\times R_{3}} is referred to as the core tensor, and 𝐔(1)I1×R1\mathbf{U}^{(1)}\in\mathbb{R}^{I_{1}\times R_{1}}, 𝐔(2)I2×R2\mathbf{U}^{(2)}\in\mathbb{R}^{I_{2}\times R_{2}} and 𝐔(3)I3×R3\mathbf{U}^{(3)}\in\mathbb{R}^{I_{3}\times R_{3}} as factor matrices.

Furthermore, the linear operations involved in Tucker-style representation (Equation (1)), which are primarily mode-independent matrix multiplications between the core tensor and factor matrices, simplify both mathematical treatment and computational implementation. This linearity ensures that linear operations (e.g., convolution or linear projection) in deep networks using the complete tensor can be efficiently represented as sequences of linear suboperations on the core tensor and factor matrices.

For instance, in one-dimensional convolutional neural networks (1D-CNNs), the convolution operation transforms an input representation ICin×LI\in\mathbb{R}^{C_{\text{in}}\times L} into an output representation OCout×LO\in\mathbb{R}^{C_{\text{out}}\times L^{\prime}} using a kernel 𝓦Cout×Cin×K\boldsymbol{\mathcal{W}}\in\mathbb{R}^{C_{\text{out}}\times C_{\text{in}}\times K}. Here, CinC_{\text{in}}, CoutC_{\text{out}}, KK, LL, and LL^{\prime} denote the number of input channels, output channels, kernel size, input sequence length, and output sequence length, respectively. The operation is mathematically defined as:

Oi,l=j=1Cink=K2K2𝓦i,j,kIj,lk.O_{i,l}=\sum_{j=1}^{C_{\text{in}}}\sum_{k=-\frac{K}{2}}^{\frac{K}{2}}\boldsymbol{\mathcal{W}}_{i,j,k}I_{j,l-k}. (2)

Given that the dimensions of the channels (CinC_{\text{in}} and CoutC_{\text{out}}) are typically much larger than the size of the kernel (KK), we restrict the decomposition to modes 1 and 2 only, focusing on the input and output channels. The 2-mode decomposition is represented by:

𝓦=𝓣×1𝐕(1)×2𝐕(2)or,𝓦i,j,k=r1=1Routr2=1Rin𝓣r1,r2,k𝐕(1)i,r1𝐕(2)j,r2,\boldsymbol{\mathcal{W}}=\boldsymbol{\mathcal{T}}\times_{1}\mathbf{V}^{(1)}\times_{2}\mathbf{V}^{(2)}\quad\text{or,}\quad\boldsymbol{\mathcal{W}}_{i,j,k}=\sum_{r_{1}=1}^{R_{\text{out}}}\sum_{r_{2}=1}^{R_{\text{in}}}\boldsymbol{\mathcal{T}}_{r_{1},r_{2},k}\mathbf{V}^{(1)}_{i,r_{1}}\mathbf{V}^{(2)}_{j,r_{2}}, (3)

where 𝓣Rout×Rin×K\boldsymbol{\mathcal{T}}\in\mathbb{R}^{R_{\text{out}}\times R_{\text{in}}\times K} is the 2-mode decomposed core tensor, and 𝐔(1)Cout×Rout\mathbf{U}^{(1)}\in\mathbb{R}^{C_{\text{out}}\times R_{\text{out}}} and 𝐔(2)Cin×Rin\mathbf{U}^{(2)}\in\mathbb{R}^{C_{\text{in}}\times R_{\text{in}}} represent the respective factor matrices, also illustrated in Figure 3B. With this decomposed form, the convolution operation can be represented as a sequence of the following linear operations:

Zr2,l=j=1Cin𝐕(2)j,r2Ij,l,(Channel Down-Projection)Z_{r_{2},l}=\sum_{j=1}^{C_{\text{in}}}\mathbf{V}^{(2)}_{j,r_{2}}I_{j,l},\qquad\text{(Channel Down-Projection)} (4)
Zr1,l=r2=1Rink=K2K2𝓣r1,r2,kZr2,lk,(Core Convolution)Z^{\prime}_{r_{1},l}=\sum_{r_{2}=1}^{R_{\text{in}}}\sum_{k=-\frac{K}{2}}^{\frac{K}{2}}\boldsymbol{\mathcal{T}}_{r_{1},r_{2},k}Z_{r_{2},l-k},\qquad\text{(Core Convolution)} (5)
Oi,l=r1=1Rout𝐕(1)i,r1Zr1,l,(Channel Up-Projection)O_{i,l^{\prime}}=\sum_{r_{1}=1}^{R_{\text{out}}}\mathbf{V}^{(1)}_{i,r_{1}}Z^{\prime}_{r_{1},l^{\prime}},\qquad\text{(Channel Up-Projection)} (6)

where both the Channel Down-Projection and the Channel Up-Projection operations are implemented as unit window-sized convolution operations.

Building on these insights, we decompose the pre-trained source model weights using Tucker decomposition, optimized via the Higher-Order Orthogonal Iteration (HOOI) algorithm, an alternating least squares method for low-rank tensor approximation (detailed algorithm in Appendix A.1). The optimization problem is defined as:

𝓣,𝐕(1),𝐕(2)=argmin𝓣,𝐕(1),𝐕(2)𝓦𝓣×1𝐕(1)×2𝐕(2)F.\boldsymbol{\mathcal{T}}^{*},\mathbf{V}^{(1)*},\mathbf{V}^{(2)*}=\operatorname*{arg\,min}_{\boldsymbol{\mathcal{T}},\mathbf{V}^{(1)},\mathbf{V}^{(2)}}\|\boldsymbol{\mathcal{W}}-\boldsymbol{\mathcal{T}}\times_{1}\mathbf{V}^{(1)}\times_{2}\mathbf{V}^{(2)}\|_{F}. (7)
Refer to caption
Figure 4: The two columns show the inputs from MNIST-1D (Greydanus & Kobak, 2024) and the core tensors for the source and target domains. The target domain is generated manually by vertically flipping (i.e., negating) the source. Both RinR_{\text{in}} and RoutR_{\text{out}} are set to 1 (cf. Figure 3B) to facilitate visualization of the core tensors. Domain-invariant output features are achieved as the core tensor adapts to the target samples to mitigate the domain shift (negation).

The weight tensor is then re-parameterized with the core tensor 𝓣Rout×Rin×K\boldsymbol{\mathcal{T}}^{*}\in\mathbb{R}^{R_{\text{out}}\times R_{\text{in}}\times K} and mode factor matrices 𝐕(1)Cout×Rout\mathbf{V}^{(1)*}\in\mathbb{R}^{C_{\text{out}}\times R_{\text{out}}} and 𝐕(2)Cin×Rin\mathbf{V}^{(2)*}\in\mathbb{R}^{C_{\text{in}}\times R_{\text{in}}}, as formulated in Equations (4), (5), and (6).

However, the re-parameterization is performed after minimizing the linear reconstruction error of the weights (Equation 7), which may degrade the predictive performance on the source domain (Kim et al., 2016). To mitigate this effect, we fine-tune the core tensor and factor matrices using the source data for an additional 2-3 epochs, prior to deploying the model on the target side. This fine-tuning effectively restores the original predictive performance, preserving the model’s performance while leveraging the benefits of fewer parameters. This behavior strongly aligns with the Lottery Ticket Hypothesis (Frankle & Carbin, 2018), which suggests that compressing and retraining pre-trained, over-parameterized deep networks can result in superior performance and storage efficiency compared to training smaller networks from scratch. Similar observations have been made by Zimmer et al. (2023) in the context of large language model training.

Moreover, we introduce the rank factor (RFRF), a hyperparameter to standardize the mode rank analysis. The mode ranks are set as (Rin,Rout)=(CinRF,CoutRF)(R_{\text{in}},R_{\text{out}})=\left(\left\lfloor\frac{C_{\text{in}}}{RF}\right\rfloor,\left\lfloor\frac{C_{\text{out}}}{RF}\right\rfloor\right), where a higher RFRF results in lower mode ranks, and vice versa. Although RFRF can be set independently for the input and output channels, for simplicity in our analysis, we control both RinR_{\text{in}} and RoutR_{\text{out}} with a single hyperparameter, RFRF. This tunable parameter allows for flexible control over the trade-off between parameter reduction and model capacity.

Target Side Adaptation.

The core tensor in our setup plays a pivotal role by capturing multi-way interactions between different modes, encoding essential inter-modal relationships with the factor matrices. Additionally, its sensitivity to temporal dynamics (cf. Equation (5)) makes it particularly well-suited for addressing distributional shifts in time-series data, where discrepancies often arise due to temporal variations (Fan et al., 2023). Therefore, by selectively fine-tuning the core tensor during target adaptation, we can effectively adapt to these shifts while maintaining a compact parameterization. Figure 4 presents a toy experiment that demonstrates how fine-tuning the core tensor suffices in addressing domain shift, aligning the model more effectively with the target domain. This approach not only ensures that the model aligns better with the target domain but also enhances parameter efficiency by reducing the number of parameters that need to be updated, minimizing computational overhead. Moreover, selective fine-tuning mitigates the risk of overfitting, providing a more robust adaptation process, as shown in Figure 2B. This strategy leverages the expressiveness of the core tensor to address temporal domain shifts while maintaining the computational benefits of a structured, low-rank parameterization.

Computational and Parameter Efficiency.

The decomposition significantly reduces the parameter count from Cout×Cin×KC_{\text{out}}\times C_{\text{in}}\times K to:

Rout×Rin×KCore Tensor (𝓣)+Cout×RoutFactor Matrix (𝐕(1))+Cin×RinFactor Matrix (𝐕(2))\underbrace{R_{\text{out}}\times R_{\text{in}}\times K}_{\text{Core Tensor }(\boldsymbol{\mathcal{T}})}+\underbrace{C_{\text{out}}\times R_{\text{out}}}_{\text{Factor Matrix }(\mathbf{V}^{(1)})}+\underbrace{C_{\text{in}}\times R_{\text{in}}}_{\text{Factor Matrix }(\mathbf{V}^{(2)})} (8)

where Rout=CoutRFCoutR_{\text{out}}=\left\lfloor\frac{C_{\text{out}}}{RF}\right\rfloor\ll C_{\text{out}} and Rin=CinRFCinR_{\text{in}}=\left\lfloor\frac{C_{\text{in}}}{RF}\right\rfloor\ll C_{\text{in}}, ensuring substantial parameter savings. Furthermore, since we selectively finetune the core tensor, the parameter efficiency is further enhanced.

In terms of computational efficiency, the factorized convolution reduces the operation count from 𝒪(Cout×Cin×K×L)\mathcal{O}(C_{\text{out}}\times C_{\text{in}}\times K\times L^{\prime}) to a series of smaller convolutions with complexity:

𝒪(Cin×Rin×L)Equation (4)+𝒪(Rout×Rin×K×L)Equation (5)+𝒪(Cout×Rout×LEquation (6))\underbrace{\mathcal{O}(C_{\text{in}}\times R_{\text{in}}\times L)}_{\text{Equation\leavevmode\nobreak\ (\ref{eq:chp1})}}+\underbrace{\mathcal{O}(R_{\text{out}}\times R_{\text{in}}\times K\times L^{\prime})}_{\text{Equation\leavevmode\nobreak\ (\ref{eq:core_conv})}}+\underbrace{\mathcal{O}(C_{\text{out}}\times R_{\text{out}}\times L^{\prime}}_{\text{Equation\leavevmode\nobreak\ (\ref{eq:chp2})}}) (9)

resulting in a notable reduction in computational cost, dependent on the values of RoutR_{\text{out}} and RinR_{\text{in}} (see Appendix A.2 for a detailed explanation).

Robust Target Adaptation.
Refer to caption
Figure 5: Layer-wise parameter distance between the source-pretrained model and the target-adapted model using the SFT strategy for different rank-factor values (RF{2,4,8}RF\in\{2,4,8\}) on the SSC (Goldberger et al., 2000), MFD (Lessmeier et al., 2016), and HHAR (Stisen et al., 2015) datasets. The values represent the average parameter distances across all source-target pairs provided for the respective datasets. Lower values indicate smaller parameter distances.

In this subsection, we aim to uncover the underlying factors contributing to the robustness of the proposed SFT strategy, particularly in terms of sample efficiency and its implicit regularization effects during adaptation. To achieve this, we leverage the PAC-Bayesian generalization theory (McAllester, 1998; 1999), which provides a principled framework for bounding the generalization error in deep neural networks during fine-tuning (Dziugaite & Roy, 2017; Li & Zhang, 2021; Wang et al., 2023). We analyze the generalization error on the network parameters 𝑾={𝓦(i)}i=1D\boldsymbol{W}=\{\boldsymbol{\mathcal{W}}^{(i)}\}_{i=1}^{D}, i.e., (𝑾)^(𝑾)\mathcal{L}(\boldsymbol{W})-\mathcal{\hat{L}}(\boldsymbol{W}), where (𝑾)\mathcal{L}(\boldsymbol{W}) is the test loss, ^(𝑾)\mathcal{\hat{L}}(\boldsymbol{W}) is the empirical training loss, and DD denotes the number of layers.

Theorem 1.

(PAC-Bayes generalization bound for fine-tuning) Let 𝐖\boldsymbol{W} be some hypothesis class (network parameters). Let PP be a prior (source) distribution on 𝐖\boldsymbol{W} that is independent of the target training set. Let Q(S)Q(S) be a posterior (target) distribution on 𝐖\boldsymbol{W} that depends on the target training set SS consisting of nn number of samples. Suppose the loss function (.)\mathcal{L}(.) is bounded by CC. If we set the prior distribution P=𝒩(𝐖src,σ2I)P=\mathcal{N}(\boldsymbol{W}_{\text{src}},\sigma^{2}I), where 𝐖src\boldsymbol{W}_{\text{src}} are the weights of the pre-trained network. The posterior distribution Q(S)Q(S) is centered at the fine-tuned model as 𝒩(𝐖trg,σ2I)\mathcal{N}(\boldsymbol{W}_{\text{trg}},\sigma^{2}I). Then with probability 1δ1-\delta over the randomness of the training set, the following holds:

𝔼𝑾Q(S)[(𝑾)]𝔼𝑾Q(S)[^(𝑾,S)]+Ci=1D𝓦(i)trg𝓦(i)srcF22σ2n+klnnδ+ln.\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\mathcal{L}(\boldsymbol{W})\right]\leq\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\hat{\mathcal{L}}(\boldsymbol{W},S)\right]+C\sqrt{\frac{\sum_{i=1}^{D}\|\boldsymbol{\mathcal{W}}^{(i)}_{\text{trg}}-\boldsymbol{\mathcal{W}}^{(i)}_{\text{src}}\|_{F}^{2}}{2\sigma^{2}n}+\frac{k\ln\frac{n}{\delta}+l}{n}}. (10)

for some δ,k,l>0\delta,k,l>0, where 𝓦(i)trg𝐖trg,and𝓦(i)src𝐖src\boldsymbol{\mathcal{W}}^{(i)}_{\text{trg}}\in\boldsymbol{W}_{\text{trg}},\text{and}\ \boldsymbol{\mathcal{W}}^{(i)}_{\text{src}}\in\boldsymbol{W}_{\text{src}}, 1iD\forall 1\leq i\leq D, DD denoting the total number of layers.

Proof.

See Appendix A.4 (Theorem 1) ∎

In Equation (10), the test error (LHS) is bounded by the empirical training loss and the divergence between the source pre-trained weights (𝑾src\boldsymbol{W}_{\text{src}}) and the target fine-tuned weights (𝑾trg\boldsymbol{W}_{\text{trg}}), measured using the layer-wise Frobenius norm. Motivated by this theoretical insight, we empirically analyze the layer-wise parameter distances between the source and target models in terms of the Frobenius norm. Figure 5 illustrates these distances for both vanilla fine-tuning (Baseline: Fine-tuning all parameters) and our SFT approach which selectively fine-tunes the core subspace, across various time series datasets, including SSC (Goldberger et al., 2000), MFD (Lessmeier et al., 2016), and HHAR (Stisen et al., 2015), for different RFRF values. The results show that SFT naturally constrains the parametric distance between the source and target weights, with a higher RFRF exhibiting a greater constraining ability, thus regulating the adaptation process in line with the distance-based regularization hypothesis of Gouk et al. (2021). Furthermore, in Appendix A.3 (Lemma 2), we provide a formal bound on the parameter distance in terms of the decomposed ranks of the weight matrices. This observation reveals a trade-off in the source-target parameter distance; while a smaller distance tightens the generalization bound, it also limits the model’s capacity, potentially regularizing or, in extreme cases, hindering adaptability. In Section 4, we use this bound to further discuss SFT’s sample efficiency.

Refer to caption
Figure 6: Comparison of the predictive performance of our selective fine-tuning (SFT) strategy w.r.t. F1 score (%) at different target sample ratios against baseline methods (SHOT, NRC, AAD, MAPU). Adaptations are conducted using 0.5%, 5%, and 100% of the total unlabeled target samples, randomly sampled in a stratified manner. The figure demonstrates performance differences across methods as the amount of target data varies, highlighting the sample efficiency of the proposed SFT strategy. Results are averaged over three runs.

4 Experiments and Analysis

Datasets and Methods.

We utilize the AdaTime benchmarks proposed by Ragab et al. (2023a; b) to evaluate the SFDA methods: SSC (Goldberger et al., 2000), and MFD (Lessmeier et al., 2016), HHAR (Stisen et al., 2015), UCIHAR (Anguita et al., 2013), WISDM (Kwapisz et al., 2011). Here each dataset involves distinct domains based on individual subjects (SSC), devices (HHAR, UCIHAR, WISDM) or entities (MFD). For comprehensive dataset descriptions and domain details, refer to the Appendix A.5. To assess the effectiveness and generalizability of our decomposition framework, we integrate it with prominent SFDA methods: SHOT (Liang et al., 2020), NRC (Yang et al., 2021a), AAD (Yang et al., 2022), and MAPU (Ragab et al., 2023b) within time-series contexts. For more details on the adaptation method we direct readers to Appendix 8.

Experimental Setup.

Our experimental setup systematically evaluates the proposed strategy (SFT) alongside contemporary SFDA methods evaluated by Ragab et al. (2023b); Gong et al. (2024), including SHOT (Liang et al., 2020), NRC (Yang et al., 2021a), AAD (Yang et al., 2022), and MAPU (Ragab et al., 2023b). In SFT, the backbone fine-tuning is restricted to the core subspace of the decomposed parameters, and the hyperparameters for source training and target adaptation are kept consistent between each vanilla SFDA method and its SFT variant. In addition, we assess the impact of different ranking factors (RFRF). For values of RFRF greater than 88, we observed a significant underfitting, leading us to limit our evaluation to RF{2,4,8}RF\in\{2,4,8\}. The experiments are carried out using the predefined source-target pairs from the AdaTime benchmark (Ragab et al., 2023a). The predictive performance of the methods is evaluated by comparing the average F1 score of the target-adapted model across all source-target pairs for the respective dataset in the benchmark. Each experiment is repeated three times with different random seeds to ensure robustness. More details are provided in Appendix A.7.

Table 1: Performance and efficiency comparison on SSC, MFD, and HHAR datasets across SFDA methods, reported as average F1 score (%) at target sample ratios (0.5%, 5%, 100%), inference MACs (M), and fine-tuned parameters (K). Highlighted rows show results for SFT, where only the core tensor is fine-tuned at different RFRF values. Green numbers represent average percentage improvement, while Red numbers indicate reduction in MACs and fine-tuned parameters.
F1 Score (%) \uparrow
Methods RFRF 0.5% 5% 100% Average MACs (M) \downarrow # Params. (K) \downarrow
SSC
SHOT (Liang et al., 2020) - 62.32 ± 1.57 63.95 ± 1.51 67.95 ±1.04 64.74 12.92 83.17
8 62.53 ± 0.46 66.55 ± 0.46 67.50 ± 1.33 65.53 (1.22%) 0.80 (93.81%) 1.38 (98.34%)
4 62.71 ± 0.57 67.16 ± 1.06 68.56 ±0.44 66.14 (2.16%) 1.99 (84.60%) 5.32 (93.60%)
SHOT + SFT 2 63.05 ± 0.32 65.44 ± 0.83 67.48 ±0.89 65.32 (0.90%) 5.54 (57.12%) 20.88 (74.89%)
NRC (Yang et al., 2021a) - 59.92 ± 1.19 63.56 ± 1.35 65.23 ± 0.59 62.90 12.92 83.17
8 60.65 ± 1.37 63.60 ± 1.43 65.05 ± 1.66 63.10 (0.32%) 0.80 (93.81%) 1.38 (98.34%)
4 61.95 ± 0.62 65.11 ± 1.34 67.19 ± 0.14 64.75 (2.94%) 1.99 (84.60%) 5.32 (93.60%)
NRC + SFT 2 60.60 ± 0.58 65.06 ± 0.24 66.83 ± 0.51 64.16 (2.00%) 5.54 (57.12%) 20.88 (74.89%)
AAD (Yang et al., 2022) - 59.39 ± 1.80 63.21 ± 1.53 63.71 ± 2.06 62.10 12.92 83.17
8 60.62 ± 1.40 65.38 ± 0.93 65.80 ± 1.17 63.93 (2.95%) 0.80 (93.81%) 1.38 (98.34%)
4 61.92 ± 0.68 66.59 ± 0.95 67.14 ± 0.57 65.22 (5.02%) 1.99 (84.60%) 5.32 (93.60%)
AAD + SFT 2 60.59 ± 0.59 63.96 ± 2.04 64.39 ± 1.45 62.98 (1.42%) 5.54 (57.12%) 20.88 (74.89%)
MAPU (Ragab et al., 2023b) - 60.35 ± 2.15 62.48 ± 1.57 66.73 ± 0.85 63.19 12.92 83.17
8 63.52 ± 1.24 64.77 ± 0.22 66.09 ± 0.19 64.79 (2.53%) 0.80 (93.81%) 1.38 (98.34%)
4 62.21 ± 0.88 65.20 ± 1.25 67.40 ± 0.59 64.94 (2.77%) 1.99 (84.60%) 5.32 (93.60%)
MAPU + SFT 2 62.25 ± 1.74 66.19 ± 1.02 67.85 ± 0.62 65.43 (3.54%) 5.54 (57.12%) 20.88 (74.89%)
MFD
SHOT (Liang et al., 2020) - 90.74 ± 1.27 91.53 ± 1.44 91.93 ± 1.37 91.40 58.18 199.3
8 92.14 ± 1.77 93.36 ± 0.53 93.13 ± 0.54 92.88 (1.62%) 3.52 (93.95%) 3.33 (98.33%)
4 92.98 ± 1.04 93.36 ± 0.75 93.21 ± 0.64 93.18 (1.95%) 8.80 (84.87%) 12.80 (93.58%)
SHOT + SFT 2 93.01 ± 0.79 93.4 ± 0.41 93.92 ± 0.33 93.44 (2.23%) 24.66 (57.61%) 50.18 (74.82%)
NRC (Yang et al., 2021a) - 89.03 ± 3.84 91.74 ± 1.31 92.20 ± 1.33 90.99 58.18 199.3
8 92.14 ± 1.77 93.36 ± 0.51 93.14 ± 0.70 92.88 (2.08%) 3.52 (93.95%) 3.33 (98.33%)
4 92.91 ± 1.05 93.25 ± 0.69 92.97 ± 0.45 93.04 (2.25%) 8.80 (84.87%) 12.80 (93.58%)
NRC + SFT 2 92.98 ± 0.78 93.22 ± 0.58 93.86 ± 0.36 93.35 (2.59%) 24.66 (57.61%) 50.18 (74.82%)
AAD (Yang et al., 2022) - 89.03 ± 3.82 91.06 ± 2.21 90.65 ± 2.48 90.25 58.18 199.3
8 92.15 ± 1.78 93.36 ± 0.58 93.28 ± 0.55 92.93 (2.97%) 3.52 (93.95%) 3.33 (98.33%)
4 92.88 ± 1.12 93.05 ± 0.61 93.67 ± 0.41 93.20 (3.27%) 8.80 (84.87%) 12.80 (93.58%)
AAD + SFT 2 92.97 ± 0.79 93.45 ± 0.53 94.41 ± 0.19 93.61 (3.72%) 24.66 (57.61%) 50.18 (74.82%)
MAPU (Ragab et al., 2023b) - 85.32 ± 1.25 86.49 ± 1.59 91.71 ± 0.47 87.84 58.18 199.3
8 91.07 ± 1.76 91.58 ± 1.14 92.11 ± 1.56 91.59 (4.27%) 3.52 (93.95%) 3.33 (98.33%)
4 91.01 ± 0.97 92.10 ± 1.02 91.18 ± 1.64 91.43 (4.09%) 8.80 (84.87%) 12.80 (93.58%)
MAPU + SFT 2 91.87 ± 2.85 91.94 ± 1.44 91.84 ± 1.85 91.88 (4.60%) 24.66 (57.61%) 50.18 (74.82%)
HHAR
SHOT (Liang et al., 2020) - 72.08 ± 1.20 79.80 ± 1.70 80.12 ± 0.29 77.33 9.04 198.21
8 73.65 ± 2.39 79.07 ± 1.88 81.73 ± 1.47 78.15 (1.06%) 0.53 (94.14%) 3.19 (98.39%)
4 75.47 ± 1.34 79.98 ± 2.01 81.05 ± 0.45 78.83 (1.94%) 1.34 (85.18%) 12.53 (93.68%)
SHOT + SFT 2 75.14 ± 1.85 77.70 ± 1.37 82.92 ± 0.27 78.59 (1.63%) 3.79 (58.08%) 49.63 (74.96%)
NRC (Yang et al., 2021a) - 72.38 ± 0.87 75.52 ± 1.15 77.80 ± 0.16 75.23 9.04 198.21
8 71.41 ± 0.62 76.15 ± 0.99 80.09 ± 0.56 75.88 (0.86%) 0.53 (94.14%) 3.19 (98.39%)
4 72.77 ± 0.43 76.60 ± 1.97 80.27 ± 0.55 76.55 (1.75%) 1.34 (85.18%) 12.53 (93.68%)
NRC + SFT 2 72.34 ± 0.14 75.53 ± 1.31 79.45 ± 1.51 75.77 (0.72%) 3.79 (58.08%) 49.63 (74.96%)
AAD (Yang et al., 2022) - 20.18 ± 2.70 76.55 ± 2.17 82.25 ± 1.14 59.66 9.04 198.21
8 57.93 ± 0.90 78.57 ± 0.74 82.92 ± 1.66 73.14 (22.59%) 0.53 (94.14%) 3.19 (98.39%)
4 54.43 ± 1.04 78.57 ± 0.74 83.15 ± 0.88 72.05 (20.77%) 1.34 (85.18%) 12.53 (93.68%)
AAD + SFT 2 49.46 ± 2.22 79.31 ± 0.72 83.25 ± 0.34 70.67 (18.45%) 3.79 (58.08%) 49.63 (74.96%)
MAPU (Ragab et al., 2023b) - 71.18 ± 2.78 78.67 ± 1.20 80.32 ± 1.16 76.72 9.04 198.21
8 73.73 ± 2.35 78.54 ± 2.37 80.16 ± 2.38 77.48 (0.99%) 0.53 (94.14%) 3.19 (98.39%)
4 75.12 ± 0.88 80.04 ± 0.42 80.24 ± 1.22 78.47 (2.28%) 1.34 (85.18%) 12.53 (93.68%)
MAPU + SFT 2 76.06 ± 1.24 78.95 ± 2.04 80.69 ± 2.45 78.57 (2.41%) 3.79 (58.08%) 49.63 (74.96%)
Sample-efficiency across SFDA Methods and Datasets.

We assess the sample efficiency of the proposed SFT method by evaluating its predictive performance in the low-data regime. To conduct this analysis, we randomly select 0.5% and 5% of the total available unlabeled target samples in a stratified manner, utilizing fixed seeds for consistency. These sampled subsets serve exclusively for the adaptation process. In Figure 6, we present a comparative analysis of the average F1 score post-adaptation, contrasting SFT with baseline methods (SHOT, NRC, AAD, and MAPU) across multiple datasets: SSC, MFD, HHAR, WISDM, and UCIHAR. Notably, our empirical observations are grounded in the theoretical framework established by Theorem 1, which is encapsulated in Equation (10). While all SFDA methods aim to minimize the empirical loss on the target domain (the first term on the right-hand side of Equation (10)) through their unsupervised objectives, the second term—dependent on both the number of samples and the distance between parameters—plays a critical role in generalization. In low-data regimes, as the number of target samples reduces, this second term increases, thereby weakening the generalization bound. As a result, baseline methods exhibit a noticeable drop in predictive performance, distinctly observed for the MFD and UCIHAR datasets (Figure 6), when adapting under the low sample ratio setting. In contrast, SFT effectively mitigates the adverse effects of a small number of samples by implicitly constraining the parameter distance term (the numerator of the second term on the RHS), as empirically observed in Figure 5, therefore, balancing the overall fraction to not loosen the bound on the RHS. This renders SFT significantly more sample-efficient, preserving robust generalization even in low-data regimes. We extend this analysis in Appendix A.7.

Parameter efficiency across SFDA Methods and Datasets.

In addition, Table 1 demonstrates consistent improvements in the F1 score alongside enhanced parameter efficiency during the adaptation process, as evidenced by the reduced parameter count (# Params.) of the parameters fine-tuned. As the rank factor (RFRF) increases, the size of the core subspace decreases (cf. Equation (8)), leading to fewer parameters that require updates during adaptation. This reduction not only enhances efficiency but also lowers computational costs (MACs) (cf. Equation (9)). The observed performance gains and improved efficiency are consistent across different SFDA methods and datasets, underscoring the robustness, generalizability, and efficiency of our approach across different SFDA methods and benchmark datasets. In Appendix A.7, we extend this analysis to WISDM and UCIHAR datasets.

Comparison with Parameter-Efficient Tuning Methods.
Refer to caption
Method Baseline SFT (RF-2) SFT (RF-4) SFT (RF-8) LoRA-2 LoRA-4 LoRA-8 LoRA-16 LoKrA
MACs (M) \downarrow 12.92 5.54 1.99 0.80 12.92 12.92 12.92 12.92 12.92
# Params. (K) \downarrow 83.17 20.88 5.32 1.38 0.81 1.94 5.19 15.63 1.84
Figure 7: Comparison of LoRA (Hu et al., 2021) and LoKrA (Edalati et al., 2022; Yeh et al., 2024) (in Purple) against baseline methods (SHOT, NRC, AAD, MAPU) and our proposed approach (SFT) on the SSC dataset, evaluated across varying target sample ratios used during adaptation. The table at the bottom shows the target model’s inference overhead after adaptation in terms of MACs and the number of parameters finetuned at the time of adaptation.

To further substantiate the robustness of SFT, we benchmark its performance against Parameter-Efficient Fine-Tuning (PEFT) approaches (Han et al., 2024; Xin et al., 2024). PEFT methods primarily aim to match the performance of full model fine-tuning while substantially reducing the number of trainable parameters. Among the most prominent approaches, Low-Rank Adaptation (LoRA) (Hu et al., 2021) has gained significant traction in this domain. In Figure 7, we analyze the performance of LoRA at varying intermediate ranks ({2, 4, 8, 16}), and Low-Rank Kronecker Adaptation (LoKrA) (Edalati et al., 2022; Yeh et al., 2024) under its most parameter-efficient configuration, alongside SFT, on the SSC dataset (additional results are provided in Appendix Figure 11). Remarkably, SFT consistently outperforms across SFDA tasks, demonstrating superior adaptability and performance. This advantage arises from the structured decomposition applied during source model preparation, which explicitly disentangles the parameter subspace. In contrast, LoRA-type methods introduce a residual branch over the pretrained weights that relies on the fine-tuning objective to uncover the underlying low-rank subspaces (Hu et al., 2021), which results in weaker control over the adaptation process. It is important to highlight that PEFT methods can also be seamlessly integrated with SFT during fine-tuning. For instance, LoRA or LoKrA-style adaptation frameworks can be applied to the core tensor, further enhancing the efficiency of the adaptation process, which we discuss in detail in Appendix A.8. However, while LoRA-type methods can effectively adapt model weights to the target domain, they do not reduce inference overhead. Achieving this requires explicit decomposition and reparameterization of the weights.

Ablation studies on Fine-tuning different Parameter Subspaces.

In Table 2, we present an ablation study evaluating the impact of fine-tuning different components of the decomposed backbone for MAPU. We compare against baseline methods: (1) fine-tuning the entire backbone and (2) tuning only Batch-Norm (BN) parameters. For our decomposed framework, we assess: 1. Fine-tuning only the factor matrices: Here, we update the factor matrices while keeping the core tensor fixed, adjusting directional transformations in the weight space. 2. Fine-tuning only the core tensor: We freeze the factor matrices and tune only the core tensor, the smallest low-rank subspace, capturing critical multi-dimensional interactions. This is efficient, especially with a high RFRF values, as the core tensor remains compact, improving performance with fewer parameters and enhancing sample efficiency. 3. Fine-tuning both the core tensor and factor matrices: Both are updated, offering more flexibility but introducing more parameters, risking overfitting. Primarily tuning the core tensor is advantageous due to its compact representation, yielding better generalization, and freezing the factor matrices preserves mode-specific transformations while optimizing key interactions. We observe that core tensor fine-tuning strikes an optimal balance between adaptation and parameter efficiency, delivering strong performance with a minimal computational footprint. Appendix Tables LABEL:tab:ablation_studies_shot, LABEL:tab:ablation_studies_nrc, and LABEL:tab:ablation_studies_aad show the ablation analysis for SHOT, NRC and AAD, respectively.

Table 2: Ablation Studies on finetuning different parameter subspaces during target-side adaptation on MAPU. The asterisk (*) on top of the dataset names denote the Many-to-one setting (cf. Appendix A.7 for reference).
SSC SSC* HHAR HHAR*
Method RFRF Parameters Finetuned F1 Score (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow F1 Score (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow F1 Score (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow F1 Score (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow
No Adaptation - None 54.96 ± 0.87 12.92 0 63.94 ± 0.84 12.92 0 66.67 ± 1.03 9.04 0 82.88 ± 1.30 9.04 0
MAPU - Entire Backbone 66.73 ± 0.85 12.92 83.17 71.74 ± 1.30 12.92 83.17 80.32 ± 1.16 9.04 198.21 95.16 ± 4.29 9.04 198.21
MAPU - BN 61.07 ± 1.58 12.92 0.45 69.17 ± 2.70 12.92 0.45 71.40 ± 1.63 9.04 0.64 86.15 ± 3.71 9.04 0.64
Factor Matrices (𝐕(1)\mathbf{V}^{(1)}, 𝐕(2)\mathbf{V}^{(2)}) 66.05 ± 0.75 0.80 3.33 68.73 ± 1.22 0.80 3.33 80.42 ± 1.51 0.53 7.18 98.00 ± 0.04 0.53 7.18
Core Tensor (𝓣\boldsymbol{\mathcal{T}}) 66.09 ± 0.19 0.80 1.38 68.85 ± 1.30 0.80 1.38 80.16 ± 2.38 0.53 3.19 98.12 ± 0.04 0.53 3.19
8 Factor Matrices (𝐕(1)\mathbf{V}^{(1)}, 𝐕(2)\mathbf{V}^{(2)}) + Core Tensor (𝓣\boldsymbol{\mathcal{T}}) 66.17 ± 0.53 0.80 4.71 68.58 ± 0.96 0.80 4.71 80.29 ± 1.04 0.53 10.37 98.06 ± 0.05 0.53 10.37
Factor Matrices (𝐕(1)\mathbf{V}^{(1)}, 𝐕(2)\mathbf{V}^{(2)}) 67.6 ± 0.07 1.99 6.66 72.08 ± 1.48 1.99 6.66 79.88 ± 1.20 1.34 14.35 98.19 ± 0.10 1.34 14.35
Core Tensor (𝓣\boldsymbol{\mathcal{T}}) 67.4 ± 0.59 1.99 5.32 71.94 ± 1.10 1.99 5.32 80.24 ± 1.22 1.34 12.53 98.25 ± 0.03 1.34 12.53
4 Factor Matrices (𝐕(1)\mathbf{V}^{(1)}, 𝐕(2)\mathbf{V}^{(2)}) + Core Tensor (𝓣\boldsymbol{\mathcal{T}}) 67.82 ± 0.21 1.99 11.98 71.52 ± 0.94 1.99 11.98 79.63 ± 1.08 1.34 26.88 98.20 ± 0.08 1.34 26.88
Factor Matrices (𝐕(1)\mathbf{V}^{(1)}, 𝐕(2)\mathbf{V}^{(2)}) 67.13 ± 0.69 5.54 13.31 72.84 ± 1.15 5.54 13.31 80.51 ± 2.41 3.79 28.68 95.06 ± 5.17 3.79 28.68
Core Tensor (𝓣\boldsymbol{\mathcal{T}}) 67.85 ± 0.62 5.54 20.88 72.73 ± 0.83 5.54 20.88 80.69 ± 2.45 3.79 49.63 94.96 ± 5.18 3.79 49.63
MAPU + SFT 2 Factor Matrices (𝐕(1)\mathbf{V}^{(1)}, 𝐕(2)\mathbf{V}^{(2)}) + Core Tensor (𝓣\boldsymbol{\mathcal{T}}) 67.37 ± 0.36 5.54 34.19 72.93 ± 0.67 5.54 34.19 80.69 ± 1.26 3.79 78.31 95.02 ± 5.11 3.79 78.31

5 Conclusion

In this work, we presented a framework for improving the parameter and sample efficiency of SFDA methods in time-series data through a low-rank decomposition of source-pretrained models. By leveraging Tucker-style tensor factorization during the source-model preparation phase, we were able to reparameterize the backbone of the model into a compact subspace. This enabled selective fine-tuning (SFT) of the core tensor on the target side, achieving robust adaptation with significantly fewer parameters. Our empirical results demonstrated that the proposed SFT strategy consistently outperformed baseline SFDA methods across various datasets, especially in resource-constrained and low-data scenarios. Theoretical analysis grounded in PAC-Bayesian generalization bounds provided insights into the regularization effect of SFT, highlighting its ability to mitigate overfitting by constraining the distance between source and target model parameters. Our ablation studies further reinforced the effectiveness of selective finetuning, showing that this approach strikes a balance between adaptation flexibility and parameter efficiency. In Appendix A.9 we discuss the limitations of the presented work. Overall, our contributions positively complement the SFDA methods, offering a framework that can be seamlessly integrated with existing SFDA techniques to improve adaptation efficiency. This lays the groundwork for future research into more resource-efficient and personalized domain adaptation techniques.

References

  • Anguita et al. (2013) Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge Luis Reyes-Ortiz, et al. A public domain dataset for human activity recognition using smartphones. In Esann, 2013.
  • Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NeurIPS, 2017.
  • Bartlett et al. (2019) Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. JMLR, 2019.
  • Cai et al. (2021) Ruichu Cai, Jiawei Chen, Zijian Li, Wei Chen, Keli Zhang, Junjian Ye, Zhuozhang Li, Xiaoyan Yang, and Zhenjie Zhang. Time series domain adaptation via sparse associative structure alignment. In AAAI, 2021.
  • Carroll & Chang (1970) J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika, 1970.
  • Cheng et al. (2024) Mingyue Cheng, Jiqian Yang, Tingyue Pan, Qi Liu, and Zhi Li. Convtimenet: A deep hierarchical fully convolutional model for multivariate time series analysis. arXiv preprint arXiv:2403.01493, 2024.
  • Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando De Freitas. Predicting parameters in deep learning. In NeurIPS, 2013.
  • Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NeurIPS, 2014.
  • Donghao & Xue (2024) Luo Donghao and Wang Xue. ModernTCN: A modern pure convolution structure for general time series analysis. In ICLR, 2024.
  • Dziugaite & Roy (2017) Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI, 2017.
  • Eckart & Young (1936) C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1936.
  • Edalati et al. (2022) Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J Clark, and Mehdi Rezagholizadeh. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650, 2022.
  • Fan et al. (2023) Wei Fan, Pengyang Wang, Dongkun Wang, Dongjie Wang, Yuanchun Zhou, and Yanjie Fu. Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting. In AAAI, 2023.
  • Frankle & Carbin (2018) Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2018.
  • Goldberger et al. (2000) Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation, 2000.
  • Gong et al. (2024) Peiliang Gong, Mohamed Ragab, Emadeldeen Eldele, Wenyu Zhang, Min Wu, Chuan-Sheng Foo, Daoqiang Zhang, Xiaoli Li, and Zhenghua Chen. Evidentially calibrated source-free time-series domain adaptation with temporal imputation. arXiv preprint arXiv:2406.02635, 2024.
  • Gouk et al. (2021) Henry Gouk, Timothy M Hospedales, and Massimiliano Pontil. Distance-based regularisation of deep networks for fine-tuning. In ICLR, 2021.
  • Greydanus & Kobak (2024) Sam Greydanus and Dmitry Kobak. Scaling down deep learning with mnist-1d. In ICML, 2024.
  • Han et al. (2024) Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024.
  • Hassibi & Stork (1992) Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. In NeurIPS, 1992.
  • He et al. (2023) Huan He, Owen Queen, Teddy Koker, Consuelo Cuevas, Theodoros Tsiligkaridis, and Marinka Zitnik. Domain adaptation for time series under feature and label shifts. In ICML, 2023.
  • Hoefler et al. (2021) Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. JMLR, 2021.
  • Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  • Jaderberg et al. (2014) M Jaderberg, A Vedaldi, and A Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
  • Jiang* et al. (2020) Yiding Jiang*, Behnam Neyshabur*, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In ICLR, 2020.
  • Jin et al. (2022) Xiaoyong Jin, Youngsuk Park, Danielle Maddix, Hao Wang, and Yuyang Wang. Domain adaptation for time series forecasting via attention sharing. In ICML, 2022.
  • Karim et al. (2023) Nazmul Karim, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-pang Chiu, Supun Samarasekera, and Nazanin Rahnavard. C-sfda: A curriculum learning aided self-training framework for efficient source free domain adaptation. In CVPR, 2023.
  • Kim et al. (2016) Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.
  • Kim et al. (2021) Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, and Sungeun Hong. Domain adaptation without source data. IEEE TAI, 2021.
  • Kingma (2014) Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kolda & Bader (2009) Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 2009.
  • Kundu et al. (2021) Jogendra Nath Kundu, Akshay Kulkarni, Amit Singh, Varun Jampani, and R Venkatesh Babu. Generalize then adapt: Source-free domain adaptive semantic segmentation. In ICCV, 2021.
  • Kundu et al. (2022a) Jogendra Nath Kundu, Suvaansh Bhambri, Akshay Kulkarni, Hiran Sarkar, Varun Jampani, and R Venkatesh Babu. Concurrent subsidiary supervision for unsupervised source-free domain adaptation. In ECCV, 2022a.
  • Kundu et al. (2022b) Jogendra Nath Kundu, Akshay R Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Anand Kulkarni, Varun Jampani, and Venkatesh Babu Radhakrishnan. Balancing discriminability and transferability for source-free domain adaptation. In ICML, 2022b.
  • Kwapisz et al. (2011) Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. Activity recognition using cell phone accelerometers. SIGKDD Explor. Newsl., 2011.
  • Lathauwer et al. (2000) Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 2000.
  • Lebedev et al. (2015) V Lebedev, Y Ganin, M Rakhuba, I Oseledets, and V Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In ICLR, 2015.
  • LeCun et al. (1989) Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In NeurIPS, 1989.
  • Lee et al. (2023) Suho Lee, Seungwon Seo, Jihyo Kim, Yejin Lee, and Sangheum Hwang. Few-shot fine-tuning is all you need for source-free domain adaptation. arXiv preprint arXiv:2304.00792, 2023.
  • Lessmeier et al. (2016) Christian Lessmeier, James Kuria Kimotho, Detmar Zimmer, and Walter Sextro. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification. In PHM Society European Conference, 2016.
  • Li & Zhang (2021) Dongyue Li and Hongyang Zhang. Improved regularization and robustness for fine-tuning in neural networks. In NeurIPS, 2021.
  • Li et al. (2017) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
  • Li et al. (2024) Jingjing Li, Zhiqi Yu, Zhekai Du, Lei Zhu, and Heng Tao Shen. A comprehensive survey on source-free domain adaptation. IEEE TPAMI, 2024.
  • Li et al. (2020) Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised domain adaptation without source data. In CVPR, 2020.
  • Liang et al. (2020) Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In ICML, 2020.
  • Liang et al. (2021) Jian Liang, Dapeng Hu, Yunbo Wang, Ran He, and Jiashi Feng. Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE TPAMI, 2021.
  • Liu & Xue (2021) Qiao Liu and Hui Xue. Adversarial spectral kernel matching for unsupervised time series domain adaptation. In IJCAI, 2021.
  • Ma et al. (2019) Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Ming Zhou, and Dawei Song. A tensorized transformer for language modeling. In NeurIPS, 2019.
  • McAllester (2003) David McAllester. Simplified pac-bayesian margin bounds. In COLT, 2003.
  • McAllester (1998) David A. McAllester. Some pac-bayesian theorems. In COLT, 1998.
  • McAllester (1999) David A. McAllester. Pac-bayesian model averaging. In COLT, 1999.
  • Molchanov et al. (2017) Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
  • Müller et al. (2019) Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In NeurIPS, 2019.
  • Nakajima et al. (2013) Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, and Ryota Tomioka. Global analytic solution of fully-observed variational bayesian matrix factorization. JMLR, 2013.
  • Neyshabur et al. (2018) Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In ICLR, 2018.
  • Novikov et al. (2015) Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In NeurIPS, 2015.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Ozyurt et al. (2023) Yilmazcan Ozyurt, Stefan Feuerriegel, and Ce Zhang. Contrastive learning for unsupervised domain adaptation of time series. In ICLR, 2023.
  • Park et al. (2023) JaeYeon Park, Kichang Lee, Sungmin Lee, Mi Zhang, and JeongGil Ko. Attfl: A personalized federated learning framework for time-series mobile and embedded sensor data processing. In Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 2023.
  • Ragab et al. (2022) Mohamed Ragab, Emadeldeen Eldele, Zhenghua Chen, Min Wu, Chee-Keong Kwoh, and Xiaoli Li. Self-supervised autoregressive domain adaptation for time series data. IEEE TNNLS, 2022.
  • Ragab et al. (2023a) Mohamed Ragab, Emadeldeen Eldele, Wee Ling Tan, Chuan-Sheng Foo, Zhenghua Chen, Min Wu, Chee-Keong Kwoh, and Xiaoli Li. Adatime: A benchmarking suite for domain adaptation on time series data. ACM TKDD, 2023a.
  • Ragab et al. (2023b) Mohamed Ragab, Emadeldeen Eldele, Min Wu, Chuan-Sheng Foo, Xiaoli Li, and Zhenghua Chen. Source-free domain adaptation with temporal imputation for time series data. In KDD, 2023b.
  • Sharma et al. (2024) Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. In ICLR, 2024.
  • Stisen et al. (2015) Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In ACM Conference on Embedded Networked Sensor Systems, 2015.
  • Tai et al. (2016) Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, and E Weinan. Convolutional neural networks with low-rank regularization. In ICLR, 2016.
  • Tucker (1966) L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 1966.
  • Wang et al. (2023) Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, and Radu Soricut. Improving robust generalization by direct pac-bayesian bound minimization. In CVPR, 2023.
  • Wilson et al. (2020) Garrett Wilson, Janardhan Rao Doppa, and Diane J Cook. Multi-source deep domain adaptation with weak supervision for time-series sensor data. In KDD, 2020.
  • Wilson et al. (2023) Garrett Wilson, Janardhan Rao Doppa, and Diane J Cook. Calda: Improving multi-source time series domain adaptation with contrastive adversarial learning. IEEE TPAMI, 2023.
  • Xia et al. (2021) Haifeng Xia, Handong Zhao, and Zhengming Ding. Adaptive adversarial network for source-free domain adaptation. In ICCV, 2021.
  • Xin et al. (2024) Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv preprint arXiv:2402.02242, 2024.
  • Yang et al. (2021a) Shiqi Yang, Joost Van de Weijer, Luis Herranz, Shangling Jui, et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. In NeurIPS, 2021a.
  • Yang et al. (2021b) Shiqi Yang, Yaxing Wang, Joost Van De Weijer, Luis Herranz, and Shangling Jui. Generalized source-free domain adaptation. In ICCV, 2021b.
  • Yang et al. (2022) Shiqi Yang, Shangling Jui, Joost van de Weijer, et al. Attracting and dispersing: A simple approach for source-free domain adaptation. In NeurIPS, 2022.
  • Yang et al. (2023) Shiqi Yang, Yaxing Wang, Joost Van de Weijer, Luis Herranz, Shangling Jui, and Jian Yang. Trust your good friends: Source-free domain adaptation by reciprocal neighborhood clustering. IEEE TPAMI, 2023.
  • Ye et al. (2018) Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. In CVPR, 2018.
  • Yeh et al. (2024) Shih-Ying Yeh, Yu-Guan Hsieh, Zhidong Gao, Bernard BW Yang, Giyeong Oh, and Yanmin Gong. Navigating text-to-image customization: From lycoris fine-tuning to model evaluation. In ICLR, 2024.
  • Zhou et al. (2019) Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Non-vacuous generalization bounds at the im-agenet scale: A pac-bayesian compression approach. In ICLR, 2019.
  • Zimmer et al. (2023) Max Zimmer, Megi Andoni, Christoph Spiegel, and Sebastian Pokutta. Perp: Rethinking the prune-retrain paradigm in the era of llms. arXiv preprint arXiv:2312.15230, 2023.

Appendix A Appendix

A.1 Higher Order Orthogonal Iteration

Input: Tensor 𝓐I1×I2××IN\boldsymbol{\mathcal{A}}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}, Truncation (R1,R2,,RN)(R_{1},R_{2},\ldots,R_{N}), Initial guess {𝐔(n)0:n=1,2,,N}\{\mathbf{U}^{(n)}_{0}:n=1,2,\ldots,N\}
Output: Core tensor 𝓖\boldsymbol{\mathcal{G}}, Factor matrices {𝐔(n)k:n=1,2,,N}\{\mathbf{U}^{(n)}_{k}:n=1,2,\ldots,N\}
k0k\leftarrow 0;
while not converged do
       for all n{1,2,,N}n\in\{1,2,\ldots,N\} do
             𝓐×1(𝐔(1)k+1)×2×n1(𝐔(n1)k+1)×n+1(𝐔(n+1)k)×N(𝐔(N)k)\mathcal{B}\leftarrow\boldsymbol{\mathcal{A}}\times_{1}(\mathbf{U}^{(1)}_{k+1})^{\top}\times_{2}\cdots\times_{n-1}(\mathbf{U}^{(n-1)}_{k+1})^{\top}\times_{n+1}(\mathbf{U}^{(n+1)}_{k})^{\top}\cdots\times_{N}(\mathbf{U}^{(N)}_{k})^{\top};
             𝐁(n)𝓑\mathbf{B}_{(n)}\leftarrow\boldsymbol{\mathcal{B}} in matrix format;
             𝐔,𝚺,𝐕\mathbf{U},\mathbf{\Sigma},\mathbf{V}^{\top}\leftarrow truncated rank-RnR_{n} SVD of 𝐁(n)\mathbf{B}_{(n)};
             𝐔(n)k+1𝐔\mathbf{U}^{(n)}_{k+1}\leftarrow\mathbf{U};
             kk+1k\leftarrow k+1;
            
      
𝓖𝚺𝐕\boldsymbol{\mathcal{G}}\leftarrow\mathbf{\Sigma}\mathbf{V}^{\top} in tensor format;
Algorithm 1 The higher-order orthogonal iteration (HOOI) algorithm. (Lathauwer et al., 2000; Kolda & Bader, 2009)

A.2 Parameter and Computational Efficiency with Tucker Decomposition

A.2.1 Tucker Factorization for CNN Weights

Consider a 3D weight tensor 𝓦Cout×Cin×K\boldsymbol{\mathcal{W}}\in\mathbb{R}^{C_{\text{out}}\times C_{\text{in}}\times K} for a 1D convolutional layer, where CoutC_{\text{out}} and CinC_{\text{in}} represent the output and input channels, and KK is the kernel size. Using Tucker factorization, 𝓦\boldsymbol{\mathcal{W}} can be decomposed into a core tensor 𝓣R1×R2×K\boldsymbol{\mathcal{T}}\in\mathbb{R}^{R_{1}\times R_{2}\times K} and factor matrices 𝐕(1)Cout×R1,𝐕(2)Cin×R2\mathbf{V}^{(1)}\in\mathbb{R}^{C_{out}\times R_{1}},\mathbf{V}^{(2)}\in\mathbb{R}^{C_{in}\times R_{2}}, such that:

𝓦=𝓣×1𝐔(1)×2𝐔(2)\boldsymbol{\mathcal{W}}=\boldsymbol{\mathcal{T}}\times_{1}\mathbf{U}^{(1)}\times_{2}\mathbf{U}^{(2)} (11)
Parameter Efficiency.

The number of parameters before factorization is:

Paramsoriginal=Cout×Cin×K\text{Params}_{\text{original}}=C_{\text{out}}\times C_{\text{in}}\times K (12)

After 2-Mode Tucker factorization, the number of parameters become:

ParamsTucker=Rout×Rin×K+Cout×Rout+Cin×Rin\text{Params}_{\text{Tucker}}=R_{\text{out}}\times R_{\text{in}}\times K+C_{\text{out}}\times R_{\text{out}}+C_{\text{in}}\times R_{\text{in}} (13)

Given that RinCinR_{\text{in}}\ll C_{\text{in}}, and RoutCoutR_{\text{out}}\ll C_{\text{out}}, the reduction in the number of parameters is significant, leading to a parameter-efficient model.

Computational Efficiency.

The convolution operation requires 𝒪(Cout×Cin×K×L)\mathcal{O}(C_{\text{out}}\times C_{\text{in}}\times K\times L^{\prime}) operations, where LL^{\prime} is the length of the output feature map. After Tucker factorization, the convolutional operations become a sequence of smaller convolutions involving the factor matrices and the core tensor, reducing the computational complexity to:

𝒪(Cin×Rin×L)+𝒪(Rout×Rin×K×L)+𝒪(Cout×Rout×L)\mathcal{O}(C_{\text{in}}\times R_{\text{in}}\times L)+\mathcal{O}(R_{\text{out}}\times R_{\text{in}}\times K\times L^{\prime})+\mathcal{O}(C_{\text{out}}\times R_{\text{out}}\times L^{\prime}) (14)

where LL denotes the length of the input sequence. This reduction depends on the rank RinR_{\text{in}} and RoutR_{\text{out}}, leading to lower computational costs compared to the original convolution.

A.2.2 Tucker Factorization for Fully Connected Weights

Consider a 2D weight matrix 𝐖M×N\mathbf{W}\in\mathbb{R}^{M\times N} in a fully connected (FC) layer, where MM is the input dimension and NN is the output dimension.

Using Tucker factorization, 𝐖\mathbf{W} can be decomposed into a core matrix 𝐆R1×R2\mathbf{G}\in\mathbb{R}^{R_{1}\times R_{2}} and factor matrices 𝐔(1)M×R1\mathbf{U}^{(1)}\in\mathbb{R}^{M\times R_{1}} and 𝐔(2)N×R2\mathbf{U}^{(2)}\in\mathbb{R}^{N\times R_{2}}, such that:

𝐖=𝐔(1)𝐆(𝐔(2))\mathbf{W}=\mathbf{U}^{(1)}\mathbf{G}(\mathbf{U}^{(2)})^{\top}
Parameter Efficiency.

The number of parameters before factorization is:

Paramsoriginal=M×N\text{Params}_{\text{original}}=M\times N

After Tucker factorization, the number of parameters becomes:

ParamsTucker=R1×R2+M×R1+N×R2\text{Params}_{\text{Tucker}}=R_{1}\times R_{2}+M\times R_{1}+N\times R_{2}

Again, with R1MR_{1}\ll M and R2NR_{2}\ll N, this leads to a substantial reduction in the number of parameters.

Computational Efficiency.

The original matrix multiplication requires 𝒪(M×N)\mathcal{O}(M\times N) operations.

After Tucker factorization, the computation is broken down into smaller matrix multiplications:

𝒪(M×R1)+𝒪(R1×R2)+𝒪(R2×N)\mathcal{O}(M\times R_{1})+\mathcal{O}(R_{1}\times R_{2})+\mathcal{O}(R_{2}\times N)

This decomposition reduces the computational cost, particularly when R1R_{1} and R2R_{2} are much smaller than MM and NN.

In sum, Tucker factorization significantly reduces both the number of parameters and the computational cost for 1D CNN and FC weights, making it an effective technique for achieving parameter efficiency and computational efficiency in deep neural networks.

A.3 Lemmas

Lemma 1.

Let 𝐖0M×N\mathbf{W}_{0}\in\mathbb{R}^{M\times N} be the initial weight matrix, and 𝐖tM×N\mathbf{W}_{t}\in\mathbb{R}^{M\times N} denote the weight matrix after tt iterations of gradient descent with a fixed learning rate η>0\eta>0. Suppose the gradient of the loss function (𝐖)\mathcal{L}(\mathbf{W}) is bounded, such that for all iterations k=0,1,,t1k=0,1,\dots,t-1, we have (𝐖k)G\|\nabla\mathcal{L}(\mathbf{W}_{k})\|_{\infty}\leq G where G>0G>0 is a constant.

Then, the Frobenius norm of the difference between the initial weight matrix w0w_{0} and the weight matrix wtw_{t} after tt iterations is bounded by:

𝐖t𝐖0FηtMNG.\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}\leq\eta t\sqrt{MN}G.
Proof.

The gradient descent algorithm updates the weight matrix according to the rule:

𝐖k+1=𝐖kη(𝐖k).\mathbf{W}_{k+1}=\mathbf{W}_{k}-\eta\nabla\mathcal{L}(\mathbf{W}_{k}).

Starting from the initial weights 𝐖0\mathbf{W}_{0}, the weights after tt iterations are:

𝐖t=𝐖0ηk=0t1(𝐖k).\mathbf{W}_{t}=\mathbf{W}_{0}-\eta\sum_{k=0}^{t-1}\nabla\mathcal{L}(\mathbf{W}_{k}).

Taking the Frobenius norm of the difference between wtw_{t} and w0w_{0}, we have:

𝐖t𝐖0F=ηk=0t1(𝐖k)F.\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}=\eta\left\|\sum_{k=0}^{t-1}\nabla\mathcal{L}(\mathbf{W}_{k})\right\|_{F}.

Using the triangle inequality for norms:

𝐖t𝐖0Fηk=0t1(𝐖k)F.\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}\leq\eta\sum_{k=0}^{t-1}\|\nabla\mathcal{L}(\mathbf{W}_{k})\|_{F}.

Since the Frobenius norm F\|\cdot\|_{F} is sub-multiplicative and satisfies (𝐖k)FMN(𝐖k)\|\nabla\mathcal{L}(\mathbf{W}_{k})\|_{F}\leq\sqrt{MN}\|\nabla\mathcal{L}(\mathbf{W}_{k})\|_{\infty}, and by the assumption (𝐖k)G\|\nabla\mathcal{L}(\mathbf{W}_{k})\|_{\infty}\leq G, it follows that:

k=0t1(𝐖k)FtMNG.\sum_{k=0}^{t-1}\|\nabla\mathcal{L}(\mathbf{W}_{k})\|_{F}\leq t\sqrt{MN}G.

Substituting this into the earlier expression, we obtain the bound:

𝐖t𝐖0FηtMNG.\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}\leq\eta t\sqrt{MN}G.

Lemma 2.

Consider the weight matrices 𝐖0\mathbf{W}_{0} and 𝐖t\mathbf{W}_{t} expressed as:

𝐖0=𝐔0(1)𝐆0(𝐔0(2))and𝐖t=𝐔0𝐆t(𝐔0(2)),\mathbf{W}_{0}=\mathbf{U}_{0}^{(1)}\mathbf{G}_{0}({\mathbf{U}_{0}^{(2)}})^{\top}\quad\text{and}\quad\mathbf{W}_{t}=\mathbf{U}_{0}\mathbf{G}_{t}({\mathbf{U}_{0}^{(2)}})^{\top},

where 𝐔0(1)M×R1\mathbf{U}_{0}^{(1)}\in\mathbb{R}^{M\times R_{1}} and 𝐔0(2)N×R2\mathbf{U}_{0}^{(2)}\in\mathbb{R}^{N\times R_{2}} are fixed matrices, and 𝐆0r1×r2\mathbf{G}_{0}\in\mathbb{R}^{r_{1}\times r_{2}} and 𝐆tr1×r2\mathbf{G}_{t}\in\mathbb{R}^{r_{1}\times r_{2}} represent the evolving core matrices. Assume that the entries of 𝐔0(1)\mathbf{U}_{0}^{(1)} and 𝐔0(2)\mathbf{U}_{0}^{(2)} are bounded by constants Cu>0C_{u}>0 and Cv>0C_{v}>0, respectively.

Let the Frobenius norm of the difference between 𝐆t\mathbf{G}_{t} and 𝐆0\mathbf{G}_{0} after tt iterations of gradient descent be bounded as:

𝐆t𝐆0FηtR1R2Gk,\|\mathbf{G}_{t}-\mathbf{G}_{0}\|_{F}\leq\eta t\sqrt{R_{1}R_{2}}G_{k},

where η>0\eta>0 is the learning rate, and Gk>0G_{k}>0 is a constant.

Then, the Frobenius norm of the difference between 𝐖t\mathbf{W}_{t} and 𝐖0\mathbf{W}_{0} is bounded by:

𝐖t𝐖0FR1R2tηMNCuGkCv.\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}\leq R_{1}R_{2}t\eta\sqrt{MN}\cdot C_{u}G_{k}C_{v}.
Proof.

Given the case where 𝐖0=𝐔0(1)𝐆0(𝐔0(2))\mathbf{W}_{0}=\mathbf{U}_{0}^{(1)}\mathbf{G}_{0}({\mathbf{U}_{0}^{(2)}})^{\top} and 𝐖t=𝐔0(1)𝐆t(𝐔0(2))\mathbf{W}_{t}=\mathbf{U}_{0}^{(1)}\mathbf{G}_{t}({\mathbf{U}_{0}^{(2)}})^{\top}, with 𝐔0(1)\mathbf{U}_{0}^{(1)} and 𝐔0(1)\mathbf{U}_{0}^{(1)} being fixed in both 𝐖0\mathbf{W}_{0} and 𝐖t\mathbf{W}_{t}. Where 𝐔0(1)\mathbf{U}_{0}^{(1)} and 𝐔0(2)\mathbf{U}_{0}^{(2)} are matrices of dimensions M×R1M\times R_{1} and N×R2N\times R_{2}, respectively, and 𝐆0\mathbf{G}_{0} and 𝐆t\mathbf{G}_{t} are matrices of dimensions R1×R2R_{1}\times R_{2}.

The Frobenius norm of the difference between 𝐖t\mathbf{W}_{t} and 𝐖0\mathbf{W}_{0} is:

𝐖t𝐖0F=𝐔0(1)𝐆0(𝐔0(2))𝐔0𝐆t(𝐔0(2))F\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}=\|\mathbf{U}_{0}^{(1)}\mathbf{G}_{0}({\mathbf{U}_{0}^{(2)}})^{\top}-\mathbf{U}_{0}\mathbf{G}_{t}({\mathbf{U}_{0}^{(2)}})^{\top}\|_{F}

Since 𝐔0(1)\mathbf{U}_{0}^{(1)} and 𝐔0(2)\mathbf{U}_{0}^{(2)} are common in both 𝐖0\mathbf{W}_{0} and 𝐖t\mathbf{W}_{t}, we can factor them out:

𝐖t𝐖0F=𝐔0(1)(𝐆t𝐆0)(𝐔0(2))F\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}=\|\mathbf{U}_{0}^{(1)}(\mathbf{G}_{t}-\mathbf{G}_{0})({\mathbf{U}_{0}^{(2)}})^{\top}\|_{F}

Using the sub-multiplicative property of the Frobenius norm:

𝐔0(1)(𝐆t𝐆0)(𝐔0(2))F𝐔0(1)F𝐆t𝐆0F𝐔0(2)F\|\mathbf{U}_{0}^{(1)}(\mathbf{G}_{t}-\mathbf{G}_{0})({\mathbf{U}_{0}^{(2)}})^{\top}\|_{F}\leq\|\mathbf{U}_{0}^{(1)}\|_{F}\|\mathbf{G}_{t}-\mathbf{G}_{0}\|_{F}\|{\mathbf{U}_{0}^{(2)}}\|_{F}

Since 𝐔0(1)\mathbf{U}_{0}^{(1)} and 𝐔0(2)\mathbf{U}_{0}^{(2)} are of dimensions M×R1M\times R_{1} and N×R2N\times R_{2}, and (𝐆t𝐆0)(\mathbf{G}_{t}-\mathbf{G}_{0}) is R1×R2R_{1}\times R_{2}.

Assuming the maximum entries of u0u_{0} and v0v_{0} are bounded by constants CuC_{u} and CvC_{v}:

𝐔0(1)F\displaystyle\|\mathbf{U}_{0}^{(1)}\|_{F} MR1Cu\displaystyle\leq\sqrt{M\cdot R_{1}}\cdot C_{u}
𝐆t𝐆0F\displaystyle\|\mathbf{G}_{t}-\mathbf{G}_{0}\|_{F} ηtR1R2Gk(from Lemma 1)\displaystyle\leq\eta t\cdot\sqrt{R_{1}\cdot R_{2}}\cdot G_{k}\ \text{(from Lemma \ref{lem:wdb})}
𝐔0(2)F\displaystyle\|\mathbf{U}_{0}^{(2)}\|_{F} NR2Cv\displaystyle\leq\sqrt{N\cdot R_{2}}\cdot C_{v}

Thus:

𝐖t𝐖0FMR1CuηtR1R2GkNR2Cv\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}\leq\sqrt{M\cdot R_{1}}\cdot C_{u}\cdot\eta t\cdot\sqrt{R_{1}\cdot R_{2}}\cdot G_{k}\cdot\sqrt{N\cdot R_{2}}\cdot C_{v}

This simplifies to:

𝐖t𝐖0FR1R2tηMNCuGkCv\|\mathbf{W}_{t}-\mathbf{W}_{0}\|_{F}\leq R_{1}R_{2}t\eta\sqrt{MN}\cdot C_{u}G_{k}C_{v}

A.4 Proof of Theorem 1

Background.

PAC-Bayesian generalization theory offers an appealing method for incorporating data-dependent aspects, like noise robustness and sharpness, into generalization bounds. Recent studies, such as (Bartlett et al., 2017; Neyshabur et al., 2018), have expanded these bounds for deep neural networks to address the mystery of why such models generalize effectively despite possessing more trainable parameters than training samples. Traditionally, the VC dimension of neural networks has been approximated by their number of parameters (Bartlett et al., 2019). While these refined bounds mark a step forward over classical learning theory, questions remain as to whether they are sufficiently tight or non-vacuous. To address this, Dziugaite & Roy (2017) proposed a computational framework that optimizes the PAC-Bayes bound, resulting in a tighter bound and lower test error. Zhou et al. (2019) validated this framework in a large-scale study. More recently, Jiang* et al. (2020) compared different complexity measures and found that PAC-Bayes-based tools align better with empirical results. Furthermore, Li & Zhang (2021) and Wang et al. (2023) utilized this bound to motivate their proposed improved regularization and genralization, respectively. Consequently, the classical PAC-Bayesian framework (McAllester, 1998; 1999) provides generalization guarantees for randomized predictors (McAllester, 2003; Li & Zhang, 2021). In particular, let f𝑾f_{\boldsymbol{W}} be any predictor (not necessarily a neural network) learned from the training data and parametrized by 𝑾\boldsymbol{W}. We consider the distribution QQ over parameters of predictors of the form f𝑾f_{\boldsymbol{W}}, where 𝑾\boldsymbol{W} is a random variable whose distribution may also depend on the training data. Given a prior distribution PP over the set of predictors that is independent of the training data, SS. The PAC-Bayes theorem states that with probability at least 1δ1-\delta over the draw of the training data, the expected error of f𝑾f_{\boldsymbol{W}} can be bounded as follows

𝔼𝑾Q(S)[(𝑾)]𝔼𝑾Q(S)[^(𝑾,S)]+CKL(Q(S)P)+klnnδ+ln,\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\mathcal{L}(\boldsymbol{W})\right]\leq\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\hat{\mathcal{L}}(\boldsymbol{W},S)\right]+C\sqrt{\frac{\text{KL}(Q(S)\parallel P)+k\ln\frac{n}{\delta}+l}{n}}, (15)

for some C,k,l>0C,k,l>0. Then based on the above described bound we reduce it for our case, where the prior distribution on the parameters is centered on source pre-trained weights and the posterior distribution on the target-adapted model and define Theorem 1.

Theorem 1.

(PAC-Bayes generalization bound for fine-tuning) Let 𝐖\boldsymbol{W} be some hypothesis class (network parameter). Let PP be a prior (source) distribution on 𝐖\boldsymbol{W} that is independent of the target training set. Let Q(S)Q(S) be a posterior (target) distribution on 𝐖\boldsymbol{W} that depends on the target training set SS consisting of nn number of samples. Suppose the loss function (.)\mathcal{L}(.) is bounded by CC. If we set the prior distribution P=𝒩(𝐖src,σ2I)P=\mathcal{N}(\boldsymbol{W}_{\text{src}},\sigma^{2}I), where 𝐖src\boldsymbol{W}_{\text{src}} are the weights of the pre-trained network. The posterior distribution Q(S)Q(S) is centered at the fine-tuned model as 𝒩(𝐖trg,σ2I)\mathcal{N}(\boldsymbol{W}_{\text{trg}},\sigma^{2}I). Then with probability 1δ1-\delta over the randomness of the training set, the following holds:

𝔼𝑾Q(S)[(𝑾)]𝔼𝑾Q(S)[^(𝑾,S)]+Ci=1D𝓦(i)trg𝓦(i)srcF22σ2n+klnnδ+ln.\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\mathcal{L}(\boldsymbol{W})\right]\leq\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\hat{\mathcal{L}}(\boldsymbol{W},S)\right]+C\sqrt{\frac{\sum_{i=1}^{D}\|\boldsymbol{\mathcal{W}}^{(i)}_{\text{trg}}-\boldsymbol{\mathcal{W}}^{(i)}_{\text{src}}\|_{F}^{2}}{2\sigma^{2}n}+\frac{k\ln\frac{n}{\delta}+l}{n}}. (16)

for some δ,k,l>0\delta,k,l>0, where 𝓦(i)trg𝐖trg,and𝓦(i)src𝐖src\boldsymbol{\mathcal{W}}^{(i)}_{\text{trg}}\in\boldsymbol{W}_{\text{trg}},\text{and}\ \boldsymbol{\mathcal{W}}^{(i)}_{\text{src}}\in\boldsymbol{W}_{\text{src}}, 1iD\forall 1\leq i\leq D, DD denoting the total number of layers.

It is important to note that the original formulation in (McAllester, 1999) assumes the loss function is restricted to values between 0 and 11. In contrast, the modified version discussed here extends the applicability to loss functions that are bounded between 0 and some positive constant CC. This adjustment is made by rescaling the loss function by a factor of 1C\frac{1}{C}, which introduces the constant CC in the right-hand side of Equation (15).

Proof.

We expand the definition using the density of multivariate normal distributions.

KL(Q(S)P)\displaystyle\text{KL}(Q(S)\parallel P) =𝔼𝑾Q(S)[log(Pr(𝑾Q(S))Pr(𝑾P))]\displaystyle=\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\log\left(\frac{\text{Pr}(\boldsymbol{W}\sim Q(S))}{\text{Pr}(\boldsymbol{W}\sim P)}\right)\right]

Substituting the densities of the Gaussian distributions, this can be written as:

KL(Q(S)P)=𝔼𝑾Q(S)[logexp(12σ2𝑾𝑾trg2)exp(12σ2𝑾𝑾src2)].\text{KL}(Q(S)\parallel P)=\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\log\frac{\exp\left(-\frac{1}{2\sigma^{2}}\|\boldsymbol{W}-\boldsymbol{W}_{\text{trg}}\|^{2}\right)}{\exp\left(-\frac{1}{2\sigma^{2}}\|\boldsymbol{W}-\boldsymbol{W}_{\text{src}}\|^{2}\right)}\right]. (17)

This simplifies to:

KL(Q(S)P)=12σ2𝔼𝑾Q(S)[𝑾𝑾src2𝑾𝑾trg2].\text{KL}(Q(S)\parallel P)=\frac{1}{2\sigma^{2}}\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\|\boldsymbol{W}-\boldsymbol{W}_{\text{src}}\|^{2}-\|\boldsymbol{W}-\boldsymbol{W}_{\text{trg}}\|^{2}\right]. (18)

Expanding the squared terms:

KL(Q(S)P)=12σ2𝔼𝑾Q(S)[𝑾trg𝑾src2+2𝑾𝑾trg,𝑾trg𝑾src].\text{KL}(Q(S)\parallel P)=\frac{1}{2\sigma^{2}}\mathbb{E}_{\boldsymbol{W}\sim Q(S)}\left[\|\boldsymbol{W}_{\text{trg}}-\boldsymbol{W}_{\text{src}}\|^{2}+2\langle\boldsymbol{W}-\boldsymbol{W}_{\text{trg}},\boldsymbol{W}_{\text{trg}}-\boldsymbol{W}_{\text{src}}\rangle\right]. (19)

Since the expectation 𝔼𝑾Q(S)[𝑾𝑾trg]=0\mathbb{E}_{\boldsymbol{W}\sim Q(S)}[\boldsymbol{W}-\boldsymbol{W}_{\text{trg}}]=0 (because 𝑾\boldsymbol{W} is distributed around 𝑾trg\boldsymbol{W}_{\text{trg}}), the cross-term vanishes:

KL(Q(S)P)=12σ2𝑾trg𝑾srcF2.\implies\text{KL}(Q(S)\parallel P)=\frac{1}{2\sigma^{2}}\|\boldsymbol{W}_{\text{trg}}-\boldsymbol{W}_{\text{src}}\|_{F}^{2}. (20)
KL(Q(S)P)=12σ2𝑾trg𝑾srcF2i=1D𝓦(i)trg𝓦(i)srcF22σ2.\implies\text{KL}(Q(S)\parallel P)=\frac{1}{2\sigma^{2}}\|\boldsymbol{W}_{\text{trg}}-\boldsymbol{W}_{\text{src}}\|_{F}^{2}\leq\frac{\sum_{i=1}^{D}\|\boldsymbol{\mathcal{W}}^{(i)}_{\text{trg}}-\boldsymbol{\mathcal{W}}^{(i)}_{\text{src}}\|_{F}^{2}}{2\sigma^{2}}. (21)

where , 𝓦(i)trg𝑾trg,and𝓦(i)src𝑾src\boldsymbol{\mathcal{W}}^{(i)}_{\text{trg}}\in\boldsymbol{W}_{\text{trg}},\text{and}\ \boldsymbol{\mathcal{W}}^{(i)}_{\text{src}}\in\boldsymbol{W}_{\text{src}}, 1iD\forall 1\leq i\leq D, DD denoting the total number of layers. Then, we can obtain Equation (16) by substituting Equation (21) in Equation (15). ∎

A.5 Dataset Description

Table 3: Dataset Summary (Ragab et al., 2023a).
Dataset # Users/Domains # Channels # Classes Sequence Length Training Set Testing Set
UCIHAR 30 9 6 128 2300 990
WISDM 36 3 6 128 1350 720
HHAR 9 3 6 128 12716 5218
SSC 20 1 5 3000 14280 6130
MFD 4 1 3 5120 7312 3604

We utilize the benchmark datasets provided by AdaTime (Ragab et al., 2023a). These datasets exhibit diverse attributes such as varying complexity, sensor types, sample sizes, class distributions, and degrees of domain shift, allowing for a comprehensive evaluation across multiple factors.

Table 3 outlines the specific details of each dataset, including the number of domains, sensor channels, class categories, sample lengths, and the total sample count for both training and testing sets. A detailed description of the selected datasets is provided below:

  • UCIHAR (Anguita et al., 2013): The UCIHAR dataset consists of data collected from three types of sensors—accelerometer, gyroscope, and body sensors—used on 30 different subjects. Each subject participated in six distinct activities: walking, walking upstairs, walking downstairs, standing, sitting, and lying down. Given the variability across subjects, each individual is considered a separate domain. From the numerous possible cross-domain combinations, we selected the ten scenarios set by Ragab et al. (2023a).

  • WISDM (Kwapisz et al., 2011): The WISDM dataset uses accelerometer sensors to gather data from 36 subjects engaged in the same activities as those in the UCIHAR dataset. However, this dataset presents additional challenges due to class imbalance among different subjects. Specifically, some subjectsonly contribute samples from a limited set of the overall activity classes. As with the UCIHAR dataset, each subject is treated as an individual domain, and ten cross-domain scenarios set by Ragab et al. (2023a) are used for evaluation.

  • HHAR (Stisen et al., 2015): The Heterogeneity Human Activity Recognition (HHAR) dataset was collected from 9 subjects using sensor data from both smartphones and smartwatches. Ragab et al. (2023a) standardized the use of a single device, specifically a Samsung smartphone, across all subjects to minimize variability. Each subject is treated as an independent domain, and a total of 10 cross-domain scenarios are created by randomly selecting subjects.

  • SSC (Goldberger et al., 2000): The sleep stage classification (SSC) task focuses on categorizing electroencephalography (EEG) signals into five distinct stages: Wake (W), Non-Rapid Eye Movement stages (N1, N2, N3), and Rapid Eye Movement (REM). This dataset is derived the Sleep-EDF dataset, which provides EEG recordings from 20 healthy individuals. Consistent with prior research (Ragab et al., 2023a), we select a single EEG channel (Fpz-Cz) and ten cross-domain scenarios for evaluation.

  • MFD (Lessmeier et al., 2016): The Machine Fault Diagnosis (MFD) dataset has been collected by Paderborn University to identify various types of incipient faults using vibration signals. The data was collected under four different operating conditions, and in our experiments, each of these conditions was treated as a separate domain. We used twelve cross-condition scenarios to evaluate the domain adaptation performance. Each sample in the dataset consists of a single univariate channel.

A.6 SFDA Methods

Refer to caption
Figure 8: A. Baseline architecture (Ragab et al., 2023a; b) B. Architecture reprametrization after Tucker-style factorization of convolution weights. CinC_{\text{in}} and KK denote the number of input channels and the filter size, respectively. R(1)outR^{(1)}_{\text{out}} and R(i)inR^{(i)}_{\text{in}} and R(i)outR^{(i)}_{\text{out}}, denote the mode ranks for the ithi^{th} layer.

.

  • SHOT (Liang et al., 2020; 2021): SHOT optimizes mutual information by minimizing conditional entropy H(Y|X)H(Y|X) to enforce unambiguous cluster assignments, while maximizing marginal entropy H(Y)H(Y) to ensure uniform cluster sizes, thereby preventing degeneracy.

  • NRC (Yang et al., 2021a; 2023): NRC leverages neighborhood clustering, with an objective function comprising two main components: a neighborhood clustering term for prediction consistency and a marginal entropy term H(Y)H(Y) to promote prediction diversity.

  • AAD (Yang et al., 2022): AAD also utilizes neighborhood clustering but incorporates a contrastive objective similar to InfoNCE (Oord et al., 2018), which attracts predictions of nearby features in the feature space while dispersing (repelling) those that are farther apart.

  • MAPU (Ragab et al., 2023b; Gong et al., 2024): Building on the concepts from Liang et al. (2021) and Kundu et al. (2022a), and focusing on time-series context, MAPU introduces masked reconstruction as an auxiliary task to enhance SFDA performance.

A.7 Training Details, Experimental Setup, and Extended Analysis

Refer to caption
Figure 9: A visual representation of the experimental setup used for evaluating the Source-Free Domain Adaptation (SFDA) frameworks.

For all datasets, we utilize a simple 3-layer 1D-CNN backbone following (Ragab et al., 2023b), which has shown superior performance compared to more complex architectures (Donghao & Xue, 2024; Cheng et al., 2024). Figure 8 illustrates both the baseline (vanilla) architecture and the proposed Tucker factorized architecture obtained after reparametrized the weights. The filter size (KK) for the input layers varies across datasets to account for differences in sequence lengths. Specifically, we set the filter sizes to 25 for SSC, 32 for MFD, 5 for HHAR, 5 for WISDM, and 5 for UCIHAR, following (Ragab et al., 2023a).

Refer to caption
Figure 10: Same experimental analysis as done in Figure 6 with an additional many-to-one SFDA adaptation experiment on the SSC (SSC*) and the HHAR (HHAR*), marked with an asterisk (*).
Table 4: Performance and efficiency comparison on SSC, MFD, HHAR, WISDM, and UCIHAR datasets across SFDA methods, reported as average F1 score (%) at target sample ratios (0.5%, 5%, 100%), inference MACs (M), and fine-tuned parameters (K). Highlighted rows show results for SFT, where only the core tensor is fine-tuned at different RFRF values. Green numbers represent average percentage improvement, while Red numbers indicate reduction in MACs and fine-tuned parameters.
F1 Score (%) \uparrow
Methods RFRF 0.5% 5% 100% Average MACs (M) \downarrow # Params. (K) \downarrow
SSC
SHOT (Liang et al., 2020) - 62.32 ± 1.57 63.95 ± 1.51 67.95 ± 1.04 64.74 12.92 83.17
8 62.53 ± 0.46 66.55 ± 0.46 67.50 ± 1.33 65.53 (1.22%) 0.80 (93.81%) 1.38 (98.34%)
4 62.71 ± 0.57 67.16 ± 1.06 68.56 ± 0.44 66.14 (2.16%) 1.99 (84.60%) 5.32 (93.60%)
SHOT + SFT 2 63.05 ± 0.32 65.44 ± 0.83 67.48 ±0.89 65.32 (0.90%) 5.54 (57.12%) 20.88 (74.89%)
NRC (Yang et al., 2021a) - 59.92 ± 1.19 63.56 ± 1.35 65.23 ± 0.59 62.90 12.92 83.17
8 60.65 ± 1.37 63.60 ± 1.43 65.05 ± 1.66 63.10 (0.32%) 0.80 (93.81%) 1.38 (98.34%)
4 61.95 ± 0.62 65.11 ± 1.34 67.19 ± 0.14 64.75 (2.94%) 1.99 (84.60%) 5.32 (93.60%)
NRC + SFT 2 60.60 ± 0.58 65.06 ± 0.24 66.83 ± 0.51 64.16 (2.00%) 5.54 (57.12%) 20.88 (74.89%)
AAD (Yang et al., 2022) - 59.39 ± 1.80 63.21 ± 1.53 63.71 ± 2.06 62.10 12.92 83.17
8 60.62 ± 1.40 65.38 ± 0.93 65.80 ± 1.17 63.93 (2.95%) 0.80 (93.81%) 1.38 (98.34%)
4 61.92 ± 0.68 66.59 ± 0.95 67.14 ± 0.57 65.22 (5.02%) 1.99 (84.60%) 5.32 (93.60%)
AAD + SFT 2 60.59 ± 0.59 63.96 ± 2.04 64.39 ± 1.45 62.98 (1.42%) 5.54 (57.12%) 20.88 (74.89%)
MAPU (Ragab et al., 2023b) - 60.35 ± 2.15 62.48 ± 1.57 66.73 ± 0.85 63.19 12.92 83.17
8 63.52 ± 1.24 64.77 ± 0.22 66.09 ± 0.19 64.79 (2.53%) 0.80 (93.81%) 1.38 (98.34%)
4 62.21 ± 0.88 65.20 ± 1.25 67.40 ± 0.59 64.94 (2.77%) 1.99 (84.60%) 5.32 (93.60%)
MAPU + SFT 2 62.25 ± 1.74 66.19 ± 1.02 67.85 ± 0.62 65.43 (3.54%) 5.54 (57.12%) 20.88 (74.89%)
MFD
SHOT (Liang et al., 2020) - 90.74 ± 1.27 91.53 ± 1.44 91.93 ± 1.37 91.40 58.18 199.3
8 92.14 ± 1.77 93.36 ± 0.53 93.13 ± 0.54 92.88 (1.62%) 3.52 (93.95%) 3.33 (98.33%)
4 92.98 ± 1.04 93.36 ± 0.75 93.21 ± 0.64 93.18 (1.95%) 8.80 (84.87%) 12.80 (93.58%)
SHOT + SFT 2 93.01 ± 0.79 93.40 ± 0.41 93.92 ± 0.33 93.44 (2.23%) 24.66 (57.61%) 50.18 (74.82%)
NRC (Yang et al., 2021a) - 89.03 ± 3.84 91.74 ± 1.31 92.20 ± 1.33 90.99 58.18 199.3
8 92.14 ± 1.77 93.36 ± 0.51 93.14 ± 0.7 92.88 (2.08%) 3.52 (93.95%) 3.33 (98.33%)
4 92.91 ± 1.05 93.25 ± 0.69 92.97 ± 0.45 93.04 (2.25%) 8.80 (84.87%) 12.80 (93.58%)
NRC + SFT 2 92.98 ± 0.78 93.22 ± 0.58 93.86 ± 0.36 93.35 (2.59%) 24.66 (57.61%) 50.18 (74.82%)
AAD (Yang et al., 2022) - 89.03 ± 3.82 91.06 ± 2.21 90.65 ± 2.48 90.25 58.18 199.3
8 92.15 ± 1.78 93.36 ± 0.58 93.28 ± 0.55 92.93 (2.97%) 3.52 (93.95%) 3.33 (98.33%)
4 92.88 ± 1.12 93.05 ± 0.61 93.67 ± 0.41 93.20 (3.27%) 8.80 (84.87%) 12.80 (93.58%)
AAD + SFT 2 92.97 ± 0.79 93.45 ± 0.53 94.41 ± 0.19 93.61 (3.72%) 24.66 (57.61%) 50.18 (74.82%)
MAPU (Ragab et al., 2023b) - 85.32 ± 1.25 86.49 ± 1.59 91.71 ± 0.47 87.84 58.18 199.3
8 91.07 ± 1.76 91.58 ± 1.14 92.11 ± 1.56 91.59 (4.27%) 3.52 (93.95%) 3.33 (98.33%)
4 91.01 ± 0.97 92.10 ± 1.02 91.18 ± 1.64 91.43 (4.09%) 8.80 (84.87%) 12.80 (93.58%)
MAPU + SFT 2 91.87 ± 2.85 91.94 ± 1.44 91.84 ± 1.85 91.88 (4.60%) 24.66 (57.61%) 50.18 (74.82%)
HHAR
SHOT (Liang et al., 2020) - 72.08 ± 1.20 79.80 ± 1.70 80.12 ± 0.29 77.33 9.04 198.21
8 73.65 ± 2.39 79.07 ± 1.88 81.73 ± 1.47 78.15 (1.06%) 0.53 (94.14%) 3.19 (98.39%)
4 75.47 ± 1.34 79.98 ± 2.01 81.05 ± 0.45 78.83 (1.94%) 1.34 (85.18%) 12.53 (93.68%)
SHOT + SFT 2 75.14 ± 1.85 77.70 ± 1.37 82.92 ± 0.27 78.59 (1.63%) 3.79 (58.08%) 49.63 (74.96%)
NRC (Yang et al., 2021a) - 72.38 ± 0.87 75.52 ± 1.15 77.80 ± 0.16 75.23 9.04 198.21
8 71.41 ± 0.62 76.15 ± 0.99 80.09 ± 0.56 75.88 (0.86%) 0.53 (94.14%) 3.19 (98.39%)
4 72.77 ± 0.43 76.60 ± 1.97 80.27 ± 0.55 76.55 (1.75%) 1.34 (85.18%) 12.53 (93.68%)
NRC + SFT 2 72.34 ± 0.14 75.53 ± 1.31 79.45 ± 1.51 75.77 (0.72%) 3.79 (58.08%) 49.63 (74.96%)
AAD (Yang et al., 2022) - 20.18 ± 2.70 76.55 ± 2.17 82.25 ± 1.14 59.66 9.04 198.21
8 57.93 ± 0.90 78.57 ± 0.74 82.92 ± 1.66 73.14 (22.59%) 0.53 (94.14%) 3.19 (98.39%)
4 54.43 ± 1.04 78.57 ± 0.74 83.15 ± 0.88 72.05 (20.77%) 1.34 (85.18%) 12.53 (93.68%)
AAD + SFT 2 49.46 ± 2.22 79.31 ± 0.72 83.25 ± 0.34 70.67 (18.45%) 3.79 (58.08%) 49.63 (74.96%)
MAPU (Ragab et al., 2023b) - 71.18 ± 2.78 78.67 ± 1.20 80.32 ± 1.16 76.72 9.04 198.21
8 73.73 ± 2.35 78.54 ± 2.37 80.16 ± 2.38 77.48 (0.99%) 0.53 (94.14%) 3.19 (98.39%)
4 75.12 ± 0.88 80.04 ± 0.42 80.24 ± 1.22 78.47 (2.28%) 1.34 (85.18%) 12.53 (93.68%)
MAPU + SFT 2 76.06 ± 1.24 78.95 ± 2.04 80.69 ± 2.45 78.57 (2.41%) 3.79 (58.08%) 49.63 (74.96%)
WISDM
SHOT (Liang et al., 2020) - 57.90 ± 2.07 59.52 ± 2.75 60.49 ± 1.53 59.3 9.04 198.21
8 60.57 ± 0.79 61.35 ± 1.91 62.08 ± 2.00 61.33 (3.42%) 0.53 (94.14%) 3.19 (98.39%)
4 58.06 ± 3.58 60.79 ± 1.86 61.17 ± 0.32 60.01 (1.20%) 1.34 (85.18%) 12.53 (93.68%)
SHOT + SFT 2 60.01 ± 0.72 63.03 ± 0.61 62.04 ± 2.75 61.69 (4.03%) 3.79 (58.08%) 49.63 (74.96%)
NRC (Yang et al., 2021a) - 53.58 ± 1.24 54.49 ± 0.85 58.64 ± 1.39 55.57 9.04 198.21
8 57.08 ± 0.94 57.61 ± 1.94 60.57 ± 1.63 58.42 (5.13%) 0.53 (94.14%) 3.19 (98.39%)
4 55.89 ± 2.98 56.82 ± 1.57 60.57 ± 1.63 57.76 (3.94%) 1.34 (85.18%) 12.53 (93.68%)
NRC + SFT 2 55.68 ± 2.28 55.74 ± 0.95 59.21 ± 1.18 56.88 (2.36%) 3.79 (58.08%) 49.63 (74.96%)
AAD (Yang et al., 2022) - 54.16 ± 0.85 52.79 ± 1.17 56.33 ± 1.40 54.43 9.04 198.21
8 56.65 ± 0.50 57.59 ± 1.67 57.69 ± 2.06 57.31 (5.29%) 0.53 (94.14%) 3.19 (98.39%)
4 55.90 ± 2.67 56.71 ± 1.47 58.57 ± 2.39 57.06 (4.83%) 1.34 (85.18%) 12.53 (93.68%)
AAD + SFT 2 55.83 ± 2.22 54.86 ± 1.25 58.07 ± 2.75 56.25 (3.34%) 3.79 (58.08%) 49.63 (74.96%)
MAPU (Ragab et al., 2023b) - 56.33 ± 4.02 58.42 ± 3.39 60.92 ± 1.70 58.56 9.04 198.21
8 59.12 ± 3.73 61.50 ± 2.23 62.55 ± 2.17 61.06 (4.27%) 0.53 (94.14%) 3.19 (98.39%)
4 62.94 ± 4.08 61.00 ± 1.77 66.01 ± 3.60 63.32 (8.13%) 1.34 (85.18%) 12.53 (93.68%)
MAPU + SFT 2 59.36 ± 5.74 63.25 ± 1.01 62.45 ± 2.66 61.69 (5.34%) 3.79 (58.08%) 49.63 (74.96%)
UCIHAR
SHOT (Liang et al., 2020) - 89.06 ± 1.17 88.89 ± 1.26 91.49 ± 0.60 89.81 9.28 200.13
8 89.57 ± 0.44 89.05 ± 0.69 92.60 ± 1.99 90.41 (0.67%) 0.57 (93.86%) 3.43 (98.29%)
4 90.02 ± 0.86 89.41 ± 0.56 92.86 ± 1.02 90.76 (1.06%) 1.41 (84.81%) 13.01 (93.50%)
SHOT + SFT 2 90.61 ± 0.76 90.28 ± 0.28 91.65 ± 0.73 90.85 (1.16%) 3.92 (57.76%) 50.59 (74.72%)
NRC (Yang et al., 2021a) - 48.13 ± 2.43 47.78 ± 1.83 91.24 ± 1.08 62.38 9.28 200.13
8 86.23 ± 0.60 86.99 ± 1.48 91.86 ± 0.85 88.36 (41.65%) 0.57 (93.86%) 3.43 (98.29%)
4 85.21 ± 1.35 86.34 ± 1.84 94.05 ± 1.01 88.53 (41.92%) 1.41 (84.81%) 13.01 (93.50%)
NRC + SFT 2 84.29 ± 1.21 86.38 ± 2.02 91.51 ± 1.65 87.39 (40.09%) 3.92 (57.76%) 50.59 (74.72%)
AAD (Yang et al., 2022) - 57.27 ± 6.75 56.21 ± 4.25 91.55 ± 0.42 68.34 9.28 200.13
8 86.62 ± 1.20 87.38 ± 0.17 91.56 ± 0.66 88.52 (29.53%) 0.57 (93.86%) 3.43 (98.29%)
4 86.73 ± 1.36 87.42 ± 0.64 92.24 ± 0.40 88.80 (29.94%) 1.41 (84.81%) 13.01 (93.50%)
AAD + SFT 2 85.51 ± 1.69 85.53 ± 2.62 91.60 ± 0.18 87.55 (28.11%) 3.92 (57.76%) 50.59 (74.72%)
MAPU (Ragab et al., 2023b) - 87.65 ± 3.02 86.47 ± 1.46 90.86 ± 1.62 88.33 9.28 200.13
8 89.57 ± 0.17 89.10 ± 0.98 91.93 ± 0.77 90.20 (2.12%) 0.57 (93.86%) 3.43 (98.29%)
4 89.54 ± 2.26 89.39 ± 0.72 92.81 ± 1.73 90.58 (2.55%) 1.41 (84.81%) 13.01 (93.50%)
MAPU + SFT 2 89.34 ± 1.74 90.05 ± 0.74 91.46 ± 0.49 90.28 (2.21%) 3.92 (57.76%) 50.59 (74.72%)
Table 5: F1 scores for each source-target pair, with domain IDs as established by Ragab et al. (2023a), on the SSC dataset across different SFDA methods.
Method RFRF 0 \rightarrow 11 12 \rightarrow 5 13 \rightarrow 17 16 \rightarrow 1 18 \rightarrow 12 3 \rightarrow 19 5 \rightarrow 15 6 \rightarrow 2 7 \rightarrow 18 9 \rightarrow 14 Average (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow
SHOT - 46.83 ± 1.58 69.19 ± 5.28 64.19 ± 0.18 67.99 ± 3.36 59.25 ± 2.53 75.31 ± 0.77 75.08 ± 1.00 71.52 ± 1.47 74.7 ± 0.54 75.39 ± 4.37 67.95 ± 1.04 12.92 83.17
8 46.68 ± 3.81 68.98 ± 2.56 66.83 ± 3.96 66.88 ± 5.08 62.74 ± 0.67 72.93 ± 1.71 71.58 ± 5.43 67.75 ± 0.78 74.59 ± 1.17 76.03 ± 2.35 67.50 ± 1.33 0.80 1.38
4 51.32 ± 3.33 70.48 ± 1.16 64.61 ± 0.67 64.81 ± 6.70 60.05 ± 3.08 74.61 ± 1.14 74.29 ± 0.98 73.41 ± 0.88 75.21 ± 0.47 76.83 ± 0.99 68.56 ± 0.44 1.99 5.32
SHOT + SFT 2 49.06 ± 5.45 68.21 ± 2.77 65.67 ± 1.86 66.53 ± 4.80 58.56 ± 1.41 73.44 ± 0.83 74.81 ± 0.76 72.40 ± 0.42 74.50 ± 0.43 71.62 ± 9.77 67.48 ± 0.89 5.54 20.88
NRC - 42.51 ± 4.43 65.93 ± 0.30 61.98 ± 0.40 62.21 ± 0.71 60.82 ± 2.61 68.09 ± 0.58 71.49 ± 4.75 72.36 ± 1.51 72.81 ± 0.85 74.16 ± 1.66 65.23 ± 0.59 12.92 83.17
8 47.14 ± 1.76 56.97 ± 9.93 61.63 ± 1.25 64.28 ± 0.59 64.29 ± 2.05 69.59 ± 0.95 72.81 ± 3.67 65.83 ± 2.03 74.73 ± 1.74 73.25 ± 0.94 65.05 ± 1.66 0.80 1.38
4 51.01 ± 4.43 65.71 ± 2.33 62.65 ± 0.86 62.22 ± 0.24 63.43 ± 2.19 71.41 ± 1.69 75.13 ± 3.54 71.25 ± 1.55 74.53 ± 0.57 74.59 ± 1.13 67.19 ± 0.14 1.99 5.32
NRC + SFT 2 50.53 ± 4.21 64.57 ± 4.55 63.93 ± 0.17 60.87 ± 0.34 60.55 ± 1.62 68.77 ± 6.08 76.94 ± 1.51 72.88 ± 0.70 73.93 ± 0.16 75.30 ± 1.53 66.83 ± 0.51 5.54 20.88
AAD - 47.83 ± 4.05 60.68 ± 2.05 57.71 ± 2.08 63.64 ± 1.29 55.70 ± 2.50 71.51 ± 0.81 61.46 ± 16.01 73.53 ± 1.08 73.73 ± 3.37 71.35 ± 5.25 63.71 ± 2.06 12.92 83.17
8 47.94 ± 3.36 64.35 ± 2.71 58.85 ± 1.59 64.48 ± 0.65 61.13 ± 1.71 71.31 ± 0.95 73.07 ± 4.57 69.09 ± 1.66 75.37 ± 0.68 72.40 ± 0.81 65.80 ± 1.17 0.80 1.38
4 48.58 ± 5.64 69.56 ± 2.74 60.04 ± 3.42 65.13 ± 5.29 59.44 ± 2.78 72.91 ± 0.85 74.97 ± 3.48 72.17 ± 1.11 75.53 ± 0.28 73.07 ± 1.17 67.14 ± 0.57 1.99 5.32
AAD + SFT 2 52.66 ± 1.20 64.00 ± 9.31 62.04 ± 2.91 62.99 ± 0.77 61.47 ± 1.98 72.09 ± 2.10 48.76 ± 15.24 73.48 ± 1.76 74.53 ± 0.40 71.91 ± 2.60 64.39 ± 1.45 5.54 20.88
MAPU - 47.43 ± 2.31 68.24 ± 2.98 60.04 ± 0.56 61.04 ± 0.86 57.35 ± 0.98 73.18 ± 0.26 76.35 ± 0.49 69.89 ± 1.93 76.44 ± 0.56 77.33 ± 1.97 66.73 ± 0.85 12.92 83.17
8 46.72 ± 1.69 68.86 ± 2.64 63.01 ± 0.96 61.19 ± 0.37 59.20 ± 3.02 70.19 ± 2.33 75.39 ± 3.45 66.24 ± 1.79 75.51 ± 0.37 74.61 ± 2.31 66.09 ± 0.19 0.80 1.38
4 47.06 ± 0.32 70.34 ± 3.06 63.33 ± 1.38 64.90 ± 1.81 55.99 ± 2.05 72.63 ± 1.37 78.31 ± 1.21 68.17 ± 1.71 75.47 ± 1.01 77.83 ± 2.81 67.40 ± 0.59 1.99 5.32
MAPU + SFT 2 50.99 ± 0.53 71.86 ± 0.73 64.21 ± 0.54 64.99 ± 5.66 54.38 ± 1.89 73.65 ± 1.03 77.55 ± 0.84 69.98 ± 1.08 75.27 ± 0.61 75.62 ± 1.23 67.85 ± 0.62 5.54 20.88
Table 6: F1 scores for each source-target pair, with domain IDs as established by Ragab et al. (2023a), on the MFD dataset across different SFDA methods.
Method RFRF 0 \rightarrow 1 0 \rightarrow 2 0 \rightarrow 3 1 \rightarrow 0 1 \rightarrow 2 1 \rightarrow 3 2 \rightarrow 0 2 \rightarrow 1 2 \rightarrow 3 3 \rightarrow 0 3 \rightarrow 1 3 \rightarrow 2 Average (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow
SHOT - 98.26 ± 1.89 86.02 ± 7.72 97.77 ± 2.40 87.56 ± 3.47 87.88 ± 2.60 99.97 ± 0.05 72.39 ± 17.01 98.81 ± 1.25 99.18 ± 0.99 88.72 ± 1.55 100.00 ± 0.00 86.56 ± 3.26 91.93 ± 1.37 58.18 199.3
8 99.19 ± 0.35 87.42 ± 2.85 99.02 ± 0.49 86.57 ± 3.77 88.48 ± 1.70 99.97 ± 0.05 82.21 ± 2.93 98.59 ± 1.10 99.21 ± 0.73 88.57 ± 2.30 100.00 ± 0.00 88.31 ± 3.31 93.13 ± 0.54 3.52 3.33
4 99.59 ± 0.08 86.31 ± 1.81 99.73 ± 0.19 84.91 ± 3.69 90.67 ± 0.52 99.97 ± 0.05 83.63 ± 2.20 97.15 ± 2.21 99.05 ± 0.47 87.60 ± 2.47 100.00 ± 0.00 89.88 ± 1.16 93.21 ± 0.64 8.80 12.80
SHOT + SFT 2 99.87 ± 0.12 87.05 ± 2.89 99.81 ± 0.13 85.72 ± 0.59 91.06 ± 1.00 99.97 ± 0.05 87.49 ± 1.10 99.19 ± 0.49 99.54 ± 0.31 88.88 ± 0.69 100.00 ± 0.00 88.48 ± 0.66 93.92 ± 0.33 24.66 50.18
NRC - 98.10 ± 2.44 85.00 ± 10.08 98.31 ± 1.92 88.53 ± 1.73 88.50 ± 3.31 100.00 ± 0.00 74.52 ± 20.08 98.64 ± 1.19 99.02 ± 1.00 88.69 ± 1.68 100.00 ± 0.00 87.08 ± 4.27 92.20 ± 1.33 58.18 199.3
8 99.27 ± 0.42 87.48 ± 2.94 99.16 ± 0.37 86.58 ± 3.62 88.43 ± 1.62 99.97 ± 0.05 81.96 ± 2.38 98.64 ± 1.06 99.32 ± 0.69 88.44 ± 2.20 100.00 ± 0.00 88.43 ± 3.39 93.14 ± 0.70 3.52 3.33
4 99.56 ± 0.12 86.40 ± 1.96 99.76 ± 0.22 84.96 ± 3.91 90.53 ± 0.74 99.97 ± 0.05 80.91 ± 5.74 97.15 ± 2.12 99.02 ± 0.51 87.62 ± 2.65 100.00 ± 0.00 89.77 ± 1.10 92.97 ± 0.45 8.80 12.80
NRC + SFT 2 99.92 ± 0.08 86.77 ± 3.66 99.87 ± 0.09 85.70 ± 0.60 91.03 ± 1.15 99.97 ± 0.05 87.02 ± 0.89 99.16 ± 0.40 99.51 ± 0.28 88.88 ± 0.69 100.00 ± 0.00 88.48 ± 0.66 93.86 ± 0.36 24.66 50.18
AAD - 98.20 ± 2.32 85.04 ± 10.09 99.40 ± 0.49 81.12 ± 8.98 86.88 ± 2.31 100.00 ± 0.00 62.98 ± 18.46 98.75 ± 1.26 99.16 ± 1.04 88.98 ± 1.78 100.00 ± 0.00 87.33 ± 4.41 90.65 ± 2.48 58.18 199.3
8 99.87 ± 0.09 88.60 ± 3.10 99.87 ± 0.12 84.73 ± 3.75 89.45 ± 1.35 100.00 ± 0.00 80.73 ± 3.37 98.97 ± 0.69 99.62 ± 0.45 87.99 ± 1.58 100.00 ± 0.00 89.55 ± 2.86 93.28 ± 0.55 3.52 3.33
4 99.97 ± 0.05 90.53 ± 1.96 99.95 ± 0.05 84.00 ± 4.74 91.22 ± 1.55 100.00 ± 0.00 81.07 ± 3.35 98.62 ± 0.33 99.49 ± 0.38 88.24 ± 1.67 100.00 ± 0.00 90.92 ± 1.42 93.67 ± 0.41 8.80 12.80
AAD + SFT 2 100.00 ± 0.00 90.92 ± 1.75 100.00 ± 0.00 84.11 ± 2.67 92.06 ± 1.52 100.00 ± 0.00 88.62 ± 0.52 99.48 ± 0.05 99.84 ± 0.08 88.80 ± 0.49 100.00 ± 0.00 89.08 ± 0.58 94.41 ± 0.19 24.66 50.18
MAPU - 94.76 ± 1.08 87.53 ± 1.05 94.09 ± 1.72 86.86 ± 0.39 87.51 ± 1.72 99.08 ± 0.91 78.61 ± 3.75 99.16 ± 0.38 98.70 ± 1.34 88.48 ± 0.99 100.00 ± 0.00 85.80 ± 2.82 91.71 ± 0.47 58.18 199.3
8 94.85 ± 2.99 86.16 ± 3.50 94.84 ± 4.56 84.31 ± 2.02 86.63 ± 2.43 99.81 ± 0.33 86.10 ± 3.25 99.32 ± 0.62 99.24 ± 1.04 88.81 ± 1.28 100.00 ± 0.00 85.22 ± 1.66 92.11 ± 1.56 3.52 3.33
4 89.45 ± 5.03 85.45 ± 3.94 92.15 ± 3.78 86.33 ± 4.70 86.31 ± 0.67 99.67 ± 0.57 84.70 ± 1.40 98.23 ± 2.02 98.42 ± 1.97 88.21 ± 3.24 99.84 ± 0.28 85.38 ± 1.02 91.18 ± 1.64 8.80 12.80
MAPU + SFT 2 92.39 ± 7.56 86.29 ± 4.12 93.62 ± 5.38 85.61 ± 3.81 86.01 ± 1.08 100.00 ± 0.00 86.27 ± 1.36 99.37 ± 0.38 99.56 ± 0.12 88.18 ± 1.87 100.00 ± 0.00 84.83 ± 0.15 91.84 ± 1.85 24.66 50.18
Table 7: F1 scores for each source-target pair, with domain IDs as established by Ragab et al. (2023a), on the HHAR dataset across different SFDA methods.
Method RFRF 0 \rightarrow 2 0 \rightarrow 6 1 \rightarrow 6 2 \rightarrow 7 3 \rightarrow 8 4 \rightarrow 5 5 \rightarrow 0 6 \rightarrow 1 7 \rightarrow 4 8 \rightarrow 3 Average (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow
SHOT - 78.72 ± 1.36 62.94 ± 0.41 92.71 ± 0.72 64.15 ± 0.82 81.65 ± 0.18 97.41 ± 0.19 32.42 ± 0.02 97.58 ± 0.44 96.44 ± 1.74 97.14 ± 0.12 80.12 ± 0.29 9.04 198.21
8 76.05 ± 9.93 66.31 ± 6.43 92.97 ± 0.35 64.59 ± 0.49 92.50 ± 9.28 97.36 ± 0.11 34.85 ± 4.88 97.90 ± 0.12 97.80 ± 0.67 97.01 ± 0.21 81.73 ± 1.47 0.53 3.19
4 78.07 ± 6.48 64.06 ± 3.29 93.04 ± 0.09 64.75 ± 0.19 92.59 ± 9.41 97.60 ± 0.19 35.08 ± 4.62 97.88 ± 0.11 97.68 ± 0.24 89.76 ± 12.54 81.05 ± 0.45 1.34 12.53
SHOT + SFT 2 81.15 ± 1.21 62.98 ± 0.23 92.80 ± 1.20 65.11 ± 0.17 98.91 ± 0.16 97.48 ± 0.10 37.59 ± 4.63 98.10 ± 0.32 98.04 ± 0.43 97.00 ± 0.00 82.92 ± 0.27 3.79 49.63
NRC - 73.05 ± 0.68 72.40 ± 1.29 92.96 ± 0.29 61.61 ± 0.31 80.63 ± 0.22 97.04 ± 0.30 33.77 ± 2.08 96.03 ± 0.28 94.71 ± 0.36 75.81 ± 0.31 77.80 ± 0.16 9.04 198.21
8 72.77 ± 0.25 68.05 ± 5.07 93.20 ± 0.27 58.41 ± 9.00 97.03 ± 1.78 97.41 ± 0.19 42.83 ± 0.02 97.70 ± 0.25 97.81 ± 0.12 75.71 ± 0.38 80.09 ± 0.56 0.53 3.19
4 74.37 ± 0.32 70.01 ± 3.77 93.38 ± 0.39 63.27 ± 0.26 96.47 ± 0.27 97.54 ± 0.11 39.73 ± 2.58 97.14 ± 0.09 95.19 ± 0.22 75.57 ± 0.33 80.27 ± 0.55 1.34 12.53
NRC + SFT 2 74.79 ± 0.12 68.67 ± 4.77 93.52 ± 0.28 58.29 ± 9.25 97.83 ± 1.37 97.66 ± 0.22 34.50 ± 5.01 97.30 ± 0.38 96.54 ± 0.98 75.35 ± 0.13 79.45 ± 1.51 3.79 49.63
AAD - 77.82 ± 9.33 64.86 ± 0.83 93.16 ± 0.46 66.14 ± 0.07 97.55 ± 1.71 97.89 ± 0.21 31.83 ± 2.59 97.58 ± 0.94 98.45 ± 0.10 97.20 ± 0.00 82.25 ± 1.14 9.04 198.21
8 78.16 ± 2.25 76.63 ± 16.99 92.84 ± 0.56 64.30 ± 0.17 97.73 ± 1.66 98.12 ± 0.50 27.98 ± 5.05 97.74 ± 0.53 98.65 ± 0.13 97.09 ± 0.12 82.92 ± 1.66 0.53 3.19
4 77.85 ± 16.03 74.10 ± 19.41 92.68 ± 0.59 65.66 ± 0.36 97.27 ± 0.21 98.28 ± 0.48 32.28 ± 0.67 97.79 ± 0.34 98.52 ± 0.22 97.07 ± 0.11 83.15 ± 0.88 1.34 12.53
AAD + SFT 2 82.72 ± 4.32 64.89 ± 4.07 92.81 ± 0.21 65.46 ± 0.70 99.23 ± 0.20 97.87 ± 0.09 35.84 ± 5.02 98.19 ± 0.09 98.51 ± 0.30 97.00 ± 0.00 83.25 ± 0.34 3.79 49.63
MAPU - 75.26 ± 4.86 62.89 ± 0.65 95.49 ± 1.02 65.20 ± 0.51 99.27 ± 0.32 97.31 ± 0.21 31.76 ± 4.63 95.65 ± 3.32 98.10 ± 0.12 82.22 ± 13.01 80.32 ± 1.16 9.04 198.21
8 68.40 ± 5.72 74.03 ± 18.84 93.75 ± 0.31 64.78 ± 0.26 92.23 ± 11.57 97.48 ± 0.13 29.24 ± 4.70 97.65 ± 0.25 97.90 ± 0.69 86.18 ± 18.72 80.16 ± 2.38 0.53 3.19
4 74.85 ± 6.47 62.81 ± 0.53 93.28 ± 0.22 64.60 ± 0.25 93.33 ± 10.44 97.48 ± 0.10 30.50 ± 2.99 97.65 ± 0.50 98.03 ± 0.12 89.89 ± 12.29 80.24 ± 1.22 1.34 12.53
MAPU + SFT 2 75.62 ± 7.66 63.15 ± 0.50 93.43 ± 0.11 64.55 ± 0.38 99.21 ± 0.22 97.54 ± 0.12 30.75 ± 2.78 98.32 ± 0.37 98.24 ± 0.12 86.12 ± 18.65 80.69 ± 2.45 3.79 49.63
Table 8: F1 scores for each source-target pair, with domain IDs as established by Ragab et al. (2023a), on the WISDM dataset across different SFDA methods.
Method RFRF 17 \rightarrow 23 20 \rightarrow 30 23 \rightarrow 32 28 \rightarrow 4 2 \rightarrow 11 33 \rightarrow 12 35 \rightarrow 31 5 \rightarrow 26 6 \rightarrow 19 7 \rightarrow 18 Average (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow
SHOT - 58.58 ± 2.34 70.21 ± 1.29 72.16 ± 9.68 61.68 ± 4.00 78.72 ± 1.75 50.70 ± 2.30 65.89 ± 0.51 31.20 ± 0.96 74.85 ± 4.97 40.89 ± 9.92 60.49 ± 1.53 9.04 198.21
8 54.53 ± 18.59 72.03 ± 0.40 81.69 ± 1.95 53.97 ± 0.00 72.69 ± 3.82 59.70 ± 12.21 64.56 ± 9.23 37.24 ± 8.27 76.58 ± 0.81 47.82 ± 4.29 62.08 ± 2.00 0.53 3.19
4 53.46 ± 6.77 75.43 ± 1.58 76.05 ± 7.35 62.35 ± 4.17 76.05 ± 3.74 50.38 ± 5.95 66.42 ± 1.46 30.25 ± 0.50 77.42 ± 2.27 43.92 ± 5.09 61.17 ± 0.32 1.34 12.53
SHOT + SFT 2 58.18 ± 14.69 70.93 ± 20.90 80.10 ± 1.56 60.46 ± 8.39 75.42 ± 0.35 61.82 ± 11.81 66.26 ± 2.29 32.85 ± 1.51 80.85 ± 8.22 33.49 ± 4.54 62.04 ± 2.75 3.79 49.63
NRC - 47.23 ± 11.21 64.66 ± 1.45 81.43 ± 2.65 60.71 ± 11.67 76.62 ± 4.43 52.22 ± 3.66 58.28 ± 10.37 34.44 ± 2.75 78.61 ± 4.57 32.22 ± 0.28 58.64 ± 1.39 9.04 198.21
8 39.61 ± 1.75 69.40 ± 7.84 81.39 ± 1.58 60.90 ± 13.30 75.01 ± 0.00 60.79 ± 10.03 67.58 ± 2.29 40.11 ± 7.27 77.85 ± 3.01 33.09 ± 1.06 60.57 ± 1.63 0.53 3.19
4 47.73 ± 7.38 75.93 ± 1.37 76.58 ± 8.77 64.32 ± 6.75 76.71 ± 4.83 48.80 ± 4.91 67.99 ± 2.02 30.60 ± 0.80 76.27 ± 0.27 41.71 ± 6.55 60.66 ± 0.92 1.34 12.53
NRC + SFT 2 39.01 ± 1.09 70.63 ± 5.44 81.63 ± 1.34 63.71 ± 10.13 75.32 ± 0.00 49.26 ± 0.70 67.16 ± 1.17 31.44 ± 1.42 76.03 ± 0.13 37.94 ± 4.02 59.21 ± 1.18 3.79 49.63
AAD - 56.63 ± 2.31 65.89 ± 3.00 66.63 ± 12.44 58.24 ± 7.55 77.97 ± 4.18 51.83 ± 3.47 47.96 ± 14.2 29.22 ± 0.66 67.11 ± 5.68 41.79 ± 6.70 56.33 ± 1.40 9.04 198.21
8 47.82 ± 6.26 71.76 ± 3.24 66.46 ± 7.21 65.12 ± 10.69 75.10 ± 5.69 47.25 ± 4.73 67.99 ± 2.02 29.40 ± 0.26 61.80 ± 1.55 44.21 ± 1.22 57.69 ± 2.06 0.53 3.19
4 52.32 ± 7.72 74.97 ± 0.86 66.65 ± 5.75 68.28 ± 5.36 68.57 ± 13.05 47.25 ± 5.33 65.85 ± 1.36 29.31 ± 0.13 68.23 ± 6.30 44.29 ± 2.56 58.57 ± 2.39 1.34 12.53
AAD + SFT 2 52.36 ± 9.50 70.70 ± 3.96 72.38 ± 4.30 67.40 ± 2.31 69.02 ± 13.95 44.21 ± 0.15 69.00 ± 1.63 30.42 ± 1.45 61.58 ± 1.64 43.62 ± 2.42 58.07 ± 2.75 3.79 49.63
MAPU - 65.36 ± 9.33 66.00 ± 0.61 73.10 ± 13.28 66.81 ± 10.92 68.50 ± 6.65 58.73 ± 12.04 68.35 ± 5.11 31.48 ± 0.16 79.44 ± 9.08 31.47 ± 1.14 60.92 ± 1.70 9.04 198.21
8 68.09 ± 10.85 70.04 ± 4.02 79.14 ± 6.07 60.25 ± 11.34 72.88 ± 3.48 59.47 ± 6.28 63.03 ± 7.22 31.35 ± 0.14 73.98 ± 3.69 47.31 ± 13.28 62.55 ± 2.17 0.53 3.19
4 73.47 ± 3.85 69.68 ± 3.93 80.85 ± 0.72 67.60 ± 19.71 73.96 ± 1.82 61.60 ± 9.32 64.94 ± 0.00 33.86 ± 1.86 80.85 ± 8.22 53.31 ± 4.72 66.01 ± 3.60 1.34 12.53
MAPU + SFT 2 60.04 ± 4.66 72.04 ± 4.64 65.85 ± 12.32 61.97 ± 9.95 75.13 ± 0.44 60.42 ± 10.41 64.02 ± 2.62 33.03 ± 2.50 81.90 ± 8.84 50.11 ± 1.06 62.45 ± 2.66 3.79 49.63
Table 9: F1 scores for each source-target pair, with domain IDs as established by Ragab et al. (2023a), on the UCIHAR dataset across different SFDA methods.
Method RFRF 12 \rightarrow 16 18 \rightarrow 27 20 \rightarrow 5 24 \rightarrow 8 28 \rightarrow 27 2 \rightarrow 11 30 \rightarrow 20 6 \rightarrow 23 7 \rightarrow 13 9 \rightarrow 18 Average (%) \uparrow MACs (M) \downarrow # Params. (K) \downarrow
SHOT - 53.11 ± 4.28 100.00 ± 0.00 90.83 ± 4.60 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 84.68 ± 5.09 97.82 ± 1.89 100.00 ± 0.00 88.48 ± 5.22 91.49 ± 0.60 9.28 200.13
8 54.63 ± 17.43 100.00 ± 0.00 95.83 ± 4.53 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 84.91 ± 2.87 97.82 ± 1.89 98.94 ± 1.06 93.92 ± 6.96 92.60 ± 1.99 0.57 3.43
4 64.21 ± 11.40 100.00 ± 0.00 92.48 ± 9.43 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 85.40 ± 1.80 98.91 ± 1.89 100.00 ± 0.00 87.55 ± 11.89 92.86 ± 1.02 1.41 13.01
SHOT + SFT 2 70.97 ± 2.38 100.00 ± 0.00 82.84 ± 1.63 98.89 ± 1.93 100.00 ± 0.00 99.63 ± 0.64 83.58 ± 1.23 97.82 ± 1.89 100.00 ± 0.00 82.74 ± 6.20 91.65 ± 0.73 3.92 50.59
NRC - 56.47 ± 10.19 100.00 ± 0.00 96.10 ± 0.64 96.91 ± 5.35 100.00 ± 0.00 100.00 ± 0.00 83.48 ± 3.33 97.82 ± 1.89 100.00 ± 0.00 81.65 ± 5.26 91.24 ± 1.08 9.28 200.13
8 66.04 ± 6.76 100.00 ± 0.00 87.09 ± 1.05 98.89 ± 1.93 100.00 ± 0.00 100.00 ± 0.00 87.92 ± 1.28 96.73 ± 0.00 96.72 ± 2.98 85.21 ± 0.96 91.86 ± 0.85 0.57 3.43
4 67.98 ± 2.76 100.00 ± 0.00 95.27 ± 0.09 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 88.79 ± 5.64 96.73 ± 0.00 97.78 ± 3.85 93.92 ± 6.96 94.05 ± 1.01 1.41 13.01
NRC + SFT 2 71.22 ± 2.80 100.00 ± 0.00 85.88 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 84.60 ± 5.80 96.73 ± 0.00 97.78 ± 3.85 78.93 ± 12.81 91.51 ± 1.65 3.92 50.59
AAD - 72.85 ± 2.04 100.00 ± 0.00 85.88 ± 0.89 96.66 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 84.83 ± 3.14 96.73 ± 0.00 93.33 ± 0.00 85.23 ± 0.94 91.55 ± 0.42 9.28 200.13
8 74.30 ± 0.08 100.00 ± 0.00 87.69 ± 0.00 98.89 ± 1.93 100.00 ± 0.00 100.00 ± 0.00 81.61 ± 5.49 96.73 ± 0.00 97.78 ± 3.85 78.62 ± 5.26 91.56 ± 0.66 0.57 3.43
4 74.05 ± 0.40 100.00 ± 0.00 86.77 ± 0.94 97.77 ± 1.93 100.00 ± 0.00 100.00 ± 0.00 84.35 ± 5.65 96.73 ± 0.00 100.00 ± 0.00 82.74 ± 6.20 92.24 ± 0.40 1.41 13.01
AAD + SFT 2 73.30 ± 2.56 100.00 ± 0.00 82.78 ± 2.76 97.77 ± 1.93 100.00 ± 0.00 100.00 ± 0.00 79.19 ± 2.13 98.91 ± 1.89 97.78 ± 3.85 86.32 ± 0.00 91.60 ± 0.18 3.92 50.59
MAPU - 60.45 ± 8.59 100.00 ± 0.00 82.94 ± 2.06 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 85.17 ± 0.44 100.00 ± 0.00 100.00 ± 0.00 80.05 ± 13.71 90.86 ± 1.62 9.28 200.13
8 60.50 ± 9.50 100.00 ± 0.00 86.63 ± 8.60 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 88.76 ± 1.46 97.82 ± 1.89 99.29 ± 1.22 86.25 ± 0.13 91.93 ± 0.77 0.57 3.43
4 68.07 ± 4.73 99.75 ± 0.43 88.29 ± 6.68 99.63 ± 0.65 100.00 ± 0.00 99.26 ± 0.64 89.21 ± 4.08 98.91 ± 1.89 97.43 ± 3.59 87.55 ± 11.89 91.93 ± 0.77 1.41 13.01
MAPU + SFT 2 72.29 ± 2.93 100.00 ± 0.00 83.47 ± 3.06 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 87.68 ± 0.42 97.82 ± 1.89 97.78 ± 3.85 75.58 ± 0.00 91.46 ± 0.49 3.92 50.59
Source Pretraining.

Following Liang et al. (2020; 2021), we pre-train a deep neural network source model fs:𝒳s𝒞sf_{s}:\mathcal{X}_{s}\rightarrow\mathcal{C}_{s}, where fs=gshsf_{s}=g_{s}\circ h_{s}, with hsh_{s} as the backbone and gsg_{s} as the classifier. The model is trained by minimizing the standard cross-entropy loss:

src(fs)=𝔼(xs,ys)𝒳s×𝒞gk=1Kqklogδk(fs(xs)),\mathcal{L}_{src}(f_{s})=-\mathbb{E}_{(x_{s},y_{s})\in\mathcal{X}_{s}\times\mathcal{C}_{g}}\sum_{k=1}^{K}q_{k}\log\delta_{k}(f_{s}(x_{s})), (22)

where δk(a)=exp(ak)iexp(ai)\delta_{k}(a)=\frac{\exp(a_{k})}{\sum_{i}\exp(a_{i})} represents the kk-th element of the softmax output, and qq is the one-hot encoding of the label ysy_{s}. To enhance the model’s discriminability, we incorporate label smoothing as described in Müller et al. (2019). The loss function with label smoothing is:

srcls(fs)=𝔼(xs,ys)𝒳s×𝒞gk=1Kqklslogδk(fs(xs)),\mathcal{L}_{src}^{ls}(f_{s})=-\mathbb{E}_{(x_{s},y_{s})\in\mathcal{X}_{s}\times\mathcal{C}_{g}}\sum_{k=1}^{K}q_{k}^{ls}\log\delta_{k}(f_{s}(x_{s})), (23)

where qkls=(1α)qk+α/Kq_{k}^{ls}=(1-\alpha)q_{k}+\alpha/K represents the smoothed label, with the smoothing parameter α\alpha set to 0.1.

Additionally, MAPU (Ragab et al., 2023b) introduces an auxiliary objective optimized alongside the cross-entropy loss, namely the imputation task loss. This auxiliary task is performed by an imputer network jsj_{s}, which takes the output features of the masked input from the backbone and maps them to the output features of the non-masked input. The imputer network minimizes the following loss:

mapu(js)=𝔼(xs,ys)𝒳s×𝒞ghs(xs)js(hs(x^s))22,wherex^s=MASK(xs).\mathcal{L}_{mapu}(j_{s})=\mathbb{E}_{(x_{s},y_{s})\in\mathcal{X}_{s}\times\mathcal{C}_{g}}\|h_{s}(x_{s})-j_{s}(h_{s}(\hat{x}_{s}))\|_{2}^{2},\quad\text{where}\ \hat{x}_{s}=\texttt{MASK}(x_{s}). (24)

The backbone and classifier weights are optimized using the Adam optimizer (Kingma, 2014) with a learning rate of 1e-3. We follow Ragab et al. (2023b) for setting all other hyperparameters.

Target Adaptation.

For target adaptation, we apply the objective functions and training strategies specific to each SFDA method (detailed in Appendix 8). The backbone weights are optimized to adapt to the target distribution, with Adam (Kingma, 2014) used as the optimizer to learn the target-adapted weights. We experiment with a range of learning rates: {5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6, 5e-7, 1e-7} for each method (including the baseline) and report the best performance achieved.

One-to-One Analysis.

In a typical SFDA evaluation setting, a source model is pre-trained using labeled data from one domain (the source domain) and subsequently adapted to another domain (the target domain), where it is evaluated using unlabeled target data. This evaluation strategy, known as One-to-one evaluation, involves a single source domain for pretraining and a single target domain for adaptation. In this paper, we use multiple source-target pairs derived from the datasets introduced in (Ragab et al., 2023a; b). Each source-target pair is evaluated separately, and the average performance across all pairs is reported to assess the effectiveness of the employed SFDA methods.

Many-to-one Analysis.

To provide a more comprehensive and realistic evaluation, we also conduct an experiment under the many-to-one setting. In this setting, data from all source domains in the source-target pairs are merged to form a single, larger source domain for pretraining. The pre-trained model is then adapted to target domains that were not part of the combined source domain. This many-to-one approach allows for much robust evaluation. Figure 9 visually describes the overall evaluation setup. In Figure 10, we extend our analysis from Figure 6 to include results for the Many-to-one setup of the SSC (Goldberger et al., 2000) and HHAR (Stisen et al., 2015) datasets, results for the Many-to-one setting are marked with an asterisk (*).

Parameter efficiency across SFDA Methods and Datasets (Extended Analysis).

In Table 4, we extend the results from Table 1 by including the outcomes for the WISDM and UCIHAR datasets. Consistent improvements are observed across various sample ratios, along with a significant reduction in both inference overhead and the number of parameters fine-tuned during adaptation. Additionally, we provide the F1 scores for each source-target pair across all the discussed datasets in Tables 5-9, for a direct comparison with the number reported by Ragab et al. (2023b).

A.8 Combining SFT with LoRA-style PEFT methods.

To evaluate the hybrid integration of SFT with LoRA-style PEFT methods, we conducted an in-depth analysis, applying a range of SFDA approaches—namely SHOT, NRC, AAD, and MAPU—across the SSC, HHAR, and MFD datasets in the One-to-one and Many-to-one settings (cf. Appendix A.7), as depicted in Figures 12-LABEL:fig:ssc_m_sft_lora. This evaluation explores the efficacy of combining SFT with LoRA-like frameworks under different domain adaptation scenarios, offering insights into the trade-offs between performance and efficiency across a diverse set of tasks. Additionally, Table LABEL:tab:sft_lora_macs presents a detailed comparison of parameter efficiency and inference overhead for all hybrid combinations tested.

Our results indicate that the parameter efficiency of SFT can be further improved by incorporating LoRA-style PEFTs during the fine-tuning stage. However, we observed a slight degradation in predictive performance compared to the vanilla SFT. This decline can likely be attributed to the compounded effect of low-rank approximations: the initial rank reduction from source model decomposition, followed by an additional low-rank fine-tuning for target-adaptation, excessively constrains the model’s learning capacity. Nevertheless, combining SFT with LoRA-style adaptation still outperforms using LoRA alone in terms of predictive performance, while also achieving superior parameter efficiency, demonstrating the advantage of source model decomposition.

A.9 Limitations

The proposed method has a limitation in that it is not rank-adaptive, meaning it does not adjust the rank based on the intrinsic low-rank structure of the data or model layers. Each dataset and mode (e.g., training vs. inference) has its own optimal low rank, and this ideal rank can also vary across different layers of the model (Sharma et al., 2024). When the rank is not appropriately chosen, the model risks either under-fitting—if the rank is too low and unable to capture the necessary complexity—or over-fitting—if the rank is too high and introduces unnecessary complexity, leading to poor generalization. Since the current method uses a fixed rank for the entire source model, it may result in suboptimal performance, with inefficiencies in both parameter usage and generalization. Addressing this limitation by incorporating rank-adaptiveness could allow the model to dynamically adjust its rank based on the properties of the data and model layers, thereby reducing the risk of under-fitting and over-fitting and improving overall performance.

Moreover, another limitation of this work is the lack of analysis on the interpretability of the factor matrices obtained during decomposition. These factors are presumed to capture representations that generalize across different domains (i.e. source vs. target). However, no effort has been made to explicitly analyze or interpret what these factors represent, which leaves open the question of whether or how they effectively capture cross-dataset generalization. This important aspect of understanding the model’s behavior is left for future work and could provide valuable insights into how the method works across various domains.

Refer to caption
Dataset \downarrow Method \rightarrow Baseline SFT (RF2) SFT (RF4) SFT (RF8) LoRA-2 LoRA-4 LoRA-8 LoRA-16 LoKrA
SSC/SSC* MACs (M) \downarrow 12.92 5.54 1.99 0.80 12.92 12.92 12.92 12.92 12.92
# Params. (K) \downarrow 83.17 20.88 5.32 1.38 0.81 1.94 5.19 15.63 1.84
HHAR/HHAR* MACs (M) \downarrow 9.04 3.79 1.34 0.53 9.04 9.04 9.04 9.04 9.04
# Params. (K) \downarrow 198.21 49.63 12.53 3.19 1.11 2.4 5.46 13.62 3.33
MFD MACs (M) \downarrow 58.18 24.66 8.80 3.52 58.18 58.18 58.18 58.18 58.18
# Params. (K) \downarrow 199.3 50.18 12.8 3.33 1.22 2.82 7.18 20.50 3.46
Figure 11: Comparison of LoRA (Hu et al., 2021) and LoKrA (Edalati et al., 2022; Yeh et al., 2024) (in Purple) against baseline methods (SHOT, NRC, AAD, MAPU) and our proposed approach (SFT) on the SSC, HHAR and MFD datasets in the One-to-one setting and in the Manu-to-one setting for SSC (SSC*) and HHAR (HHAR*), evaluated across varying target sample ratios used during adaptation. The table at the bottom shows the target model’s inference overhead after adaptation in terms of MACs and the number of parameters finetuned at the time of adaptation. Extension of the analysis presented in Figure 7.
Refer to caption
Figure 12: Performance comparison of SFT with LoRA-style PEFT methods on the MFD dataset at different sample ratio of target samples used for finetuning, on SHOT, NRC, AAD, and MAPU. The results highlight the trade-offs between parameter efficiency and predictive performance across different adaptation methods.
Refer to caption
Figure 13: Performance comparison of SFT with LoRA-style PEFT methods on the HHAR dataset at different sample ratio of target samples used for finetuning, on SHOT, NRC, AAD, and MAPU. The results highlight the trade-offs between parameter efficiency and predictive performance across different adaptation methods.