LLM-based Knowledge Pruning for Time Series Data Analytics on Edge-computing Devices

Ruibing Jin
&Qing Xu
Min Wu
&Yuecong Xu
&Dan Li
&Xiaoli Li
&Zhenghua Chen

Abstract

Limited by the scale and diversity of time series data, the neural networks trained on time series data often overfit and show unsatisfacotry performances. In comparison, large language models (LLMs) recently exhibit impressive generalization in diverse fields. Although massive LLM based approaches are proposed for time series tasks, these methods require to load the whole LLM in both training and reference. This high computational demands limit practical applications in resource-constrained settings, like edge-computing and IoT devices. To address this issue, we propose Knowledge Pruning (KP), a novel paradigm for time series learning in this paper. For a specific downstream task, we argue that the world knowledge learned by LLMs is much redundant and only the related knowledge termed as "pertinent knowledge" is useful. Unlike other methods, our KP targets to prune the redundant knowledge and only distill the pertinent knowledge into the target model. This reduces model size and computational costs significantly. Additionally, different from existing LLM based approaches, our KP does not require to load the LLM in the process of training and testing, further easing computational burdens. With our proposed KP, a lightweight network can effectively learn the pertinent knowledge, achieving satisfactory performances with a low computation cost. To verify the effectiveness of our KP, two fundamental tasks on edge-computing devices are investigated in our experiments, where eight diverse environments or benchmarks with different networks are used to verify the generalization of our KP. Through experiments, our KP demonstrates effective learning of pertinent knowledge, achieving notable performance improvements in regression (19.7% on average) and classification (up to 13.7%) tasks, showcasing state-of-the-art results.

1 Introduction

With the advancement of deep learning, massive methods are proposed for time series learning across different fields such as healthcare Zhao et al. (2019); Chen et al. (2018), transportation Jin et al. (2023a), energy Zhu et al. (2023) and industry Chen et al. (2020). Although these approaches show significant improvements on some benchmarks, it is still challenging to generalize these methods to complex scenarios Jin et al. (2023b).

The main issue which limits the generalization of existing time series approaches, is that different measurements are applied in the process of time series data collection. Unlike computer vision and language, it is difficult to combine these time series datasets collected from different measurements into a large scale dataset. Limited by the scale and diversity of a single time series dataset, the generalization of trained neural network on time series data is not satisfacotry.

Refer to caption — Figure 1: The world knowledge learned by LLMs. Different parts of the knowledge of LLMs may contribute differently on diverse tasks. For a specific task, the world knowledge in LLMs is actually redundant and only the related knowledge termed as pertinent knowledge is useful. Our proposed Knowledge Pruning (KP) aims to prune the redundant knowledge and effectively transfer the pertinent knowledge to the target model, significantly reducing computation cost while retaining satisfactory performances.

Recently, large language models (LLM) with tens of billions of parameters, demonstrate remarkable generalization capabilities in different tasks Touvron et al. (2023); Peng et al. (2023). Pre-trained on massive corpus of self-supervised data, these foundation models implicitly capture knowledge understanding on the world, which enables them to be zero-shot transferable on downstream tasks. To alleviate the issues in time series learning, some methods Xue and Salim (2023); Chang et al. (2023); Zhou et al. (2023); Gruver et al. (2023) are proposed to integrate the knowledge from LLMs into their frameworks. Nevertheless, there are two issues in these LLM based time series methods.

•

These approaches often require to load the whole LLM during training and inference, which is computationally expensive and time-consuming.
•

These methods are generally based on a pre-trained and fixed LLM, which largely limits the flexible of these methods.

Limited by these issues above, It is challenging for existing LLM based methods to flexibly design models with different scales according to the requirements of tasks, especially for some computation constrained scenarios.

To address this problem, we re-evaluate the impact of the world knowledge acquired by LLMs on downstream tasks. We argue that for a specific downstream task, it is not necessary to transfer th entire knowledge of a pre-trained LLM into a target model. Instead, as illustrated in Fig. 1, we contend that this world knowledge actually can be divided into two parts: related knowledge and redundant knowledge for a specific task. Only the related knowledge termed as "pertinent knowledge" is what we need to transfer to the target model. Motivated by this discernment, we propose a novel compression paradigm called Knowledge Pruning (KP) for LLMs, which is able to identify the pertinent knowledge, prune the redundant knowledge and effectively distill the pertinent knowledge to our target model.

Knowledge is implicitly stored in a neural network. It is generally difficult to directly obtain a specified part of knowledge from a network. However, unlike traditional networks, LLMs can produce related knowledge description via prompts based on a dialogue scheme. According to this scheme, our proposed KP firstly generates a knowledge prompt set (KPS) for a specific task, where these prompts are forwarded to a pre-trained LLM to produce corresponding embeddings. In our proposed KP, these embeddings are called knowledge anchor points (KAPs). Although the latent space of the pertinent knowledge is a continuous space, these KAPs can be used to roughly represent this latent space. After that, a metric learning is applied to learn this prior knowledge via knowledge distillation and transfer this pertinent knowledge to our target model. Additionally, the regression task requires a network to learn the continuous domain of the task and predict arbitrary value. To fulfill this requirement, an anchor voting scheme (AVS) is proposed, where the confidence distribution among different anchor points is generated to predict the expected output.

To verify the effectiveness of our proposed KP, massive experiments are conducted on two fundamental tasks on edge-computing devices, where different network architectures are investigated on two different task categories: classification and regression in time series learning. In classification, we evaluate the performances of our KP on four different benchmarks of human activity recognition, where our approach effectively improve the performances by up to 13.7%. In regression, we investigate the performance of our KP on the remaining useful life prediction under four different scenarios. Through experiments, our proposed KP significantly improves the accuracy by 19.7% on average. Our proposed KP achieves state-of-the-art performances on both tasks. Overall, our contributions are summarized as below:

•

We discover that the knowledge in LLMs is much redundant for a specific downstream task. In stead of the entire knowledge, only the pertinent knowledge needs to be transferred to the target model.
•

A novel compression paradigm, Knowledge Pruning (KP) is proposed to effectively distill the pertinent knowledge into the target model, which achieves satisfactory performances, while remaining low computation cost.
•

An anchor voting scheme (AVS) is proposed based on the scores of knowledge anchor points to predict arbitrary value for the regression task.
•

Experiments are extensively conducted on two fundamental tasks: classification and regression in time series learning, where different networks are employed among 8 different scenarios or benchmarks. For the regression task, our KP significantly improves the accuracy by 19.7% on average. For the classification task, the performances are largely improved by up to 13.7%. With our KP, state-of-the-art performances are achieved on both two tasks

2 Related Work

Large language models (LLMs) recently witness significant progress and show impressive performances among a multitude of fields including natural language processing (NLP) Zhao et al. (2023) and computer vision (CV) Awais et al. (2023). To integrate the knowledge representations of LLMs into time series analytics, many approaches are proposed. PromptCast Xue and Salim (2023) firstly attempt to utilize LLMs for time series forecasting, where the time series data is converted into prompts. OFA Zhou et al. (2023) proposes to fine-tune a pre-trained LLM for downstream tasks in time series analytics. Time-LLM Jin et al. (2023c) and LLM4TS Chang et al. (2023) aim to repurpose a pre-trained LLM by aligning the time series domain to that of language for time series tasks. TEST Sun et al. (2023) combines the text prompts with time series encoding for better aligning time series data to the language. To fully utilize the generalization capability of LLMs, TEMPO Cao et al. (2023) augment the raw time series data by data decomposition and fine-tune a Pre-trained LLM on these augmented time series data. Although these LLM based approaches achieve significant performances on time series tasks, they are proposed based on a pre-trained and fixed LLM and require to load the whole LLM during training and inference. These drawbacks limit their flexibility and their applications on some scenarios with limited computation resources. To address this issue, we propose the knowledge pruning (KP), which is able to prune the redundant knowledge and effectively transfer the pertinent knowledge to a target model without the retaining process of LLMs, significantly reducing the computation cost and maintaining satisfactory performances.

3 Main Work

Large language models (LLMs) have high computational demands and the knowledge store in them is much redundant for a specific task. To alleviate these drawbacks and facilitate the application of LLMs on computational constrained scenarios, we propose a new compression paradigm, Knowledge Pruning (KP) in this section.

Knowledge is implicitly stored in neural networks. It is difficult to directly obtain the specified network in general. To address this problem, we alleviate the dialogue scheme of LLMs to generate a series of language embeddings based on prompts. The pipeline of our KP is shown in Fig. 2, where our KP is composed of two stages: pre-processing and training stage. Given a specific downstream task, a knowledge prompt set (KPS) is firstly generated. After that, the prompts in KPS are forwarded into a pre-trained LLM to produce corresponding embeddings, which serve as knowledge anchor points (KAPs) and are used to represent the pertinent knowledge. We regard this pertinent knowledge as prior knowledge for the target model. After that, at the training stage, the metric learning and knowledge distillation are leveraged to transfer this prior knowledge into the target model. Additionally, the output based on metric learning is generally discrete. To extend the application of our KP to the tasks with continuous output, an anchor voting scheme (AVS) is proposed, which enables our KP to produce arbitrary value, achieving significant improvements on both classification and regression tasks.

3.1 Knowledge Pruning

Our knowledge pruning consists of three steps: knowledge prompt set generation, knowledge anchor point production and pertinent knowledge distillation.

Knowledge Prompt Set Generation Knowledge prompt set (KPS) contains the prompts which are forwarded into a pre-trained LLM for getting the knoweldge anchor points (KAPs). In this paper, these prompts in KPS indicate the description of corresponding data. Since we devise to apply our proposed KP to two fundumental tasks: regression and classification, two different prompt templates are proposed. In regression, the remaining sueful prediction is used to evaluate the performance of our KP, and the prompt template is “The remaining useful life is {num}.”, where num indicates the correspoding groud truth value and ranges from [ $y_{min}$ , $y_{max}$ ] In classification, we apply our KP on the human activity recognition, and the prompt template is “The subject is {action}.”, where action means the name of the corresponding activity.

Knowledge Anchor Point Production After obtaining the KPS, we forward these prompts in the KPS to a pre-trained LLM to obtain the language embeddings, which can be formulated as following.

z_{i}=F_{l}(\mathcal{P}_{i}),

(1)

where $\mathcal{P}_{i}$ indicates the $i$ th prompt, $F_{l}$ represents a pre-trained LLM, and $z_{i}$ is the produced language embedding termed as a knowledge anchor point (KAP). These KAPs are used to represent the space of pertinent knowledge.

Pertinent Knowledge Distillation Without transfering the entire knowledge of a LLM, our KP only transfer the pertinent knowledge which is indicated by KAPs. However, there is domain gap between the knowledge learned by LLM and the knowledge of downstream tasks. To alleviate this issue, an alignment module consisting of 2 fully connected layers are used to project these KAPs into tha latent space of the downstream task. This process is computed as below,

k_{i}=\phi(z_{i}),

(2)

where $\phi$ means the alignment module and $k_{i}$ is the transformed feature vector, which serves as a prior knowledge. Given a segment of time series data $x_{i}$ , we forword it into our target model $F_{t}$ , and get the predicted feature vector $f_{i}$ . To utilize the prior knowledge $\mathcal{Z}=\{z_{i}|i=1,\dots,N\}$ , the metric learning is leveraged.

Moreover, to optimze the target model and the alignment simultaneously, based on the unidirectional metric learning in Prototypical Networks Snell et al. (2017), we develope a bi-directional metric learning. For optimizing the target model, the process is computed as:

p_{t}(i)=\frac{\exp(-d(k_{i},x_{i}))}{\sum_{t=1}^{|\mathcal{Z}|}\exp(-d(k_{t},x_{i}))},

(3)

where $d$ denotes the distance function. To improve the numerical stability and computational efficiency, we further improve this computation progress and compute the prediction as below:

p_{t}(i)=\log(\frac{\exp({\rm simi}(k_{i},f_{i}))}{\sum_{t=1}^{|\mathcal{Z}|}\exp({\rm simi}(k_{t},f_{i}))}),

(4)

where simi is the cosine similarity. For the alignment optimzation part, the process can be formulated as:

p_{l}(i)=\log(\frac{\exp({\rm simi}(k_{i},f_{i}))}{\sum_{t=1}^{|\mathcal{B}|}\exp({\rm simi}(k_{i},f_{t}))}),

(5)

where $|B|$ denotes the batch number. Finally, to distill the pertinent knowledge to the target model, the Kullback–Leibler divergence (KL-div) is used and the final loss is:

L=0.5*D_{KL}(p_{t},p_{g})+0.5*D_{KL}(p_{l},p_{g}^{T}),

(6)

where $p_{g}$ is the ground truth distribution and is defined as following,

p_{g}(i)=\frac{\exp(g_{i}*\tau)}{\sum_{t=1}^{|\mathcal{B}|}\exp(g_{t}*\tau))},

(7)

where $g_{i}$ is equal to one for correpsonding prompt description, while is zero. Similarly to knowledge distillation, $\tau$ is a temperature hyper-parameter.

3.2 Anchor Voting Scheme

Since our KP is based on metric learning, the prediction of the target model is discrete. To extend our KP to the task with continuous output like regression, an anchor voting scheme (AVS) is proposed.

Given the prediction distribution $\mathcal{S}=\{p_{t}(i)|i=1,\dots,|\mathcal{Z}|\}$ , we firstly sort these scores in a descending order as below:

\hat{\mathcal{S}}={\rm sort}(\mathcal{S}).

(8)

After that, these scores are cumulated according to Eq. 9.

\mathcal{S}_{a}={\rm cumsum}(\hat{\mathcal{S}})

(9)

Then, the cumlated scores which are larger than $\theta$ , are formed into a voting set $\mathcal{V}=\{v_{i}|1,\dots,|\mathcal{V}|\}$ . The final prediciton is generated as following,

o=\frac{\sum_{i=1}^{|\mathcal{V}|}v_{i}*n_{i}}{\sum_{i=1}^{|\mathcal{V}|}v_{i}},

(10)

where $n_{i}$ indicates the numerical value described by the KAP $v_{i}$ . With our proposed AVS, our proposed KP is effectively entended to the regression task, achieving significant performances.Chen et al. (2020)

4 Experiments

To verify the effectiveness of our Knowledge Pruning (KP), extensive expriments are conducted in this section.

Table 1: Comparison with other methods in regression. Compared with other methods, our KP performs much better, achieving the best performances on nearly all subsets.

Dataset	FD001		FD002		FD003		FD004		AVG
Evaluation	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score
Li et al.	12.61	273.70	22.36	10412.00	12.64	284.10	23.31	12466.00	17.73	5858.95
BLCNN	13.18	302.27	19.09	1558.00	13.75	381.37	20.97	3859.00	16.75	1525.16
PE-Net	13.98	280.87	14.69	881.73	12.33	272.85	15.40	1103.18	14.10	634.66
DGRU	18.54	1467.00	20.06	4085.00	19.28	1488.00	20.88	3872.00	19.69	2728.00
AdaNet	13.12	248.45	15.20	890.71	12.41	231.06	15.02	883.21	13.94	563.36
Jang et al.	12.47	253.00	18.18	1618.00	11.88	270.00	22.11	2797.00	16.16	1234.5
KDnet	13.68	362.08	14.47	929.20	12.95	327.27	15.96	1303.19	14.27	730.44
Two-Stream BiLSTM	12.07	208.11	14.97	847.98	11.84	211.80	14.94	906.61	13.45	543.63
KP (ours)	12.42	197.05	12.86	584.56	11.29	175.50	14.09	788.75	12.66	436.47

4.1 Datasets and Experimental Setup

Datasets To comprehensively investigate the performances of our KP, two fundamental tasks on edge-computing devices: classification and regression, are evaluated in this paper. In classificaiton, the human activity recognition (HAR) task is studied and four different benchmarks: UCI_HAR Anguita et al. (2013), Opportunity Roggen et al. (2010), PAMAP2 Reiss and Stricker (2012), and WISDM Kwapisz et al. (2011) are used. These benchmarks contain different number of activity categories ranging from 6 to 17 with different scales between 3k and 29k samples. In regression, the remainning useful life (RUL) prediciton is alleviated for evaluation, where the C-MAPSS Saxena et al. (2008) dataset is used. C-MAPSS contains four different subsets: FD001, FD002, FD003 and FD004 with different scenarioes.

Experimental Setup In classifiction, for consistency and meaning fulcomparison, the training and inference process on UCI_HAR, Opportunity and PAMAP2 are conducted according to the protocol of iSPLInception Ronald et al. (2021). Since the experiments in iSPLInception Ronald et al. (2021) do not include WISDM benchmark, the expriments on WISDM follow the setting in Multi CNN-BiLSTM Challa et al. (2022). For fair comparion, other compared methods are re-implemented under the same setting. According to approches Ronald et al. (2021); Challa et al. (2022), F1-Score is used for evaluation in HAR tasks. In regression, some methods are also re-implemented under the same conditions. The training and inference processes are conducted according to classic RUL methods Jin et al. (2022a); Chen et al. (2020). RMSE and scoring fuction are used as evaluation metrics. Two hyper-parameters $\tau$ and $\theta$ are set as 10 and 0.9, respectively for all experiments. The pre-trained text encoder in CLIP Radford et al. (2021) is used as the pre-trained LLM in experiments. Experiments are conducted on a workstation with a GeForce RTX 4080 GPU and 128 GB memory from 1 to 4 hours.

4.2 Comparison with other methods

To evaluate the performances of our KP, we compare our approach with orther state-of-the-art (SOTA) methods. The exprimental results in regression and classificationare listed in Table. 1 and Table. 2, respectively.

In the RUL task, serveral SOTA approaches: Li et al. Li et al. (2018), BLCNN Liu et al. (2019), PE-Net Jin et al. (2022b), DGRU Behera and Misra (2021), AdaNet Jin et al. (2023d), Jang et al. Jang and Kim (2021) and KDnet Xu et al. (2021) , are compared with our method. Among these methods, Li et al. proposes a CNN based network to predict the RUL. PE-Net integrates position encoding scheme with an optimzed CNN architecture for the RUL task. AdaNet introduces the deformable convolution into the RUL task. BLCNN devises a hybrid network which combines RNN and CNN together to improve the prediction accuracy. DGRU apply the adversarial learning on the RUL task. A self-supervised learning approch is proposed in Jang et al. KDnet utilize knowledge distillation to transfer the knowledge in RNN to a CNN model. Benefitted from the learned pertinent knowledge from a pre-trained LLM, our KP performs much better than them and achieve the best performances.

Table 2: Comparison with other methods in classification. Compared with other methods, our proposed KP achieves the best performances in F1-Score among four different HAR benchmarks.

Methods	UCI_HAR	Oppotunity	PAMAP2	WISDM
LSTM-CNN	93.14	78.19	72.22	96.02
CNN	93.21	79.73	62.29	95.51
Multi CNN-GRU	94.05	83.92	68.13	96.15
Multi CNN-BiLSTM	93.60	84.53	70.23	95.8
GRU_INC	93.67	72.58	75.88	83.93
DTL	93.11	74.03	82.16	96.42
iSPLInception	93.08	84.45	80.01	96.14
KP (ours)	96.63	86.74	85.28	98.25

In the HAR task, we compare our KP with seven SOTA methods: LSTM-CNN Xia et al. (2020), CNN Van Kuppevelt et al. (2020), Multi CNN-GRU Dua et al. (2021), Multi CNN-BiLSTM Challa et al. (2022), GRU_INC Mim et al. (2023), DTL Ige and Noor (2023) and iSPLInception Ronald et al. (2021). These compared approaches employ different network architectures like RNN, CNN and even hybrid networks. As a novel comdel compression paradigm, our proposed KP is fundamentally orthogonal to existing HAR methods and can be applied to any existing HAR approaches. We apply our KP to two different methods: DTL and iSPLInception, and list the best performances we achieved in Table 1. Through experiments, it demonstrates that our KP effectively transfer the pertinent knowledge of a pre-trained LLM to the target model, which achieves the best performances among other SOTA methods.

Table 3: Ablation study in regression. Baseline1 indicates the Bi-LSTM, baseline2 is the PE-Net, and baseline3 represents the Two-Stream BiLSTM. With our proposed KP, the performances on three different baselines are improved by a large marge.

Dataset	FD001		FD002		FD003		FD004		AVG
Evaluation	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score
Baseline1	13.09	260.67	15.85	871.62	13.23	266.12	15.81	1065.47	14.50	615.97
Baseline1+KP (ours)	12.82	201.85	14.09	716.48	11.59	190.94	15.17	953.94	13.42	515.80
Baseline2	13.98	280.87	14.69	881.73	12.33	272.85	15.40	1103.18	14.10	634.66
Baseline2 + KP(ours)	13.63	251.64	14.11	721.38	12.44	197.96	15.52	938.36	13.92	527.34
Baseline3	12.07	208.11	14.97	847.98	11.84	211.80	14.94	906.61	13.45	543.63
Baseline3 + KP (ours)	12.42	197.05	12.86	584.56	11.29	175.50	14.09	788.75	12.66	436.47

Table 4: Ablation study for AVS. Baseline1 indicates the Bi-LSTM. Through experiments, it shows that without our proposed AVS, the performances of KP on the regression task are limited. After applying our AVS, the performances on the regression task are significantly improved.

Dataset	FD001		FD002		FD003		FD004		AVG
Evaluation	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score
Baseline	13.09	260.67	15.85	871.62	13.23	266.12	15.81	1065.47	14.50	615.97
Baseline+KP w/o AVS (ours)	16.07	339.03	14.27	963.37	13.35	219.47	17.29	1340.89	15.25	715.69
Baseline+KP (ours)	12.82	201.85	14.09	716.48	11.59	190.94	15.17	953.94	13.42	515.80

4.3 Ablation Study

To verify the effectiveness, experiments on ablation study are presented. Our KP is orthogonal to approaches for time series analytics and can be directly applied to these methods. To show the generalization of our KP, we apply our KP on serveral different networks and show the performance improvements. Experimental results on the regression task, RUL, and classification task, HAR are listed in Table. 3 and Table. 5, respectively.

In the RUL task, three different approaches: Bi-LSTM, Two-Stream BiLSTM Jin et al. (2022a) and PE-Net Jin et al. (2022b) are used as our baselines. Bi-LSTM is a shallow network, which consists of two bi-directional LSTM. Two-Stream BiLSTM integrates the handcrafted feature flow Jin et al. (2022a) into the raw time series data via a Bi-LSTM based network. PE-Net designs a CNN with a position encoding scheme to predict the RUL. In Table. 5, Bi-LSTM is used as baseline1, PE-Net is used as baseline2 and Two-Stream Bi-LSTM is used as baseline3. With our proosed KP, all these three methods are remarkablely improved among four different scenarioes. Compared with RMSE, Score is generally regarded as a more important evaluation metric, since it give more penalty on the late prediciton, which is similar to the practical setting. Among these three baselines, baseline3 acheves the best performances on average. After applying our KP, its performances are further improved by 19.7% in Score.

In the HAR task, two different methods: DTL Ige and Noor (2023) and iSPLInception Ronald et al. (2021), are used as our baselines. DTL is a hybrid network, which combines CNN and RNN togethor to capture the temporal features. In comparison, iSPLInception proposes to utilize the inception based CNN network to classify the human activaities.In Table. 3, baseline1 indicates the DTL, and basleine2 represents the iSPLInception. According to the experimental results, our KP is able to effectively improve the performances on these two baselines among all four benchmarks. The improvements on these methods range from 0.8% to 13.7%.

Table 5: Ablation study in classification. Baseline1 indicates the DTL method, and baseline2 represents the iSPLInception method. Our proposed KP is able to effectively improve the performances on two different network architectures among four different HAR benchmarks.

Methods	UCI_HAR	Oppotunity	PAMAP2	WISDM
Baseline1	93.11	74.03	82.16	96.42
Baseline1+KP (ours)	96.63	84.14	85.28	97.18
Baseline2)	93.08	84.45	80.01	96.14
Baseline2 + KP (ours)	94.75	86.47	83.99	98.25

These expriments above show that our propsoed KP effectively identify the pertinent knowledge and transfer it to the target model. With our KP, all these five baselines are improved by a large marge.

Effectiveness of AVS AVS is proposed to enable the metric learning based network to predict continuous value for the regression task, like RUL. Experiments are designed to show the effectiveness of our proposed AVS, which are listed in Table 4. The Bi-LSTM is used as the baseline. The experimental reults indicates that without our propsoed AVS, the performances of KP on the regrression task are not satisfactory. After applying our proposed AVS, our KP can effectively improve the accuracy on the RUL prediction by a large marge.

Based on the experiments above, it can be found that our proposed KP can consistently improve the performances across different tasks and benchmarks. Since the improvemment by KP ranges from 0.8% to 13.7%, the effectiveness of our KP may be affected by the specific data distribution and the neural network architecture.

4.4 Computation Efficiency

Our KP is proposed to alleviate the issue of the computation cost in LLMs. The experiments on computation efficiency is carries out and listed in Table. 6, where the FLOPs and Params are listed to compare the computation efficiency.

Table 6: Experiments on computation efficiency.

Methods	FLOPs (G)	Params (M)
LLM in CLIP	5.96	63.43
Two-Stream BiLSTM + KP (ours)	0.002	0.042
DTL+ KP (ours)	0.024	1.12

As listed in Table. 6, we apply our KP to two networks: Two-Stream BiLSTM and DTL for the task RUL and HAR, respectively. Since DTL applies a hybrid network, which is composed of CNN and RNN and is more complex than the Two-Stream BiLSTM, the computation complexity of DTL is higher than that of Two-Stream BiLSTM. Nevertheless, the computation demands on these two approaches are much lower than that of the LLM in CLIP. According to the experimental results, our proposed KP is able to effectively prune the redundant knowedlge of LLM. The computation issue of LLMs is well alleviated, and the performances of the target model are improved.

4.5 Sensitivity Analysis

Our proposed KP involves two hyper-parameters: $\tau$ and $\theta$ . To investigate the impact of different values of these two parameters, several experiments are conducted and discussed. For the parameter $\tau$ , we graduately increase its values and carry out experiments on HAR and RUL tasks, respectively. These experimental results on RUL and HAR are illurstrated in Fig. 3 and Fig. 4, respectively.

In Fig. 3, our KP is applied on Two-Stream BiLSTM with different $\tau$ values. Experiments are carried out in FD004 subset, which contains the most complex scenarioes. Although the difference performances are obtained on RUL task, their performances are still better than the baseline. With our KP, the performances of Two-Stream BiLSTM are consistly improved under different values of $\tau$ .

In Fig. 4, we apply our KP on the DTL in two different datasets: UCI_HAR and WISDM benchmarks. It shows that our KP can improves the performances of DTL under different $\tau$ values.

AVS is proposed to regression tasks, which enables our network to predict continuous value for the RUL task. The hyper-parameter $\theta$ in AVS is used as a threshold value to select the anchor for voting. To investigate the stability of our AVS, we apply our KP to the Two-Stream BiLSTM and design expriments on FD004 with different $\theta$ , which are presented in Table. 7.

Table 7: Experiments for AVS with different

\theta

value on FD004.

Metric	baseline	0.9	0.8	0.7	0.6	0.5
RMSE	14.94	14.09	13.92	13.91	14.11	14.30
Score	906.61	788.75	810.19	824.61	853.73	891.58

As listed in Table 7, our AVS with different $\theta$ values is able to consistly improve the performances of Two-Stream BiLSTM. This demonstrates that our proposed AVS is adaptive to the variation of $\theta$ .

5 Conclusions

In this paper, we have proposed a new model compression paradigm, Knowledge Pruning (KP). Our KP consists of three steps: knowledge prompt set generation, knowledge anchor point production and pertinent knowledge distillation. Furethurmore, since our KP is based on metric learning, the performances on the regresison tasks may be limited. To extend our KP to the regression task, a anchor voting scheme has been proposed. Through experiments, our KP has effectively pruned the redundant knoweldge of LLMs for a specific downstream task and accurately transfer the pertinent knowledge to the target model. With our KP, the computation cost introduced by LLMs is largely reduced, and satisfacotry performances are achieved. Our KP shown siginificant improvement on both classification task, HAR and regression task, RUL, achieving state-of-the-art performances.

References

Zhao et al. [2019] Rui Zhao, Ruqiang Yan, Zhenghua Chen, Kezhi Mao, Peng Wang, and Robert X Gao. Deep learning and its applications to machine health monitoring. Mechanical Systems and Signal Processing, 115:213–237, 2019.
Chen et al. [2018] Zhenghua Chen, Le Zhang, Chaoyang Jiang, Zhiguang Cao, and Wei Cui. Wifi csi based passive human activity recognition using attention based blstm. IEEE Transactions on Mobile Computing, 18(11):2714–2724, 2018.
Jin et al. [2023a] Guangyin Jin, Yuxuan Liang, Yuchen Fang, Zezhi Shao, Jincai Huang, Junbo Zhang, and Yu Zheng. Spatio-temporal graph neural networks for predictive learning in urban computing: A survey. IEEE Transactions on Knowledge and Data Engineering, 2023a.
Zhu et al. [2023] Zhaoyang Zhu, Weiqi Chen, Rui Xia, Tian Zhou, Peisong Niu, Bingqing Peng, Wenwei Wang, Hengbo Liu, Ziqing Ma, Qingsong Wen, et al. eforecaster: unifying electricity forecasting with robust, flexible, and explainable machine learning algorithms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15630–15638, 2023.
Chen et al. [2020] Zhenghua Chen, Min Wu, Rui Zhao, Feri Guretno, Ruqiang Yan, and Xiaoli Li. Machine remaining useful life prediction via an attention-based deep learning approach. IEEE Transactions on Industrial Electronics, 68(3):2521–2531, 2020.
Jin et al. [2023b] Ming Jin, Qingsong Wen, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, et al. Large models for time series and spatio-temporal data: A survey and outlook. arXiv preprint arXiv:2310.10196, 2023b.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Peng et al. [2023] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
Xue and Salim [2023] Hao Xue and Flora D Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting. IEEE Transactions on Knowledge and Data Engineering, 2023.
Chang et al. [2023] Ching Chang, Wen-Chih Peng, and Tien-Fu Chen. Llm4ts: Two-stage fine-tuning for time-series forecasting with pre-trained llms. arXiv preprint arXiv:2308.08469, 2023.
Zhou et al. [2023] Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time series analysis by pretrained lm. arXiv preprint arXiv:2302.11939, 2023.
Gruver et al. [2023] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
Zhao et al. [2023] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Awais et al. [2023] Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
Jin et al. [2023c] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023c.
Sun et al. [2023] Chenxi Sun, Yaliang Li, Hongyan Li, and Shenda Hong. Test: Text prototype aligned embedding to activate llm’s ability for time series. arXiv preprint arXiv:2308.08241, 2023.
Cao et al. [2023] Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu. Tempo: Prompt-based generative pre-trained transformer for time series forecasting. arXiv preprint arXiv:2310.04948, 2023.
Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
Anguita et al. [2013] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge Luis Reyes-Ortiz, et al. A public domain dataset for human activity recognition using smartphones. In Esann, volume 3, page 3, 2013.
Roggen et al. [2010] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster, Gerhard Tröster, Paul Lukowicz, David Bannach, Gerald Pirkl, Alois Ferscha, et al. Collecting complex activity datasets in highly rich networked sensor environments. In 2010 Seventh international conference on networked sensing systems (INSS), pages 233–240. IEEE, 2010.
Reiss and Stricker [2012] Attila Reiss and Didier Stricker. Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers, pages 108–109. IEEE, 2012.
Kwapisz et al. [2011] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter, 12(2):74–82, 2011.
Saxena et al. [2008] Abhinav Saxena, Kai Goebel, Don Simon, and Neil Eklund. Damage propagation modeling for aircraft engine run-to-failure simulation. In 2008 international conference on prognostics and health management, pages 1–9. IEEE, 2008.
Ronald et al. [2021] Mutegeki Ronald, Alwin Poulose, and Dong Seog Han. isplinception: An inception-resnet deep learning architecture for human activity recognition. IEEE Access, 9:68985–69001, 2021.
Challa et al. [2022] Sravan Kumar Challa, Akhilesh Kumar, and Vijay Bhaskar Semwal. A multibranch cnn-bilstm model for human activity recognition using wearable sensor data. The Visual Computer, 38(12):4095–4109, 2022.
Jin et al. [2022a] Ruibing Jin, Zhenghua Chen, Keyu Wu, Min Wu, Xiaoli Li, and Ruqiang Yan. Bi-lstm-based two-stream network for machine remaining useful life prediction. IEEE Transactions on Instrumentation and Measurement, 71:1–10, 2022a.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Li et al. [2018] Xiang Li, Qian Ding, and Jian-Qiao Sun. Remaining useful life estimation in prognostics using deep convolution neural networks. Reliability Engineering & System Safety, 172:1–11, 2018.
Liu et al. [2019] Hui Liu, Zhenyu Liu, Weiqiang Jia, and Xianke Lin. A novel deep learning-based encoder-decoder model for remaining useful life prediction. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
Jin et al. [2022b] Ruibing Jin, Min Wu, Keyu Wu, Kaizhou Gao, Zhenghua Chen, and Xiaoli Li. Position encoding based convolutional neural networks for machine remaining useful life prediction. IEEE/CAA Journal of Automatica Sinica, 9(8):1427–1439, 2022b.
Behera and Misra [2021] Sourajit Behera and Rajiv Misra. Generative adversarial networks based remaining useful life estimation for iiot. Computers & Electrical Engineering, 92:107195, 2021.
Jin et al. [2023d] Ruibing Jin, Duo Zhou, Min Wu, Xiaoli Li, and Zhenghua Chen. An adaptive and dynamical neural network for machine remaining useful life prediction. IEEE Transactions on Industrial Informatics, 2023d.
Jang and Kim [2021] Jaeyeon Jang and Chang Ouk Kim. Siamese network-based health representation learning and robust reference-based remaining useful life prediction. IEEE Transactions on Industrial Informatics, 18(8):5264–5274, 2021.
Xu et al. [2021] Qing Xu, Zhenghua Chen, Keyu Wu, Chao Wang, Min Wu, and Xiaoli Li. Kdnet-rul: A knowledge distillation framework to compress deep neural networks for machine remaining useful life prediction. IEEE Transactions on Industrial Electronics, 2021.
Xia et al. [2020] Kun Xia, Jianguang Huang, and Hanyu Wang. Lstm-cnn architecture for human activity recognition. IEEE Access, 8:56855–56866, 2020.
Van Kuppevelt et al. [2020] D Van Kuppevelt, C Meijer, F Huber, A van der Ploeg, S Georgievska, and Vincent T van Hees. Mcfly: Automated deep learning on time series. SoftwareX, 12:100548, 2020.
Dua et al. [2021] Nidhi Dua, Shiva Nand Singh, and Vijay Bhaskar Semwal. Multi-input cnn-gru based human activity recognition using wearable sensors. Computing, 103:1461–1478, 2021.
Mim et al. [2023] Taima Rahman Mim, Maliha Amatullah, Sadia Afreen, Mohammad Abu Yousuf, Shahadat Uddin, Salem A Alyami, Khondokar Fida Hasan, and Mohammad Ali Moni. Gru-inc: An inception-attention based approach using gru for human activity recognition. Expert Systems with Applications, 216:119419, 2023.
Ige and Noor [2023] Ayokunle Olalekan Ige and Mohd Halim Mohd Noor. A deep local-temporal architecture with attention for lightweight human activity recognition. Applied Soft Computing, 149:110954, 2023.