This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EdgeFD: An Edge-Friendly Drift-Aware Fault Diagnosis System for Industrial IoT

Jiao Chen1, Fengjian Mao1, Zuohong Lv1, and Jianhua Tang12 This work was supported in part by the National Nature Science Foundation of China under Grant 62001168. 1 Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, China 2 Pazhou Lab, Guangzhou, China {202110190459,201930362330,202220159664}@mail.scut.edu.cn, [email protected]
Abstract

Recent transfer learning (TL) approaches in industrial intelligent fault diagnosis (FD) mostly follow the “pre-train and fine-tuning” paradigm to address data drift, which emerges from variable working conditions. However, we find that this approach is prone to the phenomenon known as catastrophic forgetting. Furthermore, performing frequent models fine-tuning on the resource-constrained edge nodes can be computationally expensive and unnecessary, given the excellent transferability demonstrated by existing models. In this work, we propose the Drift-Aware Weight Consolidation (DAWC), a method optimized for edge deployments, mitigating the challenges posed by frequent data drift in the industrial Internet of Things (IIoT). DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices, thereby conserving computational resources. By detecting drift using classifier confidence and estimating parameter importance with the Fisher Information Matrix—a tool that measures parameter sensitivity in probabilistic models, we introduce a drift detection module and a continual learning module to gradually equip the FD model with powerful generalization capabilities. Experimental results demonstrate that our proposed DAWC achieves superior performance compared to existing techniques while also ensuring compatibility with edge computing constraints. Additionally, we have developed a comprehensive diagnosis and visualization platform. The project webpage is https://github.com/tenderzada/BearingEdge.

Index Terms:
Mechanical Fault Diagnosis, Drift Detection, Continual Learning, Industrial Internet of Things, Edge Computing

I Introduction

The integration of the Industrial Internet of Things (IIoT) and its key technologies (e.g., reconfigurable intelligent surface [1]) into industrial operations has triggered a profound transformation in operational methodologies. IIoT harnesses interconnected sensors, devices, and machinery to capture real-time data, facilitating data-driven decision-making and predictive maintenance strategies [2, 3].

In industry, bearings and harmonic reducers are pivotal mechanical components that support and transmit rotary motions within mechanical systems. However, bearing failures due to prolonged operation and load-bearing can result in equipment downtime and production disruptions, leading to significant economic losses and safety hazards. In response to these challenges, there exists a growing demand for intelligent fault diagnosis (FD), early warning, and visualization of bearing faults [4, 5, 6].

In recent years, integrating edge computing [7] into the IIoT framework have offered a promising avenue for enhancing intelligent diagnostic systems. Edge computing involves processing and analyzing data at or near the data source, enabling real-time analysis and minimizing data transmission latency. Notably, the convergence of IIoT and edge computing renders the concept of “source-end detection” in mechanical FD both practical and effective.

Refer to caption
Figure 1: Comparative analysis of learning approaches in the context of evolving data distributions. The data distribution of a specific Inner Race Fault (IRF) changes over time due to varying rpm or loads. The objective is to create a model that effectively generalizes across this evolving data distribution. (a) Transfer Learning: Task-specific parameters are fine-tuned for each data drift. (b) Meta-Learning: The model is quickly adapted using a few samples from the new data distribution. (c) Our Approach: Important weights from past data distributions are consolidated while adapting to new tasks, ensuring a balance of specificity and generality.

The potential of intelligent FD methods has been showcased in various studies, yet a formidable challenge persists, namely, data drift, which leads to a notable degradation in diagnosis accuracy [8]. Data drift, encompassing challenges such as domain incremental learning (where new classes emerge over time), changes in prior probabilities, and covariate drift (changes in the distribution of input data), hampers the deployment of FD models on edge devices [9]. In industrial settings, especially where data can be scarce or the environment rapidly evolves, this problem becomes particularly pronounced.

To effectively address the issue of data drift in mechanical fault diagnosis, Transfer Learning (TL) [10, 11] and Meta-Learning (ML) [12] are often employed. TL and ML address the data drift problem by facilitating the transfer of knowledge from previously learned tasks to new ones, which enhances model robustness and data efficiency—essential traits in resource-constrained or swiftly evolving industrial settings. These methodologies promote better generalization across diverse fault scenarios. Moreover, ML excels in rapid adaptability to new fault conditions, pushing real-time fault diagnosis closer to realization.

As illustrated in Fig. 1(a), TL involves adapting a shared backbone network for diverse tasks. However, a significant downside of TL is catastrophic forgetting, where the model, while learning new tasks, tends to forget the previously learned knowledge. This phenomenon is demonstrated in Fig. 2. On the other hand, as depicted in Fig. 1(b), ML showcases rapid task adaptation, albeit at a computational cost [13], which might hinder its application in time-sensitive scenarios.

Given the limitations of TL and ML in addressing data drift, exploring alternative or complementary strategies becomes imperative to mitigate the impact of data drift on mechanical fault diagnosis. Addressing these challenges head-on, we propose a novel approach that seamlessly adapts to dynamic data distributions. Our system, EdgeFD, leverages a data drift detection module and a weight consolidation mechanisms, allowing for the adaptation and optimization of dynamic data distributions. Our contributions are as follows:

\bullet We introduce a confidence-based data drift detection module that enables real-time monitoring and precise detection of abnormal data drift in bearing FD tasks.

\bullet Our utilization of the continual learning (CL) method based on weight consolidation effectively adapts to drifting data, facilitating incremental learning as new data emerges while efficiently adjusting the drift component.

\bullet For situations with frequent data drift and a high emphasis on long-term learning and knowledge retention, our Drift-Aware Weight Consolidation (DAWC) approach stands out as a superior solution.

\bullet The EdgeFD system harnesses the benefits of edge computing, enabling bearing FD tasks to be conducted on edge devices, minimizing dependence on cloud resources, and data transfer latency, and ensuring real-time responsiveness.

Refer to caption
Figure 2: When using transfer learning, a model learns four sequential tasks—one in each round. But as the neural network adapts to new tasks, its performance on the first task deteriorates by the second, third, and fourth rounds. This decline stems from weight adjustments for new tasks, which can inadvertently overwrite knowledge from previous tasks.

II Problem Definition

In this section, we outline the problem’s core attributes, particularly focusing on the challenges faced when naively implementing solutions based solely on TL and ML.

II-A Problem Definition

Mechanical equipment often exhibits drift in the underlying data distribution of their operational conditions. Factors like varying revolutions, measured in rpm (revolutions per minute), or different loads, measured in Hp (horsepower), can instigate such drifts.

As mechanical systems function, the FD model is iteratively fine-tuned over a series of tasks, represented as {𝒯(1),𝒯(2),,𝒯(T)}\{\mathcal{T}^{(1)},\mathcal{T}^{(2)},\cdots,\mathcal{T}^{(T)}\}, where 𝒯(t){\mathcal{T}^{(t)}} is a labeled dataset of the ttht^{th} task, and 𝒯(t)={𝒙i,𝒚i}i=1Nt\mathcal{T}^{(t)}=\{\bm{x}_{i},\bm{y}_{i}\}_{i=1}^{N_{t}}, which consists of NtN_{t} pairs of instances 𝒙i\bm{x}_{i} and their corresponding labels 𝒚i\bm{y}_{i}. Assuming the most realistic situation, we consider the case where the task sequence is a task stream with an unknown arriving order, such that the model can access 𝒯(t)\mathcal{T}^{(t)} only at the training period of this task 𝒯(t)\mathcal{T}^{(t)}, which will become inaccessible afterward. Given 𝒯(t)\mathcal{T}^{(t)} and the model learned so far, the learning objective at tt is as follows:

min𝜽(t)(𝜽(t);𝜽(t1),𝒯(t)),\underset{\bm{\theta}^{(t)}}{\mathop{\min}}\ \mathcal{L}(\bm{\theta}^{(t)};\bm{\theta}^{(t-1)},\mathcal{T}^{(t)}),

where (;)\mathcal{L}(\cdot;\cdot) is the loss function that quantifies the difference between predictions made using the current model parameters 𝜽(t)\bm{\theta}^{(t)} and the ground truth data associated with task 𝒯(t)\mathcal{T}^{(t)}. The role of 𝜽(t1)\bm{\theta}^{(t-1)} is to provide an initial point for the optimization process, enabling the model to adapt quickly and effectively to the new task 𝒯(t)\mathcal{T}^{(t)}.

To solve the optimization problem described above, various methods have been developed that leverage the knowledge contained in 𝜽(t1)\bm{\theta}^{(t-1)} to enhance the learning process for the new task 𝒯(t)\mathcal{T}^{(t)}. Particularly, in the context of diagnosis model training, the techniques based on TL and ML have been mostly explored.

II-B Transfer Learning and Meta Learning

Transfer Learning: The goal of TL is to speed up the learning of new tasks by sharing knowledge between different tasks. The optimization goal of TL can be expressed as minimizing empirical risk, for example using a cross-entropy loss function

min𝜽(t)t=1t𝔼(𝒙i,𝒚i)𝒯(t)[(𝒚i,f(𝒙i;𝜽(t)))],\mathop{\min}_{\bm{\theta}^{(t)}}\sum_{t=1}^{t}\mathbb{E}_{(\bm{x}_{i},\bm{y}_{i})\sim\mathcal{T}^{(t)}}[\ell(\bm{y}_{i},f(\bm{x}_{i};\bm{\theta}^{(t)}))], (1)

where f(;)f(\cdot;\cdot) is the corresponding classifier function. The model parameter for the tt-th task is the combination of the backbone network 𝒖(t)\bm{u}^{(t)} and the classifier 𝒗(t)\bm{v}^{(t)}: 𝜽(t):=(𝒖(t)𝒗(t))\bm{\theta}^{(t)}:=(\bm{u}^{(t)}\circ\bm{v}^{(t)}). The TL process freezes the backbone network 𝒖(t)=𝒖(t1)\bm{u}^{(t)}=\bm{u}^{(t-1)} and updates only the parameters of the classification head 𝒗(t)\bm{v}^{(t)}. (𝒚i,f(𝒙i;𝜽))\ell(\bm{y}_{i},f(\bm{x}_{i};\bm{\theta})) is the loss function, for measuring the difference of the predicted output f(𝒙i;𝜽)f(\bm{x}_{i};\bm{\theta}) against true labels 𝒚i\bm{y}_{i}.

Meta-Learning: The goal of ML is to learn the updating rules from multiple tasks, optimizing the learning process for subsequent tasks. A popular algorithm in this domain is Model-Agnostic Meta-Learning (MAML), where the optimization objective is

min𝜽t=1t𝔼(Dts,Dtq)𝒯(t)[(f𝜽(Dts),Dtq)],\min_{\bm{\theta}}\sum_{t=1}^{t}\mathbb{E}_{(D_{t}^{s},D_{t}^{q})\sim\mathcal{T}^{(t)}}\left[\ell\left(f_{\bm{\theta}^{\prime}}(D_{t}^{s}),D_{t}^{q}\right)\right], (2)

where 𝜽=𝜽α𝜽(f𝜽(Dts),Dtq)\bm{\theta}^{\prime}=\bm{\theta}-\alpha\nabla_{\bm{\theta}}\ell(f_{\bm{\theta}}(D_{t}^{s}),D_{t}^{q}), and f𝜽(Dts)f_{\bm{\theta}}(D_{t}^{s}) denotes the model prediction on task 𝒯(t)\mathcal{T}^{(t)}. In (2), 𝜽\bm{\theta} represents the global model parameters, DtsD_{t}^{s} is the support set for the task 𝒯(t)\mathcal{T}^{(t)} (used to fine-tune the model), DtqD_{t}^{q} is the query set (used to evaluate the performance), α\alpha is the learning rate.

While achieving certain successes, it is noteworthy that the existing TL-based FD methods often suffer from catastrophic forgetting (as shown in Fig. 2), leading to continuous model adaptation and increased computational burden on edge devices. Meanwhile, ML-based approaches struggle to handle sequential tasks and involve expensive second-order derivative computations. Moreover, these studies assume the learner’s awareness of task distribution changes, allowing the computation of a new posterior approximation. However, this assumption diverges from real-world IIoT scenarios encompassing gradual, sudden, and repeated drifts.

Therefore, data drift detection is necessary and important before model adaptation, and the detection of data drift can serve as a valuable cue for model adaptation. For example, the drift intensity helps to reduce the search space and ease the learning process [14]. Specifically, by dynamically adjusting the consolidation weights of previous parameters based on the drift detection results, we can effectively address the challenges posed by significant data drifts.

III Learning Methodology

In this section, we introduce a novel framework termed DAWC that effectively detects drift in data streams and employs weight consolidation for model fine-tuning. DAWC addresses the challenges faced by TL and ML methods while also reducing computational costs for edge devices.

III-A Data Drift Detection

Refer to caption
Figure 3: The simplified flow of drift detection algorithm. ThT_{h} is the prefixed threshold, which is related to the sensitivity λ\lambda.

Inspired by the CUSUM-type [15] detection method, we propose a sliding window data drift detection method based on the Beta distribution, as illustrated in Fig. 3. In the realm of statistical modeling, the Beta distribution stands out when dealing with random variables bounded within a fixed interval, especially those representing proportions or probabilities. This distribution is naturally defined between 0 and 1, making it a prime choice for situations where we model uncertainties of confidence scores. A key strength of the Beta distribution lies in its versatility; its shape can adapt from U-shaped to bell-shaped, contingent upon its parameters. In our context, the confidence scores from model predictions fall between 0 and 1. As data drift manifests, these confidence score distributions might shift. By leveraging the adaptability of the Beta distribution, we can effectively capture and compare these shifts, making it a powerful tool for drift detection in our methodology.

Within a sliding window of length NN, the tuple [y^i,qi][\hat{y}_{i},q_{i}] denotes the model’s prediction category and the corresponding confidence score for the input data sample 𝒙i\bm{x}_{i}. The sliding window is partitioned into two segments: the reference window QrQ_{r}, housing historical confidences, and the target window QtQ_{t}, which contains the latest confidences. It is assumed that these confidence scores in the two windows follow different Beta distributions. When data drift occurs, the CNN model’s classification accuracy declines, causing small confidence values to appear in the target window, leading to discrepancies between the two Beta distributions. To detect data drift, we compare the cumulative probability density function (PDF) values of the confidences in the target window derived from these two Beta distributions.

Step 1. Estimate Beta Distribution Parameters: Using the Maximum Likelihood Estimation (MLE) method, we estimate the parameters of the Beta distribution. Next, the PDF of the beta distribution is provided by

f(qi|α,β)={qiα1(1qi)β1B(α,β),if 0<qi<1,0,otherwise.f(q_{i}|\alpha,\beta)=\begin{cases}\frac{q_{i}^{\alpha-1}(1-q_{i})^{\beta-1}}{B(\alpha,\beta)},&\text{if }0<q_{i}<1,\\ 0,&\text{otherwise}.\end{cases} (3)

where,

B(α,β)=01qiα1(1qi)β1𝑑qi.B(\alpha,\beta)=\int_{0}^{1}q_{i}^{\alpha-1}(1-q_{i})^{\beta-1}dq_{i}. (4)

Step 2. Calculate Dissimilarity Score: Compute the dissimilarity score for PDF values on the two Beta distributions. Sum these scores to obtain the cumulative dissimilarity score sks_{k}.

Step 3. Detect Drift: If the cumulative dissimilarity score sks_{k} surpasses a predefined threshold ThT_{h}, data drift is confirmed. The prefixed threshold ThT_{h} is a critical value used to determine the occurrence of data drift. Its value is inversely related to sensitivity λ\lambda, i.e., a smaller λ\lambda results in a higher threshold, making the drift detection more stringent.

Step 4. Determine Drift Location: Resize the sub-windows and shift the divider, recalculating the dissimilarity scores. If the new cumulative dissimilarity score sfs_{f}^{{}^{\prime}} is larger, it implies the new divider is closer to the drift point. This process continues until the maximum dissimilarity score sfs_{f} is reached, pinpointing the position of the data drift occurrence.

Algorithm 1 provides a structured procedure to detect data drift using the cumulative dissimilarity score. It iteratively evaluates each data point in the sliding window against a threshold to determine the onset of drift. Given a sliding window QQ, sensitivity to change λ\lambda, padding δ\delta, and a maximum window size NmaxN_{max}, the algorithm determines whether a drift has occurred by iterating through the sliding window and evaluating the cumulative dissimilarity score against a threshold.

1
Input: Sliding window QQ, Sensitivity to change λ\lambda, Padding δ\delta, Maximum size for the sliding window NmaxN_{max}.
Output: The boolean value indicates whether drift is detected.
2
3sf0s_{f}\leftarrow 0, ThT_{h}\leftarrow-log(λ)(\lambda), N|Q|N\leftarrow|Q|
4
5for kδk\leftarrow\delta to NδN-\delta do
6       mrmean(q1:qkQ)m_{r}\leftarrow\textnormal{{mean}}(q_{1}:q_{k}\in Q);
7       mtmean(qk+1:qNQ)m_{t}\leftarrow\textnormal{{mean}}(q_{k+1}:q_{N}\in Q);
8      
9      if mt(1λ)mrm_{t}\leq(1-\lambda)\cdot m_{r} then
10             sk0s_{k}\leftarrow 0
11             [α^r,β^r]estimateParams(q1:qk)[\hat{\alpha}_{r},\hat{\beta}_{r}]\leftarrow\textnormal{{estimateParams}}(q_{1}\colon q_{k});
12             [α^t,β^t]estimateParams(qk+1:qN)[\hat{\alpha}_{t},\hat{\beta}_{t}]\leftarrow\textnormal{{estimateParams}}(q_{k+1}\colon q_{N});
13            
14            for ik+1i\leftarrow k+1 to NN do
15                   sksk+log(f(qi|α^t,β^t)f(qi|α^r,β^r))s_{k}\leftarrow s_{k}+\text{log}\left(\frac{f(q_{i}|\hat{\alpha}_{t},\hat{\beta}_{t})}{f(q_{i}|\hat{\alpha}_{r},\hat{\beta}_{r})}\right);
16                  
17            sfMax(sf,sk)s_{f}\leftarrow\text{Max}(s_{f},s_{k});
18            
19      
20
21if sf>Ths_{f}>T_{h} then
22      return truetrue
23else
24      return falsefalse
25
Algorithm 1 Drift Detection Algorithm

III-B Weight Consolidation

Once the drift is identified using our approach, the next challenge is fine-tuning the model without compromising or forgetting the previously acquired knowledge. Zhang et al. [16] presented a CL method centered on data prototype replay to address this. While effective, this method comes with the downside of increased storage and computational costs. On the other hand, regularization, a tool traditionally employed to mitigate model overfitting, has garnered attention in the CL domain. Within this sphere, the chief avenues for regularization stem from parameter importance estimates [17, 18] and knowledge distillation[19].

Building upon these insights, we introduce our approach. Drawing from the Elastic Weight Consolidation (EWC) [17] methodology, we employ an approximate Bayesian CL strategy, aiming to seamlessly adapt models in continuous learning scenarios. Let 𝜽\bm{\theta} be the parameter vector, and consider the posterior of 𝜽\bm{\theta}:

p(𝜽|𝒯(1:t))p(𝜽)i=1tp(𝒯(i)|𝜽),p(𝜽|𝒯(1:t1))p(𝒯(t)|𝜽).\begin{split}p(\bm{\theta}|\mathcal{T}^{(1:t)})&\propto p(\bm{\theta})\prod_{i=1}^{t}p(\mathcal{T}^{(i)}|\bm{\theta}),\\ &\propto p(\bm{\theta}|\mathcal{T}^{(1:t-1)})p(\mathcal{T}^{(t)}|\bm{\theta}).\\ \end{split} (5)

Here, the factorization in (5) arises due to the conditional independence assumption of task data. However, computing the exact posteriors is challenging, leading to Laplace’s approximation:

p(𝒯(t)|𝜽)𝒩(𝜽t,t1),p(\mathcal{T}^{(t)}|\bm{\theta})\approx\mathcal{N}(\bm{\theta}_{t}^{*},\mathcal{F}_{t}^{-1}), (6)

where mean 𝜽t\bm{\theta}_{t}^{*} centered at the maximum a posteriori parameter when learning task tt, and the precision given by the Fisher information matrix t\mathcal{F}_{t} evaluated at 𝜽t\bm{\theta}_{t}^{*}.This matrix approximates the Hessian of the negative log-likelihood, ensuring positive semidefiniteness.

The overall optimization objective is to minimize the empirical risk while simultaneously accounting for the significance of parameters from prior tasks through a regularization term:

logp(𝒯(t)|𝜽)+λ2it1i(𝜽𝜽i)2,-\log p(\mathcal{T}^{(t)}|\bm{\theta})+\frac{\lambda}{2}\sum_{i}^{t-1}\mathcal{F}_{i}(\bm{\theta}-\bm{\theta}_{i}^{*})^{2},\\ (7)

where λ\lambda is a hyperparameter that controls the penalty on important parameters from previous tasks. The term i\mathcal{F}_{i} is the importance-weighted regularization term for the previous task’s 𝒯(t)\mathcal{T}^{(t)} parameter. 𝒯(t)\mathcal{T}^{(t)} represents the current task’s data. This regularization mechanism effectively preserves prior knowledge while adapting to new tasks.

More detailed mathematical derivation of Eqs. (5)-(7) and additional experimental validations will be included in the extended version of this paper, which will provide an in-depth exploration of the presented methodology.

IV Experiments

In this section, we present the experimental settings and evaluate the performances of our method by comparing them with baseline methods. Moreover, we conduct extensive ablation studies to provide a deeper understanding of our method.

Baseline methods. The chosen baseline methods represent common strategies in the domain of task-oriented learning:

  • STL: Single-task learning, where each task is treated independently. This serves as the most basic comparison.

  • FCB [20]: A naive TL approach that preserves the feature extraction capabilities of the backbone network while fine-tuning only the classification head.

  • MAML [13]: An advanced meta-learning technique aiming at finding a model initialization conducive for rapid adaptation to new tasks.

IV-A Datasets and experimental details

Experimental Setup: All experiments are conducted on a system equipped with an NVIDIA RTX 3090 GPU, using PyTorch version 1.9.

Dataset: This study employs the publicly accessible CWRU Bearing Dataset [21] encompassing 10 classification tasks. These tasks correspond to different failure sites (Inner Race, Outer Race, Rolling Body) and sizes (0.007’, 0.014’, 0.021’). The dataset encompasses bearing data with failures under diverse loads (0 hp, 1 hp, 2 hp, 3 hp), reflecting variations in real-world data collection conditions. As a result, we segment the drifting task based on these distinct load conditions.

Backbone Network: Inspired by [22], we adopt the WDCNN as our core network, recognized for its distinct attributes: (1) a broad initial layer featuring convolutional kernels sized 64×164\times 1; (2) stacking multiple compact 3×13\times 1 convolutional kernels following the wide kernels. The wide initial convolutional kernels capture broad features, while the stacked smaller kernels delve into finer details.

Training Setting: From the total of 5000 data samples, each load condition has 1250 samples. We employ an 80-20 split strategy: 1000 samples from each load condition were used for model training, while the remaining 250 samples (per condition) serve as the test set to evaluate model performance.

Performance Metric: We use two metrics from [23] to evaluate algorithms when the number of tasks is large, i.e., Average Accuracy (AA) and Average Forgetting (AF).

IV-B Main Results

To ensure the stability and reliability of our results, we conduct three independent repetitions of the experiments. The outcomes of these repeated trials showed some standard deviation, with precise values provided in parentheses.

Within our experiments, we focus on investigating four distinct heterogeneous/drifting tasks. After completing training on 4000 samples, we aggregate the experimental outcomes as presented in Table I. Encouragingly, our DAWC showcase a remarkable performance. Specifically, DAWC demonstrates a notable 4.64% increase in accuracy compared to the current SOTA baseline method (i.e., MAML). It’s worth noting that DAWC not only excelled in terms of performance but also exhibited significant advantages in terms of forgetting rate. This underscores the validation of the efficacy of our proposed weight consolidation strategy.

TABLE I: Performance on CWRU dataset
Method Type AA(%) AF
STL Baseline 81.25 (±\pm 0.63) 0.22 (±\pm 0.02)
FCB Transfer Learning 86.67 (±\pm 0.72) 0.16 (±\pm 0.02)
MAML Meta-Learning 89.28 (±\pm 0.16) 0.14 (±\pm 0.01)
DAWC(ours) Continual Learning 93.92 (±\pm 0.15) 0.07 (±\pm 0.01)

Through Fig. 4, we can observe the trends in AA variations throughout the learning process for the four different methods. Notably, our DAWC approach demonstrates superior performance, notably in the face of multiple instances of data drift. This serves as evidence of the considerable potential of DAWC in handling heterogeneous tasks.

Refer to caption
Figure 4: The comparison of the AA over the 4 tasks during the learning phase.

In summary, the DAWC method outperformed the existing baseline methods in terms of both accuracy and forgetting rate, thereby validating the effectiveness of the proposed weight consolidation strategy in handling drifting tasks.

IV-C Effectiveness of Core Designs

Effect of Drift Detection: We conduct experiments using the parameters λ=0.05\lambda=0.05, Nmax=1000N_{max}=1000, and δ=100\delta=100. Each alteration in bearing load signifies a shift in the fundamental data distribution, indicating data drift. The instances where these shifts occur are 1000, 2000, and 3000. In Fig. 5, the vertical dashed lines indicate where a drift is detected by applying our detection method. We can notice that all drifts are detected shortly after they theoretically occur.

Refer to caption
Figure 5: Accuracy for the four tasks in DAWC. The vertical dashed line indicates the time at which a change in the distribution was detected
Refer to caption
Figure 6: Accuracy comparison of four methods across sequential tasks.

Effect of Weight Consolidation: We evaluate the accuracy of the four methods for each task, as depicted in Fig. 6. It is observed that the accuracy steadily improves with fewer fluctuations as the training for new tasks progresses, highlighting the robustness of our method.

The effectiveness of our approach in mitigating forgetting can be attributed to efficient weight consolidation, which helps preserve previously learned information.

Computational Costs: Fig. 7 illustrates the accuracy variations for each task throughout the learning process for the three methods. When confronted with multiple instances of data drift, both TL and DAWC demonstrate a certain reduction in computational costs for edge devices compared to the STL method. However, our approach outperforms TL in this aspect.

This superiority arises from the continuous triggering of model fine-tuning thresholds in TL, where parameters are adjusted to accommodate data drift. In contrast, our DAWC scheme rapidly adapts to various data drift scenarios without the need to trigger model fine-tuning thresholds. As a result, it significantly alleviates computational expenses, and it is hence edge-friendly.

Refer to caption
Figure 7: Comparison of average computational costs for drift adaptation.

V System Demo

Based on the framework of DAWC, we develop a system for bearing FD and alerting. The operational workflow of the entire system is depicted in Fig. 8, showcasing the major components and stages involved. This system leverages the power of DAWC to discern and classify 10 different types of bearing faults, returning corresponding confidence scores. Upon the detection of a bearing fault, the system promptly sends fault alert notifications to administrators, allowing for the timely initiation of necessary maintenance measures.

Refer to caption
Figure 8: Illustration of the system’s operational workflow.

To provide users with a comprehensive understanding of the system’s status and diagnostic results, we design an intuitive visualization interface, as partially depicted in Fig. 9. At the top of the interface, distinct identifiers for various fault categories are presented. Specifically, “Nm” indicates “Normal”, “IR0.007” stands for “0.007-inch Inner Race Fault”, “OR0.021” represents “0.021-inch Outer Race Fault”, and “B0.014” means “0.014-inch Ball Fault”. Additionally, beneath each fault classification, the upper section of a rectangular bar displays the associated confidence score, allowing users to grasp the system’s confidence in each fault classification.

Next, delving into the system’s approach towards handling drift data, as showcased in Fig. 10, when the system’s confidence in drift data is comparatively low, it employs the term “Drift” accompanied by a red bar, ensuring that users can promptly recognize instances of data drift.

Refer to caption
Figure 9: Visualization of the system interface, displaying various fault classifications and their corresponding confidence scores.

Through our meticulous design, we not only achieve significant progress in bearing FD and alerting but also provide users with the convenience of seamlessly monitoring and understanding the system’s operational status.

Refer to caption
Figure 10: Highlighting the system’s approach to drift data, using the term “Drift” and a red bar for low-confidence drift instances.

VI Conclusion

In this study, we have presented an edge-oriented method for mechanical fault diagnosis that incorporates drift detection and weight consolidation mechanisms. Distinctively, our method stands out for its significant reduction in computational overhead, especially in scenarios plagued by frequent data drift, offering a resource-efficient solution for edge devices. A prominent characteristic of this method is its simplicity, making it adaptable to various backbone network architectures. This attribute paves the way for leveraging more powerful pre-trained models in future large-scale edge-cloud collaborative settings. By combining this approach with advanced pre-trained models, we can further enhance the performance and scalability of fault diagnosis systems.

References

  • [1] J. Xiao, J. Tang, and J. Chen, “Efficient radar detection for RIS-aided dual-functional radar-communication system,” in Proc. IEEE VTC2023-Spring, Florence, Italy, Jun. 2023, pp. 1–6.
  • [2] B. Yin, J. Tang, and M. Wen, “Connectivity maximization in non-orthogonal network slicing enabled Industrial Internet-of-Things with multiple services,” IEEE Trans. Wireless Commun., vol. 22, no. 8, pp. 5642–5656, Aug. 2023.
  • [3] Y. Zhao, D. Saxena, and J. Cao, “Memory-efficient domain incremental learning for Internet of Things,” in Proc. SenSys, Boston, MA, USA, Nov. 2022, pp. 1175–1181.
  • [4] J. Chen, J. Tang, and W. Li, “Industrial edge intelligence: Federated-meta learning framework for few-shot fault diagnosis,” IEEE Trans. Network Sci. Eng., pp. 1–13, 2023, early access.
  • [5] T. Wang, H. Liu, D. Guo, and X.-M. Sun, “Continual residual reservoir computing for remaining useful life prediction,” IEEE Trans. Ind. Inf., pp. 1–10, 2023, early access.
  • [6] G. Chen, Y. Lu, and R. Su, “Interpretable fault diagnosis with shapelet temporal logic: Theory and application,” Automatica, vol. 142, p. 110350, May 2022.
  • [7] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug. 2017.
  • [8] S. Bandyopadhyay, A. Datta, A. Pal, and S. R. R. Gadepally, “Intelligent continuous monitoring to handle data distributional changes for IoT systems,” in Proc. SenSys, Boston, MA, USA, Nov. 2022, pp. 1189–1195.
  • [9] J. Hurtado, D. Salvati, R. Semola, M. Bosio, and V. Lomonaco, “Continual learning for predictive maintenance: Overview and challenges,” Intelligent Systems with Applications, p. 200251, Jun. 2023.
  • [10] J. Chen, R. Huang, Z. Chen, W. Mao, and W. Li, “Transfer learning algorithms for bearing remaining useful life prediction: A comprehensive review from an industrial application perspective,” Mechanical Systems and Signal Processing, vol. 193, p. 110239, Mar. 2023.
  • [11] J. Chen, D. Li, R. Huang, Z. Chen, and W. Li, “Aero-engine remaining useful life prediction method with self-adaptive multimodal data fusion and cluster-ensemble transfer regression,” Reliability Engineering & System Safety, vol. 234, p. 109151, Feb. 2023.
  • [12] Y. Feng, J. Chen, J. Xie, T. Zhang, H. Lv, and T. Pan, “Meta-learning as a promising approach for few-shot cross-domain fault diagnosis: Algorithms, applications, and prospects,” Knowledge-Based Systems, vol. 235, p. 107646, Jan. 2022.
  • [13] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML, Sydney, Australia, Aug. 2017, pp. 1126–1135.
  • [14] T. Lesort, M. Caccia, and I. Rish, “Understanding continual learning settings with data distribution drift analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2104.01678
  • [15] A. Haque, L. Khan, and M. Baron, “Sand: Semi-supervised adaptive novel class detection and classification over data stream,” in Proc. AAAI, vol. 30, no. 1, Phoenix, AZ, USA, Feb. 2016.
  • [16] L. Zhang, G. Gao, and H. Zhang, “Spatial-temporal federated learning for lifelong person re-identification on distributed edges,” IEEE Trans. Circuits Syst. Video Technol., 2023, early access.
  • [17] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, Mar. 2017.
  • [18] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” in Proc. ECCV, Munich, Germany, Sep. 2018, pp. 139–154.
  • [19] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara, “Dark experience for general continual learning: a strong, simple baseline,” in Proc. NeurIPS, vol. 33, Virtual, Dec. 2020, pp. 15 920–15 930.
  • [20] T. Han, C. Liu, W. Yang, and D. Jiang, “Learning transferable features in deep convolutional neural networks for diagnosing unseen machine conditions,” ISA Transactions, vol. 93, pp. 341–353, Oct. 2019.
  • [21] W. A. Smith and R. B. Randall, “Rolling element bearing diagnostics using the case western reserve university data: A benchmark study,” Mechanical systems and signal processing, vol. 64-65, pp. 100–131, Jun. 2015.
  • [22] W. Zhang, G. Peng, C. Li, Y. Chen, and Z. Zhang, “A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals,” Sensors, vol. 17, no. 2, p. 425, Feb. 2017.
  • [23] S. I. Mirzadeh, M. Farajtabar, R. Pascanu, and H. Ghasemzadeh, “Understanding the role of training regimes in continual learning,” in Proc. NeurIPS, vol. 33, Virtual, Dec. 2020, pp. 7308–7320.