EdgeFD: An Edge-Friendly Drift-Aware Fault Diagnosis System for Industrial IoT
Abstract
Recent transfer learning (TL) approaches in industrial intelligent fault diagnosis (FD) mostly follow the “pre-train and fine-tuning” paradigm to address data drift, which emerges from variable working conditions. However, we find that this approach is prone to the phenomenon known as catastrophic forgetting. Furthermore, performing frequent models fine-tuning on the resource-constrained edge nodes can be computationally expensive and unnecessary, given the excellent transferability demonstrated by existing models. In this work, we propose the Drift-Aware Weight Consolidation (DAWC), a method optimized for edge deployments, mitigating the challenges posed by frequent data drift in the industrial Internet of Things (IIoT). DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices, thereby conserving computational resources. By detecting drift using classifier confidence and estimating parameter importance with the Fisher Information Matrix—a tool that measures parameter sensitivity in probabilistic models, we introduce a drift detection module and a continual learning module to gradually equip the FD model with powerful generalization capabilities. Experimental results demonstrate that our proposed DAWC achieves superior performance compared to existing techniques while also ensuring compatibility with edge computing constraints. Additionally, we have developed a comprehensive diagnosis and visualization platform. The project webpage is https://github.com/tenderzada/BearingEdge.
Index Terms:
Mechanical Fault Diagnosis, Drift Detection, Continual Learning, Industrial Internet of Things, Edge ComputingI Introduction
The integration of the Industrial Internet of Things (IIoT) and its key technologies (e.g., reconfigurable intelligent surface [1]) into industrial operations has triggered a profound transformation in operational methodologies. IIoT harnesses interconnected sensors, devices, and machinery to capture real-time data, facilitating data-driven decision-making and predictive maintenance strategies [2, 3].
In industry, bearings and harmonic reducers are pivotal mechanical components that support and transmit rotary motions within mechanical systems. However, bearing failures due to prolonged operation and load-bearing can result in equipment downtime and production disruptions, leading to significant economic losses and safety hazards. In response to these challenges, there exists a growing demand for intelligent fault diagnosis (FD), early warning, and visualization of bearing faults [4, 5, 6].
In recent years, integrating edge computing [7] into the IIoT framework have offered a promising avenue for enhancing intelligent diagnostic systems. Edge computing involves processing and analyzing data at or near the data source, enabling real-time analysis and minimizing data transmission latency. Notably, the convergence of IIoT and edge computing renders the concept of “source-end detection” in mechanical FD both practical and effective.

The potential of intelligent FD methods has been showcased in various studies, yet a formidable challenge persists, namely, data drift, which leads to a notable degradation in diagnosis accuracy [8]. Data drift, encompassing challenges such as domain incremental learning (where new classes emerge over time), changes in prior probabilities, and covariate drift (changes in the distribution of input data), hampers the deployment of FD models on edge devices [9]. In industrial settings, especially where data can be scarce or the environment rapidly evolves, this problem becomes particularly pronounced.
To effectively address the issue of data drift in mechanical fault diagnosis, Transfer Learning (TL) [10, 11] and Meta-Learning (ML) [12] are often employed. TL and ML address the data drift problem by facilitating the transfer of knowledge from previously learned tasks to new ones, which enhances model robustness and data efficiency—essential traits in resource-constrained or swiftly evolving industrial settings. These methodologies promote better generalization across diverse fault scenarios. Moreover, ML excels in rapid adaptability to new fault conditions, pushing real-time fault diagnosis closer to realization.
As illustrated in Fig. 1(a), TL involves adapting a shared backbone network for diverse tasks. However, a significant downside of TL is catastrophic forgetting, where the model, while learning new tasks, tends to forget the previously learned knowledge. This phenomenon is demonstrated in Fig. 2. On the other hand, as depicted in Fig. 1(b), ML showcases rapid task adaptation, albeit at a computational cost [13], which might hinder its application in time-sensitive scenarios.
Given the limitations of TL and ML in addressing data drift, exploring alternative or complementary strategies becomes imperative to mitigate the impact of data drift on mechanical fault diagnosis. Addressing these challenges head-on, we propose a novel approach that seamlessly adapts to dynamic data distributions. Our system, EdgeFD, leverages a data drift detection module and a weight consolidation mechanisms, allowing for the adaptation and optimization of dynamic data distributions. Our contributions are as follows:
We introduce a confidence-based data drift detection module that enables real-time monitoring and precise detection of abnormal data drift in bearing FD tasks.
Our utilization of the continual learning (CL) method based on weight consolidation effectively adapts to drifting data, facilitating incremental learning as new data emerges while efficiently adjusting the drift component.
For situations with frequent data drift and a high emphasis on long-term learning and knowledge retention, our Drift-Aware Weight Consolidation (DAWC) approach stands out as a superior solution.
The EdgeFD system harnesses the benefits of edge computing, enabling bearing FD tasks to be conducted on edge devices, minimizing dependence on cloud resources, and data transfer latency, and ensuring real-time responsiveness.

II Problem Definition
In this section, we outline the problem’s core attributes, particularly focusing on the challenges faced when naively implementing solutions based solely on TL and ML.
II-A Problem Definition
Mechanical equipment often exhibits drift in the underlying data distribution of their operational conditions. Factors like varying revolutions, measured in rpm (revolutions per minute), or different loads, measured in Hp (horsepower), can instigate such drifts.
As mechanical systems function, the FD model is iteratively fine-tuned over a series of tasks, represented as , where is a labeled dataset of the task, and , which consists of pairs of instances and their corresponding labels . Assuming the most realistic situation, we consider the case where the task sequence is a task stream with an unknown arriving order, such that the model can access only at the training period of this task , which will become inaccessible afterward. Given and the model learned so far, the learning objective at is as follows:
where is the loss function that quantifies the difference between predictions made using the current model parameters and the ground truth data associated with task . The role of is to provide an initial point for the optimization process, enabling the model to adapt quickly and effectively to the new task .
To solve the optimization problem described above, various methods have been developed that leverage the knowledge contained in to enhance the learning process for the new task . Particularly, in the context of diagnosis model training, the techniques based on TL and ML have been mostly explored.
II-B Transfer Learning and Meta Learning
Transfer Learning: The goal of TL is to speed up the learning of new tasks by sharing knowledge between different tasks. The optimization goal of TL can be expressed as minimizing empirical risk, for example using a cross-entropy loss function
(1) |
where is the corresponding classifier function. The model parameter for the -th task is the combination of the backbone network and the classifier : . The TL process freezes the backbone network and updates only the parameters of the classification head . is the loss function, for measuring the difference of the predicted output against true labels .
Meta-Learning: The goal of ML is to learn the updating rules from multiple tasks, optimizing the learning process for subsequent tasks. A popular algorithm in this domain is Model-Agnostic Meta-Learning (MAML), where the optimization objective is
(2) |
where , and denotes the model prediction on task . In (2), represents the global model parameters, is the support set for the task (used to fine-tune the model), is the query set (used to evaluate the performance), is the learning rate.
While achieving certain successes, it is noteworthy that the existing TL-based FD methods often suffer from catastrophic forgetting (as shown in Fig. 2), leading to continuous model adaptation and increased computational burden on edge devices. Meanwhile, ML-based approaches struggle to handle sequential tasks and involve expensive second-order derivative computations. Moreover, these studies assume the learner’s awareness of task distribution changes, allowing the computation of a new posterior approximation. However, this assumption diverges from real-world IIoT scenarios encompassing gradual, sudden, and repeated drifts.
Therefore, data drift detection is necessary and important before model adaptation, and the detection of data drift can serve as a valuable cue for model adaptation. For example, the drift intensity helps to reduce the search space and ease the learning process [14]. Specifically, by dynamically adjusting the consolidation weights of previous parameters based on the drift detection results, we can effectively address the challenges posed by significant data drifts.
III Learning Methodology
In this section, we introduce a novel framework termed DAWC that effectively detects drift in data streams and employs weight consolidation for model fine-tuning. DAWC addresses the challenges faced by TL and ML methods while also reducing computational costs for edge devices.
III-A Data Drift Detection

Inspired by the CUSUM-type [15] detection method, we propose a sliding window data drift detection method based on the Beta distribution, as illustrated in Fig. 3. In the realm of statistical modeling, the Beta distribution stands out when dealing with random variables bounded within a fixed interval, especially those representing proportions or probabilities. This distribution is naturally defined between 0 and 1, making it a prime choice for situations where we model uncertainties of confidence scores. A key strength of the Beta distribution lies in its versatility; its shape can adapt from U-shaped to bell-shaped, contingent upon its parameters. In our context, the confidence scores from model predictions fall between 0 and 1. As data drift manifests, these confidence score distributions might shift. By leveraging the adaptability of the Beta distribution, we can effectively capture and compare these shifts, making it a powerful tool for drift detection in our methodology.
Within a sliding window of length , the tuple denotes the model’s prediction category and the corresponding confidence score for the input data sample . The sliding window is partitioned into two segments: the reference window , housing historical confidences, and the target window , which contains the latest confidences. It is assumed that these confidence scores in the two windows follow different Beta distributions. When data drift occurs, the CNN model’s classification accuracy declines, causing small confidence values to appear in the target window, leading to discrepancies between the two Beta distributions. To detect data drift, we compare the cumulative probability density function (PDF) values of the confidences in the target window derived from these two Beta distributions.
Step 1. Estimate Beta Distribution Parameters: Using the Maximum Likelihood Estimation (MLE) method, we estimate the parameters of the Beta distribution. Next, the PDF of the beta distribution is provided by
(3) |
where,
(4) |
Step 2. Calculate Dissimilarity Score: Compute the dissimilarity score for PDF values on the two Beta distributions. Sum these scores to obtain the cumulative dissimilarity score .
Step 3. Detect Drift: If the cumulative dissimilarity score surpasses a predefined threshold , data drift is confirmed. The prefixed threshold is a critical value used to determine the occurrence of data drift. Its value is inversely related to sensitivity , i.e., a smaller results in a higher threshold, making the drift detection more stringent.
Step 4. Determine Drift Location: Resize the sub-windows and shift the divider, recalculating the dissimilarity scores. If the new cumulative dissimilarity score is larger, it implies the new divider is closer to the drift point. This process continues until the maximum dissimilarity score is reached, pinpointing the position of the data drift occurrence.
Algorithm 1 provides a structured procedure to detect data drift using the cumulative dissimilarity score. It iteratively evaluates each data point in the sliding window against a threshold to determine the onset of drift. Given a sliding window , sensitivity to change , padding , and a maximum window size , the algorithm determines whether a drift has occurred by iterating through the sliding window and evaluating the cumulative dissimilarity score against a threshold.
III-B Weight Consolidation
Once the drift is identified using our approach, the next challenge is fine-tuning the model without compromising or forgetting the previously acquired knowledge. Zhang et al. [16] presented a CL method centered on data prototype replay to address this. While effective, this method comes with the downside of increased storage and computational costs. On the other hand, regularization, a tool traditionally employed to mitigate model overfitting, has garnered attention in the CL domain. Within this sphere, the chief avenues for regularization stem from parameter importance estimates [17, 18] and knowledge distillation[19].
Building upon these insights, we introduce our approach. Drawing from the Elastic Weight Consolidation (EWC) [17] methodology, we employ an approximate Bayesian CL strategy, aiming to seamlessly adapt models in continuous learning scenarios. Let be the parameter vector, and consider the posterior of :
(5) |
Here, the factorization in (5) arises due to the conditional independence assumption of task data. However, computing the exact posteriors is challenging, leading to Laplace’s approximation:
(6) |
where mean centered at the maximum a posteriori parameter when learning task , and the precision given by the Fisher information matrix evaluated at .This matrix approximates the Hessian of the negative log-likelihood, ensuring positive semidefiniteness.
The overall optimization objective is to minimize the empirical risk while simultaneously accounting for the significance of parameters from prior tasks through a regularization term:
(7) |
where is a hyperparameter that controls the penalty on important parameters from previous tasks. The term is the importance-weighted regularization term for the previous task’s parameter. represents the current task’s data. This regularization mechanism effectively preserves prior knowledge while adapting to new tasks.
IV Experiments
In this section, we present the experimental settings and evaluate the performances of our method by comparing them with baseline methods. Moreover, we conduct extensive ablation studies to provide a deeper understanding of our method.
Baseline methods. The chosen baseline methods represent common strategies in the domain of task-oriented learning:
-
•
STL: Single-task learning, where each task is treated independently. This serves as the most basic comparison.
-
•
FCB [20]: A naive TL approach that preserves the feature extraction capabilities of the backbone network while fine-tuning only the classification head.
-
•
MAML [13]: An advanced meta-learning technique aiming at finding a model initialization conducive for rapid adaptation to new tasks.
IV-A Datasets and experimental details
Experimental Setup: All experiments are conducted on a system equipped with an NVIDIA RTX 3090 GPU, using PyTorch version 1.9.
Dataset: This study employs the publicly accessible CWRU Bearing Dataset [21] encompassing 10 classification tasks. These tasks correspond to different failure sites (Inner Race, Outer Race, Rolling Body) and sizes (0.007’, 0.014’, 0.021’). The dataset encompasses bearing data with failures under diverse loads (0 hp, 1 hp, 2 hp, 3 hp), reflecting variations in real-world data collection conditions. As a result, we segment the drifting task based on these distinct load conditions.
Backbone Network: Inspired by [22], we adopt the WDCNN as our core network, recognized for its distinct attributes: (1) a broad initial layer featuring convolutional kernels sized ; (2) stacking multiple compact convolutional kernels following the wide kernels. The wide initial convolutional kernels capture broad features, while the stacked smaller kernels delve into finer details.
Training Setting: From the total of 5000 data samples, each load condition has 1250 samples. We employ an 80-20 split strategy: 1000 samples from each load condition were used for model training, while the remaining 250 samples (per condition) serve as the test set to evaluate model performance.
Performance Metric: We use two metrics from [23] to evaluate algorithms when the number of tasks is large, i.e., Average Accuracy (AA) and Average Forgetting (AF).
IV-B Main Results
To ensure the stability and reliability of our results, we conduct three independent repetitions of the experiments. The outcomes of these repeated trials showed some standard deviation, with precise values provided in parentheses.
Within our experiments, we focus on investigating four distinct heterogeneous/drifting tasks. After completing training on 4000 samples, we aggregate the experimental outcomes as presented in Table I. Encouragingly, our DAWC showcase a remarkable performance. Specifically, DAWC demonstrates a notable 4.64% increase in accuracy compared to the current SOTA baseline method (i.e., MAML). It’s worth noting that DAWC not only excelled in terms of performance but also exhibited significant advantages in terms of forgetting rate. This underscores the validation of the efficacy of our proposed weight consolidation strategy.
Method | Type | AA(%) | AF |
---|---|---|---|
STL | Baseline | 81.25 ( 0.63) | 0.22 ( 0.02) |
FCB | Transfer Learning | 86.67 ( 0.72) | 0.16 ( 0.02) |
MAML | Meta-Learning | 89.28 ( 0.16) | 0.14 ( 0.01) |
DAWC(ours) | Continual Learning | 93.92 ( 0.15) | 0.07 ( 0.01) |
Through Fig. 4, we can observe the trends in AA variations throughout the learning process for the four different methods. Notably, our DAWC approach demonstrates superior performance, notably in the face of multiple instances of data drift. This serves as evidence of the considerable potential of DAWC in handling heterogeneous tasks.

In summary, the DAWC method outperformed the existing baseline methods in terms of both accuracy and forgetting rate, thereby validating the effectiveness of the proposed weight consolidation strategy in handling drifting tasks.
IV-C Effectiveness of Core Designs
Effect of Drift Detection: We conduct experiments using the parameters , , and . Each alteration in bearing load signifies a shift in the fundamental data distribution, indicating data drift. The instances where these shifts occur are 1000, 2000, and 3000. In Fig. 5, the vertical dashed lines indicate where a drift is detected by applying our detection method. We can notice that all drifts are detected shortly after they theoretically occur.


Effect of Weight Consolidation: We evaluate the accuracy of the four methods for each task, as depicted in Fig. 6. It is observed that the accuracy steadily improves with fewer fluctuations as the training for new tasks progresses, highlighting the robustness of our method.
The effectiveness of our approach in mitigating forgetting can be attributed to efficient weight consolidation, which helps preserve previously learned information.
Computational Costs: Fig. 7 illustrates the accuracy variations for each task throughout the learning process for the three methods. When confronted with multiple instances of data drift, both TL and DAWC demonstrate a certain reduction in computational costs for edge devices compared to the STL method. However, our approach outperforms TL in this aspect.
This superiority arises from the continuous triggering of model fine-tuning thresholds in TL, where parameters are adjusted to accommodate data drift. In contrast, our DAWC scheme rapidly adapts to various data drift scenarios without the need to trigger model fine-tuning thresholds. As a result, it significantly alleviates computational expenses, and it is hence edge-friendly.

V System Demo
Based on the framework of DAWC, we develop a system for bearing FD and alerting. The operational workflow of the entire system is depicted in Fig. 8, showcasing the major components and stages involved. This system leverages the power of DAWC to discern and classify 10 different types of bearing faults, returning corresponding confidence scores. Upon the detection of a bearing fault, the system promptly sends fault alert notifications to administrators, allowing for the timely initiation of necessary maintenance measures.

To provide users with a comprehensive understanding of the system’s status and diagnostic results, we design an intuitive visualization interface, as partially depicted in Fig. 9. At the top of the interface, distinct identifiers for various fault categories are presented. Specifically, “Nm” indicates “Normal”, “IR0.007” stands for “0.007-inch Inner Race Fault”, “OR0.021” represents “0.021-inch Outer Race Fault”, and “B0.014” means “0.014-inch Ball Fault”. Additionally, beneath each fault classification, the upper section of a rectangular bar displays the associated confidence score, allowing users to grasp the system’s confidence in each fault classification.
Next, delving into the system’s approach towards handling drift data, as showcased in Fig. 10, when the system’s confidence in drift data is comparatively low, it employs the term “Drift” accompanied by a red bar, ensuring that users can promptly recognize instances of data drift.

Through our meticulous design, we not only achieve significant progress in bearing FD and alerting but also provide users with the convenience of seamlessly monitoring and understanding the system’s operational status.

VI Conclusion
In this study, we have presented an edge-oriented method for mechanical fault diagnosis that incorporates drift detection and weight consolidation mechanisms. Distinctively, our method stands out for its significant reduction in computational overhead, especially in scenarios plagued by frequent data drift, offering a resource-efficient solution for edge devices. A prominent characteristic of this method is its simplicity, making it adaptable to various backbone network architectures. This attribute paves the way for leveraging more powerful pre-trained models in future large-scale edge-cloud collaborative settings. By combining this approach with advanced pre-trained models, we can further enhance the performance and scalability of fault diagnosis systems.
References
- [1] J. Xiao, J. Tang, and J. Chen, “Efficient radar detection for RIS-aided dual-functional radar-communication system,” in Proc. IEEE VTC2023-Spring, Florence, Italy, Jun. 2023, pp. 1–6.
- [2] B. Yin, J. Tang, and M. Wen, “Connectivity maximization in non-orthogonal network slicing enabled Industrial Internet-of-Things with multiple services,” IEEE Trans. Wireless Commun., vol. 22, no. 8, pp. 5642–5656, Aug. 2023.
- [3] Y. Zhao, D. Saxena, and J. Cao, “Memory-efficient domain incremental learning for Internet of Things,” in Proc. SenSys, Boston, MA, USA, Nov. 2022, pp. 1175–1181.
- [4] J. Chen, J. Tang, and W. Li, “Industrial edge intelligence: Federated-meta learning framework for few-shot fault diagnosis,” IEEE Trans. Network Sci. Eng., pp. 1–13, 2023, early access.
- [5] T. Wang, H. Liu, D. Guo, and X.-M. Sun, “Continual residual reservoir computing for remaining useful life prediction,” IEEE Trans. Ind. Inf., pp. 1–10, 2023, early access.
- [6] G. Chen, Y. Lu, and R. Su, “Interpretable fault diagnosis with shapelet temporal logic: Theory and application,” Automatica, vol. 142, p. 110350, May 2022.
- [7] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug. 2017.
- [8] S. Bandyopadhyay, A. Datta, A. Pal, and S. R. R. Gadepally, “Intelligent continuous monitoring to handle data distributional changes for IoT systems,” in Proc. SenSys, Boston, MA, USA, Nov. 2022, pp. 1189–1195.
- [9] J. Hurtado, D. Salvati, R. Semola, M. Bosio, and V. Lomonaco, “Continual learning for predictive maintenance: Overview and challenges,” Intelligent Systems with Applications, p. 200251, Jun. 2023.
- [10] J. Chen, R. Huang, Z. Chen, W. Mao, and W. Li, “Transfer learning algorithms for bearing remaining useful life prediction: A comprehensive review from an industrial application perspective,” Mechanical Systems and Signal Processing, vol. 193, p. 110239, Mar. 2023.
- [11] J. Chen, D. Li, R. Huang, Z. Chen, and W. Li, “Aero-engine remaining useful life prediction method with self-adaptive multimodal data fusion and cluster-ensemble transfer regression,” Reliability Engineering & System Safety, vol. 234, p. 109151, Feb. 2023.
- [12] Y. Feng, J. Chen, J. Xie, T. Zhang, H. Lv, and T. Pan, “Meta-learning as a promising approach for few-shot cross-domain fault diagnosis: Algorithms, applications, and prospects,” Knowledge-Based Systems, vol. 235, p. 107646, Jan. 2022.
- [13] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. ICML, Sydney, Australia, Aug. 2017, pp. 1126–1135.
- [14] T. Lesort, M. Caccia, and I. Rish, “Understanding continual learning settings with data distribution drift analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2104.01678
- [15] A. Haque, L. Khan, and M. Baron, “Sand: Semi-supervised adaptive novel class detection and classification over data stream,” in Proc. AAAI, vol. 30, no. 1, Phoenix, AZ, USA, Feb. 2016.
- [16] L. Zhang, G. Gao, and H. Zhang, “Spatial-temporal federated learning for lifelong person re-identification on distributed edges,” IEEE Trans. Circuits Syst. Video Technol., 2023, early access.
- [17] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, Mar. 2017.
- [18] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” in Proc. ECCV, Munich, Germany, Sep. 2018, pp. 139–154.
- [19] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara, “Dark experience for general continual learning: a strong, simple baseline,” in Proc. NeurIPS, vol. 33, Virtual, Dec. 2020, pp. 15 920–15 930.
- [20] T. Han, C. Liu, W. Yang, and D. Jiang, “Learning transferable features in deep convolutional neural networks for diagnosing unseen machine conditions,” ISA Transactions, vol. 93, pp. 341–353, Oct. 2019.
- [21] W. A. Smith and R. B. Randall, “Rolling element bearing diagnostics using the case western reserve university data: A benchmark study,” Mechanical systems and signal processing, vol. 64-65, pp. 100–131, Jun. 2015.
- [22] W. Zhang, G. Peng, C. Li, Y. Chen, and Z. Zhang, “A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals,” Sensors, vol. 17, no. 2, p. 425, Feb. 2017.
- [23] S. I. Mirzadeh, M. Farajtabar, R. Pascanu, and H. Ghasemzadeh, “Understanding the role of training regimes in continual learning,” in Proc. NeurIPS, vol. 33, Virtual, Dec. 2020, pp. 7308–7320.