Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

Luca Della Libera [email protected] Jacopo Andreoli [email protected] Davide Dalle Pezze [email protected] Mirco Ravanelli [email protected] Gian Antonio Susto [email protected]

Abstract

A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This issue can be addressed by turning standard frequentist neural networks into Bayesian neural networks, which are naturally capable of providing confidence intervals around the estimates. Several methods exist for training those models. Researchers have focused mostly on parametric variational inference and sampling-based techniques, which notoriously suffer from limited approximation power and large computational burden, respectively. In this work, we use Stein variational gradient descent, a recently proposed algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned techniques. In particular, we show through experimental studies on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance both the same models trained via parametric variational inference and their frequentist counterparts trained via backpropagation. Furthermore, we propose a method to enhance performance based on the uncertainty information provided by the Bayesian models. We release the source code at https://github.com/lucadellalib/bdl-rul-svgd.

keywords:

Prognostics and health management , Remaining useful life , Bayesian deep learning , Stein variational gradient descent

PACS:

89.20.Ff

MSC:

68T05

^†^†journal: Engineering Applications of Artificial Intelligence

\affiliation

[inst1]organization=Concordia University, Gina Cody School of Engineering and Computer Science,addressline=1455 Boul. de Maisonneuve Ouest, city=Montreal, postcode=H3G 1M8, state=Quebec, country=Canada

\affiliation

[inst2]organization=Università degli Studi di Padova, Dipartimento di Ingegneria dell’Informazione,addressline=Via Gradenigo 6/b, city=Padova, postcode=35131, state=Veneto, country=Italy

\affiliation

[inst3]organization=Mila-Quebec AI Institute,addressline=6666 Rue Saint-Urbain, city=Montreal, postcode=H2S 3H1, state=Quebec, country=Canada

1 Introduction

Predictive maintenance is a strategy that optimizes maintenance activities based on real-time monitoring of machinery health conditions. It has become more and more successful in recent years as a result of its efficacy in eliminating needless interventions and enhancing reliability of equipment [1]. One of the key tasks in predictive maintenance is estimating the remaining useful life (RUL) of physical systems, i.e. the available time before a failure occurs in one of the system’s components. RUL prediction has traditionally been performed using model-based and statistical methods [2]. While the former characterize the deterioration process by means of a mathematical model of the underlying failure mechanism, the latter use existing failure data to fit a statistical model without depending on any physical principle. Artificial intelligence techniques have lately experienced a surge in popularity thanks to their ability to learn deterioration patterns directly from observations. Deep learning in particular has emerged as a powerful approach for RUL estimation, outperforming traditional prognostic algorithms [3]. It has gained extensive traction and adoption across a diverse array of domains such as aviation [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], semiconductor manufacturing [14, 3], transport [15], and machine tools [14]. Although deep learning models can offer reasonably precise estimates, achieving $100\%$ accuracy is unlikely. Moreover, errors in RUL tasks have completely different impact: for example, overestimating the RUL could lead to unexpected breaks, while underestimating it could result in unnecessary maintenance [16]. Users may want to tune the performances of predictive maintenance tools based on various factors, which can have different importance over time, such the availability of operators and spare parts, the presence of maintenance plans associated with other tasks in the line, production plans, and deadlines associated with product delivery. Thus, in order to make informed decisions on how to schedule maintenance operations, it is critical to quantify the uncertainty inherent to the predictions. A standard way to achieve that is to adopt a Bayesian viewpoint and treat the model’s parameters as random variables, whose prior distribution is updated into a posterior distribution according to Bayes’ rule after observing the data. Unfortunately, exactly computing the posterior is often infeasible when dealing with deep learning models, thus various approximate Bayesian inference schemes such as Bayes by Backprop [17], Monte Carlo dropout [18], and Markov chain Monte Carlo [19] have been employed, with promising results in many practical problems, including RUL estimation [20, 21, 22, 23, 24, 25]. However those techniques present limitations: either they constrain inference to a fixed family of distributions (multivariate normal with diagonal covariance matrix in Bayes by Backprop and Bernoulli in Monte Carlo dropout), or they are computationally intensive and therefore not scalable to big data (Markov chain Monte Carlo).

Stein variational gradient descent [26] is a recent gradient-based variational inference algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned approaches. In particular, differently from Monte Carlo dropout and Bayes by Backprop, it is more expressive, as it does not constrain the posterior to a fixed family of distributions, and differently from Markov chain Monte Carlo, it is more efficient, as it is amenable to batch optimization and parallelization. Furthermore, being sampling-free and easily adaptable to the geometry of the target space, it is more stable and converges faster. Despite its success in generative modeling [27] and deep reinforcement learning [28], it is still underexplored, and, to the best of our knowledge, it has not been applied to RUL estimation yet. As a novel contribution to the field, we show through an experimental study on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform in terms of convergence speed and predictive performance both the same models trained via Bayes by Backprop, which is the de-facto standard for training large scale Bayesian neural networks [29], and their frequentist (i.e. non-Bayesian) counterparts trained via backpropagation. Additionally, we propose a refreshingly simple yet effective heuristic to enhance the performance based on the uncertainty information provided by the Bayesian models. To foster reproducibility and promote further research in the field, we release the source code at https://github.com/lucadellalib/bdl-rul-svgd.

The remainder of this paper is structured as follows. In Sec. 2 we summarize related work in frequentist and Bayesian deep learning for RUL prediction. In Sec. 3 we introduce our method. In Sec. 4 we describe the experimental setup. In Sec. 5 we present and discuss the experimental results. Lastly, in Sec. 6 we draw the conclusions and outline future work directions.

2 Related Work

In the last decade, considerable effort has been dedicated to advance the state-of-the-art of RUL estimation through deep learning. One of the earliest successful works is the one by Tian [4], who developed a dense neural network model to predict the RUL of a group of pumps based on historical vibration data collected from bearings. Despite achieving a relatively low error rate, extensive preprocessing was necessary to extract relevant information from the data due to the weak inductive bias of dense layers, which complicates the learning of temporal and spatial correlations. To address the issue of modeling temporal correlations, Gugulothu et al. [5] utilized recurrent neural networks for better capturing the deterioration patterns in the multivariate trajectories, obtaining state-of-the-art performance on the turbofan engine degradation dataset from the IEEE PHM 2008 data challenge [30]. Zheng et al. [6] later improved upon Gugulothu et al.’s method through long short-term memory networks [31]. Other researchers focused instead on extracting spatial correlations between different sensor measurements by means of convolutional neural networks. Babu et al. [7] introduced a novel convolutional neural network architecture for RUL prediction that outperformed a dense neural network, a support vector machine, and a relevance vector machine on NASA’s Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset [32, 33]. Li et al. [8] further reduced the error rate compared to both Zheng et al. and Babu et al. with a deeper convolutional neural network architecture regularized via dropout [34] and trained through the adaptive learning rate optimizer Adam [35]. An attempt to combine the best of both recurrent neural networks and convolutional neural networks was carried out by Jayasinghe et al. [9], with promising results for data obtained from complex environments. More recently, transformer-based methods [36], widely used in natural language processing and computer vision, have been employed in predictive maintenance too with excellent results [10, 11, 37]. Other techniques such as variational autoencoders [38] and restricted Boltzmann machines [39] have been applied as well in a semi-supervised fashion to reduce the amount of labeled data required to train the model without compromising performance [12, 13].

All the aforementioned studies contributed significantly to improving the state-of-the-art of RUL estimation, however none of them properly addressed the problem of quantifying the uncertainty associated with the predictions. It is only in the last three years that Bayesian deep learning has started to draw the attention of the research community. For example, Kraus et al. [20] developed a combined approach consisting of a non-parametric explicit lifetime model, a Bayesian linear model, and a Bayesian recurrent neural network trained via Bayes by Backprop that achieved competitive performance on the C-MAPSS dataset with the additional advantage of being interpretable and providing confidence intervals around predictions. Another Bayesian method is that of Peng et al. [21], who used Bayes by Backprop and Monte Carlo dropout to train a Bayesian multiscale convolutional neural network and a Bayesian bidirectional long short-term memory network on the ball bearing dataset from the IEEE PHM 2012 data challenge [40, 41] and the C-MAPSS dataset, respectively. Experimental results showed superior performance compared to their frequentist counterparts trained via backpropagation. Huang et al. [22] later proposed a Bayesian deep learning framework for capturing both the uncertainty in the model and the noise in the data. In particular, they assumed the RUL to follow a normal distribution parameterized by a Bayesian dense neural network trained via Bayes by Backprop. A similar setup was adopted by Caceres et al. [23], who replaced the model by Huang et al. with a Bayesian recurrent neural network trained via both Monte Carlo dropout and Bayes by Backprop with flipout [42]. Li et al. [24] further extended Huang et al.’s work by experimenting with different RUL distributions such as normal, logistic, and Weibull [43], parameterized by a Bayesian gated recurrent unit network [44] trained via Monte Carlo dropout. Furthermore, they presented a sequential Bayesian boosting procedure to enhance the predictive accuracy, which was validated on real-world circuit breaker run-to-failure data.

Although not as popular as Bayes by Backprop and Monte Carlo dropout, Markov chain Monte Carlo was also used in predictive maintenance. Benker et al. [25] recently trained dense and convolutional Bayesian neural networks via Hamiltonian Monte Carlo [45] and Bayes by Backprop on the C-MAPSS dataset with some success. Moreover, they leveraged the uncertainty information obtained from the Bayesian models to further reduce the error rate. Despite improving over vanilla Markov chain Monte Carlo, such an algorithm still suffers from efficiency issues in big data settings due to the high computational cost of simulating Hamiltonian dynamics. On the contrary, the method we utilize – Stein variational gradient descent – is both expressive like Markov chain Monte Carlo and efficient like backpropagation.

3 Methodology

3.1 Bayesian Neural Networks

In frequentist neural networks uncertainty is totally disregarded: weights are assumed to be deterministic and their optimal values are learned directly from data $\mathcal{D}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{N}$ via maximum likelihood (or maximum a posteriori when a regularization term is included) estimation. On the contrary, in Bayesian neural networks weights are assumed to be random variables with a prior distribution $p(\mathbf{w})$ that explicitly encodes domain knowledge about the task at hand, and a posterior distribution $p(\mathbf{w}\mid\mathcal{D})$ , i.e. the conditional distribution that results from updating $p(\mathbf{w})$ with information from the likelihood $p(\mathcal{D}\mid\mathbf{w})$ [29]. Following Bayes’ rule, the posterior can be calculated as:

p(\mathbf{w}\mid\mathcal{D})=\frac{p(\mathcal{D}\mid\mathbf{w})p(\mathbf{w})}{p(\mathcal{D})},

(1)

where $p(\mathcal{D})=\int_{\mathbf{w^{\prime}}}p(\mathcal{D}\mid\mathbf{w^{\prime}})p(\mathbf{w^{\prime}})d\mathbf{w^{\prime}}$ is known as the evidence. Knowing the posterior enables us to determine the posterior predictive distribution $p(\hat{y}\mid\hat{\mathbf{x}},\mathcal{D})$ , i.e. the distribution of output $\hat{y}$ as a function of input $\hat{\mathbf{x}}$ , which is calculated as the weighted mean of the likelihoods of the infinite ensemble of neural networks induced by $p(\mathbf{w}\mid\mathcal{D})$ [17]:

	$\displaystyle p(\hat{y}\mid\hat{\mathbf{x}},\mathcal{D})$	$\displaystyle=\mathbb{E}_{p(\mathbf{w}\mid\mathcal{D})}[p(\hat{y}\mid\hat{\mathbf{x}},\mathbf{w})]$
		$\displaystyle=\int_{\mathbf{w}}p(\hat{y}\mid\hat{\mathbf{x}},\mathbf{w})p(\mathbf{w}\mid\mathcal{D})d\mathbf{w}.$		(2)

In practice, we can approximate it using a finite number of Monte Carlo samples from the posterior and calculate useful statistics such as mean, variance, skewness, etc. In particular, the mean or the mode of the posterior predictive distribution is typically used as a prediction. Moreover, we can quantify the uncertainty in the neural network’s weights (commonly referred to as epistemic), which is captured by a fraction of the predictive variance and can be lowered by collecting more data [46]. A further advantage of adopting a Bayesian perspective is that, since a prior is imposed on $\mathbf{w}$ , regularization is naturally provided and overfitting is less of a concern even in low data regimes.

Even though in theory $p(\mathbf{w}\mid\mathcal{D})$ can be derived via exact Bayesian inference using Eq. 1, this is mostly impossible due to the high dimensionality of $\mathbf{w}$ and a functional form of the neural network not amenable to integration [17]. A possible solution is to approximate the posterior through variational inference [47], a technique that substitutes the potentially multimodal and/or heavy-tailed true posterior with a simpler surrogate distribution.

3.2 Bayes by Backprop

Bayes by Backprop [48, 17] is an iterative gradient-based variational inference approach that assumes the surrogate posterior to be a parametric distribution $q(\mathbf{w}\mid{\bm{\theta}})$ from a tractable family (typically a multivariate normal with diagonal covariance matrix). The goal is to learn the optimal parameters ${\bm{\theta}}^{\star}$ that minimize the Kullback-Leibler divergence (KL) of $p(\mathbf{w}\mid\mathcal{D})$ from $q(\mathbf{w}\mid{\bm{\theta}})$ :

$\displaystyle{\bm{\theta}}^{\star}$	$\displaystyle=\arg\min_{\bm{\theta}}\text{KL}[q(\mathbf{w}\mid{\bm{\theta}})\mid\mid p(\mathbf{w}\mid\mathcal{D})]$
	$\displaystyle=\arg\min_{\bm{\theta}}\int_{\mathbf{w}}q(\mathbf{w}\mid{\bm{\theta}})\log\frac{q(\mathbf{w}\mid{\bm{\theta}})}{p(\mathbf{w})p(\mathcal{D}\mid\mathbf{w})}\textrm{d}\mathbf{w}$
	$\displaystyle=\arg\min_{\bm{\theta}}\text{KL}\left[q(\mathbf{w}\mid{\bm{\theta}})\mid\mid p(\mathbf{w})\right]-\mathbb{E}_{q(\mathbf{w}\mid{\bm{\theta}})}\left[\log p(\mathcal{D}\mid\mathbf{w})\right].$	(3)

The resulting loss function is known as the evidence lower bound (ELBO) or the variational free energy (VFE) and explicitates the trade-off between matching the simplicity of the prior (complexity loss) and the complexity of the data (likelihood loss) [49]. It is denoted as

\displaystyle\mathcal{F}(\mathcal{D},{\bm{\theta}})=\underbrace{\text{KL}\left[q(\mathbf{w}\mid{\bm{\theta}})\mid\mid p(\mathbf{w})\right]}_{\text{complexity loss}}-\underbrace{\mathbb{E}_{q(\mathbf{w}\mid{\bm{\theta}})}\left[\log p(\mathcal{D}\mid\mathbf{w})\right]}_{\text{likelihood loss}}.

(4)

Exactly minimizing the ELBO is computationally infeasible in many cases [17]. Bayes by Backprop approximates it via Monte Carlo sampling:

\displaystyle\mathcal{F}(\mathcal{D},{\bm{\theta}})\approx\sum_{i=1}^{M}\log q(\mathbf{w}_{i}\mid{\bm{\theta}})-\log p(\mathbf{w}_{i})-\log p(\mathcal{D}\mid\mathbf{w}_{i}),

(5)

where $M$ denotes the number of Monte Carlo samples and $\mathbf{w}_{i}$ the $i$ -th Monte Carlo sample drawn from $q(\mathbf{w}\mid{\bm{\theta}})$ using the reparameterization trick [38]. Gradient-based optimization can be used to minimize Eq. 5 and learn the optimal parameters of the surrogate posterior.

3.3 Stein Variational Gradient Descent

Stein variational gradient descent [26] is an iterative gradient-based variational inference algorithm that assumes the surrogate posterior $q(\mathbf{w}\mid{\{\mathbf{w}_{i}\}_{i=1}^{M}})$ to be a non-parametric distribution represented by a finite set of $M$ randomly initialized particles $\{\mathbf{w}_{i}\}_{i=1}^{M}$ that are gradually evolved toward the true posterior $p(\mathbf{w}\mid{\mathcal{D}})$ through a sequence of transforms. It has a simple form that is similar to standard gradient ascent, and equivalent to it when $M=1$ (in this case it yields the maximum a posteriori estimate). As a result, it is very adaptable and scalable and readily usable in combination with other optimization techniques such as stochastic gradient ascent, momentum, and adaptive learning rates (e.g. Adam [35]). Let $\bm{\tau}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D}$ denote the transform applied to the particles at the $l$ -th iteration. The goal is to choose $\bm{\tau}$ such that it maximally reduces the Kullback-Leibler divergence between the true posterior $p(\mathbf{w}\mid{\mathcal{D}})$ and the transformed surrogate posterior $q_{\bm{\tau}}(\mathbf{w}\mid{\{\mathbf{w}_{i}\}_{i=1}^{M}})$ :

\bm{\tau}^{\star}=\arg\min_{\bm{\tau}}\text{KL}[q_{\bm{\tau}}(\mathbf{w}\mid{\{\mathbf{w}_{i}\}_{i=1}^{M}})\mid\mid p(\mathbf{w}\mid{\mathcal{D}})].

(6)

In order to do so, it is necessary to impose a few constraints on its functional form. First, we require $\bm{\tau}$ to be invertible, such that the density of the transformed variable can be easily computed via the change of variables formula. To ensure that, we define it as a small perturbation of the identity map:

\bm{\tau}(\mathbf{w})=\mathbf{w}+\epsilon\bm{\phi}(\mathbf{w}),

(7)

where $\epsilon\in\mathbb{R}$ denotes the perturbation magnitude and $\bm{\phi}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D}$ the perturbation direction. If $\lvert\,\epsilon\,\rvert$ is sufficiently small and $\bm{\phi}$ is smooth, then $\bm{\tau}$ is invertible. Second, we restrict $\bm{\phi}$ to be within the unit ball of a $D$ -dimensional reproducing kernel Hilbert space [50] induced by a positive-definite kernel $k:\mathbb{R}^{D}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ . Given these conditions, the following closed-form expression holds for the direction of steepest descent of $\text{KL}[q_{\bm{\tau}}(\mathbf{w}\mid{\{\mathbf{w}_{i}\}_{i=1}^{M}})\mid\mid p(\mathbf{w}\mid{\mathcal{D}})]$ at the $l$ -th iteration:

	$\displaystyle\bm{{\phi}^{\star}}(\mathbf{w}^{(l)}_{i})$	$\displaystyle=\frac{1}{M}\sum_{j=1}^{M}\underbrace{k(\mathbf{w}^{(l)}_{j},\mathbf{w}^{(l)}_{i})\nabla_{\mathbf{w}^{(l)}_{j}}\log\hat{p}(\mathbf{w}^{(l)}_{j}\mid\mathcal{D})}_{\text{driving force}}$
		$\displaystyle+\frac{1}{M}\sum_{j=1}^{M}\underbrace{\nabla_{\mathbf{w}^{(l)}_{j}}k(\mathbf{w}^{(l)}_{j},\mathbf{w}^{(l)}_{i})}_{\text{repulsive force}},$		(8)

where $\hat{p}(\mathbf{w}\mid\mathcal{D})=p({\mathcal{D}}\mid\mathbf{w})p(\mathbf{w})$ denotes the unnormalized weight posterior. The first term represents a driving force that guides the particles toward high probability regions of $p(\mathbf{w}\mid{\mathcal{D}})$ , and the second a repulsive force that prevents the particles from collapsing into local modes of $p(\mathbf{w}\mid{\mathcal{D}})$ . The updated weights are then calculated as

\mathbf{w}^{(l+1)}_{i}=\mathbf{w}^{(l)}_{i}+\epsilon^{(l)}\bm{\hat{\phi}^{\star}}(\mathbf{w}^{(l)}_{i}),

(9)

where $\epsilon^{(l)}$ denotes the perturbation magnitude at the $l$ -th iteration. It is straightforward to derive a corresponding loss function from Sec. 3.3 and apply standard gradient-based optimization to learn the optimal particle-based approximation of the true posterior.

3.4 Uncertainty-Informed Predictions

The Bayesian theoretical framework offers a natural way to reason about uncertainty in predictions, which is crucial for informed decision-making in real-world applications. For example, if the RUL estimate provided by the model is highly uncertain (i.e. the predictive variance is large), a machine operator might opt to preemptively schedule maintenance before reaching the predicted threshold, since the costs associated with unforeseen equipment breakdowns generally exceed the expenses incurred through underutilization. We incorporate this conservative strategy directly into the prognostic procedure by proposing a simple heuristic that helps preventing late predictions. Given a test sample $\hat{\mathbf{x}}$ , let $\mu(\hat{{y}})$ and $\sigma(\hat{{y}})$ denote the mean and the standard deviation of the posterior predictive distribution $p(\hat{y}\mid\hat{\mathbf{x}},\mathcal{D})$ , respectively. Assuming that $\mu(\hat{{y}})$ represents the RUL estimate (as mentioned in Sec. 3.1, the mode of the posterior predictive distribution could be an alternative), we propose the following modification:

\mu^{\ast}(\hat{{y}})=\mu(\hat{{y}})-p_{\text{late}}\,k\,\sigma(\hat{{y}}),

(10)

where $\mu^{\ast}(\hat{{y}})$ is the corrected RUL estimate, $p_{\text{late}}\in[0,1]$ the probability of late prediction, and $k>0$ a hyperparameter that controls the correction strength. If $p_{\text{late}}=0$ , than the model is risk-averse and no correction is necessary. Conversely, if $p_{\text{late}}=1$ , the model is risk-seeking, hence a correction proportional to $k$ is applied. The probability of late prediction $p_{\text{late}}$ can be estimated on a subset of held-out data $\tilde{\mathcal{D}}=\{(\tilde{\mathbf{x}}_{i},\tilde{y}_{i})\}_{i=1}^{\tilde{N}}$ as

p_{\text{late}}=\frac{1}{\tilde{N}}\sum_{i=1}^{\tilde{N}}\mathds{1}(\mu(\hat{\tilde{{y}}}_{i})>\tilde{y}_{i}),

(11)

where $\mathds{1}(\cdot)$ denotes the indicator function. It is worth noting that our method can be easily adjusted as new data are acquired by reestimating $p_{\text{late}}$ on such data. This adaptability could also prove beneficial in addressing scenarios characterized by domain shift.

4 Experimental Setup

4.1 Dataset

The proposed Bayesian methods are evaluated using the simulated turbofan engine degradation data from the publicly accessible NASA’s C-MAPSS dataset¹¹1https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository [32, 33], which comprises $4$ subsets of multivariate trajectories from $21$ sensors. Every subset includes a training set and a test set. The training set consists of run-to-failure sensor recordings of several engines acquired under various operating conditions and fault modes. The engines are assumed to be healthy at first, however their initial wear and manufacturing variations are unknown. They deteriorate over time until a system failure occurs, with the last data point corresponding to the time cycle when the unit fails. On the contrary, sensor records in the test set end some time prior to the failure. The objective is to estimate the RUL of each engine in the test set. The true RUL values of the test engines are provided for performance evaluation. Each subset has $26$ columns: engine number, time cycle, $3$ operational settings (which define the operating condition), and $21$ sensor measurements. Detailed information about the $4$ subsets, identified as FD001, FD002, FD003, and FD004 is shown in Tab. 1.

4.2 Preprocessing

Feature selection

In FD001 and FD003, sensors $1$ , $5$ , $6$ , $10$ , $16$ , $18$ , and $19$ are constant throughout the engine’s lifespan, hence they do not provide any meaningful information for predicting the RUL. Furthermore, FD001 and FD003 are subjected to a single operating condition. As a result, in FD001 and FD003 only $14$ of the $21$ sensors are utilized as input features, with indices $2$ , $3$ , $4$ , $7$ , $8$ , $9$ , $11$ , $12$ , $13$ , $14$ , $15$ , $17$ , $20$ and $21$ , while the $3$ operational settings are ignored. In contrast, there are no constant measurements in FD002 and FD004, and the existence of $6$ operating conditions makes it more difficult to detect deterioration patterns. Therefore, in FD002 and FD004 all the $21$ sensors and the $3$ operational settings are employed as input features.

Normalization

The collected sensor data and operational settings are normalized to be within the interval $\left[-1,1\right]$ by means of min-max normalization:

x_{ij}^{norm}=\frac{2(x_{ij}-x_{j}^{min})}{x_{j}^{max}-x_{j}^{min}}-1,\ \ \forall\ i,j,

(12)

where $x_{ij}$ denotes feature $j$ at time step $i$ , $x_{ij}^{norm}$ the normalized value of $x_{ij}$ , $x_{j}^{max}$ the maximum value of feature $j$ across time steps of all trajectories and $x_{j}^{min}$ the minimum value.

Sliding window segmentation

We can generally extract more information from a multivariate trajectory by examining the temporal sequence data as opposed to only considering the individual data points at each time step. To retain the deterioration patterns hidden in the time dimension, the sliding window segmentation method originally proposed by Babu et al. [7] is used to partition the trajectory into overlapping segments of fixed size. Let $T$ denote the window size and $F$ the number of features. At each time step, all future sensor records within the window are gathered into a $T\times F$ matrix (see Fig. 1) such that from a trajectory of length $L$ exactly $L-T+1$ segments are extracted. If $L<T$ , the trajectory is discarded. The resulting segments are then labeled with the RUL of the last data point in the window and given as input to the deep learning models described in Sec. 4.3.

Refer to caption — Figure 1: Min-max normalized sample from FD001 test set with $T=30$ and $F=14$ .

Target rectification

A popular technique in the literature [5, 6, 7, 8, 25] is to use a piece-wise linear degradation model in which the RUL target function is assumed to be constant until a threshold value, $R_{early}$ , beyond which it linearly decreases to $0$ . From an implementation point of view, this means rectifying the targets, i.e. setting the RUL to $R_{early}$ for all samples whose RUL is larger than $R_{early}$ . The intuition behind it is that a system generally operates correctly at first and starts to degrade only after a certain amount of wear. Following the aforementioned literature, we set $R_{early}=125$ for both the training and the test set. Hyperparameters of the preprocessing phase are reported in Tab. 2.

Table 1: C-MAPSS dataset.

	C-MAPSS
	FD001	FD002	FD003	FD004
Training trajectories	$100$	$260$	$100$	$249$
Test trajectories	$100$	$259$	$100$	$248$
Operating conditions	$1$	$6$	$1$	$6$
Fault modes	$1$	$1$	$2$	$2$

Table 2: Hyperparameters of the preprocessing phase.

	C-MAPSS
	FD001	FD002	FD003	FD004
$T$	$30$	$20$	$30$	$15$
$F$	$14$	$24$	$14$	$24$
Training samples	$\numprint{17731}$	$\numprint{48819}$	$\numprint{21820}$	$\numprint{57763}$
Test samples	$100$	$259$	$100$	$248$
$R_{early}$	$125$	$125$	$125$	$125$

4.3 Deep Learning Models

Table 3: Hyperparameters of the training algorithms.

\bm{0}

denotes the origin in

\mathbb{R}^{D}

and

\mathds{I}

the identity matrix in

\mathbb{R}^{D}

, where

D

is the number of weights in the neural network.

Backpropagation
Hyperparameter	Value
Number of epochs	$50$
Batch size	$512$
Negative log-likelihood	$\text{Huber}_{\delta\mathbin{=}100}$ [51]
Optimizer	Adam [35]
Learning rate	$0.01$
Learning rate decay epoch	$40$
Learning rate decay factor	$0.1$
Late prediction correction factor	$1$
Dropout [34] probability (drop)	$0.2$
Weight initialization	Kaiming uniform [52]
Bayes by Backprop
Prior	$\mathcal{N}(\mathbf{0},\mathds{I}\cdot 0.01)$
Surrogate posterior	$\mathcal{N}(\mathbf{0},\mathds{I}\cdot\mathrm{softplus}^{2}(1)$
Number of Monte Carlo samples	$10$
Stein variational gradient descent
Prior	$\mathcal{N}(\mathbf{0},\mathds{I}\cdot 0.01)$
Number of particles	$10$
Kernel	Radial basis function\tnotexa_rbf

a

Bandwidth is chosen using the median-based heuristic by Liu et al. [26].

The goal of this work is to explore the use of Stein variational gradient descent to train Bayesian deep learning models for RUL estimation. More specifically, we investigate whether it converges faster and yields better predictive accuracy than a competitive baseline such as Bayes by Backprop. To this end, we implement the two simple yet effective neural network architectures proposed by Benker et al. [25], train them under identical experimental conditions with both methods, and compare their performance. For the sake of completeness, we also consider backpropagation as a simpler baseline for training the frequentist counterparts of the selected models, which are defined as follows:

1.

Dense3 (D3): dense neural network architecture originally proposed by Benker et al. The $T\times F$ input matrix, with $T$ denoting the window size and $F$ the number of features, is flattened into a $1$ -dimensional vector, which is then fed to $3$ consecutive $100$ neurons dense layers. A final output neuron returns the RUL prediction. After each dense hidden layer, a sigmoid activation function is applied.
2.

Conv2Pool2 (C2P2): convolutional neural network architecture originally proposed by Babu et al. [7] and later used by Benker et al. too. The $T\times F$ input matrix is fed to a $5\times 14$ convolutional layer with $8$ channels followed by a $2\times 1$ average pooling layer. A second $2\times 1$ convolution with $14$ channels is applied, followed by another $2\times 1$ average pooling. The output is flattened into a $1$ -dimensional vector and a final output neuron returns the RUL prediction. After each convolutional hidden layer, a sigmoid activation function is applied.

All models are trained for $50$ epochs using as a negative log-likelihood Huber loss [51] with a residual threshold $\delta$ arbitrarily set to $100$ (which is the overall cost to minimize for frequentist models and only a fraction of it for Bayesian ones, see Eq. 5 and Sec. 3.3), which is less sensitive to outliers than plain mean squared error. For each epoch, training samples and their corresponding targets are randomly split into multiple batches of size $512$ and fed to the model. The neural network’s weights are updated based on the gradient of the loss function computed batchwise by means of Adam [35] optimizer, with a learning rate of $0.01$ , reduced by a factor of $10$ after $40$ epochs for fine-tuning. For frequentist variants trained via backpropagation, weights are initialized according to Kaiming’s uniform scheme [52] and dropout with a drop probability of $0.2$ is applied to avoid overfitting. For Bayesian variants trained via Bayes by Backprop, prior and surrogate posterior are modeled as zero-mean multivariate normal distributions with diagonal covariance matrices and per-dimension initial standard deviations equal to $0.1$ and $\mathrm{softplus}(\rho)=\ln(1+e^{\rho})$ with $\rho=1$ , respectively. Note the use of softplus-reparameterization as per reference [17] to ensure a strictly positive standard deviation during training. The same prior is utilized for Bayesian variants trained via Stein variational gradient descent. In this case, the surrogate posterior is approximated by $10$ particles, initialized by drawing that number of samples from the prior. As a kernel, we employ a radial basis function with bandwidth chosen using the median-based heuristic by Liu et al. [26]. As mentioned in Sec. 3.4, we use the mean of the posterior predictive distribution as the RUL estimate. For Bayesian models, we correct the prediction according to Eq. 10, where we set the correction factor $k=1$ . The probability of late prediction $p_{\text{late}}$ is computed on the training set (since the C-MAPSS dataset does not include a validation set) based on Eq. 11. Hyperparameters of the training algorithms are summarized in Tab. 3.

4.4 Performance Metrics

For performance evaluation, we calculate root mean squared error (RMSE), mean absolute error (MAE) and the score function (Score) proposed by Saxena et al. [33]:

\text{RMSE}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}{d_{i}^{2}}},

(13)

\text{MAE}=\frac{1}{N}\sum_{i=1}^{N}{\lvert d_{i}\rvert},

(14)

\text{Score}=\sum_{i=1}^{N}{e^{s_{i}}-1},\ \ \text{where}\ \ s_{i}=\begin{cases}-\frac{d_{i}}{13},&\text{for}\ \ d_{i}<0\\ \hphantom{-}\frac{d_{i}}{10},&\text{for}\ \ d_{i}\geq 0\\ \end{cases},

(15)

where $N$ is the number of test samples and $d_{i}$ the error, i.e. the difference between estimated and true RUL values for the $i$ -th test sample. Good models should achieve relatively low values on all those metrics. As shown in Fig. 2, RMSE and MAE equally penalize early and late predictions, while the asymmetric score function penalizes late predictions more than early ones. Minimizing the score is crucial as late predictions often result in more catastrophic consequences than early ones because maintenance will be planned too late. Nonetheless, it is useful to monitor RMSE and MAE too, since the score is particularly sensitive to outliers due to the exponentiation.

4.5 Implementation and Hardware

Software for the experimental evaluation was implemented in Python 3.8.13. In particular, we used NumPy 1.23.4²²2https://github.com/numpy/numpy/tree/v1.23.4 [53] to preprocess the data, PyTorch 1.12.1³³3https://github.com/pytorch/pytorch/tree/v1.12.1 [54] to implement the deep learning models and the training loops, BayesTorch 0.0.1⁴⁴4https://github.com/lucadellalib/bayestorch/tree/v0.0.1 to implement the Bayesian inference algorithms, Ray AIR 2.0.1⁵⁵5https://github.com/ray-project/ray/tree/ray-2.0.1 [55, 56] for experiment execution, and Matplotlib 3.6.2⁶⁶6https://github.com/matplotlib/matplotlib/tree/v3.6.2 [57] and Seaborn 0.12.1⁷⁷7https://github.com/mwaskom/seaborn/tree/v0.12.1 [58] for plotting. All the experiments were run on an Ubuntu 20.04.5 LTS machine with an Intel i7-10875H CPU with $8$ cores @ $2.30$ GHz, $32$ GB RAM and an NVIDIA GeForce RTX $3070$ GPU @ $8$ GB with CUDA Toolkit 11.3.1.

5 Experimental Results and Discussion

Experimental results for D3 and C2P2 trained via backpropagation (BP), Bayes by Backprop (BBB) and Stein variational gradient descent (SVGD) reported in Tab. 4 show that Stein variational gradient descent yields lower error rates than the other two approaches, with D3-SVGD performing the best across all subsets. In particular, as illustrated in Fig. 3, Stein variational gradient descent tends to produce a narrower posterior and posterior predictive distribution with mean closer to the true RUL value compared to Bayes by Backprop. The lower variance in the predictions is a direct consequence of the fact that the neural network is more certain about the values of its weights, suggesting that Stein variational gradient descent converged faster and to a more accurate solution than Bayes by Backprop. This is confirmed by the generally lower standard deviation values in the performance metrics of D3-SVGD and C2P2-SVGD, which is a further indication of a less brittle optimization process.

Table 4: Mean and standard deviation (std) of the performance metrics on the rectified test set averaged over

10

random seeds (

0

–

9

) of Dense3 (D3) and Conv2Pool2 (C2P2) trained via backpropagation (BP), Bayes by Backprop (BBB) and Stein variational gradient descent (SVGD). Metrics followed by “

\ast

” are computed using the uncertainty-based heuristic described in Sec. 3.4. The best values for each model are highlighted in bold. The best overall values are framed.

Subset	Metric	D3-BP		D3-BBB		D3-SVGD		C2P2-BP		C2P2-BBB		C2P2-SVGD
Subset	Metric	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
FD001	RMSE	$14.25$	$0.55$	$14.33$	$0.57$	$\mathbf{13.17}$	$0.14$	$17.48$	$0.06$	$22.29$	$0.61$	$\mathbf{17.35}$	$0.09$
	MAE	$10.09$	$0.37$	$10.27$	$0.28$	$\mathbf{9.55}$	$0.15$	$\mathbf{12.94}$	$0.08$	$15.76$	$0.51$	$12.98$	$0.06$
	Score	$394$	$43$	$435$	$94$	$\mathbf{334}$	$16$	$677$	$18$	$\numprint{4882}$	$763$	$\mathbf{648}$	$21$
	RMSE^$\ast$	$-$	$-$	$14.18$	$0.38$	$\mathbf{13.03}$	$0.12$	$-$	$-$	$21.62$	$0.57$	$\mathbf{17.31}$	$0.09$
	MAE^$\ast$	$-$	$-$	$10.14$	$0.23$	$\mathbf{9.36}$	$0.13$	$-$	$-$	$15.27$	$0.42$	$\mathbf{12.93}$	$0.05$
	Score^$\ast$	$-$	$-$	$369$	$60$	$\mathbf{318}$	$13$	$-$	$-$	$\numprint{3738}$	$803$	$\mathbf{639}$	$20$
FD002	RMSE	$38.92$	$9.62$	$19.23$	$0.25$	$\mathbf{18.27}$	$0.40$	$19.55$	$0.23$	$19.86$	$0.32$	$\mathbf{19.26}$	$0.23$
	MAE	$33.14$	$9.11$	$14.77$	$0.21$	$\mathbf{13.81}$	$0.29$	$15.43$	$0.18$	$15.43$	$0.19$	$\mathbf{15.17}$	$0.18$
	Score	$\numprint{60994}$	$\numprint{29513}$	$\numprint{2743}$	$235$	$\mathbf{\numprint{2259}}$	$319$	$\numprint{2699}$	$136$	$\numprint{3400}$	$734$	$\mathbf{\numprint{2470}}$	$139$
	RMSE^$\ast$	$-$	$-$	$19.28$	$0.24$	$\mathbf{18.60}$	$0.37$	$-$	$-$	$19.85$	$0.30$	$\mathbf{19.24}$	$0.23$
	MAE^$\ast$	$-$	$-$	$14.82$	$0.21$	$\mathbf{14.07}$	$0.27$	$-$	$-$	$15.50$	$0.19$	$\mathbf{15.20}$	$0.19$
	Score^$\ast$	$-$	$-$	$\numprint{2430}$	$181$	$\mathbf{\numprint{2034}}$	$268$	$-$	$-$	$\numprint{3139}$	$617$	$\mathbf{\numprint{2356}}$	$129$
FD003	RMSE	$14.97$	$2.53$	$15.97$	$0.81$	$\mathbf{12.33}$	$0.24$	$\mathbf{19.75}$	$0.40$	$22.79$	$0.79$	$19.99$	$0.28$
	MAE	$10.56$	$2.15$	$11.90$	$0.51$	$\mathbf{8.53}$	$0.15$	$\mathbf{14.60}$	$0.33$	$16.70$	$0.51$	$14.81$	$0.16$
	Score	$625$	$407$	$725$	$198$	$\mathbf{307}$	$20$	$\mathbf{\numprint{1430}}$	$144$	$\numprint{3233}$	$590$	$\numprint{1491}$	$111$
	RMSE^$\ast$	$-$	$-$	$15.31$	$0.46$	$\mathbf{12.13}$	$0.21$	$-$	$-$	$21.95$	$0.68$	$\mathbf{19.88}$	$0.28$
	MAE^$\ast$	$-$	$-$	$11.53$	$0.26$	$\mathbf{8.45}$	$0.14$	$-$	$-$	$16.19$	$0.44$	$\mathbf{14.74}$	$0.16$
	Score^$\ast$	$-$	$-$	$561$	$136$	$\mathbf{287}$	$17$	$-$	$-$	$\numprint{2587}$	$501$	$\mathbf{\numprint{1447}}$	$112$
FD004	RMSE	$36.20$	$10.68$	$21.77$	$0.13$	$\mathbf{21.00}$	$0.31$	$22.88$	$0.24$	$22.98$	$0.43$	$\mathbf{22.07}$	$0.15$
	MAE	$29.72$	$10.04$	$16.40$	$0.11$	$\mathbf{15.51}$	$0.18$	$17.66$	$0.23$	$17.52$	$0.31$	$\mathbf{16.71}$	$0.22$
	Score	$\numprint{83040}$	$\numprint{61583}$	$\mathbf{\numprint{6320}}$	$732$	$\numprint{6648}$	$\numprint{1119}$	$\numprint{7310}$	$850$	$\mathbf{\numprint{6537}}$	$\numprint{1052}$	$\numprint{6577}$	$782$
	RMSE^$\ast$	$-$	$-$	$21.66$	$0.12$	$\mathbf{21.02}$	$0.30$	$-$	$-$	$22.87$	$0.40$	$\mathbf{22.00}$	$0.16$
	MAE^$\ast$	$-$	$-$	$16.29$	$0.11$	$\mathbf{15.49}$	$0.21$	$-$	$-$	$17.47$	$0.30$	$\mathbf{16.64}$	$0.22$
	Score^$\ast$	$-$	$-$	$\mathbf{\numprint{5657}}$	$582$	$\numprint{5830}$	$\numprint{1061}$	$-$	$-$	$\mathbf{\numprint{6050}}$	$920$	$\numprint{6141}$	$687$

Despite the differences between Bayes by Backprop and Stein variational gradient descent, we observe that, across all the subsets, Bayesian deep learning models trained with these methods perform better than their frequentist counterparts and especially D3-BP, which fails to achieve good performance in FD002 and FD004. This stabilizing effect is likely due to the implicit ensemble learning property that characterizes Bayesian neural networks. However, contrarily to Benker et al. [25], we did not find Bayes by Backprop to clearly improve over backpropagation with respect to the score, which might be imputable to a different hyperparameter configuration.

Regarding the comparison between different subsets, we notice that the predictive performance varies drastically, with a substantial drop in FD002 and FD004 for all the configurations. This is in line with expectations since detecting deterioration patterns is more difficult in those subsets due to the presence of $6$ instead of $1$ operating conditions. A performance gap exists between D3 and C2P2 too. In fact, the latter performs slightly worse than D3 on the more complex FD002 and FD004, and significantly worse on the less challenging FD001 and FD003. This might be due to overfitting or to a suboptimal neural network architecture. The evaluation of more competitive deep learning models is beyond the scope of this research and will be addressed in future work.

Lastly, we emphasize that our uncertainty-based heuristic consistently improves the score metric across all models. For instance, in the case of D3-SVGD, the score on FD001 improves from $334$ to $318$ . While a similar improvement is frequently observed for RMSE and MAE, it is less consistent, reflecting the fact that our method is primarily designed to mitigate the risk associated with late predictions rather than targeting error reduction.

6 Conclusions and Future Work

In this work Stein variational gradient descent was successfully applied to the task of RUL estimation for the first time. Bayesian dense and convolutional neural networks were trained using such method and compared to both the same models trained via Bayes by Backprop and their frequentist counterparts trained via backpropagation. Experiments on the simulated run-to-failure turbofan engine degradation data of the C-MAPSS dataset showed that Bayesian deep learning models trained via Stein variational gradient descent consistently outperformed the other two approaches. Furthermore, a heuristic to reduce the risk of late predictions and enhance the performance based on the uncertainty information provided by the Bayesian models was proposed. However, further investigations are necessary to harness the full potential of this technique. First, the impact of the number of particles and the choice of the kernel should be analyzed. Second, state-of-the-art transformer-based architectures should be included in the comparative study. Lastly, extensions to Stein variational gradient descent such as neural variational gradient descent [59], which eliminates the need for a kernel, could be explored.

CRediT Author Statement

Luca Della Libera: Conceptualization, Investigation, Methodology, Software, Validation, Visualization, Writing - Original Draft.
Jacopo Andreoli: Investigation, Methodology, Software, Validation, Visualization, Writing - Original Draft.
Davide Dalle Pezze: Methodology, Validation, Writing - Review & Editing.
Mirco Ravanelli: Funding acquisition, Resources, Supervision, Writing - Review & Editing.
Gian Antonio Susto: Funding acquisition, Resources, Supervision, Writing - Review & Editing.

Declaration of Competing Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

We thank Francesco Paissan for valuable mathematical discussions, and Domenico Lopez and Amrit Singh for software testing. This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Digital Research Alliance of Canada (alliancecan.ca). For GA Susto, this study was partially carried out within the MICS (Made in Italy – Circular and Sustainable) Extended Partnership and received funding from Next-GenerationEU (Italian PNRR – M4 C2, Invest 1.3 – D.D. 1551.11-10-2022, PE00000004). Moreover, this study was also partially carried out within the PNRR research activities of the consortium iNEST (Interconnected Nord-Est Innovation Ecosystem) funded by Next-GenerationEU (Italian PNRR – M4 C2, Invest 1.5 – D.D. 1058.23-06-2022, ECS00000043).

References

[1] Y. Lei, N. Li, L. Guo, N. Li, T. Yan, J. Lin, Machinery health prognostics: A systematic review from data acquisition to RUL prediction, Mechanical Systems and Signal Processing (2018) 799–836.
[2] A. Jardine, D. Lin, D. Banjevic, A review on machinery diagnostics and prognostics implementing condition-based maintenance, Mechanical Systems and Signal Processing (2006) 1483–01510.
[3] L. Lorenti, D. Dalle Pezze, J. Andreoli, C. Masiero, N. Gentner, Y. Yang, G. A. Susto, Predictive maintenance in the industry: A comparative study on deep learning-based remaining useful life estimation, in: IEEE International Conference on Industrial Informatics (INDIN), 2023, pp. 1–9.
[4] Z. Tian, An artificial neural network method for remaining useful life prediction of equipment subject to condition monitoring, Journal of Intelligent Manufacturing (2012) 227–0237.
[5] N. Gugulothu, T. Vishnu, P. Malhotra, L. Vig, P. Agarwal, G. M. Shroff, Predicting remaining useful life using time series embeddings based on recurrent neural networks, International Journal of Prognostics and Health Management (2020).
[6] S. Zheng, K. Ristovski, A. Farahat, C. Gupta, Long short-term memory network for remaining useful life estimation, in: IEEE International Conference on Prognostics and Health Management (ICPHM), 2017, pp. 88–95.
[7] G. Sateesh Babu, P. Zhao, X.-L. Li, Deep convolutional neural network based regression approach for estimation of remaining useful life, in: Database Systems for Advanced Applications, 2016, pp. 214–228.
[8] X. Li, Q. Ding, J.-Q. Sun, Remaining useful life estimation in prognostics using deep convolution neural networks, Reliability Engineering & System Safety (2018) 1–11.
[9] L. Jayasinghe, T. Samarasinghe, C. Yuenv, J. C. Ni Low, S. Sam Ge, Temporal convolutional memory networks for remaining useful life estimation of industrial machinery, in: IEEE International Conference on Industrial Technology (ICIT), 2019, pp. 915–920.
[10] Y. Mo, Q. Wu, X. Li, B. Huang, Remaining useful life estimation via transformer encoder enhanced by a gated convolutional unit, Journal of Intelligent Manufacturing (2021) 1997–2006.
[11] L. Liu, X. Song, Z. Zhou, Aircraft engine remaining useful life estimation via a double attention-based data-driven architecture, Reliability Engineering & System Safety (2022) 108330.
[12] A. S. Yoon, T. Lee, Y. Lim, D. Jung, P. Kang, D. Kim, K. Park, Y. Choi, Semi-supervised learning with deep generative models for asset failure prediction, in: KDD17 Workshop on Machine Learning for Prognostics and Health Management, 2017.
[13] A. L. Ellefsen, E. Bjørlykhaug, V. Æsøy, S. Ushakov, H. Zhang, Remaining useful life predictions for turbofan engine degradation using semi-supervised deep architecture, Reliability Engineering & System Safety (2019) 240–251.
[14] G. A. Susto, A. Beghi, Dealing with time-series data in predictive maintenance problems, in: IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), 2016, pp. 1–4.
[15] O. Fink, E. Zio, U. Weidmann, A classification framework for predicting components’ remaining useful life based on discrete-event diagnostic data, IEEE Transactions on Reliability (2015) 1049–1056.
[16] G. A. Susto, A. Schirru, S. Pampuri, S. McLoone, A. Beghi, Machine learning for predictive maintenance: A multiple classifier approach, IEEE Transactions on Industrial Informatics (2015) 812–820.
[17] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, in: International Conference on Machine Learning (ICML), 2015, pp. 1613–1622.
[18] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, in: International Conference on Machine Learning (ICML), 2016, pp. 1050–1059.
[19] R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, 1996.
[20] M. Kraus, S. Feuerriegel, Forecasting remaining useful life: Interpretable deep learning approach via variational Bayesian inferences, Decision Support Systems (2019) 113100.
[21] W. Peng, Z.-S. Ye, N. Chen, Bayesian deep-learning-based health prognostics toward prognostics uncertainty, IEEE Transactions on Industrial Electronics (2020) 2283–2293.
[22] D. Huang, R. Bai, S. Zhao, P. Wen, S. Wang, S. Chen, Bayesian neural network based method of remaining useful life prediction and uncertainty quantification for aircraft engine, in: IEEE International Conference on Prognostics and Health Management (ICPHM), 2020, pp. 1–8.
[23] J. Caceres, D. Gonzalez, T. Zhou, E. L. Droguett, A probabilistic Bayesian recurrent neural network for remaining useful life prognostics considering epistemic and aleatory uncertainties, Structural Control and Health Monitoring (2021) e2811.
[24] G. Li, L. Yang, C.-G. Lee, X. Wang, M. Rong, A Bayesian deep learning RUL framework integrating epistemic and aleatoric uncertainties, IEEE Transactions on Industrial Electronics (2021) 8829–8841.
[25] M. Benker, L. Furtner, T. Semm, M. F. Zaeh, Utilizing uncertainty information in remaining useful life estimation via Bayesian neural networks and Hamiltonian Monte Carlo, Journal of Manufacturing Systems (2021) 799–807.
[26] Q. Liu, D. Wang, Stein variational gradient descent: A general purpose Bayesian inference algorithm, in: International Conference on Neural Information Processing Systems (NeurIPS), 2016, pp. 2378–2386.
[27] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, L. Carin, VAE learning via Stein variational gradient descent, in: International Conference on Neural Information Processing Systems (NeurIPS), 2017.
[28] Y. Liu, P. Ramachandran, Q. Liu, J. Peng, Stein variational policy gradient, in: Conference on Uncertainty in Artificial Intelligence, 2017.
[29] L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, M. Bennamoun, Hands-on Bayesian neural networks - a tutorial for deep learning users, IEEE Computational Intelligence Magazine (2022) 29–48.
[30] A. Saxena, K. Goebel, PHM08 challenge data set (2008).
[31] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. (1997) 1735–1780.
[32] A. Saxena, K. Goebel, Turbofan engine degradation simulation data set (2008).
[33] A. Saxena, K. Goebel, D. Simon, N. Eklund, Damage propagation modeling for aircraft engine run-to-failure simulation, in: International Conference on Prognostics and Health Management, 2008, pp. 1–9.
[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research (2014) 1929–1958.
[35] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015.
[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: International Conference on Neural Information Processing Systems (NeurIPS), 2017, pp. 6000–6010.
[37] D. Dalle Pezze, D. Deronjic, C. Masiero, D. Tosato, A. Beghi, G. A. Susto, A multi-label continual learning framework to scale deep learning approaches for packaging equipment monitoring, Engineering Applications of Artificial Intelligence (2023) 106610.
[38] D. P. Kingma, M. Welling, Auto-encoding variational Bayes, in: International Conference on Learning Representations, (ICLR), 2014.
[39] D. H. Ackley, G. E. Hinton, T. J. Sejnowski, A learning algorithm for Boltzmann machines, Cognitive Science (1985) 147–169.
[40] P. Nectoux, R. Gouriveau, K. Medjaher, E. Ramasso, B. Morello, N. Zerhouni, C. Varnier, PHM12 challenge data set (2012).
[41] P. Nectoux, R. Gouriveau, K. Medjaher, E. Ramasso, B. Chebel-Morello, N. Zerhouni, C. Varnier, PRONOSTIA: An experimental platform for bearings accelerated degradation tests, in: 2012 International Conference on Prognostics and Health Management, 2012, pp. 1–8.
[42] Y. Wen, P. Vicol, J. Ba, D. Tran, R. Grosse, Flipout: Efficient pseudo-independent weight perturbations on mini-batches, in: International Conference on Learning Representations (ICLR), 2018.
[43] C.-D. Lai, D. Murthy, M. Xie, Weibull distributions and their applications, Springer, 2006, pp. 63–78.
[44] K. Cho, B. van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: Encoder–decoder approaches, in: Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111.
[45] S. Duane, A. Kennedy, B. J. Pendleton, D. Roweth, Hybrid Monte Carlo, Physics Letters B (1987) 216–222.
[46] A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for computer vision?, in: International Conference on Neural Information Processing Systems (NeurIPS), 2017.
[47] M. J. Wainwright, M. I. Jordan, Graphical models, exponential families, and variational inference, Found. Trends Mach. Learn. (2008) 1–305.
[48] A. Graves, Practical variational inference for neural networks, in: International Conference on Neural Information Processing Systems (NeurIPS), 2011.
[49] R. M. Neal, G. E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in: Proceedings of the NATO Advanced Study Institute on Learning in Graphical Models, 1998, pp. 355–368.
[50] A. Berlinet, C. Thomas-Agnan, Reproducing Kernel Hilbert Space in Probability and Statistics, Springer, 2004.
[51] P. J. Huber, Robust estimation of a location parameter, The Annals of Mathematical Statistics (1964) 73–101.
[52] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
[53] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant, Array programming with NumPy, Nature (2020) 357–362.
[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: International Conference on Neural Information Processing Systems (NeurIPS), 2019.
[55] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, I. Stoica, Ray: A distributed framework for emerging AI applications, in: Symposium on Operating Systems Design and Implementation, 2018, pp. 561–577.
[56] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, I. Stoica, Tune: A research platform for distributed model selection and training, in: International Conference on Machine Learning (ICML), 2018.
[57] J. D. Hunter, Matplotlib: A 2D graphics environment, Computing in Science & Engineering (2007) 90–95.
[58] M. L. Waskom, seaborn: statistical data visualization, Journal of Open Source Software (2021) 3021.
[59] L. Langosco, V. Fortuin, H. Strathmann, Neural variational gradient descent, arXiv preprint arXiv:2107.10731 (2021).