¹¹institutetext: Dynamics Research Group, Department of Mechanical Engineering, University of Sheffield, Sheffield,
S1 3JD, UK ²²institutetext: The Alan Turing Institute, The British Library, 96 Euston Road, London, NW1 2DB, UK
²²email: [email protected]

Mitigating sampling bias in risk-based active learning via an EM algorithm

A.J. Hughes 11 L.A. Bull 22 P. Gardner 11 N. Dervilis 11 K. Worden 11

Abstract

Risk-based active learning is an approach to developing statistical classifiers for online decision-support. In this approach, data-label querying is guided according to the expected value of perfect information for incipient data points. For SHM applications, the value of information is evaluated with respect to a maintenance decision process, and the data-label querying corresponds to the inspection of a structure to determine its health state. Sampling bias is a known issue within active-learning paradigms; this occurs when an active learning process over- or undersamples specific regions of a feature-space, thereby resulting in a training set that is not representative of the underlying distribution. This bias ultimately degrades decision-making performance, and as a consequence, results in unnecessary costs incurred.

The current paper outlines a risk-based approach to active learning that utilises a semi-supervised Gaussian mixture model. The semi-supervised approach counteracts sampling bias by incorporating pseudo-labels for unlabelled data via an EM algorithm. The approach is demonstrated on a numerical example representative of the decision processes found in SHM.

Keywords:

active learning semi-supervised learning decision-making risk value of information

1 Introduction

In structural health monitoring (SHM) systems, statistical pattern recognition (SPR), can be used to classify feature vectors $\mathbf{x}\in\mathbb{R}^{D}$ extracted from data, according to a descriptive label $y\in\{1,\ldots,K\}$ , that may correspond to damage location, type or extent [6]. To account for the inherent uncertainty associated with monitoring structures, statistical classifiers are used to infer probability distributions over health states, i.e. $p(y=k|\mathbf{x})$ . Gaining the ability to make robust and informed decisions regarding the operation and maintenance of structures provides significant motivation for the development and implementation of SHM technologies.

In [9], a method is proposed for considering the decision-support application within the learning process for the statistical classifiers used in SHM. Termed risk-based active learning, the approach utilises expected value of perfect information (EVPI) to trigger structural inspections, in order to obtain health-state labels for observations. Using these labelled data, a classification model is subsequently relearned/updated. In [9], it is shown that decision-making performance is improved via risk-based active learning, however, a slight decline in decision-making performance is observed during the later stages of the risk-based active learning process. This degradation in performance can be attributed to sampling bias [4, 3].

Sampling bias occurs due to the fact that risk-based active learning preferentially queries data with high expected value of information. The preferential querying process results in specific regions of the feature space and certain classes being disproportionately represented in the training data. This bias is particularly problematic for generative classification models as the learning process typically relies on the assumption that training data are representative of the underlying distribution. The bias introduced via the queried samples facilitates the learning of decision boundaries in a cost-effective manner; however, it also causes a decline in decision-making performance for later queries.

The current paper aims to mitigate the effects of sampling bias via introducing semi-supervised learning into the risk-based active learning approach. Semi-supervised methods incorporate unlabelled data into the learning process, thereby rectifying the bias in the training dataset introduced by active learning. It is demonstrated here, via numerical example, that semi-supervised learning (specifically, the expectation-maximisation algorithm) can improve risk-based active learning by reducing the effects of sampling bias; improving decision-making performance, and reducing the number of inspections made throughout a monitoring campaign. The layout of the paper is as follows. Firstly, background information on risk-based active learning and semi-supervised learning are provided. Subsequently, risk-based active learning with semi-supervised learning incorporated is demonstrated on a synthetic dataset. Finally, concluding remarks are provided.

2 Risk-based Active Learning

Active learning is a form of partially-supervised learning. Partially-supervised learning is characterised by the use of both labelled and unlabelled data.

Active-learning algorithms construct a labelled data set $\mathcal{D}_{l}$ by querying labels for otherwise unlabelled data $D_{u}$ . Labels for incipient data are queried preferentially according to some measure of their desirability. Models are then trained in a supervised manner using $\mathcal{D}_{l}$ . A general heuristic for active learning is presented in Figure 1.

Figure 1: A general active learning heuristic.

In [1], an online active learning-approach is shown to overcome the primary challenge for classifier development in the context of SHM - initial scarcity of comprehensive labelled data. This result is achieved via the construction of a labelled dataset from a querying process that corresponds to the inspection of a structure by an engineer to determine its health state. In [9], an alternative risk-based approach to active learning is proposed in which expected value of perfect information (EVPI) is used as a measure to guide querying.

The expected value of perfect information (EVPI) can be interpreted as the price that a decision-maker should be willing to pay in order to gain access to perfect information of an otherwise unknown or uncertain state. Here, it should be clarified that the terminology ‘perfect information’ refers to ground-truth states observed without uncertainty. More formally, the expected value of observing $y$ with perfect information prior to making decision $d$ is given by,

\text{EVPI}(d|y):=\text{MEU}(\mathcal{I}_{y\rightarrow d})-\text{MEU}(\mathcal{I})

(1)

where $\mathcal{I}$ is an influence diagram representing an SHM decision process with decision $d$ influenced by random variable $y$ , and $\mathcal{I}_{y\rightarrow d}$ is a modified influence diagram incorporating an additional informational link from $y$ to $d$ indicating that $y$ is observed prior to $d$ . An example calculation of EVPI is presented in [9]. Details of the procedure for forming SHM decision processes as influence diagrams can be found in [7].

A convenient criterion for mandating structural inspections can be derived from the EVPI. Put simply, if the EVPI of a label $y_{t}$ for a data point $\tilde{\mathbf{x}}_{t}$ exceeds the cost of making an inspection $C_{\text{ins}}$ then the ground-truth for $y_{t}$ should be obtained prior to $d_{t}$ . Subsequently, the labelled dataset $\mathcal{D}_{l}$ can be extended to include the newly-acquired data-label pair $(\mathbf{x}_{t},y_{t})$ , and the classifier retrained. While the assumption of perfect information may not hold in all cases, the principles and methodologies discussed in the current paper hold in general for value of information. Furthermore, the perfect information assumption may be relaxed via the introduction of an additional probabilistic model that quantifies uncertainty in inspections. Full details of the risk-based active-learning algorithm can be found in [9].

Adopting a risk-based approach to active learning allows one to learn statistical classifiers with consideration for the decision support contexts in which they may be employed. In [9], it is demonstrated that this approach provides a cost-efficient manner for classifier development.

3 Semi-supervised Learning

Alongside active learning, semi-supervised learning is a form of partially-supervised learning – utilising both $\mathcal{D}_{l}$ and $\mathcal{D}_{u}$ to inform the classification mapping. The fundamental principle of semi-supervised learning that distinguishes it from active learning, is as follows; data in the unlabelled dataset $\mathcal{D}_{u}$ can be given pseudo-labels that are informed by the ground-truth labels available in $\mathcal{D}_{l}$ . By incorporating pseudo-labels for unqueried data into the risk-based active-learning algorithm, class imbalance and inadequate coverage of the feature space can be rectified.

Semi-supervised learning has been applied to pattern recognition problems for SHM [2]. This method of learning brings several benefits such as increased the utilisation of information obtained via costly structural inspections.

3.1 Expectation-Maximisation

The approach to semi-supervised learning considered in the current paper aims to exploit a generative model form of statistical classifier, specifically a Gaussian mixture model, the details of which can be found in [1, 9]. Generative models can conveniently account for labelled and unlabelled data. This capability is achieved by modifying the expectation-maximisation (EM) algorithm [5], typically used for unsupervised density estimation, such that the log-likelihood of the model is maximised over both unlabelled and labelled data. For details of the approach, the reader is referred to [2].

Within the risk-based active learning algorithm, it is possible to apply EM every time a new unlabelled observation is acquired; however, this would be computationally expensive. To limit the computational cost of the modified active learning algorithm, the EM update was only applied following the retraining of the model subsequent to the acquisition of a new ground-truth label obtained via inspection.

4 Case Study

To draw attention to the effects of sampling bias in risk-based active learning algorithms, a modified version of the synthetic dataset used in [9] is adopted. The dataset consists of a two-dimensional input space $\mathbf{x}_{t}=\{x_{t}^{1},x_{t}^{2}\}$ and a four-class label space $y_{t}\in\{1,2,3,4\}$ . Framed as an SHM dataset, one can consider the inputs as features used to discriminate between classes corresponding to four health states of interest for a structure $S$ where:

•

Class 1 corresponds to the structure being undamaged and fully functional.
•

Class 2 corresponds to the structure possessing minor damage whilst remaining fully functional.
•

Class 3 corresponds to the structure possessing significant damage resulting in a reduced operational capacity.
•

Class 4 corresponds to the structure possessing critical damage resulting in operational incapacity, i.e. total structural failure.

The original dataset comprised of 1997 data points ordered according to health state – from Class 1 through to Class 4. For the current case study, the dataset was extended by drawing an additional 2000 independent samples from a generative distribution learned from the original dataset. This procedure was conducted a total of five times, resulting in 11997 data points that repeatedly progress from Class 1 through to Class 4, thereby emulating the process of structural deterioration and subsequent repair - a pattern that could conceivably be experienced by the fictitious structure of interest $S$ . A visualisation of the extended dataset is presented in Figure 2.

4.1 Decision Process

In order to employ risk-based active learning for the development of a statistical classifier, a decision process for the structure $S$ must first be specified. For consistency with [9], an identical binary maintenance decision process is selected for the current study.

Consider an agent that, at discrete-time instances $t\in\mathbb{N}$ , is tasked with making a binary decision $d_{t}$ . The agent must select an action from $\text{dom}(d_{t})=\{0\text{ (do nothing)},1\text{ (repair)}\}$ , such that some degree of operational capacity is maintained for $S$ at instance $t+1$ . It is assumed that the agent has access to the observed discriminative features $\mathbf{x}_{t}$ , and must infer the current and future latent health structural states $y_{t}$ and $y_{t+1}$ via a statistical classifier and a transition model, respectively. An influence diagram for such a decision process is shown in Figure 3.

Figure 3: An influence diagram representation of the decision process associated with structure

S

As aforementioned, the agent requires a transition model in order to obtain a forecast for the future health state $y_{t+1}$ . For the current case study, it is assumed that this model is known a priori so that focus can be on the development of the statistical classifier. Examples of this transition model being developed from data and prior knowledge of physics are presented in [10] and [8], respectively. As shown in Figure 3, the future health state $y_{t+1}$ is conditionally dependent on the current health state $y_{t}$ and the decision $d_{t}$ . Given $d_{t}=0$ , it is assumed that the structure will monotonically degrade with a propensity to remain in its current health state. This assumption is reflected in the conditional probability distribution $P(y_{t+1}|y_{t},d_{t}=0)$ presented in Table 2. Given $d_{t}=1$ , it is assumed that the structure is returned to its undamaged state with probability 0.99 and remains in its current state with probability 0.01. This assumption is reflected in the conditional probability distribution $P(y_{t+1}|y_{t},d_{t}=1)$ presented in Table 2.

Table 1: The conditional probability
table

P(y_{t+1}|y_{t},d_{t})

for

d_{t}=0

		$y_{t+1}$
		1	2	3	4
$y_{t}$	1	0.8	0.18	0.015	0.005
	2	0	0.8	0.15	0.05
	3	0	0	0.8	0.2
	4	0	0	0	1

Table 2: The conditional probability
table

P(y_{t+1}|y_{t},d_{t})

for

d_{t}=1

		$y_{t+1}$
		1	2	3	4
$y_{t}$	1	1	0	0	0
	2	0.99	0.01	0	0
	3	0.99	0	0.01	0
	4	0.99	0	0	0.01

For simplicity, and for more compact influence diagrams, within the current case study it is assumed that health states may be mapped directly to utilities. Once again, the utility function used here is specified to reflect the relative utility values that may be expected in a typical SHM application. The utility function $U(y_{t+1})$ is provided in Table 4. It can be seen from Table 4, that for health-states 1 and 2, in which the structure is at full operational capacity, positive utility has been assigned. For health-state 3, which corresponds to a reduced operational capacity, a lesser positive utility has been assigned. For health-state 4, which corresponds to critical damage, a relatively-large negative utility has been assigned, to reflect a complete loss of operational capacity and an additional severe consequence associated with structural failure, e.g. environmental damage.

Figure 3 indicates that the actions in the domain of $d_{t}$ have associated costs, specified by the utility function $U(d_{t})$ and shown in Table 4. It is assumed that $d_{t}=0$ has no utility associated with it. On the other hand, $d_{t}=1$ has negative utility because of the expenditure necessary to cover the material and labour costs associated with structural maintenance. It is worth noting that, in many practical applications, the specification of utility function is non-trivial and is an active topic of research outside the scope of the current paper. Hence, for this case study, relative utility values are selected to be only somewhat representative of the SHM context.

Table 3: The utility function

U(y_{t+1})

$y_{t+1}$	$U(y_{t+1})$
$1$	$10$
$2$	$10$
$3$	$5$
$4$	$-75$

Table 4: The utility function

U(d_{t})

where

d_{t}=0

and

d_{t}=1

denote the ‘do nothing’ and ‘repair’ actions, respectively.

$d_{t}$	$U(d_{t})$
$0$	$0$
$1$	$-30$

As alluded to previously, a probability distribution over the current health state $y_{t}$ is inferred from the observed features $\mathbf{x}_{t}$ via a statistical classifier subject to risk-based active learning – for the current case study a four-class Gaussian mixture model is used. To fully specify the contextual parameters for risk-based active learning, it is assumed that the ground-truth health state at discrete-time instance $t$ can be obtained via inspection at the cost of $C_{\text{ins}}=7$ . Here, cost can be interpreted as a strictly non-positive utility.

4.2 Results

In order to assess the effects of sampling bias on decision-making performance and for comparison to highlight the effects of introducing semi-supervised learning, risk-based active learning was used to learn a GMM for use within the decision process outlined in the previous section. This process was repeated 1000 times. For each repetition, the dataset was randomly halved into a test set and a training set $\mathcal{D}$ . From the training set, a small ( $\sim$ 0.2%) random subset retain their corresponding ground-truth labels. These data form the initialised labelled dataset $\mathcal{D}_{l}$ . The remaining majority of data from $\mathcal{D}$ have their ground-truth labels hidden, forming the unlabelled dataset $\mathcal{D}_{u}$ .

Figure 4(a) shows a GMM from one of the 1000 repetitions after the risk-based active-learning process. From Figure 4(a), one can see that during the active-learning process, querying is concentrated in highly-localised regions of the feature space. Specifically, regions between the clusters for Class 3 (moderate damage) and Class 4 (severe damage) have been queried preferentially. It is clear from Figure 4(a) that the subset of data $\mathcal{D}_{l}$ is not representative of the underlying distribution indicating that sampling bias is present. Nonetheless, the queried data have been somewhat successful in learning a decision boundary - as can be deduced by observing the region of high EVPI between the means of clusters for Class 3 and Class 4 in Figure 4(b).

The EM approach to semi-supervised learning was incorporated into the risk-based active learning process and applied to the case study. Once again, 1000 repetitions were conducted, each with randomly selected training and test datasets and with $\mathcal{D}_{l}$ randomly initialised as a small subset of the training data.

Figure 5 shows a GMM for one of the 1000 runs after the risk-based active learning process incorporating EM.

It can be seen from Figure 5(a) that, similar to the GMM without semi-supervised learning, risk-based active learning with semi-supervised learning results in labels being obtained for localised regions of the feature space, with Class 3 (moderate damage) being preferentially queried. Nonetheless, this figure shows that the clusters learned in a semi-supervised manner fit the data very well. Furthermore, examination of Figure 5(b) reveals that the resulting EVPI distributions over the feature spaces distinctly differ from that in Figure 4(b). For the EM approach, the introduction of semi-supervised learning into the risk-based active-learning process has enabled the inference of a well-defined decision-boundary indicated by the band of high EVPI separating the $2\sigma$ clusters for Class 3 and Class 4.

Figure 6 shows how the number of queries varies between each approach to risk-based active learning. It is immediately obvious from Figure 6, that incorporating semi-supervised learning reduces the number of queries made substantially. The significance of this result becomes most apparent if one recalls that, in the posed SHM decision context, the number of queries can be mapped directly onto inspection expenditure. This result is to be expected as, when employing semi-supervised learning, each time a query is made, additional information from the unlabelled dataset is utilised to define a class. This allows clusters to become well-defined more quickly, reducing the area of high-EVPI regions (as is visible in Figure 5), meaning fewer queries are made later in the dataset. This result is evident from Figure 7.

Figure 7 compares the total number of queries for each index in $\mathcal{D}_{u}$ over the 1000 repetitions of risk-based active learning conducted with a GMM, and a GMM with EM. It can be seen from Figure 7, that the incorporation of semi-supervised learning into the risk-based active approach results in relatively more queries being obtained during the first occurrence of each class, with relatively fewer queries being made at later occurrences. It can be seen that Class 3 is heavily investigated at each occurrence when semi-supervised methods are not employed. This phenomenon is to be expected when one considers the differences between the high-EVPI regions shown in Figure 4(b) and Figure 5(b); as previously discussed, semi-supervised learning results in a well-defined decision boundary, thereby reducing the likelihood that data will have high value of information.

Figure 8 provides a comparison of the median decision accuracies and $f_{1}$ -scores throughout the querying process between risk-based active learning, with and without semi-supervision.

Here, the decision-making performance achieved via each classifier is assessed via a quantity termed ‘decision accuracy’. The decision accuracy is a comparison between the actions selected by an agent using the classifier being evaluated, and the optimal (correct) actions selected by an agent in possession of perfect information. For details of this performance measure, the reader is directed to [7].

From Figure 8(a), one may be inclined to deduce that the introduction of semi-supervised learning has been detrimental to the performance of risk-based active learning as the introduction of EM appears to delay the observed increase in decision accuracy. However, when considered alongside Figure 7, one can realise that, because semi-supervised learning results in increased querying early in the dataset, decision accuracy over the whole dataset is improved. Furthermore, a decline in decision performance is not observed for the algorithm incorporating semi-supervised learning; this result is because of the reduction in sampling bias obtained via the inclusion of unlabelled data. Figure 8(b) shows the $f_{1}$ -score classification performance, these results provide further indication that the models learned via semi-supervised risk-based active learning better represent the underlying distribution of data and that the detrimental effects of sampling bias have been reduced.

In summary, the case study has shown that, for some applications, semi-supervised learning provides a suitable approach to reducing the effects of sampling bias. The generative distributions obtained via these semi-supervised risk-based active-learning approaches better fit the underlying data distributions, whilst also establishing well-defined decision boundaries in a cost-effective manner.

5 Conclusions

To conclude, it has been shown that risk-based active learning of a statistical classifier can improve decision-making performance over the course of an SHM campaign. However, because of issues relating to sampling bias, degradation in performance is observed in the latter stages of the learning process. To rectify this issue, a novel approach to risk-based active learning was presented, in which semi-supervised learning, by means of the EM algorithm, was demonstrated using a numerical case study. The results of the case study showed that decision-making performance can be improved, and sampling bias is counteracted as a well-defined decision boundary is inferred with the additional information provided by the unlabelled data. Furthermore, the EM approach to risk-based active learning significantly reduced the number of queries made, suggesting that resource expenditure during monitoring campaigns can be reduced via careful design of statistical classifiers.

Acknowledgments

The authors would like to acknowledge the support of the UK EPSRC via the Programme Grant EP/R006768/1. KW would also like to acknowledge support via the EPSRC Established Career Fellowship EP/R003625/1. LAB was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/W006022/1, particularly the Ecosystems of Digital Twins theme within that grant and The Alan Turing Institute. The authors would like to thank Dr. Robert Barthorpe of the University of Sheffield for providing valuable discussion.

References

[1] Bull, L., Rogers, T., Wickramarachchi, C., Cross, E., Worden, K., Dervilis, N.: Probabilistic active learning: An online framework for structural health monitoring. Mechanical Systems and Signal Processing 134, 106294 (2019)
[2] Bull, L., Rogers, T., Wickramarachchi, C., Cross, E., Worden, K., Dervilis, N.: Towards semi-supervised and probabilistic classification in structural health monitoring. Mechanical Systems and Signal Processing 140, 106653 (2020)
[3] Dasgupta, S.: Two faces of active learning. Theoretical Computer Science 412(19), 1767–1781 (2011)
[4] Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: Proceedings of the 25th International Conference on Machine Learning ACM (2008)
[5] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1), 1–38 (1977)
[6] Farrar, C., Worden, K.: Structural Health Monitoring: A Machine Learning Perspective. John Wiley & Sons, Ltd (2013)
[7] Hughes, A., Barthorpe, R., Dervilis, N., Farrar, C., Worden, K.: A probabilistic risk-based decision framework for structural health monitoring. Mechanical Systems and Signal Processing 150, 107339 (2021)
[8] Hughes, A., Barthorpe, R., Worden, K.: On health-state transition models for risk-based structural health monitoring. In: Dynamics of Civil Structures, Volume 2. pp. 49–60. Springer International Publishing (2022)
[9] Hughes, A., Bull, L., Gardner, P., Barthorpe, R., Dervilis, N., Worden, K.: On risk-based active learning for structural health monitoring. Mechanical Systems and Signal Processing 167, 108569 (2022)
[10] Vega, M., Todd, M.: A variational Bayesian neural network for structural health monitoring and cost-informed decision-making in miter gates. Structural Health Monitoring (2020)