Predicting the Reliability of an Image Classifier under Image Distortion

Dang Nguyen^*^**Corresponding author, Sunil Gupta, Kien Do, Svetha Venkatesh Applied Artificial Intelligence Institute (A²I²), Deakin University, Geelong, Australia
{d.nguyen, sunil.gupta, k.do, svetha.venkatesh}@deakin.edu.au

Abstract

In image classification tasks, deep learning models are vulnerable to image distortions i.e. their accuracy significantly drops if the input images are distorted. An image-classifier is considered “reliable” if its accuracy on distorted images is above a user-specified threshold. For a quality control purpose, it is important to predict if the image-classifier is unreliable/reliable under a distortion level. In other words, we want to predict whether a distortion level makes the image-classifier “non-reliable” or “reliable”. Our solution is to construct a training set consisting of distortion levels along with their “non-reliable” or “reliable” labels, and train a machine learning predictive model (called distortion-classifier) to classify unseen distortion levels. However, learning an effective distortion-classifier is a challenging problem as the training set is highly imbalanced. To address this problem, we propose two Gaussian process based methods to rebalance the training set. We conduct extensive experiments to show that our method significantly outperforms several baselines on six popular image datasets.

keywords:

Image classification; Reliability prediction; Image distortion; Imbalance classification; Gaussian process.

1 Introduction

Many image classification models have assisted humans from daily ordinary tasks like shopping (Google Lens) and entertainment (FaceApp) to important jobs like healthcare (Calorie Mama) and authentication (BioID).

A well-known weakness of image-classifiers is that they are often vulnerable to image distortions i.e. their performance significantly drops if the input images are distorted [1]. As illustrated in Figure 1, a ResNet model achieved 99% accuracy on a set of CIFAR-10 images. When the images were slightly rotated, its accuracy dropped to 82%. It predicted wrong labels for $20^{\circ}$ rotated images although these images were easily recognized by humans. In practice, input images can be distorted in various forms e.g. rotated images due to an unstable camera, dark images due to a poor lighting condition, noisy images due to a rainy weather, etc.

For a quality control purpose, we need to evaluate the image-classifier under different distortion levels to check in which cases it is unreliable/reliable. As this task is very time- and cost-consuming, it is important to perform it automatically using a machine learning (ML) approach. In particular, we ask the question “can we predict the model reliability under an image distortion?”, and form it as a binary classification problem. Assume that we have an image-classifier $T$ and a set of labeled images ${\cal D}$ (we call it verification set). We define the search space of distortion levels ${\cal C}$ as follows: (1) each dimension of $\mathcal{C}$ is a distortion type (e.g. rotation, brightness) and (2) each point $c\in\mathcal{C}$ is a distortion level (e.g. {rotation=20^∘, brightness=0.5}), which is used to modify the images in ${\cal D}$ to create a set of distorted images ${\cal D}^{\prime}_{c}$ . The model $T$ is called “reliable” under a distortion level $c$ if $T$ ’s accuracy on ${\cal D}_{c}^{\prime}$ is above a stipulated threshold $h$ , otherwise “non-reliable”. Recall the earlier example, if we choose the threshold $h=95\%$ , then the ResNet model is non-reliable under $20^{\circ}$ -rotation as its accuracy is only 82%. In other words, the distortion level $20^{\circ}$ -rotation has a label “non-reliable”. Our goal is to build a distortion-classifier $S$ that receives a distortion level $c\in{\cal C}$ and classifies it as $0$ (“non-reliable”) or $1$ (“reliable”). For simplicity, we treat “non-reliable” as negative label whereas “reliable” as positive label.

The process to train the distortion-classifier $S$ consists of three steps. (1) Construct a training set: a typical way to create a training set for $S$ is to randomly sample distortion levels from the search space ${\cal C}$ and computing their corresponding labels. Given a $c_{i}\in{\cal C}$ , we compute the accuracy $a_{i}$ of the model $T$ on the set of distorted images ${\cal D}^{\prime}_{c_{i}}$ . If $a_{i}\geq h$ , we assign $c_{i}$ a label “1”, otherwise a label “0”. As a result, we obtain a training set ${\cal R}=\{c_{i},\mathbb{I}_{a_{i}\geq h}\}_{i=1}^{I}$ , where $\mathbb{I}_{a_{i}\geq h}$ is an indicator function and $I$ is the sampling budget. We illustrate this procedure to construct the training set ${\cal R}$ in Figure 2. (2) Rebalance the training set: using random distortion levels often leads to an imbalanced training set ${\cal R}$ as a majority of them fall under negative class, especially when the threshold $h$ is high or model performance under distortion is generally poor. Thus, we need to rebalance ${\cal R}$ using an imbalance handling technique like SMOTE [2], NearMiss [3], or generative models [4]. (3) Train the distortion-classifier: we use the rebalanced version of ${\cal R}$ to train a ML predictive model e.g. neural network to classify unseen distortion levels.

Although current imbalance handling methods can rebalance the training data, they often suffer from generating false positive samples, leading to a sub-optimal training set for the distortion-classifier $S$ . In this paper, we improve the training set ${\cal R}$ of $S$ with two contributions: (1) a Gaussian process (GP) based sampling technique to create a training set ${\cal R}$ with a higher fraction of real positive samples and (2) a GP-based imbalance handling technique to further rebalance ${\cal R}$ by generating more synthetic positive samples.

GP-based sampling: we consider the mapping from a distortion level $c$ to the model’s accuracy on the set of distorted images ${\cal D}^{\prime}_{c}$ as a black-box, expensive function $f:{\cal C}\rightarrow[0,1]$ . The function $f$ is black-box as we do not know its expression, and $f$ is expensive as we have to measure the model’s accuracy over all distorted images in ${\cal D}^{\prime}_{c}$ . We approximate $f$ using a GP [5] that is a popular method to model black-box, expensive functions. We use $f$ ’s predictive distribution to design an acquisition function to search for distortion levels that have a high chance to be a positive sample. We update the GP with new samples, and repeat the sampling process until the sampling budget $I$ is depleted. Finally, we obtain a training set ${\cal R}=\{c_{t},\mathbb{I}_{f(c_{t})\geq h}\}_{t=1}^{I}$ .

GP-based imbalance handling: we use SMOTE on ${\cal R}$ to generate synthetic positive samples. But, SMOTE often generates many false positive samples [6]. To solve this problem, we assign an uncertainty score for each synthetic positive sample via the variance function of a GP. We filter out synthetic samples whose uncertainty scores are high. This helps to reduce the false positive rate of SMOTE.

To summarize, we make the following contributions.

1.

We are the first to define the problem of Prediction of Model Reliability under Image Distortion, and propose a distortion-classifier to predict if the model is reliable under a distortion level.
2.

We propose a GP-based sampling technique to construct a training data for the distortion-classifier with an increased fraction of real positive samples.
3.

We propose a GP-based imbalance handling method to reduce the false positive rate when generating synthetic positive samples.
4.

We extensively evaluate our method on six benchmark image datasets, and compare it with several strong baselines. We show that it is significantly better than other methods.
5.

The significance of our work lies in providing ability to predict reliability of any image-classifier under a variety of distortions on any image dataset.

The remainder of the paper is organized as follows. In Section 2, related works on image distortion, model reliability, and imbalance classification are reviewed. Our main contributions are presented in Section 3, where we describe two GP-based methods for addressing class imbalance. Experimental results are discussed in Sections 4 while conclusions and future works are represented in Section 5.

2 Related Work

Image distortion. Most deep learning models are sensitive to image distortion, where a small amount of distortion can severely reduce their performance. Many methods have been proposed to detect and correct the distortion in the input images [7, 1], which can be categorized into two groups: non-reference and full-reference. The non-reference methods corrected the distortion without any direct comparison between the original and distorted images [8, 9]. Other works developed models that were robust to image distortion, where most of them fine-tuned the pre-trained models on a pre-defined set of distorted images [10, 11, 12]. While these methods focused on improving the model quality, which is useful for the model development phase, our work focuses on predicting the model reliability, which is useful for the quality control phase.

Model reliability prediction. Assessing the reliability of a ML model is an important step in the quality control process [13]. Existing works on model reliability focus on defect/bug prediction [14], where they classify a model as “defective” if its source code has bugs. A typical solution has three main steps [15, 16]. First, we collect both “clean” and “defective” code samples from the model repository to construct a training set. Second, we rebalance the training set. Finally, we train a ML predictive model with the rebalanced training set.

Some works target to other reliability aspects of a model such as relevance and reproducibility [17]. However, there is no work addressing the problem of model reliability prediction under image distortion.

Imbalance classification. As the problem of classification with imbalanced data has been studied for many years, the imbalance classification has a rich literature. Most existing methods are based on SMOTE (Synthetic Minority Oversampling Technology) [2], where the synthetic minority samples are generated by linearly combining two real minority samples. Several variants have been developed to address SMOTE weaknesses such as outlier and noisy [18, 19, 20, 21, 22]. Other approaches for rebalancing data are under-sampling techniques [3], ensemble methods [23, 24], and generative models [25, 26, 27].

3 Framework

3.1 Problem statement

Let $T$ be an image-classifier, ${\cal D}=\{x_{i},y_{i}\}_{i=1}^{N}$ be a set of labeled images (i.e. verification set), and ${\cal E}=\{E_{1},...,E_{d}\}$ be a set of $d$ image distortions e.g. rotation, brightness, etc. Each $E_{i}$ has a value range $[l_{E_{i}},u_{E_{i}}]$ , where $l_{E_{i}}$ and $u_{E_{i}}$ are the lower and upper bounds. We define a compact subset ${\cal C}$ of $\mathbb{R}^{d}$ as a set of all possible values for image distortion (i.e. ${\cal C}$ is the search space of all possible distortion levels).

We consider a mapping function $f:{\cal C}\rightarrow[0,1]$ , which receives a distortion level $c\in{\cal C}$ as input and returns the accuracy of $T$ on the set of distorted images ${\cal D}_{c}^{\prime}=\{x_{i}^{\prime},y_{i}\}_{i=1}^{N}$ as output. Here, each image $x_{i}^{\prime}\in{\cal D}_{c}^{\prime}$ is a distorted version of an original image $x_{i}\in{\cal D}$ , caused by the distortion level $c$ . Given a threshold $h\in[0,1]$ , $T$ is considered “reliable” under $c$ if $f(c)\geq h$ , otherwise “non-reliable”. Without any loss in generality, we treat “non-reliable” as negative label (i.e. class $0$ ) while “reliable” as positive label (i.e. class $1$ ).

Our goal is to build a distortion-classifier $S$ to classify any distortion level $c\in{\cal C}$ into positive or negative class.

3.2 Proposed method

The distortion-classifier $S$ is trained with three main steps. First, we create a training set ${\cal R}=\{c_{i},\mathbb{I}_{f(c_{i})\geq h}\}_{i=1}^{I}$ , where $c_{i}$ is randomly sampled from ${\cal C}$ and $I$ is the sampling budget. However, ${\cal R}$ is often highly unbalanced, where the number of negative samples is much more than the number of positive samples. Second, we rebalance ${\cal R}$ using an imbalance handling technique such as SMOTE or a generative model. Finally, we use the rebalanced version of ${\cal R}$ to train $S$ that can be any ML predictive model e.g. random forest, neural network, etc.

We improve the quality of the training set ${\cal R}$ by proposing two new approaches. First, instead of using random sampling method, we propose a GP-based sampling method to sample $c_{i}$ to construct ${\cal R}$ . Second, we further rebalance ${\cal R}$ using a novel GP-based imbalance handling technique.

3.2.1 GP-based sampling

Our goal is to sample more positive samples when constructing ${\cal R}$ . To achieve this, we consider the mapping function $f:{\cal C}\rightarrow[0,1]$ as a black-box function and approximate it using a GP. We then use the GP predictive distribution to guide our sampling process. The detailed steps are described as follows.

1.

We initialize the training set ${\cal R}_{t}$ with a small set of randomly sampled distortion levels $[c_{1},...,c_{t}]$ and compute their function values $\boldsymbol{f}_{1:t}=[f(c_{1}),...,f(c_{t})]$ , where $t$ is a small number i.e. $t\ll I$ (recall that $I$ is the sampling budget).

We use ${\cal R}_{t}=\{c_{i},f(c_{i})\}_{i=1}^{t}$ to learn a GP to approximate $f$ . We assume that $f$ is a smooth function drawn from a GP, i.e. $f(c)\sim\text{GP}(m(c),k(c,c^{\prime}))$ , where $m(c)$ and $k(c,c^{\prime})$ are the mean and covariance functions. We compute the predictive distribution for $f(c)$ at any point $c$ as a Gaussian distribution, with its mean and variance functions:

	$\displaystyle\mu_{t}(c)$	$\displaystyle=\boldsymbol{k}^{\mathsf{T}}K^{-1}\boldsymbol{f}_{1:t}$		(1)
	$\displaystyle\sigma_{t}^{2}(c)$	$\displaystyle=k(c,c)-\boldsymbol{k}^{\mathsf{T}}K^{-1}\boldsymbol{k}$		(2)

where $\boldsymbol{k}$ is a vector with its $i$ -th element defined as $k(c_{i},c)$ and $K$ is a matrix of size $t\times t$ with its $(i,j)$ -th element defined as $k(c_{i},c_{j})$ .

We iteratively update the training set ${\cal R}_{t}$ by adding the new point $\{c_{t+1},f(c_{t+1})\}$ until the sampling budget $I$ is depleted, and at each iteration we also update the GP. Instead of randomly sampling $c_{t+1}$ , we select $c_{t+1}$ by maximizing the following acquisition function $q(c)$ :

	$\displaystyle q(c)$	$\displaystyle=\beta\times\sigma(c)+(\mu(c)-h),$		(3)
	$\displaystyle c_{t+1}$	$\displaystyle=\underset{c\in{\cal C}}{\text{argmax}\ }q(c)$		(4)

where $\mu(c)$ and $\sigma(c)$ are the predictive mean and standard deviation from Equations (1) and (2). The coefficient $\beta=2\times[\log(d\times t\times\pi^{2})-\log(6\times\delta)]$ is computed following [28], where $d$ is the number of dimensions of the search space ${\cal C}$ and $\delta=0.1$ is a small constant.

4.

We construct the training set ${\cal R}=\{c_{t},\mathbb{I}_{f(c_{t})\geq h}\}_{t=1}^{I}$ , where $c_{t}$ is sampled using our acquisition function in Equation (4).

Our sampling strategy achieves two goals: (1) sampling $c$ where the model’s accuracy is higher than the threshold $h$ (i.e. large $(\mu(c)-h)$ ) and (2) sampling $c$ where the model’s accuracy is highly uncertain (i.e. large $\sigma(c)$ ). As a result, we can efficiently find more positive samples. In the experiments, we show that our GP-based sampling method retrieves a much higher fraction of positive samples than the random sampling method.

Discussion. We want to highlight that our sampling strategy is very flexible. If $f(c)\geq h$ is the minority class as in our setting, then we use $(\mu(c)-h)$ . If $f(c)<h$ is the minority class, then we can simply change it to $(h-\mu(c))$ .

3.2.2 GP-based imbalance handling

We further rebalance the training set ${\cal R}$ by generating synthetic positive samples. Existing over-sampling methods such as SMOTE [2] and its variants [29] suffer from a high false positive rate. To address this problem, we propose a novel method combining SMOTE and GP. We call it SMOTE-GP.

We use SMOTE on ${\cal R}$ to generate synthetic positive samples. Given a real positive sample $c_{i}$ , a new synthetic positive sample will be $\hat{c}_{i}=c_{i}+\epsilon\times(c_{j}-c_{i})$ , where $c_{j}$ is another real positive sample and $\epsilon\in(0,1)$ is a random number. We indicate the set of synthetic positive samples as $\hat{{\cal R}}^{+}$ . However, SMOTE tends to generate false positive samples as the line connecting two real positive samples crosses the negative region, as shown in Figure 3(a). Our goal is to reject such false positive samples.

When SMOTE generates a new synthetic positive sample $\hat{c}_{i}$ , it simply assigns $\hat{c}_{i}$ to label $1$ . But, SMOTE does not provide any confidence estimation for its assignment. However, if we had such a confidence measure, we could reject the synthetic sample whose confidence is low (i.e. its uncertainty is high).

To measure the uncertainty of a SMOTE assignment, we use the variance function of a GP. First, we retrieve the set of real positive samples in ${\cal R}$ , we indicate this set as ${\cal R}^{+}$ . Second, we train the GP using real positive samples in ${\cal R}^{+}$ along with their function values. As the GP is trained with only real positive samples, it can approximate the generation process of SMOTE. Then, we compute an uncertainty score $u_{\hat{c}}$ for each synthetic positive sample $\hat{c}\in\hat{{\cal R}}^{+}$ (recall $\hat{{\cal R}}^{+}$ is the set of synthetic positive samples generated by SMOTE):

	$\displaystyle u_{\hat{c}}$	$\displaystyle=\sigma^{2}(\hat{c})$
		$\displaystyle=k(\hat{c},\hat{c})-\boldsymbol{k}^{\mathsf{T}}K^{-1}\boldsymbol{k},$		(5)

where $\sigma^{2}(\hat{c})$ is the variance function of the GP, $\boldsymbol{k}$ is a vector with its $i$ -th element being $k(c_{i},\hat{c})$ and $K$ is a matrix of size $|{\cal R}^{+}|\times|{\cal R}^{+}|$ with its $(i,j)$ -th element being $k(c_{i},c_{j})$ .

Finally, as $u_{\hat{c}}$ measures the uncertainty of the synthetic positive sample $\hat{c}$ , if $u_{\hat{c}}$ is smaller than a threshold $\upsilon$ , we keep $\hat{c}$ . Otherwise, we discard $\hat{c}$ . The procedure of our SMOTE-GP is shown in Algorithm 1.

Input: Imbalanced training set

{\cal R}

and uncertainty threshold

\upsilon

Output: Rebalanced training set

{\cal R}^{*}

3begin

{\cal R}^{*}={\cal R}

6 generate synthetic positive samples

\hat{{\cal R}}^{+}=\text{SMOTE}({\cal R})

7 train a GP with real positive samples

{\cal R}^{+}=\{(c_{i},f(c_{i}))\mid c_{i}\in{\cal R}\wedge\text{label}(c_{i})=1\}

8 for each $\hat{c}_{i}\in\hat{{\cal R}}^{+}$ do

10 compute its uncertainty score

u_{\hat{c}_{i}}

using Equation (5)

11 if $u_{\hat{c}_{i}}\leq$ $\upsilon$ then

{\cal R}^{*}={\cal R}^{*}\cup\hat{c}_{i}

14 end if

16 end for

18 end

Algorithm 1 Our SMOTE-GP algorithm.

4 Experiments

4.1 Experiment settings

We recall steps involved in the training and test phases of our prediction task for model reliability under image distortion in Figure 4. We then provide their implementation details.

Search space of distortion levels ${\cal C}$ . We predict the reliability of image-classifiers against six image distortions including geometry distortions [30], lighting distortion [31], and rain distortion [32]. We illustrate six distortion types in Figure 5. The value range of each image distortion is shown in Table 1. Note that our method is applicable to any distortion types as long as they can be defined by a range of values.

Table 1: List of distortions along with their domains.

Distortion	Domain	Description
Scale	$[0.7,1.3]$	Zoom in/out 0-30%
Rotation	$[0,90]$	Rotate $0^{\circ}$ - $90^{\circ}$
Translation-X	$[-0.2,0.2]$	Shift left/right 0-20%
Translation-Y	$[-0.2,0.2]$	Shift up/down 0-20%
Darkness	$[0.7,1.3]$	Darken/brighten 0-30%
Rain	$[0,1]$	$0$ : no rain, $1$ : a lot of rain

Sampling method. While the baseline uses a random sampling, our method uses the GP-based sampling. After the sampling process, we obtain a set of distortion levels $\{c_{1},...,c_{I}\}$ , where $I=600$ is the sampling budget.

Construction of training set ${\cal R}$ . Given distortion levels $\{c_{1},...,c_{I}\}$ , the module $\text{CT}(\{c_{1},...,c_{I}\}\mid T,{\cal D},h)$ assigns label $0$ or $1$ for each $c_{i}$ (see Figure 2). It requires image-classifier $T$ , verification set ${\cal D}$ , and reliability threshold $h$ .

For image-classifiers $T$ , we use five pre-trained models from [33] for image datasets MNIST, Fashion, CIFAR-10, CIFAR-100, and Tiny-ImageNet. They achieved similar accuracy as those reported in [34, 35, 36]. We also use the pre-trained ResNet50 model from the Keras library^†^††https://keras.io/api/applications/resnet/#resnet50-function for ImageNette^‡^‡‡https://www.tensorflow.org/datasets/catalog/imagenette. As pointed out by [12, 1], we expect that these image-classifiers will reduce their performance when evaluated on distorted images.

For each image dataset, we use 10% of its data samples to be the verification set ${\cal D}$ . The size of ${\cal D}$ , the accuracy of $T$ on ${\cal D}$ , and the reliability threshold $h$ are shown in Table 2.

Table 2: Size of verification set

{\cal D}

, accuracy of

T

{\cal D}

, and reliability threshold

h

used in our experiments.

	$\|{\cal D}\|$	Accuracy of $T$	$h$
MNIST	6,000	0.9967	0.90
Fashion	6,000	0.9908	0.75
CIFAR-10	5,000	0.9902	0.85
CIFAR-100	5,000	0.9340	0.65
Tiny-ImageNet	10,000	0.6275	0.45
ImageNette	1,000	0.8290	0.70

Imbalance handling method. As the training set ${\cal R}$ is highly imbalanced, we rebalance it before training the distortion-classifier $S$ . We use SMOTE-GP and compare it with SOTA imbalance handling methods, including Cost-sensitive learning [37], under-sampling method NearMiss [3], over-sampling methods SMOTE [2] and AdaSyn [38], ensemble methods SPE [23] and MESA [24], and generative models GAN [27], VAE [25], CTGAN, and TVAE [26].

As our method is based on SMOTE, we also compare it with SMOTE variants, including SMOTE-Borderline [20], SMOTE-SVM [21], SMOTE-ENN [19], SMOTE-TOMEK [18], and SMOTE-WB [22]. To be fair, we use the source codes released by the authors or implemented in well-known public libraries. The details are provided in B.

Distortion-classifier $S$ . We train five popular ML predictive models with the rebalanced training set ${\cal R}^{*}$ , including decision tree, random forest, logistic regression, support vector machine, and neural network. Each of them is a distortion-classifier. In the test phase, we report the averaged result of five distortion-classifiers.

Construction of test set ${\cal R}^{\prime}$ . To evaluate the performance of distortion-classifiers, we need to construct a test set. We create a grid of distortion levels $\{c_{1},...,c_{G}\}$ in ${\cal C}$ . For each dimension, we use five points, resulting in 4,096 grid points in total. For each test point $c$ , we determine its label using the procedure in Figure 2. At the end, there are 4,096 test distortion levels along with their labels. We report the numbers of positive and negative test points in A.

Evaluation metric. We evaluate each distortion-classifier on the test set ${\cal R}^{\prime}$ and compute the F1-score. As each imbalance handling method is combined with five ML predictive models to form five distortion-classifiers, we report the averaged F1-score. A higher F1-score means a better prediction.

We repeat each method three times with random seeds, and report the averaged F1-score. As the standard deviations are small (< 0.06), we do not report them to save space.

4.2 Comparison of sampling methods

We compare our GP-based sampling with the random sampling. From Figure 6, our GP-based sampling obtains many more positive points than the random sampling. For example, on CIFAR-10, among 600 sampled points, the random sampling obtains only 27 positive samples to construct the training set ${\cal R}$ . In contrast, our GP-based sampling retrieves 130 positive samples to construct a more balanced ${\cal R}$ .

4.3 Comparison of imbalance handling methods

We compare our imbalance handling method SMOTE-GP with current state-of-the-art imbalance handling methods. Our method has two versions: (1) SMOTE-GP combined with the random sampling and (2) SMOTE-GP combined with our GP-based sampling. We use the uncertainty threshold $\upsilon=0.05$ for CIFAR-10 and $\upsilon=0.005$ for other datasets.

Table 3 shows that our SMOTE-GP combined with our GP-based sampling is the best method and significantly outperforms other methods. Its improvements are around 5% on MNIST, 9% on Fashion, 3% on CIFAR-10, 8% on CIFAR-100, 6% on Tiny-ImageNet, and 14% on ImageNette.

Table 3: F1-scores of our method SMOTE-GP and other imbalance handling methods. Datasets include M: MNIST, F: Fashion, C10: CIFAR-10, C100: CIFAR-100, T-IN: Tiny-ImageNet, and IN: ImageNette.

	Sampling	Imbalance	M	F	C10	C100	T-IN	IN
Standard	Random	None	0.3657	0.2105	0.6507	0.4130	0.3587	0.6561
Cost-sensitive	Random	Re-weight	0.5478	0.3940	0.6938	0.5593	0.5256	0.6677
Under-sampling	Random	RandomUnder	0.2553	0.1467	0.5531	0.2824	0.2870	0.3989
Under-sampling	Random	NearMiss	0.3358	0.1514	0.6588	0.4610	0.3443	0.6764
Over-sampling	Random	RandomOver	0.6194	0.4554	0.7230	0.5939	0.5797	0.7063
	Random	SMOTE	0.6157	0.4379	0.7310	0.5933	0.5658	0.7100
	Random	Adasyn	0.6090	0.4370	0.7306	0.5955	0.5663	0.7065
Ensemble	Random	SPE	0.5237	0.2984	0.7269	0.5113	0.4816	0.6808
Ensemble	Random	MESA	0.4337	0.2120	0.6402	0.4394	0.3739	0.5802
Deep learning	Random	GAN	0.3202	0.2186	0.5157	0.3358	0.3551	0.4739
	Random	VAE	0.3831	0.2264	0.6635	0.4123	0.3821	0.6638
	Random	CTGAN	0.2958	0.1966	0.4124	0.2968	0.3144	0.3904
	Random	TVAE	0.3364	0.2124	0.5319	0.3365	0.3249	0.4673
Ours	Random	SMOTE-GP	0.6356	0.4611	0.7433	0.6440	0.5856	0.7929
	GP-based	SMOTE	0.6361	0.5327	0.7562	0.6525	0.5952	0.7988
	GP-based	SMOTE-GP	0.6635	0.5467	0.7616	0.6780	0.6381	0.8559

When using the random sampling, our SMOTE-GP is still better than other methods by 1-8%. Among imbalance handling baselines, SMOTE often achieves the best results. When SMOTE is combined with our GP-based sampling, its performance is improved significantly. This shows that our GP-based sampling is better than the random sampling.

In general, imbalance handling methods often improve the performance of the distortion-classifier, compared to the standard distortion-classifier. Over-sampling methods are always better than under-sampling methods. Deep learning methods based on generative models do not show any real benefit.

Comparison with SMOTE variants. We also compare our SMOTE-GP with imbalance handling methods based on SMOTE in Table 4. Our method is the best method while other SMOTE-based methods perform similarly.

Table 4: F1-scores of our method SMOTE-GP and SMOTE variants.

	MNIST	Fashion	CIFAR-10	CIFAR-100	Tiny-ImageNet	ImageNette
SMOTE	0.6157	0.4379	0.7310	0.5933	0.5658	0.7100
SMOTE-Borderline	0.6118	0.4313	0.7246	0.5924	0.5673	0.7057
SMOTE-SVM	0.6120	0.4231	0.7335	0.6020	0.5711	0.7187
SMOTE-ENN	0.6046	0.3931	0.6752	0.5504	0.5213	0.6744
SMOTE-TOMEK	0.6155	0.4381	0.7312	0.5931	0.5658	0.7096
SMOTE-WB	0.6210	0.4511	0.7276	0.5859	0.5659	0.7079
SMOTE-GP (Ours)	0.6635	0.5467	0.7616	0.6780	0.6381	0.8559

4.4 Ablation study

We conduct further experiments on CIFAR-10 to analyze our method under different settings.

Uncertainty threshold $\upsilon$ . Our SMOTE-GP uses the uncertainty threshold $\upsilon$ to filter out false positive synthetic samples. We investigate how different values for $\upsilon$ affect our performance.

Figure 7 shows that our SMOTE-GP is always better than SMOTE with a large range of $\upsilon$ values. When $\upsilon$ is too small (i.e. $\upsilon<0.01$ ), it may drop its F1-score as most of synthetic positive samples are filtered out. When $\upsilon$ is too large (i.e. $\upsilon>0.05$ ), it may also reduce its F1-score since many false positive samples are introduced.

Sampling budget $I$ . We investigate the effect of the number of sampling queries (i.e. the size of the sampling budget $I$ ) on the performance of our method.

Figure 8 shows that both methods improve as the number of sampling queries increase as expected. More queries result in more training data and more chance to get positive samples. However, our SMOTE-GP is always better than SMOTE by a large margin.

Reliability threshold $h$ . We investigate how our performance is changed with different reliability thresholds $h$ .

Figure 9 shows that both methods reduce their F1-scores when the reliability threshold $h$ becomes larger as the image-classifier $T$ is reliable under fewer distortion levels (i.e. fewer positive samples). This leads to a very highly imbalanced training set ${\cal R}$ .

Visualization. For a quantitative evaluation, we use t-SNE [39] to visualize the synthetic positive samples generated by each method. From Figure 10, SMOTE and its variants generate noisy synthetic samples in two situations. Only our SMOTE-GP avoids these problems.

5 Conclusion

Predicting model reliability is an important task in the quality control process. In this paper, we solve this task in the context of image distortion i.e. we predict if an image-classifier is unreliable/reliable under a distortion level. We form this task as a binary classification process with three main steps: (1) construct a training set, (2) rebalance the training set, and (3) train a distortion-classifier. As the training set is highly imbalanced, we propose two methods to handle the imbalance: (1) a GP-based sampling and (2) SMOTE-GP.

In the GP-based sampling, we approximate the black-box function mapping from a distortion level to the model’s accuracy on distorted images using GP. We then leverage the GP’s mean and variance to form our sampling process.

In the SMOTE-GP method, we compute an uncertainty score for each synthetic positive sample. We then filter out ones whose uncertainty scores are higher than a threshold.

We demonstrate the benefits of our method on six image datasets, where it greatly outperforms other baselines.

References

[1] X. Li, B. Zhang, P. Sander, J. Liao, Blind geometric distortion correction on images through deep learning, in: CVPR, 2019, pp. 4855–4864.
[2] N. Chawla, K. Bowyer, L. Hall, P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357.
[3] I. Mani, J. Zhang, kNN approach to unbalanced data distributions: a case study involving information extraction, in: ICML Workshop, Vol. 126, 2003, pp. 1–7.
[4] V. Sampath, I. Maurtua, J. J. Aguilar Martin, A. Gutierrez, A survey on generative adversarial networks for imbalance problems in computer vision tasks, Journal of Big Data 8 (2021) 1–59.
[5] B. Shahriari, K. Swersky, Z. Wang, R. Adams, N. Freitas, Taking the human out of the loop: A review of bayesian optimization, Proceedings of the IEEE 104 (1) (2016) 148–175.
[6] A. Fernández, S. Garcia, F. Herrera, N. Chawla, SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61 (2018) 863–905.
[7] N. Ahn, B. Kang, K.-A. Sohn, Image distortion detection using convolutional neural network, in: IEEE Asian Conference on Pattern Recognition (ACPR), 2017, pp. 220–225.
[8] L. Kang, P. Ye, Y. Li, D. Doermann, Convolutional neural networks for no-reference image quality assessment, in: CVPR, 2014, pp. 1733–1740.
[9] S. Bosse, D. Maniry, T. Wiegand, W. Samek, A deep neural network for image quality assessment, in: IEEE International Conference on Image Processing (ICIP), 2016, pp. 3773–3777.
[10] Y. Zhou, S. Song, N.-M. Cheung, On classification of distorted images with deep convolutional neural networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 1213–1217.
[11] S. Dodge, L. Karam, Quality robust mixtures of deep neural networks, IEEE Transactions on Image Processing 27 (11) (2018) 5553–5562.
[12] M. T. Hossain, S. W. Teng, D. Zhang, S. Lim, G. Lu, Distortion robust image classification using deep convolutional neural network with discrete cosine transform, in: IEEE International Conference on Image Processing (ICIP), 2019, pp. 659–663.
[13] F. Thung, S. Wang, D. Lo, L. Jiang, An empirical study of bugs in machine learning systems, in: International Symposium on Software Reliability Engineering, 2012, pp. 271–280.
[14] F. Jafarinejad, K. Narasimhan, M. Mezini, NerdBug: automated bug detection in neural networks, in: International Workshop on AI and Software Testing/Analysis, 2021, pp. 13–16.
[15] J. Wang, C. Zhang, Software reliability prediction using a deep learning model based on the RNN encoder–decoder, Reliability Engineering & System Safety 170 (2018) 73–82.
[16] G. Giray, K. E. Bennin, Ö. Köksal, Ö. Babur, B. Tekinerdogan, On the use of deep learning in software defect prediction, Journal of Systems and Software 195 (2023) 111537.
[17] M. M. Morovati, A. Nikanjam, F. Khomh, Z. M. Jiang, Bugs in machine learning-based systems: a faultload benchmark, Empirical Software Engineering 28 (3) (2023) 62.
[18] G. Batista, A. Bazzan, M. C. Monard, et al., Balancing training data for automated annotation of keywords: a case study, WoB 3 (2003) 10–8.
[19] G. Batista, R. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter 6 (1) (2004) 20–29.
[20] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, 2005, pp. 878–887.
[21] H. Nguyen, E. Cooper, K. Kamei, Borderline over-sampling for imbalanced data classification, International Journal of Knowledge Engineering and Soft Data Paradigms 3 (1) (2011) 4–21.
[22] F. Sağlam, M. A. Cengiz, A novel SMOTE-based resampling technique trough noise detection and the boosting procedure, Expert Systems with Applications 200 (2022) 117023.
[23] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T.-Y. Liu, Self-paced ensemble for highly imbalanced massive data classification, in: ICDE, 2020, pp. 841–852.
[24] Z. Liu, P. Wei, J. Jiang, W. Cao, J. Bian, Y. Chang, MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler, in: NeurIPS, Vol. 33, 2020, pp. 14463–14474.
[25] D. Kingma, M. Welling, et al., An introduction to variational autoencoders, Foundations and Trends in Machine Learning 12 (4) (2019) 307–392.
[26] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using Conditional GAN, in: NeurIPS, Vol. 32, 2019, pp. 7335–7345.
[27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (11) (2020) 139–144.
[28] N. Srinivas, A. Krause, S. Kakade, M. Seeger, Information-theoretic regret bounds for gaussian process optimization in the bandit setting, IEEE Transactions on Information Theory 58 (5) (2012) 3250–3265.
[29] G. Kovács, Smote-variants: A python implementation of 85 minority oversampling techniques, Neurocomputing 366 (2019) 352–354.
[30] S. Gopakumar, S. Gupta, S. Rana, V. Nguyen, S. Venkatesh, Algorithmic assurance: An active approach to algorithmic testing using bayesian optimisation, in: NIPS, 2018, pp. 5466–5474.
[31] H. Sellahewa, S. Jassim, Image-quality-based adaptive face recognition, IEEE Transactions on Instrumentation and measurement 59 (4) (2010) 805–813.
[32] P. Patil, S. Gupta, S. Rana, S. Venkatesh, Video restoration framework and its meta-adaptations to data-poor conditions, in: ECCV, 2022, pp. 143–160.
[33] D. Nguyen, S. Gupta, K. Do, S. Venkatesh, Black-box few-shot knowledge distillation, in: ECCV, 2022.
[34] Y. Tian, D. Krishnan, P. Isola, Contrastive representation distillation, in: ICLR, 2020.
[35] D. Wang, Y. Li, L. Wang, B. Gong, Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model, in: CVPR, 2020, pp. 1498–1507.
[36] P. Bhat, E. Arani, B. Zonooz, Distill on the go: Online knowledge distillation in self-supervised learning, in: CVPR Workshop, 2021, pp. 2678–2687.
[37] N. Thai-Nghe, Z. Gantner, L. Schmidt-Thieme, Cost-sensitive learning methods for imbalanced data, in: IJCNN, 2010, pp. 1–8.
[38] H. He, Y. Bai, E. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: IJCNN, 2008, pp. 1322–1328.
[39] L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research 9 (11) (2008) 2579–2605.

Appendix A Test set ${\cal R}^{\prime}$

Table 5 reports the numbers of negative and positive samples in the test set ${\cal R}^{\prime}$ , which is used to evaluate the performance of distortion-classifiers (see Figure 4).

Table 5: Test sets

{\cal R}^{\prime}

to evaluate distortion-classifiers in the test phase.

Dataset	#negative	#positive
MNIST	3,957	139
Fashion	4,017	79
CIFAR-10	3,884	212
CIFAR-100	3,977	119
Tiny-ImageNet	3,991	105
ImageNette	3,940	156

Appendix B Implementation of baselines

To be fair, when comparing with other methods, we use their source code released by the authors or their implementation in well-known public libraries. Table 6 shows the link to the implementation of each baseline.

Table 6: Method and its implementation link.

Method	Implementation link
Re-weight	https://scikit-learn.org/stable/
RandomUnder	https://imbalanced-learn.org/stable/
NearMiss
RandomOver
SMOTE
SMOTE-Borderline
SMOTE-SVM
SMOTE-ENN
SMOTE-TOMEK
Adasyn
SMOTE-WB	https://github.com/analyticalmindsltd/smote_variants
SPE	https://github.com/ZhiningLiu1998/imbalanced-ensemble
MESA	https://github.com/ZhiningLiu1998/mesa
GAN	https://github.com/dialnd/imbalanced-algorithms
VAE	https://github.com/dialnd/imbalanced-algorithms
CTGAN	https://github.com/sdv-dev/CTGAN
TVAE	https://github.com/sdv-dev/CTGAN