This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Predicting the Reliability of an Image Classifier under Image Distortion

Dang Nguyen***Corresponding author, Sunil Gupta, Kien Do, Svetha Venkatesh Applied Artificial Intelligence Institute (A2I2), Deakin University, Geelong, Australia
{d.nguyen, sunil.gupta, k.do, svetha.venkatesh}@deakin.edu.au
Abstract

In image classification tasks, deep learning models are vulnerable to image distortions i.e. their accuracy significantly drops if the input images are distorted. An image-classifier is considered “reliable” if its accuracy on distorted images is above a user-specified threshold. For a quality control purpose, it is important to predict if the image-classifier is unreliable/reliable under a distortion level. In other words, we want to predict whether a distortion level makes the image-classifier “non-reliable” or “reliable”. Our solution is to construct a training set consisting of distortion levels along with their “non-reliable” or “reliable” labels, and train a machine learning predictive model (called distortion-classifier) to classify unseen distortion levels. However, learning an effective distortion-classifier is a challenging problem as the training set is highly imbalanced. To address this problem, we propose two Gaussian process based methods to rebalance the training set. We conduct extensive experiments to show that our method significantly outperforms several baselines on six popular image datasets.

keywords:
Image classification; Reliability prediction; Image distortion; Imbalance classification; Gaussian process.

1 Introduction

Many image classification models have assisted humans from daily ordinary tasks like shopping (Google Lens) and entertainment (FaceApp) to important jobs like healthcare (Calorie Mama) and authentication (BioID).

A well-known weakness of image-classifiers is that they are often vulnerable to image distortions i.e. their performance significantly drops if the input images are distorted [1]. As illustrated in Figure 1, a ResNet model achieved 99% accuracy on a set of CIFAR-10 images. When the images were slightly rotated, its accuracy dropped to 82%. It predicted wrong labels for 2020^{\circ} rotated images although these images were easily recognized by humans. In practice, input images can be distorted in various forms e.g. rotated images due to an unstable camera, dark images due to a poor lighting condition, noisy images due to a rainy weather, etc.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: Overall accuracy and class-wise accuracy of the ResNet model on original images (a) and distorted images (b). The overall accuracy dropped 17% from 0.99 to 0.82 when the images were rotated 2020^{\circ}. Some misclassified images are shown in (c).

For a quality control purpose, we need to evaluate the image-classifier under different distortion levels to check in which cases it is unreliable/reliable. As this task is very time- and cost-consuming, it is important to perform it automatically using a machine learning (ML) approach. In particular, we ask the question “can we predict the model reliability under an image distortion?”, and form it as a binary classification problem. Assume that we have an image-classifier TT and a set of labeled images 𝒟{\cal D} (we call it verification set). We define the search space of distortion levels 𝒞{\cal C} as follows: (1) each dimension of 𝒞\mathcal{C} is a distortion type (e.g. rotation, brightness) and (2) each point c𝒞c\in\mathcal{C} is a distortion level (e.g. {rotation=20, brightness=0.5}), which is used to modify the images in 𝒟{\cal D} to create a set of distorted images 𝒟c{\cal D}^{\prime}_{c}. The model TT is called “reliable” under a distortion level cc if TT’s accuracy on 𝒟c{\cal D}_{c}^{\prime} is above a stipulated threshold hh, otherwise “non-reliable”. Recall the earlier example, if we choose the threshold h=95%h=95\%, then the ResNet model is non-reliable under 2020^{\circ}-rotation as its accuracy is only 82%. In other words, the distortion level 2020^{\circ}-rotation has a label “non-reliable”. Our goal is to build a distortion-classifier SS that receives a distortion level c𝒞c\in{\cal C} and classifies it as 0 (“non-reliable”) or 11 (“reliable”). For simplicity, we treat “non-reliable” as negative label whereas “reliable” as positive label.

The process to train the distortion-classifier SS consists of three steps. (1) Construct a training set: a typical way to create a training set for SS is to randomly sample distortion levels from the search space 𝒞{\cal C} and computing their corresponding labels. Given a ci𝒞c_{i}\in{\cal C}, we compute the accuracy aia_{i} of the model TT on the set of distorted images 𝒟ci{\cal D}^{\prime}_{c_{i}}. If aiha_{i}\geq h, we assign cic_{i} a label “1”, otherwise a label “0”. As a result, we obtain a training set ={ci,𝕀aih}i=1I{\cal R}=\{c_{i},\mathbb{I}_{a_{i}\geq h}\}_{i=1}^{I}, where 𝕀aih\mathbb{I}_{a_{i}\geq h} is an indicator function and II is the sampling budget. We illustrate this procedure to construct the training set {\cal R} in Figure 2. (2) Rebalance the training set: using random distortion levels often leads to an imbalanced training set {\cal R} as a majority of them fall under negative class, especially when the threshold hh is high or model performance under distortion is generally poor. Thus, we need to rebalance {\cal R} using an imbalance handling technique like SMOTE [2], NearMiss [3], or generative models [4]. (3) Train the distortion-classifier: we use the rebalanced version of {\cal R} to train a ML predictive model e.g. neural network to classify unseen distortion levels.

Although current imbalance handling methods can rebalance the training data, they often suffer from generating false positive samples, leading to a sub-optimal training set for the distortion-classifier SS. In this paper, we improve the training set {\cal R} of SS with two contributions: (1) a Gaussian process (GP) based sampling technique to create a training set {\cal R} with a higher fraction of real positive samples and (2) a GP-based imbalance handling technique to further rebalance {\cal R} by generating more synthetic positive samples.

Refer to caption
Figure 2: Construction of the training set {\cal R}. For example, three distortion levels c1c_{1}, c2c_{2}, and c3c_{3} are randomly sampled from the search space 𝒞{\cal C} with two dimensions {rotation, brightness}. The label of c1c_{1} is computed as follows. First, c1c_{1} is used to modify images in the verification set 𝒟{\cal D} to create the set of distorted images 𝒟c1{\cal D}_{c_{1}}^{\prime}. Then, the image-classifier TT is evaluated on 𝒟c1{\cal D}_{c_{1}}^{\prime} to compute its accuracy (0.82). Finally, c1c_{1} has a label 0 (i.e. “non-reliable”) as TT’s accuracy (0.82) is below the threshold h=0.95h=0.95. The pair (c1,0)(c_{1},0) will be a sample of the training set {\cal R}.

GP-based sampling: we consider the mapping from a distortion level cc to the model’s accuracy on the set of distorted images 𝒟c{\cal D}^{\prime}_{c} as a black-box, expensive function f:𝒞[0,1]f:{\cal C}\rightarrow[0,1]. The function ff is black-box as we do not know its expression, and ff is expensive as we have to measure the model’s accuracy over all distorted images in 𝒟c{\cal D}^{\prime}_{c}. We approximate ff using a GP [5] that is a popular method to model black-box, expensive functions. We use ff’s predictive distribution to design an acquisition function to search for distortion levels that have a high chance to be a positive sample. We update the GP with new samples, and repeat the sampling process until the sampling budget II is depleted. Finally, we obtain a training set ={ct,𝕀f(ct)h}t=1I{\cal R}=\{c_{t},\mathbb{I}_{f(c_{t})\geq h}\}_{t=1}^{I}.

GP-based imbalance handling: we use SMOTE on {\cal R} to generate synthetic positive samples. But, SMOTE often generates many false positive samples [6]. To solve this problem, we assign an uncertainty score for each synthetic positive sample via the variance function of a GP. We filter out synthetic samples whose uncertainty scores are high. This helps to reduce the false positive rate of SMOTE.

To summarize, we make the following contributions.

  1. 1.

    We are the first to define the problem of Prediction of Model Reliability under Image Distortion, and propose a distortion-classifier to predict if the model is reliable under a distortion level.

  2. 2.

    We propose a GP-based sampling technique to construct a training data for the distortion-classifier with an increased fraction of real positive samples.

  3. 3.

    We propose a GP-based imbalance handling method to reduce the false positive rate when generating synthetic positive samples.

  4. 4.

    We extensively evaluate our method on six benchmark image datasets, and compare it with several strong baselines. We show that it is significantly better than other methods.

  5. 5.

    The significance of our work lies in providing ability to predict reliability of any image-classifier under a variety of distortions on any image dataset.

The remainder of the paper is organized as follows. In Section 2, related works on image distortion, model reliability, and imbalance classification are reviewed. Our main contributions are presented in Section 3, where we describe two GP-based methods for addressing class imbalance. Experimental results are discussed in Sections 4 while conclusions and future works are represented in Section 5.

2 Related Work

Image distortion. Most deep learning models are sensitive to image distortion, where a small amount of distortion can severely reduce their performance. Many methods have been proposed to detect and correct the distortion in the input images [7, 1], which can be categorized into two groups: non-reference and full-reference. The non-reference methods corrected the distortion without any direct comparison between the original and distorted images [8, 9]. Other works developed models that were robust to image distortion, where most of them fine-tuned the pre-trained models on a pre-defined set of distorted images [10, 11, 12]. While these methods focused on improving the model quality, which is useful for the model development phase, our work focuses on predicting the model reliability, which is useful for the quality control phase.

Model reliability prediction. Assessing the reliability of a ML model is an important step in the quality control process [13]. Existing works on model reliability focus on defect/bug prediction [14], where they classify a model as “defective” if its source code has bugs. A typical solution has three main steps [15, 16]. First, we collect both “clean” and “defective” code samples from the model repository to construct a training set. Second, we rebalance the training set. Finally, we train a ML predictive model with the rebalanced training set.

Some works target to other reliability aspects of a model such as relevance and reproducibility [17]. However, there is no work addressing the problem of model reliability prediction under image distortion.

Imbalance classification. As the problem of classification with imbalanced data has been studied for many years, the imbalance classification has a rich literature. Most existing methods are based on SMOTE (Synthetic Minority Oversampling Technology) [2], where the synthetic minority samples are generated by linearly combining two real minority samples. Several variants have been developed to address SMOTE weaknesses such as outlier and noisy [18, 19, 20, 21, 22]. Other approaches for rebalancing data are under-sampling techniques [3], ensemble methods [23, 24], and generative models [25, 26, 27].

3 Framework

3.1 Problem statement

Let TT be an image-classifier, 𝒟={xi,yi}i=1N{\cal D}=\{x_{i},y_{i}\}_{i=1}^{N} be a set of labeled images (i.e. verification set), and ={E1,,Ed}{\cal E}=\{E_{1},...,E_{d}\} be a set of dd image distortions e.g. rotation, brightness, etc. Each EiE_{i} has a value range [lEi,uEi][l_{E_{i}},u_{E_{i}}], where lEil_{E_{i}} and uEiu_{E_{i}} are the lower and upper bounds. We define a compact subset 𝒞{\cal C} of d\mathbb{R}^{d} as a set of all possible values for image distortion (i.e. 𝒞{\cal C} is the search space of all possible distortion levels).

We consider a mapping function f:𝒞[0,1]f:{\cal C}\rightarrow[0,1], which receives a distortion level c𝒞c\in{\cal C} as input and returns the accuracy of TT on the set of distorted images 𝒟c={xi,yi}i=1N{\cal D}_{c}^{\prime}=\{x_{i}^{\prime},y_{i}\}_{i=1}^{N} as output. Here, each image xi𝒟cx_{i}^{\prime}\in{\cal D}_{c}^{\prime} is a distorted version of an original image xi𝒟x_{i}\in{\cal D}, caused by the distortion level cc. Given a threshold h[0,1]h\in[0,1], TT is considered “reliable” under cc if f(c)hf(c)\geq h, otherwise “non-reliable”. Without any loss in generality, we treat “non-reliable” as negative label (i.e. class 0) while “reliable” as positive label (i.e. class 11).

Our goal is to build a distortion-classifier SS to classify any distortion level c𝒞c\in{\cal C} into positive or negative class.

3.2 Proposed method

The distortion-classifier SS is trained with three main steps. First, we create a training set ={ci,𝕀f(ci)h}i=1I{\cal R}=\{c_{i},\mathbb{I}_{f(c_{i})\geq h}\}_{i=1}^{I}, where cic_{i} is randomly sampled from 𝒞{\cal C} and II is the sampling budget. However, {\cal R} is often highly unbalanced, where the number of negative samples is much more than the number of positive samples. Second, we rebalance {\cal R} using an imbalance handling technique such as SMOTE or a generative model. Finally, we use the rebalanced version of {\cal R} to train SS that can be any ML predictive model e.g. random forest, neural network, etc.

We improve the quality of the training set {\cal R} by proposing two new approaches. First, instead of using random sampling method, we propose a GP-based sampling method to sample cic_{i} to construct {\cal R}. Second, we further rebalance {\cal R} using a novel GP-based imbalance handling technique.

3.2.1 GP-based sampling

Our goal is to sample more positive samples when constructing {\cal R}. To achieve this, we consider the mapping function f:𝒞[0,1]f:{\cal C}\rightarrow[0,1] as a black-box function and approximate it using a GP. We then use the GP predictive distribution to guide our sampling process. The detailed steps are described as follows.

  1. 1.

    We initialize the training set t{\cal R}_{t} with a small set of randomly sampled distortion levels [c1,,ct][c_{1},...,c_{t}] and compute their function values 𝒇1:t=[f(c1),,f(ct)]\boldsymbol{f}_{1:t}=[f(c_{1}),...,f(c_{t})], where tt is a small number i.e. tIt\ll I (recall that II is the sampling budget).

  2. 2.

    We use t={ci,f(ci)}i=1t{\cal R}_{t}=\{c_{i},f(c_{i})\}_{i=1}^{t} to learn a GP to approximate ff. We assume that ff is a smooth function drawn from a GP, i.e. f(c)GP(m(c),k(c,c))f(c)\sim\text{GP}(m(c),k(c,c^{\prime})), where m(c)m(c) and k(c,c)k(c,c^{\prime}) are the mean and covariance functions. We compute the predictive distribution for f(c)f(c) at any point cc as a Gaussian distribution, with its mean and variance functions:

    μt(c)\displaystyle\mu_{t}(c) =𝒌𝖳K1𝒇1:t\displaystyle=\boldsymbol{k}^{\mathsf{T}}K^{-1}\boldsymbol{f}_{1:t} (1)
    σt2(c)\displaystyle\sigma_{t}^{2}(c) =k(c,c)𝒌𝖳K1𝒌\displaystyle=k(c,c)-\boldsymbol{k}^{\mathsf{T}}K^{-1}\boldsymbol{k} (2)

    where 𝒌\boldsymbol{k} is a vector with its ii-th element defined as k(ci,c)k(c_{i},c) and KK is a matrix of size t×tt\times t with its (i,j)(i,j)-th element defined as k(ci,cj)k(c_{i},c_{j}).

  3. 3.

    We iteratively update the training set t{\cal R}_{t} by adding the new point {ct+1,f(ct+1)}\{c_{t+1},f(c_{t+1})\} until the sampling budget II is depleted, and at each iteration we also update the GP. Instead of randomly sampling ct+1c_{t+1}, we select ct+1c_{t+1} by maximizing the following acquisition function q(c)q(c):

    q(c)\displaystyle q(c) =β×σ(c)+(μ(c)h),\displaystyle=\beta\times\sigma(c)+(\mu(c)-h), (3)
    ct+1\displaystyle c_{t+1} =argmaxc𝒞q(c)\displaystyle=\underset{c\in{\cal C}}{\text{argmax}\ }q(c) (4)

    where μ(c)\mu(c) and σ(c)\sigma(c) are the predictive mean and standard deviation from Equations (1) and (2). The coefficient β=2×[log(d×t×π2)log(6×δ)]\beta=2\times[\log(d\times t\times\pi^{2})-\log(6\times\delta)] is computed following [28], where dd is the number of dimensions of the search space 𝒞{\cal C} and δ=0.1\delta=0.1 is a small constant.

  4. 4.

    We construct the training set ={ct,𝕀f(ct)h}t=1I{\cal R}=\{c_{t},\mathbb{I}_{f(c_{t})\geq h}\}_{t=1}^{I}, where ctc_{t} is sampled using our acquisition function in Equation (4).

Our sampling strategy achieves two goals: (1) sampling cc where the model’s accuracy is higher than the threshold hh (i.e. large (μ(c)h)(\mu(c)-h)) and (2) sampling cc where the model’s accuracy is highly uncertain (i.e. large σ(c)\sigma(c)). As a result, we can efficiently find more positive samples. In the experiments, we show that our GP-based sampling method retrieves a much higher fraction of positive samples than the random sampling method.

Discussion. We want to highlight that our sampling strategy is very flexible. If f(c)hf(c)\geq h is the minority class as in our setting, then we use (μ(c)h)(\mu(c)-h). If f(c)<hf(c)<h is the minority class, then we can simply change it to (hμ(c))(h-\mu(c)).

3.2.2 GP-based imbalance handling

We further rebalance the training set {\cal R} by generating synthetic positive samples. Existing over-sampling methods such as SMOTE [2] and its variants [29] suffer from a high false positive rate. To address this problem, we propose a novel method combining SMOTE and GP. We call it SMOTE-GP.

We use SMOTE on {\cal R} to generate synthetic positive samples. Given a real positive sample cic_{i}, a new synthetic positive sample will be c^i=ci+ϵ×(cjci)\hat{c}_{i}=c_{i}+\epsilon\times(c_{j}-c_{i}), where cjc_{j} is another real positive sample and ϵ(0,1)\epsilon\in(0,1) is a random number. We indicate the set of synthetic positive samples as ^+\hat{{\cal R}}^{+}. However, SMOTE tends to generate false positive samples as the line connecting two real positive samples crosses the negative region, as shown in Figure 3(a). Our goal is to reject such false positive samples.

Refer to caption
Figure 3: SMOTE (a) vs. our method SMOTE-GP (b). The black curve indicates the mapping function f(c)f(c) (Section 3.1). The red line indicates the threshold hh to determine negative samples (blue dots) and positive samples (orange dots). SMOTE generates a synthetic positive sample (green dot) along the line connecting two positive samples, which is false positive (a). SMOTE-GP computes an uncertainty score for the synthetic positive sample via a GP variance function. As the uncertainty score is high, SMOTE-GP rejects this synthetic sample (b).

When SMOTE generates a new synthetic positive sample c^i\hat{c}_{i}, it simply assigns c^i\hat{c}_{i} to label 11. But, SMOTE does not provide any confidence estimation for its assignment. However, if we had such a confidence measure, we could reject the synthetic sample whose confidence is low (i.e. its uncertainty is high).

To measure the uncertainty of a SMOTE assignment, we use the variance function of a GP. First, we retrieve the set of real positive samples in {\cal R}, we indicate this set as +{\cal R}^{+}. Second, we train the GP using real positive samples in +{\cal R}^{+} along with their function values. As the GP is trained with only real positive samples, it can approximate the generation process of SMOTE. Then, we compute an uncertainty score uc^u_{\hat{c}} for each synthetic positive sample c^^+\hat{c}\in\hat{{\cal R}}^{+} (recall ^+\hat{{\cal R}}^{+} is the set of synthetic positive samples generated by SMOTE):

uc^\displaystyle u_{\hat{c}} =σ2(c^)\displaystyle=\sigma^{2}(\hat{c})
=k(c^,c^)𝒌𝖳K1𝒌,\displaystyle=k(\hat{c},\hat{c})-\boldsymbol{k}^{\mathsf{T}}K^{-1}\boldsymbol{k}, (5)

where σ2(c^)\sigma^{2}(\hat{c}) is the variance function of the GP, 𝒌\boldsymbol{k} is a vector with its ii-th element being k(ci,c^)k(c_{i},\hat{c}) and KK is a matrix of size |+|×|+||{\cal R}^{+}|\times|{\cal R}^{+}| with its (i,j)(i,j)-th element being k(ci,cj)k(c_{i},c_{j}).

Finally, as uc^u_{\hat{c}} measures the uncertainty of the synthetic positive sample c^\hat{c}, if uc^u_{\hat{c}} is smaller than a threshold υ\upsilon, we keep c^\hat{c}. Otherwise, we discard c^\hat{c}. The procedure of our SMOTE-GP is shown in Algorithm 1.

Input: Imbalanced training set {\cal R} and uncertainty threshold υ\upsilon
1
Output: Rebalanced training set {\cal R}^{*}
2
3begin
4      
5      ={\cal R}^{*}={\cal R}
6      generate synthetic positive samples ^+=SMOTE()\hat{{\cal R}}^{+}=\text{SMOTE}({\cal R})
7      train a GP with real positive samples +={(ci,f(ci))cilabel(ci)=1}{\cal R}^{+}=\{(c_{i},f(c_{i}))\mid c_{i}\in{\cal R}\wedge\text{label}(c_{i})=1\}
8      for each c^i^+\hat{c}_{i}\in\hat{{\cal R}}^{+}  do
9            
10            compute its uncertainty score uc^iu_{\hat{c}_{i}} using Equation (5)
11            if uc^iu_{\hat{c}_{i}}\leqυ\upsilon then
12                  
13                  =c^i{\cal R}^{*}={\cal R}^{*}\cup\hat{c}_{i}
14             end if
15            
16       end for
17      
18 end
19
Algorithm 1 Our SMOTE-GP algorithm.

4 Experiments

4.1 Experiment settings

We recall steps involved in the training and test phases of our prediction task for model reliability under image distortion in Figure 4. We then provide their implementation details.

Refer to caption
Figure 4: Steps involved in the training and test phases of a model reliability prediction task. The module CT({c1,,cI}T,𝒟,h)\text{CT}(\{c_{1},...,c_{I}\}\mid T,{\cal D},h) to construct the training set {\cal R} is described in Figure 2. There are two differences between our method and the baseline: (1) the GP-based sampling and (2) the SMOTE-GP.

Search space of distortion levels 𝒞{\cal C}. We predict the reliability of image-classifiers against six image distortions including geometry distortions [30], lighting distortion [31], and rain distortion [32]. We illustrate six distortion types in Figure 5. The value range of each image distortion is shown in Table 1. Note that our method is applicable to any distortion types as long as they can be defined by a range of values.

Refer to caption
Figure 5: Original image plus six distortion types.
Table 1: List of distortions along with their domains.
Distortion Domain Description
Scale [0.7,1.3][0.7,1.3] Zoom in/out 0-30%
Rotation [0,90][0,90] Rotate 00^{\circ} - 9090^{\circ}
Translation-X [0.2,0.2][-0.2,0.2] Shift left/right 0-20%
Translation-Y [0.2,0.2][-0.2,0.2] Shift up/down 0-20%
Darkness [0.7,1.3][0.7,1.3] Darken/brighten 0-30%
Rain [0,1][0,1] 0: no rain, 11: a lot of rain

Sampling method. While the baseline uses a random sampling, our method uses the GP-based sampling. After the sampling process, we obtain a set of distortion levels {c1,,cI}\{c_{1},...,c_{I}\}, where I=600I=600 is the sampling budget.

Construction of training set {\cal R}. Given distortion levels {c1,,cI}\{c_{1},...,c_{I}\}, the module CT({c1,,cI}T,𝒟,h)\text{CT}(\{c_{1},...,c_{I}\}\mid T,{\cal D},h) assigns label 0 or 11 for each cic_{i} (see Figure 2). It requires image-classifier TT, verification set 𝒟{\cal D}, and reliability threshold hh.

For image-classifiers TT, we use five pre-trained models from [33] for image datasets MNIST, Fashion, CIFAR-10, CIFAR-100, and Tiny-ImageNet. They achieved similar accuracy as those reported in [34, 35, 36]. We also use the pre-trained ResNet50 model from the Keras libraryhttps://keras.io/api/applications/resnet/#resnet50-function for ImageNettehttps://www.tensorflow.org/datasets/catalog/imagenette. As pointed out by [12, 1], we expect that these image-classifiers will reduce their performance when evaluated on distorted images.

For each image dataset, we use 10% of its data samples to be the verification set 𝒟{\cal D}. The size of 𝒟{\cal D}, the accuracy of TT on 𝒟{\cal D}, and the reliability threshold hh are shown in Table 2.

Table 2: Size of verification set 𝒟{\cal D}, accuracy of TT on 𝒟{\cal D}, and reliability threshold hh used in our experiments.
|𝒟||{\cal D}| Accuracy of TT hh
MNIST 6,000 0.9967 0.90
Fashion 6,000 0.9908 0.75
CIFAR-10 5,000 0.9902 0.85
CIFAR-100 5,000 0.9340 0.65
Tiny-ImageNet 10,000 0.6275 0.45
ImageNette 1,000 0.8290 0.70

Imbalance handling method. As the training set {\cal R} is highly imbalanced, we rebalance it before training the distortion-classifier SS. We use SMOTE-GP and compare it with SOTA imbalance handling methods, including Cost-sensitive learning [37], under-sampling method NearMiss [3], over-sampling methods SMOTE [2] and AdaSyn [38], ensemble methods SPE [23] and MESA [24], and generative models GAN [27], VAE [25], CTGAN, and TVAE [26].

As our method is based on SMOTE, we also compare it with SMOTE variants, including SMOTE-Borderline [20], SMOTE-SVM [21], SMOTE-ENN [19], SMOTE-TOMEK [18], and SMOTE-WB [22]. To be fair, we use the source codes released by the authors or implemented in well-known public libraries. The details are provided in B.

Distortion-classifier SS. We train five popular ML predictive models with the rebalanced training set {\cal R}^{*}, including decision tree, random forest, logistic regression, support vector machine, and neural network. Each of them is a distortion-classifier. In the test phase, we report the averaged result of five distortion-classifiers.

Construction of test set {\cal R}^{\prime}. To evaluate the performance of distortion-classifiers, we need to construct a test set. We create a grid of distortion levels {c1,,cG}\{c_{1},...,c_{G}\} in 𝒞{\cal C}. For each dimension, we use five points, resulting in 4,096 grid points in total. For each test point cc, we determine its label using the procedure in Figure 2. At the end, there are 4,096 test distortion levels along with their labels. We report the numbers of positive and negative test points in A.

Evaluation metric. We evaluate each distortion-classifier on the test set {\cal R}^{\prime} and compute the F1-score. As each imbalance handling method is combined with five ML predictive models to form five distortion-classifiers, we report the averaged F1-score. A higher F1-score means a better prediction.

We repeat each method three times with random seeds, and report the averaged F1-score. As the standard deviations are small (< 0.06), we do not report them to save space.

4.2 Comparison of sampling methods

We compare our GP-based sampling with the random sampling. From Figure 6, our GP-based sampling obtains many more positive points than the random sampling. For example, on CIFAR-10, among 600 sampled points, the random sampling obtains only 27 positive samples to construct the training set {\cal R}. In contrast, our GP-based sampling retrieves 130 positive samples to construct a more balanced {\cal R}.

Refer to caption
Figure 6: Number of real positive samples sampled from 𝒞{\cal C} by the random sampling and our GP-based sampling.

4.3 Comparison of imbalance handling methods

We compare our imbalance handling method SMOTE-GP with current state-of-the-art imbalance handling methods. Our method has two versions: (1) SMOTE-GP combined with the random sampling and (2) SMOTE-GP combined with our GP-based sampling. We use the uncertainty threshold υ=0.05\upsilon=0.05 for CIFAR-10 and υ=0.005\upsilon=0.005 for other datasets.

Table 3 shows that our SMOTE-GP combined with our GP-based sampling is the best method and significantly outperforms other methods. Its improvements are around 5% on MNIST, 9% on Fashion, 3% on CIFAR-10, 8% on CIFAR-100, 6% on Tiny-ImageNet, and 14% on ImageNette.

Table 3: F1-scores of our method SMOTE-GP and other imbalance handling methods. Datasets include M: MNIST, F: Fashion, C10: CIFAR-10, C100: CIFAR-100, T-IN: Tiny-ImageNet, and IN: ImageNette.
Sampling Imbalance M F C10 C100 T-IN IN
Standard Random None 0.3657 0.2105 0.6507 0.4130 0.3587 0.6561
Cost-sensitive Random Re-weight 0.5478 0.3940 0.6938 0.5593 0.5256 0.6677
Under-sampling Random RandomUnder 0.2553 0.1467 0.5531 0.2824 0.2870 0.3989
Random NearMiss 0.3358 0.1514 0.6588 0.4610 0.3443 0.6764
Over-sampling Random RandomOver 0.6194 0.4554 0.7230 0.5939 0.5797 0.7063
Random SMOTE 0.6157 0.4379 0.7310 0.5933 0.5658 0.7100
Random Adasyn 0.6090 0.4370 0.7306 0.5955 0.5663 0.7065
Ensemble Random SPE 0.5237 0.2984 0.7269 0.5113 0.4816 0.6808
Random MESA 0.4337 0.2120 0.6402 0.4394 0.3739 0.5802
Deep learning Random GAN 0.3202 0.2186 0.5157 0.3358 0.3551 0.4739
Random VAE 0.3831 0.2264 0.6635 0.4123 0.3821 0.6638
Random CTGAN 0.2958 0.1966 0.4124 0.2968 0.3144 0.3904
Random TVAE 0.3364 0.2124 0.5319 0.3365 0.3249 0.4673
Ours Random SMOTE-GP 0.6356 0.4611 0.7433 0.6440 0.5856 0.7929
GP-based SMOTE 0.6361 0.5327 0.7562 0.6525 0.5952 0.7988
GP-based SMOTE-GP 0.6635 0.5467 0.7616 0.6780 0.6381 0.8559

When using the random sampling, our SMOTE-GP is still better than other methods by 1-8%. Among imbalance handling baselines, SMOTE often achieves the best results. When SMOTE is combined with our GP-based sampling, its performance is improved significantly. This shows that our GP-based sampling is better than the random sampling.

In general, imbalance handling methods often improve the performance of the distortion-classifier, compared to the standard distortion-classifier. Over-sampling methods are always better than under-sampling methods. Deep learning methods based on generative models do not show any real benefit.

Comparison with SMOTE variants. We also compare our SMOTE-GP with imbalance handling methods based on SMOTE in Table 4. Our method is the best method while other SMOTE-based methods perform similarly.

Table 4: F1-scores of our method SMOTE-GP and SMOTE variants.
MNIST Fashion CIFAR-10 CIFAR-100 Tiny-ImageNet ImageNette
SMOTE 0.6157 0.4379 0.7310 0.5933 0.5658 0.7100
SMOTE-Borderline 0.6118 0.4313 0.7246 0.5924 0.5673 0.7057
SMOTE-SVM 0.6120 0.4231 0.7335 0.6020 0.5711 0.7187
SMOTE-ENN 0.6046 0.3931 0.6752 0.5504 0.5213 0.6744
SMOTE-TOMEK 0.6155 0.4381 0.7312 0.5931 0.5658 0.7096
SMOTE-WB 0.6210 0.4511 0.7276 0.5859 0.5659 0.7079
SMOTE-GP (Ours) 0.6635 0.5467 0.7616 0.6780 0.6381 0.8559

4.4 Ablation study

We conduct further experiments on CIFAR-10 to analyze our method under different settings.

Uncertainty threshold υ\upsilon. Our SMOTE-GP uses the uncertainty threshold υ\upsilon to filter out false positive synthetic samples. We investigate how different values for υ\upsilon affect our performance.

Figure 7 shows that our SMOTE-GP is always better than SMOTE with a large range of υ\upsilon values. When υ\upsilon is too small (i.e. υ<0.01\upsilon<0.01), it may drop its F1-score as most of synthetic positive samples are filtered out. When υ\upsilon is too large (i.e. υ>0.05\upsilon>0.05), it may also reduce its F1-score since many false positive samples are introduced.

Refer to caption
Figure 7: Our F1-score vs. the uncertainty threshold υ\upsilon.

Sampling budget II. We investigate the effect of the number of sampling queries (i.e. the size of the sampling budget II) on the performance of our method.

Figure 8 shows that both methods improve as the number of sampling queries increase as expected. More queries result in more training data and more chance to get positive samples. However, our SMOTE-GP is always better than SMOTE by a large margin.

Refer to caption
Figure 8: Our F1-score vs. the sampling budget II.

Reliability threshold hh. We investigate how our performance is changed with different reliability thresholds hh.

Figure 9 shows that both methods reduce their F1-scores when the reliability threshold hh becomes larger as the image-classifier TT is reliable under fewer distortion levels (i.e. fewer positive samples). This leads to a very highly imbalanced training set {\cal R}.

Refer to caption
Figure 9: Our F1-score vs. the reliability threshold hh.

Visualization. For a quantitative evaluation, we use t-SNE [39] to visualize the synthetic positive samples generated by each method. From Figure 10, SMOTE and its variants generate noisy synthetic samples in two situations. Only our SMOTE-GP avoids these problems.

Refer to caption
Figure 10: Visualization of original and synthetic samples on CIFAR-10. Blue and orange dots are real negative and positive samples while green dots are synthetic positive samples generated by each method. Compared to the original data (the top left figure), SMOTE variants suffer from two problems. First, they generate suspicious positive samples in the red circle although there is no original data in this region. Only SMOTE-SVM and our method SMOTE-GP overcome this problem. Second, they generate noisy (most likely wrong) positive samples in the blue square although there is only negative sample in this region. Only SMOTE-WB and our method SMOTE-GP can avoid this issue. In summary, only our SMOTE-GP can overcome two cases where of SMOTE and its variants generate incorrect synthetic samples.

5 Conclusion

Predicting model reliability is an important task in the quality control process. In this paper, we solve this task in the context of image distortion i.e. we predict if an image-classifier is unreliable/reliable under a distortion level. We form this task as a binary classification process with three main steps: (1) construct a training set, (2) rebalance the training set, and (3) train a distortion-classifier. As the training set is highly imbalanced, we propose two methods to handle the imbalance: (1) a GP-based sampling and (2) SMOTE-GP.

In the GP-based sampling, we approximate the black-box function mapping from a distortion level to the model’s accuracy on distorted images using GP. We then leverage the GP’s mean and variance to form our sampling process.

In the SMOTE-GP method, we compute an uncertainty score for each synthetic positive sample. We then filter out ones whose uncertainty scores are higher than a threshold.

We demonstrate the benefits of our method on six image datasets, where it greatly outperforms other baselines.

References

  • [1] X. Li, B. Zhang, P. Sander, J. Liao, Blind geometric distortion correction on images through deep learning, in: CVPR, 2019, pp. 4855–4864.
  • [2] N. Chawla, K. Bowyer, L. Hall, P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357.
  • [3] I. Mani, J. Zhang, kNN approach to unbalanced data distributions: a case study involving information extraction, in: ICML Workshop, Vol. 126, 2003, pp. 1–7.
  • [4] V. Sampath, I. Maurtua, J. J. Aguilar Martin, A. Gutierrez, A survey on generative adversarial networks for imbalance problems in computer vision tasks, Journal of Big Data 8 (2021) 1–59.
  • [5] B. Shahriari, K. Swersky, Z. Wang, R. Adams, N. Freitas, Taking the human out of the loop: A review of bayesian optimization, Proceedings of the IEEE 104 (1) (2016) 148–175.
  • [6] A. Fernández, S. Garcia, F. Herrera, N. Chawla, SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61 (2018) 863–905.
  • [7] N. Ahn, B. Kang, K.-A. Sohn, Image distortion detection using convolutional neural network, in: IEEE Asian Conference on Pattern Recognition (ACPR), 2017, pp. 220–225.
  • [8] L. Kang, P. Ye, Y. Li, D. Doermann, Convolutional neural networks for no-reference image quality assessment, in: CVPR, 2014, pp. 1733–1740.
  • [9] S. Bosse, D. Maniry, T. Wiegand, W. Samek, A deep neural network for image quality assessment, in: IEEE International Conference on Image Processing (ICIP), 2016, pp. 3773–3777.
  • [10] Y. Zhou, S. Song, N.-M. Cheung, On classification of distorted images with deep convolutional neural networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 1213–1217.
  • [11] S. Dodge, L. Karam, Quality robust mixtures of deep neural networks, IEEE Transactions on Image Processing 27 (11) (2018) 5553–5562.
  • [12] M. T. Hossain, S. W. Teng, D. Zhang, S. Lim, G. Lu, Distortion robust image classification using deep convolutional neural network with discrete cosine transform, in: IEEE International Conference on Image Processing (ICIP), 2019, pp. 659–663.
  • [13] F. Thung, S. Wang, D. Lo, L. Jiang, An empirical study of bugs in machine learning systems, in: International Symposium on Software Reliability Engineering, 2012, pp. 271–280.
  • [14] F. Jafarinejad, K. Narasimhan, M. Mezini, NerdBug: automated bug detection in neural networks, in: International Workshop on AI and Software Testing/Analysis, 2021, pp. 13–16.
  • [15] J. Wang, C. Zhang, Software reliability prediction using a deep learning model based on the RNN encoder–decoder, Reliability Engineering & System Safety 170 (2018) 73–82.
  • [16] G. Giray, K. E. Bennin, Ö. Köksal, Ö. Babur, B. Tekinerdogan, On the use of deep learning in software defect prediction, Journal of Systems and Software 195 (2023) 111537.
  • [17] M. M. Morovati, A. Nikanjam, F. Khomh, Z. M. Jiang, Bugs in machine learning-based systems: a faultload benchmark, Empirical Software Engineering 28 (3) (2023) 62.
  • [18] G. Batista, A. Bazzan, M. C. Monard, et al., Balancing training data for automated annotation of keywords: a case study, WoB 3 (2003) 10–8.
  • [19] G. Batista, R. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter 6 (1) (2004) 20–29.
  • [20] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, 2005, pp. 878–887.
  • [21] H. Nguyen, E. Cooper, K. Kamei, Borderline over-sampling for imbalanced data classification, International Journal of Knowledge Engineering and Soft Data Paradigms 3 (1) (2011) 4–21.
  • [22] F. Sağlam, M. A. Cengiz, A novel SMOTE-based resampling technique trough noise detection and the boosting procedure, Expert Systems with Applications 200 (2022) 117023.
  • [23] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T.-Y. Liu, Self-paced ensemble for highly imbalanced massive data classification, in: ICDE, 2020, pp. 841–852.
  • [24] Z. Liu, P. Wei, J. Jiang, W. Cao, J. Bian, Y. Chang, MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler, in: NeurIPS, Vol. 33, 2020, pp. 14463–14474.
  • [25] D. Kingma, M. Welling, et al., An introduction to variational autoencoders, Foundations and Trends in Machine Learning 12 (4) (2019) 307–392.
  • [26] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using Conditional GAN, in: NeurIPS, Vol. 32, 2019, pp. 7335–7345.
  • [27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (11) (2020) 139–144.
  • [28] N. Srinivas, A. Krause, S. Kakade, M. Seeger, Information-theoretic regret bounds for gaussian process optimization in the bandit setting, IEEE Transactions on Information Theory 58 (5) (2012) 3250–3265.
  • [29] G. Kovács, Smote-variants: A python implementation of 85 minority oversampling techniques, Neurocomputing 366 (2019) 352–354.
  • [30] S. Gopakumar, S. Gupta, S. Rana, V. Nguyen, S. Venkatesh, Algorithmic assurance: An active approach to algorithmic testing using bayesian optimisation, in: NIPS, 2018, pp. 5466–5474.
  • [31] H. Sellahewa, S. Jassim, Image-quality-based adaptive face recognition, IEEE Transactions on Instrumentation and measurement 59 (4) (2010) 805–813.
  • [32] P. Patil, S. Gupta, S. Rana, S. Venkatesh, Video restoration framework and its meta-adaptations to data-poor conditions, in: ECCV, 2022, pp. 143–160.
  • [33] D. Nguyen, S. Gupta, K. Do, S. Venkatesh, Black-box few-shot knowledge distillation, in: ECCV, 2022.
  • [34] Y. Tian, D. Krishnan, P. Isola, Contrastive representation distillation, in: ICLR, 2020.
  • [35] D. Wang, Y. Li, L. Wang, B. Gong, Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model, in: CVPR, 2020, pp. 1498–1507.
  • [36] P. Bhat, E. Arani, B. Zonooz, Distill on the go: Online knowledge distillation in self-supervised learning, in: CVPR Workshop, 2021, pp. 2678–2687.
  • [37] N. Thai-Nghe, Z. Gantner, L. Schmidt-Thieme, Cost-sensitive learning methods for imbalanced data, in: IJCNN, 2010, pp. 1–8.
  • [38] H. He, Y. Bai, E. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: IJCNN, 2008, pp. 1322–1328.
  • [39] L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research 9 (11) (2008) 2579–2605.

Appendix A Test set {\cal R}^{\prime}

Table 5 reports the numbers of negative and positive samples in the test set {\cal R}^{\prime}, which is used to evaluate the performance of distortion-classifiers (see Figure 4).

Table 5: Test sets {\cal R}^{\prime} to evaluate distortion-classifiers in the test phase.
Dataset #negative #positive
MNIST 3,957 139
Fashion 4,017 79
CIFAR-10 3,884 212
CIFAR-100 3,977 119
Tiny-ImageNet 3,991 105
ImageNette 3,940 156

Appendix B Implementation of baselines

To be fair, when comparing with other methods, we use their source code released by the authors or their implementation in well-known public libraries. Table 6 shows the link to the implementation of each baseline.

Table 6: Method and its implementation link.
Method Implementation link
Re-weight https://scikit-learn.org/stable/
RandomUnder https://imbalanced-learn.org/stable/
NearMiss
RandomOver
SMOTE
SMOTE-Borderline
SMOTE-SVM
SMOTE-ENN
SMOTE-TOMEK
Adasyn
SMOTE-WB https://github.com/analyticalmindsltd/smote_variants
SPE https://github.com/ZhiningLiu1998/imbalanced-ensemble
MESA https://github.com/ZhiningLiu1998/mesa
GAN https://github.com/dialnd/imbalanced-algorithms
VAE https://github.com/dialnd/imbalanced-algorithms
CTGAN https://github.com/sdv-dev/CTGAN
TVAE https://github.com/sdv-dev/CTGAN