\useunder

\ul

Imputation of Missing Data with Class Imbalance using Conditional Generative Adversarial Networks

Saqib Ejaz Awan Mohammed Bennamoun Ferdous Sohel Frank M Sanfilippo Girish Dwivedi

Abstract

Missing data is a common problem faced with real-world datasets. Imputation is a widely used technique to estimate the missing data. State-of-the-art imputation approaches model the distribution of observed data to approximate the missing values. Such an approach usually models a single distribution for the entire dataset, which overlooks the class-specific characteristics of the data. Class-specific characteristics are especially useful when there is a class imbalance. We propose a new method for imputing missing data based on its class-specific characteristics by adapting the popular Conditional Generative Adversarial Networks (CGAN). Our Conditional Generative Adversarial Imputation Network (CGAIN) imputes the missing data using class-specific distributions, which can produce the best estimates for the missing values. We tested our approach on benchmark datasets and achieved superior performance compared with the state-of-the-art and popular imputation approaches.

keywords:

missing data imputation , generative adversarial network , conditional generative adversarial network , class imbalance

^†^†journal: Neurocomputing

1 Introduction

The growing use of machine learning and deep learning techniques demand more and more data. One big challenge associated with real-world data is missing values of certain attributes. The reasons of missingness in real-world data include an equipment failure, data corruption, privacy concerns of users, or a human error [20, 25]. Missing data problem is categorised into three types based on the relationship between the missing and the observed values: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) [9, 13]. MCAR occurs when the missingness is totally independent of all the variables present in the data. In MAR, the missingness is related to only the observed variables. MNAR exists when the missingness is dependent on both the observed and missing variables.

To perform a task, e.g. classification or prediction, statistical and machine learning algorithms generally require complete data [20, 19, 21]. It highlights the need to handle missing data properly. A simple approach to achieve completeness is the complete-case analysis, which only uses the observed (non-missing) values of the data [7]. This approach is suitable if a few samples of data contain missing values and produces biased results otherwise [7]. Another approach to address the missing data is to replace it with plausible approximations learnt from the observed data. This approach is called missing data imputation [20]. Simple imputation approaches replace the missing values of a variable/column with a statistical estimate such as mean or median of all the non-missing values of that variable/column. These approaches replace all the missing values in a variable with the same estimated value thus underestimating the variance of the imputed values leading to poor performance. Advanced approaches, such as Multiple Imputation by Chained Equations (MICE) [1], explore the correlation between the variables to better approximate the missing values. Some joint modelling approaches, such as Expectation Maximization (EM), assume a multivariate normal distribution and assert a joint distribution on the entire data to impute missing values [18].

Another naturally inherent problem of the real-world data is the skewed class distributions. It is a condition, commonly known as the class imbalance problem, in which the majority of data belongs to one class and a significantly small amount of data belongs to the remaining classes [8]. For example, in a binary classification problem, it is naturally expected to have only a few positive cases of fraudulent transactions and a significantly large number of non-fraudulent transactions. Since most of the machine learning models are designed with the assumption of equal number of samples of each class, they over-classify the majority class and ignore the minority class [17]. In most of the cases, the minority class in a real-world data is the class of interest [16] e.g., detecting a fraudulent transaction or a cancerous image. Thus, the performance of these analytical models degrade as the class imbalance problem grows in the real-world data.

Imputation of missing data in imbalanced datasets is a challenging task because only a few samples are available from the minority class. In this case, an advanced imputation technique, such as Generative Adversarial Imputation Network (GAIN) [27], models one distribution for the entire data. However, the characteristics of one class samples may differ from the characteristics of another class samples. Such an approach will not produce high performance (Section 4.2). This issue is further exacerbated if the data has class imbalance i.e., one class has more samples than the others. Therefore, intuitively, an approach which takes into account the individual class distributions can potentially provide better imputation performance. This leads us to propose our Conditional Generative Adversarial Imputation Network (CGAIN) approach which aims to impute the missing data conditional to its class, making the imputation process rely on the individual class characteristics for missing data imputation. The main contributions of this paper are:

1.

A novel missing data imputation technique that incorporates the class distribution of the missing data.
2.

State-of-the-art imputation performance on imbalanced datasets.

The rest of this paper is structured as follows. We briefly discuss the related work on missing data imputation in Section 2. We introduce our proposed approach in Section 3 and present the experimental results and analysis in Section 4. Section 5 concludes this work.

2 Related work

Imputation of missing data is an active research area. Several imputation approaches have been proposed to complete the missing data for various applications such as medical, image inpainting, and financial data [6, 4, 10]. Missing data imputation techniques are commonly divided into two groups: single imputation and multiple imputation. Single imputation replaces the missing values with estimated values only once, while multiple imputation repeats the imputation process several times and combines the results of all imputations in the end.

Single imputation approaches are further classified into univariate single imputation and multivariate single imputation approaches. Univariate approaches use the observed values of the same column to replace the missing values with a statistical estimate such as mean, median, the most frequent value, and the last observation carried forward [28]. This process, however, undermines the variance of the imputed values. Multivariate imputation approaches use the correlation between different columns of data and use it to impute the missing values. A common approach is to predict the missing values in a column using a regression model based on the values in the observed columns. This process is then repeated for all the columns with the missing values to complete the data.

Multiple imputation allows for the uncertainty in the missing data by creating multiple different plausible imputed data sets and combining the results obtained from each of them at the end [22]. First, multiple copies of the data are created each containing missing data replaced with imputed values.Then, analytical methods are used to fit the model of interest to each of the imputed datasets and the final results are produced by combining the results from all the copies of data.

With the recent advancements in machine learning and deep learning, new imputation approaches have been proposed which improve the imputation performance of the existing approaches. These approaches have limited the use of statistical based imputation approaches. However, new approaches continue to emerge which claim to surpass the performance of existing imputation methods. Popular new imputation approaches can broadly be categorized into discriminative and generative methods [24]. The discriminative methods learn the boundary between classes of data such as support vector machines and decision trees. The generative methods focus on how the data is actually generated to model the distribution of data. The discriminative methods model the decision boundary between different classes present in data to perform imputation. These approaches include MissForest and Matrix Completion [12, 26, 2, 3, 15]. The generative methods model the actual distribution of data to perform imputation. Popular generative imputation approaches include MICE, denoising autoencoders, and Gaussian mixture models [1, 11, 22].

Recently, GAIN [27] has been introduced for imputation, which combines both the generative and discriminative models in an adversarial manner. The generative and discriminative models in this approach compete with each other to excel their tasks [27]. The aim of this method is to achieve a generative model, which can produce new samples whose distribution is so close to the original data distribution that the discriminative model is unable to distinguish between a real sample and a sample generated by the generative model. The generator in GAIN receives two inputs: input data and mask. A mask represents the presence/absence of a value. A presence is marked as 1, while an absence is 0. The discriminator’s output predicts the complete mask whose elements show the possibility of the corresponding input value to be observed, unlike a standard discriminator of Generative Adversarial Network (GAN), which tells only if the input to discriminator is real or fake as a whole. Their approach introduces a hint mechanism, which becomes an input of the discriminator. This approach learns to model the distribution of entire data as a whole which may overlook the unique characteristics of a minority class in the case of an imbalanced data.

In this work, we propose a new imputation approach which aims to learn the unique class-specific characteristics and use them to impute missing data from that class. The data imputed using our approach will be based on individual class distributions rather than the entire data distribution and we hypothesize that this will produce more accurate estimates of the missing data.

3 Proposed approach: CGAIN

Our proposed CGAIN approach aims at producing a generator which can produce fake data pertaining to a class. This means that our generator can not only produce data but it is also aware of which class of data it has to produce. We train the generator in an adversarial manner with a discriminator. The discriminator receives the fake data and predicts, for each data element, whether it was missing or observed in the original data. This step forces the generator to produce fake values, against the missing values of data, which are very close to the original data distribution. Figure 1 shows the block diagram of our approach. The generator produces fake data using the original data with missing values, class labels, and random data. This fake data contains values against both the missing values and observed values in the original data. The discriminator receives this data and predicts which components were originally missing. The generator is given the feedback (using cross-entropy loss) on how successful the discriminator was in discriminating missing values from observed values. The generator also receives the mean squared error of the fake and original values. Based on these feedbacks, the generator adjusts its parameters and attempts to produce fake samples which appear as real samples to fool the discriminator. At the end of the training phase, we get a generator which is capable of producing realistic (fake) values against the missing values in the original data.

Refer to caption — Figure 1: Block diagram of the proposed CGAIN approach.

3.1 Problem Formulation for Missing Data Imputation

Suppose we represent our data with a data vector $\textbf{X}=(\textit{X}_{1},\textit{X}_{2},\ldots,\textit{X}_{d})$ , where X is a random variable in a d-dimensional data space $\mathcal{X}=\mathcal{X}_{1}\times\mathcal{X}_{2}\times\ldots\times\mathcal{X}_{d}$ and d denotes the total number of data samples. We represent the labels/outcomes of our data as a label vector $\textbf{Y}=(Y_{1},Y_{2},\ldots,Y_{d})^{T}$ , where each vector element takes the form $\{0,1\}^{m}$ and $m$ is the total number of classes/outcomes. Also assume a mask vector $\textbf{M}=(M_{1},M_{2},\ldots,M_{d})$ , which takes its values from $\{0,1\}^{d}$ . We can define a missing data vector $\tilde{\textbf{X}}=(\tilde{\textit{X}}_{1},\tilde{\textit{X}}_{2},\ldots,\tilde{\textit{X}}_{d})$ , which replaces each missing value with an asterisk $(*)$ as shown in Equation 1.

\tilde{\textit{X}}_{i}=\left\{\begin{array}[]{ll}\textit{X}_{i},&{\rm if}\,\textit{M}_{i}=1\\ ,&{\rm otherwise}\\ \end{array}\right.

(1)

The goal of an imputation approach is to impute the missing values in $\tilde{\textbf{X}}$ . Our missing data imputation aims at generating samples according to the conditional probability of X given $\tilde{\textbf{X}}$ and Y i.e., $P(\textbf{X}|\tilde{\textbf{X}},\textbf{Y})$ .

3.2 Proposed Conditional Generative Adversarial Imputation Networks

This section describes our CGAIN based approach to simulate the $P(\textbf{X}|\tilde{\textbf{X}},\textbf{Y})$ . Figure 1 shows a concise summary of our approach. Our approach contains one generator and one discriminator. Our generator generates fake data using the original data with missing values, a mask which locates the missing values, the encoded class-labels, and a noise matrix. The discriminator discriminates between the observed and the missing values in the data by predicting the mask matrix. The main components of our approach are discussed below.

3.2.1 Generator ( $G$ )

The Generator receives the input data with missing values $\tilde{\textbf{X}}$ , mask M, labels Y, and noise Z to output a vector of imputations $\bar{{\textbf{X}}}$ . The noise vector $\textbf{Z}=(Z_{1},Z_{2},\ldots,Z_{d})$ contains d-dimensional noise while Y is a one-hot encoded vector. The generator $G$ can be mathematically defined as $G:\tilde{\mathcal{X}}\times\mathcal{M}\times\mathcal{Z}\times\mathcal{Y}\to\mathcal{X}$ , where $\mathcal{M}$ , $\mathcal{Z}$ , and $\mathcal{Y}$ denote the distributions of mask, noise, and labels/outcomes, respectively.

Since G outputs a value for every component rather than producing an estimate of the missing values only, we can create a completed data vector $\hat{\textbf{X}}$ , which combines the observed values from $\tilde{\textbf{X}}$ and the imputed values from $\bar{\textbf{X}}$ as given in Equations 2 and 3.

\bar{\textbf{X}}=G(\tilde{\textbf{X}}|\textbf{Y},\textbf{M},(\textbf{1}-\textbf{M})\odot\textbf{Z})

(2)

\hat{\textbf{X}}=\textbf{M}\odot\tilde{\textbf{X}}+(\textbf{1}-\textbf{M})\odot\bar{\textbf{X}}

(3)

where $\odot$ represents element-wise multiplication and 1 denotes a d-dimensional vector of 1s. As seen in Equation 3, $\hat{\textbf{X}}$ takes the observed values from $\tilde{\textbf{X}}$ and replaces each $*$ with its corresponding value from $\bar{\textbf{X}}$ . This setup is inspired by a standard GAN and the generator used by [27].

3.2.2 Discriminator ( $D$ )

The discriminator is generally introduced as an adversary to train the generator. Conventionally, there is only one output of $D$ i.e., either completely real or completely fake. However, the output of $D$ in an imputation setting contains multiple components, some of which are real while others are fake. So, our $D$ tries to distinguish the real (observed) components from the fake (missing) components. This is achieved by predicting a mask vector m. We can then compare this predicted mask with the original mask M. Formally, our $D$ can be mathematically defined as $D:\mathcal{X}\times\mathcal{Y}\to[0,1]^{d}$ , where $[0,1]^{d}$ represents the predicted mask vector m.

3.2.3 Hint ( $H$ )

We also use a hint mechanism similar to Yoon et al. [27] in our approach. This hint is expressed as a random variable H which takes its values in a hint space $\mathcal{H}$ . The hint vector supports D by telling it some of the imputed and observed values which allows the D to decide whether other values are imputed or observed. H is passed as an additional input to D which is then mathematically expressed as $D:\mathcal{X}\times\mathcal{H}\times\mathcal{Y}\to[0,1]^{d}$ . The hint is deemed necessary since G can produce several distributions and for all of those D can not distinguish between a real and a fake value. Therefore, giving a hint H to D restricts the solution to a single distribution. H is obtained using Equation 4.

\textbf{H}=\textbf{B}\odot\textbf{M}+0.5\odot(1-\textbf{B})

(4)

where $B\in\{0,1\}^{d}$ is a random variable obtained by uniformly sampling k from $\{1,2,\ldots,d\}$ and applying Equation 5. The term $0.5$ in Equation 4 represents a hint value similar to that used by Yoon et al. [27].

\textbf{B}_{j}=\left\{\begin{array}[]{ll}1,&{\rm if}\,j\neq k\\ 0,&{\rm if}\,j=k\\ \end{array}\right.

(5)

3.2.4 The objective function

The objective function of our CGAIN approach has two parts as inspired by the standard Conditional Generative Adversarial Network (CGAN) [14]. First, we train D to maximize the correct prediction of M. Secondly, we train G to minimize the probability of D correctly predicting M. The overall objective function and loss function of our CGAIN approach are given in Equations 6 and 7, respectively.

\displaystyle{\min_{G}\max_{D}}\;\mathcal{L}(D,G)

(6)

\mathcal{L}(D,G)=\mathbb{E}_{\hat{\textbf{X}},\textbf{M},\textbf{H},\textbf{Y}}\left[\textbf{M}^{T}\;log\;D((\hat{\textbf{X}},\textbf{H})|\textbf{Y})+(\textbf{1}-\textbf{M})^{T}\;log\;(1-D(\hat{\textbf{X}},\textbf{H})|\textbf{Y}))\right]

(7)

Since the output of D can be expressed as $\hat{\textbf{M}}=\emph{D}((\hat{\textbf{X}},\textbf{H})|\textbf{Y})$ , the loss function of D can be expressed by the cross entropy Equation 8.

\mathcal{L}_{D}=\sum_{\forall i:b_{i}=0}\large[m_{i}\;log(\hat{m}_{i})+(1-m_{i})\;log(1-\hat{m}_{i})\large]

(8)

where the term $b_{i}=0$ corresponds to those values of $\hat{\textbf{M}}$ for which H is 0.5 according to Equation 4. This ensures D to learn those mask values for which the absolute hint value (0 for missing, 1 for not missing) was not provided.

Similar to [27], the loss function of G comprises of two parts since the output of G contains imputed values for both the missing values and the observed values. The first part is the loss of imputed values whereas the second part is the loss of the observed values The combined loss function $\mathcal{L}_{G}$ is given in Equation 9.

\mathcal{L}_{G}=\sum_{\forall i:b_{i}=0}(1-m_{i})\;log(\hat{m}_{i})\;+\alpha\;\sum_{j=1}^{d}m_{i}\;L_{obs}(x_{i},x_{i}^{\prime})

(9)

where, similar to Yoon et al. [27], $\alpha$ is a positive hyper-parameter and $L_{obs}(x_{i},x_{i}^{\prime})$ is given in Equation 10:

L_{obs}(x_{i},x_{i}^{\prime})=\left\{\begin{array}[]{ll}(x_{i}^{\prime}-x_{i})^{2}&{\rm if}\,x_{i}\;is\;continuous\\ -x_{i}\;log(x_{i}^{\prime})&{\rm if}\,x_{i}\;is\;binary\\ \end{array}\right.

(10)

3.3 CGAIN Algorithm

The training of our proposed CGAIN algorithm is inspired by the original GAN approach, which iteratively trains D and G. We designed G and D as fully connected neural networks with two hidden layers. We kept the number of neurons in each hidden layer as three times the number of columns/features in the input data.

We first optimized D with a fixed G using mini-batches of (128 samples of) data. For every mini-batch including the corresponding labels Y, n independent samples of Z, B, and M are drawn to compute the imputed data $\tilde{\textbf{X}}$ according to Equations 2 and 3. Then hint vector H is produced using Equation 4. Then, the estimated mask $\hat{\textbf{M}}$ is obtained using $\emph{D}((\hat{\textbf{X}},\textbf{H})|\textbf{Y})$ followed by the optimization of D.

The next step is to update G by keeping the newly trained D fixed. Again, n independent samples of Z, B, and M are drawn for every mini-batch to compute H and update G. The CGAIN algorithm is presented in Algorithm 1.

Algorithm 1 Pseudo-code of the proposed CGAIN algorithm

Input: Discriminator batch size $n_{D}$ , Generator batch size $n_{G}$
Output: Trained CGAIN algorithm

while training loss does not converge do

(A) Discriminator optimization

Draw

n_{D}

data samples from dataset

\{(\tilde{x},y,m)\}

, noise samples

z

from Z, hint samples

b

from B

for

i=1,\ldots,n_{D}

x_{i}\leftarrow G(\tilde{x}_{i},y_{i},m_{i},z_{i})

\hat{x}_{i}\leftarrow m_{i}\odot\tilde{x}_{i}+(1-m_{i})\odot\bar{x}_{i}

h_{i}\leftarrow b_{i}\odot m_{i}+0.5(1-b_{i})

end for

Update D using stochastic gradient descent (SGD)

(B) Generator optimization

Draw

n_{D}

data samples from dataset

\{(\tilde{x},y,m)\}

, noise samples

z

from Z, hint samples

b

from B

for

i=1,\ldots,n_{G}

h_{i}=b_{i}\odot m_{i}+0.5(1-b_{i})

end for

Update G using stochastic gradient descent (SGD) with fixed D

end while

4 Experiments and Results

We tested our proposed CGAIN approach with multiple publicly available real-world datasets available at the University of California Irvine (UCI) Machine Learning repository [5]. We compared our approach with the state-of-the-art GAIN approach [27] and other popular imputation approaches. We also evaluated our approach on various percentages of missing data ranging from 5% to 20%. The missing data was created, in an MCAR style, by randomly removing values in all experiments.

4.1 Datasets and Methods

The details of the datasets used in this work are shown in Table 1. We tested our approach on 5 UCI repository datasets. The choice of these datasets is mainly inspired by the experiments of Yoon et al. [27]. This selection allows us to present a comparative analysis of the performance of our approach with the state-of-the-art GAIN approach.

The Breast Cancer dataset contains features of digitized cancerous images such as the radius of cell nuclei, texture, and perimeter etc. The Spambase dataset contains email features such as the occurrence of a specific word in an email, length of sequences of consecutive capital letters etc. Letter recognition dataset contains features from images of capital alphabets. Default credit card dataset is a classification dataset for the prediction of default of a customer based on age, amount of given credit, and history of past payments etc. News popularity dataset contains features of online news articles such as the number of words in its title, number of hyperlinks in the article, the average length of words etc.

The Breast cancer, Spambase, Default credit card and News popularity datasets are binary datasets (two classes only). The letter recognition dataset is multi-class having 26 classes. Our results show improved imputation performance on binary as well as multi-class datasets highlighting the efficacy of our approach.

Table 1: Characteristics of datasets used in this work.

Dataset	Number of instances	Number of classes	Majority class vs minority class (%)	Number of attributes
Breast cancer	569	2	62.74 vs 37.26	30
Spambase	4,601	2	60.60 vs 39.40	57
Default credit card	30,000	2	77.88 vs 22.12	24
News popularity	39,644	2	53.36 vs 46.64	61
Letter recognition	20,000	26	multi-class	17

We selected GAIN [27], MICE [1], MissForest [23], and Matrix completion [2] approaches to compare with our proposed CGAIN approach. MICE is a popular statistical imputation approach, whereas GAIN, MissForest, and Matrix completion were the best-performing methods, on the datasets used in this work, in a recent study [27].

4.2 Performance of our proposed CGAIN

The comparative performance of our proposed CGAIN approach is given in Table 2 to Table 5. We report the average Root Mean Squared Error (RMSE) of 10-fold cross-validated experiments. We compare our proposed CGAIN approach with the state-of-the-art GAIN approach [27] and other popular imputation approaches. Table 2 to Table 5 show the RMSE of all the approaches where the proportion of missing data ranges from 5% to 20%. Our proposed CGAIN¹¹1https://github.com/saqibejaz/CGAIN.git approach provided superior performance compared to other approaches on all the datasets. We used the publicly available GitHub²²2https://github.com/jsyoon0823/GAIN code of GAIN in our experiments. Other approaches such as MICE, MissForest, and imputation using matrix completion were implemented using the publicly available python libraries (missingpy, sklearn, and matrix_completion).

Our CGAIN consistently outperforms all techniques. CGAIN provides a lower RMSE (mean ± std) of 0.0643 ± 0.0014, 0.0628 ± 0.0024, 0.0673 ± 0.0039, and 0.0637 ± 0.0092 compared with the second best RMSE of 0.0658 ± 0.0022, 0.0692 ± 0.0017, 0.0689 ± 0.0058, and 0.0726 ± 0.0038 at 5%, 10%, 15%, and 20% missing values of the Breast Cancer dataset, respectively. For the Default Credit dataset, our proposed CGAIN shows lower RMSE of 0.2329 ± 0.0039, 0.2009 ± 0.0022, 0.2314 ± 0.0035, and 0.2213 ± 0.0099 compared with the GAIN’s 0.2428 ± 0.0093, 0.2109 ± 0.0344, 0.2442 ± 0.0089, and 0.2426 ± 0.0090 at 5%, 10%, 15%, and 20% missing values, respectively.

Table 2: RMSE performance (mean±std ) of our proposed CGAIN approach on 5% missing data.

Dataset	CGAIN	GAIN	MICE	MissForest	Matrix
Breast Cancer	0.0643±0.0014	0.1372±0.0013	0.0854±0.0013	\ul0.0658±0.0022	0.6881±0.0034
Spambase	0.0611±0.0060	\ul0.0723±0.0018	0.0747±0.0045	0.0771±0.0071	0.0943±0.0009
Letter	0.1066±0.0078	0.1437±0.0029	0.1833±0.0008	\ul0.1189±0.0018	0.4545±0.0020
Default Credit	0.2329±0.0039	\ul0.2428±0.0093	0.2479±0.0079	0.2902±0.0010	0.2565±0.0089
News	0.1964±0.0033	0.2822±0.0024	\ul0.2010±0.0025	0.2114±0.0014	0.4178±0.0015

^∗ Best results are shown in boldface, while the second best results are underlined.

Table 3: RMSE performance (mean±std ) of our proposed CGAIN approach on 10% missing data.

Dataset	CGAIN	GAIN	MICE	MissForest	Matrix
Breast Cancer	0.0628±0.0024	0.0931±0.0010	0.0881±0.0054	\ul0.0692±0.0017	0.6895±0.0038
Spambase	0.0664±0.0017	\ul0.0702±0.0031	0.0793±0.0040	0.0783±0.0029	0.0906±0.0011
Letter	0.1057±0.0014	0.1309±0.0008	0.1878±0.0010	\ul0.1103±0.0021	0.4539±0.0018
Default Credit	0.2009±0.0022	\ul0.2109±0.0344	0.2491±0.0085	0.2439±0.0079	0.2559±0.0075
News	0.1937±0.0074	0.2680±0.0015	\ul0.2124±0.0013	0.2442±0.0015	0.4175±0.0016

^∗ Best results are shown in boldface, while the second best results are underlined.

Table 4: RMSE performance (mean±std ) of our proposed CGAIN approach on 15% missing data.

Dataset	CGAIN	GAIN	MICE	MissForest	Matrix
Breast Cancer	0.0673+0.0039	0.0986±0.0033	0.0877±0.0056	\ul0.0689±0.0058	0.7042±0.0016
Spambase	0.0607±0.0033	\ul0.0739±0.0025	0.0784±0.0024	0.0777±0.0021	0.0902±0.0052
Letter	0.1021±0.0010	0.1326±0.0091	0.1836±0.0010	\ul0.1125±0.0012	0.4537±0.0024
Default Credit	0.2314±0.0035	\ul0.2442±0.0089	0.2479±0.0074	0.2672±0.0025	0.2565±0.0059
News	0.1992±0.0069	0.2869±0.0036	\ul0.2283±0.0015	0.2918±0.0015	0.4177±0.0015

^∗ Best results are shown in boldface, while the second best results are underlined.

Table 5: RMSE performance (mean±std ) of our proposed CGAIN approach on 20% missing data.

Dataset	CGAIN	GAIN	MICE	MissForest	Matrix
Breast Cancer	0.0637±0.0092	0.1053±0.0046	0.0903±0.0064	\ul0.0726±0.0038	0.6858±0.0012
Spambase	0.0601±0.0013	\ul0.0764±0.0034	0.0796±0.0032	0.0786±0.0059	0.0896±0.0019
Letter	0.1040±0.0027	0.1302±0.0031	0.1886±0.0010	\ul0.1163±0.0013	0.4545±0.0028
Default Credit	0.2213±0.0099	\ul0.2426±0.0090	0.2480±0.0091	0.2646±0.0026	0.2537±0.0051
News	0.1931±0.0014	0.2686±0.0010	\ul0.2424±0.0022	0.3907±0.0015	0.4176±0.0015

^∗ Best results are shown in boldface, while the second best results are underlined.

4.3 Performance of CGAIN on imbalanced data

We discussed in Section 2 that our proposed CGAIN approach takes into account the individual class distributions to impute the missing data. Therefore, we expect CGAIN to improve the RMSE of individual classes present in the data. We performed this experiment on the Default Credit dataset and present the results in Table 6. For this experiment, we randomly deleted rows of data belonging to a class, to introduce imbalance in the data. We tested the imputation performance using 10%, 25%, 40%, and 50% of data belonging to the minority class (see column 1 of Table 6, where $n_{0}$ and $n_{1}$ show the number of samples belonging to the majority and minority class, respectively). As with the previous experiments, we deleted 20% of data (in a MCAR manner) to induce missingness. We also validated this experiment on unseen test data using 10-fold cross validation.

Table 6 shows that our proposed CGAIN approach consistently outperforms the GAIN and other imputation approaches. As the data becomes more imbalanced (e.g. $n_{1}=10\%$ ), our proposed CGAIN approach provides superior RMSE of 0.2462±0.0057 compared with 0.2632 ± 0.0076, 0.2685 ± 0.0047, 0.2934 ± 0.0095, and 0.2934 ± 0.0095 for GAIN, MICE, MissForest, and Matrix completion approach, respectively. Table 6 shows that the improvement in RMSE of our approach arises from the improvement in the RMSE of the individual classes present in the data. As such, our proposed CGAIN approach provides better imputation performance for balanced as well as imbalanced datasets.

Table 6: Comparative performance [RMSE (mean±std)] of our proposed CGAIN approach on various imbalanced versions of the Default Credit dataset.

Proportion of classes

Class

CGAIN

GAIN

MICE

MissForest

Matrix

n_{0}=90\%,n_{1}=10\%

Class 0

0.2197±0.0022

0.2384±0.0034

\ul0.2203±0.0070

0.2304±0.0076

0.2317±0.0066

Class 1

0.2462±0.0057

\ul0.2632±0.0076

0.2685±0.0047

0.2934±0.0095

0.2847±0.0072

n_{0}=75\%,n_{1}=25\%

Class 0

0.2204±0.0031

0.2385±0.0014

\ul0.2280±0.0055

0.2320±0.0059

0.2327±0.0022

Class 1

0.2319±0.0054

\ul0.2491±0.0056

0.2580±0.0031

0.2760±0.0051

0.2689±0.0073

n_{0}=60\%,n_{1}=40\%

Class 0

0.2224±0.0024

0.2450±0.0019

\ul0.2321±0.0086

0.2391±0.0042

0.2384±0.0083

Class 1

0.2174±0.0097

\ul0.2301±0.0025

0.2498±0.0072

0.2538±0.0076

0.2580±0.0063

n_{0}=50\%,n_{1}=50\%

Class 0

0.2300±0.0027

0.2514±0.0009

\ul0.2353±0.0016

0.2450±0.0044

0.2458±0.0028

Class 1

0.1963±0.0077

\ul0.2238±0.0012

0.2412±0.0012

0.2459±0.0076

0.2577±0.0051

^∗ Best results are shown in boldface, while the second best results are underlined.

4.4 Computational cost analysis

A comparison of the computational time taken by the state-of-the-art GAIN approach and our proposed CGAIN approach is shown in Figure 2. We performed all our experiments on a core i7 machine supported by an NVIDIA Quadro P5000 Graphics Processing Unit (GPU). Figure 2 shows the total time taken to perform 10 fold cross-validation of a dataset using GAIN or CGAIN approach. Our approach takes slightly more time compared with the GAIN approach, which is reasonable considering the use of label encoding in both the generator and discriminator in our approach.

5 Conclusion

In this work, we have proposed a CGAIN approach which conditions the missing data imputation on class labels using label encoding. This allows our approach to learn class-specific distributions to impute the missing values especially in imbalanced scenarios. Our CGAIN approach shows superior imputation performance compared with popular approaches on publicly available datasets.

Acknowledgment

This work is partially supported by Australian Research Council Grants DP150100294 and DP150104251, and the UWA SIRF scholarship. We thank the contributors of UCI machine learning repository who collected the data and made it publicly available. We also acknowledge the computing support provided as a Quadro P5000 GPU by the NVIDIA Corporation.

References

[1] S van Buuren and Karin Groothuis-Oudshoorn. MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, pages 1–68, 2010.
[2] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
[3] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009.
[4] Xiaobo Chen, Yingfeng Cai, Qiaolin Ye, Lei Chen, and Zuoyong Li. Graph regularized local self-representation for missing value imputation with applications to on-road traffic sensor data. Neurocomputing, 303:47–59, 2018.
[5] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
[6] Oleg Ivanov, Michael Figurnov, and Dmitry P. Vetrov. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States, 6 May 2019 through 9 May 2019.
[7] Janus Christian Jakobsen, Christian Gluud, Jørn Wetterslev, and Per Winkel. When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts. BMC Medical Research Methodology, 17(1):162, 2017.
[8] Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573–3587, 2017.
[9] Sang Kyu Kwak and Jong Hae Kim. Statistical data preparation: management of missing values and outliers. Korean Journal of Anesthesiology, 70(4):407, 2017.
[10] Wei-Chao Lin and Chih-Fong Tsai. Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2):1487–1509, 2020.
[11] Haw-minn Lu, Giancarlo Perrone, and José Unpingco. Multiple imputation with denoising autoencoder using metamorphic truth and imputation feedback. In 16th International Conference on Machine Learning and Data Mining, MLDM 2020, Amsterdam, The Netherlands, July 20-21, 2020, Proceedings, pages 197–208.
[12] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287–2322, 2010.
[13] Diego PP Mesquita, João PP Gomes, Amauri H Souza Junior, and Juvêncio S Nobre. Euclidean distance estimation in incomplete datasets. Neurocomputing, 248:11–18, 2017.
[14] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[15] Kaushik Mitra, Sameer Sheorey, and Rama Chellappa. Large-scale matrix factorization with missing data under additional constraints. In Advances in Neural Information Processing Systems, pages 1651–1659, 2010.
[16] Krystyna Napierala and Jerzy Stefanowski. Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3):563–597, 2016.
[17] Giang Hoang Nguyen, Abdesselam Bouzerdoum, and Son Lam Phung. Learning pattern classification tasks with imbalanced data sets. Pattern Recognition, pages 193–208, 2009.
[18] Md Geaur Rahman and Md Zahidul Islam. Missing value imputation using a fuzzy clustering-based EM approach. Knowledge and Information Systems, 46(2):389–422, 2016.
[19] Zoila Ruiz-Chavez, Jaime Salvador-Meneses, and Jose Garcia-Rodriguez. Machine learning methods based preprocessing to improve categorical data classification. In International Conference on Intelligent Data Engineering and Automated Learning, pages 297–304. Springer, 2018.
[20] Cátia M Salgado, Carlos Azevedo, Hugo Proença, and Susana M Vieira. Missing data. In Secondary Analysis of Electronic Health Records, pages 143–162. Springer, 2016.
[21] Marek Śmieja, Łukasz Struski, Jacek Tabor, Bartosz Zieliński, and Przemysław Spurek. Processing of missing data by neural networks. In Advances in Neural Information Processing Systems, pages 2719–2729, 2018.
[22] Dušan Sovilj, Emil Eirola, Yoan Miche, Kaj-Mikael Björk, Rui Nian, Anton Akusok, and Amaury Lendasse. Extreme learning machine for missing data using multiple imputations. Neurocomputing, 174:220–231, 2016.
[23] Daniel J Stekhoven and Peter Bühlmann. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012.
[24] Karma Tarap. Hit and Miss: An evaluation of imputation techniques from machine learning. 2019.
[25] Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1405–1414, 2017.
[26] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525, 2001.
[27] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GAIN: missing data imputation using generative adversarial nets. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 5675–5684. PMLR, 2018.
[28] Zhongheng Zhang. Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4(1), 2016.

Imputation of Missing Data with Class Imbalance using Conditional Generative Adversarial Networks

Abstract

keywords:

1 Introduction

2 Related work

3 Proposed approach: CGAIN

3.1 Problem Formulation for Missing Data Imputation

3.2 Proposed Conditional Generative Adversarial Imputation Networks

3.2.1 Generator (GG)

3.2.2 Discriminator (DD)

3.2.3 Hint (HH)

3.2.4 The objective function

3.3 CGAIN Algorithm

4 Experiments and Results

4.1 Datasets and Methods

4.2 Performance of our proposed CGAIN

4.3 Performance of CGAIN on imbalanced data

4.4 Computational cost analysis

5 Conclusion

Acknowledgment

References

3.2.1 Generator ( $G$ )

3.2.2 Discriminator ( $D$ )

3.2.3 Hint ( $H$ )