Managing dataset shift by adversarial validation for credit scoring

Hongyi Qian [email protected] Baohui Wang [email protected] Ping Ma [email protected] Lei Peng [email protected] Songfeng Gao [email protected] You Song [email protected] School of Computer Science and Engineering, Beihang University, Beijing 100191, PR China School of Software, Beihang University, Beijing 100191, PR China HuaRong RongTong (Beijing) Technology Co., Ltd, Beijing 100033, PR China

Abstract

Dataset shift is common in credit scoring scenarios, and the inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance. However, most of the current studies do not take this into account, and they directly mix data from different time periods when training the models. This brings about two problems. Firstly, there is a risk of data leakage, i.e., using future data to predict the past. This can result in inflated results in offline validation, but unsatisfactory results in practical applications. Secondly, the macroeconomic environment and risk control strategies are likely to be different in different time periods, and the behavior patterns of borrowers may also change. The model trained with past data may not be applicable to the recent stage. Therefore, we propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios. In this method, partial training set samples with the closest distribution to the predicted data are selected for cross-validation by adversarial validation to ensure the generalization performance of the trained model on the predicted samples. In addition, through a simple splicing method, samples in the training data that are inconsistent with the test data distribution are also involved in the training process of cross-validation, which makes full use of all the data and further improves the model performance. To verify the effectiveness of the proposed method, comparative experiments with several other data split methods are conducted with the data provided by Lending Club. The experimental results demonstrate the importance of dataset shift in the field of credit scoring and the superiority of the proposed method.

keywords:

Credit scoring , Dataset shift , Adversarial validation , Cross validation

^†^†journal: Information Sciences

\useunder

\ul

1 Introduction

With the rapid development of internet finance in recent years, users can simply use online platforms to complete peer-to-peer transactions without going through the complex approval process of banks. As the core business of online financial institutions, credit loans not only bring huge profits to the platform but also cause many credit risk problems. How to effectively evaluate the borrowers’ solvency based on multi-dimensional information and reduce default risk has become an important research area in the academic and business community [1, 2].

At this stage, with the continuous development of intelligent machine learning methods, credit scoring models have made a series of progress in balanced sampling method [3, 4], feature selection [5, 6], and ensemble model [7, 8]. These advancements have allowed credit scoring to reach new heights of accuracy, but the vast majority of them still use traditional cross-validation schemes for data segmentation [9, 10, 11]. The whole dataset is randomly split without considering the dataset shift problem.

Dataset shift [12] is an important topic in machine learning. As shown in Fig 1. it refers to the scenario where the joint distribution of inputs and outputs is inconsistent during the training and testing phases. This inconsistency is usually caused by sample selection bias, which can lead to a loss in the generalization performance of the model on new data. Applications such as demand prediction [13], customer profiling for marketing [13], and recommender systems [13, 14] are susceptible to dataset shift. This phenomenon is particularly evident in the non-stationary environments such as credit scoring [15], where changes in the macroeconomic environment and risk control strategies can invalidate models trained with past data.

Refer to caption — Figure 1: Classifier performance loss caused by different data distribution.

However, there are few studies on dataset shift in the field of credit scoring. For example, Bravo et al. [16] proposed a model-dependent backtesting strategy designed to identify significant changes in the covariates, relating a confidence zone of the change to a maximal deviance measure obtained from the coefficients of the model. This rule-based statistical method has poor generalization performance on different datasets. Maldonado et al. [17] proposed an algorithmic-level machine learning solution, using novel fuzzy support vector machine (FSVM) strategy, in which the traditional hinge loss function is redefined to account for dataset shift. However, to our best knowledge, no data-level machine learning solution has been proposed to solve the dataset shift problem in the field of credit scoring.

The main reason why the dataset shift problem has not been highlighted in the credit scoring field is that the major credit scoring public datasets in the past do not provide the timestamp information of the samples. Such as German [18], Australian [18], Taiwan [18, 19], Japan [18] in the UCI repository¹¹1https://archive.ics.uci.edu/ml/index.php and PAKDD²²2https://pakdd.org/archive/pakdd2009/front/show/competition.html, Give Me Some Credit³³3https://www.kaggle.com/c/GiveMeSomeCredit, Home Credit Default Risk⁴⁴4https://www.kaggle.com/c/home-credit-default-risk/data provided in the data mining competitions. As a result, researchers are unable to construct the training and testing sets in chronological order.

However, that can be changed with the release of the Lending Club⁵⁵5https://www.lendingclub.com/ dataset. Lending Club is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lending institution to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. Lending Club has grown into the world’s largest peer-to-peer lending platform, and it provided a large number of real credit data for practitioners and scholars to study. The provided data have specific timestamp information that allows researchers to easily study the effect of dataset shift on the model’s performance on the latest data.

The goal of this paper is to propose a data-level machine learning solution to deal with the problem of dataset shift in credit scoring scenarios. The proposed methods are based on adversarial validation to ensure the generalization performance of the trained model on the predicted samples. Specifically, for the best solution, cross-validation is performed by selecting partial samples in the training set that are closer to the distribution of the predicted data through adversarial validation. In addition, the remaining training samples that are not consistent with the distribution of the test data are also involved in the cross-validation training process, but not in the validation, which makes full use of all the data and further improves the model performance. To sum up, the main contributions of this paper are as follows:

i.

This paper is the first to consider the dataset shift problem on Lending Club data, which is also the first solution based on a data-level machine learning approach to address the dataset shift in the credit scoring field. Dataset shift is an important topic in machine learning, but there is little research related to it in the credit scoring field [16, 17]. This paper recommends paying more attention to the impact of data distribution on the model effectiveness, rather than just minimizing the classification error.
ii.

The method used to solve the dataset shift problem in this paper is based on adversarial validation. Uber researchers [20] have previously proposed a method that uses adversarial validation to filter features to deal with dataset shift. However, there is a trade-off for this method between the improvement of generalization performance and losing information. On the contrary, the method proposed in this paper based on adversarial validation can make full use of all data to improve the model generalization performance.
iii.

Experiments on Lending Club data showed that the proposed method in this paper achieves the best results compared to the existing methods that commonly use cross-validation or timeline filtering to partition data.

The rest of this paper is organized as follows. Section 2 presents some theoretical background of dataset shift. Section 3 details the adversarial validation based method to help balance the training and testing sets. Section 4 shows the design details and results of the experiments and discusses them. Section 5 gives the conclusion and illustrates the direction for future research.

2 Dataset shift

2.1 Definition of dataset shift

The term dataset shift was first introduced by J. Quionero-Candela et al [12]. In other studies, it has also been called the concept shift [21], changes of classification [22], fracture points [23] or contrast mining in classification learning [24]. Such inconsistent terminology confounds the discussion of this important problem. In this paper, the term dataset shift is represented for the situation where the data used to train the classifier and the environment where the classifier is deployed do not follow the same distribution, which means $P_{train}(y,x)\neq P_{test}(y,x)$ .

2.2 Types of dataset shift

A classification problem consists of three parts, namely a set of features or covariates $x$ , a target variable $y$ , and joint distribution $P(y,x)$ . There are two kinds of classification problems according to [25]:

i.

$X\rightarrow Y$ problems, where the class label is causally determined by the values of the covariates. For example, user behavior in credit scoring, represented by the covariable space $X$ , determines the class label $Y$ : good or bad users.
ii.

$Y\rightarrow X$ problems, where the class label causally determines the values of the covariates. Medical diagnostics typically fall into this category, where the disease, which is modeled as the class label $Y$ , determines the symptoms, represented as covariates $X$ in the machine learning task.

There are three different types of shift for the above two kinds of problems, depending on which probabilities change or not:

i.

Covariate shift represents the situation where training and testing data distribution may differ arbitrarily, but there is only one unknown target conditional class distribution. In other words, it appears only in $X\rightarrow Y$ problems, and is mathematically defined as the case where $P_{train}(y\mid x)=P_{test}(y\mid x)$ and $P_{train}(x)\neq P_{test}(x)$ .
ii.

Prior probability shift is the reverse case of covariate shift. It appears only in $Y\rightarrow X$ problems, and is defined as the case where $P_{train}(x\mid y)=P_{test}(x\mid y)$ and $P_{train}(y)\neq P_{test}(y)$ .
iii.
Concept shift happens when the relationship between the input and class variables changes, which is defined as
- $\blacksquare$
  
  $P_{train}(y\mid x)\neq P_{test}(y\mid x)$ and $P_{train}(x)=P_{test}(x)$ in $X\rightarrow Y$ problems.
- $\blacksquare$
  
  $P_{train}(x\mid y)\neq P_{test}(x\mid y)$ and $P_{train}(y)=P_{test}(y)$ in $Y\rightarrow X$ problems.

Both the covariate shift and the concept shift can occur in the credit scoring scenario. One example of the covariate shift is that as the economy grows and per capita wages rise, the old model of using a person’s income to determine his or her credit rating may slowly fail. For concept Shift, a common example is a sudden change in the macroeconomic environment or loan policy. The risk level may change after the loan interest rate change even for the same lender.

2.3 Causes of dataset shift

There are many possible reasons for dataset shift, the two most important of which are as follows:

Reason 1. Sample Selection Bias is a systematic defect in the data collection or labeling process, where the training set is obtained by a biased method and this non-uniform selection will cause the training set to fail to represent the real sample space. For example, in social science research, students at the researcher’s university or former research participants are more likely to be surveyed than other populations. These “easy” groups may be overrepresented in the training samples, while “difficult” groups (e.g., prisoners) may be underrepresented or completely excluded.

Joaquin et al. [26] give a mathematical definition of sample selection bias:

i.

$P_{train}=P(s=1\mid y,x)P(x)$ and $P_{test}=P(y\mid x)P(x)$ in $X\rightarrow Y$ problems.
ii.

$P_{train}=P(s=1\mid x,y)P(y)$ and $P_{test}=P(x\mid y)P(y)$ in $Y\rightarrow X$ problems.

where $s$ is a binary selection variable that decides whether an instance is included in the training samples ( $s=1$ ) or rejected from it ( $s=0$ ).

The problem of operating under sample selection bias has received substantially more attention in other domains than it has in the machine learning community [25]. In the credit scoring literature it goes by the name of reject inference, because potential credit applicants who are rejected under the previous model are not available to train future models [27, 28].

Reason 2. Non-stationary Environments is often caused by temporal or spatial changes, and is very common in real-world applications. Depending on the classification problem’s type, non-stationary environments can lead to different kinds of shift:

i.

In $X\rightarrow Y$ problems, a non-stationary environment could create changes in either $P(x)$ or $P(y\mid x)$ , generating covariate shift or concept shift, respectively.
ii.

In $Y\rightarrow X$ problems, it could generate prior probability shift with a change in $P(y)$ or concept shift with a change in $P(x\mid y)$ .

Non-stationary environments often appears in adversarial classification problems such as network intrusion detection [29], spam detection [30, 31] and fraud detection [32, 33]. The presence of an adversary trying to bypass the existing classifier introduces any possible dataset shift, and the bias can change dynamically. This kind of problem is getting more and more attention in the machine learning community [34, 35].

2.4 Solutions for dataset shift

A common approach to cope with the dataset shift problem in production systems is monitoring and redeploying, i.e., continuous monitoring the model performance and retraining the model with new data when it degrades but does not completely fail [13]. Throughout the process, the model goes through a cycle of built, deployed, deprecated, and rebuilt, and this cyclical strategy is important to maintain an accurate and stable system. In the field of credit scoring, Kolmogorov-Smirnov (KS) and Population Stability Index (PSI) are often used as indicators to monitor model performance [36]. Although this approach can reduce the impact of dataset shift issues to some extent, it still encounters many challenges in practical use. For example, model predictions usually take some time to show their effects, and there is no immediate observable feedback. To solve this problem, some scholars have proposed intelligent machine learning methods to directly deal with dataset shift, which are divided into two main categories, algorithmic-level and data-level solutions.

Algorithmic-level solutions propose models that are robust in the presence of dataset shift. There are many types of algorithmic-level solutions, of which the Bayesian model is a common one, but they are mostly designed for regression rather than classification problems, such as the stock price [37] or temperature prediction [38]. In the scenario of online learning or data stream mining, many algorithmic-level solutions have also emerged. For example, Kolter et al. [39] proposed dynamic weighted majority (DWM) for concept drift that dynamically creates and removes weighted experts in response to changes in performance. In addition to that, Mello et al. [40] combine the twin support vector machine with a fuzzy membership function (FBTWSVM) to deal with large datasets and learn from data streams. In the field of credit scoring, Maldonado et al. [17] proposed a general version of hinge loss function applying aggregation operators to deal with dataset shift via fuzzy logic.

For data-level solutions, a common approach is the forgetting mechanism [13, 21], which removes outdated samples by sliding time windows. There is a tradeoff between reactivity and robustness to noise of the system, and the more abrupt forgetting is, the faster the reactivity, but also the higher the risk is of capturing noise. Another common data-level solution is the novel variant of cross-validation [41, 42, 43], which has the advantage that they do not require timestamps because they are designed to evaluate changes in the data distribution.

3 Methodology

3.1 Adversarial validation

Adversarial validation is a method to detect dataset shift, which requires training a binary classifier and judging whether the sample is from the training set or the testing set. If the performance of the classifier is close to the result of random guess, it means that it is difficult to distinguish whether a sample is from the training set or the testing set, that is, the distribution of both is relatively consistent. On the contrary, if the classifier performance is much better than the result of random guess, it indicates that the sample distribution of the training set and testing set is inconsistent.

It should be noted that the adversarial validation mentioned in this article is not the same as the adversarial training introduced in Generative Adversarial Networks (GAN) [44]. GAN’s framework corresponds to a minimax two-player game, it simultaneously trains two models: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the probability that a sample came from the training data rather than $G$ . GAN is becoming more and more popular in the field of content generation [45, 46]. In the field of credit scoring, GAN has been used to solve the sample imbalance problem [47].

Specifically, the process of adversarial validation can be divided into three steps:

i.

For the original dataset $\left\{train\_X,train\_y,val\_X,val\_y\right\}$ , remove the old label column $\left\{train\_y,val\_y\right\}$ , and add a new label column $\left\{train\_y_{s},val\_y_{s}\right\}$ that marks the source of the data, labeling the samples in the training set as 0 $(i.e.\ train\_y_{s}=0)$ and the samples in the testing set as 1 $(i.e.\ train\_y_{s}=1)$ .
ii.

Train the classifier on the dataset $\left\{train\_X,train\_y_{s},val\_X,val\_y_{s}\right\}$ with the newly labeled column. The output of the classifier is the probability that the sample belongs to the testing set. In this paper, 5-fold cross-validation is used.
iii.

Observe the results of the classifier. If the model performance is poor (AUC score is close to 0.5), it indicates that the classifier cannot distinguish whether the samples come from the training set or the testing set. It can be judged that the distribution of the training set and testing set in the original data is consistent. On the contrary, if the model performance is high (AUC score is close to 1), it indicates that the classifier can easily distinguish sample sources. It can be determined that the training set and the testing set are very different in data distribution.

In addition to detecting inconsistent data distribution, the results of adversarial validation can further help balance training and testing sets, and improve the model performance on testing set. This appears in some data competition, but there is still relatively little published research. Researchers at Uber [20] proposed an adversarial validation based approach that addresses the issue of concept drift that commonly exists in large-scale user targeting automation systems, where the distribution of complex user features keeps evolving. They use the feature importance obtained from adversarial validation to filter the most inconsistently distributed features sequentially. However, there is a trade-off for this method between the improvement of generalization performance and losing information by dropping features from the model. The proposed method in this paper can improve the generalization performance of the model to the testing set without losing information.

3.2 The adversarial validation based method to deal with dataset shift

The method proposed based on adversarial validation in this paper can not only judge whether the dataset distribution is consistent, but also further balance the training and testing sets. Specifically, gradient boosting decision tree (GBDT) [48] is used as the classifier for both the adversarial validation and the credit scoring phases, which is a boosting model that continuously reduces the residuals during the training process. It was chosen because the GBDT-based methods have been proven to be very effective in recent research in the field of credit scoring [49, 50]. Apart from that, GBDT has a very efficient modern library⁶⁶6https://github.com/microsoft/LightGBM, LightGBM[51], which has won many data competitions and will be used in this paper to build the model.

There are many ways to use the results of adversarial validation, and a total of three schemes are proposed in this paper.

Method 1. Use the adversarial validation results as sample weights added to the training process.

The adversarial validation probability result of the training set sample represents its similarity to the testing set, which can be used to determine how much to involve in the actual modeling process. If a sample is more closely distributed to the testing set data, it can be given a higher weight during training. Conversely, lower weights are assigned. Modern GBDT libraries can specify the weight of each sample during training, which is convenient to control the contribution of each sample. Specifically, set the “weight” parameter in LightGBM Dataset API to change the weight of each instance.

Method 2. Use only the data with the top-ranked adversarial validation results for 5-fold cross-validation.

Apart from reducing the weight of samples inconsistent with the distribution of the testing set, they can also be removed from the training process. In particular, the training data can be divided into two parts by screening the results of adversarial validation probability according to a certain threshold value. The samples that are more consistent with the testing set distribution are called $data\_X_{a}=\left\{train\_X_{a},val\_X_{a}\right\}$ , and the remaining samples are called $data\_X_{b}$ . $P_{data\_X_{a}}\approx P_{test\_X}\neq P_{data\_X_{b}}$ , and only $data\_X_{a}$ is reserved for 5 fold cross-validation.

As a result, model evaluation metrics on the validation data should have similar results on the testing data, which means that if the model works well on the validation data, it should also work well on the testing data. Specifically, The LightGBM parameter “early_stopping_round” is set during training, and the model stops training when the AUC of the validation set $val\_X_{a}$ still doesn’t grow in a certain number of iterations. The model with optimal results is retained and used to predict the testing set.

Method 3. All data are used for training, and only the data with the top-ranked adversarial validation results are used for validation.

Although Method 2 alleviates the problem of inconsistent data distribution between training and testing sets, it has defects in data utilization. Only $data\_X_{a}$ data was used in the whole training process, and $data\_X_{b}$ data was wasted. To solve this problem, $data\_X_{b}$ is added to the training data of each fold in the process of 5-fold cross-validation to assist training, but it does not participate in the validation. This not only maintains the consistency of validation and testing results, but also makes full use of all data.

4 Experimental study

4.1 Data collection

The dataset used in this paper comes from Lending Club, whose timestamp information helps divide training and testing sets in strict chronological order. The data includes samples from 2018M1⁷⁷7The representation of time in this paper consists of two parts, M is preceded by the year and followed by the month, for example, 2018M1 represents January 2018. to 2020M9 over a period of 33 months. The original dataset contains 1,160,066 samples with 151 features and “loan_status” as the target variable. As shown in Table 1, “loan_status” has 8 states, “Current”, “Fully Paid”, “Charged Off”, “Late (31-120 days)”, “In Grace Period”, “Issued”, “Late (16-30 days)”, and “Default”. Referring to the practice of previous papers [3, 52], only the samples with “Charged off” and “Fully Paid” status are taken as positive and negative samples respectively, all loans with other status have been filtered out as their final status are unknown. This results in an unbalanced dataset containing 276,685 samples with a positive sample ratio of 21.93%.

Table 1: Dataset characterization.

Loan status	Amount
Current	859,320
Fully Paid	216,019
Charged Off	60,666
Late (31-120 days)	12,283
In Grace Period	7,476
Issue	2,062
Late (16-30 days)	2,056
Default	184
Total	1,160,066

The number of samples in each month is shown in Fig 2. In chronological order, the data of 18 months from 2018M1 to 2019M6 are taken as the training set, which contains 247,276 samples in total. The data from 2019M7 to 2020M9 were taken as the testing set, including 29,409 samples.

Many features in the original data have a high proportion of missing values, and some of the remaining variables are unavailable to an investor before deciding to fund the loan. As a result, including the target variable “loan_status”, 25 variables are actually used, and each field description is shown in table 2.

The specific processing methods for some variables are as follows:

i.

The original “emp_length” variable contains $<$ 1 year, 1 year … 9 years, 10+ years for a total of 11 category variables, which are transformed into integer variables from 0 to 10 for better usage;
ii.

The original FICO score provides two values, “fico_score_low” and “fico_score_high”, and the difference between the two is always 4 points, which leads to data redundancy and is replaced by the average of the two;
iii.

Log transformation is performed on numerical variables with power-law distribution, including “annual_inc” and “revol_bal”;
iv.

For category variables, including “sub_grade”, “home_ownership”, “verification_status”, “initial_list_status”, “purpose”, “addr_state”, “application_type”, we employ the LightGBM built-in support.

Table 2: Indicator descriptions.

LoanStatNew	Description
addr_state	The state provided by the borrower in the loan application.
annual_inc	The self-reported annual income provided by the borrower during registration.
application_type	Indicates whether the loan is an individual application or a joint application with two co-borrowers.
dti	Borrower’s total monthly debt payments divided by the borrower’s self-reported monthly income.
earliest_cr_line	The month the borrower’s earliest reported credit line was opened.
emp_length	Employment length in years.
fico_range_high	The upper boundary range the borrower’s FICO at loan origination belongs to.
fico_range_low	The lower boundary range the borrower’s FICO at loan origination belongs to.
home_ownership	The home ownership status provided by the borrower during registration.
initial_list_status	The initial listing status of the loan.
installment	The monthly payment owed by the borrower if the loan originates.
int_rate	Interest Rate on the loan.
loan_amnt	The listed amount of the loan applied for by the borrower.
loan_status	Current status of the loan.
mort_acc	Number of mortgage accounts.
open_acc	The number of open credit lines in the borrower’s credit file.
pub_rec	Number of derogatory public records.
pub_rec_bankruptcies	Number of public record bankruptcies.
purpose	A category provided by the borrower for the loan request.
revol_bal	Total credit revolving balance.
revol_util	Revolving line utilization rate, or the amount of credit relative to all available revolving credit.
sub_grade	LC assigned loan subgrade.
term	The number of payments on the loan. Values are in months and can be either 36 or 60.
total_acc	The total number of credit lines currently in the borrower’s credit file.
verification_status	Indicates if income was verified by LC, not verified, or if the income source was verified.

After the above processing, the actual number of variables input to the model is 23. Table 3 shows the statistical characteristics of the numerical variables.

Table 3: Indicators statistics.

variable name	count	mean	std	min	25%	50%	75%	max
loan_amnt	276685	15202.28175	10022.38	1000	7500	12000	20000	40000
term	276685	42.295072	10.55719	36	36	36	60	60
int_rate	276685	13.27946	5.383279	5.31	8.81	12.4	16.46	30.99
installment	276685	454.716117	291.7174	28.77	238.02	372.88	619.47	1676.23
emp_length	252697	5.729265	3.757685	0	2	5	10	10
loan_status	276685	0.21926	0.413746	0	0	0	0	1
dti	275967	19.339198	21.08527	0	11.08	17.37	24.68	999
earliest_cr_line	276685	2002.429882	7.689656	1944	1999	2004	2007	2017
open_acc	276685	11.509359	5.920323	0	7	10	14	86
pub_rec	276685	0.145812	0.39563	0	0	0	0	52
revol_util	276289	40.718864	25.19198	0	20.4	38	59	146.3
total_acc	276685	23.802024	12.615	2	15	22	31	148
mort_acc	276685	1.456483	1.784587	0	0	1	2	24
pub_rec_bankruptcies	276685	0.136889	0.350828	0	0	0	0	7
log_annual_inc	276685	4.816797	0.35534	0	4.676703	4.832515	4.986776	6.968483
fico_score	276685	709.598766	36.96096	662	682	702	732	847.5
log_revol_bal	276685	3.883877	0.687627	0	3.665769	3.990605	4.26257	6.167597

4.2 Parameter set-up

The same settings are used for LightGBM hyperparameters in both the adversarial validation and credit scoring phases. Specifically, “num_boost_round” is set to 50000, which is a relatively large value. However, by setting the “early_stopping_rounds” parameter, the model will stop training if the validation data’s AUC doesn’t improve in the last 200 rounds. It not only ensures sufficient training, but also prevents over-fitting. More detailed hyperparameter settings are shown in Table 4.

Table 4: Parameters of LightGBM.

Parameters	Values
num_boost_round	50000
early_stopping_rounds	200
learning_rate	0.1
max_depth	4
num_leaves	8
colsample_bytree	0.8
subsample	0.8
subsample_freq	3
the others	default

4.3 Experiment set-up

As shown in Fig 3, in order to demonstrate the effect of adversarial validation, a total of 5 sets of experiments are set up. The testing set data of each experiment are all from 2019M7 to 2020M9, while the training and validation sets are divided in different ways.

Experimental Set 1. Select data according to chronological order for 5-fold cross-validation.

A fixed time point is set and only data after that time point are used for the 5-fold cross-validation. Specifically, the starting month of the cross-validation data was selected from 2018M1 to 2019M6 for a total of 18 experiments. In these experiments, the starting point selection 2018M1, which uses all data for training can be used as a benchmark.

Experimental Set 2. Select data according to chronological order, training data before, validation data after.

Different from the 5 fold cross-validation used in Experimental Set 1, experiment set 2 only used data closer to the testing set for validation. Specifically, there are three choices of data time ranges, which are to use all data, 2018M6 and subsequent data, 2018M12 and subsequent data. These three groups of data will be divided into training and validation data according to the sequence of timeline, for a total of 17+11+5=33 experiments.

Experimental Set 3. Use the adversarial validation results as sample weights added to the training process.

The output probability of the adversarial validation classifier to the samples, i.e., the similarity with the testing set samples, is directly used as the weight in the training process. All data will be used in this one experiment, and no need to be divided by time or quantile.

Experimental Set 4. Use only the data with the top-ranked adversarial validation results for 5-fold cross-validation.

The ranking of the output probability of the sample by the adversarial validation classifier is regarded as the criterion of data partitioning. Data that is more inconsistent with the distribution of the testing set will be discarded, and the remaining data will be subjected to 5-fold cross-validation. Specifically, 0%, 5% … 90%, 95% of the data were discarded, respectively, for a total of 20 experiments.

Experimental Set 5. All data are used for training, and only the data with the top-ranked adversarial validation results are used for validation.

Although Experimental Set 4 constructs a dataset that is more consistent with the testing set distribution for cross-validation, it brings about the problem of wasting a lot of data. This can also harm the model performance, especially when the amount of discarded data is large. Experimental Set 5 is further optimized based on Experimental Set 4, by adding these discarded data that are inconsistent with the testing set distribution into the cross-validation training data, but does not participate in the validation. This not only addresses the problem of dataset shift, but also makes full use of all data. Similarly, the data are also divided according to the output probability ranking of the samples by the adversarial validation classifier, and the number of experiments is the same as in Experimental Set 4, with a total of 20 experiments.

There is a total of 18+33+1+20+20=92 experiments in 5 sets, comparing a variety of models’ performance on the testing set, which is trained on data divided by time or by adversarial validation results.

4.4 Results and discussion

Analysis of Experimental Set 1.

Fig 4 shows the results of Experimental Set 1. With the increase of the starting month of the selected data, the AUC of the validation set shows a trend of gradual decline, and the decline speed increases with the decrease of the selected data. For the testing set, the AUC fluctuated steadily when the selected data started before 2019M2, and only after that did it start to show a significant decreasing trend. Although adding data with a long distance from the testing set in the training improves the offline validation effect of the model, it does not improve the performance of the test data. This confirms that the problem of dataset shift does exist, and the distribution of the data accumulated in the past is not consistent with the data that needs to be predicted in the recent stage, which leads to the problem of model generalization. The 5-fold cross-validation with all the data could be used as a benchmark, i.e., the starting month of the selected data was set to 2018M1. At this time, the AUC of the testing set was 0.7237. Among all the experiments of Experimental Set 1, the selection of 2018M2 and later data for cross-validation is the best, with the testing set AUC reaching 0.7256.

Analysis of Experimental Set 2.

Fig 5 shows the results of Experimental Set 2. From 5 (a) to (c), regardless of the selection range of training validation data starts from 2018M1, 2018M7, or 2019M1, with the gradual increase of data divided into the training set, the AUC of validation set and testing set both show an increasing trend, and the gap between them gradually decreases. This indicates that postponing the time point of splitting the training and validation sets can improve both the performance of the model and the consistency of the validation and testing set results.

Fig 5 (d) integrates the testing set AUC results of the three sub-experiments, and the optimal results that can be achieved by all three are relatively close. The best result occurs when using the data from 2018M7 to 2019M5 as the training set and the 2019M6 data as the validation set, the testing set AUC reaches 0.7220. This result is lower than using all the original data directly for 5-fold cross-validation, since the 2019M6 data, which is closest to the testing set distribution, is only involved in the validation and not in training.

Analysis of Experimental Set 3.

Experimental Set 3 uses all the data for the 5-fold cross-validation, and the prediction results of the adversarial validation are added as sample weights in the training process, so there is only 1 experiment. The AUC result of adversarial validation is 0.9681, much higher than 0.5, which indicates that the classifier can easily distinguish the training data from the test data, and the two are indeed inconsistent in distribution.

In Experimental Set 3, the final AUC obtained for the validation and testing set are 0.7149 and 0.7202, respectively, which is rather inferior to the benchmark of using the full data directly for the 5-fold cross-validation. This indicates that changing only the sample weights without changing the sample selection does not effectively solve the dataset bias problem.

Analysis of Experimental Set 4.

According to the results of adversarial validation, Experimental Set 4 selected the data with the top specific quantile for 5-fold cross-validation. Fig 6 (a) showed the results, with the increase of the probability quantile of adversarial validation, the training data also increased, and the testing set AUC showed a gradual rise at first, and then a relatively stable fluctuation. When the quantile selection is 75%, the maximum testing set AUC can reach is 0.7315, which is an improvement compared with the previous three experiments. When the quantile is small, the performance of the model is greatly reduced due to the lack of available training data, exposing the drawback that this method fails to make full use of all data.

Analysis of Experimental Set 5.

In Experimental Set 4, the quantile selection has a greater impact on the experimental results. Experimental Set 5 adds the data with the lower quantile ranking to each fold of the training data, but does not participate in the validation phase. This ensures the ability of the model to generalize to the testing set while making full use of all the data. Fig 6 (b) illustrates the results, the testing set AUC fluctuation of Experimental Set 5 is relatively more stable. When the quantile is chosen to be 40%, the maximum testing set AUC value can reach 0.7327, which is also the highest score among all experiments.

In Experimental Sets 4 and 5, the data partitioning quantiles that achieve the optimal accuracy are 75% and 40%, respectively. Fig 7 shows the distribution of these data in each month. It can be found that the closer to the testing set date, the higher the percentage of the selected data in the month’s original data. This is also in line with the phenomenon that the distribution of data at similar times is more consistent.

Comprehensive analysis of all experiments.

Fig 8 shows the comprehensive comparison of the results of all 5 experimental sets, which can be summarized as follows:

i.

The dataset shift problem does exist in credit scoring, and dividing the training and validation sets in different ways will indeed affect the model performance on the testing set.
ii.

Compared with other partitioning or data utilization methods, using adversarial validation to select data more consistent with testing set distribution for cross-validation can improve the optimal accuracy. However, attention should be paid to the amount of data selected, too little data may degrade the model performance.
iii.

To further improve the model performance, data that are not consistent with the testing set distribution can be added to the training process as well, but not involved in the validation. This will allow the optimal accuracy of the testing set to be further improved, and the results obtained by choosing different quantiles are more stable.

5 Conclusion

This paper proposes a method based on adversarial validation to deal with the dataset shift problem in the credit scoring field. Only the training samples whose distribution is consistent with the testing set are used for cross-validation to ensure the generalization performance of the model on the testing data. Meanwhile, to make full use of all data information, the remaining training samples whose distribution is inconsistent with the testing set are added into the training process of each fold of cross-validation, but not involved in the validation.

Experiments on the Lending Club dataset showed that the proposed method is more helpful in improving performance in scenarios where the data distribution of the training set and the testing set are inconsistent, rather than dividing data in chronological order. This work demonstrates the importance of the dataset shift problem in credit scoring. For the sake of performance on new data, it recommends paying more attention to the impact of data distribution on the model effectiveness, rather than just minimizing the classification error.

In the future work plan, more ways to exploit adversarial validation partitioned data can be explored. Transfer learning, which aims to improve the performance of models in different but related target domains, would be a good choice. In addition to credit scoring, the application of adversarial validation can be explored in other data distribution inconsistency scenarios.

Acknowledgments

This work was supported by HuaRong RongTong (Beijing) Technology Co., Ltd. We acknowledge HuaRong RongTong (Beijing) for providing us with high-performance machines for computation. We also acknowledge the anonymous reviewers for proposing detailed modification advice to help us improve the quality of this manuscript.

References

[1] D. Karlan, J. Zinman, Microcredit in Theory and Practice: Using Randomized Credit Scoring for Impact Evaluation, Science 332 (6035) (2011) 1278–1284.
[2] S. Maldonado, G. Peters, R. Weber, Credit scoring using three-way decisions with probabilistic rough sets, Information Sciences 507 (2020) 700–714.
[3] V. Moscato, A. Picariello, G. Sperlí, A benchmark of machine learning approaches for credit score prediction, Expert Systems with Applications 165 (mar 2021).
[4] K. Niu, Z. Zhang, Y. Liu, R. Li, Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending, Information Sciences 536 (2020) 120–134.
[5] J. López, S. Maldonado, Profit-based credit scoring based on robust optimization and feature selection, Information Sciences 500 (2019) 190–202.
[6] N. Kozodoi, S. Lessmann, K. Papakonstantinou, Y. Gatsoulis, B. Baesens, A multi-objective approach for profit-driven feature selection in credit scoring, Decision Support Systems 120 (January) (2019) 106–117.
[7] A. I. Marqués, V. García, J. S. Sánchez, Exploring the behaviour of base classifiers in credit scoring ensembles, Expert Systems with Applications 39 (11) (2012) 10244–10250.
[8] Y. Xia, J. Zhao, L. He, Y. Li, M. Niu, A novel tree-based dynamic heterogeneous ensemble method for credit scoring, Expert Systems with Applications 159 (nov 2020).
[9] Y. Song, Y. Wang, X. Ye, D. Wang, Y. Yin, Y. Wang, Multi-view ensemble learning based on distance-to-model and adaptive clustering for imbalanced credit risk assessment in P2P lending, Information Sciences 525 (2020) 182–204.
[10] Y. Xia, C. Liu, B. Da, F. Xie, A novel heterogeneous ensemble credit scoring model based on bstacking approach, Expert Systems with Applications 93 (2018) 182–199.
[11] J. Xiao, Y. Wang, J. Chen, L. Xie, J. Huang, Impact of resampling methods and classification models on the imbalanced credit scoring problems, Information Sciences 569 (2021) 508–526.
[12] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence, Dataset Shift in Machine Learning, The MIT Press, 2009.
[13] J. Gama, I. Žliobaitundefined, A. Bifet, M. Pechenizkiy, A. Bouchachia, A Survey on Concept Drift Adaptation, ACM Comput. Surv. 46 (4) (2014).
[14] G. Ditzler, M. Roveri, C. Alippi, R. Polikar, Learning in Nonstationary Environments: A Survey, IEEE Comput. Intell. Mag. 10 (4) (2015) 12–25.
[15] G. Castermans, D. Martens, T. V. Gestel, B. Hamers, B. Baesens, An overview and framework for PD backtesting and benchmarking, Journal of the Operational Research Society 61 (3) (2010) 359–373.
[16] C. Bravo, S. Maldonado, Fieller Stability Measure: A novel model-dependent backtesting approach, Journal of the Operational Research Society 66 (11) (2015) 1895–1905.
[17] S. Maldonado, J. López, C. Vairetti, Time-weighted Fuzzy Support Vector Machines for classification in changing environments, Information Sciences 559 (2021) 97–110.
[18] D. Dua, C. Graff, UCI machine learning repository (2017).
[19] I.-C. Yeh, C.-h. Lien, The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients, Expert Syst. Appl. 36 (2) (2009) 2473–2480.
[20] J. Pan, V. Pham, M. Dorairaj, H. Chen, J.-Y. Lee, Adversarial Validation Approach to Concept Drift Problem in User Targeting Automation Systems at Uber, arXiv preprint arXiv:2004.03045 (2020).
[21] G. Widmer, M. Kubat, Learning in the Presence of Concept Drift and Hidden Contexts, Mach. Learn. 23 (1) (1996) 69–101.
[22] K. Wang, S. Zhou, A. W.-C. Fu, J. X. Yu, Mining Changes of Classification by Correspondence Tracing, in: D. Barbará, C. Kamath (Eds.), Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, May 1-3, 2003, SIAM, 2003, pp. 95–106.
[23] D. A. Cieslak, N. V. Chawla, A framework for monitoring classifiers’ performance: when and why failure occurs?, Knowl. Inf. Syst. 18 (1) (2009) 83–108.
[24] Y. Yang, X. Wu, X. Zhu, Conceptual equivalence for contrast mining in classification learning, Data Knowl. Eng. 67 (3) (2008) 413–429.
[25] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, F. Herrera, A unifying view on dataset shift in classification, Pattern Recognition 45 (1) (2012) 521–530.
[26] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence, When Training and Test Sets Are Different: Characterizing Learning Transfer, 2009.
[27] J. N. Crook, J. Banasik, Does reject inference really improve the performance of application scoring models, Journal of Banking and Finance 28 (2004) 857–874.
[28] D. J. Hand, W. E. Henley, Statistical Classification Methods in Consumer Credit Scoring: a Review, Journal of The Royal Statistical Society Series A-statistics in Society 160 (1997) 523–541.
[29] A. Kolcz, C. H. Teo, Feature Weighting for Improved Classifier Robustness, in: CEAS 2009, 2009.
[30] N. Dalvi, P. Domingos, Mausam, S. Sanghai, D. Verma, Adversarial Classification, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, Association for Computing Machinery, New York, NY, USA, 2004, pp. 99–108.
[31] M. Barreno, B. Nelson, A. D. Joseph, J. D. Tygar, The security of machine learning, Machine Learning 81 (2010) 121–148.
[32] T. Fawcett, F. J. Provost, Adaptive Fraud Detection, Data Mining and Knowledge Discovery 1 (2004) 291–316.
[33] T. E. Senator, Ongoing Management and Application of Discovered Knowledge in a Large Regulatory Organization: A Case Study of the Use and Impact of NASD Regulation’s Advanced Detection System (RADS), in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’00, Association for Computing Machinery, New York, NY, USA, 2000, pp. 44–53.
[34] B. Biggio, G. Fumera, F. Roli, Multiple classifier systems for robust classifier design in adversarial environments, International Journal of Machine Learning and Cybernetics 1 (2010) 27–41.
[35] P. Laskov, R. Lippmann, Machine learning in adversarial environments, Machine Learning 81 (2010) 115–119.
[36] N. Siddiqi, Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring, 2005.
[37] T. Dangl, M. Halling, Predictive regressions with time-varying coefficients, Journal of Financial Economics 106 (1) (2012) 157–181.
[38] G. Storvik, A. Frigessi, D. Hirst, Stationary space-time Gaussian fields and their time autoregressive representation, Statistical Modeling 2 (2) (2002) 139–161.
[39] J. Z. Kolter, M. A. Maloof, Dynamic weighted majority: An ensemble method for drifting concepts, Journal of Machine Learning Research 8 (2007) 2755–2790.
[40] A. R. Mello, M. R. Stemmer, A. L. Koerich, Incremental and decremental fuzzy bounded twin support vector machine, Information Sciences 526 (2020) 20–38. arXiv:1907.09613.
[41] V. López, A. Fernández, F. Herrera, On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed, Information Sciences 257 (2014) 1–13.
[42] J. G. Moreno-Torres, J. A. Saez, F. Herrera, Study on the impact of partition-induced dataset shift on k-fold cross-validation, IEEE Transactions on Neural Networks and Learning Systems 23 (8) (2012) 1304–1312.
[43] M. Sugiyama, M. Krauledat, K.-R. Müller, Covariate shift adaptation by importance weighted cross validation, Journal of Machine Learning Research 8 (2007) 985–1005.
[44] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, Y. Bengio, Generative Adversarial Nets, in: NIPS, 2014.
[45] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks, 2017 IEEE International Conference on Computer Vision (ICCV) (2017) 2242–2251.
[46] Y. Choi, Y. Uh, J. Yoo, J.-W. Ha, StarGAN v2: Diverse Image Synthesis for Multiple Domains, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8185–8194.
[47] J. Engelmann, S. Lessmann, Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning, Expert Systems with Applications 174 (jul 2021). arXiv:2008.09202.
[48] J. H. Friedman, Greedy function approximation: A gradient boosting machine, Annals of Statistics 29 (5) (2001) 1189–1232.
[49] H. He, W. Zhang, S. Zhang, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Systems with Applications 98 (2018) 105–117.
[50] W. Liu, H. Fan, M. Xia, Step-wise multi-grained augmented gradient boosting decision trees for credit scoring, Engineering Applications of Artificial Intelligence 97 (October 2020) (2021) 104036.
[51] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T. Y. Liu, LightGBM: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems 2017-December (2017) 3147–3155.
[52] A. Namvar, M. Siami, F. Rabhi, M. Naderpour, Credit risk prediction in an imbalanced social lending environment, International Journal of Computational Intelligence Systems 11 (1) (2018) 925–935.