Concept Drift Detection: Dealing with Missing Values via Fuzzy Distance Estimations
Abstract
In data streams, the data distribution of arriving observations at different time points may change – a phenomenon called concept drift. While detecting concept drift is a relatively mature area of study, solutions to the uncertainty introduced by observations with missing values have only been studied in isolation. No one has yet explored whether or how these solutions might impact drift detection performance. We, however, believe that data imputation methods may actually increase uncertainty in the data rather than reducing it. We also conjecture that imputation can introduce bias into the process of estimating distribution changes during drift detection, which can make it more difficult to train a learning model. Our idea is to focus on estimating the distance between observations rather than estimating the missing values, and to define membership functions that allocate observations to histogram bins according to the estimation errors. Our solution comprises a novel masked distance learning (MDL) algorithm to reduce the cumulative errors caused by iteratively estimating each missing value in an observation and a fuzzy-weighted frequency (FWF) method for identifying discrepancies in the data distribution. The concept drift detection algorithm proposed in this paper is a singular and unified algorithm that can handle missing values, but not an imputation algorithm combined with a concept drift detection algorithm. Experiments on both synthetic and real-world data sets demonstrate the advantages of this method and show its robustness in detecting drift in data with missing values. The results show that, compared to the best-performing algorithm that handles imputation and drift detection separately, MDL-FWF reduced the average drift detection difference from 10.75% to 5.83%. This is a nearly 46% improvement. These findings reveal that missing values exert a profound impact on concept drift detection, but using fuzzy set theory to model observations can produce more reliable results than imputation.
Index Terms:
concept drift, machine learning, fuzzy distance, fuzzy clustering, fuzzy weighting, missing valueI Introduction
In life, ideas constantly change and evolve. In data streams, these evolving ideas are reflected as changes in the data distribution of arriving observations. Areas where monitoring, control, and security are important are particularly dependent on fast and accurate data distribution change detection [1, 2] – for example, mobile tracking systems that monitor user behavior change, intrusion detection systems that look for unusual activity[3, 4]. The change of distribution in data streams is referred to concept drift in the literature [5, 6].
Drift detection systems typically infer changes in a situation by compiling sets of observations at different time points, estimating the discrepancies in the data distributions for each set, and comparing the results [1]. For machine learning models, the inevitability of concept drift means keeping up-to-date and adapting to the current state of ideas through incremental training as new observations arrive [7, 8]. Without these continual self-refinements to the model, classification accuracy will eventually degrade beyond usefulness [9, 10, 11, 12].
However, like learning with streaming data, a common problem in exploratory data analysis is handling missing values [13, 14]. Missing values in data are extremely common for dozens of reasons – human error, faulty sensors, power outages, etc. – and can cause serious implications for machine learning [15]. One of the most serious of these implications is the increased levels of uncertainty an incomplete data set brings [16]. This makes concept drift detection a much more complex task.
A common practice to mitigate the problem is to apply an imputation method to fill in the missing values with plausible substitutes [15]. As discussed in [17], the state-of-the-art imputation methods fall into two major categories: discriminative and generative. Discriminative methods include MICE [18], MissForest [19], and matrix completion [20, 21, 22]. Generative methods include algorithms based on either expectation maximization [23] or deep learning [24, 25, 17]. Regardless of the category, these algorithms were proposed to resolve missing values under different circumstances. These imputation algorithms have achieved remarkable success and can be easily combined with any learning algorithms. However, most are based on value-oriented imputation, which means they focus on minimizing the difference between the estimated values and the ground truths. Yet, the flexibility is a double-edged sword, they do not care about what the data is going to be used for. In other words, they do not take a task-oriented perspective.
Similarly, most drift detection researchers consider imputation to be an independent research task and so intentionally leave mechanisms for handling missing values out of their frameworks. We, however, believe that data imputation methods can have a significant impact on the accuracy and robustness of subsequent drift detection processes. An inappropriate value estimation will increase the bias of the statistics when comparing distribution, resulting in increased errors, such as shown in Fig. 1. Therefore, we propose a new concept, namely task-oriented imputation, that is, we directly minimize the difference between the ground truth of the statistics we need and the statistics acquired by imputation. In our solution, the statistics are the distances between observations.

The fundamental reason that data imputation impairs the performance of a drift detection method is because the replaced values decrease data variance (Problem 1). And, further, errors with the missing value estimations are not considered in the distribution estimations (Problem 2). For example, using the mean of data as a replacement for the missing values forces the data toward a central focal point and reduces sample diversity [26, 27]. Small errors aggregate into large errors, introducing bias into the hypothesis testing for concept drift detection. Meanwhile, considering data imputation and concept drift detection as two independent tasks subjectively ignores the likelihood between the replaced value and the true value. For instance, assume we have a set of 1-dimensional observations , where is the size, and among these observations, there are missing values . The imputation is the mean of the non-missing observations denoted as . Then the error that would not be considered in concept drift detection. Uncertainty in data and its impact on learning models has been well studied. For instance, in [28, 29], the authors evaluate the performance of classifiers in terms of their fuzziness and the problem’s complexity, and discover that the generalization ability of a classifier is closely related to both its fuzziness and the target problem. Extrapolating on this conclusion, we suspect that the accuracy of a drift detection algorithm may also be closely related to data uncertainty and the fuzziness of the drift detection methodology.
To address Problem 1, we proposed a masked distance learning algorithm. The idea is to convert the distances calculated by different data imputation algorithms into a new feature set, and considering the new feature set and the true distance as a regression learning task, thereby, minimizing the difference between the estimated distance and the true distance, keeping the variance of data. To address Problem 2, the estimated distances are considered to be fuzzy to account for possible errors. This is coupled with a novel weighted frequency measuring method that tallies observations based on their degree of membership to histogram bins. Membership considers estimation errors in the data imputation process.
To detect drifts, we apply Pearson’s chi-square test on the observed frequency as a hypothesis test [30, 31], and the significance level is governed by a parameter , which determines the sensitiveness of the hypothesis test to concept drift. In summary, the main contributions of this paper are:
-
•
The notion of estimating the distance between samples rather than estimating the missing values, which reduces the cumulative errors caused by iterative estimations.
-
•
A novel concept drift detection algorithm that is robust to missing values, along with a masked distance learning algorithm and a fuzzy-weighted frequency observing method – MLD-FWF when combined.
-
•
A ablation study was conducted to evaluate the improvement of using fuzzy set theory and not using fuzzy set theory. The results indicate that applying fuzzy set theory for drift detection with missing values can achieve more reliable result.
-
•
A comprehensive evaluation of the proposed algorithms and framework on both synthetic and real-world data sets with results revealing that applying fuzzy set theory to handle missing values is beneficial to reduce false alarms in concept drift detection.
The rest of this paper is organized as follows. In Section II, we discuss the problem of concept drift detection and data imputation, followed by the preliminaries relating to our drift detection method. Section III presents the proposed masked distance learning and the fuzzy-weighted frequency drift detection algorithms. Section IV evaluates the proposed distance learning performance and drift detection accuracy. Section V concludes this study with a discussion of future work.
II Literature Review and Preliminaries
This section presents the problem of concept drift detection (Section II-A) and missing values (Section II-B). The preliminaries of our solution - Pearson’s chi-square test is also introduced, Section II-C
II-A Concept Drift Detection
In an evolving data stream, the distribution of available training samples may vary over time [1]. Consider a topological space feature space denoted as , where is the dimensionality of the feature space. A tuple denotes a data instance, where is the feature vector, is the class label, and is the number of classes. A data stream, denoted as , can then be represented as a sequence of data instances. Time windows strategy chunks the sequence in a time interval as a batch. These batches are denoted as , where is a given time interval that defines the time window. A concept drift has occurred between two time windows and if the joint probability of and is different, that is, [5, 6].
There are two approaches to estimating the density discrepancy: parametric and nonparametric [32, 31]. The parametric approach assumes the data is drawn from a known distribution, with the major challenge being how to estimate the parameters to reach the maximum likelihood [31]. Nonparametric approaches assume that it is too difficult to devise an analytic expression of the target distribution’s probability density function [33]. Therefore, the density is estimated empirically rather than by matching the data with any given parametric family. Given that concept drift detection inherently concerns evolving data distributions, it is difficult to rely on one distribution family for an accurate description. Therefore, most drift detection methods are nonparametric [1].
To quantify the distribution discrepancy between two multivariate sample sets, some researchers have proposed new test statistics, while others have proposed novel mapping methodologies that convert multivariate observations into univariate observations. For example, NN-DVI [34] and MMD [35] construct new statistical variables for comparing distributions to determine the differences. The performance of these methods is excellent, but that performance usually comes with a high computational cost. Alternatively, if the distribution of the proposed statistics is unclear, a Monto Carlo method can be applied to approximate the significance level. Examples include kdqTree combines with Kullback-Leibler divergence [36], competence model combined with total variation [7], QuantTree combined with total variation, or QuantTree combined with Chi-square statistics [37]. On yet another track, the solution involves building a model to map the discrepancy in distributions with powerful statistical tools, such as Pearson’s Chi-square [31]. The general premise of these approaches is to convert multivariable data samples into a univariable two-sample test problem, such as an incremental Kolmogorov–Smirnov test [38] or a multi-Wald-Wolfowitz test [39]. However, no research has considered the missing values for concept drift detection.
II-B Types of Missing Values and Handling Methodologies
Missing values are merely observations that were not recorded, perhaps due to human error or a sensor malfunction, which is very common in real-world data [40, 16, 15]. There are three main categories of missing values based on the correlation between the observations and the values that are missing. These are: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [27, 41]. A recent survey by Santos et al. [15] provides a set of notations and definitions that demonstrate the difference between each category quite well, as follows:
Assume we have a data set with number of instances and number of features. The values of the data set are represented as an matrix as . Plus, a missing data indicator matrix is defined as an zero-one matrix, denoted as . is represented as , where is the missing value indicator and equals 1 if is missing, and 0 otherwise. The relationship describing the missing data is defined as a conditional probability function , where are the parameters of the missing values [26]. In practice, is not important, only the relation between and the components of are important, i.e., , where is the observations without missing values, and is the the observations with missing values.
MCAR means that a missing value has nothing to do with its hypothetical value or with the values of other variables [15, 16]. The probability of missingness depends only on the parameters , i.e., [15].
MAR means the missing values are not related to themselves, but they do depend on other features somehow. In other words, the conditional probability of missing values is [5].
With MNAR, the missing values depend on both the observed and the unobserved information – and – and the conditional probability of the missing values cannot be simplified, i.e., [15]. Conventional methods of handling missing values include complete case analysis, available case analysis, dummy variable adjustment, and imputation [26].
Imputation is currently the most widely used method for handling missing values [16]. Like dummy variable adjustment, missing values are replaced but this time with plausible values, not constants or a mean. The model is then trained just as though no values were missing [26]. However, there are two serious problems with most imputation methods. First, variances tend to be underestimated, which can lead to bias in the learning parameters [26]. Second, the standard error calculations presume that all data are real, which means the inherent uncertainty and sampling variability in the imputed values is not taken into account [26]. Of course, this results in confidence intervals that are too narrow.
II-C Pearson’s Chi-square Test
Pearson’s Chi-square Test has been widely used to determine whether there are significant differences between expected frequencies and observed frequencies [30]. When there is no significant difference (i.e., when the null hypothesis is true), the statistic follows a chi-square distribution. Thus, the premise of the test is to assume the null hypothesis is true and then evaluate how likely a specific observation would be.
The standard process of the chi-square test is to use sample data to find: the degrees of freedom, the expected frequencies, the test statistic, and the p-value associated with the test statistic [30]. Given a contingency table with rows and columns of categorical variables, the degrees of freedom are equal to
The expected frequency counts are computed separately for each level of one categorical variable at each level of the other categorical variable. The -th and -th expected frequencies of the contingency table are calculated with the equation
where is the sum of the frequencies for all columns in row , is the sum of the frequencies for all rows of columns , and is the sum of all rows and columns. The test statistic is a chi-square random variable defined by
(1) |
where is the observed frequency count at row and column , and is the expected frequency count at row and column . The p-value is the probability of observing a sample statistic as extreme as the test statistic. Since the p-value is a test statistic, it can be computed with the chi-square probability distribution function.
Pearson’s chi-square test should be used with a sufficiently large sample set. According to the central limit theorem, an distribution is the sum of independent random variables with a finite mean and variance that converges to a normal distribution for large . For practical purposes, Box et al. [30] claim that, for and , the distribution of the estimated test statistics is sufficiently close to a normal distribution to ignore the difference.
III Concept Drift Detection with Missing Values via Fuzzy Distance Estimations
This section formally presents our proposed solution for concept drift with missing values. In Section III-A, we present a histogram-based concept drift detection method. Section III-B introduces our strategy for handling missing values. Last, Section III-C discusses how we leverage fuzzy distance estimations to address the uncertainty of missing values in drift detection.
III-A Concept Drift Detection via Histogram Distribution Estimation
Problem Statement. Recall the drift detection problem outlined in Section II-A, i.e., Let and be random variables defined on a topological space at different periods in a data stream with respect to , where consists of all Borel probability measures on . Given two observation sets drawn from and , and , the problem is: How confident are we to claim that based on and ?
To simplify this problem, it is commonly assumed that the observations , are i.i.d., which makes the research objective equivalent to a two-sample test problem. The general procedure is to measure the difference between the observations and (sample sets), then infer the significance of the difference between and (population).
Remark: Recall the definition of concept drift, , a data instance is defined as a tuple of (, ). For supervised learning setting, the target variable can be considered as the -th feature of the observations while for unsupervised setting, the drift detection will be equivalent to covariate shift detection [42].
The most intuitive method for capturing this difference is a histogram. A histogram is a set of intervals, i.e., bins, and density is estimated by counting the number of observations in each bin. A general framework for using a histogram and Pearson’s chi-square test to detect concept drift is shown in Fig. 2.
The next question is how to partition the feature space as bins to build a histogram. In the literature, the partitions of most multivariate histograms are based on a grid or a tree structure [37, 31]. We have used clusters because, given our problem, this method makes it easy to determine which bin an observation belongs to. More specifically, with grid- and tree-based histograms, observations are allocated to a bin according to the value of each feature, whereas, with cluster-based histograms, the bin is determined by the distance between the observations and the centroids, as demonstrated in Fig. 3.


In practice, there are three advantages to using clusters as histogram bins.
1) The shape of the bins is flexible. Both grid and tree-based histograms have hyper-rectangular-shaped bins. However, the shape of the bins in a cluster-based histogram dynamically adapts to the data distribution, which is beneficial when trying to maintain a similar number of observations per bin and not leaving any empty bins.
2) Missing values in an observation do not need to be estimated one-by-one with cluster-based histograms. Rather, only the observation-to-centroid distances need to be modeled to determine the appropriate observation-to-cluster membership. In addition, the error of the distance estimation can be characterized by fuzzifying the distance then using that fuzzy distance to calculate the observation’s degree of membership to a cluster.
3) Since our algorithm is task-oriented imputation rather than value-oriented, our drift detection algorithm has no assumption on the data type, i.e., both numerical and categorical data can be addressed.
III-B Masked Distance Learning
This section sets out our formulation for a distance learning method to estimate observation-to-observation distances with missing values. The intuition behind the idea is to retrieve the observations without missing values, denoted as , and artificially introduce null values as a mask to simulate the missing values, that is,
where denotes the missing value masking function. Since our target variable is known, i.e., the actual paired distances between the retrieved observations, we can use the masked samples and their actual distances as a training set to build a regression model, denoted as . The standard deviation of can be used as a coefficient for fuzzifying observation-to-observation distances.

The workflow of this masked distance learning method comprises three major steps, as shown in Fig. 4. Put simply, masked distance learning asks and answers three questions: 1) How must the function be defined so that the observations with missing values have the same distribution as the original sample set? 2) Which method of building the learning model will minimize the prediction errors? 3) How should the error of the be estimated? Step 1-3 are proposed to address these questions.
Step 1 is a sample selection process. Retrieving observations with no missing values can be complete by a for loop over the sample set. The runtime complexity is , where is the number of observations, and is the dimension of the feature space. The number of observations with missing values is then counted and divided by the total number of observations to produce a missing rate vector, denoted as , where is the missing rate of the -th feature. At this stage, we assume the missing values are MCAR. Based on this assumption, the probability mass function of the masking variable for each feature is
(2) |
and we have an masked observations as . The actual pairwise distances of for Step 2 is also calculated at this time.
Step 2 is the prediction model training. In this step, we want to build a learning model that maps the masked observations to the actual pairwise distance while minimizing the error between the predictions and the actual values, i.e.,
where is the hypothesis space, and is the hypothesis of mapping observations to a real value . Few existing data imputation methods are applied to the masked observations, and the masked distances are calculated as new training features. Then these new training features, along with the actual distances prior to masking, are used to train a masked distance learning model. In our solution, we use the gradient boosted decision tree model to perform the learning task, which is a non-linear regression model. We consider that non-linear regression models often outperform linear models in complicated cases.
The basic assumption is that the true distance is close to the imputed distances but which imputation method provides the best results depends on the data. For example, mean replacement is the best option for or Gaussian-distributed features, whereas most frequent fills could be the best for Poisson-distributed features. So, to make the optimal choice, the imputed distances are passed to a regression model and learned. The masked training set contains instances for training, where denotes the cardinality of the set. On the MCAR assumption, the masked distance learning model should perform similarly across the entire sample set.
Step 3 is to estimate the performance of the trained distance learning model . According to the assumption of linear regression, i.e., that where the error item , we can use the error item (residual) to fuzzify the predicted distance. To estimate the variance, the residual standard error calculated by the equation below and then stored,
(3) |
For each predicted distance , we have high confidence that the true distance is located in the interval according to the three-sigma rule. To avoid sampling bias, we apply cross-validation to the , which splits into kFold paired subsets , after which the is computed, where the is the number of split,.
Alg. 1. presents the pseudocode of masked distance learning. The inputs are the observations, a default regression model, a parameter for cross-validation, and a set of data imputation methods. The outputs are a trained and its residual standard deviation . The overall runtime complexity is broken down as follows. The complexity for Lines 1-3 to initialize the data set statistics is . The pairwise distance has a complexity of without any optimization. The worst case for the observation masking, i.e., when every value is missing, is . The worst case for the data imputation is multiple iterative imputations, which would be . The kFold training-testing complexity is . Calculating standard deviations is . Therefore, the runtime complexity for Alg. 1. is feasible within the complexity of
2. Regression model, , default GBDT regressor
3. Cross-validation times, as default
4. Data imputation algorithm pool, default
2. Prediction error,
III-C Fuzzy-weighted Frequency
After building the missing value distance prediction model and obtaining the standard deviation of the errors, we can now start estimating the frequency of observations in histogram bins. For any two observations with missing values, we use the same masking function to convert them then pass them to . The returned result with the standard deviation will be used to construct the fuzzy distance to present its membership to a bin. In fact, there are two options with the predicted distance . One is simply to use the predictions as crisp distances without considering the errors of . The observations would then be assigned to histogram bins in a many-to-one manner, which means each observation would only be counted once by a bin. Another option is to fuzzify the distance according to the , that is, to consider the estimated distance as a fuzzy number, as shown in Fig. 5.

The workflow of fuzzy-weighted frequency is as follows. Step 1) model the estimated distance between and as fuzzy distance, ; Step 2) calculate the degree of membership of every observation to each bin based on the fuzzy distances. Step 3) normalize the degree of membership as observation’s weight. Step 4) Accumulate the weights in each bin as the frequency. The novelty of the FWF compare to fuzzy clustering is that the degree of membership is calculated based on fuzzy distance rather than crisp distance.
In this case, the sample set is the universe of discourse. The value for each is the grade of membership of in . The function is the membership function of a fuzzy set , where a fuzzy set is a cluster denoted as . When measuring the frequency of observations, the fuzzy distance will be used to determine the degree of membership of an observation to each clusters.
We propose two types of fuzzy numbers to present the estimated distance: the Triangular fuzzy numbers (TFN) and Gaussian fuzzy number (GFN). We chose the TFN as a baseline for comparison because it is the most widely used [12], and we chose the GFN as it is the most reasonable since the predictions and errors are assumed to follow a Gaussian distribution. Once the fuzzy distance formulation is determined, we can leverage them to calculate the degree of membership. The contingency table is then built with the degree of membership.

In general, the membership degree of an observation with missing values to a cluster is determined by its estimated distances to the cluster centroids and the standard deviation of the errors. This is formulated by the membership function , namely . Without any prior knowledge of the sample set, we assume the observations all have equal weights of . Therefore, the fuzzy weighting function can be formulated as a normalized degree of membership, denoted as
(4) |
Fig. 6 illustrates how to calculate the weights based on the Triangular fuzzy distance. In this example, the estimated distance between an observation and the cluster centroids are represented with a Triangular fuzzy number. As shown, is closest to the centroid of and furthest from the centroid of . Even considering a possible error by , it is still very unlikely that should fall into cluster , which can be formulated by the condition that . To calculate how likely it is that observation should fall into , , and , we use the overlapping areas as a membership function. The most likely cluster for is with the overlapping area , as the areas for and are smaller, and . The sum of the total overlapped area is . Therefore, we can normalize the weight of to each cluster as , , , and .
The degree of membership of is calculated based on the fuzzy distance. We define the triangular fuzzy distance as a , that is,
where is the predicted distance between observation and the cluster . Since all triangles are congruent triangles, the triangles of and must be similar. Hence, for simplicity in explaining how the degrees of membership are normalized, we have assumed that the degree of membership of an observation to the closet cluster is equal to 1, i.e., . Based on the area theorem of similar triangles, if two triangles are similar, then the ratio of the area of both triangles is proportional to the square of the ratio of their corresponding sides. Therefore, the degree of membership of observation to its -th closest cluster is
(5) |
where is the square of the ratio of their corresponding sides. Substitute to Eq. (4) we have the triangular fuzzy-weighted frequency represented as:
(6) |
Because of their smoothness and concise notation, Gaussian and bell membership functions are popular methods for specifying fuzzy sets. With these functions, the distance between the cluster centroids and the observations is a Gaussian fuzzy number, , and the prediction value plus standard deviation are used to model the distance, that is,
The degree of membership of to is the area under the curve and is formulated as
(7) |
And, similar to the triangular fuzzy number, the Gaussian fuzzy-weighted frequency is
(8) |
In contrast to distance fuzzification, the crisp frequency is measured based on , and the frequency measuring is proceeded by finding the closest cluster.
To simplify the computation and to avoid the standard deviation become too large to provide reasonable fuzzy distances. We propose a Top-Q cluster constrain for normalization the weight, that is, only the closest clusters will be considered for frequency measuring, that is, the Eq. (4) is updated as follow.
(9) |
where, , the is the indicator function that if , otherwise . When , the measured frequency will be equal to the crisp distance frequency. The Alg 2 presents the pseudocode of fuzzy-weighted frequency measuring. With the frequency, we can apply the chi-square test to accept or reject the null hypothesis.
Alg. 2 presents the pseudocode of the fuzzy-weighted frequency. The runtime complexity is determined by -means clustering (Line 1-3), imputation (Line 4-8) and measuring the fuzzy-weighted frequency (Line 9-15). For the clustering, according to Lloyd’s -means algorithm, the runtime complexity of the clustering procedure in our algorithm is . The complexity of imputing the data is based on the worst case of the imputation algorithm in . Given our default settings, the worst case is an iterative imputation with on a data set. The fuzzy weighting process has linear runtime complexity of . Therefore, the runtime complexity of Alg. 2 is . Combining Algs. 1 and 2, the overall runtime complexity of MDL-FWF is , which is bounded by the MDL algorithm.
2. Data imputation algorithm pool, default
3. Number of clusters (bins), default
4. Top-Q constrain, default
5. Trained distance learning model,
6. Prediction error,
7. Distance fuzzification model, default TFN
2. Cluster centroids,
IV Experiments and Evaluation
In this section, we evaluate the proposed algorithms in three respects. Section IV-A evaluates the distance estimation algorithm in comparison to various data imputation methods. Section IV-B tests MDL-FWF’s performance at concept drift detection on five synthetic data sets with MCAR missing values. In this set of experiments, we also examine whether fuzzifying the observations improves performance. Section IV-C and IV-D present comparisons with five state-of-the-art concept drift detection algorithms.
IV-A Masked Distance Learning vs. Calculating Distance from Imputed Values
Experiment 1 - Distance estimation with MCAR.
This experiment evaluate the extent to which masked distance learning can reduce estimation errors.
Data sets.
Distribution | Distribution Config | MV Config | ||
Uniform |
|
|||
Gaussian | , | Same as above | ||
Exponential | Same as above | |||
Poisson |
|
|||
Categorical | Same as above |
For each distribution, we generated 500 observations and randomly removed some values in different configurations, as summarized in Table I. For the uniform, Gaussian, and exponential distributions, we generated dimensional data. The first dimensions had a missing value rate of 20%. For the Poisson, and uniform categorical distributions, we generated dimensional data where the first dimensions had a missing value rate of 20%.
Imputation methods and configurations.
Missing values were replaced with five kinds of imputed values: zeros, means, medians, most frequent, and iterative. We generated five sample sets with different distributions and stored the true pairwise distance matrix for each. Then we randomly masked some values as null and applied different distance estimation methods to estimate the pairwise distance matrix. The error between the estimated distance and the true distance was used to evaluate performance.
Evaluation metric.
We used both root mean squared error (RMSE) and mean absolute error (MAE) as the metric for distance estimation evaluation. RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are undesirable. MSE does not penalize huge errors as badly as RMSE, but it is a good metric to reflect the overall prediction accuracy.
impute_zero | impute_mean | impute_medi | impute_mfre | impute_iter | MIDAS | GAIN | MDL | ||
Uni | MAE | 0.0710.012 | 0.0360.006 | 0.0360.006 | 0.0710.012 | 0.0360.006 | 0.0420.008 | 0.0390.007 | NA |
RMSE | 0.2170.022 | 0.1090.011 | 0.0120.041 | 0.2160.022 | 0.1090.011 | 0.1330.016 | 0.1250.016 | NA | |
Gau | MAE | 0.8020.116 | 0.1130.020 | 0.1130.020 | 0.4180.065 | 0.1130.020 | 0.4420.077 | 0.1330.029 | NA |
RMSE | 2.1890.176 | 0.3720.048 | 0.1410.636 | 1.1770.113 | 0.3730.048 | 1.2940.142 | 0.4400.073 | NA | |
Exp | MAE | 0.1050.021 | 0.1050.021 | 0.1000.023 | 0.1430.030 | 0.1050.021 | 0.1040.021 | 0.1120.023 | NA |
RMSE | 0.3700.069 | 0.3710.069 | 0.1561.243 | 0.5270.088 | 0.3710.069 | 0.3700.069 | 0.4100.070 | NA | |
Poi | MAE | 6.4171.145 | 0.9750.200 | 0.9740.200 | 1.0250.216 | 0.9760.199 | 5.6291.101 | 1.1790.278 | NA |
RMSE | 16.0021.749 | 2.5240.404 | 6.56627.245 | 2.6720.450 | 2.5290.404 | 14.9891.710 | 3.0810.667 | NA | |
Cat | MAE | 0.6430.134 | 0.2930.053 | 0.2850.056 | 0.3420.075 | 0.2930.053 | 0.5790.123 | 0.3200.068 | NA |
RMSE | 1.7130.248 | 0.6940.092 | 0.5441.884 | 0.9250.149 | 0.6950.092 | 1.5660.242 | 0.8230.140 | NA |
impute_zero | impute_mean | impute_medi | impute_mfre | impute_iter | MIDAS | GAIN | MDL | ||
Uni | MAE | 0.1460.003 | 0.0870.003 | 0.0870.003 | 0.1450.003 | 0.0870.003 | 0.0870.003 | 0.0870.003 | 0.0770.002 |
RMSE | 0.2020.003 | 0.1270.004 | 0.1270.004 | 0.2010.004 | 0.1270.004 | 0.1270.003 | 0.1270.004 | 0.1040.002 | |
Gau | MAE | 3.7310.031 | 0.2940.015 | 0.2940.015 | 1.2470.122 | 0.32940.015 | 1.4560.132 | 0.3040.026 | 0.2800.014 |
RMSE | 4.4290.034 | 0.4690.024 | 0.4690.024 | 1.5980.147 | 0.4690.024 | 1.9590.145 | 0.4730.034 | 0.4030.019 | |
Exp | MAE | 0.2950.025 | 0.2960.025 | 0.2890.026 | 0.3140.023 | 0.2960.025 | 0.2950.025 | 0.2970.025 | 0.3090.026 |
RMSE | 0.5720.069 | 0.5720.069 | 0.5750.069 | 0.5830.060 | 0.5720.069 | 0.5720.069 | 0.5700.067 | 0.5420.072 | |
Poi | MAE | 21.2520.204 | 2.2140.112 | 2.2140.112 | 2.2420.113 | 2.2150.112 | 19.0980.241 | 2.3960.300 | 2.0730.098 |
RMSE | 29.0430.230 | 3.8030.203 | 3.8050.202 | 3.8340.200 | 3.8030.203 | 26.7660.302 | 4.0180.451 | 3.2840.164 | |
Cat | MAE | 1.3900.029 | 0.6480.029 | 0.6380.029 | 0.6920.057 | 0.6480.029 | 1.2210.026 | 0.6600.041 | 0.5980.027 |
RMSE | 2.1680.042 | 1.0460.051 | 1.0480.050 | 1.1310.097 | 1.0460.051 | 1.9450.042 | 1.0690.067 | 0.9150.049 |

Findings and discussion.
The distance estimation results are shown in Table III, and the feature importance is plotted in Fig. 7. In this experiment, we find that combining multiple imputation methods and training a regression model improves the overall estimations. It is easy to understand that regression considers the different imputation method will reach the best result. Compared to calculating distance from imputed values, using masked distance learning to predict the distance had lower RMSE and MAE. This inspired us to think that combining multiple instances of uncertainty into one problem and searching for the best optimization solution may produce a better result than solving each uncertainty issue individually.
Regarding the feature importance returned by the gradient boosting decision tree, we noticed that different imputation methods resulted in different importance ranks for different distributions. For example, mean imputation contributed the most for the distance etismtaion with missing values on uniform and Poisson distributions while, for exponential distributions, zero imputation was more important. The overall estimation results show that integrating diverse imputation method for distance estimation with missing values is effectual.
IV-B Fuzzy-weighted Frequency vs. Crisp Frequency Drift Detection






Experiment 2 – Algorithm ablation test
In this experiment, we analyzed the contribution of each component’s performance of MDL-FWF.
Data sets.
First, we needed to simulate drift, so we generated data sets using three different distributions and five different types of drift. We adopt the commonly used drift synthetic methods for drift detection [7, 36, 37, 39].
Configurations.
Config. 1 – No missing values (Complete). The first configuration is the MDL-FWF baseline, i.e., drift detection with no missing values in the data set. In this configuration, we created k-Means space partitions and counted the observations in each cluster as the frequency. The presence of a drift was determined from a Pearson’s chi-square test.
Config. 2 – Impute missing values (MV Impute). In this setting, we duplicated the data set generated in Config. 1 and introduced MCAR missing values by randomly replacing some values with null. Those missing values were then imputed with iterative imputation. We chose iterative imputation because after comparing the five imputation methods, the iterative approach gave the best performance except MDL. k-Means and Pearson’s chi-square tests were used to detect drifts.
Config. 3 – Masked distance learning only (MDL). In this configuration, we used the masked distance learning algorithm to estimate the distance between the cluster centroids and the observations with missing values. Those estimated distances were then used to compute the contingency table. Again, Pearson’s chi-square test was used to detect drifts.
Configs. 4.1 & 4.2 – MDL-FWF. Similar to Config. 3, we predicted the distance between the cluster centroids and the observations with missing values via MDL. However, as opposed to using the estimated distances to build the contingency table, we used the fuzzy-weighted frequency measurements via a Gaussian fuzzy membership function (Config. 4.1) and a triangular fuzzy membership function (Config. 4.2). Pearson’s chi-square tests determined drift.
Configs. 5.1 & 5.2 – MDL-FWF with the top-Q clusters (MDL-FWF-TopQ). Based on Configs. 4.1 and 4.2, we measured the fuzzy-weighted frequency of the top-Q clusters, instead of considering the degree of the membership to all clusters. The same membership functions as in Configs. 4.1 and 4.2 apply to Configs. 5.1 and Config. 5.2.
Evaluation metric.
The conventional drift detection evaluation metrics are Type-I and Type-II errors. Type-I errors are rejections of a true null hypothesis (also known as “false positives”), and Type II errors are the false null hypothesis rates (i.e., “false negatives”). For concept drift detection, the null hypothesis is that, there is no concept drift between the populations. Commonly, the Type-I and Type-II errors are shown separately and it is difficult to visualise them in one plot. Therefore, to better visualize the results of each drift detection method, we used the drift detection ratio as the metric [34], which is the number of rejected hypotheses divided by the total number of tests.
To quantify the performance of each configuration, we used Pearson’s correlation coefficient as an indicator of the similarity between different drift detection configurations. Since the base model is kMeans, the drift detection ratios of Config. 1 (No MV) forms the baseline and Configs. 2 to 5 show the impact of each component. A higher Pearson’s correlation coefficient to Config. 1 means that the component is less affected on the missing values.
Drift Severity Level | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
1. Complete | 0.012 | 0.011 | 0.021 | 0.067 | 0.150 | 0.288 | 0.473 | 0.636 | 0.888 | 0.916 | 0.919 |
2. MV Impute | 0.026 | 0.027 | 0.034 | 0.075 | 0.174 | 0.296 | 0.429 | 0.524 | 0.656 | 0.684 | 0.713 |
3. MDL | 0.034 | 0.032 | 0.061 | 0.101 | 0.293 | 0.411 | 0.585 | 0.739 | 0.951 | 0.988 | 0.995 |
4.1 MDL-FWF-Gau | 0.000 | 0.000 | 0.000 | 0.000 | 0.001 | 0.006 | 0.022 | 0.114 | 0.285 | 0.465 | 0.609 |
4.2 MDL-FWF-Tri | 0.000 | 0.000 | 0.000 | 0.001 | 0.004 | 0.007 | 0.026 | 0.078 | 0.194 | 0.364 | 0.542 |
5.1 MDL-FWF-Gau-TopQ | 0.000 | 0.000 | 0.001 | 0.005 | 0.052 | 0.137 | 0.280 | 0.434 | 0.745 | 0.894 | 0.953 |
5.2 MDL-FWF-Tri-TopQ | 0.003 | 0.002 | 0.006 | 0.020 | 0.117 | 0.229 | 0.369 | 0.540 | 0.853 | 0.956 | 0.977 |
Findings and discussion.
Fig. 8 summarizes the results of Experiment 2, and Table IV records the average drift detection ratio in terms of the drift severity level. Further, the Pearson’s correlation coefficients for each configuration follows. Bold indicates the best configuration.
-
•
Config. 1 vs. 2: 0.798
-
•
Config. 1 vs. 3: 0.922
-
•
Config. 1 vs. 4.1: 0.651
-
•
Config. 1 vs. 4.2: 0.694
-
•
Config. 1 vs. 5.1: 0.924
-
•
Config. 1 vs. 5.2: 0.944
These coefficients reveal the impact of each configuration on the drift detection results. Noting that the closer the coefficient to 1, the better the missing value handling method is.
According to the coefficient, it is clear that simply applying data imputation then concept drift detection (Config. 2, MV Impute) did not produce very accurate results. Table IV shows that imputation might increase sensitivity for drifts of low severity but, with very severity drifts, it could have a opsite impact. Config. 3, where the distance between observations was estimated with masked distance learning, had a coefficient (0.922) very close to the baseline. However, the average drift detection ratios shown in Table IV tell us that MDL increased the drift sensitivity in most cases. This opens the possibility that MDL may have a high false alarm when there is no drift and, therefore, will not reach the required confidence level. By contrast, Config. 4.1 and 4.2 (MDL-FWF with Gaussian and triangular functions) performed very poorly. These configurations treat distance estimation errors as an uncertainty issue and apply a fuzzy membership function to model the weight of observations. However, in some cases, the distance estimation errors can be very large, which results in an overestimation of the observations’ weights, in turn reducing the sensitivity of drift detection. Our solution was to choose the top-Q clusters only to handle these overestimated weights. As Configs. 5.1 and 5.2 show, performance with this approach was almost as good as the baseline without increasing the false alarm rate, where the baseline was the drift detection on the complete data sets.


IV-C A Comparative Study Against Other Drift Detection Algorithms
Experiment 3 – A comparative study
These experiments were designed to evaluate MDL-FWF’s performance in comparison to other concept drift detection algorithms. We chose four state-of-the-arts. The evaluation data sets and metric are the same as Experiment 2.
Configurations.
Multivariate Wald-Wolfowitz test (MWW test). This test was developed by Friedman and Rafsky [39] as a multivariate generalization of the Wald-Wolfowitz test. This is the baseline for this experiment. The implementation is publicly available online111https://gist.github.com/vmonaco/e9ff0ac61fcb3b1b60ba. No specific parameters are required to run this test.
QuantTree with Pearson statistics is a histogram-based change detection method for multivariate data streams [37]. Pearson’s statistic is selected because it has the best reported performance. The implementation is available online222http://home.deib.polimi.it/boracchi/Projects/projects.html. The partition size was set to the sample size divided by , which is the same as the default setting for MDL-FWF.
MDL-FWF-TopQ Our proposed algorithm, again, implemented with two different membership functions, i.e., Gaussian and Triangle. The implementation code is available online333https://github.com/Anjin-Liu/TFS-MDL-FWF.
ME Test is a generic normalized mean embedding (ME) test using a specified kernel. This is used in [43, 44]. The implementation of the ME test is available online444https://github.com/wittawatj/interpretable-test. The algorithm parameters were set to the defaults, except for , which was set to following the authors’ recommendation.
Quadratic MMD Test [45], where the null distribution is computed by permutation. The implementation is available online555https://github.com/wittawatj/interpretable-test. The parameters were set following the authors’ recommendation.
Findings and discussion.
Drift Severity Level | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Complete MWW | 0.053 | 0.056 | 0.066 | 0.082 | 0.143 | 0.179 | 0.265 | 0.383 | 0.681 | 0.807 | 0.873 |
Complete QuantTree | 0.065 | 0.064 | 0.085 | 0.121 | 0.364 | 0.49 | 0.624 | 0.755 | 0.862 | 0.957 | 0.991 |
Complete MDL | 0.012 | 0.011 | 0.021 | 0.067 | 0.150 | 0.288 | 0.473 | 0.636 | 0.888 | 0.916 | 0.919 |
MDL-FWF-Gau-TopQ | 0.012 | 0.011 | 0.021 | 0.067 | 0.150 | 0.288 | 0.473 | 0.636 | 0.888 | 0.916 | 0.919 |
MDL-FWF-Tri-TopQ | 0.012 | 0.011 | 0.021 | 0.067 | 0.150 | 0.288 | 0.473 | 0.636 | 0.888 | 0.916 | 0.919 |
Complete ME | 0.043 | 0.042 | 0.069 | 0.094 | 0.2576 | 0.302 | 0.340 | 0.402 | 0.544 | 0.598 | 0.607 |
Complete MMD | 0.022 | 0.041 | 0.127 | 0.26 | 0.4308 | 0.520 | 0.644 | 0.713 | 0.924 | 0.971 | 0.996 |
Drift Severity Level | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Impute MWW | 0.144 | 0.153 | 0.217 | 0.308 | 0.520 | 0.623 | 0.692 | 0.764 | 0.902 | 0.953 | 0.990 |
Impute QuantTree | 0.090 | 0.094 | 0.120 | 0.182 | 0.293 | 0.398 | 0.462 | 0.531 | 0.669 | 0.764 | 0.852 |
Impute MDL | 0.026 | 0.027 | 0.034 | 0.075 | 0.174 | 0.296 | 0.429 | 0.524 | 0.656 | 0.684 | 0.713 |
MDL-FWF-Gau-TopQ | 0.000 | 0.000 | 0.001 | 0.005 | 0.052 | 0.137 | 0.280 | 0.434 | 0.745 | 0.894 | 0.953 |
MDL-FWF-Tri-TopQ | 0.003 | 0.002 | 0.006 | 0.020 | 0.117 | 0.229 | 0.369 | 0.540 | 0.853 | 0.956 | 0.977 |
Impute ME | 0.072 | 0.077 | 0.128 | 0.178 | 0.318 | 0.354 | 0.392 | 0.453 | 0.618 | 0.709 | 0.816 |
Impute MMD | 0.094 | 0.146 | 0.300 | 0.445 | 0.735 | 0.803 | 0.836 | 0.868 | 0.939 | 0.976 | 0.997 |
The drift detection results are plotted in Fig. 9, and the average ratios on each of the data sets are shown in Tables V and VI. The Pearson’s correlation coefficients of the 50 drift detection ratios were as follows:
-
•
ME missing vs. complete 0.235
-
•
MWW missing vs. complete 0.667
-
•
QuantTree missing vs. complete 0.785
-
•
MMD missing vs. complete 0.887
-
•
MDL-FWF-Gau-TopQ vs. complete: 0.924
-
•
MDL-FWF-Tri-TopQ vs. complete: 0.944
Missing values impacted the ME test the most, followed by the MWW test, QuantTree, MMD test and the MDL-FWF methods. These results confirm our assumption that missing values introduce uncertainty into distribution-based concept drift detection. Although it is difficult to examine their general performance on all distributions, these results, at least, demonstrate the need to consider missing values when dealing with distribution changes. Another conclusion we draw from these results is that fuzzy theory has good potential for handle missing values in tasks that required accurate and sensitive concept drift detection.
IV-D Drift detection on real-world data sets with synthetic drifts and missing values
Experiment 4 – Evaluation with real-world data
Our final test was to compare MLD-FWF with state-of-the-art algorithms in real-world scenarios.
Real-I.a | Real-I.b | Real-II.a | Real-II.b | Real-III.a | Real-III.b | TtDiff | Rank | |||||||
Impu | Comp | Impu | Comp | Impu | Comp | Impu | Comp | Impu | Comp | Impu | Comp | |||
MWW | 0.508 | 0.008 | 0.874 | 0.076 | 0.038 | 0.006 | 0.040 | 0.014 | 0.014 | 0.002 | 0.028 | 0.004 | 1.39 | 7 |
QuantTree | 0.048 | 0.000 | 0.170 | 0.036 | 0.068 | 0.022 | 0.200 | 0.092 | 0.278 | 0.062 | 0.482 | 0.090 | 0.94 | 6 |
MDL | 0.002 | 0.014 | 0.016 | 0.004 | 0.010 | 0.044 | 0.000 | 0.000 | 0.014 | 0.028 | 0.038 | 0.008 | 0.10 | 2 |
MDL-FWF-Gau-TopQ | 0.000 | 0.014 | 0.012 | 0.004 | 0.010 | 0.044 | 0.000 | 0.000 | 0.014 | 0.028 | 0.000 | 0.008 | 0.08 | 1 |
MDL-FWF-Tri-TopQ | 0.000 | 0.014 | 0.022 | 0.004 | 0.002 | 0.044 | 0.000 | 0.000 | 0.000 | 0.028 | 0.000 | 0.008 | 0.11 | 3 |
ME | 0.164 | 0.020 | 0.040 | 0.164 | 0.052 | 0.080 | 0.004 | 0.000 | 0.092 | 0.064 | 0.186 | 0.010 | 0.50 | 5 |
MMD | 0.226 | 0.020 | 0.256 | 0.060 | 0.008 | 0.008 | 0.002 | 0.000 | 0.000 | 0.000 | 0.012 | 0.010 | 0.41 | 4 |
Real-I.a | Real-I.b | Real-II.a | Real-II.b | Real-III.a | Real-III.b | TtDiff | Rank | |||||||
Impu | Comp | Impu | Comp | Impu | Comp | Impu | Comp | Impu | Comp | Impu | Comp | |||
MWW | 0.024 | 0.824 | 0.013 | 0.593 | 0.909 | 0.933 | 0.843 | 0.960 | 0.985 | 1.005 | 0.966 | 0.995 | 1.57 | 6 |
QuantTree | 0.294 | 0.785 | 0.369 | 0.459 | 0.733 | 0.970 | 0.245 | 0.987 | 0.505 | 0.261 | 0.431 | 0.829 | 2.20 | 7 |
MDL | 0.808 | 0.470 | 0.780 | 0.868 | 0.415 | 0.426 | 0.308 | 0.535 | 0.963 | 0.967 | 0.634 | 0.816 | 0.85 | 3 |
MDL-FWF-Gau-TopQ | 0.500 | 0.469 | 0.996 | 0.862 | 0.483 | 0.423 | 0.757 | 0.532 | 0.960 | 0.969 | 0.981 | 0.817 | 0.62 | 1 |
MDL-FWF-Tri-TopQ | 0.481 | 0.468 | 0.971 | 0.865 | 0.418 | 0.423 | 1.002 | 0.536 | 1.006 | 0.965 | 1.009 | 0.816 | 0.82 | 2 |
ME | 0.729 | 0.881 | 0.918 | 0.786 | 0.796 | 0.932 | 0.970 | 0.927 | 0.972 | 0.703 | 0.467 | 0.777 | 1.04 | 5 |
MMD | 0.168 | 0.544 | 0.133 | 0.323 | 0.142 | 0.188 | 0.226 | 0.367 | 1.008 | 1.009 | 0.500 | 0.626 | 0.88 | 4 |
Data sets.
HIGGS Bosons and Background Data Set[35]. The objective of this data set is to distinguish signatures of the processes that produce Higgs boson particles from background processes that do not. We selected four low-level indicators of azimuthal angular momenta for four particle jets as features as the same as [35]. The jet momenta distributions of the background processes are denoted as Back (), and the processes that produce Higgs bosons are denoted as Higgs (). There were four types of data integration: Back-Back (Type-I Real I.a), where both sample sets were drawn from Back; Higgs-Higgs (Type-I Real I.b), where both sample sets were drawn from Higgs; Back-Higgs (Type-II Real I.a); and Higgs-Back (Type-II Real I.b)
Arabic Digit Mixture Data Set [38]. This data set contains audio features of 88 people (44 females and 44 males) pronouncing Arabic digits between and . We applied the same configuration as Denis et al. [38]. The data set has attributes which are a replacement mean and standard deviation for time series. Mixture distribution contained randomly selected samples of both males and females, with female labels from to and male labels from to . Mixture distribution reversed the labels, i.e., males from to and females from to . The drift detection on is Type-I Real II.a; is Type-I Real II.b; is Type-II Real II.a; and is Type-II Real II.b.
Insects Mixture Data Set [38]. This data set contains 49 features from a laser sensor. The task is to distinguish between 5 possible specimens of flying insects that pass through a laser in a controlled environment (Flies, Aedes, Tarsalis, Quinx, and Fruit). A preliminary analysis showed no drift in the feature space. However, the class distributions gradually change over time. To simulate drift in multiple clusters, we selected the samples from different insects and grouped them together at different percentages; thus, the data distribution varies. The populations are generated baBsed on the ratios that = and
Evaluation metric.
According to David [46], at least samples are required to estimate Pearson, Kendall, and Spearman correlations. Therefore, the results were evaluated based on the absolute total difference between the complete and the imputed drift detection ratio in terms of Type-I and Type-II errors, i.e.,
where indicates the drift detection ratio on complete data sets, is on missing values with imputation, and denotes the number of comparisons.
Findings and discussion.
The results of Experiment 4 are summarized in Tables VII and VIII. The results show a similar pattern to Experiment 3, that is, imputing missing values introduces uncertainty and decreases the sensitivity of drift detection. However, MDL-FWF is more robust to missing values than its comparators. From Tables VII and VIII, we can see that MMD was the most powerful drift detector on the complete data set, but, after introducing missing values, the Type-I errors increased dramatically and the Type-II errors dropped significantly. The total difference in Type-I errors for MMD on the six data sets was , and the difference in Type-II errors was . Thus the mean of all errors was . In contrast, MDL-FWF had a mean of only for all errors. This is an improvement of nearly 46%. Such a change in the drift detection results is dangerous because it may increase the number of false alarms and result in unnecessary warnings. Based on the results of these four experiments, we conclude that missing values should be carefully addressed when comparing sample distributions and we find that using fuzzy set theory to model the uncertainty introduced by missing values has a positive effect on distribution discrepancy analysis.
V Conclusion
In this paper, we investigated the influence of missing values on concept drift detection and imputation from a task perspective rather than a value perspective - that is, estimating the statistics we need, as opposed to estimating the missing values then calculating the statistics. The proposed algorithm is one integrated concept drift detection algorithm which can handle missing values, but not an imputation algorithm combined with a concept drift detection algorithm. The results show that applying data imputation then conducting drift detection impairs the performance of today’s drift detection algorithms. We attribute this phenomenon to imputation errors and propose an observation fuzzification model to account for the errors. Our proposed drift detection algorithm consists of two major steps. The first is a masked distance learning (MDL) scheme that estimates the distance between observations with missing values and its surrounding cluster centroids. The second is the fuzzy-weighted frequency (FWF) method that considers MDL’s estimation errors and uses fuzzy membership to normalize the weighted frequency of observations. Four experiments were conducted to show the efficacy of our method.
In future work, we need to explore a method that can address the disadvantages of complete case analysis, as well as a robust histogram construction method that can handle missing values in extreme cases. The dimensionality of the data is also worth mentioning. Since two observations are concatenated horizontally, which doubles the number of dimensions, computation costs can increase dramatically with high-dimensional data sets. We consider this as our next challenge to solve. Our intention is to begin by exploring whether any dimension reduction techniques might mitigate the problem.At the current stage, we have only applied a very shallow concept of fuzzy-distance with our fuzzy-distance estimation. However, there is still potential for a distribution estimation based on uncertainty measures. This is a more theoretically-oriented strategy that we intend to explore in a future study
Acknowledgments
The work presented in this paper was supported by the Australian Research Council (ARC) under Discovery Project DP190101733.
References
- [1] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Transactions on Knowl. Data Engineering, 2018.
- [2] M. Pratama, J. Lu, E. Lughofer, G. Zhang, and M. J. Er, “An incremental learning of concept drifts using evolving type-2 recurrent fuzzy neural networks,” IEEE Transactions on Fuzzy Systems, vol. 25, no. 5, pp. 1175–1192, 2016.
- [3] I. Žliobaitė, M. Pechenizkiy, and J. Gama, An overview of concept drift applications, book section Chapter 4, pp. 91–114. Springer, 2016.
- [4] J. Lu, A. Liu, Y. Song, and G. Zhang, “Data-driven decision support under concept drift in streamed big data,” Complex & Intell. Systems, vol. 6, no. 1, pp. 157–163, 2020.
- [5] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–37, 2014.
- [6] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonstationary environments: A survey,” IEEE Computational Intelligence Magazine, vol. 10, pp. 12–25, 2015.
- [7] N. Lu, G. Zhang, and J. Lu, “Concept drift detection via competence models,” Artificial Intelligence, vol. 209, pp. 11–28, 2014.
- [8] N. Lu, J. Lu, G. Zhang, and R. L. De Mantaras, “A concept drift-tolerant case-base editing technique,” Artificial Intelligence, vol. 230, pp. 108–133, 2016.
- [9] M. Pratama, J. Lu, and G. Zhang, “Evolving type-2 fuzzy classifier,” IEEE Transactions on Fuzzy Systems, vol. 24, no. 3, pp. 574–589, 2015.
- [10] M. Pratama, S. G. Anavatti, M. Joo, and E. D. Lughofer, “pclass: An effective classifier for streaming examples,” IEEE Transactions on Fuzzy Systems, vol. 23, no. 2, pp. 369–386, 2015.
- [11] M. Das, M. Pratama, S. Savitri, and Z. Jie, “Muse-rnn: a multilayer self-evolving recurrent neural network for data stream classification,” in Proceedings of the 19th IEEE International Conference on Data Mining, 2019.
- [12] Y. Song, J. Lu, H. Lu, and G. Zhang, “Fuzzy clustering-based adaptive regression for drifting data streams,” IEEE Transactions on Fuzzy Systems, vol. 28, no. 3, pp. 544–557, 2020.
- [13] Z. Ning, J. Yu, Y. Pan, and H. Li, “Adaptive event-triggered fault detection for fuzzy stochastic systems with missing measurements,” IEEE Transactions on Fuzzy Systems, vol. 26, no. 4, pp. 2201–2212, 2017.
- [14] J. Song, Y. Niu, J. Lam, and H.-K. Lam, “Fuzzy remote tracking control for randomly varying local nonlinear models under fading and missing measurements,” IEEE Transactions on Fuzzy Systems, vol. 26, no. 3, pp. 1125–1137, 2017.
- [15] M. S. Santos, R. C. Pereira, A. F. Costa, J. P. Soares, J. Santos, and P. H. Abreu, “Generating synthetic missing data: A review by missing mechanism,” IEEE Access, vol. 7, pp. 11651–11667, 2019.
- [16] R. J. Little and D. B. Rubin, Statistical analysis with missing data, vol. 793. Hoboken, New Jersey, United States: John Wiley & Sons, 2019.
- [17] J. Yoon, J. Jordon, and M. Schaar, “Gain: Missing data imputation using generative adversarial nets,” in Proceedings of the 2018 International Conference on Machine Learning, pp. 5689–5698, 2018.
- [18] S. v. Buuren and K. Groothuis-Oudshoorn, “Mice: Multivariate imputation by chained equations in r,” J. statistical software, pp. 1–68, 2010.
- [19] D. J. Stekhoven and P. Bühlmann, “Missforest - non-parametric missing value imputation for mixed-type data,” Bioinforma., vol. 28, no. 1, pp. 112–118, 2011.
- [20] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algorithms for learning large incomplete matrices,” J. machine learning research, vol. 11, no. Aug, pp. 2287–2322, 2010.
- [21] H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix factorization for high-dimensional time series prediction,” in Proceedings of the 2016 Advances in neural information processing systems, pp. 847–855, 2016.
- [22] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims, “Recommendations as treatments: Debiasing learning and evaluation,” in Proceedings of the 2016 International Conference on Machine Learning, pp. 1670–1679, 2016.
- [23] P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural Computing Applications, vol. 19, no. 2, pp. 263–282, 2010.
- [24] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 2008 international conference on Machine learning, pp. 1096–1103, 2008.
- [25] L. Gondara and K. Wang, “Mida: Multiple imputation using denoising autoencoders,” in Proceedings of the 2018 Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 260–272, Springer, 2018.
- [26] P. D. Allison, Missing data. Thousand Oaks, California, United States: Sage publications, 2010.
- [27] C. K. Enders, Applied missing data analysis. Guilford Press, 2010.
- [28] X. Wang, R. Wang, and C. Xu, “Discovering the relationship between generalization and uncertainty by incorporating complexity of classification,” IEEE Transactions on Cybernetics, vol. 48, no. 2, pp. 703–715, 2018.
- [29] X. Wang, H. Xing, Y. Li, Q. Hua, C. Dong, and W. Pedrycz, “A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning,” IEEE Transactions on Fuzzy Systems, vol. 23, no. 5, pp. 1638–1654, 2015.
- [30] G. E. Box, W. G. Hunter, and J. S. Hunter, Statistics for experimenters: an introduction to design, data analysis, and model building, vol. 1. JSTOR, 1978.
- [31] A. Liu, G. Zhang, and J. Lu, “Concept drift detection via equal intensity k-means space partitioning,” IEEE Transactions on Cybernetics, vol. Early Access, pp. 1–14, 2020.
- [32] B. W. Silverman, Density estimation for statistics and data analysis. Routledge, 2018.
- [33] L. Bu, C. Alippi, and D. Zhao, “A pdf-free change detection test based on density difference estimation,” IEEE Transactions on Neural Networks Learning Systems, vol. 29, no. 2, pp. 324–334, 2018.
- [34] A. Liu, J. Lu, F. Liu, and G. Zhang, “Accumulating regional density dissimilarity for concept drift detection in data streams,” Pattern Recognit., vol. 76, pp. 256–272, 2018.
- [35] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” J. Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.
- [36] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi, “An information-theoretic approach to detecting changes in multi-dimensional data streams,” in Proceedings of 2006 Symposium on the Interface of Statistics, Computing Science, and Applications, pp. 1–24, Citeseer, 2006.
- [37] G. Boracchi, D. Carrera, C. Cervellera, and D. Maccio, “Quanttree: Histograms for change detection in multivariate data streams,” in Proceedings of the 2018 International Conference on Machine Learning, pp. 638–647, 2018.
- [38] D. M. dos Reis, P. Flach, S. Matwin, and G. Batista, “Fast unsupervised online drift detection using incremental kolmogorov-smirnov test,” in Proceedings of the 2016 International Conference on Knowledge Discovery and Data Mining, pp. 1545–1554, ACM, 2016.
- [39] J. H. Friedman and L. C. Rafsky, “Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests,” The Annals Statistics, pp. 697–717, 1979.
- [40] K. Honda and H. Ichihashi, “Linear fuzzy clustering techniques with missing values and their application to local principal component analysis,” IEEE Transactions on Fuzzy Systems, vol. 12, no. 2, pp. 183–193, 2004.
- [41] S. Van Buuren, Flexible imputation of missing data. Chapman and Hall/CRC, 2018.
- [42] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, and F. Herrera, “A unifying view on dataset shift in classification,” Pattern Recognit., vol. 45, no. 1, pp. 521–530, 2012.
- [43] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton, “Fast two-sample testing with analytic representations of probability measures,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 1981–1989, Curran Associates, Inc., 2015.
- [44] W. Jitkrittum, Z. Szabó, K. P. Chwialkowski, and A. Gretton, “Interpretable distribution features with maximum testing power,” in Proceedings of the 2016 Advances in Neural Information Processing Systems, pp. 181–189, 2016.
- [45] W. Bounliphone, E. Belilovsky, M. Blaschko, I. Antonoglou, and A. Gretton, “A test of relative similarity for model selection in generative models,” Proc. 2016 International Conference on Learning Representations, 2016.
- [46] F. N. David, Tables of the ordinates and probability integral of the distribution of the correlation coefficient in small samples. Cambridge, United Kingdom: Cambridge University Press, 1938.
![]() |
Anjin Liu is a Postdoctoral Research Associate in the A/DRsch Centre for Artificial Intelligence, Faculty of Engineering and Information Technology, University of Technology Sydney, Australia. He received the BIT degree (Honour) at the University of Sydney in 2012. His research interests include concept drift detection, adaptive data stream learning, multi-stream learning, machine learning and big data analytics. |
![]() |
Jie Lu (F’18) is an Australian Laureate Fellow, IEEE Fellow and IFSA Fellow. She is an internationally renowned scientist in the areas of computational intelligence, specifically in fuzzy transfer learning, concept drift, decision support systems, and recommender systems. She is the Director of the Centre for Artificial Intelligence (CAI), which has over 200 staff and students working on over 50 research projects, and the Associate Dean (Research Excellence) in the Faculty of Engineering and Information Technology at the University of Technology Sydney (UTS). She has published six research books and over 450 papers in Artificial Intelligence, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Neural Networks & Learning Systems, Decision Support Systems and other refereed journals and conference proceedings. She has won 10 ARC Discovery projects and led 15 industry projects. She has supervised 40 PhD students to completion. She serves as Editor-In-Chief for Knowledge-Based Systems (Elsevier) and Editor- In-Chief for International Journal on Computational Intelligence Systems (Atlantis). She is the Chair of IEEE CIS Emerging Technologies Technical Committee and the Chair of IEEE CIS Subcommittee for Senior Members.She has delivered 25 keynote speeches at IEEE and other international conferences and chaired 15 international conferences. She has received the UTS Medal for Research and Teaching Integration (2010),UTS Medal for research excellence (2019), finalist of the Australian Eureka Prize for Advanced Data Science (2017, 2018), the Computer Journal Wilkes Award (2018), the IEEE Transactions on Fuzzy Systems Outstanding Paper Award (2019), the Australian Most Innovative Engineer Award (2019) and other national and international awards. |
![]() |
Guangquan Zhang is an Associate Professor and Director of the Decision Systems and e-Service Intelligent (DeSI) Research Laboratory at the University of Technology Sydney, Australia. He received the Ph.D. degree in applied mathematics from Curtin University of Technology, Australia, in 2001. His research interests include fuzzy machine learning, fuzzy optimization, and machine learning. He has authored five monographs, five textbooks, and 460 papers including 220 refereed international journal papers Dr. Zhang has won seven Australian Research Council (ARC) Discovery Projects grants and many other research grants. He was awarded an ARC QEII fellowship in 2005. He has served as a member of the editorial boards of several international journals, as a guest editor of eight special issues for IEEE transactions and other international journals, and co-chaired several international conferences and workshops in the area of fuzzy decision-making and knowledge engineering. |