Online Updating Huber Robust Regression for Big Data Streams
Abstract
Big data streams are grasping increasing attention with the development of modern science and information technology. Due to the incompatibility of limited computer memory to high volume of streaming data, real-time methods without historical data storage is worth investigating. Moreover, outliers may occur with high velocity data streams generating, calling for more robust analysis. Motivated by these concerns, a novel Online Updating Huber Robust Regression algorithm is proposed in this paper. By extracting key features of new data subsets, it obtains a computational efficient online updating estimator without historical data storage. Meanwhile, by integrating Huber regression into the framework, the estimator is robust to contaminated data streams, such as heavy-tailed or heterogeneous distributed ones as well as cases with outliers. Moreover, the proposed online updating estimator is asymptotically equivalent to Oracle estimator obtained by the entire data and has a lower computation complexity. Extensive numerical simulations and a real data analysis are also conducted to evaluate the estimation and calculation efficiency of the proposed method.
Keywords: Online Updating; Huber Regression; Big Data Streams; Divide-and-conquer
1 Introduction
1.1 Background
In recent years, big data has become a significant area of research and garnered extensive attention across various industries, including finance, manufacturing, and astronomy. Compared with traditional datasets, big data mainly has “5V” characteristics, i.e., high Volume, high Velocity, high Variety, low Veracity, and high Value(Jin et al., 2015). For big data streams, namely big data generated sequentially in a stream style, the characteristic of high Volume and high Velocity are particularly prominent, posing several challenges in its analysis. One such challenge concerns computer memory limitations. As big data streams are generated with such big volume, it is unrealistic to be completely stored in computer memory. Consequently, standard statistical analysis methods that assume adequate computer capacity are no longer applicable in the current settings. The second challenge pertains to the presence of outliers in big data streams, which can arise due to recording errors in the fast data-generating process. While linear regression-based methods are often used for big data analysis, they are highly sensitive to outliers, thereby limiting their effectiveness in this context. As such, it is imperative to develop computational efficient, memory-conserving and robust methods for big data stream analysis that can address these challenges.
1.2 Related Works
To address the storage limitation problem in big data analysis, many scholars have designed different methodologies , which can be loosely divided into three categories, i.e. subsampling-based approaches, divide-and-conquer approaches and online updating approaches(Wang et al., 2016). According to Wang et al.(2016), classic subsampling methods include bags of little bootstrap(Kleiner et al., 2014), leveraging(Ping and Sun, 2015), mean log-likelihood(Faming et al., 2013) and subsample-based MCMC(Liang et al., 2016). Take Bags of Little Bootstrap as an example, Kleiner et al.(2014) combined bootstrap with subsampling, which addressed the high computation problem of bootstrap in big data occasions. As for divide-and-conquer, its most simple type, also called naïve-divide-and-conquer or one-shot, is gaining much popularity. Its main idea is dividing complete datasets into several blocks, obtaining the local estimators on each block and calculating the global one by simple averaging. Naïve-dc has many applications in regression. For example, to solve large-scale empirical risk minimization problem, Zhang et al.(2012) analyzed a communicational efficient average mixture algorithm based on naïve-dc which is robust to the amount of parallelization. Chen and Xie(2014) designed a split-and-conquer penalized regression, including penalty, SCAD and MCP. Their estimator is asymptotically equivalent to penalized estimator using entire datasets. Furthermore, some research also applied naïve-dc on high-dimensional big datasets. Lee et al.(2015) devised an averaging debiased LASSO estimator that can be obtained distributedly and communication efficient in high dimensional settings. Battey et al.(2018) cast their eyes on hypothesis testing. They utilized divide-and-conquer in the context of sparse high dimensional Rao and Wald test, solving computation complexity increase due to high dimensional regularization problems. We can also find some non-linear applications like ridge regression(Zhang et al., 2013), SVM(Lian and Fan, 2018) etc. More naïve-dc related literature can be seen in Nan and Xi(2011), Zhao et al.(2016),Zhu et al.(2021). However, complete datasets are necessary for naïve-dc before analyzing to find the appropriate block number. While this issue might not arise for conventional big datasets, it is a crucial challenge for big data streams since the latter are incessantly generated with significant volume and velocity, making it infeasible to obtain complete data.
Different from divide-and-conquer, online updating, aiming at big data streams analysis, has no requirement for storing historical data. Specifically, it uses key features extracted from new arriving data batches to continuously update historical data thus is able to achieve computational efficiency and minimal intensive storage. According to Schifano et al.(2016), they are the first group to investigate statistical inference on online updating scenario. Their online updating framework is able to solve the predictive residual tests in linear regression and also the estimating equation setting. However, this algorithm needs to start from scratch if new predictive variables are produced with data streams generating. To overcome this disadvantage, Wang et al.(2018) adjusted the online updating framework by bias correction. Through the use of nested models(Allison, 1995; Clogg et al., 1995) in regression coefficients comparison,their method improved estimation efficiency under a variety of practical situations. Lee et al.(2020) extended aforementioned work from the perspective of biased predictive variables in big data streams being corrected afterward under the linear model framework. However, their method is still biased if not knowing the exact data adjusted point. Recently, online updating has also been incorporated in other models for big data analysis, such as online updating GIC for variable selection in generalized linear model(Xue and Hu, 2021), Cox proportional hazards model in online updating framework(Wu et al., 2021), online nonparametric method for functional data(Fang Yao, 2021) etc. To conclude, online updating is accessible to not only ordinary big datasets, but also big data streams, thereby providing a wider range of applications than naïve-dc. Nevertheless, most existing online updating frameworks are combined with linear regression, which are known to be sensitive to outliers and heavy-tailed distributions. To address this issue, novel online updating frameworks that are designed to achieve robustness to outliers are urgently needed.
Talking about the outlier problem, one popular solution is quantile regression(Bassett and Jr., 1978). It studies the conditional distribution of each quantile of response variables, thus overcoming the outlier problem and being able to explain potential heteroscedasticity of the datasets. Inspired by its good performance, loads of research incorporate quantile regression into big data analysis framework, especially in divide-and-conquer. Take Chen et al.(2019) as an example, they used a kernel function to smooth quantile loss function and operated divide-and-conquer iteratively to obtain a linear quantile regression estimator, handling with the constraint of machine number in naïve-dc. Moreover, Hu et al.(2021) addressed the nondifferentiable problem of quantile regression by developing CSL(Jordan et al., 2019) method to speed up big data algorithm. We can also find other research combining quantile regression with subsampling and online updating like Chen and Zhou(2020) and Wang et al.(2022). By using online updating, Wang’s(2022) method can be applied to big data streams. Particularly, Wang et al.(2022) analyzed quantile regression from the view of maximum likelihood and incorporated it in online updating algorithm. Without operations in check loss function, it is more convenient than the work of Chen et al.(2019), and Chen and Zhou(2020). Some variants of quantile regression like composite quantile regression and adaptive quantile regression are also investigated in big data analysis, which can be seen in Jiang et al.(2018),Jiang et al.(2021) for details.
1.3 Motivations
While quantile regression is investigated thoroughly, other members in M-estimation family, such as Huber regression(Huber and Peter, 1964), have received litte attention. Huber regression transforms the loss of big residuals from quadratic form into linear form to cut down weights, thus achieving good robustness to outliers. Therefore, Huber regression presents a feasible alternative to quantile regression in solving outlier-related problems. Regrettably, there has been little research on the use of Huber regression in big data analysis. To date, the only literature used Huber loss in big data regression is Luo’s(2022) work. They embedded Huber loss function in Jordan’s CSL(Jordan et al., 2019) framework, adding two data-driven parameters in Huber regression to achieve a balance of statistical optimization and communication efficiency. Nonetheless, there is no literature focusing on combining Huber regression with online updating to solve analysis in big data streams.
In order to solve the storage limitation and outlier problems aforementioned in big data streams analysis, we propose a novel Online Updating Huber Regression algorithm, which is inspired by the maximum likelihood idea in Wang’s(2022) online updating algorithm. The resulting online updating estimator in our algorithm is renewed by current data and summary statistics of historical data, which is computationally efficient. Specifically, by assuming local estimators generated from a multivariate normal distribution and analyzing it from the view of maximum likelihood, a weighted least square type objective function can be obtained. In the objective function, two key features of each batch can be extracted, which can be updated and easily operated in big data streams. Meanwhile, we prove that the proposed estimator is asymptotically equivalent to Oracle estimator using the entire dataset and has lower computation complexity compared with Oracle one. Numerical experiments on both simulation and real data are also included to verify the theoretical results and illustrate the good performance of our new method.
Our study makes two primary contributions. Firstly, we employ online updating instead of the widely used divide-and-conquer method to address the limited computer memory constraint so that it can be applied on the analysis of big data streams. Furthermore, by integrating Huber regression into online updating framework, we broaden the usage of online updating in robust regression other than pure linear regression. The second contribution is about Huber regression. Based on the state-of-art in big data analysis using loss functions in M-estimation, we adopt Huber regression as an alternative to quantile regression to increase robustness. It not only fills certain research gaps in Huber regression, but also studies feasibility of generalizing quantile regression to other loss functions in M-estimation.
The rest of this paper is organized as follows. Section 2 explains the methodology of our Online Updating Huber Regression by obtaining the main online updating estimator. Section 3 proves some good properties of the Online Updating Huber Regression estimator, including asymptotic equivalency and low computation complexity. Section 4 introduces simulation and real data analysis results in order to verify the good performance of our algorithm. Section 5 is about conclusions and future work of our research.
2 Methodology
2.1 Notation and Oracle Estimator
Suppose there are identically independent distributed samples . is response variable and is p-dimensional covariates. In the big data streams scenario, suppose samples are generated in form of data subsets sequentially. stands for the -th subset. Our aim is to fit a model , where is a p-dimensional unknown parameter with true value , and is random error which can be heavy-tailed distributed, heteroscedastic distributed or contaminated by outliers. is the probability density distribution of .
As we want to prove that our Online Updating estimator is asymptotic equivalent to Oracle one using entire dataset follow-up, we first point the expression of Oracle Huber Regression estimator here. According to Huber(1964), the expression of Oracle Huber Regression estimator of is as follows:
(1) |
where .
(N1) For , where , , the true value of , satisfies .
(N2).
(N3) Covariates satisfies .
(N4)Exist a positive matrix satisfying , where stands for convergence in probability.
We can obtain the asymptotic normal distribution of Oracle Huber Regression estimator as follows:
(2) |
where is a function of , stands for convergence in distribution.
2.2 Online Updating Huber Robust Regression Estimator
In this part, we mainly deduce the expression of Online Updating Huber Robust Regression (UHR) estimator. Firstly, we can calculate the local Huber estimator in each subset. Suppose is the local Huber estimator in the -th subset. According to Huber(1964), its formula is like Eq.(3).
(3) |
As the entire dataset is divided into several subsets, subset size can be much smaller than total sample size . Therefore, the calculation of Eq.(3) can be more computationally efficient and less storage intensive than Eq.(1), so that we can solve it faster. Furthermore, under regularity conditions (N1)-(N4), local estimator follows asymptotic normal distribution in Eq.(4), where .
(4) |
Taking the idea in Wang’s(2022) work as an inspiration, we consider the local estimators from the perspective of maximum likelihood. Due to identical independent distribution among total samples, local estimator are mutually independent. Thus, we can take as a sample generated from the following multivariate normal distribution.
(5) |
Based on Eq.(5), the maximum likelihood function of is as follows.
(6) |
(7) |
where is a constant, is a function of , which is irrelevant with as well. By maximizing Eq.(7), we can obtain the UHR estimator. Meanwhile, base on non-negativity of , maximizing Eq.(7) is equivalent to minimize the following loss function.
(8) |
By using to replace according to , we can obtain the final objective function as follows, where .
(9) |
Thus Online Updating Huber Robust Regression estimator of , , can be obtained by minimizing Eq.(9).
(10) |
According to Eq.(10), we can only extract two key features from each subset, i.e. and , with no need to store entire dataset. This enables us to save computer memory and speed up computation process, thus more available to big data streams.
The UHR algorithm implementation process can be seen in Figure 1. To start the algorithm, we import the first subset . By using function rlm of package MASS in R, the local estimator of the first subset can be calculated. Then two features in the first subset and can be obtained. For the second subset , use the same method to calculate and , then add them into and . The sum obtained, and are new results of subset updated by subset . The rest subsets can be processed in the same manner. After the last subset is imported, we can obtain key elements, i.e. and , and finally obtain the Online Updating estimator updated by the entire dataset. The pseudo-code can be seen in Algorithm 1.
Input: input subsets
Output: Online Updating Huber Estimator
3 Properties of the Online Updating estimator
3.1 Asymptotic Normality
In order to prove the asymptotic normality of our UHR estimator, we first add some assumptions into regularity conditions (N1)-(N4)(Wang et al., 2022).
(N5) Take as the average size of each subset. Suppose , and all diverge in the same order , i.e., for some positive constants and .
(N6) Exist positive definite matrix which satisfy and converge to in the same rate, i.e. convergence rate is irrelevant with .
Obviously, due to . Moreover, according to (N5) and (N6), all subsets have the same convergence rate, which is .
Under regularity conditions (N1)-(N6), UHR estimator follows asymptotic distribution as follows:
(11) |
where . The proof is as follows.
For local Huber estimator , it follows asymptotic condition:
(12) |
where
According to Eq.(10), the relationship with and is
(13) |
where . By direct calculation, we can obtain that
(14) |
By and regularity conditions (N1)-(N6), we can obtain
(15) | ||||
As , we have
(16) |
Then we can obtain that
(17) | ||||
If the covariates of each batch are homogeneous, i.e., , the , which coincides the asymptotic variance of oracle estimator in Eq.(2). The updating estimator is asymptotically equivalent with the oracle estimator .
3.2 Computation Complexity
In this part, we do a rough calculation on the computation complexity of Oracle Huber Regression and Online Updating Huber Regression, attesting our Online Updating algorithm has a lower computation complexity.
In our work, we use function rlm in MASS package of R to calculate the local estimator and Oracle estimator obtained by entire dataset. Function rlm is implemented by Iteratively Reweighted Least Squares. For the sake of calculation, suppose rlm’s maximum iterative time is , total sample size is and is divided into subsets evenly with samples each. Therefore, the computation complexity of solving the local estimator and the Oracle Huber estimator are about and respectively. As Online Updating algorithm contains loops and calculating the key features of each subset cost about , total computation complexity of Online Updating algorithm is about . By , we can simplify it into . Although is set with a large number(mostly 1000), it can achieve convergence within 10 times of iteration in practice, while total sample can reach a million or more, thus . As for , we only consider low-dimension cases where . To sum up, the dominants part of computation complexity of Oracle and Online Updating ones are and . By thorough comparison, we can find the only difference between these two are and , where . In conclusion, theoretically, our Online Updating Huber Regression has lower computation complexity than Oracle Huber Regression.
4 Numerical Studies
In this section, we demonstrate the performance of our UHR algorithm by comparing it with other 4 algorithms in 6 different synthetic datasets. We also apply UHR on a real airline dataset released by American Statistical Association, proving it is able to solve real problems as well.
4.1 Simulations
4.1.1 Experiment Settings
In simulations, we generate data using the following model:
where coefficients , covariates follow a multivariate normal distribution with zero mean and independent type of covariance matrix. Function is used for controlling the homoscedasticity or heteroscedasticity of random error. Specifically, if random error is homogeneous, , else . For random error , consider the following 5 cases:
(1) ;
(2) ;
(3) with of response replaced by ;
(4) ;
(5) .
Besides our Online Updating Huber Regression algorithm(UHR), we choose other 4 comparing algorithms, which are Ordinary Least Regression(OLS), Renewable Least Regression(Luo and Song, 2020), Oracle Huber Regression(Huber and Peter, 1964)(OHR), Divide-and-Conquer Huber Regression(DC-HR). OLS and RLS are to prove good performance of our UHR algorithm in robustness. OHR is to investigate whether Online Updating estimator has the same estimate efficiency with Oracle one. DC-HR is a simple combination of naïve-dc and Huber regression to compare the differences between divide-and-conquer and online updating methods. Moreover, to reduce randomness, we repeat the whole simulation process 500 times.
As for evaluation indexes, we judge the results from estimation and calculation efficiency. There are three indexes: , and , whose expressions are as Eq.(18)Eq.(20).
(18) |
(19) |
(20) |
For illustration, is the estimator of the -th dimension of in the -th repetition. Thus, and are the average of the sum of the mean squared error and mean absolute error of each dimension in 500 repetitions. is the average cost time of certain random error running 500 times, which use function system.time() in R to record elapsed time.
Our simulations include 4 experiments in total:
Experiment 1: Fix total sample size , change subsets number . This experiment is to compare the differences in estimation and calculation efficiency between different algorithms and also the influences of subsets number on them.
Experiment 2: Fix subset size , change subsets number . Accordingly, total sample size varies from to . This experiment is to investigate the changes in estimation efficiency of each algorithm when subset number increases.
Experiment 3: Fix subset size , change total sample size . Accordingly, subset number varies from to . This experiment is to investigate the influence of on each algorithm. We also investigate the relationship between and running time in this experiment to verify computation complexity obtained before from the simulation perspective.
Experiment 4: Fix subset size , change total sample size . Accordingly, subset number varies from to . This experiment is very similar to Experiment 3 but with smaller to investigate the difference between different scales of .
4.1.2 Estimation Efficiency
In estimation efficiency, we focus on Experiment 1 and Experiment 2. For Experiment 1, Table 1 to Table 5 present the and under 6 random error distribution. According to Table 1 to Table 5, we can draw the following conclusions:
(1) For heavy-tailed, heterogeneous error and random error with outliers cases, the estimation error of Huber family(OHR, DC-HR, UHR) is much smaller than that of Linear Regression family(OLS, RLS), demonstrating Huber algorithms are more robust than Linear Regression algorithms.
(2) When random error is distributed in case 1 case 4, our UHR has very close estimation errors with OHR. This verifies that Online Updating estimator are asymptotic equivalent with Oracle Huber estimator from experiment perspective.
(3) When random error is distributed in case 6, i.e. Cauchy distribution, we can find Linear Regression algorithms are invalid with such big and , while Huber algorithms are still performing well. It is worth mentioning that distributed algorithms DC-HR and UHR even have obvious estimation improvements compared with Oracle Huber estimator. This phenomenon may be related to characteristics of Cauchy distribution that variance does not exist. Due to the high volatility of data, using the whole dataset may not come with the best estimating efficiency, while subsets with smaller sizes may perform better.
(4) By comparing two distributed Huber algorithms, we find that in most cases, UHR has smaller estimation errors than DC-HR, which attests that Online Updating algorithm has a better performance than divide-and-conquer. Even though DC-HR can perform better than UHR in some cases, we still prefer UHR due to the wider range of applications for online updating in big data streams.
(5) In order to better display the influence of subset number on estimating errors of UHR, we make a line graph showing the relationship between subset number and in heterogeneous cases(Figure 2). To sum up, for normal distribution, distribution, distribution with outliers and mixed symmetric normal distribution(case 1case 4), estimation error only has subtle changes with increasing. That is, estimation results of UHR are robust to subset number. For Cauchy distribution case(case 5), estimation error goes down with increasing, as subset sample decreases, implying the same conclusion with (3).
case 1: N(0,1) | |||||
---|---|---|---|---|---|
homogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 0.04122 | 0.04122 | 0.04122 | 0.04122 | |
RLS | 0.04122 | 0.04122 | 0.04122 | 0.04122 | |
OHR | 0.04325 | 0.04325 | 0.04325 | 0.04325 | |
DC-HR | 0.04335 | 0.04329 | 0.04336 | 0.04342 | |
UHR | 0.04334 | 0.04333 | 0.04333 | 0.04336 | |
OLS | 0.32565 | 0.32565 | 0.32565 | 0.32565 | |
RLS | 0.32565 | 0.32565 | 0.32565 | 0.32565 | |
OHR | 0.33423 | 0.33423 | 0.33423 | 0.33423 | |
DC-HR | 0.33456 | 0.33414 | 0.33414 | 0.33500 | |
UHR | 0.33458 | 0.33456 | 0.33454 | 0.33486 | |
heterogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 0.24578 | 0.24578 | 0.24578 | 0.24578 | |
RLS | 0.24578 | 0.24578 | 0.24578 | 0.24578 | |
OHR | 0.13903 | 0.13903 | 0.13903 | 0.13903 | |
DC-HR | 0.13879 | 0.13911 | 0.13995 | 0.13974 | |
UHR | 0.13884 | 0.13929 | 0.14001 | 0.13999 | |
OLS | 0.79640 | 0.79640 | 0.79640 | 0.79640 | |
RLS | 0.79640 | 0.79640 | 0.79640 | 0.79640 | |
OHR | 0.60189 | 0.60189 | 0.60189 | 0.60189 | |
DC-HR | 0.60066 | 0.60161 | 0.60297 | 0.60211 | |
UHR | 0.60094 | 0.60205 | 0.60376 | 0.60315 |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
case 2: | |||||
---|---|---|---|---|---|
homogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 0.11885 | 0.11885 | 0.11885 | 0.11885 | |
RLS | 0.11885 | 0.11885 | 0.11885 | 0.11885 | |
OHR | 0.06468 | 0.06468 | 0.06468 | 0.06468 | |
DC-HR | 0.06479 | 0.06491 | 0.06500 | 0.06528 | |
UHR | 0.06463 | 0.06464 | 0.06471 | 0.06473 | |
OLS | 0.54831 | 0.54831 | 0.54831 | 0.54831 | |
RLS | 0.54831 | 0.54831 | 0.54831 | 0.54831 | |
OHR | 0.40403 | 0.40403 | 0.40403 | 0.40403 | |
DC-HR | 0.40427 | 0.40454 | 0.40473 | 0.40503 | |
UHR | 0.40381 | 0.40383 | 0.40394 | 0.40364 | |
heterogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 0.74695 | 0.74695 | 0.74695 | 0.74695 | |
RLS | 0.74695 | 0.74695 | 0.74695 | 0.74695 | |
OHR | 0.19960 | 0.19960 | 0.19960 | 0.19960 | |
DC-HR | 0.19773 | 0.19802 | 0.19812 | 0.19878 | |
UHR | 0.19823 | 0.19859 | 0.19902 | 0.19987 | |
OLS | 1.37955 | 1.37955 | 1.37955 | 1.37955 | |
RLS | 1.37955 | 1.37955 | 1.37955 | 1.37955 | |
OHR | 0.71031 | 0.71031 | 0.71031 | 0.71031 | |
DC-HR | 0.70748 | 0.70796 | 0.70763 | 0.70854 | |
UHR | 0.70816 | 0.70844 | 0.70886 | 0.71044 |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
case 3: with outliers | |||||
---|---|---|---|---|---|
homogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 2.49242 | 2.49242 | 2.49242 | 2.49242 | |
RLS | 2.49242 | 2.49242 | 2.49242 | 2.49242 | |
OHR | 0.12271 | 0.12271 | 0.12271 | 0.12271 | |
DC-HR | 0.12207 | 0.12241 | 0.12333 | 0.12442 | |
UHR | 0.12185 | 0.12199 | 0.12258 | 0.12310 | |
OLS | 2.54406 | 2.54406 | 2.54406 | 2.54406 | |
RLS | 2.54406 | 2.54406 | 2.54406 | 2.54406 | |
OHR | 0.55517 | 0.55517 | 0.55517 | 0.55517 | |
DC-HR | 0.55402 | 0.55446 | 0.55679 | 0.55884 | |
UHR | 0.55336 | 0.55369 | 0.55487 | 0.55608 | |
heterogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 3.18120 | 3.18120 | 3.18120 | 3.18120 | |
RLS | 3.18120 | 3.18120 | 3.18120 | 3.18120 | |
OHR | 0.36930 | 0.36930 | 0.36930 | 0.36930 | |
DC-HR | 0.36105 | 0.36105 | 0.36166 | 0.36678 | |
UHR | 0.36110 | 0.36135 | 0.36180 | 0.36654 | |
OLS | 2.87060 | 2.87060 | 2.87060 | 2.87060 | |
RLS | 2.87060 | 2.87060 | 2.87060 | 2.87060 | |
OHR | 0.96724 | 0.96724 | 0.96724 | 0.96724 | |
DC-HR | 0.95639 | 0.95577 | 0.95761 | 0.96439 | |
UHR | 0.95605 | 0.95594 | 0.95672 | 0.96274 |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
case 4: 0.85×N(0,1)+0.15×N(0,8) | |||||
---|---|---|---|---|---|
homogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 0.07771 | 0.07771 | 0.07771 | 0.07771 | |
RLS | 0.07771 | 0.07771 | 0.07771 | 0.47451 | |
OHR | 0.05530 | 0.05530 | 0.05530 | 0.05530 | |
DC-HR | 0.05538 | 0.05543 | 0.05544 | 0.05565 | |
UHR | 0.05524 | 0.05524 | 0.05518 | 0.05519 | |
OLS | 0.44224 | 0.44224 | 0.44224 | 0.44224 | |
RLS | 0.44224 | 0.44224 | 0.44224 | 1.11088 | |
OHR | 0.37259 | 0.37259 | 0.37259 | 0.37259 | |
DC-HR | 0.37309 | 0.37324 | 0.37331 | 0.37356 | |
UHR | 0.37238 | 0.37240 | 0.37233 | 0.37205 | |
heterogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 0.47451 | 0.47451 | 0.47451 | 0.47451 | |
RLS | 0.47451 | 0.47451 | 0.47451 | 0.47451 | |
OHR | 0.19964 | 0.19964 | 0.19964 | 0.19964 | |
DC-HR | 0.19859 | 0.19857 | 0.19953 | 0.19965 | |
UHR | 0.19881 | 0.19866 | 0.19967 | 0.19990 | |
OLS | 1.11088 | 1.11088 | 1.11088 | 1.11088 | |
RLS | 1.11088 | 1.11088 | 1.11088 | 1.11088 | |
OHR | 0.71519 | 0.71519 | 0.71519 | 0.71519 | |
DC-HR | 0.71321 | 0.71335 | 0.71535 | 0.71578 | |
UHR | 0.71364 | 0.71377 | 0.71559 | 0.71599 |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
case 5: Cauchy(0,1) | |||||
---|---|---|---|---|---|
homogeneous | |||||
100 | 200 | 500 | 1000 | ||
OLS | 11711466.60880 | 11711466.60880 | 11711466.60880 | 11711466.60880 | |
RLS | 11711466.60880 | 11711466.60880 | 11711466.60880 | 11711466.60880 | |
OHR | 7.16525 | 7.16525 | 7.16525 | 7.16525 | |
DC-HR | 0.15638 | 0.14910 | 0.14287 | 0.14240 | |
UHR | 0.15626 | 0.14878 | 0.14190 | 0.14109 | |
OLS | 1502.46026 | 1502.46026 | 1502.46026 | 1502.46026 | |
RLS | 1502.46026 | 1502.46026 | 1502.46026 | 1502.46026 | |
OHR | 3.23099 | 3.23099 | 3.23099 | 3.23099 | |
DC-HR | 0.63014 | 0.61556 | 0.60233 | 0.60258 | |
UHR | 0.63009 | 0.61492 | 0.60000 | 0.60016 | |
heterogeneous | |||||
b | 100 | 200 | 500 | 1000 | |
OLS | 29518873.58901 | 29518873.58901 | 29518873.58901 | 29518873.58901 | |
RLS | 29518873.58901 | 29518873.58901 | 29518873.58901 | 29518873.58901 | |
OHR | 102.19791 | 102.19791 | 102.19791 | 102.19791 | |
DC-HR | 0.53841 | 0.42683 | 0.38114 | 0.37118 | |
UHR | 0.53882 | 0.42715 | 0.38089 | 0.37050 | |
OLS | 2651.78745 | 2651.78745 | 2651.78745 | 2651.78745 | |
RLS | 2651.78745 | 2651.78745 | 2651.78745 | 2651.78745 | |
OHR | 13.48746 | 13.48746 | 13.48746 | 13.48746 | |
DC-HR | 1.16842 | 1.04156 | 0.98940 | 0.97666 | |
UHR | 1.16899 | 1.04247 | 0.98857 | 0.97579 |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
Table 6 to Table 10 show the results of Experiment 2, where varies from to while subset number keeps fixed. We can obtain following conclusions through these tables.
(1) With increasing, all the algorithms are performing better. However, when is small(for example, ) and random error is heavily contaminated(such as case 4,5 with heterogeneous error), estimation error of OLS is much larger compared with Huber algorithms. UHR also performs rather well, which indicates our algorithm suits for not only big data streams but also normal-scale datasets.
(2) For case 1case 4, OHR has the smallest estimation error. Although UHR does not perform better than OHR, it only has subtle differences with OHR even if is in a small amount, indicating that UHR is able to achieve good asymptotic equivalency with OHR.
(3) For case 5, the estimation error of UHR is much smaller than that of OHR. This points to the same conclusion with Experiment 1, that is, smaller subsets may have better estimation efficiency than entire dataset for Cauchy distribution.
(4) UHR algorithm performs better in most cases than DC-HR, proving the superiority of online updating algorithm.
(5) At last, in settings, Experiment 2 is close to big data streams. Each subset has a small size with 100 samples, while subset number increases from 10 to 10000. This is just like the process of new subsets arriving in big data streams. We can find that the estimation error of UHR decreases with subset number increasing and is similar to OHR, which implies UHR is applicable to big data streams.
case 1: N(0,1) | |||||
---|---|---|---|---|---|
homogeneous | |||||
OLS | 41.65896 | 3.88942 | 0.39393 | 0.04122 | |
RLS | 41.65896 | 3.88942 | 0.39393 | 0.04122 | |
OHR | 44.27814 | 4.07750 | 0.41684 | 0.04325 | |
DC-HR | 46.34740 | 4.35515 | 0.44800 | 0.04593 | |
UHR | 44.27868 | 4.09200 | 0.42152 | 0.04357 | |
OLS | 10.40966 | 3.18110 | 1.00481 | 0.32565 | |
RLS | 10.40966 | 3.18110 | 1.00481 | 0.32565 | |
OHR | 10.74373 | 3.25098 | 1.03314 | 0.33423 | |
DC-HR | 10.94708 | 3.35412 | 1.06948 | 0.34447 | |
UHR | 10.7334727* | 3.2551126* | 1.0428132* | 0.3357810* | |
heterogeneous | |||||
OLS | 227.80014 | 23.45107 | 2.41744 | 0.24578 | |
RLS | 227.80014 | 23.45107 | 2.41744 | 0.24578 | |
OHR | 146.58109 | 14.57030 | 1.38268 | 0.13903 | |
DC-HR | 153.32625* | 15.35639* | 1.44728 | 0.14974 | |
UHR | 153.78150 | 15.44100 | 1.44311* | 0.14645* | |
OLS | 24.24029 | 7.76669 | 2.49384 | 0.79640 | |
RLS | 24.24029 | 7.76669 | 2.49384 | 0.79640 | |
OHR | 19.38650 | 6.11912 | 1.87599 | 0.60189 | |
DC-HR | 19.80401* | 6.27307* | 1.91918 | 0.61996 | |
UHR | 19.84922 | 6.31367 | 1.91440* | 0.61727* |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
-
3
* represents the superior of UHR and DC-HR.
case 2: | |||||
---|---|---|---|---|---|
homogeneous | |||||
OLS | 117.55450 | 11.13197 | 1.16896 | 0.11885 | |
RLS | 117.55450 | 11.13197 | 1.16896 | 0.11885 | |
OHR | 65.99914 | 6.21516 | 0.61343 | 0.06468 | |
DC-HR | 70.83631 | 6.79250 | 0.66002 | 0.07046 | |
UHR | 67.51753 | 6.35419 | 0.62871 | 0.06651 | |
OLS | 17.16898 | 5.29056 | 1.73030 | 0.54831 | |
RLS | 17.16898 | 5.29056 | 1.73030 | 0.54831 | |
OHR | 13.04886 | 3.96196 | 1.25701 | 0.40403 | |
DC-HR | 13.47303 | 4.12096 | 1.30451 | 0.41925 | |
UHR | 13.16859* | 4.00056* | 1.27431* | 0.40905* | |
heterogeneous | |||||
OLS | 713.28555 | 72.67236 | 7.44372 | 0.74695 | |
RLS | 713.28555 | 72.67236 | 7.44372 | 0.74695 | |
OHR | 210.04680 | 21.71120 | 1.38268 | 0.19960 | |
DC-HR | 223.54581* | 23.37703 | 2.21219* | 0.21934 | |
UHR | 225.35099 | 23.18896* | 2.23103 | 0.21598* | |
OLS | 41.74305 | 13.58001 | 4.35484 | 1.37955 | |
RLS | 41.74305 | 13.58001 | 4.35484 | 1.37955 | |
OHR | 22.96835 | 7.49359 | 1.87599 | 0.71031 | |
DC-HR | 23.71750* | 7.79915 | 2.38534* | 0.74458 | |
UHR | 23.79698 | 7.72857* | 2.39061 | 0.73697* |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
-
3
* represents the superior of UHR and DC-HR.
case 3: with outliers | |||||
---|---|---|---|---|---|
homogeneous | |||||
OLS | 2507.30950 | 246.33631 | 25.46565 | 2.49242 | |
RLS | 2507.30950 | 246.33631 | 25.46565 | 2.49242 | |
OHR | 122.73570 | 11.76220 | 1.12463 | 0.12271 | |
DC-HR | 147.05309 | 14.42111 | 1.37263 | 0.15004 | |
UHR | 140.65747 | 13.72919 | 1.30735 | 0.14124 | |
OLS | 79.30791 | 25.08737 | 8.04774 | 2.54406 | |
RLS | 79.30791 | 25.08737 | 8.04774 | 2.54406 | |
OHR | 17.63695 | 5.45740 | 1.67765 | 0.55517 | |
DC-HR | 19.35555 | 6.00961 | 1.86798 | 0.61847 | |
UHR | 18.83415* | 5.84983* | 1.81094* | 0.59913* | |
heterogeneous | |||||
OLS | 3160.12456 | 304.54568 | 30.63899 | 3.18120 | |
RLS | 3160.12456 | 304.54568 | 30.63899 | 3.18120 | |
OHR | 378.19492 | 37.93189 | 3.64422 | 0.36930 | |
DC-HR | 449.14772 | 45.55376 | 4.46745 | 0.44603 | |
UHR | 443.33399* | 44.43532* | 4.36829* | 0.43609* | |
OLS | 89.66953 | 28.08533 | 8.77479 | 2.87060 | |
RLS | 89.66953 | 28.08533 | 8.77479 | 2.87060 | |
OHR | 30.89182 | 9.82228 | 3.03822 | 0.96724 | |
DC-HR | 33.87220 | 10.77172 | 3.37132 | 1.06330 | |
UHR | 33.60676* | 10.62887* | 3.34340* | 1.05017* |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
-
3
* represents the superior of UHR and DC-HR.
case 4: 0.85×N(0,1)+0.15×N(0,8) | |||||
---|---|---|---|---|---|
homogeneous | |||||
OLS | 83.55944 | 8.00806 | 0.77406 | 0.07771 | |
RLS | 83.55944 | 8.00806 | 0.77406 | 0.07771 | |
OHR | 56.64867 | 5.47834 | 0.55143 | 0.05530 | |
DC-HR | 59.51882 | 5.80763 | 0.58022 | 0.05906 | |
UHR | 57.78717* | 5.56164* | 0.56295* | 0.05617* | |
OLS | 14.42020 | 4.52082 | 1.38455 | 0.44224 | |
RLS | 14.42020 | 4.52082 | 1.38455 | 0.44224 | |
OHR | 11.89712 | 3.78035 | 1.17328 | 0.37259 | |
DC-HR | 12.18530 | 3.89420 | 1.19697 | 0.38471 | |
UHR | 11.97547* | 3.82015* | 1.18172* | 0.37528* | |
heterogeneous | |||||
OLS | 472.41924 | 51.87671 | 14382.78626 | 0.47451 | |
RLS | 472.41924 | 51.87671 | 14382.78626 | 0.47451 | |
OHR | 194.14143 | 19.97540 | 1.94836 | 0.19964 | |
DC-HR | 208.04891 | 21.31645 | 2.04540* | 0.20936* | |
UHR | 206.90744* | 21.03077* | 2.06516 | 0.21083 | |
OLS | 34.72670 | 11.48330 | 239.77327 | 1.11088 | |
RLS | 34.72670 | 11.48330 | 239.77327 | 1.11088 | |
OHR | 22.14798 | 7.18230 | 2.24135 | 0.71519 | |
DC-HR | 22.89192 | 7.35959 | 2.30214* | 0.73328* | |
UHR | 22.88671* | 7.30928* | 2.30965 | 0.73398 |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
-
3
* represents the superior of UHR and DC-HR.
case 5: Cauchy(0,1) | |||||
---|---|---|---|---|---|
homogeneous | |||||
OLS | 13031861.67103 | 37026406.50190 | 6041016.96141 | 11711466.60880 | |
RLS | 13031861.67103 | 37026406.50190 | 6041016.96141 | 11711466.60880 | |
OHR | 146.07957 | 16.91701 | 3.81870 | 7.16525 | |
DC-HR | 169.68684 | 17.13059 | 1.69887 | 0.16616 | |
UHR | 162.31759* | 16.29104* | 1.61969* | 0.15643* | |
OLS | 1431.93975 | 1881.11733 | 1147.11445 | 1502.46026 | |
RLS | 1431.93975 | 1881.11733 | 1147.11445 | 1502.46026 | |
OHR | 19.26524 | 6.32966 | 2.65689 | 3.23099 | |
DC-HR | 20.69384 | 6.62144 | 2.08405 | 0.65167 | |
UHR | 20.25285* | 6.45609* | 2.03275* | 0.63216* | |
heterogeneous | |||||
OLS | 47087625.58229 | 123184496.07168 | 92413723.44581 | 29518873.58901 | |
RLS | 47087625.58229 | 123184496.07168 | 92413723.44581 | 29518873.58901 | |
OHR | 362.02026 | 3340.24291 | 40.58347 | 102.19791 | |
DC-HR | 418.52258 | 46.02272 | 4.17859 | 0.42415 | |
UHR | 413.20117* | 45.54626* | 4.13358* | 0.41610* | |
OLS | 2705.91096 | 3340.24291 | 2526.77046 | 2651.78745 | |
RLS | 2705.91096 | 3340.24291 | 2526.77046 | 2651.78745 | |
OHR | 30.23063 | 11.18484 | 7.40727 | 13.48746 | |
DC-HR | 32.69547 | 10.84272 | 3.26736 | 1.05280 | |
UHR | 32.45720* | 10.82107* | 3.23621* | 1.03787* |
-
1
Units for and are and .
-
2
The values shown in bold are the optimal indexes in each case.
-
3
* represents the superior of UHR and DC-HR.
4.1.3 Calculation Efficiency
In calculation efficiency, we mainly focus on Experiment 1, Experiment 3 and Experiment 4. Experiment 1 investigates the influence of on calculation efficiency when total sample size is fixed. Experiment 3 focuses on the influence of when is fixed and relatively large. Experiment 4 is rather similar to Experiment 3, whereas, its fixed is smaller than Experiment 3, aiming at comparing the differences between different scales of .
For Experiment 1, we can obtain the average running time of certain random error according to Eq.(20). Before analyzing, it should be mentioned that OLS and OHR, which use entire dataset, should have had no difference in running time due to fixed . However, in execution, we run OLS and OHR 4 times in different value settings. This may result in random effect on computer, leading to small differences in running time of OLS and RLS. Details are shown in Table 11 to Table 14. Based on these results, we can make following conclusions:
(1) Our UHR runs faster than OHR, DC-HR and RLS in most cases. With increasing, running time of RLS jumps up. Although running time of UHR increases as well, the trend is more slowly than RLS, indicating better property of UHR. It can be seen in Figure 3 for details. As the results of 6 cases are similar, we only visualize case 6.
(2)Overall, Linear Regression algorithms perform better in calculation efficiency compared with Huber algorithms. However, they are not robust to contaminated random error according to Experiment 1. As a result, we still prefer UHR algorithm to Linear Regression.
(3)When subset number is rather small(like ), UHR has better calculation efficiency than OHR. However, this may not come true when increases to 1000. A possible reason is that with increasing, iteration increases as well. Accumulated running time due to iterations leads to a longer time. To solve it, we can artificially set in a smaller value for has no effect on estimation efficiency. In big data streams, we tend to prefer a rather small subset size then setting small for higher calculation efficiency may not work. However, as entire dataset cannot be obtained in big data streams scenario, OHR is no longer applicable and UHR is the only possible method so that a little bit longer running time of UHR is tolerable.
case 1 | case 2 | case 3 | case 4 | case 5 | case 6 | |
homogeneous | ||||||
OLS | 0.98250 | 0.98310 | 0.95790 | 0.98050 | 0.97960 | 0.97840 |
RLS | 1.96290 | 1.94380 | 1.95850 | 1.97450 | 1.96220 | 1.95190 |
OHR | 2.02670 | 2.28570 | 2.37480 | 2.22820 | 2.29830 | 2.28450 |
DC-HR | 1.77900 | 1.86630 | 1.87670 | 1.83970 | 1.77620 | 1.83630 |
UHR | 1.75810 | 1.84130 | 1.83750 | 1.81400 | 1.85240 | 1.82630 |
heterogeneous | ||||||
OLS | 0.97090 | 0.97520 | 0.97280 | 0.97380 | 0.97270 | 0.99380 |
RLS | 1.96890 | 2.04620 | 1.97240 | 1.95330 | 1.96670 | 1.95050 |
OHR | 2.52180 | 2.57020 | 2.45510 | 2.52730 | 4.55150 | 2.32000 |
DC-HR | 2.02770 | 1.97980 | 2.00170 | 2.07250 | 2.49510 | 1.96030 |
UHR | 2.01890 | 2.03020 | 1.93080 | 2.04700 | 2.46830 | 1.90100 |
-
1
The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.
case 1 | case 2 | case 3 | case 4 | case 5 | case 6 | |
homogeneous | ||||||
OLS | 0.97720 | 0.98290 | 1.01140 | 0.98460 | 0.98500 | 0.97810 |
RLS | 1.67800 | 1.65770 | 1.65050 | 1.65270 | 1.64270 | 1.76660 |
OHR | 2.10430 | 2.40010 | 2.46390 | 2.32010 | 2.42110 | 2.36800 |
DC-HR | 2.36680 | 2.43230 | 2.42160 | 2.41890 | 2.42460 | 2.42040 |
UHR | 2.34550 | 2.40360 | 2.38780 | 2.39220 | 2.37850 | 2.40130 |
heterogeneous | ||||||
OLS | 1.02260 | 0.98390 | 0.97880 | 1.01350 | 0.97970 | 0.98880 |
RLS | 1.75910 | 1.61520 | 1.77020 | 1.69830 | 1.61170 | 1.67900 |
OHR | 2.61610 | 2.68210 | 2.54390 | 2.64820 | 4.81750 | 2.41370 |
DC-HR | 2.62560 | 2.63130 | 2.53630 | 2.64060 | 2.97750 | 2.50600 |
UHR | 2.59650 | 2.57810 | 2.50440 | 2.59220 | 2.91410 | 2.47980 |
-
1
The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.
case 1 | case 2 | case 3 | case 4 | case 5 | case 6 | |
homogeneous | ||||||
OLS | 1.00940 | 0.99870 | 1.01550 | 1.02750 | 1.08760 | 1.01830 |
RLS | 3.42860 | 3.29650 | 3.14050 | 3.29470 | 3.16500 | 3.16210 |
OHR | 2.01170 | 2.39080 | 2.30080 | 2.33080 | 2.35250 | 2.26460 |
DC-HR | 2.34690 | 2.53610 | 2.38140 | 2.44700 | 2.52070 | 2.41760 |
UHR | 2.11160 | 2.19880 | 2.17990 | 2.13890 | 2.19090 | 2.15200 |
heterogeneous | ||||||
OLS | 1.02390 | 0.98440 | 1.09730 | 1.01810 | 1.01980 | 1.01280 |
RLS | 3.54000 | 3.19630 | 3.15630 | 3.24660 | 3.16610 | 3.26170 |
OHR | 2.47570 | 2.59900 | 2.36000 | 2.62570 | 4.64180 | 2.31450 |
DC-HR | 2.71570 | 2.92090 | 2.62800 | 2.97190 | 3.29770 | 2.67450 |
UHR | 2.50870 | 2.66040 | 2.35530 | 2.59340 | 2.87450 | 2.27890 |
-
1
The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.
case 1 | case 2 | case 3 | case 4 | case 5 | case 6 | |
homogeneous | ||||||
OLS | 1.00480 | 0.99670 | 0.99650 | 0.99950 | 1.00350 | 1.00390 |
RLS | 6.69460 | 6.77830 | 6.39620 | 4.47510 | 4.64710 | 4.46410 |
OHR | 2.30300 | 2.60550 | 2.77850 | 2.53540 | 2.62030 | 2.61390 |
DC-HR | 2.73700 | 2.88140 | 2.83060 | 2.94160 | 2.70820 | 2.89440 |
UHR | 3.06820 | 3.20160 | 3.01010 | 2.72710 | 2.86340 | 2.83450 |
heterogeneous | ||||||
OLS | 1.02390 | 0.98440 | 1.09730 | 1.01810 | 1.01980 | 1.01280 |
RLS | 3.54000 | 3.19630 | 3.15630 | 3.24660 | 3.16610 | 3.26170 |
OHR | 2.47570 | 2.59900 | 2.36000 | 2.62570 | 4.64180 | 2.31450 |
DC-HR | 2.71570 | 2.92090 | 2.62800 | 2.97190 | 3.29770 | 2.67450 |
UHR | 2.50870 | 2.66040 | 2.35530 | 2.59340 | 2.87450 | 2.27890 |
-
1
The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.
For Experiment 3 and Experiment 4, we divide them into homogeneous and heterogeneous cases and average running time of 6 cases of random error. The relationship between sample size and running time are represented in Figure 4 and Figure 5. According to Figure 5, We can draw conclusions as follows.
(1) Overall, Linear Regression algorithms run faster than Huber algorithms, indicating the same result with Experiment 2. Nevertheless, estimation accuracy is the priority, so that we still prefer Huber algorithms, although Linear Regression algorithms have better communication efficiency.
(2) For Huber algorithms, UHR performs better than OHR in most cases in calculation efficiency. According to the line trend, running time of OHR climb faster than UHR when total sample size increasing. Moreover, we obtained the computation complexity of UHR and OHR before which are and . This is coordinated with the line in Figure 4 that line of UHR and OHR are rather similar to linear and quadratic functions of , which verifies UHR has lower computation efficiency than OHR in simulations.
Unlike in Experiment 3, Experiment 4 sets . Therefore, Figure 5 shows a distinct graph from Figure 4. When , UHR in Figure 5 still runs faster than OHR, presenting the same outcome with Figure 4. However, UHR shows a disadvantage in calculation gradually when . This result is not owing to the increase in . For this deduction, we can refer to Experiment 3, where UHR runs faster than OHR with . Therefore, we may assume bad performance in Experiment 4 is due to another variant . This is similar to observations in Experiment 1, which may result from increase in iterations. A possible solution is using parallel or distributed calculation on multiple machines inside the algorithm to cut down total running time.
4.2 Real Data Analysis
In real data analysis, we use our UHR, OLS, RLS, OHR, DC-HR algorithms used in simulation and DC-AHR algorithm to do the regression. Among them, DC-AHR is a combination of naïve-dc and Adaptive Huber regression(Sun et al., 2020; Wang et al., 2021). By comparing the results, we want to prove that our algorithm has good performance in real applications as well.
Our real data is the airline dataset from 2009 ASA Data Expo(http://statcomputing.org/dataexpo/
2009/the-data.html). It contains departure and landing information of 12 million business flights in America between 1978 and 2008. We choose data in 2007 as research object, which has 7,453,215 pieces of information, about 0.654GB in total. After removing the canceled and missing flight information, 7275288 samples were left. We use the first 3,000,000 samples as training set, and the rest as test set. Set subset number , a rather small value, to better imitate the process of big data streams.
Model researched is a four element linear regression model proposed by Schifano et al., which is widely used in many big data regression investigations(Jiang et al., 2021; Wang et al., 2016; Jiang et al., 2018).
(21) |
In Eq.(21), is a transformation of raw data variable , standing for arrival delay. It is a continuous variable and transformed by formula . stands for the hour of departure time, which is a discrete variable ranging from 1 to 24. is flight distance with unit of a thousand miles. and are both dummy variables. stands for whether taking off at night. It equals to 1 for taking off during 8 pm to 5 am of the next day and equals to 0 else. equals to 1 for taking off on weekends and equals to 0 on weekdays.
We choose 3 evaluation indexes. The first is regression coefficient . By comparing itself and its significance, differences between 6 algorithms can be shown. It should be mentioned that in significance tests, we use bootstrap to estimate stand error and then calculate -value, for in distributed algorithms, -value cannot be obtained directly like Oralce ones. The second index is regression error including and , which are both out of sample errors. Their expressions are like Eq.(22) and Eq.(23). The last index is running time, aiming at calculation efficiency.
(22) |
(23) |
The outcomes are presented in Table 15. Firstly, there is little difference between coefficients of each algorithm. The -values are all significant, showing these methods are all interpretative to this model. In estimation efficiency, the estimation error of UHR is the smallest and also rather close to that of Oracle Huber, which again proves its asymptotic equivalency with Oracle algorithm. Meanwhile, its estimation error is smaller than that of DC-HR and DC-AHR as well, attesting Online Updating’s better performance than divide-and-conquer. As for calculation efficiency, Huber algorithms run slower than Linear Regression algorithms. However, according to and , Huber algorithms can cut down estimation error validly compared to Linear Regression. This is the reason we prefer Huber ones. Moreover, UHR has an outstanding performance between Huber ones with 5 seconds faster than Oracle Huber regression. This achieves our goal of speeding the calculation process. Overall, UHR performs impressively in real data analysis.
(Intercept) | DepHour | Kdis | night | weekend | time | |||
OLS | 5.72482 | 0.00356 | -0.00194 | 0.02196 | -0.00754 | 0.00905 | 0.0882 | 1.62 |
(30937.344) | (283.338) | (-19.550) | (121.83) | (-59.634) | ||||
RLS | 5.72482 | 0.00356 | -0.00194 | 0.02196 | -0.00754 | 0.00905 | 0.0882 | 0.695 |
(30455.939) | (247.18) | (-18.169) | (89.745) | (-61.248) | ||||
OHR | 5.72547 | 0.00226 | -0.00147 | 0.00799 | -0.00549 | 0.00832 | 0.08129 | 13.168 |
(50804.848) | (295.477) | (-24.236) | (72.776) | (-71.197) | ||||
DC-HR | 5.72599 | 0.00232 | -0.00131 | 0.00844 | -0.00557 | 0.00831 | 0.08113 | 8.273 |
(54921.000) | (301.501) | (-18.600) | (61.269) | (-75.822) | ||||
DC-AHR | 5.72582 | 0.00228 | -0.00118 | 0.00826 | -0.00549 | 0.00829 | 0.08096 | 134.377 |
UHR | 5.7256 | 0.00231 | -0.00146 | 0.00886 | -0.00546 | 0.00829 | 0.08092 | 8.653 |
(55172.597) | (289.168) | (-21.160) | (67.636) | (-70.772) |
-
1
The values shown in bold are the optimal model indexes in each case.
-
2
The index in parentheses is the -value.
-
3
The -value of DC-AHR is not reported because it takes too long to calculate the -value.
5 Conclusions and Discussions
In order to process statistical analysis on big data streams and solve the outlier problem simultaneously, this paper proposes an Online Updating Huber Regression algorithm. By combining online updating method with Huber regression, UHR achieves both big data streams regression through updating historical data continuously and robust estimation on contaminated datasets. Proved by theoretical and simulation results, UHR is asymptotic equivalent to Oracle one using entire dataset and has lower computation complexity. After constructing our algorithm, we also apply it to simulations and real data analysis. In simulations, UHR performs outstandingly when random error is heavy-tailed distributed or has a large amount of outliers. It has good performance in calculation efficiency compared with the Oralce one as well, especially for cases when subset number is relatively small. In real data analysis, UHR also has good regression results for the airline dataset with the smallest estimation error, much faster calculation speed than the Oracle algorithm and significant regression coefficients, attesting its feasibility in real applications.
However, there are also some issues needed to be further investigated in the future. Firstly, we assume the true value of unknown parameters to be estimated does not change in data generating process. This assumption ignores to consider concept drift issue in big data streams, which may not be true in real applications. Thus, more complicated streaming data models need to be investigated. Secondly, in the concept of big data, we only consider big data with high volume but not with high dimensions. In future work, by combining it with the penalty function, we can investigate the high-dimensional regularized online updating problems. Lastly, the proposed algorithm is designed for independent data, which is a rather simple assumption. We can further focus on solutions to dependent or non-stationary big data streams in the future.
References
- Allison (1995) P. D. Allison. The impact of random predictors on comparisons of coefficients between models: Comment on clogg, petkova, and haritou. American Journal of Sociology, 100(5):1294–1305, 1995.
- Bassett and Jr. (1978) Koenker Gilbert Bassett and Jr. Regression quantiles. Econometrica, 46(1):33–50, 1978.
- Battey et al. (2018) Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3), June 2018. ISSN 0090-5364. doi: 10.1214/17-AOS1587.
- Chen and Zhou (2020) Lanjue Chen and Yong Zhou. Quantile regression in big data: A divide and conquer based strategy. Computational Statistics & Data Analysis, 144:106892, April 2020. ISSN 01679473. doi: 10.1016/j.csda.2019.106892.
- Chen et al. (2019) Xi Chen, Weidong Liu, and Yichen Zhang. Quantile regression under memory constraint. The Annals of Statistics, 47(6), December 2019. ISSN 0090-5364. doi: 10.1214/18-AOS1777.
- Chen and Xie (2014) Xueying Chen and Min-ge Xie. A split-and-conquer approach for analysis of. Statistica Sinica, 2014. ISSN 10170405. doi: 10.5705/ss.2013.088.
- Clogg et al. (1995) Clogg, Clifford, C., Petkova, Eva, Haritou, and Adamantios. Statistical methods for comparing regression coefficients between models. American Journal of Sociology, 1995.
- Faming et al. (2013) Faming, Liang, Yichen, Cheng, Qifan, Song, Jincheol, Park, Ping, and Yang. A resampling-based stochastic approximation method for analysis of large geostatistical data. Jasa Journal of the American Statistical Association, 2013.
- Fang Yao (2021) Ying Yang Fang Yao. Online Estimation for Functional Data. Journal of the American Statistical Association, pages 1–15, November 2021. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2021.2002158.
- Hu et al. (2021) Aijun Hu, Yuling Jiao, Yanyan Liu, Yueyong Shi, and Yuanshan Wu. Distributed quantile regression for massive heterogeneous data. Neurocomputing, 448:249–262, August 2021. ISSN 09252312. doi: 10.1016/j.neucom.2021.03.041.
- Huber and Peter (1964) Huber and J. Peter. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
- Huber and Peter (1981) Huber and J. Peter. [wiley series in probability and statistics] robust statistics (huber/robust statistics) —— references. pages 294–300, 1981.
- Jiang et al. (2018) Rong Jiang, Xueping Hu, Keming Yu, and Weimin Qian. Composite quantile regression for massive datasets. Statistics, 52(5):980–1004, September 2018. ISSN 0233-1888, 1029-4910. doi: 10.1080/02331888.2018.1500579.
- Jiang et al. (2021) Rong Jiang, Wei-wei Chen, and Xin Liu. Adaptive quantile regressions for massive datasets. Statistical Papers, 62(4):1981–1995, August 2021. ISSN 0932-5026, 1613-9798. doi: 10.1007/s00362-020-01170-8.
- Jin et al. (2015) Xiaolong Jin, Benjamin W. Wah, Xueqi Cheng, and Yuanzhuo Wang. Significance and Challenges of Big Data Research. Big Data Research, 2(2):59–64, June 2015. ISSN 22145796. doi: 10.1016/j.bdr.2015.01.006.
- Jordan et al. (2019) Michael I. Jordan, Jason D. Lee, and Yun Yang. Communication-Efficient Distributed Statistical Inference. Journal of the American Statistical Association, 114(526):668–681, April 2019. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2018.1429274.
- Kleiner et al. (2014) Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816, September 2014. ISSN 13697412. doi: 10.1111/rssb.12050.
- Lee et al. (2015) J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor. Communication-efficient sparse regression: a one-shot approach. Computer Science, 2015.
- Lee et al. (2020) JooChul Lee, HaiYing Wang, and Elizabeth D. Schifano. Online updating method to correct for measurement error in big data streams. Computational Statistics & Data Analysis, 149:106976, September 2020. ISSN 01679473. doi: 10.1016/j.csda.2020.106976.
- Lian and Fan (2018) H. Lian and Z. Fan. Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research, 18:1–26, 2018.
- Liang et al. (2016) Faming Liang, Jinsu Kim, and Qifan Song. A bootstrap metropolis–hastings algorithm for bayesian analysis of big data. Technometrics A Journal of Statistics for the Physical Chemical & Engineering Sciences, page 604, 2016.
- Luo et al. (2022) Jiyu Luo, Qiang Sun, and Wen-Xin Zhou. Distributed adaptive Huber regression. Computational Statistics & Data Analysis, 169:107419, May 2022. ISSN 01679473. doi: 10.1016/j.csda.2021.107419.
- Luo and Song (2020) Lan Luo and Peter X.-K. Song. Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):69–97, February 2020. ISSN 13697412. doi: 10.1111/rssb.12352.
- Nan and Xi (2011) L. Nan and R. Xi. Aggregated estimating equation estimation. Statistics and its interface, 1(1):73–83, 2011.
- Ping and Sun (2015) M. Ping and X. Sun. Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1), 2015.
- Schifano et al. (2016) Elizabeth D. Schifano, Jing Wu, Chun Wang, Jun Yan, and Ming-Hui Chen. Online Updating of Statistical Inference in the Big Data Setting. Technometrics, 58(3):393–403, July 2016. ISSN 0040-1706, 1537-2723. doi: 10.1080/00401706.2016.1142900.
- Sun et al. (2020) Qiang Sun, Wen-Xin Zhou, and Jianqing Fan. Adaptive Huber Regression. Journal of the American Statistical Association, 115(529):254–265, January 2020. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2018.1543124.
- Wang et al. (2016) Chun Wang, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan. Statistical Methods and Computing for Big Data. Statistics and Its Interface, 9(4):399–414, 2016. ISSN 19387989, 19387997. doi: 10.4310/SII.2016.v9.n4.a1.
- Wang et al. (2018) Chun Wang, Ming-Hui Chen, Jing Wu, Jun Yan, Yuping Zhang, and Elizabeth Schifano. Online updating method with new variables for big data streams. Canadian Journal of Statistics, 46(1):123–146, March 2018. ISSN 03195724. doi: 10.1002/cjs.11330.
- Wang et al. (2022) Kangning Wang, Hongwei Wang, and Shaomin Li. Renewable quantile regression for streaming datasets. Knowledge-Based Systems, 235:107675, January 2022. ISSN 09507051. doi: 10.1016/j.knosys.2021.107675.
- Wang et al. (2021) Lili Wang, Chao Zheng, Wen Zhou, and Wen-Xin Zhou. A New Principle for Tuning-Free Huber Regression. Statistica Sinica, 2021. ISSN 10170405. doi: 10.5705/ss.202019.0045.
- Wu et al. (2021) Jing Wu, Ming-Hui Chen, Elizabeth D. Schifano, and Jun Yan. Online Updating of Survival Analysis. Journal of Computational and Graphical Statistics, 30(4):1209–1223, October 2021. ISSN 1061-8600, 1537-2715. doi: 10.1080/10618600.2020.1870481.
- Xue and Hu (2021) Yishu Xue and Guanyu Hu. Online updating of information based model selection in the big data setting. Communications in Statistics - Simulation and Computation, 50(11):3516–3529, November 2021. ISSN 0361-0918, 1532-4141. doi: 10.1080/03610918.2019.1626886.
- Zhang et al. (2013) Y. Zhang, J. C. Duchi, and M. J. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 30(1):592–617, 2013.
- Zhang et al. (2012) Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efficient algorithms for statistical optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6792–6792, Maui, HI, USA, December 2012. IEEE. ISBN 978-1-4673-2066-5 978-1-4673-2065-8 978-1-4673-2063-4 978-1-4673-2064-1. doi: 10.1109/CDC.2012.6426691.
- Zhao et al. (2016) T. Zhao, G. Cheng, and H. Liu. A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4):1400–1437, 2016.
- Zhu et al. (2021) Xuening Zhu, Feng Li, and Hansheng Wang. Least-Square Approximation for a Distributed System. Journal of Computational and Graphical Statistics, 30(4):1004–1018, October 2021. ISSN 1061-8600, 1537-2715. doi: 10.1080/10618600.2021.1923517.