This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Updating Huber Robust Regression for Big Data Streams

Chunbai Tao and Shanshan Wang
School of Economics and Managment, Beihang University
Abstract

Big data streams are grasping increasing attention with the development of modern science and information technology. Due to the incompatibility of limited computer memory to high volume of streaming data, real-time methods without historical data storage is worth investigating. Moreover, outliers may occur with high velocity data streams generating, calling for more robust analysis. Motivated by these concerns, a novel Online Updating Huber Robust Regression algorithm is proposed in this paper. By extracting key features of new data subsets, it obtains a computational efficient online updating estimator without historical data storage. Meanwhile, by integrating Huber regression into the framework, the estimator is robust to contaminated data streams, such as heavy-tailed or heterogeneous distributed ones as well as cases with outliers. Moreover, the proposed online updating estimator is asymptotically equivalent to Oracle estimator obtained by the entire data and has a lower computation complexity. Extensive numerical simulations and a real data analysis are also conducted to evaluate the estimation and calculation efficiency of the proposed method.

Keywords: Online Updating; Huber Regression; Big Data Streams; Divide-and-conquer

1 Introduction

1.1 Background

In recent years, big data has become a significant area of research and garnered extensive attention across various industries, including finance, manufacturing, and astronomy. Compared with traditional datasets, big data mainly has “5V” characteristics, i.e., high Volume, high Velocity, high Variety, low Veracity, and high Value(Jin et al., 2015). For big data streams, namely big data generated sequentially in a stream style, the characteristic of high Volume and high Velocity are particularly prominent, posing several challenges in its analysis. One such challenge concerns computer memory limitations. As big data streams are generated with such big volume, it is unrealistic to be completely stored in computer memory. Consequently, standard statistical analysis methods that assume adequate computer capacity are no longer applicable in the current settings. The second challenge pertains to the presence of outliers in big data streams, which can arise due to recording errors in the fast data-generating process. While linear regression-based methods are often used for big data analysis, they are highly sensitive to outliers, thereby limiting their effectiveness in this context. As such, it is imperative to develop computational efficient, memory-conserving and robust methods for big data stream analysis that can address these challenges.

1.2 Related Works

To address the storage limitation problem in big data analysis, many scholars have designed different methodologies , which can be loosely divided into three categories, i.e. subsampling-based approaches, divide-and-conquer approaches and online updating approaches(Wang et al., 2016). According to Wang et al.(2016), classic subsampling methods include bags of little bootstrap(Kleiner et al., 2014), leveraging(Ping and Sun, 2015), mean log-likelihood(Faming et al., 2013) and subsample-based MCMC(Liang et al., 2016). Take Bags of Little Bootstrap as an example, Kleiner et al.(2014) combined bootstrap with subsampling, which addressed the high computation problem of bootstrap in big data occasions. As for divide-and-conquer, its most simple type, also called naïve-divide-and-conquer or one-shot, is gaining much popularity. Its main idea is dividing complete datasets into several blocks, obtaining the local estimators on each block and calculating the global one by simple averaging. Naïve-dc has many applications in regression. For example, to solve large-scale empirical risk minimization problem, Zhang et al.(2012) analyzed a communicational efficient average mixture algorithm based on naïve-dc which is robust to the amount of parallelization. Chen and Xie(2014) designed a split-and-conquer penalized regression, including L1L_{1} penalty, SCAD and MCP. Their estimator is asymptotically equivalent to penalized estimator using entire datasets. Furthermore, some research also applied naïve-dc on high-dimensional big datasets. Lee et al.(2015) devised an averaging debiased LASSO estimator that can be obtained distributedly and communication efficient in high dimensional settings. Battey et al.(2018) cast their eyes on hypothesis testing. They utilized divide-and-conquer in the context of sparse high dimensional Rao and Wald test, solving computation complexity increase due to high dimensional regularization problems. We can also find some non-linear applications like ridge regression(Zhang et al., 2013), SVM(Lian and Fan, 2018) etc. More naïve-dc related literature can be seen in Nan and Xi(2011), Zhao et al.(2016),Zhu et al.(2021). However, complete datasets are necessary for naïve-dc before analyzing to find the appropriate block number. While this issue might not arise for conventional big datasets, it is a crucial challenge for big data streams since the latter are incessantly generated with significant volume and velocity, making it infeasible to obtain complete data.

Different from divide-and-conquer, online updating, aiming at big data streams analysis, has no requirement for storing historical data. Specifically, it uses key features extracted from new arriving data batches to continuously update historical data thus is able to achieve computational efficiency and minimal intensive storage. According to Schifano et al.(2016), they are the first group to investigate statistical inference on online updating scenario. Their online updating framework is able to solve the predictive residual tests in linear regression and also the estimating equation setting. However, this algorithm needs to start from scratch if new predictive variables are produced with data streams generating. To overcome this disadvantage, Wang et al.(2018) adjusted the online updating framework by bias correction. Through the use of nested models(Allison, 1995; Clogg et al., 1995) in regression coefficients comparison,their method improved estimation efficiency under a variety of practical situations. Lee et al.(2020) extended aforementioned work from the perspective of biased predictive variables in big data streams being corrected afterward under the linear model framework. However, their method is still biased if not knowing the exact data adjusted point. Recently, online updating has also been incorporated in other models for big data analysis, such as online updating GIC for variable selection in generalized linear model(Xue and Hu, 2021), Cox proportional hazards model in online updating framework(Wu et al., 2021), online nonparametric method for functional data(Fang Yao, 2021) etc. To conclude, online updating is accessible to not only ordinary big datasets, but also big data streams, thereby providing a wider range of applications than naïve-dc. Nevertheless, most existing online updating frameworks are combined with linear regression, which are known to be sensitive to outliers and heavy-tailed distributions. To address this issue, novel online updating frameworks that are designed to achieve robustness to outliers are urgently needed.

Talking about the outlier problem, one popular solution is quantile regression(Bassett and Jr., 1978). It studies the conditional distribution of each quantile of response variables, thus overcoming the outlier problem and being able to explain potential heteroscedasticity of the datasets. Inspired by its good performance, loads of research incorporate quantile regression into big data analysis framework, especially in divide-and-conquer. Take Chen et al.(2019) as an example, they used a kernel function to smooth quantile loss function and operated divide-and-conquer iteratively to obtain a linear quantile regression estimator, handling with the constraint of machine number in naïve-dc. Moreover, Hu et al.(2021) addressed the nondifferentiable problem of quantile regression by developing CSL(Jordan et al., 2019) method to speed up big data algorithm. We can also find other research combining quantile regression with subsampling and online updating like Chen and Zhou(2020) and Wang et al.(2022). By using online updating, Wang’s(2022) method can be applied to big data streams. Particularly, Wang et al.(2022) analyzed quantile regression from the view of maximum likelihood and incorporated it in online updating algorithm. Without operations in check loss function, it is more convenient than the work of Chen et al.(2019), and Chen and Zhou(2020). Some variants of quantile regression like composite quantile regression and adaptive quantile regression are also investigated in big data analysis, which can be seen in Jiang et al.(2018),Jiang et al.(2021) for details.

1.3 Motivations

While quantile regression is investigated thoroughly, other members in M-estimation family, such as Huber regression(Huber and Peter, 1964), have received litte attention. Huber regression transforms the loss of big residuals from quadratic form into linear form to cut down weights, thus achieving good robustness to outliers. Therefore, Huber regression presents a feasible alternative to quantile regression in solving outlier-related problems. Regrettably, there has been little research on the use of Huber regression in big data analysis. To date, the only literature used Huber loss in big data regression is Luo’s(2022) work. They embedded Huber loss function in Jordan’s CSL(Jordan et al., 2019) framework, adding two data-driven parameters in Huber regression to achieve a balance of statistical optimization and communication efficiency. Nonetheless, there is no literature focusing on combining Huber regression with online updating to solve analysis in big data streams.

In order to solve the storage limitation and outlier problems aforementioned in big data streams analysis, we propose a novel Online Updating Huber Regression algorithm, which is inspired by the maximum likelihood idea in Wang’s(2022) online updating algorithm. The resulting online updating estimator in our algorithm is renewed by current data and summary statistics of historical data, which is computationally efficient. Specifically, by assuming local estimators generated from a multivariate normal distribution and analyzing it from the view of maximum likelihood, a weighted least square type objective function can be obtained. In the objective function, two key features of each batch can be extracted, which can be updated and easily operated in big data streams. Meanwhile, we prove that the proposed estimator is asymptotically equivalent to Oracle estimator using the entire dataset and has lower computation complexity compared with Oracle one. Numerical experiments on both simulation and real data are also included to verify the theoretical results and illustrate the good performance of our new method.

Our study makes two primary contributions. Firstly, we employ online updating instead of the widely used divide-and-conquer method to address the limited computer memory constraint so that it can be applied on the analysis of big data streams. Furthermore, by integrating Huber regression into online updating framework, we broaden the usage of online updating in robust regression other than pure linear regression. The second contribution is about Huber regression. Based on the state-of-art in big data analysis using loss functions in M-estimation, we adopt Huber regression as an alternative to quantile regression to increase robustness. It not only fills certain research gaps in Huber regression, but also studies feasibility of generalizing quantile regression to other loss functions in M-estimation.

The rest of this paper is organized as follows. Section 2 explains the methodology of our Online Updating Huber Regression by obtaining the main online updating estimator. Section 3 proves some good properties of the Online Updating Huber Regression estimator, including asymptotic equivalency and low computation complexity. Section 4 introduces simulation and real data analysis results in order to verify the good performance of our algorithm. Section 5 is about conclusions and future work of our research.

2 Methodology

2.1 Notation and Oracle Estimator

Suppose there are NbN_{b} identically independent distributed samples {(yn,xn),n=1,2,,Nb}\{(y_{n},\textbf{x}_{n}),n=1,2,...,N_{b}\}. yRy\in\rm{R} is response variable and x=(x1,x2,,xp)TRp×1\textbf{x}=(x^{1},x^{2},...,x^{p})^{T}\in\textbf{R}^{p\times 1} is p-dimensional covariates. In the big data streams scenario, suppose NbN_{b} samples are generated in form of bb data subsets {D1,D2,,Db}\{D_{1},D_{2},...,D_{b}\} sequentially. Dt={(yti,xti),i=1,2,,nt}D_{t}=\{(y_{ti},\textbf{x}_{ti}),i=1,2,...,n_{t}\} stands for the tt-th subset. Our aim is to fit a model yti=xtiT𝜽+ϵtiy_{ti}={\textbf{x}_{ti}}^{T}\bm{\theta}+{\epsilon_{ti}}, where 𝜽=(θ1,θ2,,θp)T\bm{\theta}=(\theta_{1},\theta_{2},...,\theta_{p})^{T} is a p-dimensional unknown parameter with true value 𝜽0=(θ01,θ02,,θ0p)T\bm{\theta}_{0}=(\theta_{01},\theta_{02},...,\theta_{0p})^{T}, and ϵti\epsilon_{ti} is random error which can be heavy-tailed distributed, heteroscedastic distributed or contaminated by outliers. f()f(\cdot) is the probability density distribution of ϵti\epsilon_{ti}.

As we want to prove that our Online Updating estimator is asymptotic equivalent to Oracle one using entire dataset follow-up, we first point the expression of Oracle Huber Regression estimator here. According to Huber(1964), the expression of Oracle Huber Regression estimator of 𝜽0\bm{\theta}_{0} is as follows:

𝜽^Nb=argmin𝜽QNb(𝜽),QNb(𝜽)=1Nbt=1bi=1ntρ(ytixtiT𝜽){\widehat{\bm{\theta}}_{N_{b}}=\text{arg}\min_{\bm{\theta}}Q_{N_{b}}(\bm{\theta}),\ \ \ Q_{N_{b}}(\bm{\theta})=\frac{1}{N_{b}}\sum_{t=1}^{b}\sum_{i=1}^{n_{t}}\rho(y_{ti}-\textbf{x}_{ti}^{T}\bm{\theta})} (1)

where ρ(u)=u22I(|u|<k)+{k|u|k22}I(|u|k)\rho(u)=\frac{u^{2}}{2}I(|u|<k)+\{k|u|-\frac{k^{2}}{2}\}I(|u|\geq k).

Moreover, under following regularity conditions Huber and Peter (1964, 1981); Wang et al. (2022):

(N1) For λ(𝜽)=E(xψ(yxT𝜽))\lambda(\bm{\theta})=E(\textbf{x}\psi(y-\textbf{x}^{T}\bm{\theta})), where ψ()=ρ()\psi(\cdot)=\rho^{{}^{\prime}}(\cdot), 𝜽0\bm{\theta}_{0}, the true value of 𝜽\bm{\theta}, satisfies λ(𝜽0)=0\lambda(\bm{\theta}_{0})=0.

(N2)E(ψ2)<,0<E(ψ)<E(\psi^{2})<\infty,0<E(\psi^{{}^{\prime}})<\infty.

(N3) Covariates xti\textbf{x}_{ti} satisfies maxt=1,,b,i=1,,ntxtiNb,xti=(xtiTxti)12\frac{max_{t=1,...,b,i=1,...,n_{t}\|{\textbf{x}_{ti}}\|}}{\sqrt{N_{b}}}\rightarrow\infty,\|{\textbf{x}_{ti}}\|=({\textbf{x}_{ti}^{T}\textbf{x}_{ti}})^{\frac{1}{2}}.

(N4)Exist a positive matrix 𝚺\bm{\Sigma} satisfying 1Nbt=1bi=1ntxtixtiT𝑝𝚺,Nb\frac{1}{N_{b}}\sum_{t=1}^{b}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\textbf{x}_{ti}^{T}\xrightarrow{p}\bm{\Sigma},N_{b}\rightarrow\infty, where 𝑝\xrightarrow{p} stands for convergence in probability.

We can obtain the asymptotic normal distribution of Oracle Huber Regression estimator 𝜽^Nb\widehat{\bm{\theta}}_{N_{b}} as follows:

Nb(𝜽^Nb𝜽0)𝑑N(0,E(ψ2)[E(ψ)]2𝚺𝟏),Nb{\sqrt{N_{b}}\Big{(}\widehat{\bm{\theta}}_{N_{b}}-\bm{\theta}_{0}\Big{)}\overset{d}{\rightarrow}N\Big{(}0,\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}\bm{\Sigma^{-1}}\Big{)}},\\ \ N_{b}\rightarrow\infty (2)

where E(ψ2)[E(ψ)]2\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}} is a function of f()f(\cdot), 𝑑\xrightarrow{d} stands for convergence in distribution.

2.2 Online Updating Huber Robust Regression Estimator

In this part, we mainly deduce the expression of Online Updating Huber Robust Regression (UHR) estimator. Firstly, we can calculate the local Huber estimator in each subset. Suppose 𝜽^t\widehat{\bm{\theta}}_{t} is the local Huber estimator in the tt-th subset. According to Huber(1964), its formula is like Eq.(3).

𝜽^t=argmin𝜽Qt(𝜽),Qt(𝜽)=1nti=1ntρ(ytixtiT𝜽){\widehat{\bm{\theta}}_{t}=\text{arg}\min_{\bm{\theta}}Q_{t}(\bm{\theta}),\ \ \ Q_{t}(\bm{\theta})=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\rho(y_{ti}-\textbf{x}_{ti}^{T}\bm{\theta})} (3)

As the entire dataset is divided into several subsets, subset size ntn_{t} can be much smaller than total sample size NbN_{b}. Therefore, the calculation of Eq.(3) can be more computationally efficient and less storage intensive than Eq.(1), so that we can solve it faster. Furthermore, under regularity conditions (N1)-(N4), local estimator 𝜽^t\widehat{\bm{\theta}}_{t} follows asymptotic normal distribution in Eq.(4), where 𝚺t=E(xtixtiT)\bm{\Sigma}_{t}=E(\textbf{x}_{ti}\textbf{x}_{ti}^{T}).

nt(𝜽^t𝜽0)𝑑N(0,E(ψ2)[E(ψ)]2𝚺𝟏t),nt{\sqrt{n_{t}}\Big{(}\widehat{\bm{\theta}}_{t}-\bm{\theta}_{0}\Big{)}\overset{d}{\rightarrow}N\Big{(}\textbf{0},\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}\bm{\Sigma^{-1}}_{t}\Big{)}},\\ \ n_{t}\rightarrow\infty (4)

Taking the idea in Wang’s(2022) work as an inspiration, we consider the local estimators from the perspective of maximum likelihood. Due to identical independent distribution among total samples, local estimator 𝜽^1,,𝜽^b\widehat{\bm{\theta}}_{1},...,\widehat{\bm{\theta}}_{b} are mutually independent. Thus, we can take {𝜽^1,,𝜽^b}\{\widehat{\bm{\theta}}_{1},...,\widehat{\bm{\theta}}_{b}\} as a sample generated from the following multivariate normal distribution.

{N(𝜽0,1n1E(ψ2)[E(ψ)]2𝚺11),,N(𝜽0,1nbE(ψ2)[E(ψ)]2𝚺b1)}\Big{\{}N\Big{(}\bm{\theta}_{0},\frac{1}{n_{1}}\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}\bm{\Sigma}_{1}^{-1}\Big{)},\cdots,N\Big{(}\bm{\theta}_{0},\frac{1}{n_{b}}\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}\bm{\Sigma}_{b}^{-1}\Big{)}\Big{\}} (5)

Based on Eq.(5), the maximum likelihood function of 𝜽\bm{\theta} is as follows.

L(𝜽)=t=1b(12π)p(1ntE(ψ2)[E(ψ)]2|𝚺t1|)12exp{nt2[E(ψ)]2E(ψ2)(𝜽^t𝜽)T𝚺t(𝜽^t𝜽)}L(\bm{\theta})=\prod_{t=1}^{b}\big{(}\frac{1}{\sqrt{2\pi}}\big{)}^{p}\big{(}\frac{1}{n_{t}}\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}|\bm{\Sigma}_{t}^{-1}|\big{)}^{-\frac{1}{2}}{\rm exp}\big{\{}-\frac{n_{t}}{2}\frac{[E(\psi^{\prime})]^{2}}{E(\psi^{2})}(\widehat{\bm{\theta}}_{t}-\bm{\theta})^{T}\bm{\Sigma}_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta})\big{\}} (6)

By turning Eq.(6) into logarithmic form, we can get Eq.(7)

log(L(𝜽))=C12[E(ψ)]2E(ψ2)t=1bnt(𝜽^t𝜽)T𝚺t(𝜽^t𝜽){\rm{log}}(L(\bm{\theta}))=C-\frac{1}{2}\frac{[E(\psi^{\prime})]^{2}}{E(\psi^{2})}\sum_{t=1}^{b}n_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta})^{T}\bm{\Sigma}_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta}) (7)

where CC is a constant, [E(ψ)]2E(ψ2)\frac{[E(\psi^{\prime})]^{2}}{E(\psi^{2})} is a function of f()f(\cdot), which is irrelevant with 𝜽\bm{\theta} as well. By maximizing Eq.(7), we can obtain the UHR estimator. Meanwhile, base on non-negativity of [E(ψ)]2E(ψ2)\frac{[E(\psi^{\prime})]^{2}}{E(\psi^{2})}, maximizing Eq.(7) is equivalent to minimize the following loss function.

t=1bnt(𝜽^t𝜽)T𝚺t(𝜽^t𝜽)\sum_{t=1}^{b}n_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta})^{T}\bm{\Sigma}_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta}) (8)

By using 1nti=1ntxtixtiT\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\textbf{x}_{ti}^{T} to replace 𝚺t,t=1,,b\bm{\Sigma}_{t},t=1,...,b according to 𝚺t=E(xtixtiT)\bm{\Sigma}_{t}=E(\textbf{x}_{ti}\textbf{x}_{ti}^{T}), we can obtain the final objective function as follows, where Xt=(xt1,,xtnt)T\textbf{X}_{t}=(\textbf{x}_{t1},...,\textbf{x}_{tn_{t}})^{T}.

Q(𝜽)=t=1b(𝜽^t𝜽)TXtTXt(𝜽^t𝜽)Q(\bm{\theta})=\sum_{t=1}^{b}(\widehat{\bm{\theta}}_{t}-\bm{\theta})^{T}\textbf{X}_{t}^{T}\textbf{X}_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta}) (9)

Thus Online Updating Huber Robust Regression estimator of 𝜽0\bm{\theta}_{0}, 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr}, can be obtained by minimizing Eq.(9).

𝜽^Nbuhr=argmin𝜽Q(𝜽)=(t=1bXtTXt)1(t=1bXtTXt𝜽t^)\begin{split}\widehat{\bm{\theta}}_{N_{b}}^{uhr}&=\text{arg}\min_{\bm{\theta}}Q(\bm{\theta})\\ &=\Big{(}\sum_{t=1}^{b}\textbf{X}_{t}^{T}\textbf{X}_{t}\Big{)}^{-1}\Big{(}\sum_{t=1}^{b}\textbf{X}_{t}^{T}\textbf{X}_{t}\widehat{\bm{\theta}_{t}}\Big{)}\end{split} (10)

According to Eq.(10), we can only extract two key features from each subset, i.e. XtTXt\textbf{X}_{t}^{T}\textbf{X}_{t} and 𝜽^t\widehat{\bm{\theta}}_{t}, with no need to store entire dataset. This enables us to save computer memory and speed up computation process, thus more available to big data streams.

The UHR algorithm implementation process can be seen in Figure 1. To start the algorithm, we import the first subset D1={(y1i,x1i),i=1,,n1}D_{1}=\{(y_{1i},\textbf{x}_{1i}),i=1,...,n_{1}\}. By using function rlm of package MASS in R, the local estimator of the first subset 𝜽^1\widehat{\bm{\theta}}_{1} can be calculated. Then two features in the first subset X1TX1\textbf{X}_{1}^{T}\textbf{X}_{1} and X1TX1𝜽^1\textbf{X}_{1}^{T}\textbf{X}_{1}\widehat{\bm{\theta}}_{1} can be obtained. For the second subset D2={(y2i,x2i),i=1,,n2}D_{2}=\{(y_{2i},\textbf{x}_{2i}),i=1,...,n_{2}\}, use the same method to calculate 𝜽^2\widehat{\bm{\theta}}_{2} and X2TX2\textbf{X}_{2}^{T}\textbf{X}_{2}, then add them into X1TX1\textbf{X}_{1}^{T}\textbf{X}_{1} and X1TX1𝜽^1\textbf{X}_{1}^{T}\textbf{X}_{1}\widehat{\bm{\theta}}_{1}. The sum obtained, X1TX1+X2TX2\textbf{X}_{1}^{T}\textbf{X}_{1}+\textbf{X}_{2}^{T}\textbf{X}_{2} and X1TX1𝜽^1+X2TX2𝜽^2\textbf{X}_{1}^{T}\textbf{X}_{1}\widehat{\bm{\theta}}_{1}+\textbf{X}_{2}^{T}\textbf{X}_{2}\widehat{\bm{\theta}}_{2} are new results of subset D1D_{1} updated by subset D2D_{2}. The rest subsets can be processed in the same manner. After the last subset Db={(ybi,xbi),i=1,,n1}D_{b}=\{(y_{bi},\textbf{x}_{bi}),i=1,...,n_{1}\} is imported, we can obtain key elements, i.e. t=1bXtTXt\sum_{t=1}^{b}\textbf{X}_{t}^{T}\textbf{X}_{t} and t=1bXtTXt𝜽t^\sum_{t=1}^{b}\textbf{X}_{t}^{T}\textbf{X}_{t}\widehat{\bm{\theta}_{t}}, and finally obtain the Online Updating estimator 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr} updated by the entire dataset. The pseudo-code can be seen in Algorithm 1.

Refer to caption

Figure 1: Online Updating algorithm for Huber regression estimator
Algorithm 1 Online updating algorithm for Huber regression estimator

Input: input subsets Dt={(yti,xti),i=1,2,,nt}D_{t}=\{(y_{ti},\textbf{x}_{ti}),i=1,2,...,n_{t}\}
Output: Online Updating Huber Estimator 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr}

1:initialize Ut=𝟎p×p,Vt=0\textbf{U}_{t}=\bm{0}_{p\times p},\textbf{V}_{t}=0
2:for t=1,2,…,b do
3:     𝜽^t=argmin𝜽Qt(𝜽),Qt(𝜽)=1nti=1ntρ(ytixtiT𝜽)\widehat{\bm{\theta}}_{t}=\text{arg}\min_{\bm{\theta}}Q_{t}(\bm{\theta}),\ \ \ Q_{t}(\bm{\theta})=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\rho(y_{ti}-\textbf{x}_{ti}^{T}\bm{\theta})
4:     Ut=Ut+XtTXt,Vt=Vt+XtTXt𝜽t^\textbf{U}_{t}=\textbf{U}_{t}+\textbf{X}_{t}^{T}\textbf{X}_{t},\textbf{V}_{t}=\textbf{V}_{t}+\textbf{X}_{t}^{T}\textbf{X}_{t}\widehat{\bm{\theta}_{t}}
5:return 𝜽^Nbuhr=Ut1Vt\widehat{\bm{\theta}}_{N_{b}}^{uhr}=\textbf{U}_{t}^{-1}\textbf{V}_{t}

3 Properties of the Online Updating estimator

3.1 Asymptotic Normality

In order to prove the asymptotic normality of our UHR estimator, we first add some assumptions into regularity conditions (N1)-(N4)(Wang et al., 2022).

(N5) Take n=Nbbn=\frac{N_{b}}{b} as the average size of each subset. Suppose nNb\frac{n}{\sqrt{N_{b}}}\rightarrow\infty, and all ntn_{t} diverge in the same order O(n)O(n), i.e., c1mintnt/nmaxtnt/nc2c_{1}\leq{\rm{min}}_{t}n_{t}/n\leq{\rm{max}}_{t}n_{t}/n\leq c_{2} for some positive constants c1c_{1} and c2c_{2}.

(N6) Exist positive definite matrix 𝚺1,,𝚺b\bm{\Sigma}_{1},...,\bm{\Sigma}_{b} which satisfy 1nti=1ntxtixtiT𝑝𝚺t,nt\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\textbf{x}_{ti}^{T}\xrightarrow{p}\bm{\Sigma}_{t},n_{t}\rightarrow\infty and converge to 𝚺t\bm{\Sigma}_{t} in the same rate, i.e. convergence rate is irrelevant with tt.

Obviously, 1nti=1ntxtixtiT𝚺t=O(1nt)\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\textbf{x}_{ti}^{T}-\bm{\Sigma}_{t}=\textbf{O}(\frac{1}{\sqrt{n_{t}}}) due to 𝚺t=E(xtixtiT)\bm{\Sigma}_{t}=E(\textbf{x}_{ti}\textbf{x}_{ti}^{T}). Moreover, according to (N5) and (N6), all subsets have the same convergence rate, which is O(1nt)=O(1n)\textbf{O}(\frac{1}{\sqrt{n_{t}}})=\textbf{O}(\frac{1}{\sqrt{n}}).

Under regularity conditions (N1)-(N6), UHR estimator 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr} follows asymptotic distribution as follows:

[Φ(f)]12Nb(𝜽^Nbuhr𝜽0)𝑑N(𝟎,𝑰p),[\Phi(f)]^{-\frac{1}{2}}\sqrt{N_{b}}(\widehat{\bm{\theta}}_{N_{b}}^{uhr}-\bm{\theta}_{0})\overset{d}{\rightarrow}N(\bm{0},\bm{I}_{p}), (11)

where Φ(f)=E(ψ2)[E(ψ)]2(t=1bntNb𝚺t)1\Phi(f)=\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}\Big{(}\sum_{t=1}^{b}\frac{n_{t}}{N_{b}}\bm{\Sigma}_{t}\Big{)}^{-1}. The proof is as follows.

For local Huber estimator 𝜽^t\bm{\widehat{\theta}}_{t}, it follows asymptotic condition:

𝜽^t𝜽0=𝚺t11E(ψ)1nti=1ntxtiψ(ϵti)+Op(1nt)\widehat{\bm{\theta}}_{t}-\bm{\theta}_{0}=\bm{\Sigma}_{t}^{-1}{\frac{1}{E(\psi^{{}^{\prime}})}}{\frac{1}{n_{t}}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\psi(\epsilon_{ti})+O_{p}\Big{(}\frac{1}{n_{t}}\Big{)} (12)

where ψ(ϵti)=ρ(ϵti)={k,ϵti<kϵti,|ϵti|kk,ϵti>k.\psi(\epsilon_{ti})=\rho^{{}^{\prime}}(\epsilon_{ti})=\left\{\begin{array}[]{ll}-k,&\epsilon_{ti}<-k\\ \epsilon_{ti},&\lvert\epsilon_{ti}\lvert\leq k\\ k,&\epsilon_{ti}>k\end{array}.\right.

According to Eq.(10), the relationship with 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr} and 𝜽^t\widehat{\bm{\theta}}_{t} is

𝜽^Nbuhr=(t=1bwtXtTXtnt)1(t=1bwtXtTXt𝜽t^nt)\widehat{\bm{\theta}}_{N_{b}}^{uhr}=\Big{(}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}\Big{)}^{-1}\Big{(}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}\widehat{\bm{\theta}_{t}}}{n_{t}}\Big{)} (13)

where wt=ntNbw_{t}=\frac{n_{t}}{N_{b}}. By direct calculation, we can obtain that

Nb(𝜽^Nbuhr𝜽0)=(t=1bwtXtTXtnt)1(Nbt=1bwtXtTXtnt(𝜽t^𝜽0))\sqrt{N_{b}}(\widehat{\bm{\theta}}_{N_{b}}^{uhr}-\bm{\theta}_{0})=\Big{(}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}\Big{)}^{-1}\Big{(}\sqrt{N_{b}}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}(\widehat{\bm{\theta}_{t}}-\bm{\theta}_{0})\Big{)} (14)

By Nb(t=1bwt(XtTXtnt𝚺t)(𝜽t^𝜽0))=Op(bNb)\sqrt{N_{b}}\Bigg{(}\sum_{t=1}^{b}w_{t}\Big{(}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}-\bm{\Sigma}_{t}\Big{)}(\widehat{\bm{\theta}_{t}}-\bm{\theta}_{0})\Bigg{)}=O_{p}\Big{(}\frac{b}{\sqrt{N_{b}}}\Big{)} and regularity conditions (N1)-(N6), we can obtain

Nb(t=1bwtXtTXtnt(𝜽t^𝜽0))\displaystyle\sqrt{N_{b}}\Big{(}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}(\widehat{\bm{\theta}_{t}}-\bm{\theta}_{0})\Big{)} (15)
=\displaystyle= Nb(t=1bwt𝚺t(𝜽^t𝜽0))+Nb(t=1bwt(XtTXtnt𝚺t)(𝜽t^𝜽0))\displaystyle\sqrt{N_{b}}\Big{(}\sum_{t=1}^{b}w_{t}\bm{\Sigma}_{t}(\widehat{\bm{\theta}}_{t}-\bm{\theta}_{0})\Big{)}+\sqrt{N_{b}}\Bigg{(}\sum_{t=1}^{b}w_{t}\Big{(}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}-\bm{\Sigma}_{t}\Big{)}(\widehat{\bm{\theta}_{t}}-\bm{\theta}_{0})\Bigg{)}
=\displaystyle= Nb(t=1bwt𝚺t(𝚺t11E(ψ)1nti=1ntxtiψ(ϵti)+Op(1nt))+Op(bNb)\displaystyle\sqrt{N_{b}}\Big{(}\sum_{t=1}^{b}w_{t}\bm{\Sigma}_{t}\Big{(}\bm{\Sigma}_{t}^{-1}{\frac{1}{E(\psi^{{}^{\prime}})}}{\frac{1}{n_{t}}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\psi(\epsilon_{ti})+O_{p}\Big{(}\frac{1}{n_{t}}\Big{)}\Big{)}+O_{p}\Big{(}\frac{b}{\sqrt{N_{b}}}\Big{)}
=\displaystyle= 1Nbt=1b1E(ψ)i=1ntxtiψ(ϵti)+Op(bNb)\displaystyle\frac{1}{\sqrt{N_{b}}}\sum_{t=1}^{b}{\frac{1}{E(\psi^{{}^{\prime}})}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\psi(\epsilon_{ti})+O_{p}\Big{(}\frac{b}{\sqrt{N_{b}}}\Big{)}

As t=1bwt=1\sum_{t=1}^{b}w_{t}=1, we have

t=1bwtXtTXtntt=1bwt𝚺t=t=1bwt(XtTXtnt𝚺t)=op(1)\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}-\sum_{t=1}^{b}w_{t}\bm{\Sigma}_{t}=\sum_{t=1}^{b}w_{t}\Big{(}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}-\bm{\Sigma}_{t}\Big{)}=o_{p}(1) (16)

Then we can obtain that

[Φ(f)]12Nb(𝜽^Nbuhr𝜽0)\displaystyle[\Phi(f)]^{-\frac{1}{2}}\sqrt{N_{b}}(\widehat{\bm{\theta}}_{N_{b}}^{uhr}-\bm{\theta}_{0}) (17)
=\displaystyle= [Φ(f)]12(t=1bwtXtTXtnt)1(Nbt=1bwtXtTXtnt(𝜽t^𝜽0))\displaystyle[\Phi(f)]^{-\frac{1}{2}}\Big{(}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}\Big{)}^{-1}\Big{(}\sqrt{N_{b}}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}(\widehat{\bm{\theta}_{t}}-\bm{\theta}_{0})\Big{)}
=\displaystyle= [Φ(f)]12(t=1bwtXtTXtnt)1(1Nbt=1b1E(ψ)i=1ntxtiψ(ϵti)+Op(bNb))\displaystyle[\Phi(f)]^{-\frac{1}{2}}\Big{(}\sum_{t=1}^{b}w_{t}\frac{\textbf{X}_{t}^{T}\textbf{X}_{t}}{n_{t}}\Big{)}^{-1}\Big{(}\frac{1}{\sqrt{N_{b}}}\sum_{t=1}^{b}{\frac{1}{E(\psi^{{}^{\prime}})}}\sum_{i=1}^{n_{t}}\textbf{x}_{ti}\psi(\epsilon_{ti})+O_{p}\Big{(}\frac{b}{\sqrt{N_{b}}}\Big{)}\Big{)}
𝑑N(0,𝑰p)\displaystyle\xrightarrow{d}N(\textbf{0},\bm{I}_{p})

If the covariates of each batch are homogeneous, i.e., Σ1==Σb=Σ\Sigma_{1}=\cdots=\Sigma_{b}=\Sigma, the Φ(f)=E(ψ2)[E(ψ)]2𝚺1\Phi(f)=\frac{E(\psi^{2})}{[E(\psi^{\prime})]^{2}}\bm{\Sigma}^{-1}, which coincides the asymptotic variance of oracle estimator in Eq.(2). The updating estimator 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr} is asymptotically equivalent with the oracle estimator 𝜽^Nb\widehat{\bm{\theta}}_{N_{b}}.

3.2 Computation Complexity

In this part, we do a rough calculation on the computation complexity of Oracle Huber Regression and Online Updating Huber Regression, attesting our Online Updating algorithm has a lower computation complexity.

In our work, we use function rlm in MASS package of R to calculate the local estimator and Oracle estimator obtained by entire dataset. Function rlm is implemented by Iteratively Reweighted Least Squares. For the sake of calculation, suppose rlm’s maximum iterative time is KK, total sample size is NN and is divided into bb subsets evenly with nn samples each. Therefore, the computation complexity of solving the local estimator and the Oracle Huber estimator are about O(K(pn2+pn+np3))O(K(pn^{2}+pn+np^{3})) and O(KpN2+(Kp+Kp2)N+Kp3)O(KpN^{2}+(Kp+Kp^{2})N+Kp^{3}) respectively. As Online Updating algorithm contains bb loops and calculating the key features of each subset cost about O(p2n)O(p^{2}n), total computation complexity of Online Updating algorithm is about O(Kb(pn2+pn+n+p3)+bp2n)O(Kb(pn^{2}+pn+n+p^{3})+bp^{2}n). By N=nbN=nb, we can simplify it into O(KpNn+(Kp2+Kp+1)N+Kbp3)O(KpNn+(Kp^{2}+Kp+1)N+Kbp^{3}). Although KK is set with a large number(mostly 1000), it can achieve convergence within 10 times of iteration in practice, while total sample NN can reach a million or more, thus KNK\ll N. As for pp, we only consider low-dimension cases where pNp\ll N. To sum up, the dominants part of computation complexity of Oracle and Online Updating ones are O(KpN2)O(KpN^{2}) and O(KpNn)O(KpNn). By thorough comparison, we can find the only difference between these two are NN and nn, where nNn\ll N. In conclusion, theoretically, our Online Updating Huber Regression has lower computation complexity than Oracle Huber Regression.

4 Numerical Studies

In this section, we demonstrate the performance of our UHR algorithm by comparing it with other 4 algorithms in 6 different synthetic datasets. We also apply UHR on a real airline dataset released by American Statistical Association, proving it is able to solve real problems as well.

4.1 Simulations

4.1.1 Experiment Settings

In simulations, we generate data using the following model:

yti=xtiT𝜽+h(xti)ϵtiy_{ti}=\textbf{x}_{ti}^{T}\bm{\theta}+h(\textbf{x}_{ti})\epsilon_{ti}

where coefficients 𝜽=(θ1,θ2,θ3,θ4)T=(1,1,2,2)T\bm{\theta}=(\theta_{1},\theta_{2},\theta_{3},\theta_{4})^{T}=(1,-1,2,-2)^{T}, covariates xti\textbf{x}_{ti} follow a multivariate normal distribution with zero mean and independent type of covariance matrix. Function h(xti)h(\textbf{x}_{ti}) is used for controlling the homoscedasticity or heteroscedasticity of random error. Specifically, if random error is homogeneous, h(xti)=1h(\textbf{x}_{ti})=1, else h(xti)=j=14xtijh(\textbf{x}_{ti})=\sum_{j=1}^{4}\textbf{x}_{ti}^{j}. For random error ϵti\epsilon_{ti}, consider the following 5 cases:

(1) ϵtiN(0,1)\epsilon_{ti}\sim N(0,1);

(2) ϵtit(3)\epsilon_{ti}\sim t(3);

(3) εtit(3)\varepsilon_{ti}\sim t(3) with 15%15\% of response ytiy_{ti} replaced by yti+20y_{ti}+20;

(4) ϵti0.85N(0,1)+0.15N(0,8)\epsilon_{ti}\sim 0.85N(0,1)+0.15N(0,8);

(5) ϵtiCauchy(0,1)\epsilon_{ti}\sim{\rm{Cauchy}}(0,1).

Besides our Online Updating Huber Regression algorithm(UHR), we choose other 4 comparing algorithms, which are Ordinary Least Regression(OLS), Renewable Least Regression(Luo and Song, 2020), Oracle Huber Regression(Huber and Peter, 1964)(OHR), Divide-and-Conquer Huber Regression(DC-HR). OLS and RLS are to prove good performance of our UHR algorithm in robustness. OHR is to investigate whether Online Updating estimator has the same estimate efficiency with Oracle one. DC-HR is a simple combination of naïve-dc and Huber regression to compare the differences between divide-and-conquer and online updating methods. Moreover, to reduce randomness, we repeat the whole simulation process 500 times.

As for evaluation indexes, we judge the results from estimation and calculation efficiency. There are three indexes: MSEMSE, MAEMAE and timetime, whose expressions are as Eq.(18)\simEq.(20).

MSE(𝜽)=1500s=1500j=14(𝜽^j(s)𝜽j)2MSE(\bm{\theta})=\frac{1}{500}\sum_{s=1}^{500}\sum_{j=1}^{4}(\widehat{\bm{\theta}}_{j}^{(s)}-\bm{\theta}_{j})^{2} (18)
MAE(𝜽)=1500s=1500j=14|𝜽^j(s)𝜽j|MAE(\bm{\theta})=\frac{1}{500}\sum_{s=1}^{500}\sum_{j=1}^{4}|\widehat{\bm{\theta}}_{j}^{(s)}-\bm{\theta}_{j}| (19)
time=s=1500elapsed(s)500time=\frac{\sum_{s=1}^{500}elapsed^{(s)}}{500} (20)

For illustration, θj^(s)\hat{\theta_{j}}^{(s)} is the estimator of the jj-th dimension of θ\thetain the ss-th repetition. Thus, MSEMSE and MAEMAE are the average of the sum of the mean squared error and mean absolute error of each dimension in 500 repetitions. timetime is the average cost time of certain random error running 500 times, which use function system.time() in R to record elapsed time.

Our simulations include 4 experiments in total:

Experiment 1: Fix total sample size Nb=1,000,000N_{b}=1,000,000, change subsets number b={100,200,500,1000}b=\{100,200,500,1000\}. This experiment is to compare the differences in estimation and calculation efficiency between different algorithms and also the influences of subsets number bb on them.

Experiment 2: Fix subset size nt=100n_{t}=100, change subsets number b={10,100,1000,10000}b=\{10,100,1000,10000\}. Accordingly, total sample size NbN_{b} varies from 1,0001,000 to 1,000,0001,000,000. This experiment is to investigate the changes in estimation efficiency of each algorithm when subset number bb increases.

Experiment 3: Fix subset size nt=5000n_{t}=5000, change total sample size Nb={2.5×105,5×105,7.5×105,10×105}N_{b}=\{2.5\times 10^{5},5\times 10^{5},7.5\times 10^{5},10\times 10^{5}\}. Accordingly, subset number bb varies from 5050 to 200200. This experiment is to investigate the influence of NbN_{b} on each algorithm. We also investigate the relationship between NbN_{b} and running time in this experiment to verify computation complexity obtained before from the simulation perspective.

Experiment 4: Fix subset size nt=1000n_{t}=1000, change total sample size Nb={1×105,2×105,,8×105}N_{b}=\{1\times 10^{5},2\times 10^{5},...,8\times 10^{5}\}. Accordingly, subset number bb varies from 100100 to 800800. This experiment is very similar to Experiment 3 but with smaller ntn_{t} to investigate the difference between different scales of ntn_{t}.

4.1.2 Estimation Efficiency

In estimation efficiency, we focus on Experiment 1 and Experiment 2. For Experiment 1, Table 1 to Table 5 present the MSEMSE and MAEMAE under 6 random error distribution. According to Table 1 to Table 5, we can draw the following conclusions:

(1) For heavy-tailed, heterogeneous error and random error with outliers cases, the estimation error of Huber family(OHR, DC-HR, UHR) is much smaller than that of Linear Regression family(OLS, RLS), demonstrating Huber algorithms are more robust than Linear Regression algorithms.

(2) When random error is distributed in case 1 \sim case 4, our UHR has very close estimation errors with OHR. This verifies that Online Updating estimator 𝜽^Nbuhr\widehat{\bm{\theta}}_{N_{b}}^{uhr} are asymptotic equivalent with Oracle Huber estimator from experiment perspective.

(3) When random error is distributed in case 6, i.e. Cauchy distribution, we can find Linear Regression algorithms are invalid with such big MSEMSE and MAEMAE, while Huber algorithms are still performing well. It is worth mentioning that distributed algorithms DC-HR and UHR even have obvious estimation improvements compared with Oracle Huber estimator. This phenomenon may be related to characteristics of Cauchy distribution that variance does not exist. Due to the high volatility of data, using the whole dataset may not come with the best estimating efficiency, while subsets with smaller sizes may perform better.

(4) By comparing two distributed Huber algorithms, we find that in most cases, UHR has smaller estimation errors than DC-HR, which attests that Online Updating algorithm has a better performance than divide-and-conquer. Even though DC-HR can perform better than UHR in some cases, we still prefer UHR due to the wider range of applications for online updating in big data streams.

(5) In order to better display the influence of subset number bb on estimating errors of UHR, we make a line graph showing the relationship between subset number bb and MSEMSE in heterogeneous cases(Figure 2). To sum up, for normal distribution, tt distribution, tt distribution with outliers and mixed symmetric normal distribution(case 1\simcase 4), estimation error only has subtle changes with bb increasing. That is, estimation results of UHR are robust to subset number. For Cauchy distribution case(case 5), estimation error goes down with bb increasing, as subset sample decreases, implying the same conclusion with (3).

Table 1: Estimation Error in Experiment 1 for case 1
case 1: N(0,1)
homogeneous
bb 100 200 500 1000
MSEMSE OLS 0.04122 0.04122 0.04122 0.04122
RLS 0.04122 0.04122 0.04122 0.04122
OHR 0.04325 0.04325 0.04325 0.04325
DC-HR 0.04335 0.04329 0.04336 0.04342
UHR 0.04334 0.04333 0.04333 0.04336
MAEMAE OLS 0.32565 0.32565 0.32565 0.32565
RLS 0.32565 0.32565 0.32565 0.32565
OHR 0.33423 0.33423 0.33423 0.33423
DC-HR 0.33456 0.33414 0.33414 0.33500
UHR 0.33458 0.33456 0.33454 0.33486
heterogeneous
bb 100 200 500 1000
MSEMSE OLS 0.24578 0.24578 0.24578 0.24578
RLS 0.24578 0.24578 0.24578 0.24578
OHR 0.13903 0.13903 0.13903 0.13903
DC-HR 0.13879 0.13911 0.13995 0.13974
UHR 0.13884 0.13929 0.14001 0.13999
MAEMAE OLS 0.79640 0.79640 0.79640 0.79640
RLS 0.79640 0.79640 0.79640 0.79640
OHR 0.60189 0.60189 0.60189 0.60189
DC-HR 0.60066 0.60161 0.60297 0.60211
UHR 0.60094 0.60205 0.60376 0.60315
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

Table 2: Estimation Error in Experiment 1 for case 2
case 2: t(3)t(3)
homogeneous
bb 100 200 500 1000
MSEMSE OLS 0.11885 0.11885 0.11885 0.11885
RLS 0.11885 0.11885 0.11885 0.11885
OHR 0.06468 0.06468 0.06468 0.06468
DC-HR 0.06479 0.06491 0.06500 0.06528
UHR 0.06463 0.06464 0.06471 0.06473
MAEMAE OLS 0.54831 0.54831 0.54831 0.54831
RLS 0.54831 0.54831 0.54831 0.54831
OHR 0.40403 0.40403 0.40403 0.40403
DC-HR 0.40427 0.40454 0.40473 0.40503
UHR 0.40381 0.40383 0.40394 0.40364
heterogeneous
bb 100 200 500 1000
MSEMSE OLS 0.74695 0.74695 0.74695 0.74695
RLS 0.74695 0.74695 0.74695 0.74695
OHR 0.19960 0.19960 0.19960 0.19960
DC-HR 0.19773 0.19802 0.19812 0.19878
UHR 0.19823 0.19859 0.19902 0.19987
MAEMAE OLS 1.37955 1.37955 1.37955 1.37955
RLS 1.37955 1.37955 1.37955 1.37955
OHR 0.71031 0.71031 0.71031 0.71031
DC-HR 0.70748 0.70796 0.70763 0.70854
UHR 0.70816 0.70844 0.70886 0.71044
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

Table 3: Estimation Error in Experiment 1 for case 3
case 3: t(3)t(3) with outliers
homogeneous
bb 100 200 500 1000
MSEMSE OLS 2.49242 2.49242 2.49242 2.49242
RLS 2.49242 2.49242 2.49242 2.49242
OHR 0.12271 0.12271 0.12271 0.12271
DC-HR 0.12207 0.12241 0.12333 0.12442
UHR 0.12185 0.12199 0.12258 0.12310
MAEMAE OLS 2.54406 2.54406 2.54406 2.54406
RLS 2.54406 2.54406 2.54406 2.54406
OHR 0.55517 0.55517 0.55517 0.55517
DC-HR 0.55402 0.55446 0.55679 0.55884
UHR 0.55336 0.55369 0.55487 0.55608
heterogeneous
bb 100 200 500 1000
MSEMSE OLS 3.18120 3.18120 3.18120 3.18120
RLS 3.18120 3.18120 3.18120 3.18120
OHR 0.36930 0.36930 0.36930 0.36930
DC-HR 0.36105 0.36105 0.36166 0.36678
UHR 0.36110 0.36135 0.36180 0.36654
MAEMAE OLS 2.87060 2.87060 2.87060 2.87060
RLS 2.87060 2.87060 2.87060 2.87060
OHR 0.96724 0.96724 0.96724 0.96724
DC-HR 0.95639 0.95577 0.95761 0.96439
UHR 0.95605 0.95594 0.95672 0.96274
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

Table 4: Estimation Error in experiment 1 for case 4
case 4: 0.85×N(0,1)+0.15×N(0,8)
homogeneous
bb 100 200 500 1000
MSEMSE OLS 0.07771 0.07771 0.07771 0.07771
RLS 0.07771 0.07771 0.07771 0.47451
OHR 0.05530 0.05530 0.05530 0.05530
DC-HR 0.05538 0.05543 0.05544 0.05565
UHR 0.05524 0.05524 0.05518 0.05519
MAEMAE OLS 0.44224 0.44224 0.44224 0.44224
RLS 0.44224 0.44224 0.44224 1.11088
OHR 0.37259 0.37259 0.37259 0.37259
DC-HR 0.37309 0.37324 0.37331 0.37356
UHR 0.37238 0.37240 0.37233 0.37205
heterogeneous
bb 100 200 500 1000
MSEMSE OLS 0.47451 0.47451 0.47451 0.47451
RLS 0.47451 0.47451 0.47451 0.47451
OHR 0.19964 0.19964 0.19964 0.19964
DC-HR 0.19859 0.19857 0.19953 0.19965
UHR 0.19881 0.19866 0.19967 0.19990
MAEMAE OLS 1.11088 1.11088 1.11088 1.11088
RLS 1.11088 1.11088 1.11088 1.11088
OHR 0.71519 0.71519 0.71519 0.71519
DC-HR 0.71321 0.71335 0.71535 0.71578
UHR 0.71364 0.71377 0.71559 0.71599
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

Table 5: Estimation Error in Experiment 1 for case 5
case 5: Cauchy(0,1)
homogeneous
bb 100 200 500 1000
MSEMSE OLS 11711466.60880 11711466.60880 11711466.60880 11711466.60880
RLS 11711466.60880 11711466.60880 11711466.60880 11711466.60880
OHR 7.16525 7.16525 7.16525 7.16525
DC-HR 0.15638 0.14910 0.14287 0.14240
UHR 0.15626 0.14878 0.14190 0.14109
MAEMAE OLS 1502.46026 1502.46026 1502.46026 1502.46026
RLS 1502.46026 1502.46026 1502.46026 1502.46026
OHR 3.23099 3.23099 3.23099 3.23099
DC-HR 0.63014 0.61556 0.60233 0.60258
UHR 0.63009 0.61492 0.60000 0.60016
heterogeneous
b 100 200 500 1000
MSEMSE OLS 29518873.58901 29518873.58901 29518873.58901 29518873.58901
RLS 29518873.58901 29518873.58901 29518873.58901 29518873.58901
OHR 102.19791 102.19791 102.19791 102.19791
DC-HR 0.53841 0.42683 0.38114 0.37118
UHR 0.53882 0.42715 0.38089 0.37050
MAEMAE OLS 2651.78745 2651.78745 2651.78745 2651.78745
RLS 2651.78745 2651.78745 2651.78745 2651.78745
OHR 13.48746 13.48746 13.48746 13.48746
DC-HR 1.16842 1.04156 0.98940 0.97666
UHR 1.16899 1.04247 0.98857 0.97579
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

Refer to caption

Figure 2: MSE of UHR varies with batch number bb

Table 6 to Table 10 show the results of Experiment 2, where bb varies from 1010 to 1000010000 while subset number ntn_{t} keeps fixed. We can obtain following conclusions through these tables.

(1) With bb increasing, all the algorithms are performing better. However, when NbN_{b} is small(for example, Nb=1000N_{b}=1000) and random error is heavily contaminated(such as case 4,5 with heterogeneous error), estimation error of OLS is much larger compared with Huber algorithms. UHR also performs rather well, which indicates our algorithm suits for not only big data streams but also normal-scale datasets.

(2) For case 1\simcase 4, OHR has the smallest estimation error. Although UHR does not perform better than OHR, it only has subtle differences with OHR even if NbN_{b} is in a small amount, indicating that UHR is able to achieve good asymptotic equivalency with OHR.

(3) For case 5, the estimation error of UHR is much smaller than that of OHR. This points to the same conclusion with Experiment 1, that is, smaller subsets may have better estimation efficiency than entire dataset for Cauchy distribution.

(4) UHR algorithm performs better in most cases than DC-HR, proving the superiority of online updating algorithm.

(5) At last, in settings, Experiment 2 is close to big data streams. Each subset has a small size with 100 samples, while subset number bb increases from 10 to 10000. This is just like the process of new subsets arriving in big data streams. We can find that the estimation error of UHR decreases with subset number increasing and is similar to OHR, which implies UHR is applicable to big data streams.

Table 6: Estimation Error in Experiment 2 for case 1
case 1: N(0,1)
homogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 41.65896 3.88942 0.39393 0.04122
RLS 41.65896 3.88942 0.39393 0.04122
OHR 44.27814 4.07750 0.41684 0.04325
DC-HR 46.34740 4.35515 0.44800 0.04593
UHR 44.27868 4.09200 0.42152 0.04357
MAEMAE OLS 10.40966 3.18110 1.00481 0.32565
RLS 10.40966 3.18110 1.00481 0.32565
OHR 10.74373 3.25098 1.03314 0.33423
DC-HR 10.94708 3.35412 1.06948 0.34447
UHR 10.7334727* 3.2551126* 1.0428132* 0.3357810*
heterogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 227.80014 23.45107 2.41744 0.24578
RLS 227.80014 23.45107 2.41744 0.24578
OHR 146.58109 14.57030 1.38268 0.13903
DC-HR 153.32625* 15.35639* 1.44728 0.14974
UHR 153.78150 15.44100 1.44311* 0.14645*
MAEMAE OLS 24.24029 7.76669 2.49384 0.79640
RLS 24.24029 7.76669 2.49384 0.79640
OHR 19.38650 6.11912 1.87599 0.60189
DC-HR 19.80401* 6.27307* 1.91918 0.61996
UHR 19.84922 6.31367 1.91440* 0.61727*
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

  • 3

    * represents the superior of UHR and DC-HR.

Table 7: Estimation Error in Experiment 2 for case 2
case 2: t(3)t(3)
homogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 117.55450 11.13197 1.16896 0.11885
RLS 117.55450 11.13197 1.16896 0.11885
OHR 65.99914 6.21516 0.61343 0.06468
DC-HR 70.83631 6.79250 0.66002 0.07046
UHR 67.51753 6.35419 0.62871 0.06651
MAEMAE OLS 17.16898 5.29056 1.73030 0.54831
RLS 17.16898 5.29056 1.73030 0.54831
OHR 13.04886 3.96196 1.25701 0.40403
DC-HR 13.47303 4.12096 1.30451 0.41925
UHR 13.16859* 4.00056* 1.27431* 0.40905*
heterogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 713.28555 72.67236 7.44372 0.74695
RLS 713.28555 72.67236 7.44372 0.74695
OHR 210.04680 21.71120 1.38268 0.19960
DC-HR 223.54581* 23.37703 2.21219* 0.21934
UHR 225.35099 23.18896* 2.23103 0.21598*
MAEMAE OLS 41.74305 13.58001 4.35484 1.37955
RLS 41.74305 13.58001 4.35484 1.37955
OHR 22.96835 7.49359 1.87599 0.71031
DC-HR 23.71750* 7.79915 2.38534* 0.74458
UHR 23.79698 7.72857* 2.39061 0.73697*
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

  • 3

    * represents the superior of UHR and DC-HR.

Table 8: Estimation Error in Experiment 2 for case 3
case 3: t(3)t(3) with outliers
homogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 2507.30950 246.33631 25.46565 2.49242
RLS 2507.30950 246.33631 25.46565 2.49242
OHR 122.73570 11.76220 1.12463 0.12271
DC-HR 147.05309 14.42111 1.37263 0.15004
UHR 140.65747 13.72919 1.30735 0.14124
MAEMAE OLS 79.30791 25.08737 8.04774 2.54406
RLS 79.30791 25.08737 8.04774 2.54406
OHR 17.63695 5.45740 1.67765 0.55517
DC-HR 19.35555 6.00961 1.86798 0.61847
UHR 18.83415* 5.84983* 1.81094* 0.59913*
heterogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 3160.12456 304.54568 30.63899 3.18120
RLS 3160.12456 304.54568 30.63899 3.18120
OHR 378.19492 37.93189 3.64422 0.36930
DC-HR 449.14772 45.55376 4.46745 0.44603
UHR 443.33399* 44.43532* 4.36829* 0.43609*
MAEMAE OLS 89.66953 28.08533 8.77479 2.87060
RLS 89.66953 28.08533 8.77479 2.87060
OHR 30.89182 9.82228 3.03822 0.96724
DC-HR 33.87220 10.77172 3.37132 1.06330
UHR 33.60676* 10.62887* 3.34340* 1.05017*
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

  • 3

    * represents the superior of UHR and DC-HR.

Table 9: Estimation Error in Experiment 2 for case 4
case 4: 0.85×N(0,1)+0.15×N(0,8)
homogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 83.55944 8.00806 0.77406 0.07771
RLS 83.55944 8.00806 0.77406 0.07771
OHR 56.64867 5.47834 0.55143 0.05530
DC-HR 59.51882 5.80763 0.58022 0.05906
UHR 57.78717* 5.56164* 0.56295* 0.05617*
MAEMAE OLS 14.42020 4.52082 1.38455 0.44224
RLS 14.42020 4.52082 1.38455 0.44224
OHR 11.89712 3.78035 1.17328 0.37259
DC-HR 12.18530 3.89420 1.19697 0.38471
UHR 11.97547* 3.82015* 1.18172* 0.37528*
heterogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 472.41924 51.87671 14382.78626 0.47451
RLS 472.41924 51.87671 14382.78626 0.47451
OHR 194.14143 19.97540 1.94836 0.19964
DC-HR 208.04891 21.31645 2.04540* 0.20936*
UHR 206.90744* 21.03077* 2.06516 0.21083
MAEMAE OLS 34.72670 11.48330 239.77327 1.11088
RLS 34.72670 11.48330 239.77327 1.11088
OHR 22.14798 7.18230 2.24135 0.71519
DC-HR 22.89192 7.35959 2.30214* 0.73328*
UHR 22.88671* 7.30928* 2.30965 0.73398
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

  • 3

    * represents the superior of UHR and DC-HR.

Table 10: Estimation Error in Experiment 2 for case 5
case 5: Cauchy(0,1)
homogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 13031861.67103 37026406.50190 6041016.96141 11711466.60880
RLS 13031861.67103 37026406.50190 6041016.96141 11711466.60880
OHR 146.07957 16.91701 3.81870 7.16525
DC-HR 169.68684 17.13059 1.69887 0.16616
UHR 162.31759* 16.29104* 1.61969* 0.15643*
MAEMAE OLS 1431.93975 1881.11733 1147.11445 1502.46026
RLS 1431.93975 1881.11733 1147.11445 1502.46026
OHR 19.26524 6.32966 2.65689 3.23099
DC-HR 20.69384 6.62144 2.08405 0.65167
UHR 20.25285* 6.45609* 2.03275* 0.63216*
heterogeneous
NN 10310^{3} 10410^{4} 10510^{5} 10610^{6}
MSEMSE OLS 47087625.58229 123184496.07168 92413723.44581 29518873.58901
RLS 47087625.58229 123184496.07168 92413723.44581 29518873.58901
OHR 362.02026 3340.24291 40.58347 102.19791
DC-HR 418.52258 46.02272 4.17859 0.42415
UHR 413.20117* 45.54626* 4.13358* 0.41610*
MAEMAE OLS 2705.91096 3340.24291 2526.77046 2651.78745
RLS 2705.91096 3340.24291 2526.77046 2651.78745
OHR 30.23063 11.18484 7.40727 13.48746
DC-HR 32.69547 10.84272 3.26736 1.05280
UHR 32.45720* 10.82107* 3.23621* 1.03787*
  • 1

    Units for MSEMSE and MAEMAE are 10410^{-4} and 10210^{-2}.

  • 2

    The values shown in bold are the optimal indexes in each case.

  • 3

    * represents the superior of UHR and DC-HR.

4.1.3 Calculation Efficiency

In calculation efficiency, we mainly focus on Experiment 1, Experiment 3 and Experiment 4. Experiment 1 investigates the influence of bb on calculation efficiency when total sample size NbN_{b} is fixed. Experiment 3 focuses on the influence of NbN_{b} when ntn_{t} is fixed and relatively large. Experiment 4 is rather similar to Experiment 3, whereas, its fixed ntn_{t} is smaller than Experiment 3, aiming at comparing the differences between different scales of ntn_{t}.

For Experiment 1, we can obtain the average running time of certain random error according to Eq.(20). Before analyzing, it should be mentioned that OLS and OHR, which use entire dataset, should have had no difference in running time due to fixed NbN_{b}. However, in execution, we run OLS and OHR 4 times in different bb value settings. This may result in random effect on computer, leading to small differences in running time of OLS and RLS. Details are shown in Table 11 to Table 14. Based on these results, we can make following conclusions:

(1) Our UHR runs faster than OHR, DC-HR and RLS in most cases. With bb increasing, running time of RLS jumps up. Although running time of UHR increases as well, the trend is more slowly than RLS, indicating better property of UHR. It can be seen in Figure 3 for details. As the results of 6 cases are similar, we only visualize case 6.

(2)Overall, Linear Regression algorithms perform better in calculation efficiency compared with Huber algorithms. However, they are not robust to contaminated random error according to Experiment 1. As a result, we still prefer UHR algorithm to Linear Regression.

(3)When subset number bb is rather small(like b=100,200,500b=100,200,500), UHR has better calculation efficiency than OHR. However, this may not come true when bb increases to 1000. A possible reason is that with bb increasing, iteration increases as well. Accumulated running time due to iterations leads to a longer time. To solve it, we can artificially set bb in a smaller value for bb has no effect on estimation efficiency. In big data streams, we tend to prefer a rather small subset size ntn_{t} then setting small bb for higher calculation efficiency may not work. However, as entire dataset cannot be obtained in big data streams scenario, OHR is no longer applicable and UHR is the only possible method so that a little bit longer running time of UHR is tolerable.

Table 11: Average Running Time in Experiment 3 for b=100b=100
b=100b=100
case 1 case 2 case 3 case 4 case 5 case 6
homogeneous
OLS 0.98250 0.98310 0.95790 0.98050 0.97960 0.97840
RLS 1.96290 1.94380 1.95850 1.97450 1.96220 1.95190
OHR 2.02670 2.28570 2.37480 2.22820 2.29830 2.28450
DC-HR 1.77900 1.86630 1.87670 1.83970 1.77620 1.83630
UHR 1.75810 1.84130 1.83750 1.81400 1.85240 1.82630
heterogeneous
OLS 0.97090 0.97520 0.97280 0.97380 0.97270 0.99380
RLS 1.96890 2.04620 1.97240 1.95330 1.96670 1.95050
OHR 2.52180 2.57020 2.45510 2.52730 4.55150 2.32000
DC-HR 2.02770 1.97980 2.00170 2.07250 2.49510 1.96030
UHR 2.01890 2.03020 1.93080 2.04700 2.46830 1.90100
  • 1

    The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.

Table 12: Average Running Time in Experiment 3 for b=200b=200
b=200b=200
case 1 case 2 case 3 case 4 case 5 case 6
homogeneous
OLS 0.97720 0.98290 1.01140 0.98460 0.98500 0.97810
RLS 1.67800 1.65770 1.65050 1.65270 1.64270 1.76660
OHR 2.10430 2.40010 2.46390 2.32010 2.42110 2.36800
DC-HR 2.36680 2.43230 2.42160 2.41890 2.42460 2.42040
UHR 2.34550 2.40360 2.38780 2.39220 2.37850 2.40130
heterogeneous
OLS 1.02260 0.98390 0.97880 1.01350 0.97970 0.98880
RLS 1.75910 1.61520 1.77020 1.69830 1.61170 1.67900
OHR 2.61610 2.68210 2.54390 2.64820 4.81750 2.41370
DC-HR 2.62560 2.63130 2.53630 2.64060 2.97750 2.50600
UHR 2.59650 2.57810 2.50440 2.59220 2.91410 2.47980
  • 1

    The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.

Table 13: Average Running Time in Experiment 3 for b=500b=500
b=500b=500
case 1 case 2 case 3 case 4 case 5 case 6
homogeneous
OLS 1.00940 0.99870 1.01550 1.02750 1.08760 1.01830
RLS 3.42860 3.29650 3.14050 3.29470 3.16500 3.16210
OHR 2.01170 2.39080 2.30080 2.33080 2.35250 2.26460
DC-HR 2.34690 2.53610 2.38140 2.44700 2.52070 2.41760
UHR 2.11160 2.19880 2.17990 2.13890 2.19090 2.15200
heterogeneous
OLS 1.02390 0.98440 1.09730 1.01810 1.01980 1.01280
RLS 3.54000 3.19630 3.15630 3.24660 3.16610 3.26170
OHR 2.47570 2.59900 2.36000 2.62570 4.64180 2.31450
DC-HR 2.71570 2.92090 2.62800 2.97190 3.29770 2.67450
UHR 2.50870 2.66040 2.35530 2.59340 2.87450 2.27890
  • 1

    The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.

Table 14: Average Running Time in Experiment 3 for b=1000b=1000
b=1000b=1000
case 1 case 2 case 3 case 4 case 5 case 6
homogeneous
OLS 1.00480 0.99670 0.99650 0.99950 1.00350 1.00390
RLS 6.69460 6.77830 6.39620 4.47510 4.64710 4.46410
OHR 2.30300 2.60550 2.77850 2.53540 2.62030 2.61390
DC-HR 2.73700 2.88140 2.83060 2.94160 2.70820 2.89440
UHR 3.06820 3.20160 3.01010 2.72710 2.86340 2.83450
heterogeneous
OLS 1.02390 0.98440 1.09730 1.01810 1.01980 1.01280
RLS 3.54000 3.19630 3.15630 3.24660 3.16610 3.26170
OHR 2.47570 2.59900 2.36000 2.62570 4.64180 2.31450
DC-HR 2.71570 2.92090 2.62800 2.97190 3.29770 2.67450
UHR 2.50870 2.66040 2.35530 2.59340 2.87450 2.27890
  • 1

    The values shown in bold are the optimal model indexes (except OLS algorithm) in each case.

Refer to caption

Figure 3: batch number-calculation time for case6

For Experiment 3 and Experiment 4, we divide them into homogeneous and heterogeneous cases and average running time of 6 cases of random error. The relationship between sample size and running time are represented in Figure 4 and Figure 5. According to Figure 5, We can draw conclusions as follows.

(1) Overall, Linear Regression algorithms run faster than Huber algorithms, indicating the same result with Experiment 2. Nevertheless, estimation accuracy is the priority, so that we still prefer Huber algorithms, although Linear Regression algorithms have better communication efficiency.

(2) For Huber algorithms, UHR performs better than OHR in most cases in calculation efficiency. According to the line trend, running time of OHR climb faster than UHR when total sample size NbN_{b} increasing. Moreover, we obtained the computation complexity of UHR and OHR before which are O(KpNn)O(KpNn) and O(KpN2)O(KpN^{2}). This is coordinated with the line in Figure 4 that line of UHR and OHR are rather similar to linear and quadratic functions of NbN_{b}, which verifies UHR has lower computation efficiency than OHR in simulations.

Unlike nt=5000n_{t}=5000 in Experiment 3, Experiment 4 sets nt=1000n_{t}=1000. Therefore, Figure 5 shows a distinct graph from Figure 4. When Nb<4×105N_{b}<4\times 10^{5}, UHR in Figure 5 still runs faster than OHR, presenting the same outcome with Figure 4. However, UHR shows a disadvantage in calculation gradually when Nb>4×105N_{b}>4\times 10^{5}. This result is not owing to the increase in NbN_{b}. For this deduction, we can refer to Experiment 3, where UHR runs faster than OHR with Nb=1×106,nt=5000N_{b}=1\times 10^{6},n_{t}=5000. Therefore, we may assume bad performance in Experiment 4 is due to another variant bb. This is similar to observations in Experiment 1, which may result from increase in iterations. A possible solution is using parallel or distributed calculation on multiple machines inside the algorithm to cut down total running time.

Refer to caption

Figure 4: total sample size-calculation time with nt=5000n_{t}=5000

Refer to caption

Figure 5: total sample size-calculation time with nt=1000n_{t}=1000

4.2 Real Data Analysis

In real data analysis, we use our UHR, OLS, RLS, OHR, DC-HR algorithms used in simulation and DC-AHR algorithm to do the regression. Among them, DC-AHR is a combination of naïve-dc and Adaptive Huber regression(Sun et al., 2020; Wang et al., 2021). By comparing the results, we want to prove that our algorithm has good performance in real applications as well.

Our real data is the airline dataset from 2009 ASA Data Expo(http://statcomputing.org/dataexpo/
2009/the-data.html). It contains departure and landing information of 12 million business flights in America between 1978 and 2008. We choose data in 2007 as research object, which has 7,453,215 pieces of information, about 0.654GB in total. After removing the canceled and missing flight information, 7275288 samples were left. We use the first 3,000,000 samples as training set, and the rest as test set. Set subset number b=100b=100, a rather small value, to better imitate the process of big data streams.

Model researched is a four element linear regression model proposed by Schifano et al., which is widely used in many big data regression investigations(Jiang et al., 2021; Wang et al., 2016; Jiang et al., 2018).

AD=γ0+γ1HD+γ2DIS+γ3NF+γ4WF+ϵAD=\gamma_{0}+\gamma_{1}HD+\gamma_{2}DIS+\gamma_{3}NF+\gamma_{4}WF+\epsilon (21)

In Eq.(21), ADAD is a transformation of raw data variable ArrDelay\rm{ArrDelay}, standing for arrival delay. It is a continuous variable and transformed by formula log(𝐀𝐫𝐫𝐃𝐞𝐥𝐚𝐲min(𝐀𝐫𝐫𝐃𝐞𝐥𝐚𝐲)+1){\rm{log}(\bm{ArrDelay}-\rm{min}(\bm{ArrDelay})+1)}. HDHD stands for the hour of departure time, which is a discrete variable ranging from 1 to 24. DISDIS is flight distance with unit of a thousand miles. NFNF and WFWF are both dummy variables. NFNF stands for whether taking off at night. It equals to 1 for taking off during 8 pm to 5 am of the next day and equals to 0 else. WFWF equals to 1 for taking off on weekends and equals to 0 on weekdays.

We choose 3 evaluation indexes. The first is regression coefficient 𝜸=(γ0,γ1,γ2,γ3,γ4)T\bm{\gamma}=(\gamma_{0},\gamma_{1},\gamma_{2},\gamma_{3},\gamma_{4})^{T}. By comparing itself and its significance, differences between 6 algorithms can be shown. It should be mentioned that in significance tests, we use bootstrap to estimate stand error and then calculate tt-value, for in distributed algorithms, tt-value cannot be obtained directly like Oralce ones. The second index is regression error including msemse and maemae, which are both out of sample errors. Their expressions are like Eq.(22) and Eq.(23). The last index is running time, aiming at calculation efficiency.

mse=1ni=1n(y^iyi)2mse=\frac{1}{n}\sum_{i=1}^{n}(\widehat{y}_{i}-y_{i})^{2} (22)
mae=1ni=1n|y^iyi|mae=\frac{1}{n}\sum_{i=1}^{n}|\widehat{y}_{i}-y_{i}| (23)

The outcomes are presented in Table 15. Firstly, there is little difference between coefficients of each algorithm. The tt-values are all significant, showing these methods are all interpretative to this model. In estimation efficiency, the estimation error of UHR is the smallest and also rather close to that of Oracle Huber, which again proves its asymptotic equivalency with Oracle algorithm. Meanwhile, its estimation error is smaller than that of DC-HR and DC-AHR as well, attesting Online Updating’s better performance than divide-and-conquer. As for calculation efficiency, Huber algorithms run slower than Linear Regression algorithms. However, according to msemse and maemae, Huber algorithms can cut down estimation error validly compared to Linear Regression. This is the reason we prefer Huber ones. Moreover, UHR has an outstanding performance between Huber ones with 5 seconds faster than Oracle Huber regression. This achieves our goal of speeding the calculation process. Overall, UHR performs impressively in real data analysis.

Table 15: Results of Real Data Analysis
(Intercept) DepHour Kdis night weekend msemse maemae time
OLS 5.72482 0.00356 -0.00194 0.02196 -0.00754 0.00905 0.0882 1.62
(30937.344) (283.338) (-19.550) (121.83) (-59.634)
RLS 5.72482 0.00356 -0.00194 0.02196 -0.00754 0.00905 0.0882 0.695
(30455.939) (247.18) (-18.169) (89.745) (-61.248)
OHR 5.72547 0.00226 -0.00147 0.00799 -0.00549 0.00832 0.08129 13.168
(50804.848) (295.477) (-24.236) (72.776) (-71.197)
DC-HR 5.72599 0.00232 -0.00131 0.00844 -0.00557 0.00831 0.08113 8.273
(54921.000) (301.501) (-18.600) (61.269) (-75.822)
DC-AHR 5.72582 0.00228 -0.00118 0.00826 -0.00549 0.00829 0.08096 134.377
UHR 5.7256 0.00231 -0.00146 0.00886 -0.00546 0.00829 0.08092 8.653
(55172.597) (289.168) (-21.160) (67.636) (-70.772)
  • 1

    The values shown in bold are the optimal model indexes in each case.

  • 2

    The index in parentheses is the tt-value.

  • 3

    The tt-value of DC-AHR is not reported because it takes too long to calculate the tt-value.

5 Conclusions and Discussions

In order to process statistical analysis on big data streams and solve the outlier problem simultaneously, this paper proposes an Online Updating Huber Regression algorithm. By combining online updating method with Huber regression, UHR achieves both big data streams regression through updating historical data continuously and robust estimation on contaminated datasets. Proved by theoretical and simulation results, UHR is asymptotic equivalent to Oracle one using entire dataset and has lower computation complexity. After constructing our algorithm, we also apply it to simulations and real data analysis. In simulations, UHR performs outstandingly when random error is heavy-tailed distributed or has a large amount of outliers. It has good performance in calculation efficiency compared with the Oralce one as well, especially for cases when subset number bb is relatively small. In real data analysis, UHR also has good regression results for the airline dataset with the smallest estimation error, much faster calculation speed than the Oracle algorithm and significant regression coefficients, attesting its feasibility in real applications.

However, there are also some issues needed to be further investigated in the future. Firstly, we assume the true value of unknown parameters to be estimated does not change in data generating process. This assumption ignores to consider concept drift issue in big data streams, which may not be true in real applications. Thus, more complicated streaming data models need to be investigated. Secondly, in the concept of big data, we only consider big data with high volume but not with high dimensions. In future work, by combining it with the penalty function, we can investigate the high-dimensional regularized online updating problems. Lastly, the proposed algorithm is designed for independent data, which is a rather simple assumption. We can further focus on solutions to dependent or non-stationary big data streams in the future.

References

  • Allison (1995) P. D. Allison. The impact of random predictors on comparisons of coefficients between models: Comment on clogg, petkova, and haritou. American Journal of Sociology, 100(5):1294–1305, 1995.
  • Bassett and Jr. (1978) Koenker Gilbert Bassett and Jr. Regression quantiles. Econometrica, 46(1):33–50, 1978.
  • Battey et al. (2018) Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3), June 2018. ISSN 0090-5364. doi: 10.1214/17-AOS1587.
  • Chen and Zhou (2020) Lanjue Chen and Yong Zhou. Quantile regression in big data: A divide and conquer based strategy. Computational Statistics & Data Analysis, 144:106892, April 2020. ISSN 01679473. doi: 10.1016/j.csda.2019.106892.
  • Chen et al. (2019) Xi Chen, Weidong Liu, and Yichen Zhang. Quantile regression under memory constraint. The Annals of Statistics, 47(6), December 2019. ISSN 0090-5364. doi: 10.1214/18-AOS1777.
  • Chen and Xie (2014) Xueying Chen and Min-ge Xie. A split-and-conquer approach for analysis of. Statistica Sinica, 2014. ISSN 10170405. doi: 10.5705/ss.2013.088.
  • Clogg et al. (1995) Clogg, Clifford, C., Petkova, Eva, Haritou, and Adamantios. Statistical methods for comparing regression coefficients between models. American Journal of Sociology, 1995.
  • Faming et al. (2013) Faming, Liang, Yichen, Cheng, Qifan, Song, Jincheol, Park, Ping, and Yang. A resampling-based stochastic approximation method for analysis of large geostatistical data. Jasa Journal of the American Statistical Association, 2013.
  • Fang Yao (2021) Ying Yang Fang Yao. Online Estimation for Functional Data. Journal of the American Statistical Association, pages 1–15, November 2021. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2021.2002158.
  • Hu et al. (2021) Aijun Hu, Yuling Jiao, Yanyan Liu, Yueyong Shi, and Yuanshan Wu. Distributed quantile regression for massive heterogeneous data. Neurocomputing, 448:249–262, August 2021. ISSN 09252312. doi: 10.1016/j.neucom.2021.03.041.
  • Huber and Peter (1964) Huber and J. Peter. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
  • Huber and Peter (1981) Huber and J. Peter. [wiley series in probability and statistics] robust statistics (huber/robust statistics) —— references. pages 294–300, 1981.
  • Jiang et al. (2018) Rong Jiang, Xueping Hu, Keming Yu, and Weimin Qian. Composite quantile regression for massive datasets. Statistics, 52(5):980–1004, September 2018. ISSN 0233-1888, 1029-4910. doi: 10.1080/02331888.2018.1500579.
  • Jiang et al. (2021) Rong Jiang, Wei-wei Chen, and Xin Liu. Adaptive quantile regressions for massive datasets. Statistical Papers, 62(4):1981–1995, August 2021. ISSN 0932-5026, 1613-9798. doi: 10.1007/s00362-020-01170-8.
  • Jin et al. (2015) Xiaolong Jin, Benjamin W. Wah, Xueqi Cheng, and Yuanzhuo Wang. Significance and Challenges of Big Data Research. Big Data Research, 2(2):59–64, June 2015. ISSN 22145796. doi: 10.1016/j.bdr.2015.01.006.
  • Jordan et al. (2019) Michael I. Jordan, Jason D. Lee, and Yun Yang. Communication-Efficient Distributed Statistical Inference. Journal of the American Statistical Association, 114(526):668–681, April 2019. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2018.1429274.
  • Kleiner et al. (2014) Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816, September 2014. ISSN 13697412. doi: 10.1111/rssb.12050.
  • Lee et al. (2015) J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor. Communication-efficient sparse regression: a one-shot approach. Computer Science, 2015.
  • Lee et al. (2020) JooChul Lee, HaiYing Wang, and Elizabeth D. Schifano. Online updating method to correct for measurement error in big data streams. Computational Statistics & Data Analysis, 149:106976, September 2020. ISSN 01679473. doi: 10.1016/j.csda.2020.106976.
  • Lian and Fan (2018) H. Lian and Z. Fan. Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research, 18:1–26, 2018.
  • Liang et al. (2016) Faming Liang, Jinsu Kim, and Qifan Song. A bootstrap metropolis–hastings algorithm for bayesian analysis of big data. Technometrics A Journal of Statistics for the Physical Chemical & Engineering Sciences, page 604, 2016.
  • Luo et al. (2022) Jiyu Luo, Qiang Sun, and Wen-Xin Zhou. Distributed adaptive Huber regression. Computational Statistics & Data Analysis, 169:107419, May 2022. ISSN 01679473. doi: 10.1016/j.csda.2021.107419.
  • Luo and Song (2020) Lan Luo and Peter X.-K. Song. Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):69–97, February 2020. ISSN 13697412. doi: 10.1111/rssb.12352.
  • Nan and Xi (2011) L. Nan and R. Xi. Aggregated estimating equation estimation. Statistics and its interface, 1(1):73–83, 2011.
  • Ping and Sun (2015) M. Ping and X. Sun. Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1), 2015.
  • Schifano et al. (2016) Elizabeth D. Schifano, Jing Wu, Chun Wang, Jun Yan, and Ming-Hui Chen. Online Updating of Statistical Inference in the Big Data Setting. Technometrics, 58(3):393–403, July 2016. ISSN 0040-1706, 1537-2723. doi: 10.1080/00401706.2016.1142900.
  • Sun et al. (2020) Qiang Sun, Wen-Xin Zhou, and Jianqing Fan. Adaptive Huber Regression. Journal of the American Statistical Association, 115(529):254–265, January 2020. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2018.1543124.
  • Wang et al. (2016) Chun Wang, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan. Statistical Methods and Computing for Big Data. Statistics and Its Interface, 9(4):399–414, 2016. ISSN 19387989, 19387997. doi: 10.4310/SII.2016.v9.n4.a1.
  • Wang et al. (2018) Chun Wang, Ming-Hui Chen, Jing Wu, Jun Yan, Yuping Zhang, and Elizabeth Schifano. Online updating method with new variables for big data streams. Canadian Journal of Statistics, 46(1):123–146, March 2018. ISSN 03195724. doi: 10.1002/cjs.11330.
  • Wang et al. (2022) Kangning Wang, Hongwei Wang, and Shaomin Li. Renewable quantile regression for streaming datasets. Knowledge-Based Systems, 235:107675, January 2022. ISSN 09507051. doi: 10.1016/j.knosys.2021.107675.
  • Wang et al. (2021) Lili Wang, Chao Zheng, Wen Zhou, and Wen-Xin Zhou. A New Principle for Tuning-Free Huber Regression. Statistica Sinica, 2021. ISSN 10170405. doi: 10.5705/ss.202019.0045.
  • Wu et al. (2021) Jing Wu, Ming-Hui Chen, Elizabeth D. Schifano, and Jun Yan. Online Updating of Survival Analysis. Journal of Computational and Graphical Statistics, 30(4):1209–1223, October 2021. ISSN 1061-8600, 1537-2715. doi: 10.1080/10618600.2020.1870481.
  • Xue and Hu (2021) Yishu Xue and Guanyu Hu. Online updating of information based model selection in the big data setting. Communications in Statistics - Simulation and Computation, 50(11):3516–3529, November 2021. ISSN 0361-0918, 1532-4141. doi: 10.1080/03610918.2019.1626886.
  • Zhang et al. (2013) Y. Zhang, J. C. Duchi, and M. J. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 30(1):592–617, 2013.
  • Zhang et al. (2012) Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efficient algorithms for statistical optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6792–6792, Maui, HI, USA, December 2012. IEEE. ISBN 978-1-4673-2066-5 978-1-4673-2065-8 978-1-4673-2063-4 978-1-4673-2064-1. doi: 10.1109/CDC.2012.6426691.
  • Zhao et al. (2016) T. Zhao, G. Cheng, and H. Liu. A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4):1400–1437, 2016.
  • Zhu et al. (2021) Xuening Zhu, Feng Li, and Hansheng Wang. Least-Square Approximation for a Distributed System. Journal of Computational and Graphical Statistics, 30(4):1004–1018, October 2021. ISSN 1061-8600, 1537-2715. doi: 10.1080/10618600.2021.1923517.