This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Statistical Inference in High-dimensional Generalized Linear Models with Streaming Data

Lan Luo,   Ruijian Han,   Yuanyuan Lin,   and Jian Huang Equal contribution. Department of Statistics and Actuarial Science, University of Iowa, USAEqual contribution. Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, ChinaDepartment of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China Department of Statistics and Actuarial Science, University of Iowa, Iowa, USA

Statistical Inference in High-dimensional Generalized Linear Models with Streaming Data

Lan Luo,   Ruijian Han,   Yuanyuan Lin,   and Jian Huang Equal contribution. Department of Statistics and Actuarial Science, University of Iowa, USAEqual contribution. Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, ChinaDepartment of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China Department of Statistics and Actuarial Science, University of Iowa, Iowa, USA
Abstract

In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for real-time estimation and inference. We propose an online debiased lasso (ODL) method to accommodate the special structure of streaming data. ODL differs from offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from nonlinear online updating with streaming data. We show that the proposed online debiased estimators for the GLMs are consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of the proposed ODL method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. A streaming dataset from the National Automotive Sampling System-Crashworthiness Data System is analyzed to illustrate the application of the proposed method.

Keywords: Adaptive tuning, confidence interval, high-dimensional data, online debiased lasso, online error correction.

1 Introduction

Due to the explosive growth of data from non-traditional sources such as sensor networks, security logs and web applications, streaming data is becoming a core component in big data analyses. By streaming data, we refer to data that is generated continuously over time, typically in high volumes and at high velocity. It includes a wide variety of data types such as log files generated by mobile or web applications, ecommerce purchases, information from social networks, and financial trading floors. To reduce the demand on computing memory and achieve real-time processing, the nature of streaming data calls for the development of algorithms which require only one pass over the data. Furthermore, a lot of data streams are high-dimensional in nature such as genomic data for explaining the variability in the phenotypes of an organism and its genetic profile (Csala et al. ,, 2017; Wang et al. ,, 2017), and neuroimaging data for predicting neural activity patterns given various stimulus (Wen et al. ,, 2018; Zhang et al. ,, 2020). At an initial stage of data collection, it is routine to encounter a case where the number of predictors exceeds the number of observations. Moreover, even if sample size grows along the data streams, the candidate set of predictors may contain a large portion of sparse redundancy features. Therefore, to improve interpretability of the results, we need to differentiate predictors in terms of their signal strength and one of the classical approaches is the penalized regression method. In this paper, we focus on a high-dimensional regression setting in generalized linear models (GLM) with non-Gaussian outcomes. Our goal is to develop a real-time estimation and inference procedure that is highly scalable with respect to fast growing data volumes, but with no loss of efficiency in statistical inference in existence of a large number of features.

Streaming data processing essentially falls into the field of online learning. This line of research may be dated back five decades or so when Robbins & Monro, (1951) proposed a stochastic approximation algorithm that laid a foundation for the popular stochastic gradient descent (SGD) algorithm (Sakrison,, 1965). Later on, the SGD algorithm and its variants have been extensively studied for online estimation and prediction (Toulis & Airoldi,, 2017), but the work of developing online statistical inference remains unexplored. A recent paper by Fang, (2019) proposed a perturbation-based resampling method to construct confidence intervals for SGD, but it does not achieve desirable statistical efficiency and may produce misleading inference in high-dimensional settings. In addition to the SGD types of recursive algorithms, several online updating methods have been proposed to specifically perform sequential updating of regression coefficient estimators, including the online least squares estimator (OLSE) for the linear model, the cumulative estimating equation (CEE) estimator, the cumulatively updated estimating equation (CUEE) estimator by Schifano et al. , (2016) and the renewable estimator by Luo & Song, (2020) for nonlinear models.

Most of the aforementioned online algorithms are developed under the low-dimensional settings where the number of features is far less than the total sample size. However, as pointed out above, a prominent concern in high-dimensional streaming data analysis is that only a subset of the variables have nonzero coefficients. Besides small sample size issue at the early stage of data collection, processing such data stream without properly accounting for the sparsity in feature set may introduce significant bias and invalid statistical inference. It is worth noting that even if the cumulative sample size exceeds the number of features as time goes by, traditional estimation methods in low-dimensional settings such as maximum likelihood estimation (MLE) may still incur large bias especially in GLMs (Sur & Candès,, 2019). Therefore, current state-of-art online learning algorithms in low-dimensional settings may be insufficient for processing high-dimensional data streams.

In traditional offline settings, many methods have been developed for analyzing high-dimensional static data. Most of the work on variable selection in high-dimensional regression problems is along the line of lasso (Tibshirani,, 1996), the Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan & Li,, 2001), the adaptive lasso (Zou,, 2006), and the minimax convex penalty (MCP) (Zhang,, 2010). However, variable selection methods focus on point estimation without quantifying the uncertainty in estimates. Later on, statistical inference problems in high-dimensional settings, including interval estimation and hypothesis testing, have attracted much attention since the pioneering works of Zhang & Zhang, (2014)van de Geer et al. , (2014)Javanmard & Montanari, (2014),  Belloni et al. , (2015), among others. Other important methods on inference for high-dimensional linear models include Cai & Guo, (2017, 2020)Belloni et al. , (2019), etc. van de Geer et al. , (2014) extended the de-sparsified lasso to high-dimensional GLMs. Ning & Liu, (2017) proposed to construct confidence intervals for high-dimensional M-estimators based on decorrelated score statistic. Recently, a novel splitting and smoothing inference approach for high-dimensional GLMs was proposed by Fei & Li, (2021).

While significant progress has been made on statistical inference for high-dimensional regression problems under the traditional offline settings, variable selection and statistical inference for high-dimensional models with streaming data is still at its infancy stage.  Sun et al. , (2020) introduced a systematic framework for online variable selection based on some popular offline methods such as MCP, elastic net (Zou & Hastie,, 2005). But their focus is not on statistical inference. Different from this work, there are some existing methods considering the problem of inference. For example,  Deshpande et al. , (2019) proposed a class of online estimators for high-dimensional auto-regressive models. One of the most relevant works is a novel inference procedure in GLMs based on recursive online-score estimation (Shi et al. ,, 2020). However, in both works, the entire dataset is assumed to be available at an initial stage for computing an initial estimator, e.g. the lasso estimator; thereafter, a recursively forward bias correction procedure is conducted along sequentially arrived data points. However, the availability of the entire dataset at an initial stage is not a natural setup in online learning. To address this issue, Han et al. , (2021) proposed an online debiased lasso method for statistical inference in high-dimensional linear models with streaming data.

Unlike the case of high-dimensional linear models where the loss function depends on data only through sufficient statistics (Han et al. ,, 2021), parameters and data are not linearly separable in GLMs. Motivated by the renewable estimation method in low-dimensional GLMs (Luo & Song,, 2020), we start off by taking a first-order Taylor expansion on the quadratic loss function to bypass the need of historical individual-level data. The key idea centers around using “approximate summary statistics” resulting from Taylor expansions. However, this is not a trivial extension of the methods developed under low-dimensional settings. In high-dimensional settings where predictors are spuriously correlated, a data-splitting strategy is typically used for decorrelation where variable selection and estimation are conducted using two different sub-datasets (Ning & Liu,, 2017; Shi et al. ,, 2020; Fei & Li,, 2021). A prominent concern of using such approximate summary statistics that involve previous estimates is that it may incur dependency in the corresponding estimating equation. Theoretically speaking, the dependency among recursively updated estimators poses extra technical challenge in establishing the non-asymptotic error bound. In our proposed online method for real-time confidence interval construction, we aim to address the following questions: (i) what types of approximate summary statistics to be stored to carry out an online debiasing procedure? (ii) will the error accumulate along the updating steps if we use the approximate summary statistics? (iii) will the online debiasing procedure maintain similar oracle properties to its offline counterpart? and (iv) how to choose the tuning parameter adaptively in an online setting where cross-validation that relies on splitting the entire dataset is not feasible.

The focus of this paper is to develop an online debiased lasso (ODL) estimator in high-dimensional generalized linear models with streaming datasets for real-time estimation and inference. Our new contributions include: (i) we propose a two-stage online estimation and debiasing framework that aligns with streaming data collection scheme; (ii) ODL accounts for sparsity feature in a candidate set of predictors and provides valid statistical inference results; and (iii) ODLs for the GLMs are shown to be consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. ODL is inspired by the offline debiased lasso method (Zhang & Zhang,, 2014; van de Geer et al. ,, 2014; Javanmard & Montanari,, 2014), however, it differs from the offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from online updating with streaming data. This correction is crucial to guarantee the asymptotic normality of the ODL estimator.

This paper is organized as follows. Section 2 introduces the model formulation followed by our proposed online two-stage debiasing method to process high-dimensional streaming data. Section 3 includes some large sample properties concerning the theoretical guarantees for our proposed method. Simulation experiments are given in Section 4 to evaluate the performance of our proposed method in comparison to both MLE and offline debiased estimator. We illustrate the proposed ODL method and apply it to analyze a real data example in Section 5. Finally, we make some concluding remarks in Section 6. All technical proofs are deferred to the Appendix.

1.1 Notation

For a matrix 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p}, we let 𝑿i,𝑿j\bm{X}_{i\cdot},\bm{X}_{\cdot j} and 𝑿ij\bm{X}_{ij} denote the ii-th row, jj-th column and (i,j)(i,j)-element of matrix 𝑿.\bm{X}. 𝑿i,j\bm{X}_{i,-j} is a sub-vector of 𝑿i\bm{X}_{i\cdot} with the jj-th element deleted and 𝑿i,j\bm{X}_{-i,-j} is a sub-matrix of 𝑿\bm{X} with the ii-th rows and the jj-th column deleted while other elements remain unchanged. For a vector 𝒙m\bm{x}\in\mathbb{R}^{m}, we define its q\ell_{q}-norm as 𝒙q=(i=1pxiq)1q\lVert\bm{x}\lVert_{q}=(\sum_{i=1}^{p}x_{i}^{q})^{\frac{1}{q}}. For a sequence of random variables {ξn}n\{\xi_{n}\}_{n\in\mathbb{N}} and a corresponding sequence of constants {an}n\{a_{n}\}_{n\in\mathbb{N}}. We say that ξn=𝒪p(an)\xi_{n}=\mathcal{O}_{p}(a_{n}) if for any ϵ>0\epsilon>0, there exist two finite numbers M,N>0M,N>0 such that P(|ξn/an|>M)<ϵP(|\xi_{n}/a_{n}|>M)<\epsilon for any n>N.n>N. Generally speaking, ξn=𝒪p(an)\xi_{n}=\mathcal{O}_{p}(a_{n}) denotes ξn/an\xi_{n}/a_{n} is stochastically bounded. ξn=op(an)\xi_{n}=o_{p}(a_{n}) means that ξn/an\xi_{n}/a_{n} converges to zero in probability. With the consideration of the streaming data, we use 𝑿(j)\bm{X}^{(j)} and 𝒀(j)\bm{Y}^{(j)} to stand for 𝑿\bm{X} and 𝒚\bm{y}, namely explanatory variable and explained variables, arriving in jj-th batch respectively. In addition, 𝑿(j)\bm{X}^{(j)}_{\star} and 𝒀(j)\bm{Y}^{(j)}_{\star} (with star index) are the cumulative variables of 𝑿(j)\bm{X}^{(j)} and 𝒀(j)\bm{Y}^{(j)}. For example, 𝑿(j)=((𝑿(1)),,(𝑿(j))).\bm{X}^{(j)}_{\star}=((\bm{X}^{(1)})^{\top},\ldots,(\bm{X}^{(j)})^{\top})^{\top}.

2 Methodology

In this section, we describe the proposed estimation method with streaming data, including the online lasso estimation and online debiased lasso estimation. We also provide an adaptive tuning method to select the regularization parameter. A rundown of our algorithm is summarized at the end of this section.

Consider up to a time point b2b\geq 2, there is a total of NbN_{b} samples arriving in a sequence of bb data batches, denoted by 𝒟b={𝒟1,,𝒟b}\mathcal{D}^{\star}_{b}=\{\mathcal{D}_{1},\dots,\mathcal{D}_{b}\}, and each contains nj=|𝒟j|n_{j}=|\mathcal{D}_{j}| samples, j=1,,bj=1,\dots,b. Assume each observation yiy_{i} is independently from an exponential dispersion model with density function

f(yi𝒙i,𝜷0,ϕ)=a(yi;ϕ)exp{12ϕd(yi;μi)},i𝒟b,\displaystyle f(y_{i}\mid\bm{x}_{i},\bm{\beta}^{0},\phi)=a(y_{i};\phi)\exp\left\{-\frac{1}{2\phi}d(y_{i};\mu_{i})\right\},\ i\in\mathcal{D}^{\star}_{b},

where d(yi;μi)d(y_{i};\mu_{i}) is the unit deviance function involving the mean parameter μi=𝔼(yi𝒙i)\mu_{i}=\mathbb{E}(y_{i}\mid\bm{x}_{i}), and a()a(\cdot) is a normalizing factor depending only on the dispersion parameter ϕ>0\phi>0. The systematic component of a GLM takes the form μi=𝔼(yi𝒙i)=g(𝒙i𝜷0)\mu_{i}=\mathbb{E}(y_{i}\mid\bm{x}_{i})=g(\bm{x}_{i}^{\top}\bm{\beta}^{0}) where g()g(\cdot) is a known link function. The underlying regression coefficient 𝜷0p\bm{\beta}^{0}\in\mathbb{R}^{p} is of our interest, which is assumed to be sparse with s0s_{0} nonzero elements. Specifically, we let S0:={r:βr00}S_{0}:=\{r:\beta^{0}_{r}\neq 0\} be the active set of variables and its cardinality is s0s_{0}. Our main goal is to conduct a point-wise statistical inference for the components of the parameter vector βr0(r=1,,p)\beta_{r}^{0}\ (r=1,\dots,p) upon the arrival of every new data batch 𝒟b,b=1,2,\mathcal{D}_{b},\ b=1,2,\dots. The log-likelihood function for the cumulative dataset 𝒟b\mathcal{D}^{\star}_{b} is

(𝜷,ϕ;𝒟b)=1Nbi𝒟blogf(yi𝒙i,𝜷,ϕ)=1Nbi𝒟bloga(yi;ϕ)12Nbϕi𝒟bd(yi;g(𝒙i𝜷)).\ell(\bm{\beta},\phi;\mathcal{D}^{\star}_{b})=\frac{1}{N_{b}}\sum_{i\in\mathcal{D}^{\star}_{b}}\log f(y_{i}\mid\bm{x}_{i},\bm{\beta},\phi)=\frac{1}{N_{b}}\sum_{i\in\mathcal{D}^{\star}_{b}}\log a(y_{i};\phi)-\frac{1}{2N_{b}\phi}\sum_{i\in\mathcal{D}^{\star}_{b}}d(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})).

Based on 𝒟b\mathcal{D}_{b}^{\star}, the standard offline lasso estimator is defined as

𝜷¯(b)=argmin𝜷p{12Nbi𝒟bd(yi;g(𝒙i𝜷))+λb𝜷1},\bar{\bm{\beta}}^{(b)}=\underset{\bm{\beta}\in\mathbb{R}^{p}}{\arg\min}\left\{\frac{1}{2N_{b}}\sum_{i\in\mathcal{D}^{\star}_{b}}d\left(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})\right)+\lambda_{b}\|\bm{\beta}\|_{1}\right\}, (1)

where Nb=j=1bnjN_{b}=\sum_{j=1}^{b}n_{j} is the cumulative sample size and λb\lambda_{b} is the regularization parameter However, as discussed in Luo & Song, (2020) and Han et al. , (2021), the optimization of (1) requires the historical data 𝒟b1\mathcal{D}^{\star}_{b-1} which is not observed in the streaming data. So the standard lasso estimator cannot be computed. Therefore, new online method to find the lasso estimator and the debiased lasso tailored for streaming data is desired. The latter is for the purpose of providing a statistical inference. For the sake of clarity, we refer to the lasso estimator in (1) as the offline lasso estimator. Sections 2.1 and 2.2 are devoted to the construction of the online lasso estimator and the online debiased method, respectively.

2.1 Online lasso estimator

We first consider the online lasso estimator through the gradient descent method. Define the score function 𝒖(yi;𝒙i,𝜷):=d(yi;g(𝒙i𝜷))/𝜷\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}):=\partial d(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta}))/\partial\bm{\beta}. The negative gradient of the first term in (1) (𝜷):=i𝒟bd(yi;g(𝒙i𝜷))\mathcal{L}(\bm{\beta}):=\sum_{i\in\mathcal{D}_{b}^{\star}}d(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})) is

𝑼(b)(𝜷):=(𝜷)𝜷=i𝒟b𝒖(yi;𝒙i,𝜷).\bm{U}^{(b)}(\bm{\beta}):=-\frac{\partial\mathcal{L}(\bm{\beta})}{\partial\bm{\beta}}=\sum_{i\in\mathcal{D}_{b}^{\star}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}). (2)

Let U(j)(𝜷):=i𝒟j𝒖(yi;𝒙i,𝜷)U^{(j)}(\bm{\beta}):=\sum_{i\in\mathcal{D}_{j}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}) be the score function pertaining to the data batch 𝒟j\mathcal{D}_{j}. Then 𝑼(b)(𝜷)\bm{U}^{(b)}(\bm{\beta}) in (2) can be rewritten as a linear aggregation form:

𝑼(b)(𝜷)=j=1bi𝒟j𝒖(yi;𝒙i,𝜷)=j=1bU(j)(𝜷).\bm{U}^{(b)}(\bm{\beta})=\sum_{j=1}^{b}\sum_{i\in\mathcal{D}_{j}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta})=\sum_{j=1}^{b}U^{(j)}(\bm{\beta}).

To pursue an online estimator upon the arrival of 𝒟b\mathcal{D}_{b}, a key step is to update 𝑼(b1)(𝜷){\bm{U}}^{(b-1)}(\bm{\beta}) to 𝑼(b)(𝜷)\bm{U}^{(b)}(\bm{\beta}) without re-accessing the whole historical raw data 𝒟b1\mathcal{D}_{b-1}^{\star}.

To illustrate the idea, we consider a simple case with two data batches, i.e., 𝒟2\mathcal{D}_{2} arrives subsequent to 𝒟1\mathcal{D}_{1}. The lasso estimator based on the first batch data is denoted by 𝜷^(1)\widehat{\bm{\beta}}^{(1)}, which is the offline estimator 𝜷¯(1)\bar{\bm{\beta}}^{(1)} according to (1). To avoid using individual-level raw data in 𝒟1\mathcal{D}_{1}, we approximate 𝑼(1)(𝜷){\bm{U}}^{(1)}(\bm{\beta}) through a first-order Taylor expansion at 𝜷^(1)\widehat{\bm{\beta}}^{(1)}. Note that 𝜷^(1)\widehat{\bm{\beta}}^{(1)} satisfies 𝑼(1)(𝜷^(1))=𝒪p(λ1N1)\lVert\bm{U}^{(1)}(\widehat{\bm{\beta}}^{(1)})\lVert_{\infty}=\mathcal{O}_{p}(\lambda_{1}N_{1}), which implies

𝑼(2)(𝜷)=U(1)(𝜷)+U(2)(𝜷)=U(1)(𝜷^(1))+𝑱(1)(𝜷^(1))(𝜷^(1)𝜷)+U(2)(𝜷)+N1𝒪p(𝜷^(1)𝜷2),\begin{split}{\bm{U}}^{(2)}(\bm{\beta})&=U^{(1)}(\bm{\beta})+U^{(2)}(\bm{\beta})\\ &=U^{(1)}(\widehat{\bm{\beta}}^{(1)})+\bm{J}^{(1)}(\widehat{\bm{\beta}}^{(1)})(\widehat{\bm{\beta}}^{(1)}-\bm{\beta})+U^{(2)}(\bm{\beta})+N_{1}\mathcal{O}_{p}(\|\widehat{\bm{\beta}}^{(1)}-\bm{\beta}\|^{2}),\end{split}

where 𝑱(1)(𝜷)=U(1)(𝜷)/𝜷\bm{J}^{(1)}(\bm{\beta})=-\partial U^{(1)}(\bm{\beta})/\partial\bm{\beta} denotes the information matrix. Next, we propose to approximate 𝑼(2)(𝜷)\bm{U}^{(2)}(\bm{\beta}) by

𝑼~(2)(𝜷):=𝑱(1)(𝜷^(1))(𝜷^(1)𝜷)+U(2)(𝜷).\begin{split}\widetilde{\bm{U}}^{(2)}(\bm{\beta}):=\bm{J}^{(1)}(\widehat{\bm{\beta}}^{(1)})(\widehat{\bm{\beta}}^{(1)}-\bm{\beta})+U^{(2)}(\bm{\beta}).\end{split} (3)

Apparently, calculating 𝑼~(2)(𝜷)\widetilde{\bm{U}}^{(2)}(\bm{\beta}) through (3) only requires access to the summary statistics {𝜷^(1),𝑱(1)(𝜷^(1))}\{\widehat{\bm{\beta}}^{(1)},\bm{J}^{(1)}(\widehat{\bm{\beta}}^{(1)})\} rather than the full dataset 𝒟1\mathcal{D}_{1}.

Now we are ready to generalize the above approximation to an arbitrary data batch 𝒟b\mathcal{D}_{b}. Let 𝑱~(b1)=j=1b1𝑱(j)(𝜷^(j))\widetilde{\bm{J}}^{(b-1)}=\sum_{j=1}^{b-1}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)}) denote the aggregated information matrix. The approximation procedure in (3) becomes

𝑼~(b)(𝜷)={j=1b1𝑱(j)(𝜷^(j))}(𝜷^(b1)𝜷)+U(b)(𝜷)=𝑱~(b1)(𝜷^(b1)𝜷)+U(b)(𝜷).\begin{split}\widetilde{\bm{U}}^{(b)}(\bm{\beta})&=\left\{\sum_{j=1}^{b-1}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\}(\widehat{\bm{\beta}}^{(b-1)}-\bm{\beta})+U^{(b)}(\bm{\beta})\\ &=\widetilde{\bm{J}}^{(b-1)}(\widehat{\bm{\beta}}^{(b-1)}-\bm{\beta})+U^{(b)}(\bm{\beta}).\end{split}

The aggregated gradient 𝑼~(b)(𝜷)\widetilde{\bm{U}}^{(b)}(\bm{\beta}) depends only on {𝜷^(b1),𝑱~(b1)}\{\widehat{\bm{\beta}}^{(b-1)},\widetilde{\bm{J}}^{(b-1)}\}. Hence, 𝜷^(b)\widehat{\bm{\beta}}^{(b)} can be computed through the following algorithm.

  • Step 1: update 𝜷^(b)\widehat{\bm{\beta}}^{(b)} through

    𝜷^(b)𝜷^(b)+η2Nb𝑼~(b)(𝜷^(b)),\widehat{\bm{\beta}}^{(b)}\leftarrow\widehat{\bm{\beta}}^{(b)}+\frac{\eta}{2N_{b}}\widetilde{\bm{U}}^{(b)}(\widehat{\bm{\beta}}^{(b)}),

    where η\eta is the learning rate in the gradient descent;

  • Step 2: for r=1,,pr=1,\dots,p, apply the soft-thresholding operator 𝒮(β^r(b);ηλb)\mathcal{S}(\widehat{\beta}_{r}^{(b)};\eta\lambda_{b}) to the rr-th component of 𝜷^(b)\widehat{\bm{\beta}}^{(b)} obtained in Step 1:

    𝒮(β^r(b);ηλb)={β^r(b)+ηλb,ifβ^r(b)<ηλb0,if|β^r(b)|ηλbβ^r(b)ηλb,ifβ^r(b)>ηλb.\mathcal{S}(\widehat{\beta}^{(b)}_{r};\eta\lambda_{b})=\begin{cases}\widehat{\beta}^{(b)}_{r}+\eta\lambda_{b},\quad&\text{if}\ \widehat{\beta}^{(b)}_{r}<-\eta\lambda_{b}\\ 0,\quad&\text{if}\ |\widehat{\beta}^{(b)}_{r}|\leq\eta\lambda_{b}\\ \widehat{\beta}^{(b)}_{r}-\eta\lambda_{b},\quad&\text{if}\ \widehat{\beta}^{(b)}_{r}>\eta\lambda_{b}\end{cases}.

Note that λb\lambda_{b} is the regularization parameter that is chosen adaptively for step bb. More details on the adaptive tuning procedure is presented in Section 2.3.

The above two steps are carried out iteratively till convergence to obtain the online lasso estimator 𝜷^(b)\bm{\widehat{\beta}}^{(b)}. In the implementation, we set the stopping criterion to be η×𝑼~(b)(𝜷^(b))/2Nb2106\|{\eta}\times\widetilde{\bm{U}}^{(b)}(\widehat{\bm{\beta}}^{(b)})/{2N_{b}}\|_{2}\leq 10^{-6}. In summary, our proposed online estimator 𝜷^(b)\widehat{\bm{\beta}}^{(b)} can be defined as

𝜷^(b)=argmin𝜷p[12Nb{i𝒟bd(yi;g(𝒙i𝜷))+12(𝜷𝜷^(b1))𝑱~(b1)(𝜷𝜷^(b1))}+λb𝜷1].\widehat{\bm{\beta}}^{(b)}=\underset{\bm{\beta}\in\mathbb{R}^{p}}{\arg\min}\left[\frac{1}{2N_{b}}\left\{\sum_{i\in\mathcal{D}_{b}}d\left(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})\right)+\frac{1}{2}(\bm{\beta}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}(\bm{\beta}-\widehat{\bm{\beta}}^{(b-1)})\right\}+\lambda_{b}\lVert\bm{\beta}\lVert_{1}\right]. (4)

In contrast to the standard offline lasso estimator in (1), our proposed online estimator in (4) depends on the data only through the summary statistics {𝜷^(b1),𝑱~(b1)}\{\widehat{\bm{\beta}}^{(b-1)},\widetilde{\bm{J}}^{(b-1)}\} instead of 𝒟b1\mathcal{D}_{b-1}^{\star}.

2.2 Online debiased lasso

We now proceed to study the online statistical inference and construct confidence intervals for the rr-th component of the regression parameter vector, r=1,,pr=1,\ldots,p. We first define a low-dimensional projection

𝜸^r(b)=argmin𝜸(p1){12Nb(𝑱~r,r(b)2𝑱~r,r(b)𝜸+𝜸𝑱~r,r(b)𝜸)+λb𝜸1}.\bm{\widehat{\gamma}}_{r}^{(b)}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\left\{\frac{1}{2N_{b}}\left(\widetilde{\bm{J}}^{(b)}_{r,r}-2\widetilde{\bm{J}}^{(b)}_{r,-r}\bm{\gamma}+\bm{\gamma}^{\top}\widetilde{\bm{J}}_{-r,-r}^{(b)}\bm{\gamma}\right)+\lambda_{b}\|\bm{\gamma}\|_{1}\right\}. (5)

where NbN_{b} and λb\lambda_{b} are the same as in (4). Letting W^(j)nj×nj\widehat{W}{{}^{(j)}}\in\mathbb{R}^{n_{j}\times n_{j}} be the diagonal matrix with diagonal elements g(𝑿(j)𝜷^(j))\sqrt{g^{\prime}(\bm{X}^{(j)}\bm{\widehat{\beta}}^{(j)})} and 𝑿^(j)\widehat{\bm{X}}^{{(j)}} be the weighted design matrix W^𝑿(j)(j)\widehat{W}{{}^{(j)}}\bm{X}^{(j)}, then (5) can be recast into

𝜸^r(b)=argmin𝜸(p1){12Nbj=1b𝑿^r(j)𝑿^r(j)𝜸22+λb𝜸1}.\bm{\widehat{\gamma}}_{r}^{(b)}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\left\{\frac{1}{2N_{b}}\sum_{j=1}^{b}\left\lVert\widehat{\bm{X}}^{{(j)}}_{\cdot r}-\widehat{\bm{X}}^{{(j)}}_{\cdot-r}\bm{\gamma}\right\lVert_{2}^{2}+\lambda_{b}\|\bm{\gamma}\|_{1}\right\}. (6)

With this view, 𝜸^r(b)\bm{\widehat{\gamma}}_{r}^{(b)} can be computed in a similar fashion as the online lasso estimator. Specifically, (6) has the same form as (1) if we choose the deviance function d(;)d(\cdot;\cdot) as the square error. Then, it is straightforward to propose the online algorithm to find 𝜸^r(b)\bm{\widehat{\gamma}}_{r}^{(b)} by following the procedure in Section 2.1. At this time, the summary statistics is (𝑱~r,r(b),𝑱~r,r(b)),(\widetilde{\bm{J}}^{(b)}_{r,-r},\widetilde{\bm{J}}_{-r,-r}^{(b)}), which has been already stored in 𝑱~(b).\widetilde{\bm{J}}^{(b)}. As a result, we find a solution of (5). After obtaining 𝜸^r(b)\widehat{\bm{\gamma}}_{r}^{(b)} in (5), we compute

𝒛^r(b):=𝒙r(b)𝑿r(b)𝜸^r(b),τ^r(b):=𝑱~r,r(b)𝑱~r,r(b)𝜸^r(b).\widehat{\bm{z}}_{r}^{(b)}:=\bm{x}_{r}^{(b)}-\bm{X}_{-r}^{(b)}\widehat{\bm{\gamma}}_{r}^{(b)},\ \widehat{\tau}_{r}^{(b)}:=\widetilde{\bm{J}}_{r,r}^{(b)}-\widetilde{\bm{J}}_{r,-r}^{(b)}\widehat{\bm{\gamma}}_{r}^{(b)}.

Denote 𝜸~r(j)=(𝜸^r,1(j),,1,,𝜸^r,p(j))p\widetilde{\bm{\gamma}}^{(j)}_{r}=(\widehat{\bm{\gamma}}^{(j)}_{r,1},\ldots,-1,\ldots,\widehat{\bm{\gamma}}^{(j)}_{r,p})^{\top}\in\mathbb{R}^{p}, whose rr-th element is 1-1. Then, upon the arrival of the batch data 𝒟b\mathcal{D}_{b}, we define our proposed online debiased lasso estimator as

β^on,r(b)\displaystyle{\widehat{\beta}}^{(b)}_{\text{on},r} =β^r(b)+1τ^r(b)[j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷^(j))}+j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j)){𝜷^(b)𝜷^(j)}]\displaystyle=\widehat{{\beta}}^{(b)}_{r}+\frac{1}{\widehat{\tau}_{r}^{(b)}}\left[\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\bm{\widehat{\beta}}^{(j)})\right\}+\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\{{\widehat{\bm{\beta}}}^{(b)}-{\widehat{\bm{\beta}}}^{(j)}\}\right]
β^r(b)+debiasing term+online error correction term.\displaystyle\equiv\widehat{{\beta}}^{(b)}_{r}+\text{debiasing term}+\text{online error correction term}. (7)

It is worth noting that the debiased lasso estimator involves the initial lasso estimator defined in (4), as well as two additional terms: a debiasing term and an online error correction term. van de Geer et al. , (2014) studied the offline version of debiased lasso for GLM. Our debiased term could be viewed as an online generalization of the offline counterpart in van de Geer et al. , (2014). However, they are rather different in the sense that, the debiased term alone will not suffice to ensure the asymptotic normality. As 𝑱~(b)\widetilde{\bm{J}}^{(b)} is used to approximate the information matrix 𝑱(𝜷0)\bm{J}(\bm{\beta}^{0}), the approximation error accumulates even though each 𝜷^(j),j=1,,b,{\widehat{\bm{\beta}}}^{(j)},j=1,\ldots,b, is consistent for 𝜷0\bm{\beta}^{0}. The correction term in (2.2) is used to eliminate the approximation error arising from the online update.

Meanwhile, the proposed debiased lasso estimator with the online error correction term aligns with the online learning framework, as (2.2) only requires the following summary statistics rather than the entire dataset 𝒟b\mathcal{D}_{b}^{\star}:

s1(b)\displaystyle s_{1}^{(b)} =j=1b{𝒛^r(j)}(𝒚(j)g(𝑿(j)𝜷^(j))),\displaystyle=\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left(\bm{y}^{(j)}-g(\bm{X}^{(j)}\widehat{\bm{\beta}}^{(j)})\right),
s2(b)\displaystyle s_{2}^{(b)} =j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j))𝜷^(j),𝑺(b)=j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j)),\displaystyle=\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)}){\widehat{\bm{\beta}}}^{(j)},\ \ \bm{S}^{(b)}=\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)}), (8)

which keep the same size when new data arrive, and can be easily updated.

The asymptotic normality of the online debiased lasso and the consistency of two lasso-typed estimators in (4) and (5) are established in Section 3. For variance estimation, let

v(b)=j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷^(j))}{𝒚(j)g(𝑿(j)𝜷^(j))}𝒛^r(j).v^{(b)}=\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\widehat{\bm{\beta}}^{(j)})\right\}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\widehat{\bm{\beta}}^{(j)})\right\}^{\top}\bm{\widehat{z}}^{(j)}_{r}. (9)

The estimated standard error σ^r(b):=v(b)/τ^r(b)\widehat{\sigma}^{(b)}_{r}:={\sqrt{v^{(b)}}}/{\widehat{\tau}_{r}^{(b)}} can also be updated online accordingly.

2.3 Adaptive tuning

In an offline setting, the regularization parameter λ\lambda is typically determined by cross-validation where the entire dataset is split into training and test sets multiple times. However, since the full dataset is not accessible in an online setting, such procedure is not feasible. To align with the nature of streaming datasets, we use the “rolling-original-recalibration” procedure with mean squared prediction error (MSE) being the cross-validation criterion (Han et al. ,, 2021). Specifically, at time point bb, cumulative dataset up to time point b1b-1 serves as the training set while the new data batch 𝒟b\mathcal{D}_{b} is the test set. It is worth noting that instead of re-accessing raw data {𝒟1,,𝒟b1}\{\mathcal{D}_{1},\dots,\mathcal{D}_{b-1}\}, we plug in 𝜷^(b1)\widehat{\bm{\beta}}^{(b-1)} to evaluate the prediction error for a sequence of λ\lambda in a candidate set TλT_{\lambda}:

PEb(λ)=nb1𝒚(b)g(𝑿(b)𝜷^(b1)(λ))22,λTλ,PE_{b}(\lambda)=n_{b}^{-1}\|\bm{y}^{(b)}-g(\bm{X}^{(b)}\widehat{\bm{\beta}}^{(b-1)}(\lambda))\|^{2}_{2},\ \lambda\in T_{\lambda}, (10)

and choose λ\lambda such that λb:=argminλTλPEb(λ)\lambda_{b}:=\underset{\lambda\in T_{\lambda}}{\arg\min}\ PE_{b}(\lambda). The initial λ1\lambda_{1} is selected by the classical offline cross-validation.

2.4 Algorithm

We summarize the results in Sections 2.1-2.3 in Figure 1 and Algorithm 1. It consists of two main blocks: one is online lasso estimation and the other is online low-dimensional projection. Outputs from both blocks are used to compute the online debiased lasso estimator as well as the construction of confidence intervals in real-time. In particular, when a new data batch 𝒟b\mathcal{D}_{b} arrives, it is first sent to the online lasso estimation block, where the summary statistics {𝜷^(b1),𝑱~(b1)}\left\{\widehat{\bm{\beta}}^{(b-1)},\widetilde{\bm{J}}^{(b-1)}\right\} are used to compute 𝑼~(b)\widetilde{\bm{U}}^{(b)}. Then we use gradient descent to update the lasso estimator 𝜷^(b1)\widehat{\bm{\beta}}^{(b-1)} to 𝜷^(b)\widehat{\bm{\beta}}^{(b)} at a sequence of tuning parameters values without retrieving the whole dataset. At the same time, regarding the cumulative dataset that produces the old lasso estimate 𝜷^(b1)\widehat{\bm{\beta}}^{(b-1)} as training set and the newly arrived 𝒟b\mathcal{D}_{b} as testing set, we can choose the tuning parameter λb\lambda_{b} that gives the smallest prediction error. Now, the selected λb\lambda_{b} and sub-matrices of 𝑱~(b)\widetilde{\bm{J}}^{(b)} are passed to the low-dimensional projection block for the calculation of 𝜸^r(b)(λb)\widehat{\bm{\gamma}}_{r}^{(b)}(\lambda_{b}). The resulting projection 𝒛^r(b)\widehat{\bm{z}}_{r}^{(b)} and τ^r(b)\widehat{\tau}_{r}^{(b)} from the low-dimensional projection block together with the lasso estimator 𝜷^(b)\bm{\widehat{\beta}}^{(b)} will be used to compute the debiased lasso estimator β^on,r(b)\widehat{\beta}^{(b)}_{\text{on},r} and its estimated standard error σ^r(b)\widehat{\sigma}_{r}^{(b)}.

Refer to caption
Figure 1: Flowchart of the online debiasing algorithm. When a new data batch 𝒟b\mathcal{D}_{b} arrives, it is sent to the lasso estimation block for updating 𝜷^(b1)\widehat{\bm{\beta}}^{(b-1)} to 𝜷^(b)\widehat{\bm{\beta}}^{(b)}. At the same time, it is also viewed as test set for adaptively choosing tuning parameter λb\lambda_{b}. In the low-dim projection block, we extract sub-matrices from the updated information matrix 𝑱~(b)\widetilde{\bm{J}}^{(b)} to compute 𝜸^r(b)(λb)\widehat{\bm{\gamma}}_{r}^{(b)}(\lambda_{b}) and the corresponding low-dimensional projection 𝒛^r(b)\widehat{\bm{z}}_{r}^{(b)} and τ^r(b)\widehat{\tau}_{r}^{(b)}. Outputs β^r(b)(λb)\widehat{\beta}_{r}^{(b)}(\lambda_{b}), 𝒛^r(b)\widehat{\bm{z}}_{r}^{(b)} and τ^r(b)\widehat{\tau}_{r}^{(b)} are further used to compute the debiased lasso estimator β^on,r(b)\widehat{\beta}_{\text{on},r}^{(b)} and its estimated standard error σ^r(b)\widehat{\sigma}_{r}^{(b)}.
Algorithm 1 Online Debiased Algorithm in GLM
Learning rate η\eta, a set of candidate values TλT_{\lambda} for penalty parameter;
Debiased lasso estimator β^on,r(b)\widehat{\beta}_{\text{on},r}^{(b)};
for b=1,2,b=1,2,\ldots do
     Receive the streaming dataset 𝒟b\mathcal{D}_{b};
5:     For a sequence of λTλ\lambda\in T_{\lambda}, update lasso estimator 𝜷^(b)(λ)\widehat{\bm{\beta}}^{(b)}(\lambda) defined in (4) via gradient descent (GD);
     Determine λb\lambda_{b} based on minimum MSE, λb:=argminλTλPEb(λ)\lambda_{b}:=\underset{\lambda\in T_{\lambda}}{\arg\min}\ PE_{b}(\lambda) defined in (10);
     Update and store the summary statistics {𝜷^(b),𝑱~(b)}\{\widehat{\bm{\beta}}^{(b)},\widetilde{\bm{J}}^{(b)}\};
     Given λb\lambda_{b}, find the low-dimensional projection 𝜸^r(b)\widehat{\bm{\gamma}}_{r}^{(b)} defined in (5) via GD;
     Update and store the summary statistics s1(b),s2(b)s_{1}^{(b)},s_{2}^{(b)} and 𝑺(b)\bm{S}^{(b)} by (8);
10:     Compute β^on,r(b)\widehat{\beta}_{\text{on},r}^{(b)} by (2.2) and σ^r(b)\widehat{\sigma}^{(b)}_{r};
end for return β^on,r(b)\widehat{\beta}_{\text{on},r}^{(b)} and its estimated standard error σ^r(b)\widehat{\sigma}^{(b)}_{r}.
Remark 1.

Algorithm 1 computes the estimators in the chronological order, namely
(𝛃^(1),𝛄^r(1),β^on,r(1),,𝛃^(b),𝛄^r(b),β^on,r(b))(\widehat{\bm{\beta}}^{(1)},\widehat{\bm{\gamma}}_{r}^{(1)},\widehat{\beta}_{\text{on},r}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b)},\widehat{\bm{\gamma}}_{r}^{(b)},\widehat{\beta}_{\text{on},r}^{(b)}). Since 𝐉~(b1)\widetilde{\bm{J}}^{(b-1)} involves 𝛃^(1),,𝛃^(b1)\widehat{\bm{\beta}}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b-1)}, there are dependency among the estimators. For instance, both 𝛄^r(b1)\widehat{\bm{\gamma}}_{r}^{(b-1)} and 𝛃^(b)\widehat{\bm{\beta}}^{(b)} depend on the previous lasso estimation (𝛃^(1),,𝛃^(b1)).(\widehat{\bm{\beta}}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b-1)}).

Remark 2.

When pp is large, it may be challenging to implement Algorithm 1 since the space complexity to store the aggregated information matrix 𝐉~(b)\widetilde{\bm{J}}^{(b)} is 𝒪(p2).\mathcal{O}(p^{2}). To reduce memory usage, we can compute the eigenvalue decomposition (EVD) of 𝐉~(b)=QbΛbQb\widetilde{\bm{J}}^{(b)}=Q_{b}\Lambda_{b}Q_{b}^{\top}, where QbQ_{b} is the p×Nbp\times N_{b} columns orthogonal matrix of the eigenvectors, Λb\Lambda_{b} is the Nb×NbN_{b}\times N_{b} diagonal matrix whose diagonal elements are the eigenvalues of 𝐉~(b)\widetilde{\bm{J}}^{(b)}. We only need to store QbQ_{b} and Λb.\Lambda_{b}. Since rb=r_{b}= rank(Λb)min{Nb,p},(\Lambda_{b})\leq\min\{N_{b},p\}, we can use an incremental EVD approach (Cardot & Degras,, 2018) to update QbQ_{b} and Λb\Lambda_{b}. Then the space complexity reduces to 𝒪(rbp)\mathcal{O}(r_{b}p). The space complexity can be further reduced by setting a threshold. For example, select the principal components which explain most of the variations in the predictors. However, incremental EVD could increase the computational cost since it requires additional 𝒪(rb2p)\mathcal{O}(r_{b}^{2}p) computational complexity. Indeed, there is a trade-off between the space complexity and computational complexity. How to balance this trade-off is an important computational issue and deserves careful analysis, but is beyond the scope of this study.

3 Asymptotic properties

In this section, we state our main theoretical results: the consistency of lasso estimators 𝜷^(b)\widehat{\bm{\beta}}^{(b)} and 𝜸^r(b)\bm{\widehat{\gamma}}_{r}^{(b)} defined in (4) and (5) respectively, as well as the asymptotic normality of the online debiased estimator β^on,r(b).{\widehat{\beta}}^{(b)}_{\text{on},r}.

Before stating the main results, some notations are needed. Recall that 𝜷0\bm{\beta}^{0} is the true coefficient. Consider a random design matrix 𝑿\bm{X} with i.i.d rows. Let 𝑱\bm{J} denote the population version of the information matrix in GLM, i,e., 𝑱=𝔼[𝑱(1)(𝜷0)]/N1,\bm{J}=\mathbb{E}[\bm{J}^{(1)}(\bm{\beta}^{0})]/N_{1}, and let 𝚯=𝑱1\bm{\Theta}=\bm{J}^{-1} be its inverse. Then, the ground truth of 𝜸^r(b)\bm{\widehat{\gamma}}_{r}^{(b)} given in (5) is defined as

𝜸r0=argmin𝜸(p1)𝔼(𝑱r,r2𝑱r,r𝜸+𝜸𝑱r,r𝜸).\displaystyle\bm{{\gamma}}_{r}^{0}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\ \mathbb{E}\left({\bm{J}}_{r,r}-2{\bm{J}}_{r,-r}\bm{\gamma}+\bm{\gamma}^{\top}{\bm{J}}_{-r,-r}\bm{\gamma}\right).

Recall that S0:={r:βr00}S_{0}:=\{r:\beta^{0}_{r}\neq 0\} and s0=|S0|s_{0}=\lvert S_{0}\lvert. In addition, Sr={kr:𝚯k,r0}S_{r}=\{k\neq r:\bm{\Theta}_{k,r}\neq 0\} and sr=|Sr|s_{r}=\lvert S_{r}\lvert for r=1,,p.r=1,\ldots,p. The following assumptions are needed to establish the consistency of the lasso estimators 𝜷^(b)\widehat{\bm{\beta}}^{(b)} and 𝜸^r\widehat{\bm{\gamma}}_{r} defined in (4) and (5) respectively.

Assumption 1.

Suppose that

(A1) The pairs of random variables {yi,𝐱i}i𝒟b\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{b}^{\star}} are i.i.d.. The covariates are bounded by some finite constant K>0K>0, i.e., supi𝒟b𝐱iK.\sup_{i\in\mathcal{D}_{b}^{\star}}\lVert\bm{x}_{i}\lVert_{\infty}\leq K.

(A2) supi𝒟b|𝐱i𝛃0|=𝒪(1)\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert\bm{x}_{i}\bm{\beta}^{0}\lvert=\mathcal{O}(1) and supi𝒟b|(𝐱i)r𝛄r0|=𝒪(1),\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert(\bm{x}_{i})_{-r}\bm{{\gamma}}_{r}^{0}\lvert=\mathcal{O}(1), where (𝐱i)r(\bm{x}_{i})_{-r} is the sub-vector of 𝐱i\bm{x}_{i} with rr-th element deleted.

(A3) The derivative g()g^{\prime}(\cdot) exists. For some δ\delta-neighborhood (δ>0)(\delta>0), g()g^{\prime}(\cdot) is locally Lipschitz with constant LL, that is,

supi𝒟bsup𝜷,𝜷{𝜷:𝜷𝜷01δ}|g(𝒙i𝜷)g(𝒙i𝜷)||𝒙i𝜷𝒙i𝜷|L.\sup_{i\in\mathcal{D}_{b}^{\star}}\ \sup_{\bm{\beta},\bm{\beta}^{\prime}\in\{\bm{\beta}:\lVert\bm{\beta}-\bm{\beta}^{0}\lVert_{1}\leq\delta\}}\frac{|g^{\prime}(\bm{x}_{i}\bm{\beta})-g^{\prime}(\bm{x}_{i}\bm{\beta}^{\prime})|}{|\bm{x}_{i}\bm{\beta}-\bm{x}_{i}\bm{\beta}^{\prime}|}\leq L.

In addition, for all 𝛃𝛃0δ\lVert\bm{\beta}-\bm{\beta}^{0}\lVert\leq\delta, supi𝒟b|g(𝐱i𝛃)|=𝒪(1)\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert=\mathcal{O}(1) and supi𝒟b|1/g(𝐱i𝛃)|=𝒪(1).\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{1/g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert=\mathcal{O}(1).

(A4) The smallest eigenvalue of 𝐉\bm{J} is bounded away from zero.

The above assumptions are regular conditions in studying the properties of lasso (van de Geer,, 2008; van de Geer et al. ,, 2014). (A1) assumes the streaming data is homogeneous and the covariates are sampled from some bounded distribution. (A2) is imposed to avoid some extreme cases. (A3) requires the local Lipschitz property of the derivative of the mean function around the truth value. It can be easily verified that the popular logistic regression, a special case of GLMs, satisfies this condition. (A4) is needed to ensure that the compatibility condition holds.

Theorem 1.

Assume Assumption 1 holds. Suppose that the first batch size n1cs0logpn_{1}\geq cs_{0}\log p for some constant cc and b=o(logNb)b=o(\log N_{b}), and the tuning parameter λj=Clogp/Nj,j=1,b\lambda_{j}=C\sqrt{\log p/N_{j}},j=1,\ldots b for some constant CC. Then, for any j=1,,bj=1,\ldots,b, with probability at least 1p21-p^{-2}, the proposed online estimator in (4) satisfies

𝜷^(j)𝜷01c1(j)s0λj,𝑿(j)(𝜷^(j)𝜷0)22c2(j)s0Njλj2.{\lVert{\widehat{\bm{\beta}}}^{(j)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(j)}s_{0}\lambda_{j},\ \lVert\bm{X}^{(j)}_{\star}({\widehat{\bm{\beta}}}^{(j)}-{{\bm{\beta}}}^{0})\lVert^{2}_{2}\leq c_{2}^{(j)}s_{0}N_{j}\lambda_{j}^{2}.} (11)
Remark 3.

Theorem 1 provides upper bounds of the estimation error and the prediction error of the online lasso estimator 𝛃^(b)\widehat{\bm{\beta}}^{(b)}. The constants c1(j)c_{1}^{(j)} and c2(j),j=1,,b,c_{2}^{(j)},j=1,\ldots,b, possibly depend on the batch step bb. Recall that the lasso estimator 𝛃^(b)\widehat{\bm{\beta}}^{(b)} depends on 𝛃^(b1)\widehat{\bm{\beta}}^{(b-1)}. So the estimation error in the previous step will be carried onto the updated estimators. As a result, it is inevitable that some constants in the oracle inequality depend on bb; nonetheless, they are well under control as long as b=o(logNb).b=o(\log N_{b}).

The next corollary shows the consistency of the proposed online lasso estimator in (4).

Corollary 1.

Assume those conditions in Theorem 1 hold. If there exists an ϵ>0\epsilon>0 such that s02log(p)Nb1+ϵ=o(1),s_{0}^{2}\log(p)N_{b}^{-{1}+\epsilon}=o(1), then the lasso estimator in (4) satisfies

𝜷^(b)𝜷01p0asNb,\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\rightarrow_{p}0\ \text{as}\ N_{b}\to\infty,

where p\rightarrow_{p} means convergence in probability.

Similarly, we present the oracle inequality for 𝜸^r(b)\widehat{\bm{\gamma}}_{r}^{(b)} in the next theorem.

Theorem 2.

Assume Assumption 1 holds and the cumulative batch size satisfies Njcc1(j)s02sr2logp,j=1,,bN_{j}\geq cc^{(j)}_{1}s_{0}^{2}s_{r}^{2}\log p,j=1,\ldots,b for some constant cc. Then, for any j=1,,bj=1,\ldots,b, with probability at least 1p21-p^{-2}, low-dimensional projection defined in (5) with λj=clogp/Nj\lambda_{j}=c\sqrt{{\log p}/{N_{j}}} satisfies

𝜸^r(j)𝜸r01c3srλj.\lVert{\widehat{\bm{\gamma}}_{r}}^{(j)}-{{\bm{\gamma}}}_{r}^{0}\lVert_{1}\leq c_{3}s_{r}\lambda_{j}. (12)

Combining the results in Theorem 1 and Theorem 2, we are ready to establish the asymptotic normality of the ODL estimator.

Theorem 3.

Assume those conditions in Theorem 1 and Theorem 2 hold. If there exists an ϵ>0\epsilon>0 such that

s02log(p)log(Nb)Nb12+ϵ=o(1),s0srlog(p)Nb12+ϵ=o(1),s_{0}^{2}\log(p)\log(N_{b})N_{b}^{-\frac{1}{2}+\epsilon}=o(1),\ s_{0}s_{r}\log(p)N_{b}^{-\frac{1}{2}+\epsilon}=o(1),

then for sufficiently large NbN_{b},

τ^r(b)Nb(β^on,r(b)βr0)=Wr+Vr,\displaystyle\frac{\widehat{\tau}_{r}^{(b)}}{\sqrt{N_{b}}}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})=W_{r}+V_{r},
Wr=1Nbj=1b{𝒛^r(j)}(𝒚(j)g(𝑿(j)𝜷0)),Vr=op(1).\displaystyle W_{r}=\frac{1}{\sqrt{N_{b}}}\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left(\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right),V_{r}=o_{p}(1).

According to Theorem 3, the asymptotic expression of τ^r(b)(β^on,r(b)βr0)/Nb{\widehat{\tau}_{r}^{(b)}}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})/{\sqrt{N_{b}}} is a sum of WrW_{r} and VrV_{r}, where WrW_{r} converges in distribution to a normal distribution by the martingale central limit theorem and VrV_{r} diminishes as NbN_{b} goes to infinity.

Remark 4.

Theorem 3 implies that the total data size NbN_{b} could be as small as the logarithm of the dimensionality pp, which is a common condition for offline debiased lasso in the literature (Zhang & Zhang,, 2014; van de Geer et al. ,, 2014). However, the setting of the offline debiased lasso is quite different from the online framework considered in our paper. Due to the lack of access to the whole dataset, it is increasingly difficult to derive the asymptotic property of online debiased lasso. One major difficulty is that, we approximate the information matrix by 𝐉~(b)\widetilde{\bm{J}}^{(b)}, which involves estimators obtained in the previous steps. As a result, there is dependence among 𝛃^(1),,𝛃^(b1),𝛃^(b).\widehat{\bm{\beta}}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b-1)},\widehat{\bm{\beta}}^{(b)}. Another difficulty is to deal with the approximation error accumulates in the online updating, especially under high-dimensional settings. In contrast, the classical offline lasso does not have these two problems. Even for the online debiased lasso in linear model (Han et al. ,, 2021), the above two problems can be bypassed by making use of the special structure of the least squares in linear model.

4 Simulation studies

4.1 Setup

In this section, we conduct simulation studies to examine the finite-sample performance of the proposed ODL procedure in high-dimensional GLMs. We randomly generate a total of NbN_{b} samples arriving in a sequence of bb data batches, denoted by {𝒟1,,𝒟b}\{\mathcal{D}_{1},\dots,\mathcal{D}_{b}\}, from the logistics regression model. Specifically,

𝔼(yi(j)𝒙i(j))=exp({𝒙i(j)}𝜷0)1+exp({𝒙i(j)}𝜷0),i=1,,nj;j=1,,b,\mathbb{E}(y^{(j)}_{i}\mid\bm{x}_{i}^{(j)})=\frac{\exp(\{\bm{x}_{i}^{(j)}\}^{\top}\bm{\beta}^{0})}{1+\exp(\{\bm{x}_{i}^{(j)}\}^{\top}\bm{\beta}^{0})},\ i=1,\dots,n_{j};\ j=1,\dots,b,

where 𝒙i(j)𝒩(𝟎,𝚺)\bm{x}_{i}^{(j)}\sim\mathcal{N}(\bm{0},\bm{\Sigma}) and 𝜷0p\bm{\beta}^{0}\in\mathbb{R}^{p} is a pp-dimensional sparse parameter vector. Recall that s0s_{0} is the number of non-zero components of 𝜷0\bm{\beta}^{0}. We set half of the nonzero coefficients to be 11 (relatively strong signals), and another half to be 0.010.01 (weak signals). We consider the following settings: (i) Nb=120N_{b}=120, b=12b=12, nj=10n_{j}=10 for j=1,,12j=1,\ldots,12, p=100p=100 and s0=6s_{0}=6; (ii) Nb=624N_{b}=624, b=12b=12, nj=52n_{j}=52 for j=1,,12j=1,\ldots,12, p=600p=600 and s0=10s_{0}=10. Under each setting, two types of 𝚺\bm{\Sigma} are considered: (a) 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p}; (b) 𝚺={0.5|ij|}i,j=1,,p\bm{\Sigma}=\{0.5^{|i-j|}\}_{i,j=1,\dots,p}. We set the step size in gradient descent η=0.005\eta=0.005.

The objective is to conduct both estimation and inference along the arrival of a sequence of data batches. The evaluation criteria include: averaged absolute bias in estimating 𝜷0\bm{\beta}^{0} (A.bias); averaged estimated standard error (ASE); empirical standard error (ESE); coverage probability (CP) of the 95% confidence intervals; averaged length of the 95% confidence interval (ACL). These metrics will be evaluated separately for three groups: (i) βr0=0\beta^{0}_{r}=0, (ii) βr0=0.01\beta^{0}_{r}=0.01 and (iii) βr0=1\beta^{0}_{r}=1. Comparison is made among (i) the maximum likelihood estimator obtained by fitting the conventional GLM at the terminal point bb where Nb>pN_{b}>p, (ii) the offline debiased 1\ell_{1}-penalized estimator at the terminal point bb which is also the benchmark method, and (iii) our proposed ODL estimator at several intermediate points from j=1,,bj=1,\dots,b. We include two offline methods in comparison using R packages hdi (Dezeure et al. ,, 2015) and glm, respectively. The results are reported in Tables 1-4.

4.2 Bias and coverage probability

It can be seen from Tables 1-4 that the estimation bias of the ODL estimator decreases rapidly as the number of data batches bb increasing from 22 to 1212. Both the estimated standard errors and averaged length of 95% confidence intervals exhibit similar decreasing trend over time, and almost coincide with those by the offline benchmark method at the terminal points. Moreover, ODL enjoys great computation efficiency and it is almost 16 times faster than its offline counterpart.

It is worth noting that even though at the terminal point where the cumulative sample size NbN_{b} is slightly larger than pp, we can still fit the conventional GLM to obtain the MLE. It fails to provide reliable coverage probabilities due to severely large biases and estimated standard errors. In particular, the estimation bias of MLE becomes 50 times that of offline or online debiased lasso when p=100p=100 as shown in Tables 1 and 2, and it further increases to 80 times the debiased estimators when p=600p=600. Furthermore, as clearly indicated by the large empirical standard errors, MLE under this setting suffers from severe instability. Such an invalid estimation and inference results by MLE further demonstrates the advantage of our proposed online debiased method under the high-dimensional sparse logistic regression setting with streaming datasets.

Table 1: Nb=120N_{b}=120, b=12b=12, p=100p=100, s0=6s_{0}=6, 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p}. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased 1\ell_{1}-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter λ\lambda is chosen from Tλ={104,103,0.01,0.05}T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}. Simulation results are summarized over 200 replications. In the table, we report the λ\lambda selected with highest frequency among 200 replications.
β0,r\beta_{0,r} MLE offline ODL
data batch index 2 4 6 8 10 12
λ\lambda 0.05 0.05 0.05 0.05 10410^{-4} 10410^{-4}
A.bias 0 2.290 0.035 0.080 0.056 0.049 0.038 0.036 0.033
0.01 1.150 0.038 0.131 0.036 0.035 0.033 0.020 0.034
1 17.87 0.057 0.176 0.125 0.125 0.125 0.120 0.109
ASE 0 1.78×1061.78\times 10^{6} 0.595 1.336 0.949 0.777 0.679 0.609 0.559
0.01 1.79×1061.79\times 10^{6} 0.591 1.323 0.939 0.771 0.674 0.604 0.555
1 1.83×1061.83\times 10^{6} 0.598 1.341 0.950 0.778 0.680 0.609 0.561
ESE 0 34.90 0.591 1.367 0.962 0.786 0.681 0.611 0.559
0.01 36.25 0.581 1.341 0.996 0.792 0.686 0.597 0.553
1 33.49 0.585 1.327 0.903 0.755 0.664 0.593 0.548
CP 0 1.000 0.953 0.951 0.949 0.948 0.949 0.950 0.950
0.01 1.000 0.960 0.943 0.938 0.943 0.947 0.960 0.953
1 1.000 0.965 0.955 0.957 0.962 0.953 0.942 0.953
ACL 0 6.96×1066.96\times 10^{6} 2.487 5.239 3.721 3.047 2.663 2.385 2.190
0.01 7.02×1067.02\times 10^{6} 2.478 5.187 3.681 3.022 2.641 2.366 2.176
1 7.18×1067.18\times 10^{6} 2.479 5.258 3.725 3.051 2.666 2.387 2.197
C.Time (s) 0.03 29.97 2.026
Table 2: Nb=120N_{b}=120, b=12b=12, p=100p=100, s0=6s_{0}=6, 𝚺={0.5|ij|}i,j=1,,p\bm{\Sigma}=\{0.5^{|i-j|}\}_{i,j=1,\dots,p}. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased 1\ell_{1}-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter λ\lambda is chosen from Tλ={104,103,0.01,0.05}T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}. Simulation results are summarized over 200 replications. In the table, we report the λ\lambda selected with highest frequency among 200 replications.
β0,r\beta_{0,r} MLE offline ODL
data batch index 2 4 6 8 10 12
λ\lambda 0.05 0.05 0.05 0.05 10410^{-4} 10410^{-4}
A.bias 0 2.467 0.053 0.119 0.093 0.084 0.078 0.073 0.068
0.01 1.546 0.048 0.138 0.070 0.093 0.068 0.065 0.064
1 17.54 0.026 0.122 0.122 0.113 0.087 0.095 0.084
ASE 0 2.45×1062.45\times 10^{6} 0.674 1.341 0.955 0.781 0.680 0.614 0.567
0.01 2.43×1062.43\times 10^{6} 0.673 1.332 0.948 0.780 0.679 0.612 0.565
1 2.36×1062.36\times 10^{6} 0.667 1.344 0.962 0.783 0.682 0.616 0.567
ESE 0 45.93 0.671 1.358 0.953 0.781 0.679 0.608 0.556
0.01 45.47 0.667 1.387 0.935 0.765 0.644 0.595 0.557
1 45.87 0.657 1.334 0.940 0.774 0.659 0.592 0.544
CP 0 1.000 0.951 0.951 0.951 0.948 0.947 0.947 0.949
0.01 1.000 0.950 0.945 0.960 0.953 0.958 0.950 0.947
1 1.000 0.963 0.953 0.960 0.945 0.947 0.960 0.955
ACL 0 9.59×1069.59\times 10^{6} 2.643 5.258 3.745 3.063 2.667 2.406 2.222
0.01 9.51×1069.51\times 10^{6} 2.640 5.222 3.717 3.057 2.661 2.398 2.214
1 9.26×1069.26\times 10^{6} 2.614 5.267 3.773 3.068 2.672 2.414 2.222
C.Time (s) 0.03 29.02 2.166
Table 3: Nb=624N_{b}=624, b=12b=12, p=600p=600, s0=10s_{0}=10, 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p}. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased 1\ell_{1}-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter λ\lambda is chosen from Tλ={104,103,0.01,0.05}T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}. Simulation results are summarized over 200 replications. In the table, we report the λ\lambda selected with highest frequency among 200 replications.
β0,r\beta_{0,r} MLE offline ODL
data batch index 2 4 6 8 10 12
λ\lambda 0.05 0.01 0.01 0.01 10410^{-4} 10410^{-4}
A.bias 0 1.101 0.014 0.034 0.024 0.020 0.018 0.015 0.014
0.01 1.553 0.011 0.038 0.014 0.012 0.009 0.009 0.010
1 11.99 0.069 0.168 0.152 0.152 0.141 0.131 0.121
ASE 0 1.82×1061.82\times 10^{6} 0.259 0.573 0.412 0.341 0.297 0.268 0.245
0.01 1.80×1061.80\times 10^{6} 0.259 0.572 0.411 0.340 0.297 0.268 0.246
1 1.81×1061.81\times 10^{6} 0.261 0.573 0.411 0.341 0.297 0.268 0.246
ESE 0 19.94 0.254 0.578 0.041 0.034 0.030 0.028 0.025
0.01 19.72 0.257 0.572 0.397 0.325 0.287 0.264 0.251
1 20.37 0.262 0.579 0.407 0.334 0.297 0.262 0.244
CP 0 1.000 0.955 0.947 0.947 0.949 0.948 0.950 0.950
0.01 1.000 0.953 0.945 0.959 0.954 0.950 0.956 0.947
1 1.000 0.943 0.933 0.935 0.933 0.925 0.927 0.928
ACL 0 7.15×1067.15\times 10^{6} 1.016 2.246 1.613 1.335 1.163 1.050 0.962
0.01 7.07×1067.07\times 10^{6} 1.016 2.241 1.610 1.334 1.164 1.050 0.962
1 7.09×1067.09\times 10^{6} 1.022 2.246 1.613 1.335 1.164 1.050 0.963
C.Time (min) 0.05 117.5 7.08
Table 4: Nb=624N_{b}=624, b=12b=12, p=600p=600, s0=10s_{0}=10, 𝚺={0.5|ij|}i,j=1,,p\bm{\Sigma}=\{0.5^{|i-j|}\}_{i,j=1,\dots,p}. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased 1\ell_{1}-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter λ\lambda is chosen from Tλ={104,103,0.01,0.05}T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}. Simulation results are summarized over 200 replications. In the table, we report the λ\lambda selected with highest frequency among 200 replications.
β0,r\beta_{0,r} MLE offline ODL
data batch index 2 4 6 8 10 12
λ\lambda 0.05 0.05 10410^{-4} 10410^{-4} 10410^{-4} 10410^{-4}
A.bias 0 1.594 0.018 0.042 0.034 0.031 0.029 0.027 0.025
0.01 1.554 0.016 0.044 0.031 0.026 0.012 0.010 0.010
1 11.85 0.077 0.192 0.168 0.159 0.147 0.137 0.134
ASE 0 2.43×1062.43\times 10^{6} 0.307 0.576 0.416 0.347 0.304 0.274 0.251
0.01 2.45×1062.45\times 10^{6} 0.307 0.577 0.416 0.347 0.304 0.274 0.251
1 2.42×1062.42\times 10^{6} 0.308 0.576 0.416 0.347 0.304 0.275 0.251
ESE 0 27.18 0.300 0.576 0.411 0.341 0.298 0.269 0.246
0.01 27.68 0.296 0.576 0.403 0.330 0.287 0.266 0.240
1 27.49 0.291 0.559 0.395 0.326 0.295 0.257 0.240
CP 0 1.000 0.955 0.947 0.949 0.949 0.949 0.948 0.947
0.01 1.000 0.957 0.949 0.958 0.951 0.958 0.951 0.956
1 1.000 0.951 0.946 0.939 0.929 0.925 0.931 0.926
ACL 0 9.51×1069.51\times 10^{6} 1.202 2.259 1.631 1.361 1.193 1.076 0.982
0.01 9.59×1069.59\times 10^{6} 1.203 2.261 1.632 1.361 1.193 1.076 0.983
1 9.48×1069.48\times 10^{6} 1.206 2.257 1.629 1.361 1.193 1.077 0.983
C.Time (min) 0.06 112.3 6.99

5 Real data analysis

Graduated driver licensing programs are designed to restrict the driving privileges of new drivers. They typically include restrictions such as nighttime, expressway, and unsupervised driving. To monitor the effectiveness of such systems in reducing fatal teen crashes, we apply our proposed ODL procedure to analyze streaming data from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS). Our primary interest was to evaluate the effectiveness of graduated driver licensing, which is a nationwide legislature on novice drivers of age 21 or younger under various conditions of vehicle operation. In contrast, there are no restrictions on vehicle operation for older drivers (say, older than 65) in the current policy. To assess the effect of driver’s age on driving safety, we compared age groups with respect to the risk of fatal crash when an accident happened. We first categorized the “Age” variable into three groups: “Age<<21” representing the young group under a restricted graduate driver licensing, and “Age\geq65” for the older group with a regular full driver’s license, and those in between were treated as the reference group. Extent of “Injury” in a crash is a binary outcome of interest, 1 for a moderate or severe injury, and 0 for minor or no injury. This outcome variable was created from the variable of Maximum Known Occupant Ais (MAIS), which indicates the single most severe injury level reported for each occupant. Other potential risk factors were also included in the model, including seat belt use, alcohol, speed limit, vehicle weight, air bag system deployed, drug involvement in this accident. Some interaction terms are also included such as driver’s age group and alcohol, age group and sex etc.

Streaming data were formed by monthly accident data from the period of 7 years from January, 2009 to December, 2015, with b=84b=84 data batches and a total of Nb=2118N_{b}=2118 crashed vehicles with complete records on both outcome and predictors. The number of predictors is p=62p=62 including an intercept, 46 main effects and 15 interaction terms. Note that the first batch size n1=38<p=62n_{1}=38<p=62, and the smallest batch only contains 6 samples. We applied our proposed method to sequentially updated parameter estimates and standard errors for the regression coefficients in a penalized logistic regression model.

As shown in Figure 2, the 95% pointwise confidence bands over the 84 batches became narrower for the old group but not the young one. Such phenomenon clearly differentiate the significance levels of these two age groups with respect to the severity of injury. Starting at almost the same estimated risk, the young age group gradually moves up and stay around 0 over the 84-month period. The trace plot ends at 0.06-0.06, meaning that the young age group has a slightly lower adjusted odds of fatal crash in comparison to the middle age group. This finding is consistent with the reported results in the literature that GDL is an effective policy to protect novice drivers from fatal injuries (Chen et al. ,, 2014). In contrast, the trace plot for the older age group shows an upward trend at around year 2013 and get stabilized at a positive value 0.51 since 2014. This observation indicates that the adjusted odds of fatality in a vehicle crash for the older group becomes significantly higher than the middle age group as more data accumulates over time. This may suggest a need on policy modification on restrictive vehicle operation for old drivers.

Figure 3 shows the trends of log10p-\log_{10}p, pp-values of the Wald test constructed using the online debiased lasso estimator and its standard error in the 10-base logarithm, for nine of the risk factors over 84 months. The trajectory of each covariate effect estimate show an evidence against the null H0:βj0=0H_{0}:\beta^{0}_{j}=0 as data accumulates over time. The interaction term between “Age” and “Alcohol” turns out to have the strongest association to the odds of fatality in a crash among all covariates shown in this plot. This is an overwhelming confirmation to the findings that driving under the influence of alcohol is significantly associated with the fatal crash even with the existence of graduate driver licensing to protect young drivers. This also explains why the main effect of young age group does not show a significant association with the odds of injury while the interaction term does. Moreover, some spikes appeared in the trajectories of most risk factors between year 2010 and 2014, but they get smoothed and stabilized thereafter. This might be related to small batch sizes or unbalanced categorical values in some monthly data batches. In addition, to characterize the overall significance level for each risk factor over the 84-month period, we propose to calculate a summary statistic as the area under the pp-value curve. These values are included in the brackets after the name of each covariate in Figure 3. They provide a clear summary of the overall significance level in existence of those spikes, and most of them are well aligned with the ranking of pp-values at the end of year 2015.

Applying the proposed online debiased lasso procedure to the above CDS data analysis enabled us to visualize time-course patterns of data evidence accrual as well as stability and reproducibility of inference. As shown clearly in both Figures 2 and 3, at the early stage of data streams, due to limited sample sizes and possibly sampling bias, both parameter estimates and test power may be unstable and even misleading. These potential shortcomings can be convincingly overcome when estimates and inferential quantities were continuously updated along with data streams, which eventually reached stability and reliable conclusions.

Refer to caption
Figure 2: Trace plot of the estimated risk and 95% confidence bands of regression coefficients corresponding to young and old age groups, respectively. Numbers on two sides denote the estimated regression coefficients after the arrival of the first and last batches. The dashed line is the “0” reference line.
Refer to caption
Figure 3: Trace plot of log10p-\log_{10}p over monthly data from January, 2009 to December, 2015, each for one risk factor. Numbers of the yy-axis are the negative logarithm pp-values obtained by the Wald test and labels on the xx-axis correspond to the last month of each year. The values in the brackets next to covariate names denote the areas under the pp-value curves.

6 Discussion

In this paper, we propose an online debiased lasso method for statistical inference in high-dimensional GLMs. The method is applicable to streaming data, that is, only the historical summary statistics, not the raw historical data, are needed in updating the estimate at the current stage. Under regularity conditions similar to those in the offline setting and mild conditions on the batch sizes, we prove the online debiased lasso (with an online correction term) is asymptotically normal. The numerical studies further demonstrate the effectiveness of our algorithm and support the theoretical results.

There are still many open questions in the area of online inference. First, the method we develop focuses on homogeneous data, where the streaming data is assumed to be i.i.d. sampled. It will be interesting to determine if we can obtain similar results under the non-homogeneous setting. Second, the loss function we consider in this paper is from the negative log-likelihood function. It is unclear whether other loss functions, including nonsmooth robust loss functions such as the Huber’s loss (Huber,, 1964), could be used in the online inference. Third, we did not address the issue about the online variable selection. The major difficulty in this problem is how to recover the significant variables which may be dropped at the early stages of the stream. We hope to address these interesting questions in the future.

Appendix

The appendix contains the proofs of Theorems 1-3.

Proof of Theorem 1.

For the prior data batch 𝒟1\mathcal{D}_{1}, we have 𝜷^(1)=𝜷¯(1)\widehat{\bm{\beta}}^{(1)}=\bar{\bm{\beta}}^{(1)} where 𝜷¯(1)\bar{\bm{\beta}}^{(1)} is the offline lasso estimator defined in (1). Since the consistency of 𝜷¯(1)\bar{\bm{\beta}}^{(1)} is well-established in van de Geer et al. , (2014), (11) holds when b=1.b=1. Now we prove the consistency of 𝜷^(b)\widehat{{\bm{\beta}}}^{(b)} for an arbitrary b2b\geq 2 by the mathematical induction.

Suppose that 𝜷^(b1){\widehat{\bm{\beta}}}^{(b-1)} satisfies the statement in (11) with constant c1(b1)c_{1}^{(b-1)} and c2(b1)c_{2}^{(b-1)}. We claim that 𝜷^(b)𝜷01c1(b)s0λb.\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}. Otherwise, we consider the following linear combination,

𝜷~(b)=t𝜷^(b)+(1t)𝜷0,wheret=c1(b)s0λbc1(b)s0λb+𝜷^(b)𝜷01.\displaystyle\widetilde{\bm{\beta}}^{(b)}=t\widehat{\bm{\beta}}^{(b)}+(1-t)\bm{\beta}^{0},\ \text{where}\ t=\frac{c_{1}^{(b)}s_{0}\lambda_{b}}{c_{1}^{(b)}s_{0}\lambda_{b}+\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}}. (13)

Then 𝜷~(b)𝜷01c1(b)s0λb.\lVert{\widetilde{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}. Note that 𝜷~(b)𝜷01c1(b)s0λb/2\lVert{\widetilde{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}/2 if and only if 𝜷^(b)𝜷01c1(b)s0λb.\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}. Therefore, it suffices to show 𝜷~(b)𝜷01c1(b)s0λb/2\lVert{\widetilde{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}/2.

Let (𝜷;𝒟j)=i𝒟jd(yi;g(𝒙i𝜷))\mathcal{L}(\bm{\beta};\mathcal{D}_{j})=\sum_{i\in\mathcal{D}_{j}}d\left(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})\right), j=1,,bj=1,\ldots,b. Since the objective function defined in (4) is the convex function, we have

12Nb{(𝜷~(b);𝒟b)+12(𝜷~(b)𝜷^(b1))𝑱~(b1)(𝜷~(b)𝜷^(b1))}+λb𝜷~(b)1\displaystyle\frac{1}{2N_{b}}\left\{\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b})+\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})\right\}+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1} (14)
\displaystyle\leq 12Nb{(𝜷0;𝒟b)+12(𝜷0𝜷^(b1))𝑱~(b1)(𝜷0𝜷^(b1))}+λb𝜷01,\displaystyle\frac{1}{2N_{b}}\left\{\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b})+\frac{1}{2}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})\right\}+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1},

where 𝑱~(b1)=j=1b1𝑱(j)(𝜷^(j))\widetilde{\bm{J}}^{(b-1)}=\sum_{j=1}^{b-1}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)}). Recall that 𝒟b1={𝒟1,,𝒟b1}\mathcal{D}_{b-1}^{\star}=\{\mathcal{D}_{1},\ldots,\mathcal{D}_{b-1}\}. A Taylor’s expansion gives that

(𝜷~(b);𝒟b1)(𝜷0;𝒟b1)\displaystyle\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b-1}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b-1}^{\star})
=\displaystyle= 𝑼(b1)(𝜷0)(𝜷~(b)𝜷0)+12(𝜷~(b)𝜷0){𝑱~(b1)(𝝃)}(𝜷~(b)𝜷0),\displaystyle~{}\bm{U}^{(b-1)}({\bm{\beta}}^{0})(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})+\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{\widetilde{\bm{J}}^{(b-1)}(\bm{\xi})\right\}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}),

where 𝑱~(b1)(𝝃)=j=1b1𝑱(j)(𝝃)\widetilde{\bm{J}}^{(b-1)}(\bm{\xi})=\sum_{j=1}^{b-1}\bm{J}^{(j)}(\bm{\xi}) and 𝝃=t2𝜷0+(1t2)𝜷~(b)\bm{\xi}=t_{2}{\bm{\beta}}^{0}+(1-t_{2})\widetilde{\bm{\beta}}^{(b)} for some 0<t2<1.0<t_{2}<1. Then

12(𝜷~(b)𝜷^(b1))𝑱~(b1)(𝜷~(b)𝜷^(b1))12(𝜷0𝜷^(b1))𝑱~(b1)(𝜷0𝜷^(b1))\displaystyle\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})-\frac{1}{2}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})
=\displaystyle= 12(𝜷~(b)𝜷0)𝑱~(b1)(𝜷~(b)𝜷0)(𝜷~(b)𝜷0)𝑱~(b1)(𝜷^(b1)𝜷0)\displaystyle\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})-(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widehat{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})
=\displaystyle= (𝜷~(b);𝒟b1)(𝜷0;𝒟b1)+12j=1b1(𝜷~(b)𝜷0){𝑱(j)(𝜷^(j))𝑱(j)(𝝃)}(𝜷~(b)𝜷0)\displaystyle\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b-1}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b-1}^{\star})+\frac{1}{2}\sum_{j=1}^{b-1}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{{\bm{J}}^{(j)}(\widehat{\bm{\beta}}^{(j)})-{\bm{J}}^{(j)}(\bm{\xi})\right\}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})
j=1b1(𝜷~(b)𝜷0){𝑱(j)(𝜷^(j))}(𝜷~(b1)𝜷0)𝑼(b1)(𝜷0)(𝜷~(b)𝜷0)\displaystyle-\sum_{j=1}^{b-1}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{{\bm{J}}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\}(\widetilde{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})-\bm{U}^{(b-1)}({\bm{\beta}}^{0})(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})
:=\displaystyle:= (𝜷~(b);𝒟b1)(𝜷0;𝒟b1)+Δ1(b)Δ2(b)Δ3(b).\displaystyle\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b-1}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b-1}^{\star})+\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)}.

Taking the above equation back into (14), we obtain that

12Nb(𝜷~(b);𝒟b)+λb𝜷~(b)1+12Nb(Δ1(b)Δ2(b)Δ3(b))12Nb(𝜷0;𝒟b)+λb𝜷01.\displaystyle\frac{1}{2N_{b}}\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b}^{\star})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1}+\frac{1}{2N_{b}}(\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)})\leq\frac{1}{2N_{b}}\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b})+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}. (15)

We introduce two notations,

(𝜷)=12Nb𝔼{(𝜷;𝒟b)(𝜷0;𝒟b)},v(𝜷;𝒟b)=(𝜷;𝒟b)𝔼{(𝜷;𝒟b)},𝜷p,\displaystyle\mathcal{E}({\bm{\beta}})=\frac{1}{2N_{b}}\mathbb{E}\left\{\mathcal{L}({\bm{\beta}};\mathcal{D}_{b}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b}^{\star})\right\},~{}v(\bm{\beta};\mathcal{D}_{b}^{\star})=\mathcal{L}({\bm{\beta}};\mathcal{D}_{b}^{\star})-\mathbb{E}\left\{\mathcal{L}({\bm{\beta}};\mathcal{D}_{b}^{\star})\right\},\ \bm{\beta}\in\mathbb{R}^{p},

as the excess risk and the empirical process. Note that (𝜷)\mathcal{E}({\bm{\beta}}) does not depend on 𝒟b\mathcal{D}_{b}^{\star} since the data is i.i.d. sampled. With the above notation, (15) could be further written as

(𝜷~(b))+λb𝜷~(b)1\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1} 12Nb{v(𝜷~(b);𝒟b)v(𝜷0;𝒟b)}+λb𝜷01+12Nb(Δ1(b)Δ2(b)Δ3(b))\displaystyle\leq-\frac{1}{2N_{b}}\left\{v(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b}^{\star})-v({\bm{\beta}}^{0};\mathcal{D}_{b}^{\star})\right\}+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}+\frac{1}{2N_{b}}(\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)})
:=12NbΔ4(b)+λb𝜷01+12Nb(Δ1(b)Δ2(b)Δ3(b))\displaystyle:=-\frac{1}{2N_{b}}\Delta_{4}^{(b)}+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}+\frac{1}{2N_{b}}(\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)})

Recall that 𝑿(b1)=((𝑿(1)),,(𝑿(b1)))Nb1×p\bm{X}^{(b-1)}_{\star}=((\bm{X}^{(1)})^{\top},\ldots,(\bm{X}^{(b-1)})^{\top})^{\top}\in\mathbb{R}^{N_{b-1}\times p}. The next lemma provides the upper bound of |Δi(b)|,i=1,2,3,4,|\Delta_{i}^{(b)}|,i=1,2,3,4, whose proof is given at the end of Appendix.

Lemma 1.

Under the conditions of Theorem 1, with probability at least 1p3,1-p^{-3},

|Δ1(b)|2KLc1(1)s0λ1𝑿(b1)(𝜷~(b)𝜷0)22,\displaystyle|\Delta_{1}^{(b)}|\leq 2KLc^{(1)}_{1}s_{0}\lambda_{1}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert^{2}_{2},
|Δ2(b)|K2𝑿(b1)(𝜷~(b)𝜷0)2𝑿(b1)(𝜷~(b1)𝜷0)2,\displaystyle|\Delta_{2}^{(b)}|\leq K_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})\lVert_{2},
|Δ3(b)|λb1Nb1𝜷~(b)𝜷01/8,|Δ4(b)|λbNb𝜷~(b)𝜷01/8,\displaystyle|\Delta_{3}^{(b)}|\leq{\lambda_{b-1}{N_{b-1}}}\lVert\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}\lVert_{1}/8,\ |\Delta_{4}^{(b)}|\leq{\lambda_{b}{N_{b}}}\lVert\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}\lVert_{1}/8,

where LL is Lipschitz constant defined in Assumption 1, K=XK=\lVert X\lVert_{\infty} and K2=supi𝒟b|g(𝐱i𝛃)|K_{2}=\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert.

Note that λb1Nb1=λbNb{\lambda_{b-1}\sqrt{N_{b-1}}}={\lambda_{b}\sqrt{N_{b}}} and the upper bound of |Δ1||\Delta_{1}| could be absorbed in the upper bound of |Δ2||\Delta_{2}|. According to Lemma 1,

Δ1(b)Δ2(b)Δ3(b)Δ4(b)\displaystyle\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)}-\Delta_{4}^{(b)}
\displaystyle\leq 2K2𝑿(b1)(𝜷~(b)𝜷0)2𝑿(b1)(𝜷~(b1)𝜷0)2+14λbNb𝜷~(b)𝜷01\displaystyle 2K_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{4}{\lambda_{b}{N_{b}}}\lVert\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}\lVert_{1}
\displaystyle\leq 2K2Nb1c2(b1)s0λb1𝑿(b)(𝜷~(b)𝜷0)2+14c1(b)Nbs0λb2.\displaystyle 2K_{2}\sqrt{N_{b-1}c_{2}^{(b-1)}s_{0}}\lambda_{b-1}\lVert\bm{X}_{\star}^{(b)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{4}c_{1}^{(b)}{{N_{b}}}s_{0}\lambda_{b}^{2}.

Consequently,

(𝜷~(b))+λb𝜷~(b)1λb𝜷01\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1}-\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1} (16)
\displaystyle\leq 12Nb{2K2Nb1c2(b1)s0λb1𝑿(b)(𝜷~(b)𝜷0)2+14c1(b)Nbs0λb2}\displaystyle\frac{1}{2N_{b}}\left\{2K_{2}\sqrt{N_{b-1}c_{2}^{(b-1)}s_{0}}\lambda_{b-1}\lVert\bm{X}_{\star}^{(b)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{4}c_{1}^{(b)}{{N_{b}}}s_{0}\lambda_{b}^{2}\right\}
=\displaystyle= {K2c2(b1)s0Nb𝑿(b)(𝜷~(b)𝜷0)2+18c1(b)s0λb}λb:=Δ(b)λb.\displaystyle\left\{K_{2}\sqrt{\frac{c_{2}^{(b-1)}s_{0}}{N_{b}}}\lVert\bm{X}_{\star}^{(b)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{8}c_{1}^{(b)}s_{0}\lambda_{b}\right\}\lambda_{b}:=\Delta^{(b)}\lambda_{b}.

The left part follows the standard argument. Recall that S0={j:βj00}S_{0}=\{j:\beta_{j}^{0}\neq 0\}. For 𝜷p\bm{\beta}\in\mathbb{R}^{p}, we define that 𝜷S0:=(βj,S0)j=1p\bm{\beta}_{S_{0}}:=(\beta_{j,S_{0}})_{j=1}^{p} where βj,S0=βj𝟏{jS0}.\beta_{j,S_{0}}=\beta_{j}\mathbf{1}\{j\in S_{0}\}. Then, 𝜷=𝜷S0+𝜷S0c.\bm{\beta}=\bm{\beta}_{S_{0}}+\bm{\beta}_{S^{c}_{0}}. From (16), we have

(𝜷~(b))+λb𝜷~S0c(b)1λb𝜷01λb𝜷~S0(b)1+Δ(b)λbλb𝜷0𝜷~S0(b)1+Δ(b)λb.\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}\leq\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}-\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}+\Delta^{(b)}\lambda_{b}\leq\lambda_{b}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}+\Delta^{(b)}\lambda_{b}.

Here we need a discussion about the value of 𝜷0𝜷~S0(b)1\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}.
Case I. We suppose that 𝜷0𝜷~S0(b)1Δ(b)/2\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}\geq\Delta^{(b)}/2. Then,

(𝜷~(b))+λb𝜷~S0c(b)13λb𝜷0𝜷~S0(b)1,\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+{\lambda_{b}}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}\leq{3\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1},

which implies 𝜷~S0c(b)13𝜷0𝜷~S0(b)1\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}\leq 3\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}. Then, we can adopt the empirical compatibility condition, that is,

𝜷0𝜷~S0(b)12s0ϕ02Nb𝑿(b)(𝜷0𝜷~(b))22,\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}^{2}\leq\frac{s_{0}}{\phi_{0}^{2}N_{b}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert^{2}_{2},

where ϕ0>0\phi_{0}>0 is the compatibility constant. Thus,

(𝜷~(b))+λb𝜷~S0c(b)1+λb𝜷0𝜷~S0(b)12λbϕ0s0Nb𝑿(b)(𝜷0𝜷~(b))2+Δ(b)λb\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+{\lambda_{b}}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}+{\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}\leq\frac{2\lambda_{b}}{\phi_{0}}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\Delta^{(b)}\lambda_{b}
=\displaystyle= (2ϕ0+K2c2(b1))λbs0Nb𝑿(b)(𝜷0𝜷~(b))2+18c1(b)s0λb2\displaystyle\left(\frac{2}{\phi_{0}}+K_{2}\sqrt{c_{2}^{(b-1)}}\right)\lambda_{b}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\frac{1}{8}c_{1}^{(b)}s_{0}\lambda_{b}^{2}
:=\displaystyle:= Cλbs0Nb𝑿(b)(𝜷0𝜷~(b))2+18c1(b)s0λb2\displaystyle C\lambda_{b}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\frac{1}{8}c_{1}^{(b)}s_{0}\lambda_{b}^{2}

Let k2=1/supi𝒟b|4/g(𝒙i𝜷)|.k_{2}=1/\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{4/g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert. Based on Lemma 1, we have

k2Nb𝑿(b)(𝜷0𝜷~(b))22(𝜷~(b))+12NbΔ4(b)(𝜷~(b))+116c1(b)s0λb2,\frac{k_{2}}{N_{b}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert^{2}_{2}\leq\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\frac{1}{2N_{b}}\Delta_{4}^{(b)}\leq\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\frac{1}{16}c_{1}^{(b)}s_{0}\lambda_{b}^{2},

which is also known as the margin condition (van de Geer,, 2008). Next we apply the arithmetic mean-geometric mean inequality and obtain

Cλbs0Nb𝑿(b)(𝜷0𝜷~(b))2C22k2s0λb2+(𝜷~(b))2+132c1(b)s0λb2.C\lambda_{b}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}\leq\frac{C^{2}}{2k_{2}}s_{0}\lambda_{b}^{2}+\frac{\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})}{2}+\frac{1}{32}c_{1}^{(b)}s_{0}\lambda_{b}^{2}. (17)

Then, it follows that

(𝜷~(b))2+λb𝜷0𝜷~(b)1(C22k2+532c1(b))s0λb2.\frac{\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})}{2}+{\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\left(\frac{C^{2}}{2k_{2}}+\frac{5}{32}c_{1}^{(b)}\right)s_{0}\lambda_{b}^{2}. (18)

On the one hand, since (𝜷~(b))>0\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})>0, we obtain that

𝜷0𝜷~(b)1(C22k2+532c1(b))s0λb.\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\left(\frac{C^{2}}{2k_{2}}+\frac{5}{32}c_{1}^{(b)}\right)s_{0}\lambda_{b}.

With the suitable choice of c2(b1)c_{2}^{(b-1)},

C22k2+532c1(b)=12k2(2ϕ0+K2c2(b1))2+532c1(b)1332c1(b),\frac{C^{2}}{2k_{2}}+\frac{5}{32}c_{1}^{(b)}=\frac{1}{2k_{2}}\left(\frac{2}{\phi_{0}}+K_{2}\sqrt{c_{2}^{(b-1)}}\right)^{2}+\frac{5}{32}c_{1}^{(b)}\leq\frac{13}{32}c_{1}^{(b)},

we have

𝜷0𝜷~(b)112c1(b)s0λb.\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\frac{1}{2}c_{1}^{(b)}s_{0}\lambda_{b}.

Here we require that K2c2(1)2/ϕ0K_{2}\sqrt{c_{2}^{(1)}}\geq 2/\phi_{0}, and 8K22c2(b1)/k2=c1(b).8K_{2}^{2}c_{2}^{(b-1)}/k_{2}=c_{1}^{(b)}. On the other hand, we combine (17) and (18), and obtain

𝑿(b)(𝜷0𝜷~(b))25C2k2s0Nbλb.\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}\leq\frac{5C}{2k_{2}}\sqrt{s_{0}N_{b}}\lambda_{b}.

Again, since C2K2c2(b1){C}\leq{2K_{2}\sqrt{c_{2}^{(b-1)}}}, we obtain

𝑿(b)(𝜷0𝜷~(b))22(5K2k2)2c2(b1)s0λb2Nbc2(b)s0λb2Nb.\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert^{2}_{2}\leq\left(\frac{5K_{2}}{k_{2}}\right)^{2}c_{2}^{(b-1)}s_{0}\lambda_{b}^{2}N_{b}\leq c_{2}^{(b)}s_{0}\lambda_{b}^{2}N_{b}. (19)

Case II. We suppose that 𝜷0𝜷~S0(b)1<Δ(b)/2\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}<\Delta^{(b)}/2. Then,

(𝜷~(b))+λb𝜷~S0c(b)1+λb𝜷0𝜷~S0(b)12Δ(b)λb\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+{\lambda_{b}}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}+{\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}\leq 2\Delta^{(b)}\lambda_{b}
=\displaystyle= 2K2λbc2(b1)s0Nb𝑿(b)(𝜷0𝜷~(b))2+14c1(b)s0λb2\displaystyle 2K_{2}\lambda_{b}\sqrt{\frac{c_{2}^{(b-1)}s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\frac{1}{4}c_{1}^{(b)}s_{0}\lambda_{b}^{2}

Then, by the margin condition, it is straightforward to show that

𝜷0𝜷~(b)112c1(b)s0λb,𝑿(𝜷0𝜷~(b))22(6K2k2)2c2(b1)s0λb2Nb=c2(b)s0λb2Nb.\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\frac{1}{2}c^{(b)}_{1}s_{0}\lambda_{b},\ \lVert\bm{X}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}^{2}\leq\left(\frac{6K_{2}}{k_{2}}\right)^{2}c_{2}^{(b-1)}s_{0}\lambda_{b}^{2}N_{b}=c_{2}^{(b)}s_{0}\lambda_{b}^{2}N_{b}. (20)

Note that in both cases, we obtain 𝜷0𝜷~(b)1c1(b)s0λb/2.\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq c^{(b)}_{1}s_{0}\lambda_{b}/2. According to (13), we show our claim 𝜷0𝜷^(b)1c1(b)s0λb.\lVert\bm{\beta}^{0}-\widehat{\bm{\beta}}^{(b)}\lVert_{1}\leq c^{(b)}_{1}s_{0}\lambda_{b}. The remaining step is to repeat the above argument and obtain the upper bound of the estimation error at (19) or (20). It is worth pointing out that c1(b)=8K22c2(b1)/k2c_{1}^{(b)}=8K_{2}^{2}c_{2}^{(b-1)}/k_{2} and c2(b)=36K22c2(b1)/k22,c_{2}^{(b)}=36K_{2}^{2}c_{2}^{(b-1)}/k_{2}^{2}, where K2K_{2} and k2k_{2} does not depend on b.b. The proof is finished by taking a union bound on the events considered in Lemma 1. ∎

Before proving Theorem 2, we cite Lemma 3 in Han et al. , (2021) to compute the cumulative error.

Lemma 2.

(Han et al. ,, 2021) Let njn_{j} and NjN_{j} be the batch size and the cumulative batch size respectively when the jj-th data arrives, j=1,,b.j=1,\ldots,b. Then,

j=1bnjNj\displaystyle\sum_{j=1}^{b}\frac{n_{j}}{N_{j}} \displaystyle\leq 1+logNbn1,\displaystyle 1+\log\frac{N_{b}}{n_{1}}, (21)
j=1bnjNj\displaystyle\sum_{j=1}^{b}\frac{n_{j}}{\sqrt{N_{j}}} \displaystyle\leq 2Nb.\displaystyle 2\sqrt{N_{b}}. (22)
Proof of Theorem 2.

Recall that

𝜸^r(j)=argmin𝜸(p1){12Nj(𝑱~r,r(j)2𝑱~r,r(j)𝜸+𝜸𝑱~r,r(j)𝜸)+λj𝜸1}\bm{\widehat{\gamma}}_{r}^{(j)}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\left\{\frac{1}{2N_{j}}\left(\widetilde{\bm{J}}^{(j)}_{r,r}-2\widetilde{\bm{J}}^{(j)}_{r,-r}\bm{\gamma}+\bm{\gamma}^{\top}\widetilde{\bm{J}}_{-r,-r}^{(j)}\bm{\gamma}\right)+\lambda_{j}\|\bm{\gamma}\|_{1}\right\}

is a standard lasso estimator. The proof follows the standard argument as long as estimated information matrix 𝑱~(j)/Nj\widetilde{\bm{J}}^{(j)}/N_{j} satisfies the compatibility condition. Therefore, it suffices to demonstrate the compatibility condition holds. Recall that 𝑱~(j)/Nj=i=1j𝑱(i)(𝜷^(i))/Nj.\widetilde{\bm{J}}^{(j)}/N_{j}=\sum_{i=1}^{j}\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})/N_{j}. Then we have

𝑱~(j)/Nj𝑱0=1Nji=1j{𝑱(i)(𝜷^(i))𝔼[𝑱(i)(𝜷0)]}\displaystyle\widetilde{\bm{J}}^{(j)}/N_{j}-\bm{J}^{0}=\frac{1}{N_{j}}\sum_{i=1}^{j}\left\{\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})-\mathbb{E}[\bm{J}^{(i)}({\bm{\beta}}^{0})]\right\}
=\displaystyle= 1Nji=1j{𝑱(i)(𝜷^(i))𝑱(i)(𝜷0)}+1Nji=1j[𝑱(i)(𝜷0)𝔼{𝑱(i)(𝜷0)}].\displaystyle\frac{1}{N_{j}}\sum_{i=1}^{j}\left\{\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})-\bm{J}^{(i)}({\bm{\beta}}^{0})\right\}+\frac{1}{N_{j}}\sum_{i=1}^{j}\left[\bm{J}^{(i)}({\bm{\beta}}^{0})-\mathbb{E}\{\bm{J}^{(i)}({\bm{\beta}}^{0})\}\right].

According to the error bound of 𝜷^(i)\widehat{\bm{\beta}}^{(i)} provided in Theorem 1 and (22), we obtain

1Nji=1j{𝑱(i)(𝜷^(i))𝑱(i)(𝜷0)}K2Nji=1jni𝜷^(i)𝜷01=𝒪p(c1(j)s0log(p)Nj),\left\lVert\frac{1}{N_{j}}\sum_{i=1}^{j}\left\{\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})-\bm{J}^{(i)}({\bm{\beta}}^{0})\right\}\right\lVert_{\infty}\leq\frac{K^{2}}{N_{j}}\sum_{i=1}^{j}{n_{i}}\lVert{\widehat{\bm{\beta}}}^{(i)}-{{\bm{\beta}}}^{0}\lVert_{1}=\mathcal{O}_{p}\left(c_{1}^{(j)}s_{0}\sqrt{\frac{\log(p)}{N_{j}}}\right),

where KK is defined in (A1) of Assumption 1. Meanwhile, based on the Hoeffding’s inequality, it follows

1Nji=1j[𝑱(i)(𝜷0)𝔼{𝑱(i)(𝜷0)}]=𝒪p(log(p)Nj).\left\lVert\frac{1}{N_{j}}\sum_{i=1}^{j}\left[\bm{J}^{(i)}({\bm{\beta}}^{0})-\mathbb{E}\{\bm{J}^{(i)}({\bm{\beta}}^{0})\}\right]\right\lVert_{\infty}=\mathcal{O}_{p}\left(\sqrt{\frac{\log(p)}{N_{j}}}\right).

In summary,

𝑱~(j)/Nj𝑱0=𝒪p(c1(j)s0log(p)Nj).\left\lVert\widetilde{\bm{J}}^{(j)}/N_{j}-\bm{J}^{0}\right\lVert_{\infty}=\mathcal{O}_{p}\left(c_{1}^{(j)}s_{0}\sqrt{\frac{\log(p)}{N_{j}}}\right).

Consequently, the compatibility condition holds through Corollary 6.8 in Bühlmann & van de Geer, (2011). ∎

The remaining part is to establish the asymptotic normality of the online debiased lasso estimator.

Proof of Theorem 3.

We first deal with the debiased term. Recall that 𝜸~(j)p\widetilde{\bm{\gamma}}^{(j)}\in\mathbb{R}^{p} is the extension of 𝜸^(j)\widehat{\bm{\gamma}}^{(j)}. Then

j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷^(j))}\displaystyle\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{\widehat{\bm{\beta}}}^{(j)})\right\}
=\displaystyle= j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷0)}{𝜸~r(j)}{𝑿(j)}{g(𝑿(j)𝜷0)g(𝑿(j)𝜷^(j))}\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}-\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\{\bm{X}^{(j)}\}^{\top}\left\{g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})-g(\bm{X}^{(j)}{\widehat{\bm{\beta}}}^{(j)})\right\}

Note that for 1kn,1\leq k\leq n, we let 𝑿k(j)1×p\bm{X}^{(j)}_{k}\in\mathbb{R}^{1\times p} denote kk-th row in 𝑿(j)\bm{X}^{(j)}, namely the covariates of kk-th data in jj-th batch. According to the mean value theorem,

g(𝑿k(j)𝜷^(j))=g(𝑿k(j)𝜷0)g(ηk(j))𝑿k(j)(𝜷0𝜷^(j)),\displaystyle g(\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)})=g(\bm{X}^{(j)}_{k}{{\bm{\beta}}}^{0})-g^{\prime}(\eta_{k}^{(j)})\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)}),

where ηk(j)[𝑿k(j)𝜷0,𝑿k(j)𝜷^(j)].\eta_{k}^{(j)}\in[\bm{X}^{(j)}_{k}{{\bm{\beta}}^{0}},\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)}]. Let 𝜼(j)=(η1(j),,ηnj(j))\bm{\eta}^{(j)}=(\eta_{1}^{(j)},\ldots,\eta_{n_{j}}^{(j)})^{\top} and 𝚲(j)nj×nj\bm{\Lambda}^{(j)}\in\mathbb{R}^{n_{j}\times n_{j}} is diagonal matrix with the diagonal element {g(𝑿k(j)𝜷^(j))g(ηk(j))}k=1nj\{g^{\prime}(\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)})-g^{\prime}({\eta}_{k}^{(j)})\}_{k=1}^{n_{j}}. As a result,

j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷^(j))}\displaystyle\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{\widehat{\bm{\beta}}}^{(j)})\right\}
=\displaystyle= j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷0)}j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j))(𝜷0𝜷^(j))+Π1,\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}}^{0})\right\}-\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})+\Pi_{1},

where Π1=j=1b{𝒛^r(j)}𝚲(j)𝑿(j)(𝜷0𝜷^(j))\Pi_{1}=\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\bm{\Lambda}^{(j)}\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)}). Now, we focus on the online debiased lasso estimator defined in (2.2). According to the above results,

τ^r(b)(β^on,r(b)βr0)\displaystyle\widehat{\tau}_{r}^{(b)}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})
=\displaystyle= j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷^(j))}+j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j))(𝜷^(b)𝜷^(j))+τ^r(b)(β^r(b)βr0)\displaystyle\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\bm{\widehat{\beta}}^{(j)})\right\}+\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\widehat{\bm{\beta}}}^{(j)})+\widehat{\tau}_{r}^{(b)}(\widehat{{\beta}}^{(b)}_{r}-\beta_{r}^{0})
=\displaystyle= j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷0)}+Π1+j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j))(𝜷^(b)𝜷0)+τ^r(b)(β^r(b)βr0)\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}+\Pi_{1}+\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})+\widehat{\tau}_{r}^{(b)}(\widehat{{\beta}}^{(b)}_{r}-\beta_{r}^{0})
=\displaystyle= j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷0)}+Π1+Π2+Π3,\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}+\Pi_{1}+\Pi_{2}+\Pi_{3},

where

Π2\displaystyle\Pi_{2} =\displaystyle= j=1b{𝜸~r(b)}𝑱(j)(𝜷^(j))(𝜷^(b)𝜷0)j=1b{𝜸~r(b)}𝑱r(j)(𝜷^(j))(β^r(b)βr0),\displaystyle\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})-\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\bm{J}^{(j)}_{r}(\widehat{\bm{\beta}}^{(j)})(\widehat{{\beta}}^{(b)}_{r}-\beta_{r}^{0}),
Π3\displaystyle\Pi_{3} =\displaystyle= j=1b{𝜸~r(j)}𝑱(j)(𝜷^(j))(𝜷^(b)𝜷0)j=1b{𝜸~r(b)}𝑱(j)(𝜷^(j))(𝜷^(b)𝜷0).\displaystyle\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})-\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0}).

Therefore, the remaining part is to demonstrate that |Πi|=op(Nb),i=1,2,3.|\Pi_{i}|=o_{p}(\sqrt{N_{b}}),i=1,2,3.

First of all, according to the Lipschitz condition of g(),g^{\prime}(\cdot),

|g(ηk(j))𝑿k(j)(𝜷0𝜷^(j))g(𝑿k(j)𝜷^(j))𝑿k(j)(𝜷0𝜷^(j))|{𝑿k(j)(𝜷0𝜷^(j))}2.\displaystyle\left\lvert g^{\prime}({\eta}^{(j)}_{k})\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})-g^{\prime}(\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)})\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lvert\leq\left\{\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\}^{2}.

Then,

𝚲(j)𝑿(j)(𝜷0𝜷^(j))1𝑿(j)(𝜷0𝜷^(j))22K2(c1(j)s0λj)2nj,\displaystyle\left\lVert\bm{\Lambda}^{(j)}\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lVert_{1}\leq\left\lVert\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lVert_{2}^{2}\leq K^{2}(c_{1}^{(j)}s_{0}\lambda_{j})^{2}n_{j},

where the last inequality is from Theorem 1 and the bounded assumption of 𝑿.\bm{X}. Meanwhile, the boundedness of 𝒛^r(j)\lVert\widehat{\bm{z}}_{r}^{(j)}\lVert_{\infty} could be shown by (A2)(A2) in Assumption 1 and Theorem 2. Therefore, we find an upper bound for |Π1||\Pi_{1}|:

|Π1|\displaystyle|\Pi_{1}| \displaystyle\leq j=1b𝒛^r(j)𝚲(j)𝑿(j)(𝜷0𝜷^(j))1\displaystyle\sum_{j=1}^{b}\lVert\widehat{\bm{z}}_{r}^{(j)}\lVert_{\infty}\left\lVert\bm{\Lambda}^{(j)}\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lVert_{1}
\displaystyle\leq j=1bK2𝒛^r(j)(c1(j)s0λj)2nj=𝒪p(c1(b)s02log(p)log(Nb)),\displaystyle\sum_{j=1}^{b}K^{2}\lVert\widehat{\bm{z}}_{r}^{(j)}\lVert_{\infty}(c_{1}^{(j)}s_{0}\lambda_{j})^{2}n_{j}=\mathcal{O}_{p}\left(c_{1}^{(b)}s_{0}^{2}\log(p)\log(N_{b})\right),

where the last equation is from (21) in Lemma 2.

Next, we adopt the Karush-Kuhn-Tucker (KKT) conditions. Let 𝒆rp\bm{e}_{r}\in\mathbb{R}^{p} denote the zero-vector except that the rr-th element is one.

|Π2|\displaystyle|\Pi_{2}| =\displaystyle= |({𝜸~r(b)}𝑱~(b){𝜸~r(b)}𝑱~r(b)𝒆r)(𝜷^(b)𝜷0)|\displaystyle\left\lvert\left(\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}-\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}_{r}\bm{e}_{r}^{\top}\right)({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})\right\lvert
\displaystyle\leq {𝜸~r(b)}𝑱~(b){𝜸~r(b)}𝑱~r(b)𝒆r𝜷^(b)𝜷01\displaystyle\left\lVert\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}-\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}_{r}\bm{e}_{r}^{\top}\right\lVert_{\infty}\left\lVert{\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0}\right\lVert_{1}
\displaystyle\leq Nbλb×c1(b)s0λb=c1(b)s0Nbλb2=𝒪p(c1(b)s0log(p)).\displaystyle N_{b}\lambda_{b}\times c^{(b)}_{1}s_{0}\lambda_{b}=c^{(b)}_{1}s_{0}N_{b}\lambda_{b}^{2}=\mathcal{O}_{p}\left(c^{(b)}_{1}s_{0}\log(p)\right).

Finally, we apply Theorem 2 and (22) in Lemma 2.

|Π3|\displaystyle|\Pi_{3}| =\displaystyle= |{j=1b{𝜸~r(b)𝜸~r(j)}𝑱(j)(𝜷^(j))}(𝜷^(b)𝜷0)|\displaystyle\left\lvert\left\{\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}-\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\}({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})\right\lvert
\displaystyle\leq 𝜷^(b)𝜷01j=1b𝜸~r(b)𝜸~r(j)1𝑱(j)(𝜷^(j))\displaystyle\left\lVert{\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0}\right\lVert_{1}\sum_{j=1}^{b}\left\lVert\widetilde{\bm{\gamma}}^{(b)}_{r}-\widetilde{\bm{\gamma}}^{(j)}_{r}\right\lVert_{1}\left\lVert\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\lVert_{\infty}
\displaystyle\leq c1(b)s0λb(j=1bcsrλjnj)=𝒪p(c1(b)s0srlog(p)).\displaystyle c_{1}^{(b)}s_{0}\lambda_{b}\left(\sum_{j=1}^{b}cs_{r}\lambda_{j}n_{j}\right)=\mathcal{O}_{p}\left(c_{1}^{(b)}s_{0}s_{r}\log(p)\right).

Based on the conditions in Theorem 3, we conclude |Πi|=op(Nb),i=1,2,3.|\Pi_{i}|=o_{p}(\sqrt{N_{b}}),i=1,2,3. In summary,

τ^r(b)(β^on,r(b)βr0)=j=1b{𝒛^r(j)}{𝒚(j)g(𝑿(j)𝜷0)}+op(Nb),\displaystyle\widehat{\tau}_{r}^{(b)}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})=\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}+o_{p}(\sqrt{N_{b}}),

where we could further refer the terms in the right parts as WrW_{r} and VrV_{r} in Theorem 3.

For the sake of completeness, we provide the proof of Lemma 1 at the end of Appendix.

Proof of Lemma 1.

We start from |Δ1(b)|.|\Delta_{1}^{(b)}|. Recall that

Δ1(b)=12j=1b1(𝜷~(b)𝜷0){𝑱(j)(𝜷^(j))𝑱(j)(𝝃)}(𝜷~(b)𝜷0)\displaystyle\Delta_{1}^{(b)}=\frac{1}{2}\sum_{j=1}^{b-1}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{{\bm{J}}^{(j)}(\widehat{\bm{\beta}}^{(j)})-{\bm{J}}^{(j)}(\bm{\xi})\right\}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})

Let 𝒗(j)=𝑿(j)(𝜷~(b)𝜷0)\bm{v}^{(j)}=\bm{X}^{(j)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}), 𝒘(j)=g(𝑿(j)𝜷^(j))g(𝑿(j)𝝃)\bm{w}^{(j)}=g^{\prime}(\bm{X}^{(j)}\widehat{{\bm{\beta}}}^{(j)})-g^{\prime}(\bm{X}^{(j)}\bm{\xi}) and diag(𝒘(j))\text{diag}(\bm{w}^{(j)}) denote the diagonal matrix with diagonal element 𝒘(j).\bm{w}^{(j)}. Then,

Δ1(b)=12j=1b1(𝒗(j))diag(𝒘(j))𝒗(j)=12𝒗diag(𝒘)𝒗,\displaystyle\Delta_{1}^{(b)}=\frac{1}{2}\sum_{j=1}^{b-1}(\bm{v}^{(j)})^{\top}\text{diag}(\bm{w}^{(j)})\bm{v}^{(j)}=\frac{1}{2}\bm{v}^{\top}\text{diag}(\bm{w})\bm{v},

where 𝒗=((𝒗(1)),,(𝒗(b1)))\bm{v}=((\bm{v}^{(1)})^{\top},\ldots,(\bm{v}^{(b-1)})^{\top})^{\top} and 𝒘=((𝒘(1)),,(𝒘(b1)))\bm{w}=((\bm{w}^{(1)})^{\top},\ldots,(\bm{w}^{(b-1)})^{\top})^{\top}. As a result,

|Δ1(b)|12𝒗22diag(𝒘)2=12𝑿(b1)(𝜷~(b)𝜷0)22𝒘.\displaystyle|\Delta_{1}^{(b)}|\leq\frac{1}{2}\lVert\bm{v}\lVert_{2}^{2}\lVert\text{diag}(\bm{w})\lVert_{2}=\frac{1}{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert^{2}_{2}\lVert\bm{w}\lVert_{\infty}.

The left part is to find the upper bound of 𝒘\lVert\bm{w}\lVert_{\infty}, that is,

𝒘\displaystyle\lVert\bm{w}\lVert_{\infty} =\displaystyle= max1jb1𝒘(j)=max1jb1g(𝑿(j)𝜷^(j))g(𝑿(j)𝝃)\displaystyle\max_{1\leq j\leq b-1}\lVert\bm{w}^{(j)}\lVert_{\infty}=\max_{1\leq j\leq b-1}\lVert g^{\prime}(\bm{X}^{(j)}\widehat{{\bm{\beta}}}^{(j)})-g^{\prime}(\bm{X}^{(j)}\bm{\xi})\lVert_{\infty}
\displaystyle\leq Lmax1jb1𝑿(j)(𝜷^(j)𝝃)3KLmax1jb1c1(j)s0λj\displaystyle L\max_{1\leq j\leq b-1}\lVert\bm{X}^{(j)}(\widehat{{\bm{\beta}}}^{(j)}-\bm{\xi})\lVert_{\infty}\leq 3KL\max_{1\leq j\leq b-1}c^{(j)}_{1}s_{0}\lambda_{j}

For ease of understanding, we could assume that c1(1)s0λ1=max1jb1c1(j)s0λjc^{(1)}_{1}s_{0}\lambda_{1}=\max_{1\leq j\leq b-1}c^{(j)}_{1}s_{0}\lambda_{j} since the first data batch containing the least information may lead to the largest estimation error. Besides that, this assumption will not affect the outcome. The upper bound of |Δ1||\Delta_{1}| could always be absorbed in the upper bound of |Δ2||\Delta_{2}| due to b=o(logNb)b=o(\log N_{b}).

For Δ2\Delta_{2}, the proof is similar as the above one and we omit it.

Recall that |Δ3(b)|=𝑼(b1)(𝜷0)(𝜷~(b)𝜷0)|\Delta_{3}^{(b)}|=\bm{U}^{(b-1)}({\bm{\beta}}^{0})(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}). Note that 𝑼(b1)(𝜷0)\bm{U}^{(b-1)}({\bm{\beta}}^{0}) could be written as the sum of i.i.d.i.i.d. random variables, that is, 𝑼(b1)(𝜷0)=i𝒟b𝒖(yi;𝒙i,𝜷0).\bm{U}^{(b-1)}({\bm{\beta}}^{0})=\sum_{i\in\mathcal{D}_{b}^{\star}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}^{0}). Since 𝔼[𝑼(b1)(𝜷0)]=0\mathbb{E}[\bm{U}^{(b-1)}({\bm{\beta}}^{0})]=0 and 𝒙iK\lVert\bm{x}_{i}\lVert_{\infty}\leq K, according to Hoeffding’s inequality, with probability at least 1p31-p^{-3}, 𝑼(b1)(𝜷0)λb1Nb1/8.\lVert\bm{U}^{(b-1)}({\bm{\beta}}^{0})\lVert_{\infty}\leq\lambda_{b-1}N_{b-1}/8.

The result of |Δ4(b)||\Delta_{4}^{(b)}| is straightforward according to Theorem 14.5 in Bühlmann & van de Geer, (2011).

References

  • Belloni et al. , (2015) Belloni, A., Chernozhukov, V., & Kato, K. 2015. Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika, 102(1), 77–94.
  • Belloni et al. , (2019) Belloni, A., Chernozhukov, V., & Kato, K. 2019. Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Journal of the American Statistical Association, 114(526), 749–758.
  • Bühlmann & van de Geer, (2011) Bühlmann, P., & van de Geer, S. 2011. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
  • Cai & Guo, (2017) Cai, T. T., & Guo, Z. 2017. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of Statistics, 45(2), 615–646.
  • Cai & Guo, (2020) Cai, T. T., & Guo, Z. 2020. Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B, 82(2), 391–419.
  • Cardot & Degras, (2018) Cardot, H., & Degras, D. 2018. Online Principal Component Analysis in High Dimension: Which Algorithm to Choose? International Statistical Review, 86, 29 – 50.
  • Chen et al. , (2014) Chen, Y., Berrocal, V. J., Bingham, C. R., & Song, P. X.-K. 2014. Analysis of spatial variations in the effectiveness of graduated driver’s licensing (GDL) program in the state of Michigan. Spatial and Spatio-temporal Epidemiology, 8, 11–22.
  • Csala et al. , (2017) Csala, A., Voorbraak, F. P., Zwinderman, A. H., & Hof, M. H. 2017. Sparse redundancy analysis of high-dimensional genetic and genomic data. Bioinformatics, 33(20), 3228 – 3234.
  • Deshpande et al. , (2019) Deshpande, Y., Javanmard, A., & Mehrabi, M. 2019. Online debiasing for adaptively collected high-dimensional data. arXiv preprint arXiv:1911.01040.
  • Dezeure et al. , (2015) Dezeure, R., Bühlmann, P., Meier, L., & Meinshausen, N. 2015. High-dimensional inference: confidence intervals, pp-values and R-software hdi. Statistical Science, 30(4), 533 – 558.
  • Fan & Li, (2001) Fan, J., & Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
  • Fang, (2019) Fang, Y. 2019. Scalable statistical inference for averaged implicit stochastic gradient descent. Scandinavian Journal of Statistics, 1–16.
  • Fei & Li, (2021) Fei, Z., & Li, Y. 2021. Estimation and inference for high dimensional generalized linear models: a splitting and smoothing approach. Journal of Machine Learning Research, 22(58), 1–32.
  • Han et al. , (2021) Han, R., Luo, L., Lin, Y., & Huang, J. 2021. Online debiased lasso. arXiv preprint arXiv:2106.05925.
  • Huber, (1964) Huber, P. J. 1964. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35, 73–101.
  • Javanmard & Montanari, (2014) Javanmard, A., & Montanari, A. 2014. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1), 2869–2909.
  • Luo & Song, (2020) Luo, L., & Song, P. X.-K. 2020. Renewable estimation and incremental inference in generalized linear models with streaming datasets. Journal of the Royal Statistical Society: Series B, 82, 69–97.
  • Ning & Liu, (2017) Ning, Y., & Liu, H. 2017. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195.
  • Robbins & Monro, (1951) Robbins, H., & Monro, S. 1951. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
  • Sakrison, (1965) Sakrison, D. J. 1965. Efficient recursive estimation: application to estimating the parameter of a covariance function. International Journal of Engineering Science, 3(4), 461–483.
  • Schifano et al. , (2016) Schifano, E. D., Wu, J., Wang, C., Yan, J., & Chen, M. H. 2016. Online updating of statistical inference in the big data setting. Technometrics, 58(3), 393–403.
  • Shi et al. , (2020) Shi, C., Song, R., Lu, W., & Li, R. 2020. Statistical inference for high-dimensional models via recursive online-score estimation. Journal of the American Statistical Association, 1–12.
  • Sun et al. , (2020) Sun, L., Wang, M., Guo, Y., & Barbu, A. 2020. A novel framework for online supervised learning with feature selection. arXiv preprint arXiv:1803.11521.
  • Sur & Candès, (2019) Sur, P., & Candès, E. J. 2019. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516 – 14525.
  • Tibshirani, (1996) Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288.
  • Toulis & Airoldi, (2017) Toulis, P., & Airoldi, E. M. 2017. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4), 1694–1727.
  • van de Geer, (2008) van de Geer, S. 2008. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614 – 645.
  • van de Geer et al. , (2014) van de Geer, S., Bühlmann, P., Ritov, Y. A., & Dezeure, R. 2014. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202.
  • Wang et al. , (2017) Wang, D., Fong, S., Wong, R. K., Mohammed, S., Fiaidhi, J., & Wong, K. K. 2017. Robust high-dimensional bioinformatics data streams mining by ODR-ioVFDT. Scientific Report, 7(43167).
  • Wen et al. , (2018) Wen, H., Shi, J., Chen, W., & Liu, Z. 2018. Transferring and generalizing deep-learning-based neural encoding models across subjects. NeuroImage, 176, 152 – 163.
  • Zhang, (2010) Zhang, C. H. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942.
  • Zhang & Zhang, (2014) Zhang, C. H., & Zhang, S. S. 2014. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B, 76(1), 217 – 242.
  • Zhang et al. , (2020) Zhang, Y., Han, K., Worth, R., & Liu, Z. 2020. Connecting concepts in the brain by mapping cortical representations of semantic relations. Nature Communications, 11(1), 1877.
  • Zou, (2006) Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
  • Zou & Hastie, (2005) Zou, H., & Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67, 301–320.