Statistical Inference in High-dimensional Generalized Linear Models with Streaming Data

Lan Luo, Ruijian Han, Yuanyuan Lin, and Jian Huang Equal contribution. Department of Statistics and Actuarial Science, University of Iowa, USAEqual contribution. Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, ChinaDepartment of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China Department of Statistics and Actuarial Science, University of Iowa, Iowa, USA

Statistical Inference in High-dimensional Generalized Linear Models with Streaming Data

Abstract

In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for real-time estimation and inference. We propose an online debiased lasso (ODL) method to accommodate the special structure of streaming data. ODL differs from offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from nonlinear online updating with streaming data. We show that the proposed online debiased estimators for the GLMs are consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of the proposed ODL method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. A streaming dataset from the National Automotive Sampling System-Crashworthiness Data System is analyzed to illustrate the application of the proposed method.

Keywords: Adaptive tuning, confidence interval, high-dimensional data, online debiased lasso, online error correction.

1 Introduction

Due to the explosive growth of data from non-traditional sources such as sensor networks, security logs and web applications, streaming data is becoming a core component in big data analyses. By streaming data, we refer to data that is generated continuously over time, typically in high volumes and at high velocity. It includes a wide variety of data types such as log files generated by mobile or web applications, ecommerce purchases, information from social networks, and financial trading floors. To reduce the demand on computing memory and achieve real-time processing, the nature of streaming data calls for the development of algorithms which require only one pass over the data. Furthermore, a lot of data streams are high-dimensional in nature such as genomic data for explaining the variability in the phenotypes of an organism and its genetic profile (Csala et al. ,, 2017; Wang et al. ,, 2017), and neuroimaging data for predicting neural activity patterns given various stimulus (Wen et al. ,, 2018; Zhang et al. ,, 2020). At an initial stage of data collection, it is routine to encounter a case where the number of predictors exceeds the number of observations. Moreover, even if sample size grows along the data streams, the candidate set of predictors may contain a large portion of sparse redundancy features. Therefore, to improve interpretability of the results, we need to differentiate predictors in terms of their signal strength and one of the classical approaches is the penalized regression method. In this paper, we focus on a high-dimensional regression setting in generalized linear models (GLM) with non-Gaussian outcomes. Our goal is to develop a real-time estimation and inference procedure that is highly scalable with respect to fast growing data volumes, but with no loss of efficiency in statistical inference in existence of a large number of features.

Streaming data processing essentially falls into the field of online learning. This line of research may be dated back five decades or so when Robbins & Monro, (1951) proposed a stochastic approximation algorithm that laid a foundation for the popular stochastic gradient descent (SGD) algorithm (Sakrison,, 1965). Later on, the SGD algorithm and its variants have been extensively studied for online estimation and prediction (Toulis & Airoldi,, 2017), but the work of developing online statistical inference remains unexplored. A recent paper by Fang, (2019) proposed a perturbation-based resampling method to construct confidence intervals for SGD, but it does not achieve desirable statistical efficiency and may produce misleading inference in high-dimensional settings. In addition to the SGD types of recursive algorithms, several online updating methods have been proposed to specifically perform sequential updating of regression coefficient estimators, including the online least squares estimator (OLSE) for the linear model, the cumulative estimating equation (CEE) estimator, the cumulatively updated estimating equation (CUEE) estimator by Schifano et al. , (2016) and the renewable estimator by Luo & Song, (2020) for nonlinear models.

Most of the aforementioned online algorithms are developed under the low-dimensional settings where the number of features is far less than the total sample size. However, as pointed out above, a prominent concern in high-dimensional streaming data analysis is that only a subset of the variables have nonzero coefficients. Besides small sample size issue at the early stage of data collection, processing such data stream without properly accounting for the sparsity in feature set may introduce significant bias and invalid statistical inference. It is worth noting that even if the cumulative sample size exceeds the number of features as time goes by, traditional estimation methods in low-dimensional settings such as maximum likelihood estimation (MLE) may still incur large bias especially in GLMs (Sur & Candès,, 2019). Therefore, current state-of-art online learning algorithms in low-dimensional settings may be insufficient for processing high-dimensional data streams.

In traditional offline settings, many methods have been developed for analyzing high-dimensional static data. Most of the work on variable selection in high-dimensional regression problems is along the line of lasso (Tibshirani,, 1996), the Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan & Li,, 2001), the adaptive lasso (Zou,, 2006), and the minimax convex penalty (MCP) (Zhang,, 2010). However, variable selection methods focus on point estimation without quantifying the uncertainty in estimates. Later on, statistical inference problems in high-dimensional settings, including interval estimation and hypothesis testing, have attracted much attention since the pioneering works of Zhang & Zhang, (2014), van de Geer et al. , (2014), Javanmard & Montanari, (2014), Belloni et al. , (2015), among others. Other important methods on inference for high-dimensional linear models include Cai & Guo, (2017, 2020), Belloni et al. , (2019), etc. van de Geer et al. , (2014) extended the de-sparsified lasso to high-dimensional GLMs. Ning & Liu, (2017) proposed to construct confidence intervals for high-dimensional M-estimators based on decorrelated score statistic. Recently, a novel splitting and smoothing inference approach for high-dimensional GLMs was proposed by Fei & Li, (2021).

While significant progress has been made on statistical inference for high-dimensional regression problems under the traditional offline settings, variable selection and statistical inference for high-dimensional models with streaming data is still at its infancy stage. Sun et al. , (2020) introduced a systematic framework for online variable selection based on some popular offline methods such as MCP, elastic net (Zou & Hastie,, 2005). But their focus is not on statistical inference. Different from this work, there are some existing methods considering the problem of inference. For example, Deshpande et al. , (2019) proposed a class of online estimators for high-dimensional auto-regressive models. One of the most relevant works is a novel inference procedure in GLMs based on recursive online-score estimation (Shi et al. ,, 2020). However, in both works, the entire dataset is assumed to be available at an initial stage for computing an initial estimator, e.g. the lasso estimator; thereafter, a recursively forward bias correction procedure is conducted along sequentially arrived data points. However, the availability of the entire dataset at an initial stage is not a natural setup in online learning. To address this issue, Han et al. , (2021) proposed an online debiased lasso method for statistical inference in high-dimensional linear models with streaming data.

Unlike the case of high-dimensional linear models where the loss function depends on data only through sufficient statistics (Han et al. ,, 2021), parameters and data are not linearly separable in GLMs. Motivated by the renewable estimation method in low-dimensional GLMs (Luo & Song,, 2020), we start off by taking a first-order Taylor expansion on the quadratic loss function to bypass the need of historical individual-level data. The key idea centers around using “approximate summary statistics” resulting from Taylor expansions. However, this is not a trivial extension of the methods developed under low-dimensional settings. In high-dimensional settings where predictors are spuriously correlated, a data-splitting strategy is typically used for decorrelation where variable selection and estimation are conducted using two different sub-datasets (Ning & Liu,, 2017; Shi et al. ,, 2020; Fei & Li,, 2021). A prominent concern of using such approximate summary statistics that involve previous estimates is that it may incur dependency in the corresponding estimating equation. Theoretically speaking, the dependency among recursively updated estimators poses extra technical challenge in establishing the non-asymptotic error bound. In our proposed online method for real-time confidence interval construction, we aim to address the following questions: (i) what types of approximate summary statistics to be stored to carry out an online debiasing procedure? (ii) will the error accumulate along the updating steps if we use the approximate summary statistics? (iii) will the online debiasing procedure maintain similar oracle properties to its offline counterpart? and (iv) how to choose the tuning parameter adaptively in an online setting where cross-validation that relies on splitting the entire dataset is not feasible.

The focus of this paper is to develop an online debiased lasso (ODL) estimator in high-dimensional generalized linear models with streaming datasets for real-time estimation and inference. Our new contributions include: (i) we propose a two-stage online estimation and debiasing framework that aligns with streaming data collection scheme; (ii) ODL accounts for sparsity feature in a candidate set of predictors and provides valid statistical inference results; and (iii) ODLs for the GLMs are shown to be consistent and asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. ODL is inspired by the offline debiased lasso method (Zhang & Zhang,, 2014; van de Geer et al. ,, 2014; Javanmard & Montanari,, 2014), however, it differs from the offline debiased lasso in two important aspects. First, in computing the estimate at the current stage, it only uses summary statistics of the historical data. Second, in addition to debiasing an online lasso estimator, ODL corrects an approximation error term arising from online updating with streaming data. This correction is crucial to guarantee the asymptotic normality of the ODL estimator.

This paper is organized as follows. Section 2 introduces the model formulation followed by our proposed online two-stage debiasing method to process high-dimensional streaming data. Section 3 includes some large sample properties concerning the theoretical guarantees for our proposed method. Simulation experiments are given in Section 4 to evaluate the performance of our proposed method in comparison to both MLE and offline debiased estimator. We illustrate the proposed ODL method and apply it to analyze a real data example in Section 5. Finally, we make some concluding remarks in Section 6. All technical proofs are deferred to the Appendix.

1.1 Notation

For a matrix $\bm{X}\in\mathbb{R}^{n\times p}$ , we let $\bm{X}_{i\cdot},\bm{X}_{\cdot j}$ and $\bm{X}_{ij}$ denote the $i$ -th row, $j$ -th column and $(i,j)$ -element of matrix $\bm{X}.$ $\bm{X}_{i,-j}$ is a sub-vector of $\bm{X}_{i\cdot}$ with the $j$ -th element deleted and $\bm{X}_{-i,-j}$ is a sub-matrix of $\bm{X}$ with the $i$ -th rows and the $j$ -th column deleted while other elements remain unchanged. For a vector $\bm{x}\in\mathbb{R}^{m}$ , we define its $\ell_{q}$ -norm as $\lVert\bm{x}\lVert_{q}=(\sum_{i=1}^{p}x_{i}^{q})^{\frac{1}{q}}$ . For a sequence of random variables $\{\xi_{n}\}_{n\in\mathbb{N}}$ and a corresponding sequence of constants $\{a_{n}\}_{n\in\mathbb{N}}$ . We say that $\xi_{n}=\mathcal{O}_{p}(a_{n})$ if for any $\epsilon>0$ , there exist two finite numbers $M,N>0$ such that $P(|\xi_{n}/a_{n}|>M)<\epsilon$ for any $n>N.$ Generally speaking, $\xi_{n}=\mathcal{O}_{p}(a_{n})$ denotes $\xi_{n}/a_{n}$ is stochastically bounded. $\xi_{n}=o_{p}(a_{n})$ means that $\xi_{n}/a_{n}$ converges to zero in probability. With the consideration of the streaming data, we use $\bm{X}^{(j)}$ and $\bm{Y}^{(j)}$ to stand for $\bm{X}$ and $\bm{y}$ , namely explanatory variable and explained variables, arriving in $j$ -th batch respectively. In addition, $\bm{X}^{(j)}_{\star}$ and $\bm{Y}^{(j)}_{\star}$ (with star index) are the cumulative variables of $\bm{X}^{(j)}$ and $\bm{Y}^{(j)}$ . For example, $\bm{X}^{(j)}_{\star}=((\bm{X}^{(1)})^{\top},\ldots,(\bm{X}^{(j)})^{\top})^{\top}.$

2 Methodology

In this section, we describe the proposed estimation method with streaming data, including the online lasso estimation and online debiased lasso estimation. We also provide an adaptive tuning method to select the regularization parameter. A rundown of our algorithm is summarized at the end of this section.

Consider up to a time point $b\geq 2$ , there is a total of $N_{b}$ samples arriving in a sequence of $b$ data batches, denoted by $\mathcal{D}^{\star}_{b}=\{\mathcal{D}_{1},\dots,\mathcal{D}_{b}\}$ , and each contains $n_{j}=|\mathcal{D}_{j}|$ samples, $j=1,\dots,b$ . Assume each observation $y_{i}$ is independently from an exponential dispersion model with density function

\displaystyle f(y_{i}\mid\bm{x}_{i},\bm{\beta}^{0},\phi)=a(y_{i};\phi)\exp\left\{-\frac{1}{2\phi}d(y_{i};\mu_{i})\right\},\ i\in\mathcal{D}^{\star}_{b},

where $d(y_{i};\mu_{i})$ is the unit deviance function involving the mean parameter $\mu_{i}=\mathbb{E}(y_{i}\mid\bm{x}_{i})$ , and $a(\cdot)$ is a normalizing factor depending only on the dispersion parameter $\phi>0$ . The systematic component of a GLM takes the form $\mu_{i}=\mathbb{E}(y_{i}\mid\bm{x}_{i})=g(\bm{x}_{i}^{\top}\bm{\beta}^{0})$ where $g(\cdot)$ is a known link function. The underlying regression coefficient $\bm{\beta}^{0}\in\mathbb{R}^{p}$ is of our interest, which is assumed to be sparse with $s_{0}$ nonzero elements. Specifically, we let $S_{0}:=\{r:\beta^{0}_{r}\neq 0\}$ be the active set of variables and its cardinality is $s_{0}$ . Our main goal is to conduct a point-wise statistical inference for the components of the parameter vector $\beta_{r}^{0}\ (r=1,\dots,p)$ upon the arrival of every new data batch $\mathcal{D}_{b},\ b=1,2,\dots$ . The log-likelihood function for the cumulative dataset $\mathcal{D}^{\star}_{b}$ is

\ell(\bm{\beta},\phi;\mathcal{D}^{\star}_{b})=\frac{1}{N_{b}}\sum_{i\in\mathcal{D}^{\star}_{b}}\log f(y_{i}\mid\bm{x}_{i},\bm{\beta},\phi)=\frac{1}{N_{b}}\sum_{i\in\mathcal{D}^{\star}_{b}}\log a(y_{i};\phi)-\frac{1}{2N_{b}\phi}\sum_{i\in\mathcal{D}^{\star}_{b}}d(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})).

Based on $\mathcal{D}_{b}^{\star}$ , the standard offline lasso estimator is defined as

\bar{\bm{\beta}}^{(b)}=\underset{\bm{\beta}\in\mathbb{R}^{p}}{\arg\min}\left\{\frac{1}{2N_{b}}\sum_{i\in\mathcal{D}^{\star}_{b}}d\left(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})\right)+\lambda_{b}\|\bm{\beta}\|_{1}\right\},

(1)

where $N_{b}=\sum_{j=1}^{b}n_{j}$ is the cumulative sample size and $\lambda_{b}$ is the regularization parameter However, as discussed in Luo & Song, (2020) and Han et al. , (2021), the optimization of (1) requires the historical data $\mathcal{D}^{\star}_{b-1}$ which is not observed in the streaming data. So the standard lasso estimator cannot be computed. Therefore, new online method to find the lasso estimator and the debiased lasso tailored for streaming data is desired. The latter is for the purpose of providing a statistical inference. For the sake of clarity, we refer to the lasso estimator in (1) as the offline lasso estimator. Sections 2.1 and 2.2 are devoted to the construction of the online lasso estimator and the online debiased method, respectively.

2.1 Online lasso estimator

We first consider the online lasso estimator through the gradient descent method. Define the score function $\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}):=\partial d(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta}))/\partial\bm{\beta}$ . The negative gradient of the first term in (1) $\mathcal{L}(\bm{\beta}):=\sum_{i\in\mathcal{D}_{b}^{\star}}d(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta}))$ is

\bm{U}^{(b)}(\bm{\beta}):=-\frac{\partial\mathcal{L}(\bm{\beta})}{\partial\bm{\beta}}=\sum_{i\in\mathcal{D}_{b}^{\star}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}).

(2)

Let $U^{(j)}(\bm{\beta}):=\sum_{i\in\mathcal{D}_{j}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta})$ be the score function pertaining to the data batch $\mathcal{D}_{j}$ . Then $\bm{U}^{(b)}(\bm{\beta})$ in (2) can be rewritten as a linear aggregation form:

\bm{U}^{(b)}(\bm{\beta})=\sum_{j=1}^{b}\sum_{i\in\mathcal{D}_{j}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta})=\sum_{j=1}^{b}U^{(j)}(\bm{\beta}).

To pursue an online estimator upon the arrival of $\mathcal{D}_{b}$ , a key step is to update ${\bm{U}}^{(b-1)}(\bm{\beta})$ to $\bm{U}^{(b)}(\bm{\beta})$ without re-accessing the whole historical raw data $\mathcal{D}_{b-1}^{\star}$ .

To illustrate the idea, we consider a simple case with two data batches, i.e., $\mathcal{D}_{2}$ arrives subsequent to $\mathcal{D}_{1}$ . The lasso estimator based on the first batch data is denoted by $\widehat{\bm{\beta}}^{(1)}$ , which is the offline estimator $\bar{\bm{\beta}}^{(1)}$ according to (1). To avoid using individual-level raw data in $\mathcal{D}_{1}$ , we approximate ${\bm{U}}^{(1)}(\bm{\beta})$ through a first-order Taylor expansion at $\widehat{\bm{\beta}}^{(1)}$ . Note that $\widehat{\bm{\beta}}^{(1)}$ satisfies $\lVert\bm{U}^{(1)}(\widehat{\bm{\beta}}^{(1)})\lVert_{\infty}=\mathcal{O}_{p}(\lambda_{1}N_{1})$ , which implies

\begin{split}{\bm{U}}^{(2)}(\bm{\beta})&=U^{(1)}(\bm{\beta})+U^{(2)}(\bm{\beta})\\ &=U^{(1)}(\widehat{\bm{\beta}}^{(1)})+\bm{J}^{(1)}(\widehat{\bm{\beta}}^{(1)})(\widehat{\bm{\beta}}^{(1)}-\bm{\beta})+U^{(2)}(\bm{\beta})+N_{1}\mathcal{O}_{p}(\|\widehat{\bm{\beta}}^{(1)}-\bm{\beta}\|^{2}),\end{split}

where $\bm{J}^{(1)}(\bm{\beta})=-\partial U^{(1)}(\bm{\beta})/\partial\bm{\beta}$ denotes the information matrix. Next, we propose to approximate $\bm{U}^{(2)}(\bm{\beta})$ by

\begin{split}\widetilde{\bm{U}}^{(2)}(\bm{\beta}):=\bm{J}^{(1)}(\widehat{\bm{\beta}}^{(1)})(\widehat{\bm{\beta}}^{(1)}-\bm{\beta})+U^{(2)}(\bm{\beta}).\end{split}

(3)

Apparently, calculating $\widetilde{\bm{U}}^{(2)}(\bm{\beta})$ through (3) only requires access to the summary statistics $\{\widehat{\bm{\beta}}^{(1)},\bm{J}^{(1)}(\widehat{\bm{\beta}}^{(1)})\}$ rather than the full dataset $\mathcal{D}_{1}$ .

Now we are ready to generalize the above approximation to an arbitrary data batch $\mathcal{D}_{b}$ . Let $\widetilde{\bm{J}}^{(b-1)}=\sum_{j=1}^{b-1}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})$ denote the aggregated information matrix. The approximation procedure in (3) becomes

\begin{split}\widetilde{\bm{U}}^{(b)}(\bm{\beta})&=\left\{\sum_{j=1}^{b-1}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\}(\widehat{\bm{\beta}}^{(b-1)}-\bm{\beta})+U^{(b)}(\bm{\beta})\\ &=\widetilde{\bm{J}}^{(b-1)}(\widehat{\bm{\beta}}^{(b-1)}-\bm{\beta})+U^{(b)}(\bm{\beta}).\end{split}

The aggregated gradient $\widetilde{\bm{U}}^{(b)}(\bm{\beta})$ depends only on $\{\widehat{\bm{\beta}}^{(b-1)},\widetilde{\bm{J}}^{(b-1)}\}$ . Hence, $\widehat{\bm{\beta}}^{(b)}$ can be computed through the following algorithm.

•

Step 1: update $\widehat{\bm{\beta}}^{(b)}$ through

$\widehat{\bm{\beta}}^{(b)}\leftarrow\widehat{\bm{\beta}}^{(b)}+\frac{\eta}{2N_{b}}\widetilde{\bm{U}}^{(b)}(\widehat{\bm{\beta}}^{(b)}),$

where $\eta$ is the learning rate in the gradient descent;

•

Step 2: for $r=1,\dots,p$ , apply the soft-thresholding operator $\mathcal{S}(\widehat{\beta}_{r}^{(b)};\eta\lambda_{b})$ to the $r$ -th component of $\widehat{\bm{\beta}}^{(b)}$ obtained in Step 1:

\mathcal{S}(\widehat{\beta}^{(b)}_{r};\eta\lambda_{b})=\begin{cases}\widehat{\beta}^{(b)}_{r}+\eta\lambda_{b},\quad&\text{if}\ \widehat{\beta}^{(b)}_{r}<-\eta\lambda_{b}\\ 0,\quad&\text{if}\ |\widehat{\beta}^{(b)}_{r}|\leq\eta\lambda_{b}\\ \widehat{\beta}^{(b)}_{r}-\eta\lambda_{b},\quad&\text{if}\ \widehat{\beta}^{(b)}_{r}>\eta\lambda_{b}\end{cases}.

Note that $\lambda_{b}$ is the regularization parameter that is chosen adaptively for step $b$ . More details on the adaptive tuning procedure is presented in Section 2.3.

The above two steps are carried out iteratively till convergence to obtain the online lasso estimator $\bm{\widehat{\beta}}^{(b)}$ . In the implementation, we set the stopping criterion to be $\|{\eta}\times\widetilde{\bm{U}}^{(b)}(\widehat{\bm{\beta}}^{(b)})/{2N_{b}}\|_{2}\leq 10^{-6}$ . In summary, our proposed online estimator $\widehat{\bm{\beta}}^{(b)}$ can be defined as

\widehat{\bm{\beta}}^{(b)}=\underset{\bm{\beta}\in\mathbb{R}^{p}}{\arg\min}\left[\frac{1}{2N_{b}}\left\{\sum_{i\in\mathcal{D}_{b}}d\left(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})\right)+\frac{1}{2}(\bm{\beta}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}(\bm{\beta}-\widehat{\bm{\beta}}^{(b-1)})\right\}+\lambda_{b}\lVert\bm{\beta}\lVert_{1}\right].

(4)

In contrast to the standard offline lasso estimator in (1), our proposed online estimator in (4) depends on the data only through the summary statistics $\{\widehat{\bm{\beta}}^{(b-1)},\widetilde{\bm{J}}^{(b-1)}\}$ instead of $\mathcal{D}_{b-1}^{\star}$ .

2.2 Online debiased lasso

We now proceed to study the online statistical inference and construct confidence intervals for the $r$ -th component of the regression parameter vector, $r=1,\ldots,p$ . We first define a low-dimensional projection

\bm{\widehat{\gamma}}_{r}^{(b)}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\left\{\frac{1}{2N_{b}}\left(\widetilde{\bm{J}}^{(b)}_{r,r}-2\widetilde{\bm{J}}^{(b)}_{r,-r}\bm{\gamma}+\bm{\gamma}^{\top}\widetilde{\bm{J}}_{-r,-r}^{(b)}\bm{\gamma}\right)+\lambda_{b}\|\bm{\gamma}\|_{1}\right\}.

(5)

where $N_{b}$ and $\lambda_{b}$ are the same as in (4). Letting $\widehat{W}{{}^{(j)}}\in\mathbb{R}^{n_{j}\times n_{j}}$ be the diagonal matrix with diagonal elements $\sqrt{g^{\prime}(\bm{X}^{(j)}\bm{\widehat{\beta}}^{(j)})}$ and $\widehat{\bm{X}}^{{(j)}}$ be the weighted design matrix $\widehat{W}{{}^{(j)}}\bm{X}^{(j)}$ , then (5) can be recast into

\bm{\widehat{\gamma}}_{r}^{(b)}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\left\{\frac{1}{2N_{b}}\sum_{j=1}^{b}\left\lVert\widehat{\bm{X}}^{{(j)}}_{\cdot r}-\widehat{\bm{X}}^{{(j)}}_{\cdot-r}\bm{\gamma}\right\lVert_{2}^{2}+\lambda_{b}\|\bm{\gamma}\|_{1}\right\}.

(6)

With this view, $\bm{\widehat{\gamma}}_{r}^{(b)}$ can be computed in a similar fashion as the online lasso estimator. Specifically, (6) has the same form as (1) if we choose the deviance function $d(\cdot;\cdot)$ as the square error. Then, it is straightforward to propose the online algorithm to find $\bm{\widehat{\gamma}}_{r}^{(b)}$ by following the procedure in Section 2.1. At this time, the summary statistics is $(\widetilde{\bm{J}}^{(b)}_{r,-r},\widetilde{\bm{J}}_{-r,-r}^{(b)}),$ which has been already stored in $\widetilde{\bm{J}}^{(b)}.$ As a result, we find a solution of (5). After obtaining $\widehat{\bm{\gamma}}_{r}^{(b)}$ in (5), we compute

\widehat{\bm{z}}_{r}^{(b)}:=\bm{x}_{r}^{(b)}-\bm{X}_{-r}^{(b)}\widehat{\bm{\gamma}}_{r}^{(b)},\ \widehat{\tau}_{r}^{(b)}:=\widetilde{\bm{J}}_{r,r}^{(b)}-\widetilde{\bm{J}}_{r,-r}^{(b)}\widehat{\bm{\gamma}}_{r}^{(b)}.

Denote $\widetilde{\bm{\gamma}}^{(j)}_{r}=(\widehat{\bm{\gamma}}^{(j)}_{r,1},\ldots,-1,\ldots,\widehat{\bm{\gamma}}^{(j)}_{r,p})^{\top}\in\mathbb{R}^{p}$ , whose $r$ -th element is $-1$ . Then, upon the arrival of the batch data $\mathcal{D}_{b}$ , we define our proposed online debiased lasso estimator as

	$\displaystyle{\widehat{\beta}}^{(b)}_{\text{on},r}$	$\displaystyle=\widehat{{\beta}}^{(b)}_{r}+\frac{1}{\widehat{\tau}_{r}^{(b)}}\left[\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\bm{\widehat{\beta}}^{(j)})\right\}+\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\{{\widehat{\bm{\beta}}}^{(b)}-{\widehat{\bm{\beta}}}^{(j)}\}\right]$
		$\displaystyle\equiv\widehat{{\beta}}^{(b)}_{r}+\text{debiasing term}+\text{online error correction term}.$		(7)

It is worth noting that the debiased lasso estimator involves the initial lasso estimator defined in (4), as well as two additional terms: a debiasing term and an online error correction term. van de Geer et al. , (2014) studied the offline version of debiased lasso for GLM. Our debiased term could be viewed as an online generalization of the offline counterpart in van de Geer et al. , (2014). However, they are rather different in the sense that, the debiased term alone will not suffice to ensure the asymptotic normality. As $\widetilde{\bm{J}}^{(b)}$ is used to approximate the information matrix $\bm{J}(\bm{\beta}^{0})$ , the approximation error accumulates even though each ${\widehat{\bm{\beta}}}^{(j)},j=1,\ldots,b,$ is consistent for $\bm{\beta}^{0}$ . The correction term in (2.2) is used to eliminate the approximation error arising from the online update.

Meanwhile, the proposed debiased lasso estimator with the online error correction term aligns with the online learning framework, as (2.2) only requires the following summary statistics rather than the entire dataset $\mathcal{D}_{b}^{\star}$ :

	$\displaystyle s_{1}^{(b)}$	$\displaystyle=\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left(\bm{y}^{(j)}-g(\bm{X}^{(j)}\widehat{\bm{\beta}}^{(j)})\right),$
	$\displaystyle s_{2}^{(b)}$	$\displaystyle=\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)}){\widehat{\bm{\beta}}}^{(j)},\ \ \bm{S}^{(b)}=\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)}),$		(8)

which keep the same size when new data arrive, and can be easily updated.

The asymptotic normality of the online debiased lasso and the consistency of two lasso-typed estimators in (4) and (5) are established in Section 3. For variance estimation, let

v^{(b)}=\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\widehat{\bm{\beta}}^{(j)})\right\}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\widehat{\bm{\beta}}^{(j)})\right\}^{\top}\bm{\widehat{z}}^{(j)}_{r}.

(9)

The estimated standard error $\widehat{\sigma}^{(b)}_{r}:={\sqrt{v^{(b)}}}/{\widehat{\tau}_{r}^{(b)}}$ can also be updated online accordingly.

2.3 Adaptive tuning

In an offline setting, the regularization parameter $\lambda$ is typically determined by cross-validation where the entire dataset is split into training and test sets multiple times. However, since the full dataset is not accessible in an online setting, such procedure is not feasible. To align with the nature of streaming datasets, we use the “rolling-original-recalibration” procedure with mean squared prediction error (MSE) being the cross-validation criterion (Han et al. ,, 2021). Specifically, at time point $b$ , cumulative dataset up to time point $b-1$ serves as the training set while the new data batch $\mathcal{D}_{b}$ is the test set. It is worth noting that instead of re-accessing raw data $\{\mathcal{D}_{1},\dots,\mathcal{D}_{b-1}\}$ , we plug in $\widehat{\bm{\beta}}^{(b-1)}$ to evaluate the prediction error for a sequence of $\lambda$ in a candidate set $T_{\lambda}$ :

PE_{b}(\lambda)=n_{b}^{-1}\|\bm{y}^{(b)}-g(\bm{X}^{(b)}\widehat{\bm{\beta}}^{(b-1)}(\lambda))\|^{2}_{2},\ \lambda\in T_{\lambda},

(10)

and choose $\lambda$ such that $\lambda_{b}:=\underset{\lambda\in T_{\lambda}}{\arg\min}\ PE_{b}(\lambda)$ . The initial $\lambda_{1}$ is selected by the classical offline cross-validation.

2.4 Algorithm

We summarize the results in Sections 2.1-2.3 in Figure 1 and Algorithm 1. It consists of two main blocks: one is online lasso estimation and the other is online low-dimensional projection. Outputs from both blocks are used to compute the online debiased lasso estimator as well as the construction of confidence intervals in real-time. In particular, when a new data batch $\mathcal{D}_{b}$ arrives, it is first sent to the online lasso estimation block, where the summary statistics $\left\{\widehat{\bm{\beta}}^{(b-1)},\widetilde{\bm{J}}^{(b-1)}\right\}$ are used to compute $\widetilde{\bm{U}}^{(b)}$ . Then we use gradient descent to update the lasso estimator $\widehat{\bm{\beta}}^{(b-1)}$ to $\widehat{\bm{\beta}}^{(b)}$ at a sequence of tuning parameters values without retrieving the whole dataset. At the same time, regarding the cumulative dataset that produces the old lasso estimate $\widehat{\bm{\beta}}^{(b-1)}$ as training set and the newly arrived $\mathcal{D}_{b}$ as testing set, we can choose the tuning parameter $\lambda_{b}$ that gives the smallest prediction error. Now, the selected $\lambda_{b}$ and sub-matrices of $\widetilde{\bm{J}}^{(b)}$ are passed to the low-dimensional projection block for the calculation of $\widehat{\bm{\gamma}}_{r}^{(b)}(\lambda_{b})$ . The resulting projection $\widehat{\bm{z}}_{r}^{(b)}$ and $\widehat{\tau}_{r}^{(b)}$ from the low-dimensional projection block together with the lasso estimator $\bm{\widehat{\beta}}^{(b)}$ will be used to compute the debiased lasso estimator $\widehat{\beta}^{(b)}_{\text{on},r}$ and its estimated standard error $\widehat{\sigma}_{r}^{(b)}$ .

Refer to caption — Figure 1: Flowchart of the online debiasing algorithm. When a new data batch $\mathcal{D}_{b}$ arrives, it is sent to the lasso estimation block for updating $\widehat{\bm{\beta}}^{(b-1)}$ to $\widehat{\bm{\beta}}^{(b)}$ . At the same time, it is also viewed as test set for adaptively choosing tuning parameter $\lambda_{b}$ . In the low-dim projection block, we extract sub-matrices from the updated information matrix $\widetilde{\bm{J}}^{(b)}$ to compute $\widehat{\bm{\gamma}}_{r}^{(b)}(\lambda_{b})$ and the corresponding low-dimensional projection $\widehat{\bm{z}}_{r}^{(b)}$ and $\widehat{\tau}_{r}^{(b)}$ . Outputs $\widehat{\beta}_{r}^{(b)}(\lambda_{b})$ , $\widehat{\bm{z}}_{r}^{(b)}$ and $\widehat{\tau}_{r}^{(b)}$ are further used to compute the debiased lasso estimator $\widehat{\beta}_{\text{on},r}^{(b)}$ and its estimated standard error $\widehat{\sigma}_{r}^{(b)}$ .

Algorithm 1 Online Debiased Algorithm in GLM

Learning rate

\eta

, a set of candidate values

T_{\lambda}

for penalty parameter;

Debiased lasso estimator

\widehat{\beta}_{\text{on},r}^{(b)}

;

for

b=1,2,\ldots

Receive the streaming dataset

\mathcal{D}_{b}

;

5: For a sequence of

\lambda\in T_{\lambda}

, update lasso estimator

\widehat{\bm{\beta}}^{(b)}(\lambda)

defined in (4) via gradient descent (GD);

Determine

\lambda_{b}

based on minimum MSE,

\lambda_{b}:=\underset{\lambda\in T_{\lambda}}{\arg\min}\ PE_{b}(\lambda)

defined in (10);

Update and store the summary statistics

\{\widehat{\bm{\beta}}^{(b)},\widetilde{\bm{J}}^{(b)}\}

;

Given

\lambda_{b}

, find the low-dimensional projection

\widehat{\bm{\gamma}}_{r}^{(b)}

defined in (5) via GD;

Update and store the summary statistics

s_{1}^{(b)},s_{2}^{(b)}

and

\bm{S}^{(b)}

by (8);

10: Compute

\widehat{\beta}_{\text{on},r}^{(b)}

by (2.2) and

\widehat{\sigma}^{(b)}_{r}

;

end for return

\widehat{\beta}_{\text{on},r}^{(b)}

and its estimated standard error

\widehat{\sigma}^{(b)}_{r}

Remark 1.

Algorithm 1 computes the estimators in the chronological order, namely
$(\widehat{\bm{\beta}}^{(1)},\widehat{\bm{\gamma}}_{r}^{(1)},\widehat{\beta}_{\text{on},r}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b)},\widehat{\bm{\gamma}}_{r}^{(b)},\widehat{\beta}_{\text{on},r}^{(b)})$ . Since $\widetilde{\bm{J}}^{(b-1)}$ involves $\widehat{\bm{\beta}}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b-1)}$ , there are dependency among the estimators. For instance, both $\widehat{\bm{\gamma}}_{r}^{(b-1)}$ and $\widehat{\bm{\beta}}^{(b)}$ depend on the previous lasso estimation $(\widehat{\bm{\beta}}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b-1)}).$

Remark 2.

When $p$ is large, it may be challenging to implement Algorithm 1 since the space complexity to store the aggregated information matrix $\widetilde{\bm{J}}^{(b)}$ is $\mathcal{O}(p^{2}).$ To reduce memory usage, we can compute the eigenvalue decomposition (EVD) of $\widetilde{\bm{J}}^{(b)}=Q_{b}\Lambda_{b}Q_{b}^{\top}$ , where $Q_{b}$ is the $p\times N_{b}$ columns orthogonal matrix of the eigenvectors, $\Lambda_{b}$ is the $N_{b}\times N_{b}$ diagonal matrix whose diagonal elements are the eigenvalues of $\widetilde{\bm{J}}^{(b)}$ . We only need to store $Q_{b}$ and $\Lambda_{b}.$ Since $r_{b}=$ rank $(\Lambda_{b})\leq\min\{N_{b},p\},$ we can use an incremental EVD approach (Cardot & Degras,, 2018) to update $Q_{b}$ and $\Lambda_{b}$ . Then the space complexity reduces to $\mathcal{O}(r_{b}p)$ . The space complexity can be further reduced by setting a threshold. For example, select the principal components which explain most of the variations in the predictors. However, incremental EVD could increase the computational cost since it requires additional $\mathcal{O}(r_{b}^{2}p)$ computational complexity. Indeed, there is a trade-off between the space complexity and computational complexity. How to balance this trade-off is an important computational issue and deserves careful analysis, but is beyond the scope of this study.

3 Asymptotic properties

In this section, we state our main theoretical results: the consistency of lasso estimators $\widehat{\bm{\beta}}^{(b)}$ and $\bm{\widehat{\gamma}}_{r}^{(b)}$ defined in (4) and (5) respectively, as well as the asymptotic normality of the online debiased estimator ${\widehat{\beta}}^{(b)}_{\text{on},r}.$

Before stating the main results, some notations are needed. Recall that $\bm{\beta}^{0}$ is the true coefficient. Consider a random design matrix $\bm{X}$ with i.i.d rows. Let $\bm{J}$ denote the population version of the information matrix in GLM, i,e., $\bm{J}=\mathbb{E}[\bm{J}^{(1)}(\bm{\beta}^{0})]/N_{1},$ and let $\bm{\Theta}=\bm{J}^{-1}$ be its inverse. Then, the ground truth of $\bm{\widehat{\gamma}}_{r}^{(b)}$ given in (5) is defined as

\displaystyle\bm{{\gamma}}_{r}^{0}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\ \mathbb{E}\left({\bm{J}}_{r,r}-2{\bm{J}}_{r,-r}\bm{\gamma}+\bm{\gamma}^{\top}{\bm{J}}_{-r,-r}\bm{\gamma}\right).

Recall that $S_{0}:=\{r:\beta^{0}_{r}\neq 0\}$ and $s_{0}=\lvert S_{0}\lvert$ . In addition, $S_{r}=\{k\neq r:\bm{\Theta}_{k,r}\neq 0\}$ and $s_{r}=\lvert S_{r}\lvert$ for $r=1,\ldots,p.$ The following assumptions are needed to establish the consistency of the lasso estimators $\widehat{\bm{\beta}}^{(b)}$ and $\widehat{\bm{\gamma}}_{r}$ defined in (4) and (5) respectively.

Assumption 1.

Suppose that

(A1) The pairs of random variables $\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{b}^{\star}}$ are i.i.d.. The covariates are bounded by some finite constant $K>0$ , i.e., $\sup_{i\in\mathcal{D}_{b}^{\star}}\lVert\bm{x}_{i}\lVert_{\infty}\leq K.$

(A2) $\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert\bm{x}_{i}\bm{\beta}^{0}\lvert=\mathcal{O}(1)$ and $\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert(\bm{x}_{i})_{-r}\bm{{\gamma}}_{r}^{0}\lvert=\mathcal{O}(1),$ where $(\bm{x}_{i})_{-r}$ is the sub-vector of $\bm{x}_{i}$ with $r$ -th element deleted.

(A3) The derivative $g^{\prime}(\cdot)$ exists. For some $\delta$ -neighborhood $(\delta>0)$ , $g^{\prime}(\cdot)$ is locally Lipschitz with constant $L$ , that is,

\sup_{i\in\mathcal{D}_{b}^{\star}}\ \sup_{\bm{\beta},\bm{\beta}^{\prime}\in\{\bm{\beta}:\lVert\bm{\beta}-\bm{\beta}^{0}\lVert_{1}\leq\delta\}}\frac{|g^{\prime}(\bm{x}_{i}\bm{\beta})-g^{\prime}(\bm{x}_{i}\bm{\beta}^{\prime})|}{|\bm{x}_{i}\bm{\beta}-\bm{x}_{i}\bm{\beta}^{\prime}|}\leq L.

In addition, for all $\lVert\bm{\beta}-\bm{\beta}^{0}\lVert\leq\delta$ , $\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert=\mathcal{O}(1)$ and $\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{1/g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert=\mathcal{O}(1).$

(A4) The smallest eigenvalue of $\bm{J}$ is bounded away from zero.

The above assumptions are regular conditions in studying the properties of lasso (van de Geer,, 2008; van de Geer et al. ,, 2014). (A1) assumes the streaming data is homogeneous and the covariates are sampled from some bounded distribution. (A2) is imposed to avoid some extreme cases. (A3) requires the local Lipschitz property of the derivative of the mean function around the truth value. It can be easily verified that the popular logistic regression, a special case of GLMs, satisfies this condition. (A4) is needed to ensure that the compatibility condition holds.

Theorem 1.

Assume Assumption 1 holds. Suppose that the first batch size $n_{1}\geq cs_{0}\log p$ for some constant $c$ and $b=o(\log N_{b})$ , and the tuning parameter $\lambda_{j}=C\sqrt{\log p/N_{j}},j=1,\ldots b$ for some constant $C$ . Then, for any $j=1,\ldots,b$ , with probability at least $1-p^{-2}$ , the proposed online estimator in (4) satisfies

{\lVert{\widehat{\bm{\beta}}}^{(j)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(j)}s_{0}\lambda_{j},\ \lVert\bm{X}^{(j)}_{\star}({\widehat{\bm{\beta}}}^{(j)}-{{\bm{\beta}}}^{0})\lVert^{2}_{2}\leq c_{2}^{(j)}s_{0}N_{j}\lambda_{j}^{2}.}

(11)

Remark 3.

Theorem 1 provides upper bounds of the estimation error and the prediction error of the online lasso estimator $\widehat{\bm{\beta}}^{(b)}$ . The constants $c_{1}^{(j)}$ and $c_{2}^{(j)},j=1,\ldots,b,$ possibly depend on the batch step $b$ . Recall that the lasso estimator $\widehat{\bm{\beta}}^{(b)}$ depends on $\widehat{\bm{\beta}}^{(b-1)}$ . So the estimation error in the previous step will be carried onto the updated estimators. As a result, it is inevitable that some constants in the oracle inequality depend on $b$ ; nonetheless, they are well under control as long as $b=o(\log N_{b}).$

The next corollary shows the consistency of the proposed online lasso estimator in (4).

Corollary 1.

Assume those conditions in Theorem 1 hold. If there exists an $\epsilon>0$ such that $s_{0}^{2}\log(p)N_{b}^{-{1}+\epsilon}=o(1),$ then the lasso estimator in (4) satisfies

\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\rightarrow_{p}0\ \text{as}\ N_{b}\to\infty,

where $\rightarrow_{p}$ means convergence in probability.

Similarly, we present the oracle inequality for $\widehat{\bm{\gamma}}_{r}^{(b)}$ in the next theorem.

Theorem 2.

Assume Assumption 1 holds and the cumulative batch size satisfies $N_{j}\geq cc^{(j)}_{1}s_{0}^{2}s_{r}^{2}\log p,j=1,\ldots,b$ for some constant $c$ . Then, for any $j=1,\ldots,b$ , with probability at least $1-p^{-2}$ , low-dimensional projection defined in (5) with $\lambda_{j}=c\sqrt{{\log p}/{N_{j}}}$ satisfies

\lVert{\widehat{\bm{\gamma}}_{r}}^{(j)}-{{\bm{\gamma}}}_{r}^{0}\lVert_{1}\leq c_{3}s_{r}\lambda_{j}.

(12)

Combining the results in Theorem 1 and Theorem 2, we are ready to establish the asymptotic normality of the ODL estimator.

Theorem 3.

Assume those conditions in Theorem 1 and Theorem 2 hold. If there exists an $\epsilon>0$ such that

s_{0}^{2}\log(p)\log(N_{b})N_{b}^{-\frac{1}{2}+\epsilon}=o(1),\ s_{0}s_{r}\log(p)N_{b}^{-\frac{1}{2}+\epsilon}=o(1),

then for sufficiently large $N_{b}$ ,

	$\displaystyle\frac{\widehat{\tau}_{r}^{(b)}}{\sqrt{N_{b}}}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})=W_{r}+V_{r},$
	$\displaystyle W_{r}=\frac{1}{\sqrt{N_{b}}}\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left(\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right),V_{r}=o_{p}(1).$

According to Theorem 3, the asymptotic expression of ${\widehat{\tau}_{r}^{(b)}}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})/{\sqrt{N_{b}}}$ is a sum of $W_{r}$ and $V_{r}$ , where $W_{r}$ converges in distribution to a normal distribution by the martingale central limit theorem and $V_{r}$ diminishes as $N_{b}$ goes to infinity.

Remark 4.

Theorem 3 implies that the total data size $N_{b}$ could be as small as the logarithm of the dimensionality $p$ , which is a common condition for offline debiased lasso in the literature (Zhang & Zhang,, 2014; van de Geer et al. ,, 2014). However, the setting of the offline debiased lasso is quite different from the online framework considered in our paper. Due to the lack of access to the whole dataset, it is increasingly difficult to derive the asymptotic property of online debiased lasso. One major difficulty is that, we approximate the information matrix by $\widetilde{\bm{J}}^{(b)}$ , which involves estimators obtained in the previous steps. As a result, there is dependence among $\widehat{\bm{\beta}}^{(1)},\ldots,\widehat{\bm{\beta}}^{(b-1)},\widehat{\bm{\beta}}^{(b)}.$ Another difficulty is to deal with the approximation error accumulates in the online updating, especially under high-dimensional settings. In contrast, the classical offline lasso does not have these two problems. Even for the online debiased lasso in linear model (Han et al. ,, 2021), the above two problems can be bypassed by making use of the special structure of the least squares in linear model.

4 Simulation studies

4.1 Setup

In this section, we conduct simulation studies to examine the finite-sample performance of the proposed ODL procedure in high-dimensional GLMs. We randomly generate a total of $N_{b}$ samples arriving in a sequence of $b$ data batches, denoted by $\{\mathcal{D}_{1},\dots,\mathcal{D}_{b}\}$ , from the logistics regression model. Specifically,

\mathbb{E}(y^{(j)}_{i}\mid\bm{x}_{i}^{(j)})=\frac{\exp(\{\bm{x}_{i}^{(j)}\}^{\top}\bm{\beta}^{0})}{1+\exp(\{\bm{x}_{i}^{(j)}\}^{\top}\bm{\beta}^{0})},\ i=1,\dots,n_{j};\ j=1,\dots,b,

where $\bm{x}_{i}^{(j)}\sim\mathcal{N}(\bm{0},\bm{\Sigma})$ and $\bm{\beta}^{0}\in\mathbb{R}^{p}$ is a $p$ -dimensional sparse parameter vector. Recall that $s_{0}$ is the number of non-zero components of $\bm{\beta}^{0}$ . We set half of the nonzero coefficients to be $1$ (relatively strong signals), and another half to be $0.01$ (weak signals). We consider the following settings: (i) $N_{b}=120$ , $b=12$ , $n_{j}=10$ for $j=1,\ldots,12$ , $p=100$ and $s_{0}=6$ ; (ii) $N_{b}=624$ , $b=12$ , $n_{j}=52$ for $j=1,\ldots,12$ , $p=600$ and $s_{0}=10$ . Under each setting, two types of $\bm{\Sigma}$ are considered: (a) $\bm{\Sigma}=\bm{I}_{p}$ ; (b) $\bm{\Sigma}=\{0.5^{|i-j|}\}_{i,j=1,\dots,p}$ . We set the step size in gradient descent $\eta=0.005$ .

The objective is to conduct both estimation and inference along the arrival of a sequence of data batches. The evaluation criteria include: averaged absolute bias in estimating $\bm{\beta}^{0}$ (A.bias); averaged estimated standard error (ASE); empirical standard error (ESE); coverage probability (CP) of the 95% confidence intervals; averaged length of the 95% confidence interval (ACL). These metrics will be evaluated separately for three groups: (i) $\beta^{0}_{r}=0$ , (ii) $\beta^{0}_{r}=0.01$ and (iii) $\beta^{0}_{r}=1$ . Comparison is made among (i) the maximum likelihood estimator obtained by fitting the conventional GLM at the terminal point $b$ where $N_{b}>p$ , (ii) the offline debiased $\ell_{1}$ -penalized estimator at the terminal point $b$ which is also the benchmark method, and (iii) our proposed ODL estimator at several intermediate points from $j=1,\dots,b$ . We include two offline methods in comparison using R packages hdi (Dezeure et al. ,, 2015) and glm, respectively. The results are reported in Tables 1-4.

4.2 Bias and coverage probability

It can be seen from Tables 1-4 that the estimation bias of the ODL estimator decreases rapidly as the number of data batches $b$ increasing from $2$ to $12$ . Both the estimated standard errors and averaged length of 95% confidence intervals exhibit similar decreasing trend over time, and almost coincide with those by the offline benchmark method at the terminal points. Moreover, ODL enjoys great computation efficiency and it is almost 16 times faster than its offline counterpart.

It is worth noting that even though at the terminal point where the cumulative sample size $N_{b}$ is slightly larger than $p$ , we can still fit the conventional GLM to obtain the MLE. It fails to provide reliable coverage probabilities due to severely large biases and estimated standard errors. In particular, the estimation bias of MLE becomes 50 times that of offline or online debiased lasso when $p=100$ as shown in Tables 1 and 2, and it further increases to 80 times the debiased estimators when $p=600$ . Furthermore, as clearly indicated by the large empirical standard errors, MLE under this setting suffers from severe instability. Such an invalid estimation and inference results by MLE further demonstrates the advantage of our proposed online debiased method under the high-dimensional sparse logistic regression setting with streaming datasets.

Table 1:

N_{b}=120

b=12

p=100

s_{0}=6

\bm{\Sigma}=\bm{I}_{p}

. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased

\ell_{1}

-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter

\lambda

is chosen from

T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}

. Simulation results are summarized over 200 replications. In the table, we report the

\lambda

selected with highest frequency among 200 replications.

	$\beta_{0,r}$	MLE	offline	ODL
data batch index				2	4	6	8	10	12
$\lambda$				0.05	0.05	0.05	0.05	$10^{-4}$	$10^{-4}$
A.bias	0	2.290	0.035	0.080	0.056	0.049	0.038	0.036	0.033
	0.01	1.150	0.038	0.131	0.036	0.035	0.033	0.020	0.034
	1	17.87	0.057	0.176	0.125	0.125	0.125	0.120	0.109
ASE	0	$1.78\times 10^{6}$	0.595	1.336	0.949	0.777	0.679	0.609	0.559
	0.01	$1.79\times 10^{6}$	0.591	1.323	0.939	0.771	0.674	0.604	0.555
	1	$1.83\times 10^{6}$	0.598	1.341	0.950	0.778	0.680	0.609	0.561
ESE	0	34.90	0.591	1.367	0.962	0.786	0.681	0.611	0.559
	0.01	36.25	0.581	1.341	0.996	0.792	0.686	0.597	0.553
	1	33.49	0.585	1.327	0.903	0.755	0.664	0.593	0.548
CP	0	1.000	0.953	0.951	0.949	0.948	0.949	0.950	0.950
	0.01	1.000	0.960	0.943	0.938	0.943	0.947	0.960	0.953
	1	1.000	0.965	0.955	0.957	0.962	0.953	0.942	0.953
ACL	0	$6.96\times 10^{6}$	2.487	5.239	3.721	3.047	2.663	2.385	2.190
	0.01	$7.02\times 10^{6}$	2.478	5.187	3.681	3.022	2.641	2.366	2.176
	1	$7.18\times 10^{6}$	2.479	5.258	3.725	3.051	2.666	2.387	2.197
C.Time (s)		0.03	29.97	2.026

Table 2:

N_{b}=120

b=12

p=100

s_{0}=6

\bm{\Sigma}=\{0.5^{|i-j|}\}_{i,j=1,\dots,p}

. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased

\ell_{1}

-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter

\lambda

is chosen from

T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}

. Simulation results are summarized over 200 replications. In the table, we report the

\lambda

selected with highest frequency among 200 replications.

	$\beta_{0,r}$	MLE	offline	ODL
data batch index				2	4	6	8	10	12
$\lambda$				0.05	0.05	0.05	0.05	$10^{-4}$	$10^{-4}$
A.bias	0	2.467	0.053	0.119	0.093	0.084	0.078	0.073	0.068
	0.01	1.546	0.048	0.138	0.070	0.093	0.068	0.065	0.064
	1	17.54	0.026	0.122	0.122	0.113	0.087	0.095	0.084
ASE	0	$2.45\times 10^{6}$	0.674	1.341	0.955	0.781	0.680	0.614	0.567
	0.01	$2.43\times 10^{6}$	0.673	1.332	0.948	0.780	0.679	0.612	0.565
	1	$2.36\times 10^{6}$	0.667	1.344	0.962	0.783	0.682	0.616	0.567
ESE	0	45.93	0.671	1.358	0.953	0.781	0.679	0.608	0.556
	0.01	45.47	0.667	1.387	0.935	0.765	0.644	0.595	0.557
	1	45.87	0.657	1.334	0.940	0.774	0.659	0.592	0.544
CP	0	1.000	0.951	0.951	0.951	0.948	0.947	0.947	0.949
	0.01	1.000	0.950	0.945	0.960	0.953	0.958	0.950	0.947
	1	1.000	0.963	0.953	0.960	0.945	0.947	0.960	0.955
ACL	0	$9.59\times 10^{6}$	2.643	5.258	3.745	3.063	2.667	2.406	2.222
	0.01	$9.51\times 10^{6}$	2.640	5.222	3.717	3.057	2.661	2.398	2.214
	1	$9.26\times 10^{6}$	2.614	5.267	3.773	3.068	2.672	2.414	2.222
C.Time (s)		0.03	29.02	2.166

Table 3:

N_{b}=624

b=12

p=600

s_{0}=10

\bm{\Sigma}=\bm{I}_{p}

. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased

\ell_{1}

-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter

\lambda

is chosen from

T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}

. Simulation results are summarized over 200 replications. In the table, we report the

\lambda

selected with highest frequency among 200 replications.

	$\beta_{0,r}$	MLE	offline	ODL
data batch index				2	4	6	8	10	12
$\lambda$				0.05	0.01	0.01	0.01	$10^{-4}$	$10^{-4}$
A.bias	0	1.101	0.014	0.034	0.024	0.020	0.018	0.015	0.014
	0.01	1.553	0.011	0.038	0.014	0.012	0.009	0.009	0.010
	1	11.99	0.069	0.168	0.152	0.152	0.141	0.131	0.121
ASE	0	$1.82\times 10^{6}$	0.259	0.573	0.412	0.341	0.297	0.268	0.245
	0.01	$1.80\times 10^{6}$	0.259	0.572	0.411	0.340	0.297	0.268	0.246
	1	$1.81\times 10^{6}$	0.261	0.573	0.411	0.341	0.297	0.268	0.246
ESE	0	19.94	0.254	0.578	0.041	0.034	0.030	0.028	0.025
	0.01	19.72	0.257	0.572	0.397	0.325	0.287	0.264	0.251
	1	20.37	0.262	0.579	0.407	0.334	0.297	0.262	0.244
CP	0	1.000	0.955	0.947	0.947	0.949	0.948	0.950	0.950
	0.01	1.000	0.953	0.945	0.959	0.954	0.950	0.956	0.947
	1	1.000	0.943	0.933	0.935	0.933	0.925	0.927	0.928
ACL	0	$7.15\times 10^{6}$	1.016	2.246	1.613	1.335	1.163	1.050	0.962
	0.01	$7.07\times 10^{6}$	1.016	2.241	1.610	1.334	1.164	1.050	0.962
	1	$7.09\times 10^{6}$	1.022	2.246	1.613	1.335	1.164	1.050	0.963
C.Time (min)		0.05	117.5	7.08

Table 4:

N_{b}=624

b=12

p=600

s_{0}=10

\bm{\Sigma}=\{0.5^{|i-j|}\}_{i,j=1,\dots,p}

. Performance on statistical inference. “MLE” is the offline estimator obtained by fitting the traditional GLM, “offline” corresponds to the offline debiased

\ell_{1}

-norm penalized estimator, and “ODL” represents our proposed online debiased lasso estimator. Tuning parameter

\lambda

is chosen from

T_{\lambda}=\{10^{-4},10^{-3},0.01,0.05\}

. Simulation results are summarized over 200 replications. In the table, we report the

\lambda

selected with highest frequency among 200 replications.

	$\beta_{0,r}$	MLE	offline	ODL
data batch index				2	4	6	8	10	12
$\lambda$				0.05	0.05	$10^{-4}$	$10^{-4}$	$10^{-4}$	$10^{-4}$
A.bias	0	1.594	0.018	0.042	0.034	0.031	0.029	0.027	0.025
	0.01	1.554	0.016	0.044	0.031	0.026	0.012	0.010	0.010
	1	11.85	0.077	0.192	0.168	0.159	0.147	0.137	0.134
ASE	0	$2.43\times 10^{6}$	0.307	0.576	0.416	0.347	0.304	0.274	0.251
	0.01	$2.45\times 10^{6}$	0.307	0.577	0.416	0.347	0.304	0.274	0.251
	1	$2.42\times 10^{6}$	0.308	0.576	0.416	0.347	0.304	0.275	0.251
ESE	0	27.18	0.300	0.576	0.411	0.341	0.298	0.269	0.246
	0.01	27.68	0.296	0.576	0.403	0.330	0.287	0.266	0.240
	1	27.49	0.291	0.559	0.395	0.326	0.295	0.257	0.240
CP	0	1.000	0.955	0.947	0.949	0.949	0.949	0.948	0.947
	0.01	1.000	0.957	0.949	0.958	0.951	0.958	0.951	0.956
	1	1.000	0.951	0.946	0.939	0.929	0.925	0.931	0.926
ACL	0	$9.51\times 10^{6}$	1.202	2.259	1.631	1.361	1.193	1.076	0.982
	0.01	$9.59\times 10^{6}$	1.203	2.261	1.632	1.361	1.193	1.076	0.983
	1	$9.48\times 10^{6}$	1.206	2.257	1.629	1.361	1.193	1.077	0.983
C.Time (min)		0.06	112.3	6.99

5 Real data analysis

Graduated driver licensing programs are designed to restrict the driving privileges of new drivers. They typically include restrictions such as nighttime, expressway, and unsupervised driving. To monitor the effectiveness of such systems in reducing fatal teen crashes, we apply our proposed ODL procedure to analyze streaming data from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS). Our primary interest was to evaluate the effectiveness of graduated driver licensing, which is a nationwide legislature on novice drivers of age 21 or younger under various conditions of vehicle operation. In contrast, there are no restrictions on vehicle operation for older drivers (say, older than 65) in the current policy. To assess the effect of driver’s age on driving safety, we compared age groups with respect to the risk of fatal crash when an accident happened. We first categorized the “Age” variable into three groups: “Age $<$ 21” representing the young group under a restricted graduate driver licensing, and “Age $\geq$ 65” for the older group with a regular full driver’s license, and those in between were treated as the reference group. Extent of “Injury” in a crash is a binary outcome of interest, 1 for a moderate or severe injury, and 0 for minor or no injury. This outcome variable was created from the variable of Maximum Known Occupant Ais (MAIS), which indicates the single most severe injury level reported for each occupant. Other potential risk factors were also included in the model, including seat belt use, alcohol, speed limit, vehicle weight, air bag system deployed, drug involvement in this accident. Some interaction terms are also included such as driver’s age group and alcohol, age group and sex etc.

Streaming data were formed by monthly accident data from the period of 7 years from January, 2009 to December, 2015, with $b=84$ data batches and a total of $N_{b}=2118$ crashed vehicles with complete records on both outcome and predictors. The number of predictors is $p=62$ including an intercept, 46 main effects and 15 interaction terms. Note that the first batch size $n_{1}=38<p=62$ , and the smallest batch only contains 6 samples. We applied our proposed method to sequentially updated parameter estimates and standard errors for the regression coefficients in a penalized logistic regression model.

As shown in Figure 2, the 95% pointwise confidence bands over the 84 batches became narrower for the old group but not the young one. Such phenomenon clearly differentiate the significance levels of these two age groups with respect to the severity of injury. Starting at almost the same estimated risk, the young age group gradually moves up and stay around 0 over the 84-month period. The trace plot ends at $-0.06$ , meaning that the young age group has a slightly lower adjusted odds of fatal crash in comparison to the middle age group. This finding is consistent with the reported results in the literature that GDL is an effective policy to protect novice drivers from fatal injuries (Chen et al. ,, 2014). In contrast, the trace plot for the older age group shows an upward trend at around year 2013 and get stabilized at a positive value 0.51 since 2014. This observation indicates that the adjusted odds of fatality in a vehicle crash for the older group becomes significantly higher than the middle age group as more data accumulates over time. This may suggest a need on policy modification on restrictive vehicle operation for old drivers.

Figure 3 shows the trends of $-\log_{10}p$ , $p$ -values of the Wald test constructed using the online debiased lasso estimator and its standard error in the 10-base logarithm, for nine of the risk factors over 84 months. The trajectory of each covariate effect estimate show an evidence against the null $H_{0}:\beta^{0}_{j}=0$ as data accumulates over time. The interaction term between “Age” and “Alcohol” turns out to have the strongest association to the odds of fatality in a crash among all covariates shown in this plot. This is an overwhelming confirmation to the findings that driving under the influence of alcohol is significantly associated with the fatal crash even with the existence of graduate driver licensing to protect young drivers. This also explains why the main effect of young age group does not show a significant association with the odds of injury while the interaction term does. Moreover, some spikes appeared in the trajectories of most risk factors between year 2010 and 2014, but they get smoothed and stabilized thereafter. This might be related to small batch sizes or unbalanced categorical values in some monthly data batches. In addition, to characterize the overall significance level for each risk factor over the 84-month period, we propose to calculate a summary statistic as the area under the $p$ -value curve. These values are included in the brackets after the name of each covariate in Figure 3. They provide a clear summary of the overall significance level in existence of those spikes, and most of them are well aligned with the ranking of $p$ -values at the end of year 2015.

Applying the proposed online debiased lasso procedure to the above CDS data analysis enabled us to visualize time-course patterns of data evidence accrual as well as stability and reproducibility of inference. As shown clearly in both Figures 2 and 3, at the early stage of data streams, due to limited sample sizes and possibly sampling bias, both parameter estimates and test power may be unstable and even misleading. These potential shortcomings can be convincingly overcome when estimates and inferential quantities were continuously updated along with data streams, which eventually reached stability and reliable conclusions.

6 Discussion

In this paper, we propose an online debiased lasso method for statistical inference in high-dimensional GLMs. The method is applicable to streaming data, that is, only the historical summary statistics, not the raw historical data, are needed in updating the estimate at the current stage. Under regularity conditions similar to those in the offline setting and mild conditions on the batch sizes, we prove the online debiased lasso (with an online correction term) is asymptotically normal. The numerical studies further demonstrate the effectiveness of our algorithm and support the theoretical results.

There are still many open questions in the area of online inference. First, the method we develop focuses on homogeneous data, where the streaming data is assumed to be i.i.d. sampled. It will be interesting to determine if we can obtain similar results under the non-homogeneous setting. Second, the loss function we consider in this paper is from the negative log-likelihood function. It is unclear whether other loss functions, including nonsmooth robust loss functions such as the Huber’s loss (Huber,, 1964), could be used in the online inference. Third, we did not address the issue about the online variable selection. The major difficulty in this problem is how to recover the significant variables which may be dropped at the early stages of the stream. We hope to address these interesting questions in the future.

Appendix

The appendix contains the proofs of Theorems 1-3.

Proof of Theorem 1.

For the prior data batch $\mathcal{D}_{1}$ , we have $\widehat{\bm{\beta}}^{(1)}=\bar{\bm{\beta}}^{(1)}$ where $\bar{\bm{\beta}}^{(1)}$ is the offline lasso estimator defined in (1). Since the consistency of $\bar{\bm{\beta}}^{(1)}$ is well-established in van de Geer et al. , (2014), (11) holds when $b=1.$ Now we prove the consistency of $\widehat{{\bm{\beta}}}^{(b)}$ for an arbitrary $b\geq 2$ by the mathematical induction.

Suppose that ${\widehat{\bm{\beta}}}^{(b-1)}$ satisfies the statement in (11) with constant $c_{1}^{(b-1)}$ and $c_{2}^{(b-1)}$ . We claim that $\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}.$ Otherwise, we consider the following linear combination,

\displaystyle\widetilde{\bm{\beta}}^{(b)}=t\widehat{\bm{\beta}}^{(b)}+(1-t)\bm{\beta}^{0},\ \text{where}\ t=\frac{c_{1}^{(b)}s_{0}\lambda_{b}}{c_{1}^{(b)}s_{0}\lambda_{b}+\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}}.

(13)

Then $\lVert{\widetilde{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}.$ Note that $\lVert{\widetilde{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}/2$ if and only if $\lVert{\widehat{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}.$ Therefore, it suffices to show $\lVert{\widetilde{\bm{\beta}}}^{(b)}-{{\bm{\beta}}}^{0}\lVert_{1}\leq c_{1}^{(b)}s_{0}\lambda_{b}/2$ .

Let $\mathcal{L}(\bm{\beta};\mathcal{D}_{j})=\sum_{i\in\mathcal{D}_{j}}d\left(y_{i};g(\bm{x}_{i}^{\top}\bm{\beta})\right)$ , $j=1,\ldots,b$ . Since the objective function defined in (4) is the convex function, we have

			$\displaystyle\frac{1}{2N_{b}}\left\{\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b})+\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})\right\}+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1}$		(14)
		$\displaystyle\leq$	$\displaystyle\frac{1}{2N_{b}}\left\{\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b})+\frac{1}{2}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})\right\}+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1},$		(14)

where $\widetilde{\bm{J}}^{(b-1)}=\sum_{j=1}^{b-1}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})$ . Recall that $\mathcal{D}_{b-1}^{\star}=\{\mathcal{D}_{1},\ldots,\mathcal{D}_{b-1}\}$ . A Taylor’s expansion gives that

		$\displaystyle\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b-1}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b-1}^{\star})$
	$\displaystyle=$	$\displaystyle~{}\bm{U}^{(b-1)}({\bm{\beta}}^{0})(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})+\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{\widetilde{\bm{J}}^{(b-1)}(\bm{\xi})\right\}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}),$

where $\widetilde{\bm{J}}^{(b-1)}(\bm{\xi})=\sum_{j=1}^{b-1}\bm{J}^{(j)}(\bm{\xi})$ and $\bm{\xi}=t_{2}{\bm{\beta}}^{0}+(1-t_{2})\widetilde{\bm{\beta}}^{(b)}$ for some $0<t_{2}<1.$ Then

			$\displaystyle\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-\widehat{\bm{\beta}}^{(b-1)})-\frac{1}{2}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})^{\top}\widetilde{\bm{J}}^{(b-1)}({\bm{\beta}}^{0}-\widehat{\bm{\beta}}^{(b-1)})$
		$\displaystyle=$	$\displaystyle\frac{1}{2}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})-(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\widetilde{\bm{J}}^{(b-1)}(\widehat{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})$
		$\displaystyle=$	$\displaystyle\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b-1}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b-1}^{\star})+\frac{1}{2}\sum_{j=1}^{b-1}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{{\bm{J}}^{(j)}(\widehat{\bm{\beta}}^{(j)})-{\bm{J}}^{(j)}(\bm{\xi})\right\}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})$
			$\displaystyle-\sum_{j=1}^{b-1}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{{\bm{J}}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\}(\widetilde{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})-\bm{U}^{(b-1)}({\bm{\beta}}^{0})(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})$
		$\displaystyle:=$	$\displaystyle\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b-1}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b-1}^{\star})+\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)}.$

Taking the above equation back into (14), we obtain that

\displaystyle\frac{1}{2N_{b}}\mathcal{L}(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b}^{\star})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1}+\frac{1}{2N_{b}}(\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)})\leq\frac{1}{2N_{b}}\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b})+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}.

(15)

We introduce two notations,

\displaystyle\mathcal{E}({\bm{\beta}})=\frac{1}{2N_{b}}\mathbb{E}\left\{\mathcal{L}({\bm{\beta}};\mathcal{D}_{b}^{\star})-\mathcal{L}({\bm{\beta}}^{0};\mathcal{D}_{b}^{\star})\right\},~{}v(\bm{\beta};\mathcal{D}_{b}^{\star})=\mathcal{L}({\bm{\beta}};\mathcal{D}_{b}^{\star})-\mathbb{E}\left\{\mathcal{L}({\bm{\beta}};\mathcal{D}_{b}^{\star})\right\},\ \bm{\beta}\in\mathbb{R}^{p},

as the excess risk and the empirical process. Note that $\mathcal{E}({\bm{\beta}})$ does not depend on $\mathcal{D}_{b}^{\star}$ since the data is i.i.d. sampled. With the above notation, (15) could be further written as

	$\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1}$	$\displaystyle\leq-\frac{1}{2N_{b}}\left\{v(\widetilde{\bm{\beta}}^{(b)};\mathcal{D}_{b}^{\star})-v({\bm{\beta}}^{0};\mathcal{D}_{b}^{\star})\right\}+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}+\frac{1}{2N_{b}}(\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)})$
		$\displaystyle:=-\frac{1}{2N_{b}}\Delta_{4}^{(b)}+\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}+\frac{1}{2N_{b}}(\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)})$

Recall that $\bm{X}^{(b-1)}_{\star}=((\bm{X}^{(1)})^{\top},\ldots,(\bm{X}^{(b-1)})^{\top})^{\top}\in\mathbb{R}^{N_{b-1}\times p}$ . The next lemma provides the upper bound of $|\Delta_{i}^{(b)}|,i=1,2,3,4,$ whose proof is given at the end of Appendix.

Lemma 1.

Under the conditions of Theorem 1, with probability at least $1-p^{-3},$

	$\displaystyle\|\Delta_{1}^{(b)}\|\leq 2KLc^{(1)}_{1}s_{0}\lambda_{1}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert^{2}_{2},$
	$\displaystyle\|\Delta_{2}^{(b)}\|\leq K_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})\lVert_{2},$
	$\displaystyle\|\Delta_{3}^{(b)}\|\leq{\lambda_{b-1}{N_{b-1}}}\lVert\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}\lVert_{1}/8,\ \|\Delta_{4}^{(b)}\|\leq{\lambda_{b}{N_{b}}}\lVert\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}\lVert_{1}/8,$

where $L$ is Lipschitz constant defined in Assumption 1, $K=\lVert X\lVert_{\infty}$ and $K_{2}=\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert$ .

Note that ${\lambda_{b-1}\sqrt{N_{b-1}}}={\lambda_{b}\sqrt{N_{b}}}$ and the upper bound of $|\Delta_{1}|$ could be absorbed in the upper bound of $|\Delta_{2}|$ . According to Lemma 1,

			$\displaystyle\Delta_{1}^{(b)}-\Delta_{2}^{(b)}-\Delta_{3}^{(b)}-\Delta_{4}^{(b)}$
		$\displaystyle\leq$	$\displaystyle 2K_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b-1)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{4}{\lambda_{b}{N_{b}}}\lVert\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0}\lVert_{1}$
		$\displaystyle\leq$	$\displaystyle 2K_{2}\sqrt{N_{b-1}c_{2}^{(b-1)}s_{0}}\lambda_{b-1}\lVert\bm{X}_{\star}^{(b)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{4}c_{1}^{(b)}{{N_{b}}}s_{0}\lambda_{b}^{2}.$

Consequently,

	$\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}\lVert_{1}-\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}$	(16)
$\displaystyle\leq$	$\displaystyle\frac{1}{2N_{b}}\left\{2K_{2}\sqrt{N_{b-1}c_{2}^{(b-1)}s_{0}}\lambda_{b-1}\lVert\bm{X}_{\star}^{(b)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{4}c_{1}^{(b)}{{N_{b}}}s_{0}\lambda_{b}^{2}\right\}$
$\displaystyle=$	$\displaystyle\left\{K_{2}\sqrt{\frac{c_{2}^{(b-1)}s_{0}}{N_{b}}}\lVert\bm{X}_{\star}^{(b)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert_{2}+\frac{1}{8}c_{1}^{(b)}s_{0}\lambda_{b}\right\}\lambda_{b}:=\Delta^{(b)}\lambda_{b}.$

The left part follows the standard argument. Recall that $S_{0}=\{j:\beta_{j}^{0}\neq 0\}$ . For $\bm{\beta}\in\mathbb{R}^{p}$ , we define that $\bm{\beta}_{S_{0}}:=(\beta_{j,S_{0}})_{j=1}^{p}$ where $\beta_{j,S_{0}}=\beta_{j}\mathbf{1}\{j\in S_{0}\}.$ Then, $\bm{\beta}=\bm{\beta}_{S_{0}}+\bm{\beta}_{S^{c}_{0}}.$ From (16), we have

\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}\leq\lambda_{b}\lVert\bm{\beta}^{0}\lVert_{1}-\lambda_{b}\lVert\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}+\Delta^{(b)}\lambda_{b}\leq\lambda_{b}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}+\Delta^{(b)}\lambda_{b}.

Here we need a discussion about the value of $\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}$ .
Case I. We suppose that $\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}\geq\Delta^{(b)}/2$ . Then,

\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+{\lambda_{b}}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}\leq{3\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1},

which implies $\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}\leq 3\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}$ . Then, we can adopt the empirical compatibility condition, that is,

\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}^{2}\leq\frac{s_{0}}{\phi_{0}^{2}N_{b}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert^{2}_{2},

where $\phi_{0}>0$ is the compatibility constant. Thus,

			$\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+{\lambda_{b}}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}+{\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}\leq\frac{2\lambda_{b}}{\phi_{0}}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\Delta^{(b)}\lambda_{b}$
		$\displaystyle=$	$\displaystyle\left(\frac{2}{\phi_{0}}+K_{2}\sqrt{c_{2}^{(b-1)}}\right)\lambda_{b}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\frac{1}{8}c_{1}^{(b)}s_{0}\lambda_{b}^{2}$
		$\displaystyle:=$	$\displaystyle C\lambda_{b}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\frac{1}{8}c_{1}^{(b)}s_{0}\lambda_{b}^{2}$

Let $k_{2}=1/\sup_{i\in\mathcal{D}_{b}^{\star}}\lvert{4/g^{\prime}(\bm{x}_{i}\bm{\beta})}\lvert.$ Based on Lemma 1, we have

\frac{k_{2}}{N_{b}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert^{2}_{2}\leq\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\frac{1}{2N_{b}}\Delta_{4}^{(b)}\leq\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+\frac{1}{16}c_{1}^{(b)}s_{0}\lambda_{b}^{2},

which is also known as the margin condition (van de Geer,, 2008). Next we apply the arithmetic mean-geometric mean inequality and obtain

C\lambda_{b}\sqrt{\frac{s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}\leq\frac{C^{2}}{2k_{2}}s_{0}\lambda_{b}^{2}+\frac{\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})}{2}+\frac{1}{32}c_{1}^{(b)}s_{0}\lambda_{b}^{2}.

(17)

Then, it follows that

\frac{\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})}{2}+{\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\left(\frac{C^{2}}{2k_{2}}+\frac{5}{32}c_{1}^{(b)}\right)s_{0}\lambda_{b}^{2}.

(18)

On the one hand, since $\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})>0$ , we obtain that

\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\left(\frac{C^{2}}{2k_{2}}+\frac{5}{32}c_{1}^{(b)}\right)s_{0}\lambda_{b}.

With the suitable choice of $c_{2}^{(b-1)}$ ,

\frac{C^{2}}{2k_{2}}+\frac{5}{32}c_{1}^{(b)}=\frac{1}{2k_{2}}\left(\frac{2}{\phi_{0}}+K_{2}\sqrt{c_{2}^{(b-1)}}\right)^{2}+\frac{5}{32}c_{1}^{(b)}\leq\frac{13}{32}c_{1}^{(b)},

we have

\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\frac{1}{2}c_{1}^{(b)}s_{0}\lambda_{b}.

Here we require that $K_{2}\sqrt{c_{2}^{(1)}}\geq 2/\phi_{0}$ , and $8K_{2}^{2}c_{2}^{(b-1)}/k_{2}=c_{1}^{(b)}.$ On the other hand, we combine (17) and (18), and obtain

\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}\leq\frac{5C}{2k_{2}}\sqrt{s_{0}N_{b}}\lambda_{b}.

Again, since ${C}\leq{2K_{2}\sqrt{c_{2}^{(b-1)}}}$ , we obtain

\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert^{2}_{2}\leq\left(\frac{5K_{2}}{k_{2}}\right)^{2}c_{2}^{(b-1)}s_{0}\lambda_{b}^{2}N_{b}\leq c_{2}^{(b)}s_{0}\lambda_{b}^{2}N_{b}.

(19)

Case II. We suppose that $\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}<\Delta^{(b)}/2$ . Then,

			$\displaystyle\mathcal{E}(\widetilde{\bm{\beta}}^{(b)})+{\lambda_{b}}\lVert\widetilde{\bm{\beta}}^{(b)}_{S^{c}_{0}}\lVert_{1}+{\lambda_{b}}\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}_{S_{0}}\lVert_{1}\leq 2\Delta^{(b)}\lambda_{b}$
		$\displaystyle=$	$\displaystyle 2K_{2}\lambda_{b}\sqrt{\frac{c_{2}^{(b-1)}s_{0}}{N_{b}}}\lVert\bm{X}^{(b)}_{\star}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}+\frac{1}{4}c_{1}^{(b)}s_{0}\lambda_{b}^{2}$

Then, by the margin condition, it is straightforward to show that

\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq\frac{1}{2}c^{(b)}_{1}s_{0}\lambda_{b},\ \lVert\bm{X}(\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)})\lVert_{2}^{2}\leq\left(\frac{6K_{2}}{k_{2}}\right)^{2}c_{2}^{(b-1)}s_{0}\lambda_{b}^{2}N_{b}=c_{2}^{(b)}s_{0}\lambda_{b}^{2}N_{b}.

(20)

Note that in both cases, we obtain $\lVert\bm{\beta}^{0}-\widetilde{\bm{\beta}}^{(b)}\lVert_{1}\leq c^{(b)}_{1}s_{0}\lambda_{b}/2.$ According to (13), we show our claim $\lVert\bm{\beta}^{0}-\widehat{\bm{\beta}}^{(b)}\lVert_{1}\leq c^{(b)}_{1}s_{0}\lambda_{b}.$ The remaining step is to repeat the above argument and obtain the upper bound of the estimation error at (19) or (20). It is worth pointing out that $c_{1}^{(b)}=8K_{2}^{2}c_{2}^{(b-1)}/k_{2}$ and $c_{2}^{(b)}=36K_{2}^{2}c_{2}^{(b-1)}/k_{2}^{2},$ where $K_{2}$ and $k_{2}$ does not depend on $b.$ The proof is finished by taking a union bound on the events considered in Lemma 1. ∎

Before proving Theorem 2, we cite Lemma 3 in Han et al. , (2021) to compute the cumulative error.

Lemma 2.

(Han et al. ,, 2021) Let $n_{j}$ and $N_{j}$ be the batch size and the cumulative batch size respectively when the $j$ -th data arrives, $j=1,\ldots,b.$ Then,

	$\displaystyle\sum_{j=1}^{b}\frac{n_{j}}{N_{j}}$	$\displaystyle\leq$	$\displaystyle 1+\log\frac{N_{b}}{n_{1}},$		(21)
	$\displaystyle\sum_{j=1}^{b}\frac{n_{j}}{\sqrt{N_{j}}}$	$\displaystyle\leq$	$\displaystyle 2\sqrt{N_{b}}.$		(22)

Proof of Theorem 2.

Recall that

\bm{\widehat{\gamma}}_{r}^{(j)}=\underset{\bm{\gamma}\in\mathbb{R}^{(p-1)}}{\arg\min}\left\{\frac{1}{2N_{j}}\left(\widetilde{\bm{J}}^{(j)}_{r,r}-2\widetilde{\bm{J}}^{(j)}_{r,-r}\bm{\gamma}+\bm{\gamma}^{\top}\widetilde{\bm{J}}_{-r,-r}^{(j)}\bm{\gamma}\right)+\lambda_{j}\|\bm{\gamma}\|_{1}\right\}

is a standard lasso estimator. The proof follows the standard argument as long as estimated information matrix $\widetilde{\bm{J}}^{(j)}/N_{j}$ satisfies the compatibility condition. Therefore, it suffices to demonstrate the compatibility condition holds. Recall that $\widetilde{\bm{J}}^{(j)}/N_{j}=\sum_{i=1}^{j}\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})/N_{j}.$ Then we have

			$\displaystyle\widetilde{\bm{J}}^{(j)}/N_{j}-\bm{J}^{0}=\frac{1}{N_{j}}\sum_{i=1}^{j}\left\{\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})-\mathbb{E}[\bm{J}^{(i)}({\bm{\beta}}^{0})]\right\}$
		$\displaystyle=$	$\displaystyle\frac{1}{N_{j}}\sum_{i=1}^{j}\left\{\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})-\bm{J}^{(i)}({\bm{\beta}}^{0})\right\}+\frac{1}{N_{j}}\sum_{i=1}^{j}\left[\bm{J}^{(i)}({\bm{\beta}}^{0})-\mathbb{E}\{\bm{J}^{(i)}({\bm{\beta}}^{0})\}\right].$

According to the error bound of $\widehat{\bm{\beta}}^{(i)}$ provided in Theorem 1 and (22), we obtain

\left\lVert\frac{1}{N_{j}}\sum_{i=1}^{j}\left\{\bm{J}^{(i)}(\widehat{\bm{\beta}}^{(i)})-\bm{J}^{(i)}({\bm{\beta}}^{0})\right\}\right\lVert_{\infty}\leq\frac{K^{2}}{N_{j}}\sum_{i=1}^{j}{n_{i}}\lVert{\widehat{\bm{\beta}}}^{(i)}-{{\bm{\beta}}}^{0}\lVert_{1}=\mathcal{O}_{p}\left(c_{1}^{(j)}s_{0}\sqrt{\frac{\log(p)}{N_{j}}}\right),

where $K$ is defined in (A1) of Assumption 1. Meanwhile, based on the Hoeffding’s inequality, it follows

\left\lVert\frac{1}{N_{j}}\sum_{i=1}^{j}\left[\bm{J}^{(i)}({\bm{\beta}}^{0})-\mathbb{E}\{\bm{J}^{(i)}({\bm{\beta}}^{0})\}\right]\right\lVert_{\infty}=\mathcal{O}_{p}\left(\sqrt{\frac{\log(p)}{N_{j}}}\right).

In summary,

\left\lVert\widetilde{\bm{J}}^{(j)}/N_{j}-\bm{J}^{0}\right\lVert_{\infty}=\mathcal{O}_{p}\left(c_{1}^{(j)}s_{0}\sqrt{\frac{\log(p)}{N_{j}}}\right).

Consequently, the compatibility condition holds through Corollary 6.8 in Bühlmann & van de Geer, (2011). ∎

The remaining part is to establish the asymptotic normality of the online debiased lasso estimator.

Proof of Theorem 3.

We first deal with the debiased term. Recall that $\widetilde{\bm{\gamma}}^{(j)}\in\mathbb{R}^{p}$ is the extension of $\widehat{\bm{\gamma}}^{(j)}$ . Then

			$\displaystyle\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{\widehat{\bm{\beta}}}^{(j)})\right\}$
		$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}-\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\{\bm{X}^{(j)}\}^{\top}\left\{g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})-g(\bm{X}^{(j)}{\widehat{\bm{\beta}}}^{(j)})\right\}$

Note that for $1\leq k\leq n,$ we let $\bm{X}^{(j)}_{k}\in\mathbb{R}^{1\times p}$ denote $k$ -th row in $\bm{X}^{(j)}$ , namely the covariates of $k$ -th data in $j$ -th batch. According to the mean value theorem,

\displaystyle g(\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)})=g(\bm{X}^{(j)}_{k}{{\bm{\beta}}}^{0})-g^{\prime}(\eta_{k}^{(j)})\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)}),

where $\eta_{k}^{(j)}\in[\bm{X}^{(j)}_{k}{{\bm{\beta}}^{0}},\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)}].$ Let $\bm{\eta}^{(j)}=(\eta_{1}^{(j)},\ldots,\eta_{n_{j}}^{(j)})^{\top}$ and $\bm{\Lambda}^{(j)}\in\mathbb{R}^{n_{j}\times n_{j}}$ is diagonal matrix with the diagonal element $\{g^{\prime}(\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)})-g^{\prime}({\eta}_{k}^{(j)})\}_{k=1}^{n_{j}}$ . As a result,

			$\displaystyle\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{\widehat{\bm{\beta}}}^{(j)})\right\}$
		$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}}^{0})\right\}-\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})+\Pi_{1},$

where $\Pi_{1}=\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\bm{\Lambda}^{(j)}\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})$ . Now, we focus on the online debiased lasso estimator defined in (2.2). According to the above results,

			$\displaystyle\widehat{\tau}_{r}^{(b)}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})$
		$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{\bm{\widehat{z}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}\bm{\widehat{\beta}}^{(j)})\right\}+\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\widehat{\bm{\beta}}}^{(j)})+\widehat{\tau}_{r}^{(b)}(\widehat{{\beta}}^{(b)}_{r}-\beta_{r}^{0})$
		$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}+\Pi_{1}+\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})+\widehat{\tau}_{r}^{(b)}(\widehat{{\beta}}^{(b)}_{r}-\beta_{r}^{0})$
		$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}+\Pi_{1}+\Pi_{2}+\Pi_{3},$

where

	$\displaystyle\Pi_{2}$	$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})-\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\bm{J}^{(j)}_{r}(\widehat{\bm{\beta}}^{(j)})(\widehat{{\beta}}^{(b)}_{r}-\beta_{r}^{0}),$
	$\displaystyle\Pi_{3}$	$\displaystyle=$	$\displaystyle\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})-\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0}).$

Therefore, the remaining part is to demonstrate that $|\Pi_{i}|=o_{p}(\sqrt{N_{b}}),i=1,2,3.$

First of all, according to the Lipschitz condition of $g^{\prime}(\cdot),$

\displaystyle\left\lvert g^{\prime}({\eta}^{(j)}_{k})\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})-g^{\prime}(\bm{X}^{(j)}_{k}{\widehat{\bm{\beta}}}^{(j)})\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lvert\leq\left\{\bm{X}^{(j)}_{k}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\}^{2}.

Then,

\displaystyle\left\lVert\bm{\Lambda}^{(j)}\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lVert_{1}\leq\left\lVert\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lVert_{2}^{2}\leq K^{2}(c_{1}^{(j)}s_{0}\lambda_{j})^{2}n_{j},

where the last inequality is from Theorem 1 and the bounded assumption of $\bm{X}.$ Meanwhile, the boundedness of $\lVert\widehat{\bm{z}}_{r}^{(j)}\lVert_{\infty}$ could be shown by $(A2)$ in Assumption 1 and Theorem 2. Therefore, we find an upper bound for $|\Pi_{1}|$ :

	$\displaystyle\|\Pi_{1}\|$	$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{b}\lVert\widehat{\bm{z}}_{r}^{(j)}\lVert_{\infty}\left\lVert\bm{\Lambda}^{(j)}\bm{X}^{(j)}({\bm{\beta}}^{0}-{\widehat{\bm{\beta}}}^{(j)})\right\lVert_{1}$
		$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{b}K^{2}\lVert\widehat{\bm{z}}_{r}^{(j)}\lVert_{\infty}(c_{1}^{(j)}s_{0}\lambda_{j})^{2}n_{j}=\mathcal{O}_{p}\left(c_{1}^{(b)}s_{0}^{2}\log(p)\log(N_{b})\right),$

where the last equation is from (21) in Lemma 2.

Next, we adopt the Karush-Kuhn-Tucker (KKT) conditions. Let $\bm{e}_{r}\in\mathbb{R}^{p}$ denote the zero-vector except that the $r$ -th element is one.

$\displaystyle\|\Pi_{2}\|$	$\displaystyle=$	$\displaystyle\left\lvert\left(\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}-\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}_{r}\bm{e}_{r}^{\top}\right)({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})\right\lvert$
	$\displaystyle\leq$	$\displaystyle\left\lVert\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}-\{\widetilde{\bm{\gamma}}^{(b)}_{r}\}^{\top}\widetilde{\bm{J}}^{(b)}_{r}\bm{e}_{r}^{\top}\right\lVert_{\infty}\left\lVert{\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0}\right\lVert_{1}$
	$\displaystyle\leq$	$\displaystyle N_{b}\lambda_{b}\times c^{(b)}_{1}s_{0}\lambda_{b}=c^{(b)}_{1}s_{0}N_{b}\lambda_{b}^{2}=\mathcal{O}_{p}\left(c^{(b)}_{1}s_{0}\log(p)\right).$

Finally, we apply Theorem 2 and (22) in Lemma 2.

$\displaystyle\|\Pi_{3}\|$	$\displaystyle=$	$\displaystyle\left\lvert\left\{\sum_{j=1}^{b}\{\widetilde{\bm{\gamma}}^{(b)}_{r}-\widetilde{\bm{\gamma}}^{(j)}_{r}\}^{\top}\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\}({\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0})\right\lvert$
	$\displaystyle\leq$	$\displaystyle\left\lVert{\widehat{\bm{\beta}}}^{(b)}-{\bm{\beta}}^{0}\right\lVert_{1}\sum_{j=1}^{b}\left\lVert\widetilde{\bm{\gamma}}^{(b)}_{r}-\widetilde{\bm{\gamma}}^{(j)}_{r}\right\lVert_{1}\left\lVert\bm{J}^{(j)}(\widehat{\bm{\beta}}^{(j)})\right\lVert_{\infty}$
	$\displaystyle\leq$	$\displaystyle c_{1}^{(b)}s_{0}\lambda_{b}\left(\sum_{j=1}^{b}cs_{r}\lambda_{j}n_{j}\right)=\mathcal{O}_{p}\left(c_{1}^{(b)}s_{0}s_{r}\log(p)\right).$

Based on the conditions in Theorem 3, we conclude $|\Pi_{i}|=o_{p}(\sqrt{N_{b}}),i=1,2,3.$ In summary,

\displaystyle\widehat{\tau}_{r}^{(b)}({\widehat{\beta}}^{(b)}_{\text{on},r}-\beta_{r}^{0})=\sum_{j=1}^{b}\{{\widehat{\bm{z}}}^{(j)}_{r}\}^{\top}\left\{\bm{y}^{(j)}-g(\bm{X}^{(j)}{{\bm{\beta}}^{0}})\right\}+o_{p}(\sqrt{N_{b}}),

where we could further refer the terms in the right parts as $W_{r}$ and $V_{r}$ in Theorem 3.

∎

For the sake of completeness, we provide the proof of Lemma 1 at the end of Appendix.

Proof of Lemma 1.

We start from $|\Delta_{1}^{(b)}|.$ Recall that

\displaystyle\Delta_{1}^{(b)}=\frac{1}{2}\sum_{j=1}^{b-1}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})^{\top}\left\{{\bm{J}}^{(j)}(\widehat{\bm{\beta}}^{(j)})-{\bm{J}}^{(j)}(\bm{\xi})\right\}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})

Let $\bm{v}^{(j)}=\bm{X}^{(j)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})$ , $\bm{w}^{(j)}=g^{\prime}(\bm{X}^{(j)}\widehat{{\bm{\beta}}}^{(j)})-g^{\prime}(\bm{X}^{(j)}\bm{\xi})$ and $\text{diag}(\bm{w}^{(j)})$ denote the diagonal matrix with diagonal element $\bm{w}^{(j)}.$ Then,

\displaystyle\Delta_{1}^{(b)}=\frac{1}{2}\sum_{j=1}^{b-1}(\bm{v}^{(j)})^{\top}\text{diag}(\bm{w}^{(j)})\bm{v}^{(j)}=\frac{1}{2}\bm{v}^{\top}\text{diag}(\bm{w})\bm{v},

where $\bm{v}=((\bm{v}^{(1)})^{\top},\ldots,(\bm{v}^{(b-1)})^{\top})^{\top}$ and $\bm{w}=((\bm{w}^{(1)})^{\top},\ldots,(\bm{w}^{(b-1)})^{\top})^{\top}$ . As a result,

\displaystyle|\Delta_{1}^{(b)}|\leq\frac{1}{2}\lVert\bm{v}\lVert_{2}^{2}\lVert\text{diag}(\bm{w})\lVert_{2}=\frac{1}{2}\lVert\bm{X}_{\star}^{(b-1)}(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})\lVert^{2}_{2}\lVert\bm{w}\lVert_{\infty}.

The left part is to find the upper bound of $\lVert\bm{w}\lVert_{\infty}$ , that is,

	$\displaystyle\lVert\bm{w}\lVert_{\infty}$	$\displaystyle=$	$\displaystyle\max_{1\leq j\leq b-1}\lVert\bm{w}^{(j)}\lVert_{\infty}=\max_{1\leq j\leq b-1}\lVert g^{\prime}(\bm{X}^{(j)}\widehat{{\bm{\beta}}}^{(j)})-g^{\prime}(\bm{X}^{(j)}\bm{\xi})\lVert_{\infty}$
		$\displaystyle\leq$	$\displaystyle L\max_{1\leq j\leq b-1}\lVert\bm{X}^{(j)}(\widehat{{\bm{\beta}}}^{(j)}-\bm{\xi})\lVert_{\infty}\leq 3KL\max_{1\leq j\leq b-1}c^{(j)}_{1}s_{0}\lambda_{j}$

For ease of understanding, we could assume that $c^{(1)}_{1}s_{0}\lambda_{1}=\max_{1\leq j\leq b-1}c^{(j)}_{1}s_{0}\lambda_{j}$ since the first data batch containing the least information may lead to the largest estimation error. Besides that, this assumption will not affect the outcome. The upper bound of $|\Delta_{1}|$ could always be absorbed in the upper bound of $|\Delta_{2}|$ due to $b=o(\log N_{b})$ .

For $\Delta_{2}$ , the proof is similar as the above one and we omit it.

Recall that $|\Delta_{3}^{(b)}|=\bm{U}^{(b-1)}({\bm{\beta}}^{0})(\widetilde{\bm{\beta}}^{(b)}-{\bm{\beta}}^{0})$ . Note that $\bm{U}^{(b-1)}({\bm{\beta}}^{0})$ could be written as the sum of $i.i.d.$ random variables, that is, $\bm{U}^{(b-1)}({\bm{\beta}}^{0})=\sum_{i\in\mathcal{D}_{b}^{\star}}\bm{u}(y_{i};\bm{x}_{i},\bm{\beta}^{0}).$ Since $\mathbb{E}[\bm{U}^{(b-1)}({\bm{\beta}}^{0})]=0$ and $\lVert\bm{x}_{i}\lVert_{\infty}\leq K$ , according to Hoeffding’s inequality, with probability at least $1-p^{-3}$ , $\lVert\bm{U}^{(b-1)}({\bm{\beta}}^{0})\lVert_{\infty}\leq\lambda_{b-1}N_{b-1}/8.$

The result of $|\Delta_{4}^{(b)}|$ is straightforward according to Theorem 14.5 in Bühlmann & van de Geer, (2011).

∎

References

Belloni et al. , (2015) Belloni, A., Chernozhukov, V., & Kato, K. 2015. Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika, 102(1), 77–94.
Belloni et al. , (2019) Belloni, A., Chernozhukov, V., & Kato, K. 2019. Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Journal of the American Statistical Association, 114(526), 749–758.
Bühlmann & van de Geer, (2011) Bühlmann, P., & van de Geer, S. 2011. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
Cai & Guo, (2017) Cai, T. T., & Guo, Z. 2017. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of Statistics, 45(2), 615–646.
Cai & Guo, (2020) Cai, T. T., & Guo, Z. 2020. Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B, 82(2), 391–419.
Cardot & Degras, (2018) Cardot, H., & Degras, D. 2018. Online Principal Component Analysis in High Dimension: Which Algorithm to Choose? International Statistical Review, 86, 29 – 50.
Chen et al. , (2014) Chen, Y., Berrocal, V. J., Bingham, C. R., & Song, P. X.-K. 2014. Analysis of spatial variations in the effectiveness of graduated driver’s licensing (GDL) program in the state of Michigan. Spatial and Spatio-temporal Epidemiology, 8, 11–22.
Csala et al. , (2017) Csala, A., Voorbraak, F. P., Zwinderman, A. H., & Hof, M. H. 2017. Sparse redundancy analysis of high-dimensional genetic and genomic data. Bioinformatics, 33(20), 3228 – 3234.
Deshpande et al. , (2019) Deshpande, Y., Javanmard, A., & Mehrabi, M. 2019. Online debiasing for adaptively collected high-dimensional data. arXiv preprint arXiv:1911.01040.
Dezeure et al. , (2015) Dezeure, R., Bühlmann, P., Meier, L., & Meinshausen, N. 2015. High-dimensional inference: confidence intervals, $p$ -values and R-software hdi. Statistical Science, 30(4), 533 – 558.
Fan & Li, (2001) Fan, J., & Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Fang, (2019) Fang, Y. 2019. Scalable statistical inference for averaged implicit stochastic gradient descent. Scandinavian Journal of Statistics, 1–16.
Fei & Li, (2021) Fei, Z., & Li, Y. 2021. Estimation and inference for high dimensional generalized linear models: a splitting and smoothing approach. Journal of Machine Learning Research, 22(58), 1–32.
Han et al. , (2021) Han, R., Luo, L., Lin, Y., & Huang, J. 2021. Online debiased lasso. arXiv preprint arXiv:2106.05925.
Huber, (1964) Huber, P. J. 1964. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35, 73–101.
Javanmard & Montanari, (2014) Javanmard, A., & Montanari, A. 2014. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1), 2869–2909.
Luo & Song, (2020) Luo, L., & Song, P. X.-K. 2020. Renewable estimation and incremental inference in generalized linear models with streaming datasets. Journal of the Royal Statistical Society: Series B, 82, 69–97.
Ning & Liu, (2017) Ning, Y., & Liu, H. 2017. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195.
Robbins & Monro, (1951) Robbins, H., & Monro, S. 1951. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
Sakrison, (1965) Sakrison, D. J. 1965. Efficient recursive estimation: application to estimating the parameter of a covariance function. International Journal of Engineering Science, 3(4), 461–483.
Schifano et al. , (2016) Schifano, E. D., Wu, J., Wang, C., Yan, J., & Chen, M. H. 2016. Online updating of statistical inference in the big data setting. Technometrics, 58(3), 393–403.
Shi et al. , (2020) Shi, C., Song, R., Lu, W., & Li, R. 2020. Statistical inference for high-dimensional models via recursive online-score estimation. Journal of the American Statistical Association, 1–12.
Sun et al. , (2020) Sun, L., Wang, M., Guo, Y., & Barbu, A. 2020. A novel framework for online supervised learning with feature selection. arXiv preprint arXiv:1803.11521.
Sur & Candès, (2019) Sur, P., & Candès, E. J. 2019. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516 – 14525.
Tibshirani, (1996) Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288.
Toulis & Airoldi, (2017) Toulis, P., & Airoldi, E. M. 2017. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4), 1694–1727.
van de Geer, (2008) van de Geer, S. 2008. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614 – 645.
van de Geer et al. , (2014) van de Geer, S., Bühlmann, P., Ritov, Y. A., & Dezeure, R. 2014. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202.
Wang et al. , (2017) Wang, D., Fong, S., Wong, R. K., Mohammed, S., Fiaidhi, J., & Wong, K. K. 2017. Robust high-dimensional bioinformatics data streams mining by ODR-ioVFDT. Scientific Report, 7(43167).
Wen et al. , (2018) Wen, H., Shi, J., Chen, W., & Liu, Z. 2018. Transferring and generalizing deep-learning-based neural encoding models across subjects. NeuroImage, 176, 152 – 163.
Zhang, (2010) Zhang, C. H. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942.
Zhang & Zhang, (2014) Zhang, C. H., & Zhang, S. S. 2014. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B, 76(1), 217 – 242.
Zhang et al. , (2020) Zhang, Y., Han, K., Worth, R., & Liu, Z. 2020. Connecting concepts in the brain by mapping cortical representations of semantic relations. Nature Communications, 11(1), 1877.
Zou, (2006) Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Zou & Hastie, (2005) Zou, H., & Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67, 301–320.