This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Varying Coefficient Model via Adaptive Spline Fitting

Xufei Wang
Two Sigma Investments, LP
Bo Jiang
Two Sigma Investments, LP and
Jun S. Liu
Department of Statistics, Harvard University
The views expressed herein are the authors alone and are not necessarily the views of Two Sigma Investments, LP, or any of its affiliates.
Abstract

The varying coefficient model has received broad attention from researchers as it is a powerful dimension reduction tool for non-parametric modeling. Most existing varying coefficient models fitted with polynomial spline assume equidistant knots and take the number of knots as the hyperparameter. However, imposing equidistant knots appears to be too rigid, and determining the optimal number of knots systematically is also a challenge. In this article, we deal with this challenge by utilizing polynomial splines with adaptively selected and predictor-specific knots to fit the coefficients in varying coefficient models. An efficient dynamic programming algorithm is proposed to find the optimal solution. Numerical results show that the new method can achieve significantly smaller mean squared errors for coefficients compared with the equidistant spline fitting method.


Keywords: Mean squared error; Splines and knots; Varying coefficient model.

1 Introduction

The problem of accurately estimating the relationship between a response variable and multiple predictor variables is fundamental to statistics and machine learning, and many other scientific applications. Among parametric models, the linear regression model is a simple and powerful approach, yet it is somewhat limited by its linearity assumption, which is often violated in real applications. In contrast, non-parametric models do not assume any specific type of relationship between the response variable and the predictors, which offer some flexibility in modeling nonlinear relationships, and are powerful alternatives to linear models. However, non-parametric models often need to impose certain local smoothness assumptions which are mostly achieved by employing a certain class of kernels or spline basis functions, to overcome the over-fitting issue. This is known to be plagued by the curse of dimensionality in that these methods are both ineffective in capturing the true relationship and computationally very expensive for high dimensional data sets.

Serving as a bridge between linear and non-parametric models, the varying coefficient model (Hastie & Tibshirani,, 1993) provides an attractive compromise between simplicity and flexibility. In this class of models, the regression coefficients are not set constants and are allowed to depend on other conditioners. As a result, the varying coefficient model is more flexible because of the infinite dimensionality of the corresponding parameter spaces. Compared to standard non-parametric approaches, this method rises as a powerful strategy to cope with the curse of dimensionality. This method also inherits all advantages from linear models of being simple and interpretable. A typical setup for the varying coefficient model is as follows. Given response yy and predictors X=(x1,,xp)TX=(x_{1},\ldots,x_{p})^{T}, the model assumes that

y=j=1pβj(u)xj+ϵ,y=\sum_{j=1}^{p}\beta_{j}(u)x_{j}+\epsilon,

where uu is the conditional random variable, which is usually a scalar. The varying coefficient model has a broad range of applications, including longitudinal data (Huang et al.,, 2004; Tang & Cheng,, 2012), functional data (Zhang & Wang,, 2014; Hu et al.,, 2019), and spatial data (Wang & Sun,, 2019; Finley & Banerjee,, 2020). Moreover, varying coefficient models could naturally be extended to time series contexts (Huang & Shen,, 2004; Lin et al.,, 2019).

There are three major approaches to estimate the coefficients βj(u)(j=1,,p)\beta_{j}(u)\ (j=1,\ldots,p). One widely acknowledged approach is the smoothing spline method proposed by Hastie & Tibshirani, (1993), with a recent follow-up work using P-spline (Astrid et al.,, 2009). Fan & Zhang, (1999) and Fan & Zhang, (2000) proposed a predictor-specific kernel approach for estimating the coefficients. The first step is to use local cubic smoothing to model function βj(u)\beta_{j}(u), and the second step re-applies local cubic smoothing on the residuals. A recent follow-up of their work is an adaptive estimator by Chen et al., (2015). Approximation of the coefficient functions using a basis expansion, e.g. polynomial B-splines, is always popular as it leads to a simple solution to estimation and inference with good theoretical properties. Compared with smoothing spline and kernel methods, the polynomial spline method with a finite number of knots strikes a balance between model flexibility and interpretability of the estimated coefficients. Huang et al., (2002), Huang & Shen, (2004), and Huang et al., (2004) utilized a set of polynomial estimators. The assumptions made in these works are that the knots are equidistant and the number of knots is chosen such that the bias terms become asymptotically negligible to guarantee the local asymptotic normality.

Most polynomial spline approaches run optimization based on a set of finite-dimensional classes of functions, such as a space of polynomial B-splines with LL equally spaced knots. If the real turning points of the coefficients are not equidistant yet we still use the method assuming equally spaced knots, then LL should be large enough to make sure the distances between knots in the model are small enough to capture the resolution for the coefficients. In practice, LL is always chosen by a parameter searching process accompanied with other parameters, as people need to compare a set of parameters before determining the optimal fixed number of knots. Too few knots might ignore the high-frequency information of βj(u)\beta_{j}(u), while too many knots might over-fit the area where the coefficients barely change. Variable selection for varying coefficient models, especially when the number of predictors is larger than the sample size, is also an interesting direction to study.

In this paper, we propose two adaptive algorithms that are enabled by first fitting piece-wise linear functions with automatically selected turning points for univariate conditioner variable uu. Compared with traditional methods above, these algorithms can automatically determine optimal positions of knots modeled as the turning points of the true coefficients. We prove that, if the coefficients are piece-wise linear in uu, the selected knots by our methods are almost surely the true change points. We further combine the knots selection algorithms with the adaptive group LASSO method for variable selection in high-dimensional problems, following a similar idea as Wei et al., (2011) who applies the adaptive group LASSO to basis expansions of predictors. Our simulation studies demonstrate that the new adaptive method achieves smaller mean squared errors for estimating the coefficients compared to available methods, and improves the variable selection performance. We finally apply the method to study the association between environmental factors and COVID-19 infected cases and observe time-varying effects of the environmental factors, as well as the Boston Housing data (Harrison & Rubinfeld,, 1978).

2 Adaptive spline fitting methods

2.1 Knots selection for polynomial spline

In varying coefficient models, each coefficient βj(u)\beta_{j}(u) is a function of the conditional variable uu, which can be estimated by fitting a polynomial spline on uu. In this paper we assume that uu is a univariate variable. Let Xi=(xi,1,,xi,p)T𝐑pX_{i}=(x_{i,1},\ldots,x_{i,p})^{T}\in\mathbf{R}^{p}, uiu_{i}, and yiy_{i} denote the iith observations of the predictor vector, the conditional variable, and the response variable, respectively, for i=1,,ni=1,\ldots,n. Suppose the knots are common to all coefficients located at d1<<dLd_{1}<\ldots<d_{L}, and the corresponding B-splines of degree DD are Bk(u)(k=1,,D+L+1)B_{k}(u)\ (k=1,\ldots,D+L+1). Each varying coefficient can be represented as βj(u)=k=1D+L+1cj,kBk(u)\beta_{j}(u)=\sum_{k=1}^{D+L+1}c_{j,k}B_{k}(u), where the coefficients cj,kc_{j,k} are estimated by minimizing the following sum of squared errors:

i=1n{yij=1pxi,jk=1D+L+1cj,kBk(ui)}2.\sum_{i=1}^{n}\left\{y_{i}-\sum_{j=1}^{p}x_{i,j}\sum_{k=1}^{D+L+1}c_{j,k}B_{k}(u_{i})\right\}^{2}. (1)

In previous work, the knots for polynomial spline were always chosen as equidistant quantiles of uu and are the same for all predictors. The approach is computationally straightforward, but the knots chosen cannot properly reflect the varying smoothness between and within the coefficients. We propose an adaptive knot selection approach in which the knots can be interpreted as turning points of the coefficients.

For knots d1<<dLd_{1}<\ldots<d_{L}, we define the segmentation scheme S={s1,,sL+1}S=\{s_{1},\ldots,s_{L+1}\} for the observed samples ordered by uu, where sk={idk<uidk+1}s_{k}=\{i\mid d_{k}<u_{i}\leq d_{k+1}\}, with d0=d_{0}=-\infty and dL+1=d_{L+1}=\infty. If the true coefficients β(u)=(β1(u),,βp(u))T𝐑p\beta(u)=(\beta_{1}(u),\ldots,\beta_{p}(u))^{T}\in\mathbf{R}^{p} is a linear function of uu within each segment, i.e., β(u)=as+bsu\beta(u)=a_{s}+b_{s}u for as,bs𝐑pa_{s},b_{s}\in\mathbf{R}^{p}, then the observed response satisfies

y=asTX+bsT(uX)+ϵ,ϵN(0,σs2).y=a_{s}^{T}X+b_{s}^{T}(uX)+\epsilon,\epsilon\sim N(0,\sigma_{s}^{2}). (2)

Thus, the coefficients can be estimated by maximizing the log-likelihood function, which is equivalent to minimizing the loss function:

L(S)=sS|s|logσ^s2,L(S)=\sum_{s\in S}|s|\log\hat{\sigma}^{2}_{s}, (3)

where σ^s2\hat{\sigma}^{2}_{s} is the residual variance by regressing yiy_{i} over (Xi,uiXi)(X_{i},u_{i}X_{i}) for isi\in s.

Because any almost-everywhere continuous function can be approximated by piece-wise linear functions, we can employ the estimating framework in (2) and derive (3). To avoid over-fitting when maximizing the log-likelihood over the parameters and segmentation schemes, we constrain that |s||s| is greater than a lower bound ms=nα(α>0)m_{s}=n^{-\alpha}\ (\alpha>0), and penalize the loss function with the number of segments

L(S,λ0)=sS|s|logσ^s2+λ0|S|log(n),L(S,\lambda_{0})=\sum_{s\in S}|s|\log\hat{\sigma}^{2}_{s}+\lambda_{0}|S|\log(n), (4)

where λ0>0\lambda_{0}>0 is the penalty strength. The resulting knots correspond to the segmentation scheme SS that minimizes the penalized loss function (4). When λ0\lambda_{0} is very large, this strategy tends to select no knots, whereas when λ0\lambda_{0} gets close to 0 it can select as many knots as n1αn^{1-\alpha}. We find the optimal λ0\lambda_{0} by minimizing the Bayesian information criterion (Schwarz et al.,, 1978) of the fitted model. For a given λ0\lambda_{0}, suppose L(λ0)L(\lambda_{0}) knots are finally proposed, and the fitted model is f^(X,u)\hat{f}(X,u). Then, we have

bic(λ0)=nlog[1ni=1n{yif(Xi,ui)}2]+p{L(λ0)+D+1}log(n).\textsc{bic}(\lambda_{0})=n\log\left[\frac{1}{n}\sum_{i=1}^{n}\{y_{i}-f(X_{i},u_{i})\}^{2}\right]+p\{L(\lambda_{0})+D+1\}\log(n). (5)

The optimal λ0\lambda_{0} is selected by searching over a grid to minimize bic(λ0)(\lambda_{0}). We call this procedure the global adaptive knots selection strategy as it assumes that all the predictors have the same set of knots. We will discuss in Section 2.4 how to allow each predictor to have its own set of knots.

Here we only use the piece-wise linear model (2) and loss function (4) for knots selection, but will fit the varying coefficients with B-splines derived from the resulting knots via minimizing (1). In this way, the fitted varying coefficients are smooth functions and the smoothness is determined by the degree of the splines. This method is referred to as the global adaptive spline fitting throughout the paper.

2.2 Theoretical property of the selected knots

The proposed method is invariant to marginal distribution of uu. Without loss of generality, we assume that uu follows the uniform distribution in [0,1][0,1]. According to Definition 1 below, a turning point of uu, denoted as ckc_{k}, is defined as a local maximum or minimum of any coefficient βi(u)(j=1,,p)\beta_{i}(u)\ (j=1,\ldots,p). In Theorem 1, we show that the adaptive knots selection approach can almost surely detect all the turning points of β(u)\beta(u).

Definition 1

We call 0<c1<<cK<1,K<0<c_{1}<\ldots<c_{K}<1,K<\infty the turning points of β(u)\beta(u) if for any ck1<u1<u2<ck<u3<u4<ck+1(k=1,,K)c_{k-1}<u_{1}<u_{2}<c_{k}<u_{3}<u_{4}<c_{k+1}\ (k=1,\ldots,K),

{βj(u1)βj(u2)}{βj(u3)βj(u4)}<0,\{\beta_{j}(u_{1})-\beta_{j}(u_{2})\}\{\beta_{j}(u_{3})-\beta_{j}(u_{4})\}<0,

for some index jj with u0=0,uK+1=1u_{0}=0,u_{K+1}=1.

Theorem 1

Suppose y=β(u)TX+ϵ,uUnif(0,1),ϵN(0,σ2)y=\beta(u)^{T}X+\epsilon,u\sim\mathrm{Unif}(0,1),\epsilon\sim N(0,\sigma^{2}) where XX is bounded and β(u)\beta(u) is a bounded continuous function. Moreover, assume X,u,ϵX,\ u,\ \epsilon are independent. Let 0<c1<<cK<10<c_{1}<\ldots<c_{K}<1 be the turning points in Definition 1 and d1<<dLd_{1}<\ldots<d_{L} be the selected knots with ms=nα(α>0)m_{s}=n^{-\alpha}\ (\alpha>0), then for 0<γ<1α0<\gamma<1-\alpha and λ0\lambda_{0} large enough,

pr(LK,maxk=1Kminl=1L|dlck|<nγ)1,n.\displaystyle\mathrm{pr}\left(L\geq K,\max_{k=1}^{K}\min_{l=1}^{L}|d_{l}-c_{k}|<n^{-\gamma}\right)\to 1,\quad n\to\infty.

Although the adaptive spline fitting method is motivated by a piece-wise linear model, Theorem 1 shows that with probability approaching 1, we can detect all the turning points accurately under general varying coefficient functions. The selected knots could be a superset of the real turning points especially when λ0\lambda_{0} is small, so we tune λ0\lambda_{0} with bic (5) which penalizes the number of selected knots, to find the optimal set.

When the varying coefficient functions are piece-wise linear, each coefficient βj(u)(j=1,,p)\beta_{j}(u)\ (j=1,\ldots,p) can be almost surely defined as

βj(u)=aj,k+bj,ku,cj,k1<ucj,k,k=1,,Kj+1,\displaystyle\beta_{j}(u)=a_{j,k}+b_{j,k}u,\quad c_{j,k-1}<u\leq c_{j,k},\quad k=1,\ldots,K_{j}+1,

where cj,0=0c_{j,0}=0, cj,Kj+1=1c_{j,K_{j}+1}=1 and (aj,kaj,k+1)2+(bj,kbj,k+1)2>0(k=1,,Kj)(a_{j,k}-a_{j,k+1})^{2}+(b_{j,k}-b_{j,k+1})^{2}>0\ (k=1,\ldots,K_{j}) if Kj1K_{j}\geq 1. When Kj=0K_{j}=0 the varying coefficient βj(u)\beta_{j}(u) is a simple linear function of uu, otherwise we call cj,k(k=1,,Kj)c_{j,k}\ (k=1,\ldots,K_{j}) the change points for coefficients βj(u)\beta_{j}(u), because the linear relationship varies before and after these points. Then Kj1{cj,1,,cj,Kj}\bigcup_{K_{j}\geq 1}\{c_{j,1},\ldots,c_{j,K_{j}}\} are all the change points of the entire varying coefficient function. In Theorem 2, we show that the adaptive knots selection method can almost surely discover the change points of β(u)\beta(u) without false positive selection.

Theorem 2

Suppose (X,y,u)(X,y,u) follow the same assumption as in Theorem 1, and β(u)\beta(u) is a piece-wise linear function of uu. Let 0<c1<<cK<10<c_{1}<\ldots<c_{K}<1 be the change points defined as above and d1<<dLd_{1}<\ldots<d_{L} be the selected knots with ms=nα(α>0)m_{s}=n^{-\alpha}\ (\alpha>0), then for 0<γ<1α0<\gamma<1-\alpha and λ0\lambda_{0} large enough,

pr(L=K,maxk=1K|dkck|<nγ)1,n.\displaystyle\mathrm{pr}\left(L=K,\max_{k=1}^{K}|d_{k}-c_{k}|<n^{-\gamma}\right)\to 1,\quad n\to\infty.

The theorem shows that if the varying coefficient function is piece-wise linear, the method can discover all the change points with almost 100% accuracy.

2.3 Dynamic programming algorithm for adaptive knots selection

The brute force algorithm to compute the optimal knots has a computation complexity of O(2n)O(2^{n}) and is impractical. As summarized by the following algorithm, we propose a dynamic programming approach whose computation complexity is of order O(n2)O(n^{2}). If we further assume that the knots can only be chosen from a predetermined set \mathcal{M}, e.g. mn(m=1,,n1)\frac{m}{\sqrt{n}}\ (m=1,\ldots,\lfloor\sqrt{n}\rfloor-1) quantiles of uiu_{i}’s, the computation complexity can be further reduced to O(||2)O(|\mathcal{M}|^{2}). Note that the algorithm in Section 2.4 of Wang et al., (2017) is a special case with a constant xx. The algorithm is summarized in the following three steps:

  1. 1.

    Data preparation: Arrange the data (Xi,ui,yi)(i=1,,n)(X_{i},u_{i},y_{i})\ (i=1,\ldots,n) according to order of uiu_{i} from small to large. Without lose of generality, we assume that u1<<unu_{1}<\cdots<u_{n}.

  2. 2.

    Obtain the minimum loss by forward recursion: Define ms=nα(α>0)m_{s}=\lceil n^{-\alpha}\rceil\ (\alpha>0) as the smallest segment size, and set λ=λ0log(n)\lambda=\lambda_{0}\log(n). Initialize two sequence: (Lossi,Previ)(i=1,n)(\mathrm{Loss}_{i},\mathrm{Prev}_{i})\ (i=1,\ldots n) with Loss0=0\mathrm{Loss}_{0}=0. For i=ms,,ni=m_{s},\ldots,n, recursively fill in entries of the tables with

    Lossi=miniIi(Lossi1+li:i+λ),Previ=argminiIi(Lossi1+li:i+λ).\mathrm{Loss}_{i}=\min_{i^{\prime}\in I_{i}}(\mathrm{Loss}_{i^{\prime}-1}+l_{i^{\prime}:i}+\lambda),\quad\mathrm{Prev}_{i}=\operatorname*{arg\,min}_{i^{\prime}\in I_{i}}(\mathrm{Loss}_{i^{\prime}-1}+l_{i^{\prime}:i}+\lambda).

    Here Ii={1}{ms+1,,ims+1}I_{i}=\{1\}\cup\{m_{s}+1,\ldots,i-m_{s}+1\} and li:i=(ii+1)logσ^i:i2l_{i^{\prime}:i}=(i-i^{\prime}+1)\log\hat{\sigma}^{2}_{i^{\prime}:i} where σ^i:i2\hat{\sigma}^{2}_{i^{\prime}:i} is the residual variance of regressing yky_{k} on (Xk,ukXk)(k=i,,i)(X_{k},u_{k}X_{k})\ (k=i^{\prime},\ldots,i).

  3. 3.

    Determine knots by backward tracing: Let p=Prevnp=\mathrm{Prev}_{n}. If p=1p=1 no knot is needed; otherwise, we recursively add 0.5(up1+up)0.5(u_{p-1}+u_{p}) as a new knot with p=Prevpp=\mathrm{Prev}_{p} until p=1p=1.

When the algorithm is run with a grid of λ0\lambda_{0}, we repeat Step 2 and 3 for all the λ0\lambda_{0}’s, and return the final model with the minimum bic.

2.4 Predictor specific adaptive knots selection

The global adaptive knots selection method introduced in Section 2.1 assumes that the set of knot locations are the same for all predictors, similar to most of the literature for polynomial spline fitting. However, different coefficients may have a different resolution of smoothness relative to uu, and a different set of knots for each predictor is preferable. Here we propose a predictor-specific adaptive spline fitting algorithm on top of the global knot selection. Suppose the fitted model for the global adaptive spline fitting is f^(X,u)=j=1pβ^j(u)xj\hat{f}(X,u)=\sum_{j=1}^{p}\hat{\beta}_{j}(u)x_{j}. For predictor jj, the knots can be updated by performing the same knots selection procedure between it and the residual without it, that is

ri=yijjβ^j(ui)xi,j,i=1,,n.r_{i}=y_{i}-\sum_{j^{\prime}\neq j}\hat{\beta}_{j^{\prime}}(u_{i})x_{i,j^{\prime}},\quad i=1,\ldots,n. (6)

If bic in (5) is smaller for the corresponding new polynomial spline model with the new knots, the knots and fitted model are update. The steps are repeated until bic cannot be further minimized. The following three steps summarizes the algorithm:

  1. 1.

    global adaptive spline: Run the global adaptive spline fitting algorithm for (Xi,ui,yi)(i=1,,n)(X_{i},u_{i},y_{i})\ (i=1,\ldots,n). Denote f^(X,u)\hat{f}(X,u) as the fitted model and compute model bic with (5).

  2. 2.

    Update knots with bic: For j=1,,pj=1,\dots,p, compute residual variable rir_{i} by (6). Run the global adaptive spline fitting algorithm for the residual and predictor jj, that is (xi,j,ui,ri)(i=1,,n)(x_{i,j},u_{i},r_{i})\ (i=1,\ldots,n). Denote the new fitted model as f^j(X,u)\hat{f}^{j}(X,u) and corresponding bic as bicj\textsc{bic}_{j}. Note that when we assume different number of knots for each predictor, the term p{L(λ0)+D+1}log(n)p\{L(\lambda_{0})+D+1\}\log(n) in (5) should be replaced by {j=1pLj(λ0)+p(D+1)}log(n)\left\{\sum_{j=1}^{p}L_{j}(\lambda_{0})+p(D+1)\right\}\log(n), where Lj(λ0)L_{j}(\lambda_{0}) is the number of knots for predictor jj.

  3. 3.

    Recursively update model: If minbicj<bic\min\textsc{bic}_{j}<\textsc{bic} and j=argminbicjj^{*}=\operatorname*{arg\,min}\textsc{bic}_{j}, update the current model by f^(X,u)=f^j(X,u)\hat{f}(X,u)=\hat{f}^{j^{*}}(X,u), and repeat Step 3. Otherwise return the fitted model f^(X,u)\hat{f}(X,u).

An alternative approach to model the heterogeneous relationships between coefficients is to replace the initial model in Step 1 with f^(X,u)=0\hat{f}(X,u)=0, and repeat the following two steps. However, starting from the global model is preferred because fitting to residual instead of the original response minimizes the mean squared error more efficiently. Simulation studies in Section 3 show that the predictor-specific knots can further reduce mean squared error for the fitted coefficient, compared with the global knot selection approach.

2.5 Knots selection in sparse high-dimensional problems

When the number of predictors is large and the number of predictors with non-zero varying coefficients is small, we perform a variable selection for all the predictors. For predictor jj, we run the knots selection procedure between it and the response, and construct B-spline functions as {Bj,k(u)}(L=1,,Lj+D+1)\{B_{j,k}(u)\}\ (L=1,\ldots,L_{j}+D+1) where LjL_{j} is the number of knots and DD is the degree of the B-splines. Then, we perform the variable selection method for varying coefficient model proposed by Wei et al., (2011), which is a generalization of group LASSO (Yuan & Lin,, 2006) and adaptive LASSO (Zou,, 2006). Wei et al., (2011) shows that group LASSO tends to over select variables and suggests adaptive group LASSO for variable selection. In their original algorithm, the knots for each predictor are chosen as equidistant quantiles and not predictor-specific. The following three steps summarize our proposed algorithm:

  1. 1.

    Select knots for each predictor: Run the global adaptive spline fitting algorithm between each predictor xi,jx_{i,j} and response yiy_{i}. Denote the corresponding B-splines as Bj,k(u)(k=1,,Lj+D+1)B_{j,k}(u)\ (k=1,\ldots,L_{j}+D+1) where LjL_{j} is the number of knots for predictor jj and DD is degree of splines.

  2. 2.

    Group LASSO for first step variable selection: Run group LASSO algorithm for the following loss function

    1ni=1n{yij=1pxi,jk=1Lj+D+1cj,kBj,k(ui)}+λ1j=1p(cjRjcj)1/2,λ1>0\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{y_{i}-\sum_{j=1}^{p}x_{i,j}\sum_{k=1}^{L_{j}+D+1}c_{j,k}B_{j,k}(u_{i})\right\}+\lambda_{1}\sum_{j=1}^{p}(c_{j}^{\prime}R_{j}c_{j})^{1/2},\quad\lambda_{1}>0

    where cj=(cj,1,,cj,Lj+D+1)c_{j}=(c_{j,1},\ldots,c_{j,L_{j}+D+1}) and RjR_{j} is the kernel matrix whose (k1,k2)(k_{1},k_{2}) element is E{Bj,k1(u)Bj,k2(u)}E\{B_{j,k_{1}}(u)*B_{j,k_{2}}(u)\}. Denote the fitted coefficients as c~j,k\tilde{c}_{j,k}.

  3. 3.

    Adaptive group LASSO for second step variable selection: Run adaptive group LASSO algorithm for the updated loss function

    1ni=1n{yij=1pxi,jk=1Lj+D+1cj,kBj,k(ui)}+λ2j=1pwj(cjRjcj)1/2,λ2>0\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{y_{i}-\sum_{j=1}^{p}x_{i,j}\sum_{k=1}^{L_{j}+D+1}c_{j,k}B_{j,k}(u_{i})\right\}+\lambda_{2}\sum_{j=1}^{p}w_{j}(c_{j}^{\prime}R_{j}c_{j})^{1/2},\quad\lambda_{2}>0

    with weight as wk=w_{k}=\infty if c~jRjc~j=0\tilde{c}_{j}^{\prime}R_{j}\tilde{c}_{j}=0, otherwise wk=(c~jRjc~j)1/2w_{k}=(\tilde{c}_{j}^{\prime}R_{j}\tilde{c}_{j})^{-1/2}. Denote the fitted coefficients as c^j,k\hat{c}_{j,k} and the selected variables are those with c^jRjc^j0\hat{c}_{j}^{\prime}R_{j}\hat{c}_{j}\neq 0.

For Steps 2 and 3, parameter λ1\lambda_{1} and λ2\lambda_{2} are chosen by minimizing bic (5) for the fitted model, where the degree of freedom is computed with only the selected predictors. Simulation studies in Section 3.2 show that, with the knots selection Step 1, the algorithm can always choose the correct predictors with a reasonable number of samples.

3 Empirical studies

3.1 Simulation study for adaptive spline fitting

In this section, we compare the global and predictor-specific adaptive spline fitting approaches we introduced with the equidistant spline fitting approach as well as a commonly used kernel method package tvReg by Casas & Fernandez-Casal, (2019), using the simulation examples from Tang & Cheng, (2012). The model simulates a longitudinal data, which are frequently encountered in biomedical applications. In this example, the conditional random variable represents time and for convenience, we use notation tt instead of uu. We simulated nn individuals, each having a scheduled time set {0,1,,19}\{0,1,\ldots,19\} to generate observations. A scheduled time can be skipped with probability 0.6, so no observation will be generated at the skipped time. For each non-skipped scheduled time, the real observed time is the scheduled time plus a random disturbance from Unif(0,1)\mathrm{Unif}(0,1). Consider the time-dependent predictors X(t)=(x1(t),x2(t),x3(t),x4(t))TX(t)=(x_{1}(t),x_{2}(t),x_{3}(t),x_{4}(t))^{T} with

x1(t)=1,\displaystyle x_{1}(t)=1, x2(t)Bern(0.6),\displaystyle x_{2}(t)\sim\mathrm{Bern}(0.6),
x3(t)Unif(0.1t,2+0.1t),\displaystyle x_{3}(t)\sim\mathrm{Unif}(0.1t,2+0.1t), x4(t)x3(t)N(0,1+x3(t)2+x3(t)).\displaystyle x_{4}(t)~{}\mid~{}x_{3}(t)\sim N\left(0,\frac{1+x_{3}(t)}{2+x_{3}(t)}\right).

The response for individual ii at observed time ti,kt_{i,k} is yi(ti,k)=j=14βj(ti,k)xi,j(ti,k)+ϵi(ti,k)y_{i}(t_{i,k})=\sum_{j=1}^{4}\beta_{j}(t_{i,k})x_{i,j}(t_{i,k})+\epsilon_{i}(t_{i,k}), with

β1(t)=1+3.5sin(t3),\displaystyle\beta_{1}(t)=1+3.5\sin(t-3), β2(t)=25cos(0.75t0.25),\displaystyle\beta_{2}(t)=2-5\cos(0.75t-0.25),
β3(t)=40.04(t12)2,\displaystyle\beta_{3}(t)=4-0.04(t-12)^{2}, β4(t)=1+0.125t+4.6(10.1t)3.\displaystyle\beta_{4}(t)=1+0.125t+4.6\left(1-0.1t\right)^{3}.

The random error ϵi(ti,k)\epsilon_{i}(t_{i,k}) are independent from the predictors, and generated with two components

ϵi(ti,k)=vi(ti,k)+ei(ti,k),\epsilon_{i}(t_{i,k})=v_{i}(t_{i,k})+e_{i}(t_{i,k}),

where ei(ti,k)i.i.d.N(0,4)e_{i}(t_{i,k})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,4), and vi(ti,k)N(0,4)v_{i}(t_{i,k})\sim N(0,4) with correlation structure

cor{vi1(ti1,k1),vi2(ti2,k2)}=I(i1=i2)exp(|ti1,k1ti2,k2|).\mathrm{cor}\left\{v_{i_{1}}(t_{i_{1},k_{1}}),v_{i_{2}}(t_{i_{2},k_{2}})\right\}=I(i_{1}=i_{2})\exp(-|t_{i_{1},k_{1}}-t_{i_{2},k_{2}}|).

Random errors are positively correlated within the same individual and are independent between different individuals. Figure 1 shows the true coefficients and the fitted coefficients by the equidistant and predictor-specific spline fitting methods for an example with n=200n=200, as well as the selected knots. Note that for the equidistant fitting approach, the number of knots is also determined by minimizing the model bic (5). We observe that the fitted coefficients by the predictor-specific method are smoother than those by the equidistant method, especially for the less volatile coefficients β3(t)\beta_{3}(t) and β4(t)\beta_{4}(t). This is because the predictor-specific method uses only 1 knot for these two coefficient which better reflects the resolution, while the equidistant method uses 7 knots.

Refer to caption
Figure 1: The true coefficients and fitted coefficients with equidistant and predictor-specific knots. Black line: true coefficients; triangle: equidistant knots; dashed lines: fitted coefficients with equidistant knots; stars: predictor-specific knots; red: fitted coefficients with predictor-specific knots.

We compare the three methods by the mean square errors of their estimated coefficients, i.e.,

msej=1Ni=1nk=1ni{β^j(ti,k)βj(ti,k)}2range(βj)2,\textsc{mse}_{j}=\frac{1}{N}\sum_{i=1}^{n}\sum_{k=1}^{n_{i}}\frac{\left\{\hat{\beta}_{j}(t_{i,k})-\beta_{j}(t_{i,k})\right\}^{2}}{\mathrm{range}(\beta_{j})^{2}}, (7)

where βj(t){\beta}_{j}(t) and β^j(t)\hat{\beta}_{j}(t) are the true and estimated coefficients, respectively, nin_{i} is the total number of observations for individual ii and N=i=1nniN=\sum_{i=1}^{n}n_{i}. We run the simulation with n=200n=200 for 1000 repetitions. For adaptive spline methods we only consider knots from mN(m=1,,N)\frac{m}{\sqrt{N}}\ (m=1,\ldots,\lfloor\sqrt{N}\rfloor) quantiles of ti,kt_{i,k}.

mse11e2\textsc{mse}_{1}*1e2 mse21e2\textsc{mse}_{2}*1e2 mse31e2\textsc{mse}_{3}*1e2 mse41e2\textsc{mse}_{4}*1e2
equidistant 2.53 (1.14) 0.26 (0.10) 0.64 (0.29) 0.09 (0.04)
global 2.24 (1.12) 0.32 (0.12) 0.55 (0.26) 0.08 (0.04)
predictor-specific 1.30 (0.83) 0.23 (0.10) 0.28 (0.21) 0.04 (0.03)
kernel 2.80 (0.92) 0.38 (0.11) 0.81 (0.26) 0.14 (0.04)
Table 1: msej\textsc{mse}_{j} for the global and predictor-specific adaptive spline fitting methods, compared with kernel method.

Table 1 shows the average msej\textsc{mse}_{j} for the proposed global and predictor-specific methods, in comparison with the equidistant spline fitting and the kernel method. We find that the predictor-specific method has the smallest msej\textsc{mse}_{j} for all four coefficients, significantly outperforming the other two methods. Interestingly, the equidistant method selects on average 6.9 knots, the global adaptive method selects on average 6.1 global knots for all predictors, whereas the predictor-specific method selects fewer on average: 5.8 knots for x1(t)x_{1}(t), 4.2 for x2(t)x_{2}(t), 1.5 for x3(t)x_{3}(t) and 0.8 for x4(t)x_{4}(t). The result is expected since β3(t)\beta_{3}(t) and β4(t)\beta_{4}(t) are less volatile than β1(t)\beta_{1}(t) and β2(t)\beta_{2}(t). It appears that the predictor-specific procedure can use a smaller number of knots to achieve a better fitting than the other two methods.

3.2 Simulation study for variable selection

We use the simulation example from Wei et al., (2011) to compare the performance of our method with one using adaptive group LASSO and equidistant knots. Similar to the previous subsection, there are nn individuals, each having a scheduled time set {0,1,,29}\{0,1,\ldots,29\} to generate observations and a skipping probability of 0.6. For each non-skipped scheduled time, the real observed time is the scheduled time adding a random disturbance generated from Unif(0,1)\mathrm{Unif}(0,1). We construct p=500p=500 time-dependent predictors as follows:

x1(t)Unif(0.05+0.1t,2.05+0.1t),\displaystyle x_{1}(t)\sim\mathrm{Unif}\left(0.05+0.1t,2.05+0.1t\right), xj(t)x1(t)N(0,1+x1(t)2+x1(t)),j=2,,5\displaystyle x_{j}(t)~{}\mid~{}x_{1}(t)\sim N\left(0,\frac{1+x_{1}(t)}{2+x_{1}(t)}\right),\quad j=2,\ldots,5
x6(t)N(3exp{(t+0.5)/30},1)\displaystyle x_{6}(t)\sim N\left(3\exp\{(t+0.5)/30\},1\right) xj(t)i.i.d.N(0,4),j=7,,500.\displaystyle x_{j}(t)\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,4),\quad j=7,\ldots,500.

The same individual’s predictors xj(t)(j=7,,500)x_{j}(t)\ (j=7,\ldots,500) are correlated with cor{xj(t),xj(s)}=exp(|ts|)\mathrm{cor}\{x_{j}(t),x_{j}(s)\}=\exp(-|t-s|). The response for individual ii at observed time ti,kt_{i,k} is yi(ti,k)=j=16βj(ti,k)xi,j(ti,k)+ϵi(ti,k)y_{i}(t_{i,k})=\sum_{j=1}^{6}\beta_{j}(t_{i,k})x_{i,j}(t_{i,k})+\epsilon_{i}(t_{i,k}). The time-varying coefficients βj(t)(j=1,,6)\beta_{j}(t)\ (j=1,\ldots,6) are

β1(t)=15+20sin{π(t+0.5)/15},\displaystyle\beta_{1}(t)=15+20\sin\{\pi(t+0.5)/15\}, β2(t)=15+20cos{π(t+0.5)/15},\displaystyle\beta_{2}(t)=15+20\cos\{\pi(t+0.5)/15\},
β3(t)=23sin{π(t24.5)/15},\displaystyle\beta_{3}(t)=2-3\sin\{\pi(t-24.5)/15\}, β4(t)=23cos{π(t24.5)/15},\displaystyle\beta_{4}(t)=2-3\cos\{\pi(t-24.5)/15\},
β5(t)=60.2(t+0.5)2,\displaystyle\beta_{5}(t)=6-0.2(t+0.5)^{2}, β6(t)=4+5104(19.5t)3.\displaystyle\beta_{6}(t)=-4+5*10^{-4}(19.5-t)^{3}.

The random error ϵi(ti,k)\epsilon_{i}(t_{i,k}) is independent from the predictors and follows the same distribution as that in Section 3. We simulate cases with n=50,100,200n=50,100,200 and replicate each set for 200 times. Three metrics are considered: average number of selected variables, percentage of cases when there is no false negative, and percentage of cases when there is no false positive or negative. A comparison of our method with the variable selection method using equidistant knots (Wei et al.,, 2011) is summarized in Table 2. We find that our method clearly outperforms the method with no predictor-specific knots selection, and can always select the correct predictors when nn is larger than 100.

equidistant knots
# selected variables % no false negative % no false positive or negative
n=50n=50 7.04 72 68
n=100n=100 6.21 87 84
n=200n=200 6.13 99 93
adaptive selected knots
# selected variables % no false negative % no false positive or negative
n=50n=50 5.96 96.50 96.50
n=100n=100 6.00 100 100
n=200n=200 6.00 100 100
Table 2: Variable selection performance for adaptive group LASSO with and without predictor-dependent knots selection.

4 Applications

4.1 Environmental factors and COVID-19

The dataset we investigate contains daily measurements of meteorological data and air quality data in 7 counties of the state of New York between March 1, 2020, and September 30, 2021. The meteorological data were obtained from the National Oceanic and Atmospheric Administration Regional Climate Centers, Northeast Regional Climate Center at Cornell University: http://www.nrcc.cornell.edu. The daily data are based on the average of the hourly measurements of several stations in each county and composed of records of five meteorological components: temperature in Fahrenheit, dew point in Fahrenheit, wind speed in miles per hour, precipitation in inches, and humidity in percentage. The air quality data were from the Environmental Protection Agency: https://www.epa.gov. The data contain daily records of two major air quality components, the fine particles with an aerodynamic diameter of 2.5μm\rm{\mu m} or less, i.e., PM2.5\rm{PM_{2.5}}, in μg/m3\rm{\mu g/m^{3}} and ozone in μg/m3\rm{\mu g/m^{3}}. The objective of the study is to understand the association between the meteorological measurements, in conjunction with pollutant levels, and the number of infected cases for COVID-19, a contagious disease caused by severe acute respiratory syndrome coronavirus 2, and examine whether the association varies over time. The daily infected records were retrieved from the official website of the Department of Health, New York state: https://data.ny.gov. Figure 2 shows scatter plots of daily infected cases in New York County and the 7 environmental components over time. In order to remove the variation of recorded cases between weekdays and weekends, in the following study, we consider the weekly averaged infected cases which are the average between each day and the following 6 days. We also find that the temperature factor and dew point factor are highly correlated, and we remove the dew point factor when fitting the model.

Refer to caption
Figure 2: Environmental measurements and COVID-19 infected cases in New York County, NY.

We first take the logarithmic transformation of the weekly averaged infected cases, denoted as yy, and then fit a varying coefficient model with the following predictors: x1=1x_{1}=1 as intercept, x2x_{2} as temperature, x3x_{3} as wind speed, x4x_{4} as precipitation, x5x_{5} as humidity, x6x_{6} as PM2.5\rm{PM_{2.5}} and x7x_{7} as ozone. Time tt is the conditioner for our model. All predictors except the constant are normalized to make varying coefficients βj(t)\beta_{j}(t) comparable. The varying coefficient model has the form:

yi(ti,k)=j=17βj(ti,k)xi,j(ti,k)+ϵ(ti,k),y_{i}(t_{i,k})=\sum_{j=1}^{7}\beta_{j}(t_{i,k})x_{i,j}(t_{i,k})+\epsilon(t_{i,k}), (8)

where ti,kt_{i,k} is the kkth record time for the iith county, with yi(ti,k)y_{i}(t_{i,k}) and xi,j(ti,k)x_{i,j}(t_{i,k}) being the corresponding records for county ii at time ti,kt_{i,k}, and ϵ(ti,k)iidN(0,σ2)\epsilon(t_{i,k})\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\sigma^{2}) denotes the error term.

We apply the equidistant and the proposed predictor-specific adaptive spline fitting method to fit the data. Figure 3 shows the fitted βj(t)\beta_{j}(t) for each predictor by the two methods. Figures show that there is a strong time effect on each coefficient function. For example, there are several peaks for the intercept, which correspond to the initial outbreak and delta variant outbreak. Moreover, there are rapid changes of coefficients around March 2020. This could be explained by the fact that at the beginning of the outbreak, the test cases were fewer and the number of infected cases was underestimated. Moreover, the coefficient curves show that the most important predictor is temperature. For most of the period, the coefficient is negative, indicating that high temperature is negatively associated with transmission of the virus, which is also observed in the study by Notari, (2021) such that COVID-19 spread is slower at high temperatures. Besides the negatively correlated relationship, our analysis also demonstrates the time-varying property of the coefficient. Moreover, the fitted coefficients by the predictor-specific knots are less volatile than the equidistant knots, especially for temperature, wind speed, precipitation, humidity and PM2.5\rm{PM_{2.5}}.

Refer to caption
Figure 3: Black lines: fitted coefficients with predictor-specific knots; red lines: fitted coefficients with equidistant knots.

We evaluate the predictivity of proposed approach by a rolling window approach, with the training size of at least of 1 year, and rolling window of 1 week. That is to say for each date tt after March 1, 2021, we fit two models with equidistant and predictor-specific knots with (xi,j(ti,j),y(ti,j))(ti,j<t)\left(x_{i,j}(t_{i,j}),y(t_{i,j})\right)\ (t_{i,j}<t), and predicts y(ti,j)(tti,j<t+7)y(t_{i,j})\ (t\leq t_{i,j}<t+7). The root mean squared error for equidistant knots fitting is 0.932 and for predictor-specific knots fitting is 0.901.

We note that environmental factors may not have an immediate effect on the number of recorded COVID-19 cases since there is 2-14 days of incubation period before the onset of symptoms and it takes another 2-7 days before a test result is available. To study whether there are lagging effects between the predictor and response variables, we fit a varying coefficient model with predictor-specific knots, for each time lag τ\tau, i.e., using data yi(ti,k+τ)y_{i}(t_{i,k}+\tau) and xi,j(ti,k)(j=1,,7)x_{i,j}(t_{i,k})\ (j=1,\ldots,7), to fit model (8) similarly as in Fan & Zhang, (1999). Figure 4 shows the residual root mean squared error for each time lag. We find that the root mean squared error is the smallest with 4 days lag. Note that yi(t)y_{i}(t) represents the logarithm of the weekly average between day tt and t+6t+6. This means the predictors at day tt are most predictive for infected numbers between day t+4t+4 and t+10t+10, which is consistent with the incubation period and test duration.

Refer to caption
Figure 4: Root mean squared error for logarithm of weekly averaged infected cases, with lag τ=0,,21\tau=0,\ldots,21 days.

4.2 The Boston housing data

We consider the Boston housing price data from Harrison & Rubinfeld, (1978) with n=506n=506 observations for the census districts of the Boston metropolitan area. The data is available in the R-package lmbench. Following (Wang et al.,, 2009; Hu & Xia,, 2012), we use medv (mediam value of owner-occupied homes in 1,000 USD) as the response, and lstat as the conditioner, which is defined as a linear combination of the proportion of adults with high school education or above and proportion of male workers classified as laborers. The predictors are: int (the intercept), crim (per capita crime rate by town), rm (average number of rooms per dwelling), ptratio (pupil-teacher ratio by town), nox (nitric oxides concentration parts per 10 million), tax (full-value property-tax rate per 10,000 USD), and age (proportion of owner-occupied units built prior to 1940). We transform the conditioner lstat so that its marginal distribution is U(0,1)\rm{U}(0,1), and make a logarithm transformation of the response medv. For the predictors (except for intercept), we first transform them to the standard normal. Since some of the predictors are highly correlated with the conditioner, we further regress each predictor against the transformed lstat , and use the normalized residuals in the follow-up analysis.

We fit a predictor-specific varying coefficient linear model to predict the response with the residualized predictors using lstat as the conditioner. Figure 5 shows the fitted coefficients as a function of the conditioner, with the red lines showing the 95% confidence interval and the dashed lines being the x-axis. The confidence interval are computed conditioning on the selected knots for each predictor. The results show clear conditioner-varying effects of most predictors. The intercept appears to vary most significantly along with lstat, demonstrating that the the housing price is negatively and nearly linearly impacted by lstat. The coefficient for rm is in general positive, but trends towards insignificant as lstat increases. An interpretation is that houses with more rooms are generally more expensive, but its impact becomes insignificant in areas where lstat is large. The variable crim is highly correlated with lstat. After regressing out lstat, the residualized crim has an interesting fluctuation: varying from insignificant to positive and then to negative. We suspect that the positive impact of crim for certain areas may be due to a confounding effect with other unused variables such as the location’s convenience and attractiveness for tourists.

We also compare the predictive power of the simple linear model and the varying coefficient model by 10-fold cross validation. For the simple linear model, we include all the predictors used in the varying coefficient model as well as the conditioner lstat . The mean squared error (MSE) for the simple linear model is 23.52 while the MSE for the varying coefficient model is 20.51. When comparing the MSE, the observed and predicted responses are transformed back to the original scale. The results show that the varying coefficient model can better describe the relationship between the housing price and the predictors.

Refer to caption
Figure 5: Fitted coefficient for Boston Housing data, with conditioner as lstat.

5 Discussion

In this paper, we introduce two algorithms for fitting varying coefficient models with adaptive polynomial splines. The first algorithm selects knots globally using a recursive method, assuming the same set of knots for all the predictors, whereas the second algorithm allows each predictor to have its own set of knots and also uses a recursive method for individualized knots selection.

Coefficients modeled by polynomial splines with a finite number of non-regularly positioned knots are more flexible as well as more interpretable than the standard ones with equidistant knot placements. Simulation studies show that both global and predictor-specific algorithms outperform the commonly used kernel method, as well as equidistant spline fitting method, in terms of mean squared errors, while the predictor-specific algorithm achieves the best performance. A fast dynamic programming algorithm is introduced, with a computation complexity no more than O(n2)O(n^{2}) and can be of order O(n)O(n) if the knots are chosen from mn(m=1,,n1)\frac{m}{\sqrt{n}}\ (m=1,\ldots,\lfloor\sqrt{n}\rfloor-1) quantiles of the conditional variable uu.

Throughout the paper we assume that the conditioner variable uu is univariate. The proposed predictor-specific spline approach can be easily generalized to the case when each coefficient βj(u)\beta_{j}(u) has its own univariate conditioner variable uu. However, it is still a challenging problem to generalize the proposed method to multi-dimensional conditioners and to model correlated errors.

Acknowledgement

We are grateful to the National Oceanic and Atmospheric Administration Regional Climate Centers, Northeast Regional Climate Center at Cornell University for kindly sharing the meteorological data. We are thankful to the Environmental Protection Agency and the Department of Health, New York State for publishing the air quality data and daily infected records online.

References

  • Astrid et al., (2009) Astrid, J., Philippe, L., Benoit, B., & F, V. (2009). Pharmacokinetic parameters estimation using adaptive bayesian p-splines models. Pharm Stat, 8.
  • Casas & Fernandez-Casal, (2019) Casas, I. & Fernandez-Casal, R. (2019). tvreg: Time-varying coefficient linear regression for single and multi-equations in r. SSRN.
  • Chen et al., (2015) Chen, Y., Wang, Q., & Yao, W. (2015). Adaptive estimation for varying coefficient models. Journal of Multivariate Analysis, 137, 17–31.
  • Fan & Zhang, (1999) Fan, J. & Zhang, W. (1999). Statistical estimation in varying coefficient models. The Annals of Statistics, 27(5), 1491 – 1518.
  • Fan & Zhang, (2000) Fan, J. & Zhang, W. (2000). Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scandinavian Journal of Statistics, 27(4), 715–731.
  • Finley & Banerjee, (2020) Finley, A. O. & Banerjee, S. (2020). Bayesian spatially varying coefficient models in the spbayes r package. Environmental Modelling & Software, 125, 104608.
  • Harrison & Rubinfeld, (1978) Harrison, D. & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1), 81–102.
  • Hastie & Tibshirani, (1993) Hastie, T. & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological), 55(4), 757–796.
  • Hu et al., (2019) Hu, L., Huang, T., & You, J. (2019). Estimation and identification of a varying-coefficient additive model for locally stationary processes. Journal of the American Statistical Association, 114(527), 1191–1204.
  • Hu & Xia, (2012) Hu, T. & Xia, Y. (2012). Adaptive semi-varying coefficient model selection. Statistica Sinica, (pp. 575–599).
  • Huang & Shen, (2004) Huang, J. Z. & Shen, H. (2004). Functional coefficient regression models for non-linear time series: A polynomial spline approach. Scandinavian Journal of Statistics, 31(4), 515–534.
  • Huang et al., (2002) Huang, J. Z., Wu, C. O., & Zhou, L. (2002). Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Bimoetrika, 89(1), 111–128.
  • Huang et al., (2004) Huang, J. Z., Wu, C. O., & Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, (pp. 763–788).
  • Lin et al., (2019) Lin, H., Yang, B., Zhou, L., Yip, P. S., Chen, Y.-Y., & Liang, H. (2019). Global kernel estimator and test of varying-coefficient autoregressive model. Canadian Journal of Statistics, 47(3), 487–519.
  • Notari, (2021) Notari, A. (2021). Temperature dependence of covid-19 transmission. Science of The Total Environment, 763, 144390.
  • Schwarz et al., (1978) Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of statistics, 6(2), 461–464.
  • Tang & Cheng, (2012) Tang, Q. & Cheng, L. (2012). Componentwise b-spline estimation for varying coefficient models with longitudinal data. Statistical Papers, 53, 629–652.
  • Wang et al., (2009) Wang, H. J., Zhu, Z., & Zhou, J. (2009). Quantile regression in partially linear varying coefficient models. The Annals of Statistics, 37(6B), 3841–3866.
  • Wang & Sun, (2019) Wang, W. & Sun, Y. (2019). Penalized local polynomial regression for spatial data. Biometrics, 75(4), 1179–1190.
  • Wang et al., (2017) Wang, X., Jiang, B., & Liu, J. (2017). Generalized r-squared for detecting dependence. Biometrika, 104(1), 129–139.
  • Wei et al., (2011) Wei, F., Huang, J., & Li, H. (2011). Variable selection and estimation in high-dimensional varying-coefficient models. Statistica Sinica, 21(4), 1515.
  • Yuan & Lin, (2006) Yuan, M. & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
  • Zhang & Wang, (2014) Zhang, X. & Wang, J.-L. (2014). Varying-coefficient additive models for functional data. Biometrika, 102(1), 15–32.
  • Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.