This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations

Xiaofei Wu,    Zhimin Zhang,   Zhenyu Cui College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, P.R. China. Email: [email protected] author. College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, P.R. China. Email: [email protected] of Business, Stevens Institute of Technology, Hoboken, United States, NJ 07030. Email: [email protected]
Abstract

The parallel alternating direction method of multipliers (ADMM) algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined (composite) regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. In this paper, we propose a unified constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems and derive the corresponding parallel ADMM algorithms. Furthermore, we prove that the proposed algorithm not only has global convergence but also exhibits linear convergence rate. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithms can be obtained at https://github.com/xfwu1016/CPADMM.

Keywords: Combined regularization; Global convergence; High-dimensional regression; Massive data; Parallel ADMM

1 Introduction

The advancement of modern science and technology has made data collection increasingly easy, resulting in the generation of massive amounts of data. However, due to the sheer volume of data and other factors such as privacy concerns, it has become necessary to store this data in a distributed manner. Therefore, it is essential to design parallel algorithms that can adapt to this massive and distributed storage dataset. Interested readers can refer to Boyd et al. (2010), Parikh and Boyd (2014), Lin et al. (2022), and their references for further information.

In this paper, our focus is on high-dimensional regression problems with massive data. When not considering distributed storage, the data required for a regression model typically includes the following,

{yi,𝒙i}in={yi,xi,1,xi,2,,xi,p}in={𝒚,𝑿}.\{{y_{i}},{\bm{x}_{i}}\}_{i}^{n}=\{{y_{i}},{x_{i,1}},{x_{i,2}},\cdots,{x_{i,p}}\}_{i}^{n}=\{\bm{y},\bm{X}\}. (1.1)

Here, yiy_{i} denotes the target value for the ii-th observation 𝒙i\bm{x}_{i}, and 𝒙i\bm{x}_{i} represents a pp-dimensional vector that signifies the ii-th observation value of pp features. These features could be the original observations and/or selected functions constructed from them. The objective of regression is to establish a relationship between the features in 𝑿\bm{X} and the target values in 𝒚\bm{y} by estimating coefficients that minimize the difference between the predicted and actual values of 𝒚\bm{y}. To avoid overfitting and improve interpretability, some regularization terms are considered to be added to the objective optimization function, that is,

argmin𝜷p(𝒚𝑿𝜷)+Pλ(|𝜷|),\displaystyle\mathop{\arg\min}\limits_{\bm{\beta}\in\mathbb{R}^{p}}\quad\mathcal{L}(\bm{y}-\bm{X}\bm{\beta})+P_{\lambda}(|\bm{\beta}|), (1.2)

where (𝒚𝑿𝜷)=1ni=1nL(yi𝒙i𝜷)\mathcal{L}(\bm{y}-\bm{X}\bm{\beta})=\frac{1}{n}\sum_{i=1}^{n}{L}(y_{i}-\bm{x}_{i}^{\top}\bm{\beta}). Here, the loss function L{L} is a generic function that includes least squares, Huber loss in Huber (1964) and quantile loss in Koenker and Basset (1970). We also consider situations where (𝒚𝑿𝜷)\mathcal{L}(\bm{y}-\bm{X}\bm{\beta}) cannot be separated, such as square root in Belloni et al. (2011). The regularization term Pλ(|𝜷|)P_{\lambda}(|\bm{\beta}|) with tuning parameter λ>0\lambda>0 can be a single regularization term, such as 2\ell_{2} or ridge in Hoerl and Kennard (1970), total variation in Rudin et al. (1992), lasso in (Tibshirani (1996)), group lasso in Yuan and Lin (2008); or a combination of them, such as sparse fused lasso (lasso plus total variation) in Tibshirani et al. (2005), elastic-net (lasso plus 2\ell_{2}) in Zou (2006), sparse group lasso (lasso plus group lasso) in Wu and Lange (2008); or their nonconvex variants. Due to the lack of unified algorithms for these combined regularization regressions, this paper focuses primarily on addressing this issue.

The alternating direction method of multipliers (ADMM) is an iterative optimization algorithm commonly used to solve convex minimization problems with linear constraint. It decomposes these problems into subproblems and can be applied to various statistical learning models. Boyd et al. (2010) provided a good overview of its development and detailed applications in various fields. Because some of the functions LL are nonsmooth and nondifferentiable, it is necessary to introduce the linear constraint 𝒓=𝒚𝑿𝜷\bm{r}=\bm{y}-\bm{X}\bm{\beta} in order to use ADMM to uniformly solve (1.2). The constrained optimization problem is then formulated as follows,

min𝜷,𝒓\displaystyle\min_{\bm{\beta},\bm{r}} (𝒓)+Pλ(|𝜷|),\displaystyle\quad\mathcal{L}(\bm{r})+{P}_{\lambda}(|\bm{\beta}|),
s.t. 𝑿𝜷+𝒓=𝒚.\displaystyle\bm{X}\bm{\beta}+\bm{r}=\bm{y}. (1.3)

Thus, the augmented Lagrangian form of (1) is

Lμ(𝜷,𝒓,𝒅)=(𝒓)+Pλ(|𝜷|)𝒅(𝑿𝜷+𝒓𝒚)+μ2𝑿𝜷+𝒓𝒚22,\displaystyle L_{\mu}(\bm{\beta},\bm{r},\bm{d})=\mathcal{L}(\bm{r})+{P}_{\lambda}(|\bm{\beta}|)-\bm{d}^{\top}(\bm{X}\bm{\beta}+\bm{r}-\bm{y})+\frac{\mu}{2}\|\bm{X}\bm{\beta}+\bm{r}-\bm{y}\|_{2}^{2}, (1.4)

where 𝒅\bm{d} is dual variable corresponding to the linear constraint, and μ>0\mu>0 is a given penalty parameter. The iterative scheme of ADMM for (1) is

{𝜷k+1argmin𝜷{Lμ(𝜷,𝒓k,𝒅k)};𝒓k+1argmin𝒓{Lμ(𝜷k+1,𝒓,𝒅k)};𝒅k+1𝒅kμ(𝑿𝜷k+1+𝒓k+1𝒚).\left\{\begin{array}[]{l}\bm{\beta}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{L_{\mu}(\bm{\beta},\bm{r}^{k},\bm{d}^{k})\right\};\\ \bm{r}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{r}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r},\bm{d}^{k})\right\};\\ \bm{d}^{k+1}\ \leftarrow\bm{d}^{k}-\mu(\bm{X}\bm{\beta}^{k+1}+\bm{r}^{k+1}-\bm{y}).\end{array}\right. (1.5)

In general, when designing algorithms for distributed parallel processing, it requires a central machine and several local machines. Assume that

𝑿=(𝑿1,𝑿2,,𝑿M)and𝒚=(𝒚1,𝒚2,,𝒚M)\displaystyle\bm{X}=(\bm{X}_{1}^{\top},\bm{X}_{2}^{\top},\dots,\bm{X}_{M}^{\top})^{\top}\quad\text{and}\quad\bm{y}=(\bm{y}_{1}^{\top},\bm{y}_{2}^{\top},\dots,\bm{y}_{M}^{\top})^{\top} (1.6)

are stored in a distributed manner across MM local machines. The central machine receives information transmitted by the local machines, consolidates the information, and then forwards it to the respective local machines. The decomposition in (1.6) enables the algorithm to be naturally parallelizable across multiple local machines. Each local machine can independently solve its designated subproblem, and the solutions are then communicated and coordinated between the central machine and the local machines through variable updates. This distributed storage allows for parallel processing and efficient handling of large-scale datasets. The traditional ADMM algorithm in (1.5) can only handle problems on a single machine and is not sufficient for data stored in a distributed manner. Recently, many parallel ADMM algorithms have been proposed to solve statistical learning models when data is stored in a distributed manner, such as linear regression models (see Boyd et al. (2010), Yu et al. (2017) and Fan et al. (2020)). The existing parallel ADMM algorithms employ strategies similar to the regularized consensus problem described by Boyd et al. (2010). The consensus problem in distributed computing refers to the need for multiple nodes in a distributed system to agree on a common decision.

These existing parallel algorithms are specifically designed for individual regularization terms which can achieve sparse feature selection, such as lasso, group lasso, scad (Fan and Li (2001)) and mcp (Zhang (2010)). However, practical applications often involve other types of structural information beyond sparse feature structures. These include high correlation, feature grouping, or smooth trends among features. Combined regularization terms can simultaneously utilize different regularization techniques to integrate multiple structural information in the model, enhancing its performance and generalization ability. Thus, in such cases, many studies have suggested replacing a single regularization term with a combined regularization term. For example, elastic-net in Zou and Hastie (2005), fused lasso in Tibshirani et al. (2005), and sparse group lasso in Wu and Lange (2008). Moreover, many nonconvex variants of these combined regularization terms have been developed. For example, elastic-net variants such as snet (scad plus 2\ell_{2}) in Wang et al. (2010) and mnet (mcp plus 2\ell_{2}) in Huang et al. (2015); sparse fused lasso variants like sctv (scad plus total variation) and mctv (mcp plus total variation) in Xiu et al. (2019); sparse group lasso variants such as scgl (scad plus group lasso) and mcgl (mcp plus group lasso) in Huang et al. (2012) and Tugnait (2021b).

However, when the individual regularization is replaced by the combined regularization, the parallel algorithms mentioned above cannot be directly applied to these combined regularization regressions due to the complexity of the combination penalty terms, as well as the non-separability of certain composite regularization terms (such as sparse fused lasso and its nonconvex variants), which do not have closed-form solutions for their proximal operators. Currently, there are no parallel algorithms proposed for solving these high-dimensional problems with combined regularizations. The main work of this paper is to fill this gap and develop a unified consensus-based parallel ADMM (CPADMM) algorithm for high-dimensional regression problems with combined regularizations. The proposed algorithm has three main advantages: 1. It can not only handle massive amounts of data, but also adapt to the structure of distributed data storage. 2. Our algorithm is applicable to a wide range of convex loss functions, as long as we can derive the closed-form solutions for the proximal operators corresponding to the losses. 3. It has a guarantee of global convergence and a linear convergence rate.

The rest of the paper is organized as follows. Section 2 introduces some widely used combined regularization terms, and introduces some proximal operators that can play a crucial role in the algorithm. In Section 3, we first review the existing parallel ADMM algorithms for high-dimensional linear regression, and then propose a unified regression parallel ADMM algorithm for solving combined regularization. Section 4 discusses the specific implementation of the algorithm for solving various combined regularization regression. The convergence analysis is provided in Section 5. Section 6 presents numerical results demonstrating the scalability, efficiency and accuracy of the proposed algorithm. Section 7 summarizes the findings and concludes the paper, with a discussion on future research directions. Technical proofs, algorithm extensions and additional experiments are provided in the Appendix. The R package CPADMM for implementing the parallel ADMM algorithms can be obtained at https://github.com/xfwu1016/CPADMM.

Notations: 𝟎n\bm{0}_{n} and 𝟏n\bm{1}_{n} represent nn-dimensional vectors with all elements being 0 and 1, respectively. 𝑭\bm{F} is a (p1)×p(p-1)\times p matrix with all elements being 0, except for 1 on the diagonal and -1 on the superdiagonal. 𝑰n\bm{I}_{n} represents the nn-dimensional identity matrix. The Hadamard product is denoted by \odot. The sign()(\cdot) function is defined component-wise such that sign(t)=1(t)=1 if t>0t>0, sign(t)=0(t)=0 if t=0t=0, and sign(t)=1(t)=-1 if t<0t<0. ()+(\cdot)_{+} signifies the element-wise operation of extracting the positive component, while |||\cdot| denotes the element-wise absolute value function. For any vector 𝒖\bm{u}, 𝒖1\|\bm{u}\|_{1}, 𝒖2\|\bm{u}\|_{2}, and 𝒖2,1\|\bm{u}\|_{2,1} denote the 1\ell_{1} norm, the 2\ell_{2} norm, and the 2,1\ell_{2,1} norm of 𝒖\bm{u}, respectively. 𝒖𝑯:=𝒖𝑯𝒖\|\bm{u}\|_{\bm{H}}:=\sqrt{\bm{u}^{\top}\bm{H}\bm{u}} is used to denote the norm of 𝒖\bm{u} under the matrix 𝑯\bm{H}, where 𝑯\bm{H} is a matrix.

2 Preliminaries and Literature Review

In this section, we first review some commonly used combined regularization techniques and loss functions. Then we derive their corresponding proximal operators, as they play a crucial role in our algorithm.

2.1 Combined regularizations

The combined regularization can be defined as

Pλ(|𝜷|)=Pλ1(|𝜷|)+Pλ2(|𝜷|),\displaystyle P_{\lambda}(|\bm{\beta}|)=P_{\lambda_{1}}(|\bm{\beta}|)+P_{\lambda_{2}}(|\bm{\beta}|), (2.1)

where λ1,λ2>0\lambda_{1},\lambda_{2}>0 are two tuning parameters, and Pλ1(|𝜷|)P_{\lambda_{1}}(|\bm{\beta}|) and Pλ2(|𝜷|)P_{\lambda_{2}}(|\bm{\beta}|) are two individual regularization terms. Next, we will discuss in detail the various expressions for combined regularization.

2.1.1 Elastic-net

Elastic-net regularization in Zou and Hastie (2005) is a popular regularization technique used in machine learning and statistics. It is specifically designed for scenarios where there are a large number of variables and multicollinearity exists among them. This method combines the strengths of two other regularization techniques, 1\ell_{1} (lasso) and 2\ell_{2} (ridge), that is

λ1𝜷1+λ2𝜷22,\displaystyle\lambda_{1}\|\bm{\beta}\|_{1}+\lambda_{2}\|\bm{\beta}\|_{2}^{2}, (2.2)

to achieve a more effective and stable model. Due to these advantages, elastic-net regularization is widely applied in various fields, including high-dimensional data analysis in Zou and Tibshirani (2006), Zou and Zhang (2009) and Zou and Xue (2018); genomics and bioinformatics in Ogutu et al. (2012), Zhang and Hong (2017) and Zou (2019); and feature selection and variable screening in Zou (2020) and Fan et al. (2020). Many algorithms have also been proposed to solve various loss functions with elastic-net regularization, such as path solution algorithm in Rosset and Zhu (2007); cyclical coordinate descent in Friedman et al. (2007, 2010), Yang & Zou (2013) and Yi and Huang (2017); ADMM algorithm in Gu et al. (2017) and Liang et al. (2023). Although these algorithms perform well in high-dimensional data, they are not suitable for handling massive data stored in distributed storage.

2.1.2 Sparse fused lasso

For problems in which covariates are sparse and blocky structure are desired, the sparse fused lasso regularization (Tibshirani et al. (2005)) has proved to be very efficient. The sparse fused lasso regularization term is defined as

λ1𝜷1+λ2j=2p|𝜷j𝜷j1|=λ1𝜷1+λ2𝑭𝜷1\displaystyle{\lambda}_{1}{\left\|{{\bm{\beta}}}\right\|_{1}}+\lambda_{2}\sum\limits_{j=2}^{p}\left|{{\bm{\beta}_{j}}-{\bm{\beta}_{j-1}}}\right|={\lambda}_{1}{\left\|{{\bm{\beta}}}\right\|_{1}}+\lambda_{2}\left\|\bm{F}\bm{\beta}\right\|_{1} (2.3)

We refer to Li and Zhu (2007), Tibshirani and Wang (2008), Ahmed and Xing (2009), Sun et al. (2016), Parekh and Selesnick (2016) and Corsaro et al. (2019) for wide applications of the fused lasso regularization. There have been several algorithms proposed to solve various types of high-dimensional fused lasso regression (𝑿\bm{X} is not an identity matrix), including the majorization-minimization algorithm (Yu et al. (2013)), various variations of the ADMM algorithms (Ye and Xie (2011), Li et al. (2014), Jiang et al. (2020) and Wu et al. (2023)). However, the above algorithms are also not suitable for massively distributed storage of data.

2.1.3 Sparse group lasso

In high-dimensional supervised learning problems, leveraging problem-specific assumptions often leads to improved accuracy. For problems involving grouped covariates, it is widely believed that the effects are sparsely distributed at both the group level and within the groups. Many studies (see Friedman et al. (2010), Qin and Goldfarb (2011), Simon et al. (2013) and Chinot et al. (2020)) have shown that employing sparse group lasso regularization can accurately identify these grouped covariates. The sparse group lasso regularization is defined as

λ1𝜷1+λ2𝜷2,1,\displaystyle{\lambda}_{1}{\left\|{{\bm{\beta}}}\right\|_{1}}+\lambda_{2}{\left\|\bm{\beta}\right\|_{2,1}}, (2.4)

where 2,1\ell_{2,1}-norm is a rotational invariant of the 1\ell_{1} norm. If a pp-vector 𝜷\bm{\beta} is partitioned into GG disjoint groups denoted respectively by 𝜷1,𝜷2,,𝜷G\bm{\beta}_{1},\bm{\beta}_{2},\dots,\bm{\beta}_{G}, then its 2,1\ell_{2,1}-norm is defined as

𝜷2,1=g=1G𝜷2\displaystyle{\left\|\bm{\beta}\right\|_{2,1}}=\sum\limits_{g=1}^{G}\left\|\bm{\beta}\right\|_{2} (2.5)

We refer to Huang et al. (2009), Zhao et al, (2015), Chen et al. (2020), Li et al. (2020) and He et al. (2022) for wide applications of the group lasso regularization. So far, several algorithms have been proposed to solve the high-dimensional sparse group regularization regression problem, including but not limited to coordinate descent algorithm (Wu and Lange (2008), Daniel et al. (2017) and Juan et al. (2019)), accelerated generalized gradient descent algorithm (Simon et al. (2013)), linearized ADMM algorithm (Li et al. (2014)), inexact semismooth newton based augmented lagrangian algorithm (Zhang et al. (2020)), and proximal gradient descent algorithm (Klosa et al. (2020)). However, these algorithms are not suitable for massive data stored in distributed storage.

2.1.4 Nonconvex extensions for combined regularizations

Although these combined regularizations effectively utilize prior information about the structure of features, they still rely on lasso to achieve sparsity. Many studies, such as Fan and Li (2001) and Zou (2006), have confirmed that while lasso has the ability to induce sparsity, it can introduce non-negligible estimation bias due to the use of the same penalty for all terms. Therefore, Many non-convex penalty functions have been proposed for solving this problem, and the most famous ones among them are scad (Fan and Li (2001)) and mcp (Zhang (2010)). In this paper, we mainly focus our attention on two popular nonconvex regularizations, scad(|𝜷|)\text{scad}(|\bm{\beta}|) and mcp(|𝜷|)\text{mcp}(|\bm{\beta}|), to replace λ1𝜷1\lambda_{1}\|\bm{\beta}\|_{1} to form nonconvex combined regularization. We summarize these combined regularizations in Table 1. These nine combined regularizations are also the main focus of this paper.

Table 1: Nine types of combined regularization.
  Pλ2(|𝜷|)=λ2𝜷22P_{\lambda_{2}}(|\bm{\beta}|)=\lambda_{2}\|\bm{\beta}\|_{2}^{2} Pλ2(|𝜷|)=λ2𝜷2,1P_{\lambda_{2}}(|\bm{\beta}|)=\lambda_{2}{\left\|\bm{\beta}\right\|_{2,1}} Pλ2(|𝜷|)=λ2𝑭𝜷1P_{\lambda_{2}}(|\bm{\beta}|)=\lambda_{2}\left\|\bm{F}\bm{\beta}\right\|_{1}
  Pλ1(|𝜷|)=λ1𝜷1P_{\lambda_{1}}(|\bm{\beta}|)=\lambda_{1}\|\bm{\beta}\|_{1} elastic-net sgla sfla
Pλ1(|𝜷|)=scad(|𝜷|)P_{\lambda_{1}}(|\bm{\beta}|)=\text{scad}(|\bm{\beta}|) snet scgl sctv
Pλ1(|𝜷|)=mcp(|𝜷|)P_{\lambda_{1}}(|\bm{\beta}|)=\text{mcp}(|\bm{\beta}|) mnet mcgl mctv
 

Local linear approximation (LLA) method proposed by Zou and Li (2008) is an effective algorithm for solving nonconvex regularization regression. The main idea is to transform the statistical model algorithm for solving scad and mcp into solving several iterative reweighted lasso problems. Fan et al. (2014) gave this method a theoretical guarantee, proving that for many losses, as long as the initial estimator and the oracle estimator behave well, the two step LLA can find nice estimator with large probability. In this paper, we will use LLA method to transform these nonconvex combined regularization optimization forms into weighted convex optimization forms.

2.2 Proximal operator

Proximal operator introduced by Parikh and Boyd (2014) is widely used in convex optimization and numerical optimization, particularly when the objective function involves nondifferentiable and/or nonsmooth terms. Its mathematical expression is as follows,

proxγ,f(𝒙)=argmin𝒛q(f(𝒛)+γ2𝒛𝒙22),\displaystyle\operatorname{prox}_{\gamma,f}(\bm{x})=\arg\min_{\bm{z}\in\mathbb{R}^{q}}\left(f(\bm{z})+\frac{\gamma}{2}\|\bm{z}-\bm{x}\|_{2}^{2}\right), (2.6)

where f:q{+}f:\mathbb{R}^{q}\rightarrow\mathbb{R}\cup\{+\infty\} is a closed proper convex function, 𝒛\bm{z} and γ\gamma respectively denote a given vector and a constant. Generally speaking, ff is assumed to be separable, that is f(𝒛)=ifi(zi)f(\bm{z})=\sum_{i}f_{i}(z_{i}). Since 22\|\cdot\|_{2}^{2} is also a separable function, proxγ,f(𝒙)\operatorname{prox}_{\gamma,f}(\bm{x}) can be solved coordinate-wise, that is

proxγ,fi(xi)=argminzi(fi(zi)+γ2(zixi)2).\displaystyle\operatorname{prox}_{\gamma,f_{i}}(x_{i})=\arg\min_{z_{i}}\left(f_{i}(z_{i})+\frac{\gamma}{2}(z_{i}-x_{i})^{2}\right). (2.7)

2.2.1 Proximal operators for regularizations

Many widely used operators in applications are actually special cases of proximal operators, among which the most famous one is the soft-thresholding operator in Donoho (1995). The soft-thresholding operator is defined as proxγ,λ𝜷1(𝒙)=argmin𝜷p(λ𝜷1+γ2𝜷𝒙22).\text{prox}_{\gamma,\lambda\|\bm{\beta}\|_{1}}(\bm{x})=\arg\min_{\bm{\beta}\in\mathbb{R}^{p}}\left(\lambda\|\bm{\beta}\|_{1}+\frac{\gamma}{2}\|\bm{\beta}-\bm{x}\|_{2}^{2}\right). The closed-form solution for proxγ,λ𝜷1(𝒙)\text{prox}_{\gamma,\lambda\|\bm{\beta}\|_{1}}(\bm{x}) is given by

proxγ,λ𝜷1(𝒙)=sign(𝒙)max(|𝒙|λγ,0).\displaystyle\text{prox}_{\gamma,\lambda\|\bm{\beta}\|_{1}}(\bm{x})=\text{sign}(\bm{x})\odot\max(|\bm{x}|-\frac{\lambda}{\gamma},0). (2.8)

In addition, we need to consider the proximal operator commonly used for 2,1\ell_{2,1}-norm regularization, which is also known as the group soft-thresholding operator. It is defined as proxγ,λ𝜷2,1(𝒙)=argmin𝜷p(λ𝜷2,1\text{prox}_{\gamma,\lambda\|\bm{\beta}\|_{2,1}}(\bm{x})=\arg\min_{\bm{\beta}\in\mathbb{R}^{p}}\left(\lambda\|\bm{\beta}\|_{2,1}\right. +γ2𝜷𝒙22)\left.+\frac{\gamma}{2}\|\bm{\beta}-\bm{x}\|_{2}^{2}\right). The closed-form solution for proxγ,λ𝜷2,1(𝒙)\text{prox}_{\gamma,\lambda\|\bm{\beta}\|_{2,1}}(\bm{x}) is given by

proxγ,λ𝜷2,1(𝒙)=𝒙𝒙2max(𝒙2λγ,0)\displaystyle\text{prox}_{\gamma,\lambda\|\bm{\beta}\|_{2,1}}(\bm{x})=\frac{\bm{x}}{\|\bm{x}\|_{2}}\cdot\max(\|\bm{x}\|_{2}-\frac{\lambda}{\gamma},0) (2.9)

However, when it comes to the proximal operators of total variation, the definition is as follows

proxγ,λ𝑭𝜷1(𝒙)=argmin𝜷p(λj=2p|𝜷j𝜷j1|+γ2𝜷𝒙22).\displaystyle\text{prox}_{\gamma,\lambda\|\bm{F}\bm{\beta}\|_{1}}(\bm{x})=\arg\min_{\bm{\beta}\in\mathbb{R}^{p}}\left(\lambda\sum\limits_{j=2}^{p}\left|{{\bm{\beta}_{j}}-{\bm{\beta}_{j-1}}}\right|+\frac{\gamma}{2}\|\bm{\beta}-\bm{x}\|_{2}^{2}\right). (2.10)

Due to the fact that total variation j=2p|𝜷j𝜷j1|\sum_{j=2}^{p}\left|{{\bm{\beta}_{j}}-{\bm{\beta}_{j-1}}}\right| is an inseparable regularization term, it is not possible to derive a closed-form solution for the proximal operator in (2.10). Although it can be solved using iterative numerical methods, incorporating this method into ADMM iterations in (1.5) would result in a double loop, leading to significant computational expenses and time consumption.

Note that for combined regularizations, such as elastic-net and sparse group lasso, the closed-form solutions for the proximal operators proxγ,λ1𝜷1+λ2𝜷2,1(𝒙)\text{prox}_{\gamma,\lambda_{1}\|\bm{\beta}\|_{1}+\lambda_{2}\|\bm{\beta}\|_{2,1}}(\bm{x}) and proxγ,λ1𝜷1+λ2𝜷2,1(𝒙)\text{prox}_{\gamma,\lambda_{1}\|\bm{\beta}\|_{1}+\lambda_{2}\|\bm{\beta}\|_{2,1}}(\bm{x}) can be found in Gu et al. (2017) and Chartrand (2012), respectively. However, note also that the closed-form solutions for the proximal operators proxγ,λ1𝜷1+λ2𝑭𝜷1(𝒙)\text{prox}_{\gamma,\lambda_{1}\|\bm{\beta}\|_{1}+\lambda_{2}\|\bm{F}\bm{\beta}\|_{1}}(\bm{x}) cannot be derived due to the existence of total variation regularization.

2.2.2 Proximal operators for loss functions

In this paper, we consider several convex loss functions, such as least squares (Tibshirani et al. (2005), Zou and Hastie (2005) and Friedman et al. (2010)), quantile (Kato (2011), Gu et al. (2017) and Wu et al. (2023)), square root (Bunea et al. (2013) and Jiang et al. (2020)), and Huber (Chinot et al. (2020)), that are applied to various high-dimensional elastic-net, sparse fused lasso and sparse group lasso regression models. We use the symbol \mathcal{L} to represent these loss functions, and their proximal operators are defined as follows

proxμ,(𝒙)=argmin𝒓((𝒓)+μ2𝒓𝒙22).\displaystyle\operatorname{prox}_{\mu,\mathcal{L}}(\bm{x})=\arg\min_{\bm{r}}\left(\mathcal{L}(\bm{r})+\frac{\mu}{2}\|\bm{r}-\bm{x}\|_{2}^{2}\right). (2.11)

If \mathcal{L} is a separable function, proxμ,(𝒙)\operatorname{prox}_{\mu,\mathcal{L}}(\bm{x}) can be also solved coordinate-wise, that is

proxμ,(𝒙i)=argmin𝒓i(1nL(𝒓i)+μ2𝒓i𝒙i22).\displaystyle\operatorname{prox}_{\mu,\mathcal{L}}(\bm{x}_{i})=\arg\min_{\bm{r}_{i}}\left(\frac{1}{n}L(\bm{r}_{i})+\frac{\mu}{2}\|\bm{r}_{i}-\bm{x}_{i}\|_{2}^{2}\right). (2.12)

We summarize the mathematical expressions of these losses and their corresponding closed-form solutions for the proximal operators in Table 2. These closed-form solutions have already been discussed in some papers, and a simple and easily extensible proof was presented in the appendix of Liang et al. (2023).

Table 2: The mathematical expressions of various loss functions and their closed-form solutions for the proximal operators.
  Loss function Mathematical expression (𝒓)\mathcal{L}(\bm{r}) The closed from of proximal operator proxμ,(𝒙i)\operatorname{prox}_{\mu,\mathcal{L}}(\bm{x}_{i})
  Least squares 12ni=1nri2\frac{1}{2n}\sum\limits_{i=1}^{n}r_{i}^{2} nμxi1+nμ\frac{n\mu x_{i}}{1+n\mu}
Quantile 1ni=1nri(1τI(ri<0))\frac{1}{n}\sum\limits_{i=1}^{n}r_{i}(1-\tau I_{(r_{i}<0)}) max{xiτnμ,min(0,xi+1τnμ)}\max\{x_{i}-\frac{\tau}{n\mu},\min(0,x_{i}+\frac{1-\tau}{n\mu})\}
Square root i=1nri22n\sqrt{\frac{\sum\limits_{i=1}^{n}r_{i}^{2}}{2n}} xi𝒙2max{𝒙212nμ,0}\frac{x_{i}}{\|\bm{x}\|_{2}}\cdot\max\{\|\bm{x}\|_{2}-\frac{1}{\sqrt{2n}\mu},0\}
Huber 1ni=1n[ri22δI(riδ)+(|ri|δ2)I(ri<δ)]\frac{1}{n}\sum\limits_{i=1}^{n}[\frac{r_{i}^{2}}{2\delta}I_{(r_{i}\geq\delta)}+(|r_{i}|-\frac{\delta}{2})I_{(r_{i}<\delta)}] nμδxi1+nμδ+11+nμδsign(xi)max{0,|xi|1nμδ}\frac{n\mu\delta x_{i}}{1+n\mu\delta}+\frac{1}{1+n\mu\delta}\cdot\text{sign}(x_{i})\cdot\max\{0,|x_{i}|-\frac{1}{n\mu}-\delta\}
 

3 Parallel ADMM algorithm

Assuming that there are a total of MM local machines available, the response vector 𝒚n\bm{y}\in\mathbb{R}^{n} and the sample observation matrix 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} can be divided into MM blocks as follows,

𝒚=(𝒚1,𝒚2,,𝒚M),𝑿=(𝑿1,𝑿2,,𝑿M),\displaystyle\bm{y}=(\bm{y}_{1}^{\top},\bm{y}_{2}^{\top},\dots,\bm{y}_{M}^{\top})^{\top},\ \bm{X}=(\bm{X}_{1}^{\top},\bm{X}_{2}^{\top},\dots,\bm{X}_{M}^{\top})^{\top}, (3.1)

where 𝒚mnm\bm{y}_{m}\in\mathbb{R}^{n_{m}}, 𝑿mnm×p\bm{X}_{m}\in\mathbb{R}^{n_{m}\times p} and m=1Mnm=n\sum_{m=1}^{M}n_{m}=n. In this section, we briefly review the existing consensus parallel ADMM algorithms for solving high-dimensional regression models. Based on existing algorithms, we propose a unified parallel ADMM algorithm for high-dimensional regression models with combined regularization.

3.1 Existing consensus-based parallel ADMM algorithm

3.1.1 Parallel ADMM for lasso

In Boyd et al. (2010), the consensus ADMM algorithm was first introduced for solving high-dimensional lasso regression in distributed storage settings. By introducing the consensus constraints {𝜷=𝜷m}m=1M\{\bm{\beta}=\bm{\beta}_{m}\}_{m=1}^{M}, the constrained optimization problem for lasso is as follows

min𝜷,𝜷m\displaystyle\min_{\bm{\beta},\bm{\beta}_{m}}\quad m=1M12n𝒚m𝑿m𝜷m22+λ𝜷1,\displaystyle\sum_{m=1}^{M}\frac{1}{2n}\|\bm{y}_{m}-\bm{X}_{m}\bm{\beta}_{m}\|_{2}^{2}+\lambda\|\bm{\beta}\|_{1},
s.t. 𝜷=𝜷m,m=1,2,,M.\displaystyle\bm{\beta}=\bm{\beta}_{m},\ m=1,2,\dots,M. (3.2)

The augmented Lagrangian of (3.1.1) is

Lμ(𝜷,𝜷m,𝒅m)=m=1M12n𝒚m𝑿m𝜷m22+λ𝜷1m=1M𝒆m,𝜷m𝜷+μ2m=1M𝜷m𝜷22,\displaystyle L_{\mu}(\bm{\beta},\bm{\beta}_{m},\bm{d}_{m})=\sum_{m=1}^{M}\frac{1}{2n}\|\bm{y}_{m}-\bm{X}_{m}\bm{\beta}_{m}\|_{2}^{2}+\lambda\|\bm{\beta}\|_{1}-\sum\limits_{m=1}^{M}\langle\bm{e}_{m},\bm{\beta}_{m}-\bm{\beta}\rangle+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}-\bm{\beta}\|_{2}^{2}, (3.3)

where 𝒆m\bm{e}_{m} is dual variables corresponding to the linear constraint, and μ>0\mu>0 is a given penalty parameter. Similar to the iterative steps of the ADMM algorithm, the parallel ADMM can be written in the following form,

{𝜷k+1argmin𝜷{μ2m=1M𝜷mk𝜷𝒆mk/μ22+λ𝜷1};𝜷mk+1argmin𝜷m{12n𝒚m𝑿m𝜷m22+m=1M𝜷m𝜷k+1𝒆mk/μ22},m=1,2,,M;𝒆mk+1𝒆mkμ(𝜷mk+1𝜷k+1),m=1,2,,M.\left\{\begin{array}[]{l}\bm{\beta}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}^{k}-\bm{\beta}-\bm{e}^{k}_{m}/\mu\|_{2}^{2}+\lambda\|\bm{\beta}\|_{1}\right\};\\ \bm{\beta}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}_{m}}\left\{\frac{1}{2n}\|\bm{y}_{m}-\bm{X}_{m}\bm{\beta}_{m}\|_{2}^{2}+\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}-\bm{\beta}^{k+1}-\bm{e}_{m}^{k}/\mu\|_{2}^{2}\right\},\ m=1,2,\dots,M;\\ \bm{e}^{k+1}_{m}\ \leftarrow\bm{e}^{k}_{m}-\mu(\bm{\beta}_{m}^{k+1}-\bm{\beta}^{k+1}),\ m=1,2,\dots,M.\end{array}\right. (3.4)

The first step involves the central machine receiving the 𝜷mk\bm{\beta}_{m}^{k} passed by each local machine, integrating the information to obtain 𝜷k+1\bm{\beta}^{k+1}, and then passing it on to each local machine. In the second step, the update of 𝜷mk+1\bm{\beta}_{m}^{k+1} depends only on the updates of 𝜷k+1\bm{\beta}^{k+1} and 𝒆mk\bm{e}_{m}^{k}. This paves the way for parallel computation of 𝜷m,m=1,2,3,,M\bm{\beta}_{m},m=1,2,3,\dots,M. The third step involves updating the dual variables and loading them onto the respective local machines as well.

Note that the least squares loss in (3.1.1) is replaced by quantile loss, and the parallel algorithm provided by Boyd et al. (2010) is not applicable because quantile loss is piecewise linear and non differentiable. Inspired by the work of Boyd et al. (2010), Yu et al. (2017) proposed a consensus-based parallel ADMM algorithm for convex and nonconvex regularized quantile regression.

3.1.2 Parallel ADMM for regularized quantile regression

By introducing the consensus constraints {𝜷=𝜷m}m=1M\{\bm{\beta}=\bm{\beta}_{m}\}_{m=1}^{M} and {𝒓m=𝒚m𝑿m𝜷m}m=1M\{\bm{r}_{m}=\bm{y}_{m}-\bm{X}_{m}\bm{\beta}_{m}\}_{m=1}^{M}, the constrained optimization problem is as follows

min𝜷,𝒓m,𝜷m\displaystyle\min_{\bm{\beta},\bm{r}_{m},\bm{\beta}_{m}} m=1Mρ(𝒓m)+Pλ(|𝜷|),\displaystyle\quad\sum_{m=1}^{M}\rho(\bm{r}_{m})+{P}_{\lambda}(|\bm{\beta}|),
s.t.𝑿m𝜷m+𝒓m\displaystyle\text{s.t.}\ \bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m} =𝒚m,𝜷=𝜷m,m=1,2,,M,\displaystyle=\bm{y}_{m},\ \bm{\beta}=\bm{\beta}_{m},\ m=1,2,\dots,M, (3.5)

where 𝒓=(𝒓1,𝒓2,,𝒓M)\bm{r}=(\bm{r}_{1}^{\top},\bm{r}_{2}^{\top},\dots,\bm{r}_{M}^{\top})^{\top} and ρ(𝒓m)=1niρ(𝒓mi)\rho(\bm{r}_{m})=\frac{1}{n}\sum_{i}\rho(\bm{r}_{mi}). Here, ρ()\rho(\ ) is the quantile loss that can be found in Table 1, and Pλ(|𝜷|){P}_{\lambda}(|\bm{\beta}|) is the lasso, scad or mcp. The augmented Lagrangian of (3.1.2) is

Lμ(𝜷,𝒓m,𝜷m,𝒅m,𝒆m)\displaystyle L_{\mu}(\bm{\beta},\bm{r}_{m},\bm{\beta}_{m},\bm{d}_{m},\bm{e}_{m}) =m=1Mρ(𝒓m)+Pλ(𝜷)m=1M𝒅m,𝑿m𝜷m+𝒓m𝒚mm=1M𝒆m,𝜷m𝜷\displaystyle=\sum\limits_{m=1}^{M}\rho{{{\left(\bm{r}_{m}\right)}}}+{P}_{\lambda}(\bm{\beta})-\sum\limits_{m=1}^{M}\langle\bm{d}_{m},\bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}-\bm{y}_{m}\rangle-\sum\limits_{m=1}^{M}\langle\bm{e}_{m},\bm{\beta}_{m}-\bm{\beta}\rangle (3.6)
+μ2m=1M𝑿m𝜷m+𝒓m𝒚m22+μ2m=1M𝜷m𝜷22.,\displaystyle+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}-\bm{y}_{m}\|_{2}^{2}+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}-\bm{\beta}\|_{2}^{2}.,

where 𝒅m\bm{d}_{m} and 𝒆m\bm{e}_{m} are dual variables corresponding to the linear constraints, and μ>0\mu>0 is a given penalty parameter. With a given {𝒓m0,𝜷m0,𝒅m0,𝒆m0}m=1M\{\bm{r}_{m}^{0},\bm{\beta}_{m}^{0},\bm{d}_{m}^{0},\bm{e}_{m}^{0}\}_{m=1}^{M}, the iterative scheme of parallel ADMM for (3.6) is

{𝜷k+1argmin𝜷{Lμ(𝜷,𝒓mk,𝜷mk,𝒅mk,𝒆mk)};𝒓mk+1argmin𝒓m{Lμ(𝜷k+1,𝒓m,𝜷mk,𝒅mk,𝒆mk)},m=1,2,,M;𝜷mk+1argmin𝜷m{Lμ(𝜷k+1,𝒓mk+1,𝜷m,𝒅mk,𝒆mk)},m=1,2,,M;𝒅mk+1𝒅mkμ(𝑿m𝜷mk+1+𝒓mk+1𝒚m),m=1,2,,M;𝒆mk+1𝒆mkμ(𝜷mk+1𝜷k+1),m=1,2,,M.\left\{\begin{array}[]{l}\bm{\beta}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}}\left\{L_{\mu}(\bm{\beta},\bm{r}_{m}^{k},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\};\\ \bm{r}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{r}_{m}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r}_{m},\bm{\beta}_{m}^{k},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\},\ m=1,2,\dots,M;\\ \bm{\beta}_{m}^{k+1}\ \leftarrow\mathop{\arg\min}\limits_{\bm{\beta}_{m}}\left\{L_{\mu}(\bm{\beta}^{k+1},\bm{r}_{m}^{k+1},\bm{\beta}_{m},\bm{d}_{m}^{k},\bm{e}_{m}^{k})\right\},\ m=1,2,\dots,M;\\ \bm{d}^{k+1}_{m}\ \leftarrow\bm{d}^{k}_{m}-\mu(\bm{X}_{m}\bm{\beta}_{m}^{k+1}+\bm{r}_{m}^{k+1}-\bm{y}_{m}),\ m=1,2,\dots,M;\\ \bm{e}^{k+1}_{m}\ \leftarrow\bm{e}^{k}_{m}-\mu(\bm{\beta}_{m}^{k+1}-\bm{\beta}^{k+1}),\ m=1,2,\dots,M.\end{array}\right. (3.7)

Both 𝜷k+1\bm{\beta}^{k+1} and 𝒓k+1\bm{r}^{k+1} updates can be written as proximal operators with closed-form solutions. The update of 𝜷mk+1\bm{\beta}_{m}^{k+1} involves solving a system of linear equations. The updates of 𝒅mk+1\bm{d}_{m}^{k+1} and 𝒆mk+1\bm{e}_{m}^{k+1} involve matrix multiplication with a vector operation. For more information, please refer to Yu et al. (2017). More recently, Fan et al. (2020) used the slack variable representation to rewrite quantile loss, and develpo a parallel ADMM algorithm for penalized quantile regression. Despite the algorithm achieving good results in simulation, the design philosophy of the algorithm is the same as Fan et al. (2020). Therefore, we will not provide further description.

3.2 Parallel ADMM for solving the combined regularization regression

The unified optimization form for high-dimensional regression models with combined regularization is defined as

argmin𝜷{(𝒚𝑿𝜷)+Pλ1(|𝜷|)+Pλ2(|𝑮𝜷|)},\displaystyle\mathop{\arg\min}\limits_{\bm{\beta}}\left\{\mathcal{L}{{{\left(\bm{y}-\bm{X}\bm{\beta}\right)}}}+P_{\lambda_{1}}(|\bm{\beta}|)+P_{\lambda_{2}}(|\bm{G}\bm{\beta}|)\right\}, (3.8)

where 𝑮\bm{G} is a matrix that changes with the method. For elastic-net regression and group regression problems, 𝑮\bm{G} is the identity matrix, and for fused lasso regression problems, 𝑮\bm{G} is 𝑭\bm{F}.

Setting 𝒓m=𝒚m𝑿m𝜷m\bm{r}_{m}=\bm{y}_{m}-\bm{X}_{m}\bm{\beta}_{m}, 𝜷m=𝜷\bm{\beta}_{m}=\bm{\beta} and 𝑮𝜷=𝒃\bm{G}\bm{\beta}=\bm{b}, the combined regularization regression models in (3.8) can be rewritten as

argmin𝜷,𝒃,{𝒓}m=1M,{𝜷m}m=1M{m=1M(𝒓m)+Pλ1(|𝜷|)+Pλ2(|𝒃|)},\displaystyle\mathop{\arg\min}\limits_{\bm{\beta},\bm{b},\{\bm{r}\}_{m=1}^{M},\{\bm{\beta}_{m}\}_{m=1}^{M}}\left\{\sum\limits_{m=1}^{M}\mathcal{L}{{{\left(\bm{r}_{m}\right)}}}+P_{\lambda_{1}}(|\bm{\beta}|)+P_{\lambda_{2}}(|\bm{b}|)\right\},
s.t.𝑿m𝜷m+𝒓m=𝒚m,𝜷m=𝜷,𝑮𝜷=𝒃,m=1,2,,M.\displaystyle\textbf{s.t.}\ \bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}=\bm{y}_{m},\ \bm{\beta}_{m}=\bm{\beta},\ \bm{G}\bm{\beta}=\bm{b},\ m=1,2,\dots,M. (3.9)

The augmented Lagrangian of (3.2) is given by

m=1M\displaystyle\sum\limits_{m=1}^{M} (𝒓m)+Pλ1(|𝜷|)+Pλ2(|𝒃|)m=1M𝒅m,𝑿m𝜷m+𝒓m𝒚m\displaystyle\mathcal{L}{{{\left(\bm{r}_{m}\right)}}}+P_{\lambda_{1}}(|\bm{\beta}|)+P_{\lambda_{2}}(|\bm{b}|)-\sum\limits_{m=1}^{M}\langle\bm{d}_{m},\bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}-\bm{y}_{m}\rangle
m=1M𝒆m,𝜷m𝜷𝒇,𝑮𝜷𝒃+μ2m=1M𝑿m𝜷m+𝒓m𝒚m22\displaystyle-\sum\limits_{m=1}^{M}\langle\bm{e}_{m},\bm{\beta}_{m}-\bm{\beta}\rangle-\langle\bm{f},\bm{G}\bm{\beta}-\bm{b}\rangle+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}-\bm{y}_{m}\|_{2}^{2} (3.10)
+μ2m=1M𝜷m𝜷22+μ2𝑮𝜷𝒃22,\displaystyle+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}-\bm{\beta}\|_{2}^{2}+\frac{\mu}{2}\|\bm{G}\bm{\beta}-\bm{b}\|_{2}^{2},

where 𝒅m\bm{d}_{m}, 𝒆m\bm{e}_{m} and 𝒇\bm{f} are dual variables corresponding to the linear constraints, and μ>0\mu>0 is a given penalty parameter.

After rearranging the terms in (3.2) and omitting constant terms, the update of the variables in the iterative process can be written as

{𝜷k+1argmin𝜷{Pλ1(|𝜷|)+μ2m=1M𝜷mk𝜷𝒆mk/μ22+μ2𝑮𝜷𝒃k𝒇k/μ22};𝒃k+1argmin𝒃{Pλ2(|𝒃|)+μ2𝑮𝜷k+1𝒃𝒇k/μ22};𝒓mk+1argmin𝒓m{(𝒓m)+μ2𝑿m𝜷mk+𝒓m𝒚m𝒅mk/μ22},m=1,2,,M;𝜷mk+1argmin𝜷m{μ2𝑿m𝜷m+𝒓mk+1𝒚m𝒅mk/μ22+μ2𝜷m𝜷k+1𝒆mk/μ22},m=1,2,,M;𝒅mk+1𝒅mkμ(𝑿m𝜷mk+1+𝒓mk+1𝒚m),m=1,2,,M;𝒆mk+1𝒆mkμ(𝜷mk+1𝜷k+1),m=1,2,,M;𝒇k+1𝒇kμ(𝑮𝜷k+1𝒃k+1),m=1,2,,M.\left\{\begin{aligned} \bm{\beta}^{k+1}&\leftarrow\arg\min_{\bm{\beta}}\left\{P_{\lambda_{1}}(|\bm{\beta}|)+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}^{k}-\bm{\beta}-\bm{e}_{m}^{k}/\mu\|_{2}^{2}+\frac{\mu}{2}\|\bm{G}\bm{\beta}-\bm{b}^{k}-\bm{f}^{k}/\mu\|_{2}^{2}\right\};\\ {\bm{b}^{k+1}}&\leftarrow\arg\min_{\bm{b}}\left\{P_{\lambda_{2}}(|\bm{b}|)+\frac{\mu}{2}\|\bm{G}\bm{\beta}^{k+1}-\bm{b}-\bm{f}^{k}/\mu\|_{2}^{2}\right\};\\ {\bm{r}_{m}^{k+1}}&\leftarrow\arg\min_{\bm{r}_{m}}\left\{\mathcal{L}(\bm{r}_{m})+\frac{\mu}{2}\|\bm{X}_{m}\bm{\beta}_{m}^{k}+\bm{r}_{m}-\bm{y}_{m}-\bm{d}_{m}^{k}/\mu\|_{2}^{2}\right\},m=1,2,\dots,M;\\ {\bm{\beta}_{m}^{k+1}}&\leftarrow\arg\min_{\bm{\beta}_{m}}\left\{\frac{\mu}{2}\|\bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}^{k+1}-\bm{y}_{m}-\bm{d}_{m}^{k}/\mu\|_{2}^{2}+\frac{\mu}{2}\|\bm{\beta}_{m}-\bm{\beta}^{k+1}-\bm{e}_{m}^{k}/\mu\|_{2}^{2}\right\},m=1,2,\dots,M;\\ \bm{d}_{m}^{k+1}&\leftarrow\bm{d}_{m}^{k}-\mu(\bm{X}_{m}\bm{\beta}_{m}^{k+1}+\bm{r}_{m}^{k+1}-\bm{y}_{m}),m=1,2,\dots,M;\\ \bm{e}_{m}^{k+1}&\leftarrow\bm{e}_{m}^{k}-\mu(\bm{\beta}_{m}^{k+1}-\bm{\beta}^{k+1}),m=1,2,\dots,M;\\ \bm{f}^{k+1}&\leftarrow\bm{f}^{k}-\mu(\bm{G}\bm{\beta}^{k+1}-\bm{b}^{k+1}),m=1,2,\dots,M.\end{aligned}\right. (3.11)

From (3.11), we know that the central machine does not need to load data, it only needs to update 𝜷\bm{\beta}, 𝒃\bm{b} and 𝒇\bm{f}. The mm-th local machine needs to load 𝑿m\bm{X}_{m} and 𝒚m\bm{y}_{m}, and update its corresponding 𝒓m\bm{r}_{m}, 𝜷m\bm{\beta}_{m}, 𝒅m\bm{d}_{m}, and 𝒆m\bm{e}_{m}. We visualize the operations of the central and local machines in parallel algorithms in Figure 1, and summarize the algorithm in Algorithm 1.

In certain economic and financial applications, additional linear constraints are required on the coefficient 𝜷\bm{\beta}, such as the sum constraint in investment arbitrage and the non-negative constraint in stock index tracking. In such cases, a minor modification is needed in the update of 𝜷k+1\bm{\beta}^{k+1}, that is,

𝜷k+1\displaystyle\bm{\beta}^{k+1} argmin𝜷𝒞{Pλ1(|𝜷|)+μ2m=1M𝜷mk𝜷𝒆mk/μ22+μ2𝑮𝜷𝒃k𝒇k/μ22},\displaystyle\leftarrow\arg\min_{\bm{\beta}\in\mathcal{C}}\left\{P_{\lambda_{1}}(|\bm{\beta}|)+\frac{\mu}{2}\sum\limits_{m=1}^{M}\|\bm{\beta}_{m}^{k}-\bm{\beta}-\bm{e}_{m}^{k}/\mu\|_{2}^{2}+\frac{\mu}{2}\|\bm{G}\bm{\beta}-\bm{b}^{k}-\bm{f}^{k}/\mu\|_{2}^{2}\right\}, (3.12)

which does not change the structure of the algorithm. Here, 𝒞\mathcal{C} is a space with linear constraints, such as 𝟏𝜷=1\bm{1}^{\top}\bm{\beta}=1 and {β}j=1p0\{\beta\}_{j=1}^{p}\geq 0. In order to make our algorithm better suited for these specific applications, we include these linear constraints in the subsequent discussions.

Refer to caption
Figure 1: Schematic diagram of the implementation of parallel ADMM algorithm.
Algorithm 1 Parallel ADMM for solving the combined regularization regression in (3)
  Input: \bullet Central machine: μ,M,λ1,λ2,𝒃0,𝒇0\mu,M,\lambda_{1},\lambda_{2},\bm{b}^{0},\bm{f}^{0}.        \bullet The mm-th local machine: {𝑿m,𝒚m}m=1M\{\bm{X}_{m},\bm{y}_{m}\}_{m=1}^{M}; μ\mu, δ\delta (Huber loss), τ\tau (quantile loss); 𝜷m0\bm{\beta}_{m}^{0}, 𝒅m0\bm{d}_{m}^{0}, 𝒆m0\bm{e}_{m}^{0}.
  Output: the total number of iterations KK, 𝜷K\bm{\beta}^{K}.
  while not converged do
   Central machine: 1. Receive 𝜷mk\bm{\beta}_{m}^{k} and 𝒆mk\bm{e}_{m}^{k} transmitted by MM local machines,          2. Update 𝜷k+1\bm{\beta}^{k+1} and 𝒃k+1\bm{b}^{k+1},           3. Update 𝒇k+1𝒇kμ(𝑮𝜷k+1𝒃k+1)\bm{f}^{k+1}\leftarrow\bm{f}^{k}-\mu(\bm{G}\bm{\beta}^{k+1}-\bm{b}^{k+1}),           4. Send 𝜷k+1\bm{\beta}^{k+1} to the local machines.
   Local machines:    for m=1,2,,Mm=1,2,\dots,M (in parallel)           1. Receive 𝜷k+1\bm{\beta}^{k+1} transmitted by the central machine,           2. Update 𝒓mk+1\bm{r}_{m}^{k+1} and 𝜷mk+1\bm{\beta}_{m}^{k+1},           3. Update 𝒅mk+1𝒅mkμ(𝑿m𝜷mk+1+𝒓mk+1𝒚m)\bm{d}_{m}^{k+1}\leftarrow\bm{d}_{m}^{k}-\mu(\bm{X}_{m}\bm{\beta}_{m}^{k+1}+\bm{r}_{m}^{k+1}-\bm{y}_{m}) and 𝒆mk+1𝒆mkμ(𝜷mk+1𝜷k+1)\bm{e}_{m}^{k+1}\leftarrow\bm{e}_{m}^{k}-\mu(\bm{\beta}_{m}^{k+1}-\bm{\beta}^{k+1}),           4. Send 𝜷mk+1\bm{\beta}^{k+1}_{m} and 𝒆mk+1\bm{e}^{k+1}_{m} to the central machine.
  end while
  return solution.

4 The specific implementation of parallel ADMM algorithm

From (3.11) and Algorithm 1, it can be observed that the operations on each local machine do not need to vary with changes in the regularization term. In other words, we only need to modify the operations on the central machine to solve high-dimensional regression models with different combined regularizers. Here, we first discuss the updates of local machines (in parallel), and then discuss the updates of the central machine separately according to three types of combined regularization terms. Note that we only need to focus on updating the primal variables, while the updates of the dual variables, 𝒅m\bm{d}_{m}, 𝒆m\bm{e}_{m} and 𝒇\bm{f} involve linear algebraic operations and are easy to implement.

For the update of the subproblem 𝒓mk+1\bm{r}^{k+1}_{m} in (3.11), it is evidently an \mathcal{L} proximal operator (prox1n(𝒚m+𝒅mk/μ𝑿m𝜷mk)\text{prox}_{\frac{1}{n}\mathcal{L}}(\bm{y}_{m}+\bm{d}_{m}^{k}/\mu-\bm{X}_{m}\bm{\beta}_{m}^{k})), and its closed-form solution can be obtained from Table 2.

𝒓mk+1prox1n(𝒚m+𝒅mk/μ𝑿m𝜷mk),m=1,2,,M.\displaystyle\bm{r}_{m}^{k+1}\leftarrow\text{prox}_{\frac{1}{n}\mathcal{L}}(\bm{y}_{m}+\bm{d}_{m}^{k}/\mu-\bm{X}_{m}\bm{\beta}_{m}^{k}),\ m=1,2,\dots,M. (4.1)

For the update of the subproblem 𝜷mk+1\bm{\beta}^{k+1}_{m} in (3.11), the minimization problem is quadratic and differentiable, allowing us to solve the subproblem by solving the following linear equations:

𝜷m(𝑿mT𝑿m+𝑰p)1[𝑿mT(𝒚m+𝒅mk/μ𝒓mk+1)+(𝜷k+1+𝒆mk/μ)].\displaystyle\bm{\beta}_{m}\leftarrow(\bm{X}_{m}^{T}\bm{X}_{m}+\bm{I}_{p})^{-1}\left[\bm{X}_{m}^{T}(\bm{y}_{m}+{\bm{d}_{m}^{k}}/{\mu}-\bm{r}_{m}^{k+1})+(\bm{\beta}^{k+1}+{\bm{e}_{m}^{k}}/{\mu})\right]. (4.2)

When pp is large than nmn_{m}, Yu et al. (2017) suggested using the Woodbury matrix identity (𝑿mT𝑿m+𝑰p)1=𝑰p𝑿m(𝑰nm+𝑿m𝑿m)1𝑿m(\bm{X}_{m}^{T}\bm{X}_{m}+\bm{I}_{p})^{-1}=\bm{I}_{p}-\bm{X}_{m}^{\top}(\bm{I}_{n_{m}}+\bm{X}_{m}\bm{X}_{m}^{\top})^{-1}\bm{X}_{m}. This method is actually very practical when the size of nmn_{m} is small because throughout the iteration process of the entire algorithm, the inverse only needs to be computed once. However, when both nmn_{m} and pp are large, computing the inverse can be time-consuming, and in such cases, the conjugate gradient method performs better in solving the equation in (4.2). The conjugate gradient method is an efficient iterative algorithm used to solve symmetric positive definite linear systems of equations, and it has advantages in handling large-scale sparse problems.

Next, we will provide a detailed description of the update steps for the central machine for different combined regularization terms.

4.1 Elastic-net

In this section, we describe the updates of each subproblem in central machine for implementing regression with elastic-net regularizations. Then, in (3.2), 𝑮=𝑰p\bm{G}=\bm{I}_{p}, Pλ1(|𝜷|)=λ1𝜷1P_{\lambda_{1}}(|\bm{\beta}|)=\lambda_{1}\|\bm{\beta}\|_{1} and Pλ2(|𝒃|)=λ2𝒃22P_{\lambda_{2}}(|\bm{b}|)=\lambda_{2}\|\bm{b}\|_{2}^{2}.

For the subproblem of updating 𝜷k+1\bm{\beta}^{k+1} in (3.11), by rearranging the optimization equation and omitting some constant terms that are not relevant to the optimization target variable 𝜷\bm{\beta}, the following equation can be obtained,

𝜷k+1argmin𝜷{λ1𝜷1+μ(M+1)2𝜷M(𝜷¯k𝒆¯k/μ)+(𝒃k+𝒇k/μ)M+122},\displaystyle\bm{\beta}^{k+1}\leftarrow\arg\min_{\bm{\beta}}\left\{{\lambda}_{1}\left\|{{\bm{\beta}}}\right\|_{1}+\frac{\mu(M+1)}{2}\left\|\bm{\beta}-\frac{M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+(\bm{b}^{k}+\bm{f}^{k}/\mu)}{M+1}\right\|_{2}^{2}\right\}, (4.3)

where 𝜷¯k=M1m=1M𝜷mk\bar{\bm{\beta}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{\beta}_{m}^{k} and 𝒆¯k=M1m=1M𝒆mk\bar{\bm{e}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{e}_{m}^{k}. As a result, the closed-form solution of the soft-thresholding operator in (4.3) is given as follows:

𝜷k+1sign(M(𝜷¯k𝒆¯k/μ)+(𝒃k+𝒇k/μ)M+1)(|M(𝜷¯k𝒆¯k/μ)+(𝒃k+𝒇k/μ)M+1|λ1μ(M+1))+.\displaystyle\bm{\beta}^{k+1}\leftarrow\text{sign}\left(\frac{M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+(\bm{b}^{k}+\bm{f}^{k}/\mu)}{M+1}\right)\odot\left(\left|\frac{M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+(\bm{b}^{k}+\bm{f}^{k}/\mu)}{M+1}\right|-\frac{{\lambda}_{1}}{\mu(M+1)}\right)_{+}. (4.4)

Here, \odot denotes Hadamard product, sign()\text{sign}(\cdot) represents the sign function, and ()+(\cdot)_{+} denotes the operation of taking the positive part element-wise. Note when the variable 𝜷\bm{\beta} needs to satisfy linear constraints (𝜷𝒞\bm{\beta}\in\mathcal{C}), we just need to project 𝜷k+1\bm{\beta}^{k+1} in (4.4) onto the linear space 𝒞\mathcal{C}. Mathematically, the projection of vector 𝜷k+1\bm{\beta}^{k+1} onto linear constraint space 𝒞\mathcal{C} can be represented as,

𝜷𝒞k+1=Proj𝒞(𝜷k+1)=argmin𝜷𝒞𝜷𝜷k+12.\displaystyle\bm{\beta}_{\mathcal{C}}^{k+1}=\mathrm{Proj}_{\mathcal{C}}(\bm{\beta}^{k+1})=\mathrm{arg}\min_{\bm{\beta}\in\mathcal{C}}\|\bm{\beta}-\bm{\beta}^{k+1}\|_{2}. (4.5)

Projection onto convex sets is a well-studied concept. For the applications mentioned in this paper, the projection can be analytically solved, see Section 15.2 of Lange (2013) for several examples. Therefore, when addressing regression problems with linear constraints, we can substitute 𝜷𝒞k+1\bm{\beta}_{\mathcal{C}}^{k+1} for 𝜷k+1\bm{\beta}^{k+1} in Algorithm 1 to iterate over 𝜷\bm{\beta}.

For the subproblem of updating 𝒃k+1\bm{b}^{k+1} in (3.11), the minimization problem is quadratic and differentiable, allowing us to solve the subproblem by

𝒃k+1=μ(𝜷k+1𝒇k/μ)2λ2+μ.\displaystyle\bm{b}^{k+1}=\frac{\mu(\bm{\beta}^{k+1}-\bm{f}^{k}/\mu)}{2\lambda_{2}+\mu}. (4.6)

4.2 Sparse group lasso

In this section, we describe the updates of each subproblem in central machine for implementing regression with sparse group lasso. Then, in (3.2), 𝑮=𝑰p\bm{G}=\bm{I}_{p}, Pλ1(|𝜷|)=λ1𝜷1P_{\lambda_{1}}(|\bm{\beta}|)=\lambda_{1}\|\bm{\beta}\|_{1} and Pλ2(|𝒃|)=λ2𝒃2,1P_{\lambda_{2}}(|\bm{b}|)=\lambda_{2}\|\bm{b}\|_{2,1}.

For the subproblem of updating 𝜷k+1\bm{\beta}^{k+1} in (3.11), it follows the same procedure as described in Section 4.1, where it is updated as

𝜷k+1sign(M(𝜷¯k𝒆¯k/μ)+(𝒃k+𝒇k/μ)M+1)(|M(𝜷¯k𝒆¯k/μ)+(𝒃k+𝒇k/μ)M+1|λ1μ(M+1))+,\displaystyle\bm{\beta}^{k+1}\leftarrow\text{sign}\left(\frac{M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+(\bm{b}^{k}+\bm{f}^{k}/\mu)}{M+1}\right)\odot\left(\left|\frac{M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+(\bm{b}^{k}+\bm{f}^{k}/\mu)}{M+1}\right|-\frac{{\lambda}_{1}}{\mu(M+1)}\right)_{+}, (4.7)

where 𝜷¯k=M1m=1M𝜷mk\bar{\bm{\beta}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{\beta}_{m}^{k} and 𝒆¯k=M1m=1M𝒆mk\bar{\bm{e}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{e}_{m}^{k}. For constrained regression problems, the same strategy as (4.5) can be used.

For the update of the subproblem 𝒃k+1\bm{b}^{k+1}, it is evidently an 2,1\ell_{2,1}-norm proximal operator (proxλ22,1(𝜷k+1𝒇k/μ)\text{prox}_{\lambda_{2}\|\cdot\|_{2,1}}(\bm{\beta}^{k+1}-{\bm{f}^{k}}/{\mu})), and its closed-form solution is as follows:

𝒃gk+1𝜷gk+1𝒇kg/μ𝜷gk+1𝒇kg/μ2(𝜷gk+1𝒇kg/μ2λ2/μ)+,g=1,2,,G.\displaystyle\bm{b}_{g}^{k+1}\leftarrow\frac{\bm{\beta}^{k+1}_{g}-{\bm{f}^{k}}_{g}/{\mu}}{\|\bm{\beta}^{k+1}_{g}-{\bm{f}^{k}}_{g}/{\mu}\|_{2}}\cdot(\|\bm{\beta}^{k+1}_{g}-{\bm{f}^{k}}_{g}/{\mu}\|_{2}-\lambda_{2}/\mu)_{+},\ g=1,2,\dots,G. (4.8)

4.3 Sparse fused lasso

In this section, we describe the updates of each subproblem in central machine for implementing regression with sparse group lasso. Then, in (3.2), 𝑮=𝑭\bm{G}=\bm{F}, Pλ1(|𝜷|)=λ1𝜷1P_{\lambda_{1}}(|\bm{\beta}|)=\lambda_{1}\|\bm{\beta}\|_{1} and Pλ2(|𝒃|)=λ2𝒃1P_{\lambda_{2}}(|\bm{b}|)=\lambda_{2}\|\bm{b}\|_{1}. Due to the presence of the 𝑭\bm{F} matrix, the update of 𝜷k+1\bm{\beta}^{k+1} in the central machine differs from that in Section 4.1 and Section 4.2.

For the subproblem of updating 𝜷k+1\bm{\beta}^{k+1}, by rearranging the optimization equation and omitting some constant terms that are not relevant to the optimization target variable 𝜷\bm{\beta}, the following equation can be obtained,

𝜷k+1argmin𝜷{λ1𝜷1+μ2𝜷(M𝑰p+𝑭𝑭)𝜷μ𝜷[M(𝜷¯k𝒆¯k/μ)+𝑭(𝒃k+𝒇k/μ)]},\displaystyle\bm{\beta}^{k+1}\leftarrow\arg\min_{\bm{\beta}}\left\{{\lambda}_{1}\left\|{{\bm{\beta}}}\right\|_{1}+\frac{\mu}{2}\bm{\beta}^{\top}(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}-\mu\bm{\beta}^{\top}\left[M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+\bm{F}^{\top}(\bm{b}^{k}+\bm{f}^{k}/\mu)\right]\right\}, (4.9)

where 𝜷¯k=M1m=1M𝜷mk\bar{\bm{\beta}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{\beta}_{m}^{k} and 𝒆¯k=M1m=1M𝒆mk\bar{\bm{e}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{e}_{m}^{k}. Clearly, due to the non-identity matrix before the quadratic term of β\beta and 𝜷1\left\|{{\bm{\beta}}}\right\|_{1}, the equation (4.9) does not have a closed-form solution. Although we can solve the equation (4.9) using numerical methods such as the coordinate descent method, it would significantly increase the computational burden. Here, we suggest using a linearization method to approximate this optimization problem and obtain a closed-form solution for 𝜷k+1\bm{\beta}^{k+1}. Specifically, we propose to add a proximal term to the objective function in Equation (4.9) and update 𝜷k+1\bm{\beta}^{k+1} using the following formula,

𝜷k+1argmin𝜷{λ1𝜷1+μ2𝜷(M𝑰p+𝑭𝑭)𝜷μ𝜷[M(𝜷¯k𝒆¯k/μ)+𝑭(𝒃k+𝒇k/μ)]+12𝜷𝜷kS2},\displaystyle\bm{\beta}^{k+1}\leftarrow\arg\min_{\bm{\beta}}\left\{{\lambda}_{1}\left\|{{\bm{\beta}}}\right\|_{1}+\frac{\mu}{2}\bm{\beta}^{\top}(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}-\mu\bm{\beta}^{\top}\left[M(\bar{\bm{\beta}}^{k}-{\bar{\bm{e}}^{k}}/{\mu})+\bm{F}^{\top}(\bm{b}^{k}+{\bm{f}^{k}}/{\mu})\right]+\frac{1}{2}\|\bm{\beta}-\bm{\beta}^{k}\|_{S}^{2}\right\}, (4.10)

where 𝑺=η𝑰p(M𝑰p+𝑭𝑭)\bm{S}=\eta\bm{I}_{p}-(M\bm{I}_{p}+\bm{F}^{\top}\bm{F}) is a positive semidefinite matrix. To ensure that 𝑺\bm{S} is a positive definite matrix, η\eta must be greater than or equal to the largest eigenvalue of (M𝑰p+𝑭𝑭)(M\bm{I}_{p}+\bm{F}^{\top}\bm{F}). It is worth noting that the largest eigenvalue of (M𝑰p+𝑭𝑭)(M\bm{I}_{p}+\bm{F}^{\top}\bm{F}) is equal to M+4M+4. Then, we can set η=M+4\eta=M+4. After rearranging the terms in (4.10) and omitting constant terms, the update of the 𝜷k+1\bm{\beta}^{k+1} can be written as

𝜷k+1argmin𝜷{λ1𝜷1+η2𝜷𝜷k+μη[(M𝑰p+𝑭𝑭)𝜷kM(𝜷¯k𝒆¯k/μ)𝑭(𝒃k+𝒇k/μ)]22}.\displaystyle\bm{\beta}^{k+1}\leftarrow\arg\min_{\bm{\beta}}\left\{{\lambda}_{1}\left\|{{\bm{\beta}}}\right\|_{1}+\frac{\eta}{2}\left\|\bm{\beta}-\bm{\beta}^{k}+\frac{\mu}{\eta}\left[(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}^{k}-M(\bar{\bm{\beta}}^{k}-{\bar{\bm{e}}^{k}}/{\mu})-\bm{F}^{\top}(\bm{b}^{k}+{\bm{f}^{k}}/{\mu})\right]\right\|_{2}^{2}\right\}. (4.11)

As a result, the closed-form solution of the soft-thresholding operator in (4.11) is given as follows:

𝜷k+1sign(𝜷kμ/η[(M𝑰p+𝑭𝑭)𝜷kM(𝜷¯k𝒆¯k/μ)𝑭(𝒃k+𝒇k/μ)])\displaystyle\bm{\beta}^{k+1}\leftarrow\text{sign}\left(\bm{\beta}^{k}-{\mu}/{\eta}\left[(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}^{k}-M(\bar{\bm{\beta}}^{k}-{\bar{\bm{e}}^{k}}/{\mu})-\bm{F}^{\top}(\bm{b}^{k}+{\bm{f}^{k}}/{\mu})\right]\right)\odot (4.12)
(|𝜷kμ/η[(M𝑰p+𝑭𝑭)𝜷kM(𝜷¯k𝒆¯k/μ)𝑭(𝒃k+𝒇k/μ)]|λ1/η)+.\displaystyle\left(\left|\bm{\beta}^{k}-{\mu}/{\eta}\left[(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}^{k}-M(\bar{\bm{\beta}}^{k}-{\bar{\bm{e}}^{k}}/{\mu})-\bm{F}^{\top}(\bm{b}^{k}+{\bm{f}^{k}}/{\mu})\right]\right|-{{\lambda}_{1}}/{\eta}\right)_{+}.

For constrained regression problems, we can employ the same strategy as (4.5).

For the update of the subproblem 𝒃k+1\bm{b}^{k+1}, it is also a soft-thresholding operator (proxλ11(𝑭𝜷k+1𝒇k/μ)\text{prox}_{\lambda_{1}\|\cdot\|_{1}}(\bm{F}\bm{\beta}^{k+1}-{\bm{f}^{k}}/{\mu})), and its closed-form solution is as follows:

𝒃k+1sign(𝑭𝜷k+1𝒇k/μ)(|𝑭𝜷k+1𝒇k/μ|λ2/μ)+.\displaystyle\bm{b}^{k+1}\leftarrow\text{sign}\left(\bm{F}\bm{\beta}^{k+1}-{\bm{f}^{k}}/{\mu}\right)\odot\left(\left|\bm{F}\bm{\beta}^{k+1}-{\bm{f}^{k}}/{\mu}\right|-\lambda_{2}/\mu\right)_{+}. (4.13)

4.4 Nonconvex extension

Compared to the combination of convex regularizers, the main difference of the combination of nonconvex regularizers lies in Pλ1(|𝜷|)P_{\lambda_{1}}(|\bm{\beta}|). In this paper, we mainly consider two popular non-convex regularizers, scad and mcp. According to the suggestion in Zou and Li (2008), we can use a unified method named local linear approximation to handle the nonconvex penalty, that is

Pλ1(|𝜷|)Pλ1(|𝜷l|)+Pλ1(|𝜷l|)T(|𝜷||𝜷l|),for𝜷𝜷l,\displaystyle P_{\lambda_{1}}(|\bm{\beta}|)\approx P_{\lambda_{1}}(|\bm{\beta}^{l}|)+\nabla P_{\lambda_{1}}(|\bm{\beta}^{l}|)^{T}(|\bm{\beta}|-|\bm{\beta}^{l}|),\ \text{for}\ \bm{\beta}\approx\bm{\beta}^{l}, (4.14)

where 𝜷l\bm{\beta}^{l} is the solution from the last iteration, and Pλ1(|𝜷l|)=(Pλ(|𝜷1l|),Pλ(|𝜷2l|),,Pλ(|𝜷pl|))\nabla P_{\lambda_{1}}(|\bm{\beta}^{l}|)=(\nabla P_{\lambda}(|\bm{\beta}_{1}^{l}|),\nabla P_{\lambda}(|\bm{\beta}_{2}^{l}|),\dots,\nabla P_{\lambda}(|\bm{\beta}_{p}^{l}|))^{\top}.

\bullet For scad, we have

Pλ(|𝜷j|)={λ1,if |𝜷j|λ1,aλ1|𝜷j|a1,if λ1<|𝜷j|<aλ1,0,if |𝜷j|aλ1.\displaystyle\nabla P_{\lambda}(|\bm{\beta}_{j}|)=\begin{cases}\lambda_{1},&\text{{if }}|\bm{\beta}_{j}|\leq\lambda_{1},\\ \frac{a\lambda_{1}-|\bm{\beta}_{j}|}{a-1},&\text{{if }}\lambda_{1}<|\bm{\beta}_{j}|<a\lambda_{1},\\ 0,&\text{{if }}|\bm{\beta}_{j}|\geq a\lambda_{1}.\\ \end{cases} (4.15)

\bullet For mcp, we have

Pλ(|𝜷j|)={λ1|𝜷j|a,if |𝜷j|aλ10,if |𝜷j|>aλ1\displaystyle\nabla P_{\lambda}(|\bm{\beta}_{j}|)=\begin{cases}\lambda_{1}-\frac{|\bm{\beta}_{j}|}{a},&\text{{if }}|\bm{\beta}_{j}|\leq a\lambda_{1}\\ 0,&\text{{if }}|\bm{\beta}_{j}|>a\lambda_{1}\\ \end{cases} (4.16)

For nonconvex combined regularization, it can be written as

argmin𝜷{(𝒚𝑿𝜷)+Pλ1(|𝜷|)+Pλ2(𝑮𝜷)}.\displaystyle\mathop{\arg\min}\limits_{\bm{\beta}}\left\{\mathcal{L}{{{\left(\bm{y}-\bm{X}\bm{\beta}\right)}}}+P_{\lambda_{1}}(|\bm{\beta}|)+P_{\lambda_{2}}(\bm{G}\bm{\beta})\right\}. (4.17)

where Pλ1(|𝜷|)=scad(𝜷)ormcp(𝜷)P_{\lambda_{1}}(|\bm{\beta}|)=\text{scad}(\bm{\beta})\ \text{or}\ \text{mcp}(\bm{\beta}). By substituting equation (4.14) into equation (4.17), we can obtain the following optimized form in a weighted manner,

argmin𝜷l+1{(𝒚𝑿𝜷)+j=1pPλ1(|βjl|)|βj|+Pλ2(𝑮𝜷)}.\displaystyle\mathop{\arg\min}\limits_{\bm{\beta}^{l+1}}\left\{\mathcal{L}{{{\left(\bm{y}-\bm{X}\bm{\beta}\right)}}}+\sum_{j=1}^{p}\nabla P_{\lambda_{1}}(|\beta_{j}^{l}|)|\beta_{j}|+P_{\lambda_{2}}(\bm{G}\bm{\beta})\right\}. (4.18)

Note that we only need to make a small change to solve this weighted combined optimization form using Algorithm 1. This change only requires replacing λ1\lambda_{1} in (4.4), (4.7), and (4.12) with Pλ1(|𝜷l|)\nabla P_{\lambda_{1}}(|\bm{\beta}^{l}|).

To solve nonconvex regression using the LLA algorithm, it is necessary to find a good initial value. As suggested by Gu2018, we can use the solution of Pλ1(|𝜷|)=λ1𝜷1P_{\lambda_{1}}(|\bm{\beta}|)=\lambda_{1}\|\bm{\beta}\|_{1} in (4.17) as the initial value. Then, we get the solution of (4.17) by solving a sequence of weighted 1\ell_{1}-penalized quantile regression. The LLA algorithm solves the nonconvex regression via the following iterations:

Algorithm 2 The local linear approximation (LLA) algorithm for combined nonconvex regression
  1. Initialize 𝜷\bm{\beta} with 𝜷1\bm{\beta}^{1}, where 𝜷1\bm{\beta}^{1} is obtained by Algorithm 1.
  2. For l=1,2,,Ll=1,2,\dots,L, continue iterating the LLA iteration until convergence is achieved.
    2.1. Compute the weights Pλ1(|𝜷l|)=(Pλ(|𝜷1l|),Pλ(|𝜷2l|),,Pλ(|𝜷pl|))\nabla P_{\lambda_{1}}(|\bm{\beta}^{l}|)=(\nabla P_{\lambda}(|\bm{\beta}_{1}^{l}|),\nabla P_{\lambda}(|\bm{\beta}_{2}^{l}|),\dots,\nabla P_{\lambda}(|\bm{\beta}_{p}^{l}|))^{\top} by (4.15) or (4.16)
    2.2. Solve the weighted problem in (4.18) by modified Algorithm 1 with λ1\lambda_{1} in (4.4), (4.7), and (4.12) replacing with Pλ1(|𝜷l1|)\nabla P_{\lambda_{1}}(|\bm{\beta}^{l-1}|). Let this solution be denoted as 𝜷l+1\bm{\beta}^{l+1}.

It can be seen that the nonconvex combined regularization regression is solved through multiple iterations of convex combined regularization regression. Moreover, Fan et al. (2014) demonstrated that, theoretically, only two or three iterations are sufficient to obtain a solution with high statistical accuracy.

5 Convergence Analysis and Algorithm Complexity

The traditional ADMM algorithm is commonly used to solve problems that have two-block separable objective functions and are connected by equality constraints. Two-block separable function refers to having two primal variables that need to be optimized alternately. For a comprehensive overview of the ADMM algorithm, one may refer to the cited reference Boyd et al. (2010) and the related literature mentioned therein. It is evident that when there are MM local machines, both Algorithm 1 involves a total of 2M+22M+2 primal variables, namely {𝜷,𝒃,{𝒓m,𝜷m}m=1M}\left\{\bm{\beta},\bm{b},\{\bm{r}_{m},\bm{\beta}_{m}\}_{m=1}^{M}\right\}. Thus, the parallel ADMM algorithms can be categorized as multi-block ADMM algorithms. However, a recent study by Chen et al. (2016) has demonstrated that directly extending the ADMM algorithm for convex optimization with three or more separable blocks may not guarantee convergence, and they even provided an example of divergence. Fortunately, they also established a sufficient condition to ensure convergence for the direct extension of multi-block ADMM algorithm. Their findings suggest that convergence of multi-block ADMM algorithm is guaranteed as long as certain constrained coefficient matrices are orthogonal. This allows the iterations of all primal subproblems to be sequentially divided into two independent parts, ensuring convergence. In the following, we will demonstrate that the two algorithms can be transformed into traditional two-block ADMM algorithms.

To simplify notation, we here discuss the convergence of the parallel algorithms in the case of two local machines. The case of multiple local machines (M>2M>2) is similar to the case of two local machines. In this way, the collected data will be divided into two parts,

𝑿=[𝑿1𝑿2]and𝒚=[𝒚1𝒚2].\displaystyle\bm{X}=\begin{bmatrix}\bm{X}_{1}\\ \bm{X}_{2}\end{bmatrix}\ \text{and}\ \bm{y}=\begin{bmatrix}\bm{y}_{1}\\ \bm{y}_{2}\end{bmatrix}. (5.1)

5.1 The Convergence of Algorithm 1

Considering the presence of two local machines (M=2M=2), in order to accommodate the parallel structure, we need to introduce the terms {𝒓m=𝒚m𝑿m𝜷m}m=12\{\bm{r}_{m}=\bm{y}_{m}-\bm{X}_{m}\bm{\beta}_{m}\}_{m=1}^{2} and {𝜷m=𝜷}m=12\{\bm{\beta}_{m}=\bm{\beta}\}_{m=1}^{2}. Revisiting the constrained optimization form in (3.2) and taking into account the linear constants 𝜷𝒞\bm{\beta}\in\mathcal{C}, we obtain the following optimization problem:

argmin𝜷,𝒃,𝒓1,𝒓2,𝜷1,𝜷2{m=12(𝒓m)+Pλ1(|𝜷|)+Pλ2(|𝒃|)+I𝒞(𝜷)},\displaystyle\mathop{\arg\min}\limits_{\bm{\beta},\bm{b},\bm{r}_{1},\bm{r}_{2},\bm{\beta}_{1},\bm{\beta}_{2}}\left\{\sum\limits_{m=1}^{2}\mathcal{L}{{{\left(\bm{r}_{m}\right)}}}+P_{\lambda_{1}}(|\bm{\beta}|)+P_{\lambda_{2}}(|\bm{b}|)+I_{\mathcal{C}}(\bm{\beta})\right\},
s.t.𝑿m𝜷m+𝒓m=𝒚m,𝜷m=𝜷,𝑮𝜷=𝒃,m=1,2,\displaystyle\textbf{s.t.}\ \bm{X}_{m}\bm{\beta}_{m}+\bm{r}_{m}=\bm{y}_{m},\ \bm{\beta}_{m}=\bm{\beta},\ \bm{G}\bm{\beta}=\bm{b},\ m=1,2, (5.2)

where

I𝒞(𝜷)={0,if 𝜷𝒞,,if 𝜷𝒞.I_{\mathcal{C}}(\bm{\beta})=\begin{cases}0,&\text{if }\bm{\beta}\in\mathcal{C},\\ \infty,&\text{if }\bm{\beta}\notin\mathcal{C}.\end{cases}

Let θ1(𝜷)=Pλ1(|𝜷|)+I𝒞(𝜷)\theta_{1}(\bm{\beta})=P_{\lambda_{1}}(|\bm{\beta}|)+I_{\mathcal{C}}(\bm{\beta}), θ2(𝒓1)=(𝒓1)\theta_{2}(\bm{r}_{1})=\mathcal{L}{{{\left(\bm{r}_{1}\right)}}}, θ3(𝒓2)=(𝒓2)\theta_{3}(\bm{r}_{2})=\mathcal{L}{{{\left(\bm{r}_{2}\right)}}}, θ4(𝜷1)=0\theta_{4}(\bm{\beta}_{1})=0, θ5(𝜷2)=0\theta_{5}(\bm{\beta}_{2})=0, θ6(𝒃)=Pλ2(|𝒃|)\theta_{6}(\bm{b})=P_{\lambda_{2}}(|\bm{b}|), where \mathcal{L} only needs to be a convex loss function, including the least squares loss, quantile loss, square root loss, and Huber loss considered in this paper. Note that both Pλ1(|𝜷|)P_{\lambda_{1}}(|\bm{\beta}|), Pλ2(|𝒃|)P_{\lambda_{2}}(|\bm{b}|) and I𝒞(𝜷)I_{\mathcal{C}}(\bm{\beta}) are convex, even if Pλ1(|𝜷|)P_{\lambda_{1}}(|\bm{\beta}|) is a weighted version for nonconvex regularization. Clearly, the optimization equation (5.1) involves six optimization variables (primal variables in the ADMM algorithm). Therefore, in the case of two local machines, the proposed parallel ADMM algorithm can be decomposed into six-block ADMM algorithm.

Let

𝑰𝑮={𝑰p,if 𝑮=𝑰p,𝑰p1,if 𝑮=𝑭,\displaystyle\bm{I}_{\bm{G}}=\begin{cases}\bm{I}_{p},&\text{{if }}\bm{G}=\bm{I}_{p},\\ \bm{I}_{p-1},&\text{{if }}\bm{G}=\bm{F},\\ \end{cases} (5.3)

and set

𝑨1=[𝑮𝑰p𝑰p𝟎𝟎],𝑨2=[𝟎𝟎𝟎𝑰n𝟎],𝑨3=[𝟎𝟎𝟎𝟎𝑰n],{\bm{A}_{1}}=\begin{bmatrix}\bm{G}\\ -\bm{I}_{p}\\ -\bm{I}_{p}\\ \bm{0}\\ \bm{0}\end{bmatrix},\quad{\bm{A}_{2}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \bm{I}_{n}\\ \bm{0}\end{bmatrix},\quad{\bm{A}_{3}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \bm{0}\\ \bm{I}_{n}\end{bmatrix},
𝑩1=[𝟎𝑰p𝟎𝑿1𝟎],𝑩2=[𝟎𝟎𝑰p𝟎𝑿2],𝑩3=[𝑰𝑮𝟎𝟎𝟎𝟎]and𝒄=[𝟎𝟎𝟎𝒚1𝒚2],\quad{\bm{B}_{1}}=\begin{bmatrix}\bm{0}\\ \bm{I}_{p}\\ \bm{0}\\ \bm{X}_{1}\\ \bm{0}\end{bmatrix},\quad{\bm{B}_{2}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{I}_{p}\\ \bm{0}\\ \bm{X}_{2}\end{bmatrix},\quad{\bm{B}_{3}}=\begin{bmatrix}-\bm{I}_{\bm{G}}\\ \bm{0}\\ \bm{0}\\ \bm{0}\\ \bm{0}\end{bmatrix}\quad\text{and}\quad\bm{c}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \bm{y}_{1}\\ \bm{y}_{2}\end{bmatrix},

and rearrange the equations in (5.1), thus the constrained optimization form in (5.1) can be rewritten as

argmin𝜷,𝒓1,𝒓2,𝜷1,𝜷2,𝒃{θ1(𝜷)+θ2(𝒓1)+θ3(𝒓2)+θ4(𝜷1)+θ5(𝜷2)+θ6(𝒃)},\displaystyle\mathop{\arg\min}\limits_{\bm{\beta},\bm{r}_{1},\bm{r}_{2},\bm{\beta}_{1},\bm{\beta}_{2},\bm{b}}\big{\{}\theta_{1}(\bm{\beta})+\theta_{2}(\bm{r}_{1})+\theta_{3}(\bm{r}_{2})+\theta_{4}(\bm{\beta}_{1})+\theta_{5}(\bm{\beta}_{2})+\theta_{6}(\bm{b})\big{\}},
s.t.𝑨1𝜷+𝑨2𝒓1+𝑨3𝒓2+𝑩1𝜷1+𝑩2𝜷2+𝑩3𝒃=𝒄.\displaystyle\textbf{s.t.}\ \bm{A}_{1}\bm{\beta}+\bm{A}_{2}\bm{r}_{1}+\bm{A}_{3}\bm{r}_{2}+\bm{B}_{1}\bm{\beta}_{1}+\bm{B}_{2}\bm{\beta}_{2}+\bm{B}_{3}\bm{b}=\bm{c}. (5.4)

Note that in (3.11), 𝒃k+1\bm{b}^{k+1} is updated as the second variable among all the primal variables, while in equation (5.1) it is listed last. In fact, the update of 𝒃\bm{b} depends solely on 𝜷\bm{\beta} and is independent of 𝒓1\bm{r}_{1}, 𝒓2\bm{r}_{2}, 𝜷1\bm{\beta}_{1} and 𝜷2\bm{\beta}_{2}. Therefore, in all the updates of the primal variables, 𝒃k+1\bm{b}^{k+1} only needs to be placed after 𝜷k+1\bm{\beta}^{k+1}; placing it in other positions will not affect the solution of Algorithm 1. This is also reflected in the relationship between the matrices, clearly, 𝑩3\bm{B}_{3} is orthogonal to 𝑨2,𝑨3,𝑩1\bm{A}_{2},\bm{A}_{3},\bm{B}_{1} and 𝑩2\bm{B}_{2}.

Taking into account the linearized term of 𝜷\bm{\beta}, the augmented Lagrangian form of (5.1) is

θ1(𝜷)\displaystyle\theta_{1}(\bm{\beta}) +θ2(𝒓1)+θ3(𝒓2)+θ4(𝜷1)+θ5(𝜷2)+θ6(𝒃)𝒛(𝑨1𝜷+𝑨2𝒓1+𝑨3𝒓2+𝑩1𝜷1+𝑩2𝜷2+𝑩3𝒃𝒄)\displaystyle+\theta_{2}(\bm{r}_{1})+\theta_{3}(\bm{r}_{2})+\theta_{4}(\bm{\beta}_{1})+\theta_{5}(\bm{\beta}_{2})+\theta_{6}(\bm{b})-\bm{z}^{\top}(\bm{A}_{1}\bm{\beta}+\bm{A}_{2}\bm{r}_{1}+\bm{A}_{3}\bm{r}_{2}+\bm{B}_{1}\bm{\beta}_{1}+\bm{B}_{2}\bm{\beta}_{2}+\bm{B}_{3}\bm{b}-\bm{c})
+μ2𝑨1𝜷+𝑨2𝒓1+𝑨3𝒓2+𝑩1𝜷1+𝑩2𝜷2+𝑩3𝒃𝒄22+12𝜷𝜷k𝑺G2,\displaystyle+\frac{\mu}{2}\|\bm{A}_{1}\bm{\beta}+\bm{A}_{2}\bm{r}_{1}+\bm{A}_{3}\bm{r}_{2}+\bm{B}_{1}\bm{\beta}_{1}+\bm{B}_{2}\bm{\beta}_{2}+\bm{B}_{3}\bm{b}-\bm{c}\|_{2}^{2}+\frac{1}{2}\|\bm{\beta}-\bm{\beta}^{k}\|_{\bm{S}_{G}}^{2}, (5.5)

where 𝒛=(𝒇,𝒅1,𝒅2,𝒆1,𝒆2)\bm{z}=(\bm{f}^{\top},\bm{d}_{1}^{\top},\bm{d}_{2}^{\top},\bm{e}_{1}^{\top},\bm{e}_{2}^{\top})^{\top}, and

𝑺𝑮={𝟎,if 𝑮=𝑰p,𝑺,if 𝑮=𝑭,\displaystyle\bm{S}_{\bm{G}}=\begin{cases}\bm{0},&\text{{if }}\bm{G}=\bm{I}_{p},\\ \bm{S},&\text{{if }}\bm{G}=\bm{F},\\ \end{cases} (5.6)

where 𝑺=η𝑰p(M𝑰p+𝑭𝑭)\bm{S}=\eta\bm{I}_{p}-(M\bm{I}_{p}+\bm{F}^{\top}\bm{F}). According to the first-order optimality conditions of the minimization problems in (5.1), we have

{θ1(𝜷)θ1(𝜷k+1)+(𝜷𝜷k+1){𝑨𝟏[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+𝑨3𝒓2k+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄)]+𝑺𝑮(𝜷𝜷k)}0,θ2(𝒓1)θ2(𝒓1k+1)+(𝒓1𝒓1k+1){𝑨𝟐[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄)]}0,θ3(𝒓2)θ3(𝒓2k+1)+(𝒓2𝒓2k+1){𝑨𝟑[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄)]}0,θ4(𝜷1)θ4(𝜷1k+1)+(𝜷1𝜷1k+1){𝑩𝟏[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩1𝜷1k+1+𝑩2𝜷2k+𝑩3𝒃k𝒄)]}0,θ5(𝜷2)θ5(𝜷2k+1)+(𝜷2𝜷2k+1){𝑩𝟐[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩1𝜷1k+1+𝑩2𝜷2k+1+𝑩3𝒃k𝒄)]}0,θ6(𝒃)θ6(𝒃k+1)+(𝒃𝒃2k+1){𝑩𝟑[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩1𝜷1k+1+𝑩2𝜷2k+1+𝑩3𝒃k+1𝒄)]}0.\left\{\begin{array}[]{l}\theta_{1}(\bm{\beta})-\theta_{1}(\bm{\beta}^{k+1})+(\bm{\beta}-\bm{\beta}^{k+1})^{\top}\left\{-\bm{A_{1}}^{\top}[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k}+\bm{A}_{3}\bm{r}_{2}^{k}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}\right.\\ \left.+\bm{B}_{3}\bm{b}^{k}-\bm{c})]+\bm{S}_{\bm{G}}(\bm{\beta}-\bm{\beta}^{k})\right\}\geq 0,\\ \theta_{2}(\bm{r}_{1})-\theta_{2}(\bm{r}_{1}^{k+1})+(\bm{r}_{1}-\bm{r}_{1}^{k+1})^{\top}\left\{-\bm{A_{2}}^{\top}[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}\right.\\ \left.+\bm{B}_{3}\bm{b}^{k}-\bm{c})]\right\}\geq 0,\\ \theta_{3}(\bm{r}_{2})-\theta_{3}(\bm{r}_{2}^{k+1})+(\bm{r}_{2}-\bm{r}_{2}^{k+1})^{\top}\left\{-\bm{A_{3}}^{\top}[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}\right.\\ \left.+\bm{B}_{3}\bm{b}^{k}-\bm{c})]\right\}\geq 0,\\ \theta_{4}(\bm{\beta}_{1})-\theta_{4}(\bm{\beta}_{1}^{k+1})+(\bm{\beta}_{1}-\bm{\beta}_{1}^{k+1})^{\top}\left\{-\bm{B_{1}}^{\top}[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k+1}+\bm{B}_{2}\bm{\beta}_{2}^{k}\right.\\ \left.+\bm{B}_{3}\bm{b}^{k}-\bm{c})]\right\}\geq 0,\\ \theta_{5}(\bm{\beta}_{2})-\theta_{5}(\bm{\beta}_{2}^{k+1})+(\bm{\beta}_{2}-\bm{\beta}_{2}^{k+1})^{\top}\left\{-\bm{B_{2}}^{\top}[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k+1}+\bm{B}_{2}\bm{\beta}_{2}^{k+1}\right.\\ \left.+\bm{B}_{3}\bm{b}^{k}-\bm{c})]\right\}\geq 0,\\ \theta_{6}(\bm{b})-\theta_{6}(\bm{b}^{k+1})+(\bm{b}-\bm{b}_{2}^{k+1})^{\top}\left\{-\bm{B_{3}}^{\top}[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k+1}+\bm{B}_{2}\bm{\beta}_{2}^{k+1}\right.\\ \left.+\bm{B}_{3}\bm{b}^{k+1}-\bm{c})]\right\}\geq 0.\end{array}\right. (5.7)

It is easy to verify that

𝑨𝟏𝑨𝟐=𝟎,𝑨𝟏𝑨𝟑=𝟎and𝑨𝟐𝑨𝟑=𝟎,\displaystyle\bm{A_{1}}^{\top}\bm{A_{2}}=\bm{0},\ \bm{A_{1}}^{\top}\bm{A_{3}}=\bm{0}\ \text{and}\ \bm{A_{2}}^{\top}\bm{A_{3}}=\bm{0},
𝑩𝟏𝑩𝟐=𝟎,𝑩𝟏𝑩𝟑=𝟎and𝑩𝟐𝑩𝟑=𝟎.\displaystyle\bm{B_{1}}^{\top}\bm{B_{2}}=\bm{0},\ \bm{B_{1}}^{\top}\bm{B_{3}}=\bm{0}\ \text{and}\ \bm{B_{2}}^{\top}\bm{B_{3}}=\bm{0}.

Together with (5.7), we have

{θ1(𝜷)θ1(𝜷k+1)+(𝜷𝜷k+1){𝑨𝟏[𝒛kμ(𝑨1𝜷k+1+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄)]+𝑺𝑮(𝜷𝜷k)}0,θ2(𝒓1)θ2(𝒓1k+1)+(𝒓1𝒓1k+1){𝑨𝟐[𝒛kμ(𝑨2𝒓1k+1+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄)]}0,θ3(𝒓2)θ3(𝒓2k+1)+(𝒓2𝒓2k+1){𝑨𝟑[𝒛kμ(𝑨3𝒓2k+1+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄)]}0,θ4(𝜷1)θ4(𝜷1k+1)+(𝜷1𝜷1k+1){𝑩𝟏[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩1𝜷1k+1𝒄)]}0,θ5(𝜷2)θ5(𝜷2k+1)+(𝜷2𝜷2k+1){𝑩𝟐[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩2𝜷2k+1𝒄)]}0,θ6(𝒃)θ6(𝒃k+1)+(𝒃𝒃2k+1){𝑩𝟑[𝒛kμ(𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩3𝒃k+1𝒄)]}0,\left\{\begin{array}[]{l}\theta_{1}(\bm{\beta})-\theta_{1}(\bm{\beta}^{k+1})+(\bm{\beta}-\bm{\beta}^{k+1})^{\top}\left\{-\bm{A_{1}}^{\top}\left[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}+\bm{B}_{3}\bm{b}^{k}-\bm{c})\right]+\bm{S}_{\bm{G}}(\bm{\beta}-\bm{\beta}^{k})\right\}\geq 0,\\ \theta_{2}(\bm{r}_{1})-\theta_{2}(\bm{r}_{1}^{k+1})+(\bm{r}_{1}-\bm{r}_{1}^{k+1})^{\top}\left\{-\bm{A_{2}}^{\top}\left[\bm{z}^{k}-\mu(\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}+\bm{B}_{3}\bm{b}^{k}-\bm{c})\right]\right\}\geq 0,\\ \theta_{3}(\bm{r}_{2})-\theta_{3}(\bm{r}_{2}^{k+1})+(\bm{r}_{2}-\bm{r}_{2}^{k+1})^{\top}\left\{-\bm{A_{3}}^{\top}\left[\bm{z}^{k}-\mu(\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}+\bm{B}_{3}\bm{b}^{k}-\bm{c})\right]\right\}\geq 0,\\ \theta_{4}(\bm{\beta}_{1})-\theta_{4}(\bm{\beta}_{1}^{k+1})+(\bm{\beta}_{1}-\bm{\beta}_{1}^{k+1})^{\top}\left\{-\bm{B_{1}}^{\top}\left[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k+1}-\bm{c})\right]\right\}\geq 0,\\ \theta_{5}(\bm{\beta}_{2})-\theta_{5}(\bm{\beta}_{2}^{k+1})+(\bm{\beta}_{2}-\bm{\beta}_{2}^{k+1})^{\top}\left\{-\bm{B_{2}}^{\top}\left[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{2}\bm{\beta}_{2}^{k+1}-\bm{c})\right]\right\}\geq 0,\\ \theta_{6}(\bm{b})-\theta_{6}(\bm{b}^{k+1})+(\bm{b}-\bm{b}_{2}^{k+1})^{\top}\left\{-\bm{B_{3}}^{\top}\left[\bm{z}^{k}-\mu(\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{3}\bm{b}^{k+1}-\bm{c})\right]\right\}\geq 0,\end{array}\right. (5.8)

which is also the first-order optimality condition of the scheme

{(𝜷k+1,𝒓1k+1,𝒓2k+1)=argmin𝜷,𝒓1,𝒓2{θ1(𝜷)+θ2(𝒓1)+θ3(𝒓2)(𝒛k)(𝑨𝟏𝜷+𝑨2𝒓1+𝑨3𝒓2)+μ2𝑨1𝜷+𝑨2𝒓1+𝑨3𝒓2+𝑩1𝜷1k+𝑩2𝜷2k+𝑩3𝒃k𝒄22+12𝜷𝜷k𝑺𝑮2},(𝜷1k+1,𝜷2k+1,𝒃k+1)=argmin𝜷1,𝜷2,𝒃{θ4(𝜷1)+θ5(𝜷2)+θ6(𝒃)(𝒛k)(𝑨𝟒𝜷1+𝑨𝟓𝜷2+𝑨𝟔𝒃)+μ2𝑨1𝜷k+1+𝑨2𝒓1k+1+𝑨3𝒓2k+1+𝑩1𝜷1+𝑩2𝜷2+𝑩3𝒃𝒄22}.\left\{\begin{array}[]{l}(\bm{\beta}^{k+1},\bm{r}_{1}^{k+1},\bm{r}_{2}^{k+1})=\mathop{\arg\min}\limits_{\bm{\beta},\bm{r}_{1},\bm{r}_{2}}\left\{\theta_{1}(\bm{\beta})+\theta_{2}(\bm{r}_{1})+\theta_{3}(\bm{r}_{2})-(\bm{z}^{k})^{\top}(\bm{A_{1}}\bm{\beta}+\bm{A}_{2}\bm{r}_{1}+\bm{A}_{3}\bm{r}_{2})\right.\\ \left.+\frac{\mu}{2}\|\bm{A}_{1}\bm{\beta}+\bm{A}_{2}\bm{r}_{1}+\bm{A}_{3}\bm{r}_{2}+\bm{B}_{1}\bm{\beta}_{1}^{k}+\bm{B}_{2}\bm{\beta}_{2}^{k}+\bm{B}_{3}\bm{b}^{k}-\bm{c}\|_{2}^{2}+\frac{1}{2}\|\bm{\beta}-\bm{\beta}^{k}\|_{\bm{S}_{\bm{G}}}^{2}\right\},\\ (\bm{\beta}_{1}^{k+1},\bm{\beta}_{2}^{k+1},\bm{b}^{k+1})=\mathop{\arg\min}\limits_{\bm{\beta}_{1},\bm{\beta}_{2},\bm{b}}\left\{\theta_{4}(\bm{\beta}_{1})+\theta_{5}(\bm{\beta}_{2})+\theta_{6}(\bm{b})-(\bm{z}^{k})^{\top}(\bm{A_{4}}\bm{\beta}_{1}+\bm{A_{5}}\bm{\beta}_{2}+\bm{A_{6}}\bm{b})\right.\\ \left.+\frac{\mu}{2}\|\bm{A}_{1}\bm{\beta}^{k+1}+\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{A}_{3}\bm{r}_{2}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}+\bm{B}_{2}\bm{\beta}_{2}+\bm{B}_{3}\bm{b}-\bm{c}\|_{2}^{2}\right\}.\end{array}\right. (5.9)

Then, we can consider (𝜷,𝒓1,𝒓2)(\bm{\beta}^{\top},\bm{r}_{1}^{\top},\bm{r}_{2}^{\top})^{\top} as a variable 𝒗1\bm{v}_{1}, θ1(𝜷)+θ2(𝒓1)+θ3(𝒓2)\theta_{1}(\bm{\beta})+\theta_{2}(\bm{r}_{1})+\theta_{3}(\bm{r}_{2}) as a function 𝜽1(𝒗1){\bm{\theta}}_{1}(\bm{v}_{1}), [𝑨1,𝑨2,𝑨3][\bm{A}_{1},\bm{A}_{2},\bm{A}_{3}] as a matrix 𝑨\bm{A}; and (𝜷1,𝜷2,𝒃)(\bm{\beta}_{1}^{\top},\bm{\beta}_{2}^{\top},\bm{b}^{\top})^{\top} as the other variable 𝒗2\bm{v}_{2}, θ4(𝜷1)+θ5(𝜷2)+θ6(𝒃)\theta_{4}(\bm{\beta}_{1})+\theta_{5}(\bm{\beta}_{2})+\theta_{6}(\bm{b}) as the other function 𝜽2(𝒗2){\bm{\theta}}_{2}(\bm{v}_{2}), [𝑩1,𝑩2,𝑩3][\bm{B}_{1},\bm{B}_{2},\bm{B}_{3}] as the other matrix 𝑩\bm{B}. Thus, the update of the primal variables in (5.9) can be rewritten as

{𝒗1k+1=argmin𝒗1{𝜽1(𝒗1)(𝒛k)𝑨𝒗1+μ2𝑨𝒗1+𝑩𝒗2k𝒄22+12𝜷𝜷k𝑺𝑮2},𝒗2k+1=argmin𝒗2{𝜽2(𝒗2)(𝒛k)𝑩𝒗2+μ2𝑨𝒗1k+1+𝑩𝒗2𝒄22}.\left\{\begin{array}[]{l}\bm{v}_{1}^{k+1}=\mathop{\arg\min}\limits_{\bm{v}_{1}}\left\{{\bm{\theta}}_{1}(\bm{v}_{1})-(\bm{z}^{k})^{\top}\bm{A}\bm{v}_{1}+\frac{\mu}{2}\|\bm{A}\bm{v}_{1}+\bm{B}\bm{v}_{2}^{k}-\bm{c}\|_{2}^{2}+\frac{1}{2}\|\bm{\beta}-\bm{\beta}^{k}\|_{\bm{S}_{\bm{G}}}^{2}\right\},\\ \bm{v}_{2}^{k+1}=\mathop{\arg\min}\limits_{\bm{v}_{2}}\left\{{\bm{\theta}}_{2}(\bm{v}_{2})-(\bm{z}^{k})^{\top}\bm{B}\bm{v}_{2}+\frac{\mu}{2}\|\bm{A}\bm{v}_{1}^{k+1}+\bm{B}\bm{v}_{2}-\bm{c}\|_{2}^{2}\right\}.\end{array}\right. (5.10)

Obviously, if we incorporate the update of the dual variable 𝒛k+1=𝒛kμ(𝑨𝒗1k+1+𝑩𝒗2k+1𝒄)\bm{z}^{k+1}=\bm{z}^{k}-\mu(\bm{A}\bm{v}_{1}^{k+1}+\bm{B}\bm{v}_{2}^{k+1}-\bm{c}) into (5.10), it becomes a two-block ADMM iteration process. Although the iteration in (5.10) can be seen as a two-block ADMM algorithm, it only linearizes a portion of 𝒗1\bm{v}_{1}, namely 𝜷\bm{\beta}. Note that when M>2M>2, we can still transform Algorithm 1 into a partially linearized ADMM algorithm by defining 𝒗1=(𝜷,𝒓1,𝒓2,,𝒓M)\bm{v}_{1}=(\bm{\beta}^{\top},\bm{r}_{1}^{\top},\bm{r}_{2}^{\top},\dots,\bm{r}_{M}^{\top})^{\top} and 𝒗2=(𝜷1,𝜷2,,𝜷M,𝒃)\bm{v}_{2}=(\bm{\beta}_{1}^{\top},\bm{\beta}_{2}^{\top},\dots,\bm{\beta}_{M}^{\top},\bm{b}^{\top})^{\top}, and making some corresponding changes to 𝜽1(𝒗1)\bm{\theta}_{1}(\bm{v}_{1}), 𝜽2(𝒗2)\bm{\theta}_{2}(\bm{v}_{2}), 𝑨\bm{A} and 𝑩\bm{B}. We include the mathematical derivation regarding this result in Appendix A. Therefore, the existing convergence results in He and Yuan (2012, 2015) and Yuan et al. (2020) cannot be directly applied to Algorithm 1.

Nevertheless, we can also follow the similar framework in He and Yuan (2012, 2015) and Yuan et al. (2020) for convergence analysis. The convergence property and convergence rate of Algorithm 1 are shown in the following theorem. We provide a detailed proof in the Appendix A.

Theorem 1

Let 𝐰~k=(𝛃k,{𝐫mk}m=1M,{𝛃mk}m=1M,𝐛k,{𝐝mk}m=1M,{𝐞mk}m=1M,𝐟k)\tilde{\bm{w}}^{k}=\left(\bm{\beta}^{k},\{\bm{r}_{m}^{k}\}_{m=1}^{M},\{\bm{\beta}_{m}^{k}\}_{m=1}^{M},\bm{b}^{k},\{\bm{d}_{m}^{k}\}_{m=1}^{M},\{\bm{e}_{m}^{k}\}_{m=1}^{M},\bm{f}^{k}\right) is generated by Algorithm 1 with an initial feasible solution 𝐰~0\tilde{\bm{w}}^{0}. The sequence 𝐯~k=(𝛃k,{𝛃mk}m=1M,𝐛k,{𝐝mk}m=1M,{𝐞mk}m=1M,𝐟k)\tilde{\bm{v}}^{k}=\left(\bm{\beta}^{k},\{\bm{\beta}_{m}^{k}\}_{m=1}^{M},\bm{b}^{k},\{\bm{d}_{m}^{k}\}_{m=1}^{M},\{\bm{e}_{m}^{k}\}_{m=1}^{M},\bm{f}^{k}\right) converges to 𝐯~\tilde{\bm{v}}^{*}, where 𝐯~\tilde{\bm{v}}^{*} =(𝛃,{𝛃m}m=1M,𝐛,{𝐝m}m=1M,{𝐞m}m=1M,𝐟)=\left(\bm{\beta}^{*},\{\bm{\beta}_{m}^{*}\}_{m=1}^{M},\bm{b}^{*},\\ \{\bm{d}_{m}^{*}\}_{m=1}^{M},\{\bm{e}_{m}^{*}\}_{m=1}^{M},\bm{f}^{*}\right) is an optimal solution point of (5.1). The O(1/k)O(1/k) convergence rate in a non-ergodic sense can also be obtained, i.e.,

𝒗~k𝒗~k+1𝑯21k+1𝒗~0𝒗~𝑯2,\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1}\|_{\bm{H}}^{2}\leq\small{\frac{1}{k+1}}\|\tilde{\bm{v}}^{0}-\tilde{\bm{v}}^{*}\|_{\bm{H}}^{2}, (5.11)

where 𝐇=[𝐒𝐆𝟎𝟎𝟎μ𝐁𝐁𝟎𝟎𝟎1μ𝐈n𝐆+(M+1)p]\bm{H}=\begin{bmatrix}\bm{S}_{\bm{G}}&\bm{0}&\bm{0}\\ \bm{0}&\mu\bm{B}^{\top}\bm{B}&\bm{0}\\ \bm{0}&\bm{0}&\frac{1}{\mu}\bm{I}_{n_{\bm{G}}+(M+1)p}\end{bmatrix} is a symmetric and positive semidefinite matrix, and n𝐆=nn_{\bm{G}}=n if 𝐆=𝐈p\bm{G}=\bm{I}_{p}, and otherwise n𝐆=n1n_{\bm{G}}=n-1.

Remark 1

Note that when 𝐆=𝐈p\bm{G}=\bm{I}_{p}, 𝐒𝐆=𝟎\bm{S}_{\bm{G}}=\bm{0}. The convergent sequence is ({𝛃m}m=1M,𝐛,{𝐝m}m=1M,{𝐞m}m=1M,𝐟).\left(\{\bm{\beta}_{m}\}_{m=1}^{M},\bm{b},\{\bm{d}_{m}\}_{m=1}^{M},\{\bm{e}_{m}\}_{m=1}^{M},\bm{f}\right). Recalling the 𝛃\bm{\beta}-subproblem and 𝐫m\bm{r}_{m}-subproblem in Algorithm 1, it becomes evident that as the sequence({𝛃mk}m=1M,𝐛k,\left(\{\bm{\beta}_{m}^{k}\}_{m=1}^{M},\bm{b}^{k},\right. {𝐝mk}m=1M,{𝐞mk}m=1M,𝐟k)\left.\{\bm{d}_{m}^{k}\}_{m=1}^{M},\{\bm{e}^{k}_{m}\}_{m=1}^{M},\bm{f}^{k}\right) converges, both 𝛃k\bm{\beta}^{k} and 𝐫mk\bm{r}_{m}^{k} converge to their respective optimal values 𝛃\bm{\beta}^{*} and 𝐫m\bm{r}_{m}^{*}. Note also that the convergence of our algorithm does not require specifying the form of the loss function, which opens up possibilities for further extensions of our algorithm.

Remark 2

Note that when 𝐆=𝐅\bm{G}=\bm{F}, 𝐒𝐆=𝐒\bm{S}_{\bm{G}}=\bm{S}. The convergent sequence is 𝐯~=(𝛃,{𝛃m}m=1M,𝐛,{𝐝m}m=1M,{𝐞m}m=1M,𝐟).\tilde{\bm{v}}=\left(\bm{\beta},\{\bm{\beta}_{m}\}_{m=1}^{M},\bm{b},\{\bm{d}_{m}\}_{m=1}^{M},\{\bm{e}_{m}\}_{m=1}^{M},\bm{f}\right). Recalling the 𝐫m\bm{r}_{m}-subproblem in Algorithm 1, it becomes evident that as the sequence 𝐯~k\tilde{\bm{v}}^{k} converges, 𝐫mk\bm{r}_{m}^{k} converges to the respective optimal value 𝐫m\bm{r}_{m}^{*}. Moreover, we have 𝐯~0𝐯~𝐇22\|\tilde{\bm{v}}^{0}-\tilde{\bm{v}}^{*}\|_{\bm{H}_{2}}^{2} of the order O(1)O(1), and as a consequence, 𝐯~k𝐯~k+1𝐇22=O(1/k)\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1}\|_{\bm{H}_{2}}^{2}=O(1/k). This means that Algorithm 1 has a linear convergence rate.

5.2 Analysis of Algorithm Complexity

We will first discuss the computational cost that the central machine needs to carry. Reviewing Algorithm 1, we know that the central machine only needs to update 𝜷k+1\bm{\beta}^{k+1}, 𝒃k+1\bm{b}^{k+1}, and 𝒇k+1\bm{f}^{k+1} in each ADMM iteration. From the update of 𝜷k+1\bm{\beta}^{k+1} in Section 4.1, we can observe that the three convex combined regularizations are transformed into the soft-thresholding operator in (2.8). The computational cost of computing the soft-thresholding in (4.4), (4.7) and (4.12) is O(p)O(p) flops. Note that (4.4) and (4.7) are the same, and the overall cost of forming (M(𝜷¯k𝒆¯k/μ)+(𝒃k+𝒇k/μ)M+1)\left(\frac{M(\bar{\bm{\beta}}^{k}-\bar{\bm{e}}^{k}/\mu)+(\bm{b}^{k}+\bm{f}^{k}/\mu)}{M+1}\right) is O(Mp)O(Mp) flops, where 𝜷¯k=M1m=1M𝜷mk\bar{\bm{\beta}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{\beta}_{m}^{k} and 𝒆¯k=M1m=1M𝒆mk\bar{\bm{e}}^{k}=M^{-1}\sum_{m=1}^{M}\bm{e}_{m}^{k}. Then, the computational cost of updating 𝜷k+1\bm{\beta}^{k+1} for elastic-net and sparse group regression is O(Mp+p)=O(Mp)O(Mp+p)=O(Mp). The soft-thresholding operator in (4.12) need to compute (𝜷kμ/η[(M𝑰p+𝑭𝑭)𝜷kM(𝜷¯k𝒆¯k/μ)𝑭(𝒃k+𝒇k/μ)])\left(\bm{\beta}^{k}-{\mu}/{\eta}\left[(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}^{k}-M(\bar{\bm{\beta}}^{k}-{\bar{\bm{e}}^{k}}/{\mu})-\bm{F}^{\top}(\bm{b}^{k}+{\bm{f}^{k}}/{\mu})\right]\right). In fact, the matrix 𝑭𝑭\bm{F}^{\top}\bm{F} does not need to be calculated, and we can easily write out each element of its matrix due to its own structure. Then, the main computational cost in generating vector (𝜷kμ/η[(M𝑰p+𝑭𝑭)𝜷kM(𝜷¯k𝒆¯k/μ)𝑭(𝒃k+𝒇k/μ)])\left(\bm{\beta}^{k}-{\mu}/{\eta}\left[(M\bm{I}_{p}+\bm{F}^{\top}\bm{F})\bm{\beta}^{k}-M(\bar{\bm{\beta}}^{k}-{\bar{\bm{e}}^{k}}/{\mu})-\bm{F}^{\top}(\bm{b}^{k}+{\bm{f}^{k}}/{\mu})\right]\right) lies in the matrix multiplied by the vector inside it, which has a complexity of O(p2)O(p^{2}). Then, the computational cost of updating 𝜷k+1\bm{\beta}^{k+1} in (4.12) for sparse fused regression is O(Mp+p2+p)=O(p2)O(Mp+p^{2}+p)=O(p^{2}) (p>>M)(p>>M) flops.

For the central machine in each iteration of the ADMM algorithm, the computation required for 𝒃k+1\bm{b}^{k+1} in Section 4.1 consists of a linear algebra operation for the elastic net (O(p)O(p) flops), a group soft-thresholding operator for the sparse group lasso (O(p)O(p) flops), and a soft-thresholding operator for the sparse fused lasso (O(p2p)=O(p2)O(p^{2}-p)=O(p^{2}) flops). The cost of updating the dual variable 𝒇k+1\bm{f}^{k+1} on the central machine is O(np)O(np) (elastic net and sparse group lasso) or O(p2)O(p^{2}) (sparse fused lasso) flops. Therefore, the total computational cost for the central machine in the elastic net and sparse group regression is O(Mp)+O(p)+O(p)=O(Mp)O(Mp)+O(p)+O(p)=O(Mp) flops, while the total computational cost for the central machine in the sparse fused lasso is O(p2)+O(Mp)+O(p2)=O(p2)O(p^{2})+O(Mp)+O(p^{2})=O(p^{2}) flops.

Next, let us discuss the computational cost of each iteration when implementing Algorithm 1 on each local machine. Based on the discussion in Section 3, we can observe that the operations for each local machine remain the same for implementing the three convex combined regularization regressions. To update 𝒓k+1\bm{r}^{k+1} in (4.1), we use the \mathcal{L} proximal operator. The cost of obtaining the closed-form solution of the \mathcal{L} proximal operator is O(nm)O(n_{m}) flops. Then, the cost of updating 𝒓k+1\bm{r}^{k+1} depends on the matrix-vector multiplication (O(nmp)O(n_{m}p)) within it. Therefore, the computational cost of updating 𝒓k+1\bm{r}^{k+1} is O(nm+nmp)=O(nmp)O(n_{m}+n_{m}p)=O(n_{m}p) flops.

Before each iteration of the ADMM algorithm, each local machine needs to compute the inverse of the matrix 𝑿m𝑿m+𝑰p\bm{X}_{m}^{\top}\bm{X}_{m}+\bm{I}_{p}. The computational cost is O(nmp2)O(n_{m}p^{2}) flops if nm>pn_{m}>p. However, when p>nmp>n_{m}, we can use the Woodbury matrix identity to compute the inverse, which reduces the cost to O(nm2p)O(n_{m}^{2}p). Once the inverse of the matrix is obtained, the cost of obtaining 𝜷mk+1\bm{\beta}_{m}^{k+1} in (4.2) is max(O(nmp),O(p2))\max\left(O(n_{m}p),O(p^{2})\right) flops, which depends on the matrix-vector multiplication involved in the calculation. The cost of updating the dual variable 𝒅mk+1\bm{d}_{m}^{k+1} and 𝒆mk+1\bm{e}_{m}^{k+1} on each local machines is O(nmp)O(n_{m}p) and O(p)O(p) flops. Thus, the total computational cost for each local machine is

{O(nmp2)+O(nmp)×KIf nm>p,O(nm2p)+O(p2)×Kotherwise,\begin{cases}O(n_{m}p^{2})+O(n_{m}p)\times K&\text{If }n_{m}>p,\\ O(n_{m}^{2}p)+O(p^{2})\times K&\text{otherwise},\end{cases} (5.12)

where KK is the total number of iterations in the ADMM algorithm.

In the practical implementation of the algorithm, it is common to have nm>>Mn_{m}>>M and p>>Mp>>M. As a consequence, the primary factor impacting the overall cost of the algorithm is not the expense of the central machine but rather the cost associated with operating the local machine, which handles the largest quantity of samples. Let nmax=max{nm}m=1Mn_{\max}=\max\{n_{m}\}_{m=1}^{M}, then the total complexity of Algorithm 1 (in parallel) for solving the three types of convex combined regularized regressions is given by

{O(nmaxp2)+O(nmaxp)×KIf nmax>p,O(nmax2p)+O(p2)×Kotherwise.\begin{cases}O(n_{\max}p^{2})+O(n_{\max}p)\times K&\text{If }n_{\max}>p,\\ O(n_{\max}^{2}p)+O(p^{2})\times K&\text{otherwise}.\end{cases} (5.13)

The complexity of this algorithm indicates that the computational cost of handling the three different combined regularization regressions is similar. As it is well-known, the computation of sparse fused lasso is significantly more complex compared to the other two combined regularizations. This is in fact one of the main advantages of our parallel algorithm.

Furthermore, this algorithm has a comparable complexity to the ADMM algorithm Wu et al. (2023) when M=1M=1, and comparable complexity to the ADMM algorithms proposed in Yu et al. (2017) and Fan et al. (2020) when M2M\geq 2.

6 Numerical results

In this section, we use synthetic data and real data cases to demonstrate the accuracy, stability, and scalability of the proposed ADMM algorithms.

6.1 Synthetic data

In this subsection, we synthetically generate regression datasets that are suitable for three combinations of regularization, and solve them using the ADMM algorithms that we propose in both parallel computing framework and non-parallel computing framework. In additions, we will compare the performance of our algorithm with other state-of-the-art methods in terms of estimation accuracy and computational time.

6.1.1 Elastic-net

In the first study, we utilize a well-known simulation model proposed by Friedman et al. (2010) and Gu et al. (2017) to generate data for conducting a comparison of elastic-net regression. We simulate data with nn observations from the linear model

𝒚=j=1p𝒙jβj+σϵ,\displaystyle\bm{y}=\sum_{j=1}^{p}\bm{x}_{j}\beta_{j}^{*}+\sigma\bm{\epsilon}, (6.1)

where (𝒙1,𝒙2,,𝒙p)N(𝟎p,𝚺)(\bm{x}_{1},\bm{x}_{2},\dots,\bm{x}_{p})^{\top}\sim N(\bm{0}_{p},\bm{\Sigma}),𝚺=(ρ+(1ρ)𝑰i=j)p×p\bm{\Sigma}=(\rho+(1-\rho)\bm{I}_{i=j})_{p\times p}; βj=(1)jexp(2j1)/20\beta_{j}^{*}=(-1)^{j}\exp^{-(2j-1)/20}; ϵN(0,1)\bm{\epsilon}\sim N(0,1); and the parameter σ\sigma is selected in such a way that the signal-to-noise ratio of the data is equal to 1. For our simulation, we focus on two scenarios: high-dimensional data on a single machine (where n=720,p=2560n=720,p=2560), and massive data on multiple machines (where n=720000,p=1280n=720000,p=1280) with the choice of the correlation ρ=0.5\rho=0.5. The schematic diagram of the true coefficient is shown in Figure 2.

Refer to caption
Figure 2: The true coefficient for elastic-net regressions.

First, we apply our CPADMM algorithm to solve four elastic-net regressions: LS-Enet (least squares loss plus elastic-net), Q-Enet (quantile loss plus elastic-net), SR-Enet (square root loss plus elastic-net), and H-Enet (huber loss plus elastic-net). Figure 3 displays the visualization of the coefficient estimations {βj}j=1p\{\beta_{j}^{*}\}_{j=1}^{p} obtained from these elastic-net regression models. This result is an arbitrary selection made by us, indicating that the four elastic-net regression models all perform well under normal error, and CPADMM is an effective and accurate algorithm for solving them.

Refer to caption
Figure 3: The estimation coefficients by CPADMM for the four types of elastic-net regression.

Next, we compare the CPADMM algorithm in this paper with some state-of-the-art algorithms to fit the regression data in (6.1). The above four elastic-net regression models all have efficient solvers, such as glmnet in Friedman et al. (2010) for LS-Enet, FHDQR in Gu et al. (2017) for Q-Enet, cyclic coordinate-wise descent (CCD in Raninen and Ollila (2017)) for SR-Enet, and hqreg in Yi and Huang (2017) for H-Enet. We compare our unified ADMM algorithm with the above four algorithms for solving elastic-net regression models. Each simulation is repeated 100 times, and the average results for nonparallel computations (single machine) are presented in Table 3. The results of Table 3 indicate that although our algorithm has certain disadvantages compared to existing algorithms in terms of time, it has certain advantages in estimation accuracy and prediction performance.

Sometimes it is necessary to set M>1M>1 when the data is too large to be stored processed by a single computer, or can only be stored in a distributed manner. Due to the lack of research on parallel algorithms for massive data regression with combined regularizations at present, this paper only demonstrates the performance of our algorithm in a parallel computing environment. We present the computation time and coefficient estimation accuracy for M=1,10M=1,10, and 100100 in Figure 4. The results in Figure 4 indicate that parallel algorithms can effectively save computation time on large-scale data, but excessive local machines can also slow down the convergence speed of the algorithm.

Table 3: Comparison of CPADMM and four algorithms for calculating four elastic-net regressions
  LS-Enet Q-Enet SR-Enet H-Enet
  Algoritnm glmnet CPADMM FHDQR CPADMM CCD CPADMM hqreg CPADMM
AAE(×102\times 10^{-2}) 0.284(0.011) 0.251(0.010) 0.415(0.019) 0.379(0.016) 0.467(0.023) 0.389(0.015) 0.385(0.017) 0.282(0.012)
ASE(×102\times 10^{-2}) 0.875(0.066) 0.753(0.059) 0.999(0.072) 0.951(0.065) 1.236(0.064) 0.962(0.058) 1.010(0.061) 0.843(0.052)
AAP 0.155(0.012) 0.131(0.008) 0.022(0.003) 0.009(0.001) 0.024(0.005) 0.005(0.001) 0.047(0.009) 0.133(0.015)
ASP 0.183(0.013) 0.173(0.011) 0.104(0.007) 0.025(0.003) 0.013(0.002) 0.007(0.001) 0.259(0.023) 0.172(0.018)
Time 0.052(0.005) 8.471(0.076) 2.765(0.023) 8.897(0.079) 10.32(0.121) 8.834(0.072) 3.342(0.039) 8.532(0.080)
 
  • *

    The meanings of the notations used in this table are as follows: AAE: average absolute estimation error; ASE: average square estimation error; AAP: average absolute prediction error; ASP: average square prediction error. Numbers in the parentheses represent the corresponding standard deviations. The optimal solution is represented in bold.

Refer to caption
Figure 4: Comparisons of elastic-net regressions for different MM value.

6.1.2 Sparse fused lasso

In the second study, we do some simulations on fused regressions. All model settings remain unchanged from those in Section 6.1.1, except for the coefficient settings. In this case, the coefficients are assumed to exhibit an ordering relationship and are closely connected to their neighboring coefficients. To this end, we divide the coefficients into 80 groups, and randomly select 10 groups to set as the same non-zero coefficients. For convenience, we define the set of these non-zero groups as {𝒜i}i=110\{\mathcal{A}_{i}\}_{i=1}^{10}. The coefficients are randomly generate as

𝜷𝒜i={U[3,3]×𝟏𝒜i,ifi=1,2,,10,0,otherwise.\bm{\beta}_{\mathcal{A}_{i}}^{*}=\left\{\begin{array}[]{l}\text{U}\left[{-3,3}\right]\times\bm{1}_{\mathcal{A}_{i}},\ \text{if}\ i=1,2,\dots,10,\\ \textbf{0},\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{otherwise}.\end{array}\right. (6.2)

Here, U[3,3]\text{U}\left[-3,3\right] represents the uniform distribution on the interval [3,3]\left[{-3,3}\right]. The coefficients have been used in Li et al. (2014), Xiu et al. (2019) and Wu et al. (2023), and the schematic diagram of a random result of the coefficients is shown in Figure 5.

Refer to caption
Figure 5: The true coefficient for sparse fused lasso regressions.

We also first used our CPADMM algorithm to calculate four sparse fused lasso regression models , that is, LS-Sfla (least squares loss plus sparse fused lasso), Q-Sfla (quantile loss plus sparse fused lasso), SR-Sfla (square root loss plus sparse fused lasso), H-Sfla (huber loss plus sparse fused lasso), and their estimation coefficients as shown in Figure 6. The results in Figure 6 demonstrate the good performance of the four sparse fused lasso regression models with normal error. Furthermore, CPADMM is an effective and accurate algorithm for solving these models.

Refer to caption
Figure 6: The estimation coefficients by CPADMM for the four types of sparse fused lasso regression.

Next, we use four variations of sparse fused lasso regression to fit the data. Several algorithms have been proposed to solve these four sparse fused lasso models. For instance, the linearized ADMM (LADMM) algorithm is used for LS-Sfla in Li et al. (2014), the Multi-block ADMM (MADMM) algorithm is used for Q-Sfla in Wu et al. (2023), the square root ADMM (SRADMM) algorithm is used for SR-Sfla in Jiang et al. (2020), and the path solution algorithm is used for H-Sfla in Liu et al. (2021). To compare the performance of our unified CPADMM algorithm with the above four algorithms in solving sparse fused lasso regression models, we conduct 100 repetitions of each simulation. The average results for non-parallel computations (single machine) are presented in Table 4. The results in Table 4 indicate that our algorithm demonstrates competitive performance in terms of computation time, estimation accuracy, and prediction performance compared to existing algorithms.

We can implement parallel simulations using our CPADMM algorithm to conclude this subsection. We show the computation time and coefficient estimation accuracy for M=1,10M=1,10, and 100100 in Figure 7. The results demonstrate that parallel algorithms can significantly reduce computation time for large-scale sparse fused lasso regression. However, it should be noted that an excessive number of local machines can potentially impact the convergence speed of the algorithm.

Refer to caption
Figure 7: Comparisons of sparse fused lasso regressions for different MM value.
Table 4: Comparison of CPADMM and four algorithms for calculating four sparse fused lasso regressions.
  LS-Sfla Q-Sfla SR-Sfla H-Sfla
  Algorithm LADMM CPADMM MADMM CPADMM SRADMM CPADMM Path Solution CPADMM
AAE(×102\times 10^{-2}) 0.017(0.008) 0.008(0.001) 0.010(0.004) 0.009(0.004) 0.022(0.029) 0.008(0.001) 0.015(0.014) 0.008(0.001)
ASE(×102\times 10^{-2}) 0.026(0.013) 0.015(0.009) 0.016(0.006) 0.017(0.005) 0.033(0.015) 0.016(0.007) 0.024(0.011) 0.015(0.006)
AAP 0.621(0.028) 0.451(0.019) 0.466(0.017) 0.468(0.018) 0.592(0.025) 0.339(0.013) 0.483(0.018) 0.461(0.012)
ASP 0.583(0.025) 0.571(0.018) 0.657(0.031) 0.659(0.029) 0.738(0.039) 0.427(0.019) 0.602(0.032) 0.589(0.025)
Time 13.59(0.442) 12.48(0.437) 12.62(0.413) 12.55(0.409) 15.40(0.620) 12.22(0.425) 230.6(23.61) 12.52(0.418)
 
  • *

    The meanings of the notations used in this table are as follows: AAE: average absolute estimation error; ASE: average square estimation error; AAP: average absolute prediction error; ASP: average square prediction error. Numbers in the parentheses represent the corresponding standard deviations. The optimal solution is represented in bold.

6.1.3 Sparse group lasso

In the third study, we use a simulation model similar to that of Li et al. (2014) to generate data for conducting a comparison of sparse group lasso regression. All model settings remain the same as those in Section 6.1.1, except for the coefficient settings. Taking into account coefficient grouping, we also divide the coefficients into 80 groups. From these groups, we randomly select 10 groups to set as non-zero coefficient groups. For convenience, we define the set of these non-zero coefficients as SS. Then, the true coefficient vector 𝜷\bm{\beta}^{*} is randomly generated by

βj={ξj(1+|aj|),ifjS,0,otherwise,{\beta}^{*}_{j}=\left\{\begin{array}[]{l}\xi_{j}(1+|a_{j}|),\ \ \ \ \text{if}\ j\in S,\\ \text{0},\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{otherwise},\end{array}\right. (6.3)

where ξj\xi_{j} is randomly chosen from the set {+1,1}\{+1,-1\} and aja_{j} follows a N(0,1)N(0,1) distribution. The schematic diagram of a random true coefficient is as follows

Refer to caption
Figure 8: The true coefficient for sparse group lasso regressions.
Refer to caption
Figure 9: The estimation coefficients by CPADMM for the four types of sparse group lasso regression.

We initially employed our CPADMM algorithm to compute four sparse group lasso regression models. These models are LS-Sgla (least squares loss plus sparse group lasso), Q-Sgla (quantile loss plus sparse group lasso), SR-Sgla (square root loss plus sparse group lasso), and H-Sgla (huber loss plus sparse group lasso). The estimated coefficients for these models are displayed in Figure 9. The results in Figure 9 show that the four sparse group lasso regression models all perform well under normal error, and CPADMM is an effective and accurate algorithm for solving them.

Next, we use four types of sparse group lasso regression to fit these data. Li et al. (2014) and Wang et al. (2019) respectively proposed linearized ADMM algorithms to solve LS-Sgla and Q-Sgla. Currently, there is no effective algorithm available to solve SR-Sgla (λ10\lambda_{1}\neq 0 and λ20\lambda_{2}\neq 0) and H-Sgla (λ10\lambda_{1}\neq 0 and λ20\lambda_{2}\neq 0). However, the ADMM algorithm in Boyd et al. (2010) and a scaled thresholding-based iterative selection procedure (S-TISP) in Bunea et al. (2013) have been proposed to solve SR-Sgla and H-Sgla respectively in the special case when λ1=0\lambda_{1}=0. We compare our unified ADMM algorithm with the above four algorithms for solving sparse group lasso or group lasso regression models. Each simulation is repeated 100 times, and the average results for nonparallel computations (single machine) are presented in Table 5. The simulation results demonstrate that our CPADMM algorithm presents significant advantages compared to existing algorithms in terms of computation time, estimation accuracy, and prediction performance.

Table 5: Comparison of CPADMM and four algorithms for calculating four sparse group lasso regressions.
  LS-Sgla Q-Sgla SR-Sgla H-Sgla
  Algorithm LADMM CPADMM LADMM CPADMM ADMM CPADMM S-TISP CPADMM
AAE(10210^{-2}) 0.023(0.003) 0.011(0.001) 0.027(0.004) 0.012(0.001) 0.029(0.004) 0.012(0.001) 0.036(0.005) 0.011(0.001)
ASE(10210^{-2}) 0.035(0.004) 0.027(0.002) 0.042(0.006) 0.026(0.002) 0.039(0.005) 0.025(0.002) 0.044(0.006) 0.027(0.002)
MADE 0.323(0.013) 0.219(0.009) 0.121(0.006) 0.093(0.003) 0.088(0.003) 0.069(0.002) 0.337(0.015) 0.219(0.012)
MSE 0.381(0.021) 0.268(0.019) 0.156(0.013) 0.129(0.008) 0.091(0.002) 0.087(0.001) 0.358(0.029) 0.268(0.018)
Time 7.21(0.351) 5.38(0.227) 7.85(0.412) 5.56(0.239) 9.26(0.482) 5.41(0.253) 7.77(0.382) 5.47(0.246)
 
  • *

    The meanings of the notations used in this table are as follows: AAE: average absolute estimation error; ASE: average square estimation error; AAP: average absolute prediction error; ASP: average square prediction error. Numbers in the parentheses represent the corresponding standard deviations. The optimal solution is represented in bold.

In addition, we conclude this subsection by performing parallel simulations on sparse group lasso regressions through our CPADMM algorithm. The computation time and coefficient estimation accuracy are depicted in Figure 10 for M=1,10M=1,10, and 100100. The results suggest that parallel algorithms can significantly reduce computation time for large-scale sparse group lasso regression data. However, it is worth noting that having too many local machines may actually slow down the convergence speed of the algorithm.

Refer to caption
Figure 10: Comparisons of the four types of sparse group lasso regressions for different MM value.

6.2 Real datasets

In this subsection, we focus on index tracking using combined regularization regression, with the target indices being two highly representative stock indices in China, the SSE 50 (Shanghai Stock Exchange 50 stock index) and CSI 300 (China Securities Index 300). A stock index is a numerical value calculated by weighting the price or market value of its constituent stocks, which is used to reflect the ups and downs of the stock market and the overall market trend. SSE 50 is an index of 50 major stocks on the Shanghai Stock Exchange, representing China’s A-share market performance and used as a benchmark for investors. CSI 300 tracks the performance of 300 A-share stocks across different sectors on both Shanghai and Shenzhen Stock Exchanges, providing broad coverage of the Chinese stock market.

For further data analysis, we have tabulated the distribution of constituent stocks in various industry sectors for two stock indices in Table 6. We classify the constituent stocks of the stock index into several sectors, such as financials (banking, insurance, and financial services); real estate (property development, real estate investment trusts, and property management); consumer discretionary (retail, automotive, and leisure and entertainment); industrials (manufacturing, construction and engineering); utilities (electric utilities and gas utilities); health care (pharmaceuticals and biotechnology); information technology (software development and telecommunications); and others (advertising, agriculture and related fields). The grouping of these sectors can serve as a prior grouping for sparse group lasso regression.

Table 6: Distribution of constituent stocks in various sectors of stock indices.
Sector Number of Stocks (SSE 50) Number of Stocks (CSI 300)
Financials 9 47
Real Estate 4 20
Consumer Discretionary 10 12
Industrials 17 110
Utilities 3 39
Health care 4 25
Information Technology 3 26
Others 0 21

Index tracking refers to investors using financial products such as index funds and exchange-traded funds (ETFs) to replicate or track the performance of specific stock indices. Through index tracking, investors can participate in the ups and downs of the index without directly purchasing individual stocks by holding financial products that track specific indices. These financial products are typically invested in accordance with the weightings of the constituent stocks in the index to ensure their performance closely aligns with the upward or downward trends of the index. The advantage of index tracking is that investors can achieve diversified risk by investing in multiple stocks in the index through a single investment, while also benefiting from the overall market performance. However, selecting which constituent stocks to include in index tracking has become a hot topic in recent years due to the increasing transaction costs associated with holding a larger number of stocks in the financial market. Since the weights of the constituent stocks in an index are non-negative, regularized regression with non-negativity constraints is a widely used approach, see Wu et al. (2014), Yang & Wu (2016) and Wu et al. (2022).

According to their research, we can define xt,jx_{t,j} and yty_{t} to represent the returns of the jj-th constituent stock and the index, respectively, where j=1,2,,50(300)j=1,2,\dots,50\ (300). We can then use a linear regression model to describe the relationship between xt,jx_{t,j} and yty_{t} as follows:

yt=tβjxt,j+εt,t=1,2,T,{y_{t}}=\sum\limits_{t}{{\beta_{j}}}{x_{t,j}}+{\varepsilon_{t}},\qquad t=1,2,\cdots T, (6.4)

where βj0\beta_{j}\geq 0 represents the weight of the jj-th chosen stock and εt\varepsilon_{t} is the error term. In practical applications, the optimal estimate of βj\beta_{j} represents the proportion of each stock. For example, if β^1=1,β^2=2\hat{\beta}_{1}=1,\hat{\beta}_{2}=2, then when tracking the stock index, for every unit of the 1st labeled stock held, it would be necessary to hold 2 units of the 2nd labeled stock. To measure the bias for tracking, the Annual Tracking Error (ATE) is used and defined by

TrackingErrorYear=252×(errtmean(errt)2T1,TrackingErro{r_{Year}}=\sqrt{252}\times\sqrt{\frac{{{{\sum{(er{r_{t}}-mean(er{r_{t}})}}^{2}}}}{{T-1}}}, (6.5)

where errt=y^tyterr_{t}=\hat{y}_{t}-y_{t} and y^t\hat{y}_{t} is the fitted or predicted value of yty_{t} ,for t=1,2,,Tt=1,2,\dots,T. Following Wu et al. (2014), we can choose some penalty parameters from 0 to a sufficiently large positive number which shrinks all coefficients to 0, and the interval between the two parameters is equal. For the given dataset, we can fit the model using λ2=logp/n\lambda_{2}=\sqrt{\log p/n}, and λ1=100logp/n\lambda_{1}=100\sqrt{\log p/n}, which is large enough to ensure all coefficients are shrunk to 0. In this case, we choose 1000 penalty parameters ranging from 0 to 100logp/n100\sqrt{\log p/n}. The interval between consecutive parameters is set as 0.1logp/n0.1\sqrt{\log p/n}.

Next, we will compare non-negative lasso regression (nlasso) in Wu et al. (2014), non-negative adaptive lasso regression (nalasso) in Yang & Wu (2016), non-negative ladlasso regression (nladlasso) in Wu et al. (2022), and non-negative sparse group lasso regression (nsglasso, G=8G=8) considered in this paper for tracking the minute-by-minute stock indexes of the SSE 50 and the CSI 300 between August 7, 2023 and November 9, 2023. Note that all non negative methods mentioned above can be uniformly solved using our proposed CPADMM algorithm. Each stock index has a total of 15000 data, which means n=15000n=15000. Please note that the last update of the constituent stocks for SSE 50 and CSI 300 was on June 9, 2023. During the data collection period, there were no trading suspensions for any of the constituent stocks of these two indices. Therefore, no data manipulation is required. In addition, we selected the first 12000 samples as the training set and the last 3000 samples as the training set. We record the ATE (×104\times 10^{4}) comparison results of various non-negative regression methods in Table 7. In the tracking of the SSE 50 Index, we selected 5, 10, and 20 constituent stocks, while in the CSI 300 Index, we chose 10, 20, and 40 constituent stocks. The results from Table 7 indicate that the sparse group lasso, when utilizing grouping information, performs significantly better in stock index tracking compared to a single regularizer. Moreover, we have placed the schematic diagrams of the 5 constituent stocks selected by the nsglasso method for tracking SSE 50 and the 10 constituent stocks selected for tracking CSI 300 separately in Figure 11 and Figure 12.

Table 7: SSE 50 and CSI 300 index tracking data.
SSE 50 CSI 300
Number Method ATEtrainATE_{train} ATEtestATE_{test} Number Method ATEtrainATE_{train} ATEtestATE_{test}
5 nladlasso 5.411 2.067 10 nladlasso 6.534 3.258
nalasso 5.893 2.191 nalasso 6.436 3.658
nlasso 6.093 2.103 nlasso 6.852 3.803
nsglasso 5.343 2.035 nsglasso 6.298 3.110
10 nladlasso 5.359 1.987 20 nladlasso 6.489 3.165
nalasso 5.762 2.158 nalasso 6.412 3.558
nlasso 5.921 2.058 nlasso 6.782 3.724
nsglasso 5.226 1.958 nsglasso 6.111 3.087
20 nladlasso 5.309 1.943 40 nladlasso 6.359 3.156
nalasso 5.708 1.996 nalasso 6.387 3.470
nlasso 5.856 2.007 nlasso 6.626 3.703
nsglasso 5.197 1.923 nsglasso 6.091 2.990
Refer to caption
Figure 11: SSE 50 tracking diagram, the top one shows the actual stock index and the bottom one shows the predicted value by nsglasso.
Refer to caption
Figure 12: CSI 300 tracking diagram, the top one shows the actual stock index and the bottom one shows the predicted value by nsglasso.

7 Conclusion and further research

In this paper, we consider an efficient parallel ADMM algorithm for uniformly solving sparse combined regularized regression models. This work fills the gap in parallel algorithms for combined regularization regression. The primary challenges in designing this unified ADMM algorithm lie in dealing with some combined regularizations that lack closed-form solutions for the proximal operator and handling distributed storage of data. To solve these two difficulties, we introduce some auxiliary variables that not only enable combined regularization to handle the subproblems of each iteration separately and have closed-form solutions, but also construct consensus structures to adapt to distributed stored data. Fortunately, despite the introduction of additional auxiliary variables, we can prove that these constrained optimization problems can be viewed as optimization problems with two primal variables subject to linear constraints. Since our linearization of the first variable is not complete, but only partial, existing research on the two-block ADMM algorithms cannot directly ensure the convergence of our parallel algorithm. Nevertheless, we can still use the framework used in He and Yuan (2012, 2015) to prove the convergence of our parallel ADMM algorithm and provide its linear convergence rate. The algorithm complexity is comparable to existing ADMM algorithms for both single machine and multi-machine scenarios.

Our algorithm has a relatively fast computation speed due to the closed-form solutions for each subproblem at every iteration. However, we must also acknowledge that our algorithm may be slow when there are too many local machines, particularly when solving for large values of {nm}m=1M\{n_{m}\}_{m=1}^{M} and pp. This is because our algorithm requires the mm-th local machine to calculate the inverse of a matrix with dimension max(ni,p)\max(n_{i},p). In addition, as the number of local machines MM increases, our algorithm tends to slow down its convergence speed in simulation data. This is also actually reflected in Theorem 1 of this paper, where the convergence rate of the algorithm is inversely proportional to the dimension of the iteration sequence. Developing a more applicable parallel algorithm to address these issues is one of our future research goals.

Moreover, our algorithm is flexible in terms of the loss function. During the specific implementation of the algorithm, it only requires modifying the proximal operator in 𝒓\bm{r}-subproblem to adapt to different losses. This provides the possibility for further extension of our algorithm. For example, similar to Li et al. (2015), we can replace the loss function with the Dantzig selector and q\ell_{q} (q>1q>1) to tackle more regression tasks. Additionally, we can utilize certain classification losses mentioned in Liang et al. (2023) to handle distributed storage data classification tasks. The convergence of these extended parallel algorithm may be proven by the same method in this paper.

Acknowledgements

We express our sincere gratitude to Professor Bingsheng He for engaging in invaluable discussions with us. His insights and expertise have greatly assisted us in effectively utilizing parallel ADMM algorithms to solve regression problems. The research of Zhimin Zhang was supported by the National Natural Science Foundation of China [Grant Numbers 12271066, 12171405, 11871121], and the research of Xiaofei Wu was supported by the Scientific and Technological Research Program of Chongqing Municipal Education Commission [Grant Numbers KJQN202302003].

References

  • Ahmed and Xing (2009) Ahmed, A. & Xing, E. P. (2009). Recovering Time-varying Networks of Dependencies in Social and Biological Studies[J]. Proceedings of the National Academy of Sciences, 106, 11878-11883. https://doi.org/10.1073/pnas.0901910106.
  • Belloni et al. (2011) Belloni, A., Chernozhukov, V. & Wang, L. (2011). Square-root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming. Biometrika, 98, 791–806. https://doi.org/10.1093/biomet/asr043.
  • Boyd et al. (2010) Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2010). Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers[J]. Foundations and Trends in Machine Learning, 3(1), 1-122. https://doi.org/10.1561/2200000016.
  • Bunea et al. (2013) Bunea, F. , Lederer, J. & She, Y. Y. (2013). The Group Square-root Lasso: Theoretical Properties and Fast Algorithms[J]. IEEE Transactions on Information Theory, 60(2), 1313-1325.
  • Chartrand (2012) Chartrand, R. (2012). Nonconvex splitting for regularized low-rank + sparse decomposition[J]. IEEE Transactions Signal Process, 60, 5810–5819.
  • Chen et al. (2016) Chen, C. H., He, B. S., Ye, Y. Y. & Yuan, X. M. (2016). The Direct Extension of ADMM for Multi-Block Convex Minimization Problems is not Necessary Convergent[J]. Mathematical Programming, 155, 57-79. https://doi.org/10.1007/s10107-014-0826-5.
  • Chen et al. (2020) Chen, H., He, Y., Ji, J. D. & Shi, Y. F. (2020). The Sparse Group Lasso for High-dimensional Integrative Linear Discriminant Analysis with Application to Alzheimer’s Disease Prediction[J]. Journal of Statistical Computation and Simulation, 90(17), 3218-3231.
  • Chinot et al. (2020) Chinot, G. , Lecué, Guillaume & Lerasle, M. (2020). Robust High Dimensional Learning for Lipschitz and Convex Losses[J]. Journal of Machine Learning Research, 21, 1-47.
  • Corsaro et al. (2019) Corsaro, S., Simone, V. D. & Marino, Z. (2019). Fused Lasso Approach in Portfolio Selection[J]. Annals of Operations Research, 1, 1-13. https://doi.org/10.1007/s10479-019-03289-w.
  • Daniel et al. (2017) Daniel, V., Samarov, David, Allen & Jeeseong, et al. (2017). A Coordinate-descent-based Approach to Solving the Sparse Group Elastic Net[J]. Technometrics, 59(4), 437-445. https://doi.org/10.1080/00401706.2016.1273138.
  • Donoho (1995) Donoho, D L. De-noising by Soft-thresholding[J]. IEEE Transactions on Information Theory, 41, 613-627. https://doi.org/10.1109/18.382009.
  • Fan et al. (2020) Fan, Y., Lin, N. & Yin, X. J. (2020). Penalized Quantile Regression for Distributed Big Data Using the Slack Variable Representation[J]. Journal of Computational and Graphical Statistics, 30(3), 557-565. https://doi.org/10.1080/10618600.2020.1840996.
  • Fan and Li (2001) Fan, J. Q., and Li, R. Z. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties[J]. Journal of the American Statistical Association, 96, 1348–1360.
  • Fan et al. (2014) Fan, J. Q., Xue, L. G. and Zou, H. (2014). Strong Oracle Optimality of Folded Concave Penalized Estimation[J]. Annals of Statistics, 819–849.
  • Fan et al. (2020) Fan, J. Q., Li, R. Z., Zhang, C. H. & Zou, H. (2020). Statistical Foundations of Data Science (1st ed.). Chapman and Hall/CRC, New York[M]. https://doi.org/10.1201/9780429096280
  • Friedman et al. (2007) Friedman, J., Hastie, T., Hofling, H. & Tibshirani, R. (2007). Pathwise Coordinate Optimization[J]. Annals of Applied Statistics, 1, 302-332. https://doi.org/10.1214/07-AOAS131.
  • Friedman et al. (2010) Friedman, J., Hastie, T. & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent[J]. Journal of Statistical Software, 33(1), 1-22.
  • Friedman et al. (2010) Friedman, J., Hastie, T. & Tibshirani, R. (2010). A Note on the Group Lasso and a Sparse Group Lasso[J]. arXiv preprint arXiv:1001.0736.
  • Gu et al. (2017) Gu, Y. W., Fan, J., Kong, L. C., Ma, S. Q. & Zou, H. (2017). ADMM for High-Dimensional Sparse Penalized Quantile Regression[J]. Technometrics, 60(3), 319-333. https://doi.org/10.1080/00401706.2017.1345703.
  • He and Yuan (2012) He, B. S. & Yuan, X. M. (2012). On the O(1/n) Convergence Rate of the Douglas–Rachford Alternating Direction Method[J]. SIAM Journal on Numerical Analysis, 50, 700–709. https://doi.org/10.1137/110836936.
  • He and Yuan (2015) He, B. S. & Yuan, X. M. (2015). On Non-Ergodic Convergence Rate of Douglas–Rachford Alternating Direction Method of Multipliers[J]. Numerische Mathematik, 130(3), 567-577. https://doi.org/10.1007/s00211-014-0673-6.
  • He et al. (2022) He, H., Guo, X. Y., Yu, J. L., Ai, C. & Shi, S. P. (2022). Retracted: Overcoming the Inadaptability of Sparse Group Lasso for Data with Various Group Structures by Stacking[J]. Bioinformatics, 38(6), 1542–1549.
  • Hoerl and Kennard (1970) Hoerl, A. E. & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems[J]. Technometrics, 12, 55-67.
  • Huang et al. (2012) Huang, J. , Breheny, P. & Ma, S. (2012). A Selective Review of Group Selection in High-dimensional Models[J]. Statistical Science, 27(4), 481-499.
  • Huang et al. (2015) Huang, J., Breheny, P., Lee, S., Ma, S. and Zhang, C. H. (2016). The Mnet Method for Variable Selection[J]. Statistica Sinica, (3), 903-923.
  • Huang et al. (2009) Huang, J. Z. , Zhang, T. & Metaxas, D. (2009). Learning with Structured Sparsity[J]. Journal of Machine Learning Research, 12.
  • Huang and Zhang (2010) Huang, J. Z. & Zhang, T. (2010). The Benefit of Group Sparsity[J]. The Annals of Statistics, 38(4), 1978-2004. https://doi.org/10.1214/09-AOS778.
  • Huber (1964) Huber, P. J. (1964). Robust Estimation of a Location Parameter[J]. Annals of Mathematical Statistics, 35(1), 73-101.
  • Jiang et al. (2020) Jiang, H., Luo, S. H. & Dong, Y. (2020). Simultaneous Feature Selection and Clustering Based on Square Root Optimization[J]. European Journal of Operational Research, 289(1), 214-231. https://doi.org/10.1016/j.ejor.2020.06.045.
  • Juan et al. (2019) Juan C. Laria, M. Carmen Aguilera-Morillo & Rosa E. Lillo (2019). An Iterative Sparse-Group Lasso[J]. Journal of Computational and Graphical Statistics, 28(3), 722-731.
  • Kato (2011) Kato K (2011). Group Lasso for High Dimensional Sparse Quantile Regression Models[J]. https://doi.org/10.48550/arXiv.1103.1458.
  • Klosa et al. (2020) Klosa, J., Simon, N., Westermark, P. O., Liebscher, V. & Wittenburg, D. (2020). Seagull: Lasso, Group lasso and Sparse-group Lasso Regularization for Linear Regression Models via Proximal Gradient Descent[J]. BMC Bioinformatics, 21(1), 407.
  • Koenker and Basset (1970) Koenker, R.and Basset, G. (1970) Regression Quantiles[J]. Econometrica, 46, 33–50
  • Lange (2013) Lange, K. (2013). Optimization, 2nd edn[M]. Springer, New York.
  • Li et al. (2014) Li, X. X., Mo, L. L., Yuan, X. M. & Zhang, J. Z. (2014). Linearized Alternating Direction Method of Multipliers for Sparse Group and Fused LASSO Models[J]. Computational Statistics & Data Analysis, 79, 203-221. https://doi.org/10.1016/j.csda.2014.05.017.
  • Li et al. (2015) Li, X. G., Zhao, T. , Yuan, X. M. & Liu, H. (2015) The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R[J]. Journal of Machine Learning Research, 553-557.
  • Liang et al. (2023) Liang, R. M., Wu, X. F. & Zhang, Z. Z. (2023) Linearized Alternating Direction Method of Multipliers for Elastic-net Support Vector Machines[J]. Pattern Recognition, in press.
  • Lin et al. (2022) Lin, Z. C., Fang, C. and Li, H. (2022). Alternating Direction Method of Multipliers for Machine Learning [M]. Springer Singapore.
  • Liu et al. (2020) Liu, Y. X., Zeng, P. & Lin, L. (2020). Generalized L1-penalized Quantile Regression with Linear Constraints[J]. Computational Statistics & Data Analysis, 142. https://doi.org/10.1016/j.csda.2019.106819.
  • Liu et al. (2021) Liu, Y. X., Zeng, P. & Lin, L. (2021). Degrees of Freedom for Regularized Regression with Huber loss and Linear Constraints. Statistical Papers, 62: 2383-2405.
  • Li and Zhu (2007) Li, Y. & Zhu, J. (2007). Analysis of Array CGH Data for Cancer Studies Using Fused Quantile Regression[J]. Bioinformatics, 23, 2470-2476. https://doi.org/10.1093/bioinformatics/btm364.
  • Li et al. (2020) Li, Y. , Sun, C. , Li, P. , Zhao, Y. , & Chen, J. . (2020). Corrigendum: Hypernetwork Construction and Feature Fusion Analysis Based on Sparse Group Lasso method on FMRI Dataset[J]. Frontiers in Neuroscience, 14.
  • Mota et al. (2013) Mota, J. F. C., Xavier, J. M. F., Aguiar, P. M. Q. & Püschel, Markus. (2013). D-ADMM: A Communication-Efficient Distributed Algorithm For Separable Optimization[J]. IEEE Transactions on Signal Processing, 61(10), 2718–2723. https://doi.org/10.1109/TSP.2013.2254478.
  • Mendez et al. (2021) Mendez-Civieta, A., Aguilera-Morillo, M.C. & Lillo, R.E. (2021). Adaptive Sparse Group LASSO in Quantile Regression. Advances in Data Analysis and Classification, 15, 547–573 (2021). https://doi.org/10.1007/s11634-020-00413-8
  • Ogutu et al. (2012) Ogutu, J. O. , Schulz-Streeck, T. & Piepho, H. P. (2012). Genomic Selection Using Regularized Linear Regression Models: Ridge Regression, Lasso, Elastic-net and Their Extensions. BMC Proceedings, 6,2-6.
  • Parikh and Boyd (2014) Parikh, N. & Boyd, S. (2014). Proximal Algorithms[M]. Now Foundations and Trends. https://doi.org/10.1561/2400000003.
  • Parekh and Selesnick (2016) Parekh, A. & Selesnick, I. W. (2016). Convex Fused Lasso Denoising with Non-Convex Regularization and Its Use for Pulse Detection[C]. Signal Processing in Medicine & Biology Symposium, 1-6. https://doi.org/10.1109/SPMB.2015.7405474.
  • Piepho and Ogutu (2014) Piepho, H. P. & Ogutu, J. O. (2014). Regularized Group Regression Methods for Genomic Prediction: Bridge, Mcp, Scad, Group Bridge, Group Lasso, Sparse Group Lasso, Group Mcp and Group Scad. BMC Proceedings, 8(5), 1-9.
  • Qin and Goldfarb (2011) Qin, Z. & Goldfarb, D. (2012). Structured Sparsity via Alternating Direction Methods. Journal of Machine Learning Research, 13(1), 1435-1468.
  • Raninen and Ollila (2017) Raninen, E. & Ollila, E. (2017). Scaled and Square-root Elastic net[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, 4336-4340. https://doi.org/10.1109/ICASSP.2017.7952975.
  • Rosset and Zhu (2007) Rosset, S. & Zhu, J. (2007). Piecewise Linear Regularized Solution Paths[J]. The Annals of Statistics, 35, 1012–1030.
  • Rudin et al. (1992) Rudin, L. I., Osher, S. & Fatemi, E. (1992). Nonlinear Total Variation based Noise Removal Algorithms[J]. Physica D Nonlinear Phenomena, 60(1-4), 259-268. https://doi.org/10.1016/0167-2789(92)90242-F.
  • Shen et al. (2016) Shen, X., Chen, L., Gu, Y. & So, H. C. (2016). Square-root Lasso with Nonconvex Regularization: an ADMM Approach[J]. IEEE Signal Processing Letters, 934-938.
  • Sun et al. (2016) Sun, Y., Wang, H. J. & Fuentes, M. (2016). Fused Adaptive Lasso for Spatial and Temporal Quantile Function Estimation[J]. Technometrics, 58(1), 127–137. https://doi.org/10.1080/00401706.2015.1017115.
  • Simon et al. (2013) Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. (2013). A Sparse Group Lasso[J]. Journal of Computational and Graphical Statistics, 22, 231–245.
  • Slawski (2012) Slawski, M. (2012). The Structured Elastic-net for Quantile Regression and Support Vector Classification[J]. Statistics and Computing, 22, 153–168. https://doi.org/10.1007/s11222-010-9214-z
  • Tibshirani and Wang (2008) Tibshirani, R. & Wang, P. (2008). Spatial Smoothing and Hot Spot Detection for CGH Data Using the Fused Lasso[J]. Biostatistics, 9, 18-29. https://doi.org/10.1093/biostatistics/kxm013.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso[J]. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 58, 267-288. https://doi.org/ 10.1111/j.1467-9868.2005.00490.x.
  • Tibshirani et al. (2005) Tibshirani, R., Saunders, M., Rosset, S. & Zhu, J. (2005). Sparsity and Smoothness via the Fused Lasso[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91-108. https://doi.org/10.1111/j.1467-9868.2005.00490.x.
  • Tugnait (2021a) Tugnait, J. K. (2021a). Sparse-group Lasso for Graph Learning from Multi-attribute Data[J]. IEEE Transactions on Signal Processing, 69, 1771-1786. https://doi.org/10.1109/TSP.2021.3057699.
  • Tugnait (2021b) Tugnait, J. K. (2021b). Sparse-Group Non-convex Penalized Multi-Attribute Graphical Model Selection[C]. European Signal Processing Conference, 1850-1854. https://doi.org/10.23919/EUSIPCO54536.2021.9616255.
  • Wang et al. (2019) Wang, H. F., Kong, L. C. & Tao, J. Y. (2019) The Linearized Alternating Direction Method of Multipliers for Sparse Group LAD Model[J]. Optimization Letters, 13, 505–525. https://doi.org/10.1007/s11590-017-1180-3
  • Wang et al. (2010) Wang, X. M., Park, T. & Carriere, K. C. (2010). Variable Selection via Combined Penalization for High-dimensional Data Analysis[J]. Computational Statistics & Data Analysis, 54(10), 2230-2243. https://doi.org/10.1016/j.csda.2010.03.026.
  • Wu et al. (2014) Wu, L., Yang, Y. H., & Liu, H. Z. (2014). Nonnegative-lasso and Application in Index Tracking. Computational Statistics & Data Analysis, 70(2), 116-126. https://doi.org/10.1016/j.csda.2013.08.012.
  • Wu and Lange (2008) Wu, T. T. & Lange, K. (2008). Coordinate Descent Algorithms for Lasso Penalized Regression[J]. Annals of Applied Statistics, 2(1), 24-244.
  • Wu et al. (2022) Wu, X. F., Liang, R. M. & Yang, H. (2022). Penalized and Constrained LAD Estimation in Fixed and High Dimension[J]. Statistical Papers, 63(1), 53-95. https://doi.org/10.1007/s00362-021-01229-0.
  • Wu et al. (2023) Wu, X. F., Ming, H., Cui, Z. Y. & Zhang, Z. M. (2023). Multi-block Alternating Direction Method of Multipliers for Ultrahigh Dimensional Quantile Fused Regression[J]. Computational Statistics & Data Analysis, in press.
  • Wu and Liu (2009) Wu, Y. C. & Liu, Y. F. (2009). Variable Selection in Quantile Regression[J]. Statistica Sinica, 19(2), 801-817. https://doi.org/10.1088/0953-8984/8/36/001.
  • Xin et al. (2015) Xin, B., Tian, Y., Wang, Y. Z. & Gao, W. (2015). Background Subtraction via Generalized Fused Lasso Foreground Modeling[C]. Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 4676-4684. https://doi.org/10.1109/CVPR.2015.7299099.
  • Xiu et al. (2019) Xiu, X. C., Liu, W. Q., Li, L. & Kong, L. C. (2019). Alternating Direction Method of Multipliers for Nonconvex Fused Regression Problems[J]. Computational Statistics & Data Analysis. https://doi.org/10.1016/j.csda.2019.01.002.
  • Yang & Zou (2013) Yang, Y. & Zou, H. (2013). An Efficient Algorithm for Computing the HHSVM and Its Generalizations[J]. Journal of Computational and Graphical Statistics, 22(2), 396-415.
  • Yang & Zou (2015) Yang, Y. & Zou, H. (2015) A Fast Unified Algorithm for Solving Group-lasso Penalize Learning Problems[J]. Statistics and Computing, 25, 1129–1141. https://doi.org/10.1007/s11222-014-9498-5.
  • Yang & Wu (2016) Yang, Y. H. & Wu, L. (2016). Nonnegative Adaptive Lasso for Ultra-high Dimensional Regression Models and a Two-stage Method Applied in Financial Modeling[J]. Journal of Statistical Planning & Inference,174, 52-67. https://doi.org/10.1016/j.jspi.2016.01.011.
  • Yi and Huang (2017) Yi, C. R. & Huang, J. (2017). Semismooth Newton Coordinate Descent Algorithm for Elastic-Net Penalized Huber Loss Regression and Quantile Regression[J]. Journal of Computational and Graphical Statistics, 26(3), 547-557. https://doi.org/10.1080/10618600.2016.1256816.
  • Ye and Xie (2011) Ye, G. B. & Xie, X. H. (2011). Split Bregman Method for Large Scale Fused Lasso[J]. Computational Statistics & Data Analysis, 55(4), 1552-1569. https://doi.org/10.1016/j.csda.2010.10.021.
  • Yuan and Lin (2008) Yuan, M. & Lin, Y. (2006). Model Selection and Estimation in Regression with Grouped Variables[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
  • Yuan et al. (2020) Yuan, X. M., Zeng, S. & Zhang, J. (2020). Discerning the Linear Convergence of ADMM for Structured Convex Optimization through the Lens of Variational Analysis[J]. Journal of Machine Learning Research, 21, 1-74.
  • Yu et al. (2013) Yu, D., Won, J. H., Lee, T., Lim, J. & Yoon, S. (2013). High-Dimensional Fused Lasso Regression Using Majorization–Minimization and Parallel Processing[J]. Journal of Computational and Graphical Statistics, 24(1), 121-153. https://doi.org/10.1080/10618600.2013.878662.
  • Yu et al. (2017) Yu, L., Lin, N. & Wang, L. (2017). A Parallel Algorithm for Large-Scale Nonconvex Penalized Quantile Regression[J]. Journal of Computational and Graphical Statistics, 26, 935–939.
  • Zhang (2010) Zhang, C. H. (2010). Nearly Unbiased Variable Selection Under Minimax Concave Penalty[J]. Annals of Statistics, 38, 894–942.
  • Zhang and Hong (2017) Zhang, F. Z. & Hong, D. (2015). Elastic net-based Framework for Imaging Mass Spectrometry Data Biomarker Selection and Classification. Statistics in Medicine, 30(7), 753-768.
  • Zhang et al. (2020) Zhang, Y. J., Zhang, N., Sun, D. F., & Toh, K. C. (2020). An Efficient Hessian Based Algorithm for Solving Large-scale Sparse Group Lasso Problems. Mathematical Programming, 179(1-2), 1-41.
  • Zhao et al, (2015) Zhao, L., Hu, Q. H. & Wang, W. W. (2015). Heterogeneous Feature Selection with Multi-Modal Deep Neural Networks and Sparse Group Lasso[J]. IEEE Transactions on Multimedia, 17(11),1936-1948.
  • Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net[J]. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 67(2), 301-320.
  • Zou (2006) Zou, H. (2006). The Adaptive LASSO and Its Oracle Properties[J]. Journal of the American Statistical Association, 101(476), 1418-1429. https://doi.org/10.1198/016214506000000735.
  • Zou and Tibshirani (2006) Zou, H. & Tibshirani, R. (2006). Sparse Principal Component Analysis[J]. Journal of Computational and Graphical Statistics, 15(2), 1-30. https://doi.org/10.1198/106186006X113430.
  • Zou and Li (2008) Zou, H. & Li, R. Z. (2008). One-step Sparse Estimates in Nonconcave Penalized Likelihood Models[J]. Annals of Statistics, 36(4), 1509-1533.
  • Zou and Zhang (2009) Zou, H. & Zhang, H. H. (2009). On the Adaptive Elastic-net with a Diverging Number of Parameters[J]. Annals of Statistics 37(4), 1733–1751. https://doi.org/10.1214/08-AOS625.
  • Zou and Xue (2018) Zou, H. & Xue, L. Z. (2018). A Selective Overview of Sparse Principal Component Analysis. In Proceedings of the IEEE, 106(8), 1311-1320. https://doi.org/10.1109/JPROC.2018.2846588.
  • Zou (2019) Zou, H. (2019). Classification with High Dimensional Features[J]. Wiley Interdisciplinary Reviews: Computational Statistics, 11. https://doi.org/10.1002/wics.1453.
  • Zou (2020) Zou, H. (2020) Comment: Ridge Regression—Still Inspiring After 50 Years, Technometrics, 62(4), 456-458, https://doi.org/ 10.1080/00401706.2020.1801257

Appendix

Appendix A Proofs of Convergence Theorems

In Section 4.1, we have proven that solving the regression models with combined regularization in parallel using the ADMM algorithm can be seen as a traditional two-block ADMM algorithm (𝑮=𝑰p\bm{G}=\bm{I}_{p}) or a partially linearized two-block ADMM algorithm (𝑮=𝑭\bm{G}=\bm{F}). Moreover, the optimization objective function of this paper can be written as a two-block separable function with the linear constraint,

argmin𝒗1,𝒗2{𝜽1(𝒗1)+𝜽2(𝒗2)},\displaystyle\mathop{\arg\min}\limits_{\bm{v}_{1},\bm{v}_{2}}\big{\{}{\bm{\theta}}_{1}(\bm{v}_{1})+{\bm{\theta}}_{2}(\bm{v}_{2})\big{\}},
s.t.𝑨𝒗1+𝑩𝒗2=𝒄,\displaystyle\textbf{s.t.}\ \bm{A}\bm{v}_{1}+\bm{B}\bm{v}_{2}=\bm{c}, (A.1)

where 𝒗1=(𝜷,𝒓1,𝒓2,,𝒓M)\bm{v}_{1}=(\bm{\beta}^{\top},\bm{r}_{1}^{\top},\bm{r}_{2}^{\top},\dots,\bm{r}_{M}^{\top})^{\top}, 𝒗2=(𝜷1,𝜷2,,𝜷M,𝒃)\bm{v}_{2}=(\bm{\beta}_{1}^{\top},\bm{\beta}_{2}^{\top},\dots,\bm{\beta}_{M}^{\top},\bm{b}^{\top})^{\top}, 𝜽1(𝒗1)=λ1𝜷1+I𝒞(𝜷)+m=1M(𝒓m){\bm{\theta}}_{1}(\bm{v}_{1})=\lambda_{1}\|\bm{\beta}\|_{1}+I_{\mathcal{C}}(\bm{\beta})+\sum_{m=1}^{M}\mathcal{L}(\bm{r}_{m}) and 𝜽2(𝒗2)=λ2𝒃2,1{\bm{\theta}}_{2}(\bm{v}_{2})=\lambda_{2}\|\bm{b}\|_{2,1} (since the functions corresponding to 𝜷m\bm{\beta}_{m} are all zero). Here, 𝑨=[𝑨1,𝑨2,,𝑨M+1]\bm{A}=[\bm{A}_{1},\bm{A}_{2},\dots,\bm{A}_{M+1}], where

𝑨1=[𝑮𝑰p𝑰p𝑰p𝟎𝟎𝟎],𝑨2=[𝟎𝟎𝟎𝟎𝑰n1𝟎𝟎],𝑨3=[𝟎𝟎𝟎𝟎𝟎𝑰n2𝟎],,𝑨M+1=[𝟎𝟎𝟎𝟎𝟎𝟎𝑰nM];{\bm{A}_{1}}=\begin{bmatrix}\bm{G}\\ -\bm{I}_{p}\\ -\bm{I}_{p}\\ \vdots\\ -\bm{I}_{p}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{0}\end{bmatrix},\quad{\bm{A}_{2}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{0}\\ \bm{I}_{n_{1}}\\ \bm{0}\\ \vdots\\ \bm{0}\end{bmatrix},\quad{\bm{A}_{3}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{0}\\ \bm{0}\\ \bm{I}_{n_{2}}\\ \vdots\\ \bm{0}\end{bmatrix},\quad\dots,\quad{\bm{A}_{M+1}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{0}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{I}_{n_{M}}\end{bmatrix};

and 𝑩=[𝑩1,𝑩2,,𝑩M+1]\bm{B}=[\bm{B}_{1},\bm{B}_{2},\dots,\bm{B}_{M+1}], where

𝑩1=[𝟎𝑰p𝟎𝟎𝑿1𝟎𝟎],𝑩2=[𝟎𝟎𝑰p𝟎𝟎𝑿2𝟎],,𝑩M=[𝟎𝟎𝟎𝑰p𝟎𝟎𝑿M],𝑩M+1=[𝑰G𝟎𝟎𝟎𝟎𝟎𝟎];{\bm{B}_{1}}=\begin{bmatrix}\bm{0}\\ \bm{I}_{p}\\ \bm{0}\\ \vdots\\ \bm{0}\\ \bm{X}_{1}\\ \bm{0}\\ \vdots\\ \bm{0}\end{bmatrix},\quad{\bm{B}_{2}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{I}_{p}\\ \vdots\\ \bm{0}\\ \bm{0}\\ \bm{X}_{2}\\ \vdots\\ \bm{0}\end{bmatrix},\quad\dots,\quad{\bm{B}_{M}}=\begin{bmatrix}\bm{0}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{I}_{p}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{X}_{M}\end{bmatrix},\quad{\bm{B}_{M+1}}=\begin{bmatrix}-\bm{I}_{G}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{0}\\ \bm{0}\\ \bm{0}\\ \vdots\\ \bm{0}\end{bmatrix};

and 𝒄=(𝟎,𝟎,𝟎,,𝟎,𝒚1,𝒚2,,𝒚M)\bm{c}=(\bm{0}^{\top},\bm{0}^{\top},\bm{0}^{\top},\dots,\bm{0}^{\top},\bm{y}_{1}^{\top},\bm{y}_{2}^{\top},\dots,\bm{y}_{M}^{\top})^{\top}. Note that all 𝑨m\bm{A}_{m} are mutually orthogonal, and 𝑩m\bm{B}_{m} are also mutually orthogonal. This is actually the reason for transforming Algorithm 1 (multi-block parallel algorithm) into a two-block ADMM algorithm.

The existing convergence results for the original two-block ADMM, as presented in He and Yuan (2012, 2015) and Yuan et al. (2020), also hold for Algorithm 1 when 𝑮=𝑰p\bm{G}=\bm{I}_{p}. However, when 𝑮=𝑭\bm{G}=\bm{F}, the existing convergence results for the linearized two-block ADMM in He and Yuan (2012, 2015) cannot be directly applied to our partially linearized version. Nonetheless, with slight modifications in their proof process, we can also obtain the following conclusions from their existing convergence analysis.

Recall that

𝑺𝑮={𝟎,if 𝑮=𝑰p,𝑺,if 𝑮=𝑭,\displaystyle\bm{S}_{\bm{G}}=\begin{cases}\bm{0},&\text{{if }}\bm{G}=\bm{I}_{p},\\ \bm{S},&\text{{if }}\bm{G}=\bm{F},\\ \end{cases} (A.2)

where 𝑺=η𝑰p(M𝑰p+𝑭𝑭)\bm{S}=\eta\bm{I}_{p}-(M\bm{I}_{p}+\bm{F}^{\top}\bm{F}).

Proposition 1

Let 𝐰~k=(𝛃k,{𝐫mk}m=1M,{𝛃mk}m=1M,𝐛k,{𝐝mk}m=1M,{𝐞mk}m=1M,𝐟k)\tilde{\bm{w}}^{k}=\left(\bm{\beta}^{k},\{\bm{r}_{m}^{k}\}_{m=1}^{M},\{\bm{\beta}_{m}^{k}\}_{m=1}^{M},\bm{b}^{k},\{\bm{d}_{m}^{k}\}_{m=1}^{M},\{\bm{e}_{m}^{k}\}_{m=1}^{M},\bm{f}^{k}\right) is generated by Algorithm 1 with an initial feasible solution 𝐰~0\tilde{\bm{w}}^{0}. The sequence 𝐯~k=(𝛃k,{𝛃mk}m=1M,𝐛k,{𝐝mk}m=1M,{𝐞mk}m=1M,𝐟k)\tilde{\bm{v}}^{k}=\left(\bm{\beta}^{k},\{\bm{\beta}_{m}^{k}\}_{m=1}^{M},\bm{b}^{k},\{\bm{d}_{m}^{k}\}_{m=1}^{M},\{\bm{e}_{m}^{k}\}_{m=1}^{M},\bm{f}^{k}\right) has the following contraction inequality

𝒗~k+1𝒗~𝑯2𝒗~k𝒗~𝑯2𝒗~k𝒗~k+1𝑯2,\|\tilde{\bm{v}}^{k+1}-\tilde{\bm{v}}^{*}\|_{\bm{H}}^{2}\leq\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{*}\|_{\bm{H}}^{2}-\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1}\|_{\bm{H}}^{2}, (A.3)

where 𝐯~\tilde{\bm{v}}^{*} =(𝛃,{𝛃}m=1M,𝐛,{𝐝m}m=1M,{𝐞m}m=1M,𝐟)=\left(\bm{\beta}^{*},\{\bm{\beta}^{*}\}_{m=1}^{M},\bm{b}^{*},\\ \{\bm{d}_{m}^{*}\}_{m=1}^{M},\{\bm{e}_{m}^{*}\}_{m=1}^{M},\bm{f}^{*}\right) and 𝐇=[𝐒𝐆𝟎𝟎𝟎μ𝐁𝐁𝟎𝟎𝟎1μ𝐈n𝐆1+(M+1)p]\bm{H}=\begin{bmatrix}\bm{S}_{\bm{G}}&\bm{0}&\bm{0}\\ \bm{0}&\mu\bm{B}^{\top}\bm{B}&\bm{0}\\ \bm{0}&\bm{0}&\frac{1}{\mu}\bm{I}_{n_{\bm{G}}-1+(M+1)p}\end{bmatrix} is a symmetric and positive definite matrix, and n𝐆=nn_{\bm{G}}=n if 𝐆=𝐈p\bm{G}=\bm{I}_{p}, and otherwise n𝐆=n1n_{\bm{G}}=n-1. .

Proof. For convenience, take 𝒓~M=(𝒓1,𝒓2,,𝒓M)\tilde{\bm{r}}_{M}=(\bm{r}_{1}^{\top},\bm{r}_{2}^{\top},\dots,\bm{r}_{M}^{\top})^{\top}, 𝜷~M=(𝜷1,𝜷2,,𝜷M)\tilde{\bm{\beta}}_{M}=(\bm{\beta}_{1}^{\top},\bm{\beta}_{2}^{\top},\dots,\bm{\beta}_{M}^{\top})^{\top}, 𝒗1,2=(𝒗1,𝒗2)\bm{v}_{1,2}=(\bm{v}_{1}^{\top},\bm{v}_{2}^{\top})^{\top}, 𝜽(𝒗1,2)=𝜽1(𝒗1)+𝜽2(𝒗2)\bm{\theta}(\bm{v}_{1,2})=\bm{\theta}_{1}(\bm{v}_{1})+\bm{\theta}_{2}(\bm{v}_{2}), and thus 𝒘=(𝒗1,2,𝒛)\bm{w}=(\bm{v}_{1,2}^{\top},\bm{z}^{\top})^{\top} . Because of the orthogonal relationship between the matrices, the subproblems of ADMM in (A) can be reformulated as

{𝜷k+1=argmin𝜷{λ𝜷1+I𝒞(𝜷)+μ2𝑨1𝜷+m=1M𝑩m𝜷mk+𝑩M+1𝒃k𝒄𝒛k/μ22+12𝜷𝜷k𝑺𝑮2},𝒓1k+1=argmin𝒓1{(𝒓1)+μ2𝑨2𝒓1+𝑩1𝜷1k𝒄𝒛k/μ22},𝒓2k+1=argmin𝒓2{(𝒓2)+μ2𝑨3𝒓2+𝑩2𝜷2k𝒄𝒛k/μ22},𝒓Mk+1=argmin𝒓M{(𝒓M)+μ2𝑨M+1𝒓M+𝑩M𝜷Mk𝒄𝒛k/μ22},𝒗2k+1=argmin𝒗2{𝜽2(ϑ)+μ2𝑨𝒗1k+1+𝑩𝒗2𝒄𝒛k/μ22},𝒛k+1=𝒛kμ(𝑨𝒗1k+1+𝑩𝒗2k+1𝒄).\left\{\begin{array}[]{l}{\bm{\beta}}^{k+1}=\mathop{\arg\min}\limits_{{\bm{\beta}}}\left\{\lambda\|\bm{\beta}\|_{1}+I_{\mathcal{C}}(\bm{\beta})+{\mu\over 2}\|\bm{A}_{1}{\bm{\beta}}+\sum_{m=1}^{M}{\bm{B}_{m}}\bm{\beta}_{m}^{k}+\bm{B}_{M+1}\bm{b}^{k}-\bm{c}-\bm{z}^{k}/\mu\|_{2}^{2}+\frac{1}{2}\|\bm{\beta}-\bm{\beta}^{k}\|_{\bm{S}_{\bm{G}}}^{2}\right\},\\ {\bm{r}_{1}}^{k+1}=\mathop{\arg\min}\limits_{{\bm{r}_{1}}}\left\{\mathcal{L}(\bm{r}_{1})+{\mu\over 2}\|\bm{A}_{2}\bm{r}_{1}+\bm{B}_{1}\bm{\beta}_{1}^{k}-\bm{c}-\bm{z}^{k}/\mu\|_{2}^{2}\right\},\\ {\bm{r}_{2}}^{k+1}=\mathop{\arg\min}\limits_{{\bm{r}_{2}}}\left\{\mathcal{L}(\bm{r}_{2})+{\mu\over 2}\|\bm{A}_{3}\bm{r}_{2}+\bm{B}_{2}\bm{\beta}_{2}^{k}-\bm{c}-\bm{z}^{k}/\mu\|_{2}^{2}\right\},\\ \qquad\qquad\qquad\qquad\qquad\qquad\vdots\\ {\bm{r}_{M}}^{k+1}=\mathop{\arg\min}\limits_{{\bm{r}_{M}}}\left\{\mathcal{L}(\bm{r}_{M})+{\mu\over 2}\|\bm{A}_{M+1}\bm{r}_{M}+\bm{B}_{M}\bm{\beta}_{M}^{k}-\bm{c}-\bm{z}^{k}/\mu\|_{2}^{2}\right\},\\ \bm{v}_{2}^{k+1}=\mathop{\arg\min}\limits_{\bm{v}_{2}}\left\{\bm{\theta}_{2}(\bm{\vartheta})+{\mu\over 2}\|\bm{A}{\bm{v}_{1}}^{k+1}+{\bm{B}}\bm{v}_{2}-\bm{c}-\bm{z}^{k}/\mu\|_{2}^{2}\right\},\\ \bm{z}^{k+1}=\bm{z}^{k}-\mu(\bm{A}\bm{v}_{1}^{k+1}+{\bm{B}}\bm{v}_{2}^{k+1}-\bm{c}).\end{array}\right. (A.4)

For the 𝜷{\bm{\beta}}-subproblem of (A.4), it follows from the convexity of λ𝜷1\lambda\|\bm{\beta}\|_{1} and I𝒞(𝜷)I_{\mathcal{C}}(\bm{\beta}) that

(λ𝜷1+I𝒞(𝜷))(λ𝜷k+11+I𝒞(𝜷k+1))+(𝜷𝜷k+1)[μ𝑨1(𝑨1𝜷k+1+m=1M𝑩m𝜷mk\displaystyle\left(\lambda\|\bm{\beta}\|_{1}+I_{\mathcal{C}}(\bm{\beta})\right)-\left(\lambda\|\bm{\beta}^{k+1}\|_{1}+I_{\mathcal{C}}(\bm{\beta}^{k+1})\right)+(\bm{\beta}-\bm{\beta}^{k+1})^{\top}[\mu\bm{A}_{1}^{\top}(\bm{A}_{1}{\bm{\beta}}^{k+1}+\sum_{m=1}^{M}{\bm{B}_{m}}\bm{\beta}_{m}^{k}
+𝑩M+1𝒃k𝒄𝒛k/μ)+𝑺𝑮(βk+1𝜷k)]0\displaystyle+\bm{B}_{M+1}\bm{b}^{k}-\bm{c}-\bm{z}^{k}/\mu)+\bm{S}_{\bm{G}}\bm{(}\beta^{k+1}-\bm{\beta}^{k})]\geq 0 (A.5)

For the 𝒓m{\bm{r}_{m}}-subproblem (m=1,2,,Mm=1,2,\dots,M) of (A.4), it follows from the convexity of (𝒓m)\mathcal{L}(\bm{r}_{m}) that

(𝒓m)(𝒓mk+1)+(𝒓m𝒓mk+1)[μ𝑨2(𝑨2𝒓1k+1+𝑩1𝜷1k𝒄𝒛k/μ)]0\displaystyle\mathcal{L}(\bm{r}_{m})-\mathcal{L}(\bm{r}_{m}^{k+1})+(\bm{r}_{m}-\bm{r}_{m}^{k+1})^{\top}[\mu\bm{A}_{2}^{\top}(\bm{A}_{2}\bm{r}_{1}^{k+1}+\bm{B}_{1}\bm{\beta}_{1}^{k}-\bm{c}-\bm{z}^{k}/\mu)]\geq 0 (A.6)

By combining the subproblems (A) and (A.6) together and performing some algebraic manipulations, we have

𝜽1(𝒗1)𝜽1(𝒗1k+1)+(𝒗1𝒗1k+1)[μ𝑨(𝑨𝒗1k+1+𝑩𝒗2k𝒄𝒛k/μ)]+(𝜷𝜷k+1)𝑺𝑮(βk+1𝜷k)0.\bm{\theta}_{1}({\bm{v}_{1}})-\bm{\theta}_{1}({\bm{v}_{1}}^{k+1})+({\bm{v}_{1}}-{\bm{v}_{1}}^{k+1})^{\top}\left[\mu\bm{A}^{\top}(\bm{A}{\bm{v}_{1}}^{k+1}+{\bm{B}}\bm{v}_{2}^{k}-\bm{c}-\bm{z}^{k}/\mu)\right]+(\bm{\beta}-\bm{\beta}^{k+1})^{\top}\bm{S}_{\bm{G}}\bm{(}\beta^{k+1}-\bm{\beta}^{k})\geq 0. (A.7)

The key distinction between the proof of He and Yuan (2012, 2015) and the current situation lies in the linearized term 12𝜷𝜷k𝑺𝑮2\frac{1}{2}\|\bm{\beta}-\bm{\beta}^{k}\|_{\bm{S}_{\bm{G}}}^{2}. By extracting this linearized term separately, we can readily obtain (A.7). Similarly, for the 𝒗2\bm{v}_{2}-subproblem of (A.4), by the convexity of 𝜽2(𝒗2)\bm{\theta}_{2}({\bm{v}_{2}}), we have

𝜽2(𝒗2)𝜽2(𝒗2k+1)+(𝒗2𝒗2k+1)[μ𝑩(𝑨𝒗1k+1+𝑩𝒗2k+1𝒄𝒛k/μ)]0.\bm{\theta}_{2}({\bm{v}_{2}})-\bm{\theta}_{2}({\bm{v}_{2}^{k+1}})+(\bm{v}_{2}-\bm{v}_{2}^{k+1})^{\top}\left[\mu{\bm{B}}^{\top}(\bm{A}{\bm{v}_{1}}^{k+1}+{\bm{B}}\bm{v}_{2}^{k+1}-\bm{c}-\bm{z}^{k}/\mu)\right]\geq 0. (A.8)

For the 𝒛\bm{z}-subproblem of (A.4), we have

(𝒛𝒛k+1)[(𝒛k+1𝒛k)/μ+(𝑨𝒗1k+1+𝑩𝒗2k+1𝒄)]0.(\bm{z}-\bm{z}^{k+1})^{\top}\left[(\bm{z}^{k+1}-\bm{z}^{k})/\mu+(\bm{A}{\bm{v}_{1}}^{k+1}+{\bm{B}}\bm{v}_{2}^{k+1}-\bm{c})\right]\geq 0. (A.9)

Summing the above three inequalities (A.7), (A.8) and (A.9) together, we obtain

𝜽(𝒗𝟏,𝟐)𝜽(𝒗𝟏,𝟐k+1)+(𝒘~𝒘~k+1)F(𝒘~k+1)+(𝒗1𝒗1k+1)μ𝑨𝑩(𝒗2k𝒗2k+1)+(𝜷𝜷k+1)𝑺𝑮(βk+1𝜷k)+1μ(𝒛𝒛k+1)(𝒛k+1𝒛k)0,\begin{split}\bm{\theta}(\bm{{v}_{1,2}})&-\bm{\theta}(\bm{{v}_{1,2}}^{k+1})+(\tilde{\bm{w}}-\tilde{\bm{w}}^{k+1})^{\top}F(\tilde{\bm{w}}^{k+1})+({\bm{v}_{1}}-{\bm{v}_{1}}^{k+1})^{\top}\mu\bm{A}^{\top}{\bm{B}}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1})\\ &+(\bm{\beta}-\bm{\beta}^{k+1})^{\top}\bm{S}_{\bm{G}}\bm{(}\beta^{k+1}-\bm{\beta}^{k})+{1\over\mu}(\bm{z}-\bm{z}^{k+1})^{\top}(\bm{z}^{k+1}-\bm{z}^{k})\geq 0,\end{split} (A.10)

where F(𝒘~k+1)=[𝑨𝒛k+1𝑩𝒛k+1𝑨𝒗1k+1+𝑩𝒗2k+1𝒄]F(\tilde{\bm{w}}^{k+1})=\left[{\begin{array}[]{*{20}{c}}-\bm{A}^{\top}\bm{z}^{k+1}\\ -{\bm{B}}^{\top}\bm{z}^{k+1}\\ \bm{A}{\bm{v}_{1}}^{k+1}+{\bm{B}}\bm{v}_{2}^{k+1}-\bm{c}\end{array}}\right].

Add (𝒗2𝒗2k+1)μ𝑩𝑩(𝒗2k𝒗2k+1)(\bm{v}_{2}-\bm{v}_{2}^{k+1})^{\top}\mu{\bm{B}}^{\top}{\bm{B}}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1}) to both sides of inequality (A.10), then

𝜽(𝒗1,2)𝜽(𝒗1,2k+1)+(𝒘~𝒘~k+1)F(𝒘~k+1)+μ[𝒗1𝒗1k+1𝒗2𝒗2k+1][𝑨𝑩]𝑩(𝒗2k𝒗2k+1)(𝒗~𝒗~k+1)𝑯(𝒗~k𝒗~k+1),\begin{split}\bm{\theta}(\bm{v}_{1,2})-\bm{\theta}(\bm{v}_{1,2}^{k+1})+(\tilde{\bm{w}}-\tilde{\bm{w}}^{k+1})^{\top}F(\bm{\tilde{w}}^{k+1})+&\mu{\left[\begin{array}[]{*{20}{c}}\bm{v}_{1}-{\bm{v}_{1}^{k+1}}\\ \bm{v}_{2}-{\bm{v}_{2}^{k+1}}\end{array}\right]^{\top}}\left[\begin{array}[]{l}\bm{A}^{\top}\\ {\bm{B}}^{\top}\end{array}\right]{\bm{B}}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1})\\ &\geq(\tilde{\bm{v}}-\tilde{\bm{v}}^{k+1})^{\top}\bm{H}(\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1}),\end{split} (A.11)

where 𝒗~=(𝜷,𝒗2,𝒛)\tilde{\bm{v}}=\left(\bm{\beta}^{\top},\bm{v}_{2}^{\top},\bm{z}^{\top}\right)^{\top} and 𝑯=[𝑺𝑮𝟎𝟎𝟎μ𝑩𝑩𝟎𝟎𝟎1μ𝑰n𝑮1+(M+1)p].\bm{H}=\begin{bmatrix}\bm{S}_{\bm{G}}&\bm{0}&\bm{0}\\ \bm{0}&\mu\bm{B}^{\top}\bm{B}&\bm{0}\\ \bm{0}&\bm{0}&\frac{1}{\mu}\bm{I}_{n_{\bm{G}}-1+(M+1)p}\end{bmatrix}.

From the variational inequality in section 2 of He and Yuan (2012), we observe that

𝜽(𝒗1,2)𝜽(𝒗1,2k+1)+(𝒘~𝒘~k+1)F(𝒘~k+1)=𝜽(𝒗1,2)𝜽(𝒗1,2k+1)+(𝒘~𝒘~k+1)F(𝒘~)0.\bm{\theta}(\bm{v}_{1,2})-\bm{\theta}(\bm{v}_{1,2}^{k+1})+(\tilde{\bm{w}}-\tilde{\bm{w}}^{k+1})^{\top}F(\tilde{\bm{w}}^{k+1})=\bm{\theta}(\bm{v}_{1,2}^{*})-\bm{\theta}(\bm{v}_{1,2}^{k+1})+(\tilde{\bm{w}}^{*}-\tilde{\bm{w}}^{k+1})^{\top}F(\tilde{\bm{w}}^{*})\leq 0. (A.12)

Together with 𝑨𝒗1+𝑩𝒗2=𝒄\bm{A}\bm{v}_{1}^{*}+\bm{B}\bm{v}_{2}^{*}=\bm{c}, and after performing simple algebraic operations, we obtain

μ[𝒗1𝒗1k+1𝒗2𝒗2k+1][𝑨𝑩]𝑩(𝒗2k𝒗2k+1)=(𝒛k+1𝒛k)𝑩(𝒗2k𝒗2k+1).\mu{\left[\begin{array}[]{*{20}{c}}\bm{v}_{1}^{*}-{\bm{v}_{1}^{k+1}}\\ \bm{v}_{2}^{*}-{\bm{v}_{2}^{k+1}}\end{array}\right]^{\top}}\left[\begin{array}[]{l}\bm{A}^{\top}\\ {\bm{B}}^{\top}\end{array}\right]{\bm{B}}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1})=(\bm{z}^{k+1}-\bm{z}^{k})^{\top}{\bm{B}}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1}). (A.13)

Let 𝒗1=𝒗1,𝒗2=𝒗2,𝒗1,2=𝒗1,2,𝒗~=𝒗~,𝒘~=𝒘~\bm{v}_{1}=\bm{v}_{1}^{*},\bm{v}_{2}=\bm{v}_{2}^{*},\bm{v}_{1,2}=\bm{v}_{1,2}^{*},\tilde{\bm{v}}=\tilde{\bm{v}}^{*},\tilde{\bm{w}}=\tilde{\bm{w}}^{*} and bring (A.12) and (A.13) into (A.11), then we get

(𝒗~k+1𝒗~)𝑯(𝒗~k𝒗~k+1)(𝒛k+1𝒛k)𝑩(𝒗2k𝒗2k+1).(\tilde{\bm{v}}^{k+1}-\tilde{\bm{v}}^{*})^{\top}\bm{H}(\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1})\geq(\bm{z}^{k+1}-\bm{z}^{k})^{\top}{\bm{B}}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1}). (A.14)

From (A.8), we obtain the following two inequalities

𝜽2(𝒗2k)𝜽2(𝒗2k+1)+(𝒗2k𝒗2k+1)(𝑩𝒛k+1)0,\displaystyle\bm{\theta}_{2}({\bm{v}_{2}}^{k})-\bm{\theta}_{2}({\bm{v}_{2}^{k+1}})+(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1})^{\top}(-\bm{B}^{\top}\bm{z}^{k+1})\geq 0, (A.15)
𝜽2(𝒗2k+1)𝜽2(𝒗2k)+(𝒗2k+1𝒗2k)(𝑩𝒛k)0.\displaystyle\bm{\theta}_{2}({\bm{v}_{2}}^{k+1})-\bm{\theta}_{2}({\bm{v}_{2}^{k}})+(\bm{v}_{2}^{k+1}-\bm{v}_{2}^{k})^{\top}(-\bm{B}^{\top}\bm{z}^{k})\geq 0. (A.16)

These two inequalities point out that

(𝒛k𝒛k+1)𝑩(𝒗2k𝒗2k+1)0.(\bm{z}^{k}-\bm{z}^{k+1})^{\top}\bm{B}(\bm{v}_{2}^{k}-\bm{v}_{2}^{k+1})\geq 0. (A.17)

Then, we obtain

(𝒗~k+1𝒗~)𝑯(𝒗~k𝒗~k+1)0.(\tilde{\bm{v}}^{k+1}-\tilde{\bm{v}}^{*})^{\top}\bm{H}(\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1})\geq 0. (A.18)

It is clear that 𝑯\bm{H} is a symmetric and positive definite matrix. Then, for any 𝒙,𝒚\bm{x},\bm{y}, if 𝒚𝑯(𝒙𝒚)0\bm{y}^{\top}\bm{H}(\bm{x}-\bm{y})\geq 0, then we have

𝒚𝑯2𝒙𝑯2𝒙𝒚𝑯2.\|\bm{y}\|_{\bm{H}}^{2}\leq\|\bm{x}\|_{\bm{H}}^{2}-\|\bm{x}-\bm{y}\|_{\bm{H}}^{2}. (A.19)

Therefore,

𝒗~k+1𝒗~𝑯2𝒗~k𝒗~𝑯2𝒗~k𝒗~k+1𝑯2.\|\tilde{\bm{v}}^{k+1}-\tilde{\bm{v}}^{*}\|_{\bm{H}}^{2}\leq\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{*}\|_{\bm{H}}^{2}-\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1}\|_{\bm{H}}^{2}.

So far, we have completed the proof of Proposition 1.   

The contraction inequality (A.3) plays a crucial role in the convergence analysis of the ADMM algorithm. Proposition 1 has a similar conclusion to Lemma 3 in Li et al. (2014). Based on the contraction inequality (A.3), the assertions summarized in the following corollary become trivial.

Corollary A.1

Let 𝐯~k{\tilde{\bm{v}}^{k}} be the sequence generated by Algorithm 1. Then we have

  1. 1.

    limk𝒗~k𝒗~k+1𝑯=0\mathop{\lim}\limits_{k\to\infty}\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{k+1}\|_{\bm{H}}=0;

  2. 2.

    the sequence {𝒗~k}\{\tilde{\bm{v}}^{k}\} is bounded;

  3. 3.

    for any optimal solution 𝒗~\tilde{\bm{v}}^{*}, the sequence {𝒗~k𝒗~𝑯}\{\|\tilde{\bm{v}}^{k}-\tilde{\bm{v}}^{*}\|_{\bm{H}}\} is monotonically non-increasing.

By utilizing Corollary A.1, similar to the proof of Theorem 1 in Li et al. (2014), we can establish the convergence of 𝒗k\bm{v}^{k} towards 𝒗\bm{v}^{*}. Additionally, the convergence rate of O(1/k)O(1/k) in a non-ergodic sense can be directly derived from Theorem 5.1 in He and Yuan (2015). Consequently, we have successfully demonstrated the validity of Theorem 1.

Appendix B Supplementary experiments on nonconvex combined regularizations

In this subsection, we present the simulation results of CPADMM for solving nonconvex combined regularization regressions. Table 8 presents the performance of the non-convex extensions (mnet, snet) of the elastic net on the parallel dataset in Section 6.1.1.

Table 8: The results of mnet and snet solved by CPADMM.
  Method AAE(10210^{-2}) ASE(10210^{-2}) AAP ASP Time
  M=1 mnet 0.256(0.009) 0.795(0.011) 0.136(0.005) 0.166(0.006) 206.7(24.3)
snet 0.241(0.008) 0.787(0.010) 0.143(0.005) 0.171(0.007) 203.8(25.1)
M=10 mnet 0.289(0.010) 0.813(0.012) 0.152(0.006) 0.184(0.007) 37.9(4.3)
snet 0.290(0.011) 0.824(0.011) 0.148(0.006) 0.181(0.005) 36.5(4.1)
M=100 mnet 0.321(0.012) 0.923(0.018) 0.159(0.007) 0.199(0.008) 15.6(1.8)
snet 0.337(0.013) 0.941(0.019) 0.163(0.006) 0.203(0.008) 14.7(1.7)
 
  • *

    The meanings of the notations used in this table are as follows: AAE: average absolute estimation error; ASE: average square estimation error; AAP: average absolute prediction error; ASP: average square prediction error. Numbers in the parentheses represent the corresponding standard deviations. The optimal solution is represented in bold.

Table 9 presents the performance of the non-convex extensions (mctv, sctv) of the sparse fused lasso on the parallel dataset in Section 6.1.2.

Table 9: The results of mctv and sctv solved by CPADMM.
  Method AAE(10210^{-2}) ASE(10210^{-2}) AAP ASP Time
  M=1 mctv 0.006(0.001) 0.015(0.003) 0.432(0.011) 0.518(0.015) 197.8(22.4)
sctv 0.007(0.001) 0.014(0.002) 0.451(0.010) 0.493(0.014) 201.1(21.5)
M=10 mctv 0.009(0.002) 0.018(0.003) 0.478(0.012) 0.553(0.016) 30.2(3.9)
sctv 0.009(0.002) 0.017(0.003) 0.487(0.012) 0.523(0.015) 32.7(4.1)
M=100 mctv 0.011(0.003) 0.021(0.003) 0.538(0.017) 0.682(0.021) 12.3(1.5)
sctv 0.012(0.003) 0.023(0.003) 0.557(0.018) 0.659(0.019) 11.9(1.6)
 
  • *

    The meanings of the notations used in this table are as follows: AAE: average absolute estimation error; ASE: average square estimation error; AAP: average absolute prediction error; ASP: average square prediction error. Numbers in the parentheses represent the corresponding standard deviations. The optimal solution is represented in bold.

Table 10 presents the performance of the non-convex extensions (mcgl, scgl) of the sparse fused lasso on the parallel dataset in Section 6.1.3.

Table 10: The results of mcgl and scgl solved by CPADMM.
  Method AAE(10210^{-2}) ASE(10210^{-2}) AAP ASP Time
  M=1 mcgl 0.012(0.002) 0.029(0.005) 0.236(0.008) 0.259(0.010) 210.6(23.5)
scgl 0.015(0.003) 0.027(0.004) 0.286(0.007) 0.276(0.011) 201.3(22.9)
M=10 mcgl 0.018(0.004) 0.035(0.006) 0.297(0.008) 0.283(0.012) 41.3(4.2)
scgl 0.021(0.005) 0.037(0.005) 0.310(0.009) 0.291(0.012) 38.4(3.8)
M=100 mcgl 0.027(0.006) 0.047(0.008) 0.356(0.008) 0.354(0.013) 15.8(1.9)
scgl 0.031(0.007) 0.052(0.007) 0.379(0.009) 0.363(0.014) 14.3(1.8)
 
  • *

    The meanings of the notations used in this table are as follows: AAE: average absolute estimation error; ASE: average square estimation error; AAP: average absolute prediction error; ASP: average square prediction error. Numbers in the parentheses represent the corresponding standard deviations. The optimal solution is represented in bold.