This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Can Wu 22institutetext: School of Mathematics and Statistics, Hainan University, Haikou, 570228, China. This work was conducted during her Postdoctoral Fellowship at the Department of Applied Mathematics, The Hong Kong Polytechnic University.
22email: [email protected]
33institutetext: Dong-Hui Li 44institutetext: School of Mathematical Sciences, South China Normal University, Guangzhou, 510631, China
44email: [email protected]
55institutetext: Defeng Sun 66institutetext: Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
66email: [email protected]

Support matrix machine: exploring sample sparsity, low rank, and adaptive sieving in high-performance computing

Can Wu    Dong-Hui Li    Defeng Sun
(Received: date / Accepted: date)
Abstract

Support matrix machine (SMM) is a successful supervised classification model for matrix-type samples. Unlike support vector machines, it employs low-rank regularization on the regression matrix to effectively capture the intrinsic structure embedded in each input matrix. When solving a large-scale SMM, a major challenge arises from the potential increase in sample size, leading to substantial computational and storage burdens. To address these issues, we design a semismooth Newton-CG (SNCG) based augmented Lagrangian method (ALM) for solving the SMM. The ALM exhibits an asymptotic R-superlinear convergence if a strict complementarity condition is satisfied. The SNCG method is employed to solve the ALM subproblems, achieving at least a superlinear convergence rate under the nonemptiness of an index set. Furthermore, the sparsity of samples and the low-rank nature of solutions enable us to reduce the computational cost and storage demands for the Newton linear systems. Additionally, we develop an adaptive sieving strategy that generates a solution path for the SMM by exploiting sample sparsity. The finite convergence of this strategy is also demonstrated. Numerical experiments on both large-scale real and synthetic datasets validate the effectiveness of the proposed methods.

Keywords:
Support matrix machine Sample sparsity Low rank Adaptive sieving Augmented Lagrangian method Semismooth Newton method
MSC:
90C06 90C25 90C90

1 Introduction

Numerous well-known classification methods, such as linear discriminant analysis, logistic regression, support vector machines (SVMs), and AdaBoost, were initially designed for vector or scalar inputs hastie2009elements . On the other hand, matrix-structured data, like digital images with pixel grids yang2004two and EEG signals with multi-channel voltage readings over time sanei2013eeg , are also prevalent in practical applications. Classifying matrix data often involves flattening it into a vector, but this causes several issues qian2019robust : a) High-dimensional vectors increase dimensionality problems, b) Loss of matrix structure and correlations, and c) Structural differences require different regularization methods. To circumvent these challenges, the support matrix machine (SMM) was introduced by Luo, Xie, Zhang, and Li luo2015support . Given a set of training samples {Xi,yi}i=1n\{X_{i},y_{i}\}^{n}_{i=1}, where Xip×qX_{i}\in\mathbb{R}^{p\times q} represents the iith feature matrix and yi{1,+1}y_{i}\in\{-1,+1\} is its corresponding class label, the optimization problem of the SMM is formulated as

minimize(W,b)p×q×12W2+τW+Ci=1nmax{1yi[tr(WXi)+b],0},\mathop{\mbox{minimize}}\limits_{(W,b)\in\mathbb{R}^{p\times q}\times\mathbb{R}}\ \frac{1}{2}\|W\|^{2}_{{\mathcal{F}}}+\tau\|W\|_{*}+C\sum^{n}_{i=1}\max\left\{1-y_{i}\left[\mbox{tr}(W^{\top}X_{i})+b\right],0\right\}, (1)

where Wp×qW\in\mathbb{R}^{p\times q} is the matrix of regression coefficients, bb\in\mathbb{R} is an offset term, CC and τ\tau are positive regularization parameters. Here, \|\cdot\|_{{\mathcal{F}}} and \|\cdot\|_{*} denote the Frobenius norm and nuclear norm of a matrix, respectively. The objective function in (1) combines two key components: a) the spectral elastic net penalty (1/2)2+τ(1/2)\|\cdot\|^{2}_{\mathcal{F}}+\tau\|\cdot\|_{*}, enjoying grouping effect111At a solution pair (W¯,b¯)(\overline{W},\overline{b}) of the model (1), the columns of the regression matrix W¯\overline{W} exhibit a grouping effect when the associated features display strong correlation luo2015support . and low-rank structures; b) the hinge loss, ensuring sparsity and robustness. This model has been incorporated as a key component in deep stacked networks hang2020deep ; liang2022deep and semi-supervised learning frameworks li2023intelligent . Additionally, the SMM model (1) and its variants are predominantly applied in fault diagnosis (e.g., pan2019fault ; li2020non ; li2022highly ; li2022fusion ; pan2022pinball ; pan2023deep ; geng2023fault ; li2024dynamics ; xu2024intelligent ; pan2024semi ; li2024intelligent ) and EEG classification (e.g., zheng2018robust ; zheng2018multiclass ; razzak2019multiclass ; hang2020deep ; chen2020novel ; liang2022deep ; hang2023deep ; liang2024adaptive ). For a comprehensive overview of SMM applications and extensions, see the recent survey kumari2024support .

The two-block alternating direction methods of multipliers (ADMMs) are the most widely adopted methods for solving the convex SMM model (1) and its extensions, including least squares SMMs li2020non ; li2022highly ; liang2024adaptive ; hang2023deep , multi-class SMMs zheng2018multiclass ; pan2019fault , weighted SMMs li2022fusion ; li2024intelligent ; pan2024semi , transfer SMMs chen2020novel ; pan2022pinball , and pinball SMMs feng2022support ; pan2022pinball ; pan2023deep ; geng2023fault . Compared to classical SVMs, the primary challenge in solving a large-scale SMM model (1) and the above extensions arises from the additional nuclear norm term. Specifically, their dual problems are no longer a quadratic program (QP) due to the presence of the sphere constraint in the sense of the spectral norm, which precludes the use of the off-the-shelf QP solvers like LIBQP222The codes are available at https://cmp.felk.cvut.cz/~xfrancv/pages/libqp.html.. To overcome this drawback, Luo et al. luo2015support suggested to utilize the two-block fast ADMM with restart (F-ADMM), proposed in goldstein2014fast . Indeed, it is reasonable to choose two-block ADMM algorithms over three-block ADMM chen2017efficient because the former consistently requires fewer iterations, resulting in lower singular value decomposition costs from the soft thresholding operator. However, F-ADMM solves the dual subproblems by calling the LIBQP solver, which requires producing at least one n×nn\times n matrix. This can cause a memory burden as the sample size nn increases significantly. As a remedy, Duan et al. duan2017quantum introduced a quantum algorithm that employs the quantum matrix inversion algorithm harrow2009quantum and a quantum singular value thresholding algorithm to update the subproblems’s variables in the F-ADMM framework, respectively. Nevertheless, no numerical experiments were reported in duan2017quantum , leaving the practical effectiveness of the proposed algorithm undetermined. Efficiently solving the SMM models under large-scale samples remains a significant challenge.

The first purpose of this paper is to devise an efficient and robust algorithm for solving the SMM model (1) when nn is significantly large (e.g., several hundred thousand or up to a million) and pqpq is large (e.g., ten of thousands). By leveraging the sample sparsity and the low-rank property of the solution to model (1), we develop an efficient and robust semismooth Newton-CG based augmented Lagrangian method. Notably, these two properties depend on the values of the parameters CC and τ\tau. Here, sample sparsity indicates that the solution of (1) relies solely on a subset of samples (feature matrices) known as active samples (support matrices), while disregarding the remaining portion referred to as non-active samples (non-support matrices). At each iteration, a conjugate gradient (CG) algorithm is employed to solve the Newton linear system. By making full use of the sample sparsity and the solution’s low-rank property in (1), the primary computational cost in each iteration of the CG method can be reduced to O(pqmax{|𝒥1|,|α|})O(pq\max\{|\mathcal{J}_{1}|,|\alpha|\}), instead of O(npq)O(npq). Here, the index set 𝒥1\mathcal{J}_{1} ultimately corresponds the active support matrices. Its cardinality |𝒥1||\mathcal{J}_{1}| is typically much smaller than the total sample size nn. And the cardinality of the index set α\alpha eventually corresponds to the rank of the solution matrix WW in the model (1). It is worth mentioning that the proposed computational framework can be readily extended to convex variants of the SMM model (1), such as multi-class SMMs zheng2018multiclass ; pan2019fault , weighted SMMs li2022fusion ; li2024intelligent ; pan2024semi , transfer SMMs chen2020novel ; pan2022pinball , and pinball SMMs feng2022support ; pan2022pinball ; pan2023deep ; geng2023fault .

Our second goal in this paper is to develop a strategy that efficiently guesses and adjusts irrelevant samples before the starting of the aforementioned optimization process. This method is particularly useful for generating a solution path for models (1). Recently, Ghaoui et al. el2012safe introduced a safe feature elimination method for 1\ell_{1}-regularized convex problems. It inspired researchers to extend these ideas to conventional SVMs and develop safe sample screening rules ogawa2013safe ; wang2014scaling ; zimmert2015safe ; pan2019novel . Those rules aim to first bound the primal SVM solution within a valid region and then exclude non-active samples by relaxing the Karush-Kuhn-Tucker conditions, which reduces the problem size and saves computational costs and storage. However, existing safe screening rules are typically problem-specific and cannot be directly applied to the SMM model (1). The major challenges include two parts: a) the objective function in (1) is not strongly convex due to the intercept term bb zimmert2015safe ; b) the nuclear norm term complicates the removal of non-support matrices using first-order optimality conditions for the dual SVM model wang2014scaling . Recently, Yuan, Lin, Sun, and Toh yuan2023adaptive ; lin2020adaptive introduced a highly efficient and flexible adaptive sieving (AS) strategy for convex composite sparse machine learning models. By exploiting the inherent sparsity of the solutions, it significantly reduces computational load through solving a finite number of reduced subproblems on a smaller scale yuan2022dimension ; li2023mars ; wu2023convex . The effectiveness of this AS strategy crucially hinges on solution sparsity, which dictates the size of these reduced subproblems. Unlike the sparsity emerges at solutions, the SMM model (1) inherently exhibits sample sparsity. As a result, the application of the AS strategy to (1) is not straightforward. We will extend the idea of the AS strategy in yuan2023adaptive ; lin2020adaptive to generate a solution path for the SMM model (1).

Our main contributions in this paper are outlined as follows:

  • 1)

    To solve the SMM (1), we propose an augmented Lagrangian method (ALM), which guarantees asymptotic R-superlinear convergence of the KKT residuals under a mild strict complementarity condition, commonly satisfied in the classical soft margin SVM model. For solving each ALM subproblem, we use the semismooth Newton-CG method (SNCG), ensuring superlinear convergence when an index set is nonempty, with its cardinality ultimately corresponding to the number of active support matrices.

  • 2)

    The main computational bottleneck in solving semismooth Newton linear systems with the conjugate gradient method is the execution of the generalized Jacobian linear transformation. By leveraging the sample sparsity and solution’s low-rank structure in model (1), we can reduce the cost of this transformation from O(pqmax{|𝒥1|,|α|})O(pq\max\{|{\mathcal{J}}_{1}|,|\alpha|\}) to O(npq)O(npq), where the cardinalities of 𝒥1{\mathcal{J}}_{1} and α\alpha are typically much smaller than the sample size nn. Numerical experiments demonstrate that ALM-SNCG achieves an average speedup of 422.7 times over F-ADMM on four real datasets, even for low-accuracy solutions of model (1).

  • 3)

    To efficiently generate a solution path for the SMM models with a huge sample size, we employ an AS strategy. This strategy iteratively identifies and removes inactive samples to obtain reduced subproblems. Solving these subproblems inexactly ensures convergence to the desired solution path within a finite number of iterations (see Theorem 5.1). Numerical experiments reveal that, for generating high-accuracy solution paths of models (1), the AS strategy paired with ALM-SNCG is, on average, 2.53 times faster than the warm-started ALM-SNCG on synthetic dataset with one million samples.

The rest of the paper is organized as follows. Section 2 discusses the sample sparsity and the solution’s low-rank structure of the SMM model (1). Building on these properties, Section 3 introduces the computational framework of a semismooth Newton-CG based augmented Lagrangian method for solving this model. Section 4 explores how these properties reduce computational costs in solving the Newton linear system. Section 5 introduces an adaptive sieving strategy for computing a solution path of the SMM models across a fine grid of CC’s. Finally, Section 6 presents extensive numerical experiments on synthetic and real data to demonstrate the effectiveness of the proposed methods.

Notation. Let m×n\mathbb{R}^{m\times n} be the space of all m×nm\times n real matrices and 𝕊n\mathbb{S}^{n} be the space of all n×nn\times n real symmetric matrices. The notation 0m×n0_{m\times n} stands for the zero matrix in m×n\mathbb{R}^{m\times n}, and Inn×nI_{n}\in\mathbb{R}^{n\times n} is the identity matrix. We use 𝒪n\mathcal{O}^{n} to denote the set of all n×nn\times n orthogonal matrices. For X,Ym×nX,Y\in\mathbb{R}^{m\times n}, their inner product is defined by X,Y=tr(XY)\langle X,Y\rangle=\mbox{tr}(X^{\top}Y), where “tr” represents the trace operation. The Frobenius norm of Xm×nX\in\mathbb{R}^{m\times n} is X=X,X\|X\|_{{\mathcal{F}}}=\sqrt{\langle X,X\rangle}. The nn-dimensional vector space n×1\mathbb{R}^{n\times 1} is abbreviated as n\mathbb{R}^{n}. The vector enne_{n}\in\mathbb{R}^{n} is the vector whose elements are all ones. We denote the index set [m]:={1,2,,m}[m]:=\{1,2,\ldots,m\}. For any index subsets α[m]\alpha\subseteq[m] and β[n]\beta\subseteq[n], and any Xm×nX\in\mathbb{R}^{m\times n}, let XαβX_{\alpha\beta} denote the |α|×|β||\alpha|\times|\beta| submatrix of XX obtained by removing all the rows of XX not in α\alpha and all the columns of XX not in β\beta. For any xnx\in\mathbb{R}^{n}, we write xαx_{\alpha} as the subvector of xx obtained by removing all the elements of xx not in α\alpha, and Diag(x)\mbox{Diag}(x) as the diagonal matrix with diagonal entries xix_{i}, i=1,,ni=1,\ldots,n. The notation 2\|\cdot\|_{2} represents the matrix spectral norm, defined as the largest singular value of the matrix. For any τ>0\tau>0, we define 𝔹2τ:={Xm×n|X2τ}\mathbb{B}^{\tau}_{2}:=\{X\in\mathbb{R}^{m\times n}\ |\ \|X\|_{2}\leq\tau\}. The metric projection of XX onto 𝔹2τ\mathbb{B}^{\tau}_{2} is denoted by Π𝔹2τ(X)\Pi_{\mathbb{B}^{\tau}_{2}}(X), with Π𝔹2τ(X)\partial\Pi_{\mathbb{B}^{\tau}_{2}}(X) representing its Clarke generalized Jacobian at XX. Similarly, for any closed convex set KnK\subset\mathbb{R}^{n}, ΠK(ω)\Pi_{K}(\omega) and ΠK(ω)\partial\Pi_{K}(\omega) denote the metric projection of ωn\omega\in\mathbb{R}^{n} onto KK and its Clarke generalized Jacobian at ω\omega, respectively.

2 The structural properties of the SMM model (1)

This section highlights the sample sparsity and low-rank properties at the Karush-Kuhn-Tucker (KKT) points of the model (1), providing the foundation for developing an efficient algorithm.

For notational simplicity, we first reformulate the SMM model (1). Denote

𝒳:=p×q××n×p×qand𝒴:=n×p×q.\mathcal{X}:=\mathbb{R}^{p\times q}\times\mathbb{R}\times\mathbb{R}^{n}\times\mathbb{R}^{p\times q}\quad\mbox{and}\quad\mathcal{Y}:=\mathbb{R}^{n}\times\mathbb{R}^{p\times q}.

Define the linear operator 𝒜:p×qn{\mathcal{A}}:\mathbb{R}^{p\times q}\rightarrow\mathbb{R}^{n} and its corresponding adjoint 𝒜:np×q{\mathcal{A}}^{*}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{p\times q} as follows:

𝒜W=(y1X1,W,,ynXn,W),𝒜z=k=1nzkykXk,(z,W)𝒴.{\mathcal{A}}W=\left(\langle y_{1}X_{1},W\rangle,\ldots,\langle y_{n}X_{n},W\rangle\right)^{\top},\ {\mathcal{A}}^{*}z=\sum^{n}_{k=1}z_{k}y_{k}X_{k},\ \forall\ (z,W)\in{\mathcal{Y}}. (2)

Let y=(y1,y2,yn)y=(y_{1},y_{2}\ldots,y_{n})^{\top}, v=en𝒜Wbyv=e_{n}-{\mathcal{A}}W-by, and S=[0,C]nS=[0,C]^{n}. The support function of SS is given by δS(v)=Ci=1nmax{vi,0}\delta^{*}_{S}(v)=C\sum^{n}_{i=1}\max\{v_{i},0\} for any vnv\in\mathbb{R}^{n}. The SMM model (1) can be equivalently expressed as

minimize(W,b,v,U)𝒳12W2+τU+δS(v)subject to𝒜W+by+v=en,WU=0p×q.\begin{array}[]{cl}\mathop{\mbox{minimize}}\limits_{(W,b,v,U)\in\mathcal{X}}&\displaystyle\frac{1}{2}\|W\|^{2}_{{\mathcal{F}}}+\tau\|U\|_{*}+\delta^{*}_{S}(v)\\ \mbox{subject to}&{\mathcal{A}}W+by+v=e_{n},\\ {}\hfil&W-U=0_{p\times q}.\end{array} (P)

Its Lagrangian dual problem can be written as

maximize(λ,Λ)𝒴Φ(λ,Λ):=12𝒜λ+Λ2+λ,en+δS×𝔹2τ(λ,Λ)subject toyλ=0.\begin{array}[]{cl}\mathop{\mbox{maximize}}\limits_{(\lambda,\Lambda)\in\mathcal{Y}}&-\Phi(\lambda,\Lambda):=\displaystyle\frac{1}{2}\|{\mathcal{A}}^{*}\lambda+\Lambda\|^{2}_{{\mathcal{F}}}+\langle\lambda,e_{n}\rangle+\delta_{S\times\mathbb{B}^{\tau}_{2}}(-\lambda,\Lambda)\\ \mbox{subject to}&\qquad\qquad\quad\ \ y^{\top}\lambda=0.\\ \end{array} (DD)

It is easy to get the following KKT system of (P) and (DD):

{W+𝒜λ+Λ=0,λy=0, 0δS(v)+λ,0τUΛ,𝒜W+by+v=en,WU=0p×q.\left\{\begin{array}[]{l}W+{\mathcal{A}}^{*}\lambda+\Lambda=0,\ \lambda^{\top}y=0,\ 0\in\partial\delta^{*}_{S}(v)+\lambda,\\[5.69054pt] 0\in\partial\,\tau\|U\|_{*}-\Lambda,\ {\mathcal{A}}W+by+v=e_{n},\ W-U=0_{p\times q}.\end{array}\right. (3)

Without loss of generality, suppose that there exist indices i0i_{0}, j0[n]j_{0}\in[n] such that yi0=+1y_{i_{0}}=+1 and yj0=1y_{j_{0}}=-1 (otherwise, the classification task cannot be executed). Clearly, the solution set of the KKT system (3) is always nonempty. Indeed, the objective function in problem (1) is a real-valued, lower level-bounded convex function. It follows from (rockafellar2009variational, , Theorem 1.9) that the solution set of (1) is nonempty, convex, and compact, with a finite optimal value. Since the equivalent problem (P) of (1) contains linear constraints only, by (rockafellar1970convex, , Corollary 28.3.1), the KKT system (3) has at least one solution. It is known from (rockafellar1970convex, , Theorems 28.3, 30.4, and Corollary 30.5.1) that (W¯,b¯,v¯,U¯,λ¯,Λ¯)(\overline{W},\overline{b},\overline{v},\overline{U},\overline{\lambda},\overline{\Lambda}) is a solution to (3) if and only if (W¯,b¯,v¯,U¯)(\overline{W},\overline{b},\overline{v},\overline{U}) is an optimal solution to (P), and (λ¯,Λ¯)(\overline{\lambda},\overline{\Lambda}) is an optimal solution to (DD).

2.1 Characterizing sample sparsity

For the SMM model (P), a sufficiently large sample size nn introduces significant computational and storage challenges. However, the inherent sample sparsity offers a promising avenue for developing efficient algorithms.

Suppose that (W¯,b¯,v¯,U¯,λ¯,Λ¯)(\overline{W},\overline{b},\overline{v},\overline{U},\overline{\lambda},\overline{\Lambda}) is a solution of the KKT system (3). It follows that

W¯=𝒜λ¯Λ¯=j=1n(λ¯)jyjXjΛ¯.\overline{W}=-{\mathcal{A}}^{*}\overline{\lambda}-\overline{\Lambda}=\sum^{n}_{j=1}(-\overline{\lambda})_{j}y_{j}X_{j}-\overline{\Lambda}.

This implies that the value of W¯\overline{W} depends solely on the samples associated with the non-zero elements of λ¯-\overline{\lambda}, a property referred to as sample sparsity. In addition, by λ¯δS(v¯)-\overline{\lambda}\in\partial\delta^{*}_{S}(\overline{v}) in (3), we deduce that for each j[n]j\in[n],

(λ¯)j{η|η=Cifv¯j>0η[0,C]ifv¯j=0η=0otherwise}.(-\overline{\lambda})_{j}\in\left\{\eta\in\mathbb{R}\ \left|\ \begin{array}[]{ll}\eta=C&\mbox{if}\ \overline{v}_{j}>0\\ \eta\in[0,C]&\mbox{if}\ \overline{v}_{j}=0\\ \eta=0&\mbox{otherwise}\end{array}\right.\right\}. (4)

It implies that λ¯[0,C]n-\overline{\lambda}\in[0,C]^{n}. Analogous to the concept of support vectors cortes1995support , we introduce the definition of support matrices.

Definition 1 (Support matrix)

An example XjX_{j} that satisfies the inequality 0<(λ¯)jC0<(-\overline{\lambda})_{j}\leq C is said a support matrix at (W¯,b¯)(\overline{W},\overline{b}); otherwise, it is a non-support matrix. The set of all support matrices at (W¯,b¯)(\overline{W},\overline{b}) is denoted as

SM:={Xj| 0<(λ¯)jC,j[n]}.SM:=\{X_{j}\ |\ 0<(-\overline{\lambda})_{j}\leq C,j\in[n]\}.

To illustrate the geometric interpretation of support matrices, we define the optimal hyperplane H(W¯,b¯)H(\overline{W},\overline{b}) and the decision boundaries H+(W¯,b¯)H_{+}(\overline{W},\overline{b}) and H(W¯,b¯)H_{-}(\overline{W},\overline{b}) associated with (W¯,b¯)(\overline{W},\overline{b}) as follows:

H(W¯,b¯):={X|W¯,X+b¯=0},H+(W¯,b¯):={X|W¯,X+b¯=1},H(W¯,b¯):={X|W¯,X+b¯=1}.\begin{array}[]{c}H(\overline{W},\overline{b}):=\{X\ |\ \langle\overline{W},X\rangle+\overline{b}=0\},\\ H_{+}(\overline{W},\overline{b}):=\{X\ |\ \langle\overline{W},X\rangle+\overline{b}=1\},\ H_{-}(\overline{W},\overline{b}):=\{X\ |\ \langle\overline{W},X\rangle+\overline{b}=-1\}.\end{array}

Based on (4) and (v¯)j=1yj(W¯,Xj+b¯)(\overline{v})_{j}=1-y_{j}(\langle\overline{W},X_{j}\rangle+\overline{b}), we obtain that

SM{Xj|v¯j0,j[n]}={Xj|yj(W¯,Xj+b¯)1,j[n]}.SM\subseteq\{X_{j}\ |\ \overline{v}_{j}\geq 0,j\in[n]\}=\{X_{j}\ |\ y_{j}(\langle\overline{W},X_{j}\rangle+\overline{b})\leq 1,j\in[n]\}. (5)

Thus, geometrically, support matrices include a subset of examples lying on the decision boundaries, within the region between them, or misclassified.

It can also be observed from (4) that

{Xj|yj(W¯,Xj+b¯)<1,j[n]}={Xj|v¯j>0,j[n]}{Xj|(λ¯)j=C}.\{X_{j}\ |\ y_{j}(\langle\overline{W},X_{j}\rangle+\overline{b})<1,j\in[n]\}=\{X_{j}\ |\ \overline{v}_{j}>0,j\in[n]\}\subseteq\{X_{j}\ |\ (-\overline{\lambda})_{j}=C\}.

This implies that, after excluding support matrices XjX_{j} at (W¯,b¯)(\overline{W},\overline{b}) where (λ¯)j=C(-\overline{\lambda})_{j}=C, only those remaining on the decision boundaries are retained, i.e.,

ASM:={Xj|(λ¯)j(0,C),j[n]}{Xj|yj(W¯,Xj+b¯)=1,j[n]}.\displaystyle ASM:=\{X_{j}\ |\ (-\overline{\lambda})_{j}\in(0,C),j\in[n]\}\subseteq\{X_{j}\ |\ y_{j}(\langle\overline{W},X_{j}\rangle+\overline{b})=1,j\in[n]\}.

Here, we define support matrices satisfying 0<(λ¯)j<C0<(-\overline{\lambda})_{j}<C as active support matrices. Notably, the number of support matrices far exceeds that of active support matrices. As shown in Figure 1, the proportion of support matrices relative to the total sample size ranges from 9% to 86%, whereas the proportion of active support matrices is significantly lower, ranging from 0% to 3%. This raises a natural question: Can we design an algorithm whose main computational cost depends solely on the samples corresponding to the active support matrices?

Refer to caption
Figure 1: Comparison of SM, ASM, and NSM under (n,p,q,τn,p,q,\tau)=(100, 2, 1, 0.1) (Here, NSM means the set of non-support matrix. Red circles represent positive examples, blue diamonds represent negative examples, and solid circles/diamonds indicate support matrices. Active support matrices are highlighted with a green square border. SM/ASM refers to the set of support matrices after excluding the active support matrices. We observe that (a) the cardinality of active support matrices is significantly smaller than that of support matrices; and (b) as the value of CC increases, both the number of support matrices and the optimal margin 2/W¯2/\|\overline{W}\|_{{\mathcal{F}}} decrease.)

2.2 Characterizing low-rank property of W¯\overline{W}

In addition to the sample sparsity of model (P), another key property is the low-rank structure of the solution W¯\overline{W}. Indeed, since (W¯,b¯,v¯,U¯,λ¯,Λ¯)(\overline{W},\overline{b},\overline{v},\overline{U},\overline{\lambda},\overline{\Lambda}) is a solution to the KKT system (3), one has that

W¯=U¯=Proxτ(Λ¯+U¯)=Proxτ(𝒜λ¯).\overline{W}=\overline{U}=\mbox{Prox}_{\tau\|\cdot\|_{*}}(\overline{\Lambda}+\overline{U})=\mbox{Prox}_{\tau\|\cdot\|_{*}}(-{\mathcal{A}}^{*}\overline{\lambda}).

Without loss of generality, we assume that pqp\leq q. And suppose 𝒜λ¯-{\mathcal{A}}^{*}\overline{\lambda} has the following singular value decomposition (SVD):

𝒜λ¯=U[Diag(v) 0]V,-{\mathcal{A}}^{*}\overline{\lambda}=U[\mbox{Diag}(v)\ 0]V^{\top},

where U𝒪pU\in\mathcal{O}^{p}, V𝒪qV\in\mathcal{O}^{q}, and v:=(v1,v2,,vp)pv:=(v_{1},v_{2},\ldots,v_{p})\in\mathbb{R}^{p} with v1v2vpv_{1}\geq v_{2}\geq\ldots\geq v_{p}. Furthermore, define

k¯:=max{i[p]|vi>τ}.\bar{k}:=\max\{i\in[p]\ |\ v_{i}>\tau\}.

By applying the proximal mapping Proxτ(𝒜λ¯)\mbox{Prox}_{\tau\|\cdot\|_{*}}(-{\mathcal{A}}^{*}\overline{\lambda}) , as described in liu2012implementable , we obtain

W¯=\displaystyle\overline{W}= Proxτ(𝒜λ¯)=U[Diag(max{v1τ,0},,max{vpτ,0}) 0]V\displaystyle\mbox{Prox}_{\tau\|\cdot\|_{*}}(-{\mathcal{A}}^{*}\overline{\lambda})=U[\mbox{Diag}(\max\{v_{1}-\tau,0\},\ldots,\max\{v_{p}-\tau,0\})\ 0]V^{\top}
=U[ Diag(v1τ,,vk¯τ,0,,0) 0]V.\displaystyle=U[\mbox{ Diag}(v_{1}-\tau,\ldots,v_{\bar{k}}-\tau,0,\ldots,0)\ 0]V^{\top}.

This implies that the rank of W¯\overline{W} is k¯\bar{k}. As illustrated in Figure 2, for sufficiently large τ\tau, k¯\bar{k} is non-increasing, potentially enhancing classification accuracy on the test set (denoted Accuracytest\mbox{Accuracy}_{\mbox{test}} in (48)). Thus, the low-rank property of W¯\overline{W} may correspond to improved performance on the test set. Given this, how can an efficient algorithm be designed to leverage the low-rank structure of W¯\overline{W} when k¯\bar{k} is sufficiently small?

Refer to caption
Figure 2: The changes of values of k¯\bar{k} and Accuracytest\mbox{Accuracy}_{\mbox{test}} as the value of τ\tau increases on EEG and CIFAR-10 real datasets

3 A semismooth Newton-CG based augmented Lagrangian method for the SMM model

In this section, we present our algorithmic framework for solving the SMM model (1), which comprises an outer loop using the inexact augmented Lagrangian method (ALM) and inner loops employing the semismooth Newton-CG method for ALM subproblems.

3.1 An inexact augmented Lagrangian method

We are going to develop an inexact augmented Lagrangian method for solving the problem (P). Given a positive penalty parameter σ\sigma, the augmented Lagrangian function Lσ:𝒳×𝒴L_{\sigma}:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R} for (P) takes the form

Lσ(x;z)=\displaystyle L_{\sigma}(x;z)= 12W2+τU+δS(v)+λ,𝒜W+by+ven+Λ,WU\displaystyle\frac{1}{2}\|W\|^{2}_{\mathcal{F}}+\tau\|U\|_{*}+\delta^{*}_{S}(v)+\langle\lambda,{\mathcal{A}}W+by+v-e_{n}\rangle+\langle\Lambda,W-U\rangle (6)
+σ2𝒜W+by+ven2+σ2WU2,\displaystyle+\frac{\sigma}{2}\|{\mathcal{A}}W+by+v-e_{n}\|^{2}+\frac{\sigma}{2}\|W-U\|^{2}_{{\mathcal{F}}},

where x=(W,b,v,U)𝒳x=(W,b,v,U)\in\mathcal{X} and z=(λ,Λ)𝒴z=(\lambda,\Lambda)\in\mathcal{Y}. Following the general framework in rockafellar1976augmented , the steps of the ALM method is summarized in Algorithm 1 below.

Algorithm 1 An inexact augmented Lagrangian method

Initialization: Given are two positive parameters σ0\sigma_{0} and σ\sigma_{\infty}. Choose an initial point z0=(λ0,Λ0)𝒴z^{0}=(\lambda^{0},\Lambda^{0})\in\mathcal{Y}. Set k=0k=0. Perform the following steps in each iteration until a proper stopping criterion is satisfied:

Step 1. Compute an approximate solution
(Wk+1,bk+1,vk+1,Uk+1)argmin{fk(x):=Lσk(x;zk)|x𝒳}.(W^{k+1},b^{k+1},v^{k+1},U^{k+1})\approx\mbox{argmin}\left\{f_{k}(x):=L_{\sigma_{k}}(x;z^{k})\ |\ x\in\mathcal{X}\right\}. (7)
Step 2. Update the multipliers
(λk+1,Λk+1)=(λk+σk(𝒜Wk+1+bk+1y+vk+1en),Λk+σk(Wk+1Uk+1)).(\lambda^{k+1},\Lambda^{k+1})=(\lambda^{k}+\sigma_{k}({\mathcal{A}}W^{k+1}+b^{k+1}y+v^{k+1}-e_{n}),\Lambda^{k}+\sigma_{k}(W^{k+1}-U^{k+1})).
Step 3. Update σk+1σ\sigma_{k+1}\uparrow\sigma_{\infty}\leq\infty.

In Step 1 of Algorithm 1, the subproblem (7) is solved inexactly. Different inexact rules can be adopted. Here, we give two easily implementable inexact rules. Observe that (W^,b^,v^,U^)(\widehat{W},\widehat{b},\widehat{v},\widehat{U}) is an optimal solution to the ALM subproblem (7) if and only if it satisfies

(W^,b^)argmin{φk(W,b)|(W,b)p×q×},\displaystyle(\widehat{W},\widehat{b})\in\mbox{argmin}\left\{\varphi_{k}(W,b)\ |\ (W,b)\in\mathbb{R}^{p\times q}\times\mathbb{R}\right\}, (8)
(v^,U^)=(σk1ProxδS(ωk(W^,b^)),σk1Proxτ(Xk(W^))),\displaystyle(\widehat{v},\widehat{U})=\left(\sigma^{-1}_{k}\mbox{Prox}_{\delta^{*}_{S}}(\omega_{k}(\widehat{W},\widehat{b})),\sigma^{-1}_{k}\mbox{Prox}_{\tau\|\cdot\|_{*}}(X_{k}(\widehat{W}))\right), (9)

where the function φk:p×q×\varphi_{k}:\mathbb{R}^{p\times q}\times\mathbb{R}\to\mathbb{R} is defined by

φk(W,b):=\displaystyle\varphi_{k}(W,b):= (1/2)W2+(σk)1EδS(ωk(W,b))(2σk)1λk2\displaystyle(1/2)\|W\|^{2}_{{\mathcal{F}}}+(\sigma_{k})^{-1}E_{\delta^{*}_{S}}\left(\omega_{k}(W,b)\right)-(2\sigma_{k})^{-1}\|\lambda^{k}\|^{2} (10)
+(σk)1Eτ(Xk(W))(2σk)1Λk2.\displaystyle+(\sigma_{k})^{-1}E_{\tau\|\cdot\|_{*}}(X_{k}(W))-(2\sigma_{k})^{-1}\|\Lambda^{k}\|^{2}_{{\mathcal{F}}}.

Here, ωk:p×q×n\omega_{k}:\mathbb{R}^{p\times q}\times\mathbb{R}\rightarrow\mathbb{R}^{n} and Xk:p×qp×qX_{k}:\mathbb{R}^{p\times q}\rightarrow\mathbb{R}^{p\times q} are defined as

ωk(W,b):=λkσk(𝒜W+byen)andXk(W):=Λk+σkW.\omega_{k}(W,b):=-\lambda^{k}-\sigma_{k}({\mathcal{A}}W+b\,y-e_{n})\quad\mbox{and}\quad X_{k}(W):=\Lambda^{k}+\sigma_{k}W. (11)

Additionally, Eg()E_{g}(\cdot) and Proxg()\mbox{Prox}_{g}(\cdot) represent the Moreau envelop and proximal mapping of some proper closed convex function gg, respectively. Similar to those in cui2019r , we adopt the following two easily implementable inexact rules for (8):

(A)\displaystyle(A)\quad φk(Wk+1,bk+1)εk 2/σk1+xk+1+zk+1min{1/χk,1},\displaystyle\|\nabla\varphi_{k}(W^{k+1},b^{k+1})\|\leq\frac{\varepsilon^{\,2}_{k}/\sigma_{k}}{1+\|x^{k+1}\|+\|z^{k+1}\|}\min\left\{1/\chi_{k},1\right\},
(B)\displaystyle(B)\quad φk(Wk+1,bk+1)(ηk 2/σk)zk+1zk21+xk+1+zk+1min{1/χk,1}\displaystyle\|\nabla\varphi_{k}(W^{k+1},b^{k+1})\|\leq\frac{(\eta^{\,2}_{k}/\sigma_{k})\|z^{k+1}-z^{k}\|^{2}}{1+\|x^{k+1}\|+\|z^{k+1}\|}\min\left\{1/\chi_{k},1\right\}

where {εk}\{\varepsilon_{k}\} and {ηk}\{\eta_{k}\} are two given positive summable sequences, xk=(Wk,bkx^{k}=(W^{k},b^{k}, vk,Uk)v^{k},U^{k}), zk=(λk,Λk)z^{k}=(\lambda^{k},\Lambda^{k}), and χk=Wk+1+zk+1zk/σk+1/σk\chi_{k}=\|W^{k+1}\|_{{\mathcal{F}}}+\|z^{k+1}-z^{k}\|/\sigma_{k}+1/\sigma_{k}.

Rockafellar’s original work rockafellar1976augmented demonstrated the asymptotic QQ-superlinear convergence rate of the dual sequence produced by the ALM, assuming Lipschitz continuity of the dual solution mapping at the origin. However, this condition is challenging to satisfy for (P) due to the requirement of dual solution uniqueness. Therefore, we consider a weaker quadratic growth condition on the dual problem (DD) in this paper. The quadratic growth condition for (DD) at z¯:=(λ¯,Λ¯)ΩD\overline{z}:=(\overline{\lambda},\overline{\Lambda})\in\Omega_{D} is said to hold if there exist constants ε>0\varepsilon>0 and κ>0\kappa>0 such that

Φ(z)Φ(z¯)+κdist2(z,ΩD),z:=(λ,Λ)FD𝔹ε(z¯).\Phi(z)\geq\Phi(\overline{z})+\kappa\,\mbox{dist}^{2}(z,\Omega_{D}),\quad\forall\ z:=(\lambda,\Lambda)\in F_{D}\cap\mathbb{B}_{\varepsilon}(\overline{z}). (12)

Here, the function Φ\Phi is defined in (DD), ΩD\Omega_{D} denotes the set of all optimal solutions to (DD), and FD:={(λ,Λ)𝒴|yλ=0,λS,Λ𝔹2τ}F_{D}:=\{(\lambda,\Lambda)\in\mathcal{Y}\ |\ y^{\top}\lambda=0,-\lambda\in S,\Lambda\in\mathbb{B}^{\tau}_{2}\} represents the set of all feasible solutions to (DD).

Denote Rkkt:𝒳×𝒴𝒳×𝒴R^{kkt}:\mathcal{X}\times\mathcal{Y}\rightarrow\mathcal{X}\times\mathcal{Y} be the natural residual mapping as follows:

Rkkt(x,z):=[W+𝒜λ+ΛyλvProxδS(vλ)UProxτ(U+Λ)𝒜W+by+venWU].R^{kkt}(x,z):=\left[\begin{array}[]{c}W+{\mathcal{A}}^{*}\lambda+\Lambda\\ y^{\top}\lambda\\ v-\mbox{Prox}_{\delta^{*}_{S}}(v-\lambda)\\ U-\mbox{Prox}_{\tau\|\cdot\|_{*}}(U+\Lambda)\\ {\mathcal{A}}W+by+v-e_{n}\\ W-U\end{array}\right].

The following theorem establishes the global convergence and asymptotic RR-superlinear convergence of the KKT residuals in terms of Rkkt(xk,zk)\|R^{kkt}(x^{k},z^{k})\| under criteria (A) and (B) for Algorithm 1, which can be directly derived from (cui2019r, , Theorem 2).

Theorem 3.1

Let {(xk,zk)}\{(x^{k},z^{k})\} be an infinite sequence generated by Algorithm 1 under criterion (AA) with xk:=(Wk,bk,vk,Uk)x^{k}:=(W^{k},b^{k},v^{k},U^{k}) and zk:=(λk,Λk)z^{k}:=(\lambda^{k},\Lambda^{k}). Then the sequence {zk}\{z^{k}\} converges to some z¯ΩD\overline{z}\in\Omega_{D}. Moreover, the sequence {xk}\{x^{k}\} is bounded with all of its accumulation points in the solution set of (P).

If in addition, criterion (B) is executed and the quadratic growth condition (12) holds at z¯\overline{z}, then there exists k0k^{\prime}\geq 0 such that for all kkk\geq k^{\prime}, βηk<1\beta\eta_{k}<1 and

dist(zk+1,ΩD)θkdist(zk,ΩD),Rkkt(xk+1,zk+1)θkdist(zk,ΩD),\displaystyle\mbox{dist}\,(z^{k+1},\Omega_{D})\leq\theta_{k}\mbox{dist}\,(z^{k},\Omega_{D}),\quad\|R^{kkt}(x^{k+1},z^{k+1})\|\leq\theta^{\prime}_{k}\mbox{dist}\,(z^{k},\Omega_{D}),

where

θk:=\displaystyle\theta_{k}:= [βηk+(βηk+1)/1+σk2κ2]/(1βηk)θ:=1/1+σ2κ2,\displaystyle\left[\beta\eta_{k}+(\beta\eta_{k}+1)/\sqrt{1+\sigma^{2}_{k}\kappa^{2}}\right]/(1-\beta\eta_{k})\rightarrow\theta_{\infty}:=1/\sqrt{1+\sigma^{2}_{\infty}\kappa^{2}},
θk:=\displaystyle\theta^{\prime}_{k}:= [1/σk+(ηk2/σk)zk+1zk]/(1βηk)θ:=1/σ,\displaystyle\left[1/\sigma_{k}+(\eta^{2}_{k}/\sigma_{k})\|z^{k+1}-z^{k}\|\right]/(1-\beta\eta_{k})\rightarrow\theta^{\prime}_{\infty}:=1/\sigma_{\infty},
β:=\displaystyle\beta:= 2[1+τp+Cn+γ(n+1)+2γ2]with someγ1.\displaystyle\sqrt{2\left[1+\tau\sqrt{p}+C\sqrt{n}+\gamma(\sqrt{n}+1)+2\gamma^{2}\right]}\quad\mbox{with some}\quad\gamma\geq 1.

Moreover, θ=θ=0\theta_{\infty}=\theta^{\prime}_{\infty}=0 if σ=+\sigma_{\infty}=+\infty.

The following proposition gives a sufficient condition for the quadratic growth condition (12) with respect to the dual problem (DD). Here, we assume without loss of generality that 0<pq0<p\leq q. Detailed proof is given in Appendix A.

Proposition 1

Let z¯:=(λ¯,Λ¯)ΩD\overline{z}:=(\overline{\lambda},\overline{\Lambda})\in\Omega_{D}. Assume that there exists (λ~,Λ~)ΩD(\widetilde{\lambda},\widetilde{\Lambda})\in\Omega_{D} such that

rank(Υ¯)+rank(τI¯Λ~)=pwithI¯:=U¯[Ip  0p×(qp)]V¯,\mbox{rank}\,(-\overline{\Upsilon})+\mbox{rank}\,(\tau\overline{I}-\widetilde{\Lambda})=p\ \mbox{with}\ \overline{I}:=\overline{U}[I_{p}\ \,0_{p\times(q-p)}]\overline{V}^{\top}, (14)

where Υ¯:=𝒜λ¯+Λ¯\overline{\Upsilon}:={\mathcal{A}}^{*}\overline{\lambda}+\overline{\Lambda}, and the SVD of Λ~\widetilde{\Lambda} is

Λ~=U¯[Σ¯(Λ~) 0]V¯withU¯𝒪pandV¯𝒪q.\widetilde{\Lambda}=\overline{U}[\overline{\Sigma}(\widetilde{\Lambda})\ 0]\overline{V}^{\top}\ \mbox{with}\ \overline{U}\in\mathcal{O}^{p}\ \mbox{and}\ \overline{V}\in\mathcal{O}^{q}. (15)

Then the quadratic growth condition (12) at z¯\overline{z} holds for the dual problem (DD).

Finally, condition (14) can be interpreted as a strict complementarity condition. Indeed, if (W¯,b¯,v¯,U¯,λ¯,Λ¯)(\overline{W},\overline{b},\overline{v},\overline{U},\overline{\lambda},\overline{\Lambda}) satisfies the KKT system (3), then W¯=(𝒜λ¯+Λ¯)=Υ¯\overline{W}=-({\mathcal{A}}^{*}\overline{\lambda}+\overline{\Lambda})=-\overline{\Upsilon}. Consequently, the condition (14) means that there exists (λ~,Λ~)ΩD(\widetilde{\lambda},\widetilde{\Lambda})\in\Omega_{D} such that

rank(W¯)+rank(τI¯Λ~)=p.\mbox{rank}\,(\overline{W})+\mbox{rank}\,(\tau\overline{I}-\widetilde{\Lambda})=p. (16)

Moreover, by the use of (zhou2017unified, , Proposition 10) and (A.5), we can deduce W¯,τI¯Λ~=0\langle\overline{W},\tau\overline{I}-\widetilde{\Lambda}\rangle=0. In this sense, equality (16) is regarded as a strict complementarity between the matrices W¯\overline{W} and τI¯Λ~\tau\overline{I}-\widetilde{\Lambda}. In particular, (16) retains true if rank(W¯)=p\mbox{rank}\,(\overline{W})=p and τ=0\tau=0, a condition inherently satisfied in the context of the soft margin support vector machine model cortes1995support .

3.2 A semismooth Newton-CG method to solve the subproblem (8)

In Algorithm 1, the primary computational cost arises from solving the convex subproblem (8). In this subsection, we propose a semismooth Newton-CG method to calculate an inexact solution to this subproblem.

Due to the convexity of φk(,)\varphi_{k}(\cdot,\cdot), solving subproblem (8) is equivalent to solving the following nonlinear equation:

φk(W,b)=[W𝒜ΠS(ωk(W,b))+Π𝔹2τ(Xk(W))yΠS(ωk(W,b))]=0,\nabla\varphi_{k}(W,b)=\left[\begin{array}[]{c}W-{\mathcal{A}}^{*}\Pi_{S}\left(\omega_{k}(W,b)\right)+\Pi_{\mathbb{B}^{\tau}_{2}}(X_{k}(W))\\[5.69054pt] -y^{\top}\Pi_{S}\left(\omega_{k}(W,b)\right)\end{array}\right]=0, (17)

where ωk(W,b)\omega_{k}(W,b) and Xk(W)X_{k}(W) are defined in (11). Notice that φk(,)\nabla\varphi_{k}(\cdot,\cdot) is not smooth but strongly semismooth (see, e.g., (FP2007, , Proposition 7.4.4) and (kaifeng2011algorithms, , Theorem 2.3)). It is then desirable to solve using the semismooth Newton method. The semismooth Newton method has been extensively studied. Under some regularity conditions, the method can be superlinearly/quadratically convergent, see, e.g., qi1993nonsmooth ; qi2006quadratically ; ZST2010 .

In what follows, we construct an alternative set for the Clarke generalized Jacobian 2φk(W,b)\partial^{2}\varphi_{k}(W,b) of φk\nabla\varphi_{k} at (W,b)(W,b), which is more computationally tractable. Define for (W,b)p×q×(W,b)\in\mathbb{R}^{p\times q}\times\mathbb{R},

^ 2φk(W,b):=[+σkΠ𝔹2τ(Xk(W))0]+σk[𝒜y]ΠS(ωk(W,b))[𝒜y].\widehat{\partial}^{\,2}\varphi_{k}(W,b):=\left[\begin{array}[]{ll}{\mathcal{I}}+\sigma_{k}\partial\Pi_{\mathbb{B}^{\tau}_{2}}(X_{k}(W))&{}\\ {}&0\end{array}\right]+\sigma_{k}\left[\begin{array}[]{l}{\mathcal{A}}^{*}\\ y^{\top}\end{array}\right]\partial\Pi_{S}(\omega_{k}(W,b))[{\mathcal{A}}\quad y].

It follows from (clarke1990optimization, , Proposition 2.3.3 and Theorem 2.6.6) that

 2φk(W,b)(dW,dy)^ 2φk(W,b)(dW,dy),(dW,dy)p×q×.{\partial}^{\,2}\varphi_{k}(W,b)(d_{W},d_{y})\subseteq\widehat{\partial}^{\,2}\varphi_{k}(W,b)(d_{W},d_{y}),\quad\forall\ (d_{W},d_{y})\in\mathbb{R}^{p\times q}\times\mathbb{R}.

The element 𝒱^ 2φk(W,b){\mathcal{V}}\in\widehat{\partial}^{\,2}\varphi_{k}(W,b) takes the following form

𝒱=[+σk𝒢0]+σk[𝒜y]M[𝒜y],{\mathcal{V}}=\left[\begin{array}[]{ll}{\mathcal{I}}+\sigma_{k}{\mathcal{G}}&{}\\ {}&0\end{array}\right]+\sigma_{k}\left[\begin{array}[]{l}{\mathcal{A}}^{*}\\ y^{\top}\end{array}\right]M[{\mathcal{A}}\quad y], (18)

where 𝒢Π𝔹2τ(Xk(W)){\mathcal{G}}\in\partial\Pi_{\mathbb{B}^{\tau}_{2}}(X_{k}(W)), MΠS(ωk(W,b))M\in\partial\Pi_{S}(\omega_{k}(W,b)), and {\mathcal{I}} is the identity operator from p×q\mathbb{R}^{p\times q} to p×q\mathbb{R}^{p\times q}.

The steps of the semismooth Newton-CG method for solving (8) are stated in Algorithm 2.

Algorithm 2 A semismooth Newton-CG method for subproblem (8)

Initialization: Choose positive scalars μ(0,1/2)\mu\in(0,1/2), η¯,τ1,τ2,δ(0,1)\bar{\eta},\tau_{1},\tau_{2},\delta\in(0,1), ϱ(0,1]\varrho\in(0,1] and an initial point (Wk,0,bk,0)p×q×(W^{k,0},b^{k,0})\in\mathbb{R}^{p\times q}\times\mathbb{R}. Set i=0i=0. Execute the following steps until the stopping criteria (AA) and/or (BB) at (Wk,i+1,bk,i+1)(W^{k,i+1},b^{k,i+1}) are satisfied.

Step 1 (finding the semismooth Newton direction). Choose 𝒢Π𝔹2τ(Xk(Wk,i)){\mathcal{G}}\in\partial\Pi_{\mathbb{B}^{\tau}_{2}}(X_{k}(W^{k,i})) and MΠS(ωk(Wk,i,bk,i))M\in\partial\Pi_{S}(\omega_{k}(W^{k,i},b^{k,i})). Let 𝒱i:=𝒱{\mathcal{V}}_{i}:={\mathcal{V}} be given as in (18) with W=Wk,iW=W^{k,i}, b=bk,ib=b^{k,i}, and ρi=τ1min{τ2,φk(Wk,i,bk,i)}\rho_{i}=\tau_{1}\min\{\tau_{2},\|\nabla\varphi_{k}(W^{k,i},b^{k,i})\|\}. Apply the conjugate gradient (CG) algorithm to find an approximate solution (dWi,dbi)p×q×(d^{\,i}_{W},d^{\,i}_{b})\in\mathbb{R}^{p\times q}\times\mathbb{R} of the following linear equation
𝒱i(dW,db)+ρi(0,db)=φk(Wk,i,bk,i){\mathcal{V}}_{i}(d_{W},d_{b})+\rho_{i}(0,d_{b})=-\nabla\varphi_{k}(W^{k,i},b^{k,i}) (19)
such that
𝒱i(dWi,dbi)+ρi(0,dbi)+φk(Wk,i,bk,i)min(η¯,φk(Wk,i,bk,i)1+ϱ).\|{\mathcal{V}}_{i}(d^{\,i}_{W},d^{\,i}_{b})+\rho_{i}(0,d^{\,i}_{b})+\nabla\varphi_{k}(W^{k,i},b^{k,i})\|\leq\min(\bar{\eta},\|\nabla\varphi_{k}(W^{k,i},b^{k,i})\|^{1+\varrho}).
Step 2 (line search). Set αi=δmi\alpha_{i}=\delta^{m_{i}}, where mim_{i} is the first nonnegative integer mm for which
φk(Wk,i+δmdWi,bk,i+δmdbi)φk(Wk,i,bk,i)+μδmφk(Wk,i,bk,i),(dWi,dbi).\varphi_{k}(W^{k,i}+\delta^{m}d^{\,i}_{W},b^{k,i}+\delta^{m}d^{\,i}_{b})\leq\varphi_{k}(W^{k,i},b^{k,i})+\mu\,\delta^{m}\left\langle\nabla\varphi_{k}(W^{k,i},b^{k,i}),(d^{\,i}_{W},d^{\,i}_{b})\right\rangle.
Step 3. Set Wk,i+1=Wk,i+αidWiW^{k,i+1}=W^{k,i}+\alpha_{i}d^{\,i}_{W}, bk,i+1=bk,i+αidbib^{k,i+1}=b^{k,i}+\alpha_{i}d^{\,i}_{b}, and ii+1i\leftarrow i+1.

In our semismooth Newton method, the positive definiteness of 𝒱i{\mathcal{V}}_{i} in the Newton linear system (19) is crucial for ensuring the superlinear convergence of the method, as discussed in ZST2010 ; li2018qsdpnal . In what follows, we show that the positive definiteness of 𝒱i{\mathcal{V}}_{i} is equivalent to the constraint nondegenerate condition from shapiro2003sensitivity , which simplifies to the nonemptiness of a specific index set. Here, the constraint nondegenerate condition associated with an optimal solution (λ^,Λ^)(\widehat{\lambda},\widehat{\Lambda}) of the dual problem for (7) can be expressed as (see shapiro2003sensitivity )

ylin(TS(λ^))=.y^{\top}\mbox{lin}\left(T_{S}(-\widehat{\lambda})\right)=\mathbb{R}. (20)
Proposition 2

Let (W^,b^)(\widehat{W},\widehat{b}) be an optimal solution for subproblem (8). Denote λ^:=ΠS(ωk(W^,b^))\widehat{\lambda}:=-\Pi_{S}(\omega_{k}(\widehat{W},\widehat{b})) and 𝒥1(λ^):={j[n]| 0<(λ^)j<C}{\mathcal{J}}_{1}(-\widehat{\lambda}):=\{j\in[n]\ |\ 0<(-\widehat{\lambda})_{j}<C\}, where ωk(W^,b^)\omega_{k}(\widehat{W},\widehat{b}) is defined in (11). Then the following conditions are equivalent:
(i) The constraint nondegenerate condition (20) holds at λ^\widehat{\lambda};
(ii) The index set 𝒥1(λ^){\mathcal{J}}_{1}(-\widehat{\lambda}) is nonempty;
(iii) Every element in ^ 2φk(W^,b^)\widehat{\partial}^{\,2}\varphi_{k}(\widehat{W},\widehat{b}) is self-adjoint and positive definite.

Proof

See Appendix B. ∎

We conclude this section by giving the convergence theorem of Algorithm 2. It can be proved in a way similar to those in (qi1993nonsmooth, , Proposition 3.1) and (ZST2010, , Theorem 3.5).

Theorem 3.2

The sequence {(Wk,i,bk,i)}i0\{(W^{k,i},b^{k,i})\}_{i\geq 0} generated by Algorithm 2 is bounded. Moreover, every accumulation point is an optimal solution to (8). If at some accumulation point (W¯k+1,b¯k+1)(\overline{W}^{k+1},\overline{b}^{k+1}), the constraint nondegeneracy condition (20) holds with λ¯k+1:=ΠS(ωk(W¯k+1,b¯k+1))\overline{\lambda}^{k+1}:=-\Pi_{S}(\omega_{k}(\overline{W}^{k+1},\overline{b}^{k+1})). Then the whole sequence {(Wk,i,bk,i)}\{(W^{k,i},b^{k,i})\} converges to (W¯k+1,b¯k+1)(\overline{W}^{k+1},\overline{b}^{k+1}) and

(Wk,i+1,bk,i+1)(W¯k+1,b¯k+1)=O((Wk,i,bk,i)(W¯k+1,b¯k+1)1+ϱ).\|(W^{k,i+1},b^{k,i+1})-(\overline{W}^{k+1},\overline{b}^{k+1})\|=O\left(\|(W^{k,i},b^{k,i})-(\overline{W}^{k+1},\overline{b}^{k+1})\|^{1+\varrho}\right).

4 Implementation of the semismooth Newton-CG method

As previous mentioned, our focus is on situations where the sample size nn greatly exceeds the maximum of pp and qq, particularly when npqn\gg pq. As a result, the main computational burden in each iteration of Algorithm 2 is solving the following Newton linear system:

𝒱(dW,db)+ρ(0,db)=(Rhs1,rhs2),(dW,db)p×q×,{\mathcal{V}}(d_{W},d_{b})+\rho(0,d_{b})=(\mbox{Rhs1},\mbox{rhs2}),\quad(d_{W},d_{b})\in\mathbb{R}^{p\times q}\times\mathbb{R}, (21)

where 𝒱{\mathcal{V}} is defined in (18) with 𝒢Π𝔹2τ(Xk(W~))\mathcal{G}\in\partial\Pi_{\mathbb{B}^{\tau}_{2}}(X_{k}(\widetilde{W})), MΠS(ωk(W~,b~))M\in\partial\Pi_{S}(\omega_{k}(\widetilde{W},\widetilde{b})), σ:=σk\sigma:=\sigma_{k}, W~:=Wk,i\widetilde{W}:=W^{k,i}, and b~:=bk,i\widetilde{b}:=b^{k,i}. The right-hand sides are

Rhs1:=[W~𝒜ΠS(ωk(W~,b~))+Π𝔹2τ(Xk(W~))]andrhs2:=yΠS(ωk(W~,b~)).\mbox{Rhs1}:=-[\widetilde{W}-{\mathcal{A}}^{*}\Pi_{S}(\omega_{k}(\widetilde{W},\widetilde{b}))+\Pi_{\mathbb{B}^{\tau}_{2}}(X_{k}(\widetilde{W}))]\ \mbox{and}\ \mbox{rhs2}:=y^{\top}\Pi_{S}(\omega_{k}(\widetilde{W},\widetilde{b})).

It can be further rewritten as

{𝒱~dW=Rhs1(σrhs2)(σyMy+ρ)1𝒜My,db=(σyMy+ρ)1(rhs2σyM𝒜dW),\left\{\begin{array}[]{l}\widetilde{{\mathcal{V}}}d_{W}=\mbox{Rhs1}-(\sigma\cdot\mbox{rhs2})(\sigma y^{\top}My+\rho)^{-1}{\mathcal{A}}^{*}My,\\[5.69054pt] d_{b}=(\sigma y^{\top}My+\rho)^{-1}(\mbox{rhs2}-\sigma y^{\top}M{\mathcal{A}}\,d_{W}),\end{array}\right. (22)

where

𝒱~:=+σ𝒢+\widetilde{{\mathcal{V}}}:={\mathcal{I}}+\sigma{\mathcal{G}}+\mathcal{H}

with

:=σ𝒜M𝒜σ2(σyMy+ρ)1𝒜MyyM𝒜.\mathcal{H}:=\sigma{\mathcal{A}}^{*}M{\mathcal{A}}-\sigma^{2}(\sigma y^{\top}My+\rho)^{-1}{\mathcal{A}}^{*}Myy^{\top}M{\mathcal{A}}. (23)

Clearly, the major cost for finding an inexact solution of (22) lies in the computation of

𝒱~dW=dW+σ𝒢dW+dW.\widetilde{{\mathcal{V}}}d_{W}=d_{W}+\sigma{\mathcal{G}}d_{W}+\mathcal{H}d_{W}.

In the next two parts, we discuss how to reduce the computational cost for computing 𝒢dW{\mathcal{G}}d_{W} and dW\mathcal{H}d_{W} by selecting special linear operators 𝒢{\mathcal{G}} and \mathcal{H}, respectively.

4.1 The computation of 𝒢dW{\mathcal{G}}d_{W}

According to (kaifeng2011algorithms, , Theorem 2.5) or (jiang2013solving, , Proposition 2), if Xk(W~)2τ\|X_{k}(\widetilde{W})\|_{2}\leq\tau, then we can select 𝒢{\mathcal{G}} such that 𝒢dW=dW\mathcal{G}d_{W}=d_{W}. In the case Xk(W~)2>τ\|X_{k}(\widetilde{W})\|_{2}>\tau, the computation of 𝒢dW{\mathcal{G}}d_{W} depends on the SVD of Xk(W~)X_{k}(\widetilde{W}). Without loss of generality, we assume pqp\leq q. Let Xk(W~)X_{k}(\widetilde{W}) have the following SVD:

Xk(W~)=U[Diag(ν) 0]V=UDiag(ν)V1,X_{k}(\widetilde{W})=U\left[\mbox{Diag}(\nu)\ 0\right]V^{\top}=U\mbox{Diag}(\nu)V^{\top}_{1},

where U𝒪pU\in\mathcal{O}^{p}, V=[V1V2]𝒪qV=[V_{1}\ V_{2}]\in\mathcal{O}^{q} with V1q×pV_{1}\in\mathbb{R}^{q\times p} and V2q×(qp)V_{2}\in\mathbb{R}^{q\times(q-p)}, and ν=(ν1,,νp)p\nu=(\nu_{1},\ldots,\nu_{p})^{\top}\in\mathbb{R}^{p} is the vector of singular values ordered as ν1νp\nu_{1}\geq\ldots\geq\nu_{p}.

Denote

α:=\displaystyle\alpha:= [k1(ν)],β:=β1β2,γ:={p+1,p+2,,q},\displaystyle[k_{1}(\nu)],\quad\beta:=\beta_{1}\cup\beta_{2},\quad\gamma:=\{p+1,p+2,\ldots,q\}, (24)
β1:=\displaystyle\beta_{1}:= {k1(ν)+1,,k2(ν)},β2:={k2(ν)+1,k2(ν)+2,,p},\displaystyle\{k_{1}(\nu)+1,\ldots,k_{2}(\nu)\},\quad\beta_{2}:=\{k_{2}(\nu)+1,k_{2}(\nu)+2,\ldots,p\},

where k1(ν):=max{i[p]|νi>τ}k_{1}(\nu):=\max\{i\in[p]\ |\ \nu_{i}>\tau\} and k2(ν):=max{i[p]|νiτ}k_{2}(\nu):=\max\{i\in[p]\ |\ \nu_{i}\geq\tau\}. Define three matrices Ξ1p×p\Xi^{1}\in\mathbb{R}^{p\times p}, Ξ2p×p\Xi^{2}\in\mathbb{R}^{p\times p}, and Ξ3p×(qp)\Xi^{3}\in\mathbb{R}^{p\times(q-p)}:

Ξ1=[EααEαβ1Ξαβ21Eβ1α00(Ξαβ21)00],Ξ2=[Ξαα2Ξαβ12Ξαβ22(Ξβ1α2)00(Ξαβ22)00],Ξ3=[Ξαγ300],\displaystyle\Xi^{1}=\left[\begin{array}[]{ccc}E_{\alpha\alpha}&E_{\alpha\beta_{1}}&\Xi^{1}_{\alpha\beta_{2}}\\ E_{\beta_{1}\alpha}&0&0\\ (\Xi^{1}_{\alpha\beta_{2}})^{\top}&0&0\end{array}\right],\ \Xi^{2}=\left[\begin{array}[]{ccc}\Xi^{2}_{\alpha\alpha}&\Xi^{2}_{\alpha\beta_{1}}&\Xi^{2}_{\alpha\beta_{2}}\\ (\Xi^{2}_{\beta_{1}\alpha})^{\top}&0&0\\ (\Xi^{2}_{\alpha\beta_{2}})^{\top}&0&0\end{array}\right],\ \Xi^{3}=\left[\begin{array}[]{c}\Xi^{3}_{\alpha\gamma}\\ 0\\ 0\end{array}\right], (34)

where Ep×pE\in\mathbb{R}^{p\times p} is the matrix whose elements are all ones, and

(Ξαβ21)ij=νiτνiνj,foriα,jβ2;(Ξαα2)ij=12τνi+νj,fori,jα;\displaystyle(\Xi^{1}_{\alpha\beta_{2}})_{ij}=\frac{\nu_{i}-\tau}{\nu_{i}-\nu_{j}},\ \mbox{for}\ i\in\alpha,j\in\beta_{2};\quad(\Xi^{2}_{\alpha\alpha})_{ij}=1-\frac{2\tau}{\nu_{i}+\nu_{j}},\ \mbox{for}\ i,j\in\alpha;
(Ξαβ2)ij=νiτνi+νj,foriα,jβ;(Ξαγ3)ij=1τνi,foriα,jγ.\displaystyle(\Xi^{2}_{\alpha\beta})_{ij}=\frac{\nu_{i}-\tau}{\nu_{i}+\nu_{j}},\ \mbox{for}\ i\in\alpha,j\in\beta;\quad(\Xi^{3}_{\alpha\gamma})_{ij}=1-\frac{\tau}{\nu_{i}},\ \mbox{for}\ i\in\alpha,j\in\gamma.

As a result, If Xk(W~)2>τ\|X_{k}(\widetilde{W})\|_{2}>\tau, 𝒢dW{\mathcal{G}}d_{W} can be expressed as (see (kaifeng2011algorithms, , Theorem 2.5) or (jiang2013solving, , Proposition 2))

𝒢dW=dWU[Ξ1S(H~1)+Ξ2T(H~1)]V1𝒢1dWU(Ξ3H~2)V2𝒢2dW,{\mathcal{G}}d_{W}=d_{W}-\smash{\underbrace{U\left[\Xi^{1}\circ S(\widetilde{H}_{1})+\Xi^{2}\circ T(\widetilde{H}_{1})\right]V^{\top}_{1}}_{{\mathcal{G}}_{1}d_{W}}}-\smash{\underbrace{U(\Xi^{3}\circ\widetilde{H}_{2})V^{\top}_{2}}_{{\mathcal{G}}_{2}d_{W}}},\vspace{4mm} (35)

where H~1:=UdWV1\widetilde{H}_{1}:=U^{\top}d_{W}V_{1} and H~2:=UdWV2\widetilde{H}_{2}:=U^{\top}d_{W}V_{2}. Here, S:p×p𝕊pS:\mathbb{R}^{p\times p}\rightarrow\mathbb{S}^{p} and T:p×pp×pT:\mathbb{R}^{p\times p}\rightarrow\mathbb{R}^{p\times p} are linear matrix operators defined by

S(X):=(X+X)/2andT(X):=(XX)/2,Xp×p.S(X):=(X+X^{\top})/2\quad\mbox{and}\quad T(X):=(X-X^{\top})/2,\quad\forall X\in\mathbb{R}^{p\times p}.

The primary computational cost of 𝒢dW{\mathcal{G}}d_{W} is approximately 4p2q+4pq24p^{2}q+4pq^{2} flops. However, by leveraging the sparsity of the matrices Ξ1\Xi^{1}, Ξ2\Xi^{2}, and Ξ3\Xi^{3}, the computation of 𝒢1dW{\mathcal{G}}_{1}d_{W} and 𝒢2dW{\mathcal{G}}_{2}d_{W} in (35) can be performed more efficiently.

The term 𝒢1dW{\mathcal{G}}_{1}d_{W}. Based on the definitions of Ξ1\Xi^{1} and Ξ2\Xi^{2} in (34), 𝒢1dW{\mathcal{G}}_{1}d_{W} in (35) can be computed by

𝒢1dW=\displaystyle{\mathcal{G}}_{1}d_{W}= U[Ξ1S(H~1)+Ξ2T(H~1)]V1\displaystyle U\left[\Xi^{1}\circ S(\widetilde{H}_{1})+\Xi^{2}\circ T(\widetilde{H}_{1})\right]V^{\top}_{1}
=\displaystyle= Uα[(S(H~1))αβ1+Ξαβ12(T(H~1))αβ1]V1β1+Uα[Ξαβ21(S(H~1))αβ2\displaystyle U_{\alpha}\left[(S(\widetilde{H}_{1}))_{\alpha\beta_{1}}+\Xi^{2}_{\alpha\beta_{1}}\circ(T(\widetilde{H}_{1}))_{\alpha\beta_{1}}\right]V^{\top}_{1\beta_{1}}+U_{\alpha}\left[\Xi^{1}_{\alpha\beta_{2}}\circ(S(\widetilde{H}_{1}))_{\alpha\beta_{2}}\right.
+Ξαβ22(T(H~1))αβ2]V1β2+{Uα[(S(H~1))αα+Ξαα1(T(H~1))αα]\displaystyle\left.+\Xi^{2}_{\alpha\beta_{2}}\circ(T(\widetilde{H}_{1}))_{\alpha\beta_{2}}\right]V^{\top}_{1\beta_{2}}+\left\{U_{\alpha}\left[(S(\widetilde{H}_{1}))_{\alpha\alpha}+\Xi^{1}_{\alpha\alpha}\circ(T(\widetilde{H}_{1}))_{\alpha\alpha}\right]\right.
+Uβ2[(Ξαβ21)(S(H~1))β2α+(Ξαβ22)(T(H~1))β2α]\displaystyle+\left.U_{\beta_{2}}\left[(\Xi^{1}_{\alpha\beta_{2}})^{\top}\circ(S(\widetilde{H}_{1}))_{\beta_{2}\alpha}+(\Xi^{2}_{\alpha\beta_{2}})^{\top}\circ(T(\widetilde{H}_{1}))_{\beta_{2}\alpha}\right]\right.
+Uβ1[(S(H~1))β1α+(Ξαβ12)(T(H~1))β1α]}V1α.\displaystyle+\left.U_{\beta_{1}}\left[(S(\widetilde{H}_{1}))_{\beta_{1}\alpha}+(\Xi^{2}_{\alpha\beta_{1}})^{\top}\circ(T(\widetilde{H}_{1}))_{\beta_{1}\alpha}\right]\right\}V^{\top}_{1\alpha}.

It shows that the computational cost for computing 𝒢1dW{\mathcal{G}}_{1}d_{W} can be reduced from 4p3+4p2q4p^{3}+4p^{2}q flops to 14|α|pq+4|α|p214|\alpha|pq+4|\alpha|p^{2} flops. From the definition of α\alpha in (24), if the tuple (Wk+1,bk+1,vk+1,Uk+1,λk+1,Λk+1)(W^{k+1},b^{k+1},v^{k+1},U^{k+1},\lambda^{k+1},\Lambda^{k+1}) generated by Algorithm 1 is a KKT solution to (3), then |α|=rank(Wk+1)|\alpha|=\mbox{rank}(W^{k+1}) follows immediately. As mentoned in Subsection 2.2, for sufficient large τ\tau, the rank of Wk+1W^{k+1} is generally much less than pp, i.e., |α|p|\alpha|\ll p. Therefore, the low-rank structure of Wk+1W^{k+1} can be effectively leveraged within our computational framework.

The term 𝒢2dW{\mathcal{G}}_{2}d_{W}. According to the definition of Ξ3\Xi^{3} in (34), we have

𝒢2dW=\displaystyle{\mathcal{G}}_{2}d_{W}= U(Ξ3H~2)V2=Uα[Ξαγ3(UαdWV2)]V2=UαDiag(d)UαdWV2V2\displaystyle U(\Xi^{3}\circ\widetilde{H}_{2})V^{\top}_{2}=U_{\alpha}[\Xi^{3}_{\alpha\gamma}\circ(U^{\top}_{\alpha}d_{W}V_{2})]V^{\top}_{2}=U_{\alpha}\mbox{Diag}(d)U^{\top}_{\alpha}d_{W}V_{2}V_{2}^{\top}
=\displaystyle= UαDiag(d)UαdW(IqV1V1)=UαDiag(d)[UαdW(UαdWV1)V1]\displaystyle U_{\alpha}\mbox{Diag}(d)U^{\top}_{\alpha}d_{W}(I_{q}-V_{1}V^{\top}_{1})=U_{\alpha}\mbox{Diag}(d)[U^{\top}_{\alpha}d_{W}-(U^{\top}_{\alpha}d_{W}V_{1})V^{\top}_{1}]

with d:=(d1,,d|α|)|α|d:=(d_{1},\ldots,d_{|\alpha|})^{\top}\in\mathbb{R}^{|\alpha|} and d:=1τ/νd_{\ell}:=1-\tau/\nu_{\ell} for =1,,|α|\ell=1,\ldots,|\alpha|. The last equality shows that by the use of the sparsity in Ξ3\Xi^{3}, the cost for computing U(Ξ3H~2)V2U(\Xi^{3}\circ\widetilde{H}_{2})V^{\top}_{2} can also be notably decreased from 4p2(qp)+4pq(qp)4p^{2}(q-p)+4pq(q-p) flops to 4|α|pq4|\alpha|pq flops when |α|p|\alpha|\ll p.

The preceding arguments show that exploiting the sparsity in 𝒢\mathcal{G}, induced by the low rank of Wk+1W^{k+1}, reduces the computational cost of 𝒢dW\mathcal{G}d_{W} from 4p2q+4pq24p^{2}q+4pq^{2} flops to 18|α|pq+4|α|p218|\alpha|pq+4|\alpha|p^{2} flops, with |α||\alpha| typically much smaller than pp.

4.2 The computation of dW\mathcal{H}d_{W}

By the definition of \mathcal{H} in (23), we have

dW=σ𝒜M𝒜dWσ2(σyMy+ρ)1𝒜MyyM𝒜dW.\mathcal{H}d_{W}=\sigma{\mathcal{A}}^{*}M{\mathcal{A}}d_{W}-\sigma^{2}(\sigma y^{\top}My+\rho)^{-1}{\mathcal{A}}^{*}Myy^{\top}M{\mathcal{A}}d_{W}. (36)

Notice that in the above expression, MM can be an arbitrary element in ΠS(ω(W~,b~))\partial\Pi_{S}(\omega(\widetilde{W},\widetilde{b})). To save computational cost, we select a simple MΠS(ω(W~,b~))M\in\partial\Pi_{S}(\omega(\widetilde{W},\widetilde{b})) as follows:

M=Diag(v)withvj={1,if 0<(ω(W~,b~))j<C,0,otherwise,j=1,2,,n.M=\mbox{Diag}(v)\quad\mbox{with}\quad v_{j}=\left\{\begin{array}[]{ll}1,&\quad\mbox{if}\ 0<(\omega(\widetilde{W},\widetilde{b}))_{j}<C,\\ 0,&\quad\mbox{otherwise},\end{array}\right.\quad j=1,2,\ldots,n.

Let

𝒥1=𝒥1(ω(W~,b~)):={j[n]| 0<(ω(W~,b~))j<C}.{\mathcal{J}}_{1}={\mathcal{J}}_{1}(\omega(\widetilde{W},\widetilde{b})):=\{j\in[n]\ |\ 0<(\omega(\widetilde{W},\widetilde{b}))_{j}<C\}. (37)

It is easy to get yMy=|𝒥1|y^{\top}My=|{\mathcal{J}}_{1}|, and

𝒜My=𝒜𝒥1y𝒥1,yM𝒜dW=y𝒥1𝒜𝒥1dW,𝒜M𝒜dW=𝒜𝒥1𝒜𝒥1dW,\displaystyle{\mathcal{A}}^{*}My={\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}},\quad y^{\top}M{\mathcal{A}}\,d_{W}=y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}d_{W},\quad{\mathcal{A}}^{*}M{\mathcal{A}}\,d_{W}={\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}d_{W},

where the linear mapping 𝒜{\mathcal{A}}_{\scriptscriptstyle{\mathcal{I}}} and its adjoint mapping 𝒜{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{I}}} with [n]{\mathcal{I}}\subseteq[n] are defined by

𝒜Y:=(𝒜Y)and𝒜z:=jzjyjXj,Yp×q,zn.{\mathcal{A}}_{\scriptscriptstyle{\mathcal{I}}}Y:=({\mathcal{A}}Y)_{\scriptscriptstyle{\mathcal{I}}}\quad\mbox{and}\quad{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{I}}}z:=\sum_{j\in{\mathcal{I}}}z_{j}y_{j}X_{j},\quad\forall\ Y\in\mathbb{R}^{p\times q},\quad\forall z\in\mathbb{R}^{n}. (38)

Therefore, it follows from (36) that

dW=σ𝒜𝒥1𝒜𝒥1dWσ2(σ|𝒥1|+ρ)1𝒜𝒥1y𝒥1y𝒥1𝒜𝒥1dW.\mathcal{H}d_{W}=\sigma{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}d_{W}-\sigma^{2}(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}}y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}d_{W}.

It means that the cost for computing dW\mathcal{H}d_{W} can be decreased from O(npq)O(npq) to O(|𝒥1|pq)O(|{\mathcal{J}}_{1}|pq). Observe that if (Wk+1,bk+1,vk+1,Uk+1,λk+1,Λk+1)(W^{k+1},b^{k+1},v^{k+1},U^{k+1},\lambda^{k+1},\Lambda^{k+1}) is generated by Algorithm 1 as a solution of the KKT system (3), then the index set 𝒥1{\mathcal{J}}_{1} corresponds to the active support matrices at (Wk+1,bk+1)(W^{k+1},b^{k+1}). In fact, based on Step 2 in Algorithm 1 and (9), we deduce from the Moreau identity that

λk+1=ωk(Wk+1,bk+1)+ProxδS(ωk(Wk+1,bk+1))=ΠS(ωk(Wk+1,bk+1)),\displaystyle\lambda^{k+1}=-\omega_{k}(W^{k+1},b^{k+1})+\mbox{Prox}_{\delta^{*}_{S}}(\omega_{k}(W^{k+1},b^{k+1}))=-\Pi_{S}(\omega_{k}(W^{k+1},b^{k+1})),

where ωk(,)\omega_{k}(\cdot,\cdot) is defined in (11). It further implies from (37) that

𝒥1={j[n]|(ωk(Wk+1,bk+1))j(0,C)}={j[n]|(λk+1)j(0,C)}.{\mathcal{J}}_{1}=\{j\in[n]\ |\ (\omega_{k}(W^{k+1},b^{k+1}))_{j}\in(0,C)\}=\{j\in[n]\ |\ (-\lambda^{k+1})_{j}\in(0,C)\}. (39)

The analysis in Subsection 2.1 reveals that the cardinality of the index set 𝒥1{\mathcal{J}}_{1} is typically much smaller than nn, thereby providing a positive answer to the question posed earlier. This results in a significant decrease in the computational cost for calculating dW\mathcal{H}d_{W}.

4.3 Solving equation (22)

The discussion above has shown that the equation (22) can be reduced as

{𝒱~dW=Rhs1(σrhs2)(σ|𝒥1|+ρ)1𝒜𝒥1y𝒥1,db=(σ|𝒥1|+ρ)1(rhs2σy𝒥1𝒜𝒥1dW)\left\{\begin{array}[]{l}\widetilde{{\mathcal{V}}}d_{W}=\mbox{Rhs1}-(\sigma\cdot\mbox{rhs2})(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}},\\[5.69054pt] d_{b}=(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}\left(\mbox{rhs2}-\sigma y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}d_{W}\right)\end{array}\right. (40)

with

𝒱~=+σ𝒢+σ𝒜𝒥1𝒜𝒥1σ2(σ|𝒥1|+ρ)1𝒜𝒥1y𝒥1y𝒥1𝒜𝒥1.\widetilde{{\mathcal{V}}}={\mathcal{I}}+\sigma{\mathcal{G}}+\sigma{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}-\sigma^{2}(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}}y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}. (41)

Accordingly, the inexact rule in Step 1 of Algorithm 2 is written as

𝒱~dW[Rhs1(σrhs2)(σ|𝒥1|+ρ)1𝒜𝒥1y𝒥1]min(η¯,φk(W~,b~)1+ϱ).\left\|\widetilde{{\mathcal{V}}}d_{W}-\left[\mbox{Rhs1}-(\sigma\cdot\mbox{rhs2})(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}}\right]\right\|\leq\min(\bar{\eta},\|\nabla\varphi_{k}(\widetilde{W},\widetilde{b})\|^{1+\varrho}).

The positive definiteness of the above linear operator 𝒱~\widetilde{{\mathcal{V}}} is proved in the next proposition.

Proposition 3

The linear operator 𝒱~:p×q×p×q×\widetilde{{\mathcal{V}}}:\mathbb{R}^{p\times q}\times\mathbb{R}\rightarrow\mathbb{R}^{p\times q}\times\mathbb{R} defined in (41) is a self-adjoint and positive definite operator.

Proof

From (meng2005semismoothness, , Proposition 1), we know that each element 𝒢{\mathcal{G}} in Π𝔹2τ(W~)\partial\Pi_{\mathbb{B}^{\tau}_{2}}(\widetilde{W}) is self-adjoint and positive semidefinite. So 𝒱~\widetilde{{\mathcal{V}}} is self-adjoint. If the index set 𝒥1{\mathcal{J}}_{1} is empty, then 𝒱~=+σ𝒢\widetilde{{\mathcal{V}}}={\mathcal{I}}+\sigma{\mathcal{G}} is positive definite. Otherwise, for any dWp×q{0}d_{W}\in\mathbb{R}^{p\times q}\setminus\{0\}, it follows from (41) that

dW,𝒱~dW=dW,[+σ𝒢+σ𝒜𝒥1(I|𝒥1|σ(σ|𝒥1|+ρ)1y𝒥1y𝒥1)𝒜𝒥1]dW\displaystyle\left\langle d_{W},\widetilde{{\mathcal{V}}}d_{W}\right\rangle=\left\langle d_{W},\left[{\mathcal{I}}+\sigma{\mathcal{G}}+\sigma{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}\left(I_{|\scriptscriptstyle{\mathcal{J}}_{1}|}-\sigma(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}y_{\scriptscriptstyle{\mathcal{J}}_{1}}y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}\right){\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}\right]d_{W}\right\rangle
=\displaystyle= dW,(+σ𝒢)dW+σ𝒜𝒥1dW,(I|𝒥1|σ(σ|𝒥1|+ρ)1y𝒥1y𝒥1)(𝒜𝒥1dW)\displaystyle\left\langle d_{W},({\mathcal{I}}+\sigma{\mathcal{G}})d_{W}\right\rangle+\sigma\left\langle{\mathcal{A}}_{{\mathcal{J}}_{1}}d_{W},\left(I_{|{\mathcal{J}}_{1}|}-\sigma(\sigma|{\mathcal{J}}_{1}|+\rho)^{-1}y_{{\mathcal{J}}_{1}}y^{\top}_{{\mathcal{J}}_{1}}\right)({\mathcal{A}}_{{\mathcal{J}}_{1}}d_{W})\right\rangle
\displaystyle\geq dW,(+σ𝒢)dW+σ𝒜𝒥1dW,(I|𝒥1|y𝒥1y𝒥1/|𝒥1|)(𝒜𝒥1dW)\displaystyle\left\langle d_{W},({\mathcal{I}}+\sigma{\mathcal{G}})d_{W}\right\rangle+\sigma\left\langle{\mathcal{A}}_{{\mathcal{J}}_{1}}d_{W},\left(I_{|{\mathcal{J}}_{1}|}-y_{{\mathcal{J}}_{1}}y^{\top}_{{\mathcal{J}}_{1}}/|{\mathcal{J}}_{1}|\right)({\mathcal{A}}_{{\mathcal{J}}_{1}}d_{W})\right\rangle
\displaystyle\geq dW,(+σ𝒢)dW>0,\displaystyle\langle d_{W},({\mathcal{I}}+\sigma{\mathcal{G}})d_{W}\rangle>0,

where the last second inequality follows from the fact that I|𝒥1|y𝒥1y𝒥1/|𝒥1|I_{|\scriptscriptstyle{\mathcal{J}}_{1}|}-y_{\scriptscriptstyle{\mathcal{J}}_{1}}y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}/|{\mathcal{J}}_{1}| is an orthogonal projection matrix in |𝒥1|×|𝒥1|\mathbb{R}^{\scriptscriptstyle|{\mathcal{J}}_{1}|\times|{\mathcal{J}}_{1}|}. The desired conclusion is achieved. ∎

Table 1 summarizes the main computational costs of 𝒱~H\widetilde{{\mathcal{V}}}H obtained by a traditional manner in (22) and by exploiting the solution’s low rank and sample sparsity in (40). This indicates that when calculating the term 𝒱~H\widetilde{{\mathcal{V}}}H, the main expenses are only O(max{|𝒥1|,|α|}pq)O(\max\{|{\mathcal{J}}_{1}|,|\alpha|\}pq), which is always significantly lower than O(npq)O(npq) when nqpn\gg q\geq p.

Table 1: The main computational costs of 𝒱~H\widetilde{{\mathcal{V}}}H for a given HH under two different manners with nqpn\geq q\geq p
Traditional manner in (22) Exploiting the sparsity of MM and 𝒢{\mathcal{G}} in (40)
main terms cost main terms cost
yMyy^{\top}My O(n)O(n) y𝒥1y𝒥1y^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}} O(|𝒥1|)O(|{\mathcal{J}}_{1}|)
𝒜My{\mathcal{A}}^{*}My O(npq)O(npq) 𝒜𝒥1y𝒥1{{\mathcal{A}}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}y_{\scriptscriptstyle{\mathcal{J}}_{1}} O(|𝒥1|pq)O(|{\mathcal{J}}_{1}|pq)
yM𝒜Hy^{\top}M{\mathcal{A}}H O(npq)O(npq) y𝒥1𝒜𝒥1Hy^{\top}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}H O(|𝒥1|pq)O(|{\mathcal{J}}_{1}|pq)
𝒜M𝒜H{\mathcal{A}}^{*}M{\mathcal{A}}H O(npq)O(npq) 𝒜𝒥1𝒜𝒥1H{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{J}}_{1}}{\mathcal{A}}_{\scriptscriptstyle{\mathcal{J}}_{1}}H O(|𝒥1|pq)O(|{\mathcal{J}}_{1}|pq)
𝒢H{\mathcal{G}}H O(pq2)O(pq^{2}) 𝒢H{\mathcal{G}}H O(|α|pq)O(|\alpha|pq)
total cost O(npq)O(npq) total cost O(max{|𝒥1|,|α|}pq)O(\max\{|{\mathcal{J}}_{1}|,|\alpha|\}pq)

5 An adaptive strategy

The previous section concentrated on solving the SMM model (P) with fixed parameters CC and τ\tau. In many cases, one needs to compute a solution path {W(Ci),b(Ci)}i=1N\{W(C_{i}),b(C_{i})\}^{N}_{i=1} across a specified sequence of grid points 0<C1<C2<<CN0<C_{1}<C_{2}<\ldots<C_{N}. A typical example is to optimize the hyperparameter CC through cross-validation while maintaining τ\tau constant in model (P). This section presents an adaptive sieving (AS) strategy that iteratively reduces the sample size, thereby facilitating the efficient computation of the solution path for CC in models (P), particularly for extremely large nn.

We develop the AS strategy based on the following two key observations. First, the solution (W¯,b¯)(\overline{W},\overline{b}) of (1) only depends on the samples corresponding to its support matrices (abbreviated as active samples). Identifying these active samples can reduce the scale of the problem and thereby reduce its computational cost. Second, as shown in Figure 1, when Ci1<CiC_{i-1}<C_{i}, the support matrices at C=Ci1C=C_{i-1} is always a superset of those at C=CiC=C_{i}. Thus, to solve model (1) at C=CiC=C_{i}, it is reasonable to initially restrict it to the active sample set at (W(Ci1),b(Ci1))(W(C_{i-1}),b(C_{i-1})).

The core idea behind the AS strategy can be viewed as a specialized warm-start approach. It aggressively guesses and adjusts the active samples at (W(Ci),b(Ci))(W(C_{i}),b(C_{i})) for a new grid point CiC_{i}, using the solution (W(Ci1),b(Ci1))(W(C_{i-1}),b(C_{i-1})) obtained from the previous grid point Ci1C_{i-1}. Specifically, we initiate the iith problem within a restricted sample space that includes all active samples at (W(Ci1),b(Ci1))(W(C_{i-1}),b(C_{i-1})). After solving the restricted SMM model, we update active samples at the computed solution. The process is repeated until no additional active samples need to be included.

The similar strategy has been employed to efficiently compute solution paths for convex composite sparse machine learning problems by leveraging the inherent sparsity of solutions yuan2023adaptive ; yuan2022dimension ; li2023mars ; wu2023convex . However, since the SMM model (P) does not guarantee solution’s sparsity, the method proposed in yuan2023adaptive ; lin2020adaptive cannot be directly applied here. We will develop an adaptive sieving strategy that efficiently generates a solution path for the SMM model (P) by effectively exploiting the intrinsic sparsity of the samples. For a specified index set [n]{\mathcal{I}}\subseteq[n], we define

Si:={x||| 0xjCi,j=1,,||}.S^{\,i}_{{\mathcal{I}}}:=\{x\in\mathbb{R}^{|{\mathcal{I}}|}\ |\ 0\leq x_{j}\leq C_{i},j=1,\ldots,|{\mathcal{I}}|\}.

The subscript {\mathcal{I}} is omitted when =[n]{\mathcal{I}}=[n]. Detailed framework of the AS strategy is presented in Algorithm 3.

Algorithm 3 An AS strategy

Choose an initial point (W¯(C0),b¯(C0))p×q×(\overline{W}(C_{0}),\overline{b}(C_{0}))\in\mathbb{R}^{p\times q}\times\mathbb{R} , a sequence of grid points 0<C0<C1<<CN0<C_{0}<C_{1}<\ldots<C_{N}, a tolerance ε0\varepsilon\geq 0, a scalar ε^0\widehat{\varepsilon}\geq 0, and a positive integer dmaxd_{\max}. Compute the initial index set (C0):={j[n]|yj(W¯(C0),Xj+b¯(C0))1+ε^}{\mathcal{I}}^{*}(C_{0}):=\{j\in[n]\ |\ y_{j}(\langle\overline{W}(C_{0}),X_{j}\rangle+\overline{b}(C_{0}))\leq 1+\widehat{\varepsilon}\}. Execute the following steps for i=1,2,,Ni=1,2,\ldots,N with the initialization 0(Ci)=(Ci1){\mathcal{I}}^{0}(C_{i})={\mathcal{I}}^{*}(C_{i-1}) and k=0k=0.

Step 1. Find a KKT tuple (Wk(Ci),bk(Ci),vk(Ci),Uk(Ci),λk(Ci),Λk(Ci))(W^{k}(C_{i}),b^{k}(C_{i}),v^{k}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}),U^{k}(C_{i}),\lambda^{k}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}),\Lambda^{k}(C_{i})) p×q××|k(Ci)|×p×q×|k(Ci)|×p×q\in\mathbb{R}^{p\times q}\times\mathbb{R}\times\mathbb{R}^{|{\mathcal{I}}^{k}(C_{i})|}\times\mathbb{R}^{p\times q}\times\mathbb{R}^{|{\mathcal{I}}^{k}(C_{i})|}\times\mathbb{R}^{p\times q} of the following problem
minimizeW,b,v,U12W2+τU+δSki(v)δW,Wbδbδv,vδU,Usubject to𝒜k(Ci)W+byk(Ci)+v(en)k(Ci)=δλ,WU=δΛ,\begin{array}[]{cc}\mathop{\mbox{minimize}}\limits_{\tiny W,b,v_{\scriptscriptstyle{\mathcal{I}}},U}&{\displaystyle\frac{1}{2}}\|W\|^{2}_{{\mathcal{F}}}+\tau\|U\|_{*}+\delta^{*}_{S^{\,i}_{\scriptscriptstyle{\mathcal{I}}^{k}}}(v_{\scriptscriptstyle{\mathcal{I}}})-\langle\delta_{W},W\rangle-b\,\delta_{b}-\langle\delta_{v_{\scriptscriptstyle{\mathcal{I}}}},v_{\scriptscriptstyle{\mathcal{I}}}\rangle-\langle\delta_{U},U\rangle\\ \mbox{subject to}&{\mathcal{A}}_{{\mathcal{I}}^{k}(C_{i})}W+by_{\scriptscriptstyle{\mathcal{I}}^{k}(C_{i})}+v_{\scriptscriptstyle{\mathcal{I}}}-(e_{n})_{{\mathcal{I}}^{k}(C_{i})}=\delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}},\ W-U=\delta_{\Lambda},\end{array} (42)
where 𝒜k(Ci){\mathcal{A}}_{{\mathcal{I}}^{k}(C_{i})} is defined in (38) and (δW,δb,δv,δU,δλ,δΛ)p×q××|k(Ci)|×p×q×|k(Ci)|×p×q(\delta_{W},\delta_{b},\delta_{v_{\scriptscriptstyle{\mathcal{I}}}},\delta_{U},\delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}},\delta_{\Lambda})\in\mathbb{R}^{p\times q}\times\mathbb{R}\times\mathbb{R}^{|{\mathcal{I}}^{k}(C_{i})|}\times\mathbb{R}^{p\times q}\times\mathbb{R}^{|{\mathcal{I}}^{k}(C_{i})|}\times\mathbb{R}^{p\times q} is an error satisfying
max(δW,|δb|,δv,δU,δλ,δΛ)ε.\max(\|\delta_{W}\|,|\delta_{b}|,\|\delta_{v_{\scriptscriptstyle{\mathcal{I}}}}\|,\|\delta_{U}\|,\|\delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}}\|,\|\delta_{\Lambda}\|)\leq\varepsilon. (43)
Step 2. Compute index set
𝒥k(Ci):={j¯k(Ci)|vjk(Ci)0},{\mathcal{J}}^{k}(C_{i}):=\{j\in\overline{{\mathcal{I}}}^{k}(C_{i})\ |\ v^{k}_{j}(C_{i})\geq 0\}, (44)
where ¯k(Ci):=[n]k(Ci)\overline{{\mathcal{I}}}^{k}(C_{i}):=[n]\setminus{\mathcal{I}}^{k}(C_{i}). Extend vk(Ci)v^{k}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}) and λk(Ci)\lambda^{k}_{{\mathcal{I}}}(C_{i}) to nn-dimensional vectors vk(Ci)v^{k}(C_{i}) and λk(Ci)\lambda^{k}(C_{i}) by setting vjk(Ci)=1yj(Wk(Ci),Xj+bk(Ci))v^{k}_{j}(C_{i})=1-y_{j}(\langle W^{k}(C_{i}),X_{j}\rangle+b^{k}(C_{i})) for j¯k(Ci)j\in\overline{{\mathcal{I}}}^{k}(C_{i}), and λjk(Ci)=C\lambda^{k}_{j}(C_{i})=-C if j𝒥k(Ci)j\in{\mathcal{J}}^{k}(C_{i}), otherwise λjk(Ci)=0\lambda^{k}_{j}(C_{i})=0.
Step 3. If 𝒥k(Ci)={\mathcal{J}}^{k}(C_{i})=\emptyset, stop and set (W¯(Ci),b¯(Ci),v¯(Ci),U¯(Ci),λ¯(Ci)(\overline{W}(C_{i}),\overline{b}(C_{i}),\overline{v}(C_{i}),\overline{U}(C_{i}),\overline{\lambda}(C_{i}), Λ¯(Ci))=(Wk(Ci),bk(Ci),vk(Ci),Uk(Ci),λk(Ci),Λk(Ci))\overline{\Lambda}(C_{i}))=(W^{k}(C_{i}),b^{k}(C_{i}),v^{k}(C_{i}),U^{k}(C_{i}),\lambda^{k}(C_{i}),\Lambda^{k}(C_{i})) with (Ci)={j[n]|yj(W¯(Ci),Xj+b¯(Ci))1+ε^}{\mathcal{I}}^{*}(C_{i})=\{j\in[n]\ |\ y_{j}(\langle\overline{W}(C_{i}),X_{j}\rangle+\overline{b}(C_{i}))\leq 1+\widehat{\varepsilon}\}. Let ii+1i\leftarrow i+1 (unless i=Ni=N already so that the algorithm should be stopped). Otherwise compute the index set
𝒥^k+1(Ci):={j𝒥k(Ci)|(vk(Ci))jis among the firstdlargestvaluesin{vtk(Ci)}t𝒥k+1(Ci)},\widehat{{\mathcal{J}}}^{k+1}(C_{i}):=\left\{j\in{\mathcal{J}}^{k}(C_{i})\ \bigg{|}\begin{array}[]{l}(v^{k}(C_{i}))_{j}\ \mbox{is among the first}\ d\ \mbox{largest}\\ \mbox{values}\ \mbox{in}\ \{v^{k}_{t}(C_{i})\}_{t\in{\mathcal{J}}^{k+1}(C_{i})}\end{array}\right\},
where dd is a given positive integer satisfying dmin{|𝒥k(Ci)|,dmax}d\leq\min\{|{\mathcal{J}}^{k}(C_{i})|,d_{\max}\}. Let k+1(Ci):=k(Ci)𝒥^k+1(Ci){\mathcal{I}}^{k+1}(C_{i}):={\mathcal{I}}^{k}(C_{i})\cup\widehat{{\mathcal{J}}}^{k+1}(C_{i}). Set kk+1k\leftarrow k+1 and go to Step 1.
Remark 1
  • (a)

    In Step 2 of Algorithm 3, the index set 𝒥k(Ci){\mathcal{J}}^{k}(C_{i}) can be expressed as

    𝒥k(Ci)={j¯k(Ci)|yj(Wk(Ci),Xj+bk(Ci))1}.{\mathcal{J}}^{k}(C_{i})=\{j\in\overline{{\mathcal{I}}}^{k}(C_{i})\ |\ y_{j}(\langle W^{k}(C_{i}),X_{j}\rangle+b^{k}(C_{i}))\leq 1\}.

    It aims to identify all indices of the support matrices at (Wk(Ci),bk(Ci))(W^{k}(C_{i}),b^{k}(C_{i})) beyond k(Ci){\mathcal{I}}^{k}(C_{i}), as described by (5).

  • (b)

    In Step 3 of Algorithm 3, the nonnegative scalar ε^\widehat{\varepsilon} provides flexibility in adjusting the size of the index set (Ci){\mathcal{I}}^{*}(C_{i}). Specifically, if ε^=0\widehat{\varepsilon}=0, (Ci){\mathcal{I}}^{*}(C_{i}) includes all indices of active samples at (W¯(Ci),b¯(Ci))(\overline{W}(C_{i}),\overline{b}(C_{i})). In general, by selecting an appropriate positive scalar ε^>0\widehat{\varepsilon}>0, we aim to ensure that (Ci){\mathcal{I}}^{*}(C_{i}) includes as many active indices as possible at the solution for the next grid point Ci+1C_{i+1}.

  • (c)

    The reduced subproblem (42) shares the same mathematical structure as (P) with a reduced subset of training samples {Xj,yj}jk(Ci)\{X_{j},y_{j}\}_{j\in{\mathcal{I}}^{k}(C_{i})}. This similarity allows us to efficiently apply Algorithm 1 to solve (42).

Lastly, we establish the finite convergence of Algorithm 3 in the following theorem. Its proof is given in Appendix C.

Theorem 5.1

For each CiC_{i}, Algorithm 3 terminates within a finite number of iterations. That is, there exists an integer k¯[1,n]\bar{k}\in[1,n] such that 𝒥k¯(Ci)={\mathcal{J}}^{\,\bar{k}}(C_{i})=\emptyset. Furthermore, any output solution (W¯(Ci),b¯(Ci),v¯(Ci),U¯(Ci),λ¯(Ci),Λ¯(Ci))\left(\overline{W}(C_{i}),\overline{b}(C_{i}),\overline{v}(C_{i}),\overline{U}(C_{i}),\overline{\lambda}(C_{i}),\overline{\Lambda}(C_{i})\right) is a KKT tuple for the following problem

minimize(W,b,v,U)𝒳12W2+τU+δSi(v)δW,Wbδbδv,vδU,Usubject to𝒜W+by+ven=δλ,WU=δΛ,\begin{array}[]{cc}\mathop{\mbox{minimize}}\limits_{(W,b,v,U)\in{\mathcal{X}}}&{\displaystyle\frac{1}{2}}\|W\|^{2}_{{\mathcal{F}}}+\tau\|U\|_{*}+\delta^{*}_{S^{i}}(v)-\langle\delta_{W},W\rangle-b\,\delta_{b}-\langle\delta_{v},v\rangle-\langle\delta_{U},U\rangle\\ \mbox{subject to}&{\mathcal{A}}W+by+v-e_{n}=\delta_{\lambda},\ W-U=\delta_{\Lambda},\end{array} (45)

where the error (δW,δb,δv,δU,δλ,δΛ)(\delta_{W},\delta_{b},\delta_{v},\delta_{U},\delta_{\lambda},\delta_{\Lambda}) satisfies max(δW,|δb|,δv,\max(\|\delta_{W}\|,|\delta_{b}|,\|\delta_{v}\|, δU,δλ\|\delta_{U}\|,\|\delta_{\lambda}\|, δΛ)ε\|\delta_{\Lambda}\|)\leq\varepsilon.

6 Numerical experiments

We have executed extensive numerical experiments to demonstrate the efficiency of our proposed algorithm and strategy on both large-scale synthetic and real datasets. All the experiments were implemented in MATLAB on a windows workstation (8-core, Intel(R) Core(TM) i7-10700 @ 2.90GHz, 64G RAM).

6.1 Implementation details

For the SMM model (P), we measure the qualities of the computed solutions via the following relative KKT residual of the iterates:

ηkkt=max{ηW,ηb,ηv,ηU,ηλ,ηΛ},\eta_{kkt}=\max\{\eta_{W},\eta_{b},\eta_{v},\eta_{U},\eta_{\lambda},\eta_{\Lambda}\}, (46)

where

ηW:=W+𝒜λ+Λ1+W+𝒜λ+Λ,ηb:=|λy|1+n,ηv:=λ+ΠS(vλ)1+λ+v,\displaystyle\eta_{W}:=\frac{\|W+{\mathcal{A}}^{*}\lambda+\Lambda\|_{{\mathcal{F}}}}{1+\|W\|_{{\mathcal{F}}}+\|{\mathcal{A}}^{*}\lambda\|_{{\mathcal{F}}}+\|\Lambda\|_{{\mathcal{F}}}},\ \eta_{b}:=\frac{|\lambda^{\top}y|}{1+\sqrt{n}},\ \eta_{v}:=\frac{\|\lambda+\Pi_{S}(v-\lambda)\|}{1+\|\lambda\|+\|v\|},
ηU=ΛΠ𝔹2τ(U+Λ)1+Λ+U,ηλ=𝒜W+by+ven1+n,ηΛ=WU1+W+U.\displaystyle\eta_{U}=\frac{\|\Lambda-\Pi_{\mathbb{B}^{\tau}_{2}}(U+\Lambda)\|_{{\mathcal{F}}}}{1+\|\Lambda\|_{{\mathcal{F}}}+\|U\|_{{\mathcal{F}}}},\ \eta_{\lambda}=\frac{\|\ {\mathcal{A}}W+by+v-e_{n}\|}{1+\sqrt{n}},\ \eta_{\Lambda}=\frac{\|W-U\|_{{\mathcal{F}}}}{1+\|W\|_{{\mathcal{F}}}+\|U\|_{{\mathcal{F}}}}.

By default, the termination criteria for Algorithm 1 (denoted as ALM-SNCG) involves either satisfying ηkktε\eta_{kkt}\leq\varepsilon or reaching the maximum number 500 of iterations. In the case where both algorithms stopped with different criteria, we used the value of the objective functions obtained from the ALM-SNCG satisfying ηkkt108\eta_{kkt}\leq 10^{-8} (denoted as objopt\mbox{obj}_{\scriptsize\mbox{opt}}) as benchmarks to evaluate their quality (denoted as objcomputed\mbox{obj}_{\scriptsize\mbox{computed}}):

Relobj:=|objcomputedobjopt|1+|objopt|.\mbox{Relobj}:=\frac{|\mbox{obj}_{\scriptsize\mbox{computed}}-\mbox{obj}_{\scriptsize\mbox{opt}}|}{1+|\mbox{obj}_{\scriptsize\mbox{opt}}|}. (47)

Both synthetic and real datasets were utilized in the subsequent experiments. The sampling data {(Xi,yi)}i=1np×q×{1,1}\{(X_{i},y_{i})\}^{n}_{i=1}\subseteq\mathbb{R}^{p\times q}\times\{-1,1\} was generated following the process in luo2015support . For any k[p]k\in[p] and [q]\ell\in[q], denote ak:=[(X1)k,,(Xn)k]na_{k\ell}:=\left[(X_{1})_{k\ell},\ldots,(X_{n})_{k\ell}\right]^{\top}\in\mathbb{R}^{n}. Each vector aka_{k\ell} is constructed as:

ak=br/q+ϵkwithϵkN(0,δ2In),k=1,,p,=1,,q.a_{k\ell}=b_{\lceil r\ell/q\rceil}+\epsilon_{k\ell}\ \mbox{with}\ \epsilon_{k\ell}\sim N(0,\delta^{2}I_{n}),\qquad k=1,\ldots,p,\quad\ell=1,\ldots,q.

Here, b1,b2,,brb_{1},b_{2},\ldots,b_{r} are rr orthonormal vectors in n\mathbb{R}^{n}, and 0<rmin{p,q}0<r\leq\min\{p,q\}. Given a p×qp\times q matrix WW with rank rr, each sample’s class label is determined by yi=sign(W,Xi)y_{i}=\mbox{sign}(\langle W,X_{i}\rangle) for i[n]i\in[n], where sign(x)\mbox{sign}(x) equals 11 if x0x\geq 0 and 1-1 otherwise. In our experiments, we set r=20r=20 and δ=2×104\delta=2\times 10^{-4}. For each data configuration, the sample sizes nn vary from 1.25×1041.25\times 10^{4} to 1.25×1061.25\times 10^{6}, with 80%80\% used as training data and the remaining 20%20\% used for testing.

We also utilized four distinct real datasets. They are

Additional information for these datasets is provided in Tabel 2, where ntrainn_{\scriptsize\mbox{train}} and ntestn_{\scriptsize\mbox{test}} denote the number of training and test samples, respectively.

Table 2: Summary of four real datasets
datasets p×qp\times q ntrainn_{\scriptsize\mbox{train}} ntestn_{\scriptsize\mbox{test}} datasets p×qp\times q ntrainn_{\scriptsize\mbox{train}} ntestn_{\scriptsize\mbox{test}}
EEG 256×64256\times 64 300 66 CIFAR-10 32×3232\times 32 10000 2000
INRIA 70×13470\times 134 3634 1579 MNIST 28×2828\times 28 60000 10000

In subsequent experiments, we set the parameter τ\tau in the SMM model (1) to 10 and 100 for random data, and to 1 and 10 for real data.

6.2 Solving the SMM model at a fixed parameter value CC

In this subsection, we evaluate four algorithms - ALM-SNCG, the inexact semi-proximal ADMM (isPADMM), the symmetric Gauss-Seidel based isPADMM (sGS-isPADMM), and Fast ADMM with restart (F-ADMM) - for solving the SMM model (1) with fixed parameters τ\tau and CC. Additional details on isPADMM and sGS-isPADMM can be found in Appendices D and E, respectively. The F-ADMM is based on the works of Luo et al. luo2015support 777Its code is available for download at: http://bcmi.sjtu.edu.cn/~luoluo/code/smm.zip and Goldstein et al. goldstein2014fast . Each method is subject to a two-hour computational time limit. For consistent comparison, we adopt the termination criterion Relobjε\mbox{Relobj}\leq\varepsilon, where Relobj is defined in (47). We assess their performance at two accuracy levels: ε=104\varepsilon=10^{-4} and ε=106\varepsilon=10^{-6}.

Firstly, we evaluate the performance of two-block ADMM algorithms (F-ADMM and isPADMM) and three-block ADMM (sGS-isPADMM) for solving the SMM model (1) on the EEG training dataset. The “Iteration time” column in Table 3 represents the average time per iteration for each algorithm. Numerical results in Table 3 show that, on average, sGS-isPADMM is 2.6 times faster per iteration than F-ADMM and 52.2 times faster than isPADMM. However, sGS-isPADMM is 7.2 times slower than isPADMM in terms of total time consumed due to having 184.8 times more iteration steps on average, leading to higher SVD costs from the soft thresholding operator. Henceforth, we will focus on comparing the performance of two-block ADMM algorithms and ALM-SNCG on synthetic and real datasets.

Table 3: Results of sGS-isPADMM (sGS), F-ADMM (F), and isPADMM (isP) with Relobjε\mbox{Relobj}\leq\varepsilon on EEG training dataset
ε\varepsilon τ\tau CC Relobj Iteration numbers Time (seconds) Iteration time (seconds)
sGS F isP sGS F isP sGS F isP sGS F isP
1e-4 1 1e-4 1e-4 6e-5 9e-5 3876 477 19 26.1 5.0 2.0 7e-3 1e-2 1e-1
1e-3 1e-4 1e-4 5e-5 4251 285 16 27.5 4.4 3.8 6e-3 2e-2 2e-1
1e-2 1e-4 5e-4 1e-4 4329 30000 22 28.5 616.3 11.3 7e-3 2e-2 5e-1
1e-1 1e-4 5e-3 8e-5 4317 30000 23 27.7 616.2 16.1 6e-3 2e-2 7e-1
10 1e-4 6e-5 6e-5 6e-5 3324 117 20 20.6 1.2 6.0 6e-3 1e-2 3e-1
1e-3 1e-4 1e-4 1e-4 8818 8141 111 55.0 100.2 9.7 6e-3 1e-2 9e-2
1e-2 1e-4 1e-4 9e-5 7986 2841 93 49.9 58.5 22.1 6e-3 2e-2 2e-1
1e-1 1e-4 1e-1 1e-4 5428 30000 63 33.9 621.1 31.1 6e-3 2e-2 5e-1
1e-6 1 1e-4 1e-6 1e-6 9e-7 23861 25759 45 151.5 279.1 3.3 6e-3 1e-2 7e-2
1e-3 1e-6 1e-6 8e-7 7147 501 25 44.2 9.0 5.5 6e-3 2e-2 2e-1
1e-2 1e-6 5e-4 7e-7 7614 30000 48 48.5 621.7 21.7 6e-3 2e-2 5e-1
1e-1 1e-6 5e-3 9e-7 7544 30000 44 47.1 623.7 29.4 6e-3 2e-2 7e-1
10 1e-4 8e-7 7e-7 6e-7 4268 121 28 26.4 1.3 11.6 6e-3 1e-2 4e-1
1e-3 1e-6 1e-6 1e-6 26985 21117 151 168.4 259.1 12.5 6e-3 1e-2 8e-2
1e-2 1e-6 1e-6 7e-7 11853 5363 111 74.2 109.6 25.9 6e-3 2e-2 2e-1
1e-1 1e-6 6e-2 4e-7 9044 30000 89 56.5 617.2 41.1 6e-3 2e-2 5e-1

It is worth mentioning that ALM-SNCG is initialized with a lower-accuracy starting point from isPADMM, using up to four iterations for synthetic datasets and up to ten iterations for real datasets. Otherwise, the origin is used as the default starting point for the algorithms. Tables 4 and 5 present the numerical performance of the above three algorithms, including the following information:

  • |𝒥1||{\mathcal{J}}_{1}|: the cardinality of the index set 𝒥1{\mathcal{J}}_{1} defined in (37);

  • |α||\alpha|: the cardinality of the index set α\alpha defined in (24);

  • Accuracytest: the classification accuracy on test set for solutions from each algorithm, i.e.,

    Accuracytest:=|{i|y^i=(ytest)i,i=1,,ntest}|/ntest.\mbox{Accuracy}_{\scriptsize\mbox{test}}:=|\{i\ |\ \hat{y}_{i}=(y_{\scriptsize\mbox{test}})_{i},\ i=1,\ldots,n_{\scriptsize\mbox{test}}\}|/n_{\scriptsize\mbox{test}}. (48)

    Here, y^\hat{y} represents the predicted class label vector, and ytesty_{\scriptsize\mbox{test}} denotes the true class label vector for the test set.

6.2.1 Numerical results on synthetic data

Table 4 shows the computational results for ALM-SNCG, isPADMM, and F-ADMM on randomly generated data with CC ranging from 0.10.1 to 100100. The results indicate that ALM-SNCG and isPADMM can solve all problems within two hours. In contrast, F-ADMM encountered memory trouble when the sample sizes exceed 10510^{5}. For (n,p,q,τ,C)=(10000,1000,500,100,100)(n,p,q,\tau,C)=(10000,1000,500,100,100), F-ADMM exibited lower classification accuracy (Accuracytest\mbox{Accuracy}_{\scriptsize\mbox{test}}) compared to isPADMM and ALM-SNCG due to its inability to achive the desired solution accuracy. Even for the smallest instance (n,p,q)=(10000,100,100)(n,p,q)=(10000,100,100), F-ADMM is, on average, 101.2 times slower than ALM-SNCG to reach Relobj104\mbox{Relobj}\leq 10^{-4} and 210.2 times slower to reach Relobj106\mbox{Relobj}\leq 10^{-6}. On the other hand, ALM-SNCG is on average 8.7 times faster than isPADMM for ε=104\varepsilon=10^{-4} and 13.0 times faster for ε=106\varepsilon=10^{-6}. It shows that ALM-SNCG is more efficient and stable, particularly for high-accuracy solutions.

Table 4: Results of F-ADMM (F), isPADMM (isP), and ALM-SNCG (A) with Relobjε\mbox{Relobj}\leq\varepsilon on random data
Data ε\varepsilon τ\tau CC |𝒥1||\mathcal{J}_{1}| |α||\alpha| Accuracytest\mbox{Accuracy}_{\scriptsize\mbox{test}} Relobj Time (seconds)
(n,p,q)(n,p,q) F isP A F isP A F isP A
(1e4, 1e2, 1e2) 1e-4 10 0.1 27 1 0.9876 0.9872 0.9872 8e-5 8e-5 3e-5 11.2 18.2 1.6
1 65 1 0.9932 0.9932 0.9936 8e-5 1e-4 8e-5 13.5 13.5 1.7
10 46 1 0.9928 0.9928 0.9928 4e-5 3e-5 1e-4 26.0 20.6 1.6
100 26 1 0.9936 0.9936 0.9936 9e-5 6e-5 4e-5 28.1 21.7 7.9
100 0.1 19 1 0.9844 0.9844 0.9848 8e-5 8e-5 1e-4 14.8 10.4 2.2
1 42 1 0.9880 0.9880 0.9880 3e-5 6e-5 8e-5 10.5 17.1 2.3
10 39 1 0.9932 0.9932 0.9932 1e-4 9e-5 7e-5 26.3 69.1 1.7
100 37 1 0.9932 0.9932 0.9932 1e-4 9e-5 8e-5 1526.2 37.5 2.0
1e-6 10 0.1 21 1 0.9872 0.9872 0.9872 4e-7 8e-7 8e-7 21.7 29.4 2.4
1 20 1 0.9932 0.9932 0.9932 4e-7 8e-7 8e-7 21.2 27.1 5.5
10 23 1 0.9928 0.9928 0.9928 7e-7 3e-7 8e-7 33.1 37.0 4.3
100 22 1 0.9936 0.9936 0.9936 3e-5 7e-7 9e-7 7200.6 42.4 14.0
100 0.1 11 1 0.9848 0.9848 0.9848 9e-7 7e-7 8e-7 29.5 18.9 4.9
1 18 1 0.9880 0.9880 0.9880 8e-7 8e-7 9e-7 15.4 25.6 5.3
10 23 1 0.9932 0.9932 0.9932 7e-7 4e-7 1e-6 38.7 46.4 6.0
100 22 1 0.9932 0.9932 0.9932 5e-4 8e-7 7e-7 7200.0 116.7 6.4
(1e4, 1e3, 5e2) 1e-4 10 0.1 22 1 0.9940 0.9940 0.9940 6e-5 9e-5 8e-5 421.3 2020.6 237.8
1 23 1 0.9924 0.9924 0.9924 3e-6 8e-5 9e-5 536.0 1082.0 295.1
10 23 1 0.9924 0.9924 0.9924 2e-3 8e-5 9e-5 7200.1 1565.1 255.8
100 32 107 0.9888 0.9936 0.9936 2e+0 9e-5 8e-5 7202.5 1249.0 864.5
100 0.1 20 1 0.9880 0.9880 0.9880 9e-5 8e-5 2e-5 556.5 2030.1 243.7
1 24 1 0.9940 0.9940 0.9940 4e-3 7e-5 7e-5 7202.7 1229.2 242.3
10 23 1 0.9936 0.9936 0.9936 3e-2 4e-5 6e-5 7202.5 2050.7 258.6
100 23 1 0.9836 0.9920 0.9920 2e+0 7e-5 8e-5 7200.7 2118.4 510.0
1e-6 10 0.1 20 1 0.9940 0.9940 0.9940 9e-7 1e-6 2e-7 525.8 2353.6 334.3
1 21 1 0.9924 0.9924 0.9924 7e-5 3e-7 4e-7 7203.5 1580.2 517.2
10 21 1 0.9924 0.9924 0.9924 2e-3 4e-7 7e-7 7201.9 3531.0 413.7
100 32 107 0.9916 0.9936 0.9936 2e+0 7e-7 6e-7 7202.8 2177.1 1229.5
100 0.1 20 1 0.9880 0.9880 0.9880 9e-7 8e-7 7e-7 969.2 2685.1 265.5
1 21 1 0.9940 0.9940 0.9940 4e-3 7e-7 4e-7 7201.5 3305.9 411.1
10 22 1 0.9936 0.9936 0.9936 3e-2 9e-7 9e-7 7204.6 6706.1 371.6
100 21 1 0.9792 0.9920 0.9920 3e+0 6e-7 4e-7 7200.6 3331.2 628.9
(1e5, 50, 1e2) 1e-4 10 0.1 197 1 - 0.9681 0.9683 - 7e-5 5e-5 - 27.6 3.6
1 636 1 - 0.9682 0.9682 - 5e-5 2e-5 - 33.1 6.8
10 502 1 - 0.9686 0.9687 - 4e-5 4e-5 - 38.5 15.2
100 430 45 - 0.9692 0.9693 - 7e-5 4e-5 - 93.4 88.8
100 0.1 189 1 - 0.9682 0.9681 - 7e-5 5e-5 - 56.9 3.6
1 568 1 - 0.9685 0.9685 - 9e-5 7e-5 - 64.6 6.7
10 590 1 - 0.9685 0.9685 - 6e-5 7e-5 - 76.9 4.8
100 388 1 - 0.9690 0.9690 - 7e-5 2e-5 - 111.5 38.1
1e-6 10 0.1 64 1 - 0.9683 0.9683 - 1e-6 9e-7 - 365.7 7.0
1 97 1 - 0.9682 0.9682 - 1e-6 9e-7 - 278.2 13.9
10 93 1 - 0.9687 0.9686 - 9e-7 8e-7 - 96.6 22.4
100 117 45 - 0.9693 0.9693 - 1e-6 9e-7 - 282.0 104.6
100 0.1 36 1 - 0.9682 0.9682 - 7e-7 9e-7 - 324.6 10.3
1 48 1 - 0.9684 0.9685 - 9e-7 9e-7 - 654.4 19.8
10 66 1 - 0.9685 0.9685 - 6e-7 8e-7 - 171.1 15.2
100 87 1 - 0.9690 0.9690 - 8e-7 1e-6 - 417.5 45.2
(1e6, 50, 1e2) 1e-4 10 0.1 1935 1 - 0.9018 0.9017 - 5e-5 5e-5 - 194.2 29.0
1 11357 1 - 0.9017 0.9017 - 4e-5 3e-5 - 167.5 60.0
10 9267 30 - 0.9074 0.9078 - 7e-5 4e-6 - 694.4 271.5
100 13027 50 - 0.9445 0.9448 - 7e-5 5e-5 - 2346.5 67.0
100 0.1 1348 1 - 0.9017 0.9017 - 8e-5 4e-5 - 339.2 34.7
1 7545 1 - 0.9018 0.9018 - 8e-5 6e-5 - 370.5 49.4
10 8750 1 - 0.9018 0.9018 - 7e-5 7e-6 - 327.2 154.8
100 12252 29 - 0.9408 0.9410 - 9e-5 3e-5 - 1702.6 1136.2
1e-6 10 0.1 862 1 - 0.9018 0.9017 - 7e-7 7e-7 - 372.7 42.7
1 3072 2 - 0.9017 0.9017 - 1e-6 9e-7 - 1222.0 89.9
10 4294 30 - 0.9077 0.9077 - 1e-6 9e-7 - 2003.1 290.6
100 2409 50 - 0.9450 0.9450 - 7e-7 9e-7 - 5413.7 224.1
100 0.1 312 1 - 0.9017 0.9017 - 9e-7 8e-7 - 1703.4 72.9
1 835 1 - 0.9018 0.9018 - 8e-7 1e-6 - 4299.5 127.2
10 2319 1 - 0.9018 0.9018 - 6e-7 9e-7 - 2320.4 187.4
100 1833 29 - 0.9413 0.9413 - 8e-7 9e-7 - 5371.3 1289.7

6.2.2 Numerical results on real data

Table 5 lists the results for ALM-SNCG, F-ADMM, and isPADMM on the model (1) using four real datasets described in Table 2, with the parameter CC ranging from 10410^{-4} to 100100. The results in the table show that both ALM-SNCG and isPADMM achieved high accuracy (Relobj106\mbox{Relobj}\leq 10^{-6}) in all instances, while F-ADMM only stopped successfully for 56%56\% problems with accuracy Relobj104\mbox{Relobj}\leq 10^{-4}, and for 47%47\% problems with accuracy Relobj106\mbox{Relobj}\leq 10^{-6}. Table 5 also reveals that ALM-SNCG performed on average 2.6 times (5.9 times) faster than isPADMM and 422.7 times (477.2 times) faster than F-ADMM when the stop criteria was set to Relobj104\mbox{Relobj}\leq 10^{-4} (Relobj106\mbox{Relobj}\leq 10^{-6}).

Table 5: Results of F-ADMM (F), isPADMM (isP), and ALM-SNCG (A) with Relobjε\mbox{Relobj}\leq\varepsilon on real data
Prob ε\varepsilon τ\tau CC |𝒥1||\mathcal{J}_{1}| |α||\alpha| Accuracytest\mbox{Accuracy}_{\scriptsize\mbox{test}} Relobj Time (seconds)
(n,p,q)(n,p,q) F isP A F isP A F isP A
EEG(300,256,64)\begin{array}[]{c}\mbox{EEG}\\ (300,256,64)\end{array} 1e-4 1 1e-4 6 2 0.7424 0.7424 0.7424 6e-5 9e-5 6e-5 5.0 2.0 2.1
1e-3 60 5 0.8636 0.8636 0.8636 1e-4 5e-5 6e-5 4.4 3.8 4.5
1e-2 123 5 0.9394 0.9545 0.9394 5e-4 1e-4 7e-5 616.3 11.3 6.9
1e-1 123 5 0.9545 0.9545 0.9394 5e-3 8e-5 9e-5 616.2 16.1 12.1
10 1e-4 95 0 0.6667 0.6667 0.6667 6e-5 6e-5 4e-6 1.2 6.0 2.7
1e-3 6 2 0.7424 0.7424 0.7424 1e-4 1e-4 9e-5 100.2 9.7 5.7
1e-2 60 5 0.8636 0.8636 0.8636 1e-4 9e-5 6e-5 58.5 22.1 9.2
1e-1 124 5 0.9394 0.9394 0.9394 1e-1 1e-4 6e-5 621.1 31.1 10.8
1e-6 1 1e-4 6 2 0.7424 0.7424 0.7424 1e-6 9e-7 7e-7 279.1 3.3 2.2
1e-3 60 5 0.8636 0.8636 0.8636 1e-6 8e-7 9e-7 9.0 5.5 4.7
1e-2 123 5 0.9394 0.9394 0.9394 5e-4 7e-7 6e-7 621.7 21.7 8.2
1e-1 123 5 0.9394 0.9394 0.9394 5e-3 9e-7 5e-7 623.7 29.4 12.2
10 1e-4 91 0 0.6667 0.6667 0.6667 7e-7 6e-7 5e-8 1.3 11.6 2.6
1e-3 6 2 0.7424 0.7424 0.7424 1e-6 1e-6 7e-7 259.1 12.5 6.1
1e-2 60 5 0.8636 0.8636 0.8636 1e-6 7e-7 5e-7 109.6 25.9 10.2
1e-1 123 5 0.9394 0.9394 0.9394 6e-2 4e-7 5e-7 617.2 41.1 15.8
INRIA(3634,70,134)\begin{array}[]{c}\mbox{INRIA}\\ (3634,70,134)\end{array} 1e-4 1 1e-3 12 2 0.8892 0.8892 0.8892 1e-4 6e-5 1e-4 5.4 6.7 5.6
1e-2 100 6 0.8898 0.8898 0.8898 8e-5 1e-4 1e-4 6.0 10.6 6.5
1e-1 807 35 0.8638 0.8645 0.8645 5e-3 1e-4 7e-5 7200.1 103.5 37.3
1 1080 44 0.8385 0.8379 0.8379 8e-1 8e-5 7e-5 7200.1 173.8 44.2
10 1e-3 1716 0 0.7131 0.7131 0.7131 9e-5 8e-5 7e-5 392.7 328.7 23.9
1e-2 17 2 0.8904 0.8904 0.8904 1e-4 9e-5 9e-5 21.8 13.5 8.2
1e-1 123 6 0.8879 0.8879 0.8879 1e-4 9e-5 9e-5 43.8 30.0 17.1
1 938 27 0.8493 0.8436 0.8461 3e-1 1e-4 7e-5 7200.2 148.5 123.4
1e-6 1 1e-3 11 2 0.8892 0.8892 0.8892 9e-7 8e-7 8e-7 6.7 16.3 6.2
1e-2 98 6 0.8898 0.8898 0.8898 9e-7 9e-7 7e-7 12.6 26.7 8.2
1e-1 807 35 0.8645 0.8645 0.8645 5e-3 1e-6 9e-7 7200.0 416.1 88.2
1 1080 44 0.8391 0.8379 0.8379 8e-1 1e-6 1e-6 7200.2 502.1 59.1
10 1e-3 2028 0 0.7131 0.7131 0.7131 3e-5 7e-7 8e-7 7200.1 1508.3 26.0
1e-2 17 2 0.8911 0.8911 0.8911 1e-6 8e-7 9e-7 49.1 19.8 9.1
1e-1 123 6 0.8879 0.8879 0.8879 1e-6 9e-7 9e-7 142.8 48.6 19.9
1 937 27 0.8436 0.8461 0.8461 4e-1 1e-6 8e-7 7200.0 362.8 174.4
CIFAR-10(10000,32,32)\begin{array}[]{c}\mbox{CIFAR-10}\\ (10000,32,32)\end{array} 1e-4 1 1e-3 15 2 0.8015 0.8015 0.8020 9e-5 9e-5 1e-4 14.2 2.0 1.3
1e-2 62 5 0.8245 0.8245 0.8245 9e-5 8e-5 1e-4 8.3 1.7 1.1
1e-1 319 19 0.8205 0.8210 0.8190 1e-4 9e-5 7e-5 26.5 4.6 1.5
1 751 30 0.7790 0.7990 0.8020 4e-1 9e-5 7e-5 7200.6 12.5 2.1
10 1e-3 5 1 0.7645 0.7645 0.7645 1e-4 6e-5 8e-5 25.9 2.2 1.5
1e-2 15 2 0.8050 0.8050 0.8050 1e-4 7e-5 1e-4 26.2 2.1 1.6
1e-1 70 5 0.8235 0.8235 0.8235 1e-4 9e-5 8e-5 43.9 2.4 1.9
1 457 20 0.7805 0.8170 0.8180 2e-1 8e-5 9e-5 7200.3 4.2 1.9
1e-6 1 1e-3 11 2 0.8015 0.8020 0.8015 9e-7 6e-7 7e-7 19.4 6.4 1.5
1e-2 55 5 0.8245 0.8245 0.8245 9e-7 5e-7 7e-7 19.7 6.8 1.6
1e-1 304 19 0.8200 0.8200 0.8195 4e-5 9e-7 8e-7 7200.3 17.6 1.6
1 723 30 0.7940 0.8015 0.8015 2e-1 9e-7 9e-7 7200.4 42.0 3.6
10 1e-3 6 1 0.7645 0.7645 0.7645 1e-6 7e-7 1e-6 79.3 4.3 2.0
1e-2 12 2 0.8050 0.8050 0.8050 1e-6 8e-7 7e-7 92.2 7.5 2.1
1e-1 62 5 0.8235 0.8235 0.8235 1e-6 1e-6 8e-7 146.9 7.3 2.2
1 438 20 0.8020 0.8175 0.8170 3e-1 9e-7 8e-7 7200.2 13.5 2.9
MNIST(60000,28,28)\begin{array}[]{c}\mbox{MNIST}\\ (60000,28,28)\end{array} 1e-4 1 1e-1 212 14 0.9928 0.9928 0.9927 9e-5 9e-5 9e-5 225.0 12.8 4.0
1 422 22 0.9930 0.9930 0.9927 2e-3 9e-5 1e-4 7201.4 16.9 5.7
1e+1 498 25 0.9764 0.9922 0.9922 6e+0 8e-5 6e-5 7201.0 36.4 9.6
1e+2 549 26 0.9540 0.9917 0.9915 2e+1 9e-5 9e-5 7201.6 42.2 16.7
10 1e-1 91 5 0.9924 0.9924 0.9924 9e-5 7e-5 9e-5 178.2 7.7 4.4
1 274 14 0.9926 0.9926 0.9926 1e-3 9e-5 1e-4 7200.3 12.4 4.1
1e+1 473 21 0.9769 0.9926 0.9926 5e+0 9e-5 6e-5 7201.0 17.7 8.3
1e+2 541 25 0.9710 0.9916 0.9916 1e+1 1e-4 8e-5 7202.4 34.5 17.1
1e-6 1 1e-1 213 14 0.9928 0.9928 0.9928 4e-6 9e-7 1e-6 7200.7 42.0 4.3
1 414 22 0.9929 0.9929 0.9928 2e-3 1e-6 1e-6 7202.9 50.3 7.7
1e+1 496 25 0.9764 0.9923 0.9923 6e+0 9e-7 7e-7 7201.0 94.7 16.8
1e+2 542 26 0.9540 0.9915 0.9915 2e+1 9e-7 8e-7 7202.2 101.8 48.8
10 1e-1 91 5 0.9924 0.9924 0.9924 9e-7 9e-7 8e-7 483.5 17.1 5.2
1 269 14 0.9926 0.9926 0.9926 2e-3 8e-7 8e-7 7202.8 28.3 5.1
1e+1 470 21 0.9855 0.9926 0.9926 4e+0 1e-6 9e-7 7200.2 64.5 13.3
1e+2 532 25 0.9657 0.9916 0.9916 1e+1 8e-7 9e-7 7201.1 129.6 51.7

6.3 Computing a solution path of the SMM model for {Ci}i=1N\{C_{i}\}^{N}_{i=1}

In this subsection, we assess the numerical performance of the AS strategy on both synthetic and real datasets. While do numerical experiments, we used the ALM-SNCG as inner solver. It means that for given parameters CiC_{i} and τ\tau, we solved problem (1) by ALM-SNCG. The ALM-SNCG with AS strategy is abbreviated as AS+ALM. We compared the performance of AS+ALM with that of the warm-started ALM-SNCG (abbreviated as Warm+ALM), evaluated at ηkkt104\eta_{kkt}\leq 10^{-4} and ηkkt106\eta_{kkt}\leq 10^{-6}, respectively. In all experiments, we set dmax=500d_{\max}=500 in Algorithm 3. Additionally, ε^\widehat{\varepsilon} is set to 0.05 for n<500,000n<500,000 and 0.1 otherwise for random data, and to 0.4 for real data.

Tables 6 and 7 show the performance of both methods AS+ALM and Warm+ALM. Each column in the tables takes the following meaning:

  • Avg¯nSM\mbox{Avg}\underline{~{}}{\mbox{nSM}}: the average number of the support matrices for the reduced subproblems (42);

  • Avg¯sam\mbox{Avg}\underline{~{}}\mbox{sam}/Max¯sam\mbox{Max}\underline{~{}}\mbox{sam}: the average/maximum sample size of (42);

  • Worst relkkt: the maximum relative KKT residuals as in (46);

  • Avg¯|𝒥1|\mbox{Avg}\underline{~{}}{|{\mathcal{J}}_{1}|}/Avg¯time\mbox{Avg}\underline{~{}}\mbox{time}: the average values of |𝒥1||{\mathcal{J}}_{1}|/computation times;

  • Iteration numbers: the average number of iterations. For AS+ALM, the column shows the average number of AS rounds, followed by the average number of outer augmented Lagrangian iterations, with the average number of inner semismooth Newton-CG iterations in parentheses.

6.3.1 Numerical results on synthetic data

Table 6 displays the numerical results from AS+ALM and Warm+ALM using a given sequence {Ci}i=1N\{C_{i}\}^{N}_{i=1} on random data. The parameter CC ranges from 0.1 to 100, divided into 50 equally spaced grid points on the log10\log_{10} scale. A notable observation from Table 6 is that, on average, the AS strategy requires only one iteration per grid point to achieve the desired solution. It suggests that the initial index sets 0(Ci){\mathcal{I}}^{0}(C_{i}) almost fully cover indices of the support matrices at the solutions for each CiC_{i}.

Table 6 also reveals that despite the average number of inner ALM-SNCG iterations in AS+ALM surpasses that of Warm+ALM across all instances, the AS strategy can still accelerate the Warm+ALM by an average factor of 2.69 (under ηkkt104\eta_{kkt}\leq 10^{-4}) and 3.07 (under ηkkt106\eta_{kkt}\leq 10^{-6}) in terms of time consumption. This is due to the smaller average sample size of AS’s reduced subproblems compared to that of the original SMM models. Furthermore, Figure 3 depicts that the percentages of support matrices in the reduced subproblems (mean(|SM|/n)\mbox{mean}(|\mbox{SM}|/n_{\scriptscriptstyle{\mathcal{I}}})) consistently exceed those in the original problems (mean(|SM|)/n\mbox{mean}(|\mbox{SM}|)/n). Here, |SM||\mbox{SM}| and nn_{\scriptscriptstyle{\mathcal{I}}} represent the number of support matrices and samples in the reduced subproblem (42) during each AS iteration, respectively.

Table 6: Results for generating solution paths using Warm+ALM (Warm) and AS+ALM (AS) with a sequence {Ci}i=1N\{C_{i}\}^{N}_{i=1} and ηkktε\eta_{kkt}\leq\varepsilon on random data
Data ε\varepsilon τ\tau Information of the AS Avg¯|𝒥1|\underline{~{}}|{\mathcal{J}}_{1}| Worst relkkt Iteration numbers Avg¯\underline{~{}}time (seconds)
(n,p,q)(n,p,q) Avg¯\underline{~{}}nSM Avg¯\underline{~{}}sam Max¯\underline{~{}}sam Warm AS Warm AS Warm AS Warm AS
(1e4, 1e2, 1e2) 1e-4 10 1374 1532 4152 28 25 1e-4 5e-5 (2, 20) 1(3, 25) 1.50 0.49
100 2542 2825 8739 27 25 1e-4 1e-4 (3, 20) 1(4, 24) 1.50 0.65
1e-6 10 1375 1532 4153 21 21 1e-6 6e-7 (11, 46) 1(13, 50) 3.93 1.20
100 2542 2825 8736 20 19 1e-6 1e-6 (12, 47) 1(14, 50) 3.80 1.55
(1e4, 1e3, 5e2) 1e-4 10 449 507 1360 24 22 1e-4 3e-5 (5, 27) 1(10, 39) 98.37 40.39
100 962 1086 3548 25 23 1e-4 1e-4 (7, 31) 1(11, 40) 107.76 37.84
1e-6 10 449 507 1360 21 21 1e-6 4e-7 (15, 57) 1(21, 66) 211.74 69.51
100 962 1086 3550 21 21 1e-6 1e-6 (20, 71) 1(24, 73) 244.64 64.56
(1e5, 5e1, 1e2) 1e-4 10 17750 19285 44278 105 76 1e-4 4e-5 (2, 38) 1(3, 39) 12.50 2.71
100 22697 24878 67188 120 84 1e-4 4e-5 (2, 10) 1(3, 14) 3.47 1.48
1e-6 10 17753 19287 44268 32 30 1e-6 6e-7 (14, 70) 1(17, 85) 23.29 6.35
100 22700 24879 67180 27 24 1e-6 8e-7 (15, 51) 1(19, 60) 16.93 5.21
(1e6, 5e1, 1e2) 1e-4 10 276208 303668 485124 1363 984 1e-4 5e-5 (2, 24) 1(3, 28) 76.63 34.20
100 291663 321628 565727 1254 860 1e-4 5e-5 (2, 15) 1(3, 18) 57.21 34.15
1e-6 10 276231 303681 485006 413 395 1e-6 6e-7 (14, 67) 1(17, 76) 209.41 78.54
100 291677 321643 565653 234 214 1e-6 7e-7 (15, 60) 1(18, 67) 197.75 82.87
Refer to caption
Figure 3: Comparison of support matrices percentage between reduced subproblems (mean(|SM|/n\mbox{mean}(|\mbox{SM}|/n_{\scriptscriptstyle{\mathcal{I}}})) and original problems (mean(|SM|)/n\mbox{mean}(|\mbox{SM}|)/n) on random data under ηkktε\eta_{kkt}\leq\varepsilon

6.3.2 Numerical results on real data

Lastly, we compare AS+ALM and Warm+ALM in generating the entire solution path for SMM models (P) on large-scale CIFAR-10 and MNIST training datasets. We vary CC from 10310^{-3} to 1 for the CIFAR-10 and from 10110^{-1} to 10210^{2} for the MNIST, using 50 equally spaced grid points on a log10\log_{10} scale. Similarly to the previous subsection, Table 7 and Figure 4 demonstrate that the AS strategy reduces the overall sample size of SMM models, thereby enhancing computational efficiency compared to the warm-starting strategy. Particularly, on the MNIST dataset, AS+ALM achieves average speedup factors of 1.50 and 1.92 over Warm+ALM for ηkkt104\eta_{kkt}\leq 10^{-4} and ηkkt106\eta_{kkt}\leq 10^{-6}, respectively.

Table 7: Results for generating solution paths using Warm+ALM (Warm) and AS+ALM (AS) with a sequence {Ci}i=1N\{C_{i}\}^{N}_{i=1} and ηkktε\eta_{kkt}\leq\varepsilon on real data
Prob ε\varepsilon τ\tau information of the AS Avg¯|𝒥1|\underline{~{}}|{\mathcal{J}}_{1}| Worst relkkt Iteration numbers Avg¯\underline{~{}}time (seconds)
(n,p,q)(n,p,q) Avg¯\underline{~{}}nSM Avg¯\underline{~{}}sam Max¯\underline{~{}}sam Warm AS Warm AS Warm AS Warm AS
CIFAR-10(1e4, 32, 32)\begin{array}[]{c}\mbox{CIFAR-10}\\ \mbox{(1e4, 32, 32)}\end{array} 1e-4 1 4302 5709 7045 242 242 1e-4 1e-4 (11, 52) 1(12, 52) 0.78 0.59
10 5073 6580 9661 86 86 1e-4 1e-4 (15, 110) 1(15, 109) 1.41 1.05
1e-6 1 4300 5709 7045 240 240 1e-6 1e-6 (22, 72) 1(22, 71) 1.20 0.87
10 5073 6580 9661 85 85 1e-6 1e-6 (26, 138) 1(26, 138) 1.73 1.30
MNIST(6e4, 28, 28)\begin{array}[]{c}\mbox{MNIST}\\ \mbox{(6e4, 28, 28)}\end{array} 1e-4 1 1063 1924 2403 443 442 1e-4 1e-4 (5, 73) 1(5, 75) 4.16 2.77
10 1188 2130 3316 352 352 1e-4 1e-4 (9, 96) 1(9, 98) 5.14 3.41
1e-6 1 1064 1924 2404 441 440 1e-6 1e-6 (12, 221) 1(13, 181) 10.37 5.10
10 1188 2130 3318 350 350 1e-6 1e-6 (16, 157) 1(17, 155) 7.36 4.07
Refer to caption
Figure 4: Comparison of support matrices percentage between reduced subproblems (mean(|SM|/n)\mbox{mean}(|\mbox{SM}|/n_{\scriptscriptstyle{\mathcal{I}}})) and original problems (mean(|SM|)/n\mbox{mean}(|\mbox{SM}|)/n) on real data under ηkktε\eta_{kkt}\leq\varepsilon

7 Conclusion

In this paper, we have proposed a semismooth Newton-CG based augmented Lagrangian method for solving the large-scale support matrix machine (SMM) model. Our algorithm effectively leverages the sparse and low-rank structures inherent in the second-order information associated with the SMM model. Furthermore, we have developed an adaptive sieving (AS) strategy aimed at iteratively guessing and adjusting active samples, facilitating the rapid generation of solution paths across a fine grid of parameters CC for the SMM models. Numerical results have convincingly demonstrated the superior efficiency and robustness of our algorithm and strategy in solving large-scale SMM models. Nevertheless, the current AS strategy focuses on reducing the sample size of subproblems. The further research direction is to explore methods for simultaneously reducing both sample size and the dimensionality of feature matrices. This may be achieved through the development of effective combinations of the active subspace selection hsieh2014nuclear ; feng2022subspace and the AS strategy.

Appendices

Appendix A Proof of Proposition 1

Proof

To proceed, we characterize the set ΩD\Omega_{D}. It is not difficult to show that the KKT system (3) is equivalent to the following system:

{0=yλ,0by𝒜(𝒜λ+Λ)en+δS(λ),(b,λ,Λ)×𝒴.0𝒜λ+Λ+δ𝔹2τ(Λ),\left\{\begin{array}[]{l}0=y^{\top}\lambda,\\ 0\in by-{\mathcal{A}}({\mathcal{A}}^{*}\lambda+\Lambda)-e_{n}+\partial\delta_{S}(-\lambda),\qquad\qquad(b,\lambda,\Lambda)\in\mathbb{R}\times\mathcal{Y}.\\ 0\in{\mathcal{A}}^{*}\lambda+\Lambda+\partial\delta_{\mathbb{B}^{\tau}_{2}}(\Lambda),\\ \end{array}\right. (A.1)

Indeed, if (W¯,b¯,v¯,U¯,λ¯,Λ¯)𝒳×𝒴(\overline{W},\overline{b},\overline{v},\overline{U},\overline{\lambda},\overline{\Lambda})\in\mathcal{X}\times\mathcal{Y} satisfies (3), then (b¯,λ¯,Λ¯)(\overline{b},\overline{\lambda},\overline{\Lambda}) solves (A.1). Conversely, if (b¯,λ¯,Λ¯)×𝒴(\overline{b},\overline{\lambda},\overline{\Lambda})\in\mathbb{R}\times\mathcal{Y} solves (A.1), then (W¯,b¯,v¯,U¯,λ¯,Λ¯)(\overline{W},\overline{b},\overline{v},\overline{U},\overline{\lambda},\overline{\Lambda}) with W¯=𝒜λ¯Λ¯\overline{W}=-{\mathcal{A}}^{*}\overline{\lambda}-\overline{\Lambda}, U¯=W¯\overline{U}=\overline{W}, and v¯=en𝒜W¯b¯y\overline{v}=e_{n}-{\mathcal{A}}\overline{W}-\overline{b}y satisfies the KKT system (3). For any pair z¯:=(λ¯,Λ¯)ΩD\overline{z}:=(\overline{\lambda},\overline{\Lambda})\in\Omega_{D}, Υ¯=𝒜λ¯+Λ¯\overline{\Upsilon}={\mathcal{A}}^{*}\overline{\lambda}+\overline{\Lambda} is an invariant (see, e.g., mangasarian1988simple ). We then define η¯:=𝒜(Υ¯)+en\overline{\eta}:={\mathcal{A}}(\overline{\Upsilon})+e_{n}. Based on the arguments preceding Subsection 2.1 and the equivalence between (3) and (A.1), we claim that for any b¯D(z¯)\overline{b}\in{\mathcal{M}}_{D}(\overline{z}), the set ΩD\Omega_{D} can be expressed as

ΩD=𝒱D𝒢D1𝒢D2(b¯),\Omega_{D}={\mathcal{V}}_{D}\cap{\mathcal{G}}^{1}_{D}\cap{\mathcal{G}}^{2}_{D}(\overline{b}), (A.2)

where

D(z¯):={b|(b,λ¯,Λ¯)satisfies(A.1)},𝒢D1:={(λ,Λ)𝒴|yλ=0},\displaystyle{\mathcal{M}}_{D}(\overline{z}):=\{b\in\mathbb{R}\ |\ (b,\overline{\lambda},\overline{\Lambda})\ \mbox{satisfies}\ (\ref{eq:KKT_system_blamLAm})\},\ {\mathcal{G}}^{1}_{D}:=\left\{(\lambda,\Lambda)\in\mathcal{Y}\ |\ y^{\top}\lambda=0\right\},
𝒢D2(b¯):=(δS(η¯b¯y))×τΥ¯,𝒱D:={(λ,Λ)𝒴|𝒜λ+Λ=Υ¯}.\displaystyle{\mathcal{G}}^{2}_{D}(\overline{b}):=\left(-\partial\delta^{*}_{S}(\overline{\eta}-\overline{b}y)\right)\times\partial\tau\|-\overline{\Upsilon}\|_{*},\ {\mathcal{V}}_{D}:=\left\{(\lambda,\Lambda)\in\mathcal{Y}\ |\ {\mathcal{A}}^{*}\lambda+\Lambda=\overline{\Upsilon}\right\}.

Building on the above preparation, we proceed to establish the conclusion of Proposition 1. We first show that for any ((λ¯,Λ¯),(ξ¯,Ξ¯))gphδS×𝔹2τ((-\overline{\lambda},\overline{\Lambda}),(\overline{\xi},\overline{\Xi}))\in\mbox{gph}\,\partial\delta_{S\times\mathbb{B}^{\tau}_{2}}, δS×𝔹2τ\partial\delta_{S\times\mathbb{B}^{\tau}_{2}} is metrically subregular at (λ¯,Λ¯)(-\overline{\lambda},\overline{\Lambda}) for (ξ¯,Ξ¯)(\overline{\xi},\overline{\Xi}), i.e., there exist a constant κ>0\kappa>0 and a neighborhood 𝒰𝒴{\mathcal{U}}\subseteq{\mathcal{Y}} of (λ¯,Λ¯)(-\overline{\lambda},\overline{\Lambda}) such that

dist((λ,Λ),(δS×𝔹2τ)1(ξ¯,Ξ¯))κdist((ξ¯,Ξ¯),δS×𝔹2τ(λ,Λ)),(λ,Λ)𝒰.\mbox{dist}\left((\lambda,\Lambda),(\partial\delta_{S\times\mathbb{B}^{\tau}_{2}})^{-1}(\overline{\xi},\overline{\Xi})\right)\leq\kappa\mbox{dist}\left((\overline{\xi},\overline{\Xi}),\partial\delta_{S\times\mathbb{B}^{\tau}_{2}}(\lambda,\Lambda)\right),\quad\forall\,(\lambda,\Lambda)\in{\mathcal{U}}.

Here, dist(x,D):=mindDdx\mbox{dist}(x,D):=\min_{d\in D}\|d-x\| denotes the distance from xx to a closed convex set DD in a finite dimensional real Euclidean space 𝒵{\mathcal{Z}}, and gphF:={(x,y)𝒵×𝒵|yF(x)}\mbox{gph}\,F:=\{(x,y)\in{\mathcal{Z}}\times{\mathcal{Z}}\ |\ y\in F(x)\} represents the graph of a multi-valued mapping FF from 𝒵{\mathcal{Z}} to 𝒵{\mathcal{Z}}. In fact, it follows from the piecewise linearity of δS()\delta^{*}_{S}(\cdot) on n\mathbb{R}^{n} and (rockafellar2009variational, , Proposition 12.30) that δS()\partial\delta^{*}_{S}(\cdot) is piecewise polyhedral. Then, for any (ξ¯,λ¯)gphδS(\overline{\xi},-\overline{\lambda})\in\mbox{gph}\,\partial\delta^{*}_{S}, we obtain from (robinson1981some, , Proposition 1) that δS()\partial\delta^{*}_{S}(\cdot) is locally upper Lipschitz continuous at ξ¯\overline{\xi}, which further implies the metric subregularity of δS\partial\delta_{S} at λ¯-\overline{\lambda} for ξ¯\overline{\xi}. Let 𝔹1τp\mathbb{B}^{\tau}_{1}\subset\mathbb{R}^{p} be the 1\ell_{1}-norm ball centered at 0 with radius τ\tau. Similarly, for any (v,v^)gphδ𝔹1τ(v,\hat{v})\in\mbox{gph}\,\partial\delta_{\mathbb{B}^{\tau}_{1}}, δ𝔹1τ\partial\delta_{\mathbb{B}^{\tau}_{1}} is metrically subregular at vv for v^\hat{v} due to the piecewise linearity of τ\tau\|\cdot\|_{\infty} on p\mathbb{R}^{p}. For any given (Λ¯,Ξ¯)gphδ𝔹2τ(\overline{\Lambda},\overline{\Xi})\in\mbox{gph}\,\partial\delta_{\mathbb{B}^{\tau}_{2}}, it follows from (cui2017quadratic, , Proposition 3.8) that δ𝔹2τ\partial\delta_{\mathbb{B}^{\tau}_{2}} is also metrically subregular at Λ¯\overline{\Lambda} for Ξ¯\overline{\Xi}. Thus, there exist positive constants κ1\kappa_{1} and κ2\kappa_{2} and a neighborhood 𝒰{\mathcal{U}} of (λ¯,Λ¯)(-\overline{\lambda},\overline{\Lambda}) such that for any (λ,Λ)𝒰(\lambda,\Lambda)\in{\mathcal{U}},

dist((λ,Λ),(δS×𝔹2τ)(ξ¯,Ξ¯)1)\displaystyle\mbox{dist}\left((\lambda,\Lambda),(\partial\delta_{S\times\mathbb{B}^{\tau}_{2}}){}^{-1}(\overline{\xi},\overline{\Xi})\right) dist(λ,(δS)1(ξ¯))+dist(Λ,(δ𝔹2τ)1(Ξ¯))\displaystyle\leq\mbox{dist}\left(\lambda,(\partial\delta_{S})^{-1}(\overline{\xi})\right)+\mbox{dist}\left(\Lambda,(\partial\delta_{\mathbb{B}^{\tau}_{2}})^{-1}(\overline{\Xi})\right)
κ1dist(ξ¯,δS(λ))+κ2dist(Ξ¯,δ𝔹2τ(Λ))\displaystyle\leq\kappa_{1}\mbox{dist}\left(\overline{\xi},\partial\delta_{S}(\lambda)\right)+\kappa_{2}\mbox{dist}\left(\overline{\Xi},\partial\delta_{\mathbb{B}^{\tau}_{2}}(\Lambda)\right)
max{κ1,κ2}(dist(ξ¯,δS(λ))+dist(Ξ¯,δ𝔹2τ(Λ)))\displaystyle\leq\max\{\kappa_{1},\kappa_{2}\}\left(\mbox{dist}(\overline{\xi},\partial\delta_{S}(\lambda))+{\mbox{dist}}(\overline{\Xi},\partial\delta_{\mathbb{B}^{\tau}_{2}}(\Lambda))\right)
2max{κ1,κ2}dist((ξ¯,Ξ¯),δS×𝔹2τ(λ,Λ)),\displaystyle\leq\sqrt{2}\max\{\kappa_{1},\kappa_{2}\}\mbox{dist}\left((\overline{\xi},\overline{\Xi}),\partial\delta_{S\times\mathbb{B}^{\tau}_{2}}(\lambda,\Lambda)\right),

where the second inequality follows from the metric subregularity of δS\partial\delta_{S} at λ¯-\overline{\lambda} for ξ¯\overline{\xi} and the metric subregularity of δ𝔹2τ\partial\delta_{\mathbb{B}^{\tau}_{2}} at Λ¯\overline{\Lambda} for Ξ¯\overline{\Xi}.

Next, for any b¯D(z¯)\overline{b}\in{\mathcal{M}}_{D}(\overline{z}), we prove that the collection of sets {𝒱D,𝒢D1,𝒢D2(b¯)}\{{\mathcal{V}}_{D},{\mathcal{G}}^{1}_{D},{\mathcal{G}}^{2}_{D}(\overline{b})\} is boundedly linearly regular, i.e., for every bounded set B𝒴B\subset\mathcal{Y}, there exists a constant κ>0\kappa^{\prime}>0 such that for any (λ,Λ)B(\lambda,\Lambda)\in B,

dist((λ,Λ),𝒱D𝒢D1𝒢D2(b¯))κmax{dist((λ,Λ),𝒱D),dist((λ,Λ),𝒢D1),dist((λ,Λ),𝒢D2(b¯))}.\mbox{dist}\left((\lambda,\Lambda),{\mathcal{V}}_{D}\cap{\mathcal{G}}^{1}_{D}\cap{\mathcal{G}}^{2}_{D}(\overline{b})\right)\leq\kappa^{\prime}\max\left\{\mbox{dist}((\lambda,\Lambda),{\mathcal{V}}_{D}),\mbox{dist}((\lambda,\Lambda),{\mathcal{G}}^{1}_{D}),\mbox{dist}((\lambda,\Lambda),{\mathcal{G}}^{2}_{D}(\overline{b}))\right\}.

In fact, based on the arguments preceding Subsection 2.1 and (A.2), the intersection 𝒱D𝒢D1𝒢D2(b¯){\mathcal{V}}_{D}\cap{\mathcal{G}}^{1}_{D}\cap{\mathcal{G}}^{2}_{D}(\overline{b}) is nonempty. Define 𝒢D21(b¯):=(δS(η¯b¯y))×p×q{\mathcal{G}}^{21}_{D}(\overline{b}):=\left(-\partial\delta^{*}_{S}(\overline{\eta}-\overline{b}y)\right)\times\mathbb{R}^{p\times q} and 𝒢D22:=n×τΥ¯{\mathcal{G}}^{22}_{D}:=\mathbb{R}^{n}\times\partial\tau\|-\overline{\Upsilon}\|_{*}. It follows from the equality (A.2) that

ΩD=𝒱D𝒢D1𝒢D21(b¯)𝒢D22.\Omega_{D}={\mathcal{V}}_{D}\cap{\mathcal{G}}^{1}_{D}\cap{\mathcal{G}}^{21}_{D}(\overline{b})\cap{\mathcal{G}}^{22}_{D}. (A.3)

It can be checked that {𝒱D,𝒢D1,𝒢D2(b¯)}\{{\mathcal{V}}_{D},{\mathcal{G}}^{1}_{D},{\mathcal{G}}^{2}_{D}(\overline{b})\} is boundedly linearly regular if and only if {𝒱D,𝒢D1\{{\mathcal{V}}_{D},{\mathcal{G}}^{1}_{D}, 𝒢D21(b¯),𝒢D22}{\mathcal{G}}^{21}_{D}(\overline{b}),{\mathcal{G}}^{22}_{D}\} is boundedly linearly regular. Furthermore, from (rockafellar1970convex, , Corollary 19.2.1 and Theorem 23.10), δS(η¯b¯y)\partial\delta^{*}_{S}(\overline{\eta}-\overline{b}y) is a polyhedral convex set, implying that 𝒢D21(b¯){\mathcal{G}}^{21}_{D}(\overline{b}) is also polyhedral convex. Since 𝒱D{\mathcal{V}}_{D} and 𝒢D1{\mathcal{G}}^{1}_{D} are polyhedral convex, based on (bauschke1999strong, , Corollary 3) and (A.3), we only need to show that there exists (λ~,Λ~)ΩD(\widetilde{\lambda},\widetilde{\Lambda})\in\Omega_{D} such that (Λ~,Λ~)ri(𝒢D22)(\widetilde{\Lambda},\widetilde{\Lambda})\in\mbox{ri}\,({\mathcal{G}}^{22}_{D}), i.e., Λ~ri(τΥ¯)\widetilde{\Lambda}\in\mbox{ri}\left(\partial\tau\|-\overline{\Upsilon}\|_{*}\right). Let rank(Λ~)=r¯\mbox{rank}(\widetilde{\Lambda})=\overline{r}. It follows from (watson1992characterization, , Example 2) that Λ~2τ\|\widetilde{\Lambda}\|_{2}\leq\tau. Suppose the SVD of Λ~\widetilde{\Lambda} in (15) has the following form:

Λ~=U¯[Σ¯(Λ~)  0]V¯=[U¯1U¯(0,1)U¯0][τs¯000τΣ¯(0,1)(Λ~)0000][V¯1V¯(0,1)V¯0],\widetilde{\Lambda}=\overline{U}\left[\overline{\Sigma}(\widetilde{\Lambda})\ \,0\right]\overline{V}^{\top}=[\overline{U}_{1}\ \,\overline{U}_{(0,1)}\ \,\overline{U}_{0}]\left[\begin{array}[]{ccc}\tau{\mathcal{I}}_{\overline{s}}&0&0\\[5.69054pt] 0&\tau\overline{\Sigma}_{(0,1)}(\widetilde{\Lambda})&0\\ 0&0&0\end{array}\right]\left[\begin{array}[]{l}\overline{V}^{\top}_{1}\\[2.84526pt] \overline{V}^{\top}_{(0,1)}\\[2.84526pt] \overline{V}^{\top}_{0}\end{array}\right], (A.4)

where Σ¯(Λ~)=Diag(ν1(Λ~),,νp(Λ~))𝒮+p\overline{\Sigma}(\widetilde{\Lambda})=\mbox{Diag}\left(\nu_{1}(\widetilde{\Lambda}),\ldots,\nu_{p}(\widetilde{\Lambda})\right)\in\mathcal{S}^{p}_{+} and Σ¯(0,1)(Λ~)=Diag(νs¯+1(Λ~)\overline{\Sigma}_{(0,1)}(\widetilde{\Lambda})=\mbox{Diag}(\nu_{\,\overline{s}+1}(\widetilde{\Lambda}), ,νr¯(Λ~))𝒮r¯s¯++\ldots,\nu_{\,\overline{r}}(\widetilde{\Lambda}))\in\mathcal{S}^{\,\overline{r}-\overline{s}}_{++} with singular values 1=ν1(Λ~)==νs¯(Λ~)>νs¯+1(Λ~)νr¯(Λ~)>νr¯+1(Λ~)==νp(Λ~)=01=\nu_{1}(\widetilde{\Lambda})=\ldots=\nu_{\,\overline{s}}(\widetilde{\Lambda})>\nu_{\,\overline{s}+1}(\widetilde{\Lambda})\geq\ldots\geq\nu_{\,\overline{r}}(\widetilde{\Lambda})>\nu_{\,\overline{r}+1}(\widetilde{\Lambda})=\ldots=\nu_{p}(\widetilde{\Lambda})=0, and U¯:=[U¯1U¯(0,1)U¯0]𝒪p\overline{U}:=\left[\overline{U}_{1}\ \overline{U}_{(0,1)}\ \overline{U}_{0}\right]\in\mathcal{O}^{p} and V¯:=[V¯1V¯(0,1)V¯0]𝒪q\overline{V}:=\left[\overline{V}_{1}\ \overline{V}_{(0,1)}\ \overline{V}_{0}\right]\in\mathcal{O}^{q} whose columns form a compatible set of orthonormal left and right singular vectors of Λ~\widetilde{\Lambda} with U¯1p×s¯\overline{U}_{1}\in\mathbb{R}^{p\times\overline{s}}, U¯(0,1)p×(r¯s¯)\overline{U}_{(0,1)}\in\mathbb{R}^{p\times(\overline{r}-\overline{s})}, U¯0p×(pr¯)\overline{U}_{0}\in\mathbb{R}^{p\times(p-\overline{r})}, V¯1q×s¯\overline{V}_{1}\in\mathbb{R}^{q\times\overline{s}}, V¯(0,1)q×(r¯s¯)\overline{V}_{(0,1)}\in\mathbb{R}^{q\times(\overline{r}-\overline{s})}, and V¯0q×(qr¯)\overline{V}_{0}\in\mathbb{R}^{q\times(q-\overline{r})}. Based on (watson1992characterization, , Example 2) and (A.4), one has that Λ~ri(τΥ¯)\widetilde{\Lambda}\in\mbox{ri}(\partial\tau\|-\overline{\Upsilon}\|_{*}) if and only if rank(Υ¯)=s¯\mbox{rank}(-\overline{\Upsilon})=\overline{s}. One the other hand, from the definition of the matrix I¯\overline{I} in (14), we obtain that

τI¯Λ~=[U¯1U¯(0,1)U¯0][00000τ(Ir¯s¯Σ¯(0,1)(Λ~))0000Ipr¯0][V¯1V¯(0,1)V¯0],\tau\overline{I}-\widetilde{\Lambda}=\left[\overline{U}_{1}\ \,\overline{U}_{(0,1)}\ \,\overline{U}_{0}\right]\left[\begin{array}[]{cccc}0&0&0&0\\[5.69054pt] 0&\tau(I_{\overline{r}-\overline{s}}-\overline{\Sigma}_{(0,1)}(\widetilde{\Lambda}))&0&0\\[5.69054pt] 0&0&I_{p-\overline{r}}&0\end{array}\right]\left[\begin{array}[]{l}\overline{V}^{\top}_{1}\\[2.84526pt] \overline{V}^{\top}_{(0,1)}\\[2.84526pt] \overline{V}^{\top}_{0}\end{array}\right], (A.5)

which implies that rank(τI¯Λ~)=ps¯\mbox{rank}(\tau\overline{I}-\widetilde{\Lambda})=p-\overline{s}. It means that rank(Υ¯)=s¯\mbox{rank}(-\overline{\Upsilon})=\overline{s} if and only if p=rank(Υ¯)+rank(τI¯Λ~)p=\mbox{rank}(-\overline{\Upsilon})+\mbox{rank}(\tau\overline{I}-\widetilde{\Lambda}). Therefore, if there exists (λ~,Λ~)ΩD(\widetilde{\lambda},\widetilde{\Lambda})\in\Omega_{D} such that p=rank(Υ¯)+rank(τI¯Λ~)p=\mbox{rank}(-\overline{\Upsilon})+\mbox{rank}(\tau\overline{I}-\widetilde{\Lambda}), then {𝒱D,𝒢D1,𝒢D2(b¯)}\{{\mathcal{V}}_{D},{\mathcal{G}}^{1}_{D},{\mathcal{G}}^{2}_{D}(\overline{b})\} is boundedly linearly regular.

Based on all the above analysis, the desired conclusion can be established utilizing (cui2016asymptotic, , Theorem 3.1) and (artacho2008characterization, , Theorem 3.3). ∎

Appendix B Proof of Theorem 2

Proof

Building upon a technique similar to (li2018qsdpnal, , Proposition 3.1), we obtain the following equivalent relation:

𝒱is self-adjoint and positive definite in p×q×yMy>0,{\mathcal{V}}\ \mbox{is self-adjoint and positive definite in }\ \mathbb{R}^{p\times q}\times\mathbb{R}\quad\Longleftrightarrow\quad y^{\top}My>0, (B.1)

where 𝒱{\mathcal{V}} is any element in ^2φk(W^,b^)\widehat{\partial}^{2}\varphi_{k}(\widehat{W},\widehat{b}) defined in (18) with MΠS(ωk(W^,b^))M\in\partial\Pi_{S}(\omega_{k}(\widehat{W},\widehat{b})). Here, for any ωn\omega\in\mathbb{R}^{n}, the Clarke generalized Jacobian ΠS(ω)\partial\Pi_{S}(\omega) can be expressed as

ΠS(ω)={Diag(v)|vj=1,ifωj(0,C);vj[0,1],ifωj=0orωj=C;vj=0,ifωj<0orωj>C}.\partial\Pi_{S}(\omega)=\left\{\mbox{Diag}(v)\ \left|\ \begin{array}[]{l}v_{j}=1,\ \mbox{if}\ \omega_{j}\in(0,C);\\ v_{j}\in[0,1],\ \mbox{if}\ \omega_{j}=0\ \mbox{or}\ \omega_{j}=C;\\ v_{j}=0,\ \mbox{if}\ \omega_{j}<0\ \mbox{or}\ \omega_{j}>C\end{array}\right.\right\}. (B.2)

We consider the explicit expression of lin(TS(λ^))\mbox{lin}(T_{S}(-\widehat{\lambda})). Denote the following two index sets

𝒥2(λ^):={j[n]|λ^j=0}and𝒥3(λ^):={j[n]|λ^j=C}.{\mathcal{J}}_{2}(-\widehat{\lambda}):=\{j\in[n]\ |\ \widehat{\lambda}_{j}=0\}\quad\mbox{and}\quad{\mathcal{J}}_{3}(-\widehat{\lambda}):=\{j\in[n]\ |\ -\widehat{\lambda}_{j}=C\}.

Then we can deduce from (rockafellar2009variational, , Theorem 6.9) that

TS(λ^)={dn|d𝒥2(λ^)0,d𝒥3(λ^)0}.T_{S}(-\widehat{\lambda})=\{d\in\mathbb{R}^{n}\ |\ d_{\tiny{\mathcal{J}}_{2}(-\widehat{\lambda})}\geq 0,\ d_{{\mathcal{J}}_{3}(-\widehat{\lambda})}\leq 0\}.

It follows that

lin(TS(λ^))={dn|d𝒥2(λ^)=0,d𝒥3(λ^)=0}.\mbox{lin}(T_{S}(-\widehat{\lambda}))=\{d\in\mathbb{R}^{n}\ |\ d_{{\mathcal{J}}_{2}(-\widehat{\lambda})}=0,\ d_{{\mathcal{J}}_{3}(-\widehat{\lambda})}=0\}.

(i)(ii)(i)\Leftrightarrow(ii)”. The equality (20) at λ^\widehat{\lambda} holds if and only if =y𝒥1(λ^)|𝒥1(λ^)|,\mathbb{R}=y^{\top}_{{\mathcal{J}}_{1}(-\widehat{\lambda})}\mathbb{R}^{|{\mathcal{J}}_{1}(-\widehat{\lambda})|}, i.e., 𝒥1(λ^){\mathcal{J}}_{1}(-\widehat{\lambda})\neq\emptyset.

(ii)(iii)(ii)\Rightarrow(iii)”. If 𝒥1(λ^){\mathcal{J}}_{1}(-\widehat{\lambda})\neq\emptyset, we obtain from λ^=ΠS(ωk(W^,b^))\widehat{\lambda}=-\Pi_{S}(\omega_{k}(\widehat{W},\widehat{b})) that there exists an index j0[n]j^{0}\in[n] such that 0<(ωk(W^,b^))j0<C0<(\omega_{k}(\widehat{W},\widehat{b}))_{j^{0}}<C. It follows from (B.2) that yMy>0y^{\top}My>0. By the equivalent relation (B.1), we prove condition (iii)(iii).

(iii)(ii)(iii)\Rightarrow(ii)”. If the condition (iii)(iii) holds, then by the equivalence in (B.1), yMy>0y^{\top}My>0 for any MΠS(ωk(W^,b^))M\in\partial\Pi_{S}(\omega_{k}(\widehat{W},\widehat{b})). It means from (B.2) that there exists j0[n]j^{0}\in[n] such that 0<(ωk(W^,b^))j0<C0<(\omega_{k}(\widehat{W},\widehat{b}))_{j^{0}}<C. From λ^=ΠS(ωk(W^,b^))\widehat{\lambda}=-\Pi_{S}(\omega_{k}(\widehat{W},\widehat{b})), it follows that 0<(λ^)j0<C0<(-\widehat{\lambda})_{j^{0}}<C, implying condition (ii)(ii). ∎

Appendix C Proof of Theorem 5.1

Proof

For each given CiC_{i}, the cardinality of ¯k(Ci)\overline{{\mathcal{I}}}^{\,k}(C_{i}) will decrease as that of k(Ci){\mathcal{I}}^{k}(C_{i}) increases. According to the finiteness of samples size nn, the index set 𝒥k(Ci){\mathcal{J}}^{k}(C_{i}) as a subset of ¯k(Ci)\overline{{\mathcal{I}}}^{k}(C_{i}) will be empty after a finite number of iterations, ensuring finite termination.

Next, we show that the output solution (W¯(Ci),b¯(Ci),v¯(Ci),U¯(Ci),λ¯(Ci),Λ¯(Ci))(\overline{W}(C_{i}),\overline{b}(C_{i}),\overline{v}(C_{i}),\overline{U}(C_{i}),\overline{\lambda}(C_{i}),\overline{\Lambda}(C_{i})) of Algorithm 3 is a KKT tuple of the problem (45). Indeed, by the finite termination of Algorithm 3, we know that there exists an integer k¯[n]\bar{k}\in[n] such that (W¯(Ci),b¯(Ci),vk¯(Ci),U¯(Ci),λk¯(Ci)(\overline{W}(C_{i}),\overline{b}(C_{i}),{v}^{\,\bar{k}}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}),\overline{U}(C_{i}),{\lambda}^{\,\bar{k}}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}), Λ¯(Ci))\overline{\Lambda}(C_{i})) is a KKT solution of the reduced subproblem (42) with the error (δW,δb,δv,δU,δλ(\delta_{W},\delta_{b},\delta_{v_{\scriptscriptstyle{\mathcal{I}}}},\delta_{U},\delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}}, δΛ)\delta_{\Lambda}) satisfying the condition (43) and 𝒥k¯(Ci)={\mathcal{J}}^{\,\bar{k}}(C_{i})=\emptyset. And v¯(Ci)\overline{v}(C_{i}) and λ¯(Ci)\overline{\lambda}(C_{i}) are obtained by expanding vk¯(Ci){v}^{\,\bar{k}}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}) and λk¯(Ci){\lambda}^{\,\bar{k}}_{\scriptscriptstyle{\mathcal{I}}}(C_{i}) to the nn-dimensional vectors with the rest entries being 1yj(W¯(Ci),Xj+b¯(Ci))1-y_{j}(\langle\overline{W}(C_{i}),X_{j}\rangle+\overline{b}(C_{i})) and 0 for any j¯k¯(Ci)j\in\overline{{\mathcal{I}}}^{\,\bar{k}}(C_{i}), respectively.

For the sake of convenience in further analysis, we will exclude CiC_{i} from the above KKT solutions and index sets. Additionally, we will eliminate k¯\bar{k} from k¯{\mathcal{I}}^{\,\bar{k}}, ¯k¯\overline{{\mathcal{I}}}^{\,\bar{k}}, λk¯{\lambda}^{\,\bar{k}}_{\scriptscriptstyle{\mathcal{I}}}, and vk¯{v}^{\,\bar{k}}_{\scriptscriptstyle{\mathcal{I}}}. By the KKT conditions of the reduced subproblem (42) and the condition (43), one has that

δW=W¯+𝒜λ+Λ¯,δb=yλ,δΛ=W¯U¯,δvλδSi(v),δU+Λ¯τU¯,δλ=𝒜W¯+b¯y+ve||,max(δW,|δb|,δU,δλ,δv,δΛ)ε,\begin{array}[]{c}\delta_{W}=\overline{W}+{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{I}}}{\lambda}_{\scriptscriptstyle{\mathcal{I}}}+\overline{\Lambda},\ \delta_{b}=y_{\scriptscriptstyle{\mathcal{I}}}^{\top}{\lambda}_{\scriptscriptstyle{\mathcal{I}}},\ \delta_{\Lambda}=\overline{W}-\overline{U},\ \delta_{v_{\scriptscriptstyle{\mathcal{I}}}}-{\lambda}_{\scriptscriptstyle{\mathcal{I}}}\in\partial\delta^{*}_{S^{\,i}_{\scriptscriptstyle{\mathcal{I}}}}({v}_{\scriptscriptstyle{\mathcal{I}}}),\ \delta_{U}+\overline{\Lambda}\in\partial\tau\|\overline{U}\|_{*},\\[5.69054pt] \delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}}={\mathcal{A}}_{\scriptscriptstyle{\mathcal{I}}}\overline{W}+\overline{b}y_{\scriptscriptstyle{\mathcal{I}}}+{v}_{\scriptscriptstyle{\mathcal{I}}}-e_{|{\mathcal{I}}|},\ \max(\|\delta_{W}\|,|\delta_{b}|,\|\delta_{U}\|,\|\delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}}\|,\|\delta_{v_{{\mathcal{I}}}}\|,\|\delta_{\Lambda}\|)\leq\varepsilon,\end{array} (C.1)

where 𝒜{\mathcal{A}}_{\scriptscriptstyle{\mathcal{I}}} and 𝒜{\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{I}}} are defined in (38). By extending δλ\delta_{\lambda_{\scriptscriptstyle{\mathcal{I}}}} and δv\delta_{v_{\scriptscriptstyle{\mathcal{I}}}} to the nn-dimensional error vectors δλ\delta_{\lambda} and δv\delta_{v} with the rest elements being 0, we only need to show that

δW=W¯+𝒜λ¯+Λ¯,δb=yλ¯,δλ=𝒜W¯+b¯y+v¯en,δvλ¯δSi(v¯).\delta_{W}=\overline{W}+{\mathcal{A}}^{*}\overline{\lambda}+\overline{\Lambda},\quad\delta_{b}=y^{\top}\overline{\lambda},\quad\delta_{\lambda}={\mathcal{A}}\overline{W}+\overline{b}y+\overline{v}-e_{n},\quad\delta_{v}-\overline{\lambda}\in\partial\delta^{*}_{S^{\,i}}(\overline{v}).

Indeed, due to the fact that (λ¯)¯=0(\overline{\lambda})_{\scriptscriptstyle\overline{{\mathcal{I}}}}=0, we obtain

𝒜λ¯=𝒜λ+𝒜¯(λ¯)¯=𝒜λandyλ¯=yλ+y¯(λ¯)¯=yλ¯,{\mathcal{A}}^{*}\overline{\lambda}={\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{I}}}{\lambda}_{\scriptscriptstyle{\mathcal{I}}}+{\mathcal{A}}^{*}_{\scriptscriptstyle\overline{{\mathcal{I}}}}(\overline{\lambda})_{\scriptscriptstyle\overline{{\mathcal{I}}}}={\mathcal{A}}^{*}_{\scriptscriptstyle{\mathcal{I}}}{\lambda}_{\scriptscriptstyle{\mathcal{I}}}\quad\mbox{and}\quad y^{\top}\overline{\lambda}=y_{\scriptscriptstyle{\mathcal{I}}}^{\top}{\lambda}_{\scriptscriptstyle{\mathcal{I}}}+y_{\scriptscriptstyle\overline{{\mathcal{I}}}}^{\top}(\overline{\lambda})_{\scriptscriptstyle\overline{{\mathcal{I}}}}=y_{\scriptscriptstyle{\mathcal{I}}}^{\top}\overline{\lambda}_{\scriptscriptstyle{\mathcal{I}}},

which follows from (C.1) that δW=W¯+𝒜λ¯+Λ¯\delta_{W}=\overline{W}+{\mathcal{A}}^{*}\overline{\lambda}+\overline{\Lambda} and δb=λ¯y\delta_{b}=\overline{\lambda}^{\top}y. From expanding manners of v¯\overline{v} and δλ\delta_{\lambda}, we know that v¯=e¯𝒜¯W¯b¯y¯{v}_{\scriptscriptstyle\overline{{\mathcal{I}}}}=e_{\scriptscriptstyle\overline{{\mathcal{I}}}}-{\mathcal{A}}_{\scriptscriptstyle\overline{{\mathcal{I}}}}\overline{W}-\overline{b}y_{\scriptscriptstyle\overline{{\mathcal{I}}}}, which further implies from (C.1) that δλ=𝒜W¯+b¯y+v¯en\delta_{\lambda}={\mathcal{A}}\overline{W}+\overline{b}y+\overline{v}-e_{n}. At last, from 𝒥k¯(Ci)={\mathcal{J}}^{\bar{k}}(C_{i})=\emptyset, one has v¯j<0\overline{v}_{j}<0 for all j¯j\in\overline{{\mathcal{I}}}. It follows that (δv)¯(λ¯)¯=0=δS¯i(v¯)(\delta_{v})_{\scriptscriptstyle\overline{{\mathcal{I}}}}-(\overline{\lambda})_{\scriptscriptstyle\overline{{\mathcal{I}}}}=0=\nabla\delta^{*}_{S^{\,i}_{\scriptscriptstyle\overline{{\mathcal{I}}}}}({v}_{\scriptscriptstyle\overline{{\mathcal{I}}}}). Combing with (C.1), we obtain that δvλ¯δSi(v¯)\delta_{v}-\overline{\lambda}\in\partial\delta^{*}_{S^{i}}(\overline{v}). ∎

Appendix D An isPADMM for solving SMM model

In this part, we outline the framework of the inexact semi-proximal ADMM (isPADMM) to solve the SMM model (1). Specifically, according to the definition of the linear operator 𝒜{\mathcal{A}} in (2), the SMM model (1) can be equivalently reformulated as follows:

minimize(W,b,U)p×q××p×q12W2+τU+δS(en𝒜Wby)subject toWU=0.\begin{array}[]{cl}\mathop{\mbox{minimize}}\limits_{(W,b,U)\in\mathbb{R}^{p\times q}\times\mathbb{R}\times\mathbb{R}^{p\times q}}&\displaystyle\frac{1}{2}\|W\|^{2}_{{\mathcal{F}}}+\tau\|U\|_{*}+\delta_{S}^{*}(e_{n}-{\mathcal{A}}W-by)\\ \mbox{subject to}&W-U=0.\end{array} (D.1)

Its augmented Lagrangian function is defined for any (W,b,U,Λ)p×q××p×q×p×q(W,b,U,\Lambda)\in\mathbb{R}^{p\times q}\times\mathbb{R}\times\mathbb{R}^{p\times q}\times\mathbb{R}^{p\times q} as follows:

Lγ(W,b,U;Λ)=12W2+τU+δS(en𝒜Wby)+Λ,WU+γ2WU2,L_{\gamma}(W,b,U;\Lambda)=\frac{1}{2}\|W\|^{2}_{{\mathcal{F}}}+\tau\|U\|_{*}+\delta_{S}^{*}(e_{n}-{\mathcal{A}}W-by)+\langle\Lambda,W-U\rangle+\frac{\gamma}{2}\|W-U\|^{2}_{{\mathcal{F}}},

where γ\gamma is a positive penalty parameter. The framework of isPADMM for solving (D.1) is outlined in Algorithm 5.

Algorithm 4 An isPADMM for solving SMM model (D.1)

Initialization: Let γ>0\gamma>0 and ζ(0,(1+5)/2)\zeta\in(0,(1+\sqrt{5})/2) be given parameters, {ε¯k}k0\{\overline{\varepsilon}_{k}\}_{k\geq 0} be a nonnegative summable sequence. Choose δ>0\delta>0 and an initial point (U0,Λ0)p×q×p×q(U^{0},\Lambda^{0})\in\mathbb{R}^{p\times q}\times\mathbb{R}^{p\times q}. Set k=0k=0 and perform the following steps in each iteration:

Step 1. Compute
(Wk+1,bk+1)=\displaystyle(W^{\,k+1},{b}^{\,k+1})= argminW,b{Lγ(W,b,Uk;Λk)+δ2(bbk)2dWk,Wbdbk}\displaystyle\mathop{\mbox{argmin}}_{W,b}\left\{L_{\gamma}\big{(}W,b,U^{k};\Lambda^{k}\big{)}+\frac{\delta}{2}(b-b^{k})^{2}-\langle d^{\,k}_{W},W\rangle-bd^{\,k}_{b}\right\}
=\displaystyle= argminW,b{1+γ2W+ΛkγUk1+γ2+δS(en𝒜Wby)\displaystyle\mathop{\mbox{argmin}}_{W,b}\left\{\frac{1+\gamma}{2}\|W+\frac{\Lambda^{k}-\gamma U^{k}}{1+\gamma}\|^{2}_{{\mathcal{F}}}+\delta^{*}_{S}(e_{n}-{\mathcal{A}}W-by)\right.
+δ2(bbk)2dWk,Wbdbk},\displaystyle\qquad\qquad\qquad\left.+\frac{\delta}{2}(b-b^{k})^{2}-\langle d^{\,k}_{W},W\rangle-bd^{\,k}_{b}\right\}, (D.2)
where (dWk,dbk)p×q×(d^{\,k}_{W},d^{\,k}_{b})\in\mathbb{R}^{p\times q}\times\mathbb{R} is an error term such that (1/γ)δWk2+(1/δ)|dbk|2ε¯k(1/\gamma)\|\delta^{\,k}_{W}\|^{2}_{{\mathcal{F}}}+(1/\delta)|d^{\,k}_{b}|^{2}\leq\overline{\varepsilon}_{k}.
Step 2. Compute
Uk+1\displaystyle U^{k+1} =argminUp×q{Lγ(Wk+1,bk+1,U;Λk)}=1γProxτ(γWk+1+Λk).\displaystyle=\mathop{\mbox{argmin}}\limits_{U\in\mathbb{R}^{p\times q}}\left\{L_{\gamma}(W^{k+1},b^{k+1},U;\Lambda^{k})\right\}=\frac{1}{\gamma}\mbox{Prox}_{\tau\|\cdot\|_{*}}(\gamma W^{k+1}+\Lambda^{k}).
Step 3. Update Λk+1=Λk+ζγ(Wk+1Uk+1).\Lambda^{k+1}=\Lambda^{k}+\zeta\,\gamma\left(W^{k+1}-U^{k+1}\right).

Notably, subproblem (1) closely resembles the original SMM model (1), except τ\tau is set to zero and an additional proximal term involving bb is included. This similarity allows the direct application of the semismooth Newton-based augmented Lagrangian method to these subproblems, as discussed in Sections 3 and 4. The derivations, not detailed here, follow a rationale similar to that of the aforementioned algorithms. Furthermore, by leveraging the results of (chen2017efficient, , Theorem 5.1), we can deduce the global convergence of Algorithm 5 in a direct manner. To ensure computational feasibility, the isPADMM iterations are capped at 30,000, with δ\delta set to 10610^{-6}.

Appendix E A sGS-isPADMM for solving SMM model

In this part, we present the framework of the symmetric Gauss-Seidel based inexact semi-proximal ADMM (sGS-isPADMM) to effectively solve the SMM model (P). Based on the definition of the augmented Lagrangian function in (6), the steps of the sGS-isPADMM for solving (P) are listed as follows.

Algorithm 5 A sGS-isPADMM for solving SMM model (P)

Initialization: Let γ>0\gamma>0 and ζ(0,(1+5)/2)\zeta\in(0,(1+\sqrt{5})/2) be given parameters, {ε¯k}k0\{\overline{\varepsilon}_{k}\}_{k\geq 0} be a nonnegative summable sequence. Select a self-adjoint positive semidefinite linear operator 𝒮1:p×qp×q\mathcal{S}_{1}:\mathbb{R}^{p\times q}\rightarrow\mathbb{R}^{p\times q}. Choose an initial point (W0,b0,v0,U0,λ0,Λ0)p×q××n×p×q×n×p×q(W^{0},b^{0},v^{0},U^{0},\lambda^{0},\Lambda^{0})\in\mathbb{R}^{p\times q}\times\mathbb{R}\times\mathbb{R}^{n}\times\mathbb{R}^{p\times q}\times\mathbb{R}^{n}\times\mathbb{R}^{p\times q}. Set k=0k=0 and perform the following steps in each iteration:

1:Step 1. Compute
b¯k+1\displaystyle\overline{b}^{\,k+1} =argminb{Lγ(Wk,b,vk,Uk;λk,Λk)}=1ny(𝒜Wk+vken+λkγ),\displaystyle=\mathop{\mbox{argmin}}_{b\in\mathbb{R}}\left\{L_{\gamma}\big{(}W^{k},b,v^{k},U^{k};\lambda^{k},\Lambda^{k}\big{)}\right\}=-\frac{1}{n}y^{\top}\left({\mathcal{A}}W^{k}+v^{k}-e_{n}+\frac{\lambda^{k}}{\gamma}\right),
Wk+1\displaystyle W^{k+1} =argminWp×q{Lγ(W,b¯k+1,vk,Uk;λk,Λk)+12WWk𝒮12δWk,W},\displaystyle=\mathop{\mbox{argmin}}_{W\in\mathbb{R}^{p\times q}}\left\{L_{\gamma}(W,\overline{b}^{k+1},v^{k},U^{k};\lambda^{k},\Lambda^{k})+\frac{1}{2}\|W-W^{k}\|^{2}_{\mathcal{S}_{1}}-\langle\delta^{k}_{W},W\rangle\right\},
bk+1\displaystyle b^{k+1} =argminb{Lγ(Wk+1,b,vk,Uk;λk,Λk)}=1ny(𝒜Wk+1+vken+λkγ),\displaystyle=\mathop{\mbox{argmin}}_{b\in\mathbb{R}}\left\{L_{\gamma}(W^{k+1},b,v^{k},U^{k};\lambda^{k},\Lambda^{k})\right\}=-\frac{1}{n}y^{\top}\left({\mathcal{A}}W^{k+1}+v^{k}-e_{n}+\frac{\lambda^{k}}{\gamma}\right),
where δWkp×q\delta^{k}_{W}\in\mathbb{R}^{p\times q} is an error matrix such that δWkε¯k\|\delta^{k}_{W}\|\leq\overline{\varepsilon}_{k}.
2:Step 2. Compute
vk+1\displaystyle v^{k+1} =1γProxδS(λkγ(𝒜Wk+1+bk+1yen)),\displaystyle=\frac{1}{\gamma}\mbox{Prox}_{\delta^{*}_{S}}\left(-\lambda^{k}-\gamma({\mathcal{A}}W^{k+1}+b^{k+1}y-e_{n})\right),
Uk+1\displaystyle U^{k+1} =(1/γ)Proxτ(Λk+γWk+1).\displaystyle=(1/\gamma)\mbox{Prox}_{\tau\|\cdot\|_{*}}\left(\Lambda^{k}+\gamma W^{k+1}\right).
3:Step 3. Update
λk+1\displaystyle\lambda^{k+1} =λk+ζγ(𝒜Wk+1+bk+1y+vk+1en),\displaystyle=\lambda^{k}+\zeta\,\gamma({\mathcal{A}}W^{k+1}+b^{k+1}y+v^{k+1}-e_{n}),
Λk+1\displaystyle\Lambda^{k+1} =Λk+ζγ(Wk+1Uk+1).\displaystyle=\Lambda^{k}+\zeta\,\gamma(W^{k+1}-U^{k+1}).

Building upon the results in (chen2017efficient, , Proposition 4.2 and Theorem 5.1), we can directly obtain the global convergence results for Algorithm 5. Lastly, the maximum number of iterations for the sGS-isPADMM is set to 30,000.

Statements and Declarations

Funding The work of Defeng Sun was supported in part by RGC Senior Research Fellow Scheme No. SRFS2223-5S02 and GRF Project No. 15307822.

Availability of data and materials The references of all datasets are provided in this published article.

Conflict of interest The authors declare that they have no conflict of interest.

References

  • (1) Artacho, F.A., Geoffroy, M.H.: Characterization of metric regularity of subdifferentials. J. Convex Anal. 15(2), 365–380 (2008)
  • (2) Bauschke, H.H., Borwein, J.M., Li, W.: Strong conical hull intersection property, bounded linear regularity, Jameson’s property (G), and error bounds in convex optimization. Math. Program. 86(1), 135–160
  • (3) Chen, L., Sun, D.F., Toh, K.C.: An efficient inexact symmetric Gauss-Seidel based majorized ADMM for high-dimensional convex composite conic programming. Math. Program. 161(1), 237–270 (2017)
  • (4) Chen, Y., Hang, W., Liang, S., Liu, X., Li, G., Wang, Q., Qin, J., Choi, K.S.: A novel transfer support matrix machine for motor imagery-based brain computer interface. Front. Neurosci. 14, 606949 (2020)
  • (5) Clarke, F.H.: Optimization and Nonsmooth Analysis. John Wiley and Sons, New York (1983)
  • (6) Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
  • (7) Cui, Y., Ding, C., Zhao, X.: Quadratic growth conditions for convex matrix optimization problems associated with spectral functions. SIAM J. Optim. 27(4), 2332–2355 (2017)
  • (8) Cui, Y., Sun, D.F., Toh, K.C.: On the asymptotic superlinear convergence of the augmented Lagrangian method for semidefinite programming with multiple solutions. arXiv preprint arXiv:1610.00875 (2016)
  • (9) Cui, Y., Sun, D.F., Toh, K.C.: On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming. Math. Program. 178(1), 381–415 (2019)
  • (10) Duan, B., Yuan, J., Liu, Y., Li, D.: Quantum algorithm for support matrix machines. Phys. Rev. A 96(3), 032301 (2017)
  • (11) Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media, New York (2007)
  • (12) Feng, R., Xu, Y.: Support matrix machine with pinball loss for classification. Neural Comput. Appl. 34, 18643–18661 (2022)
  • (13) Feng, R., Zhong, P., Xu, Y.: A subspace elimination strategy for accelerating support matrix machine. Pac. J. Optim. 18(1), 155–176 (2022)
  • (14) Geng, M., Xu, Z., Mei, M.: Fault diagnosis method for railway turnout with pinball loss-based multiclass support matrix machine. Appl. Sci. 13(22), 12375 (2023)
  • (15) Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
  • (16) Hang, W., Feng, W., Liang, S., Wang, Q., Liu, X., Choi, K.S.: Deep stacked support matrix machine based representation learning for motor imagery eeg classification. Comput. Meth. Programs Biomed. 193, 105466 (2020)
  • (17) Hang, W., Li, Z., Yin, M., Liang, S., Shen, H., Wang, Q., Qin, J., Choi, K.S.: Deep stacked least square support matrix machine with adaptive multi-layer transfer for EEG classification. Biomed. Signal Process. Control 82, 104579 (2023)
  • (18) Harrow, A.W., Hassidim, A., Lloyd, S.: Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103(15), 150502 (2009)
  • (19) Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. Springer (2009)
  • (20) Hsieh, C.J., Olsen, P.: Nuclear norm minimization via active subspace selection. In: International Conference on Machine Learning, pp. 575–583. PMLR (2014)
  • (21) Jiang, K.: Algorithms for Large Scale Nuclear Norm Minimization and Convex Quadratic Semidefinite Programming Problems. Ph.D. thesis (2011)
  • (22) Jiang, K., Sun, D.F., Toh, K.C.: Solving nuclear norm regularized and semidefinite matrix least squares problems with linear equality constraints. In: Discrete Geometry and Optimization, pp. 133–162. Springer (2013)
  • (23) Kumari, A., Akhtar, M., Shah, R., Tanveer, M.: Support matrix machine: A review. Neural Networks (2024). https://doi.org/10.1016/j.neunet.2024.106767
  • (24) Laurent, E.G., Vivian, V., Tarek, R.: Safe feature elimination in sparse supervised learning. Pac. J. Optim. 8(4), 667–698 (2012)
  • (25) Li, H., Xu, Y.: Support matrix machine with truncated pinball loss for classification. Appl. Soft. Comput. 154, 111311 (2024)
  • (26) Li, Q., Jiang, B., Sun, D.F.: MARS: A second-order reduction algorithm for high-dimensional sparse precision matrices estimation. J. Mach. Learn. Res. 24, 1–44 (2023)
  • (27) Li, X., Cheng, J., Shao, H., Liu, K., Cai, B.: A fusion cwsmm-based framework for rotating machinery fault diagnosis under strong interference and imbalanced case. IEEE Trans. Industr. Inform. 18(8), 5180–5189 (2022)
  • (28) Li, X., Cheng, J., Shao, H., Liu, K., Cai, B.: A fusion CWSMM-based framework for rotating machinery fault diagnosis under strong interference and imbalanced case. IEEE Trans. Ind. Inform. 18(8), 5180–5189 (2022)
  • (29) Li, X., Li, S., Wei, D., Si, L., Yu, K., Yan, K.: Dynamics simulation-driven fault diagnosis of rolling bearings using security transfer support matrix machine. Reliab. Eng. Syst. Saf. 243, 109882 (2024)
  • (30) Li, X., Li, Y., Yan, K., Shao, H., Lin, J.J.: Intelligent fault diagnosis of bevel gearboxes using semi-supervised probability support matrix machine and infrared imaging. Reliab. Eng. Syst. Saf. 230, 108921 (2023)
  • (31) Li, X., Li, Y., Yan, K., Si, L., Shao, H.: An intelligent fault detection method of industrial gearboxes with robustness one-class support matrix machine toward multisource nonideal data. IEEE ASME Trans. Mechatron. 29(1), 388–399 (2024)
  • (32) Li, X., Shao, H., Lu, S., Xiang, J., Cai, B.: Highly efficient fault diagnosis of rotating machinery under time-varying speeds using LSISMM and small infrared thermal images. IEEE Trans. Syst. Man Cybern. Syst. 52(12), 7328–7340 (2022)
  • (33) Li, X., Sun, D.F., Toh, K.C.: QSDPNAL: A two-phase augmented Lagrangian method for convex quadratic semidefinite programming. Math. Program. Comput. 10, 703–743 (2018)
  • (34) Li, X., Yang, Y., Pan, H., Cheng, J., Cheng, J.: Non-parallel least squares support matrix machine for rolling bearing fault diagnosis. Mech. Mach. Theory 145, 103676 (2020)
  • (35) Li, Y., Wang, D., Liu, F.: The auto-correlation function aided sparse support matrix machine for EEG-based fatigue detection. IEEE Trans. Circuits Syst. II-Express Briefs 70(2), 836–840 (2023)
  • (36) Liang, S., Hang, W., Lei, B., Wang, J., Qin, J., Choi, K.S., Zhang, Y.: Adaptive multimodel knowledge transfer matrix machine for EEG classification. IEEE Trans. Neural Netw. Learn. Syst. 35(6), 7726–7739 (2024)
  • (37) Liang, S., Hang, W., Yin, M., Shen, H., Wang, Q., Qin, J., Choi, K.S., Zhang, Y.: Deep EEG feature learning via stacking common spatial pattern and support matrix machine. Biomed. Signal Process. Control 74, 103531 (2022)
  • (38) Lin, M., Yuan, Y., Sun, D.F., Toh, K.C.: Adaptive sieving with PPDNA: Generating solution paths of exclusive lasso models. arXiv preprint arXiv:2009.08719 (2020)
  • (39) Liu, Y.J., Sun, D., Toh, K.C.: An implementable proximal point algorithmic framework for nuclear norm minimization. Math. Program. 133, 399–436 (2012)
  • (40) Luo, L., Xie, Y., Zhang, Z., Li, W.J.: Support matrix machines. In: International Conference on Machine Learning, pp. 938–947. PMLR (2015)
  • (41) Mangasarian, O.L.: A simple characterization of solution sets of convex programs. Oper. Res. Lett. 7, 21–26 (1988)
  • (42) Meng, F., Sun, D.F., Zhao, G.: Semismoothness of solutions to generalized equations and the Moreau-Yosida regularization. Math. Program. 104, 561–581 (2005)
  • (43) Ogawa, K., Suzuki, Y., Takeuchi, I.: Safe screening of non-support vectors in pathwise SVM computation. In: International Conference on Machine Learning, pp. 1382–1390. PMLR (2013)
  • (44) Pan, H., Sheng, L., Xu, H., Tong, J., Zheng, J., Liu, Q.: Pinball transfer support matrix machine for roller bearing fault diagnosis under limited annotation data. Appl. Soft. Comput. 125, 109209 (2022)
  • (45) Pan, H., Sheng, L., Xu, H., Zheng, J., Tong, J., Niu, L.: Deep stacked pinball transfer matrix machine with its application in roller bearing fault diagnosis. Eng. Appl. Artif. Intell. 121, 105991 (2023)
  • (46) Pan, H., Xu, H., Zheng, J., Shao, H., Tong, J.: A semi-supervised matrixized graph embedding machine for roller bearing fault diagnosis under few-labeled samples. IEEE Trans. Industr. Inform. 20(1), 854–863 (2024)
  • (47) Pan, H., Xu, H., Zheng, J., Su, J., Tong, J.: Multi-class fuzzy support matrix machine for classification in roller bearing fault diagnosis. Adv. Eng. Inform. 51, 101445 (2022)
  • (48) Pan, H., Yang, Y., Zheng, J., Li, X., Cheng, J.: A fault diagnosis approach for roller bearing based on symplectic geometry matrix machine. Mech. Mach. Theory 140, 31–43 (2019)
  • (49) Pan, X., Xu, Y.: A novel and safe two-stage screening method for support vector machine. IEEE Trans. Neural Netw. Learn. Syst. 30(8), 2263–2274 (2019)
  • (50) Qi, H., Sun, D.F.: A quadratically convergent Newton method for computing the nearest correlation matrix. SIAM J. Matrix Anal. Appl. 28(2), 360–385 (2006)
  • (51) Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
  • (52) Qian, C., Tran-Dinh, Q., Fu, S., Zou, C., Liu, Y.: Robust multicategory support matrix machines. Math. Program. 176, 429–463 (2019)
  • (53) Razzak, I., Blumenstein, M., Xu, G.: Multiclass support matrix machines by maximizing the inter-class margin for single trial EEG classification. IEEE Trans. Neural Syst. Rehabilitation Eng. 27(6), 1117–1127 (2019)
  • (54) Robinson, S.M.: Some continuity properties of polyhedral multifunctions. in Mathematical Programming at Oberwolfach, Math. Program. Stud. pp. 206–214 (1981)
  • (55) Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
  • (56) Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976)
  • (57) Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976)
  • (58) Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer Science & Business Media (2009)
  • (59) Sanei, S., Chambers, J.A.: EEG Signal Processing. John Wiley & Sons (2013)
  • (60) Shapiro, A.: Sensitivity analysis of generalized equations. J. Math. Sci. 115(4), 2554–2565 (2003)
  • (61) Wang, J., Wonka, P., Ye, J.: Scaling SVM and least absolute deviations via exact data reduction. In: International Conference on Machine Learning, pp. 523–531. PMLR (2014)
  • (62) Watson, G.A.: Characterization of the subdifferential of some matrix norms. Linear Alg. Appl. 170, 33–45 (1992)
  • (63) Wu, C., Cui, Y., Li, D., Sun, D.F.: Convex and nonconvex risk-based linear regression at scale. INFORMS J. Comput. 35(4), 797–816 (2023)
  • (64) Xu, H., Pan, H., Zheng, J., Tong, J., Zhang, F., Chu, F.: Intelligent fault identification in sample imbalance scenarios using robust low-rank matrix classifier with fuzzy weighting factor. Appl. Soft. Comput. 152, 111229 (2024)
  • (65) Yang, J., Zhang, D., Frangi, A.F., Yang, J.y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004)
  • (66) Yuan, Y., Chang, T.H., Sun, D.F., Toh, K.C.: A dimension reduction technique for large-scale structured sparse optimization problems with application to convex clustering. SIAM J. Optim. 32(3), 2294–2318 (2022)
  • (67) Yuan, Y., Lin, M., Sun, D.F., Toh, K.C.: Adaptive sieving: A dimension reduction technique for sparse optimization problems. arXiv preprint arXiv:2306.17369 (2023)
  • (68) Zhang, W., Liu, Y.: Proximal support matrix machine. J. appl. math. phys 10(7), 2268–2291 (2022)
  • (69) Zhao, X.Y., Sun, D.F., Toh, K.C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20(4), 1737–1765 (2010)
  • (70) Zheng, Q., Zhu, F., Heng, P.A.: Robust support matrix machine for single trial EEG classification. IEEE Trans. Neural Syst. Rehabilitation Eng. 26(3), 551–562 (2018)
  • (71) Zheng, Q., Zhu, F., Qin, J., Chen, B., Heng, P.A.: Sparse support matrix machine. Pattern Recognit. 76, 715–726 (2018)
  • (72) Zheng, Q., Zhu, F., Qin, J., Heng, P.A.: Multiclass support matrix machine for single trial EEG classification. Neurocomputing 275, 869–880 (2018)
  • (73) Zhou, Z., So, A.M.C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. 165, 689–728 (2017)
  • (74) Zimmert, J., de Witt, C.S., Kerg, G., Kloft, M.: Safe screening for support vector machines. In: NIPS Workshop on Optimization in Machine Learning (OPT) (2015)