∎
22email: [email protected] 33institutetext: Dong-Hui Li 44institutetext: School of Mathematical Sciences, South China Normal University, Guangzhou, 510631, China
44email: [email protected] 55institutetext: Defeng Sun 66institutetext: Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
66email: [email protected]
Support matrix machine: exploring sample sparsity, low rank, and adaptive sieving in high-performance computing
Abstract
Support matrix machine (SMM) is a successful supervised classification model for matrix-type samples. Unlike support vector machines, it employs low-rank regularization on the regression matrix to effectively capture the intrinsic structure embedded in each input matrix. When solving a large-scale SMM, a major challenge arises from the potential increase in sample size, leading to substantial computational and storage burdens. To address these issues, we design a semismooth Newton-CG (SNCG) based augmented Lagrangian method (ALM) for solving the SMM. The ALM exhibits an asymptotic R-superlinear convergence if a strict complementarity condition is satisfied. The SNCG method is employed to solve the ALM subproblems, achieving at least a superlinear convergence rate under the nonemptiness of an index set. Furthermore, the sparsity of samples and the low-rank nature of solutions enable us to reduce the computational cost and storage demands for the Newton linear systems. Additionally, we develop an adaptive sieving strategy that generates a solution path for the SMM by exploiting sample sparsity. The finite convergence of this strategy is also demonstrated. Numerical experiments on both large-scale real and synthetic datasets validate the effectiveness of the proposed methods.
Keywords:
Support matrix machine Sample sparsity Low rank Adaptive sieving Augmented Lagrangian method Semismooth Newton methodMSC:
90C06 90C25 90C901 Introduction
Numerous well-known classification methods, such as linear discriminant analysis, logistic regression, support vector machines (SVMs), and AdaBoost, were initially designed for vector or scalar inputs hastie2009elements . On the other hand, matrix-structured data, like digital images with pixel grids yang2004two and EEG signals with multi-channel voltage readings over time sanei2013eeg , are also prevalent in practical applications. Classifying matrix data often involves flattening it into a vector, but this causes several issues qian2019robust : a) High-dimensional vectors increase dimensionality problems, b) Loss of matrix structure and correlations, and c) Structural differences require different regularization methods. To circumvent these challenges, the support matrix machine (SMM) was introduced by Luo, Xie, Zhang, and Li luo2015support . Given a set of training samples , where represents the th feature matrix and is its corresponding class label, the optimization problem of the SMM is formulated as
(1) |
where is the matrix of regression coefficients, is an offset term, and are positive regularization parameters. Here, and denote the Frobenius norm and nuclear norm of a matrix, respectively. The objective function in (1) combines two key components: a) the spectral elastic net penalty , enjoying grouping effect111At a solution pair of the model (1), the columns of the regression matrix exhibit a grouping effect when the associated features display strong correlation luo2015support . and low-rank structures; b) the hinge loss, ensuring sparsity and robustness. This model has been incorporated as a key component in deep stacked networks hang2020deep ; liang2022deep and semi-supervised learning frameworks li2023intelligent . Additionally, the SMM model (1) and its variants are predominantly applied in fault diagnosis (e.g., pan2019fault ; li2020non ; li2022highly ; li2022fusion ; pan2022pinball ; pan2023deep ; geng2023fault ; li2024dynamics ; xu2024intelligent ; pan2024semi ; li2024intelligent ) and EEG classification (e.g., zheng2018robust ; zheng2018multiclass ; razzak2019multiclass ; hang2020deep ; chen2020novel ; liang2022deep ; hang2023deep ; liang2024adaptive ). For a comprehensive overview of SMM applications and extensions, see the recent survey kumari2024support .
The two-block alternating direction methods of multipliers (ADMMs) are the most widely adopted methods for solving the convex SMM model (1) and its extensions, including least squares SMMs li2020non ; li2022highly ; liang2024adaptive ; hang2023deep , multi-class SMMs zheng2018multiclass ; pan2019fault , weighted SMMs li2022fusion ; li2024intelligent ; pan2024semi , transfer SMMs chen2020novel ; pan2022pinball , and pinball SMMs feng2022support ; pan2022pinball ; pan2023deep ; geng2023fault . Compared to classical SVMs, the primary challenge in solving a large-scale SMM model (1) and the above extensions arises from the additional nuclear norm term. Specifically, their dual problems are no longer a quadratic program (QP) due to the presence of the sphere constraint in the sense of the spectral norm, which precludes the use of the off-the-shelf QP solvers like LIBQP222The codes are available at https://cmp.felk.cvut.cz/~xfrancv/pages/libqp.html.. To overcome this drawback, Luo et al. luo2015support suggested to utilize the two-block fast ADMM with restart (F-ADMM), proposed in goldstein2014fast . Indeed, it is reasonable to choose two-block ADMM algorithms over three-block ADMM chen2017efficient because the former consistently requires fewer iterations, resulting in lower singular value decomposition costs from the soft thresholding operator. However, F-ADMM solves the dual subproblems by calling the LIBQP solver, which requires producing at least one matrix. This can cause a memory burden as the sample size increases significantly. As a remedy, Duan et al. duan2017quantum introduced a quantum algorithm that employs the quantum matrix inversion algorithm harrow2009quantum and a quantum singular value thresholding algorithm to update the subproblems’s variables in the F-ADMM framework, respectively. Nevertheless, no numerical experiments were reported in duan2017quantum , leaving the practical effectiveness of the proposed algorithm undetermined. Efficiently solving the SMM models under large-scale samples remains a significant challenge.
The first purpose of this paper is to devise an efficient and robust algorithm for solving the SMM model (1) when is significantly large (e.g., several hundred thousand or up to a million) and is large (e.g., ten of thousands). By leveraging the sample sparsity and the low-rank property of the solution to model (1), we develop an efficient and robust semismooth Newton-CG based augmented Lagrangian method. Notably, these two properties depend on the values of the parameters and . Here, sample sparsity indicates that the solution of (1) relies solely on a subset of samples (feature matrices) known as active samples (support matrices), while disregarding the remaining portion referred to as non-active samples (non-support matrices). At each iteration, a conjugate gradient (CG) algorithm is employed to solve the Newton linear system. By making full use of the sample sparsity and the solution’s low-rank property in (1), the primary computational cost in each iteration of the CG method can be reduced to , instead of . Here, the index set ultimately corresponds the active support matrices. Its cardinality is typically much smaller than the total sample size . And the cardinality of the index set eventually corresponds to the rank of the solution matrix in the model (1). It is worth mentioning that the proposed computational framework can be readily extended to convex variants of the SMM model (1), such as multi-class SMMs zheng2018multiclass ; pan2019fault , weighted SMMs li2022fusion ; li2024intelligent ; pan2024semi , transfer SMMs chen2020novel ; pan2022pinball , and pinball SMMs feng2022support ; pan2022pinball ; pan2023deep ; geng2023fault .
Our second goal in this paper is to develop a strategy that efficiently guesses and adjusts irrelevant samples before the starting of the aforementioned optimization process. This method is particularly useful for generating a solution path for models (1). Recently, Ghaoui et al. el2012safe introduced a safe feature elimination method for -regularized convex problems. It inspired researchers to extend these ideas to conventional SVMs and develop safe sample screening rules ogawa2013safe ; wang2014scaling ; zimmert2015safe ; pan2019novel . Those rules aim to first bound the primal SVM solution within a valid region and then exclude non-active samples by relaxing the Karush-Kuhn-Tucker conditions, which reduces the problem size and saves computational costs and storage. However, existing safe screening rules are typically problem-specific and cannot be directly applied to the SMM model (1). The major challenges include two parts: a) the objective function in (1) is not strongly convex due to the intercept term zimmert2015safe ; b) the nuclear norm term complicates the removal of non-support matrices using first-order optimality conditions for the dual SVM model wang2014scaling . Recently, Yuan, Lin, Sun, and Toh yuan2023adaptive ; lin2020adaptive introduced a highly efficient and flexible adaptive sieving (AS) strategy for convex composite sparse machine learning models. By exploiting the inherent sparsity of the solutions, it significantly reduces computational load through solving a finite number of reduced subproblems on a smaller scale yuan2022dimension ; li2023mars ; wu2023convex . The effectiveness of this AS strategy crucially hinges on solution sparsity, which dictates the size of these reduced subproblems. Unlike the sparsity emerges at solutions, the SMM model (1) inherently exhibits sample sparsity. As a result, the application of the AS strategy to (1) is not straightforward. We will extend the idea of the AS strategy in yuan2023adaptive ; lin2020adaptive to generate a solution path for the SMM model (1).
Our main contributions in this paper are outlined as follows:
-
1)
To solve the SMM (1), we propose an augmented Lagrangian method (ALM), which guarantees asymptotic R-superlinear convergence of the KKT residuals under a mild strict complementarity condition, commonly satisfied in the classical soft margin SVM model. For solving each ALM subproblem, we use the semismooth Newton-CG method (SNCG), ensuring superlinear convergence when an index set is nonempty, with its cardinality ultimately corresponding to the number of active support matrices.
-
2)
The main computational bottleneck in solving semismooth Newton linear systems with the conjugate gradient method is the execution of the generalized Jacobian linear transformation. By leveraging the sample sparsity and solution’s low-rank structure in model (1), we can reduce the cost of this transformation from to , where the cardinalities of and are typically much smaller than the sample size . Numerical experiments demonstrate that ALM-SNCG achieves an average speedup of 422.7 times over F-ADMM on four real datasets, even for low-accuracy solutions of model (1).
-
3)
To efficiently generate a solution path for the SMM models with a huge sample size, we employ an AS strategy. This strategy iteratively identifies and removes inactive samples to obtain reduced subproblems. Solving these subproblems inexactly ensures convergence to the desired solution path within a finite number of iterations (see Theorem 5.1). Numerical experiments reveal that, for generating high-accuracy solution paths of models (1), the AS strategy paired with ALM-SNCG is, on average, 2.53 times faster than the warm-started ALM-SNCG on synthetic dataset with one million samples.
The rest of the paper is organized as follows. Section 2 discusses the sample sparsity and the solution’s low-rank structure of the SMM model (1). Building on these properties, Section 3 introduces the computational framework of a semismooth Newton-CG based augmented Lagrangian method for solving this model. Section 4 explores how these properties reduce computational costs in solving the Newton linear system. Section 5 introduces an adaptive sieving strategy for computing a solution path of the SMM models across a fine grid of ’s. Finally, Section 6 presents extensive numerical experiments on synthetic and real data to demonstrate the effectiveness of the proposed methods.
Notation. Let be the space of all real matrices and be the space of all real symmetric matrices. The notation stands for the zero matrix in , and is the identity matrix. We use to denote the set of all orthogonal matrices. For , their inner product is defined by , where “tr” represents the trace operation. The Frobenius norm of is . The -dimensional vector space is abbreviated as . The vector is the vector whose elements are all ones. We denote the index set . For any index subsets and , and any , let denote the submatrix of obtained by removing all the rows of not in and all the columns of not in . For any , we write as the subvector of obtained by removing all the elements of not in , and as the diagonal matrix with diagonal entries , . The notation represents the matrix spectral norm, defined as the largest singular value of the matrix. For any , we define . The metric projection of onto is denoted by , with representing its Clarke generalized Jacobian at . Similarly, for any closed convex set , and denote the metric projection of onto and its Clarke generalized Jacobian at , respectively.
2 The structural properties of the SMM model (1)
This section highlights the sample sparsity and low-rank properties at the Karush-Kuhn-Tucker (KKT) points of the model (1), providing the foundation for developing an efficient algorithm.
For notational simplicity, we first reformulate the SMM model (1). Denote
Define the linear operator and its corresponding adjoint as follows:
(2) |
Let , , and . The support function of is given by for any . The SMM model (1) can be equivalently expressed as
(P) |
Its Lagrangian dual problem can be written as
() |
It is easy to get the following KKT system of (P) and ():
(3) |
Without loss of generality, suppose that there exist indices , such that and (otherwise, the classification task cannot be executed). Clearly, the solution set of the KKT system (3) is always nonempty. Indeed, the objective function in problem (1) is a real-valued, lower level-bounded convex function. It follows from (rockafellar2009variational, , Theorem 1.9) that the solution set of (1) is nonempty, convex, and compact, with a finite optimal value. Since the equivalent problem (P) of (1) contains linear constraints only, by (rockafellar1970convex, , Corollary 28.3.1), the KKT system (3) has at least one solution. It is known from (rockafellar1970convex, , Theorems 28.3, 30.4, and Corollary 30.5.1) that is a solution to (3) if and only if is an optimal solution to (P), and is an optimal solution to ().
2.1 Characterizing sample sparsity
For the SMM model (P), a sufficiently large sample size introduces significant computational and storage challenges. However, the inherent sample sparsity offers a promising avenue for developing efficient algorithms.
Suppose that is a solution of the KKT system (3). It follows that
This implies that the value of depends solely on the samples associated with the non-zero elements of , a property referred to as sample sparsity. In addition, by in (3), we deduce that for each ,
(4) |
It implies that . Analogous to the concept of support vectors cortes1995support , we introduce the definition of support matrices.
Definition 1 (Support matrix)
An example that satisfies the inequality is said a support matrix at ; otherwise, it is a non-support matrix. The set of all support matrices at is denoted as
To illustrate the geometric interpretation of support matrices, we define the optimal hyperplane and the decision boundaries and associated with as follows:
Based on (4) and , we obtain that
(5) |
Thus, geometrically, support matrices include a subset of examples lying on the decision boundaries, within the region between them, or misclassified.
It can also be observed from (4) that
This implies that, after excluding support matrices at where , only those remaining on the decision boundaries are retained, i.e.,
Here, we define support matrices satisfying as active support matrices. Notably, the number of support matrices far exceeds that of active support matrices. As shown in Figure 1, the proportion of support matrices relative to the total sample size ranges from 9% to 86%, whereas the proportion of active support matrices is significantly lower, ranging from 0% to 3%. This raises a natural question: Can we design an algorithm whose main computational cost depends solely on the samples corresponding to the active support matrices?

2.2 Characterizing low-rank property of
In addition to the sample sparsity of model (P), another key property is the low-rank structure of the solution . Indeed, since is a solution to the KKT system (3), one has that
Without loss of generality, we assume that . And suppose has the following singular value decomposition (SVD):
where , , and with . Furthermore, define
By applying the proximal mapping , as described in liu2012implementable , we obtain
This implies that the rank of is . As illustrated in Figure 2, for sufficiently large , is non-increasing, potentially enhancing classification accuracy on the test set (denoted in (48)). Thus, the low-rank property of may correspond to improved performance on the test set. Given this, how can an efficient algorithm be designed to leverage the low-rank structure of when is sufficiently small?

3 A semismooth Newton-CG based augmented Lagrangian method for the SMM model
In this section, we present our algorithmic framework for solving the SMM model (1), which comprises an outer loop using the inexact augmented Lagrangian method (ALM) and inner loops employing the semismooth Newton-CG method for ALM subproblems.
3.1 An inexact augmented Lagrangian method
We are going to develop an inexact augmented Lagrangian method for solving the problem (P). Given a positive penalty parameter , the augmented Lagrangian function for (P) takes the form
(6) | ||||
where and . Following the general framework in rockafellar1976augmented , the steps of the ALM method is summarized in Algorithm 1 below.
Initialization: Given are two positive parameters and . Choose an initial point . Set . Perform the following steps in each iteration until a proper stopping criterion is satisfied:
(7) |
In Step 1 of Algorithm 1, the subproblem (7) is solved inexactly. Different inexact rules can be adopted. Here, we give two easily implementable inexact rules. Observe that is an optimal solution to the ALM subproblem (7) if and only if it satisfies
(8) | |||
(9) |
where the function is defined by
(10) | ||||
Here, and are defined as
(11) |
Additionally, and represent the Moreau envelop and proximal mapping of some proper closed convex function , respectively. Similar to those in cui2019r , we adopt the following two easily implementable inexact rules for (8):
where and are two given positive summable sequences, , , , and .
Rockafellar’s original work rockafellar1976augmented demonstrated the asymptotic -superlinear convergence rate of the dual sequence produced by the ALM, assuming Lipschitz continuity of the dual solution mapping at the origin. However, this condition is challenging to satisfy for (P) due to the requirement of dual solution uniqueness. Therefore, we consider a weaker quadratic growth condition on the dual problem () in this paper. The quadratic growth condition for () at is said to hold if there exist constants and such that
(12) |
Here, the function is defined in (), denotes the set of all optimal solutions to (), and represents the set of all feasible solutions to ().
Denote be the natural residual mapping as follows:
The following theorem establishes the global convergence and asymptotic -superlinear convergence of the KKT residuals in terms of under criteria (A) and (B) for Algorithm 1, which can be directly derived from (cui2019r, , Theorem 2).
Theorem 3.1
Let be an infinite sequence generated by Algorithm 1 under criterion () with and . Then the sequence converges to some . Moreover, the sequence is bounded with all of its accumulation points in the solution set of (P).
If in addition, criterion (B) is executed and the quadratic growth condition (12) holds at , then there exists such that for all , and
where
Moreover, if .
The following proposition gives a sufficient condition for the quadratic growth condition (12) with respect to the dual problem (). Here, we assume without loss of generality that . Detailed proof is given in Appendix A.
Proposition 1
Let . Assume that there exists such that
(14) |
where , and the SVD of is
(15) |
Then the quadratic growth condition (12) at holds for the dual problem ().
Finally, condition (14) can be interpreted as a strict complementarity condition. Indeed, if satisfies the KKT system (3), then . Consequently, the condition (14) means that there exists such that
(16) |
Moreover, by the use of (zhou2017unified, , Proposition 10) and (A.5), we can deduce . In this sense, equality (16) is regarded as a strict complementarity between the matrices and . In particular, (16) retains true if and , a condition inherently satisfied in the context of the soft margin support vector machine model cortes1995support .
3.2 A semismooth Newton-CG method to solve the subproblem (8)
In Algorithm 1, the primary computational cost arises from solving the convex subproblem (8). In this subsection, we propose a semismooth Newton-CG method to calculate an inexact solution to this subproblem.
Due to the convexity of , solving subproblem (8) is equivalent to solving the following nonlinear equation:
(17) |
where and are defined in (11). Notice that is not smooth but strongly semismooth (see, e.g., (FP2007, , Proposition 7.4.4) and (kaifeng2011algorithms, , Theorem 2.3)). It is then desirable to solve using the semismooth Newton method. The semismooth Newton method has been extensively studied. Under some regularity conditions, the method can be superlinearly/quadratically convergent, see, e.g., qi1993nonsmooth ; qi2006quadratically ; ZST2010 .
In what follows, we construct an alternative set for the Clarke generalized Jacobian of at , which is more computationally tractable. Define for ,
It follows from (clarke1990optimization, , Proposition 2.3.3 and Theorem 2.6.6) that
The element takes the following form
(18) |
where , , and is the identity operator from to .
Initialization: Choose positive scalars , , and an initial point . Set . Execute the following steps until the stopping criteria () and/or () at are satisfied.
(19) |
In our semismooth Newton method, the positive definiteness of in the Newton linear system (19) is crucial for ensuring the superlinear convergence of the method, as discussed in ZST2010 ; li2018qsdpnal . In what follows, we show that the positive definiteness of is equivalent to the constraint nondegenerate condition from shapiro2003sensitivity , which simplifies to the nonemptiness of a specific index set. Here, the constraint nondegenerate condition associated with an optimal solution of the dual problem for (7) can be expressed as (see shapiro2003sensitivity )
(20) |
Proposition 2
Proof
See Appendix B. ∎
We conclude this section by giving the convergence theorem of Algorithm 2. It can be proved in a way similar to those in (qi1993nonsmooth, , Proposition 3.1) and (ZST2010, , Theorem 3.5).
4 Implementation of the semismooth Newton-CG method
As previous mentioned, our focus is on situations where the sample size greatly exceeds the maximum of and , particularly when . As a result, the main computational burden in each iteration of Algorithm 2 is solving the following Newton linear system:
(21) |
where is defined in (18) with , , , , and . The right-hand sides are
It can be further rewritten as
(22) |
where
with
(23) |
Clearly, the major cost for finding an inexact solution of (22) lies in the computation of
In the next two parts, we discuss how to reduce the computational cost for computing and by selecting special linear operators and , respectively.
4.1 The computation of
According to (kaifeng2011algorithms, , Theorem 2.5) or (jiang2013solving, , Proposition 2), if , then we can select such that . In the case , the computation of depends on the SVD of . Without loss of generality, we assume . Let have the following SVD:
where , with and , and is the vector of singular values ordered as .
Denote
(24) | ||||
where and . Define three matrices , , and :
(34) |
where is the matrix whose elements are all ones, and
As a result, If , can be expressed as (see (kaifeng2011algorithms, , Theorem 2.5) or (jiang2013solving, , Proposition 2))
(35) |
where and . Here, and are linear matrix operators defined by
The primary computational cost of is approximately flops. However, by leveraging the sparsity of the matrices , , and , the computation of and in (35) can be performed more efficiently.
The term . Based on the definitions of and in (34), in (35) can be computed by
It shows that the computational cost for computing can be reduced from flops to flops. From the definition of in (24), if the tuple generated by Algorithm 1 is a KKT solution to (3), then follows immediately. As mentoned in Subsection 2.2, for sufficient large , the rank of is generally much less than , i.e., . Therefore, the low-rank structure of can be effectively leveraged within our computational framework.
The term . According to the definition of in (34), we have
with and for . The last equality shows that by the use of the sparsity in , the cost for computing can also be notably decreased from flops to flops when .
The preceding arguments show that exploiting the sparsity in , induced by the low rank of , reduces the computational cost of from flops to flops, with typically much smaller than .
4.2 The computation of
By the definition of in (23), we have
(36) |
Notice that in the above expression, can be an arbitrary element in . To save computational cost, we select a simple as follows:
Let
(37) |
It is easy to get , and
where the linear mapping and its adjoint mapping with are defined by
(38) |
Therefore, it follows from (36) that
It means that the cost for computing can be decreased from to . Observe that if is generated by Algorithm 1 as a solution of the KKT system (3), then the index set corresponds to the active support matrices at . In fact, based on Step 2 in Algorithm 1 and (9), we deduce from the Moreau identity that
where is defined in (11). It further implies from (37) that
(39) |
The analysis in Subsection 2.1 reveals that the cardinality of the index set is typically much smaller than , thereby providing a positive answer to the question posed earlier. This results in a significant decrease in the computational cost for calculating .
4.3 Solving equation (22)
The discussion above has shown that the equation (22) can be reduced as
(40) |
with
(41) |
Accordingly, the inexact rule in Step 1 of Algorithm 2 is written as
The positive definiteness of the above linear operator is proved in the next proposition.
Proposition 3
The linear operator defined in (41) is a self-adjoint and positive definite operator.
Proof
From (meng2005semismoothness, , Proposition 1), we know that each element in is self-adjoint and positive semidefinite. So is self-adjoint. If the index set is empty, then is positive definite. Otherwise, for any , it follows from (41) that
where the last second inequality follows from the fact that is an orthogonal projection matrix in . The desired conclusion is achieved. ∎
5 An adaptive strategy
The previous section concentrated on solving the SMM model (P) with fixed parameters and . In many cases, one needs to compute a solution path across a specified sequence of grid points . A typical example is to optimize the hyperparameter through cross-validation while maintaining constant in model (P). This section presents an adaptive sieving (AS) strategy that iteratively reduces the sample size, thereby facilitating the efficient computation of the solution path for in models (P), particularly for extremely large .
We develop the AS strategy based on the following two key observations. First, the solution of (1) only depends on the samples corresponding to its support matrices (abbreviated as active samples). Identifying these active samples can reduce the scale of the problem and thereby reduce its computational cost. Second, as shown in Figure 1, when , the support matrices at is always a superset of those at . Thus, to solve model (1) at , it is reasonable to initially restrict it to the active sample set at .
The core idea behind the AS strategy can be viewed as a specialized warm-start approach. It aggressively guesses and adjusts the active samples at for a new grid point , using the solution obtained from the previous grid point . Specifically, we initiate the th problem within a restricted sample space that includes all active samples at . After solving the restricted SMM model, we update active samples at the computed solution. The process is repeated until no additional active samples need to be included.
The similar strategy has been employed to efficiently compute solution paths for convex composite sparse machine learning problems by leveraging the inherent sparsity of solutions yuan2023adaptive ; yuan2022dimension ; li2023mars ; wu2023convex . However, since the SMM model (P) does not guarantee solution’s sparsity, the method proposed in yuan2023adaptive ; lin2020adaptive cannot be directly applied here. We will develop an adaptive sieving strategy that efficiently generates a solution path for the SMM model (P) by effectively exploiting the intrinsic sparsity of the samples. For a specified index set , we define
The subscript is omitted when . Detailed framework of the AS strategy is presented in Algorithm 3.
Choose an initial point , a sequence of grid points , a tolerance , a scalar , and a positive integer . Compute the initial index set . Execute the following steps for with the initialization and .
(42) |
(43) |
(44) |
Remark 1
- (a)
-
(b)
In Step 3 of Algorithm 3, the nonnegative scalar provides flexibility in adjusting the size of the index set . Specifically, if , includes all indices of active samples at . In general, by selecting an appropriate positive scalar , we aim to ensure that includes as many active indices as possible at the solution for the next grid point .
- (c)
Lastly, we establish the finite convergence of Algorithm 3 in the following theorem. Its proof is given in Appendix C.
Theorem 5.1
For each , Algorithm 3 terminates within a finite number of iterations. That is, there exists an integer such that . Furthermore, any output solution is a KKT tuple for the following problem
(45) |
where the error satisfies , .
6 Numerical experiments
We have executed extensive numerical experiments to demonstrate the efficiency of our proposed algorithm and strategy on both large-scale synthetic and real datasets. All the experiments were implemented in MATLAB on a windows workstation (8-core, Intel(R) Core(TM) i7-10700 @ 2.90GHz, 64G RAM).
6.1 Implementation details
For the SMM model (P), we measure the qualities of the computed solutions via the following relative KKT residual of the iterates:
(46) |
where
By default, the termination criteria for Algorithm 1 (denoted as ALM-SNCG) involves either satisfying or reaching the maximum number 500 of iterations. In the case where both algorithms stopped with different criteria, we used the value of the objective functions obtained from the ALM-SNCG satisfying (denoted as ) as benchmarks to evaluate their quality (denoted as ):
(47) |
Both synthetic and real datasets were utilized in the subsequent experiments. The sampling data was generated following the process in luo2015support . For any and , denote . Each vector is constructed as:
Here, are orthonormal vectors in , and . Given a matrix with rank , each sample’s class label is determined by for , where equals if and otherwise. In our experiments, we set and . For each data configuration, the sample sizes vary from to , with used as training data and the remaining used for testing.
We also utilized four distinct real datasets. They are
-
•
the electroencephalogram alcoholism dataset (EEG)333http://kdd.ics.uci.edu/databases/eeg/eeg.html for classifying subjects based on EEG signals as alcoholics or non-alcoholics;
-
•
the INRIA person dataset (INRIA)444ftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar for detecting the presence of individuals in an image;
-
•
the CIFAR-10 dataset (CIFAR-10)555https://www.cs.toronto.edu/~kriz/cifar.html for classifying images as depicting either a dog or a truck;
-
•
the MNIST handwritten digit dataset (MNIST)666https://yann.lecun.com/exdb/mnist for classifying handwritten digit images as the number 0 or not.
Additional information for these datasets is provided in Tabel 2, where and denote the number of training and test samples, respectively.
datasets | datasets | ||||||
---|---|---|---|---|---|---|---|
EEG | 300 | 66 | CIFAR-10 | 10000 | 2000 | ||
INRIA | 3634 | 1579 | MNIST | 60000 | 10000 |
In subsequent experiments, we set the parameter in the SMM model (1) to 10 and 100 for random data, and to 1 and 10 for real data.
6.2 Solving the SMM model at a fixed parameter value
In this subsection, we evaluate four algorithms - ALM-SNCG, the inexact semi-proximal ADMM (isPADMM), the symmetric Gauss-Seidel based isPADMM (sGS-isPADMM), and Fast ADMM with restart (F-ADMM) - for solving the SMM model (1) with fixed parameters and . Additional details on isPADMM and sGS-isPADMM can be found in Appendices D and E, respectively. The F-ADMM is based on the works of Luo et al. luo2015support 777Its code is available for download at: http://bcmi.sjtu.edu.cn/~luoluo/code/smm.zip and Goldstein et al. goldstein2014fast . Each method is subject to a two-hour computational time limit. For consistent comparison, we adopt the termination criterion , where Relobj is defined in (47). We assess their performance at two accuracy levels: and .
Firstly, we evaluate the performance of two-block ADMM algorithms (F-ADMM and isPADMM) and three-block ADMM (sGS-isPADMM) for solving the SMM model (1) on the EEG training dataset. The “Iteration time” column in Table 3 represents the average time per iteration for each algorithm. Numerical results in Table 3 show that, on average, sGS-isPADMM is 2.6 times faster per iteration than F-ADMM and 52.2 times faster than isPADMM. However, sGS-isPADMM is 7.2 times slower than isPADMM in terms of total time consumed due to having 184.8 times more iteration steps on average, leading to higher SVD costs from the soft thresholding operator. Henceforth, we will focus on comparing the performance of two-block ADMM algorithms and ALM-SNCG on synthetic and real datasets.
Relobj | Iteration numbers | Time (seconds) | Iteration time (seconds) | |||||||||||
sGS | F | isP | sGS | F | isP | sGS | F | isP | sGS | F | isP | |||
1e-4 | 1 | 1e-4 | 1e-4 | 6e-5 | 9e-5 | 3876 | 477 | 19 | 26.1 | 5.0 | 2.0 | 7e-3 | 1e-2 | 1e-1 |
1e-3 | 1e-4 | 1e-4 | 5e-5 | 4251 | 285 | 16 | 27.5 | 4.4 | 3.8 | 6e-3 | 2e-2 | 2e-1 | ||
1e-2 | 1e-4 | 5e-4 | 1e-4 | 4329 | 30000 | 22 | 28.5 | 616.3 | 11.3 | 7e-3 | 2e-2 | 5e-1 | ||
1e-1 | 1e-4 | 5e-3 | 8e-5 | 4317 | 30000 | 23 | 27.7 | 616.2 | 16.1 | 6e-3 | 2e-2 | 7e-1 | ||
10 | 1e-4 | 6e-5 | 6e-5 | 6e-5 | 3324 | 117 | 20 | 20.6 | 1.2 | 6.0 | 6e-3 | 1e-2 | 3e-1 | |
1e-3 | 1e-4 | 1e-4 | 1e-4 | 8818 | 8141 | 111 | 55.0 | 100.2 | 9.7 | 6e-3 | 1e-2 | 9e-2 | ||
1e-2 | 1e-4 | 1e-4 | 9e-5 | 7986 | 2841 | 93 | 49.9 | 58.5 | 22.1 | 6e-3 | 2e-2 | 2e-1 | ||
1e-1 | 1e-4 | 1e-1 | 1e-4 | 5428 | 30000 | 63 | 33.9 | 621.1 | 31.1 | 6e-3 | 2e-2 | 5e-1 | ||
1e-6 | 1 | 1e-4 | 1e-6 | 1e-6 | 9e-7 | 23861 | 25759 | 45 | 151.5 | 279.1 | 3.3 | 6e-3 | 1e-2 | 7e-2 |
1e-3 | 1e-6 | 1e-6 | 8e-7 | 7147 | 501 | 25 | 44.2 | 9.0 | 5.5 | 6e-3 | 2e-2 | 2e-1 | ||
1e-2 | 1e-6 | 5e-4 | 7e-7 | 7614 | 30000 | 48 | 48.5 | 621.7 | 21.7 | 6e-3 | 2e-2 | 5e-1 | ||
1e-1 | 1e-6 | 5e-3 | 9e-7 | 7544 | 30000 | 44 | 47.1 | 623.7 | 29.4 | 6e-3 | 2e-2 | 7e-1 | ||
10 | 1e-4 | 8e-7 | 7e-7 | 6e-7 | 4268 | 121 | 28 | 26.4 | 1.3 | 11.6 | 6e-3 | 1e-2 | 4e-1 | |
1e-3 | 1e-6 | 1e-6 | 1e-6 | 26985 | 21117 | 151 | 168.4 | 259.1 | 12.5 | 6e-3 | 1e-2 | 8e-2 | ||
1e-2 | 1e-6 | 1e-6 | 7e-7 | 11853 | 5363 | 111 | 74.2 | 109.6 | 25.9 | 6e-3 | 2e-2 | 2e-1 | ||
1e-1 | 1e-6 | 6e-2 | 4e-7 | 9044 | 30000 | 89 | 56.5 | 617.2 | 41.1 | 6e-3 | 2e-2 | 5e-1 |
It is worth mentioning that ALM-SNCG is initialized with a lower-accuracy starting point from isPADMM, using up to four iterations for synthetic datasets and up to ten iterations for real datasets. Otherwise, the origin is used as the default starting point for the algorithms. Tables 4 and 5 present the numerical performance of the above three algorithms, including the following information:
-
•
: the cardinality of the index set defined in (37);
-
•
: the cardinality of the index set defined in (24);
-
•
Accuracytest: the classification accuracy on test set for solutions from each algorithm, i.e.,
(48) Here, represents the predicted class label vector, and denotes the true class label vector for the test set.
6.2.1 Numerical results on synthetic data
Table 4 shows the computational results for ALM-SNCG, isPADMM, and F-ADMM on randomly generated data with ranging from to . The results indicate that ALM-SNCG and isPADMM can solve all problems within two hours. In contrast, F-ADMM encountered memory trouble when the sample sizes exceed . For , F-ADMM exibited lower classification accuracy () compared to isPADMM and ALM-SNCG due to its inability to achive the desired solution accuracy. Even for the smallest instance , F-ADMM is, on average, 101.2 times slower than ALM-SNCG to reach and 210.2 times slower to reach . On the other hand, ALM-SNCG is on average 8.7 times faster than isPADMM for and 13.0 times faster for . It shows that ALM-SNCG is more efficient and stable, particularly for high-accuracy solutions.
Data | Relobj | Time (seconds) | ||||||||||||
F | isP | A | F | isP | A | F | isP | A | ||||||
(1e4, 1e2, 1e2) | 1e-4 | 10 | 0.1 | 27 | 1 | 0.9876 | 0.9872 | 0.9872 | 8e-5 | 8e-5 | 3e-5 | 11.2 | 18.2 | 1.6 |
1 | 65 | 1 | 0.9932 | 0.9932 | 0.9936 | 8e-5 | 1e-4 | 8e-5 | 13.5 | 13.5 | 1.7 | |||
10 | 46 | 1 | 0.9928 | 0.9928 | 0.9928 | 4e-5 | 3e-5 | 1e-4 | 26.0 | 20.6 | 1.6 | |||
100 | 26 | 1 | 0.9936 | 0.9936 | 0.9936 | 9e-5 | 6e-5 | 4e-5 | 28.1 | 21.7 | 7.9 | |||
100 | 0.1 | 19 | 1 | 0.9844 | 0.9844 | 0.9848 | 8e-5 | 8e-5 | 1e-4 | 14.8 | 10.4 | 2.2 | ||
1 | 42 | 1 | 0.9880 | 0.9880 | 0.9880 | 3e-5 | 6e-5 | 8e-5 | 10.5 | 17.1 | 2.3 | |||
10 | 39 | 1 | 0.9932 | 0.9932 | 0.9932 | 1e-4 | 9e-5 | 7e-5 | 26.3 | 69.1 | 1.7 | |||
100 | 37 | 1 | 0.9932 | 0.9932 | 0.9932 | 1e-4 | 9e-5 | 8e-5 | 1526.2 | 37.5 | 2.0 | |||
1e-6 | 10 | 0.1 | 21 | 1 | 0.9872 | 0.9872 | 0.9872 | 4e-7 | 8e-7 | 8e-7 | 21.7 | 29.4 | 2.4 | |
1 | 20 | 1 | 0.9932 | 0.9932 | 0.9932 | 4e-7 | 8e-7 | 8e-7 | 21.2 | 27.1 | 5.5 | |||
10 | 23 | 1 | 0.9928 | 0.9928 | 0.9928 | 7e-7 | 3e-7 | 8e-7 | 33.1 | 37.0 | 4.3 | |||
100 | 22 | 1 | 0.9936 | 0.9936 | 0.9936 | 3e-5 | 7e-7 | 9e-7 | 7200.6 | 42.4 | 14.0 | |||
100 | 0.1 | 11 | 1 | 0.9848 | 0.9848 | 0.9848 | 9e-7 | 7e-7 | 8e-7 | 29.5 | 18.9 | 4.9 | ||
1 | 18 | 1 | 0.9880 | 0.9880 | 0.9880 | 8e-7 | 8e-7 | 9e-7 | 15.4 | 25.6 | 5.3 | |||
10 | 23 | 1 | 0.9932 | 0.9932 | 0.9932 | 7e-7 | 4e-7 | 1e-6 | 38.7 | 46.4 | 6.0 | |||
100 | 22 | 1 | 0.9932 | 0.9932 | 0.9932 | 5e-4 | 8e-7 | 7e-7 | 7200.0 | 116.7 | 6.4 | |||
(1e4, 1e3, 5e2) | 1e-4 | 10 | 0.1 | 22 | 1 | 0.9940 | 0.9940 | 0.9940 | 6e-5 | 9e-5 | 8e-5 | 421.3 | 2020.6 | 237.8 |
1 | 23 | 1 | 0.9924 | 0.9924 | 0.9924 | 3e-6 | 8e-5 | 9e-5 | 536.0 | 1082.0 | 295.1 | |||
10 | 23 | 1 | 0.9924 | 0.9924 | 0.9924 | 2e-3 | 8e-5 | 9e-5 | 7200.1 | 1565.1 | 255.8 | |||
100 | 32 | 107 | 0.9888 | 0.9936 | 0.9936 | 2e+0 | 9e-5 | 8e-5 | 7202.5 | 1249.0 | 864.5 | |||
100 | 0.1 | 20 | 1 | 0.9880 | 0.9880 | 0.9880 | 9e-5 | 8e-5 | 2e-5 | 556.5 | 2030.1 | 243.7 | ||
1 | 24 | 1 | 0.9940 | 0.9940 | 0.9940 | 4e-3 | 7e-5 | 7e-5 | 7202.7 | 1229.2 | 242.3 | |||
10 | 23 | 1 | 0.9936 | 0.9936 | 0.9936 | 3e-2 | 4e-5 | 6e-5 | 7202.5 | 2050.7 | 258.6 | |||
100 | 23 | 1 | 0.9836 | 0.9920 | 0.9920 | 2e+0 | 7e-5 | 8e-5 | 7200.7 | 2118.4 | 510.0 | |||
1e-6 | 10 | 0.1 | 20 | 1 | 0.9940 | 0.9940 | 0.9940 | 9e-7 | 1e-6 | 2e-7 | 525.8 | 2353.6 | 334.3 | |
1 | 21 | 1 | 0.9924 | 0.9924 | 0.9924 | 7e-5 | 3e-7 | 4e-7 | 7203.5 | 1580.2 | 517.2 | |||
10 | 21 | 1 | 0.9924 | 0.9924 | 0.9924 | 2e-3 | 4e-7 | 7e-7 | 7201.9 | 3531.0 | 413.7 | |||
100 | 32 | 107 | 0.9916 | 0.9936 | 0.9936 | 2e+0 | 7e-7 | 6e-7 | 7202.8 | 2177.1 | 1229.5 | |||
100 | 0.1 | 20 | 1 | 0.9880 | 0.9880 | 0.9880 | 9e-7 | 8e-7 | 7e-7 | 969.2 | 2685.1 | 265.5 | ||
1 | 21 | 1 | 0.9940 | 0.9940 | 0.9940 | 4e-3 | 7e-7 | 4e-7 | 7201.5 | 3305.9 | 411.1 | |||
10 | 22 | 1 | 0.9936 | 0.9936 | 0.9936 | 3e-2 | 9e-7 | 9e-7 | 7204.6 | 6706.1 | 371.6 | |||
100 | 21 | 1 | 0.9792 | 0.9920 | 0.9920 | 3e+0 | 6e-7 | 4e-7 | 7200.6 | 3331.2 | 628.9 | |||
(1e5, 50, 1e2) | 1e-4 | 10 | 0.1 | 197 | 1 | - | 0.9681 | 0.9683 | - | 7e-5 | 5e-5 | - | 27.6 | 3.6 |
1 | 636 | 1 | - | 0.9682 | 0.9682 | - | 5e-5 | 2e-5 | - | 33.1 | 6.8 | |||
10 | 502 | 1 | - | 0.9686 | 0.9687 | - | 4e-5 | 4e-5 | - | 38.5 | 15.2 | |||
100 | 430 | 45 | - | 0.9692 | 0.9693 | - | 7e-5 | 4e-5 | - | 93.4 | 88.8 | |||
100 | 0.1 | 189 | 1 | - | 0.9682 | 0.9681 | - | 7e-5 | 5e-5 | - | 56.9 | 3.6 | ||
1 | 568 | 1 | - | 0.9685 | 0.9685 | - | 9e-5 | 7e-5 | - | 64.6 | 6.7 | |||
10 | 590 | 1 | - | 0.9685 | 0.9685 | - | 6e-5 | 7e-5 | - | 76.9 | 4.8 | |||
100 | 388 | 1 | - | 0.9690 | 0.9690 | - | 7e-5 | 2e-5 | - | 111.5 | 38.1 | |||
1e-6 | 10 | 0.1 | 64 | 1 | - | 0.9683 | 0.9683 | - | 1e-6 | 9e-7 | - | 365.7 | 7.0 | |
1 | 97 | 1 | - | 0.9682 | 0.9682 | - | 1e-6 | 9e-7 | - | 278.2 | 13.9 | |||
10 | 93 | 1 | - | 0.9687 | 0.9686 | - | 9e-7 | 8e-7 | - | 96.6 | 22.4 | |||
100 | 117 | 45 | - | 0.9693 | 0.9693 | - | 1e-6 | 9e-7 | - | 282.0 | 104.6 | |||
100 | 0.1 | 36 | 1 | - | 0.9682 | 0.9682 | - | 7e-7 | 9e-7 | - | 324.6 | 10.3 | ||
1 | 48 | 1 | - | 0.9684 | 0.9685 | - | 9e-7 | 9e-7 | - | 654.4 | 19.8 | |||
10 | 66 | 1 | - | 0.9685 | 0.9685 | - | 6e-7 | 8e-7 | - | 171.1 | 15.2 | |||
100 | 87 | 1 | - | 0.9690 | 0.9690 | - | 8e-7 | 1e-6 | - | 417.5 | 45.2 | |||
(1e6, 50, 1e2) | 1e-4 | 10 | 0.1 | 1935 | 1 | - | 0.9018 | 0.9017 | - | 5e-5 | 5e-5 | - | 194.2 | 29.0 |
1 | 11357 | 1 | - | 0.9017 | 0.9017 | - | 4e-5 | 3e-5 | - | 167.5 | 60.0 | |||
10 | 9267 | 30 | - | 0.9074 | 0.9078 | - | 7e-5 | 4e-6 | - | 694.4 | 271.5 | |||
100 | 13027 | 50 | - | 0.9445 | 0.9448 | - | 7e-5 | 5e-5 | - | 2346.5 | 67.0 | |||
100 | 0.1 | 1348 | 1 | - | 0.9017 | 0.9017 | - | 8e-5 | 4e-5 | - | 339.2 | 34.7 | ||
1 | 7545 | 1 | - | 0.9018 | 0.9018 | - | 8e-5 | 6e-5 | - | 370.5 | 49.4 | |||
10 | 8750 | 1 | - | 0.9018 | 0.9018 | - | 7e-5 | 7e-6 | - | 327.2 | 154.8 | |||
100 | 12252 | 29 | - | 0.9408 | 0.9410 | - | 9e-5 | 3e-5 | - | 1702.6 | 1136.2 | |||
1e-6 | 10 | 0.1 | 862 | 1 | - | 0.9018 | 0.9017 | - | 7e-7 | 7e-7 | - | 372.7 | 42.7 | |
1 | 3072 | 2 | - | 0.9017 | 0.9017 | - | 1e-6 | 9e-7 | - | 1222.0 | 89.9 | |||
10 | 4294 | 30 | - | 0.9077 | 0.9077 | - | 1e-6 | 9e-7 | - | 2003.1 | 290.6 | |||
100 | 2409 | 50 | - | 0.9450 | 0.9450 | - | 7e-7 | 9e-7 | - | 5413.7 | 224.1 | |||
100 | 0.1 | 312 | 1 | - | 0.9017 | 0.9017 | - | 9e-7 | 8e-7 | - | 1703.4 | 72.9 | ||
1 | 835 | 1 | - | 0.9018 | 0.9018 | - | 8e-7 | 1e-6 | - | 4299.5 | 127.2 | |||
10 | 2319 | 1 | - | 0.9018 | 0.9018 | - | 6e-7 | 9e-7 | - | 2320.4 | 187.4 | |||
100 | 1833 | 29 | - | 0.9413 | 0.9413 | - | 8e-7 | 9e-7 | - | 5371.3 | 1289.7 |
6.2.2 Numerical results on real data
Table 5 lists the results for ALM-SNCG, F-ADMM, and isPADMM on the model (1) using four real datasets described in Table 2, with the parameter ranging from to . The results in the table show that both ALM-SNCG and isPADMM achieved high accuracy () in all instances, while F-ADMM only stopped successfully for problems with accuracy , and for problems with accuracy . Table 5 also reveals that ALM-SNCG performed on average 2.6 times (5.9 times) faster than isPADMM and 422.7 times (477.2 times) faster than F-ADMM when the stop criteria was set to ().
Prob | Relobj | Time (seconds) | ||||||||||||
F | isP | A | F | isP | A | F | isP | A | ||||||
1e-4 | 1 | 1e-4 | 6 | 2 | 0.7424 | 0.7424 | 0.7424 | 6e-5 | 9e-5 | 6e-5 | 5.0 | 2.0 | 2.1 | |
1e-3 | 60 | 5 | 0.8636 | 0.8636 | 0.8636 | 1e-4 | 5e-5 | 6e-5 | 4.4 | 3.8 | 4.5 | |||
1e-2 | 123 | 5 | 0.9394 | 0.9545 | 0.9394 | 5e-4 | 1e-4 | 7e-5 | 616.3 | 11.3 | 6.9 | |||
1e-1 | 123 | 5 | 0.9545 | 0.9545 | 0.9394 | 5e-3 | 8e-5 | 9e-5 | 616.2 | 16.1 | 12.1 | |||
10 | 1e-4 | 95 | 0 | 0.6667 | 0.6667 | 0.6667 | 6e-5 | 6e-5 | 4e-6 | 1.2 | 6.0 | 2.7 | ||
1e-3 | 6 | 2 | 0.7424 | 0.7424 | 0.7424 | 1e-4 | 1e-4 | 9e-5 | 100.2 | 9.7 | 5.7 | |||
1e-2 | 60 | 5 | 0.8636 | 0.8636 | 0.8636 | 1e-4 | 9e-5 | 6e-5 | 58.5 | 22.1 | 9.2 | |||
1e-1 | 124 | 5 | 0.9394 | 0.9394 | 0.9394 | 1e-1 | 1e-4 | 6e-5 | 621.1 | 31.1 | 10.8 | |||
1e-6 | 1 | 1e-4 | 6 | 2 | 0.7424 | 0.7424 | 0.7424 | 1e-6 | 9e-7 | 7e-7 | 279.1 | 3.3 | 2.2 | |
1e-3 | 60 | 5 | 0.8636 | 0.8636 | 0.8636 | 1e-6 | 8e-7 | 9e-7 | 9.0 | 5.5 | 4.7 | |||
1e-2 | 123 | 5 | 0.9394 | 0.9394 | 0.9394 | 5e-4 | 7e-7 | 6e-7 | 621.7 | 21.7 | 8.2 | |||
1e-1 | 123 | 5 | 0.9394 | 0.9394 | 0.9394 | 5e-3 | 9e-7 | 5e-7 | 623.7 | 29.4 | 12.2 | |||
10 | 1e-4 | 91 | 0 | 0.6667 | 0.6667 | 0.6667 | 7e-7 | 6e-7 | 5e-8 | 1.3 | 11.6 | 2.6 | ||
1e-3 | 6 | 2 | 0.7424 | 0.7424 | 0.7424 | 1e-6 | 1e-6 | 7e-7 | 259.1 | 12.5 | 6.1 | |||
1e-2 | 60 | 5 | 0.8636 | 0.8636 | 0.8636 | 1e-6 | 7e-7 | 5e-7 | 109.6 | 25.9 | 10.2 | |||
1e-1 | 123 | 5 | 0.9394 | 0.9394 | 0.9394 | 6e-2 | 4e-7 | 5e-7 | 617.2 | 41.1 | 15.8 | |||
1e-4 | 1 | 1e-3 | 12 | 2 | 0.8892 | 0.8892 | 0.8892 | 1e-4 | 6e-5 | 1e-4 | 5.4 | 6.7 | 5.6 | |
1e-2 | 100 | 6 | 0.8898 | 0.8898 | 0.8898 | 8e-5 | 1e-4 | 1e-4 | 6.0 | 10.6 | 6.5 | |||
1e-1 | 807 | 35 | 0.8638 | 0.8645 | 0.8645 | 5e-3 | 1e-4 | 7e-5 | 7200.1 | 103.5 | 37.3 | |||
1 | 1080 | 44 | 0.8385 | 0.8379 | 0.8379 | 8e-1 | 8e-5 | 7e-5 | 7200.1 | 173.8 | 44.2 | |||
10 | 1e-3 | 1716 | 0 | 0.7131 | 0.7131 | 0.7131 | 9e-5 | 8e-5 | 7e-5 | 392.7 | 328.7 | 23.9 | ||
1e-2 | 17 | 2 | 0.8904 | 0.8904 | 0.8904 | 1e-4 | 9e-5 | 9e-5 | 21.8 | 13.5 | 8.2 | |||
1e-1 | 123 | 6 | 0.8879 | 0.8879 | 0.8879 | 1e-4 | 9e-5 | 9e-5 | 43.8 | 30.0 | 17.1 | |||
1 | 938 | 27 | 0.8493 | 0.8436 | 0.8461 | 3e-1 | 1e-4 | 7e-5 | 7200.2 | 148.5 | 123.4 | |||
1e-6 | 1 | 1e-3 | 11 | 2 | 0.8892 | 0.8892 | 0.8892 | 9e-7 | 8e-7 | 8e-7 | 6.7 | 16.3 | 6.2 | |
1e-2 | 98 | 6 | 0.8898 | 0.8898 | 0.8898 | 9e-7 | 9e-7 | 7e-7 | 12.6 | 26.7 | 8.2 | |||
1e-1 | 807 | 35 | 0.8645 | 0.8645 | 0.8645 | 5e-3 | 1e-6 | 9e-7 | 7200.0 | 416.1 | 88.2 | |||
1 | 1080 | 44 | 0.8391 | 0.8379 | 0.8379 | 8e-1 | 1e-6 | 1e-6 | 7200.2 | 502.1 | 59.1 | |||
10 | 1e-3 | 2028 | 0 | 0.7131 | 0.7131 | 0.7131 | 3e-5 | 7e-7 | 8e-7 | 7200.1 | 1508.3 | 26.0 | ||
1e-2 | 17 | 2 | 0.8911 | 0.8911 | 0.8911 | 1e-6 | 8e-7 | 9e-7 | 49.1 | 19.8 | 9.1 | |||
1e-1 | 123 | 6 | 0.8879 | 0.8879 | 0.8879 | 1e-6 | 9e-7 | 9e-7 | 142.8 | 48.6 | 19.9 | |||
1 | 937 | 27 | 0.8436 | 0.8461 | 0.8461 | 4e-1 | 1e-6 | 8e-7 | 7200.0 | 362.8 | 174.4 | |||
1e-4 | 1 | 1e-3 | 15 | 2 | 0.8015 | 0.8015 | 0.8020 | 9e-5 | 9e-5 | 1e-4 | 14.2 | 2.0 | 1.3 | |
1e-2 | 62 | 5 | 0.8245 | 0.8245 | 0.8245 | 9e-5 | 8e-5 | 1e-4 | 8.3 | 1.7 | 1.1 | |||
1e-1 | 319 | 19 | 0.8205 | 0.8210 | 0.8190 | 1e-4 | 9e-5 | 7e-5 | 26.5 | 4.6 | 1.5 | |||
1 | 751 | 30 | 0.7790 | 0.7990 | 0.8020 | 4e-1 | 9e-5 | 7e-5 | 7200.6 | 12.5 | 2.1 | |||
10 | 1e-3 | 5 | 1 | 0.7645 | 0.7645 | 0.7645 | 1e-4 | 6e-5 | 8e-5 | 25.9 | 2.2 | 1.5 | ||
1e-2 | 15 | 2 | 0.8050 | 0.8050 | 0.8050 | 1e-4 | 7e-5 | 1e-4 | 26.2 | 2.1 | 1.6 | |||
1e-1 | 70 | 5 | 0.8235 | 0.8235 | 0.8235 | 1e-4 | 9e-5 | 8e-5 | 43.9 | 2.4 | 1.9 | |||
1 | 457 | 20 | 0.7805 | 0.8170 | 0.8180 | 2e-1 | 8e-5 | 9e-5 | 7200.3 | 4.2 | 1.9 | |||
1e-6 | 1 | 1e-3 | 11 | 2 | 0.8015 | 0.8020 | 0.8015 | 9e-7 | 6e-7 | 7e-7 | 19.4 | 6.4 | 1.5 | |
1e-2 | 55 | 5 | 0.8245 | 0.8245 | 0.8245 | 9e-7 | 5e-7 | 7e-7 | 19.7 | 6.8 | 1.6 | |||
1e-1 | 304 | 19 | 0.8200 | 0.8200 | 0.8195 | 4e-5 | 9e-7 | 8e-7 | 7200.3 | 17.6 | 1.6 | |||
1 | 723 | 30 | 0.7940 | 0.8015 | 0.8015 | 2e-1 | 9e-7 | 9e-7 | 7200.4 | 42.0 | 3.6 | |||
10 | 1e-3 | 6 | 1 | 0.7645 | 0.7645 | 0.7645 | 1e-6 | 7e-7 | 1e-6 | 79.3 | 4.3 | 2.0 | ||
1e-2 | 12 | 2 | 0.8050 | 0.8050 | 0.8050 | 1e-6 | 8e-7 | 7e-7 | 92.2 | 7.5 | 2.1 | |||
1e-1 | 62 | 5 | 0.8235 | 0.8235 | 0.8235 | 1e-6 | 1e-6 | 8e-7 | 146.9 | 7.3 | 2.2 | |||
1 | 438 | 20 | 0.8020 | 0.8175 | 0.8170 | 3e-1 | 9e-7 | 8e-7 | 7200.2 | 13.5 | 2.9 | |||
1e-4 | 1 | 1e-1 | 212 | 14 | 0.9928 | 0.9928 | 0.9927 | 9e-5 | 9e-5 | 9e-5 | 225.0 | 12.8 | 4.0 | |
1 | 422 | 22 | 0.9930 | 0.9930 | 0.9927 | 2e-3 | 9e-5 | 1e-4 | 7201.4 | 16.9 | 5.7 | |||
1e+1 | 498 | 25 | 0.9764 | 0.9922 | 0.9922 | 6e+0 | 8e-5 | 6e-5 | 7201.0 | 36.4 | 9.6 | |||
1e+2 | 549 | 26 | 0.9540 | 0.9917 | 0.9915 | 2e+1 | 9e-5 | 9e-5 | 7201.6 | 42.2 | 16.7 | |||
10 | 1e-1 | 91 | 5 | 0.9924 | 0.9924 | 0.9924 | 9e-5 | 7e-5 | 9e-5 | 178.2 | 7.7 | 4.4 | ||
1 | 274 | 14 | 0.9926 | 0.9926 | 0.9926 | 1e-3 | 9e-5 | 1e-4 | 7200.3 | 12.4 | 4.1 | |||
1e+1 | 473 | 21 | 0.9769 | 0.9926 | 0.9926 | 5e+0 | 9e-5 | 6e-5 | 7201.0 | 17.7 | 8.3 | |||
1e+2 | 541 | 25 | 0.9710 | 0.9916 | 0.9916 | 1e+1 | 1e-4 | 8e-5 | 7202.4 | 34.5 | 17.1 | |||
1e-6 | 1 | 1e-1 | 213 | 14 | 0.9928 | 0.9928 | 0.9928 | 4e-6 | 9e-7 | 1e-6 | 7200.7 | 42.0 | 4.3 | |
1 | 414 | 22 | 0.9929 | 0.9929 | 0.9928 | 2e-3 | 1e-6 | 1e-6 | 7202.9 | 50.3 | 7.7 | |||
1e+1 | 496 | 25 | 0.9764 | 0.9923 | 0.9923 | 6e+0 | 9e-7 | 7e-7 | 7201.0 | 94.7 | 16.8 | |||
1e+2 | 542 | 26 | 0.9540 | 0.9915 | 0.9915 | 2e+1 | 9e-7 | 8e-7 | 7202.2 | 101.8 | 48.8 | |||
10 | 1e-1 | 91 | 5 | 0.9924 | 0.9924 | 0.9924 | 9e-7 | 9e-7 | 8e-7 | 483.5 | 17.1 | 5.2 | ||
1 | 269 | 14 | 0.9926 | 0.9926 | 0.9926 | 2e-3 | 8e-7 | 8e-7 | 7202.8 | 28.3 | 5.1 | |||
1e+1 | 470 | 21 | 0.9855 | 0.9926 | 0.9926 | 4e+0 | 1e-6 | 9e-7 | 7200.2 | 64.5 | 13.3 | |||
1e+2 | 532 | 25 | 0.9657 | 0.9916 | 0.9916 | 1e+1 | 8e-7 | 9e-7 | 7201.1 | 129.6 | 51.7 |
6.3 Computing a solution path of the SMM model for
In this subsection, we assess the numerical performance of the AS strategy on both synthetic and real datasets. While do numerical experiments, we used the ALM-SNCG as inner solver. It means that for given parameters and , we solved problem (1) by ALM-SNCG. The ALM-SNCG with AS strategy is abbreviated as AS+ALM. We compared the performance of AS+ALM with that of the warm-started ALM-SNCG (abbreviated as Warm+ALM), evaluated at and , respectively. In all experiments, we set in Algorithm 3. Additionally, is set to 0.05 for and 0.1 otherwise for random data, and to 0.4 for real data.
Tables 6 and 7 show the performance of both methods AS+ALM and Warm+ALM. Each column in the tables takes the following meaning:
-
•
: the average number of the support matrices for the reduced subproblems (42);
-
•
/: the average/maximum sample size of (42);
-
•
Worst relkkt: the maximum relative KKT residuals as in (46);
-
•
/: the average values of /computation times;
-
•
Iteration numbers: the average number of iterations. For AS+ALM, the column shows the average number of AS rounds, followed by the average number of outer augmented Lagrangian iterations, with the average number of inner semismooth Newton-CG iterations in parentheses.
6.3.1 Numerical results on synthetic data
Table 6 displays the numerical results from AS+ALM and Warm+ALM using a given sequence on random data. The parameter ranges from 0.1 to 100, divided into 50 equally spaced grid points on the scale. A notable observation from Table 6 is that, on average, the AS strategy requires only one iteration per grid point to achieve the desired solution. It suggests that the initial index sets almost fully cover indices of the support matrices at the solutions for each .
Table 6 also reveals that despite the average number of inner ALM-SNCG iterations in AS+ALM surpasses that of Warm+ALM across all instances, the AS strategy can still accelerate the Warm+ALM by an average factor of 2.69 (under ) and 3.07 (under ) in terms of time consumption. This is due to the smaller average sample size of AS’s reduced subproblems compared to that of the original SMM models. Furthermore, Figure 3 depicts that the percentages of support matrices in the reduced subproblems () consistently exceed those in the original problems (). Here, and represent the number of support matrices and samples in the reduced subproblem (42) during each AS iteration, respectively.
Data | Information of the AS | Avg | Worst relkkt | Iteration numbers | Avgtime (seconds) | ||||||||
AvgnSM | Avgsam | Maxsam | Warm | AS | Warm | AS | Warm | AS | Warm | AS | |||
(1e4, 1e2, 1e2) | 1e-4 | 10 | 1374 | 1532 | 4152 | 28 | 25 | 1e-4 | 5e-5 | (2, 20) | 1(3, 25) | 1.50 | 0.49 |
100 | 2542 | 2825 | 8739 | 27 | 25 | 1e-4 | 1e-4 | (3, 20) | 1(4, 24) | 1.50 | 0.65 | ||
1e-6 | 10 | 1375 | 1532 | 4153 | 21 | 21 | 1e-6 | 6e-7 | (11, 46) | 1(13, 50) | 3.93 | 1.20 | |
100 | 2542 | 2825 | 8736 | 20 | 19 | 1e-6 | 1e-6 | (12, 47) | 1(14, 50) | 3.80 | 1.55 | ||
(1e4, 1e3, 5e2) | 1e-4 | 10 | 449 | 507 | 1360 | 24 | 22 | 1e-4 | 3e-5 | (5, 27) | 1(10, 39) | 98.37 | 40.39 |
100 | 962 | 1086 | 3548 | 25 | 23 | 1e-4 | 1e-4 | (7, 31) | 1(11, 40) | 107.76 | 37.84 | ||
1e-6 | 10 | 449 | 507 | 1360 | 21 | 21 | 1e-6 | 4e-7 | (15, 57) | 1(21, 66) | 211.74 | 69.51 | |
100 | 962 | 1086 | 3550 | 21 | 21 | 1e-6 | 1e-6 | (20, 71) | 1(24, 73) | 244.64 | 64.56 | ||
(1e5, 5e1, 1e2) | 1e-4 | 10 | 17750 | 19285 | 44278 | 105 | 76 | 1e-4 | 4e-5 | (2, 38) | 1(3, 39) | 12.50 | 2.71 |
100 | 22697 | 24878 | 67188 | 120 | 84 | 1e-4 | 4e-5 | (2, 10) | 1(3, 14) | 3.47 | 1.48 | ||
1e-6 | 10 | 17753 | 19287 | 44268 | 32 | 30 | 1e-6 | 6e-7 | (14, 70) | 1(17, 85) | 23.29 | 6.35 | |
100 | 22700 | 24879 | 67180 | 27 | 24 | 1e-6 | 8e-7 | (15, 51) | 1(19, 60) | 16.93 | 5.21 | ||
(1e6, 5e1, 1e2) | 1e-4 | 10 | 276208 | 303668 | 485124 | 1363 | 984 | 1e-4 | 5e-5 | (2, 24) | 1(3, 28) | 76.63 | 34.20 |
100 | 291663 | 321628 | 565727 | 1254 | 860 | 1e-4 | 5e-5 | (2, 15) | 1(3, 18) | 57.21 | 34.15 | ||
1e-6 | 10 | 276231 | 303681 | 485006 | 413 | 395 | 1e-6 | 6e-7 | (14, 67) | 1(17, 76) | 209.41 | 78.54 | |
100 | 291677 | 321643 | 565653 | 234 | 214 | 1e-6 | 7e-7 | (15, 60) | 1(18, 67) | 197.75 | 82.87 |

6.3.2 Numerical results on real data
Lastly, we compare AS+ALM and Warm+ALM in generating the entire solution path for SMM models (P) on large-scale CIFAR-10 and MNIST training datasets. We vary from to 1 for the CIFAR-10 and from to for the MNIST, using 50 equally spaced grid points on a scale. Similarly to the previous subsection, Table 7 and Figure 4 demonstrate that the AS strategy reduces the overall sample size of SMM models, thereby enhancing computational efficiency compared to the warm-starting strategy. Particularly, on the MNIST dataset, AS+ALM achieves average speedup factors of 1.50 and 1.92 over Warm+ALM for and , respectively.
Prob | information of the AS | Avg | Worst relkkt | Iteration numbers | Avgtime (seconds) | ||||||||
AvgnSM | Avgsam | Maxsam | Warm | AS | Warm | AS | Warm | AS | Warm | AS | |||
1e-4 | 1 | 4302 | 5709 | 7045 | 242 | 242 | 1e-4 | 1e-4 | (11, 52) | 1(12, 52) | 0.78 | 0.59 | |
10 | 5073 | 6580 | 9661 | 86 | 86 | 1e-4 | 1e-4 | (15, 110) | 1(15, 109) | 1.41 | 1.05 | ||
1e-6 | 1 | 4300 | 5709 | 7045 | 240 | 240 | 1e-6 | 1e-6 | (22, 72) | 1(22, 71) | 1.20 | 0.87 | |
10 | 5073 | 6580 | 9661 | 85 | 85 | 1e-6 | 1e-6 | (26, 138) | 1(26, 138) | 1.73 | 1.30 | ||
1e-4 | 1 | 1063 | 1924 | 2403 | 443 | 442 | 1e-4 | 1e-4 | (5, 73) | 1(5, 75) | 4.16 | 2.77 | |
10 | 1188 | 2130 | 3316 | 352 | 352 | 1e-4 | 1e-4 | (9, 96) | 1(9, 98) | 5.14 | 3.41 | ||
1e-6 | 1 | 1064 | 1924 | 2404 | 441 | 440 | 1e-6 | 1e-6 | (12, 221) | 1(13, 181) | 10.37 | 5.10 | |
10 | 1188 | 2130 | 3318 | 350 | 350 | 1e-6 | 1e-6 | (16, 157) | 1(17, 155) | 7.36 | 4.07 |

7 Conclusion
In this paper, we have proposed a semismooth Newton-CG based augmented Lagrangian method for solving the large-scale support matrix machine (SMM) model. Our algorithm effectively leverages the sparse and low-rank structures inherent in the second-order information associated with the SMM model. Furthermore, we have developed an adaptive sieving (AS) strategy aimed at iteratively guessing and adjusting active samples, facilitating the rapid generation of solution paths across a fine grid of parameters for the SMM models. Numerical results have convincingly demonstrated the superior efficiency and robustness of our algorithm and strategy in solving large-scale SMM models. Nevertheless, the current AS strategy focuses on reducing the sample size of subproblems. The further research direction is to explore methods for simultaneously reducing both sample size and the dimensionality of feature matrices. This may be achieved through the development of effective combinations of the active subspace selection hsieh2014nuclear ; feng2022subspace and the AS strategy.
Appendices
Appendix A Proof of Proposition 1
Proof
To proceed, we characterize the set . It is not difficult to show that the KKT system (3) is equivalent to the following system:
(A.1) |
Indeed, if satisfies (3), then solves (A.1). Conversely, if solves (A.1), then with , , and satisfies the KKT system (3). For any pair , is an invariant (see, e.g., mangasarian1988simple ). We then define . Based on the arguments preceding Subsection 2.1 and the equivalence between (3) and (A.1), we claim that for any , the set can be expressed as
(A.2) |
where
Building on the above preparation, we proceed to establish the conclusion of Proposition 1. We first show that for any , is metrically subregular at for , i.e., there exist a constant and a neighborhood of such that
Here, denotes the distance from to a closed convex set in a finite dimensional real Euclidean space , and represents the graph of a multi-valued mapping from to . In fact, it follows from the piecewise linearity of on and (rockafellar2009variational, , Proposition 12.30) that is piecewise polyhedral. Then, for any , we obtain from (robinson1981some, , Proposition 1) that is locally upper Lipschitz continuous at , which further implies the metric subregularity of at for . Let be the -norm ball centered at 0 with radius . Similarly, for any , is metrically subregular at for due to the piecewise linearity of on . For any given , it follows from (cui2017quadratic, , Proposition 3.8) that is also metrically subregular at for . Thus, there exist positive constants and and a neighborhood of such that for any ,
where the second inequality follows from the metric subregularity of at for and the metric subregularity of at for .
Next, for any , we prove that the collection of sets is boundedly linearly regular, i.e., for every bounded set , there exists a constant such that for any ,
In fact, based on the arguments preceding Subsection 2.1 and (A.2), the intersection is nonempty. Define and . It follows from the equality (A.2) that
(A.3) |
It can be checked that is boundedly linearly regular if and only if , is boundedly linearly regular. Furthermore, from (rockafellar1970convex, , Corollary 19.2.1 and Theorem 23.10), is a polyhedral convex set, implying that is also polyhedral convex. Since and are polyhedral convex, based on (bauschke1999strong, , Corollary 3) and (A.3), we only need to show that there exists such that , i.e., . Let . It follows from (watson1992characterization, , Example 2) that . Suppose the SVD of in (15) has the following form:
(A.4) |
where and , with singular values , and and whose columns form a compatible set of orthonormal left and right singular vectors of with , , , , , and . Based on (watson1992characterization, , Example 2) and (A.4), one has that if and only if . One the other hand, from the definition of the matrix in (14), we obtain that
(A.5) |
which implies that . It means that if and only if . Therefore, if there exists such that , then is boundedly linearly regular.
Based on all the above analysis, the desired conclusion can be established utilizing (cui2016asymptotic, , Theorem 3.1) and (artacho2008characterization, , Theorem 3.3). ∎
Appendix B Proof of Theorem 2
Proof
Building upon a technique similar to (li2018qsdpnal, , Proposition 3.1), we obtain the following equivalent relation:
(B.1) |
where is any element in defined in (18) with . Here, for any , the Clarke generalized Jacobian can be expressed as
(B.2) |
We consider the explicit expression of . Denote the following two index sets
Then we can deduce from (rockafellar2009variational, , Theorem 6.9) that
It follows that
“”. The equality (20) at holds if and only if i.e., .
Appendix C Proof of Theorem 5.1
Proof
For each given , the cardinality of will decrease as that of increases. According to the finiteness of samples size , the index set as a subset of will be empty after a finite number of iterations, ensuring finite termination.
Next, we show that the output solution of Algorithm 3 is a KKT tuple of the problem (45). Indeed, by the finite termination of Algorithm 3, we know that there exists an integer such that , is a KKT solution of the reduced subproblem (42) with the error , satisfying the condition (43) and . And and are obtained by expanding and to the -dimensional vectors with the rest entries being and for any , respectively.
For the sake of convenience in further analysis, we will exclude from the above KKT solutions and index sets. Additionally, we will eliminate from , , , and . By the KKT conditions of the reduced subproblem (42) and the condition (43), one has that
(C.1) |
where and are defined in (38). By extending and to the -dimensional error vectors and with the rest elements being , we only need to show that
Indeed, due to the fact that , we obtain
which follows from (C.1) that and . From expanding manners of and , we know that , which further implies from (C.1) that . At last, from , one has for all . It follows that . Combing with (C.1), we obtain that . ∎
Appendix D An isPADMM for solving SMM model
In this part, we outline the framework of the inexact semi-proximal ADMM (isPADMM) to solve the SMM model (1). Specifically, according to the definition of the linear operator in (2), the SMM model (1) can be equivalently reformulated as follows:
(D.1) |
Its augmented Lagrangian function is defined for any as follows:
where is a positive penalty parameter. The framework of isPADMM for solving (D.1) is outlined in Algorithm 5.
Initialization: Let and be given parameters, be a nonnegative summable sequence. Choose and an initial point . Set and perform the following steps in each iteration:
(D.2) |
Notably, subproblem (1) closely resembles the original SMM model (1), except is set to zero and an additional proximal term involving is included. This similarity allows the direct application of the semismooth Newton-based augmented Lagrangian method to these subproblems, as discussed in Sections 3 and 4. The derivations, not detailed here, follow a rationale similar to that of the aforementioned algorithms. Furthermore, by leveraging the results of (chen2017efficient, , Theorem 5.1), we can deduce the global convergence of Algorithm 5 in a direct manner. To ensure computational feasibility, the isPADMM iterations are capped at 30,000, with set to .
Appendix E A sGS-isPADMM for solving SMM model
In this part, we present the framework of the symmetric Gauss-Seidel based inexact semi-proximal ADMM (sGS-isPADMM) to effectively solve the SMM model (P). Based on the definition of the augmented Lagrangian function in (6), the steps of the sGS-isPADMM for solving (P) are listed as follows.
Initialization: Let and be given parameters, be a nonnegative summable sequence. Select a self-adjoint positive semidefinite linear operator . Choose an initial point . Set and perform the following steps in each iteration:
Building upon the results in (chen2017efficient, , Proposition 4.2 and Theorem 5.1), we can directly obtain the global convergence results for Algorithm 5. Lastly, the maximum number of iterations for the sGS-isPADMM is set to 30,000.
Statements and Declarations
Funding The work of Defeng Sun was supported in part by RGC Senior Research Fellow Scheme No. SRFS2223-5S02 and GRF Project No. 15307822.
Availability of data and materials The references of all datasets are provided in this published article.
Conflict of interest The authors declare that they have no conflict of interest.
References
- (1) Artacho, F.A., Geoffroy, M.H.: Characterization of metric regularity of subdifferentials. J. Convex Anal. 15(2), 365–380 (2008)
- (2) Bauschke, H.H., Borwein, J.M., Li, W.: Strong conical hull intersection property, bounded linear regularity, Jameson’s property (G), and error bounds in convex optimization. Math. Program. 86(1), 135–160
- (3) Chen, L., Sun, D.F., Toh, K.C.: An efficient inexact symmetric Gauss-Seidel based majorized ADMM for high-dimensional convex composite conic programming. Math. Program. 161(1), 237–270 (2017)
- (4) Chen, Y., Hang, W., Liang, S., Liu, X., Li, G., Wang, Q., Qin, J., Choi, K.S.: A novel transfer support matrix machine for motor imagery-based brain computer interface. Front. Neurosci. 14, 606949 (2020)
- (5) Clarke, F.H.: Optimization and Nonsmooth Analysis. John Wiley and Sons, New York (1983)
- (6) Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
- (7) Cui, Y., Ding, C., Zhao, X.: Quadratic growth conditions for convex matrix optimization problems associated with spectral functions. SIAM J. Optim. 27(4), 2332–2355 (2017)
- (8) Cui, Y., Sun, D.F., Toh, K.C.: On the asymptotic superlinear convergence of the augmented Lagrangian method for semidefinite programming with multiple solutions. arXiv preprint arXiv:1610.00875 (2016)
- (9) Cui, Y., Sun, D.F., Toh, K.C.: On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming. Math. Program. 178(1), 381–415 (2019)
- (10) Duan, B., Yuan, J., Liu, Y., Li, D.: Quantum algorithm for support matrix machines. Phys. Rev. A 96(3), 032301 (2017)
- (11) Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media, New York (2007)
- (12) Feng, R., Xu, Y.: Support matrix machine with pinball loss for classification. Neural Comput. Appl. 34, 18643–18661 (2022)
- (13) Feng, R., Zhong, P., Xu, Y.: A subspace elimination strategy for accelerating support matrix machine. Pac. J. Optim. 18(1), 155–176 (2022)
- (14) Geng, M., Xu, Z., Mei, M.: Fault diagnosis method for railway turnout with pinball loss-based multiclass support matrix machine. Appl. Sci. 13(22), 12375 (2023)
- (15) Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
- (16) Hang, W., Feng, W., Liang, S., Wang, Q., Liu, X., Choi, K.S.: Deep stacked support matrix machine based representation learning for motor imagery eeg classification. Comput. Meth. Programs Biomed. 193, 105466 (2020)
- (17) Hang, W., Li, Z., Yin, M., Liang, S., Shen, H., Wang, Q., Qin, J., Choi, K.S.: Deep stacked least square support matrix machine with adaptive multi-layer transfer for EEG classification. Biomed. Signal Process. Control 82, 104579 (2023)
- (18) Harrow, A.W., Hassidim, A., Lloyd, S.: Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103(15), 150502 (2009)
- (19) Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. Springer (2009)
- (20) Hsieh, C.J., Olsen, P.: Nuclear norm minimization via active subspace selection. In: International Conference on Machine Learning, pp. 575–583. PMLR (2014)
- (21) Jiang, K.: Algorithms for Large Scale Nuclear Norm Minimization and Convex Quadratic Semidefinite Programming Problems. Ph.D. thesis (2011)
- (22) Jiang, K., Sun, D.F., Toh, K.C.: Solving nuclear norm regularized and semidefinite matrix least squares problems with linear equality constraints. In: Discrete Geometry and Optimization, pp. 133–162. Springer (2013)
- (23) Kumari, A., Akhtar, M., Shah, R., Tanveer, M.: Support matrix machine: A review. Neural Networks (2024). https://doi.org/10.1016/j.neunet.2024.106767
- (24) Laurent, E.G., Vivian, V., Tarek, R.: Safe feature elimination in sparse supervised learning. Pac. J. Optim. 8(4), 667–698 (2012)
- (25) Li, H., Xu, Y.: Support matrix machine with truncated pinball loss for classification. Appl. Soft. Comput. 154, 111311 (2024)
- (26) Li, Q., Jiang, B., Sun, D.F.: MARS: A second-order reduction algorithm for high-dimensional sparse precision matrices estimation. J. Mach. Learn. Res. 24, 1–44 (2023)
- (27) Li, X., Cheng, J., Shao, H., Liu, K., Cai, B.: A fusion cwsmm-based framework for rotating machinery fault diagnosis under strong interference and imbalanced case. IEEE Trans. Industr. Inform. 18(8), 5180–5189 (2022)
- (28) Li, X., Cheng, J., Shao, H., Liu, K., Cai, B.: A fusion CWSMM-based framework for rotating machinery fault diagnosis under strong interference and imbalanced case. IEEE Trans. Ind. Inform. 18(8), 5180–5189 (2022)
- (29) Li, X., Li, S., Wei, D., Si, L., Yu, K., Yan, K.: Dynamics simulation-driven fault diagnosis of rolling bearings using security transfer support matrix machine. Reliab. Eng. Syst. Saf. 243, 109882 (2024)
- (30) Li, X., Li, Y., Yan, K., Shao, H., Lin, J.J.: Intelligent fault diagnosis of bevel gearboxes using semi-supervised probability support matrix machine and infrared imaging. Reliab. Eng. Syst. Saf. 230, 108921 (2023)
- (31) Li, X., Li, Y., Yan, K., Si, L., Shao, H.: An intelligent fault detection method of industrial gearboxes with robustness one-class support matrix machine toward multisource nonideal data. IEEE ASME Trans. Mechatron. 29(1), 388–399 (2024)
- (32) Li, X., Shao, H., Lu, S., Xiang, J., Cai, B.: Highly efficient fault diagnosis of rotating machinery under time-varying speeds using LSISMM and small infrared thermal images. IEEE Trans. Syst. Man Cybern. Syst. 52(12), 7328–7340 (2022)
- (33) Li, X., Sun, D.F., Toh, K.C.: QSDPNAL: A two-phase augmented Lagrangian method for convex quadratic semidefinite programming. Math. Program. Comput. 10, 703–743 (2018)
- (34) Li, X., Yang, Y., Pan, H., Cheng, J., Cheng, J.: Non-parallel least squares support matrix machine for rolling bearing fault diagnosis. Mech. Mach. Theory 145, 103676 (2020)
- (35) Li, Y., Wang, D., Liu, F.: The auto-correlation function aided sparse support matrix machine for EEG-based fatigue detection. IEEE Trans. Circuits Syst. II-Express Briefs 70(2), 836–840 (2023)
- (36) Liang, S., Hang, W., Lei, B., Wang, J., Qin, J., Choi, K.S., Zhang, Y.: Adaptive multimodel knowledge transfer matrix machine for EEG classification. IEEE Trans. Neural Netw. Learn. Syst. 35(6), 7726–7739 (2024)
- (37) Liang, S., Hang, W., Yin, M., Shen, H., Wang, Q., Qin, J., Choi, K.S., Zhang, Y.: Deep EEG feature learning via stacking common spatial pattern and support matrix machine. Biomed. Signal Process. Control 74, 103531 (2022)
- (38) Lin, M., Yuan, Y., Sun, D.F., Toh, K.C.: Adaptive sieving with PPDNA: Generating solution paths of exclusive lasso models. arXiv preprint arXiv:2009.08719 (2020)
- (39) Liu, Y.J., Sun, D., Toh, K.C.: An implementable proximal point algorithmic framework for nuclear norm minimization. Math. Program. 133, 399–436 (2012)
- (40) Luo, L., Xie, Y., Zhang, Z., Li, W.J.: Support matrix machines. In: International Conference on Machine Learning, pp. 938–947. PMLR (2015)
- (41) Mangasarian, O.L.: A simple characterization of solution sets of convex programs. Oper. Res. Lett. 7, 21–26 (1988)
- (42) Meng, F., Sun, D.F., Zhao, G.: Semismoothness of solutions to generalized equations and the Moreau-Yosida regularization. Math. Program. 104, 561–581 (2005)
- (43) Ogawa, K., Suzuki, Y., Takeuchi, I.: Safe screening of non-support vectors in pathwise SVM computation. In: International Conference on Machine Learning, pp. 1382–1390. PMLR (2013)
- (44) Pan, H., Sheng, L., Xu, H., Tong, J., Zheng, J., Liu, Q.: Pinball transfer support matrix machine for roller bearing fault diagnosis under limited annotation data. Appl. Soft. Comput. 125, 109209 (2022)
- (45) Pan, H., Sheng, L., Xu, H., Zheng, J., Tong, J., Niu, L.: Deep stacked pinball transfer matrix machine with its application in roller bearing fault diagnosis. Eng. Appl. Artif. Intell. 121, 105991 (2023)
- (46) Pan, H., Xu, H., Zheng, J., Shao, H., Tong, J.: A semi-supervised matrixized graph embedding machine for roller bearing fault diagnosis under few-labeled samples. IEEE Trans. Industr. Inform. 20(1), 854–863 (2024)
- (47) Pan, H., Xu, H., Zheng, J., Su, J., Tong, J.: Multi-class fuzzy support matrix machine for classification in roller bearing fault diagnosis. Adv. Eng. Inform. 51, 101445 (2022)
- (48) Pan, H., Yang, Y., Zheng, J., Li, X., Cheng, J.: A fault diagnosis approach for roller bearing based on symplectic geometry matrix machine. Mech. Mach. Theory 140, 31–43 (2019)
- (49) Pan, X., Xu, Y.: A novel and safe two-stage screening method for support vector machine. IEEE Trans. Neural Netw. Learn. Syst. 30(8), 2263–2274 (2019)
- (50) Qi, H., Sun, D.F.: A quadratically convergent Newton method for computing the nearest correlation matrix. SIAM J. Matrix Anal. Appl. 28(2), 360–385 (2006)
- (51) Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
- (52) Qian, C., Tran-Dinh, Q., Fu, S., Zou, C., Liu, Y.: Robust multicategory support matrix machines. Math. Program. 176, 429–463 (2019)
- (53) Razzak, I., Blumenstein, M., Xu, G.: Multiclass support matrix machines by maximizing the inter-class margin for single trial EEG classification. IEEE Trans. Neural Syst. Rehabilitation Eng. 27(6), 1117–1127 (2019)
- (54) Robinson, S.M.: Some continuity properties of polyhedral multifunctions. in Mathematical Programming at Oberwolfach, Math. Program. Stud. pp. 206–214 (1981)
- (55) Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
- (56) Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976)
- (57) Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976)
- (58) Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer Science & Business Media (2009)
- (59) Sanei, S., Chambers, J.A.: EEG Signal Processing. John Wiley & Sons (2013)
- (60) Shapiro, A.: Sensitivity analysis of generalized equations. J. Math. Sci. 115(4), 2554–2565 (2003)
- (61) Wang, J., Wonka, P., Ye, J.: Scaling SVM and least absolute deviations via exact data reduction. In: International Conference on Machine Learning, pp. 523–531. PMLR (2014)
- (62) Watson, G.A.: Characterization of the subdifferential of some matrix norms. Linear Alg. Appl. 170, 33–45 (1992)
- (63) Wu, C., Cui, Y., Li, D., Sun, D.F.: Convex and nonconvex risk-based linear regression at scale. INFORMS J. Comput. 35(4), 797–816 (2023)
- (64) Xu, H., Pan, H., Zheng, J., Tong, J., Zhang, F., Chu, F.: Intelligent fault identification in sample imbalance scenarios using robust low-rank matrix classifier with fuzzy weighting factor. Appl. Soft. Comput. 152, 111229 (2024)
- (65) Yang, J., Zhang, D., Frangi, A.F., Yang, J.y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004)
- (66) Yuan, Y., Chang, T.H., Sun, D.F., Toh, K.C.: A dimension reduction technique for large-scale structured sparse optimization problems with application to convex clustering. SIAM J. Optim. 32(3), 2294–2318 (2022)
- (67) Yuan, Y., Lin, M., Sun, D.F., Toh, K.C.: Adaptive sieving: A dimension reduction technique for sparse optimization problems. arXiv preprint arXiv:2306.17369 (2023)
- (68) Zhang, W., Liu, Y.: Proximal support matrix machine. J. appl. math. phys 10(7), 2268–2291 (2022)
- (69) Zhao, X.Y., Sun, D.F., Toh, K.C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20(4), 1737–1765 (2010)
- (70) Zheng, Q., Zhu, F., Heng, P.A.: Robust support matrix machine for single trial EEG classification. IEEE Trans. Neural Syst. Rehabilitation Eng. 26(3), 551–562 (2018)
- (71) Zheng, Q., Zhu, F., Qin, J., Chen, B., Heng, P.A.: Sparse support matrix machine. Pattern Recognit. 76, 715–726 (2018)
- (72) Zheng, Q., Zhu, F., Qin, J., Heng, P.A.: Multiclass support matrix machine for single trial EEG classification. Neurocomputing 275, 869–880 (2018)
- (73) Zhou, Z., So, A.M.C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. 165, 689–728 (2017)
- (74) Zimmert, J., de Witt, C.S., Kerg, G., Kloft, M.: Safe screening for support vector machines. In: NIPS Workshop on Optimization in Machine Learning (OPT) (2015)