Solving Stackelberg Prediction Game with Least Squares Loss via Spherically Constrained Least Squares Reformulation
Abstract
The Stackelberg prediction game (SPG) is popular in characterizing strategic interactions between a learner and an attacker. As an important special case, the SPG with least squares loss (SPG-LS) has recently received much research attention. Although initially formulated as a difficult bi-level optimization problem, SPG-LS admits tractable reformulations which can be polynomially globally solved by semidefinite programming or second order cone programming. However, all the available approaches are not well-suited for handling large-scale datasets, especially those with huge numbers of features. In this paper, we explore an alternative reformulation of the SPG-LS. By a novel nonlinear change of variables, we rewrite the SPG-LS as a spherically constrained least squares (SCLS) problem. Theoretically, we show that an optimal solution to the SCLS (and the SPG-LS) can be achieved in floating-point operations, where is the number of nonzero entries in the data matrix. Practically, we apply two well-known methods for solving this new reformulation, i.e., the Krylov subspace method and the Riemannian trust region method. Both algorithms are factorization free so that they are suitable for solving large scale problems. Numerical results on both synthetic and real-world datasets indicate that the SPG-LS, equipped with the SCLS reformulation, can be solved orders of magnitude faster than the state of the art.
1 Introduction
The big data era has led to an explosion in the availability of data from which to make decisions. It is thus indispensable to use machine learning techniques to gain deep insights from massive data. In practice, many classic data analytic approaches start by splitting available data into the training and test sets. Then, learning algorithms are fed with the training set and are expected to produce results which generalize well to the test set. However, this paradigm only works under the key implicit assumption that the available data in both training and test sets are independently and identically distributed, which, unfortunately, is not always the truth in practice. For example, in the context of email spam filtering, an attacker often adversarially generates spam emails based on his knowledge of the spam filter implemented by the email service provider (Brückner & Scheffer, 2011; Zhou et al., 2019). In addition to malicious attacks, sometimes the data providers may manipulate data for their own interests. For instance, health insurance policy holders may decide to modify self-reported data to reduce their premiums. On the other hand, the insurers (the “defenders” in this scenario) aim to select a good price model for the true data despite only seeing the modified data.
In fact, these scenarios can be modeled by the Stackelberg prediction game (SPG) (Brückner & Scheffer, 2011; Shokri et al., 2012; Zhou & Kantarcioglu, 2016; Wahab et al., 2016; Zhou et al., 2019; Bishop et al., 2020) which characterizes the interactions between two players, a learner (or, a leader) and a data provider (or, a follower). In this setting, the learner makes the first move by selecting a learning model. Then the data provider, with full knowledge of the learner’s model, is allowed to modify its data. The learner’s goal is to minimize its own loss function under the assumption that the training data has been optimally modified from the data provider’s perspective. From the above description, we see that the SPG model concerns two levels of optimization problems: The follower optimally manipulates its data and the leader makes its optimal decision taking into account the data manipulation. Formally, it is often formulated as a hierarchical mathematical problem or a bi-level optimization problem, which is generally NP-hard even in the simplest case with linear constraints and objectives (Jeroslow, 1985).
To overcome this issue, Bishop et al. (2020) take the first step to focus on a subclass of SPGs that can be reformulated as fractional programs. Specifically, they assume that all the loss functions of the leader and the follower are least squares, and that a quadratic regularizer is added to the follower’s loss to penalise its manipulation of the data. This assumption eventually turns the bi-level optimization problem into a single-level fractional optimization task which is proven to be polynomially globally solvable. Since no other assumption is made about the learner and data provider, this subclass of SPG, termed as the SPG-LS, is general enough to be applied in wide fields. However, the bisection algorithm proposed in Bishop et al. (2020) involves solving several tens of semidefinite programs (SDPs) which are computationally prohibitive in practice. Later, Wang et al. (2021b) improves over Bishop et al. (2020) by showing that the SPG-LS can be globally solved via solving only a single SDP with almost the same size as the ones in Bishop et al. (2020). Furthermore, this single SDP can be reduced to a second order cone program (SOCP). It is shown in Wang et al. (2021b) that the SOCP approach for solving SPG-LS can be over 20,000+ times faster than the bisection method proposed in Bishop et al. (2020). Yet, the SOCP method is still not well-suited for solving large-scale SPG-LS. Indeed, the spectral decomposition in the SOCP reformulation process is time-consuming when the future dimension is high. This inevitably reduces the practical applicability of the SOCP approach for the SPG-LS.
In this paper, we present a novel reformulation to resolve the above mentioned issues of the SOCP method. Specifically, a nonlinear change of variables is proposed to reformulate the SPG-LS as a spherically constrained least squares (SCLS) problem. Then, we prove that an optimal solution to the SPG-LS can be recovered easily from any optimal solution to the SCLS under a mild assumption. The SCLS can be seen as an equality constrained version of the trust region subproblem (Conn et al., 2000), which admits a large amount of existing research on practical algorithms and theoretical complexity analysis. Based on this, we show that an optimal solution111We say is an optimal solution for an optimization problem , if and . of the SCLS and thus SPG-LS can be solved in flops, where denotes the number of nonzero entries of the data matrix and hides the logarithmic factors. This means there exits a linear time algorithm for finding an optimal solution of the SPG-LS. Moreover, we demonstrate the empirical efficiency of our SCLS reformulation when matrix factorization free methods like the Krylov subspace method (Gould et al., 1999; Zhang & Shen, 2018) and the Riemannian trust region Newton (RTRNewton) method (Absil et al., 2007) are used as solvers.
We summarise our contributions as follows:
-
•
We derive an SCLS reformulation for the SPG-LS that avoids spectral decomposition steps (which are expensive when the involved data matrices are large). Moreover, we show that an optimal solution to the SPG-LS can be recovered from any optimal solution to the SCLS reformulation under a mild condition.
-
•
Based on the reformulation, we show that an optimal solution for the SCLS can be found using matrix vector products. In other words, an solution can be obtained in running time , where is the number of nonzeros in the data matrix. Moreover, we show that an optimal solution of SCLS can be used to recover an optimal solution for the original SPG-LS.
-
•
Two practically efficient algorithms, which are factorization free, are adopted to solve the SCLS reformulation. We show that the SCLS approach significantly outperforms the SOCP approach with experiments on both real and synthetic data sets.
2 Preliminaries
In this section, we elaborate on the SPG-LS problem adopting the same terminology as in Bishop et al. (2020); Wang et al. (2021b). To have a better understanding of our reformulation, a brief review of methods in Wang et al. (2021b), which is the fastest existing method for solving the SPG-LS, will also be provided.
We assume that the learner has access to sample tuples , where is input data with features, and are the true output label of and the label that the data provider would like to achieve, respectively. These samples are assumed to follow some fixed but unknown distribution . The learner aims at training a linear predictor to best estimate the true output label given the fake data. Meanwhile, the data provider, with full knowledge of the learner’s predictive model , selects its own strategy (i.e., the modified data ) to make the corresponding prediction close to the desired label . Note that there is also a regularizer, , to control the deviation from the original data . This hyper-parameter adjusts the trade-off between data manipulation and closeness to the aimed target.
The problem can be modeled as a Stackelberg prediction game (Brückner & Scheffer, 2011; Bishop et al., 2020). On the one hand, each data provider aims to minimize its own loss function with a regularizer that penalizes the manipulation of the data by solving the following optimization problem:
where is the learner’s model parameter that is known to the data provider. On the other hand, the learner seeks to minimize the least squares loss with the modified data :
To find the Stackelberg equilibrium of the two players, we focus on the following bi-level optimization problem
(1) |
where the -th row of is the input sample and the -th entries of are labels and , respectively.
In the following section, we have a quick review of single SDP and SOCP methods in Wang et al. (2021b).
2.1 SDP Reformulation
2.2 SOCP Reformulation
Wang et al. (2021b) further constructed an invertible matrix such that , and are simultaneously congruent to arrow matrices via the change of variables associated to , i.e.,
where , and , and
Therefore the linear matrix inequality (LMI) constraint in (6) is equivalent to Using the generalized Schur complement, we further obtain an SOCP reformulation as follows.
Lemma 2.2 (Theorem 4.1 in Wang et al. (2021b)).
With the same notation in this section, problem (6) is equivalent to the following SOCP problem
(11) |
To obtain the SOCP reformulation, we need a spectral decomposition to a matrix of order (Wang et al., 2021b), which is expensive when the dimension is high and may lead to inaccurate solutions when the matrix is ill-conditioned. To amend this issue, we obtain in the next section a factorization free method based on a novel reformulation of problem (2).
3 Main Results
In this section, we show that using a nonlinear change of variables, we can rewrite (3) as a least squares problem over the unit sphere. This is the key observation of our paper.
Before presenting our main results, we first make a blanket assumption on the nonemptiness of the optimal solution set of (2).
Our main result is that under Assumption 3.1, the QFP (3) can be reformulated as a spherical constrained least squares (SCLS) problem
(12) |
A formal statement is deferred to Theorem 3.4.
Before presenting our main results, we first introduce two lemmas that show how, given a feasible solution in (3), we can construct a feasible solution with the same objective value in (12) and vice versa (up to a minor achievability issue).
Proof.
Proof.
We first note that are well-defined as . Next, we check feasibility of in (2):
Here, the second equality follows from the fact that . Finally, we check the objective value of : ∎
Let and be the optimal values of (3) and (12), respectively. Now we are ready to present our main results.
Theorem 3.4.
Proof.
Since the feasible region of (12) is compact and is continuous, it follows from the well-known Weierstrass theorem that there exists at least one optimal solution to (12).
Note that if with is a feasible solution in (12), we must have . Now we claim that cannot be the unique optimal solution to (12). Suppose on the contrary that is the unique optimal solution to (12). Let be any optimal solution to (3). Then, from Lemma 3.2, we have
(15) |
where and is feasible to (12). Since is the unique optimal solution to (12), it holds that
(16) |
On the other hand, for any and , the pair is clearly feasible to (3) with objective value
(17) |
Consequently, for sufficiently large , we must have from (15), (16) and (17) that
which contradicts the optimality of to (3).
We remark that the other direction of above theorem also holds. That is, under Assumption 3.1, there exists an optimal solution to (3), and, furthermore, , defined by (13), is optimal to (12). We also remark that using the relationship (14) and the equivalence of (2) and (3), an optimal solution of the SCLS can be used to recover an optimal solution of the SPG-LS.
In the following of this paper, we focus on solving (18). Problem (18) is closed related to the well-known trust region subproblem (TRS), where the sphere constraint is replaced by a unit ball constraint. There exist various methods for solving (18) from the literature on TRS (Moré & Sorensen, 1983; Gould et al., 1999; Conn et al., 2000; Hazan & Koren, 2016; Ho-Nguyen & Kilinc-Karzan, 2017; Zhang et al., 2017, 2018; Zhang & Shen, 2018; Carmon & Duchi, 2018), or the generalized trust region subproblem (GTRS) (Moré, 1993; Pong & Wolkowicz, 2014; Jiang & Li, 2019, 2020; Wang & Kılınç-Karzan, 2020; Wang et al., 2021a), which minimizes a (possible nonconvex) quadratic function over a (possible nonconvex) quadratic inequality or equality constraint. Note that the TRS differs from the SCLS in the constraint, and the GTRS contains the SCLS as a special case.
4 Complexity and Algorithms
In this section, we first show that in theory there exists a linear time algorithm to find an optimal solution for the SPG-LS. After that, we introduce two practically efficient algorithms to solve (18) (and thus recover a solution for the SPG-LS).
We point out that the linear time algorithms for the TRS (Hazan & Koren, 2016; Ho-Nguyen & Kilinc-Karzan, 2017) can be adapted to design a linear time algorithm with complexity for the SCLS to achieve an optimal solution, and the linear time algorithms for the GTRS (Jiang & Li, 2020; Wang & Kılınç-Karzan, 2020; Wang et al., 2021a) indicate that the SCLS, as a special case of the GTRS, can also be solved in linear time . Here denotes the number of nonzero entries in the data matrix, and the logarithm in the runtime comes from the probability of success in Lanczos type methods for finding the smallest eigenvalue of a matrix. Once we obtain a solution such that and , where is an optimal solution of (18), we can set and as in . Then is an optimal solution to (2) because and for the same reasoning . Thus, one can obtain an optimal solution to SPG-LS in runtime as the main cost is in solving the SCLS (18).
However, in practice the computation of even an approximate minimum eigenvalue may be expensive. Instead, we will introduce two highly efficient algorithms to solve (12) without computing approximate eigenvalues. One is the Krylov subspace method (adapted to the spherically constrained case) proposed in Gould et al. (1999), and the other is the Riemannian trust region Newton (RTRNewton) method proposed in Absil et al. (2007).
4.1 The Krylov Subspace Method
The simplest idea of the Krylov subspace method (Section 5 in Gould et al. (1999)) solves a sequence of smaller dimensional problems in the same form of (18). Specifically, define st Krylov subspace
Let be an orthonormal basis produced by the generalized Lanczos process. Then assuming , we have that is a tridiagonal matrix with . Each iteration of the Krylov subspace method solves the following subproblem (adapted to the spherical constrained case)
(20) |
Gould et al. (1999) proved that the above subproblem can be solved efficiently in flops, if we use a safeguarded Newton’s method, where the most expensive cost is matrix-vector products for , with . We remark that though Gould et al. (1999) considered the case , the two cases and are essentially the same if the inequality in the TRS is active, which occur if 222This can be easily derived from the proof of Proposition 4.1.. Here denotes the Moore-Penrose pseudoinverse of a matrix .
To achieve better practical performance, Gould et al. (1999) proposed the generalized Lanczos trust-region (GLTR) method, which is an efficient implementation of the above Krylov subspace method. Based on an efficient nested restarting strategy, Zhang & Shen (2018) further proposed a nested Lanczos method for TRS (LTRSR), which is an improvement for GLTR.
The convergence behavior of the Krylov subspace method is also well analyzed in the literature. The optimality condition of problem (18) is characterized as follows (adapted from Chapter 7 in Conn et al. (2000))
(21) |
where is the corresponding Lagrangian multiplier. It is shown that there always exists an optimal solution and a unique Lagrangian multiplier , because different s yield different values for , contradicting .
Define
(22) |
for and use the convention . Here is regarded as the condition number of (18) (Carmon & Duchi, 2018). Zhang & Shen (2018) and Carmon & Duchi (2018) demonstrated that when , the Krylov subspace method satisfies
where is an optimal solution of (20), and is an optimal solution for (18). Carmon & Duchi (2018) further proved that for all cases including , a variant of the Krylov subspace method where is perturbed with random vector will output a solution satisfies
for the SCLS (adapted from their analysis for the TRS). This indeed gives another time algorithm for solving the SPG-LS up to tolerance; see also Wang et al. (2021a) for extensions of this idea to the GTRS.
Next we relate the existing convergence results with problem (18).
Proposition 4.1.
Proof.
Note that if , then the first equation in (21) implies that and thus the assumption in the proposition implies . However, this violates the constraint . Therefore we must have and thus . ∎
In fact, we checked the data in the experiments in Section 5 and found that always holds in real-world datasets and our synthetic datasets.
4.2 The Riemannian Trust Region Newton (RTRNewton) Method
The feasible set in Problem (18) forms a unit sphere . When is endowed with the Euclidean metric , the unit sphere is a Riemannian manifold (Absil et al., 2008). Therefore, the RTRNewton method proposed in Absil et al. (2007) can be used. The RTRNewton method for Problem (12) is summarized in Algorithm 1.
(23) |
Algorithm 1 relies on the notion of Riemannian gradient, Riemannian Hessian, and retraction. We refer to Absil et al. (2008) for their rigorous definitions. Here, we give the Riemannian gradient, the Riemannian Hessian for Problem (18) and the used retraction.
The Riemannian gradient of is given by
(24) |
where denotes the orthogonal projection onto the tangent space at with , and denotes the Euclidean gradient of , i.e., . The action of the Riemannian Hessian of at along is given by
(25) |
where denotes the Euclidean Hessian of at . The retraction that we use is given by
(26) |
where and .
The subproblem (23) is approximately solved by the truncated conjugate gradient method. We use the implementations in ROPTLIB (Huang et al., 2018).
The global convergence and local superlinear convergence rate of RTRNewton have been established by Theorem 7.4.4 and Theorem 7.4.11 of Absil et al. (2008). We state the results in the theorem below.
Theorem 4.2.
Let be a sequence of iterates generated by Algorithm 1. It follows that
Suppose is a nondegenerate local minimizer of , i.e., and is positive definite. Then there exists such that for all sequence generated by Algorithm 1 converging to , there exists such that for all ,
(27) |
The proof for the theorem is deferred to Appendix A. The global convergence rate has also been established where the iteration complexity is for . We refer interested readers to Theorem 3.9 of Boumal et al. (2019).
4.3 Time complexity comparisons
In this section, we give a theoretical worst case time complexity of different methods for solving the SPG-LS. First we point out that the RTRNewton cannot be guaranteed to converge to the global minimum of SPG-LS. In the worst case, the RTRNewton needs to solve many trust region subproblems. This means the time complexity is much worse than the Krylov subspace method (Carmon & Duchi, 2018) studied in Section 4.1.
Next we compare the time complexity of the Krylov subspace method and the SOCP method. In the case of dealing with a sparse data matrix, the time complexity of the Krylov subspace method is , where is the number of nonzero entries in the data matrix . Here, we use the fact that the cost of the matrix-vector product in the Krylov subspace method is as we can compute by for any given . If , then the complexity can be further improved to . In the dense case with , the complexity is and can be improved to if . Next we consider the time complexity for the SOCP method, which consists of the time complexity of formulating the matrix , the spectral decomposition and the IPM for solving the SOCP. Since the spectral decomposition and the IPM can not benefit much from the data sparsity, we do not distinguish the sparse and dense cases for the SOCP method. Particularly, the cost of formulating the matrix is lower bounded by and upper bounded by and the spectral decomposition takes flops for some satisfying (Demmel et al., 2007). Meanwhile, the iteration complexity for solving the reformulation is according to Monteiro & Tsuchiya (2000). As per iteration in cost in the IPM is , the total cost of the IPM is . Therefore the worst case complexity of the SOCP method is
Theortically, it is hard to compare the Krylov subspace method and the SOCP method as the result depends on , , and . In practice, it usually holds that and the spectral decomposition step in the SOCP methods usually costs . In fact, our experiments show that the spectral decomposition step often needs more time than the IPM for the SOCP. Thus, the Krylov subspace method, which can effectively utilize the data sparsity, is much faster than the SOCP approach especially for solving large-scale problems.
5 Experiment Results
In this section, we present numerical results on both synthetic and real-world datasets to verify the superiority of our proposed reformulation in terms of computational costs. We refer a nested Lanczos Method for TRS (LTRSR) (Zhang & Shen, 2018) to perform the GLTR method, and use the implementation of Riemannian trust-region Newton (RTRNewton) (Absil et al., 2007) from Riemannian Manifold Optimization Library (ROPTLIB) (Huang et al., 2018). Similar as the setting in Wang et al. (2021b), we compare the running time of above two methods with the SDP and SOCP approaches in Wang et al. (2021b), averaged over 10 trials, to evaluate the performance of our new reformulation. All the four methods solve the SDP, SOCP or the SCLS reformulations to their default precision and the solutions to the SPG-LS are recovered accordingly.
All simulations are implemented using MATLAB R2021b on a PC running Windows 10 Intel(R) Xeon(R) E5-2650 v4 CPU (2.2GHz) and 64GB RAM. We report the results of two real datasets and six synthetic datasets and defer other results to the supplementary material. In all following tests, the parameter is set as 0.1.
5.1 Real-world Dataset
We first compare four methods on the red wine dataset (Cortez et al., 2009), which consist of 1599 instances each with 11 features. The output label is a physiochemical measurement ranged from 0 to 10, where a higher score means that the corresponding wine sample has better quality. Wine producers would like to manipulate the data to fool the learner to predict a higher score when the raw label is smaller than some threshold . We consider the case that there are two kinds of providers and , where the details of the manipulation are set the same as Wang et al. (2021b).




We also compare four methods on the residental building dataset333https://archive.ics.uci.edu/ml/datasets/Residential+Building+Data+Set from the UCI data repository (Dua & Graff, 2017) as well. The building dataset includes 372 samples with 107 features. Each feature reflects information of a certain session such as project date, physical and financial elements. The output label is the sale prices to real estate single-family residential apartments. We consider a scenario where the seller wants to manipulate the price to higher level. As a buyer, our task is to predict the fair price under fake data. We still consider two types of sellers: with and with .
The computational time for both datasets is reported in Figures 1 and 2. It show that LTRSR method outperforms others and follows by the SOCP and RTRNewton. One can also observe that RTRNewton is even more expensive than the SDP approach in Figure 1. The main reason is that in the red wine dataset, the number of features is quite small (). Thus, the spectral decomposition step, as well as the iterations of the interior point method, in the SOCP approach is cheap.
We then report the relative errors of objective values (MSEs) of the SOCP method and our methods in Table LABEL:tab:SPGLS-real-reltol-part for red wine and residental building datasets. Indeed, all the methods have a very high accuracy as the relative errors are only up to 3.37e-5. More MSE comparisons can be found in Section 2.1.
Dataset | ||||||
---|---|---|---|---|---|---|
AVG | MIN | MAX | AVG | MIN | MAX | |
Wine Modest | 6.17E-10 | 3.94E-12 | 4.23E-09 | -8.26E-10 | -4.66E-09 | 3.62E-09 |
Wine Severe | 1.32E-10 | 3.39E-12 | 1.84E-09 | -8.30E-11 | -4.07E-10 | 1.80E-09 |
Build Modest | 1.49E-07 | 1.93E-09 | 5.91E-07 | -2.19E-05 | -3.37E-05 | -1.28E-05 |
Build Severe | 3.02E-08 | 4.02E-10 | 1.25E-07 | -1.96E-06 | -3.06E-06 | -1.14E-06 |
5.2 Synthetic Dataset
From the previous subsection, we also see that SOCP, LTRSR and RTRNewton are much faster than the SDP approach. To have a comprehensive comparison on wall-clock time among SOCP, LTRSR and RTRNewton, we test these methods on synthetic experiments.
5.2.1 Dense Data
SOCP (eig time) | RTRNew | LTRSR | Ratio | ||
---|---|---|---|---|---|
2000 | 1000 | 0.585 (0.064) | 0.743 | 17 | |
4000 | 2000 | 1.957 (0.317) | 2.459 | 11 | |
8000 | 4000 | 10.693 (2.758) | 9.269 | 11 | |
12000 | 6000 | 29.304 (9.444) | 18.824 | 14 | |
16000 | 8000 | 58.561 (21.634) | 40.711 | 15 | |
20000 | 10000 | 114.376 (49.754) | 59.768 | 19 | |
SOCP (eig time) | RTRNew | LTRSR | Ratio | ||
1000 | 1000 | 0.454 (0.065) | 0.594 | 27 | |
2000 | 2000 | 2.104 (0.325) | 2.600 | 22 | |
4000 | 4000 | 10.795 (2.698) | 6.958 | 23 | |
6000 | 6000 | 28.391 (9.481) | 17.835 | 26 | |
8000 | 8000 | 55.263 (21.555) | 35.510 | 27 | |
10000 | 10000 | 97.383 (40.091) | 58.009 | 32 | |
SOCP (eig time) | RTRNew | LTRSR | Ratio | ||
500 | 1000 | 0.536 (0.059) | 0.523 | 24 | |
1000 | 2000 | 1.748 (0.309) | 1.432 | 31 | |
2000 | 4000 | 9.928 (2.526) | 6.932 | 40 | |
3000 | 6000 | 27.230 (8.848) | 19.179 | 51 | |
4000 | 8000 | 54.174 (20.258) | 28.953 | 58 | |
5000 | 10000 | 94.548 (37.771) | 52.756 | 61 |
We first conduct experiments on dense synthetic dataset. To have better validation of the effectiveness of our proposed reformulation, we use the same artificial dataset in Wang et al. (2021b), which employs make_regression function in scikit-learn (Pedregosa et al., 2011) with setting the noise as 0.1 and other parameters as default.
Table 2 summarises the comparison of time on different scales with . Here, “SOCP” represents total time needed for the SOCP approach (including “eig time”), “eig time” represents the spectral decomposition time in the SOCP approach, “RTRNew” represents the RTRNewton method, “LTRSR” represents the LTRSR method, “Ratio” represents the ratio of times of SOCP method and LTRSR.
From Table 2, we find that in large scale setting, the two methods RTRNewton and LTRSR are more efficient. LTRSR is of orders faster than the other two methods. From the “Ratio” values we also see that the time cost of SOCP approach is several tens times of that of LTRSR. We also observe that the spectral decomposition time in formulating SOCP is expensive and takes about 40% of total time. Indeed, the spectral decomposition time becomes much larger as increases, which is also evidenced from our experiments for the sparse data setting in Table 3 below.
5.2.2 Sparse Data
To further show the efficacy of our proposed reformulation, we conduct experiments on synthetic data with high feature dimension and various sparsity. We apply the sprandn function in MATLAB to obtain the data matrix , whose -th row is input vector . The noise measurements i.i.d from the uniform distribution . Then the output label via Following Wang et al. (2021b), we set the fake output label as .
sparsity = 0.01 | |||||
---|---|---|---|---|---|
SOCP (eig time) | RTRNew | LTRSR | Ratio | ||
5000 | 10000 | 71.601 (39.432) | 13.124 | 318 | |
7500 | 15000 | 217.529 (120.456) | 26.551 | 407 | |
10000 | 20000 | 513.751 (288.490) | 47.411 | 490 | |
12500 | 25000 | 941.394 (539.619) | 69.421 | 586 | |
15000 | 30000 | 1539.443 (865.813) | 113.223 | 637 | |
sparsity = 0.001 | |||||
SOCP (eig time) | RTRNew | LTRSR | Ratio | ||
5000 | 10000 | 61.587 (45.253) | 1.416 | 2200 | |
7500 | 15000 | 153.075 (117.389) | 2.379 | 2888 | |
10000 | 20000 | 335.956 (259.671) | 5.453 | 2973 | |
12500 | 25000 | 638.175 (491.391) | 7.715 | 3799 | |
15000 | 30000 | 1082.261 (832.413) | 12.090 | 4605 | |
sparsity = 0.0001 | |||||
SOCP (eig time) | RTRNew | LTRSR | Ratio | ||
5000 | 10000 | 49.869 (45.462) | 0.391 | 5541 | |
7500 | 15000 | 141.507 (134.525) | 0.716 | 10108 | |
10000 | 20000 | 310.991 (289.447) | 0.979 | 15550 | |
12500 | 25000 | 587.124 (540.314) | 1.301 | 19571 | |
15000 | 30000 | 991.070 (912.015) | 2.171 | 26786 |
Table 3 summarises time comparisons on synthetic datasets with different sparsity and different dimension for . From these tables, we find that LTRSR and RTRNewton perform much better than the dense case, and their superiority over the SOCP approach becomes larger. This is mainly because both methods are the matrix free methods that require only matrix vector products in each iteration. However, the SOCP do not benefit from sparsity as well as the other two methods. We find that, by comparing the ”eig time” for different instances with the same dimension but different sparsity, the ”eig time” dominates the time of SOCP approach as the spectral decomposition cannot utilize the sparsity of the data. From the “Ratio” values, we find that the outperformance of LTRSR grows considerably when the sparsity and problem size increase. In the case of and sparsity = 0.0001, LTRSR takes up to 26,000+ times faster than the SOCP approach. Moreover, our LTRSR takes less than 0.05 second for all the instances with sparsity = 0.0001. More reports on relative errors of all methods are reported in Section 2.2.
6 Conclusion
We propose an SCLS reformulation for the SPG-LS and show its optimal solution can be used to recover an optimal solution of the SPG-LS. We further show that an optimal solution of the SPG-LS can be also recovered from an optimal solution of the SCLS. Moreover, such an optimal solution obtained in runtime . We also introduce two practical efficient methods, LTRSR and RTRNewton, for solving the SCLS. Experiments show that the SCLS approach is much faster than the existing best approach. In particular, the performance of the LTRSR dominates both RTRNewton and SOCP methods.
Acknowledgements
Wen Huang is partly supported by the Fundamental Research Funds for the Central Universities 20720190060 and NSFC 12001455. Rujun Jiang is partly supported by NSFC 12171100 and 72161160340, and Natural Science Foundation of Shanghai 22ZR1405100. Xudong Li is partly supported by the “Chenguang Program” by Shanghai Education Development Foundation and Shanghai Municipal Education Commission 19CG02, and the Shanghai Science and Technology Program 21JC1400600.
References
- Absil et al. (2007) Absil, P.-A., Baker, C. G., and Gallivan, K. A. Trust-region methods on riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
- Absil et al. (2008) Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. ISBN 9780691132983.
- Bishop et al. (2020) Bishop, N., Tran-Thanh, L., and Gerding, E. Optimal learning from verified training data. In Advances in Neural Information Processing Systems 33, pp. 9520–9529. NeurIPS, 2020.
- Boumal et al. (2019) Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
- Brückner & Scheffer (2011) Brückner, M. and Scheffer, T. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 547–555, 2011.
- Carmon & Duchi (2018) Carmon, Y. and Duchi, J. C. Analysis of krylov subspace solutions of regularized nonconvex quadratic problems. In Advances in Neural Information Processing Systems, pp. 10728–10738, 2018.
- Conn et al. (2000) Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods. SIAM, 2000.
- Cortez et al. (2009) Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4):547–553, 2009.
- Demmel et al. (2007) Demmel, J., Dumitriu, I., and Holtz, O. Fast linear algebra is stable. Numerische Mathematik, 108(1):59–91, 2007.
- Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Gould et al. (1999) Gould, N. I., Lucidi, S., Roma, M., and Toint, P. L. Solving the trust-region subproblem using the lanczos method. SIAM Journal on Optimization, 9(2):504–525, 1999.
- Hazan & Koren (2016) Hazan, E. and Koren, T. A linear-time algorithm for trust region problems. Mathematical Programming, 158(1):363–381, 2016.
- Ho-Nguyen & Kilinc-Karzan (2017) Ho-Nguyen, N. and Kilinc-Karzan, F. A second-order cone based approach for solving the trust-region subproblem and its variants. SIAM Journal on Optimization, 27(3):1485–1512, 2017.
- Huang et al. (2018) Huang, W., Absil, P.-A., Gallivan, K. A., and Hand, P. Roptlib: an object-oriented c++ library for optimization on riemannian manifolds. ACM Transactions on Mathematical Software (TOMS), 44(4):1–21, 2018.
- Jeroslow (1985) Jeroslow, R. G. The polynomial hierarchy and a simple model for competitive analysis. Mathematical programming, 32(2):146–164, 1985.
- Jiang & Li (2019) Jiang, R. and Li, D. Novel reformulations and efficient algorithms for the generalized trust region subproblem. SIAM Journal on Optimization, 29(2):1603–1633, 2019.
- Jiang & Li (2020) Jiang, R. and Li, D. A linear-time algorithm for generalized trust region subproblems. SIAM Journal on Optimization, 30(1):915–932, 2020.
- Monteiro & Tsuchiya (2000) Monteiro, R. D. and Tsuchiya, T. Polynomial convergence of primal-dual algorithms for the second-order cone program based on the mz-family of directions. Mathematical programming, 88(1):61–83, 2000.
- Moré (1993) Moré, J. J. Generalizations of the trust region problem. Optimization methods and Software, 2(3-4):189–209, 1993.
- Moré & Sorensen (1983) Moré, J. J. and Sorensen, D. C. Computing a trust region step. SIAM Journal on scientific and statistical computing, 4(3):553–572, 1983.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Pong & Wolkowicz (2014) Pong, T. K. and Wolkowicz, H. The generalized trust region subproblem. Computational optimization and applications, 58(2):273–322, 2014.
- Sherman & Morrison (1950) Sherman, J. and Morrison, W. J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
- Shokri et al. (2012) Shokri, R., Theodorakopoulos, G., Troncoso, C., Hubaux, J.-P., and Le Boudec, J.-Y. Protecting location privacy: optimal strategy against localization attacks. In Proceedings of the 2012 ACM conference on Computer and communications security, pp. 617–627, 2012.
- Wahab et al. (2016) Wahab, O. A., Bentahar, J., Otrok, H., and Mourad, A. A stackelberg game for distributed formation of business-driven services communities. Expert Systems with Applications, 45:359–372, 2016.
- Wang & Kılınç-Karzan (2020) Wang, A. L. and Kılınç-Karzan, F. The generalized trust region subproblem: solution complexity and convex hull results. Mathematical Programming, pp. 1–42, 2020.
- Wang et al. (2021a) Wang, A. L., Lu, Y., and Kilinc-Karzan, F. Implicit regularity and linear convergence rates for the generalized trust-region subproblem. arXiv preprint arXiv:2112.13821, 2021a.
- Wang et al. (2021b) Wang, J., Chen, H., Jiang, R., Li, X., and Li, Z. Fast algorithms for stackelberg prediction game with least squares loss. In Proceedings of the 38th International Conference on Machine Learning, pp. 10708–10716. PMLR, 2021b.
- Zhang & Shen (2018) Zhang, L.-H. and Shen, C. A nested lanczos method for the trust-region subproblem. SIAM Journal on Scientific Computing, 40(4):A2005–A2032, 2018.
- Zhang et al. (2017) Zhang, L.-H., Shen, C., and Li, R.-C. On the generalized lanczos trust-region method. SIAM Journal on Optimization, 27(3):2110–2142, 2017.
- Zhang et al. (2018) Zhang, L.-H., Yang, W. H., Shen, C., and Feng, J. Error bounds of the lanczos approach for the trust-region subproblem. Frontiers of Mathematics in China, 13(2):459–481, 2018.
- Zhou & Kantarcioglu (2016) Zhou, Y. and Kantarcioglu, M. Modeling adversarial learning as nested stackelberg games. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 350–362. Springer, 2016.
- Zhou et al. (2019) Zhou, Y., Kantarcioglu, M., and Xi, B. A survey of game theoretic approach for adversarial machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3):e1259, 2019.
Appendix
Appendix A Deferred proof for Theorem 4.2
Proof.
We only need to verify the assumptions of Theorem 7.4.4 and Theorem 7.4.11 of Absil et al. (2008).
By definition, the function is bounded below. Since the unit sphere is a compact manifold, we have that the Riemannian Hessian is bounded above for any , that is radially Lipschitz continuously differentiable, and that is Lipschitz continuously differentiable by Proposition 7.4.5 and Corollary 7.4.6 of Absil et al. (2008). The implementation in ROPTLIB for the subproblem (17) is Algorithm 11 of Absil et al. (2008) with . Therefore, the Cauchy decrease inequality is satisfied. It follows that all the assumptions of Theorem 7.4.4 are satisfied.
Since Retraction (26) is second order, the assumption of (7.36) in Theorem 7.4.11 of Absil et al. (2008) holds with . Since the function is and the manifold is compact, the assumption of (7.37) in Theorem 7.4.11 of Absil et al. (2008) holds. The local quadratic convergence rate in (27) thus follows. ∎
Appendix B Additional Experiments
In this section, we provide additional numerical results to further show the efficiency of our proposed reformulation.
1 Real-world Dataset
We demonstrate the speed of our methods on two other real-world datasets, the insurance dataset444https://www.kaggle.com/mirichoi0218/insurance and the blogfeedback dataset555https://archive.ics.uci.edu/ml/datasets/BlogFeedback. Similar as the setting in previous section, we compare the wall-clock time of RTRNewton (Absil et al., 2007) and LTRSR (Zhang & Shen, 2018) approaches with the SDP and SOCP methods proposed in Wang et al. (2021b).
We still apply four methods on the insurance dataset that has 1,338 instances with 7 features. Each feature shows information on certain aspects such as age, sex, bmi and region. The output labels are individual medical costs by buying health insurance. For model accuracy, we transform categorical features such as sex into a one-hot vector. We assume that the individuals incline to modify self-related data to reduce their insurance costs. Formally, the individual’s desired outcome can be defined as We have two types of individual: with and with . All the hyperparameters are the same as those in Wang et al. (2021b). As an insurer, our goal is to select a good price model to predict the insurance costs as true as possible.
To further illustrate the effectiveness of our new reformulation, we compare four methods on the blogfeedback dataset. The blog dataset contains 52,397 samples each with 281 features processed from raw feedback materials on the Internet. Each feature represents information of a certain session. The output label is the number of comments. As a learner, our task is to predict the future comments of blog. Similarly, we assume that the true output label would be modified by data providers in order to achieve the goal of increasing blog comments. For example, public media intend to manipulate data to add the blog comments and enhance its news popularity. Formally, we define the altered label as . We still have two types of data providers: with and with .
The wall-clock time comparison can be found in Figure 3. Similar to the previous cases, LTRSR outperforms other approaches. However, as the dimension in this problem is too small, the comparison of the other three methods does not match their performance for large scale problems.




2 Synthetic Dataset
To further demonstrate the efficiency and adaptiveness to high dimension problems of our proposed reformulation, we conduct experiments on more synthetic datasets with different hyperparameters and various dimensions and sparsity.
2.1 Dense Data
We first perform experiments with on synthetic datasets without sparsity to show the superiority of our reformulation under different hyperparameter. All the other settings are the same as in Section 5.2.1.
Table LABEL:tab:app-syn-dense shows the comparison of wall-clock time on different scales with , which is similar to the case of . We observed that both SCLS approaches are faster than the SOCP approach. Moreover, all the LTRSR is less than 6 seconds while the SOCP can take up to about 110 seconds.
SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 0.619 (0.087) | 0.712 | 20 | 0.564 (0.086) | 0.704 | 28 | 0.466 (0.078) | 0.649 | 39 | |||
2000 | 2.244 (0.419) | 2.519 | 16 | 1.900 (0.471) | 1.828 | 19 | 2.438 (0.401) | 2.092 | 41 | |||
4000 | 12.123 (3.448) | 5.179 | 13 | 11.597 (3.539) | 6.789 | 23 | 11.645 (3.212) | 6.778 | 47 | |||
6000 | 33.093 (11.903) | 15.857 | 16 | 31.262 (11.691) | 18.617 | 30 | 32.812 (11.008) | 11.070 | 58 | |||
8000 | 66.816 (27.466) | 38.501 | 18 | 63.983 (27.512) | 34.655 | 32 | 59.725 (25.061) | 27.201 | 62 | |||
10000 | 118.044 (49.477) | 59.551 | 20 | 109.516 (50.018) | 54.060 | 36 | 104.251 (47.261) | 39.611 | 68 |
2.2 Sparse Data
sparsity = 0.01 | sparsity = 0.001 | sparsity = 0.0001 | ||||||||||
SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | |
10000 | 84.547 (40.602) | 21.947 | 174 | 56.101 (40.322) | 1.892 | 920 | 41.854 (38.038) | 0.499 | 3488 | |||
15000 | 254.85 (126.365) | 50.797 | 223 | 160.486 (125.064) | 5.166 | 1225 | 130.071 (118.582) | 0.540 | 7226 | |||
20000 | 550.191 (281.018) | 107.358 | 264 | 355.746 (278.687) | 10.932 | 1603 | 299.496 (265.987) | 1.075 | 10327 | |||
25000 | 1090.231 (581.289) | 159.592 | 311 | 728.118 (560.429) | 15.543 | 2040 | 591.172 (509.912) | 1.285 | 15557 | |||
30000 | 1726.133 (963.081) | 248.325 | 320 | 1240.319 (979.66) | 19.947 | 2349 | 1040.441 (888.321) | 2.323 | 18579 | |||
sparsity = 0.01 | sparsity = 0.001 | sparsity = 0.0001 | ||||||||||
SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | |
10000 | 108.861 (48.901) | 50.508 | 99 | 64.521 (48.604) | 5.765 | 542 | 45.599 (39.977) | 0.326 | 2850 | |||
15000 | 316.424 (151.406) | 117.217 | 121 | 173.231 (132.098) | 12.230 | 674 | 148.608 (127.759) | 0.947 | 5307 | |||
20000 | 643.464 (324.073) | 219.622 | 127 | 379.145 (289.701) | 23.199 | 830 | 338.119 (283.882) | 1.241 | 7044 | |||
25000 | 1149.663 (605.581) | 395.202 | 147 | 720.837 (561.652) | 43.718 | 994 | 684.654 (563.812) | 2.254 | 11224 | |||
30000 | 2026.088 (1080.883) | 598.468 | 169 | 1217.779 (937.93) | 45.074 | 1107 | 1106.028 (899.099) | 3.869 | 13326 | |||
sparsity = 0.01 | sparsity = 0.001 | sparsity = 0.0001 | ||||||||||
SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | SOCP (eig time) | RTRNew | LTRSR | Ratio | |
10000 | 113.978 (51.388) | 76.183 | 72 | 56.434 (41.871) | 8.557 | 330 | 46.739 (39.697) | 0.446 | 2125 | |||
15000 | 298.498 (143.174) | 200.29 | 76 | 171.902 (129.377) | 22.335 | 434 | 150.666 (123.946) | 0.774 | 4072 | |||
20000 | 596.182 (288.772) | 356.849 | 77 | 417.155 (312.681) | 42.438 | 581 | 339.114 (274.737) | 1.713 | 5559 | |||
25000 | 1152.428 (606.097) | 614.192 | 85 | 782.197 (587.333) | 43.646 | 680 | 641.319 (518.51) | 2.678 | 7545 | |||
30000 | 1949.035 (1039.845) | 948.449 | 90 | 1298.734 (971.861) | 93.219 | 765 | 1085.053 (876.409) | 5.526 | 9118 |
We proceed to perform experiments on large-scale sparse dataset of various scales with different sparsity. All the other settings are the same as in Section 5.2.2.
From Table LABEL:tab:app-syn-sparse, we observed the great superiority of our SCLS reformulation since all the LTRSR and RTRNewton method faster than SOCP method. LTRSR is of several orders faster than the SOCP approach, especially when the sparsity and dimension grow. The spectral time of decomposition in formulating SOCP is quite expensive as the problem size grows. In the case , sparsity , the decomposition time is up about 1000 seconds, while the LTRSR method only takes about 22 seconds to solve the problem.
Appendix C Relative Error
This section reports relative errors of function values and MSEs between SOCP and our methods for all our experiments in Tables LABEL:tab:SPGLS-real-reltol, LABEL:tab:SPGLS-real-MSE, LABEL:tab:SPGLS-dense-reltol and LABEL:tab:SPGLS-sparse-reltol.
1 Real-world Dataset
In Tables LABEL:tab:SPGLS-real-reltol and LABEL:tab:SPGLS-real-MSE, we show the relative errors of MSEs on training sets and test sets, respectively. Here, abbreviations “Insur” represents insurance dataset, “Build” represents building dataset, “f” represents the function value of related methods and “M” represents its MSE. In Table LABEL:tab:SPGLS-real-reltol, We observe that LTRSR is more accurate than SOCP as its relative error preserves positive. Indeed, all the methods have a very high accuracy as the relative errors are only up to to 3.37e-5. Table LABEL:tab:SPGLS-real-MSE shows that all methods have similar test accuracy as the relative errors of the test MSEs of all the methods are up to 5.27e-4.
Dataset | Dataset | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | ||
Wine Modest | 6.17E-10 | 3.94E-12 | 4.23E-09 | -8.26E-10 | -4.66E-09 | 3.62E-09 | Insur Modest | 1.01E-05 | 1.40E-06 | 3.31E-05 | 1.01E-05 | 1.40E-06 | 3.31E-05 |
Wine Severe | 1.32E-10 | 3.39E-12 | 1.84E-09 | -8.30E-11 | -4.07E-10 | 1.80E-09 | Insur Severe | 2.57E-06 | 4.90E-07 | 9.56E-06 | 2.57E-06 | 4.90E-07 | 9.56E-06 |
Build Modest | 1.49E-07 | 1.93E-09 | 5.91E-07 | -2.19E-05 | -3.37E-05 | -1.28E-05 | Blog Modest | 8.07E-09 | 3.33E-10 | 3.82E-08 | 8.07E-09 | 3.33E-10 | 3.82E-08 |
Build Severe | 3.02E-08 | 4.02E-10 | 1.25E-07 | -1.96E-06 | -3.06E-06 | -1.14E-06 | Blog Severe | 3.55E-08 | 2.80E-10 | 2.24E-07 | 3.55E-08 | 2.80E-10 | 2.24E-07 |
Dataset | Dataset | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | ||
Wine Modest | 1.99E-07 | -9.55E-07 | 4.31E-06 | 1.74E-07 | -9.54E-07 | 3.99E-06 | Insur Modest | 1.05E-05 | 7.69E-07 | 4.21E-05 | 1.02E-05 | -7.09E-07 | 4.28E-05 |
Wine Severe | 2.03E-07 | -1.79E-07 | 2.57E-06 | 2.64E-07 | -4.55E-07 | 2.79E-06 | Insur Severe | 2.43E-06 | -3.36E-07 | 1.01E-05 | 2.45E-06 | -3.27E-07 | 1.01E-05 |
Build Modest | -4.52E-06 | -1.25E-04 | 1.23E-04 | -1.25E-04 | -5.27E-04 | 5.01E-04 | Blog Modest | -2.59E-09 | -5.66E-07 | 2.99E-07 | 1.92E-09 | -5.62E-07 | 3.03E-07 |
Build Severe | 2.65E-07 | -2.63E-05 | 2.54E-05 | 8.27E-06 | -8.02E-05 | 1.47E-04 | Blog Severe | -2.06E-08 | -8.12E-07 | 3.24E-07 | -5.02E-09 | -7.81E-07 | 3.06E-07 |
2 Synthetic Dataset
2.1 Dense Data
Table LABEL:tab:SPGLS-dense-reltol summarises relative errors of objective values on synthetic datasets without sparsity. Comparing to the result in real-world dataset, we find that all the methods have high accuracy as the relative errors of the MSEs are up to 4.60e-5.
AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.01E-09 | 9.58E-11 | 3.41E-09 | 1.01E-09 | 9.58E-11 | 3.41E-09 | 1.51E-05 | 5.39E-08 | 4.60E-05 | 1.51E-05 | 5.39E-08 | 4.60E-05 | ||
2.26E-08 | 3.37E-11 | 1.34E-07 | 2.26E-08 | 3.36E-11 | 1.34E-07 | 2.11E-06 | 3.26E-07 | 7.40E-06 | 2.11E-06 | 3.26E-07 | 7.40E-06 | ||
8.27E-09 | 2.54E-12 | 4.82E-08 | 8.27E-09 | 2.05E-12 | 4.82E-08 | 2.21E-06 | 3.44E-07 | 6.87E-06 | 2.21E-06 | 3.44E-07 | 6.87E-06 |
2.2 Sparse Data
Table LABEL:tab:SPGLS-sparse-reltol summarises relative error of objective values in synthetic dataset with different sparsity. Similar to the previous cases, both of our methods have high accuracy in terms of MSEs. These consistent results further prove the validity of our SCLS reformulation.
sparsity = 0.0001 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | ||
1.24E-09 | 1.48E-11 | 5.58E-09 | 1.24E-09 | 1.45E-11 | 5.58E-09 | 3.98E-09 | 5.27E-11 | 1.19E-08 | 3.97E-09 | 3.19E-11 | 1.19E-08 | ||
7.49E-11 | 3.53E-13 | 1.55E-10 | 7.47E-11 | 3.20E-13 | 1.54E-10 | 2.25E-09 | 2.62E-12 | 4.88E-09 | 2.24E-09 | -3.51E-12 | 4.87E-09 | ||
9.10E-10 | 9.93E-12 | 2.01E-09 | 9.10E-10 | 9.73E-12 | 2.01E-09 | 7.95E-09 | 1.83E-10 | 2.34E-08 | 7.94E-09 | 1.16E-10 | 2.34E-08 | ||
3.06E-10 | 3.81E-12 | 6.60E-10 | 3.06E-10 | 3.70E-12 | 6.60E-10 | 3.07E-09 | 7.96E-13 | 6.28E-09 | 3.06E-09 | -4.52E-12 | 6.28E-09 | ||
sparsity = 0.001 | |||||||||||||
AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | ||
1.36E-09 | 5.73E-11 | 2.58E-09 | 1.36E-09 | 5.64E-11 | 2.58E-09 | 1.92E-08 | 5.04E-10 | 8.89E-08 | 1.92E-08 | 5.04E-10 | 8.89E-08 | ||
1.66E-09 | 6.00E-11 | 5.06E-09 | 1.66E-09 | 5.91E-11 | 5.06E-09 | 5.14E-08 | 9.80E-12 | 2.05E-07 | 5.14E-08 | 9.74E-12 | 2.05E-07 | ||
5.76E-10 | 1.53E-09 | 1.13E-12 | 6.19E-09 | 1.53E-09 | -2.03E-13 | 1.96E-08 | 1.14E-10 | 4.83E-08 | 1.96E-08 | 1.14E-10 | 4.83E-08 | ||
1.90E-09 | 1.02E-10 | 6.68E-09 | 1.90E-09 | 1.01E-10 | 6.68E-09 | 4.42E-08 | 4.91E-09 | 1.34E-07 | 4.42E-08 | 4.91E-09 | 1.34E-07 | ||
sparsity = 0.01 | |||||||||||||
AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | AVG | MIN | MAX | ||
2.19E-08 | 2.64E-11 | 1.70E-08 | 2.19E-08 | 2.63E-11 | 1.70E-08 | 1.05E-07 | 2.15E-09 | 2.51E-07 | 1.05E-07 | 2.15E-09 | 2.51E-07 | ||
8.23E-08 | 8.23E-08 | 3.74E-07 | 1.02E-07 | 4.80E-11 | 3.74E-07 | 6.94E-07 | 3.99E-10 | 3.10E-06 | 6.94E-07 | 3.99E-10 | 3.10E-06 | ||
1.27E-08 | 9.14E-10 | 3.37E-08 | 1.27E-08 | 9.14E-10 | 3.37E-08 | 2.21E-06 | 7.78E-09 | 1.00E-05 | 2.21E-06 | 7.78E-09 | 1.00E-05 | ||
1.90E-08 | 3.52E-12 | 4.50E-08 | 1.90E-08 | 3.45E-12 | 4.50E-08 | 6.52E-06 | 6.40E-08 | 2.95E-05 | 6.52E-06 | 6.40E-08 | 2.95E-05 |