This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Non-Negative Quadratic Programming
in Geometric Optimization

[email protected]    \fnmMan Ting \surWong [email protected] [ [ [

[12]\fnmSiu Wing \surCheng \equalcontThese authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

1]\orgdivDepartment of Computer Science and Engineering, \orgname
The Hong Kong University of Science and Technology, \orgaddress
\cityHong Kong, \countryChina 2]ORCID: 0000-0002-3557-9935 3]ORCID: 0000-0002-5682-0003

On Non-Negative Quadratic Programming in Geometric Optimization
Siu Wing Cheng111Corresponding author. E-mail: [email protected]. ORCID: 0000-0002-3557-993533footnotemark: 3 and Man Ting Wong222Contributing author. Email: [email protected]. ORCID: 0000-0002-5682-0003333These authors contributed equally to this work.
Department of Computer Science and Engineering,
The Hong Kong University of Science and Technology,
Hong Kong, China.

Abstract

We study a method to solve several non-negative quadratic programming problems in geometric optimization by applying a numerical solver iteratively. We implemented the method by using quadprog of MATLAB as a subroutine. In comparison with a single call of quadprog, we obtain a 10-fold speedup for two proximity graph problems in d\mathbb{R}^{d} on some public data sets, and a 2-fold or more speedup for deblurring some gray-scale space and thermal images via non-negative least square. In the image deblurring experiments, the iterative method compares favorably with other software that can solve non-negative least square, including FISTA with backtracking, SBB, FNNLS, and lsqnonneg of MATLAB. We try to explain the observed efficiency by proving that, under a certain assumption, the number of iterations needed to reduce the gap between the current solution value and the optimum by a factor ee is roughly bounded by the square root of the number of variables. We checked the assumption in our experiments on proximity graphs and confirmed that the assumption holds.

Keywords: Convex quadratic programming, Geometric optimization, Active set methods, Non-negative solution.

1 Introduction

We study a method to solve several non-negative convex quadratic programming problems (NNQ problems): minimize 𝚡t𝙰t𝙰𝚡+𝚊t𝚡\mathtt{x}^{t}\mathtt{A}^{t}\mathtt{Ax}+\mathtt{a}^{t}\mathtt{x} subject to 𝙱𝚡𝚋\mathtt{Bx}\geq\mathtt{b} and 𝚡0ν\mathtt{x}\geq 0_{\nu}. It applies a numerical solver on the constrained subproblems iteratively until the optimal solution is obtained. Given a vector 𝚡\mathtt{x}, we use (𝚡)i(\mathtt{x})_{i} to denote its ii-th coordinate. We use 1ν1_{\nu} and 0ν0_{\nu} to denote ν\nu-dimensional vectors that consist of 1s and 0s, respectively.

The first problem is fitting a proximity graph to data points in d\mathbb{R}^{d}, which finds applications in classification, regression, and clustering [1, 2, 3, 4]. Daitch et al. [1] defined a proximity graph via solving an NNQ problem. The unknown is a vector 𝚡n(n1)/2\mathtt{x}\in\mathbb{R}^{n(n-1)/2}, specifying the edge weights in the complete graph on the input points 𝚙1,,𝚙n\mathtt{p}_{1},\ldots,\mathtt{p}_{n}. Let (𝚡)ij(\mathtt{x})_{i\vartriangle j} denote the coordinate of 𝚡\mathtt{x} that stores the weight of the edge 𝚙i𝚙j\mathtt{p}_{i}\mathtt{p}_{j}. Edge weights must be non-negative; each point must have a total weight of at least 1 over its incident edges; these constraints can be modelled as 𝙱𝚡1n\mathtt{Bx}\geq 1_{n} and 𝚡0n(n1)/2\mathtt{x}\geq 0_{n(n-1)/2}, where 𝙱n×n(n1)/2\mathtt{B}\in\mathbb{R}^{n\times n(n-1)/2} is the incidence matrix for the complete graph, and 1n1_{n} is the nn-dimensional vector with all coordinates equal to 1. The objective is to minimize i=1nj[n]{i}(𝚡)ij(𝚙i𝚙j)2\sum_{i=1}^{n}\bigl{\|}\sum_{j\in[n]\setminus\{i\}}(\mathtt{x})_{i\vartriangle j}(\mathtt{p}_{i}-\mathtt{p}_{j})\bigr{\|}^{2}, which can be written as 𝚡t𝙰t𝙰𝚡\mathtt{x}^{t}\mathtt{A}^{t}\mathtt{Ax} for some matrix 𝙰dn×n(n1)/2\mathtt{A}\in\mathbb{R}^{dn\times n(n-1)/2}. We call this the DKSG problem. The one-dimensional case can be solved in O(n2)O(n^{2}) time [5].

We also study another proximity graph in d\mathbb{R}^{d} defined by Zhang et al. [4] via minimizing 1d𝚋t𝚡+μ2𝚄𝚡1n2+ρ2𝚡2\frac{1}{d}\mathtt{b}^{t}\mathtt{x}+\frac{\mu}{2}\|\mathtt{Ux}-1_{n}\|^{2}+\frac{\rho}{2}\|\mathtt{x}\|^{2} for some constants μ,ρ0\mu,\rho\geq 0. The vector 𝚡n(n1)/2\mathtt{x}\in\mathbb{R}^{n(n-1)/2} stores the unknown edge weights. For all distinct i,j[n]i,j\in[n], the coordinate (𝚋)ij(\mathtt{b})_{i\vartriangle j} of 𝚋\mathtt{b} is equal to 𝚙i𝚙j2\|\mathtt{p}_{i}-\mathtt{p}_{j}\|^{2}. The matrix 𝚄\mathtt{U} is the incidence matrix for the complete graph. The objective function can be written as 𝚡t𝙰t𝙰𝚡+𝚊t𝚡\mathtt{x}^{t}\mathtt{A}^{t}\mathtt{A}\mathtt{x}+\mathtt{a}^{t}\mathtt{x}, where 𝙰t=[μ2𝚄tρ2𝙸n(n1)/2]\mathtt{A}^{t}=\bigl{[}\frac{\mu}{2}\mathtt{U}^{t}\,\,\,\frac{\rho}{2}\mathtt{I}_{n(n-1)/2}\bigr{]} and 𝚊=1d𝚋μ𝚄t1n\mathtt{a}=\frac{1}{d}\mathtt{b}-\mu\mathtt{U}^{t}1_{n}. The only constraints are 𝚡0n(n1)/2\mathtt{x}\geq 0_{n(n-1)/2}. We call this the ZHLG problem.

The third problem is to deblur some mildly sparse gray-scale images [6, 7]. The pixels in an image form a vector 𝚡\mathtt{x} of gray scale values in d\mathbb{R}^{d}; the blurring can be modelled by the action of a matrix 𝙰d×d\mathtt{A}\in\mathbb{R}^{d\times d} that depends on the particular point spread function adopted [7]; and 𝙰𝚡\mathtt{Ax} is the blurred image. Working backward, given the matrix 𝙰\mathtt{A} and an observed blurred image 𝚋\mathtt{b}, we recover the image by finding the unknown 𝚡0d\mathtt{x}\geq 0_{d} that minimizes 𝙰𝚡𝚋2\|\mathtt{Ax}-\mathtt{b}\|^{2}, which is a non-negative least square problem (NNLS) [8, 9]. The problem is often ill-posed, i.e., 𝙰\mathtt{A} is not invertible.

1.1 Related work

Random sampling has been applied to select a subset of violated constraints in an iterative scheme to solve linear programming and its generalizations (which includes convex quadratic programming) [10, 11, 12, 13, 14]. For linear programming in d\mathbb{R}^{d} with nn constraints, Clarkson’s Las Vegas method [10] starts with a small working set of constraints and runs in iterations; it draws a random sample of the violated constraints at the beginning of each iteration, adds them to the working set of constraints, and recursively solves the linear programming problem on the expanded working set. This algorithm solves approximately dd subproblems in expectation, with O(dn)O(d\sqrt{n}) constraints in each subproblem. In addition, Clarkson [10] provides an iterative reweighing algorithm for linear programming that solves O(dlogn)O(d\log n) subproblems in expectation with O(d2)O(d^{2}) constraints in each subproblem. A mixed algorithm is proposed in [10] by following the first algorithm but solving every subproblem using the iterative reweighing algorithm. All algorithms run in the RAM model and the expected running time of the mixed algorithm is O(d2n)+(d2logn)O(d)d/2+O(1)+O(d4nlogn)O(d^{2}n)+(d^{2}\log n)O(d)^{d/2+O(1)}+O(d^{4}\sqrt{n}\log n). Subsequent research by Sharir and Welzl [14] and Matoušek et al. [13] improved the expected running time to O(d2n+eO(dlnd))O(d^{2}n+e^{O(\sqrt{d\ln d})}) in the RAM model which is subexponential in dd. Brise and Gärtner [15] showed that Clarkson’s algorithm works for the larger class of violator spaces [16].

Quadratic programming problems are convex when the Hessian is positive (semi-)definite. When the Hessian is indefinite, the problem becomes non-convex and NP-hard. For non-convex quadratic programming, the goal is to find a local minimum instead of the global minimum.

When all input numbers are rational numbers, one can work outside the RAM model to obtain polynomial-time algorithms for convex quadratic programming. Kozlov et al. [17] described a polynomial-time time algorithm for convex quadratic programming that is based on the ellipsoid method [18, 19]. Kapoor and Vaidya [20] showed that convex quadratic programming with nn constraints and mm variables can be solved in O((n+m)3.67(logL)(log(n+m))L)O((n+m)^{3.67}(\log L)(\log(n+m))L) arithmetic operations, where LL is bounded by the number of bits in the input. Monteiro and Adler [21] showed that the running time can be improved to O(m3L)O(m^{3}L), provided that n=O(m)n=O(m).

A simplex-based exact convex quadratic programming solver was developed by Gärtner and Schönherr. It works well when the constraint matrix is dense and there are few variables or constraints [22]. However, it has been noted that this solver is inefficient in high dimensions due to the use of arbitrary-precision arithmetic [23]. Generally, iterative methods are preferred for large problems because they preserve the sparsity of the Hessian matrix [24].

Our method is related to solving large-scale quadratic programming by the active set method. The active set method can be viewed as turning the active inequality constraints into equality constraints, resulting in a simplified subproblem. Large-scale quadratic programming algorithms often assume box constraints, i.e., each variable has an upper bound and lower bound; however, these bounds may not necessarily be finite. In the subsequent discussion, we will provide a brief overview of the existing active-set algorithms for large-scale quadratic programming found in the literature.

Box constraints only

Coleman and Hulbert [25] proposed an active set algorithm that works for both convex and non-convex quadratic programming problems. The algorithm is designed for the case that there are only box constraints with finite bounds. Therefore, having a variable in the active set means requiring that variable to be at its lower or upper bound. As in [26, 27, 28], their method identifies several variables to be added to the active set. Given a search direction, the projected search direction is a projection of the search direction that sets all its coordinates that are in the active set to zero. Consequently, following the projected search direction would not alter the variables in the active set. This descent process begins by following the projected search direction until a coordinate hits its lower or upper bound. The algorithm updates the active set by adding the variable that just hits its lower or upper bound. This process is repeated until the projected search direction is no longer a descent direction or a local minimum is reached along the projected search direction. If the solution is not optimal, the algorithm uses the gradient to identify a set of variables in the active set such that, by allowing these variables to change their values, the objective function value will decrease. In each iteration, the algorithm releases this set of variables from the active set. Thus allows for the addition and removal of multiple variables from the active set in each iteration, guaranteeing that the active set does not experience monotonic growth. This is considered advantageous because active set algorithms often suffer from only being able to add or delete a single constraint from the active set per iteration.

Yang and Tolle [24] proposed an active set algorithm for convex quadratic programming with box constraints that can remove more than one box constraint from the active set in each iteration. The descent in each iteration is performed using the conjugate gradient method [29], which ignores the box constraints and generates a descent path that either stops at the optimum or at the intersection with the boundary of the feasible region. In the latter case, some variables reach their lower or upper bounds. The algorithm maintains a subset SS of such variables and the bounds reached by them that were recorded in previous iterations. In the main loop of the algorithm, the first step is to compute the set BB of variables that are at their lower or upper bounds such that the objective function value is not smaller at any feasible solution obtained by adjusting the variables in BB. Then, the algorithm removes one constraint at a time from SS until the shrunken SS induces a descent direction. If SS becomes empty and there is no feasible descent direction, then the solution is optimal. If there exists a descent direction, then SBS\cup B forms the new active set. The algorithm solves the constrained problem induced by this new active set using the conjugate gradient method [29] which we already mentioned earlier. Empirically, the algorithm performs better for problems that have a small set of free variables at optimality, and for problems with only non-negativity constraints.

Moré and Toraldo [30] proposed an algorithm for strictly convex quadratic programs subject to box constraints. The box constraints induce the faces of a hyper-rectangle, which form the boundaries of the feasible region. Furthermore, an active set identifies a face of the hyper-rectangle. The main loop iterates two procedures. The first procedure performs gradient descent until the optimal solution is reached or the gradient descent path is intercepted by a facet FF of the feasible region. In the first case, the algorithm terminates. In the latter case, the affine subspace of FF is defined by an active set SS of variables at their upper or lower bounds, and the second procedure is invoked. It runs the conjugate gradient method on the original problem constrained by the active set SS; it either finds the optimum of this constrained problem in the interior of FF, or descends to a boundary face of FF. In either case, the algorithm proceeds to the next iteration of the main loop.

Friedlander et al. [31] proposed an algorithm for convex quadratic programs subject to box constraints. The structure of the algorithm is similar to [30]. The aglorithm performs gradient descent until the descent path is intercepted by a facet FF of the feasible region. To solve the original problem constrained on the facet FF, instead of using the conjugate gradient method, it employs the Barzilai-Borwein method to find the optimum of this constrained problem. The advantage of the Barzilai-Borwein method is that it requires less memory storage and fewer computations than the conjugate gradient method.

Box constraints and general linear constraints

Gould [32] proposed an active set algorithm for non-convex quadratic programming problems subject to box constraints and linear constraints. The algorithm adds or removes at most one constraint from the active set per iteration. In each iteration, the algorithm constructs a system of linear equations induced by the active set, solves it, and adds this solution to current feasible solution of the quadratic program. The algorithm maintains an LU factorization of the coefficient matrix to exploit the sparsity of the system of equations. This approach allows for efficient updates to the factorization as the active set changes.

Bartlett and Biegler [33] proposed an algorithm for convex quadratic programming subject to box constraints and linear inequality constraints. The algorithm maintains a dual feasible solution and a primal infeasible solution in the optimization process. In each iteration, the algorithm adds a violated primal constraint to the active set. Then, the Schur complement method [34] is used to solve the optimisation problem constrained by the active set. If a dual variable reaches zero in the update, the corresponding constraint is removed from the active set.

Our contributions

The method that we study greedily selects a subset of variables as free variables—the non-free variables are fixed at zero—and calls the solver to solve a constrained NNQ problem in each iteration. At the end of each iteration, the set of free variables is updated in two ways. First, all free variables that are equal to zero are made non-free in the next iteration. Second, a subset of the non-free variables that violate the dual feasibility constraint the most are set free in the next iteration. However, if there are too few variables that violate the dual feasibility constraint or the number of iterations exceeds a predefined threshold, we simply set free all variables that violate the dual feasibility constraint.

A version of the above greedy selection of free variables in an iterative scheme was already used in [1] for solving the DKSG problem because it took too long for a single call of a quadratic programming solver to finish. However, it is only briefly mentioned in [1] that a small number of the most negative gradient coordinates are selected to be freed. No further details are given. We found that the selection of the number of variables to be freed is crucial. The right answer is not some constant number of variables or a constant fraction of them. Also, one cannot always turn free variables that happen to be zero in an iteration into non-free variables in the next iteration as suggested in [1]; otherwise, the algorithm may not terminate in some cases.

Our method also uses the gradient to screen for descent directions, allowing us to release more than one variable at each iteration. However, instead of setting free all variables that have a non-zero gradient from the active set as in [15, 25, 31, 30], we choose a parameter that enables us to remove just the right amount of variables from the active set so that our subproblem does not become too large. Heuristically, this approach is faster than freeing all variables with a non-zero gradient or freeing only one variable, as in [33, 32]. Similar to the method presented in [24], we remove variables from the free variable set, enabling us to keep our subproblem small. This is in contrast to the non-decreasing size of the free variable set, as in [15, 10]. Thus, our algorithm performs well when solution sparsity is observed.

Clarkson’s linear programming algorithm [10] is designed for cases when the number of variables is much less than the number of given constraints. However, this is not the case for some other problems. For example, consider the DSKG problem with nn input points. Although the ambient space is d\mathbb{R}^{d}, there are Θ(n2)\Theta(n^{2}) variables and Θ(n2)\Theta(n^{2}) constraints. By design, the number of constraints in a subproblem of Clarkson’s algorithm is proportional to the product of the number of variables and the square root of the total number of constraints. In our context, the subproblem has Θ(n3)\Theta(n^{3}) constraints. Consequently, solving such a subproblem is not easier than solving the whole problem. Therefore, Clarkson’s algorithm cannot be directly applied to our proximity graph and NNLS problems, as in these problems, the number of constraints and variables are proportional to each other.

We implemented the iterative method by using quadprog of MATLAB as a subroutine. In comparison with a single call of quadprog, we obtain a 10-fold speedup for DKSG and ZHLG on some public data sets; we also obtain a 2-fold or more speedup for deblurring some gray-scale space and thermal images via NNLS. In the image deblurring experiments, the iterative method compares favorably with other software that can solve NNLS, including FISTA with backtracking [35, 36], SBB [8], FNNLS [9], and lsqnonneg of MATLAB. We emphasize that we do not claim a solution for image deblurring as there are many issues that we do not address; we only seek to demonstrate the potential of the iterative scheme.

The iterative method gains efficiency when the solutions for the intermediate constrained subproblems are sparse and the numerical solver runs faster in such cases. Due to the overhead in setting up the problem to be solved in each iteration, a substantial fraction of the variables in an intermediate solution should be zero in order that the saving outweighs the overhead.

We try to offer a theoretical explanation of the observed efficiency. Let f:νf:\mathbb{R}^{\nu}\rightarrow\mathbb{R} denote the objective function of the NNQ problem at hand. Let 𝚡,𝚢\langle\mathtt{x},\mathtt{y}\rangle denote the inner product of two vectors 𝚡\mathtt{x} and 𝚢\mathtt{y}. A unit direction 𝚗ν\mathtt{n}\in\mathbb{R}^{\nu} is a descent direction from a feasible solution 𝚡\mathtt{x} if f(𝚡),𝚗<0\langle\nabla f(\mathtt{x}),\mathtt{n}\rangle<0 and 𝚡+s𝚗\mathtt{x}+s\mathtt{n} lies in the feasible region for some s>0s>0. Let 𝚡r\mathtt{x}_{r} be the solution produced in the (r1)(r-1)-th iteration. Let 𝚡\mathtt{x}_{*} be the optimal solution. Using elementary vector calculus, one can show that f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)112𝚡r𝚢𝚡r𝚡f(𝚡r),𝚗f(𝚡r),𝚗\frac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\frac{1}{2}\cdot\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\cdot\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle}, where 𝚢\mathtt{y} is minimum in direction 𝚗\mathtt{n} from 𝚡r\mathtt{x}_{r}, and 𝚗\mathtt{n}_{*} is the unit vector (𝚡𝚡r)/𝚡𝚡r(\mathtt{x}_{*}-\mathtt{x}_{r})/\|\mathtt{x}_{*}-\mathtt{x}_{r}\|. We prove that there exists a descent direction 𝚗r\mathtt{n}_{r} from 𝚡r\mathtt{x}_{r} for which only a few variables need to be freed and f(𝚡r),𝚗rf(𝚡r),𝚗\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{r}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle} is roughly bounded from below by the reciprocal of the square root of the number of variables. Therefore, if we assume that the distance between 𝚡r\mathtt{x}_{r} and the minimum 𝚢r\mathtt{y}_{r} in direction 𝚗r\mathtt{n}_{r} is at least 1λ𝚡r𝚡\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\| for all rr, where λ\lambda is some fixed value, then f(𝚡r+i)f(𝚡)e1(f(𝚡r)f(𝚡))f(\mathtt{x}_{r+i})-f(\mathtt{x}_{*})\leq e^{-1}(f(\mathtt{x}_{r})-f(\mathtt{x}_{*})) for some ii that is roughly bounded by λ\lambda times the square root of the number of variables. That is, the gap from f(𝚡)f(\mathtt{x}_{*}) will be reduced by a factor ee in at most ii iterations. It follows that SolveNNQ will converge quickly to the optimum under our assumption. The geometric-like convergence still holds even if the inequality 𝚡r𝚢r1λ𝚡r𝚡\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\| is satisfied in only a constant fraction of any sequence of consecutive iterations (the iterations in that constant fraction need not be consecutive).

We ran experiments on the DKSG and ZHLG problems to check the assumption of our convergence analysis. We checked the ratios 𝚡r𝚢𝚡r𝚡\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|} using the descent direction 𝚡r+1𝚡r𝚡r+1𝚡r\frac{\mathtt{x}_{r+1}-\mathtt{x}_{r}}{\|\mathtt{x}_{r+1}-\mathtt{x}_{r}\|} taken by our algorithm as 𝚗\mathtt{n}. Therefore, 𝚢\mathtt{y} is 𝚡r+1\mathtt{x}_{r+1}. Our experiments show that 𝚡r𝚢𝚡r𝚡1100\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\geq\frac{1}{100} in every iteration, with the exception of the outliers from the DKSG problem. However, the convergence still holds despite the outliers. Most of the runs achieve better results, as shown in Figures 3(a) and (b) in Section 3. For the ZHLG problem, we observe that 𝚡r𝚢𝚡r𝚡12\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\geq\frac{1}{2} holds in every iteration. For the DKSG problem, the majority of the ratios are greater than or equal to 12\frac{1}{2}, with only a few exceptions that are less than 12\frac{1}{2}.

2 Algorithm

Notation

We use uppercase and lowercase letters in typewriter font to denote matrices and vectors, respectively. The inner product of 𝚡\mathtt{x} and 𝚢\mathtt{y} is written as 𝚡,𝚢\langle\mathtt{x},\mathtt{y}\rangle or 𝚡t𝚢\mathtt{x}^{t}\mathtt{y}. We use 0m0_{m} to denote the mm-dimensional zero vector and 1m1_{m} the mm-dimensional vector with all coordinates equal to 1.

The span of a set of vectors WW is the linear subspace spanned by them; we denote it by 𝚜𝚙𝚊𝚗(W)\mathtt{span}(W). In ν\mathbb{R}^{\nu}, for i[ν]i\in[\nu], define the vector 𝚎i\mathtt{e}_{i} to have a value 1 in the ii-th coordinate and zero at other coordinates. Therefore, 𝚎1,𝚎2,,𝚎ν\mathtt{e}_{1},\mathtt{e}_{2},\ldots,\mathtt{e}_{\nu} form an orthonormal basis of ν\mathbb{R}^{\nu}.

A conical combination of a set of vectors {𝚠1,𝚠2,,𝚠ν}\{\mathtt{w}_{1},\mathtt{w}_{2},\ldots,\mathtt{w}_{\nu}\} is i=1νci𝚠i\sum_{i=1}^{\nu}c_{i}\mathtt{w}_{i} for some non-negative real values c1,c2,,cνc_{1},c_{2},\ldots,c_{\nu}. For example, the set of all conical combinations of {𝚎1,𝚎2,,𝚎ν}\{\mathtt{e}_{1},\mathtt{e}_{2},\ldots,\mathtt{e}_{\nu}\} form the positive quadrant of ν\mathbb{R}^{\nu}.

Given a vector 𝚡\mathtt{x}, we denote the ii-th coordinate of 𝚡\mathtt{x} by putting a pair of brackets around 𝚡\mathtt{x} and attaching a subscript ii to the right bracket, i.e., (𝚡)i(\mathtt{x})_{i}. This notation should not be confused with a vector that is annotated with a subscript. For example, (𝚎i)i=1(\mathtt{e}_{i})_{i}=1 and (𝚎i)j=0(\mathtt{e}_{i})_{j}=0 for all jij\not=i. As another example, 𝚡r\mathtt{x}_{r} means the intermediate solution vector that we obtain in the (r1)(r-1)-th iteration, but (𝚡r)i(\mathtt{x}_{r})_{i} refers to the ii-th coordinate of 𝚡r\mathtt{x}_{r}. The vector inside the brackets can be the result of some operations; for example, (𝙼𝚡)i(\mathtt{M}\mathtt{x})_{i} refers to the ii-th coordinate of the vector formed by multiplying the matrix 𝙼\mathtt{M} with the vector 𝚡\mathtt{x}. We use 𝚜𝚞𝚙𝚙(𝚡)\mathtt{supp}(\mathtt{x}) to denote {i:(𝚡)i0}\bigl{\{}i:(\mathtt{x})_{i}\not=0\bigr{\}}.

min\min f(𝚡)f(\mathtt{x}) max𝚞,𝚞~,𝚟\displaystyle\max_{\mathtt{u},\tilde{\mathtt{u}},\mathtt{v}} min𝚡g(𝚞,𝚞~,𝚟,𝚡)\displaystyle\min_{\mathtt{x}}\,\,g(\mathtt{u},\tilde{\mathtt{u}},\mathtt{v},\mathtt{x})
s.t. 𝙱𝚡𝚋\mathtt{Bx}\geq\mathtt{b}, 𝙲𝚡=𝚌\mathtt{Cx}=\mathtt{c}, s.t. 𝚞0κ\mathtt{u}\geq 0_{\kappa},
iSr,(𝚡)i0\forall\,i\not\in S_{r},\,(\mathtt{x})_{i}\geq 0, iSr,(𝚟)i0\forall\,i\not\in S_{r},\,(\mathtt{v})_{i}\geq 0.
iSr,(𝚡)i=0\forall\,i\in S_{r},\,(\mathtt{x})_{i}=0.
(a) Constrained NNQ (b) Lagrange dual
Figure 1: The constrained NNQ and its Lagrange dual problem.

Lagrange dual

Let the objective function be denoted by f:νf:\mathbb{R}^{\nu}\rightarrow\mathbb{R}. Assume that the constraints of the original NNQ problem are 𝙱𝚡𝚋\mathtt{Bx}\geq\mathtt{b}, 𝙲𝚡=𝚌\mathtt{Cx}=\mathtt{c}, and 𝚡0ν\mathtt{x}\geq 0_{\nu} for some matrices 𝙱\mathtt{B} and 𝙲\mathtt{C}. Let κ\kappa be the number of rows of 𝙱\mathtt{B}. The number of rows of 𝙲\mathtt{C} will not matter in our discussion.

For r=2,3,r=2,3,\ldots, the (r1)(r-1)-th iteration of the method solves a constrained version of the NNQ problem in which a subset of variables are fixed at zero. Figure 1(a) shows this constrained NNQ problem. We use SrS_{r} to denote the set of indices of these variables that are fixed at zero, and we call SrS_{r} an active set. For convenience, we say that the variables whose indices belong to SrS_{r} are the variables in the active set SrS_{r}. We call the variables outside SrS_{r} the free variables.

We are interested in the Lagrange dual problem as it will tell us how to update the active set SrS_{r} when proceeding to the next iteration [37]. Let 𝚞\mathtt{u} and 𝚞~\tilde{\mathtt{u}} be the vectors of Lagrange multipliers for the constraints 𝙱𝚡𝚋\mathtt{Bx}\geq\mathtt{b} and 𝙲𝚡=𝚌\mathtt{Cx}=\mathtt{c}, respectively. The dimensions of 𝚞\mathtt{u} and 𝚞~\tilde{\mathtt{u}} are equal to number of rows of 𝙱\mathtt{B} and 𝙲\mathtt{C}, respectively. As a result, 𝚞κ\mathtt{u}\in\mathbb{R}^{\kappa}. In the dual problem, the coordinates of 𝚞\mathtt{u} are required to be non-negative as the inequality signs in 𝙱𝚡𝚋\mathtt{Bx}\geq\mathtt{b} are greater than or equal to, whereas the coordinates of 𝚞~\tilde{\mathtt{u}} are unrestricted due to the equality signs in 𝙲𝚡=𝚌\mathtt{Cx}=\mathtt{c}. Similarly, let 𝚟\mathtt{v} be the vector of Lagrange multipliers that correspond to the constraints (𝚡)i0(\mathtt{x})_{i}\geq 0 for iSri\not\in S_{r} and (𝚡)i=0(\mathtt{x})_{i}=0 for iSri\in S_{r}. So (𝚟)i(\mathtt{v})_{i} is required to be non-negative for iSri\not\in S_{r} whereas (𝚟)i(\mathtt{v})_{i} is unrestricted for iSri\in S_{r}. The Lagrange dual function is g(𝚞,𝚞~,𝚟,𝚡)=f(𝚡)+𝚞t(𝚋𝙱𝚡)+𝚞~t(𝚌𝙲𝚡)𝚟t𝚡g(\mathtt{u},\tilde{\mathtt{u}},\mathtt{v},\mathtt{x})=f(\mathtt{x})+\mathtt{u}^{t}(\mathtt{b}-\mathtt{Bx})+\tilde{\mathtt{u}}^{t}(\mathtt{c}-\mathtt{Cx})-\mathtt{v}^{t}\,\mathtt{x}. The Lagrange dual problem is shown in Figure 1(b). Note that 𝚡\mathtt{x} is unrestricted in the dual problem.

KKT conditions and candidate free variable

By duality, the optimal value of the Lagrange dual problem is always a lower bound of the optimal value of the primal problem. Since all constraints in Figure 1(a) are affine, strong duality holds in this case which implies that the optimal value of the Lagrange dual problem is equal to the optimal value of the primal problem [37, Section 5.2.3].

Let 𝚡r\mathtt{x}_{r} denote the optimal solution of the constrained NNQ problem in Figure 1(a). That is, 𝚡r\mathtt{x}_{r} is the optimal solution in the (r1)(r-1)-th iteration. Let (𝚞r,𝚞~r,𝚟r,𝚡r)(\mathtt{u}_{r},\tilde{\mathtt{u}}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) be the corresponding solution of the Lagrange dual problem. For convenience, we say that 𝚡r\mathtt{x}_{r} is the optimal solution of ff constrained by SrS_{r}, and that (𝚞r,𝚞~r,𝚟r,𝚡r)(\mathtt{u}_{r},\tilde{\mathtt{u}}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) is the optimal solution of gg constrained by SrS_{r}.

The Karush-Kuhn-Tucker conditions, or KKT conditions for short, are the sufficient and necessary conditions for (𝚞r,𝚞~r,𝚟r,𝚡r)(\mathtt{u}_{r},\tilde{\mathtt{u}}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) to be an optimal solution for the Lagrange dual problem [37]. There are four of them:

  • Critical point: g(𝚡r)𝚡=f(𝚡r)𝙱t𝚞r𝙲t𝚞~r𝚟r=0ν\displaystyle\frac{\partial g(\mathtt{x}_{r})}{\partial\mathtt{x}}=\nabla f(\mathtt{x}_{r})-\mathtt{B}^{t}\mathtt{u}_{r}-\mathtt{C}^{t}\tilde{\mathtt{u}}_{r}-\mathtt{v}_{r}=0_{\nu}.

  • Primal feasibility: 𝙱𝚡r𝚋\mathtt{Bx}_{r}\geq\mathtt{b}, 𝙲𝚡r=𝚌\mathtt{Cx}_{r}=\mathtt{c}, 𝚡r0ν\mathtt{x}_{r}\geq 0_{\nu}, and iSr\forall\,i\in S_{r}, (𝚡r)i=0(\mathtt{x}_{r})_{i}=0.

  • Dual feasibility: 𝚞r0κ\mathtt{u}_{r}\geq 0_{\kappa} and iSr\forall\,i\not\in S_{r}, (𝚟r)i0(\mathtt{v}_{r})_{i}\geq 0.

  • Complementary slackness: for all i[κ]i\in[\kappa], (𝚞r)i(𝚋𝙱𝚡r)i=0(\mathtt{u}_{r})_{i}\cdot(\mathtt{b}-\mathtt{Bx}_{r})_{i}=0, and for all i[ν]i\in[\nu], (𝚟r)i(𝚡r)i=0(\mathtt{v}_{r})_{i}\cdot(\mathtt{x}_{r})_{i}=0.

If 𝚡r\mathtt{x}_{r} is not an optimal solution for the original NNQ problem, there must exist some jSrj\in S_{r} such that if jj is removed from SrS_{r}, the KKT conditions are violated and hence (𝚞r,𝚞~r,𝚟r,𝚡r)(\mathtt{u}_{r},\tilde{\mathtt{u}}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) is not an optimal solution for gg constrained by Sr{j}S_{r}\setminus\{j\}. Among the KKT conditions, only the dual feasibility can be violated by removing jj from SrS_{r}; in case of violation, we must have (𝚟r)j<0(\mathtt{v}_{r})_{j}<0. Therefore, for all jSrj\in S_{r}, if (𝚟r)j<0(\mathtt{v}_{r})_{j}<0, then jj is a candidate to be removed from the active set. In other words, (𝚡)j(\mathtt{x})_{j} is a candidate variable to be set free in the next iteration.

Algorithm specification

We use a trick that is important for efficiency. For r=2,3,r=2,3,\ldots, before calling the solver to compute 𝚡r\mathtt{x}_{r} in the (r1)(r-1)-th iteration, we eliminate the variables in the active set SrS_{r} because they are set to zeros anyway. Since there can be many variables in SrS_{r} especially when rr is small, this step allows the solver to potentially run a lot faster. Therefore, the solver cannot return any Lagrange multiplier (𝚟r)j(\mathtt{v}_{r})_{j} that corresponds to the variable (𝚡)j(\mathtt{x})_{j} in the active set SrS_{r} because (𝚡)j(\mathtt{x})_{j} is eliminated before the solver is called. This seems to be a problem because our key task at the end of the (r1)(r-1)-th iteration is to determine an appropriate subset of jSrj\in S_{r} such that (𝚟r)j<0(\mathtt{v}_{r})_{j}<0. Fortunately, the first KKT condition tells us that 𝚟r=f(𝚡r)𝙱t𝚞r𝙲t𝚞~r\mathtt{v}_{r}=\nabla f(\mathtt{x}_{r})-\mathtt{B}^{t}\mathtt{u}_{r}-\mathtt{C}^{t}\tilde{\mathtt{u}}_{r}, so we can compute all entries in 𝚟r\mathtt{v}_{r}. In other words, the optimal solution of the Lagrange dual problem constrained by SrS_{r} is fully characterized by (𝚞r,𝚞~r,𝚡r)(\mathtt{u}_{r},\tilde{\mathtt{u}}_{r},\mathtt{x}_{r}).

Algorithm 1 SolveNNQ
1:τ\tau is an integer threshold much less than ν\nu
2:β0\beta_{0} is a threshold on |Er||E_{r}| and β0>τ\beta_{0}>\tau
3:β1\beta_{1} is a threshold on the number of iterations
4:compute an initial active set S1S_{1} and an optimal solution (𝚞1,𝚞~1,𝚡1)(\mathtt{u}_{1},\tilde{\mathtt{u}}_{1},\mathtt{x}_{1}) of gg constrained by S1S_{1}
5:r1r\leftarrow 1
6:loop
7:     ErE_{r}\leftarrow the sequence of indices iSri\in S_{r} such that (𝚟r)i=(f(𝚡r)𝙱t𝚞r𝙲t𝚞~r)i<0(\mathtt{v}_{r})_{i}=(\nabla f(\mathtt{x}_{r})-\mathtt{B}^{t}\mathtt{u}_{r}-\mathtt{C}^{t}\tilde{\mathtt{u}}_{r})_{i}<0 and ErE_{r} is sorted in non-decreasing order of (𝚟r)i(\mathtt{v}_{r})_{i}
8:     if Er=E_{r}=\emptyset then return 𝚡r\mathtt{x}_{r}
9:     else
10:         if |Er|<β0|E_{r}|<\beta_{0} or r>β1r>\beta_{1} then
11:              Sr+1SrErS_{r+1}\leftarrow S_{r}\setminus E_{r}
12:         else
13:              let ErE^{\prime}_{r} be subset of the first τ\tau indices in ErE_{r}
14:              Sr+1[ν](𝚜𝚞𝚙𝚙(𝚡r)Er)S_{r+1}\leftarrow[\nu]\setminus\bigl{(}\mathtt{supp}(\mathtt{x}_{r})\cup E^{\prime}_{r}\bigr{)}
15:         end if
16:     end if
17:     (𝚞r+1,𝚞~r+1,𝚡r+1)(\mathtt{u}_{r+1},\tilde{\mathtt{u}}_{r+1},\mathtt{x}_{r+1})\leftarrow optimal solution of gg constrained by the active set Sr+1S_{r+1}
18:     rr+1r\leftarrow r+1
19:end loop

SolveNNQ in Algorithm 1 describes the iterative method. In line 14 of SolveNNQ, we kick out all free variables that are zero in the current solution 𝚡r\mathtt{x}_{r}, i.e., if iSri\not\in S_{r} and i𝚜𝚞𝚙𝚙(𝚡r)i\not\in\mathtt{supp}(\mathtt{x}_{r}), move ii to Sr+1S_{r+1}. This step keeps the number of free variables small so that the next call of the solver will run fast. When the computation is near its end, moving a free variable wrongly to the active set can be costly; it is preferable to shrink the active set to stay on course for the optimal solution 𝚡\mathtt{x}_{*}. Therefore, we have a threshold β0\beta_{0} on |Er||E_{r}| and set Sr+1SrErS_{r+1}\leftarrow S_{r}\setminus E_{r} in line 11 if |Er|<β0|E_{r}|<\beta_{0}. The threshold β0\beta_{0} gives us control on the number of free variables. In some rare cases when we are near 𝚡\mathtt{x}_{*}, the algorithm may alternate between excluding a primal variable from the active set in one iteration and moving the same variable into the active set in the next iteration. The algorithm may not terminate. Therefore, we have another threshold β1\beta_{1} on the number of iterations beyond which the active set will be shrunk monotonically in line 11, thereby guaranteeing convergence. The threshold β1\beta_{1} is rarely exceeded in our experiments; setting Sr+1SrErS_{r+1}\leftarrow S_{r}\setminus E_{r} when |Er|<β0|E_{r}|<\beta_{0} also helps to keep the the algorithm from exceeding the threshold β1\beta_{1}. In our experiments, we set τ=4ln2ν\tau=4\ln^{2}\nu, β0=3τ\beta_{0}=3\tau, and β1=15\beta_{1}=15.

Using elementary vector calculus, one can prove that f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)112𝚡r𝚢𝚡r𝚡f(𝚡r),𝚗f(𝚡r),𝚗\frac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\frac{1}{2}\cdot\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\cdot\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle} as stated in Lemma 1 below, where 𝚗\mathtt{n} is any unit descent direction from 𝚡r\mathtt{x}_{r}, 𝚢\mathtt{y} is the minimum in direction 𝚗\mathtt{n} from 𝚡r\mathtt{x}_{r}, and 𝚗\mathtt{n}_{*} is the unit vector (𝚡𝚡r)/𝚡𝚡r(\mathtt{x}_{*}-\mathtt{x}_{r})/\|\mathtt{x}_{*}-\mathtt{x}_{r}\|.

In Sections 4 and 5, for the DKSG, ZHLG, and NNLS problems, we prove that there exists a descent direction 𝚗r\mathtt{n}_{r} from 𝚡r\mathtt{x}_{r} that involves only free variables in [ν]Sr+1[\nu]\setminus S_{r+1} and f(𝚡r),𝚗rf(𝚡r),𝚗\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{r}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle} is roughly bounded from below by the square root of the number of variables. Therefore, if we assume that the distance between 𝚡r\mathtt{x}_{r} and the minimum in direction 𝚗r\mathtt{n}_{r} is at least 1λ𝚡r𝚡\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\| for all rr, where λ\lambda is some fixed value, then f(𝚡r+i)f(𝚡)e1(f(𝚡r)f(𝚡))f(\mathtt{x}_{r+i})-f(\mathtt{x}_{*})\leq e^{-1}(f(\mathtt{x}_{r})-f(\mathtt{x}_{*})) for some ii that is roughly bounded by λ\lambda times the square root of the number of variables. It follows that if β1\beta_{1} is not exceeded, SolveNNQ will converge efficiently to the optimum under our assumption. The geometric-like convergence still holds even if the inequality 𝚡r𝚢r1λ𝚡r𝚡\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\| is satisfied in only a constant fraction of any sequence of consecutive iterations (the iterations in that constant fraction need not be consecutive). If β1\beta_{1} is exceeded, Lemma 2 below shows that the number of remaining iterations is at most |𝚜𝚞𝚙𝚙(𝚡)||\mathtt{supp}(\mathtt{x}_{*})|; however, we have no control on the number of free variables.

In the following, we derive Lemma 1 for convex quadratic functions. Property (i) of Lemma 1 is closely related to the first-order condition [37] for convex functions. For completeness, we include an adaptation of it to the convex quadratic functions. For simplicity in representation, we assume that the optimal solution is unique without loss of generality. However, the correctness of our proofs does not require the optimal solution to be unique.

Lemma 1.

Let 𝚗ν\mathtt{n}\in\mathbb{R}^{\nu} be any unit descent direction from 𝚡r\mathtt{x}_{r}. Let 𝚢=𝚡r+α𝚗\mathtt{y}=\mathtt{x}_{r}+\alpha\mathtt{n} for some α>0\alpha>0 be the feasible point that minimizes ff on the ray from 𝚡r\mathtt{x}_{r} in direction 𝚗\mathtt{n}. Let SS be an active set that is disjoint from 𝚜𝚞𝚙𝚙(𝚡r)𝚜𝚞𝚙𝚙(𝚗)\mathtt{supp}(\mathtt{x}_{r})\cup\mathtt{supp}(\mathtt{n}). Let 𝚡r+1\mathtt{x}_{r+1} be the optimal solution for ff constrained by SS. Then, f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)112𝚡r𝚢𝚡r𝚡f(𝚡r),𝚗f(𝚡r),𝚗\frac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\frac{1}{2}\cdot\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\cdot\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle}, where 𝚗\mathtt{n}_{*} is the unit vector 𝚡𝚡r𝚡𝚡r\frac{\mathtt{x}_{*}-\mathtt{x}_{r}}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|}.

Proof.

We prove two properties: (i) f(𝚡r)f(𝚢)𝚡r𝚢f(𝚡r),𝚗f(\mathtt{x}_{r})-f(\mathtt{y})\leq-\|\mathtt{x}_{r}-\mathtt{y}\|\cdot\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle, and (ii) f(𝚡r)f(𝚢)12𝚡r𝚢f(𝚡r),𝚗f(\mathtt{x}_{r})-f(\mathtt{y})\geq-\frac{1}{2}\|\mathtt{x}_{r}-\mathtt{y}\|\cdot\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle.

Consider (i). For all s[0,1]s\in[0,1], define 𝚢s=𝚡r+s(𝚢𝚡r)\mathtt{y}_{s}=\mathtt{x}_{r}+s(\mathtt{y}-\mathtt{x}_{r}). By the chain rule, we have fs=f𝚢s,𝚢ss=f(𝚢s),𝚢𝚡r\frac{\partial f}{\partial s}=\bigl{\langle}\frac{\partial f}{\partial\mathtt{y}_{s}},\frac{\partial\mathtt{y}_{s}}{\partial s}\bigr{\rangle}=\bigl{\langle}\nabla f(\mathtt{y}_{s}),\,\mathtt{y}-\mathtt{x}_{r}\bigr{\rangle}. We integrate along a linear movement from 𝚡r\mathtt{x}_{r} to 𝚢\mathtt{y}. Using the fact that f(𝚢s)=2𝙰t𝙰(𝚡r+s(𝚢𝚡r))+𝚊=f(𝚡r)+2s𝙰t𝙰(𝚢𝚡r)\nabla f(\mathtt{y}_{s})=2\mathtt{A}^{t}\mathtt{A}(\mathtt{x}_{r}+s(\mathtt{y}-\mathtt{x}_{r}))+\mathtt{a}=\nabla f(\mathtt{x}_{r})+2s\mathtt{A}^{t}\mathtt{A}(\mathtt{y}-\mathtt{x}_{r}), we obtain

f(𝚢)\displaystyle f(\mathtt{y}) =\displaystyle= f(𝚡r)+01f(𝚢s),𝚢𝚡r𝚍s\displaystyle f(\mathtt{x}_{r})+\int_{0}^{1}\langle\nabla f(\mathtt{y}_{s}),\mathtt{y}-\mathtt{x}_{r}\rangle\,\mathtt{d}s
=\displaystyle= f(𝚡r)+01f(𝚡r),𝚢𝚡r𝚍s+012s𝙰t𝙰(𝚢𝚡r),𝚢𝚡r𝚍s\displaystyle f(\mathtt{x}_{r})+\int_{0}^{1}\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{y}-\mathtt{x}_{r}\rangle\,\mathtt{d}s+\int_{0}^{1}2s\bigl{\langle}\mathtt{A}^{t}\mathtt{A}(\mathtt{y}-\mathtt{x}_{r}),\,\mathtt{y}-\mathtt{x}_{r}\,\bigr{\rangle}\,\mathtt{d}s
=\displaystyle= f(𝚡r)+[f(𝚡r),𝚢𝚡rs]01+[𝙰(𝚢𝚡r)2s2]01\displaystyle f(\mathtt{x}_{r})+\Bigl{[}\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{y}-\mathtt{x}_{r}\rangle\cdot s\Bigr{]}^{1}_{0}+\Bigl{[}\|\mathtt{A}(\mathtt{y}-\mathtt{x}_{r})\|^{2}\cdot s^{2}\Bigr{]}^{1}_{0}
=\displaystyle= f(𝚡r)+f(𝚡r),𝚢𝚡r+𝙰(𝚢𝚡r)2.\displaystyle f(\mathtt{x}_{r})+\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{y}-\mathtt{x}_{r}\rangle+\|\mathtt{A}(\mathtt{y}-\mathtt{x}_{r})\|^{2}.

It follows immediately that

f(𝚡r)f(𝚢)\displaystyle f(\mathtt{x}_{r})-f(\mathtt{y}) =\displaystyle= f(𝚡r),𝚢𝚡r𝙰(𝚢𝚡r)2\displaystyle-\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{y}-\mathtt{x}_{r}\rangle-\|\mathtt{A}(\mathtt{y}-\mathtt{x}_{r})\|^{2}
\displaystyle\leq f(𝚡r),𝚢𝚡r=𝚡r𝚢f(𝚡r),𝚗.\displaystyle-\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{y}-\mathtt{x}_{r}\rangle\;\;=\;\;-\|\mathtt{x}_{r}-\mathtt{y}\|\cdot\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle.

This completes the proof of (i).

Consider (ii). Using 𝚢s=𝚡r+s(𝚢𝚡r)=𝚢+(1s)(𝚡r𝚢)\mathtt{y}_{s}=\mathtt{x}_{r}+s(\mathtt{y}-\mathtt{x}_{r})=\mathtt{y}+(1-s)(\mathtt{x}_{r}-\mathtt{y}) and the fact that f(𝚢s)=2𝙰t𝙰(𝚢+(1s)(𝚡r𝚢))+𝚊=f(𝚢)+2(1s)𝙰t𝙰(𝚡r𝚢)\nabla f(\mathtt{y}_{s})=2\mathtt{A}^{t}\mathtt{A}(\mathtt{y}+(1-s)(\mathtt{x}_{r}-\mathtt{y}))+\mathtt{a}=\nabla f(\mathtt{y})+2(1-s)\mathtt{A}^{t}\mathtt{A}(\mathtt{x}_{r}-\mathtt{y}), we carry out a symmetric derivation:

f(𝚡r)\displaystyle f(\mathtt{x}_{r}) =\displaystyle= f(𝚢)+10f(𝚢s),𝚢𝚡r𝚍s\displaystyle f(\mathtt{y})+\int_{1}^{0}\langle\nabla f(\mathtt{y}_{s}),\mathtt{y}-\mathtt{x}_{r}\rangle\,\mathtt{d}s (2)
=\displaystyle= f(𝚢)+10f(𝚢),𝚢𝚡r𝚍s+102(1s)𝙰t𝙰(𝚡r𝚢),𝚢𝚡r𝚍s\displaystyle f(\mathtt{y})+\int_{1}^{0}\langle\nabla f(\mathtt{y}),\,\mathtt{y}-\mathtt{x}_{r}\rangle\,\mathtt{d}s+\int_{1}^{0}2(1-s)\langle\mathtt{A}^{t}\mathtt{A}(\mathtt{x}_{r}-\mathtt{y}),\,\mathtt{y}-\mathtt{x}_{r}\rangle\,\mathtt{d}s
=\displaystyle= f(𝚢)+[f(𝚢),𝚢𝚡rs]10+[𝙰(𝚢𝚡r)2(1s)2]10\displaystyle f(\mathtt{y})+\Bigl{[}\langle\nabla f(\mathtt{y}),\,\mathtt{y}-\mathtt{x}_{r}\rangle\cdot s\Bigr{]}^{0}_{1}+\Bigl{[}\|\mathtt{A}(\mathtt{y}-\mathtt{x}_{r})\|^{2}\cdot(1-s)^{2}\Bigr{]}^{0}_{1}
=\displaystyle= f(𝚢)f(𝚢),𝚢𝚡r+𝙰(𝚢𝚡r)2.\displaystyle f(\mathtt{y})-\langle\nabla f(\mathtt{y}),\,\mathtt{y}-\mathtt{x}_{r}\rangle+\|\mathtt{A}(\mathtt{y}-\mathtt{x}_{r})\|^{2}.

Combining (2) and (2) gives f(𝚡r)f(𝚢)=12f(𝚡r),𝚢𝚡r12f(𝚢),𝚢𝚡rf(\mathtt{x}_{r})-f(\mathtt{y})=-\frac{1}{2}\langle\nabla f(\mathtt{x}_{r}),\mathtt{y}-\mathtt{x}_{r}\rangle-\frac{1}{2}\langle\nabla f(\mathtt{y}),\mathtt{y}-\mathtt{x}_{r}\rangle. As 𝚢\mathtt{y} is the feasible point that minimizes the value of ff on the ray from 𝚡r\mathtt{x}_{r} in direction 𝚗\mathtt{n}, we have f(𝚢),𝚢𝚡r0\langle\nabla f(\mathtt{y}),\mathtt{y}-\mathtt{x}_{r}\rangle\leq 0. Therefore, f(𝚡r)f(𝚢)12f(𝚡r),𝚢𝚡r=12𝚡r𝚢f(𝚡r),𝚗f(\mathtt{x}_{r})-f(\mathtt{y})\geq-\frac{1}{2}\langle\nabla f(\mathtt{x}_{r}),\mathtt{y}-\mathtt{x}_{r}\rangle=-\frac{1}{2}\|\mathtt{x}_{r}-\mathtt{y}\|\cdot\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle, completing the proof of (ii).

By applying (i) to the descent direction 𝚡𝚡r\mathtt{x}_{*}-\mathtt{x}_{r}, we get f(𝚡r)f(𝚡)f(𝚡r),𝚡𝚡rf(\mathtt{x}_{r})-f(\mathtt{x}_{*})\leq-\langle\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\rangle. Since SS is disjoint from 𝚜𝚞𝚙𝚙(𝚡)𝚜𝚞𝚙𝚙(𝚗)\mathtt{supp}(\mathtt{x}_{*})\cup\mathtt{supp}(\mathtt{n}), the feasible points on the ray from 𝚡r\mathtt{x}_{r} in direction 𝚗\mathtt{n} are also feasible with respect to SS. Therefore, f(𝚡r)f(𝚡r+1)f(𝚡r)f(𝚢)f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})\geq f(\mathtt{x}_{r})-f(\mathtt{y}). By applying (ii) to the descent direction 𝚗\mathtt{n}, we get f(𝚡r)f(𝚡r+1)f(𝚡r)f(𝚢)12𝚡r𝚢f(𝚡r),𝚗f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})\geq f(\mathtt{x}_{r})-f(\mathtt{y})\geq-\frac{1}{2}\|\mathtt{x}_{r}-\mathtt{y}\|\cdot\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle. Dividing the lower bound for f(𝚡r)f(𝚡r+1)f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1}) by the upper bound for f(𝚡r)f(𝚡)f(\mathtt{x}_{r})-f(\mathtt{x}_{*}) proves that f(𝚡r)f(𝚡r+1)f(𝚡r)f(𝚡)12𝚡r𝚢𝚡r𝚡f(𝚡r),𝚗f(𝚡r),𝚗\frac{f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\geq\frac{1}{2}\cdot\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\cdot\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle}. Hence, f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)=1f(𝚡r)f(𝚡r+1)f(𝚡r)f(𝚡)112𝚡r𝚢𝚡r𝚡f(𝚡r),𝚗f(𝚡r),𝚗\frac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}=1-\frac{f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\frac{1}{2}\cdot\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|}\cdot\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle}. ∎

Lemma 2.

If β1\beta_{1} is exceeded, the lines 718 of SolveNNQ will be iterated at most |𝚜𝚞𝚙𝚙(𝚡)||\mathtt{supp}(\mathtt{x}_{*})| more times before 𝚡\mathtt{x}_{*} is found.

Proof.

By the KKT conditions, f(𝚡r)=𝙱t𝚞r+𝙲t𝚞~r+𝚟r\nabla f(\mathtt{x}_{r})=\mathtt{B}^{t}\mathtt{u}_{r}+\mathtt{C}^{t}\tilde{\mathtt{u}}_{r}+\mathtt{v}_{r}, so f(𝚡r),𝚡𝚡r=𝙱t𝚞r,𝚡𝚡r+𝙲t𝚞~r,𝚡𝚡r+𝚟r,𝚡𝚡r\langle\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\rangle=\langle\mathtt{B}^{t}\mathtt{u}_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle+\langle\mathtt{C}^{t}\tilde{\mathtt{u}}_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle+\langle\mathtt{v}_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle. By the complementary slackness, we have 𝚞rt(𝙱𝚡r𝚋)=0\mathtt{u}_{r}^{t}(\mathtt{Bx}_{r}-\mathtt{b})=0 and 𝚟rt𝚡r=0\mathtt{v}_{r}^{t}\mathtt{x}_{r}=0. The former equation implies that 𝚞rt𝙱𝚡r=𝚞rt𝚋\mathtt{u}_{r}^{t}\mathtt{Bx}_{r}=\mathtt{u}^{t}_{r}\mathtt{b}. By primal feasibility, 𝙲𝚡=𝙲𝚡r=𝚌\mathtt{Cx}_{*}=\mathtt{Cx}_{r}=\mathtt{c} which implies that 𝙲t𝚞~r,𝚡𝚡r=𝚞~rt(𝙲𝚡𝙲𝚡r)=0\langle\mathtt{C}^{t}\tilde{\mathtt{u}}_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle=\tilde{\mathtt{u}}_{r}^{t}(\mathtt{Cx}_{*}-\mathtt{Cx}_{r})=0. As a result,

f(𝚡r),𝚡𝚡r=𝚞rt(𝙱𝚡𝚋)+𝚟rt𝚡.\langle\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\rangle=\mathtt{u}_{r}^{t}(\mathtt{B}\mathtt{x}_{*}-\mathtt{b})+\mathtt{v}_{r}^{t}\mathtt{x}_{*}. (3)

By primal and dual feasibilities, 𝙱𝚡𝚋\mathtt{Bx}_{*}\geq\mathtt{b} and 𝚞r0κ\mathtt{u}_{r}\geq 0_{\kappa}. So 𝚞rt(𝙱𝚡𝚋)0\mathtt{u}_{r}^{t}(\mathtt{Bx}_{*}-\mathtt{b})\geq 0.

We claim that (𝚟r)i<0(\mathtt{v}_{r})_{i}<0 for some i𝚜𝚞𝚙𝚙(𝚡)Sri\in\mathtt{supp}(\mathtt{x}_{*})\cap S_{r}. Assume to the contrary that (𝚟r)i0(\mathtt{v}_{r})_{i}\geq 0 for all i𝚜𝚞𝚙𝚙(𝚡)Sri\in\mathtt{supp}(\mathtt{x}_{*})\cap S_{r}. Then, (𝚟r)i0(\mathtt{v}_{r})_{i}\geq 0 for all i𝚜𝚞𝚙𝚙(𝚡)i\in\mathtt{supp}(\mathtt{x}_{*}), which implies that 𝚟rt𝚡0\mathtt{v}_{r}^{t}\mathtt{x}_{*}\geq 0. Substituting 𝚞rt(𝙱𝚡𝚋)0\mathtt{u}_{r}^{t}(\mathtt{Bx}_{*}-\mathtt{b})\geq 0 and 𝚟rt𝚡0\mathtt{v}_{r}^{t}\mathtt{x}_{*}\geq 0 into (3) gives f(𝚡r),𝚡𝚡r0\bigl{\langle}\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\,\bigr{\rangle}\geq 0. However, since 𝚡r\mathtt{x}_{r} is not optimal, by the convexity of ff and the feasible region of the NNQ problem, 𝚡𝚡r\mathtt{x}_{*}-\mathtt{x}_{r} is a descent direction from 𝚡r\mathtt{x}_{r}, contradicting the relation f(𝚡r),𝚡𝚡r0\langle\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\rangle\geq 0. This proves our claim.

By our claim, there must exist some index i𝚜𝚞𝚙𝚙(𝚡)Sri\in\mathtt{supp}(\mathtt{x}_{*})\cap S_{r} such that (𝚟r)i<0(\mathtt{v}_{r})_{i}<0. This index ii must be included in ErE_{r}, which means that iSr+1i\not\in S_{r+1} because Sr+1=SrErS_{r+1}=S_{r}\setminus E_{r}. As a result, each iteration of lines 10 and 11 sets free at least one more element of 𝚜𝚞𝚙𝚙(𝚡)\mathtt{supp}(\mathtt{x}_{*}). After at most |𝚜𝚞𝚙𝚙(𝚡)||\mathtt{supp}(\mathtt{x}_{*})| more iterations, all elements of 𝚜𝚞𝚙𝚙(𝚡)\mathtt{supp}(\mathtt{x}_{*}) are free variables, which means that the solver will find 𝚡\mathtt{x}_{*} in at most |𝚜𝚞𝚙𝚙(𝚡)||\mathtt{supp}(\mathtt{x}_{*})| more iterations. ∎

3 Experimental results

In this section, we present our experimental results on the DKSG, ZHLG, and image deblurring problems. Our machine configuration is: Intel Core 7-9700K 3.6Hz cpu, 3600 Mhz ram, 8 cores, 8 logical processors. We use MATLAB version R2020b. We use the option of quadprog that runs the interior-point-convex algorithm.

As explained in Section 2, we eliminate the variables in the active set in each iteration in order to speed up the solver. One implementation is to extract the remaining rows and columns of the constraint matrix; however, doing so incurs a large amount of data movements. Instead, we zero out the rows and columns corresponding to the variables in the active set.

We determine that τ=4ln2ν\tau=4\ln^{2}\nu is a good setting for the DKSG and ZHLG data sets. If we increase ln2ν\ln^{2}\nu to ln3ν\ln^{3}\nu, the number of iterations is substantially reduced, but the running time of each iteration is larger; there is little difference in the overall efficiency. If we decrease ln2ν\ln^{2}\nu to lnν\ln\nu, the number of iterations increases so much that the overall efficiency decreases. We also set τ=4ln2ν\tau=4\ln^{2}\nu in the experiments with image deblurring without checking whether this is the best for them. It helps to verify the robustness of this choice of τ\tau.

We set β0\beta_{0} to be 3τ3\tau and β1\beta_{1} to be 15. Since we often observe a geometric reduction in the gap between the current objective function value and the optimum from one iteration to the next, we rarely encounter a case in our experiments that the threshold β1\beta_{1} is exceeded.

3.1 DKSG and ZHLG

For both the DKSG and ZHLG problems, we ran experiments to compare the performance of SolveNNQ with a single call of quadprog on the UCI Machine Learning Repository data sets Iris, HCV, Ionosphere, and AAL [38]. The data sets IRIS and Ionosphere were also used in the experiments in the paper of Daitch et al. [1] that proposed the DKSG problem. For each data set, we sampled uniformly at random a number of rows that represent the number of points nn and a number of attributes that represent the number of dimensions dd. There are two parameters μ\mu and ρ\rho in the objective function of the ZHLG problem. According to the results in [4], we set μ=16\mu=16 and ρ=2\rho=2.

Ionosphere
DKSG ZHLG
nn dd quad ours nnz quad ours nnz
70 2 4.57s 1.84s 44% 2.51s 1.06s 47%
4 5.78s 2.25s 47% 2.20s 0.97s 49%
10 4.65s 1.55s 47% 1.86s 1.67s 59%
100 2 26.67s 5.41s 33% 16.31s 3.35s 36%
4 43.60s 7.06s 35% 14.87s 2.77s 36%
10 28.53s 5.46s 35% 13.34s 5.42s 44%
130 2 192.41s 11.56s 27% 144.27s 5.53s 28%
4 261.54s 13.13s 27% 141.70s 6.23s 29%
10 208.24s 12.02s 27% 134.95s 12.43s 36%
160 2 660.25s 25.39s 23% 614.71s 9.14s 24%
4 888.06s 27.20s 23% 622.58s 10.74s 24%
10 979.92s 22.03s 22% 598.55s 29.97s 31%
Table 1: Running time comparison of the Ionosphere data set. The data in each row is the average of the corresponding data from five runs; nnz is the percentage of the average number of non-zeros in the final solution. We use quad to denote quadprog.
AAL
DKSG ZHLG
nn dd quad ours nnz quad ours nnz
70 2 3.99s 1.72s 44% 2.36s 1.19s 52%
4 4.64s 1.94s 46% 2.32s 1.27s 50%
10 4.89s 1.80s 48% 1.93s 1.69s 58%
100 2 22.91s 5.30s 34% 15.25s 3.24s 37%
4 38.63s 7.81s 37% 13.38s 2.68s 35%
10 36.58s 6.00s 35% 13.34s 4.59s 43%
130 2 175.43s 15.81s 27% 145.83s 6.58s 28%
4 221.80s 15.89s 29% 142.01s 5.88s 30%
10 257.41s 14.60s 28% 135.08s 10.03s 35%
160 2 550.44s 26.06s 24% 613.45s 15.01s 26%
4 864.51s 31.71s 24% 610.71s 14.62s 26%
10 984.85s 20.86s 22% 601.50s 22.42s 30%
Table 2: Running time comparison of the AAL data set. The data in each row is the average of the corresponding data from five runs; nnz is the percentage of the average number of non-zeros in the final solution. We use quad to denote quadprog.
IRIS
DKSG ZHLG
nn dd quad ours nnz quad ours nnz
70 2 5.61s 2.00s 43% 2.76s 1.21s 47%
3 5.66s 2.09s 44% 2.61s 1.15s 48%
4 4.88s 2.10s 45% 2.23s 1.18s 50%
100 2 43.87s 6.55s 33% 17.63s 2.13s 34%
3 37.56s 5.92s 33% 15.92s 2.64s 36%
4 42.69s 5.95s 33% 15.24s 2.95s 37%
130 2 241.59s 13.29s 26% 148.34s 5.62s 29%
3 199.22s 12.20s 27% 150.54s 5.34s 28%
4 256.28s 13.96s 27% 151.87s 6.20s 30%
150 2 504.21s 20.15s 23% 414.79s 8.38s 25%
3 510.55s 19.81s 24% 398.06s 7.80s 26%
4 565.89s 21.04s 24% 390.47s 7.36s 26%
Table 3: Running time comparison of the IRIS data set. The data in each row is the average of the corresponding data from five runs; nnz is the percentage of the average number of non-zeros in the final solution. We use quad to denote quadprog.
HCV
DKSG ZHLG
nn dd quad ours nnz quad ours nnz
70 2 3.68s 1.16s 43% 2.65s 0.84s 45%
4 5.22s 1.71s 44% 2.13s 0.88s 45%
10 4.55s 1.35s 43% 1.88s 2.06s 62%
100 2 13.02s 3.50s 32% 15.20s 2.00s 32%
4 39.50s 4.84s 33% 13.03s 2.35s 32%
10 29.79s 3.65s 31% 12.78s 8.32s 48%
130 2 108.10s 6.90s 28% 141.20s 3.07s 24%
4 180.42s 12.05s 27% 138.05s 3.78s 24%
10 184.51s 8.16s 24% 133.91s 21.17s 38%
160 2 551.43s 17.32s 24% 613.29s 5.22s 20%
4 868.60s 23.93s 22% 613.04s 8.10s 21%
10 633.97s 14.48s 20% 588.64s 45.33s 32%
Table 4: Running time comparison of the HCV data set. The data in each row is the average of the corresponding data from five runs; nnz is the percentage of the average number of non-zeros in the final solution. We use quad to denote quadprog.

Tables 123, and 4 show some average running times of SolveNNQ and quadprog. When n130n\geq 130 and n13dn\geq 13d, SolveNNQ is often 10 times or more faster. For the same dimension dd, the column nnz shows that the percentage of non-zero solution coordinates decreases as nn increases, that is, the sparsity increases as nn increases. Correspondingly, the speedup of SolveNNQ over a single call of quadprog also increases as reflected by the running times.

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 2: We plot the natural logarithm of the average speedup of SolveNNQ over a single call of quadprog for DKSG and ZHLG. The horizontal axis is lnn\ln n. A distinct color is used for data points for the same dimension dd.

Figure 2 shows four plots, one for each of the four data sets, that show how the natural logarithm of the average speedup of SolveNNQ over a single call of quadprog increases as the value of lnn\ln n increases. In each plot, data points for distinct dimensions are shown in distinct colors, and the legends give the color coding of the dimensions. For both DKSG and ZHLG and for every fixed dd, the average speedup as a function of nn hovers around Θ(n3)\Theta(n^{3}). The average speedup data points for HCV shows a greater diversity, but the smallest speedup is still roughly proportional to n2.7n^{2.7}. As a result, SolveNNQ gives a much superior performance than a single call of quadprog even for moderate values of nn.

To verify our assumption that the distance between 𝚡r\mathtt{x}_{r} and the minimum 𝚢r\mathtt{y}_{r} in direction 𝚗r\mathtt{n}_{r} is at least 1λ𝚡r𝚡\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\|, we used box-and-whisker plots for visualizations. A box and whisker plot displays the distribution of a dataset. It shows the middle 50% of the data as a box, with the median shown using a horizontal segment inside the box; whiskers are used to show the spread of the remaining data. We exclude two outliers of the ratios for the DKSG problem. These outliers occur when r=1r=1, so the ratios 𝚡1𝚡2𝚡1𝚡\frac{\|\mathtt{x}_{1}-\mathtt{x}_{2}\|}{\|\mathtt{x}_{1}-\mathtt{x}_{*}\|} describe the behavior between the first and second iterations of the algorithm. These outliers can be ignored, as the initialization does not induce any part of the active set from the gradient. The active set in the first iteration is arbitrary and distinct from the rest of the algorithm. As a result, any abnormality involving the first iteration does not impact the convergence of the algorithm, since it is just one of many iterations. As shown in Figures 3(a) and (b), the ratio 𝚡r𝚡r+1𝚡r𝚡\frac{\|\mathtt{x}_{r}-\mathtt{x}_{r+1}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|} is greater than 1100\frac{1}{100}. Thus, our algorithm takes a descent direction that meets our assumption that 𝚡r𝚢𝚡r𝚡\frac{\|\mathtt{x}_{r}-\mathtt{y}\|}{\|\mathtt{x}_{r}-\mathtt{x}_{*}\|} is larger than or equal to a not so small constant of 1100\frac{1}{100}. The observed convergence rate is even better. As illustrated in Figure 4, the gap f(𝚡r)f(𝚡)f(\mathtt{x}_{r})-f(\mathtt{x}_{*}) is reduced by a factor in the range [1.5,10][1.5,10] in every iteration on average; in fact, most runs converge more rapidly than a geometric convergence.

Refer to caption Refer to caption
(a) (b)
Figure 3: (a) Plot of all ratios 𝚡r+1𝚡r𝚡𝚡r\frac{\|\mathtt{x}_{r+1}-\mathtt{x}_{r}\|}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|} for DKSG. The ratios 𝚡r+1𝚡r𝚡𝚡r\frac{\|\mathtt{x}_{r+1}-\mathtt{x}_{r}\|}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|} are grouped into disjoint ranges, and these disjoint ranges are listed along the x-axis. The (median, cardinality) tuple of each group is shown above each box. (b) Plot of all ratios 𝚡r+1𝚡r𝚡𝚡r\frac{\|\mathtt{x}_{r+1}-\mathtt{x}_{r}\|}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|} for ZHLG.
Refer to caption
Figure 4: Each data point is the median of ten runs. Plot of the median f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)\frac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})} against rr.
Refer to caption Refer to caption Refer to caption
nph sts hst
Figure 5: Three space images.

3.2 Image deblurring

We experimented with four space images nph, sts, and hst from [39, 40]. We down-sample and sparsify by setting nearly black pixels (with intensities such as 10 or below, or 20 or below) to black. Figure 5 shows the four resulting space images: nph is 149×103149\times 103, and both sts and hst are 128×128128\times 128. These images are sparse (i.e., many black pixels); therefore, they are good for the working of SolveNNQ.

For space images, it is most relevant to use blurring due to atmospheric turbulence. In the following, we describe a point spread function in the literature that produces such blurring effects [41]. Let II denote a two-dimensional r0×c0r_{0}\times c_{0} pixel array (an image). We call the pixel at the aa-th row and bb-th column the (a,b)(a,b)-pixel. Let Ia,bI_{a,b}^{*} denote the intensity of the (a,b)(a,b)-pixel in the original image. Let Ia,bI^{\prime}_{a,b} denote its intensity after blurring.

The atmospheric turbulence point spread function determines how to distribute the intensity Ia,bI^{*}_{a,b} to other pixels. At the same time, the (a,b)(a,b)-pixel receives contributions from other pixels. The sum of these contributions and the remaining intensity at the (a,b)(a,b)-pixel gives Ia,bI^{\prime}_{a,b}. The weight factor with respect to an (a,b)(a,b)-pixel and a (c,d)(c,d)-pixel is Kexp((ac)2+(bd)22σ2)K\cdot\mathrm{exp}\bigl{(}-\frac{(a-c)^{2}+(b-d)^{2}}{2\sigma^{2}}\bigr{)} for some positive constant KK. It means that the contribution of the (a,b)(a,b)-pixel to Ic,dI^{\prime}_{c,d} is Kexp((ac)2+(bd)22σ2)Ia,bK\cdot\mathrm{exp}\bigl{(}-\frac{(a-c)^{2}+(b-d)^{2}}{2\sigma^{2}}\bigr{)}\cdot I^{*}_{a,b}. We assume the point spread function is spatially invariant which makes the relation symmetric; therefore, the contribution of the (c,d)(c,d)-pixel to Ia,bI^{\prime}_{a,b} is Kexp((ac)2+(bd)22σ2)Ic,dK\cdot\mathrm{exp}\bigl{(}-\frac{(a-c)^{2}+(b-d)^{2}}{2\sigma^{2}}\bigr{)}\cdot I^{*}_{c,d}.

The atmospheric turbulence point spread function needs to be truncated as its support is infinite. Let BB be a (2σ+1)×(2σ+1)(2\sigma+1)\times(2\sigma+1) square centered at the (a,b)(a,b)-pixel. The truncation is done so that the (a,b)(a,b)-pixel has no contribution outside BB, and no pixel outside BB contributes to the (a,b)(a,b)-pixel. For any s,t[σ,σ]s,t\in[-\sigma,\sigma], the weight of the entry of BB at the (a+s)(a+s)-th row and the (b+t)(b+t)-th column is Kexp(s2+t22σ2)K\cdot\mathrm{exp}\bigl{(}-\frac{s^{2}+t^{2}}{2\sigma^{2}}\bigr{)}. We divide every entry of BB by the total weight of all entries in BB because the contributions of the (a,b)(a,b)-pixel to other pixels should sum to 1. The coefficient KK no longer appears in the normalized weights in BB, so we do not need to worry about how to set KK.

How do we produce the blurring effects as a multiplication of a matrix and a vector? We first transpose the rows of II to columns. Then, we stack these columns in the row order in II to form a long vector 𝚡\mathtt{x}. That is, the first row of II becomes the top c0c_{0} entries of 𝚡\mathtt{x} and so on. Let Ma,bM_{a,b} be an array with the same row and column dimensions as II. Assume for now that the square BB centered at the (a,b)(a,b)-entry of Ma,bM_{a,b} is fully contained in Ma,bM_{a,b}. The entries of Ma,bM_{a,b} are zero if they are outside BB; otherwise, the entries of Ma,bM_{a,b} are equal to the normalized weights at the corresponding entries in BB. Then, we concatenate the rows of Ma,bM_{a,b} in the row order to form a long row vector; this long row vector is the ((a1)c0+b)((a-1)c_{0}+b)-th row of a matrix 𝙰\mathtt{A}. Repeating this for all Ma,bM_{a,b} defines the matrix 𝙰\mathtt{A}. The product of the ((a1)c0+b)((a-1)c_{0}+b)-th row of 𝙰\mathtt{A} and 𝚡\mathtt{x} is a scalar, which is exactly the sum of the contributions of all pixels at the (a,b)(a,b)-pixel, i.e., the intensity Ia,bI^{\prime}_{a,b}. Consider the case that some entries of BB lie outside Ma,bM_{a,b}. If an entry of BB is outside Ma,bM_{a,b}, we find the vertically/horizontally nearest boundary entry of Ma,bM_{a,b} and add to it the normalized weight at that “outside” entry of BB. This ensures that the total weight within Ma,bM_{a,b} still sums to 1, thus making sure that no pixel intensity is lost. Afterwards, we linearize Ma,bM_{a,b} to form a row of 𝙰\mathtt{A} as before. In all, 𝙰𝚡\mathtt{Ax} is the blurred image by the atmospheric turbulence point spread function.

Working backward, we solve an NNLS problem to deblur the image: find the non-negative 𝚡\mathtt{x} that minimizes 𝙰𝚡𝚋2\|\mathtt{Ax}-\mathtt{b}\|^{2}, where 𝚋\mathtt{b} is the vector obtained by transposing the rows of the blurred image into columns, followed by stacking these columns in the row order in the blurred image.

In running SolveNNQ on the space images, we compute the gradient vector at the zero vector and select the 20τ20\tau most negative coordinates. (Recall that τ=4ln2ν\tau=4\ln^{2}\nu and ν\nu is the total number of pixels in an image in this case). We call quadprog with these 20τ20\tau most negative coordinates as the only free variables and obtain the initial solution 𝚡1\mathtt{x}_{1}. Afterwards, SolveNNQ iterates as before.

Refer to caption Refer to caption Refer to caption Refer to caption
hst, original hst, σ=2.5\sigma=2.5 SolveNNQ, 0.002, 21.6s FISTABT, 0.1, 11.5s
Refer to caption Refer to caption Refer to caption Refer to caption
SBB, 0.08, 600s FNNLS, 2×1072\times 10^{-7}, 694s lsqnonneg, 7×10127\times 10^{-12}, 1610s quadprog, 0.002, 72.1s
Figure 6: The upper left two images are the original and the blurred image. Under the output of each software, the first number is the relative mean square error of the intensities of the corresponding pixels between the output and the original, and the second number is the running time.
SolveNNQ FISTABT SBB
Data σ\sigma rel. err time rel. err time rel. err time
nph 1 10510^{-5} 3s 0.009 2s 3×1053\times 10^{-5} 171s
1.5 4×1044\times 10^{-4} 5.6s 0.04 4.3s 0.009 600s
2 7×1047\times 10^{-4} 8.2s 0.06 6.8s 0.02 600s
sts 1 2×1052\times 10^{-5} 2.1s 0.01 2.3s 4×1054\times 10^{-5} 206s
1.5 4×1044\times 10^{-4} 4.9s 0.06 4.7s 0.01 600s
2 5×1045\times 10^{-4} 9.7s 0.1 7.1s 0.04 600s
hst 1 2×1052\times 10^{-5} 2.5s 0.02 2.3s 4×1054\times 10^{-5} 275s
1.5 0.001 6.5s 0.06 4.7s 0.02 600s
2 0.001 11.1s 0.1 7.4s 0.05 600s

FNNLS lsqnonneg quadprog
Data σ\sigma rel. err time rel. err time rel. err time
nph 1 7×10117\times 10^{-11} 15.4s 101310^{-13} 25s 3×1053\times 10^{-5} 8.8s
1.5 3×1083\times 10^{-8} 60s 101210^{-12} 105s 5×1045\times 10^{-4} 17.9s
2 6×1086\times 10^{-8} 94s 2×10122\times 10^{-12} 181s 8×1048\times 10^{-4} 33.3s
sts 1 8×10118\times 10^{-11} 14s 9×10149\times 10^{-14} 22.7s 3×1053\times 10^{-5} 9.4s
1.5 10810^{-8} 48.9s 2×10122\times 10^{-12} 87s 4×1044\times 10^{-4} 20.4s
2 3×1083\times 10^{-8} 80s 2×10122\times 10^{-12} 156s 0.001 38.6s
hst 1 101010^{-10} 79.4s 2×10132\times 10^{-13} 182s 3×1053\times 10^{-5} 9.4s
1.5 10710^{-7} 272s 3×10123\times 10^{-12} 651s 0.001 20.9s
2 3×1073\times 10^{-7} 459s 4×10124\times 10^{-12} 1060s 0.003 47.2s
Table 5: Experimental results for some space images. In the column for each software, the number on the left is the relative mean square error, and the number on the right is the running time.
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 7: Blurring and deblurring nph: (a) σ=1\sigma=1 and relative mean square error = 10510^{-5}, (b) σ=1.5\sigma=1.5 and relative mean square error = 4×1044\times 10^{-4}, (c) σ=2\sigma=2 and relative mean square error = 7×1047\times 10^{-4}, (d) σ=2.5\sigma=2.5 and relative mean square error = 4×1044\times 10^{-4}.
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 8: Blurring and deblurring sts: (a) σ=1\sigma=1 and relative mean square error = 2×1052\times 10^{-5}, (b) σ=1.5\sigma=1.5 and relative mean square error = 4×1044\times 10^{-4}, (c) σ=2\sigma=2 and relative mean square error = 5×1045\times 10^{-4}, (d) σ=2.5\sigma=2.5 and relative mean square error = 0.002.
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 9: Blurring and deblurring hst: (a) σ=1\sigma=1 and relative mean square error = 2×1052\times 10^{-5}, (b) σ=1.5\sigma=1.5 and relative mean square error = 0.001, (c) σ=2\sigma=2 and relative mean square error = 0.001, (d) σ=2.5\sigma=2.5 and relative mean square error = 0.002.

We compared SolveNNQ with several software that can solve NNLS, including FISTA with backtracking (FISTABT) [35, 36], SBB [8], FNNLS [9], lsqnonneg of MATLAB, and a single call of quadprog. Table 5 shows the statistics of the relative mean square errors and the running times. Figure 6 shows the deblurred images of hst produced by different software. When the relative mean square error of an image is well below 0.01, it is visually non-distinguishable from the original. In our experiments, SolveNNQ is very efficient and it produces a very good output. Figures 79 show some vertical pairs of blurred images and deblurred images recovered by SolveNNQ. We emphasize that we do not claim a solution for image deblurring as there are many issues that we do not address; we only seek to demonstrate the potential of the iterative scheme.

SolveNNQ FISTABT SBB
RR rel. err time rel. err time rel. err time
walk 2 6×1066\times 10^{-6} 9.6s 2×10112\times 10^{-11} 350s 10710^{-7} 5.4s
3 6×1066\times 10^{-6} 19.1s 2×10112\times 10^{-11} 1340s 2×1072\times 10^{-7} 16.4s
4 10510^{-5} 33.1s 3×10113\times 10^{-11} 2630s 2×1072\times 10^{-7} 43.5s
heli 2 3×1063\times 10^{-6} 1.9s 2×10102\times 10^{-10} 800s 3×1073\times 10^{-7} 4.9s
3 2×1062\times 10^{-6} 4.6s 2×10102\times 10^{-10} 800s 2×1072\times 10^{-7} 8.1s
4 3×1063\times 10^{-6} 8.2s 9×1089\times 10^{-8} 800s 5×1075\times 10^{-7} 36.8s
lion 2 10610^{-6} 0.9s 6×10116\times 10^{-11} 104s 6×1086\times 10^{-8} 0.9s
3 4×1064\times 10^{-6} 1.6s 6×10116\times 10^{-11} 185s 6×1086\times 10^{-8} 1s
4 9×1069\times 10^{-6} 3.2s 5×10115\times 10^{-11} 800s 2×1072\times 10^{-7} 3.7s

FNNLS lsqnonneg quadprog
RR rel. err time rel. err time rel. err time
walk 2 101310^{-13} 144s 4×10154\times 10^{-15} 220s 6×1066\times 10^{-6} 26.9s
3 2×10132\times 10^{-13} 270s 7×10157\times 10^{-15} 447s 2×1052\times 10^{-5} 76.3s
4 6×10136\times 10^{-13} 414s 101410^{-14} 748s 10510^{-5} 163s
heli 2 2×10132\times 10^{-13} 57.2s 5×10155\times 10^{-15} 93.8s 4×1064\times 10^{-6} 12.3s
3 5×10135\times 10^{-13} 103s 101410^{-14} 187s 6×1066\times 10^{-6} 37.2s
4 101210^{-12} 157s 2×10142\times 10^{-14} 289s 8×1068\times 10^{-6} 54.7s
lion 2 5×10145\times 10^{-14} 4.8s 3×10153\times 10^{-15} 6.8s 4×1064\times 10^{-6} 5.4s
3 101310^{-13} 8.9s 6×10156\times 10^{-15} 13s 9×1069\times 10^{-6} 11.7s
4 4×10134\times 10^{-13} 14.1s 101410^{-14} 21.2s 7×1067\times 10^{-6} 20.5
Table 6: Results for some thermal images. In the column for each software, the number on the left is the relative mean square error, and the number on the right is the running time.
Refer to caption Refer to caption Refer to caption Refer to caption
walk, thermal walk, extracted walk, R=3R=3 SolveNNQ, rel. err = 6×1066\times 10^{-6}
Refer to caption Refer to caption Refer to caption Refer to caption
heli, thermal heli, extracted heli, R=3R=3 SolveNNQ, rel. err = 2×1062\times 10^{-6}
Refer to caption Refer to caption Refer to caption Refer to caption
lion, thermal lion, extracted lion, R=3R=3 SolveNNQ, rel. err = 7×1067\times 10^{-6}
Figure 10: Thermal images.
Refer to caption Refer to caption Refer to caption
(a) Thermal, original. (b) Red, thresholded. (c) Green, thresholded.
Refer to caption Refer to caption
(d) Blue, thresholded. (e) Recombined.
Figure 11: Thermal image of a dog, the three thresholded color panes, and their recombinations.
SolveNNQ FISTABT SBB
R red green blue total red green blue total red green blue total
2 0.5s 0.6s 2.6s 3.7s 362s 48.1s 160s 570.1s 2.1s 0.7s 1.5s 4.3s
3 1s 1.3s 5.1s 7.4s 663s 176s 301s 1140s 2.4s 1.3s 2.8s 6.5s
4 2s 1.9s 7.7s 11.6s 800s 583s 800s 2183s 11.5s 5.5s 9.8s 26.8s

FNNLS lsqnonneg quadprog
R red green blue total red green blue total red green blue total
2 4.7s 4.8s 30.6s 40.1s 6.5s 6.6s 42.5s 55.6s 8.7s 9.3s 10.6s 28.6s
3 8.4s 8.1s 54.1s 70.6s 12.2s 11.3s 79.2s 102.7s 23.1s 26.6s 30.6s 80.3s
4 14s 13.2s 78.1s 105.3s 19.4s 18.6s 130s 168s 38.6s 41.7s 42.3s 122.6s
Table 7: Running times on the thresholded, blurred color panes of the dog image.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Red, R=2R=2 Green, R=2R=2 Blue, R=2R=2 Recombined, R=2R=2 SolveNNQ
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Red, R=3R=3 Green, R=3R=3 Blue, R=3R=3 Recombined, R=3R=3 SolveNNQ
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Red, R=4R=4 Green, R=4R=4 Blue, R=4R=4 Recombined, R=4R=4 SolveNNQ
Figure 12: Blurred color panes, their recombinations, and the recombinations of the deblurred output of SolveNNQ.

We also experimented with some thermal images walk, heli, lion, and dog [42, 43, 44]; such thermal images are typically encountered in surveillance. In each case, we extract the dominant color, down-sample, and apply thresholding to obtain a gray-scale image that omits most of the irrelevant parts. Afterwards, we apply the uniform out-of-focus point spread function for blurring [7]—the atmospheric turbulence point spread function is not relevant in these cases. There is a positive parameter RR. Given an (a,b)(a,b)-pixel and a (c,d)(c,d)-pixel, if (ac)2+(bd)2R2(a-c)^{2}+(b-d)^{2}\leq R^{2}, the weight factor for them is 1/(πR2)1/(\pi R^{2}); otherwise, the weight factor is zero. The larger RR is, the blurrier is the image. The construction of the matrix 𝙰\mathtt{A} works as in the atmospheric turbulence point spread function.

We use a certain number of the most negative coordinates of the gradient at the zero vector, and then we call quadprog with these coordinates as the only free variables to obtain the initial solution 𝚡1\mathtt{x}_{1}. Afterwards, SolveNNQ iterates as described previously. Figure 10 shows the thermal images, the extracted images, the blurred images, and the deblurred images produced by SolveNNQ.

We did another experiment on deblurring the red, green, and blue color panes of a dog thermal image separately, followed by recombining the deblurred color panes. Figure 11(a) shows the original thermal image of a dog. Figures 11(b)–(d) show the gray-scale versions of the red, green, and blue pixels which have been thresholded for sparsification. Figure 11(e) shows the image obtained by combining Figures 11(b)–(d); the result is in essence an extraction of the dog in the foreground. We blur the different color panes; Figure 12 shows these blurred color panes and their recombinations. We deblur the blurred versions of the different color panes and then recombine the deblurred color panes. The rightmost column in Figure 12 shows the recombinations of the deblurred output of SolveNNQ. As in the case of the images walk, heli, and lion, the outputs of SolveNNQ, FISTABT, SBB, FNNLS, lsqnonneg, and a single call quadprog are visually non-distinguishable from each other. Table 7 shows that the total running times of SolveNNQ over the three color panes are very competitive when compared with the other software.

4 Analysis for NNLS and ZHLG

In this section, we prove that there exists a unit descent direction 𝚗r\mathtt{n}_{r} in the (r1)(r-1)-th iteration of SolveNNQ such that f(𝚡r),𝚗rf(𝚡r),𝚗12νlnν\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{r}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle}\geq\frac{1}{\sqrt{2\nu\ln\nu}} for NNLS and ZHLG, where 𝚗=𝚡𝚡r𝚡𝚡r\mathtt{n}_{*}=\frac{\mathtt{x}_{*}-\mathtt{x}_{r}}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|}, and ν=n\nu=n for NNLS and ν=n(n1)/2\nu=n(n-1)/2 for ZHLG. Combined with the assumption that the distance between 𝚡r\mathtt{x}_{r} and the minimum in direction 𝚗r\mathtt{n}_{r} is at least 1λ𝚡r𝚡\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\|, we will show that the gap f(𝚡r)f(𝚡)f(\mathtt{x}_{r})-f(\mathtt{x}_{*}) is decreased by a factor ee after every O(λνlnν)O(\lambda\sqrt{\nu\ln\nu}) iterations.

Let LL be an affine subspace in ν\mathbb{R}^{\nu}. For every 𝚡ν\mathtt{x}\in\mathbb{R}^{\nu}, define 𝚡L\mathtt{x}\!\downarrow\!L to be the projection of 𝚡\mathtt{x} to the linear subspace parallel to LL—the translate of LL that contains the origin. Figure 13 gives an illustration. Note that 𝚡L\mathtt{x}\!\downarrow\!L may not belong to LL; it does if LL is a linear subspace. The projection of 𝚡\mathtt{x} to LL can be implemented by multiplying 𝚡\mathtt{x} with an appropriate ν×ν\nu\times\nu matrix 𝙼\mathtt{M}, i.e., 𝚡L=𝙼𝚡\mathtt{x}\!\downarrow\!L=\mathtt{Mx}. Thus, (𝚡)L=𝙼(𝚡)=𝙼𝚡=(𝚡L)(-\mathtt{x})\!\downarrow\!L=\mathtt{M}\cdot(-\mathtt{x})=-\mathtt{Mx}=-(\mathtt{x}\!\downarrow\!L), so we will write 𝚡L-\mathtt{x}\!\downarrow\!L without any bracket.

Since there are only the non-negativity constraints for NNLS and ZHLG, we do not have the Lagrange multipliers 𝚞\mathtt{u} and 𝚞~\tilde{\mathtt{u}}. So (𝚟r,𝚡r)(\mathtt{v}_{r},\mathtt{x}_{r}) is the optimal solution of max𝚟min𝚡g(𝚟,𝚡)\max_{\mathtt{v}}\min_{\mathtt{x}}g(\mathtt{v},\mathtt{x}) that SolveNNQ solves in the (r1)(r-1)-th iteration under the constraints that (𝚟)i0(\mathtt{v})_{i}\geq 0 for all iSri\not\in S_{r}.

Refer to caption

Figure 13: In 2\mathbb{R}^{2}, a point 𝚡\mathtt{x}, a line LL, and the projection 𝚡L\mathtt{x}\!\downarrow\!L.

Recall that, in the pseudocode of SolveNNQ, ErE_{r} is a sequence of indices iSri\in S_{r} such that (𝚟r)i<0(\mathtt{v}_{r})_{i}<0, and ErE_{r} is sorted in non-decreasing order of (𝚟r)i(\mathtt{v}_{r})_{i}. Define Jr=𝚜𝚙𝚊𝚗({𝚎i:iEr})J_{r}=\mathtt{span}(\{\mathtt{e}_{i}:i\in E_{r}\}). The following result follows from the definitions of ErE_{r} and JrJ_{r}.

Lemma 3.

For all i[ν]i\in[\nu], if iEri\in E_{r}, then (𝚟rJr)i=(𝚟r)i(\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}=(\mathtt{v}_{r})_{i}; otherwise, (𝚟rJr)i=0(\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}=0. Moreover, 𝚟rJr-\mathtt{v}_{r}\!\downarrow\!J_{r} is a conical combination of {𝚎i:iEr}\{\mathtt{e}_{i}:i\in E_{r}\}.

We show that there exists a subset of indices iEri\in E_{r} such that 𝚟rJr,𝚎i\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{e}_{i}\rangle is bounded away from zero.

Lemma 4.

For any α(0,1]\alpha\in(0,1], there exists jj in ErE_{r} with rank at most α|Er|\alpha|E_{r}| such that for every iEri\in E_{r}, if i=ji=j or ii precedes jj in ErE_{r}, then 𝚟rJr,𝚎i2𝚟rJr2α/(jlnν)\langle\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{e}_{i}\rangle^{2}\geq\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|^{2}\cdot\alpha/\bigl{(}j\ln\nu\bigr{)}.

Proof.

Consider a histogram H1H_{1} of α/(ilnν)\alpha/(i\ln\nu) against iEri\in E_{r}. The total length of the vertical bars in H1H_{1} is equal to iErα/(ilnν)(α/lnν)i=1ν1/iα\sum_{i\in E_{r}}\alpha/(i\ln\nu)\,\,\leq\,\,(\alpha/\ln\nu)\cdot\sum_{i=1}^{\nu}1/i\,\,\leq\,\,\alpha. Consider another histogram H2H_{2} of 𝚟rJr,𝚎i2/𝚟rJr2\langle\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{e}_{i}\rangle^{2}/\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|^{2} against iEri\in E_{r}. The total length of the vertical bars in H2H_{2} is equal to iEr𝚟rJr,𝚎i2/𝚟rJr2\sum_{i\in E_{r}}\langle\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{e}_{i}\rangle^{2}/\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|^{2}, which is equal to 1 because 𝚟rJrJr\mathtt{v}_{r}\!\downarrow\!J_{r}\in J_{r}. Recall that the indices of ErE_{r} are sorted in non-decreasing order of (𝚟r)i({\mathtt{v}}_{r})_{i}. As (𝚟rJr)i=(𝚟r)i<0(\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}=(\mathtt{v}_{r})_{i}<0 for all iEri\in E_{r}, the ordering in ErE_{r} is the same as the non-increasing order of 𝚟rJr,𝚎i2/𝚟rJr2=(𝚟rJr)i2/𝚟rJr2\langle\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{e}_{i}\rangle^{2}/\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|^{2}=(\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}^{2}/\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|^{2}. Therefore, the total length of the vertical bars in H2H_{2} for the first α|Er|\alpha|E_{r}| indices is at least α\alpha. It implies that when we scan the indices in ErE_{r} from left to right, we must encounter an index jj among the first α|Er|\alpha|E_{r}| indices such that the vertical bar in H2H_{2} at jj is not shorter than the vertical bar in H1H_{1} at jj, i.e., 𝚟rJr,𝚎i2/𝚟rJr2α/(jlnν)\langle\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{e}_{i}\rangle^{2}/\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|^{2}\geq\alpha/(j\ln\nu) for every ii that precedes jj in ErE_{r}. This index jj satisfies the lemma. ∎

Next, we boost the angle bound implied by Lemma 4 by showing that 𝚟rJr-\mathtt{v}_{r}\!\downarrow\!J_{r} makes a much smaller angle with some conical combination of {𝚎i:iEr}\{\mathtt{e}_{i}:i\in E_{r}\}.

Refer to caption

Figure 14: The vector 𝚗1,2\mathtt{n}_{1,2} bisects the right angle (𝚗1,𝚗2)\angle(\mathtt{n}_{1},\mathtt{n}_{2}). The angle η\eta is at least π/2\pi/2.
Lemma 5.

Take any c1/2c\leq 1/\sqrt{2}. Let 𝚣\mathtt{z} be a vector in D\mathbb{R}^{D} for some D2D\geq 2. Suppose that there is a set VV of unit vectors in D\mathbb{R}^{D} such that the vectors in VV are mutually orthogonal, and for every 𝚗V\mathtt{n}\in V, cos(𝚗,𝚣)c|V|1/2\cos\angle(\mathtt{n},\mathtt{z})\geq c|V|^{-1/2}. There exists a conical combination 𝚢\mathtt{y} of the vectors in VV such that cos(𝚢,𝚣)c/2\cos\angle(\mathtt{y},\mathtt{z})\geq c/\sqrt{2}.

Proof.

Let θ=arccos(c|V|1/2)\theta=\arccos\left(c|V|^{-1/2}\right). If θπ/3\theta\leq\pi/3, we can pick any vector 𝚗V\mathtt{n}\in V as 𝚢\mathtt{y} because π/3arccos(c/2)\pi/3\leq\arccos(c/\sqrt{2}) for c1/2c\leq 1/\sqrt{2}. Suppose not. Let WW be a maximal subset of VV whose size is a power of 2. Arbitrarily label the vectors in WW as 𝚗1,𝚗2,\mathtt{n}_{1},\mathtt{n}_{2},\ldots. Consider the unit vector 𝚗1,2=12𝚗1+12𝚗2\mathtt{n}_{1,2}=\frac{1}{\sqrt{2}}\mathtt{n}_{1}+\frac{1}{\sqrt{2}}\mathtt{n}_{2}. Let ϕ=(𝚗1,2,𝚣)\phi=\angle(\mathtt{n}_{1,2},\mathtt{z}). Refer to Figure 14. By assumption, 𝚗1𝚗2\mathtt{n}_{1}\perp\mathtt{n}_{2}. Let η\eta be the non-acute angle between the plane span({𝚗1,𝚗2})\mathrm{span}(\{\mathtt{n}_{1},\mathtt{n}_{2}\}) and the plane span({𝚗1,2,𝚣})\mathrm{span}(\{\mathtt{n}_{1,2},\mathtt{z}\}). By the spherical law of cosines, cosθcos(𝚗2,𝚣)=cosϕcos(π/4)+sinϕsin(π/4)cosη\cos\theta\leq\cos\angle(\mathtt{n}_{2},\mathtt{z})=\cos\phi\cos(\pi/4)+\sin\phi\sin(\pi/4)\cos\eta. Note that cosη0\cos\eta\leq 0 as ηπ/2\eta\geq\pi/2. It implies that cosϕsec(π/4)cosθ=2cosθ\cos\phi\geq\sec(\pi/4)\cos\theta=\sqrt{2}\cos\theta. The same analysis holds between 𝚣\mathtt{z} and the unit vector 𝚗3,4=12𝚗3+12𝚗4\mathtt{n}_{3,4}=\frac{1}{\sqrt{2}}\mathtt{n}_{3}+\frac{1}{\sqrt{2}}\mathtt{n}_{4}, and so on. In the end, we obtain |W|/2|W|/2 vectors 𝚗2i1,2i\mathtt{n}_{2i-1,2i} for i=1,2,,|W|/2i=1,2,\ldots,|W|/2 such that (𝚗2i1,2i,𝚣)arccos(2cosθ)\angle(\mathtt{n}_{2i-1,2i},\mathtt{z})\leq\arccos\bigl{(}\sqrt{2}\cos\theta\bigr{)}. Call this the first stage. Repeat the above with the |W|/2|W|/2 unit vectors 𝚗1,2,𝚗3,4,\mathtt{n}_{1,2},\mathtt{n}_{3,4},\ldots in the second stage and so on. We end up with only one vector in log2|W|\log_{2}|W| stages. If we ever produce a vector that makes an angle at most π/3\pi/3 with 𝚣\mathtt{z} before going through all log2|W|\log_{2}|W| stages, the lemma is true. Otherwise, we produce a vector 𝚢\mathtt{y} in the end such that cos(𝚢,𝚣)(2)log2|W|cosθ(2)log2|V|1cosθ|V|/2cosθ=c/2\cos\angle(\mathtt{y},\mathtt{z})\geq\bigl{(}\sqrt{2}\bigr{)}^{\log_{2}|W|}\cos\theta\geq\bigl{(}\sqrt{2}\bigr{)}^{\log_{2}|V|-1}\cos\theta\geq\sqrt{|V|/2}\cdot\cos\theta=c/\sqrt{2}. ∎

Lemma 6.

The vectors in {𝚎i:i among the first τ indices in Er}\bigl{\{}\mathtt{e}_{i}:\text{$i$ among the first $\tau$ indices in $E_{r}$}\bigr{\}} have a unit conical combination 𝚗r\mathtt{n}_{r} such that 𝚗r\mathtt{n}_{r} is a descent direction from 𝚡r\mathtt{x}_{r} and 𝚟rJr,𝚗r𝚟rJr/2νlnν\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},{\mathtt{n}}_{r}\rangle\geq\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|/\sqrt{2\nu\ln\nu}.

Proof.

Since 𝚟rJr-\mathtt{v}_{r}\!\downarrow\!J_{r} is a conical combination of {𝚎i:iEr}\bigl{\{}\mathtt{e}_{i}:i\in E_{r}\bigr{\}}, 𝚟rJr-\mathtt{v}_{r}\!\downarrow\!J_{r} makes a non-obtuse angle with any vector in {𝚎i:iEr}\{\mathtt{e}_{i}:i\in E_{r}\}. Then, Lemmas 4 and 5 imply that the vectors in {𝚎i:i among the first τ indices in Er}\bigl{\{}\mathtt{e}_{i}:\text{$i$ among the first $\tau$ indices in $E_{r}$}\bigr{\}} have a unit conical combination 𝚗r\mathtt{n}_{r} such that 𝚟rJr,𝚗r𝚟rJrα/(2lnν)\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},{\mathtt{n}}_{r}\rangle\geq\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|\cdot\sqrt{\alpha/(2\ln\nu)}, where α=τ/|Er|\alpha=\tau/|E_{r}|. As |Er|ν|E_{r}|\leq\nu, we have α1/ν\alpha\geq 1/\nu which implies that 𝚟rJr,𝚗r𝚟rJr/2νlnν\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},{\mathtt{n}}_{r}\rangle\geq\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|/\sqrt{2\nu\ln\nu}. For any feasible solution 𝚡\mathtt{x} and any non-negative values c1,,cνc_{1},\ldots,c_{\nu}, 𝚡+i=1νci𝚎i\mathtt{x}+\sum_{i=1}^{\nu}c_{i}\mathtt{e}_{i} is also a feasible solution. It follows that 𝚡r+s𝚗r\mathtt{x}_{r}+s\mathtt{n}_{r} is a feasible solution for all s0s\geq 0. For NNLS and ZHLG, the primal problem has no constraint other than the non-negativity constraints 𝚡0ν\mathtt{x}\geq 0_{\nu}. By the KKT conditions, we have g(𝚡r)/𝚡=f(𝚡r)𝚟r=0ν\partial g(\mathtt{x}_{r})/\partial\mathtt{x}=\nabla f(\mathtt{x}_{r})-\mathtt{v}_{r}=0_{\nu}, which implies that f(𝚡r)=𝚟r\nabla f(\mathtt{x}_{r})=\mathtt{v}_{r}. Moreover, 𝚟rJr,𝚗r=𝚟r,𝚗r\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{n}_{r}\rangle=\langle-\mathtt{v}_{r},\mathtt{n}_{r}\rangle because 𝚗rJr\mathtt{n}_{r}\in J_{r} and so the component of 𝚟r-\mathtt{v}_{r} orthogonal to JrJ_{r} vanishes in the inner product. Then, the acuteness of (𝚟rJr,𝚗r)\angle(-\mathtt{v}_{r}\!\downarrow\!J_{r},{\mathtt{n}}_{r}) implies that f(𝚡r),𝚗r<0\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{r}\rangle<0. In all, 𝚗r\mathtt{n}_{r} is a descent direction. ∎

We use Lemma 6 to establish a convergence rate under an assumption.

Lemma 7.

Let 𝚗r{\mathtt{n}}_{r} be a unit descent direction from 𝚡r\mathtt{x}_{r} that satisfies Lemma 6. Let 𝚢r\mathtt{y}_{r} be the feasible point that minimizes ff on the ray from 𝚡r\mathtt{x}_{r} in direction 𝚗r{\mathtt{n}}_{r}.

  1. (i)

    f(𝚡r),𝚗rf(𝚡r),𝚗12νlnν\frac{\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{n}_{r}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\,\mathtt{n}_{*}\rangle}\geq\frac{1}{\sqrt{2\nu\ln\nu}}, where 𝚗=𝚡𝚡r𝚡𝚡r\mathtt{n}_{*}=\frac{\mathtt{x}_{*}-\mathtt{x}_{r}}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|}.

  2. (ii)

    If there exists λ\lambda such that 𝚡r𝚢r1λ𝚡r𝚡\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\|, then f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)112λ2νlnν\frac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\frac{1}{2\lambda\sqrt{2\nu\ln\nu}}.

Proof.

Irrespective of whether Sr+1S_{r+1} is computed in line 11 or 14 of SolveNNQ, Sr+1S_{r+1} is disjoint from 𝚜𝚞𝚙𝚙(𝚡r)𝚜𝚞𝚙𝚙(𝚗r)\mathtt{supp}(\mathtt{x}_{r})\cup\mathtt{supp}(\mathtt{n}_{r}), which makes Lemma 1 applicable.

Since f(𝚡r)=𝚟r\nabla f(\mathtt{x}_{r})=\mathtt{v}_{r} for an NNLS or ZHLG problem, we have f(𝚡r),𝚗r)=𝚟r,𝚗r=𝚟rJr,𝚗r\langle-\nabla f(\mathtt{x}_{r}),\mathtt{n}_{r})=\langle-\mathtt{v}_{r},\mathtt{n}_{r}\rangle=\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{n}_{r}\rangle. The last equality comes from the fact that 𝚗rJr\mathtt{n}_{r}\in J_{r}, so the component of 𝚟r-\mathtt{v}_{r} orthogonal to JrJ_{r} vanishes in the inner product. By Lemma 6, f(𝚡r),𝚗r=𝚟rJr,𝚗r𝚟rJr/2νlnν\langle-\nabla f(\mathtt{x}_{r}),{\mathtt{n}}_{r}\rangle=\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},{\mathtt{n}}_{r}\rangle\geq\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|/\sqrt{2\nu\ln\nu}.

The inequality above takes care of the nominator in the bound in Lemma 1 multiplied by 1-1. The denominator in the bound in Lemma 1 multiplied by 𝚡𝚡r-\|\mathtt{x}_{*}-\mathtt{x}_{r}\| is equal to f(𝚡r),𝚡𝚡r=𝚟r,𝚡𝚡r=𝚟rJr,𝚡𝚡r+𝚟r+𝚟rJr,𝚡𝚡r\langle-\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\rangle=\langle-\mathtt{v}_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle=\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle+\langle-\mathtt{v}_{r}+\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle.

Recall that (𝚟rJr)i(\mathtt{v}_{r}\!\downarrow\!J_{r})_{i} is (𝚟r)i(\mathtt{v}_{r})_{i} for iEri\in E_{r} and zero otherwise. Therefore, (𝚟r+𝚟rJr)i(-\mathtt{v}_{r}+\mathtt{v}_{r}\!\downarrow\!J_{r})_{i} is zero for iEri\in E_{r} and (𝚟r)i-(\mathtt{v}_{r})_{i} otherwise. If iEri\not\in E_{r}, then (𝚟r)i0(\mathtt{v}_{r})_{i}\geq 0, which implies that (𝚟r+𝚟rJr)i=(𝚟r)i0(-\mathtt{v}_{r}+\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}=-(\mathtt{v}_{r})_{i}\leq 0. By the complementary slackness, if (𝚟r)i>0(\mathtt{v}_{r})_{i}>0, then (𝚡r)i=0(\mathtt{x}_{r})_{i}=0, which implies that (𝚡𝚡r)i0(\mathtt{x}_{*}-\mathtt{x}_{r})_{i}\geq 0 as 𝚡\mathtt{x}_{*} is non-negative. Altogether, we conclude that for i[ν]i\in[\nu], if iEri\in E_{r} or (iEr(𝚟r)i=0)i\not\in E_{r}\,\wedge\,(\mathtt{v}_{r})_{i}=0), then (𝚟r+𝚟rJr)i(𝚡𝚡r)i=0(-\mathtt{v}_{r}+\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}\cdot(\mathtt{x}_{*}-\mathtt{x}_{r})_{i}=0; otherwise, (𝚟r+𝚟rJr)i(𝚡𝚡r)i0(-\mathtt{v}_{r}+\mathtt{v}_{r}\!\downarrow\!J_{r})_{i}\cdot(\mathtt{x}_{*}-\mathtt{x}_{r})_{i}\leq 0. Therefore, 𝚟r+𝚟rJr,𝚡𝚡r0\langle-\mathtt{v}_{r}+\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle\leq 0. As a result, f(𝚡r),𝚡𝚡r𝚟rJr,𝚡𝚡r𝚡𝚡r𝚟rJr\langle-\nabla f(\mathtt{x}_{r}),\mathtt{x}_{*}-\mathtt{x}_{r}\rangle\leq\langle-\mathtt{v}_{r}\!\downarrow\!J_{r},\mathtt{x}_{*}-\mathtt{x}_{r}\rangle\leq\|\mathtt{x}_{*}-\mathtt{x}_{r}\|\cdot\|\mathtt{v}_{r}\!\downarrow\!J_{r}\|.

Hence, f(𝚡r),𝚗rf(𝚡r),𝚗12νlnν\frac{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{r}\rangle}{\langle\nabla f(\mathtt{x}_{r}),\mathtt{n}_{*}\rangle}\geq\frac{1}{\sqrt{2\nu\ln\nu}}, where 𝚗=𝚡𝚡r𝚡𝚡r\mathtt{n}_{*}=\frac{\mathtt{x}_{*}-\mathtt{x}_{r}}{\|\mathtt{x}_{*}-\mathtt{x}_{r}\|}. This completes the proof of (i).

Substituting (i) into Lemma 1 gives f(𝚡r)f(𝚡r+1)f(𝚡r)f(𝚡)𝚡r𝚢r2𝚡𝚡r12νlnν\frac{f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\geq\frac{\|\mathtt{x}_{r}-\mathtt{y}_{r}\|}{2\|\mathtt{x}_{*}-\mathtt{x}_{r}\|}\cdot\frac{1}{\sqrt{2\nu\ln\nu}}. It then follows from the assumption of 𝚡r𝚢r𝚡r𝚡/λ\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\|\mathtt{x}_{r}-\mathtt{x}_{*}\|/\lambda in the lemma that f(𝚡r+1)f(𝚡)=(f(𝚡r)f(𝚡))(f(𝚡r)f(𝚡r+1))(f(𝚡r)f(𝚡))12λ2νlnν(f(𝚡r)f(𝚡))f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})=\bigl{(}f(\mathtt{x}_{r})-f(\mathtt{x}_{*})\bigr{)}-\bigl{(}f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})\bigr{)}\leq\bigl{(}f(\mathtt{x}_{r})-f(\mathtt{x}_{*})\bigr{)}-\frac{1}{2\lambda\sqrt{2\nu\ln\nu}}\bigl{(}f(\mathtt{x}_{r})-f(\mathtt{x}_{*})\bigr{)}. ∎

Let T(n)T(n) denote the time complexity of solving a convex quadratic program with nn variables and O(n)O(n) constraints. It is known that T(n)=O(n3L)T(n)=O(n^{3}L), where LL is bounded by the number of bits in the input [21]. Theorem 1 below gives the performance of SolveNNQ on NNLS and ZHLG.

Theorem 1.

Consider the application of SolveNNQ on an NNLS problem with nn constraints in d\mathbb{R}^{d} or a ZHLG problem for nn points in d\mathbb{R}^{d}. Let ν\nu be nn for NNLS or n(n1)/2n(n-1)/2 for ZHLG. The initialization of SolveNNQ can be done in T(β0)+O(dn2)T(\beta_{0})+O(dn^{2}) time for NNLS or T(β0)+O(n4)T(\beta_{0})+O(n^{4}) time for ZHLG. Suppose that the following assumptions hold.

  • Assume that the threshold β1\beta_{1} on the total number of iterations is not exceeded.

  • For all r1r\geq 1, let 𝚗r\mathtt{n}_{r} be a unit descent direction from 𝚡r\mathtt{x}_{r} that satisfies Lemma 6, assume that 𝚡r𝚢r1λ𝚡r𝚡\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\|, where λ\lambda is some fixed value and 𝚢r\mathtt{y}_{r} is the feasible point that minimizes ff on the ray from 𝚡r\mathtt{x}_{r} in the direction 𝚗r\mathtt{n}_{r}.

Then, for all r1r\geq 1, f(𝚡r+i)f(𝚡)e1(f(𝚡r)f(𝚡))f(\mathtt{x}_{r+i})-f(\mathtt{x}_{*})\leq e^{-1}(f(\mathtt{x}_{r})-f(\mathtt{x}_{*})) for some i=O(λνlogν)i=O\bigl{(}\lambda\sqrt{\nu\log\nu}\bigr{)}, and each iteration in SolveNNQ takes T(k+β0)+O(kν)T(k+\beta_{0})+O(k\nu) time, where k=maxr1|𝚜𝚞𝚙𝚙(𝚡r)|k=\max_{r\geq 1}|\mathtt{supp}(\mathtt{x}_{r})|.

Proof.

Since the threshold β1\beta_{1} is not exceeded by assumption, the set ErE_{r} contains at least τ\tau indices for all r1r\geq 1. Therefore, Lemma 7 implies that the gap f(𝚡r)f(𝚡)f(\mathtt{x}_{r})-f(\mathtt{x}_{*}) decreases by a factor ee in O(λνlnν)O(\lambda\sqrt{\nu\ln\nu}) iterations. It remains to analyze the running time.

For NNLS, 𝙰\mathtt{A} is n×dn\times d and we compute 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} in O(dn2)O(dn^{2}) time. For ZHLG, 𝙰\mathtt{A} is n(n+1)/2×n(n1)/2n(n+1)/2\times n(n-1)/2 and 𝚊\mathtt{a} is a vector of dimension n(n1)/2n(n-1)/2. Recall that 𝙰t=[μ2𝚄tρ2𝙸n(n1)/2]\mathtt{A}^{t}=\bigl{[}\frac{\mu}{2}\mathtt{U}^{t}\,\,\,\frac{\rho}{2}\mathtt{I}_{n(n-1)/2}\bigr{]}, where 𝚄n×n(n1)/2\mathtt{U}\in\mathbb{R}^{n\times n(n-1)/2} is the incidence matrix for the complete graph. Therefore, each row of 𝙰t\mathtt{A}^{t} has at most three non-zero entries, meaning that 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} can be computed in O(n4)O(n^{4}) time.

The initial solution is obtained as follows. Sample β0\beta_{0} indices from [ν][\nu]. The initial active set contains all indices in the range [ν][\nu] except for these sampled indices. We extract the rows and columns of 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} and coordinates of 𝚊\mathtt{a} that correspond to these β0\beta_{0} sampled indices. Then, we call the solver in T(β0)T(\beta_{0}) time to obtain the initial solution 𝚡1\mathtt{x}_{1}.

At the beginning of every iteration, we compute f(𝚡r)\nabla f(\mathtt{x}_{r}) and update the active set. This step can be done in O(kν)O(k\nu) time because 𝚡r\mathtt{x}_{r} has at most kk non-zero entries. At most k+β0k+\beta_{0} indices are absent from the updated active set because the threshold β1\beta_{1} is not exceeded. We select the rows and columns of 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} and coordinates of 𝚊\mathtt{a} that correspond to the indices absent from the active set in O(k2+β02)O(k^{2}+\beta_{0}^{2}) time. The subsequent call of the solver takes T(k+β0)T(k+\beta_{0}) time. So each iteration runs in T(k+β0)+O(kν)T(k+\beta_{0})+O(k\nu) time.

We assume in Theorem 1 that 𝚡r𝚢r1λ𝚡r𝚡\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\| for all r1r\geq 1 for ease of presentation. The geometric-like convergence still holds even if the inequality 𝚡r𝚢r1λ𝚡r𝚡\|\mathtt{x}_{r}-\mathtt{y}_{r}\|\geq\frac{1}{\lambda}\|\mathtt{x}_{r}-\mathtt{x}_{*}\| is satisfied in only a constant fraction of any sequence of consecutive iterations (the iterations in that constant fraction need not be consecutive).

5 Analysis for DKSG

Let 𝚙1,,𝚙n\mathtt{p}_{1},\ldots,\mathtt{p}_{n} be the input points. So there are n(n1)/2n(n-1)/2 edges connecting them, and the problem is to determine the weights of these edges. Suppose that we have used SolveNNQ to obtain the current solution (𝚞r,𝚟r,𝚡r)(\mathtt{u}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) with respect to the active set SrS_{r}. We are to analyze the convergence rate of SolveNNQ when computing the next solution (𝚞r+1,𝚟r+1,𝚡r+1)(\mathtt{u}_{r+1},\mathtt{v}_{r+1},\mathtt{x}_{r+1}) with respect to the next active set Sr+1S_{r+1}.

In our analysis, we assume that the next active set Sr+1S_{r+1} is computed as [n(n1)/2](𝚜𝚞𝚙𝚙(𝚡r)Er)[n(n-1)/2]\setminus(\mathtt{supp}(\mathtt{x}_{r})\cup E^{\prime}_{r}), which is the same assumption that we made in the case of NNLS and ZHLG. Recall that ErE^{\prime}_{r} is the subset of ErE_{r} that SolveNNQ extracted so that variables in ErE^{\prime}_{r} will be freed in the next iteration.

To analyze the convergence rate achieved by (𝚞r+1,𝚟r+1,𝚡r+1)(\mathtt{u}_{r+1},\mathtt{v}_{r+1},\mathtt{x}_{r+1}), we transform the original system to an extended system and analyze an imaginary run of SolveNNQ on this extended system. First, we convert the inequality constraints to equality constraints by introducing nn slack variables. Second, we need another large entry in each row of the contraint matrix for a technical reason. So we introduce another nn dummy variables. We give the details of the transformation in the next section.

5.1 Extended system

The original variables are (𝚡)j(\mathtt{x})_{j} for j[n(n1)/2]j\in[n(n-1)/2]. We add one slack variable per constraint in 𝙱𝚡1n\mathtt{Bx}\geq 1_{n}. This gives a vector 𝚜n\mathtt{s}\in\mathbb{R}^{n} of slack variables. We also add a vector 𝚍n\mathtt{d}\in\mathbb{R}^{n} of dummy variables, one for each constraint in 𝙱𝚡1n\mathtt{Bx}\geq 1_{n}. As a result, the ambient space expands from n(n1)/2\mathbb{R}^{n(n-1)/2} to n(n+3)/2\mathbb{R}^{n(n+3)/2}. We use 𝚡\mathtt{x} to denote a point in n(n1)/2\mathbb{R}^{n(n-1)/2}. The idea is to map 𝚡\mathtt{x} to a point 𝚡^\hat{\mathtt{x}} in n(n+3)/2\mathbb{R}^{n(n+3)/2} by appending slack and dummy variables:

𝚡^=[𝚡𝚜𝚍].\hat{\mathtt{x}}=\begin{bmatrix}\mathtt{x}\\ \mathtt{s}\\ \mathtt{d}\end{bmatrix}.

Let MM be an arbitrarily large number. Define the following matrix 𝙱^n×n(n+3)/2\hat{\mathtt{B}}\in\mathbb{R}^{n\times n(n+3)/2}:

𝙱^=[𝙱𝙸nM𝙸n].\hat{\mathtt{B}}=\begin{bmatrix}\,\mathtt{B}&&-\mathtt{I}_{n}&&M\mathtt{I}_{n}\,\end{bmatrix}.

The constraints for the extended system in n(n+3)/2\mathbb{R}^{n(n+3)/2} are:

𝙱^𝚡^=(1+M) 1n,𝚡^0n(n+3)/2.\hat{\mathtt{B}}\hat{\mathtt{x}}=(1+M)\,1_{n},\quad\hat{\mathtt{x}}\geq 0_{n(n+3)/2}.

That is, for each point 𝚙i\mathtt{p}_{i}, the sum of the weights of edges incident to 𝚙i\mathtt{p}_{i}, minus (𝚜)i(\mathtt{s})_{i}, and plus M(𝚍)iM(\mathtt{d})_{i} must be equal to 1+M1+M. If we can force 𝚍\mathtt{d} to be 1n1_{n}, we can recover the original requirement of the sum of the weights of edges incident to 𝚙i\mathtt{p}_{i} being at least 1 because the slack variable (𝚜)i(\mathtt{s})_{i} can cancel any excess weight.

We can make 𝚍\mathtt{d} nearly 1n1_{n} by changing the objective function from 𝚡t𝙰t𝙰𝚡\mathtt{x}^{t}\mathtt{A}^{t}\mathtt{Ax} to 𝚡t𝙰t𝙰𝚡+12M3𝚍1n2\mathtt{x}^{t}\mathtt{A}^{t}\mathtt{Ax}+\frac{1}{2}M^{3}\|\mathtt{d}-1_{n}\|^{2}. Although 𝚍\mathtt{d} may not be exactly 1n1_{n} at an optimum, we will argue that it suffices for our purpose. Let f^:n(n+3)/2\hat{f}:\mathbb{R}^{n(n+3)/2}\rightarrow\mathbb{R} denote the new objective function 𝚡t𝙰t𝙰𝚡+12M3𝚍1n2\mathtt{x}^{t}\mathtt{A}^{t}\mathtt{Ax}+\frac{1}{2}M^{3}\|\mathtt{d}-1_{n}\|^{2}. Note that f^\hat{f} is quadratic. We have

f^(𝚡^)=[f(𝚡)0nM3(𝚍1n)]=[2𝙰t𝙰𝚡0nM3(𝚍1n)].\nabla\hat{f}(\hat{\mathtt{x}})=\begin{bmatrix}\nabla f(\mathtt{x})\\ 0_{n}\\ M^{3}(\mathtt{d}-1_{n})\end{bmatrix}=\begin{bmatrix}2\mathtt{A}^{t}\mathtt{Ax}\\ 0_{n}\\ M^{3}(\mathtt{d}-1_{n})\end{bmatrix}.

Let 𝚞^\hat{\mathtt{u}} denote the nn-dimensional vector of multipliers corresponding to the constraints 𝙱^𝚡^=(1+M)1n\hat{\mathtt{B}}\hat{\mathtt{x}}=(1+M)1_{n}. Let 𝚟^\hat{\mathtt{v}} denote the (n(n+3)/2)(n(n+3)/2)-dimensional vector of multipliers corresponding to the constraints 𝚡^0n(n+3)/2\hat{\mathtt{x}}\geq 0_{n(n+3)/2}.

Lemma 8.

Let (𝚞r,𝚟r,𝚡r)(\mathtt{u}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) be an optimal solution for the original system with respect to the active set SrS_{r}. There is a feasible solution (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}) for the extended system with respect to the active set SrS_{r} such that the following properties hold:

  1. (i)

    𝚞^r=𝚞r\hat{\mathtt{u}}_{r}=\mathtt{u}_{r}.

  2. (ii)

    𝚟^r=[𝚟r𝚞r0n]\hat{\mathtt{v}}_{r}=\begin{bmatrix}\mathtt{v}_{r}\\ \mathtt{u}_{r}\\ 0_{n}\end{bmatrix}.

  3. (iii)

    𝚡^r=[𝚡r𝚜𝚍]\hat{\mathtt{x}}_{r}=\begin{bmatrix}\mathtt{x}_{r}\\ \mathtt{s}\\ \mathtt{d}\end{bmatrix}, where 𝚜=𝙱𝚡r1n+𝚞r/M\mathtt{s}=\mathtt{B}\mathtt{x}_{r}-1_{n}+\mathtt{u}_{r}/M and 𝚍=1n+𝚞r/M2\mathtt{d}=1_{n}+\mathtt{u}_{r}/M^{2}.

  4. (iv)

    f^(𝚡^r)𝙱^t𝚞^r=𝚟^r\nabla\hat{f}(\hat{\mathtt{x}}_{r})-\hat{\mathtt{B}}^{t}\hat{\mathtt{u}}_{r}=\hat{\mathtt{v}}_{r}.

  5. (v)

    f^(𝚡^r)f(𝚡r)\hat{f}(\hat{\mathtt{x}}_{r})\geq f(\mathtt{x}_{r}) and f^(𝚡^r)\hat{f}(\hat{\mathtt{x}}_{r}) approaches f(𝚡r)f(\mathtt{x}_{r}) as MM increases.

Proof.

With respect to the active SrS_{r}, we claim that a feasible solution (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}) of the extended system can be obtained by solving the following linear system. We use 𝚠^1\hat{\mathtt{w}}_{1} to denote the first n(n1)/2n(n-1)/2 coordinates of 𝚟^\hat{\mathtt{v}} and 𝚠^2\hat{\mathtt{w}}_{2} the next nn coordinates of 𝚟^\hat{\mathtt{v}}.

[2𝙰t𝙰𝚡0nM3(𝚍1n)][𝙱t𝚞^𝚞^M𝚞^]=[𝚠^1𝚠^20n],\begin{bmatrix}2\mathtt{A}^{t}\mathtt{Ax}\\ 0_{n}\\ M^{3}(\mathtt{d}-1_{n})\end{bmatrix}-\begin{bmatrix}\mathtt{B}^{t}\hat{\mathtt{u}}\\ -\hat{\mathtt{u}}\\ M\hat{\mathtt{u}}\end{bmatrix}=\begin{bmatrix}\hat{\mathtt{w}}_{1}\\ \hat{\mathtt{w}}_{2}\\ 0_{n}\end{bmatrix}, (4)
𝙱𝚡𝚜+M𝚍=(1+M)1n.\mathtt{B}\mathtt{x}-\mathtt{s}+M\mathtt{d}=(1+M)1_{n}. (5)

Equation (4) models the requirement of f^(𝚡^)𝙱^t𝚞^=𝚟^\nabla\hat{f}(\hat{\mathtt{x}})-\hat{\mathtt{B}}^{t}\hat{\mathtt{u}}=\hat{\mathtt{v}} for some particular settings of 𝚡^\hat{\mathtt{x}}, 𝚞^\hat{\mathtt{u}}, and 𝚟^\hat{\mathtt{v}}. Equation (5) models one of the primal feasibility constraints: 𝙱^𝚡^=(1+M)1n\hat{\mathtt{B}}\hat{\mathtt{x}}=(1+M)1_{n}.

First, we argue that (4) and (5) can be satisfied simultaneously. Observe that 2𝙰t𝙰𝚡𝙱t𝚞^=𝚠^12\mathtt{A}^{t}\mathtt{Ax}-\mathtt{B}^{t}\hat{\mathtt{u}}=\hat{\mathtt{w}}_{1} is a KKT condition of the original system, so it can be satisfied by setting 𝚡=𝚡r\mathtt{x}=\mathtt{x}_{r}, 𝚞^=𝚞r\hat{\mathtt{u}}=\mathtt{u}_{r} and 𝚠^1=𝚟r\hat{\mathtt{w}}_{1}=\mathtt{v}_{r}. We set 𝚠^2=𝚞r\hat{\mathtt{w}}_{2}=\mathtt{u}_{r} and 𝚍=1n+𝚞r/M2\mathtt{d}=1_{n}+\mathtt{u}_{r}/M^{2}. These settings of 𝚡^\hat{\mathtt{x}}, 𝚞^\hat{\mathtt{u}}, 𝚠^1\hat{\mathtt{w}}_{1}, 𝚠^2\hat{\mathtt{w}}_{2}, and 𝚍{\mathtt{d}} satisfy (4). To satisfy (5), we use 𝚡r\mathtt{x}_{r} and the setting of 𝚍\mathtt{d} to set 𝚜=𝙱𝚡r1n+𝚞r/M\mathtt{s}=\mathtt{B}\mathtt{x}_{r}-1_{n}+\mathtt{u}_{r}/M. The satisfiability of (4) and (5) is thus verified. These settings yield the following candidate solution for the extended system and establishes the correctness of (i)–(iv):

𝚞^r=𝚞r,𝚟^r=[𝚟r𝚞r0n],𝚡^r=[𝚡r𝚜𝚍].\hat{\mathtt{u}}_{r}=\mathtt{u}_{r},\quad\hat{\mathtt{v}}_{r}=\begin{bmatrix}\mathtt{v}_{r}\\ \mathtt{u}_{r}\\ 0_{n}\end{bmatrix},\quad\hat{\mathtt{x}}_{r}=\begin{bmatrix}\mathtt{x}_{r}\\ \mathtt{s}\\ \mathtt{d}\end{bmatrix}.

We do not know whether 𝚡^r\hat{\mathtt{x}}_{r} is a feasible solution for the extended system yet. We argue that (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}) satisfy the primal and dual feasibilities of the extended system. Equation 5 explicitly models one of the primal feasibility conditions: 𝙱^𝚡^=(1+M)1n\hat{\mathtt{B}}\hat{\mathtt{x}}=(1+M)1_{n}. The other two feasibility conditions that need to be checked are 𝚡^r0n(n+3)/2\hat{\mathtt{x}}_{r}\geq 0_{n(n+3)/2} and (𝚟^r)j0(\hat{\mathtt{v}}_{r})_{j}\geq 0 for jSrj\not\in S_{r}. By the dual feasibility of the original system, we have 𝚞r0n\mathtt{u}_{r}\geq 0_{n} and (𝚟r)j0(\mathtt{v}_{r})_{j}\geq 0 for jSrj\not\in S_{r}, which implies that (𝚟^r)j0(\hat{\mathtt{v}}_{r})_{j}\geq 0 for jSrj\not\in S_{r}. By the primal feasibility of the original system, we have 𝚡r0n(n1)/2\mathtt{x}_{r}\geq 0_{n(n-1)/2}. Since 𝚞r0n\mathtt{u}_{r}\geq 0_{n}, we have 𝚍=1n+𝚞r/M21n\mathtt{d}=1_{n}+\mathtt{u}_{r}/M^{2}\geq 1_{n}. As 𝙱𝚡r1n\mathtt{B}\mathtt{x}_{r}\geq 1_{n} by the primal feasibility of the original system, we have 𝚜=𝙱𝚡r1n+𝚞r/M0n\mathtt{s}=\mathtt{B}\mathtt{x}_{r}-1_{n}+\mathtt{u}_{r}/M\geq 0_{n}. Hence, 𝚡^r0n(n+3)/2\hat{\mathtt{x}}_{r}\geq 0_{n(n+3)/2}.

Finally, f^(𝚡^r)=f(𝚡r)+𝚞r2/M\hat{f}(\hat{\mathtt{x}}_{r})=f(\mathtt{x}_{r})+\|\mathtt{u}_{r}\|^{2}/M, which approaches f(𝚡r)f(\mathtt{x}_{r}) as MM increases. This proves (v). ∎

We remark that (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}) may not be an optimal solution for the extended system. Indeed, one can check that (𝚜)i(𝚞r)i(\mathtt{s})_{i}\cdot(\mathtt{u}_{r})_{i} may be positive for some j[n]j\in[n], implying that (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}) may not satisfy complementary slackness.

We will analyze an imaginary run of SolveNNQ on this extended system starting from (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}). Since no index in the range [n(n1)/2+1,n(n+3)/2][n(n-1)/2+1,n(n+3)/2] is put into an active set by SolveNNQ, the run on the extended system mimics what SolveNNQ does on the original system.

5.2 A descent direction for the extended system

Let (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}) be a feasible solution of the extended solution corresponding to (𝚞r,𝚟r,𝚡r)(\mathtt{u}_{r},\mathtt{v}_{r},\mathtt{x}_{r}) as promised by Lemma 8. Recall that, in the pseudocode of SolveNNQ, ErE_{r} is a sequence of indices iSri\in S_{r} such that (𝚟r)i<0(\mathtt{v}_{r})_{i}<0. Without loss of generality, we assume that (𝚟r)1=miniEr(𝚟r)i(\mathtt{v}_{r})_{1}=\min_{i\in E_{r}}(\mathtt{v}_{r})_{i} and that the edge with index 1 is 𝚙1𝚙2\mathtt{p}_{1}\mathtt{p}_{2}.

In an imaginary run of SolveNNQ on the extended system starting from (𝚞^r,𝚟^r,𝚡^r)(\hat{\mathtt{u}}_{r},\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}_{r}), we will free the variable corresponding to the first coordinate of 𝚡^r\hat{\mathtt{x}}_{r}. So we will increase the variable (𝚡^r)1(\hat{\mathtt{x}}_{r})_{1}, which means that the descent direction is some appropriate component of the vector 𝚎1\mathtt{e}_{1} in n(n+3)/2\mathbb{R}^{n(n+3)/2}. We introduce some definitions in order to characterize the descent direction.

Definition 1.

Define the following set of indices

𝚌𝚘𝚛𝚎(𝚡^r)=𝚜𝚞𝚙𝚙(𝚡r)[n(n+1)2+1,n(n+3)2]{a+n(n1)2:(𝙱𝚡r)a>1}.\mathtt{core}(\hat{\mathtt{x}}_{r})=\mathtt{supp}(\mathtt{x}_{r})\cup\left[\frac{n(n+1)}{2}+1,\frac{n(n+3)}{2}\right]\cup\left\{a+\frac{n(n-1)}{2}:(\mathtt{B}\mathtt{x}_{r})_{a}>1\right\}.

The role of 𝚌𝚘𝚛𝚎(𝚡^r)\mathtt{core}(\hat{\mathtt{x}}_{r}) for the extended system is analogous to the role of 𝚜𝚞𝚙𝚙(𝚡r)\mathtt{supp}(\mathtt{x}_{r}) for the original system. All dummy variables belong to 𝚌𝚘𝚛𝚎(𝚡^r)\mathtt{core}(\hat{\mathtt{x}}_{r}). A slack variable is in 𝚌𝚘𝚛𝚎(𝚡^r)\mathtt{core}(\hat{\mathtt{x}}_{r}) if and only if the correponding constraint in 𝙱𝚡r1n\mathtt{B}\mathtt{x}_{r}\geq 1_{n} is not tight.

Definition 2.

For all a[n]a\in[n], let Ia={i𝚌𝚘𝚛𝚎(𝚡^r):(𝙱^)a,i0}I_{a}=\bigl{\{}i\in\mathtt{core}(\hat{\mathtt{x}}_{r}):(\hat{\mathtt{B}})_{a,i}\not=0\bigr{\}}. So Ia𝚌𝚘𝚛𝚎(𝚡^r)I_{a}\subseteq\mathtt{core}(\hat{\mathtt{x}}_{r}) and it tells us the entries in the aa-th constraint in 𝙱^\hat{\mathtt{B}} that are affected if (𝚡r)1(\mathtt{x}_{r})_{1} is increased. Note that IaI_{a} always contains n(n+1)/2+an(n+1)/2+a.

Definition 3.

For all a[n]a\in[n], define the following quantities:

ca=\displaystyle c_{a}= {1+M,if (𝙱𝚡r)a>1, i.e., a+n(n1)/2𝚌𝚘𝚛𝚎(x^r);1+M(𝚞r)a/M,if (𝙱𝚡r)a=1, i.e., a+n(n1)/2𝚌𝚘𝚛𝚎(𝚡^r).\displaystyle\left\{\begin{array}[]{ccl}1+M,&&\text{if $(\mathtt{B}\mathtt{x}_{r})_{a}>1,$ {\em i.e., $a+n(n-1)/2\in\mathtt{core}(\hat{x}_{r})$;}}\\ 1+M-(\mathtt{u}_{r})_{a}/M,&&\text{if $(\mathtt{B}\mathtt{x}_{r})_{a}=1,$ {\em i.e., $a+n(n-1)/2\not\in\mathtt{core}(\hat{\mathtt{x}}_{r})$}}.\end{array}\right.
Ha=\displaystyle H_{a}= {𝚡^n(n+3)/2:(𝙱^𝚡^)a=ca}span({𝚎i:i{1}𝚌𝚘𝚛𝚎(𝚡^r)}).\displaystyle\left\{\hat{\mathtt{x}}\in\mathbb{R}^{n(n+3)/2}:(\hat{\mathtt{B}}\hat{\mathtt{x}})_{a}=c_{a}\right\}\,\bigcap\,\mathrm{span}\Bigl{(}\bigl{\{}\mathtt{e}_{i}:i\in\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{\}}\Bigr{)}.

In the definition of HaH_{a}, since we set all variables not in {1}𝚌𝚘𝚛𝚎(𝚡^r)\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}) to zero, the slack variables that are excluded must be subtracted from the vector (1+M)1n(1+M)1_{n} on the right hand side of the constraints 𝙱^𝚡^=(1+M)1n\hat{\mathtt{B}}\hat{\mathtt{x}}=(1+M)1_{n}. By Lemma 8, for any a[n]a\in[n], if a+n(n1)/2𝚌𝚘𝚛𝚎(𝚡^r)a+n(n-1)/2\not\in\mathtt{core}(\hat{\mathtt{x}}_{r}), the value of the corresponding slack variable is (𝚞r)a/M(\mathtt{u}_{r})_{a}/M. This consideration justifies the definition of cac_{a}.

Definition 4.

Let h=a[n]Hah=\bigcap_{a\in[n]}H_{a}. It is a non-empty subspace of span({𝚎i:i{1}𝚌𝚘𝚛𝚎(𝚡^r))\mathrm{span}\bigl{(}\bigl{\{}\mathtt{e}_{i}:i\in\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{)} because it contains the projection of 𝚡^r\hat{\mathtt{x}}_{r} to span({𝚎i:i{1}𝚌𝚘𝚛𝚎(𝚡^r)})\mathrm{span}\bigl{(}\bigl{\{}\mathtt{e}_{i}:i\in\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{\}}\bigr{)}.

Definition 5.

For all a[n]a\in[n], define a vector 𝚔^a\hat{\mathtt{k}}_{a} in n(n+3)/2\mathbb{R}^{n(n+3)/2} as follows:

(𝚔^a)i={(𝙱^)a,i,if i{1}Ia;0,otherwise.(\hat{\mathtt{k}}_{a})_{i}=\left\{\begin{array}[]{ccl}(\hat{\mathtt{B}})_{a,i},&&\text{{\em if $i\in\{1\}\cup I_{a}$;}}\\[2.5pt] 0,&&\text{{\em otherwise}}.\end{array}\right.

Observe that 𝚔^aHa\hat{\mathtt{k}}_{a}\perp H_{a}. Also, if a{1,2}a\in\{1,2\}, then (𝚔^a)1=1(\hat{\mathtt{k}}_{a})_{1}=1; otherwise, (𝚔^a)1=0(\hat{\mathtt{k}}_{a})_{1}=0. It is because the column index 11 is the index of edge 𝚙1𝚙2\mathtt{p}_{1}\mathtt{p}_{2} by assumption.

Note that every 𝚔^a\hat{\mathtt{k}}_{a} is orthogonal to hh. The vectors {𝚔^a:a[n]}\{\hat{\mathtt{k}}_{a}:a\in[n]\} and the vectors that are parallel to hh span the linear subspace span({𝚎i:i{1}𝚌𝚘𝚛𝚎(𝚡^r)})\mathrm{span}\bigl{(}\bigl{\{}\mathtt{e}_{i}:i\in\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{\}}\bigr{)}.

We want to analyze the effect of using 𝚎1h\mathtt{e}_{1}\!\downarrow\!h as a descent direction. The direction 𝚎h\mathtt{e}\!\downarrow\!h is equal to 𝚎1𝚎1span({𝚔^a:a[n]})\mathtt{e}_{1}-\mathtt{e}_{1}\!\downarrow\!\mathrm{span}(\{\hat{\mathtt{k}}_{a}:a\in[n]\}). However, the vectors {𝚔^a:a[n]}\{\hat{\mathtt{k}}_{a}:a\in[n]\} are not orthonormal, so we first orthonormalize {𝚔^a:a[n]}\{\hat{\mathtt{k}}_{a}:a\in[n]\} using the Gram-Schmidt method [45] in order to express 𝚎1span({𝚔^a:a[n]})\mathtt{e}_{1}\!\downarrow\!\mathrm{span}(\{\hat{\mathtt{k}}_{a}:a\in[n]\}). The method first generates nn mutually orthogonal vectors so that we can normalize each individually later.

First, define 𝚔ˇn=𝚔^n\check{\mathtt{k}}_{n}=\hat{\mathtt{k}}_{n}. For a=n1,n2,,3,2,1a=n-1,n-2,\ldots,3,2,1 in this order, define

𝚔ˇa=𝚔^ab=a+1n𝚔^a,𝚔ˇb𝚔ˇb,𝚔ˇb𝚔ˇb.\check{\mathtt{k}}_{a}=\hat{\mathtt{k}}_{a}-\sum_{b=a+1}^{n}\frac{\langle\hat{\mathtt{k}}_{a},\check{\mathtt{k}}_{b}\rangle}{\langle\check{\mathtt{k}}_{b},\check{\mathtt{k}}_{b}\rangle}\check{\mathtt{k}}_{b}.

This process generates 𝚔ˇa\check{\mathtt{k}}_{a} by subtracting the component of 𝚔^a\hat{\mathtt{k}}_{a} that lies in span({𝚔^b:b[a+1,n]})\mathrm{span}(\{\hat{\mathtt{k}}_{b}:b\in[a+1,n]\}). The set of vectors {𝚔ˇa:a[n]}\{\check{\mathtt{k}}_{a}:a\in[n]\} are mutually orthogonal. Then, we define the following set of orthonormal vectors:

{𝚔a=𝚔ˇa𝚔ˇa:a[n]}.\left\{\mathtt{k}_{a}=\frac{\check{\mathtt{k}}_{a}}{\|\check{\mathtt{k}}_{a}\|}:a\in[n]\right\}.
Lemma 9.

The following properties are satisfied.

  1. (i)

    For all a[n]a\in[n], (𝚔ˇa)n(n+1)/2+a=M(\check{\mathtt{k}}_{a})_{n(n+1)/2+a}=M.

  2. (ii)

    (𝚔ˇ2)1=1(\check{\mathtt{k}}_{2})_{1}=1, and for a[3,n]a\in[3,n], (𝚔ˇa)1=0(\check{\mathtt{k}}_{a})_{1}=0.

  3. (iii)

    For all a[n]a\in[n] and i[n(n+1)/2]i\in[n(n+1)/2], 11M(𝚔ˇa)i1+1M-1-\frac{1}{M}\leq(\check{\mathtt{k}}_{a})_{i}\leq 1+\frac{1}{M} for Mn2(n1)M\geq n^{2}(n-1).

Proof.

Observe that for a[n]a\in[n], (𝚔^b)n(n+1)/2+a=0(\hat{\mathtt{k}}_{b})_{n(n+1)/2+a}=0 for all b>ab>a. This implies that (𝚔ˇb)n(n+1)/2+a=0(\check{\mathtt{k}}_{b})_{n(n+1)/2+a}=0 for all b>ab>a and hence (𝚔ˇa)n(n+1)/2+a=M(\check{\mathtt{k}}_{a})_{n(n+1)/2+a}=M. This proves (i).

Similarly, for a[3,n]a\in[3,n], we have (𝚔^a)1=0(\hat{\mathtt{k}}_{a})_{1}=0, which imples that (𝚔ˇa)1=0(\check{\mathtt{k}}_{a})_{1}=0. Then, coupled with the fact that (𝚔^2)1=1(\hat{\mathtt{k}}_{2})_{1}=1, we get (𝚔ˇ2)1=1(\check{\mathtt{k}}_{2})_{1}=1. This proves (ii).

We show (iii) inductively. The base case is trivial as 𝚔ˇn=𝚔^n\check{\mathtt{k}}_{n}=\hat{\mathtt{k}}_{n}. Consider the computation of 𝚔ˇa\check{\mathtt{k}}_{a}. We can argue as in proving (i) that (𝚔ˇb)n(n+1)/2+a=0(\check{\mathtt{k}}_{b})_{n(n+1)/2+a}=0 for all b>ab>a. Also, (𝚔^a)n(n+1)/2+b=0(\hat{\mathtt{k}}_{a})_{n(n+1)/2+b}=0 for all bab\not=a. Therefore, the inner product 𝚔^a,𝚔ˇb\langle\hat{\mathtt{k}}_{a},\check{\mathtt{k}}_{b}\rangle for any b>ab>a does not receive any contributions from the entries with indices in the range [n(n+1)/2+1,n(n+3)/2][n(n+1)/2+1,n(n+3)/2]. By a similar reasoning, there are no contributions from the entries with indices in the range [n(n1)/2+1,n(n+1)/2][n(n-1)/2+1,n(n+1)/2]. It follows that, for any b>ab>a, |𝚔^a,𝚔ˇb/𝚔ˇb,𝚔ˇb|i=1n(n1)/2|(𝚔ˇb)i|/𝚔ˇb,𝚔ˇbn(n1)/𝚔ˇb,𝚔ˇb\bigl{|}\langle\hat{\mathtt{k}}_{a},\check{\mathtt{k}}_{b}\rangle/\langle\check{\mathtt{k}}_{b},\check{\mathtt{k}}_{b}\rangle\bigr{|}\leq\sum_{i=1}^{n(n-1)/2}|(\check{\mathtt{k}}_{b})_{i}|/\langle\check{\mathtt{k}}_{b},\check{\mathtt{k}}_{b}\rangle\leq n(n-1)/\langle\check{\mathtt{k}}_{b},\check{\mathtt{k}}_{b}\rangle as the largest magnitude of the first n(n1)/2n(n-1)/2 entries of 𝚔^a\hat{\mathtt{k}}_{a} is 1, and the largest magnitude of the first n(n1)/2n(n-1)/2 entries of 𝚔ˇb\check{\mathtt{k}}_{b} is at most 1+1M21+\frac{1}{M}\leq 2 by induction assumption. By (i), 𝚔ˇb,𝚔ˇbM2\langle\check{\mathtt{k}}_{b},\check{\mathtt{k}}_{b}\rangle\geq M^{2}. As a result, the largest magnitude of the first n(n+1)/2n(n+1)/2 entries of b=a+1n𝚔ˇb𝚔^a,𝚔ˇb/𝚔ˇb,𝚔ˇb\sum_{b=a+1}^{n}\check{\mathtt{k}}_{b}\cdot\langle\hat{\mathtt{k}}_{a},\check{\mathtt{k}}_{b}\rangle/\langle\check{\mathtt{k}}_{b},\check{\mathtt{k}}_{b}\rangle is at most n(1+1/M)n(n1)/M21/Mn\cdot(1+1/M)\cdot n(n-1)/M^{2}\leq 1/M by induction assumption and the assumption that Mn2(n1)M\geq n^{2}(n-1). Therefore, among the first n(n+1)/2n(n+1)/2 entries of 𝚔ˇa\check{\mathtt{k}}_{a}, the smallest entry is at least 11/M-1-1/M and the largest entry is at most 1+1/M1+1/M. This proves (iii). ∎

Let 𝚗^r=𝚎1h/𝚎1h\hat{\mathtt{n}}_{r}=\mathtt{e}_{1}\!\downarrow\!h/\|\mathtt{e}_{1}\!\downarrow\!h\|. We analyze the value of f^(𝚡^r),𝚗^r\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle and show that 𝚗^r\hat{\mathtt{n}}_{r} is a descent direction.

Lemma 10.

For a large enough MM, f^(𝚡^r),𝚗^r23(𝚟r)1<0\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle\leq\frac{2}{3}(\mathtt{v}_{r})_{1}<0.

Proof.

We first obtain a simple expression for 𝚎1h\mathtt{e}_{1}\!\downarrow\!h. Since the vectors {𝚔a:a[n]}\{\mathtt{k}_{a}:a\in[n]\} are orthonormal, we have 𝚎1h=𝚎1a=1n𝚎1,𝚔a𝚔a\mathtt{e}_{1}\!\downarrow\!h=\mathtt{e}_{1}-\sum_{a=1}^{n}\langle\mathtt{e}_{1},\mathtt{k}_{a}\rangle\mathtt{k}_{a}. The only non-zero entry of 𝚎1\mathtt{e}_{1} is a 1 at position 1. By Lemma 9, (𝚔a)1=0(\mathtt{k}_{a})_{1}=0 for a[3,n]a\in[3,n]. Note that 𝚎1,𝚔1=(𝚔ˇ1)1/𝚔ˇ1\langle\mathtt{e}_{1},\mathtt{k}_{1}\rangle=(\check{\mathtt{k}}_{1})_{1}/\|\check{\mathtt{k}}_{1}\| and by Lemma 9(ii), 𝚎1,𝚔2=1/𝚔ˇ2\langle\mathtt{e}_{1},\mathtt{k}_{2}\rangle=1/\|\check{\mathtt{k}}_{2}\|. Therefore,

𝚎1h=𝚎1(𝚔ˇ1)1𝚔ˇ1𝚔11𝚔ˇ2𝚔2.\mathtt{e}_{1}\!\downarrow\!h=\mathtt{e}_{1}-\frac{(\check{\mathtt{k}}_{1})_{1}}{\|\check{\mathtt{k}}_{1}\|}\mathtt{k}_{1}-\frac{1}{\|\check{\mathtt{k}}_{2}\|}\mathtt{k}_{2}. (6)

Let Kr=span({𝚎i:iEr𝚌𝚘𝚛𝚎(𝚡^r)})K_{r}=\mathrm{span}\bigl{(}\bigl{\{}\mathtt{e}_{i}:i\in E_{r}\cup\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{\}}\bigr{)}. Therefore, for all 𝚢Kr\mathtt{y}\in K_{r}, if iEr𝚌𝚘𝚛𝚎(𝚡^r)i\not\in E_{r}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}), then (𝚢)i=0(\mathtt{y})_{i}=0. We claim that for a sufficiently large MM,

𝚟^rKr,𝚎1h23(𝚟r)1.\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{e}_{1}\!\downarrow\!h\rangle\leq\frac{2}{3}(\mathtt{v}_{r})_{1}. (7)

The proof goes as follows. By (6),

𝚟^rKr,𝚎1h=𝚟^rKr,𝚎1(𝚔ˇ1)1𝚔ˇ1𝚟^rKr,𝚔11𝚔ˇ2𝚟^rKr,𝚔2.\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{e}_{1}\!\downarrow\!h\rangle=\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{e}_{1}\rangle-\frac{(\check{\mathtt{k}}_{1})_{1}}{\|\check{\mathtt{k}}_{1}\|}\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{k}_{1}\rangle-\frac{1}{\|\check{\mathtt{k}}_{2}\|}\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{k}_{2}\rangle.

Clearly, 𝚟^rKr,𝚎1=(𝚟^r)1\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{e}_{1}\rangle=(\hat{\mathtt{v}}_{r})_{1} which is equal to (𝚟r)1(\mathtt{v}_{r})_{1} by Lemma 8. Recall that (𝚟r)1(\mathtt{v}_{r})_{1} is the most negative entry of 𝚟r\mathtt{v}_{r} by assumption. For i>n(n+1)/2i>n(n+1)/2, if iEr𝚌𝚘𝚛𝚎(𝚡^r)i\not\in E_{r}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}), then (𝚟^rKr)i=0(\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}=0 by the defintion of KrK_{r}; otherwise, (𝚟^rKr)i=(𝚟^r)i(\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}=(\hat{\mathtt{v}}_{r})_{i} which is zero by Lemma 8. Take any in(n+1)/2i\leq n(n+1)/2. By Lemma 9(i) and (iii), |(𝚔1)i||(\mathtt{k}_{1})_{i}| and |(𝚔2)i||(\mathtt{k}_{2})_{i}| are at most 1/M+1/M21/M+1/M^{2}. Therefore, (𝚟^rKr)i(𝚔1)i(\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}\cdot(\mathtt{k}_{1})_{i} and (𝚟^rKr)i(𝚔2)i(\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}\cdot(\mathtt{k}_{2})_{i} have negligible magnitude for a large enough MM. It follows that 𝚟^rKr,𝚎1h\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{e}_{1}\!\downarrow\!h\rangle is no more than (𝚟r)1({\mathtt{v}}_{r})_{1} plus a non-negative number of negligible magnitude in the worst case. The result is thus less than 23(𝚟r)1\frac{2}{3}({\mathtt{v}}_{r})_{1}. This completes the proof of (7).

For all a[n]a\in[n], the aa-th constraint in the extended system is a hyperplane in n(n+3)/2\mathbb{R}^{n(n+3)/2} for which the aa-th row 𝙱^a,\hat{\mathtt{B}}_{a,*} of 𝙱^\hat{\mathtt{B}} is a normal vector. As a result, for any direction 𝚠\mathtt{w} from the feasible solution 𝚡^r\hat{\mathtt{x}}_{r} that is parallel to the hyperplanes encoded by 𝙱^𝚡^=(1+M)1n\hat{\mathtt{B}}\hat{\mathtt{x}}=(1+M)1_{n}, the vector 𝙱^t𝚞^r\hat{\mathtt{B}}^{t}\hat{\mathtt{u}}_{r} is orthogonal to 𝚠\mathtt{w}, which gives

f^(𝚡^r),𝚠=f^(𝚡^r)𝙱^t𝚞^r,𝚠=𝚟^r,𝚠.\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\mathtt{w}\rangle=\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r})-\hat{\mathtt{B}}^{t}\hat{\mathtt{u}}_{r},\mathtt{w}\rangle=\langle\hat{\mathtt{v}}_{r},\mathtt{w}\rangle. (8)

The last step uses the relation of f^(𝚡^r)𝙱^t𝚞^r=𝚟^r\nabla\hat{f}(\hat{\mathtt{x}}_{r})-\hat{\mathtt{B}}^{t}\hat{\mathtt{u}}_{r}=\hat{\mathtt{v}}_{r} from Lemma 8(iv).

As we move from 𝚡^r\hat{\mathtt{x}}_{r} in direction 𝚎1h\mathtt{e}_{1}\!\downarrow\!h, our projection in 𝚎1\mathtt{e}_{1} moves in the positive direction, and so (𝚡^)1(\hat{\mathtt{x}})_{1} increases. During the movement, for any i{1}𝚌𝚘𝚛𝚎(𝚡^r)i\not\in\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}), the coordinate (𝚡^)i(\hat{\mathtt{x}})_{i} remains unchanged by the definition of hh. Specifically, if in(n1)/2i\leq n(n-1)/2, then (𝚡^r)i(\hat{\mathtt{x}}_{r})_{i} remains at zero; if i[n(n1)/2+1,n(n+1)/2]i\in[n(n-1)/2+1,n(n+1)/2], (𝚡^r)i(\hat{\mathtt{x}}_{r})_{i} is fixed at (𝚞^r)a/M(\hat{\mathtt{u}}_{r})_{a}/M, where a=in(n1)/2a=i-n(n-1)/2. Some coordinates among {(𝚡^)i:i𝚌𝚘𝚛𝚎(𝚡^r)}\bigl{\{}(\hat{\mathtt{x}})_{i}:i\in\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{\}}, which are positive, may decrease. The feasibility constraints are thus preserved for some extent of the movement. By (8), f^(𝚡^r),𝚎1h=𝚟^r,𝚎1h\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\mathtt{e}_{1}\!\downarrow\!h\rangle=\langle\hat{\mathtt{v}}_{r},\mathtt{e}_{1}\!\downarrow\!h\rangle, which is equal to 𝚟^rKr,𝚎1h\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\mathtt{e}_{1}\!\downarrow\!h\rangle as 𝚎1hKr\mathtt{e}_{1}\!\downarrow\!h\in K_{r}. By (7), f^(𝚡^r),𝚎1h23(𝚟r)1<0\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\mathtt{e}_{1}\!\downarrow\!h\rangle\leq\frac{2}{3}(\mathtt{v}_{r})_{1}<0. Clearly, 𝚎1h1\|\mathtt{e}_{1}\!\downarrow\!h\|\leq 1. So f^(𝚡^r),𝚗^rf^(𝚡^r),𝚎1h23(𝚟r)1<0\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle\leq\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\mathtt{e}_{1}\!\downarrow\!h\rangle\leq\frac{2}{3}(\mathtt{v}_{r})_{1}<0. ∎

5.3 Convergence

Our analysis takes a detour via the extended system. Since the extended system is also a non-negative quadratic programming problem, Lemma 1 is applicable. The natural goal is thus to use the descent direction 𝚗^r\hat{\mathtt{n}}_{r} from 𝚡^r\hat{\mathtt{x}}_{r} and prove lower bounds for 𝚢^𝚡^r/𝚡^r𝚡^\|\hat{\mathtt{y}}-\hat{\mathtt{x}}_{r}\|/{\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\|} and f^(𝚡^r),𝚗^r/f^(𝚡^r),𝚗^\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle/{{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{*}\rangle}}, where 𝚢^\hat{\mathtt{y}} is the feasible point that minimizes f^\hat{f} in direction 𝚗^r\hat{\mathtt{n}}_{r} from 𝚡^r\hat{\mathtt{x}}_{r}, 𝚡^\hat{\mathtt{x}}_{*} is the optimal solution of the extended sytsem, and 𝚗^=(𝚡^𝚡^r)/𝚡^𝚡^r\hat{\mathtt{n}}_{*}=(\hat{\mathtt{x}}_{*}-\hat{\mathtt{x}}_{r})/\|\hat{\mathtt{x}}_{*}-\hat{\mathtt{x}}_{r}\|.

We prove a lower bound for f^(𝚡^r),𝚗^r/f^(𝚡^r),𝚗^\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle/{{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{*}\rangle}}. As in the case of NNLS and ZHLG, we do not know a lower bound for 𝚢^𝚡^r/𝚡^r𝚡^\|\hat{\mathtt{y}}-\hat{\mathtt{x}}_{r}\|/{\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\|}. Therefore, we derive a convergence result based the assumption that 𝚢^𝚡^r/𝚡^r𝚡^\|\hat{\mathtt{y}}-\hat{\mathtt{x}}_{r}\|/{\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\|} is bounded from below.

Let SS be an active set that is a subset of [n(n1)/2][n(n-1)/2] disjoint from {1}𝚌𝚘𝚛𝚎(𝚡^r)\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}). So 𝚗^r\hat{\mathtt{n}}_{r} is a descent direction from 𝚡^r\hat{\mathtt{x}}_{r} with respect to SS. Let 𝚡^\hat{\mathtt{x}} be any feasible solution of the extended system with respect to SS such that f^(𝚡^)<f^(𝚡^r)\hat{f}(\hat{\mathtt{x}})<\hat{f}(\hat{\mathtt{x}}_{r}). Let 𝚗^\hat{\mathtt{n}} be the unit vector from 𝚡^r\hat{\mathtt{x}}_{r} towards 𝚡^\hat{\mathtt{x}}. We bound the ratio f^(𝚡^r),𝚗^r/f^(𝚡^r),𝚗^\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle/\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}\rangle from below. Note that 𝚗^\hat{\mathtt{n}} is a descent direction from 𝚡^r\hat{\mathtt{x}}_{r} with respect to SS. Therefore, both f^(𝚡^r),𝚗^r\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle and f^(𝚡^r),𝚗^\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}\rangle are negative, and the ratio is thus positive.

Lemma 11.

Let 𝚡^\hat{\mathtt{x}} be any feasible solution of the extended system with respect to an active set SS that is a subset of [n(n1)/2][n(n-1)/2] disjoint from {1}𝚌𝚘𝚛𝚎(𝚡^r)\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}). Assume that f^(𝚡^)<f^(𝚡^r)\hat{f}(\hat{\mathtt{x}})<\hat{f}(\hat{\mathtt{x}}_{r}). Let 𝚗^=(𝚡^𝚡^r)/𝚡^𝚡^r\hat{\mathtt{n}}=(\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r})/\|\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\|. For a sufficiently large MM,

f^(𝚡^r),𝚗^rf^(𝚡^r),𝚗^12n(n1).\frac{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}_{r}\rangle}{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\hat{\mathtt{n}}\rangle}\geq\frac{1}{\sqrt{2n(n-1)}}.
Proof.

As in the proof of Lemma 10, let Kr=span({𝚎i:iEr𝚌𝚘𝚛𝚎(𝚡^r)})K_{r}=\mathrm{span}\bigl{(}\bigl{\{}\mathtt{e}_{i}:i\in E_{r}\cup\mathtt{core}(\hat{\mathtt{x}}_{r})\bigr{\}}\bigr{)}. Therefore, for all 𝚢Kr\mathtt{y}\in K_{r}, if iEr𝚌𝚘𝚛𝚎(𝚡^r)i\not\in E_{r}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}), then (𝚢)i=0(\mathtt{y})_{i}=0. We claim that there exists a non-negative δ0\delta\geq 0 that approaches zero as MM increases such that

𝚟^r,𝚡^𝚡^r𝚟^rKr,𝚡^𝚡^r+δ.\langle-\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle\leq\langle-\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle+\delta. (9)

Since 𝚟^r,𝚡^𝚡^r=𝚟^rKr,𝚡^𝚡^r+𝚟^r+𝚟^rKr,𝚡^𝚡^r\langle-\hat{\mathtt{v}}_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle=\langle-\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle+\langle-\hat{\mathtt{v}}_{r}+\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle, it suffices to prove that 𝚟^r+𝚟^rKr,𝚡^𝚡^rδ\langle-\hat{\mathtt{v}}_{r}+\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle\leq\delta.

If i𝚌𝚘𝚛𝚎(𝚡^r)Eri\in\mathtt{core}(\hat{\mathtt{x}}_{r})\cup E_{r}, then (𝚟^rKr)i=(𝚟^r)i(\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}=(\hat{\mathtt{v}}_{r})_{i}; otherwise, (𝚟^rKr)i=0(\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}=0. Therefore, if i𝚌𝚘𝚛𝚎(𝚡^r)Eri\in\mathtt{core}(\hat{\mathtt{x}}_{r})\cup E_{r}, then (𝚟^r+𝚟^rKr)i=0(-\hat{\mathtt{v}}_{r}+\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}=0; otherwise, by Lemma 8, (𝚟^r+𝚟^rKr)i=(𝚟^r)i0(-\hat{\mathtt{v}}_{r}+\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}=-(\hat{\mathtt{v}}_{r})_{i}\leq 0.

Take any i𝚌𝚘𝚛𝚎(𝚡^r)Eri\not\in\mathtt{core}(\hat{\mathtt{x}}_{r})\cup E_{r}. If i[n(n1)/2]i\in[n(n-1)/2], then i𝚜𝚞𝚙𝚙(𝚡r)i\not\in\mathtt{supp}(\mathtt{x}_{r}) and (𝚡^r)i=(𝚡r)i=0(\hat{\mathtt{x}}_{r})_{i}=(\mathtt{x}_{r})_{i}=0 by Lemma 8, which implies that (𝚡^𝚡^r)i0(\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r})_{i}\geq 0 because 𝚡^0n(n+3)/2\hat{\mathtt{x}}\geq 0_{n(n+3)/2} by the primal feasibility constraint. If i[n(n1)/2]i\not\in[n(n-1)/2], then i[n(n1)/2+1,n(n+1)/2]i\in[n(n-1)/2+1,n(n+1)/2], so (𝚡^r)i=(𝚞r)a/M(\hat{\mathtt{x}}_{r})_{i}=(\mathtt{u}_{r})_{a}/M, where a=in(n1)/2a=i-n(n-1)/2, which implies that (𝚡^𝚡^r)i(𝚞r)a/M(\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r})_{i}\geq-(\mathtt{u}_{r})_{a}/M.

We conclude that either (𝚟^r+𝚟^rKr)i(𝚡^𝚡^r)i(-\hat{\mathtt{v}}_{r}+\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r})_{i}\cdot(\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r})_{i} is negative, or it is at most some negligible postive value for a large enough MM. As a result, 𝚟^r+𝚟^rKr,𝚡^𝚡^rδ\langle-\hat{\mathtt{v}}_{r}+\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\rangle\leq\delta. This completes the proof of (9).

By (8), Lemma 10, and (9), we have

f^(𝚡^r),𝚗^rf^(𝚡^r),𝚗^=f^(𝚡^r),𝚗^r𝚟^r,𝚗^2(𝚟r)13𝚟^r,𝚗^2(𝚟r)13𝚟^rKr,𝚗^+3δ/𝚡^𝚡^r.\frac{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\,\hat{\mathtt{n}}_{r}\rangle}{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\,\hat{\mathtt{n}}\rangle}=\frac{-\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\,\hat{\mathtt{n}}_{r}\rangle}{-\langle\hat{\mathtt{v}}_{r},\hat{\mathtt{n}}\rangle}\geq\frac{-2(\mathtt{v}_{r})_{1}}{-3\langle\hat{\mathtt{v}}_{r},\hat{\mathtt{n}}\rangle}\geq\frac{-2(\mathtt{v}_{r})_{1}}{-3\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{n}}\rangle+3\delta/\|\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\|}.

Observe that 𝚟^rKr,𝚗^-\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{n}}\rangle is positive because, by (8) and (9), 𝚟^rKr,𝚗^𝚟^r,𝚗^δ/𝚡^𝚡^r=f^(𝚡^r),𝚗^δ/𝚡^𝚡^r-\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{n}}\rangle\geq-\langle\hat{\mathtt{v}}_{r},\hat{\mathtt{n}}\rangle-\delta/\|\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\|=-\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\,\hat{\mathtt{n}}\rangle-\delta/\|\hat{\mathtt{x}}-\hat{\mathtt{x}}_{r}\| which is positive for a large enough MM. As 𝚟^rKr,𝚗^=𝚟^rKr,𝚗^-\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},\hat{\mathtt{n}}\rangle=\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},-\hat{\mathtt{n}}\rangle and 𝚗^-\hat{\mathtt{n}} is a unit vector, we have 𝚟^rKr,𝚗^𝚟^rKr\langle\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r},-\hat{\mathtt{n}}\rangle\leq\|\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r}\|. We conclude that for a large enough MM, we have

f^(𝚡^r),𝚗^rf^(𝚡^r),𝚗^(𝚟r)12𝚟^rKr.\frac{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\,\hat{\mathtt{n}}_{r}\rangle}{\langle\nabla\hat{f}(\hat{\mathtt{x}}_{r}),\,\hat{\mathtt{n}}\rangle}\geq\frac{-(\mathtt{v}_{r})_{1}}{2\|\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r}\|}. (10)

We analyze 𝚟^rKr\|\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r}\|. Take any i{1}𝚌𝚘𝚛𝚎(𝚡^r)i\in\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}). There are three cases.

  • i[n(n1)/2]i\in[n(n-1)/2]. Then iEr𝚜𝚞𝚙𝚙(𝚡r)i\in E_{r}\cup\mathtt{supp}(\mathtt{x}_{r}) and (𝚟^r)i=(𝚟r)i(\hat{\mathtt{v}}_{r})_{i}=(\mathtt{v}_{r})_{i} by Lemma 8, which means that (𝚟^r)i=(𝚟r)i(\hat{\mathtt{v}}_{r})_{i}=(\mathtt{v}_{r})_{i} contributes to 𝚟^rKr\|\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r}\| if and only if iEri\in E_{r}.

  • i[n(n1)/2+1,n(n+1)/2]i\in[n(n-1)/2+1,n(n+1)/2]. Then, ii must belong to 𝚌𝚘𝚛𝚎(𝚡^r)\mathtt{core}(\hat{\mathtt{x}}_{r}), which implies that (𝙱𝚡r)a>1(\mathtt{B}\mathtt{x}_{r})_{a}>1, where a=in(n1)/2a=i-n(n-1)/2. By the complementary slackness of the original system, (𝚞r)a=0(\mathtt{u}_{r})_{a}=0 which implies that (𝚟^r)i=(𝚞r)a=0(\hat{\mathtt{v}}_{r})_{i}=(\mathtt{u}_{r})_{a}=0 by Lemma 8.

  • i>n(n+1)/2i>n(n+1)/2. Then, (𝚟^r)i=0(\hat{\mathtt{v}}_{r})_{i}=0 by Lemma 8.

We conclude that 𝚟^rKr2=iEr(𝚟^r)i2n(n1)/2(𝚟^r)12\|\hat{\mathtt{v}}_{r}\!\downarrow\!K_{r}\|^{2}=\sum_{i\in E_{r}}(\hat{\mathtt{v}}_{r})_{i}^{2}\leq n(n-1)/2\cdot(\hat{\mathtt{v}}_{r})_{1}^{2}. Substituting into (10) gives the lemma. ∎

Let Sr+1S_{r+1} be the active set used by SolveNNQ for the next iteration on the original system. Note that Sr+1S_{r+1} is a subset of [n(n1)/2][n(n-1)/2] disjoint from {1}𝚜𝚞𝚙𝚙(𝚡r)\{1\}\cup\mathtt{supp}(\mathtt{x}_{r}). It follows that Sr+1S_{r+1} is disjoint from {1}𝚌𝚘𝚛𝚎(𝚡^r)\{1\}\cup\mathtt{core}(\hat{\mathtt{x}}_{r}) as well. Let 𝚢^r\hat{\mathtt{y}}_{r} be the point that minimizes f^\hat{f} in the direction 𝚗^r\hat{\mathtt{n}}_{r} from 𝚡^r\hat{\mathtt{x}}_{r}. Let 𝚡^\hat{\mathtt{x}}_{*} be the optimal solution of the extended system (with an empty active set). Recall 𝚡r+1\mathtt{x}_{r+1} is the solution of the original system with respect to Sr+1S_{r+1} computed by SolveNNQ. Let 𝚡\mathtt{x}_{*} be the optimal DKSG solution. We prove that if 𝚢^r𝚡^r1λ𝚡^r𝚡^\|\hat{\mathtt{y}}_{r}-\hat{\mathtt{x}}_{r}\|\geq\frac{1}{\lambda}\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\|, then f(𝚡r+1)f(\mathtt{x}_{r+1}) is closer to f(𝚡)f(\mathtt{x}_{*}) than f(𝚡r)f(\mathtt{x}_{r}) by a factor of 1Θ(1λn(n1))1-\Theta(\frac{1}{\lambda\sqrt{n(n-1)}}).

Lemma 12.

If 𝚢^r𝚡^r1λ𝚡^r𝚡^\|\hat{\mathtt{y}}_{r}-\hat{\mathtt{x}}_{r}\|\geq\frac{1}{\lambda}\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\| where λ\lambda is a fixed value independent of MM, then for a large enough MM,

f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)113λ2n(n1).\dfrac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\dfrac{1}{3\lambda\sqrt{2n(n-1)}}.
Proof.

Let 𝚣^r+1\hat{\mathtt{z}}_{r+1} be the optimal solution of the extended system with respect to the active set Sr+1S_{r+1}. By Lemmas 1 and 11 and the assumption that 𝚢^r𝚡^r1λ𝚡^r𝚡^\|\hat{\mathtt{y}}_{r}-\hat{\mathtt{x}}_{r}\|\geq\frac{1}{\lambda}\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\|, we have

f^(𝚡^r)f^(𝚣^r+1)f^(𝚡^r)f^(𝚡^)12λ2n(n1).\frac{\hat{f}(\hat{\mathtt{x}}_{r})-\hat{f}(\hat{\mathtt{z}}_{r+1})}{\hat{f}(\hat{\mathtt{x}}_{r})-\hat{f}(\hat{\mathtt{x}}_{*})}\geq\frac{1}{2\lambda\sqrt{2n(n-1)}}.

We obtain a feasible solution 𝚣r+1n(n1)/2\mathtt{z}_{r+1}\in\mathbb{R}^{n(n-1)/2} of the original system from 𝚣^r+1\hat{\mathtt{z}}_{r+1} as follows. Let 𝚣~r+1\tilde{\mathtt{z}}_{r+1} be the projection of 𝚣^r+1\hat{\mathtt{z}}_{r+1} to the first n(n1)/2n(n-1)/2 coordinates. First, we set 𝚣r+1=𝚣~r+1\mathtt{z}_{r+1}=\tilde{\mathtt{z}}_{r+1}. Second, for each j>n(n+1)/2j>n(n+1)/2, if (𝚣^r+1)j>1(\hat{\mathtt{z}}_{r+1})_{j}>1, then let a=jn(n+1)/2a=j-n(n+1)/2, we pick an arbitrary i𝚜𝚞𝚙𝚙(𝚣r+1)i\in\mathtt{supp}(\mathtt{z}_{r+1}) such that (𝙱)a,i=1(\mathtt{B})_{a,i}=1, and we increase (𝚣r+1)i(\mathtt{z}_{r+1})_{i} by M(𝚣^r+1)jMM(\hat{\mathtt{z}}_{r+1})_{j}-M. This completes the determination of 𝚣r+1\mathtt{z}_{r+1}. Since we move the excess of the dummy variables over 1 to the positive coordinates of 𝚣r+1\mathtt{z}_{r+1}, one can verify that 𝙱𝚣r+11n\mathtt{B}\mathtt{z}_{r+1}\geq 1_{n}. So 𝚣r+1\mathtt{z}_{r+1} is a feasible solution of the original system with respect to Sr+1S_{r+1}. Note that for any j>n(n+1)/2j>n(n+1)/2, M(𝚣^r+1)jMM(\hat{\mathtt{z}}_{r+1})_{j}-M must approach zero as MM increases. Otherwise, as f^(𝚣^r+1)\hat{f}(\hat{\mathtt{z}}_{r+1}) consists of the term 12M3((𝚣^r+1)j1)2=12M(M(𝚣^r+1)jM)2\frac{1}{2}M^{3}\cdot((\hat{\mathtt{z}}_{r+1})_{j}-1)^{2}=\frac{1}{2}M\cdot(M(\hat{\mathtt{z}}_{r+1})_{j}-M)^{2}, f^(𝚣^r+1)\hat{f}(\hat{\mathtt{z}}_{r+1}) would tend to \infty as MM increases, contradicting the fact that f^(𝚣^r+1)f^(𝚡^r)\hat{f}(\hat{\mathtt{z}}_{r+1})\leq\hat{f}(\hat{\mathtt{x}}_{r}) which is bounded as MM increases. Therefore, as MM increases, f(𝚣r+1)=𝚣r+1t𝙰t𝙰𝚣r+1f(\mathtt{z}_{r+1})=\mathtt{z}_{r+1}^{t}\mathtt{A}^{t}\mathtt{A}\mathtt{z}_{r+1} tends to 𝚣~r+1t𝙰t𝙰𝚣~r+1\tilde{\mathtt{z}}_{r+1}^{t}\mathtt{A}^{t}\mathtt{A}\tilde{\mathtt{z}}_{r+1} which is the first term in f^(𝚣^r+1)\hat{f}(\hat{\mathtt{z}}_{r+1}). The second term of f^(𝚣^r+1)\hat{f}(\hat{\mathtt{z}}_{r+1}) is non-negative, which means that f(𝚣r+1)f(𝚡r+1)f^(𝚣^r+1)𝚣~r+1t𝙰t𝙰𝚣~r+1f(\mathtt{z}_{r+1})\geq f(\mathtt{x}_{r+1})\geq\hat{f}(\hat{\mathtt{z}}_{r+1})\geq\tilde{\mathtt{z}}_{r+1}^{t}\mathtt{A}^{t}\mathtt{A}\tilde{\mathtt{z}}_{r+1}. We conclude that f^(𝚣^r+1)\hat{f}(\hat{\mathtt{z}}_{r+1}) tends to f(𝚡r+1)f(\mathtt{x}_{r+1}) as MM increases.

By a similar reasoning, we can also show that f^(𝚡^)\hat{f}(\hat{\mathtt{x}}_{*}) tends to f(𝚡)f(\mathtt{x}_{*}) as MM increases. By Lemma 8(v), f^(𝚡^r)\hat{f}(\hat{\mathtt{x}}_{r}) tends to f(𝚡r)f(\mathtt{x}_{r}) as MM increases. Hence, for a large enough MM,

f(𝚡r)f(𝚡r+1)f(𝚡r)f(𝚡)23f^(𝚡^r)f^(𝚣^r+1)f^(𝚡^r)f^(𝚡^)13λ2n(n1).\frac{f(\mathtt{x}_{r})-f(\mathtt{x}_{r+1})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\geq\frac{2}{3}\cdot\frac{\hat{f}(\hat{\mathtt{x}}_{r})-\hat{f}(\hat{\mathtt{z}}_{r+1})}{\hat{f}(\hat{\mathtt{x}}_{r})-\hat{f}(\hat{\mathtt{x}}_{*})}\geq\frac{1}{3\lambda\sqrt{2n(n-1)}}.

Hence, f(𝚡r+1)f(𝚡)f(𝚡r)f(𝚡)113λ2n(n1)\dfrac{f(\mathtt{x}_{r+1})-f(\mathtt{x}_{*})}{f(\mathtt{x}_{r})-f(\mathtt{x}_{*})}\leq 1-\dfrac{1}{3\lambda\sqrt{2n(n-1)}}. ∎

As set in the definition of 𝚡^r\hat{\mathtt{x}}_{r} and as argued in the proof of Lemma 12, the contributions of MM to the coordinates of 𝚡^r\hat{\mathtt{x}}_{r}, 𝚢^r\hat{\mathtt{y}}_{r}, and 𝚡^\hat{\mathtt{x}}_{*} become negligible as MM increases. Therefore, whether there exists a fixed λ\lambda in Lemma 12 such that 𝚢^r𝚡^r1λ𝚡^r𝚡^\|\hat{\mathtt{y}}_{r}-\hat{\mathtt{x}}_{r}\|\geq\frac{1}{\lambda}\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\| is unaffected by MM when MM is large enough. Still, we have to leave the condition 𝚢^r𝚡^r1λ𝚡^r𝚡^\|\hat{\mathtt{y}}_{r}-\hat{\mathtt{x}}_{r}\|\geq\frac{1}{\lambda}\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\| as an assumption to obtain the convergence result.

Theorem 2.

Consider the application of SolveNNQ on the DKSG problem with nn points in d\mathbb{R}^{d}. The initialization of SolveNNQ can be done in O(dn4)+T(n+β0)O(dn^{4})+T(n+\beta_{0}) time. Suppose that the following assumptions hold.

  • The threshold β1\beta_{1} on the total number of iterations is not exceeded.

  • For all r1r\geq 1, 𝚡^r𝚢^r1λ𝚡^r𝚡^\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{y}}_{r}\|\geq\frac{1}{\lambda}\|\hat{\mathtt{x}}_{r}-\hat{\mathtt{x}}_{*}\|, where 𝚡^r\hat{\mathtt{x}}_{r} is defined as in Lemma 8, 𝚗^r\hat{\mathtt{n}}_{r} is a unit descent direction from 𝚡^r\hat{\mathtt{x}}_{r} that satisfies Lemma 10, 𝚢^r\hat{\mathtt{y}}_{r} is the feasible point of the extended system that minimizes f^\hat{f} in the direction 𝚗^r\hat{\mathtt{n}}_{r} from 𝚡^r\hat{\mathtt{x}}_{r}, and 𝚡^\hat{\mathtt{x}}_{*} is the optimal solution of the extended system.

Then, for all r1r\geq 1, f(𝚡r+i)f(𝚡)e1(f(𝚡r)f(𝚡))f(\mathtt{x}_{r+i})-f(\mathtt{x}_{*})\leq e^{-1}(f(\mathtt{x}_{r})-f(\mathtt{x}_{*})) for some i=O(λn)i=O\bigl{(}\lambda n\bigr{)}, and each iteration in SolveNNQ takes T(k+β0)T(k+\beta_{0}) time, where k=maxr1|𝚜𝚞𝚙𝚙(𝚡r)|k=\max_{r\geq 1}|\mathtt{supp}(\mathtt{x}_{r})|.

Proof.

By Lemma 12, the gap f(𝚡r)f(𝚡)f(\mathtt{x}_{r})-f(\mathtt{x}_{*}) decreases by a factor ee in O(λn)O(\lambda n) iterations. Thus, f(𝚡r+i)f(𝚡)e1(f(𝚡r)f(𝚡))f(\mathtt{x}_{r+i})-f(\mathtt{x}_{*})\leq e^{-1}(f(\mathtt{x}_{r})-f(\mathtt{x}_{*})) for some i=O(λn)i=O\bigl{(}\lambda n\bigr{)}. It remains to analyze the running times.

We first compute 𝙰t𝙰\mathtt{A}^{t}\mathtt{A}. Recall that 𝙰\mathtt{A} is dn×n(n1)/2dn\times n(n-1)/2, and 𝙰\mathtt{A} resembles the incidence matrix for the complete graph in the sense that every non-zero entry of the incidence matrix gives rise to dd non-zero entries in the same column in 𝙰\mathtt{A}. There are at most two non-zero entries in each column of the incidence matrix. Therefore, computing 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} takes O(dn4)O(dn^{4}) time. We form an initial graph as follows. Randomly include β0\beta_{0} edges by sampling indices from [n(n1)/2]\bigl{[}n(n-1)/2\bigr{]}. Also, pick a node and connect it to all other n1n-1 nodes in order to guarantee that the initial graph admits a feasible solution for the DKSG problem. There are at most n+β01n+\beta_{0}-1 edges. The initial active set contains all indices in the range [n(n1)/2]\bigl{[}n(n-1)/2\bigr{]} except for the indices of the edges of the initial graph. We extract in O(n2+β02)O(n^{2}+\beta_{0}^{2}) time the rows and columns of 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} that correspond to the edges of the initial graph. Then, we call the solver in T(n+β0)T(n+\beta_{0}) time to obtain the initial solution (𝚞1,𝚡1)(\mathtt{u}_{1},\mathtt{x}_{1}). Therefore, the initialization time is O(dn4)+T(n+β0)O(dn^{4})+T(n+\beta_{0}).

At the beginning of every iteration, we compute f(𝚡r)\nabla f(\mathtt{x}_{r}) and update the active set. This can be done in O(kn2)O(kn^{2}) time because 𝚡r\mathtt{x}_{r} has at most kk non-zero entries. Since the threshold β1\beta_{1} is not exceeded, at most k+β0k+\beta_{0} indices are absent from the updated active set. We extract the corresponding rows and columns of 𝙰t𝙰\mathtt{A}^{t}\mathtt{A} in O(k2+β02)O(k^{2}+\beta_{0}^{2}) time. The subsequent call of the solver takes T(k+β0)T(k+\beta_{0}) time, which dominates the other processing steps that take O(kn2+k2+β02)O(kn^{2}+k^{2}+\beta_{0}^{2}) time. So each iteration runs in T(k+β0)T(k+\beta_{0}) time. ∎

6 Conclusion

We presented a method that solves non-negative convex quadratic programming problems by calling a numerical solver iteratively. This method is very efficient when the intermediate and final solutions have a lot of zero coordinates. We experimented with two proximity graph problems and the non-negative least square problem in image deblurring. The method is often several times faster than a single call of quadprog of MATLAB. From our computational experience, we obtain a speedup when the percentage of positive coordinates in the final solution drops below 50%, and the speedup increases significantly as this percentage decreases. We proved that the iterative method converges efficiently under the assumption. We checked the assumption in our experiments on proximity graph and confirmed that the assumption hold. A possible research problem is proving a better convergence result under some weaker assumptions. The runtime of our method is sensitive to solution sparsity. As far as we know, we are the first to introduce a parameter τ\tau explicitly to control the size of the intermediate problems. Another possible line of research is to explore the choice of τ\tau given some prior knowledge of the solution sparsity.

References

  • \bibcommenthead
  • Daitch et al. [2009] Daitch, S.I., Kelner, J.A., Spielman, D.A.: Fitting a graph to vector data. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 201–209 (2009). https://doi.org/10.1145/1553374.1553400
  • Jebara et al. [2009] Jebara, T., Wang, J., Chang, S.-F.: Graph construction and bb-matching for semi-supervised learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 441–448 (2009). https://doi.org/10.1145/1553374.1553432
  • Kalofolias [2016] Kalofolias, V.: How to learn a graph from smooth signals. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 920–929 (2016)
  • Zhang et al. [2014] Zhang, Y.-M., Huang, K., Hou, X., Liu, C.-L.: Learning locality preserving graph from data. IEEE Tranactions on Cybernetics 44(11), 2088–2098 (2014) https://doi.org/%****␣paper.bbl␣Line␣100␣****10.1109/TCYB.2014.2300489
  • Cheng et al. [2021] Cheng, S.-W., Cheong, O., Lee, T., Ren, Z.: Fitting a graph to one-dimensional data. Theoretical Computer Science 867, 40–49 (2021) https://doi.org/10.1016/j.tcs.2021.03.020
  • Hanke et al. [2000] Hanke, M., Nagy, J.G., Vogel, C.: Quasi-Newton approach to nonnegative image restorations. Linear Algebra and Its Applications 316, 223–236 (2000) https://doi.org/10.1016/S0024-3795(00)00116-6
  • Lagendijk and Biemond [1991] Lagendijk, R.L., Biemond, J.: Iterative Identification and Restoration of Images. Springer, New York (1991). https://doi.org/10.1007/978-1-4615-3980-3
  • Kim et al. [2013] Kim, D., Sra, S., Dhillon, I.S.: A non-monotonic method for large-scale non-negative least squares. Optimization Methods and Software 28(5), 1012–1039 (2013) https://doi.org/10.1080/10556788.2012.656368
  • Lawson and Hanson [1995] Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. Society for Industrial and Applied Mathematics, Philadelphia (1995). https://doi.org/10.1137/1.9781611971217
  • Clarkson [1995] Clarkson, K.L.: Las Vegas algorithms for linear and integer programming when the dimension is small. Discrete and Computational Geometry 42, 488–499 (1995) https://doi.org/10.1145/201019.201036
  • Dyer and Frieze [1989] Dyer, M.E., Frieze, A.M.: A randomized algorithm for fixed-dimensional linear programming. Mathematical Programming 44, 203–212 (1989) https://doi.org/%****␣paper.bbl␣Line␣200␣****10.1007/BF01587088
  • Gärtner [1995] Gärtner, B.: A subexponential algorithm for abstract optimization problems. SIAM Journal on Computing 24, 1018–1035 (1995) https://doi.org/10.1137/S0097539793250287
  • Matoušek et al. [1996] Matoušek, J., Sharir, M., Welzl, E.: A subexponential bound for linear programming. Algorithmica 16, 498–516 (1996) https://doi.org/10.1007/BF01940877
  • Sharir and Welzl [1992] Sharir, M., Welzl, E.: A combinatorial bound for linear programming and related problems. In: Proceeeings of the 9th Annual Symposium on Theoretical Analysis of Computer Science, pp. 569–579 (1992). https://doi.org/10.1007/3-540-55210-3_213
  • Brise and Gärtner [2011] Brise, Y., Gärtner, B.: Clarkson’s algorithm for violator spaces. Computational Geometry: Theory and Applications 44, 70–81 (2011) https://doi.org/10.1016/j.comgeo.2010.09.003
  • Gärtner et al. [2008] Gärtner, B., Matoušek, J., Rüst, L., Škovroň, P.: Violator spaces: structure and algorithms. Discrete Applied Mathematics 156, 2124–2141 (2008) https://doi.org/10.1016/j.dam.2007.08.048
  • Kozlov et al. [1979] Kozlov, M.K., Tarasov, S.P., Khachiyan, L.G.: Polynomial solvability of convex quadratic programming. Soviet Mathematics Doklady 20, 1108–1111 (1979) https://doi.org/10.1016/0041-5553(80)90098-1
  • Bland et al. [1980] Bland, R.G., Goldfarb, D., Todd, M.J.: The ellipsoid method: A survey. (1980). https://doi.org/10.1287/opre.29.6.1039
  • Khachiyan [1980] Khachiyan, L.G.: Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics 20(1), 53–72 (1980) https://doi.org/10.1016/0041-5553(80)90061-0
  • Kapoor and Vaidya [1986] Kapoor, S., Vaidya, P.M.: Fast algorithms for convex quadratic programming and multicommodity flows. In: Proceedings of the 18th Annual ACM Symposium on Theory of Computing, pp. 147–159 (1986). https://doi.org/10.1145/12130.12145
  • Monteiro and Adler [1989] Monteiro, R.D.C., Adler, I.: Interior path following primal-dual algorithms. Part II: convex quadratic programming. Mathematical Programming 44, 43–66 (1989) https://doi.org/10.1007/BF01587076
  • Gärtner and Schönherr [2000] Gärtner, B., Schönherr, S.: An efficient, exact, and generic quadratic programming solver for geometric optimization. In: Proceedings of the 16th Annual Symposium on Computational Geoemtry, pp. 110–118 (2000). https://doi.org/10.1145/336154.336191
  • Fischer et al. [2003] Fischer, K., Gärtner, B., Kutz, M.: Fast smallest-enclosing-ball computation in high dimensions. In: Proceedings of the 11th Annual European Symposium on Algorithms, pp. 630–641 (2003). https://doi.org/10.1007/978-3-540-39658-1_57
  • Yang and Tolle [1991] Yang, E.K., Tolle, J.W.: A class of methods for solving large, convex quadratic programs subject to box constraints. Mathematical Programming 51, 223–228 (1991) https://doi.org/10.1007/BF01586934
  • Coleman and Hulbert [1989] Coleman, T.F., Hulbert, L.A.: A direct active set algorithm for large sparse quadratic programs with simple bounds. Mathematical Programming 45, 373–406 (1989) https://doi.org/10.1007/BF01589112
  • Bertsekas [1982] Bertsekas, D.P.: Projected newton methods for optimization problems with simple constraints. SIAM Journal on Control and Optimization 20(2), 221–246 (1982) https://doi.org/10.1137/0320018
  • Calamai and Moré [1987] Calamai, P.H., Moré, J.J.: Projected gradient methods for linearly constrained problems. Mathematical programming 39, 93–116 (1987) https://doi.org/10.1007/BF02592073
  • Conn et al. [1988] Conn, A.R., Gould, N.I.M., Toint, P.L.: Global convergence of a class of trust region algorithms for optimization with simple bounds. SIAM Journal on Numerical Analysis 25(2), 433–460 (1988) https://doi.org/10.1137/0725029 . Accessed 2023-04-28
  • Hestenes and Stiefel [1952] Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards 49, 409–435 (1952) https://doi.org/10.6028/jres.049.044
  • Moré and Toraldo [1991] Moré, J.J., Toraldo, G.: On the solution of large quadratic programming problems with bound constraints. SIAM Journal on Optimization 1(1), 93–113 (1991) https://doi.org/%****␣paper.bbl␣Line␣475␣****10.1137/0801008
  • Friedlander et al. [1995] Friedlander, A., Martínez, J.M., Raydon, M.: A new method for large-scale box constrained convex quadratic minimization problems. Optimization Methods and Software 5(1), 57–74 (1995) https://doi.org/10.1080/10556789508805602
  • GOULD [1991] GOULD, N.I.M.: An Algorithm for Large-Scale Quadratic Programming. IMA Journal of Numerical Analysis 11(3), 299–324 (1991) https://doi.org/10.1093/imanum/11.3.299
  • Bartlett and Biegler [2006] Bartlett, R.A., Biegler, L.T.: Qpschur: A dual, active-set, schur-complement method for large-scale and structured convex quadratic programming. Optimization and Engineering 7(1), 5–32 (2006) https://doi.org/10.1007/s11081-006-6588-z
  • Gill et al. [1987] Gill, P.E., Murray, W., Saunders, M.A., Wright, M.H.: A schur-complement method for sparse quadratic programming. (1987)
  • [35] Vu, T.: TIEPVUPSU/Fista: Fista Implementation in MATLAB (recently updated Fista with backtracking). https://github.com/tiepvupsu/FISTA
  • Beck and Teboulle [2009] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009) https://doi.org/10.1137/080716542
  • Boyd and Vandenberghe [2004] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge, UK (2004). https://doi.org/10.1017/CBO9780511804441
  • [38] Kelly, M., Longjohn, R., Nottingham, K.: Welcome to the UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/
  • Gazzola et al. [2019] Gazzola, S., Hansen, P.C., Nagy, J.: Ir tools: A matlab package of iterative regularization methods and large-scale test problems. Numerical Algorithms 81 (2019) https://doi.org/10.1007/s11075-018-0570-7
  • Pearlman and NASA [2021] Pearlman, R.Z., NASA: Russian progress cargo ship docks at Space Station after two-day journey (2021). https://www.space.com/russian-progress-cargo-spacecraft-docks-space-station
  • McGlamery [1967] McGlamery, B.L.: Restoration of turbulence degraded images. Journal of the Optical Society of America 57(3), 293–297 (1967) https://doi.org/10.1364/JOSA.57.000293
  • [42] Download Free ADAS Dataset v2 - Teledyne FLIR. https://adas-dataset-v2.flirconservator.com/#downloadguide
  • Liu et al. [2020] Liu, Q., Li, X., He, Z., Li, C., Li, J., Zhou, Z., Yuan, D., Li, J., Yang, K., Fan, N., Zheng, F.: Lsotb-tir: A large-scale high-diversity thermal infrared object tracking benchmark. (2020). https://doi.org/10.1145/3394171.3413922
  • Nelson [2022] Nelson, J.: Thermal dogs and people object detection dataset - resize-416x416 (2022). https://public.roboflow.com/object-detection/thermal-dogs-and-people/1
  • Goloub and van Loan [1996] Goloub, G.H., Loan, C.F.: Matrix Computations. Johns Hopkins University Press, Maryland, USA (1996)