This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization

Derek Fox Computer Science
UNC Greensboro, USA
defox@uncg.edu
   Samuel Hernandez Mathematics
Texas A&M University, USA
samuelhq11@tamu.edu
   Qianqian Tong Computer Science
UNC Greensboron, USA
q_tong@uncg.edu
Abstract

Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or 0\ell_{0}-norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate the efficiency and effectiveness of our proposed algorithms.

I Introduction

Graph structures enable the imposition of intricate sparsity constraints on the model, allowing them to better reflect relationships present in the data. For instance, graph-structured sparsity models are well-suited for predicting the spread of diseases or identifying social groups in networks. The search of connected subgraphs or clusters has a significant impact on identifying disease-related genes [2, 22, 1, 23]. Graph sparsification, which aims to reduce the complexity of large-scale graphs while preserving their essential structural properties, has garnered increasing attention as a crucial technique in modern data analysis and machine learning [6, 11, 12, 13, 30].

Graph sparsification can be formulated as the following optimization problem:

minxpF(x),F(x):=1ni=1nfi(x),\min_{x\in\mathbb{R}^{p}}F(x),\quad F(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x), (1)

which is known as the empirical risk minimization problem. Each fi(x)f_{i}(x) (i[n])(i\in[n]) is convex and differentiable, and the graph structured sparsity is reflected by the constraint set p\mathbb{R}^{p} on xx. The input vector xx denotes the parameter of the model, and the output fi(x)f_{i}(x) is defined as the loss associated with sample ii. By minimizing the loss F(x)F(x), we guide the model towards the optimal solution. Typically, sparsity can be encoded by adding sparsity-inducing norms or penalties such as 0\ell_{0} norm, 1\ell_{1} norm and mixed norms [7, 10, 29, 16, 24, 26, 8, 25, 28]. These models often involve convex penalties and can be solved using convex optimization algorithms [4, 3, 5]. However, dealing with more complex sparse settings, such as graph-structured models, is more challenging.

In stochastic optimization, iterative hard thresholding (IHT) methods include gradient descent IHT (GD-IHT) [16], stochastic gradient descent IHT (SGD-IHT) [21], hybrid stochastic gradient IHT (HSG-IHT) [31], and variance-reduced methods such as stochastic variance reduced gradient IHT (SVRG-IHT) [19], stochastically controlled stochastic gradient IHT (SCSG-IHT) [20]. These methods update the parameter iterate xx via gradient descent or its variants, and then apply a hard thresholding (HT) operator to enforce sparsity of xx, preserving the top ss elements in xx while setting other elements to zero. In the context of graph-structured sparse optimization, the stochastic gradient decent based IHT method, named GraphSto-IHT, achieves HT through the use of Head and Tail Projections, first described by [30]. Head and Tail Projections map arbitrary vectors from the data onto the graph while simultaneously enforcing model sparsity [12, 13, 14]. Specifically, Head Projection identifies and preserves the largest entries in xx, while Tail Projection identifies the smallest entries and sets them to zero. By ignoring the small magnitude entries in the vector, these projections help prevent overfitting and ensure sparse solutions. Meanwhile, stochastic sampling of the data is used to speed up gradient calculations. A single data point or small batch of the data is selected and a gradient is calculated only with respect to that batch. This greatly decreases the computational costs associated with the gradient calculations. However, if the selected batch does not represent the whole dataset, the gradient may not accurately point towards a local minimum of the function, introducing variance into the gradient descent process.

To reduce the randomness inherent in SGD, variance reduction techniques may be used such as SVRG [17], SAGA [9], or SCSG [18]. During each iteration, the history of the stochastic process is considered to regulate the newly calculated gradient and minimize large changes in direction. This improvement on SGD is our main interest; both proposed algorithms utilize this technique to leverage fast gradient calculations while still enjoying quick convergence. In this paper we leverage the recent success of stochastic variance-reduced algorithms for non-convex problems and propose a series of efficient stochastic optimization algorithms for graph-structured sparsity constraint problems. Specifically, we introduce two new stochastic variance-reduced gradient based methods, GraphSVRG-IHT and GraphSCSG-IHT, designed to solve graph sparsity optimization problems. By incorporating stochastic variance-reduced techniques and graph approximated projections (head and tail), our algorithms are specifically tailored for non-convex graph-structured sparsity constraints, leading to faster convergence and improved performance. We provide a comprehensive theoretical framework and conduct extensive experiments to validate the efficiency and effectiveness of our proposed methods.

Our main contributions are summarized as follows.

  • This work is the first to explore the application of stochastic variance-reduced methods to graph-structured sparsity problems. By using batch gradients to approximate computationally expensive full gradients, we enhance the efficiency of variance reduction. GraphSCSG-IHT is built on GraphSVRG-IHT by parameterizing the variance reduction technique for greater control. It employs two different batch sizes for gradient calculations and randomly selects the update rate of the current position from a geometric distribution.

  • We theoretically prove GraphSCSG-IHT enjoys linear convergence rate with a constant learning rate. This analysis provides a robust theoretical framework for analyzing graph sparsification optimization.

  • We conduct extensive experiments to validate our proposed algorithms. In addition to simulation tests, we also evaluate our methods on a real-world breast cancer dataset. Our experiments empirically demonstrate the efficiency and effectiveness of our methods compared to GraphSto-IHT and other deterministic methods.

II Preliminaries

Notations. We use lowercase letters, e.g. xx, to denote a vector and use \|\cdot\| to denote the l2l_{2}-norm of a vector. The operator EE represents taking expectation over all random variables, [n][n] denotes the integer set {1,,n}\{1,...,n\}. The notation supp(x)supp(x) means the support of xx or the index set of non-zero elements in xx. Other important parameters are listed in Appendix A.

Definition 1.

(Subspace model) [14] Given the space p\mathbb{R}^{p}, a subspace model \mathcal{M} is defined as a family of linear subspaces of p\mathbb{R}^{p}:

={S1,S2,,Sk,}\mathcal{M}=\{S_{1},S_{2},\ldots,S_{k},\ldots\}

where each SkS_{k} is a subspace of p\mathbb{R}^{p}. The set of corresponding vectors in these subspaces is denoted as

()={x:xV for some V}.\mathcal{M}(\mathcal{M})=\{x:x\in V\text{ for some }V\in\mathcal{M}\}.
Definition 2.

(Weighted graph model) [13] Given an underlying graph G=(V,E)G=(V,E) defined on the coefficients of the unknown vector xx, where V=[p]V=[p] and EV×VE\subseteq V\times V, then the weighted graph model (G,s,g,C)(G,s,g,C)- WGM can be defined as the following set of supports

={S:|S|s, there is an FV with \mathcal{M}=\{S:|S|\leq s,\text{ there is an }F\subseteq V\text{ with }
VF=S,γ(F)=g, and w(F)C},V_{F}=S,\gamma(F)=g,\text{ and }w(F)\leq C\},

where CC is the budget on weight of edges ww, gg is the number of connected components of FF, and ss is the sparsity.

Definition 3.

(Projection Operator) [14] We define a projection operator onto (𝕄)\mathcal{M}(\mathbb{M}), i.e, P(,(𝕄)):ppP(\cdot,\mathcal{M}(\mathbb{M})):\mathbb{R}^{p}\to\mathbb{R}^{p} defined as

P(x,(𝕄))=argminy()xy2.P(x,\mathcal{M}(\mathbb{M}))=\arg\min_{y\in\mathcal{M}(\mathcal{M})}\|x-y\|^{2}.

With this projection definition, we borrow two important projections: Head Projection and Tail Projection from previous literature to help us with theoretical analysis.

Assumption 1.

(Head Projection)[14] Let MM and MHM_{H} be the predefined subspace models. Given any vector xx, there exists a (cH,M,MH)(c_{H},M,M_{H})-Head-Projection which is to find a subspace HMHH\in M_{H} such that

P(x,H)2cHmaxSMP(x,S)2,\|P(x,H)\|_{2}\geq c_{H}\cdot\max_{S\in M}\|P(x,S)\|_{2},

where 0<cH10<c_{H}\leq 1. We denote P(x,H)P(x,H) as P(x,M,MH)P(x,M,M_{H}).

Assumption 2.

(Tail Projection)[14] Let MM and MTM_{T} be the predefined subspace models. Given any vector xx, there exists a (cT,M,MT)(c_{T},M,M_{T})-Tail-Projection which is to find a subspace TMTT\in M_{T} such that

P(x,T)x2cTminSMxP(x,S)2,\|P(x,T)-x\|_{2}\leq c_{T}\cdot\min_{S\in M}\|x-P(x,S)\|_{2},

where cT1c_{T}\geq 1. We denote P(x,T)P(x,T) as P(x,M,MT)P(x,M,M_{T}).

We can see that head projection keeps large magnitudes whereas tail projection sets small magnitudes to zero.

Definition 4.

((α,β,M(M)\alpha,\beta,M(M))-RSC/RSS Properties)[14] We say a differentiable function f()f(\cdot) satisfies the (α,β,M(M))(\alpha,\beta,M(M))-Restricted Strong Convexity (RSC)/Smoothness (RSS) property if there exist positive constants α\alpha and β\beta such that

α2xy22Bf(x,y)β2xy22,\frac{\alpha}{2}\|x-y\|_{2}^{2}\leq B_{f}(x,y)\leq\frac{\beta}{2}\|x-y\|_{2}^{2},

for all x,yM(M)x,y\in M(M), where Bf(x,y)B_{f}(x,y) is the Bregman divergence of ff, i.e.,

Bf(x,y)=f(x)f(y)f(y),xy.B_{f}(x,y)=f(x)-f(y)-\langle\nabla f(y),x-y\rangle.

α\alpha and β\beta are the strong convexity parameter and strong smoothness parameter, respectively.

The RSC/RSS definition is widely used in sparsity optimization. Together with the following assumption, we characterize the properties of the objective function.

Assumption 3.

[14] Given the objective function F(x)F(x) in (1), we assume that F(x)F(x) satisfies α\alpha-RSC in subspace model (HT)\mathcal{M}(\mathcal{M}\oplus\mathcal{M}_{H}\oplus\mathcal{M}_{T}). Each function fi(x)f_{i}(x) satisfies β\beta-RSS in (HT)\mathcal{M}(\mathcal{M}\oplus\mathcal{M}_{H}\oplus\mathcal{M}_{T}), where \oplus of two models 1\mathcal{M}_{1} and 2\mathcal{M}_{2} is defined as 12:={S1S2:S11,S22}\mathcal{M}_{1}\oplus\mathcal{M}_{2}:=\{S_{1}\cup S_{2}:S_{1}\in\mathcal{M}_{1},S_{2}\in\mathcal{M}_{2}\}.

III Methods

In this section, we introduce two proposed algorithms: GraphSVRG-IHT and GraphSCSG-IHT. Both algorithms employ the variance-reduced techniques derived from SVRG [17] and SCSG [18] respectively, while also utilizing the graph projection operators found in GraphSto-IHT[30]. This results in methods that are applicable to graph-structured sparsity problems and effectively reduce the variance inherent to stochastic gradient descent. Therefore, our algorithms converge faster and more accurately than their predecessors.

III-A GraphSVRG-IHT

Our proposed GraphSVRG-IHT algorithm (Algorithm 1) utilizes variance reduction by periodically computing the full gradient, significantly reducing the inherent variance in stochastic gradient methods. By incorporating graph projection operators, our GraphSVRG-IHT adapts to non-convex graph sparsity constraints, enhancing its applicability and efficiency. The key steps are outlined below:

  1. 1.

    Calculate the full gradient, v~j\tilde{v}^{j}, with the position at the start of each outer loop, x~j\tilde{x}^{j} (Line 4).

  2. 2.

    In the inner loop, compute two gradients from a single sampled data point: one at the copied position xkjx^{j}_{k} and the other at x~j\tilde{x}^{j}. Then calculate the stochastic variance reduced gradient, vkjv_{k}^{j} (Line 7-8).

  3. 3.

    Pass vkjv_{k}^{j} through the Head Projection operator (Line 9); and use the resulting gradient to update the next iterate xkjx^{j}_{k}, through the Tail Projection operator (Line 10).

  4. 4.

    After a fixed number (𝒦\mathcal{K}) of inner loop iterations, update the outer loop position x~j\tilde{x}^{j} and re-calculate the new full gradient.

Algorithm 1 GraphSVRG-IHT
1:Input: η\eta, fif_{i}, 𝕄\mathbb{M}, 𝕄\mathbb{M}_{\mathcal{H}}, 𝕄𝒯\mathbb{M}_{\mathcal{T}}, 𝒥\mathcal{J}, 𝒦\mathcal{K}
2:Initialize: x~1\tilde{x}^{1} such that supp(x~1)𝕄supp(\tilde{x}^{1})\in\mathbb{M}
3:for j=1j=1 to 𝒥\mathcal{J} do
4:     v~j=1ni=1nfi(x~j)\tilde{v}^{j}=\frac{1}{n}\sum_{i=1}^{n}{\nabla f_{i}(\tilde{x}^{j})}
5:     x0j=x~jx^{j}_{0}=\tilde{x}^{j}
6:     for k=1k=1 to 𝒦\mathcal{K} do
7:         Randomly pick ik[n]i_{k}\in[n]
8:         vkj=fik(xk1j)fik(x~j)+v~jv^{j}_{k}=\nabla f_{i_{k}}(x^{j}_{k-1})-\nabla f_{i_{k}}(\tilde{x}^{j})+\tilde{v}^{j}
9:         τk=P(vkj,𝕄𝕄𝒯,𝕄)\tau_{k}=P(v^{j}_{k},\mathbb{M}\oplus\mathbb{M}_{\mathcal{T}},\mathbb{M}_{\mathcal{H}})
10:         xkj=P(xk1jητk,𝕄,𝕄𝒯)x^{j}_{k}=P(x^{j}_{k-1}-\eta\tau_{k},\mathbb{M},\mathbb{M}_{\mathcal{T}})
11:     end for
12:     x~j+1=x𝒦j\tilde{x}^{j+1}=x^{j}_{\mathcal{K}}
13:end for

Our proposed GraphSVRG-IHT algorithm differs from GraphSto-IHT in that GraphSVRG-IHT  uses nested loops, allowing it to account for the history of the stochastic process, whereas GraphSto-IHT only considers a stochastic gradient. Both methods implement Head and Tail Projections for hard thresholding, making them applicable to graph-structured sparsity optimization problems. The inclusion of stochastic variance in GraphSVRG-IHT makes the theoretical analysis more complex and challenging.

III-B GraphSCSG-IHT

To better understand the calculation of variance-reduced gradients and stochastically control the outer batch size, we propose the GraphSCSG-IHT algorithm (Algorithm 2). While similar to Algorithm 1 in its use of variance-reduced gradients, GraphSCSG-IHT has the following key characteristics:

  1. 1.

    In the outer loop, the gradient is calculated using a batch of data of size BB, whereas Algorithm 1 calculates a full gradient at this step (Line 4-5).

  2. 2.

    In the inner loop, when calculating the stochastic variance reduced gradient, a mini-batch is used instead of a single data point (Line 10-11).

  3. 3.

    The number of inner loops, 𝒦j\mathcal{K}^{j}, is not fixed. Instead, 𝒦j\mathcal{K}^{j} is chosen from a geometric distribution (Line 7) or can be set as Bb\frac{B}{b} (Line 8).

Algorithm 2 GraphSCSG-IHT
1:Input: η\eta, fif_{i}, 𝕄\mathbb{M}, 𝕄\mathbb{M}_{\mathcal{H}}, 𝕄𝒯\mathbb{M}_{\mathcal{T}}, B, b, 𝒥\mathcal{J}
2:Initialize: x~1\tilde{x}^{1} such that supp(x~1)𝕄supp(\tilde{x}^{1})\in\mathbb{M}
3:for j=1j=1 to 𝒥\mathcal{J} do
4:     Uniformly sample a batch Ij[n]I^{j}\subset[n], s.t. |Ij|=B|I^{j}|=B
5:     μ~j=fIj(x~j)\tilde{\mu}^{j}=\nabla f_{I^{j}}(\tilde{x}^{j})
6:     x0j=x~jx^{j}_{0}=\tilde{x}^{j}
7:     (Option I) Generate 𝒦jGeom(BB+b)\mathcal{K}^{j}\sim Geom\left(\frac{B}{B+b}\right)
8:     (Option II) 𝒦j=Bb\mathcal{K}^{j}=\frac{B}{b}
9:     for k=1k=1 to 𝒦j\mathcal{K}^{j} do
10:         Uniformly sample a mini-batch Ikj[n]I^{j}_{k}\subset[n], s.t. |Ikj|=b|I^{j}_{k}|=b
11:         μkj=fIkj(xk1j)fIkj(x~j)+μ~j\mu^{j}_{k}=\nabla f_{I^{j}_{k}}(x^{j}_{k-1})-\nabla f_{I^{j}_{k}}(\tilde{x}^{j})+\tilde{\mu}^{j}
12:         τk=P(μkj,𝕄𝕄𝒯,𝕄)\tau_{k}=P(\mu^{j}_{k},\mathbb{M}\oplus\mathbb{M}_{\mathcal{T}},\mathbb{M}_{\mathcal{H}})
13:         xkj=P(xk1jητk,𝕄,𝕄𝒯)x^{j}_{k}=P(x^{j}_{k-1}-\eta\tau_{k},\mathbb{M},\mathbb{M}_{\mathcal{T}})
14:     end for
15:     x~j+1=x𝒦jj\tilde{x}^{j+1}=x^{j}_{\mathcal{K}^{j}}
16:end for

In contrast to GraphSto-IHT, the inner loop of GraphSCSG-IHT computes the gradient over a random set of functions rather than a randomly selected function. GraphSCSG-IHT also employs two loops with different batch sizes for gradient calculation, making it more flexible than GraphSto-IHTand GraphSVRG-IHT. This flexibility allows GraphSCSG-IHT to serve as a general framework for graph constrained optimization.

In summary, compared with traditional IHT and StoIHT methods, graph-structured hard thresholding steps mainly differ in their use of Head and Tail Projections. Both proposed new algorithms are much more complex than GraphSto-IHT, introducing distinct variance reduction techniques while maintaining the same sparsification enforcement. It is also important to note that GraphSVRG-IHT is a specific scenario of GraphSCSG-IHT, hence, we provide theoretical analysis of GraphSCSG-IHT, which can be easily extended to GraphSVRG-IHT case.

IV Theoretical Analysis

In this section, we present our main theoretical results characterizing the estimation error of parameters xx. The proof provides a general framework based on the gradients from GraphSCSG-IHT. Consequently, the main theorem is applicable to our two proposed algorithms, GraphSVRG-IHTand GraphSCSG-IHT. We demonstrate the convergence of these algorithms by bounding the final error using the 2\ell^{2} norm of the initial and the optimal distance. Additionally, we consider the history of the stochastic process up to iteration j𝒦j*\mathcal{K} with the notation I𝒦jI^{j}_{\mathcal{K}}.

Before delving into the main theorem, we present a key lemma that is crucial for the proof.

Lemma 1.

[30] If each fξt()f_{\xi_{t}}(\cdot) and F(x)F(x) satisfy Assumption 3, and given head projection model (cH,MMT,MH)(c_{H},M\oplus M_{T},M_{H}) and tail projection model (cT,M,MT)(c_{T},M,M_{T}), then we have the following inequality

𝔼ξt(xtx)H1α0𝔼ξtxtx+σ1,\mathbb{E}_{\xi_{t}}\|(x^{t}-x^{*})_{H}\|\leq\sqrt{1-\alpha_{0}}\mathbb{E}_{\xi_{t}}\|x^{t}-x^{*}\|+\sigma_{1},

where

σ1=(β0α0+α0β01α0)𝔼ξtIfξt(x),\sigma_{1}=\left(\frac{\beta_{0}}{\alpha_{0}}+\sqrt{\frac{\alpha_{0}\beta_{0}}{1-\alpha_{0}}}\right)\mathbb{E}_{\xi_{t}}\|\nabla_{I}f_{\xi_{t}}(x^{*})\|,
H=supp(P(fξt(xt),MMT,MH)),H=\text{supp}(P(\nabla f_{\xi_{t}}(x^{t}),M\oplus M_{T},M_{H})),
α0=cHαταβτ22ατ+1,β0=(1+cH)τ,\alpha_{0}=c_{H}\alpha\tau-\sqrt{\alpha\beta\tau^{2}-2\alpha\tau+1},\quad\beta_{0}=(1+c_{H})\tau,
I=argmaxSMMTMH𝔼ξtSfξt(x),I=\arg\max_{S\in M\oplus M_{T}\oplus M_{H}}\mathbb{E}_{\xi_{t}}\|\nabla_{S}f_{\xi_{t}}(x^{*})\|,

and τ(0,2/β)\tau\in(0,2/\beta).

With the prepared lemmas and appropriate assumptions in place, we can now present our main theorem. This theorem establishes the convergence properties of our proposed algorithms, under specific conditions. The detailed statement of our main theorem is as follows:

Theorem 2.

(Main Theorem) Assume that Definition 4 and Assumption 3 hold. Under the same setting of Lemma 1, let x0x_{0} be the start point. If we choose a constant learning rate η\eta within

η(2α4α23.75αβ2αβ,2α+4α23.75αβ2αβ)\eta\in\left(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}\right)

then the solution x~j+1\tilde{x}^{j+1} of GraphSCSG-IHT satisfies

𝔼I𝒦jx~j+1x[(δ1λ)S+λj(1λ1λδ)]x0x\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|\leq\left[\left(\frac{\delta}{1-\lambda}\right)^{S}+\lambda^{j}\left(\frac{1-\lambda}{1-\lambda-\delta}\right)\right]\|x^{0}-x^{*}\|
+γ1λδ𝔼I𝒦jIfIj(x),+\frac{\gamma}{1-\lambda-\delta}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|,

where

δ=(1+c𝒯)(αβη22αη+1+1α0),\delta=(1+c_{\mathcal{T}})\left(\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}+\sqrt{1-\alpha_{0}}\right),
λ=(1+c𝒯)(2αβη22αη+1),\lambda=(1+c_{\mathcal{T}})(2\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}),
γ=(1+c𝒯)(β0α0+α0β01α02+η).\gamma=(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta).

Theorem 2 demonstrates that our new algorithm achieves linear convergence with stochastic variance-reduced gradients, even with more stochastic settings in batch and mini-batch. Each variable defined in the theorem is strictly less than 1, ensuring that the error decreases as the number of iterations increases. This result aligns with our experimental findings, where more iterations consistently lead to smaller errors.

From Theorem 2, we derive the following corollary, which further justifies the convergence of our algorithm and specifies the appropriate range for the learning rate η\eta.

Corollary 2.1.

To ensure convergence of our algorithm, the learning rate η\eta, which is a constant, should be chosen within the range (2α4α23.75αβ2αβ,2α+4α23.75αβ2αβ)(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}). For this range to be valid, the following inequality must hold:

δ1λ<1.\frac{\delta}{1-\lambda}<1.

Corollary 2.1 is the cornerstone of Theorem 2. It ensures that the upper bound for the estimation error does not blow up to infinity, and provides a constant value for the finite series. Similarly, it also ensures that the upper bound will decay more after performing more iterations. Corollary 2.1 also provides a range of η\eta, which is smaller than the one given by GraphSto-IHT. This way we can find η\eta such that the algorithm will always converge. All the proofs are provided in the appendix of the paper.

V Experiments

V-A Experimental setup

We perform multiple experiments to compare our proposed algorithms with baseline methods. For our experiments, we consider the residual norm of the loss function, 𝐀xt+1y\|\mathbf{A}x^{t+1}-y\| as the number of epochs increases. Due to the non-convex nature of the problem, there are several local minima and the algorithm may not approach the global minimum, xx^{*}. Additionally, in real-world applications, xx^{*} is often unknown. Therefore, we use the residual norm as a measurement of convergence as opposed to the distance from the final iterate to the target vector, xt+1x\|x^{t+1}-x^{*}\|. All experiments are tested on a Ubuntu 22.04.4 server with 256 AMD EPYC 9554 64-core processors and with 1.6 TB RAM. All codes are written in Python111All code is available at https://github.com/Derek-Fox/graph-scsg-iht.

V-B Synthetic Dataset

We first tested our methods on synthetic datasets to determine the optimal parameters. For a fair comparison, we followed the exact settings used in GraphSto-IHT, conducting multiple experiments using a grid graph with a dimension of 256 and unit-weight edges.

Refer to caption Refer to caption Refer to caption
Figure 1: Comparison of methods with different learning rate.
Refer to caption Refer to caption Refer to caption
Figure 2: Comparison of methods with different sparsities.
Refer to caption Refer to caption Refer to caption
Figure 3: Number of data points vs. residual loss value with different batch size.

Choice of η\eta. To study the effect of the learning rate on the performance of our algorithms, we varied η\eta across {0.1, 0.01, 0.001} and tested these rates in various sparsity cases, with sparsity values chosen from {256, 128, 64, 32}. By varying the learning rate, we aimed to understand the convergence behaviour of our algorithms. As shown in Figure 1, our experiments reveal that, as expected, with a larger learning rate, all methods converge quickly, while with a smaller learning rate, the steps take longer to achieve zero residual loss. Additionally, both our proposed methods and the baseline method converge stably, and regardless of the learning rate setting, GraphSVRG-IHT consistently performs the best.

Choice of sparsity. Studying the setting of sparsity, ss, is crucial for understanding the performance of our algorithms in graph sparsification optimization. To examine the effect of sparsity, we compared our methods, GraphSVRG-IHT and GraphSCSG-IHT  against the baseline algorithm GraphSto-IHT. Using the experimental settings from GraphSto-IHT  we employed a grid graph of dimension 256, fixed the learning rate η=0.01\eta=0.01 and set the batch size equal to the sparsity parameter B=sB=s. We varied the sparsity parameter ss to observe the behavior of all algorithms, as shown in Figure 2. Figure 2 demonstrates that as the sparsity parameter ss decreases, GraphSVRG-IHT outperforms the other algorithms in minimizing the residual norm 𝐀xt+1y\|\mathbf{A}x^{t+1}-y\| over epochs. Another interesting finding is that we observed that GraphSto-IHT and GraphSCSG-IHT display almost identical behavior in this parameter setting.

Table I: Identified genes from breast cancer dataset
Method Number Genes
IHT 2 NAT1 TOP2A
StoIHT 2 NAT1 TOP2A
GraphSto-IHT 6 AR ATM BRCA2 CCND2 CDKN1A TOP2A
GraphSVRG-IHT 6 AR ATM BRCA2 CCND2 CDKN1A TOP2A
GraphSCSG-IHT 10 AR ATM BRCA1 BRCA2 CCND2 CCND3 CDKN1A CHEK2 FBP1 TOP2A

Choice of batch size. After exploring the choice of η\eta and ss, we varied the batch size BB on a grid graph with a dimension of 256 to demonstrate the advantages of GraphSCSG-IHT. Here we fixed η=0.01\eta=0.01, s=32s=32. For fair comparison, we considered the number of data points instead of the number of epochs to estimate the run time of the algorithms, which is a common practice in optimization.

Since gradient calculations are computationally expensive, using fewer data points would result in faster run times. Figure 3 shows three different scenarios with varying batch sizes BB. When BB equals to the dimension, GraphSCSG-IHT degrades to GraphSVRG-IHT  resulting in similar performance. With smaller BB (meaning fewer data points are used in the full gradient calculation), GraphSCSG-IHT consistently outperforms GraphSVRG-IHT. However, as BB continues to decrease, GraphSCSG-IHT initially outperforms GraphSVRG-IHT  but was eventually surpassed by GraphSVRG-IHT slightly. This occurs because smaller batch size increase gradient variance, making it harder for GraphSCSG-IHT to maintain its initial advantage. Furthermore, the interaction between batch size and mini-batch size becomes more complex, ultimately affecting the final performance.

Therefore, we also conducted numerous experiments to study the effects of mini-batch size in conjunction with batch size. Additionally, to better understand the graph-structure patterns, we have explored how varying the number of connected components in subgraphs impacts final convergence performance. Additional results are provided in the Appendix.

V-C Real-world Dataset

To test our algorithms on a real-world dataset, we use a large breast cancer dataset [27]. This dataset contains 295 training samples with 8,141 genes dimensions, including 78 positive (metastatic) and 217 negative (non-metastatic) samples. Following the experimental setting in [30], we use a Protein-Protein Interaction network with 637 pathways from [15], the dataset is folded into 5 subfolds, and 20 trials are conducted.

Table I shows the results from the gene identification task on the breast cancer dataset. GraphSCSG-IHT identifies 40% of the 25 genes highly correlated with breast cancer, as gathered by [30]. GraphSto-IHT and GraphSVRG-IHT both identify 24% of these genes, consistent with the findings in [30]. All the graph-structured methods greatly outperform the IHT and StoIHT methods which only identify 8% of the cancer-related genes. These results demonstrate the promise of variance-reduction techniques for stochastic gradient descent in the setting of graph sparsity optimization.

In summary, GraphSVRG-IHT is more stable while GraphSCSG-IHT shows excellent performance under appropriate settings due to the introduction of two additional batch size. This allows for greater flexibility and control in the gradient estimation process. Overall, our two new algorithms demonstrate both efficiency and effectiveness by incorporating variance reduction techniques in graph sparsity problems, making them suitable for a wide range of applications.

VI Conclusion

We have proposed two algorithms to utilize variance reduction techniques in the setting of graph sparse optimization. The proposed algorithms can significantly improve the stability and efficiency in machine learning and other applications. Theoretically, we provide a general proof framework showing linear convergence. Empirically, our algorithms are competitive in minimizing the objective loss function compared to their predecessors in various experimental settings with a synthetic dataset. Additionally, testing on a large-scale medical dataset demonstrated superior performance in identifying cancer-related genes. Future work should include testing our algorithms on more larger real-world datasets. By employing graph-structured hard thresholding, we can uncover more underlying subgraphs and related patterns, with significant implications in fields such as medical research and social network analysis. This approach can enhance our understanding of complex data structures and lead to more effective solutions.

VII Acknowledgement

This work was completed at the University of North Carolina at Greensboro, funded by NSF REU program 2349369.

References

  • [1] Cem Aksoylar, Lorenzo Orecchia, and Venkatesh Saligrama. Connected subgraph detection with mirror descent on sdps. In International Conference on Machine Learning, pages 51–59. PMLR, 2017.
  • [2] Ery Arias-Castro, Emmanuel J Candes, and Arnaud Durand. Detection of an anomalous cluster in a network. The Annals of Statistics, pages 278–304, 2011.
  • [3] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27(4):450–468, 2012.
  • [4] Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
  • [5] Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14(Mar):807–841, 2013.
  • [6] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Transactions on information theory, 56(4):1982–2001, 2010.
  • [7] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
  • [8] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
  • [9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
  • [10] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011.
  • [11] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A fast approximation algorithm for tree-sparse recovery. In 2014 IEEE International Symposium on Information Theory, pages 1842–1846. IEEE, 2014.
  • [12] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compressive sensing. IEEE Transactions on Information Theory, 61(9):5129–5147, 2015.
  • [13] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A nearly-linear time framework for graph-structured sparsity. In International Conference on Machine Learning, pages 928–937. PMLR, 2015.
  • [14] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Fast recovery from a union of subspaces. Advances in Neural Information Processing Systems, 29, 2016.
  • [15] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440, 2009.
  • [16] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
  • [17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
  • [18] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Artificial Intelligence and Statistics, pages 148–156, 2017.
  • [19] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced optimization for nonconvex sparse learning. In International Conference on Machine Learning, pages 917–925, 2016.
  • [20] Guannan Liang, Qianqian Tong, Chunjiang Zhu, and Jinbo Bi. An effective hard thresholding method based on stochastic variance reduction for nonconvex sparse learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34(02), pages 1585–1592, 2020.
  • [21] Nam Nguyen, Deanna Needell, and Tina Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869–6895, 2017.
  • [22] Jing Qian, Venkatesh Saligrama, and Yuting Chen. Connected sub-graph detection. In Artificial Intelligence and Statistics, pages 796–804. PMLR, 2014.
  • [23] Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, and Nikolaj Tatti. Event detection in activity networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1176–1185, 2014.
  • [24] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [25] Berwin A Turlach, William N Venables, and Stephen J Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005.
  • [26] Sara A Van de Geer et al. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 2008.
  • [27] Marc J Van De Vijver, Yudong D He, Laura J Van’t Veer, Hongyue Dai, Augustinus AM Hart, Dorien W Voskuil, George J Schreiber, Johannes L Peterse, Chris Roberts, Matthew J Marton, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999–2009, 2002.
  • [28] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.
  • [29] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning, pages 127–135, 2014.
  • [30] Baojian Zhou, Feng Chen, and Yiming Ying. Stochastic iterative hard thresholding for graph-structured sparsity optimization. In International Conference on Machine Learning, pages 7563–7573. PMLR, 2019.
  • [31] Pan Zhou, Xiaotong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresholding. In Advances in Neural Information Processing Systems, pages 1988–1997, 2018.

Appendix A Parameters Table

Table: Constraints for Parameters

Notation Name Constraint
η\eta Step size Fixed
F() Function Minimized
𝕄\mathbb{M} Set of Supports
𝕄\mathbb{M_{\mathcal{H}}} Set of Supports
𝕄𝒯\mathbb{M_{\mathcal{T}}} Set of Supports
B Big batch
IjI^{j} Big batch sample Ij[n]I^{j}\subset[n] and |Ij|=B|I^{j}|=B
b Small batch Subset of B
IkjI^{j}_{k} Small batch sample Ikj[n]I^{j}_{k}\subset[n] and |Ikj|=b|I^{j}_{k}|=b
𝒥\mathcal{J} Number of outer loops
j Outer loop index
𝒦\mathcal{K} Number of inner loops
k Inner loop index
x~1\tilde{x}^{1} Starting Position supp(x~1)𝕄supp(\tilde{x}^{1})\in\mathbb{M}
s Number of non-zero entries Sparsity Constraint
P(x,𝕄,𝕄)P(x,\mathbb{M},\mathbb{M}_{\mathcal{H}}) Head Projection
P(x,𝕄,𝕄𝒯)P(x,\mathbb{M},\mathbb{M}_{\mathcal{T}}) Tail Projection
v~j\tilde{v}^{j} SVRG Gradient
μ~j\tilde{\mu}^{j} SCSG Gradient
x~j\tilde{x}^{j} Current Position
vkjv^{j}_{k} Reduced Variance Gradient
μkj\mu^{j}_{k} Reduced Variance Gradient
τk\tau_{k} Sparsified Gradient
gg Number of Connected Components

Appendix B Proof

Lemma 4 [21]. Let ξt\xi_{t} be a discrete random variable defined on [n][n] and its probability mass function is defined as Pr(ξt=i):=1n\Pr(\xi_{t}=i):=\frac{1}{n}, which means the probability of ξt\xi_{t} selecting the ii-th block at time tt. For any fixed sparse vectors xx, yy and 0<τ<2β0<\tau<\frac{2}{\beta}, we have

𝔼ξtxyτ(Ωfξt(x)Ωfξt(y))αβτ22ατ+1xy,\mathbb{E}_{\xi_{t}}\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\|\leq\sqrt{\alpha\beta\tau^{2}-2\alpha\tau+1}\|x-y\|,

where Ω\Omega is such that x0y0Ω\|x\|_{0}\cup\|y\|_{0}\subseteq\Omega and Ω\Omega\in\mathcal{M}.

Proof.

We first try to obtain an upper bound for 𝔼ξtxyτ(Ωfξt(x)Ωfξt(y))2\mathbb{E}_{\xi_{t}}\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\|^{2} as follows:

𝔼ξtxyτ(Ωfξt(x)Ωfξt(y))2\displaystyle\mathbb{E}_{\xi_{t}}\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\|^{2} =xy22τ𝔼ξtxy,Ωfξt(x)Ωfξt(y)\displaystyle=\|x-y\|^{2}-2\tau\mathbb{E}_{\xi_{t}}\langle x-y,\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\rangle
+τ2𝔼ξtΩfξt(x)Ωfξt(y)2\displaystyle\quad+\tau^{2}\mathbb{E}_{\xi_{t}}\|\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\|^{2}
=xy22τxy,𝔼ξt[fξt(x)fξt(y)]\displaystyle=\|x-y\|^{2}-2\tau\langle x-y,\mathbb{E}_{\xi_{t}}[\nabla f_{\xi_{t}}(x)-\nabla f_{\xi_{t}}(y)]\rangle
+τ2𝔼ξtΩfξt(x)Ωfξt(y)2\displaystyle\quad+\tau^{2}\mathbb{E}_{\xi_{t}}\|\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\|^{2}
xy22τxy,F(x)F(y)\displaystyle\leq\|x-y\|^{2}-2\tau\langle x-y,\nabla F(x)-\nabla F(y)\rangle
+τ2βxy,F(x)F(y)\displaystyle\quad+\tau^{2}\beta\langle x-y,\nabla F(x)-\nabla F(y)\rangle
=xy2+(τ2β2τ)xy,F(x)F(y)\displaystyle=\|x-y\|^{2}+(\tau^{2}\beta-2\tau)\langle x-y,F(x)-F(y)\rangle
(αβτ22ατ+1)xy2,\displaystyle\leq(\alpha\beta\tau^{2}-2\alpha\tau+1)\|x-y\|^{2},

where the second equality uses the fact that x0y0Ω\|x\|_{0}\cup\|y\|_{0}\subseteq\Omega, the first inequality follows from Lemma 2, the third equality is obtained by using the fact that 𝔼ξt[fξt(x)fξt(y)]=F(x)F(y)\mathbb{E}_{\xi_{t}}[\nabla f_{\xi_{t}}(x)-\nabla f_{\xi_{t}}(y)]=\nabla F(x)-\nabla F(y), and the last inequality is due to the restricted strong convexity of F(x)F(x) on ()\mathcal{M(M)}. We complete the proof by taking the square root of both sides and using the fact: for any random variable XX, we have (𝔼[X])2𝔼[X2](\mathbb{E}[X])^{2}\leq\mathbb{E}[X^{2}]. ∎

Proof of Theorem 2.

Proof.
𝔼I𝒦jx~j+1x\displaystyle\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\| =𝔼I𝒦j|I𝒦1jP(x𝒦1jηυ𝒦,𝕄,𝕄T)x\displaystyle=\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|P(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}},\mathbb{M},\mathbb{M}_{T})-x^{*}\| (1)
𝔼I𝒦j|I𝒦1jP(x𝒦1jηυ𝒦,𝕄,𝕄T)(x𝒦1jηυ𝒦)+𝔼I𝒦j|I𝒦1j(x𝒦1jηυ𝒦)x\displaystyle\begin{split}&\leq\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|P(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}},\mathbb{M},\mathbb{M}_{T})-(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}})\|\\ &\quad+\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}})-x^{*}\|\end{split} (2)
(1+c𝒯)𝔼I𝒦j|I𝒦1jx𝒦1jηυ𝒦x\displaystyle\leq(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}}-x^{*}\| (3)
=(1+c𝒯)𝔼I𝒦j|I𝒦1jr𝒦ηυ𝒦\displaystyle=(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|r^{\mathcal{K}}-\eta\upsilon_{\mathcal{K}}\| (4)
=(1+c𝒯)𝔼I𝒦j|I𝒦1jr𝒦η(HfI𝒦j(x𝒦1j)HfI𝒦j(x~j)+HfIj(x~j))\displaystyle\begin{split}&=(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|r^{\mathcal{K}}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})\\ &\quad-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\|\\ \end{split} (5)

By the definition of x~j+1\tilde{x}^{j+1}, Eq. (1) holds. Eq. (2) holds by the Triangle Inequality. Eq. (3) holds by Assumption 2. Eq. (4) holds by defining r𝒦=x𝒦1jxr^{\mathcal{K}}=x^{j}_{\mathcal{K}-1}-x^{*}. Eq. (5) holds by the definition of υ𝒦\upsilon_{\mathcal{K}}. Now, we will focus on the expectation only. Also, define m𝒦=x~jxm^{\mathcal{K}}=\tilde{x}^{j}-x^{*}.

\displaystyle\Rightarrow 𝔼I𝒦j|I𝒦1jr𝒦η(HfI𝒦j(x𝒦1j)HfI𝒦j(x~j)+HfIj(x~j))\displaystyle\ \mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}\|r^{\mathcal{K}}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\|
𝔼I𝒦j|I𝒦1j[rH𝒦η(HfI𝒦j(x𝒦1j)HfI𝒦j(x~j)+HfIj(x~j))(HfI𝒦j(x)HfI𝒦j(x)+HfIj(x))+ηHfIj(x)+rHc𝒦]\displaystyle\begin{split}&\quad\leq\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}[\|r^{\mathcal{K}}_{H}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\\ &\quad-(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{*})-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{*})+\nabla_{H}f_{I^{j}}(x^{*}))\|\\ &\quad+\eta\|\nabla_{H}f_{I^{j}}(x^{*})\|+\|r^{\mathcal{K}}_{H^{c}}\|]\end{split} (6)
𝔼I𝒦j|I𝒦1j[rHΩ𝒦η(HΩfI𝒦j(x𝒦1j)HΩfI𝒦j(x~j)+HΩfIj(x~j))(HΩfI𝒦j(x)HΩfI𝒦j(x)+HΩfIj(x))+ηHfIj(x)+rHc𝒦]\displaystyle\begin{split}&\quad\leq\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}[\|r^{\mathcal{K}}_{H\cup\Omega}-\eta(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})\\ &\quad+\nabla_{H\cup\Omega}f_{I^{j}}(\tilde{x}^{j}))-(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{*})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{*})+\nabla_{H\cup\Omega}f_{I^{j}}(x^{*}))\|\\ &\quad+\eta\|\nabla_{H}f_{I^{j}}(x^{*})\|+\|r^{\mathcal{K}}_{H^{c}}\|]\end{split} (7)
=𝔼I𝒦j|I𝒦1j[rHΩ𝒦mHΩ𝒦+mHΩ𝒦η(HΩfI𝒦j(x𝒦1j)HΩfI𝒦j(x~j)+HΩfIj(x~j))(HΩfI𝒦j(x)HΩfI𝒦j(x)+HΩfIj(x))+ηHfIj(x)+rHc𝒦]\displaystyle\begin{split}&\quad=\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}[\|r^{\mathcal{K}}_{H\cup\Omega}-m^{\mathcal{K}}_{H\cup\Omega}+m^{\mathcal{K}}_{H\cup\Omega}-\eta(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})\\ &\quad-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H\cup\Omega}f_{I^{j}}(\tilde{x}^{j}))-(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{*})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{*})\\ &\quad+\nabla_{H\cup\Omega}f_{I^{j}}(x^{*}))\|+\eta\|\nabla_{H}f_{I^{j}}(x^{*})\|+\|r^{\mathcal{K}}_{H^{c}}\|]\end{split} (8)
𝔼I𝒦j|I𝒦1j[rHΩ𝒦η(HΩfI𝒦j(x𝒦1j)HΩfI𝒦j(x~j)+HΩfIj(x~j))(HΩfI𝒦j(x)HΩfI𝒦j(x)+HΩfIj(x))+ηHfIj(x)+rHc𝒦]αβη22αη+1𝔼I𝒦1jr𝒦+αβη22αη+1𝔼I𝒦1jm𝒦+αβη22αη+1𝔼I𝒦1jm𝒦+η𝔼I𝒦jIfIj(x)+𝔼I𝒦j|I𝒦1jrHc𝒦\displaystyle\begin{split}&\quad\leq\mathbb{E}_{I^{j}_{\mathcal{K}}|I^{j}_{\mathcal{K}-1}}[\|r^{\mathcal{K}}_{H\cup\Omega}-\eta(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})\\ &\quad+\nabla_{H\cup\Omega}f_{I^{j}}(\tilde{x}^{j}))-(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{*})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{*})+\nabla_{H\cup\Omega}f_{I^{j}}(x^{*}))\|\\ &\quad+\eta\|\nabla_{H}f_{I^{j}}(x^{*})\|+\|r^{\mathcal{K}}_{H^{c}}\|]\\ &\quad\leq\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|r_{\mathcal{K}}\|+\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|-m_{\mathcal{K}}\|\\ &\quad+\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|m_{\mathcal{K}}\|+\eta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|+\mathbb{E}_{{I^{j}_{\mathcal{K}}}|{I^{j}_{\mathcal{K}-1}}}\|r^{\mathcal{K}}_{H^{c}}\|\end{split} (9)
=αβη22αη+1𝔼I𝒦1jr𝒦+2αβη22αη+1𝔼I𝒦1jm𝒦+η𝔼I𝒦jIfIj(x)+𝔼I𝒦j|I𝒦1jrHc𝒦\displaystyle\begin{split}&\quad=\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|r_{\mathcal{K}}\|+2\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|m_{\mathcal{K}}\|\\ &\quad+\eta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|+\mathbb{E}_{{I^{j}_{\mathcal{K}}}|{I^{j}_{\mathcal{K}-1}}}\|r^{\mathcal{K}}_{H^{c}}\|\end{split} (10)
(αβη22αη+1+1α02)𝔼I𝒦jx𝒦j1jx+2αβη22αη+1𝔼I𝒦1jx~jx+(β0α0+α0β01α02+η)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}&\quad\leq(\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}+\sqrt{1-{\alpha_{0}}^{2}})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|\\ &\quad+2\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|\tilde{x}^{j}-x^{*}\|\\ &\quad+\left(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta\right)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (11)

Eq. (6) and Eq. (7) hold by the Triangle Inequality. Eq. (8) holds by adding and subtracting m𝒦m_{\mathcal{K}}. Eq. (9) also holds by the Triangle Inequality. Eq. (10) holds by B. Eq. (11) holds by the fact that m𝒦=m𝒦\|m_{\mathcal{K}}\|=\|-m_{\mathcal{K}}\|. Eq. (12) holds by 1. We therefore get the following inequality that we will use to find our final equation.

𝔼I𝒦jx~j+1xδ𝔼I𝒦jx𝒦j1jx+λ𝔼I𝒦1jx~jx+(1+c𝒯)(β0α0+α0β01α02+η)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|\tilde{x}^{j}-x^{*}\|\\ &\quad+(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}\quad+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (12)

where δ=(1+c𝒯)(αβη22αη+1+1α02)\delta=(1+c_{\mathcal{T}})(\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}+\sqrt{1-{\alpha_{0}}^{2}}) and λ=2(1+c𝒯)(δ1+c𝒯1α02)\lambda=2(1+c_{\mathcal{T}})(\frac{\delta}{1+c_{\mathcal{T}}}-\sqrt{1-{\alpha_{0}}^{2}}) Since we have two different vectors measured from the target vector xx^{*}, we will use recursion twice in Eq. (13) to find the formula.

First, we apply recursion with respect to the outer loop, j, by finding x~j\tilde{x}^{j} using the above inequality. Once we find it, we plug it back to understand the behavior and do this process j times.

𝔼I𝒦jx~jxδ𝔼I𝒦jx𝒦j1jx+λ𝔼I𝒦1jx~j1x+(1+c𝒯)(β0α0+α0β01α02+η)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j}-x^{*}\|&\leq\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|\tilde{x}^{j-1}-x^{*}\|\\ &\quad+(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}\quad+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (13)

Plugging Eq. (14) in Eq. (13), we get:

𝔼I𝒦jx~j+1xδ𝔼I𝒦jx𝒦j1jx+λ(δ𝔼I𝒦jx𝒦j1jx+λ𝔼I𝒦2jx~j1x+(1+c𝒯)(β0α0+α0β01α02+η)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda(\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda\mathbb{E}_{I^{j}_{\mathcal{K}-2}}\|\tilde{x}^{j-1}-x^{*}\|\\ &\quad+(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (14)

After doing it j times, we get:

𝔼I𝒦jx~j+1x(i=0jλi)δ𝔼I𝒦1jx𝒦j1jx+λjx0x+(i=0jλi)(1+c𝒯)(β0α0+α0β01α02+η)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\sum_{i=0}^{j}\lambda^{i})\delta\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|\ +\ \lambda^{j}\|x^{0}-x^{*}\|\\ &\quad+\ (\sum_{i=0}^{j}\lambda^{i})(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (15)

By geomteric series (since λ<1\lambda<1), and by defining γ=(1+c𝒯)(β0α0+α0β01α02+η)\gamma=(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta), we get:

𝔼I𝒦jx~j+1x(11λ)δ𝔼I𝒦1jx𝒦j1jx+λjx0x+(11λ)γ𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\frac{1}{1-\lambda})\delta\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|\ +\ \lambda^{j}\|x^{0}-x^{*}\|\\ &\quad+\ (\frac{1}{1-\lambda})\gamma\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (16)

Once we did the recursion with respect to the outer loop only, we do recursion with respect the inner loop and outer loop to find the final formula. This means that we will do this process S times, where S=𝒯bS=\frac{\mathcal{B*T}}{b}.

𝔼I𝒦1jx𝒦j1jx(11λ)δ𝔼I𝒦2jx𝒦j2jx+λjx0x+(11λ)γ𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|&\leq(\frac{1}{1-\lambda})\delta\mathbb{E}_{I^{j}_{\mathcal{K}-2}}\|x^{j}_{\mathcal{K}^{j}-2}-x^{*}\|\ +\ \lambda^{j}\|x^{0}-x^{*}\|\\ &\quad+\ (\frac{1}{1-\lambda})\gamma\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (17)

We then plug Eq. (18) into Eq. (17), so we get Eq. (19):

𝔼I𝒦jx~j+1x(δ1λ)[(δ1λ)𝔼I𝒦2jx𝒦j2jx+λjx0x+(γ1λ)𝔼I𝒦jIfIj(x)]+λjx0x+(γ1λ)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\frac{\delta}{1-\lambda})[(\frac{\delta}{1-\lambda})\mathbb{E}_{I^{j}_{\mathcal{K}-2}}\|x^{j}_{\mathcal{K}^{j}-2}-x^{*}\|\ \\ &\quad+\ \lambda^{j}\|x^{0}-x^{*}\|+(\frac{\gamma}{1-\lambda})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|]\\ &\quad+\lambda^{j}\|x^{0}-x^{*}\|+(\frac{\gamma}{1-\lambda})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (18)

After doing this step S times, we get the following inequality:

𝔼I𝒦jx~j+1x(δ1λ)Sx0x+λj(n=0S(δ1λ)n)x0x+γ1λ(n=1S(δ1λ)n)𝔼I𝒦jIfIj(x)\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\frac{\delta}{1-\lambda})^{S}\|x^{0}-x^{*}\|+\lambda^{j}(\sum_{n=0}^{S}(\frac{\delta}{1-\lambda})^{n})\|x^{0}-x^{*}\|\\ &\quad+\frac{\gamma}{1-\lambda}(\sum_{n=1}^{S}(\frac{\delta}{1-\lambda})^{n})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split} (19)

We finally get this formula that guarantees the convergence of the algorithm, since δ\delta and λ\lambda are both less than 1.

𝔼I𝒦jx~j+1x[(δ1λ)S+λj(n=0S(δ1λ)n)]x0x+γ1λ(n=0S(δ1λ)n)𝔼I𝒦jIfIj(x)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|\leq[(\frac{\delta}{1-\lambda})^{S}+\lambda^{j}(\sum_{n=0}^{S}(\frac{\delta}{1-\lambda})^{n})]\|x^{0}-x^{*}\|+\frac{\gamma}{1-\lambda}(\sum_{n=0}^{S}(\frac{\delta}{1-\lambda})^{n})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|

which we can rewrite as

𝔼I𝒦jx~j+1x[(δ1λ)S+λj(1λ1λδ)]x0x+γ1λδ𝔼I𝒦jIfIj(x)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|\leq[(\frac{\delta}{1-\lambda})^{S}+\lambda^{j}(\frac{1-\lambda}{1-\lambda-\delta})]\|x^{0}-x^{*}\|+\frac{\gamma}{1-\lambda-\delta}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|

Corollary 2.0. Let δ<1\delta<1, since η(2α4α23.75αβ2αβ,2α+4α23.75αβ2αβ)\eta\in(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}). Therefore, by definition of λ\lambda, we have:

λ=2(δ1+c𝒯1α02).\lambda=2(\frac{\delta}{1+c_{\mathcal{T}}}-\sqrt{1-{\alpha_{0}}^{2}}).
Proof.

By contradiction, assume λ1\lambda\geq 1. Therefore,

2δ1+c𝒯21α021\displaystyle\frac{2\delta}{1+c_{\mathcal{T}}}-2\sqrt{1-{\alpha_{0}}^{2}}\geq 1
2δ1+c𝒯1+21α021\displaystyle\frac{2\delta}{1+c_{\mathcal{T}}}\geq 1+2\sqrt{1-{\alpha_{0}}^{2}}\geq 1
2δ1+c𝒯1\displaystyle\frac{2\delta}{1+c_{\mathcal{T}}}\geq 1
2δ1+c𝒯2\displaystyle 2\delta\geq 1+c_{\mathcal{T}}\geq 2
δ1\displaystyle\delta\geq 1

which is a contradiction, since δ<1\delta<1 by definition. Therefore, λ<1\lambda<1. Note that λ>0\lambda>0 by its definition. ∎

Corollary 2.1. To ensure convergence of our algorithm, the learning rate η\eta, which is a constant, should be chosen within the range (2α4α23.75αβ2αβ,2α+4α23.75αβ2αβ)(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}). For this range to be valid, the following inequality must hold:

δ1λ<1.\frac{\delta}{1-\lambda}<1.
Proof.

Assume by contradiction that δ1λ\delta\geq 1-\lambda. Then, by definition of λ\lambda, we get

δ12(δ1+c𝒯1α02)12δ1+c𝒯\displaystyle\delta\geq 1-2(\frac{\delta}{1+c_{\mathcal{T}}}-\sqrt{1-\alpha_{0}^{2}})\geq 1-\frac{2\delta}{1+c_{\mathcal{T}}} (1)
012δ1+c𝒯δ=1δ(3+c𝒯)1+c𝒯12δ\displaystyle 0\geq 1-\frac{2\delta}{1+c_{\mathcal{T}}}-\delta=1-\frac{\delta(3+c_{\mathcal{T}})}{1+c_{\mathcal{T}}}\geq 1-2\delta (2)

where Eq. (2) holds true by the definition of c𝒯c_{\mathcal{T}}. Now, we have the following:

12δ0\displaystyle 1-2\delta\leq 0
12δδ12\displaystyle 1\leq 2\delta\Rightarrow\delta\geq\frac{1}{2}

Since η(2α4α23.75αβ2αβ,2α+4α23.75αβ2αβ)\eta\in(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}), then δ<12\delta<\frac{1}{2}. We then have a contradiction. Therefore,

δ1λ<1,η(2α4α23.75αβ2αβ,2α+4α23.75αβ2αβ).\frac{\delta}{1-\lambda}<1,\ \forall\ \eta\in(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}).

Appendix C More Results

C-A Simulation

Refer to caption
Figure 4: Various choices for BB and bb
Refer to caption Refer to caption
Figure 5: Different number of Connected Components

As pointed out in our main paper, we also explored the effects of mini-batch size in conjunction with batch size. Figure 4 demonstrates the generality of GraphSCSG-IHT. With different settings of BB and bb GraphSCSG-IHT can range in behavior between GraphSto-IHT and GraphSVRG-IHT. Not included in the figure is the case when BB equals the dimension of the data and b=1b=1, in which case GraphSCSG-IHT and GraphSVRG-IHT are only distinguished by the number of inner loops (which could be parameterized itself to make the two algorithms identical).

We vary the number of connected components, controlled by parameter gg, in Figure 5. We observe that when gg is low (in this case g=1g=1), the loss function curve does not smoothly decrease but frequently levels off before stepping downward. This is due to the algorithm intermittently finding a more suitable support set, which allows it to more rapidly decrease the objective loss. The restriction of the algorithm to only one connected component makes this phenomenon much more apparent.

C-B Breast cancer dataset

All parameters are tuned using 5-fold cross validation, following the experimental settings of [30]. The sparsity ss is tuned from [10,20,100][10,20,...100] and the block size is tuned from [n,n/2][n,n/2], where nn is the number of samples. The task is to find a single connected component, so gg is set to 1 for the head and tail projections. The learning rate is tuned using backtracking line search. We record the Balanced Classification Error and Area Under Curve (AUC) scores, as well as the number of nonzeroes for each method over the 20 trials.

Table II: Balanced Classification Error ± Std. Dev. on Breast Cancer Dataset
IHT StoIHT GraphSto-IHT GraphSVRG-IHT GraphSCSG-IHT
Trial 1 0.352±0.079 0.355±0.083 0.367±0.091 0.335±0.048 0.474±0.048
Trial 2 0.350±0.047 0.346±0.045 0.395±0.074 0.340±0.077 0.496±0.015
Trial 3 0.354±0.060 0.370±0.061 0.365±0.107 0.318±0.061 0.478±0.061
Trial 4 0.351±0.046 0.350±0.066 0.365±0.032 0.365±0.032 0.459±0.050
Trial 5 0.357±0.064 0.365±0.063 0.366±0.065 0.368±0.082 0.418±0.076
Trial 6 0.334±0.050 0.327±0.046 0.330±0.078 0.352±0.090 0.447±0.071
Trial 7 0.343±0.050 0.402±0.085 0.321±0.027 0.321±0.027 0.499±0.004
Trial 8 0.352±0.059 0.402±0.103 0.338±0.021 0.349±0.030 0.489±0.013
Trial 9 0.340±0.082 0.357±0.058 0.335±0.110 0.343±0.040 0.452±0.053
Trial 10 0.344±0.068 0.351±0.065 0.349±0.050 0.331±0.061 0.502±0.005
Trial 11 0.339±0.056 0.318±0.061 0.353±0.076 0.357±0.054 0.473±0.054
Trial 12 0.355±0.061 0.394±0.053 0.349±0.088 0.329±0.082 0.463±0.072
Trial 13 0.327±0.053 0.317±0.040 0.312±0.065 0.323±0.060 0.457±0.051
Trial 14 0.362±0.058 0.340±0.071 0.348±0.019 0.330±0.054 0.453±0.045
Trial 15 0.381±0.048 0.346±0.051 0.297±0.059 0.297±0.059 0.416±0.069
Trial 16 0.330±0.094 0.320±0.089 0.330±0.074 0.339±0.064 0.488±0.055
Trial 17 0.349±0.074 0.332±0.067 0.353±0.078 0.330±0.056 0.437±0.055
Trial 18 0.331±0.096 0.331±0.096 0.292±0.067 0.290±0.069 0.462±0.053
Trial 19 0.327±0.052 0.333±0.054 0.311±0.021 0.306±0.023 0.476±0.017
Trial 20 0.333±0.038 0.338±0.026 0.352±0.064 0.376±0.036 0.427±0.052
Averaged 0.345±0.065 0.350±0.072 0.341±0.073 0.335±0.062 0.463±0.057
Table III: AUC score ± Std. Dev. on Breast Cancer Dataset
IHT StoIHT GraphSto-IHT GraphSVRG-IHT GraphSCSG-IHT
Trial 1 0.726±0.067 0.736±0.054 0.720±0.030 0.729±0.039 0.713±0.023
Trial 2 0.683±0.067 0.692±0.064 0.697±0.044 0.712±0.062 0.696±0.101
Trial 3 0.716±0.062 0.726±0.077 0.726±0.064 0.715±0.065 0.703±0.027
Trial 4 0.695±0.059 0.701±0.061 0.687±0.041 0.687±0.041 0.693±0.033
Trial 5 0.699±0.042 0.700±0.067 0.707±0.040 0.701±0.028 0.701±0.045
Trial 6 0.677±0.071 0.673±0.074 0.704±0.051 0.669±0.071 0.640±0.101
Trial 7 0.712±0.069 0.719±0.066 0.741±0.055 0.741±0.055 0.701±0.069
Trial 8 0.711±0.081 0.707±0.055 0.721±0.062 0.717±0.055 0.657±0.075
Trial 9 0.717±0.066 0.718±0.068 0.723±0.044 0.718±0.026 0.719±0.058
Trial 10 0.710±0.060 0.711±0.060 0.691±0.052 0.702±0.062 0.655±0.103
Trial 11 0.708±0.082 0.719±0.086 0.722±0.085 0.711±0.074 0.693±0.142
Trial 12 0.713±0.025 0.711±0.045 0.736±0.055 0.730±0.060 0.664±0.057
Trial 13 0.714±0.058 0.719±0.058 0.718±0.073 0.703±0.071 0.704±0.074
Trial 14 0.700±0.058 0.705±0.067 0.711±0.064 0.712±0.046 0.665±0.061
Trial 15 0.696±0.033 0.687±0.040 0.721±0.058 0.721±0.058 0.715±0.040
Trial 16 0.729±0.081 0.725±0.083 0.723±0.072 0.725±0.073 0.734±0.049
Trial 17 0.720±0.055 0.722±0.062 0.718±0.037 0.717±0.037 0.681±0.061
Trial 18 0.712±0.039 0.712±0.039 0.733±0.036 0.730±0.034 0.679±0.056
Trial 19 0.721±0.067 0.724±0.068 0.736±0.056 0.738±0.054 0.701±0.057
Trial 20 0.720±0.031 0.706±0.027 0.717±0.045 0.705±0.033 0.703±0.056
Averaged 0.709±0.062 0.711±0.064 0.717±0.057 0.714±0.057 0.691±0.074
Table IV: Num. nonzeroes ± Std. Dev. on Breast Cancer Dataset
IHT StoIHT GraphSto-IHT GraphSVRG-IHT GraphSCSG-IHT
Trial 1 082.0±36.00 066.0±28.00 033.4±10.38 033.8±17.31 064.0±17.72
Trial 2 044.0±08.00 038.0±13.27 061.2±41.81 079.0±32.00 087.2±04.12
Trial 3 094.0±12.00 068.0±31.87 030.4±12.97 040.2±17.03 086.2±13.33
Trial 4 074.0±12.00 062.0±31.87 051.6±08.33 051.6±08.33 075.0±18.17
Trial 5 048.0±04.00 036.0±13.56 054.2±14.37 057.6±15.99 078.2±12.69
Trial 6 090.0±00.00 080.0±20.00 041.0±12.44 047.4±16.21 094.6±09.46
Trial 7 010.0±00.00 048.0±39.19 079.8±09.60 079.8±09.60 091.4±06.77
Trial 8 040.0±00.00 048.0±29.26 083.0±29.26 072.0±36.55 063.0±19.13
Trial 9 092.0±04.00 076.0±33.23 035.2±13.50 019.8±07.49 078.0±18.06
Trial 10 050.0±00.00 052.0±04.00 053.0±13.64 052.0±14.00 088.2±11.53
Trial 11 074.0±08.00 054.0±20.59 059.0±23.39 073.0±22.49 078.6±20.34
Trial 12 056.0±08.00 022.0±09.80 095.2±10.80 080.8±31.52 083.4±15.42
Trial 13 046.0±12.00 052.0±21.35 067.0±22.27 067.0±23.58 087.4±09.73
Trial 14 078.0±24.00 058.0±23.15 028.8±06.76 030.4±06.65 074.2±16.27
Trial 15 082.0±04.00 052.2±27.63 045.0±00.00 045.0±00.00 088.0±14.35
Trial 16 052.0±16.00 056.0±19.60 047.0±16.91 041.2±24.55 073.6±23.69
Trial 17 092.0±04.00 064.0±23.32 024.2±07.52 027.0±06.51 084.4±11.55
Trial 18 094.0±12.00 094.0±12.00 044.2±17.90 034.2±02.14 082.0±14.35
Trial 19 100.0±00.00 030.0±20.98 043.6±08.28 038.6±03.83 087.2±05.42
Trial 20 020.0±00.00 050.0±26.83 042.2±20.43 049.8±25.20 088.2±12.92
Averaged 065.9±28.36 055.3±29.26 051.0±25.39 051.0±26.44 081.6±16.80