Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization

Derek Fox Computer Science
UNC Greensboro, USA
defox@uncg.edu Samuel Hernandez Mathematics
Texas A&M University, USA
samuelhq11@tamu.edu Qianqian Tong Computer Science
UNC Greensboron, USA
q_tong@uncg.edu

Abstract

Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or $\ell_{0}$ -norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate the efficiency and effectiveness of our proposed algorithms.

I Introduction

Graph structures enable the imposition of intricate sparsity constraints on the model, allowing them to better reflect relationships present in the data. For instance, graph-structured sparsity models are well-suited for predicting the spread of diseases or identifying social groups in networks. The search of connected subgraphs or clusters has a significant impact on identifying disease-related genes [2, 22, 1, 23]. Graph sparsification, which aims to reduce the complexity of large-scale graphs while preserving their essential structural properties, has garnered increasing attention as a crucial technique in modern data analysis and machine learning [6, 11, 12, 13, 30].

Graph sparsification can be formulated as the following optimization problem:

\min_{x\in\mathbb{R}^{p}}F(x),\quad F(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x),

(1)

which is known as the empirical risk minimization problem. Each $f_{i}(x)$ $(i\in[n])$ is convex and differentiable, and the graph structured sparsity is reflected by the constraint set $\mathbb{R}^{p}$ on $x$ . The input vector $x$ denotes the parameter of the model, and the output $f_{i}(x)$ is defined as the loss associated with sample $i$ . By minimizing the loss $F(x)$ , we guide the model towards the optimal solution. Typically, sparsity can be encoded by adding sparsity-inducing norms or penalties such as $\ell_{0}$ norm, $\ell_{1}$ norm and mixed norms [7, 10, 29, 16, 24, 26, 8, 25, 28]. These models often involve convex penalties and can be solved using convex optimization algorithms [4, 3, 5]. However, dealing with more complex sparse settings, such as graph-structured models, is more challenging.

In stochastic optimization, iterative hard thresholding (IHT) methods include gradient descent IHT (GD-IHT) [16], stochastic gradient descent IHT (SGD-IHT) [21], hybrid stochastic gradient IHT (HSG-IHT) [31], and variance-reduced methods such as stochastic variance reduced gradient IHT (SVRG-IHT) [19], stochastically controlled stochastic gradient IHT (SCSG-IHT) [20]. These methods update the parameter iterate $x$ via gradient descent or its variants, and then apply a hard thresholding (HT) operator to enforce sparsity of $x$ , preserving the top $s$ elements in $x$ while setting other elements to zero. In the context of graph-structured sparse optimization, the stochastic gradient decent based IHT method, named GraphSto-IHT, achieves HT through the use of Head and Tail Projections, first described by [30]. Head and Tail Projections map arbitrary vectors from the data onto the graph while simultaneously enforcing model sparsity [12, 13, 14]. Specifically, Head Projection identifies and preserves the largest entries in $x$ , while Tail Projection identifies the smallest entries and sets them to zero. By ignoring the small magnitude entries in the vector, these projections help prevent overfitting and ensure sparse solutions. Meanwhile, stochastic sampling of the data is used to speed up gradient calculations. A single data point or small batch of the data is selected and a gradient is calculated only with respect to that batch. This greatly decreases the computational costs associated with the gradient calculations. However, if the selected batch does not represent the whole dataset, the gradient may not accurately point towards a local minimum of the function, introducing variance into the gradient descent process.

To reduce the randomness inherent in SGD, variance reduction techniques may be used such as SVRG [17], SAGA [9], or SCSG [18]. During each iteration, the history of the stochastic process is considered to regulate the newly calculated gradient and minimize large changes in direction. This improvement on SGD is our main interest; both proposed algorithms utilize this technique to leverage fast gradient calculations while still enjoying quick convergence. In this paper we leverage the recent success of stochastic variance-reduced algorithms for non-convex problems and propose a series of efficient stochastic optimization algorithms for graph-structured sparsity constraint problems. Specifically, we introduce two new stochastic variance-reduced gradient based methods, GraphSVRG-IHT and GraphSCSG-IHT, designed to solve graph sparsity optimization problems. By incorporating stochastic variance-reduced techniques and graph approximated projections (head and tail), our algorithms are specifically tailored for non-convex graph-structured sparsity constraints, leading to faster convergence and improved performance. We provide a comprehensive theoretical framework and conduct extensive experiments to validate the efficiency and effectiveness of our proposed methods.

Our main contributions are summarized as follows.

•

This work is the first to explore the application of stochastic variance-reduced methods to graph-structured sparsity problems. By using batch gradients to approximate computationally expensive full gradients, we enhance the efficiency of variance reduction. GraphSCSG-IHT is built on GraphSVRG-IHT by parameterizing the variance reduction technique for greater control. It employs two different batch sizes for gradient calculations and randomly selects the update rate of the current position from a geometric distribution.
•

We theoretically prove GraphSCSG-IHT enjoys linear convergence rate with a constant learning rate. This analysis provides a robust theoretical framework for analyzing graph sparsification optimization.
•

We conduct extensive experiments to validate our proposed algorithms. In addition to simulation tests, we also evaluate our methods on a real-world breast cancer dataset. Our experiments empirically demonstrate the efficiency and effectiveness of our methods compared to GraphSto-IHT and other deterministic methods.

II Preliminaries

Notations. We use lowercase letters, e.g. $x$ , to denote a vector and use $\|\cdot\|$ to denote the $l_{2}$ -norm of a vector. The operator $E$ represents taking expectation over all random variables, $[n]$ denotes the integer set $\{1,...,n\}$ . The notation $supp(x)$ means the support of $x$ or the index set of non-zero elements in $x$ . Other important parameters are listed in Appendix A.

Definition 1.

(Subspace model) [14] Given the space $\mathbb{R}^{p}$ , a subspace model $\mathcal{M}$ is defined as a family of linear subspaces of $\mathbb{R}^{p}$ :

\mathcal{M}=\{S_{1},S_{2},\ldots,S_{k},\ldots\}

where each $S_{k}$ is a subspace of $\mathbb{R}^{p}$ . The set of corresponding vectors in these subspaces is denoted as

\mathcal{M}(\mathcal{M})=\{x:x\in V\text{ for some }V\in\mathcal{M}\}.

Definition 2.

(Weighted graph model) [13] Given an underlying graph $G=(V,E)$ defined on the coefficients of the unknown vector $x$ , where $V=[p]$ and $E\subseteq V\times V$ , then the weighted graph model $(G,s,g,C)$ - WGM can be defined as the following set of supports

\mathcal{M}=\{S:|S|\leq s,\text{ there is an }F\subseteq V\text{ with }

V_{F}=S,\gamma(F)=g,\text{ and }w(F)\leq C\},

where $C$ is the budget on weight of edges $w$ , $g$ is the number of connected components of $F$ , and $s$ is the sparsity.

Definition 3.

(Projection Operator) [14] We define a projection operator onto $\mathcal{M}(\mathbb{M})$ , i.e, $P(\cdot,\mathcal{M}(\mathbb{M})):\mathbb{R}^{p}\to\mathbb{R}^{p}$ defined as

P(x,\mathcal{M}(\mathbb{M}))=\arg\min_{y\in\mathcal{M}(\mathcal{M})}\|x-y\|^{2}.

With this projection definition, we borrow two important projections: Head Projection and Tail Projection from previous literature to help us with theoretical analysis.

Assumption 1.

(Head Projection)[14] Let $M$ and $M_{H}$ be the predefined subspace models. Given any vector $x$ , there exists a $(c_{H},M,M_{H})$ -Head-Projection which is to find a subspace $H\in M_{H}$ such that

\|P(x,H)\|_{2}\geq c_{H}\cdot\max_{S\in M}\|P(x,S)\|_{2},

where $0<c_{H}\leq 1$ . We denote $P(x,H)$ as $P(x,M,M_{H})$ .

Assumption 2.

(Tail Projection)[14] Let $M$ and $M_{T}$ be the predefined subspace models. Given any vector $x$ , there exists a $(c_{T},M,M_{T})$ -Tail-Projection which is to find a subspace $T\in M_{T}$ such that

\|P(x,T)-x\|_{2}\leq c_{T}\cdot\min_{S\in M}\|x-P(x,S)\|_{2},

where $c_{T}\geq 1$ . We denote $P(x,T)$ as $P(x,M,M_{T})$ .

We can see that head projection keeps large magnitudes whereas tail projection sets small magnitudes to zero.

Definition 4.

(( $\alpha,\beta,M(M)$ )-RSC/RSS Properties)[14] We say a differentiable function $f(\cdot)$ satisfies the $(\alpha,\beta,M(M))$ -Restricted Strong Convexity (RSC)/Smoothness (RSS) property if there exist positive constants $\alpha$ and $\beta$ such that

\frac{\alpha}{2}\|x-y\|_{2}^{2}\leq B_{f}(x,y)\leq\frac{\beta}{2}\|x-y\|_{2}^{2},

for all $x,y\in M(M)$ , where $B_{f}(x,y)$ is the Bregman divergence of $f$ , i.e.,

B_{f}(x,y)=f(x)-f(y)-\langle\nabla f(y),x-y\rangle.

$\alpha$ and $\beta$ are the strong convexity parameter and strong smoothness parameter, respectively.

The RSC/RSS definition is widely used in sparsity optimization. Together with the following assumption, we characterize the properties of the objective function.

Assumption 3.

[14] Given the objective function $F(x)$ in (1), we assume that $F(x)$ satisfies $\alpha$ -RSC in subspace model $\mathcal{M}(\mathcal{M}\oplus\mathcal{M}_{H}\oplus\mathcal{M}_{T})$ . Each function $f_{i}(x)$ satisfies $\beta$ -RSS in $\mathcal{M}(\mathcal{M}\oplus\mathcal{M}_{H}\oplus\mathcal{M}_{T})$ , where $\oplus$ of two models $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ is defined as $\mathcal{M}_{1}\oplus\mathcal{M}_{2}:=\{S_{1}\cup S_{2}:S_{1}\in\mathcal{M}_{1},S_{2}\in\mathcal{M}_{2}\}$ .

III Methods

In this section, we introduce two proposed algorithms: GraphSVRG-IHT and GraphSCSG-IHT. Both algorithms employ the variance-reduced techniques derived from SVRG [17] and SCSG [18] respectively, while also utilizing the graph projection operators found in GraphSto-IHT[30]. This results in methods that are applicable to graph-structured sparsity problems and effectively reduce the variance inherent to stochastic gradient descent. Therefore, our algorithms converge faster and more accurately than their predecessors.

III-A GraphSVRG-IHT

Our proposed GraphSVRG-IHT algorithm (Algorithm 1) utilizes variance reduction by periodically computing the full gradient, significantly reducing the inherent variance in stochastic gradient methods. By incorporating graph projection operators, our GraphSVRG-IHT adapts to non-convex graph sparsity constraints, enhancing its applicability and efficiency. The key steps are outlined below:

1.

Calculate the full gradient, $\tilde{v}^{j}$ , with the position at the start of each outer loop, $\tilde{x}^{j}$ (Line 4).
2.

In the inner loop, compute two gradients from a single sampled data point: one at the copied position $x^{j}_{k}$ and the other at $\tilde{x}^{j}$ . Then calculate the stochastic variance reduced gradient, $v_{k}^{j}$ (Line 7-8).
3.

Pass $v_{k}^{j}$ through the Head Projection operator (Line 9); and use the resulting gradient to update the next iterate $x^{j}_{k}$ , through the Tail Projection operator (Line 10).
4.

After a fixed number ( $\mathcal{K}$ ) of inner loop iterations, update the outer loop position $\tilde{x}^{j}$ and re-calculate the new full gradient.

Algorithm 1 GraphSVRG-IHT

1:Input:

\eta

f_{i}

\mathbb{M}

\mathbb{M}_{\mathcal{H}}

\mathbb{M}_{\mathcal{T}}

\mathcal{J}

\mathcal{K}

2:Initialize:

\tilde{x}^{1}

such that

supp(\tilde{x}^{1})\in\mathbb{M}

3:for

j=1

\mathcal{J}

\tilde{v}^{j}=\frac{1}{n}\sum_{i=1}^{n}{\nabla f_{i}(\tilde{x}^{j})}

x^{j}_{0}=\tilde{x}^{j}

6: for

k=1

\mathcal{K}

7: Randomly pick

i_{k}\in[n]

v^{j}_{k}=\nabla f_{i_{k}}(x^{j}_{k-1})-\nabla f_{i_{k}}(\tilde{x}^{j})+\tilde{v}^{j}

\tau_{k}=P(v^{j}_{k},\mathbb{M}\oplus\mathbb{M}_{\mathcal{T}},\mathbb{M}_{\mathcal{H}})

10:

x^{j}_{k}=P(x^{j}_{k-1}-\eta\tau_{k},\mathbb{M},\mathbb{M}_{\mathcal{T}})

11: end for

12:

\tilde{x}^{j+1}=x^{j}_{\mathcal{K}}

13:end for

Our proposed GraphSVRG-IHT algorithm differs from GraphSto-IHT in that GraphSVRG-IHT uses nested loops, allowing it to account for the history of the stochastic process, whereas GraphSto-IHT only considers a stochastic gradient. Both methods implement Head and Tail Projections for hard thresholding, making them applicable to graph-structured sparsity optimization problems. The inclusion of stochastic variance in GraphSVRG-IHT makes the theoretical analysis more complex and challenging.

III-B GraphSCSG-IHT

To better understand the calculation of variance-reduced gradients and stochastically control the outer batch size, we propose the GraphSCSG-IHT algorithm (Algorithm 2). While similar to Algorithm 1 in its use of variance-reduced gradients, GraphSCSG-IHT has the following key characteristics:

1.

In the outer loop, the gradient is calculated using a batch of data of size $B$ , whereas Algorithm 1 calculates a full gradient at this step (Line 4-5).
2.

In the inner loop, when calculating the stochastic variance reduced gradient, a mini-batch is used instead of a single data point (Line 10-11).
3.

The number of inner loops, $\mathcal{K}^{j}$ , is not fixed. Instead, $\mathcal{K}^{j}$ is chosen from a geometric distribution (Line 7) or can be set as $\frac{B}{b}$ (Line 8).

Algorithm 2 GraphSCSG-IHT

1:Input:

\eta

f_{i}

\mathbb{M}

\mathbb{M}_{\mathcal{H}}

\mathbb{M}_{\mathcal{T}}

, B, b,

\mathcal{J}

2:Initialize:

\tilde{x}^{1}

such that

supp(\tilde{x}^{1})\in\mathbb{M}

3:for

j=1

\mathcal{J}

4: Uniformly sample a batch

I^{j}\subset[n]

, s.t.

|I^{j}|=B

\tilde{\mu}^{j}=\nabla f_{I^{j}}(\tilde{x}^{j})

x^{j}_{0}=\tilde{x}^{j}

7: (Option I) Generate

\mathcal{K}^{j}\sim Geom\left(\frac{B}{B+b}\right)

8: (Option II)

\mathcal{K}^{j}=\frac{B}{b}

9: for

k=1

\mathcal{K}^{j}

10: Uniformly sample a mini-batch

I^{j}_{k}\subset[n]

, s.t.

|I^{j}_{k}|=b

11:

\mu^{j}_{k}=\nabla f_{I^{j}_{k}}(x^{j}_{k-1})-\nabla f_{I^{j}_{k}}(\tilde{x}^{j})+\tilde{\mu}^{j}

12:

\tau_{k}=P(\mu^{j}_{k},\mathbb{M}\oplus\mathbb{M}_{\mathcal{T}},\mathbb{M}_{\mathcal{H}})

13:

x^{j}_{k}=P(x^{j}_{k-1}-\eta\tau_{k},\mathbb{M},\mathbb{M}_{\mathcal{T}})

14: end for

15:

\tilde{x}^{j+1}=x^{j}_{\mathcal{K}^{j}}

16:end for

In contrast to GraphSto-IHT, the inner loop of GraphSCSG-IHT computes the gradient over a random set of functions rather than a randomly selected function. GraphSCSG-IHT also employs two loops with different batch sizes for gradient calculation, making it more flexible than GraphSto-IHTand GraphSVRG-IHT. This flexibility allows GraphSCSG-IHT to serve as a general framework for graph constrained optimization.

In summary, compared with traditional IHT and StoIHT methods, graph-structured hard thresholding steps mainly differ in their use of Head and Tail Projections. Both proposed new algorithms are much more complex than GraphSto-IHT, introducing distinct variance reduction techniques while maintaining the same sparsification enforcement. It is also important to note that GraphSVRG-IHT is a specific scenario of GraphSCSG-IHT, hence, we provide theoretical analysis of GraphSCSG-IHT, which can be easily extended to GraphSVRG-IHT case.

IV Theoretical Analysis

In this section, we present our main theoretical results characterizing the estimation error of parameters $x$ . The proof provides a general framework based on the gradients from GraphSCSG-IHT. Consequently, the main theorem is applicable to our two proposed algorithms, GraphSVRG-IHTand GraphSCSG-IHT. We demonstrate the convergence of these algorithms by bounding the final error using the $\ell^{2}$ norm of the initial and the optimal distance. Additionally, we consider the history of the stochastic process up to iteration $j*\mathcal{K}$ with the notation $I^{j}_{\mathcal{K}}$ .

Before delving into the main theorem, we present a key lemma that is crucial for the proof.

Lemma 1.

[30] If each $f_{\xi_{t}}(\cdot)$ and $F(x)$ satisfy Assumption 3, and given head projection model $(c_{H},M\oplus M_{T},M_{H})$ and tail projection model $(c_{T},M,M_{T})$ , then we have the following inequality

\mathbb{E}_{\xi_{t}}\|(x^{t}-x^{*})_{H}\|\leq\sqrt{1-\alpha_{0}}\mathbb{E}_{\xi_{t}}\|x^{t}-x^{*}\|+\sigma_{1},

where

\sigma_{1}=\left(\frac{\beta_{0}}{\alpha_{0}}+\sqrt{\frac{\alpha_{0}\beta_{0}}{1-\alpha_{0}}}\right)\mathbb{E}_{\xi_{t}}\|\nabla_{I}f_{\xi_{t}}(x^{*})\|,

H=\text{supp}(P(\nabla f_{\xi_{t}}(x^{t}),M\oplus M_{T},M_{H})),

\alpha_{0}=c_{H}\alpha\tau-\sqrt{\alpha\beta\tau^{2}-2\alpha\tau+1},\quad\beta_{0}=(1+c_{H})\tau,

I=\arg\max_{S\in M\oplus M_{T}\oplus M_{H}}\mathbb{E}_{\xi_{t}}\|\nabla_{S}f_{\xi_{t}}(x^{*})\|,

and $\tau\in(0,2/\beta)$ .

With the prepared lemmas and appropriate assumptions in place, we can now present our main theorem. This theorem establishes the convergence properties of our proposed algorithms, under specific conditions. The detailed statement of our main theorem is as follows:

Theorem 2.

(Main Theorem) Assume that Definition 4 and Assumption 3 hold. Under the same setting of Lemma 1, let $x_{0}$ be the start point. If we choose a constant learning rate $\eta$ within

\eta\in\left(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}\right)

then the solution $\tilde{x}^{j+1}$ of GraphSCSG-IHT satisfies

\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|\leq\left[\left(\frac{\delta}{1-\lambda}\right)^{S}+\lambda^{j}\left(\frac{1-\lambda}{1-\lambda-\delta}\right)\right]\|x^{0}-x^{*}\|

+\frac{\gamma}{1-\lambda-\delta}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|,

where

\delta=(1+c_{\mathcal{T}})\left(\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}+\sqrt{1-\alpha_{0}}\right),

\lambda=(1+c_{\mathcal{T}})(2\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}),

\gamma=(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta).

Theorem 2 demonstrates that our new algorithm achieves linear convergence with stochastic variance-reduced gradients, even with more stochastic settings in batch and mini-batch. Each variable defined in the theorem is strictly less than 1, ensuring that the error decreases as the number of iterations increases. This result aligns with our experimental findings, where more iterations consistently lead to smaller errors.

From Theorem 2, we derive the following corollary, which further justifies the convergence of our algorithm and specifies the appropriate range for the learning rate $\eta$ .

Corollary 2.1.

To ensure convergence of our algorithm, the learning rate $\eta$ , which is a constant, should be chosen within the range $(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta})$ . For this range to be valid, the following inequality must hold:

\frac{\delta}{1-\lambda}<1.

Corollary 2.1 is the cornerstone of Theorem 2. It ensures that the upper bound for the estimation error does not blow up to infinity, and provides a constant value for the finite series. Similarly, it also ensures that the upper bound will decay more after performing more iterations. Corollary 2.1 also provides a range of $\eta$ , which is smaller than the one given by GraphSto-IHT. This way we can find $\eta$ such that the algorithm will always converge. All the proofs are provided in the appendix of the paper.

V Experiments

V-A Experimental setup

We perform multiple experiments to compare our proposed algorithms with baseline methods. For our experiments, we consider the residual norm of the loss function, $\|\mathbf{A}x^{t+1}-y\|$ as the number of epochs increases. Due to the non-convex nature of the problem, there are several local minima and the algorithm may not approach the global minimum, $x^{*}$ . Additionally, in real-world applications, $x^{*}$ is often unknown. Therefore, we use the residual norm as a measurement of convergence as opposed to the distance from the final iterate to the target vector, $\|x^{t+1}-x^{*}\|$ . All experiments are tested on a Ubuntu 22.04.4 server with 256 AMD EPYC 9554 64-core processors and with 1.6 TB RAM. All codes are written in Python¹¹1All code is available at https://github.com/Derek-Fox/graph-scsg-iht.

V-B Synthetic Dataset

We first tested our methods on synthetic datasets to determine the optimal parameters. For a fair comparison, we followed the exact settings used in GraphSto-IHT, conducting multiple experiments using a grid graph with a dimension of 256 and unit-weight edges.

Refer to caption — Figure 1: Comparison of methods with different learning rate.

Choice of $\eta$ . To study the effect of the learning rate on the performance of our algorithms, we varied $\eta$ across {0.1, 0.01, 0.001} and tested these rates in various sparsity cases, with sparsity values chosen from {256, 128, 64, 32}. By varying the learning rate, we aimed to understand the convergence behaviour of our algorithms. As shown in Figure 1, our experiments reveal that, as expected, with a larger learning rate, all methods converge quickly, while with a smaller learning rate, the steps take longer to achieve zero residual loss. Additionally, both our proposed methods and the baseline method converge stably, and regardless of the learning rate setting, GraphSVRG-IHT consistently performs the best.

Choice of sparsity. Studying the setting of sparsity, $s$ , is crucial for understanding the performance of our algorithms in graph sparsification optimization. To examine the effect of sparsity, we compared our methods, GraphSVRG-IHT and GraphSCSG-IHT against the baseline algorithm GraphSto-IHT. Using the experimental settings from GraphSto-IHT we employed a grid graph of dimension 256, fixed the learning rate $\eta=0.01$ and set the batch size equal to the sparsity parameter $B=s$ . We varied the sparsity parameter $s$ to observe the behavior of all algorithms, as shown in Figure 2. Figure 2 demonstrates that as the sparsity parameter $s$ decreases, GraphSVRG-IHT outperforms the other algorithms in minimizing the residual norm $\|\mathbf{A}x^{t+1}-y\|$ over epochs. Another interesting finding is that we observed that GraphSto-IHT and GraphSCSG-IHT display almost identical behavior in this parameter setting.

Table I: Identified genes from breast cancer dataset

Method	Number	Genes
IHT	2	NAT1 TOP2A
StoIHT	2	NAT1 TOP2A
GraphSto-IHT	6	AR ATM BRCA2 CCND2 CDKN1A TOP2A
GraphSVRG-IHT	6	AR ATM BRCA2 CCND2 CDKN1A TOP2A
GraphSCSG-IHT	10	AR ATM BRCA1 BRCA2 CCND2 CCND3 CDKN1A CHEK2 FBP1 TOP2A

Choice of batch size. After exploring the choice of $\eta$ and $s$ , we varied the batch size $B$ on a grid graph with a dimension of 256 to demonstrate the advantages of GraphSCSG-IHT. Here we fixed $\eta=0.01$ , $s=32$ . For fair comparison, we considered the number of data points instead of the number of epochs to estimate the run time of the algorithms, which is a common practice in optimization.

Since gradient calculations are computationally expensive, using fewer data points would result in faster run times. Figure 3 shows three different scenarios with varying batch sizes $B$ . When $B$ equals to the dimension, GraphSCSG-IHT degrades to GraphSVRG-IHT resulting in similar performance. With smaller $B$ (meaning fewer data points are used in the full gradient calculation), GraphSCSG-IHT consistently outperforms GraphSVRG-IHT. However, as $B$ continues to decrease, GraphSCSG-IHT initially outperforms GraphSVRG-IHT but was eventually surpassed by GraphSVRG-IHT slightly. This occurs because smaller batch size increase gradient variance, making it harder for GraphSCSG-IHT to maintain its initial advantage. Furthermore, the interaction between batch size and mini-batch size becomes more complex, ultimately affecting the final performance.

Therefore, we also conducted numerous experiments to study the effects of mini-batch size in conjunction with batch size. Additionally, to better understand the graph-structure patterns, we have explored how varying the number of connected components in subgraphs impacts final convergence performance. Additional results are provided in the Appendix.

V-C Real-world Dataset

To test our algorithms on a real-world dataset, we use a large breast cancer dataset [27]. This dataset contains 295 training samples with 8,141 genes dimensions, including 78 positive (metastatic) and 217 negative (non-metastatic) samples. Following the experimental setting in [30], we use a Protein-Protein Interaction network with 637 pathways from [15], the dataset is folded into 5 subfolds, and 20 trials are conducted.

Table I shows the results from the gene identification task on the breast cancer dataset. GraphSCSG-IHT identifies 40% of the 25 genes highly correlated with breast cancer, as gathered by [30]. GraphSto-IHT and GraphSVRG-IHT both identify 24% of these genes, consistent with the findings in [30]. All the graph-structured methods greatly outperform the IHT and StoIHT methods which only identify 8% of the cancer-related genes. These results demonstrate the promise of variance-reduction techniques for stochastic gradient descent in the setting of graph sparsity optimization.

In summary, GraphSVRG-IHT is more stable while GraphSCSG-IHT shows excellent performance under appropriate settings due to the introduction of two additional batch size. This allows for greater flexibility and control in the gradient estimation process. Overall, our two new algorithms demonstrate both efficiency and effectiveness by incorporating variance reduction techniques in graph sparsity problems, making them suitable for a wide range of applications.

VI Conclusion

We have proposed two algorithms to utilize variance reduction techniques in the setting of graph sparse optimization. The proposed algorithms can significantly improve the stability and efficiency in machine learning and other applications. Theoretically, we provide a general proof framework showing linear convergence. Empirically, our algorithms are competitive in minimizing the objective loss function compared to their predecessors in various experimental settings with a synthetic dataset. Additionally, testing on a large-scale medical dataset demonstrated superior performance in identifying cancer-related genes. Future work should include testing our algorithms on more larger real-world datasets. By employing graph-structured hard thresholding, we can uncover more underlying subgraphs and related patterns, with significant implications in fields such as medical research and social network analysis. This approach can enhance our understanding of complex data structures and lead to more effective solutions.

VII Acknowledgement

This work was completed at the University of North Carolina at Greensboro, funded by NSF REU program 2349369.

References

[1] Cem Aksoylar, Lorenzo Orecchia, and Venkatesh Saligrama. Connected subgraph detection with mirror descent on sdps. In International Conference on Machine Learning, pages 51–59. PMLR, 2017.
[2] Ery Arias-Castro, Emmanuel J Candes, and Arnaud Durand. Detection of an anomalous cluster in a network. The Annals of Statistics, pages 278–304, 2011.
[3] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27(4):450–468, 2012.
[4] Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
[5] Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14(Mar):807–841, 2013.
[6] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Transactions on information theory, 56(4):1982–2001, 2010.
[7] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
[8] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
[9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
[10] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011.
[11] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A fast approximation algorithm for tree-sparse recovery. In 2014 IEEE International Symposium on Information Theory, pages 1842–1846. IEEE, 2014.
[12] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compressive sensing. IEEE Transactions on Information Theory, 61(9):5129–5147, 2015.
[13] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A nearly-linear time framework for graph-structured sparsity. In International Conference on Machine Learning, pages 928–937. PMLR, 2015.
[14] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Fast recovery from a union of subspaces. Advances in Neural Information Processing Systems, 29, 2016.
[15] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440, 2009.
[16] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
[17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
[18] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Artificial Intelligence and Statistics, pages 148–156, 2017.
[19] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced optimization for nonconvex sparse learning. In International Conference on Machine Learning, pages 917–925, 2016.
[20] Guannan Liang, Qianqian Tong, Chunjiang Zhu, and Jinbo Bi. An effective hard thresholding method based on stochastic variance reduction for nonconvex sparse learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34(02), pages 1585–1592, 2020.
[21] Nam Nguyen, Deanna Needell, and Tina Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869–6895, 2017.
[22] Jing Qian, Venkatesh Saligrama, and Yuting Chen. Connected sub-graph detection. In Artificial Intelligence and Statistics, pages 796–804. PMLR, 2014.
[23] Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, and Nikolaj Tatti. Event detection in activity networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1176–1185, 2014.
[24] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[25] Berwin A Turlach, William N Venables, and Stephen J Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005.
[26] Sara A Van de Geer et al. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 2008.
[27] Marc J Van De Vijver, Yudong D He, Laura J Van’t Veer, Hongyue Dai, Augustinus AM Hart, Dorien W Voskuil, George J Schreiber, Johannes L Peterse, Chris Roberts, Matthew J Marton, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999–2009, 2002.
[28] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.
[29] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning, pages 127–135, 2014.
[30] Baojian Zhou, Feng Chen, and Yiming Ying. Stochastic iterative hard thresholding for graph-structured sparsity optimization. In International Conference on Machine Learning, pages 7563–7573. PMLR, 2019.
[31] Pan Zhou, Xiaotong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresholding. In Advances in Neural Information Processing Systems, pages 1988–1997, 2018.

Appendix A Parameters Table

Table: Constraints for Parameters

Notation	Name	Constraint
$\eta$	Step size	Fixed
F()	Function	Minimized
$\mathbb{M}$	Set of Supports
$\mathbb{M_{\mathcal{H}}}$	Set of Supports
$\mathbb{M_{\mathcal{T}}}$	Set of Supports
B	Big batch
$I^{j}$	Big batch sample	$I^{j}\subset[n]$ and $\|I^{j}\|=B$
b	Small batch	Subset of B
$I^{j}_{k}$	Small batch sample	$I^{j}_{k}\subset[n]$ and $\|I^{j}_{k}\|=b$
$\mathcal{J}$	Number of outer loops
j	Outer loop index
$\mathcal{K}$	Number of inner loops
k	Inner loop index
$\tilde{x}^{1}$	Starting Position	$supp(\tilde{x}^{1})\in\mathbb{M}$
s	Number of non-zero entries	Sparsity Constraint
$P(x,\mathbb{M},\mathbb{M}_{\mathcal{H}})$	Head Projection
$P(x,\mathbb{M},\mathbb{M}_{\mathcal{T}})$	Tail Projection
$\tilde{v}^{j}$	SVRG Gradient
$\tilde{\mu}^{j}$	SCSG Gradient
$\tilde{x}^{j}$	Current Position
$v^{j}_{k}$	Reduced Variance Gradient
$\mu^{j}_{k}$	Reduced Variance Gradient
$\tau_{k}$	Sparsified Gradient
$g$	Number of Connected Components

Appendix B Proof

Lemma 4 [21]. Let $\xi_{t}$ be a discrete random variable defined on $[n]$ and its probability mass function is defined as $\Pr(\xi_{t}=i):=\frac{1}{n}$ , which means the probability of $\xi_{t}$ selecting the $i$ -th block at time $t$ . For any fixed sparse vectors $x$ , $y$ and $0<\tau<\frac{2}{\beta}$ , we have

\mathbb{E}_{\xi_{t}}\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\|\leq\sqrt{\alpha\beta\tau^{2}-2\alpha\tau+1}\|x-y\|,

where $\Omega$ is such that $\|x\|_{0}\cup\|y\|_{0}\subseteq\Omega$ and $\Omega\in\mathcal{M}$ .

Proof.

We first try to obtain an upper bound for $\mathbb{E}_{\xi_{t}}\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\|^{2}$ as follows:

	$\displaystyle\mathbb{E}_{\xi_{t}}\\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\\|^{2}$	$\displaystyle=\\|x-y\\|^{2}-2\tau\mathbb{E}_{\xi_{t}}\langle x-y,\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\rangle$
		$\displaystyle\quad+\tau^{2}\mathbb{E}_{\xi_{t}}\\|\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\\|^{2}$
		$\displaystyle=\\|x-y\\|^{2}-2\tau\langle x-y,\mathbb{E}_{\xi_{t}}[\nabla f_{\xi_{t}}(x)-\nabla f_{\xi_{t}}(y)]\rangle$
		$\displaystyle\quad+\tau^{2}\mathbb{E}_{\xi_{t}}\\|\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\\|^{2}$
		$\displaystyle\leq\\|x-y\\|^{2}-2\tau\langle x-y,\nabla F(x)-\nabla F(y)\rangle$
		$\displaystyle\quad+\tau^{2}\beta\langle x-y,\nabla F(x)-\nabla F(y)\rangle$
		$\displaystyle=\\|x-y\\|^{2}+(\tau^{2}\beta-2\tau)\langle x-y,F(x)-F(y)\rangle$
		$\displaystyle\leq(\alpha\beta\tau^{2}-2\alpha\tau+1)\\|x-y\\|^{2},$

where the second equality uses the fact that $\|x\|_{0}\cup\|y\|_{0}\subseteq\Omega$ , the first inequality follows from Lemma 2, the third equality is obtained by using the fact that $\mathbb{E}_{\xi_{t}}[\nabla f_{\xi_{t}}(x)-\nabla f_{\xi_{t}}(y)]=\nabla F(x)-\nabla F(y)$ , and the last inequality is due to the restricted strong convexity of $F(x)$ on $\mathcal{M(M)}$ . We complete the proof by taking the square root of both sides and using the fact: for any random variable $X$ , we have $(\mathbb{E}[X])^{2}\leq\mathbb{E}[X^{2}]$ . ∎

Proof of Theorem 2.

Proof.

$\displaystyle\mathbb{E}_{I^{j}_{\mathcal{K}}}\\|\tilde{x}^{j+1}-x^{*}\\|$	$\displaystyle=\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|P(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}},\mathbb{M},\mathbb{M}_{T})-x^{*}\\|$	(1)
$\displaystyle\begin{split}&\leq\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|P(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}},\mathbb{M},\mathbb{M}_{T})-(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}})\\|\\ &\quad+\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}})-x^{*}\\|\end{split}$		(2)
	$\displaystyle\leq(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}}-x^{*}\\|$	(3)
	$\displaystyle=(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|r^{\mathcal{K}}-\eta\upsilon_{\mathcal{K}}\\|$	(4)
$\displaystyle\begin{split}&=(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|r^{\mathcal{K}}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})\\ &\quad-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\\|\\ \end{split}$		(5)

By the definition of $\tilde{x}^{j+1}$ , Eq. (1) holds. Eq. (2) holds by the Triangle Inequality. Eq. (3) holds by Assumption 2. Eq. (4) holds by defining $r^{\mathcal{K}}=x^{j}_{\mathcal{K}-1}-x^{*}$ . Eq. (5) holds by the definition of $\upsilon_{\mathcal{K}}$ . Now, we will focus on the expectation only. Also, define $m^{\mathcal{K}}=\tilde{x}^{j}-x^{*}$ .

$\displaystyle\Rightarrow$	$\displaystyle\ \mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|r^{\mathcal{K}}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\\|$
$\displaystyle\begin{split}&\quad\leq\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}[\\|r^{\mathcal{K}}_{H}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\\ &\quad-(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{})-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{})+\nabla_{H}f_{I^{j}}(x^{}))\\|\\ &\quad+\eta\\|\nabla_{H}f_{I^{j}}(x^{})\\|+\\|r^{\mathcal{K}}_{H^{c}}\\|]\end{split}$		(6)
$\displaystyle\begin{split}&\quad\leq\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}[\\|r^{\mathcal{K}}_{H\cup\Omega}-\eta(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})\\ &\quad+\nabla_{H\cup\Omega}f_{I^{j}}(\tilde{x}^{j}))-(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{})+\nabla_{H\cup\Omega}f_{I^{j}}(x^{}))\\|\\ &\quad+\eta\\|\nabla_{H}f_{I^{j}}(x^{})\\|+\\|r^{\mathcal{K}}_{H^{c}}\\|]\end{split}$		(7)
$\displaystyle\begin{split}&\quad=\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}[\\|r^{\mathcal{K}}_{H\cup\Omega}-m^{\mathcal{K}}_{H\cup\Omega}+m^{\mathcal{K}}_{H\cup\Omega}-\eta(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})\\ &\quad-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H\cup\Omega}f_{I^{j}}(\tilde{x}^{j}))-(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{})\\ &\quad+\nabla_{H\cup\Omega}f_{I^{j}}(x^{}))\\|+\eta\\|\nabla_{H}f_{I^{j}}(x^{})\\|+\\|r^{\mathcal{K}}_{H^{c}}\\|]\end{split}$		(8)
$\displaystyle\begin{split}&\quad\leq\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}[\\|r^{\mathcal{K}}_{H\cup\Omega}-\eta(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})\\ &\quad+\nabla_{H\cup\Omega}f_{I^{j}}(\tilde{x}^{j}))-(\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{})-\nabla_{H\cup\Omega}f_{I^{j}_{\mathcal{K}}}(x^{})+\nabla_{H\cup\Omega}f_{I^{j}}(x^{}))\\|\\ &\quad+\eta\\|\nabla_{H}f_{I^{j}}(x^{})\\|+\\|r^{\mathcal{K}}_{H^{c}}\\|]\\ &\quad\leq\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\\|r_{\mathcal{K}}\\|+\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\\|-m_{\mathcal{K}}\\|\\ &\quad+\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\\|m_{\mathcal{K}}\\|+\eta\mathbb{E}_{I^{j}_{\mathcal{K}}}\\|\nabla_{I}f_{I^{j}}(x^{*})\\|+\mathbb{E}_{{I^{j}_{\mathcal{K}}}\|{I^{j}_{\mathcal{K}-1}}}\\|r^{\mathcal{K}}_{H^{c}}\\|\end{split}$		(9)
$\displaystyle\begin{split}&\quad=\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\\|r_{\mathcal{K}}\\|+2\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\\|m_{\mathcal{K}}\\|\\ &\quad+\eta\mathbb{E}_{I^{j}_{\mathcal{K}}}\\|\nabla_{I}f_{I^{j}}(x^{*})\\|+\mathbb{E}_{{I^{j}_{\mathcal{K}}}\|{I^{j}_{\mathcal{K}-1}}}\\|r^{\mathcal{K}}_{H^{c}}\\|\end{split}$		(10)
$\displaystyle\begin{split}&\quad\leq(\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}+\sqrt{1-{\alpha_{0}}^{2}})\mathbb{E}_{I^{j}_{\mathcal{K}}}\\|x^{j}_{\mathcal{K}^{j}-1}-x^{}\\|\\ &\quad+2\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\\|\tilde{x}^{j}-x^{}\\|\\ &\quad+\left(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta\right)\mathbb{E}_{I^{j}_{\mathcal{K}}}\\|\nabla_{I}f_{I^{j}}(x^{*})\\|\end{split}$		(11)

Eq. (6) and Eq. (7) hold by the Triangle Inequality. Eq. (8) holds by adding and subtracting $m_{\mathcal{K}}$ . Eq. (9) also holds by the Triangle Inequality. Eq. (10) holds by B. Eq. (11) holds by the fact that $\|m_{\mathcal{K}}\|=\|-m_{\mathcal{K}}\|$ . Eq. (12) holds by 1. We therefore get the following inequality that we will use to find our final equation.

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|\tilde{x}^{j}-x^{*}\|\\ &\quad+(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}\quad+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(12)

where $\delta=(1+c_{\mathcal{T}})(\sqrt{\alpha\beta\eta^{2}-2\alpha\eta+1}+\sqrt{1-{\alpha_{0}}^{2}})$ and $\lambda=2(1+c_{\mathcal{T}})(\frac{\delta}{1+c_{\mathcal{T}}}-\sqrt{1-{\alpha_{0}}^{2}})$ Since we have two different vectors measured from the target vector $x^{*}$ , we will use recursion twice in Eq. (13) to find the formula.

First, we apply recursion with respect to the outer loop, j, by finding $\tilde{x}^{j}$ using the above inequality. Once we find it, we plug it back to understand the behavior and do this process j times.

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j}-x^{*}\|&\leq\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|\tilde{x}^{j-1}-x^{*}\|\\ &\quad+(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}\quad+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(13)

Plugging Eq. (14) in Eq. (13), we get:

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda(\delta\mathbb{E}_{I^{j}_{\mathcal{K}}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|+\lambda\mathbb{E}_{I^{j}_{\mathcal{K}-2}}\|\tilde{x}^{j-1}-x^{*}\|\\ &\quad+(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(14)

After doing it j times, we get:

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\sum_{i=0}^{j}\lambda^{i})\delta\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|\ +\ \lambda^{j}\|x^{0}-x^{*}\|\\ &\quad+\ (\sum_{i=0}^{j}\lambda^{i})(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(15)

By geomteric series (since $\lambda<1$ ), and by defining $\gamma=(1+c_{\mathcal{T}})(\frac{\beta_{0}}{\alpha_{0}}+\frac{\alpha_{0}\beta_{0}}{\sqrt{1-{\alpha_{0}}^{2}}}+\eta)$ , we get:

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\frac{1}{1-\lambda})\delta\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|\ +\ \lambda^{j}\|x^{0}-x^{*}\|\\ &\quad+\ (\frac{1}{1-\lambda})\gamma\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(16)

Once we did the recursion with respect to the outer loop only, we do recursion with respect the inner loop and outer loop to find the final formula. This means that we will do this process S times, where $S=\frac{\mathcal{B*T}}{b}$ .

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}-1}}\|x^{j}_{\mathcal{K}^{j}-1}-x^{*}\|&\leq(\frac{1}{1-\lambda})\delta\mathbb{E}_{I^{j}_{\mathcal{K}-2}}\|x^{j}_{\mathcal{K}^{j}-2}-x^{*}\|\ +\ \lambda^{j}\|x^{0}-x^{*}\|\\ &\quad+\ (\frac{1}{1-\lambda})\gamma\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(17)

We then plug Eq. (18) into Eq. (17), so we get Eq. (19):

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\frac{\delta}{1-\lambda})[(\frac{\delta}{1-\lambda})\mathbb{E}_{I^{j}_{\mathcal{K}-2}}\|x^{j}_{\mathcal{K}^{j}-2}-x^{*}\|\ \\ &\quad+\ \lambda^{j}\|x^{0}-x^{*}\|+(\frac{\gamma}{1-\lambda})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|]\\ &\quad+\lambda^{j}\|x^{0}-x^{*}\|+(\frac{\gamma}{1-\lambda})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(18)

After doing this step S times, we get the following inequality:

\displaystyle\begin{split}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|&\leq(\frac{\delta}{1-\lambda})^{S}\|x^{0}-x^{*}\|+\lambda^{j}(\sum_{n=0}^{S}(\frac{\delta}{1-\lambda})^{n})\|x^{0}-x^{*}\|\\ &\quad+\frac{\gamma}{1-\lambda}(\sum_{n=1}^{S}(\frac{\delta}{1-\lambda})^{n})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|\end{split}

(19)

We finally get this formula that guarantees the convergence of the algorithm, since $\delta$ and $\lambda$ are both less than 1.

\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|\leq[(\frac{\delta}{1-\lambda})^{S}+\lambda^{j}(\sum_{n=0}^{S}(\frac{\delta}{1-\lambda})^{n})]\|x^{0}-x^{*}\|+\frac{\gamma}{1-\lambda}(\sum_{n=0}^{S}(\frac{\delta}{1-\lambda})^{n})\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|

which we can rewrite as

\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\tilde{x}^{j+1}-x^{*}\|\leq[(\frac{\delta}{1-\lambda})^{S}+\lambda^{j}(\frac{1-\lambda}{1-\lambda-\delta})]\|x^{0}-x^{*}\|+\frac{\gamma}{1-\lambda-\delta}\mathbb{E}_{I^{j}_{\mathcal{K}}}\|\nabla_{I}f_{I^{j}}(x^{*})\|

∎

Corollary 2.0. Let $\delta<1$ , since $\eta\in(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta})$ . Therefore, by definition of $\lambda$ , we have:

\lambda=2(\frac{\delta}{1+c_{\mathcal{T}}}-\sqrt{1-{\alpha_{0}}^{2}}).

Proof.

By contradiction, assume $\lambda\geq 1$ . Therefore,

	$\displaystyle\frac{2\delta}{1+c_{\mathcal{T}}}-2\sqrt{1-{\alpha_{0}}^{2}}\geq 1$
	$\displaystyle\frac{2\delta}{1+c_{\mathcal{T}}}\geq 1+2\sqrt{1-{\alpha_{0}}^{2}}\geq 1$
	$\displaystyle\frac{2\delta}{1+c_{\mathcal{T}}}\geq 1$
	$\displaystyle 2\delta\geq 1+c_{\mathcal{T}}\geq 2$
	$\displaystyle\delta\geq 1$

which is a contradiction, since $\delta<1$ by definition. Therefore, $\lambda<1$ . Note that $\lambda>0$ by its definition. ∎

Corollary 2.1. To ensure convergence of our algorithm, the learning rate $\eta$ , which is a constant, should be chosen within the range $(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta})$ . For this range to be valid, the following inequality must hold:

\frac{\delta}{1-\lambda}<1.

Proof.

Assume by contradiction that $\delta\geq 1-\lambda$ . Then, by definition of $\lambda$ , we get

	$\displaystyle\delta\geq 1-2(\frac{\delta}{1+c_{\mathcal{T}}}-\sqrt{1-\alpha_{0}^{2}})\geq 1-\frac{2\delta}{1+c_{\mathcal{T}}}$		(1)
	$\displaystyle 0\geq 1-\frac{2\delta}{1+c_{\mathcal{T}}}-\delta=1-\frac{\delta(3+c_{\mathcal{T}})}{1+c_{\mathcal{T}}}\geq 1-2\delta$		(2)

where Eq. (2) holds true by the definition of $c_{\mathcal{T}}$ . Now, we have the following:

	$\displaystyle 1-2\delta\leq 0$
	$\displaystyle 1\leq 2\delta\Rightarrow\delta\geq\frac{1}{2}$

Since $\eta\in(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta})$ , then $\delta<\frac{1}{2}$ . We then have a contradiction. Therefore,

\frac{\delta}{1-\lambda}<1,\ \forall\ \eta\in(\frac{2\alpha-\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta},\frac{2\alpha+\sqrt{4\alpha^{2}-3.75\alpha\beta}}{2\alpha\beta}).

∎

Appendix C More Results

C-A Simulation

As pointed out in our main paper, we also explored the effects of mini-batch size in conjunction with batch size. Figure 4 demonstrates the generality of GraphSCSG-IHT. With different settings of $B$ and $b$ GraphSCSG-IHT can range in behavior between GraphSto-IHT and GraphSVRG-IHT. Not included in the figure is the case when $B$ equals the dimension of the data and $b=1$ , in which case GraphSCSG-IHT and GraphSVRG-IHT are only distinguished by the number of inner loops (which could be parameterized itself to make the two algorithms identical).

We vary the number of connected components, controlled by parameter $g$ , in Figure 5. We observe that when $g$ is low (in this case $g=1$ ), the loss function curve does not smoothly decrease but frequently levels off before stepping downward. This is due to the algorithm intermittently finding a more suitable support set, which allows it to more rapidly decrease the objective loss. The restriction of the algorithm to only one connected component makes this phenomenon much more apparent.

C-B Breast cancer dataset

All parameters are tuned using 5-fold cross validation, following the experimental settings of [30]. The sparsity $s$ is tuned from $[10,20,...100]$ and the block size is tuned from $[n,n/2]$ , where $n$ is the number of samples. The task is to find a single connected component, so $g$ is set to 1 for the head and tail projections. The learning rate is tuned using backtracking line search. We record the Balanced Classification Error and Area Under Curve (AUC) scores, as well as the number of nonzeroes for each method over the 20 trials.

Table II: Balanced Classification Error ± Std. Dev. on Breast Cancer Dataset

	IHT	StoIHT	GraphSto-IHT	GraphSVRG-IHT	GraphSCSG-IHT
Trial 1	0.352±0.079	0.355±0.083	0.367±0.091	0.335±0.048	0.474±0.048
Trial 2	0.350±0.047	0.346±0.045	0.395±0.074	0.340±0.077	0.496±0.015
Trial 3	0.354±0.060	0.370±0.061	0.365±0.107	0.318±0.061	0.478±0.061
Trial 4	0.351±0.046	0.350±0.066	0.365±0.032	0.365±0.032	0.459±0.050
Trial 5	0.357±0.064	0.365±0.063	0.366±0.065	0.368±0.082	0.418±0.076
Trial 6	0.334±0.050	0.327±0.046	0.330±0.078	0.352±0.090	0.447±0.071
Trial 7	0.343±0.050	0.402±0.085	0.321±0.027	0.321±0.027	0.499±0.004
Trial 8	0.352±0.059	0.402±0.103	0.338±0.021	0.349±0.030	0.489±0.013
Trial 9	0.340±0.082	0.357±0.058	0.335±0.110	0.343±0.040	0.452±0.053
Trial 10	0.344±0.068	0.351±0.065	0.349±0.050	0.331±0.061	0.502±0.005
Trial 11	0.339±0.056	0.318±0.061	0.353±0.076	0.357±0.054	0.473±0.054
Trial 12	0.355±0.061	0.394±0.053	0.349±0.088	0.329±0.082	0.463±0.072
Trial 13	0.327±0.053	0.317±0.040	0.312±0.065	0.323±0.060	0.457±0.051
Trial 14	0.362±0.058	0.340±0.071	0.348±0.019	0.330±0.054	0.453±0.045
Trial 15	0.381±0.048	0.346±0.051	0.297±0.059	0.297±0.059	0.416±0.069
Trial 16	0.330±0.094	0.320±0.089	0.330±0.074	0.339±0.064	0.488±0.055
Trial 17	0.349±0.074	0.332±0.067	0.353±0.078	0.330±0.056	0.437±0.055
Trial 18	0.331±0.096	0.331±0.096	0.292±0.067	0.290±0.069	0.462±0.053
Trial 19	0.327±0.052	0.333±0.054	0.311±0.021	0.306±0.023	0.476±0.017
Trial 20	0.333±0.038	0.338±0.026	0.352±0.064	0.376±0.036	0.427±0.052
Averaged	0.345±0.065	0.350±0.072	0.341±0.073	0.335±0.062	0.463±0.057

Table III: AUC score ± Std. Dev. on Breast Cancer Dataset

	IHT	StoIHT	GraphSto-IHT	GraphSVRG-IHT	GraphSCSG-IHT
Trial 1	0.726±0.067	0.736±0.054	0.720±0.030	0.729±0.039	0.713±0.023
Trial 2	0.683±0.067	0.692±0.064	0.697±0.044	0.712±0.062	0.696±0.101
Trial 3	0.716±0.062	0.726±0.077	0.726±0.064	0.715±0.065	0.703±0.027
Trial 4	0.695±0.059	0.701±0.061	0.687±0.041	0.687±0.041	0.693±0.033
Trial 5	0.699±0.042	0.700±0.067	0.707±0.040	0.701±0.028	0.701±0.045
Trial 6	0.677±0.071	0.673±0.074	0.704±0.051	0.669±0.071	0.640±0.101
Trial 7	0.712±0.069	0.719±0.066	0.741±0.055	0.741±0.055	0.701±0.069
Trial 8	0.711±0.081	0.707±0.055	0.721±0.062	0.717±0.055	0.657±0.075
Trial 9	0.717±0.066	0.718±0.068	0.723±0.044	0.718±0.026	0.719±0.058
Trial 10	0.710±0.060	0.711±0.060	0.691±0.052	0.702±0.062	0.655±0.103
Trial 11	0.708±0.082	0.719±0.086	0.722±0.085	0.711±0.074	0.693±0.142
Trial 12	0.713±0.025	0.711±0.045	0.736±0.055	0.730±0.060	0.664±0.057
Trial 13	0.714±0.058	0.719±0.058	0.718±0.073	0.703±0.071	0.704±0.074
Trial 14	0.700±0.058	0.705±0.067	0.711±0.064	0.712±0.046	0.665±0.061
Trial 15	0.696±0.033	0.687±0.040	0.721±0.058	0.721±0.058	0.715±0.040
Trial 16	0.729±0.081	0.725±0.083	0.723±0.072	0.725±0.073	0.734±0.049
Trial 17	0.720±0.055	0.722±0.062	0.718±0.037	0.717±0.037	0.681±0.061
Trial 18	0.712±0.039	0.712±0.039	0.733±0.036	0.730±0.034	0.679±0.056
Trial 19	0.721±0.067	0.724±0.068	0.736±0.056	0.738±0.054	0.701±0.057
Trial 20	0.720±0.031	0.706±0.027	0.717±0.045	0.705±0.033	0.703±0.056
Averaged	0.709±0.062	0.711±0.064	0.717±0.057	0.714±0.057	0.691±0.074

Table IV: Num. nonzeroes ± Std. Dev. on Breast Cancer Dataset

	IHT	StoIHT	GraphSto-IHT	GraphSVRG-IHT	GraphSCSG-IHT
Trial 1	082.0±36.00	066.0±28.00	033.4±10.38	033.8±17.31	064.0±17.72
Trial 2	044.0±08.00	038.0±13.27	061.2±41.81	079.0±32.00	087.2±04.12
Trial 3	094.0±12.00	068.0±31.87	030.4±12.97	040.2±17.03	086.2±13.33
Trial 4	074.0±12.00	062.0±31.87	051.6±08.33	051.6±08.33	075.0±18.17
Trial 5	048.0±04.00	036.0±13.56	054.2±14.37	057.6±15.99	078.2±12.69
Trial 6	090.0±00.00	080.0±20.00	041.0±12.44	047.4±16.21	094.6±09.46
Trial 7	010.0±00.00	048.0±39.19	079.8±09.60	079.8±09.60	091.4±06.77
Trial 8	040.0±00.00	048.0±29.26	083.0±29.26	072.0±36.55	063.0±19.13
Trial 9	092.0±04.00	076.0±33.23	035.2±13.50	019.8±07.49	078.0±18.06
Trial 10	050.0±00.00	052.0±04.00	053.0±13.64	052.0±14.00	088.2±11.53
Trial 11	074.0±08.00	054.0±20.59	059.0±23.39	073.0±22.49	078.6±20.34
Trial 12	056.0±08.00	022.0±09.80	095.2±10.80	080.8±31.52	083.4±15.42
Trial 13	046.0±12.00	052.0±21.35	067.0±22.27	067.0±23.58	087.4±09.73
Trial 14	078.0±24.00	058.0±23.15	028.8±06.76	030.4±06.65	074.2±16.27
Trial 15	082.0±04.00	052.2±27.63	045.0±00.00	045.0±00.00	088.0±14.35
Trial 16	052.0±16.00	056.0±19.60	047.0±16.91	041.2±24.55	073.6±23.69
Trial 17	092.0±04.00	064.0±23.32	024.2±07.52	027.0±06.51	084.4±11.55
Trial 18	094.0±12.00	094.0±12.00	044.2±17.90	034.2±02.14	082.0±14.35
Trial 19	100.0±00.00	030.0±20.98	043.6±08.28	038.6±03.83	087.2±05.42
Trial 20	020.0±00.00	050.0±26.83	042.2±20.43	049.8±25.20	088.2±12.92
Averaged	065.9±28.36	055.3±29.26	051.0±25.39	051.0±26.44	081.6±16.80

	$\displaystyle\mathbb{E}_{\xi_{t}}\\|x-y-\tau(\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y))\\|^{2}$	$\displaystyle=\\|x-y\\|^{2}-2\tau\mathbb{E}_{\xi_{t}}\langle x-y,\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\rangle$
		$\displaystyle\quad+\tau^{2}\mathbb{E}_{\xi_{t}}\\|\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\\|^{2}$
		$\displaystyle=\\|x-y\\|^{2}-2\tau\langle x-y,\mathbb{E}_{\xi_{t}}[\nabla f_{\xi_{t}}(x)-\nabla f_{\xi_{t}}(y)]\rangle$
		$\displaystyle\quad+\tau^{2}\mathbb{E}_{\xi_{t}}\\|\nabla_{\Omega}f_{\xi_{t}}(x)-\nabla_{\Omega}f_{\xi_{t}}(y)\\|^{2}$
		$\displaystyle\leq\\|x-y\\|^{2}-2\tau\langle x-y,\nabla F(x)-\nabla F(y)\rangle$
		$\displaystyle\quad+\tau^{2}\beta\langle x-y,\nabla F(x)-\nabla F(y)\rangle$
		$\displaystyle=\\|x-y\\|^{2}+(\tau^{2}\beta-2\tau)\langle x-y,F(x)-F(y)\rangle$
		$\displaystyle\leq(\alpha\beta\tau^{2}-2\alpha\tau+1)\\|x-y\\|^{2},$

$\displaystyle\mathbb{E}_{I^{j}_{\mathcal{K}}}\\|\tilde{x}^{j+1}-x^{*}\\|$	$\displaystyle=\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|P(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}},\mathbb{M},\mathbb{M}_{T})-x^{*}\\|$	(1)
$\displaystyle\begin{split}&\leq\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|P(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}},\mathbb{M},\mathbb{M}_{T})-(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}})\\|\\ &\quad+\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|(x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}})-x^{*}\\|\end{split}$		(2)
	$\displaystyle\leq(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|x^{j}_{\mathcal{K}-1}-\eta\upsilon_{\mathcal{K}}-x^{*}\\|$	(3)
	$\displaystyle=(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|r^{\mathcal{K}}-\eta\upsilon_{\mathcal{K}}\\|$	(4)
$\displaystyle\begin{split}&=(1+c_{\mathcal{T}})\mathbb{E}_{I^{j}_{\mathcal{K}}\|I^{j}_{\mathcal{K}-1}}\\|r^{\mathcal{K}}-\eta(\nabla_{H}f_{I^{j}_{\mathcal{K}}}(x^{j}_{\mathcal{K}-1})\\ &\quad-\nabla_{H}f_{I^{j}_{\mathcal{K}}}(\tilde{x}^{j})+\nabla_{H}f_{I^{j}}(\tilde{x}^{j}))\\|\\ \end{split}$		(5)