Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization
Abstract
Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or -norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate the efficiency and effectiveness of our proposed algorithms.
I Introduction
Graph structures enable the imposition of intricate sparsity constraints on the model, allowing them to better reflect relationships present in the data. For instance, graph-structured sparsity models are well-suited for predicting the spread of diseases or identifying social groups in networks. The search of connected subgraphs or clusters has a significant impact on identifying disease-related genes [2, 22, 1, 23]. Graph sparsification, which aims to reduce the complexity of large-scale graphs while preserving their essential structural properties, has garnered increasing attention as a crucial technique in modern data analysis and machine learning [6, 11, 12, 13, 30].
Graph sparsification can be formulated as the following optimization problem:
(1) |
which is known as the empirical risk minimization problem. Each is convex and differentiable, and the graph structured sparsity is reflected by the constraint set on . The input vector denotes the parameter of the model, and the output is defined as the loss associated with sample . By minimizing the loss , we guide the model towards the optimal solution. Typically, sparsity can be encoded by adding sparsity-inducing norms or penalties such as norm, norm and mixed norms [7, 10, 29, 16, 24, 26, 8, 25, 28]. These models often involve convex penalties and can be solved using convex optimization algorithms [4, 3, 5]. However, dealing with more complex sparse settings, such as graph-structured models, is more challenging.
In stochastic optimization, iterative hard thresholding (IHT) methods include gradient descent IHT (GD-IHT) [16], stochastic gradient descent IHT (SGD-IHT) [21], hybrid stochastic gradient IHT (HSG-IHT) [31], and variance-reduced methods such as stochastic variance reduced gradient IHT (SVRG-IHT) [19], stochastically controlled stochastic gradient IHT (SCSG-IHT) [20]. These methods update the parameter iterate via gradient descent or its variants, and then apply a hard thresholding (HT) operator to enforce sparsity of , preserving the top elements in while setting other elements to zero. In the context of graph-structured sparse optimization, the stochastic gradient decent based IHT method, named GraphSto-IHT, achieves HT through the use of Head and Tail Projections, first described by [30]. Head and Tail Projections map arbitrary vectors from the data onto the graph while simultaneously enforcing model sparsity [12, 13, 14]. Specifically, Head Projection identifies and preserves the largest entries in , while Tail Projection identifies the smallest entries and sets them to zero. By ignoring the small magnitude entries in the vector, these projections help prevent overfitting and ensure sparse solutions. Meanwhile, stochastic sampling of the data is used to speed up gradient calculations. A single data point or small batch of the data is selected and a gradient is calculated only with respect to that batch. This greatly decreases the computational costs associated with the gradient calculations. However, if the selected batch does not represent the whole dataset, the gradient may not accurately point towards a local minimum of the function, introducing variance into the gradient descent process.
To reduce the randomness inherent in SGD, variance reduction techniques may be used such as SVRG [17], SAGA [9], or SCSG [18]. During each iteration, the history of the stochastic process is considered to regulate the newly calculated gradient and minimize large changes in direction. This improvement on SGD is our main interest; both proposed algorithms utilize this technique to leverage fast gradient calculations while still enjoying quick convergence. In this paper we leverage the recent success of stochastic variance-reduced algorithms for non-convex problems and propose a series of efficient stochastic optimization algorithms for graph-structured sparsity constraint problems. Specifically, we introduce two new stochastic variance-reduced gradient based methods, GraphSVRG-IHT and GraphSCSG-IHT, designed to solve graph sparsity optimization problems. By incorporating stochastic variance-reduced techniques and graph approximated projections (head and tail), our algorithms are specifically tailored for non-convex graph-structured sparsity constraints, leading to faster convergence and improved performance. We provide a comprehensive theoretical framework and conduct extensive experiments to validate the efficiency and effectiveness of our proposed methods.
Our main contributions are summarized as follows.
-
•
This work is the first to explore the application of stochastic variance-reduced methods to graph-structured sparsity problems. By using batch gradients to approximate computationally expensive full gradients, we enhance the efficiency of variance reduction. GraphSCSG-IHT is built on GraphSVRG-IHT by parameterizing the variance reduction technique for greater control. It employs two different batch sizes for gradient calculations and randomly selects the update rate of the current position from a geometric distribution.
-
•
We theoretically prove GraphSCSG-IHT enjoys linear convergence rate with a constant learning rate. This analysis provides a robust theoretical framework for analyzing graph sparsification optimization.
-
•
We conduct extensive experiments to validate our proposed algorithms. In addition to simulation tests, we also evaluate our methods on a real-world breast cancer dataset. Our experiments empirically demonstrate the efficiency and effectiveness of our methods compared to GraphSto-IHT and other deterministic methods.
II Preliminaries
Notations. We use lowercase letters, e.g. , to denote a vector and use to denote the -norm of a vector. The operator represents taking expectation over all random variables, denotes the integer set . The notation means the support of or the index set of non-zero elements in . Other important parameters are listed in Appendix A.
Definition 1.
(Subspace model) [14] Given the space , a subspace model is defined as a family of linear subspaces of :
where each is a subspace of . The set of corresponding vectors in these subspaces is denoted as
Definition 2.
(Weighted graph model) [13] Given an underlying graph defined on the coefficients of the unknown vector , where and , then the weighted graph model - WGM can be defined as the following set of supports
where is the budget on weight of edges , is the number of connected components of , and is the sparsity.
Definition 3.
(Projection Operator) [14] We define a projection operator onto , i.e, defined as
With this projection definition, we borrow two important projections: Head Projection and Tail Projection from previous literature to help us with theoretical analysis.
Assumption 1.
(Head Projection)[14] Let and be the predefined subspace models. Given any vector , there exists a -Head-Projection which is to find a subspace such that
where . We denote as .
Assumption 2.
(Tail Projection)[14] Let and be the predefined subspace models. Given any vector , there exists a -Tail-Projection which is to find a subspace such that
where . We denote as .
We can see that head projection keeps large magnitudes whereas tail projection sets small magnitudes to zero.
Definition 4.
(()-RSC/RSS Properties)[14] We say a differentiable function satisfies the -Restricted Strong Convexity (RSC)/Smoothness (RSS) property if there exist positive constants and such that
for all , where is the Bregman divergence of , i.e.,
and are the strong convexity parameter and strong smoothness parameter, respectively.
The RSC/RSS definition is widely used in sparsity optimization. Together with the following assumption, we characterize the properties of the objective function.
Assumption 3.
[14] Given the objective function in (1), we assume that satisfies -RSC in subspace model . Each function satisfies -RSS in , where of two models and is defined as .
III Methods
In this section, we introduce two proposed algorithms: GraphSVRG-IHT and GraphSCSG-IHT. Both algorithms employ the variance-reduced techniques derived from SVRG [17] and SCSG [18] respectively, while also utilizing the graph projection operators found in GraphSto-IHT[30]. This results in methods that are applicable to graph-structured sparsity problems and effectively reduce the variance inherent to stochastic gradient descent. Therefore, our algorithms converge faster and more accurately than their predecessors.
III-A GraphSVRG-IHT
Our proposed GraphSVRG-IHT algorithm (Algorithm 1) utilizes variance reduction by periodically computing the full gradient, significantly reducing the inherent variance in stochastic gradient methods. By incorporating graph projection operators, our GraphSVRG-IHT adapts to non-convex graph sparsity constraints, enhancing its applicability and efficiency. The key steps are outlined below:
-
1.
Calculate the full gradient, , with the position at the start of each outer loop, (Line 4).
-
2.
In the inner loop, compute two gradients from a single sampled data point: one at the copied position and the other at . Then calculate the stochastic variance reduced gradient, (Line 7-8).
-
3.
Pass through the Head Projection operator (Line 9); and use the resulting gradient to update the next iterate , through the Tail Projection operator (Line 10).
-
4.
After a fixed number () of inner loop iterations, update the outer loop position and re-calculate the new full gradient.
Our proposed GraphSVRG-IHT algorithm differs from GraphSto-IHT in that GraphSVRG-IHT uses nested loops, allowing it to account for the history of the stochastic process, whereas GraphSto-IHT only considers a stochastic gradient. Both methods implement Head and Tail Projections for hard thresholding, making them applicable to graph-structured sparsity optimization problems. The inclusion of stochastic variance in GraphSVRG-IHT makes the theoretical analysis more complex and challenging.
III-B GraphSCSG-IHT
To better understand the calculation of variance-reduced gradients and stochastically control the outer batch size, we propose the GraphSCSG-IHT algorithm (Algorithm 2). While similar to Algorithm 1 in its use of variance-reduced gradients, GraphSCSG-IHT has the following key characteristics:
-
1.
In the outer loop, the gradient is calculated using a batch of data of size , whereas Algorithm 1 calculates a full gradient at this step (Line 4-5).
-
2.
In the inner loop, when calculating the stochastic variance reduced gradient, a mini-batch is used instead of a single data point (Line 10-11).
-
3.
The number of inner loops, , is not fixed. Instead, is chosen from a geometric distribution (Line 7) or can be set as (Line 8).
In contrast to GraphSto-IHT, the inner loop of GraphSCSG-IHT computes the gradient over a random set of functions rather than a randomly selected function. GraphSCSG-IHT also employs two loops with different batch sizes for gradient calculation, making it more flexible than GraphSto-IHTand GraphSVRG-IHT. This flexibility allows GraphSCSG-IHT to serve as a general framework for graph constrained optimization.
In summary, compared with traditional IHT and StoIHT methods, graph-structured hard thresholding steps mainly differ in their use of Head and Tail Projections. Both proposed new algorithms are much more complex than GraphSto-IHT, introducing distinct variance reduction techniques while maintaining the same sparsification enforcement. It is also important to note that GraphSVRG-IHT is a specific scenario of GraphSCSG-IHT, hence, we provide theoretical analysis of GraphSCSG-IHT, which can be easily extended to GraphSVRG-IHT case.
IV Theoretical Analysis
In this section, we present our main theoretical results characterizing the estimation error of parameters . The proof provides a general framework based on the gradients from GraphSCSG-IHT. Consequently, the main theorem is applicable to our two proposed algorithms, GraphSVRG-IHTand GraphSCSG-IHT. We demonstrate the convergence of these algorithms by bounding the final error using the norm of the initial and the optimal distance. Additionally, we consider the history of the stochastic process up to iteration with the notation .
Before delving into the main theorem, we present a key lemma that is crucial for the proof.
Lemma 1.
With the prepared lemmas and appropriate assumptions in place, we can now present our main theorem. This theorem establishes the convergence properties of our proposed algorithms, under specific conditions. The detailed statement of our main theorem is as follows:
Theorem 2.
Theorem 2 demonstrates that our new algorithm achieves linear convergence with stochastic variance-reduced gradients, even with more stochastic settings in batch and mini-batch. Each variable defined in the theorem is strictly less than 1, ensuring that the error decreases as the number of iterations increases. This result aligns with our experimental findings, where more iterations consistently lead to smaller errors.
From Theorem 2, we derive the following corollary, which further justifies the convergence of our algorithm and specifies the appropriate range for the learning rate .
Corollary 2.1.
To ensure convergence of our algorithm, the learning rate , which is a constant, should be chosen within the range . For this range to be valid, the following inequality must hold:
Corollary 2.1 is the cornerstone of Theorem 2. It ensures that the upper bound for the estimation error does not blow up to infinity, and provides a constant value for the finite series. Similarly, it also ensures that the upper bound will decay more after performing more iterations. Corollary 2.1 also provides a range of , which is smaller than the one given by GraphSto-IHT. This way we can find such that the algorithm will always converge. All the proofs are provided in the appendix of the paper.
V Experiments
V-A Experimental setup
We perform multiple experiments to compare our proposed algorithms with baseline methods. For our experiments, we consider the residual norm of the loss function, as the number of epochs increases. Due to the non-convex nature of the problem, there are several local minima and the algorithm may not approach the global minimum, . Additionally, in real-world applications, is often unknown. Therefore, we use the residual norm as a measurement of convergence as opposed to the distance from the final iterate to the target vector, . All experiments are tested on a Ubuntu 22.04.4 server with 256 AMD EPYC 9554 64-core processors and with 1.6 TB RAM. All codes are written in Python111All code is available at https://github.com/Derek-Fox/graph-scsg-iht.
V-B Synthetic Dataset
We first tested our methods on synthetic datasets to determine the optimal parameters. For a fair comparison, we followed the exact settings used in GraphSto-IHT, conducting multiple experiments using a grid graph with a dimension of 256 and unit-weight edges.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() ![]() ![]() |
Choice of . To study the effect of the learning rate on the performance of our algorithms, we varied across {0.1, 0.01, 0.001} and tested these rates in various sparsity cases, with sparsity values chosen from {256, 128, 64, 32}. By varying the learning rate, we aimed to understand the convergence behaviour of our algorithms. As shown in Figure 1, our experiments reveal that, as expected, with a larger learning rate, all methods converge quickly, while with a smaller learning rate, the steps take longer to achieve zero residual loss. Additionally, both our proposed methods and the baseline method converge stably, and regardless of the learning rate setting, GraphSVRG-IHT consistently performs the best.
Choice of sparsity. Studying the setting of sparsity, , is crucial for understanding the performance of our algorithms in graph sparsification optimization. To examine the effect of sparsity, we compared our methods, GraphSVRG-IHT and GraphSCSG-IHT against the baseline algorithm GraphSto-IHT. Using the experimental settings from GraphSto-IHT we employed a grid graph of dimension 256, fixed the learning rate and set the batch size equal to the sparsity parameter . We varied the sparsity parameter to observe the behavior of all algorithms, as shown in Figure 2. Figure 2 demonstrates that as the sparsity parameter decreases, GraphSVRG-IHT outperforms the other algorithms in minimizing the residual norm over epochs. Another interesting finding is that we observed that GraphSto-IHT and GraphSCSG-IHT display almost identical behavior in this parameter setting.
Method | Number | Genes |
---|---|---|
IHT | 2 | NAT1 TOP2A |
StoIHT | 2 | NAT1 TOP2A |
GraphSto-IHT | 6 | AR ATM BRCA2 CCND2 CDKN1A TOP2A |
GraphSVRG-IHT | 6 | AR ATM BRCA2 CCND2 CDKN1A TOP2A |
GraphSCSG-IHT | 10 | AR ATM BRCA1 BRCA2 CCND2 CCND3 CDKN1A CHEK2 FBP1 TOP2A |
Choice of batch size. After exploring the choice of and , we varied the batch size on a grid graph with a dimension of 256 to demonstrate the advantages of GraphSCSG-IHT. Here we fixed , . For fair comparison, we considered the number of data points instead of the number of epochs to estimate the run time of the algorithms, which is a common practice in optimization.
Since gradient calculations are computationally expensive, using fewer data points would result in faster run times. Figure 3 shows three different scenarios with varying batch sizes . When equals to the dimension, GraphSCSG-IHT degrades to GraphSVRG-IHT resulting in similar performance. With smaller (meaning fewer data points are used in the full gradient calculation), GraphSCSG-IHT consistently outperforms GraphSVRG-IHT. However, as continues to decrease, GraphSCSG-IHT initially outperforms GraphSVRG-IHT but was eventually surpassed by GraphSVRG-IHT slightly. This occurs because smaller batch size increase gradient variance, making it harder for GraphSCSG-IHT to maintain its initial advantage. Furthermore, the interaction between batch size and mini-batch size becomes more complex, ultimately affecting the final performance.
Therefore, we also conducted numerous experiments to study the effects of mini-batch size in conjunction with batch size. Additionally, to better understand the graph-structure patterns, we have explored how varying the number of connected components in subgraphs impacts final convergence performance. Additional results are provided in the Appendix.
V-C Real-world Dataset
To test our algorithms on a real-world dataset, we use a large breast cancer dataset [27]. This dataset contains 295 training samples with 8,141 genes dimensions, including 78 positive (metastatic) and 217 negative (non-metastatic) samples. Following the experimental setting in [30], we use a Protein-Protein Interaction network with 637 pathways from [15], the dataset is folded into 5 subfolds, and 20 trials are conducted.
Table I shows the results from the gene identification task on the breast cancer dataset. GraphSCSG-IHT identifies 40% of the 25 genes highly correlated with breast cancer, as gathered by [30]. GraphSto-IHT and GraphSVRG-IHT both identify 24% of these genes, consistent with the findings in [30]. All the graph-structured methods greatly outperform the IHT and StoIHT methods which only identify 8% of the cancer-related genes. These results demonstrate the promise of variance-reduction techniques for stochastic gradient descent in the setting of graph sparsity optimization.
In summary, GraphSVRG-IHT is more stable while GraphSCSG-IHT shows excellent performance under appropriate settings due to the introduction of two additional batch size. This allows for greater flexibility and control in the gradient estimation process. Overall, our two new algorithms demonstrate both efficiency and effectiveness by incorporating variance reduction techniques in graph sparsity problems, making them suitable for a wide range of applications.
VI Conclusion
We have proposed two algorithms to utilize variance reduction techniques in the setting of graph sparse optimization. The proposed algorithms can significantly improve the stability and efficiency in machine learning and other applications. Theoretically, we provide a general proof framework showing linear convergence. Empirically, our algorithms are competitive in minimizing the objective loss function compared to their predecessors in various experimental settings with a synthetic dataset. Additionally, testing on a large-scale medical dataset demonstrated superior performance in identifying cancer-related genes. Future work should include testing our algorithms on more larger real-world datasets. By employing graph-structured hard thresholding, we can uncover more underlying subgraphs and related patterns, with significant implications in fields such as medical research and social network analysis. This approach can enhance our understanding of complex data structures and lead to more effective solutions.
VII Acknowledgement
This work was completed at the University of North Carolina at Greensboro, funded by NSF REU program 2349369.
References
- [1] Cem Aksoylar, Lorenzo Orecchia, and Venkatesh Saligrama. Connected subgraph detection with mirror descent on sdps. In International Conference on Machine Learning, pages 51–59. PMLR, 2017.
- [2] Ery Arias-Castro, Emmanuel J Candes, and Arnaud Durand. Detection of an anomalous cluster in a network. The Annals of Statistics, pages 278–304, 2011.
- [3] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27(4):450–468, 2012.
- [4] Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
- [5] Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14(Mar):807–841, 2013.
- [6] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Transactions on information theory, 56(4):1982–2001, 2010.
- [7] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
- [8] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
- [9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
- [10] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011.
- [11] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A fast approximation algorithm for tree-sparse recovery. In 2014 IEEE International Symposium on Information Theory, pages 1842–1846. IEEE, 2014.
- [12] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compressive sensing. IEEE Transactions on Information Theory, 61(9):5129–5147, 2015.
- [13] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. A nearly-linear time framework for graph-structured sparsity. In International Conference on Machine Learning, pages 928–937. PMLR, 2015.
- [14] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Fast recovery from a union of subspaces. Advances in Neural Information Processing Systems, 29, 2016.
- [15] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440, 2009.
- [16] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
- [17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
- [18] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Artificial Intelligence and Statistics, pages 148–156, 2017.
- [19] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced optimization for nonconvex sparse learning. In International Conference on Machine Learning, pages 917–925, 2016.
- [20] Guannan Liang, Qianqian Tong, Chunjiang Zhu, and Jinbo Bi. An effective hard thresholding method based on stochastic variance reduction for nonconvex sparse learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34(02), pages 1585–1592, 2020.
- [21] Nam Nguyen, Deanna Needell, and Tina Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869–6895, 2017.
- [22] Jing Qian, Venkatesh Saligrama, and Yuting Chen. Connected sub-graph detection. In Artificial Intelligence and Statistics, pages 796–804. PMLR, 2014.
- [23] Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, and Nikolaj Tatti. Event detection in activity networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1176–1185, 2014.
- [24] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- [25] Berwin A Turlach, William N Venables, and Stephen J Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005.
- [26] Sara A Van de Geer et al. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 2008.
- [27] Marc J Van De Vijver, Yudong D He, Laura J Van’t Veer, Hongyue Dai, Augustinus AM Hart, Dorien W Voskuil, George J Schreiber, Johannes L Peterse, Chris Roberts, Matthew J Marton, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999–2009, 2002.
- [28] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2006.
- [29] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning, pages 127–135, 2014.
- [30] Baojian Zhou, Feng Chen, and Yiming Ying. Stochastic iterative hard thresholding for graph-structured sparsity optimization. In International Conference on Machine Learning, pages 7563–7573. PMLR, 2019.
- [31] Pan Zhou, Xiaotong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresholding. In Advances in Neural Information Processing Systems, pages 1988–1997, 2018.
Appendix A Parameters Table
Table: Constraints for Parameters
Notation | Name | Constraint |
---|---|---|
Step size | Fixed | |
F() | Function | Minimized |
Set of Supports | ||
Set of Supports | ||
Set of Supports | ||
B | Big batch | |
Big batch sample | and | |
b | Small batch | Subset of B |
Small batch sample | and | |
Number of outer loops | ||
j | Outer loop index | |
Number of inner loops | ||
k | Inner loop index | |
Starting Position | ||
s | Number of non-zero entries | Sparsity Constraint |
Head Projection | ||
Tail Projection | ||
SVRG Gradient | ||
SCSG Gradient | ||
Current Position | ||
Reduced Variance Gradient | ||
Reduced Variance Gradient | ||
Sparsified Gradient | ||
Number of Connected Components |
Appendix B Proof
Lemma 4 [21]. Let be a discrete random variable defined on and its probability mass function is defined as , which means the probability of selecting the -th block at time . For any fixed sparse vectors , and , we have
where is such that and .
Proof.
We first try to obtain an upper bound for as follows:
where the second equality uses the fact that , the first inequality follows from Lemma 2, the third equality is obtained by using the fact that , and the last inequality is due to the restricted strong convexity of on . We complete the proof by taking the square root of both sides and using the fact: for any random variable , we have . ∎
Proof of Theorem 2.
Proof.
(1) | ||||
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
By the definition of , Eq. (1) holds. Eq. (2) holds by the Triangle Inequality. Eq. (3) holds by Assumption 2. Eq. (4) holds by defining . Eq. (5) holds by the definition of . Now, we will focus on the expectation only. Also, define .
(6) | ||||
(7) | ||||
(8) | ||||
(9) | ||||
(10) | ||||
(11) |
Eq. (6) and Eq. (7) hold by the Triangle Inequality. Eq. (8) holds by adding and subtracting . Eq. (9) also holds by the Triangle Inequality. Eq. (10) holds by B. Eq. (11) holds by the fact that . Eq. (12) holds by 1. We therefore get the following inequality that we will use to find our final equation.
(12) |
where and Since we have two different vectors measured from the target vector , we will use recursion twice in Eq. (13) to find the formula.
First, we apply recursion with respect to the outer loop, j, by finding using the above inequality. Once we find it, we plug it back to understand the behavior and do this process j times.
(13) |
Plugging Eq. (14) in Eq. (13), we get:
(14) |
After doing it j times, we get:
(15) |
By geomteric series (since ), and by defining , we get:
(16) |
Once we did the recursion with respect to the outer loop only, we do recursion with respect the inner loop and outer loop to find the final formula. This means that we will do this process S times, where .
(17) |
We then plug Eq. (18) into Eq. (17), so we get Eq. (19):
(18) |
After doing this step S times, we get the following inequality:
(19) |
We finally get this formula that guarantees the convergence of the algorithm, since and are both less than 1.
which we can rewrite as
∎
Corollary 2.0. Let , since . Therefore, by definition of , we have:
Proof.
By contradiction, assume . Therefore,
which is a contradiction, since by definition. Therefore, . Note that by its definition. ∎
Corollary 2.1. To ensure convergence of our algorithm, the learning rate , which is a constant, should be chosen within the range . For this range to be valid, the following inequality must hold:
Proof.
Assume by contradiction that . Then, by definition of , we get
(1) | |||
(2) |
where Eq. (2) holds true by the definition of . Now, we have the following:
Since , then . We then have a contradiction. Therefore,
∎
Appendix C More Results
C-A Simulation

![]() |
![]() |
As pointed out in our main paper, we also explored the effects of mini-batch size in conjunction with batch size. Figure 4 demonstrates the generality of GraphSCSG-IHT. With different settings of and GraphSCSG-IHT can range in behavior between GraphSto-IHT and GraphSVRG-IHT. Not included in the figure is the case when equals the dimension of the data and , in which case GraphSCSG-IHT and GraphSVRG-IHT are only distinguished by the number of inner loops (which could be parameterized itself to make the two algorithms identical).
We vary the number of connected components, controlled by parameter , in Figure 5. We observe that when is low (in this case ), the loss function curve does not smoothly decrease but frequently levels off before stepping downward. This is due to the algorithm intermittently finding a more suitable support set, which allows it to more rapidly decrease the objective loss. The restriction of the algorithm to only one connected component makes this phenomenon much more apparent.
C-B Breast cancer dataset
All parameters are tuned using 5-fold cross validation, following the experimental settings of [30]. The sparsity is tuned from and the block size is tuned from , where is the number of samples. The task is to find a single connected component, so is set to 1 for the head and tail projections. The learning rate is tuned using backtracking line search. We record the Balanced Classification Error and Area Under Curve (AUC) scores, as well as the number of nonzeroes for each method over the 20 trials.
IHT | StoIHT | GraphSto-IHT | GraphSVRG-IHT | GraphSCSG-IHT | |
---|---|---|---|---|---|
Trial 1 | 0.352±0.079 | 0.355±0.083 | 0.367±0.091 | 0.335±0.048 | 0.474±0.048 |
Trial 2 | 0.350±0.047 | 0.346±0.045 | 0.395±0.074 | 0.340±0.077 | 0.496±0.015 |
Trial 3 | 0.354±0.060 | 0.370±0.061 | 0.365±0.107 | 0.318±0.061 | 0.478±0.061 |
Trial 4 | 0.351±0.046 | 0.350±0.066 | 0.365±0.032 | 0.365±0.032 | 0.459±0.050 |
Trial 5 | 0.357±0.064 | 0.365±0.063 | 0.366±0.065 | 0.368±0.082 | 0.418±0.076 |
Trial 6 | 0.334±0.050 | 0.327±0.046 | 0.330±0.078 | 0.352±0.090 | 0.447±0.071 |
Trial 7 | 0.343±0.050 | 0.402±0.085 | 0.321±0.027 | 0.321±0.027 | 0.499±0.004 |
Trial 8 | 0.352±0.059 | 0.402±0.103 | 0.338±0.021 | 0.349±0.030 | 0.489±0.013 |
Trial 9 | 0.340±0.082 | 0.357±0.058 | 0.335±0.110 | 0.343±0.040 | 0.452±0.053 |
Trial 10 | 0.344±0.068 | 0.351±0.065 | 0.349±0.050 | 0.331±0.061 | 0.502±0.005 |
Trial 11 | 0.339±0.056 | 0.318±0.061 | 0.353±0.076 | 0.357±0.054 | 0.473±0.054 |
Trial 12 | 0.355±0.061 | 0.394±0.053 | 0.349±0.088 | 0.329±0.082 | 0.463±0.072 |
Trial 13 | 0.327±0.053 | 0.317±0.040 | 0.312±0.065 | 0.323±0.060 | 0.457±0.051 |
Trial 14 | 0.362±0.058 | 0.340±0.071 | 0.348±0.019 | 0.330±0.054 | 0.453±0.045 |
Trial 15 | 0.381±0.048 | 0.346±0.051 | 0.297±0.059 | 0.297±0.059 | 0.416±0.069 |
Trial 16 | 0.330±0.094 | 0.320±0.089 | 0.330±0.074 | 0.339±0.064 | 0.488±0.055 |
Trial 17 | 0.349±0.074 | 0.332±0.067 | 0.353±0.078 | 0.330±0.056 | 0.437±0.055 |
Trial 18 | 0.331±0.096 | 0.331±0.096 | 0.292±0.067 | 0.290±0.069 | 0.462±0.053 |
Trial 19 | 0.327±0.052 | 0.333±0.054 | 0.311±0.021 | 0.306±0.023 | 0.476±0.017 |
Trial 20 | 0.333±0.038 | 0.338±0.026 | 0.352±0.064 | 0.376±0.036 | 0.427±0.052 |
Averaged | 0.345±0.065 | 0.350±0.072 | 0.341±0.073 | 0.335±0.062 | 0.463±0.057 |
IHT | StoIHT | GraphSto-IHT | GraphSVRG-IHT | GraphSCSG-IHT | |
---|---|---|---|---|---|
Trial 1 | 0.726±0.067 | 0.736±0.054 | 0.720±0.030 | 0.729±0.039 | 0.713±0.023 |
Trial 2 | 0.683±0.067 | 0.692±0.064 | 0.697±0.044 | 0.712±0.062 | 0.696±0.101 |
Trial 3 | 0.716±0.062 | 0.726±0.077 | 0.726±0.064 | 0.715±0.065 | 0.703±0.027 |
Trial 4 | 0.695±0.059 | 0.701±0.061 | 0.687±0.041 | 0.687±0.041 | 0.693±0.033 |
Trial 5 | 0.699±0.042 | 0.700±0.067 | 0.707±0.040 | 0.701±0.028 | 0.701±0.045 |
Trial 6 | 0.677±0.071 | 0.673±0.074 | 0.704±0.051 | 0.669±0.071 | 0.640±0.101 |
Trial 7 | 0.712±0.069 | 0.719±0.066 | 0.741±0.055 | 0.741±0.055 | 0.701±0.069 |
Trial 8 | 0.711±0.081 | 0.707±0.055 | 0.721±0.062 | 0.717±0.055 | 0.657±0.075 |
Trial 9 | 0.717±0.066 | 0.718±0.068 | 0.723±0.044 | 0.718±0.026 | 0.719±0.058 |
Trial 10 | 0.710±0.060 | 0.711±0.060 | 0.691±0.052 | 0.702±0.062 | 0.655±0.103 |
Trial 11 | 0.708±0.082 | 0.719±0.086 | 0.722±0.085 | 0.711±0.074 | 0.693±0.142 |
Trial 12 | 0.713±0.025 | 0.711±0.045 | 0.736±0.055 | 0.730±0.060 | 0.664±0.057 |
Trial 13 | 0.714±0.058 | 0.719±0.058 | 0.718±0.073 | 0.703±0.071 | 0.704±0.074 |
Trial 14 | 0.700±0.058 | 0.705±0.067 | 0.711±0.064 | 0.712±0.046 | 0.665±0.061 |
Trial 15 | 0.696±0.033 | 0.687±0.040 | 0.721±0.058 | 0.721±0.058 | 0.715±0.040 |
Trial 16 | 0.729±0.081 | 0.725±0.083 | 0.723±0.072 | 0.725±0.073 | 0.734±0.049 |
Trial 17 | 0.720±0.055 | 0.722±0.062 | 0.718±0.037 | 0.717±0.037 | 0.681±0.061 |
Trial 18 | 0.712±0.039 | 0.712±0.039 | 0.733±0.036 | 0.730±0.034 | 0.679±0.056 |
Trial 19 | 0.721±0.067 | 0.724±0.068 | 0.736±0.056 | 0.738±0.054 | 0.701±0.057 |
Trial 20 | 0.720±0.031 | 0.706±0.027 | 0.717±0.045 | 0.705±0.033 | 0.703±0.056 |
Averaged | 0.709±0.062 | 0.711±0.064 | 0.717±0.057 | 0.714±0.057 | 0.691±0.074 |
IHT | StoIHT | GraphSto-IHT | GraphSVRG-IHT | GraphSCSG-IHT | |
---|---|---|---|---|---|
Trial 1 | 082.0±36.00 | 066.0±28.00 | 033.4±10.38 | 033.8±17.31 | 064.0±17.72 |
Trial 2 | 044.0±08.00 | 038.0±13.27 | 061.2±41.81 | 079.0±32.00 | 087.2±04.12 |
Trial 3 | 094.0±12.00 | 068.0±31.87 | 030.4±12.97 | 040.2±17.03 | 086.2±13.33 |
Trial 4 | 074.0±12.00 | 062.0±31.87 | 051.6±08.33 | 051.6±08.33 | 075.0±18.17 |
Trial 5 | 048.0±04.00 | 036.0±13.56 | 054.2±14.37 | 057.6±15.99 | 078.2±12.69 |
Trial 6 | 090.0±00.00 | 080.0±20.00 | 041.0±12.44 | 047.4±16.21 | 094.6±09.46 |
Trial 7 | 010.0±00.00 | 048.0±39.19 | 079.8±09.60 | 079.8±09.60 | 091.4±06.77 |
Trial 8 | 040.0±00.00 | 048.0±29.26 | 083.0±29.26 | 072.0±36.55 | 063.0±19.13 |
Trial 9 | 092.0±04.00 | 076.0±33.23 | 035.2±13.50 | 019.8±07.49 | 078.0±18.06 |
Trial 10 | 050.0±00.00 | 052.0±04.00 | 053.0±13.64 | 052.0±14.00 | 088.2±11.53 |
Trial 11 | 074.0±08.00 | 054.0±20.59 | 059.0±23.39 | 073.0±22.49 | 078.6±20.34 |
Trial 12 | 056.0±08.00 | 022.0±09.80 | 095.2±10.80 | 080.8±31.52 | 083.4±15.42 |
Trial 13 | 046.0±12.00 | 052.0±21.35 | 067.0±22.27 | 067.0±23.58 | 087.4±09.73 |
Trial 14 | 078.0±24.00 | 058.0±23.15 | 028.8±06.76 | 030.4±06.65 | 074.2±16.27 |
Trial 15 | 082.0±04.00 | 052.2±27.63 | 045.0±00.00 | 045.0±00.00 | 088.0±14.35 |
Trial 16 | 052.0±16.00 | 056.0±19.60 | 047.0±16.91 | 041.2±24.55 | 073.6±23.69 |
Trial 17 | 092.0±04.00 | 064.0±23.32 | 024.2±07.52 | 027.0±06.51 | 084.4±11.55 |
Trial 18 | 094.0±12.00 | 094.0±12.00 | 044.2±17.90 | 034.2±02.14 | 082.0±14.35 |
Trial 19 | 100.0±00.00 | 030.0±20.98 | 043.6±08.28 | 038.6±03.83 | 087.2±05.42 |
Trial 20 | 020.0±00.00 | 050.0±26.83 | 042.2±20.43 | 049.8±25.20 | 088.2±12.92 |
Averaged | 065.9±28.36 | 055.3±29.26 | 051.0±25.39 | 051.0±26.44 | 081.6±16.80 |