This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nearly Optimal Linear Convergence of Stochastic Primal-Dual Methods for Linear Programming

Haihao Lu The University of Chicago, Booth School of Business ([email protected]).    Jinwen Yang The University of Chicago, Department of Statistics ([email protected]).
(December 2023)
Abstract

There is a recent interest on first-order methods for linear programming (LP). In this paper, we propose a stochastic algorithm using variance reduction and restarts for solving sharp primal-dual problems such as LP. We show that the proposed stochastic method exhibits a linear convergence rate for solving sharp instances with a high probability. In addition, we propose an efficient coordinate-based stochastic oracle, which has 𝒪(1)\mathcal{O}(1) per iteration cost and improves the complexity of the existing deterministic and stochastic algorithms. Finally, we show that the obtained linear convergence rate is nearly optimal (upto log\log terms) for a wide class of stochastic primal-dual methods. Numerical performance verifies the theoretical guarantees of the proposed algorithms.

1 Introduction

Linear programming (LP), as one of the most fundamental tools in operations research and computer science, has been extensively studied in both academia and industry since 1940s. The applications of LP span various fields, including pricing and revenue management, transportation, network flow, scheduling, and many others [14, 27, 9, 42, 4, 40].

The dominant solvers of LP are essentially based on either the simplex method or interior-point method, which are very mature nowadays and can usually provide reliable solutions to LP. However, the success of both methods heavily relies on efficient solving linear systems using factorization, which has the following major drawbacks: (i) the factorization may run out of memory even though the original LP can fit in memory; (ii) it is highly challenging to take advantage of modern computing resources, such as distributed system and GPUs when solving linear systems. These drawbacks make it highly challenging to further scale up LP.

Recently, it was shown that first-order methods (FOMs) with proper enhancements can also identify high-quality solutions to LP problem quickly [5], which provides an alternative to the traditional simplex and barrier methods for LP. FOMs utilize only the gradient information of the objective function to update the iterates (in contrast to second-order methods where Hessian is used). The basic operation in FOMs is matrix-vector multiplication, which can avoid the two major drawbacks of traditional methods mentioned above, thus it is suitable for solving larger LP.

In this work, we further push this line of research and study stochastic FOMs for LP. In contrast to deterministic first-order methods [5], the basic operation per iteration in stochastic FOMs is a vector operation. Such operation is very cheap in general and is a common practice for modern machine learning applications. As a drawback, the number of iterations of stochastic algorithms to obtain an approximate solution is typically higher than its deterministic counterpart.

Due to such a fundamental distinction, Nesterov [48] formally defines methods that require matrix factorization (or matrix inversion) as handling medium-scale problems, methods that require matrix vector multiplication as handling large-scale problems, and methods that require vector operations as handling huge-scale problems. Based on this definition, in the context of LP, the simplex method and interior point method are classified as handling medium-scale problems (this definition perhaps belie the practical efficiency of the two methods, but it is a fact that it is challenging to further scale up them as mentioned above), deterministic FOMs [5, 39] are classified as handling large-scale problems, and in the paper we instead look at huge-scale problems.

While our motivation is LP, we here consider a more general class of primal-dual problems with the form

minxnmaxym(x,y):=Φ(x,y)+g1(x)g2(y),\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}\mathcal{L}(x,y):=\Phi(x,y)+g_{1}(x)-g_{2}(y), (1)

where (x,y)\mathcal{L}(x,y) is convex in xx and concave in yy, g1(x)g_{1}(x) is a simple convex function in xx and g2(y)g_{2}(y) is a simple convex function in yy. In particular, the primal-dual formulation of standard form LP is

minx0maxymyTAx+cTxbTy,\min_{x\geq 0}\max_{y\in\mathbb{R}^{m}}y^{T}Ax+c^{T}x-b^{T}y\ , (2)

and a highly related problem is the unconstrained bilinear problem

minxnmaxymyTAx+cTxbTy.\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}y^{T}Ax+c^{T}x-b^{T}y\ . (3)

For notational convenience, we define z=(x,y)z=(x,y), F(z)=[xΦ(x,y),yΦ(x,y)]F(z)=[\nabla_{x}\Phi(x,y),-\nabla_{y}\Phi(x,y)] and g(z)=g1(x)+g2(y)g(z)=g_{1}(x)+g_{2}(y) . We here assume there is an unbiased stochastic oracle Fξ(z)F_{\xi}(z) such that 𝔼[Fξ(z)]=F(z)\mathbb{E}[F_{\xi}(z)]=F(z) (see Section 5 for examples on how to construct the stochastic oracles). To efficiently solve (2), our key ideas are variance reduction and restarts:

Variance reduction is a successful technique for finite sum stochastic minimization problems [31, 19, 55]. The basic idea is to reduce the variance in the stochastic gradient estimation by comparing it with the snapshots of the true gradient. Variance reduction can usually improve the convergence property of stochastic minimization algorithms [31, 19, 55]. Recently, [3] extends the variance reduction scheme to EGM for solving monotonic variational inequality (with some ambiguity, we call this algorithm sEGM), which was shown to have O(1/ϵ)O(1/\epsilon) convergence rate.

Restart is another standard technique for minimization problems, which is used to speed up the convergence of multiple deterministic and stochastic algorithms [49, 24, 59, 51, 60]. Recently, [7] introduces the sharpness of primal-dual problems, and presents a simple restart scheme to accelerate the linear convergence rate of primal-dual algorithms for solving sharp problems.

This paper extends the restarted algorithm in [7] to stochastic algorithms (in particular sEGM [3]) for solving sharp primal-dual problems, such as LP. We further show that the obtained linear convergence rate is nearly optimal (upto log\log terms) for a wild class of stochastic algorithms. It turns out that while LP is sharp on any bounded region [7], LP is not globally sharp (see Appendix A.1 and Appendix A.2 for a counter example). A fundamental difficulty of restarted stochastic algorithm is that, unlike deterministic algorithms, the iterates may escape from any bounded region, thus there is no guarantee that local sharpness can be helpful for the convergence of stochastic algorithms. We overcome this issue by presenting a high-probability argument and show the linear convergence of the proposed restarted algorithms with high probability.

The performance of randomized algorithms depends on the realization of stochasticity. The traditional convergence analysis of these algorithms usually measures the expected performance, namely, running the algorithm multiple times and looking at the average performance. In contrast, we present high probability results, namely, running the algorithm multiple times, and study the performance of all trajectories with a certain probability. In particular, when choosing the probability as 1/21/2, we obtain the convergence guarantee of the median trajectory, which can be viewed as an alternative to the expected performance studied in the related literature.

Algorithm Per-iteration Cost Number of Iteration to find an ϵ\epsilon solution Total Cost111 The total cost is sometimes more than the product of the per-iteration cost and the number of iterations due to the snapshot step in variance reduction.
Deterministic [33] 𝒪(nnz(A))\mathcal{O}\left({\text{nnz}(A)}\right) 𝒪(A2ϵ)\mathcal{O}\left({\frac{\|A\|_{2}}{\epsilon}}\right) 𝒪(nnz(A)A2ϵ)\mathcal{O}\left({\text{nnz}(A)\frac{\|A\|_{2}}{\epsilon}}\right)
Deterministic Restart [7] 𝒪(nnz(A))\mathcal{O}\left({\text{nnz}(A)}\right) 𝒪(κ2log1ϵ)\mathcal{O}\left({\kappa_{2}\log{\frac{1}{\epsilon}}}\right) 𝒪(nnz(A)κ2log1ϵ)\mathcal{O}\left({\text{nnz}(A)\kappa_{2}\log{\frac{1}{\epsilon}}}\right)
SPDHG [12] 𝒪(m+n)\mathcal{O}(m+n) 𝒪(iAi2ϵ){\mathcal{O}}\left({\frac{\sum_{i}\|A_{i\cdot}\|_{2}}{\epsilon}}\right) 𝒪(nnz(A)+(m+n)iAi2ϵ){\mathcal{O}}\left({\text{nnz}(A)+(m+n)\frac{\sum_{i}\|A_{i\cdot}\|_{2}}{\epsilon}}\right)
SPDHG222 The complexity of SPDHG [3] involves complicated terms. In the table, we present a lower bound on the number of iterations and the total cost, i.e., they need at least this number of iterations (or total cost) to find an ϵ\epsilon-accuracy solution based on their analysis (see Appendix A.4 for more details). [2] 𝒪(m+n)\mathcal{O}(m+n) 𝒪(κF2log1ϵ){\mathcal{O}}\left({\kappa_{F}^{2}\log\frac{1}{\epsilon}}\right) 𝒪((m+n)κF2log1ϵ){\mathcal{O}}\left({(m+n)\kappa_{F}^{2}\log\frac{1}{\epsilon}}\right)
Conceptual Proximal [10] 𝒪(m+n)\mathcal{O}(m+n) 𝒪(nnz(A)m+nAFϵ)\mathcal{O}\left({\sqrt{\frac{\text{nnz}(A)}{m+n}}\frac{\|A\|_{F}}{\epsilon}}\right) 𝒪(nnz(A)+nnz(A)(m+n)AFϵ)\mathcal{O}\left({\text{nnz}(A)+\frac{\sqrt{\text{nnz}(A)(m+n)}\|A\|_{F}}{\epsilon}}\right)
sEGM [3] 𝒪(m+n)\mathcal{O}(m+n) 𝒪(nnz(A)m+nAFϵ)\mathcal{O}\left({\sqrt{\frac{\text{nnz}(A)}{m+n}}\frac{\|A\|_{F}}{\epsilon}}\right) 𝒪(nnz(A)+nnz(A)(m+n)AFϵ)\mathcal{O}\left({\text{nnz}(A)+\frac{\sqrt{\text{nnz}(A)(m+n)}\|A\|_{F}}{\epsilon}}\right)
RsEGM Oracle II (This Paper) 𝒪(m+n)\mathcal{O}(m+n) 𝒪~(nnz(A)m+nκFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\sqrt{\frac{\text{nnz}(A)}{m+n}}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right) 𝒪~(nnz(A)log1ϵ+nnz(A)(m+n)κFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{1}{\epsilon}}+\sqrt{\text{nnz}(A)(m+n)}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right)
RsEGM Oracle IV (This Paper) 𝒪(1)\mathcal{O}(1) 𝒪~(nnz(A)κFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\sqrt{{\text{nnz}(A)}}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right) 𝒪~(nnz(A)log1ϵ+nnz(A)κFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{1}{\epsilon}}+\sqrt{\text{nnz}(A)}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right)
Table 1: Comparison on Unconstrained Bilinear Problem (3), where κ2=A2α\kappa_{2}=\frac{\|A\|_{2}}{\alpha}, κF=AFα\kappa_{F}=\frac{\|A\|_{F}}{\alpha}.
Algorithm Per-iteration Cost Number of Iteration Total Cost
Deterministic [33] 𝒪(nnz(A))\mathcal{O}(\text{nnz}(A)) 𝒪(A2ϵ)\mathcal{O}\left({\frac{\|A\|_{2}}{\epsilon}}\right) 𝒪(nnz(A)A2ϵ)\mathcal{O}\left({\text{nnz}(A)\frac{\|A\|_{2}}{\epsilon}}\right)
Deterministic Restart [7] 𝒪(nnz(A))\mathcal{O}(\text{nnz}(A)) 𝒪(κ2log1ϵ)\mathcal{O}\left({\kappa_{2}\log{\frac{1}{\epsilon}}}\right) 𝒪(nnz(A)κ2log1ϵ)\mathcal{O}\left({\text{nnz}(A)\kappa_{2}\log{\frac{1}{\epsilon}}}\right)
sEGM [3] 𝒪(m+n)\mathcal{O}(m+n) 𝒪(nnz(A)m+nAFϵ)\mathcal{O}\left({\sqrt{\frac{\text{nnz}(A)}{m+n}}\frac{\|A\|_{F}}{\epsilon}}\right) 𝒪(nnz(A)+nnz(A)(m+n)AFϵ)\mathcal{O}\left({\text{nnz}(A)+\frac{\sqrt{\text{nnz}(A)(m+n)}\|A\|_{F}}{\epsilon}}\right)
RsEGM Oracle II (This Paper) 𝒪(m+n)\mathcal{O}(m+n) 𝒪~(nnz(A)m+nκFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\sqrt{\frac{\text{nnz}(A)}{m+n}}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right) 𝒪~(nnz(A)log1ϵ+nnz(A)(m+n)κFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{1}{\epsilon}}+\sqrt{\text{nnz}(A)(m+n)}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right)
RsEGM Oracle IV (This Paper) 𝒪(1)\mathcal{O}(1) 𝒪~(nnz(A)κFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\sqrt{{\text{nnz}(A)}}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right) 𝒪~(nnz(A)log1ϵ+nnz(A)κFpolylog(1ϵ))\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{1}{\epsilon}}+\sqrt{\text{nnz}(A)}\kappa_{F}\text{polylog}(\frac{1}{\epsilon})}\right)
Table 2: Comparison on standard LP (2), where κ2=A21/[H(1+z0,0+R0)]\kappa_{2}=\frac{\|A\|_{2}}{1/[H(1+\|z^{0,0}\|+R_{0})]}, κF=AF1/[H(1+z0,0+R0)]\kappa_{F}=\frac{\|A\|_{F}}{1/[H(1+\|z^{0,0}\|+R_{0})]}, and HH is the Hoffman constant of the KKT system of the LP (see Example 2.3 for details).

Table 1 and Table 2 present the per iteration cost, complexity of the algorithm, and the total flop counts for our proposed algorithm, Restarted sEGM (RsEGM), and compare it with multiple deterministic and stochastic algorithms for solving unconstrained bilinear problem (3) and LP (2), respectively. For unconstrained bilinear problems (3), deterministic algorithms require to compute the gradient of the objective, thus the per iteration cost is O(nnz(A))O(\text{nnz}(A)). Standard stochastic algorithms are usually based on row/column sampling, and the iteration cost is O(m+n)O(m+n). We also present a stochastic coordinate scheme (oracle IV) where the per-iteration cost is O(1)O(1). While stochastic algorithms have low per-iteration cost, it usually requires more iterations to identify an ϵ\epsilon-close solution compared to its deterministic counterparts. As we see in Table 1, compared with the optimal deterministic algorithms [7], the total flop count of RsEGM with stochastic Oracle II is better when the matrix AA is dense and of low rank. With the coordinate gradient estimator Oracle IV, the total cost of RsEGM is even lower and improves optimal deterministic algorithms by at least a factor of n\sqrt{n} when A is dense. For standard LP, as seen in Table 2, stochastic algorithms require more iterations to achieve ϵ\epsilon-accuracy than the unconstrained bilinear problem due to the existence of inequality constrains. Similar to the unconstrained bilinear setting, when matrix AA is low-rank and dense, the total flop cost of RsEGM improves the optimal deterministic algorithm. On the other hand, most of the previous works on stochastic algorithms have sublinear rate. The only exception is [3], where the authors show the linear convergence of SPDHG for solving problems satisfying global metric sub-regularity. Indeed, unconstrained bilinear problems satisfy the global metric sub-regularity, while LP does not satisfy it globally. The complexity of SPDHG involves more complicated notations and we present a more detailed comparison in Appendix A.2 and Appendix A.4, but our proposed algorithms are at least better than SPDHG by a factor of condition number κF\kappa_{F}.

1.1 Summary of Contributions

The contributions of the paper can be summarized as follow:

(i) We propose a restarted stochastic extragradient method with variance reduction for solving sharp primal-dual problems. We show that the proposed algorithm exhibit linear convergence rate with high probability, in particular,

  • For linear programming and unconstrained bilinear problems, our restarted scheme improves the complexity of existing linear convergence of stochastic algorithms [3] by a factor of the condition number. The improvement comes from restarts.

  • To the best of our knowledge, this is the first stochastic algorithm with linear rate for the general standard-form LP problems (2). To prove this result, we introduce a high-probability analysis to upper-bound distance between the iterates and the optimal solution set.

(ii) We present the complexity lower bound of a class of stochastic first-order methods for solving sharp primal-dual problems, which matches the upper bound we obtained for RsEGM (upto log\log terms). This showcases RsEGM achieves a nearly optimal linear convergence rate for sharp primal-dual problems.

1.2 Assumptions

Throughout the paper, we have two assumptions on the problem and the stochastic oracle. The first one is on the primal-dual problem:

Assumption 1.

The problem (1) satisfies:
(i) (x,y)\mathcal{L}(x,y) is convex in xx and concave in yy.
(ii) g:𝒵+g:\mathcal{Z}\rightarrow\mathbb{R}\cup{+\infty} is proper convex lower semi-continuous.
(iii) The stationary solution set 𝒵={z|0F(z)+g(z)}\mathcal{Z}^{*}=\{z|0\in F(z)+\partial g(z)\}\neq\varnothing.

As a stochastic first-order method, we assume there exists a stochastic gradient oracle:

Assumption 2.

We assume there exists a stochastic oracle Fξ:m+nm+nF_{\xi}:\mathbb{R}^{m+n}\rightarrow\mathbb{R}^{m+n} such that

(i) it is unbiased: 𝔼[Fξ(z)]=F(z)\mathbb{E}[F_{\xi}(z)]=F(z);

(ii) it is LL-Lipschitz (in expectation): 𝔼[Fξ(u)Fξ(v)2]L2uv2\mathbb{E}[\|F_{\xi}(u)-F_{\xi}(v)\|^{2}]\leq L^{2}\|u-v\|^{2}.

1.3 Related Literature

Convex-concave primal-dual problems. There has been a long history of convex-concave primal-dual problems, and many of the early works study a more general problem, monotone variational inequalities. Rockafellar proposed a proximal point method (PPM) [53] for solving monotone variational inequalities. Around the same time, Korpelevich proposed the extragradient method (EGM) [33] for convex-concave primal-dual problems. After that, there have been numerous results on the convergence analysis of these methods. In particular, Tseng [58] shows that PPM and EGM have linear convergence for strongly-convex-strongly-concave primal-dual problems or for unconstrained bilinear problems. Nemirovski proposes Mirror Prox algorithm in the seminal work [45], which is a more general form of EGM, and shows that EGM has 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) sublinear convergence rate for solving general convex-concave primal-dual problems over a bounded and compact set. [45] also build up the connection between EGM and PPM: EGM is an approximation to PPM.

Another line of research is to study a special case of (1) where Φ(x,y)=yTAx\Phi(x,y)=y^{T}Ax is a bilinear term. Two well-known algorithms are Douglas-Rachford splitting [20, 21] (Alternating Direction Method of Multiplier (ADMM) as a special case) and Primal-dual Hybrid Gradient Method (PDHG) [13].

Very recently, there is a renewed interest on primal-dual methods, motivated by machine learning applications. For bilinear problems (x,y)=yTAx\mathcal{L}(x,y)=y^{T}Ax with full rank matrix AA, [17] shows that the Optimistic Gradient Descent Ascent (OGDA) converges linearly and later on [43] shows that OGDA , EGM and PPM all enjoy a linear convergence rate. [43] also presents an interesting observation that OGDA approximates PPM on bilinear problems. Lu [41] analyzes the dynamics of unconstrained primal-dual algorithms under an ODE framework and yields tight conditions under which different algorithms exhibit linear convergence. However, an important caveat is that not all linear convergence rates are equal. [7] shows that a simple restarted variant of these algorithms can improve the dependence of complexity on condition number in their linear convergence rate, as well as the empirical performance of the algorithms.

Linear programming. Linear programming is a fundamental tool in operations research. Two dominating methods to solve LP problems are simplex method [16] and interior-point method [32], and the commercial LP solvers based on these methods can provide reliable solutions even for fairly large instances. While the two methods are quite different, both require solving linear systems using factorization. As a result, it becomes very challenging to further scale up these two methods, in particular to take advantage of distributed computing. Recently, there is a recent trend on developing first-order methods for LP only utilizing matrix-vector multiplication [8, 25, 39, 5, 6]. In general, these methods are easy to be parallelized and do not need to store the factorization in memory.

The traditional results of first-order methods for LP usually have sublinear rate, due to the lack of strong convexity, which prevents them to identify high-accuracy solutions. To deal with this issue, [22] presents a variant of ADMM and shows the linear convergence of the proposed method for LP. More recently [36, 37] show that many primal-dual algorithms under a mild non-degeneracy condition have eventual linear convergence, but it may take a long time before reaching the linear convergence regime. [7] propose a restarted scheme for LP in the primal-dual formulation. They introduce a sharpness condition for primal-dual problems based on the normalized duality gap and show that the primal-dual formulation of LP is sharp on any bounded region. Then they provide restarted schemes for sharp primal-dual problems and show that their proposed algorithms have the optimal linear convergence rate (in a class of deterministic first-order methods) when solving sharp problems. A concurrent work [56] studies stochastic algorithms for generalized linear programming, but the focus is on sublinear rate, while our focus is on the linear rate.

Sharpness conditions and restart schemes. The concept of sharpness was first proposed by Polyak [52] on minimization problem. Recently, there is a trend of work on developing first-order method with faster convergence rates using sharpness. For example, [59] shows linear convergence of restarted subgradient descent on sharp non-smooth functions and there are other works on sharp non-convex minimization [18]. Sharpness can also be viewed as a certain error bound condition [54]. Recently [7] introduces sharpness condition for primal-dual problems. A highly related concept is the metric subregularity for variational inequalities, which is a weaker condition than the sharpness condition proposed in [7] (see Appendix A.1 for a discussion). Under such conditions, [2, 23] present the linear convergence for stochastic PDHG and deterministic PDHG, respectively.

Restarting is a powerful technique in optimization. It can improve the practical and theoretical convergence of a base algorithm without modification to the base algorithm [51]. Recently, there have been extensive works on this technique in smooth convex optimization [49, 54], non-smooth convex optimization [24, 59] and stochastic convex optimization [31, 38, 57]. For sharp primal-dual problems, [7] propose fixed-frequency and adaptive restart on a large class of base primal-dual algorithms including PDHG, ADMM and EGM.

Variance reduction and primal-dual problem. Variance reduction technique [19, 31] is developed to improve the convergence rate for stochastic algorithms upon pure SGD for minimization problems. There are extensive works on variants of stochastic variance reduction for minimization problems under various settings (see [26] for a recent overview).

Compared with the extensive works on minimization problem, the research of variance-reduced methods on primal-dual problems is fairly limited. [50] studies stochastic forward-backward algorithm with variance reduction for primal-dual problems and, more generally, monotone inclusions. Under strong monotonicity, they prove a linear convergence rate and improve the complexity of deterministic methods for bilinear problems. [10] proposes a randomized variant of Mirror Prox algorithm. They focus on matrix games and improve complexity over deterministic methods in several settings. However, beyond matrix games, their method requires extra assumptions such as bounded domain and involves a three-loop algorithm. More recently, [3] proposes a stochastic extragradient method with variance reduction for solving variational inequalities. Under Euclidean setting, their method is based on a loopless variant of variance-reduced method [29, 34]. Their algorithm offers a similar convergence guarantee as [10] but does not require assumptions such as bounded region.

After the first version of this paper, there have been several recent works on stochastic primal-dual methods that can be applied to LP. For example, [30, 44] propose variants of variance-reduced extra-gradient methods for finite-sum variational inequality. In particular, [30] derives the sublinear rate of the proposed method under the finite-sum convex-concave setting, while [44] derives the linear convergence rate of Algorithm 1 under the projection-type error-bound condition. [1] studies complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent and obtains sub-linear rate for general convex-concave problems with bilinear coupling.

1.4 Notations

Let log()\log(\cdot) and exp()\exp(\cdot) refer to the natural log and exponential function, respectively. Let σmin(A)\sigma_{min}(A) and σmin+(A)\sigma_{min}^{+}(A) denote the minimum singular value and minimum nonzero singular value of a matrix AA. Let z2=jzj2\|z\|_{2}=\sqrt{\sum_{j}z_{j}^{2}} be the l2l_{2} norm of vector zz. Let A2\|A\|_{2} and AF\|A\|_{F} denote the 2-norm and Frobenius norm of a matrix AA. Let eie_{i} be the ii-th standard unit vector. Let f(x)=𝒪(g(x))f(x)=\mathcal{O}(g(x)) denote for sufficiently large xx, there exists constant CC such that f(x)Cg(x)f(x)\leq Cg(x) and f(x)=Ω(g(x))f(x)=\Omega(g(x)) denote for sufficiently large xx, there exists constant cc such that f(x)cg(x)f(x)\geq cg(x). f(x)=Θ(g(x))f(x)=\Theta(g(x)) if f(x)=𝒪(g(x))f(x)=\mathcal{O}(g(x)) and f(x)=Ω(g(x))f(x)=\Omega(g(x)). Denote 𝒪~(g(x))=𝒪(g(x)polylog(g(x)))\widetilde{\mathcal{O}}(g(x))=\mathcal{O}\bigg{(}g(x)\text{polylog}(g(x))\bigg{)}, where polylog(g(x))\text{polylog}(g(x)) refers to a polynomial in log(g(x))\log(g(x)). For set 𝒵\mathcal{Z}, denote Wr(z)={z^𝒵|zz^r}W_{r}(z)=\left\{\hat{z}\in\mathcal{Z}\big{|}\|z-\hat{z}\|\leq r\right\} the ball centered at zz with radius r(0,)r\in(0,\infty) intersected with the set 𝒵\mathcal{Z}. ιS\iota_{S} is the indicator function of convex set SS.

2 Preliminaries

In this section, we present two results in the recent literature (i) the sharpness condition for primal-dual problems based on the normalized duality gap introduced in [7], and (ii) the stochastic EGM with variance reduction that was introduced in [3]. We will utilize these results to develop our main theory in later sections.

2.1 Sharpness Condition for Primal-Dual Problems

Sharpness is a central property of the objective in minimization problems that can speed up the convergence of optimization methods. Recently, [7] introduces a new sharpness condition of primal-dual problems based on a normalized duality gap. They show that linear programming (among other examples) in the primal-dual formulation is sharp. Furthermore, [7] proposes a restarted algorithm that can accelerate the convergence of primal-dual algorithms such as EGM, PDHG and ADMM, and achieve the optimal linear convergence rate on sharp primal-dual problems.

More formally, the sharpness of a primal-dual problem is defined as:

Definition 1.

We say a primal-dual problem is α\alpha-sharp on the set S𝒵S\subset\mathcal{Z} if ρr(z)\rho_{r}(z) is α\alpha-sharp on SS for all rr, i.e. it holds for all zSz\in S that

αdist(z,𝒵)ρr(z)=maxz^Wr(z)(x,y^)(x^,y)r,\alpha\text{dist}(z,\mathcal{Z}^{*})\leq\rho_{r}(z)=max_{\hat{z}\in W_{r}(z)}\frac{\mathcal{L}(x,\hat{y})-\mathcal{L}(\hat{x},y)}{r}\ ,

where Wr(z)={z^𝒵|zz^r}W_{r}(z)=\left\{\hat{z}\in\mathcal{Z}\big{|}\|z-\hat{z}\|\leq r\right\} is a ball centered at zz with radius rr intersected with the set 𝒵\mathcal{Z}.

In particular, the following examples are shown to be sharp instances [7]:

Example 2.1.

Consider a primal-dual problem with a bounded feasible region, i.e., g1g_{1} and g2g_{2} encode a bounded feasible region of xx and yy, respectively. Suppose P(x)=maxy(x,y)P(x)=\max_{y}\mathcal{L}(x,y) is α1\alpha_{1}-sharp in xx and D(y)=maxx𝒳(x,y)D(y)=\max_{x\in\mathcal{X}}\mathcal{L}(x,y) is α2\alpha_{2}-sharp in yy. Then problem (1) is α\alpha-sharp in the feasible region, where α=min{α1,α2}R\alpha=\frac{\min{\{\alpha_{1},\alpha_{2}\}}}{R} and RR is the diameter of the region.

Example 2.2.

Consider an unconstrained bilinear problem with (x,y)=yTAx+cTxbTy\mathcal{L}(x,y)=y^{T}Ax+c^{T}x-b^{T}y, then problem (1) is α\alpha-sharp in m+n\mathbb{R}^{m+n}, where α=σmin+(A)\alpha=\sigma^{+}_{min}(A).

Example 2.3.

Consider the primal-dual form of standard LP:

minxnmaxym(x,y)=yTAx+cTx+ιx0bTy.\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}\mathcal{L}(x,y)=y^{T}Ax+c^{T}x+\iota_{x\geq 0}-b^{T}y\ .

Then for any zz, the normalized duality gap ρr(z)=maxz^Wr(z)(x,y^)(x^,y)r\rho_{r}(z)=\max_{\hat{z}\in W_{r}(z)}\frac{\mathcal{L}(x,\hat{y})-\mathcal{L}(\hat{x},y)}{r} satisfies

ρr(z)1H1+4z2dist(z,𝒵),\rho_{r}(z)\geq\frac{1}{H\sqrt{1+4\|z\|^{2}}}\text{dist}(z,\mathcal{Z}^{*})\ ,

where HH is the Hoffman constant of the KKT system of LP [7, 28].

2.2 Stochastic EGM with Variance Reduction

In this subsection, we present the stochastic EGM with variance reduction introduced in [3], and restate their two major convergence results.

Algorithm 1 presents the stochastic EGM with variance reduction (sEGM). The algorithm keeps track of two sequences {zk}\{z_{k}\} and {wk}\{w_{k}\}. wkw_{k} can be viewed as a snapshot, which is updated rarely and we evaluate the full gradient F(wk)F(w_{k}). Then we calculate z¯k\bar{z}_{k} as a weighted average of zkz_{k} and wkw_{k}. Similar to deterministic EGM, the algorithm computes an intermediate solution zk+1/2z_{k+1/2}. We then use the full gradient of the snapshot F(wk)F(w_{k}) to compute zk+1/2z_{k+1/2}, and use the variance reduced stochastic gradient F(wk)+Fξk(zk+1/2)Fξk(wk)F(w_{k})+F_{\xi_{k}}(z_{k+1/2})-F_{\xi_{k}}(w_{k}) to compute zk+1z_{k+1}.

Algorithm 1 Stochastic EGM with variance reduction: zK=sEGM(p,Q,τ,z0,K)z^{K}=sEGM(p,Q,\tau,z^{0},K)
0:  Probability p(0,1)p\in(0,1), sample oracle QQ, step size τ\tau, number of iteration KK. Let w0=z0w^{0}=z^{0}.
0:  z~K=1Kl=0k1zl+1/2\widetilde{z}^{K}=\frac{1}{K}\sum_{l=0}^{k-1}z^{l+1/2}.
1:  for k=0,1,K1k=0,1,...K-1 do
2:     z¯k=(1p)zk+pwk\bar{z}^{k}=(1-p)z^{k}+pw^{k}
3:     zk+1/2=proxτg(z¯kτF(wk))z^{k+1/2}=\text{prox}_{\tau g}(\bar{z}^{k}-\tau F(w^{k}))
4:     Draw an index ξk\xi_{k} according to Q
5:     zk+1=proxτg(z¯kτ[F(wk)+Fξk(zk+1/2)Fξk(wk)])z^{k+1}=\text{prox}_{\tau g}(\bar{z}^{k}-\tau[F(w^{k})+F_{\xi_{k}}(z^{k+1/2})-F_{\xi_{k}}(w^{k})])
6:     wk+1={zk+1,with prob.pwk,with prob. 1pw^{k+1}=\begin{cases}z^{k+1},&\text{with prob.}\;p\\ w^{k},&\text{with prob.}\;1-p\end{cases}
7:  end for

We here restate the descent lemma of Algorithm 1 ([3, Lemma 2.2]), which is the key component in the sublinear convergence rate proof of sEGM:

Lemma 1.

Let ϕk(z)=(1p)zkz2+wkz2\phi_{k}(z)=(1-p)\|z^{k}-z\|^{2}+\|w^{k}-z\|^{2}. Let τ=p2L\tau=\frac{\sqrt{p}}{2L}. Then for (zk)(z^{k}) generated by Algorithm 1 and any z𝒵z_{*}\in\mathcal{Z}^{*}, it holds that

𝔼k[ϕk+1(z)]ϕk(z)12(pzk+1/2wk2+𝔼k[zk+1/2zk+12]).\mathbb{E}_{k}[\phi_{k+1}(z_{*})]\leq\phi_{k}(z_{*})-\frac{1}{2}\left({p\|z^{k+1/2}-w_{k}\|^{2}+\mathbb{E}_{k}[\|z^{k+1/2}-z^{k+1}\|^{2}]}\right)\ .

Moreover, it holds that

k=0(p𝔼[zk+1/2wk2]+𝔼[zk+1/2zk+12])2ϕ0(z)=2(2p)z0z2.\sum_{k=0}^{\infty}\left({p\mathbb{E}[\|z^{k+1/2}-w^{k}\|^{2}]+\mathbb{E}[\|z^{k+1/2}-z^{k+1}\|^{2}]}\right)\leq 2\phi_{0}(z_{*})=2(2-p)\|z^{0}-z_{*}\|^{2}\ .

Next, we restate the sublinear convergence of Algorithm 1 in terms of the duality gap. The theorem and its proof can be obtained by a simple modification of [3, Theorem 2.5].

Theorem 1.

Let Assumptions hold, p(0,1)p\in(0,1) and τ=p2L\tau=\frac{\sqrt{p}}{2L}. Let Θk+1/2(z)=F(zk+1/2),zk+1/2z+g(zk+1/2)g(z)\Theta_{k+1/2}(z)=\langle F(z^{k+1/2}),z^{k+1/2}-z\rangle+g(z^{k+1/2})-g(z). Then for any compact set 𝒞\mathcal{C},

2τ𝔼[maxz𝒞l=0k1Θl+1/2(z)]72maxz𝒞z0z2+14z0z2.2\tau\mathbb{E}\left[\max_{z\in\mathcal{C}}\sum_{l=0}^{k-1}\Theta_{l+1/2}(z)\right]\leq\frac{7}{2}\max_{z\in\mathcal{C}}\|z^{0}-z\|^{2}+14\|z^{0}-z_{*}\|^{2}\ .

3 Restarted Stochastic EGM with Variance Reduction

In this section, we first present our main algorithm, the nested-loop restarted stochastic EGM algorithm (RsEGM, Algorithm 2), in section 3.1. Then we show that with high probability, the restarted algorithm exhibits linear convergence to an optimal solution for sharp primal-dual problems both with and without bounded region. This linear convergence result accelerates the linear convergence of stochastic algorithms without restart.

3.1 Algorithm

Algorithm 2 presents the nested-loop restarted stochastic EGM algorithm (RsEGM). We initialize the algorithm with probability parameter p(0,1]p\in(0,1], a stochastic sample oracle QQ, step size τ\tau for sEGM, length of inner loop KK and outer iteration TT. In each outer loop, we run sEGM for KK iterations and restart the next outer loop using the output of sEGM from the previous outer loop.

Algorithm 2 Restarted stochastic EGM: zT,0=RsEGM(p,Q,τ,z0,0,T,K)z^{T,0}=RsEGM(p,Q,\tau,z^{0,0},T,K)
0:  Probability p(0,1]p\in(0,1], probability distribution QQ, step size τ\tau, number of restart TT, number of iteration for sEGM KK.
0:  zT,0z^{T,0}.
1:  for t=0,1,,T1t=0,1,...,T-1 do
2:     Call subroutine sEGM to obtain zt+1,0=sEGM(p,Q,τ,zt,0,K)z^{t+1,0}=sEGM(p,Q,\tau,z^{t,0},K)
3:  end for

3.2 Convergence Guarantees for Problems with Global Sharpness

Theorem 2 presents the high probability linear convergence rate of Algorithm 2 for solving primal-dual problem (1) with global sharpness conditions. Recall that the global sharpness holds for unconstrained bilinear problems and bilinear problems with bounded region (an economic example is two-player matrix game), this implies the linear convergence of Algorithm 2 for these two cases.

Theorem 2.

Consider RsEGM (Algorithm 2) for solving a sharp primal-dual problem (1). Suppose Assumption 1 and Assumption 2 hold. Let α\alpha be the global sharpness constant of the primal-dual problem and LL be the Lipschitz parameter of the stochastic oracle. Denote the initial distance to optimal set as R0=dist(z0,0,𝒵)R_{0}=\text{dist}(z^{0,0},\mathcal{Z}^{*}), and choose step-size τ=p2L\tau=\frac{\sqrt{p}}{2L}. For any δ(0,1)\delta\in(0,1) and ϵ>0\epsilon>0, if the outer loop count TT and the inner loop count KK of RsEGM satisfy

Tmax{log2R0ϵ,0},K𝒪~(1pLα1δ2T2),T\geq\max{\left\{\log_{2}\frac{R_{0}}{\epsilon},0\right\}},\;K\geq\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{\alpha}\frac{1}{\delta^{2}}T^{2}}\right)\ ,

then it holds with probability at least 1δ1-\delta that

dist(zT,0,𝒵)ϵ.\text{dist}(z^{T,0},\mathcal{Z}^{*})\leq\epsilon\ .

Furthermore, the total number of iteration to get ϵ\epsilon accuracy with probability at least 1δ1-\delta is of order

𝒪~(1pLα1δ2(logR0ϵ)3).\widetilde{\mathcal{O}}\left(\frac{1}{\sqrt{p}}\frac{L}{\alpha}\frac{1}{\delta^{2}}\left({\log\frac{R_{0}}{\epsilon}}\right)^{3}\right)\ .
Remark 1.

The performance of stochastic algorithms depends on the realization of stochasticity. The traditional convergence analysis of these algorithms usually measures the expected performance, namely, running the algorithm multiple times and looking at the average performance. In contrast, when choosing δ=12\delta=\frac{1}{2}, Theorem 2 provides the complexity of the median performance of the algorithm.

Remark 2.

In this remark, we consider the finite-sum model where (x,y)=1Ni=1Ni(x,y)\mathcal{L}(x,y)=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{i}(x,y). Suppose the stochastic oracle computes Fξ(z)=[xi(x,y),yi(x,y)]F_{\xi}(z)=[\nabla_{x}\mathcal{L}_{i}(x,y),-\nabla_{y}\mathcal{L}_{i}(x,y)] for ii that is uniformly random in set [1,,N][1,...,N]. Then evaluating Fξ(z)F_{\xi}(z) needs one stochastic oracle and evaluating the true gradient F(z)F(z) need NN stochastic oracles. With a careful calculation, we obtain that one outer iteration of RsEGM requires (N+2)+𝒪~((Np+2)1pLα1δ2(logR0ϵ)2)\left({N+2}\right)+\widetilde{\mathcal{O}}\left({\left({Np+2}\right)\frac{1}{\sqrt{p}}\frac{L}{\alpha}\frac{1}{\delta^{2}}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{2}}\right) stochastic oracle calls. Thus, the average total oracle cost of RsEGM to reach ϵ\epsilon-accuracy with probability at least 1δ1-\delta becomes

𝒪~((N+2)logR0ϵ+(Np+2)1pLα1δ2(logR0ϵ)3).\widetilde{\mathcal{O}}\left({\left({N+2}\right)\log{\frac{R_{0}}{\epsilon}}+\left({Np+2}\right)\frac{1}{\sqrt{p}}\frac{L}{\alpha}\frac{1}{\delta^{2}}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right)\ . (4)

Choosing p=Θ(1N)p=\Theta{\left({\frac{1}{N}}\right)} and δ=0.5\delta=0.5, the number of stochastic oracle calls for the median trajectory to achieve ϵ\epsilon accuracy becomes

𝒪~((N+2)logR0ϵ+NLα(logR0ϵ)3).\widetilde{\mathcal{O}}\left({\left({N+2}\right)\log{\frac{R_{0}}{\epsilon}}+\sqrt{N}\frac{L}{\alpha}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right)\ .

In contrast, the total oracle cost of deterministic restarted methods solving primal-dual problem with global sharpness [7] (which is the optimal deterministic algorithm) is

𝒪(NLFαlogR0ϵ),\mathcal{O}\left({N\frac{L_{F}}{\alpha}\log{\frac{R_{0}}{\epsilon}}}\right)\ ,

where LFL_{F} is the Lipschitz constant for (x,y)\mathcal{L}(x,y). In the context of finite-sum model, LFL_{F} is generally of the same order of LL (upto a variance term that is independent of NN). This shows that RsEGM in general needs O(N)O(\sqrt{N}) less stochastic oracles to find an ϵ\epsilon-solution than its deterministic counterpart after ignoring the log\log terms.

We know from Example 2.1 and Example 2.2 that bilinear problems with bounded region and unconstrained bilinear problem are both globally sharp. A direct application of Theorem 2 gives:

Corollary 1.

For bilinear problems with bounded region and for unconstrained bilinear problem, the total number of iteration to get ϵ\epsilon accuracy with probability at least 1δ1-\delta is of order

𝒪~(1pLα1δ2(logR0ϵ)3).\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{\alpha}\frac{1}{\delta^{2}}\left({\log\frac{R_{0}}{\epsilon}}\right)^{3}}\right)\ .

To prove Theorem 2, we first introduce two properties of sEGM. The next lemma bounds the expected distance between the iterates of sEGM and initial point by the distance between initial point and optimal solution.

Lemma 2.

Consider sEGM (Algorithm 1) for solving (1). Suppose Assumption 1 and Assumption 2 holds, then

𝔼[zk+1/2z0](3+21p)z0z.\mathbb{E}[\|z^{k+1/2}-z^{0}\|]\leq\left({3+\sqrt{\frac{2}{1-p}}}\right)\|z^{0}-z_{*}\|\ . (5)
Proof.

It follows from Lemma 1

𝔼[zk+1/2zk+12]2ϕ0(z)4z0z2,\mathbb{E}[\|z^{k+1/2}-z^{k+1}\|^{2}]\leq 2\phi_{0}(z_{*})\leq 4\|z^{0}-z_{*}\|^{2}\ ,

thus

𝔼[zk+1/2zk+1](𝔼[zk+1/2zk+12])1/22z0z.\mathbb{E}[\|z^{k+1/2}-z^{k+1}\|]\leq\left({\mathbb{E}[\|z^{k+1/2}-z^{k+1}\|^{2}]}\right)^{1/2}\leq 2\|z^{0}-z_{*}\|\ .

We can also derive from the monotonicity of 𝔼[ϕk(z)]\mathbb{E}[\phi_{k}(z_{*})] that

𝔼[zk+1z2]11p𝔼[ϕk+1(z)]11p𝔼[ϕ0(z)]=11p(2p)z0z221pz0z2\mathbb{E}[\|z^{k+1}-z_{*}\|^{2}]\leq\frac{1}{1-p}\mathbb{E}[\phi_{k+1}(z_{*})]\leq\cdots\leq\frac{1}{1-p}\mathbb{E}[\phi_{0}(z_{*})]=\frac{1}{1-p}(2-p)\|z^{0}-z_{*}\|^{2}\leq\frac{2}{1-p}\|z^{0}-z_{*}\|^{2}

thus

𝔼[zk+1z](𝔼[zk+1z2])1/221pz0z.\mathbb{E}[\|z^{k+1}-z_{*}\|]\leq\left({\mathbb{E}[\|z^{k+1}-z_{*}\|^{2}]}\right)^{1/2}\leq\sqrt{\frac{2}{1-p}}\|z^{0}-z_{*}\|\ .

By triangle inequality we have

𝔼[zk+1/2z]𝔼[zk+1/2zk+1]+𝔼[zk+1z](2+21p)z0z,\mathbb{E}[\|z^{k+1/2}-z_{*}\|]\leq\mathbb{E}[\|z^{k+1/2}-z^{k+1}\|]+\mathbb{E}[\|z^{k+1}-z_{*}\|]\leq\left({2+\sqrt{\frac{2}{1-p}}}\right)\|z^{0}-z_{*}\|\ ,
𝔼[zk+1/2z0]𝔼[zk+1/2z]+zz0(3+21p)z0z.\mathbb{E}[\|z^{k+1/2}-z^{0}\|]\leq\mathbb{E}[\|z^{k+1/2}-z_{*}\|]+\|z_{*}-z^{0}\|\leq\left({3+\sqrt{\frac{2}{1-p}}}\right)\|z^{0}-z_{*}\|\ .

Next, we present the sublinear rate of sEGM on localized duality gap:

Lemma 3.

Consider sEGM (Algorithm 1) for solving (1). Suppose Assumption 1 and Assumption 2 holds, then

𝔼[maxz𝒞𝒵(x~K,y)(x,y~K)]12τK(72maxz𝒞z0z2+14z0z2).\mathbb{E}\left[\max_{z\in\mathcal{C}\cap\mathcal{Z}}\mathcal{L}(\widetilde{x}^{K},y)-\mathcal{L}(x,\widetilde{y}^{K})\right]\leq\frac{1}{2\tau K}\left({\frac{7}{2}\max_{z\in\mathcal{C}}\|z^{0}-z\|^{2}+14\|z^{0}-z_{*}\|^{2}}\right)\ .
Proof.

Notice that

𝔼[maxz𝒞𝒵(x~K,y)(x,y~K)]𝔼[maxz𝒞𝒵1kk=0K1((xk+1/2,y)(x,yk+1/2))]1k𝔼[maxz𝒞𝒵k=0K1x(xk+1/2,yk+1/2)(xk+1/2x)y(xk+1/2,yk+1/2)(yk+1/2y)]1k𝔼[maxz𝒞k=0K1Θk+1/2(z)]12τK(72maxz𝒞z0z2+14z0z2).\displaystyle\begin{split}&\ \mathbb{E}\left[\max_{z\in\mathcal{C}\cap\mathcal{Z}}\mathcal{L}(\widetilde{x}^{K},y)-\mathcal{L}(x,\widetilde{y}^{K})\right]\leq\mathbb{E}\left[\max_{z\in\mathcal{C}\cap\mathcal{Z}}\frac{1}{k}\sum_{k=0}^{K-1}\left({\mathcal{L}(x^{k+1/2},y)-\mathcal{L}(x,y^{k+1/2})}\right)\right]\\ &\ \leq\frac{1}{k}\mathbb{E}\left[\max_{z\in\mathcal{C}\cap\mathcal{Z}}\sum_{k=0}^{K-1}\nabla_{x}\mathcal{L}(x^{k+1/2},y^{k+1/2})(x^{k+1/2}-x)-\nabla_{y}\mathcal{L}(x^{k+1/2},y^{k+1/2})(y^{k+1/2}-y)\right]\\ &\ \leq\frac{1}{k}\mathbb{E}\left[\max_{z\in\mathcal{C}}\sum_{k=0}^{K-1}\Theta_{k+1/2}(z)\right]\leq\frac{1}{2\tau K}\left({\frac{7}{2}\max_{z\in\mathcal{C}}\|z^{0}-z\|^{2}+14\|z^{0}-z_{*}\|^{2}}\right)\ .\end{split}

where the first two inequalities are from the convexity-concavity of L(x,y)L(x,y) and the last inequality is due to Theorem 1. ∎

Now we are ready to prove Theorem 2.

Proof of Theorem 2.

Consider the tt-th outer iteration, and define Rt=dist(zt,0,𝒵)R_{t}=\text{dist}(z^{t,0},\mathcal{Z}^{*}). To prove the theorem, we’ll show that the event {RTR02T}\bigg{\{}R_{T}\leq\frac{R_{0}}{2^{T}}\bigg{\}} is a high probability event and the key is to use the global sharpness on a carefully controlled region.

Suppose K2000exp(e2)K\geq 2000\geq\exp(e^{2}). It follows from Markov inequality that for any inner iteration kk

(zt,kzt,0r0)=1(zt,kzt,0>r0)1𝔼[zt,kzt,0]r0.\mathbb{P}\left({\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)=1-\mathbb{P}\left({\|z^{t,k}-z^{t,0}\|>r_{0}}\right)\geq 1-\frac{\mathbb{E}\left[\|z^{t,k}-z^{t,0}\|\right]}{r_{0}}\ .

Furthermore, it follows from Lemma 2 that

(zt,kzt,0r0)1𝔼[zt,kzt,0]r01(3+21p)zt,0zr0.\mathbb{P}\left({\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)\geq 1-\frac{\mathbb{E}[\|z^{t,k}-z^{t,0}\|]}{r_{0}}\geq 1-\left({3+\sqrt{\frac{2}{1-p}}}\right)\frac{\|z^{t,0}-z_{*}\|}{r_{0}}\ . (6)

Thus, if r0r_{0} is chosen to be sufficiently large, the probability (zt,kzt,0r0)\mathbb{P}\left({\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right) can be arbitrarily close to 11 . Given the high probability event {zt,kzt,0r0}\left\{\|z^{t,k}-z^{t,0}\|\leq r_{0}\right\}, we can upper bound the normalized duality gap as

maxzWr(zt,k){(xt,k,y)(x,yt,k)}maxzWr+r0(zt,0){(xt,k,y)(x,yt,k)}.\max_{z\in W_{r}(z^{t,k})}{\left\{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})\right\}}\leq\max_{z\in W_{r+r_{0}}(z^{t,0})}{\left\{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})\right\}}\ . (7)

By α\alpha-sharpness and the fact that Wr+r0(zt,0)W_{r+r_{0}}(z^{t,0}) is deterministic given zt,0z^{t,0}, we have for any constant C>0C>0 and function g(k)>0g(k)>0 that

(αRt+1CRtg(k))(1rmaxzWr(zt+1,0)(xt+1,0,y)(x,yt+1,0)CRtg(k))=(maxzWr(zt,k)(xt,k,y)(x,yt,k)CrRtg(k)).\displaystyle\begin{split}\mathbb{P}\left({\alpha R_{t+1}\leq\frac{CR_{t}}{g(k)}}\right)\geq&\ \mathbb{P}\left({\frac{1}{r}\max_{z\in W_{r}(z^{t+1,0})}{\mathcal{L}(x^{t+1,0},y)-\mathcal{L}(x,y^{t+1,0})}\leq\frac{CR_{t}}{g(k)}}\right)\\ =&\ \mathbb{P}\left({\max_{z\in W_{r}(z^{t,k})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}\leq\frac{CrR_{t}}{g(k)}}\right)\ .\end{split}

Thus, it holds for any r00r_{0}\geq 0 that

(αRt+1CRtg(k))(maxzWr(zt,k)(xt,k,y)(x,yt,k)CrRtg(k),zt,kzt,0r0)(maxzWr+r0(zt,0)(xt,k,y)(x,yt,k)CrRtg(k),zt,kzt,0r0)=(zt,kzt,0r0)(maxzWr+r0(zt,0)(xt,k,y)(x,yt,k)>CrRtg(k),zt,kzt,0r0)(zt,kzt,0r0)(maxzWr+r0(zt,0)(xt,k,y)(x,yt,k)>CrRtg(k))\displaystyle\begin{split}\mathbb{P}\left({\alpha R_{t+1}\leq\frac{CR_{t}}{g(k)}}\right)\geq&\ \mathbb{P}\left({\max_{z\in W_{r}(z^{t,k})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}\leq\frac{CrR_{t}}{g(k)}\;,\;\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)\\ \geq&\ \mathbb{P}\left({\max_{z\in W_{r+r_{0}}(z^{t,0})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}\leq\frac{CrR_{t}}{g(k)}\;,\;\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)\\ =&\ \mathbb{P}\left({\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)\\ &\ -\mathbb{P}\left({\max_{z\in W_{r+r_{0}}(z^{t,0})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}>\frac{CrR_{t}}{g(k)}\;,\;\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)\\ \geq&\ \mathbb{P}\left({\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)-\mathbb{P}\left({\max_{z\in W_{r+r_{0}}(z^{t,0})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}>\frac{CrR_{t}}{g(k)}}\right)\\ \end{split} (8)

where the second inequality utilizes Equation (7).

The next step is to upper bound the last term in the right hand side of (8). Denote C1=3+21pC_{1}=3+\sqrt{\frac{2}{1-p}} and choose r0=r=2C1RtTδr_{0}=r=2C_{1}R_{t}\frac{T}{\delta}, then we have zWr+r0(zt,0)z_{*}\in W_{r+r_{0}}(z^{t,0}) because r+r0Rtr+r_{0}\geq R_{t}, thus

(maxzWr+r0(zt,0)(xt,k,y)(x,yt,k)>CrRtg(k))g(k)CrRt𝔼[maxzWr+r0(zt,0)(xt,k,y)(x,yt,k)]g(k)CrRt12τk(72+14)maxzWr+r0(zt,0)zt,0z2354τCg(k)k(r+r0)2rRt,\displaystyle\begin{split}\mathbb{P}&\left({\max_{z\in W_{r+r_{0}}(z^{t,0})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}>\frac{CrR_{t}}{g(k)}}\right)\leq\frac{g(k)}{CrR_{t}}\mathbb{E}\left[\max_{z\in W_{r+r_{0}}(z^{t,0})}{\mathcal{L}(x^{t,k},y)-\mathcal{L}(x,y^{t,k})}\right]\\ &\ \leq\frac{g(k)}{CrR_{t}}\frac{1}{2\tau k}\left({\frac{7}{2}+14}\right)\max_{z\in W_{r+r_{0}}(z^{t,0})}\|z^{t,0}-z\|^{2}\ \leq\frac{35}{4\tau C}\frac{g(k)}{k}\frac{(r+r_{0})^{2}}{rR_{t}}\ ,\end{split} (9)

where the first inequality follows from Markov inequality, the second one utilizes Lemma 3 and zWr+r0(zt,0)z_{*}\in W_{r+r_{0}}(z^{t,0}). Combine Equation (8) with (9), we obtain

(αRt+1CRtg(k))(zt,kzt,0r0)354τCg(k)k(r+r0)2rRt1(3+21p)Rtr0354τCg(k)k(r+r0)2rRt,\displaystyle\begin{split}\mathbb{P}\left({\alpha R_{t+1}\leq\frac{CR_{t}}{g(k)}}\right)&\ \geq\mathbb{P}\left({\|z^{t,k}-z^{t,0}\|\leq r_{0}}\right)-\frac{35}{4\tau C}\frac{g(k)}{k}\frac{(r+r_{0})^{2}}{rR_{t}}\\ &\ \geq 1-\left({3+\sqrt{\frac{2}{1-p}}}\right)\frac{R_{t}}{r_{0}}-\frac{35}{4\tau C}\frac{g(k)}{k}\frac{(r+r_{0})^{2}}{rR_{t}}\ ,\end{split}

where the last inequality uses Equation (6). Thus, we obtain by substituting the value of r0r_{0}, rr and τ\tau that

(αRt+1CRtg(k))1δ2T354τCg(k)k8C1Tδ=1δ2TLCg(k)kTδ140C1p.\displaystyle\begin{split}\mathbb{P}\left({\alpha R_{t+1}\leq\frac{CR_{t}}{g(k)}}\right)\geq&1-\frac{\delta}{2T}-\frac{35}{4\tau C}\frac{g(k)}{k}\frac{8C_{1}T}{\delta}=1-\frac{\delta}{2T}-\frac{L}{C}\frac{g(k)}{k}\frac{T}{\delta}\frac{140C_{1}}{\sqrt{p}}\ .\end{split}

Let C2=140C1C_{2}=140C_{1}, C=αC=\alpha and function g(k)=loglogkg(k)=\log{\log{k}}, then we have

(Rt+1Rtloglogk)1δ2TC2pLαTδloglogkk.\begin{split}\mathbb{P}\left({R_{t+1}\leq\frac{R_{t}}{\log{\log{k}}}}\right)\geq 1-\frac{\delta}{2T}-\frac{C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T}{\delta}\frac{\log{\log{k}}}{k}\ .\end{split} (10)

When k2000>exp(e2)k\geq 2000>\exp{(e^{2})}, we have {RtRt1loglogk}{RtRt12}\left\{R_{t}\leq\frac{R_{t-1}}{\log{\log{k}}}\right\}\subseteq\left\{R_{t}\leq\frac{R_{t-1}}{2}\right\}. Let t={RtRt12}\mathcal{E}_{t}=\left\{R_{t}\leq\frac{R_{t-1}}{2}\right\} and (|¯t)\mathbb{P}\left({\cdot|\bar{\mathcal{E}}_{t}}\right) be the probability conditioned on {1,,t}\left\{\mathcal{E}_{1},...,\mathcal{E}_{t}\right\}, then

(RTR02T)(t=1Tt)=(1)t=2T(t|¯t1)(1δ2TC2pLαTδloglogkk)T 1δ2C2pLαT2δloglogkk,\displaystyle\begin{split}\mathbb{P}\left({R_{T}\leq\frac{R_{0}}{2^{T}}}\right)\geq&\ \mathbb{P}\left({\cap_{t=1}^{T}\mathcal{E}_{t}}\right)=\mathbb{P}(\mathcal{E}_{1})\prod_{t=2}^{T}\mathbb{P}(\mathcal{E}_{t}|\bar{\mathcal{E}}_{t-1})\\ \geq&\ \bigg{(}1-\frac{\delta}{2T}-\frac{C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T}{\delta}\frac{\log{\log{k}}}{k}\bigg{)}^{T}\geq\ 1-\frac{\delta}{2}-\frac{C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta}\frac{\log{\log{k}}}{k}\ ,\end{split} (11)

where the second inequality is from Equation (10) and the third one uses Bernoulli inequality. Thus, if k2C2pLαT2δ2log(2C2pLαT2δ2)k\geq\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)} we obtain

C2pLαT2δloglogkkδ2loglog(2C2pLαT2δ2log(2C2pLαT2δ2))log(2C2pLαT2δ2)δ2log(log(2C2pLαT2δ2)+loglog(2C2pLαT2δ2))log(2C2pLαT2δ2)δ2,\displaystyle\begin{split}\frac{C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta}\frac{\log{\log{k}}}{k}\leq&\frac{\delta}{2}\cdot\frac{\log{\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)}}\right)}}}{\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)}}\\ \leq&\frac{\delta}{2}\cdot\frac{\log{\left({\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)+\log{\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)}}}}\right)}}{\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)}}\leq\frac{\delta}{2}\ ,\end{split} (12)

where the last inequality is from log(x+logx)x1\frac{\log{(x+\log{x})}}{x}\leq 1. Combine (11) and (12) we arrive at

(RTR02T)1δ2C2pLαT2δloglogkk1δ.\mathbb{P}\left({R_{T}\leq\frac{R_{0}}{2^{T}}}\right)\geq 1-\frac{\delta}{2}-\frac{C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta}\frac{\log{\log{k}}}{k}\geq 1-\delta\ .

Finally, by choosing T=log2R0ϵT=\left\lceil\log_{2}\frac{R_{0}}{\epsilon}\right\rceil, Kmax{2C2pLαT2δ2log(2C2pLαT2δ2),2000}K\geq\max\left\{\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}\log{\left({\frac{2C_{2}}{\sqrt{p}}\frac{L}{\alpha}\frac{T^{2}}{\delta^{2}}}\right)},2000\right\}, the total number of iteration to get ϵ\epsilon accuracy with probability at least 1δ1-\delta is TKTK which is of order

𝒪~(1pLα1δ2(logR0ϵ)3).\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{\alpha}\frac{1}{\delta^{2}}\left({\log\frac{R_{0}}{\epsilon}}\right)^{3}}\right)\ .

3.3 Convergence Guarantees for Standard-Form LP

In this subsection, we establish the high probability convergence rate of RsEGM for LP (2). Unfortunately, LP is not globally sharp (see Appendix A.2), thus the result in Section 3.2 cannot be directly applied to LP. Instead, [7] shows that LP is sharp on any bounded region. Unlike deterministic methods as studied in [7], where the iterates of a primal dual algorithm always stay in a bounded region, the iterates of stochastic methods may escape from any bounded region. Theorem 3 overcomes this difficulty by presenting a high probability argument.

Theorem 3.

Consider RsEGM (Algorithm 2) for solving LP (2). Suppose Assumption 1 and Assumption 2 hold. Let LL be the Lipschitz parameter of the stochastic oracle and HH be the Hoffman constant of the KKT system of LP (see Example 2.3). Denote the initial distance to optimal set as R0=dist(z0,0,𝒵)R_{0}=\text{dist}(z^{0,0},\mathcal{Z}^{*}), and choose step-size τ=p2L\tau=\frac{\sqrt{p}}{2L}. For any δ(0,1)\delta\in(0,1) and ϵ>0\epsilon>0, if the outer loop count TT and the inner loop count KK of RsEGM satisfy

Tmax{log2R0ϵ,0},K𝒪~(1pL1/[H(1+z0,0+R0)]1δ3T4),T\geq\max{\left\{\log_{2}\frac{R_{0}}{\epsilon},0\right\}},\;K\geq\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{1/\left[H(1+\|z^{0,0}\|+R_{0})\right]}\frac{1}{\delta^{3}}T^{4}}\right),

then it holds with probability at least 1δ1-\delta that

dist(zT,0,𝒵)ϵ.\text{dist}(z^{T,0},\mathcal{Z}^{*})\leq\epsilon.

Furthermore, the total number of iteration to get ϵ\epsilon accuracy with probability at least 1δ1-\delta is of order

𝒪~(1pL1/[H(1+z0,0+R0)]1δ3(logR0ϵ)5).\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{1/[H(1+\|z^{0,0}\|+R_{0})]}\frac{1}{\delta^{3}}\left({\log\frac{R_{0}}{\epsilon}}\right)^{5}}\right)\ .
Remark 3.

Similar to Remark 1, we can obtain the complexity of the median trajectory of the algorithm by choosing δ=12\delta=\frac{1}{2}.

Remark 4.

As shown in [7, Lemma 5], LP is sharp with α=1H1+4R2\alpha=\frac{1}{H\sqrt{1+4R^{2}}} in the ball {z|z2R}\{z|\|z\|_{2}\leq R\} with any radius R>0R>0. Notice that for deterministic primal-dual methods, the iterates usually stay in a ball centered at 0 with radius R=z0,0+R0R=\|z^{0,0}\|+R_{0} (see Section 2 in [7]), thus the effective sharpness constant on deterministic algorithms for LP is given by α=1H(1+z0,0+R0)\alpha=\frac{1}{H(1+\|z^{0,0}\|+R_{0})}. Theorem 3 shows that the complexity of RsEGM is 𝒪~(1pL1/α1δ3(logR0ϵ)5).\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{1/\alpha}\frac{1}{\delta^{3}}\left({\log\frac{R_{0}}{\epsilon}}\right)^{5}}\right)\ .

In the following we prove Theorem 3. The key is to show that the iterates stay in a bounded region with a high probability, thus we can utilize the sharpness condition of LP.

Proof of Theorem 3.

Suppose K2000exp(e2)K\geq 2000\geq\exp(e^{2}). Denote Rt=dist(zt,0,𝒵)R_{t}=\text{dist}(z^{t,0},\mathcal{Z}^{*}) and consider event t={Rt1loglogKRt1andzt,0zt1,0Rt1B}\mathcal{E}_{t}=\left\{R_{t}\leq\frac{1}{\log{\log{K}}}R_{t-1}\;\text{and}\;\|z^{t,0}-z^{t-1,0}\|\leq R_{t-1}B\right\}, where we determine the value of BB later. By definition:

l=1tl{Rt<Rt1<<R0andzt,0tR0B+z0,0}\cap_{l=1}^{t}\mathcal{E}_{l}\subseteq\{R_{t}<R_{t-1}<...<R_{0}\;\text{and}\;\|z^{t,0}\|\leq tR_{0}B+\|z^{0,0}\|\}

Let ¯t={l=1tl}\bar{\mathcal{E}}_{t}=\{\cap_{l=1}^{t}\mathcal{E}_{l}\}, then (|¯t)\mathbb{P}(\cdot|\bar{\mathcal{E}}_{t}) is the probability conditioned on {1,,t}\{\mathcal{E}_{1},...,\mathcal{E}_{t}\}. For convenience, we denote, (|¯0)=()\mathbb{P}(\cdot|\bar{\mathcal{E}}_{0})=\mathbb{P}(\cdot). It follows from Lemma 2.3 and the sharpness of LP that

(t+1|¯t)=(1H1+4zt+1,02Rt+11H1+4zt+1,02RtloglogK,zt+1,0zt,0RtB|¯t)(maxz^Wr(zt+1,0)(xt+1,0,y^)(x^,yt+1,0)rRtHloglogK1+4zt+1,02,zt+1,0zt,0RtB|¯t).\displaystyle\begin{split}&\ \mathbb{P}\left({\mathcal{E}_{t+1}|\bar{\mathcal{E}}_{t}}\right)\\ =&\ \mathbb{P}\left({\frac{1}{H\sqrt{1+4\|z^{t+1,0}\|^{2}}}R_{t+1}\leq\frac{1}{H\sqrt{1+4\|z^{t+1,0}\|^{2}}}\frac{R_{t}}{\log{\log{K}}},\;\|z^{t+1,0}-z^{t,0}\|\leq R_{t}B\big{|}\bar{\mathcal{E}}_{t}}\right)\\ \geq&\ \mathbb{P}\left({\max_{\hat{z}\in W_{r}(z^{t+1,0})}\frac{\mathcal{L}(x^{t+1,0},\hat{y})-\mathcal{L}(\hat{x},y^{t+1,0})}{r}\leq\frac{R_{t}}{H\log{\log{K}}\sqrt{1+4\|z^{t+1,0}\|^{2}}},\;\|z^{t+1,0}-z^{t,0}\|\leq R_{t}B\big{|}\bar{\mathcal{E}}_{t}}\right)\ .\\ \end{split} (13)

Note that given the event {zt+1,0zt,0RtB}\{\|z^{t+1,0}-z^{t,0}\|\leq R_{t}B\}, we have for any r>0r>0 that

maxz^Wr(zt+1,0)(xt+1,0,y^)(x^,yt+1,0)rmaxz^Wr+RtB(zt,0)(xt+1,0,y^)(x^,yt+1,0)r.\max_{\hat{z}\in W_{r}(z^{t+1,0})}\frac{\mathcal{L}(x^{t+1,0},\hat{y})-\mathcal{L}(\hat{x},y^{t+1,0})}{r}\leq\max_{\hat{z}\in W_{r+R_{t}B}(z^{t,0})}\frac{\mathcal{L}(x^{t+1,0},\hat{y})-\mathcal{L}(\hat{x},y^{t+1,0})}{r}\ . (14)

Furthermore, we have conditioned on ¯t\bar{\mathcal{E}}_{t} that zt,0tR0B+z0,0\|z^{t,0}\|\leq tR_{0}B+\|z^{0,0}\|, thus

RtHloglogK1+4(zt,0+RtB)2RtHloglogK1+4(z0,0+tR0B+R0B)2.\frac{R_{t}}{H\log{\log{K}}\sqrt{1+4(\|z^{t,0}\|+R_{t}B)^{2}}}\geq\frac{R_{t}}{H\log{\log{K}}\sqrt{1+4(\|z^{0,0}\|+tR_{0}B+R_{0}B)^{2}}}\ . (15)

Combining Equation (13), (14) and (15), we arrive at

(t+1|¯t)(maxz^Wr+RtB(zt,0)(xt,K,y^)(x^,yt,K)rRt/loglogKH1+4(z0,0+(t+1)R0B)2,zt,Kzt,0RtB|¯t)=(zt,Kzt,0RtB|¯t)(maxz^Wr+RtB(zt,0)(xt,K,y^)(x^,yt,K)r>Rt/loglogKH1+4(z0,0+(t+1)R0B)2,zt,Kzt,0RtB|¯t)(zt,Kzt,0RtB)(maxz^Wr+RtB(zt,0)(xt,K,y^)(x^,yt,K)r>Rt/loglogKH1+4(z0,0+(t+1)R0B)2|¯t).\displaystyle\begin{split}&\mathbb{P}\left({\mathcal{E}_{t+1}|\bar{\mathcal{E}}_{t}}\right)\\ \geq&\mathbb{P}\left({\max_{\hat{z}\in W_{r+R_{t}B}(z^{t,0})}\frac{\mathcal{L}(x^{t,K},\hat{y})-\mathcal{L}(\hat{x},y^{t,K})}{r}\leq\frac{R_{t}/\log{\log{K}}}{H\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}},\|z^{t,K}-z^{t,0}\|\leq R_{t}B\big{|}\bar{\mathcal{E}}_{t}}\right)\\ =&\mathbb{P}\big{(}\|z^{t,K}-z^{t,0}\|\leq R_{t}B\big{|}\bar{\mathcal{E}}_{t}\big{)}\\ -&\mathbb{P}\left({\max_{\hat{z}\in W_{r+R_{t}B}(z^{t,0})}\frac{\mathcal{L}(x^{t,K},\hat{y})-\mathcal{L}(\hat{x},y^{t,K})}{r}>\frac{R_{t}/\log{\log{K}}}{H\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}},\|z^{t,K}-z^{t,0}\|\leq R_{t}B\big{|}\bar{\mathcal{E}}_{t}}\right)\\ \geq&\mathbb{P}\left({\|z^{t,K}-z^{t,0}\|\leq R_{t}B}\right)-\mathbb{P}\left({\max_{\hat{z}\in W_{r+R_{t}B}(z^{t,0})}\frac{\mathcal{L}(x^{t,K},\hat{y})-\mathcal{L}(\hat{x},y^{t,K})}{r}>\frac{R_{t}/\log{\log{K}}}{H\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}}\big{|}\bar{\mathcal{E}}_{t}}\right)\ .\end{split} (16)

Next, we set C1=3+21pC_{1}=3+\sqrt{\frac{2}{1-p}}, B=2TC1δB=\frac{2TC_{1}}{\delta} and r=RtBr=R_{t}B, then we know there exists an optimal solution zz_{*} such that zWr+RtB(zt,0)z_{*}\in W_{r+R_{t}B}(z^{t,0}), because r+RtB=2RtB>Rtr+R_{t}B=2R_{t}B>R_{t}. Then we can upper bound the second term in the right hand side of (16) by Markov inequality:

(maxz^Wr+RtB(zt,0)(xt,K,y^)(x^,yt,K)r>Rt/loglogKH1+4(z0,0+(t+1)R0B)2|¯t)HloglogK1+4(z0,0+(t+1)R0B)2rRt𝔼[maxz^Wr+RtB(zt,0)(xt,K,y^)(x^,yt,K)]HloglogK1+4(z0,0+(t+1)R0B)2rRt354τKmaxzWr+RtB(zt,0)zt,0z2HloglogK1+4(z0,0+(t+1)R0B)2rRt354τK(r+RtB)2HloglogK(1+2(z0,0+(t+1)R0B))rRt354τK(r+RtB)2,\displaystyle\begin{split}&\ \mathbb{P}\left({\max_{\hat{z}\in W_{r+R_{t}B}(z^{t,0})}\frac{\mathcal{L}(x^{t,K},\hat{y})-\mathcal{L}(\hat{x},y^{t,K})}{r}>\frac{R_{t}/\log{\log{K}}}{H\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}}\big{|}\bar{\mathcal{E}}_{t}}\right)\\ &\ \leq\frac{H\log{\log{K}}\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}}{rR_{t}}\mathbb{E}\left[\max_{\hat{z}\in W_{r+R_{t}B}(z^{t,0})}\mathcal{L}(x^{t,K},\hat{y})-\mathcal{L}(\hat{x},y^{t,K})\right]\\ &\ \leq\frac{H\log{\log{K}}\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}}{rR_{t}}\frac{35}{4\tau K}max_{z\in W_{r+R_{t}B}(z^{t,0})}\|z^{t,0}-z\|^{2}\\ &\ \leq\frac{H\log{\log{K}}\sqrt{1+4(\|z^{0,0}\|+(t+1)R_{0}B)^{2}}}{rR_{t}}\frac{35}{4\tau K}(r+R_{t}B)^{2}\\ &\ \leq\frac{H\log{\log{K}}(1+2(\|z^{0,0}\|+(t+1)R_{0}B))}{rR_{t}}\frac{35}{4\tau K}(r+R_{t}B)^{2}\ ,\end{split} (17)

where the second inequality follows from Lemma 3, by noticing zWr+RtB(zt,0)z_{*}\in W_{r+R_{t}B}(z^{t,0}), and the last inequality uses 1+x1+x\sqrt{1+x}\leq 1+\sqrt{x}. Combining Equation (6), (16) and (17), we obtain

(t+1|¯t)1(3+21p)1B35H4τloglogKK(r+RtB)2rRt(1+2(z0,0+(t+1)R0B))\mathbb{P}(\mathcal{E}_{t+1}|\bar{\mathcal{E}}_{t})\geq 1-\left({3+\sqrt{\frac{2}{1-p}}}\right)\frac{1}{B}-\frac{35H}{4\tau}\frac{\log{\log{K}}}{K}\frac{(r+R_{t}B)^{2}}{rR_{t}}\left({1+2(\|z^{0,0}\|+(t+1)R_{0}B)}\right)

Furthermore, by recalling r=RtBr=R_{t}B, we obtain

(RTR02T)(t=1Tt)=(1)t=2T(t|¯t1) 1(3+21p)TB35H4τloglogKK(B+B)2B(T+2Tz0,0+T(T+1)R0B)\displaystyle\begin{split}&\mathbb{P}\left({R_{T}\leq\frac{R_{0}}{2^{T}}}\right)\\ \geq&\ \mathbb{P}(\cap_{t=1}^{T}\mathcal{E}_{t})=\mathbb{P}(\mathcal{E}_{1})\prod_{t=2}^{T}\mathbb{P}(\mathcal{E}_{t}|\bar{\mathcal{E}}_{t-1})\\ \geq&\ 1-\left({3+\sqrt{\frac{2}{1-p}}}\right)\frac{T}{B}\ -\frac{35H}{4\tau}\frac{\log{\log{K}}}{K}\frac{(B+B)^{2}}{B}\left({T+2T\|z^{0,0}\|+T(T+1)R_{0}B}\right)\\ \end{split}

where the last inequality comes from i(1xi)1ixi\prod_{i}(1-x_{i})\geq 1-\sum_{i}x_{i} for xi>0x_{i}>0. Thus we have

(RTR02T)1(3+21p)TB35HBτloglogKK(T+2Tz0,0+T(T+1)R0B)\mathbb{P}\left({R_{T}\leq\frac{R_{0}}{2^{T}}}\right)\geq 1-\left({3+\sqrt{\frac{2}{1-p}}}\right)\frac{T}{B}-\frac{35HB}{\tau}\frac{\log{\log{K}}}{K}\left({T+2T\|z^{0,0}\|+T(T+1)R_{0}B}\right)

Recall τ=p2L\tau=\frac{\sqrt{p}}{2L} and B=2TC1δB=\frac{2TC_{1}}{\delta}, thus

(RTR02T) 1δ2140HLploglogKKTδ(T+2Tz0,0+T(T+1)R02TC1δ) 1δ2140HLploglogKKTδT3δ(1+2z0,0+4C1R0) 1δ2HLloglogKKT4δ2140p(1+2z0,0+4C1R0).\displaystyle\begin{split}\mathbb{P}\left({R_{T}\leq\frac{R_{0}}{2^{T}}}\right)\geq&\ 1-\frac{\delta}{2}-\frac{140HL}{\sqrt{p}}\frac{\log{\log{K}}}{K}\frac{T}{\delta}\left({T+2T\|z^{0,0}\|+T(T+1)R_{0}\frac{2TC_{1}}{\delta}}\right)\\ \geq&\ 1-\frac{\delta}{2}-\frac{140HL}{\sqrt{p}}\frac{\log{\log{K}}}{K}\frac{T}{\delta}\frac{T^{3}}{\delta}(1+2\|z^{0,0}\|+4C_{1}R_{0})\\ \geq&\ 1-\frac{\delta}{2}-HL\frac{\log{\log{K}}}{K}\frac{T^{4}}{\delta^{2}}\frac{140}{\sqrt{p}}(1+2\|z^{0,0}\|+4C_{1}R_{0})\ .\end{split}

Denote C2=C2(z0,0)=140(1+2z0,0+4C1R0)C_{2}=C_{2}(z^{0,0})=140(1+2\|z^{0,0}\|+4C_{1}R_{0}). Then if

K2C2pHLT4δ3log(2C2pHLT4δ3),K\geq\frac{2C_{2}}{\sqrt{p}}HL\frac{T^{4}}{\delta^{3}}\log{\left({\frac{2C_{2}}{\sqrt{p}}HL\frac{T^{4}}{\delta^{3}}}\right)},

we have

HLloglogKKT4δ2140p(1+2z0,0+4C1R0)δ2,HL\frac{\log{\log{K}}}{K}\frac{T^{4}}{\delta^{2}}\frac{140}{\sqrt{p}}(1+2\|z^{0,0}\|+4C_{1}R_{0})\leq\frac{\delta}{2}\ ,

for the same reason as Equation (12). Thus, we obtain

(RTR02T)1δ2HLloglogKKT4δ2140p(1+2z0,0+4C1R0)1δ.\mathbb{P}\left({R_{T}\leq\frac{R_{0}}{2^{T}}}\right)\geq 1-\frac{\delta}{2}-HL\frac{\log{\log{K}}}{K}\frac{T^{4}}{\delta^{2}}\frac{140}{\sqrt{p}}(1+2\|z^{0,0}\|+4C_{1}R_{0})\geq 1-\delta\ .

In summary, choosing T=log2R0ϵT=\left\lceil\log_{2}\frac{R_{0}}{\epsilon}\right\rceil, Kmax{2C2pHLT4δ3log(2C2pHLT4δ3),2000}K\geq\max\left\{\frac{2C_{2}}{\sqrt{p}}HL\frac{T^{4}}{\delta^{3}}\log{\left({\frac{2C_{2}}{\sqrt{p}}HL\frac{T^{4}}{\delta^{3}}}\right)},2000\right\}, the total number of iteration to get ϵ\epsilon accuracy with probability at least 1δ1-\delta is TKTK which is of order

𝒪~(1pL1/[H(1+z0,0+R0)]1δ3(logR0ϵ)5).\widetilde{\mathcal{O}}\left({\frac{1}{\sqrt{p}}\frac{L}{1/[H(1+\|z^{0,0}\|+R_{0})]}\frac{1}{\delta^{3}}\left({\log\frac{R_{0}}{\epsilon}}\right)^{5}}\right)\ .

4 Lower Bound

In this section, we establish the complexity lower bound of stochastic first-order algorithms (with variance reduction) for solving sharp primal-dual problems. Together with the results shown in Section 3 , we demonstrate that RsEGM has nearly optimal convergence rate (upto log terms) among a large class of stochastic primal-dual methods.

Since we study the lower bound in this section, we can assume \mathcal{L} is differentiable everywhere, thus \nabla\mathcal{L} is well defined. Recall that in deterministic first-order primal-dual methods, we utilize the gradient information to update the iterates. Thus, the iterates of first-order methods always stay in the subspace spanned by the gradient, and we call this behavior first-order span-respecting (see for example [7] ) defined as below:

Definition 2.

A deterministic primal dual algorithm is called first-order span-respecting if its iterates satisfy

xtx0+Span{x(xi,yj):i,j{1,,t1}}yty0+Span{y(xi,yj):i{1,,t},j{1,,t1}}\displaystyle\begin{split}&\ x^{t}\in x^{0}+\text{Span}\left\{\nabla_{x}\mathcal{L}(x^{i},y^{j}):\;\forall i,j\in\{1,...,t-1\}\right\}\\ &\ y^{t}\in y^{0}+\text{Span}\left\{\nabla_{y}\mathcal{L}(x^{i},y^{j}):\;\forall i\in\{1,...,t\},\forall j\in\{1,...,t-1\}\right\}\end{split}

Many classic primal dual first-order methods, such as GDA, AGDA and EGM, are first-order span-respecting when applying unconstrained problems.

In this paper, we study stochastic algorithms with variance reduction. As a result, we consider a general class of stochastic algorithms where the span includes both the deterministic gradient (for variance reduction sake) and the stochastic oracle FξF_{\xi}. More formally, we introduce stochastic span-respecting algorithms, defined as below:

Definition 3.

A randomized primal dual algorithm is called first-order stochastic span-respecting with respect to a stochastic oracle ξ\xi if the iterates in m\mathbb{R}^{m} satisfy:

xtx0+Span{x(xi,yj),xξi,j(xi,yj):i,j{1,,t1}}yty0+Span{y(xi,yj),yξi,j(xi,yj):i{1,,t1},j{1,,t}}\displaystyle\begin{split}&\ x^{t}\in x^{0}+\text{Span}\left\{\nabla_{x}\mathcal{L}(x^{i},y^{j}),\nabla_{x}^{\xi^{i,j}}\mathcal{L}(x^{i},y^{j}):\;\forall i,j\in\{1,...,t-1\}\right\}\\ &\ y^{t}\in y^{0}+\text{Span}\left\{\nabla_{y}\mathcal{L}(x^{i},y^{j}),\nabla_{y}^{\xi^{i,j}}\mathcal{L}(x^{i},y^{j}):\;\forall i\in\{1,...,t-1\},\forall j\in\{1,...,t\}\right\}\end{split}

ξi,j\xi^{i,j} are random variables.

Remark 5.

The above definition consists of both deterministic gradient together with the stochastic gradient estimator. The reason is to include the variance reduction method into the algorithm class. Obviously the above algorithm class is larger than the one only with deterministic or stochastic gradient. As a result, the lower bound is generally higher than a pure stochastic algorithm.

Remark 6.

If \mathcal{L} is bilinear, then with appropriate indexing of iterates, sEGM and RsEGM are first-order stochastic span-respecting.

Theorem 4.

Consider any iteration tt\in\mathbb{N} and parameter value L>2α>0L>\sqrt{2}\alpha>0. There exists an α\alpha-sharp primal-dual problem that satisfies Assumption 1 and an LL-Lipschitz (in expectation) stochastic oracle FξF_{\xi} that satisfies Assumption 2, such that the iterates ztz^{t} of any stochastic span-respecting algorithm satisfies that

dist(zt,𝒵)12(1α2L)tdist(z0,𝒵).\text{dist}(z^{t},\mathcal{Z}^{*})\geq\frac{1}{\sqrt{2}}\left({1-\frac{\alpha\sqrt{2}}{L}}\right)^{t}\text{dist}(z^{0},\mathcal{Z}^{*})\ .
Remark 7.

Theorem 4 implies the following lower complexity bound for a stochastic first-order method to achieve an ϵ\epsilon-accuracy solution:

Ω(LαlogR0ϵ).\Omega\left({\frac{L}{\alpha}\log{\frac{R_{0}}{\epsilon}}}\right)\ .
Proof of Theorem 4.

Set m=2tm=2t. We consider a primal-dual problem of the form

minx2mmaxy2m(x,y)=yTAxbTy,\min_{x\in\mathbb{R}^{2m}}\max_{y\in\mathbb{R}^{2m}}\mathcal{L}(x,y)=y^{T}Ax-b^{T}y\ , (18)

The matrix A2m×2mA\in\mathbb{R}^{2m\times 2m} and vector b2mb\in\mathbb{R}^{2m} have the structure

A=(A0A0),b=(b0b0),\displaystyle A=\begin{pmatrix}A_{0}&\ \\ \ &A_{0}\end{pmatrix},\;b=\begin{pmatrix}b_{0}\\ b_{0}\end{pmatrix}\ ,

where A0m×mA_{0}\in\mathbb{R}^{m\times m} and b0mb_{0}\in\mathbb{R}^{m} are determined later. Rewriting the problem using x=(x1,x2)x=(x_{1},x_{2}) and y=(y1,y2)y=(y_{1},y_{2}), we obtain:

minx2mmaxy2m(x,y)=1(x,y)+2(x,y),\min_{x\in\mathbb{R}^{2m}}\max_{y\in\mathbb{R}^{2m}}\mathcal{L}(x,y)=\mathcal{L}_{1}(x,y)+\mathcal{L}_{2}(x,y)\ ,

where 1(x,y)=y1TA0x1b0Ty1\mathcal{L}_{1}(x,y)=y_{1}^{T}A_{0}x_{1}-b_{0}^{T}y_{1}, 2(x,y)=y2TA0x2b0Ty2\mathcal{L}_{2}(x,y)=y_{2}^{T}A_{0}x_{2}-b_{0}^{T}y_{2} .

For any given Lipschitz parameter LL and sharpness parameter α\alpha such that L>2αL>\sqrt{2}\alpha, we denote Q=L22α2Q=\frac{L^{2}}{2\alpha^{2}}, κ=Q+3Q+1\kappa=\frac{\sqrt{Q}+3}{\sqrt{Q}+1}, and

G0=(211211211211κ).\displaystyle G_{0}=\begin{pmatrix}2&-1&\ &\ &\ &\ \\ -1&2&-1&\ &\ &\ \\ \ &-1&2&-1&\ &\ \\ \ &\ &\ddots&\ddots&\ddots&\ \\ \ &\ &\ &-1&2&-1\\ \ &\ &\ &\ &-1&\kappa\end{pmatrix}\ .

Note that it holds for any κ3\kappa\leq 3 that 0G04I.0\preceq G_{0}\preceq 4I. Denote

H0=L2/2α24G0+α2I,h0=L2/2α24e1,H_{0}=\frac{L^{2}/2-\alpha^{2}}{4}G_{0}+\alpha^{2}I,\;h_{0}=\frac{L^{2}/2-\alpha^{2}}{4}e_{1}\ ,

where e1e_{1} is the first standard unit vector. One can check that the solution of the linear system H0u=h0H_{0}u=h_{0} is unique and given by

umsuch thatui=qi,i=1,2,,m,u^{*}\in\mathbb{R}^{m}\;\text{such that}\;u_{i}^{*}=q^{i},\;i=1,2,...,m\ ,

with q=Q1Q+1q=\frac{\sqrt{Q}-1}{\sqrt{Q}+1}. Furthermore, it holds that

α2IH0L22I.\alpha^{2}I\preceq H_{0}\preceq\frac{L^{2}}{2}I\ .

Choose A0=H01/2A_{0}=H_{0}^{1/2}, b0=H01/2h0b_{0}=H_{0}^{-1/2}h_{0} and consider (18). The optimal solution to (18) is given by Ax=b,ATy=0,Ax=b,\;A^{T}y=0\ , which is equivalent to

A0x1=b0,A0Ty1=0,A0x2=b0,A0Ty2=0,A_{0}x_{1}=b_{0},\;A_{0}^{T}y_{1}=0,\;A_{0}x_{2}=b_{0},\;A_{0}^{T}y_{2}=0\ ,

and thus

H0x1=h0,y1=0,H0x2=h0,y2=0.H_{0}x_{1}=h_{0},\;y_{1}=0,\;H_{0}x_{2}=h_{0},\;y_{2}=0\ .

The optimal solution to (18) is given by z=(x1,x2,y1,y2)=(u,u,0,0)z^{*}=(x_{1}^{*},x_{2}^{*},y_{1}^{*},y_{2}^{*})=(u^{*},u^{*},0,0). Moreover, the primal-dual problem (18) is α\alpha-sharp, because

αImA0=H01/2,αI2mA=(A0A0).\alpha I_{m}\preceq A_{0}=H_{0}^{1/2}\ ,\;\;\alpha I_{2m}\preceq A=\begin{pmatrix}A_{0}&\ \\ \ &A_{0}\end{pmatrix}\ .

Without loss of generality we assume x0=y0=0x^{0}=y^{0}=0. Consider the stochastic oracle

Fξ(z)=2(xξ(x,y),yξ(x,y)),={2(A0Ty1,0,A0x1b,0),with probability 122(0,A0Ty2,0,A0x2b),with probability 12.\displaystyle\begin{split}F_{\xi}(z)&\ =2(\nabla_{x}\mathcal{L}_{\xi}(x,y),-\nabla_{y}\mathcal{L}_{\xi}(x,y)),\\ &\ =\begin{cases}2(A_{0}^{T}y_{1},0,A_{0}x_{1}-b,0),\;\text{with probability }\frac{1}{2}\\ 2(0,A_{0}^{T}y_{2},0,A_{0}x_{2}-b),\;\text{with probability }\frac{1}{2}\end{cases}.\end{split} (19)

Then we know that Fξ(z)F_{\xi}(z) is unbliased and 𝔼[Fξ(u)Fξ(v)2]L2uv2\mathbb{E}[\|F_{\xi}(u)-F_{\xi}(v)\|^{2}]\leq L^{2}\|u-v\|^{2}, thus the stochastic gradient oracle satisfies Assumption 2.

By definition of span-respecting for randomized algorithms with the above oracle:

x1tx10+Span{A0Ty10,,A0Ty1t1}x2tx20+Span{A0Ty20,,A0Ty2t1}y1ty10+Span{A0x10b0,,A0x1t1b0}y2ty20+Span{A0x20b0,,A0x2t1b0}\displaystyle\begin{split}&\ x_{1}^{t}\in x_{1}^{0}+\text{Span}\left\{A_{0}^{T}y_{1}^{0},...,A_{0}^{T}y_{1}^{t-1}\right\}\\ &\ x_{2}^{t}\in x_{2}^{0}+\text{Span}\left\{A_{0}^{T}y_{2}^{0},...,A_{0}^{T}y_{2}^{t-1}\right\}\\ &\ y_{1}^{t}\in y_{1}^{0}+\text{Span}\left\{A_{0}x_{1}^{0}-b_{0},...,A_{0}x_{1}^{t-1}-b_{0}\right\}\\ &\ y_{2}^{t}\in y_{2}^{0}+\text{Span}\left\{A_{0}x_{2}^{0}-b_{0},...,A_{0}x_{2}^{t-1}-b_{0}\right\}\end{split}

thus it holds that

x1tSpan{A0T(A0x10b),,A0T(A0x1t1b)}=Span{H0x10h0,,H0x1t1h0}{e1,,et}x2tSpan{A0T(A0x20b),,A0T(A0x2t1b)}=Span{H0x20h0,,H0x2t1h0}{e1,,et}\displaystyle\begin{split}x_{1}^{t}&\ \in\text{Span}\left\{A_{0}^{T}(A_{0}x_{1}^{0}-b),...,A_{0}^{T}(A_{0}x_{1}^{t-1}-b)\right\}=\text{Span}\left\{H_{0}x_{1}^{0}-h_{0},...,H_{0}x_{1}^{t-1}-h_{0}\right\}\subseteq\{e_{1},...,e_{t}\}\\ x_{2}^{t}&\ \in\text{Span}\left\{A_{0}^{T}(A_{0}x_{2}^{0}-b),...,A_{0}^{T}(A_{0}x_{2}^{t-1}-b)\right\}=\text{Span}\left\{H_{0}x_{2}^{0}-h_{0},...,H_{0}x_{2}^{t-1}-h_{0}\right\}\subseteq\{e_{1},...,e_{t}\}\end{split}

Therefore, we have when mm is sufficiently large that

x1tu2x10u2k=t+1mq2kk=1mq2k=q2t1q2m2t1q2m12q2t,\displaystyle\begin{split}\frac{\|x_{1}^{t}-u^{*}\|^{2}}{\|x_{1}^{0}-u^{*}\|^{2}}\geq\frac{\sum_{k=t+1}^{m}q^{2k}}{\sum_{k=1}^{m}q^{2k}}=q^{2t}\frac{1-q^{2m-2t}}{1-q^{2m}}\geq\frac{1}{2}q^{2t}\ ,\end{split}

where the last inequality uses m=2tm=2t, 0<q<10<q<1 and the fact that q4t2q2t+1=(q2t1)20q^{4t}-2q^{2t}+1=(q^{2t}-1)^{2}\geq 0.

Similar result holds for x2x_{2}:

x2tu2x10u212q2t.\frac{\|x_{2}^{t}-u^{*}\|^{2}}{\|x_{1}^{0}-u^{*}\|^{2}}\geq\frac{1}{2}q^{2t}\ . (20)

To sum up, for any tt\in\mathbb{N} there exists an integer m2tm\geq 2t such that

dist(zt,𝒵)2dist(x1t,u)2+dist(x2t,u)212q2tdist(x10,u)2+12q2tdist(x20,u)2=12q2tdist(x0,(u,u))2=12(1α2L)2tdist(z0,𝒵)2,\displaystyle\begin{split}\text{dist}(z^{t},\mathcal{Z}^{*})^{2}&\ \geq\text{dist}(x_{1}^{t},u^{*})^{2}+\text{dist}(x_{2}^{t},u^{*})^{2}\\ &\ \geq\frac{1}{2}q^{2t}\text{dist}(x_{1}^{0},u^{*})^{2}+\frac{1}{2}q^{2t}\text{dist}(x_{2}^{0},u^{*})^{2}\\ &\ =\frac{1}{2}q^{2t}\text{dist}\left({x^{0},(u^{*},u^{*})}\right)^{2}\\ &\ =\frac{1}{2}\left({1-\frac{\alpha\sqrt{2}}{L}}\right)^{2t}\text{dist}(z^{0},\mathcal{Z}^{*})^{2}\ ,\end{split} (21)

where second inequality is from Equation (21). Last equality uses y1=y2=y10=y20=0y_{1}^{*}=y_{2}^{*}=y_{1}^{0}=y_{2}^{0}=0. Taking square root we finally reach

dist(zt,𝒵)12(1α2L)tdist(z0,𝒵).\text{dist}(z^{t},\mathcal{Z}^{*})\geq\frac{1}{\sqrt{2}}\left({1-\frac{\alpha\sqrt{2}}{L}}\right)^{t}\text{dist}(z^{0},\mathcal{Z}^{*})\ . (22)

Remark 8.

RsEGM with the stochastic gradient oracle FξF_{\xi} given in the proof (see (19)) has upper bound equal to 𝒪~(A2σmin+(A)(logR0ϵ)3)\widetilde{\mathcal{O}}\left({\frac{\|A\|_{2}}{\sigma^{+}_{min}(A)}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right). Compared with the above lower bound for unconstrained bilinear problem, Ω(A2σmin+(A)logR0ϵ)\Omega\left({\frac{\|A\|_{2}}{\sigma^{+}_{min}(A)}\log{\frac{R_{0}}{\epsilon}}}\right), we conclude that RsEGM is tight up to logarithmic factors.

5 Stochastic Oracles

In this section, we propose four stochastic oracles and apply the main results to compute their total flop counts to obtain an approximate solution to the unconstrained bilinear problems and standard-form LP.

5.1 Unconstrained Bilinear Problems

We consider unconstrained bilinear problems

minxnmaxymyTAx+cTxbTy.\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}y^{T}Ax+c^{T}x-b^{T}y\ .

Without loss of generality, we drop the linear terms cTxbTyc^{T}x-b^{T}y because this can be achieved by shifting the origin. Now we consider the unconstrained bilinear problems of the form

minxnmaxymyTAx,\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}y^{T}Ax\ ,

where Am×nA\in\mathbb{R}^{m\times n}. For comparison sake, we assume nnz(A)m+n\text{nnz}(A)\geq m+n.

Converting to (1) and calculating F(z)F(z), we have

F(z)=F(x,y)=(ATyAx),g1(x)=g2(y)=0.F(z)=F(x,y)=\left(\begin{array}[]{c}A^{T}y\\ -Ax\end{array}\right),\;g_{1}(x)=g_{2}(y)=0\ .

First, we restate the total flop count of the optimal deterministic first-order primal dual method (it is achieved by restart primal-dual algorithms [7]). In order to obtain ϵ\epsilon accuracy solution, the number of primal-dual iterations is 𝒪(A2αlogR0ϵ)\mathcal{O}\left({\frac{\|A\|_{2}}{\alpha}\log{\frac{R_{0}}{\epsilon}}}\right), thus the total flop counts is (by noticing the flop count of one primal-dual iteration is O(nnz(A))O(\text{nnz}(A)))

𝒪(nnz(A)A2αlogR0ϵ).\mathcal{O}\left({\text{nnz}(A)\frac{\|A\|_{2}}{\alpha}\log{\frac{R_{0}}{\epsilon}}}\right)\ .

To ease the comparison of the flop counts, we state the following simple fact ([50]):

max{Ai2,Aj2}A2AFmax{m,n}max{Ai2,Aj2}max{m,n}A2\max{\{\|A_{i\cdot}\|_{2},\|A_{\cdot j}\|_{2}\}}\leq\|A\|_{2}\leq\|A\|_{F}\leq\sqrt{\max{\{m,n\}}}\cdot\max{\{\|A_{i\cdot}\|_{2},\|A_{\cdot j}\|_{2}\}}\leq\sqrt{\max{\{m,n\}}}\cdot\|A\|_{2} (23)

Next, we describe four different stochastic oracles that satisfies Assumption 2. We will discuss their flop counts and how they improve deterministic restarted methods in different regimes.

The first two oracles are based row/column sampling, i.e. the stochastic oracle is constructed using one row and one column. The stochastic oracle is given by the following

Fξ(z)=(1riAiyi1cjAjxj),(ξ=(i,j))=ricj,F_{\xi}(z)=\left(\begin{array}[]{c}\frac{1}{r_{i}}A_{i\cdot}y_{i}\\ -\frac{1}{c_{j}}A_{\cdot j}x_{j}\end{array}\right),\;\mathbb{P}(\xi=(i,j))=r_{i}c_{j}\ ,

where cj0c_{j}\geq 0 for j=1,..nj=1,..n and ri0r_{i}\geq 0 for i=1,mi=1,...m are parameters of the sampling scheme with j=1ncj=i=1nri=1\sum_{j=1}^{n}c_{j}=\sum_{i=1}^{n}r_{i}=1. Notice that

𝔼[Fξ(z)2]=i=1mri1riAiyi2+j=1ncj1cjAjxj2=i=1m1riyi2Ai2+j=1n1cjxj2Aj2maxi,j{Ai2ri,Aj2cj}z2.\displaystyle\begin{split}\mathbb{E}\left[\|F_{\xi}(z)\|^{2}\right]&\ =\sum_{i=1}^{m}r_{i}\|\frac{1}{r_{i}}A_{i\cdot}y_{i}\|^{2}+\sum_{j=1}^{n}c_{j}\|-\frac{1}{c_{j}}A_{\cdot j}x_{j}\|^{2}\\ &\ =\sum_{i=1}^{m}\frac{1}{r_{i}}y_{i}^{2}\|A_{i\cdot}\|^{2}+\sum_{j=1}^{n}\frac{1}{c_{j}}x_{j}^{2}\|A_{\cdot j}\|^{2}\\ &\ \leq\max_{i,j}{\left\{\frac{\|A_{i\cdot}\|^{2}}{r_{i}},\frac{\|A_{\cdot j}\|^{2}}{c_{j}}\right\}}\|z\|^{2}\ .\end{split}

Thus the Lipschitz constant of stochastic oracle FξF_{\xi} is bounded by

Lmaxi,j{Ai2ri,Aj2cj}.L\leq\sqrt{\max_{i,j}{\left\{\frac{\|A_{i\cdot}\|^{2}}{r_{i}},\frac{\|A_{\cdot j}\|^{2}}{c_{j}}\right\}}}\ . (24)

Oracle I: Uniform row-column sampling. In the first oracle, we uniformly randomly choose row and columns, i.e.,

Fξ(z)=(1riAiyi1cjAjxj),(ξ=(i,j))=ricj,ri=1m,cj=1n.F_{\xi}(z)=\left(\begin{array}[]{c}\frac{1}{r_{i}}A_{i\cdot}y_{i}\\ -\frac{1}{c_{j}}A_{\cdot j}x_{j}\end{array}\right),\;\mathbb{P}(\xi=(i,j))=r_{i}c_{j},\;r_{i}=\frac{1}{m},\;c_{j}=\frac{1}{n}\ .

Plugging in the choice of rir_{i} and cjc_{j} into Equation (24), the Lipschitz constant of FξF_{\xi} is upper bounded by

L1max{mAi22,nAj22}max{m,n}max{Ai2,Aj2}.L_{1}\leq\sqrt{\max{\{m\|A_{i\cdot}\|_{2}^{2}},n\|A_{\cdot j}\|_{2}^{2}\}}\leq\sqrt{\max{\{m,n\}}}\cdot\max{\left\{\|A_{i\cdot}\|_{2},\|A_{\cdot j}\|_{2}\right\}}\ . (25)

Combine Equation (4) together with (25) and set p=m+nnnz(A)p=\frac{m+n}{\text{nnz}(A)}. The total flop count of RsEGM to find an ϵ\epsilon-optimal solution using Oracle I is upper bounded by

𝒪~(nnz(A)logR0ϵ+nnz(A)(m+n)max{m,n}max{Ai2,Aj2}α(logR0ϵ)3)\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{R_{0}}{\epsilon}}+\sqrt{\text{nnz}(A)(m+n)}\frac{\sqrt{\max{\{m,n\}}}\max{\{\|A_{i\cdot}\|_{2},\|A_{\cdot j}\|_{2}}\}}{\alpha}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right)

Up to logarithmic factors, the RsEGM with uniformly sampled stochastic oracle is no worse than the deterministic method when A is dense according to Equation (23). And it improves the deterministic method when A is dense and max{Ai2,Aj2}A2\max{\{\|A_{i\cdot}\|_{2},\|A_{\cdot j}\|_{2}\}}\ll\|A\|_{2}.

Oracle II: Importance row-column sampling. In the second oracle, we use importance sampling to set the probability to select rows and columns:

Fξ(z)=(1riAiyi1cjAjxj),(ξ=(i,j))=ricj,ri=Ai22AF2,cj=Aj22AF2.F_{\xi}(z)=\left(\begin{array}[]{c}\frac{1}{r_{i}}A_{i\cdot}y_{i}\\ -\frac{1}{c_{j}}A_{\cdot j}x_{j}\end{array}\right),\;\mathbb{P}(\xi=(i,j))=r_{i}c_{j},\;r_{i}=\frac{\|A_{i\cdot}\|_{2}^{2}}{\|A\|_{F}^{2}},\;c_{j}=\frac{\|A_{\cdot j}\|_{2}^{2}}{\|A\|_{F}^{2}}\ .

Plugging in the choice of rir_{i} and cjc_{j} into Equation (24), the Lipschitz constant of FξF_{\xi} is upper bounded by

L2AF.L_{2}\leq\|A\|_{F}\ . (26)

Combining equation (4) with (26) and setting p=m+nnnz(A)p=\frac{m+n}{\text{nnz}(A)}, the total flop count of RsEGM with Oracle II to find an ϵ\epsilon-optimal solution is

𝒪~(nnz(A)logR0ϵ+nnz(A)(m+n)AFα(logR0ϵ)3).\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{R_{0}}{\epsilon}}+\sqrt{\text{nnz}(A)(m+n)}\frac{\|A\|_{F}}{\alpha}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right)\ .

It follows from (23) that if AA is dense, RsEGM with importance sampled stochastic oracle is no worse than the optimal deterministic method up to a logarithmic factor. Furthermore, it improves the deterministic method when the stable rank AF2A22nnz(A)m+n\frac{\|A\|_{F}^{2}}{\|A\|_{2}^{2}}\ll\frac{\text{nnz}(A)}{m+n}, which is the case for low-rank dense matrix AA.

In the above two oracles, the flop cost of computing the stochastic oracle is 𝒪(m+n)\mathcal{O}(m+n). This can be further improved by using coordinate based sampling, where the flop cost per iteration is 𝒪(1)\mathcal{O}(1) (See Appendix A.3 for details on efficient implementation of a more generic coordinate oracle).

Oracle III: Coordinate gradient estimator [11]. The third oracle only updates a primal coordinate and a dual coordinate, thus the cost per iteration can be as low as O(1)O(1). The idea follows from [11]. More formally, we set

Fξ(z)=(Aixjxyixpixjxejx,Aiyjyxjyqiyjyeiy)T,F_{\xi}(z)=\left(\frac{A_{i^{x}j^{x}}y_{i^{x}}}{p_{i^{x}j^{x}}}e_{j^{x}},-\frac{A_{i^{y}j^{y}}x_{j^{y}}}{q_{i^{y}j^{y}}}e_{i^{y}}\right)^{T}\ , (27)

where ix,iy{1,,n}i_{x},i_{y}\in\{1,...,n\}, jx,jy{1,,m}j_{x},j_{y}\in\{1,...,m\}, and

((ix,jx)=(i,j))=pij=Ai12iAi12|Aij|Ai1,((iy,jy)=(i,j))=qij=Aj12jAj12|Aij|Aj1\mathbb{P}((i_{x},j_{x})=(i,j))=p_{ij}=\frac{\|A_{i\cdot}\|_{1}^{2}}{\sum_{i}\|A_{i\cdot}\|_{1}^{2}}\frac{|A_{ij}|}{\|A_{i\cdot}\|_{1}},\;\mathbb{P}((i_{y},j_{y})=(i,j))=q_{ij}=\frac{\|A_{\cdot j}\|_{1}^{2}}{\sum_{j}\|A_{\cdot j}\|_{1}^{2}}\frac{|A_{ij}|}{\|A_{\cdot j}\|_{1}}

It is easy to check that Aixjxyixpixjxejx\frac{A_{i^{x}j^{x}}y_{i^{x}}}{p_{i^{x}j^{x}}}e_{j^{x}} and Aiyjyxjyqiyjyeiy-\frac{A_{i^{y}j^{y}}x_{j^{y}}}{q_{i^{y}j^{y}}}e_{i^{y}} are unbiased estimators for ATyA^{T}y and Ax-Ax respectively. Furthermore, one can show the Lipschitz constant of the Oracle III (in expectation) is (see [11] for more details):

L3=max{iAi12,jAj12}m+n|A|2,L_{3}=\max\left\{\sqrt{\sum_{i}\|A_{i\cdot}\|_{1}^{2}},\sqrt{\sum_{j}\|A_{\cdot j}\|_{1}^{2}}\right\}\leq\sqrt{m+n}\left\|\big{|}A\big{|}\right\|_{2}\ , (28)

where |A|\big{|}A\big{|} is the matrix with entry-wise absolute value of AA. Thus Oracle III satisfies Assumption 2 with Lipschitz constant L3L_{3}.

Combine Equation (4) together with (28) and set p=1nnz(A)p=\frac{1}{\text{nnz}(A)}. The total cost of RsEGM with Oracle III equals

𝒪~(nnz(A)logR0ϵ+nnz(A)max{iAi12,jAj12}α(logR0ϵ)3)\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{R_{0}}{\epsilon}}+\sqrt{\text{nnz}(A)}\frac{\max{\{\sqrt{\sum_{i}\|A_{i\cdot}\|_{1}^{2}},\sqrt{\sum_{j}\|A_{\cdot j}\|_{1}^{2}}}\}}{\alpha}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right)

For entry-wise non-negative dense matrix AA, RsEGM with Oracle III improve deterministic methods by a factor of max{m,n}\sqrt{\max{\{m,n\}}}. Unfortunately, there is no guaranteed improvement for a general matrix AA that involves negative value due to the definition of Lipschitz constant L3L_{3}.

Oracle IV: Coordinate gradient estimator with a new probability distribution. Inspired by Oracle III, we propose Oracle IV, where we utilize the same gradient estimation (27) but with different probability values pijp_{ij} and qijq_{ij}:

((ix,jx)=(i,j))=pij=Aij2AF2,((iy,jy)=(i,j))=qij=Aij2AF2.\mathbb{P}((i_{x},j_{x})=(i,j))=p_{ij}=\frac{A_{ij}^{2}}{\|A\|_{F}^{2}},\;\mathbb{P}((i_{y},j_{y})=(i,j))=q_{ij}=\frac{A_{ij}^{2}}{\|A\|_{F}^{2}}\ .

Similarly, we know Aixjxyixpixjxejx\frac{A_{i^{x}j^{x}}y_{i^{x}}}{p_{i^{x}j^{x}}}e_{j^{x}} and Aiyjyxjyqiyjyeiy-\frac{A_{i^{y}j^{y}}x_{j^{y}}}{q_{i^{y}j^{y}}}e_{i^{y}} are unbiased estimators for ATyA^{T}y and Ax-Ax respectively. Furthermore, the Lipschitz constant of Oracle IV can be computed by:

𝔼[Fξ(z)2]=ix,jxpixjx(Aixjxyixpixjx)2+iy,jyqiyjy(Aiyjyxjyqiyjy)2=ix,jx(Aixjx)2pixjx(yix)2+iy,jy(Aiyjy)2qiyjy(xjy)2=AF2z2.\displaystyle\begin{split}\mathbb{E}\left[\|F_{\xi}(z)\|^{2}\right]&\ =\sum_{i^{x},j^{x}}p_{i^{x}j^{x}}(\frac{A_{i^{x}j^{x}}y_{i^{x}}}{p_{i^{x}j^{x}}})^{2}+\sum_{i^{y},j^{y}}q_{i^{y}j^{y}}(-\frac{A_{i^{y}j^{y}}x_{j^{y}}}{q_{i^{y}j^{y}}})^{2}\\ &\ =\sum_{i^{x},j^{x}}\frac{(A_{i^{x}j^{x}})^{2}}{p_{i^{x}j^{x}}}(y_{i^{x}})^{2}+\sum_{i^{y},j^{y}}\frac{(A_{i^{y}j^{y}})^{2}}{q_{i^{y}j^{y}}}(x_{j^{y}})^{2}\\ &\ =\|A\|_{F}^{2}\|z\|^{2}\ .\end{split}

Thus, L4=AFL_{4}=\|A\|_{F}. Combine Equation (4) with L4=AFL_{4}=\|A\|_{F} and set p=1nnz(A)p=\frac{1}{\text{nnz}(A)}. The total flop cost of RsEGM with Oracle IV to find an ϵ\epsilon-solution becomes

𝒪~(nnz(A)logR0ϵ+nnz(A)AFα(logR0ϵ)3).\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{R_{0}}{\epsilon}}+\sqrt{\text{nnz}(A)}\frac{\|A\|_{F}}{\alpha}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{3}}\right)\ .

Note that when the matrix AA in the unconstrained bilinear problem is dense, RsEGM improves the total flop counts of deterministic primal-dual method (upto log terms) by at least a factor of max{m,n}\sqrt{\max{\{m,n\}}}. The improvement does NOT require extra assumptions on the spectral property of AA (such as Oracle II) nor non-negativity of entries of AA (such as Oracle III).

5.2 Linear programming

Consider the primal-dual formulation of standard form LP (2). For the ease of comparison, we assume nnz(A)m+n\text{nnz}(A)\geq m+n. Furthermore, we set

F(z)=(ATyAx),g1(x)=cTx+ι{x0},g2(y)=bTy.F(z)=\left(\begin{array}[]{c}A^{T}y\\ -Ax\end{array}\right),\;g_{1}(x)=c^{T}x+\iota_{\{x\geq 0\}},\;g_{2}(y)=b^{T}y. (29)

Recall that R0=dist(z0,0,𝒵)R_{0}=\text{dist}(z^{0,0},\mathcal{Z}^{*}). The total flop count of restarted primal-dual algorithms is (see [7] for details)

𝒪(nnz(A)A21/[H(1+z0,0+R0)]logR0ϵ)\mathcal{O}\left({\text{nnz}(A)\frac{\|A\|_{2}}{1/[H(1+\|z^{0,0}\|+R_{0})]}\log{\frac{R_{0}}{\epsilon}}}\right)

(i) Consider RsEGM with stochastic Oracle II (importance sampling)

Fξ(z)=(1riAiyi1cjAjxj),(ξ=(i,j))=ricj,ri=Ai22AF2,cj=Aj22AF2.F_{\xi}(z)=\left(\begin{array}[]{c}\frac{1}{r_{i}}A_{i\cdot}y_{i}\\ -\frac{1}{c_{j}}A_{\cdot j}x_{j}\end{array}\right),\;\mathbb{P}(\xi=(i,j))=r_{i}c_{j},\;r_{i}=\frac{\|A_{i\cdot}\|_{2}^{2}}{\|A\|_{F}^{2}},\;c_{j}=\frac{\|A_{\cdot j}\|_{2}^{2}}{\|A\|_{F}^{2}}\ .

Similar to the unconstrained bilinear case (Section 5.1), we can obtain the flop count

𝒪~(nnz(A)logR0ϵ+nnz(A)(m+n)AF1/[H(1+z0,0+R0)](logR0ϵ)5)\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{R_{0}}{\epsilon}}+\sqrt{\text{nnz}(A)(m+n)}\frac{\|A\|_{F}}{1/[H(1+\|z^{0,0}\|+R_{0})]}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{5}}\right)

(ii) Consider RsEGM with Oracle IV, i.e.,

Fξ(z)=(Aixjxyixpixjxejx,Aiyjyxjyqiyjyeiy)T,pij=Aij2AF2,qij=Aij2AF2.F_{\xi}(z)=\left(\frac{A_{i^{x}j^{x}}y_{i^{x}}}{p_{i^{x}j^{x}}}e_{j^{x}},-\frac{A_{i^{y}j^{y}}x_{j^{y}}}{q_{i^{y}j^{y}}}e_{i^{y}}\right)^{T},\;p_{ij}=\frac{A_{ij}^{2}}{\|A\|_{F}^{2}},\;q_{ij}=\frac{A_{ij}^{2}}{\|A\|_{F}^{2}}\ .

With a similar calculation (Section 5.1), we can obtain the flop count:

𝒪~(nnz(A)logR0ϵ+nnz(A)AF1/[H(1+z0,0+R0)](logR0ϵ)5).\widetilde{\mathcal{O}}\left({\text{nnz}(A)\log{\frac{R_{0}}{\epsilon}}+\sqrt{\text{nnz}(A)}\frac{\|A\|_{F}}{1/[H(1+\|z^{0,0}\|+R_{0})]}\left({\log{\frac{R_{0}}{\epsilon}}}\right)^{5}}\right)\ .

Note that AFnnz(A)A2\|A\|_{F}\leq\sqrt{\text{nnz}(A)}\|A\|_{2}, thus RsEGM always has a complexity no worse than the deterministic primal-dual methods (upto log terms). In the case when the constraint matrix AA is dense, the improvement of the total flop counts over its deterministic counterpart is by at least a factor of max{m,n}\sqrt{\max{\{m,n\}}}.

6 Numerical Experiment

In this section, we present numerical experiments on two classes of problems: matrix games and linear programming.

6.1 Matrix games

Problem instances. Matrix games can be formulated as a primal-dual optimization problem with the following form:

minxΔnmaxyΔmyTAx,\min_{x\in\Delta_{n}}\max_{y\in\Delta_{m}}y^{T}Ax\ ,

where Δn\Delta_{n} and Δm\Delta_{m} are nn-dimensional and mm-dimensional simplex respectively and matrix Am×nA\in\mathbb{R}^{m\times n}. In the experiment, we generate three matrix game instances with scale m=n=1000m=n=1000 using the instance-generation code from [3], which generated matrix game instances from classic literature [47, 46].

Progress metric. We use duality gap, i.e., maxx^Δn,y^Δmy^TAxyTAx^\max_{\hat{x}\in\Delta_{n},\hat{y}\in\Delta_{m}}\hat{y}^{T}Ax-y^{T}A\hat{x} as the performance measure for different algorithms. The duality gap can be easily evaluated by noticing it equals maxi(Ax)iminj(ATy)j\max_{i}(Ax)_{i}-\min_{j}(A^{T}y)_{j}.

Projection and stochastic oracle. We use the projection oracle developed in [15] for the projection step onto standard simplex. We utilize the importance row-column sampling (i.e., Oracle II in Section 5.1) to implement RsEGM.

Results. We compare RsEGM with several notable methods in literature for solving matrix games with simplex constraint, namely, stochastic extragradient method with variance reduction (sEGM) [3], variance-reduced method for matrix games (EG-Car+19) [10] and deterministic restarted extragradient method (REGM) [7]. Figure 1 presents the numerical performance of different methods on the three matrix game instances. The xx-axis is the number of iterations for a deterministic gradient step (a stochastic gradient step is adjusted proportionally), and the yy-axis is the duality gap. We can observe that the proposed RsEGM exhibits the fastest convergence to obtain a high-accuracy solution and clearly exhibits a linear convergence rate, which verifies our theoretical guarantees. Furthermore, we can observe that both sEGM and EG-Car+19 exhibit slow convergence compared to RsEGM. Moreover, although REGM also enjoys linear convergence theoretically, the linear rate is only shown up in instance nemirovski2 and has not yet been triggered in the other two instances within 10000 iterations.

Refer to caption Refer to caption Refer to caption
Figure 1: Convergence behaviour on matrix games with simplex constraint.

6.2 Linear programming

Problem instances. In the experiments, we utilize the instances from MIPLIB 2017 to compare the performance of different algorithms. We randomly select six instances from MIPLIB 2017 and look at its root-node LP relaxation. For these instances, we first convert the instances to the following form:

minxcTxs.t.AEx=bEAIxbIx0,\displaystyle\begin{split}&\min_{x}\;c^{T}x\\ &\ \mathrm{s.t.}\;A_{E}x=b_{E}\\ &\ \quad\;\;A_{I}x\leq b_{I}\\ &\ \quad\;\;x\geq 0\ ,\end{split} (30)

and consider its primal-dual formulation

minx0maxyI0cTxyTAx+bTy,\displaystyle\begin{split}\min_{x\geq 0}\max_{y_{I}\leq 0}\;c^{T}x-y^{T}Ax+b^{T}y\ ,\end{split} (31)

where AEmE×nA_{E}\in\mathbb{R}^{m_{E}\times n}, AImI×nA_{I}\in\mathbb{R}^{m_{I}\times n}, A=(AEAI)(mE+mI)×nA=\begin{pmatrix}A_{E}\\ A_{I}\end{pmatrix}\in\mathbb{R}^{(m_{E}+m_{I})\times n} and b=(bEbI)(mE+mI)b=\begin{pmatrix}b_{E}\\ b_{I}\end{pmatrix}\in\mathbb{R}^{(m_{E}+m_{I})}. We then compare the numerical performance of different primal-dual algorithms for solving (31). This is the problem format supported by the PDHG-based LP solver PDLP [5].

Progress metric. We use KKT residual of (31), i.e., a combination of primal infeasibility, dual infeasibility and primal-dual gap, to measure the performance of current iterates. More formally, the KKT residual of (30) is given by

KKT(x,y)=(AExbE[AIxbI]+[x]+[ATyc]+[yI]+[cTxbTy]+)2.\mathrm{KKT}(x,y)=\left\|\begin{pmatrix}A_{E}x-b_{E}\\ [A_{I}x-b_{I}]^{+}\\ [-x]^{+}\\ [A^{T}y-c]^{+}\\ [y_{I}]^{+}\\ [c^{T}x-b^{T}y]^{+}\end{pmatrix}\right\|_{2}\ .

Stochastic oracles. Similar to matrix games, we utilize the importance row-column sampling scheme to implement RsEGM, namely, stochastic Oracle II presented in Section 5.2.

Results. We compare RsEGM with two methods for solving linear programming: stochastic extragradient method with variance reduction without restart (sEGM) [3] and deterministic restarted extragradient method (REGM) [7]. Note that the variance-reduced method for matrix games (EG-Car+19) [10] compared in Section 6.1 is not considered for linear programming since it requires bounded domain, which is in general not satisfied by real-world LP instances, as those in the MIPLIB 2017 dataset.

Figure 2 presents the convergence behaviors of different methods on the six instances. The xx-axis is the number of iterations for a deterministic gradient step (a stochastic gradient step is adjusted proportionally), and the yy-axis is the KKT residual. Again, we can see that RsEGM exhibits the fastest convergence to achieve the desired accuracy 10510^{-5}, and sEGM has slow sublinear convergence. REGM has competitive performance with linear convergence, but RsEGM enables earlier triggers of linear convergence and a faster eventual rate in general.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 2: Convergence behaviour on LP instances from MIPLIB.

7 Conclusions

In this work, we introduce a stochastic algorithm for sharp primal-dual problems, such as linear programming and bilinear games, using variance reduction and restart. We show that our proposed stochastic methods enjoy a linear convergence rate, which improves the complexity of the existing algorithms in the literature. We also show that our proposed algorithm achieves the optimal convergence rate (upto a log term) in a wide class of stochastic algorithms.

Acknowledgement

The authors would like to thank Warren Schudy for proposing and discussing the efficient update of the coordinate stochastic oracle (Oracle IV) for LP.

References

  • [1] Ahmet Alacaoglu, Volkan Cevher, and Stephen J Wright, On the complexity of a practical primal-dual coordinate method, arXiv preprint arXiv:2201.07684 (2022).
  • [2] Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher, On the convergence of stochastic primal-dual hybrid gradient, arXiv preprint arXiv:1911.00799 (2019).
  • [3] Ahmet Alacaoglu and Yura Malitsky, Stochastic variance reduction for variational inequality methods, arXiv preprint arXiv:2102.08352 (2021).
  • [4] Randy I Anderson, Robert Fok, and John Scott, Hotel industry efficiency: An advanced linear programming examination, American Business Review 18 (2000), no. 1, 40.
  • [5] David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O’Donoghue, and Warren Schudy, Practical large-scale linear programming using primal-dual hybrid gradient, arXiv preprint arXiv:2106.04756 (2021).
  • [6] David Applegate, Mateo Díaz, Haihao Lu, and Miles Lubin, Infeasibility detection with primal-dual hybrid gradient for large-scale linear programming, arXiv preprint arXiv:2102.04592 (2021).
  • [7] David Applegate, Oliver Hinder, Haihao Lu, and Miles Lubin, Faster first-order primal-dual methods for linear programming using restarts and sharpness, arXiv preprint arXiv:2105.12715 (2021).
  • [8] Kinjal Basu, Amol Ghoting, Rahul Mazumder, and Yao Pan, ECLIPSE: An extreme-scale linear program solver for web-applications, Proceedings of the 37th International Conference on Machine Learning (Virtual) (Hal Daumé III and Aarti Singh, eds.), Proceedings of Machine Learning Research, vol. 119, PMLR, 13–18 Jul 2020, pp. 704–714.
  • [9] Edward H Bowman, Production scheduling by the transportation method of linear programming, Operations Research 4 (1956), no. 1, 100–103.
  • [10] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian, Variance reduction for matrix games, arXiv preprint arXiv:1907.02056 (2019).
  • [11]  , Coordinate methods for matrix games, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2020, pp. 283–293.
  • [12] Antonin Chambolle, Matthias J Ehrhardt, Peter Richtárik, and Carola-Bibiane Schonlieb, Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications, SIAM Journal on Optimization 28 (2018), no. 4, 2783–2808.
  • [13] Antonin Chambolle and Thomas Pock, A first-order primal-dual algorithm for convex problems with applications to imaging, Journal of mathematical imaging and vision 40 (2011), no. 1, 120–145.
  • [14] Abraham Charnes and William W Cooper, The stepping stone method of explaining linear programming calculations in transportation problems, Management science 1 (1954), no. 1, 49–69.
  • [15] Laurent Condat, Fast projection onto the simplex and the l 1 ball, Mathematical Programming 158 (2016), no. 1-2, 575–585.
  • [16] George Bernard Dantzig, Linear programming and extensions, vol. 48, Princeton university press, 1998.
  • [17] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng, Training GANs with optimism, International Conference on Learning Representations, 2018.
  • [18] Damek Davis, Dmitriy Drusvyatskiy, Kellie J MacPhee, and Courtney Paquette, Subgradient methods for sharp weakly convex functions, Journal of Optimization Theory and Applications 179 (2018), no. 3, 962–982.
  • [19] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in neural information processing systems, 2014, pp. 1646–1654.
  • [20] Jim Douglas and Henry H Rachford, On the numerical solution of heat conduction problems in two and three space variables, Transactions of the American mathematical Society 82 (1956), no. 2, 421–439.
  • [21] Jonathan Eckstein and Dimitri P Bertsekas, On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming 55 (1992), no. 1-3, 293–318.
  • [22] Jonathan Eckstein, Dimitri P Bertsekas, et al., An alternating direction method for linear programming, (1990).
  • [23] Olivier Fercoq, Quadratic error bound of the smoothed gap and the restarted averaged primal-dual hybrid gradient, (2021).
  • [24] Robert M Freund and Haihao Lu, New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure, Mathematical Programming 170 (2018), no. 2, 445–477.
  • [25] Jacek Gondzio, Interior point methods 25 years later, European Journal of Operational Research 218 (2012), no. 3, 587–601.
  • [26] Robert M Gower, Mark Schmidt, Francis Bach, and Peter Richtárik, Variance-reduced methods for machine learning, Proceedings of the IEEE 108 (2020), no. 11, 1968–1983.
  • [27] Fred Hanssmann and Sidney W Hess, A linear programming approach to production and employment scheduling, Management science (1960), no. 1, 46–51.
  • [28] Alan J Hoffman, On approximate solutions of systems of linear inequalities, Journal of Research of the National Bureau of Standards 49 (1952), 263–265.
  • [29] Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams, Variance reduced stochastic gradient descent with neighbors, arXiv preprint arXiv:1506.03662 (2015).
  • [30] Kevin Huang, Nuozhou Wang, and Shuzhong Zhang, An accelerated variance reduced extra-point approach to finite-sum vi and optimization, arXiv preprint arXiv:2211.03269 (2022).
  • [31] Rie Johnson and Tong Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in neural information processing systems 26 (2013), 315–323.
  • [32] Narendra Karmarkar, A new polynomial-time algorithm for linear programming, Proceedings of the sixteenth annual ACM symposium on Theory of computing, 1984, pp. 302–311.
  • [33] Galina M Korpelevich, The extragradient method for finding saddle points and other problems, Matecon 12 (1976), 747–756.
  • [34] Dmitry Kovalev, Samuel Horváth, and Peter Richtárik, Don’t jump through hoops and remove those loops: Svrg and katyusha are better without the outer loop, Algorithmic Learning Theory, PMLR, 2020, pp. 451–467.
  • [35] Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos, A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization, IEEE Transactions on Automatic Control 64 (2019), no. 10, 4050–4065.
  • [36] Adrian S Lewis and Jingwei Liang, Partial smoothness and constant rank, arXiv preprint arXiv:1807.03134 (2018).
  • [37] Jingwei Liang, Jalal Fadili, and Gabriel Peyré, Local linear convergence analysis of primal–dual splitting methods, Optimization 67 (2018), no. 6, 821–853.
  • [38] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui, A universal catalyst for first-order optimization, Advances in neural information processing systems, 2015, pp. 3384–3392.
  • [39] Tianyi Lin, Shiqian Ma, Yinyu Ye, and Shuzhong Zhang, An admm-based interior-point method for large-scale linear programming, Optimization Methods and Software 36 (2021), no. 2-3, 389–424.
  • [40] Qian Liu and Garrett Van Ryzin, On the choice-based linear programming model for network revenue management, Manufacturing & Service Operations Management 10 (2008), no. 2, 288–310.
  • [41] Haihao Lu, An O(sr){O}(s^{r})-resolution ODE framework for discrete-time optimization algorithms and applications to convex-concave saddle-point problems, arXiv preprint arXiv:2001.08826 (2020).
  • [42] Alan S Manne, Linear programming and sequential decisions, Management Science 6 (1960), no. 3, 259–267.
  • [43] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil, A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach, International Conference on Artificial Intelligence and Statistics, 2020.
  • [44] Tianlong Nan, Yuan Gao, and Christian Kroer, Extragradient svrg for variational inequalities: Error bounds and increasing iterate averaging, arXiv preprint arXiv:2306.01796 (2023).
  • [45] Arkadi Nemirovski, Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems, SIAM Journal on Optimization 15 (2004), no. 1, 229–251.
  • [46]  , Mini-course on convex programming algorithms, Lecture notes (2013).
  • [47] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on optimization 19 (2009), no. 4, 1574–1609.
  • [48] Yurii Nesterov, Gradient methods for minimizing composite functions, Mathematical Programming 140 (2013), no. 1, 125–161.
  • [49] Brendan O’Donoghue and Emmanuel Candes, Adaptive restart for accelerated gradient schemes, Foundations of computational mathematics 15 (2015), no. 3, 715–732.
  • [50] Balamurugan Palaniappan and Francis Bach, Stochastic variance reduction methods for saddle-point problems, Advances in Neural Information Processing Systems, 2016, pp. 1416–1424.
  • [51] Sebastian Pokutta, Restarting algorithms: Sometimes there is free lunch, International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, Springer, 2020, pp. 22–38.
  • [52] Boris Polyak, Sharp minima, Proceedings of the IIASA Workshop on Generalized Lagrangians and Their Applications, Laxenburg, Austria. Institute of Control Sciences Lecture Notes, Moscow, 1979.
  • [53] R. Tyrrell Rockafellar, Monotone operators and the proximal point algorithm, SIAM Journal on Control and Optimization 14 (1976), no. 5, 877–898.
  • [54] Vincent Roulet and Alexandre d’Aspremont, Sharpness, restart, and acceleration, SIAM Journal on Optimization 30 (2020), no. 1, 262–289.
  • [55] Mark Schmidt, Nicolas Le Roux, and Francis Bach, Minimizing finite sums with the stochastic average gradient, Mathematical Programming 162 (2017), no. 1-2, 83–112.
  • [56] Chaobing Song, Cheuk Yin Lin, Stephen Wright, and Jelena Diakonikolas, Coordinate linear variance reduction for generalized linear programming, Advances in Neural Information Processing Systems 35 (2022), 22049–22063.
  • [57] Junqi Tang, Mohammad Golbabaee, Francis Bach, et al., Rest-katyusha: exploiting the solution’s structure via scheduled restart schemes, Advances in Neural Information Processing Systems, 2018, pp. 429–440.
  • [58] Paul Tseng, On linear convergence of iterative methods for the variational inequality problem, Journal of Computational and Applied Mathematics 60 (1995), no. 1-2, 237–252.
  • [59] Tianbao Yang and Qihang Lin, RSG: Beating subgradient method without smoothness and strong convexity, The Journal of Machine Learning Research 19 (2018), no. 1, 236–268.
  • [60] Renbo Zhao, Optimal stochastic algorithms for convex-concave saddle-point problems, arXiv preprint arXiv:1903.01687 (2020).

Appendix A Appendix

A.1 Sharpness and metric sub-regularity

In this subsection we establish a connection between sharpness based on the normalized duality gap [7] and metric sub-regularity [35]. The result is summarized in the following proposition.

Proposition 1.

Consider primal-dual problem minx𝒳maxy𝒴(x,y)\min_{x\in\mathcal{X}}\max_{y\in\mathcal{Y}}\mathcal{L}(x,y). If for any z𝒰(z)z\in\mathcal{U}(z^{*}) the primal-dual problem is sharp with constant α\alpha, i.e. αdist(z,𝒵)ρr(z)\alpha\text{dist}(z,\mathcal{Z}^{*})\leq\rho_{r}(z) for all r>0r>0, then it satisfies metric sub-regularity at zz^{*} for 0, i.e. αdist(z,𝒵)dist(0,(z))=inf(u,v)(z)(u,v),z𝒰(z)\alpha\text{dist}(z,\mathcal{Z}^{*})\leq\text{dist}(0,\partial\mathcal{L}(z))=\inf_{(u,-v)\in\partial\mathcal{L}(z)}\|(u,-v)\|,\;\forall z\in\mathcal{U}(z^{*}).

Proof. For any (u,v)(z)(u,-v)\in\partial\mathcal{L}(z)

αdist(z,𝒵)ρr(z)=1rmaxz^Wr(z)(x,y^)(x^,y)1rmaxz^Br(z)(x,y^)(x^,y)=1rmaxz^Br(z)(x,y^)(x,y)+(x,y)(x^,y)1rmaxz^Br(z)vT(y^y)+(u)T(x^x)=1rmaxz^Br(z)(u,v)T(z^z)=1rr(u,v)=(u,v),\displaystyle\begin{split}\alpha\text{dist}(z,\mathcal{Z}^{*})&\leq\rho_{r}(z)=\frac{1}{r}\max_{\hat{z}\in W_{r}(z)}{\mathcal{L}(x,\hat{y})-\mathcal{L}(\hat{x},y)}\\ &\leq\frac{1}{r}\max_{\hat{z}\in B_{r}(z)}{\mathcal{L}(x,\hat{y})-\mathcal{L}(\hat{x},y)}\\ &=\frac{1}{r}\max_{\hat{z}\in B_{r}(z)}{\mathcal{L}(x,\hat{y})-\mathcal{L}(x,y)+\mathcal{L}(x,y)-\mathcal{L}(\hat{x},y)}\\ &\leq\frac{1}{r}\max_{\hat{z}\in B_{r}(z)}{v^{T}(\hat{y}-y)+(-u)^{T}(\hat{x}-x)}\\ &=\frac{1}{r}\max_{\hat{z}\in B_{r}(z)}{-(u,-v)^{T}(\hat{z}-z)}\\ &=\frac{1}{r}r\|(u,-v)\|=\|(u,-v)\|\ ,\end{split}

where the first inequality utilizes the definition of sharpness, the third one is due to the convexity-concavity of (x,y)\mathcal{L}(x,y).

Take infimum over (z)\partial\mathcal{L}(z) we have

αdist(z,𝒵)inf(u,v)(z)(u,v)=dist(0,(z))\alpha\text{dist}(z,\mathcal{Z}^{*})\leq\inf_{(u,-v)\in\partial\mathcal{L}(z)}\|(u,-v)\|=\text{dist}(0,\partial\mathcal{L}(z))

A.2 LP does not satisfy metric sub-regularity globally

In this subsection, we present a counter-example to show that the Lagrangian’s generalized gradient of LP does not satisfy metric sub-regularity globally.

Consider an LP

minx2𝟏Tx,s.t.Axb,x0\min_{x\in\mathbb{R}^{2}}{\mathbf{1}^{T}x},\quad\text{s.t.}\;Ax\leq b,x\geq 0

where A=(0 1)1×2A=(0\;1)\in\mathbb{R}^{1\times 2} and b=1b=1. The optimal solution is unique x=(0,0)x^{*}=(0,0). The primal dual form is

(x,y)=𝟏Tx+yT(Axb)+ι{x0}ι{y0},\mathcal{L}(x,y)=\mathbf{1}^{T}x+y^{T}(Ax-b)+\iota_{\{x\geq 0\}}-\iota_{\{y\geq 0\}}\ ,

and the optimal primal-dual solution z=(𝟎2,0)z^{*}=(\mathbf{0}_{2},0). Notice that

L(z)={(p,q)2+1|p𝟏+ATy+ι{x0},qAx+b+ι{y0}}\partial L(z)=\bigg{\{}(p,q)\in\mathbb{R}^{2+1}\bigg{|}p\in\mathbf{1}+A^{T}y+\partial\iota_{\{x\geq 0\}},q\in-Ax+b+\partial\iota_{\{y\geq 0\}}\bigg{\}}

thus

dist(0,(z))2=inf(p,q)L(z)p2+q21+(1+y)2+(1x2)2\text{dist}(0,\partial\mathcal{L}(z))^{2}=\inf_{(p,q)\in\partial L(z)}{\|p\|^{2}+q^{2}}\leq 1+(1+y)^{2}+(1-x_{2})^{2}

If the Lagrangian’s generalized gradient of LP satisfies global metric sub-regularity, i.e., there exists a constant η\eta such that ηdist(z,𝒵)dist(0,(z)),z𝒵\eta\text{dist}(z,\mathcal{Z}^{*})\leq\text{dist}(0,\partial\mathcal{L}(z)),\;\forall z\in\mathcal{Z}, then

x12+x22+y21η2(1+(1+y)2+(1x2)2),(x1,x2,y)3+.x_{1}^{2}+x_{2}^{2}+y^{2}\leq\frac{1}{\eta^{2}}(1+(1+y)^{2}+(1-x_{2})^{2}),\;\forall(x_{1},x_{2},y)\in\mathbb{R}^{3}_{+}\ .

However, this is impossible by fixing x2,yx_{2},y and letting x1x_{1}\rightarrow\infty. This leads to a contradiction.

Therefore LP does not satisfy the global metric sub-regularity. Although we use Euclidean norm for illustration, due to the equivalence of norms on n\mathbb{R}^{n}, LP does not satisfy the global metric sub-regularity under any norm. Notice that metric subregularity is a weaker condition than sharpness (see Appendix A.1). A direct consequence is that LP is not globally sharp.

A.3 Efficient update for coordinate sampling

In Section 5.1, we propose two coordinate gradient estimators, Oracle III and Oracle IV. Here, we present an efficient computation approach that 𝒪(nnz(A))\mathcal{O}(\text{nnz}(A)) flops for a snapshot step (i.e., the steps when updating the snapshot ww), and 𝒪(1)\mathcal{O}(1) flops for a non-snapshot step (i.e., the steps between two snapshot steps). With such efficient update rules, we obtain the total cost computation for the Oracle IV rows in Table 1 and Table 2.

Consider a generic coordinate stochastic oracle

Fξ(z)=(cxex,dyey),F_{\xi}(z)=(c_{x}e_{x},d_{y}e_{y})\ ,

where ex,eye_{x},e_{y} are standard unit vectors in m\mathbb{R}^{m} and n\mathbb{R}^{n}, respectively, and cx,dyc_{x},d_{y} are two arbitrary scalars. Oracle III and IV are special cases for this generic coordinate stochastic oracle.

Let wkjw^{k^{j}} be the jj-th snapshot step, i.e., the jj-th updates of ww. Consider a non-snapshot step kk with kjk<kj+11k^{j}\leq k<k^{j+1}-1, i.e., the steps between two consecutive snapshot steps, then we know wkwkjw^{k}\equiv w^{k^{j}} and u=p(wkj)τF(wkj)u=p(w^{k^{j}})-\tau F(w^{k^{j}}) do not change for any kj<k<kj+11k^{j}<k<k^{j+1}-1. Furthermore, it is easy to check that Fξk(zk+1/2)i=Fξk(wk)iF_{\xi_{k}}(z^{k+1/2})_{i}=F_{\xi_{k}}(w^{k})_{i} has update

zk+1i={max{0,ui/p+(1p)(zkiui/p)},ifcoordinateiisconstrainedwithzi0ui/p+(1p)(zkiui/p),ifcoordinateiisunconstrained.z^{k+1}_{i}=\begin{cases}\max\{0,u_{i}/p+(1-p)(z^{k}_{i}-u_{i}/p)\},&\mathrm{if\ coordinate}\ i\mathrm{\ is\ constrained\ with\ }z_{i}\geq 0\\ u_{i}/p+(1-p)(z^{k}_{i}-u_{i}/p),&\mathrm{if\ coordinate}\ i\mathrm{\ is\ unconstrained}\end{cases}\ .

Generalizing this simple observation, a closed-form update rule holds for iterates zkz^{k^{\prime}} with kjk<kkj+1k^{j}\leq k<k^{\prime}\leq k^{j+1} that

zki={max{0,ui/p+(1p)kk(zkiui/p)},ifcoordinateiisconstrainedwithzi0ui/p+(1p)kk(zkiui/p),ifcoordinateiisunconstrained.z^{k^{\prime}}_{i}=\begin{cases}\max\{0,u_{i}/p+(1-p)^{k^{\prime}-k}(z^{k}_{i}-u_{i}/p)\},&\mathrm{if\ coordinate}\ i\mathrm{\ is\ constrained\ with\ }z_{i}\geq 0\\ u_{i}/p+(1-p)^{k^{\prime}-k}(z^{k}_{i}-u_{i}/p),&\mathrm{if\ coordinate}\ i\mathrm{\ is\ unconstrained}\end{cases}\ . (32)

Thus motivated, we can run Algorithm 1 with an efficient update rule using the following procedure:

Snapshot steps. In a snapshot step (i.e., k=kjk=k^{j}), we compute zkjz^{k^{j}} using (32), set the iteration of last update of coordinate ii (denoted as kik_{i}) to kjk^{j}, namely, ki=kjk_{i}=k^{j} for 1im+n1\leq i\leq m+n, record zkjz^{k^{j}} in memory, and set wkj=zkjw^{k^{j}}=z^{k^{j}}. The computational cost is 𝒪(m+n)\mathcal{O}(m+n). Then we compute u=pwkjτF(wkj)u=pw^{k^{j}}-\tau F(w^{k^{j}}) and store uu in memory. The computational cost is 𝒪(nnz(A))\mathcal{O}(\text{nnz}(A)). Finally, we compute the running average of iterates z~kj\widetilde{z}^{k^{j}}, and the cost is 𝒪(m+n)\mathcal{O}(m+n) by taking advantage of the sum over geometric series.

Non-snapshot steps. In a non-snapshot step (i.e, kj<k<kj+11k^{j}<k<k^{j+1}-1), if Fξk(zk+1/2)F_{\xi_{k}}(z_{k+1/2}) updates coordinate ii and Fξk(zk+1/2)iFξk(wk)iF_{\xi_{k}}(z^{k+1/2})_{i}\neq F_{\xi_{k}}(w^{k})_{i}, we compute zkiz^{k}_{i} using (32) and update zk+1/2iz^{k+1/2}_{i} and zk+1iz^{k+1}_{i}. Then set ki=k+1k_{i}=k+1. We store zkiiz^{k_{i}}_{i} in memory. The cost is 𝒪(1)\mathcal{O}(1).

As such, the cost when updating the snapshot ww is 𝒪(nnz(A))\mathcal{O}(\text{nnz}(A)) and the cost for a non-snapshot step is 𝒪(1)\mathcal{O}(1) for Oracle III and Oracle IV discussed in Section 5.

A.4 Comparison with SPDHG

[2, Theorem 4.6] shows the linear convergence of SPDHG for problems with global metric sub-regularity, such as the unconstrained bilinear problem. However, the obtained linear convergence rate may involve parameters that are not easy to interpret. We here apply their results to unconstrained bilinear problems with more transparent parameters so that we make a comparison with our Theorem 2.

Using the notation in [2], they obtained

𝔼[C12xkxkτ12+12yk+1yk+1D(σ)1P12](1ρ)k2Φ0,\mathbb{E}\left[\frac{C_{1}}{2}\|x^{k}-x_{*}^{k}\|_{\tau^{-1}}^{2}+\frac{1}{2}\|y^{k+1}-y_{*}^{k+1}\|_{D(\sigma)^{-1}P^{-1}}^{2}\right]\leq(1-\rho)^{k}2\Phi^{0}\ ,

where ρ=C1p¯2ζ\rho=\frac{C_{1}\underline{p}}{2\zeta}, p¯=minipi\underline{p}=\min_{i}p_{i}, C1=1γC_{1}=1-\gamma, ζ=2+2α2HN2\zeta=2+\frac{2}{\alpha^{2}}\|H-N\|^{2}. For simplicity, we consider the uniform sampling scheme with pi=1np_{i}=\frac{1}{n}, and thus p¯=minipi=1n\underline{p}=\min_{i}p_{i}=\frac{1}{n} (the other schemes follow with a similar arguments). The total number of iterations for SPDHG to achieve ϵ\epsilon-accuracy is of order

𝒪(1ρlog1ϵ)=𝒪(ζp¯log1ϵ)=𝒪(nζlog1ϵ)=𝒪(nHN2α2log1ϵ).\displaystyle\begin{split}&\ \mathcal{O}\left({\frac{1}{\rho}\log{\frac{1}{\epsilon}}}\right)=\ \mathcal{O}\left({\frac{\zeta}{\underline{p}}\log\frac{1}{\epsilon}}\right)=\ \mathcal{O}(n\zeta\log\frac{1}{\epsilon})=\ \mathcal{O}(n\frac{\|H-N\|^{2}}{\alpha^{2}}\log\frac{1}{\epsilon})\ .\end{split}

where the definitions of HH and NN are defined in [2, Lemma 4.4]. The operators HH and NN are indeed a bit complicated so that the upper bound is hard to derive. Now we lower bound the term HN2\|H-N\|^{2}.

HN2(HN)q22q22τ1x(1)22+Ax(1)22x(1)22Ax(1)22x(1)22=A2,\displaystyle\begin{split}\|H-N\|^{2}&\ \geq\frac{\|(H-N)q\|_{2}^{2}}{\|q\|_{2}^{2}}\ \geq\frac{\|\tau^{-1}x(1)\|_{2}^{2}+\|Ax(1)\|_{2}^{2}}{\|x(1)\|_{2}^{2}}\ \geq\frac{\|Ax(1)\|_{2}^{2}}{\|x(1)\|_{2}^{2}}=\|A\|^{2}\ ,\end{split}

where the first inequality is due to the definition of operator norm, the second inequality follows from the choice of q=(x(1),0,,0)q=(x(1),0,...,0) and the last equality is from the choice of x(1)x(1) such that x(1)=argmaxuAu22u22x(1)=\text{argmax}_{u}\frac{\|Au\|_{2}^{2}}{\|u\|_{2}^{2}}.

Thus we have the complexity of SPDHG is at least

𝒪(1ρlog1ϵ)=𝒪(nHN2α2log1ϵ)𝒪(nA2α2log1ϵ)𝒪(AF2α2log1ϵ)=𝒪(κF2log1ϵ).\mathcal{O}\left({\frac{1}{\rho}\log{\frac{1}{\epsilon}}}\right)=\mathcal{O}(n\frac{\|H-N\|^{2}}{\alpha^{2}}\log\frac{1}{\epsilon})\geq\mathcal{O}(n\frac{\|A\|^{2}}{\alpha^{2}}\log\frac{1}{\epsilon})\geq\mathcal{O}(\frac{\|A\|_{F}^{2}}{\alpha^{2}}\log\frac{1}{\epsilon})=\mathcal{O}(\kappa_{F}^{2}\log\frac{1}{\epsilon})\ .

In contrast, the complexity of RsEGM is 𝒪~(κF(log1ϵ)3)\widetilde{\mathcal{O}}(\kappa_{F}\left({\log\frac{1}{\epsilon}}\right)^{3}). This showcase RsEGM has an improved linear convergence rate for solving unconstrained bilinear problems compared to SPDHG in terms of the condition number (after ignoring the log\log terms). Such an improvement comes from the restarted scheme, similar to the deterministic case [7]. Lastly, we would like to mention that SPDHG is a semi-stochastic algorithm, where the primal is a full gradient update and the dual is a stochastic gradient update, while RsEGM takes stochastic gradient steps in both primal and dual space (except the snapshot steps).