This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dissipative Gradient Descent Ascent Method: A Control Theory Inspired Algorithm for Min-max Optimization

Tianqi Zheng1, Nicolas Loizou1, Pengcheng You2 and Enrique Mallada1 1T. Zheng, and E. Mallada are with the Department of Electrical and Computer Engineering, N. Loizou is with the Department of Applied Mathematics and Statistics at Johns Hopkins University, Baltimore, MD 21218, USA {tzheng8,nloizou,mallada}@jhu.edu2P. You is with the Department of Industrial Engineering and Management at Peking University, Beijing, China [email protected]
Abstract

Gradient Descent Ascent (GDA) methods for min-max optimization problems typically produce oscillatory behavior that can lead to instability, e.g., in bilinear settings. To address this problem, we introduce a dissipation term into the GDA updates to dampen these oscillations. The proposed Dissipative GDA (DGDA) method can be seen as performing standard GDA on a state-augmented and regularized saddle function that does not strictly introduce additional convexity/concavity. We theoretically show the linear convergence of DGDA in the bilinear and strongly convex-strongly concave settings and assess its performance by comparing DGDA with other methods such as GDA, Extra-Gradient (EG), and Optimistic GDA. Our findings demonstrate that DGDA surpasses these methods, achieving superior convergence rates. We support our claims with two numerical examples that showcase DGDA’s effectiveness in solving saddle point problems.

I Introduction

In recent years, there has been a significant focus on solving saddle point problems, namely min-max optimization problems [1, 2, 3, 4, 5, 6]. These problems have garnered considerable attention, particularly in fields such as Generative Adversarial Networks (GANs) [7, 6, 8], Reinforcement Learning (RL) [9], and Constrained RL (C-RL) [10, 11]. However, a major challenge that persists in these approaches is the instability of the training process. That is, solving the min-max optimization problem via running the standard Gradient Descent Ascent (GDA) algorithm often leads to unstable oscillatory behavior rather than convergence to the optimal solution. This is particularly illustrated in bilinear min-max problems, such as the training of Wasserstein GANs [12] or solving C-RL problems in the occupancy measure space [13], for which the standard GDA fails to converge [1, 2].

In order to understand the instability of the GDA method and further tackle its limitation, we draw inspiration from the control-theoretic notions of dissipativity [14], which enables the design of stabilizing controllers using dynamic (state-augmented) components that seek to dissipate the energy generated by the unstable process. This aligns with recent work that leverages control theory tools in the analysis and design of optimization algorithms [15, 16, 17, 18, 19]. From a dynamical system point of view, dissipativity theory characterizes the manner in which energy dissipates within the system and drives it towards equilibrium. It provides a direct way to construct a Lyapunov function, which further relates the rate of decrease of this internal energy to the rate of convergence of the algorithm.

We motivate our developments by looking first at a simple scalar bilinear problem wherein, the energy of the system, expressed as the square 2-norm distance to the saddle, strictly increases on every iteration, leading to oscillations of increasing amplitude. To tackle this unstable oscillating behavior, we propose the Dissipative GDA method, which, as the name suggests, incorporates a simple friction term to GDA updates to dissipate the internal energy and stabilize the system. Our algorithm can be seen as a discrete-time version of [20], which has been applied to solve the C-RL problems [10]. In this work, we build on this literature, making the following contributions:

1. Novel control theory inspired algorithm: We illustrate how to use control theoretic concepts of dissipativity theory to design an algorithm that can stabilize the unstable behavior of GDA. Particularly, we show that by introducing a friction term, the proposed DGDA algorithm dissipates the stored internal energy and converges toward equilibrium.

2. Theoretical analysis with better rates: We establish the linear convergence of the DGDA method for bilinear and strongly convex-strongly concave saddle point problems. In both settings, we show that the DGDA method outperforms other state-of-the-art first-order explicit methods, surpassing the standard known linear convergence rate (see Table I and II).

3. Numerical Validation: We corroborate our theoretical results with numerical experiments by evaluating the performance of the DGDA method with GDA, EG, and OGDA methods. When applied to solve bilinear and strongly convex-strongly concave saddle point problems, the DGDA method systematically outperforms other methods in terms of convergence rate.

Outline: The rest of the paper is organized as follows. In Section II, we provide some preliminary definitions and background. In Section III, we leverage tools from dissipativity theory and propose the Dissipative GDA (DGDA) algorithm to tackle the unstable oscillatory behavior of GDA methods. In Section IV, we establish its linear convergence rate for bilinear and strongly convex-strongly concave problems, which outperforms state-of-the-art first-order explicit algorithms, including GDA, EG, and OGDA methods. In Section V, we support our claims with two numerical examples. We close the paper with concluding remarks and future research directions in Section VI.

Bil. Mokhtari, 2020 Azizian, 2020 This Work
EG κ120\frac{\kappa^{-1}}{20} κ164\frac{\kappa^{-1}}{64} -
OG κ1800\frac{\kappa^{-1}}{800} κ1128\frac{\kappa^{-1}}{128} -
DG - - κ14\frac{\kappa^{-1}}{4}
TABLE I: Summary of the global convergence results for EG, OGDA, and DGDA methods with bilinear objective functions. If a result shows that the iterates converge as 𝒪((1r)t)\mathcal{O}((1-r)^{t}), the quantity r is reported (the larger the better). κ\kappa represents the condition number.

II Problem Formulation

In this paper, we study the problem of finding saddle points in the min-max optimization problem:

minxnmaxymf(x,y),\displaystyle\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x,y), (1)

where the function f:n×mf:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is a convex-concave function. Precisely, f(,y)f(\cdot,y) is convex for all ymy\in\mathbb{R}^{m} and f(x,)f(x,\cdot) is concave for all xnx\in\mathbb{R}^{n}. We seek to develop a novel optimization algorithm that converges to some saddle point (x,y)(x^{*},y^{*}) of Problem 1.

Definition 1 (Saddle Point)

A point (x,y)n×m(x^{*},y^{*})\in\mathbb{R}^{n}\times\mathbb{R}^{m} is a saddle point of convex-concave function (1) if and only if it satisfies f(x,y)f(x,y)f(x,y)f(x^{*},y)\leq f(x^{*},y^{*})\leq f(x,y^{*}) for all xn,ymx\in\mathbb{R}^{n},y\in\mathbb{R}^{m}.

Throughout this paper, we consider two specific instances of Problem 1 commonly studied in related literature: strongly convex-strongly concave and bilinear functions. Herein, we briefly present some definitions and properties.

Definition 2 (Strongly Convex)

A differentiable function f:nf:\mathbb{R}^{n}\to\mathbb{R} is said to be μ\mu-strongly convex if f(w)f(w)+f(w)T(ww)+μ2ww2f(w)\geq f(w^{\prime})+\nabla f(w)^{T}(w-w^{\prime})+\frac{\mu}{2}\|w-w^{\prime}\|^{2}.

Notice that if μ=0\mu=0, then we recover the definition of convexity for a continuously differentiable function and f(w)f(w) is μ\mu-strongly concave if f(w)-f(w) is μ\mu-strongly convex. Another important property commonly used in the convergence analysis of optimization algorithms is the Lipschitz-ness of the gradient f(w)\nabla f(w).

Definition 3 (LL-Lipschitz)

A function F:nmF:\mathbb{R}^{n}\to\mathbb{R}^{m} is L-Lipschitz if w,wn\forall w,w^{\prime}\in\mathbb{R}^{n}, we have F(w)F(w)Lww\|F(w)-F(w^{\prime})\|\leq L\|w-w^{\prime}\|.

Combining the above two properties leads to the first important class of problem that has been extensively studied [21, 22, 2, 1].

Assumption 1

(Strongly strongly convex-concave functions with L-Lipschitz Gradient) The function f:n×mf:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is continuously differentiable, μ\mu strongly convex in xx, and μ\mu strongly concave in yy. Further, the gradient vector (xf(x,y);yf(x,y))(\nabla_{x}f(x,y);-\nabla_{y}f(x,y)) is LL-Lipschitz.

It is also crucial to consider situations where the objective function is bilinear. Such bilinear min-max problems often appear when solving constrained reinforcement learning problems [10, 23], and training of WGANs [12].

Assumption 2 (Bilinear function)

The function f:n×mf:\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is a bilinear function if it can be written in the form f(x,y)=xTAyf(x,y)=x^{T}Ay. For simplicity, we further assume that the matrix Am×nA\in\mathbb{R}^{m\times n} is full rank, with mnm\leq n.

As seen in Table I and II as well as in Section IV, the linear convergence rates of existing algorithms are frequently characterized by the condition number κ\kappa. Specifically, when the objective function is bilinear, the condition number is defined as κ:=σmax2(A)/σmin2(A)\kappa:=\sigma_{\max}^{2}(A)/\sigma_{\min}^{2}(A), where σmax(M)\sigma_{\max}(M) and σmin(M)\sigma_{\min}(M) denote the largest singular value and smallest singular of a matrix MM respectively. When the objective function is strongly convex-strongly concave with the LL-Lipschitz gradient, the condition number of the problem is defined as κ:=L/μ\kappa:=L/\mu.

III Dissipative Gradient Descent Ascent Algorithm

S.C Zhang, 2021 Mokhtari, 2020 Azizian, 2020 This Work
GD κ2\kappa^{-2} - - -
EG - κ14\frac{\kappa^{-1}}{4} κ14+ϵ\frac{\kappa^{-1}}{4}+\epsilon -
OG - κ14\frac{\kappa^{-1}}{4} κ14+ϵ\frac{\kappa^{-1}}{4}+\epsilon -
DG - - - ​​​​ κ1𝒪(κ2)\kappa^{-1}\!\!\!~{}-~{}\mathcal{O}(\kappa^{-2})
TABLE II: Summary of the global convergence results for GDA, EG, OGDA, and DGDA methods with strongly convex-strongly concave and LL-Lipschitz objective functions. The table reports the term rr of a (1r)(1-r) linear rate. The constant ϵ>0\epsilon>0 depends on the problem.

This section introduces the proposed first-order method for solving the min-max optimization problem 1. The algorithm can be seen as a discretization of the algorithm proposed by [20], where the authors introduce a regularization framework for continuous saddle flow dynamics that guarantees asymptotic convergence to a saddle point under mild assumptions. However, the continuous-time analysis presented in [20] does generally extend to discrete time. In this work, we will show the linear convergence of the discrete-time version of this algorithm. Moreover, we illustrate a novel control-theoretic design methodology for optimization algorithms that can broadly use dissipativity tools to study the convergence of the resulting algorithm.

Our results build on gaining an intuitive understanding of the problems that one encounters when applying the vanilla Gradient Descent Ascent method to solve saddle point problems (1):
Gradient Descent Ascent (GDA)

xk+1=xkηxf(xk,yk),\displaystyle x_{k+1}=x_{k}-\eta\nabla_{x}f(x_{k},y_{k}), (2)
yk+1=yk+ηyf(xk,yk).\displaystyle y_{k+1}=y_{k}+\eta\nabla_{y}f(x_{k},y_{k}).

When (1) is strongly convex-strongly concave, and has L-Lipschitz gradients, the GDA method provides linear convergence, with step size η=μ/L2\eta=\mu/L^{2} and a know rate estimate of 11/κ21-1/\kappa^{2} [24]. However, when the problem is bilinear, the standard GDA method fails to converge.

III-A Control Theory-Based Motivation

Before we dive into our algorithm, we would like to illustrate our key insight from control theory about how to characterize the energy stored in the system and how to introduce friction to dissipate energy and drive the system toward equilibrium. To get started, consider the following motivating example with a simple bilinear objective function:

minxmaxyf(x,y):=xy\displaystyle\min_{x}\max_{y}f(x,y):=xy (3)

We seek to analyze the behavior of (a possibly controlled) vanilla GDA algorithm applied to the above simple bilinear objective function, that is:

zk+1=(IηM)zk+ukz_{k+1}=(I-\eta M)z_{k}+u_{k} (4)

where zk=(xk,yk)2z_{k}=(x_{k},y_{k})\in\mathbb{R}^{2} represent the system state, uk2u_{k}\in\mathbb{R}^{2} a controlled input, and M:=[ 0  1;1  0]M:=[\,0\,\,1\,;-1\,\,0\,]. In the absence of control (uk=0u_{k}=0, k\forall k), the algorithm diverges as illustrated by the upper subplot of Figure 1. To understand this unstable process, we keep track of the system’s energy, expressed as the square 2-norm distance to the unique saddle point (0,0)(0,0), V(zk)=zk2=xk2+yk2V(z_{k})=\|z_{k}\|^{2}=x_{k}^{2}+y_{k}^{2}. A simple algebraic manipulation shows that

V(zk+1)\displaystyle V(z_{k+1}) =(1+η2)V(zk)+2ukTzk+uk2.\displaystyle=(1+\eta^{2})V(z_{k})+2u_{k}^{T}z_{k}+\|u_{k}\|^{2}. (5)

Notably, (5) shows how, without control, the energy grows exponentially with each iteration.

Refer to caption
Figure 1: Trajectories of states for GDA and DGDA for the simple bilinear objective function f(x,y):=xyf(x,y):=xy.

We to choose uku_{k} so that the right-hand side of (5) is upper-bounded by α2V(zk)\alpha^{2}V(z_{k}), with |α|<1|\alpha|<1. However, it is usually difficult to do without knowledge of the equilibrium.

The proposed approach uses state augmentation to incorporate a dissipation component that receives some external input u^k\hat{u}_{k} and filters its fluctuations, i.e.,

z^k+1=(1ρ)z^k+ρu^k.\hat{z}_{k+1}=(1-\rho)\hat{z}_{k}+\rho\hat{u}_{k}. (6)

The dissipation interpretation of (6) can be readily seen, again, by looking at the evolution of its energy

V(z^k+1)=(1ρ)2V(z^k)+2ρ(1ρ)z^kTu^k+ρ2u^k2\displaystyle V(\hat{z}_{k+1})\!=\!(1\!-\!\rho)^{2}V(\hat{z}_{k})\!+\!2\rho(1\!-\!\rho)\hat{z}_{k}^{T}\hat{u}_{k}\!+\!\rho^{2}\|\hat{u}_{k}\|^{2} (7)

which has the natural tendency to dissipate energy when |1ρ|<1|1-\rho|<1 and u^k=0\hat{u}_{k}=0, k\forall k.

The key innovation of the proposed approach is to combine (4) and (6) with a proper choice of uku_{k} and u^k\hat{u}_{k} that satisfies:

  • The combined energy V(ξk)=zk2+z^k2V(\xi_{k})=\|z_{k}\|^{2}+\|\hat{z}_{k}\|^{2} decreases, i.e., there is |α|<1|\alpha|<1 such that:

    V(ξk+1)α2V(ξk).V(\xi_{k+1})\leq\alpha^{2}V(\xi_{k})\,. (8)
  • The iterates uku=0u_{k}\rightarrow u^{*}=0, so that the equilibrium of (4), that is, the saddle point does not change.

To achieve these goals we select

uk=ρ(zkz^k), and u^k=zk,u_{k}=-\rho(z_{k}-\hat{z}_{k}),\quad\text{ and }\quad\hat{u}_{k}=z_{k}\,, (9)

thus leading to the combined algorithm

zk+1\displaystyle z_{k+1} =zkηMzkρ(zkz^k)\displaystyle=z_{k}-\eta Mz_{k}-\rho(z_{k}-\hat{z}_{k}) (10)
z^k+1\displaystyle\hat{z}_{k+1} =z^kρ(z^kzk)\displaystyle=\hat{z}_{k}-\rho(\hat{z}_{k}-z_{k}) (11)

The choice of u^k\hat{u}_{k} ensures that z^k\hat{z}_{k} acts as a filtering process of zkz_{k}. Further, since for constant zk=z¯z_{k}=\bar{z}, z^kz¯\hat{z}_{k}\rightarrow\bar{z}, the choice of uku_{k} ensures that the equilibrium of (4) remains unchanged. We will show in Section IV that one can choose η\eta and ρ\rho so that (8) holds. We refer to the bottom of Figure 1 for an illustration of its effectiveness.

III-B Dissipative GDA Algorithm

In the previous section, we showed how to design a dissipative component (6) to stabilize the GDA updates on a simple bilinear objective. Since the design approach is independent of the particular saddle function f(x,y)=xyf(x,y)=xy, it can be readily applied in more general settings. The algorithm presented below seeks to incorporate a similar friction component into a standard GDA update. This modification addresses the inherent limitation of the GDA update when applied to general bilinear problems.

Dissipative gradient descent ascent (DGDA):

[xk+1x^k+1yk+1y^k+1]=[xkηxf(xk,yk)ρ(xkx^k)x^kρ(x^kxk)yk+ηyf(xk,yk)ρ(yky^k)y^kρ(y^kyk)]\displaystyle\begin{bmatrix}x_{k+1}\\ \hat{x}_{k+1}\\ y_{k+1}\\ \hat{y}_{k+1}\end{bmatrix}=\begin{bmatrix}x_{k}-\eta\nabla_{x}f(x_{k},y_{k})-\rho(x_{k}-\hat{x}_{k})\\ \hat{x}_{k}-\rho(\hat{x}_{k}-x_{k})\\ y_{k}+\eta\nabla_{y}f(x_{k},y_{k})-\rho(y_{k}-\hat{y}_{k})\\ \hat{y}_{k}-\rho(\hat{y}_{k}-y_{k})\end{bmatrix} (12)

Particularly, for ff as in (1), in (12) we introduce two new sets of variables x^n\hat{x}\in\mathbb{R}^{n} and y^m\hat{y}\in\mathbb{R}^{m} and a damping parameter ρ>0\rho>0. One important observation is that, once the system reaches equilibrium, i.e., xk+1=xk,yk+1=yk,x^k+1=x^k,y^k+1=y^kx_{k+1}=x_{k},y_{k+1}=y_{k},\hat{x}_{k+1}=\hat{x}_{k},\hat{y}_{k+1}=\hat{y}_{k}, one necessarily has x^k=xk\hat{x}_{k}=x_{k} and y^k=yk\hat{y}_{k}=y_{k}, which ensures that the fixed point is necessarily a critical point of the saddle function.

In the remaining part of this section, we will first provide several key observations and properties of the proposed DGDA updates. Then, in the next section, we will formally prove that the proposed DGDA algorithm provides a linear convergence guarantee for both bilinear and strongly convex-strongly concave functions. Mainly, we will show that by simply adding a friction or damping term, the DGDA updates succeed in dissipating the energy of the system even in cases where GDA does not.

III-C Key Properties and Related Algorithms

The first important observation is that the above DGDA update could be considered as applying a vanilla GDA update to the following regularized surrogate for f(x,y)f(x,y):

f(x,y,x^,y^):=f(x,y)+ρ2xx^2ρ2yy^2.\displaystyle f(x,y,\hat{x},\hat{y})\!:=\!f(x,y)\!+\!\frac{\rho}{2}\|x\!-\!\hat{x}\|^{2}\!-\!\frac{\rho}{2}\|y\!-\!\hat{y}\|^{2}. (13)

Note that this is different from the Proximal Point Method [21, 25] or introducing a L2L_{2} regularization [26, 27].

Differences with L2L_{2} regularization  A commonly used method to ensure convergence is to introduce a L2L_{2} regularization term in xx and yy [26, 27]:

minxnmaxymf(x,y)+ρ2x2ρ2y2.\displaystyle\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x,y)+\frac{\rho}{2}\|x\|^{2}-\frac{\rho}{2}\|y\|^{2}. (14)

Although the augmented objective function becomes strongly convex-strongly concave and the vanilla GDA updates will converge, this regularization changes the saddle points. While our algorithm also introduces two regularizing terms, the following Lemma verifies the fixed positions of saddle points between f(x,y)f(x,y) and f(x,y,x^,y^)f(x,y,\hat{x},\hat{y}) with virtual variables aligned with original variables.

Lemma 1 (Saddle Point Invariance)

[20, Lemma 6] For problem 1, a point (x,y)(x^{*},y^{*}) is a saddle point of f(x,y)f(x,y) if and only if (x,y,x^,y^)(x^{*},y^{*},\hat{x}^{*},\hat{y}^{*}) is a saddle point of f(x,y,x^,y^)f(x,y,\hat{x},\hat{y}), with x^=x\hat{x}^{*}=x^{*} and y^=y\hat{y}^{*}=y^{*}.

More interestingly, the regularization terms, ρ2xx^2\frac{\rho}{2}\|x-\hat{x}\|^{2} and ρ2yy^2\frac{\rho}{2}\|y-\hat{y}\|^{2}, do not introduce extra strong convexity-stong concavity to the original problem. Precisely, the augmented problem f(x,y,x^,y^)f(x,y,\hat{x},\hat{y}) is neither strongly convex on (x,x^x,\hat{x}) nor strongly concave on (y,y^)(y,\hat{y}). Indeed, on the hyperplane of x=x^x=\hat{x} and y=y^y=\hat{y}, the augmented problem recovers the original problem f(x,y,x^,y^)=f(x,y)f(x,y,\hat{x},\hat{y})=f(x,y). Additionally, the introduced regularization terms of the DGDA method are separable and local, thus preserving the distributed structure that original systems may have. Consequently, it can be seamlessly integrated into a fully distributed approach.

Differences with Proximal Point Method The Proximal Point Method for saddle point problems [21] shares a similar structure with DGDA algorithm. In the proximal method, the next iterates (xk+1,yk+1)(x_{k+1},y_{k+1}) is the unique solution to the saddle point problem

(xk+1,yk+1)=proxη(xk,yk)\displaystyle(x_{k+1},y_{k+1})=\mathrm{prox}_{\eta}(x_{k},y_{k}) (15)
:=argminxnmaxymf(x,y)+η2xxk2η2yyk2\displaystyle\!:=\!\arg\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x,y)\!+\!\frac{\eta}{2}\|x\!-\!x_{k}\|^{2}\!-\!\frac{\eta}{2}\|y\!-\!y_{k}\|^{2}

Using the optimality conditions of (15), the update of the Proximal Point method can be written as:

xk+1=xkηxf(xk+1,yk+1),\displaystyle x_{k+1}=x_{k}-\eta\nabla_{x}f(x_{k+1},y_{k+1}), (16)
yk+1=yk+ηyf(xk+1,yk+1).\displaystyle y_{k+1}=y_{k}+\eta\nabla_{y}f(x_{k+1},y_{k+1}).

This expression shows that the Proximal Point method is an implicit method. Although Implicit methods are known to be more stable and to benefit from better convergence properties [25, 6], implementing the above updates requires computing the operators (I+ηxf)1(I+\eta\nabla_{x}f)^{-1} and (I+ηyf)1(I+\eta\nabla_{y}f)^{-1}, and therefore may be computationally intractable. In contrast, the DGDA algorithm is an explicit algorithm, which applies a vanilla GDA update to the augmented objective function (13)\eqref{eq:augmented saddle flow}. Thus, as shown in Table I and II, and in the next section, we choose to compare the convergence of DGDA only with other explicit algorithms for saddle point problems, such as the GDA, Extra-Gradient (EG), and Optimistic GDA (OGDA) methods which have comparable per-iteration computational requirements; although the DGDA algorithm has twice as many state variables, it only requires a single gradient computation per update. Moreover, there is no need for retaining and reemploying the extrapolated gradient, which also sets it apart from the OGDA method.

We finalize this section comparing DGDA with recent efforts leveraging Moreau-Yosida smoothing techniques to solve nonconvex-concave [28, 29, 30], nonconvex-nonconcave [31] min-max optimization problems.

Smoothed-GDA method The Smoothed-GDA consists of the following updates

xk+1=xkcxK(xk,zk;yk),\displaystyle x_{k+1}=x_{k}-c\nabla_{x}K(x_{k},z_{k};y_{k}), (17)
yk+1=yk+αyK(xk+1,zk;yk)\displaystyle y_{k+1}=y_{k}+\alpha\nabla_{y}K(x_{k+1},z_{k};y_{k})
zk+1=zk+β(xk+1zk)\displaystyle z_{k+1}=z_{k}+\beta(x_{k+1}-z_{k})

where K(x,z;y)=f(x,y)+p2xz2.K(x,z;y)=f(x,y)+\frac{p}{2}\|x-z\|^{2}. Smoothed-GDA was independently introduced by [32]. It was originally motivated as an ADMM algorithm to solve the linearly constrained nonconvex differentiable minimization problem [32]. The algorithm has been further extended to consider nonconvex-concave min-max optimization problems [28]. Despite its similarities with DGDA, the iterates of smoothed-GDA are sequential, requiring first updates in xx and subsequently in (y,z)(y,z), whereas DGDA updates are fully parallel or synchronized. Unfortunately, the setting where such algorithm has been studied is different from the one considered in this paper, which difficult the comparison with the present work. A thorough comparison of DGDA with (LABEL:eq:_SGDA) is subject of current research.

IV Convergence Analysis

In this section, we provide a theoretical analysis of the proposed algorithm. For the purpose of our analysis, we consider a quadratic Lyapunov function to track the energy dissipation of the DGDA updates

Vk:=xkx2+yky2+x^kx^2+y^ky^2,V_{k}:=\|x_{k}-x^{*}\|^{2}+\|y_{k}-y^{*}\|^{2}+\|\hat{x}_{k}-\hat{x}^{*}\|^{2}+\|\hat{y}_{k}-\hat{y}^{*}\|^{2},

which denotes the square 2-norm distance to the saddle point at the kk-th iteration. The goal is, therefore, to find some 0toα<10to\leq\alpha<1 such that Vk+1αVkV_{k+1}\leq\alpha V_{k}, where α\alpha denotes the linear convergence rate.

IV-A Convergence Analysis for Bilinear Functions

When applied to the bilinear min-max optimization problem f(x,y)=xTAyf(x,y)=x^{T}Ay, the DGDA update (12) is equivalent to a linear dynamical system. Specifically, denote z=[x,y]T,z^=[x^,y^]Tz=[x,y]^{T},\hat{z}=[\hat{x},\hat{y}]^{T} yields:

[zk+1zz^k+1z^]=[(1ρ)IηMρIρI(1ρ)I][zkzz^kz^],\displaystyle\begin{bmatrix}z_{k+1}\!-\!z^{*}\\ \hat{z}_{k+1}\!-\!\hat{z}^{*}\\ \end{bmatrix}\!=\!\begin{bmatrix}(1\!-\!\rho)I\!-\!\eta M\!&\!\!\!\rho I\\ \rho I\!&\!\!\!(1\!-\!\rho)I\\ \end{bmatrix}\!\begin{bmatrix}z_{k}\!-\!z^{*}\\ \hat{z}_{k}\!-\!\hat{z}^{*}\\ \end{bmatrix}, (18)

where M=[𝟎AAT𝟎].M=\begin{bmatrix}\mathbf{0}&A\\ -A^{T}&\mathbf{0}\\ \end{bmatrix}. As a result, the linear convergence of DGDA, as well as its convergence rate can be derived from the analysis of the spectrum of the associated matrix that defines the DGDA update in (18). This yields the following theorem.

Theorem 2

(Linear convergence of DGDA, Bilinear Case) Let Assumption 2 hold. Then the updates 12 of DGDA with 0<η2ρσmax(A)0<\eta\leq\frac{2\rho}{\sigma_{\max}(A)} and ρ>0\rho>0 provide linearly converging iterates. That is, there is a constant β>0\beta>0, independent of V0V_{0} such that

Vk𝒪((12ρ+2ρ2+(1ρ)4ρ2η2σmin2(A))k)V0,\displaystyle\textstyle V_{k}\!\leq\!\mathcal{O}\left(\left(1\!-\!2\rho\!+\!2\rho^{2}\!+\!(1\!-\!\rho)\sqrt{4\rho^{2}\!-\!\eta^{2}\sigma_{\min}^{2}(A)}\right)^{k}\right)V_{0},

Particularly, setting ρ=1/2\rho=1/2 and η=1/σmax(A)\eta=1/\sigma_{\max}(A) we have

Vk𝒪((114κ)k)V0.\displaystyle\textstyle V_{k}\leq\mathcal{O}\left(\left(1-\frac{1}{4\kappa}\right)^{k}\right)\,V_{0}. (19)

We remark that linear convergence requires ρ>0\rho>0. This is not surprising since GDA, which is known to diverge for bilinear functions, can be interpreted as DGDA method when ρ=0\rho=0. More importantly, by choosing the optimal step size ρ=1/2,η=1/σmax(A)\rho=1/2,\eta=1/\sigma_{\max}(A), DGDA method achieves a better linear convergence rate than the EG and OGDA methods (see Table I).

IV-B Convergence Analysis for Strongly Convex Stronly Concave Functions

We now consider the case of strongly convex-strongly concave min-max problems. Let F(zk):=(xf(xk,yk),yf(xk,yk))F(z_{k}):=(\nabla_{x}f(x_{k},y_{k}),-\nabla_{y}f(x_{k},y_{k})). The DGDA updates can be written as follows:

[zk+1z^k+1]=[zkηF(zk)ρ(zkz^k)z^kρ(z^kzk)]\displaystyle\begin{bmatrix}z_{k+1}\\ \hat{z}_{k+1}\end{bmatrix}=\begin{bmatrix}z_{k}-\eta F(z_{k})-\rho(z_{k}-\hat{z}_{k})\\ \hat{z}_{k}-\rho(\hat{z}_{k}-z_{k})\end{bmatrix} (20)

Because of the existence of the nonlinear term F(zk)F(z_{k}), we cannot analyze the spectrum as in the previous bilinear case. This is indeed a common challenge in analyzing most optimization algorithms beyond a neighborhood of the fixed point. We circumvent this problem by leveraging recent results on the analysis of variational mappings as F()F(\cdot) via integral quadratic constraint [18, 16, 17]. More details can be found in the Appendix where we prove the following theorem.

Theorem 3

(Linear convergence of DGDA, Strongly Convex-Strongly Concave Case) Let Assumption 1 hold, then the updates (12) with ρ=1/2\rho=1/2 and η=1/(L+μ)\eta=1/(L+\mu) of the DGDA algorithm provide linearly converging iterates:

Vk(1κ1+𝒪(κ2))kV0\displaystyle V_{k}\leq\biggl{(}1-\kappa^{-1}+\mathcal{O}(\kappa^{-2})\biggr{)}^{k}V_{0} (21)

Similarly, as in the bilinear case, we remark on the importance of the dissipation component. When ρ=0\rho=0, a similar analysis as in the proof of the theorem recovers the lower bound of the convergence rate of GDA (1κ2)(1-\kappa^{-2}) as shown in [19, 3.1]. Thus, our DGDA method provides a better convergence rate estimate than GDA, since clearly κ[1,)\kappa\in[1,\infty), and therefore κ2κ1\kappa^{-2}\leq\kappa^{-1}.

We remark that while the rate obtained in Theorem 3 is clearly better than those of the EG and OGDA methods for large condition numbers κ\kappa (see Table II), the theorem fails to quantify the comparative performance of DGDA for small values of κ\kappa. The following corollary shows that indeed, the rate of DGDA is provably better for all κ2\kappa\geq 2.

Corollary 4 (SCSC, comparison with known rates)

Let Assumption 1 hold, and suppose that L2mL\geq 2m, i.e., κ2\kappa\geq 2. Then, the linear convergence rate estimate of DGDA (21) is smaller (better) than that of of EG and OGDA, i.e., 1κ1/41-\kappa^{-1}/4 (Theorem 6&76\&7 [2] and Theorem 4&74\&7 [1]).

V Numerical Experiments

Refer to caption
Figure 2: Convergence of GDA, EG, OGDA, and DGDA in terms of the number of gradient evaluations for the bilinear problem. GDA diverges and the error is not shown. All other three algorithms converge linearly, where the DGDA method provides the best performance.

In this section, we compare the performance of the proposed Dissipative gradient descent (DGDA) method with the Extra-gradient (EG), Gradient descent ascent (GDA), and Optimistic gradient descent ascent (OGDA) methods.

V-A Bilinear problem

We first consider the following bilinear min-max optimization problem: minxnmaxymxTAy\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}x^{T}Ay, where Am×nA\in\mathbb{R}^{m\times n} is full-rank. The simulation results are illustrated in Figure 2 and Figure 3. In this experiment, we set the dimension of the problem to m=n=10m=n=10 and the iterates are initialized at x0,y0x_{0},y_{0}, which are randomly drawn from the uniform distribution on the open interval (0,1)(0,1).

We plot the errors (distance to saddle points) of DGDA, EG, and OGDA versus the number of gradient evaluations for this problem in Figure 2. The solid line and grey-shaded error bars represent the average trajectories and standard deviations of 20 trials, where in each trial the randomly generated matrix AA has a fixed condition number, i.e., κ=σmax2(A)/σmin2(A)=25\kappa=\sigma^{2}_{\max}(A)/\sigma^{2}_{\min}(A)=25. The key motivation is that all three algorithms’ convergence rates critically depend on κ1\kappa^{-1}, and by fixing the condition number, we provide an explicit comparison of their convergence speed.

We pick the step size for different methods according to theoretical findings. That is, we select ρ=1/2\rho=1/2 and η=1/σmax(A)\eta=1/\sigma_{\max}(A) for DGDA (Theorem 2), η=1/4L=1/4σmax(A)\eta=1/4L=1/4\sigma_{\max}(A) for EG and OGDA (Theorem 6&7 [2] and Theorem 4&7 [1]). In Figure 2, we do not show the error of GDA since it diverges for this bilinear saddle point problem. All other three algorithms converge linearly, with the DGDA method providing the best performance.

Finally, to provide a qualitative demonstration of how DGDA fares with other existing algorithms, we further plot the sample trajectories of GDA, EG, OGDA, and EGDA on a simple 2D bilinear min-max problem, with m=n=1m=n=1. In Figure 3, we observe that while GDA diverges, the trajectories of all other three algorithms converge linearly to the saddle point (x,y)=(0,0)(x^{*},y^{*})=(0,0). Interestingly, our proposed algorithm (DGDA) despite taking larger steps, exhibits faster linear convergence.

Refer to caption
Figure 3: Trajectories of GDA, EG, OGDA, and DGDA for a 2d bilinear problem. GDA diverges and all other three algorithms converge linearly, where the DGDA method provides the best performance.

V-B Strongly convex-strongly concave problem

In the second numerical example, we focus on a strongly convex-strongly strongly-concave quadratic problem of the following form:

minxnmaxym12xTAx12yTBy+xTCy,\displaystyle\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}\frac{1}{2}x^{T}Ax-\frac{1}{2}y^{T}By+x^{T}Cy, (22)

where the matrices satisfy μAIALAI\mu_{A}I\preceq A\preceq L_{A}I, μBIBLBI\mu_{B}I\preceq B\preceq L_{B}I, μc2ICTCLc2I\mu_{c}^{2}I\preceq C^{T}C\preceq L_{c}^{2}I. As a result, the problem (22) satisfy Assumption 1. In this experiment, we set the dimension of the problem to n=50,m=10n=50,m=10, and the iterates are initialized at x0,y0x_{0},y_{0}, which are randomly drawn from the uniform distribution on the open interval (0,1)(0,1). We plot the errors (distance to saddle points) of GDA, DGDA, EG, and OGDA versus the number of gradient evaluations for this problem in Figure 4. Again, the solid line and grey-shaded error bars represent the average trajectories and standard deviations of 20 trials, where in each trial the randomly generated matrix [ACCTB]\begin{bmatrix}A&C\\ -C^{T}&B\end{bmatrix} is chosen such that the condition number of (22) remains constant, i.e., κ=L/μ=31\kappa=L/\mu=31. Similarly as in the bilinear problem in Section V-A, we pick the step size for the DGDA method according to our theoretical finding in Theorem 3. The step size of the GDA method is selected as η=μ/L2\eta=\mu/L^{2} (Theorem 5 [33]). The step sizes for EG and OGDA methods are selected as η=1/4L\eta=1/4L (Theorem 6&7 [2] and Theorem 4&7 [1]). According to the plots, all algorithms converge linearly, and the DGDA method has the best performance.

Refer to caption
Figure 4: Convergence of GDA, EG, OGDA, and DGDA in terms of the number of gradient evaluations for problem 22. All algorithms converge linearly, and the DGDA method has the best performance.

VI Conclusion and Future Work

In this work, we present the Dissipative GDA (DGDA) algorithm, a novel method for solving min-max optimization problems. Drawing inspiration from dissipativity theory and control theory, we address the challenge of diverging oscillations in bilinear min-max optimization problems when using the Gradient Descent Ascent (GDA) method. Particularly, we introduce a friction term into the GDA updates aiming to dissipate the internal energy and drive the system towards equilibrium. By incorporating a state-augmented regularization, our proposed DGDA method can be seen as performing standard GDA on an extended saddle function without introducing additional convexity. We further establish the superiority of the convergence rate of the proposed DGDA method when compared with other established methods including GDA, Extra-Gradient (EG), and Optimistic GDA. The analysis is further supported by two numerical examples, demonstrating its effectiveness in solving saddle point problems. Our future work includes studying the DGDA method in a stochastic setting and its application in solving Constrained Reinforcement learning problems in the policy space.

References

  • [1] A. Mokhtari, A. Ozdaglar, and S. Pattathil, “A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 1497–1507.
  • [2] W. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel, “A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games,” in International conference on artificial intelligence and statistics.   PMLR, 2020, pp. 2863–2873.
  • [3] E. Gorbunov, H. Berard, G. Gidel, and N. Loizou, “Stochastic extragradient: General analysis and improved rates,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 7865–7901.
  • [4] N. Loizou, H. Berard, G. Gidel, I. Mitliagkas, and S. Lacoste-Julien, “Stochastic gradient descent-ascent and consensus optimization for smooth games: Convergence analysis under expected co-coercivity,” Advances in Neural Information Processing Systems, vol. 34, pp. 19 095–19 108, 2021.
  • [5] A. Beznosikov, E. Gorbunov, H. Berard, and N. Loizou, “Stochastic gradient descent-ascent: Unified theory and new efficient methods,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2023, pp. 172–235.
  • [6] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “A variational inequality perspective on generative adversarial networks,” arXiv preprint arXiv:1802.10551, 2018.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • [8] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training gans with optimism,” arXiv preprint arXiv:1711.00141, 2017.
  • [9] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” arXiv preprint arXiv:1610.01945, 2016.
  • [10] T. Zheng, P. You, and E. Mallada, “Constrained reinforcement learning via dissipative saddle flow dynamics,” in 2022 56th Asilomar Conference on Signals, Systems, and Computers.   IEEE, 2022, pp. 1362–1366.
  • [11] D. Ding, C.-Y. Wei, K. Zhang, and A. Ribeiro, “Last-iterate convergent policy gradient primal-dual methods for constrained mdps,” arXiv preprint arXiv:2306.11700, 2023.
  • [12] J. Adler and S. Lunz, “Banach wasserstein gan,” Advances in neural information processing systems, vol. 31, 2018.
  • [13] E. Altman, Constrained Markov decision processes: stochastic modeling.   Routledge, 1999.
  • [14] S. Sastry, Nonlinear systems: analysis, stability, and control.   Springer Science & Business Media, 2013, vol. 10.
  • [15] Z. E. Nelson and E. Mallada, “An integral quadratic constraint framework for real-time steady-state optimization of linear time-invariant systems,” in 2018 annual American control conference (ACC).   IEEE, 2018, pp. 597–603.
  • [16] B. Hu and L. Lessard, “Dissipativity theory for nesterov’s accelerated method,” in International Conference on Machine Learning.   PMLR, 2017, pp. 1549–1557.
  • [17] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado, “Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems,” SIAM Journal on Optimization, vol. 28, no. 3, pp. 2654–2689, 2018.
  • [18] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimization algorithms via integral quadratic constraints,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 57–95, 2016.
  • [19] G. Zhang, X. Bao, L. Lessard, and R. Grosse, “A unified analysis of first-order methods for smooth games via integral quadratic constraints,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 4648–4686, 2021.
  • [20] P. You and E. Mallada, “Saddle flow dynamics: Observable certificates and separable regularization,” in 2021 American Control Conference (ACC).   IEEE, 2021, pp. 4817–4823.
  • [21] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” SIAM journal on control and optimization, vol. 14, no. 5, pp. 877–898, 1976.
  • [22] P. Tseng, “On linear convergence of iterative methods for the variational inequality problem,” Journal of Computational and Applied Mathematics, vol. 60, no. 1-2, pp. 237–252, 1995.
  • [23] M. Wang, “Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time,” Mathematics of Operations Research, vol. 45, no. 2, pp. 517–546, 2020.
  • [24] B. Grimmer, H. Lu, P. Worah, and V. Mirrokni, “The landscape of the proximal point method for nonconvex–nonconcave minimax optimization,” Mathematical Programming, vol. 201, no. 1-2, pp. 373–407, 2023.
  • [25] N. Parikh, S. Boyd et al., “Proximal algorithms,” Foundations and trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
  • [26] A. Krogh and J. Hertz, “A simple weight decay can improve generalization,” Advances in neural information processing systems, vol. 4, 1991.
  • [27] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction.   Springer, 2009, vol. 2.
  • [28] J. Zhang, P. Xiao, R. Sun, and Z. Luo, “A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,” Advances in neural information processing systems, vol. 33, pp. 7377–7389, 2020.
  • [29] Z. Xu, H. Zhang, Y. Xu, and G. Lan, “A unified single-loop alternating gradient projection algorithm for nonconvex–concave and convex–nonconcave minimax problems,” Mathematical Programming, pp. 1–72, 2023.
  • [30] J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algorithms for minimax optimization without strong concavity,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 5485–5517.
  • [31] T. Zheng, L. Zhu, A. M.-C. So, J. Blanchet, and J. Li, “Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [32] J. Zhang and Z.-Q. Luo, “A proximal alternating direction method of multiplier for linearly constrained nonconvex minimization,” SIAM Journal on Optimization, vol. 30, no. 3, pp. 2272–2302, 2020.
  • [33] A. Beznosikov, B. Polyak, E. Gorbunov, D. Kovalev, and A. Gasnikov, “Smooth monotone stochastic variational inequalities and saddle point problems: A survey,” European Mathematical Society Magazine, no. 127, pp. 15–28, 2023.
  • [34] G. M. Korpelevich, “The extragradient method for finding saddle points and other problems,” Matecon, vol. 12, pp. 747–756, 1976.
  • [35] R. T. Rockafellar, “Augmented lagrangians and applications of the proximal point algorithm in convex programming,” Mathematics of operations research, vol. 1, no. 2, pp. 97–116, 1976.
  • [36] G. Zhang and Y. Yu, “Convergence of gradient methods on bilinear zero-sum games,” in International Conference on Learning Representations, 2019.
  • [37] J. P. Hespanha, Linear systems theory.   Princeton university press, 2018.

-A First Order Algorithms for Saddle Point Problems

In this section, we introduce several popular first-order methods for solving the min-max problem 1 in the machine learning community. Precisely, we focus on Gradient Descent-Ascent (GDA), Extra-gradient (EG), Optimistic Gradient Descent-Ascent (OGDA), Proximal Point and Smoothed-GDA methods.

Gradient descent ascent (GDA)

xk+1=xkηxf(xk,yk),\displaystyle x_{k+1}=x_{k}-\eta\nabla_{x}f(x_{k},y_{k}), (23)
yk+1=yk+ηyf(xk,yk).\displaystyle y_{k+1}=y_{k}+\eta\nabla_{y}f(x_{k},y_{k}).

When the problem is strongly convex-strongly concave and L-Lipschitz, the GDA method provides linear convergence, with step size η=μ/L2\eta=\mu/L^{2} and a know rate estimate of 11/κ21-1/\kappa^{2} [24, 19]. However, when the problem is bilinear, the standard GDA method fails to converge. Therefore, variants of the gradient method such as Extra-Gradient and Optimistic Gradient Descent-Ascent methods have attracted much attention in recent literature because of their superior empirical performance in solving min-max optimization problems such as training GANs and solving C-RL problems.

Extra-gradient (EG)

xk+1/2=xkηxf(xk,yk),\displaystyle x_{k+1/2}=x_{k}-\eta\nabla_{x}f(x_{k},y_{k}), (24)
yk+1/2=yk+ηyf(xk,yk),\displaystyle y_{k+1/2}=y_{k}+\eta\nabla_{y}f(x_{k},y_{k}),
xk+1=xkηxf(xk+1/2,yk+1/2),\displaystyle x_{k+1}=x_{k}-\eta\nabla_{x}f(x_{k+1/2},y_{k+1/2}),
yk+1=yk+ηyf(xk+1/2,yk+1/2).\displaystyle y_{k+1}=y_{k}+\eta\nabla_{y}f(x_{k+1/2},y_{k+1/2}).

Extra-gradient is a classical method introduced in [34], where its linear rate of convergence for bilinear functions and smooth strongly convex-strongly concave functions have been established in many recent works (see Table I and II). The Extra-gradient method first computes an extrapolated point (xk+1/2,yk+1/2x_{k+1/2},y_{k+1/2}) by performing a GDA update. Then the gradients evaluated at the extrapolated point are used to compute the new iterates (xk+1,yk+1)(x_{k+1},y_{k+1}).

The linear convergence rate of EG for strongly convex-strongly concave is established, with a standard known rate of 11/4κ1-1/4\kappa; see e.g.[2, 1]. One issue with the Extra-gradient method is that, as the name suggests, each update requires evaluation of extra gradients at the extrapolated point (xk+1/2,yk+1/2x_{k+1/2},y_{k+1/2}), which doubles the computational complexity of EG method compared to vanilla GDA method.

Optimistic gradient descent ascent (OGDA)

xk+1=xk2ηxf(xk,yk)+ηxf(xk1,yk1),\displaystyle x_{k+1}\!=\!x_{k}-2\eta\nabla_{x}f(x_{k},y_{k})+\eta\nabla_{x}f(x_{k-1},y_{k-1}), (25)
yk+1=yk+2ηyf(xk,yk)ηyf(xk1,yk1).\displaystyle y_{k+1}\!=\!y_{k}+2\eta\nabla_{y}f(x_{k},y_{k})-\eta\nabla_{y}f(x_{k-1},y_{k-1}).

The Optimistic gradient descent ascent (OGDA) method adds a ”negative-momentum” term to each of the updates, which differentiates the OGDA method from the vanilla GDA method. Meanwhile, the OGDA method stores and re-uses the extrapolated gradient for the extrapolation, which only requires a single gradient computation per update.

The convergence properties of OGDA were also recently investigated in (refer to Table I and II), demonstrating linear convergence rates with smooth and bilinear functions, as well as strongly convex-strongly concave functions.

Proximal Point (PP)

The proximal point method for convex minimization has been extensively studied [35, 25] and extended to solve saddle point problems in [21]. Define the iterates {xk+1,yk+1}\{x_{k+1},y_{k+1}\} as the unique solution to the saddle point problem

(xk+1,yk+1)=proxη(xk,yk)\displaystyle(x_{k+1},y_{k+1})=\mathrm{prox}_{\eta}(x_{k},y_{k}) (26)
:=argminxnmaxymf(x,y)+η2xxk2η2yyk2\displaystyle:=\arg\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x,y)+\frac{\eta}{2}\|x-x_{k}\|^{2}-\frac{\eta}{2}\|y-y_{k}\|^{2}

Using the optimality conditions, the update of the Proximal Point method can be written as:

xk+1=xkηxf(xk+1,yk+1),\displaystyle x_{k+1}=x_{k}-\eta\nabla_{x}f(x_{k+1},y_{k+1}), (27)
yk+1=yk+ηyf(xk+1,yk+1).\displaystyle y_{k+1}=y_{k}+\eta\nabla_{y}f(x_{k+1},y_{k+1}).

This expression shows that in contrast to explicit methods such as GDA, EG, and OGDA methods, the Proximal Point method is an implicit method. Although Implicit methods are known to be more stable and to benefit from better convergence properties [25, 6], implementing the above updates requires computing the operators (I+ηxf)1(I+\eta\nabla_{x}f)^{-1} and (I+ηyf)1(I+\eta\nabla_{y}f)^{-1}, and therefore may be computationally intractable. Notably, in [1], the authors show the EG and OGDA methods can be interpreted as an approximation of the PP method, and therefore exhibits similar convergence performance.

Smoothed-GDA method

We finalize this section comparing DGDA with recent efforts leveraging Moreau-Yosida smoothing techniques to solve nonconvex-concave [28, 29, 30], nonconvex-nonconcave [31] min-max optimization problems.

xk+1=xkcxK(xk,zk;yk),\displaystyle x_{k+1}=x_{k}-c\nabla_{x}K(x_{k},z_{k};y_{k}), (28)
yk+1=yk+αK(xk+1,zk;yk)\displaystyle y_{k+1}=y_{k}+\alpha\nabla_{K}(x_{k+1},z_{k};y_{k})
zk+1=zk+β(xk+1zk)\displaystyle z_{k+1}=z_{k}+\beta(x_{k+1}-z_{k})

where K(x,z;y)=f(x,y)+p2xz2K(x,z;y)=f(x,y)+\frac{p}{2}\|x-z\|^{2}.

The Smoothed-GDA was independently introduced by Jiawei et al. in [32] and later [28]. It was originally motivated by ADMM to solve the linearly constrained nonconvex differentiable minimization problem [32], where they introduce an extra quadratic proximal term for the equality constraints and an extra sequence {zk}\{z_{k}\}. They claim this smoothing or exponential averaging scheme is necessary for the convergence of the proximal ADMM when the objective is nonconvex. Later on, this scheme is further extended to solve the nonconvex-concave min-max optimization problem [28].

-B Proof of Theorem 2

We consider, for ease of presentation, the case when Am×mA\in\mathbb{R}^{m\times m} is a square matrix. The extension for non-square matrices is straightforward and has been covered in the literature [36, Appendix G]. Applying the updates 12 to f(x,y)=xTAyf(x,y)=x^{T}Ay and denoting z=[x,y]T,z^=[x^,y^]Tz=[x,y]^{T},\hat{z}=[\hat{x},\hat{y}]^{T} yields:

[zk+1zz^k+1z^]\displaystyle\begin{bmatrix}z_{k+1}-z^{*}\\ \hat{z}_{k+1}-\hat{z}^{*}\\ \end{bmatrix} =[zkηMzkρ(zkz^k)z^kρ(z^kzk)]\displaystyle=\begin{bmatrix}z_{k}-\eta Mz_{k}-\rho(z_{k}-\hat{z}_{k})\\ \hat{z}_{k}-\rho(\hat{z}_{k}-z_{k})\\ \end{bmatrix} (29)
=[(1ρ)IηMρIρI(1ρ)I][zkzz^kz^],\displaystyle=\begin{bmatrix}(1-\rho)I-\eta M&\rho I\\ \rho I&(1-\rho)I\\ \end{bmatrix}\begin{bmatrix}z_{k}-z^{*}\\ \hat{z}_{k}-\hat{z}^{*}\\ \end{bmatrix},

where

M=[𝟎AAT𝟎]\displaystyle M=\begin{bmatrix}\mathbf{0}&A\\ -A^{T}&\mathbf{0}\\ \end{bmatrix}

According to [2, Lemma 7]

Sp(M)={±iσ|σ2Sp(AAT)}.\displaystyle\mathrm{Sp}(M)=\{\pm i\sigma|\sigma^{2}\in\mathrm{Sp}(AA^{T})\}.

We will use σmax\sigma_{\max} and σmin\sigma_{\min} to denote the largest singular value and smallest singular of matrix AA, respectively. And according to Assumption 2, we have σmin>0\sigma_{\min}>0. Since MM is a normal matrix and diagonalizable, we can compute the eigenvalues of the linear system (29) using the following similarity transformation

[(1ρ)IηMρIρI(1ρ)I]=\displaystyle\begin{bmatrix}(1-\rho)I-\eta M&\rho I\\ \rho I&(1-\rho)I\\ \end{bmatrix}= (30)
[U1𝟎𝟎U1][(1ρ)Iη𝚲ρIρI(1ρ)I][U𝟎𝟎U],\displaystyle\begin{bmatrix}U^{-1}&\mathbf{0}\\ \mathbf{0}&U^{-1}\\ \end{bmatrix}\begin{bmatrix}(1-\rho)I-\eta\boldsymbol{\Lambda}&\rho I\\ \rho I&(1-\rho)I\\ \end{bmatrix}\begin{bmatrix}U&\mathbf{0}\\ \mathbf{0}&U\\ \end{bmatrix},

where 𝚲=diag(λ1,,λ2m)\boldsymbol{\Lambda}=\mathrm{diag}(\lambda_{1},...,\lambda_{2m}), λ2j1=+iσj\lambda_{2j-1}=+i\sigma_{j}, λ2j=iσj\lambda_{2j}=-i\sigma_{j}, with ±iσjSp(M),j={1,,m}\pm i\sigma_{j}\in\mathrm{Sp}(M),j=\{1,...,m\}. In order to show linear convergence, we want to show that maxj[m]|μj|2<1\max_{j\in[m]}{|\mu_{j}|^{2}}<1 , where μj\mu_{j} are the eigenvalues of the above matrix (30) (i.e., the spectral radius of matrix (30)). As straight forward computation leads to

μj=\displaystyle\mu_{j}= 12(22ρηλj±η2λj2+4ρ2)\displaystyle\frac{1}{2}(2-2\rho-\eta\lambda_{j}\pm\sqrt{\eta^{2}\lambda_{j}^{2}+4\rho^{2}})
=\displaystyle= 1ρ±i(12ησj)±124ρ2η2σj2\displaystyle 1-\rho\pm i(\frac{1}{2}\eta\sigma_{j})\pm\frac{1}{2}\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}} (31)

Since, for complex number cc, |c|2=cc¯|c|^{2}=c\bar{c}, the magnitude of eigenvalues |μj|2|\mu_{j}|^{2} are given by,

|μj|2=(1ρ+i12ησj±124ρ2η2σj2)×\displaystyle|\mu_{j}|^{2}=\biggl{(}1-\rho+i\frac{1}{2}\eta\sigma_{j}\pm\frac{1}{2}\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}}\biggr{)}\times
(1ρi12ησj±124ρ2η2σj2¯)\displaystyle\qquad\quad\;\biggl{(}1-\rho-i\frac{1}{2}\eta\sigma_{j}\pm\frac{1}{2}\overline{\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}}}\biggr{)}
=(1ρ)2+14η2σj2+14|4ρ2η2σj2±(1ρ)(4ρ2η2σj2)\displaystyle=(1\!-\!\rho)^{2}\!+\!\frac{1}{4}\eta^{2}\sigma_{j}^{2}\!+\!\frac{1}{4}|4\rho^{2}\!-\!\eta^{2}\sigma_{j}^{2}\!\pm\!(1\!-\!\rho)\Re(\sqrt{4\rho^{2}\!-\!\eta^{2}\sigma_{j}^{2}})
±i4ησj4ρ2η2σj2¯i4ησj4ρ2η2σj2\displaystyle\quad\pm\frac{i}{4}\eta\sigma_{j}\overline{\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}}}\mp\frac{i}{4}\eta\sigma_{j}\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}}
={12ρ+2ρ2±(1ρ)4ρ2η2σj2,if 4ρ2η2σj2012ρ+12η2σj2±12ησjη2σj24ρ2,if η2σj24ρ20\displaystyle=\begin{cases}1\!-\!2\rho\!+\!2\rho^{2}\!\pm\!(1\!-\!\rho)\sqrt{4\rho^{2}\!-\!\eta^{2}\sigma_{j}^{2}},\;\text{if $4\rho^{2}\!-\!\eta^{2}\sigma_{j}^{2}\!\geq\!0$}\\ 1\!-\!2\rho\!+\!\frac{1}{2}\eta^{2}\sigma_{j}^{2}\!\pm\!\frac{1}{2}\eta\sigma_{j}\sqrt{\eta^{2}\sigma_{j}^{2}\!-\!4\rho^{2}},\;\text{if $\eta^{2}\sigma_{j}^{2}\!-\!4\rho^{2}\!\geq\!0$}\end{cases}

Suppose that for all j[m]j\in[m], we choose 0<η2ρσmax2ρσj0<\eta\leq\frac{2\rho}{\sigma_{\max}}\leq\frac{2\rho}{\sigma_{j}} and ρ>0\rho>0, which implies 4ρ2η2σj204\rho^{2}-\eta^{2}\sigma_{j}^{2}\geq 0. From (-B), j[m]\forall j\in[m] we have,

|μj|2\displaystyle|\mu_{j}|^{2} =12ρ+2ρ2±(1ρ)4ρ2η2σj2\displaystyle=1-2\rho+2\rho^{2}\pm(1-\rho)\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}}
12ρ+2ρ2+(1ρ)4ρ2η2σj2\displaystyle\leq 1-2\rho+2\rho^{2}+(1-\rho)\sqrt{4\rho^{2}-\eta^{2}\sigma_{j}^{2}}
12ρ+2ρ2+(1ρ)4ρ2η2σmin2\displaystyle\leq 1-2\rho+2\rho^{2}+(1-\rho)\sqrt{4\rho^{2}-\eta^{2}\sigma_{\min}^{2}}
<12ρ+2ρ2+(1ρ)4ρ2\displaystyle<1-2\rho+2\rho^{2}+(1-\rho)\sqrt{4\rho^{2}}
=1.\displaystyle=1\;\;.

According to classical linear system theory, e.g. [37, Theorem 8.3], the above spectral radius analysis of the linear system (29) results in the following linear convergence rate estimate:

Vk𝒪((12ρ+2ρ2+(1ρ)4ρ2η2σmin2)k)V0,\displaystyle V_{k}\leq\mathcal{O}\left(\!\left(1-2\rho+2\rho^{2}+(1-\rho)\sqrt{4\rho^{2}-\eta^{2}\sigma_{\min}^{2}}\right)^{k}\!\right)V_{0},

where Vk:=xkx2+yky2+x^kx^2+y^ky^2V_{k}:=\|x_{k}-x^{*}\|^{2}+\|y_{k}-y^{*}\|^{2}+\|\hat{x}_{k}-\hat{x}^{*}\|^{2}+\|\hat{y}_{k}-\hat{y}^{*}\|^{2}.

Furthermore, we want to select the optimal step size ρ,η\rho,\eta. The immediate step is to substitute the optimal η=2ρσmax\eta=\frac{2\rho}{\sigma_{\max}}, which yields the following inequality:

|μj|2\displaystyle|\mu_{j}|^{2} 12ρ+2ρ2+(1ρ)4ρ24ρ2σmax2σj2,j[m].\displaystyle\leq 1\!-\!2\rho\!+\!2\rho^{2}\!+\!(1-\rho)\sqrt{4\rho^{2}-\frac{4\rho^{2}}{\sigma_{\max}^{2}}\sigma_{j}^{2}}\;\;,\forall j\in[m]. (32)

The spectral radius is therefore given by choosing σj=σmin\sigma_{j}=\sigma_{\min} above, i.e.,

maxj[m]\displaystyle\max_{j\in[m]} |μj|2=2ρ22ρ+1+(1ρ)2ρ1σmin2σmax2\displaystyle{|\mu_{j}|^{2}}=2\rho^{2}-2\rho+1+(1-\rho)2\rho\sqrt{1-\frac{\sigma_{\min}^{2}}{\sigma_{\max}^{2}}}
=ρ2(221σmin2σmax2)ρ(221σmin2σmax2)+1\displaystyle=\rho^{2}\biggl{(}2\!-\!2\sqrt{1-\frac{\sigma_{\min}^{2}}{\sigma_{\max}^{2}}}\biggr{)}\!-\!\rho\biggl{(}2\!-\!2\sqrt{1-\frac{\sigma_{\min}^{2}}{\sigma_{\max}^{2}}}\biggr{)}\!+\!1
112(11σmin2σmax2)\displaystyle\leq 1-\frac{1}{2}(1-\sqrt{1-\frac{\sigma_{\min}^{2}}{\sigma_{\max}^{2}}})
=12+121σmin2σmax2,j[m]\displaystyle=\frac{1}{2}+\frac{1}{2}\sqrt{1-\frac{\sigma_{\min}^{2}}{\sigma_{\max}^{2}}}\;\;,\forall j\in[m]

where the last inequality comes from selecting optimal ρ=12\rho=\frac{1}{2} of a quadratic polynomial of ρ\rho. Using the fact that 1x1x/2\sqrt{1-x}\leq 1-x/2, we have

maxj[m]|μj|2114σmin2σmax2\displaystyle\max_{j\in[m]}{|\mu_{j}|^{2}}\leq 1-\frac{1}{4}\frac{\sigma_{\min}^{2}}{\sigma_{\max}^{2}} (33)

Again, this results in the following linear convergence rate estimate:

Vk𝒪((114κ)k)V0.\displaystyle\textstyle V_{k}\leq\mathcal{O}\left(\left(1-\frac{1}{4\kappa}\right)^{k}\right)\,V_{0}. (34)
Remark 1

We could also choose η=2ρσmin\eta=\frac{2\rho}{\sigma_{\min}} such that η2σj24ρ20\eta^{2}\sigma_{j}^{2}-4\rho^{2}\geq 0. And we could construct a similar linear convergence rate by repeating the above process. However, in practice, we found that the step sizes η=2ρσmax,ρ=1/2\eta=\frac{2\rho}{\sigma_{\max}},\rho=1/2 always perform better in numerical experiments. Therefore, we choose this pair of step sizes by default.

Remark 2

Since GDA method could be interpreted as a special case of DGDA method when selecting ρ=0\rho=0, the above step proves that when η>0\eta>0, the GDA method diverges for a bilinear objective function. Specifically, when ρ=0,η>0\rho=0,\eta>0, we have η2σj24ρ2>0\eta^{2}\sigma_{j}^{2}-4\rho^{2}>0 and

|μj|2=1±12η2σj2,\displaystyle|\mu_{j}|^{2}=1\pm\frac{1}{2}\eta^{2}\sigma^{2}_{j}, (35)

-C Proof of Theorem 3

The proof relies on the application of dissipativity theory to construct Lyapunov functions and establish linear convergence. For more detailed information, refer to [16].

According to [16], a linear dynamical system of the form:

ξk+1=Aξk+Bwk\displaystyle{\xi}_{k+1}=A{\xi}_{k}+Bw_{k} (36)

Here, ξnξ{\xi}\in\mathbb{R}^{n_{\xi}} is the state, wknww_{k}\in\mathbb{R}^{n_{w}} is the input, AA is the state transition matrix and BB is the input matrix. Suppose that there exist a (Lyapunov) function VV, satisfying V(ξ)0,ξnξV({\xi})\geq 0,\forall{\xi}\in\mathbb{R}^{n_{\xi}}, some 0α<10\leq\alpha<1 and a supply rate function S(ξk,wk)0,kS({\xi}_{k},w_{k})\leq 0,\forall k such that

V(ξk+1)α2V(ξk)S(ξk,wk).\displaystyle V({\xi}_{k+1})-\alpha^{2}V({\xi}_{k})\leq S({\xi}_{k},w_{k}). (37)

This dissipation inequality (37) implies that V(ξk+1)α2V(ξk)V({\xi}_{k+1})\leq\alpha^{2}V({\xi}_{k}), and the state will approach a minimum value ate equilibrium no slower than the linear rate α2\alpha^{2}. The flowing theorem states how to construct the dissipation inequality (37) by solving a semidefinite programming problem.

Theorem 5

[16][Theorem 2] Consider the following quadratic supply rate with X(nξ+nw)×(nξ+nw)X\in\mathbb{R}^{(n_{\xi}+n_{w})\times(n_{\xi}+n_{w})} and XT=XX^{T}=X

S(ξ,w):=[ξw]TX[ξw].\displaystyle S({\xi},w):=\begin{bmatrix}{\xi}\\ w\end{bmatrix}^{T}X\begin{bmatrix}{\xi}\\ w\end{bmatrix}. (38)

If there exists matrix Pnξ×nξP\in\mathbb{R}^{n_{\xi}\times n_{\xi}} with P0P\succeq 0 such that

[ATPAα2PATPBBTPABTPB]X0,\displaystyle\begin{bmatrix}A^{T}PA-\alpha^{2}P&A^{T}PB\\ B^{T}PA&B^{T}PB\end{bmatrix}-X\leq 0, (39)

then the dissipation inequality holds for all trajectories of (36) with V(ξ)=ξTPξV({\xi})={\xi}^{T}P{\xi}.

A major benefit of the proposed constructive dissipation approach is that it replaces the trouble some component of a dynamical system (e.g. the gradient term w=ξf(ξ)w=\nabla_{{\xi}}f({\xi})) by a quadratic constraint on its inputs and outputs that is always satisfied, namely the supply rate constraint S(ξ,w)0S(\xi,w)\leq 0. This leads to a two-step novel approach to the convergence analysis of optimization algorithms.

  1. 1.

    Choose a proper quadratic supply rate function SS such that S(ξk,wk)0,kS({\xi}_{k},w_{k})\leq 0,\forall k, that depends on the specific nonlinear term.

  2. 2.

    Solve the Linear Matrix Inequality (39) to obtain a storage function VV and finding the linear convergence rate α\alpha.

We will apply this methodology to analyze the DGDA update (12). Let z=[x,y]T,z^=[x^,y^]Tz=[x,y]^{T},\hat{z}=[\hat{x},\hat{y}]^{T} and F(zk)=(xf(xk,yk);yf(xk,yk))F(z_{k})=(\nabla_{x}f(x_{k},y_{k});-\nabla_{y}f(x_{k},y_{k})), and rewrite (12) as in the form of (36):

[zk+1z^k+1]\displaystyle\begin{bmatrix}z_{k+1}\\ \hat{z}_{k+1}\end{bmatrix} =[zkηF(zk)ρ(zkz^k)z^kρ(z^kzk)]\displaystyle=\begin{bmatrix}z_{k}-\eta F(z_{k})-\rho(z_{k}-\hat{z}_{k})\\ \hat{z}_{k}-\rho(\hat{z}_{k}-z_{k})\end{bmatrix} (40)
=[1ρρρ1ρ][zkz^k]+[η0]wk\displaystyle=\begin{bmatrix}1-\rho&\rho\\ \rho&1-\rho\end{bmatrix}\begin{bmatrix}z_{k}\\ \hat{z}_{k}\end{bmatrix}+\begin{bmatrix}-\eta\\ 0\end{bmatrix}w_{k}

where wk=F(zk)w_{k}=F(z_{k}).

According to the previous discussion, the first step would be to choose a proper quadratic supply rate function SS such that S(ξk,wk)0,kS({\xi}_{k},w_{k})\leq 0,\forall k, where ξk=(zk;z^k){\xi}_{k}=(z_{k};\hat{z}_{k}) that depends on the specific nonlinear term wk=F(zk)w_{k}=F(z_{k}). According to the equations (7) in the work by Hu et al. (2017) [16] and Lemma 6 from the research by Lessard et al. (2016) [18], the following applies to the nonlinear operator F(zk)F(z_{k}) that meets the conditions specified in Assumption 1:

S(zk,wk)=[zkwk]T[2μLI(μ+L)I(μ+L)I2I][zkwk]0\displaystyle S(z_{k},w_{k})=\begin{bmatrix}z_{k}\\ w_{k}\end{bmatrix}^{T}\begin{bmatrix}2\mu LI&(-\mu+L)I\\ (-\mu+L)I&2I\end{bmatrix}\begin{bmatrix}z_{k}\\ w_{k}\end{bmatrix}\leq 0 (41)

The conditions in Assumption 1 are also commonly referred to as being L-smooth and m-strongly monotone, as can be found in related literature on variational inequality problems, such as the works by [19, 6]]. Therefore, we could easily extend the above LMI into the following supply rate function for DGDA updates, by augmenting the states ξk=(zk;z^k){\xi}_{k}=(z_{k};\hat{z}_{k}):

S(ξk,wk)=\displaystyle S(\xi_{k},w_{k})= (42)
[zkz^kwk]T[2μLI0(μ+L)I000(μ+L)I02I][zkz^kwk]0\displaystyle\begin{bmatrix}z_{k}\\ \hat{z}_{k}\\ w_{k}\end{bmatrix}^{T}\begin{bmatrix}2\mu LI&0&(-\mu+L)I\\ 0&0&0\\ (-\mu+L)I&0&2I\end{bmatrix}\begin{bmatrix}z_{k}\\ \hat{z}_{k}\\ w_{k}\end{bmatrix}\leq 0

as a proper quadratic supply rate function S(ξk,wk)0S({\xi}_{k},w_{k})\leq 0, whenever wk=F(zk)w_{k}=F(z_{k}).

Finally, according to Theorem 5 and the above discussion, proving linear convergence reduces to finding a positive definite a matrix P𝐑2(n+m)×2(n+m)P\in\mathbf{R}^{2(n+m)\times 2(n+m)}, α[0,1)\alpha\in[0,1) such that (39) is satisfied, where the problem parameters are given by

A=[1ρρρ1ρ]I,B=[η0]I,\displaystyle A=\begin{bmatrix}1-\rho&\rho\\ \rho&1-\rho\end{bmatrix}\otimes I,B=\begin{bmatrix}-\eta\\ 0\end{bmatrix}\otimes I, (43)
X=[2μL0(μ+L)000(μ+L)02]I,\displaystyle X=\begin{bmatrix}2\mu L&0&(-\mu+L)\\ 0&0&0\\ (-\mu+L)&0&2\end{bmatrix}\otimes I,

where \otimes is the Kronecker product. Due to the Kronecker structure of this problem, this is equivalent to solving an LMI problem of dimension 3 by 3, with design parameters P=P¯IP=\bar{P}\otimes I, with P¯𝐑2×2\bar{P}\in\mathbf{R}^{2\times 2}, α2[0,1)\alpha^{2}\in[0,1), ρ\rho and η\eta. Because this Linear Matrix Inequality is simple (3 by 3), it can be solved using analytical methods. This, in turn, results in a feasible solution for the LMI, denoted as follows:

ρ=12,η=1L+μ,\displaystyle\rho=\frac{1}{2},\qquad\eta=\frac{1}{L+\mu}, (44)
α2=3L2+2Lμ+3μ2+(L+μ)4+16L2μ24(L+μ)2,\displaystyle\alpha^{2}=\frac{3L^{2}+2L\mu+3\mu^{2}+\sqrt{(L+\mu)^{4}+16L^{2}\mu^{2}}}{4(L+\mu)^{2}},
P=[(L+μ)200(L+μ)2]I.\displaystyle P=\begin{bmatrix}(L+\mu)^{2}&0\\ 0&(L+\mu)^{2}\end{bmatrix}\otimes I.

After substituting the definition for condition number κ:=L/μ\kappa:=L/\mu, the convergence rate α2\alpha^{2} simplifies to:

α2=1κ1+𝒪((μL)2)\displaystyle\alpha^{2}=1-\kappa^{-1}+\mathcal{O}\bigl{(}(\frac{\mu}{L})^{2}\bigr{)} (45)

-D Proof of Corollary 4

According to Theorem 3, the linear convergence rate estimate of DGDA is

3L2+2Lμ+3μ2+(L+μ)4+16L2μ24(L+μ)2\displaystyle\frac{3L^{2}+2L\mu+3\mu^{2}+\sqrt{(L+\mu)^{4}+16L^{2}\mu^{2}}}{4(L+\mu)^{2}} (46)
=3κ2+2κ+3+(κ+1)4+16κ24(κ+1)2,\displaystyle=\frac{3\kappa^{2}+2\kappa+3+\sqrt{(\kappa+1)^{4}+16\kappa^{2}}}{4(\kappa+1)^{2}}, (47)

where κ=L/μ\kappa=L/\mu.

According to Theorem 6&76\&7 [2] and Theorem 4&74\&7 [1], the standard known linear convergence rate estimate of EG and OGDA is

1μ4L=114κ.\displaystyle 1-\frac{\mu}{4L}=1-\frac{1}{4\kappa}. (48)

By simple algebraic calculation, it can be shown that as a function of κ\kappa, when κ2\kappa\geq 2, the following polynomial is always nonnegative, i.e.,

114κ3κ2+2κ+3+(κ+1)4+16κ24(κ+1)20.\displaystyle 1-\frac{1}{4\kappa}-\frac{3\kappa^{2}+2\kappa+3+\sqrt{(\kappa+1)^{4}+16\kappa^{2}}}{4(\kappa+1)^{2}}\geq 0. (49)