This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Global Optimality in Bivariate Gradient-based DAG Learning

Chang Deng Correspondence to [email protected] Kevin Bello Bryon Aragam Pradeep Ravikumar
Abstract

Recently, a new class of non-convex optimization problems motivated by the statistical problem of learning an acyclic directed graphical model from data has attracted significant interest. While existing work uses standard first-order optimization schemes to solve this problem, proving the global optimality of such approaches has proven elusive. The difficulty lies in the fact that unlike other non-convex problems in the literature, this problem is not “benign”, and possesses multiple spurious solutions that standard approaches can easily get trapped in. In this paper, we prove that a simple path-following optimization scheme globally converges to the global minimum of the population loss in the bivariate setting.

1 Introduction

Over the past decade, non-convex optimization has become a major topic of research within the machine learning community, in part due to the successes of training large-scale models with simple first-order methods such as gradient descent—along with their stochastic and accelerated variants—in spite of the non-convexity of the loss function. A large part of this research has focused on characterizing which problems have benign loss landscapes that are amenable to the use of gradient-based methods, i.e., there are no spurious local minima, or they can be easily avoided. By now, several theoretical results have shown this property for different non-convex problems such as: learning a two hidden unit ReLU network (Wu et al., 2018), learning (deep) over-parameterized quadratic neural networks (Soltanolkotabi et al., 2018; Kazemipour et al., 2019), low-rank matrix recovery (Ge et al., 2017; De Sa et al., 2015; Bhojanapalli et al., 2016), learning a two-layer ReLU network with a single non-overlapping convolutional filter (Brutzkus and Globerson, 2017), semidefinite matrix completion (Boumal et al., 2016; Ge et al., 2016), learning neural networks for binary classification with the addition of a single special neuron (Liang et al., 2018), and learning deep networks with independent ReLU activations (Kawaguchi, 2016; Choromanska et al., 2015), to name a few.

Recently, a new class of non-convex optimization problems due to Zheng et al. (2018) have emerged in the context of learning the underlying structure of a structural equation model (SEM) or Bayesian network. This underlying structure is typically represented by a directed acyclic graph (DAG), which makes the learning task highly complex due to its combinatorial nature. In general, learning DAGs is well-known to be NP-complete (Chickering, 1996; Chickering et al., 2004). The key innovation in Zheng et al. (2018) was the introduction of a differentiable function hh, whose level set at zero exactly characterizes DAGs. Thus, replacing the challenges of combinatorial optimization by those of non-convex optimization. Mathematically, this class of non-convex problems take the following general form:

minΘf(Θ) subject to h(W(Θ))=0,\displaystyle\min_{\Theta}f(\Theta)\ \textrm{\;subject to\;}\ h(W(\Theta))=0, (1)

where Θl\Theta\in\mathbb{R}^{l} represents the model parameters, f:lf\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{l}\to\mathbb{R} is a (possibly non-convex) smooth loss function (sometimes called a score function) that measures the fitness of Θ\Theta, and h:d×d[0,)h\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{d\times d}\to[0,\infty) is a smooth non-convex function that takes the value of zero if and only if the induced weighted adjacency matrix of dd nodes, W(Θ)W(\Theta), corresponds to a DAG.

Given the smoothness of ff and hh, problem (1) can be solved using off-the-shelf nonlinear solvers, which has driven a series of remarkable developments in structure learning for DAGs. Multiple empirical studies have demonstrated that global or near-global minimizers for (1) can often be found in a variety of settings, such as linear models with Gaussian and non-Gaussian noises (e.g., Zheng et al., 2018; Ng et al., 2020; Bello et al., 2022), and non-linear models, represented by neural networks, with additive Gaussian noises (e.g., Lachapelle et al., 2020; Zheng et al., 2020; Yu et al., 2019; Bello et al., 2022). The empirical success for learning DAGs via (1), which started with the Notears method of Zheng et al. (2018), bears a resemblance to the success of training deep models, which started with AlexNet for image classification.

Importantly, the reader should note that the majority of applications in ML consist of solving a single unconstrained non-convex problem. In contrast, the class of problems (1) contains a non-convex constraint. Thus, researchers have considered some type of penalty method such as the augmented Lagrangian (Zheng et al., 2018, 2020), quadratic penalty (Ng et al., 2022), and a log-barrier (Bello et al., 2022). In all cases, the penalty approach consists in solving a sequence of unconstrained non-convex problems, where the constraint is enforced progressively (see e.g. Bertsekas, 1997, for background). In this work, we will consider the following form of penalty:

minΘgμk(Θ)μkf(Θ)+h(W(Θ)).\displaystyle\min_{\Theta}g_{\mu_{k}}(\Theta)\coloneqq\mu_{k}f(\Theta)+h(W(\Theta)). (2)

It was shown by Bello et al. (2022) that due to the invexity property of hh,111An invex function is any function where all its stationary points are global minima. It is worth noting that the composite objective in (2) is not necessarily invex, even when ff is convex. solutions to (2) will converge to a DAG as μk0\mu_{k}\to 0. However, no guarantees on local/global optimality were given.

With the above considerations in hand, one is inevitably led to ask the following questions:

  • (i)

    Are the loss landscapes gμk(Θ)g_{\mu_{k}}(\Theta) benign for different μk\mu_{k}?

  • (ii)

    Is there a (tractable) solution path {Θk\Theta_{k}} that converges to a global minimum of (1)?

Due to the NP-completeness of learning DAGs, one would expect the answer to (i) to be negative in its most general form. Moreover, it is known from the classical theory of constrained optimization (e.g. Bertsekas, 1997) that if we can exactly and globally optimize (1) for each μk\mu_{k}, then the answer to (ii) is affirmative. This is not a practical algorithm, however, since the problem (1) is nonconvex. Thus we seek a solution path that can be tractably computed in practice, e.g. by gradient descent.

In this work, we focus on perhaps the simplest setting where interesting phenomena take place. That is, a linear SEM with two nodes (i.e., d=2d=2), ff is the population least squared loss (i.e., ff is convex), and Θk\Theta_{k} is defined via gradient flow with warm starts. More specifically, we consider the case where Θk\Theta_{k} is obtained by following the gradient flow of gμkg_{\mu_{k}} with initial condition Θk1\Theta_{k-1}.

Under this setting, to answer (i), it is easy to see that for a large enough μk\mu_{k}, the convex function ff dominates and we can expect a benign landscape, i.e., a (almost) convex landscape. Similarly, when μk\mu_{k} approaches zero, the invexity of hh kicks in and we could expect that all stationary points are (near) global minimizers.222This transition or path, from an optimizer of a simple function to an optimizer of a function that closely resembles the original constrained formulation, is also known as a homotopy. That is, at the extremes μk\mu_{k}\to\infty and μk0\mu_{k}\to 0, the landscapes seem well-behaved, and the reader might wonder if it follows that for any μk[0,)\mu_{k}\in[0,\infty) the landscape is well-behaved. We answer the latter in the negative and show that there always exists a τ>0\tau>0 where the landscape of gμkg_{\mu_{k}} is non-benign for any μk<τ\mu_{k}<\tau, namely, there exist three stationary points: i) A saddle point, ii) A spurious local minimum, and iii) The global minimum. In addition, each of these stationary points have wide basins of attractions, thus making the initialization of the gradient flow for gμkg_{\mu_{k}} crucial. Finally, we answer (ii) in the affirmative and provide an explicit scheduling for μk\mu_{k} that guarantees the asymptotic convergence of Θk\Theta_{k} to the global minimum of (1). Moreover, we show that this scheduling cannot be arbitrary as there exists a sequence of {μk}\{\mu_{k}\} that leads {Θk}\{\Theta_{k}\} to a spurious local minimum.

Overall, we establish the first set of results that study the optimization landscape and global optimality for the class of problems (1). We believe that this comprehensive analysis in the bivariate case provides a valuable starting point for future research in more complex settings.

Remark 1.

We emphasize that solving (1) in the bivariate case is not an inherently difficult problem. Indeed, when there are only two nodes, there are only two DAGs to distinguish and one can simply fit ff under the only two possible DAGs, and select the model with the lowest value for ff. However, evaluating ff for each possible DAG structure clearly cannot scale beyond 10 or 20 nodes, and is not a standard algorithm for solving (1). Instead, here our focus is on studying how (1) is actually being solved in practice, namely, by solving unconstrained non-convex problems in the form of (2). Previous work suggests that such gradient-based approaches indeed scale well to hundreds and even thousands of nodes (e.g. Zheng et al., 2018; Ng et al., 2020; Bello et al., 2022).

Refer to caption
(a) Contour plot
Refer to caption
(b) Stationary points
Figure 1: Visualizing the nonconvex landscape. (a) A contour plot of gμg_{\mu} for a=0.5a=0.5 and μ=0.005\mu=0.005 (see Section 2 for definitions). We only show a section of the landscape for better visualization. The solid lines represent the contours, while the dashed lines represent the vector field gμ-\nabla g_{\mu}. (b) Stationary points of gμg_{\mu}, r(y;μ)=0r(y;\mu)=0 and r(x;μ)=0r(x;\mu)=0 (see Section 4 for definitions).

1.1 Our Contributions

More specifically, we make the following contributions:

  1. 1.

    We present a homotopy-based optimization scheme (Algorithm 2) to find global minimizers of the program (1) by iteratively decreasing the penalty coefficient according to a given schedule. Gradient flow is used to find the stationary points of (2) at each step, starting from the previous solution.

  2. 2.

    We prove that Algorithm 2 converges globally (i.e. regardless of initialization for WW) to the global minimum (Theorem 1).

  3. 3.

    We show that the non-convex program (1) is indeed non-benign, and naïve implementation of black-box solvers are likely to get trapped in a bad local minimum. See Figure 1 (a).

  4. 4.

    Experimental results verify our theory, consistently recovering the global minimum of (1), regardless of initialization or initial penalty value. We show that our algorithm converges to the global minimum while naïve approaches can get stuck.

The analysis consists of three main parts: First, we explicitly characterize the trajectory of the stationary points of (2). Second, we classify the number and type of all stationary points (Lemma 1) and use this to isolate the desired global minimum. Finally, we apply Lyapunov analysis to identify the basin of attraction for each stationary point, which suggests a schedule for the penalty coefficient that ensures that the gradient flow is initialized within that basin at the previous solution.

1.2 Related Work

The class of problems (1) falls under the umbrella of score-based methods, where given a score function ff, the goal is to identify the DAG structure with the lowest score possible (Chickering, 2003; Heckerman et al., 1995). We shall note that learning DAGs is a very popular structure model in a wide range of domains such as biology (Sachs et al., 2005), genetics (Zhang et al., 2013), and causal inference (Spirtes et al., 2000; Pearl, 2009), to name a few.

Score-based methods that consider the combinatorial constraint.

Given the ample set of score-based methods in the literature, we briefly mention some classical works that attempt to optimize ff by considering the combinatorial DAG constraint. In particular, we have approximate algorithms such as the greedy search method of Chickering et al. (2004), order search methods (Teyssier and Koller, 2005; Scanagatta et al., 2015; Park and Klabjan, 2017), the LP-relaxation method of Jaakkola et al. (2010), and the dynamic programming approach of Loh and Bühlmann (2014). There are also exact methods such as GOBNILP (Cussens, 2012) and Bene (Silander and Myllymaki, 2006), however, these algorithms only scale up to 30\approx 30 nodes.

Score-based methods that consider the continuous non-convex constraint hh.

The following works are the closest to ours since they attempt to solve a problem in the form of (1). Most of these developments either consider optimizing different score functions ff such as ordinary least squares (Zheng et al., 2018, 2020), the log-likelihood (Lachapelle et al., 2020; Ng et al., 2020), the evidence lower bound (Yu et al., 2019), a regret function (Zhu et al., 2020); or consider different differentiable characterizations of acyclicity hh (Yu et al., 2019; Bello et al., 2022). However, none of the aforementioned works provide any type of optimality guarantee. Few studies have examined the optimization intricacies of problem (1). Wei et al. (2020) investigated the optimality issues and provided local optimality guarantees under the assumption of convexity in the score ff and linear models. On the other hand, Ng et al. (2022) analyzed the convergence to (local) DAGs of generic methods for solving nonlinear constrained problems, such as the augmented Lagrangian and quadratic penalty methods. In contrast to both, our work is the first to study global optimality and the loss landscapes of actual methods used in practice for solving (1).

Bivariate causal discovery.

Even though in a two-node model the discrete DAG constraint does not pose a major challenge, the bivariate setting has been subject to major research in the area of causal discovery. See for instance (Ni, 2022; Duong and Nguyen, 2022; Mooij et al., 2014; Jiao et al., 2018) and references therein.

Penalty and homotopy methods.

There exist classical global optimality guarantees for the penalty method if ff and hh were convex functions, see for instance (Bertsekas, 1997; Boyd et al., 2004; Nocedal and Wright, 1999). However, to our knowledge, there are no global optimality guarantees for general classes of non-convex constrained problems, let alone for the specific type of non-convex functions hh considered in this work. On the other hand, homotopy methods (also referred to as continuation or embedding methods) are in many cases capable of finding better solutions than standard first-order methods for non-convex problems, albeit they typically do not come with global optimality guarantees either. When homotopy methods come with global optimality guarantees, they are commonly computationally more intensive as it involves discarding solutions, thus, closely resembling simulated annealing methods, see for instance (Dunlavy and O’Leary, 2005). Authors in (Hazan et al., 2016) characterize a family of non-convex functions where a homotopy algorithm provably converges to a global optimum. However, the conditions for such family of non-convex functions are difficult to verify and are very restrictive; moreover, their homotopy algorithm involves Gaussian smoothing, making it also computationally more intensive than the procedure we study here. Other examples of homotopy methods in machine learning include (Chen et al., 2019; Garrigues and Ghaoui, 2008; Wauthier and Donnelly, 2015; Gargiani et al., 2020; Iwakiri et al., 2022), in all these cases, no global optimality guarantees are given.

2 Preliminaries

The objective ff we consider can be easily written down as follows:

f(W)=12𝔼X[XWX22],\displaystyle f(W)=\frac{1}{2}\mathbb{E}_{X}\left[\|X-W^{\top}X\|_{2}^{2}\right], (3)

where X2X\in\mathbb{R}^{2} is a random vector and W2×2W\in\mathbb{R}^{2\times 2}. Although not strictly necessary for the developments that follow, we begin by introducing the necessary background on linear SEM that leads to this objective and the resulting optimization problem of interest.

The bivariate model.

Let X=(X1,X2)2X=(X_{1},X_{2})\in\mathbb{R}^{2} denote the random variables in the model, and let N=(N1,N2)2N=(N_{1},N_{2})\in\mathbb{R}^{2} denote a vector of independent errors. Then a linear SEM over XX is defined as X=WX+NX=W_{*}^{\top}X+N, where W2×2W_{*}\in\mathbb{R}^{2\times 2} is a weighted adjacency matrix encoding the coefficients in the linear model. In order to represent a valid Bayesian network for XX (see e.g. Pearl, 2009; Spirtes et al., 2000, for details), the matrix WW_{*} must be acyclic: More formally, the weighted graph induced by the adjacency matrix WW_{*} must be a DAG. This (non-convex) acyclicity constraint represents the major computational hurdle that must overcome in practice (cf. Remark 1).

The goal is to recover the matrix WW_{*} from the random vector XX. Since WW_{*} is acyclic, we can assume the diagonal of WW_{*} is zero (i.e. no self-loops). Thus, under the bivariate linear model, it then suffices to consider two parameters xx and yy that define the matrix of parameters333Following the notation in (1), for the bivariate model we simply have Θ(x,y)\Theta\equiv(x,y) and W(Θ)(0xy0)W(\Theta)\equiv\left(\begin{smallmatrix}0&x\\ y&0\end{smallmatrix}\right).

W=W(x,y)=(0xy0)\displaystyle W=W(x,y)=\begin{pmatrix}0&x\\ y&0\end{pmatrix} (4)

For notational simplicity, we will use f(W)f(W) and f(x,y)f(x,y) interchangeably, similarly for h(W)h(W) and h(x,y)h(x,y). Without loss of generality, we write the underlying parameter as

W=(0a00)\displaystyle W_{*}=\begin{pmatrix}0&a\\ 0&0\end{pmatrix} (5)

which implies

X=WX+N{X1=N1,X2=aX1+N2.\displaystyle X=W_{*}^{\top}X+N\implies\left\{\begin{aligned} X_{1}&=N_{1},\\ X_{2}&=aX_{1}+N_{2}.\\ \end{aligned}\right.

In general, we only require NiN_{i} to have finite mean and variance, hence we do not assume Gaussianity. We assume that Var[N1]=Var[N2]\mathrm{Var}[N_{1}]=\mathrm{Var}[N_{2}], and for simplicity, we consider 𝔼[N]=0\mathbb{E}[N]=0 and Cov[N]=I\text{Cov}[N]=I, where II denotes the identity matrix. Finally, in the sequel we assume w.l.o.g. that a>0a>0.

The population least squares.

In this work, we consider the population squared loss defined by (3). If we equivalently write ff in terms of xx and yy, then we have: f(W)=((1ay)2+y2+(ax)2+1)/2.f(W)=((1-ay)^{2}+y^{2}+(a-x)^{2}+1)/2. In fact, the population loss can be substituted with empirical loss. In such a case, our algorithm can still attain the global minimum, W𝖦W_{\mathsf{G}}, of problem (6). However, the output W𝖦W_{\mathsf{G}} will serve as an empirical estimation of WW_{*}. An in-depth discussion on this topic can be found in Appendix B

The non-convex function hh.

We use the continuous acyclicity characterization of Yu et al. (2019), i.e., h(W)=Tr((I+1dWW)d)dh(W)=\mathrm{Tr}((I+\frac{1}{d}W\circ W)^{d})-d, where \circ denotes the Hadamard product. Then, for the bivariate case, we have h(W)=x2y2/2.h(W)=x^{2}y^{2}/2. We note that the analysis presented in this work is not tailored to this version of hh, that is, we can use the same techniques used throughout this work for other existing formulations of hh, such as the trace of the matrix exponential (Zheng et al., 2018), and the log-det formulation (Bello et al., 2022). Nonetheless, here we consider that the polynomial formulation of Yu et al. (2019) is more amenable for the analysis.

Remark 2.

Our restriction to the bivariate case highlights the simplest setting in which this problem exhibits nontrivial behaviour. Extending our analysis to higher dimensions remains a challenging future direction, however, we emphasize that even in two-dimensions this problem is nontrivial. Our approach is similar to that taken in other parts of the literature that started with simple cases (e.g. single-neuron models in deep learning).

Remark 3.

It is worth noting that our choice of the population least squares is not arbitrary. Indeed, for linear models with identity error covariance, such as the model considered in this work, it is known that the global minimizer of the population squared loss is unique and corresponds to the underlying matrix WW_{*}. See Theorem 7 in (Loh and Bühlmann, 2014).

Gluing all the pieces together, we arrive to the following version of (1) for the bivariate case:

minx,yf(x,y)12((1ay)2+y2+(ax)2+1) subject to h(x,y)x2y22=0.\displaystyle\min_{x,y}f(x,y)\coloneqq\frac{1}{2}((1-ay)^{2}+y^{2}+(a-x)^{2}+1)\;\textrm{\;subject to\;}\;h(x,y)\coloneqq\frac{x^{2}y^{2}}{2}=0. (6)

Moreover, for any μ0\mu\geq 0, we have the corresponding version of (2) expressed as:

minx,ygμ(x,y)μf(x,y)+h(x,y)=μ2((1ay)2+y2+(ax)2+1)+x2y22.\displaystyle\min_{x,y}g_{\mu}(x,y)\coloneqq\mu f(x,y)+h(x,y)=\frac{\mu}{2}((1-ay)^{2}+y^{2}+(a-x)^{2}+1)+\frac{x^{2}y^{2}}{2}. (7)

To conclude this section, we present a visualization of the landscape of gμ(x,y)g_{\mu}(x,y) in Figure 1 (a), for a=0.5a=0.5 and μ=0.005\mu=0.005. We can clearly observe the non-benign landscape of gμg_{\mu}, i.e., there exists a spurious local minimum, a saddle point, and the global minimum. In particular, we can see that the basin of attraction of the spurious local minimum is comparable to that of the global minimum, which is problematic for a local algorithm such as the gradient flow (or gradient descent) as it can easily get trapped in a local minimum if initialized in the wrong basin.

3 A Homotopy-Based Approach and Its Convergence to the Global Optimum

To fix notation, let us write Wk:=Wμk(0xμkyμk0)W_{k}\mathrel{\mathop{\ordinarycolon}}=W_{\mu_{k}}\coloneqq(\begin{smallmatrix}0&x_{\mu_{k}}\\ y_{\mu_{k}}&0\end{smallmatrix}). and let W𝖦W_{\mathsf{G}} denote the global minimizer of (6). In this section, we present our main result, which provides conditions under which solving a series of unconstrained problems (7) with first-order methods will converge to the global optimum W𝖦W_{\mathsf{G}} of (6), in spite of facing non-benign landscapes. Recall that from Remark 3, we have that W𝖦=(0a00)W_{\mathsf{G}}=(\begin{smallmatrix}0&a\\ 0&0\end{smallmatrix}). Since we use gradient flow path to connect WμkW_{\mu_{k}} and Wμk+1W_{\mu_{k+1}}, we specify this path in Procedure 1 for clarity. Although the theory here assumes continuous-time gradient flow with tt\to\infty, see Section 5 for an iteration complexity analysis for (discrete-time) gradient descent, which is a straightforward consequence of the continuous-time theory.

1:set z(0)=z0z(0)=z_{0}
2:ddtz(t)=f(z(t))\frac{d}{dt}z(t)=-\nabla f(z(t))
3:return limtz(t)\lim_{t\rightarrow\infty}z(t)
Algorithm 1 GradientFlow(f,z0f,z_{0})
Input: Initial W0=W(x0,y0)W_{0}=W(x_{0},y_{0}), μ0[a24(a2+1)3,a24)\mu_{0}\in\left[\frac{a^{2}}{4(a^{2}+1)^{3}},\frac{a^{2}}{4}\right)
Output: {Wμk}k=0\{W_{\mu_{k}}\}_{k=0}^{\infty}
1 Wμ0GradientFlow(gμ0,W0)W_{\mu_{0}}\leftarrow\textnormal{{GradientFlow}}(g_{\mu_{0}},W_{0})
2 for k=1,2,k=1,2,\ldots do
3       Let μk=(2/a)2/3μk14/3\mu_{k}=\left(2/a\right)^{2/3}\mu_{k-1}^{4/3}
4       WμkGradientFlow(gμk,Wμk1)W_{\mu_{k}}\leftarrow\textnormal{{GradientFlow}}(g_{\mu_{k}},W_{\mu_{k-1}})
5 end for
Algorithm 2 Homotopy algorithm for solving (1).

In Algorithm 2, we provide an explicit regime of initialization for the homotopy parameter μ0\mu_{0} and a specific scheduling for μk\mu_{k} such that the solution path found by Algorithm 2 will converge to the global optimum of (6). This is formally stated in Theorem 1, whose proof is given in Section 5.

Theorem 1.

For any initialization W0W_{0} and aa\in\mathbb{R}, the solution path provided in Algorithm 2 converges to the global optimum of (6), i.e.,

limkWμk=W𝖦.\lim_{k\rightarrow\infty}W_{\mu_{k}}=W_{\mathsf{G}}.

A few observations regarding Algorithm 2: Observe that when the underlying model parameter a0a\gg 0, the regime of initialization for μ0\mu_{0} is wider; on the other hand, if aa is closer to zero then the interval for μ0\mu_{0} is much narrower. As a concrete example, if a=2a=2 then it suffices to have μ0[0.008,1)\mu_{0}\in[0.008,1); whereas if a=0.1a=0.1 then the regime is about μ0[0.0089,0.01)\mu_{0}\in[0.0089,0.01). This matches the intuition that for a “stronger” value of aa it should be easier to detect the right direction of the underlying model. Second, although in Line 3 we set μk\mu_{k} in a specific manner, it actually suffices to have

μk[(μk12)2/3(a1/3a2/3(4μk1)1/3)2,μk1).\mu_{k}\in\Big{[}(\frac{\mu_{k-1}}{2})^{\nicefrac{{2}}{{3}}}(a^{\nicefrac{{1}}{{3}}}-\sqrt{a^{\nicefrac{{2}}{{3}}}-(4\mu_{k-1})^{\nicefrac{{1}}{{3}}}})^{2},\mu_{k-1}\Big{)}.

We simply chose a particular expression from this interval for clarity of presentation; see the proof in Section 5 for details.

As presented, Algorithm 2 is of theoretical nature in the sense that the initialization for μ0\mu_{0} and the decay rate for μk\mu_{k} in Line 3 depend on the underlying parameter aa, which in practice is unknown. In Algorithm 3, we present a modification that is independent of aa and WW_{*}. By assuming instead a lower bound on aa, which is a standard assumption in the literature, we can prove that Algorithm 3 also converges to the global minimum:

Input: Initial W0=W(x0,y0)W_{0}=W(x_{0},y_{0})
Output: {Wμk}k=0\{W_{\mu_{k}}\}_{k=0}^{\infty}
1 μ01/27\mu_{0}\leftarrow 1/27
2 Wμ0GradientFlow(gμ0,W0)W_{\mu_{0}}\leftarrow\textnormal{{GradientFlow}}(g_{\mu_{0}},W_{0})
3 for k=1,2,k=1,2,\ldots do
4       Let μk=(2/5μ0)2/3μk14/3\mu_{k}=\left(2/\sqrt{5\mu_{0}}\right)^{2/3}\mu_{k-1}^{4/3}
5       WμkGradientFlow(gμk,Wμk1)W_{\mu_{k}}\leftarrow\textnormal{{GradientFlow}}(g_{\mu_{k}},W_{\mu_{k-1}})
6 end for
Algorithm 3 Practical (i.e. independent of aa and WW_{*}) homotopy algorithm for solving (1).
Corollary 1.

Initialize μ0=127\mu_{0}=\frac{1}{27}. If a>5/27a>\sqrt{5/27} then for any initialization W0W_{0}, Algorithm 3 outputs the global optimal solution to (6), i.e.

limkWμk=W𝖦.\lim_{k\rightarrow\infty}W_{\mu_{k}}=W_{\mathsf{G}}.

For more details on this modification, see Appendix A.

4 A Detailed Analysis of the Evolution of the Stationary Points

The homotopy approach in Algorithm 2 relies heavily on how the stationary points of (7) behave with respect to μk\mu_{k}. In this section, we dive deep into the properties of these critical points.

By analyzing the first-order conditions for gμg_{\mu}, we first narrow our attention to the region A={0xa,0yaa2+1}.A=\{0\leq x\leq a,0\leq y\leq\frac{a}{a^{2}+1}\}. By solving the resulting equations, we obtain an equation that only involves the variable yy:

r(y;μ)=ayμa2(y2+μ)2(a2+1).\displaystyle r(y;\mu)=\frac{a}{y}-\frac{\mu a^{2}}{(y^{2}+\mu)^{2}}-(a^{2}+1). (8)

Likewise, we can find an equation only involving the variable xx:

t(x;μ)=axμa2(μ(a2+1)+x2)21.\displaystyle t(x;\mu)=\frac{a}{x}-\frac{\mu a^{2}}{(\mu(a^{2}+1)+x^{2})^{2}}-1. (9)

To understand the behavior of the stationary points of gμ(W)g_{\mu}(W), we can examine the characteristics of t(x;μ)t(x;\mu) in the range x[0,a]x\in[0,a] and the properties of r(y;μ)r(y;\mu) in the interval y[0,aa2+1]y\in[0,\frac{a}{a^{2}+1}].

Refer to caption
(a) μ>τ\mu>\tau
Refer to caption
(b) μ=τ\mu=\tau
Refer to caption
(c) μ<τ\mu<\tau
Figure 2: The behavior of r(y;μ)r(y;\mu) for different μ\mu.
Refer to caption
(a) μ>τ\mu>\tau
Refer to caption
(b) μ=τ\mu=\tau
Refer to caption
(c) μ<τ\mu<\tau
Figure 3: The behavior of t(x;μ)t(x;\mu) for different μ\mu.

In Figures 2 and 3, we show the behavior of r(y;μ)r(y;\mu) and t(x;μ)t(x;\mu) for a=1a=1. Theorems 5 and 6 in the appendix establish the existence of a τ>0\tau>0 with the following useful property:

Corollary 2.

There exists μ<τ\mu<\tau such that the equation gμ(W)=0\nabla g_{\mu}(W)=0 has three different solutions, denoted as Wμ,Wμ,WμW_{\mu}^{*},W_{\mu}^{**},W_{\mu}^{***}. Then,

limμ0Wμ=[0a00],limμ0Wμ=[0000],limμ0Wμ=[00aa2+10]\displaystyle\lim_{\mu\rightarrow 0}W_{\mu}^{*}=\begin{bmatrix}0&a\\ 0&0\end{bmatrix},\;\lim_{\mu\rightarrow 0}W_{\mu}^{**}=\begin{bmatrix}0&0\\ 0&0\end{bmatrix},\;\lim_{\mu\rightarrow 0}W_{\mu}^{***}=\begin{bmatrix}0&0\\ \frac{a}{a^{2}+1}&0\end{bmatrix}

Note that the interesting regime takes place when μ<τ\mu<\tau. Then, we characterize the stationary points as either local minima or saddle points:

Lemma 1.

Let μ<τ\mu<\tau, then gμ(W)g_{\mu}(W) has two local minima at Wμ,WμW_{\mu}^{*},W_{\mu}^{***}, and a saddle point at WμW_{\mu}^{**}.

With the above results, it has been established that WμW_{\mu}^{*} converges to the global minimum W𝖦W_{\mathsf{G}} as μ0\mu\rightarrow 0. In the following section for the proof of Theorem 1, we perform a thorough analysis on how to track WμW^{*}_{\mu} and avoid the local minimum at WμW^{**}_{\mu} by carefully designing the scheduling for μk\mu_{k}.

5 Convergence Analysis: From continuous to discrete

We now discuss the iteration complexity of our method when gradient descent is used in place of gradient flow. We begin with some preliminaries regarding the continuous-time analysis.

5.1 Continuous case: Gradient flow

The key to ensuring the convergence of gradient flow to WμW_{\mu}^{*} is to accurately identify the basin of attraction of WμW_{\mu}^{*}. The following lemma provides a region that lies within such basin of attraction.

Lemma 2.

Define Bμ={(x,y)xμ<xa,0y<yμ}B_{\mu}=\{(x,y)\mid x_{\mu}^{**}<x\leq a,0\leq y<y_{\mu}^{**}\}. Run Algorithm 1 with input f=gμ(x,y),z0=(x(0),y(0))f=g_{\mu}(x,y),z_{0}=(x(0),y(0)) where (x(0),y(0))Bμ(x(0),y(0))\in B_{\mu}, then t0\forall t\geq 0, we have that (x(t),y(t))Bμ(x(t),y(t))\in B_{\mu} and limt(x(t),y(t))=(xμ,yμ)\lim_{t\rightarrow\infty}(x(t),y(t))=(x_{\mu}^{*},y_{\mu}^{*}).

In Figure 1 (b), the lower-right rectangle corresponds to BμB_{\mu}. Lemma 2 implies that the gradient flow with any initialization inside Bμk+1B_{\mu_{k+1}} will converge to Wμk+1W_{\mu_{k+1}}^{*} at last. Then, by utilizing the previous solution WμkW^{*}_{\mu_{k}} as the initial point, as long as it lies within region Bμk+1B_{\mu_{k+1}}, the gradient flow can converge to Wμk+1W_{\mu_{k+1}}^{*}, thereby achieving the goal of tracking Wμk+1W_{\mu_{k+1}}^{*}. Following the scheduling for μk\mu_{k} prescribed in Algorithm 2 provides a sufficient condition to ensure that will happen.

The following lemma, with proof in the appendix, is used for the Proof of Theorem 1. It provides a lower bound for yμy_{\mu}^{**} and upper bound for yμ.y_{\mu}^{*}.

Lemma 3.

If μ<τ\mu<\tau, then yμ>μy_{\mu}^{**}>\sqrt{\mu}, and (4μ)1/32(a1/3a2/3(4μ)1/3)>yμ\frac{(4\mu)^{1/3}}{2}\left(a^{1/3}-\sqrt{a^{2/3}-(4\mu)^{1/3}}\right)>y_{\mu}^{*}.

Proof of Theorem 1.

Consider that we are at iteration k+1k+1 of Algorithm 2, then μk+1<μk\mu_{k+1}<\mu_{k}. If μk>τ\mu_{k}>\tau and μk+1>τ\mu_{k+1}>\tau, then there is only one stationary point for gμk(x,y)g_{\mu_{k}}(x,y) and gμk+1(x,y)g_{\mu_{k+1}}(x,y), thus, 1 will converge to such stationary point. Hence, let us assume μk+1τ\mu_{k+1}\leq\tau. From Theorem 6 in the appendix, we known that xμk+1<xμkx_{\mu_{k+1}}^{**}<x_{\mu_{k}}^{*}. Then, the following relations hold:

yμk+1>(1)μk+12(μk24a)1/3(2)(4μk)1/32(a1/3a2/3(4μk)1/3)>(3)yμk\displaystyle y_{\mu_{k+1}}^{**}\overset{(1)}{>}\sqrt{\mu_{k+1}}\geq 2\left(\frac{\mu_{k}^{2}}{4a}\right)^{1/3}\overset{(2)}{\geq}\frac{(4\mu_{k})^{1/3}}{2}\left(a^{1/3}-\sqrt{a^{2/3}-(4\mu_{k})^{1/3}}\right)\overset{(3)}{>}y_{\mu_{k}}^{*}

Here (1) and (3) are due to Lemma 3, and (2) follows from 1x1x\sqrt{1-x}\geq 1-x for 0x10\leq x\leq 1. Then it implies that (xμk,yμk)(x_{\mu_{k}}^{*},y_{\mu_{k}}^{*}) is in the region {(x,y)xμk+1<xa,0y<yμk+1}\{(x,y)\mid x_{\mu_{k+1}}^{**}<x\leq a,0\leq y<y_{\mu_{k+1}}^{**}\}. By Lemma 2, the 1 procedure will converge to (xμk+1,yμk+1)(x_{\mu_{k+1}}^{*},y_{\mu_{k+1}}^{*}). Finally, from Theorems 5 and 6, if limkμk=0\lim_{k\rightarrow\infty}\mu_{k}=0, then limkxμk=a,limkyμk=0\lim_{k\rightarrow\infty}x_{\mu_{k}}^{*}=a,\lim_{k\rightarrow\infty}y_{\mu_{k}}^{*}=0, thus, converging to the global optimum, i.e.,

limkWμk=W𝖦.\lim_{k\rightarrow\infty}W_{\mu_{k}}=W_{\mathsf{G}}.

5.2 Discrete case: Gradient Descent

In Algorithms 2 and 4, gradient flow is employed to locate the next stationary points, which is not practically feasible. A viable alternative is to execute Algorithm 2, replacing the gradient flow with gradient descent. Now, at every iteration kk, Algorithm 6 uses gradient descent to output Wμk,ϵkW_{\mu_{k},\epsilon_{k}}, a ϵk\epsilon_{k} stationary point of gμkg_{\mu_{k}}, initialized at Wμk1,ϵk1W_{\mu_{k-1},\epsilon_{k-1}}, and a step size of ηk=1/(μk(a2+1)+3a2)\eta_{k}=1/(\mu_{k}(a^{2}+1)+3a^{2}). The tolerance parameter ϵk\epsilon_{k} can significantly influence the behavior of the algorithm and must be controlled for different iterations. A convergence guarantee is established via a simplified theorem presented here. A more formal version of the theorem and a comprehensive description of the algorithm (i.e., Algorithm 6) can be found in Appendix C.

Theorem 2 (Informal).

For any εdist>0{\varepsilon_{\mathrm{dist}}}>0, set μ0\mu_{0} satisfy a mild condition, and use updating rule ϵk=min{βaμk,μk3/2}\epsilon_{k}=\min\{\beta a\mu_{k},\mu_{k}^{3/2}\}, μk+1=(2μk2)2/3(a+ϵk/μk)2/3(aϵk/μk)4/3\mu_{k+1}=(2\mu_{k}^{2})^{2/3}\frac{(a+\epsilon_{k}/\mu_{k})^{2/3}}{(a-\epsilon_{k}/\mu_{k})^{4/3}}, and let KK(μ0,a,εdist)O(lnμ0aεdist)K\equiv K(\mu_{0},a,{\varepsilon_{\mathrm{dist}}})\in O\left({\ln\frac{\mu_{0}}{a{\varepsilon_{\mathrm{dist}}}}}\right). Then, for any initialization W0W_{0}, following the updated procedure above for k=0,,Kk=0,\ldots,K, we have:

Wμk,ϵkW𝖦2εdist\displaystyle\|W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\|_{2}\leq{\varepsilon_{\mathrm{dist}}}

that is, Wμk,ϵkW_{\mu_{k},\epsilon_{k}} is εdist{\varepsilon_{\mathrm{dist}}}-close in Frobenius norm to global optimum W𝖦W_{\mathsf{G}}. Moreover, the total number of gradient descent steps is upper bounded by O((μ0a2+a2+μ0)(1a6+1εdist6))O\left(\left(\mu_{0}a^{2}+a^{2}+\mu_{0}\right)\left(\frac{1}{a^{6}}+\frac{1}{{\varepsilon_{\mathrm{dist}}}^{6}}\right)\right).

6 Experiments

We conducted experiments to verify that Algorithms 2 and 4 both converge to the global minimum of (7). Our purpose is to illustrate two main points: First, we compare our updating scheme as given in Line 3 of Algorithm 2 against a faster-decreasing updating scheme for μk\mu_{k}. In Figure 4 we illustrate how a naive faster decrease of μ\mu can lead to spurious a local minimum. Second, in Figure 5, we show that regardless of the initialization, Algorithms 2 and 4 always return the global minimum. In the supplementary material, we provide additional experiments where the gradient flow is replaced with gradient descent. For more details, please refer to Appendix F.

Refer to caption
Figure 4: Trajectory of the gradient flow path for two different update rules for μk\mu_{k} with same initialization and μ0\mu_{0}. Here, “good scheduling” uses Line 3 of Algorithm 2, while “bad scheduling” uses a faster decreasing scheme for μk\mu_{k} which leads the path to a spurious local minimum.
Refer to caption
(a) Random Initialization 1
Refer to caption
(b) Random Initialization 2
Refer to caption
(c) Random Initialization 3
Figure 5: Trajectory of the gradient flow path with the different initializations. We observe that under a proper scheduling for μk\mu_{k}, they all converge to the global minimum.

References

  • (1)
  • Bello et al. (2022) Bello, K., Aragam, B. and Ravikumar, P. (2022), DAGMA: Learning dags via M-matrices and a log-determinant acyclicity characterization, in ‘Advances in Neural Information Processing Systems’.
  • Bertsekas (1997) Bertsekas, D. P. (1997), ‘Nonlinear programming’, Journal of the Operational Research Society 48(3), 334–334.
  • Bhojanapalli et al. (2016) Bhojanapalli, S., Neyshabur, B. and Srebro, N. (2016), ‘Global optimality of local search for low rank matrix recovery’, Advances in Neural Information Processing Systems 29.
  • Boumal et al. (2016) Boumal, N., Voroninski, V. and Bandeira, A. (2016), ‘The non-convex burer-monteiro approach works on smooth semidefinite programs’, Advances in Neural Information Processing Systems 29.
  • Boyd et al. (2004) Boyd, S., Boyd, S. P. and Vandenberghe, L. (2004), Convex optimization, Cambridge university press.
  • Brutzkus and Globerson (2017) Brutzkus, A. and Globerson, A. (2017), Globally optimal gradient descent for a convnet with gaussian inputs, in ‘International conference on machine learning’, PMLR, pp. 605–614.
  • Chen et al. (2019) Chen, W., Drton, M. and Wang, Y. S. (2019), ‘On causal discovery with an equal-variance assumption’, Biometrika 106(4), 973–980.
  • Chickering (1996) Chickering, D. M. (1996), Learning Bayesian networks is NP-complete, in ‘Learning from data’, Springer, pp. 121–130.
  • Chickering (2003) Chickering, D. M. (2003), ‘Optimal structure identification with greedy search’, JMLR 3, 507–554.
  • Chickering et al. (2004) Chickering, D. M., Heckerman, D. and Meek, C. (2004), ‘Large-sample learning of Bayesian networks is NP-hard’, Journal of Machine Learning Research 5, 1287–1330.
  • Choromanska et al. (2015) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. and LeCun, Y. (2015), The loss surfaces of multilayer networks, in ‘Artificial intelligence and statistics’, PMLR, pp. 192–204.
  • Cussens (2012) Cussens, J. (2012), ‘Bayesian network learning with cutting planes’, arXiv preprint arXiv:1202.3713 .
  • De Sa et al. (2015) De Sa, C., Re, C. and Olukotun, K. (2015), Global convergence of stochastic gradient descent for some non-convex matrix problems, in ‘International conference on machine learning’, PMLR, pp. 2332–2341.
  • Dontchev et al. (2009) Dontchev, A. L., Rockafellar, R. T. and Rockafellar, R. T. (2009), Implicit functions and solution mappings: A view from variational analysis, Vol. 11, Springer.
  • Dunlavy and O’Leary (2005) Dunlavy, D. M. and O’Leary, D. P. (2005), Homotopy optimization methods for global optimization, Technical report, Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA ….
  • Duong and Nguyen (2022) Duong, B. and Nguyen, T. (2022), Bivariate causal discovery via conditional divergence, in ‘Conference on Causal Learning and Reasoning’.
  • Gargiani et al. (2020) Gargiani, M., Zanelli, A., Tran-Dinh, Q., Diehl, M. and Hutter, F. (2020), ‘Convergence analysis of homotopy-sgd for non-convex optimization’, arXiv preprint arXiv:2011.10298 .
  • Garrigues and Ghaoui (2008) Garrigues, P. and Ghaoui, L. (2008), ‘An homotopy algorithm for the lasso with online observations’, Advances in neural information processing systems 21.
  • Ge et al. (2017) Ge, R., Jin, C. and Zheng, Y. (2017), No spurious local minima in nonconvex low rank problems: A unified geometric analysis, in ‘International Conference on Machine Learning’, PMLR, pp. 1233–1242.
  • Ge et al. (2016) Ge, R., Lee, J. D. and Ma, T. (2016), ‘Matrix completion has no spurious local minimum’, Advances in neural information processing systems 29.
  • Hazan et al. (2016) Hazan, E., Levy, K. Y. and Shalev-Shwartz, S. (2016), On graduated optimization for stochastic non-convex problems, in ‘International conference on machine learning’, PMLR, pp. 1833–1841.
  • Heckerman et al. (1995) Heckerman, D., Geiger, D. and Chickering, D. M. (1995), ‘Learning bayesian networks: The combination of knowledge and statistical data’, Machine learning 20(3), 197–243.
  • Iwakiri et al. (2022) Iwakiri, H., Wang, Y., Ito, S. and Takeda, A. (2022), ‘Single loop gaussian homotopy method for non-convex optimization’, arXiv preprint arXiv:2203.05717 .
  • Jaakkola et al. (2010) Jaakkola, T., Sontag, D., Globerson, A. and Meila, M. (2010), Learning bayesian network structure using lp relaxations, in ‘Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics’, Vol. 9 of Proceedings of Machine Learning Research, pp. 358–365.
  • Jiao et al. (2018) Jiao, R., Lin, N., Hu, Z., Bennett, D. A., Jin, L. and Xiong, M. (2018), ‘Bivariate causal discovery and its applications to gene expression and imaging data analysis’, Frontiers Genetics 9, 347.
  • Kawaguchi (2016) Kawaguchi, K. (2016), ‘Deep learning without poor local minima’, Advances in neural information processing systems 29.
  • Kazemipour et al. (2019) Kazemipour, A., Larsen, B. W. and Druckmann, S. (2019), ‘Avoiding spurious local minima in deep quadratic networks’, arXiv preprint arXiv:2001.00098 .
  • Khalil (2002) Khalil, H. K. (2002), ‘Nonlinear systems third edition’, Patience Hall 115.
  • Lachapelle et al. (2020) Lachapelle, S., Brouillard, P., Deleu, T. and Lacoste-Julien, S. (2020), Gradient-based neural dag learning, in ‘International Conference on Learning Representations’.
  • Liang et al. (2018) Liang, S., Sun, R., Lee, J. D. and Srikant, R. (2018), ‘Adding one neuron can eliminate all bad local minima’, Advances in Neural Information Processing Systems 31.
  • Loh and Bühlmann (2014) Loh, P.-L. and Bühlmann, P. (2014), ‘High-dimensional learning of linear causal networks via inverse covariance estimation’, The Journal of Machine Learning Research 15(1), 3065–3105.
  • Mooij et al. (2014) Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. and Schölkopf, B. (2014), ‘Distinguishing cause from effect using observational data: methods and benchmarks’, Arxiv .
  • Nesterov et al. (2018) Nesterov, Y. et al. (2018), Lectures on convex optimization, Vol. 137, Springer.
  • Ng et al. (2020) Ng, I., Ghassami, A. and Zhang, K. (2020), On the role of sparsity and dag constraints for learning linear DAGs, in ‘Advances in Neural Information Processing Systems’.
  • Ng et al. (2022) Ng, I., Lachapelle, S., Ke, N. R., Lacoste-Julien, S. and Zhang, K. (2022), On the convergence of continuous constrained optimization for structure learning, in ‘International Conference on Artificial Intelligence and Statistics’, PMLR, pp. 8176–8198.
  • Ni (2022) Ni, Y. (2022), ‘Bivariate causal discovery for categorical data via classification with optimal label permutation’, Arxiv .
  • Nocedal and Wright (1999) Nocedal, J. and Wright, S. J. (1999), Numerical optimization, Springer.
  • Park and Klabjan (2017) Park, Y. W. and Klabjan, D. (2017), ‘Bayesian network learning via topological order’, The Journal of Machine Learning Research 18(1), 3451–3482.
  • Pearl (2009) Pearl, J. (2009), Causality: Models, Reasoning, and Inference, 2nd edn, Cambridge University Press.
  • Sachs et al. (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A. and Nolan, G. P. (2005), ‘Causal protein-signaling networks derived from multiparameter single-cell data’, Science 308(5721), 523–529.
  • Scanagatta et al. (2015) Scanagatta, M., de Campos, C. P., Corani, G. and Zaffalon, M. (2015), Learning bayesian networks with thousands of variables., in ‘NIPS’, pp. 1864–1872.
  • Silander and Myllymaki (2006) Silander, T. and Myllymaki, P. (2006), A simple approach for finding the globally optimal bayesian network structure, in ‘Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence’.
  • Soltanolkotabi et al. (2018) Soltanolkotabi, M., Javanmard, A. and Lee, J. D. (2018), ‘Theoretical insights into the optimization landscape of over-parameterized shallow neural networks’, IEEE Transactions on Information Theory 65(2), 742–769.
  • Spirtes et al. (2000) Spirtes, P., Glymour, C. N., Scheines, R. and Heckerman, D. (2000), Causation, prediction, and search, MIT press.
  • Teyssier and Koller (2005) Teyssier, M. and Koller, D. (2005), Ordering-based search: A simple and effective algorithm for learning bayesian networks, in ‘Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence’.
  • Wauthier and Donnelly (2015) Wauthier, F. and Donnelly, P. (2015), A greedy homotopy method for regression with nonconvex constraints, in ‘Artificial Intelligence and Statistics’, PMLR, pp. 1051–1060.
  • Wei et al. (2020) Wei, D., Gao, T. and Yu, Y. (2020), DAGs with no fears: A closer look at continuous optimization for learning bayesian networks, in ‘Advances in Neural Information Processing Systems’.
  • Wu et al. (2018) Wu, C., Luo, J. and Lee, J. D. (2018), No spurious local minima in a two hidden unit relu network, in ‘International Conference on Learning Representations’.
  • Yu et al. (2019) Yu, Y., Chen, J., Gao, T. and Yu, M. (2019), Dag-gnn: Dag structure learning with graph neural networks, in ‘International Conference on Machine Learning’, PMLR, pp. 7154–7163.
  • Zhang et al. (2013) Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., Dobrin, R. et al. (2013), ‘Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer’s disease’, Cell 153(3), 707–720.
  • Zheng et al. (2018) Zheng, X., Aragam, B., Ravikumar, P. K. and Xing, E. P. (2018), DAGs with NO TEARS: Continuous optimization for structure learning, in ‘Advances in Neural Information Processing Systems’.
  • Zheng et al. (2020) Zheng, X., Dan, C., Aragam, B., Ravikumar, P. and Xing, E. (2020), Learning sparse nonparametric DAGs, in ‘International Conference on Artificial Intelligence and Statistics’, PMLR, pp. 3414–3425.
  • Zhu et al. (2020) Zhu, S., Ng, I. and Chen, Z. (2020), Causal discovery with reinforcement learning, in ‘International Conference on Learning Representations’.

Appendix A Practical Implementation of Algorithm 2

We present a practical implementation of our homotopy algorithm in Algorithm 4. The updating scheme for μk\mu_{k} is now independent of the parameter aa, but as presented, the initialization for μ0\mu_{0} still depends on aa. This is for the following reason: It is possible to make the updating scheme independent of aa without imposing any additional assumptions on aa, as evidenced by Lemma 4 below. The initialization for μ0\mu_{0}, however, is trickier, and we must consider two separate cases:

  1. 1.

    No assumptions on aa. In this case, if aa is too small, then the problem becomes harder and the initial choice of μ0\mu_{0} matters.

  2. 2.

    Lower bound on aa. If we are willing to accept a lower bound on aa, then there is an initialization for μ0\mu_{0} that does not depend on aa.

In Corollary 1, we illustrate this last point with the additional condition that a>5/27a>\sqrt{5/27}. This essentially amounts to an assumption on the minimum signal, and is quite standard in the literature on learning SEM.

Input: μ0[a24(a2+1)3,a24)\mu_{0}\in\left[\frac{a^{2}}{4(a^{2}+1)^{3}},\frac{a^{2}}{4}\right), ε>0\varepsilon>0
Output: {Wμk}k=0\{W_{\mu_{k}}\}_{k=0}^{\infty}
a^4(μ0+ε)\widehat{a}\leftarrow\sqrt{4(\mu_{0}+\varepsilon)}
  // ε0\forall\varepsilon\geq 0 s.t. a^<a\widehat{a}<a
1 Wμ0GradientFlow(gμ0,𝟎)W_{\mu_{0}}\leftarrow\textnormal{{GradientFlow}}(g_{\mu_{0}},\bm{0})
2 for k=1,2,k=1,2,\ldots do
3       Let μk+1[(2/a^)2/3μk4/3,μk)\mu_{k+1}\in\left[\left(2/\widehat{a}\right)^{2/3}\mu_{k}^{4/3},\mu_{k}\right)
4       Wμk+1GradientFlow(gμk+1,Wμk)W_{\mu_{k+1}}\leftarrow\textnormal{{GradientFlow}}(g_{\mu_{k+1}},W_{\mu_{k}})
5 end for
return {Wμk}k=0\{W_{\mu_{k}}\}_{k=0}^{\infty}
Algorithm 4 Find a path {Wμk}\{W_{\mu_{k}}\} via a particular scheduling for μk\mu_{k} when aa is unknown.
Lemma 4.

Under the assumption a24(a2+1)3μ0<a24\frac{a^{2}}{4(a^{2}+1)^{3}}\leq\mu_{0}<\frac{a^{2}}{4}, the Algorithm 4 outputs the global optimal solution to (6), i.e.

limkWμk=W𝖦.\lim_{k\rightarrow\infty}W_{\mu_{k}}=W_{\mathsf{G}}.

It turns out that the assumption in Lemma 4 is not overly restrictive, as there exist pre-determined sequences of {μk}k=0\{\mu_{k}\}_{k=0}^{\infty} that can ensure the effectiveness of Algorithm 4 for any values of aa greater than a certain threshold.

Appendix B From Population Loss to Empirical Loss

The transformation from population loss to empirical can be thought from two components. First, with a given empirical loss, Algorithms 2 and 3 still attain the global minimum, W𝖦W_{\mathsf{G}}, of problem 6, but now the output from the Algorithm is an empirical estimator a^\hat{a}, rather than ground truth aa, Theorem 1 and Corollary 1 would continue to be valid. Second, the global optimum, W𝖦W_{\mathsf{G}}, of the empirical loss possesses the same DAG structure as the underlying WW_{*}. The finite-sample findings in Section 5 (specifically, Lemmas 18 and 19) of Loh and Bühlmann (2014) offer sufficient conditions on the sample size to ensure that the DAG structures of W𝖦W_{\mathsf{G}} and WW_{*} are identical.

Appendix C From Continuous to Discrete: Gradient Descent

Previously, gradient flow was employed to address the intermediate problem (7), a method that poses implementation challenges in a computational setting. In this section, we introduce Algorithm 6 that leverages gradient descent to solve (7) in each iteration. This adjustment serves practical considerations.

Input: function ff, step size η\eta, initial point W0W_{0}, tolerance ϵ\epsilon
Output: WtW_{t}
1 t0t\leftarrow 0
2 while f(Wt)2>ϵ\|\nabla f(W_{t})\|_{2}>\epsilon do
3       Wt+1Wtηf(Wt)W_{t+1}\leftarrow W_{t}-\eta\nabla f(W_{t})
4       tt+1t\leftarrow t+1
5 end while
Algorithm 5 Gradient Descent(f,η,W0,ϵf,\eta,W_{0},\epsilon)
Input: Initial W1=W(x1,y1)W_{-1}=W(x_{-1},y_{-1}), μ0[a24(a2+1)3(1+β)4(1β)2,a24(1δ)3(1β)4(1+β)2)\mu_{0}\in\left[\frac{a^{2}}{4(a^{2}+1)^{3}}\frac{(1+\beta)^{4}}{(1-\beta)^{2}},\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}\right), η0=1μ0(a2+1)+3a2\eta_{0}=\frac{1}{\mu_{0}(a^{2}+1)+3a^{2}}, ϵ0=min{βaμ0,μ03/2}\epsilon_{0}=\min\{\beta a\mu_{0},\mu_{0}^{3/2}\}
Output: {Wμk}k=0\{W_{\mu_{k}}\}_{k=0}^{\infty}
1 Wμ0,ϵ0Gradient Descent(gμ0,η0,W1,ϵ0)W_{\mu_{0},\epsilon_{0}}\leftarrow\textnormal{{Gradient Descent}}(g_{\mu_{0}},\eta_{0},W_{-1},\epsilon_{0})
2 for k=1,2,k=1,2,\ldots do
3       Let μk=(2μk12)2/3(a+ϵk1/μk1)2/3(aϵk1/μk1)4/3\mu_{k}=(2\mu_{k-1}^{2})^{2/3}\frac{(a+\epsilon_{k-1}/\mu_{k-1})^{2/3}}{(a-\epsilon_{k-1}/\mu_{k-1})^{4/3}}
4       Let ηk=1μk(a2+1)+3a2\eta_{k}=\frac{1}{\mu_{k}(a^{2}+1)+3a^{2}}
5       Let ϵk=min{βaμk,μk3/2}\epsilon_{k}=\min\{\beta a\mu_{k},\mu_{k}^{3/2}\}
6       Wμk,ϵkGradient Descent(gμk,ηk,Wμk1,ϵk)W_{\mu_{k},\epsilon_{k}}\leftarrow\textnormal{{Gradient Descent}}(g_{\mu_{k}},\eta_{k},W_{\mu_{k-1}},\epsilon_{k})
7 end for
Algorithm 6 Homotopy algorithm using gradient descent for solving (1).

We start with the convergence results of Gradient Descent.

Definition 1.

ff is LL-smooth, if ff is differentiable and x,ydom(f)\forall x,y\in\text{dom}(f) such that f(x)f(y)2Lxy2\|\nabla f(x)-\nabla f(y)\|_{2}\leq L\|x-y\|_{2}.

Theorem 3 (Nesterov et al. 2018).

If function ff is LL-smooth, then Gradient Descent (Algorithm 5) with step size η=1/L\eta=1/L, finds an ϵ\epsilon-first-order stationary point (i.e. f(x)2ϵ\|\nabla f(x)\|_{2}\leq\epsilon) in 2L(f(x0)f)/ϵ22L(f(x^{0})-f^{*})/\epsilon^{2} iterations.

One of the pivotal factors influencing the convergence of gradient descent is the selection of the step size. Theorem 3 select a step size η=1L\eta=\frac{1}{L}. Therefore, our initial step is to determine the smoothness of gμ(W)g_{\mu}(W) within our region of interest, A={0xa,0yaa2+1}A=\{0\leq x\leq a,0\leq y\leq\frac{a}{a^{2}+1}\}.

Lemma 5.

Consider the function gμ(W)g_{\mu}(W) as defined in Equation 7 within the region A={0xa,0yaa2+1}A=\{0\leq x\leq a,0\leq y\leq\frac{a}{a^{2}+1}\}. It follows that for all μ0\mu\geq 0, the function gμ(W)g_{\mu}(W) is μ(a2+1)+3a2\mu(a^{2}+1)+3a^{2}-smooth.

Since gradient descent is limited to identifying the ϵ\epsilon stationary point of the function. Thus, we study the gradient of gμ(W)=μf(W)+h(W)g_{\mu}(W)=\mu f(W)+h(W), i.e. gμ(W)\nabla g_{\mu}(W) has the following form

gμ(W)=(μ(xa)+y2xμ(a2+1)yaμ+yx2)\displaystyle\nabla g_{\mu}(W)=\begin{pmatrix}\mu(x-a)+y^{2}x\\ \mu(a^{2}+1)y-a\mu+yx^{2}\end{pmatrix}

As gradient descent is limited to identifying the ϵ\epsilon stationary point of the function, we, therefore, focus on gμ(W)2ϵ\|g_{\mu}(W)\|_{2}\leq\epsilon. This can be expressed in the subsequent manner:

gμ(W)2ϵϵμ(xa)+y2x<ϵand ϵμ(a2+1)yaμ+yx2ϵ\displaystyle\|\nabla g_{\mu}(W)\|_{2}\leq\epsilon\Rightarrow-\epsilon\leq\mu(x-a)+y^{2}x<\epsilon\quad\text{and }-\epsilon\leq\mu(a^{2}+1)y-a\mu+yx^{2}\leq\epsilon

As a result,

{(x,y)gμ(W)2ϵ}{(x,y)μaϵμ+y2xμa+ϵμ+y2,μaϵx2+μ(a2+1)yμa+ϵx2+μ(a2+1)}\displaystyle\{(x,y)\mid\|\nabla g_{\mu}(W)\|_{2}\leq\epsilon\}\subseteq\{(x,y)\mid\frac{\mu a-\epsilon}{\mu+y^{2}}\leq x\leq\frac{\mu a+\epsilon}{\mu+y^{2}},\frac{\mu a-\epsilon}{x^{2}+\mu(a^{2}+1)}\leq y\leq\frac{\mu a+\epsilon}{x^{2}+\mu(a^{2}+1)}\}

Here we denote such region as Aμ,ϵA_{\mu,\epsilon}

Aμ,ϵ={(x,y)μaϵμ+y2xμa+ϵμ+y2,μaϵx2+μ(a2+1)yμa+ϵx2+μ(a2+1)}\displaystyle A_{\mu,\epsilon}=\{(x,y)\mid\frac{\mu a-\epsilon}{\mu+y^{2}}\leq x\leq\frac{\mu a+\epsilon}{\mu+y^{2}},\frac{\mu a-\epsilon}{x^{2}+\mu(a^{2}+1)}\leq y\leq\frac{\mu a+\epsilon}{x^{2}+\mu(a^{2}+1)}\} (10)

Figure 6 and 7 illustrate the region Aμ,ϵA_{\mu,\epsilon}.

Refer to caption
Figure 6: An example of Aμ,ϵA_{\mu,\epsilon} is depicted for a=0.6a=0.6, μ=0.009\mu=0.009, and ϵ=0.00055\epsilon=0.00055. The yellow region signifies ϵ\epsilon stationary points, denoted as Aμ,ϵA_{\mu,\epsilon} and defined by Equation (10). Aμ,ϵA_{\mu,\epsilon} is the disjoint union of Aμ,ϵ1A^{1}_{\mu,\epsilon} and Aμ,ϵ2A^{2}_{\mu,\epsilon}, which are defined by Equations (21) and (22), respectively.
Refer to caption
Figure 7: Here is a localized illustration of Aμ,ϵA_{\mu,\epsilon} that includes the point (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}). This region, referred to as Aμ,ϵ1A^{1}_{\mu,\epsilon}, is defined in Equation (21).

Given that the gradient descent can only locate ϵ\epsilon stationary points within the region Aμ,ϵA_{\mu,\epsilon} during each iteration, the boundary of Aμ,ϵA_{\mu,\epsilon} becomes a critical component of our analysis. To facilitate clear presentation, it is essential to establish some pertinent notations.

  • x=μaμ+y2\displaystyle x=\frac{\mu a}{\mu+y^{2}} (11a)
    y=μaμ(a2+1)+x2\displaystyle y=\frac{\mu a}{\mu(a^{2}+1)+x^{2}} (11b)

    If the system of equations yields only a single solution, we denote this solution as (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}). If it yields two solutions, these solutions are denoted as (xμ,yμ),(xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}),(x_{\mu}^{**},y_{\mu}^{**}), with xμ<xμx_{\mu}^{**}<x_{\mu}^{*}. In the event that there are three distinct solutions to the system of equations, these solutions are denoted as (xμ,yμ),(xμ,yμ),(xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}),(x_{\mu}^{**},y_{\mu}^{**}),(x_{\mu}^{***},y_{\mu}^{***}), where xμ<xμ<xμx_{\mu}^{***}<x_{\mu}^{**}<x_{\mu}^{*}.

  • x=μaϵμ+y2\displaystyle x=\frac{\mu a-\epsilon}{\mu+y^{2}} (12a)
    y=μa+ϵμ(a2+1)+x2\displaystyle y=\frac{\mu a+\epsilon}{\mu(a^{2}+1)+x^{2}} (12b)

    If the system of equations yields only a single solution, we denote this solution as (xμ,ϵ,yμ,ϵ)(x_{\mu,\epsilon}^{*},y_{\mu,\epsilon}^{*}). If it yields two solutions, these solutions are denoted as (xμ,ϵ,yμ,ϵ),(xμ,ϵ,yμ,ϵ)(x_{\mu,\epsilon}^{*},y_{\mu,\epsilon}^{*}),(x_{\mu,\epsilon}^{**},y_{\mu,\epsilon}^{**}), with xμ,ϵ<xμ,ϵx_{\mu,\epsilon}^{**}<x_{\mu,\epsilon}^{*}. In the event that there are three distinct solutions to the system of equations, these solutions are denoted as (xμ,ϵ,yμ,ϵ),(xμ,ϵ,yμ,ϵ),(xμ,ϵ,yμ,ϵ)(x_{\mu,\epsilon}^{*},y_{\mu,\epsilon}^{*}),(x_{\mu,\epsilon}^{**},y_{\mu,\epsilon}^{**}),(x_{\mu,\epsilon}^{***},y_{\mu,\epsilon}^{***}), where xμ,ϵ<xμ,ϵ<xμ,ϵx_{\mu,\epsilon}^{***}<x_{\mu,\epsilon}^{**}<x_{\mu,\epsilon}^{*}.

  • x=μa+ϵμ+y2\displaystyle x=\frac{\mu a+\epsilon}{\mu+y^{2}} (13a)
    y=μaϵμ(a2+1)+x2\displaystyle y=\frac{\mu a-\epsilon}{\mu(a^{2}+1)+x^{2}} (13b)

    If the system of equations yields only a single solution, we denote this solution as (xμ,ϵ_,yμ,ϵ_)(x_{\mu,\epsilon\_}^{*},y_{\mu,\epsilon\_}^{*}). If it yields two solutions, these solutions are denoted as (xμ,ϵ_,yμ,ϵ_),(xμ,ϵ_,yμ,ϵ_)(x_{\mu,\epsilon\_}^{*},y_{\mu,\epsilon\_}^{*}),(x_{\mu,\epsilon\_}^{**},y_{\mu,\epsilon\_}^{**}), with xμ,ϵ_<xμ,ϵ_x_{\mu,\epsilon\_}^{**}<x_{\mu,\epsilon\_}^{*}. In the event that there are three distinct solutions to the system of equations, these solutions are denoted as (xμ,ϵ_,yμ,ϵ_),(xμ,ϵ_,yμ,ϵ_),(xμ,ϵ_,yμ,ϵ_)(x_{\mu,\epsilon\_}^{*},y_{\mu,\epsilon\_}^{*}),(x_{\mu,\epsilon\_}^{**},y_{\mu,\epsilon\_}^{**}),(x_{\mu,\epsilon\_}^{***},y_{\mu,\epsilon\_}^{***}), where xμ,ϵ_<xμ,ϵ_<xμ,ϵ_x_{\mu,\epsilon\_}^{***}<x_{\mu,\epsilon\_}^{**}<x_{\mu,\epsilon\_}^{*}.

Remark 4.

There always exists at least one solution to the above system of equations. When μ\mu is sufficiently small, the above system of equations always yields three solutions, as demonstrated in Theorem 5, and Theorem 9.

The parameter ϵ\epsilon can substantially influence the behavior of the systems of equations (12a),(12b) and (13a),(13b). A crucial consideration is to ensure that ϵ\epsilon remains adequately small. To facilitate this, we introduce a new parameter, β\beta, whose specific value will be determined later. At this stage, we merely require that β\beta should lie within the interval (0,1)(0,1). We further impose a constraint on ϵ\epsilon to satisfy the following inequality:

ϵβaμ\displaystyle\epsilon\leq\beta a\mu (14)

Following the same procedure when we deal with ϵ=0\epsilon=0. Let us substitute (12a) into (12b), then we obtain an equation that only involves the variable yy

rϵ(y;μ)=\displaystyle r_{\epsilon}(y;\mu)= a+ϵ/μy(a2+1)(μaϵ)2/μ(y2+μ)2\displaystyle\frac{a+\epsilon/\mu}{y}-(a^{2}+1)-\frac{(\mu a-\epsilon)^{2}/\mu}{(y^{2}+\mu)^{2}} (15)

Let us substitute (12b) into (12a), then we obtain an equation that only involves the variable xx

tϵ(x;μ)=\displaystyle t_{\epsilon}(x;\mu)= aϵ/μx1(μa+ϵ)2/μ(μ(a2+1)+x2)2\displaystyle\frac{a-\epsilon/\mu}{x}-1-\frac{(\mu a+\epsilon)^{2}/\mu}{(\mu(a^{2}+1)+x^{2})^{2}} (16)

Proceed similarly for equations (13a) and (13b).

rϵ_(y;μ)=\displaystyle r_{\epsilon\_}(y;\mu)= aϵ/μy(a2+1)(μa+ϵ)2/μ(y2+μ)2\displaystyle\frac{a-\epsilon/\mu}{y}-(a^{2}+1)-\frac{(\mu a+\epsilon)^{2}/\mu}{(y^{2}+\mu)^{2}} (17)
tϵ_(x;μ)=\displaystyle t_{\epsilon\_}(x;\mu)= a+ϵ/μx1(μaϵ)2/μ(μ(a2+1)+x2)2\displaystyle\frac{a+\epsilon/\mu}{x}-1-\frac{(\mu a-\epsilon)^{2}/\mu}{(\mu(a^{2}+1)+x^{2})^{2}} (18)

Given the substantial role that the system of equations 12a and 12b play in our analysis, the existence of ϵ\epsilon in these equations complicates the analysis, this can be avoided by considering the worst-case scenario, i.e., when ϵ=βaμ\epsilon=\beta a\mu. With this particular choice of ϵ\epsilon, we can reformulate (15) and (16) as follows, denoting them as rβ(y;ϵ)r_{\beta}(y;\epsilon) and rβ(x;ϵ)r_{\beta}(x;\epsilon) respectively.

rβ(y;μ)=\displaystyle r_{\beta}(y;\mu)= a(1+β)y(a2+1)μa2(1β)2(y2+μ)2\displaystyle\frac{a(1+\beta)}{y}-(a^{2}+1)-\frac{\mu a^{2}(1-\beta)^{2}}{(y^{2}+\mu)^{2}} (19)
tβ(x;μ)=\displaystyle t_{\beta}(x;\mu)= a(1β)x1μa2(1+β)2(μ(a2+1)+x2)2\displaystyle\frac{a(1-\beta)}{x}-1-\frac{\mu a^{2}(1+\beta)^{2}}{(\mu(a^{2}+1)+x^{2})^{2}} (20)

The functions rϵ(y;μ)r_{\epsilon}(y;\mu), rϵ_(y;μ)r_{\epsilon\_}(y;\mu), and rβ(y;μ)r_{\beta}(y;\mu) possess similar properties to r(y;μ)r(y;\mu) as defined in Equation (8), with more details available in Theorem 7 and 8. Additionally, the functions tϵ(x;μ)t_{\epsilon}(x;\mu), tϵ_(x;μ)t_{\epsilon\_}(x;\mu), and tβ(x;μ)t_{\beta}(x;\mu) share similar characteristics with t(x;μ)t(x;\mu) as defined in Equation (9), with more details provided in Theorem 9.

As illustrated in Figure 6, the ϵ\epsilon-stationary point region Aμ,ϵA_{\mu,\epsilon} can be partitioned into two distinct areas, of which only the lower-right one contains (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}) and it is of interest to our analysis. Moreover, (xμ,ϵ,yμ,ϵ)(x^{*}_{\mu,\epsilon},y^{*}_{\mu,\epsilon}) and (xμ,ϵ,yμ,ϵ)(x^{**}_{\mu,\epsilon},y^{**}_{\mu,\epsilon}) are extremal point of two distinct regions. The upcoming corollary substantiates this intuition.

Corollary 3.

If μ<τ\mu<\tau (τ\tau is defined in Theorem 5(v)), assume ϵ\epsilon satisfies (14), β\beta satisfies (1+β1β)2a2+1\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq a^{2}+1, systems of equations (12a),(12b) at least have two solutions. Moreover, Aμ,ϵ=Aμ,ϵ1Aμ,ϵ2A_{\mu,\epsilon}=A^{1}_{\mu,\epsilon}\cup A^{2}_{\mu,\epsilon}

Aμ,ϵ1\displaystyle A^{1}_{\mu,\epsilon} =Aμ,ϵ{(x,y)xxμ,ϵ,yyμ,ϵ}\displaystyle=A_{\mu,\epsilon}\cap\{(x,y)\mid x\geq x^{*}_{\mu,\epsilon},y\leq y^{*}_{\mu,\epsilon}\} (21)
Aμ,ϵ2\displaystyle A^{2}_{\mu,\epsilon} =Aμ,ϵ{(x,y)xxμ,ϵ,yyμ,ϵ}\displaystyle=A_{\mu,\epsilon}\cap\{(x,y)\mid x\leq x^{**}_{\mu,\epsilon},y\geq y^{**}_{\mu,\epsilon}\} (22)

Corollary 3 suggests that Aμ,ϵA_{\mu,\epsilon} can be partitioned into two distinct regions, namely Aμ,ϵ1A^{1}_{\mu,\epsilon} and Aμ,ϵ2A^{2}_{\mu,\epsilon}. Furthermore, for every (x,y)(x,y) belonging to Aμ,ϵ1A^{1}_{\mu,\epsilon}, it follows that xxμ,ϵx\geq x^{*}_{\mu,\epsilon} and yyμ,ϵy\leq y^{*}_{\mu,\epsilon}. Similarly, for every (x,y)(x,y) that lies within Aμ,ϵ2A^{2}_{\mu,\epsilon}, the condition xxμ,ϵx\leq x^{**}_{\mu,\epsilon} and yyμ,ϵy\geq y^{**}_{\mu,\epsilon} holds. The region Aμ,ϵ1A^{1}_{\mu,\epsilon} represents the “correct" region that gradient descent should identify. In this context, identifying the region equates to pinpointing the extremal points of the region. As a result, our focus should be on the extremal points of Aμ,ϵ1A^{1}_{\mu,\epsilon} and Aμ,ϵ2A^{2}_{\mu,\epsilon}, specifically at (xμ,ϵ,yμ,ϵ)(x^{*}_{\mu,\epsilon},y^{*}_{\mu,\epsilon}) and (xμ,ϵ,yμ,ϵ)(x^{**}_{\mu,\epsilon},y^{**}_{\mu,\epsilon}). Furthermore, the key to ensuring the convergence of the gradient descent to the Aμ,ϵ1A^{1}_{\mu,\epsilon} is to accurately identify the “basin of attraction” of the region Aμ,ϵ1A^{1}_{\mu,\epsilon}. The following lemma provides a region within which, regardless of the initialization point of the gradient descent, it converges inside Aμ,ϵ1A^{1}_{\mu,\epsilon}.

Lemma 6.

Assume μ<τ\mu<\tau (τ\tau is defined in Theorem 5(v)), (1+β1β)2a2+1\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq a^{2}+1. Define Bμ,ϵ={(x,y)xμ,ϵ<xa,0y<yμ,ϵ}B_{\mu,\epsilon}=\{(x,y)\mid x_{\mu,\epsilon}^{**}<x\leq a,0\leq y<y_{\mu,\epsilon}^{**}\}. Run Algorithm 5 with input f=gμ(x,y),η=1μ(a2+1)+3a2,W0=(x(0),y(0)),f=g_{\mu}(x,y),\eta=\frac{1}{\mu(a^{2}+1)+3a^{2}},W_{0}=(x(0),y(0)), where (x(0),y(0))Bμ,ϵ(x(0),y(0))\in B_{\mu,\epsilon}, then after at most 2(μ(a2+1)+3a2)(gμ(x(0),y(0))gμ(xμ,yμ))ϵ2\frac{2(\mu(a^{2}+1)+3a^{2})(g_{\mu}(x(0),y(0))-g_{\mu}(x_{\mu}^{*},y_{\mu}^{*}))}{\epsilon^{2}} iterations, (xt,yt)Aμ,ϵ1(x_{t},y_{t})\in A^{1}_{\mu,\epsilon}.

Lemma 6 can be considered the gradient descent analogue of Lemma 2. It plays a pivotal role in the proof of Theorem 4. In Figure 6, the lower-right rectangle corresponds to Bμ,ϵB_{\mu,\epsilon}. Lemma 6 implies that the gradient descent with any initialization inside Bμk+1,ϵk+1B_{\mu_{k+1},\epsilon_{k+1}} will converge to Aμk+1,ϵk+11A^{1}_{\mu_{k+1},\epsilon_{k+1}} at last. Then, by utilizing the previous solution Wμk,ϵkW_{\mu_{k},\epsilon_{k}} as the initial point, as long as it lies within region Bμk+1,ϵk+1B_{\mu_{k+1},\epsilon_{k+1}}, the gradient descent can converge to Aμk+1,ϵk+11A^{1}_{\mu_{k+1},\epsilon_{k+1}} which is ϵ\epsilon stationary points region that contains Wμk+1W_{\mu_{k+1}}^{*}, thereby achieving the goal of tracking Wμk+1W_{\mu_{k+1}}^{*}. Following the scheduling for μk\mu_{k} prescribed in Algorithm 6 provides a sufficient condition to ensure that will happen.

We now proceed to present the theorem which guarantees the global convergence of Algorithm 6.

Theorem 4.

If δ(0,1)\delta\in(0,1), β(0,1)\beta\in(0,1), (1+β1β)2(1δ)(a2+1)\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq(1-\delta)(a^{2}+1), and μ0\mu_{0} satisfies

a24(a2+1)3a24(a2+1)3(1+β)4(1β)2μ0a24(1δ)3(1β)4(1+β)2a24\displaystyle\frac{a^{2}}{4(a^{2}+1)^{3}}\leq\frac{a^{2}}{4(a^{2}+1)^{3}}\frac{(1+\beta)^{4}}{(1-\beta)^{2}}\leq\mu_{0}\leq\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}\leq\frac{a^{2}}{4}

Set the updating rule

ϵk=\displaystyle\epsilon_{k}= min{βaμk,μk3/2}\displaystyle\min\{\beta a\mu_{k},\mu_{k}^{3/2}\}
μk+1=\displaystyle\mu_{k+1}= (2μk2)2/3(a+ϵk/μk)2/3(aϵk/μk)4/3\displaystyle(2\mu_{k}^{2})^{2/3}\frac{(a+\epsilon_{k}/\mu_{k})^{2/3}}{(a-\epsilon_{k}/\mu_{k})^{4/3}}

Then μk+1(1δ)μk\mu_{k+1}\leq(1-\delta)\mu_{k}. Moreover, for any εdist>0{\varepsilon_{\mathrm{dist}}}>0, running Algorithm 6 after K(μ0,a,δ,εdist)K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}}) outer iteration

Wμk,ϵkW𝖦2εdist\displaystyle\|W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\|_{2}\leq{\varepsilon_{\mathrm{dist}}} (23)

where

K(μ0,a,δ,εdist)\displaystyle K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})\geq 1ln(1/(1δ))max{lnμ0β2a2,ln72μ0a2(1(1/2)1/4),ln(3(4δ)μ0εdist2),12ln(46656μ02a2εdist2),13ln(46656μ03a4εdist2)}\displaystyle\frac{1}{\ln(1/(1-\delta))}\max\left\{\ln{\frac{\mu_{0}}{\beta^{2}a^{2}}},\ln\frac{72\mu_{0}}{a^{2}(1-(1/2)^{1/4})},\ln(\frac{3(4-\delta)\mu_{0}}{{\varepsilon_{\mathrm{dist}}}^{2}}),\frac{1}{2}\ln(\frac{46656\mu_{0}^{2}}{a^{2}{\varepsilon_{\mathrm{dist}}}^{2}}),\frac{1}{3}\ln(\frac{46656\mu_{0}^{3}}{a^{4}{\varepsilon_{\mathrm{dist}}}^{2}})\right\}

The total gradient descent steps are

k=0K(μ0,a,δ,εdist)2(μk(a2+1)+3a2)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))ϵk2\displaystyle\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{2(\mu_{k}(a^{2}+1)+3a^{2})(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\epsilon_{k}^{2}}
\displaystyle\leq 2(μ0(a2+1)+3a2)(1β6a6+(max{3(4δ)εdist2,216aεdist,(216aεdist)2/3,1β2a2,72(1(1/2)1/4)a2})3)gμ0(Wμ0ϵ0)\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\frac{1}{\beta^{6}a^{6}}+\left(\max\{\frac{3(4-\delta)}{{\varepsilon_{\mathrm{dist}}}^{2}},\frac{216}{a{\varepsilon_{\mathrm{dist}}}},\left(\frac{216}{a{\varepsilon_{\mathrm{dist}}}}\right)^{2/3},\frac{1}{\beta^{2}a^{2}},\frac{72}{(1-(1/2)^{1/4})a^{2}}\}\right)^{3}\right)g_{\mu_{0}}(W_{\mu_{0}}^{\epsilon_{0}})
\displaystyle\lesssim O(μ0a2+a2+μ0)(1β6a6+1εdist6+1a3εdist3+1a2εdist2+1a6)\displaystyle O\left(\mu_{0}a^{2}+a^{2}+\mu_{0}\right)\left(\frac{1}{\beta^{6}a^{6}}+\frac{1}{{\varepsilon_{\mathrm{dist}}}^{6}}+\frac{1}{a^{3}{\varepsilon_{\mathrm{dist}}}^{3}}+\frac{1}{a^{2}{\varepsilon_{\mathrm{dist}}}^{2}}+\frac{1}{a^{6}}\right)
Proof.

Upon substituting gradient flow with gradient descent, it becomes possible to only identify an ϵ\epsilon-stationary point for gμ(W)g_{\mu}(W). This modification necessitates specifying the stepsize η\eta for gradient descent, as well as an updating rule for μ\mu. The adjustment procedure used can substantially influence the result of Algorithm 6. In this proof, we will impose limitations on the update scheme μk\mu_{k}, the stepsize ηk\eta_{k}, and the tolerance ϵk\epsilon_{k} to ensure their effective operation within Algorithm 6. The approach employed for this proof closely mirrors that of the proof for Theorem 1 albeit with more careful scrutiny. In this proof, we will work out all the requirements for μ,ϵ,η\mu,\epsilon,\eta. Subsequently, we will verify that our selection in Theorem 4 conforms to these requirements.

In the proof, we occasionally use μ,ϵ\mu,\epsilon or μk,ϵk\mu_{k},\epsilon_{k}. When we employ μ,ϵ\mu,\epsilon, it signifies that the given inequality or equality holds for any μ,ϵ\mu,\epsilon. Conversely, when we use μk,ϵk\mu_{k},\epsilon_{k}, it indicates we are examining how to set these parameters for distinct iterations.

Establish the Bound yμ,ϵμy_{\mu,\epsilon}^{**}\geq\sqrt{\mu}

First, let us consider rϵ(μ;μ)0r_{\epsilon}(\sqrt{\mu};\mu)\leq 0, i.e.

rϵ(μ;μ)=a+ϵ/μμ(a2+1)μ(aϵ/μ)24μ20\displaystyle r_{\epsilon}(\sqrt{\mu};\mu)=\frac{a+\epsilon/\mu}{\sqrt{\mu}}-(a^{2}+1)-\frac{\mu(a-\epsilon/\mu)^{2}}{4\mu^{2}}\leq 0

This is always true when μ>4/a2\mu>4/a^{2}, and we require

ϵ2μ3/2+aμ22aμ5/2μ3a2 when μ4a2\displaystyle\epsilon\leq 2\mu^{3/2}+a\mu-2\sqrt{2a\mu^{5/2}-\mu^{3}a^{2}}\quad\text{ when }\mu\leq\frac{4}{a^{2}}

Now we name it condition 1.

Condition 1.
ϵ2μ3/2+aμ22aμ5/2μ3a2 when μ4a2\displaystyle\epsilon\leq 2\mu^{3/2}+a\mu-2\sqrt{2a\mu^{5/2}-\mu^{3}a^{2}}\quad\text{ when }\mu\leq\frac{4}{a^{2}}

Under the assumption that Condition 1 is satisfied. Since rϵ(y;μ)r_{\epsilon}(y;\mu) is increasing function with interval y[ylb,ϵ,yub,ϵ]y\in[y_{\mathrm{lb},\epsilon},y_{\mathrm{ub},\epsilon}], and we know ylb,ϵμyub,ϵy_{\mathrm{lb},\epsilon}\leq\sqrt{\mu}\leq y_{\mathrm{ub},\epsilon} and based on Theorem 7(ii), we have ylb,ϵyμ,ϵyub,ϵy_{\mathrm{lb},\epsilon}\leq y_{\mu,\epsilon}^{**}\leq y_{\mathrm{ub},\epsilon}, rϵ(μ;μ)rϵ(yμ,ϵ;μ)=0r_{\epsilon}(\sqrt{\mu};\mu)\leq r_{\epsilon}(y_{\mu,\epsilon}^{**};\mu)=0. Therefore, yμ,ϵμy_{\mu,\epsilon}^{**}\geq\sqrt{\mu}.

Ensuring the Correct Solution Path via Gradient Descent

Following the argument when we prove Theorem 1, we strive to ensure that the gradient descent, when initiated at (xμk,ϵk,yμk,ϵk)(x_{\mu_{k},\epsilon_{k}},y_{\mu_{k},\epsilon_{k}}), will converge within the "correct" ϵk+1\epsilon_{k+1}-stationary point region (namely, gμk+1(W)2<ϵk+1\|\nabla g_{\mu_{k+1}}(W)\|_{2}<\epsilon_{k+1}) which includes (xμk+1,yμk+1)(x_{\mu_{k+1}}^{*},y_{\mu_{k+1}}^{*}). For this to occur, we necessitate that:

yμk+1,ϵk+1>(1)yμk+1,ϵk+1>(2)μk+1(3)(2μk2)1/3(a+ϵk/μk)1/3(aϵk/μk)2/3>(4)yμk,ϵk>(5)yμk,ϵk\displaystyle y_{\mu_{k+1},\epsilon_{k+1}}\overset{(1)}{>}{y_{\mu_{k+1},\epsilon_{k+1}}}^{**}\overset{(2)}{>}\sqrt{\mu_{k+1}}\overset{(3)}{\geq}(2\mu_{k}^{2})^{1/3}\frac{(a+\epsilon_{k}/\mu_{k})^{1/3}}{(a-\epsilon_{k}/\mu_{k})^{2/3}}\overset{(4)}{>}{y_{\mu_{k},\epsilon_{k}}}^{*}\overset{(5)}{>}{y_{\mu_{k},\epsilon_{k}}} (24)

Here (1), (5) are due to Corollary 3; (2) comes from the boundary we established earlier; (3) is based on the constraints we have placed on μk\mu_{k} and μk+1\mu_{k+1}, which we will present as Condition 2 subsequently; (4) is from the Theorem 7(ii) and relationship yμk,ϵk<ylb,μk,ϵky_{\mu_{k},\epsilon_{k}}^{*}<y_{\mathrm{lb},\mu_{k},\epsilon_{k}}. Also, from the Lemma 9, maxμτxμ,ϵminμ>0xμ,ϵ\max_{\mu\leq\tau}x_{\mu,\epsilon}^{**}\leq\min_{\mu>0}x_{\mu,\epsilon}^{*}. Hence, by invoking Lemma 6, we can affirm that our gradient descent consistently traces the correct stationary point. Now we state condition to make it happen,

Condition 2.
(1δ)μkμk+1(2μk2)2/3(a+ϵk/μk)2/3(aϵk/μk)4/3(1-\delta)\mu_{k}\geq{\mu_{k+1}}\geq(2\mu_{k}^{2})^{2/3}\frac{(a+\epsilon_{k}/\mu_{k})^{2/3}}{(a-\epsilon_{k}/\mu_{k})^{4/3}}

In this context, our requirement extends beyond merely ensuring that μk\mu_{k} decreases. We further stipulate that it should decrease by a factor of 1δ1-\delta. Next, we impose another important constraint

Condition 3.
ϵkμk3/2\epsilon_{k}\leq\mu^{3/2}_{k}

Updating Rules

Now we are ready to check our updating rules satisfy the conditions above

ϵk=\displaystyle\epsilon_{k}= min{βaμk,μk3/2}\displaystyle\min\{\beta a\mu_{k},\mu_{k}^{3/2}\}
μk+1=\displaystyle\mu_{k+1}= (2μk2)2/3(a+ϵk/μk)2/3(aϵk/μk)4/3\displaystyle(2\mu_{k}^{2})^{2/3}\frac{(a+\epsilon_{k}/\mu_{k})^{2/3}}{(a-\epsilon_{k}/\mu_{k})^{4/3}}

Check for Conditions

First, we check the condition 2. condition 2 requires

(1δ)μk(2μk2)2/3(a+ϵk/μk)2/3(aϵk/μk)4/3μk(a+ϵk/μk)2(aϵk/μk)4(1δ)34\displaystyle(1-\delta)\mu_{k}\geq(2\mu_{k}^{2})^{2/3}\frac{(a+\epsilon_{k}/\mu_{k})^{2/3}}{(a-\epsilon_{k}/\mu_{k})^{4/3}}\Rightarrow\mu_{k}\frac{(a+\epsilon_{k}/\mu_{k})^{2}}{(a-\epsilon_{k}/\mu_{k})^{4}}\leq\frac{(1-\delta)^{3}}{4}

Note that ϵkβaμk<aμk\epsilon_{k}\leq\beta a\mu_{k}<a\mu_{k}

μk(a+ϵk/μk)2(aϵk/μk)4μk(1+β)2(1β)41a2\displaystyle\mu_{k}\frac{(a+\epsilon_{k}/\mu_{k})^{2}}{(a-\epsilon_{k}/\mu_{k})^{4}}\leq\mu_{k}\frac{(1+\beta)^{2}}{(1-\beta)^{4}}\frac{1}{a^{2}}

Therefore, once the following inequality is true, Condition 2 is satisfied.

μk(1+β)2(1β)41a2(1δ)34μka24(1δ)3(1β)4(1+β)2\displaystyle\mu_{k}\frac{(1+\beta)^{2}}{(1-\beta)^{4}}\frac{1}{a^{2}}\leq\frac{(1-\delta)^{3}}{4}\Rightarrow\mu_{k}\leq\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}

Because μkμ0a24(1δ)3(1β)4(1+β)2\mu_{k}\leq\mu_{0}\leq\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}} from the condition we impose for μ0\mu_{0}. Consequently, Condition 2 is satisfied under our choice of ϵk\epsilon_{k}.

Now we focus on the Condition 1. Because ϵkaβμk\epsilon_{k}\leq a\beta\mu_{k}, if we can ensure aβμk2μk3/2+aμk22aμk5/2μk3a2a\beta\mu_{k}\leq 2\mu_{k}^{3/2}+a\mu_{k}-2\sqrt{2a\mu_{k}^{5/2}-\mu_{k}^{3}a^{2}} holds, then we can show Condition 1 is always satisfied.

aβμk\displaystyle a\beta\mu_{k}\leq 2μk3/2+aμk22aμk5/2μk3a2\displaystyle 2\mu_{k}^{3/2}+a\mu_{k}-2\sqrt{2a\mu_{k}^{5/2}-\mu_{k}^{3}a^{2}}
22aμk5/2μk3a2\displaystyle 2\sqrt{2a\mu_{k}^{5/2}-\mu_{k}^{3}a^{2}}\leq 2μk3/2+(1β)aμk\displaystyle 2\mu_{k}^{3/2}+(1-\beta)a\mu_{k}
4(2aμk5/2μk3a2)\displaystyle 4(2a\mu_{k}^{5/2}-\mu_{k}^{3}a^{2})\leq 4μk3+(1β)2a2μk2+4(1β)aμk5/2\displaystyle 4\mu_{k}^{3}+(1-\beta)^{2}a^{2}\mu_{k}^{2}+4(1-\beta)a\mu_{k}^{5/2}
0\displaystyle 0\leq 4(a2+1)μk3+(1β)2a2μk24(1+β)aμk5/2\displaystyle 4(a^{2}+1)\mu_{k}^{3}+(1-\beta)^{2}a^{2}\mu_{k}^{2}-4(1+\beta)a\mu_{k}^{5/2}
0\displaystyle 0\leq 4(a2+1)μk4(1+β)aμk1/2+(1β)2a2when 0μk4/a2\displaystyle 4(a^{2}+1)\mu_{k}-4(1+\beta)a\mu_{k}^{1/2}+(1-\beta)^{2}a^{2}\qquad\text{when }\quad 0\leq\mu_{k}\leq 4/a^{2}
0\displaystyle 0\leq μk(1+β)a(a2+1)μk1/2+(1β)2a24(a2+1)\displaystyle\mu_{k}-\frac{(1+\beta)a}{(a^{2}+1)}\mu_{k}^{1/2}+\frac{(1-\beta)^{2}a^{2}}{4(a^{2}+1)}

We also notice that

(1+β)2a2(a2+1)24(1β)2a24(a2+1)0(1+β1β)2a2+1\displaystyle\frac{(1+\beta)^{2}a^{2}}{(a^{2}+1)^{2}}-4\frac{(1-\beta)^{2}a^{2}}{4(a^{2}+1)}\leq 0\Leftrightarrow\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq a^{2}+1

Because (1+β1β)2(1δ)(a2+1)\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq(1-\delta)(a^{2}+1), the inequality above always holds and this inequality implies that for any μk0\mu_{k}\geq 0

0μk(1+β)a(a2+1)μk1/2+(1β)2a24(a2+1)0\leq\mu_{k}-\frac{(1+\beta)a}{(a^{2}+1)}\mu_{k}^{1/2}+\frac{(1-\beta)^{2}a^{2}}{4(a^{2}+1)}

Therefore, Condition 2 holds. Condition 3 also holds because of the choice of ϵk\epsilon_{k}.

Bound the Distance

Let c=72/a2c=72/a^{2}, and assume that μ\mu satisfies the following

μ\displaystyle\mu\leq min{1c(1(1/2)1/4),β2a2}\displaystyle\min\{\frac{1}{c}\left(1-(1/2)^{1/4}\right),\beta^{2}a^{2}\} (25)

Note that when μ\mu satisfies (25), then μ3/2βaμ\mu^{3/2}\leq\beta a\mu, so ϵ=μ3/2\epsilon=\mu^{3/2}.

μ1c(1(1/2)1/4)=a272(1(1/2)1/4)a24\mu\leq\frac{1}{c}\left(1-(1/2)^{1/4}\right)=\frac{a^{2}}{72}\left(1-(1/2)^{1/4}\right)\leq\frac{a^{2}}{4}
ϵ/μ=μa2\displaystyle\epsilon/\mu=\sqrt{\mu}\leq\frac{a}{2} (26)

Then

tϵ((aϵ/μ)(1cμ);μ)=\displaystyle t_{\epsilon}((a-\epsilon/\mu)(1-c\mu);\mu)= 11cμ1μ(a+ϵ/μ)2(μ(a2+1)+(aϵ/μ)2(1cμ)2)2\displaystyle\frac{1}{1-c\mu}-1-\frac{\mu(a+\epsilon/\mu)^{2}}{(\mu(a^{2}+1)+(a-\epsilon/\mu)^{2}(1-c\mu)^{2})^{2}}
=\displaystyle= cμ1cμμ(a+ϵ/μ)2(μ(a2+1)+(aϵ/μ)2(1cμ)2)2\displaystyle\frac{c\mu}{1-c\mu}-\frac{\mu(a+\epsilon/\mu)^{2}}{(\mu(a^{2}+1)+(a-\epsilon/\mu)^{2}(1-c\mu)^{2})^{2}}
\displaystyle\geq cμμ(a+ϵ/μ)2(aϵ/μ)4(1cμ)4\displaystyle c\mu-\mu\frac{(a+\epsilon/\mu)^{2}}{(a-\epsilon/\mu)^{4}(1-c\mu)^{4}}
\displaystyle\geq cμμ(a+a/2)2(aa/2)4(1cμ)4\displaystyle c\mu-\mu\frac{(a+a/2)^{2}}{(a-a/2)^{4}(1-c\mu)^{4}}
=\displaystyle= μ(c36a2(1cμ)4)\displaystyle\mu\left(c-\frac{36}{a^{2}(1-c\mu)^{4}}\right)
=\displaystyle= μ(72a236a2(1cμ)4)>0\displaystyle\mu\left(\frac{72}{a^{2}}-\frac{36}{a^{2}(1-c\mu)^{4}}\right)>0

Then we know (aϵ/μ)(1cμ)<xμ,ϵ(a-\epsilon/\mu)(1-c\mu)<x_{\mu,\epsilon}^{*}. Now we can bound the distance Wμk,ϵkW𝖦\lVert W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\rVert, it is important to note that

Wμk,ϵkW𝖦=\displaystyle\lVert W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\rVert= (xμk,ϵka)2+(yμk,ϵk)2\displaystyle\sqrt{(x_{\mu_{k},\epsilon_{k}}-a)^{2}+(y_{\mu_{k},\epsilon_{k}})^{2}}
\displaystyle\leq max{(xμk,ϵka)2+(yμk,ϵk)2,(xμk,ϵk_a)2+(yμk,ϵk)2}\displaystyle\max\left\{\sqrt{(x^{*}_{\mu_{k},\epsilon_{k}}-a)^{2}+(y^{*}_{\mu_{k},\epsilon_{k}})^{2}},\sqrt{(x^{*}_{\mu_{k},\epsilon_{k}\_}-a)^{2}+(y^{*}_{\mu_{k},\epsilon_{k}})^{2}}\right\}

We use the fact that xμk,ϵk<xμk,ϵk<ax^{*}_{\mu_{k},\epsilon_{k}}<x_{\mu_{k},\epsilon_{k}}<a, xμk,ϵk<xμk,ϵk_x_{\mu_{k},\epsilon_{k}}<x^{*}_{\mu_{k},\epsilon_{k}\_} and yμk,ϵk<yμk,ϵky_{\mu_{k},\epsilon_{k}}<y^{*}_{\mu_{k},\epsilon_{k}}. Next, we can separately establish bounds for these two terms. Due to (24), yμk,ϵk<(2μk2)1/3(a+ϵk/μk)1/3(aϵk/μk)2/3=μk+1y^{*}_{\mu_{k},\epsilon_{k}}<(2\mu_{k}^{2})^{1/3}\frac{(a+\epsilon_{k}/\mu_{k})^{1/3}}{(a-\epsilon_{k}/\mu_{k})^{2/3}}=\sqrt{\mu_{k+1}} and (aϵk/μk)(1cμk)<xμk,ϵk(a-\epsilon_{k}/\mu_{k})(1-c\mu_{k})<x^{*}_{\mu_{k},\epsilon_{k}}

(xμk,ϵka)2+(yμk,ϵk)2μk+1+(a(aϵk/μk)(1cμk))2\displaystyle\sqrt{(x^{*}_{\mu_{k},\epsilon_{k}}-a)^{2}+(y^{*}_{\mu_{k},\epsilon_{k}})^{2}}\leq\sqrt{\mu_{k+1}+(a-(a-\epsilon_{k}/\mu_{k})(1-c\mu_{k}))^{2}}

Given that if xμk,ϵk_ax^{*}_{\mu_{k},\epsilon_{k}\_}\leq a, then (xμk,ϵka)2+(yμk,ϵk)2(xμk,ϵk_a)2+(yμk,ϵk)2\sqrt{(x^{*}_{\mu_{k},\epsilon_{k}}-a)^{2}+(y^{*}_{\mu_{k},\epsilon_{k}})^{2}}\geq\sqrt{(x^{*}_{\mu_{k},\epsilon_{k}\_}-a)^{2}+(y^{*}_{\mu_{k},\epsilon_{k}})^{2}}. Therefore, if xμk,ϵk_ax^{*}_{\mu_{k},\epsilon_{k}\_}\geq a, we can use the fact that xμk,ϵk_a+ϵkμkx^{*}_{\mu_{k},\epsilon_{k}\_}\leq a+\frac{\epsilon_{k}}{\mu_{k}}. In this case,

(xμk,ϵk_a)2+(yμk,ϵk)2μk+1+(ϵk/μk)2=μk+1+μk(2δ)μk\displaystyle\sqrt{(x^{*}_{\mu_{k},\epsilon_{k}\_}-a)^{2}+(y^{*}_{\mu_{k},\epsilon_{k}})^{2}}\leq\sqrt{\mu_{k+1}+(\epsilon_{k}/\mu_{k})^{2}}=\sqrt{\mu_{k+1}+\mu_{k}}\leq\sqrt{(2-\delta)\mu_{k}}

As a result, we have

Wμk,ϵkW𝖦max{μk+1+(a(aϵk/μk)(1cμk))2,(2δ)μk}\displaystyle\lVert W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\rVert\leq\max\{\sqrt{\mu_{k+1}+(a-(a-\epsilon_{k}/\mu_{k})(1-c\mu_{k}))^{2}},\sqrt{(2-\delta)\mu_{k}}\}
μk+1+(a(aϵk/μk)(1cμk))2\displaystyle\mu_{k+1}+(a-(a-\epsilon_{k}/\mu_{k})(1-c\mu_{k}))^{2}\leq (1δ)μk+(acμk+μkcμk3/2)2\displaystyle(1-\delta)\mu_{k}+(ac\mu_{k}+\sqrt{\mu_{k}}-c\mu_{k}^{3/2})^{2}
\displaystyle\leq (1δ)μk+3(a2c2μk2+μk+c2μk3)\displaystyle(1-\delta)\mu_{k}+3(a^{2}c^{2}\mu_{k}^{2}+\mu_{k}+c^{2}\mu_{k}^{3})
=\displaystyle= (4δ)μk+3a2c2μk2+3c2μk3\displaystyle(4-\delta)\mu_{k}+3a^{2}c^{2}\mu_{k}^{2}+3c^{2}\mu_{k}^{3}
Wμk,ϵkW𝖦\displaystyle\lVert W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\rVert\leq max{μk+1+(a(aϵk/μk)(1cμk))2,(2δ)μk}\displaystyle\max\{\sqrt{\mu_{k+1}+(a-(a-\epsilon_{k}/\mu_{k})(1-c\mu_{k}))^{2}},\sqrt{(2-\delta)\mu_{k}}\}
\displaystyle\leq max{(4δ)μk+3a2c2μk2+3c2μk3,(2δ)μk}\displaystyle\max\{\sqrt{(4-\delta)\mu_{k}+3a^{2}c^{2}\mu_{k}^{2}+3c^{2}\mu_{k}^{3}},\sqrt{(2-\delta)\mu_{k}}\}
=\displaystyle= (4δ)μk+3a2c2μk2+3c2μk3\displaystyle\sqrt{(4-\delta)\mu_{k}+3a^{2}c^{2}\mu_{k}^{2}+3c^{2}\mu_{k}^{3}}

Just let

(4δ)μk(4δ)(1δ)kμ0εdist23\displaystyle(4-\delta)\mu_{k}\leq(4-\delta)(1-\delta)^{k}\mu_{0}\leq\frac{{\varepsilon_{\mathrm{dist}}}^{2}}{3} kln(3(4δ)μ0/εdist2)ln(1/(1δ))\displaystyle\Rightarrow k\geq\frac{\ln(3(4-\delta)\mu_{0}/{\varepsilon_{\mathrm{dist}}}^{2})}{\ln(1/(1-\delta))} (27)
3a2c2μk23a2c2(1δ)2kμ02εdist23\displaystyle 3a^{2}c^{2}\mu_{k}^{2}\leq 3a^{2}c^{2}(1-\delta)^{2k}\mu^{2}_{0}\leq\frac{{\varepsilon_{\mathrm{dist}}}^{2}}{3} kln(46656μ02/(a2εdist2))2ln(1/(1δ))\displaystyle\Rightarrow k\geq\frac{\ln(46656\mu_{0}^{2}/(a^{2}{\varepsilon_{\mathrm{dist}}}^{2}))}{2\ln(1/(1-\delta))} (28)
3c2μk33c2(1δ)3kμ03εdist23\displaystyle 3c^{2}\mu_{k}^{3}\leq 3c^{2}(1-\delta)^{3k}\mu^{3}_{0}\leq\frac{{\varepsilon_{\mathrm{dist}}}^{2}}{3} kln(46656μ03/(a4εdist2))3ln(1/(1δ))\displaystyle\Rightarrow k\geq\frac{\ln(46656\mu_{0}^{3}/(a^{4}{\varepsilon_{\mathrm{dist}}}^{2}))}{3\ln(1/(1-\delta))} (29)

We use the fact that μk(1δ)kμ0\mu_{k}\leq(1-\delta)^{k}\mu_{0}. In order to satisfy (25).

μkμ0(1δ)ka272(1(1/2)1/4)kln72μ0a2(1(1/2)1/4)ln11δ\displaystyle\mu_{k}\leq\mu_{0}(1-\delta)^{k}\leq\frac{a^{2}}{72}(1-(1/2)^{1/4})\Rightarrow k\geq\frac{\ln\frac{72\mu_{0}}{a^{2}(1-(1/2)^{1/4})}}{\ln\frac{1}{1-\delta}} (30)
μkμ0(1δ)kβ2a2kln(μ0/(β2a2))ln11δ\displaystyle\mu_{k}\leq\mu_{0}(1-\delta)^{k}\leq\beta^{2}a^{2}\Rightarrow k\geq\frac{\ln{(\mu_{0}/(\beta^{2}a^{2}))}}{\ln\frac{1}{1-\delta}} (31)

Consequently, running Algorithm 6 after K(μ0,a,δ,εdist)K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}}) outer iteration

Wμk,ϵkW𝖦2εdist\displaystyle\|W_{\mu_{k},\epsilon_{k}}-W_{\mathsf{G}}\|_{2}\leq{\varepsilon_{\mathrm{dist}}}

where

K(μ0,a,δ,εdist)\displaystyle K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})\geq 1ln(1/(1δ))max{lnμ0β2a2,ln72μ0a2(1(1/2)1/4),ln(3(4δ)μ0ε2),12ln(46656μ02a2ε2),13ln(46656μ03a4ε2)}\displaystyle\frac{1}{\ln(1/(1-\delta))}\max\left\{\ln{\frac{\mu_{0}}{\beta^{2}a^{2}}},\ln\frac{72\mu_{0}}{a^{2}(1-(1/2)^{1/4})},\ln(\frac{3(4-\delta)\mu_{0}}{\varepsilon^{2}}),\frac{1}{2}\ln(\frac{46656\mu_{0}^{2}}{a^{2}\varepsilon^{2}}),\frac{1}{3}\ln(\frac{46656\mu_{0}^{3}}{a^{4}\varepsilon^{2}})\right\}

By Lemma 6, kk iteration of Algorithm 6 need the following step of gradient descent

2(μk(a2+1)+3a2)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))ϵk2\displaystyle\frac{2(\mu_{k}(a^{2}+1)+3a^{2})(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\epsilon_{k}^{2}}

Let K^(μ0,a,δ,εdist)\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}}) satisfy μK^(μ0,a,δ,εdist)β2a2<μK^(μ0,a,δ,εdist)1\mu_{\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\leq\beta^{2}a^{2}<\mu_{\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})-1}. Hence, the total number of gradient steps required by Algorithm 6 can be expressed as follows:

k=0K(μ0,a,δ,εdist)2(μk(a2+1)+3a2)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))ϵk2\displaystyle\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{2(\mu_{k}(a^{2}+1)+3a^{2})(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\epsilon_{k}^{2}}
\displaystyle\leq 2(μ0(a2+1)+3a2)(k=0K^(μ0,a,δ,εdist)1(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))ϵk2+k=K^(μ0,a,δ,εdist)K(μ0,a,δ,εdist)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))ϵk2)\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\sum_{k=0}^{\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})-1}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\epsilon_{k}^{2}}+\sum_{k=\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\epsilon_{k}^{2}}\right)
=\displaystyle= 2(μ0(a2+1)+3a2)(k=0K^(μ0,a,δ,εdist)1(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))β2a2μk2+k=K^(μ0,a,δ,εdist)K(μ0,a,δ,εdist)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))μk3)\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\sum_{k=0}^{\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})-1}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\beta^{2}a^{2}\mu_{k}^{2}}+\sum_{k=\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\mu_{k}^{3}}\right)
\displaystyle\leq 2(μ0(a2+1)+3a2)(k=0K^(μ0,a,δ,εdist)1(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))β6a6+k=K^(μ0,a,δ,εdist)K(μ0,a,δ,εdist)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))μk3)\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\sum_{k=0}^{\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})-1}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\beta^{6}a^{6}}+\sum_{k=\widehat{K}(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\mu_{k}^{3}}\right)
\displaystyle\leq 2(μ0(a2+1)+3a2)(k=0K(μ0,a,δ,εdist)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))β6a6+k=0K(μ0,a,δ,εdist)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))μK(μ0,a,δ,εdist)3)\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\beta^{6}a^{6}}+\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{3}}\right)
=\displaystyle= 2(μ0(a2+1)+3a2)(1β6a6+1μK(μ0,a,δ,εdist)3)(k=0K(μ0,a,δ,εdist)((gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))))\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\frac{1}{\beta^{6}a^{6}}+\frac{1}{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{3}}\right)\left(\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\left((g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))\right)\right)
\displaystyle\leq 2(μ0(a2+1)+3a2)(1β6a6+1μK(μ0,a,δ,εdist)3)(k=0K(μ0,a,δ,εdist)((gμk(Wμkϵk)gμk+1(Wμk+1ϵk+1))))\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\frac{1}{\beta^{6}a^{6}}+\frac{1}{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{3}}\right)\left(\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\left((g_{\mu_{k}}(W_{\mu_{k}}^{\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1}}^{\epsilon_{k+1}}))\right)\right)
=\displaystyle= 2(μ0(a2+1)+3a2)(1β6a6+1μK(μ0,a,δ,εdist)3)((gμ0(Wμ0,ϵ0)gμK(μ0,a,δ,εdist)+1(WμK(μ0,a,δ,εdist)+1ϵK(μ0,a,δ,εdist)+1))\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\frac{1}{\beta^{6}a^{6}}+\frac{1}{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{3}}\right)\left((g_{\mu_{0}}(W_{\mu_{0},\epsilon_{0}})-g_{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})+1}}(W_{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})+1}}^{\epsilon_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})+1}})\right)
\displaystyle\leq 2(μ0(a2+1)+3a2)(1β6a6+1μK(μ0,a,δ,εdist)3)gμ0(Wμ0,ϵ0)\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\frac{1}{\beta^{6}a^{6}}+\frac{1}{\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}^{3}}\right)g_{\mu_{0}}(W_{\mu_{0},\epsilon_{0}})

Note from (27) and (30), the following should holds

μK(μ0,a,δ,εdist)=min{εdist23(4δ),aεdist216,(aεdist216)2/3,β2a2,a272(1(1/2)1/4)}\displaystyle\mu_{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}=\min\{\frac{{\varepsilon_{\mathrm{dist}}}^{2}}{3(4-\delta)},\frac{a{\varepsilon_{\mathrm{dist}}}}{216},\left(\frac{a{\varepsilon_{\mathrm{dist}}}}{216}\right)^{2/3},\beta^{2}a^{2},\frac{a^{2}}{72}(1-(1/2)^{1/4})\}

Therefore,

k=0K(μ0,a,δ,εdist)2(μk(a2+1)+3a2)(gμk+1(Wμk,ϵk)gμk+1(Wμk+1,ϵk+1))ϵk2\displaystyle\sum_{k=0}^{K(\mu_{0},a,\delta,{\varepsilon_{\mathrm{dist}}})}\frac{2(\mu_{k}(a^{2}+1)+3a^{2})(g_{\mu_{k+1}}(W_{\mu_{k},\epsilon_{k}})-g_{\mu_{k+1}}(W_{\mu_{k+1},\epsilon_{k+1}}))}{\epsilon_{k}^{2}}
\displaystyle\leq 2(μ0(a2+1)+3a2)(1β6a6+(max{3(4δ)εdist2,216aεdist,(216aεdist)2/3,1β2a2,72(1(1/2)1/4)a2})3)gμ0(Wμ0ϵ0).\displaystyle 2(\mu_{0}(a^{2}+1)+3a^{2})\left(\frac{1}{\beta^{6}a^{6}}+\left(\max\{\frac{3(4-\delta)}{{\varepsilon_{\mathrm{dist}}}^{2}},\frac{216}{a{\varepsilon_{\mathrm{dist}}}},\left(\frac{216}{a{\varepsilon_{\mathrm{dist}}}}\right)^{2/3},\frac{1}{\beta^{2}a^{2}},\frac{72}{(1-(1/2)^{1/4})a^{2}}\}\right)^{3}\right)g_{\mu_{0}}(W_{\mu_{0}}^{\epsilon_{0}}).

Appendix D Additional Theorems and Lemmas

Theorem 5 (Detailed Property of r(y;μ)r(y;\mu)).

For r(y;μ)r(y;\mu) in (8), then

  1. (i)

    For μ>0\mu>0, limy0+r(y;μ)=\lim_{y\rightarrow 0^{+}}r(y;\mu)=\infty, r(aa2+1,μ)<0r(\frac{a}{a^{2}+1},\mu)<0

  2. (ii)

    For μ>0\mu>0, r(μ,μ)<0r(\sqrt{\mu},\mu)<0.

  3. (iii)

    For μ>a24\mu>\frac{a^{2}}{4}

    dr(y;μ)dy<0\frac{dr(y;\mu)}{dy}<0

    For 0<μa240<\mu\leq\frac{a^{2}}{4}

    dr(y;μ)dy>0\displaystyle\frac{dr(y;\mu)}{dy}>0 ylb<y<yuby_{\mathrm{lb}}<y<y_{\mathrm{ub}} (32a)
    dr(y;μ)dy0\displaystyle\frac{dr(y;\mu)}{dy}\leq 0 Otherwise (32b)

    where

    ylb=\displaystyle y_{\mathrm{lb}}= (4μ)1/32(a1/3a2/3(4μ)1/3)yub=\displaystyle\frac{(4\mu)^{1/3}}{2}(a^{1/3}-\sqrt{a^{2/3}-(4\mu)^{1/3}})\quad y_{\mathrm{ub}}= (4μ)1/32(a1/3+a2/3(4μ)1/3)\displaystyle\frac{(4\mu)^{1/3}}{2}(a^{1/3}+\sqrt{a^{2/3}-(4\mu)^{1/3}})

    Moreover,

    ylbμyub\displaystyle y_{\mathrm{lb}}\leq\sqrt{\mu}\leq y_{\mathrm{ub}}
  4. (iv)

    For 0<μ<a240<\mu<\frac{a^{2}}{4}, let p(μ)=r(yub,μ)p(\mu)=r(y_{\mathrm{ub}},\mu), then p(μ)<0p^{\prime}(\mu)<0 and there exist a unique solution to p(μ)=0p(\mu)=0, denoted as τ\tau. Additionally, τ<a24\tau<\frac{a^{2}}{4}.

  5. (v)

    There exists a τ>0\tau>0 such that, μ>τ\forall\mu>\tau, the equation r(y;μ)=0r(y;\mu)=0 has only one solution. At μ=τ\mu=\tau, the equation r(y;μ)=0r(y;\mu)=0 has two solutions, and μ<τ\forall\mu<\tau, the equation r(y;μ)=0r(y;\mu)=0 has three solutions. Moreover, μ<a24\mu<\frac{a^{2}}{4}.

  6. (vi)

    μ<τ\forall\mu<\tau, the equation r(y;μ)=0r(y;\mu)=0 has three solution, i.e. yμ<yμ<yμy_{\mu}^{*}<y_{\mu}^{**}<y_{\mu}^{***}.

    dyμdμ>0dyμdμ>0dyμdμ<0 and limμ0yμ=0,limμ0yμ=0,limμ0yμ=aa2+1\displaystyle\frac{dy_{\mu}^{*}}{d\mu}>0\quad\frac{dy_{\mu}^{**}}{d\mu}>0\quad\frac{dy_{\mu}^{***}}{d\mu}<0\text{ and }\lim_{\mu\rightarrow 0}y_{\mu}^{*}=0,\lim_{\mu\rightarrow 0}y_{\mu}^{**}=0,\lim_{\mu\rightarrow 0}y_{\mu}^{***}=\frac{a}{a^{2}+1}

    Moreover,

    yμ<ylb<μ<yμ<yub<yμy_{\mu}^{*}<y_{\mathrm{lb}}<\sqrt{\mu}<y_{\mu}^{**}<y_{\mathrm{ub}}<y_{\mu}^{***}
Theorem 6 (Detailed Property of t(x;μ)t(x;\mu)).

For t(x;μ)t(x;\mu) in (9), then

  1. (i)

    For μ>0\mu>0, limx0+t(x;μ)=\lim_{x\rightarrow 0^{+}}t(x;\mu)=\infty, t(a,μ)<0t(a,\mu)<0

  2. (ii)

    If μ<(a(a2+1a)2(a2+1))2\mu<\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2} or μ>(a(a2+1+a)2(a2+1))2\mu>\left(\frac{a(\sqrt{a^{2}+1}+a)}{2(a^{2}+1)}\right)^{2}, then t(μ(a2+1),μ)<0t(\sqrt{\mu(a^{2}+1)},\mu)<0.

  3. (iii)

    For μ>a24(a2+1)3\mu>\frac{a^{2}}{4(a^{2}+1)^{3}}

    dt(x;μ)dx<0\frac{dt(x;\mu)}{dx}<0

    For 0<μa24(a2+1)30<\mu\leq\frac{a^{2}}{4(a^{2}+1)^{3}}

    dt(x;μ)dx>0\displaystyle\frac{dt(x;\mu)}{dx}>0 xlb<x<xubx_{\mathrm{lb}}<x<x_{\mathrm{ub}} (33a)
    dt(x;μ)dx0\displaystyle\frac{dt(x;\mu)}{dx}\leq 0 Otherwise (33b)

    where

    xlb=\displaystyle x_{\mathrm{lb}}= (4μa)1/3(11(4μ)1/3(a2+1)a2/3)2xub=\displaystyle\frac{(4\mu a)^{1/3}(1-\sqrt{1-\frac{(4\mu)^{1/3}(a^{2}+1)}{a^{2/3}}})}{2}\quad x_{\mathrm{ub}}= (4μa)1/3(1+1(4μ)1/3(a2+1)a2/3)2\displaystyle\frac{(4\mu a)^{1/3}(1+\sqrt{1-\frac{(4\mu)^{1/3}(a^{2}+1)}{a^{2/3}}})}{2}

    Moreover,

    xlbμ(a2+1)xubx_{\mathrm{lb}}\leq\sqrt{\mu(a^{2}+1)}\leq x_{\mathrm{ub}}
  4. (iv)

    For 0<μ<a24(a2+1)30<\mu<\frac{a^{2}}{4(a^{2}+1)^{3}} and let q(μ)=t(xlb,μ)q(\mu)=t(x_{\mathrm{lb}},\mu), then q(μ)>0q^{\prime}(\mu)>0 and there exist a unique solution to q(μ)=0q(\mu)=0, denoted as τ\tau and τ<a24(a2+1)3127\tau<\frac{a^{2}}{4(a^{2}+1)^{3}}\leq\frac{1}{27}.

  5. (v)

    There exists a τ>0\tau>0 such that, μ>τ\forall\mu>\tau, the equation t(x;μ)=0t(x;\mu)=0 has only one solution. At μ=τ\mu=\tau, the equation t(x;μ)=0t(x;\mu)=0 has two solutions, and μ<τ\forall\mu<\tau, the equation t(x;μ)=0t(x;\mu)=0 has three solutions. Moreover, τ<a24(a2+1)3127\tau<\frac{a^{2}}{4(a^{2}+1)^{3}}\leq\frac{1}{27}

  6. (vi)

    μ<τ\forall\mu<\tau, t(x;μ)=0t(x;\mu)=0 has three stationary points, i.e. xμ<xμ<xμx_{\mu}^{***}<x_{\mu}^{**}<x_{\mu}^{*}.

    dxμdμ<0dxμdμ>0 and limμ0xμ=a,limμ0xμ=0,limμ0xμ=0\displaystyle\frac{dx_{\mu}^{*}}{d\mu}<0\quad\frac{dx_{\mu}^{***}}{d\mu}>0\text{ and }\lim_{\mu\rightarrow 0}x_{\mu}^{*}=a,\lim_{\mu\rightarrow 0}x_{\mu}^{**}=0,\lim_{\mu\rightarrow 0}x_{\mu}^{***}=0

    Besides,

    maxμτxμa(a2+1a)2a2+1anda(a2+1+a)2a2+1minμ>0xμ\max_{\mu\leq\tau}x_{\mu}^{**}\leq\frac{a(\sqrt{a^{2}+1}-a)}{2\sqrt{a^{2}+1}}\quad\text{and}\quad\frac{a(\sqrt{a^{2}+1}+a)}{2\sqrt{a^{2}+1}}\leq\min_{\mu>0}x_{\mu}^{*}

    It also implies that t(a(a2+1a)2a2+1;μ)0t(\frac{a(\sqrt{a^{2}+1}-a)}{2\sqrt{a^{2}+1}};\mu)\geq 0 and maxμμ0xμ<minμ>0xμ\max_{\mu\leq\mu_{0}}x_{\mu}^{**}<\min_{\mu>0}x_{\mu}^{*}

Lemma 7.

Algorithm 1 with input f=gμ(x,y),𝐳0=(x(0),y(0))f=g_{\mu}(x,y),\bm{z}_{0}=(x(0),y(0)) where (x(0),y(0))Cμ3(x(0),y(0))\in C_{\mu 3} in (41), then t0\forall t\geq 0, (x(t),y(t))Cμ3(x(t),y(t))\in C_{\mu 3}. Moreover, limt(x(t),y(t))=(xμ,yμ)\lim_{t\rightarrow\infty}(x(t),y(t))=(x_{\mu}^{*},y_{\mu}^{*})

Lemma 8.

For any (x,y)Cμ3(x,y)\in C_{\mu 3} in (41), and (x,y)(xμ,yμ)(x,y)\neq(x_{\mu}^{*},y_{\mu}^{*})

gμ(x,y)>gμ(xμ,yμ)g_{\mu}(x,y)>g_{\mu}(x_{\mu}^{*},y_{\mu}^{*})
Theorem 7 (Detailed Property of rϵ(y;μ)r_{\epsilon}(y;\mu)).

For rϵ(y;μ)r_{\epsilon}(y;\mu) in (15), then

  1. (i)

    For μ>0,ϵ>0\mu>0,\epsilon>0, limy0+rϵ(y;μ)=\lim_{y\rightarrow 0^{+}}r_{\epsilon}(y;\mu)=\infty, y(aa2+1,μ)<0y(\frac{a}{a^{2}+1},\mu)<0

  2. (ii)

    For μ>(aϵ/μ)44(a+ϵ/μ)2\mu>\frac{(a-\epsilon/\mu)^{4}}{4(a+\epsilon/\mu)^{2}}, then drϵ(y;μ)dy<0\frac{dr_{\epsilon}(y;\mu)}{dy}<0. For 0<μ(aϵ/μ)44(a+ϵ/μ)20<\mu\leq\frac{(a-\epsilon/\mu)^{4}}{4(a+\epsilon/\mu)^{2}}

    drϵ(y;μ)dy>0\displaystyle\frac{dr_{\epsilon}(y;\mu)}{dy}>0 ylb,μ,ϵ<y<yub,μ,ϵy_{\mathrm{lb},\mu,\epsilon}<y<y_{\mathrm{ub},\mu,\epsilon} (34a)
    drϵ(y;μ)dy0\displaystyle\frac{dr_{\epsilon}(y;\mu)}{dy}\leq 0 Otherwise (34b)

    where

    ylb,μ,ϵ=\displaystyle y_{\mathrm{lb},\mu,\epsilon}= (4μ)1/32(((aϵ/μ)2aϵ/μ)1/3((aϵ/μ)2aϵ/μ)2/3(4μ)1/3)\displaystyle\frac{(4\mu)^{1/3}}{2}\left(\left(\frac{(a-\epsilon/\mu)^{2}}{a-\epsilon/\mu}\right)^{1/3}-\sqrt{\left(\frac{(a-\epsilon/\mu)^{2}}{a-\epsilon/\mu}\right)^{2/3}-(4\mu)^{1/3}}\right)
    yub,μ,ϵ=\displaystyle y_{\mathrm{ub},\mu,\epsilon}= (4μ)1/32(((aϵ/μ)2aϵ/μ)1/3+((aϵ/μ)2aϵ/μ)2/3(4μ)1/3)\displaystyle\frac{(4\mu)^{1/3}}{2}\left(\left(\frac{(a-\epsilon/\mu)^{2}}{a-\epsilon/\mu}\right)^{1/3}+\sqrt{\left(\frac{(a-\epsilon/\mu)^{2}}{a-\epsilon/\mu}\right)^{2/3}-(4\mu)^{1/3}}\right)

    Also,

    ylb,μ,ϵ(2μ2)1/3(a+ϵ/μ)1/3(aϵ/μ)2/3\displaystyle y_{\mathrm{lb},\mu,\epsilon}\leq(2\mu^{2})^{1/3}\frac{(a+\epsilon/\mu)^{1/3}}{(a-\epsilon/\mu)^{2/3}}
    ylb,μ,ϵμyub,μ,ϵy_{\mathrm{lb},\mu,\epsilon}\leq\sqrt{\mu}\leq y_{\mathrm{ub},\mu,\epsilon}
Theorem 8 (Detailed Property of rβ(y;μ)r_{\beta}(y;\mu)).

For rβ(y;μ)r_{\beta}(y;\mu) in (19), then

  1. (i)

    For μ>0,ϵ>0\mu>0,\epsilon>0, limy0+rβ(y;μ)=\lim_{y\rightarrow 0^{+}}r_{\beta}(y;\mu)=\infty

  2. (ii)

    For μ>a2(1β)44(1+β)2\mu>\frac{a^{2}(1-\beta)^{4}}{4(1+\beta)^{2}}, then drβ(y;μ)dy<0\frac{dr_{\beta}(y;\mu)}{dy}<0. For 0<μa2(1β)44(1+β)20<\mu\leq\frac{a^{2}(1-\beta)^{4}}{4(1+\beta)^{2}}

    drβ(y;μ)dy>0\displaystyle\frac{dr_{\beta}(y;\mu)}{dy}>0 ylb,μ,β<y<yub,μ,βy_{\mathrm{lb},\mu,\beta}<y<y_{\mathrm{ub},\mu,\beta} (35a)
    drβ(y;μ)dy0\displaystyle\frac{dr_{\beta}(y;\mu)}{dy}\leq 0 Otherwise (35b)

    where

    ylb,μ,β=\displaystyle y_{\mathrm{lb},\mu,\beta}= (4μ)1/32(a(1β)21+β)1/3(11(4μ)1/3a2/3(1+β(1β)2)2/3)\displaystyle\frac{(4\mu)^{1/3}}{2}\left(\frac{a(1-\beta)^{2}}{1+\beta}\right)^{1/3}\left(1-\sqrt{1-\frac{(4\mu)^{1/3}}{a^{2/3}}\left(\frac{1+\beta}{(1-\beta)^{2}}\right)^{2/3}}\right)
    yub,μ,β=\displaystyle y_{\mathrm{ub},\mu,\beta}= (4μ)1/32(a(1β)21+β)1/3(1+1(4μ)1/3a2/3(1+β(1β)2)2/3)\displaystyle\frac{(4\mu)^{1/3}}{2}\left(\frac{a(1-\beta)^{2}}{1+\beta}\right)^{1/3}\left(1+\sqrt{1-\frac{(4\mu)^{1/3}}{a^{2/3}}\left(\frac{1+\beta}{(1-\beta)^{2}}\right)^{2/3}}\right)

    Also,

    ylb,μ,β(4μ)2/32a1/3(1+β)1/3(1β)2/3\displaystyle y_{\mathrm{lb},\mu,\beta}\leq\frac{(4\mu)^{2/3}}{2a^{1/3}}\frac{(1+\beta)^{1/3}}{(1-\beta)^{2/3}}
    ylb,μ,βμyub,μ,βy_{\mathrm{lb},\mu,\beta}\leq\sqrt{\mu}\leq y_{\mathrm{ub},\mu,\beta}
Theorem 9 (Detailed Property of tβ(x;μ)t_{\beta}(x;\mu)).

For tβ(x;μ)t_{\beta}(x;\mu) in (20), then

  1. (i)

    For μ>0\mu>0, limx0+tβ(x;μ)=\lim_{x\rightarrow 0^{+}}t_{\beta}(x;\mu)=\infty, tβ(a;μ)<0t_{\beta}(a;\mu)<0

  2. (ii)

    For μ>a24(a2+1)3(β+1)4(β1)2\mu>\frac{a^{2}}{4(a^{2}+1)^{3}}\frac{(\beta+1)^{4}}{(\beta-1)^{2}}

    dtβ(x;μ)dx<0\frac{dt_{\beta}(x;\mu)}{dx}<0

    For 0<μa24(a2+1)3(β+1)4(β1)20<\mu\leq\frac{a^{2}}{4(a^{2}+1)^{3}}\frac{(\beta+1)^{4}}{(\beta-1)^{2}}

    dtβ(x;μ)dx>0\displaystyle\frac{dt_{\beta}(x;\mu)}{dx}>0 xlb,μ,β<x<xub,μ,βx_{\mathrm{lb},\mu,\beta}<x<x_{\mathrm{ub},\mu,\beta} (36a)
    dtβ(x;μ)dx0\displaystyle\frac{dt_{\beta}(x;\mu)}{dx}\leq 0 Otherwise (36b)

    where

    xlb,μ,β=\displaystyle x_{\mathrm{lb},\mu,\beta}= 12(4aμ(1+β)21β)1/3(11(4μ)1/3(a2+1)a2/3(1β(1+β)2)2/3)\displaystyle\frac{1}{2}\left(\frac{4a\mu(1+\beta)^{2}}{1-\beta}\right)^{1/3}\left(1-\sqrt{1-\frac{(4\mu)^{1/3}(a^{2}+1)}{a^{2/3}}\left(\frac{1-\beta}{(1+\beta)^{2}}\right)^{2/3}}\right)
    xub,μ,β=\displaystyle x_{\mathrm{ub},\mu,\beta}= 12(4aμ(1+β)21β)1/3(1+1(4μ)1/3(a2+1)a2/3(1β(1+β)2)2/3)\displaystyle\frac{1}{2}\left(\frac{4a\mu(1+\beta)^{2}}{1-\beta}\right)^{1/3}\left(1+\sqrt{1-\frac{(4\mu)^{1/3}(a^{2}+1)}{a^{2/3}}\left(\frac{1-\beta}{(1+\beta)^{2}}\right)^{2/3}}\right)
  3. (iii)

    If 0<β<(a2+1)1(a2+1)+10<\beta<\frac{\sqrt{(a^{2}+1)}-1}{\sqrt{(a^{2}+1)}+1}, then there exists a τβ>0\tau_{\beta}>0 such that, μ>τβ\forall\mu>\tau_{\beta}, the equation rβ(x;μ)=0r_{\beta}(x;\mu)=0 has only one solution. At μ=τβ\mu=\tau_{\beta}, the equation rβ(x;μ)=0r_{\beta}(x;\mu)=0 has two solutions, and μ<τβ\forall\mu<\tau_{\beta}, the equation rβ(x;μ)=0r_{\beta}(x;\mu)=0 has three solutions. Moreover, μ<a24(a2+1)3(β+1)4(β1)2\mu<\frac{a^{2}}{4(a^{2}+1)^{3}}\frac{(\beta+1)^{4}}{(\beta-1)^{2}}.

  4. (iv)

    If 0<β<(a2+1)1(a2+1)+10<\beta<\frac{\sqrt{(a^{2}+1)}-1}{\sqrt{(a^{2}+1)}+1}, then μ<τβ\forall\mu<\tau_{\beta}, tβ(x;μ)=0t_{\beta}(x;\mu)=0 has three stationary points, i.e. xμ,β<xμ,β<xμ,βx_{\mu,\beta}^{***}<x_{\mu,\beta}^{**}<x_{\mu,\beta}^{*}. Besides,

    maxμτβxμ,βa((1β)a2+1(1β)2(a2+1)(β+1)2)2a2+1\displaystyle\max_{\mu\leq\tau_{\beta}}x_{\mu,\beta}^{**}\leq\frac{a((1-\beta)\sqrt{a^{2}+1}-\sqrt{(1-\beta)^{2}(a^{2}+1)-(\beta+1)^{2}})}{2\sqrt{a^{2}+1}}
    a((1β)a2+1+(1β)2(a2+1)(β+1)2)2a2+1minμ>0xμ,β\displaystyle\frac{a((1-\beta)\sqrt{a^{2}+1}+\sqrt{(1-\beta)^{2}(a^{2}+1)-(\beta+1)^{2}})}{2\sqrt{a^{2}+1}}\leq\min_{\mu>0}x_{\mu,\beta}^{*}

    It implies that

    maxμτβxμ,β<minμ>0xμ,β\displaystyle\max_{\mu\leq\tau_{\beta}}x_{\mu,\beta}^{**}<\min_{\mu>0}x_{\mu,\beta}^{*}
Lemma 9.

Under the same setting as Corollary 3,

maxμτxμ,ϵ<minμ>0xμ,ϵ\displaystyle\max_{\mu\leq\tau}x_{\mu,\epsilon}^{**}<\min_{\mu>0}x_{\mu,\epsilon}^{*}

Appendix E Technical Proofs

E.1 Proof of Theorem 3

Proof.

For the sake of completeness, we have included the proof here. Please note that this proof can also be found in Nesterov et al. (2018).

Proof.

We use the fact that ff is LL-smooth function if and only if for any W,Ydom(f)W,Y\in\text{dom}(f)

f(W)f(Y)+f(Y),YW+L2YW22\displaystyle f(W)\leq f(Y)+\langle\nabla f(Y),Y-W\rangle+\frac{L}{2}\|Y-W\|_{2}^{2}

Let W=Wt+1W=W^{t+1} and Y=WtY=W^{t}, then using the updating rule Wt+1=Wt1Lf(Wt)W^{t+1}=W^{t}-\frac{1}{L}\nabla f(W^{t})

f(Wt+1)\displaystyle f(W^{t+1})\leq f(Wt)+f(Wt),Wt+1Wt+L2Wt+1Wt22\displaystyle f(W^{t})+\langle\nabla f(W^{t}),W^{t+1}-W^{t}\rangle+\frac{L}{2}\|W^{t+1}-W^{t}\|_{2}^{2}
=\displaystyle= f(Wt)1Lf(Wt)22+12Lf(Wt)22\displaystyle f(W^{t})-\frac{1}{L}\|\nabla f(W^{t})\|_{2}^{2}+\frac{1}{2L}\|\nabla f(W^{t})\|_{2}^{2}
=\displaystyle= f(Wt)12Lf(Wt)22\displaystyle f(W^{t})-\frac{1}{2L}\|\nabla f(W^{t})\|_{2}^{2}

Therefore,

min0tn1f(Wt)221nt=0n1f(Wt)222L(f(W0)f(Wn))n2L(f(W0)f(W))n\displaystyle\min_{0\leq t\leq n-1}\|\nabla f(W^{t})\|_{2}^{2}\leq\frac{1}{n}\sum_{t=0}^{n-1}\|\nabla f(W^{t})\|_{2}^{2}\leq\frac{2L(f(W^{0})-f(W^{n}))}{n}\leq\frac{2L(f(W^{0})-f(W^{*}))}{n}
min0tn1f(Wt)222L(f(W0)f(W))nϵ2n2L(f(W0)f(W))ϵ2\min_{0\leq t\leq n-1}\|\nabla f(W^{t})\|_{2}^{2}\leq\frac{2L(f(W^{0})-f(W^{*}))}{n}\leq\epsilon^{2}\Rightarrow n\geq\frac{2L(f(W^{0})-f(W^{*}))}{\epsilon^{2}}

E.2 Proof of Theorem 5

Proof.
  1. (i)

    For any μ>0\mu>0,

    limy0+r(y;μ)=limy0+aya2μ(a2+1)=\displaystyle\lim_{y\rightarrow 0^{+}}r(y;\mu)=\lim_{y\rightarrow 0^{+}}\frac{a}{y}-\frac{a^{2}}{\mu}-(a^{2}+1)=\infty
    r(aa2+1)=μa2(aa2+1)2+μ<0.\displaystyle r(\frac{a}{a^{2}+1})=-\frac{\mu a^{2}}{(\frac{a}{a^{2}+1})^{2}+\mu}<0.
  2. (ii)
    r(μ,μ)=\displaystyle r(\sqrt{\mu},\mu)= aμa24μ(a2+1)\displaystyle\frac{a}{\sqrt{\mu}}-\frac{a^{2}}{4\mu}-(a^{2}+1)
    =\displaystyle= a24(1μ2a)2a2<0\displaystyle-\frac{a^{2}}{4}(\frac{1}{\sqrt{\mu}}-\frac{2}{a})^{2}-a^{2}<0
  3. (iii)
    dr(y;μ)dy=\displaystyle\frac{dr(y;\mu)}{dy}= ay2+4a2μy(y2+μ)3\displaystyle-\frac{a}{y^{2}}+\frac{4a^{2}\mu y}{(y^{2}+\mu)^{3}}
    =\displaystyle= 4a2μy3a(y2+μ)3y2(y2+μ)3\displaystyle\frac{4a^{2}\mu y^{3}-a(y^{2}+\mu)^{3}}{y^{2}(y^{2}+\mu)^{3}}
    =\displaystyle= a((4aμ)2/3y2+(4aμ)1/3y(y2+μ)+(y2+μ)2)((4aμ)1/3yy2μ)y2(y2+μ)3\displaystyle\frac{a((4a\mu)^{2/3}y^{2}+(4a\mu)^{1/3}y(y^{2}+\mu)+(y^{2}+\mu)^{2})((4a\mu)^{1/3}y-y^{2}-\mu)}{y^{2}(y^{2}+\mu)^{3}}

    For μa24\mu\geq\frac{a^{2}}{4}, ((4aμ)1/3yy2μ)<0dr(y;μ)dy<0((4a\mu)^{1/3}y-y^{2}-\mu)<0\Leftrightarrow\frac{dr(y;\mu)}{dy}<0.
    For μ<a24\mu<\frac{a^{2}}{4}, ylb<y<yub,((4aμ)1/3yy2μ)>0dr(y;μ)dy>0y_{\mathrm{lb}}<y<y_{\mathrm{ub}},((4a\mu)^{1/3}y-y^{2}-\mu)>0\Leftrightarrow\frac{dr(y;\mu)}{dy}>0. For μ<a24\mu<\frac{a^{2}}{4}, y<ylby<y_{\mathrm{lb}} or yub<y,((4aμ)1/3yy2μ)0dr(y;μ)dy0y_{\mathrm{ub}}<y,((4a\mu)^{1/3}y-y^{2}-\mu)\leq 0\Leftrightarrow\frac{dr(y;\mu)}{dy}\leq 0.

    Note that

    dr(y;μ)dμ=0((4aμ)1/3yy2μ)=0(4aμ)1/3=y+μy\frac{dr(y;\mu)}{d\mu}=0\Leftrightarrow((4a\mu)^{1/3}y-y^{2}-\mu)=0\Leftrightarrow(4a\mu)^{1/3}=y+\frac{\mu}{y}

    The intersection between line (4aμ)1/3(4a\mu)^{1/3} and function y+μyy+\frac{\mu}{y} are exactly ylby_{\mathrm{lb}} and yuby_{\mathrm{ub}}, and ylb<μ<yuby_{\mathrm{lb}}<\sqrt{\mu}<y_{\mathrm{ub}}.

  4. (iv)

    Note that for 0<μ<a240<\mu<\frac{a^{2}}{4},

    rμ=a2y2μ(μ+y2)3 and ylb<μ<yub\frac{\partial r}{\partial\mu}=-a^{2}\frac{y^{2}-\mu}{(\mu+y^{2})^{3}}\quad\text{ and }\quad y_{\mathrm{lb}}<\sqrt{\mu}<y_{\mathrm{ub}}

    then rμ|y=yub<0\left.\frac{\partial r}{\partial\mu}\right|_{y=y_{\mathrm{ub}}}<0. Let p(μ)=r(yub,μ)p(\mu)=r(y_{\mathrm{ub}},\mu), because ry|y=yub=0\frac{\partial r}{\partial y}|_{y=y_{\mathrm{ub}}}=0, then

    dp(μ)dμ=\displaystyle\frac{dp(\mu)}{d\mu}= dr(yub,μ)dμ=ry|y=yubdyubdμ+rμ|y=yub=rμ|y=yub<0\displaystyle\frac{dr(y_{\mathrm{ub}},\mu)}{d\mu}=\left.\frac{\partial r}{\partial y}\right|_{y=y_{\mathrm{ub}}}\frac{dy_{\mathrm{ub}}}{d\mu}+\left.\frac{\partial r}{\partial\mu}\right|_{y=y_{\mathrm{ub}}}=\left.\frac{\partial r}{\partial\mu}\right|_{y=y_{\mathrm{ub}}}<0

    Also note that when μ=a24\mu=\frac{a^{2}}{4}, yub=μy_{\mathrm{ub}}=\sqrt{\mu}, p(μ)=r(yub,μ)=r(μ,μ)<0p(\mu)=r({y_{\mathrm{ub}}},\mu)=r(\sqrt{\mu},\mu)<0, and also if μ<a24\mu<\frac{a^{2}}{4}, then

    yub<(4μ)1/322a1/3=(4μa)1/3\displaystyle y_{\mathrm{ub}}<\frac{(4\mu)^{1/3}}{2}2a^{1/3}=(4\mu a)^{1/3}

    Thus,

    r((4μa)1/3,μ)=\displaystyle r((4\mu a)^{1/3},\mu)= a(4μa)1/3μa2((4μa)2/3+μ)2(a2+1)\displaystyle\frac{a}{(4\mu a)^{1/3}}-\frac{\mu a^{2}}{((4\mu a)^{2/3}+\mu)^{2}}-(a^{2}+1)
    =\displaystyle= a(4μa)1/3a2(μ)1/3((4a)2/3+μ1/3)2(a2+1)\displaystyle\frac{a}{(4\mu a)^{1/3}}-\frac{a^{2}}{(\mu)^{1/3}((4a)^{2/3}+\mu^{1/3})^{2}}-(a^{2}+1)
    >\displaystyle> 1μ1/3(a(4a)1/3a2(4a)4/3)(a2+1)\displaystyle\frac{1}{\mu^{1/3}}(\frac{a}{(4a)^{1/3}}-\frac{a^{2}}{(4a)^{4/3}})-(a^{2}+1)

    Because a(4a)1/3>a2(4a)4/3\frac{a}{(4a)^{1/3}}>\frac{a^{2}}{(4a)^{4/3}}, it is easy to see when μ0\mu\rightarrow 0, r((4μa)1/3,μ)r((4\mu a)^{1/3},\mu)\rightarrow\infty. We know r(yub,μ)>r((4μa)1/3,μ)r(y_{\mathrm{ub}},\mu)>r((4\mu a)^{1/3},\mu)\rightarrow\infty as μ0\mu\rightarrow 0 because of the monotonicity of r(y;μ)r(y;\mu) in Theorem 5(iii). Combining all of these, i.e.

    dp(μ)dμ<0,limμ0+p(μ)=,p(a24)<0\displaystyle\frac{dp(\mu)}{d\mu}<0,\quad\lim_{\mu\rightarrow 0^{+}}p(\mu)=\infty,\quad p(\frac{a^{2}}{4})<0

    There exists a τ<a24\tau<\frac{a^{2}}{4} such that p(τ)=0p(\tau)=0

  5. (v)

    From Theorem 5(iv), for μ>τ\mu>\tau, then p(μ)=r(yub,μ)>0p(\mu)=r(y_{\mathrm{ub}},\mu)>0, and for μ=τ\mu=\tau, then p(μ)=r(yub,μ)=0p(\mu)=r(y_{\mathrm{ub}},\mu)=0. For μ<τ\mu<\tau, then p(μ)=r(yub,μ)<0p(\mu)=r(y_{\mathrm{ub}},\mu)<0, combining Theorem 5(i),5(iii), we get the conclusions.

  6. (vi)

    By Theorem 5(v), μ<τ\forall\mu<\tau, there exists three stationary points such that 0<yμ<ylb<μ<yμ<yub<yμ0<y_{\mu}^{*}<y_{\mathrm{lb}}<\sqrt{\mu}<y_{\mu}^{**}<y_{\mathrm{ub}}<y_{\mu}^{***}. Because dr(y;μ)dy|y=ylb=dr(y;μ)dy|y=yub=0\left.\frac{dr(y;\mu)}{dy}\right|_{y=y_{\mathrm{lb}}}=\left.\frac{dr(y;\mu)}{dy}\right|_{y=y_{\mathrm{ub}}}=0, then

    dr(y;μ)dy|y=yμ0,dr(y;μ)dy|y=yμ0,dr(y;μ)dy|y=yμ0\displaystyle\left.\frac{dr(y;\mu)}{dy}\right|_{y=y_{\mu}^{*}}\neq 0,\quad\left.\frac{dr(y;\mu)}{dy}\right|_{y=y_{\mu}^{**}}\neq 0,\quad\left.\frac{dr(y;\mu)}{dy}\right|_{y=y_{\mu}^{***}}\neq 0

    By implicit function theorem (Dontchev et al., 2009), for solution to equation r(y;μ)=0r(y;\mu)=0, there exists a unique continuously differentiable function such that y=y(μ)y=y(\mu) and satisfies r(y(μ),μ)=0r(y(\mu),\mu)=0. Therefore,

    rμ=\displaystyle\frac{\partial r}{\partial\mu}= a2y2μ(μ+y2)3,ry=ay2+4a2μy(y2+μ)3,dy(μ)dμ=r/μr/y\displaystyle-a^{2}\frac{y^{2}-\mu}{(\mu+y^{2})^{3}},\quad\frac{\partial r}{\partial y}=-\frac{a}{y^{2}}+\frac{4a^{2}\mu y}{(y^{2}+\mu)^{3}},\quad\frac{dy(\mu)}{d\mu}=-\frac{\partial r/\partial\mu}{\partial r/\partial y}

    Therefore by Theorem 5(iii),

    dydμ|y=yμ>0dydμ|y=yμ>0dydμ|y=yμ<0\left.\frac{dy}{d\mu}\right|_{y=y_{\mu}^{*}}>0\quad\left.\frac{dy}{d\mu}\right|_{y=y_{\mu}^{**}}>0\quad\left.\frac{dy}{d\mu}\right|_{y=y_{\mu}^{***}}<0

    Because limμ0+ylb=limμ0+yub=0\lim_{\mu\rightarrow 0^{+}}y_{\mathrm{lb}}=\lim_{\mu\rightarrow 0^{+}}y_{\mathrm{ub}}=0, then limμ0+yμ=limμ0+yμ=0\lim_{\mu\rightarrow 0^{+}}y_{\mu}^{*}=\lim_{\mu\rightarrow 0^{+}}y_{\mu}^{**}=0. Let us consider r(aa2+1(1cμ),μ)r(\frac{a}{a^{2}+1}(1-c\mu),\mu) where c=32(a2+1)3a2c=32\frac{(a^{2}+1)^{3}}{a^{2}} and μ<12c\mu<\frac{1}{2c}

    r(aa2+1(1cμ),μ)\displaystyle r(\frac{a}{a^{2}+1}(1-c\mu),\mu)
    =\displaystyle= aaa2+1(1cμ)μa2(a2(a2+1)2(1cμ)2+μ)2(a2+1)\displaystyle\frac{a}{\frac{a}{a^{2}+1}(1-c\mu)}-\frac{\mu a^{2}}{(\frac{a^{2}}{(a^{2}+1)^{2}}(1-c\mu)^{2}+\mu)^{2}}-(a^{2}+1)
    =\displaystyle= (a2+1)(cμ1cμ)μa2(a2(a2+1)2(1cμ)2+μ)2\displaystyle(a^{2}+1)(\frac{c\mu}{1-c\mu})-\frac{\mu a^{2}}{(\frac{a^{2}}{(a^{2}+1)^{2}}(1-c\mu)^{2}+\mu)^{2}}
    \displaystyle\geq c(a2+1)μμa2(a2(a2+1)2(1cμ)2)2\displaystyle c(a^{2}+1)\mu-\frac{\mu a^{2}}{(\frac{a^{2}}{(a^{2}+1)^{2}}(1-c\mu)^{2})^{2}}
    =\displaystyle= c(a2+1)μ16(a2+1)4a2μ\displaystyle c(a^{2}+1)\mu-\frac{16(a^{2}+1)^{4}}{a^{2}}\mu
    =\displaystyle= 16(a2+1)4a2μ>0\displaystyle\frac{16(a^{2}+1)^{4}}{a^{2}}\mu>0

    By Theorem 5(iii), then aa2+1(1cμ)<yμ\frac{a}{a^{2}+1}(1-c\mu)<y_{\mu}^{***}, then

    aa2+1=limμ0+aa2+1(1cμ),μ)limμ0+yμaa2+1\frac{a}{a^{2}+1}=\lim_{\mu\rightarrow 0^{+}}\frac{a}{a^{2}+1}(1-c\mu),\mu)\leq\lim_{\mu\rightarrow 0^{+}}y_{\mu}^{***}\leq\frac{a}{a^{2}+1}

    Consequently,

    limμ0+yμ=aa2+1\lim_{\mu\rightarrow 0^{+}}y_{\mu}^{***}=\frac{a}{a^{2}+1}

E.3 Proof of Theorem 6

Proof.
  1. (i)

    For μ>0\mu>0,

    limx0+t(x;μ)=limx0+axa2μ(a2+1)21=\displaystyle\lim_{x\rightarrow 0^{+}}t(x;\mu)=\lim_{x\rightarrow 0^{+}}\frac{a}{x}-\frac{a^{2}}{\mu(a^{2}+1)^{2}}-1=\infty
    t(a,μ)=μa2(μ(a2+1)+a2)2<0\displaystyle t(a,\mu)=-\frac{\mu a^{2}}{(\mu(a^{2}+1)+a^{2})^{2}}<0
  2. (ii)
    t(μ(a2+1),μ)=\displaystyle t(\sqrt{\mu(a^{2}+1)},\mu)= aa2+11μa24μ(a2+1)21\displaystyle\frac{a}{\sqrt{a^{2}+1}}\frac{1}{\sqrt{\mu}}-\frac{a^{2}}{4\mu(a^{2}+1)^{2}}-1

    If t(μ(a2+1),μ)=0t(\sqrt{\mu(a^{2}+1)},\mu)=0, then

    1μ=2(a2+1)3/2a±2(a2+1)μ=(a(a2+1a)2(a2+1))2\frac{1}{\sqrt{\mu}}=2\frac{(a^{2}+1)^{3/2}}{a}\pm 2(a^{2}+1)\Rightarrow\mu=\left(\frac{a(\sqrt{a^{2}+1}\mp a)}{2(a^{2}+1)}\right)^{2}

    so when μ<(a(a2+1a)2(a2+1))2\mu<\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2} or μ>(a(a2+1+a)2(a2+1))2\mu>\left(\frac{a(\sqrt{a^{2}+1}+a)}{2(a^{2}+1)}\right)^{2}, then t(μ(a2+1),μ)<0t(\sqrt{\mu(a^{2}+1)},\mu)<0

  3. (iii)
    dt(x,μ)dx\displaystyle\frac{dt(x,\mu)}{dx}
    =\displaystyle= ax2+4μa2x(μ(a2+1)+x2)3\displaystyle-\frac{a}{x^{2}}+\frac{4\mu a^{2}x}{(\mu(a^{2}+1)+x^{2})^{3}}
    =\displaystyle= 4μa2x3a(μ(a2+1)+x2)3x2(μ(a2+1)+x2)3\displaystyle\frac{4\mu a^{2}x^{3}-a(\mu(a^{2}+1)+x^{2})^{3}}{x^{2}(\mu(a^{2}+1)+x^{2})^{3}}
    =\displaystyle= a((μ(a2+1)+x2)2+(μ(a2+1)+x2)(4μa)1/3x+(4μa)2/3x2)((4μa)1/3xμ(a2+1)x2)x2(μ(a2+1)+x2)3\displaystyle\frac{a((\mu(a^{2}+1)+x^{2})^{2}+(\mu(a^{2}+1)+x^{2})(4\mu a)^{1/3}x+(4\mu a)^{2/3}x^{2})((4\mu a)^{1/3}x-\mu(a^{2}+1)-x^{2})}{x^{2}(\mu(a^{2}+1)+x^{2})^{3}}

    For μ>a24(a2+1)3\mu>\frac{a^{2}}{4(a^{2}+1)^{3}}, then (4μa)1/3xμ(a2+1)x2<0dt(x,μ)dx<0(4\mu a)^{1/3}x-\mu(a^{2}+1)-x^{2}<0\Leftrightarrow\frac{dt(x,\mu)}{dx}<0. For μ<a24(a2+1)3\mu<\frac{a^{2}}{4(a^{2}+1)^{3}}, and xlb<x<xubx_{\mathrm{lb}}<x<x_{\mathrm{ub}}, then (4μa)1/3xμ(a2+1)x2>0dt(x,μ)dx>0(4\mu a)^{1/3}x-\mu(a^{2}+1)-x^{2}>0\Leftrightarrow\frac{dt(x,\mu)}{dx}>0, For μ<a24(a2+1)3\mu<\frac{a^{2}}{4(a^{2}+1)^{3}}, x<xlbx<x_{\mathrm{lb}} or x>xubx>x_{\mathrm{ub}}, (4μa)1/3xμ(a2+1)x2<0dt(x,μ)dx<0(4\mu a)^{1/3}x-\mu(a^{2}+1)-x^{2}<0\Leftrightarrow\frac{dt(x,\mu)}{dx}<0.

    We use the same argument as before to show that

    xlb<μ(a2+1)<xubx_{\mathrm{lb}}<\sqrt{\mu(a^{2}+1)}<x_{\mathrm{ub}}
  4. (iv)

    Note that for 0<μ<a24(a2+1)30<\mu<\frac{a^{2}}{4(a^{2}+1)^{3}}

    tμ=a2x2μ(a2+1)(μ(a2+1)+x2)3andxlb<μ(a2+1)<xub\frac{\partial t}{\partial\mu}=-a^{2}\frac{x^{2}-\mu(a^{2}+1)}{(\mu(a^{2}+1)+x^{2})^{3}}\quad\text{and}\quad x_{\mathrm{lb}}<\sqrt{\mu(a^{2}+1)}<x_{\mathrm{ub}}

    then tμ|x=xlb>0\left.\frac{\partial t}{\partial\mu}\right|_{x=x_{\mathrm{lb}}}>0. Let q(μ)=t(xlb,μ)q(\mu)=t(x_{\mathrm{lb}},\mu), because tx|x=xlb=0\left.\frac{\partial t}{\partial x}\right|_{x=x_{\mathrm{lb}}}=0, then

    dq(μ)dμ=\displaystyle\frac{dq(\mu)}{d\mu}= dt(xlb,μ)dμ=tx|x=xlbdxlbdμ+tμ|x=xlb=tμ|x=xlb>0\displaystyle\frac{dt(x_{\mathrm{lb}},\mu)}{d\mu}=\left.\frac{\partial t}{\partial x}\right|_{x=x_{\mathrm{lb}}}\frac{dx_{\mathrm{lb}}}{d\mu}+\left.\frac{\partial t}{\partial\mu}\right|_{x=x_{\mathrm{lb}}}=\left.\frac{\partial t}{\partial\mu}\right|_{x=x_{\mathrm{lb}}}>0

    Note that μ=a24(a2+1)3\mu=\frac{a^{2}}{4(a^{2}+1)^{3}}, xub=xlb=(4μa)1/32x_{\mathrm{ub}}=x_{\mathrm{lb}}=\frac{(4\mu a)^{1/3}}{2}, t((4μa)1/32,a24(a2+1)3)=a(4μa)1/31>0t(\frac{(4\mu a)^{1/3}}{2},\frac{a^{2}}{4(a^{2}+1)^{3}})=\frac{a}{(4\mu a)^{1/3}}-1>0. When μ<(a(a2+1a)2(a2+1))2\mu<\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2}, then t(μ(a2+1),μ)<0t(\sqrt{\mu(a^{2}+1)},\mu)<0 by Theorem 6(ii). It implies that q(μ)<0q(\mu)<0 when μ0+\mu\rightarrow 0^{+}. By Theorem 6(iii), q(μ)=t(xlb,μ)<t(μ(a2+1),μ)<0q(\mu)=t(x_{\mathrm{lb}},\mu)<t(\sqrt{\mu(a^{2}+1)},\mu)<0. Combining all of the theses, i.e.

    dq(μ)dμ>0,limμ0+q(μ)<0,q(a24(a2+1)3)>0\displaystyle\frac{dq(\mu)}{d\mu}>0,\quad\lim_{\mu\rightarrow 0^{+}}q(\mu)<0,\quad q(\frac{a^{2}}{4(a^{2}+1)^{3}})>0

    There exists a τ<a24(a2+1)3\tau<\frac{a^{2}}{4(a^{2}+1)^{3}}, q(τ)=0q(\tau)=0. Such τ\tau is the same as in Theorem 5(iv).

  5. (v)

    We follow the same proof from the proof of Theorem 5(v).

  6. (vi)

    By Theorem 6(v), μ<μ0\forall\mu<\mu_{0}, there exists three stationary points such that 0<xμ<xlb<xμ<xub<xμ<a0<x_{\mu}^{***}<x_{\mathrm{lb}}<x_{\mu}^{**}<x_{\mathrm{ub}}<x_{\mu}^{*}<a. Because dt(x;μ)dx|x=xlb=dt(x;μ)dx|x=xub=0\left.\frac{dt(x;\mu)}{dx}\right|_{x=x_{\mathrm{lb}}}=\left.\frac{dt(x;\mu)}{dx}\right|_{x=x_{\mathrm{ub}}}=0, then

    dt(x;μ)dx|x=xμ0,dt(x;μ)dx|x=xμ0,dt(x;μ)dx|x=xμ0\displaystyle\left.\frac{dt(x;\mu)}{dx}\right|_{x=x_{\mu}^{*}}\neq 0,\quad\left.\frac{dt(x;\mu)}{dx}\right|_{x=x_{\mu}^{**}}\neq 0,\quad\left.\frac{dt(x;\mu)}{dx}\right|_{x=x_{\mu}^{***}}\neq 0

    By implicit function theorem (Dontchev et al., 2009), for solutions to equation t(x;μ)=0t(x;\mu)=0, there exists a unique continuously differentiable function such that x=x(μ)x=x(\mu) and satisfies t(x(μ),μ)=0t(x(\mu),\mu)=0. Therefore,

    dxdμ=t/μt/x=a2x2μ(a2+1)(μ(a2+1)+x2)3ax2+4μa2x(μ(a2+1)+x2)3\frac{dx}{d\mu}=-\frac{\partial t/\partial\mu}{\partial t/\partial x}=a^{2}\frac{\frac{x^{2}-\mu(a^{2}+1)}{(\mu(a^{2}+1)+x^{2})^{3}}}{-\frac{a}{x^{2}}+\frac{4\mu a^{2}x}{(\mu(a^{2}+1)+x^{2})^{3}}}

    Therefore, by Theorem 6(iii)

    dxdμ|x=xμ<0dxdμ|x=xμ>0\left.\frac{dx}{d\mu}\right|_{x=x_{\mu}^{*}}<0\quad\left.\frac{dx}{d\mu}\right|_{x=x_{\mu}^{***}}>0

    Because 0<xμ<xlb<xμ<xub0<x_{\mu}^{***}<x_{\mathrm{lb}}<x_{\mu}^{**}<x_{\mathrm{ub}} and limμ0+xlb=limμ0+xub=0\lim_{\mu\rightarrow 0^{+}}x_{\mathrm{lb}}=\lim_{\mu\rightarrow 0^{+}}x_{\mathrm{ub}}=0.

    limμ0xμ=limμ0xμ=0\lim_{\mu\rightarrow 0}x_{\mu}^{**}=\lim_{\mu\rightarrow 0}x_{\mu}^{***}=0

    Let us consider t(a(1cμ),μ)t(a(1-c\mu),\mu) where c=32a2c=\frac{32}{a^{2}} and μ<12c\mu<\frac{1}{2c}

    t(a(1cμ);μ)\displaystyle t(a(1-c\mu);\mu)
    =\displaystyle= aa(1cμ)μa2(μ(a2+1)+a2(1cμ)2)21\displaystyle\frac{a}{a(1-c\mu)}-\frac{\mu a^{2}}{(\mu(a^{2}+1)+a^{2}(1-c\mu)^{2})^{2}}-1
    =\displaystyle= cμ1cμμa2(μ(a2+1)+a2(1cμ)2)2\displaystyle\frac{c\mu}{1-c\mu}-\frac{\mu a^{2}}{(\mu(a^{2}+1)+a^{2}(1-c\mu)^{2})^{2}}
    \displaystyle\geq cμμa2(a2(1cμ)2)2\displaystyle c\mu-\frac{\mu a^{2}}{(a^{2}(1-c\mu)^{2})^{2}}
    \displaystyle\geq cμ16a2μ>0\displaystyle c\mu-\frac{16}{a^{2}}\mu>0

    By Theorem 6(iii). It implies

    a(1cμ)xμa(1-c\mu)\leq x_{\mu}^{*}

    taking μ0+\mu\rightarrow 0^{+} on both side,

    a=limμ0+a(1cμ)limμ0+xμaa=\lim_{\mu\rightarrow 0^{+}}a(1-c\mu)\leq\lim_{\mu\rightarrow 0^{+}}x_{\mu}^{*}\leq a

    Hence, limμ0xμ=a\lim_{\mu\rightarrow 0}x_{\mu}^{*}=a.

    When μ=τ\mu=\tau, because t(xlb;μ)=0t(x_{\mathrm{lb}};\mu)=0 and xub>μ(a2+1)>xlbx_{\mathrm{ub}}>\sqrt{\mu(a^{2}+1)}>x_{\mathrm{lb}}, t(x;μ)t(x;\mu) is increasing function between [xlb,xub][x_{\mathrm{lb}},x_{\mathrm{ub}}] then t(μ(a2+1);μ)>t(xlb;μ)=0t(\sqrt{\mu(a^{2}+1)};\mu)>t(x_{\mathrm{lb}};\mu)=0. Moreover, t(μ(a2+1),μ)t(\sqrt{\mu(a^{2}+1)},\mu), xlbx_{\mathrm{lb}} and xμx_{\mu}^{**} are continuous function w.r.t μ\mu, δ>0\exists\delta>0 which is really small, such that μ=τδ\mu=\tau-\delta and t(μ(a2+1),μ)>0t(\sqrt{\mu(a^{2}+1)},\mu)>0, t(xlb,μ)<0t(x_{\mathrm{lb}},\mu)<0 (by Theorem 6(iv)) and xμ>xlbx_{\mu}^{**}>x_{\mathrm{lb}}, hence dxdμ|x=xμ<0\left.\frac{dx}{d\mu}\right|_{x=x_{\mu}^{**}}<0. It implies when μ\mu decreases, then xμx_{\mu}^{**} increases. This relation holds until xμ=μ(a2+1)x_{\mu}^{**}=\sqrt{\mu(a^{2}+1)}

    t(xμ,μ)=t(μ(a2+1),μ)=0\displaystyle t(x_{\mu}^{**},\mu)=t(\sqrt{\mu(a^{2}+1)},\mu)=0
    \displaystyle\Rightarrow μ=(a(a2+1a)2(a2+1))2\displaystyle\mu=\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2}

    and μ(a2+1)=a(a2+1a)2a2+1\sqrt{\mu(a^{2}+1)}=\frac{a(\sqrt{a^{2}+1}-a)}{2\sqrt{a^{2}+1}}. Note that when μ<(a(a2+1a)2(a2+1))2\mu<\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2}, t(μ(a2+1),μ)<0t(\sqrt{\mu(a^{2}+1)},\mu)<0, it implies that xμ>μ(a2+1)x_{\mu}^{**}>\sqrt{\mu(a^{2}+1)} and dxdμ|x=xμ>0\left.\frac{dx}{d\mu}\right|_{x=x_{\mu}^{**}}>0, thus decreasing μ\mu leads to decreasing xμx_{\mu}^{**}. We can conclude

    maxμτxμa(a2+1a)2a2+1\max_{\mu\leq\tau}x_{\mu}^{**}\leq\frac{a(\sqrt{a^{2}+1}-a)}{2\sqrt{a^{2}+1}}

    Note that μ s.t.(a(a2+1a)2(a2+1))2<μ<τ\forall\mu\text{ s.t.}\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2}<\mu<\tau, xμ<(a(a2+1a)2(a2+1))2x_{\mu}^{**}<\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2}, so t((a(a2+1a)2(a2+1))2,μ)0.t(\left(\frac{a(\sqrt{a^{2}+1}-a)}{2(a^{2}+1)}\right)^{2},\mu)\geq 0.

    Note that when μ>a2a2+1\mu>\frac{a^{2}}{a^{2}+1}, i.e. (xμ)2μ(a2+1)(x_{\mu}^{*})^{2}\geq\mu(a^{2}+1) then

    dxdμ|x=xμ>0\displaystyle\left.\frac{dx}{d\mu}\right|_{x=x_{\mu}^{*}}>0

    It implies that when μ\mu decreases, xμx_{\mu}^{*} also decreases. It holds true until xμ=μ(a2+1)x^{*}_{\mu}=\sqrt{\mu(a^{2}+1)}. The same analysis can be applied to xμx_{\mu}^{*} like above, we can conclude that

    minτxμ=a(a2+1+a)2a2+1\min_{\tau}x_{\mu}^{*}=\frac{a(\sqrt{a^{2}+1}+a)}{2\sqrt{a^{2}+1}}

    Hence

    maxμτxμa(a2+1a)2a2+1<a(a2+1+a)2a2+1minμ>0xμ\max_{\mu\leq\tau}x_{\mu}^{**}\leq\frac{a(\sqrt{a^{2}+1}-a)}{2\sqrt{a^{2}+1}}<\frac{a(\sqrt{a^{2}+1}+a)}{2\sqrt{a^{2}+1}}\leq\min_{\mu>0}x_{\mu}^{*}

E.4 Proof of Theorem 7,8 and 9

Proof.

The proof is similar to the proof of Theorem 5 and Theorem 6. ∎

E.5 Proof of Lemma 1

Proof.
2gμ(x,y)=(μ+y22xy2xyμ(a2+1)+x2)\displaystyle\nabla^{2}g_{\mu}(x,y)=\begin{pmatrix}\mu+y^{2}&2xy\\ 2xy&\mu(a^{2}+1)+x^{2}\end{pmatrix}

Let λ1(2gμ(x,y)),λ2(2gμ(x,y))\lambda_{1}(\nabla^{2}g_{\mu}(x,y)),\lambda_{2}(\nabla^{2}g_{\mu}(x,y)) be the eigenvalue of matrix 2gμ(x,y)\nabla^{2}g_{\mu}(x,y), then

λ1(2gμ(x,y))+λ2(2gμ(x,y))\displaystyle\lambda_{1}(\nabla^{2}g_{\mu}(x,y))+\lambda_{2}(\nabla^{2}g_{\mu}(x,y))
=\displaystyle= Tr(2gμ(x,y))=μ+y2+μ(a2+1)+x2>0\displaystyle\operatorname{Tr}(\nabla^{2}g_{\mu}(x,y))=\mu+y^{2}+\mu(a^{2}+1)+x^{2}>0

Now we calculate the product of eigenvalue

λ1(2gμ(x,y))λ2(2gμ(W))\displaystyle\lambda_{1}(\nabla^{2}g_{\mu}(x,y))\cdot\lambda_{2}(\nabla^{2}g_{\mu}(W))
=\displaystyle= det(2gμ(W))\displaystyle\operatorname{det}(\nabla^{2}g_{\mu}(W))
=\displaystyle= (μ+y2)(μ(a2+1)+x2)4x2y2\displaystyle(\mu+y^{2})(\mu(a^{2}+1)+x^{2})-4x^{2}y^{2}
=\displaystyle= μaxμay4x2y2>0\displaystyle\frac{\mu a}{x}\frac{\mu a}{y}-4x^{2}y^{2}>0
\displaystyle\Leftrightarrow (aμ2)2/3>xy\displaystyle(\frac{a\mu}{2})^{2/3}>xy
\displaystyle\Leftrightarrow (aμ2)2/3>aμy2+μy\displaystyle(\frac{a\mu}{2})^{2/3}>\frac{a\mu}{y^{2}+\mu}y
\displaystyle\Leftrightarrow y+μy>(4aμ)1/3\displaystyle y+\frac{\mu}{y}>(4a\mu)^{1/3}

Note that for (xμ,yμ),(xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}),(x_{\mu}^{***},y_{\mu}^{***}), they satisfy (11a) and (11b), this fact is used in third equality and second “\Leftrightarrow”. By (32b), we know λ1(2gμ(x,y))λ2(2gμ(x,y))>0\lambda_{1}(\nabla^{2}g_{\mu}(x,y))\cdot\lambda_{2}(\nabla^{2}g_{\mu}(x,y))>0 for (xμ,yμ),(xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}),(x_{\mu}^{***},y_{\mu}^{***}), and λ1(2gμ(x,y))λ2(2gμ(x,y))<0\lambda_{1}(\nabla^{2}g_{\mu}(x,y))\cdot\lambda_{2}(\nabla^{2}g_{\mu}(x,y))<0 for (xμ,yμ)(x_{\mu}^{**},y_{\mu}^{**}), then

λ1(2gμ(x,y))>0,λ2(2gμ(x,y))>0 for (xμ,yμ),(xμ,yμ)\lambda_{1}(\nabla^{2}g_{\mu}(x,y))>0,\lambda_{2}(\nabla^{2}g_{\mu}(x,y))>0\qquad\text{ for }(x_{\mu}^{*},y_{\mu}^{*}),(x_{\mu}^{***},y_{\mu}^{***})
λ1(2gμ(x,y))<0 or λ2(2gμ(x,y))<0 for (xμ,yμ)\lambda_{1}(\nabla^{2}g_{\mu}(x,y))<0\text{ or }\lambda_{2}(\nabla^{2}g_{\mu}(x,y))<0\qquad\text{ for }(x_{\mu}^{**},y_{\mu}^{**})

and

gμ(x,y)=0\nabla g_{\mu}(x,y)=0

Then (xμ,yμ),(xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}),(x_{\mu}^{***},y_{\mu}^{***}) are locally minima, (xμ,yμ)(x_{\mu}^{**},y_{\mu}^{**}) is saddle point for gμ(W)g_{\mu}(W). ∎

E.6 Proof of Lemma 2

Refer to caption
Figure 8: Stationary points when μ<τ\mu<\tau
Proof.

Let us define the functions as below

yμ1(x)=μ(axx)\displaystyle y_{\mu 1}(x)=\sqrt{\mu(\frac{a-x}{x})} 0<xa0<x\leq a (37a)
yμ2(x)=μaμ(a2+1)+x2\displaystyle y_{\mu 2}(x)=\frac{\mu a}{\mu(a^{2}+1)+x^{2}} 0<xa0<x\leq a (37b)
xμ1(y)=μay2+μ\displaystyle x_{\mu 1}(y)=\frac{\mu a}{y^{2}+\mu} 0<y<aa2+10<y<\frac{a}{a^{2}+1} (38a)
xμ2(y)=μ(ay(a2+1))\displaystyle x_{\mu 2}(y)=\sqrt{\mu(\frac{a}{y}-(a^{2}+1))} 0<y<aa2+10<y<\frac{a}{a^{2}+1} (38b)

with simple calculations,

yμ1yμ2t(x;μ)0x(0,xμ][xμ,xμ]y_{\mu 1}\geq y_{\mu 2}\Leftrightarrow t(x;\mu)\geq 0\Leftrightarrow x\in(0,x_{\mu}^{***}]\cup[x_{\mu}^{**},x_{\mu}^{*}]

and

xμ1xμ2r(y;μ)0y[yμ,yμ][yμ,aa2+1)x_{\mu 1}\geq x_{\mu 2}\Leftrightarrow r(y;\mu)\leq 0\Leftrightarrow y\in[y_{\mu}^{*},y_{\mu}^{**}]\cup[y_{\mu}^{***},\frac{a}{a^{2}+1})

Here we divide BμB_{\mu} into three parts, Cμ1,Cμ2,Cμ3C_{\mu 1},C_{\mu 2},C_{\mu 3}

Cμ1=\displaystyle C_{\mu 1}= {(x,y)|xμ<xxμ,yμ1<y<yμ}{(x,y)|xμ<xa,yμ2<y<yμ}\displaystyle\{(x,y)|x_{\mu}^{**}<x\leq x_{\mu}^{*},y_{\mu 1}<y<y_{\mu}^{**}\}\cup\{(x,y)|x_{\mu}^{*}<x\leq a,y_{\mu 2}<y<y_{\mu}^{**}\} (39)
Cμ2=\displaystyle C_{\mu 2}= {(x,y)|xμ<xxμ,0y<yμ2}{(x,y)|xμ<xa,0y<yμ1}\displaystyle\{(x,y)|x_{\mu}^{**}<x\leq x_{\mu}^{*},0\leq y<y_{\mu 2}\}\cup\{(x,y)|x_{\mu}^{*}<x\leq a,0\leq y<y_{\mu 1}\} (40)
Cμ3=\displaystyle C_{\mu 3}= {(x,y)|xμ<xxμ,yμ2yyμ1}{(x,y)|xμ<xa,yμ1yyμ2}\displaystyle\{(x,y)|x_{\mu}^{**}<x\leq x_{\mu}^{*},y_{\mu 2}\leq y\leq y_{\mu 1}\}\cup\{(x,y)|x_{\mu}^{*}<x\leq a,y_{\mu 1}\leq y\leq y_{\mu 2}\} (41)

Also note that

(x,y)Cμ1gμ(x,y)x>0,gμ(x,y)y>0\displaystyle\forall(x,y)\in C_{\mu 1}\Rightarrow\frac{\partial g_{\mu}(x,y)}{\partial x}>0,\frac{\partial g_{\mu}(x,y)}{\partial y}>0
(x,y)Cμ2gμ(x,y)x<0,gμ(x,y)y<0\displaystyle\forall(x,y)\in C_{\mu 2}\Rightarrow\frac{\partial g_{\mu}(x,y)}{\partial x}<0,\frac{\partial g_{\mu}(x,y)}{\partial y}<0

The gradient flow follows

(x(t)y(t))=(gμ(x(t),y(t))xgμ(x(t),y(t))y)=gμ(x(t),y(t))\begin{pmatrix}x^{\prime}(t)\\ y^{\prime}(t)\end{pmatrix}=-\begin{pmatrix}\frac{\partial g_{\mu}(x(t),y(t))}{\partial x}\\ \frac{\partial g_{\mu}(x(t),y(t))}{\partial y}\end{pmatrix}=-\nabla g_{\mu}(x(t),y(t))

then

(x,y)Cμ1(x(t)y(t))<0,gμ>0\displaystyle\forall(x,y)\in C_{\mu 1}\Rightarrow\begin{pmatrix}x^{\prime}(t)\\ y^{\prime}(t)\end{pmatrix}<0,\quad\lVert\nabla g_{\mu}\rVert>0 (42)
(x,y)Cμ2(x(t)y(t))>0,gμ>0\displaystyle\forall(x,y)\in C_{\mu 2}\Rightarrow\begin{pmatrix}x^{\prime}(t)\\ y^{\prime}(t)\end{pmatrix}>0,\quad\lVert\nabla g_{\mu}\rVert>0 (43)

Note that gμ\lVert\nabla g_{\mu}\rVert is not diminishing and bounded away from 0. Let us consider the (x(0),y(0))Cμ1(x(0),y(0))\in C_{\mu 1}, since gμ(x,y)0\nabla g_{\mu}(x,y)\neq 0, gμ(x,y)<0-\nabla g_{\mu}(x,y)<0 in (42) and boundness of Cμ1C_{\mu 1}, it implies there exists a finite t0>0t_{0}>0 such that

(x(t0),y(t0))Cμ1,(x(t),y(t))Cμ1 for 0t<t0\displaystyle(x(t_{0}),y(t_{0}))\in\partial C_{\mu 1},(x(t),y(t))\in C_{\mu 1}\text{ for }0\leq t<t_{0}

where Cμ1\partial C_{\mu 1} is defined as

Cμ1={(x,y)|xμ<xxμ,y=yμ1}{(x,y)|xμ<xa,y=yμ2}Cμ3\partial C_{\mu 1}=\{(x,y)|x_{\mu}^{**}<x\leq x_{\mu}^{*},y=y_{\mu 1}\}\cup\{(x,y)|x_{\mu}^{*}<x\leq a,y=y_{\mu 2}\}\subseteq C_{\mu 3}

For the same reason, if (x(0),y(0))Cμ2(x(0),y(0))\in C_{\mu 2}, there exists a finite time t1>0t_{1}>0,

(x(t0),y(t0))Cμ2,(x(t),y(t))Cμ2 for 0t<t1\displaystyle(x(t_{0}),y(t_{0}))\in\partial C_{\mu 2},(x(t),y(t))\in C_{\mu 2}\text{ for }0\leq t<t_{1}

where Cμ2\partial C_{\mu 2} is defined as

Cμ2={(x,y)|xμ<xxμ,y=yμ2}{(x,y)|xμ<xa,y=yμ1}Cμ3\partial C_{\mu 2}=\{(x,y)|x_{\mu}^{**}<x\leq x_{\mu}^{*},y=y_{\mu 2}\}\cup\{(x,y)|x_{\mu}^{*}<x\leq a,y=y_{\mu 1}\}\subseteq C_{\mu 3}

then by lemma 7, limt(x(t),y(t))=(xμ,yμ)\lim_{t\rightarrow\infty}(x(t),y(t))=(x_{\mu}^{*},y_{\mu}^{*}). ∎

E.7 Proof of Lemma 3

Proof.

This is just a result of the Theorem 5. ∎

E.8 Proof of Lemma 5

Proof.

Note that

2gμ(W)=(μ+y22xy2xyμ(a2+1)+x2)=(μ00μ(a2+1))+(y22xy2xyx2)\displaystyle\nabla^{2}g_{\mu}(W)=\begin{pmatrix}\mu+y^{2}&2xy\\ 2xy&\mu(a^{2}+1)+x^{2}\end{pmatrix}=\begin{pmatrix}\mu&0\\ 0&\mu(a^{2}+1)\end{pmatrix}+\begin{pmatrix}y^{2}&2xy\\ 2xy&x^{2}\end{pmatrix}

Let op\|\cdot\|_{\text{op}} is the spectral norm, and it satisfies triangle inequality

2gμ(W)op\displaystyle\left\lVert\nabla^{2}g_{\mu}(W)\right\rVert_{\text{op}}\leq (μ00μ(a2+1))op+(y22xy2xyx2)op\displaystyle\left\lVert\begin{pmatrix}\mu&0\\ 0&\mu(a^{2}+1)\end{pmatrix}\right\rVert_{\text{op}}+\left\lVert\begin{pmatrix}y^{2}&2xy\\ 2xy&x^{2}\end{pmatrix}\right\rVert_{\text{op}}
=\displaystyle= μ(a2+1)+(y22xy2xyx2)op\displaystyle\mu(a^{2}+1)+\left\lVert\begin{pmatrix}y^{2}&2xy\\ 2xy&x^{2}\end{pmatrix}\right\rVert_{\text{op}}

The spectral norm of the second term in area A is bounded by

max(x,y)A(x2+y2)+(x2+y2)2+12x2y222a2+4a4+12a42=3a2\displaystyle\max_{(x,y)\in A}\frac{(x^{2}+y^{2})+\sqrt{(x^{2}+y^{2})^{2}+12x^{2}y^{2}}}{2}\leq\frac{2a^{2}+\sqrt{4a^{4}+12a^{4}}}{2}=3a^{2}

We use x2a2,y2a2x^{2}\leq a^{2},y^{2}\leq a^{2} in the inequality. Therefore,

2gμ(W)op3a2+μ(a2+1)\displaystyle\left\lVert\nabla^{2}g_{\mu}(W)\right\rVert_{\text{op}}\leq 3a^{2}+\mu(a^{2}+1)

Also, according to Boyd et al. (2004); Nesterov et al. (2018), for any ff, if 2f\nabla^{2}f exists, then ff is LL smooth if and only if |2f|opL|\nabla^{2}f|_{\text{op}}\leq L. With this, we conclude the proof. ∎

E.9 Proof of Lemma 7

Proof.

First we prove t0,(x(t),y(t))Cμ3\forall t\geq 0,(x(t),y(t))\in C_{\mu 3}, because if (x(t),y(t))Cμ3(x(t),y(t))\notin C_{\mu 3}, then there exists a finite tt such that

(x(t),y(t))Cμ3(x(t),y(t))\in\partial C_{\mu 3}

where Cμ3\partial C_{\mu 3} is the boundary of Cμ3C_{\mu 3}, defined as

Cμ3={(x,y)|y=yμ1(x) or y=yμ2(x),xμ<xa}\partial C_{\mu 3}=\{(x,y)|y=y_{\mu 1}(x)\text{ or }y=y_{\mu 2}(x),x_{\mu}^{**}<x\leq a\}

W.L.O.G, let us assume (x(0),y(0))Cμ3(x(0),y(0))\in\partial C_{\mu 3} and (x(0),y(0))(xμ,yμ)(x(0),y(0))\neq(x_{\mu}^{*},y_{\mu}^{*}). Here are four different cases,

gμ(x(t),y(t))={(=0>0)if y(0)=yμ1(x(0)),xμ<x(0)<xμ(=0<0)if y(0)=yμ1(x(0)),xμ<x(0)a(<0=0)if y(0)=yμ2(x(0)),xμ<x(0)<xμ(>0=0)if y(0)=yμ2(x(0)),xμ<x(0)a\nabla g_{\mu}(x(t),y(t))=\left\{\begin{array}[]{ll}\begin{pmatrix}=0\\ >0\end{pmatrix}&\text{if }y(0)=y_{\mu 1}(x(0)),x_{\mu}^{**}<x(0)<x_{\mu}^{*}\\ \begin{pmatrix}=0\\ <0\end{pmatrix}&\text{if }y(0)=y_{\mu 1}(x(0)),x_{\mu}^{*}<x(0)\leq a\\ \begin{pmatrix}<0\\ =0\end{pmatrix}&\text{if }y(0)=y_{\mu 2}(x(0)),x_{\mu}^{**}<x(0)<x_{\mu}^{*}\\ \begin{pmatrix}>0\\ =0\end{pmatrix}&\text{if }y(0)=y_{\mu 2}(x(0)),x_{\mu}^{*}<x(0)\leq a\end{array}\right.

This indicates that gμ(x(t),y(t))-\nabla g_{\mu}(x(t),y(t)) are pointing to the interior of Cμ3C_{\mu 3}, then (x(t),y(t))(x(t),y(t)) can not escape Cμ3C_{\mu 3}. Here we can focus our attention in Cμ3C_{\mu 3}, because t0,(x(t),y(t))Cμ3\forall t\geq 0,(x(t),y(t))\in C_{\mu 3}. For Algorithm 1,

df(𝒛t)dt=f(𝒛t)𝒛t˙=f(𝒛t)22\frac{df(\bm{z}_{t})}{dt}=\nabla f(\bm{z}_{t})\dot{\bm{z}_{t}}=-\lVert\nabla f(\bm{z}_{t})\rVert_{2}^{2}

In our setting, (x,y)Cμ3\forall(x,y)\in C_{\mu 3}

{gμ(x,y)0(x,y)(xμ,yμ)gμ(x,y)=0(x,y)=(xμ,yμ)\left\{\begin{array}[]{ll}\nabla g_{\mu}(x,y)\neq 0&(x,y)\neq(x_{\mu}^{*},y_{\mu}^{*})\\ \nabla g_{\mu}(x,y)=0&(x,y)=(x_{\mu}^{*},y_{\mu}^{*})\end{array}\right.

so

dgμ(x(t),y(t))dt={gμ22<0(x,y)(xμ,yμ)gμ22=0(x,y)=(xμ,yμ)\frac{dg_{\mu}(x(t),y(t))}{dt}=\left\{\begin{array}[]{rl}-\lVert\nabla g_{\mu}\rVert^{2}_{2}<0&(x,y)\neq(x_{\mu}^{*},y_{\mu}^{*})\\ -\lVert\nabla g_{\mu}\rVert_{2}^{2}=0&(x,y)=(x_{\mu}^{*},y_{\mu}^{*})\end{array}\right.

Plus, (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}) is the unique stationary point of gμ(W)g_{\mu}(W) in Cμ3C_{\mu 3}. By lemma 8

gμ(x,y)>gμ(xμ,yμ)(x,y)(xμ,yμ)g_{\mu}(x,y)>g_{\mu}(x_{\mu}^{*},y_{\mu}^{*})\quad(x,y)\neq(x_{\mu}^{*},y_{\mu}^{*})

By Lyapunov asymptotic stability theorem (Khalil, 2002), and applying it to gradient flow for gμ(x,y)g_{\mu}(x,y) in Cμ3C_{\mu 3}, we can conclude limt(x(t),y(t))=(xμ,yμ)\lim_{t\rightarrow\infty}(x(t),y(t))=(x_{\mu}^{*},y_{\mu}^{*}). ∎

E.10 Proof of Lemma 8

Proof.

For any (x,y)Cμ3(x,y)\in C_{\mu 3} in 41, and (x,y)(xμ,yμ)(x,y)\neq(x_{\mu}^{*},y_{\mu}^{*}), in Algorithm 7. W.L.O.G, we can assume x(xμ,xμ)x\in(x_{\mu}^{**},x_{\mu}^{*}), the analysis details can also be applied to x(xμ,a)x\in(x_{\mu}^{*},a). It is obvious that x~j<x~j+1\tilde{x}_{j}<\tilde{x}_{j+1} and y~j+1<y~j\tilde{y}_{j+1}<\tilde{y}_{j}. Also, limj(x~j,y~j)=(xμ,yμ)\lim_{j\rightarrow\infty}(\tilde{x}_{j},\tilde{y}_{j})=(x_{\mu}^{*},y_{\mu}^{*}). Otherwise either x~jxμ\tilde{x}_{j}\neq x_{\mu}^{*} or y~jyμ\tilde{y}_{j}\neq y_{\mu}^{*} hold, Algorithm 7 continues until limj(x~j,y~j)=limj(yμ2(y~j),xμ1(x~j))\lim_{j\rightarrow\infty}(\tilde{x}_{j},\tilde{y}_{j})=\lim_{j\rightarrow\infty}(y_{\mu 2}(\tilde{y}_{j}),x_{\mu 1}(\tilde{x}_{j})), i.e. (x~j,y~j)(\tilde{x}_{j},\tilde{y}_{j}) converges to (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}).

Moreover, note that for any j=0,1,j=0,1,\ldots

gμ(x~j1,y~j1)>gμ(x~j1,y~j)>gμ(x~j,y~j)g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j-1})>g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j})>g_{\mu}(\tilde{x}_{j},\tilde{y}_{j})

Because

gμ(x~j1,y~j1)gμ(x~j1,y~j)=gμ(x~j1,y~)y(y~j1y~j)where y~(y~j,y~j1)g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j-1})-g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j})=\frac{\partial g_{\mu}(\tilde{x}_{j-1},\tilde{y})}{\partial y}(\tilde{y}_{j-1}-\tilde{y}_{j})\quad\text{where }\tilde{y}\in(\tilde{y}_{j},\tilde{y}_{j-1})

Note that

gμ(x~j1,y~)y>0gμ(x~j1,y~j1)>gμ(x~j1,y~j)\frac{\partial g_{\mu}(\tilde{x}_{j-1},\tilde{y})}{\partial y}>0\Rightarrow g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j-1})>g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j})

By the same reason,

gμ(x~j1,y~j)>gμ(x~j,y~j)g_{\mu}(\tilde{x}_{j-1},\tilde{y}_{j})>g_{\mu}(\tilde{x}_{j},\tilde{y}_{j})

By Lemma 1, (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}) is local minima, and there exists a rμ>0r_{\mu}>0 and any {(x,y)(x,y)(xμ,yμ)2rμ},gμ(x,y)>gμ(xμ,yμ)\{(x,y)\mid\|(x,y)-(x_{\mu}^{*},y_{\mu}^{*})\|_{2}\leq r_{\mu}\},g_{\mu}(x,y)>g_{\mu}(x_{\mu}^{*},y_{\mu}^{*}) Since limj(x~j,y~j)=(xμ,yμ)\lim_{j\rightarrow\infty}(\tilde{x}_{j},\tilde{y}_{j})=(x_{\mu}^{*},y_{\mu}^{*}), there exists a J>0J>0 such that j>J\forall j>J, (x~j,y~j)(xμ,yμ)2rμ\|(\tilde{x}_{j},\tilde{y}_{j})-(x_{\mu}^{*},y_{\mu}^{*})\|_{2}\leq r_{\mu}, combining them all

gμ(x,y)>gμ(x~j,y~j)>gμ(xμ,yμ)g_{\mu}(x,y)>g_{\mu}(\tilde{x}_{j},\tilde{y}_{j})>g_{\mu}(x_{\mu}^{*},y_{\mu}^{*})
Input: (x,y)Cμ3,xμ1(y),yμ2(x)(x,y)\in C_{\mu 3},x_{\mu 1}(y),y_{\mu 2}(x) as (38a),(37b)
Output: {(x~j,y~j)}j=0\{(\tilde{x}_{j},\tilde{y}_{j})\}_{j=0}^{\infty}
1 (x~0,y~0)(x,y)(\tilde{x}_{0},\tilde{y}_{0})\leftarrow(x,y)
2 for j=1,2,j=1,2,\ldots do
3       y~jyμ2(x~j1)\tilde{y}_{j}\leftarrow y_{\mu 2}(\tilde{x}_{j-1})
4       x~jxμ1(y~j1)\tilde{x}_{j}\leftarrow x_{\mu 1}(\tilde{y}_{j-1})
5 end for
Algorithm 7 Path goes to (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*})

E.11 Proof of Lemma 4

Proof.

From the proof of Theorem 1, any any scheduling for μk\mu_{k} satisfies following will do the job

(2/a)2/3μk14/3μk<μk1\displaystyle(2/a)^{2/3}\mu_{k-1}^{4/3}\leq\mu_{k}<\mu_{k-1}

Note that in Algorithm 4, we have a^=4(μ0+ε)<a\widehat{a}=\sqrt{4(\mu_{0}+\varepsilon)}<a, then it is obvious

(2/a)2/3μk14/3<(2/a^)2/3μk14/3(2/a)^{2/3}\mu_{k-1}^{4/3}<(2/\widehat{a})^{2/3}\mu_{k-1}^{4/3}

The same analysis for Theorem 1 can be applied here. ∎

E.12 Proof of Lemma 6

Proof.

By the Theorem 3 and Lemma 5 and the fact that Aμ,ϵ1A_{\mu,\epsilon}^{1} is μ\mu-stationary point region, we use the same argument as proof of Lemma 7 to demonstrate the gradient descent will never go to Aμ,ϵ2.A_{\mu,\epsilon}^{2}.

E.13 Proof of Lemma 9

Proof.

By Theorem 9(iv)

maxμτβxμ,βminμ>0xμ,β\max_{\mu\leq\tau_{\beta}}x_{\mu,\beta}^{**}\leq\min_{\mu>0}x_{\mu,\beta}^{*}

We also know from the proof of Corollary 3, xμ,ϵ<xμ,βx_{\mu,\epsilon}^{**}<x_{\mu,\beta}^{**} and xμ,β<xμ,ϵx_{\mu,\beta}^{*}<x_{\mu,\epsilon}^{*}. Consequently,

maxμτβxμ,ϵminμ>0xμ,ϵ\max_{\mu\leq\tau_{\beta}}x_{\mu,\epsilon}^{**}\leq\min_{\mu>0}x_{\mu,\epsilon}^{*}

Because τβ>τ\tau_{\beta}>\tau, so

maxμτxμ,ϵmaxμτβxμ,ϵminμ>0xμ,ϵ\max_{\mu\leq\tau}x_{\mu,\epsilon}^{**}\leq\max_{\mu\leq\tau_{\beta}}x_{\mu,\epsilon}^{**}\leq\min_{\mu>0}x_{\mu,\epsilon}^{*}

E.14 Proof of Corollary 1

Proof.

Note that

a24(a2+1)3127a>0\frac{a^{2}}{4(a^{2}+1)^{3}}\leq\frac{1}{27}\quad a>0

when a>527a>\sqrt{\frac{5}{27}}, then a24>μ0=127a24(a2+1)3\frac{a^{2}}{4}>\mu_{0}=\frac{1}{27}\geq\frac{a^{2}}{4(a^{2}+1)^{3}}, it satisfies condition in Lemma 4, we obtain the same result. ∎

E.15 Proof of Corollary 2

Proof.

Use Theorem 5(vi) and Theorem 6(vi). ∎

E.16 Proof of Corollary 3

Proof.

It is easy to know that

rβ(y;μ)>rϵ(y;μ)>r(y;μ)r_{\beta}(y;\mu)>r_{\epsilon}(y;\mu)>r(y;\mu)

and

tβ(x;μ)<tϵ(x;μ)<t(x;μ)t_{\beta}(x;\mu)<t_{\epsilon}(x;\mu)<t(x;\mu)

and when μ<τ\mu<\tau, there are three solutions to r(y;μ)=0r(y;\mu)=0 by Theorem 5. Also, we know from Theorem 7, 8

limy0+rϵ(y;μ)=limy0+rβ(y;μ)=\lim_{y\rightarrow 0^{+}}r_{\epsilon}(y;\mu)=\infty\quad\lim_{y\rightarrow 0^{+}}r_{\beta}(y;\mu)=\infty

Note that when (1+β1β)2a2+1\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq a^{2}+1

rβ(μ;μ)=a(1+β)μ(a2+1)a2(1β)24μ0μ>0\displaystyle r_{\beta}(\sqrt{\mu};\mu)=\frac{a(1+\beta)}{\sqrt{\mu}}-(a^{2}+1)-\frac{a^{2}(1-\beta)^{2}}{4\mu}\leq 0\quad\forall\mu>0

Therefore,

0rβ(μ;μ)>rϵ(μ;μ)>r(μ;μ)0\geq r_{\beta}(\sqrt{\mu};\mu)>r_{\epsilon}(\sqrt{\mu};\mu)>r(\sqrt{\mu};\mu)

Also, we know that for yuby_{\mathrm{ub}} defined in Theorem 5(iii), we know r(yub;μ)>0r(y_{\mathrm{ub}};\mu)>0 from Theorem 5(iv). Therefore,

rβ(yub;μ)>rϵ(yub;μ)>r(yub;μ)>0r_{\beta}(y_{\mathrm{ub}};\mu)>r_{\epsilon}(y_{\mathrm{ub}};\mu)>r(y_{\mathrm{ub}};\mu)>0

Besides, μ<yub\sqrt{\mu}<y_{\mathrm{ub}}. By monotonicity of rβ(y;μ)r_{\beta}(y;\mu) and rϵ(y;μ)r_{\epsilon}(y;\mu) from the Theorem 7(ii) and Theorem 8(ii), it implies that there are at least two solutions to rβ(y;μ)r_{\beta}(y;\mu) and rϵ(y;μ)r_{\epsilon}(y;\mu). From the geometry of rβ(y;μ),rϵ(y;μ),r(y;μ)r_{\beta}(y;\mu),r_{\epsilon}(y;\mu),r(y;\mu) and tβ(x;μ),tϵ(x;μ),t(x;μ)t_{\beta}(x;\mu),t_{\epsilon}(x;\mu),t(x;\mu), it is trivial to know that xμ,ϵxμx_{\mu,\epsilon}^{*}\leq x_{\mu}^{*}, yμ,ϵyμy_{\mu,\epsilon}^{*}\geq y_{\mu}^{*}, xμ,ϵxμx_{\mu,\epsilon}^{**}\geq x_{\mu}^{**}, yμ,ϵyμy_{\mu,\epsilon}^{*}\leq y_{\mu}^{**}.

Finally, for every point (x,y)Aμ,ϵ1(x,y)\in A_{\mu,\epsilon}^{1}, there exists a pair ϵ1,ϵ2\epsilon_{1},\epsilon_{2}, each satisfying |ϵ1|ϵ|\epsilon_{1}|\leq\epsilon and |ϵ2|ϵ|\epsilon_{2}|\leq\epsilon, such that (x,y)(x,y) is the solution to

x=μa+ϵ1μ+y2y=μa+ϵ2x2+μ(a2+1)\displaystyle x=\frac{\mu a+\epsilon_{1}}{\mu+y^{2}}\qquad y=\frac{\mu a+\epsilon_{2}}{x^{2}+\mu(a^{2}+1)}

We can repeat the same analysis above to show that xμ,ϵxx_{\mu,\epsilon}^{*}\leq x, yμ,ϵyy_{\mu,\epsilon}^{*}\geq y. Applying the same logic to (x,y)Aμ,ϵ2\forall(x,y)\in A_{\mu,\epsilon}^{2}, we find xμ,ϵxx_{\mu,\epsilon}^{**}\geq x, yμ,ϵyy_{\mu,\epsilon}^{*}\leq y. Thus, (xμ,yμ)(x_{\mu}^{*},y_{\mu}^{*}) is the extreme point of Aμ,ϵ1A_{\mu,\epsilon}^{1} and (xμ,yμ)(x_{\mu}^{**},y_{\mu}^{**}) is the extreme point of Aμ,ϵ2A_{\mu,\epsilon}^{2}, we get the results. ∎

Appendix F Experiments Details

In this section, we present experiments to validate the global convergence of Algorithm 6. Our goal is twofold: First, we aim to demonstrate that irrespective of the starting point, Algorithm 6 using gradient descent consistently returns the global minimum. Second, we contrast our updating scheme for μk,ϵk\mu_{k},\epsilon_{k} as prescribed in Algorithm 6 with an arbitrary updating scheme for μk,ϵk\mu_{k},\epsilon_{k}. This comparison illustrates how inappropriate setting of parameters in gradient descent could lead to incorrect solutions.

F.1 Random Initialization Converges to Global Optimum

Refer to caption
(a) Random Initialization 1
Refer to caption
(b) Random Initialization 2
Refer to caption
(c) Random Initialization 3
Refer to caption
(d) Random Initialization 4
Figure 9: Trajectory of the gradient descent path with the different initializations for a=2a=2. We observe that regardless of the initialization, Algorithm 6 always converges to the global minimum. Initial μ0=a24(1δ)3(1β)4(1+β)2\mu_{0}=\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}
Refer to caption
(a) Random Initialization 1
Refer to caption
(b) Random Initialization 2
Refer to caption
(c) Random Initialization 3
Refer to caption
(d) Random Initialization 4
Figure 10: Trajectory of the gradient descent path with the different initializations for a=0.5a=0.5. We observe that regardless of the initialization, Algorithm 6 always converges to the global minimum. Initial μ0=a24(1δ)3(1β)4(1+β)2\mu_{0}=\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}

F.2 Wrong Specification of δ\delta Leads to Spurious Local Optimial

Refer to caption
(a) δ=0.4\delta=0.4
Refer to caption
(b) δ=0.1\delta=0.1
Figure 11: Trajectory of the gradient descent path for two difference δ\delta. Left: β\beta violates requirement (1+β1β)2(1δ)(a2+1)\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq(1-\delta)(a^{2}+1) in Theorem 4, leading to spurious local minimum. Right: β\beta follows requirement (1+β1β)2(1δ)(a2+1)\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq(1-\delta)(a^{2}+1) in Theorem 4, leading to global minimum. Initial μ0=a24(1δ)3(1β)4(1+β)2\mu_{0}=\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}

F.3 Wrong Specification of β\beta Leads to Incorrect Solution

Refer to caption
(a) β=0.6\beta=0.6
Refer to caption
(b) δ=0.01\delta=0.01
Figure 12: Trajectory of the gradient descent path for two difference β\beta. Left: β\beta violates requirement (1+β1β)2(1δ)(a2+1)\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq(1-\delta)(a^{2}+1) in Theorem 4, leading to incorrect solution. Right: β\beta follows requirement (1+β1β)2(1δ)(a2+1)\left(\frac{1+\beta}{1-\beta}\right)^{2}\leq(1-\delta)(a^{2}+1) in Theorem 4, leading to global minimum. Initial μ0=a24(1δ)3(1β)4(1+β)2\mu_{0}=\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}

F.4 Faster decrease of μk\mu_{k} Leads to Incorrect Solution

Refer to caption
(a) Bad scheduling
Refer to caption
(b) Good scheduling
Figure 13: Trajectory of the gradient descent path for two difference update rules for μk\mu_{k} with the same initialization. Left: “Bad scheduling” uses a faster-decreasing scheme for μk\mu_{k}, leading to an incorrect solution, even a non-local optimal solution. Right: “Good scheduling” follows updating rule for μk\mu_{k} in Algorithm 6, leading to the global minimum. Initial μ0=a24(1δ)3(1β)4(1+β)2\mu_{0}=\frac{a^{2}}{4}\frac{(1-\delta)^{3}(1-\beta)^{4}}{(1+\beta)^{2}}