This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Trust-Region Method for Nonsmooth Nonconvex Optimization

Ziang Chen
Department of Mathematics, Duke University, USA
and
Andre Milzarek
School of Data Science (SDS), The Chinese University of Hong Kong, Shenzhen,
Shenzhen Research Institute for Big Data (SRIBD), CHINA
and
Zaiwen wen
Beijing International Center for Mathematical Research, Peking University, CHINA
Email: [email protected]Email: [email protected]. A. Milzarek is partly supported by the Fundamental Research Fund – Shenzhen Research Institute for Big Data (SRIBD) Startup Fund JCYJ-AM20190601.Email: [email protected]. Z. Wen is partly supported by the NSFC grant 11831002, and the Beijing Academy of Artificial Intelligence.
Abstract

We propose a trust-region type method for a class of nonsmooth nonconvex optimization problems where the objective function is a summation of a (probably nonconvex) smooth function and a (probably nonsmooth) convex function. The model function of our trust-region subproblem is always quadratic and the linear term of the model is generated using abstract descent directions. Therefore, the trust-region subproblems can be easily constructed as well as efficiently solved by cheap and standard methods. When the accuracy of the model function at the solution of the subproblem is not sufficient, we add a safeguard on the stepsizes for improving the accuracy. For a class of functions that can be “truncated”, an additional truncation step is defined and a stepsize modification strategy is designed. The overall scheme converges globally and we establish fast local convergence under suitable assumptions. In particular, using a connection with a smooth Riemannian trust-region method, we prove local quadratic convergence for partly smooth functions under a strict complementary condition. Preliminary numerical results on a family of 1\ell_{1}-optimization problems are reported and demonstrate the efficiency of our approach.

Keywords: trust-region method, nonsmooth composite programs, quadratic model function, global and local convergence.

1 Introduction

We consider the unconstrained nonsmooth nonconvex optimization problems of the composite form:

minxnψ(x):=f(x)+φ(x),\min_{x\in\mathbb{R}^{n}}\psi(x):=f(x)+\varphi(x), (1.1)

where f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is a continuously differentiable but probably nonconvex function and φ:n\varphi:\mathbb{R}^{n}\rightarrow\mathbb{R} is real-valued and convex. The composite program (1.1) is a special form of the general nonsmooth nonconvex optimization problems

minxnψ(x),\min_{x\in\mathbb{R}^{n}}\psi(x), (1.2)

where the objective function ψ:n\psi:\mathbb{R}^{n}\to\mathbb{R} is locally Lipschitz continuous, and has numerous applications, such as 1\ell_{1}-regularized problems [71, 68, 27, 41, 12], group sparse problems [22, 77, 54, 70], penalty approaches [5], dictionary learning [50, 31], and matrix completion [19, 18, 37].

1.1 Related Work

Different types of nonsmooth trust-region methods have already been proposed and analyzed for the general optimization problem (1.2) throughout the last two decades. Several of these nonsmooth trust-region methods utilize abstract model functions on a theoretical level which means that the model function is typically not specified. In [26], a nonsmooth trust-region method is proposed for (1.2) under the assumption that ψ\psi is regular. A nonsmooth trust-region algorithm for general problems is investigated in [64]. In this work, an abstract first-order model is considered that is not necessarily based on subgradient information or directional derivatives. Extending the results in [64], Grohs and Hosseini propose a Riemannian trust-region algorithm, see [34]. Here, the objective function is defined on a complete Riemannian manifold. All mentioned methods derive global convergence under an assumption similar to the concept of a “strict model” stated in [59]. Using this concept, a nonsmooth bundle trust-region algorithm with global convergence is constructed in [4].

In [20], a hybrid approach is presented using simpler and more tractable quadratic model functions. The method switches to a complicated second model if the quadratic model is not accurate enough and if it is strictly necessary. In [3], a quadratic model function is analyzed where the first-order term is derived from a suitable approximation of the steepest descent direction and the second-order term is updated utilizing a BFGS scheme. The authors apply an algorithmic approach proposed in [49] to compute the approximation of the ϵ\epsilon-subdifferential and steepest descent direction. Another class of methods employs smoothing techniques. In [32], the authors first present a smooth trust-region method without using derivatives, and then, in the nonsmooth case, use this methodology after smoothing the objective function. Furthermore, trust-region algorithms for nonsmooth problems can be developed based on smooth merit functions for the problem. In [66], a nonsmooth convex optimization is investigated and the Moreau envelope is considered as a smooth merit function. A smooth trust-region method is performed on the smooth merit function, where the second-order term of the model function is again updated by the BFGS formula.

Bundle methods are an important and related class of methods for nonsmooth problems [44, 52, 51, 39, 45, 42, 40]. The ubiquitous cutting-plane model in bundle methods is polyhedral, i.e., the supremum of a finite affine family. This model builds approximations of convex functions based on the subgradient inequality. In [67], an efficient bundle technique for convex optimization has been proposed; in [24], a convex bundle method is derived to deal with additional noise, i.e., the case when the objective function and the subgradient can not be evaluated exactly. Different modifications of the bundle ideas for nonconvex problems have been established in [59, 67]. In [35] and [60], the authors consider bundle methods for nonsmooth nonconvex optimization when the function values and the subgradients of ψ\psi can only be evaluated inexactly.

Local convergence properties and rates for nonsmooth problems are typically studied utilizing additional and more subtle structures. In this regard, some fundamental and helpful concepts are the idea of an “active manifold” and the family of “partly smooth” functions introduced by [46]. In particular, the problem (1.1) has been investigated when the nonsmooth term φ\varphi is partly smooth relative to a smooth active manifold. The so-called finite activity identification is established for forward-backward splitting methods by [48] and, more recently, for SAGA/Prox-SVRG by [63]. After the identification, those algorithms enter a locally linear convergence regime. In [36], the authors use partial smoothness and prox-regularity to identify the active constraints after finitely many iterations, which is an extension of other works on finite constraint identification, see [17, 15, 74]. After identifying the active manifold, the nonsmooth problem may become a smooth optimization on a Riemannian manifold. Some algorithms and analysis for Riemannian trust-region methods were studied in the literature; see e.g., [2, 1, 38, 8, 6]. There are numerous applications of Riemannian trust-region methods, such as eigenproblems [6, 7], low-rank matrix completion [13], and tensor problems [14].

We note that for composite programs, nonsmooth trust-region methods have also been studied in the literature, such as [28, 29, 79, 76, 33] for ming(x)+h(f(x))\min\ g(x)+h(f(x)) where ff and gg are smooth and hh is convex, [78, 9] for minh(f(x))\min\ h(f(x)) where ff is smooth and hh is convex, [25] for minh(f(x))\min\ h(f(x)) where ff is locally Lipschitz and hh is smooth and convex and [16] for ming(x)+h(f(x))\min\ g(x)+h(f(x)) where ff is smooth and gg and hh are convex. For the problem (1.1), there are also other efficient methods, such as gradient-type methods [30, 57], semismooth Newton methods, [55, 47, 75], proximal Newton methods [61, 43, 72], or forward-backward envelope-based (quasi-)Newton methods [62, 69].

More developments about trust-region methods may be found in review papers such as [80].

1.2 Our Contribution

In this work, we propose and investigate a trust-region method for nonsmooth composite programs. The approach utilizes quadratic model functions to approximate the underlying nonsmooth objective function. This methodology leads to classical and tractable trust-region subproblems that can be solved efficiently by standard optimization methods if the second-order information is symmetric. We also discuss an efficient subproblem solver in the case that the second-order information does not stem from a symmetric matrix. The linear part of our proposed quadratic model can be based on the steepest descent direction or other directions such as proximal gradient-type descent directions. Our algorithm contains the following steps: computation of a model function and (approximate) solution of the associated trust-region subproblem; acceptance test of the calculated step; determination of a suitable stepsize by a cheap method followed by some stepsize safeguards and a second acceptance test (if the first test is not successful); update of the trust-region radius and a modification step via a novel truncation mechanism.

In order to control the approximation error between the quadratic model function and the nonsmooth objective function, we define a stepsize safeguard strategy that tries to avoid points along a specific direction at which the directional derivative is not continuous. Specifically, this strategy tries to guarantee that the objective function is directionally differentiable along a specific direction. Since a direct implementation of such a strategy can yield arbitrarily small stepsizes, we consider functions which can be truncated and propose an additional truncation step that allows to enlarge the stepsize. This modification is an essential and new part in our global convergence theory. We verify that the family of functions that can be truncated is rich and contains many important examples, such as the 1\ell_{1}-norm, \ell_{\infty}-norm or group sparse-type penalty terms. Moreover, we provide a detailed global convergence analysis of the proposed trust-region framework. In particular, we show that every accumulation point of a sequence generated by our algorithm is a stationary point. Global convergence of nonsmooth trust-region methods typically requires a certain uniform accuracy assumption on the model which coincides with the concept of the already mentioned strict model proposed by [59]. Our assumptions are similar to these standard requirements and can be verified for a large family containing polyhedral problems and group lasso. Furthermore, we also show how a strict model – aside from utilizing the original objective function – can be constructed.

We analyze the local properties of the nonsmooth trust-region method for (1.1) when φ\varphi is a partly smooth function. In particular, it is possible to establish quadratic convergence of our approach in this case. We assume that the underlying manifold is an affine subspace and that a strict complementary condition holds. After the finite activity identification, we transfer our problem to a smooth problem in the affine subspace by proving that an appropriate choice of the first-order and the second-order model coincides with the Riemannian framework. Results from Riemannian trust-region theory can then be applied to derive local quadratic convergence. Additionally, if the nonsmooth term is polyhedral, it can be shown that the Riemannian Hessian can be computed without knowing the underlying manifold.

1.3 Organization

The rest of this paper is organized as follows. In Section 2, descent directions and several properties of ψ\psi and φ\varphi used in our algorithm are discussed. In Section 3, we present the nonsmooth trust-region framework. In Section 4, the global convergence of our method is established. In Section 5, we show fast local convergence by studying the nonsmooth composite program for partly smooth φ\varphi. Some preliminary numerical experiments are presented in Section 6.

2 Descent Directions and Truncation Operators

In classical trust-region methods, global convergence is established under fairly mild assumptions on the second-order term of the model while the first-order model typically needs to capture the whole gradient information of the objective function, see, e.g., [58] and the references therein. This underlines the importance of the first-order information in the trust-region method. In this section, we analyze properties of ψ\psi and φ\varphi as preparation for the construction of suitable linear first-order models.

2.1 Preliminaries

In this work, the expression \|\cdot\| denotes the 2\ell_{2}-norm and the Frobenius norm for vectors and matrices, respectively. For xnx\in\mathbb{R}^{n} and r>0r>0, Br(x):={yn:yx<r}B_{r}(x):=\{y\in\mathbb{R}^{n}:\|y-x\|<r\} denotes the open ball with radius rr around xx. Let Λ\Lambda be a given symmetric positive definite matrix. The proximal operator is defined via

proxφΛ(z)=argminynφ(y)+12yzΛ2,\mathrm{prox}_{\varphi}^{\Lambda}(z)=\mathop{\operatorname*{argmin}}_{y\in\mathbb{R}^{n}}\varphi(y)+\frac{1}{2}\|y-z\|_{\Lambda}^{2},

where xΛ2:=xTΛx\|x\|_{\Lambda}^{2}:=x^{T}\Lambda x. We also slightly abuse the notation and write proxφλ=proxφΛ\mathrm{prox}_{\varphi}^{\lambda}=\mathrm{prox}_{\varphi}^{\Lambda} for Λ=λI\Lambda=\lambda I.

The directional derivative of a function h:nh:\mathbb{R}^{n}\rightarrow\mathbb{R} at xx along dd is denoted by

h(x;d):=limt0+h(x+td)h(x)t,h^{\prime}(x;d):=\lim_{t\rightarrow 0^{+}}\frac{h(x+td)-h(x)}{t},

if it exists. In the composite case ψ=f+φ\psi=f+\varphi with smooth ff and real-valued convex φ\varphi, the directional derivative ψ(x;d)\psi^{\prime}(x;d) is well-defined for all x,dnx,d\in\mathbb{R}^{n} and the subdifferential of ψ\psi is defined via

ψ(x)={f(x)}+φ(x)={ana,dψ(x;d),dn},\partial\psi(x)=\{\nabla f(x)\}+\partial\varphi(x)=\{a\in\mathbb{R}^{n}\mid\langle a,d\rangle\leq\psi^{\prime}(x;d),\ \forall\ d\in\mathbb{R}^{n}\},

where φ(x)\partial\varphi(x) is the usual subdifferential of a convex function. The steepest descent direction of ψ\psi is defined as

ds(x):={argmind1ψ(x;d)if 0ψ(x),0if 0ψ(x).d_{s}(x):=\begin{cases}\mathop{\operatorname*{argmin}}_{\|d\|\leq 1}\psi^{\prime}(x;d)&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x).\end{cases}

In this paper, we will repeatedly work with the following normalization condition for a direction d(x)d(x):

d(x)={1if 0ψ(x),0if 0ψ(x).\|d(x)\|=\begin{cases}1&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x).\end{cases} (2.1)

We can see that ds(x)d_{s}(x) satisfies the property (2.1). We say that a point xx^{*} is a stationary point of problem (1.2) if ψ(x;ds(x))=0\psi^{\prime}(x^{*};d_{s}(x^{*}))=0, i.e., if and only if, 0ψ(x)0\in\partial\psi(x^{*}).

2.2 Descent Directions

If the objective function ψ\psi is smooth, then the first-order information is carried in the gradient ψ\nabla\psi. We notice that the gradient is directly connected to the steepest descent direction ds(x)=ψ(x)/ψ(x)d_{s}(x)=-\nabla\psi(x)/\|\nabla\psi(x)\| and thus, we can use the following representation ψ(x)=ψ(x;ds(x))ds(x)-\nabla\psi(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x). Motivated by this observation, a natural choice of the first-order information and extension in the nonsmooth case is g(x)=ψ(x;ds(x))ds(x)g(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x). Since in some cases it might be hard or expensive to directly compute g(x)=ψ(x;ds(x))ds(x)g(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x), we can also utilize a general descent direction d(x)d(x) satisfying (2.1) instead. In our model function, we will work with directions of the form g(x)=u(x)d(x)g(x)=u(x)d(x) where d(x)d(x) is a descent direction, ψ(x;d(x))<0\psi^{\prime}(x;d(x))<0, satisfying (2.1) and u(x)u(x) is an upper bound of ψ(x,d(x))\psi^{\prime}(x,d(x)) with

u(x){[ψ(x,d(x)),0)if 0ψ(x),=0if 0ψ(x).u(x)\begin{cases}\in[\psi^{\prime}(x,d(x)),0)&\text{if }0\notin\partial\psi(x),\\ =0&\text{if }0\in\partial\psi(x).\end{cases} (2.2)

This implies g(x)=0g(x)=0 if and only if 0ψ(x)0\in\partial\psi(x), i.e., if xx is a stationary point of (1.2). The direction g(x)g(x) plays a similar role as the gradient in the smooth case. We would call g(x)g(x) as pseudo-gradient. Our aim in the rest of this subsection is to propose several strategies in the settings of composite programs (1.1) for computing and choosing the functions u(x)u(x) and d(x)d(x).

2.2.1 Steepest Descent Direction

We first compute and express

g(x)=ψ(x;ds(x))ds(x),g(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x), (2.3)

via the so-called normal map [65]:

·FnorΛ(z):=f(proxφΛ(z))+Λ(zproxφΛ(z)),\textperiodcentered F_{\mathrm{nor}}^{\Lambda}(z):=\nabla f(\mathrm{prox}_{\varphi}^{\Lambda}(z))+\Lambda(z-\mathrm{prox}_{\varphi}^{\Lambda}(z)), (2.4)

where Λ\Lambda denotes a symmetric and positive semidefinite matrix. We also use the notation Fnorλ=FnorΛF_{\mathrm{nor}}^{\lambda}=F_{\mathrm{nor}}^{\Lambda} in the case Λ=λI\Lambda=\lambda I. The next lemma establishes a relation between ds(x)d_{s}(x), FnorΛ(z)F_{\mathrm{nor}}^{\Lambda}(z), and ψ(x)\partial\psi(x).

Lemma 2.1.

Let xnx\in\mathbb{R}^{n} be given. It holds that

  • (i)

    The direction ds(x)d_{s}(x) and the derivative ψ(x;ds(x))\psi^{\prime}(x;d_{s}(x)) can be represented as follows: ψ(x;ds(x))=dist(0,ψ(x)):=minvψ(x)v\psi^{\prime}(x;d_{s}(x))=-\mathrm{dist}(0,\partial\psi(x)):=-\min_{v\in\partial\psi(x)}\|v\| and

    ds(x)={𝐏ψ(x)(0)𝐏ψ(x)(0)if 0ψ(x),0if 0ψ(x),\begin{split}d_{s}(x)=\begin{cases}-\frac{\mathbf{P}_{\partial\psi(x)}(0)}{\|\mathbf{P}_{\partial\psi(x)}(0)\|}&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x),\end{cases}\end{split}

    where 𝐏ψ(x)\mathbf{P}_{\partial\psi(x)} denotes the orthogonal projection onto the convex, closed set ψ(x)\partial\psi(x).

  • (ii)

    We have ψ(x;ds(x))ds(x)ψ(x)\psi^{\prime}(x;d_{s}(x))d_{s}(x)\in\partial\psi(x).

  • (iii)

    ψ(x)={FnorΛ(z):proxφΛ(z)=x}\partial\psi(x)=\{F_{\mathrm{nor}}^{\Lambda}(z):\mathrm{prox}_{\varphi}^{\Lambda}(z)=x\}.

Proof. (i) Using Fenchel-Rockafellar duality, see [10, Theorem 15.23], and the conjugation result (ιB(0,1))(d)=σB(0,1)(d)=d(\iota_{B_{\|\cdot\|}(0,1)})^{*}(d)=\sigma_{B_{\|\cdot\|}(0,1)}(d)=\|d\|, we obtain

ψ(x;ds(x))=mindσψ(x)(d)+ιB(0,1)(d)=minvιψ(x)(v)+v=dist(0,ψ(x)).\begin{split}\psi^{\prime}(x;d_{s}(x))&=\min_{d}\sigma_{\partial\psi(x)}(d)+\iota_{B_{\|\cdot\|}(0,1)}(d)\\ &=-\min_{v}\iota_{\partial\psi(x)}(v)+\|v\|=-\text{dist}(0,\partial\psi(x)).\end{split}

The unique solution of the dual problem is given by v=𝐏ψ(x)(0)v=\mathbf{P}_{\partial\psi(x)}(0). By [10, Corollary 19.2], the set of primal solutions can be characterized via ds(x)Nψ(x)(v)(v)d_{s}(x)\in N_{\partial\psi(x)}(v)\cap\partial\|\cdot\|(-v). Here, the set Nψ(x)(v):={h:h,yv0,yψ(x)}N_{\partial\psi(x)}(v):=\{h:\langle h,y-v\rangle\leq 0,\ \forall\ y\in\partial\psi(x)\} is the associated normal cone of ψ(x)\partial\psi(x) at vv. In the case 0ψ(x)0\notin\partial\psi(x), we have v0\|v\|\neq 0 and hence, (v)={v/v}\partial\|\cdot\|(-v)=\{-v/\|v\|\}. Moreover, since vv is a solution of the problem minyψ(x)12y2\min_{y\in\partial\psi(x)}\frac{1}{2}\|y\|^{2}, it satisfies the optimality condition v,yv0,yψ(x)\langle v,y-v\rangle\geq 0,\ \forall\ y\in\partial\psi(x). This implies {v/v}Nψ(x)(v)\{-v/\|v\|\}\in N_{\partial\psi(x)}(v) and ds(x)=𝐏ψ(x)(0)/𝐏ψ(x)(0)d_{s}(x)=-\mathbf{P}_{\partial\psi(x)}(0)/{\|\mathbf{P}_{\partial\psi(x)}(0)\|}. In the case 0ψ(x)0\in\partial\psi(x), we have ds(x)=0d_{s}(x)=0 by definition.

(ii) Noticing ψ(x;ds(x))=𝐏ψ(x)(0)\psi^{\prime}(x;d_{s}(x))=-\|\mathbf{P}_{\partial\psi(x)}(0)\|, it follows ψ(x;ds(x))ds(x)=𝐏ψ(x)(0)ψ(x).\psi^{\prime}(x;d_{s}(x))d_{s}(x)=\mathbf{P}_{\partial\psi(x)}(0)\in\partial\psi(x).

(iii) By the definition of proxφΛ(z)\mathrm{prox}_{\varphi}^{\Lambda}(z), it can be shown that

x=proxφΛ(z)Λ(zx)φ(x).x=\mathrm{prox}_{\varphi}^{\Lambda}(z)\quad\Longleftrightarrow\quad\Lambda(z-x)\in\partial\varphi(x). (2.5)

If proxφΛ(z)=x\mathrm{prox}_{\varphi}^{\Lambda}(z)=x, by (2.5), we have FnorΛ(z)=f(x)+Λ(zx)ψ(x)F_{\mathrm{nor}}^{\Lambda}(z)=\nabla f(x)+\Lambda(z-x)\in\partial\psi(x). If vψ(x)v\in\partial\psi(x), set z=x+Λ1(vf(x))z=x+\Lambda^{-1}(v-\nabla f(x)), then we have Λ(zx)=vf(x)φ(x)\Lambda(z-x)=v-\nabla f(x)\in\partial\varphi(x). According to (2.5), it holds x=proxφΛ(z)x=\mathrm{prox}_{\varphi}^{\Lambda}(z). Thus, we obtain v=FnorΛ(z)v=F_{\mathrm{nor}}^{\Lambda}(z). \square

From Lemma 2.1 we can immediately derive the following corollary which uses FnorΛ(z)F_{\mathrm{nor}}^{\Lambda}(z) to describe the first-order optimality conditions.

Corollary 2.2.

A point xnx^{*}\in\mathbb{R}^{n} is a stationary point of problem (1.1), if and only if there exists znz^{*}\in\mathbb{R}^{n} satisfying x=proxφΛ(z)x^{*}=\mathrm{prox}_{\varphi}^{\Lambda}(z^{*}) and zz^{*} is a solution of the nonsmooth equation FnorΛ(z)=0F_{\mathrm{nor}}^{\Lambda}(z)=0.

By Lemma 2.1, the calculation of ψ(x;ds(x))ds(x)\psi^{\prime}(x;d_{s}(x))d_{s}(x) is equivalent to solving an optimization problem ψ(x;ds(x))ds(x)=𝐏ψ(x)(0)=argminvψ(x)v\psi^{\prime}(x;d_{s}(x))d_{s}(x)=\mathbf{P}_{\partial\psi(x)}(0)=\mathop{\operatorname*{argmin}}_{v\in\partial\psi(x)}\|v\|. Alternatively, we can first solve

τ(x)=argminznFnorΛ(z)s.t.proxφΛ(z)=x\tau(x)=\mathop{\operatorname*{argmin}}_{z\in\mathbb{R}^{n}}\|F_{\mathrm{nor}}^{\Lambda}(z)\|\quad\text{s.t.}\quad\mathrm{prox}_{\varphi}^{\Lambda}(z)=x (2.6)

and then compute ψ(x;ds(x))ds(x)=FnorΛ(τ(x))\psi^{\prime}(x;d_{s}(x))d_{s}(x)=F_{\mathrm{nor}}^{\Lambda}(\tau(x)). By the definition of FnorΛ(z)F_{\mathrm{nor}}^{\Lambda}(z), solving (2.6) is equivalent to

τ(x)=argminznf(x)+Λ(zx)s.t.proxφΛ(z)=x,\tau(x)=\mathop{\operatorname*{argmin}}_{z\in\mathbb{R}^{n}}\|\nabla f(x)+\Lambda(z-x)\|\quad\text{s.t.}\quad\mathrm{prox}_{\varphi}^{\Lambda}(z)=x,

which combined with (2.5) leads to

τ(x)=x+Λ1𝐏φ(x)(f(x)).\tau(x)=x+\Lambda^{-1}\mathbf{P}_{\partial\varphi(x)}(-\nabla f(x)). (2.7)

and

ψ(x;ds(x))ds(x)=FnorΛ(τ(x))=f(x)+𝐏φ(x)(f(x)).\psi^{\prime}(x;d_{s}(x))d_{s}(x)=F_{\mathrm{nor}}^{\Lambda}(\tau(x))=\nabla f(x)+\mathbf{P}_{\partial\varphi(x)}(-\nabla f(x)). (2.8)

A closed form representation of the mapping FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x)) can be derived for 1\ell_{1}-optimization, group lasso, and \ell_{\infty}-optimization. We present FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x)) for an 1\ell_{1}-problem in Example 2.3; other examples are summarized in the appendix in Example A.1.

Example 2.3 (FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x)) for 1\ell_{1}-optimization).

Suppose that φ(x)=x1\varphi(x)=\|x\|_{1} and Λ=diag(λ1,λ2,,λn)\Lambda=\mathrm{diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{n}), then by (2.8), we can compute

FnorΛ(τ(x))i={f(x)i𝐏[1,1](f(x)i),xi=0,f(x)i+sgn(xi),xi0,i=1,2,,n.F_{\mathrm{nor}}^{\Lambda}(\tau(x))_{i}=\begin{cases}\nabla f(x)_{i}-\mathbf{P}_{[-1,1]}(\nabla f(x)_{i}),&x_{i}=0,\\ \nabla f(x)_{i}+\mathrm{sgn}(x_{i}),&x_{i}\neq 0,\\ \end{cases}\quad\forall~{}i=1,2,\cdots,n.

2.2.2 Natural Residual

Another possible choice for g(x)=u(x)d(x)g(x)=u(x)d(x) can be based on the so-called natural residual,

FnatΛ(x):=xproxφΛ(xΛ1f(x)).F_{\mathrm{nat}}^{\Lambda}(x):=x-\mathrm{prox}_{\varphi}^{\Lambda}(x-\Lambda^{-1}\nabla f(x)). (2.9)

Similar to the normal map, FnatΛF_{\mathrm{nat}}^{\Lambda} can be used as a criticality measure.

Lemma 2.4.

A point xx^{*} is a stationary point of problem (1.1) if and only if xx^{*} is a solution of the nonsmooth equation FnatΛ(x)=0.F_{\mathrm{nat}}^{\Lambda}(x)=0.

Following [30, Proposition 4.2], the directional derivative at xx along FnatΛ(x)-F_{\mathrm{nat}}^{\Lambda}(x) satisfies ψ(x;FnatΛ(x))FnatΛ(x)Λ2\psi^{\prime}(x;-F_{\mathrm{nat}}^{\Lambda}(x))\leq-\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}^{2}. Thus, the direction

d(x)={FnatΛ(x)FnatΛ(x)Λif 0ψ(x),0if 0ψ(x),d(x)=\begin{cases}-\frac{F_{\mathrm{nat}}^{\Lambda}(x)}{\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}}&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x),\end{cases}

is a descent direction with the directional derivative

ψ(x;d(x))FnatΛ(x)Λ2FnatΛ(x)Λ=FnatΛ(x)Λ.\psi^{\prime}(x;d(x))\leq-\frac{\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}^{2}}{\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}}=-\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}.

We can choose u(x)=FnatΛ(x)Λu(x)=-\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}, which implies that

g(x)=FnatΛ(x).g(x)=F_{\mathrm{nat}}^{\Lambda}(x). (2.10)

2.3 Stepsize Safeguard

If we utilize a smooth model mkm_{k}, once we have selected the descent direction dd, (directionally) noncontinuous points of ψ(,d)\psi^{\prime}(\cdot,d) will contribute to the inaccuracy of mkm_{k}. Hence, we should keep the stepsize relatively small to avoid those points. For any x,dnx,d\in\mathbb{R}^{n} with d=1\|d\|=1, we set the stepsize safeguard Γ(x,d)\Gamma(x,d) to guarantee that tψ(x+td;d)t\mapsto\psi^{\prime}(x+td;d) is continuous on (0,Γ(x,d))(0,\Gamma(x,d)), which is equivalent to saying that tψ(x+td)t\mapsto\psi\left(x+td\right) is continuously differentiable on t(0,Γ(x,d))t\in(0,\Gamma(x,d)).

We prefer to choose the largest possible value of Γ(x,d)\Gamma(x,d):

Γ(x,d)=Γmax(x,d):=sup{T>0:ψ~x,d(t):=ψ(x+td;d)C1(0,T)},\Gamma(x,d)=\Gamma_{\max}(x,d):=\sup\left\{T>0:\tilde{\psi}^{\prime}_{x,d}(t):=\psi^{\prime}(x+td;d)\in C^{1}(0,T)\right\}, (2.11)

since it can intuitively lead to faster convergence. We will see that this choice works well for polyhedral problems, where φ\varphi is the supremum of several affine functions, such as in 1\ell_{1}- and \ell_{\infty}-optimization.

However, in some other cases, we may need to set Γ(x,d)\Gamma(x,d) more carefully. For example, for the group lasso problem minXn1×n2f(X)+φ(X)\min_{X\in\mathbb{R}^{n_{1}\times n_{2}}}f(X)+\varphi(X), where ff is smooth and φ\varphi is given by φ(X)=i=1n2Xi\varphi(X)=\sum_{i=1}^{n_{2}}\left\|X_{i}\right\|, an appropriate choice for X=(X1,X2,,Xn2)n1×n2X=(X_{1},X_{2},\cdots,X_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}} and D=(D1,D2,,Dn2)n1×n2D=(D_{1},D_{2},\cdots,D_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}} with D=1\|D\|=1 is:

Γ(X,D)=min{Γmax(X,D),minXi0Xi1+σ1θi2,minXi0Ximax{2θi,0}},σ>0.\Gamma(X,D)=\min\left\{\Gamma_{\max}(X,D),\min_{X_{i}\neq 0}\frac{\|X_{i}\|^{1+\sigma}}{1-\theta_{i}^{2}},\min_{X_{i}\neq 0}\frac{\|X_{i}\|}{\max\{-2\theta_{i},0\}}\right\},\quad\sigma>0. (2.12)

Here, θi\theta_{i} is given by θi:=Xi,Di/(XiDi)\theta_{i}:=\langle X_{i},D_{i}\rangle/(\|X_{i}\|\cdot\|D_{i}\|) and we use c/0:=+c/0:=+\infty if c>0c>0. The term Γmax(X,D)\Gamma_{\max}(X,D) is defined as in (2.11). This Γ(X,D)\Gamma(X,D) is specifically designed to overcome some technical difficulties; see, e.g., Lemma 4.4.

Next, let us define the function Γ:n+\Gamma:\mathbb{R}^{n}\rightarrow\mathbb{R}^{+} via

Γ(x):=infdn,d=1Γmax(x,d).\Gamma(x):=\inf_{d\in\mathbb{R}^{n},\ \|d\|=1}\Gamma_{\max}(x,d). (2.13)

The scalar Γ(x)\Gamma(x) is important in our convergence analysis as it provides a lower bound for the stepsize safeguard. For the composite program (1.1) with φ(x)=x1\varphi(x)=\|x\|_{1}, Γ\Gamma can be simply calculated as follows Γ(x)=min{|xi|:xi0}\Gamma(x)=\min\{|x_{i}|:x_{i}\neq 0\} where min:=+\min\emptyset:=+\infty. Further examples for Γ\Gamma can be found in Appendix A.2.

2.4 Truncation Operators

Since in our algorithmic design we utilize simple, linear-quadratic models to approximate the nonsmooth function ψ\psi, we need to introduce stepsize safeguards that allow to intrinsically control the accuracy of the model. However, if the “safeguard” Γ(x)\Gamma(x) is very small, the resulting step might be close to the old iterate and the algorithm can start to stagnate. In order to prevent such an undesirable behavior, we discuss an additional modification step that allows to increase Γ(x)\Gamma(x).

Specifically, given a point xx, first we want to find a point xx^{\prime} near xx such that Γ(x)\Gamma(x^{\prime}) is relatively large. Let us consider the simplest case where ψ=f+φ\psi=f+\varphi and φ(x)=x1\varphi(x)=\|x\|_{1}. If xx has a nonzero component with small absolute values, then Γ(x)\Gamma(x) is also small. So we can replace those components with 0 and get a new point xx^{\prime} satisfying Γ(x)>Γ(x)\Gamma(x^{\prime})>\Gamma(x). Since only some components with small absolute values are truncated to 0, the point xx^{\prime} is close to xx. In more general cases, we define a class of functions that allow similar operations:

Definition 2.5.

Suppose that there exist a finite sequence {Si}i=0m\{S_{i}\}_{i=0}^{m} satisfying n=S0S1Sm\mathbb{R}^{n}=S_{0}\supset S_{1}\cdots\supset S_{m}, δ(0,+]\delta\in(0,+\infty], κ>0\kappa>0, and a function T:n×(0,δ]nT:\mathbb{R}^{n}\times(0,\delta]\rightarrow\mathbb{R}^{n} with following properties:

  • (i)

    Γ(x)δ,xSm\Gamma(x)\geq\delta,\ \forall\ x\in S_{m};

  • (ii)

    For any a(0,δ]a\in(0,\delta] and xSi\Si+1x\in S_{i}\backslash S_{i+1}, i{0,1,,m1}i\in\{0,1,\cdots,m-1\}, if Γ(x)a\Gamma(x)\geq a, it holds that T(x,a)=xT(x,a)=x, otherwise we have T(x,a)Si+1T(x,a)\in S_{i+1}, Γ(T(x,a))a\Gamma(T(x,a))\geq a, and T(x,a)xκa\|T(x,a)-x\|\leq\kappa a.

Then we say that ψ\psi can be truncated and that TT is a truncation operator.

In Definition 2.5, Γ(T(x,a))a\Gamma(T(x,a))\geq a means that we can make the value of Γ()\Gamma(\cdot) larger by performing truncation and T(x,a)xκa\|T(x,a)-x\|\leq\kappa a implies that the change caused by T(,a)T(\cdot,a) can be controlled. Example 2.6 shows that φ(x)=x1\varphi(x)=\|x\|_{1} can be truncated and we present more examples (\ell_{\infty}-optimization and group lasso) in the appendix.

Example 2.6 (φ(x)=x1\varphi(x)=\|x\|_{1}).

For i=0,1,,ni=0,1,\cdots,n, we set Si={xncard{j=1,2,,nxj=0}i}S_{i}=\{x\in\mathbb{R}^{n}\mid\mathrm{card}\{j=1,2,\cdots,n\mid x_{j}=0\}\geq i\}, m=nm=n, δ=+\delta=+\infty, κ=n\kappa=\sqrt{n}, and

T(x,a)j=𝟙||a(xj)xj,j=1,2,,n,T(x,a)_{j}=\mathbbm{1}_{|\cdot|\geq a}(x_{j})x_{j},\ j=1,2,\cdots,n,

where 𝟙A()\mathbbm{1}_{A}(\cdot) is the indicator function. Figure 1 shows the truncation operator for φ(x)=x1\varphi(x)=\|x\|_{1} and n=2n=2 explicitly, where S1={(x1,x2)x1x2=0}S_{1}=\{(x_{1},x_{2})\mid x_{1}x_{2}=0\} and S2={(0,0)}S_{2}=\{(0,0)\}.

1-101122331-10112233x=(1,2)Tx=(1,2)^{T}T(x,12)=xT(x,\frac{1}{2})=x 1-101122331-10112233x=(1,2)Tx=(1,2)^{T}T(x,32)T(x,\frac{3}{2})
1-101122331-10112233x=(1,2)Tx=(1,2)^{T}T(x,52)T(x,\frac{5}{2})
Figure 1: Illustration of the truncation operator in Example 2.6 for φ(x)=x1\varphi(x)=\|x\|_{1} and n=2n=2.

Let us mention that, for a smooth regularizer φ\varphi, all properties discussed above are satisfied, since the stepsize safeguard can be chosen as ++\infty and no truncation is needed.

3 A Nonsmooth Trust-region Method

In this section, we present the algorithmic framework of our trust-region type method. The traditional trust-region framework using the pseudo-gradient g(x)g(x) is employed in our algorithm with potential refinement using a stepsize safeguard. The traditional framework is standard but it might not be accurate enough to pass the acceptance test. In each iteration, we first perform a classical trust-region step. A stepsize safeguard for refinement is used if the traditional step fails. In order to promote large stepsizes, a novel truncation step is proposed.

We first introduce the model function and trust-region subproblem. Then, we propose several modification steps including the choice of the stepsize and the novel truncation step. The final algorithm is presented at the end of this section. We also present some methods for solving the corresponding trust-region subproblem in Appendix B.

3.1 Model Function and Trust-region Subproblem

Recall that in the classical trust-region method for a smooth optimization problem, minxnψ(x)\min_{x\in\mathbb{R}^{n}}\psi(x), the model function is mk(s)=ψ(xk)+ψ(xk),s+12s,Bksm_{k}(s)=\psi(x^{k})+\langle\nabla\psi(x^{k}),s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle. As mentioned at the beginning of Section 2.2, a natural extension is

mk(s)=ψ(xk)+ψ(xk;ds(xk))ds(xk),s+12s,Bks.m_{k}(s)=\psi(x^{k})+\langle\psi^{\prime}(x^{k};d_{s}(x^{k}))d_{s}(x^{k}),s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle. (3.1)

This model function is still quadratic and fits the objective function well along the steepest descent direction, which means that they have the same directional derivative in this direction. Though approximating the nonsmooth function ψ\psi with a quadratic function might not lead to good trust-region models in general, we can design specific quadratic models that fit ψ\psi well along certain directions.

One may wish to use different descent directions. Or it might be expensive to compute the steepest descent direction ds(x)d_{s}(x) and its Clarke’s generalized directional derivative ψ(x;ds(x))\psi^{\prime}(x;d_{s}(x)). Therefore we use a descent direction d(x)d(x) satisfying (2.1) and u(x)u(x) satisfying (2.2) instead. We can now define our model function

mk(s)=ψk+gk,s+12s,Bks,m_{k}(s)=\psi_{k}+\langle g^{k},s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle, (3.2)

where ψk=ψ(xk)\psi_{k}=\psi(x^{k}) and gk=g(xk)=u(xk)d(xk)g^{k}=g(x^{k})=u(x^{k})d(x^{k}), and the associated trust-region subproblem is given by

minsmk(s)=ψk+gk,s+12s,Bkss.t.sΔk.\min_{s}\ m_{k}(s)=\psi_{k}+\langle g^{k},s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle\quad\text{s.t.}\quad\|s\|\leq\Delta_{k}. (3.3)

This subproblem is quadratic and coincides with the classical approaches if BkB^{k} is symmetric. The matrices BkB^{k} in the model (3.2) are typically chosen to capture or approximate the second-order information of the objective function ψ\psi. However, such a careful choice is only required when certain local convergence properties should be guaranteed. Similar to classical trust-region methods, global convergence of the method will generally not be affected by {Bk}\{B^{k}\} and flexible choices of BkB^{k} are possible under some mild boundedness conditions that will be specified later.

We remark that the major difference between our algorithm and other nonsmooth trust-region methods in the literature is the first-order information in the model function. The methods in [64, 20] employ first-order terms that tend to be more complicated than a simple linear function, in order to approximate the nonsmooth objective function well and to satisfy certain accuracy assumptions. Although the model function in [3] is quadratic, their first-order term needs to be built using a steepest descent direction based on the ϵ\epsilon-subdifferential of ψ\psi.

An important concept for solving (3.3) is the so-called Cauchy point, which is defined via

sCk:=αkCgkandαkC:=argmin0tΔk/gkmk(tgk).s^{k}_{C}:=-\alpha_{k}^{C}g^{k}\ \text{and}\ \alpha_{k}^{C}:=\mathop{\operatorname*{argmin}}_{0\leq t\leq\Delta_{k}/\|g^{k}\|}m_{k}(-tg^{k}).

The Cauchy point is computational inexpensive [58, Algorithm 4.2] and it leads to sufficient reduction of the model function (Cauchy decrease condition):

mk(0)mk(sCk)12gkmin{Δk,gkBk},m_{k}(0)-m_{k}(s^{k}_{C})\geq\frac{1}{2}\|g^{k}\|\min\left\{\Delta_{k},\frac{\|g^{k}\|}{\|B^{k}\|}\right\}, (3.4)

see, e.g., in [58, Lemma 4.3]. Furthermore, it can be shown that

mk(0)mk(αks¯Ck)αksCk12gkmin{Δk,gkBk},m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k}_{C})\geq\frac{\alpha_{k}}{\|s^{k}_{C}\|}\cdot\frac{1}{2}\|g^{k}\|\min\left\{\Delta_{k},\frac{\|g^{k}\|}{\|B^{k}\|}\right\}, (3.5)

where s¯Ck=sCk/sCk\bar{s}^{k}_{C}=s^{k}_{C}/\|s^{k}_{C}\| and 0<αksCk0<\alpha_{k}\leq\|s^{k}_{C}\|.

In our algorithm, we need to generate an approximate solution of (3.3) that achieves a similar model descent compared to the Cauchy descent condition (3.4) in some sense. More precisely, we need to recover a solution sks^{k} satisfying

mk(0)mk(sk)γ12gkmin{Δk,γ2gk},m_{k}(0)-m_{k}(s^{k})\geq\frac{\gamma_{1}}{2}\|g^{k}\|\min\left\{\Delta_{k},\gamma_{2}\|g^{k}\|\right\}, (3.6)

where γ1,γ2>0\gamma_{1},\gamma_{2}>0 are constants which do not depend on kk, and

mk(0)mk(sk)(1(sk))[mk(0)mk(sCk)],m_{k}(0)-m_{k}(s^{k})\geq(1-\ell(\|s^{k}\|))[m_{k}(0)-m_{k}(s^{k}_{C})], (3.7)

where :+[0,1]\ell:\mathbb{R}^{+}\rightarrow[0,1] is chosen as a monotonically decreasing function with limΔ0+(Δ)=0\lim_{\Delta\rightarrow 0^{+}}\ell(\Delta)=0. The classical choice (Δ)0\ell(\Delta)\equiv 0 is also allowed.

3.2 Suitable Stepsizes

In the trust-region framework, we will work with the parameters 0<ηη1<η2<10<\eta\leq\eta_{1}<\eta_{2}<1, 0<r1<1<r20<r_{1}<1<r_{2}, and Δmax>0\Delta_{\max}>0. Let sks^{k} denote the generated solution of (3.3). Similar to the classical trust-region method, we define the ratio between actual reduction and predicted reduction as

ρk1=ψ(xk)ψ(xk+sk)mk(0)mk(sk).\rho^{1}_{k}=\frac{\psi(x^{k})-\psi(x^{k}+s^{k})}{m_{k}(0)-m_{k}(s^{k})}. (3.8)

If the proposed step xk+skx^{k}+s^{k} is “successful”, i.e., ρk1η1\rho^{1}_{k}\geq\eta_{1}, we accept the step, i.e., x~k=xk+sk\tilde{x}^{k}=x^{k}+s^{k}, and update the trust-region radius Δk\Delta_{k} as

Δk+1={min{Δmax,r2Δk}if ρk1>η2,Δkotherwise.\Delta_{k+1}=\begin{cases}\min\{{\Delta}_{\max},r_{2}\Delta_{k}\}&\text{if }\rho^{1}_{k}>\eta_{2},\\ \Delta_{k}&\text{otherwise}.\end{cases} (3.9)

If ρk1<η1\rho^{1}_{k}<\eta_{1}, we can introduce an additional stepsize strategy to refine the step. Specifically, we now consider the normalized descent direction s¯k:=sk/sk\bar{s}^{k}:={s^{k}}/{\|s^{k}\|}. In the following, we will always use the notation x¯:=x/x\bar{x}:={x}/{\|x\|}. Instead of setting the stepsize as sk\|s^{k}\| and working with sks^{k} directly, we calculate αk\alpha_{k} via

αk=min{Γ(xk,s¯k),sk},\alpha_{k}=\min\left\{\Gamma(x^{k},\bar{s}^{k}),\|s^{k}\|\right\}, (3.10)

where Γ(xk,s¯k)\Gamma(x^{k},\bar{s}^{k}) is the stepsize safeguard. If

mk(0)mk(αks¯k)αk2sk(mk(0)mk(sk)),m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})\geq\frac{\alpha_{k}}{2\|s^{k}\|}(m_{k}(0)-m_{k}(s^{k})), (3.11)

which means that the modified step yields sufficient descent, we use the direction s¯k\bar{s}^{k} and the stepsize αk\alpha_{k}; otherwise we set

sk=sCk,s¯k=s¯Ck, and αk=min{Γ(xk,s¯Ck),sCk}.s^{k}=s^{k}_{C},\ \bar{s}^{k}=\bar{s}^{k}_{C},\text{ and }\alpha_{k}=\min\left\{\Gamma(x^{k},\bar{s}^{k}_{C}),\|s^{k}_{C}\|\right\}. (3.12)

If mkm_{k} is convex (which, e.g., can be ensured when the matrix BkB^{k} is chosen to be positive semidefinite) then (3.11) holds automatically and the latter case will not occur. In (3.12) we utilize the Cauchy point and the corresponding stepsize as a simpler gradient-based step. As we have seen such a step can always guarantee certain descent properties. Next, we perform a second ratio test

ρk2=ψ(xk)ψ(xk+αks¯k)mk(0)mk(αks¯k).\rho^{2}_{k}=\frac{\psi(x^{k})-\psi(x^{k}+\alpha_{k}\bar{s}^{k})}{m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})}. (3.13)

According to this ratio, we update the trust-region radius as

Δk+1={r1Δkif ρk2<η1,min{Δmax,r2Δk}if ρk2>η2,Δkotherwise,\Delta_{k+1}=\begin{cases}r_{1}\Delta_{k}&\text{if }\rho^{2}_{k}<\eta_{1},\\ \min\{{\Delta}_{\max},r_{2}\Delta_{k}\}&\text{if }\rho^{2}_{k}>\eta_{2},\\ \Delta_{k}&\text{otherwise},\end{cases} (3.14)

and decide whether to accept the proposed step

x~k={xk+αks¯kif ρk2η,xkif ρk2<η.\tilde{x}^{k}=\begin{cases}x^{k}+\alpha_{k}\bar{s}^{k}&\text{if }\rho_{k}^{2}\geq\eta,\\ x^{k}&\text{if }\rho_{k}^{2}<\eta.\end{cases} (3.15)

We declare the step as “subsuccessful” if ρk1<η1\rho^{1}_{k}<\eta_{1} while ρk2η\rho^{2}_{k}\geq\eta, i.e., even if the original step is unsuccessful, the refined version can still provide some descent which is essential to guarantee convergence.

3.3 Truncation Step

It might not be suitable to simply set xk+1=x~kx^{k+1}=\tilde{x}^{k}, since Γ(x~k)\Gamma(\tilde{x}^{k}) can be very small and larger Γ\Gamma-values increase the stepsize and improve the fitness of the model. Our idea is to allow a small modification of x~k\tilde{x}^{k} and to get a new point xk+1x^{k+1} with relatively large Γ(xk+1)\Gamma(x_{k+1}) although such modification may cause an increase of the objective function. In the following, we describe an algorithmic procedure for increasing the safeguard Γ(xk+1)\Gamma(x^{k+1}) for functions which can be truncated.

Suppose that φ\varphi can be truncated and let S0,S1,SmS_{0},S_{1}\cdots,S_{m}, δ\delta, and TT be the corresponding truncation parameters and operators, respectively. Let {ϵs}s=01+\{\epsilon_{s}\}_{s=0}^{\infty}\in\ell_{1}^{+} be a positive and strictly decreasing sequence that is upper-bounded by δ\delta as well as summable. Since the sets {Sj:j=0,1,,m}\{S_{j}:j=0,1,\cdots,m\} are nested and cover the whole n\mathbb{R}^{n}, we know that there exists a unique index i{0,1,,m}i\in\{0,1,\cdots,m\} with x~kSi\Si+1\tilde{x}^{k}\in S_{i}\backslash S_{i+1}, where Sm+1=S_{m+1}=\emptyset. In the following, we define Nj:=Sj\Sj+1N_{j}:=S_{j}\backslash S_{j+1} and introduce a global counter cjc_{j} that is associated with each set NjN_{j} and that counts the total number of truncations performed on points in the set NjN_{j} for j=0,1,,mj=0,1,\cdots,m. Depending on the safeguard Γ(x~k)\Gamma(\tilde{x}^{k}) we then decide whether x~k\tilde{x}^{k} should be truncated via applying the truncation operator or not. The whole process is given as follows: find i{0,1,,m}i\in\{0,1,\cdots,m\} such that x~kNi\tilde{x}^{k}\in N_{i}; if Γ(x~k)<ϵci\Gamma(\tilde{x}^{k})<\epsilon_{c_{i}}, set x~kT(x~k,ϵci)\tilde{x}^{k}\leftarrow T(\tilde{x}^{k},\epsilon_{c_{i}}), otherwise we keep x~k\tilde{x}^{k} unchanged; and update ci=ci+1c_{i}=c_{i}+1 if x~k\tilde{x}^{k} is updated.

This procedure is repeated until Γ(x~k)ϵci\Gamma(\tilde{x}^{k})\geq\epsilon_{c_{i}}. Lemma 3.1 implies that this algorithm is well-defined and terminates within a finite number of steps. We call the whole procedure a truncation step which is presented in Algorithm 1.

Algorithm 1 Truncation step

Input: x~k\tilde{x}^{k} and cj,j=0,1,,m.c_{j},\ j=0,1,\cdots,m.

1:  while true do
2:     Compute the unique ii such that x~kSi\Si+1\tilde{x}^{k}\in S_{i}\backslash S_{i+1}.
3:     if Γ(x~k)<ϵci\Gamma(\tilde{x}^{k})<\epsilon_{c_{i}} then
4:        x~kT(x~k,ϵci)\tilde{x}^{k}\leftarrow T(\tilde{x}^{k},\epsilon_{c_{i}}).
5:        cici+1c_{i}\leftarrow c_{i}+1.
6:     else
7:        break.
8:     end if
9:  end while

Output: xk+1=x~kx^{k+1}=\tilde{x}^{k} and cj,j=0,1,,m.c_{j},\ j=0,1,\cdots,m.

Lemma 3.1.

Algorithm 1 will terminate in at most mm steps.

Proof. Since for any xSmx\in S_{m} and ss\in\mathbb{N}, we have Γ(x)δϵs\Gamma(x)\geq\delta\geq\epsilon_{s} and the operator TT moves points in Si\Si+1S_{i}\backslash S_{i+1} into Si+1S_{i+1}, TT is performed on x~k\tilde{x}_{k} at most mm times before Algorithm 1 terminates. \square

The iterate x~k\tilde{x}^{k} will be changed when performing Algorithm 1. For simplicity, in the rest of this paper, when we mention x~k\tilde{x}^{k}, we always mean the input of Algorithm 1.

3.4 Algorithmic Framework

We now present a nonsmooth trust-region framework with quadratic model functions that combines the mentioned strategies. One of the main advantages is that the corresponding subproblem can be cheaply formulated and solved. Specifically, the first-order term of our model can be constructed using any kind of descent direction and the second-order term BkB^{k} is only required to satisfy a classical boundedness condition to guarantee global convergence. Moreover, the resulting trust-region subproblem coincides with the classical one and can be solved using classical methods. Let us further note that in order to obtain fast local convergence, gkg^{k} and BkB^{k} need to be chosen and coupled in a more careful way. This is explored in more detail in Section 5 and 6.

The full algorithm is shown in Algorithm 2. We require the following parameters: 0<η<η1<η2<10<\eta<\eta_{1}<\eta_{2}<1, 0<r1<1<r20<r_{1}<1<r_{2}, Δmax>0{\Delta}_{\max}>0, γ1>0\gamma_{1}>0, γ2>0\gamma_{2}>0, and a positive and strictly decreasing sequence {ϵs}s=01+\{\epsilon_{s}\}_{s=0}^{\infty}\in\ell_{1}^{+} which is upper-bounded by δ\delta in Definition 2.5. We also assume that there is a monotonically decreasing function :+[0,12]\ell:\mathbb{R}^{+}\rightarrow[0,\frac{1}{2}] with limΔ0+(Δ)=0\lim_{\Delta\rightarrow 0^{+}}\ell(\Delta)=0. Some additional discussions of how to solve the trust-region subproblems (3.3) can be found in the appendix.

Algorithm 2 A trust-region method for nonsmooth nonconvex optimization

Initialization: initial point x0nx^{0}\in\mathbb{R}^{n}, initial trust-region radius Δ0\Delta^{0}, iteration k:=0k:=0, global counters cj=0,j=0,1,,mc_{j}=0,\ j=0,1,\cdots,m.

1:  while gk0\|g^{k}\|\neq 0 do
2:     Compute d(xk)d(x^{k}), u(xk)u(x^{k}) and gk=u(xk)d(xk)g^{k}=u(x^{k})d(x^{k}) and choose Bkn×nB_{k}\in\mathbb{R}^{n\times n}.
3:     Solve the trust-region subproblem (3.3) and obtain sks^{k} that satisfies (3.6) and (3.7).
4:     Compute ρk1\rho^{1}_{k} according to (3.8).
5:     if ρk1η1\rho^{1}_{k}\geq\eta_{1} then
6:        x~k:=xk+sk\tilde{x}^{k}:=x^{k}+s^{k}.
7:        Compute Δk+1\Delta_{k+1} according to (3.9).
8:     else
9:        Compute sks^{k}, s¯k\bar{s}^{k}, and αk\alpha_{k} according to (3.10), (3.11), and (3.12).
10:        Compute ρk2\rho^{2}_{k} according to (3.13).
11:        Compute Δk+1\Delta_{k+1} according to (3.14).
12:        Compute x~k\tilde{x}^{k} according to (3.15).
13:     end if
14:     Perform Algorithm 1, get xk+1x^{k+1} and update cj,j=0,1,,mc_{j},\ j=0,1,\cdots,m.
15:     kk+1k\leftarrow k+1.
16:  end while

We want to mention here that, for iteration k1k\geq 1, if xk+αks¯kx^{k}+\alpha_{k}\bar{s}^{k} is not accepted, i.e., x~k=xk\tilde{x}^{k}=x^{k}, we have xk+1=x~k=xkx^{k+1}=\tilde{x}^{k}=x^{k}, which means that no truncation is performed on x~k\tilde{x}^{k}. This is because xkx^{k} satisfies the stopping criteria of Algorithm 1 since it was the output of Algorithm 1 in the last iteration.

4 Global Convergence

In this section, we show the global convergence of Algorithm 2. Specifically, we will prove that every accumulation point of the sequence generated by Algorithm 2 is a stationary point under some suitable assumptions and that the natural residual converges to zero along the generated iterates.

4.1 Assumptions

In this subsection, we state the assumptions required for the convergence. Assumption 4.1 summarizes the conditions on the objective function ψ\psi and its pseudo-gradient gg.

Assumption 4.1.

We assume that ψ\psi and gg have the following properties:

  • (A.1)

    ψ\psi is bounded from below by LbL_{b}.

  • (A.2)

    If g(x)0g(x)\neq 0, then there exists r,ϵ>0r,\epsilon>0 such that g(y)ϵ,yBr(x)\|g(y)\|\geq\epsilon,\ \forall\ y\in B_{r}(x).

Assumption (A.1) is a standard assumption. Assumption (A.2) means that the first-order model will not vanish sharply. Condition (A.2) holds automatically if xg(x)x\mapsto\|g(x)\| is lower semicontinuous.

Lemma 4.2.

Assumption (A.2) is satisfied for the choices in (2.3) and (2.10).

Proof. (i) For (2.3), set ϵ=12FnorΛ(τ(x))>0\epsilon=\frac{1}{2}\|F_{\mathrm{nor}}^{\Lambda}(\tau(x))\|>0. Suppose that there exist a sequence {ym}m\{y^{m}\}_{m} satisfying ymxy^{m}\rightarrow x and FnorΛ(τ(ym))<ϵ\|F_{\mathrm{nor}}^{\Lambda}(\tau(y^{m}))\|<\epsilon for all mm\in\mathbb{N}. By the local boundedness of φ\partial\varphi, we can infer that {𝐏φ(ym)(f(ym))}m\{\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\}_{m} is bounded and thus, {𝐏φ(ym)(f(ym))}m=0\{\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\}_{m=0}^{\infty} has a convergent subsequence. Without loss of generality, we assume that the whole sequence {𝐏φ(ym)(f(ym))}m=0\{\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\}_{m=0}^{\infty} converges. Let us set w=limm𝐏φ(ym)(f(ym))w=\lim_{m\rightarrow\infty}\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m})). Using the upper semicontinuity or the closedness of φ\partial\varphi, ymxy^{m}\rightarrow x, and φ(ym)𝐏φ(ym)(f(ym))w\partial\varphi(y^{m})\ni\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\rightarrow w, it follows wφ(x)w\in\partial\varphi(x). Therefore, we obtain f(x)+wψ(x)\nabla f(x)+w\in\partial\psi(x) with f(x)+wϵ<FnorΛ(τ(x))\|\nabla f(x)+w\|\leq\epsilon<\|F_{\mathrm{nor}}^{\Lambda}(\tau(x))\|, which contradicts the optimality of FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x)). We can conclude that for some r>0r>0, it holds g(y)=FnorΛ(τ(y))=ψ(y,ds(y))ϵ\|g(y)\|=\|F_{\mathrm{nor}}^{\Lambda}(\tau(y))\|=-\psi^{\prime}(y,d_{s}(y))\geq\epsilon for all yBr(x)y\in B_{r}(x). Hence, condition (A.2) is satisfied.

(ii) For (2.10), assumption (A.2) is a simple consequence of the continuity of FnatΛF^{\Lambda}_{\mathrm{nat}}. \square

Next, we present several assumptions on the iterates and sequences generated by Algorithm 2.

Assumption 4.3.

Let {xk}\{x^{k}\} and {Bk}\{B^{k}\} be generated by Algorithm 2. We assume:

  • (B.1)

    {xk}\{x^{k}\} is bounded, i.e., there exist R>0R>0 with {xk}BR(0)\{x^{k}\}\subseteq B_{R}(0).

  • (B.2)

    There exists κB>0\kappa_{B}>0 with supkBkκB<\sup_{k\in\mathbb{N}}\|B^{k}\|\leq\kappa_{B}<\infty.

  • (B.3)

    For any subsequence {k}=0\{k_{\ell}\}_{\ell=0}^{\infty}\subseteq\mathbb{N}, if {xk}\{x^{k_{\ell}}\} converges and we have αk0\alpha_{k_{\ell}}\rightarrow 0, then it holds that

ψ(xk+αks¯k)ψ(xk)αkψ(xk;s¯k)=o(αk).\psi(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-\psi(x^{k_{\ell}})-\alpha_{k_{\ell}}\psi^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})=o(\alpha_{k_{\ell}})\quad\ell\to\infty. (4.1)
  • (B.4)

    For every ϵ>0\epsilon>0 there is ϵ>0\epsilon^{\prime}>0 such that for all xkx^{k} with Γ(xk)ϵ\Gamma(x^{k})\geq\epsilon it follows Γ(xk,s¯k)ϵ\Gamma\left(x^{k},\bar{s}^{k}\right)\geq\epsilon^{\prime}.

The conditions (B.1)–(B.3) are standard assumptions. Condition (B.2) is frequently used in classical trust-region theory, see, e.g., [58, Theorem 4.5]. Assumption (B.3) is required to ensure uniform accuracy of the model function. Similar assumptions also appear in other convergence analyses of nonsmooth trust-region methods. For instance, condition A.2 in [64] and assumption 2.4(2c) in [20] have similar formats. However, as far as we can tell, a simple modification of the convergence analysis in previous works (such as [64, 21]) may not prove the convergence of our method. This is partly because assumption (B.3) is only required when the stepsize αk\alpha_{k} is smaller than or equal to some safeguard, see also (3.10). Moreover, even when the trust-region radius is relatively large and we pass the acceptance, the potential descent could be small. This is also further motivates our truncation step which allows to enlarge the stepsize.

Let us notice that condition (B.3) is similar to but weaker than the concept of “strict models” introduced by Noll in [59]. In particular, assumption (B.3) only needs to hold at accumulation points of {xk}\{x^{k}\} and along the specific and associated directions {s¯k}\{\bar{s}^{k}\} while typical strict model assumptions are formulated uniformly for all points and directions in n\mathbb{R}^{n}. In the following lemmas, we verify the conditions (B.3) and (B.4) for two exemplary cases.

Lemma 4.4.

Suppose that ψ\psi is given by ψ=f+φ\psi=f+\varphi with f\nabla f being locally Lipschitz continuous and assume that condition (B.1) holds. Then, (B.3) is satisfied in the following cases:

  • (i)

    The mapping φ\varphi is polyhedral and we have Γ(x,d)Γmax(x,d)\Gamma(x,d)\leq\Gamma_{\max}(x,d).

  • (ii)

    The problem is in the group lasso format and we set Γ(x,d)\Gamma(x,d) as in (2.12).

Proof. The local Lipschitz continuity of f\nabla f and the boundedness of {xk}\{x^{k}\} imply that for any subsequence {k}=0\{k_{\ell}\}_{\ell=0}^{\infty}\subseteq\mathbb{N} with xkxx^{k_{\ell}}\rightarrow x and αk0\alpha_{k_{\ell}}\rightarrow 0, it holds that f(xk+αks¯k)f(xk)αkf(xk;s¯k)=o(αk)f(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-f(x^{k_{\ell}})-\alpha_{k_{\ell}}f^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})=o(\alpha_{k_{\ell}}). Thus, it suffices to prove

φ(xk+αks¯k)φ(xk)αkφ(xk;s¯k)=o(αk).\varphi(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-\varphi(x^{k_{\ell}})-\alpha_{k_{\ell}}\varphi^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})=o(\alpha_{k_{\ell}}). (4.2)

If φ\varphi is polyhedral, the function φ~x,d(t):=φ(x+td)\tilde{\varphi}_{x,d}(t):=\varphi\left(x+td\right) is linear on (0,Γmax(x,d))(0,\Gamma_{\max}(x,d)). Thus, (4.2) holds with the right side of the equality taken as zero.

In case of the group lasso problem and using the definition (2.12), we can see that Xik+αkθik0.5Xik\|X_{i}^{k_{\ell}}\|+\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}}\geq 0.5\|X_{i}^{k_{\ell}}\|, where θik:=Xik,Sik/(XikSik)\theta_{i}^{k_{\ell}}:=\langle X_{i}^{k_{\ell}},S_{i}^{k_{\ell}}\rangle/(\|X_{i}^{k_{\ell}}\|\cdot\|S_{i}^{k_{\ell}}\|). Thus, we have

φ(Xk+αkS¯k)\displaystyle\varphi(X^{k_{\ell}}+\alpha_{k_{\ell}}\bar{S}^{k_{\ell}}) φ(Xk)αkφ(Xk;S¯k)\displaystyle-\varphi(X^{k_{\ell}})-\alpha_{k_{\ell}}\varphi^{\prime}(X^{k_{\ell}};\bar{S}^{k_{\ell}})
=\displaystyle= Xik0Xik+αkS¯ikXikαkX¯ik,S¯ik\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}\|X_{i}^{k_{\ell}}+\alpha_{k_{\ell}}\bar{S}^{k_{\ell}}_{i}\|-\|X_{i}^{k_{\ell}}\|-\alpha_{k_{\ell}}\langle\bar{X}_{i}^{k_{\ell}},\bar{S}^{k_{\ell}}_{i}\rangle
=\displaystyle= Xik0(Xik2+αk2+2αkXikθik)1/2Xikαkθik\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}(\|X_{i}^{k_{\ell}}\|^{2}+\alpha_{k_{\ell}}^{2}+2\alpha_{k_{\ell}}\|X_{i}^{k_{\ell}}\|\theta_{i}^{k_{\ell}})^{1/2}-\|X_{i}^{k_{\ell}}\|-\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}}
=\displaystyle= Xik0((Xik+αkθik)2+αk2(1(θik)2))1/2Xikαkθik\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}((\|X_{i}^{k_{\ell}}\|+\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}})^{2}+\alpha_{k_{\ell}}^{2}(1-(\theta_{i}^{k_{\ell}})^{2}))^{1/2}-\|X_{i}^{k_{\ell}}\|-\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}}
\displaystyle\leq Xik0αk2(1(θik)2)2(Xik+αkθik)Xik0αk21(θik)2Xik,\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}\frac{\alpha_{k_{\ell}}^{2}(1-(\theta_{i}^{k_{\ell}})^{2})}{2(\|X_{i}^{k_{\ell}}\|+\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}})}\leq\sum_{X_{i}^{k_{\ell}}\neq 0}\alpha_{k_{\ell}}^{2}\cdot\frac{1-(\theta_{i}^{k_{\ell}})^{2}}{\|X_{i}^{k_{\ell}}\|},

where the penultimate inequality follows from (a2+b)1/2ab/(2a)(a^{2}+b)^{1/2}-a\leq{b}/{(2a)} for a>0a>0, b>0b>0. For all ii, if limXik0\lim_{\ell\rightarrow\infty}X_{i}^{k_{\ell}}\neq 0, we have αk2/Xik=o(αk)\alpha_{k_{\ell}}^{2}/{\|X_{i}^{k_{\ell}}\|}=o(\alpha_{k_{\ell}}). If 0Xik00\neq X_{i}^{k_{\ell}}\rightarrow 0, definition (2.12) implies αk2(1(θik)2)/XikαkXikσ=o(αk)\alpha_{k_{\ell}}^{2}(1-(\theta_{i}^{k_{\ell}})^{2})/{\|X_{i}^{k_{\ell}}\|}\leq\alpha_{k_{\ell}}\|X_{i}^{k_{\ell}}\|^{\sigma}=o(\alpha_{k_{\ell}}). Therefore, condition (4.2) is satisfied. \square

Lemma 4.5.

Condition (B.4) is satisfied for the choices (2.11) and (2.12) (for group lasso problems).

Proof. Using (2.11), we immediately obtain Γ(x,d)Γ(x)\Gamma(x,d)\geq\Gamma(x) for all xx and dd with d=1\|d\|=1. Hence, in this case, we can set ϵ=ϵ\epsilon^{\prime}=\epsilon.

For the group lasso problem and (2.12), Example A.2 establishes Γ(X)=min{Xi:Xi0}\Gamma(X)=\min\left\{\|X_{i}\|:X_{i}\neq 0\right\}. Consequently, from Γ(X)ϵ\Gamma(X)\geq\epsilon, we can deduce

Γ(X,D)=min{Γmax(X,D),minXi0Xi1+σ1θi2,minXi0Ximax{2θi,0}}min{ϵ1+σ,ϵ/2},\Gamma(X,D)=\min\left\{\Gamma_{\max}(X,D),\min_{X_{i}\neq 0}\frac{\|X_{i}\|^{1+\sigma}}{1-\theta_{i}^{2}},\min_{X_{i}\neq 0}\frac{\|X_{i}\|}{\max\{-2\theta_{i},0\}}\right\}\geq\min\{\epsilon^{1+\sigma},\epsilon/2\},

where θi:=Xi,Di/(XiDi)\theta_{i}:={\langle X_{i},D_{i}\rangle}/{(\|X_{i}\|\cdot\|D_{i}\|)} for any DD with D=1\|D\|=1. \square

4.2 Convergence Analysis

In this subsection, we will prove that every accumulation point of {xk}k\{x^{k}\}_{k} is a stationary point of (1.1). First, in Lemma 4.6, we derive a global version of assumption (B.3) over the ball BR(0)¯\overline{B_{R}(0)}.

Lemma 4.6.

Suppose that (B.1) and (B.3) are satisfied and that Algorithm 2 does not terminate within finitely many steps. Let {xk}k\{x^{k}\}_{k} be a sequence generated by Algorithm 2. Then, there exists a function h:(0,)[0,]h:(0,\infty)\rightarrow[0,\infty] with limΔ0+h(Δ)=0\lim_{\Delta\rightarrow 0^{+}}h(\Delta)=0 and

ψ(xk+αks¯k)ψ(xk)αkψ(xk;s¯k)h(Δ)αk,\psi(x^{k}+\alpha_{k}\bar{s}^{k})-\psi(x^{k})-\alpha_{k}\psi^{\prime}(x^{k};\bar{s}^{k})\leq h(\Delta)\alpha_{k}, (4.3)

for all kk with ΔkΔ\Delta_{k}\leq\Delta.

Proof. We set

h(Δ)=max{supj:ΔjΔψ(xj+αjs¯j)ψ(xj)αjψ(xj;s¯j)αj,0}.h(\Delta)=\max\left\{\sup_{j:\ \Delta_{j}\leq\Delta}\frac{\psi(x^{j}+\alpha_{j}\bar{s}^{j})-\psi(x^{j})-\alpha_{j}\psi^{\prime}(x^{j};\bar{s}^{j})}{\alpha_{j}},0\right\}.

From the definition, it directly follows h(Δ1)h(Δ2)h(\Delta^{1})\leq h(\Delta^{2}) for all 0<Δ1<Δ2<0<\Delta^{1}<\Delta^{2}<\infty. Thus, it suffices to show that for every ϵ>0\epsilon>0 there exists Δ>0\Delta>0 such that h(Δ)ϵh(\Delta)\leq\epsilon. Let us assume that for some ϵ>0\epsilon>0 we have h(Δ)>ϵh(\Delta)>\epsilon for all Δ>0\Delta>0. Then, there exists a subsequence {k}=0\{k_{\ell}\}_{\ell=0}^{\infty}\subseteq\mathbb{N}, such that Δk0\Delta_{k_{\ell}}\rightarrow 0 and

ψ(xk+αks¯k)ψ(xk)αkψ(xk;s¯k)αkϵ.\frac{\psi(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-\psi(x^{k_{\ell}})-\alpha_{k_{\ell}}\psi^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})}{\alpha_{k_{\ell}}}\geq\epsilon\quad\forall~{}\ell\in\mathbb{N}. (4.4)

Since {xk}\{x^{k_{\ell}}\}_{\ell} is bounded, it has a convergent subsequence {xkm}m\{x^{k_{\ell_{m}}}\}_{m}. Due to Δk0\Delta_{k_{\ell}}\rightarrow 0 it follows αkm0\alpha_{k_{\ell_{m}}}\rightarrow 0. Therefore, (4.1) and (4.4) yield a contradiction. \square

Recall that the iterates xk+1x^{k+1} result from a possible truncation of the trust-region steps x~k\tilde{x}^{k}. We now prove that these truncation steps and the potential increase of the objective function ψ\psi can be controlled.

Lemma 4.7.

Suppose that ψ\psi can be truncated and that assumption (B.1) holds. Then, we have k=0xk+1x~kmκi=0ϵi<\sum_{k=0}^{\infty}\|x^{k+1}-\tilde{x}^{k}\|\leq m\kappa\sum_{i=0}^{\infty}\epsilon_{i}<\infty and k=0|ψ(xk+1)ψ(x~k)|<\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty.

Proof. The estimate k=0xk+1x~kmκi=0ϵi<\sum_{k=0}^{\infty}\|x^{k+1}-\tilde{x}^{k}\|\leq m\kappa\sum_{i=0}^{\infty}\epsilon_{i}<\infty follows directly from Definition 2.5 and the settings in the truncation step. Hence, due to the local Lipschitz continuity of ψ\psi and the boundedness of {xk}k\{x^{k}\}_{k}, there exists a constant Lψ>0L_{\psi}>0 such that k=0|ψ(xk+1)ψ(x~k)|Lψmκi=0ϵi<\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|\leq L_{\psi}m\kappa\sum_{i=0}^{\infty}\epsilon_{i}<\infty. \square

The next theorem is a weak global convergence result for Algorithm 2. In the proof, we combine our specific stepsize strategy and the truncation step to guarantee accuracy of the model and sufficient descent in ψ\psi. More specifically, on the one hand, under the assumption that gk\|g^{k}\| has a positive lower bound, the stepsize strategy ensures that Δk\Delta_{k} can not be arbitrarily small; on the other hand, the truncation step guarantees that the stepsize safeguard has a positive lower bound on an infinite set of iterations. Combining these two results, we would conclude that the total descent is infinite, which is a contradiction. We want to point out here that even if Algorithm 2 may not need to compute ρ2k\rho^{k}_{2} in some steps, we still use it in our analysis.

Theorem 4.8.

Suppose that the conditions (A.1) and (B.1)-(B.4) are satisfied and that ψ\psi can be truncated. Furthermore, let us assume that Algorithm 2 does not terminate in finitely many steps and let {xk}k\{x^{k}\}_{k} be the generated sequence of iterates. Then, it holds that

lim infkgk=0.\liminf_{k\rightarrow\infty}\|g^{k}\|=0.

Proof. Since Algorithm 2 does not terminate in finitely many steps, we have

ψ(xk;d(xk))<0andd(xk)=1k.\psi^{\prime}(x^{k};d(x^{k}))<0\quad\text{and}\quad\|d(x^{k})\|=1\quad\forall~{}k\in\mathbb{N}.

Suppose there exist ϵ>0\epsilon>0 and K+K\in\mathbb{N}_{+} such that gk>ϵ\|g^{k}\|>\epsilon for all kKk\geq K. By the definition of the Cauchy point, we know that mk(sCk)mk(skd(xk))m_{k}(s^{k}_{C})\leq m_{k}(\|s^{k}\|d(x^{k})). Using (3.7), we obtain

mk(0)mk(sk)(1(sk))[mk(0)mk(sCk)](1(sk))[mk(0)mk(skd(xk))],\begin{split}m_{k}(0)-m_{k}(s^{k})\geq(1-\ell(\|s^{k}\|))[m_{k}(0)-m_{k}(s^{k}_{C})]\geq(1-\ell(\|s^{k}\|))[m_{k}(0)-m_{k}(\|s^{k}\|d(x^{k}))],\end{split}

and we have

gk,sk(1(sk))skd(xk)12sk,Bksk+1(sk)2skd(xk),Bkskd(xk)κBsk2.\begin{split}\langle g^{k},s^{k}-(1-\ell(\|s^{k}\|))\|s^{k}\|d(x^{k})\rangle\leq-\frac{1}{2}\langle s^{k},B^{k}s^{k}\rangle+\frac{1-\ell(\|s^{k}\|)}{2}\langle\|s^{k}\|d(x^{k}),B^{k}\|s^{k}\|d(x^{k})\rangle\leq\kappa_{B}\|s^{k}\|^{2}.\end{split}

Due to g¯k=d(xk)\bar{g}^{k}=-d(x^{k}), it holds that

d(xk),s¯k(1(sk))d(xk)κBgkskκBϵsk\langle-d(x^{k}),\bar{s}^{k}-(1-\ell(\|s^{k}\|))d(x^{k})\rangle\leq\frac{\kappa_{B}}{\|g^{k}\|}\|s^{k}\|\leq\frac{\kappa_{B}}{\epsilon}\|s^{k}\|

for every kKk\geq K which implies d(xk),s¯k<1+(sk)+κBϵsk-\langle d(x^{k}),\bar{s}^{k}\rangle<-1+\ell(\|s^{k}\|)+\frac{\kappa_{B}}{\epsilon}\|s^{k}\|. Hence, we have

s¯kd(xk)=22d(xk),s¯k2(sk)+2κBϵsk.\|\bar{s}^{k}-d(x^{k})\|=\sqrt{2-2\langle d(x^{k}),\bar{s}^{k}\rangle}\leq\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}. (4.5)

Using the boundedness of {xk}k\{x^{k}\}_{k}, the fact that ψ(x;d)\psi^{\prime}(x;d) is Lipschitz in dd for local xx (see, e.g., [21]), and (4.5), we derive that there exists Lψ>0L_{\psi}>0 such that

|ψ(xk;s¯k)ψ(xk;d(xk))|Lψ2(sk)+2κBϵskkK,|\psi^{\prime}(x^{k};\bar{s}^{k})-\psi^{\prime}(x^{k};d(x^{k}))|\leq L_{\psi}\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}\quad\forall~{}k\geq K,

which combined with gk=u(xk)d(xk)g^{k}=u(x^{k})d(x^{k}) and (2.2) yields

αkψ(xk;s¯k)gk,αkd(xk)=αk(ψ(xk;s¯k)u(xk))αk(ψ(xk;d(xk))u(xk))+αkLψ2(sk)+2κBϵskαkLψ2(sk)+2κBϵsk.\begin{split}\alpha_{k}\psi^{\prime}(x^{k};\bar{s}^{k})-\langle g^{k},\alpha_{k}d(x^{k})\rangle=&\alpha_{k}(\psi^{\prime}(x^{k};\bar{s}^{k})-u(x^{k}))\\ \leq&\alpha_{k}(\psi^{\prime}(x^{k};d(x^{k}))-u(x^{k}))+\alpha_{k}L_{\psi}\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}\\ \leq&\alpha_{k}L_{\psi}\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}.\end{split} (4.6)

Together with (4.5), we obtain

gk,d(xk)gk,s¯kgkd(xk)s¯kgk2(sk)+2κBϵsk.\begin{split}\langle g^{k},d(x^{k})\rangle-\langle g^{k},\bar{s}^{k}\rangle\leq\|g^{k}\|\|d(x^{k})-\bar{s}^{k}\|\leq\|g^{k}\|\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}.\end{split} (4.7)

Using the definition of the function hh in (4.3) and combining (4.6) and (4.7), it follows

ψ(xk+αks¯k)ψ(xk)gk,αks¯k\displaystyle\psi(x^{k}+\alpha_{k}\bar{s}^{k})-\psi(x^{k})-\langle g^{k},\alpha_{k}\bar{s}^{k}\rangle\leq αk[(Lψ+gk)2(sk)+2κBϵsk+h(Δk)]\displaystyle\alpha_{k}\left[(L_{\psi}+\|g^{k}\|)\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}+h(\Delta_{k})\right]
\displaystyle\leq αk[(Lψ+gk)2(Δk)+2κBϵΔk+h(Δk)].\displaystyle\alpha_{k}\left[(L_{\psi}+\|g^{k}\|)\sqrt{2\ell(\Delta_{k})+\frac{2\kappa_{B}}{\epsilon}\Delta_{k}}+h(\Delta_{k})\right].

For all kKk\geq K, setting ν(Δk):=[2(Δk)+2κBΔk/ϵ]12\nu(\Delta_{k}):=[2\ell(\Delta_{k})+2\kappa_{B}\Delta_{k}/\epsilon]^{\frac{1}{2}}, we now get

mk(0)mk(αks¯k)\displaystyle m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k}) =αk[gk,s¯kαk2s¯k,Bks¯k]\displaystyle=\alpha_{k}\left[\langle-g^{k},\bar{s}^{k}\rangle-\frac{\alpha_{k}}{2}\langle\bar{s}^{k},B^{k}\bar{s}^{k}\rangle\right]
αk[gk,d(xk)gk2(sk)+2κBϵskαkκB2]\displaystyle\geq\alpha_{k}\left[\langle-g^{k},d(x^{k})\rangle-\|g^{k}\|\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}-\frac{\alpha_{k}\kappa_{B}}{2}\right]
αkgk[1ν(Δk)κB2ϵΔk],\displaystyle\geq\alpha_{k}\|g^{k}\|\left[1-\nu(\Delta_{k})-\frac{\kappa_{B}}{2\epsilon}\Delta_{k}\right],

and

1ρk2\displaystyle 1-\rho^{2}_{k} =1ψ(xk)ψ(xk+αks¯k)mk(0)mk(αks¯k)=ψ(xk+αks¯k)ψ(xk)αkgk,s¯kαk22s¯k,Bks¯kmk(0)mk(αks¯ks)\displaystyle=1-\frac{\psi(x^{k})-\psi(x^{k}+\alpha_{k}\bar{s}^{k})}{m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})}=\frac{\psi(x^{k}+\alpha_{k}\bar{s}^{k})-\psi(x^{k})-\alpha_{k}\langle g^{k},\bar{s}^{k}\rangle-\frac{\alpha_{k}^{2}}{2}\langle\bar{s}^{k},B^{k}\bar{s}^{k}\rangle}{m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k}s)}
(Lψ+gk)ν(Δk)+h(Δk)+κB2Δkgk[1ν(Δk)κB2ϵΔk](Lψϵ+1)ν(Δk)+1ϵh(Δk)+κB2ϵΔk1ν(Δk)κB2ϵΔk.\displaystyle\leq\frac{(L_{\psi}+\|g^{k}\|)\nu(\Delta_{k})+h(\Delta_{k})+\frac{\kappa_{B}}{2}\Delta_{k}}{\|g^{k}\|\left[1-\nu(\Delta_{k})-\frac{\kappa_{B}}{2\epsilon}\Delta_{k}\right]}\leq\frac{(\frac{L_{\psi}}{\epsilon}+1)\nu(\Delta_{k})+\frac{1}{\epsilon}h(\Delta_{k})+\frac{\kappa_{B}}{2\epsilon}\Delta_{k}}{1-\nu(\Delta_{k})-\frac{\kappa_{B}}{2\epsilon}\Delta_{k}}.

Thus, there exists σ(0,ϵ/κB)\sigma\in(0,{\epsilon}/{\kappa_{B}}), such that for every kKk\geq K with Δkσ\Delta_{k}\leq{\sigma} it holds that 1ρk2<1η11-\rho^{2}_{k}<1-\eta_{1}. This implies ρk2>η1\rho^{2}_{k}>\eta_{1} for all kKk\geq K satisfying Δk<σ\Delta_{k}<{\sigma} which means that those steps are at least “subsuccessful”. Hence, we can infer

Δkmin{ΔK,r1σ}kK.\Delta_{k}\geq\min\{\Delta_{K},r_{1}{\sigma}\}\quad\forall~{}k\geq K. (4.8)

Next, let us set 𝒦={kKρk1η1orρk2η}\mathcal{K}=\{k\geq K\mid\rho^{1}_{k}\geq\eta_{1}\ \text{or}\ \rho^{2}_{k}\geq\eta\} and

𝒦i={k𝒦xkNi=Si\Si+1}i=0,1,2,,m.\mathcal{K}_{i}=\{k\in\mathcal{K}\mid x^{k}\in N_{i}=S_{i}\backslash S_{i+1}\}\quad i=0,1,2,\cdots,m.

Due to (4.8), we have |𝒦|=|\mathcal{K}|=\infty and applying Lemma 4.7, it follows

k𝒦[ψ(xk)ψ(x~k)]k𝒦[ψ(xk)ψ(xk+1)]+k𝒦|ψ(xk+1)ψ(x~k)|ψ(x0)Lb+k=0|ψ(xk+1)ψ(x~k)|<,\begin{split}\sum_{k\in\mathcal{K}}[\psi(x^{k})-\psi(\tilde{x}^{k})]\leq&\sum_{k\in\mathcal{K}}[\psi(x^{k})-\psi(x^{k+1})]+\sum_{k\in\mathcal{K}}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|\\ \leq&\psi(x^{0})-L_{b}+\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty,\end{split}

where we used xk+1=x~k=xkx^{k+1}=\tilde{x}^{k}=x^{k} for all k𝒦k\notin\mathcal{K}. Hence, we have

k𝒦[mk(0)mk(x~kxk)]1ηk𝒦[ψ(xk)ψ(x~k)]<.\sum_{k\in\mathcal{K}}[m_{k}(0)-m_{k}(\tilde{x}^{k}-x^{k})]\leq\frac{1}{\eta}\sum_{k\in\mathcal{K}}[\psi(x^{k})-\psi(\tilde{x}^{k})]<\infty. (4.9)

We now define the index i0=max{i{0,1,2,,m}:|𝒦i|=}i_{0}=\max\{i\in\{0,1,2,\cdots,m\}:|\mathcal{K}_{i}|=\infty\}. By the optimality of i0i_{0}, we can conclude that only finitely many elements of the sequence {xk}k\{x^{k}\}_{k} belong to Si0+1S_{i_{0}+1}. This implies that the truncation operator TT is only applied a finite number of times on points in Ni0=Si0\Si0+1N_{i_{0}}=S_{i_{0}}\backslash S_{i_{0}+1}. In particular, TT will move points from Si0\Si0+1S_{i_{0}}\backslash S_{i_{0}+1} to the set Si0+1S_{i_{0}+1} and after a certain number of iterations KK^{\prime}, the counter ci0c_{i_{0}} will not be updated anymore, i.e., we have ci0cc_{i_{0}}\equiv c for some cc. Then, it follows Γ(xk)ϵc\Gamma(x^{k})\geq\epsilon_{{c}} for all xkNi0x^{k}\in N_{i_{0}} and kKk\geq K^{\prime} and by (B.4) there exists ϵ>0\epsilon^{\prime}>0 such that we have Γ(xk,s¯k)ϵ\Gamma(x^{k},\bar{s}^{k})\geq\epsilon^{\prime} for all xkNi0x^{k}\in N_{i_{0}} and kKk\geq K^{\prime}. Combining (3.5), (3.6), and (3.11), we can always guarantee descent in the model. In particular, we obtain

mk(0)mk(x~kxk)x~kxk4skδ1gkmin{Δk,δ2gk},m_{k}(0)-m_{k}(\tilde{x}^{k}-x^{k})\geq\frac{\|\tilde{x}^{k}-x^{k}\|}{4\|s^{k}\|}\cdot{\delta}_{1}\|g^{k}\|\min\left\{\Delta_{k},{\delta}_{2}\|g^{k}\|\right\}, (4.10)

where δ1=min{γ1,1}{\delta}_{1}=\min\{\gamma_{1},1\} and δ2=min{γ2,1/κB}{\delta_{2}}=\min\{\gamma_{2},1/\kappa_{B}\}. Thus, we can conclude that

>k𝒦[mk(0)mk(x~kxk)]k𝒦i0,kKδ1x~kxk4skgkmin{Δk,δ2gk}k𝒦i0,kKδ1ϵ4min{Γ(xk,s¯k),sk}skmin{ΔK,r1σ,δ2ϵ}k𝒦i0,kKδ1ϵ4min{ϵΔmax,1}min{ΔK,r1σ,δ2ϵ}=,\begin{split}\infty&>\sum_{k\in\mathcal{K}}[m_{k}(0)-m_{k}(\tilde{x}^{k}-x^{k})]\geq\sum_{k\in\mathcal{K}_{i_{0}},k\geq K^{\prime}}\frac{{\delta}_{1}\|\tilde{x}^{k}-x^{k}\|}{4\|s^{k}\|}\|g^{k}\|\min\left\{\Delta_{k},{\delta}_{2}\|g^{k}\|\right\}\\ &\geq\sum_{k\in\mathcal{K}_{i_{0}},k\geq K^{\prime}}\frac{{\delta}_{1}\epsilon}{4}\frac{\min\left\{\Gamma(x^{k},\bar{s}^{k}),\|s^{k}\|\right\}}{\|s^{k}\|}\min\left\{\Delta_{K},r_{1}\sigma,{\delta}_{2}\epsilon\right\}\\ &\geq\sum_{k\in\mathcal{K}_{i_{0}},k\geq K^{\prime}}\frac{{\delta}_{1}\epsilon}{4}\min\left\{\frac{\epsilon^{\prime}}{{\Delta}_{\max}},1\right\}\min\left\{\Delta_{K},r_{1}{\sigma},{\delta}_{2}\epsilon\right\}=\infty,\end{split}

which is a contradiction. \square

Next, we prove a stronger version of our global result under the additional assumption (A.2). Specifically, we show that every accumulation point of Algorithm 2 is a stationary point of (1.2). This is a standard global convergence result, see, e.g., [64].

Theorem 4.9.

Let the conditions (A.1)–(A.2) and (B.1)–(B.4) be satisfied and suppose that ψ\psi can be truncated. Assume that Algorithm 2 does not terminate after finitely many steps and that it generates a sequence {xk}k\{x^{k}\}_{k} with an accumulation point xx^{*}. Then, xx^{*} is a stationary point of (1.2).

Proof. We assume that xx^{*} is not a stationary point of (1.2). By (A.2) there exist r,ϵ>0r,\epsilon>0 such that g(y)ϵ\|g(y)\|\geq\epsilon for all yBr(x)y\in B_{r}(x^{*}). Let us set Ak=max{ψ(xk+1)ψ(x~k),0}A^{k}=\max\{\psi(x^{k+1})-\psi(\tilde{x}^{k}),0\} for kk\in\mathbb{N}. Applying Lemma 4.7, we know that k=0Ak<\sum_{k=0}^{\infty}A^{k}<\infty. For any k>kk^{\prime}>k, we have

ψ(xk)=ψ(xk)+t=kk1(ψ(xt+1)ψ(x~t))+t=kk1(ψ(x~t)ψ(xt))ψ(xk)+t=kk1Atψ(xk)+t=kAt,\begin{split}\psi(x^{k^{\prime}})&=\psi(x^{k})+\sum_{t=k}^{k^{\prime}-1}(\psi(x^{t+1})-\psi(\tilde{x}^{t}))+\sum_{t=k}^{k^{\prime}-1}(\psi(\tilde{x}^{t})-\psi(x^{t}))\\ &\leq\psi(x^{k})+\sum_{t=k}^{k^{\prime}-1}A^{t}\leq\psi(x^{k})+\sum_{t=k}^{\infty}A^{t},\end{split}

where we used the descent property ψ(x~t)ψ(xt)0\psi(\tilde{x}^{t})-\psi(x^{t})\leq 0. Consequently, we can infer

lim supkψ(xk)lim infkψ(xk)+limkt=kAtlim infkψ(xk),\limsup_{k^{\prime}\rightarrow\infty}\psi(x^{k^{\prime}})\leq\liminf_{k\rightarrow\infty}\psi(x^{k})+\lim_{k\rightarrow\infty}\sum_{t=k}^{\infty}A^{t}\leq\liminf_{k\rightarrow\infty}\psi(x^{k}),

which implies that {ψ(xk)}k\{\psi(x^{k})\}_{k} converges. Next, Lemma 4.7 implies that there exists a constant KK\in\mathbb{N} such that k=Kxk+1x~kr4\sum_{k=K}^{\infty}\|x^{k+1}-\tilde{x}^{k}\|\leq\frac{r}{4}. There is a subsequence {xk}k𝒦{xk}k=K\{x^{k}\}_{k\in\mathcal{K}}\subseteq\{x^{k}\}_{k=K}^{\infty} satisfying {xk}k𝒦Br/4(x)\{x^{k}\}_{k\in\mathcal{K}}\subseteq B_{r/4}(x^{*}) and xkx,k,k𝒦x^{k}\rightarrow x^{*},\ k\rightarrow\infty,k\in\mathcal{K}. For any k𝒦k\in\mathcal{K}, since g(y)ϵ,yBr(x)\|g(y)\|\geq\epsilon,\ \forall~{}y\in B_{r}(x^{*}) and lim infkgk=0\liminf_{k^{\prime}}\|g^{k^{\prime}}\|=0 by Theorem 4.8, there must be some kkk^{\prime}\geq k such that xkBr(x)x^{k^{\prime}}\notin B_{r}(x^{*}). Set l(k)=sup{kk:xtBr(x),ktk}l(k)=\sup\{k^{\prime}\geq k:x^{t}\in B_{r}(x^{*}),\ \forall~{}k\leq t\leq k^{\prime}\}. Thus, it holds that

t=kl(k)x~txt+r4t=kl(k)(xt+1x~t+x~txt)t=kl(k)xt+1xtxl(k)+1xk3r4\begin{split}\sum_{t=k}^{l(k)}\|\tilde{x}^{t}-x^{t}\|+\frac{r}{4}&\geq\sum_{t=k}^{l(k)}(\|x^{t+1}-\tilde{x}^{t}\|+\|\tilde{x}^{t}-x^{t}\|)\\ &\geq\sum_{t=k}^{l(k)}\|x^{t+1}-x^{t}\|\geq\|x^{l(k)+1}-x^{k}\|\geq\frac{3r}{4}\end{split}

and it follows t=kl(k)x~txtr2\sum_{t=k}^{l(k)}\|\tilde{x}^{t}-x^{t}\|\geq\frac{r}{2}. Mimicking the last steps in the proof of Theorem 4.8, we get

ψ(xk)ψ(xl(k)+1)\displaystyle\psi(x^{k})-\psi(x^{l(k)+1})\geq t=kl(k)(ψ(xt)ψ(x~t)At)\displaystyle\sum_{t=k}^{l(k)}(\psi(x^{t})-\psi(\tilde{x}^{t})-A^{t})
\displaystyle\geq ktl(k),ρt1η1 or ρt2ηη[mt(0)mt(x~txt)]t=kAt\displaystyle\sum_{k\leq t\leq l(k),\ \rho^{1}_{t}\geq\eta_{1}\text{ or }\rho^{2}_{t}\geq\eta}\eta[m_{t}(0)-m_{t}(\tilde{x}^{t}-x^{t})]-\sum_{t=k}^{\infty}A^{t}
\displaystyle\geq ktl(k),ρt1η1 or ρt2ηηδ1x~txt4stgtmin{Δt,δ2gt}t=kAt\displaystyle\sum_{k\leq t\leq l(k),\ \rho^{1}_{t}\geq\eta_{1}\text{ or }\rho^{2}_{t}\geq\eta}\frac{\eta{\delta}_{1}\|\tilde{x}^{t}-x^{t}\|}{4\|s^{t}\|}\|g^{t}\|\min\left\{\Delta_{t},{\delta}_{2}\|g^{t}\|\right\}-\sum_{t=k}^{\infty}A^{t}
\displaystyle\geq ktl(k),ρt1η1 or ρt2ηηδ1ϵx~txt4min{1,δ2ϵΔmax}t=kAt\displaystyle\sum_{k\leq t\leq l(k),\ \rho^{1}_{t}\geq\eta_{1}\text{ or }\rho^{2}_{t}\geq\eta}\frac{\eta{\delta}_{1}\epsilon\|\tilde{x}^{t}-x^{t}\|}{4}\min\left\{1,\frac{{\delta}_{2}\epsilon}{{\Delta}_{\max}}\right\}-\sum_{t=k}^{\infty}A^{t}
\displaystyle\geq ηδ1ϵr8min{1,δ2ϵΔmax}t=kAt.\displaystyle\frac{\eta{\delta}_{1}\epsilon r}{8}\min\left\{1,\frac{{\delta}_{2}\epsilon}{{\Delta}_{\max}}\right\}-\sum_{t=k}^{\infty}A^{t}.

Taking the limit 𝒦k\mathcal{K}\ni k\rightarrow\infty we obtain the contradiction 0ηδ1ϵr8min{1,δ2ϵΔmax1}0\geq\frac{\eta{\delta}_{1}\epsilon r}{8}\min\{1,{\delta}_{2}\epsilon{\Delta}_{\max}^{-1}\}. \square

Remark 4.10.

Theorem 4.9 essentially establishes a similar result as in [64, Theorem 3.4]. We notice that instead of boundedness of the level set {xnψ(x)ψ(x0)}\{x\in\mathbb{R}^{n}\mid\psi(x)\leq\psi(x^{0})\} (which was used in [64]), we need to work with the slightly stronger assumption (B.1) here since the truncation step can increase the objective function value ψ\psi. However, if the function ψ\psi satisfies a Lipschitz-type assumption, i.e., if there are ϵ>0\epsilon>0 and L0L\geq 0 such that |ψ(x)ψ(y)|Lxy|\psi(x)-\psi(y)|\leq L\|x-y\| for all x,yx,y with xyϵ\|x-y\|\leq\epsilon, then the proof of Lemma 4.7 implies k=0|ψ(xk+1)ψ(x~k)|<\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty. This can be combined with the descent property ψ(x~k)ψ(xk)\psi(\tilde{x}^{k})\leq\psi(x^{k}) of the trust-region step to show that the iterates {xk}\{x^{k}\} will stay in the level set {xn:ψ(x)ζ}\{x\in\mathbb{R}^{n}:\psi(x)\leq\zeta\} where ζ=ψ(x0)+k=0|ψ(xk+1)ψ(x~k)|<\zeta=\psi(x^{0})+\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty. Therefore, (B.1) can be substituted by a more classical level set condition in such a situation.

Finally, via utilizing the natural residual, it is possible to obtain strong lim-type convergence of Algorithm 2. Actually, compared with Theorem 4.8 which states that along a subsequence, gkg^{k} converges to zero, the next theorem proves that some nonsmooth residual of xkx^{k} converges to zero along the whole sequence.

Theorem 4.11.

Suppose that the same assumptions stated in Theorem 4.9 are satisfied. Then, it holds that limkFnatΛ(xk)=0\lim_{k\rightarrow\infty}\|F_{\mathrm{nat}}^{\Lambda}(x^{k})\|=0.

Proof. Suppose that there exists ϵ>0\epsilon>0 and an infinite subsequence {xk}k𝒦\{x^{k}\}_{k\in\mathcal{K}} of {xk}k=0\{x^{k}\}_{k=0}^{\infty} satisfying

FnatΛ(xk)ϵk𝒦.\displaystyle\|F_{\mathrm{nat}}^{\Lambda}(x^{k})\|\geq\epsilon\quad\forall~{}k\in\mathcal{K}. (4.11)

By (B.1), {xk}k𝒦\{x^{k}\}_{k\in\mathcal{K}} has another subsequence {xk}k𝒦1\{x^{k}\}_{k\in\mathcal{K}_{1}} with limit x=lim𝒦1kxkx^{*}=\lim_{\mathcal{K}_{1}\ni k\rightarrow\infty}x^{k}. By Theorem 4.9, xx^{*} is a stationary point of (1.2) with FnatΛ(x)=0F_{\mathrm{nat}}^{\Lambda}(x^{*})=0. Using the continuity of FnatΛF_{\mathrm{nat}}^{\Lambda}, this contradicts (4.11). \square

5 Fast Local Convergence

To the best of our knowledge, there are very limited local convergence results for nonsmooth trust-region type methods and most of the existing work only focuses on the global convergence analysis, see, e.g., [64, 34, 3]. In this section, we investigate local properties of our algorithm. Specifically, we will establish fast local convergence for the composite program ψ=f+φ\psi=f+\varphi when ff is a smooth mapping and φ\varphi is real-valued convex and partly smooth relative to an affine subspace. Our local results require that the first- and second-order information, i.e., gkg^{k} and BkB^{k}, are chosen as the Riemannian gradient and the Riemannian Hessian with respect to some active manifold. We will also show that such information can be derived without knowing the active manifold under some suitable assumptions.

5.1 Definitions and Assumptions

In this subsection, we state some elementary definitions and assumptions. The family of partly smooth functions was originally introduced in [46] and plays a fundamental role in nonsmooth optimization. In particular, the concept of partly smoothness is utilized in the convergence analysis of nonsmooth optimization algorithms and to derive activity identification properties, see, e.g., [48, 63]. Since in (1.1), the mapping φ\varphi is real-valued convex, we use the definition of partly smooth functions given in [48]. For a more general version and further details, we refer to [46].

Definition 5.1.

[48, Definition 3.1] A proper convex and lower semicontinuous function φ\varphi is said to be partly smooth at xx relative to a set \mathcal{M} containing xx if φ(x)\partial\varphi(x)\neq\emptyset and we have:

  • (i)

    Smoothness: \mathcal{M} is a C2C^{2}-manifold around xx and φ\varphi restricted to \mathcal{M} is C2C^{2} around xx;

  • (ii)

    Sharpness: The tangent space T(x)T_{\mathcal{M}}(x) coincides with Tx:=par(φ(x))T_{x}:=\text{par}(\partial\varphi(x))^{\perp}, where par(A)=span(AA)\text{par}(A)=\text{span}(A-A) for a convex set AnA\subset\mathbb{R}^{n}.

  • (iii)

    Continuity: The set-valued mapping φ\partial\varphi is continuous at xx relative to \mathcal{M}.

Let {Si}i=0m\{S_{i}\}_{i=0}^{m} be the sequence of sets associated with the truncation operator of ψ\psi and let {xk}k\{x^{k}\}_{k} be generated by Algorithm 2. We further consider an accumulation point xx^{*} of {xk}k\{x^{k}\}_{k} with xSi\Si+1x^{*}\in S_{i^{*}}\backslash S_{i^{*}+1} and i{0,1,,m}i^{*}\in\{0,1,\cdots,m\} and we make the following assumptions.

Assumption 5.2.

We consider the following conditions:

  • (C.1)

    The mapping φ\varphi is partly smooth at xx^{*} relative to an affine subspace \mathcal{M} and it holds that Br(x)Si=Br(x)B_{r}(x^{*})\cap S_{i^{*}}=B_{r}(x^{*})\cap\mathcal{M} for all r(0,Γ(x))r\in(0,\Gamma(x^{*})).

  • (C.2)

    The Riemannian Hessian 2ψ(x)\nabla^{2}_{\mathcal{M}}\psi(x) is locally Lipschitz continuous around xx^{*} restricted to \mathcal{M} and the second-order sufficient condition is satisfied at xx^{*}, i.e., we have 2ψ(x)[ξ,ξ]cξ2\nabla^{2}_{\mathcal{M}}\psi(x^{*})[\xi,\xi]\geq c\|\xi\|^{2} for some positive constant cc and all ξT(x)\xi\in T_{\mathcal{M}}(x^{*}).

  • (C.3)

    The strict complementary condition f(x)riφ(x)-\nabla f(x^{*})\in\text{ri}\ \partial\varphi(x^{*}) is satisfied.

  • (C.4)

    For all xSix\in S_{i}, ySjy\in S_{j} with i<ji<j, it holds that Γmax(x,yx¯)yx\Gamma_{\max}\left(x,\overline{y-x}\right)\leq\|y-x\| where yx¯=yxyx\overline{y-x}=\frac{y-x}{\|y-x\|}. For every r(0,Γ(x))r\in(0,\Gamma(x^{*})), there exists ϵ(r)>0\epsilon(r)>0 such that Γ(x)ϵ(r)\Gamma(x)\geq\epsilon(r) for all xBr(x)x\in B_{r}(x^{*})\cap\mathcal{M}.

  • (C.5)

    The sequence {xk}k\{x^{k}\}_{k} converges with limit limkxk=x\lim_{k\rightarrow\infty}x_{k}=x^{*}.

Besides the partly smoothness, assumption (C.1) states that the local structure of SiS_{i^{*}} around xx^{*} has to be affine. The conditions (C.2), (C.3), and (C.5) are standard assumptions for finite active identification and have been used to establish local convergence rates. For instance, they appeared in [48, 63].

In order to illustrate assumption (C.4), we consider the example φ(x)=x1\varphi(x)=\|x\|_{1}. Suppose that xSi,ySjx\in S_{i},\ y\in S_{j} are two given points with i<ji<j. Since yy has more zero-components as xx there exists a point on the line connecting xx and yy at which φ\varphi is not differentiable. This immediately leads to Γmax(x,yx¯)yx\Gamma_{\max}\left(x,\overline{y-x}\right)\leq\|y-x\|. The second part in (C.4) requires that Γ\Gamma does not decay sharply around xx^{*} restricted to \mathcal{M}. We will use condition (C.4) in the analysis of the truncation step.

5.2 Riemannian Gradient and Riemannian Hessian

We now choose gk=FnorΛ(τ(xk))g^{k}=F_{\mathrm{nor}}^{\Lambda}(\tau(x^{k})). The next lemma shows that this choice actually coincides with the Riemannian gradient of ψ\psi when xkx^{k} lies in the manifold \mathcal{M} and is close to xx^{*}.

Lemma 5.3.

Suppose that the conditions (C.1) and (C.3) hold and that xx^{*} is a stationary point. Let xBr(x)x\in B_{r}(x^{*})\cap\mathcal{M} be given for r(0,Γ(x))r\in(0,\Gamma(x^{*})) sufficiently small. Then, we have FnorΛ(τ(x))=ψ(x;ds(x))ds(x)=ψ(x)F_{\mathrm{nor}}^{\Lambda}(\tau(x))=\psi^{\prime}(x;d_{s}(x))d_{s}(x)=\nabla_{\mathcal{M}}\psi(x), where ψ(x)\nabla_{\mathcal{M}}\psi(x) denotes the Riemannian gradient of ψ\psi.

Proof. By the stationarity of xx^{*} and (C.3), it is easy to see that ψ(x)=0f(x)+riφ(x)=riψ(x).\nabla_{\mathcal{M}}\psi(x^{*})=0\in\nabla f(x^{*})+\text{ri}\ \partial\varphi(x^{*})=\text{ri}\ \partial\psi(x^{*}). Thus, applying [23, Corollary 21] for xx\in\mathcal{M} close to xx^{*} and Lemma 2.1, we can conclude that ψ(x)=𝐏ψ(x)(0)=ψ(x;ds(x))ds(x)=FnorΛ(τ(x))\nabla_{\mathcal{M}}\psi(x)=\mathbf{P}_{\partial\psi(x)}(0)=\psi^{\prime}(x;d_{s}(x))d_{s}(x)=F_{\mathrm{nor}}^{\Lambda}(\tau(x)). \square

Since gkg^{k} coincides with the Riemannian gradient, we naturally would like to choose BkB^{k} as the associated Riemannian Hessian of ψ\psi. We now show that this Hessian can be derived without knowing the underlying manifold \mathcal{M} if we additionally assume that φ\varphi is polyhedral. Specifically, the following lemma establishes a connection between the derivative of FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x)) and the Riemannian Hessian.

Lemma 5.4.

Suppose that the assumptions stated in Lemma 5.3 hold and that φ\varphi is a polyhedral function. For r(0,Γ(x))r\in(0,\Gamma(x^{*})) sufficiently small and all xBr(x)x\in B_{r}(x^{*})\cap\mathcal{M}, it follows

V𝒟FnorΛ(z)=2ψ(x),Λ=λI,λ>0,V\mathcal{D}F_{\mathrm{nor}}^{\Lambda}(z)=\nabla^{2}_{\mathcal{M}}\psi(x),\quad\Lambda=\lambda I,\;\lambda>0,

where 𝒟\mathcal{D} is the differential operator, z=τ(x)z=\tau(x), V=𝒟proxφΛ(z)V=\mathcal{D}\mathrm{prox}_{\varphi}^{\Lambda}(z) is the derivative of proxφΛ\mathrm{prox}_{\varphi}^{\Lambda}, and 2ψ(x)\nabla^{2}_{\mathcal{M}}\psi(x) is the Riemannian Hessian.

Proof. For xx\in\mathcal{M} and near xx^{*}, by Definition 5.1 and [48, Fact 3.3], we can decompose φ(x)\partial\varphi(x) as φ(x)={φ(x)}+φ(x)\partial\varphi(x)=\left\{\nabla_{\mathcal{M}}\varphi(x)\right\}+\partial_{\mathcal{M}}^{\perp}\varphi(x), where φ(x)T(x)\partial_{\mathcal{M}}^{\perp}\varphi(x)\subseteq T_{\mathcal{M}}(x)^{\perp}. We can see that both φ(x)\nabla_{\mathcal{M}}\varphi(x) and φ(x)\partial_{\mathcal{M}}^{\perp}\varphi(x) restricted to \mathcal{M} are continuous around xx^{*}. Moreover, we have the decomposition f(x)=f(x)+f(x)\nabla f(x)=\nabla_{\mathcal{M}}f(x)+\nabla_{\mathcal{M}}^{\perp}f(x) where f(x)T(x)\nabla_{\mathcal{M}}^{\perp}f(x)\in T_{\mathcal{M}}(x)^{\perp}.

Condition (C.3) implies f(x)riφ(x)-\nabla_{\mathcal{M}}^{\perp}f(x^{*})\in\text{ri}\ \partial_{\mathcal{M}}^{\perp}\varphi(x^{*}), which combined with part (ii) in Definition 5.1 and the continuity of f|\nabla_{\mathcal{M}}^{\perp}f\big{|}_{\mathcal{M}} and φ|\partial_{\mathcal{M}}^{\perp}\varphi\big{|}_{\mathcal{M}} leads to f(x)riφ(x)-\nabla_{\mathcal{M}}^{\perp}f(x)\in\text{ri}\ \partial_{\mathcal{M}}^{\perp}\varphi(x), i.e., 0{f(x)}+riφ(x)0\in\left\{\nabla_{\mathcal{M}}^{\perp}f(x)\right\}+\text{ri}\ \partial_{\mathcal{M}}^{\perp}\varphi(x) for all xBr(x)x\in B_{r}(x^{*})\cap\mathcal{M} where r>0r>0 is sufficiently small. Thus, it follows

f(x)+Λ(zx)=FnorΛ(z)=f(x)+φ(x)f(x)+riφ(x),\nabla f(x)+\Lambda(z-x)=F_{\mathrm{nor}}^{\Lambda}(z)=\nabla_{\mathcal{M}}f(x)+\nabla_{\mathcal{M}}\varphi(x)\in\nabla f(x)+\text{ri}\ \partial\varphi(x),

which implies Λ(zx)riφ(x)\Lambda(z-x)\in\text{ri}\ \partial\varphi(x). Since φ\varphi is polyhedral, the subdifferential φ(x)\partial\varphi(x) is locally constant around xx^{*} on \mathcal{M}. For any d1Td_{1}\in T_{\mathcal{M}}, d2Td_{2}\in T_{\mathcal{M}}^{\perp} with d1,d2\|d_{1}\|,\|d_{2}\| sufficiently small, it holds that Λ((z+d1+d2)(x+d1))=Λ(zx)+Λd2φ(x)=φ(x+d1)\Lambda((z+d_{1}+d_{2})-(x+d_{1}))=\Lambda(z-x)+\Lambda d_{2}\in\partial\varphi(x)=\partial\varphi(x+d_{1}), which implies that proxφΛ(z+d1+d2)=x+d1=proxφΛ(z)+d1\mathrm{prox}_{\varphi}^{\Lambda}(z+d_{1}+d_{2})=x+d_{1}=\mathrm{prox}_{\varphi}^{\Lambda}(z)+d_{1}. Consequently, we have 𝒟proxφΛ(z)=V=𝐏\mathcal{D}\mathrm{prox}_{\varphi}^{\Lambda}(z)=V=\mathbf{P}, where 𝐏\mathbf{P} is the orthogonal projection operator onto TT_{\mathcal{M}}. The derivative of the normal map is 𝒟FnorΛ(z)=2f(x)𝐏+Λ(I𝐏)\mathcal{D}F_{\mathrm{nor}}^{\Lambda}(z)=\nabla^{2}f(x)\mathbf{P}+\Lambda(I-\mathbf{P}), which combined with the local linearity of φ\varphi yields 𝐏𝒟FnorΛ(z)=𝐏2f(x)𝐏=2ψ(x)\mathbf{P}\mathcal{D}F_{\mathrm{nor}}^{\Lambda}(z)=\mathbf{P}\nabla^{2}f(x)\mathbf{P}=\nabla^{2}_{\mathcal{M}}\psi(x). \square

5.3 Convergence Analysis

We have the following finite active identification result.

Lemma 5.5.

Suppose that the assumptions in Theorem 4.9 are satisfied and that the conditions (C.4)–(C.5) hold. Then for every r(0,Γ(x))r\in(0,\Gamma(x^{*})) there exist infinitely many kk\in\mathbb{N} with xkBr(x)Six^{k}\in B_{r}(x^{*})\cap S_{i^{*}}.

Proof. Without loss of generality, we can assume {xk}kBr(x)\{x^{k}\}_{k}\subseteq B_{r}(x^{*}). Let us set

𝒦i={kxkSi\Si+1},i=0,1,,m.\mathcal{K}_{i}=\{k\in\mathbb{N}\mid x^{k}\in S_{i}\backslash S_{i+1}\},\quad i=0,1,\cdots,m.

By assumption (C.4), for every yBr(x)y\in B_{r}(x^{*}) with ySi+1y\in S_{i^{*}+1} we have

yxΓmax(x,yx¯)Γ(x)>r>yx,\|y-x^{*}\|\geq\Gamma_{\max}\left(x^{*},\overline{y-x^{*}}\right)\geq\Gamma(x^{*})>r>\|y-x^{*}\|,

which is a contradiction. Hence, it follows |𝒦i|=0|\mathcal{K}_{i}|=0 for every i>ii>i^{*}, i.e., Br(x)Si+1=B_{r}(x^{*})\cap S_{i^{*}+1}=\emptyset.

Set i0=max{i=0,1,,i|𝒦i|=}i_{0}=\max\{i=0,1,\cdots,i^{*}\mid|\mathcal{K}_{i}|=\infty\}. If i0i1i_{0}\leq i^{*}-1, since truncations on points in Si0\Si0+1S_{i_{0}}\backslash S_{i_{0}+1} only happen in a finite number of times, {Γ(xk)k𝒦i0}\{\Gamma(x^{k})\mid k\in\mathcal{K}_{i_{0}}\} has a positive lower bound, i.e., β:=infk𝒦i0Γ(xk)>0\beta:=\inf_{k\in\mathcal{K}_{i_{0}}}\Gamma(x^{k})>0. For k𝒦i0k\in\mathcal{K}_{i_{0}} and by (C.4), we can conclude that Γ(xk)Γmax(xk,xxk¯)xkx\Gamma(x^{k})\leq\Gamma_{\max}(x^{k},\overline{x^{*}-x^{k}})\leq\|x^{k}-x^{*}\|. Using xkxx^{k}\rightarrow x^{*}, (𝒦i0k\mathcal{K}_{i_{0}}\ni k\rightarrow\infty), we can infer

0<βlim infk𝒦i0,kΓ(xk)limk𝒦i0,kxkx=0,0<\beta\leq\liminf_{k\in\mathcal{K}_{i_{0}},\ k\rightarrow\infty}\Gamma(x^{k})\leq\lim_{k\in\mathcal{K}_{i_{0}},\ k\rightarrow\infty}\|x^{k}-x^{*}\|=0,

which is a contradiction. Thus, we have i0=ii_{0}=i^{*}, which finishes the proof. \square

At the end of this subsection, we establish the local convergence rate by connecting our algorithm with a Riemannian trust-region method.

Theorem 5.6.

Suppose that the assumptions in Theorem 4.9 hold and that the conditions (C.1)–(C.5) are satisfied. Furthermore, if for some sufficiently small r(0,Γ(x))r\in(0,\Gamma(x^{*})) and every kk with xkBr(x)x^{k}\in B_{r}(x^{*})\cap\mathcal{M}, we choose gk=ψ(xk)g^{k}=\nabla_{\mathcal{M}}\psi(x^{k}), Bk=2ψ(xk)B^{k}=\nabla^{2}_{\mathcal{M}}\psi(x^{k}), and solve the trust-region subproblem exactly with solution skT(xk)s^{k}\in T_{\mathcal{M}}(x^{k}), then {xk}k\{x^{k}\}_{k} converges to xx^{*} q-quadratically.

Proof. We can assume {xk}kBr(x)\{x^{k}\}_{k}\in B_{r}(x^{*})\cap\mathcal{M} for some small enough r>0r>0, which combined with (C.1) and the fact Br(x)Si+1=B_{r}(x^{*})\cap S_{i^{*}+1}=\emptyset implies that there is no truncation. Since the trust-region subproblem is solved exactly in T(xk)T_{\mathcal{M}}(x^{k}) and \mathcal{M} is affine, condition (C.2) can be utilized to show that the first acceptance test is always locally successful and hence the algorithm always skips the second acceptance mechanism. A detailed proof of this observation, which is also applicable in our situation, can be found in [58, Theorem 4.9]. We can then infer that our algorithm locally coincides with a Riemannian trust-region method or a classical trust-region method in the tangent space T(x)T_{\mathcal{M}}(x^{*}) and the trust-region radius eventually becomes inactive. Thus, the local quadratic convergence rate is achieved by following [2, Chapter 7] or [58, Theorem 4.9]. \square

Remark 5.7.

Lemmas 5.3 and 5.4 guarantee that we can set gk=ψ(xk)g^{k}=\nabla_{\mathcal{M}}\psi(x^{k}) and Bk=2ψ(xk)B^{k}=\nabla^{2}_{\mathcal{M}}\psi(x^{k}) without explicitly knowing \mathcal{M} and that there is a (globally optimal) solution of the trust-region subproblem located in T(xk)T_{\mathcal{M}}(x^{k}). This solution actually has the minimal 2\ell_{2}-norm among all solutions. Since gkg^{k} and BkB^{k} operate on the tangent space of the active manifold, some practical algorithms, such as the CG-Steihaug method, can indeed recover sks^{k} in T(xk)T_{\mathcal{M}}(x^{k}), which leads to xk+skx^{k}+s^{k}\in\mathcal{M}. Therefore, given xkBr(x)x^{k}\in B_{r}(x^{*})\cap\mathcal{M} for kk sufficiently large, we would have xk+1Br(x)x^{k+1}\in B_{r}(x^{*})\cap\mathcal{M}.

6 Preliminary Numerical Results

In this section, we test the efficiency of our proposed nonsmooth trust-region method by applying it to convex and nonconvex 1\ell_{1}-minimization problems. All numerical experiments are performed in MATLAB R2020a on a laptop with Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz and 16GB memory.

We apply our framework to the 1\ell_{1}-minimization problem

minxnf(x)+μx1,\min_{x\in\mathbb{R}^{n}}~{}f(x)+\mu{\|x\|_{1}}, (6.1)

where f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is a smooth function. Setting φ(x)=μx1\varphi(x)=\mu{\|x\|_{1}} and using (2.10), we choose

g(x)=λFnatλ(x):=λ[xproxφλ(xλ1f(x))],λ>0.g(x)=\lambda F_{\mathrm{nat}}^{\lambda}(x):=\lambda[x-\mathrm{prox}_{\varphi}^{\lambda}(x-\lambda^{-1}\nabla f(x))],\quad\lambda>0. (6.2)

where the proximity operator is given explicitly by (proxφλ(x))i=sign(xi)max(|xi|λ1μ,0)(\mathrm{prox}_{\varphi}^{\lambda}(x))_{i}=\text{sign}(x_{i})\max(|x_{i}|-\lambda^{-1}\mu,0). We now construct an element M(x)proxφλ(xλ1f(x))M(x)\in\partial\mathrm{prox}_{\varphi}^{\lambda}(x-\lambda^{-1}\nabla f(x)) as follows: M(x)n×nM(x)\in\mathbb{R}^{n\times n} is a diagonal matrix with diagonal entries

(M(x))ii={1,if |(xλ1f(x))i|>λ1μ,0,otherwise.(M(x))_{ii}=\begin{cases}1,&\text{if }|(x-\lambda^{-1}\nabla f(x))_{i}|>\lambda^{-1}\mu,\\ 0,&\text{otherwise}.\end{cases}

Thus, J(x)=IM(x)(Iλ12f(x))J(x)=I-M(x)(I-\lambda^{-1}\nabla^{2}f(x)) is a possible generalized Jacobian of Fnatλ(x)F_{\mathrm{nat}}^{\lambda}(x). Let us define the index sets

(x):={i{1,2,,n}:|(xλ1f(x))i|>λ1μ},\mathcal{I}(x):=\{i\in\{1,2,\dots,n\}:|(x-\lambda^{-1}\nabla f(x))_{i}|>\lambda^{-1}\mu\},

and

𝒪(x):={i{1,2,,n}:|(xλ1f(x))i|λ1μ}.\mathcal{O}(x):=\{i\in\{1,2,\dots,n\}:|(x-\lambda^{-1}\nabla f(x))_{i}|\leq\lambda^{-1}\mu\}.

Then, J(x)J(x) can be written in an alternative format:

J(x)=(λ1(2f(x))(x)(x)λ1(2f(x))(x)𝒪(x)0I).J(x)=\begin{pmatrix}\lambda^{-1}(\nabla^{2}f(x))_{\mathcal{I}(x)\mathcal{I}(x)}&\lambda^{-1}(\nabla^{2}f(x))_{\mathcal{I}(x)\mathcal{O}(x)}\\ 0&I\end{pmatrix}. (6.3)

In the following, we choose Bk=λJ(xk)B^{k}=\lambda J(x^{k}). For simplicity, we do not check the condition (3.11).

6.1 The Lasso Problem

We first consider the Lasso problem where ff is a convex quadratic function

f(x)=12Axb2,f(x)=\frac{1}{2}\|Ax-b\|^{2},

and bmb\in\mathbb{R}^{m} and A=m×nA=\mathbb{R}^{m\times n} are given.

It can be shown that J(x)J(x) is positive semidefinite if λ\lambda is sufficiently large [75]. When solving the trust-region subproblem (3.3) and similar to the method presented in Appendix B, we first choose a suitable regularization parameter tk0t_{k}\geq 0 and solve the linear system

(Jk+tkI)pk=Fnatλ(xk),Jk=J(xk),(J^{k}+t_{k}I)p^{k}=-F_{\mathrm{nat}}^{\lambda}(x^{k}),\quad J^{k}=J(x^{k}), (6.4)

and then project pkp^{k} onto the trust region, i.e., sk=min{Δk,pk}p¯ks^{k}=\min\{\Delta_{k},\|p^{k}\|\}{\bar{p}^{k}}. Setting gk=g(xk)g^{k}=g(x^{k}), k=(xk)\mathcal{I}^{k}=\mathcal{I}(x^{k}), and 𝒪k=𝒪(xk)\mathcal{O}^{k}=\mathcal{O}(x^{k}), the linear system (6.4) is equivalent to

(1+tk)p𝒪kk=g𝒪kk,(λ1(ATA)kk+tkI)pkk+λ1(ATA)k𝒪kp𝒪kk=gkk,(1+t_{k})p^{k}_{\mathcal{O}^{k}}=-g^{k}_{\mathcal{O}^{k}},\quad(\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{I}^{k}}+t^{k}I)p^{k}_{\mathcal{I}^{k}}+\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{O}^{k}}p^{k}_{\mathcal{O}^{k}}=-g^{k}_{\mathcal{I}^{k}},

which leads to

p𝒪kk=1(1+tk)g𝒪kk,(λ1(ATA)kk+tkI)pkk=gkkλ1(ATA)k𝒪kp𝒪kk.p^{k}_{\mathcal{O}^{k}}=-\frac{1}{(1+t_{k})}g^{k}_{\mathcal{O}^{k}},\quad(\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{I}^{k}}+t^{k}I)p^{k}_{\mathcal{I}^{k}}=-g^{k}_{\mathcal{I}^{k}}-\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{O}^{k}}p^{k}_{\mathcal{O}^{k}}.

The second system is symmetric and can be much smaller than the original problem (6.4). It can be solved efficiently by applying the CG method.

Our test framework follows [12, 55]:

  • A sparse solution x^n\hat{x}\in\mathbb{R}^{n} with n=5122=262144n=512^{2}=262144 is generated randomly with k=[n/40]k=[n/40] zero entries. The nonzero components are chosen from {1,2,,n}\{1,2,\cdots,n\} uniformly with values given by x^i=η1(i)10dη2(i)/20\hat{x}_{i}=\eta_{1}(i)10^{{d\eta_{2}(i)}/{20}}. Here, η1(i)\eta_{1}(i) and η2(i)\eta_{2}(i) are distributed uniformly in {1,1}\{-1,1\} and [0,1][0,1], respectively and dd is a dynamic range.

  • We randomly choose 𝒥{1,2,,n}\mathcal{J}\subseteq\{1,2,\cdots,n\} with |𝒥|=m=n/8=32768|\mathcal{J}|=m=n/8=32768. The linear operator A:nmA:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is then defined via Ax=(dct(x))𝒥Ax=(\texttt{dct}(x))_{\mathcal{J}} where dct denotes the discrete cosine transform.

  • We set b=Ax^+ϵb=A\hat{x}+\epsilon where ϵm\epsilon\in\mathbb{R}^{m} is Gaussian noise with covariance matrix σ^Im×m\hat{\sigma}I_{m\times m}, σ^=0.1\hat{\sigma}=0.1.

Given a tolerance ϵ\epsilon, we terminate whenever the condition λFnatλ(x)ϵ\lambda\left\|F_{\mathrm{nat}}^{\lambda}(x)\right\|\leq\epsilon is satisfied. In the algorithm, λ\lambda is chosen adaptively to estimate the local Lipschitz constant of f\nabla f, i.e., we have λ=λk=max{103,min{xk+1xk/f(xk+1)f(xk),103}\lambda=\lambda_{k}=\max\{10^{-3},\min\{\|x^{k+1}-x^{k}\|/\|\nabla f(x^{k+1})-\nabla f(x^{k})\|,10^{3}\} if the step was successful. We compare our nonsmooth trust-region method (NTR) with the adaptive semi-smooth Newton (ASSN) method in [75] and the fast iterative shrinkage-thresholding algorithm (FISTA) [11] for different tolerances ϵ{100,101,102,104,106}\epsilon\in\{10^{0},10^{-1},10^{-2},10^{-4},10^{-6}\} and dynamic ranges d{20,40,60,80}d\in\{20,40,60,80\}. We report the average CPU time (in seconds) as well as the average number of AA- and ATA^{T}-calls NAN_{A} over 10 independent trials.

The numerical comparisons are shown in Tables 1-4. From those results we can see that the nonsmooth trust-region method outperforms the first-order method FISTA and is quite competitive with the second-order method ASSN. Even if the second acceptance test and the stepsize safeguard is required to guarantee theoretical convergence, in the numerical experiments we find that with a group of suitably chosen parameters, our algorithm rarely or never fails in the first acceptance test and hence employs the mechanism of the stepsize safeguard, which prevents additional costs. The similar behavior of NTR and ASSN may stem from the fact that we utilize similar strategies as in ASSN [75] for certain parameters. Our results on NAN_{A} are comparable with ASSN’s results and are sometimes better. Because each of our iterations involves potential acceptance tests and truncation steps, our method overall requires slightly more CPU time to converge than ASSN.

Table 1: Numerical results with dynamic range 20 dB
ϵ:100\epsilon:10^{0} ϵ:101\epsilon:10^{-1} ϵ:102\epsilon:10^{-2} ϵ:104\epsilon:10^{-4} ϵ:106\epsilon:10^{-6}
time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A}
NTR 0.8275 86.8 1.2226 132.8 1.5549 172 2.0399 227.4 2.4704 280.6
ASSN 0.7368 89.8 1.1409 145 1.3583 173 1.9094 246.4 2.2844 298.2
FISTA 0.5337 59 1.3959 153 3.2304 353.2 13.9021 1490.4 33.7451 3581.2
Table 2: Numerical results with dynamic range 40 dB
ϵ:100\epsilon:10^{0} ϵ:101\epsilon:10^{-1} ϵ:102\epsilon:10^{-2} ϵ:104\epsilon:10^{-4} ϵ:106\epsilon:10^{-6}
time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A}
NTR 1.7293 176.4 2.6661 280.4 3.1330 330.8 3.7463 402.2 4.1825 464.2
ASSN 1.5227 182.2 2.3414 285.4 2.7751 338.6 3.3216 407 3.6687 459.2
FISTA 2.2007 234.8 3.9217 418.4 7.7210 817 26.6733 2804.6 57.0018 5991.6
Table 3: Numerical results with dynamic range 60 dB
ϵ:100\epsilon:10^{0} ϵ:101\epsilon:10^{-1} ϵ:102\epsilon:10^{-2} ϵ:104\epsilon:10^{-4} ϵ:106\epsilon:10^{-6}
time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A}
NTR 2.9135 303.4 3.7794 398.4 4.4177 471.2 5.1089 562.4 5.6559 632
ASSN 2.4508 295.4 3.4378 416.4 4.0051 492 4.6492 582.4 5.1121 642.4
FISTA 5.9349 630.6 9.0296 951.6 14.7183 1548 39.5980 4164.2 79.6442 8355.4
Table 4: Numerical results with dynamic range 80 dB
ϵ:100\epsilon:10^{0} ϵ:101\epsilon:10^{-1} ϵ:102\epsilon:10^{-2} ϵ:104\epsilon:10^{-4} ϵ:106\epsilon:10^{-6}
time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A}
NTR 3.6748 411 5.3121 614 5.9883 702.2 6.5791 803.6 7.2349 868.8
ASSN 3.6290 482.8 4.5779 601 4.9879 690.6 5.7141 780.6 6.3010 865.4
FISTA 20.9653 2222.4 25.6927 2673.2 33.2373 3527 —- —-

6.2 Nonconvex Binary Classification

We consider a second, nonconvex binary classification problem, [53, 73], where ff is given as follows:

f(x)=1Ni=1N(1tanh(biaiTx)).f(x)=\frac{1}{N}\sum_{i=1}^{N}\left(1-\tanh(b_{i}\cdot a_{i}^{T}x)\right).

Here, the datapoints aina_{i}\in\mathbb{R}^{n} and features bi{±1}b_{i}\in\{\pm 1\} are selected from the datasets CINA (N=16033N=16033, n=132n=132) and gisette (N=6000N=6000, n=5000n=5000). We set μ=0.01\mu=0.01.

Although positive semidefiniteness of JkJ^{k} is not guaranteed in the nonconvex case, we reuse the method described in Section 6.1 to solve the trust-region subproblem with tk=0t_{k}=0. The parameter λ\lambda is updated adaptively as before. We also use the same stopping criterion as in Section 6.1. We compare our nonsmooth trust-region method (NTR) with the stochastic semismooth Newton method with variance reduction (S4N-VR) [56] for different tolerances ϵ{100,101,102,104,106}\epsilon\in\{10^{0},10^{-1},10^{-2},10^{-4},10^{-6}\}.

The CPU time (in seconds) and the number of AA- and ATA^{T}-calls NAN_{A} are reported in Table 5 and Table 6, respectively. The results for S4N-VR are averaged over 10 independent trials while the results for NTR are based on one (deterministic) trial. While S4N-VR achieves slightly better results on CINA, NTR outperforms S4N-VR on the second dataset gisette.

Although the performance of NTR is still not perfect, our preliminary results underline that the proposed class of nonsmooth trust-region methods is promising and allows us to handle nonsmooth nonconvex optimization problems from a different perspective.

Table 5: Numerical results for CINA
ϵ:100\epsilon:10^{0} ϵ:101\epsilon:10^{-1} ϵ:102\epsilon:10^{-2} ϵ:104\epsilon:10^{-4} ϵ:106\epsilon:10^{-6}
time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A}
NTR 0.03567 2.5995 0.08788 9.6904 0.2794 55.9480 0.2806 63.3571 0.3048 67.3268
S4N-VR 0.02843 2.0983 0.08124 6.0051 0.2090 25.8886 0.2422 28.3287 0.3722 45.6214
Table 6: Numerical results for gisette
ϵ:100\epsilon:10^{0} ϵ:101\epsilon:10^{-1} ϵ:102\epsilon:10^{-2} ϵ:104\epsilon:10^{-4} ϵ:106\epsilon:10^{-6}
time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A} time NAN_{A}
NTR 0.3136 14.1908 0.5044 21.3416 0.6588 34.7592 0.9333 57.2328 1.0780 73.7344
S4N-VR 0.9177 6.8913 1.6120 20.4010 2.3202 32.6835 5.2769 84.0567 6.7880 121.4267

7 Conclusion

In this paper, we investigate a trust-region method for nonsmooth nonconvex optimization problems. In the proposed framework, the model functions are quadratic and cheap descent directions can be utilized. This allows us to construct cheap model functions and to apply standard algorithms for solving the resulting trust-region subproblems. We propose a novel combination of a stepsize safeguard for ensuring the accuracy of the model and an additional truncation step to enlarge the stepsize safeguard and to accelerate convergence. We present a detailed discussion of the global convergence properties under suitable and mild assumptions. In the case of composite-type problems, we also show that our method converges locally with a quadratic rate after the finite identification of the active manifold when the nonsmooth part of the objective function is a partly smooth mapping. The results are established using a strict complementary condition and a connection between our algorithm and the standard Riemannian trust-region method. Preliminary numerical results demonstrate that the approach performs promisingly on a class of 1\ell_{1}-optimization problems.

References

  • [1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Found. Comput. Math., 7(3):303–330, 2007.
  • [2] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008.
  • [3] Z. Akbari, R. Yousefpour, and M. Reza Peyghami. A new nonsmooth trust region algorithm for locally Lipschitz unconstrained optimization problems. J. Optim. Theory Appl., 164(3):733–754, 2015.
  • [4] P. Apkarian, D. Noll, and L. Ravanbod. Nonsmooth bundle trust-region algorithm with applications to robust stability. Set-Valued Var. Anal., 24(1):115–148, 2016.
  • [5] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Found. and Trends® in Mach. Learn., 4(1):1–106, 2011.
  • [6] C. G. Baker. Riemannian manifold trust-region methods with applications to eigenproblems. PhD thesis, Florida State University, 2008.
  • [7] C. G. Baker, P.-A. Absil, and K. A. Gallivan. An implicit Riemannian trust-region method for the symmetric generalized eigenproblem. pages 210–217. Springer, International Conference on Computational Science, 2006.
  • [8] C. G. Baker, P.-A. Absil, and K. A. Gallivan. An implicit trust-region method on Riemannian manifolds. IMA J. Numer. Anal., 28(4):665–689, 2008.
  • [9] T. Bannert. A trust region algorithm for nonsmooth optimization. Math. program., 67(1):247–264, 1994.
  • [10] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
  • [11] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
  • [12] S. Becker, J. Bobin, and E. J. Candès. NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci., 4(1):1–39, 2011.
  • [13] N. Boumal and P.-A. Absil. RTRMC: A Riemannian trust-region method for low-rank matrix completion. pages 406–414. Advances in neural information processing systems, 2011.
  • [14] P. Breiding and N. Vannieuwenhoven. A Riemannian trust region method for the canonical tensor rank approximation problem. SIAM J. Optim., 28(3):2435–2465, 2018.
  • [15] J. Burke. On the identification of active constraints. II. The nonconvex case. SIAM J. Numer. Anal., 27(4):1081–1103, 1990.
  • [16] J. V. Burke and A. Engle. Line search and trust-region methods for convex-composite optimization. ArXiv:1806.05218, 2018.
  • [17] J. V. Burke and J. J. Moré. On the identification of active constraints. SIAM J. Numer. Anal., 25(5):1197–1211, 1988.
  • [18] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):1956–1982, 2010.
  • [19] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6):717–772, 2009.
  • [20] C. Christof, J. C. De los Reyes, and C. Meyer. A nonsmooth trust-region method for locally Lipschitz functions with application to optimization problems constrained by variational inequalities. SIAM J. Optim., 30(3):2163–2196, 2020.
  • [21] C. Clason. Nonsmooth analysis and optimization. ArXiv:1708.04180, 2018.
  • [22] S. Cotter, B. Rao, K. Engan, and K. Kreutz-Delgado. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Trans. Signal Process., 53(7):2477–2488, 2005.
  • [23] A. Daniilidis, W. Hare, and J. Malick. Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization, 55(5-6):481–503, 2006.
  • [24] W. de Oliveira, C. Sagastizábal, and C. Lemaréchal. Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Math. Program., 148(1):241–277, 2014.
  • [25] R. De Sampaio, J.-Y. Yuan, and W.-Y. Sun. Trust region algorithm for nonsmooth optimization. Appl. Math. Comput., 85(2-3):109–116, 1997.
  • [26] J. E. Dennis, Jr., S.-B. B. Li, and R. A. Tapia. A unified approach to global convergence of trust region methods for nonsmooth optimization. Math. Program., 68(1):319–346, 1995.
  • [27] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006.
  • [28] R. Fletcher. A model algorithm for composite nondifferentiable optimization problems. In Nondifferential and Variational Techniques in Optimization, pages 67–76. Springer, 1982.
  • [29] R. Fletcher. Second order corrections for non-differentiable optimization. In Numerical analysis, pages 85–114. Springer, 1982.
  • [30] M. Fukushima and H. Mine. A generalized proximal point algorithm for certain nonconvex minimization problems. Internat. J. Systems Sci., 12(8):989–1000, 1981.
  • [31] M. Gangeh, A. Farahat, A. Ghodsi, and M. Kamel. Supervised dictionary learning and sparse representation – A review. ArXiv:1502.05928, 2015.
  • [32] R. Garmanjani, D. Júdice, and L. N. Vicente. Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. SIAM J. Optim., 26(4):1987–2011, 2016.
  • [33] G. N. Grapiglia, J. Yuan, and Y.-x. Yuan. A derivative-free trust-region algorithm for composite nonsmooth optimization. Comp. Appl. Math., 35(2):475–499, 2016.
  • [34] P. Grohs and S. Hosseini. Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds. IMA J. Numer. Anal., 36(3):1167–1192, 2016.
  • [35] W. Hare, C. Sagastizábal, and M. Solodov. A proximal bundle method for nonsmooth nonconvex functions with inexact information. Comput. Optim. Appl., 63(1):1–28, 2016.
  • [36] W. L. Hare and A. S. Lewis. Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal., 11(2):251–266, 2004.
  • [37] T. Hastie, R. Mazumder, J. Lee, and R. Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res., 16:3367–3402, 2015.
  • [38] W. Huang, P.-A. Absil, and K. A. Gallivan. A Riemannian symmetric rank-one trust-region method. Math. Program., Ser. A, 150(2):179–216, 2015.
  • [39] N. Karmitsa, A. Bagirov, and M. M. Mäkelä. Comparing different nonsmooth minimization methods and software. Optim. Methods Softw., 27(1):131–153, 2012.
  • [40] K. C. Kiwiel. An ellipsoid trust region bundle method for nonsmooth convex minimization. SIAM J. Control Optim., 27(4):737–757, 1989.
  • [41] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale 1\ell_{1}-regularized logistic regression. J. Mach. Learn. Res., 8:1519–1555, 2007.
  • [42] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program., 149(1):1–45, 2015.
  • [43] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim., 24(3):1420–1443, 2014.
  • [44] C. Lemaréchal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Math. Program., 69(1, Ser. B):111–147, 1995. Nondifferentiable and large-scale optimization (Geneva, 1992).
  • [45] C. Lemaréchal and J. Zowe. A condensed introduction to bundle methods in nonsmooth optimization. In Algorithms for continuous optimization (Il Ciocco, 1993), volume 434 of NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., pages 357–382. Kluwer Acad. Publ., Dordrecht, 1994.
  • [46] A. S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (2003), 2002.
  • [47] X. Li, D. Sun, and K.-C. Toh. A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM J. Optim., 28(1):433–458, 2018.
  • [48] J. Liang, J. Fadili, and G. Peyré. Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim., 27(1):408–437, 2017.
  • [49] N. Mahdavi-Amiri and R. Yousefpour. An effective nonsmooth optimization algorithm for locally Lipschitz functions. J. Optim. Theory Appl., 155(1):180–195, 2012.
  • [50] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. Proceedings of the 26th Int. Conf. on Mach. Learn., pages 689–696, 2009.
  • [51] M. M. Mäkelä. Survey of bundle methods for nonsmooth optimization. Optim. Methods Softw., 17(1):1–29, 2002.
  • [52] M. M. Mäkelä and P. Neittaanmäki. A survey of bundle methods. In Nonsmooth Optimization, pages 97–111. World Scientific Publishing Co. Pte. Ltd., 1992.
  • [53] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent in function space. In Proc. NIPS, volume 12, pages 512–518, 1999.
  • [54] L. Meier, S. Van De Geer, and P. Bühlmann. The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(1):53–71, 2008.
  • [55] A. Milzarek and M. Ulbrich. A semismooth Newton method with multidimensional filter globalization for l1l_{1}-optimization. SIAM J. Optim., 24(1):298–333, 2014.
  • [56] A. Milzarek, X. Xiao, S. Cen, Z. Wen, and M. Ulbrich. A stochastic semismooth Newton method for nonsmooth nonconvex optimization. SIAM J. Optim., 29(4):2916–2948, 2019.
  • [57] Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1):125–161, 2013.
  • [58] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, second edition, 2006.
  • [59] D. Noll. Cutting plane oracles to minimize non-smooth non-convex functions. Set-Valued Var. Anal., 18(3-4):531–568, 2010.
  • [60] D. Noll. Bundle method for non-convex minimization with inexact subgradients and function values. In Computational and analytical mathematics, volume 50 of Springer Proc. Math. Stat., pages 555–592. Springer, New York, 2013.
  • [61] P. Patrinos and A. Bemporad. Proximal Newton methods for convex composite optimization. Proceedings of the 52nd IEEE Conf. on Decision and Control, pages 2358–2363, 2013.
  • [62] P. Patrinos, L. Stella, and A. Bemporad. Forward-backward truncated Newton methods for convex composite optimization. ArXiv:1402.6655, 2014.
  • [63] C. Poon, J. Liang, and C. Schoenlieb. Local convergence properties of SAGA/Prox-SVRG and acceleration. Proceedings of the 35th Int. Conf. on Mach. Learn., 80:4124–4132, 2018.
  • [64] L. Q. Qi and J. Sun. A trust region algorithm for minimization of locally Lipschitzian functions. Math. Program., 66(1):25–43, 1994.
  • [65] S. M. Robinson. Normal maps induced by linear transformations. Math. Oper. Res., 17(3):691–714, 1992.
  • [66] N. Sagara and M. Fukushima. A trust region method for nonsmooth convex optimization. J. Ind. Manag. Optim., 1(2):171–180, 2005.
  • [67] H. Schramm and J. Zowe. A version of the bundle idea for minimizing a nonsmooth function: conceptual idea, convergence analysis, numerical results. SIAM J. Optim., 2(1):121–152, 1992.
  • [68] S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.
  • [69] L. Stella, A. Themelis, and P. Patrinos. Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl., 67(3):443–487, 2017.
  • [70] L. Sun, J. Liu, J. Chen, and J. Ye. Efficient recovery of jointly sparse vectors. Advances in Neural Information Processing Systems, (NIPS), 23, 2009.
  • [71] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267–288, 1996.
  • [72] J. Wang and T. Zhang. Utilizing second order information in minibatch stochastic variance reduced proximal iterations. J. Mach. Learn. Res., 20(42):1–56, 2019.
  • [73] X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim., 27(2):927–956, 2017.
  • [74] S. J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993.
  • [75] X. Xiao, Y. Li, Z. Wen, and L. Zhang. A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput., 76(1):364–389, 2018.
  • [76] E. Yamakawa, M. Fukushima, and T. Ibaraki. An efficient trust region algorithm for minimizing nondifferentiable composite functions. SIAM J. Sci. and Stat. Comput., 10(3):562–580, 1989.
  • [77] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1):49–67, 2006.
  • [78] Y.-x. Yuan. Conditions for convergence of trust region algorithms for nonsmooth optimization. Math. Program., 31(2):220–228, 1985.
  • [79] Y.-x. Yuan. On the superlinear convergence of a trust region algorithm for nonsmooth optimization. Math. Program., 31(3):269–285, 1985.
  • [80] Y.-x. Yuan. Recent advances in trust region algorithms. Math. Program., 151(1):249–281, 2015.

Appendix A Some supplementary examples

Example A.1 (Expressing FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x))).

We consider the following examples:

  • (i)

    Group lasso: In this setting, X=(X1,,Xn2)X=(X_{1},\cdots,X_{n_{2}}) is a matrix in n1×n2\mathbb{R}^{n_{1}\times n_{2}}, ψ=f+φ\psi=f+\varphi, φ(X)=i=1n2Xi2\varphi(X)=\sum_{i=1}^{n_{2}}\left\|X_{i}\right\|_{2}, and Λ=diag(λ1In1,λ2In1,,λn2In1)\Lambda=\mathrm{diag}(\lambda_{1}I_{n_{1}},\lambda_{2}I_{n_{1}},\cdots,\lambda_{n_{2}}I_{n_{1}}), where we understand Λ\Lambda by viewing XX as a vector X=(X1T,,Xn2T)TX=(X_{1}^{T},\cdots,X_{n_{2}}^{T})^{T}. In this case, we obtain

    φ(X)={W=(W1,,Wn2)n1×n2:Wi{B1(0)if Xi=0,=Xi/Xiif Xi0,i=1,2,,n2},\partial\varphi(X)=\left\{W=(W_{1},\cdots,W_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}:W_{i}\begin{cases}\in B_{1}(0)&\text{if }X_{i}=0,\\ =X_{i}/\|X_{i}\|&\text{if }X_{i}\neq 0,\end{cases}\quad\forall~{}i=1,2,\cdots,n_{2}\right\},

    and

    FnorΛ(τ(X))i={f(X)i+Xi/Xiif Xi0,f(X)i𝐏B1(0)(f(X)i),if Xi=0,i=1,2,,n2.F_{\mathrm{nor}}^{\Lambda}(\tau(X))_{i}=\begin{cases}\nabla f(X)_{i}+X_{i}/\left\|X_{i}\right\|&\text{if }X_{i}\neq 0,\\ \nabla f(X)_{i}-\mathbf{P}_{B_{1}(0)}\left(\nabla f(X)_{i}\right),&\text{if }X_{i}=0,\end{cases}\quad\forall~{}i=1,2,\cdots,n_{2}.
  • (ii)

    \ell_{\infty}-optimization: ψ=f+φ\psi=f+\varphi, φ(x)=x\varphi(x)=\|x\|_{\infty} and Λ=λI\Lambda=\lambda I. Using the dual characterization x=maxy11xTy\|x\|_{\infty}=\max_{\|y\|_{1}\leq 1}{x^{T}y}, we have

    φ(x)={wn:w11,wTx=x},\partial\varphi(x)=\{w\in\mathbb{R}^{n}:\|w\|_{1}\leq 1,\ w^{T}x=\|x\|_{\infty}\},

    and FnorΛ(τ(x))F_{\mathrm{nor}}^{\Lambda}(\tau(x)) can be computed using (2.8).

Example A.2 (Calculating Γ(x)\Gamma(x)).

We have:

  • (i)

    Group lasso: For X=(X1,,Xn2)n1×n2X=(X_{1},\cdots,X_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}, ψ=f+φ\psi=f+\varphi, and φ(X)=i=1n2Xi2\varphi(X)=\sum_{i=1}^{n_{2}}\|X_{i}\|_{2}, it holds that Γ(X)=min{Xi:Xi0}\Gamma(X)=\min\{\|X_{i}\|:X_{i}\neq 0\}.

  • (ii)

    \ell_{\infty}-optimization: Let us set ψ=f+φ\psi=f+\varphi, φ(x)=x\varphi(x)=\|x\|_{\infty}, and S:={i{1,2,,n}|xi|x}S:=\{i\in\{1,2,\cdots,n\}\mid|x_{i}|\neq\|x\|_{\infty}\}. Then, it holds that

    Γ(x)={xmaxiS|xi|S,2xS=.\Gamma(x)=\begin{cases}\|x\|_{\infty}-\max_{i\in S}|x_{i}|&S\neq\emptyset,\\ 2\|x\|_{\infty}&S=\emptyset.\end{cases}
Example A.3 (Truncation).

We consider some additional functions that can be truncated:

  • (i)

    Group lasso: For X=(X1,,Xn2)n1×n2X=(X_{1},\cdots,X_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}, ψ=f+φ\psi=f+\varphi, and φ(X)=i=1n2Xi2\varphi(X)=\sum_{i=1}^{n_{2}}\|X_{i}\|_{2}, we can set Si={Xn1×n2:card{j=1,2,,n2Xj=0}i}S_{i}=\{X\in\mathbb{R}^{n_{1}\times n_{2}}:\mathrm{card}\{j=1,2,\cdots,n_{2}\mid X_{j}=0\}\geq i\} for i=1,2,,n2i=1,2,\cdots,n_{2}, m=n1m=n_{1}, δ=+\delta=+\infty, κ=n2\kappa=\sqrt{n_{2}}, and T(X,a)n1×n2T(X,a)\in\mathbb{R}^{n_{1}\times n_{2}} is defined column-wise

    T(X,a)j=𝟙a(Xj)Xjj=1,2,,n2.T(X,a)_{j}={\mathbbm{1}}_{\|\cdot\|\geq a}(X_{j})\cdot X_{j}\quad j=1,2,\cdots,n_{2}.
  • (ii)

    \ell_{\infty}-optimization: For ψ=f+φ\psi=f+\varphi and φ(x)=x\varphi(x)=\|x\|_{\infty}, we can set Si={xncard{j=1,2,,nxj=x}i+1}S_{i}=\{x\in\mathbb{R}^{n}\mid\mathrm{card}\{j=1,2,\cdots,n\mid x_{j}=\|x\|_{\infty}\}\geq i+1\} for i=0,1,,n1i=0,1,\cdots,n-1, Sn={0}S_{n}=\{0\}, m=nm=n, δ=+\delta=+\infty, and κ=n\kappa=\sqrt{n}. As for T(x,a)nT(x,a)\in\mathbb{R}^{n}, if xSn1x\in S_{n-1}, it is defined via

    T(x,a)=𝟙a2(x)x;T(x,a)={\mathbbm{1}}_{\|\cdot\|_{\infty}\geq\frac{a}{2}}(x)\cdot x;

    otherwise it is defined component by component via

    T(x,a)j=xj+𝟙||>xa(xj)sgn(xj)(x|xj|),j=1,2,,n.T(x,a)_{j}=x_{j}+{\mathbbm{1}}_{|\cdot|>\|x\|_{\infty}-a}(x_{j})\mathrm{sgn}(x_{j})(\|x\|_{\infty}-|x_{j}|),\quad j=1,2,\cdots,n.

Appendix B The Solution of the Trust-region Subproblem

In this section, we briefly discuss how to recover a solution sks^{k} satisfying (3.6) and (3.7). If the second-order information BkB^{k} is symmetric, the subproblem (3.3) coincides with the classical trust-region subproblem and can be solved using standard methods, such as CG-Steihaug method [58, Algorithm 7.2]. However, due to the nonsmoothness, the second-order information BkB^{k} might be asymmetric. For example, the Jabobian of g(x)=FnatΛ(x)g(x)=F_{\mathrm{nat}}^{\Lambda}(x) might be asymmetric. In this case, we can simply replace BkB^{k} with its symmetrized version, 12[Bk+(Bk)T]\frac{1}{2}[B^{k}+(B^{k})^{T}], and then employ the CG-Steihaug method.

If the matrix BkB^{k} is positive semidefinite (but probably non-symmetric), i.e., h,Bkh0\langle h,B^{k}h\rangle\geq 0 for all hnh\in\mathbb{R}^{n}, we can still solve (3.3) without symmetrization. We first choose a suitable regularization parameter tk0t_{k}\geq 0 such that

12hTBkh+tkh2λ1h2hn and Bk+tkIλ2,\frac{1}{2}h^{T}B^{k}h+t_{k}\|h\|^{2}\geq\lambda_{1}\|h\|^{2}\quad\forall\ h\in\mathbb{R}^{n}\quad\text{ and }\quad\|B^{k}+t_{k}I\|\leq\lambda_{2}, (B.1)

where λ1,λ2\lambda_{1},\lambda_{2} are chosen positive constants which do not depend on kk. We then consider the linear system

(Bk+tkI)p=gk(B^{k}+t_{k}I)p=-g^{k} (B.2)

and solve it to get an approximate solution pkp^{k} satisfying

(Bk+tkI)pk=gk+rkandrkλ12(λ1+λ2)gk,(B^{k}+t_{k}I)p^{k}=-g^{k}+r^{k}\quad\text{and}\quad\|r^{k}\|\leq\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\|g^{k}\|, (B.3)

where rkr^{k} is the residual. Finally, we project pkp^{k} onto the trust region, i.e.,

sk=min{Δk,pk}p¯k.s^{k}=\min\{\Delta_{k},\|p^{k}\|\}{\bar{p}^{k}}. (B.4)

The next lemma proves that sks^{k} given by (B.2) and (B.4) satisfies the condition (3.6) for some γ1,γ2>0\gamma_{1},\gamma_{2}>0.

Lemma B.1.

Suppose that BkB^{k} is positive semidefinite and that (B.1) holds for some constants λ1,λ2>0\lambda_{1},\lambda_{2}>0. Then condition (3.6) holds for γ1=λ12λ2\gamma_{1}=\frac{\lambda_{1}}{2\lambda_{2}} and γ2=λ1+2λ22λ2(λ1+λ2)\gamma_{2}=\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})} if sks^{k} is given by (B.2) and (B.4).

Proof. Due to

pk=(Bk+tkI)1(gkrk)=gkrkgkrk/(Bk+tkI)1(gkrk)1maxhn(Bk+tkI)hhgkrk1λ2gkrkλ1+2λ22λ2(λ1+λ2)gk\begin{split}\|p^{k}\|&=\|(B^{k}+t_{k}I)^{-1}(g^{k}-r^{k})\|=\frac{\|g^{k}-r^{k}\|}{\|g^{k}-r^{k}\|/\left\|(B^{k}+t_{k}I)^{-1}(g^{k}-r^{k})\right\|}\\ &\geq\frac{1}{\displaystyle\max_{h\in\mathbb{R}^{n}}\frac{\|(B^{k}+t_{k}I)h\|}{\|h\|}}\|g^{k}-r^{k}\|\geq\frac{1}{\lambda_{2}}\|g^{k}-r^{k}\|\geq\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}\|g^{k}\|\end{split}

and utilizing the positive semidefiniteness of BkB^{k}, we have:

m(0)m(sk)\displaystyle m(0)-m(s^{k}) min{Δk,pk}pk[m(0)m(pk)]\displaystyle\geq\frac{\min\{\Delta_{k},\|p^{k}\|\}}{\|p^{k}\|}[m(0)-m(p^{k})]
=min{Δk,pk}pk[(pk)T(Bk+tkI)pk(rk)Tpk12(pk)TBkpk]\displaystyle=\frac{\min\{\Delta_{k},\|p^{k}\|\}}{\|p^{k}\|}\left[(p^{k})^{T}(B^{k}+t_{k}I)p^{k}-(r^{k})^{T}p^{k}-\frac{1}{2}(p^{k})^{T}B^{k}p^{k}\right]
min{Δk,pk}pk[12(pk)TBkpk+tkpk2λ12(λ1+λ2)gkpk]\displaystyle\geq\frac{\min\{\Delta_{k},\|p^{k}\|\}}{\|p^{k}\|}\left[\frac{1}{2}(p^{k})^{T}B^{k}p^{k}+t_{k}\|p^{k}\|^{2}-\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\|g^{k}\|\|p^{k}\|\right]
λ1pkmin{Δk,pk}λ12(λ1+λ2)gkmin{Δk,pk}\displaystyle\geq\lambda_{1}\|p^{k}\|\min\{\Delta_{k},\|p^{k}\|\}-\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\|g^{k}\|\min\{\Delta_{k},\|p^{k}\|\}
λ12λ2gkmin{Δk,λ1+2λ22λ2(λ1+λ2)gk}.\displaystyle\geq\frac{\lambda_{1}}{2\lambda_{2}}\|g^{k}\|\min\left\{\Delta_{k},\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}\|g^{k}\|\right\}.

Thus, (3.6) is satisfied for γ1=λ12λ2\gamma_{1}=\frac{\lambda_{1}}{2\lambda_{2}} and γ2=λ1+2λ22λ2(λ1+λ2)\gamma_{2}=\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}. \square

As for the other condition (3.7), we can simply set

sk={the solution given by (B.2) and (B.4)if Δkζ,sCkif Δk<ζand(Δ)={0if Δ<ζ,1if Δζ,s^{k}=\begin{cases}\text{the solution given by \eqref{Newton-sys0} and \eqref{proj_s}}&\text{if }\Delta_{k}\geq\zeta,\\ s^{k}_{C}&\text{if }\Delta_{k}<\zeta\end{cases}\quad\text{and}\quad\ell(\Delta)=\begin{cases}0&\text{if }\Delta<\zeta,\\ 1&\text{if }\Delta\geq\zeta,\end{cases} (B.5)

where ζ>0\zeta>0 is a constant. We immediately obtain (3.7).