A Trust-Region Method for Nonsmooth Nonconvex Optimization

Ziang Chen
Department of Mathematics, Duke University, USA
and
Andre Milzarek
School of Data Science (SDS), The Chinese University of Hong Kong, Shenzhen,
Shenzhen Research Institute for Big Data (SRIBD), CHINA
and
Zaiwen wen
Beijing International Center for Mathematical Research, Peking University, CHINA Email: [email protected]Email: [email protected]. A. Milzarek is partly supported by the Fundamental Research Fund – Shenzhen Research Institute for Big Data (SRIBD) Startup Fund JCYJ-AM20190601.Email: [email protected]. Z. Wen is partly supported by the NSFC grant 11831002, and the Beijing Academy of Artificial Intelligence.

Abstract

We propose a trust-region type method for a class of nonsmooth nonconvex optimization problems where the objective function is a summation of a (probably nonconvex) smooth function and a (probably nonsmooth) convex function. The model function of our trust-region subproblem is always quadratic and the linear term of the model is generated using abstract descent directions. Therefore, the trust-region subproblems can be easily constructed as well as efficiently solved by cheap and standard methods. When the accuracy of the model function at the solution of the subproblem is not sufficient, we add a safeguard on the stepsizes for improving the accuracy. For a class of functions that can be “truncated”, an additional truncation step is defined and a stepsize modification strategy is designed. The overall scheme converges globally and we establish fast local convergence under suitable assumptions. In particular, using a connection with a smooth Riemannian trust-region method, we prove local quadratic convergence for partly smooth functions under a strict complementary condition. Preliminary numerical results on a family of $\ell_{1}$ -optimization problems are reported and demonstrate the efficiency of our approach.

Keywords: trust-region method, nonsmooth composite programs, quadratic model function, global and local convergence.

1 Introduction

We consider the unconstrained nonsmooth nonconvex optimization problems of the composite form:

\min_{x\in\mathbb{R}^{n}}\psi(x):=f(x)+\varphi(x),

(1.1)

where $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a continuously differentiable but probably nonconvex function and $\varphi:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is real-valued and convex. The composite program (1.1) is a special form of the general nonsmooth nonconvex optimization problems

\min_{x\in\mathbb{R}^{n}}\psi(x),

(1.2)

where the objective function $\psi:\mathbb{R}^{n}\to\mathbb{R}$ is locally Lipschitz continuous, and has numerous applications, such as $\ell_{1}$ -regularized problems [71, 68, 27, 41, 12], group sparse problems [22, 77, 54, 70], penalty approaches [5], dictionary learning [50, 31], and matrix completion [19, 18, 37].

1.1 Related Work

Different types of nonsmooth trust-region methods have already been proposed and analyzed for the general optimization problem (1.2) throughout the last two decades. Several of these nonsmooth trust-region methods utilize abstract model functions on a theoretical level which means that the model function is typically not specified. In [26], a nonsmooth trust-region method is proposed for (1.2) under the assumption that $\psi$ is regular. A nonsmooth trust-region algorithm for general problems is investigated in [64]. In this work, an abstract first-order model is considered that is not necessarily based on subgradient information or directional derivatives. Extending the results in [64], Grohs and Hosseini propose a Riemannian trust-region algorithm, see [34]. Here, the objective function is defined on a complete Riemannian manifold. All mentioned methods derive global convergence under an assumption similar to the concept of a “strict model” stated in [59]. Using this concept, a nonsmooth bundle trust-region algorithm with global convergence is constructed in [4].

In [20], a hybrid approach is presented using simpler and more tractable quadratic model functions. The method switches to a complicated second model if the quadratic model is not accurate enough and if it is strictly necessary. In [3], a quadratic model function is analyzed where the first-order term is derived from a suitable approximation of the steepest descent direction and the second-order term is updated utilizing a BFGS scheme. The authors apply an algorithmic approach proposed in [49] to compute the approximation of the $\epsilon$ -subdifferential and steepest descent direction. Another class of methods employs smoothing techniques. In [32], the authors first present a smooth trust-region method without using derivatives, and then, in the nonsmooth case, use this methodology after smoothing the objective function. Furthermore, trust-region algorithms for nonsmooth problems can be developed based on smooth merit functions for the problem. In [66], a nonsmooth convex optimization is investigated and the Moreau envelope is considered as a smooth merit function. A smooth trust-region method is performed on the smooth merit function, where the second-order term of the model function is again updated by the BFGS formula.

Bundle methods are an important and related class of methods for nonsmooth problems [44, 52, 51, 39, 45, 42, 40]. The ubiquitous cutting-plane model in bundle methods is polyhedral, i.e., the supremum of a finite affine family. This model builds approximations of convex functions based on the subgradient inequality. In [67], an efficient bundle technique for convex optimization has been proposed; in [24], a convex bundle method is derived to deal with additional noise, i.e., the case when the objective function and the subgradient can not be evaluated exactly. Different modifications of the bundle ideas for nonconvex problems have been established in [59, 67]. In [35] and [60], the authors consider bundle methods for nonsmooth nonconvex optimization when the function values and the subgradients of $\psi$ can only be evaluated inexactly.

Local convergence properties and rates for nonsmooth problems are typically studied utilizing additional and more subtle structures. In this regard, some fundamental and helpful concepts are the idea of an “active manifold” and the family of “partly smooth” functions introduced by [46]. In particular, the problem (1.1) has been investigated when the nonsmooth term $\varphi$ is partly smooth relative to a smooth active manifold. The so-called finite activity identification is established for forward-backward splitting methods by [48] and, more recently, for SAGA/Prox-SVRG by [63]. After the identification, those algorithms enter a locally linear convergence regime. In [36], the authors use partial smoothness and prox-regularity to identify the active constraints after finitely many iterations, which is an extension of other works on finite constraint identification, see [17, 15, 74]. After identifying the active manifold, the nonsmooth problem may become a smooth optimization on a Riemannian manifold. Some algorithms and analysis for Riemannian trust-region methods were studied in the literature; see e.g., [2, 1, 38, 8, 6]. There are numerous applications of Riemannian trust-region methods, such as eigenproblems [6, 7], low-rank matrix completion [13], and tensor problems [14].

We note that for composite programs, nonsmooth trust-region methods have also been studied in the literature, such as [28, 29, 79, 76, 33] for $\min\ g(x)+h(f(x))$ where $f$ and $g$ are smooth and $h$ is convex, [78, 9] for $\min\ h(f(x))$ where $f$ is smooth and $h$ is convex, [25] for $\min\ h(f(x))$ where $f$ is locally Lipschitz and $h$ is smooth and convex and [16] for $\min\ g(x)+h(f(x))$ where $f$ is smooth and $g$ and $h$ are convex. For the problem (1.1), there are also other efficient methods, such as gradient-type methods [30, 57], semismooth Newton methods, [55, 47, 75], proximal Newton methods [61, 43, 72], or forward-backward envelope-based (quasi-)Newton methods [62, 69].

More developments about trust-region methods may be found in review papers such as [80].

1.2 Our Contribution

In this work, we propose and investigate a trust-region method for nonsmooth composite programs. The approach utilizes quadratic model functions to approximate the underlying nonsmooth objective function. This methodology leads to classical and tractable trust-region subproblems that can be solved efficiently by standard optimization methods if the second-order information is symmetric. We also discuss an efficient subproblem solver in the case that the second-order information does not stem from a symmetric matrix. The linear part of our proposed quadratic model can be based on the steepest descent direction or other directions such as proximal gradient-type descent directions. Our algorithm contains the following steps: computation of a model function and (approximate) solution of the associated trust-region subproblem; acceptance test of the calculated step; determination of a suitable stepsize by a cheap method followed by some stepsize safeguards and a second acceptance test (if the first test is not successful); update of the trust-region radius and a modification step via a novel truncation mechanism.

In order to control the approximation error between the quadratic model function and the nonsmooth objective function, we define a stepsize safeguard strategy that tries to avoid points along a specific direction at which the directional derivative is not continuous. Specifically, this strategy tries to guarantee that the objective function is directionally differentiable along a specific direction. Since a direct implementation of such a strategy can yield arbitrarily small stepsizes, we consider functions which can be truncated and propose an additional truncation step that allows to enlarge the stepsize. This modification is an essential and new part in our global convergence theory. We verify that the family of functions that can be truncated is rich and contains many important examples, such as the $\ell_{1}$ -norm, $\ell_{\infty}$ -norm or group sparse-type penalty terms. Moreover, we provide a detailed global convergence analysis of the proposed trust-region framework. In particular, we show that every accumulation point of a sequence generated by our algorithm is a stationary point. Global convergence of nonsmooth trust-region methods typically requires a certain uniform accuracy assumption on the model which coincides with the concept of the already mentioned strict model proposed by [59]. Our assumptions are similar to these standard requirements and can be verified for a large family containing polyhedral problems and group lasso. Furthermore, we also show how a strict model – aside from utilizing the original objective function – can be constructed.

We analyze the local properties of the nonsmooth trust-region method for (1.1) when $\varphi$ is a partly smooth function. In particular, it is possible to establish quadratic convergence of our approach in this case. We assume that the underlying manifold is an affine subspace and that a strict complementary condition holds. After the finite activity identification, we transfer our problem to a smooth problem in the affine subspace by proving that an appropriate choice of the first-order and the second-order model coincides with the Riemannian framework. Results from Riemannian trust-region theory can then be applied to derive local quadratic convergence. Additionally, if the nonsmooth term is polyhedral, it can be shown that the Riemannian Hessian can be computed without knowing the underlying manifold.

1.3 Organization

The rest of this paper is organized as follows. In Section 2, descent directions and several properties of $\psi$ and $\varphi$ used in our algorithm are discussed. In Section 3, we present the nonsmooth trust-region framework. In Section 4, the global convergence of our method is established. In Section 5, we show fast local convergence by studying the nonsmooth composite program for partly smooth $\varphi$ . Some preliminary numerical experiments are presented in Section 6.

2 Descent Directions and Truncation Operators

In classical trust-region methods, global convergence is established under fairly mild assumptions on the second-order term of the model while the first-order model typically needs to capture the whole gradient information of the objective function, see, e.g., [58] and the references therein. This underlines the importance of the first-order information in the trust-region method. In this section, we analyze properties of $\psi$ and $\varphi$ as preparation for the construction of suitable linear first-order models.

2.1 Preliminaries

In this work, the expression $\|\cdot\|$ denotes the $\ell_{2}$ -norm and the Frobenius norm for vectors and matrices, respectively. For $x\in\mathbb{R}^{n}$ and $r>0$ , $B_{r}(x):=\{y\in\mathbb{R}^{n}:\|y-x\|<r\}$ denotes the open ball with radius $r$ around $x$ . Let $\Lambda$ be a given symmetric positive definite matrix. The proximal operator is defined via

\mathrm{prox}_{\varphi}^{\Lambda}(z)=\mathop{\operatorname*{argmin}}_{y\in\mathbb{R}^{n}}\varphi(y)+\frac{1}{2}\|y-z\|_{\Lambda}^{2},

where $\|x\|_{\Lambda}^{2}:=x^{T}\Lambda x$ . We also slightly abuse the notation and write $\mathrm{prox}_{\varphi}^{\lambda}=\mathrm{prox}_{\varphi}^{\Lambda}$ for $\Lambda=\lambda I$ .

The directional derivative of a function $h:\mathbb{R}^{n}\rightarrow\mathbb{R}$ at $x$ along $d$ is denoted by

h^{\prime}(x;d):=\lim_{t\rightarrow 0^{+}}\frac{h(x+td)-h(x)}{t},

if it exists. In the composite case $\psi=f+\varphi$ with smooth $f$ and real-valued convex $\varphi$ , the directional derivative $\psi^{\prime}(x;d)$ is well-defined for all $x,d\in\mathbb{R}^{n}$ and the subdifferential of $\psi$ is defined via

\partial\psi(x)=\{\nabla f(x)\}+\partial\varphi(x)=\{a\in\mathbb{R}^{n}\mid\langle a,d\rangle\leq\psi^{\prime}(x;d),\ \forall\ d\in\mathbb{R}^{n}\},

where $\partial\varphi(x)$ is the usual subdifferential of a convex function. The steepest descent direction of $\psi$ is defined as

d_{s}(x):=\begin{cases}\mathop{\operatorname*{argmin}}_{\|d\|\leq 1}\psi^{\prime}(x;d)&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x).\end{cases}

In this paper, we will repeatedly work with the following normalization condition for a direction $d(x)$ :

\|d(x)\|=\begin{cases}1&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x).\end{cases}

(2.1)

We can see that $d_{s}(x)$ satisfies the property (2.1). We say that a point $x^{*}$ is a stationary point of problem (1.2) if $\psi^{\prime}(x^{*};d_{s}(x^{*}))=0$ , i.e., if and only if, $0\in\partial\psi(x^{*})$ .

2.2 Descent Directions

If the objective function $\psi$ is smooth, then the first-order information is carried in the gradient $\nabla\psi$ . We notice that the gradient is directly connected to the steepest descent direction $d_{s}(x)=-\nabla\psi(x)/\|\nabla\psi(x)\|$ and thus, we can use the following representation $-\nabla\psi(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x)$ . Motivated by this observation, a natural choice of the first-order information and extension in the nonsmooth case is $g(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x)$ . Since in some cases it might be hard or expensive to directly compute $g(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x)$ , we can also utilize a general descent direction $d(x)$ satisfying (2.1) instead. In our model function, we will work with directions of the form $g(x)=u(x)d(x)$ where $d(x)$ is a descent direction, $\psi^{\prime}(x;d(x))<0$ , satisfying (2.1) and $u(x)$ is an upper bound of $\psi^{\prime}(x,d(x))$ with

u(x)\begin{cases}\in[\psi^{\prime}(x,d(x)),0)&\text{if }0\notin\partial\psi(x),\\ =0&\text{if }0\in\partial\psi(x).\end{cases}

(2.2)

This implies $g(x)=0$ if and only if $0\in\partial\psi(x)$ , i.e., if $x$ is a stationary point of (1.2). The direction $g(x)$ plays a similar role as the gradient in the smooth case. We would call $g(x)$ as pseudo-gradient. Our aim in the rest of this subsection is to propose several strategies in the settings of composite programs (1.1) for computing and choosing the functions $u(x)$ and $d(x)$ .

2.2.1 Steepest Descent Direction

We first compute and express

g(x)=\psi^{\prime}(x;d_{s}(x))d_{s}(x),

(2.3)

via the so-called normal map [65]:

\textperiodcentered F_{\mathrm{nor}}^{\Lambda}(z):=\nabla f(\mathrm{prox}_{\varphi}^{\Lambda}(z))+\Lambda(z-\mathrm{prox}_{\varphi}^{\Lambda}(z)),

(2.4)

where $\Lambda$ denotes a symmetric and positive semidefinite matrix. We also use the notation $F_{\mathrm{nor}}^{\lambda}=F_{\mathrm{nor}}^{\Lambda}$ in the case $\Lambda=\lambda I$ . The next lemma establishes a relation between $d_{s}(x)$ , $F_{\mathrm{nor}}^{\Lambda}(z)$ , and $\partial\psi(x)$ .

Lemma 2.1.

Let $x\in\mathbb{R}^{n}$ be given. It holds that

(i)

The direction $d_{s}(x)$ and the derivative $\psi^{\prime}(x;d_{s}(x))$ can be represented as follows: $\psi^{\prime}(x;d_{s}(x))=-\mathrm{dist}(0,\partial\psi(x)):=-\min_{v\in\partial\psi(x)}\|v\|$ and

\begin{split}d_{s}(x)=\begin{cases}-\frac{\mathbf{P}_{\partial\psi(x)}(0)}{\|\mathbf{P}_{\partial\psi(x)}(0)\|}&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x),\end{cases}\end{split}

where $\mathbf{P}_{\partial\psi(x)}$ denotes the orthogonal projection onto the convex, closed set $\partial\psi(x)$ .

(ii)

We have $\psi^{\prime}(x;d_{s}(x))d_{s}(x)\in\partial\psi(x)$ .
(iii)

$\partial\psi(x)=\{F_{\mathrm{nor}}^{\Lambda}(z):\mathrm{prox}_{\varphi}^{\Lambda}(z)=x\}$ .

Proof. (i) Using Fenchel-Rockafellar duality, see [10, Theorem 15.23], and the conjugation result $(\iota_{B_{\|\cdot\|}(0,1)})^{*}(d)=\sigma_{B_{\|\cdot\|}(0,1)}(d)=\|d\|$ , we obtain

\begin{split}\psi^{\prime}(x;d_{s}(x))&=\min_{d}\sigma_{\partial\psi(x)}(d)+\iota_{B_{\|\cdot\|}(0,1)}(d)\\ &=-\min_{v}\iota_{\partial\psi(x)}(v)+\|v\|=-\text{dist}(0,\partial\psi(x)).\end{split}

The unique solution of the dual problem is given by $v=\mathbf{P}_{\partial\psi(x)}(0)$ . By [10, Corollary 19.2], the set of primal solutions can be characterized via $d_{s}(x)\in N_{\partial\psi(x)}(v)\cap\partial\|\cdot\|(-v)$ . Here, the set $N_{\partial\psi(x)}(v):=\{h:\langle h,y-v\rangle\leq 0,\ \forall\ y\in\partial\psi(x)\}$ is the associated normal cone of $\partial\psi(x)$ at $v$ . In the case $0\notin\partial\psi(x)$ , we have $\|v\|\neq 0$ and hence, $\partial\|\cdot\|(-v)=\{-v/\|v\|\}$ . Moreover, since $v$ is a solution of the problem $\min_{y\in\partial\psi(x)}\frac{1}{2}\|y\|^{2}$ , it satisfies the optimality condition $\langle v,y-v\rangle\geq 0,\ \forall\ y\in\partial\psi(x)$ . This implies $\{-v/\|v\|\}\in N_{\partial\psi(x)}(v)$ and $d_{s}(x)=-\mathbf{P}_{\partial\psi(x)}(0)/{\|\mathbf{P}_{\partial\psi(x)}(0)\|}$ . In the case $0\in\partial\psi(x)$ , we have $d_{s}(x)=0$ by definition.

(ii) Noticing $\psi^{\prime}(x;d_{s}(x))=-\|\mathbf{P}_{\partial\psi(x)}(0)\|$ , it follows $\psi^{\prime}(x;d_{s}(x))d_{s}(x)=\mathbf{P}_{\partial\psi(x)}(0)\in\partial\psi(x).$

(iii) By the definition of $\mathrm{prox}_{\varphi}^{\Lambda}(z)$ , it can be shown that

x=\mathrm{prox}_{\varphi}^{\Lambda}(z)\quad\Longleftrightarrow\quad\Lambda(z-x)\in\partial\varphi(x).

(2.5)

If $\mathrm{prox}_{\varphi}^{\Lambda}(z)=x$ , by (2.5), we have $F_{\mathrm{nor}}^{\Lambda}(z)=\nabla f(x)+\Lambda(z-x)\in\partial\psi(x)$ . If $v\in\partial\psi(x)$ , set $z=x+\Lambda^{-1}(v-\nabla f(x))$ , then we have $\Lambda(z-x)=v-\nabla f(x)\in\partial\varphi(x)$ . According to (2.5), it holds $x=\mathrm{prox}_{\varphi}^{\Lambda}(z)$ . Thus, we obtain $v=F_{\mathrm{nor}}^{\Lambda}(z)$ . $\square$

From Lemma 2.1 we can immediately derive the following corollary which uses $F_{\mathrm{nor}}^{\Lambda}(z)$ to describe the first-order optimality conditions.

Corollary 2.2.

A point $x^{*}\in\mathbb{R}^{n}$ is a stationary point of problem (1.1), if and only if there exists $z^{*}\in\mathbb{R}^{n}$ satisfying $x^{*}=\mathrm{prox}_{\varphi}^{\Lambda}(z^{*})$ and $z^{*}$ is a solution of the nonsmooth equation $F_{\mathrm{nor}}^{\Lambda}(z)=0$ .

By Lemma 2.1, the calculation of $\psi^{\prime}(x;d_{s}(x))d_{s}(x)$ is equivalent to solving an optimization problem $\psi^{\prime}(x;d_{s}(x))d_{s}(x)=\mathbf{P}_{\partial\psi(x)}(0)=\mathop{\operatorname*{argmin}}_{v\in\partial\psi(x)}\|v\|$ . Alternatively, we can first solve

\tau(x)=\mathop{\operatorname*{argmin}}_{z\in\mathbb{R}^{n}}\|F_{\mathrm{nor}}^{\Lambda}(z)\|\quad\text{s.t.}\quad\mathrm{prox}_{\varphi}^{\Lambda}(z)=x

(2.6)

and then compute $\psi^{\prime}(x;d_{s}(x))d_{s}(x)=F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ . By the definition of $F_{\mathrm{nor}}^{\Lambda}(z)$ , solving (2.6) is equivalent to

\tau(x)=\mathop{\operatorname*{argmin}}_{z\in\mathbb{R}^{n}}\|\nabla f(x)+\Lambda(z-x)\|\quad\text{s.t.}\quad\mathrm{prox}_{\varphi}^{\Lambda}(z)=x,

which combined with (2.5) leads to

\tau(x)=x+\Lambda^{-1}\mathbf{P}_{\partial\varphi(x)}(-\nabla f(x)).

(2.7)

and

\psi^{\prime}(x;d_{s}(x))d_{s}(x)=F_{\mathrm{nor}}^{\Lambda}(\tau(x))=\nabla f(x)+\mathbf{P}_{\partial\varphi(x)}(-\nabla f(x)).

(2.8)

A closed form representation of the mapping $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ can be derived for $\ell_{1}$ -optimization, group lasso, and $\ell_{\infty}$ -optimization. We present $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ for an $\ell_{1}$ -problem in Example 2.3; other examples are summarized in the appendix in Example A.1.

Example 2.3 ( $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ for $\ell_{1}$ -optimization).

Suppose that $\varphi(x)=\|x\|_{1}$ and $\Lambda=\mathrm{diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{n})$ , then by (2.8), we can compute

F_{\mathrm{nor}}^{\Lambda}(\tau(x))_{i}=\begin{cases}\nabla f(x)_{i}-\mathbf{P}_{[-1,1]}(\nabla f(x)_{i}),&x_{i}=0,\\ \nabla f(x)_{i}+\mathrm{sgn}(x_{i}),&x_{i}\neq 0,\\ \end{cases}\quad\forall~{}i=1,2,\cdots,n.

2.2.2 Natural Residual

Another possible choice for $g(x)=u(x)d(x)$ can be based on the so-called natural residual,

F_{\mathrm{nat}}^{\Lambda}(x):=x-\mathrm{prox}_{\varphi}^{\Lambda}(x-\Lambda^{-1}\nabla f(x)).

(2.9)

Similar to the normal map, $F_{\mathrm{nat}}^{\Lambda}$ can be used as a criticality measure.

Lemma 2.4.

A point $x^{*}$ is a stationary point of problem (1.1) if and only if $x^{*}$ is a solution of the nonsmooth equation $F_{\mathrm{nat}}^{\Lambda}(x)=0.$

Following [30, Proposition 4.2], the directional derivative at $x$ along $-F_{\mathrm{nat}}^{\Lambda}(x)$ satisfies $\psi^{\prime}(x;-F_{\mathrm{nat}}^{\Lambda}(x))\leq-\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}^{2}$ . Thus, the direction

d(x)=\begin{cases}-\frac{F_{\mathrm{nat}}^{\Lambda}(x)}{\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}}&\text{if }0\notin\partial\psi(x),\\ 0&\text{if }0\in\partial\psi(x),\end{cases}

is a descent direction with the directional derivative

\psi^{\prime}(x;d(x))\leq-\frac{\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}^{2}}{\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}}=-\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}.

We can choose $u(x)=-\|F_{\mathrm{nat}}^{\Lambda}(x)\|_{\Lambda}$ , which implies that

g(x)=F_{\mathrm{nat}}^{\Lambda}(x).

(2.10)

2.3 Stepsize Safeguard

If we utilize a smooth model $m_{k}$ , once we have selected the descent direction $d$ , (directionally) noncontinuous points of $\psi^{\prime}(\cdot,d)$ will contribute to the inaccuracy of $m_{k}$ . Hence, we should keep the stepsize relatively small to avoid those points. For any $x,d\in\mathbb{R}^{n}$ with $\|d\|=1$ , we set the stepsize safeguard $\Gamma(x,d)$ to guarantee that $t\mapsto\psi^{\prime}(x+td;d)$ is continuous on $(0,\Gamma(x,d))$ , which is equivalent to saying that $t\mapsto\psi\left(x+td\right)$ is continuously differentiable on $t\in(0,\Gamma(x,d))$ .

We prefer to choose the largest possible value of $\Gamma(x,d)$ :

\Gamma(x,d)=\Gamma_{\max}(x,d):=\sup\left\{T>0:\tilde{\psi}^{\prime}_{x,d}(t):=\psi^{\prime}(x+td;d)\in C^{1}(0,T)\right\},

(2.11)

since it can intuitively lead to faster convergence. We will see that this choice works well for polyhedral problems, where $\varphi$ is the supremum of several affine functions, such as in $\ell_{1}$ - and $\ell_{\infty}$ -optimization.

However, in some other cases, we may need to set $\Gamma(x,d)$ more carefully. For example, for the group lasso problem $\min_{X\in\mathbb{R}^{n_{1}\times n_{2}}}f(X)+\varphi(X)$ , where $f$ is smooth and $\varphi$ is given by $\varphi(X)=\sum_{i=1}^{n_{2}}\left\|X_{i}\right\|$ , an appropriate choice for $X=(X_{1},X_{2},\cdots,X_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}$ and $D=(D_{1},D_{2},\cdots,D_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}$ with $\|D\|=1$ is:

\Gamma(X,D)=\min\left\{\Gamma_{\max}(X,D),\min_{X_{i}\neq 0}\frac{\|X_{i}\|^{1+\sigma}}{1-\theta_{i}^{2}},\min_{X_{i}\neq 0}\frac{\|X_{i}\|}{\max\{-2\theta_{i},0\}}\right\},\quad\sigma>0.

(2.12)

Here, $\theta_{i}$ is given by $\theta_{i}:=\langle X_{i},D_{i}\rangle/(\|X_{i}\|\cdot\|D_{i}\|)$ and we use $c/0:=+\infty$ if $c>0$ . The term $\Gamma_{\max}(X,D)$ is defined as in (2.11). This $\Gamma(X,D)$ is specifically designed to overcome some technical difficulties; see, e.g., Lemma 4.4.

Next, let us define the function $\Gamma:\mathbb{R}^{n}\rightarrow\mathbb{R}^{+}$ via

\Gamma(x):=\inf_{d\in\mathbb{R}^{n},\ \|d\|=1}\Gamma_{\max}(x,d).

(2.13)

The scalar $\Gamma(x)$ is important in our convergence analysis as it provides a lower bound for the stepsize safeguard. For the composite program (1.1) with $\varphi(x)=\|x\|_{1}$ , $\Gamma$ can be simply calculated as follows $\Gamma(x)=\min\{|x_{i}|:x_{i}\neq 0\}$ where $\min\emptyset:=+\infty$ . Further examples for $\Gamma$ can be found in Appendix A.2.

2.4 Truncation Operators

Since in our algorithmic design we utilize simple, linear-quadratic models to approximate the nonsmooth function $\psi$ , we need to introduce stepsize safeguards that allow to intrinsically control the accuracy of the model. However, if the “safeguard” $\Gamma(x)$ is very small, the resulting step might be close to the old iterate and the algorithm can start to stagnate. In order to prevent such an undesirable behavior, we discuss an additional modification step that allows to increase $\Gamma(x)$ .

Specifically, given a point $x$ , first we want to find a point $x^{\prime}$ near $x$ such that $\Gamma(x^{\prime})$ is relatively large. Let us consider the simplest case where $\psi=f+\varphi$ and $\varphi(x)=\|x\|_{1}$ . If $x$ has a nonzero component with small absolute values, then $\Gamma(x)$ is also small. So we can replace those components with 0 and get a new point $x^{\prime}$ satisfying $\Gamma(x^{\prime})>\Gamma(x)$ . Since only some components with small absolute values are truncated to 0, the point $x^{\prime}$ is close to $x$ . In more general cases, we define a class of functions that allow similar operations:

Definition 2.5.

Suppose that there exist a finite sequence $\{S_{i}\}_{i=0}^{m}$ satisfying $\mathbb{R}^{n}=S_{0}\supset S_{1}\cdots\supset S_{m}$ , $\delta\in(0,+\infty]$ , $\kappa>0$ , and a function $T:\mathbb{R}^{n}\times(0,\delta]\rightarrow\mathbb{R}^{n}$ with following properties:

(i)

$\Gamma(x)\geq\delta,\ \forall\ x\in S_{m}$ ;
(ii)

For any $a\in(0,\delta]$ and $x\in S_{i}\backslash S_{i+1}$ , $i\in\{0,1,\cdots,m-1\}$ , if $\Gamma(x)\geq a$ , it holds that $T(x,a)=x$ , otherwise we have $T(x,a)\in S_{i+1}$ , $\Gamma(T(x,a))\geq a$ , and $\|T(x,a)-x\|\leq\kappa a$ .

Then we say that $\psi$ can be truncated and that $T$ is a truncation operator.

In Definition 2.5, $\Gamma(T(x,a))\geq a$ means that we can make the value of $\Gamma(\cdot)$ larger by performing truncation and $\|T(x,a)-x\|\leq\kappa a$ implies that the change caused by $T(\cdot,a)$ can be controlled. Example 2.6 shows that $\varphi(x)=\|x\|_{1}$ can be truncated and we present more examples ( $\ell_{\infty}$ -optimization and group lasso) in the appendix.

Example 2.6 ( $\varphi(x)=\|x\|_{1}$ ).

For $i=0,1,\cdots,n$ , we set $S_{i}=\{x\in\mathbb{R}^{n}\mid\mathrm{card}\{j=1,2,\cdots,n\mid x_{j}=0\}\geq i\}$ , $m=n$ , $\delta=+\infty$ , $\kappa=\sqrt{n}$ , and

T(x,a)_{j}=\mathbbm{1}_{|\cdot|\geq a}(x_{j})x_{j},\ j=1,2,\cdots,n,

where $\mathbbm{1}_{A}(\cdot)$ is the indicator function. Figure 1 shows the truncation operator for $\varphi(x)=\|x\|_{1}$ and $n=2$ explicitly, where $S_{1}=\{(x_{1},x_{2})\mid x_{1}x_{2}=0\}$ and $S_{2}=\{(0,0)\}$ .

Figure 1: Illustration of the truncation operator in Example 2.6 for

\varphi(x)=\|x\|_{1}

and

n=2

Let us mention that, for a smooth regularizer $\varphi$ , all properties discussed above are satisfied, since the stepsize safeguard can be chosen as $+\infty$ and no truncation is needed.

3 A Nonsmooth Trust-region Method

In this section, we present the algorithmic framework of our trust-region type method. The traditional trust-region framework using the pseudo-gradient $g(x)$ is employed in our algorithm with potential refinement using a stepsize safeguard. The traditional framework is standard but it might not be accurate enough to pass the acceptance test. In each iteration, we first perform a classical trust-region step. A stepsize safeguard for refinement is used if the traditional step fails. In order to promote large stepsizes, a novel truncation step is proposed.

We first introduce the model function and trust-region subproblem. Then, we propose several modification steps including the choice of the stepsize and the novel truncation step. The final algorithm is presented at the end of this section. We also present some methods for solving the corresponding trust-region subproblem in Appendix B.

3.1 Model Function and Trust-region Subproblem

Recall that in the classical trust-region method for a smooth optimization problem, $\min_{x\in\mathbb{R}^{n}}\psi(x)$ , the model function is $m_{k}(s)=\psi(x^{k})+\langle\nabla\psi(x^{k}),s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle$ . As mentioned at the beginning of Section 2.2, a natural extension is

m_{k}(s)=\psi(x^{k})+\langle\psi^{\prime}(x^{k};d_{s}(x^{k}))d_{s}(x^{k}),s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle.

(3.1)

This model function is still quadratic and fits the objective function well along the steepest descent direction, which means that they have the same directional derivative in this direction. Though approximating the nonsmooth function $\psi$ with a quadratic function might not lead to good trust-region models in general, we can design specific quadratic models that fit $\psi$ well along certain directions.

One may wish to use different descent directions. Or it might be expensive to compute the steepest descent direction $d_{s}(x)$ and its Clarke’s generalized directional derivative $\psi^{\prime}(x;d_{s}(x))$ . Therefore we use a descent direction $d(x)$ satisfying (2.1) and $u(x)$ satisfying (2.2) instead. We can now define our model function

m_{k}(s)=\psi_{k}+\langle g^{k},s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle,

(3.2)

where $\psi_{k}=\psi(x^{k})$ and $g^{k}=g(x^{k})=u(x^{k})d(x^{k})$ , and the associated trust-region subproblem is given by

\min_{s}\ m_{k}(s)=\psi_{k}+\langle g^{k},s\rangle+\frac{1}{2}\langle s,B^{k}s\rangle\quad\text{s.t.}\quad\|s\|\leq\Delta_{k}.

(3.3)

This subproblem is quadratic and coincides with the classical approaches if $B^{k}$ is symmetric. The matrices $B^{k}$ in the model (3.2) are typically chosen to capture or approximate the second-order information of the objective function $\psi$ . However, such a careful choice is only required when certain local convergence properties should be guaranteed. Similar to classical trust-region methods, global convergence of the method will generally not be affected by $\{B^{k}\}$ and flexible choices of $B^{k}$ are possible under some mild boundedness conditions that will be specified later.

We remark that the major difference between our algorithm and other nonsmooth trust-region methods in the literature is the first-order information in the model function. The methods in [64, 20] employ first-order terms that tend to be more complicated than a simple linear function, in order to approximate the nonsmooth objective function well and to satisfy certain accuracy assumptions. Although the model function in [3] is quadratic, their first-order term needs to be built using a steepest descent direction based on the $\epsilon$ -subdifferential of $\psi$ .

An important concept for solving (3.3) is the so-called Cauchy point, which is defined via

s^{k}_{C}:=-\alpha_{k}^{C}g^{k}\ \text{and}\ \alpha_{k}^{C}:=\mathop{\operatorname*{argmin}}_{0\leq t\leq\Delta_{k}/\|g^{k}\|}m_{k}(-tg^{k}).

The Cauchy point is computational inexpensive [58, Algorithm 4.2] and it leads to sufficient reduction of the model function (Cauchy decrease condition):

m_{k}(0)-m_{k}(s^{k}_{C})\geq\frac{1}{2}\|g^{k}\|\min\left\{\Delta_{k},\frac{\|g^{k}\|}{\|B^{k}\|}\right\},

(3.4)

see, e.g., in [58, Lemma 4.3]. Furthermore, it can be shown that

m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k}_{C})\geq\frac{\alpha_{k}}{\|s^{k}_{C}\|}\cdot\frac{1}{2}\|g^{k}\|\min\left\{\Delta_{k},\frac{\|g^{k}\|}{\|B^{k}\|}\right\},

(3.5)

where $\bar{s}^{k}_{C}=s^{k}_{C}/\|s^{k}_{C}\|$ and $0<\alpha_{k}\leq\|s^{k}_{C}\|$ .

In our algorithm, we need to generate an approximate solution of (3.3) that achieves a similar model descent compared to the Cauchy descent condition (3.4) in some sense. More precisely, we need to recover a solution $s^{k}$ satisfying

m_{k}(0)-m_{k}(s^{k})\geq\frac{\gamma_{1}}{2}\|g^{k}\|\min\left\{\Delta_{k},\gamma_{2}\|g^{k}\|\right\},

(3.6)

where $\gamma_{1},\gamma_{2}>0$ are constants which do not depend on $k$ , and

m_{k}(0)-m_{k}(s^{k})\geq(1-\ell(\|s^{k}\|))[m_{k}(0)-m_{k}(s^{k}_{C})],

(3.7)

where $\ell:\mathbb{R}^{+}\rightarrow[0,1]$ is chosen as a monotonically decreasing function with $\lim_{\Delta\rightarrow 0^{+}}\ell(\Delta)=0$ . The classical choice $\ell(\Delta)\equiv 0$ is also allowed.

3.2 Suitable Stepsizes

In the trust-region framework, we will work with the parameters $0<\eta\leq\eta_{1}<\eta_{2}<1$ , $0<r_{1}<1<r_{2}$ , and $\Delta_{\max}>0$ . Let $s^{k}$ denote the generated solution of (3.3). Similar to the classical trust-region method, we define the ratio between actual reduction and predicted reduction as

\rho^{1}_{k}=\frac{\psi(x^{k})-\psi(x^{k}+s^{k})}{m_{k}(0)-m_{k}(s^{k})}.

(3.8)

If the proposed step $x^{k}+s^{k}$ is “successful”, i.e., $\rho^{1}_{k}\geq\eta_{1}$ , we accept the step, i.e., $\tilde{x}^{k}=x^{k}+s^{k}$ , and update the trust-region radius $\Delta_{k}$ as

\Delta_{k+1}=\begin{cases}\min\{{\Delta}_{\max},r_{2}\Delta_{k}\}&\text{if }\rho^{1}_{k}>\eta_{2},\\ \Delta_{k}&\text{otherwise}.\end{cases}

(3.9)

If $\rho^{1}_{k}<\eta_{1}$ , we can introduce an additional stepsize strategy to refine the step. Specifically, we now consider the normalized descent direction $\bar{s}^{k}:={s^{k}}/{\|s^{k}\|}$ . In the following, we will always use the notation $\bar{x}:={x}/{\|x\|}$ . Instead of setting the stepsize as $\|s^{k}\|$ and working with $s^{k}$ directly, we calculate $\alpha_{k}$ via

\alpha_{k}=\min\left\{\Gamma(x^{k},\bar{s}^{k}),\|s^{k}\|\right\},

(3.10)

where $\Gamma(x^{k},\bar{s}^{k})$ is the stepsize safeguard. If

m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})\geq\frac{\alpha_{k}}{2\|s^{k}\|}(m_{k}(0)-m_{k}(s^{k})),

(3.11)

which means that the modified step yields sufficient descent, we use the direction $\bar{s}^{k}$ and the stepsize $\alpha_{k}$ ; otherwise we set

s^{k}=s^{k}_{C},\ \bar{s}^{k}=\bar{s}^{k}_{C},\text{ and }\alpha_{k}=\min\left\{\Gamma(x^{k},\bar{s}^{k}_{C}),\|s^{k}_{C}\|\right\}.

(3.12)

If $m_{k}$ is convex (which, e.g., can be ensured when the matrix $B^{k}$ is chosen to be positive semidefinite) then (3.11) holds automatically and the latter case will not occur. In (3.12) we utilize the Cauchy point and the corresponding stepsize as a simpler gradient-based step. As we have seen such a step can always guarantee certain descent properties. Next, we perform a second ratio test

\rho^{2}_{k}=\frac{\psi(x^{k})-\psi(x^{k}+\alpha_{k}\bar{s}^{k})}{m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})}.

(3.13)

According to this ratio, we update the trust-region radius as

\Delta_{k+1}=\begin{cases}r_{1}\Delta_{k}&\text{if }\rho^{2}_{k}<\eta_{1},\\ \min\{{\Delta}_{\max},r_{2}\Delta_{k}\}&\text{if }\rho^{2}_{k}>\eta_{2},\\ \Delta_{k}&\text{otherwise},\end{cases}

(3.14)

and decide whether to accept the proposed step

\tilde{x}^{k}=\begin{cases}x^{k}+\alpha_{k}\bar{s}^{k}&\text{if }\rho_{k}^{2}\geq\eta,\\ x^{k}&\text{if }\rho_{k}^{2}<\eta.\end{cases}

(3.15)

We declare the step as “subsuccessful” if $\rho^{1}_{k}<\eta_{1}$ while $\rho^{2}_{k}\geq\eta$ , i.e., even if the original step is unsuccessful, the refined version can still provide some descent which is essential to guarantee convergence.

3.3 Truncation Step

It might not be suitable to simply set $x^{k+1}=\tilde{x}^{k}$ , since $\Gamma(\tilde{x}^{k})$ can be very small and larger $\Gamma$ -values increase the stepsize and improve the fitness of the model. Our idea is to allow a small modification of $\tilde{x}^{k}$ and to get a new point $x^{k+1}$ with relatively large $\Gamma(x_{k+1})$ although such modification may cause an increase of the objective function. In the following, we describe an algorithmic procedure for increasing the safeguard $\Gamma(x^{k+1})$ for functions which can be truncated.

Suppose that $\varphi$ can be truncated and let $S_{0},S_{1}\cdots,S_{m}$ , $\delta$ , and $T$ be the corresponding truncation parameters and operators, respectively. Let $\{\epsilon_{s}\}_{s=0}^{\infty}\in\ell_{1}^{+}$ be a positive and strictly decreasing sequence that is upper-bounded by $\delta$ as well as summable. Since the sets $\{S_{j}:j=0,1,\cdots,m\}$ are nested and cover the whole $\mathbb{R}^{n}$ , we know that there exists a unique index $i\in\{0,1,\cdots,m\}$ with $\tilde{x}^{k}\in S_{i}\backslash S_{i+1}$ , where $S_{m+1}=\emptyset$ . In the following, we define $N_{j}:=S_{j}\backslash S_{j+1}$ and introduce a global counter $c_{j}$ that is associated with each set $N_{j}$ and that counts the total number of truncations performed on points in the set $N_{j}$ for $j=0,1,\cdots,m$ . Depending on the safeguard $\Gamma(\tilde{x}^{k})$ we then decide whether $\tilde{x}^{k}$ should be truncated via applying the truncation operator or not. The whole process is given as follows: find $i\in\{0,1,\cdots,m\}$ such that $\tilde{x}^{k}\in N_{i}$ ; if $\Gamma(\tilde{x}^{k})<\epsilon_{c_{i}}$ , set $\tilde{x}^{k}\leftarrow T(\tilde{x}^{k},\epsilon_{c_{i}})$ , otherwise we keep $\tilde{x}^{k}$ unchanged; and update $c_{i}=c_{i}+1$ if $\tilde{x}^{k}$ is updated.

This procedure is repeated until $\Gamma(\tilde{x}^{k})\geq\epsilon_{c_{i}}$ . Lemma 3.1 implies that this algorithm is well-defined and terminates within a finite number of steps. We call the whole procedure a truncation step which is presented in Algorithm 1.

Algorithm 1 Truncation step

Input: $\tilde{x}^{k}$ and $c_{j},\ j=0,1,\cdots,m.$

1: while true do

2: Compute the unique

i

such that

\tilde{x}^{k}\in S_{i}\backslash S_{i+1}

3: if

\Gamma(\tilde{x}^{k})<\epsilon_{c_{i}}

then

\tilde{x}^{k}\leftarrow T(\tilde{x}^{k},\epsilon_{c_{i}})

c_{i}\leftarrow c_{i}+1

6: else

7: break.

8: end if

9: end while

Output: $x^{k+1}=\tilde{x}^{k}$ and $c_{j},\ j=0,1,\cdots,m.$

Lemma 3.1.

Algorithm 1 will terminate in at most $m$ steps.

Proof. Since for any $x\in S_{m}$ and $s\in\mathbb{N}$ , we have $\Gamma(x)\geq\delta\geq\epsilon_{s}$ and the operator $T$ moves points in $S_{i}\backslash S_{i+1}$ into $S_{i+1}$ , $T$ is performed on $\tilde{x}_{k}$ at most $m$ times before Algorithm 1 terminates. $\square$

The iterate $\tilde{x}^{k}$ will be changed when performing Algorithm 1. For simplicity, in the rest of this paper, when we mention $\tilde{x}^{k}$ , we always mean the input of Algorithm 1.

3.4 Algorithmic Framework

We now present a nonsmooth trust-region framework with quadratic model functions that combines the mentioned strategies. One of the main advantages is that the corresponding subproblem can be cheaply formulated and solved. Specifically, the first-order term of our model can be constructed using any kind of descent direction and the second-order term $B^{k}$ is only required to satisfy a classical boundedness condition to guarantee global convergence. Moreover, the resulting trust-region subproblem coincides with the classical one and can be solved using classical methods. Let us further note that in order to obtain fast local convergence, $g^{k}$ and $B^{k}$ need to be chosen and coupled in a more careful way. This is explored in more detail in Section 5 and 6.

The full algorithm is shown in Algorithm 2. We require the following parameters: $0<\eta<\eta_{1}<\eta_{2}<1$ , $0<r_{1}<1<r_{2}$ , ${\Delta}_{\max}>0$ , $\gamma_{1}>0$ , $\gamma_{2}>0$ , and a positive and strictly decreasing sequence $\{\epsilon_{s}\}_{s=0}^{\infty}\in\ell_{1}^{+}$ which is upper-bounded by $\delta$ in Definition 2.5. We also assume that there is a monotonically decreasing function $\ell:\mathbb{R}^{+}\rightarrow[0,\frac{1}{2}]$ with $\lim_{\Delta\rightarrow 0^{+}}\ell(\Delta)=0$ . Some additional discussions of how to solve the trust-region subproblems (3.3) can be found in the appendix.

Algorithm 2 A trust-region method for nonsmooth nonconvex optimization

Initialization: initial point $x^{0}\in\mathbb{R}^{n}$ , initial trust-region radius $\Delta^{0}$ , iteration $k:=0$ , global counters $c_{j}=0,\ j=0,1,\cdots,m$ .

1: while

\|g^{k}\|\neq 0

2: Compute

d(x^{k})

u(x^{k})

and

g^{k}=u(x^{k})d(x^{k})

and choose

B_{k}\in\mathbb{R}^{n\times n}

3: Solve the trust-region subproblem (3.3) and obtain

s^{k}

that satisfies (3.6) and (3.7).

4: Compute

\rho^{1}_{k}

according to (3.8).

5: if

\rho^{1}_{k}\geq\eta_{1}

then

\tilde{x}^{k}:=x^{k}+s^{k}

7: Compute

\Delta_{k+1}

according to (3.9).

8: else

9: Compute

s^{k}

\bar{s}^{k}

, and

\alpha_{k}

according to (3.10), (3.11), and (3.12).

10: Compute

\rho^{2}_{k}

according to (3.13).

11: Compute

\Delta_{k+1}

according to (3.14).

12: Compute

\tilde{x}^{k}

according to (3.15).

13: end if

14: Perform Algorithm 1, get

x^{k+1}

and update

c_{j},\ j=0,1,\cdots,m

15:

k\leftarrow k+1

16: end while

We want to mention here that, for iteration $k\geq 1$ , if $x^{k}+\alpha_{k}\bar{s}^{k}$ is not accepted, i.e., $\tilde{x}^{k}=x^{k}$ , we have $x^{k+1}=\tilde{x}^{k}=x^{k}$ , which means that no truncation is performed on $\tilde{x}^{k}$ . This is because $x^{k}$ satisfies the stopping criteria of Algorithm 1 since it was the output of Algorithm 1 in the last iteration.

4 Global Convergence

In this section, we show the global convergence of Algorithm 2. Specifically, we will prove that every accumulation point of the sequence generated by Algorithm 2 is a stationary point under some suitable assumptions and that the natural residual converges to zero along the generated iterates.

4.1 Assumptions

In this subsection, we state the assumptions required for the convergence. Assumption 4.1 summarizes the conditions on the objective function $\psi$ and its pseudo-gradient $g$ .

Assumption 4.1.

We assume that $\psi$ and $g$ have the following properties:

(A.1)

$\psi$ is bounded from below by $L_{b}$ .
(A.2)

If $g(x)\neq 0$ , then there exists $r,\epsilon>0$ such that $\|g(y)\|\geq\epsilon,\ \forall\ y\in B_{r}(x)$ .

Assumption (A.1) is a standard assumption. Assumption (A.2) means that the first-order model will not vanish sharply. Condition (A.2) holds automatically if $x\mapsto\|g(x)\|$ is lower semicontinuous.

Lemma 4.2.

Assumption (A.2) is satisfied for the choices in (2.3) and (2.10).

Proof. (i) For (2.3), set $\epsilon=\frac{1}{2}\|F_{\mathrm{nor}}^{\Lambda}(\tau(x))\|>0$ . Suppose that there exist a sequence $\{y^{m}\}_{m}$ satisfying $y^{m}\rightarrow x$ and $\|F_{\mathrm{nor}}^{\Lambda}(\tau(y^{m}))\|<\epsilon$ for all $m\in\mathbb{N}$ . By the local boundedness of $\partial\varphi$ , we can infer that $\{\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\}_{m}$ is bounded and thus, $\{\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\}_{m=0}^{\infty}$ has a convergent subsequence. Without loss of generality, we assume that the whole sequence $\{\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\}_{m=0}^{\infty}$ converges. Let us set $w=\lim_{m\rightarrow\infty}\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))$ . Using the upper semicontinuity or the closedness of $\partial\varphi$ , $y^{m}\rightarrow x$ , and $\partial\varphi(y^{m})\ni\mathbf{P}_{\partial\varphi(y^{m})}(-\nabla f(y^{m}))\rightarrow w$ , it follows $w\in\partial\varphi(x)$ . Therefore, we obtain $\nabla f(x)+w\in\partial\psi(x)$ with $\|\nabla f(x)+w\|\leq\epsilon<\|F_{\mathrm{nor}}^{\Lambda}(\tau(x))\|$ , which contradicts the optimality of $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ . We can conclude that for some $r>0$ , it holds $\|g(y)\|=\|F_{\mathrm{nor}}^{\Lambda}(\tau(y))\|=-\psi^{\prime}(y,d_{s}(y))\geq\epsilon$ for all $y\in B_{r}(x)$ . Hence, condition (A.2) is satisfied.

(ii) For (2.10), assumption (A.2) is a simple consequence of the continuity of $F^{\Lambda}_{\mathrm{nat}}$ . $\square$

Next, we present several assumptions on the iterates and sequences generated by Algorithm 2.

Assumption 4.3.

Let $\{x^{k}\}$ and $\{B^{k}\}$ be generated by Algorithm 2. We assume:

(B.1)

$\{x^{k}\}$ is bounded, i.e., there exist $R>0$ with $\{x^{k}\}\subseteq B_{R}(0)$ .
(B.2)

There exists $\kappa_{B}>0$ with $\sup_{k\in\mathbb{N}}\|B^{k}\|\leq\kappa_{B}<\infty$ .
(B.3)

For any subsequence $\{k_{\ell}\}_{\ell=0}^{\infty}\subseteq\mathbb{N}$ , if $\{x^{k_{\ell}}\}$ converges and we have $\alpha_{k_{\ell}}\rightarrow 0$ , then it holds that

\psi(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-\psi(x^{k_{\ell}})-\alpha_{k_{\ell}}\psi^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})=o(\alpha_{k_{\ell}})\quad\ell\to\infty.

(4.1)

(B.4)

For every $\epsilon>0$ there is $\epsilon^{\prime}>0$ such that for all $x^{k}$ with $\Gamma(x^{k})\geq\epsilon$ it follows $\Gamma\left(x^{k},\bar{s}^{k}\right)\geq\epsilon^{\prime}$ .

The conditions (B.1)–(B.3) are standard assumptions. Condition (B.2) is frequently used in classical trust-region theory, see, e.g., [58, Theorem 4.5]. Assumption (B.3) is required to ensure uniform accuracy of the model function. Similar assumptions also appear in other convergence analyses of nonsmooth trust-region methods. For instance, condition A.2 in [64] and assumption 2.4(2c) in [20] have similar formats. However, as far as we can tell, a simple modification of the convergence analysis in previous works (such as [64, 21]) may not prove the convergence of our method. This is partly because assumption (B.3) is only required when the stepsize $\alpha_{k}$ is smaller than or equal to some safeguard, see also (3.10). Moreover, even when the trust-region radius is relatively large and we pass the acceptance, the potential descent could be small. This is also further motivates our truncation step which allows to enlarge the stepsize.

Let us notice that condition (B.3) is similar to but weaker than the concept of “strict models” introduced by Noll in [59]. In particular, assumption (B.3) only needs to hold at accumulation points of $\{x^{k}\}$ and along the specific and associated directions $\{\bar{s}^{k}\}$ while typical strict model assumptions are formulated uniformly for all points and directions in $\mathbb{R}^{n}$ . In the following lemmas, we verify the conditions (B.3) and (B.4) for two exemplary cases.

Lemma 4.4.

Suppose that $\psi$ is given by $\psi=f+\varphi$ with $\nabla f$ being locally Lipschitz continuous and assume that condition (B.1) holds. Then, (B.3) is satisfied in the following cases:

(i)

The mapping $\varphi$ is polyhedral and we have $\Gamma(x,d)\leq\Gamma_{\max}(x,d)$ .
(ii)

The problem is in the group lasso format and we set $\Gamma(x,d)$ as in (2.12).

Proof. The local Lipschitz continuity of $\nabla f$ and the boundedness of $\{x^{k}\}$ imply that for any subsequence $\{k_{\ell}\}_{\ell=0}^{\infty}\subseteq\mathbb{N}$ with $x^{k_{\ell}}\rightarrow x$ and $\alpha_{k_{\ell}}\rightarrow 0$ , it holds that $f(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-f(x^{k_{\ell}})-\alpha_{k_{\ell}}f^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})=o(\alpha_{k_{\ell}})$ . Thus, it suffices to prove

\varphi(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-\varphi(x^{k_{\ell}})-\alpha_{k_{\ell}}\varphi^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})=o(\alpha_{k_{\ell}}).

(4.2)

If $\varphi$ is polyhedral, the function $\tilde{\varphi}_{x,d}(t):=\varphi\left(x+td\right)$ is linear on $(0,\Gamma_{\max}(x,d))$ . Thus, (4.2) holds with the right side of the equality taken as zero.

In case of the group lasso problem and using the definition (2.12), we can see that $\|X_{i}^{k_{\ell}}\|+\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}}\geq 0.5\|X_{i}^{k_{\ell}}\|$ , where $\theta_{i}^{k_{\ell}}:=\langle X_{i}^{k_{\ell}},S_{i}^{k_{\ell}}\rangle/(\|X_{i}^{k_{\ell}}\|\cdot\|S_{i}^{k_{\ell}}\|)$ . Thus, we have

	$\displaystyle\varphi(X^{k_{\ell}}+\alpha_{k_{\ell}}\bar{S}^{k_{\ell}})$	$\displaystyle-\varphi(X^{k_{\ell}})-\alpha_{k_{\ell}}\varphi^{\prime}(X^{k_{\ell}};\bar{S}^{k_{\ell}})$
	$\displaystyle=$	$\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}\\|X_{i}^{k_{\ell}}+\alpha_{k_{\ell}}\bar{S}^{k_{\ell}}_{i}\\|-\\|X_{i}^{k_{\ell}}\\|-\alpha_{k_{\ell}}\langle\bar{X}_{i}^{k_{\ell}},\bar{S}^{k_{\ell}}_{i}\rangle$
	$\displaystyle=$	$\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}(\\|X_{i}^{k_{\ell}}\\|^{2}+\alpha_{k_{\ell}}^{2}+2\alpha_{k_{\ell}}\\|X_{i}^{k_{\ell}}\\|\theta_{i}^{k_{\ell}})^{1/2}-\\|X_{i}^{k_{\ell}}\\|-\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}}$
	$\displaystyle=$	$\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}((\\|X_{i}^{k_{\ell}}\\|+\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}})^{2}+\alpha_{k_{\ell}}^{2}(1-(\theta_{i}^{k_{\ell}})^{2}))^{1/2}-\\|X_{i}^{k_{\ell}}\\|-\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}}$
	$\displaystyle\leq$	$\displaystyle\sum_{X_{i}^{k_{\ell}}\neq 0}\frac{\alpha_{k_{\ell}}^{2}(1-(\theta_{i}^{k_{\ell}})^{2})}{2(\\|X_{i}^{k_{\ell}}\\|+\alpha_{k_{\ell}}\theta_{i}^{k_{\ell}})}\leq\sum_{X_{i}^{k_{\ell}}\neq 0}\alpha_{k_{\ell}}^{2}\cdot\frac{1-(\theta_{i}^{k_{\ell}})^{2}}{\\|X_{i}^{k_{\ell}}\\|},$

where the penultimate inequality follows from $(a^{2}+b)^{1/2}-a\leq{b}/{(2a)}$ for $a>0$ , $b>0$ . For all $i$ , if $\lim_{\ell\rightarrow\infty}X_{i}^{k_{\ell}}\neq 0$ , we have $\alpha_{k_{\ell}}^{2}/{\|X_{i}^{k_{\ell}}\|}=o(\alpha_{k_{\ell}})$ . If $0\neq X_{i}^{k_{\ell}}\rightarrow 0$ , definition (2.12) implies $\alpha_{k_{\ell}}^{2}(1-(\theta_{i}^{k_{\ell}})^{2})/{\|X_{i}^{k_{\ell}}\|}\leq\alpha_{k_{\ell}}\|X_{i}^{k_{\ell}}\|^{\sigma}=o(\alpha_{k_{\ell}})$ . Therefore, condition (4.2) is satisfied. $\square$

Lemma 4.5.

Condition (B.4) is satisfied for the choices (2.11) and (2.12) (for group lasso problems).

Proof. Using (2.11), we immediately obtain $\Gamma(x,d)\geq\Gamma(x)$ for all $x$ and $d$ with $\|d\|=1$ . Hence, in this case, we can set $\epsilon^{\prime}=\epsilon$ .

For the group lasso problem and (2.12), Example A.2 establishes $\Gamma(X)=\min\left\{\|X_{i}\|:X_{i}\neq 0\right\}$ . Consequently, from $\Gamma(X)\geq\epsilon$ , we can deduce

\Gamma(X,D)=\min\left\{\Gamma_{\max}(X,D),\min_{X_{i}\neq 0}\frac{\|X_{i}\|^{1+\sigma}}{1-\theta_{i}^{2}},\min_{X_{i}\neq 0}\frac{\|X_{i}\|}{\max\{-2\theta_{i},0\}}\right\}\geq\min\{\epsilon^{1+\sigma},\epsilon/2\},

where $\theta_{i}:={\langle X_{i},D_{i}\rangle}/{(\|X_{i}\|\cdot\|D_{i}\|)}$ for any $D$ with $\|D\|=1$ . $\square$

4.2 Convergence Analysis

In this subsection, we will prove that every accumulation point of $\{x^{k}\}_{k}$ is a stationary point of (1.1). First, in Lemma 4.6, we derive a global version of assumption (B.3) over the ball $\overline{B_{R}(0)}$ .

Lemma 4.6.

Suppose that (B.1) and (B.3) are satisfied and that Algorithm 2 does not terminate within finitely many steps. Let $\{x^{k}\}_{k}$ be a sequence generated by Algorithm 2. Then, there exists a function $h:(0,\infty)\rightarrow[0,\infty]$ with $\lim_{\Delta\rightarrow 0^{+}}h(\Delta)=0$ and

\psi(x^{k}+\alpha_{k}\bar{s}^{k})-\psi(x^{k})-\alpha_{k}\psi^{\prime}(x^{k};\bar{s}^{k})\leq h(\Delta)\alpha_{k},

(4.3)

for all $k$ with $\Delta_{k}\leq\Delta$ .

Proof. We set

h(\Delta)=\max\left\{\sup_{j:\ \Delta_{j}\leq\Delta}\frac{\psi(x^{j}+\alpha_{j}\bar{s}^{j})-\psi(x^{j})-\alpha_{j}\psi^{\prime}(x^{j};\bar{s}^{j})}{\alpha_{j}},0\right\}.

From the definition, it directly follows $h(\Delta^{1})\leq h(\Delta^{2})$ for all $0<\Delta^{1}<\Delta^{2}<\infty$ . Thus, it suffices to show that for every $\epsilon>0$ there exists $\Delta>0$ such that $h(\Delta)\leq\epsilon$ . Let us assume that for some $\epsilon>0$ we have $h(\Delta)>\epsilon$ for all $\Delta>0$ . Then, there exists a subsequence $\{k_{\ell}\}_{\ell=0}^{\infty}\subseteq\mathbb{N}$ , such that $\Delta_{k_{\ell}}\rightarrow 0$ and

\frac{\psi(x^{k_{\ell}}+\alpha_{k_{\ell}}\bar{s}^{k_{\ell}})-\psi(x^{k_{\ell}})-\alpha_{k_{\ell}}\psi^{\prime}(x^{k_{\ell}};\bar{s}^{k_{\ell}})}{\alpha_{k_{\ell}}}\geq\epsilon\quad\forall~{}\ell\in\mathbb{N}.

(4.4)

Since $\{x^{k_{\ell}}\}_{\ell}$ is bounded, it has a convergent subsequence $\{x^{k_{\ell_{m}}}\}_{m}$ . Due to $\Delta_{k_{\ell}}\rightarrow 0$ it follows $\alpha_{k_{\ell_{m}}}\rightarrow 0$ . Therefore, (4.1) and (4.4) yield a contradiction. $\square$

Recall that the iterates $x^{k+1}$ result from a possible truncation of the trust-region steps $\tilde{x}^{k}$ . We now prove that these truncation steps and the potential increase of the objective function $\psi$ can be controlled.

Lemma 4.7.

Suppose that $\psi$ can be truncated and that assumption (B.1) holds. Then, we have $\sum_{k=0}^{\infty}\|x^{k+1}-\tilde{x}^{k}\|\leq m\kappa\sum_{i=0}^{\infty}\epsilon_{i}<\infty$ and $\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty$ .

Proof. The estimate $\sum_{k=0}^{\infty}\|x^{k+1}-\tilde{x}^{k}\|\leq m\kappa\sum_{i=0}^{\infty}\epsilon_{i}<\infty$ follows directly from Definition 2.5 and the settings in the truncation step. Hence, due to the local Lipschitz continuity of $\psi$ and the boundedness of $\{x^{k}\}_{k}$ , there exists a constant $L_{\psi}>0$ such that $\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|\leq L_{\psi}m\kappa\sum_{i=0}^{\infty}\epsilon_{i}<\infty$ . $\square$

The next theorem is a weak global convergence result for Algorithm 2. In the proof, we combine our specific stepsize strategy and the truncation step to guarantee accuracy of the model and sufficient descent in $\psi$ . More specifically, on the one hand, under the assumption that $\|g^{k}\|$ has a positive lower bound, the stepsize strategy ensures that $\Delta_{k}$ can not be arbitrarily small; on the other hand, the truncation step guarantees that the stepsize safeguard has a positive lower bound on an infinite set of iterations. Combining these two results, we would conclude that the total descent is infinite, which is a contradiction. We want to point out here that even if Algorithm 2 may not need to compute $\rho^{k}_{2}$ in some steps, we still use it in our analysis.

Theorem 4.8.

Suppose that the conditions (A.1) and (B.1)-(B.4) are satisfied and that $\psi$ can be truncated. Furthermore, let us assume that Algorithm 2 does not terminate in finitely many steps and let $\{x^{k}\}_{k}$ be the generated sequence of iterates. Then, it holds that

\liminf_{k\rightarrow\infty}\|g^{k}\|=0.

Proof. Since Algorithm 2 does not terminate in finitely many steps, we have

\psi^{\prime}(x^{k};d(x^{k}))<0\quad\text{and}\quad\|d(x^{k})\|=1\quad\forall~{}k\in\mathbb{N}.

Suppose there exist $\epsilon>0$ and $K\in\mathbb{N}_{+}$ such that $\|g^{k}\|>\epsilon$ for all $k\geq K$ . By the definition of the Cauchy point, we know that $m_{k}(s^{k}_{C})\leq m_{k}(\|s^{k}\|d(x^{k}))$ . Using (3.7), we obtain

\begin{split}m_{k}(0)-m_{k}(s^{k})\geq(1-\ell(\|s^{k}\|))[m_{k}(0)-m_{k}(s^{k}_{C})]\geq(1-\ell(\|s^{k}\|))[m_{k}(0)-m_{k}(\|s^{k}\|d(x^{k}))],\end{split}

and we have

\begin{split}\langle g^{k},s^{k}-(1-\ell(\|s^{k}\|))\|s^{k}\|d(x^{k})\rangle\leq-\frac{1}{2}\langle s^{k},B^{k}s^{k}\rangle+\frac{1-\ell(\|s^{k}\|)}{2}\langle\|s^{k}\|d(x^{k}),B^{k}\|s^{k}\|d(x^{k})\rangle\leq\kappa_{B}\|s^{k}\|^{2}.\end{split}

Due to $\bar{g}^{k}=-d(x^{k})$ , it holds that

\langle-d(x^{k}),\bar{s}^{k}-(1-\ell(\|s^{k}\|))d(x^{k})\rangle\leq\frac{\kappa_{B}}{\|g^{k}\|}\|s^{k}\|\leq\frac{\kappa_{B}}{\epsilon}\|s^{k}\|

for every $k\geq K$ which implies $-\langle d(x^{k}),\bar{s}^{k}\rangle<-1+\ell(\|s^{k}\|)+\frac{\kappa_{B}}{\epsilon}\|s^{k}\|$ . Hence, we have

\|\bar{s}^{k}-d(x^{k})\|=\sqrt{2-2\langle d(x^{k}),\bar{s}^{k}\rangle}\leq\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}.

(4.5)

Using the boundedness of $\{x^{k}\}_{k}$ , the fact that $\psi^{\prime}(x;d)$ is Lipschitz in $d$ for local $x$ (see, e.g., [21]), and (4.5), we derive that there exists $L_{\psi}>0$ such that

|\psi^{\prime}(x^{k};\bar{s}^{k})-\psi^{\prime}(x^{k};d(x^{k}))|\leq L_{\psi}\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}\quad\forall~{}k\geq K,

which combined with $g^{k}=u(x^{k})d(x^{k})$ and (2.2) yields

\begin{split}\alpha_{k}\psi^{\prime}(x^{k};\bar{s}^{k})-\langle g^{k},\alpha_{k}d(x^{k})\rangle=&\alpha_{k}(\psi^{\prime}(x^{k};\bar{s}^{k})-u(x^{k}))\\ \leq&\alpha_{k}(\psi^{\prime}(x^{k};d(x^{k}))-u(x^{k}))+\alpha_{k}L_{\psi}\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}\\ \leq&\alpha_{k}L_{\psi}\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}.\end{split}

(4.6)

Together with (4.5), we obtain

\begin{split}\langle g^{k},d(x^{k})\rangle-\langle g^{k},\bar{s}^{k}\rangle\leq\|g^{k}\|\|d(x^{k})-\bar{s}^{k}\|\leq\|g^{k}\|\sqrt{2\ell(\|s^{k}\|)+\frac{2\kappa_{B}}{\epsilon}\|s^{k}\|}.\end{split}

(4.7)

Using the definition of the function $h$ in (4.3) and combining (4.6) and (4.7), it follows

	$\displaystyle\psi(x^{k}+\alpha_{k}\bar{s}^{k})-\psi(x^{k})-\langle g^{k},\alpha_{k}\bar{s}^{k}\rangle\leq$	$\displaystyle\alpha_{k}\left[(L_{\psi}+\\|g^{k}\\|)\sqrt{2\ell(\\|s^{k}\\|)+\frac{2\kappa_{B}}{\epsilon}\\|s^{k}\\|}+h(\Delta_{k})\right]$
	$\displaystyle\leq$	$\displaystyle\alpha_{k}\left[(L_{\psi}+\\|g^{k}\\|)\sqrt{2\ell(\Delta_{k})+\frac{2\kappa_{B}}{\epsilon}\Delta_{k}}+h(\Delta_{k})\right].$

For all $k\geq K$ , setting $\nu(\Delta_{k}):=[2\ell(\Delta_{k})+2\kappa_{B}\Delta_{k}/\epsilon]^{\frac{1}{2}}$ , we now get

	$\displaystyle m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})$	$\displaystyle=\alpha_{k}\left[\langle-g^{k},\bar{s}^{k}\rangle-\frac{\alpha_{k}}{2}\langle\bar{s}^{k},B^{k}\bar{s}^{k}\rangle\right]$
		$\displaystyle\geq\alpha_{k}\left[\langle-g^{k},d(x^{k})\rangle-\\|g^{k}\\|\sqrt{2\ell(\\|s^{k}\\|)+\frac{2\kappa_{B}}{\epsilon}\\|s^{k}\\|}-\frac{\alpha_{k}\kappa_{B}}{2}\right]$
		$\displaystyle\geq\alpha_{k}\\|g^{k}\\|\left[1-\nu(\Delta_{k})-\frac{\kappa_{B}}{2\epsilon}\Delta_{k}\right],$

and

	$\displaystyle 1-\rho^{2}_{k}$	$\displaystyle=1-\frac{\psi(x^{k})-\psi(x^{k}+\alpha_{k}\bar{s}^{k})}{m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k})}=\frac{\psi(x^{k}+\alpha_{k}\bar{s}^{k})-\psi(x^{k})-\alpha_{k}\langle g^{k},\bar{s}^{k}\rangle-\frac{\alpha_{k}^{2}}{2}\langle\bar{s}^{k},B^{k}\bar{s}^{k}\rangle}{m_{k}(0)-m_{k}(\alpha_{k}\bar{s}^{k}s)}$
		$\displaystyle\leq\frac{(L_{\psi}+\\|g^{k}\\|)\nu(\Delta_{k})+h(\Delta_{k})+\frac{\kappa_{B}}{2}\Delta_{k}}{\\|g^{k}\\|\left[1-\nu(\Delta_{k})-\frac{\kappa_{B}}{2\epsilon}\Delta_{k}\right]}\leq\frac{(\frac{L_{\psi}}{\epsilon}+1)\nu(\Delta_{k})+\frac{1}{\epsilon}h(\Delta_{k})+\frac{\kappa_{B}}{2\epsilon}\Delta_{k}}{1-\nu(\Delta_{k})-\frac{\kappa_{B}}{2\epsilon}\Delta_{k}}.$

Thus, there exists $\sigma\in(0,{\epsilon}/{\kappa_{B}})$ , such that for every $k\geq K$ with $\Delta_{k}\leq{\sigma}$ it holds that $1-\rho^{2}_{k}<1-\eta_{1}$ . This implies $\rho^{2}_{k}>\eta_{1}$ for all $k\geq K$ satisfying $\Delta_{k}<{\sigma}$ which means that those steps are at least “subsuccessful”. Hence, we can infer

\Delta_{k}\geq\min\{\Delta_{K},r_{1}{\sigma}\}\quad\forall~{}k\geq K.

(4.8)

Next, let us set $\mathcal{K}=\{k\geq K\mid\rho^{1}_{k}\geq\eta_{1}\ \text{or}\ \rho^{2}_{k}\geq\eta\}$ and

\mathcal{K}_{i}=\{k\in\mathcal{K}\mid x^{k}\in N_{i}=S_{i}\backslash S_{i+1}\}\quad i=0,1,2,\cdots,m.

Due to (4.8), we have $|\mathcal{K}|=\infty$ and applying Lemma 4.7, it follows

\begin{split}\sum_{k\in\mathcal{K}}[\psi(x^{k})-\psi(\tilde{x}^{k})]\leq&\sum_{k\in\mathcal{K}}[\psi(x^{k})-\psi(x^{k+1})]+\sum_{k\in\mathcal{K}}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|\\ \leq&\psi(x^{0})-L_{b}+\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty,\end{split}

where we used $x^{k+1}=\tilde{x}^{k}=x^{k}$ for all $k\notin\mathcal{K}$ . Hence, we have

\sum_{k\in\mathcal{K}}[m_{k}(0)-m_{k}(\tilde{x}^{k}-x^{k})]\leq\frac{1}{\eta}\sum_{k\in\mathcal{K}}[\psi(x^{k})-\psi(\tilde{x}^{k})]<\infty.

(4.9)

We now define the index $i_{0}=\max\{i\in\{0,1,2,\cdots,m\}:|\mathcal{K}_{i}|=\infty\}$ . By the optimality of $i_{0}$ , we can conclude that only finitely many elements of the sequence $\{x^{k}\}_{k}$ belong to $S_{i_{0}+1}$ . This implies that the truncation operator $T$ is only applied a finite number of times on points in $N_{i_{0}}=S_{i_{0}}\backslash S_{i_{0}+1}$ . In particular, $T$ will move points from $S_{i_{0}}\backslash S_{i_{0}+1}$ to the set $S_{i_{0}+1}$ and after a certain number of iterations $K^{\prime}$ , the counter $c_{i_{0}}$ will not be updated anymore, i.e., we have $c_{i_{0}}\equiv c$ for some $c$ . Then, it follows $\Gamma(x^{k})\geq\epsilon_{{c}}$ for all $x^{k}\in N_{i_{0}}$ and $k\geq K^{\prime}$ and by (B.4) there exists $\epsilon^{\prime}>0$ such that we have $\Gamma(x^{k},\bar{s}^{k})\geq\epsilon^{\prime}$ for all $x^{k}\in N_{i_{0}}$ and $k\geq K^{\prime}$ . Combining (3.5), (3.6), and (3.11), we can always guarantee descent in the model. In particular, we obtain

m_{k}(0)-m_{k}(\tilde{x}^{k}-x^{k})\geq\frac{\|\tilde{x}^{k}-x^{k}\|}{4\|s^{k}\|}\cdot{\delta}_{1}\|g^{k}\|\min\left\{\Delta_{k},{\delta}_{2}\|g^{k}\|\right\},

(4.10)

where ${\delta}_{1}=\min\{\gamma_{1},1\}$ and ${\delta_{2}}=\min\{\gamma_{2},1/\kappa_{B}\}$ . Thus, we can conclude that

\begin{split}\infty&>\sum_{k\in\mathcal{K}}[m_{k}(0)-m_{k}(\tilde{x}^{k}-x^{k})]\geq\sum_{k\in\mathcal{K}_{i_{0}},k\geq K^{\prime}}\frac{{\delta}_{1}\|\tilde{x}^{k}-x^{k}\|}{4\|s^{k}\|}\|g^{k}\|\min\left\{\Delta_{k},{\delta}_{2}\|g^{k}\|\right\}\\ &\geq\sum_{k\in\mathcal{K}_{i_{0}},k\geq K^{\prime}}\frac{{\delta}_{1}\epsilon}{4}\frac{\min\left\{\Gamma(x^{k},\bar{s}^{k}),\|s^{k}\|\right\}}{\|s^{k}\|}\min\left\{\Delta_{K},r_{1}\sigma,{\delta}_{2}\epsilon\right\}\\ &\geq\sum_{k\in\mathcal{K}_{i_{0}},k\geq K^{\prime}}\frac{{\delta}_{1}\epsilon}{4}\min\left\{\frac{\epsilon^{\prime}}{{\Delta}_{\max}},1\right\}\min\left\{\Delta_{K},r_{1}{\sigma},{\delta}_{2}\epsilon\right\}=\infty,\end{split}

which is a contradiction. $\square$

Next, we prove a stronger version of our global result under the additional assumption (A.2). Specifically, we show that every accumulation point of Algorithm 2 is a stationary point of (1.2). This is a standard global convergence result, see, e.g., [64].

Theorem 4.9.

Let the conditions (A.1)–(A.2) and (B.1)–(B.4) be satisfied and suppose that $\psi$ can be truncated. Assume that Algorithm 2 does not terminate after finitely many steps and that it generates a sequence $\{x^{k}\}_{k}$ with an accumulation point $x^{*}$ . Then, $x^{*}$ is a stationary point of (1.2).

Proof. We assume that $x^{*}$ is not a stationary point of (1.2). By (A.2) there exist $r,\epsilon>0$ such that $\|g(y)\|\geq\epsilon$ for all $y\in B_{r}(x^{*})$ . Let us set $A^{k}=\max\{\psi(x^{k+1})-\psi(\tilde{x}^{k}),0\}$ for $k\in\mathbb{N}$ . Applying Lemma 4.7, we know that $\sum_{k=0}^{\infty}A^{k}<\infty$ . For any $k^{\prime}>k$ , we have

\begin{split}\psi(x^{k^{\prime}})&=\psi(x^{k})+\sum_{t=k}^{k^{\prime}-1}(\psi(x^{t+1})-\psi(\tilde{x}^{t}))+\sum_{t=k}^{k^{\prime}-1}(\psi(\tilde{x}^{t})-\psi(x^{t}))\\ &\leq\psi(x^{k})+\sum_{t=k}^{k^{\prime}-1}A^{t}\leq\psi(x^{k})+\sum_{t=k}^{\infty}A^{t},\end{split}

where we used the descent property $\psi(\tilde{x}^{t})-\psi(x^{t})\leq 0$ . Consequently, we can infer

\limsup_{k^{\prime}\rightarrow\infty}\psi(x^{k^{\prime}})\leq\liminf_{k\rightarrow\infty}\psi(x^{k})+\lim_{k\rightarrow\infty}\sum_{t=k}^{\infty}A^{t}\leq\liminf_{k\rightarrow\infty}\psi(x^{k}),

which implies that $\{\psi(x^{k})\}_{k}$ converges. Next, Lemma 4.7 implies that there exists a constant $K\in\mathbb{N}$ such that $\sum_{k=K}^{\infty}\|x^{k+1}-\tilde{x}^{k}\|\leq\frac{r}{4}$ . There is a subsequence $\{x^{k}\}_{k\in\mathcal{K}}\subseteq\{x^{k}\}_{k=K}^{\infty}$ satisfying $\{x^{k}\}_{k\in\mathcal{K}}\subseteq B_{r/4}(x^{*})$ and $x^{k}\rightarrow x^{*},\ k\rightarrow\infty,k\in\mathcal{K}$ . For any $k\in\mathcal{K}$ , since $\|g(y)\|\geq\epsilon,\ \forall~{}y\in B_{r}(x^{*})$ and $\liminf_{k^{\prime}}\|g^{k^{\prime}}\|=0$ by Theorem 4.8, there must be some $k^{\prime}\geq k$ such that $x^{k^{\prime}}\notin B_{r}(x^{*})$ . Set $l(k)=\sup\{k^{\prime}\geq k:x^{t}\in B_{r}(x^{*}),\ \forall~{}k\leq t\leq k^{\prime}\}$ . Thus, it holds that

\begin{split}\sum_{t=k}^{l(k)}\|\tilde{x}^{t}-x^{t}\|+\frac{r}{4}&\geq\sum_{t=k}^{l(k)}(\|x^{t+1}-\tilde{x}^{t}\|+\|\tilde{x}^{t}-x^{t}\|)\\ &\geq\sum_{t=k}^{l(k)}\|x^{t+1}-x^{t}\|\geq\|x^{l(k)+1}-x^{k}\|\geq\frac{3r}{4}\end{split}

and it follows $\sum_{t=k}^{l(k)}\|\tilde{x}^{t}-x^{t}\|\geq\frac{r}{2}$ . Mimicking the last steps in the proof of Theorem 4.8, we get

	$\displaystyle\psi(x^{k})-\psi(x^{l(k)+1})\geq$	$\displaystyle\sum_{t=k}^{l(k)}(\psi(x^{t})-\psi(\tilde{x}^{t})-A^{t})$
	$\displaystyle\geq$	$\displaystyle\sum_{k\leq t\leq l(k),\ \rho^{1}_{t}\geq\eta_{1}\text{ or }\rho^{2}_{t}\geq\eta}\eta[m_{t}(0)-m_{t}(\tilde{x}^{t}-x^{t})]-\sum_{t=k}^{\infty}A^{t}$
	$\displaystyle\geq$	$\displaystyle\sum_{k\leq t\leq l(k),\ \rho^{1}_{t}\geq\eta_{1}\text{ or }\rho^{2}_{t}\geq\eta}\frac{\eta{\delta}_{1}\\|\tilde{x}^{t}-x^{t}\\|}{4\\|s^{t}\\|}\\|g^{t}\\|\min\left\{\Delta_{t},{\delta}_{2}\\|g^{t}\\|\right\}-\sum_{t=k}^{\infty}A^{t}$
	$\displaystyle\geq$	$\displaystyle\sum_{k\leq t\leq l(k),\ \rho^{1}_{t}\geq\eta_{1}\text{ or }\rho^{2}_{t}\geq\eta}\frac{\eta{\delta}_{1}\epsilon\\|\tilde{x}^{t}-x^{t}\\|}{4}\min\left\{1,\frac{{\delta}_{2}\epsilon}{{\Delta}_{\max}}\right\}-\sum_{t=k}^{\infty}A^{t}$
	$\displaystyle\geq$	$\displaystyle\frac{\eta{\delta}_{1}\epsilon r}{8}\min\left\{1,\frac{{\delta}_{2}\epsilon}{{\Delta}_{\max}}\right\}-\sum_{t=k}^{\infty}A^{t}.$

Taking the limit $\mathcal{K}\ni k\rightarrow\infty$ we obtain the contradiction $0\geq\frac{\eta{\delta}_{1}\epsilon r}{8}\min\{1,{\delta}_{2}\epsilon{\Delta}_{\max}^{-1}\}$ . $\square$

Remark 4.10.

Theorem 4.9 essentially establishes a similar result as in [64, Theorem 3.4]. We notice that instead of boundedness of the level set $\{x\in\mathbb{R}^{n}\mid\psi(x)\leq\psi(x^{0})\}$ (which was used in [64]), we need to work with the slightly stronger assumption (B.1) here since the truncation step can increase the objective function value $\psi$ . However, if the function $\psi$ satisfies a Lipschitz-type assumption, i.e., if there are $\epsilon>0$ and $L\geq 0$ such that $|\psi(x)-\psi(y)|\leq L\|x-y\|$ for all $x,y$ with $\|x-y\|\leq\epsilon$ , then the proof of Lemma 4.7 implies $\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty$ . This can be combined with the descent property $\psi(\tilde{x}^{k})\leq\psi(x^{k})$ of the trust-region step to show that the iterates $\{x^{k}\}$ will stay in the level set $\{x\in\mathbb{R}^{n}:\psi(x)\leq\zeta\}$ where $\zeta=\psi(x^{0})+\sum_{k=0}^{\infty}|\psi(x^{k+1})-\psi(\tilde{x}^{k})|<\infty$ . Therefore, (B.1) can be substituted by a more classical level set condition in such a situation.

Finally, via utilizing the natural residual, it is possible to obtain strong lim-type convergence of Algorithm 2. Actually, compared with Theorem 4.8 which states that along a subsequence, $g^{k}$ converges to zero, the next theorem proves that some nonsmooth residual of $x^{k}$ converges to zero along the whole sequence.

Theorem 4.11.

Suppose that the same assumptions stated in Theorem 4.9 are satisfied. Then, it holds that $\lim_{k\rightarrow\infty}\|F_{\mathrm{nat}}^{\Lambda}(x^{k})\|=0$ .

Proof. Suppose that there exists $\epsilon>0$ and an infinite subsequence $\{x^{k}\}_{k\in\mathcal{K}}$ of $\{x^{k}\}_{k=0}^{\infty}$ satisfying

\displaystyle\|F_{\mathrm{nat}}^{\Lambda}(x^{k})\|\geq\epsilon\quad\forall~{}k\in\mathcal{K}.

(4.11)

By (B.1), $\{x^{k}\}_{k\in\mathcal{K}}$ has another subsequence $\{x^{k}\}_{k\in\mathcal{K}_{1}}$ with limit $x^{*}=\lim_{\mathcal{K}_{1}\ni k\rightarrow\infty}x^{k}$ . By Theorem 4.9, $x^{*}$ is a stationary point of (1.2) with $F_{\mathrm{nat}}^{\Lambda}(x^{*})=0$ . Using the continuity of $F_{\mathrm{nat}}^{\Lambda}$ , this contradicts (4.11). $\square$

5 Fast Local Convergence

To the best of our knowledge, there are very limited local convergence results for nonsmooth trust-region type methods and most of the existing work only focuses on the global convergence analysis, see, e.g., [64, 34, 3]. In this section, we investigate local properties of our algorithm. Specifically, we will establish fast local convergence for the composite program $\psi=f+\varphi$ when $f$ is a smooth mapping and $\varphi$ is real-valued convex and partly smooth relative to an affine subspace. Our local results require that the first- and second-order information, i.e., $g^{k}$ and $B^{k}$ , are chosen as the Riemannian gradient and the Riemannian Hessian with respect to some active manifold. We will also show that such information can be derived without knowing the active manifold under some suitable assumptions.

5.1 Definitions and Assumptions

In this subsection, we state some elementary definitions and assumptions. The family of partly smooth functions was originally introduced in [46] and plays a fundamental role in nonsmooth optimization. In particular, the concept of partly smoothness is utilized in the convergence analysis of nonsmooth optimization algorithms and to derive activity identification properties, see, e.g., [48, 63]. Since in (1.1), the mapping $\varphi$ is real-valued convex, we use the definition of partly smooth functions given in [48]. For a more general version and further details, we refer to [46].

Definition 5.1.

[48, Definition 3.1] A proper convex and lower semicontinuous function $\varphi$ is said to be partly smooth at $x$ relative to a set $\mathcal{M}$ containing $x$ if $\partial\varphi(x)\neq\emptyset$ and we have:

(i)

Smoothness: $\mathcal{M}$ is a $C^{2}$ -manifold around $x$ and $\varphi$ restricted to $\mathcal{M}$ is $C^{2}$ around $x$ ;
(ii)

Sharpness: The tangent space $T_{\mathcal{M}}(x)$ coincides with $T_{x}:=\text{par}(\partial\varphi(x))^{\perp}$ , where $\text{par}(A)=\text{span}(A-A)$ for a convex set $A\subset\mathbb{R}^{n}$ .
(iii)

Continuity: The set-valued mapping $\partial\varphi$ is continuous at $x$ relative to $\mathcal{M}$ .

Let $\{S_{i}\}_{i=0}^{m}$ be the sequence of sets associated with the truncation operator of $\psi$ and let $\{x^{k}\}_{k}$ be generated by Algorithm 2. We further consider an accumulation point $x^{*}$ of $\{x^{k}\}_{k}$ with $x^{*}\in S_{i^{*}}\backslash S_{i^{*}+1}$ and $i^{*}\in\{0,1,\cdots,m\}$ and we make the following assumptions.

Assumption 5.2.

We consider the following conditions:

(C.1)

The mapping $\varphi$ is partly smooth at $x^{*}$ relative to an affine subspace $\mathcal{M}$ and it holds that $B_{r}(x^{*})\cap S_{i^{*}}=B_{r}(x^{*})\cap\mathcal{M}$ for all $r\in(0,\Gamma(x^{*}))$ .
(C.2)

The Riemannian Hessian $\nabla^{2}_{\mathcal{M}}\psi(x)$ is locally Lipschitz continuous around $x^{*}$ restricted to $\mathcal{M}$ and the second-order sufficient condition is satisfied at $x^{*}$ , i.e., we have $\nabla^{2}_{\mathcal{M}}\psi(x^{*})[\xi,\xi]\geq c\|\xi\|^{2}$ for some positive constant $c$ and all $\xi\in T_{\mathcal{M}}(x^{*})$ .
(C.3)

The strict complementary condition $-\nabla f(x^{*})\in\text{ri}\ \partial\varphi(x^{*})$ is satisfied.
(C.4)

For all $x\in S_{i}$ , $y\in S_{j}$ with $i<j$ , it holds that $\Gamma_{\max}\left(x,\overline{y-x}\right)\leq\|y-x\|$ where $\overline{y-x}=\frac{y-x}{\|y-x\|}$ . For every $r\in(0,\Gamma(x^{*}))$ , there exists $\epsilon(r)>0$ such that $\Gamma(x)\geq\epsilon(r)$ for all $x\in B_{r}(x^{*})\cap\mathcal{M}$ .
(C.5)

The sequence $\{x^{k}\}_{k}$ converges with limit $\lim_{k\rightarrow\infty}x_{k}=x^{*}$ .

Besides the partly smoothness, assumption (C.1) states that the local structure of $S_{i^{*}}$ around $x^{*}$ has to be affine. The conditions (C.2), (C.3), and (C.5) are standard assumptions for finite active identification and have been used to establish local convergence rates. For instance, they appeared in [48, 63].

In order to illustrate assumption (C.4), we consider the example $\varphi(x)=\|x\|_{1}$ . Suppose that $x\in S_{i},\ y\in S_{j}$ are two given points with $i<j$ . Since $y$ has more zero-components as $x$ there exists a point on the line connecting $x$ and $y$ at which $\varphi$ is not differentiable. This immediately leads to $\Gamma_{\max}\left(x,\overline{y-x}\right)\leq\|y-x\|$ . The second part in (C.4) requires that $\Gamma$ does not decay sharply around $x^{*}$ restricted to $\mathcal{M}$ . We will use condition (C.4) in the analysis of the truncation step.

5.2 Riemannian Gradient and Riemannian Hessian

We now choose $g^{k}=F_{\mathrm{nor}}^{\Lambda}(\tau(x^{k}))$ . The next lemma shows that this choice actually coincides with the Riemannian gradient of $\psi$ when $x^{k}$ lies in the manifold $\mathcal{M}$ and is close to $x^{*}$ .

Lemma 5.3.

Suppose that the conditions (C.1) and (C.3) hold and that $x^{*}$ is a stationary point. Let $x\in B_{r}(x^{*})\cap\mathcal{M}$ be given for $r\in(0,\Gamma(x^{*}))$ sufficiently small. Then, we have $F_{\mathrm{nor}}^{\Lambda}(\tau(x))=\psi^{\prime}(x;d_{s}(x))d_{s}(x)=\nabla_{\mathcal{M}}\psi(x)$ , where $\nabla_{\mathcal{M}}\psi(x)$ denotes the Riemannian gradient of $\psi$ .

Proof. By the stationarity of $x^{*}$ and (C.3), it is easy to see that $\nabla_{\mathcal{M}}\psi(x^{*})=0\in\nabla f(x^{*})+\text{ri}\ \partial\varphi(x^{*})=\text{ri}\ \partial\psi(x^{*}).$ Thus, applying [23, Corollary 21] for $x\in\mathcal{M}$ close to $x^{*}$ and Lemma 2.1, we can conclude that $\nabla_{\mathcal{M}}\psi(x)=\mathbf{P}_{\partial\psi(x)}(0)=\psi^{\prime}(x;d_{s}(x))d_{s}(x)=F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ . $\square$

Since $g^{k}$ coincides with the Riemannian gradient, we naturally would like to choose $B^{k}$ as the associated Riemannian Hessian of $\psi$ . We now show that this Hessian can be derived without knowing the underlying manifold $\mathcal{M}$ if we additionally assume that $\varphi$ is polyhedral. Specifically, the following lemma establishes a connection between the derivative of $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ and the Riemannian Hessian.

Lemma 5.4.

Suppose that the assumptions stated in Lemma 5.3 hold and that $\varphi$ is a polyhedral function. For $r\in(0,\Gamma(x^{*}))$ sufficiently small and all $x\in B_{r}(x^{*})\cap\mathcal{M}$ , it follows

V\mathcal{D}F_{\mathrm{nor}}^{\Lambda}(z)=\nabla^{2}_{\mathcal{M}}\psi(x),\quad\Lambda=\lambda I,\;\lambda>0,

where $\mathcal{D}$ is the differential operator, $z=\tau(x)$ , $V=\mathcal{D}\mathrm{prox}_{\varphi}^{\Lambda}(z)$ is the derivative of $\mathrm{prox}_{\varphi}^{\Lambda}$ , and $\nabla^{2}_{\mathcal{M}}\psi(x)$ is the Riemannian Hessian.

Proof. For $x\in\mathcal{M}$ and near $x^{*}$ , by Definition 5.1 and [48, Fact 3.3], we can decompose $\partial\varphi(x)$ as $\partial\varphi(x)=\left\{\nabla_{\mathcal{M}}\varphi(x)\right\}+\partial_{\mathcal{M}}^{\perp}\varphi(x)$ , where $\partial_{\mathcal{M}}^{\perp}\varphi(x)\subseteq T_{\mathcal{M}}(x)^{\perp}$ . We can see that both $\nabla_{\mathcal{M}}\varphi(x)$ and $\partial_{\mathcal{M}}^{\perp}\varphi(x)$ restricted to $\mathcal{M}$ are continuous around $x^{*}$ . Moreover, we have the decomposition $\nabla f(x)=\nabla_{\mathcal{M}}f(x)+\nabla_{\mathcal{M}}^{\perp}f(x)$ where $\nabla_{\mathcal{M}}^{\perp}f(x)\in T_{\mathcal{M}}(x)^{\perp}$ .

Condition (C.3) implies $-\nabla_{\mathcal{M}}^{\perp}f(x^{*})\in\text{ri}\ \partial_{\mathcal{M}}^{\perp}\varphi(x^{*})$ , which combined with part (ii) in Definition 5.1 and the continuity of $\nabla_{\mathcal{M}}^{\perp}f\big{|}_{\mathcal{M}}$ and $\partial_{\mathcal{M}}^{\perp}\varphi\big{|}_{\mathcal{M}}$ leads to $-\nabla_{\mathcal{M}}^{\perp}f(x)\in\text{ri}\ \partial_{\mathcal{M}}^{\perp}\varphi(x)$ , i.e., $0\in\left\{\nabla_{\mathcal{M}}^{\perp}f(x)\right\}+\text{ri}\ \partial_{\mathcal{M}}^{\perp}\varphi(x)$ for all $x\in B_{r}(x^{*})\cap\mathcal{M}$ where $r>0$ is sufficiently small. Thus, it follows

\nabla f(x)+\Lambda(z-x)=F_{\mathrm{nor}}^{\Lambda}(z)=\nabla_{\mathcal{M}}f(x)+\nabla_{\mathcal{M}}\varphi(x)\in\nabla f(x)+\text{ri}\ \partial\varphi(x),

which implies $\Lambda(z-x)\in\text{ri}\ \partial\varphi(x)$ . Since $\varphi$ is polyhedral, the subdifferential $\partial\varphi(x)$ is locally constant around $x^{*}$ on $\mathcal{M}$ . For any $d_{1}\in T_{\mathcal{M}}$ , $d_{2}\in T_{\mathcal{M}}^{\perp}$ with $\|d_{1}\|,\|d_{2}\|$ sufficiently small, it holds that $\Lambda((z+d_{1}+d_{2})-(x+d_{1}))=\Lambda(z-x)+\Lambda d_{2}\in\partial\varphi(x)=\partial\varphi(x+d_{1})$ , which implies that $\mathrm{prox}_{\varphi}^{\Lambda}(z+d_{1}+d_{2})=x+d_{1}=\mathrm{prox}_{\varphi}^{\Lambda}(z)+d_{1}$ . Consequently, we have $\mathcal{D}\mathrm{prox}_{\varphi}^{\Lambda}(z)=V=\mathbf{P}$ , where $\mathbf{P}$ is the orthogonal projection operator onto $T_{\mathcal{M}}$ . The derivative of the normal map is $\mathcal{D}F_{\mathrm{nor}}^{\Lambda}(z)=\nabla^{2}f(x)\mathbf{P}+\Lambda(I-\mathbf{P})$ , which combined with the local linearity of $\varphi$ yields $\mathbf{P}\mathcal{D}F_{\mathrm{nor}}^{\Lambda}(z)=\mathbf{P}\nabla^{2}f(x)\mathbf{P}=\nabla^{2}_{\mathcal{M}}\psi(x)$ . $\square$

5.3 Convergence Analysis

We have the following finite active identification result.

Lemma 5.5.

Suppose that the assumptions in Theorem 4.9 are satisfied and that the conditions (C.4)–(C.5) hold. Then for every $r\in(0,\Gamma(x^{*}))$ there exist infinitely many $k\in\mathbb{N}$ with $x^{k}\in B_{r}(x^{*})\cap S_{i^{*}}$ .

Proof. Without loss of generality, we can assume $\{x^{k}\}_{k}\subseteq B_{r}(x^{*})$ . Let us set

\mathcal{K}_{i}=\{k\in\mathbb{N}\mid x^{k}\in S_{i}\backslash S_{i+1}\},\quad i=0,1,\cdots,m.

By assumption (C.4), for every $y\in B_{r}(x^{*})$ with $y\in S_{i^{*}+1}$ we have

\|y-x^{*}\|\geq\Gamma_{\max}\left(x^{*},\overline{y-x^{*}}\right)\geq\Gamma(x^{*})>r>\|y-x^{*}\|,

which is a contradiction. Hence, it follows $|\mathcal{K}_{i}|=0$ for every $i>i^{*}$ , i.e., $B_{r}(x^{*})\cap S_{i^{*}+1}=\emptyset$ .

Set $i_{0}=\max\{i=0,1,\cdots,i^{*}\mid|\mathcal{K}_{i}|=\infty\}$ . If $i_{0}\leq i^{*}-1$ , since truncations on points in $S_{i_{0}}\backslash S_{i_{0}+1}$ only happen in a finite number of times, $\{\Gamma(x^{k})\mid k\in\mathcal{K}_{i_{0}}\}$ has a positive lower bound, i.e., $\beta:=\inf_{k\in\mathcal{K}_{i_{0}}}\Gamma(x^{k})>0$ . For $k\in\mathcal{K}_{i_{0}}$ and by (C.4), we can conclude that $\Gamma(x^{k})\leq\Gamma_{\max}(x^{k},\overline{x^{*}-x^{k}})\leq\|x^{k}-x^{*}\|$ . Using $x^{k}\rightarrow x^{*}$ , ( $\mathcal{K}_{i_{0}}\ni k\rightarrow\infty$ ), we can infer

0<\beta\leq\liminf_{k\in\mathcal{K}_{i_{0}},\ k\rightarrow\infty}\Gamma(x^{k})\leq\lim_{k\in\mathcal{K}_{i_{0}},\ k\rightarrow\infty}\|x^{k}-x^{*}\|=0,

which is a contradiction. Thus, we have $i_{0}=i^{*}$ , which finishes the proof. $\square$

At the end of this subsection, we establish the local convergence rate by connecting our algorithm with a Riemannian trust-region method.

Theorem 5.6.

Suppose that the assumptions in Theorem 4.9 hold and that the conditions (C.1)–(C.5) are satisfied. Furthermore, if for some sufficiently small $r\in(0,\Gamma(x^{*}))$ and every $k$ with $x^{k}\in B_{r}(x^{*})\cap\mathcal{M}$ , we choose $g^{k}=\nabla_{\mathcal{M}}\psi(x^{k})$ , $B^{k}=\nabla^{2}_{\mathcal{M}}\psi(x^{k})$ , and solve the trust-region subproblem exactly with solution $s^{k}\in T_{\mathcal{M}}(x^{k})$ , then $\{x^{k}\}_{k}$ converges to $x^{*}$ q-quadratically.

Proof. We can assume $\{x^{k}\}_{k}\in B_{r}(x^{*})\cap\mathcal{M}$ for some small enough $r>0$ , which combined with (C.1) and the fact $B_{r}(x^{*})\cap S_{i^{*}+1}=\emptyset$ implies that there is no truncation. Since the trust-region subproblem is solved exactly in $T_{\mathcal{M}}(x^{k})$ and $\mathcal{M}$ is affine, condition (C.2) can be utilized to show that the first acceptance test is always locally successful and hence the algorithm always skips the second acceptance mechanism. A detailed proof of this observation, which is also applicable in our situation, can be found in [58, Theorem 4.9]. We can then infer that our algorithm locally coincides with a Riemannian trust-region method or a classical trust-region method in the tangent space $T_{\mathcal{M}}(x^{*})$ and the trust-region radius eventually becomes inactive. Thus, the local quadratic convergence rate is achieved by following [2, Chapter 7] or [58, Theorem 4.9]. $\square$

Remark 5.7.

Lemmas 5.3 and 5.4 guarantee that we can set $g^{k}=\nabla_{\mathcal{M}}\psi(x^{k})$ and $B^{k}=\nabla^{2}_{\mathcal{M}}\psi(x^{k})$ without explicitly knowing $\mathcal{M}$ and that there is a (globally optimal) solution of the trust-region subproblem located in $T_{\mathcal{M}}(x^{k})$ . This solution actually has the minimal $\ell_{2}$ -norm among all solutions. Since $g^{k}$ and $B^{k}$ operate on the tangent space of the active manifold, some practical algorithms, such as the CG-Steihaug method, can indeed recover $s^{k}$ in $T_{\mathcal{M}}(x^{k})$ , which leads to $x^{k}+s^{k}\in\mathcal{M}$ . Therefore, given $x^{k}\in B_{r}(x^{*})\cap\mathcal{M}$ for $k$ sufficiently large, we would have $x^{k+1}\in B_{r}(x^{*})\cap\mathcal{M}$ .

6 Preliminary Numerical Results

In this section, we test the efficiency of our proposed nonsmooth trust-region method by applying it to convex and nonconvex $\ell_{1}$ -minimization problems. All numerical experiments are performed in MATLAB R2020a on a laptop with Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz and 16GB memory.

We apply our framework to the $\ell_{1}$ -minimization problem

\min_{x\in\mathbb{R}^{n}}~{}f(x)+\mu{\|x\|_{1}},

(6.1)

where $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a smooth function. Setting $\varphi(x)=\mu{\|x\|_{1}}$ and using (2.10), we choose

g(x)=\lambda F_{\mathrm{nat}}^{\lambda}(x):=\lambda[x-\mathrm{prox}_{\varphi}^{\lambda}(x-\lambda^{-1}\nabla f(x))],\quad\lambda>0.

(6.2)

where the proximity operator is given explicitly by $(\mathrm{prox}_{\varphi}^{\lambda}(x))_{i}=\text{sign}(x_{i})\max(|x_{i}|-\lambda^{-1}\mu,0)$ . We now construct an element $M(x)\in\partial\mathrm{prox}_{\varphi}^{\lambda}(x-\lambda^{-1}\nabla f(x))$ as follows: $M(x)\in\mathbb{R}^{n\times n}$ is a diagonal matrix with diagonal entries

(M(x))_{ii}=\begin{cases}1,&\text{if }|(x-\lambda^{-1}\nabla f(x))_{i}|>\lambda^{-1}\mu,\\ 0,&\text{otherwise}.\end{cases}

Thus, $J(x)=I-M(x)(I-\lambda^{-1}\nabla^{2}f(x))$ is a possible generalized Jacobian of $F_{\mathrm{nat}}^{\lambda}(x)$ . Let us define the index sets

\mathcal{I}(x):=\{i\in\{1,2,\dots,n\}:|(x-\lambda^{-1}\nabla f(x))_{i}|>\lambda^{-1}\mu\},

and

\mathcal{O}(x):=\{i\in\{1,2,\dots,n\}:|(x-\lambda^{-1}\nabla f(x))_{i}|\leq\lambda^{-1}\mu\}.

Then, $J(x)$ can be written in an alternative format:

J(x)=\begin{pmatrix}\lambda^{-1}(\nabla^{2}f(x))_{\mathcal{I}(x)\mathcal{I}(x)}&\lambda^{-1}(\nabla^{2}f(x))_{\mathcal{I}(x)\mathcal{O}(x)}\\ 0&I\end{pmatrix}.

(6.3)

In the following, we choose $B^{k}=\lambda J(x^{k})$ . For simplicity, we do not check the condition (3.11).

6.1 The Lasso Problem

We first consider the Lasso problem where $f$ is a convex quadratic function

f(x)=\frac{1}{2}\|Ax-b\|^{2},

and $b\in\mathbb{R}^{m}$ and $A=\mathbb{R}^{m\times n}$ are given.

It can be shown that $J(x)$ is positive semidefinite if $\lambda$ is sufficiently large [75]. When solving the trust-region subproblem (3.3) and similar to the method presented in Appendix B, we first choose a suitable regularization parameter $t_{k}\geq 0$ and solve the linear system

(J^{k}+t_{k}I)p^{k}=-F_{\mathrm{nat}}^{\lambda}(x^{k}),\quad J^{k}=J(x^{k}),

(6.4)

and then project $p^{k}$ onto the trust region, i.e., $s^{k}=\min\{\Delta_{k},\|p^{k}\|\}{\bar{p}^{k}}$ . Setting $g^{k}=g(x^{k})$ , $\mathcal{I}^{k}=\mathcal{I}(x^{k})$ , and $\mathcal{O}^{k}=\mathcal{O}(x^{k})$ , the linear system (6.4) is equivalent to

(1+t_{k})p^{k}_{\mathcal{O}^{k}}=-g^{k}_{\mathcal{O}^{k}},\quad(\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{I}^{k}}+t^{k}I)p^{k}_{\mathcal{I}^{k}}+\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{O}^{k}}p^{k}_{\mathcal{O}^{k}}=-g^{k}_{\mathcal{I}^{k}},

which leads to

p^{k}_{\mathcal{O}^{k}}=-\frac{1}{(1+t_{k})}g^{k}_{\mathcal{O}^{k}},\quad(\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{I}^{k}}+t^{k}I)p^{k}_{\mathcal{I}^{k}}=-g^{k}_{\mathcal{I}^{k}}-\lambda^{-1}(A^{T}A)_{\mathcal{I}^{k}\mathcal{O}^{k}}p^{k}_{\mathcal{O}^{k}}.

The second system is symmetric and can be much smaller than the original problem (6.4). It can be solved efficiently by applying the CG method.

Our test framework follows [12, 55]:

•

A sparse solution $\hat{x}\in\mathbb{R}^{n}$ with $n=512^{2}=262144$ is generated randomly with $k=[n/40]$ zero entries. The nonzero components are chosen from $\{1,2,\cdots,n\}$ uniformly with values given by $\hat{x}_{i}=\eta_{1}(i)10^{{d\eta_{2}(i)}/{20}}$ . Here, $\eta_{1}(i)$ and $\eta_{2}(i)$ are distributed uniformly in $\{-1,1\}$ and $[0,1]$ , respectively and $d$ is a dynamic range.
•

We randomly choose $\mathcal{J}\subseteq\{1,2,\cdots,n\}$ with $|\mathcal{J}|=m=n/8=32768$ . The linear operator $A:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is then defined via $Ax=(\texttt{dct}(x))_{\mathcal{J}}$ where dct denotes the discrete cosine transform.
•

We set $b=A\hat{x}+\epsilon$ where $\epsilon\in\mathbb{R}^{m}$ is Gaussian noise with covariance matrix $\hat{\sigma}I_{m\times m}$ , $\hat{\sigma}=0.1$ .

Given a tolerance $\epsilon$ , we terminate whenever the condition $\lambda\left\|F_{\mathrm{nat}}^{\lambda}(x)\right\|\leq\epsilon$ is satisfied. In the algorithm, $\lambda$ is chosen adaptively to estimate the local Lipschitz constant of $\nabla f$ , i.e., we have $\lambda=\lambda_{k}=\max\{10^{-3},\min\{\|x^{k+1}-x^{k}\|/\|\nabla f(x^{k+1})-\nabla f(x^{k})\|,10^{3}\}$ if the step was successful. We compare our nonsmooth trust-region method (NTR) with the adaptive semi-smooth Newton (ASSN) method in [75] and the fast iterative shrinkage-thresholding algorithm (FISTA) [11] for different tolerances $\epsilon\in\{10^{0},10^{-1},10^{-2},10^{-4},10^{-6}\}$ and dynamic ranges $d\in\{20,40,60,80\}$ . We report the average CPU time (in seconds) as well as the average number of $A$ - and $A^{T}$ -calls $N_{A}$ over 10 independent trials.

The numerical comparisons are shown in Tables 1-4. From those results we can see that the nonsmooth trust-region method outperforms the first-order method FISTA and is quite competitive with the second-order method ASSN. Even if the second acceptance test and the stepsize safeguard is required to guarantee theoretical convergence, in the numerical experiments we find that with a group of suitably chosen parameters, our algorithm rarely or never fails in the first acceptance test and hence employs the mechanism of the stepsize safeguard, which prevents additional costs. The similar behavior of NTR and ASSN may stem from the fact that we utilize similar strategies as in ASSN [75] for certain parameters. Our results on $N_{A}$ are comparable with ASSN’s results and are sometimes better. Because each of our iterations involves potential acceptance tests and truncation steps, our method overall requires slightly more CPU time to converge than ASSN.

Table 1: Numerical results with dynamic range 20 dB

	$\epsilon:10^{0}$		$\epsilon:10^{-1}$		$\epsilon:10^{-2}$		$\epsilon:10^{-4}$		$\epsilon:10^{-6}$
	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$
NTR	0.8275	86.8	1.2226	132.8	1.5549	172	2.0399	227.4	2.4704	280.6
ASSN	0.7368	89.8	1.1409	145	1.3583	173	1.9094	246.4	2.2844	298.2
FISTA	0.5337	59	1.3959	153	3.2304	353.2	13.9021	1490.4	33.7451	3581.2

Table 2: Numerical results with dynamic range 40 dB

	$\epsilon:10^{0}$		$\epsilon:10^{-1}$		$\epsilon:10^{-2}$		$\epsilon:10^{-4}$		$\epsilon:10^{-6}$
	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$
NTR	1.7293	176.4	2.6661	280.4	3.1330	330.8	3.7463	402.2	4.1825	464.2
ASSN	1.5227	182.2	2.3414	285.4	2.7751	338.6	3.3216	407	3.6687	459.2
FISTA	2.2007	234.8	3.9217	418.4	7.7210	817	26.6733	2804.6	57.0018	5991.6

Table 3: Numerical results with dynamic range 60 dB

	$\epsilon:10^{0}$		$\epsilon:10^{-1}$		$\epsilon:10^{-2}$		$\epsilon:10^{-4}$		$\epsilon:10^{-6}$
	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$
NTR	2.9135	303.4	3.7794	398.4	4.4177	471.2	5.1089	562.4	5.6559	632
ASSN	2.4508	295.4	3.4378	416.4	4.0051	492	4.6492	582.4	5.1121	642.4
FISTA	5.9349	630.6	9.0296	951.6	14.7183	1548	39.5980	4164.2	79.6442	8355.4

Table 4: Numerical results with dynamic range 80 dB

	$\epsilon:10^{0}$		$\epsilon:10^{-1}$		$\epsilon:10^{-2}$		$\epsilon:10^{-4}$		$\epsilon:10^{-6}$
	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$
NTR	3.6748	411	5.3121	614	5.9883	702.2	6.5791	803.6	7.2349	868.8
ASSN	3.6290	482.8	4.5779	601	4.9879	690.6	5.7141	780.6	6.3010	865.4
FISTA	20.9653	2222.4	25.6927	2673.2	33.2373	3527	—	—	—-	—-

6.2 Nonconvex Binary Classification

We consider a second, nonconvex binary classification problem, [53, 73], where $f$ is given as follows:

f(x)=\frac{1}{N}\sum_{i=1}^{N}\left(1-\tanh(b_{i}\cdot a_{i}^{T}x)\right).

Here, the datapoints $a_{i}\in\mathbb{R}^{n}$ and features $b_{i}\in\{\pm 1\}$ are selected from the datasets CINA ( $N=16033$ , $n=132$ ) and gisette ( $N=6000$ , $n=5000$ ). We set $\mu=0.01$ .

Although positive semidefiniteness of $J^{k}$ is not guaranteed in the nonconvex case, we reuse the method described in Section 6.1 to solve the trust-region subproblem with $t_{k}=0$ . The parameter $\lambda$ is updated adaptively as before. We also use the same stopping criterion as in Section 6.1. We compare our nonsmooth trust-region method (NTR) with the stochastic semismooth Newton method with variance reduction (S4N-VR) [56] for different tolerances $\epsilon\in\{10^{0},10^{-1},10^{-2},10^{-4},10^{-6}\}$ .

The CPU time (in seconds) and the number of $A$ - and $A^{T}$ -calls $N_{A}$ are reported in Table 5 and Table 6, respectively. The results for S4N-VR are averaged over 10 independent trials while the results for NTR are based on one (deterministic) trial. While S4N-VR achieves slightly better results on CINA, NTR outperforms S4N-VR on the second dataset gisette.

Although the performance of NTR is still not perfect, our preliminary results underline that the proposed class of nonsmooth trust-region methods is promising and allows us to handle nonsmooth nonconvex optimization problems from a different perspective.

Table 5: Numerical results for CINA

	$\epsilon:10^{0}$		$\epsilon:10^{-1}$		$\epsilon:10^{-2}$		$\epsilon:10^{-4}$		$\epsilon:10^{-6}$
	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$
NTR	0.03567	2.5995	0.08788	9.6904	0.2794	55.9480	0.2806	63.3571	0.3048	67.3268
S4N-VR	0.02843	2.0983	0.08124	6.0051	0.2090	25.8886	0.2422	28.3287	0.3722	45.6214

Table 6: Numerical results for gisette

	$\epsilon:10^{0}$		$\epsilon:10^{-1}$		$\epsilon:10^{-2}$		$\epsilon:10^{-4}$		$\epsilon:10^{-6}$
	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$	time	$N_{A}$
NTR	0.3136	14.1908	0.5044	21.3416	0.6588	34.7592	0.9333	57.2328	1.0780	73.7344
S4N-VR	0.9177	6.8913	1.6120	20.4010	2.3202	32.6835	5.2769	84.0567	6.7880	121.4267

7 Conclusion

In this paper, we investigate a trust-region method for nonsmooth nonconvex optimization problems. In the proposed framework, the model functions are quadratic and cheap descent directions can be utilized. This allows us to construct cheap model functions and to apply standard algorithms for solving the resulting trust-region subproblems. We propose a novel combination of a stepsize safeguard for ensuring the accuracy of the model and an additional truncation step to enlarge the stepsize safeguard and to accelerate convergence. We present a detailed discussion of the global convergence properties under suitable and mild assumptions. In the case of composite-type problems, we also show that our method converges locally with a quadratic rate after the finite identification of the active manifold when the nonsmooth part of the objective function is a partly smooth mapping. The results are established using a strict complementary condition and a connection between our algorithm and the standard Riemannian trust-region method. Preliminary numerical results demonstrate that the approach performs promisingly on a class of $\ell_{1}$ -optimization problems.

References

[1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Found. Comput. Math., 7(3):303–330, 2007.
[2] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008.
[3] Z. Akbari, R. Yousefpour, and M. Reza Peyghami. A new nonsmooth trust region algorithm for locally Lipschitz unconstrained optimization problems. J. Optim. Theory Appl., 164(3):733–754, 2015.
[4] P. Apkarian, D. Noll, and L. Ravanbod. Nonsmooth bundle trust-region algorithm with applications to robust stability. Set-Valued Var. Anal., 24(1):115–148, 2016.
[5] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Found. and Trends® in Mach. Learn., 4(1):1–106, 2011.
[6] C. G. Baker. Riemannian manifold trust-region methods with applications to eigenproblems. PhD thesis, Florida State University, 2008.
[7] C. G. Baker, P.-A. Absil, and K. A. Gallivan. An implicit Riemannian trust-region method for the symmetric generalized eigenproblem. pages 210–217. Springer, International Conference on Computational Science, 2006.
[8] C. G. Baker, P.-A. Absil, and K. A. Gallivan. An implicit trust-region method on Riemannian manifolds. IMA J. Numer. Anal., 28(4):665–689, 2008.
[9] T. Bannert. A trust region algorithm for nonsmooth optimization. Math. program., 67(1):247–264, 1994.
[10] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
[11] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
[12] S. Becker, J. Bobin, and E. J. Candès. NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci., 4(1):1–39, 2011.
[13] N. Boumal and P.-A. Absil. RTRMC: A Riemannian trust-region method for low-rank matrix completion. pages 406–414. Advances in neural information processing systems, 2011.
[14] P. Breiding and N. Vannieuwenhoven. A Riemannian trust region method for the canonical tensor rank approximation problem. SIAM J. Optim., 28(3):2435–2465, 2018.
[15] J. Burke. On the identification of active constraints. II. The nonconvex case. SIAM J. Numer. Anal., 27(4):1081–1103, 1990.
[16] J. V. Burke and A. Engle. Line search and trust-region methods for convex-composite optimization. ArXiv:1806.05218, 2018.
[17] J. V. Burke and J. J. Moré. On the identification of active constraints. SIAM J. Numer. Anal., 25(5):1197–1211, 1988.
[18] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):1956–1982, 2010.
[19] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6):717–772, 2009.
[20] C. Christof, J. C. De los Reyes, and C. Meyer. A nonsmooth trust-region method for locally Lipschitz functions with application to optimization problems constrained by variational inequalities. SIAM J. Optim., 30(3):2163–2196, 2020.
[21] C. Clason. Nonsmooth analysis and optimization. ArXiv:1708.04180, 2018.
[22] S. Cotter, B. Rao, K. Engan, and K. Kreutz-Delgado. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Trans. Signal Process., 53(7):2477–2488, 2005.
[23] A. Daniilidis, W. Hare, and J. Malick. Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization, 55(5-6):481–503, 2006.
[24] W. de Oliveira, C. Sagastizábal, and C. Lemaréchal. Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Math. Program., 148(1):241–277, 2014.
[25] R. De Sampaio, J.-Y. Yuan, and W.-Y. Sun. Trust region algorithm for nonsmooth optimization. Appl. Math. Comput., 85(2-3):109–116, 1997.
[26] J. E. Dennis, Jr., S.-B. B. Li, and R. A. Tapia. A unified approach to global convergence of trust region methods for nonsmooth optimization. Math. Program., 68(1):319–346, 1995.
[27] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006.
[28] R. Fletcher. A model algorithm for composite nondifferentiable optimization problems. In Nondifferential and Variational Techniques in Optimization, pages 67–76. Springer, 1982.
[29] R. Fletcher. Second order corrections for non-differentiable optimization. In Numerical analysis, pages 85–114. Springer, 1982.
[30] M. Fukushima and H. Mine. A generalized proximal point algorithm for certain nonconvex minimization problems. Internat. J. Systems Sci., 12(8):989–1000, 1981.
[31] M. Gangeh, A. Farahat, A. Ghodsi, and M. Kamel. Supervised dictionary learning and sparse representation – A review. ArXiv:1502.05928, 2015.
[32] R. Garmanjani, D. Júdice, and L. N. Vicente. Trust-region methods without using derivatives: worst case complexity and the nonsmooth case. SIAM J. Optim., 26(4):1987–2011, 2016.
[33] G. N. Grapiglia, J. Yuan, and Y.-x. Yuan. A derivative-free trust-region algorithm for composite nonsmooth optimization. Comp. Appl. Math., 35(2):475–499, 2016.
[34] P. Grohs and S. Hosseini. Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds. IMA J. Numer. Anal., 36(3):1167–1192, 2016.
[35] W. Hare, C. Sagastizábal, and M. Solodov. A proximal bundle method for nonsmooth nonconvex functions with inexact information. Comput. Optim. Appl., 63(1):1–28, 2016.
[36] W. L. Hare and A. S. Lewis. Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal., 11(2):251–266, 2004.
[37] T. Hastie, R. Mazumder, J. Lee, and R. Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res., 16:3367–3402, 2015.
[38] W. Huang, P.-A. Absil, and K. A. Gallivan. A Riemannian symmetric rank-one trust-region method. Math. Program., Ser. A, 150(2):179–216, 2015.
[39] N. Karmitsa, A. Bagirov, and M. M. Mäkelä. Comparing different nonsmooth minimization methods and software. Optim. Methods Softw., 27(1):131–153, 2012.
[40] K. C. Kiwiel. An ellipsoid trust region bundle method for nonsmooth convex minimization. SIAM J. Control Optim., 27(4):737–757, 1989.
[41] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale $\ell_{1}$ -regularized logistic regression. J. Mach. Learn. Res., 8:1519–1555, 2007.
[42] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program., 149(1):1–45, 2015.
[43] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim., 24(3):1420–1443, 2014.
[44] C. Lemaréchal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Math. Program., 69(1, Ser. B):111–147, 1995. Nondifferentiable and large-scale optimization (Geneva, 1992).
[45] C. Lemaréchal and J. Zowe. A condensed introduction to bundle methods in nonsmooth optimization. In Algorithms for continuous optimization (Il Ciocco, 1993), volume 434 of NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., pages 357–382. Kluwer Acad. Publ., Dordrecht, 1994.
[46] A. S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (2003), 2002.
[47] X. Li, D. Sun, and K.-C. Toh. A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM J. Optim., 28(1):433–458, 2018.
[48] J. Liang, J. Fadili, and G. Peyré. Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim., 27(1):408–437, 2017.
[49] N. Mahdavi-Amiri and R. Yousefpour. An effective nonsmooth optimization algorithm for locally Lipschitz functions. J. Optim. Theory Appl., 155(1):180–195, 2012.
[50] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. Proceedings of the 26th Int. Conf. on Mach. Learn., pages 689–696, 2009.
[51] M. M. Mäkelä. Survey of bundle methods for nonsmooth optimization. Optim. Methods Softw., 17(1):1–29, 2002.
[52] M. M. Mäkelä and P. Neittaanmäki. A survey of bundle methods. In Nonsmooth Optimization, pages 97–111. World Scientific Publishing Co. Pte. Ltd., 1992.
[53] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent in function space. In Proc. NIPS, volume 12, pages 512–518, 1999.
[54] L. Meier, S. Van De Geer, and P. Bühlmann. The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(1):53–71, 2008.
[55] A. Milzarek and M. Ulbrich. A semismooth Newton method with multidimensional filter globalization for $l_{1}$ -optimization. SIAM J. Optim., 24(1):298–333, 2014.
[56] A. Milzarek, X. Xiao, S. Cen, Z. Wen, and M. Ulbrich. A stochastic semismooth Newton method for nonsmooth nonconvex optimization. SIAM J. Optim., 29(4):2916–2948, 2019.
[57] Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1):125–161, 2013.
[58] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, second edition, 2006.
[59] D. Noll. Cutting plane oracles to minimize non-smooth non-convex functions. Set-Valued Var. Anal., 18(3-4):531–568, 2010.
[60] D. Noll. Bundle method for non-convex minimization with inexact subgradients and function values. In Computational and analytical mathematics, volume 50 of Springer Proc. Math. Stat., pages 555–592. Springer, New York, 2013.
[61] P. Patrinos and A. Bemporad. Proximal Newton methods for convex composite optimization. Proceedings of the 52nd IEEE Conf. on Decision and Control, pages 2358–2363, 2013.
[62] P. Patrinos, L. Stella, and A. Bemporad. Forward-backward truncated Newton methods for convex composite optimization. ArXiv:1402.6655, 2014.
[63] C. Poon, J. Liang, and C. Schoenlieb. Local convergence properties of SAGA/Prox-SVRG and acceleration. Proceedings of the 35th Int. Conf. on Mach. Learn., 80:4124–4132, 2018.
[64] L. Q. Qi and J. Sun. A trust region algorithm for minimization of locally Lipschitzian functions. Math. Program., 66(1):25–43, 1994.
[65] S. M. Robinson. Normal maps induced by linear transformations. Math. Oper. Res., 17(3):691–714, 1992.
[66] N. Sagara and M. Fukushima. A trust region method for nonsmooth convex optimization. J. Ind. Manag. Optim., 1(2):171–180, 2005.
[67] H. Schramm and J. Zowe. A version of the bundle idea for minimizing a nonsmooth function: conceptual idea, convergence analysis, numerical results. SIAM J. Optim., 2(1):121–152, 1992.
[68] S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253, 2003.
[69] L. Stella, A. Themelis, and P. Patrinos. Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl., 67(3):443–487, 2017.
[70] L. Sun, J. Liu, J. Chen, and J. Ye. Efficient recovery of jointly sparse vectors. Advances in Neural Information Processing Systems, (NIPS), 23, 2009.
[71] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267–288, 1996.
[72] J. Wang and T. Zhang. Utilizing second order information in minibatch stochastic variance reduced proximal iterations. J. Mach. Learn. Res., 20(42):1–56, 2019.
[73] X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim., 27(2):927–956, 2017.
[74] S. J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993.
[75] X. Xiao, Y. Li, Z. Wen, and L. Zhang. A regularized semi-smooth Newton method with projection steps for composite convex programs. J. Sci. Comput., 76(1):364–389, 2018.
[76] E. Yamakawa, M. Fukushima, and T. Ibaraki. An efficient trust region algorithm for minimizing nondifferentiable composite functions. SIAM J. Sci. and Stat. Comput., 10(3):562–580, 1989.
[77] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1):49–67, 2006.
[78] Y.-x. Yuan. Conditions for convergence of trust region algorithms for nonsmooth optimization. Math. Program., 31(2):220–228, 1985.
[79] Y.-x. Yuan. On the superlinear convergence of a trust region algorithm for nonsmooth optimization. Math. Program., 31(3):269–285, 1985.
[80] Y.-x. Yuan. Recent advances in trust region algorithms. Math. Program., 151(1):249–281, 2015.

Appendix A Some supplementary examples

Example A.1 (Expressing $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ ).

We consider the following examples:

(i)

Group lasso: In this setting, $X=(X_{1},\cdots,X_{n_{2}})$ is a matrix in $\mathbb{R}^{n_{1}\times n_{2}}$ , $\psi=f+\varphi$ , $\varphi(X)=\sum_{i=1}^{n_{2}}\left\|X_{i}\right\|_{2}$ , and $\Lambda=\mathrm{diag}(\lambda_{1}I_{n_{1}},\lambda_{2}I_{n_{1}},\cdots,\lambda_{n_{2}}I_{n_{1}})$ , where we understand $\Lambda$ by viewing $X$ as a vector $X=(X_{1}^{T},\cdots,X_{n_{2}}^{T})^{T}$ . In this case, we obtain

\partial\varphi(X)=\left\{W=(W_{1},\cdots,W_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}:W_{i}\begin{cases}\in B_{1}(0)&\text{if }X_{i}=0,\\ =X_{i}/\|X_{i}\|&\text{if }X_{i}\neq 0,\end{cases}\quad\forall~{}i=1,2,\cdots,n_{2}\right\},

and

F_{\mathrm{nor}}^{\Lambda}(\tau(X))_{i}=\begin{cases}\nabla f(X)_{i}+X_{i}/\left\|X_{i}\right\|&\text{if }X_{i}\neq 0,\\ \nabla f(X)_{i}-\mathbf{P}_{B_{1}(0)}\left(\nabla f(X)_{i}\right),&\text{if }X_{i}=0,\end{cases}\quad\forall~{}i=1,2,\cdots,n_{2}.

(ii)

$\ell_{\infty}$ -optimization: $\psi=f+\varphi$ , $\varphi(x)=\|x\|_{\infty}$ and $\Lambda=\lambda I$ . Using the dual characterization $\|x\|_{\infty}=\max_{\|y\|_{1}\leq 1}{x^{T}y}$ , we have

$\partial\varphi(x)=\{w\in\mathbb{R}^{n}:\|w\|_{1}\leq 1,\ w^{T}x=\|x\|_{\infty}\},$

and $F_{\mathrm{nor}}^{\Lambda}(\tau(x))$ can be computed using (2.8).

Example A.2 (Calculating $\Gamma(x)$ ).

We have:

(i)

Group lasso: For $X=(X_{1},\cdots,X_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}$ , $\psi=f+\varphi$ , and $\varphi(X)=\sum_{i=1}^{n_{2}}\|X_{i}\|_{2}$ , it holds that $\Gamma(X)=\min\{\|X_{i}\|:X_{i}\neq 0\}$ .
(ii)

$\ell_{\infty}$ -optimization: Let us set $\psi=f+\varphi$ , $\varphi(x)=\|x\|_{\infty}$ , and $S:=\{i\in\{1,2,\cdots,n\}\mid|x_{i}|\neq\|x\|_{\infty}\}$ . Then, it holds that

$\Gamma(x)=\begin{cases}\|x\|_{\infty}-\max_{i\in S}|x_{i}|&S\neq\emptyset,\\ 2\|x\|_{\infty}&S=\emptyset.\end{cases}$

Example A.3 (Truncation).

We consider some additional functions that can be truncated:

(i)

Group lasso: For $X=(X_{1},\cdots,X_{n_{2}})\in\mathbb{R}^{n_{1}\times n_{2}}$ , $\psi=f+\varphi$ , and $\varphi(X)=\sum_{i=1}^{n_{2}}\|X_{i}\|_{2}$ , we can set $S_{i}=\{X\in\mathbb{R}^{n_{1}\times n_{2}}:\mathrm{card}\{j=1,2,\cdots,n_{2}\mid X_{j}=0\}\geq i\}$ for $i=1,2,\cdots,n_{2}$ , $m=n_{1}$ , $\delta=+\infty$ , $\kappa=\sqrt{n_{2}}$ , and $T(X,a)\in\mathbb{R}^{n_{1}\times n_{2}}$ is defined column-wise

$T(X,a)_{j}={\mathbbm{1}}_{\|\cdot\|\geq a}(X_{j})\cdot X_{j}\quad j=1,2,\cdots,n_{2}.$

(ii)

$\ell_{\infty}$ -optimization: For $\psi=f+\varphi$ and $\varphi(x)=\|x\|_{\infty}$ , we can set $S_{i}=\{x\in\mathbb{R}^{n}\mid\mathrm{card}\{j=1,2,\cdots,n\mid x_{j}=\|x\|_{\infty}\}\geq i+1\}$ for $i=0,1,\cdots,n-1$ , $S_{n}=\{0\}$ , $m=n$ , $\delta=+\infty$ , and $\kappa=\sqrt{n}$ . As for $T(x,a)\in\mathbb{R}^{n}$ , if $x\in S_{n-1}$ , it is defined via

T(x,a)={\mathbbm{1}}_{\|\cdot\|_{\infty}\geq\frac{a}{2}}(x)\cdot x;

otherwise it is defined component by component via

T(x,a)_{j}=x_{j}+{\mathbbm{1}}_{|\cdot|>\|x\|_{\infty}-a}(x_{j})\mathrm{sgn}(x_{j})(\|x\|_{\infty}-|x_{j}|),\quad j=1,2,\cdots,n.

Appendix B The Solution of the Trust-region Subproblem

In this section, we briefly discuss how to recover a solution $s^{k}$ satisfying (3.6) and (3.7). If the second-order information $B^{k}$ is symmetric, the subproblem (3.3) coincides with the classical trust-region subproblem and can be solved using standard methods, such as CG-Steihaug method [58, Algorithm 7.2]. However, due to the nonsmoothness, the second-order information $B^{k}$ might be asymmetric. For example, the Jabobian of $g(x)=F_{\mathrm{nat}}^{\Lambda}(x)$ might be asymmetric. In this case, we can simply replace $B^{k}$ with its symmetrized version, $\frac{1}{2}[B^{k}+(B^{k})^{T}]$ , and then employ the CG-Steihaug method.

If the matrix $B^{k}$ is positive semidefinite (but probably non-symmetric), i.e., $\langle h,B^{k}h\rangle\geq 0$ for all $h\in\mathbb{R}^{n}$ , we can still solve (3.3) without symmetrization. We first choose a suitable regularization parameter $t_{k}\geq 0$ such that

\frac{1}{2}h^{T}B^{k}h+t_{k}\|h\|^{2}\geq\lambda_{1}\|h\|^{2}\quad\forall\ h\in\mathbb{R}^{n}\quad\text{ and }\quad\|B^{k}+t_{k}I\|\leq\lambda_{2},

(B.1)

where $\lambda_{1},\lambda_{2}$ are chosen positive constants which do not depend on $k$ . We then consider the linear system

(B^{k}+t_{k}I)p=-g^{k}

(B.2)

and solve it to get an approximate solution $p^{k}$ satisfying

(B^{k}+t_{k}I)p^{k}=-g^{k}+r^{k}\quad\text{and}\quad\|r^{k}\|\leq\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\|g^{k}\|,

(B.3)

where $r^{k}$ is the residual. Finally, we project $p^{k}$ onto the trust region, i.e.,

s^{k}=\min\{\Delta_{k},\|p^{k}\|\}{\bar{p}^{k}}.

(B.4)

The next lemma proves that $s^{k}$ given by (B.2) and (B.4) satisfies the condition (3.6) for some $\gamma_{1},\gamma_{2}>0$ .

Lemma B.1.

Suppose that $B^{k}$ is positive semidefinite and that (B.1) holds for some constants $\lambda_{1},\lambda_{2}>0$ . Then condition (3.6) holds for $\gamma_{1}=\frac{\lambda_{1}}{2\lambda_{2}}$ and $\gamma_{2}=\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}$ if $s^{k}$ is given by (B.2) and (B.4).

Proof. Due to

\begin{split}\|p^{k}\|&=\|(B^{k}+t_{k}I)^{-1}(g^{k}-r^{k})\|=\frac{\|g^{k}-r^{k}\|}{\|g^{k}-r^{k}\|/\left\|(B^{k}+t_{k}I)^{-1}(g^{k}-r^{k})\right\|}\\ &\geq\frac{1}{\displaystyle\max_{h\in\mathbb{R}^{n}}\frac{\|(B^{k}+t_{k}I)h\|}{\|h\|}}\|g^{k}-r^{k}\|\geq\frac{1}{\lambda_{2}}\|g^{k}-r^{k}\|\geq\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}\|g^{k}\|\end{split}

and utilizing the positive semidefiniteness of $B^{k}$ , we have:

	$\displaystyle m(0)-m(s^{k})$	$\displaystyle\geq\frac{\min\{\Delta_{k},\\|p^{k}\\|\}}{\\|p^{k}\\|}[m(0)-m(p^{k})]$
		$\displaystyle=\frac{\min\{\Delta_{k},\\|p^{k}\\|\}}{\\|p^{k}\\|}\left[(p^{k})^{T}(B^{k}+t_{k}I)p^{k}-(r^{k})^{T}p^{k}-\frac{1}{2}(p^{k})^{T}B^{k}p^{k}\right]$
		$\displaystyle\geq\frac{\min\{\Delta_{k},\\|p^{k}\\|\}}{\\|p^{k}\\|}\left[\frac{1}{2}(p^{k})^{T}B^{k}p^{k}+t_{k}\\|p^{k}\\|^{2}-\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\\|g^{k}\\|\\|p^{k}\\|\right]$
		$\displaystyle\geq\lambda_{1}\\|p^{k}\\|\min\{\Delta_{k},\\|p^{k}\\|\}-\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\\|g^{k}\\|\min\{\Delta_{k},\\|p^{k}\\|\}$
		$\displaystyle\geq\frac{\lambda_{1}}{2\lambda_{2}}\\|g^{k}\\|\min\left\{\Delta_{k},\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}\\|g^{k}\\|\right\}.$

Thus, (3.6) is satisfied for $\gamma_{1}=\frac{\lambda_{1}}{2\lambda_{2}}$ and $\gamma_{2}=\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}$ . $\square$

As for the other condition (3.7), we can simply set

s^{k}=\begin{cases}\text{the solution given by \eqref{Newton-sys0} and \eqref{proj_s}}&\text{if }\Delta_{k}\geq\zeta,\\ s^{k}_{C}&\text{if }\Delta_{k}<\zeta\end{cases}\quad\text{and}\quad\ell(\Delta)=\begin{cases}0&\text{if }\Delta<\zeta,\\ 1&\text{if }\Delta\geq\zeta,\end{cases}

(B.5)

where $\zeta>0$ is a constant. We immediately obtain (3.7).

	$\displaystyle m(0)-m(s^{k})$	$\displaystyle\geq\frac{\min\{\Delta_{k},\\|p^{k}\\|\}}{\\|p^{k}\\|}[m(0)-m(p^{k})]$
		$\displaystyle=\frac{\min\{\Delta_{k},\\|p^{k}\\|\}}{\\|p^{k}\\|}\left[(p^{k})^{T}(B^{k}+t_{k}I)p^{k}-(r^{k})^{T}p^{k}-\frac{1}{2}(p^{k})^{T}B^{k}p^{k}\right]$
		$\displaystyle\geq\frac{\min\{\Delta_{k},\\|p^{k}\\|\}}{\\|p^{k}\\|}\left[\frac{1}{2}(p^{k})^{T}B^{k}p^{k}+t_{k}\\|p^{k}\\|^{2}-\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\\|g^{k}\\|\\|p^{k}\\|\right]$
		$\displaystyle\geq\lambda_{1}\\|p^{k}\\|\min\{\Delta_{k},\\|p^{k}\\|\}-\frac{\lambda_{1}}{2(\lambda_{1}+\lambda_{2})}\\|g^{k}\\|\min\{\Delta_{k},\\|p^{k}\\|\}$
		$\displaystyle\geq\frac{\lambda_{1}}{2\lambda_{2}}\\|g^{k}\\|\min\left\{\Delta_{k},\frac{\lambda_{1}+2\lambda_{2}}{2\lambda_{2}(\lambda_{1}+\lambda_{2})}\\|g^{k}\\|\right\}.$