This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An efficient descent method for locally Lipschitz multiobjective optimization problems

Bennet Gebken Department of Mathematics, Paderborn University, Germany Sebastian Peitz Department of Mathematics, Paderborn University, Germany
Abstract

In this article, we present an efficient descent method for locally Lipschitz continuous multiobjective optimization problems (MOPs). The method is realized by combining a theoretical result regarding the computation of descent directions for nonsmooth MOPs with a practical method to approximate the subdifferentials of the objective functions. We show convergence to points which satisfy a necessary condition for Pareto optimality. Using a set of test problems, we compare our method to the multiobjective proximal bundle method by Mäkelä. The results indicate that our method is competitive while being easier to implement. While the number of objective function evaluations is larger, the overall number of subgradient evaluations is lower. Finally, we show that our method can be combined with a subdivision algorithm to compute entire Pareto sets of nonsmooth MOPs.

1 Introduction

In many scenarios in real life, the problem of optimizing multiple objectives at the same time arises. In engineering for example, one often wants to steer a physical system as close as possible to a desired state while minimizing the required energy cost at the same time. These problems are called multiobjective optimization problems (MOPs) and generally do not possess a single optimal solution. Instead, the solution is the set of all optimal compromises, the so-called Pareto set containing all Pareto optimal points. Due to this, the numerical computation of solutions to MOPs is more challenging than to single-objective problems. On top of that, there are numerous applications where the objectives are nonsmooth, for example contact problems in mechanics, which adds to the difficulty. In this article, we will address both difficulties combined by considering nonsmooth MOPs.

When addressing the above-mentioned difficulties, i.e., multiple objectives and nonsmoothness, separately, there exists a large number solution methods. For smooth MOPs, the most popular methods include evolutionary [9, 10] and scalarization methods [27]. Additionally, some methods from single-objective optimization have been generalized, like gradient descent methods [13, 30, 14] and Newton’s method [12]. For the nonsmooth single-objective case, commonly used methods include subgradient methods [31], bundle methods [20] and gradient sampling methods [3]. More recently, in [22], a generalization of the steepest descent method to the nonsmooth case was proposed, which is based on an efficient approximation of the subdifferential of the objective function. For nonsmooth multiobjective optimization, the literature is a lot more scarce. Since classical scalarization approaches do not require the existence of gradients, they can still be used. In [1], a generalization of the steepest descent method was proposed for the case when the full subdifferentials of the objectives are known, which is rarely the case in practice. In [2, 7], the subgradient method was generalized to the multiobjective case, but both articles report that their method is not suitable for real life application due to inefficiency. In [25] (see also [18, 23]), a multiobjective version of the proximal bundle method was proposed, which currently appears to be the most efficient solver.

In this article, we develop a new descent method for locally Lipschitz continuous MOPs by combining the descent direction from [1] with the approximation of the subdifferentials from [22]. In [1] it was shown that the element with the smallest norm in the negative convex hull of the subdifferentials of the objective functions is a common descent direction for all objectives. In [22], the subdifferential of the objective function was approximated by starting with a single subgradient and then systematically computing new subgradients until the element with the smallest norm in the convex hull of all subgradients is a direction of (sufficient) descent. Combining both approaches yields a descent direction for locally Lipschitz MOPs and together with an Armijo step length, we obtain a descent method. We show convergence to points which satisfy a necessary condition for Pareto optimality. Using a set of test problems, we compare the performance of our method to the multiobjective proximal bundle method from [25]. The results indicate that our method is inferior in terms of function evaluations, but superior in terms of subgradient evaluations.

The structure of this article is as follows: we start with a short introduction to nonsmooth and multiobjective optimization in Section 2. In Section 3, we derive our descent method by replacing the Clarke subdifferential for the computation of the descent direction by the Goldstein ε\varepsilon-subdifferential and then showing how the latter can be efficiently approximated. In Section 4, we apply our descent method to numerical examples. We first visualize and discuss the typical behavior of our method before comparing it to the multiobjective proximal bundle method from [25] using a set of test problems. Afterwards, we show how our method can be combined with a subdivision algorithm to approximate entire Pareto sets. Finally, in Section 5, we draw a conclusion and discuss possible future work.

2 Nonsmooth multiobjective optimization

We consider the nonsmooth multiobjective optimization problem

minxnf(x)=minxn(f1(x)fk(x)),\displaystyle\min_{x\in\mathbb{R}^{n}}f(x)=\min_{x\in\mathbb{R}^{n}}\begin{pmatrix}f_{1}(x)\\ \vdots\\ f_{k}(x)\end{pmatrix}, (MOP)

where f:nkf:\mathbb{R}^{n}\rightarrow\mathbb{R}^{k} is the objective vector with components fi:nf_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R}, i{1,,k}i\in\{1,...,k\}, called objective functions. We assume the objective functions to be locally Lipschitz continuous, i.e., for each i{1,,k}i\in\{1,...,k\} and xnx\in\mathbb{R}^{n}, there is some Li>0L_{i}>0 and ε>0\varepsilon>0 with

|fi(y)fi(z)|Liyzy,z{yn:xy<ε},\displaystyle|f_{i}(y)-f_{i}(z)|\leq L_{i}\|y-z\|\quad\forall y,z\in\{y\in\mathbb{R}^{n}:\|x-y\|<\varepsilon\},

where \|\cdot\| denotes the Euclidean norm in n\mathbb{R}^{n}. Since (MOP) is an optimization problem with a vector-valued objective function, the classical concept of optimality from the scalar case can not directly be conveyed. Instead, we are looking for the Pareto set, which is defined in the following way:

Definition 2.1.

A point xnx\in\mathbb{R}^{n} is called Pareto optimal, if there is no yny\in\mathbb{R}^{n} such that

fi(y)\displaystyle f_{i}(y) fi(x)i{1,,k},\displaystyle\leq f_{i}(x)\quad\forall i\in\{1,...,k\},
fj(y)\displaystyle f_{j}(y) <fj(x)for some j{1,,k}.\displaystyle<f_{j}(x)\quad\text{for some }j\in\{1,...,k\}.

The set of all Pareto optimal points is the Pareto set.

In practice, to check if a given point is Pareto optimal, we need optimality conditions. In the smooth case, there are the well-known KKT conditions (cf. [27]), which are based on the gradients of the objective functions. In case the objective functions are merely locally Lipschitz, the KKT conditions can be generalized using the concept of subdifferentials. In the following, we will recall the required definitions and results from nonsmooth analysis. For a more detailed introduction, we refer to [6].

Definition 2.2.

Let Ωin\Omega_{i}\subseteq\mathbb{R}^{n} be the set of points where fif_{i} is not differentiable. Then

fi(x)=conv({ξn:\displaystyle\partial f_{i}(x)=\operatorname*{conv}(\{\xi\in\mathbb{R}^{n}: (xj)jnΩi with xjx and\displaystyle\exists(x_{j})_{j}\in\mathbb{R}^{n}\setminus\Omega_{i}\text{ with }x_{j}\rightarrow x\text{ and }
fi(xj)ξ for j})\displaystyle\nabla f_{i}(x_{j})\rightarrow\xi\text{ for }j\rightarrow\infty\})

is the (Clarke) subdifferential of fif_{i} in xx. ξfi(x)\xi\in\partial f_{i}(x) is a subgradient.

It is easy to see that if fif_{i} is continuously differentiable, then the Clarke subdifferential is the set containing only the gradient of fif_{i}. We will later use the following technical result on some properties of the Clarke subdifferential (cf. [6], Prop. 2.1.2).

Lemma 2.3.

fi(x)\partial f_{i}(x) is nonempty, convex and compact.

Using the subdifferential, we can state a necessary optimality condition for locally Lipschitz MOPs (cf. [24], Thm. 12).

Theorem 2.4.

Let xnx\in\mathbb{R}^{n} be Pareto optimal. Then

0conv(i=1kfi(x)).\displaystyle 0\in\operatorname*{conv}\left(\bigcup_{i=1}^{k}\partial f_{i}(x)\right). (1)

In the smooth case, (1) reduces to the classical multiobjective KKT conditions. Note that in contrast to the smooth case, the optimality condition (1) is numerically challenging to work with, as subdifferentials are difficult to compute. Thus, in numerical methods, (1) is only used implicitly.

The method we are presenting in this paper is a descent method, which means that, starting from a point x1nx_{1}\in\mathbb{R}^{n}, we want to generate a sequence (xj)jn(x_{j})_{j}\in\mathbb{R}^{n} in which each point is an improvement over the previous point. This is done by computing directions vjnv_{j}\in\mathbb{R}^{n} and step lengths tj>0t_{j}\in\mathbb{R}^{>0} such that xj+1=xj+tjvjx_{j+1}=x_{j}+t_{j}v_{j} and

fi(xj+1)<fi(xj)j,i{1,,k}.\displaystyle f_{i}(x_{j+1})<f_{i}(x_{j})\quad\forall j\in\mathbb{N},\ i\in\{1,...,k\}.

For the computation of vjv_{j}, we recall the following basic result from convex analysis that forms the theoretical foundation for descent methods in the presence of multiple (sub)gradients. Let .\|.\| be the Euclidean norm in n\mathbb{R}^{n}.

Theorem 2.5.

Let WnW\subseteq\mathbb{R}^{n} be convex and compact and

v¯:=argminξWξ2.\displaystyle\bar{v}:=\operatorname*{arg\,min}_{\xi\in-W}\|\xi\|^{2}. (2)

Then either v¯0\bar{v}\neq 0 and

v¯,ξv¯2<0ξW,\displaystyle\langle\bar{v},\xi\rangle\leq-\|\bar{v}\|^{2}<0\quad\forall\xi\in W, (3)

or v¯=0\bar{v}=0 and there is no vnv\in\mathbb{R}^{n} with v,ξ<0\langle v,\xi\rangle<0 for all ξW\xi\in W.

Proof.

Since v¯\bar{v} is the projection of the origin onto the closed and convex set W-W, we have

0ξv¯,v¯0=v¯,ξv¯2\displaystyle 0\leq\langle-\xi-\bar{v},\bar{v}-0\rangle=-\langle\bar{v},\xi\rangle-\|\bar{v}\|^{2}
\displaystyle\Leftrightarrow\quad v¯,ξv¯2\displaystyle\langle\bar{v},\xi\rangle\leq-\|\bar{v}\|^{2}

for all ξW\xi\in W (cf. [4], Lem.). In particular, if v¯0\bar{v}\neq 0 then v¯,ξv¯2<0\langle\bar{v},\xi\rangle\leq-\|\bar{v}\|^{2}<0.

Conversely, v¯=0\bar{v}=0 implies 0W0\in W, so in this case there can not be any vnv\in\mathbb{R}^{n} with v,ξ<0\langle v,\xi\rangle<0 for all ξW\xi\in W. ∎

Roughly speaking, Theorem 2.5 states that the element of minimal norm in the convex and compact set W-W is directionally opposed to all elements of WW. To be more precise, v¯\bar{v} is contained in the intersection of all half-spaces induced by elements of W-W. In the context of optimization, this result has several applications:

  • (i)

    In the smooth, single-objective case, W={f(x)}W=\{\nabla f(x)\} trivially yields the classical steepest descent method.

  • (ii)

    In the smooth, multiobjective case, W=conv({f1(x),,fk(x)})W=\operatorname*{conv}(\{\nabla f_{1}(x),...,\nabla f_{k}(x)\}) yields the descent direction from [13] (after dualization) and [30].

  • (iii)

    In the nonsmooth, single-objective case, W=f(x)W=\partial f(x) yields the descent direction from [6], Prop. 6.2.4.

  • (iv)

    In the nonsmooth, multiobjective case, W=conv(i=1kfi(x))W=\operatorname*{conv}\left(\bigcup_{i=1}^{k}\partial f_{i}(x)\right) yields the descent direction from [1].

In (i) and (ii), the solution of problem (2) is straightforward, since WW is a convex polytope with the gradients as vertices. In (iii), the solution of (2) is non-trivial due to the difficulty of computing the subdifferential. In subgradient methods [31], the solution is approximated by using a single subgradient instead of the entire subdifferential. As a result, it can not be guaranteed that the solution is a descent direction and in particular, (2) can not be used as a stopping criterion. In gradient sampling methods [3], the subdifferential is approximated by the convex hull of gradients of the objective function in randomly sampled points around the current point. Due to the randomness, it can not be guaranteed that the resulting direction yields sufficient descent. Additionally, a check for differentiability of the objective is required, which can pose a problem [17]. In (iv), the solution of (2) gets even more complicated due to the presence of multiple subdifferentials. So far, the only methods that deal with (2) in this case are multiobjective versions of the subgradient method [2, 7], which were reported unsuitable for real life applications.

In the following section, we will describe a new way to compute descent directions for nonsmooth MOPs by systematically computing an approximation of conv(i=1kfi(x))\operatorname*{conv}\left(\cup_{i=1}^{k}\partial f_{i}(x)\right) that is sufficient to obtain a ”good enough” descent direction from (2).

3 Descent method for nonsmooth MOPs

In this section, we will present a method to compute descent directions of nonsmooth MOPs that generalizes the method from [22] to the multiobjective case. As described in the previous section, when computing descent directions via Theorem 2.5, one has the problem of having to compute subdifferentials. Since these are difficult to come by in practice, we will instead replace WW in Theorem 2.5 by an approximation of conv(i=1kfi(x))\operatorname*{conv}\left(\cup_{i=1}^{k}\partial f_{i}(x)\right) such that the resulting direction is guaranteed to have sufficient descent. To this end, we will first replace the Clarke subdifferential by the so-called ε\varepsilon-subdifferential, and then take a finite approximation of the latter.

3.1 The epsilon-subdifferential

By definition, fi(x)\partial f_{i}(x) is the convex hull of the limits of the gradient of fif_{i} in all sequences near xx that converge to xx. Thus, if we evaluate fi\nabla f_{i} in a number of points close to xx (where it is defined) and take the convex hull, we expect the resulting set to be an approximation of fi(x)\partial f_{i}(x). To formalize this, we introduce the following definition [16, 21].

Definition 3.1.

Let ε0\varepsilon\geq 0, xnx\in\mathbb{R}^{n} and Bε(x):={yn:xyε}B_{\varepsilon}(x):=\{y\in\mathbb{R}^{n}:\|x-y\|\leq\varepsilon\}. Then

εfi(x):=conv(yBε(x)fi(y))\displaystyle\partial_{\varepsilon}f_{i}(x):=\operatorname*{conv}\left(\bigcup_{y\in B_{\varepsilon}(x)}\partial f_{i}(y)\right)

is the (Goldstein) ε\varepsilon-subdifferential of fif_{i} in xx. ξεfi(x)\xi\in\partial_{\varepsilon}f_{i}(x) is an ε\varepsilon-subgradient.

Note that 0fi(x)=fi(x)\partial_{0}f_{i}(x)=\partial f_{i}(x) and fi(x)εfi(x)\partial f_{i}(x)\subseteq\partial_{\varepsilon}f_{i}(x). For ε0\varepsilon\geq 0 we define for the multiobjective setting

Fε(x):=conv(i=1kεfi(x)).\displaystyle F_{\varepsilon}(x):=\operatorname*{conv}\left(\bigcup_{i=1}^{k}\partial_{\varepsilon}f_{i}(x)\right).

To be able to choose W=Fε(x)W=F_{\varepsilon}(x) in Theorem 2.5, we first need to establish some properties of Fε(x)F_{\varepsilon}(x).

Lemma 3.2.

εfi(x)\partial_{\varepsilon}f_{i}(x) is nonempty, convex and compact. In particular, the same holds for Fε(x)F_{\varepsilon}(x).

Proof.

For εfi(x)\partial_{\varepsilon}f_{i}(x), this was shown in [16], Prop. 2.3. For Fε(x)F_{\varepsilon}(x), it then follows directly from the definition. ∎

We immediately get the following corollary from Theorems 2.4 and 2.5.

Corollary 3.3.

Let ε0\varepsilon\geq 0.

  • a)

    If xx is Pareto optimal, then

    0Fε(x).\displaystyle 0\in F_{\varepsilon}(x). (4)
  • b)

    Let xnx\in\mathbb{R}^{n} and

    v¯:=argminξFε(x)ξ2.\displaystyle\bar{v}:=\operatorname*{arg\,min}_{\xi\in-F_{\varepsilon}(x)}\|\xi\|^{2}. (5)

    Then either v¯0\bar{v}\neq 0 and

    v¯,ξv¯2<0ξFε(x),\displaystyle\langle\bar{v},\xi\rangle\leq-\|\bar{v}\|^{2}<0\quad\forall\xi\in F_{\varepsilon}(x), (6)

    or v¯=0\bar{v}=0 and there is no vnv\in\mathbb{R}^{n} with v,ξ<0\langle v,\xi\rangle<0 for all ξFε(x)\xi\in F_{\varepsilon}(x).

The previous corollary states that when working with the ε\varepsilon-subdifferential instead of the Clarke subdifferential, we still have a necessary optimality condition and a way to compute descent directions, although the optimality conditions are weaker and the descent direction has a less strong descent. This is illustrated in the following example.

Example 3.4.

Consider the locally Lipschitz function

f:22,x((x11)2+(x21)2x12+|x2|).\displaystyle f:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2},\quad x\mapsto\begin{pmatrix}(x_{1}-1)^{2}+(x_{2}-1)^{2}\\ x_{1}^{2}+|x_{2}|\end{pmatrix}.

The set of nondifferentiable points of ff is ×{0}\mathbb{R}\times\{0\}. For ε>0\varepsilon>0 and x2x\in\mathbb{R}^{2} we have

f1(x)=(2(x11)2(x21))andεf1(x)=2Bε(x)(22).\displaystyle\nabla f_{1}(x)=\begin{pmatrix}2(x_{1}-1)\\ 2(x_{2}-1)\end{pmatrix}\quad\text{and}\quad\partial_{\varepsilon}f_{1}(x)=2B_{\varepsilon}(x)-\begin{pmatrix}2\\ 2\end{pmatrix}.

For x×{0}x\in\mathbb{R}\times\{0\} we have

f2(x)={2x1}×[1,1]andεf2(x)={2x1+[2ε,2ε]}×[1,1].\displaystyle\partial f_{2}(x)=\{2x_{1}\}\times[-1,1]\quad\text{and}\quad\partial_{\varepsilon}f_{2}(x)=\{2x_{1}+[-2\varepsilon,2\varepsilon]\}\times[-1,1].

Figure 1 shows the Clarke subdifferential (a), the ε\varepsilon-subdifferential (b) for ε=0.2\varepsilon=0.2 and the corresponding sets Fε(x)F_{\varepsilon}(x) for x=(1.5,0)x=(1.5,0)^{\top}.

Refer to caption

(a)

Refer to caption

(b)

Figure 1: Clarke subdifferentials (a), ε\varepsilon-subdifferentials (b) for ε=0.2\varepsilon=0.2 and the corresponding sets Fε(x)F_{\varepsilon}(x) for x=(1.5,0)x=(1.5,0)^{\top} in Example 3.4.

Additionally, the corresponding solutions of (5) are shown. In this case, the predicted descent v¯2-\|\bar{v}\|^{2} (cf. (3)) is approximately 3.7692-3.7692 in (a) and 2.4433-2.4433 in (b).

Figure 2 shows the same scenario for x=(0.5,0)x=(0.5,0)^{\top}.

Refer to caption

(a)

Refer to caption

(b)

Figure 2: Clarke subdifferentials (a), ε\varepsilon-subdifferentials (b) for ε=0.2\varepsilon=0.2 and the corresponding sets Fε(x)F_{\varepsilon}(x) for x=(0.5,0)x=(0.5,0)^{\top} in Example 3.4.

Here, the Clarke subdifferential still yields a descent, while v¯=0\bar{v}=0 for the ε\varepsilon-subdifferential. In other words, xx satisfies the necessary optimality condition (4) but not (1).

The following lemma shows that for the direction from (5), there is a lower bound for a step size up to which we have guaranteed descent in each objective function fif_{i}.

Lemma 3.5.

Let ε0\varepsilon\geq 0 and v¯\bar{v} be the solution of (5). Then

fi(x+tv¯)fi(x)tv¯2tεv¯.\displaystyle f_{i}(x+t\bar{v})\leq f_{i}(x)-t\|\bar{v}\|^{2}\quad\forall t\leq\frac{\varepsilon}{\|\bar{v}\|}.
Proof.

Let tεv¯t\leq\frac{\varepsilon}{\|\bar{v}\|}. Since fif_{i} is locally Lipschitz continuous on n\mathbb{R}^{n}, it is in particular Lipschitz continuous on an open set containing x+[0,t]v¯x+[0,t]\bar{v}. By applying the mean value theorem (cf. [6], Thm. 2.3.7), we obtain

fi(x+tv¯)fi(x)fi(x+rv¯),tv¯\displaystyle f_{i}(x+t\bar{v})-f_{i}(x)\in\langle\partial f_{i}(x+r\bar{v}),t\bar{v}\rangle

for some r(0,t)r\in(0,t). Since x(x+rv¯)=rv¯<ε\|x-(x+r\bar{v})\|=r\|\bar{v}\|<\varepsilon we have fi(x+rv¯)εfi(x)\partial f_{i}(x+r\bar{v})\subseteq\partial_{\varepsilon}f_{i}(x). This means that there is some ξεfi(x)\xi\in\partial_{\varepsilon}f_{i}(x) such that

fi(x+tv¯)fi(x)=tξ,v¯.\displaystyle f_{i}(x+t\bar{v})-f_{i}(x)=t\langle\xi,\bar{v}\rangle.

Combined with (6) we obtain

fi(x+tv¯)fi(x)tv¯2\displaystyle f_{i}(x+t\bar{v})-f_{i}(x)\leq-t\|\bar{v}\|^{2}
\displaystyle\Leftrightarrow\quad fi(x+tv¯)fi(x)tv¯2.\displaystyle f_{i}(x+t\bar{v})\leq f_{i}(x)-t\|\bar{v}\|^{2}.

In the following, we will describe how we can obtain a good approximation of (5) without requiring full knowledge of the ε\varepsilon-subdifferentials.

3.2 Efficient computation of descent directions

In this part, we will describe how the solution of (5) can be approximated when only a single subgradient can be computed at every xnx\in\mathbb{R}^{n}. Similar to the gradient sampling approach, the idea behind our method is to replace Fε(x)F_{\varepsilon}(x) in (5) by the convex hull of a finite number of ε\varepsilon-subgradients ξ1,,ξmFε(x)\xi_{1},...,\xi_{m}\in F_{\varepsilon}(x), mm\in\mathbb{N}. Since it is impossible to know a priori how many and which ε\varepsilon-subgradients are required to obtain a good descent direction, we solve (5) multiple times in an iterative approach to enrich our approximation until a satisfying direction has been found. To this end, we have to specify how to enrich our current approximation conv({ξ1,,ξm})\operatorname*{conv}(\{\xi_{1},...,\xi_{m}\}) and how to characterize an acceptable descent direction.

Let W={ξ1,,ξm}Fε(x)W=\{\xi_{1},...,\xi_{m}\}\subseteq F_{\varepsilon}(x) and

v~:=argminvconv(W)v2.\displaystyle\tilde{v}:=\operatorname*{arg\,min}_{v\in-\operatorname*{conv}(W)}\|v\|^{2}. (7)

Let c(0,1)c\in(0,1). Motivated by Lemma 3.5, we regard v~\tilde{v} as an acceptable descent direction, if

fi(x+εv~v~)fi(x)cεv~i{1,,k}.\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)\leq f_{i}(x)-c\varepsilon\|\tilde{v}\|\quad\forall i\in\{1,...,k\}. (8)

If the set I{1,,k}I\subseteq\{1,...,k\} for which (8) is violated is non-empty then we have to find a new ε\varepsilon-subgradient ξFε(x)\xi^{\prime}\in F_{\varepsilon}(x) such that W{ξ}W\cup\{\xi^{\prime}\} yields a better descent direction. Intuitively, (8) being violated means that the local behavior of fif_{i}, iIi\in I, in xx in the direction v~\tilde{v} is not sufficiently captured in WW. Thus, for each iIi\in I, we expect that there exists some t(0,εv~]t^{\prime}\in(0,\frac{\varepsilon}{\|\tilde{v}\|}] such that ξfi(x+tv~)\xi^{\prime}\in\partial f_{i}(x+t^{\prime}\tilde{v}) improves the approximation of Fε(x)F_{\varepsilon}(x). This is proven in the following lemma.

Lemma 3.6.

Let c(0,1)c\in(0,1), W={ξ1,,ξm}Fε(x)W=\{\xi_{1},...,\xi_{m}\}\subseteq F_{\varepsilon}(x) and v~\tilde{v} be the solution of (7). If

fi(x+εv~v~)>fi(x)cεv~,\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)>f_{i}(x)-c\varepsilon\|\tilde{v}\|,

then there is some t(0,εv~]t^{\prime}\in(0,\frac{\varepsilon}{\|\tilde{v}\|}] and ξfi(x+tv~)\xi^{\prime}\in\partial f_{i}(x+t^{\prime}\tilde{v}) such that

v~,ξ>cv~2.\displaystyle\langle\tilde{v},\xi^{\prime}\rangle>-c\|\tilde{v}\|^{2}. (9)

In particular, ξFε(x)conv(W)\xi^{\prime}\in F_{\varepsilon}(x)\setminus\operatorname*{conv}(W).

Proof.

Assume that for all t(0,εv~]t^{\prime}\in(0,\frac{\varepsilon}{\|\tilde{v}\|}] and ξfi(x+tv~)\xi^{\prime}\in\partial f_{i}(x+t^{\prime}\tilde{v}) we have

v~,ξcv~2.\displaystyle\langle\tilde{v},\xi^{\prime}\rangle\leq-c\|\tilde{v}\|^{2}. (10)

By applying the mean value theorem as in Lemma 3.5, we obtain

fi(x+εv~v~)fi(x)fi(x+rv~),εv~v~\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)-f_{i}(x)\in\langle\partial f_{i}(x+r\tilde{v}),\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\rangle

for some r(0,εv~)r\in(0,\frac{\varepsilon}{\|\tilde{v}\|}). This means that there is some ξfi(x+rv~)\xi^{\prime}\in\partial f_{i}(x+r\tilde{v}) such that

fi(x+εv~v~)fi(x)=ξ,εv~v~=εv~ξ,v~.\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)-f_{i}(x)=\langle\xi^{\prime},\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\rangle=\frac{\varepsilon}{\|\tilde{v}\|}\langle\xi^{\prime},\tilde{v}\rangle.

By (10) it follows that

fi(x+εv~v~)fi(x)cεv~\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)-f_{i}(x)\leq-c\varepsilon\|\tilde{v}\|
\displaystyle\Leftrightarrow\quad fi(x+εv~v~)fi(x)cεv~,\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)\leq f_{i}(x)-c\varepsilon\|\tilde{v}\|,

which is a contradiction. In particular, (3) yields ξFε(x)conv(W)\xi^{\prime}\in F_{\varepsilon}(x)\setminus\operatorname*{conv}(W). ∎

The following example visualizes the previous lemma.

Example 3.7.

Consider ff as in Example 3.4, ε=0.2\varepsilon=0.2 and x=(0.75,0)x=(0.75,0)^{\top}. The dashed lines in Figure 3 show the ε\varepsilon-subdifferentials, Fε(x)F_{\varepsilon}(x) and the resulting descent direction (cf. Figure 1 and 2).

Refer to caption

(a)

Refer to caption

(b)

Figure 3: Approximations of Fε(x)F_{\varepsilon}(x) for ε=0.2\varepsilon=0.2 and x=(0.75,0)x=(0.75,0)^{\top} in Example 3.7. Fϵ(x)F_{\epsilon}(x) is approximated by conv({ξ1,ξ2})\operatorname*{conv}(\{\xi_{1},\xi_{2}\}) in (a) and by conv({ξ1,ξ2,ξ})\operatorname*{conv}(\{\xi_{1},\xi_{2},\xi^{\prime}\}) in (b).

Let y=(0.94,0.02)y=(0.94,-0.02)^{\top}. Then xy0.191ε\|x-y\|\approx 0.191\leq\varepsilon, so yBε(x)y\in B_{\varepsilon}(x) and

εf1(x)\displaystyle\partial_{\varepsilon}f_{1}(x) f1(y)={(0.122.04)}=:{ξ1},\displaystyle\supseteq\partial f_{1}(y)=\left\{\begin{pmatrix}-0.12\\ -2.04\end{pmatrix}\right\}=:\{\xi_{1}\},
εf2(x)\displaystyle\partial_{\varepsilon}f_{2}(x) f2(y)={(1.881)}=:{ξ2}.\displaystyle\supseteq\partial f_{2}(y)=\left\{\begin{pmatrix}1.88\\ -1\end{pmatrix}\right\}=:\{\xi_{2}\}.

Let W:={ξ1,ξ2}W:=\{\xi_{1},\xi_{2}\} and conv(W)\operatorname*{conv}(W) be the approximation of Fε(x)F_{\varepsilon}(x), shown as the solid line in Figure 3(a). Let v~\tilde{v} be the solution of (7) for this WW and choose c=0.25c=0.25. Checking (8), we have

f2(x+εv~v~)\displaystyle f_{2}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right) 0.6101>0.4748f2(x)cεv~.\displaystyle\approx 0.6101>0.4748\approx f_{2}(x)-c\varepsilon\|\tilde{v}\|.

By Lemma 3.6, this means that there is some t(0,εv~]t^{\prime}\in(0,\frac{\varepsilon}{\|\tilde{v}\|}] and ξf2(x+tv~)\xi^{\prime}\in\partial f_{2}(x+t^{\prime}\tilde{v}) such that

v~,ξ>cv~2.\displaystyle\langle\tilde{v},\xi^{\prime}\rangle>-c\|\tilde{v}\|^{2}.

In this case, we can take for example t=12εv~t^{\prime}=\frac{1}{2}\frac{\varepsilon}{\|\tilde{v}\|}, resulting in

f2(x+tv){(1.40771)}=:{ξ},\displaystyle\partial f_{2}(x+t^{\prime}v)\approx\left\{\begin{pmatrix}1.4077\\ 1\end{pmatrix}\right\}=:\{\xi^{\prime}\},
v~,ξ0.4172>0.7696cv~2.\displaystyle\langle\tilde{v},\xi^{\prime}\rangle\approx 0.4172>-0.7696\approx-c\|\tilde{v}\|^{2}.

Figure 3(b) shows the improved approximation W{ξ}W\cup\{\xi^{\prime}\} and the resulting descent direction v~\tilde{v}. By checking (8) for this new descent direction, we see that v~\tilde{v} is acceptable. (Note that in general, a single improvement step like this will not be sufficient to reach an acceptable direction.)

Note that Lemma 3.6 only shows the existence of tt^{\prime} and ξ\xi^{\prime} without stating a way how to actually compute them. To this end, let ii be the index of an objective function for which (8) is not satisfied, define

hi:,tfi(x+tv~)fi(x)+ctv~2\displaystyle h_{i}:\mathbb{R}\rightarrow\mathbb{R},\quad t\mapsto f_{i}(x+t\tilde{v})-f_{i}(x)+ct\|\tilde{v}\|^{2} (11)

(cf. [22]) and consider Algorithm 1. If fif_{i} is continuously differentiable around xx, then (9) is equivalent to hi(t)>0h_{i}^{\prime}(t^{\prime})>0, i.e., hih_{i} being monotonically increasing around tt^{\prime}. Thus, the idea of Algorithm 1 is to find some tt such that hih_{i} is monotonically increasing around tt, while checking if (9) is satisfied for a subgradient ξfi(x+tv~)\xi\in f_{i}(x+t\tilde{v}).

Algorithm 1 Compute new subgradient
0:  Current point xnx\in\mathbb{R}^{n}, direction v~\tilde{v}, tolerance ε\varepsilon, Armijo parameter c(0,1)c\in(0,1).
1:  Set a=0a=0, b=εv~b=\frac{\varepsilon}{\|\tilde{v}\|} and t=a+b2t=\frac{a+b}{2}.
2:  Compute ξfi(x+tv~)\xi^{\prime}\in\partial f_{i}(x+t\tilde{v}).
3:  If v~,ξ>cv~2\langle\tilde{v},\xi^{\prime}\rangle>-c\|\tilde{v}\|^{2} then stop.
4:  If hi(b)>hi(t)h_{i}(b)>h_{i}(t) then set a=ta=t. Otherwise set b=tb=t.
5:  Set t=a+b2t=\frac{a+b}{2} and go to step 2.

Although in the general setting, we can not guarantee that Algorithm 1 yields a subgradient satisfying (9), we can at least show that after finitely many iterations, a factor tt is found such that fi(x+tv~)\partial f_{i}(x+t\tilde{v}) contains a subgradient that satisfies (9).

Lemma 3.8.

Let (tk)k(t_{k})_{k} be the sequence generated in Algorithm 1. If (tk)k(t_{k})_{k} is finite, then some ξ\xi^{\prime} was found such that (9) is satisfied. If (tk)k(t_{k})_{k} is infinite, then it converges to some t¯[0,εv~]\bar{t}\in[0,\frac{\varepsilon}{\|\tilde{v}\|}] such that there is some ξfi(x+t¯v~)\xi^{\prime}\in\partial f_{i}(x+\bar{t}\tilde{v}) which satisfies (9). Additionally, there is some NN\in\mathbb{N} such that for all k>Nk>N there is some ξfi(x+tkv~)\xi^{\prime}\in\partial f_{i}(x+t_{k}\tilde{v}) satisfying (9).

Proof.

Let (tk)k(t_{k})_{k} be finite with last element t¯(0,εv~)\bar{t}\in(0,\frac{\varepsilon}{\|\tilde{v}\|}). Then Algorithm 1 must have stopped in step 3, i.e., some ξfi(x+t¯v~)\xi^{\prime}\in\partial f_{i}(x+\bar{t}\tilde{v}) satisfying (9) was found.
Now let (tk)k(t_{k})_{k} be infinite. By construction, (tk)k(t_{k})_{k} is a Cauchy sequence in the compact set [0,εv~][0,\frac{\varepsilon}{\|\tilde{v}\|}], so it has to converge to some t¯[0,εv~]\bar{t}\in[0,\frac{\varepsilon}{\|\tilde{v}\|}]. Additionally, since (8) is violated for the index ii by assumption, we have

hi(0)=0andhi(εv~)>0.\displaystyle h_{i}(0)=0\quad\text{and}\quad h_{i}\left(\frac{\varepsilon}{\|\tilde{v}\|}\right)>0.

Let (ak)k(a_{k})_{k} and (bk)k(b_{k})_{k} be the sequences corresponding to aa and bb in Algorithm 1 (at the start of each iteration). Then hi(ak)<hi(bk)h_{i}(a_{k})<h_{i}(b_{k}) for all kk\in\mathbb{N}. Thus, by the mean value theorem, there has to be some rk(ak,bk)r_{k}\in(a_{k},b_{k}) such that

0<hi(bk)hi(ak)hi(rk),bkak=hi(rk)(bkak).\displaystyle 0<h_{i}(b_{k})-h_{i}(a_{k})\in\langle\partial h_{i}(r_{k}),b_{k}-a_{k}\rangle=\partial h_{i}(r_{k})(b_{k}-a_{k}).

In particular, limkrk=t¯\lim_{k\rightarrow\infty}r_{k}=\bar{t} and since ak<bka_{k}<b_{k}, hi(rk)>0\partial h_{i}(r_{k})\cap\mathbb{R}^{>0}\neq\emptyset for all kk\in\mathbb{N}. By upper semicontinuity of h\partial h there must be some θhi(t¯)\theta\in\partial h_{i}(\bar{t}) with θ>0\theta>0. By the chain rule, we have

0<θhi(t¯)v~,fi(x+t¯v~)+cv~2.\displaystyle 0<\theta\in\partial h_{i}(\bar{t})\subseteq\langle\tilde{v},\partial f_{i}(x+\bar{t}\tilde{v})\rangle+c\|\tilde{v}\|^{2}. (12)

Thus, there must be some ξfi(x+t¯v~)\xi^{\prime}\in\partial f_{i}(x+\bar{t}\tilde{v}) with

0<v~,ξ+cv~2\displaystyle 0<\langle\tilde{v},\xi^{\prime}\rangle+c\|\tilde{v}\|^{2}
\displaystyle\Leftrightarrow\quad v~,ξ>cv~2.\displaystyle\langle\tilde{v},\xi^{\prime}\rangle>-c\|\tilde{v}\|^{2}.

By upper semicontinuity of h\partial h it also follows that there is some NN\in\mathbb{N} such that hi(tk)>0\partial h_{i}(t_{k})\cap\mathbb{R}^{>0}\neq\emptyset for all k>Nk>N. Applying the same argument as above completes the proof. ∎

In the following remark, we will briefly discuss the implication of Lemma 3.8 for practical use of Algorithm 1.

Remark 3.9.

Let NN\in\mathbb{N} be as in Lemma 3.8.

  • a)

    Note that if k>Nk>N and hh is differentiable in tkt_{k}, then we have

    0<hi(tk)=v~,fi(x+tkv~)+cv~2,\displaystyle 0<\nabla h_{i}(t_{k})=\langle\tilde{v},\nabla f_{i}(x+t_{k}\tilde{v})\rangle+c\|\tilde{v}\|^{2},

    i.e., the stopping criterion in step 3 is satisfied. Thus, if Algorithm 1 generates an infinite sequence, hh must be nonsmooth in tkt_{k} for all k>Nk>N. In particular, fif_{i} must be nonsmooth in x+tkv~x+t_{k}\tilde{v} for all k>Nk>N.

  • b)

    If ff is regular (cf. [6], Def. 2.3.4), then equality holds when applying the chain rule to hh (cf. [6], Thm. 2.3.10), i.e.,

    hi(tk)=v~,fi(x+tkv~)+cv~2.\displaystyle\partial h_{i}(t_{k})=\langle\tilde{v},\partial f_{i}(x+t_{k}\tilde{v})\rangle+c\|\tilde{v}\|^{2}.

    Thus, if Algorithm 1 generates an infinite sequence, then for all k>Nk>N there must be both positive and negative elements in hi(tk)\partial h_{i}(t_{k}). By convexity of hi(tk)\partial h_{i}(t_{k}), this implies that 0hi(tk)0\in\partial h_{i}(t_{k}) for all k>Nk>N, i.e., hh must have infinitely many (nonsmooth) stationary points in [0,εv~][0,\frac{\varepsilon}{\|\tilde{v}\|}].

Motivated by the previous remark, we will from now on assume that Algorithm 1 stops after finitely many iterations and thus yields a new subgradient satisfying (9). We can use this method of finding new subgradients to construct an algorithm that computes descent directions of nonsmooth MOPs, namely Algorithm 2.

Algorithm 2 Compute descent direction
0:  Current point xnx\in\mathbb{R}^{n}, tolerances ε,δ>0\varepsilon,\delta>0, Armijo parameter c(0,1)c\in(0,1).
1:  Compute ξ1iεfi(x)\xi^{i}_{1}\in\partial_{\varepsilon}f_{i}(x) for all i{1,,k}i\in\{1,...,k\}. Set W1={ξ11,,ξ1k}W_{1}=\{\xi^{1}_{1},...,\xi^{k}_{1}\} and l=1l=1.
2:  Compute vl=argminvconv(Wl)v2v_{l}=\operatorname*{arg\,min}_{v\in-\operatorname*{conv}(W_{l})}\|v\|^{2}.
3:  If vlδ\|v_{l}\|\leq\delta then stop.
4:  Find all objective functions for which there is insufficient descent:
Il={j{1,,k}:fj(x+εvlvl)>fj(x)cεvl}.\displaystyle I_{l}=\left\{j\in\{1,...,k\}:f_{j}\left(x+\frac{\varepsilon}{\|v_{l}\|}v_{l}\right)>f_{j}(x)-c\varepsilon\|v_{l}\|\right\}.
If Il=I_{l}=\emptyset then stop.
5:  For each jIlj\in I_{l}, compute t(0,εvl]t\in(0,\frac{\varepsilon}{\|v_{l}\|}] and ξljfj(x+tvl)\xi^{j}_{l}\in\partial f_{j}(x+tv_{l}) such that
vl,ξlj>cvl2\displaystyle\langle v_{l},\xi^{j}_{l}\rangle>-c\|v_{l}\|^{2}
via Algorithm 1.
6:  Set Wl+1=Wl{ξlj:jIl}W_{l+1}=W_{l}\cup\{\xi^{j}_{l}:j\in I_{l}\}, l=l+1l=l+1 and go to step 2.

The following theorem shows that Algorithm 2 stops after a finite number of iterations and produces an acceptable descent direction (cf. (8)).

Theorem 3.10.

Algorithm 2 terminates. In particular, if v~\tilde{v} is the last element of (vl)l(v_{l})_{l}, then either v~δ\|\tilde{v}\|\leq\delta or v~\tilde{v} is an acceptable descent direction, i.e.,

fi(x+εv~v~)fi(x)cεv~i{1,,k}.\displaystyle f_{i}\left(x+\frac{\varepsilon}{\|\tilde{v}\|}\tilde{v}\right)\leq f_{i}(x)-c\varepsilon\|\tilde{v}\|\quad\forall i\in\{1,...,k\}.
Proof.

Assume that Algorithm 2 does not terminate, i.e., (vl)l(v_{l})_{l\in\mathbb{N}} is an infinite sequence. Let l>1l>1 and jIl1j\in I_{l-1}. Since ξl1jWl\xi^{j}_{l-1}\in W_{l} and vl1Wl1Wl-v_{l-1}\in W_{l-1}\subseteq W_{l} we have

vl2\displaystyle\|v_{l}\|^{2} vl1+s(ξl1j+vl1)2\displaystyle\leq\|-v_{l-1}+s(\xi^{j}_{l-1}+v_{l-1})\|^{2}
=vl122svl1,ξl1j+vl1+s2ξl1j+vl12\displaystyle=\|v_{l-1}\|^{2}-2s\langle v_{l-1},\xi^{j}_{l-1}+v_{l-1}\rangle+s^{2}\|\xi^{j}_{l-1}+v_{l-1}\|^{2}
=vl122svl1,ξl1j2svl12+s2ξl1j+vl12\displaystyle=\|v_{l-1}\|^{2}-2s\langle v_{l-1},\xi^{j}_{l-1}\rangle-2s\|v_{l-1}\|^{2}+s^{2}\|\xi^{j}_{l-1}+v_{l-1}\|^{2} (13)

for all s[0,1]s\in[0,1]. Since jIl1j\in I_{l-1} we must have

vl1,ξl1j>cvl12\displaystyle\langle v_{l-1},\xi^{j}_{l-1}\rangle>-c\|v_{l-1}\|^{2} (14)

by step 5. Let LL be a common Lipschitz constant of all fif_{i}, i{1,,k}i\in\{1,...,k\}, on the closed ε\varepsilon-ball Bε(x)B_{\varepsilon}(x) around xx. Then by [6], Prop. 2.1.2, and the definition of the ε\varepsilon-subdifferential, we must have ξL\|\xi\|\leq L for all ξFε(x)\xi\in F_{\varepsilon}(x). So in particular,

ξl1j+vl12L.\displaystyle\|\xi^{j}_{l-1}+v_{l-1}\|\leq 2L. (15)

Combining (3.2) with (14) and (15) yields

vl2\displaystyle\|v_{l}\|^{2} <vl122scvl122svl12+4s2L2\displaystyle<\|v_{l-1}\|^{2}-2sc\|v_{l-1}\|^{2}-2s\|v_{l-1}\|^{2}+4s^{2}L^{2}
=vl122s(c+1)vl12+4s2L2.\displaystyle=\|v_{l-1}\|^{2}-2s(c+1)\|v_{l-1}\|^{2}+4s^{2}L^{2}.

Let s:=c+14L2vl12s:=\frac{c+1}{4L^{2}}\|v_{l-1}\|^{2}. Since c+1(1,2)c+1\in(1,2) and vl1L\|v_{l-1}\|\leq L we have s(0,1)s\in(0,1). We obtain

vl2\displaystyle\|v_{l}\|^{2} <vl122(c+1)24L2vl14+(c+1)24L2vl14\displaystyle<\|v_{l-1}\|^{2}-2\frac{(c+1)^{2}}{4L^{2}}\|v_{l-1}\|^{4}+\frac{(c+1)^{2}}{4L^{2}}\|v_{l-1}\|^{4}
=(1(c+1)24L2vl12)vl12.\displaystyle=\left(1-\frac{(c+1)^{2}}{4L^{2}}\|v_{l-1}\|^{2}\right)\|v_{l-1}\|^{2}.

Since Algorithm 2 did not terminate, it holds vl1>δ\|v_{l-1}\|>\delta. It follows that

vl2<(1(c+12Lδ)2)vl12.\displaystyle\|v_{l}\|^{2}<\left(1-\left(\frac{c+1}{2L}\delta\right)^{2}\right)\|v_{l-1}\|^{2}.

Let r=1(c+12Lδ)2r=1-\left(\frac{c+1}{2L}\delta\right)^{2}. Note that we have δ<vlL\delta<\|v_{l}\|\leq L for all ll\in\mathbb{N}, so r(0,1)r\in(0,1). Additionally, rr does not depend on ll, so we have

vl2<rvl12<r2vl12<<rl1v12rl1L2.\displaystyle\|v_{l}\|^{2}<r\|v_{l-1}\|^{2}<r^{2}\|v_{l-1}\|^{2}<...<r^{l-1}\|v_{1}\|^{2}\leq r^{l-1}L^{2}.

In particular, there is some ll such that vlδ\|v_{l}\|\leq\delta, which is a contradiction. ∎

Remark 3.11.

The proof of Theorem 3.10 shows that for convergence of Algorithm 2, it would be sufficient to consider only a single jIjj\in I_{j} in step 5. Similarly, for the initial approximation W1W_{1}, a single element from εfi(x)\partial_{\varepsilon}f_{i}(x) for any i{1,,k}i\in\{1,...,k\} would be enough. A modification of either step can potentially reduce the number of executions of step 5 (i.e., Algorithm 1) in Algorithm 2 in case the ε\varepsilon-subdifferentials of multiple objective functions are similar. Nonetheless, we will restrain the discussion in this article to Algorithm 2 as it is, since both modifications also introduce a bias towards certain objective functions, which we want to avoid.

To highlight the strengths of Algorithm 2, we will consider an example where standard gradient sampling approaches can fail to obtain a useful descent direction.

Example 3.12.

For a,b{0}a,b\in\mathbb{R}\setminus\{0\} consider the locally Lipschitz function

f:22,x((x11)2+(x21)2|x2a|x1||+bx2).\displaystyle f:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2},\quad x\mapsto\begin{pmatrix}(x_{1}-1)^{2}+(x_{2}-1)^{2}\\ |x_{2}-a|x_{1}||+bx_{2}\end{pmatrix}.

The set of nondifferentiable points is

Ωf=({0}×){(λ,a|λ|):λ},\displaystyle\Omega_{f}=(\{0\}\times\mathbb{R})\cup\{(\lambda,a|\lambda|)^{\top}:\lambda\in\mathbb{R}\},

separating 2\mathbb{R}^{2} into four smooth areas (cf. Figure 4(a)). For large a>0a>0, the two areas above the graph of λa|λ|\lambda\mapsto a|\lambda| become small, making it difficult to compute the subdifferential close to (0,0)(0,0)^{\top}.

Let a=10a=10, b=0.5b=0.5, ε=103\varepsilon=10^{-3} and x=(104,104)x=(10^{-4},10^{-4})^{\top}. In this case, (0,0)(0,0)^{\top} is the minimal point of f2f_{2} and

εf2(x)\displaystyle\partial_{\varepsilon}f_{2}(x) =conv{(ab1),(ab+1),(ab1),(ab+1)}\displaystyle=\operatorname*{conv}\left\{\begin{pmatrix}-a\\ b-1\end{pmatrix},\begin{pmatrix}a\\ b+1\end{pmatrix},\begin{pmatrix}a\\ b-1\end{pmatrix},\begin{pmatrix}-a\\ b+1\end{pmatrix}\right\}
=conv{(100.5),(101.5),(100.5),(101.5)}.\displaystyle=\operatorname*{conv}\left\{\begin{pmatrix}-10\\ -0.5\end{pmatrix},\begin{pmatrix}10\\ 1.5\end{pmatrix},\begin{pmatrix}10\\ -0.5\end{pmatrix},\begin{pmatrix}-10\\ 1.5\end{pmatrix}\right\}.

In particular, 0εf2(x)0\in\partial_{\varepsilon}f_{2}(x), so the descent direction with the exact ε\varepsilon-subdifferentials from (5) is zero. When applying Algorithm 2 in xx, after two iterations we obtain

v~=v2(0.118,1.185)109,\displaystyle\tilde{v}=v_{2}\approx(0.118,1.185)\cdot 10^{-9},

i.e., v~1.1911011\|\tilde{v}\|\approx 1.191\cdot 10^{-11}. Thus, xx is correctly identified as (potentially) Pareto optimal. The final approximation W2W_{2} of Fε(x)F_{\varepsilon}(x) is

W2={ξ11,ξ12,ξ22}={(100.5),(1.99981.9998),(101.5)}.\displaystyle W_{2}=\left\{\xi^{1}_{1},\xi^{2}_{1},\xi^{2}_{2}\right\}=\left\{\begin{pmatrix}10\\ -0.5\end{pmatrix},\begin{pmatrix}-1.9998\\ -1.9998\end{pmatrix},\begin{pmatrix}-10\\ 1.5\end{pmatrix}\right\}.

The first two elements of W2W_{2} are the gradients of f1f_{1} and f2f_{2} in xx from the first iteration of Algorithm 2, and the last element is the gradient of f2f_{2} in x=x+tv1=(0.038,0.596)103Bε(x)x^{\prime}=x+tv_{1}=(0.038,0.596)^{\top}\cdot 10^{-3}\in B_{\varepsilon}(x) from the second iteration. The result is visualized in Figure 4.

Refer to caption

(a)

Refer to caption

(b)

Figure 4: (a) The set of nondifferentiable points Ωf\Omega_{f} of ff, the ball Bε(x)B_{\varepsilon}(x) for the ε\varepsilon-subdifferential and the point in which subgradients where computed for Algorithm 2 in Example 3.12. (b) The approximation of Fε(x)F_{\varepsilon}(x) in Algorithm 2.

Building on Algorithm 2, it is now straightforward to construct the descent method for locally Lipschitz continuous MOPs given in Algorithm 3. In step 4, the classical Armijo backtracking line search was used (cf. [13]) for the sake of simplicity. Note that it is well defined due to step 4 in Algorithm 2.

Algorithm 3 Nonsmooth descent method
0:  Initial point x1nx_{1}\in\mathbb{R}^{n}, tolerances ε,δ>0\varepsilon,\delta>0, Armijo parameters c(0,1),t0>0c\in(0,1),t_{0}>0.
1:  Set j=1j=1.
2:  Compute a descent direction vjv_{j} via Algorithm 2.
3:  If vjδ\|v_{j}\|\leq\delta then stop.
4:  Compute
s¯=inf({s{0}:fi(xj+2st0vj)fi(xj)2st0cvj2i{1,,k}})\displaystyle\bar{s}=\inf(\{s\in\mathbb{N}\cup\{0\}:f_{i}(x_{j}+2^{-s}t_{0}v_{j})\leq f_{i}(x_{j})-2^{-s}t_{0}c\|v_{j}\|^{2}\ \forall i\in\{1,...,k\}\})
and set t¯=max({2s¯t0,εvj})\bar{t}=\max(\{2^{-\bar{s}}t_{0},\frac{\varepsilon}{\|v_{j}\|}\}).
5:  Set xj+1=xj+t¯vjx_{j+1}=x_{j}+\bar{t}v_{j}, j=j+1j=j+1 and go to step 2.

Since we introduced the two tolerances ε\varepsilon (for the ε\varepsilon-subdifferential) and δ\delta (as a threshold for when we consider ε\varepsilon-subgradients to be zero), we can not expect that Algorithm 3 computes points which satisfy the optimality condition (1). This is why we introduce the following definition, similar to the definition of ε\varepsilon-stationarity from [3].

Definition 3.13.

Let xnx\in\mathbb{R}^{n}, ε>0\varepsilon>0 and δ>0\delta>0. Then xx is called (ε,δ)(\varepsilon,\delta)-critical, if

minvFε(x¯)vδ.\displaystyle\min_{v\in-F_{\varepsilon}(\bar{x})}\|v\|\leq\delta.

It is easy to see that (ε,δ)(\varepsilon,\delta)-criticality is a necessary optimality condition for Pareto optimality, but a weaker one than (1). The following theorem shows that convergence in the sense of (ε,δ)(\varepsilon,\delta)-criticality is what we can expect from our descent method.

Theorem 3.14.

Let (xj)j(x_{j})_{j} be the sequence generated by Algorithm 3. Then either (fi(xj))j(f_{i}(x_{j}))_{j} is unbounded below for each i{1,,k}i\in\{1,...,k\}, or (xj)j(x_{j})_{j} is finite with the last element being (ε,δ)(\varepsilon,\delta)-critical.

Proof.

Assume that (xj)j(x_{j})_{j} is infinite. Then vj>δ\|v_{j}\|>\delta for all jj\in\mathbb{N}. By step 4 and Lemma 3.5 we have

fi(xj+t¯vj)fi(xj)t¯vj2εvj<εδ<0\displaystyle f_{i}(x_{j}+\bar{t}v_{j})-f_{i}(x_{j})\leq-\bar{t}\|v_{j}\|^{2}\leq-\varepsilon\|v_{j}\|<-\varepsilon\delta<0

for all i{1,,k}i\in\{1,...,k\}. This implies that (fi(xj))j(f_{i}(x_{j}))_{j} is unbounded below for each i{1,,k}i\in\{1,...,k\}.
Now assume that (xj)j(x_{j})_{j} is finite, with x¯\bar{x} and v¯\bar{v} being the last elements of (xj)j(x_{j})_{j} and (vj)j(v_{j})_{j}, respectively. Since the algorithm stopped, we must have v¯δ\|\bar{v}\|\leq\delta. From the application of Algorithm 2 in step 2, we know that there must be some W¯Fε(x¯)\overline{W}\subseteq F_{\varepsilon}(\bar{x}) such that v¯=argminvW¯v2\bar{v}=\operatorname*{arg\,min}_{v\in-\overline{W}}\|v\|^{2}. This implies

minvFε(x¯)vminvconv(W¯)v=v¯δ.\displaystyle\min_{v\in-F_{\varepsilon}(\bar{x})}\|v\|\leq\min_{v\in-\operatorname*{conv}(\overline{W})}\|v\|=\|\bar{v}\|\leq\delta.

4 Numerical examples

In this section we will apply our nonsmooth descent method (Algorithm 3) to several examples. We will begin by discussing its typical behavior before comparing its performance to the multiobjective proximal bundle method [25]. Finally, we will combine our method with the subdivision algorithm [11] in order to approximate the entire Pareto set of nonsmooth MOPs.

4.1 Typical behavior

In smooth areas, the behavior of Algorithm 3 is almost identical to the behavior of the multiobjective steepest descent method [13]. The only difference stems from the fact that, unlike the Clarke subdifferential, the ε\varepsilon-subdifferential does not reduce to the gradient when ff is continuously differentiable. As a result, Algorithm 3 may behave differently in points xnx\in\mathbb{R}^{n} where

max{fi(x)fi(y):yBε(x),i{1,,k}}\displaystyle\max\{\|\nabla f_{i}(x)-\nabla f_{i}(y)\|:y\in B_{\varepsilon}(x),\ i\in\{1,...,k\}\}

is large. (If ff is twice differentiable, this can obviously be characterized in terms of second order derivatives.) Nevertheless, if ε\varepsilon is chosen small enough, this difference can be neglected. Thus, in the following, we will focus on the behavior with respect to the nonsmoothness of ff.

To show the typical behavior of Algorithm 3, we consider the objective function

f:22,x(max{x12+(x21)2+x21,x12(x21)2+x2+1}x1+2(x12+x221)+1.75|x12+x221|)\displaystyle f:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2},\quad x\mapsto\begin{pmatrix}\max\{x_{1}^{2}+(x_{2}-1)^{2}+x_{2}-1,-x_{1}^{2}-(x_{2}-1)^{2}+x_{2}+1\}\\ -x_{1}+2(x_{1}^{2}+x_{2}^{2}-1)+1.75|x_{1}^{2}+x_{2}^{2}-1|\end{pmatrix} (16)

from [25] (combining Crescent from [19] and Mifflin 2 from [26]). The set of nondifferentiable points is Ωf=S1(S1+(0,1))\Omega_{f}=S^{1}\cup(S^{1}+(0,1)^{\top}). We consider the starting points

x1=(0,0.3),x2=(0.6,1.0),x3=(1,0.2),\displaystyle x^{1}=(0,-0.3)^{\top},\quad x^{2}=(0.6,1.0)^{\top},\quad x^{3}=(-1,-0.2)^{\top},

the tolerances ε=103\varepsilon=10^{-3}, δ=103\delta=10^{-3} and the Armijo parameters c=0.25c=0.25, t0=1t_{0}=1. The results are shown in Figure 5.

Refer to caption
Figure 5: Result of Algorithm 3 in three different starting points for the MOP (16). The Pareto set is shown in red, the dashed lines show the set of nondifferentiable points Ωf\Omega_{f}.

We will briefly go over the result for each starting point:

  • For x1x^{1}, the sequence moves through the smooth area like the steepest descent method until a point is found with a distance less or equal ε\varepsilon to the set of nondifferentiable points Ωf\Omega_{f}. In that point, more than one ε\varepsilon-subgradient is required to obtain a sufficient approximation of the ε\varepsilon-subdifferentials. Since this part of Ωf\Omega_{f} is Pareto optimal, no acceptable descent direction (cf. (8)) is found and the algorithm stops (in a (ε,δ)(\varepsilon,\delta)-critical point).

  • For x2x^{2}, the sequence starts zig-zagging around the non-optimal part of Ωf\Omega_{f}, since the points are too far away from Ωf\Omega_{f} for the algorithm to notice the nondifferentiability. When a point is found with distance less or equal ε\varepsilon to Ωf\Omega_{f}, a better descent direction is found, breaking the zig-zagging motion.

  • For x3x^{3}, the sequence has a similar zig-zagging motion to the previous case. The difference is that this time, the sequence moves along Ωf\Omega_{f} until a Pareto optimal point in Ωf\Omega_{f} is found.

As described above, the zig-zagging behavior when starting in x2x^{2} is caused by the fact that ε\varepsilon was too small for the method to notice the nondifferentiability. To circumvent problems like this and quickly move through problematic areas, it is possible to apply Algorithm 3 consecutively with decreasing values of ε\varepsilon. The result is Algorithm 4. (A similar idea was implemented in [22].)

Algorithm 4 ε\varepsilon-decreasing nonsmooth descent method
0:  Initial point x1nx_{1}\in\mathbb{R}^{n}, tolerances δ,ε1,,εK>0\delta,\varepsilon_{1},...,\varepsilon_{K}>0 , Armijo parameters c(0,1),t0>0c\in(0,1),t_{0}>0.
1:  Set y1=x1y_{1}=x_{1}.
2:  for i=1,,Ki=1,...,K do
3:     Apply Algorithm 3 with initial point yiy_{i} and tolerance ε=εi\varepsilon=\varepsilon_{i}. Let yi+1y_{i+1} be the final element in the generated sequence.
4:  end for

4.2 Comparison to the multiobjective proximal bundle method

We will now compare Algorithms 3 and 4 to the multiobjective proximal bundle method (MPB) by Mäkelä, Karmitsa and Wilppu from [25] (see also [23]). As test problems, we consider the 18 MOPs in Table 1, which are created on the basis of the scalar problems from [25]. Problems 1 to 15 are convex (and were also considered in [28]) and problems 16 to 18 are nonconvex. Due to their simplicity, we are able to differentiate all test problems by hand to obtain explicit formulas for the subgradients. For each test problem, we choose 100 starting points on a 10 ×\times 10 grid in the corresponding area given in Table 1.

Table 1: Test problems (using objectives from [25])
Nr. fif_{i} Area Nr. fif_{i} Area
1. CB3, DEM [3,3]2[-3,3]^{2} 10. QL, LQ [3,3]2[-3,3]^{2}
2. CB3, QL [3,3]2[-3,3]^{2} 11. QL, Mifflin 1 [3,3]2[-3,3]^{2}
3. CB3, LQ [0.5,1.5]2[0.5,1.5]^{2} 12. QL, Wolfe [3,3]2[-3,3]^{2}
4. CB3, Mifflin 1 [3,3]2[-3,3]^{2} 13. LQ, Mifflin 1 [0.5,1.5]×[0.5,1][0.5,1.5]\times[-0.5,1]
5. CB3, Wolfe [3,3]2[-3,3]^{2} 14. LQ, Wolfe [3,3]2[-3,3]^{2}
6. DEM, QL [3,3]2[-3,3]^{2} 15. Mifflin 1, Wolfe [3,3]2[-3,3]^{2}
7. DEM, LQ [3,3]2[-3,3]^{2} 16. Crescent, Mifflin 2 [0.5,1.5]2[-0.5,1.5]^{2}
8. DEM, Mifflin 1 [3,3]2[-3,3]^{2} 17. Mifflin 2, WF [3,3]2[-3,3]^{2}
9. DEM, Wolfe [3,3]2[-3,3]^{2} 18. Mifflin 2, SPIRAL [3,3]2[-3,3]^{2}

For the MPB, we use the Fortran implementation from [23] with the default parameters. For Algorithm 3, we use ε=103\varepsilon=10^{-3}, δ=103\delta=10^{-3}, c=0.25c=0.25 and t0=max{vj1,1}t_{0}=\max\{\|v_{j}\|^{-1},1\} (i.e., the initial step size t0t_{0} is chosen depending on the norm of the descent direction vjv_{j} in the current iteration). For Algorithm 4, we additionally use ε1=101\varepsilon_{1}=10^{-1}, ε2=102\varepsilon_{2}=10^{-2}, ε3=103\varepsilon_{3}=10^{-3}. By this choice of parameters, all three methods produce results of similar approximation quality.

To compare the performance of the three methods, we count the number of evaluations of objectives fif_{i}, their subgradients ξfi\xi\in\partial f_{i} and the number of iterations (i.e., descent steps) needed. (This means that one call of ff will account for kk evaluations of objectives.) Since the MPB always evaluates all objectives and subgradients in a point, the value for the objectives and the subgradients are the same here. The results are shown in Table 2 and are discussed in the following.

Table 2: Performance of MPB, Algorithm 3 and Algorithm 4 for the test problems in Table 1 for 100100 starting points
    #fif_{i}     #fi\partial f_{i}     # Iter
Nr.     MPB Alg. 3 Alg. 4     MPB Alg. 3 Alg. 4     MPB Alg. 3 Alg. 4
1.     1780 6924 7801     1780 1102 1751     761 492 695
2.     2522 14688 12263     2522 1906 2351     1151 842 914
3.     880 5625 6447     880 921 1534     340 448 662
4.     4416 103826 17664     4416 11774 3415     1832 4644 1242
5.     2956 30457 16877     2956 3479 3037     1377 1616 1161
6.     1640 8357 8684     1640 1209 1802     706 552 736
7.     1702 8736 8483     1702 1307 1832     723 595 739
8.     4226 8283 8620     4226 1318 1914     1204 582 759
9.     1828 8201 8794     1828 1194 1805     793 536 732
10.     1782 6799 7201     1782 1101 1722     684 543 733
11.     4426 52096 17594     4426 6311 3189     1964 2442 1206
12.     2482 15146 12446     2482 1992 2401     1140 967 1010
13.     2662 36570 9513     2662 4958 2247     1221 1692 787
14.     4264 95303 12227     4264 9524 2571     1774 4379 921
15.     3594 85936 15669     3594 9329 3124     1444 3963 1125
16.     2206 20372 11094     2206 2596 2400     884 1194 947
17.     2388 7920 5852     2388 1272 1556     868 626 706
18.     11430 166707 31528     11430 16676 6902     2789 8291 2412
Avg.     3176.9 37885.9 12153.2     3176.9 4331.6 2530.7     1203.1 1911.3 971.5
    100% 1192.5% 382.5%     100% 136.3% 79.7%     100% 158.9% 80.8%
  • Function evaluations: When considering the number of function evaluations, it is clear that the MPB requires far less evaluations than both of our algorithms. In our methods, these evaluations are used to check if a descent direction is acceptable (cf. (8)) and for the computation of the Armijo step length. One reason for the larger total amount is the fact that unlike the MPB, our methods are autonomous in the sense that they do not reuse information from previous iterations, so some information is potentially gathered multiple times. Additionally, the step length we use is fairly simple, so it might be possible to lower the number of evaluations by using a more sophisticated step length. When comparing our methods to each other, we see that Algorithm 4 is a lot more efficient than Algorithm 3 when the number of evaluations is high and is slightly less efficient when the number of evaluations is low. The reason for this is that for simple problems (i.e., where the number of evaluations is low), some of the iterations of Algorithm 4 will be redundant, because the (εi1,δ)(\varepsilon_{i-1},\delta)-critical point of the previous iteration is already (εi,δ)(\varepsilon_{i},\delta)-critical.

  • Subgradient evaluations: For the subgradient evaluations, we see that MPB is slightly superior to our methods on problems 3, 5 and 16, but inferior on the rest. Regarding the comparison of Algorithms 3 and 4, we observe the same pattern as for the function evaluations: Algorithm 3 is superior for simple and Algorithm 4 for complex problems.

  • Iterations: For the number of iterations, besides problem 5, we see the exact same pattern as for the number of subgradient evaluations. Note that the MPB can perform null steps, which are iterations where only the bundle is enriched, while the current point in the descent sequence stays the same.

For our set of test problems, this leads us to the overall conclusion that in terms of function evaluations, the MPB seems to be superior to our methods, while in terms of subgradient evaluations, our methods seem to be (almost always) more efficient. Furthermore, we remark that the implementation of the MPB is somewhat challenging, whereas our method can be implemented relatively quickly.

4.3 Combination with the subdivision algorithm

Note that so far, we have a method where we can put in some initial point from n\mathbb{R}^{n} and obtain a single (ε,δ)(\varepsilon,\delta)-critical point (close to an actual Pareto optimal point) as a result. But ultimately, we are not interested in one, but all Pareto optimal points. The intuitive and straightforward approach to extend our method would be to just take a large set of well-spread initial points and apply our method to each of them. The problem with this is that we have no guarantee that this results in a good approximation of the Pareto set. To solve this issue, we combine our method with the subdivision algorithm which was developed for smooth problems in [11]. Since we only have to do minor adjustments for the nonsmooth case, we will only sketch the method here and refer to [11] for the details.

The idea is to interpret the nonsmooth descent method as a discrete dynamical system

xj+1=g(xj),j=0,1,2,,x0n,\displaystyle x_{j+1}=g(x_{j}),\quad j=0,1,2,...,\quad x_{0}\in\mathbb{R}^{n}, (17)

where g:nng:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is the map that applies one iteration of Algorithm 3 to a point in n\mathbb{R}^{n}. (For the sake of brevity, we have omitted the rest of the input of the algorithm here.) Since no information is carried over between iterations of the algorithm, the trajectory (i.e., the sequence) generated by the system (17) is the same as the one generated by Algorithm 3. In particular, this means that the Pareto set of the MOP is contained in the set of fixed points of the system (17). Thus, the subdivision algorithm (which was originally designed to compute attractors of dynamical systems) can be used to compute (a superset of) the Pareto set.

The subdivision algorithm starts with a large hypercube (or box) in n\mathbb{R}^{n} that contains the Pareto set and mainly consists of two steps:

  1. 1.

    Subdivision: Divide each box in the current set of boxes into smaller boxes.

  2. 2.

    Selection: Compute the image of the union of the current set of boxes under gg and remove all boxes that have an empty intersection with this image. Go to step 1.

In practice, we realize step 1 by evenly dividing each box into 2n2^{n} smaller boxes and step 2 by using the image of a set of sample points. The algorithm is visualized in Figure 6.

Refer to caption

(a)

Refer to caption

(b)

Figure 6: Subdivision algorithm. (a) Applying gg to a set of sample points. (b) Selection step, where boxes with an empty intersection with the image of gg are removed.

Unfortunately, the convergence results of the subdivison algorithm only apply if gg is a diffeomorphism. If the objective function ff is smooth, then the descent direction is at least continuous (cf. [13]) and the resulting dynamical system gg, while not being a diffeomorphism, still behaves well enough for the subdivision algorithm to work. If ff is nonsmooth, then our descent direction is inherently discontinuous close to the nonsmooth points. Thus, the subdivision algorithm applied to (17) will (usually) fail to work. In practice, we were able to solve this issue by applying multiple iterations of Algorithm 3 in gg at once instead of just one. Roughly speaking, this smoothes gg by pushing the influence of the discontinuity further away from the Pareto set and was sufficient for convergence (in our tests).

Figures 7 to 9 show the result of the subdivision algorithm for some of the problems from Table 1. For each problem, we used 1515 iterations of Algorithm 3 in gg, [3.1,3]2[-3.1,3]^{2} as the starting box and applied 99 iterations of the subdivision algorithm. For the approximation of the Pareto front (i.e., the image of the Pareto set), we evaluated ff in all points of the image of gg of the last selection step in the subdivision algorithm. In all of these examples, the algorithm produced a tight approximation of the Pareto set.

Refer to caption

(a)

Refer to caption

(b)

Figure 7: (a) Result of the subdivision algorithm applied to problem 6 from Table 1. (b) Corresponding approximation of the Pareto front (red) and a pointwise discretization of the image of ff (black).
Refer to caption

(a)

Refer to caption

(b)

Figure 8: (a) Result of the subdivision algorithm applied to problem 12 from Table 1. (b) Corresponding approximation of the Pareto front (red) and a pointwise discretization of the image of ff (black).
Refer to caption

(a)

Refer to caption

(b)

Figure 9: (a) Result of the subdivision algorithm applied to problem 16 from Table 1. (b) Corresponding approximation of the Pareto front (red) and a pointwise discretization of the image of ff (black).

5 Conclusion and outlook

In this article, we have developed a new descent method for locally Lipschitz continuous multiobjective optimization problems, which is based on the efficient approximation of the Clarke subdifferentials of the objective functions from [22]. In [1], it was shown that the element with the smallest norm in the negative convex hull of the union of the subdifferentials is a descent direction for all objectives at the same time. In practice, the entire subdifferentials which are required to compute this direction are rarely known and only single subgradients can be computed. To solve this issue, we presented a method to obtain an approximation of the subdifferentials which is sufficient to obtain a descent direction. The idea is to start with a rough approximation of the subdifferentials by only a few subgradients and then systematically enrich the approximation with new subgradients until a direction of sufficient descent is found. By combining the descent direction with an Armijo step length, we obtained a descent method for nonsmooth MOPs and showed convergence to points which satisfy a necessary condition for Pareto optimality. We then compared the performance to the multiobjective proximal bundle method from [25]. For the 18 test problems we considered, the MPB was superior in terms of objective function evaluations, but our method required less subgradient evaluations and iterations. Finally, we showed that our descent method can be combined with the subdivision algorithm from [11] to compute approximations of entire Pareto sets.

For future work, we believe that it is straightforward to extend our method to constrained MOPs by adding constraints to the problem (7) that ensure that the descent direction maintains the feasibility of the descent sequence (similar to [14] for smooth problems). Additionally, in [8], the classical gradient sampling method for scalar nonsmooth optimization was generalized by allowing variable norms in the direction finding problem, increasing its efficiency. We expect that a similar generalization can be performed for problem (7), which potentially yields a similar increase in efficiency. Additional potential for increased performance lies in more advanced step length schemes as well as descent directions with memory (for instance, conjugate-gradient-like). Furthermore, it might be possible to extend our method to infinite-dimensional nonsmooth MOPs [29, 5]. Finally, in the context of nonsmooth many-objective optimization, we believe that considering subsets of objectives is a very promising and efficient approach (cf. [15] for smooth problems). However, theoretical advances are required for locally Lipschitz continuous problems.

Acknowledgement

This research was funded by the DFG Priority Programme 1962 ”Non-smooth and Complement-arity-based Distributed Parameter Systems”.

References

  • [1] H. Attouch, G. Garrigos, and X. Goudou. A dynamic gradient approach to pareto optimization with nonsmooth convex objective functions. Journal of Mathematical Analysis and Applications, 422(1):741 – 771, 2015.
  • [2] Y. Bello-Cruz. A Subgradient Method for Vector Optimization Problems. SIAM Journal on Optimization, Nov 2013.
  • [3] J. Burke, A. Lewis, and M. Overton. A Robust Gradient Sampling Algorithm for Nonsmooth, Nonconvex Optimization. SIAM Journal on Optimization, 15:751–779, Jan 2005.
  • [4] W. Cheney and A. A. Goldstein. Proximity maps for convex sets. Proceedings of the American Mathematical Society, 10(3):448–450, 1959.
  • [5] C. Christof and G. Müller. Multiobjective Optimal Control of a Non-Smooth Semilinear Elliptic Partial Differential Equation. European Series in Applied and Industrial Mathematics (ESAIM) : Control, Optimisation and Calculus of Variations, 2020.
  • [6] F. Clarke. Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics, 1983.
  • [7] J. Cruz Neto, G. Silva, O. Ferreira, and J. Lopes. A subgradient method for multiobjective optimization. Computational Optimization and Applications, 54:461–472, Apr 2013.
  • [8] F. E. Curtis and X. Que. An adaptive gradient sampling algorithm for non-smooth optimization. Optimization Methods and Software, 28(6):1302–1324, 2013.
  • [9] K. Deb. Multi-objective optimization using evolutionary algorithms, volume 16. John Wiley & Sons, 2001.
  • [10] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2):182–197, Apr 2002.
  • [11] M. Dellnitz, O. Schütze, and T. Hestermeyer. Covering Pareto Sets by Multilevel Subdivision Techniques. Journal of Optimization Theory and Applications, 124(1):113–136, 2005.
  • [12] J. Fliege, L. Graña, and B. F. Svaiter. Newton’s Method for Multiobjective Optimization. SIAM Journal on Optimization, 20, Mar 2008.
  • [13] J. Fliege and B. F. Svaiter. Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 51(3):479–494, Aug 2000.
  • [14] B. Gebken, S. Peitz, and M. Dellnitz. A Descent Method for Equality and Inequality Constrained Multiobjective Optimization Problems. In L. Trujillo, O. Schütze, Y. Maldonado, and P. Valle, editors, Numerical and Evolutionary Optimization – NEO 2017, pages 29–61. Springer, Cham, 2019.
  • [15] B. Gebken, S. Peitz, and M. Dellnitz. On the hierarchical structure of Pareto critical sets. Journal of Global Optimization, 73(4):891–913, 2019.
  • [16] A. Goldstein. Optimization of Lipschitz continuous functions. Mathematical Programming, 13:14–22, 12 1977.
  • [17] E. S. Helou, S. A. Santos, and L. E. A. Simões. On the differentiability check in gradient sampling methods. Optimization Methods and Software, 31(5):983–1007, 2016.
  • [18] K. C. Kiwiel. A descent method for nonsmooth convex multiobjective minimization. Large scale systems, 8(2):119–129, 1985.
  • [19] K. C. Kiwiel. Methods of Descent for Nondifferentiable Optimization. Springer-Verlag Berlin Heidelberg, 1985.
  • [20] K. C. Kiwiel. Proximity Control in Bundle Methods for Convex Nondifferentiable Minimization. Mathematical Programming, 46:105–122, Jan 1990.
  • [21] K. C. Kiwiel. A Nonderivative Version of the Gradient Sampling Algorithm for Nonsmooth Nonconvex Optimization. SIAM Journal on Optimization, 20(4):1983–1994, 2010.
  • [22] N. Mahdavi-Amiri and R. Yousefpour. An Effective Nonsmooth Optimization Algorithm for Locally Lipschitz Functions. Journal of Optimization Theory and Applications, 155(1):180–195, Oct 2012.
  • [23] M. M. Mäkelä. Multiobjective proximal bundle method for nonconvex nonsmooth optimization: Fortran subroutine mpbngc 2.0. Reports of the Department of Mathematical Information Technology, Series B. Scientific Computing, B, 13:2003, 2003.
  • [24] M. M. Mäkelä, V.-P. Eronen, and N. Karmitsa. On Nonsmooth Multiobjective Optimality Conditions with Generalized Convexities, pages 333–357. Springer New York, New York, NY, 2014.
  • [25] M. M. Mäkelä, N. Karmitsa, and O. Wilppu. Multiobjective proximal bundle method for nonsmooth optimization. TUCS technical report No 1120, Turku Centre for Computer Science, Turku, 2014.
  • [26] M. M. Mäkelä and P. Neittaanmäki. Nonsmooth Optimization: Analysis and Algorithms with Applications to Optimal Control. World Scientific, 1992.
  • [27] K. Miettinen. Nonlinear Multiobjective Optimization. Springer US, 1998.
  • [28] O. Montonen, N. Karmitsa, and M. M. Mäkelä. Multiple subgradient descent bundle method for convex nonsmooth multiobjective optimization. Optimization, 67(1):139–158, 2018.
  • [29] B. Mordukhovich. Multiobjective optimization problems with equilibrium constraints. Optimization Methods and Software, 117:331–354, Jul 2008.
  • [30] S. Schäffler, R. Schultz, and K. Weinzierl. Stochastic Method for the Solution of Unconstrained Vector Optimization Problems. J. Optim. Theory Appl., 114(1):209–222, 2002.
  • [31] N. Shor. Minimization Methods for Non-Differentiable Function. Springer-Verlag Berlin Heidelberg, 1985.