Approximation Guarantees for the Non-Dominated Sorting Genetic Algorithm II (NSGA-II)
Abstract
Recent theoretical works have shown that the NSGA-II efficiently computes the full Pareto front when the population size is large enough. In this work, we study how well it approximates the Pareto front when the population size is smaller.
For the OneMinMax benchmark, we point out situations in which the parents and offspring cover well the Pareto front, but the next population has large gaps on the Pareto front. Our mathematical proofs suggest as reason for this undesirable behavior that the NSGA-II in the selection stage computes the crowding distance once and then removes individuals with smallest crowding distance without considering that a removal increases the crowding distance of some individuals.
We then analyze two variants not prone to this problem. For the NSGA-II that updates the crowding distance after each removal (Kukkonen and Deb (2006)) and the steady-state NSGA-II (Nebro and Durillo (2009)), we prove that the gaps in the Pareto front are never more than a small constant factor larger than the theoretical minimum. This is the first mathematical work on the approximation ability of the NSGA-II and the first runtime analysis for the steady-state NSGA-II. Experiments also show the superior approximation ability of the two NSGA-II variants.
1 Introduction
While the theory of evolutionary algorithms (EAs), in particular, the mathematical runtime analysis, has made substantial progress in the last 25 years [NW10, AD11, Jan13, ZYQ19, DN20], the rigorous understanding of multi-objective EAs (MOEAs), started 20 years ago [LTZ+02, Gie03, LTZ04], is less developed and is massively lagging behind their success in practice. However, in the last years some significant progress has been made, for example [BQT18, RNNF19, QYT+19, QBF20, BFQY20, ZD23c, Cra21, DDHW23]. In particular, now the first analyses of MOEAs that are massively used in practice have appeared, namely for the MOEA/D [LZZZ16], the SMS-EMOA [BZLQ23], and most notably the NSGA-II [ZLD22] (see [ZD23a] for the journal version), the by far dominant algorithm in practice [ZQL+11].
The analysis of the NSGA-II [ZLD22] proved that several variants of this algorithm can compute the full Pareto front of the OneMinMax and LOTZ benchmarks efficiently when the population size is chosen by a constant factor larger than the size of the Pareto front (which is for these two problems in ). However, it was also proven (for the OneMinMax problem with problem size ) that a population size strictly larger than the Pareto front is necessary – if these two sizes are only equal, then with probability for an exponential number of iterations the population of the NSGA-II does not cover a constant fraction of the Pareto front. Experiments show that this fraction is roughly 20% for the OneMinMax benchmark and roughly 40% for the LOTZ benchmark. Several runtime analyses of the NSGA-II quickly followed this first work, see the literature review in Section 2.2. However, all these works only discuss the efficiency of covering the full Pareto front.
Since we cannot always assume that the NSGA-II is run with a population size larger than the Pareto front size by a constant factor – both because the algorithm user does not know the size of the Pareto front and because some problems have a so large Pareto front that using a comparably large population size is not possible –, a deeper understanding of the approximation performance of the NSGA-II is highly desirable. This is the target of this work.
There is some reason to be optimistic: The experiments conducted in [ZLD22] for the case that the population size equals the size of the Pareto front not only gave the negative result that 20% or 40% of the Pareto front was not covered, but they also showed that the missing points are relatively evenly distributed over the Pareto front: the largest empty interval ever seen in all experiments was of length . Hence the population evolved by the NSGA-II in these experiments was a very good approximation of the Pareto front.
Our results: Unfortunately, we observe that these positive findings do not extend to smaller population sizes. However, our negative results let us detect in the selection mechanism of the NSGA-II a reason for the lower-than-expected approximation capability. This suggests a natural modification of the algorithm. For this, we prove that it computes populations that approximate the Pareto fronts of the OneMinMax and LOTZ benchmarks very well.
In detail, when we ran experiments for OneMinMax with problem size and population sizes , and , empty intervals with sizes around , , and regularly occurred (see Table 1). Unfortunately, due to the complicated population dynamics of the NSGA-II, we were not able to prove a meaningful lower bound. Nevertheless, the experimental results show that even on a simple benchmark like OneMinMax, the largest empty interval the population has on the Pareto front is significantly larger than the optimal value (, , and for these population sizes above), which would result from a perfect distribution of the population on the Pareto front.
To better understand how this discrepancy can arise, we regard two synthetic examples. We show that when the combined parent and offspring population is such that each point on the Pareto front is covered exactly once (this implies that the population size is essentially half the size of the Pareto front), then with high probability the next parent population does not cover an interval of length on the Pareto front (whereas simply removing every second point would give a population such that each point on the Pareto front has a neighbor that is covered by the population). We further construct a more artificial example where the combined parent and offspring population covers the Pareto front apart from isolated points, but the next parent population does not cover an interval of length of the Pareto front.
The reason why we were able to construct such examples is the following property of the selection scheme of the NSGA-II. To select the new parent population, the NSGA-II uses as first criterion the non-dominated sorting and then the crowding distance. The crowding distance, however is not updated during the selection process. That is, while removing individuals with smallest crowding distance, the changing crowding distance of the remaining individuals is not taken into account, but instead the algorithm proceeds with the initial crowding distance. We assume that this design choice was made for reasons of efficiency – by not updating the crowding distance, it suffices to sort the combined parent and offspring population once and then remove the desired number of individuals. We note that this shortcoming of the traditional selection method of the NSGA-II was, with intuitive arguments, already detected by Kukkonen and Deb [KD06, Figure 2], a paper which unfortunately is not too well-known in the community (we overlooked it when preparing the conference version [ZD22] and none of the reviewers detected this oversight; our deepest thanks to Hisao Ishibuchi for pointing us to the work of Kukkonen and Deb).
There is a natural remedy to this shortcoming, proposed also by Kukkonen and Deb [KD06], and this is to sequentially remove individuals always based on the current crowding distance. This procedure can be implemented very efficiently: The removal of one individual changes the crowding distance of at most other individuals (in a bi-objective problem), so at most crowding distance values need to be updated. There is no need for a new sorting from scratch when we use as a data structure a priority queue. With this implementation, the selection based on the current crowding distance takes not more than operations, which is the same asymptotic complexity as the one of sorting the individuals by their initial crowding distance in the original NSGA-II. We note that both operations are fast compared to the non-dominated sorting step with its quadratic time complexity.
For this modified NSGA-II, the problems shown above for the traditional NSGA-II cannot occur. For problem size , the modified algorithm for never created an empty interval on the Pareto front larger than optimal value of . For , in more than half the iterations all empty intervals observed the optimal value of , in the other iterations the maximum empty interval (MEI) had a length of . For , the median MEI value was (optimal value: ). Hence the modified algorithm distributes the population on the Pareto front in a significantly more balanced manner.
For this algorithm, we can also prove a guarantee on the approximation quality. After a time comparable to the time needed to find the two extremal solutions of the front, for all future generations with probability one the largest empty interval on the Pareto front has length at most , hence at most a constant factor larger than the theoretical minimum of . Consequently, even with a population size not large enough to cover the full Pareto front, this algorithm computes very good approximations to the Pareto front.
There is a second variant of the NSGA-II proposed in the past that, by definition, is not prone to the problem of working with initial crowding distance values, namely the steady-state NSGA-II proposed by Durillo et al. [DNLA09], which generates a single offspring per iteration. There are no theoretical result on this algorithm yet, but the empirical results of Nebro and Durillo showed its very competitive approximation strength. For this reason, we also conduct a mathematical runtime analysis of this algorithm on the OneMinMax benchmark. We prove the same good approximation guarantees as for the NSGA-II with current crowding distance. Our experiments in Section 7 verify this similarity and also indicate a more steady behavior of the steady-state NSGA-II.
This work extends our conference paper [ZD22] majorly in the following ways. This version obtains tighter estimates of the relation between the -dominance and maximal empty interval length. We further discuss the relation between the hypervolume indicator and the maximum empty interval length, which was not contained in the conference version. This version improves approximation guarantee for the NSGA-II with current crowding distance from to . We also added a new section discussing the approximation guarantee of the steady-state NSGA-II and the corresponding experiments. Besides, this version contains all mathematical proofs that were omitted in the conference version for reasons of space.
This work is organized as follows. Section 2 briefly introduces bi-objective optimization and the NSGA-II. Section 3 discusses the approximation measures that will be used in this work. The approximation difficulties of the traditional NSGA-II are theoretically shown via two synthetic examples in Section 4. Section 5 introduces our modified variant of the NSGA-II and conducts the theoretical analysis of its approximation ability. Our experiments are discussed in Section 7. Section 8 concludes this work.
2 Preliminaries
2.1 Bi-objective Optimization and the OneMinMax Benchmark
In this paper, we regard on bi-objective optimization problems with each objective to be maximized. For , we say that strictly dominates , denoted by , if and at least one of the inequalities is strict. If cannot be strictly dominated by any solution in , we say that is Pareto optimal and that is a Pareto front point. The set of all Pareto front points is called Pareto front. The typical aim for a multi-objective optimizer is to compute the Pareto front, that is, compute a set of solutions such that is the Pareto front, or to approximate it well.
We shall work with the popular bi-objective benchmark OneMinMax. OneMinMax was first proposed in [GL10] and is similar to the COCZ benchmark defined in [LTZ04]. The first objective of OneMinMax counts the number of zeros in the bit-string, and the second objective counts the number of ones. More specifically, for any , the OneMinMax function is defined by
It is not difficult to see that any solution is Pareto optimal and that the Pareto front is .
2.2 The Non-Dominated Sorting Genetic Algorithm II (NSGA-II)
We now give a brief introduction to the NSGA-II, which was first proposed in [DPAM02] and now is the by far dominant MOEA in practice [ZQL+11]. It works with a fixed population size . Consequently, each time new individuals are generated, the NSGA-II needs to remove individuals to maintain this fixed population size. To this aim, the NSGA-II computes a complete order on the combined parent and offspring population that uses the dominance as the first criterion and a diversity measure (crowding distance) as the second criterion, and removes the worst individuals.
In more detail, after the random initialization, in each generation , an offspring population with size is generated from the parent population . The NSGA-II now needs to remove individuals from the combined population . To this aim, it divides into several fronts where is the set of the non-dominated solutions in , and is the set of the non-dominated solutions in . For the first index such that the size of is at least , the NSGA-II will calculate the crowding distance, denoted by , of the individuals in as follows.111If the crowding distance is also used for the parent selection (like in tournament selection), then now (Algorithm 2, step 6) also the crowding distance of the individuals in is computed. For each objective, the individuals are sorted according to their objective values. The value of the first and last point in the sorted list is infinite. For the other individuals, the with respect to the current objective is the normalized distance of the objective values of its two neighbors in the list. The complete of an individual is the sum of its components for all objectives. See Algorithm 1 for a complete description of the computation of the crowding distance. Then the individuals in with smallest crowding distance value and all individuals in , are removed (ties broken randomly). The complete NSGA-II framework is shown in Algorithm 2.
We note that, for reasons of generality, in Algorithm 1 we make no assumption on how individuals with equal objective values are sorted. As pointed out in [BQ22], for bi-objective problems the sorting w.r.t. the second objective can be taken as the inverse of the first sorting. This has mild algorithmic advantages (but note that the quadratic time complexity of the non-dominated sorting procedure dominates the complexity of the selection stage) and can lower the minimum required population size to compute the whole Pareto front by a factor of two. It is clear that this idea can only make a difference when two individuals with identical value in one objective are present, as only then the sorting is not unique. In our setting with the population size smaller than the size of the Pareto front, we do not expect this to happen too often, so we do not expect significantly better results under this assumption.
Since it is not the most central aspect of our work, we have not yet discussed how the offspring population is computed. As in [ZLD22], we shall regard a mutation-only version of the NSGA-II. For a problem like OneMinMax, composed of two very simple unimodal objectives, we do not expect that the use of crossover gives significant advantages. Also, we are convinced that our proofs can easily be extended to variants that use crossover with some constant probability (less than one) – we note that in particular the central result of this work regarding the selection of the next population (Lemma 13) does not rely on any assumption about how the offspring are created. As mutation operators, we shall regard the two classic ones of one-bit mutation, which flips a random bit in the argument, and standard bit-wise mutation, which flips each bit independently with probability . We consider three ways to select the parents for the mutation (mating selection). In fair selection, we let each of the parents generate exactly one offspring. In random selection, we times independently and uniformly at random (hence “with replacement”) select a parent and let it create an offspring. In binary tournament selection, we times independently and uniformly at random pick two parents, and let the better one (according to non-dominated sorting and crowding distance, ties broken randomly) create an offspring.
We briefly review the state of the art in runtime analysis for the NSGA-II. In the first mathematical runtime analysis222As usual, we call a mathematical runtime analysis a theoretical work estimating the number of iterations or fitness evaluations it takes to reach a certain goal. This is different from the implementational complexity of the operations for each iteration, as discussed, e.g., already in the original NSGA-II paper [DPAM02]. of the NSGA-II [ZLD22], it was shown that the NSGA-II with several mating selection and mutation strategies and population size sufficiently large, efficiently computes the Pareto fronts of the OneMinMax and LOTZ benchmarks, namely in expected and fitness evaluations. For , these are the same asymptotic complexities as those known for the basic global simple evolutionary multi-objective optimizer (GSEMO) [Gie03], namely for OneMinMax and for LOTZ. However, the work [ZLD22] also showed that with a population size of , that is, equal to the size of the Pareto front, it takes at least an exponential time to compute a population covering the Pareto front better than with a constant-factor loss.
This work has quickly led to several follow-up works beyond the conference version [ZD22] of this work. In [BQ22], it was observed that using the same sorting for the two objectives allows to lower the minimum required population size to twice the size of the Pareto front (this is proven for the LOTZ benchmark, but the argument can easily be extended to OneMinMax). Also, in this work for the first time an NSGA-II with crossover is regarded and runtime guarantees analogous to those in [ZLD22] are proven. The most profound result of this work is that significant runtime improvements (from to when optimizing LOTZ with ) can be obtained from a newly proposed tournament selection operator, which chooses the tournament size chosen uniformly at random from the range . This is particularly interesting in that here a uniform random choice was successful, whereas most previous works using random parameter choices employ heavy-tailed distributions, see, e.g., [DLMN17, ABD21, ZD23c, DELQ22]. The first mathematical runtime analysis on a multimodal problem, the OneJumpZeroJump benchmark from [ZD23c], was conducted in [DQ23a]. It shows that when the population size is at least times the Pareto front size, then the NSGA-II with the right population size is as effective as the GSEMO on this benchmark. The first lower bounds for the NSGA-II [DQ23b] show that the previous results for OneMinMax [ZLD22] and OneJumpZeroJump [DQ23a] are asymptotically tight, even for larger population sizes. This implies that increasing the population size above the minimum required size immediately leads to asymptotic performance losses. This is very different from single-objective optimization, where often a larger range of population sizes gives the same asymptotic performance, see, e.g., [JJW05, DK15] for such results for the OneMax benchmark. Two results showing that crossover can give asymptotic performance gains appeared in parallel [DOSS23b, DQ23c]. The first runtime analysis for a combinatorial optimization problem, the bi-objective minimum spanning tree problem previously analyzed for the GSEMO [Neu07], appeared in [CDH+23]. A runtime analysis in the presence of noise (together with the independent parallel work [DDHW23] the first mathematical runtime analysis of a MOEA in a noisy environment) was conducted in [DOSS23a]. That the NSGA-II can have difficulties with more than two objectives was shown for OneMinMax in [ZD23b], whereas the NSGA-III was proven to be efficient on the -objective OneMinMax problem in [WD23].
As pointed in Section 1, all these theoretical works consider the time taken to cover the full Pareto front, so no other work exists on approximating the Pareto front.
3 Approximation Measures
For reasons of efficiency, instead of computing the whole Pareto front, one often resorts to approximating it, that is, to computing a set of solutions that is a reasonable representation for the whole front. This raises the question of how to measure the approximation quality of such a set of solutions. Several different approximation measures have been proposed in the literature such as multiplicative -dominance [LTDZ02], various kinds of generational distances [VL98, BT03, CCRS04], or the hypervolume [ZT98]. For the OneMinMax problem regarded in this work, the particular structure of the problem suggests to use a more elementary measure, namely the maximum length of an empty interval on the Pareto front. We define this measure in Section 3.1. We will then in Sections 3.2 and 3.3 compare the new measure with two classic measures, namely -dominance, which has frequently been regarded in theoretical works, and hypervolume, which is mostly widely used in general [SIHP21]. We are optimistic that our new measure can equally easily be compared to other measures.
3.1 Maximal Empty Interval Size
For OneMinMax with problem size that this paper will analyze, each possible objective value is on the Pareto front and the first objective values of full Pareto front are exactly . Any missing Pareto front point can be directly seen in , hence, we now simply use a measure about the size of the maximal empty interval, denoted as MEI, inside in terms of the solutions that one MOEA reaches and with respect to values. If the maximal empty interval size is as small as possible, then the MOEA can approximate the Pareto front as well as possible. The formal definition of the MEI of a set in the objective space is as follows.
Definition 1.
Let be a subset of the Pareto front of OneMinMax. Let be the sorted list of in the increasing order (ties broken uniformly at random). We define the maximal empty interval size of , denoted by , as
For , we further define
Obviously, this is the smallest that an MOEA with a fixed population size can obtain when the extremal points and are covered.
It is not difficult to see that is witnessed by a set as evenly distributed as possible. We explicitly formulate this observation in the following lemma.
Lemma 2.
For all , we have .
Proof.
Consider any with and . Let be the set of the first objective values of , and let with for . Then we have and , and thus
Hence, there exists such that . Since and are integers, we then have . As , we know that , where the last inequality uses . Hence, .
Now we prove that . Let for . Then and . Consider the set . It is not difficult to see that
for all . Hence . By definition of , we know that . ∎
3.2 -Dominance
This subsection discusses the optimal approximation quality w.r.t. the classic -dominance measure for OneMinMax and compares this measure with the MEI.
3.2.1 Background and Definition
One way to measure the quality of solution sets is via -dominance, which is a relaxed notion of dominance first defined in [LTDZ02]. A set of objective vectors is then called an -approximation of the Pareto front if each point of the Pareto front is -dominated by a point from .
In this subsection, we only discuss multiplicative -dominance and simply call it -dominance, as it is the main variant discussed in [LTDZ02] and the variant most used in follow-up works. Here is the formal definition.
Definition 3 ((Multiplicative) -dominance [LTDZ02]).
Let and be the number of objectives. For , we say -dominates , denoted by , if and only if , that is, for all .
Let be the whole objective vector set for a given problem. We say a subset is an -approximation for this problem if and only if for each , there exists such that .
Here we briefly review the theoretical works utilizing -dominance relation as the measure to evaluate the approximation performance of the MOEAs or as the basis to design the MOEAs considering more diversity in the survival selection. Horoba and Neumann [HN08] defined the LargeFrontε function and proved that to obtain an -approximation, the GSEMO needs a runtime with probability, while the GSEMO with a diversity strategy based on -dominance only requires an expected runtime of . A similar work with respect to the additive -dominance relation was conducted in [HN09]. To reach an -approximation for the LargeFrontε, Brockhoff, Friedrich, and Neumann [BFN08] proved that the runtime for the -simple indicator-based evolutionary algorithm (-SIBEA) is .
For the GSEMO with -dominance diversity strategy, Neumann and Reichel [NR08] proved that it can achieve good approximations for the minimum cut problem in expected polynomial time without restrictions on the graph weight, while the original GSEMO requires the bounded weight to ensure the efficient approximation. The efficiency of the GSEMO with -dominance diversity strategy can also be witnessed in [NRS11, PSN19].
Gutjahr [Gut12] replaced the survival selection of the SEMO considering the common dominance by the additive -dominance, inserted the modified SEMO into the adaptive Pareto sampling (APS) framework, and proved that for one stochastic multi-objective combinatorial optimization problem, the expected runtime for such APS can be bounded from above, with the bound depending on the expected runtime of the original SEMO on the deterministic counterpart.
3.2.2 Relation of Maximal Empty Intervals and -Approximations
The two approximation measures MEI and -approximation are almost equivalent. In this subsection, we show that for any with the smallest rendering an -approximation satisfies .
We first prove the following relation between the MEI and -dominance.
Lemma 4.
Let be a subset of the Pareto front of OneMinMax with . Let . Then is an -approximation for (and thus also for ).
Proof.
Let be the set of the first objective values of . More specifically, let with for . From the assumption, we know and . Consider any and let and . For any with , we have
Hence we know that or . Thus is -dominated by one of and . ∎
Similar to , we define an optimal value for . From Definition 3, it is not difficult to see that the smaller , the better approximates the set . Let be the collection of all that are an -approximation of . Then, for , let
Obviously, this is the smallest so that a population of size covering the two extremal points can be an approximation of .
From Lemma 4, we easily obtain the following upper bound of the optimal .
Corollary 5.
For all , we have .
Now we establish a lower bound for w.r.t. , and also a lower bound for the optimal as its corollary.
Lemma 6.
Let with be an -approximation for . Then .
Proof.
Let and be any two neighboring points in . Let with . If there exists an such that , then . Since , we have and thus . Together with , we know . Analogously, if there exists an and such that , then . Hence one of and -dominates .
If , then , that is, . If , then , that is, . Hence,
(1) |
Let . While not important for the proof, we note that this is the value for that renders the two arguments of the minimum equal and thus maximizes (1). If is an integer, then from taking we obtain . If is not an integer, let and . Exploiting (1) with and , we obtain
where the last inequality uses that the maximal value of is when .
Let be the set of first objective values in and ordered such that for all . We have just shown that for all . Hence
(2) |
where we use the fact that increases as increases. ∎
Corollary 7.
For all , we have .
3.3 Hypervolume
In this subsection, we discuss the classic hypervolume indicator, the most common measure for how well a set of solutions approximates the Pareto front, and its relation to the MEI.
3.3.1 Background and Definition
The hypervolume indicator, denoted by HV in the remainder, was first proposed by Zitzler and Thiele [ZT98]. It is the most intensively used measure for approximation quality in evolutionary multi-objective optimizations [SIHP21]. Here “hypervolume” refers to the size of the space covered by the solution set (with respect to a reference point). Different from , -dominance, and generational distance measures [VL98, BT03, CCRS04], the HV does not require a prior knowledge or estimate about the Pareto front, which is a huge advantage in practical applications.
Definition 8 (Hypervolume (HV)).
Let be the number of objectives, and let be the whole objective vector set for a given problem. Let be such that for all . Let denote the Lebesgue measure in . Then the hypervolume of a subset is defined as
where is the hyper-rectangle defined by the corners and .
In addition to its wide usage in practice, there are also some theoretical works discussing the HV. Brockhoff, Friedrich, and Neumann [BFN08] proposed the -simple indicator-based evolutionary algorithm, -SIBEA, and proved its efficient expected runtime of for reaching an -approximation of the Pareto front of LargeFrontε, for any . In contrast, the GSEMO with high probability needs at least exponential time to achieve an -approximation [HN08]. Besides this result, [BFN08] also proved that the -SIBEA with computes the full Pareto front of the LOTZ function in an expected runtime of .
Nguyen, Sutton, and Neumann [NSN15] proved that the expected runtime of the -SIBEA, , to compute the full Pareto front (hence to reach the maximum HV value possible with this population size) for the OneMinMax problem is . They also proved that starting from some initial population and using a population size of only , the -SIBEA with high probability takes at least exponential time to reach the maximum HV value (attainable with this population size). For the LOTZ problem, they showed the following results. If , then the algorithm can reach the maximum HV value (and thus compute the full Pareto front) in expected time . If , the expected time to reach the maximum hypervolume (and thus approximate the Pareto front best in this measure) is .
While the work just discussed has shown that the -SIBEA with small population sizes has difficulties to compute the optimal approximation of the Pareto front of OneMinMax, this algorithm can easily compute good approximations. Doerr, Gao, and Neumann [DGN16] proved that -SIBEA with in expected time reaches a population with HV below the value for the full Pareto front by only an additive term of at most .
We note that in addition to the above runtime analyses, other theoretical works on structural aspects of the HV exist, e.g., [ABB10, ABBZ12, BF13, FNT15]. We will not discuss them here in full detail as our main focus in this work is the analysis of runtimes of evolutionary algorithms.
However, we note that for discrete problems, the optimal HV value for a population with size less than the finite Pareto front size is not well understood. Emmerich, Deutz, Beume [EDB07, Section 3] proved that the HV value for approximating the continuous Pareto front with points plus the two extremal points is maximized if the points are evenly spaced between two extremal points. Beume et al. [BFLI+09, Lemma 1] proved this result for the line for any and with . Auger et al. [ABBZ12] further extended it to the line with for any . They also considered the case that the reference point is dominated by at least one Pareto optimal point, but not necessarily all. Nguyen, Sutton, and Neumann [NSN15] discussed two special cases, namely for OneMinMax and for LOTZ. Both implicitly assume to be an integer. Doerr, Gao, and Neumann [DGN16] compared the HV of the computed approximations with the HV of the full Pareto front and thus did not require an analysis of the best HV achievable with fixed small population sizes. In summary, for the discrete OneMinMax problem, the optimal HV for points is not well analyzed. Hence, we will also bound the optimal HV value when we discuss the relation of MEI and HV in the following Section 3.3.2.
3.3.2 Relation of Maximal Empty Intervals and Hypervolumes
We now analyze the relation between the MEI and the . As we shall see, the fact that the hypervolume depends on the sum of the squares of the lengths of the empty intervals disallows a simple relation between the two measures. However, we shall also see that the optimal approximations for the two measures are very similar, in particular, an optimal approximation with respect to the has the minimum possible empty interval size.
We first give an exact formula for the hypervolume. Let with be the reference point so that it is dominated by any w.r.t. OneMinMax. Let be a subset of the Pareto front of OneMinMax with , and be the set of the first objective values of . More specifically, let with for . Then we know and . Since each Pareto front point satisfies , we have
(3) |
where we note that is the HV of the continuous front . We point out that the -part was wrongly computed as in [DGN16, Lemma 2]. This renders the calculations of the expressions and there incorrect, but this does not influence the correctness of Lemma 2 in [DGN16] as the extra term cancels out in .
Now we estimate .
Lemma 9.
Let be a subset of the Pareto front of OneMinMax with , and let with be the reference point. Let . Then
Proof.
We use the same as defined above. Then we have . By the definition of , we have
(4) |
From the Cauchy–Bunyakovsky–Schwarz inequality, we have
(5) |
where the last equality uses and .
Similar to , we define an optimal value for the hypervolume. From Definition 8, it is not difficult to see that the larger , the better approximates the Pareto front. For , we further define
Obviously, this is the largest that a population of size containing the two extremal points can obtain. Now we estimate .
Lemma 10.
Let with be the reference point, and let . For all , we have
Proof.
Different from Corollary 5 and 7, we do not see a simple relation between and here. However, we note that the construction of here is the same as the one in the proof of Lemma 2. Hence, the lower bound of the is reached when is reached.
Since our MEI measure is more intuitive and crucial in our proofs, and it can be easily transferred to the -approximation and HV measure that are utilized in the community, we will use MEI as the approximation measure to see how well an MOEA approximates the Pareto front in the remaining of this work.
4 Difficulties for the NSGA-II to Approximate the Pareto Front
In this section, we show that the traditional way how the NSGA-II selects the next population, namely by relying on the initial crowding distance, can lead to suboptimal approximations of the Pareto front. To this aim, we analyze by mathematical means the results of the selection from two different combined parent and offspring populations. These examples demonstrate quite clearly that the traditional selection can lead to unwanted results. We note that these results do not prove completely that the NSGA-II has difficulties to find good approximations since we do not know how often the NSGA-II enters exactly these situations. Unfortunately the population dynamics of the NSGA-II are too complicated for a full proof. Our experimental results in Section 7, however, suggest that the phenomena we observe in these synthetic situations (in particular, the first one) do show up.
We start by regarding the optimal-looking situation that the combined parent and offspring population for each point on the Pareto front contains exactly one individual. Hence by removing the individual corresponding to (essentially) every second point on the Pareto front, one could obtain a very good approximation of the front. Surprisingly, the NSGA-II does much worse. Since all solutions apart from the two extremal ones have the same crowding distance, the NSGA-II removes a random set of out of these inner solutions. As we show now, with high probability this creates an empty interval on the Pareto front of length .
Lemma 11.
Let be odd. Consider using the NSGA-II to optimize the OneMinMax function with problem size . Let the population size be . Suppose that in a certain generation , the combined parent and offspring population fully covers the Pareto front, that is, . Then with probability at least , we have . On the positive side, and for any constant .
We note that in the above result, we did not try to optimize the implicit constants. The proof of the upper bound on the length of the longest empty interval is simple union bound argument. The proof of the lower bound needs some care to deal with the stochastic dependencies among the intervals. We overcome these difficulties by only regarding a set of disjoint intervals and with a two-stage sampling process such that the first stage treats the regarded intervals in an independent manner.
Proof.
We note that has size at least , since it covers the Pareto front, which has a size of . From our assumption , we thus conclude that every point in the Pareto front has exactly one corresponding individual in . Hence, except for the individuals and that have infinite crowding distance, all other individuals have the equal crowding distance of . Consequently, the original NSGA-II survival selection will randomly select individuals from the inner individuals to be removed.
To prove the logarithmic lower bound on the expectation, we argue as follows. Let . Let such that and for any two , we have or (note that such a set exists since by definition of and ). Consequently, for different , the intervals are disjoint subsets of .
To cope with the stochastic dependencies, we regard a particular way to sample the individuals to be removed. In a first phase, we select each individual with with probability . This defines a random set of individuals with expected cardinality . We remove these individuals from the combined parent and offspring population. In a second phase, if , which is very likely, then we repeat removing random individuals with until we have removed a total of individuals. If , then we repeat adding random previously removed individuals until we have removed only individuals.
We now prove that with high probability, the first phase ends with (i) an interval of length on the Pareto front which is not covered by the population, and with (ii) . Apparently, then also in the final population is not covered. Denote by the probability that all individuals with are removed in the first phase. We have
By construction, the events , , are independent. Hence, using the estimate valid for all , we estimate
Finally, we estimate the probability that . We note that can be written as a sum of independent binary random variables. Hence by the multiplicative Chernoff bound (Theorem 1.10.1 in [Doe20]), we have
This proves that with probability at least , after the selection phase there is an interval of length of the Pareto front such that none of its points is covered by the new population . This also implies .
We now turn to the upper bounds. Let and , and let be the event that all individuals with value in are selected for removal. Then
It is not difficult to see that if happens for some , then . Hence, by a union bound over all possible , we obtain
where the last inequality uses and thus . Hence for any constant , we know that
For the expectation, we compute
The example above shows that even in a perfectly symmetric situation, the NSGA-II with high probability selects a new parent population with high irregularities and relatively large areas on the Pareto front that are not covered by the population.
We now show that even more extreme examples can be constructed. This example builds on the same idea as the -point example in Figure 2 in [KD06], but is much more general (working for arbitrary large population sizes) and gives a more extreme result. We do not expect these to come up often in a regular run of the NSGA-II, but they underline that the drawbacks of working with the initial crowding distance can be tremendous.
Lemma 12.
For all , there is a combined parent and offspring population with and the following two properties.
-
•
The population selected by the traditional NSGA-II satisfies .
-
•
There is a population with and such that .
Proof.
Let be such that and . Then for any two different , we have . By construction, . We note that exactly those with have the smallest occurring crowding distance of . Since these are elements of and since elements that have to be removed, they will all be removed in the selection step, leaving all points with uncovered by the selected population.
Now let be obtained from by keeping , removing the individuals with values and , keeping the individual with value , and then alternatingly in order of increasing value deleting and keeping the individuals. This removes exactly half the individuals from , but leaves and in . Apart from the interval , the intervals in are all the union of two intervals in . Since , this shows . ∎
5 Working With the Current Crowding Distance
As already noted in [KD06] and then further quantified in the preceding section, the traditional way the NSGA-II selects the next parent population can lead to unevenly distributed populations. The natural reason, discussed also in [KD06] and made again very visible in the proofs of Lemmas 11 and 12, is that the crowding distance is computed only once and then used for all removals of individuals. Consequently, it may happen that individuals are removed which, at the time of their removal, have a much larger crowding distance than at the start of the selection phase.
This phenomenon is heavily exploited in the construction of the very negative example in the proof of Lemma 12. In this example, the individuals with all have the smallest crowding distance of , and thus are all removed in some arbitrary order. When the last individual is removed, its neighbors on the front have objective values and . Consequently, this individual at the moment of its removal has a crowding distance of , which is (for large ) much larger than its initial value of , but also much larger than the crowding distance of , which most of the remaining individuals still have. This example shows very clearly the downside of working with the initial crowding distance and at the same time suggests to work with the current crowding distance instead.
This is the road we will follow in this section. We first argue that there is an efficient implementation of the NSGA-II that repeatedly selects an individual with smallest crowding distance. We then show that this selection mechanism leads to much more balanced selections for the OneMinMax benchmark. We prove that the modified NSGA-II achieves an MEI value of at most , which is only by roughly a factor of larger than the optimal MEI of . Reaching such a balanced distribution is very efficient – once the two extremal points are found (in time ), it only takes additional time to find a population satisfying the MEI guarantee above. From this point on, the MEI never increases above .
5.1 Implementation of the NSGA-II Using the Current Crowding Distance
It is immediately obvious that the computation of the crowding distance in the original NSGA-II can be implemented in a straightforward manner that takes time, where is the number of objectives and is the number of individuals for which the crowding distance is used as tie-breaker.
For the NSGA-II using the current crowding distance, some more thought is necessary. A naïve computation of the crowding distance for each removal could lead to a total effort of . Since the removal of one individual can change the crowding distance of the at most individuals which are its neighbors in one of the sortings, one would hope that more efficient implementations exist, and this is indeed true.
In fact, already in [KD06] this problem was discussed and a solution via a priority queue and an additional data structure storing the neighbor relation was presented. Since the presentation in [KD06] is very compact (and did not discuss how to implement a random tie-breaking among individuals with the same crowding distance), we now describe this approach in more detail.
For the following presentation, let us assume that we have only objectives. Let us assume that we have a set of individuals which pairwise are not comparable (none dominates the other) or have identical objective values. When optimizing OneMinMax, this set will be the combined parent and offspring population; in the general case it will be the front . Let us call the size of this set. Let us assume that we want to remove some number of individuals from , sequentially by repeatedly removing an individual with the smallest current crowding distance, breaking ties randomly.
Besides keeping the crowding distance updated (which can be done in constant time per removal, as we just saw), we also need to be able to efficiently find and remove an individual with the smallest (current) crowding distance, and moreover, a random one in case of ties. The detection and removal of an element with the smallest key calls for a priority queue. Let us ignore for the moment the random tie-breaking and only discuss how to use a priority queue for the detection and removal of an individual with the smallest crowding distance. We recall that a priority queue is a data structure that stores items together with a key, a numerical value assigned to each item that describes the priority of the item. Standard priority queues support at least the following three operations: Adding new items to the queue, removing from the queue an item with the smallest key, and decreasing the key of an item in the queue. They do so with a time complexity that is only logarithmic in the current length of the queue.
For our problem, at the start of the selection phase, we add all individuals with their crowding distance as key to the priority queue. We repeatedly remove individuals according to the following scheme: (i) We find and remove from the priority queue an individual with smallest key. We also remove from . (ii) We find the up to four neighbors of in the two sortings of according to the two objectives, compute their crowding distance, and update their keys in the priority queue accordingly. This is not a decrease-key operation (but an increase of the key), but such an increase-key can be simulated by first decreasing the key to an artificially small value (smaller than all real values that can occur, here for example ), then removing the item with smallest key (which is just this item), and then adding it with the new key to the queue.
These two steps can be implemented in logarithmic time, since all operations of the priority queue take at most logarithmic time, except that we still need to provide an efficient way to find the neighbors of an element in the sortings. This is necessary to determine the up to four neighbors of , but also to compute their crowding distance (which needs knowing their neighbors). To enable efficient computations of such neighbors, we use an additional data structure, namely for each objective a doubly-linked list that stores the current set sorted according to this objective. This list data structure must enable finding predecessors and successors (provided they exist) as well as the deletion of elements. Standard doubly-linked lists support these operations in constant time. We use this list in step (ii) above to find the desired neighbors. We also need to delete the removed individual in step (i) from this list.
To allow finding individuals in the priority queue or the doubly-linked list, we need a helper data structure with pointers to the individuals in these data structures. This can be a simple array indexed by the initial set . When is not suitable as index, we can initially build an array storing and then use the indices of the elements of in as identifiers. With this approach, regardless of the structure of , we can access individuals in any of our data structures in constant time.
So far we have described how to repeatedly remove an element with the currently smallest crowding distance each in logarithmic time. To add the random tie-breaking, it suffices to give each individual a random second-priority key, e.g., a random number . The key of an individual used in the priority queue now is composed of the current crowding distance and this number , where a key is smaller than another when its crowding distance is smaller or when the crowding distances are equal and the number is smaller (lexicographic order of the two parts of the key). With these extended keys, the individuals with the currently smallest crowding distance have the highest priority, and in the case of several such individuals, the number serves as a random tie-breaker.
With this tie-breaking mechanism, we have now implemented a way to repeatedly remove an individual with the smallest current crowding distance, breaking ties randomly, in time logarithmic in , the size of the set . Overall, our selection using the current crowding distance takes the same time of as the selection based on the initial crowding distance. Note that both complexities are small compared to the non-dominated sorting step, which takes time quadratic in .
Without going into details, we note that when the possible crowding distance values are known and they are not too numerous, say there is an upper bound of for their number, then using a bucket queue instead of a standard priority queue would give a complexity of order . This is slightly superior to the above runtime, e.g., for OneMinMax when .
5.2 Runtime Analysis and Approximation Quality of the NSGA-II with Current Crowding Distance
We now conduct a mathematical analysis of the NSGA-II with survival selection based on the current crowding distance. The following lemma shows that invididuals with at least a certain crowding distance will certainly enter the next generation. The key argument in this proof is an averaging argument based on the observation that the sum of the crowding distances of all individuals other than the ones with infinite crowding distance is at most .
Lemma 13.
Let . Consider using the NSGA-II with survival selection based on the current crowding distance to optimize the OneMinMax function with problem size . Let be the first generation such that the two extreme points and are in . Let . Consider the selection of the next population from , which consists of times removing an individual with smallest current crowding distance, breaking ties in an arbitrary manner. Assume that at some stage of this removal process the individual has a crowding distance of or higher. Then . Also, surely contains the two extreme points.
Proof.
We consider first a single removal of an individual at some moment of the selection phase. Let be the set of individuals from the combined parent and offspring population that have not yet been removed. Note that . Note also that, by definition of the crowding distance, there can be at most individuals with infinite crowding distance. Since , such individuals are never removed, and consequently, surely contains and . Let be the sorting of by increasing value and be the sorting of by increasing value that are used for the computation of the crowding distance. For , let such that (we assume here that individuals have unique identifiers, so they are distinguishable also when having the same genotype; consequently, these and are uniquely defined). From the definition of , we know the following. If , then . Otherwise, .
We compute
the latter by our insight that contains the two extremal individuals. An analogous estimate holds for and the . This allows the estimate
where we write for the individuals in having a finite crowding distance. Since , a simple averaging argument shows that there is an with . Hence also the individual that is removed has a crowding distance of at most .
Now looking at all removals in this selection step, we see that never an individual with crowding distance above is removed. ∎
The result just shown easily implies that no large empty interval is created by the removal of a solution from in the selection phase. Slightly more involved arguments are necessary to argue that the MEI-value decreases relatively fast. It is not too difficult to see that if the set of values in contains a large empty interval (the same will then be true for ), then this interval can be reduced in length by at least one via the event that one of the individuals corresponding to the boundaries of the empty interval create an offspring “inside the interval”. What needs more care is that we require such arguments for all large empty intervals in parallel. For this, we regard how an empty interval shrinks over a longer time. Since this shrinking is composed of independent small steps, we can use strong concentration arguments to show that an interval shrinks to the desired value in the desired time with very high probability. This admits a union bound argument to extend this result to all empty intervals.
A slight technical challenge is to make precise what “one interval” means over the course of time (recall that in one iteration, we generate offspring and then remove individuals from ). We overcome this by regarding a half-integral point and the corresponding interval
This definition is unambiguous. It gives room to some strange effects, which seem hard to avoid, nevertheless. Assume that for some sufficiently larger than . Assume that there is a single individual with in . Assume that has an offspring with . If both and survive into the next generation, then has suddenly shrunk a lot to . If survives but does not, then has changed considerably to , where . Since in each iteration individuals are newly generated and then are removed, we do not see a way to define the empty intervals in a more stable manner. Nevertheless, our proof below will be such that it covers all these cases in a relatively uniform manner.
We recall from Section 2 that fair selection means that each individual of the parent population generates exactly one offspring and random selection means that times an individual is chosen uniformly at random from the parent population to generate one offspring.
Lemma 14.
Let . Consider using the NSGA-II with fair or random parent selection, with one-bit or bit-wise mutation, and with survival selection using the current crowding distance, to optimize the OneMinMax function with problem size . Let be the first generation such that the two extreme points and are contained in . Let be the first generation such that . Then , both in expectation and with probability . Also, for all , we have with probability one.
Proof.
Let . For all , let be the length of the empty interval in containing , that is,
and let analogously be the length of the empty interval in that contains .
We first prove if for some and we have , then we have for all with probability one. It apparently suffices to regard one iteration, so assume by way of contradiction that . Note that is obtained from by times removing an individual with smallest current crowding distance. Consider the removal step which extends the empty interval containing to a length larger than . Let be the empty interval containing before this removal. To increase the size of this empty interval, the individual removed in this step must have and it must be the only individual with this value. Assume, without loss of generality, that . Removing , by definition of the crowding distance, creates an empty interval of length exactly times the current crowding distance of (here the factor of stems from the fact that for an individual with unique and value, the contributions of the two objectives to the crowding distance are equal). By Lemma 13, this current crowding distance of is at most , contradicting our assumption that the removal of creates an empty interval of length larger than . This proves our first claim.
Since for all , this claim implies that for , and therefore that once , we have for all . Consequently, for all , we have with probability one.
It remains to show that is actually decreasing to at most over time. To this aim, we now consider the situation that for some . Let and with be the two ends of the interval in that contains . Let respectively be individuals with values and . The probability that mutation applied to creates an offspring with is for one-bit mutation and for bit-wise mutation. Likewise, the probability that mutates into a with is for one-bit mutation and at least for bit-wise mutation. Since , we have , that is, for at least one of and from which a with is mutated, such probability is at least . The probability that this individual is chosen as parent at least once in this iteration is one for fair parent selection and for random parent selection. Consequently, the probability that at least one individual with value in is created in this iteration is at least .
Let us assume that this positive event has happened. Then the interval in containing has length (note that, depending on the value of and the outcome of the offspring generation, this interval can be any of , , , , and , but all arguments below hold for any of these cases). By our first claim, we now have with probability one. This argument shows that in any iteration starting with , regardless of what happened in previous iterations, we have a probability of at least of reducing by at least one.
Now we analyze the probability of the event that drops to or less in generations. We consider a random variable , where are independent Bernoulli random variables with success probability of . Then stochastically dominates . Since , applying the multiplicative Chernoff bound (see, e.g., [Doe20, Theorem 1.10.7]), we have
Hence
Note that the above discussed is corresponding to one given . A union bound over all gives that for any generation after the -th generation, with probability at least
all empty intervals in the population will have the length at most , that is, we have for all . This proves the claimed bound with high probability.
For the claimed bound on the expectation, we note that we can repeat our argument above since we did not make any particular assumptions on the initial state. Consequently, the probability that is at most . This immediately implies that that the expected time to have all empty intervals of length at most is at most , see, e.g., Corollary 1.6.2 in [Doe20]. ∎
It remains to analyze the time it takes to have the two extreme points and in the population. This analysis is very similar to the analysis of the original NSGA-II with population size large enough that the full Pareto front is found in [ZLD22, Theorems 2 and 6]. Since the proof below also applies to the original NSGA-II, we formulate the following result for both algorithms.
Lemma 15.
Consider using the NSGA-II in the classic version or using the current crowding distance (Algorithm 2 or 3) with one of the following six ways to generate the offspring, namely, applying fair selection, random selection, or binary tournament selection and applying one-bit mutation or standard bit-wise mutation. Then after an expected number of iterations, that is, an expected number of fitness evaluations, the two extreme points and are contained in the population.
Proof.
We first consider the time to generate , that is, the unique search point with largest value. Let . Let be an individual in the parent population with largest value and with infinite crowding distance. Let . Let denote the probability that appears at least once in the individuals selected to generate offspring and let denote the probability of generating an individual with larger value from via the mutation operator. Note that if there exists individuals with value larger than , then one such individual will survive to , and will have an individual with fitness at least . The expected number of iterations until this happens is at most . It is not difficult to see that the largest value in cannot decrease. Since , the expected number of iterations to reach is at most .
It is not difficult to see that for the fair selection and at least for random selection. For binary tournament selection, there are at most two individuals with infinite crowding distance but value different from . Hence in this case. It is also not difficult to see that for the one-bit mutation, and . Hence the expected number of iterations to find is at most
Similarly, we could show that the expected number of iterations to have in the population is also . Hence, the expected number of iteration to have the two extreme points in the population is . Since each iteration uses fitness evaluations, the expected number of fitness evaluations is . ∎
From Lemmas 14 and 15, we easily obtain the following theorem on the approximation ability of the NSGA-II using the current crowding distance.
Theorem 16.
Let . Consider using the NSGA-II with fair or random parent selection, with survival selection using the current crowding distance, and one-bit mutation to optimize the OneMinMax function with problem size . Then after an expected number of fitness evaluations, a population containing the two extreme points and and with is reached and kept for all future time.
Recalling from Lemma 2 that the optimal maximal empty interval size is , Theorem 16 shows that the gaps in the Pareto front are at most by around a factor of larger than this theoretical minimum. Also, comparing the runtimes in Lemma 14 with those in Lemma 15 or Theorem 16, we see that the cost for reaching a good approximation is asymptotically negligible compared to the one proven for reaching the two extreme points.
6 Steady-State Approaches
In order to overcome the shortcoming of the original NSGA-II, we discussed in the previous section the approach to update the crowding distance after each removal. A different way to resolve the problems arising from in parallel removing individuals based on the crowding distance is a steady-state approach, that is, a version of the NSGA-II which in each iteration generates only one new solution and selects the new population from the old one plus the new individual. Building on first ideas in this direction in [SR06], such a steady-state NSGA-II was proposed in [DNLA09] and empirically a good performance was shown.
In this section, we will theoretically prove that the steady-state NSGA-II can approximate well the Pareto front. This is also the first mathematical runtime analysis of the steady-state NSGA-II.
6.1 The Steady-State NSGA-II
We now make precise the steady-state NSGA-II we regard in this work. Apart from the way the offspring is generated, it is identical to the steady-state NSGA-II proposed in [DNLA09]. In this NSGA-II, with population size , a single offspring is generated per iteration. Note that when generating a single offspring, fair selection is not well-defined, so we only regard random parent selection. Hence the single offspring will be generated from mutating a randomly selected parent, either via one-bit mutation or via bit-wise mutation. From the combined parent and offspring population of size , the next parent population of size is selected in the same way as in the classic NSGA-II (apart from the different population sizes), that is, non-dominated sorting is applied to and then all individuals are taken into the next generation apart from one individual in the last front with smallest crowding distance (ties broken randomly).
We note that the above description of the steady-stage NSGA-II requires to run the non-dominated sorting and crowding-distance computation procedures once for each newly generated individual, whereas in the classic NSGA-II, these procedures are called only once per newly generated offspring. A more efficient implementation of the steady-state NSGA was proposed in [NBS16]. Readers can refer to this paper for more details.
6.2 Runtime Analysis and Approximation Quality of the Steady-State NSGA-II
We now prove runtime guarantees and approximation guarantees for the steady-state NSGA-II of the same quality as those proven for the NSGA-II using the current crowding distance in Section 5.2.
We first note that the removal of a least favorable individual from individuals, which is the selection step of the steady-state NSGA-II, was already analyzed as a special case of Lemma 13, namely as the last of the removal steps in the NSGA-II with current crowding distance. Since there no particular structure of the individuals was assumed, the proof of this lemma immediately implies the following result
Lemma 17.
Let . Consider using the steady-state NSGA-II to optimize the OneMinMax function with problem size . Let be the first generation such that the two extreme points and are in . Let . Consider the selection of the next population from . Let with crowding distance or higher. Then .
Also, surely contains the two extreme points.
Fix some . We reuse from Section 5.2 the notations of being the length of the empty interval in that contains , and being the corresponding length for .
From Lemma 17, in a fashion analogous to the proof of Lemma 14, we obtain first that if for some and we have , then for all . Consequently, (i) once , we have for all , and (ii) if , then .
Analogous to the second part of the proof in Lemma 14, we have for the steady-state NSGA-II that with probability , the empty-interval length reduces in one iteration, that is, that , when . This probability was at least in Lemma 14 where we consider the event that a desired individual is contained in the selected parents. Now where only one parent is selected, it is smaller by a factor of . These arguments, as in the proof of Lemma 14, give the following lemma.
Lemma 18.
Let . Consider using the steady-state NSGA-II with random parent selection and with one-bit or bit-wise mutation to optimize the OneMinMax function with problem size . Let be the first generation that the two extreme points and are contained in . Let be the first generation such that . Then , both in expectation and with probability . Also, for all , we have with probability one.
It remains to analyze the time to have the two extreme points and in the population. This analysis is completely identical to the one in Lemma 15 apart from the calculation of the probability , which now is the probability that the unique parent selected in one iteration is an individual with maximal -value and infinite crowding distance.
For random selection, this probability is at least . For binary tournament selection, , again using the argument that there are at most two individuals with infinite crowding distance and -value different from the maximal value. Naturally, these probabilities are by a factor of smaller than the values computed in Lemma 15, where in each iteration offspring were generated. Consequently, in terms of iterations, our bound for the time to find the two extremal points is by a factor of larger.
Lemma 19.
Consider using the steady-state NSGA-II with one of the following four ways to generate the offspring, namely, applying random selection or binary tournament selection and applying one-bit mutation or standard bit-wise mutation. Then after an expected number of iterations (fitness evaluations), the two extreme points and are contained in the population.
From Lemmas 18 and 19, we have the following theorem for the approximation ability of the steady-state NSGA-II.
Theorem 20.
Let . Consider using the steady-state NSGA-II with random selection and with one-bit or standard bit-wise mutation to optimize the OneMinMax function with problem size . Then after an expected number of fitness evaluations, a population containing the two extreme points and and with is reached and kept for all future time.
Note that except Lemma 19 also discussing the binary tournament selection, we only consider random parent selection in other propositions in this section. It is not clear how to extend them to tournament selection, and we leave it as an interesting open question for future research.
7 Experiments
In Section 4 we conducted a theoretical analysis of two synthetic situations to show that the traditional NSGA-II can have difficulties in approximating the Pareto front. The complicated population dynamics prevented us from analyzing how often such situations arise. In Section 5 and 6, we proved that the NSGA-II using the current crowding distance and the steady-state NSGA-II leave gaps on the Pareto front that are asymptotically at most a factor of larger than those given by a perfect approximation of the Pareto front. To complete the picture, in this section, we present some experimental results.
7.1 Settings
We conduct experiments with the following settings.
- •
-
•
Problem size : 601. Given that the OneMinMax problem is an easy multi-objective problem, this is a moderate problem size. Such a choice is sensible, since with a too small size we will not gain reliable insights, whereas insights obtained on a large problem size raise the question whether they still apply to practically relevant problem sizes. We use the odd number to include the setting discussed in Lemma 11.
-
•
Algorithms: We regard the classic NSGA-II using the initial crowding distance for the selection, the NSGA-II using the current crowding distance, and the steady-state NSGA-II. As in our theoretical analysis, we do not use crossover. That is, mutation is the only operator to generate the offspring population.
-
•
Mating selection and mutation strategy: We select the parents of the mutation operations via fair selection for the two generational NSGA-II variant. For the steady-state NSGA-II, we select a random individual as parent. As mutation operators, we use one-bit mutation. These setting are included in our mathematical runtime analysis. We have no reason to believe that other settings, e.g., random selection and bit-wise mutation (as also covered in our mathematical analyses), lead to substantially different results.
- •
-
•
independent runs for each setting.
7.2 Results
Our focus is the maximal length of an empty interval on the Pareto front, that is, the value defined earlier. As long as the population has not fully spread out on the Pareto front, that is, the extremal solutions and are not yet part of the population, enlarging the spread of the population is more critical than a balanced distribution in the known part of the front. For this reason, we only regard times after both extremal solutions have entered the population.
To see whether the approximation quality changes over time, we regard separately the two intervals of and generations after finding the two extremal solutions. We collect statistical data on the value in these intervals in Table 1 and we display exemplary runs in Figure 1.
Generations | ||
---|---|---|
Initial CD | (7,8,9) | (7,8,9) |
Current CD | (3,3,3) | (3,3,3) |
Steady-State | (3,3,3) | (3,3,3) |
Initial CD | (14,15,17) | (14,15,17) |
Current CD | (5,5,6) | (5,5,6) |
Steady-State | (5,5,5) | (5,5,5) |
Initial CD | (23,26,29) | (24,27,30) |
Current CD | (11,12,12) | (11,12,12) |
Steady-State | (11,12,12) | (11,11,11) |



The optimal MEI value for OneMinMax with and population sizes equals and , respectively. From Table 1, we see that the modified NSGA-II and the steady-state NSGA-II often reach the optimal MEI for and , and are only slightly sub-optimal values for . In contrast, the traditional NSGA-II shows median MEI values of in the better time interval. This is more than twice the optimal MEI and the median MEI of the two NSGA-II variants that overcome the problem with the initial crowding distance.
We observe no greater differences between the generations right after finding the two extremal points and the generations generations later. For , the NSGA-II with initial crowding distance displays slightly larger MEI values in the later interval, whereas the steady-state NSGA-II shows slightly smaller values. We do not have an explanation for this, but the small differences render it not very interesting to investigate this observation further.
In Figure 1, we see that the MEI value oscillates considerably for the classic NSGA-II, whereas it is relatively stable for the NSGA-II using the current crowding distance and constant for the steady state NSGA-II. This appears to be the second advantage of the two variants of the NSGA-II.
Our experimental data is not sufficient to answer the question if the traditional NSGA-II suffers from super-constant MEI values. Our theoretical result in Lemma 11 could be seen as an indication that logarithmic MEI values can show up (or MEI values of order for general values of ). To answer this question, significantly more experiments with truly large problem sizes would be necessary (due to the slow growth behavior of logarithmic functions). For our purposes, our results, however, are fully sufficient. They show clearly that the two variants of the NSGA-II not building on the initial crowding distance yield much smaller and more stable MEI values. Not surprisingly, the experimentally observed MEI values for these two variants are better than the mathematical guarantee given in Theorems 16 and 20.
8 Conclusion
None of the existing runtime analyses for the NSGA-II (including those published after our conference version [ZD22]) regards the performance of the NSGA-II when the population size is smaller than the size of the Pareto front. In this work, we regard this situation and discuss how well the population evolved by the NSGA-II approximates the Pareto front. Our theoretical analysis of two artificial cases and our experiments give a mixed picture. However, they also suggest that the reason for the not fully satisfying approximation behavior is the fact that the selection of the next parent population is based on the initial crowding distance of individuals in the combined parent and offspring population, which can be very different from the crowding distance at the moment when an individual is removed, an effect previously observed with less mathematical arguments in [KD06].
For the NSGA-II building on the current crowding distance, we proved very positive approximation results for the OneMinMax benchmark. After an expected time comparable to the runtime of the classic NSGA-II on the OneMinMax benchmark, a population is reached that covers the Pareto front apart from gaps that are only a constant factor larger than in an optimal approximation. This state of the population is maintained for the remaining runtime, with probability one. We further discussed the steady-state NSGA-II and proved the same good approximation ability. Our experiments confirm the superiority of these two selection strategies.
From our proofs, we conjecture that similar results can be obtained for other classic benchmark problems such as LOTZ or the large front problem [HN08]. Overall, this work gives additional motivation to prefer two less common variants of the NSGA-II, the NSGA-II working with the current crowding distance [KD06] (around 180 citations on Google scholar) and the steady-state NSGA-II [DNLA09] (76 citations), over the classic NSGA-II (over 50,000 citations).
We want to mention two open problems arising from this work. As already discussed, due to the complex population dynamics, we were not able to conduct a mathematical analysis of the approximation ability of the classic NSGA-II. So neither we could prove, say, a super-constant lower bound on the typical MEI value for, say, , nor we could prove any interesting upper bound on the MEI value, say a logarithmic factor above the ideal value. This is clearly an interesting problem area to further explore. On the more technical side, we note that in this first work on the approximation strength of the NSGA-II we only regarded fair and random parent selection. Our proofs do not apply to tournament selection, and we feel that substantially different arguments will be necessary to prove approximation guarantees in this case.
Acknowlegements
This work was supported by National Natural Science Foundation of China (Grant No. 62306086), Science, Technology and Innovation Commission of Shenzhen Municipality (Grant No. GXWD20220818191018001), Guangdong Basic and Applied Basic Research Foundation (Grant No. 2019A1515110177).
This work was also supported by a public grant as part of the Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.
We thank Hisao Ishibuchi for pointing out Kukkonen and Deb’s work [KD06]. We also thank Ke Shang for his question about the behavior of the steady-state NSGA-II at GECCO 2022.
References
- [ABB10] Anne Auger, Johannes Bader, and Dimo Brockhoff. Theoretically investigating optimal -distributions for the hypervolume indicator: First results for three objectives. In Parallel Problem Solving from Nature, PPSN 2010, pages 586–596. Springer, 2010.
- [ABBZ12] Anne Auger, Johannes Bader, Dimo Brockhoff, and Eckart Zitzler. Hypervolume-based multiobjective optimization: Theoretical foundations and practical implications. Theoretical Computer Science, 425:75–103, 2012.
- [ABD21] Denis Antipov, Maxim Buzdalov, and Benjamin Doerr. Lazy parameter tuning and control: choosing all parameters randomly from a power-law distribution. In Genetic and Evolutionary Computation Conference, GECCO 2021, pages 1115–1123. ACM, 2021.
- [AD11] Anne Auger and Benjamin Doerr, editors. Theory of Randomized Search Heuristics. World Scientific Publishing, 2011.
- [BF13] Karl Bringmann and Tobias Friedrich. Approximation quality of the hypervolume indicator. Artificial Intelligence, 195:265–290, 2013.
- [BFLI+09] Nicola Beume, Carlos M. Fonseca, Manuel Lopez-Ibanez, Luis Paquete, and Jan Vahrenhold. On the complexity of computing the hypervolume indicator. IEEE Transactions on Evolutionary Computation, 13:1075–1082, 2009.
- [BFN08] Dimo Brockhoff, Tobias Friedrich, and Frank Neumann. Analyzing hypervolume indicator based algorithms. In Parallel Problem Solving from Nature, PPSN 2008, pages 651–660. Springer, 2008.
- [BFQY20] Chao Bian, Chao Feng, Chao Qian, and Yang Yu. An efficient evolutionary algorithm for subset selection with general cost constraints. In Conference on Artificial Intelligence, AAAI 2020, pages 3267–3274. AAAI Press, 2020.
- [BQ22] Chao Bian and Chao Qian. Better running time of the non-dominated sorting genetic algorithm II (NSGA-II) by using stochastic tournament selection. In Parallel Problem Solving From Nature, PPSN 2022, pages 428–441. Springer, 2022.
- [BQT18] Chao Bian, Chao Qian, and Ke Tang. A general approach to running time analysis of multi-objective evolutionary algorithms. In International Joint Conference on Artificial Intelligence, IJCAI 2018, pages 1405–1411. IJCAI, 2018.
- [BT03] Peter A.N. Bosman and Dirk Thierens. The balance between proximity and diversity in multiobjective evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 7:174–188, 2003.
- [BZLQ23] Chao Bian, Yawen Zhou, Miqing Li, and Chao Qian. Stochastic population update can provably be helpful in multi-objective evolutionary algorithms. In International Joint Conference on Artificial Intelligence, IJCAI 2023, pages 5513–5521. ijcai.org, 2023.
- [CCRS04] Carlos A. Coello Coello and Margarita Reyes Sierra. A study of the parallelization of a coevolutionary multi-objective evolutionary algorithm. In Mexican International Conference on Artificial Intelligence, MICAI 2004, pages 688–697. Springer, 2004.
- [CDH+23] Sacha Cerf, Benjamin Doerr, Benjamin Hebras, Jakob Kahane, and Simon Wietheger. The first proven performance guarantees for the Non-Dominated Sorting Genetic Algorithm II (NSGA-II) on a combinatorial optimization problem. In International Joint Conference on Artificial Intelligence, IJCAI 2023, pages 5522–5530. ijcai.org, 2023.
- [Cra21] Victoria G. Crawford. Faster guarantees of evolutionary algorithms for maximization of monotone submodular functions. In International Joint Conference on Artificial Intelligence, IJCAI 2021, pages 1661–1667. ijcai.org, 2021.
- [DDHW23] Matthieu Dinot, Benjamin Doerr, Ulysse Hennebelle, and Sebastian Will. Runtime analyses of multi-objective evolutionary algorithms in the presence of noise. In International Joint Conference on Artificial Intelligence, IJCAI 2023, pages 5549–5557. ijcai.org, 2023.
- [DELQ22] Duc-Cuong Dang, Anton V. Eremeev, Per Kristian Lehre, and Xiaoyu Qin. Fast non-elitist evolutionary algorithms with power-law ranking selection. In Genetic and Evolutionary Computation Conference, GECCO 2022, pages 1372–1380. ACM, 2022.
- [DGN16] Benjamin Doerr, Wanru Gao, and Frank Neumann. Runtime analysis of evolutionary diversity maximization for OneMinMax. In Genetic and Evolutionary Computation Conference, GECCO 2016, pages 557–564. ACM, 2016.
- [DK15] Benjamin Doerr and Marvin Künnemann. Optimizing linear functions with the evolutionary algorithm—different asymptotic runtimes for different instances. Theoretical Computer Science, 561:3–23, 2015.
- [DLMN17] Benjamin Doerr, Huu Phuoc Le, Régis Makhmara, and Ta Duy Nguyen. Fast genetic algorithms. In Genetic and Evolutionary Computation Conference, GECCO 2017, pages 777–784. ACM, 2017.
- [DN20] Benjamin Doerr and Frank Neumann, editors. Theory of Evolutionary Computation—Recent Developments in Discrete Optimization. Springer, 2020. Also available at http://www.lix.polytechnique.fr/Labo/Benjamin.Doerr/doerr_neumann_book.html.
- [DNLA09] Juan J. Durillo, Antonio J. Nebro, Francisco Luna, and Enrique Alba. On the effect of the steady-state selection scheme in multi-objective genetic algorithms. In International Conference on Evolutionary Multi-Criterion Optimization, EMO 2009, pages 183–197. Springer Berlin Heidelberg, 2009.
- [Doe20] Benjamin Doerr. Probabilistic tools for the analysis of randomized optimization heuristics. In Benjamin Doerr and Frank Neumann, editors, Theory of Evolutionary Computation: Recent Developments in Discrete Optimization, pages 1–87. Springer, 2020. Also available at https://arxiv.org/abs/1801.06733.
- [DOSS23a] Duc-Cuong Dang, Andre Opris, Bahare Salehi, and Dirk Sudholt. Analysing the robustness of NSGA-II under noise. In Genetic and Evolutionary Computation Conference, GECCO 2023, pages 642–651. ACM, 2023.
- [DOSS23b] Duc-Cuong Dang, Andre Opris, Bahare Salehi, and Dirk Sudholt. A proof that using crossover can guarantee exponential speed-ups in evolutionary multi-objective optimisation. In Conference on Artificial Intelligence, AAAI 2023, pages 12390–12398. AAAI Press, 2023.
- [DPAM02] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182–197, 2002.
- [DQ23a] Benjamin Doerr and Zhongdi Qu. A first runtime analysis of the NSGA-II on a multimodal problem. Transactions on Evolutionary Computation, 2023. https://doi.org/10.1109/TEVC.2023.3250552.
- [DQ23b] Benjamin Doerr and Zhongdi Qu. From understanding the population dynamics of the NSGA-II to the first proven lower bounds. In Conference on Artificial Intelligence, AAAI 2023, pages 12408–12416. AAAI Press, 2023.
- [DQ23c] Benjamin Doerr and Zhongdi Qu. Runtime analysis for the NSGA-II: provable speed-ups from crossover. In Conference on Artificial Intelligence, AAAI 2023, pages 12399–12407. AAAI Press, 2023.
- [EDB07] Michael Emmerich, André Deutz, and Nicola Beume. Gradient-based/evolutionary relay hybrid for computing pareto front approximations maximizing the S-metric. In International Workshop on Hybrid Metaheuristics, HM 2007, pages 140–156. Springer, 2007.
- [FNT15] Tobias Friedrich, Frank Neumann, and Christian Thyssen. Multiplicative approximations, optimal hypervolume distributions, and the choice of the reference point. Evolutionary Computation, 23:131–159, 2015.
- [Gie03] Oliver Giel. Expected runtimes of a simple multi-objective evolutionary algorithm. In Congress on Evolutionary Computation, CEC 2003, pages 1918–1925. IEEE, 2003.
- [GL10] Oliver Giel and Per Kristian Lehre. On the effect of populations in evolutionary multi-objective optimisation. Evolutionary Computation, 18:335–356, 2010.
- [Gut12] Walter J. Gutjahr. Runtime analysis of an evolutionary algorithm for stochastic multi-objective combinatorial optimization. Evolutionary Computation, 20:395–421, 2012.
- [HN08] Christian Horoba and Frank Neumann. Benefits and drawbacks for the use of epsilon-dominance in evolutionary multi-objective optimization. In Genetic and Evolutionary Computation Conference, GECCO 2008, pages 641–648. ACM, 2008.
- [HN09] Christian Horoba and Frank Neumann. Additive approximations of pareto-optimal sets by evolutionary multi-objective algorithms. In Foundations of Genetic Algorithms, FOGA 2009, pages 79–86. ACM, 2009.
- [Jan13] Thomas Jansen. Analyzing Evolutionary Algorithms – The Computer Science Perspective. Springer, 2013.
- [JJW05] Thomas Jansen, Kenneth A. De Jong, and Ingo Wegener. On the choice of the offspring population size in evolutionary algorithms. Evolutionary Computation, 13:413–440, 2005.
- [KD06] Saku Kukkonen and Kalyanmoy Deb. Improved pruning of non-dominated solutions based on crowding distance for bi-objective optimization problems. In Conference on Evolutionary Computation, CEC 2006, pages 1179–1186. IEEE, 2006.
- [LTDZ02] Marco Laumanns, Lothar Thiele, Kalyanmoy Deb, and Eckart Zitzler. Combining convergence and diversity in evolutionary multiobjective optimization. Evolutionary Computation, 10:263–282, 2002.
- [LTZ+02] Marco Laumanns, Lothar Thiele, Eckart Zitzler, Emo Welzl, and Kalyanmoy Deb. Running time analysis of multi-objective evolutionary algorithms on a simple discrete optimization problem. In Parallel Problem Solving from Nature, PPSN 2002, pages 44–53. Springer, 2002.
- [LTZ04] Marco Laumanns, Lothar Thiele, and Eckart Zitzler. Running time analysis of multiobjective evolutionary algorithms on pseudo-Boolean functions. IEEE Transactions on Evolutionary Computation, 8:170–182, 2004.
- [LZZZ16] Yuan-Long Li, Yu-Ren Zhou, Zhi-Hui Zhan, and Jun Zhang. A primary theoretical study on decomposition-based multiobjective evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 20:563–576, 2016.
- [NBS16] Niyaz Nigmatullin, Maxim Buzdalov, and Andrey Stankevich. Efficient removal of points with smallest crowding distance in two-dimensional incremental non-dominated sorting. In Genetic and Evolutionary Computation Conference, GECCO 2016, Companion Material, pages 1121–1128. ACM, 2016.
- [Neu07] Frank Neumann. Expected runtimes of a simple evolutionary algorithm for the multi-objective minimum spanning tree problem. European Journal of Operational Research, 181:1620–1629, 2007.
- [NR08] Frank Neumann and Joachim Reichel. Approximating minimum multicuts by evolutionary multi-objective algorithms. In Parallel Problem Solving from Nature, PPSN 2008, pages 72–81. Springer, 2008.
- [NRS11] Frank Neumann, Joachim Reichel, and Martin Skutella. Computing minimum cuts by randomized search heuristics. Algorithmica, 59:323–342, 2011.
- [NSN15] Anh Quang Nguyen, Andrew M. Sutton, and Frank Neumann. Population size matters: rigorous runtime results for maximizing the hypervolume indicator. Theoretical Computer Science, 561:24–36, 2015.
- [NW10] Frank Neumann and Carsten Witt. Bioinspired Computation in Combinatorial Optimization – Algorithms and Their Computational Complexity. Springer, 2010.
- [PSN19] Mojgan Pourhassan, Feng Shi, and Frank Neumann. Parameterized analysis of multiobjective evolutionary algorithms and the weighted vertex cover problem. Evolutionary Computation, 27:559–575, 2019.
- [QBF20] Chao Qian, Chao Bian, and Chao Feng. Subset selection by Pareto optimization with recombination. In Conference on Artificial Intelligence, AAAI 2020, pages 2408–2415. AAAI Press, 2020.
- [QYT+19] Chao Qian, Yang Yu, Ke Tang, Xin Yao, and Zhi-Hua Zhou. Maximizing submodular or monotone approximately submodular functions by multi-objective evolutionary algorithms. Artificial Intelligence, 275:279–294, 2019.
- [RNNF19] Vahid Roostapour, Aneta Neumann, Frank Neumann, and Tobias Friedrich. Pareto optimization for subset selection with dynamic cost constraints. In Conference on Artificial Intelligence, AAAI 2019, pages 2354–2361. AAAI Press, 2019.
- [SIHP21] Ke Shang, Hisao Ishibuchi, Linjun He, and Lie Meng Pang. A survey on the hypervolume indicator in evolutionary multiobjective optimization. IEEE Transactions on Evolutionary Computation, 25:1–20, 2021.
- [SR06] Dipti Srinivasan and Lily Rachmawati. An efficient multi-objective evolutionary algorithm with steady-state replacement model. In Genetic and Evolutionary Computation Conference, GECCO 2006, pages 715–722. ACM, 2006.
- [VL98] David A. Van Veldhuizen and Gary B. Lamont. Evolutionary computation and convergence to a Pareto front. In John R. Koza, editor, Late Breaking Papers at the Genetic Programming Conference 1998, pages 221–228. Stanford University Bookstore, 1998.
- [WD23] Simon Wietheger and Benjamin Doerr. A mathematical runtime analysis of the Non-dominated Sorting Genetic Algorithm III (NSGA-III). In International Joint Conference on Artificial Intelligence, IJCAI 2023, pages 5657–5665. ijcai.org, 2023.
- [ZD22] Weijie Zheng and Benjamin Doerr. Better approximation guarantees for the NSGA-II by using the current crowding distance. In Genetic and Evolutionary Computation Conference, GECCO 2022, pages 611–619. ACM, 2022.
- [ZD23a] Weijie Zheng and Benjamin Doerr. Mathematical runtime analysis for the non-dominated sorting genetic algorithm II (NSGA-II). Artificial Intelligence, 2023. In Press, https://doi.org/10.1016/j.artint.2023.104016.
- [ZD23b] Weijie Zheng and Benjamin Doerr. Runtime analysis for the NSGA-II: proving, quantifying, and explaining the inefficiency for three or more objectives. IEEE Transactions on Evolutionary Computation, 2023. In Press, https://doi.org/0.1109/TEVC.2023.3320278.
- [ZD23c] Weijie Zheng and Benjamin Doerr. Theoretical analyses of multiobjective evolutionary algorithms on multimodal objectives. Evolutionary Computation, 2023. In Press, https://doi.org/10.1162/evco_a_00328.
- [ZLD22] Weijie Zheng, Yufei Liu, and Benjamin Doerr. A first mathematical runtime analysis of the Non-Dominated Sorting Genetic Algorithm II (NSGA-II). In Conference on Artificial Intelligence, AAAI 2022, pages 10408–10416. AAAI Press, 2022.
- [ZQL+11] Aimin Zhou, Bo-Yang Qu, Hui Li, Shi-Zheng Zhao, Ponnuthurai Nagaratnam Suganthan, and Qingfu Zhang. Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm and Evolutionary Computation, 1:32–49, 2011.
- [ZT98] Eckart Zitzler and Lothar Thiele. Multiobjective optimization using evolutionary algorithms—a comparative case study. In Parallel Problem Solving from Nature, PPSN 1998, pages 292–301. Springer, 1998.
- [ZYQ19] Zhi-Hua Zhou, Yang Yu, and Chao Qian. Evolutionary Learning: Advances in Theories and Algorithms. Springer, 2019.