126calBbCounter
Submodular + Concave
Abstract
It has been well established that first order optimization methods can converge to the maximal objective value of concave functions and provide constant factor approximation guarantees for (non-convex/non-concave) continuous submodular functions. In this work, we initiate the study of the maximization of functions of the form over a solvable convex body , where is a smooth DR-submodular function and is a smooth concave function. This class of functions is a strict extension of both concave and continuous DR-submodular functions for which no theoretical guarantee is known. We provide a suite of Frank-Wolfe style algorithms, which, depending on the nature of the objective function (i.e., if and are monotone or not, and non-negative or not) and on the nature of the set (i.e., whether it is downward closed or not), provide , , or approximation guarantees. We then use our algorithms to get a framework to smoothly interpolate between choosing a diverse set of elements from a given ground set (corresponding to the mode of a determinantal point process) and choosing a clustered set of elements (corresponding to the maxima of a suitable concave function). Additionally, we apply our algorithms to various functions in the above class (DR-submodular + concave) in both constrained and unconstrained settings, and show that our algorithms consistently outperform natural baselines.
1 Introduction
Despite their simplicity, first-order optimization methods (such as gradient descent and its variants, Frank-Wolfe, momentum based methods, and others) have shown great success in many machine learning applications. A large body of research in the operations research and machine learning communities has fully demystified the convergence rate of such methods in minimizing well behaved convex objectives (Bubeck, 2015; Nesterov, 2018). More recently, a new surge of rigorous results has also shown that gradient descent methods can find the global minima of specific non-convex objective functions arisen from non-negative matrix factorization (Arora et al., 2012), robust PCA (Netrapalli et al., 2014), phase retrieval (Chen et al., 2019b), matrix completion (Sun and Luo, 2016), and the training of wide neural networks (Du et al., 2019; Jacot et al., 2018; Allen-Zhu et al., 2019), to name a few. It is also very well known that finding the global minimum of a general non-convex function is indeed computationally intractable (Murty and Kabadi, 1987). To avoid such impossibility results, simpler goals have been pursued by the community: either developing algorithms that can escape saddle points and reach local minima (Ge et al., 2015) or describing structural properties which guarantee that reaching a local minimizer ensures optimality (Sun et al., 2016; Bian et al., 2017b; Hazan et al., 2016). In the same spirit, this paper quantifies a large class of non-convex functions for which first-order optimization methods provably achieve near optimal solutions.
More specifically, we consider objective functions that are formed by the sum of a continuous DR-submodular function and a concave function . Recent research in non-convex optimization has shown that first-order optimization methods provide constant factor approximation guarantees for maximizing continuous submodular functions Bian et al. (2017b); Hassani et al. (2017); Bian et al. (2017a); Niazadeh et al. (2018); Hassani et al. (2020a); Mokhtari et al. (2018a). Similarly, such methods find the global maximizer of concave functions. However, the class of functions is strictly larger than both concave and continuous DR-submodular functions. More specifically, is not concave nor continuous DR-submodular in general (Figure 1 illustrates an example). In this paper, we show that first-order methods provably provide strong theoretical guarantees for maximizing such functions.
The combinations of continuous submodular and concave functions naturally arise in many ML applications such as maximizing a regularized submodular function (Kazemi et al., 2020a) or finding the mode of distributions (Kazemi et al., 2020a; Robinson et al., 2019). For instance, it is common to add a regularizer to a D-optimal design objective function to increase the stability of the final solution against perturbations (He, 2010; Derezinski et al., 2020; Lattimore and Szepesvari, 2020). Similarly, many instances of log-submodular distributions, such as determinantal point processes (DPPs), have been studied in depth in order to sample a diverse set of items from a ground set (Kulesza, 2012; Rebeschini and Karbasi, 2015; Anari et al., 2016). In order to control the level of diversity, one natural recipe is to consider the combination of a log-concave (e.g., normal distribution, exponential distribution and Laplace distribution) (Dwivedi et al., 2019; Robinson et al., 2019) and log-submodular distributions (Djolonga and Krause, 2014; Bresler et al., 2019), i.e., for some . In this way, we can obtain a class of distributions that contains log-concave and log-submodular distributions as special cases. However, this class of distributions is strictly more expressive than both log-concave and log-submodular models, e.g., in contrast to log-concave distributions, they are not uni-modal in general. Finding the mode of such distributions amounts to maximizing a combination of a continuous DR-submodular function and a concave function. The contributions of this paper are as follows.
-
•
Assuming first-order oracle access for the function , we develop the algorithms: Greedy Frank-Wolfe (Algorithm 1) and Measured Greedy Frank-Wolfe (Algorithm 2) which achieve constant factors approximation guarantees between and depending on the setting, i.e. depending on the monotonicity and non-negativity of and , and depending on the constraint set having the down-closeness property or not.
-
•
Furthermore, if we have access to the individual gradients of and , then we are able to make the guarantee with respect to exact using the algorithms: Gradient Combining Frank-wolfe (Algorithm 3) and Non-oblivious Frank-Wolfe (Algorithm 4). These results are summarized and made more precise in Table 1 and Section 3.
-
•
We then present experiments designed to use our algorithms to smoothly interpolate between contrasting objectives such as picking a diverse set of elements and picking a clustered set of elements. This smooth interpolation provides a way to control the amount of diversity in the final solution. We also demonstrate the use of our algorithms to maximize a large class of (non-convex/non-concave) quadratic programming problems.

Related Work.
The study of discrete submodular maximization has flourished in the last decade through far reaching applications in machine learning and and artificial intelligence including viral marketing (Kempe et al., 2003), dictionary learning (Krause and Cevher, 2010), sparse regression (Elenberg et al., 2016), neural network interoperability (Elenberg et al., 2017), crowd teaching (Singla et al., 2014), sequential decision making (Alieva et al., 2021), active learning (Wei et al., 2015), and data summarization (Mirzasoleiman et al., 2013). We refer the interested reader to a recent survey by Tohidi et al. (2020) and the references therein.
Recently, Bian et al. (2017b) proposed an extension of discrete submodular functions to the continuous domains that can be of use in machine learning applications. Notably, this class of (non-convex/non-concave) functions, so called continuous DR-submodular, contains the continuous multilinear extension of discrete submodular functions Călinescu et al. (2011) as a special case. Continuous DR-submodular functions can reliably model revenue maximization (Bian et al., 2017b), robust budget allocation (Staib and Jegelka, 2017), experimental design (Chen et al., 2018), MAP inference for DPPs (Gillenwater et al., 2012; Hassani et al., 2020b), energy allocation (Wilder, 2018a), classes of zero-sum games (Wilder, 2018b), online welfare maximization and online task assignment (Sadeghi et al., 2020), as well as many other settings of interest.
The research on maximizing continuous DR-submodular functions in the last few years has established strong theoretical results in different optimization settings including unconstrained (Niazadeh et al., 2018; Bian et al., 2019), stochastic Mokhtari et al. (2018a); Hassani et al. (2017), online (Chen et al., 2018; Zhang et al., 2019; Sadeghi and Fazel, 2019; Raut et al., 2021), and parallel models of computation (Chen et al., 2019a; Mokhtari et al., 2018b; Xie et al., 2019; Ene and Nguyen, 2019).
A different line of works study the maximization of discrete functions that can be represented as the sum of a non-negative monotone submodular function and a linear function. The ability to do so is useful in practice since the linear function can be viewed as a soft constraint, and it also has theoretical applications as is argued by the first work in this line (Sviridenko et al., 2017) (for example, the problem of maximization of a monotone submodular function with a bounded curvature can be reduced to the maximization of the sum of a monotone submodular function and a linear function). In terms of the approximation guarantee, the algorithms suggested by Sviridenko et al. (2017) were optimal. However, more recent works improve over the time complexities of these algorithms (Feldman, 2021; Harshaw et al., 2019; Ene et al., 2020), generalize them to weakly-submodular functions (Harshaw et al., 2019), and adapt them to other computational models such as the data stream and MapReduce models (Kazemi et al., 2020b; Ene et al., 2020).
2 Setting and Notation
Let us now formally define the setting we consider. Fix a subset of of the form , where is a closed range in . Intuitively, a function is called (continuous) DR-submodular if it exhibits diminishing returns in the sense that given a vector , the increase in obtained by increasing (for any ) by is negatively correlated with the original values of the coordinates of . This intuition is captured by the following definition. In this definition denotes the standard basis vector in the direction.
Definition 2.1 (DR-submodular function).
A function is DR-submodular if for every two vectors , positive value and coordinate we have
whenever and .111Throughout the paper, inequalities between vectors should be understood as holding coordinate-wise.
It is well known that when is continuously differentiable, then it is DR-submodular if and only if for every two vectors that obey . And when is a twice differentiable function, it is DR-submodular if and only if its hessian is non-positive at every vector . Furthermore, for the sake of simplicity we assume in this work that . This assumption is without loss of generality because the natural mapping from to preserves all our results.
We are interested in the problem of finding the point in some convex body that maximizes a given function that can be expressed as the sum of a DR-submodular function and a concave function . To get meaningful results for this problem, we need to make some assumptions. Here we describe three basic assumptions that we make throughout the paper. The quality of the results that we obtain improves if additional assumptions are made, as is described in Section 3.
Our first basic assumption is that is non-negative. This assumption is necessary since we obtain multiplicative approximation guarantees with respect to , and such guarantees do not make sense when is allowed to take negative values.222We note that almost all the literature on submodular maximization of both discrete and continuous functions assumes non-negativity for the same reason. Our second basic assumption is that is solvable, i.e., that one can efficiently optimize linear functions subject to it. Intuitively, this assumption makes sense because one should not expect to be able to optimize a complex function such as subject to if one cannot optimize linear functions subject to it (nevertheless, it is possible to adapt our algorithms to obtain some guarantee even when linear functions can only be approximately optimized subject to ). Our final basic assumption is that both functions and are -smooth, which means that they are differentiable, and moreover, their gradients are -Lipschitz, i.e.,
We conclude this section by introducing some additional notation that we need. We denote by an arbitrary optimal solution for the problem described above, and define . Additionally, given two vectors , we denote by their coordinate-wise multiplication, and by their standard Euclidean inner product.
3 Main Algorithms and Results
In this section we present our (first-order) algorithms for solving the problem described in Section 2. In general, these algorithms are all Frank-Wolfe type algorithms, but they differ in the exact linear function which is maximized in each iteration (step 1 of the while/for loop), and in the formula used to update the solution (step 2 of the while/for loop). As mentioned previously, we assume everywhere that is a non-negative -smooth DR-submodular function, is an -smooth concave function, and is a solvable convex body. Some of our algorithms require additional non-negativity and/or monotonicity assumptions on the functions and , and occasionally they also require a downward closed assumption on . A summary of which settings each algorithm is applicable to can be found in Table 1. Each algorithm listed in the table outputs a point which is guaranteed to obey for the constants and given in Table 1 and some error term .
Algorithm (Section) | |||||
---|---|---|---|---|---|
Greedy | monotone | monotone | general | ||
Frank-Wolfe | & non-neg. | & non-neg. | |||
Measured Greedy | monotone | non-neg. | down-closed | ||
Frank-Wolfe | & non-neg. | ||||
Measured Greedy | non-neg. | monotone | down-closed | ||
Frank-Wolfe | & non-neg. | ||||
Measured Greedy | monotone | monotone | down-closed | ||
Frank-Wolfe | & non-neg. | & non-neg. | |||
Measured Greedy | non-neg. | non-neg. | down-closed | ||
Frank-Wolfe | |||||
Gradient Combining | monotone | General | General | 1 | |
Frank-Wolfe | & non-neg. | ||||
Non-Oblivious | monotone | non-neg. | General | ||
Frank-Wolfe | & non-neg. |
3.1 Greedy Frank-Wolfe Algorithm
In this section we assume that both and are monotone and non-negative functions (in addition to their other properties). Given this assumption, we analyze the guarantee of the greedy Frank-Wolfe variant appearing as Algorithm 1. This algorithm is related to the Continuous Greedy algorithm for discrete objective functions due to Călinescu et al. (2011), and it gets a quality control parameter . We assume in the algorithm that is an integer. This assumption is without loss of generality because, if violates the assumption, then it can be replaced with a value from the range that obeys it.
One can observe that the output of Algorithm 1 is within the convex body because it is a convex combination of the vectors , which are all vectors in . Let us now analyze the value of the output of Algorithm 1. The next lemma is the first step towards this goal. It provides a lower bounds on the increase in the value of as a function of .
Lemma 3.1.
For every , .
Proof.
Observe that
where the last inequality follows since is the maximizer found by Algorithm 1. Therefore, to prove the lemma it only remains to show that .
Since is concave and monotone,
Similarly, since is DR-submodular and monotone,
Adding the last two inequalities and rearranging gives the inequality that we wanted to prove. ∎
The corollary below follows by showing (via induction) that the inequality holds for every integer , and then plugging in . The inductive step is proven using Lemma 3.1.
Corollary 3.2.
.
Proof.
We prove by induction that for every integer we have
(1) |
One can note that the corollary follows from this inequality by plugging since .
We are now ready to summarize the properties of Algorithm 1 in a theorem.
Theorem 3.3.
Let be the time it takes to find a point in maximizing a given liner function, then Algorithm 1 runs in time, makes calls to the gradient oracle, and outputs a vector such that .
3.2 Measured Greedy Frank-Wolfe Algorithm
In this section we assume that is a down-closed body (in addition to being convex) and that and are both non-negative. Given these assumptions, we analyze the guarantee of the variant of Frank-Wolfe appearing as Algorithm 2, which is motivated by the Measured Continuous Greedy algorithm for discrete objective functions due to Feldman et al. (2011). We again have a quality control parameter , and assume (without loss of generality) that is an integer.
We begin the analysis of Algorithm 2 by bounding the range of the entries of the vectors , which is proven by induction.
Lemma 3.4.
For every two integers and , .
Proof.
We prove the lemma by induction on . For the lemma is trivial since by the initialization of . Assume now that the lemma holds for , and let us prove it for . Observe that
where the first inequality holds since by the induction hypothesis and since . Similarly,
where the first inequality holds because since , and the second inequality holds by the induction hypothesis. ∎
Using the last lemma, we can see why the output of Algorithm 2 must be in . Recall that the output of Algorithm 2 is where the inequality follows due to Lemma 3.4 and the fact that (as a vector in ) is non-negative. Since is a convex combination of points in , it also belongs to . Since the vector is coordinate-wise dominated by this combination, the down-closeness of implies .
Our next objective is to prove an approximation guarantee for Algorithm 2. Towards this goal, let us define, for every function , the expression to be when is monotone and when is non-monotone. Using this definition, we can state the following lemma and corollary, which are counterparts of Lemma 3.1 and Corollary 3.2 from Section 3.1.
Lemma 3.5.
For every integer , .
Proof.
Observe that
Furthermore, we also have
where the inequality holds since is the maximizer found by Algorithm 2. Combining the last two inequalities, we get
(2) |
Let us now find a lower bound on the expression in the above inequality. Since is concave,
Similarly, since is DR-submodular,
Adding the last two inequalities provides the promised lower bound on . Plugging this lower bound into Inequality (2) yields
Given the last inequality, to prove the lemma it remains to show that and . Since , these inequalities follow immediately when and are monotone, respectively. Therefore, we concentrate on proving these inequalities when and are non-monotone. Since is concave,
where the second inequality follows from the non-negativity of . We also note that
since by Lemma 3.4. Similarly, since is DR-submodular,
where the second inequality follows from the non-negativity of , and the first inequality holds since
because by Lemma 3.4. ∎
Corollary 3.6.
Proof.
For every function , let be if is monotone, and if is non-monotone. We prove by induction that for every integer we have
(3) |
One can note that the corollary follows from this inequality by plugging since when is monotone and when is non-monotone.
For , Inequality (3) follows from the non-negativity of since both when is monotone and non-monotone. Next, let us prove Inequality (3) for assuming it holds for . We note that . For a monotone this is true since
and for a non-monotone this is true since
Using Lemma 3.5, we now get
where the second inequality holds due to the induction hypothesis. ∎
We are now ready to state and prove our main theorem for Algorithm 2.
theoremthmMeasuredContinuousGreedy Let be the time it takes to find a point in maximizing a given liner function, then Algorithm 2 runs in time, makes calls to the gradient oracle, and outputs a vector such that
Proof.
3.3 Gradient Combining Frank-Wolfe Algorithm
Up to this point, the guarantees of the algorithms that we have seen had both and that are strictly smaller than . However, since concave functions can be exactly maximized, it is reasonable to expect also algorithms for which the coefficient associated with is equal to . In Sections 3.3 and 3.4, we describe such algorithms.
In this section, we assume that is a monotone and non-negative function (in addition to its other properties). The algorithm we study in this section is Algorithm 3, and it again takes a quality control parameter as input. This time, however, the algorithm assumes that is an integer. As usual, if that is not the case, then can be replaced with a value from the range that has this property.
Firstly, note that for every integer , ; and therefore, the output of Algorithm 3 also belongs to . For this holds by the initialization of . For larger values of , this follows by induction because is a convex combination of and the point ( belongs to by the induction hypothesis, and belongs to by definition).
Our next objective is to lower bound the value of the output point of Algorithm 3. For that purpose, it will be useful to define and . To get a bound on the value of the output of Algorithm 3, we first show that is small for at least some value. We do that using the next lemma, which shows that decreases as a function of as longs as it is not already small compared to . Then, Corollary 3.8 guarantees the existence of a good iteration .
Lemma 3.7.
For every integer , .
Proof.
Observe that
To make the expression on the rightmost side useful, we need to lower bound the inner product .
where the inequality holds since is the maximizer found by Algorithm 3. Combining the last two inequalities yields
Therefore, to complete the proof of the lemma it remains to prove that and .
The inequality follows immediately from the concavity of . To prove the other inequality, observe that the DR-submodularity of implies for every two vectors and that obey , and is either non-negative or non-positive. This observation implies
where the last inequality follows from the monotonicity and non-negativity of . ∎
Corollary 3.8.
There is an integer obeying .
Proof.
Assume towards a contradiction that the corollary does not hold. By Lemma 3.7, this implies
for every integer . Adding up this inequality for all these values of , we get
However, by the choice of , is not too large. Specifically,
where the inequality holds since is one possible candidate to be and is non-negative. Therefore, we get
which by the non-negativity of and implies
(which contradicts our assumption). ∎
We are now ready to summarize the properties of Algorithm 3 in a theorem.
Theorem 3.9.
Let be the time it takes to find a point in maximizing a given linear function and be the time it takes to find a point in maximizing up to an error of , then Algorithm 3 runs in time, makes gradient oracle calls, and outputs a vector such that
Proof.
We begin the proof by analyzing the time and oracle complexities of Algorithm 3. Every iteration of the loop of Algorithm 3 takes time. As there are such iterations, the entire algorithm runs in time. Also note that each iteration of the loop requires 2 calls to the gradient oracles (a single call to the oracle corresponding to , and a single call to the oracle corresponding to ), so the overall oracle complexity of the algorithm is .
3.4 Non-oblivious Frank-Wolfe Algorithm
As mentioned in the beginning of Section 3.3, our objective in this section is to present another algorithm that has (i.e., it maximizes “exactly” in some sense). In Section 3.3, we presented Algorithm 3, which achieves this goal with . The algorithm we present in the current section achieves the same goal with an improved value of for . However, the improvement is obtained at the cost of requiring the function to be non-negative (which was not required in Section 3.3). Additionally, like in the previous section, we assume here that is a monotone and non-negative function (in addition to its other properties).
The algorithm we study in this section is a non-oblivious variant of the Frank-Wolfe algorithm, appearing as Algorithm 4, which takes a quality control parameter as input. As usual, we assume without loss of generality that is an integer. Algorithm 4 also employs the non-negative auxiliary function:
This function is inspired by the non-oblivious objective function used by Filmus and Ward (2012).
Note that any call to the gradient oracle of can be simulated using calls to the gradient oracle of . The properties of Algorithm 4 are stated in Theorem 3.16. We begin the analysis of Algorithm 4 by observing that for every integer , ; and therefore, the output of Algorithm 4 also belongs to . The proof for this is identical to the corresponding proof in Section 3.3.
Let us now analyze the auxiliary function . Specifically, we need to show that is almost as smooth as the original function , and that for every given vector , the value of is not very large compared to . We mention these as observations below.
Observation 3.10.
The auxiliary function is -smooth.
Proof.
For every two vectors ,
Observation 3.11.
For every vector , .
Proof.
Note that, by the monotonicity of ,
where the last inequality holds since
We now define . To get a bound on the value of the output of Algorithm 4, we need to show that there exists at least one value of for which is not much larger than , which intuitively means that is roughly a local maximum with respect to . The following lemma proves that such an value indeed exists.
Lemma 3.12.
There is an integer such that .
Proof.
Assume towards a contradiction that the lemma does not hold. This implies
Furthermore, using the non-negativity of , the last inequality implies
Let us now upper bound the two terms on the leftmost side of the above inequality. By Observation 3.11, we can upper bound by . Additionally, since because , we can upper bound by . Plugging both these upper bounds into the previous inequalities and dividing by gives
which contradicts the definition of , and thus, completes the proof of the lemma. ∎
The preceding lemma shows that is roughly a local optimum in terms of . We now show an upper bound on that is implied by this property of .
Lemma 3.13.
.
Proof.
Observe that
To make the expression in the rightmost side useful, we need to lower bound the expression .
where the inequality holds since is the maximizer found by Algorithm 4. Combining the last two inequalities and the guarantee on from Lemma 3.12, we now get
where the last inequality follows from the concavity of . The lemma now follows by rearranging the last inequality (and dividing it by ). ∎
We now wish to lower bound and we do so by separately analyzing and in the following two lemmas.
Lemma 3.14.
.
Proof.
For brevity, we use in this proof the shorthand . Then, for every integer , we get (in these calculations, the expression represents the gradient of the function at the point ).
where the first and last inequalities follow from the monotonicity of and the second inequality follows from the DR-submodularity of . Using the last inequality, we now get
where the second inequality holds since the fact that is a geometric sum implies
Lemma 3.15.
.
Proof.
In the following calculation, the expression again represents the gradient of the function at the point ). Thus,
(4) | ||||
where the inequality holds because
We now observe also that the DR-submodularity of implies, for every integer , that
where the last inequality follows from the non-negativity of . Plugging the last inequality now into Inequality (4) yields
(5) | ||||
where the last inequality follows from the monotonicity of (since ) and the second inequality follows from Chebyshev’s sum inequality because the monotonicity of implies that both and are non-decreasing functions of . Additionally, the penultimate inequality follows due to the following calculation, which holds for every .
To complete the proof of the lemma, it remains to show that the coefficient of in the rightmost side of Inequality (5) is upper bounded by . To do that, we note that this coefficient is
where the first inequality and inequalities hold because and for , and the last inequality holds for . ∎
Combining the above lemmas then leads to the following corollary.
Corollary 3.16.
.
Proof.
We are now ready to state and prove Theorem 3.16.
theoremthmNonObFw Let be the time it takes to find a point in maximizing a given linear function, then Algorithm 4 runs in time, makes gradient oracle calls, and outputs a vector such that:
Proof.
Once more, we begin by analyzing the time and oracle complexities of the algorithm. Every iteration of the loop of Algorithm 4 takes time. As there are such iterations, the entire algorithm runs in time. Furthermore, as each iteration of the loop requires oracle calls (a single call to the oracle corresponding to , and calls to the oracle corresponding to ), the overall oracle complexity is .
4 Experiments
In this section we describe some experiments pertaining to our algorithms for maximizing DR-submodular + concave functions. All experiments are done on a 2015 Apple MacBook Pro with a quad-core 2.2 GHz i7 processor and 16 GB of RAM.
4.1 Interpolating Between Constrasting Objectives
We use our algorithms for maximizing the sum of a DR-submodular function and a concave function to provide a way to achieve a trade-off between different objectives. For example, given a ground set and a DPP supported on the power set , the maximum a posteriori (MAP) of the DPP corresponds to picking the most likely (diverse) set of elements according to the DPP. On the other hand, concave functions can be used to encourage points being closer together and clustered.
Finding the MAP of a DPP is an NP-hard problem. However, continuous approaches employing the multilinear extension (Călinescu et al., 2011) or the softmax extension (Bian et al., 2017a; Gillenwater et al., 2012) provide strong approximation guarantees for it. The softmax approach is usually preferred as it has a closed form solution which is easier to work with. Now, suppose that , and let be the kernel of the DPP and be the identity matrix, then is the softmax extension for . Here, corresponds to a diagonal matrix with the entries of along its diagonal.
Observe now that, given a vector , can be thought of as the likelihood of picking element . Moreover, captures the similarity between elements and . Hence, our choice for a concave function which promotes similarity among elements is . The rationale behind this is as follows. For a particular pair of elements and , if is large, that means that and are similar, so we would want to be larger when is high, provided that we are indeed picking both and (i.e., provided that is small). One can verify that the function is indeed concave as its Hessian is negative semidefinite.
In our first experiment we fix the ground set to be the set of points evenly spaced in . We also choose to be the Gaussian kernel , where is the Euclidean distance between points and , and . But to define points and , we need some ordering of the 400 points on the plane. So, we call point 0 to be at the origin and the index increases till we reach point 19 at . Then we have point 20 at and so on till we finally reach point 399 at . Given the functions and defined above, we optimize in this experiment a combined objective formally specified by , where is a control parameter that can be used to balance the contrasting objectives represented by and . For example, setting produces the (spread out) pure DPP MAP solution, setting produces the (clustered) pure concave solution and produces a solution that takes both constraints into consideration to some extent. It is important to note, however, that the effect of changing on the importance that each type of constraint gets is not necessarily linear—although it becomes linear when the ranges of and happen to be similar.

In Figure 2, we can see how changing changes the solution. The plots in the figure show the final output of Algorithm 3 when run on just the submodular objective (left plot), just the concave objective (right plot), and a combination of both (middle plot). The algorithm is run with the same cardinality constraint of 25 in all plots, which corresponds to imposing that the norm of each iteratation must be at most 25. It is important to note that we represent the exact (continuous) output of the algorithm here. To get a discrete solution, a rounding method should be applied. Also, all of the runs of the algorithm begin from the same fixed starting point inside the cardinality constrained polytope. The step sizes used by the different runs are all constant, and were chosen empirically.
4.2 Other Submodular + Concave Objectives
In this section we compare our algorithms with two baselines: the Frank-Wolfe algorithm (Frank and Wolfe, 1956) and the projected gradient ascent algorithm. In all the experiments done in this section, the objective function is again , where and are a DR-submodular function and a concave function, respectively. For simplicity, we set throughout the section.
4.2.1 Quadratic Programming
In this section, we define . Note that by choosing an appropriate matrix and vector , this objective can be made to be monotone or non-monotone DR-submodular. We also define the down-closed constraint set to be . Following Bian et al. (2017a), we choose the matrix to be a randomly generated symmetric matrix with entries uniformly distributed in , and the matrix to be a random matrix with entries uniformly distributed in (the addition here is used to ensure that the entries are all positive). The vector is chosen as the all ones vector, and the vector is a tight upper bound on whose coordinate is defined as . We let which makes non-monotone. Finally, although this does not affect the results of our experiments, we take to be a large enough additive constant (in this case 10) to make non-negative.
It is know that when the Hessian of a quadratic program is negative semidefinite, the resulting objective is concave. Accordingly, we let , where is a negative semidefinite matrix defined by the negation of the product of a random matrix with entries in with its transpose. As one can observe, the generality of DR-submodular + concave objectives allows us to consider quadratic programming with very different Hessians. We hope that our ability to do this will motivate future work about quadratic programming for a broader class of Hessian matrices.
In the current experiment, we let and , and run each algorithm for 50 iterations. Note that having fixed the number of iterations, the maximum step size for Algorithms 1 and 2 is upper bounded by number of iterations to guarantee that these algorithms remain within the polytope. To ensure consistency, we set the step sizes for the other algorithms to be as well, except for Algorithm 4 for which we set to the value of given by solving . This ensures that the gradient computation in Algorithm 4 is not too time consuming. We start Algorithms 1 and 2 from the starting point their pseudocodes specify, and the other algorithms from the same arbitrary point. We show the results for the aforementioned values of and in Figure 4. The appropriate values of and are mentioned below each plot, and each plot graphs the average of runs of the experiment. We also note that since Algorithms 3 and 4 output the best among the results of all their iterations, we just plot the final output of these algorithms instead of the entire trajectory.
4.2.2 D-optimal Experimental Design
Following Chen et al. (2018), the DR-submodular objective function for the D-optimal experimental design problem is . Here, is an dimensional row-vector in which each entry is drawn independently from the standard Gaussian distribution. The choice of concave function is . In this experiment there is no combinatorial constraint. Instead, we are interested in maximization over a box constraint, i.e., over (note that the box is shifted compared to the standard to ensure that is defined everywhere as it is undefined at ). The final outputs of all the algorithms for appear in Figure 4(h). Like in Section 4.2.1, each algorithm was run for iterations, and each plot is the average of runs. The step sizes and starting points used by the algorithms are set exactly like in Section 4.2.1.
Takeaways.
Based on our experiments, we can observe that Algorithms 1 and 4 consistently outperform the other algorithms. We can also see (especially in the D-optimal experimental design problem where they almost superimpose) that the difference between Algorithm 3 and the standard Frank-Wolfe algorithm are minimal, but we believe that the difference between the two algorithms can be made more pronounced by considering settings in which the gradient of dominates the gradient of . Finally, one can note that the output values in the plots corresponding to the quadratic programming problem discussed in Section 4.2.1 tend to decrease when the number of constraints increases, which matches our intuitive expectation.
5 Conclusion
In this paper, we have considered the maximization of a class of objective functions that is strictly larger than both DR-submodular functions and concave functions. The ability to optimize this class of functions using first-order information is interesting from both theoretical and practical points of view. Our results provide a step towards the goal of efficiently analyzing structured non-convex functions—a goal that is becoming increasingly relevant.










References
- Alieva et al. [2021] Ayya Alieva, Aiden Aceves, Jialin Song, Stephen Mayo, Yisong Yue, and Yuxin Chen. Learning to make decisions via submodular regularization. In International Conference on Learning Representations, 2021.
- Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, 2019.
- Anari et al. [2016] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In 29th Annual Conference on Learning Theory, 2016.
- Arora et al. [2012] Sanjeev Arora, Rong Ge, Ravi Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, 2012.
- Bian et al. [2017a] An Bian, Kfir Y. Levy, Andreas Krause, and Joachim M. Buhmann. Continuous dr-submodular maximization: Structure and algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017a.
- Bian et al. [2019] An Bian, Joachim M. Buhmann, and Andreas Krause. Optimal dr-submodular maximization and applications to provable mean field inference. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Bian et al. [2017b] Andrew An Bian, Baharan Mirzasoleiman, Joachim M. Buhmann, and Andreas Krause. Guaranteed non-convex optimization: Submodular maximization over continuous domains. In AISTATS, 2017b.
- Bresler et al. [2019] Guy Bresler, Frederic Koehler, Ankur Moitra, and Elchanan Mossel. Learning restricted boltzmann machines via influence maximization. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019.
- Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 2015.
- Călinescu et al. [2011] Gruia Călinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput., 2011.
- Chen et al. [2018] Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018.
- Chen et al. [2019a] Lin Chen, Moran Feldman, and Amin Karbasi. Unconstrained submodular maximization with constant adaptive complexity. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, 2019a.
- Chen et al. [2019b] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Mathematical Programming, 2019b.
- Derezinski et al. [2020] Michal Derezinski, Feynman Liang, and Michael Mahoney. Bayesian experimental design using regularized determinantal point processes. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 2020.
- Djolonga and Krause [2014] Josip Djolonga and Andreas Krause. From map to marginals: Variational inference in bayesian submodular models. In Advances in Neural Information Processing Systems, 2014.
- Du et al. [2019] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
- Dwivedi et al. [2019] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast. Journal of Machine Learning Research, 2019.
- Elenberg et al. [2016] Ethan R. Elenberg, Rajiv Khanna, Alexandros G. Dimakis, and Sahand Negahban. Restricted strong convexity implies weak submodularity. In Proceedings of the 30th Conference on Neural Information Processing Systems, 2016.
- Elenberg et al. [2017] Ethan R. Elenberg, Alexandros G. Dimakis, Moran Feldman, and Amin Karbasi. Streaming weak submodularity: Interpreting neural networks on the fly. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
- Ene and Nguyen [2019] Alina Ene and Huy L. Nguyen. Parallel algorithm for non-monotone dr-submodular maximization. In Proceedings of the 37th International Conference on Machine Learning, 2019.
- Ene et al. [2020] Alina Ene, Sofia Maria Nikolakaki, and Evimaria Terzi. Team formation: Striking a balance between coverage and cost. CoRR, 2020.
- Feldman [2021] Moran Feldman. Guess free maximization of submodular and linear sums. Algorithmica, 2021.
- Feldman et al. [2011] Moran Feldman, Joseph Naor, and Roy Schwartz. A unified continuous greedy algorithm for submodular maximization. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, 2011.
- Filmus and Ward [2012] Yuval Filmus and Justin Ward. A tight combinatorial algorithm for submodular maximization subject to a matroid constraint. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012.
- Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.
- Ge et al. [2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points — online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, 2015.
- Gillenwater et al. [2012] J. Gillenwater, A. Kulesza, and B. Taskar. Near-Optimal MAP Inference for Determinantal Point Processes. In Proc. Neural Information Processing Systems (NIPS), 2012.
- Harshaw et al. [2019] Chris Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. In ICML, 2019.
- Hassani et al. [2017] Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submodular maximization. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
- Hassani et al. [2020a] Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++: (non)convex minimization and continuous submodular maximization. SIAM Journal on Optimization, 2020a.
- Hassani et al. [2020b] Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++, 2020b.
- Hazan et al. [2016] Elad Hazan, Kfir Y. Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
- He [2010] Xiaofei He. Laplacian regularized d-optimal design for active learning and its application to image retrieval. IEEE Transactions on Image Processing, 2010.
- Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
- Kazemi et al. [2020a] Ehsan Kazemi, Shervin Minaee, Moran Feldman, and Amin Karbasi. Regularized submodular maximization at scale. CoRR, 2020a.
- Kazemi et al. [2020b] Ehsan Kazemi, Shervin Minaee, Moran Feldman, and Amin Karbasi. Regularized submodular maximization at scale. CoRR, 2020b.
- Kempe et al. [2003] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03. Association for Computing Machinery, 2003.
- Krause and Cevher [2010] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse representation. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
- Kulesza [2012] Alex Kulesza. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 2012.
- Lattimore and Szepesvari [2020] Tor Lattimore and Csaba Szepesvari. Bandit algorithms. Cambridge University Press, 2020.
- Mirzasoleiman et al. [2013] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, 2013.
- Mokhtari et al. [2018a] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochastic submodular maximization: Closing the gap. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018a.
- Mokhtari et al. [2018b] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Decentralized submodular maximization: Bridging discrete and continuous settings. In Proceedings of the 35th International Conference on Machine Learning, 2018b.
- Murty and Kabadi [1987] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 1987.
- Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2018.
- Netrapalli et al. [2014] Praneeth Netrapalli, U N Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in Neural Information Processing Systems, 2014.
- Niazadeh et al. [2018] Rad Niazadeh, Tim Roughgarden, and Joshua R. Wang. Optimal algorithms for continuous non-monotone submodular and dr-submodular maximization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
- Raut et al. [2021] Prasanna Sanjay Raut, Omid Sadeghi, and Maryam Fazel. Online dr-submodular maximization with stochastic cumulative constraints, 2021.
- Rebeschini and Karbasi [2015] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Proceedings of The 28th Conference on Learning Theory, 2015.
- Robinson et al. [2019] Joshua Robinson, Suvrit Sra, and Stefanie Jegelka. Flexible modeling of diversity with strongly log-concave distributions. In Advances in Neural Information Processing Systems, 2019.
- Sadeghi and Fazel [2019] Omid Sadeghi and Maryam Fazel. Online continuous DR-submodular maximization with long-term budget constraints. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 2019.
- Sadeghi et al. [2020] Omid Sadeghi, Prasanna Raut, and Maryam Fazel. A single recipe for online submodular maximization with adversarial or stochastic constraints. In Advances in Neural Information Processing Systems, 2020.
- Singla et al. [2014] Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, and Andreas Krause. Near-optimally teaching the crowd to classify. In Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014.
- Staib and Jegelka [2017] Matthew Staib and Stefanie Jegelka. Robust budget allocation via continuous submodular functions. In Proceedings of the 34th International Conference on Machine Learning, 2017.
- Sun et al. [2016] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary?, 2016.
- Sun and Luo [2016] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 2016.
- Sviridenko et al. [2017] Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. Math. Oper. Res., 2017.
- Tohidi et al. [2020] Ehsan Tohidi, Rouhollah Amiri, Mario Coutino, David Gesbert, Geert Leus, and Amin Karbasi. Submodularity in action: From machine learning to signal processing applications. IEEE Signal Processing Magazine, 2020.
- Wei et al. [2015] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015.
- Wilder [2018a] Bryan Wilder. Risk-sensitive submodular optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018a.
- Wilder [2018b] Bryan Wilder. Equilibrium computation and robust optimization in zero sum games with submodular structure. Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.
- Xie et al. [2019] Jiahao Xie, Chao Zhang, Zebang Shen, Chao Mi, and Hui Qian. Decentralized gradient tracking for continuous dr-submodular maximization. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019.
- Zhang et al. [2019] Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization: From full-information to bandit feedback. In Advances in Neural Information Processing Systems, 2019.