\forLoop

126calBbCounter

Submodular + Concave

Siddharth Mitra
Yale University
[email protected] Moran Feldman
University of Haifa
[email protected] Amin Karbasi
Yale University
[email protected]

Abstract

It has been well established that first order optimization methods can converge to the maximal objective value of concave functions and provide constant factor approximation guarantees for (non-convex/non-concave) continuous submodular functions. In this work, we initiate the study of the maximization of functions of the form $F(x)=G(x)+C(x)$ over a solvable convex body $P$ , where $G$ is a smooth DR-submodular function and $C$ is a smooth concave function. This class of functions is a strict extension of both concave and continuous DR-submodular functions for which no theoretical guarantee is known. We provide a suite of Frank-Wolfe style algorithms, which, depending on the nature of the objective function (i.e., if $G$ and $C$ are monotone or not, and non-negative or not) and on the nature of the set $P$ (i.e., whether it is downward closed or not), provide $1-1/e$ , $1/e$ , or $1/2$ approximation guarantees. We then use our algorithms to get a framework to smoothly interpolate between choosing a diverse set of elements from a given ground set (corresponding to the mode of a determinantal point process) and choosing a clustered set of elements (corresponding to the maxima of a suitable concave function). Additionally, we apply our algorithms to various functions in the above class (DR-submodular + concave) in both constrained and unconstrained settings, and show that our algorithms consistently outperform natural baselines.

1 Introduction

Despite their simplicity, first-order optimization methods (such as gradient descent and its variants, Frank-Wolfe, momentum based methods, and others) have shown great success in many machine learning applications. A large body of research in the operations research and machine learning communities has fully demystified the convergence rate of such methods in minimizing well behaved convex objectives (Bubeck, 2015; Nesterov, 2018). More recently, a new surge of rigorous results has also shown that gradient descent methods can find the global minima of specific non-convex objective functions arisen from non-negative matrix factorization (Arora et al., 2012), robust PCA (Netrapalli et al., 2014), phase retrieval (Chen et al., 2019b), matrix completion (Sun and Luo, 2016), and the training of wide neural networks (Du et al., 2019; Jacot et al., 2018; Allen-Zhu et al., 2019), to name a few. It is also very well known that finding the global minimum of a general non-convex function is indeed computationally intractable (Murty and Kabadi, 1987). To avoid such impossibility results, simpler goals have been pursued by the community: either developing algorithms that can escape saddle points and reach local minima (Ge et al., 2015) or describing structural properties which guarantee that reaching a local minimizer ensures optimality (Sun et al., 2016; Bian et al., 2017b; Hazan et al., 2016). In the same spirit, this paper quantifies a large class of non-convex functions for which first-order optimization methods provably achieve near optimal solutions.

More specifically, we consider objective functions that are formed by the sum of a continuous DR-submodular function $G(x)$ and a concave function $C(x)$ . Recent research in non-convex optimization has shown that first-order optimization methods provide constant factor approximation guarantees for maximizing continuous submodular functions Bian et al. (2017b); Hassani et al. (2017); Bian et al. (2017a); Niazadeh et al. (2018); Hassani et al. (2020a); Mokhtari et al. (2018a). Similarly, such methods find the global maximizer of concave functions. However, the class of $F(x)=G(x)+C(x)$ functions is strictly larger than both concave and continuous DR-submodular functions. More specifically, $F(x)$ is not concave nor continuous DR-submodular in general (Figure 1 illustrates an example). In this paper, we show that first-order methods provably provide strong theoretical guarantees for maximizing such functions.

The combinations of continuous submodular and concave functions naturally arise in many ML applications such as maximizing a regularized submodular function (Kazemi et al., 2020a) or finding the mode of distributions (Kazemi et al., 2020a; Robinson et al., 2019). For instance, it is common to add a regularizer to a D-optimal design objective function to increase the stability of the final solution against perturbations (He, 2010; Derezinski et al., 2020; Lattimore and Szepesvari, 2020). Similarly, many instances of log-submodular distributions, such as determinantal point processes (DPPs), have been studied in depth in order to sample a diverse set of items from a ground set (Kulesza, 2012; Rebeschini and Karbasi, 2015; Anari et al., 2016). In order to control the level of diversity, one natural recipe is to consider the combination of a log-concave (e.g., normal distribution, exponential distribution and Laplace distribution) (Dwivedi et al., 2019; Robinson et al., 2019) and log-submodular distributions (Djolonga and Krause, 2014; Bresler et al., 2019), i.e., $\Pr(\vx)\propto\exp(\lambda C(\vx)+(1-\lambda)G(\vx))$ for some $\lambda\in[0,1]$ . In this way, we can obtain a class of distributions that contains log-concave and log-submodular distributions as special cases. However, this class of distributions is strictly more expressive than both log-concave and log-submodular models, e.g., in contrast to log-concave distributions, they are not uni-modal in general. Finding the mode of such distributions amounts to maximizing a combination of a continuous DR-submodular function and a concave function. The contributions of this paper are as follows.

•

Assuming first-order oracle access for the function $F$ , we develop the algorithms: Greedy Frank-Wolfe (Algorithm 1) and Measured Greedy Frank-Wolfe (Algorithm 2) which achieve constant factors approximation guarantees between $1-1/e$ and $1/e$ depending on the setting, i.e. depending on the monotonicity and non-negativity of $G$ and $C$ , and depending on the constraint set having the down-closeness property or not.
•

Furthermore, if we have access to the individual gradients of $G$ and $C$ , then we are able to make the guarantee with respect to $C$ exact using the algorithms: Gradient Combining Frank-wolfe (Algorithm 3) and Non-oblivious Frank-Wolfe (Algorithm 4). These results are summarized and made more precise in Table 1 and Section 3.
•

We then present experiments designed to use our algorithms to smoothly interpolate between contrasting objectives such as picking a diverse set of elements and picking a clustered set of elements. This smooth interpolation provides a way to control the amount of diversity in the final solution. We also demonstrate the use of our algorithms to maximize a large class of (non-convex/non-concave) quadratic programming problems.

Refer to caption — Figure 1: Left: (continuous) DR-submodular softmax extension. Middle: concave quadratic function. Right: sum of both.

Related Work.

The study of discrete submodular maximization has flourished in the last decade through far reaching applications in machine learning and and artificial intelligence including viral marketing (Kempe et al., 2003), dictionary learning (Krause and Cevher, 2010), sparse regression (Elenberg et al., 2016), neural network interoperability (Elenberg et al., 2017), crowd teaching (Singla et al., 2014), sequential decision making (Alieva et al., 2021), active learning (Wei et al., 2015), and data summarization (Mirzasoleiman et al., 2013). We refer the interested reader to a recent survey by Tohidi et al. (2020) and the references therein.

Recently, Bian et al. (2017b) proposed an extension of discrete submodular functions to the continuous domains that can be of use in machine learning applications. Notably, this class of (non-convex/non-concave) functions, so called continuous DR-submodular, contains the continuous multilinear extension of discrete submodular functions Călinescu et al. (2011) as a special case. Continuous DR-submodular functions can reliably model revenue maximization (Bian et al., 2017b), robust budget allocation (Staib and Jegelka, 2017), experimental design (Chen et al., 2018), MAP inference for DPPs (Gillenwater et al., 2012; Hassani et al., 2020b), energy allocation (Wilder, 2018a), classes of zero-sum games (Wilder, 2018b), online welfare maximization and online task assignment (Sadeghi et al., 2020), as well as many other settings of interest.

The research on maximizing continuous DR-submodular functions in the last few years has established strong theoretical results in different optimization settings including unconstrained (Niazadeh et al., 2018; Bian et al., 2019), stochastic Mokhtari et al. (2018a); Hassani et al. (2017), online (Chen et al., 2018; Zhang et al., 2019; Sadeghi and Fazel, 2019; Raut et al., 2021), and parallel models of computation (Chen et al., 2019a; Mokhtari et al., 2018b; Xie et al., 2019; Ene and Nguyen, 2019).

A different line of works study the maximization of discrete functions that can be represented as the sum of a non-negative monotone submodular function and a linear function. The ability to do so is useful in practice since the linear function can be viewed as a soft constraint, and it also has theoretical applications as is argued by the first work in this line (Sviridenko et al., 2017) (for example, the problem of maximization of a monotone submodular function with a bounded curvature can be reduced to the maximization of the sum of a monotone submodular function and a linear function). In terms of the approximation guarantee, the algorithms suggested by Sviridenko et al. (2017) were optimal. However, more recent works improve over the time complexities of these algorithms (Feldman, 2021; Harshaw et al., 2019; Ene et al., 2020), generalize them to weakly-submodular functions (Harshaw et al., 2019), and adapt them to other computational models such as the data stream and MapReduce models (Kazemi et al., 2020b; Ene et al., 2020).

2 Setting and Notation

Let us now formally define the setting we consider. Fix a subset $\cX$ of $R^{n}$ of the form $\prod_{i=1}^{n}\cX_{i}$ , where $\cX_{i}$ is a closed range in $\bR$ . Intuitively, a function $G\colon\cX\to\bR$ is called (continuous) DR-submodular if it exhibits diminishing returns in the sense that given a vector $\vx\in\cX$ , the increase in $G(\vx)$ obtained by increasing $x_{i}$ (for any $i\in[n]$ ) by $\varepsilon>0$ is negatively correlated with the original values of the coordinates of $\vx$ . This intuition is captured by the following definition. In this definition $\ve_{i}$ denotes the standard basis vector in the $i^{\text{th}}$ direction.

Definition 2.1 (DR-submodular function).

A function $G\colon\cX\to\bR$ is DR-submodular if for every two vectors $\va,\vb\in\cX$ , positive value $k$ and coordinate $i\in[n]$ we have

G(\va+k\ve_{i})-G(\va)\geq G(\vb+k\ve_{i})-G(\vb)

whenever $\va\leq\vb$ and $\va+k\ve_{i},\vb+k\ve_{i}\in\cX$ .¹¹1Throughout the paper, inequalities between vectors should be understood as holding coordinate-wise.

It is well known that when $G$ is continuously differentiable, then it is DR-submodular if and only if $\nabla G(\va)\geq\nabla G(\vb)$ for every two vectors $\va,\vb\in\cX$ that obey $\va\leq\vb$ . And when $G$ is a twice differentiable function, it is DR-submodular if and only if its hessian is non-positive at every vector $\vx\in\cX$ . Furthermore, for the sake of simplicity we assume in this work that $\cX=[0,1]^{n}$ . This assumption is without loss of generality because the natural mapping from $\cX$ to $[0,1]^{n}$ preserves all our results.

We are interested in the problem of finding the point in some convex body $P\subseteq[0,1]^{n}$ that maximizes a given function $F\colon[0,1]^{n}\to\bR$ that can be expressed as the sum of a DR-submodular function $G\colon[0,1]^{n}\to\bR$ and a concave function $C\colon[0,1]^{n}\to\bR$ . To get meaningful results for this problem, we need to make some assumptions. Here we describe three basic assumptions that we make throughout the paper. The quality of the results that we obtain improves if additional assumptions are made, as is described in Section 3.

Our first basic assumption is that $G$ is non-negative. This assumption is necessary since we obtain multiplicative approximation guarantees with respect to $G$ , and such guarantees do not make sense when $G$ is allowed to take negative values.²²2We note that almost all the literature on submodular maximization of both discrete and continuous functions assumes non-negativity for the same reason. Our second basic assumption is that $P$ is solvable, i.e., that one can efficiently optimize linear functions subject to it. Intuitively, this assumption makes sense because one should not expect to be able to optimize a complex function such as $F$ subject to $P$ if one cannot optimize linear functions subject to it (nevertheless, it is possible to adapt our algorithms to obtain some guarantee even when linear functions can only be approximately optimized subject to $P$ ). Our final basic assumption is that both functions $G$ and $C$ are $L$ -smooth, which means that they are differentiable, and moreover, their gradients are $L$ -Lipschitz, i.e.,

\|\nabla f(\va)-\nabla f(\vb)\|_{2}\leq L\|\va-\vb\|_{2}\quad\forall\;\va,\vb\in[0,1]^{n}\enspace.

We conclude this section by introducing some additional notation that we need. We denote by $\vo$ an arbitrary optimal solution for the problem described above, and define $D=\max_{\vx\in P}\|\vx\|_{2}$ . Additionally, given two vectors $\va,\vb\in\bR^{n}$ , we denote by $\va\odot\vb$ their coordinate-wise multiplication, and by $\left\langle{\va,\vb}\right\rangle$ their standard Euclidean inner product.

3 Main Algorithms and Results

In this section we present our (first-order) algorithms for solving the problem described in Section 2. In general, these algorithms are all Frank-Wolfe type algorithms, but they differ in the exact linear function which is maximized in each iteration (step 1 of the while/for loop), and in the formula used to update the solution (step 2 of the while/for loop). As mentioned previously, we assume everywhere that $G$ is a non-negative $L$ -smooth DR-submodular function, $C$ is an $L$ -smooth concave function, and $P$ is a solvable convex body. Some of our algorithms require additional non-negativity and/or monotonicity assumptions on the functions $G$ and $C$ , and occasionally they also require a downward closed assumption on $P$ . A summary of which settings each algorithm is applicable to can be found in Table 1. Each algorithm listed in the table outputs a point $x\in P$ which is guaranteed to obey $F(x)\geq\alpha\cdot G(\vo)+\beta\cdot C(\vo)-E$ for the constants $\alpha$ and $\beta$ given in Table 1 and some error term $E$ .

Table 1: Summary of algorithms, settings, and guarantees (“non-neg.” is a shorthand for “non-negative”). All of the conditions stated in the table are in addition to the continuity and smoothness of

G

and

C

, and the convexity of

P

\alpha

and

\beta

are the constants preceding

G(\vo)

and

C(\vo)

respectively in the lower bound on the output of the algorithm.

Algorithm (Section)	$G$	$C$	$P$	$\alpha$	$\beta$
Greedy	monotone	monotone	general	$1-1/e$	$1-1/e$
Frank-Wolfe	& non-neg.	& non-neg.	general	$1-1/e$	$1-1/e$
Measured Greedy	monotone	non-neg.	down-closed	$1-1/e$	$1/e$
Frank-Wolfe	& non-neg.	non-neg.	down-closed	$1-1/e$	$1/e$
Measured Greedy	non-neg.	monotone	down-closed	$1/e$	$1-1/e$
Frank-Wolfe	non-neg.	& non-neg.	down-closed	$1/e$	$1-1/e$
Measured Greedy	monotone	monotone	down-closed	$1-1/e$	$1-1/e$
Frank-Wolfe	& non-neg.	& non-neg.	down-closed	$1-1/e$	$1-1/e$
Measured Greedy	non-neg.	non-neg.	down-closed	$1/e$	$1/e$
Frank-Wolfe	non-neg.	non-neg.	down-closed	$1/e$	$1/e$
Gradient Combining	monotone	General	General	$1/2-\varepsilon$	1
Frank-Wolfe	& non-neg.	General	General	$1/2-\varepsilon$	1
Non-Oblivious	monotone	non-neg.	General	$1-1/e-\varepsilon$	$1-\varepsilon$
Frank-Wolfe	& non-neg.	non-neg.		$1-1/e-\varepsilon$	$1-\varepsilon$

3.1 Greedy Frank-Wolfe Algorithm

In this section we assume that both $G$ and $C$ are monotone and non-negative functions (in addition to their other properties). Given this assumption, we analyze the guarantee of the greedy Frank-Wolfe variant appearing as Algorithm 1. This algorithm is related to the Continuous Greedy algorithm for discrete objective functions due to Călinescu et al. (2011), and it gets a quality control parameter $\varepsilon\in(0,1)$ . We assume in the algorithm that $\varepsilon^{-1}$ is an integer. This assumption is without loss of generality because, if $\varepsilon$ violates the assumption, then it can be replaced with a value from the range $[\varepsilon/2,\varepsilon]$ that obeys it.

Let

t\leftarrow 0

and

\vy^{(t)}\leftarrow{\bar{0}}

while

t<1

\vs^{(t)}\leftarrow\arg\max_{\vx\in P}\left\langle{\nabla F(\vy^{(t)}),\vx}\right\rangle

\vy^{(t+\varepsilon)}\leftarrow\vy^{(t)}+\varepsilon\cdot\vs^{(t)}

t\leftarrow t+\varepsilon

end while

return

\vy^{(1)}

Algorithm 1 Greedy Frank-Wolfe

(\varepsilon)

One can observe that the output $\vy^{(1)}$ of Algorithm 1 is within the convex body $P$ because it is a convex combination of the vectors $\vs^{(0)},\vs^{(\varepsilon)},\vs^{(2\varepsilon)},\dotsc,\vs^{1-\varepsilon}$ , which are all vectors in $P$ . Let us now analyze the value of the output of Algorithm 1. The next lemma is the first step towards this goal. It provides a lower bounds on the increase in the value of $\vy^{(t)}$ as a function of $t$ .

Lemma 3.1.

For every $0\leq i<\varepsilon^{-1}$ , $F(\vy^{(\varepsilon(i+1))})-F(\vy^{(\varepsilon i)})\geq\varepsilon\cdot[F(\vo)-F(\vy^{(\varepsilon i)})]-\varepsilon^{2}LD^{2}$ .

Proof.

Observe that

	$\displaystyle F(\vy^{(\varepsilon(i+1))})-F($	$\displaystyle\vy^{(\varepsilon i)})=F(\vy^{(\varepsilon i)}+\varepsilon\cdot\vs^{(\varepsilon i)})-F(\vy^{(\varepsilon i)})=\int_{0}^{\varepsilon}\left.\frac{dF(\vy^{(\varepsilon i)}+z\cdot\vs^{(\varepsilon i)})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left\langle{\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)}+r\cdot\vs^{(\varepsilon i)})}\right\rangle dr\geq\int_{0}^{\varepsilon}\left\{\left\langle{\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)})}\right\rangle-2rLD^{2}\right\}dr$
	$\displaystyle={}$	$\displaystyle\left\langle{\varepsilon\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)})}\right\rangle-\varepsilon^{2}LD^{2}\geq\left\langle{\varepsilon\vo,\nabla F(\vy^{(\varepsilon i)})}\right\rangle-\varepsilon^{2}LD^{2}\enspace.$

where the last inequality follows since $\vs^{(\varepsilon i)}$ is the maximizer found by Algorithm 1. Therefore, to prove the lemma it only remains to show that $\left\langle{\vo,\nabla F(\vy^{(\varepsilon i)})}\right\rangle\geq F(\vo)-F(\vy^{(\varepsilon i)})$ .

Since $C$ is concave and monotone,

C(\vo)\leq C(\vo\vee\vy^{(\varepsilon i)})\leq C(\vy^{(\varepsilon i)})+\left\langle{\vo\vee\vy^{(\varepsilon i)}-\vy^{(\varepsilon i)},\nabla C(\vy^{(\varepsilon i)})}\right\rangle\leq C(\vy^{(\varepsilon i)})+\left\langle{\vo,\nabla C(\vy^{(\varepsilon i)})}\right\rangle\enspace.

Similarly, since $G$ is DR-submodular and monotone,

	$\displaystyle G(\vo)\leq{}$	$\displaystyle G(\vo\vee\vy^{(\varepsilon i)})=G(\vy^{(\varepsilon i)})+\int_{0}^{1}\left.\frac{dG(y^{(\varepsilon i)}+z\cdot(\vo\vee\vy^{(\varepsilon i)}-\vy^{(\varepsilon i)}))}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle G(\vy^{(\varepsilon i)})+\int_{0}^{1}\left\langle{\vo\vee\vy^{(\varepsilon i)}-\vy^{(\varepsilon i)},\nabla G(\vy^{(\varepsilon i)}+r\cdot(\vo\vee\vy^{(\varepsilon i)}-\vy^{(\varepsilon i)}))}\right\rangle dr$
	$\displaystyle\leq{}$	$\displaystyle G(\vy^{(\varepsilon i)})+\left\langle{\vo\vee\vy^{(\varepsilon i)}-\vy^{(\varepsilon i)},\nabla G(\vy^{(\varepsilon i)})}\right\rangle\leq G(\vy^{(\varepsilon i)})+\left\langle{\vo,\nabla G(\vy^{(\varepsilon i)})}\right\rangle\enspace.$

Adding the last two inequalities and rearranging gives the inequality $\left\langle{\vo,\nabla F(\vy^{(\varepsilon i)})}\right\rangle\geq F(\vo)-F(\vy^{(\varepsilon i)})$ that we wanted to prove. ∎

The corollary below follows by showing (via induction) that the inequality $F(\vy^{(\varepsilon i)})\geq\big{[}1-\left(1-\varepsilon\right)^{i}\big{]}\cdot F(\vo)-i\cdot\varepsilon^{2}LD^{2}$ holds for every integer $0\leq i\leq\varepsilon^{-1}$ , and then plugging in $i=\varepsilon^{-1}$ . The inductive step is proven using Lemma 3.1.

Corollary 3.2.

$F(\vy^{(1)})\geq(1-e^{-1})\cdot F(\vo)-O(\varepsilon LD^{2})$ .

Proof.

We prove by induction that for every integer $0\leq i\leq\varepsilon^{-1}$ we have

F(\vy^{(\varepsilon i)})\geq\left[1-\left(1-\varepsilon\right)^{i}\right]\cdot F(\vo)-i\cdot\varepsilon^{2}LD^{2}\enspace.

(1)

One can note that the corollary follows from this inequality by plugging $i=\varepsilon^{-1}$ since $(1-\varepsilon)^{\varepsilon^{-1}}\leq e^{-1}$ .

For $i=0$ , Inequality (1) follows from the non-negativity of $F$ (which follows from the non-negativity of $G$ and $C$ ). Next, let us prove Inequality (1) for $1\leq i\leq\varepsilon^{-1}$ assuming it holds for $i-1$ . By Lemma 3.1,

	$\displaystyle F(\vy^{(\varepsilon i)})\geq{}$	$\displaystyle F(\vy^{(\varepsilon(i-1))})+\varepsilon\cdot[F(\vo)-F(\vy^{(\varepsilon(i-1))})]-\varepsilon^{2}LD^{2}$
	$\displaystyle={}$	$\displaystyle(1-\varepsilon)\cdot F(\vy^{(\varepsilon(i-1))})+\varepsilon\cdot F(\vo)-\varepsilon^{2}LD^{2}$
	$\displaystyle\geq{}$	$\displaystyle(1-\varepsilon)\cdot\left\{\left[1-\left(1-\varepsilon\right)^{i-1}\right]\cdot F(\vo)-(i-1)\cdot\varepsilon^{2}LD^{2}\right\}+\varepsilon\cdot F(\vo)-\varepsilon^{2}LD^{2}$
	$\displaystyle\geq{}$	$\displaystyle\left[1-\left(1-\varepsilon\right)^{i}\right]\cdot F(\vo)-i\cdot\varepsilon^{2}LD^{2}\enspace,$

where the second inequality holds due to the induction hypothesis. ∎

We are now ready to summarize the properties of Algorithm 1 in a theorem.

Theorem 3.3.

Let $S$ be the time it takes to find a point in $P$ maximizing a given liner function, then Algorithm 1 runs in $O(\varepsilon^{-1}(n+S))$ time, makes $O(1/\varepsilon)$ calls to the gradient oracle, and outputs a vector $\vy$ such that $F(\vy)\geq(1-1/e)\cdot F(\vo)-O(\varepsilon LD^{2})$ .

Proof.

The output guarantee of Theorem 3.3 follows directly from Corollary 3.2. The time and oracle complexity follows by observing that the algorithm’s while loop makes $\varepsilon^{-1}$ iterations, and each iteration requires $O(n+S)$ time, in addition to making a single call to the gradient oracle. ∎

3.2 Measured Greedy Frank-Wolfe Algorithm

In this section we assume that $P$ is a down-closed body (in addition to being convex) and that $G$ and $C$ are both non-negative. Given these assumptions, we analyze the guarantee of the variant of Frank-Wolfe appearing as Algorithm 2, which is motivated by the Measured Continuous Greedy algorithm for discrete objective functions due to Feldman et al. (2011). We again have a quality control parameter $\varepsilon\in(0,1)$ , and assume (without loss of generality) that $\varepsilon^{-1}$ is an integer.

Let

t\leftarrow 0

and

\vy^{(t)}\leftarrow{\bar{0}}

while

t<1

\vs^{(t)}\leftarrow\arg\max_{\vx\in P}\langle({\bar{1}}-\vy^{(t)})\odot\nabla F(\vy^{(t)}),\vx\rangle

\vy^{(t+\varepsilon)}\leftarrow\vy^{(t)}+\varepsilon\cdot({\bar{1}}-\vy^{(t)})\odot\vs^{(t)}

t\leftarrow t+\varepsilon

end while

return

\vy^{(1)}

Algorithm 2 Measured Greedy Frank-Wolfe

(\varepsilon)

We begin the analysis of Algorithm 2 by bounding the range of the entries of the vectors $\vy^{(t)}$ , which is proven by induction.

Lemma 3.4.

For every two integers $0\leq i\leq\varepsilon^{-1}$ and $1\leq j\leq n$ , $0\leq y^{(\varepsilon i)}_{j}\leq 1-(1-\varepsilon)^{i}\leq 1$ .

Proof.

We prove the lemma by induction on $i$ . For $i=0$ the lemma is trivial since $y^{(0)}_{j}=0$ by the initialization of $\vy^{(0)}$ . Assume now that the lemma holds for $i-1$ , and let us prove it for $1\leq i\leq\varepsilon^{-1}$ . Observe that

y^{(\varepsilon i)}_{j}=y^{(\varepsilon(i-1))}_{j}+\varepsilon\cdot(1-y^{(\varepsilon(i-1))}_{j})\cdot s^{(\varepsilon i)}_{j}\geq y^{(\varepsilon(i-1))}_{j}\geq 0\enspace,

where the first inequality holds since $y^{(\varepsilon(i-1))}_{j}\leq 1$ by the induction hypothesis and $s^{(\varepsilon i)}_{j}\geq 0$ since $\vs^{(\varepsilon i)}\in P$ . Similarly,

	$\displaystyle y^{(\varepsilon i)}_{j}={}$	$\displaystyle y^{(\varepsilon(i-1))}_{j}+\varepsilon\cdot(1-y^{(\varepsilon(i-1))}_{j})\cdot s^{(\varepsilon i)}_{j}\leq y^{(\varepsilon(i-1))}_{j}+\varepsilon\cdot(1-y^{(\varepsilon(i-1))}_{j})$
	$\displaystyle={}$	$\displaystyle\varepsilon+(1-\varepsilon)\cdot y^{(\varepsilon(i-1))}_{j}\leq\varepsilon+(1-\varepsilon)\cdot[1-(1-\varepsilon)^{i-1}]=1-(1-\varepsilon)^{i}\enspace,$

where the first inequality holds because $s^{(\varepsilon i)}_{j}\leq 1$ since $\vs^{(\varepsilon i)}\in P$ , and the second inequality holds by the induction hypothesis. ∎

Using the last lemma, we can see why the output of Algorithm 2 must be in $P$ . Recall that the output of Algorithm 2 is $\vy^{(1)}=\sum_{i=1}^{\varepsilon^{-1}}\varepsilon\cdot({\bar{1}}-\vy^{\varepsilon(i-1)})\odot\vs^{(\varepsilon i)}\leq\varepsilon\cdot\sum_{i=1}^{\varepsilon^{-1}}\vs^{(\varepsilon i)},$ where the inequality follows due to Lemma 3.4 and the fact that $\vs^{(\varepsilon i)}$ (as a vector in $P$ ) is non-negative. Since $\varepsilon\cdot\sum_{i=1}^{\varepsilon^{-1}}\vs^{(\varepsilon i)}$ is a convex combination of points in $P$ , it also belongs to $P$ . Since the vector $\vy^{(1)}$ is coordinate-wise dominated by this combination, the down-closeness of $P$ implies $\vy^{(1)}\in P$ .

Our next objective is to prove an approximation guarantee for Algorithm 2. Towards this goal, let us define, for every function $H$ , the expression $M(H,i)$ to be $1$ when $H$ is monotone and $(1-\varepsilon)^{i}$ when $H$ is non-monotone. Using this definition, we can state the following lemma and corollary, which are counterparts of Lemma 3.1 and Corollary 3.2 from Section 3.1.

Lemma 3.5.

For every integer $0\leq i<\varepsilon^{-1}$ , $F(\vy^{(\varepsilon(i+1))})-F(\vy^{(\varepsilon i)})\geq\varepsilon\cdot\{[M(G,i)\cdot G(\vo)-G(\vy^{(\varepsilon i)})]+[M(C,i)\cdot C(\vo)-C(\vy^{(\varepsilon i)})]\}-\varepsilon^{2}LD^{2}$ .

Proof.

Observe that

	$\displaystyle F(\vy^{(\varepsilon(i+1))})-F(\vy^{(\varepsilon i)})={}$	$\displaystyle F(\vy^{(\varepsilon i)}+\varepsilon\cdot({\bar{1}}-\vy^{(\varepsilon i)})\odot\vs^{(\varepsilon i)})-F(\vy^{(\varepsilon i)})$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left.\frac{dF(\vy^{(\varepsilon i)}+z\cdot({\bar{1}}-\vy^{(\varepsilon i)})\odot\vs^{(\varepsilon i)})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left\langle{({\bar{1}}-\vy^{(\varepsilon i)})\odot\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)}+r\cdot({\bar{1}}-\vy^{(t)})\odot\vs^{(\varepsilon i)})}\right\rangle dr$
	$\displaystyle\geq{}$	$\displaystyle\int_{0}^{\varepsilon}\left\{\left\langle{(({\bar{1}}-\vy^{(\varepsilon i)})\odot\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)})}\right\rangle-2rLD^{2}\right\}dr$
	$\displaystyle={}$	$\displaystyle\left\langle{\varepsilon({\bar{1}}-\vy^{(\varepsilon i)})\odot\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)})}\right\rangle-\varepsilon^{2}LD^{2}\enspace.$

Furthermore, we also have

	$\displaystyle\left\langle{({\bar{1}}-\vy^{(\varepsilon i)})\odot\vs^{(\varepsilon i)},\nabla F(\vy^{(\varepsilon i)})}\right\rangle={}$	$\displaystyle\left\langle{\vs^{(\varepsilon i)},({\bar{1}}-\vy^{(\varepsilon i)})\odot\nabla F(\vy^{(\varepsilon i)})}\right\rangle$
	$\displaystyle\geq{}$	$\displaystyle\left\langle{\vo,({\bar{1}}-\vy^{(\varepsilon i)})\odot\nabla F(\vy^{(\varepsilon i)})}\right\rangle\enspace,$

where the inequality holds since $s^{(\varepsilon i)}$ is the maximizer found by Algorithm 2. Combining the last two inequalities, we get

F(\vy^{(\varepsilon(i+1))})-F(\vy^{(\varepsilon i)})\geq\left\langle{\varepsilon\vo,({\bar{1}}-\vy^{(\varepsilon i)})\odot\nabla F(\vy^{(\varepsilon i)})}\right\rangle-\varepsilon^{2}LD^{2}\enspace.

(2)

Let us now find a lower bound on the expression $\left\langle{\vo,({\bar{1}}-\vy^{(\varepsilon i)})\odot\nabla F(\vy^{(\varepsilon i)})}\right\rangle$ in the above inequality. Since $C$ is concave,

	$\displaystyle C(\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})-C(\vy^{(\varepsilon i)})\leq{}$	$\displaystyle\left\langle{\vo\odot({\bar{1}}-\vy^{(\varepsilon i)}),\nabla C(\vy^{(\varepsilon i)})}\right\rangle$
	$\displaystyle={}$	$\displaystyle\left\langle{\vo,({\bar{1}}-\vy^{(\varepsilon i)})\odot\nabla C(\vy^{(\varepsilon i)})}\right\rangle\enspace.$

Similarly, since $G$ is DR-submodular,

	$\displaystyle G(\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})$	$\displaystyle{}-G(\vy^{(\varepsilon i)})=\int_{0}^{1}\left.\frac{dG(\vy^{(\varepsilon i)}+z\cdot\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{1}\left\langle{\vo\odot({\bar{1}}-\vy^{(\varepsilon i)}),\nabla G(\vy^{(\varepsilon i)}+r\cdot\vo\odot({\bar{1}}-\vy^{(\varepsilon i)}))}\right\rangle dr$
	$\displaystyle\leq{}$	$\displaystyle\left\langle{\vo\odot({\bar{1}}-\vy^{(\varepsilon i)}),\nabla G(\vy^{(\varepsilon i)})}\right\rangle\leq\left\langle{\vo,({\bar{1}}-\vy^{(t)})\odot\nabla G(\vy^{(\varepsilon i)})}\right\rangle\enspace.$

Adding the last two inequalities provides the promised lower bound on $\left\langle{\vo,({\bar{1}}-\vy^{(t)})\odot\nabla F(\vy^{(\varepsilon i)})}\right\rangle$ . Plugging this lower bound into Inequality (2) yields

	$\displaystyle F(\vy^{(\varepsilon(i+1))})-F(\vy^{(\varepsilon i)})\geq\varepsilon\cdot\{$	$\displaystyle[G(\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})-G(\vy^{(\varepsilon i)})]$
		$\displaystyle+[C(\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})-C(\vy^{(\varepsilon i)})]\}-\varepsilon^{2}LD^{2}\enspace.$

Given the last inequality, to prove the lemma it remains to show that $G(\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})\geq M(G,i)\cdot G(\vo)$ and $C(\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})\geq M(C,i)\cdot C(\vo)$ . Since $\vo\odot({\bar{1}}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)}\geq\vo$ , these inequalities follow immediately when $G$ and $C$ are monotone, respectively. Therefore, we concentrate on proving these inequalities when $G$ and $C$ are non-monotone. Since $C$ is concave,

	$\displaystyle C(\vo\odot({\bar{1}}-{}$	$\displaystyle\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})$
	$\displaystyle={}$	$\displaystyle C\left(\mspace{-6.0mu}M(C,i)\cdot\vo+(1-M(C,i))\cdot\frac{((1-M(C,i))\cdot{\bar{1}}-\vy^{(\varepsilon i)})\odot\vo+\vy^{(\varepsilon i)}}{1-M(C,i)}\right)$
	$\displaystyle\geq{}$	$\displaystyle M(C,i)\cdot C(\vo)+(1-M(C,i))\cdot C\left(\frac{((1-M(C,i))\cdot{\bar{1}}-\vy^{(\varepsilon i)})\odot\vo+\vy^{(\varepsilon i)}}{1-M(C,i)}\right)$
	$\displaystyle\geq{}$	$\displaystyle M(C,i)\cdot C(\vo)\enspace,$

where the second inequality follows from the non-negativity of $C$ . We also note that

\frac{((1-M(C,i))\cdot{\bar{1}}-\vy^{(\varepsilon i)})\odot\vo+\vy^{(\varepsilon i)}}{1-M(C,i)}\in[0,1]^{n}

since $0\leq\vy^{(\varepsilon i)}\leq(1-M(C,i))\cdot{\bar{1}}$ by Lemma 3.4. Similarly, since $G$ is DR-submodular,

	$\displaystyle G(\vo\odot({\bar{1}}$	$\displaystyle{}-\vy^{(\varepsilon i)})+\vy^{(\varepsilon i)})=G(\vo+({\bar{1}}-\vo)\odot\vy^{(\varepsilon i)})$
	$\displaystyle={}$	$\displaystyle G(\vo)+\int_{0}^{1}\left.\frac{dG(\vo+z\cdot({\bar{1}}-\vo)\odot\vy^{(\varepsilon i)})}{dz}\right\|_{z=r}dr$
	$\displaystyle\geq{}$	$\displaystyle G(\vo)+(1-M(G,i))\cdot\int_{0}^{(1-M(G,i))^{-1}}\left.\frac{dG(\vo+z\cdot({\bar{1}}-\vo)\odot\vy^{(\varepsilon i)})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle M(G,i)\cdot G(\vo)+(1-M(G,i))\cdot G\left(\vo+\frac{1}{1-M(G,i)}\cdot({\bar{1}}-\vo)\odot\vy^{(\varepsilon i)}\right)$
	$\displaystyle\geq{}$	$\displaystyle M(G,i)\cdot G(\vo)\enspace,$

where the second inequality follows from the non-negativity of $G$ , and the first inequality holds since

\vo+\frac{1}{1-M(G,i)}\cdot({\bar{1}}-\vo)\odot\vy^{(\varepsilon i)}\in[0,1]^{n}

because $0\leq\vy^{(\varepsilon i)}\leq(1-M(C,i))\cdot{\bar{1}}$ by Lemma 3.4. ∎

Corollary 3.6.

F(\vy^{(1)})\mspace{-3.0mu}\geq\mspace{-3.0mu}\left\{\begin{array}[]{ll}\mspace{-8.0mu}1-e^{-1}\mspace{-7.0mu}&\text{if $G$ is monotone}\mspace{-8.0mu}\\ \mspace{-8.0mu}e^{-1}&\text{otherwise}\end{array}\right\}\cdot G(\vo)+\left\{\begin{array}[]{ll}\mspace{-8.0mu}1-e^{-1}\mspace{-7.0mu}&\text{if $C$ is monotone}\mspace{-8.0mu}\\ \mspace{-8.0mu}e^{-1}&\text{otherwise}\end{array}\right\}\cdot C(\vo)-O(\varepsilon LD^{2})\mspace{-2.0mu}\enspace.

Proof.

For every function $H$ , let $S(H,i)$ be $1-(1-\varepsilon)^{i}$ if $H$ is monotone, and $(\varepsilon i)\cdot(1-\varepsilon)^{i-1}$ if $H$ is non-monotone. We prove by induction that for every integer $0\leq i\leq\varepsilon^{-1}$ we have

F(\vy^{(\varepsilon i)})\geq S(G,i)\cdot G(\vo)+S(C,i)\cdot C(\vo)-i\cdot\varepsilon^{2}LD^{2}\enspace.

(3)

One can note that the corollary follows from this inequality by plugging $i=\varepsilon^{-1}$ since $S(H,\varepsilon^{-1})=1-(1-\varepsilon)^{\varepsilon^{-1}}\geq 1-e^{-1}$ when $H$ is monotone and $S(H,\varepsilon^{-1})=(1-\varepsilon)^{\varepsilon^{-1}-1}\geq e^{-1}$ when $H$ is non-monotone.

For $i=0$ , Inequality (3) follows from the non-negativity of $F$ since $S(H,0)=0$ both when $H$ is monotone and non-monotone. Next, let us prove Inequality (3) for $1\leq i\leq\varepsilon^{-1}$ assuming it holds for $i-1$ . We note that $(1-\varepsilon)\cdot S(H,i-1)+\varepsilon\cdot M(H,i-1)=S(H,i)$ . For a monotone $H$ this is true since

(1-\varepsilon)\cdot S(H,i-1)+\varepsilon\cdot M(H,i-1)=(1-\varepsilon)\cdot[1-(1-\varepsilon)^{i-1}]+\varepsilon=1-(1-\varepsilon)^{i}=S(H,i)\enspace,

and for a non-monotone $H$ this is true since

	$\displaystyle(1-\varepsilon)\cdot S(H,i-1)+\varepsilon\cdot M(H,i-1)={}$	$\displaystyle(1-\varepsilon)\cdot(\varepsilon(i-1))\cdot(1-\varepsilon)^{i-2}+\varepsilon\cdot(1-\varepsilon)^{i-1}$
	$\displaystyle={}$	$\displaystyle(\varepsilon i)\cdot(1-\varepsilon)^{i-1}=S(H,i)\enspace.$

Using Lemma 3.5, we now get

	$\displaystyle F(\vy^{(\varepsilon i)})\geq{}$	$\displaystyle F(\vy^{(\varepsilon(i-1))})+\varepsilon\cdot\{[M(G,i-1)\cdot G(\vo)-G(\vy^{(\varepsilon(i-1))})]$
		$\displaystyle\mspace{200.0mu}+[M(C,i-1)\cdot C(\vo)-C(\vy^{(\varepsilon(i-1))})]\}-\varepsilon^{2}LD^{2}$
	$\displaystyle={}$	$\displaystyle(1-\varepsilon)\cdot F(\vy^{(\varepsilon(i-1))})+\varepsilon\cdot\{M(G,i-1)\cdot G(\vo)+M(C,i-1)\cdot C(\vo)\}-\varepsilon^{2}LD^{2}$
	$\displaystyle\geq{}$	$\displaystyle[(1-\varepsilon)\cdot S(G,i-1)\cdot G(\vo)+\varepsilon\cdot M(G,i-1)\cdot G(\vo)]$
		$\displaystyle\mspace{200.0mu}+[(1-\varepsilon)\cdot S(C,i-1)\cdot C(\vo)+\varepsilon\cdot M(C,i-1)\cdot C(\vo)]$
		$\displaystyle\mspace{200.0mu}-[(1-\varepsilon)\cdot(i-1)\cdot\varepsilon^{2}LD^{2}+\varepsilon^{2}LD^{2}]$
	$\displaystyle\geq{}$	$\displaystyle S(G,i)\cdot G(\vo)+S(C,i)\cdot C(\vo)-i\cdot\varepsilon^{2}LD^{2}\enspace,$

where the second inequality holds due to the induction hypothesis. ∎

We are now ready to state and prove our main theorem for Algorithm 2.

{restatable}

theoremthmMeasuredContinuousGreedy Let $S$ be the time it takes to find a point in $P$ maximizing a given liner function, then Algorithm 2 runs in $O(\varepsilon^{-1}(n+S))$ time, makes $O(1/\varepsilon)$ calls to the gradient oracle, and outputs a vector $\vy$ such that

F(\vy)\geq\left\{\begin{array}[]{ll}\mspace{-8.0mu}1-e^{-1}\mspace{-7.0mu}&\text{if $G$ is monotone}\mspace{-8.0mu}\\ \mspace{-8.0mu}e^{-1}&\text{otherwise}\end{array}\right\}\cdot G(\vo)+\left\{\begin{array}[]{ll}\mspace{-8.0mu}1-e^{-1}\mspace{-7.0mu}&\text{if $C$ is monotone}\mspace{-8.0mu}\\ \mspace{-8.0mu}e^{-1}&\text{otherwise}\end{array}\right\}\cdot C(\vo)-O(\varepsilon LD^{2})\enspace.

Proof.

The output guarantee for Theorem 3.6 follows from Corollary 3.6. Like in the analysis of Theorem 3.3, the time and oracle complexities follow from the observation that the algorithm performs $\varepsilon^{-1}$ iterations, and each iteration takes $O(n+S)$ time and makes a single call to the gradient oracle. ∎

Remark: The guarantees of Algorithms 1 and 2 apply in general to different settings. However, both guarantees apply in the special case in which the functions $G$ and $C$ are monotone and the polytope $P$ is down-closed. Interestingly, the two guarantees are identical in this common special case.

3.3 Gradient Combining Frank-Wolfe Algorithm

Up to this point, the guarantees of the algorithms that we have seen had both $\alpha$ and $\beta$ that are strictly smaller than $1$ . However, since concave functions can be exactly maximized, it is reasonable to expect also algorithms for which the coefficient $\beta$ associated with $C(\vo)$ is equal to $1$ . In Sections 3.3 and 3.4, we describe such algorithms.

In this section, we assume that $G$ is a monotone and non-negative function (in addition to its other properties). The algorithm we study in this section is Algorithm 3, and it again takes a quality control parameter $\varepsilon\in(0,1)$ as input. This time, however, the algorithm assumes that $\varepsilon^{-3}$ is an integer. As usual, if that is not the case, then $\varepsilon$ can be replaced with a value from the range $[\varepsilon/2,\varepsilon]$ that has this property.

Let

\vy^{(0)}

be a vector in

P

maximizing

C

up to an error of

\eta\geq 0

for

i=1

\varepsilon^{-3}

\vs^{(i)}\leftarrow\arg\max_{\vx\in P}\left\langle{\nabla G(\vy^{(i-1)})+2\nabla C(\vy^{(i-1)}),\vx}\right\rangle

\vy^{(i)}\leftarrow(1-\varepsilon^{2})\cdot\vy^{(i-1)}+\varepsilon^{2}\cdot\vs^{(i)}

end for

return the vector maximizing

F

among

\{\vy^{(0)},\dotsc,\allowbreak\vy^{(\varepsilon^{-3})}

}

Algorithm 3 Gradient Combining Frank-Wolfe

(\varepsilon)

Firstly, note that for every integer $0\leq i\leq\varepsilon^{-3}$ , $\vy^{(i)}\in P$ ; and therefore, the output of Algorithm 3 also belongs to $P$ . For $i=0$ this holds by the initialization of $\vy^{(0)}$ . For larger values of $i$ , this follows by induction because $\vy^{(i)}$ is a convex combination of $\vy^{(i-1)}$ and the point $\vs^{(i)}$ ( $\vy^{(i-1)}$ belongs to $P$ by the induction hypothesis, and $\vs^{(i)}$ belongs to $P$ by definition).

Our next objective is to lower bound the value of the output point of Algorithm 3. For that purpose, it will be useful to define $\bar{F}(\vx)=G(\vx)+2C(\vx)$ and $H(i)=\bar{F}(\vo)-\bar{F}(\vy^{(i)})$ . To get a bound on the value of the output of Algorithm 3, we first show that $H(i)$ is small for at least some $i$ value. We do that using the next lemma, which shows that $H(i)$ decreases as a function of $i$ as longs as it is not already small compared to $G(\vy^{i})$ . Then, Corollary 3.8 guarantees the existence of a good iteration $i^{*}$ .

Lemma 3.7.

For every integer $1\leq i\leq\varepsilon^{-3}$ , $H(i-1)-H(i)\geq\varepsilon^{2}\cdot[G(\vo)-2G(\vy^{(i-1)})]+2\varepsilon^{2}\cdot[C(\vo)-C(\vy^{(i-1)})]-6\varepsilon^{4}LD^{2}=\varepsilon^{2}\cdot[H(i-1)-G(\vy^{(i-1)})]-6\varepsilon^{4}LD^{2}$ .

Proof.

Observe that

	$\displaystyle H(i-1)-H(i)={}$	$\displaystyle\bar{F}(\vy^{(i)})-\bar{F}(\vy^{(i-1)})=\bar{F}((1-\varepsilon^{2})\cdot\vy^{(i-1)}+\varepsilon^{2}\cdot\vs^{(i)})-\bar{F}(\vy^{(i-1)})$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon^{2}}\left.\frac{\bar{F}((1-z)\cdot\vy^{(i-1)}+z\cdot\vs^{(i)})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon^{2}}\left\langle{\vs^{(i)}-\vy^{(i-1)},\nabla\bar{F}((1-r)\cdot\vy^{(i-1)}+r\cdot\vs^{(i)})}\right\rangle dr$
	$\displaystyle\geq{}$	$\displaystyle\int_{0}^{\varepsilon^{2}}\left[\left\langle{\vs^{(i)}-\vy^{(i-1)},\nabla\bar{F}(\vy^{(i-1)})}\right\rangle-12rLD^{2}\right]dr$
	$\displaystyle={}$	$\displaystyle\varepsilon^{2}\cdot\left\langle{\vs^{(i)}-\vy^{(i-1)},\nabla\bar{F}(\vy^{(i-1)})}\right\rangle-6\varepsilon^{4}LD^{2}\enspace.$

To make the expression on the rightmost side useful, we need to lower bound the inner product $\left\langle{\vs^{(i)}-\vy^{(i-1)},\nabla\bar{F}(\vy^{(i-1)})}\right\rangle$ .

	$\displaystyle\left\langle{\vs^{(i)}-\vy^{(i-1)},\nabla\bar{F}(\vy^{(i-1)})}\right\rangle\mspace{-80.0mu}$
	$\displaystyle={}$	$\displaystyle\left\langle{\vs^{(i)},\nabla G(\vy^{(i-1)})+2\nabla C(\vy^{(i-1)})}\right\rangle-\left\langle{\vy^{(i-1)},\nabla G(\vy^{(i-1)})+2\nabla C(\vy^{(i-1)})}\right\rangle$
	$\displaystyle\geq{}$	$\displaystyle\left\langle{\vo,\nabla G(\vy^{(i-1)})+2\nabla C(\vy^{(i-1)})}\right\rangle-\left\langle{\vy^{(i-1)},\nabla G(\vy^{(i-1)})+2\nabla C(\vy^{(i-1)})}\right\rangle$
	$\displaystyle={}$	$\displaystyle\left\langle{\vo-\vy^{(i-1)},\nabla G(\vy^{(i-1)})}\right\rangle+\left\langle{2(\vo-\vy^{(i-1)}),\nabla C(\vy^{(i-1)})}\right\rangle\enspace,$

where the inequality holds since $\vs^{(i)}$ is the maximizer found by Algorithm 3. Combining the last two inequalities yields

H(i-1)-H(i)\geq\left\langle{\varepsilon^{2}(\vo-\vy^{(i-1)}),\nabla G(\vy^{(i-1)})}\right\rangle+\left\langle{2\varepsilon^{2}(\vo-\vy^{(i-1)}),\nabla C(\vy^{(i-1)})}\right\rangle-6\varepsilon^{4}LD^{2}\enspace.

Therefore, to complete the proof of the lemma it remains to prove that $\left\langle{\vo-\vy^{(i-1)},\nabla G(\vy^{(i-1)})}\right\rangle\geq G(\vo)-2G(\vy^{(i-1)})$ and $\left\langle{\vo-\vy^{(i-1)},\nabla C(\vy^{(i-1)})}\right\rangle\geq C(\vo)-C(\vy^{(i-1)})$ .

The inequality $\left\langle{\vo-\vy^{(i-1)},\nabla C(\vy^{(i-1)})}\right\rangle\geq C(\vo)-C(\vy^{(i-1)})$ follows immediately from the concavity of $C$ . To prove the other inequality, observe that the DR-submodularity of $G$ implies $\left\langle{\vx^{(1)},\nabla G(\vx^{(2)})}\right\rangle\geq\left\langle{\vx^{(1)},\nabla G(\vx^{(1)}+\vx^{(2)})}\right\rangle$ for every two vectors $\vx^{(1)}$ and $\vx^{(2)}$ that obey $\vx^{(2)}\in[0,1]^{n}$ , $\vx^{(1)}+\vx^{2}\in[0,1]^{n}$ and $\vx^{(1)}$ is either non-negative or non-positive. This observation implies

		$\displaystyle\left\langle{\vo-\vy^{(i-1)},\nabla G(\vy^{(i-1)})}\right\rangle$
	$\displaystyle={}$	$\displaystyle\left\langle{\vo\vee\vy^{(i-1)}-\vy^{(i-1)},\nabla G(\vy^{(i-1)})}\right\rangle+\left\langle{\vo\wedge\vy^{(i-1)}-\vy^{(i-1)},\nabla G(\vy^{(i-1)})}\right\rangle$
	$\displaystyle\geq{}$	$\displaystyle\int_{0}^{1}\left[\left\langle{\vo\vee\vy^{(i-1)}-\vy^{(i-1)},\nabla G(\vy^{(i-1)}+r\cdot(\vo\vee\vy^{(i-1)}-\vy^{(i-1)}))}\right\rangle\right.$
		$\displaystyle\left.+\left\langle{\vo\wedge\vy^{(i-1)}-\vy^{(i-1)},\nabla G(\vy^{(i-1)}+r\cdot(\vo\wedge\vy^{(i-1)}-\vy^{(i-1)}))}\right\rangle\right]dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{1}\left.\frac{dG(\vy^{(i-1)}+z\cdot(\vo\vee\vy^{(i-1)}-\vy^{(i-1)}))+dG(\vy^{(i-1)}+z\cdot(\vo\wedge\vy^{(i-1)}-\vy^{(i-1)}))}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle[G(\vo\vee\vy^{(i-1)})+G(\vo\wedge\vy^{(i-1)})]-2G(\vy^{(i-1)})\geq G(\vo)-2G(\vy^{(i-1)})\enspace,$

where the last inequality follows from the monotonicity and non-negativity of $G$ . ∎

Corollary 3.8.

There is an integer $0\leq i^{*}\leq\varepsilon^{-3}$ obeying $H(i^{*})\leq G(\vy^{(i^{*})})+\varepsilon\cdot[G(\vo)+2\eta+6LD^{2}]$ .

Proof.

Assume towards a contradiction that the corollary does not hold. By Lemma 3.7, this implies

H(i-1)-H(i)\geq\varepsilon^{2}\cdot[H(i-1)-G(\vy^{(i-1)})]-6\varepsilon^{4}LD^{2}\geq\varepsilon^{3}\cdot[G(\vo)+2\eta]-6\varepsilon^{4}LD^{2}

for every integer $1\leq i\leq\varepsilon^{-3}$ . Adding up this inequality for all these values of $i$ , we get

H(0)-H(\varepsilon^{-3})\geq[G(\vo)+2\eta]-6\varepsilon LD^{2}\enspace.

However, by the choice of $\vy^{(0)}$ , $H(0)$ is not too large. Specifically,

H(0)=[G(\vo)+2C(\vo)]-[G(\vy^{(0)})+2C(\vy^{(0)})]\leq[G(\vo)+2C(\vo)]-\{0+2[C(\vo)-\eta]\}=G(\vo)+2\eta\mspace{-1.0mu}\enspace,

where the inequality holds since $\vo$ is one possible candidate to be $\vy^{(0)}$ and $G$ is non-negative. Therefore, we get

[G(\vo)+2\eta]-H(\varepsilon^{-3})\geq[G(\vo)+2\eta]-6\varepsilon LD^{2}\enspace,

which by the non-negativity of $G$ and $\eta$ implies

H(\varepsilon^{-3})\leq 6\varepsilon LD^{2}\leq G(\varepsilon^{-3})+\varepsilon\cdot[G(\vo)+2\eta+6LD^{2}]

(which contradicts our assumption). ∎

We are now ready to summarize the properties of Algorithm 3 in a theorem.

Theorem 3.9.

Let $S_{1}$ be the time it takes to find a point in $P$ maximizing a given linear function and $S_{2}$ be the time it takes to find a point in $P$ maximizing $C(\cdot)$ up to an error of $\eta$ , then Algorithm 3 runs in $O(\varepsilon^{-3}\cdot(n+S_{1})+S_{2})$ time, makes $O(1/\varepsilon^{3})$ gradient oracle calls, and outputs a vector $\vy$ such that

F(\vy)\geq\tfrac{1}{2}(1-\varepsilon)\cdot G(\vo)+C(\vo)-\varepsilon\cdot O(\eta+LD^{2})\enspace.

Proof.

We begin the proof by analyzing the time and oracle complexities of Algorithm 3. Every iteration of the loop of Algorithm 3 takes $O(n+S_{1})$ time. As there are $\varepsilon^{-3}$ such iterations, the entire algorithm runs in $O(\varepsilon^{-3}(n+S_{1})+S_{2})$ time. Also note that each iteration of the loop requires 2 calls to the gradient oracles (a single call to the oracle corresponding to $G$ , and a single call to the oracle corresponding to $C$ ), so the overall oracle complexity of the algorithm is $O(1/\varepsilon^{3})$ .

Consider now iteration $i^{*}$ , whose existence is guaranteed by Corollary 3.8. Then,

		$\displaystyle H(i^{})\leq G(\vy^{(i^{})})+\varepsilon\cdot[G(\vo)+2\eta+6LD^{2}]$
	$\displaystyle\implies{}$	$\displaystyle[G(\vo)+2C(\vo)]-[G(\vy^{(i^{})})+2C(\vy^{(i^{})})]\leq G(\vy^{(i^{*})})+\varepsilon\cdot[G(\vo)+2\eta+6LD^{2}]$
	$\displaystyle\implies{}$	$\displaystyle(1-\varepsilon)\cdot G(\vo)+2C(\vo)\leq 2[G(\vy^{(i^{})})+C(\vy^{(i^{})})]+\varepsilon\cdot[2\eta+6LD^{2}]$
	$\displaystyle\implies{}$	$\displaystyle F(\vy^{(i^{*})})\geq\tfrac{1}{2}(1-\varepsilon)\cdot G(\vo)+C(\vo)-\varepsilon\cdot[\eta+3LD^{2}]\enspace.$

The theorem now follows since the output of Algorithm 3 is at least as good as $\vy^{(i^{*})}$ . ∎

3.4 Non-oblivious Frank-Wolfe Algorithm

As mentioned in the beginning of Section 3.3, our objective in this section is to present another algorithm that has $\beta=1$ (i.e., it maximizes $C$ “exactly” in some sense). In Section 3.3, we presented Algorithm 3, which achieves this goal with $\alpha=1/2$ . The algorithm we present in the current section achieves the same goal with an improved value of $1-1/e$ for $\alpha$ . However, the improvement is obtained at the cost of requiring the function $C$ to be non-negative (which was not required in Section 3.3). Additionally, like in the previous section, we assume here that $G$ is a monotone and non-negative function (in addition to its other properties).

The algorithm we study in this section is a non-oblivious variant of the Frank-Wolfe algorithm, appearing as Algorithm 4, which takes a quality control parameter $\varepsilon\in(0,1/4)$ as input. As usual, we assume without loss of generality that $\varepsilon^{-1}$ is an integer. Algorithm 4 also employs the non-negative auxiliary function:

\bar{G}(\vx)=\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vx)}{\varepsilon j}.

This function is inspired by the non-oblivious objective function used by Filmus and Ward (2012).

Let

\vy^{(0)}

be an arbitrary vector in

P

, and let

\beta(\varepsilon)\leftarrow e(1-\ln\varepsilon)

for

i=0

\lceil e^{-1}\cdot\beta(\varepsilon)/\varepsilon^{2}\rceil

\vs^{(i)}\leftarrow\arg\max_{\vx\in P}\left\langle{e^{-1}\!\cdot\!\nabla\bar{G}(\vy^{(i)})+\nabla C(\vy^{(i)}),\vx}\right\rangle

\vy^{(i+1)}\leftarrow(1-\varepsilon)\cdot\vy^{(i)}+\varepsilon\cdot\vs^{(i)}

end for

return the vector maximizing

F

among

\{\vy^{(0)},\dotsc,\allowbreak\vy^{(\lceil e^{-1}\cdot\frac{\beta(\varepsilon)}{\varepsilon^{2}}\rceil)}

}

Algorithm 4 Non-Oblivious Frank-Wolfe

(\varepsilon)

Note that any call to the gradient oracle of $\bar{G}$ can be simulated using $\varepsilon^{-1}$ calls to the gradient oracle of $G$ . The properties of Algorithm 4 are stated in Theorem 3.16. We begin the analysis of Algorithm 4 by observing that for every integer $0\leq i\leq\lceil e^{-1}\cdot\beta(\varepsilon)/\varepsilon^{2}\rceil$ , $\vy^{(i)}\in P$ ; and therefore, the output of Algorithm 4 also belongs to $P$ . The proof for this is identical to the corresponding proof in Section 3.3.

Let us now analyze the auxiliary function $\bar{G}$ . Specifically, we need to show that $\bar{G}$ is almost as smooth as the original function $G$ , and that for every given vector $\vx\in[0,1]$ , the value of $\bar{G}(\vx)$ is not very large compared to $G(\vx)$ . We mention these as observations below.

Observation 3.10.

The auxiliary function $\bar{G}$ is $eL$ -smooth.

Proof.

For every two vectors $\vx,\vy\in[0,1]^{n}$ ,

	$\displaystyle\\|\nabla$	$\displaystyle\bar{G}(\vx)-\nabla\bar{G}(\vy)\\|_{2}=\left\\|\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e^{\varepsilon j}\cdot[\nabla G(\varepsilon j\cdot\vx)-\nabla G(\varepsilon j\cdot\vy)]}{\varepsilon j}\right\\|_{2}$
	$\displaystyle\leq{}$	$\displaystyle\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e^{\varepsilon j}\cdot\\|\nabla G(\varepsilon j\cdot\vx)-\nabla G(\varepsilon j\cdot\vy)\\|_{2}}{\varepsilon j}\leq\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e\cdot L\varepsilon j\cdot\\|\vx-\vy\\|_{2}}{\varepsilon j}=eL\cdot\\|\vx-\vy\\|_{2}\enspace.\qed$

Observation 3.11.

For every vector $\vx\in[0,1]^{n}$ , $\bar{G}(\vx)\leq\beta(\varepsilon)\cdot G(\vx)$ .

Proof.

Note that, by the monotonicity of $G$ ,

	$\displaystyle\bar{G}(\vx)={}$	$\displaystyle\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vx)}{\varepsilon j}\leq\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e\cdot G(\vx)}{\varepsilon j}$
	$\displaystyle={}$	$\displaystyle e\cdot G(\vx)\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{1}{j}\leq e(1-\ln\varepsilon)\cdot G(\vx)=\beta(\varepsilon)\cdot G(\vx)\enspace,$

where the last inequality holds since

\sum_{j=1}^{\varepsilon^{-1}}\frac{1}{j}\leq 1+\int_{1}^{\varepsilon^{-1}}\frac{dx}{x}=1+[\ln x]_{1}^{\varepsilon^{-1}}=1-\ln\varepsilon\enspace.\qed

We now define $\bar{F}(\vx)=e^{-1}\cdot\bar{G}(\vx)+C(\vx)$ . To get a bound on the value of the output of Algorithm 4, we need to show that there exists at least one value of $i$ for which $\bar{F}(\vy^{(i+1)})$ is not much larger than $\bar{F}(\vy^{(i)})$ , which intuitively means that $\vy^{(i)}$ is roughly a local maximum with respect to $\bar{F}$ . The following lemma proves that such an $i$ value indeed exists.

Lemma 3.12.

There is an integer $0\leq i^{*}<\lceil e^{-1}\cdot\beta(\varepsilon)/\varepsilon^{2}\rceil$ such that $\bar{F}(\vy^{(i^{*}+1)})-\bar{F}(\vy^{(i^{*})})\leq\varepsilon^{2}\cdot F(\vo)$ .

Proof.

Assume towards a contradiction that the lemma does not hold. This implies

	$\displaystyle\bar{F}(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})-\bar{F}(\vy^{(0)})={}$	$\displaystyle\sum_{i=0}^{\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil-1}[\bar{F}(\vy^{(i+1)})-\bar{F}(\vy^{(i)})]$
	$\displaystyle>{}$	$\displaystyle\lceil e^{-1}\cdot\beta(\varepsilon)/\varepsilon^{2}\rceil\cdot\varepsilon^{2}\cdot F(\vo)\geq e^{-1}\cdot\beta(\varepsilon)\cdot F(\vo)\enspace.$

Furthermore, using the non-negativity of $\bar{F}$ , the last inequality implies

	$\displaystyle e^{-1}\cdot\bar{G}(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})+C(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})={}$	$\displaystyle\bar{F}(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})$
	$\displaystyle\geq{}$	$\displaystyle\bar{F}(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})-\bar{F}(\vy^{(0)})>e^{-1}\cdot\beta(\varepsilon)\cdot F(\vo)\enspace.$

Let us now upper bound the two terms on the leftmost side of the above inequality. By Observation 3.11, we can upper bound $\bar{G}(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})$ by $\beta(\varepsilon)\cdot G(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})$ . Additionally, since $\beta(\varepsilon)\geq e$ because $\varepsilon<1$ , we can upper bound $C(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})$ by $e^{-1}\cdot\beta(\varepsilon)\cdot C(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})$ . Plugging both these upper bounds into the previous inequalities and dividing by $e^{-1}\cdot\beta(\varepsilon)$ gives

F(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})=G(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})+C(\vy^{(\lceil\beta(\varepsilon)/\varepsilon^{2}\rceil)})>F(\vo)\enspace,

which contradicts the definition of $\vo$ , and thus, completes the proof of the lemma. ∎

The preceding lemma shows that $\vy^{(i^{*})}$ is roughly a local optimum in terms of $\bar{F}$ . We now show an upper bound on $\left\langle{(\vo-\vy^{(i^{*})}),\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle$ that is implied by this property of $\vy^{(i^{*})}$ .

Lemma 3.13.

$\left\langle{e^{-1}\cdot(\vo-\vy^{(i^{*})}),\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle\leq\varepsilon\cdot F(\vo)+[C(\vy^{(i^{*})})-C(\vo)]+4\varepsilon LD^{2}$ .

Proof.

Observe that

		$\displaystyle\bar{F}(\vy^{(i^{}+1)})-\bar{F}(\vy^{(i^{})})=\bar{F}((1-\varepsilon)\cdot\vy^{(i^{})}+\varepsilon\cdot\vs^{(i^{})})-\bar{F}(\vy^{(i^{*})})$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left.\frac{\bar{F}((1-z)\cdot\vy^{(i^{})}+z\cdot\vs^{(i^{})})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left\langle{\vs^{(i^{})}-\vy^{(i^{})},\nabla\bar{F}((1-r)\cdot\vy^{(i^{})}+r\cdot\vs^{(i^{}+1)})}\right\rangle dr$
	$\displaystyle\geq{}$	$\displaystyle\int_{0}^{\varepsilon}\left[\left\langle{\vs^{(i^{})}-\vy^{(i^{})},\nabla\bar{F}(\vy^{(i^{})})}\right\rangle-8rLD^{2}\right]dr=\left\langle{\varepsilon\cdot(\vs^{(i^{})}-\vy^{(i^{})}),\nabla\bar{F}(\vy^{(i^{})})}\right\rangle-4\varepsilon^{2}LD^{2}\enspace.$

To make the expression in the rightmost side useful, we need to lower bound the expression $\left\langle{\vs^{(i^{*})}-\vy^{(i^{*})},\nabla\bar{F}(\vy^{(i^{*})})}\right\rangle$ .

	$\displaystyle(\vs^{(i^{*})}-{}$	$\displaystyle\vy^{(i^{})})\cdot\nabla\bar{F}(\vy^{(i^{})})$
	$\displaystyle={}$	$\displaystyle\left\langle{\vs^{(i^{})},e^{-1}\cdot\nabla\bar{G}(\vy^{(i^{})})+\nabla C(\vy^{(i^{})})}\right\rangle-\left\langle{\vy^{(i^{})},e^{-1}\cdot\nabla\bar{G}(\vy^{(i^{})})+\nabla C(\vy^{(i^{})})}\right\rangle$
	$\displaystyle\geq{}$	$\displaystyle\left\langle{\vo,e^{-1}\cdot\nabla\bar{G}(\vy^{(i^{})})+\nabla C(\vy^{(i^{})})}\right\rangle-\left\langle{\vy^{(i^{})},e^{-1}\cdot\nabla\bar{G}(\vy^{(i^{})})+\nabla C(\vy^{(i^{*})})}\right\rangle$
	$\displaystyle={}$	$\displaystyle\left\langle{e^{-1}\cdot(\vo-\vy^{(i^{})}),\nabla\bar{G}(\vy^{(i^{})})}\right\rangle+\left\langle{\vo-\vy^{(i^{})},\nabla C(\vy^{(i^{})})}\right\rangle\enspace,$

where the inequality holds since $\vs^{(i)}$ is the maximizer found by Algorithm 4. Combining the last two inequalities and the guarantee on $i^{*}$ from Lemma 3.12, we now get

	$\displaystyle\varepsilon^{2}\cdot F(\vo)\geq{}$	$\displaystyle\bar{F}(\vy^{(i^{}+1)})-\bar{F}(\vy^{(i^{})})\geq\left\langle{\varepsilon\cdot(\vs^{(i^{})}-\vy^{(i^{})}),\nabla\bar{F}(\vy^{(i^{*})})}\right\rangle-4\varepsilon^{2}LD^{2}$
	$\displaystyle\geq{}$	$\displaystyle\varepsilon\cdot\left[\left\langle{e^{-1}\cdot(\vo-\vy^{(i^{})}),\nabla\bar{G}(\vy^{(i^{})})}\right\rangle+\left\langle{\vo-\vy^{(i^{})},\nabla C(\vy^{(i^{})})}\right\rangle\right]-4\varepsilon^{2}LD^{2}$
	$\displaystyle\geq{}$	$\displaystyle\varepsilon\cdot\left[\left\langle{e^{-1}\cdot(\vo-\vy^{(i^{})}),\nabla\bar{G}(\vy^{(i^{})})}\right\rangle+C(\vo)-C(\vy^{(i-1)})\right]-4\varepsilon^{2}LD^{2}\enspace,$

where the last inequality follows from the concavity of $C$ . The lemma now follows by rearranging the last inequality (and dividing it by $\varepsilon$ ). ∎

We now wish to lower bound $\left\langle{(\vo-\vy^{(i^{*})}),\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle$ and we do so by separately analyzing $\left\langle{\vo,\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle$ and $\left\langle{\vy^{(i^{*})},\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle$ in the following two lemmas.

Lemma 3.14.

$\left\langle{\vo,\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle\geq(e-1)\cdot G(\vo)-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$ .

Proof.

For brevity, we use in this proof the shorthand $\vu^{(j)}=\varepsilon j\cdot\vy^{(i^{*})}$ . Then, for every integer $0\leq j\leq\varepsilon^{-1}$ , we get (in these calculations, the expression $\nabla G(\varepsilon j\cdot\vy^{(i^{*})})$ represents the gradient of the function $G(\varepsilon j\cdot x)$ at the point $x=\vy^{(i^{*})}$ ).

	$\displaystyle\frac{\left\langle{\vo,\nabla G(\varepsilon j\cdot\vy^{(i^{*})})}\right\rangle}{\varepsilon j}\geq{}\mspace{-61.0mu}$	$\displaystyle\mspace{61.0mu}\frac{\left\langle{\vo\vee\vu^{(j)}-\vu^{(j)},\nabla G(\varepsilon j\cdot\vy^{(i^{*})})}\right\rangle}{\varepsilon j}=\left.\frac{dG(\vu^{(j)}+z(\vo\vee\vu^{(j)}-\vu^{(j)}))}{dz}\right\|_{z=0}$
	$\displaystyle\geq{}$	$\displaystyle\int_{0}^{1}\left.\frac{dG(\vu^{(j)}+z\cdot(\vo\vee\vu^{(j)}-\vu^{(j)}))}{dz}\right\|_{z=r}dr=\left[G(\vu^{(j)}+z\cdot(\vo\vee\vu^{(j)}-\vu^{(j)}))\right]_{0}^{1}$
	$\displaystyle={}$	$\displaystyle G(\vo\vee\vu^{(uj)})-G(\varepsilon j\cdot\vy^{(i^{})})\geq G(\vo)-G(\varepsilon j\cdot\vy^{(i^{})})\enspace,$

where the first and last inequalities follow from the monotonicity of $G$ and the second inequality follows from the DR-submodularity of $G$ . Using the last inequality, we now get

	$\displaystyle\left\langle{\vo,\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle={}$	$\displaystyle\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{\left\langle{e^{\varepsilon j}\cdot\vo,\nabla G(\varepsilon j\cdot\vy^{(i^{})})}\right\rangle}{\varepsilon j}\geq\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot[G(\vo)-G(\varepsilon j\cdot\vy^{(i^{})})]$
	$\displaystyle\geq{}$	$\displaystyle(e-1)\cdot G(\vo)-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\enspace,$

where the second inequality holds since the fact that $\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}$ is a geometric sum implies

\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}=\varepsilon\cdot e^{\varepsilon}\cdot\frac{e-1}{e^{\varepsilon}-1}=\frac{\varepsilon}{1-e^{-\varepsilon}}\cdot(e-1)\geq e-1\enspace.\qed

Lemma 3.15.

$\left\langle{\vy^{(i^{*})},\nabla\bar{G}(\vy^{(i^{*})})}\right\rangle\leq(e+6\varepsilon\ln\varepsilon^{-1})\cdot G(\vy^{(i^{*})})-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$ .

Proof.

In the following calculation, the expression $\nabla G(\varepsilon j\cdot\vy^{(i^{*})})$ again represents the gradient of the function $G(\varepsilon j\cdot x)$ at the point $x=\vy^{(i^{*})}$ ). Thus,

	$\displaystyle\left\langle{\vy^{(i^{})},\nabla\bar{G}(\vy^{(i^{})})}\right\rangle={}$	$\displaystyle\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{\left\langle{e^{\varepsilon j}\cdot\vy^{(i^{})},\nabla G(\varepsilon j\cdot\vy^{(i^{})})}\right\rangle}{\varepsilon j}=\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot\left.\frac{dG(z\cdot\vy^{(i^{*})})}{dz}\right\|_{z=\varepsilon j}$		(4)
	$\displaystyle\leq{}$	$\displaystyle\varepsilon\cdot\sum_{h=1}^{\varepsilon^{-1}}\left[\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=\varepsilon h}+\varepsilon\cdot e^{\varepsilon h}\cdot\sum_{j=h}^{\varepsilon^{-1}}\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=\varepsilon j}\right]\enspace,$

where the inequality holds because

\varepsilon\cdot\sum_{h=1}^{j}e^{\varepsilon h}=\varepsilon\cdot e^{\varepsilon}\cdot\frac{e^{\varepsilon j}-1}{e^{\varepsilon}-1}=\frac{\varepsilon}{1-e^{-\varepsilon}}\cdot(e^{\varepsilon j}-1)\geq e^{\varepsilon j}-1\enspace.

We now observe also that the DR-submodularity of $G$ implies, for every integer $1\leq h\leq\varepsilon^{-1}$ , that

	$\displaystyle\varepsilon\cdot\sum_{j=h}^{\varepsilon^{-1}}\left.\frac{dG(z\cdot\vy^{(i^{*})})}{dz}\right\|_{z=\varepsilon j}\leq{}$	$\displaystyle h^{-1}\cdot\int_{0}^{\varepsilon h}\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=x}dx+\int_{\varepsilon h}^{1}\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=x}dx$
	$\displaystyle={}$	$\displaystyle h^{-1}\cdot[G(x\cdot\vy^{(i^{})})]_{0}^{\varepsilon h}+[G(x\cdot\vy^{(i^{})})]_{\varepsilon h}^{1}$
	$\displaystyle={}$	$\displaystyle h^{-1}\cdot G(\varepsilon h\cdot\vy^{(i^{})})-h^{-1}\cdot G({\bar{0}})+G(\vy^{(i^{})})-G(\varepsilon h\cdot\vy^{(i^{*})})$
	$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})-(1-h^{-1})\cdot G(\varepsilon h\cdot\vy^{(i^{})})\enspace,$

where the last inequality follows from the non-negativity of $G$ . Plugging the last inequality now into Inequality (4) yields

$\displaystyle\left\langle{\vy^{(i^{})},\nabla\bar{G}(\vy^{(i^{})})}\right\rangle\mspace{-51.0mu}{}$	$\displaystyle\mspace{51.0mu}\leq G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\left[e^{\varepsilon j}\cdot\{G(\vy^{(i^{})})-(1-j^{-1})\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\}\right]$	(5)
$\displaystyle={}$	$\displaystyle G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}(1-j^{-1})\cdot e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$
$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon^{2}\cdot\left(\sum_{j=1}^{\varepsilon^{-1}}(1-j^{-1})\right)\cdot\left(\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\right)$
$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon(1+2\varepsilon\ln\varepsilon)\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$
$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})+\left(\varepsilon-2\varepsilon^{2}\ln\varepsilon\right)\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\enspace,$

where the last inequality follows from the monotonicity of $G$ (since $2\varepsilon\ln\varepsilon<0$ ) and the second inequality follows from Chebyshev’s sum inequality because the monotonicity of $G$ implies that both $1-j^{-1}$ and $e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$ are non-decreasing functions of $j$ . Additionally, the penultimate inequality follows due to the following calculation, which holds for every $\varepsilon\in(0,1/3)$ .

\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}(1-j^{-1})\geq\varepsilon\cdot\int_{1}^{\varepsilon^{-1}}(1-x^{-1})dx=\varepsilon[x-\ln x]_{1}^{\varepsilon^{-1}}=1-\varepsilon+\varepsilon\ln\varepsilon\geq 1+2\varepsilon\ln\varepsilon\enspace.

To complete the proof of the lemma, it remains to show that the coefficient of $G(\vy^{(i^{*})})$ in the rightmost side of Inequality (5) is upper bounded by $e+6\varepsilon\ln\varepsilon^{-1}$ . To do that, we note that this coefficient is

	$\displaystyle 1+(\varepsilon-2\varepsilon^{2}\ln\varepsilon)$	$\displaystyle{}\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}=1+(\varepsilon-2\varepsilon^{2}\ln\varepsilon)\cdot e^{\varepsilon}\cdot\frac{e-1}{e^{\varepsilon}-1}=1+\frac{\varepsilon(1-2\varepsilon\ln\varepsilon)}{1-e^{-\varepsilon}}\cdot(e-1)$
	$\displaystyle\leq{}$	$\displaystyle 1+\frac{1-2\varepsilon\ln\varepsilon}{1-\varepsilon/2}\cdot(e-1)\leq 1+(1-2\varepsilon\ln\varepsilon)(1+\varepsilon)(e-1)$
	$\displaystyle={}$	$\displaystyle e+\varepsilon(e-1)\cdot[1-2(1+\varepsilon)\ln\varepsilon]\leq e+6\varepsilon\ln\varepsilon^{-1}\enspace,$

where the first inequality and inequalities hold because $e^{-\varepsilon}\leq 1-\varepsilon+\varepsilon^{2}/2$ and $(1-\varepsilon/2)^{-1}\leq(1+\varepsilon)$ for $\varepsilon\in(0,1)$ , and the last inequality holds for $\varepsilon\in(0,1/4)$ . ∎

Combining the above lemmas then leads to the following corollary.

Corollary 3.16.

$G(\vy^{(i^{*})})+C(\vy^{(i^{*})})\geq(1-1/e-4\varepsilon\ln\varepsilon^{-1})\cdot G(\vo)+(1-4\varepsilon\ln\varepsilon^{-1})\cdot C(\vo)-4\varepsilon LD^{2}$ .

Proof.

Observe that

	$\displaystyle(1-1/e)\cdot G(\vo)-(1$	$\displaystyle{}+3\varepsilon\ln\varepsilon^{-1})\cdot G(\vy^{(i^{*})})$
	$\displaystyle\leq{}$	$\displaystyle e^{-1}\cdot\{[(e-1)\cdot G(\vo)-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(j\cdot\vy^{(i^{*})})]$
		$\displaystyle\mspace{100.0mu}-[(e+6\varepsilon\ln\varepsilon^{-1})\cdot G(\vy^{(i^{})})-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{})})]\}$
	$\displaystyle\leq{}$	$\displaystyle e^{-1}\cdot\left\langle{\vo-\vy^{(i^{})},\nabla\bar{G}(\vy^{(i^{})})}\right\rangle\leq\varepsilon\cdot F(\vo)+[C(\vy^{(i^{*})})-C(\vo)]+4\varepsilon LD^{2}\enspace,$

where the second inequality follows from Lemmata 3.14 and 3.15, and the last inequality follows from Lemma 3.13. Rearranging the above inequality now yields

	$\displaystyle G(\vy^{(i^{})})+C(\vy^{(i^{})})\geq{}$	$\displaystyle(1-1/e-\varepsilon)\cdot G(\vo)+(1-\varepsilon)\cdot C(\vo)-3\varepsilon\ln\varepsilon^{-1}\cdot G(\vy^{(i^{*})})-4\varepsilon LD^{2}$
	$\displaystyle\geq{}$	$\displaystyle(1-1/e-4\varepsilon\ln\varepsilon^{-1})\cdot G(\vo)+(1-4\varepsilon\ln\varepsilon^{-1})\cdot C(\vo)-4\varepsilon LD^{2}\enspace,$

where the second inequality holds since $G(\vo)+C(\vo)\geq G(\vy^{(i^{*})})+C(\vy^{(i^{*})})\geq G(\vy^{(i^{*})})$ due to the choice of the vector $\vo$ and the non-negativity of $C$ . ∎

We are now ready to state and prove Theorem 3.16.

{restatable}

theoremthmNonObFw Let $S$ be the time it takes to find a point in $P$ maximizing a given linear function, then Algorithm 4 runs in $O(\varepsilon^{-2}(n/\varepsilon+S)\ln\varepsilon^{-1})$ time, makes $O(\varepsilon^{-3}\ln\varepsilon^{-1})$ gradient oracle calls, and outputs a vector $\vy$ such that:

F(\vy)\geq{}(1-1/e-4\varepsilon\ln\varepsilon^{-1})\cdot G(\vo)+(1-4\varepsilon\ln\varepsilon^{-1})\cdot C(\vo)-4\varepsilon LD^{2}\enspace.

Proof.

Once more, we begin by analyzing the time and oracle complexities of the algorithm. Every iteration of the loop of Algorithm 4 takes $O(n/\varepsilon+S)$ time. As there are $\lceil e^{-1}\cdot\beta(\varepsilon)/\varepsilon^{-2}\rceil=O(\varepsilon^{-2}\ln\varepsilon^{-1})$ such iterations, the entire algorithm runs in $O(\varepsilon^{-2}(n/\varepsilon+S)\ln\varepsilon^{-1})$ time. Furthermore, as each iteration of the loop requires $O(\varepsilon^{-1})$ oracle calls (a single call to the oracle corresponding to $C$ , and $\varepsilon^{-1}$ calls to the oracle corresponding to $G$ ), the overall oracle complexity is $O(\varepsilon^{-3}\ln\varepsilon^{-1})$ .

The theorem now follows since the value of the output of Algorithm 4 is at least as good as $F(\vy^{(i^{*})})$ , and Corollary 3.16 guarantees that $F(\vy^{(i^{*})})\geq(1-1/e-4\varepsilon\ln\varepsilon^{-1})\cdot G(\vo)+(1-4\varepsilon\ln\varepsilon^{-1})\cdot C(\vo)-4\varepsilon LD^{2}.$ ∎

4 Experiments

In this section we describe some experiments pertaining to our algorithms for maximizing DR-submodular + concave functions. All experiments are done on a 2015 Apple MacBook Pro with a quad-core 2.2 GHz i7 processor and 16 GB of RAM.

4.1 Interpolating Between Constrasting Objectives

We use our algorithms for maximizing the sum of a DR-submodular function and a concave function to provide a way to achieve a trade-off between different objectives. For example, given a ground set $X$ and a DPP supported on the power set $2^{X}$ , the maximum a posteriori (MAP) of the DPP corresponds to picking the most likely (diverse) set of elements according to the DPP. On the other hand, concave functions can be used to encourage points being closer together and clustered.

Finding the MAP of a DPP is an NP-hard problem. However, continuous approaches employing the multilinear extension (Călinescu et al., 2011) or the softmax extension (Bian et al., 2017a; Gillenwater et al., 2012) provide strong approximation guarantees for it. The softmax approach is usually preferred as it has a closed form solution which is easier to work with. Now, suppose that $|X|=n$ , and let $\mathbf{L}$ be the $n\times n$ kernel of the DPP and $\mathbf{I}$ be the $n\times n$ identity matrix, then $G(\vx)=\log\det[\operatorname{diag}(\vx)\mathbf{(L-I)+I}]$ is the softmax extension for $\vx\in[0,1]^{n}$ . Here, $\operatorname{diag}(\vx)$ corresponds to a diagonal matrix with the entries of $\vx$ along its diagonal.

Observe now that, given a vector $\vx\in[0,1]^{n}$ , $x_{i}$ can be thought of as the likelihood of picking element $i$ . Moreover, $\mathbf{L}_{ij}$ captures the similarity between elements $i$ and $j$ . Hence, our choice for a concave function which promotes similarity among elements is $C(\vx)=\sum_{i,j}\mathbf{L}_{ij}(1-(x_{i}-x_{j})^{2})$ . The rationale behind this is as follows. For a particular pair of elements $i$ and $j$ , if $\mathbf{L}_{ij}$ is large, that means that $i$ and $j$ are similar, so we would want $C$ to be larger when $\mathbf{L}_{ij}$ is high, provided that we are indeed picking both $i$ and $j$ (i.e., provided that $(x_{i}-x_{j})^{2}$ is small). One can verify that the function $C(\vx)$ is indeed concave as its Hessian is negative semidefinite.

In our first experiment we fix the ground set to be the set of $20\times 20=400$ points evenly spaced in $[0,1]\times[0,1]\subset\mathbb{R}^{2}$ . We also choose $\mathbf{L}$ to be the Gaussian kernel $\mathbf{L}_{ij}=\exp(-d(i,j)^{2}/2\sigma^{2})$ , where $d(i,j)$ is the Euclidean distance between points $i$ and $j$ , and $\sigma=0.04$ . But to define points $i$ and $j$ , we need some ordering of the 400 points on the plane. So, we call point 0 to be at the origin and the index increases till we reach point 19 at $(1,0)$ . Then we have point 20 at $(0,\nicefrac{{1}}{{20}})$ and so on till we finally reach point 399 at $(1,1)$ . Given the functions $G$ and $C$ defined above, we optimize in this experiment a combined objective formally specified by $F=\lambda G+(1-\lambda)C$ , where $\lambda\in[0,1]$ is a control parameter that can be used to balance the contrasting objectives represented by $G$ and $C$ . For example, setting $\lambda=1$ produces the (spread out) pure DPP MAP solution, setting $\lambda=0$ produces the (clustered) pure concave solution and $\lambda=0.5$ produces a solution that takes both constraints into consideration to some extent. It is important to note, however, that the effect of changing $\lambda$ on the importance that each type of constraint gets is not necessarily linear—although it becomes linear when the ranges of $G$ and $C$ happen to be similar.

In Figure 2, we can see how changing $\lambda$ changes the solution. The plots in the figure show the final output of Algorithm 3 when run on just the submodular objective $G$ (left plot), just the concave objective $C$ (right plot), and a combination of both (middle plot). The algorithm is run with the same cardinality constraint of 25 in all plots, which corresponds to imposing that the $\ell_{1}$ norm of each iteratation must be at most 25. It is important to note that we represent the exact (continuous) output of the algorithm here. To get a discrete solution, a rounding method should be applied. Also, all of the runs of the algorithm begin from the same fixed starting point inside the cardinality constrained polytope. The step sizes used by the different runs are all constant, and were chosen empirically.

4.2 Other Submodular + Concave Objectives

In this section we compare our algorithms with two baselines: the Frank-Wolfe algorithm (Frank and Wolfe, 1956) and the projected gradient ascent algorithm. In all the experiments done in this section, the objective function is again $F=\lambda G+(1-\lambda)C$ , where $G$ and $C$ are a DR-submodular function and a concave function, respectively. For simplicity, we set $\lambda=1/2$ throughout the section.

4.2.1 Quadratic Programming

In this section, we define $G(\vx)=\frac{1}{2}~{}\vx^{\top}\mathbf{H}\vx+\mathbf{h}^{\top}\vx+c$ . Note that by choosing an appropriate matrix $\mathbf{H}$ and vector $\mathbf{h}$ , this objective can be made to be monotone or non-monotone DR-submodular. We also define the down-closed constraint set to be $P=\{\vx\in\bR_{+}^{n}~{}|~{}\mathbf{Ax\leq b,x\leq u,A}\in\bR_{+}^{m\times n},\mathbf{b}\in\bR_{+}^{m}\}$ . Following Bian et al. (2017a), we choose the matrix $\mathbf{H}\in\bR^{n\times n}$ to be a randomly generated symmetric matrix with entries uniformly distributed in $[-1,0]$ , and the matrix $\mathbf{A}$ to be a random matrix with entries uniformly distributed in $[0.01,1.01]$ (the $0.01$ addition here is used to ensure that the entries are all positive). The vector $\vb$ is chosen as the all ones vector, and the vector $\vu$ is a tight upper bound on $P$ whose $i^{\text{th}}$ coordinate is defined as $u_{i}=\min_{j\in[m]}b_{j}/\mathbf{A}_{ji}$ . We let $\vh=-0.2\cdot\mathbf{H}^{\top}\vu$ which makes $G$ non-monotone. Finally, although this does not affect the results of our experiments, we take $c$ to be a large enough additive constant (in this case 10) to make $G$ non-negative.

It is know that when the Hessian of a quadratic program is negative semidefinite, the resulting objective is concave. Accordingly, we let $C(\vx)=\frac{1}{20}~{}\vx^{\top}\mathbf{D}\vx$ , where $\mathbf{D}$ is a negative semidefinite matrix defined by the negation of the product of a random matrix with entries in $[0,1]$ with its transpose. As one can observe, the generality of DR-submodular + concave objectives allows us to consider quadratic programming with very different Hessians. We hope that our ability to do this will motivate future work about quadratic programming for a broader class of Hessian matrices.

In the current experiment, we let $n\in\{8,12,16\}$ and $m\in\{0.5n,n,1.5n\}$ , and run each algorithm for 50 iterations. Note that having fixed the number of iterations, the maximum step size for Algorithms 1 and 2 is upper bounded by $($ number of iterations $)^{-1}=\nicefrac{{1}}{{50}}$ to guarantee that these algorithms remain within the polytope. To ensure consistency, we set the step sizes for the other algorithms to be $\nicefrac{{1}}{{50}}$ as well, except for Algorithm 4 for which we set to the value of $\varepsilon$ given by solving $e^{-1}\cdot\beta(\varepsilon)/\varepsilon^{2}=50$ . This ensures that the gradient computation in Algorithm 4 is not too time consuming. We start Algorithms 1 and 2 from the starting point their pseudocodes specify, and the other algorithms from the same arbitrary point. We show the results for the aforementioned values of $n$ and $m$ in Figure 4. The appropriate values of $n$ and $m$ are mentioned below each plot, and each plot graphs the average of $50$ runs of the experiment. We also note that since Algorithms 3 and 4 output the best among the results of all their iterations, we just plot the final output of these algorithms instead of the entire trajectory.

4.2.2 D-optimal Experimental Design

Following Chen et al. (2018), the DR-submodular objective function for the D-optimal experimental design problem is $G(\vx)=\log\det\left(\sum_{i=1}^{n}x_{i}\mathbf{Y_{i}^{\top}Y_{i}}\right)$ . Here, $\mathbf{Y_{i}}$ is an $n$ dimensional row-vector in which each entry is drawn independently from the standard Gaussian distribution. The choice of concave function is $C(\vx)=\tfrac{1}{10}\sum_{i=1}^{n}\log(x_{i})$ . In this experiment there is no combinatorial constraint. Instead, we are interested in maximization over a box constraint, i.e., over $[1,2]^{n}$ (note that the box is shifted compared to the standard $[0,1]^{n}$ to ensure that $G$ is defined everywhere as it is undefined at $\vx=\mathbf{0}$ ). The final outputs of all the algorithms for $n=8,12,16$ appear in Figure 4(h). Like in Section 4.2.1, each algorithm was run for $50$ iterations, and each plot is the average of $50$ runs. The step sizes and starting points used by the algorithms are set exactly like in Section 4.2.1.

Takeaways.

Based on our experiments, we can observe that Algorithms 1 and 4 consistently outperform the other algorithms. We can also see (especially in the D-optimal experimental design problem where they almost superimpose) that the difference between Algorithm 3 and the standard Frank-Wolfe algorithm are minimal, but we believe that the difference between the two algorithms can be made more pronounced by considering settings in which the gradient of $C$ dominates the gradient of $G$ . Finally, one can note that the output values in the plots corresponding to the quadratic programming problem discussed in Section 4.2.1 tend to decrease when the number of constraints increases, which matches our intuitive expectation.

5 Conclusion

In this paper, we have considered the maximization of a class of objective functions that is strictly larger than both DR-submodular functions and concave functions. The ability to optimize this class of functions using first-order information is interesting from both theoretical and practical points of view. Our results provide a step towards the goal of efficiently analyzing structured non-convex functions—a goal that is becoming increasingly relevant.

References

Alieva et al. [2021] Ayya Alieva, Aiden Aceves, Jialin Song, Stephen Mayo, Yisong Yue, and Yuxin Chen. Learning to make decisions via submodular regularization. In International Conference on Learning Representations, 2021.
Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, 2019.
Anari et al. [2016] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In 29th Annual Conference on Learning Theory, 2016.
Arora et al. [2012] Sanjeev Arora, Rong Ge, Ravi Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, 2012.
Bian et al. [2017a] An Bian, Kfir Y. Levy, Andreas Krause, and Joachim M. Buhmann. Continuous dr-submodular maximization: Structure and algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017a.
Bian et al. [2019] An Bian, Joachim M. Buhmann, and Andreas Krause. Optimal dr-submodular maximization and applications to provable mean field inference. In Proceedings of the 36th International Conference on Machine Learning, 2019.
Bian et al. [2017b] Andrew An Bian, Baharan Mirzasoleiman, Joachim M. Buhmann, and Andreas Krause. Guaranteed non-convex optimization: Submodular maximization over continuous domains. In AISTATS, 2017b.
Bresler et al. [2019] Guy Bresler, Frederic Koehler, Ankur Moitra, and Elchanan Mossel. Learning restricted boltzmann machines via influence maximization. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019.
Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 2015.
Călinescu et al. [2011] Gruia Călinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput., 2011.
Chen et al. [2018] Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018.
Chen et al. [2019a] Lin Chen, Moran Feldman, and Amin Karbasi. Unconstrained submodular maximization with constant adaptive complexity. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, 2019a.
Chen et al. [2019b] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Mathematical Programming, 2019b.
Derezinski et al. [2020] Michal Derezinski, Feynman Liang, and Michael Mahoney. Bayesian experimental design using regularized determinantal point processes. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 2020.
Djolonga and Krause [2014] Josip Djolonga and Andreas Krause. From map to marginals: Variational inference in bayesian submodular models. In Advances in Neural Information Processing Systems, 2014.
Du et al. [2019] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
Dwivedi et al. [2019] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast. Journal of Machine Learning Research, 2019.
Elenberg et al. [2016] Ethan R. Elenberg, Rajiv Khanna, Alexandros G. Dimakis, and Sahand Negahban. Restricted strong convexity implies weak submodularity. In Proceedings of the 30th Conference on Neural Information Processing Systems, 2016.
Elenberg et al. [2017] Ethan R. Elenberg, Alexandros G. Dimakis, Moran Feldman, and Amin Karbasi. Streaming weak submodularity: Interpreting neural networks on the fly. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
Ene and Nguyen [2019] Alina Ene and Huy L. Nguyen. Parallel algorithm for non-monotone dr-submodular maximization. In Proceedings of the 37th International Conference on Machine Learning, 2019.
Ene et al. [2020] Alina Ene, Sofia Maria Nikolakaki, and Evimaria Terzi. Team formation: Striking a balance between coverage and cost. CoRR, 2020.
Feldman [2021] Moran Feldman. Guess free maximization of submodular and linear sums. Algorithmica, 2021.
Feldman et al. [2011] Moran Feldman, Joseph Naor, and Roy Schwartz. A unified continuous greedy algorithm for submodular maximization. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, 2011.
Filmus and Ward [2012] Yuval Filmus and Justin Ward. A tight combinatorial algorithm for submodular maximization subject to a matroid constraint. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012.
Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.
Ge et al. [2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points — online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, 2015.
Gillenwater et al. [2012] J. Gillenwater, A. Kulesza, and B. Taskar. Near-Optimal MAP Inference for Determinantal Point Processes. In Proc. Neural Information Processing Systems (NIPS), 2012.
Harshaw et al. [2019] Chris Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. In ICML, 2019.
Hassani et al. [2017] Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submodular maximization. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
Hassani et al. [2020a] Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++: (non)convex minimization and continuous submodular maximization. SIAM Journal on Optimization, 2020a.
Hassani et al. [2020b] Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++, 2020b.
Hazan et al. [2016] Elad Hazan, Kfir Y. Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
He [2010] Xiaofei He. Laplacian regularized d-optimal design for active learning and its application to image retrieval. IEEE Transactions on Image Processing, 2010.
Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
Kazemi et al. [2020a] Ehsan Kazemi, Shervin Minaee, Moran Feldman, and Amin Karbasi. Regularized submodular maximization at scale. CoRR, 2020a.
Kazemi et al. [2020b] Ehsan Kazemi, Shervin Minaee, Moran Feldman, and Amin Karbasi. Regularized submodular maximization at scale. CoRR, 2020b.
Kempe et al. [2003] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03. Association for Computing Machinery, 2003.
Krause and Cevher [2010] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse representation. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010.
Kulesza [2012] Alex Kulesza. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 2012.
Lattimore and Szepesvari [2020] Tor Lattimore and Csaba Szepesvari. Bandit algorithms. Cambridge University Press, 2020.
Mirzasoleiman et al. [2013] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, 2013.
Mokhtari et al. [2018a] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochastic submodular maximization: Closing the gap. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018a.
Mokhtari et al. [2018b] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Decentralized submodular maximization: Bridging discrete and continuous settings. In Proceedings of the 35th International Conference on Machine Learning, 2018b.
Murty and Kabadi [1987] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 1987.
Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2018.
Netrapalli et al. [2014] Praneeth Netrapalli, U N Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in Neural Information Processing Systems, 2014.
Niazadeh et al. [2018] Rad Niazadeh, Tim Roughgarden, and Joshua R. Wang. Optimal algorithms for continuous non-monotone submodular and dr-submodular maximization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
Raut et al. [2021] Prasanna Sanjay Raut, Omid Sadeghi, and Maryam Fazel. Online dr-submodular maximization with stochastic cumulative constraints, 2021.
Rebeschini and Karbasi [2015] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Proceedings of The 28th Conference on Learning Theory, 2015.
Robinson et al. [2019] Joshua Robinson, Suvrit Sra, and Stefanie Jegelka. Flexible modeling of diversity with strongly log-concave distributions. In Advances in Neural Information Processing Systems, 2019.
Sadeghi and Fazel [2019] Omid Sadeghi and Maryam Fazel. Online continuous DR-submodular maximization with long-term budget constraints. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 2019.
Sadeghi et al. [2020] Omid Sadeghi, Prasanna Raut, and Maryam Fazel. A single recipe for online submodular maximization with adversarial or stochastic constraints. In Advances in Neural Information Processing Systems, 2020.
Singla et al. [2014] Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, and Andreas Krause. Near-optimally teaching the crowd to classify. In Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014.
Staib and Jegelka [2017] Matthew Staib and Stefanie Jegelka. Robust budget allocation via continuous submodular functions. In Proceedings of the 34th International Conference on Machine Learning, 2017.
Sun et al. [2016] Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary?, 2016.
Sun and Luo [2016] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 2016.
Sviridenko et al. [2017] Maxim Sviridenko, Jan Vondrák, and Justin Ward. Optimal approximation for submodular and supermodular optimization with bounded curvature. Math. Oper. Res., 2017.
Tohidi et al. [2020] Ehsan Tohidi, Rouhollah Amiri, Mario Coutino, David Gesbert, Geert Leus, and Amin Karbasi. Submodularity in action: From machine learning to signal processing applications. IEEE Signal Processing Magazine, 2020.
Wei et al. [2015] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015.
Wilder [2018a] Bryan Wilder. Risk-sensitive submodular optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018a.
Wilder [2018b] Bryan Wilder. Equilibrium computation and robust optimization in zero sum games with submodular structure. Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.
Xie et al. [2019] Jiahao Xie, Chao Zhang, Zebang Shen, Chao Mi, and Hui Qian. Decentralized gradient tracking for continuous dr-submodular maximization. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019.
Zhang et al. [2019] Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization: From full-information to bandit feedback. In Advances in Neural Information Processing Systems, 2019.

	$\displaystyle\\|\nabla$	$\displaystyle\bar{G}(\vx)-\nabla\bar{G}(\vy)\\|_{2}=\left\\|\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e^{\varepsilon j}\cdot[\nabla G(\varepsilon j\cdot\vx)-\nabla G(\varepsilon j\cdot\vy)]}{\varepsilon j}\right\\|_{2}$
	$\displaystyle\leq{}$	$\displaystyle\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e^{\varepsilon j}\cdot\\|\nabla G(\varepsilon j\cdot\vx)-\nabla G(\varepsilon j\cdot\vy)\\|_{2}}{\varepsilon j}\leq\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{e\cdot L\varepsilon j\cdot\\|\vx-\vy\\|_{2}}{\varepsilon j}=eL\cdot\\|\vx-\vy\\|_{2}\enspace.\qed$

		$\displaystyle\bar{F}(\vy^{(i^{}+1)})-\bar{F}(\vy^{(i^{})})=\bar{F}((1-\varepsilon)\cdot\vy^{(i^{})}+\varepsilon\cdot\vs^{(i^{})})-\bar{F}(\vy^{(i^{*})})$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left.\frac{\bar{F}((1-z)\cdot\vy^{(i^{})}+z\cdot\vs^{(i^{})})}{dz}\right\|_{z=r}dr$
	$\displaystyle={}$	$\displaystyle\int_{0}^{\varepsilon}\left\langle{\vs^{(i^{})}-\vy^{(i^{})},\nabla\bar{F}((1-r)\cdot\vy^{(i^{})}+r\cdot\vs^{(i^{}+1)})}\right\rangle dr$
	$\displaystyle\geq{}$	$\displaystyle\int_{0}^{\varepsilon}\left[\left\langle{\vs^{(i^{})}-\vy^{(i^{})},\nabla\bar{F}(\vy^{(i^{})})}\right\rangle-8rLD^{2}\right]dr=\left\langle{\varepsilon\cdot(\vs^{(i^{})}-\vy^{(i^{})}),\nabla\bar{F}(\vy^{(i^{})})}\right\rangle-4\varepsilon^{2}LD^{2}\enspace.$

	$\displaystyle\left\langle{\vy^{(i^{})},\nabla\bar{G}(\vy^{(i^{})})}\right\rangle={}$	$\displaystyle\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\frac{\left\langle{e^{\varepsilon j}\cdot\vy^{(i^{})},\nabla G(\varepsilon j\cdot\vy^{(i^{})})}\right\rangle}{\varepsilon j}=\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot\left.\frac{dG(z\cdot\vy^{(i^{*})})}{dz}\right\|_{z=\varepsilon j}$		(4)
	$\displaystyle\leq{}$	$\displaystyle\varepsilon\cdot\sum_{h=1}^{\varepsilon^{-1}}\left[\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=\varepsilon h}+\varepsilon\cdot e^{\varepsilon h}\cdot\sum_{j=h}^{\varepsilon^{-1}}\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=\varepsilon j}\right]\enspace,$

	$\displaystyle\varepsilon\cdot\sum_{j=h}^{\varepsilon^{-1}}\left.\frac{dG(z\cdot\vy^{(i^{*})})}{dz}\right\|_{z=\varepsilon j}\leq{}$	$\displaystyle h^{-1}\cdot\int_{0}^{\varepsilon h}\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=x}dx+\int_{\varepsilon h}^{1}\left.\frac{dG(z\cdot\vy^{(i^{})})}{dz}\right\|_{z=x}dx$
	$\displaystyle={}$	$\displaystyle h^{-1}\cdot[G(x\cdot\vy^{(i^{})})]_{0}^{\varepsilon h}+[G(x\cdot\vy^{(i^{})})]_{\varepsilon h}^{1}$
	$\displaystyle={}$	$\displaystyle h^{-1}\cdot G(\varepsilon h\cdot\vy^{(i^{})})-h^{-1}\cdot G({\bar{0}})+G(\vy^{(i^{})})-G(\varepsilon h\cdot\vy^{(i^{*})})$
	$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})-(1-h^{-1})\cdot G(\varepsilon h\cdot\vy^{(i^{})})\enspace,$

$\displaystyle\left\langle{\vy^{(i^{})},\nabla\bar{G}(\vy^{(i^{})})}\right\rangle\mspace{-51.0mu}{}$	$\displaystyle\mspace{51.0mu}\leq G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}\left[e^{\varepsilon j}\cdot\{G(\vy^{(i^{})})-(1-j^{-1})\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\}\right]$	(5)
$\displaystyle={}$	$\displaystyle G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}(1-j^{-1})\cdot e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$
$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon^{2}\cdot\left(\sum_{j=1}^{\varepsilon^{-1}}(1-j^{-1})\right)\cdot\left(\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\right)$
$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})+\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon(1+2\varepsilon\ln\varepsilon)\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})$
$\displaystyle\leq{}$	$\displaystyle G(\vy^{(i^{})})+\left(\varepsilon-2\varepsilon^{2}\ln\varepsilon\right)\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\vy^{(i^{})})-\varepsilon\cdot\sum_{j=1}^{\varepsilon^{-1}}e^{\varepsilon j}\cdot G(\varepsilon j\cdot\vy^{(i^{*})})\enspace,$