This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable Distributed Algorithms for Size-Constrained Submodular Maximization in the MapReduce and Adaptive Complexity Models

Yixin Chen [email protected]
Department of Computer Science & Engineering
Texas A&M University
College Station, TX
   Tonmoy Dey [email protected]
Department of Computer Science
Florida State University
Tallahassee, FL, USA
   Alan Kuhnle [email protected]
Department of Computer Science & Engineering
Texas A&M University
College Station, TX
Abstract

Distributed maximization of a submodular function in the MapReduce (MR) model has received much attention, culminating in two frameworks that allow a centralized algorithm to be run in the MR setting without loss of approximation, as long as the centralized algorithm satisfies a certain consistency property – which had previously only been known to be satisfied by the standard greedy and continous greedy algorithms. A separate line of work has studied parallelizability of submodular maximization in the adaptive complexity model, where each thread may have access to the entire ground set. For the size-constrained maximization of a monotone and submodular function, we show that several sublinearly adaptive (highly parallelizable) algorithms satisfy the consistency property required to work in the MR setting, which yields practical, parallelizable and distributed algorithms. Separately, we develop the first distributed algorithm with linear query complexity for this problem. Finally, we provide a method to increase the maximum cardinality constraint for MR algorithms at the cost of additional MR rounds.

1 Introduction

Submodular maximization has become an important problem in data mining and machine learning with real world applications ranging from video summarization (Mirzasoleiman et al., 2018) and mini-batch selection (Joseph et al., 2019) to more complex tasks such as active learning (Rangwani et al., 2021) and federated learning (Balakrishnan et al., 2022). In this work, we study the size-constrained maximization of a monotone, submodular function (SMCC), formally defined in Section 1.3. Because of the ubiquity of problems requiring the optimization of a submodular function, a vast literature on submodular optimization exists; we refer the reader to the surveys (Liu et al., 2020; Liu, 2020).

Reference Ratio MR Rounds Adaptivity (Θ\Theta) Queries (Ξ\Xi) kmaxk_{max}
RG 12(11e)\frac{1}{2}(1-\frac{1}{e}) 𝟐\boldsymbol{2} O(k)O(k) O(nk+k2)O(nk+k^{2}\ell) O(n2)O(\frac{n}{\ell^{2}})
PAlg 𝟏𝟏𝒆𝜺\boldsymbol{1-\frac{1}{e}-\varepsilon} O(1ε2)O(\frac{1}{\varepsilon^{2}}) O(kε2)O(\frac{k}{\varepsilon^{2}}) O(nkε2+k22ε4)O(\frac{nk}{\varepsilon^{2}}+\frac{k^{2}\ell^{2}}{\varepsilon^{4}}) O(nε22)O(\frac{n\varepsilon^{2}}{\ell^{2}})
DDist 𝟏𝟏𝒆𝜺\boldsymbol{1-\frac{1}{e}-\varepsilon} O(1ε)O(\frac{1}{\varepsilon}) O(kε)O(\frac{k}{\varepsilon}) O(nkε+k22ε2)O(\frac{nk}{\varepsilon}+\frac{k^{2}\ell^{2}}{\varepsilon^{2}}) O(nε2)O(\frac{n\varepsilon}{\ell^{2}})
BCG 1ε1-\varepsilon rr O(rkε2/rlog2(1ε1/r))O\left(\frac{rk}{\varepsilon^{2/r}}\log^{2}(\frac{1}{\varepsilon^{1/r}})\right) O(nkrε1/r+k2rm2(ε1/r)ε3/r)O(\frac{nkr}{\varepsilon^{1/r}}+\frac{k^{2}\ell rm^{2}(\varepsilon^{1/r})}{\varepsilon^{3/r}}) O(nε1/r2)O(\frac{n\varepsilon^{1/r}}{\ell^{2}})
PG 0.545ε0.545-\varepsilon 𝟐\boldsymbol{2} O(k2)O(k^{2}) O(nk+k3)O(nk+k^{3}) O(n2)O(\frac{n}{\ell^{2}})
TG 12\frac{1}{2} 𝟐\boldsymbol{2} O(nlog(k)){O(n\log(k))} O(nlog(k)){O(n\log(k))} O(n2log(k))O(\frac{n}{\ell^{2}\log(k)})
R-DASH (Alg. 4) 12(11eε)\frac{1}{2}(1-\frac{1}{e}-\varepsilon) 𝟐\boldsymbol{2} O(log(k)log(n)ε4)O(\frac{\log(k)\log(n)}{\varepsilon^{4}}) O(nlog(k)ε4){O(\frac{n\log(k)}{\varepsilon^{4}})} O(n2)O(\frac{n}{\ell^{2}})
G-DASH (Alg. 5) 𝟏𝟏𝒆𝜺\boldsymbol{1-\frac{1}{e}-\varepsilon} O(1ε)O(\frac{1}{\varepsilon}) O(log(k)log(n)ε5)O\left(\frac{\log(k)\log(n)}{\varepsilon^{5}}\right) O(nlog(k)ε5)O(\frac{n\log(k)}{\varepsilon^{5}}) O(nε2)O(\frac{n\varepsilon}{\ell^{2}})
T-DASH (Alg. 10) 38ε\frac{3}{8}-\varepsilon 𝟐\boldsymbol{2} 𝑶(𝐥𝐨𝐠(𝒏)𝜺𝟑)\boldsymbol{O(\frac{\log(n)}{\varepsilon^{3}})} O(nlog(k)ε4){O(\frac{n\log(k)}{\varepsilon^{4}})} O(nε2log(k))O(\frac{n\varepsilon}{\ell^{2}\log(k)})
L-Dist (Alg. 7) 18\frac{1}{8} 𝟐\boldsymbol{2} O(n)O(\frac{n}{\ell}) 𝑶(𝒏)\boldsymbol{O(n)} O(n2log(n))O(\frac{n}{\ell^{2}\log(n)})
Table 1: Theoretical comparison of our algorithms to MapReduce model algorithms: RG of Barbosa et al. (2015), PAlg of Barbosa et al. (2016), DDist of Kazemi et al. (2021), BCG of Epasto et al. (2017), PG of Mirrokni and Zadimoghaddam (2015), and TG of Liu and Vondrak (2019). The column kmaxk_{max} provides the maximum cardinality constraint that the algorithm can compute.

A foundational result of Nemhauser et al. (1978) shows that a simple greedy algorithm (Greedy, pseudocode in Appendix H) achieves the optimal approximation ratio for SMCC of 11/e0.631-1/e\approx 0.63 in the value query model, in which the submodular function ff is available to the algorithm as an oracle that returns f(S)f(S) when queried with set SS. However, because of the modern revolution in data (Mao et al., 2021; Ettinger et al., 2021), there is a need for more scalable algorithms. Specifically, the standard greedy algorithm is a centralized algorithm that needs access to the whole dataset, makes many adaptive rounds (defined below) of queries to the submodular function, and has quadratic total query complexity in the worst case.

Distributed Algorithms for SMCC. Massive data sets are often too large to fit on a single machine and are distributed over a cluster of many machines. In this context, there has been a line of work developing algorithms for SMCC in the MapReduce model, defined formally in Section 1.3 (see Table 1 and the Related Work section below). Of these, PAlg (Barbosa et al., 2015) is a general framework that allows a centralized algorithm Alg to be converted to the MapReduce distributed setting with nearly the same theoretical approximation ratio, as long as Alg satisfies a technical, randomized consistency property (RCP), defined in Section 1.3.

In addition, DistributedDistorted (Kazemi et al., 2019), is also a general framework that can adapt any Alg satisfying RCP, as we show111In Kazemi et al. (2019), an instantiated version of DistributedDistorted with a distorted greedy algorithm is analyzed. in Section 4.2. However, to the best of our knowledge, the only centralized algorithms to satisfy this property are the Greedy Nemhauser et al. (1978) and the continuous greedy algorithm of Calinescu et al. (2007).

Parallelizable Algorithms for SMCC. A separate line of works, initiated by Balkanski and Singer (2018) has taken an orthogonal approach to scaling up the standard greedy algorithm: parallelizable algorithms for SMCC as measured by the adaptive complexity of the algorithm, formally defined in Section 1.3. In this model, each thread may have access to the entire ground set and hence these algorithms do not apply in a distributed setting. Altogether, these algorithms have exponentially improved the adaptivity of standard greedy from O(n)O(n) to O(log(n))O(\log(n)) while obtaining the nearly the same 11/e1-1/e approximation ratio Fahrbach et al. (2019a); Balkanski et al. (2019); Ene and Nguyen (2019); Chekuri and Quanrud (2019). Recently, highly practical sublinearly adaptive algorithms have been developed without compromising theoretical guarantees Breuer et al. (2020); Chen et al. (2021). However, none of these algorithms have been shown to satisfy the RCP.

Combining Parallelizability and Distributed Settings. In this work, we study parallelizable, distributed algorithms for SMCC; each node of the distributed cluster likely has many processors, so we seek to take advantage of this situation by parallelizing the algorithm within each machine in the cluster. In the existing MapReduce algorithms, the number of processors \ell could be set to the number of processors available to ensure the usage of all available resources – this would consider each processor as a separate machine in the cluster. However, there are a number of disadvantages to this approach: 1) A large number of machines severely restricts the size of the cardinality constraint kk – since, in all of the MapReduce algorithms, a set of size kk\ell must be stored on a single machine; therefore, we must have kO(Ψ/)=O(n/2)k\leq O(\Psi/\ell)=O(n/\ell^{2}). For example, with n=106n=10^{6}, =256\ell=256 implies that k15Ck\leq 15\cdot C, for some CC. However, if each machine has 8 processor cores, the number of machines \ell can be set to 32 (and then parallelized on each machine), which enables the use of 256 processors with k976Ck\leq 976\cdot C. As the number of cores and processors per physical machine continues to grow, large practical benefits can be obtained by a parallelized and distributed algorithm. 2) Modern CPU architectures have many cores that all share a common memory, and it is inefficient to view communication between these cores to be as expensive as communication between separate machines in a cluster.

In addition to the MapReduce and adaptive complexity models, the total query complexity of the algorithm to the submodular oracle is also relevant to the scalability of the algorithm. Frequently, evaluation of the submodular function is expensive, so the time spent answering oracle queries dominates the other parts of the computation. Therefore, the following questions are posed: Q1: Is it possible to design constant-factor approximation distributed algorithms with 1) constant number of MapReduce rounds; 2) sublinear adaptive complexity (highly-parallelizable); and 3) nearly linear total query complexity? Q2: Can we design practical distributed algorithms that also meet these three criteria?

1.1 Contributions

In overview, the contributions of the paper are: 1) an exponential improvement in adaptivity (from Ω(n)\Omega(n) to O(logn))O(\log n)) for an algorithm with a constant number of MR rounds; 2) the first linear-time, constant-factor algorithm with constant MR rounds; 3) a way to increase the size constraint limitation at the cost of additional MR rounds; and 4) an empirical evaluation on a cluster of 64 machines that demonstrates an order of magnitude improvement in runtime is obtained by combining the MR and adaptive complexity models.

To develop MR algorithms with sublinear adaptivity, we first modify and analyze the low-adaptive algorithm ThreshSeqMod from Chen et al. (2021) and show that our modification satisfies the RCP (Section 2). Next, ThreshSeqMod is used to create a low-adaptive greedy procedure LAG; LAG then satisfies the RCP, so can be used within PAlg Barbosa et al. (2016) to yield a 11/eε1-1/e-\varepsilon approximation in O(1/ε2)O(1/\varepsilon^{2}) MR rounds, with adaptivity O(logn)O(\log n). We show in Section 4.2 that DistributedDistorted Kazemi et al. (2021) also works with any Alg satisfying RCP, which yields an improvement in the number of MR rounds to achieve the same ratio. The resulting algorithm we term G-DASH (Section 4), which achieves ratio 11/eε1-1/e-\varepsilon with O(log2n)O(\log^{2}n) adaptivity and O(1/ε)O(1/\varepsilon) MR rounds.

Although G-DASH achieves nearly the optimal ratio in constant MR rounds and sublinear adaptivity, the number of MR rounds is high. To obtain truly practical algorithms, we develop constant-factor algorithms with 2 MR rounds: R-DASH and T-DASH with O(log(k)log(n))O(\log(k)\log(n)) and O(logn)O(\log n) adaptivity, respectively, and with approximation ratios of 12(11/eε)\frac{1}{2}(1-1/e-\varepsilon) (\simeq 0.316) and 3/8ε3/8-\varepsilon. R-DASH is our most practical algorithm and may be regarded as a parallelized version of RandGreeDI Barbosa et al. (2015), the first MapReduce algorithm for SMCC and the state-of-the-art MR algorithm in terms of empirical performance. T-DASH is a novel algorithm that improves the theoretical properties of R-DASH at the cost of empirical performance (see Table 1 and Section 8). Notably, the adaptivity of T-DASH is O(log(n))O\left(\log(n)\right) which is close to the best known adaptivity for a constant factor algorithm of Chen et al. (2021), which has adaptivity of O(log(n/k))O(\log(n/k)).

Although our MR algorithms are nearly linear time (within a polylogarithmic factor of linear), all of the existing MR algorithms are superlinear time, which raises the question of if an algorithm exists that has a linear query complexity and has constant MR rounds and constant approximation factor. We answer this question affirmatively by adapting a linear-time algorithm of Kuhnle (2021); Chen et al. (2021), and showing that our adaptation satisfies RCP (Section 3). Subsequently, we develop the first MR algorithm (L-Dist) with an overall linear query complexity.

Our next contribution is MED, a general plug-in framework for distributed algorithms, which increases the size of the maximum cardinality constraint at the cost of more MR rounds. As discussed above, the maximum kk value of any prior MR algorithm for SMCC is O(n/2)O(n/\ell^{2}), where \ell is the number of machines; MED increases this to O(n/)O(n/\ell). We also show that under certain assumptions on the objective function (which are satisfied by all of the empirical applications evaluated in Section 8), MED can be run with kmax=nk_{max}=n, which removes any cardinality restrictions. When used in conjunction with a γ\gamma-approximation MR algorithm, MED provides an (1eγ1-e^{-\gamma})-approximate solution.

Finally, an extensive empirical evaluation of our algorithms and the current state-of-the-art on a 64-machine cluster with 32 cores each and data instances ranging up to 5 million nodes show that R-DASH is orders of magnitude faster than state-of-the-art MR algorithms and demonstrates an exponential improvement in scaling with larger kk. Moreover, we show that MR algorithms slow down with increasing number of machines past a certain point, even if enough memory is available, which further motivates distributing a parallelizable algorithm over smaller number of machines. This observation also serves as a motivation for the development of the MED+Alg framework. In our evaluation, we found that the MED+Alg framework delivers solutions significantly faster than the Alg for large constraints, showcasing its superior performance and efficiency.

A previous version of this work was published in a conference Dey et al. (2023). In that version, RCP was claimed to be shown for the ThresholdSeq algorithm of Chen et al. (2021). Unfortunately, RCP does not hold for this algorithm. In this version, we provide a modified version of the ThresholdSeq algorithm of Chen et al. (2021), for which we show RCP. Additionally, in this version, we introduce the first linear time MR algorithm, LinearTime-Distributed (L-Dist). Finally, in our empirical evaluation, we expand upon the conference version by conducting two additional experiments using a larger cluster comprising 64 machines. These additional experiments aim to evaluate the performance of L-Dist and investigate the effectiveness of MED in handling increasing cardinality constraints, respectively.

1.2 Organization

The rest of this paper is organized as follows. In Section 2, we present the low-adaptive procedures and show they satisfy the RCP. In Section 3, we analyze an algorithm with linear query complexity and show it satisfies the RCP. In Section 4, we show how to use algorithms satisfying the RCP to obtain MR algorithms. In Section 5, we improve the theoretical properties of our 2-round MR algorithm. In Section 6, we detail the linear-time MR algorithm. In Section 7, we show how to increase the supported constraint size by adding additional MR rounds. In Section 8, we conduct our empirical evaluation.

1.3 Preliminaries

A submodular set function captures the diminishing gain property of adding an element to a set that decreases with increase in size of the set. Formally, let NN be a finite set of size nn. A non-negative set function f:2𝒩+f:2^{\mathcal{N}}\rightarrow\mathbb{R}^{+} is submodular iff for all sets AB𝒩A\subseteq B\subseteq\mathcal{N} and x𝒩\Bx\in\mathcal{N}\backslash B, f(Ax)f(A)f(Bx)f(B)f(A\cup{x})-f(A)\geq f(B\cup{x})-f(B). The submodular function is monotone if f(A)f(B)f(A)\leq f(B) for all ABA\subseteq B. We use Δ(x|S)\Delta\left(x\,|\,S\right) to denote the marginal game of element xx to set SS: Δ(x|S)=f(S{x})f(S)\Delta\left(x\,|\,S\right)=f(S\cup\{x\})-f(S). In this paper, we study the following optimization problem for submodular optimization under cardinality constraint (SMCC):

OargmaxS𝒩f(S); subject to|S|kO\leftarrow\arg\max_{S\subseteq\mathcal{N}}f(S)\text{; subject to}\ |S|\leq k (1)

MapReduce Model. The MapReduce (MR) model is a formalization of distributed computation into MapReduce rounds of communication between machines. A dataset of size nn is distributed on \ell machines, each with memory to hold at most Ψ\Psi elements of the ground set. The total memory of the machines is constrained to be Ψ=O(n)\Psi\cdot\ell=O(n). After each round of computation, a machine may send O(Ψ)O(\Psi) amount of data to other machines. We assume n1c\ell\leq n^{1-c} for some constant c1/2c\geq 1/2.

Next, we formally define the RCP needed to use the two frameworks to convert a centralized algorithm to the MR setting, as discussed previously.

Property 1 (Randomized Consistency Property of Barbosa et al. (2016)).

Let 𝐪\mathbf{q} be a fixed sequence of random bits; and Alg be a randomized algorithm with randomness determined by 𝐪\mathbf{q}. Suppose Alg(𝒩,𝐪)\textsc{Alg}(\mathcal{N},\mathbf{q}) returns a pair of sets, (AlgSol(𝒩,𝐪),AlgRel(𝒩,𝐪))(\textsc{AlgSol}(\mathcal{N},\mathbf{q}),\textsc{AlgRel}(\mathcal{N},\mathbf{q})), where AlgSol(𝒩,𝐪)\textsc{AlgSol}(\mathcal{N},\mathbf{q}) is the feasible solution and AlgRel(𝒩,𝐪)\textsc{AlgRel}(\mathcal{N},\mathbf{q}) is a set providing additional information. Let AA and BB be two disjoint subsets of 𝒩\mathcal{N}, and that for each bBb\in B, AlgRel(A{b},𝐪)=AlgRel(A,𝐪)\textsc{AlgRel}(A\cup\{b\},\mathbf{q})=\textsc{AlgRel}(A,\mathbf{q}). Also, suppose that Alg(A,𝐪)\textsc{Alg}(A,\mathbf{q}) terminates successfully. Then Alg(AB,𝐪)\textsc{Alg}(A\cup B,\mathbf{q}) terminates successfully and AlgSol(AB,𝐪)=AlgSol(A,𝐪)\textsc{AlgSol}(A\cup B,\mathbf{q})=\textsc{AlgSol}(A,\mathbf{q}).

Adaptive Complexity. The adaptive complexity of an algorithm is the minimum number of sequential adaptive rounds of at most polynomially many queries, in which the queries to the submodular function can be arranged, such that the queries in each round only depend on the results of queries in previous rounds.

1.4 Related Work

MapReduce Algorithms. Addressing the coverage maximization problem (a special case of SMCC), Chierichetti et al. (2010) proposed an approximation algorithm achieving a ratio of 11/e1-1/e, with polylogarithmic MapReduce (MR) rounds. This was later enhanced by Blelloch et al. (2012), who reduced the number of rounds to log2n\log^{2}n. Kumar et al. (2013) contributed further advancements with a 11/e1-1/e algorithm, significantly reducing the number of MR rounds to logarithmic levels. Additionally, they introduced a 1/2ϵ1/2-\epsilon approximation algorithm that operates within O(1/δ)O(1/\delta) MR rounds, albeit with a logarithmic increase in communication complexity.

For SMCC, Mirzasoleiman et al. (2013) introduced the two-round distributed greedy algorithm (GreedI), demonstrating its efficacy through empirical studies across various machine learning applications. However, its worst-case approximation guarantee is 1/Θ(min{k,})1/\Theta(\min\{\sqrt{k},\ell\}). Subsequent advancements led to the development of constant-factor algorithms by Barbosa et al. (2015). Barbosa et al. (2015) notably introduced randomization into the distributed greedy algorithm GreedI, resulting in the RandGreeDI algorithm. This algorithm achieves a 12(11/e)\frac{1}{2}(1-1/e) approximation ratio with two MR rounds and O(nk)O\left(nk\right) queries. Mirrokni and Zadimoghaddam (2015) also introduced an algorithm with two MR rounds, where the first round finded randomized composable core-sets and the second round applied a tie-breaking rule. This algorithm improve the approximation ratio from 12(11/e)\frac{1}{2}(1-1/e) to 0.545ε0.545-\varepsilon. Later on, Barbosa et al. (2016) proposed a O(1/ε2)O\left(1/\varepsilon^{2}\right) round algorithm with the optimal (11/eε)(1-1/e-\varepsilon) approximation ratio without data duplication. Then, Kazemi et al. (2021) improved its space and communication complexity by a factor of O(1/ε)O\left(1/\varepsilon\right) with the same approximation ratio. Another two-round algorithm, proposed by Liu and Vondrak (2019), achieved an 1/2ε1/2-\varepsilon approximation ratio, but requires data duplication with four times more elements distributed to each machine in the first round. As a result, distributed setups have a more rigid memory restriction when running this algorithm.

Parallelizable Algorithms. A separate line of works consider the parallelizability of the algorithm, as measured by its adaptive complexity. Balkanski and Singer (2018) introduced the first O(log(n))O\left(\log(n)\right)-adaptive algorithm, achieving a (1/3ε)(1/3-\varepsilon) approximation ratio with O(nk2log3(n))O\left(nk^{2}\log^{3}(n)\right) query complexity for SMC Then, Balkanski et al. (2019) enhanced the approximation ratio to 11/eε1-1/e-\varepsilon with O(nk2log2(n))O\left(nk^{2}\log^{2}(n)\right) query complexity while maintaining the same sublinear adaptivity. In the meantime, Ene and Nguyen (2019), Chekuri and Quanrud (2019), and Fahrbach et al. (2019a) achieved the same approximation ratio and adaptivity but with improved query complexities: O(npoly(log(n)))O\left(n\text{poly}(\log(n))\right) for Ene and Nguyen (2019), O(nlog(n))O\left(n\log(n)\right) for Chekuri and Quanrud (2019), and O(n)O\left(n\right) for Fahrbach et al. (2019a). Recently, highly practical sublinearly adaptive algorithms have been developed without compromising on the approximation ratio. FAST, introduced by Breuer et al. (2020), operates with O(log(n)log2(log(k)))O\left(\log(n)\log^{2}\left(\log(k)\right)\right) adaptive rounds and O(nlog(log(k)))O\left(n\log\left(\log(k)\right)\right) queries, while LS+PGB, proposed by Chen et al. (2021), runs with O(log(n))O\left(\log(n)\right) adaptive rounds and O(n)O\left(n\right) queries. However, none of these algorithms have been shown to satisfy the RCP.

Fast (Low Query Complexity) Algorithms. Since a query to the oracle for the submodular function is typically an expensive operation, the total query complexity of an algorithm is highly relevant to its scalability. The work by Badanidiyuru and Vondrak (2014) reduced the query complexity of the standard greedy algorithm from O(kn)O\left(kn\right) to O(nlog(n))O\left(n\log(n)\right), while nearly maintaining the ratio of 11/e1-1/e. Subsequently, Mirzasoleiman et al. (2015) utilized random sampling on the ground set to achieve the same approximation ratio in expectation with optimal O(n)O\left(n\right) query complexity. Afterwards, Kuhnle (2021) obtained a deterministic algorithm that achieves ratio 1/41/4 in exactly nn queries, and the first deterministic algorithm with O(n)O\left(n\right) query complexity and (11/eε)(1-1/e-\varepsilon) approximation ratio. Since our goal is to develop practical algorithms in this work, we develop paralellizable and distributed algorithms with nearly linear query complexity (i.e. within a polylogarithmic factor of linear). Further, we develop the first linear-time algorithm with constant number of MR rounds, although it doesn’t parallelize well.

Non-monotone Algorithms. If the submodular function is no longer assumed to be monotone, the SMCC problem becomes more difficult. Here, we highlight a few works in each of the categories of distributed, parallelizable, and query efficient. Lee et al. (2009) provided the first constant-factor approximation algorithm with an approximation ratio of (1/4ε)(1/4-\varepsilon) and a query complexity of O(nk5log(k))O\left(nk^{5}\log(k)\right). Building upon that, Gupta et al. (2010) reduced the query complexity to O(nk)O\left(nk\right) by replacing the local search procedure with a greedy approach that resulted in a slightly worse approximation ratio. Subsequently, Buchbinder et al. (2014) incorporated randomness to achieve an expected 1/e1/e approximation ratio with O(nk)O\left(nk\right) query complexity. Furthermore, Buchbinder et al. (2017) improved it to O(n)O\left(n\right) query complexity while maintaining the same approximation ratio in expectation. For parallelizable algorithms in the adaptive complexity model, Ene and Nguyen (2020) achieved the current best (1/eε)(1/e-\varepsilon) approximation ratio with O(log(n))O\left(\log(n)\right) adaptivity and O(nk2log2(n))O\left(nk^{2}\log^{2}(n)\right) queries to the continuous oracle (a continuous relaxation of the original value oracle). Later, Chen and Kuhnle (2024) achieved the current best sublinear adaptivity and nearly linear query complexity with a (1/6ε)(1/6-\varepsilon) approximation ratio. In terms of distributed algorithms in the MapReduce model, RandGreeDI proposed by Barbosa et al. (2015) achieved a 0.10.1 approximation ratio within 22 MR rounds and O(nk)O\left(nk\right) query complexity. Simulaneously, Mirrokni and Zadimoghaddam (2015) also presented a 22-round algorithm with a 0.140.14 approximation ratio and O(nk)O\left(nk\right) query complexity. Afterwards, Barbosa et al. (2016) proposed a 22-round continuous algorithm with a 0.232(11)0.232\left(1-\frac{1}{\ell}\right) approximation ratio.

2 Low-Adaptive Algorithms That Satisfy the RCP

In this section, we analyze two low-adaptive procedures, ThreshSeqMod (Alg. 1), LAG (Alg. 2), variants of low-adaptive procedures proposed in Chen et al. (2021). This analysis enables their use in the distributed, MapReduce setting. For convenience, we regard the randomness of the algorithms to be determined by a sequence of random bits 𝐪\mathbf{q}, which is an input to each algorithm.

Algorithm 1 Low-Adaptive Threshold Algorithm (ThreshSeqMod)
1:procedure ThreshSeqMod(f,X,k,δ,ε,τ,𝐪f,X,k,\delta,\varepsilon,\tau,\mathbf{q})
2:     Input: evaluation oracle f:2𝒩+f:2^{\mathcal{N}}\to\mathbb{R}^{+}, subset X𝒩X\subseteq\mathcal{N}, constraint kk, confidence δ\delta, error ε\varepsilon, threshold τ\tau, a finite set of sequences of random bits 𝐪{σ1,,σM+1}\mathbf{q}\leftarrow\{\sigma_{1},\ldots,\sigma_{M+1}\}
3:     Initialize S0S_{0}\leftarrow\emptyset , R0R_{0}\leftarrow\emptyset V0XV_{0}\leftarrow X, M4(1+1βε)log(nδ)M\leftarrow\left\lceil 4\left(1+\frac{1}{\beta\varepsilon}\right)\log\left(\frac{n}{\delta}\right)\right\rceil, βε(16log(41eε/2))1\beta\leftarrow\varepsilon\left(16\log\left(\frac{4}{1-e^{-\varepsilon/2}}\right)\right)^{-1}
4:     for  j1j\leftarrow 1 to M+1M+1  do
5:         Update Vj{xVj1:Δ(x|Sj1)τ}V_{j}\leftarrow\{x\in V_{j-1}:\Delta\left(x\,|\,S_{j-1}\right)\geq\tau\}
6:         if  |Vj|=0|V_{j}|=0 or |Sj1|=k|S_{j-1}|=k then
7:              return Sj1S_{j-1}, Rj1R_{j-1}, success          
8:         VjV_{j}\leftarrow random-permutation(Vj,σj)(V_{j},\sigma_{j})
9:         smin{k|Sj1|,|Vj|}s\leftarrow\min\{k-|S_{j-1}|,|V_{j}|\}
10:         Λ{1,2,,min{s,1ε}}{(1+ε)u:1(1+ε)us,u}{s}\Lambda\leftarrow\left\{1,2,\ldots,\min\left\{s,\left\lceil\frac{1}{\varepsilon}\right\rceil\right\}\right\}\cup\left\{\left\lfloor(1+\varepsilon)^{u}\right\rfloor:1\leq\left\lfloor(1+\varepsilon)^{u}\right\rfloor\leq s,u\in\mathbb{N}\right\}\cup\{s\}
11:         B[λ]False,λΛB[\lambda]\leftarrow\textbf{False},\forall\lambda\in\Lambda
12:         for λΛ\lambda\in\Lambda in parallel do
13:              Tλ{v1,v2,,vλ}T_{\lambda}\leftarrow\{v_{1},v_{2},\ldots,v_{\lambda}\}
14:              if Δ(Tλ|Sj1)/|Tλ|(1ε)τ\Delta\left(T_{\lambda}\,|\,S_{j-1}\right)/|T_{\lambda}|\geq(1-\varepsilon)\tau then
15:                  B[λ]TrueB[\lambda]\leftarrow\textbf{True}                        
16:         λjmin{λΛ:B[λ]=False OR λ=s}\lambda^{\prime}_{j}\leftarrow\min\left\{\lambda\in\Lambda:B[\lambda]=\textbf{False}\text{ OR }\lambda=s\right\}
17:         RjRj1TλjR_{j}\leftarrow R_{j-1}\cup T_{\lambda^{\prime}_{j}}
18:         if λj1ε\lambda_{j}^{\prime}\leq\left\lceil\frac{1}{\varepsilon}\right\rceil then λjλj1\lambda_{j}^{*}\leftarrow\lambda^{\prime}_{j}-1 else λjλj\lambda_{j}^{*}\leftarrow\lambda^{\prime}_{j}
19:         SjSj1TλjS_{j}\leftarrow S_{j-1}\cup T_{\lambda_{j}^{*}}      
20:     return SM+1S_{M+1}, RM+1R_{M+1}, failure

Observe that the randomness of both of ThreshSeqMod and LAG only comes from the random permutations of VjV_{j} on Line 8 of ThreshSeqMod, since LAG employs ThreshSeqMod as a subroutine. Consider an equivalent version of these algorithms in which the entire ground set 𝒩\mathcal{N} is permuted randomly, from which the permutation of VjV_{j} is extracted. That is, if σ\sigma is the permutation of 𝒩\mathcal{N}, the permutation of VV is given by v<wv<w iff σ(v)<σ(w)\sigma(v)<\sigma(w), for v,wVv,w\in V. Then, the random vector 𝐪\mathbf{q} specifies a sequence of permutations of 𝒩\mathcal{N}: (σ1,σ2,)(\sigma_{1},\sigma_{2},\ldots), which completely determines the behavior of both algorithms.

2.1 Low-Adaptive Threshold (ThreshSeqMod) Algorithm

This section presents the analysis of the low-adaptive threshold algorithm, ThreshSeqMod (Alg. 1; a variant of ThresholdSeq from Chen et al. (2021)). In ThreshSeqMod, the randomness is explicitly dependent on a random vector 𝐪\mathbf{q}. This modification ensures the consistency property in the distributed setting. Besides the addition of randomness q, ThreshSeqMod employs an alternative strategy of prefix selection within the for loop. Instead of identifying the failure point (the average marginal gain is under the threshold) after the final success point (the average marginal gain is above the threshold) in Chen et al. (2021), ThreshSeqMod determines the first failure point. This adjustment not only addresses the consistency problem but also preserves the theoretical guarantees below.

Theorem 1.

Suppose ThreshSeqMod is run with input (f,X,k,δ,ε,τ,𝐪)(f,X,k,\delta,\varepsilon,\tau,\mathbf{q}). Then, the algorithm has adaptive complexity O(log(n/δ)/ε3)O(\log(n/\delta)/\varepsilon^{3}) and outputs S,R𝒩S,R\subseteq\mathcal{N}, where SS is the solution set with |S|k|S|\leq k and RR provides additional information with |R|=O(k)|R|=O\left(k\right). The following properties hold: 1) The algorithm succeeds with probability at least 1δ/n1-\delta/n. 2) There are O(n/ε3)O(n/\varepsilon^{3}) oracle queries in expectation. 3) It holds that f(S)/|S|(12ε)τ/(1+ε)f(S)/|S|\geq(1-2\varepsilon)\tau/(1+\varepsilon). 4) If |S|<k|S|<k, then Δ(x|S)<τ\Delta\left(x\,|\,S\right)<\tau for all x𝒩x\in\mathcal{N}.

2.1.1 Analysis of Consistency

Lemma 1.

ThreshSeqMod satisfies the randomized consistency property (Property 1).

Proof.

Consider that the algorithm runs for all M+1M+1 iterations of the outer for loop; if the algorithm would have returned at iteration jj, the values SiS_{i} and ViV_{i}, for i>ji>j keep their values from when the algorithm would have returned. The proof relies upon the fact that every call to ThreshSeqMod(,𝐪)\textsc{ThreshSeqMod}(\cdot,\mathbf{q}) uses same sequences of permutations of 𝒩\mathcal{N}: {σ1,σ2,,σM+1}\{\sigma_{1},\sigma_{2},\ldots,\sigma_{M+1}\}. We refer to iterations of the outer for loop on Line 4 of Alg. 1 simply as iterations. Since randomness only happens with the random permutation of 𝒩\mathcal{N} at each iteration in Alg. 1, the randomness of ThreshSeqMod(,𝐪)\textsc{ThreshSeqMod}(\cdot,\mathbf{q}) is determined by 𝐪\mathbf{q}, which satisfies the hypothesis of Property 1.

For the two sets returned by ThreshSeqMod(𝒩,𝐪)\textsc{ThreshSeqMod}(\mathcal{N},\mathbf{q}), SM+1=ThreshSeqModSol(𝒩,𝐪)S_{M+1}=\textsc{ThreshSeqModSol}(\mathcal{N},\mathbf{q}) represents the feasible solution, and RM+1=ThreshSeqModRel(𝒩,𝐪)R_{M+1}=\textsc{ThreshSeqModRel}(\mathcal{N},\mathbf{q}) is the set that provides additional information. We consider the runs of (1) ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}), (2) ThreshSeqMod(A{b},𝐪)\textsc{ThreshSeqMod}(A\cup\{b\},\mathbf{q}), and (3) ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) together. Variables of (1) are given the notation defined in the pseudocode; variables of (2) are given the superscript bb; and variables of (3) are given the superscript . Suppose for each bBb\in B, RM+1b=RM+1R_{M+1}^{b}=R_{M+1}. The lemma holds if SM+1=SM+1S_{M+1}^{\prime}=S_{M+1} and ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) terminates successfully.

Let P(i)P(i) be the statement for iteration ii of the outer for loop on Line 4 that

  • (i)

    Si=SiS_{i}=S^{\prime}_{i}, and

  • (ii)

    for all bBb\in B, Si=SibS_{i}=S^{b}_{i}, Ri=RibR_{i}=R^{b}_{i}, and

  • (iii)

    Vi=ViBV_{i}=V^{\prime}_{i}\setminus B, and

  • (iv)

    for all bBb\in B, Vi=Vib{b}V_{i}=V^{b}_{i}\setminus\{b\}.

If P(M+1)P(M+1) is true, and ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}) and ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) terminate at the same iteration, implying that ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) also terminates successfully, then the lemma holds immediately. In the following, we prove these two statements by induction.

For i=0i=0, it is clear that S0=S0=R0=R0b=S_{0}=S_{0}^{\prime}=R_{0}=R_{0}^{b}=\emptyset, and V0=V0B=V0b{b}=𝒩BV_{0}=V_{0}^{\prime}\setminus B=V_{0}^{b}\setminus\{b\}=\mathcal{N}\setminus B for all bBb\in B. Therefore, P(0)P(0) is true. Next, suppose that P(i1)P(i-1) is true. We show that P(i)P(i) is also true, and if ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}) terminates at iteration ii, ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) also terminates at iteration ii.

Firstly, we show that (iii) and (iv) of P(i)P(i) hold. Since P(i1)P(i-1) holds, it indicates that Si1=Si1=Si1bS_{i-1}=S^{\prime}_{i-1}=S_{i-1}^{b} for any bBb\in B by (i) and (ii) of P(i1)P(i-1). So, (iii) and (iv) of P(i)P(i) clearly hold since the sets Si1,Si1,Si1bS_{i-1},S_{i-1}^{\prime},S_{i-1}^{b} involved in updating ViV_{i}, Vi,VibV_{i}^{\prime},V_{i}^{b} on Line 5 are equal (after BB is removed).

Secondly, we prove that (i) and (ii) of P(i)P(i) hold and that if ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}) terminates at iteration ii, ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) also terminates at iteration ii.

If ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}) terminates at iteration ii because |Si1|=k|S_{i-1}|=k, then the other two runs of ThreshSeqMod also terminate at iteration ii, since |Si1|=|Si1b|=|Si1|=k|S^{\prime}_{i-1}|=|S_{i-1}^{b}|=|S_{i-1}|=k by (i) and (ii) of P(i1)P(i-1). Moreover, (i) and (ii) of P(i)P(i) are true since SiS^{\prime}_{i}, SibS_{i}^{b}, SiS_{i}, RibR_{i}^{b}, and RiR_{i} do not update.

If ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}) terminates at iteration ii because Vi=V_{i}=\emptyset, then by (iii) and (iv) of P(i)P(i), it holds that Vib{b}=ViB=V_{i}^{b}\setminus\{b\}=V_{i}^{\prime}\setminus B=\emptyset for all bBb\in B. Thus, VibV_{i}^{b} must be empty; otherwise, Vib={b}V_{i}^{b}=\{b\} and bb will be added to Ri1bR_{i-1}^{b} at iteration ii which contradicts the fact that bRM+1=RM+1bb\not\in R_{M+1}=R_{M+1}^{b}. Therefore, bb has been filtered out before iteration ii or will be filtered out at iteration ii in both ThreshSeqMod(A{b},𝐪)\textsc{ThreshSeqMod}(A\cup\{b\},\mathbf{q}) and ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) since the sets Si1bS_{i-1}^{b} and Si1BS_{i-1}^{B} involved in updating VibV_{i}^{b} and ViBV_{i}^{B} are the same. Consequently, ViV_{i}^{\prime} is also an empty set and ThreshSeqMod(AB,𝐪)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}) terminates at iteration ii. Futhermore, (i) and (ii) of P(i)P(i) are true since SiS^{\prime}_{i}, SibS_{i}^{b}, SiS_{i}, RibR_{i}^{b}, and RiR_{i} do not update.

Refer to caption
Figure 1: This figure depicts the relationship between ViV_{i}, Vib1V_{i}^{b_{1}} and ViV_{i}^{\prime} in the circumstance that |Si1|<k|S_{i-1}|<k, ViV_{i}\neq\emptyset.

Finally, consider the case that ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}) does not terminate at iteration ii, where ViV_{i}\neq\emptyset and |Si1|<k|S_{i-1}|<k. By (iii) and (iv), it holds that Vi=ViB=Vib{b}V_{i}=V_{i}^{\prime}\setminus B=V_{i}^{b}\setminus\{b\}, for any bBb\in B. Let b1Bb_{1}\in B be the first element appeared in the sequence of ViV_{i}^{\prime}, λ#\lambda^{\#} be the index of b1b_{1} in the order of the permutation of ViV_{i}^{\prime}. If such b1b_{1} does not exist (i.e. Vi=Vi=VibV_{i}=V_{i}^{\prime}=V_{i}^{b}, for any bBb\in B), let λ#=|Vi|+1\lambda^{\#}=|V_{i}|+1. Fig. 1 depicts the relationship between three sequences ViV_{i}, Vib1V_{i}^{b_{1}} and ViV_{i}^{\prime}. Let TλiT_{\lambda^{\prime}_{i}}, Tλib1T_{\lambda^{\prime}_{i}}^{b_{1}}, TλiT_{\lambda^{\prime}_{i}}^{\prime} be sets added to Ri1R_{i-1}, Ri1b1R_{i-1}^{b_{1}}, Ri1R_{i-1}^{\prime} at iteration ii respectively. Since Vib1V^{b_{1}}_{i} and ViV_{i}^{\prime} are subsets of the same random permutation of 𝒩\mathcal{N}, and ViB=Vib1{b1}V_{i}^{\prime}\setminus B=V_{i}^{b_{1}}\setminus\{b_{1}\}, λ#\lambda^{\#} is also the index of b1b_{1} in Vib1V_{i}^{b_{1}}. Also, it holds that

Tx=Txb1=Tx,x<λ#.T_{x}=T_{x}^{b_{1}}=T_{x}^{\prime},\forall x<\lambda^{\#}. (2)

Since b1RM+1b1b_{1}\not\in R_{M+1}^{b_{1}}, it holds that b1Tλib1b_{1}\not\in T_{\lambda^{\prime}_{i}}^{b_{1}}, which indicates that λi<λ#\lambda^{\prime}_{i}<\lambda^{\#}. By the definition of λi\lambda^{\prime}_{i} on Line 16,

{Bb1[x]=True,x<λi,Bb1[λi]=False.\left\{\begin{aligned} &B^{b_{1}}[x]=\textbf{True},\forall x<\lambda^{\prime}_{i},\\ &B^{b_{1}}[\lambda^{\prime}_{i}]=\textbf{False}.\end{aligned}\right.

Follows from Equality 2, and λi<λ#\lambda^{\prime}_{i}<\lambda^{\#}, it holds that

{B[x]=B[x]=True,x<λi,B[λi]=B[λi]=False,\left\{\begin{aligned} &B[x]=B^{\prime}[x]=\textbf{True},\forall x<\lambda^{\prime}_{i},\\ &B[\lambda^{*}_{i}]=B^{\prime}[\lambda^{\prime}_{i}]=\textbf{False},\end{aligned}\right.

Therefore, Tλi=Tλib=TλiT_{\lambda^{*}_{i}}=T_{\lambda^{*}_{i}}^{b}=T_{\lambda^{*}_{i}}^{\prime}, and even further Tλi=Tλib=TλiT_{\lambda^{\prime}_{i}}=T_{\lambda^{\prime}_{i}}^{b}=T_{\lambda^{\prime}_{i}}^{\prime}. So, Property (i) and (ii) hold.

2.1.2 Analysis of Guarantees

There are three adjustments in ThreshSeqMod compared with ThresholdSeq in Chen et al. (2021) altogether. First, we made the randomness 𝐪\mathbf{q} explicit allowing us to still consider each permutation on Line 8 as a random step. Therefore, the analysis of the theoretical guarantees is not influenced by adopting the randomness 𝐪\mathbf{q}. Second, the prefix selection step is changed from identifying the failure point after the final success point to identifying the initial failure point. Note that, this change is necessary for the algorithm to satisfy RCP (Property1). However, by making this change, fewer elements could be filtered out at the next iteration. Fortunately, we are still able to filter out a constant fraction of the candidate set. Third, the algorithm returns one more set that provides extra information. Since the additional set does not affect the functioning of the algorithm, it does not influence the theoretical guarantees. In the following, we will provide the detailed analysis for the theoretical guarantees.

As demonstrated by Lemma 2 below, an increase in the number of selected elements results in a corresponding increase in the number of filtered elements. Consequently, there exists a point, say tt, such that a given constant fraction of VjV_{j} can be filtered out if we are adding more than tt elements. The proof of this lemma is quite similar to the proof of Lemma 12 in Chen et al. (2021) which can be found in Appendix C.

Lemma 2.

At an iteration jj, let Ai={xVj:Δ(x|TiSj1)<τ}A_{i}=\{x\in V_{j}:\Delta\left(x\,|\,T_{i}\cup S_{j-1}\right)<\tau\} after Line 5. It holds that |A0|=0|A_{0}|=0, |A|Vj||=|Vj||A_{|V_{j}|}|=|V_{j}| and |Ai||Ai+1||A_{i}|\leq|A_{i+1}|.

Although we use a smaller prefix based on Line 16 and 18 compared to ThresholdSeq, it is still possible that with probability at least 1/21/2, a constant fraction of VjV_{j} will be filtered out at the beginning of iteration j+1j+1, or the algorithm terminates with |Sj|=k|S_{j}|=k; given as Lemma 3 in the following. Intuitively, if there are enough such iterations, ThreshSeqMod will succeed. Also, the calculation of query complexity comes from Lemma 3 directly. Since the analyses of success probability, adaptivity and query complexity simply follows the analysis in Chen et al. (2021), we provide these analyses in Appendix C. Next, we prove that Lemma 3 always holds.

Lemma 3.

At any iteration j+1j+1, with probability at least 1/21/2, either |Vj+1|(1βε)|Vj||V_{j+1}|\leq(1-\beta\varepsilon)|V_{j}| after Line 5, or the algorithm terminates with |Sj|=k|S_{j}|=k on Line 7.

Proof.

By the definition of AiA_{i} in Lemma 2, AλjA_{\lambda_{j}^{*}} will be filtered out from VjV_{j} at the next iteration. Equivalently, Aλj=VjVj+1A_{\lambda_{j}^{*}}=V_{j}\setminus V_{j+1}. After Line 8 at iteration j+1j+1, by Lemma 2, there exists a tt such that t=min{i:|Ai|βε|Vj|}t=\min\{i\in\mathbb{N}:|A_{i}|\geq\beta\varepsilon|V_{j}|\}. We say an iteration jj succeeds, if λjmin{s,t}\lambda_{j}^{*}\geq\min\{s,t\}. In this case, it holds that either |VjVj+1|βε|Vj||V_{j}\setminus V_{j+1}|\geq\beta\varepsilon|V_{j}|, or |Sj|=k|S_{j}|=k. In other words, iteration jj succeeds if there does not exist λmin{s,t}\lambda\leq\min\{s,t\} such that B[λ]=FalseB[\lambda]=\textbf{False} by the selection of λj\lambda_{j}^{*} on Line 18. Instead of directly calculating the success probability, we consider the failure probability. Define an element viv_{i} bad if Δ(vi|Sj1Ti1)<τ\Delta\left(v_{i}\,|\,S_{j-1}\cup T_{i-1}\right)<\tau, and good otherwise. Consider the random permutation of VjV_{j} as a sequence of dependent Bernoulli trials, with success if the element is bad and failure otherwise. When imin{s,t}i\leq\min\{s,t\}, the probability of viv_{i} is bad is less than βε\beta\varepsilon. If B[λ]=FalseB[\lambda]=\textbf{False}, there are at least ελ\varepsilon\lambda bad elements in TλT_{\lambda}. Let {Yi}i=1\{Y_{i}\}_{i=1}^{\infty} be a sequence of independent and identically distributed Bernoulli trials, each with success probability βε\beta\varepsilon. Next, we consider the following sequence of dependent Bernoulli trials {Xi}i=1\{X_{i}\}_{i=1}^{\infty}, where Xi=1vi is badX_{i}=1_{v_{i}\text{ is bad}} when iti\leq t, and Xi=YiX_{i}=Y_{i} otherwise. Define B[λ]=FalseB^{\prime}[\lambda]=\textbf{False} if there are at least ελ\varepsilon\lambda bad elements in {Xi}i=1λ\{X_{i}\}_{i=1}^{\lambda}. Then, for any λmin{s,t}\lambda\leq\min\{s,t\}, it holds that Pr(B[λ]=False)=Pr(B[λ]=False)Pr\left(B[\lambda]=\textbf{False}\right)=Pr\left(B^{\prime}[\lambda]=\textbf{False}\right). In the following, we bound the probability that there are more than ε\varepsilon-fraction 11s in {Xi}i=1λ\{X_{i}\}_{i=1}^{\lambda} for any λs\lambda\leq s,

Pr(B[λ]=False)\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}Pr\left(B^{\prime}[\lambda]=\textbf{False}\right) Pr(i=1λXiελ)\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\leq Pr\left(\sum_{i=1}^{\lambda}X_{i}\geq\varepsilon\lambda\right)
Pr(i=1λYiελ)\displaystyle\leq Pr\left(\sum_{i=1}^{\lambda}Y_{i}\geq\varepsilon\lambda\right) (Lemma 9)
min{β,e(1β)21+βελ}\displaystyle\leq\min\left\{\beta,e^{-\frac{(1-\beta)^{2}}{1+\beta}\varepsilon\lambda}\right\} (Markov’s Inequality, Lemma 10)
min{β,eελ/2}\displaystyle\leq\min\left\{\beta,e^{-\varepsilon\lambda/2}\right\} (β<1/16\beta<1/16)

Subsequently, the probability of failure at iteration jj is calculated below,

Pr(iteration j fails)\displaystyle Pr\left(\text{iteration }j\text{ fails}\right) =Pr(λmin{s,t},s.t.,B[λ]=False)\displaystyle=Pr\left(\exists\lambda\leq\min\{s,t\},\textit{s.t.},B[\lambda]=\textbf{False}\right)
Pr(λs,s.t.,B[λ]=False)\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\leq Pr\left(\exists\lambda\leq s,\textit{s.t.},B^{\prime}[\lambda]=\textbf{False}\right)
λsPr(B[λ]=False)\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\leq\sum_{\lambda\leq s}Pr\left(B^{\prime}[\lambda]=\textbf{False}\right)
λ=1min{β,eελ/2}\displaystyle\leq\sum_{\lambda=1}^{\infty}\min\left\{\beta,e^{-\varepsilon\lambda/2}\right\}
aβ+λ=a+1eελ/2, where a=14β=2εlog(41eε/2)\displaystyle\leq a\beta+\sum_{\lambda=a+1}^{\infty}e^{-\varepsilon\lambda/2}\text{, where }a=\left\lfloor\frac{1}{4\beta}\right\rfloor=\left\lfloor\frac{2}{\varepsilon}\log\left(\frac{4}{1-e^{-\varepsilon/2}}\right)\right\rfloor
aβ+eε(a+1)/21eε/214+14=12\displaystyle\leq a\beta+\frac{e^{-\varepsilon(a+1)/2}}{1-e^{-\varepsilon/2}}\leq\frac{1}{4}+\frac{1}{4}=\frac{1}{2}

As for the Properties (3) and (4) in Theorem 1, since the structure of the solution set is the same as the solution of ThresholdSeq. So, the analysis is similar to what in Chen et al. (2021). We provide these proofs in Appendix C.

2.2 Low-Adaptive Greedy (LAG) Algorithm

Algorithm 2 Low-Adaptive Greedy (LAG)
1:Input: Evaluation oracle f:2𝒩+f:2^{\mathcal{N}}\to\mathbb{R}^{+}, subset CC, constraint kk, fixed sequence of random bits (𝐪1,𝐪2,)(\mathbf{q}_{1},\mathbf{q}_{2},\ldots), accuracy parameter ε\varepsilon, constant α\alpha, value Γ\Gamma such that Γf(O)Γ/α\Gamma\leq f(O)\leq\Gamma/\alpha
2:if Γ=\Gamma=\varnothing or α=\alpha=\varnothing then
3:     Γmaxx𝒩f(x)\Gamma\leftarrow\max_{x\in\mathcal{N}}f(x)
4:     α1/k\alpha\leftarrow 1/k
5:Initialize τΓ/αk\tau\leftarrow\Gamma/\alpha k, δ\delta \leftarrow 1/(log1ε(α/3))+1)1/(\log_{1-\varepsilon}(\alpha/3))+1), SS\leftarrow\emptyset,RR\leftarrow\emptyset, i0i\leftarrow 0
6:for i0i\leftarrow 0 to log1ε(α/3)\log_{1-\varepsilon}(\alpha/3) do
7:     ττ(1ε)i\tau\leftarrow\tau(1-\varepsilon)^{i}, g()f(S)g(\cdot)\leftarrow f(S\cup\cdot)
8:     T1,T2ThreshSeqMod(g,C,k|S|,δ,ε/3,τ,𝐪i)T_{1},T_{2}\leftarrow\textsc{ThreshSeqMod}(g,C,k-|S|,\delta,\varepsilon/3,\tau,\mathbf{q}_{i})
9:     SST1S\leftarrow S\cup T_{1}, RRT2R\leftarrow R\cup T_{2}
10:     if |S|=k|S|=k then
11:         return S,RS,R      
12:return S,RS,R

Another building block for our distributed algorithms is a simple, low-adaptive greedy algorithm LAG (Alg. 2). This algorithm is an instantiation of the ParallelGreedyBoost framework of Chen et al. (2021), and it relies heavily on the low-adaptive procedure ThreshSeqMod (Alg. 1).

Guarantees.  Following the analysis of Theorem 3 in Chen et al. (2021), LAG achieves ratio 11/eε1-1/e-\varepsilon in O(log(α1)log(nlog(α1)/ε)ε4)O\left(\frac{\log(\alpha^{-1})\log(n\log(\alpha^{-1})/\varepsilon)}{\varepsilon^{4}}\right) adaptive rounds and O(nlog(α1)ε4)O(\frac{n\log(\alpha^{-1})}{\varepsilon^{4}}) total queries with probability at least 11/n1-1/n. Setting Γ=maxx𝒩f({x})\Gamma=\max_{x\in\mathcal{N}}f(\{x\}) and α=1/k\alpha=1/k, LAG has adaptive complexity of O(lognlogkε4)O\left(\frac{\log n\log k}{\varepsilon^{4}}\right) and a query complexity of O(nlogkε4)O(\frac{n\log k}{\varepsilon^{4}})

Lemma 4.

LAG satisfies the randomized consistency property (Property 1).

Proof of Lemma 4.

Observe that the only randomness in LAG is from the calls to ThreshSeqMod. Since LAG(A,𝐪)\textsc{LAG}(A,\mathbf{q}) succeeds, every call to ThreshSeqMod must succeed as well. Moreover, considering that 𝐪\mathbf{q} is used to permute the underlying ground set 𝒩\mathcal{N}, changing the set argument AA of LAG does not change the sequence received by each call to ThreshSeqMod. Order the calls to ThreshSeqMod: ThreshSeqMod(,𝐪1)\textsc{ThreshSeqMod}(\cdot,\mathbf{q}_{1}), ThreshSeqMod(,𝐪2)\textsc{ThreshSeqMod}(\cdot,\mathbf{q}_{2}), …, ThreshSeqMod(,𝐪m)\textsc{ThreshSeqMod}(\cdot,\mathbf{q}_{m}). Since LAGRel(A{b},𝐪)=LAGRel(A,𝐪)\textsc{LAG}\textsc{Rel}(A\cup\{b\},\mathbf{q})=\textsc{LAG}\textsc{Rel}(A,\mathbf{q}), it holds that bThreshSeqModRel(A{b},𝐪i)b\not\in\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q}_{i}), for each bBb\in B and each ii. If bThreshSeqModRel(A{b},𝐪i)b\not\in\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q}_{i}) is equivalent to ThreshSeqModRel(A{b},𝐪i)=ThreshSeqModRel(A,𝐪i)\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q}_{i})=\textsc{ThreshSeqMod}\textsc{Rel}(A,\mathbf{q}_{i}) (as shown in Claim 1 in the following), then by application of Lemma 1, ThreshSeqMod(AB,𝐪i)\textsc{ThreshSeqMod}(A\cup B,\mathbf{q}_{i}) terminates successfully and ThreshSeqModSol(AB,𝐪i)=ThreshSeqModSol(A,𝐪i)\textsc{ThreshSeqMod}\textsc{Sol}(A\cup B,\mathbf{q}_{i})=\textsc{ThreshSeqMod}\textsc{Sol}(A,\mathbf{q}_{i}) for each ii. Therefore, LAGSol(AB,𝐪)=LAGSol(A,𝐪)\textsc{LAG}\textsc{Sol}(A\cup B,\mathbf{q})=\textsc{LAG}\textsc{Sol}(A,\mathbf{q}) and the former call terminates successfully. ∎

Next, we provide Claim 1 and its analysis below.

Claim 1.

Let 𝐪\mathbf{q} be a fixed sequence of random bits, A𝒩A\subseteq\mathcal{N}, and b𝒩Ab\in\mathcal{N}\setminus A. Then, bThreshSeqModRel(A{b},𝐪)b\not\in\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q}) if and only if ThreshSeqModRel(A{b},𝐪)=ThreshSeqModRel(A,𝐪)\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q})=\textsc{ThreshSeqMod}\textsc{Rel}(A,\mathbf{q}).

Proof.

It is obvious that if ThreshSeqModRel(A{b},𝐪i)=ThreshSeqModRel(A,𝐪)\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q}_{i})=\textsc{ThreshSeqMod}\textsc{Rel}(A,\mathbf{q}), then bThreshSeqModRel(A{b},𝐪)b\not\in\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q}). In the following, we prove the reverse statement.

Following the analysis of RCP for ThreshSeqMod in Section 2.1.1, we consider the runs of (1) ThreshSeqMod(A,𝐪)\textsc{ThreshSeqMod}(A,\mathbf{q}), and (2) ThreshSeqMod(A{b},𝐪)\textsc{ThreshSeqMod}(A\cup\{b\},\mathbf{q}). Variables of (1) are given the notation defined in the pseudocode; variables of (2) are given the superscript bb. We analyze the following statement P(i)P(i) for each iteration ii of the outer for loop on Line 4 in Alg. 1.

  • (i)

    for all bBb\in B, Si=SibS_{i}=S^{b}_{i}, Ri=RibR_{i}=R^{b}_{i}, and

  • (ii)

    for all bBb\in B, Vi=Vib{b}V_{i}=V^{b}_{i}\setminus\{b\}.

Since bThreshSeqModRel(A{b},𝐪)=RM+1bb\not\in\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q})=R^{b}_{M+1}, then bRibb\not\in R^{b}_{i} for each ii. Thus, the analysis in Section 2.1.1 also holds in this case. According to the inductive method, we are able to prove that P(M+1)P(M+1) is true for each ii which indicates ThreshSeqModRel(A{b},𝐪)=ThreshSeqModRel(A,𝐪)\textsc{ThreshSeqMod}\textsc{Rel}(A\cup\{b\},\mathbf{q})=\textsc{ThreshSeqMod}\textsc{Rel}(A,\mathbf{q}). ∎

3 Consistent Linear Time (Linear-TimeCardinality) Algorithm

In this section, we present the analysis of Linear-TimeCardinality (Alg. 3, LTC), a consistent linear time algorithm which is an extension of the highly adaptive linear-time algorithm (Alg. 3) in Chen et al. (2021). Notably, to ensure consistency within a distributed setting and bound the solution size, LTC incorporates the randomness q, and initializes the solution set with the maximum singleton. The algorithm facilitates the creation of linear-time MapReduce algorithms, enhancing overall computational efficiency beyond the capabilities of current state-of-the-art superlinear algorithms.

Algorithm 3 Linear-TimeCardinality (LTC)
1:procedure LTC(f,𝒩,k,𝐪f,\mathcal{N},k,\mathbf{q})
2:     Input: evaluation oracle ff, constraint kk, fixed sequence of random bits 𝐪(σ)\mathbf{q}\leftarrow(\sigma).
3:     𝒩random-permutation(𝒩,σ)\mathcal{N}\leftarrow\textbf{random-permutation}(\mathcal{N},\sigma)
4:     Initialize afirst element in argmaxx𝒩f(x)a\leftarrow\text{first element in }\operatorname*{arg\,max}_{x\in\mathcal{N}}{f(x)} , S{a}S\leftarrow\{a\}
5:     for  x𝒩x\in\mathcal{N}  do
6:         if Δ(x|S)f(S)k\Delta\left(x\,|\,S\right)\geq\frac{f(S)}{k} then SS{x}S\leftarrow S\cup\{x\}      
7:     return SS
Theorem 2.

Let (f,k)(f,k) be an instance of SM. The algorithm Linear-TimeCardinality outputs S𝒩S\subseteq\mathcal{N} such that the following properties hold: 1) There are O(n)O(n) oracle queries and O(n)O(n) adaptive rounds. 2) Let SS^{\prime} be the last kk elements in SS. It holds that f(S)12f(S)14f(O)f(S^{\prime})\geq\frac{1}{2}f(S)\geq\frac{1}{4}f(O), where OO is an optimal solution to the instance (f,k)(f,k). 3) The size of SS is limited to O(klog(n))O(k\log(n)).

3.1 Analysis of Consistency

The highly adaptive linear-time algorithm (Alg. 3) outlined in Chen et al. (2021) commences with an empty set and adds elements to it iff. Δ(x|S)f(S)/k\Delta\left(x\,|\,S\right)\geq f(S)/k. Without introducing any randomness, this algorithm can be deterministic only if the order of the ground set is fixed. Additionally, to limit the solution size, we initialize the solution set with the maximum singleton, a choice that also impacts the algorithm’s consistency. However, by selecting the first element that maximizes the objective value, the algorithm can maintain its consistency. In the following, we provide the analysis of randomized consistency property of LTC.

Lemma 5.

LTC satisfies the randomized consistency property (Property 1).

Proof.

Consider the runs of (1) LTC(A,q)\textsc{LTC}(A,\textbf{q}), (2) LTC(A{b},q)\textsc{LTC}(A\cup\{b\},\textbf{q}), and (3) LTC(AB,q)\textsc{LTC}(A\cup B,\textbf{q}). With the same sequence of random bits q, after the random permutation, AA, A{b}A\cup\{b\} and ABA\cup B are the subsets of the same sequence. For any x𝒩x\in\mathcal{N}, let ixi_{x} be the index of xx. Then, 𝒩ix\mathcal{N}_{i_{x}} are the elements before xx (including xx). Define SixS_{i_{x}} be the intermediate solution of (1) after we consider element xx. If xAx\not\in A, define Six=Six1S_{i_{x}}=S_{i_{x}-1}. Similarly, define SixS_{i_{x}}^{\prime} and SixbS_{i_{x}}^{b} be the intermediate solution of (2) and (3). If Six=Sixb=SixS_{i_{x}}=S_{i_{x}}^{b}=S_{i_{x}}^{\prime}, for any x𝒩x\in\mathcal{N} and bBb\in B, LTC(AB,q)=LTC(A,q)\textsc{LTC}(A\cup B,\textbf{q})=\textsc{LTC}(A,\textbf{q}). Next, we prove the above statement holds.

For any bBb\in B, since LTC(A,q)=LTC(A{b},q)\textsc{LTC}(A,\textbf{q})=\textsc{LTC}(A\cup\{b\},\textbf{q}), it holds that either f(b)<maxxAf(x)f(b)<\max_{x\in A}f(x), or f(b)=maxxAf(x)f(b)=\max_{x\in A}f(x), and ia<ibi_{a}<i_{b}, where aa is the first element in argmaxxAf(x)\arg\max_{x\in A}f(x). Therefore, aa is also the first element in argmaxxABf(x)\arg\max_{x\in A\cup B}f(x). Furthermore, it holds that S0=S0b=S0={a}S_{0}=S_{0}^{b}=S_{0}^{\prime}=\{a\}.

Suppose that Six1=Six1b=Six1S_{i_{x}-1}=S_{i_{x}-1}^{b}=S_{i_{x}-1}^{\prime}. If xAx\in A or xABx\not\in A\cup B, naturally, Six=Sixb=SixS_{i_{x}}=S_{i_{x}}^{b}=S_{i_{x}}^{\prime}. If xBx\in B, Six=Six1S_{i_{x}}=S_{i_{x}-1} immediately. Since LTC(A,q)=LTC(A{b},q)\textsc{LTC}(A,\textbf{q})=\textsc{LTC}(A\cup\{b\},\textbf{q}), it holds that Δ(x|Six1b)<f(Six1b)/k\Delta\left(x\,|\,S_{i_{x}-1}^{b}\right)<f(S_{i_{x}-1}^{b})/k and Sixb=Six1bS_{i_{x}}^{b}=S_{i_{x}-1}^{b}. So, Δ(x|Six1)<f(Six1)/k\Delta\left(x\,|\,S_{i_{x}-1}^{\prime}\right)<f(S_{i_{x}-1}^{\prime})/k. Then, Six=Six1S_{i_{x}}^{\prime}=S_{i_{x}-1}^{\prime}. Thus, Six=Sixb=SixS_{i_{x}}=S_{i_{x}}^{b}=S_{i_{x}}^{\prime}. ∎

3.2 Analysis of Guarantees

Query Complexity and Adaptivity. Alg. 3 queries on Line 4 and 6, where there are nn oracle queries on Line 4 and 1 oracle query for each element received on Line 6. Therefore, the query complexity and the adaptivity are both O(n)O(n).

Solution Size. Given that ff is a monotone function, it holds that argmaxS𝒩f(S)=𝒩\arg\max_{S\in\mathcal{N}}f(S)=\mathcal{N}. Furthermore, since f({a})=maxx𝒩f({x})f(\{a\})=\max_{x\in\mathcal{N}}f(\{x\}), the submodularity property implies that f({a})1nx𝒩f({x})1nf(𝒩)f(\{a\})\geq\frac{1}{n}\sum_{x\in\mathcal{N}}f(\{x\})\geq\frac{1}{n}f(\mathcal{N}). Each element added to SS contributes to an increase in the solution value by a factor of (1+1k)(1+\frac{1}{k}). Therefore,

f(𝒩)f(S)(1+1k)|S|f(a)(1+1k)|S|f(𝒩)n\displaystyle f(\mathcal{N})\geq f(S)\geq\left(1+\frac{1}{k}\right)^{|S|}f(a)\geq\left(1+\frac{1}{k}\right)^{|S|}\frac{f(\mathcal{N})}{n}
\displaystyle\Rightarrow |S|log1+1k(n)(k+1)log(n)\displaystyle|S|\leq\log_{1+\frac{1}{k}}(n)\leq(k+1)\log(n) (log(x)11x\log(x)\geq 1-\frac{1}{x})

Objective Value. For any x𝒩x\in\mathcal{N}, let SxS_{x} be the intermediate solution before we process xx. It holds that Δ(o|So)<f(So)/k\Delta\left(o\,|\,S_{o}\right)<f(S_{o})/k for any oOSo\in O\setminus S. Then,

f(O)f(S)(1)oOSΔ(o|S)(2)oOSΔ(o|So)<oOSf(So)/kf(S),f(O)-f(S)\overset{(1)}{\leq}\sum_{o\in O\setminus S}\Delta\left(o\,|\,S\right)\overset{(2)}{\leq}\sum_{o\in O\setminus S}\Delta\left(o\,|\,S_{o}\right)<\sum_{o\in O\setminus S}f(S_{o})/k\leq f(S),

where Inequality (1) follows from monotonicity and submodularity, Inequality (2) follows from submodularity.

For any xSx\in S^{\prime}, it holds that Δ(x|Sx)f(Sx)/k\Delta\left(x\,|\,S_{x}\right)\geq f(S_{x})/k. Then,

f(S)f(SS)=xSΔ(x|Sx)xSf(Sx)/k(1)xSf(SS)/k=f(SS)\displaystyle f(S)-f(S\setminus S^{\prime})=\sum_{x\in S^{\prime}}\Delta\left(x\,|\,S_{x}\right)\geq\sum_{x\in S^{\prime}}f(S_{x})/k\overset{(1)}{\geq}\sum_{x\in S^{\prime}}f(S\setminus S^{\prime})/k=f(S\setminus S^{\prime})
\displaystyle\Rightarrow f(S)(2)f(S)f(SS)12f(S)\displaystyle\hskip 10.00002ptf(S^{\prime})\overset{(2)}{\geq}f(S)-f(S\setminus S^{\prime})\geq\frac{1}{2}f(S)

where Inequalities (1) and (2) follow from submodularity.

Algorithm 4 Randomized-DASH (R-DASH)
1:Input: Evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, error ε\varepsilon, available machines M{1,2,,}M\leftarrow\{1,2,...,\ell\}
2:for e𝒩e\in\mathcal{N}  do
3:     Assign ee to machine chosen uniformly at random
4:for iMi\in M do
5:     \triangleright On machine ii
6:     Let 𝒩i\mathcal{N}_{i} be the elements assigned to machine ii
7:     Si,RiLAG(f,𝒩i,k,ε)S_{i},R_{i}\leftarrow\textsc{LAG}(f,\mathcal{N}_{i},k,\varepsilon)
8:     Send Si,RiS_{i},R_{i} to primary machine
9:\triangleright On primary machine
10:Gather Ri=1RiR\leftarrow\bigcup_{i=1}^{\ell}R_{i}
11:TLAG(f,R,k,ε)T\leftarrow\textsc{LAG}(f,R,k,\varepsilon) on machine m1m_{1}
12:return Vargmax{f(T),f(S1)}V\leftarrow\operatorname*{arg\,max}{\{f(T),f(S_{1})}\}

4 Low-Adaptive Algorithms with Constant MR Rounds (Greedy-DASH and Randomized-DASH)

Once we have the randomized consistency property of LAG, we can almost immediately use it to obtain parallelizable MapReduce algorithms.

4.1 Randomized-DASH

R-DASH (Alg. 4) is a two MR-rounds algorithm obtained by plugging LAG into the RandGreeDI algorithm of Mirzasoleiman et al. (2013); Barbosa et al. (2015). R-DASH runs in two MapReduce rounds, O(log(k)log(n))O(\log{}(k)\log{}(n)) adaptive rounds, and guarantees the ratio 12(11/eε)\frac{1}{2}(1-1/e-\varepsilon) (\simeq 0.316).

Description.  The ground set is initially distributed at random by R-DASH across all machines MM. In its first MR round, R-DASH runs LAG on every machine to obtain Si,RiS_{i},R_{i} in O(log(k)log(|𝒩i|))O\left(\log(k)\log(|\mathcal{N}_{i}|)\right) adaptive rounds. The solutions from every machine are then returned to the primary machine, where LAG selects the output solution that guarantees 12(11/eε)\frac{1}{2}(1-1/e-\varepsilon) approximation in O(log(k)log(|R|))O\left(\log(k)\log(|R|)\right) adaptive rounds as stated in Corollary 1. First, we provide Theorem 3 and its analysis by plugging any randomized algorithm Alg which follows RCP (Property 1) into the framework of RandGreeDI. The proof is a minor modification of the proof of Barbosa et al. (2015) to incorporate randomized consistency. Then, we can get Corollary 1 for R-DASH immediately.

Theorem 3.

Let (f,k)(f,k) be an instance of SM where k=O(ψ/)k=O\left(\psi/\ell\right). Alg is a randomized algorithm which satisfies the Randomized Consistency Property with α\alpha approximation ratio, Φ(n)\Phi(n) query complexity, and Ψ(n)\Psi(n) adaptivity. By replacing LAG with Alg, R-DASH returns set VV with two MR rounds, 2Ψ(n/)2\Psi(n/\ell) adaptive rounds, (+1)Φ(n/)(\ell+1)\Phi(n/\ell) total queries, O(n)O(n) communication complexity such that

𝔼[f(V)]α2OPT.\mathbb{E}[f(V)]\geq\frac{\alpha}{2}\textsc{OPT}.
Proof.

Query Complexity and Adaptivity. R-DASH runs with two MR rounds. In the first MR round, Alg is invoked \ell times in parallel, each with a ground set size of n/n/\ell. So, during the first MR round, the number of queries is Φ(n/)\ell\Phi(n/\ell) and the number of adaptive rounds is Ψ(n/)\Psi(n/\ell). Then, the second MR round calls Alg once, handling at most n/n/\ell elements. Consequently, the total number of queries of R-DASH is (+1)Φ(n/)(\ell+1)\Phi(n/\ell), and the total number of adaptive rounds is 2Ψ(n/)2\Psi(n/\ell).

Approximation Ratio. Let R-DASH be run with input (f,k,ε,M)(f,k,\varepsilon,M). Since n1c\ell\leq n^{1-c}, 𝔼[|𝒩i|]=n/nc\mathbb{E}\left[|\mathcal{N}_{i}|\right]=n/\ell\geq n^{c}. By an application of Chernoff’s bound to show that the size |𝒩i||\mathcal{N}_{i}| is concentrated.

Let 𝒩(1/)\mathcal{N}(1/\ell) denote the random distribution over subsets of 𝒩\mathcal{N} where each element is included independently with probability 1/1/\ell. For x𝒩x\in\mathcal{N}, let

px={PrX𝒩(1/),𝐪[AlgRel(X{x},𝐪)=AlgRel(X,𝐪)], if xO0, otherwise .p_{x}=\begin{cases}Pr_{X\sim\mathcal{N}(1/\ell),\mathbf{q}}\left[\textsc{AlgRel}(X\cup\{x\},\mathbf{q})=\textsc{AlgRel}(X,\mathbf{q})\right]\text{, if }x\in O\\ 0\hskip 130.0002pt\text{, otherwise }\end{cases}.

Consider S1=AlgSol(𝒩1,𝐪)S_{1}=\textsc{AlgSol}(\mathcal{N}_{1},\mathbf{q}) and R1=AlgRel(𝒩1,𝐪)R_{1}=\textsc{AlgRel}(\mathcal{N}_{1},\mathbf{q}) on machine m1m_{1}. Let O1={oO:AlgRel(𝒩1{o},𝐪)=AlgRel(𝒩1,𝐪)}O_{1}=\{o\in O:\textsc{AlgRel}(\mathcal{N}_{1}\cup\{o\},\mathbf{q})=\textsc{AlgRel}(\mathcal{N}_{1},\mathbf{q})\}. Since Alg follows the Randomized Consistency Property, AlgSol(𝒩1O1,𝐪)=AlgSol(𝒩1,𝐪)=S1\textsc{AlgSol}(\mathcal{N}_{1}\cup O_{1},\mathbf{q})=\textsc{AlgSol}(\mathcal{N}_{1},\mathbf{q})=S_{1}. Therefore, f(S1)αf(O1)f(S_{1})\geq\alpha f(O_{1}). Next, let O2=ORO_{2}=O\cap R, where R=i=1RiR=\bigcup_{i=1}^{\ell}R_{i}. It holds that f(T)αf(O2)f(T)\geq\alpha f(O_{2}). Let oOo\in O; oo is assigned to 𝒩c\mathcal{N}_{c} on some machine mcm_{c}. It holds that,

Pr[oO2]\displaystyle\Pr[o\in O_{2}] =Pr[oAlgRel(𝒩c)|o𝒩c]\displaystyle=\Pr[o\in\textsc{AlgRel}(\mathcal{N}_{c})|o\in\mathcal{N}_{c}]
=Pr[oAlgRel(𝒩c{o})]\displaystyle=\Pr[o\in\textsc{AlgRel}(\mathcal{N}_{c}\cup\{o\})]
=1po.\displaystyle=1-p_{o}.

Therefore,

𝔼[f(V)]\displaystyle\mathbb{E}\left[f(V)\right] 12(𝔼[f(S1)]+𝔼[f(T)])\displaystyle\geq\frac{1}{2}\left(\mathbb{E}[f(S_{1})]+\mathbb{E}[f(T)]\right)
α2(𝔼[f(O1)]+𝔼[f(O2)])\displaystyle\geq\frac{\alpha}{2}\left(\mathbb{E}[f(O_{1})]+\mathbb{E}[f(O_{2})]\right)
α2(F(𝐩)+F(𝟏O𝐩))\displaystyle\geq\frac{\alpha}{2}\left(F(\mathbf{p})+F(\mathbf{1}_{O}-\mathbf{p})\right)
α2F(𝟏O),\displaystyle\geq\frac{\alpha}{2}F(\mathbf{1}_{O}), (3)

where Inequality 3 follows from Lemma 8 and FF is convex. ∎

Corollary 1.

Let (f,k)(f,k) be an instance of SM where k=O(ψ/)k=O\left(\psi/\ell\right). R-DASH returns set VV with two MR rounds, O(1ε4log(k)log(n))O\left(\frac{1}{\varepsilon^{4}}\log(k)\log\left(n\right)\right) adaptive rounds, O(nlog(k)ε4)O\left(\frac{n\log(k)}{\varepsilon^{4}}\right) total queries, O(n)O(n) communication complexity, and probability at least 1n12c1-n^{1-2c} such that

𝔼[f(V)]11/eε2OPT.\mathbb{E}[f(V)]\geq\frac{1-1/e-\varepsilon}{2}\textsc{OPT}.

4.2 Greedy-DASH

Algorithm 5 Greedy-DASH(G-DASH)\textsc{Greedy-DASH}(\textsc{G-DASH})
Input: evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, error ε\varepsilon, available machines M{1,2,,}M\leftarrow\{1,2,...,\ell\}
2:S,C0S\leftarrow\emptyset,C_{0}\leftarrow\emptyset
for r1r\leftarrow 1 to 1ε\lceil\frac{1}{\varepsilon}\rceil  do
4:     Xr,iX_{r,i}\leftarrow Elements assigned to machine ii chosen uniformly at random in round rr
     𝒩r,iXr,iCr1\mathcal{N}_{r,i}\leftarrow X_{r,i}\cup C_{r-1}
6:     for iMi\in M in parallel  do
         Sr,i,Rr,iLAG(f,𝒩r,i,k,ε)S_{r,i},R_{r,i}\leftarrow\textsc{LAG}(f,\mathcal{N}_{r,i},k,\varepsilon)
8:         Send Sr,i,Rr,iS_{r,i},R_{r,i} to each machine      
     Sargmax{f(S),f(Sr,1),,f(Sr,)}S\leftarrow\operatorname*{arg\,max}{\{f(S),f(S_{r,1}),\cdots,f(S_{r,\ell})}\}
10:     Cri=1Rr,iCr1C_{r}\leftarrow\bigcup_{i=1}^{\ell}R_{r,i}\cup C_{r-1}
return S

Next, we obtain the nearly the optimal ratio by applying LAG and randomized consistency to DistributedDistorted proposed by Kazemi et al. (2021). DistributedDistorted is a distributed algorithm for regularized submodular maximization that relies upon (a distorted version of) the standard greedy algorithm. For SMCC, the distorted greedy algorithm reduces to the standard greedy algorithm Greedy. In the following, we show that any algorithm that satisfies randomized consistency can be used in place of Greedy as stated in Theorem 4. By introducing LAG into DistributedDistorted, G-DASH achieves the near optimal (11/eε1-1/e-\varepsilon) expected ratio in 1ε\lceil\frac{1}{\varepsilon}\rceil MapReduce rounds, O(log(n)log(k))O(\log(n)\log(k)) adaptive rounds, and O(nlog(k))O(n\log(k)) total queries. We also generalize this to any algorithm that satisfies randomized consistency.

The Framework of Kazemi et al. (2021).  DistributedDistorted has 1/ε\lceil 1/\varepsilon\rceil MR rounds where each round rr works as follows: First, it distributes the ground set into mm machines uniformly at random. Then, each machine ii runs Greedy (when modular term ()=0\ell(\cdot)=0) on the data 𝒩r,i\mathcal{N}_{r,i} that combines the elements distributed before each round Xr,iX_{r,i} and the elements forwarded from the previous rounds Cr1C_{r-1} to get the solution Sr,iS_{r,i}. At the end, the final solution, which is the best among Sr,1S_{r,1} and all the previous solutions, is returned. To improve the adaptive rounds of DistributedDistorted, we replace standard greedy with LAG to get G-DASH.

Theorem 4.

Let (f,k)(f,k) be an instance of SM where k=O(εψ/)k=O\left(\varepsilon\psi/\ell\right). Alg is a randomized algorithm which satisfies the Randomized Consistency Property with α\alpha approximation ratio, Φ(n)\Phi(n) query complexity, and Ψ(n)\Psi(n) adaptivity. By replacing LAG with Alg, G-DASH returns set SS with O(1/ε)O\left(1/\varepsilon\right) MR rounds and 1εΨ(n)\frac{1}{\varepsilon}\Psi\left(\frac{n}{\ell}\right) adaptive rounds, εΦ(n)\frac{\ell}{\varepsilon}\Phi\left(\frac{n}{\ell}\right) total queries, O(nε)O\left(\frac{n}{\varepsilon}\right) communication complexity such that,

𝔼[f(S)](αε)OPT.\mathbb{E}\left[f(S)\right]\geq(\alpha-\varepsilon)\textsc{OPT}.
Proof.

Query Complexity and Adaptivity. G-DASH operates with 1ε\lceil\frac{1}{\varepsilon}\rceil MR rounds, where each MR round calls Alg \ell times in parallel. As each call of Alg works on the ground set with a size of at most n\frac{n}{\ell}, the total number of queries for G-DASH is εΦ(n)\frac{\ell}{\varepsilon}\Phi\left(\frac{n}{\ell}\right), and the total number of adaptive rounds is 1εΨ(n)\frac{1}{\varepsilon}\Psi\left(\frac{n}{\ell}\right).

Approximation ratio. Let G-DASH be run with input (f,k,ε,M)(f,k,\varepsilon,M). Similar to the analysis of Theorem 3, The size of 𝒩i\mathcal{N}_{i} is concentrated. Let OO be the optimal solution. For any x𝒩x\in\mathcal{N}, define that,

pxr={PrX𝒩(1/),𝐪[xCr1 and xAlgRel(XCr1{x},𝐪)], if xO0, otherwise .p_{x}^{r}=\begin{cases}Pr_{X\sim\mathcal{N}(1/\ell),\mathbf{q}}\left[x\not\in C_{r-1}\text{ and }\right.\\ \color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hskip 10.00002pt\left.x\in\textsc{AlgRel}(X\cup C_{r-1}\cup\{x\},\mathbf{q})\right]\hskip 10.00002pt\text{, if }x\in O\\ 0\hskip 130.0002pt\text{, otherwise }\end{cases}.

Then we provide the following lemma.

Lemma 6.

For any xOx\in O and 1r1/ε1\leq r\leq 1/\varepsilon, Pr(xCr)=r=1rpxrPr\left(x\in C_{r}\right)=\sum_{r^{\prime}=1}^{r}p_{x}^{r^{\prime}}.

Proof.
Pr(xCr)\displaystyle Pr\left(x\in C_{r}\right) =r=1rPr(xCr\Cr1)\displaystyle=\sum_{r^{\prime}=1}^{r}Pr\left(x\in C_{r^{\prime}}\backslash C_{r^{\prime}-1}\right)
=r=1rPr(xi=1Rr,i\Cr1)\displaystyle=\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{r^{\prime}=1}^{r}Pr\left(x\in\cup_{i=1}^{\ell}R_{r^{\prime},i}\backslash C_{r^{\prime}-1}\right)
=r=1ri=11Pr(xRr,i\Cr1|x𝒩r,i)\displaystyle=\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{r^{\prime}=1}^{r}\sum_{i=1}^{\ell}\frac{1}{\ell}Pr\left(x\in R_{r^{\prime},i}\backslash C_{r^{\prime}-1}|x\in\mathcal{N}_{r^{\prime},i}\right)
=r=1ri=11Pr(xCr1 and \displaystyle=\sum_{r^{\prime}=1}^{r}\sum_{i=1}^{\ell}\frac{1}{\ell}Pr\left(x\not\in C_{r^{\prime}-1}\text{ and }\right.
xAlgRel(Xr,iCr1{x},𝐪))\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\quad\left.x\in\textsc{AlgRel}(X_{r^{\prime},i}\cup C_{r^{\prime}-1}\cup\{x\},\mathbf{q})\right)
=r=1rpxr\displaystyle=\sum_{r^{\prime}=1}^{r}p_{x}^{r^{\prime}}

The rest of the proof bounds f(Sr,1)f(S_{r,1}) in the following two ways.

First, let Or,1={oO:oAlgRel(Xr,1Cr1{o},𝐪)}O_{r,1}=\{o\in O:o\not\in\textsc{AlgRel}\left(X_{r,1}\cup C_{r-1}\cup\{o\},\mathbf{q}\right)\}, and Or,2=(Cr1O)Or,1O_{r,2}=\left(C_{r-1}\cap O\right)\cup O_{r,1}. Since Alg follows the Randomized Consistency Property, it holds that AlgSol(Xr,1Cr1,𝐪)=AlgSol(Xr,1Cr1Or,1,𝐪)=Sr,1\textsc{AlgSol}(X_{r,1}\cup C_{r-1},\mathbf{q})=\textsc{AlgSol}(X_{r,1}\cup C_{r-1}\cup O_{r,1},\mathbf{q})=S_{r,1}, and f(Sr,1)αf(Or,2)f\left(S_{r,1}\right)\geq\alpha f\left(O_{r,2}\right). So, for any oOo\in O,

Pr(oOr,2)=Pr(oCr1 or oAlgRel(Xr,1Cr1{o},𝐪))=1por.\displaystyle Pr\left(o\in O_{r,2}\right)=Pr\left(o\in C_{r-1}\text{ or }o\not\in\textsc{AlgRel}\left(X_{r,1}\cup C_{r-1}\cup\{o\},\mathbf{q}\right)\right)=1-p_{o}^{r}.

Therefore,

𝔼[f(Sr,1)]α𝔼[f(Or,2)]αF(𝟏O𝐩r).\displaystyle\mathbb{E}\left[f\left(S_{r,1}\right)\right]\geq\alpha\mathbb{E}\left[f\left(O_{r,2}\right)\right]\geq\alpha F\left(\mathbf{1}_{O}-\mathbf{p}^{r}\right). (4)

Second, let Or,3=Cr1OO_{r,3}=C_{r-1}\cap O. Similarly, it holds that f(Sr,1)αf(Or,3)f\left(S_{r,1}\right)\geq\alpha f\left(O_{r,3}\right). And for any oOo\in O, by Lemma 6, it holds that,

Pr(oOr,3)=Pr(oCr1)=r=1r1pxr,Pr\left(o\in O_{r,3}\right)=Pr\left(o\in C_{r-1}\right)=\sum_{r^{\prime}=1}^{r-1}p_{x}^{r^{\prime}},

Therefore,

𝔼[f(Sr,1)]α𝔼[f(Or,3)]αF(r=1r1𝐩r).\displaystyle\mathbb{E}\left[f\left(S_{r,1}\right)\right]\geq\alpha\mathbb{E}\left[f\left(O_{r,3}\right)\right]\geq\alpha F\left(\sum_{r^{\prime}=1}^{r-1}\mathbf{p}^{r^{\prime}}\right). (5)

By Inequalities 4 and 5, we bound the approximation ratio of G-DASH by the following,

𝔼[f(S)]\displaystyle\mathbb{E}\left[f(S)\right] εr=11/ε𝔼[f(Sr,1)]\displaystyle\geq\varepsilon\sum_{r=1}^{1/\varepsilon}\mathbb{E}\left[f\left(S_{r,1}\right)\right]
εα(F(r=11/ε1𝐩r)+r=11/ε1F(𝟏O𝐩r))\displaystyle\geq\varepsilon\alpha\cdot\left(F\left(\sum_{r^{\prime}=1}^{1/\varepsilon-1}\mathbf{p}^{r^{\prime}}\right)+\sum_{r=1}^{1/\varepsilon-1}F\left(\mathbf{1}_{O}-\mathbf{p}^{r^{\prime}}\right)\right)
(a)α(1ε)F(𝟏O)\displaystyle\overset{(a)}{\geq}\alpha(1-\varepsilon)F\left(\mathbf{1}_{O}\right)
(αε)OPT,\displaystyle\geq(\alpha-\varepsilon)\textsc{OPT},

where Inequality (a) follows from Lemma 8 and FF is convex. ∎

Corollary 2.

Let (f,k)(f,k) be an instance of SM where k=O(εψ/)k=O\left(\varepsilon\psi/\ell\right). G-DASH returns set SS with O(1/ε)O\left(1/\varepsilon\right) MR round and O(1ε5log(k)log(n))O\left(\frac{1}{\varepsilon^{5}}\log(k)\log(n)\right) adaptive rounds, O(nlog(k)ε5)O\left(\frac{n\log(k)}{\varepsilon^{5}}\right) total queries, O(nε)O\left(\frac{n}{\varepsilon}\right) communication complexity, and probability at least 1ε1n12c1-\varepsilon^{-1}n^{1-2c} such that,

𝔼[f(S)](11/eε)OPT.\mathbb{E}\left[f(S)\right]\geq(1-1/e-\varepsilon)\textsc{OPT}.

5 Threshold-DASH: Two MR-Rounds Algorithm with Improved Theoretical Properties

In this section, we present a novel (0.375ε(0.375-\varepsilon)-approximate, two MR-rounds algorithm Threshold-DASH (T-DASH) that achieves nearly optimal O(log(n))O(\log(n)) adaptivity in O(nlog(k))O(n\log(k)) total time complexity.

Description.  The T-DASH algorithm (Alg. 6) is a two MR-rounds algorithm that runs ThreshSeqMod concurrently on every machine with a specified threshold value of αOPT/k\alpha\textsc{OPT}/k. The primary machine then builds up its solution S1S_{1} by adding elements with ThreshSeqMod from the pool of solutions returned by the other machines. Notice that there is a small amount of data duplication as elements of the ground set are not randomly partitioned in the same way as in the other algorithms. This version of the algorithm requires to know the OPT value; in Appendix E we show how to remove this assumption. In Appendix I, we further discuss the similar two MR round algorithm of  Liu and Vondrak (2019), that provides an improved 1/21/2-approximation but requires four times the data duplication of T-DASH.

Algorithm 6 Threshold-DASH Knowing OPT
1:Input: Evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, error ε\varepsilon, available machines M{1,2,,}M\leftarrow\{1,2,...,\ell\}, and OPT
2:Initialize δ1/(+1)\delta\leftarrow 1/(\ell+1), 𝐪\mathbf{q}\leftarrow a fixed sequence of random bits.
3:Set α38\alpha\leftarrow\frac{3}{8}, (𝐪1,,𝐪+1)𝐪(\mathbf{q}_{1},\ldots,\mathbf{q}_{\ell+1})\leftarrow\mathbf{q}
4:for e𝒩e\in\mathcal{N} do  do
5:     Assign ee to each machine independently with probability 1/1/\ell
6:for iMi\in M do
7:     \triangleright On machine ii
8:     Let 𝒩i\mathcal{N}_{i} be the elements assigned to machine ii
9:     Si,RiThreshSeqMod(f,𝒩i,k,δ,ε,(α+ε)OPT/k,𝐪i)S_{i},R_{i}\leftarrow\textsc{ThreshSeqMod}(f,\mathcal{N}_{i},k,\delta,\varepsilon,(\alpha+\varepsilon)\textsc{OPT}/k,\mathbf{q}_{i})
10:     if |Si|=k|S_{i}|=k then return TSiT^{\prime}\leftarrow S_{i}
11:     Send Si,RiS_{i},R_{i} to primary machine
12:\triangleright On primary machine
13:Ri=1RiR\leftarrow\bigcup_{i=1}^{\ell}R_{i}
14:Let g()f(S1)f(S1)g(\cdot)\leftarrow f(S_{1}\cup\cdot)-f(S_{1})
15:TThreshSeqMod(g,R,k|S1|,δ,ε,(α+ε)OPT/k,𝐪+1)T\leftarrow\textsc{ThreshSeqMod}(g,R,k-|S_{1}|,\delta,\varepsilon,(\alpha+\varepsilon)\textsc{OPT}/k,\mathbf{q}_{\ell+1})
16:TS1TT^{\prime}\leftarrow S_{1}\cup T
17:return TT^{\prime}
Theorem 5.

Let (f,k)(f,k) be an instance of SM where k=O(ψ)k=O\left(\frac{\psi}{\ell}\right). T-DASH knowing OPT returns set TT^{\prime} with two MR rounds, O(1ε3log(n))O\left(\frac{1}{\varepsilon^{3}}\log\left(n\right)\right) adaptive rounds, O(nε3)O\left(\frac{n}{\varepsilon^{3}}\right) total queries, O(n)O(n) communication complexity, and probability at least 1nc1-n^{-c} such that

𝔼[f(T)](38ε)OPT.\mathbb{E}\left[f(T^{\prime})\right]\geq\left(\frac{3}{8}-\varepsilon\right)\textsc{OPT}.
Proof.

Query Complexity and Adaptivity. T-DASH runs with two MR rounds, where the first MR round invokes ThreshSeqMod \ell times in parallel, and the second MR round invokes ThreshSeqMod once. By Theorem 1, each call of ThreshSeqMod with a ground set of size at most n\frac{n}{\ell} queries O(nε3)O\left(\frac{n}{\ell\varepsilon^{3}}\right) within O(log(n/)ε3)O\left(\frac{\log(n/\ell)}{\varepsilon^{3}}\right) adaptive rounds. The total number of queries is O(nε3)O\left(\frac{n}{\varepsilon^{3}}\right), and the total number of adaptive rounds is O(log(n)ε3)O\left(\frac{\log(n)}{\varepsilon^{3}}\right).

Approximation Ratio. In Algorithm 6, there are +1\ell+1 independent calls of ThreshSeqMod. With |𝒩i|nc|\mathcal{N}_{i}|\geq n^{c}, the success probability of each call of ThreshSeqMod is larger than 11nc(+1)1-\frac{1}{n^{c}(\ell+1)}. Thus, Algorithm 6 succeeds with probability larger than 1nc1-n^{-c}. For the remainder of the analysis, we condition on the event that all calls to ThreshSeqMod succeed.

In the case that |T|=k|T^{\prime}|=k, by Theorem 1 in Section 2.1 and τ=(38+ε)OPTk\tau=\left(\frac{3}{8}+\varepsilon\right)\frac{\textsc{OPT}}{k}, it holds that f(T)12ε1+ετk(38ε)OPT.f(T^{\prime})\geq\frac{1-2\varepsilon}{1+\varepsilon}\tau\cdot k\geq\left(\frac{3}{8}-\varepsilon\right)\textsc{OPT}. Otherwise, we consider the case that |T|<k|T^{\prime}|<k in the following. Let (TSMSol(𝒩,𝐪),TSMRel(𝒩,𝐪))(\textsc{TSMSol}(\mathcal{N},\mathbf{q}),\textsc{TSMRel}(\mathcal{N},\mathbf{q})) be the pair of sets returned by ThreshSeqMod(𝒩,𝐪)\textsc{ThreshSeqMod}(\mathcal{N},\mathbf{q}). For any x𝒩x\in\mathcal{N}, let

px={PrX𝒩(1/),𝐪[TSMRel(X{x},𝐪)=TSMRel(X,𝐪)], if xO0, otherwise .p_{x}=\begin{cases}Pr_{X\sim\mathcal{N}(1/\ell),\mathbf{q}}\left[\textsc{TSMRel}(X\cup\{x\},\mathbf{q})\right.\\ \hskip 60.00009pt\left.=\textsc{TSMRel}(X,\mathbf{q})\right]\hskip 10.00002pt\text{, if }x\in O\\ 0\hskip 130.0002pt\text{, otherwise }\end{cases}.

Let O1={oO:oTSMRel(N1{o},q)}O_{1}=\{o\in O:o\not\in\textsc{TSMRel}(N_{1}\cup\{o\},q)\}, O2=ROO_{2}=R\cap O. By Randomized Consistency Property, it holds that S1=TSMSol(N1,q)=TSMSol(N1O1,q)S_{1}=\textsc{TSMSol}(N_{1},q)=\textsc{TSMSol}(N_{1}\cup O_{1},q). For any oO1o\in O_{1}, oo is not selected in S1S_{1}. Since, |S1|<k|S_{1}|<k, by Property (4) in Theorem 1, Δ(o|T)<Δ(o|S1)<τ.\Delta\left(o\,|\,T^{\prime}\right)<\Delta\left(o\,|\,S_{1}\right)<\tau. Also, for any oO2\To\in O_{2}\backslash T, oo is not selected in TT. Similarly, Δ(o|T)<τ.\Delta\left(o\,|\,T^{\prime}\right)<\tau. Then, we can get,

f(O1O2)f(T)f(O1O2T)f(T)\displaystyle f(O_{1}\cup O_{2})-f(T^{\prime})\leq f(O_{1}\cup O_{2}\cup T^{\prime})-f(T^{\prime})
oO1O2\TΔ(o|T)kτ=(38+ε)OPT.\displaystyle\leq\sum_{o\in O_{1}\cup O_{2}\backslash T^{\prime}}\Delta\left(o\,|\,T^{\prime}\right)\leq k\cdot\tau=\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left(\frac{3}{8}+\varepsilon\right)\textsc{OPT}. (6)

Next, we provide the following lemma to complete the rest of the proof.

Lemma 7.

For any oOo\in O, it holds that Pr(oO1O2)3/4Pr\left(o\in O_{1}\cup O_{2}\right)\geq 3/4.

Then, by Lemma 8 in Appendix A, 𝔼[f(O1O2)]3/4F(𝟏O)=3/4OPT.\mathbb{E}\left[f(O_{1}\cup O_{2})\right]\geq 3/4\cdot F(\mathbf{1}_{O})=3/4\cdot\textsc{OPT}. From this and Inequality 6, we can bound the approximation ratio for Algorithm T-DASH knowing OPT by follows, 𝔼[f(T)](38ε)OPT.\mathbb{E}\left[f(T^{\prime})\right]\geq\left(\frac{3}{8}-\varepsilon\right)\textsc{OPT}.

6 LinearTime-Distributed (L-Dist)

Algorithm 7 LinearTime-Distributed
1:Input: Evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, error ε\varepsilon, available machines M{1,2,,}M\leftarrow\{1,2,...,\ell\}
2:for e𝒩e\in\mathcal{N}  do
3:     Assign ee to machine chosen uniformly at random
4:for iMi\in M do
5:     \triangleright On machine ii
6:     Let 𝒩i\mathcal{N}_{i} be the elements assigned to machine ii
7:     SiLTC(f,𝒩i,k,𝐪)S_{i}\leftarrow\textsc{LTC}(f,\mathcal{N}_{i},k,\mathbf{q})
8:     Send SiS_{i} to primary machine
9:\triangleright On primary machine
10:Gather Si=1SiS\leftarrow\bigcup_{i=1}^{\ell}S_{i}
11:T1LTC(f,S,k,𝐪)T_{1}\leftarrow\textsc{LTC}(f,S,k,\mathbf{q}) on machine m1m_{1}
12:T1T_{1}^{\prime}\leftarrow last kk elements in T1T_{1}
13:T2ThresholdGreedy(f,T1,k,ε,f(T1),1/2)T_{2}\leftarrow\textsc{ThresholdGreedy}\left(f,T_{1},k,\varepsilon,f(T_{1}^{\prime}),1/2\right) \triangleright Post-Processing
14:S1S_{1}^{\prime}\leftarrow last kk elements added to S1S_{1}
15:return Vargmax{f(S1),f(T1),f(T2)}V\leftarrow\operatorname*{arg\,max}{\{f(S_{1}^{\prime}),f(T_{1}^{\prime}),f(T_{2})}\}

In this section, we introduce L-Dist, the first linear time MapReduce algorithm with constant approximation. We derive L-Dist from R-DASH by incorporating the analysis of LTC (Alg. 3), an algorithm with O(n)O(n) query complexity, into the distributed setting. This integration allows L-Dist to operate within two MapReduce rounds, demonstrating an adaptive complexity of O(n)O(\frac{n}{\ell}) and a query complexity of O(n)O(n). Also, as stated in Theorem 6, it guarantees an approximation ratio of 18\frac{1}{8}. Additionally, we improve the objective value by implementing ThresholdGreedy (Alg. 9 in Appendix D) on the set returned at the second MapReduce round. The integration of LTC into LinearTime-Distributed provides enhanced capabilities and improves the efficiency of the overall algorithm.

Description: Initially, L-Dist involves randomly distributing the ground set across all machines MM. In the first MR round, LinearTime-Distributed applies LTC on each machine to obtain SiS_{i} within O(n/)O\left(n/\ell\right) query calls. The solutions from all machines are then returned to the primary machine, where LTC selects the output solution T1T_{1}. By selecting the best last kk among all solutions, L-Dist ensures a 18\frac{1}{8}-approximation. Besides returning the last kk elements of the solution T1T_{1} which is an 1/21/2-approximation of T1T_{1}, we employ ThresholdGreedy, a (11/eε)(1-1/e-\varepsilon) approximation algorithm, to boost the objective value. We provide the theoretical guarantees of L-Dist as Theorem 6. The proof involves a minor modification of the proof provided in Theorem 3.

Theorem 6.

Let (f,k)(f,k) be an instance of SM where k<ψlog(ψ)k<\frac{\psi}{\ell\log(\psi)}. L-Dist returns a set VV with two MR rounds, O(n)O\left(\frac{n}{\ell}\right) adaptive rounds, O(n)O\left(n\right) total queries, O(n)O(n) communication complexity such that

𝔼[f(V)]18OPT.\mathbb{E}[f(V)]\geq\frac{1}{8}\textsc{OPT}.

6.1 Analysis of Query Complexity and Adaptivity

L-Dist runs with two MR rounds. In the first MR round, LTC is invoked \ell times in parallel, each with O(n/)O\left(n/\ell\right) queries and O(n/)O\left(n/\ell\right) adaptive rounds by Theorem 2. So, during the first MR round, the number of queries is O(n)O\left(n\right) and the number of adaptive rounds is O(n/)O\left(n/\ell\right). Then, the second MR round calls LTC and ThresholdGreedy once respectively, handling at most n/n/\ell elements. By Theorem 2 and 8, the number of queries is O(n/)O\left(n/\ell\right) and the number of adaptive rounds is O(n/)O\left(n/\ell\right). Consequently, the total number of queries of L-Dist is O(n)O\left(n\right), and the total number of adaptive rounds is O(n/)O\left(n/\ell\right).

6.2 Analysis of Approximation Ratio

Let L-Dist be executed with input (f,k,ε,M)(f,k,\varepsilon,M). Since n1c\ell\leq n^{1-c}, it follows that the expected size of each subset |𝒩i||\mathcal{N}_{i}| satisfies 𝔼[|𝒩i|]=n/nc\mathbb{E}\left[|\mathcal{N}_{i}|\right]=n/\ell\geq n^{c}. To ensure that the size |𝒩i||\mathcal{N}_{i}| is concentrated, we apply Chernoff’s bound. Let 𝒩(1/)\mathcal{N}(1/\ell) denote the random distribution over subsets of 𝒩\mathcal{N} where each element is included independently with probability 1/1/\ell. Let 𝐩[0,1]n\mathbf{p}\in[0,1]^{n} be the following vector. For x𝒩x\in\mathcal{N}, let

px={PrX𝒩(1/),𝐪[LTC(X{x},𝐪)=LTC(X,𝐪)], if xO0, otherwise .p_{x}=\begin{cases}Pr_{X\sim\mathcal{N}(1/\ell),\mathbf{q}}\left[\textsc{LTC}(X\cup\{x\},\mathbf{q})=\textsc{LTC}(X,\mathbf{q})\right]\text{, if }x\in O\\ 0\hskip 130.0002pt\text{, otherwise }\end{cases}.

Consider S1=LTC(𝒩1,𝐪)S_{1}=\textsc{LTC}(\mathcal{N}_{1},\mathbf{q}) on machine m1m_{1}. Let O1={oO:LTC(𝒩1{o},𝐪)=LTC(𝒩1,𝐪)}O_{1}=\{o\in O:\textsc{LTC}(\mathcal{N}_{1}\cup\{o\},\mathbf{q})=\textsc{LTC}(\mathcal{N}_{1},\mathbf{q})\}. By Lemma 5, LTC(𝒩1O1,𝐪)=S1\textsc{LTC}(\mathcal{N}_{1}\cup O_{1},\mathbf{q})=S_{1}. Therefore, by Theorem 2, it holds that f(S1)f(O1)/4f(S_{1}^{\prime})\geq f(O_{1})/4. Next, let O2=OSO_{2}=O\cap S, where S=i=1SiS=\bigcup_{i=1}^{\ell}S_{i}. Similarly, by Theorem 2, it holds that f(T1)f(O2)/4f(T_{1}^{\prime})\geq f(O_{2})/4. Let oOo\in O; oo is assigned to 𝒩c\mathcal{N}_{c} on some machine mcm_{c}. It holds that,

Pr[oO2]\displaystyle\Pr[o\in O_{2}] =Pr[oLTC(𝒩c)|o𝒩c]\displaystyle=\Pr[o\in\textsc{LTC}(\mathcal{N}_{c})|o\in\mathcal{N}_{c}]
=Pr[oLTC(𝒩c{o})]\displaystyle=\Pr[o\in\textsc{LTC}(\mathcal{N}_{c}\cup\{o\})]
=1po.\displaystyle=1-p_{o}.

Therefore,

𝔼[f(V)]\displaystyle\mathbb{E}\left[f(V)\right] 12(𝔼[f(S1)]+𝔼[f(T1)])\displaystyle\geq\frac{1}{2}\left(\mathbb{E}[f(S_{1}^{\prime})]+\mathbb{E}[f(T_{1}^{\prime})]\right)
12(𝔼[f(O1)/4]+𝔼[f(O2)/4])\displaystyle\geq\frac{1}{2}\left(\mathbb{E}[f(O_{1})/4]+\mathbb{E}[f(O_{2})/4]\right)
18(F(𝐩)+F(𝟏O𝐩))\displaystyle\geq\frac{1}{8}\left(F(\mathbf{p})+F(\mathbf{1}_{O}-\mathbf{p})\right)
18F(𝟏O)=18OPT,\displaystyle\geq\frac{1}{8}F(\mathbf{1}_{O})=\frac{1}{8}\textsc{OPT}, (7)

where inequality 7 follows from Lemma 8 and FF is convex.

6.3 Post-Processing

In this algorithm, we employ a simple post-processing procedure. Since T1T_{1}^{\prime} is an 1/21/2-approximation of T1T_{1} with size kk, T1T_{1}^{\prime} is also an 1/21/2-approximation of the instance (f,k)(f,k) on the ground set T1T_{1}. By running any linear time algorithm that has a better approximation ratio on T1T_{1}, we are able to boost the objective value returned by the algorithm with the same theoretical guarantees. By Theorem 8, ThresholdGreedy achieves (11/eε)(1-1/e-\varepsilon)-approximation in linear time with a guess on the optimal solution. Therefore, with input α=1/2\alpha=1/2 and Γ=f(T1)\Gamma=f(T_{1}^{\prime}), utilizing ThresholdGreedy can effectively enhance the objective value.

7 Towards Larger kk: A Memory-Efficient, Distributed Framework (MED)

In this section, we propose a general-purpose plug-in framework for distributed algorithms, MemoryEfficientDistributed (MED, Alg. 8). MED increases the largest possible constraint value from O(n/2)O(n/\ell^{2}) to O(n/)O(n/\ell) in the value oracle model. Under some additional assumptions, we remove all restrictions on the constraint value.

Algorithm 8 MED(f,k,M,Alg,𝐪)\textsc{MED}(f,k,M,\textsc{Alg},\mathbf{q})
1:Input: evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, available machines M{1,2,,}M\leftarrow\{1,2,...,\ell\}, MR algorithm Alg, random bits 𝐪\mathbf{q}
2:n|𝒩|n\leftarrow|\mathcal{N}|, Ψ\Psi\leftarrow Memory capacity (# elements) of primary machine
3:Choose kmax{k:kΨ}k^{\prime}\leftarrow\max\{k^{\prime}\in\mathbb{N}:k^{\prime}\leq\frac{\Psi}{\ell}\}
4:mk/km\leftarrow\lceil k/k^{\prime}\rceil, (𝐪1,,𝐪m)𝐪(\mathbf{q}_{1},\ldots,\mathbf{q}_{m})\leftarrow\mathbf{q}
5:for i1tomi\leftarrow 1\ to\ m  do
6:     Let g()f(Si)f(Si)g(\cdot)\leftarrow f(\cdot\cup S_{i})-f(S_{i})
7:     AiAlg(g,M,min{k,k|Si|},𝐪i)A_{i}\leftarrow\textsc{Alg}(g,M,\min\{k^{\prime},k-|S_{i}|\},\mathbf{q}_{i})
8:     Si+1SiAiS_{i+1}\leftarrow S_{i}\cup A_{i}
9:return SmS_{m}

As discussed in Section 1, the kk value is limited to a fraction of the machine memory for MapReduce algorithms: kO(n/2)k\leq O(n/\ell^{2}), since those algorithms need to merge a group of solutions and pass it to a single machine. MED works around this limitation as follows: MED can be thought of as a greedy algorithm that uses an approximate greedy selection through Alg. One machine manages the partial solutions {Si}\{S_{i}\}, built up over mm iterations. In this way, each call to Alg is within the constraint restriction of Alg, i.e. O(n/2)O(n/\ell^{2}), but a larger solution of up to size O(n/)O(n/\ell) can be constructed.

The restriction on kk of MED of O(n/)O(n/\ell) comes from passing the data of the current solution to the next round. Intuitively, if we can send some auxiliary information about the function instead of the elements selected, the kk value can be unrestricted.

Assumption 1.

Let ff be a set function with ground set 𝒩\mathcal{N} of size nn. If for all S𝒩S\subseteq\mathcal{N}, there exists a bitvector 𝐯S\mathbf{v}_{S}, such that the function g(X)=f(SX)f(S)g(X)=f(S\cup X)-f(S) can be computed from XX and 𝐯S\mathbf{v}_{S}, then ff satisfies Assumption 1.

We show in Appendix F that all four applications evaluated in Section 8 satisfy Assumption 1. As an example, consider MaxCover, which can be expressed as f(S)=i𝒩fi(S)f(S)=\sum_{i\in\mathcal{N}}f_{i}(S), where fi(S)=𝟏{i is covered by S}f_{i}(S)=\mathbf{1}_{\{i\text{ is covered by }S\}}. Let gi(X)=fi(SX)fi(S)g_{i}(X)=f_{i}(S\cup X)-f_{i}(S), and 𝐯S=(f1(S),,fn(S))\mathbf{v}_{S}=(f_{1}(S),\ldots,f_{n}(S)). Then, fi(SX)=𝟏{i is covered by S}𝟏{i is covered by X}f_{i}(S\cup X)=\mathbf{1}_{\{i\text{ is covered by }S\}}\lor\mathbf{1}_{\{i\text{ is covered by }X\}}, where the first term is given by 𝐯S\mathbf{v}_{S} and the second term is calculated by XX. Therefore, since g(X)=i𝒩gi(X)g(X)=\sum_{i\in\mathcal{N}}g_{i}(X), g(X)g(X) can be computed from XX and 𝐯S\mathbf{v}_{S}.

Theorem 7.

Let (f,kf,k) be an instance of SMCC distributed over \ell machines. For generic objectives, where the data of current solution need to be passed to the next round, kmin{ψn1+1,ψ+1}k\leq\min\{\frac{\ell\psi-n}{\ell-1}+1,\psi-\ell+1\}; for special objectives, where only one piece of compressed data need to be passed to the next round, knk\leq n. Let SmS_{m} be the set returned by MED. Then 𝔼[f(Sm)](1eγ)OPT\mathbb{E}\left[f(S_{m})\right]\geq(1-e^{-\gamma})\textsc{OPT}, where γ\gamma is the expected approximation of Alg.

Proof.

Let OO be an optimal and O1,O2,,OmO_{1},O_{2},...,O_{m} be a partition of OO into mm pieces, each of size k\leq k^{\prime}. Also if^i\forall_{i}\hat{f}_{i} is monotone SMCC since ff is monotone SMCC.

𝔼[f(Si+1)f(Si)|Si]\displaystyle\mathbb{E}\left[f(S_{i+1})-f(S_{i})|S_{i}\right] =𝔼[f(SiAi)f(Si)|Si]\displaystyle=\mathbb{E}\left[f(S_{i}\cup A_{i})-f(S_{i})|S_{i}\right]
=𝔼[f^i(Ai)|Si]\displaystyle=\mathbb{E}\left[\hat{f}_{i}(A_{i})|S_{i}\right]
(a)1mj=1mγf^i(Oj)\displaystyle\overset{(a)}{\geq}\frac{1}{m}\sum_{j=1}^{m}\gamma\hat{f}_{i}(O_{j})
γmf^i(O)\displaystyle\geq\frac{\gamma}{m}\hat{f}_{i}(O)
=γm(f(SiO)f(Si))\displaystyle=\frac{\gamma}{m}(f(S_{i}\cup O)-f(S_{i}))
γm(OPTf(Si)),\displaystyle\geq\frac{\gamma}{m}(OPT-f(S_{i})),

where Inequality (a) follows from f^i\hat{f}_{i} is SMCC; Alg is γ\gamma-approximation. Unfix SiS_{i}, it holds that

OPT𝔼[f(Si+1)]\displaystyle OPT-\mathbb{E}\left[f(S_{i+1})\right] (1γm)[OPT𝔼[f(Si)]]\displaystyle\leq\left(1-\frac{\gamma}{m}\right)[OPT-\mathbb{E}\left[f(S_{i})\right]]
(1γm)i[OPTf()]\displaystyle\leq\left(1-\frac{\gamma}{m}\right)^{i}[OPT-f(\emptyset)]
=(1γm)iOPT\displaystyle=\left(1-\frac{\gamma}{m}\right)^{i}\cdot OPT
𝔼[f(Sm)]\displaystyle\therefore\mathbb{E}\left[f(S_{m})\right] [1(1γm)m]OPT\displaystyle\geq\left[1-(1-\frac{\gamma}{m})^{m}\right]\cdot OPT
(1eγ)OPT\displaystyle\geq(1-e^{-\gamma})\cdot OPT

To analyze the memory requirements of MED, consider the following. To compute g()f(Si)f(Si)g(\cdot)\leftarrow f(\cdot\cup S_{i})-f(S_{i}), the current solution SiS_{i} need to be passed to each machine in MM for the next call of Alg. Suppose |S|=kx|S|=k-x where 1xk1\leq x\leq k. The size of data stored on any non-primary machine in cluster MM, as well as on the primary machine of MM can be bounded as follows:

kx+(nk+x)/Ψkx+(k)/mΨk(Ψn)/(1)+1kΨ+1.\begin{aligned} &k-x+(n-k+x)/\ell\leq\Psi\\ &k-x+(k\ell)/m\leq\Psi\end{aligned}\,\Rightarrow\,\begin{aligned} &k\leq(\ell\Psi-n)/(\ell-1)+1\\ &k\leq\Psi-\ell+1\end{aligned}.

Therefore, if Ψ2n/\Psi\geq 2n/\ell, MED can run since n/\ell\leq n/\ell in our MapReduce model. Under the alternative assumption, it is clear that MED can run for all knk\leq n. ∎

8 Empirical Evaluation

This section presents empirical comparison of R-DASH, T-DASH, G-DASH, L-Dist to the state-of-the-art distributed algorithms. Distributed algorithms include RandGreeDI (RG) of Barbosa et al. (2015) and DistributedDistorted (DDist) of Kazemi et al. (2021).

Table 2: Environment Setup
Experiment Number of Machines (\ell)* Number of Threads Per Machine (Υ\varUpsilon) Dataset Size (n)
Experiment 1 (Fig. 2) 8 4 10K - 100K
Experiment 2 (Fig. 3) 64 32 1M - 5M
Experiment 3 (Fig. 4) 8 4 3M
Experiment 3 (Fig. 4) 8 4 100K
Experiment 4 (Fig. 5) 32 1 100K
  • *

    Prior MR (greedy) algorithms were assessed using Υ\ell\cdot\varUpsilon machines, utilizing each thread as an individual core. Thus, all algorithms were configured to use all available cores.

  • {\ddagger}

    We kept a variant of R-DASH evaluated on 8 machines (with 4 cores each) as a baseline. All other algorithms use 2 machines (with 4 cores each).

Environment. Our experiments utilized diverse computational setups. Experiments 1, and 3 employed an eight-machine cluster, each with four CPU cores, totaling 32 cores. Experiment 2 was conducted on a larger 64-machine cluster, each featuring 32 CPU cores. Experiment 4 operated on a cluster of 32 single-core machines. Notably, in Experiment 1, our algorithms used =8\ell=8 machines, while prior MapReduce (MR) algorithms employed =32\ell=32, fully utilizing 32 cores. MPICH version 3.3a2 was installed on every machine, and we used the python library mpi4pympi4py for implementing and parallelizing all algorithms with the Message Passing Interface (MPI). These algorithms were executed using the mpirunmpirun command, with runtime tracked using mpi4py.MPI.Wtime()mpi4py.MPI.Wtime() at the algorithm’s start and completion.

Datasets. Table 2 presents dataset sizes for our experiments. In Experiment 1 and 3 (Fig. 4), we assess algorithm performance on small datasets, varying from nn = 10,000 to 100,000. These sizes enable evaluation of computationally intensive algorithms, such as DDist, T-DASH, and G-DASH. Experiment 2, 3 (Fig. 4), and 4 focus on larger datasets, ranging from nn = 50,000 to 5 million.

Applications. We evaluated image summarization (ImageSumm), influence maximization (InfMax), revenue maximization (RevMax) and maximum coverage (MaxCover). Details are provided in Appendix F.1.

Experiment Objectives. The primary objective of our experiment set is to comprehensively evaluate the practicality of the distributed algorithms across varying cluster and dataset sizes. The specific objectives of each experiment are outlined below:

  • Experiment 1111Results published in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023: Baseline experiment aimed at assessing the performance of the algorithms using small datasets and cluster setup.

  • Experiment 2: Assess the performance of the algorithms on large datasets and cluster setup.

  • Experiment 3: Investigate the influence of the number of nodes in a cluster on algorithm performance.

  • Experiment 4: Examine the impact of increasing cardinality constraints on the performance of MED.

8.1 Experiment 1 - Comparative Analysis on Small Datasets

The results of Experiment 1 (Fig. 2) show that all algorithms provide similar solution values (with T-DASH being a little worse than the others). However, there is a large difference in runtime, with R-DASH the fastest by orders of magnitude. The availability of only 4 threads per machine severely limits the parallelization of T-DASH, resulting in longer runtime; access to log(k)1+ε\log{}_{1+\varepsilon}(k) threads per machine should result in faster runtime than R-DASH.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Performance comparison of distributed algorithms on ImageSumm, InfluenceMax, RevenueMax and MaxCover; RandGreeDI (RG) is run with Greedy as the algorithm Alg Barbosa et al. (2015) to ensure the 12(11/e)\frac{1}{2}(1-1/e) ratio. All Greedy implementations used lazy greedy to improve the runtime. TimeoutTimeout for each application: 6 hours per algorithm.

8.2 Experiment 2 - Performance Analysis of R-DASH, RandGreeDI and L-Dist on a Large Cluster

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Empirical comparison of R-DASH and L-Dist. The plotted metrics are solution value (Fig. 3-3) and runtime (Fig. 3-3).

This section presents a comparative analysis of R-DASH, L-Dist, and RandGreeDI on a large 64-node cluster, each node equipped with 32 cores. We assess two versions of L-Dist and RandGreeDI, one with =64\ell=64 and the other with =2048\ell=2048, with a predetermined time limit of 3 hours for each application. The plotted results depict the instances completed by each algorithm within this time constraint.

In terms of solution quality, as depicted in Figure 3, statistically indistinguishable results, while L-Dist exhibits an average drop of 9% in solution quality across completed instances. Examining runtime (Figure 3), R-DASH consistently outperforms L-Dist and RandGreeDI by orders of magnitude. As discussed in Experiment 3, for both L-Dist and RandGreeDI, instances with fewer distributions (=64\ell=64) display faster runtimes compared to instances using a larger setup (=2048\ell=2048), attributed to kk values surpassing n2\frac{n}{\ell^{2}} (0.27\simeq 0.27 for InfMax). The subsequent experiment demonstrates how MED effectively addresses this challenge. Comparing the fastest variant of each algorithm, we observe that for the smallest kk values, RandGreeDI outperforms L-Dist. However, as kk increases, RandGreeDI’s runtime linearly grows, while L-Dist’s runtime remains unaffected due to its linear time complexity, enabling L-Dist to outperform RandGreeDI for larger constraints.

8.3 Experiment 3 - Scalability Assessment

Figure 4 illustrates a linear speedup for R-DASH as the number of machines \ell increases. Figure 4 highlights an intriguing observation that, despite having sufficient available memory, increasing \ell can result in inferior performance when k>n2k>\frac{n}{\ell^{2}}. Specifically, as depicted in Figure 4, we initially witness the expected faster execution of RandGreeDI with =32\ell=32 compared to RandGreeDI with =8\ell=8. However, once k>n322k>\frac{n}{32^{2}}, the relative performance of RandGreeDI with =32\ell=32 rapidly deteriorates. This decline can be attributed to the degradation of RandGreeDI’s optimal performance beyond k=n2k=\frac{n}{\ell^{2}}.

When running RandGreeDI with a single thread on each machine, the total running time on \ell machines can be computed based on two components. First, the running time for one machine in the first MR round, which is proportional to (n/(k1)/2)k(n/\ell-(k-1)/2)k. Second, the running time for the primary machine in the second MR round (post-processing step), which is proportional to (k(k1)/2)k(k\ell-(k-1)/2)k. Consequently, the total running time is proportional to nk/+k2k(k1)nk/\ell+\ell k^{2}-k(k-1). Optimal performance is achieved when =n/k\ell=\sqrt{n/k}, which justifies the preference for parallelization within a machine to maintain a lower \ell rather than distributing the data across separate processors.

Furthermore, in Experiment 5 (Section 8.4), we demonstrate that utilizing MED enables MR algorithms to produce solutions much faster with no compromise in solution value, particularly when solving for k>n2k>\frac{n}{\ell^{2}}. These results provide further support for the advantage of incorporating MED in achieving efficient and effective parallelization in MR algorithms.

Refer to caption
Refer to caption
Figure 4: (a): Scalability of R-DASH vs. \ell (b): RandGreeDI with =8\ell=8 Vs. =32\ell=32.

8.4 Experiment 4 - Performance Analysis of MED

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Empirical comparison of MED+RG to RandGreeDI. The plotted metrics are solution value (Fig. 5-5) and runtime (Fig. 5-5).

This section presents experimental results comparing the performance of MED+RG and the vanilla RandGreeDI algorithm. In terms of solution value, MED+RG consistently provides nearly identical solutions to the vanilla RandGreeDI algorithm across all instances for the three applications studied. Regarding runtime, the following observations are made: Initially, both algorithms exhibit similar execution times up to a threshold of k=n2k=\frac{n}{\ell^{2}}. However, beyond this threshold, the performance gap between MED+RG and RandGreeDI widens linearly. MED+RG achieves notable average speedup factors of 1.8, 2.2, and 2.3 over RandGreeDI for the respective applications. Moreover, beyond k=n2k=\frac{n}{\ell^{2}}, MED+RG outperforms RandGreeDI in terms of completing instances within a 12-hour timeout, completing 77% more instances of kk across all three applications. These findings highlight the promising performance of MED+RG, demonstrating comparable solution values while significantly improving runtime efficiency compared to the vanilla RandGreeDI algorithm.

The empirical findings from this experiment provide insights into the capabilities of the MED+Alg framework. Our results indicate that MED+Alg achieves solution quality that is practically indistinguishable from the vanilla Alg algorithm, even for significantly larger values of kk. Additionally, even with sufficient available memory to run Alg beyond k=n2k=\frac{n}{\ell^{2}}, MED+Alg demonstrates noteworthy computational efficiency, surpassing that of the Alg algorithm. This combination of comparable solution quality and improved computational efficiency positions MED+Alg as a highly promising framework. These findings have significant implications and underscore the potential of MED+Alg to address complex problems in a distributed setting with large-scale values of kk.

9 Discussion and Conclusion

Prior to this work, no MR algorithms for SMCC could parallelize within a machine; these algorithms require many sequential queries. Moreover, increasing the number of machines to the number of threads available in the cluster may actually harm the performance, as we showed empirically; intuitively, this is because the size of the set on the primary machine scales linearly with the number of machines \ell. In this paper, we have addressed this limitation by introducing a suite of algorithms that are both parallelizable and distributed. Specifically, we have presented R-DASH, T-DASH, and G-DASH, which are the first MR algorithms with sublinear adaptive complexity (highly parallelizable). Moreover, our algorithms make have nearly linear query complexity over the entire computation. We also provide the first distributed algorithm with O(n)O(n) total query complexity, improving on the O(npolylog(n))O(n\text{polylog}(n)) of our other algorithms and the algorithm of Liu and Vondrak (2019).

When RandGreeDI was introduced by Mirzasoleiman et al. (2013), the empirical performance of the algorithm was emphasized, with theoretical guarantees unproven until the work of Barbosa et al. (2015). Since that time, RandGreeDI has remained the most practical algorithm for distributed SMCC. Our R-DASH algorithm may be regarded as a version of RandGreeDI that is 1) parallelized; 2) nearly linear time in total (RandGreeDI is quadratic); and 3) empirically orders of magnitude faster in parallel wall time in our evaluation.

R-DASH achieves approximation ratio of (11/e)/2(1-1/e)/2 in two MR rounds; the first round merely being to distribute the data. We provide G-DASH to close the gap to (11/e)(1-1/e) ratio in a constant number of rounds. However, MR rounds are expensive (as shown by our Experiment 1). The current best ratio achieved in two rounds is the 0.5450.545-approximation algorithm of Mirrokni and Zadimoghaddam (2015). Therefore, a natural question for future research is what is the best ratio achievable in two MR rounds. Moreover, non-monotone or partially monotone objective functions and more sophisticated constraint systems are directions for future research.

Acknowledgements

Yixin Chen and Tonmoy Dey contributed equally to this work. The work of Tonmoy Dey was partially supported by Florida State University; the work of Yixin Chen and Alan Kuhnle was partially supported by Texas A & M University. The authors have received no third-party funding in direct support of this work. The authors have no additional revenues from other sources related to this work.

Appendix A Lovász Extension of Submodular Function.

Given submodular function ff, the Lovász extension FF of ff is defined as follows: For 𝐳[0,1]|𝒩|\mathbf{z}\in[0,1]^{|\mathcal{N}|},

F(𝐳=(zi)i𝒩)=𝔼λ𝒰[0,1][f({i:ziλ})].F(\mathbf{z}=(z_{i})_{i\in\mathcal{N}})=\mathbb{E}_{\lambda\sim\mathcal{U}[0,1]}[f(\{i:z_{i}\geq\lambda\})].

The Lovász extension satisfies the following properties: (1) FF is convex; (2) F(c𝐳)cF(𝐳)F(c\mathbf{z})\geq cF(\mathbf{z}) for any c(0,1)c\in(0,1). Moreover, we will require the following simple lemma:

Lemma 8 (Barbosa et al. (2015)).

Let SS be a random set, and suppose 𝔼[𝟏S]=c𝐳\mathbb{E}[\mathbf{1}_{S}]=c\cdot\mathbf{z}, for c(0,1)c\in(0,1). Then 𝔼[f(S)]cF(𝐳)\mathbb{E}[f(S)]\geq c\cdot F(\mathbf{z}).

Appendix B Probability Lemma and Concentration Bounds

Lemma 9.

(Chen et al. (2021)) Suppose there is a sequence of nn Bernoulli trials: X1,X2,,Xn,X_{1},X_{2},\ldots,X_{n}, where the success probability of XiX_{i} depends on the results of the preceding trials X1,,Xi1X_{1},\ldots,X_{i-1}. Suppose it holds that

Pr(Xi=1|X1=x1,X2=x2,,Xi1=xi1)η,Pr\left(X_{i}=1|X_{1}=x_{1},X_{2}=x_{2},\ldots,X_{i-1}=x_{i-1}\right)\geq\eta,

where η>0\eta>0 is a constant and x1,,xi1x_{1},\ldots,x_{i-1} are arbitrary.

Then, if Y1,,YnY_{1},\ldots,Y_{n} are independent Bernoulli trials, each with probability η\eta of success, then

Pr(i=1nXib)Pr(i=1nYib),Pr\left(\sum_{i=1}^{n}X_{i}\leq b\right)\leq Pr\left(\sum_{i=1}^{n}Y_{i}\leq b\right),

where bb is an arbitrary integer.

Moreover, let AA be the first occurrence of success in sequence XiX_{i}. Then,

𝔼[A]1/η.\mathbb{E}\left[A\right]\leq 1/\eta.
Lemma 10 (Chernoff bounds Mitzenmacher and Upfal (2017)).

Suppose X1X_{1}, … , XnX_{n} are independent binary random variables such that Pr(Xi=1)=piPr\left(X_{i}=1\right)=p_{i}. Let μ=i=1npi\mu=\sum_{i=1}^{n}p_{i}, and X=i=1nXiX=\sum_{i=1}^{n}X_{i}. Then for any δ0\delta\geq 0, we have

Pr(X(1+δ)μ)eδ2μ2+δ.\displaystyle Pr\left(X\geq(1+\delta)\mu\right)\leq e^{-\frac{\delta^{2}\mu}{2+\delta}}. (8)

Moreover, for any 0δ10\leq\delta\leq 1, we have

Pr(X(1δ)μ)eδ2μ2.\displaystyle Pr\left(X\leq(1-\delta)\mu\right)\leq e^{-\frac{\delta^{2}\mu}{2}}. (9)

Appendix C Omitted Proof of ThreshSeqMod

See 2

Proof.

After filtration on Line 5, it holds that, for any xVjx\in V_{j}, Δ(x|Sj1)τ\Delta\left(x\,|\,S_{j-1}\right)\geq\tau and Δ(x|VjSj1)=0\Delta\left(x\,|\,V_{j}\cup S_{j-1}\right)=0. Therefore,

|A0|=|{xVj:Δ(x|Sj1)<τ}|=0,\displaystyle|A_{0}|=|\{x\in V_{j}:\Delta\left(x\,|\,S_{j-1}\right)<\tau\}|=0,
|A|Vj||=|{xVj:Δ(x|VjSj1)<τ}|=|Vj|.\displaystyle|A_{|V_{j}|}|=|\{x\in V_{j}:\Delta\left(x\,|\,V_{j}\cup S_{j-1}\right)<\tau\}|=|V_{j}|.

For any xAix\in A_{i}, it holds that Δ(x|TiSj1)<τ\Delta\left(x\,|\,T_{i}\cup S_{j-1}\right)<\tau. Due to submodularity, Δ(x|Ti+1Sj1)Δ(x|TiSj1)<τ\Delta\left(x\,|\,T_{i+1}\cup S_{j-1}\right)\leq\Delta\left(x\,|\,T_{i}\cup S_{j-1}\right)<\tau. Therefore, AiA_{i} is subset of Ai+1A_{i+1}, which indicates that |Ai||Ai+1||A_{i}|\leq|A_{i+1}|. ∎

Proof of Success Probability.

The algorithm successfully terminates if, at some point, |Vj|=0|V_{j}|=0 or |Sj|=k|S_{j}|=k. To analyze the success probability, we consider a variant of the algorithm in which it does not terminate once |Vj|=0|V_{j}|=0 or |Sj|=k|S_{j}|=k. In this case, the algorithm keeps running with s=0s=0 and selecting empty set after inner for loop in the following iterations. Thus, with probability 11, either |Vj+1|=0(1βε)|Vj|=0|V_{j+1}|=0\leq(1-\beta\varepsilon)|V_{j}|=0, or |Sj|=k|S_{j}|=k. Lemma 3 holds for all M+1M+1 iterations of the outer for loop.

If the algorithm fails, there should be no more than m=log1βε(1/n)m=\lceil\log_{1-\beta\varepsilon}(1/n)\rceil successful iterations where λjmin{s,t}\lambda_{j}^{*}\geq\min\{s,t\}. Otherwise, by Lemma 3, there exists an iteration jj such that |Sj|=k|S_{j}|=k, or |VM+1|<n(1βε)m1|V_{M+1}|<n\left(1-\beta\varepsilon\right)^{m}\leq 1. Let XX be the number of successful iterations. Then, XX is the sum of M+1M+1 dependent Bernoulli random variables, where each variable has a success probability of more than 1/21/2. Let YY be the sum of M+1M+1 independent Bernoulli random variables with success probability 1/21/2. By probability lemmata, the failure probability of Alg. 1 is calculated as follows,

Pr(Alg. 1 fails)\displaystyle Pr\left(\text{Alg.~{}\ref{alg:threshold} fails}\right) Pr(Xm)\displaystyle\leq Pr\left(X\leq m\right)
Pr(Ym)\displaystyle\leq Pr\left(Y\leq m\right) (Lemma 9)
Pr(Ylog(nδ)/(βε))\displaystyle\leq Pr\left(Y\leq\log\left(\frac{n}{\delta}\right)/(\beta\varepsilon)\right) (log(x)11x,x>0\log(x)\geq 1-\frac{1}{x},\forall x>0)
e12(112(1+βε))22(1+1/βε)log(nδ)\displaystyle\leq e^{-\frac{1}{2}\left(1-\frac{1}{2(1+\beta\varepsilon)}\right)^{2}\cdot 2(1+1/\beta\varepsilon)\log\left(\frac{n}{\delta}\right)} (Lemma 10)
=(δn)(2βε+1)24βε(βε+1)δn\displaystyle=\left(\frac{\delta}{n}\right)^{\frac{(2\beta\varepsilon+1)^{2}}{4\beta\varepsilon(\beta\varepsilon+1)}}\leq\frac{\delta}{n}\qed
Proof of Adaptivity and Query Complexity.

In Alg. 1, oracle queries incurred on Line 5 and 14, and can be done in parallel. Thus, there are constant number of adaptive rounds within one iteration of the outer for loop. The adaptivity is O(M)=O(log(n/δ)/ε3)O\left(M\right)=O\left(\log(n/\delta)/\varepsilon^{3}\right).

Consider an iteration jj, there are no more than |Vj1|+1|V_{j-1}|+1 queries on Line 5 and log1+ε(|Vj|)+2\log_{1+\varepsilon}(|V_{j}|)+2 queries on Line 14. Let YiY_{i} be the ii-th successful iteration. Then, for any Yi1<jYiY_{i-1}<j\leq Y_{i}, it holds that |Vj|n(1βε)i1|V_{j}|\leq n(1-\beta\varepsilon)^{i-1}. By Lemma 9, for any i1i\geq 1, 𝔼[YiYi1]2\mathbb{E}\left[Y_{i}-Y_{i-1}\right]\leq 2. Then, the expected number of oracle queries is as follows,

𝔼[#queries]j=1M+1𝔼[|Vj1|+log1+ε(|Vj|)+3]\displaystyle\mathbb{E}\left[\#\text{queries}\right]\leq\sum_{j=1}^{M+1}\mathbb{E}\left[|V_{j-1}|+\log_{1+\varepsilon}(|V_{j}|)+3\right]
n+i:n(1βε)i11𝔼[YiYi1](n(1βε)i1+log1+ε(n(1βε)i1))+3(M+1)\displaystyle\leq n+\sum_{i:n(1-\beta\varepsilon)^{i-1}\geq 1}\mathbb{E}\left[Y_{i}-Y_{i-1}\right]\left(n(1-\beta\varepsilon)^{i-1}+\log_{1+\varepsilon}\left(n(1-\beta\varepsilon)^{i-1}\right)\right)+3(M+1)
n+2ni1(1βε)i1+2log1+ε(n)log1βε(1/n)+3log(n/δ)/ε3\displaystyle\leq n+2n\sum_{i\geq 1}(1-\beta\varepsilon)^{i-1}+2\log_{1+\varepsilon}(n)\cdot\log_{1-\beta\varepsilon}(1/n)+3\log(n/\delta)/\varepsilon^{3}
(1+2βε)n+2log1+ε(n)log1βε(1/n)+3log(n/δ)/ε3\displaystyle\leq\left(1+\frac{2}{\beta\varepsilon}\right)n+2\log_{1+\varepsilon}(n)\cdot\log_{1-\beta\varepsilon}(1/n)+3\log(n/\delta)/\varepsilon^{3}
O(n/ε3)\displaystyle\leq O(n/\varepsilon^{3})\qed
Proof of Property (3).

For each iteration jj, consider two cases of λj\lambda^{*}_{j} by the choice of λj\lambda^{*}_{j} on Line 18. Firstly, if λj1ε\lambda^{*}_{j}\leq\left\lceil\frac{1}{\varepsilon}\right\rceil, clearly,

Δ(Tλj|Sj1)(1ε)τλj.\Delta\left(T_{\lambda^{*}_{j}}\,|\,S_{j-1}\right)\geq(1-\varepsilon)\tau\lambda^{*}_{j}.

Secondly, if λj>1ε\lambda^{*}_{j}>\left\lceil\frac{1}{\varepsilon}\right\rceil, let λj=max{λΛ:λ<λj}\lambda_{j}=\max\left\{\lambda\in\Lambda:\lambda<\lambda_{j}^{*}\right\}. Then, B[λj]=TrueB[\lambda_{j}]=\textbf{True}, and

Δ(Tλj|Sj1)(1ε)τλj.\Delta\left(T_{\lambda_{j}}\,|\,S_{j-1}\right)\geq(1-\varepsilon)\tau\lambda_{j}.

Let λj=(1+ε)u\lambda_{j}=\left\lfloor(1+\varepsilon)^{u}\right\rfloor, and λj=(1+ε)u+1\lambda^{*}_{j}=\left\lfloor(1+\varepsilon)^{u+1}\right\rfloor. Then,

|Tj,λj||Tj,λj|=λjλj=(1+ε)u(1+ε)u+1(1+ε)u1(1+ε)u+111+εε,\frac{|T_{j,\lambda_{j}}|}{|T_{j,\lambda_{j}^{*}}|}=\frac{\lambda_{j}}{\lambda_{j}^{*}}=\frac{\left\lfloor(1+\varepsilon)^{u}\right\rfloor}{\left\lfloor(1+\varepsilon)^{u+1}\right\rfloor}\geq\frac{(1+\varepsilon)^{u}-1}{(1+\varepsilon)^{u+1}}\geq\frac{1}{1+\varepsilon}-\varepsilon,

where the last inequality follows from (1+ε)u+1λj>1ε1ε(1+\varepsilon)^{u+1}\geq\lambda_{j}^{*}>\left\lceil\frac{1}{\varepsilon}\right\rceil\geq\frac{1}{\varepsilon}. Therefore, by above two inequalities and monotonicity of ff,

Δ(Tj,λj|Sj1)Δ(Tj,λj|Sj1)(1ε)(11+εε)τλj(12ε)τλj/(1+ε).\Delta\left(T_{j,\lambda_{j}^{*}}\,|\,S_{j-1}\right)\geq\Delta\left(T_{j,\lambda_{j}}\,|\,S_{j-1}\right)\geq(1-\varepsilon)\left(\frac{1}{1+\varepsilon}-\varepsilon\right)\tau\lambda_{j}^{*}\geq(1-2\varepsilon)\tau\lambda_{j}^{*}/(1+\varepsilon).

The objective value of the returned solution can be bounded as follows,

f(S)=jΔ(Tλj|Sj1)j(12ε)τλj/(1+ε)=(12ε)τ|S|/(1+ε).\displaystyle f(S)=\sum_{j}\Delta\left(T_{\lambda_{j}^{*}}\,|\,S_{j-1}\right)\geq\sum_{j}(1-2\varepsilon)\tau\lambda_{j}^{*}/(1+\varepsilon)=(1-2\varepsilon)\tau|S|/(1+\varepsilon).\qed
Proof of Property (4).

If the output set SS follows that |S|<k|S|<k, for any x𝒩x\in\mathcal{N}, it is filtered out at some point. Let xx be discarded at itetation jxj_{x}. Then,

Δ(x|S)Δ(x|Sjx1)<τ.\Delta\left(x\,|\,S\right)\leq\Delta\left(x\,|\,S_{j_{x}-1}\right)<\tau.\qed

Appendix D ThresholdGreedy

Algorithm 9 ThresholdGreedy
1:Input: Evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, constant α\alpha, value Γ\Gamma s.t. Γf(O)Γ/α\Gamma\leq f(O)\leq\Gamma/\alpha, error ε\varepsilon
2:Initialize τΓ/(αk)\tau\leftarrow\Gamma/(\alpha k), SS\leftarrow\emptyset
3:while τεΓ/k\tau\geq\varepsilon\Gamma/k do
4:     for e𝒩e\in\mathcal{N} do
5:         if Δ(e|S)τ\Delta\left(e\,|\,S\right)\geq\tau then SS{e}S\leftarrow S\cup\{e\}
6:         if |S|=k|S|=k then return SS      
7:     ττ(1ε)\tau\leftarrow\tau(1-\varepsilon)
8:return SS

In this section, we introduce a variant of near-linear time algorithm (Alg. 1) in Badanidiyuru and Vondrak (2014), ThresholdGreedy (Alg. 9), which requires two more parameters, constant α\alpha and value Γ\Gamma. With the assumption that Γ\Gamma is an α\alpha-approximation solution, ThresholdGreedy ensures a (11/eε)(1-1/e-\varepsilon)-approximation with O(n/ε)O(n/\varepsilon) query calls.

Theorem 8.

Let (f,k)(f,k) be an instance of SM, and OO is the optimal solution to (f,k)(f,k). With input α\alpha, Γ\Gamma and ε\varepsilon, where Γf(O)Γ/α\Gamma\leq f(O)\leq\Gamma/\alpha, ThresholdGreedy outputs solution SS with O(nε1log(αε)1)O(n\varepsilon^{-1}\log(\alpha\varepsilon)^{-1}) queries such that f(S)(11/eε)f(O)f(S)\geq(1-1/e-\varepsilon)f(O).

Proof.

Query Complexity. The while loop in Alg. 9 has at most log1ε(αε)ε1log(αε)1\log_{1-\varepsilon}(\alpha\varepsilon)\leq\varepsilon^{-1}\log(\alpha\varepsilon)^{-1} iterations. For each iteration of the while loop, there are O(n)O(n) queries. Therefore, the query complexity of ThresholdGreedy is O(nε1log(αε)1)O(n\varepsilon^{-1}\log(\alpha\varepsilon)^{-1}).

Approximation Ratio. If |S|<k|S|<k at termination, for any e𝒩Se\in\mathcal{N}\setminus S, it holds that Δ(e|S)<εΓ/kεf(O)/k\Delta\left(e\,|\,S\right)<\varepsilon\Gamma/k\leq\varepsilon f(O)/k. Then,

f(O)f(S)(1)oOSΔ(o|S)εf(O)\displaystyle f(O)-f(S)\overset{(1)}{\leq}\sum_{o\in O\setminus S}\Delta\left(o\,|\,S\right)\leq\varepsilon f(O)
\displaystyle\Rightarrow f(S)(1ε)f(O)\displaystyle\hskip 10.00002ptf(S)\geq(1-\varepsilon)f(O)

where Inequality (1) follows from monotonicity and submodularity.

If |S|=k|S|=k at termination, let SjS_{j} be the intermediate solution after iteration jj of the while loop, and τj\tau_{j} is the corresponding threshold. Suppose that SjSj1S_{j}\setminus S_{j-1}\neq\emptyset. Then, each element added during iteration jj provides a minimum incremental benefit of τ\tau,

f(Sj)f(Sj1)|SjSj1|τj.f(S_{j})-f(S_{j-1})\geq|S_{j}\setminus S_{j-1}|\tau_{j}.

Next, we consider iteration j1j-1. If j>1j>1, since the algorithm does not terminate during iteration j1j-1, each element in 𝒩\mathcal{N} is either added to the solution or omitted due to small marginal gain with respect to the current solution. Therefore, for any oOSj1o\in O\setminus S_{j-1}, it holds that Δ(o|Sj1)<τj1\Delta\left(o\,|\,S_{j-1}\right)<\tau_{j-1} by submodularity. Then,

f(O)f(Sj1)oOSj1Δ(o|Sj1)<kτj1=kτj/(1ε).f(O)-f(S_{j-1})\leq\sum_{o\in O\setminus S_{j-1}}\Delta\left(o\,|\,S_{j-1}\right)<k\tau_{j-1}=k\tau_{j}/(1-\varepsilon).

If j=1j=1, τj=Γ/(αk)\tau_{j}=\Gamma/(\alpha k) and Sj1=S_{j-1}=\emptyset.

f(O)f(Sj1)=f(O)Γ/α=kτj<kτj/(1ε).f(O)-f(S_{j-1})=f(O)\leq\Gamma/\alpha=k\tau_{j}<k\tau_{j}/(1-\varepsilon).

Therefore, for any iteration jj that has elements being added to the solution, it holds that

f(O)f(Sj)\displaystyle f(O)-f(S_{j}) (11εk|SjSj1|)(f(O)f(Sj1))\displaystyle\leq\left(1-\frac{1-\varepsilon}{k}|S_{j}\setminus S_{j-1}|\right)\left(f(O)-f(S_{j-1})\right)
f(O)f(S)\displaystyle\Rightarrow\hskip 10.00002ptf(O)-f(S) j(11εk|SjSj1|)f(O)\displaystyle\leq\prod_{j}\left(1-\frac{1-\varepsilon}{k}|S_{j}\setminus S_{j-1}|\right)f(O)
je1εk|SjSj1|f(O)\displaystyle\leq\prod_{j}e^{-\frac{1-\varepsilon}{k}|S_{j}\setminus S_{j-1}|}f(O) (x+1exx+1\leq e^{x})
=e1+εf(O)(1/e+ε)f(O)\displaystyle=e^{-1+\varepsilon}f(O)\leq(1/e+\varepsilon)f(O) (0<ε<10<\varepsilon<1)
f(S)(11/eε)f(O)\Rightarrow\hskip 30.00005ptf(S)\geq(1-1/e-\varepsilon)f(O)\hskip 30.00005pt\qed
Table 3: Small and Large Data

. Application Small Data Large Data nn Edges\ast nn Edges\ast ImageSumm 10,000 1.0×108\simeq 1.0\times 10^{8} 50,000 2.5×109\simeq 2.5\times 10^{9} InfluenceMax 26,588 1.0×105\simeq 1.0\times 10^{5} 1,134,890 6.0×106\simeq 6.0\times 10^{6} RevenueMax 17,432 1.8×105\simeq 1.8\times 10^{5} 3,072,441 2.3×108\simeq 2.3\times 10^{8} MaxCover (BA) 100,000 5.0×105\simeq 5.0\times 10^{5} 1,000,000 1.0×109\simeq 1.0\times 10^{9}

Appendix E Threshold-DASH with no Knowledge of OPT

See 7

Proof.

By Definition (pxp_{x} in Section 5), it holds that Pr(oO1)=poPr\left(o\in O_{1}\right)=p_{o}. Since oo is assigned to each machine randomly with probability 1/1/\ell,

Pr(oO2)\displaystyle Pr\left(o\in O_{2}\right) =i=1Pr(oSi)\displaystyle=\sum_{i=1}^{\ell}Pr\left(o\in S_{i}\right)
=Pr(oS1|o𝒩1)Pr(o𝒩1)\displaystyle=\ell\cdot Pr\left(o\in S_{1}|o\in\mathcal{N}_{1}\right)\cdot Pr\left(o\in\mathcal{N}_{1}\right)
=Pr(oTSMRel(𝒩1)|o𝒩1)\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=Pr\left(o\in\textsc{TSMRel}(\mathcal{N}_{1})|o\in\mathcal{N}_{1}\right)
=Pr(oTSMRel(𝒩1{o}))\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=Pr\left(o\in\textsc{TSMRel}(\mathcal{N}_{1}\cup\{o\})\right)
=1po.\displaystyle=1-p_{o}.

Moreover, we know that any two machines selects elements independently. So,

Pr(oO2|oO1)\displaystyle Pr\left(o\in O_{2}|o\in O_{1}\right) =Pr(oO2|oTSMRel(𝒩1{o},q))\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=Pr\left(o\in O_{2}|o\not\in\textsc{TSMRel}(\mathcal{N}_{1}\cup\{o\},q)\right)
=1Pr(oO2|oTSMRel(𝒩1{o},q))\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=1-Pr\left(o\not\in O_{2}|o\not\in\textsc{TSMRel}(\mathcal{N}_{1}\cup\{o\},q)\right)
=1i=1Pr(oSi|oTSMRel(𝒩1{o},q))\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=1-\prod_{i=1}^{\ell}Pr\left(o\not\in S_{i}|o\not\in\textsc{TSMRel}(\mathcal{N}_{1}\cup\{o\},q)\right)
=(a)1i=2Pr(oSi)\displaystyle\overset{(a)}{=}1-\prod_{i=2}^{\ell}Pr\left(o\not\in S_{i}\right)
1Pr(oO2)\displaystyle\leq 1-Pr\left(o\not\in O_{2}\right)
=Pr(oO2)=1po,\displaystyle=Pr\left(o\in O_{2}\right)=1-p_{o},

where Inequality (a) follows from Pr(oTSMRel(𝒩1)|oTSMRel(𝒩1{o}))=1Pr\left(o\not\in\textsc{TSMRel}(\mathcal{N}_{1})|o\not\in\textsc{TSMRel}(\mathcal{N}_{1}\cup\{o\})\right)=1.

Thus, we can bound the probability by the following,

Pr(oO1O2)=\displaystyle Pr\left(o\in O_{1}\cup O_{2}\right)= Pr(oO1)+Pr(oO1)\displaystyle Pr\left(o\in O_{1}\right)+Pr\left(o\in O_{1}\right)
Pr(oO1O2)\displaystyle-Pr\left(o\in O_{1}\cap O_{2}\right)
\displaystyle\geq po+1popo(1po)\displaystyle p_{o}+1-p_{o}-p_{o}(1-p_{o})
\displaystyle\geq 3/4.\displaystyle 3/4.

Description.  The T-DASH algorithm described in Alg. 10 is a two MR-rounds algorithm using the AFD approach that runs ThreshSeqMod concurrently on every machine for log(k)1+ε\log{}_{1+\varepsilon}(k) different guesses of threshold τi,j\tau_{i,j} in the range [αΔik,αΔi][\frac{\alpha\Delta_{i}^{*}}{k},\alpha\Delta_{i}^{*}]; where α\alpha is the approximation of T-DASH and Δi\Delta_{i}^{*} is the maximum singleton in 𝒩i\mathcal{N}_{i}. Every solution returned to the primary machine are placed into bins based on their corresponding threshold guess τi,j\tau_{i,j} such that Δk(1+ε)xτi,jΔk(1+ε)x+1\frac{\Delta^{*}}{k}(1+\varepsilon)^{x}\leq\tau_{i,j}\leq\frac{\Delta^{*}}{k}(1+\varepsilon)^{x+1}; where Δ\Delta^{*} is the maximum singleton in 𝒩\mathcal{N}. Since there must exist a threshold τ\tau^{*} that is close enough to αOPT/k\alpha\textsc{OPT}/k ; running ThreshSeqMod (Line 21) on every bin in the range [αΔk,αΔ][\frac{\alpha\Delta^{*}}{k},\alpha\Delta^{*}] and selecting the best solution guarantees the α\alpha-approximation of T-DASH in O(log(n))O(\log{}(n)) adaptive rounds.

Algorithm 10 Threshold-DASH with no knowledge of OPT (T-DASH)
1:Input: Evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk, error ε\varepsilon, available machines M{1,2,,}M\leftarrow\{1,2,...,\ell\}
2:Initialize δ1/(+1)\delta\leftarrow 1/(\ell+1), 𝐪\mathbf{q}\leftarrow a fixed sequence of random bits.
3:Set α38\alpha\leftarrow\frac{3}{8}, (𝐪i,j)i[+1],j[log1+ε(k)]𝐪(\mathbf{q}_{i,j})_{i\in[\ell+1],j\in[\log_{1+\varepsilon}(k)]}\leftarrow\mathbf{q}
4:for e𝒩e\in\mathcal{N} do  do
5:     Assign ee to each machine independently with probability 1/1/\ell
6:for iMi\in M do
7:     \triangleright On machine ii
8:     Let 𝒩i\mathcal{N}_{i} be the elements assigned to machine ii
9:     Set Δimax{f(e):e𝒩i}\Delta_{i}^{*}\leftarrow\max\{f(e):e\in\mathcal{N}_{i}\}
10:     for j0j\leftarrow 0 to log1+ε(k)\log_{1+\varepsilon}(k) in parallel do
11:         τi,j(α+ε)Δik(1+ε)j\tau_{i,j}\leftarrow\frac{(\alpha+\varepsilon)\Delta_{i}^{*}}{k}(1+\varepsilon)^{j}
12:         Si,j,Ri,jThreshSeqMod(f,𝒩i,k,δ,ε,τj,𝐪i,j)S_{i,j},R_{i,j}\leftarrow\textsc{ThreshSeqMod}(f,\mathcal{N}_{i},k,\delta,\varepsilon,\tau_{j},\mathbf{q}_{i,j})      
13:     Send Δi\Delta_{i}^{*} and all (τi,j,Si,j,Ri,j)(\tau_{i,j},S_{i,j},R_{i,j}) to primary machine
14:\triangleright On primary machine
15:Set Δmax{Δi:1i}\Delta^{*}\leftarrow\max\{\Delta_{i}^{*}:1\leq i\leq\ell\}
16:for x0x\leftarrow 0 to log1+ε(k)+1\lceil\log_{1+\varepsilon}(k)\rceil+1 in parallel do
17:     τx(α+ε)Δk(1+ε)x\tau_{x}\leftarrow\frac{(\alpha+\varepsilon)\Delta^{*}}{k}(1+\varepsilon)^{x}
18:     Let Rx{Ri,j:τxτi,jτx+1}R_{x}\leftarrow\{\bigcup R_{i,j}:\tau_{x}\leq\tau_{i,j}\leq\tau_{x+1}\}
19:     Let Ax{Sample a solution Si,j:τxτi,jτx+1}A_{x}\leftarrow\{\text{Sample a solution }S_{i,j}:\tau_{x}\leq\tau_{i,j}\leq\tau_{x+1}\}
20:     Let gx()f(Ax)f(Ax)g_{x}(\cdot)\leftarrow f(A_{x}\cup\cdot)-f(A_{x})
21:     TxThreshSeqMod(gx,Rx,k|Ax|,δ,ε,τx,𝐪+1,x)T_{x}\leftarrow\textsc{ThreshSeqMod}(g_{x},R_{x},k-|A_{x}|,\delta,\varepsilon,\tau_{x},\mathbf{q}_{\ell+1,x})
22:     TxAxTxT_{x}^{\prime}\leftarrow A_{x}\cup T_{x}
23:Targmax{f(Tx):0xlog1+ε(k)}T\leftarrow\operatorname*{arg\,max}\{f(T_{x}^{\prime}):0\leq x\leq\log_{1+\varepsilon}(k)\}
24:return TT
Theorem 9.

Let (f,k)(f,k) be an instance of SM where klog(k)<εψk\log(k)<\frac{\varepsilon\psi}{\ell}. T-DASH with no knowledge of OPT returns set TT^{\prime} with two MR rounds, O(1ε3log(n))O\left(\frac{1}{\varepsilon^{3}}\log(n)\right) adaptive rounds, O(nlog(k)ε4)O\left(\frac{n\log(k)}{\varepsilon^{4}}\right) total queries, O(n)O(n) communication complexity, and probability at least 1nc1-n^{-c} such that

𝔼[f(T)](38ε)OPT.\mathbb{E}\left[f(T^{\prime})\right]\geq\left(\frac{3}{8}-\varepsilon\right)\textsc{OPT}.

Overview of Proof. Alg. 10 is inspired by Alg. 6 in Section 5, which is a version of the algorithm that knows that optimal solution value. With Δ=max{f(e):e𝒩}\Delta^{*}=\max\{f(e):e\in\mathcal{N}\}, there exists an x0x_{0} such that τx0αOPT(1+ε)/kτx0+1\tau_{x_{0}}\leq\alpha\textsc{OPT}(1+\varepsilon)/k\leq\tau_{x_{0}+1} with O(log(k)/ε)O\left(\log(k)/\varepsilon\right) guesses. Then, on each machine ii, we only consider sets Si,jS_{i,j} and Ri,jR_{i,j} such that τx0τi,jτx0+1\tau_{x_{0}}\leq\tau_{i,j}\leq\tau_{x_{0}+1}. If this τi,j\tau_{i,j} does exist, (Si,j,Ri,j)(S_{i,j},R_{i,j}) works like (Si,Ri)(S_{i},R_{i}) in Alg. 6. If this τi,j\tau_{i,j} does not exist, then for any e𝒩ie\in\mathcal{N}_{i}, it holds that f(e)<αOPT/kf(e)<\alpha\textsc{OPT}/k, which means ThreshSeqMod(𝒩i)\textsc{ThreshSeqMod}(\mathcal{N}_{i}) with τ=αOPT/k\tau=\alpha\textsc{OPT}/k will return an empty set. Since each call of ThreshSeqMod with different guesses on τ\tau is executed in parallel, the adaptivity remains the same and the query complexity increases by a factor of log(k)/ε\log(k)/\varepsilon.

Proof.

First, for x=0x=0, τx=(α+ε)Δ/k(α+ε)OPT/k\tau_{x}=(\alpha+\varepsilon)\Delta^{*}/k\leq(\alpha+\varepsilon)\textsc{OPT}/k; and for x=log1+ε(k)+1x=\lceil\log_{1+\varepsilon}(k)\rceil+1,

τx(α+ε)(1+ε)Δ(α+ε)(1+ε)oOf(o)/k(α+ε)(1+ε)OPT/k.\tau_{x}\geq(\alpha+\varepsilon)(1+\varepsilon)\Delta^{*}\geq(\alpha+\varepsilon)(1+\varepsilon)\sum_{o\in O}f(o)/k\geq(\alpha+\varepsilon)(1+\varepsilon)\textsc{OPT}/k.

Therefore, there exists an x0x_{0} such that τx0(α+ε)(1+ε)OPT/kτx0+1\tau_{x_{0}}\leq(\alpha+\varepsilon)(1+\varepsilon)\textsc{OPT}/k\leq\tau_{x_{0}+1}. Since τx0+1=τx0(1+ε)\tau_{x_{0}+1}=\tau_{x_{0}}(1+\varepsilon), it holds that τx0(α+ε)OPT/k\tau_{x_{0}}\geq(\alpha+\varepsilon)\textsc{OPT}/k and τx0+1(α+ε)(1+ε)2OPT/k\tau_{x_{0}+1}\leq(\alpha+\varepsilon)(1+\varepsilon)^{2}\textsc{OPT}/k.

Then, we only consider Tx0T_{x_{0}}^{\prime}. If |Tx0|=k|T_{x_{0}}^{\prime}|=k, by Theorem 1, it holds that,

f(T)f(Tx0)12ε1+ετx0k(38ε)OPT.\displaystyle f(T)\geq f(T_{x_{0}}^{\prime})\geq\frac{1-2\varepsilon}{1+\varepsilon}\tau_{x_{0}}k\geq\left(\frac{3}{8}-\varepsilon\right)\textsc{OPT}.

Otherwise, in the case that |Tx0|<k|T_{x_{0}}^{\prime}|<k, let Ax0=Si0,j0A_{x_{0}}=S_{i_{0},j_{0}}, O1={oO:oTSMRel(Ni0{o},q)}O_{1}=\{o\in O:o\not\in\textsc{TSMRel}(N_{i_{0}}\cup\{o\},q)\}. Also, let τi,(j0)\tau_{i,(j_{0})} be returned by machine ii that τx0τi,(j0)τx0+1\tau_{x_{0}}\leq\tau_{i,(j_{0})}\leq\tau_{x_{0}+1}, define

(Si,(j0),Ri,(j0))={(Si,(j0),Ri,(j0)), if machine i returned a τi,(j0)(,), otherwise .(S_{i,(j_{0})}^{\prime},R_{i,(j_{0})}^{\prime})=\begin{cases}(S_{i,(j_{0})},R_{i,(j_{0})})&\text{, if machine $i$ returned a }\tau_{i,(j_{0})}\\ (\emptyset,\emptyset)&\text{, otherwise }\end{cases}.

On the machine which does not return a τi,(j0)\tau_{i,(j_{0})}, we consider it runs ThreshSeqMod(𝒩i,τx0+1)\textsc{ThreshSeqMod}(\mathcal{N}_{i},\tau_{x_{0}+1}) and returns two empty sets, and hence max{f(e):e𝒩i}<τx0\max\{f(e):e\in\mathcal{N}_{i}\}<\tau_{x_{0}}. Let Rx0=iMRi,(j0)R_{x_{0}}^{\prime}=\cup_{i\in M}R_{i,(j_{0})}^{\prime}, O2=Rx0OO_{2}=R_{x_{0}}^{\prime}\cap O. Then, Lemma 7 in Appendix E still holds in this case. We can calculate the approximation ratio as follows with ε2/3\varepsilon\leq 2/3,

𝔼[T]𝔼[Tx0]\displaystyle\mathbb{E}\left[T\right]\geq\mathbb{E}\left[T_{x_{0}}^{\prime}\right] 𝔼[f(O1O2)]kτx0+1\displaystyle\geq\mathbb{E}\left[f(O_{1}\cup O_{2})\right]-k\cdot\tau_{x_{0}+1}
(38ε)OPT.\displaystyle\geq\left(\frac{3}{8}-\varepsilon\right)\textsc{OPT}.

Appendix F Experiment Setup

F.1 Applications

Given a constraint kk, the objectives of the applications are defined as follows:

F.1.1 Max Cover

Maximize the number of nodes covered by choosing a set SS of maximum size kk, such that the number of nodes having at least one neighbour in the set SS. The application is run on synthetic random BA graphs of groundset size 100,000, 1,000,000 and 5,000,000 generated using Barabási–Albert (BA) models for the centralized and distributed experiments respectively.

F.1.2 Image Summarization on CIFAR-10 Data

Given large collection of images, find a subset of maximum size kk which is representative of the entire collection. The objective used for the experiments is a monotone variant of the image summarization from Fahrbach et al. (2019b). For a groundset with NN images, it is defined as follows:

f(S)=iNmaxjSsi,j\displaystyle f(S)=\sum_{i\in N}\max_{j\in S}s_{i,j}

where si,js_{i,j} is the cosine similarity of the pixel values between image ii and image jj. The data for the image summarization experiments contains 10,000 and 50,000 CIFAR-10 Krizhevsky et al. (2009) color images respectively for the centralized and distributed experiments.

F.1.3 Influence Maximization on a Social Network.

Maximise the aggregate influence to promote a topic by selecting a set of social network influencers of maximum size kk. The probability that a random user ii will be influenced by the set of influencers in SS is given by:

fi(S)\displaystyle f_{i}(S) =1, for iS\displaystyle=1\quad\textrm{, for $i\in S$}
fi(S)\displaystyle f_{i}(S) =1(1p)|NS(i)|, for iS\displaystyle=1-(1-p)^{|N_{S}(i)|}\quad\textrm{, for $i\notin S$ }

where —NS(i)N_{S}(i)— is the number of neighbors of node ii in SS. We use the Epinions data set consisting of 27,000 users from Rossi and Ahmed (2015) for the centralized data experiments and the Youtube online social network data Yang and Leskovec (2012) consisting more than 1 million users for distrbuted data experiments. The value of pp is set to 0.01.

F.1.4 Revenue Maximization on YouTube.

Maximise revenue of a product by selecting set of users SS of maximum size kk, where the network neighbors will be advertised a different product by the set of users SS. It is based on the objective function from Mirzasoleiman et al. (2016). For a given set of users XX and wi,jw_{i,j} as the influence between user ii and jj, the objective function can be defined by:

f(S)\displaystyle f(S) =iXV(jSwi,j)\displaystyle=\sum_{i\in X}V\left(\sum_{j\in S}w_{i,j}\right)
V(y)\displaystyle V(y) =yα\displaystyle=y^{\alpha}

where V(S)V(S), the expected revenue from an user is a function of the sum of influences from neighbors who are in SS and α:0<α<1\alpha:0<\alpha<1 is a rate of diminishing returns parameter for increased cover.

We use the Youtube data set from Mirzasoleiman et al. (2016) consisting of 18,000 users for centralized data experiments. For the distrbuted data experiments we perform empirical evaluation on the Orkut online social network data from Yang and Leskovec (2012) consisting more than 3 million users. The value of α\alpha is set to 0.3

Appendix G Replicating the Experiment Results

Our experiments can be replicated by running the following scripts:

  • Install MPICH version 3.3a2 (DO NOT install OpenMPI and ensure mpirun utlizes mpich using the command mpirun –version (Ubuntu))

  • Install pandas, mpi4py, scipy, networkx

  • Set up an MPI cluster using the following tutorial:
    https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

  • Create and update the host file ../nodesFileIPnew to store the ip addresses of all the connected MPI machines before running any experiments (First machine being the primary machine)

    • NOTE: Please place nodesFileIPnew inside the MPI shared repository; ”cloud/” in this case (at the same level as the code base directory). DO NOT place it inside the code base ”DASH-Distributed_SMCC-python/” directory.

  • Clone the DASH-Distributed_SMCC-python repository inside the MPI shared repository (/cloud in the case using the given tutorial)

    • NOTE: Please clone the ”DASH-Distributed_SMCC-python” repository and execute the following commands on a machine with sufficient memory (RAM); capable of generating the large datasets. This repository NEED NOT be the primary repository (”/cloud/DASH-Distributed_SMCC-python/”) on the shared memory of the cluster; that will be used for the experiments.

  • Additional Datasets For Experiment 1 : Please download the Image Similarity Matrix file ”images_10K_mat.csv”(https://drive.google.com/file/d/1s9PzUhV-C5dW8iL4tZPVjSRX4PBhrsiJ/view?usp=sharing)) and place it in the data/data_exp1/ directory.

  • To generate the decentralized data for Experiment 2 and 3 : Please follow the below steps:

    • Execute bash GenerateDistributedData.bash nThreads nNodes

    • The previous command should generate nNodes directories in loading_data/ directory (with names machine<<nodeNo>>Data)

    • Copy the data_exp2_split/ and data_exp3_split/ directories within each machine<<i>>Data directory to the corresponding machine MiM_{i} and place the directories outside /cloud (directory created after setting up an MPI cluster using the given tutorial)).

To run all experiments in the apper
Please read the README.md file in the ”DASH-Distributed_SMCC-python” (Code/Data Appendix) for detailed information.

Appendix H The Greedy Algorithm (Nemhauser and Wolsey (1978))

The standard Greedy algorithm starts with an empty set and then proceeds by adding elements to the set over kk iterations. In each iteration the algorithm selects the element with the maximum marginal gain Δ(e|Ai1)\Delta(e|A_{i-1}) where Δ(i|A)=f(A{i})f(A)\Delta(i|A)=f(A\cup\{i\})-f(A). The Algorithm 11 is a formal statement of the standard Greedy algorithm. The intermediate solution set AiA_{i} represents the solution after iteration ii. The (11/e1-1/e\approx 0.632)-approximation of the Greedy algorithm is the optimal ratio possible for monotone objectives.

Algorithm 11 Greedy(f,𝒩,k)\textsc{Greedy}(f,\mathcal{N},k)
1:Input: evaluation oracle f:2𝒩f:2^{\mathcal{N}}\to\mathbb{R}, constraint kk
2:Let AiA_{i}\leftarrow\emptyset
3:for i=1i=1 to kk  do
4:     uiargmax{f(e|Ai1); where e𝒩\Ai1}u_{i}\leftarrow\operatorname*{arg\,max}\{f(e|A_{i-1});\text{ where }e\in\mathcal{N}\textbackslash A_{i-1}\}
5:     AiAi1uiA_{i}\leftarrow A_{i-1}\cup u_{i}
6:return AkA_{k}

Appendix I Discussion of the 1/21/2-approximate Algorithm in Liu and Vondrak (2019)

Algorithm 12 ThresholdGreedy(S,G,τ)\textsc{ThresholdGreedy}(S,G,\tau)
1:Input: An input set SS, a partial greedy solution GG with GkG\leq k, and a threshold τ\tau
2:GGG^{\prime}\leftarrow G
3:for eSe\in S do
4:     if Δ(e|G0)τ\Delta\left(e\,|\,G_{0}\right)\geq\tau and |G0|<k|G_{0}|<k then
5:         G0G0{e}G_{0}\leftarrow G_{0}\cup\{e\}      
6:return GG^{\prime}
Algorithm 13 ThresholdFilter(S,G,τ)\textsc{ThresholdFilter}(S,G,\tau)
1:Input: An input set SS, a partial greedy solution GG with GkG\leq k, and a threshold τ\tau
2:SSS^{\prime}\leftarrow S
3:for eSe\in S do
4:     if Δ(e|G)<τ\Delta\left(e\,|\,G\right)<\tau then SS{e}S^{\prime}\leftarrow S^{\prime}\setminus\{e\}
5:return SS^{\prime}
Algorithm 14 A simple 2-round 1/21/2 approximation, assuming OPT is known
1:round 1:
2:SS\leftarrow sample each e𝒩e\in\mathcal{N} with probability p=4k/np=4\sqrt{k/n}
3:send SS to each machine and the central machine CC
4:partition 𝒩\mathcal{N} randomly into sets V1,V2,,VmV_{1},V_{2},\ldots,V_{m}
5:send ViV_{i} to machine ii for each i[m]i\in[m]
6:for each machine MiM_{i} (in parallel) do
7:     τOPT2k\tau\leftarrow\frac{\textsc{OPT}}{2k}
8:     G0ThresholdGreedy(S,,τ)G_{0}\leftarrow\textsc{ThresholdGreedy}(S,\emptyset,\tau)
9:     if |G0|<k|G_{0}|<k then
10:         RiThresholdFilter(Vi,G0,τ)R_{i}\leftarrow\textsc{ThresholdFilter}(V_{i},G_{0},\tau)
11:     else
12:         Ri=R_{i}=\emptyset      
13:     send RiR_{i} to the central machine CC
14:
15:round 2 (only on CC):
16:compute G0G_{0} from SS as in first round
17:for eiRie\in\cup_{i}R_{i} do
18:     if Δ(e|G)τ\Delta\left(e\,|\,G\right)\geq\tau then GG{e}G\leftarrow G\cup\{e\}
19:return GG

The simple two-round distributed algorithm proposed by Liu and Vondrak (2019) achieves a deterministic 1/2ε1/2-\varepsilon approximation ratio. The pseudocode of this algorithm, assuming that we know OPT, is provided in Alg. 14. The guessing OPT version runs O(log(k))O\left(\log(k)\right) copies of Alg. 14 on each machine, which has O(k)O\left(k\right) adaptivity and O(nklog(k))O\left(nk\log(k)\right) query complexity.

Intuitively, the result of Alg. 14 is equivalent to a single run of ThresholdGreedy on the whole ground set. Within the first round, it calls ThresholdGreedy on SS to get an intermediate solution G0G_{0}. After filtering out the elements with gains less than τ\tau in ViV_{i}, the rest of the elements in ViV_{i} are sent to the primary machine. By careful analysis, there is a high probability that the total number of elements returned within the first round is less than nk\sqrt{nk}.

Compared to the 2-round algorithms in Table 1, Alg. 14 is the only algorithm that requires data duplication, with four times more elements distributed to each machine in the first round. As a result, distributed setups have a more rigid memory restriction when running this algorithm. For example, when the dataset exactly fits into the distributed setups which follows that Ψ=n/\Psi=n/\ell, there is no more space on each machine to store the random subset SS. In the meantime, other 2-round algorithms can still run with kn/2k\leq n/\ell^{2}.

Additionally, since the approximation analysis of Alg. 14 does not take randomization into consideration, we may be able to substitute its ThresholdGreedy subroutine with any threshold algorithm (such as ThresholdSeq in Chen et al. (2021)) that has optimal logarithmic adaptivity and linear query complexity. However, given its analysis of the success probability considers SS as 3k3k blocks where each block has random access to the whole ground set, it may not be easy to fit parallelizable threshold algorithms into this framework.

References

  • Badanidiyuru and Vondrak (2014) Ashwinkumar Badanidiyuru and Jan Vondrak. Fast algorithms for maximizing submodular functions. In Chandra Chekuri, editor, Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1497–1514. SIAM, 2014.
  • Balakrishnan et al. (2022) Ravikumar Balakrishnan, Tian Li, Tianyi Zhou, Nageen Himayat, Virginia Smith, and Jeff A. Bilmes. Diverse client selection for federated learning via submodular maximization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
  • Balkanski and Singer (2018) Eric Balkanski and Yaron Singer. The adaptive complexity of maximizing a submodular function. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 1138–1151. ACM, 2018.
  • Balkanski et al. (2019) Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An exponential speedup in parallel running time for submodular maximization without loss in approximation. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 283–302. SIAM, 2019.
  • Barbosa et al. (2015) Rafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. The power of randomization: Distributed submodular maximization on massive datasets. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1236–1244. JMLR.org, 2015.
  • Barbosa et al. (2016) Rafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. A new framework for distributed submodular maximization. In Irit Dinur, editor, IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 645–654. IEEE Computer Society, 2016.
  • Blelloch et al. (2012) Guy E. Blelloch, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Parallel and I/O efficient set covering algorithms. In Guy E. Blelloch and Maurice Herlihy, editors, 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012, pages 82–90. ACM, 2012.
  • Breuer et al. (2020) Adam Breuer, Eric Balkanski, and Yaron Singer. The FAST algorithm for submodular maximization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1134–1143. PMLR, 2020.
  • Buchbinder et al. (2014) Niv Buchbinder, Moran Feldman, Joseph Naor, and Roy Schwartz. Submodular maximization with cardinality constraints. In Chandra Chekuri, editor, Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1433–1452. SIAM, 2014.
  • Buchbinder et al. (2017) Niv Buchbinder, Moran Feldman, and Roy Schwartz. Comparing apples and oranges: Query trade-off in submodular maximization. Math. Oper. Res., 42(2):308–329, 2017.
  • Calinescu et al. (2007) Gruia Calinescu, Chandra Chekuri, Martin Pal, and Jan Vondrak. Maximizing a submodular set function subject to a matroid constraint (extended abstract). In Matteo Fischetti and David P. Williamson, editors, Integer Programming and Combinatorial Optimization, 12th International IPCO Conference, Ithaca, NY, USA, June 25-27, 2007, Proceedings, volume 4513 of Lecture Notes in Computer Science, pages 182–196. Springer, 2007.
  • Chekuri and Quanrud (2019) Chandra Chekuri and Kent Quanrud. Submodular function maximization in parallel via the multilinear relaxation. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 303–322. SIAM, 2019.
  • Chen and Kuhnle (2024) Yixin Chen and Alan Kuhnle. Practical and parallelizable algorithms for non-monotone submodular maximization with size constraint. J. Artif. Intell. Res., 79:599–637, 2024.
  • Chen et al. (2021) Yixin Chen, Tonmoy Dey, and Alan Kuhnle. Best of both worlds: Practical and theoretically optimal submodular maximization in parallel. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 25528–25539, 2021.
  • Chierichetti et al. (2010) Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. Max-cover in map-reduce. In Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 231–240. ACM, 2010.
  • Dey et al. (2023) Tonmoy Dey, Yixin Chen, and Alan Kuhnle. DASH: A distributed and parallelizable algorithm for size-constrained submodular maximization. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 3941–3948. AAAI Press, 2023.
  • Ene and Nguyen (2019) Alina Ene and Huy L. Nguyen. Submodular maximization with nearly-optimal approximation and adaptivity in nearly-linear time. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 274–282. SIAM, 2019.
  • Ene and Nguyen (2020) Alina Ene and Huy L. Nguyen. Parallel algorithm for non-monotone dr-submodular maximization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 2902–2911. PMLR, 2020.
  • Epasto et al. (2017) Alessandro Epasto, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Bicriteria distributed submodular maximization in a few rounds. In Christian Scheideler and Mohammad Taghi Hajiaghayi, editors, Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, pages 25–33. ACM, 2017.
  • Ettinger et al. (2021) Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aurelien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driving : The waymo open motion dataset. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9690–9699. IEEE, 2021.
  • Fahrbach et al. (2019a) Matthew Fahrbach, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Submodular maximization with nearly optimal approximation, adaptivity and query complexity. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 255–273. SIAM, 2019a.
  • Fahrbach et al. (2019b) Matthew Fahrbach, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Non-monotone submodular maximization with nearly optimal adaptivity and query complexity. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1833–1842. PMLR, 2019b.
  • Gupta et al. (2010) Anupam Gupta, Aaron Roth, Grant Schoenebeck, and Kunal Talwar. Constrained non-monotone submodular maximization: Offline and secretary algorithms. In Amin Saberi, editor, Internet and Network Economics - 6th International Workshop, WINE 2010, Stanford, CA, USA, December 13-17, 2010. Proceedings, volume 6484 of Lecture Notes in Computer Science, pages 246–257. Springer, 2010.
  • Joseph et al. (2019) K. J. Joseph, Vamshi Teja R, Krishnakant Singh, and Vineeth N. Balasubramanian. Submodular batch selection for training deep neural networks. In Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 2677–2683. ijcai.org, 2019.
  • Kazemi et al. (2019) Ehsan Kazemi, Marko Mitrovic, Morteza Zadimoghaddam, Silvio Lattanzi, and Amin Karbasi. Submodular streaming in all its glory: Tight approximation, minimum memory and low adaptive complexity. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 3311–3320. PMLR, 2019.
  • Kazemi et al. (2021) Ehsan Kazemi, Shervin Minaee, Moran Feldman, and Amin Karbasi. Regularized submodular maximization at scale. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5356–5366. PMLR, 2021.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. In University of Toronto, Canada. Citeseer, 2009.
  • Kuhnle (2021) Alan Kuhnle. Quick streaming algorithms for maximization of monotone submodular functions in linear time. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 1360–1368. PMLR, 2021.
  • Kumar et al. (2013) Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in mapreduce and streaming. In Guy E. Blelloch and Berthold Vöcking, editors, 25th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’13, Montreal, QC, Canada - July 23 - 25, 2013, pages 1–10. ACM, 2013.
  • Lee et al. (2009) Jon Lee, Vahab S. Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. In Michael Mitzenmacher, editor, Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 323–332. ACM, 2009.
  • Liu and Vondrak (2019) Paul Liu and Jan Vondrak. Submodular optimization in the mapreduce model. In Jeremy T. Fineman and Michael Mitzenmacher, editors, 2nd Symposium on Simplicity in Algorithms, SOSA 2019, January 8-9, 2019, San Diego, CA, USA, volume 69 of OASIcs, pages 18:1–18:10. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • Liu (2020) Siwen Liu. A review for submodular optimization on machine scheduling problems. In Ding-Zhu Du and Jie Wang, editors, Complexity and Approximation - In Memory of Ker-I Ko, volume 12000 of Lecture Notes in Computer Science, pages 252–267. Springer, 2020.
  • Liu et al. (2020) Yajing Liu, Edwin K. P. Chong, Ali Pezeshki, and Zhenliang Zhang. Submodular optimization problems and greedy strategies: A survey. Discret. Event Dyn. Syst., 30(3):381–412, 2020.
  • Mao et al. (2021) Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, Jie Yu, Chunjing Xu, and Hang Xu. One million scenes for autonomous driving: ONCE dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
  • Mirrokni and Zadimoghaddam (2015) Vahab S. Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for distributed submodular maximization. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 153–162. ACM, 2015.
  • Mirzasoleiman et al. (2013) Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 2049–2057, 2013.
  • Mirzasoleiman et al. (2015) Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrak, and Andreas Krause. Lazier than lazy greedy. In Blai Bonet and Sven Koenig, editors, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pages 1812–1818. AAAI Press, 2015.
  • Mirzasoleiman et al. (2016) Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. Fast constrained submodular maximization: Personalized data summarization. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1358–1367. JMLR.org, 2016.
  • Mirzasoleiman et al. (2018) Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause. Streaming non-monotone submodular maximization: Personalized video summarization on the fly. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 1379–1386. AAAI Press, 2018.
  • Mitzenmacher and Upfal (2017) Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
  • Nemhauser and Wolsey (1978) George L. Nemhauser and Laurence A. Wolsey. Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Res., 3(3):177–188, 1978.
  • Nemhauser et al. (1978) George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations for maximizing submodular set functions - I. Math. Program., 14(1):265–294, 1978.
  • Rangwani et al. (2021) Harsh Rangwani, Arihant Jain, Sumukh K. Aithal, and R. Venkatesh Babu. S3{}^{\mbox{3}}vaada: Submodular subset selection for virtual adversarial active domain adaptation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 7496–7505. IEEE, 2021.
  • Rossi and Ahmed (2015) Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In Blai Bonet and Sven Koenig, editors, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pages 4292–4293. AAAI Press, 2015.
  • Yang and Leskovec (2012) Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Mohammed Javeed Zaki, Arno Siebes, Jeffrey Xu Yu, Bart Goethals, Geoffrey I. Webb, and Xindong Wu, editors, 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10-13, 2012, pages 745–754. IEEE Computer Society, 2012.