This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Algorithms for Maximum Coverage
in Dynamic and Random Order Streams

Amit Chakrabarti Department of Computer Science, Dartmouth College, Hanover, NH, USA. Supported in part by NSF under award 2006589.    Andrew McGregor College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA    Anthony Wirth School of Computing and Information Systems, The University of Melbourne, Victoria, Australia. Supported in part by the Faculty of Engineering and Information Technology.
Abstract

The maximum coverage problem is to select kk sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider (11/eε)(1-1/e-\varepsilon)-approximation algorithms for this NP-hard problem in three standard data stream models.

  1. 1.

    Dynamic Model. The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses ε2kpolylog(n,m)\varepsilon^{-2}k\cdot\operatorname{polylog}(n,m) space. The best previous result (Assadi and Khanna, SODA 2018) used (n+ε4k)polylog(n,m)(n+\varepsilon^{-4}k)\operatorname{polylog}(n,m) space. While both algorithms use O(ε1logn)O(\varepsilon^{-1}\log n) passes, our analysis shows that when ε\varepsilon is a constant, it is possible to reduce the number of passes by a 1/loglogn1/\log\log n factor without incurring additional space.

  2. 2.

    Random Order Model. In this model, there are no deletions and the sets forming the instance are uniformly randomly permuted to form the input stream. We show that a single pass and kpolylog(n,m)k\operatorname{polylog}(n,m) space suffices for arbitrary small constant ε\varepsilon. The best previous result, by Warneke et al. (ESA 2023), used k2polylog(n,m)k^{2}\operatorname{polylog}(n,m) space.

  3. 3.

    Insert-Only Model. Lastly, our results, along with numerous previous results, use a sub-sampling technique introduced by McGregor and Vu (ICDT 2017) to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: polylog(m,n)\operatorname{polylog}(m,n) update time suffices whereas the best previous result by Jaud et al. (SEA 2023) required update time that was linear in kk.

1 Introduction

Background.

The input to the maximum coverage problem is an integer kk and a collection of mm sets S1,,SmS_{1},\ldots,S_{m}, each a subset of the universe [n]:={1,,n}[n]:=\{1,\ldots,n\}. The goal is to find kk sets {Si}\{S_{i}\} whose union has maximum cardinality. This longstanding problem has several applications, including facility and sensor allocation [KG07], circuit layout and job scheduling [HP98], information retrieval [ABB+15], influence maximization in marketing strategy design [KKT15], and content recommendation [SG09]. In terms of theoretical importance, it is perhaps simplest example of a submodular maximization problem, a rich family of problems in machine learning [KG14]. A natural variant of the problem is the set cover problem and was one of Karp’s original 21 NP-complete problems [Kar72]. A greedy approach that iteratively chooses the set with the greatest contribution yields asymptotically optimal approximation algorithms for set cover and maximum coverage [Fei98]. Indeed, the greedy algorithm in practice typically outperforms its proven approximation factor for many practical problem instances [LMW14].

Data Stream Computation.

Over the last decade, there has been a growing body of work that addresses the challenges of solving the maximum coverage problem (henceforth, MaxCov), the related set cover problem (SetCov), and the general problem of submodular optimization, in computational models such as the data stream model, that are relevant when the underlying data sets are massive [SG09, MV19a, BEM17, IV19, AK18, WCW23, ABG+12, YY13, LRVZ21a, ASS19, NTM+18, CKW10, JWC23, HIMV16, CW16, ER14, AKL16]. Assadi, Khanna and Li [AKL16] have a comprehensive summary of results and discussion. In many real-world applications, even the conceptually simple greedy approach is not practical. This can occur if the underlying data is too large to be stored in a single location and is therefore distributed across multiple machines; or if the data changes over time, e.g., sets are added or removed from the input collection.

This work focuses on the set streaming model [SG09], where the input stream is a sequence of sets (in the dynamic case, a sequence of set insertions and deletions, arbitrarily intermixed). Each set is described contiguously. The goal is to solve MaxCov in space that is sublinear in the size of the input. We note in passing that there has also been interest in an alternative “edge streaming” model [KKA23, BEM17, IV19], but it is not the focus of this paper.

Along with the set-streaming model itself, Saha and Getoor [SG09] introduced the first algorithm for MaxCov in this model. Their algorithm, sops,111 The algorithm received this name in a later work by Yu and Yuan [YY13] achieves an approximation factor of 1/41/4 using O(nk)O(nk) space. The input could be as large as Ω(mn)\Omega(mn), and there is, e.g., a logarithmic-factor approximation for set cover that involves space O~(mnδ)\tilde{O}(mn^{\delta}) in O(1/δ)O(1/\delta) passes [HIMV16]. To accord with practice, early approaches for set cover took space linear in nn [CKW10, ER14, CW16]. Though it is unclear whether there are sensible o(n)o(n)-space solutions for Set Cover, McGregor and Vu [MV19a] pioneered the quest for O~(k)\tilde{O}(k)-space methods for MaxCov. Storing even a single set from the stream may, in general, require Θ(n)\Theta(n) space. To avoid this, various papers consider sub-sampling the elements in the sets [JWC23, DIMV14, BEM17, HIMV16]. In particular, McGregor and Vu [MV19a] observed that subsampling the instance to a universe of size O~(k)\tilde{O}(k) with a hash function of high degree preserves coverage very accurately.

We will focus on three specific set streaming models. In the insert-only set streaming model the data stream comprises mm sets, in arbitrary order, each described once. In the random order set streaming model, we assume that these sets, each described once and contiguously, are ordered uniformly at random. This model captures a natural form of average-case analysis: the sets are chosen adversarially, but we can choose a random ordering to process them [AK18, WCW23]. In the dynamic set streaming model, each set may be added and deleted several times: the resulting MaxCov instance involves only those sets that remain added at the end of the stream.

Submodular maximization.

Our problem MaxCov is a special case of submodular maximization, specifically, of the problem of maximizing a monotone submodular function under a cardinality constraint. In the streaming setting [BMKK14, FNFSZ20], we have oracle access to a submodular function f:2Uf\colon 2^{U}\to\mathbb{R}, for some universe UU and the input stream is a sequence of elements of UU. These elements correspond to sets SiS_{i} in MaxCov. An important point here is that in MaxCov (unlike in submodular maximization), storing an element of UU (i.e., a set SiS_{i}) takes non-constant space, as does storing the ff-value (coverage) of a putative solution take space.

The sieve-streaming algorithm [BMKK14] for submodular maximization gives a (1/2ε)(1/2-\varepsilon)-approximation for any ε>0\varepsilon>0. By adapting to maximum coverage, there is an O~(nε1)\tilde{O}(n\varepsilon^{-1})-space algorithm that achieves the same approximation factor [MV19b]. Subsequently, Chekuri et al. [CGQ15] extended this work to non-monotone submodular function maximization under constraints beyond cardinality. Observe that Ω(mk3)\Omega(mk^{-3}) space is required to achieve an approximation greater than 1/2+o(1)1/2+o(1) in the arbitrary-arrival model [FNFSZ20]. Needless to say, there is progress on submodular maximization in dynamic streams [Mon20], though comparatively little about MaxCov.

Random order.

Streaming algorithms for submodular maximization in the random-arrival model have been studied [NFTM+18, ASS19, LRVZ21b]. Interestingly, approximation factors better than 1/21/2 have been achieved in this model using low space. Norouzi-Fard et al. developed salsa, an O~(k)\tilde{O}(k)-space algorithm that achieves an approximation factor of 1/2+c01/2+c_{0} in-expectation, where c0>0c_{0}>0 is a small absolute constant. Subsequently, Agrawal et al. [ASS19] achieved an approximation factor of 11/eεo(1)1-1/e-\varepsilon-o(1) in-expectation using O~ε(k)\tilde{O}_{\varepsilon}(k) space. The o(1)o(1) term in the approximation factor is a function of kk. For small kk, this term is very large, in which case salsa makes a better approximation guarantee; nonetheless the algorithm of Agrawal et al. is a focus of this paper. Warneke et al. [WCW23] introduced low-space methods for maximum coverage in random-order streams. Their approaches consume space proportional to k2k^{2}: we improve this to linear in kk, see Section 1.1 for details. The corresponding hardness result: Ω(m)\Omega(m) space is required to achieve an approximation factor of 11/e+ε+o(1)1-1/e+\varepsilon+o(1) in the random-arrival model [LRVZ21b].222 This improves upon the Ω(mk3)\Omega(mk^{-3}) space lower bound given by McGregor and Vu [MV19b] for random-arrival set-streaming maximum coverage, which applies to random-arrival submodular maximization, due to an important connection between these two problems, explained shortly.

1.1 Our Results, Approach, and Related Work

To establish our new results, we revisit recent work on the maximum coverage problem in the relevant data stream models and show how to refine and extend the approaches to substantially decrease the the amount of space needed to solve the problem in the data stream model. In the process, we also establish secondary results on speeding up the existing algorithms and in one case, developing a slightly more pass efficient algorithm.

In the dynamic stream model, our main result is the following.

Result 1 (Algorithm for Dynamic Model).

There exists an O(1+ε1/loglogm)logm)O(1+\varepsilon^{-1}/\log\log m)\log m) pass algorithm in the dynamic data stream model that uses ε2kpolylog(n,m)\varepsilon^{-2}\cdot k\cdot\operatorname{polylog}(n,m) space and returns a 11/eε1-1/e-\varepsilon approximation.

This result reappears as Theorem 4.2. It is an improvement over the best previous algorithm by Assadi and Khanna [AK18] that used (n+ε4k)polylog(n,m)(n+\varepsilon^{-4}k)\operatorname{polylog}(n,m) space. The high-level approach is similar and is based on the existing 0\ell_{0} sampling primitive that allows uniform sampling amongst the sets that are inserted but not deleted. In each pass, we pick additional sets and maintain the union CC of the selected sets. We pick the additional sets by sampling only among the sets that include a “meaningful” number of elements that are not currently in CC. Unfortunately, when we simultaneously sample multiple sets in this way, once we start adding sampled sets to our solution, some of the sampled sets may no longer make a meaningful increment to |C||C|. However, by repeating the process over multiple passes, with a slowly decreasing threshold for what constitutes a meaningful improvement, we ensure that either we exhaust the collection of sets that could potentially be worth adding or we add enough sets and together these yield a good approximate solution. The main point of departure from Assadi and Khanna [AK18] is in the way we handle the decreasing threshold. The largest threshold considered by Assadi and Khanna is about OPT/k\operatorname{OPT}/k; this is sufficient for achieving a 11/eε1-1/e-\varepsilon approximation factor, but has the downside that there can be many overlapping sets that would contribute more than OPT/k\operatorname{OPT}/k new elements to CC; storing these sets during the sampling process is expensive. Instead, we consider thresholds decreasing from OPT\operatorname{OPT} but handling the sets that make a contribution between OPT/2i\operatorname{OPT}/2^{i} and OPT/2i+1\operatorname{OPT}/2^{i+1} for i=0,,logki=0,\ldots,\log k separately but in parallel. We analyze a stochastic process—we call it the cascading urns process—to show how our new algorithm uses a similar number of passes as the algorithm by Assadi and Khanna [AK18] but uses significantly less space. We present the details in Section 4. In fact, a more careful analysis of the process shows that it is possible to reduce the number of passes to O(logm+ε1logm/loglogm)O(\log m+\varepsilon^{-1}\log m/\log\log m) in contrast to the O(ε1logm)O(\varepsilon^{-1}\log m) passes used by previous algorithm. We believe that a similar pass saving can be achieved in Assadi and Khanna [AK18] but it requires an extra log factor in the space.

In the random order model, our main result is the following.

Result 2 (Algorithm for Random Order Model).

There exists a single pass algorithm in the random order data stream model that uses Oε(kpolylog(n,m))O_{\varepsilon}(k\cdot\operatorname{polylog}(n,m)) space and returns a 11/eε1-1/e-\varepsilon approximation of maximum coverage in expectation.

The above result is established in Section 5.2. Note that the dependence on ε\varepsilon in the result is exponential and this makes the algorithm less practical. The main significance of the result is removing a factor kk in the space required by the best previous result, by Warneke et al. [WCW23], that used ε2k2polylog(n,m)\varepsilon^{-2}k^{2}\operatorname{polylog}(n,m) space. Both their and our approach are based on modifying existing algorithms for the cardinality constrained monotone submodular optimization problem. This is a more general problem than maximum coverage, but it is assumed in the general problem that the algorithm has oracle access to the function being optimized. For maximum coverage, we need to store enough information to be able to simulate this oracle. An Oε(k2logm)O_{\varepsilon}(k^{2}\log m)-space algorithm for maximum coverage follows immediately for any Oε(k)O_{\varepsilon}(k) space algorithm for monotone submodular optimization because the universe subsampling technique discussed in Section 3 allows us to focus on the case where set has size O(ε2klogm)O(\varepsilon^{-2}k\log m). To reduce the dependence on kk to linear, rather than quadratic, we need a more careful adaption of an algorithm by Agrawal et al. [ASS19]. We maintain a set of size O(ε2klogm)O(\varepsilon^{-2}k\log m) corresponding to the items covered by a collection of sets chosen for the solution so far, along with Oε(1)O_{\varepsilon}(1) sets of size Oε(logm)O_{\varepsilon}(\log m). We mention in passing that, being a thresholding algorithm, salsa is easy to adapt to our Oε(kpolylog(n,m))O_{\varepsilon}(k\cdot\operatorname{polylog}(n,m)) regime, yielding an approximation factor of 1/2+c01/2+c_{0}. We present the details in Section 5.

Lastly, our results, along with numerous previous results, use the aforementioned universe sub-sampling technique introduced by McGregor and Vu [MV19a] to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: polylog(m,n)\operatorname{polylog}(m,n) update time suffices whereas the best previous result by Jaud et al. [JWC23] required update time kpolylog(m,n)k\cdot\operatorname{polylog}(m,n). We present the details in Section 6.

2 Further Related Work

Here we provide some more detail on the background materials.

Random arrivals.

Guha and McGregor [GM09] §1.1 further justify the assumptions of the random-arrival model. This model has been studied for many important problems in other contexts, including quantile estimation [GM09], submodular maximization [NTM+18, ASS19], and maximum matching [KMM12, AB21], often revealing improved space-accuracy trade-offs.

Coverage in streams.

Saha and Getoor [SG09] gave a swap-based 1/41/4 approximation algorithm, sops, that uses a single pass and O~(kn)\tilde{O}(kn) space. Their algorithm stores kk sets explicitly in the memory as the current solution and selectively replaces sets in memory as new sets arrive in the stream. Subsequently, Ausiello et al. [ABG+12] gave a slightly different swap based algorithm that also finds a 1/41/4 approximation using one pass and the same space.

Yu and Yuan [YY13] improved upon sops under a more relaxed output specification, in which only the IDs of the chosen sets must be returned. Their algorithm algorithm, gops, achieves an approximation factor of approximately 0.3 using just O~(n)\tilde{O}(n) space.

McGregor and Vu [MV19a] introduced several methods for MaxCov under set streaming. They presented four near-optimal algorithms for the problem, with the following specifications:

  1. 1.

    An approximation factor of 11/eε1-1/e-\varepsilon using O~(mε2)\tilde{O}(m\varepsilon^{-2}) space.

  2. 2.

    An approximation factor of 11/eε1-1/e-\varepsilon using O~(kε2)\tilde{O}(k\varepsilon^{-2}), requiring O~(ε1)\tilde{O}(\varepsilon^{-1}) passes over the stream. This method is a key launchpad for our work.

  3. 3.

    An approximation factor of 1ε1-\varepsilon using O~(mε3)\tilde{O}(m\varepsilon^{-3}) space (and exponential time).

  4. 4.

    An approximation factor of 1/2ε1/2-\varepsilon using just O~(kε3)\tilde{O}(k\varepsilon^{-3}) space.

Subsequently, McGregor et al. [MTV21] showed how to solve max coverage exactly in one pass and in space O~(dd+1kd)\tilde{O}(d^{d+1}k^{d}) when every set is of size at most dd.

Dynamic streams.

In the dynamic graph stream model [McG14], the stream consists of insertions and deletions of edges of the underlying graph [ACG+15, AGM12a, AGM12b, AGM13, KLM+14, KW14, GMT15, BHNT15, CCE+16, AKLY16, Kon15, MVV16, MTVV15], relevant to the maximum vertex coverage problem. An algorithm for the dynamic graph stream model can also be used in the streaming-set model; the streaming-set model is simply a special case in which edges are grouped by endpoint.

Lower bounds.

McGregor and Vu [MV19a] introduced the first nontrivial space lower bound for set-streaming maximum coverage. For all ε>0\varepsilon>0, achieving an approximation factor better than 11/e+o(1)1-1/e+o(1) requires Ω(mk2)\Omega(mk^{-2}) space in the arbitrary-arrival case, and Ω(mk3)\Omega(mk^{-3}) space in the random-arrival case, where the o(1)o(1) term is a function of kk. This also represented the first result explicitly regarding random-arrival maximum coverage. Assadi [Ass17] showed that achieving a (1ε)(1-\varepsilon)-approximation requires Ω(mε2)\Omega(m\varepsilon^{-2}) space, even in the random-arrival model, almost matching the O~(mε3)\tilde{O}(m\varepsilon^{-3}) upper bound achieved by McGregor and Vu [MV19a]. More recently, Feldman et al. [FNFSZ20] showed that Ω(mk3)\Omega(mk^{-3}) space is required to beat 1/2+o(1)1/2+o(1) in the arbitrary-arrival model.

3 Preliminaries

Without loss of generality we assume each set is tagged with a unique ID in the range [m3n][m^{3}n], for if not, we could generate suitable IDs via random hashing of the contents of the sets themselves.333E.g., hash each set S[n]S\in[n], to fS(r)f_{S}(r), where fS(X)=uS(Xu)f_{S}(X)=\prod_{u\in S}(X-u) is a polynomial over a prime field 𝔽p\mathbb{F}_{p} of cardinality Θ(m3n)\Theta(m^{3}n) and rr is uniformly random in 𝔽p\mathbb{F}_{p}. By elementary analysis, the map SfS(r)S\mapsto f_{S}(r) is non-injective with probability at most (m2)n/p1/m\binom{m}{2}n/p\leq 1/m. In the dynamic setting, we will want to sample uniformly from the sets that are inserted but not deleted (sometimes we will put additional requirements on the relevant sets). To do this, we will use the standard technique of 0\ell_{0} sampling.

Theorem 3.1 (0\ell_{0} sampling [JST11]).

There is a one-pass algorithm that processes a stream of tokens x1,Δ1,x2,Δ2,,\langle x_{1},\Delta_{1}\rangle,\langle x_{2},\Delta_{2}\rangle,\ldots, where each xi{1,,M}x_{i}\in\{1,\ldots,M\} and Δi{1,1}\Delta_{i}\in\{-1,1\}, using O(log2(M)log(1/δ))O(\log^{2}(M)\log(1/\delta)) bits of space, that, with probability 1δ1-\delta, returns an element chosen uniformly at random from the set {x[M]:i:xi=xΔi0}\{x\in[M]:\sum_{i:x_{i}=x}\Delta_{i}\neq 0\}.

Specifically, we will use the above technique when M=m3nM=m^{3}n, with each xix_{i} being the ID of a set and the corresponding Δi\Delta_{i} specifying whether the set is being inserted or deleted.

As mentioned in the introduction, the natural (offline) greedy approach achieves a (11/e)(1-1/e)-approximation for maximum coverage. In streaming settings, a “quantized” version of the algorithm achieves similar results, as summarized below.

Theorem 3.2 (Quantized Greedy Algorithm, e.g., [MV19a]).

Consider the algorithm that processes a stream S1,S2,\langle S_{1},S_{2},\ldots\rangle of sets as follows. Starting with an empty collection YY, it makes pp passes: in the iith pass, it adds to YY each encountered set that covers τi\geq\tau_{i} uncovered elements. If, at some point, |Y|=k|Y|=k, then the algorithm stops and returns YY. If these threshold parameters τ1>>τp\tau_{1}>\cdots>\tau_{p} satisfy

  1. 1.

    τ1OPT/k\tau_{1}\geq\operatorname{OPT}/k,

  2. 2.

    τp<OPT/(4ek)\tau_{p}<\operatorname{OPT}/(4ek), and

  3. 3.

    τi/τi+11+ε\tau_{i}/\tau_{i+1}\leq 1+\varepsilon, for all i[p1]i\in[p-1],

then the algorithm achieves a (11/eε)(1-1/e-\varepsilon)-approximation for the maximum coverage problem.

Both of our main algorithmic results appeal to the following technique that, given a guess444We run the final algorithms with guesses v=1,2,4,8,2lognv=1,2,4,8,\ldots 2^{\log n} and use the fact that one of these guesses is within a factor 2 of OPT\operatorname{OPT}. vv of the value OPT\operatorname{OPT}, transforms the MaxCov instance into one in which the optimum solution covers O(ε2klogm)O(\varepsilon^{-2}k\log m) elements. Note that this immediately implies that every set in the input has cardinality O(ε2klogm)O(\varepsilon^{-2}k\log m). In what follows, we will tacitly assume that the given instance is filtered through this technique before being fed into the algorithms we design. In particular, we will appeal to the following theorem [MV19a, JWC23].

Theorem 3.3 (Universe Subsampling).

Let the function h:[n]{0,1}h:[n]\rightarrow\{0,1\} be drawn uniformly at random from a family of O(klogm)O(k\log m)-wise independent hash functions, where

p:=Pr[h(e)=1]=λ/v,for λ=10ε2k,p:=\Pr\left[{h(e)=1}\right]={\lambda}/{v},\quad\text{for }\lambda=10\varepsilon^{-2}k\,,

and vv satisfies OPT/2vOPT\operatorname{OPT}/2\leq v\leq\operatorname{OPT}. An hh-sparsification of an instance \mathcal{I} of MaxCov is formed by replacing every set SS in \mathcal{I} by {eS:h(e)=1}\{e\in S:h(e)=1\}. With high probability, any α\alpha-approximation for the hh-sparsification of \mathcal{I} yields an (αε)(\alpha-\varepsilon)-approximation for \mathcal{I}.

In Section 6, we will discuss how to implement the universe subsampling such that the update time of the resulting algorithm is logarithmic for each element of a set in the stream.

4 Dynamic Streams

Our main algorithm for dynamic streams is based on the following variant (Algorithm 1) of the quantized greedy algorithm. It requires an estimate vv satisfying OPT/2v<OPT\operatorname{OPT}/2\leq v<\operatorname{OPT}. To describe it compactly, we set up some notation. Let 𝒮\mathcal{S} denote the collection of sets yielded by the input stream. For a given C[n]C\subset[n], define the subcollections

iC\displaystyle\mathcal{F}_{i}^{C} ={S𝒮:θi|SC|<θi1},\displaystyle=\{S\in\mathcal{S}:\theta_{i}\leq|S\setminus C|<\theta_{i-1}\}\,, for i[],\displaystyle\text{for }i\in[\ell]\,,
𝒢iC\displaystyle\mathcal{G}_{i}^{C} ={S𝒮:τi|SC|<τi1},\displaystyle=\{S\in\mathcal{S}:\tau_{i}\leq|S\setminus C|<\tau_{i-1}\}\,, for i[1+log1+ε(16e)],\displaystyle\text{for }i\in[1+\lceil\log_{1+\varepsilon}(16e)]\rceil\,,

where

=logk,θi:=2v2i,and τi:=2v/2(1+ε)i1.\ell=\left\lfloor\log k\right\rfloor\,,\quad\theta_{i}:=\frac{2v}{2^{i}}\,,\quad\text{and~{}~{}}\tau_{i}:=\frac{2v/2^{\ell}}{(1+\varepsilon)^{i-1}}\,.

In the pseudocode below, the computed solution YY is a set of IDs and the set CC is maintained to be the subset of the universe covered by the sets represented in YY. The macro555Being a macro, grow-solution can cause the invoking algorithm to return. \Callgrow-solutionY,S,C,θY,S,C,\theta implements the following logic: if |SC|θ|S\setminus C|\geq\theta, then update YY{id(S)}Y\leftarrow Y\cup\{\textsc{id}(S)\} (i.e., add the set SS to the solution) and CCSC\leftarrow C\cup S; if this makes |Y|=k|Y|=k, then stop and return YY.

Algorithm 1 Quantized greedy adapted for dynamic set streams
1:start with an empty solution (YY\leftarrow\varnothing) and let CC\leftarrow\varnothing \While1CC\mathcal{F}_{1}^{C}\cup\cdots\cup\mathcal{F}_{\ell}^{C}\neq\varnothing \Fori[]i\in[\ell] sample 2i2^{i} sets from iC\mathcal{F}_{i}^{C} with replacement \EndFor\Fori[]i\in[\ell] and each set SS sampled from iC\mathcal{F}_{i}^{C}
2:\Callgrow-solutionY,S,C,θiY,S,C,\theta_{i} \EndFor\EndWhile\Fori1i\leftarrow 1 to 1+log1+ε(16e)1+\left\lceil\log_{1+\varepsilon}(16e)\right\rceil \While𝒢iC\mathcal{G}_{i}^{C}\neq\varnothing
3:sample kk sets from 𝒢iC\mathcal{G}_{i}^{C} with replacement \Foreach sampled set SS \Callgrow-solutionY,S,C,τiY,S,C,\tau_{i} \EndFor\EndWhile\EndFor

The most involved part of the analysis of Algorithm 1 is the proof of the following lemma, which we shall defer slightly.

Lemma 4.1.

With high probability, the while loop at 1 makes O(logm)O(\log m) iterations and each invocation of the while loop at 2 makes O(logm/loglogm)O(\log m/\log\log m) iterations.

Theorem 4.2.

There is a (11/eε)(1-1/e-\varepsilon)-approximation algorithm for max coverage in the dynamic stream model that uses O~(k/ε2)\tilde{O}(k/\varepsilon^{2}) space and O((1+ε1/loglogm)logm)O((1+\varepsilon^{-1}/\log\log m)\log m) passes.

Proof.

The idea is to implement Algorithm 1 in a small number of space-efficient streaming passes. To sample from iC\mathcal{F}_{i}^{C} and check the non-emptiness condition of the while loop at 1, we use an 0\ell_{0}-sampling pass on the IDs of the sets where |SC||S\setminus C| is in the relevant range. Using an additional pass, we store {SC}\{S\setminus C\} for all the sampled sets, so that we can implement the logic of grow-solution. These sets contain at most

i=12i2v/2i1=4v=O(ε2klogk)\sum_{i=1}^{\ell}2^{i}\cdot 2v/2^{i-1}=4v\ell=O(\varepsilon^{-2}k\log k)

elements. Similarly we can sample sets from each 𝒢iC\mathcal{G}_{i}^{C} while only storing O(ε2klogk)O(\varepsilon^{-2}k\log k) elements.

Since each iteration of each while loop corresponds to two streaming passes, Lemma 4.1 shows that w.h.p. the algorithm uses at most the claimed number of passes. The approximation guarantee follows from Theorem 3.2, since

τ1=2v/2>OPT/k\tau_{1}=2v/2^{\ell}>\operatorname{OPT}/k

and

τ1+log1+ε(16e)=2v/2(1+ε)log1+ε(16e)<4OPT16ek=OPT4ek.\tau_{1+\left\lceil\log_{1+\varepsilon}(16e)\right\rceil}=\frac{2v/2^{\ell}}{(1+\varepsilon)^{\left\lceil\log_{1+\varepsilon}(16e)\right\rceil}}<\frac{4\operatorname{OPT}}{16ek}=\frac{\operatorname{OPT}}{4ek}\,.\qed

It remains to prove Lemma 4.1. We need to understand how the collections iC\mathcal{F}_{i}^{C} and 𝒢iC\mathcal{G}_{i}^{C} evolve as we grow our solution YY, thus changing CC. To this end, we introduce two stochastic urn processes: the first (and simpler) process models the evolution of the collection 𝒢iC\mathcal{G}_{i}^{C}, for a particular ii; the second process models the evolution of the ensemble {iC}\{\mathcal{F}_{i}^{C}\}. How exactly the urn processes model these evolutions is explained in Section 4.3.

4.1 The Single Urn Process and its Analysis

An urn contains mm balls, each initially gold. Balls are drawn from the urn in phases, where a phase consists of drawing dd balls uniformly at random, with replacement. When a ball is drawn, if it is gold, then we earn one point and some arbitrary subset of the balls in the urn turn into lead, including the ball that was just drawn (and put back). At the end of the phase, all lead balls are removed from the urn and the next phase begins. The process ends when either dd points have been earned or the urn is empty.

We will prove the following result about this process.

Theorem 4.3.

With probability 9/10\geq 9/10, the urn process ends within O(logm/loglogm)O(\log m/\log\log m) phases.

The intuition behind the result is as follow. The observation is that if the fraction of balls in the urn remains above d/dd^{\prime}/d then each draw is gold with probability at least d/dd^{\prime}/d and it would be reasonable to expect at least d/d×d=dd^{\prime}/d\times d=d^{\prime} of the draws to be gold. Define m(i)m(i) to be number of gold balls left in the urn after i1i-1 rounds and let d(i)d(i) be the number of gold balls observed in the ii-th round. For the sake of intuition (the formal proof follows shortly) suppose the observation holds for all rounds even though d(i)d(i) and m(i+1)m(i+1) are dependent, i.e., m(i+1)/m(i)d(i)/dm(i+1)/m(i)\leq d(i)/d for all ii; then

m(i+1)m=j=1i+1m(j)m(j1)j=1id(j)d((j=1id(j)/d)i)i1/ii,\frac{m(i+1)}{m}=\prod_{j=1}^{i+1}\frac{m(j)}{m(j-1)}\leq\prod_{j=1}^{i}\frac{d(j)}{d}\leq\left(\frac{(\sum_{j=1}^{i}d(j)/d)}{i}\right)^{i}\leq 1/i^{i}\,,

by the AM-GM inequality and the fact that every drawn gold ball is turned into lead, so only contributes to at most one term d(j)d(j), hence (jd(j)/d)1(\sum_{j}d(j)/d)\leq 1. Consequently, ii can be at most O(logm/loglogm)O(\log m/\log\log m) before m(i+1)m(i+1) becomes zero.

Our result is stronger than analogous results in previous work in two specific ways. Previous analyses essentially need a) an factor O(logm)O(\log m) in the number of draws taken in each phase and b) only establish that O(logm)O(\log m) phases suffice, rather than O(logm/loglogm)O(\log m/\log\log m) phases. The improvements in our approach stem from avoiding the need to guarantee that each phase makes progress with high probability and considering both the progress made towards observing dd gold balls and the progress made in terms of turning balls into lead.

Proof of Theorem 4.3.

Let mjm_{j} be the number of balls in the urn after the jjth phase ends and all lead balls are removed. Also, let m0=mm_{0}=m. Put

γ=loglogm/logm\displaystyle\gamma=\log\log m/\log m (1)

and assume d12/γd\geq 12/\gamma, otherwise the result follows trivially, since each phase earns at least one point. The jjth phase is deemed successful if, during it, either γd/2\gamma d/2 points are earned or the fraction of gold balls drops below γ\gamma (causing m(j)/m(j1)<γm(j)/m(j-1)<\gamma). In Lemmas 4.4 and 4.5, we will establish that each phase is successful with probability at least 1/21/2, even when conditioned on the outcomes of previous phases. Thus, by a Chernoff bound, after 10/γ10/\gamma phases the probability that there were at least 4/γ4/\gamma successful phases is at least 1exp(O(1/γ))1-\exp(-O(1/\gamma)). If 4/γ4/\gamma phases are successful, either we have earned (2/γ)(γd/2)(2/\gamma)(\gamma d/2) points or the number of balls in the urn has reduced from m(0)=mm(0)=m to at most (γ)2/γm(\gamma)^{2/\gamma}m. By eq. 1, this number is less than 11, implying that the urn is empty. ∎

We turn to lower bounding the success probability of a phase as claimed in the above proof. Fix a particular phase and a particular realization of all previous phases of the stochastic process. Let GiG_{i} be the event that the iith draw of this phase reveals a gold ball (and thus earns a point); let LiL_{i} be the event that after i1i-1 draws, the fraction of gold balls in the urn is less than γ\gamma. Define indicator variables Xi=𝟙GiLiX_{i}=\mathbbm{1}_{G_{i}\cup L_{i}} and put X=i=1dXiX=\sum_{i=1}^{d}X_{i}. Then,

𝔼[Xi]=Pr[Li]+Pr[GiL¯i]Pr[GiL¯i]γ.\displaystyle\mathbb{E}[X_{i}]=\Pr[L_{i}]+\Pr[G_{i}\cap\overline{L}_{i}]\geq\Pr[G_{i}\mid\overline{L}_{i}]\geq\gamma\,. (2)
Lemma 4.4.

If Xγd/2X\geq\gamma d/2, then the phase is successful.

Proof.

If Xγd/2X\geq\gamma d/2, either some event LiL_{i} occurs or we collect γd/2\gamma d/2 gold balls (and points). In the latter case, the phase is successful by definition. In the former case, the fraction of gold balls drops below γ\gamma during the phase. Since lead never turns into gold, this fraction remains below γ\gamma. Upon removing lead balls at the end of the phase, the number of balls in the urn drops to at most γ\gamma times the number at the start of the phase, i.e., the phase is successful. ∎

Finally, we show that XX stochastically dominates a binomial random variable, which then lower-bounds Pr[Xγd/2]\Pr[X\geq\gamma d/2]. Denote Bernoulli and binomial distributions as “Bern” and “Bin.”

Lemma 4.5.

Pr[Xγd/2]1exp(γd/12)1/2\Pr[X\geq\gamma d/2]\geq 1-\exp(-\gamma d/12)\geq 1/2, where the latter inequality uses our assumption that d12/γd\geq 12/\gamma.

Proof.

Let Y1,,YdY_{1},\ldots,Y_{d} be independent draws from Bern(γ)\operatorname{Bern}(\gamma). Put Xj=i=1jXiX^{\leq j}=\sum_{i=1}^{j}X_{i} and Yj=i=1jYiY^{\leq j}=\sum_{i=1}^{j}Y_{i}. We first show by induction on jj that XjX^{\leq j} stochastically dominates YjY^{\leq j} for all jj. Inequality (2) with i=1i=1 establishes the base case of j=1j=1. Assuming the result for jj, for an arbitrary integer, zz, we have

Pr[Xj+1z]\displaystyle\Pr[X^{\leq j+1}\geq z]
=Pr[Xjz]+Pr[Xj=z1]Pr[Xj+1=1Xj=z1]\displaystyle=\Pr[X^{\leq j}\geq z]+\Pr[X^{\leq j}=z-1]\Pr[X_{j+1}=1\mid X^{\leq j}=z-1]
Pr[Xjz]+Pr[Xj=z1]γ\displaystyle\geq\Pr[X^{\leq j}\geq z]+\Pr[X^{\leq j}=z-1]\gamma
=Pr[Xjz](1γ)+Pr[Xjz1]γ\displaystyle=\Pr[X^{\leq j}\geq z](1-\gamma)+\Pr[X^{\leq j}\geq z-1]\gamma
Pr[Yjz](1γ)+Pr[Yjz1]γ\displaystyle\geq\Pr[Y^{\leq j}\geq z](1-\gamma)+\Pr[Y^{\leq j}\geq z-1]\gamma
=Pr[Yj+1z].\displaystyle=\Pr[Y^{\leq j+1}\geq z]\ .

Thus, Pr[Xγd/2]Pr[Yγd/2]\Pr[X\geq\gamma d/2]\geq\Pr[Y\geq\gamma d/2]. The lemma now follows by applying a Chernoff bound to YBin(d,γ)Y\sim\operatorname{Bin}(d,\gamma). ∎

4.2 Cascading Urns

To prove the first part of Lemma 4.1, we will also need to analyze a more complex version of the above process; we call it the cascading urn process. Now we have tt urns, numbered 11 through tt, with the rrth urn starting out with mrm_{r} balls; let m=r[t]mrm=\sum_{r\in[t]}m_{r}. Drawing a gold ball from urn rr is worth d/2rd/2^{r} points. Initially, all balls are gold. The process works as follows.

  • Balls are drawn in phases: a phase consists of drawing 2r2^{r} balls from urn rr, in order of increasing rr.

  • When a ball is drawn from urn rr, if it is gold, the following things happen: (i) it turns into lead and is returned to urn rr; (ii) some arbitrary subset of the balls in the urns (all urns, not just urn rr) turn into lead; and (iii) we earn d/2rd/2^{r} points.

  • At the end of each phase, each lead ball in an urn is removed from its current urn and either destroyed or transformed back into gold and placed in a higher-numbered urn. We then start the next phase with all balls being gold again.

The process ends when either dd points have been earned or all urns have become empty.

Theorem 4.6.

With probability at least 9/109/10, the cascading urn process ends within O(t+logm)O(t+\log m) phases.

Proof.

Fix a particular phase. For each r[t]r\in[t], the analysis of one phase of the single urn process applied to urn jj establishes that, with probability at least

Pr[Bin(2r,12)2r4]1exp(2r112)1e112>113=:4λ,\Pr\left[\operatorname{Bin}\left(2^{r},\frac{1}{2}\right)\geq\frac{2^{r}}{4}\right]\geq 1-\exp\left(\frac{-2^{r-1}}{12}\right)\geq 1-e^{-\frac{1}{12}}>\frac{1}{13}=:4\lambda\,,

either at least d/2d/2 points are earned from this urn (call this event GrG_{r}) or the fraction of gold balls in this urn drops below 1/21/2 (call this event LrL_{r}). Urn rr is deemed successful in this phase if GrLrG_{r}\cup L_{r} occurs. Put Zr=𝟙GrLrZ_{r}=\mathbbm{1}_{G_{r}\cup L_{r}} and note 𝔼[Zr]>4λ\mathbb{E}[Z_{r}]>4\lambda.

Let m^r\hat{m}_{r} be the number of balls in urn rr at the start of the phase we are considering (recall that these are all gold balls) and let m^=r[t]m^r\hat{m}=\sum_{r\in[t]}\hat{m}_{r}. If rGr\bigcup_{r}G_{r} does not occur, then at the end of the phase, the number of gold balls across all urns is at most

Z:=r[t](Zrm^r/2+(1Zr)m^r)=r[t](1Zr/2)m^r.Z:=\sum_{r\in[t]}(Z_{r}\hat{m}_{r}/2+(1-Z_{r})\hat{m}_{r})=\sum_{r\in[t]}(1-Z_{r}/2)\hat{m}_{r}\,.

Using the lower bound on 𝔼[Zr]\mathbb{E}[Z_{r}], we get

𝔼[Z]r[t](12λ)m^r=(12λ)m.\mathbb{E}[Z]\leq\sum_{r\in[t]}(1-2\lambda)\hat{m}_{r}=(1-2\lambda)m\,.

Applying a Markov bound gives

Pr[Z(1λ)m]1𝔼[Z]/((1λ)m)=λ/(1λ).\Pr[Z\leq(1-\lambda)m]\geq 1-\mathbb{E}[Z]/((1-\lambda)m)=\lambda/(1-\lambda)\,.

Hence, with probability at least λ/(1λ)\lambda/(1-\lambda), either rGr\bigcup_{r}G_{r} occurs (in which case we collect d/2d/2 points) or the fraction of balls that remain gold until the end of the phase is at most 1λ1-\lambda.

Let mr,pm_{r,p} denote the number of balls in urn rr after the ppth phase ends (and all lead balls are either reconverted to gold or destroyed). Also, let mr,0=mrm_{r,0}=m_{r}. Consider the quantity

Qp:=r[t]2trmr,p.Q_{p}:=\sum_{r\in[t]}2^{t-r}m_{r,p}\,.

Note that Q02t1mQ_{0}\leq 2^{t-1}m and if Qp<1Q_{p}<1 for some pp, then the urns must be empty after phase pp. We claim that if the fraction of balls that remain gold in phase pp is at most 1λ1-\lambda, then Qp(1λ/2)Qp1Q_{p}\leq(1-\lambda/2)Q_{p-1}. This claim holds because (a) after moving a ball to a higher-numbered urn or destroying it, the contribution of that ball to QpQ_{p} goes down by a factor of 22 or more; and (b) at least a λ\lambda fraction of the balls turn to lead and are therefore either moved or destroyed. Thus, at least a λ\lambda fraction of the quantity QpQ_{p} is reduced to λQp/2\lambda Q_{p}/2 by the end of the phase, proving the claim.

Hence, with probability at least λ/(1λ)\lambda/(1-\lambda), in each phase, we either collect d/2d/2 points or QpQ_{p} reduces by a factor μ:=1λ/2\mu:=1-\lambda/2. By a Chernoff bound, after p=O(logQ0)=O(μ1logQ0)p=O(\log Q_{0})=O(\mu^{-1}\log Q_{0}) phases, either we have collected more than 2d/2=d2\cdot d/2=d points or Qp<1Q_{p}<1, implying that all the urns are empty. ∎

4.3 Applications to Our Algorithm

We now circle back to Algorithm 1. To finish its analysis, we need to prove Lemma 4.1, which we can now do, as an application of what we have established about these urn processes.

To analyze the while loop at 1, apply Theorem 4.6 with t=logkt=\left\lfloor\log k\right\rfloor as follows. The balls in the iith urn correspond to the sets in iC\mathcal{F}_{i}^{C}. A ball is gold if the corresponding set SS satisfies |SC|θi|S\setminus C|\geq\theta_{i}; otherwise, it is lead. Adding a set SS to the candidate solution YY grows the set CC of covered elements, thereby turning gold balls to lead in some complicated way that the algorithm cannot easily track. The bound on the number of phases until the urn process terminates translates to a bound of O(logk+logm)O(\log k+\log m) on the number of iterations of that while loop. Noting that kmk\leq m gives Lemma 4.1.

Similarly, Theorem 4.3 implies that the number of iterations of the while loop at 2 is O(logm/loglogm)O(\log m/\log\log m). This completes the analysis.

5 Random Order Model

In this section, we consider the stream to be a random permutation of the collection of sets. We will show that it is possible to approximate the maximum coverage problem up to a factor 11/eε1-1/e-\varepsilon using Oε(klogm)O_{\varepsilon}(k\log m) space in a single pass. As mentioned in the introduction, our approach is based on modifying an algorithm by Agrawal et al. [ASS19] for cardinality constrained monotone submodular maximization. This is a more general problem than maximum coverage, but their algorithm assumes oracle access to the function being optimized. For maximum coverage we need to store enough information to be able to simulate this oracle. A Oε(k2logm)O_{\varepsilon}(k^{2}\log m)-space algorithm for maximum coverage follows immediately via the universe subsampling technique discussed in Section 3 since we may assume that every set has size Oε(klogm)O_{\varepsilon}(k\log m). To reduce the dependence on kk to linear rather than quadratic, we need a more careful adaption of the algorithm in Agrawal et al. [ASS19], to ensure that the algorithm can be implemented via maintaining a set of size Oε(klogm)O_{\varepsilon}(k\log m) corresponding to the items covered by a collection of sets chosen for the solution so far, along with Oε(1)O_{\varepsilon}(1) sets of size Oε(logm)O_{\varepsilon}(\log m).

5.1 Warm-Up

To motivate the approach, consider a variant of the problem in which an algorithm is presented with a random permutation of the input sets, but we make the following two changes:

  1. 1.

    For a fixed optimal solution consisting of sets O1,,OkO_{1},\ldots,O_{k}, replace the occurrences of these sets in the stream by a set drawn uniformly at random from {O1,,Ok}\{O_{1},\ldots,O_{k}\}.

  2. 2.

    We give the algorithm foreknowledge of a way to segment the stream into kk contiguous sub-streams 𝒞1,,𝒞k\mathcal{C}_{1},\ldots,\mathcal{C}_{k} such that each contains exactly one element from {O1,,Ok}\{O_{1},\ldots,O_{k}\}.

Then consider the greedy algorithm that outputs the sequence S1,,Sk\langle S_{1},\ldots,S_{k}\rangle where

Sj=argmaxS𝒞j|S(S1Sj1)|.S_{j}=\operatorname{{argmax}}_{S\in\mathcal{C}_{j}}|S\setminus(S_{1}\cup\cdots\cup S_{j-1})|\ .

To analyze this algorithm, define

uj=OPT|S1Sj| and Fj=|Sj(S1Sj1)|uj1,u_{j}=\operatorname{OPT}-|S_{1}\cup\cdots\cup S_{j}|~{}\mbox{ and }~{}F_{j}=\frac{|S_{j}\setminus(S_{1}\cup\cdots\cup S_{j-1})|}{u_{j-1}}\,,

and note that

|S1Sk|=OPTuk=OPTOPTj=1k(1Fj)|S_{1}\cup\cdots\cup S_{k}|=\operatorname{OPT}-u_{k}=\operatorname{OPT}-\operatorname{OPT}\cdot\prod_{j=1}^{k}(1-F_{j})\

since for all j[k]j\in[k],

uj=OPT|S1Sj1||Sj(S1Sj1)|=(1Fj)uj1.u_{j}=\operatorname{OPT}-|S_{1}\cup\cdots\cup S_{j-1}|-|S_{j}\setminus(S_{1}\cup\cdots\cup S_{j-1})|=(1-F_{j})u_{j-1}\ .

Note also that

E[FjS1,,Sj1]\displaystyle\textup{E}\left[F_{j}\mid S_{1},\ldots,S_{j-1}\right] i=1k|Oi(S1Sj1)|uj1Pr[Oi𝒞j]\displaystyle\geq\sum_{i=1}^{k}\frac{|O_{i}\setminus(S_{1}\cup\cdots\cup S_{j-1})|}{u_{j-1}}\cdot\Pr[O_{i}\in\mathcal{C}_{j}]
=1ki=1k|Oi(S1Sj1)|uj11/k.\displaystyle=\frac{1}{k}\cdot\sum_{i=1}^{k}\frac{|O_{i}\setminus(S_{1}\cup\ldots\cup S_{j-1})|}{u_{j-1}}\geq 1/k\ .

Hence,

𝔼[|S1Sk|]=OPT(1𝔼[j=1k(1Fj)])1(11/k)k11/e.\mathbb{E}[|S_{1}\cup\ldots\cup S_{k}|]=\operatorname{OPT}(1-\mathbb{E}[\prod_{j=1}^{k}(1-F_{j})])\geq 1-(1-1/k)^{k}\rightarrow 1-1/e\ .

The space required to implement this algorithm is O(kε2logm)O(k\varepsilon^{-2}\log m) since it suffices to maintain the union of the sets chosen thus far rather than the sets themselves.

To handle the original random order setting, we obviously need to deal with the fact that the algorithm does not have foreknowledge of a segmentation of the stream such that each segment contains exactly one set from the optimum solution. A naive approach would be to segment the stream into βk\beta k contiguous segments for some large constant β>0\beta>0 with the hope that few pairs of optimum sets appear in the same segment. We could then consider all (βkk)=eO(k)\binom{\beta k}{k}=e^{O(k)} subsequences of these segments but this is clearly inefficient. A better approach is to use limited number of guesses in parallel using an approach by Agrawal et al. [ASS19]. More complicated to analyze, especially when combined with the goal of using limited space, is the fact that because the OO-sets appear randomly permuted, rather than independently sampled, there are now various dependencies to contend with.

5.2 General Setting

We will randomly partition the input collection 𝒞\mathcal{C} of sets into kβk\beta groups 𝒞1,,𝒞kβ\mathcal{C}_{1},\ldots,\mathcal{C}_{k\beta}. These groups will ultimately correspond to segments of the stream, i.e., the first |𝒞1|Bin(m,1/(kβ))|\mathcal{C}_{1}|\sim Bin(m,1/(k\beta)) sets define 𝒞1\mathcal{C}_{1} etc. For any

={i1,i2,}{1,,kβ},\mathcal{I}=\{i_{1},i_{2},\ldots\}\in\{1,\ldots,k\beta\}~{},

we define a sequence of groups

Σ=𝒞i1,𝒞i2,,𝒞i|| where i1<i2<<i||.\Sigma_{\mathcal{I}}=\langle\mathcal{C}_{i_{1}},\mathcal{C}_{i_{2}},\cdots,\mathcal{C}_{i_{|\mathcal{I}|}}\rangle~{}~{}\mbox{ where $i_{1}<i_{2}<\ldots<i_{|\mathcal{I}|}$}\ .

At the heart of the algorithm is a greedy process run on such sequences. For ={i1,i2,}{1,,kβ}\mathcal{I}=\{i_{1},i_{2},\ldots\}\in\{1,\ldots,k\beta\}, C[n]C\subset[n], and a “reserve” collection of sets, \mathcal{R}—the name will become clear later—define the greedy sequence:

σ(C,)=S1,S2,,S||\sigma_{\mathcal{I}}(C,\mathcal{R})=\langle S_{1},S_{2},\ldots,S_{|\mathcal{I}|}\rangle (3)

where

Sj=argmaxS𝒞ij|S(CS1Sj1)|.S_{j}=\operatorname{{argmax}}_{S\in\mathcal{R}\cup\mathcal{C}_{i_{j}}}|S\setminus(C\cup S_{1}\cup\cdots\cup S_{j-1})|\ .
Algorithm 2 Max Coverage for Random Order Streams
1:initialize \mathcal{R}\leftarrow\varnothing, CC\leftarrow\varnothing, k0k^{\prime}\leftarrow 0. \Fori1i\leftarrow 1 to k/αk/\alpha
2:Compute σ(C,)\sigma_{\mathcal{I}}(C,\mathcal{R}) for all Pi\mathcal{I}\in P_{i}, where PiP_{i} consists of all size k+min{α,kk}k^{+}\equiv\min\{\alpha,k-k^{\prime}\} subsets of
Wi={αβ(i1)+1,,αβi}.W_{i}=\{\alpha\beta(i-1)+1,\ldots,\alpha\beta i\}\ .
We will subsequently refer to WiW_{i} as the iith window.
3:Let
=argmaxPi|C(Sσ(C,)S)|\mathcal{I}^{*}=\operatorname{{argmax}}_{\mathcal{I}\in P_{i}}\left|~{}C\cup\left(\bigcup_{S\in\sigma_{\mathcal{I}}(C,\mathcal{R})}S\right)~{}\right|
and update
CC(Sσ(C,)S) and kk+k+.C\leftarrow C\cup\left(\bigcup_{S\in\sigma_{\mathcal{I}^{*}(C,\mathcal{R})}}S\right)~{}~{}\mbox{ and }~{}~{}k^{\prime}\leftarrow k^{\prime}+k^{+}\ .
4:For all Pi\mathcal{I}\in P_{i}, add all sets in σ(C,)\sigma_{\mathcal{I}}(C,\mathcal{R}) to \mathcal{R}. If there exists a set SS in \mathcal{R} such that |SC|OPT/k|S\setminus C|\geq\operatorname{OPT}/k, select such a set uniformly at random and update
CCS and kk+1.C\leftarrow C\cup S~{}~{}\mbox{ and }~{}~{}k^{\prime}\leftarrow k^{\prime}+1\ .
Repeat until either k=kk^{\prime}=k or no more such sets exist in \mathcal{R}. \EndFor

Algorithm 2 contains most details, but to fully specify the algorithm, we need to define the original groups 𝒞1,𝒞2,,𝒞kβ\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{k\beta}. To do this, for each set SS define a random variable YSY_{S} that is uniform over {1,2,,kβ}\{1,2,\ldots,k\beta\}. Then, we define 𝒞i={S:YS=i}\mathcal{C}_{i}=\{S:Y_{S}=i\}. Equivalently, for a random permutation of the stream 𝒞1\mathcal{C}_{1} is the first |{S:YS=1}||\{S:Y_{S}=1\}| sets in the stream, 𝒞2\mathcal{C}_{2} is the next |{S:YS=2}||\{S:Y_{S}=2\}| sets in the stream, etc. Given this definition of the stream, Algorithm 2 maintains sets whose total size is

|OPT|+α(αβα)|OPT|+(k/α)(αβα)OPT/k=Oα,β(OPT),|\operatorname{OPT}|+\alpha\binom{\alpha\beta}{\alpha}|\operatorname{OPT}|+(k/\alpha)\binom{\alpha\beta}{\alpha}\operatorname{OPT}/k=O_{\alpha,\beta}(\operatorname{OPT})\,, (4)

where the first term corresponds to maintaining CC, the next term consists of all sets appearing in greedy subsequences during the processing of each window (defined in line 2 of Algorithm 2), and the last term corresponds to storing sets in \mathcal{R}. For sets SS in \mathcal{R} it suffices to store SCS\setminus C rather than SS itself. Hence, the total number of elements in sets in \mathcal{R} is at most ||OPT/k|\mathcal{R}|\operatorname{OPT}/k.

The motivation for how to set α\alpha and β\beta is as follows. Let 𝒪={O1,,Ok}{\mathcal{O}}=\{O_{1},\ldots,O_{k}\} be an optimum collection of kk sets. We say a group 𝒞i\mathcal{C}_{i} is active if it contains a set from 𝒪{\mathcal{O}} and note if β\beta is large enough the number of active groups κ\kappa is close to kk. Furthermore, in each window the number of active groups is expected to be κ(αβ)/(kβ)=ακ/kα\kappa\cdot(\alpha\beta)/(k\beta)=\alpha\kappa/k\approx\alpha. We will later set α\alpha to be large enough such that the number of active groups in each window is typically close to α\alpha, i.e., there is an α\alpha-length subsequence of the window that mainly consists of active groups.

5.2.1 Analysis

The analysis in Agrawal et al. [ASS19] establishes that, during the iith window, the algorithm finds a collection of α\alpha sets S1,,SαS_{1},\ldots,S_{\alpha} – via one of the greedy subsequences considered – that, as explained below, covers a significant number of new elements when added to the solution. For any subset A[n]A\subset[n], define

u(A)=(1α/k)OPT|A|.u(A)=(1-\alpha/k)\operatorname{OPT}-|A|\,.

Lemma 15 of Agrawal et al. [ASS19] establishes that if Ci1C_{i-1} is the set of elements covered by the sets chosen during the processing of the first i1i-1 windows, and Ti1T_{i-1} is the entire set of greedy subsequences constructed during the first i1i-1 windows, then666Note the statement of [ASS19, Lemma 15] has the RHS of Eq. (5) as (1δ)eα/ku(Ci1)(1-\delta)e^{-\alpha/k}u(C_{i-1}), rather than the expression in Eq. (5), where the (1δ)(1-\delta) appears in the exponent. Their proof actually only implies the weaker statement, in Eq. (5); fortunately, this is still sufficient for their and our purposes.

𝔼[u(S1SαCi1)Ti1]eα(1δ)/ku(Ci1),\mathbb{E}[u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})\mid T_{i-1}]\leq e^{-\alpha(1-\delta)/k}u(C_{i-1})\,, (5)

where δ=8/β\delta=\sqrt{8/\beta}, assuming kαβk\geq\alpha\beta and α4β2log(β/8)\alpha\geq 4\beta^{2}\log(\beta/8). The crucial observation is that their proof of eq. 5 holds true regardless of how Ci1C_{i-1} is constructed from the greedy subsequences in Ti1T_{i-1}. In particular, Equation (5) holds true given our modification of the Agrawal et al. algorithm, i.e., the possibility of adding additional sets from \mathcal{R} at the end of previous windows. At the end of the iith window, we potentially add additional sets Sα+1,,SαiS_{\alpha+1},\ldots,S_{\alpha^{\prime}_{i}} where

u(S1SαiCi1)\displaystyle u(S_{1}\cup\cdots\cup S_{\alpha^{\prime}_{i}}\cup C_{i-1})
u(S1SαCi1)(αiα)OPT/k\displaystyle\leq u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})-(\alpha^{\prime}_{i}-\alpha)\operatorname{OPT}/k
(1(αiα)/k)u(S1SαCi1)\displaystyle\leq(1-(\alpha^{\prime}_{i}-\alpha)/k)\cdot u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})
e(αiα)/ku(S1SαCi1)\displaystyle\leq e^{-(\alpha^{\prime}_{i}-\alpha)/k}\cdot u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})

Hence, in the iith window (except potentially in the window where we add the kkth set to our solution) we add αiα\alpha^{\prime}_{i}\geq\alpha sets S1,,SαiS_{1},\ldots,S_{\alpha^{\prime}_{i}} such that if Ci=S1SαC_{i}=S_{1}\cup\cdots\cup S_{\alpha^{\prime}}

𝔼[u(Ci)Ti1]eαi(1δ)/ku(Ci1).\mathbb{E}[u(C_{i})\mid T_{i-1}]\leq e^{-\alpha^{\prime}_{i}(1-\delta)/k}\cdot u(C_{i-1})\ . (6)

Since there are k/αk/\alpha windows, and we find a sequence of length αi>α\alpha^{\prime}_{i}>\alpha until we have a sequence of length kk, such that Eq. (6) holds in each window, for some jk/αj\leq k/\alpha, with δ<ε/2<1\delta<\varepsilon/2<1,

𝔼[(1α/k)OPT|Cj|](1α/k)OPTi1eαi(1δ)/k\displaystyle\mathbb{E}[(1-\alpha/k)\operatorname{OPT}-|C_{j}|]\leq(1-\alpha/k)\cdot\operatorname{OPT}\cdot\prod_{i\geq 1}e^{-\alpha^{\prime}_{i}(1-\delta)/k}
=(1α/k)e1+δOPT<(1α/k)(1/e+ε/2)OPT\displaystyle=(1-\alpha/k)e^{-1+\delta}\operatorname{OPT}<(1-\alpha/k)(1/e+\varepsilon/2)\operatorname{OPT}

where we used the fact iαi=k\sum_{i}\alpha^{\prime}_{i}=k and exp(1+x)1/e+x\exp(-1+x)\leq 1/e+x for 0<x<10<x<1. Hence, 𝔼[|Cj|]11/eε\mathbb{E}[|C_{j}|]\leq 1-1/e-\varepsilon assuming α/kε/2\alpha/k\leq\varepsilon/2. Setting the constants to α=4096ε4log(2/ε)\alpha=4096\varepsilon^{-4}\log(2/\varepsilon) and β=32ε2\beta=32\varepsilon^{-2} ensures all the necessary conditions are met assuming k=ω(1)k=\omega(1). With these values of α\alpha and β\beta, we have (αβα)exp(O(ε4ln2ε1))\binom{\alpha\beta}{\alpha}\leq\exp(O(\varepsilon^{-4}\ln^{2}\varepsilon^{-1})). Hence, the space dependence on ε\varepsilon in Eq. 4 becomes O(klogmexp(O(ε4lnε1)))O(k\log m\cdot\exp(O(\varepsilon^{-4}\ln\varepsilon^{-1}))), since we may assume OPT=O(kε2logm)\operatorname{OPT}=O(k\varepsilon^{-2}\log m) and the exponential dependencies on ε\varepsilon dominate the polynomial dependencies.

6 Fast Algorithms

6.1 Fast Universe Sub-Sampling

The universe sub-sampling approach described in Section 3, requires the use of a γ\gamma-wise independent hash function where γ=O(klogm)\gamma=O(k\log m). This bound on the independence was actually an improvement by Jaud et al. [JWC23] over the bound of O(ε2klogm)O(\varepsilon^{-2}k\log m) originally given by McGregor and Vu [MV19a]. Jaud et al. were motivated to improve the bound not for the sake of reducing the space of any streaming algorithms, but rather to reduce the update time from O(ε2klogm)O(\varepsilon^{-2}k\log m) to O(klogm)O(k\log m). However, we note that the update time, at least in an amortized sense, can actually be reduced to polylogarithmic. Specifically, we can use the hash family given by degree-γ\gamma polynomials over a suitably large finite field [WC79]. While evaluating such a polynomial at a single point (which corresponds to hashing a single element) requires Ω(γ)\Omega(\gamma) time, evaluating the polynomial on γ\gamma points can actually be done in O(γlog2γloglogγ)O(\gamma\log^{2}\gamma\log\log\gamma) time [vzGG99, Chapter 10]. Hence, by buffering sets of O(klogm)O(k\log m) elements, we may ensure that the hash function can be applied with only poly(logk,loglogm)\operatorname{poly}(\log k,\log\log m) amortized update time. We note that a similar approach was used by Kane et al.  [KNPW11] in the context of frequency moment estimation.

6.2 Fast 𝟎\mathbf{\ell_{0}} Sampling

In the dynamic algorithm we needed to sample rr sets with replacement from a stream in the presence of insertions and deletions. This can be done via 0\ell_{0} sampling as discussed in the preliminaries. However, a naive implementation of this process requires Ω(r)\Omega(r) update time if rr different 0\ell_{0} samplers need to be updated with every set insertion or deletion. To avoid this, a technique by McGregor et al. [MTVV15] can be used. Specifically, we use a O(logr)O(\log r)-wise independent hash function to partition [M][M] in t:=r/logrt:=r/\log r groups P1,P2,PtP_{1},P_{2},\ldots P_{t}. During the stream we compute

ρi=|{xPi:j:xj=xΔj0}|\rho_{i}=|\{x\in P_{i}:\textstyle\sum_{j:x_{j}=x}\Delta_{j}\neq 0\}|

i.e., the number of values in PiP_{i} that are inserted a different number of times from the number of times they are deleted. Note that this is simple to compute ρ1,,ρt\rho_{1},\ldots,\rho_{t} on the assumption that j:xj=xΔj{0,1}\sum_{j:x_{j}=x}\Delta_{j}\in\{0,1\}, i.e., every set is inserted either the same number of times it is deleted or exactly one extra time. We also compute 2logr2\log r independent 0\ell_{0} samplers for each PiP_{i}; note that each set insert or delete requires updating at most 2logr2\log r independent 0\ell_{0} samplers. Then at the end of the stream, to generate a sequence of samples with replacement we first sample ii with probability ρi/jρj\rho_{i}/\sum_{j}\rho_{j} and then use the next unused 0\ell_{0} sampler from group PiP_{i}. Assuming r|{x[M]:j:xj=xΔj0}|r\ll|\{x\in[M]:\sum_{j:x_{j}=x}\Delta_{j}\neq 0\}| (if this is not the case, we can just use sparse recovery), then with high probability each ρi/jρj2(logr)/r\rho_{i}/\sum_{j}\rho_{j}\leq 2(\log r)/r and the process will use at most 2logr2\log r independent 0\ell_{0} samplers from each group in expectation and less than 4logr4\log r with probability at least 11/poly(r)1-1/\operatorname{poly}(r).

7 Conclusion

We presented new algorithms for 11/eε1-1/e-\varepsilon approximation of the maximum coverage problem in the data stream model. These algorithms improve upon the state-of-the-art results by a) reducing the space required in the dynamic model from (n+ε4)polylog(m,n)(n+\varepsilon^{-4})\operatorname{polylog}(m,n) to ε2kpolylog(m,n)\varepsilon^{-2}k\operatorname{polylog}(m,n) when given O(ε1logm)O(\varepsilon^{-1}\log m) passes, b) reducing the space required in single-pass random order model from ε2k2polylog(m,n)\varepsilon^{-2}k^{2}\operatorname{polylog}(m,n) to Oε(kpolylog(m,n))O_{\varepsilon}(k\operatorname{polylog}(m,n)) although we emphasize that the OεO_{\varepsilon} hides an exponential dependence on ε\varepsilon, and c) reducing the amortized update time to polylogarithmic in mm and nn. We conjecture that ε2kpolylog(m,n)\varepsilon^{-2}k\operatorname{polylog}(m,n) space should be sufficient in the random order model. In fact there is a monotone submodular optimization algorithm in the random order model that, when applied to the maximum coverage problem would maintain only O(ε1k)O(\varepsilon^{-1}k) sets [LRVZ21a]. Given that, via universe sub-sampling, we may assume that the sets have size O(ε2klogm)O(\varepsilon^{-2}k\log m); this yielded the Warneke et al. [WCW23] result. However, it seems unlikely that the algorithm of Liu et al. [LRVZ21a] can be modified to ensure that the total number of elements across all sets maintained is O(ε2klogm)O(\varepsilon^{-2}k\log m) so a new approach is likely to be necessary.

References

  • [AB21] Sepehr Assadi and Soheil Behnezhad. Beating two-thirds for random-order streaming matching. In 48th International Colloquium on Automata, Languages and Programming (ICALP), page 19, 2021.
  • [ABB+15] Aris Anagnostopoulos, Luca Becchetti, Ilaria Bordino, Stefano Leonardi, Ida Mele, and Piotr Sankowski. Stochastic query covering for fast approximate document retrieval. ACM Transactions on Information Systems (TOIS), 33(3):1–35, 2015.
  • [ABG+12] Giorgio Ausiello, Nicolas Boria, Aristotelis Giannakos, Giorgio Lucarelli, and Vangelis Th. Paschos. Online maximum k-coverage. Discrete Applied Mathematics, 160(13-14):1901–1913, 2012.
  • [ACG+15] Kook Jin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, and Anthony Wirth. Correlation clustering in data streams. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2237–2246. JMLR.org, 2015.
  • [AGM12a] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In SODA, pages 459–467. SIAM, 2012.
  • [AGM12b] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Graph sketches: sparsification, spanners, and subgraphs. In PODS, pages 5–14. ACM, 2012.
  • [AGM13] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Spectral sparsification in dynamic graph streams. In APPROX-RANDOM, volume 8096 of Lecture Notes in Computer Science, pages 1–10. Springer, 2013.
  • [AK18] Sepehr Assadi and Sanjeev Khanna. Tight bounds on the round complexity of the distributed maximum coverage problem. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2412–2431. SIAM, 2018.
  • [AKL16] Sepehr Assadi, Sanjeev Khanna, and Yang Li. Tight bounds for single-pass streaming complexity of the set cover problem. In STOC, pages 698–711. ACM, 2016.
  • [AKLY16] Sepehr Assadi, Sanjeev Khanna, Yang Li, and Grigory Yaroslavtsev. Maximum matchings in dynamic graph streams and the simultaneous communication model. In SODA, pages 1345–1364. SIAM, 2016.
  • [Ass17] Sepehr Assadi. Tight space-approximation tradeoff for the multi-pass streaming set cover problem. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 321–335, 2017.
  • [ASS19] Shipra Agrawal, Mohammad Shadravan, and Cliff Stein. Submodular secretary problem with shortlists. In Avrim Blum, editor, 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [BEM17] MohammadHossein Bateni, Hossein Esfandiari, and Vahab S. Mirrokni. Almost optimal streaming algorithms for coverage problems. In Christian Scheideler and Mohammad Taghi Hajiaghayi, editors, Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, pages 13–23. ACM, 2017.
  • [BHNT15] Sayan Bhattacharya, Monika Henzinger, Danupon Nanongkai, and Charalampos E. Tsourakakis. Space- and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In STOC, pages 173–182. ACM, 2015.
  • [BMKK14] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671–680, 2014.
  • [CCE+16] Rajesh Chitnis, Graham Cormode, Hossein Esfandiari, MohammadTaghi Hajiaghayi, Andrew McGregor, Morteza Monemizadeh, and Sofya Vorotnikova. Kernelization via sampling with applications to finding matchings and related problems in dynamic graph streams. In SODA, pages 1326–1344. SIAM, 2016.
  • [CGQ15] Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming algorithms for submodular function maximization. In ICALP (1), volume 9134 of Lecture Notes in Computer Science, pages 318–330. Springer, 2015.
  • [CKW10] Graham Cormode, Howard J. Karloff, and Anthony Wirth. Set cover algorithms for very large datasets. In CIKM, pages 479–488. ACM, 2010.
  • [CW16] Amit Chakrabarti and Anthony Wirth. Incidence geometries and the pass complexity of semi-streaming set cover. In SODA, pages 1365–1373. SIAM, 2016.
  • [DIMV14] Erik D. Demaine, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. On streaming and communication complexity of the set cover problem. In Fabian Kuhn, editor, Distributed Computing - 28th International Symposium, DISC 2014, Austin, TX, USA, October 12-15, 2014. Proceedings, volume 8784 of Lecture Notes in Computer Science, pages 484–498. Springer, 2014.
  • [ER14] Yuval Emek and Adi Rosén. Semi-streaming set cover - (extended abstract). In ICALP (1), volume 8572 of Lecture Notes in Computer Science, pages 453–464. Springer, 2014.
  • [Fei98] Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652, 1998.
  • [FNFSZ20] Moran Feldman, Ashkan Norouzi-Fard, Ola Svensson, and Rico Zenklusen. The one-way communication complexity of submodular maximization with applications to streaming and robustness. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1363–1374, 2020.
  • [GM09] Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5):2044–2059, 2009.
  • [GMT15] Sudipto Guha, Andrew McGregor, and David Tench. Vertex and hyperedge connectivity in dynamic graph streams. In PODS, pages 241–247. ACM, 2015.
  • [HIMV16] Sariel Har-Peled, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. Towards tight bounds for the streaming set cover problem. In PODS, pages 371–383. ACM, 2016.
  • [HP98] Dorit S Hochbaum and Anu Pathria. Analysis of the greedy approach in problems of maximum k-coverage. Naval Research Logistics (NRL), 45(6):615–627, 1998.
  • [IV19] Piotr Indyk and Ali Vakilian. Tight trade-offs for the maximum kk-coverage problem in the general streaming model. In 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 200–217, 2019.
  • [JST11] Hossein Jowhari, Mert Sağlam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 49–58. ACM, 2011.
  • [JWC23] Stephen Jaud, Anthony Wirth, and Farhana Choudhury. Maximum Coverage in Sublinear Space, Faster. In 21st International Symposium on Experimental Algorithms (SEA 2023), volume 265, pages 21:1–21:20, Dagstuhl, Germany, 2023.
  • [Kar72] Richard M Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85–103. Springer, 1972.
  • [KG07] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI, pages 1650–1654. AAAI Press, 2007.
  • [KG14] Andreas Krause and Daniel Golovin. Submodular function maximization. In Lucas Bordeaux, Youssef Hamadi, and Pushmeet Kohli, editors, Tractability: Practical Approaches to Hard Problems, pages 71–104. Cambridge University Press, 2014.
  • [KKA23] Sanjeev Khanna, Christian Konrad, and Cezar-Mihail Alexandru. Set cover in the one-pass edge-arrival streaming model. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 127–139. ACM, 2023.
  • [KKT15] David Kempe, Jon M. Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. Theory of Computing, 11:105–147, 2015.
  • [KLM+14] Michael Kapralov, Yin Tat Lee, Cameron Musco, Christopher Musco, and Aaron Sidford. Single pass spectral sparsification in dynamic streams. In FOCS, pages 561–570. IEEE Computer Society, 2014.
  • [KMM12] Christian Konrad, Frédéric Magniez, and Claire Mathieu. Maximum matching in semi-streaming with few passes. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (APPROX/RANDOM), pages 231–242, 2012.
  • [KNPW11] Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 745–754, 2011.
  • [Kon15] Christian Konrad. Maximum matching in turnstile streams. In ESA, volume 9294 of Lecture Notes in Computer Science, pages 840–852. Springer, 2015.
  • [KW14] Michael Kapralov and David P. Woodruff. Spanners and sparsifiers in dynamic streams. In PODC, pages 272–281. ACM, 2014.
  • [LMW14] Ching Lih Lim, Alistair Moffat, and Anthony Wirth. Lazy and eager approaches for the set cover problem. In Proceedings of the 37th Australasian Computer Science Conference, pages 19–27, 2014.
  • [LRVZ21a] Paul Liu, Aviad Rubinstein, Jan Vondrák, and Junyao Zhao. Cardinality constrained submodular maximization for random streams. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 6491–6502, 2021.
  • [LRVZ21b] Paul Liu, Aviad Rubinstein, Jan Vondrák, and Junyao Zhao. Cardinality constrained submodular maximization for random streams. In Advances in Neural Information Processing Systems 34, pages 6491–6502, 2021.
  • [McG14] Andrew McGregor. Graph stream algorithms: a survey. SIGMOD Record, 43(1):9–20, 2014.
  • [Mon20] Morteza Monemizadeh. Dynamic submodular maximization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9806–9817. Curran Associates, Inc., 2020.
  • [MTV21] Andrew McGregor, David Tench, and Hoa T. Vu. Maximum coverage in the data stream model: Parameterized and generalized. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, volume 186 of LIPIcs, pages 12:1–12:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  • [MTVV15] Andrew McGregor, David Tench, Sofya Vorotnikova, and Hoa T. Vu. Densest subgraph in dynamic graph streams. In MFCS (2), volume 9235 of Lecture Notes in Computer Science, pages 472–482. Springer, 2015.
  • [MV19a] Andrew McGregor and Hoa T. Vu. Better streaming algorithms for the maximum coverage problem. Theory Comput. Syst., 63(7):1595–1619, 2019.
  • [MV19b] Andrew McGregor and Hoa T Vu. Better streaming algorithms for the maximum coverage problem. Theory of Computing Systems, 63:1595–1619, 2019.
  • [MVV16] Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu. Better algorithms for counting triangles in data streams. In PODS, pages 401–411. ACM, 2016.
  • [NFTM+18] Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousavifar, and Ola Svensson. Beyond 1/2-approximation for submodular maximization on massive data streams. In Proceedings of the 35th International Conference on Machine Learning, pages 3829–3838, 2018.
  • [NTM+18] Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousavifar, and Ola Svensson. Beyond 1/2-approximation for submodular maximization on massive data streams. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3826–3835. PMLR, 2018.
  • [SG09] Barna Saha and Lise Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In SDM, pages 697–708. SIAM, 2009.
  • [vzGG99] Joachim von zur Gathen and Jürgen Gerhard. Modern Computer Algebra. Cambridge University Press, 1999.
  • [WC79] Mark N. Wegman and Larry Carter. New classes and applications of hash functions. In Proc. 20th Annual IEEE Symposium on Foundations of Computer Science, pages 175–182, 1979.
  • [WCW23] Rowan Warneke, Farhana Murtaza Choudhury, and Anthony Wirth. Maximum coverage in random-arrival streams. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands, volume 274 of LIPIcs, pages 102:1–102:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
  • [YY13] Huiwen Yu and Dayu Yuan. Set coverage problems in a one-pass data stream. In SDM, pages 758–766. SIAM, 2013.