Improved Algorithms for Maximum Coverage
in Dynamic and Random Order Streams

Amit Chakrabarti Department of Computer Science, Dartmouth College, Hanover, NH, USA. Supported in part by NSF under award 2006589. Andrew McGregor College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA Anthony Wirth School of Computing and Information Systems, The University of Melbourne, Victoria, Australia. Supported in part by the Faculty of Engineering and Information Technology.

Abstract

The maximum coverage problem is to select $k$ sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider $(1-1/e-\varepsilon)$ -approximation algorithms for this NP-hard problem in three standard data stream models.

1.

Dynamic Model. The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses $\varepsilon^{-2}k\cdot\operatorname{polylog}(n,m)$ space. The best previous result (Assadi and Khanna, SODA 2018) used $(n+\varepsilon^{-4}k)\operatorname{polylog}(n,m)$ space. While both algorithms use $O(\varepsilon^{-1}\log n)$ passes, our analysis shows that when $\varepsilon$ is a constant, it is possible to reduce the number of passes by a $1/\log\log n$ factor without incurring additional space.
2.

Random Order Model. In this model, there are no deletions and the sets forming the instance are uniformly randomly permuted to form the input stream. We show that a single pass and $k\operatorname{polylog}(n,m)$ space suffices for arbitrary small constant $\varepsilon$ . The best previous result, by Warneke et al. (ESA 2023), used $k^{2}\operatorname{polylog}(n,m)$ space.
3.

Insert-Only Model. Lastly, our results, along with numerous previous results, use a sub-sampling technique introduced by McGregor and Vu (ICDT 2017) to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: $\operatorname{polylog}(m,n)$ update time suffices whereas the best previous result by Jaud et al. (SEA 2023) required update time that was linear in $k$ .

1 Introduction

Background.

The input to the maximum coverage problem is an integer $k$ and a collection of $m$ sets $S_{1},\ldots,S_{m}$ , each a subset of the universe $[n]:=\{1,\ldots,n\}$ . The goal is to find $k$ sets $\{S_{i}\}$ whose union has maximum cardinality. This longstanding problem has several applications, including facility and sensor allocation [KG07], circuit layout and job scheduling [HP98], information retrieval [ABB⁺15], influence maximization in marketing strategy design [KKT15], and content recommendation [SG09]. In terms of theoretical importance, it is perhaps simplest example of a submodular maximization problem, a rich family of problems in machine learning [KG14]. A natural variant of the problem is the set cover problem and was one of Karp’s original 21 NP-complete problems [Kar72]. A greedy approach that iteratively chooses the set with the greatest contribution yields asymptotically optimal approximation algorithms for set cover and maximum coverage [Fei98]. Indeed, the greedy algorithm in practice typically outperforms its proven approximation factor for many practical problem instances [LMW14].

Data Stream Computation.

Over the last decade, there has been a growing body of work that addresses the challenges of solving the maximum coverage problem (henceforth, MaxCov), the related set cover problem (SetCov), and the general problem of submodular optimization, in computational models such as the data stream model, that are relevant when the underlying data sets are massive [SG09, MV19a, BEM17, IV19, AK18, WCW23, ABG⁺12, YY13, LRVZ21a, ASS19, NTM⁺18, CKW10, JWC23, HIMV16, CW16, ER14, AKL16]. Assadi, Khanna and Li [AKL16] have a comprehensive summary of results and discussion. In many real-world applications, even the conceptually simple greedy approach is not practical. This can occur if the underlying data is too large to be stored in a single location and is therefore distributed across multiple machines; or if the data changes over time, e.g., sets are added or removed from the input collection.

This work focuses on the set streaming model [SG09], where the input stream is a sequence of sets (in the dynamic case, a sequence of set insertions and deletions, arbitrarily intermixed). Each set is described contiguously. The goal is to solve MaxCov in space that is sublinear in the size of the input. We note in passing that there has also been interest in an alternative “edge streaming” model [KKA23, BEM17, IV19], but it is not the focus of this paper.

Along with the set-streaming model itself, Saha and Getoor [SG09] introduced the first algorithm for MaxCov in this model. Their algorithm, sops,¹¹1 The algorithm received this name in a later work by Yu and Yuan [YY13] achieves an approximation factor of $1/4$ using $O(nk)$ space. The input could be as large as $\Omega(mn)$ , and there is, e.g., a logarithmic-factor approximation for set cover that involves space $\tilde{O}(mn^{\delta})$ in $O(1/\delta)$ passes [HIMV16]. To accord with practice, early approaches for set cover took space linear in $n$ [CKW10, ER14, CW16]. Though it is unclear whether there are sensible $o(n)$ -space solutions for Set Cover, McGregor and Vu [MV19a] pioneered the quest for $\tilde{O}(k)$ -space methods for MaxCov. Storing even a single set from the stream may, in general, require $\Theta(n)$ space. To avoid this, various papers consider sub-sampling the elements in the sets [JWC23, DIMV14, BEM17, HIMV16]. In particular, McGregor and Vu [MV19a] observed that subsampling the instance to a universe of size $\tilde{O}(k)$ with a hash function of high degree preserves coverage very accurately.

We will focus on three specific set streaming models. In the insert-only set streaming model the data stream comprises $m$ sets, in arbitrary order, each described once. In the random order set streaming model, we assume that these sets, each described once and contiguously, are ordered uniformly at random. This model captures a natural form of average-case analysis: the sets are chosen adversarially, but we can choose a random ordering to process them [AK18, WCW23]. In the dynamic set streaming model, each set may be added and deleted several times: the resulting MaxCov instance involves only those sets that remain added at the end of the stream.

Submodular maximization.

Our problem MaxCov is a special case of submodular maximization, specifically, of the problem of maximizing a monotone submodular function under a cardinality constraint. In the streaming setting [BMKK14, FNFSZ20], we have oracle access to a submodular function $f\colon 2^{U}\to\mathbb{R}$ , for some universe $U$ and the input stream is a sequence of elements of $U$ . These elements correspond to sets $S_{i}$ in MaxCov. An important point here is that in MaxCov (unlike in submodular maximization), storing an element of $U$ (i.e., a set $S_{i}$ ) takes non-constant space, as does storing the $f$ -value (coverage) of a putative solution take space.

The sieve-streaming algorithm [BMKK14] for submodular maximization gives a $(1/2-\varepsilon)$ -approximation for any $\varepsilon>0$ . By adapting to maximum coverage, there is an $\tilde{O}(n\varepsilon^{-1})$ -space algorithm that achieves the same approximation factor [MV19b]. Subsequently, Chekuri et al. [CGQ15] extended this work to non-monotone submodular function maximization under constraints beyond cardinality. Observe that $\Omega(mk^{-3})$ space is required to achieve an approximation greater than $1/2+o(1)$ in the arbitrary-arrival model [FNFSZ20]. Needless to say, there is progress on submodular maximization in dynamic streams [Mon20], though comparatively little about MaxCov.

Random order.

Streaming algorithms for submodular maximization in the random-arrival model have been studied [NFTM⁺18, ASS19, LRVZ21b]. Interestingly, approximation factors better than $1/2$ have been achieved in this model using low space. Norouzi-Fard et al. developed salsa, an $\tilde{O}(k)$ -space algorithm that achieves an approximation factor of $1/2+c_{0}$ in-expectation, where $c_{0}>0$ is a small absolute constant. Subsequently, Agrawal et al. [ASS19] achieved an approximation factor of $1-1/e-\varepsilon-o(1)$ in-expectation using $\tilde{O}_{\varepsilon}(k)$ space. The $o(1)$ term in the approximation factor is a function of $k$ . For small $k$ , this term is very large, in which case salsa makes a better approximation guarantee; nonetheless the algorithm of Agrawal et al. is a focus of this paper. Warneke et al. [WCW23] introduced low-space methods for maximum coverage in random-order streams. Their approaches consume space proportional to $k^{2}$ : we improve this to linear in $k$ , see Section 1.1 for details. The corresponding hardness result: $\Omega(m)$ space is required to achieve an approximation factor of $1-1/e+\varepsilon+o(1)$ in the random-arrival model [LRVZ21b].²²2 This improves upon the $\Omega(mk^{-3})$ space lower bound given by McGregor and Vu [MV19b] for random-arrival set-streaming maximum coverage, which applies to random-arrival submodular maximization, due to an important connection between these two problems, explained shortly.

1.1 Our Results, Approach, and Related Work

To establish our new results, we revisit recent work on the maximum coverage problem in the relevant data stream models and show how to refine and extend the approaches to substantially decrease the the amount of space needed to solve the problem in the data stream model. In the process, we also establish secondary results on speeding up the existing algorithms and in one case, developing a slightly more pass efficient algorithm.

In the dynamic stream model, our main result is the following.

Result 1 (Algorithm for Dynamic Model).

There exists an $O(1+\varepsilon^{-1}/\log\log m)\log m)$ pass algorithm in the dynamic data stream model that uses $\varepsilon^{-2}\cdot k\cdot\operatorname{polylog}(n,m)$ space and returns a $1-1/e-\varepsilon$ approximation.

This result reappears as Theorem 4.2. It is an improvement over the best previous algorithm by Assadi and Khanna [AK18] that used $(n+\varepsilon^{-4}k)\operatorname{polylog}(n,m)$ space. The high-level approach is similar and is based on the existing $\ell_{0}$ sampling primitive that allows uniform sampling amongst the sets that are inserted but not deleted. In each pass, we pick additional sets and maintain the union $C$ of the selected sets. We pick the additional sets by sampling only among the sets that include a “meaningful” number of elements that are not currently in $C$ . Unfortunately, when we simultaneously sample multiple sets in this way, once we start adding sampled sets to our solution, some of the sampled sets may no longer make a meaningful increment to $|C|$ . However, by repeating the process over multiple passes, with a slowly decreasing threshold for what constitutes a meaningful improvement, we ensure that either we exhaust the collection of sets that could potentially be worth adding or we add enough sets and together these yield a good approximate solution. The main point of departure from Assadi and Khanna [AK18] is in the way we handle the decreasing threshold. The largest threshold considered by Assadi and Khanna is about $\operatorname{OPT}/k$ ; this is sufficient for achieving a $1-1/e-\varepsilon$ approximation factor, but has the downside that there can be many overlapping sets that would contribute more than $\operatorname{OPT}/k$ new elements to $C$ ; storing these sets during the sampling process is expensive. Instead, we consider thresholds decreasing from $\operatorname{OPT}$ but handling the sets that make a contribution between $\operatorname{OPT}/2^{i}$ and $\operatorname{OPT}/2^{i+1}$ for $i=0,\ldots,\log k$ separately but in parallel. We analyze a stochastic process—we call it the cascading urns process—to show how our new algorithm uses a similar number of passes as the algorithm by Assadi and Khanna [AK18] but uses significantly less space. We present the details in Section 4. In fact, a more careful analysis of the process shows that it is possible to reduce the number of passes to $O(\log m+\varepsilon^{-1}\log m/\log\log m)$ in contrast to the $O(\varepsilon^{-1}\log m)$ passes used by previous algorithm. We believe that a similar pass saving can be achieved in Assadi and Khanna [AK18] but it requires an extra log factor in the space.

In the random order model, our main result is the following.

Result 2 (Algorithm for Random Order Model).

There exists a single pass algorithm in the random order data stream model that uses $O_{\varepsilon}(k\cdot\operatorname{polylog}(n,m))$ space and returns a $1-1/e-\varepsilon$ approximation of maximum coverage in expectation.

The above result is established in Section 5.2. Note that the dependence on $\varepsilon$ in the result is exponential and this makes the algorithm less practical. The main significance of the result is removing a factor $k$ in the space required by the best previous result, by Warneke et al. [WCW23], that used $\varepsilon^{-2}k^{2}\operatorname{polylog}(n,m)$ space. Both their and our approach are based on modifying existing algorithms for the cardinality constrained monotone submodular optimization problem. This is a more general problem than maximum coverage, but it is assumed in the general problem that the algorithm has oracle access to the function being optimized. For maximum coverage, we need to store enough information to be able to simulate this oracle. An $O_{\varepsilon}(k^{2}\log m)$ -space algorithm for maximum coverage follows immediately for any $O_{\varepsilon}(k)$ space algorithm for monotone submodular optimization because the universe subsampling technique discussed in Section 3 allows us to focus on the case where set has size $O(\varepsilon^{-2}k\log m)$ . To reduce the dependence on $k$ to linear, rather than quadratic, we need a more careful adaption of an algorithm by Agrawal et al. [ASS19]. We maintain a set of size $O(\varepsilon^{-2}k\log m)$ corresponding to the items covered by a collection of sets chosen for the solution so far, along with $O_{\varepsilon}(1)$ sets of size $O_{\varepsilon}(\log m)$ . We mention in passing that, being a thresholding algorithm, salsa is easy to adapt to our $O_{\varepsilon}(k\cdot\operatorname{polylog}(n,m))$ regime, yielding an approximation factor of $1/2+c_{0}$ . We present the details in Section 5.

Lastly, our results, along with numerous previous results, use the aforementioned universe sub-sampling technique introduced by McGregor and Vu [MV19a] to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: $\operatorname{polylog}(m,n)$ update time suffices whereas the best previous result by Jaud et al. [JWC23] required update time $k\cdot\operatorname{polylog}(m,n)$ . We present the details in Section 6.

2 Further Related Work

Here we provide some more detail on the background materials.

Random arrivals.

Guha and McGregor [GM09] §1.1 further justify the assumptions of the random-arrival model. This model has been studied for many important problems in other contexts, including quantile estimation [GM09], submodular maximization [NTM⁺18, ASS19], and maximum matching [KMM12, AB21], often revealing improved space-accuracy trade-offs.

Coverage in streams.

Saha and Getoor [SG09] gave a swap-based $1/4$ approximation algorithm, sops, that uses a single pass and $\tilde{O}(kn)$ space. Their algorithm stores $k$ sets explicitly in the memory as the current solution and selectively replaces sets in memory as new sets arrive in the stream. Subsequently, Ausiello et al. [ABG⁺12] gave a slightly different swap based algorithm that also finds a $1/4$ approximation using one pass and the same space.

Yu and Yuan [YY13] improved upon sops under a more relaxed output specification, in which only the IDs of the chosen sets must be returned. Their algorithm algorithm, gops, achieves an approximation factor of approximately 0.3 using just $\tilde{O}(n)$ space.

McGregor and Vu [MV19a] introduced several methods for MaxCov under set streaming. They presented four near-optimal algorithms for the problem, with the following specifications:

1.

An approximation factor of $1-1/e-\varepsilon$ using $\tilde{O}(m\varepsilon^{-2})$ space.
2.

An approximation factor of $1-1/e-\varepsilon$ using $\tilde{O}(k\varepsilon^{-2})$ , requiring $\tilde{O}(\varepsilon^{-1})$ passes over the stream. This method is a key launchpad for our work.
3.

An approximation factor of $1-\varepsilon$ using $\tilde{O}(m\varepsilon^{-3})$ space (and exponential time).
4.

An approximation factor of $1/2-\varepsilon$ using just $\tilde{O}(k\varepsilon^{-3})$ space.

Subsequently, McGregor et al. [MTV21] showed how to solve max coverage exactly in one pass and in space $\tilde{O}(d^{d+1}k^{d})$ when every set is of size at most $d$ .

Dynamic streams.

In the dynamic graph stream model [McG14], the stream consists of insertions and deletions of edges of the underlying graph [ACG⁺15, AGM12a, AGM12b, AGM13, KLM⁺14, KW14, GMT15, BHNT15, CCE⁺16, AKLY16, Kon15, MVV16, MTVV15], relevant to the maximum vertex coverage problem. An algorithm for the dynamic graph stream model can also be used in the streaming-set model; the streaming-set model is simply a special case in which edges are grouped by endpoint.

Lower bounds.

McGregor and Vu [MV19a] introduced the first nontrivial space lower bound for set-streaming maximum coverage. For all $\varepsilon>0$ , achieving an approximation factor better than $1-1/e+o(1)$ requires $\Omega(mk^{-2})$ space in the arbitrary-arrival case, and $\Omega(mk^{-3})$ space in the random-arrival case, where the $o(1)$ term is a function of $k$ . This also represented the first result explicitly regarding random-arrival maximum coverage. Assadi [Ass17] showed that achieving a $(1-\varepsilon)$ -approximation requires $\Omega(m\varepsilon^{-2})$ space, even in the random-arrival model, almost matching the $\tilde{O}(m\varepsilon^{-3})$ upper bound achieved by McGregor and Vu [MV19a]. More recently, Feldman et al. [FNFSZ20] showed that $\Omega(mk^{-3})$ space is required to beat $1/2+o(1)$ in the arbitrary-arrival model.

3 Preliminaries

Without loss of generality we assume each set is tagged with a unique ID in the range $[m^{3}n]$ , for if not, we could generate suitable IDs via random hashing of the contents of the sets themselves.³³3E.g., hash each set $S\in[n]$ , to $f_{S}(r)$ , where $f_{S}(X)=\prod_{u\in S}(X-u)$ is a polynomial over a prime field $\mathbb{F}_{p}$ of cardinality $\Theta(m^{3}n)$ and $r$ is uniformly random in $\mathbb{F}_{p}$ . By elementary analysis, the map $S\mapsto f_{S}(r)$ is non-injective with probability at most $\binom{m}{2}n/p\leq 1/m$ . In the dynamic setting, we will want to sample uniformly from the sets that are inserted but not deleted (sometimes we will put additional requirements on the relevant sets). To do this, we will use the standard technique of $\ell_{0}$ sampling.

Theorem 3.1 ( $\ell_{0}$ sampling [JST11]).

There is a one-pass algorithm that processes a stream of tokens $\langle x_{1},\Delta_{1}\rangle,\langle x_{2},\Delta_{2}\rangle,\ldots,$ where each $x_{i}\in\{1,\ldots,M\}$ and $\Delta_{i}\in\{-1,1\}$ , using $O(\log^{2}(M)\log(1/\delta))$ bits of space, that, with probability $1-\delta$ , returns an element chosen uniformly at random from the set $\{x\in[M]:\sum_{i:x_{i}=x}\Delta_{i}\neq 0\}$ .

Specifically, we will use the above technique when $M=m^{3}n$ , with each $x_{i}$ being the ID of a set and the corresponding $\Delta_{i}$ specifying whether the set is being inserted or deleted.

As mentioned in the introduction, the natural (offline) greedy approach achieves a $(1-1/e)$ -approximation for maximum coverage. In streaming settings, a “quantized” version of the algorithm achieves similar results, as summarized below.

Theorem 3.2 (Quantized Greedy Algorithm, e.g., [MV19a]).

Consider the algorithm that processes a stream $\langle S_{1},S_{2},\ldots\rangle$ of sets as follows. Starting with an empty collection $Y$ , it makes $p$ passes: in the $i$ th pass, it adds to $Y$ each encountered set that covers $\geq\tau_{i}$ uncovered elements. If, at some point, $|Y|=k$ , then the algorithm stops and returns $Y$ . If these threshold parameters $\tau_{1}>\cdots>\tau_{p}$ satisfy

1.

$\tau_{1}\geq\operatorname{OPT}/k$ ,
2.

$\tau_{p}<\operatorname{OPT}/(4ek)$ , and
3.

$\tau_{i}/\tau_{i+1}\leq 1+\varepsilon$ , for all $i\in[p-1]$ ,

then the algorithm achieves a $(1-1/e-\varepsilon)$ -approximation for the maximum coverage problem.

Both of our main algorithmic results appeal to the following technique that, given a guess⁴⁴4We run the final algorithms with guesses $v=1,2,4,8,\ldots 2^{\log n}$ and use the fact that one of these guesses is within a factor 2 of $\operatorname{OPT}$ . $v$ of the value $\operatorname{OPT}$ , transforms the MaxCov instance into one in which the optimum solution covers $O(\varepsilon^{-2}k\log m)$ elements. Note that this immediately implies that every set in the input has cardinality $O(\varepsilon^{-2}k\log m)$ . In what follows, we will tacitly assume that the given instance is filtered through this technique before being fed into the algorithms we design. In particular, we will appeal to the following theorem [MV19a, JWC23].

Theorem 3.3 (Universe Subsampling).

Let the function $h:[n]\rightarrow\{0,1\}$ be drawn uniformly at random from a family of $O(k\log m)$ -wise independent hash functions, where

p:=\Pr\left[{h(e)=1}\right]={\lambda}/{v},\quad\text{for }\lambda=10\varepsilon^{-2}k\,,

and $v$ satisfies $\operatorname{OPT}/2\leq v\leq\operatorname{OPT}$ . An $h$ -sparsification of an instance $\mathcal{I}$ of MaxCov is formed by replacing every set $S$ in $\mathcal{I}$ by $\{e\in S:h(e)=1\}$ . With high probability, any $\alpha$ -approximation for the $h$ -sparsification of $\mathcal{I}$ yields an $(\alpha-\varepsilon)$ -approximation for $\mathcal{I}$ .

In Section 6, we will discuss how to implement the universe subsampling such that the update time of the resulting algorithm is logarithmic for each element of a set in the stream.

4 Dynamic Streams

Our main algorithm for dynamic streams is based on the following variant (Algorithm 1) of the quantized greedy algorithm. It requires an estimate $v$ satisfying $\operatorname{OPT}/2\leq v<\operatorname{OPT}$ . To describe it compactly, we set up some notation. Let $\mathcal{S}$ denote the collection of sets yielded by the input stream. For a given $C\subset[n]$ , define the subcollections

	$\displaystyle\mathcal{F}_{i}^{C}$	$\displaystyle=\{S\in\mathcal{S}:\theta_{i}\leq\|S\setminus C\|<\theta_{i-1}\}\,,$		$\displaystyle\text{for }i\in[\ell]\,,$
	$\displaystyle\mathcal{G}_{i}^{C}$	$\displaystyle=\{S\in\mathcal{S}:\tau_{i}\leq\|S\setminus C\|<\tau_{i-1}\}\,,$		$\displaystyle\text{for }i\in[1+\lceil\log_{1+\varepsilon}(16e)]\rceil\,,$

where

\ell=\left\lfloor\log k\right\rfloor\,,\quad\theta_{i}:=\frac{2v}{2^{i}}\,,\quad\text{and~{}~{}}\tau_{i}:=\frac{2v/2^{\ell}}{(1+\varepsilon)^{i-1}}\,.

In the pseudocode below, the computed solution $Y$ is a set of IDs and the set $C$ is maintained to be the subset of the universe covered by the sets represented in $Y$ . The macro⁵⁵5Being a macro, grow-solution can cause the invoking algorithm to return. \Callgrow-solution $Y,S,C,\theta$ implements the following logic: if $|S\setminus C|\geq\theta$ , then update $Y\leftarrow Y\cup\{\textsc{id}(S)\}$ (i.e., add the set $S$ to the solution) and $C\leftarrow C\cup S$ ; if this makes $|Y|=k$ , then stop and return $Y$ .

Algorithm 1 Quantized greedy adapted for dynamic set streams

1:start with an empty solution (

Y\leftarrow\varnothing

) and let

C\leftarrow\varnothing

\While

\mathcal{F}_{1}^{C}\cup\cdots\cup\mathcal{F}_{\ell}^{C}\neq\varnothing

\For

i\in[\ell]

sample

2^{i}

sets from

\mathcal{F}_{i}^{C}

with replacement \EndFor\For

i\in[\ell]

and each set

S

sampled from

\mathcal{F}_{i}^{C}

2:\Callgrow-solution

Y,S,C,\theta_{i}

\EndFor\EndWhile\For

i\leftarrow 1

1+\left\lceil\log_{1+\varepsilon}(16e)\right\rceil

\While

\mathcal{G}_{i}^{C}\neq\varnothing

3:sample

k

sets from

\mathcal{G}_{i}^{C}

with replacement \Foreach sampled set

S

\Callgrow-solution

Y,S,C,\tau_{i}

\EndFor\EndWhile\EndFor

The most involved part of the analysis of Algorithm 1 is the proof of the following lemma, which we shall defer slightly.

Lemma 4.1.

With high probability, the while loop at 1 makes $O(\log m)$ iterations and each invocation of the while loop at 2 makes $O(\log m/\log\log m)$ iterations.

Theorem 4.2.

There is a $(1-1/e-\varepsilon)$ -approximation algorithm for max coverage in the dynamic stream model that uses $\tilde{O}(k/\varepsilon^{2})$ space and $O((1+\varepsilon^{-1}/\log\log m)\log m)$ passes.

Proof.

The idea is to implement Algorithm 1 in a small number of space-efficient streaming passes. To sample from $\mathcal{F}_{i}^{C}$ and check the non-emptiness condition of the while loop at 1, we use an $\ell_{0}$ -sampling pass on the IDs of the sets where $|S\setminus C|$ is in the relevant range. Using an additional pass, we store $\{S\setminus C\}$ for all the sampled sets, so that we can implement the logic of grow-solution. These sets contain at most

\sum_{i=1}^{\ell}2^{i}\cdot 2v/2^{i-1}=4v\ell=O(\varepsilon^{-2}k\log k)

elements. Similarly we can sample sets from each $\mathcal{G}_{i}^{C}$ while only storing $O(\varepsilon^{-2}k\log k)$ elements.

Since each iteration of each while loop corresponds to two streaming passes, Lemma 4.1 shows that w.h.p. the algorithm uses at most the claimed number of passes. The approximation guarantee follows from Theorem 3.2, since

\tau_{1}=2v/2^{\ell}>\operatorname{OPT}/k

and

\tau_{1+\left\lceil\log_{1+\varepsilon}(16e)\right\rceil}=\frac{2v/2^{\ell}}{(1+\varepsilon)^{\left\lceil\log_{1+\varepsilon}(16e)\right\rceil}}<\frac{4\operatorname{OPT}}{16ek}=\frac{\operatorname{OPT}}{4ek}\,.\qed

It remains to prove Lemma 4.1. We need to understand how the collections $\mathcal{F}_{i}^{C}$ and $\mathcal{G}_{i}^{C}$ evolve as we grow our solution $Y$ , thus changing $C$ . To this end, we introduce two stochastic urn processes: the first (and simpler) process models the evolution of the collection $\mathcal{G}_{i}^{C}$ , for a particular $i$ ; the second process models the evolution of the ensemble $\{\mathcal{F}_{i}^{C}\}$ . How exactly the urn processes model these evolutions is explained in Section 4.3.

4.1 The Single Urn Process and its Analysis

An urn contains $m$ balls, each initially gold. Balls are drawn from the urn in phases, where a phase consists of drawing $d$ balls uniformly at random, with replacement. When a ball is drawn, if it is gold, then we earn one point and some arbitrary subset of the balls in the urn turn into lead, including the ball that was just drawn (and put back). At the end of the phase, all lead balls are removed from the urn and the next phase begins. The process ends when either $d$ points have been earned or the urn is empty.

We will prove the following result about this process.

Theorem 4.3.

With probability $\geq 9/10$ , the urn process ends within $O(\log m/\log\log m)$ phases.

The intuition behind the result is as follow. The observation is that if the fraction of balls in the urn remains above $d^{\prime}/d$ then each draw is gold with probability at least $d^{\prime}/d$ and it would be reasonable to expect at least $d^{\prime}/d\times d=d^{\prime}$ of the draws to be gold. Define $m(i)$ to be number of gold balls left in the urn after $i-1$ rounds and let $d(i)$ be the number of gold balls observed in the $i$ -th round. For the sake of intuition (the formal proof follows shortly) suppose the observation holds for all rounds even though $d(i)$ and $m(i+1)$ are dependent, i.e., $m(i+1)/m(i)\leq d(i)/d$ for all $i$ ; then

\frac{m(i+1)}{m}=\prod_{j=1}^{i+1}\frac{m(j)}{m(j-1)}\leq\prod_{j=1}^{i}\frac{d(j)}{d}\leq\left(\frac{(\sum_{j=1}^{i}d(j)/d)}{i}\right)^{i}\leq 1/i^{i}\,,

by the AM-GM inequality and the fact that every drawn gold ball is turned into lead, so only contributes to at most one term $d(j)$ , hence $(\sum_{j}d(j)/d)\leq 1$ . Consequently, $i$ can be at most $O(\log m/\log\log m)$ before $m(i+1)$ becomes zero.

Our result is stronger than analogous results in previous work in two specific ways. Previous analyses essentially need a) an factor $O(\log m)$ in the number of draws taken in each phase and b) only establish that $O(\log m)$ phases suffice, rather than $O(\log m/\log\log m)$ phases. The improvements in our approach stem from avoiding the need to guarantee that each phase makes progress with high probability and considering both the progress made towards observing $d$ gold balls and the progress made in terms of turning balls into lead.

Proof of Theorem 4.3.

Let $m_{j}$ be the number of balls in the urn after the $j$ th phase ends and all lead balls are removed. Also, let $m_{0}=m$ . Put

\displaystyle\gamma=\log\log m/\log m

(1)

and assume $d\geq 12/\gamma$ , otherwise the result follows trivially, since each phase earns at least one point. The $j$ th phase is deemed successful if, during it, either $\gamma d/2$ points are earned or the fraction of gold balls drops below $\gamma$ (causing $m(j)/m(j-1)<\gamma$ ). In Lemmas 4.4 and 4.5, we will establish that each phase is successful with probability at least $1/2$ , even when conditioned on the outcomes of previous phases. Thus, by a Chernoff bound, after $10/\gamma$ phases the probability that there were at least $4/\gamma$ successful phases is at least $1-\exp(-O(1/\gamma))$ . If $4/\gamma$ phases are successful, either we have earned $(2/\gamma)(\gamma d/2)$ points or the number of balls in the urn has reduced from $m(0)=m$ to at most $(\gamma)^{2/\gamma}m$ . By eq. 1, this number is less than $1$ , implying that the urn is empty. ∎

We turn to lower bounding the success probability of a phase as claimed in the above proof. Fix a particular phase and a particular realization of all previous phases of the stochastic process. Let $G_{i}$ be the event that the $i$ th draw of this phase reveals a gold ball (and thus earns a point); let $L_{i}$ be the event that after $i-1$ draws, the fraction of gold balls in the urn is less than $\gamma$ . Define indicator variables $X_{i}=\mathbbm{1}_{G_{i}\cup L_{i}}$ and put $X=\sum_{i=1}^{d}X_{i}$ . Then,

\displaystyle\mathbb{E}[X_{i}]=\Pr[L_{i}]+\Pr[G_{i}\cap\overline{L}_{i}]\geq\Pr[G_{i}\mid\overline{L}_{i}]\geq\gamma\,.

(2)

Lemma 4.4.

If $X\geq\gamma d/2$ , then the phase is successful.

Proof.

If $X\geq\gamma d/2$ , either some event $L_{i}$ occurs or we collect $\gamma d/2$ gold balls (and points). In the latter case, the phase is successful by definition. In the former case, the fraction of gold balls drops below $\gamma$ during the phase. Since lead never turns into gold, this fraction remains below $\gamma$ . Upon removing lead balls at the end of the phase, the number of balls in the urn drops to at most $\gamma$ times the number at the start of the phase, i.e., the phase is successful. ∎

Finally, we show that $X$ stochastically dominates a binomial random variable, which then lower-bounds $\Pr[X\geq\gamma d/2]$ . Denote Bernoulli and binomial distributions as “Bern” and “Bin.”

Lemma 4.5.

$\Pr[X\geq\gamma d/2]\geq 1-\exp(-\gamma d/12)\geq 1/2$ , where the latter inequality uses our assumption that $d\geq 12/\gamma$ .

Proof.

Let $Y_{1},\ldots,Y_{d}$ be independent draws from $\operatorname{Bern}(\gamma)$ . Put $X^{\leq j}=\sum_{i=1}^{j}X_{i}$ and $Y^{\leq j}=\sum_{i=1}^{j}Y_{i}$ . We first show by induction on $j$ that $X^{\leq j}$ stochastically dominates $Y^{\leq j}$ for all $j$ . Inequality (2) with $i=1$ establishes the base case of $j=1$ . Assuming the result for $j$ , for an arbitrary integer, $z$ , we have

	$\displaystyle\Pr[X^{\leq j+1}\geq z]$
	$\displaystyle=\Pr[X^{\leq j}\geq z]+\Pr[X^{\leq j}=z-1]\Pr[X_{j+1}=1\mid X^{\leq j}=z-1]$
	$\displaystyle\geq\Pr[X^{\leq j}\geq z]+\Pr[X^{\leq j}=z-1]\gamma$
	$\displaystyle=\Pr[X^{\leq j}\geq z](1-\gamma)+\Pr[X^{\leq j}\geq z-1]\gamma$
	$\displaystyle\geq\Pr[Y^{\leq j}\geq z](1-\gamma)+\Pr[Y^{\leq j}\geq z-1]\gamma$
	$\displaystyle=\Pr[Y^{\leq j+1}\geq z]\ .$

Thus, $\Pr[X\geq\gamma d/2]\geq\Pr[Y\geq\gamma d/2]$ . The lemma now follows by applying a Chernoff bound to $Y\sim\operatorname{Bin}(d,\gamma)$ . ∎

4.2 Cascading Urns

To prove the first part of Lemma 4.1, we will also need to analyze a more complex version of the above process; we call it the cascading urn process. Now we have $t$ urns, numbered $1$ through $t$ , with the $r$ th urn starting out with $m_{r}$ balls; let $m=\sum_{r\in[t]}m_{r}$ . Drawing a gold ball from urn $r$ is worth $d/2^{r}$ points. Initially, all balls are gold. The process works as follows.

•

Balls are drawn in phases: a phase consists of drawing $2^{r}$ balls from urn $r$ , in order of increasing $r$ .
•

When a ball is drawn from urn $r$ , if it is gold, the following things happen: (i) it turns into lead and is returned to urn $r$ ; (ii) some arbitrary subset of the balls in the urns (all urns, not just urn $r$ ) turn into lead; and (iii) we earn $d/2^{r}$ points.
•

At the end of each phase, each lead ball in an urn is removed from its current urn and either destroyed or transformed back into gold and placed in a higher-numbered urn. We then start the next phase with all balls being gold again.

The process ends when either $d$ points have been earned or all urns have become empty.

Theorem 4.6.

With probability at least $9/10$ , the cascading urn process ends within $O(t+\log m)$ phases.

Proof.

Fix a particular phase. For each $r\in[t]$ , the analysis of one phase of the single urn process applied to urn $j$ establishes that, with probability at least

\Pr\left[\operatorname{Bin}\left(2^{r},\frac{1}{2}\right)\geq\frac{2^{r}}{4}\right]\geq 1-\exp\left(\frac{-2^{r-1}}{12}\right)\geq 1-e^{-\frac{1}{12}}>\frac{1}{13}=:4\lambda\,,

either at least $d/2$ points are earned from this urn (call this event $G_{r}$ ) or the fraction of gold balls in this urn drops below $1/2$ (call this event $L_{r}$ ). Urn $r$ is deemed successful in this phase if $G_{r}\cup L_{r}$ occurs. Put $Z_{r}=\mathbbm{1}_{G_{r}\cup L_{r}}$ and note $\mathbb{E}[Z_{r}]>4\lambda$ .

Let $\hat{m}_{r}$ be the number of balls in urn $r$ at the start of the phase we are considering (recall that these are all gold balls) and let $\hat{m}=\sum_{r\in[t]}\hat{m}_{r}$ . If $\bigcup_{r}G_{r}$ does not occur, then at the end of the phase, the number of gold balls across all urns is at most

Z:=\sum_{r\in[t]}(Z_{r}\hat{m}_{r}/2+(1-Z_{r})\hat{m}_{r})=\sum_{r\in[t]}(1-Z_{r}/2)\hat{m}_{r}\,.

Using the lower bound on $\mathbb{E}[Z_{r}]$ , we get

\mathbb{E}[Z]\leq\sum_{r\in[t]}(1-2\lambda)\hat{m}_{r}=(1-2\lambda)m\,.

Applying a Markov bound gives

\Pr[Z\leq(1-\lambda)m]\geq 1-\mathbb{E}[Z]/((1-\lambda)m)=\lambda/(1-\lambda)\,.

Hence, with probability at least $\lambda/(1-\lambda)$ , either $\bigcup_{r}G_{r}$ occurs (in which case we collect $d/2$ points) or the fraction of balls that remain gold until the end of the phase is at most $1-\lambda$ .

Let $m_{r,p}$ denote the number of balls in urn $r$ after the $p$ th phase ends (and all lead balls are either reconverted to gold or destroyed). Also, let $m_{r,0}=m_{r}$ . Consider the quantity

Q_{p}:=\sum_{r\in[t]}2^{t-r}m_{r,p}\,.

Note that $Q_{0}\leq 2^{t-1}m$ and if $Q_{p}<1$ for some $p$ , then the urns must be empty after phase $p$ . We claim that if the fraction of balls that remain gold in phase $p$ is at most $1-\lambda$ , then $Q_{p}\leq(1-\lambda/2)Q_{p-1}$ . This claim holds because (a) after moving a ball to a higher-numbered urn or destroying it, the contribution of that ball to $Q_{p}$ goes down by a factor of $2$ or more; and (b) at least a $\lambda$ fraction of the balls turn to lead and are therefore either moved or destroyed. Thus, at least a $\lambda$ fraction of the quantity $Q_{p}$ is reduced to $\lambda Q_{p}/2$ by the end of the phase, proving the claim.

Hence, with probability at least $\lambda/(1-\lambda)$ , in each phase, we either collect $d/2$ points or $Q_{p}$ reduces by a factor $\mu:=1-\lambda/2$ . By a Chernoff bound, after $p=O(\log Q_{0})=O(\mu^{-1}\log Q_{0})$ phases, either we have collected more than $2\cdot d/2=d$ points or $Q_{p}<1$ , implying that all the urns are empty. ∎

4.3 Applications to Our Algorithm

We now circle back to Algorithm 1. To finish its analysis, we need to prove Lemma 4.1, which we can now do, as an application of what we have established about these urn processes.

To analyze the while loop at 1, apply Theorem 4.6 with $t=\left\lfloor\log k\right\rfloor$ as follows. The balls in the $i$ th urn correspond to the sets in $\mathcal{F}_{i}^{C}$ . A ball is gold if the corresponding set $S$ satisfies $|S\setminus C|\geq\theta_{i}$ ; otherwise, it is lead. Adding a set $S$ to the candidate solution $Y$ grows the set $C$ of covered elements, thereby turning gold balls to lead in some complicated way that the algorithm cannot easily track. The bound on the number of phases until the urn process terminates translates to a bound of $O(\log k+\log m)$ on the number of iterations of that while loop. Noting that $k\leq m$ gives Lemma 4.1.

Similarly, Theorem 4.3 implies that the number of iterations of the while loop at 2 is $O(\log m/\log\log m)$ . This completes the analysis.

5 Random Order Model

In this section, we consider the stream to be a random permutation of the collection of sets. We will show that it is possible to approximate the maximum coverage problem up to a factor $1-1/e-\varepsilon$ using $O_{\varepsilon}(k\log m)$ space in a single pass. As mentioned in the introduction, our approach is based on modifying an algorithm by Agrawal et al. [ASS19] for cardinality constrained monotone submodular maximization. This is a more general problem than maximum coverage, but their algorithm assumes oracle access to the function being optimized. For maximum coverage we need to store enough information to be able to simulate this oracle. A $O_{\varepsilon}(k^{2}\log m)$ -space algorithm for maximum coverage follows immediately via the universe subsampling technique discussed in Section 3 since we may assume that every set has size $O_{\varepsilon}(k\log m)$ . To reduce the dependence on $k$ to linear rather than quadratic, we need a more careful adaption of the algorithm in Agrawal et al. [ASS19], to ensure that the algorithm can be implemented via maintaining a set of size $O_{\varepsilon}(k\log m)$ corresponding to the items covered by a collection of sets chosen for the solution so far, along with $O_{\varepsilon}(1)$ sets of size $O_{\varepsilon}(\log m)$ .

5.1 Warm-Up

To motivate the approach, consider a variant of the problem in which an algorithm is presented with a random permutation of the input sets, but we make the following two changes:

1.

For a fixed optimal solution consisting of sets $O_{1},\ldots,O_{k}$ , replace the occurrences of these sets in the stream by a set drawn uniformly at random from $\{O_{1},\ldots,O_{k}\}$ .
2.

We give the algorithm foreknowledge of a way to segment the stream into $k$ contiguous sub-streams $\mathcal{C}_{1},\ldots,\mathcal{C}_{k}$ such that each contains exactly one element from $\{O_{1},\ldots,O_{k}\}$ .

Then consider the greedy algorithm that outputs the sequence $\langle S_{1},\ldots,S_{k}\rangle$ where

S_{j}=\operatorname{{argmax}}_{S\in\mathcal{C}_{j}}|S\setminus(S_{1}\cup\cdots\cup S_{j-1})|\ .

To analyze this algorithm, define

u_{j}=\operatorname{OPT}-|S_{1}\cup\cdots\cup S_{j}|~{}\mbox{ and }~{}F_{j}=\frac{|S_{j}\setminus(S_{1}\cup\cdots\cup S_{j-1})|}{u_{j-1}}\,,

and note that

|S_{1}\cup\cdots\cup S_{k}|=\operatorname{OPT}-u_{k}=\operatorname{OPT}-\operatorname{OPT}\cdot\prod_{j=1}^{k}(1-F_{j})\

since for all $j\in[k]$ ,

u_{j}=\operatorname{OPT}-|S_{1}\cup\cdots\cup S_{j-1}|-|S_{j}\setminus(S_{1}\cup\cdots\cup S_{j-1})|=(1-F_{j})u_{j-1}\ .

Note also that

	$\displaystyle\textup{E}\left[F_{j}\mid S_{1},\ldots,S_{j-1}\right]$	$\displaystyle\geq\sum_{i=1}^{k}\frac{\|O_{i}\setminus(S_{1}\cup\cdots\cup S_{j-1})\|}{u_{j-1}}\cdot\Pr[O_{i}\in\mathcal{C}_{j}]$
		$\displaystyle=\frac{1}{k}\cdot\sum_{i=1}^{k}\frac{\|O_{i}\setminus(S_{1}\cup\ldots\cup S_{j-1})\|}{u_{j-1}}\geq 1/k\ .$

Hence,

\mathbb{E}[|S_{1}\cup\ldots\cup S_{k}|]=\operatorname{OPT}(1-\mathbb{E}[\prod_{j=1}^{k}(1-F_{j})])\geq 1-(1-1/k)^{k}\rightarrow 1-1/e\ .

The space required to implement this algorithm is $O(k\varepsilon^{-2}\log m)$ since it suffices to maintain the union of the sets chosen thus far rather than the sets themselves.

To handle the original random order setting, we obviously need to deal with the fact that the algorithm does not have foreknowledge of a segmentation of the stream such that each segment contains exactly one set from the optimum solution. A naive approach would be to segment the stream into $\beta k$ contiguous segments for some large constant $\beta>0$ with the hope that few pairs of optimum sets appear in the same segment. We could then consider all $\binom{\beta k}{k}=e^{O(k)}$ subsequences of these segments but this is clearly inefficient. A better approach is to use limited number of guesses in parallel using an approach by Agrawal et al. [ASS19]. More complicated to analyze, especially when combined with the goal of using limited space, is the fact that because the $O$ -sets appear randomly permuted, rather than independently sampled, there are now various dependencies to contend with.

5.2 General Setting

We will randomly partition the input collection $\mathcal{C}$ of sets into $k\beta$ groups $\mathcal{C}_{1},\ldots,\mathcal{C}_{k\beta}$ . These groups will ultimately correspond to segments of the stream, i.e., the first $|\mathcal{C}_{1}|\sim Bin(m,1/(k\beta))$ sets define $\mathcal{C}_{1}$ etc. For any

\mathcal{I}=\{i_{1},i_{2},\ldots\}\in\{1,\ldots,k\beta\}~{},

we define a sequence of groups

\Sigma_{\mathcal{I}}=\langle\mathcal{C}_{i_{1}},\mathcal{C}_{i_{2}},\cdots,\mathcal{C}_{i_{|\mathcal{I}|}}\rangle~{}~{}\mbox{ where $i_{1}<i_{2}<\ldots<i_{|\mathcal{I}|}$}\ .

At the heart of the algorithm is a greedy process run on such sequences. For $\mathcal{I}=\{i_{1},i_{2},\ldots\}\in\{1,\ldots,k\beta\}$ , $C\subset[n]$ , and a “reserve” collection of sets, $\mathcal{R}$ —the name will become clear later—define the greedy sequence:

\sigma_{\mathcal{I}}(C,\mathcal{R})=\langle S_{1},S_{2},\ldots,S_{|\mathcal{I}|}\rangle

(3)

where

S_{j}=\operatorname{{argmax}}_{S\in\mathcal{R}\cup\mathcal{C}_{i_{j}}}|S\setminus(C\cup S_{1}\cup\cdots\cup S_{j-1})|\ .

Algorithm 2 Max Coverage for Random Order Streams

1:initialize

\mathcal{R}\leftarrow\varnothing

C\leftarrow\varnothing

k^{\prime}\leftarrow 0

. \For

i\leftarrow 1

k/\alpha

2:Compute

\sigma_{\mathcal{I}}(C,\mathcal{R})

for all

\mathcal{I}\in P_{i}

, where

P_{i}

consists of all size

k^{+}\equiv\min\{\alpha,k-k^{\prime}\}

subsets of

W_{i}=\{\alpha\beta(i-1)+1,\ldots,\alpha\beta i\}\ .

We will subsequently refer to

W_{i}

as the

i

th window.

3:Let

\mathcal{I}^{*}=\operatorname{{argmax}}_{\mathcal{I}\in P_{i}}\left|~{}C\cup\left(\bigcup_{S\in\sigma_{\mathcal{I}}(C,\mathcal{R})}S\right)~{}\right|

and update

C\leftarrow C\cup\left(\bigcup_{S\in\sigma_{\mathcal{I}^{*}(C,\mathcal{R})}}S\right)~{}~{}\mbox{ and }~{}~{}k^{\prime}\leftarrow k^{\prime}+k^{+}\ .

4:For all

\mathcal{I}\in P_{i}

, add all sets in

\sigma_{\mathcal{I}}(C,\mathcal{R})

\mathcal{R}

. If there exists a set

S

\mathcal{R}

such that

|S\setminus C|\geq\operatorname{OPT}/k

, select such a set uniformly at random and update

C\leftarrow C\cup S~{}~{}\mbox{ and }~{}~{}k^{\prime}\leftarrow k^{\prime}+1\ .

Repeat until either

k^{\prime}=k

or no more such sets exist in

\mathcal{R}

. \EndFor

Algorithm 2 contains most details, but to fully specify the algorithm, we need to define the original groups $\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{k\beta}$ . To do this, for each set $S$ define a random variable $Y_{S}$ that is uniform over $\{1,2,\ldots,k\beta\}$ . Then, we define $\mathcal{C}_{i}=\{S:Y_{S}=i\}$ . Equivalently, for a random permutation of the stream $\mathcal{C}_{1}$ is the first $|\{S:Y_{S}=1\}|$ sets in the stream, $\mathcal{C}_{2}$ is the next $|\{S:Y_{S}=2\}|$ sets in the stream, etc. Given this definition of the stream, Algorithm 2 maintains sets whose total size is

|\operatorname{OPT}|+\alpha\binom{\alpha\beta}{\alpha}|\operatorname{OPT}|+(k/\alpha)\binom{\alpha\beta}{\alpha}\operatorname{OPT}/k=O_{\alpha,\beta}(\operatorname{OPT})\,,

(4)

where the first term corresponds to maintaining $C$ , the next term consists of all sets appearing in greedy subsequences during the processing of each window (defined in line 2 of Algorithm 2), and the last term corresponds to storing sets in $\mathcal{R}$ . For sets $S$ in $\mathcal{R}$ it suffices to store $S\setminus C$ rather than $S$ itself. Hence, the total number of elements in sets in $\mathcal{R}$ is at most $|\mathcal{R}|\operatorname{OPT}/k$ .

The motivation for how to set $\alpha$ and $\beta$ is as follows. Let ${\mathcal{O}}=\{O_{1},\ldots,O_{k}\}$ be an optimum collection of $k$ sets. We say a group $\mathcal{C}_{i}$ is active if it contains a set from ${\mathcal{O}}$ and note if $\beta$ is large enough the number of active groups $\kappa$ is close to $k$ . Furthermore, in each window the number of active groups is expected to be $\kappa\cdot(\alpha\beta)/(k\beta)=\alpha\kappa/k\approx\alpha$ . We will later set $\alpha$ to be large enough such that the number of active groups in each window is typically close to $\alpha$ , i.e., there is an $\alpha$ -length subsequence of the window that mainly consists of active groups.

5.2.1 Analysis

The analysis in Agrawal et al. [ASS19] establishes that, during the $i$ th window, the algorithm finds a collection of $\alpha$ sets $S_{1},\ldots,S_{\alpha}$ – via one of the greedy subsequences considered – that, as explained below, covers a significant number of new elements when added to the solution. For any subset $A\subset[n]$ , define

u(A)=(1-\alpha/k)\operatorname{OPT}-|A|\,.

Lemma 15 of Agrawal et al. [ASS19] establishes that if $C_{i-1}$ is the set of elements covered by the sets chosen during the processing of the first $i-1$ windows, and $T_{i-1}$ is the entire set of greedy subsequences constructed during the first $i-1$ windows, then⁶⁶6Note the statement of [ASS19, Lemma 15] has the RHS of Eq. (5) as $(1-\delta)e^{-\alpha/k}u(C_{i-1})$ , rather than the expression in Eq. (5), where the $(1-\delta)$ appears in the exponent. Their proof actually only implies the weaker statement, in Eq. (5); fortunately, this is still sufficient for their and our purposes.

\mathbb{E}[u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})\mid T_{i-1}]\leq e^{-\alpha(1-\delta)/k}u(C_{i-1})\,,

(5)

where $\delta=\sqrt{8/\beta}$ , assuming $k\geq\alpha\beta$ and $\alpha\geq 4\beta^{2}\log(\beta/8)$ . The crucial observation is that their proof of eq. 5 holds true regardless of how $C_{i-1}$ is constructed from the greedy subsequences in $T_{i-1}$ . In particular, Equation (5) holds true given our modification of the Agrawal et al. algorithm, i.e., the possibility of adding additional sets from $\mathcal{R}$ at the end of previous windows. At the end of the $i$ th window, we potentially add additional sets $S_{\alpha+1},\ldots,S_{\alpha^{\prime}_{i}}$ where

	$\displaystyle u(S_{1}\cup\cdots\cup S_{\alpha^{\prime}_{i}}\cup C_{i-1})$
	$\displaystyle\leq u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})-(\alpha^{\prime}_{i}-\alpha)\operatorname{OPT}/k$
	$\displaystyle\leq(1-(\alpha^{\prime}_{i}-\alpha)/k)\cdot u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})$
	$\displaystyle\leq e^{-(\alpha^{\prime}_{i}-\alpha)/k}\cdot u(S_{1}\cup\cdots\cup S_{\alpha}\cup C_{i-1})$

Hence, in the $i$ th window (except potentially in the window where we add the $k$ th set to our solution) we add $\alpha^{\prime}_{i}\geq\alpha$ sets $S_{1},\ldots,S_{\alpha^{\prime}_{i}}$ such that if $C_{i}=S_{1}\cup\cdots\cup S_{\alpha^{\prime}}$

\mathbb{E}[u(C_{i})\mid T_{i-1}]\leq e^{-\alpha^{\prime}_{i}(1-\delta)/k}\cdot u(C_{i-1})\ .

(6)

Since there are $k/\alpha$ windows, and we find a sequence of length $\alpha^{\prime}_{i}>\alpha$ until we have a sequence of length $k$ , such that Eq. (6) holds in each window, for some $j\leq k/\alpha$ , with $\delta<\varepsilon/2<1$ ,

	$\displaystyle\mathbb{E}[(1-\alpha/k)\operatorname{OPT}-\|C_{j}\|]\leq(1-\alpha/k)\cdot\operatorname{OPT}\cdot\prod_{i\geq 1}e^{-\alpha^{\prime}_{i}(1-\delta)/k}$
	$\displaystyle=(1-\alpha/k)e^{-1+\delta}\operatorname{OPT}<(1-\alpha/k)(1/e+\varepsilon/2)\operatorname{OPT}$

where we used the fact $\sum_{i}\alpha^{\prime}_{i}=k$ and $\exp(-1+x)\leq 1/e+x$ for $0<x<1$ . Hence, $\mathbb{E}[|C_{j}|]\leq 1-1/e-\varepsilon$ assuming $\alpha/k\leq\varepsilon/2$ . Setting the constants to $\alpha=4096\varepsilon^{-4}\log(2/\varepsilon)$ and $\beta=32\varepsilon^{-2}$ ensures all the necessary conditions are met assuming $k=\omega(1)$ . With these values of $\alpha$ and $\beta$ , we have $\binom{\alpha\beta}{\alpha}\leq\exp(O(\varepsilon^{-4}\ln^{2}\varepsilon^{-1}))$ . Hence, the space dependence on $\varepsilon$ in Eq. 4 becomes $O(k\log m\cdot\exp(O(\varepsilon^{-4}\ln\varepsilon^{-1})))$ , since we may assume $\operatorname{OPT}=O(k\varepsilon^{-2}\log m)$ and the exponential dependencies on $\varepsilon$ dominate the polynomial dependencies.

6 Fast Algorithms

6.1 Fast Universe Sub-Sampling

The universe sub-sampling approach described in Section 3, requires the use of a $\gamma$ -wise independent hash function where $\gamma=O(k\log m)$ . This bound on the independence was actually an improvement by Jaud et al. [JWC23] over the bound of $O(\varepsilon^{-2}k\log m)$ originally given by McGregor and Vu [MV19a]. Jaud et al. were motivated to improve the bound not for the sake of reducing the space of any streaming algorithms, but rather to reduce the update time from $O(\varepsilon^{-2}k\log m)$ to $O(k\log m)$ . However, we note that the update time, at least in an amortized sense, can actually be reduced to polylogarithmic. Specifically, we can use the hash family given by degree- $\gamma$ polynomials over a suitably large finite field [WC79]. While evaluating such a polynomial at a single point (which corresponds to hashing a single element) requires $\Omega(\gamma)$ time, evaluating the polynomial on $\gamma$ points can actually be done in $O(\gamma\log^{2}\gamma\log\log\gamma)$ time [vzGG99, Chapter 10]. Hence, by buffering sets of $O(k\log m)$ elements, we may ensure that the hash function can be applied with only $\operatorname{poly}(\log k,\log\log m)$ amortized update time. We note that a similar approach was used by Kane et al. [KNPW11] in the context of frequency moment estimation.

6.2 Fast $\mathbf{\ell_{0}}$ Sampling

In the dynamic algorithm we needed to sample $r$ sets with replacement from a stream in the presence of insertions and deletions. This can be done via $\ell_{0}$ sampling as discussed in the preliminaries. However, a naive implementation of this process requires $\Omega(r)$ update time if $r$ different $\ell_{0}$ samplers need to be updated with every set insertion or deletion. To avoid this, a technique by McGregor et al. [MTVV15] can be used. Specifically, we use a $O(\log r)$ -wise independent hash function to partition $[M]$ in $t:=r/\log r$ groups $P_{1},P_{2},\ldots P_{t}$ . During the stream we compute

\rho_{i}=|\{x\in P_{i}:\textstyle\sum_{j:x_{j}=x}\Delta_{j}\neq 0\}|

i.e., the number of values in $P_{i}$ that are inserted a different number of times from the number of times they are deleted. Note that this is simple to compute $\rho_{1},\ldots,\rho_{t}$ on the assumption that $\sum_{j:x_{j}=x}\Delta_{j}\in\{0,1\}$ , i.e., every set is inserted either the same number of times it is deleted or exactly one extra time. We also compute $2\log r$ independent $\ell_{0}$ samplers for each $P_{i}$ ; note that each set insert or delete requires updating at most $2\log r$ independent $\ell_{0}$ samplers. Then at the end of the stream, to generate a sequence of samples with replacement we first sample $i$ with probability $\rho_{i}/\sum_{j}\rho_{j}$ and then use the next unused $\ell_{0}$ sampler from group $P_{i}$ . Assuming $r\ll|\{x\in[M]:\sum_{j:x_{j}=x}\Delta_{j}\neq 0\}|$ (if this is not the case, we can just use sparse recovery), then with high probability each $\rho_{i}/\sum_{j}\rho_{j}\leq 2(\log r)/r$ and the process will use at most $2\log r$ independent $\ell_{0}$ samplers from each group in expectation and less than $4\log r$ with probability at least $1-1/\operatorname{poly}(r)$ .

7 Conclusion

We presented new algorithms for $1-1/e-\varepsilon$ approximation of the maximum coverage problem in the data stream model. These algorithms improve upon the state-of-the-art results by a) reducing the space required in the dynamic model from $(n+\varepsilon^{-4})\operatorname{polylog}(m,n)$ to $\varepsilon^{-2}k\operatorname{polylog}(m,n)$ when given $O(\varepsilon^{-1}\log m)$ passes, b) reducing the space required in single-pass random order model from $\varepsilon^{-2}k^{2}\operatorname{polylog}(m,n)$ to $O_{\varepsilon}(k\operatorname{polylog}(m,n))$ although we emphasize that the $O_{\varepsilon}$ hides an exponential dependence on $\varepsilon$ , and c) reducing the amortized update time to polylogarithmic in $m$ and $n$ . We conjecture that $\varepsilon^{-2}k\operatorname{polylog}(m,n)$ space should be sufficient in the random order model. In fact there is a monotone submodular optimization algorithm in the random order model that, when applied to the maximum coverage problem would maintain only $O(\varepsilon^{-1}k)$ sets [LRVZ21a]. Given that, via universe sub-sampling, we may assume that the sets have size $O(\varepsilon^{-2}k\log m)$ ; this yielded the Warneke et al. [WCW23] result. However, it seems unlikely that the algorithm of Liu et al. [LRVZ21a] can be modified to ensure that the total number of elements across all sets maintained is $O(\varepsilon^{-2}k\log m)$ so a new approach is likely to be necessary.

References

[AB21] Sepehr Assadi and Soheil Behnezhad. Beating two-thirds for random-order streaming matching. In 48th International Colloquium on Automata, Languages and Programming (ICALP), page 19, 2021.
[ABB⁺15] Aris Anagnostopoulos, Luca Becchetti, Ilaria Bordino, Stefano Leonardi, Ida Mele, and Piotr Sankowski. Stochastic query covering for fast approximate document retrieval. ACM Transactions on Information Systems (TOIS), 33(3):1–35, 2015.
[ABG⁺12] Giorgio Ausiello, Nicolas Boria, Aristotelis Giannakos, Giorgio Lucarelli, and Vangelis Th. Paschos. Online maximum k-coverage. Discrete Applied Mathematics, 160(13-14):1901–1913, 2012.
[ACG⁺15] Kook Jin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, and Anthony Wirth. Correlation clustering in data streams. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2237–2246. JMLR.org, 2015.
[AGM12a] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In SODA, pages 459–467. SIAM, 2012.
[AGM12b] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Graph sketches: sparsification, spanners, and subgraphs. In PODS, pages 5–14. ACM, 2012.
[AGM13] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Spectral sparsification in dynamic graph streams. In APPROX-RANDOM, volume 8096 of Lecture Notes in Computer Science, pages 1–10. Springer, 2013.
[AK18] Sepehr Assadi and Sanjeev Khanna. Tight bounds on the round complexity of the distributed maximum coverage problem. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2412–2431. SIAM, 2018.
[AKL16] Sepehr Assadi, Sanjeev Khanna, and Yang Li. Tight bounds for single-pass streaming complexity of the set cover problem. In STOC, pages 698–711. ACM, 2016.
[AKLY16] Sepehr Assadi, Sanjeev Khanna, Yang Li, and Grigory Yaroslavtsev. Maximum matchings in dynamic graph streams and the simultaneous communication model. In SODA, pages 1345–1364. SIAM, 2016.
[Ass17] Sepehr Assadi. Tight space-approximation tradeoff for the multi-pass streaming set cover problem. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 321–335, 2017.
[ASS19] Shipra Agrawal, Mohammad Shadravan, and Cliff Stein. Submodular secretary problem with shortlists. In Avrim Blum, editor, 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
[BEM17] MohammadHossein Bateni, Hossein Esfandiari, and Vahab S. Mirrokni. Almost optimal streaming algorithms for coverage problems. In Christian Scheideler and Mohammad Taghi Hajiaghayi, editors, Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, pages 13–23. ACM, 2017.
[BHNT15] Sayan Bhattacharya, Monika Henzinger, Danupon Nanongkai, and Charalampos E. Tsourakakis. Space- and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In STOC, pages 173–182. ACM, 2015.
[BMKK14] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671–680, 2014.
[CCE⁺16] Rajesh Chitnis, Graham Cormode, Hossein Esfandiari, MohammadTaghi Hajiaghayi, Andrew McGregor, Morteza Monemizadeh, and Sofya Vorotnikova. Kernelization via sampling with applications to finding matchings and related problems in dynamic graph streams. In SODA, pages 1326–1344. SIAM, 2016.
[CGQ15] Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming algorithms for submodular function maximization. In ICALP (1), volume 9134 of Lecture Notes in Computer Science, pages 318–330. Springer, 2015.
[CKW10] Graham Cormode, Howard J. Karloff, and Anthony Wirth. Set cover algorithms for very large datasets. In CIKM, pages 479–488. ACM, 2010.
[CW16] Amit Chakrabarti and Anthony Wirth. Incidence geometries and the pass complexity of semi-streaming set cover. In SODA, pages 1365–1373. SIAM, 2016.
[DIMV14] Erik D. Demaine, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. On streaming and communication complexity of the set cover problem. In Fabian Kuhn, editor, Distributed Computing - 28th International Symposium, DISC 2014, Austin, TX, USA, October 12-15, 2014. Proceedings, volume 8784 of Lecture Notes in Computer Science, pages 484–498. Springer, 2014.
[ER14] Yuval Emek and Adi Rosén. Semi-streaming set cover - (extended abstract). In ICALP (1), volume 8572 of Lecture Notes in Computer Science, pages 453–464. Springer, 2014.
[Fei98] Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652, 1998.
[FNFSZ20] Moran Feldman, Ashkan Norouzi-Fard, Ola Svensson, and Rico Zenklusen. The one-way communication complexity of submodular maximization with applications to streaming and robustness. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1363–1374, 2020.
[GM09] Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5):2044–2059, 2009.
[GMT15] Sudipto Guha, Andrew McGregor, and David Tench. Vertex and hyperedge connectivity in dynamic graph streams. In PODS, pages 241–247. ACM, 2015.
[HIMV16] Sariel Har-Peled, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. Towards tight bounds for the streaming set cover problem. In PODS, pages 371–383. ACM, 2016.
[HP98] Dorit S Hochbaum and Anu Pathria. Analysis of the greedy approach in problems of maximum k-coverage. Naval Research Logistics (NRL), 45(6):615–627, 1998.
[IV19] Piotr Indyk and Ali Vakilian. Tight trade-offs for the maximum $k$ -coverage problem in the general streaming model. In 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 200–217, 2019.
[JST11] Hossein Jowhari, Mert Sağlam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 49–58. ACM, 2011.
[JWC23] Stephen Jaud, Anthony Wirth, and Farhana Choudhury. Maximum Coverage in Sublinear Space, Faster. In 21st International Symposium on Experimental Algorithms (SEA 2023), volume 265, pages 21:1–21:20, Dagstuhl, Germany, 2023.
[Kar72] Richard M Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85–103. Springer, 1972.
[KG07] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI, pages 1650–1654. AAAI Press, 2007.
[KG14] Andreas Krause and Daniel Golovin. Submodular function maximization. In Lucas Bordeaux, Youssef Hamadi, and Pushmeet Kohli, editors, Tractability: Practical Approaches to Hard Problems, pages 71–104. Cambridge University Press, 2014.
[KKA23] Sanjeev Khanna, Christian Konrad, and Cezar-Mihail Alexandru. Set cover in the one-pass edge-arrival streaming model. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 127–139. ACM, 2023.
[KKT15] David Kempe, Jon M. Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. Theory of Computing, 11:105–147, 2015.
[KLM⁺14] Michael Kapralov, Yin Tat Lee, Cameron Musco, Christopher Musco, and Aaron Sidford. Single pass spectral sparsification in dynamic streams. In FOCS, pages 561–570. IEEE Computer Society, 2014.
[KMM12] Christian Konrad, Frédéric Magniez, and Claire Mathieu. Maximum matching in semi-streaming with few passes. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques (APPROX/RANDOM), pages 231–242, 2012.
[KNPW11] Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 745–754, 2011.
[Kon15] Christian Konrad. Maximum matching in turnstile streams. In ESA, volume 9294 of Lecture Notes in Computer Science, pages 840–852. Springer, 2015.
[KW14] Michael Kapralov and David P. Woodruff. Spanners and sparsifiers in dynamic streams. In PODC, pages 272–281. ACM, 2014.
[LMW14] Ching Lih Lim, Alistair Moffat, and Anthony Wirth. Lazy and eager approaches for the set cover problem. In Proceedings of the 37th Australasian Computer Science Conference, pages 19–27, 2014.
[LRVZ21a] Paul Liu, Aviad Rubinstein, Jan Vondrák, and Junyao Zhao. Cardinality constrained submodular maximization for random streams. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 6491–6502, 2021.
[LRVZ21b] Paul Liu, Aviad Rubinstein, Jan Vondrák, and Junyao Zhao. Cardinality constrained submodular maximization for random streams. In Advances in Neural Information Processing Systems 34, pages 6491–6502, 2021.
[McG14] Andrew McGregor. Graph stream algorithms: a survey. SIGMOD Record, 43(1):9–20, 2014.
[Mon20] Morteza Monemizadeh. Dynamic submodular maximization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9806–9817. Curran Associates, Inc., 2020.
[MTV21] Andrew McGregor, David Tench, and Hoa T. Vu. Maximum coverage in the data stream model: Parameterized and generalized. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, volume 186 of LIPIcs, pages 12:1–12:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
[MTVV15] Andrew McGregor, David Tench, Sofya Vorotnikova, and Hoa T. Vu. Densest subgraph in dynamic graph streams. In MFCS (2), volume 9235 of Lecture Notes in Computer Science, pages 472–482. Springer, 2015.
[MV19a] Andrew McGregor and Hoa T. Vu. Better streaming algorithms for the maximum coverage problem. Theory Comput. Syst., 63(7):1595–1619, 2019.
[MV19b] Andrew McGregor and Hoa T Vu. Better streaming algorithms for the maximum coverage problem. Theory of Computing Systems, 63:1595–1619, 2019.
[MVV16] Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu. Better algorithms for counting triangles in data streams. In PODS, pages 401–411. ACM, 2016.
[NFTM⁺18] Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousavifar, and Ola Svensson. Beyond 1/2-approximation for submodular maximization on massive data streams. In Proceedings of the 35th International Conference on Machine Learning, pages 3829–3838, 2018.
[NTM⁺18] Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousavifar, and Ola Svensson. Beyond 1/2-approximation for submodular maximization on massive data streams. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3826–3835. PMLR, 2018.
[SG09] Barna Saha and Lise Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In SDM, pages 697–708. SIAM, 2009.
[vzGG99] Joachim von zur Gathen and Jürgen Gerhard. Modern Computer Algebra. Cambridge University Press, 1999.
[WC79] Mark N. Wegman and Larry Carter. New classes and applications of hash functions. In Proc. 20th Annual IEEE Symposium on Foundations of Computer Science, pages 175–182, 1979.
[WCW23] Rowan Warneke, Farhana Murtaza Choudhury, and Anthony Wirth. Maximum coverage in random-arrival streams. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands, volume 274 of LIPIcs, pages 102:1–102:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
[YY13] Huiwen Yu and Dayu Yuan. Set coverage problems in a one-pass data stream. In SDM, pages 758–766. SIAM, 2013.

Improved Algorithms for Maximum Coverage in Dynamic and Random Order Streams

Abstract

1 Introduction

Background.

Data Stream Computation.

Submodular maximization.

Random order.

1.1 Our Results, Approach, and Related Work

Result 1 (Algorithm for Dynamic Model).

Result 2 (Algorithm for Random Order Model).

2 Further Related Work

Random arrivals.

Coverage in streams.

Dynamic streams.

Lower bounds.

3 Preliminaries

Theorem 3.1 (ℓ0\ell_{0} sampling [JST11]).

Theorem 3.2 (Quantized Greedy Algorithm, e.g., [MV19a]).

Theorem 3.3 (Universe Subsampling).

4 Dynamic Streams

Lemma 4.1.

Theorem 4.2.

Proof.

4.1 The Single Urn Process and its Analysis

Theorem 4.3.

Proof of Theorem 4.3.

Lemma 4.4.

Proof.

Lemma 4.5.

Proof.

4.2 Cascading Urns

Theorem 4.6.

Proof.

4.3 Applications to Our Algorithm

5 Random Order Model

5.1 Warm-Up

5.2 General Setting

5.2.1 Analysis

6 Fast Algorithms

6.1 Fast Universe Sub-Sampling

6.2 Fast ℓ𝟎\mathbf{\ell_{0}} Sampling

7 Conclusion

References

Improved Algorithms for Maximum Coverage
in Dynamic and Random Order Streams

Theorem 3.1 ( $\ell_{0}$ sampling [JST11]).

6.2 Fast $\mathbf{\ell_{0}}$ Sampling