Robust Streaming, Sampling, and a Perspective on Online Learning

Evan Dogariu Jiatong Yu
Department of Computer Science, Princeton University
{edogariu, jiatongy}@princeton.edu

Abstract

A young and rapidly growing field of theoretical computer science is that of robust streaming. The general subject of streaming faces many use cases in practice, coming up in problems like network traffic analysis and routing, reinforcement learning, database monitoring, server query response, distributed computing, etc. A nascent subfield of streaming concerns streaming algorithms that are robust to adversarially prepared streams, which can be found to have substantial practical grounding. For example, an adversary could submit a small amount of carefully chosen traffic to produce a denial-of-service attack in a network routing system; a robust routing algorithm in this setting would have immense practical use. We investigate this new field of robust streaming and in particular the formalization of robust sampling, which concerns sampling from an adversarially prepared stream to recover a representative sample.

Throughout this survey, we also highlight, explore, and deepen the connection between the field of robust streaming and that of statistical online learning. On the surface, these fields can appear distinct and are often researched independently; however, there is a deep interrelatedness that can be used to generate new results and intuitons in both places.

In this work we present an overview of statistical learning, followed by a survey of robust streaming techniques and challenges, culminating in several rigorous results proving the relationship that we motivate and hint at throughout the journey. Furthermore, we unify often disjoint theorems in a shared framework and notation to clarify the deep connections that are discovered. We hope that by approaching these results from a shared perspective, already aware of the technical connections that exist, we can enlighten the study of both fields and perhaps motivate new and previously unconsidered directions of research.

1 Preliminaries

1.1 Statistical Learning

The field of statistical learning concerns itself with the formal study of questions such as which tasks are learnable from data, how much training data is required to fit a general distribution, and other topics of this flavor. We present formal frameworks for research in this field in the offline and online settings below.

1.1.1 Offline Setting

In the offline setting, a learner attempts to learn from a dataset that is available all at once (in contrast with the online setting, where data arrives sequentially in a stream). In order to formalize these ideas, we will supply some notation to the setting of learning from data. We focus on binary classification below and represent our hypothesis as subsets of the domain, but general statistical learning theory follows the same principles and can reduce to this framework.

We consider learning over a set system $(\mathcal{U},\mathcal{R})$ , where $\mathcal{U}$ is a universe of possible elements and $\mathcal{R}\subseteq 2^{\mathcal{U}}$ is a set of subsets of the universe. $\mathcal{R}$ is called the hypothesis class in which a learner attempts to learn, for reasons we will see below. Let $\mathcal{S}=((x_{1},y_{1}),...,(x_{m},y_{m}))\in\mathcal{U}\times\{0,1\}$ be a finite training sample of size $m$ where $x_{i}$ ’s are sampled and labeled either 0 or 1 according to some scheme. To be specific about the data generation, we work in the realizable setting if $\mathcal{U}$ is equipped with a distribution over its elements $\mathcal{D}$ and there exists some hypothesis $f\in\mathcal{R}$ that generates the labels (i.e. each $y_{i}=1$ if $x_{i}\in f$ and 0 otherwise). If not, then we are in the agnostic setting in which the space $\mathcal{U}\times\{0,1\}$ is equipped with a joint distribution $\mathcal{D}$ . The realizability assumption assumes that there is some perfect ground truth hypothesis in $\mathcal{R}$ that we would like the learner to learn; otherwise, we would like to learn the best hypothesis we can. A learning algorithm is any algorithm that, given a training sample $\mathcal{S}$ , outputs a hypothesis $h\in\mathcal{R}$ about the training data. Hopefully, the learned hypothesis $h$ not only fits the training sample $\mathcal{S}$ well, but also generalizes well. We define the generalization error (or risk) of a hypothesis $h\in\mathcal{R}$ in the realizable setting to be

L_{\mathcal{D},f}=\underset{x\sim\mathcal{D}}{Pr}[\mathbbm{1}_{x\in h}\neq\mathbbm{1}_{x\in f}]

and in the agnostic setting to be

L_{\mathcal{D}}=\underset{(x,y)\sim\mathcal{D}}{Pr}[\mathbbm{1}_{x\in h}\neq y]

With all of the above notation out of the way, we can now make headway into the statistical learning theory. The first step in formalization is to define what is meant by a hypothesis class being learnable; to this end, we introduce the following definitions.

Definition 1.1 (PAC-Learnable).

A hypothesis class $\mathcal{R}$ is PAC-learnable in the realizable setting if there exists a function $m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N}$ and a learning algorithm with the following property: For all accuracy and probability tolerances $\epsilon,\delta\in(0,1)$ , for all distributions $\mathcal{D}$ over $\mathcal{U}$ and all labeling sets $f\in\mathcal{R}$ , then when running the learning algorithm on $m\geq m_{\mathcal{R}}(\epsilon,\delta)$ i.i.d samples drawn from $\mathcal{D}$ and labeled by $f$ the algorithm returns a hypothesis $h\in\mathcal{R}$ for which

Pr[L_{\mathcal{D},f}(h)\leq\epsilon]\geq 1-\delta

Definition 1.2 (Agnostic PAC-Learnable).

A hypothesis class $\mathcal{R}$ is PAC-learnable in the agnostic setting if there exists a function $m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N}$ and a learning algorithm with the following property: For all accuracy and probability tolerances $\epsilon,\delta\in(0,1)$ , for all distributions $\mathcal{D}$ over $\mathcal{U}\times\{0,1\}$ , then when running the learning algorithm on $m\geq m_{\mathcal{R}}(\epsilon,\delta)$ i.i.d samples drawn from $\mathcal{D}$ the algorithm returns a hypothesis $h\in\mathcal{R}$ for which

Pr[L_{\mathcal{D}}(h)\leq\epsilon]\geq 1-\delta

Here, PAC stands for Probably-Approximately-Correct. Intuitively, these definitions mean that the correct hypothesis can be learned within arbitrary risk tolerance (the Approximately Correct) with high probability (the Probably) using a number of data points that is determined by these tolerances. We can characterize hypothesis classes with two concepts that are defined below (for intuition or examples, consult [15])

Definition 1.3 (Uniform Law of Large Numbers, Sample Complexity).

Let $(\mathcal{U,R})$ be a set system. A hypothesis class $\mathcal{R}$ admits a Uniform Law of Large Numbers if there exists a function $m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N}$ such that for any distribution $\mathcal{D}$ over $\mathcal{U}$ and any tolerances $\epsilon,\delta\in(0,1)$ , the empirical error $L_{\mathcal{S}}(h)$ over a sample $\mathcal{S}$ of size $m\geq m_{\mathcal{R}}(\epsilon,\delta)$ satisfies

\forall h\in\mathcal{R}\quad|L_{\mathcal{S}}(h)-L_{\mathcal{D}}(h)|\leq\epsilon

with probability at least $1-\delta$ . This property is often referred to as uniform convergence. We call $m_{\mathcal{R}}(\epsilon,\delta)$ the sample complexity of $\mathcal{R}$ .

Definition 1.4.

[VC-Dimension] The VC-dimension of a hypothesis class $\mathcal{R}$ , denoted $VCdim(\mathcal{R})$ , is the maximal size of a set $C\subset\mathcal{U}$ that can be shattered by $\mathcal{R}$ . If $\mathcal{R}$ can shatter sets of arbitrarily large size we say that $\mathcal{R}$ has infinite VC-dimension. Here, shattering has the usual definition where $\mathcal{R}$ shatters $C\subset\mathcal{U}$ if for every labeling of $C$ there exists an element of $\mathcal{R}$ that perfectly describes $C$ .

Equipped with the above tooling, we can now discuss the two fundamental theorems of VC/PAC theory. Proofs of both can be found in Chapter 5 of [15].

Theorem 1.1 (Fundamental Theorem of PAC Learning).

Let $(\mathcal{U},\mathcal{R})$ be a set system. The following are equivalent:

1.

$\mathcal{R}$ is PAC learnable in the realizable and agnostic settings.
2.

$\mathcal{R}$ admits a Uniform Law of Large Numbers.
3.

$\mathcal{R}$ has a finite VC-dimension.

Theorem 1.2 (Fundamental Theorem of PAC Learning - Quantitiative).

Let $(\mathcal{U},\mathcal{R})$ be a set system with finite VC-dimension $d$ . Then, there exist constants $C_{1},C_{2}$ such that:

1.

$\mathcal{R}$ ’s Uniform Law of Large Numbers has sample complexity

$C_{1}\frac{d+\log(1/\delta)}{\epsilon^{2}}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d+\log(1/\delta)}{\epsilon^{2}}$
2.

$\mathcal{R}$ is agnostic PAC learnable with sample complexity

$C_{1}\frac{d+\log(1/\delta)}{\epsilon^{2}}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d+\log(1/\delta)}{\epsilon^{2}}$

$\mathcal{R}$ is PAC learnable with sample complexity

C_{1}\frac{d+\log(1/\delta)}{\epsilon}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}

These two theorems characterize what types of hypothesis classes $\mathcal{R}$ are learnable, and how much data is required to learn them. We will see that the online setting has well-established results that mirror these.

1.1.2 Online Setting

We now consider the online setting, in which samples of data are given to a learner sequentially in a stream. The learner learns by making predictions on the new data point, after which the true label of the point is revealed and the online learner can learn from it. Owing to the potential for previous training examples to not be useful in the future, we must define online learnability differently in both the realizable and agnostic settings than how we did in the PAC framework.

Definition 1.5 (Mistake Bounds, Realizable Online Learnability).

Let $\mathcal{R}$ be a hypothesis class and let $A$ be an online learning algorithm. Given any sequence $\mathcal{S}=((x_{1},f(x_{1})),...,(x_{T},f(x_{T}))$ , where $T$ is any integer and $f\in\mathcal{R}$ generates the labels, let $M_{A}(\mathcal{S})$ be the number of mistakes $A$ makes on the sequence $\mathcal{S}$ . We denote by $M_{A}(\mathcal{R})$ the supremum of $M_{A}(\mathcal{S})$ over all sequences of the above form. A bound of the form $M_{A}(\mathcal{R})\leq B<\infty$ is called a mistake bound. We say that a hypothesis class $\mathcal{R}$ is online learnable in the realizable setting if there exists an algorithm $A$ for which $M_{A}(\mathcal{R})\leq B<\infty$ .

Definition 1.6 (Regret, Agnostic Online Learnability).

Let $\mathcal{R}$ be a hypothesis class and let $A$ be an online learning algorithm. Given any sequence $\mathcal{S}=((x_{1},y_{1}),...,(x_{T},y_{T}))$ , where $T$ is any integer, let $A(x_{i})$ be the prediction that the algorithm made at step $i$ before being revealed the actual label $y_{i}$ . We define the regret to be the difference between the number of mistakes made by the learner and the number of mistakes made by the best hypothesis

R_{\mathcal{S}}(T)=\sum_{i=1}^{T}\mathbbm{1}_{A(x_{i})\neq y_{i}}-\min_{h\in\mathcal{R}}\sum_{i=1}^{T}\mathbbm{1}_{h(x_{i})\neq y_{i}}

We say that $\mathcal{R}$ is online learnable in the agnostic setting if the expected regret $\mathbb{E}_{\mathcal{S}}[R_{\mathcal{S}}(T)]$ is $o(T)$ , or equivalently if $\lim_{T\rightarrow\infty}\mathbb{E}_{\mathcal{S}}[R_{\mathcal{S}}(T)]/T=0$ (i.e. the amortized expected regret vanishes). We call $R_{T}(\mathcal{R})$ the optimal regret, where $R_{T}(\mathcal{R})$ is the regret of the best learner against the worst adversarial sequence.

There is a combinatorial measure, due to Littlestone [11], that characterizes online learnability much in the same way that the VC-dimension characterizes offline learnability.

Definition 1.7 (Littlestone Dimension).

Let $(\mathcal{U,R})$ be a set system. The definition of the Littlestone Dimension of $\mathcal{R}$ , denoted $Ldim(\mathcal{R})$ , is given using mistake-trees: these are binary decision trees whose internal nodes are labeled by elements of $\mathcal{U}$ . Any root-to-leaf path corresponds to a sequence of pairs $(x_{1},y_{1}),...,(x_{d},y_{d})$ , where $x_{i}$ is the label of the $i$ ’th internal node in the path, and $y_{i}=1$ if the $(i+1)$ ’th node in the path is the right child of the $i$ ’th node, and otherwise $y_{i}=0$ . We say that a tree $T$ is shattered by $\mathcal{R}$ if for any root-to-leaf path $(x_{1},y_{1}),...,(x_{d},y_{d})$ in $T$ there is an $h\in\mathcal{R}$ such that $x_{i}\in h\iff y_{i}=1$ for all $i\leq d$ . $Ldim(\mathcal{R})$ is the depth of the largest complete tree shattered by $\mathcal{R}$ , with the convention that $Ldim(\emptyset)=-1$ .

The above definition is the online equivalent of the VC-dimension: intuitively, the tree $T$ is the largest set for which any path has a perfect predictor in $\mathcal{R}$ , very much like how the set $C$ in Definition 1.4 is the largest set with a perfect predictor in $\mathcal{R}$ (however, the mistake tree framework encapsulates the sequential nature of the online learning problem). An immediate corollary of this is that $VCdim(\mathcal{R})\leq Ldim(\mathcal{R})$ . The following theorem characterizes online learning, given by Littlestone [11] in the realizable setting and by [4] in the agnostic setting.

Theorem 1.3.

Let $(\mathcal{U,R})$ be a set system. Then, $\mathcal{R}$ is online learnable in both the realizable and agnostic settings if and only if it has finite Littlestone dimension.

Previous work in online learnability theory failed to give an adequate description of a uniform law of large numbers condition on online learnability a la Definition 1.3 nor a tight quantitative bound on sample complexities a la Theorem. 1.2

1.2 Streaming

In the streaming setting, data of large volume arrives sequentially (and rapidly, which requires efficient computation performance), and it is not realistic to store the entire data stream. Instead, algorithms need one or several scan through the stream and approximately answer queries to the data.

1.2.1 Data Stream Models

An input stream of length $m$ in the format $(a_{1},\Delta_{1}),(a_{2},\Delta_{2})\dots(a_{m},\Delta_{m})$ arrives sequentially, which describes an underlying frequency vector $f\in\mathbb{R}^{n}$ where $f_{i}=\sum_{t:a_{t}=i}\Delta_{t}$ . Namely, $a_{t}$ describes the index of update and $\Delta_{t}$ describes the value of update to the frequency vector. This generic model is also called the turnstile model of streaming. A common assumption bounds the maximum coordinate of frequency vector at any time step $t$ . Let $f^{(t)}_{i}=\sum_{j\leq t:a_{j}=i}\Delta_{j}$ , then $||f^{t}||_{\infty}\leq M$ for some $M>0$ .

Another commonly studied model assumes $\Delta_{t}\geq 0$ , and is called the insertion-only model. The above setup could then be simplified as follows. Given an input stream $a_{1},a_{2}\dots a_{m}$ , the frequency vector $f\in\mathbb{R}^{n}$ is given by $f_{i}=|\{j\in[m]:a_{j}=i\}|$ .

The streaming task is to respond to queries regarding the frequency vector $f$ at any timestamp $t$ with close approximation. One measure of success is known as strong tracking.

Definition 1.8 (Strong Tracking).

Let $f^{(1)},f^{(2)}\dots f^{(m)}$ be the frequency vectors given the input stream, and let $g:\mathbb{R}^{n}\rightarrow\mathbb{R}$ be the query function over the frenquency vectors. A streaming algorithm is said to provide $\epsilon$ -strong g-tracking if, at each time step $t$ , the approximation output $R_{t}$ satisfies

|R_{t}-g(f^{(t)})|\leq\epsilon|g(f^{t})|

1.2.2 Linear Sketches

Linear sketching is a crucial idea in many streaming algorithm designs. On a very high level, sketches achieve dimensionality reduction by generating pseudo-random vectors with (limitedly) independent random variables. Let $\Delta_{t}\in\mathbb{R}^{n}$ . Given a distribution $\mathcal{M}$ over $r\times n$ matrix space, and an evaluation function $\mathcal{F}:\mathbb{R}^{r\times n}\times\mathbb{R}^{r}\rightarrow R$ where $R$ is the output space, a linear sketching algorithm draws a (randomized) matrix $A\in\mathcal{M}$ . The evaluation function is used to respond to queries by $F(A,A\Delta_{t})$ .

Consider the second moment estimation problem on turnstile model. Alon et al. [2] developed a linear sketch approach as follows. The second moment is defined as $F_{2}=\sum_{i}f[i]^{2}$ . For each $i,j$ define $X_{ij}$ to be a random vector of length $n$ , with $\pm 1$ -valued, $4$ -wise independent random variables as elements. Update $X_{ij}=\langle f,X_{ij}\rangle$ , and we have $E[X_{ij}^{2}]=F_{2}$ and $Var[X_{ij}^{2}]\leq 2F_{2}^{2}$ . Define $Y_{i}$ to be the average of $X_{i1},X_{i2},\dots X_{i,O(log(1/\epsilon^{2}))}$ ; by Chebyshev’s inequality $P(|Y_{i}-F_{2}|>\epsilon F_{2})\leq O(1)$ . Taking the medium of means, and let $Z$ be the medium of $Y_{i}$ s, we have $P(|Z-F_{2}|>\epsilon F_{2})<\delta$ . This algorithm uses $O(\frac{1}{\epsilon^{2}}log(\frac{1}{\delta}))$ space to achieve $(\epsilon)$ -approximation of $F_{2}$ with probability $1-\delta$ .

In fact, the best-known algorithms for any problem in the turnstile model involves linear sketching [9]. Such applications include frequency moments, heavy hitters, entropy estimation, etc. S. Muthukrishnan[13] provides a detailed documentation of the aforementioned problems for interested readers.

1.3 Sampling

Sampling in the streaming setting aims to select a small subset $S$ , given sequentially-arrived input stream $X=\{x_{1},x_{2},\dots x_{n}\}$ , that is representative of the input stream. Deterministic sampling algorithms either fail to use a desirably small space, or involves complicated design tailored toward particular problems [5]. Therefore this section will primarily discusses random sampling.

The formal notion of being representative of the input stream is defined through $\epsilon$ -approximation [12]. Intuitively, $S$ is a good representation of $X$ if any subset $R$ has similar density in $X$ and $S$ , as formulated by Definition 1.9.

Definition 1.9 ( $\epsilon$ -Approximation).

Let X be a stream and let $(\mathcal{U,R})$ . A sample $S$ is an $\epsilon$ -approximation of X with respect to $\mathcal{R}$ if, for any subset $R\in\mathcal{R}$ , it holds that $\left|\frac{|R\cap X|}{|X|}-\frac{|R\cap S|}{|S|}\right|\leq\epsilon$ .

An oblivious sampler is a sampling algorithm that accepts or rejects an element based on index, independent of the values of the stream. Oblivious samplers are well-studied for a variety of sampling tasks, and are directly applied in the adversarial setting in Section 2. Below, we enumerate some classic oblivious samplers.

•

Bernoulli Sampler $Ber(n,p)$ samples elements $i\in[n]$ independently with probability $p$ .
•

Uniform Sampler $Uni(n,k)$ randomly draws $k$ indexes $i_{1},i_{2}\dots i_{k}$ , $i_{j}\in[1,n]$ . Stream elements with matching indexes are sampled.
•

Reservoir Sampler $Res(n,k)$ keeps a sample of size $k$ . The first $k$ arrivals of the stream are accepted. For an element with index $i>k$ , it is accepted to the sample with probability $\frac{k}{i}$ ; if accepted, an element currently in the samply will be uniformly drawn to be replaced by the new element.

Several results connect sampling complexity in this setting with VC theory [18] [16].

Theorem 1.4.

Let $(\mathcal{U},\mathcal{R})$ be a set system with VC-dimension $d$ , and let $X$ be a stream drawn from $\mathcal{U}$ . Let $S\subseteq X$ be a subset sampled uniformly at random with size $O\left(\frac{d+log(1/\delta)}{\epsilon^{2}}\right)$ . Then $S$ is an $\epsilon$ -approximation of $X$ with probability at least $1-\delta$ .

Note that this is a restatement in sampling theory language of the sample complexity bound for a Uniform Law of Large Numbers (Definition 1.3) in the VC-theory (Theorem 1.2, item 1) .

2 Adversarial Robustness

Recall from sections 1.2 and 1.3 the key assumption that the input stream $X$ is chosen non-adaptively in advance. This assumption does not hold in many modern applications, where the streaming outputs may implicitly or directly affect the future input stream. An important area of application for streaming algorithms is routing [1] [10], where adversarial robustness has started to play a key role in evaluation metrics [10]. This section discusses some recent developments of adversarially robust streaming and sampling algorithms. The framework of analysis also contributes to section 3, where Alon et al. develop the connection between adversarial robustness in sampling and online learnability.

2.1 Adversarial Setting

We can model the adversarial streaming/sampling problem as a two-player game between Adversary and algorithm [5] [6]. In the most generic setting, Adversary is allowed unbounded computational power and can adaptively choose the next element in the stream based on algorithm’s output history.

For a streaming task, algorithm aims to respond to queries about its frequency vector $f$ in close approximation; for sampling, algorithm aims to admit a sample representative of the stream (i.e. an $\epsilon$ -approximation of adversary’s input stream). At a high level, the goal of Adversary is to prevent Algorithm from obtaining good results, and the goal of algorithm is to be robust against this. Each round of the two-player game proceeds as follows.

1.

Adversary submits an element of the stream to algorithm. The choice of element could (probabilistically) depend on all elements submitted in previous rounds, and observations from algorithm’s output history.
2.

Algorithm update its internal state (estimation of $f$ or accepted sample $S$ ) based on the new element submitted, and output its response to queries regarding its internal state.
3.

Adversary observes Algorithm’s responses and proceeds to the next round.

2.2 Robust Streaming

Hardt and Woodruff [9] show that linear sketching is inherently non-robust to adversarial inputs. This necessitates new robust streaming algorithms. This section covers recent developments to robustify non-robust algorithms through sketch-switching.

2.2.1 Vulnerability of Linear Sketching

Hardt and Woodruff [9] in particular show that for problems that are at least as hard as $l_{p}$ -norm estimation, linear sketching algorithms suffer from adversarial inputs¹¹1 Please refer to the original paper for formal proofs of the statements made in the algorithm design summary, which are omitted in this paper for simplicity. .

Definition 2.1 (GapNorm).

Given input stream $X:\{x_{1},x_{2}\dots\}$ with each $x_{t}\in\mathbb{R}^{n}$ , the GapNorm $(B)$ problem requires an algorithm to return $0$ if $||x_{t}||_{2}\leq 1$ and return $1$ if $||x_{t}||_{2}\geq B$ for some parameter $B$ . If $x_{t}$ satisfies neither of the conditions, the output can be arbitrarily $0$ or $1$ .

The GapNorm problem requires a slight modification to the data stream model in section 1.2: instead of $\Delta_{t}\in\mathbb{Z}$ , each input from the data stream is denoted as a vector $x_{t}\in\mathbb{R}^{n}$ .

Theorem 2.1.

There exists a randomized adversary that, with high probability, finds a distribution over queries on which linear sketching fails to solve GapNorm $(B)$ with constant probability for $B\geq 2$ .

Proof.

Consider the following two-player game strategy between Adversary and algorithm. algorithm samples a matrix $A$ from a distribution $\mathcal{M}$ . At each round, adversary submits $x_{t}\in\mathbb{R}^{n}$ , and algorithm responds with evaluation function $F(A,Ax_{t})$ . Adversary aims to learn the row space $R(A)$ of algorithm. If adversary can successfully learn $R(A)$ , the following strategy will force algorithm to make mistakes on sufficiently many queries:

With $1/2$ probability, adversary submits the zero vector in $\mathbb{R}^{n}$ . With $1/2$ probability, adversary submits a vector $x_{t}$ in the kernel space of $A$ .

To learn the row space $R(A)$ , Adversary’s initial element $x_{1}$ is drawn from $N(0,\tau I_{n})$ , a mutivariate Gaussian distribution where $\tau I_{n}$ is the covariance matrix. From the properties of Gaussian variables, the projection of $x_{1}$ on Algorithm’s sketch matrix $A$ is spherically symmetric, and therefore the output will only depend on the the norm of projection $P_{A}(x_{1})$ . If $x_{1}$ happens to have a high correlation with $R(A)$ , algorithm will be tricked to calculate a larger estimation of $||P_{A}(x_{1})||_{2}$ . [9] proved the existence of a $\tau$ to make the aforementioned difference sufficiently large.

Therefore, adversary can first submit multiple elements $x_{t}$ drawn from $N(0,\tau I_{n})$ , and aggregate them into a vector $v_{1}\in\mathbb{R}^{n}$ that is almost entirely contained in $R(A)$ . This is accomplished by aggregating $m=poly(n)$ positively-labeled vectors $x_{t}$ from the repeated trials into a matrix $G\in\mathbb{R}^{m\times n}$ , and obtain $v_{1}$ as the top right singular vector of $G$ . Then $||P_{A}(v_{1})||_{2}\geq 1-\frac{1}{poly(n)}$ , and with a sufficiently large $m$ we can say $v_{1}$ is almost entirely contained in $R(A)$ .

At each iteration of the previously attached scheme, after finding $v_{1}$ adversary draws $x_{t}$ from $N(0,\tau I_{n})$ within the subspace orthogonal to $\{v_{1},v_{2}\dots\}$ and runs the attack. This effectively reduces one ”unknown” dimension in $R(A)$ every time a new $v$ is found. Repeat the iterations until only a constant number of such unknown dimensions remain, and remaining attacks following the aforementioned strategy of switching between zero and kernel space vectors will trick algorithm to make a sufficiently large number of mistakes.

The adversarial algorithm above leads to Theorem 2.1. ∎

2.2.2 Sketch Switching

In light of the adversarial vulnerability of linear sketches, Ben-Eliezer and Yogev [6] developed generic robustification schemes for linear sketching. Two techniques were proposed to transform static streaming algorithms into adversarially robust ones. This section will focus on the sketch switching technique for its widespread usage in the static setting and its success in robustifying popular problems like Distinct Element, $F_{p}$ Estimation, Heavy Hitters, and Entropy Estimation.

The Algorithm

Sketch switching achieves robustness by keeping multiple copies of strong tracking (Definition 1.8) algorithms. The high level goal of sketch switching is to change the current output as few times as possible and, when required to update output, move to different copies so that Adversary has no enough information to attack. The detailed algorithm is illustrated in 1.

As long as the previous output based on $(a_{t-1},\Delta_{t-1})$ is a good multiplicative approximation of the estimate of $(a_{t},\Delta_{t})$ by the copy in use, algorithm outputs the previous output, and internally updates the output state to the current estimation. When the previous output falls out of range with the internal estimation, algorithm submits the current estimation given by $(a_{t},\Delta_{t})$ , deactivates the current copy, and activate another copy of static strong tracking algorithms.

\lambda\leftarrow\lambda_{\epsilon/8,m}(g)

2:Initialize independent copies

A_{1},A_{2}\dots A_{\lambda}

(\frac{\epsilon}{8},\frac{\delta}{\lambda})

-strong

g

-tracking static algorithms

\rho\leftarrow 1

\hat{g}\leftarrow g(\textbf{0})

5:while new stream update

(a_{t},\Delta_{t})

6: Insert update

(a_{t},\Delta_{t})

to all copies

A_{1},A_{2}\dots A_{\lambda}

y\leftarrow

current output of

A_{\rho}

8: if

\hat{g}\notin(1\pm\epsilon/2)y

then

\hat{g}\leftarrow y

10:

\rho\leftarrow\rho+1

11: end if

12: Output estimation

\hat{g}

13:end while

Algorithm 1 Sketch Switching

A central definition in the sketch switching algorithm is flip number (Definition 2.2). Lemma 2.2 will come handy in proving the main theorem 2.3.

Definition 2.2 (Flip Number).

Let $Y=\{y_{0},y_{1},\dots,y_{m}\}$ be any sequence of real numbers. For $\epsilon\geq 0$ , the $\epsilon$ -flip number $\lambda_{\epsilon}(Y)$ is the maximum $k$ for which there exist indices $0\leq i_{1}<i_{2}\dots<i_{k}\leq m$ such that $y_{i_{j-1}}\notin(1\pm\epsilon)y_{i_{j}}$ for $j\in[2,k]$ .

Definition 2.3 ( $(\epsilon,m)$ -Flip Number).

Let $\mathcal{C}\subseteq([n]\times\mathbb{Z})^{m}$ be the space of possible input streams. Let $g$ be the query function defined in Definition 1.8. Then the $Y=\{y_{0},y_{1},\dots,y_{m}\}$ where $y_{t}=g(f^{(t)})$ . The $(\epsilon,m)$ -flip number $\lambda_{\epsilon,m}(g)$ of $g$ over $\mathcal{C}$ is the maximum $\epsilon$ -flip number of $Y$ s generated by all possible input stream sequences of length $m$ in $\mathcal{C}$ .

Lemma 2.2.

Fix $\epsilon\in(0,1)$ . Let $U=\{u_{1},u_{2}\dots u_{m}\}$ , $V=\{v_{1},v_{2}\dots v_{m}\}$ , and $W=\{w_{0},w_{1}\dots w_{m}\}$ be three sequences of real numbers that satisfy the following

1.

For $i\in[m]$ , $v_{i}=(1\pm\epsilon/8)u_{i}$
2.

$w_{0}=v_{0}$
3.

For $i>0$ , if $w_{i-1}=(1\pm\epsilon/2)v_{i}$ , then $w_{i}=w_{i-1}$ . Else $w_{i}=v_{i}$ .

Then $w_{i}=(1\pm\epsilon)v_{i}$ for $i\in[m]$ . $\lambda_{0}(W)\leq\lambda_{\epsilon/8,m}(g)$ .

The Proof

Now we have enough tools to move on to the main theorem of sketch switching, with a skeleton of the proof from the original paper [6].

Theorem 2.3 (Sketch Switching).

Fix a query function $g:\mathbb{R}^{n}\rightarrow\mathbb{R}$ over frequency vector $f$ , and let $A$ be a static streaming algorithm that, for any $\epsilon\in[0,1]$ and $\delta>0$ , uses $L(\epsilon,\delta)$ space and satisfies the $(\epsilon/8,\delta/\lambda)$ -strong g-tracking property over frequency vectors at each time step $f^{(1)},f^{(2)},\dots,f^{(m)}$ . Let $\lambda=\lambda_{\epsilon/8,m(g)}$ be the $(\epsilon,\delta)$ -flip number of $g$ over input stream class $\mathcal{C}$ . Then Algorithm 1 is an adversarially robust algorithm that returns $(1+\epsilon)$ -approximation of $g(f^{(t)})$ at each time step $t$ with probability at least $1-\delta$ , and uses $O(\lambda\cdot L(\epsilon/8,\delta/\lambda))$ space.

Proof.

Assume a fixed, randomized algorithm following the robusification framework in Algorithm 1 with static streaming copies $A_{1},A_{2}\dots A_{\lambda}$ , each with its own randomness. Assume a deterministic adversary for simplicity. By Yao’s minimax principle [20], this assumption can easily be relaxed to apply to randomized adversary. With this assumption, Adversary’s choice of $(a_{t},\Delta_{t})$ is deterministically defined by $((a_{1},\Delta_{1})\dots(a_{t-1},\Delta_{t_{1}})$ and responses $R_{1},R_{2}\dots R_{t-1}$ .

The skeleton of the proof goes as follows: we first show that up until algorithm switches a copy of sketch, its return sequence $Y$ satisfies $(1+\epsilon)$ -approximation of $g(f)$ at each time step, and inductively apply the same line of reasoning to each switching to $A_{\rho},\;\rho\in[\lambda]$ . Union bounding the $\lambda$ copies of $(\epsilon/8,\delta/\lambda)$ -strong $g$ -tracking copies gives us the $1-\delta$ probability of success. And the remaining proof shows that $\lambda$ copies are indeed sufficient to handle adversarial input stream of length $m$ .

Robustness of the first sketch. First fix the randomness of $A_{1}$ . Let $u_{1}^{1},u_{2}^{1}\dots u_{k}^{1}$ be the $(a,\Delta)$ updates adversary would make if algorithm were to output $y_{0}=g(\textbf{0})$ to every $u$ . That is, $u_{1}^{1},u_{2}^{1}\dots u_{k}^{1}$ is the sequence such that algorithm does not switch copy. Let $k+1$ be the time step that $y_{0}\notin(1\pm\epsilon/2)A_{1}(u_{k+1})$ (i.e. where $y_{0}$ falls out of range for the first time). Here algorithm returns $R_{1},R_{2}\dots R_{k}=y_{0}$ and $R_{k+1}=y_{1}$ . By the definition of $(\epsilon/8,\delta/\lambda)$ -strong g-tracking, we know that $A_{1}(u_{t})\in(1\pm\epsilon/8)g(f^{(t)})$ for $t\leq k$ , and by design of Algorithm 1 $R_{t}=y_{0}\in(1\pm\epsilon/2)A_{1}(u_{t})$ . By Lemma 2.2, we have

R_{1},R_{2}\dots R_{k}=y_{0}\in(1\pm\epsilon)g(f^{(t)}),\;\;\forall t\leq k

Induction on $A_{2}$ . Following the definition above, algorithm switches to copy $A_{2}$ at time $t=k+1$ . Here $R_{k+1}$ , the output of algorithm, is updated to be $y_{1}=A_{1}(u_{k+1})$ . Recall that by strong tracking, the switching point output $R_{k+1}=y_{1}\in(1\pm\epsilon/8)g(f^{(k+1)})$ by Definition 1.8. Consider $\hat{X}=\{u_{1}^{1},u_{2}^{1}\dots u_{k}^{1},u_{1}^{2},u_{2}^{2}\dots u_{k_{2}}^{2}\}$ be the concatenation between the aforementioned inputs and the stream of inputs such that algorithm will keep outputing $y_{1}$ . Here $k_{2}$ is the index where $y_{1}$ starts to fall out of range. Given that the $\epsilon/8$ -strong $g$ -tracking guarantee should hold on this fixed size of input and, similar to above reasoning, have

R_{k+1},R_{k+2}\dots R_{k_{2}}=y_{1}\in(1\pm\epsilon)g(f^{t})\;\;\forall t\in[k+1,k_{2}]

Noted that this line of reasoning extends to any $A_{\rho}$ for $\rho\in[\lambda]$ . Then at any time $t$ , algorithm outputs $y_{\rho}\in(1\pm\epsilon)g(f^{(t)})$ except for probability $\frac{\delta}{\lambda}$ . Taking a union bound over all copies of $A_{\rho}$ , this gives us the desired $1-\delta$ probability in Theorem 2.3.

Bounding number of copies required. The last step in the proof shows that $\lambda$ samples of static streaming algorithm with strong tracking guarantee suffices to handle an input stream of length $m$ . Define $U=\{g(f^{(0)}),g(f^{(1)}),\dots g(f^{(m)})\}$ , $V=\{g(f^{(0)}),A_{1}(u_{1}),\dots A_{1}(u_{k}),A_{2}(u_{k_{1}}),\dots A_{2}(u_{k_{2}}),\dots\}$ , and $W=\{y_{0},y_{0}\dots,y_{1},y_{1}\dots\}$ (i.e. algorithm’s outputs $R_{1},R_{2}\dots R_{m}$ ). These three sequences satisfy the condition of Lemma 2.2, and thus we have $\lambda_{0}(W)\leq\lambda_{\epsilon/8,m}(g)$ . ∎

An Example

Below is a demonstration of how to apply the sketch switching technique to a common streaming problem, Distinct Elements. Also known as $F_{0}$ estimation, the problem is defined through the query function $g(f):|\{i:f[i]\neq 0\}|,\;\;f\in\mathbb{R}^{n}$ ,. Although the proof will be omitted, the following Corollary 2.1 plays a central role in applying sketch switching on a variety of tasks. And the constraint of insertion-only model in 2.1 naturally constraints the technique itself to insertion-only models.

Corollary 2.1.

Let $p\geq 0$ . Assume the input stream has length $m=poly(n)$ . The $(\epsilon,m)$ -flip number of of $||f||_{p}^{p}$ in the insertion-only model is $\lambda_{\epsilon,m}(||f||_{p}^{p})=O(\frac{\log n}{\epsilon})$ for $p\leq 2$ , and $O(\frac{p\log n}{\epsilon})$ for $p>2$ . For $p=0$ , we have $O(\frac{\log m}{\epsilon})$ .

Theorem 2.4 (Robust DistinctElements).

There exists an adversarially robust algorithm that outputs $R_{t}\in(1\pm\epsilon)||f^{(t)}||_{0}$ at each time step $t$ with probability at least $1-\delta$ . The space usage is given below.

O\left(\frac{\log 1/\epsilon}{\epsilon}\left(\frac{\log 1/\epsilon+\log 1/\delta+\log(\log n)}{\epsilon^{2}}+\log n\right)\right)

Proof.

First observe that DistinctElements inherently describes an insertion-only model. Therefore Algorithm will keep $\lambda_{\epsilon,m}=O(\frac{\log n}{\epsilon})$ following Corollary 2.1. The copies $A_{1},A_{2},\dots A_{\lambda}$ uses the static streaming algorithm proposed by Błasiok[7], which maintains a $1-\delta_{0}$ probability of $\epsilon$ -strong $g$ -tracking. Each copy $A_{\rho}$ takes $O\left(\frac{\log 1/\epsilon+\log 1/\delta+\log(\log n)}{\epsilon^{2}}+\log n\right)$ bits of space. To avoid complexity blow-up by the multiplicative term $\frac{\log n}{\epsilon}$ , use the following trick: instead of discarding a copy permanently, algorithm keeps a smaller amount of copy than prescribed by $\lambda_{\epsilon,m}$ , and instead achieve the same result in a cyclic manner. After running out of new copies, re-activate a previously deactivated copy, and so on. This in effect reduces copy complexity from $O(\frac{\log n}{\epsilon})$ to $O(\log(1/\epsilon)/\epsilon))$ . ∎

2.3 Robust Sampling

The oblivious random samplers discussed in Section 1.3 are used extensively in modern data-intensive systems, such as database monitoring and distributed machine learning. This naturally raises the question of whether the random samplers are vulnerable to potential attacks by adversarial streams.

Recall Theorem 1.4: For a set system $(\mathcal{U},\mathcal{R})$ with VC-dimension $d$ , a random sample of size $O(\frac{d+log\,1/\delta}{\epsilon})$ is an $\epsilon$ -approximation of $X$ drawn from $\mathcal{U}$ with probability $1-\delta$ . This well-established connection between sample size and learnability is extensively used in theoretical machine learning and is a consequence of the quantitative version of the Fundamental Theorem of PAC Learning (Theorem 1.2).

Ben-Eliezer and Yogev [5] showed that the sample size in Theorem 1.4 is not robust to adversarial streams, and developed the minimum overhead required for an adversarially-robust sample. Section 3 shows that this overhead is implicitly connected to online learnability and further develops on this result.

Consider the adversarial setting described in Section 2.1. A formal definition of an adversarially robust algorithm is as follows.

Definition 2.4 ( $(\epsilon,\delta)$ -Robust).

A randomized sampling algorithm is $(\epsilon,\delta)$ -robust if, for any strategy played by the (computationally unbounded) Adversary, the sample $S$ achieves $\epsilon$ -approximation of input stream $X$ with respect to the set system $(\mathcal{U},\mathcal{R})$ with probability $1-\delta$ .

2.3.1 Adversarial Attack on Sampling

Theorem 2.5.

There exists a set system $(\mathcal{U},\mathcal{R})$ with VC-dimension $1$ where, for $\epsilon>0$ , $\delta<1/2$ , and some constant $c>0$ , the sample size required for static setting is not $(\epsilon,\delta)$ -robust.

Proof.

Consider the set system $(\mathcal{U},\mathcal{R})$ where $\mathcal{U}$ is a well-ordered set $\{1,2,\dots N\},\;\;N\in[n^{6ln\,n},2^{n/2}]$ . Here $n$ is the size of adversarial input stream. Define $\mathcal{R}=\{[1,b]:\,b\in\mathcal{U}\}$ to be the set of inclusive intervals from 1 to each element of $\mathcal{U}$ .

Observe that $(\mathcal{U},\mathcal{R})$ has VC-dimension $1$ (recall Definition 1.4 for definitions of VC-dimension and shattering). To see this, note that any subset $C\subset\mathcal{U}$ of size 1 can clearly be shattered (if $C=\{c\}$ , $[1,c]$ describes $C$ for a positive label and $[1,c-1]$ describes $C$ for a negative label of $c$ ). So, the VC-dimension is certainly at least 1. However, consider any arbitrary set $C=\{c_{1},c_{2}\}\subset\mathcal{U}$ of size 2, and suppose without loss of generality that $c_{1}<c_{2}$ . There exists a labeling of $C$ (namely, $c_{1}$ has a negative label but $c_{2}$ has a positive label) that no interval in $\mathcal{R}$ can describe (any interval not containing $c_{1}$ and satisfying its negative label cannot contain $c_{2}$ ). Similar logic shows that $\mathcal{R}$ cannot shatter any subset of $\mathcal{U}$ of size at least 2. Therefore, we must have $VCdim(\mathcal{R})<2\implies VCdim(\mathcal{R})=1$ .

Here we consider a Bernoulli sampler $Ber(n,p)$ for the following adversarial scheme. It could be easily extended to the other aforementioned types of oblivious random sampling algorithms (e.g. the Reservoir sampler). Assume $p\leq\frac{1}{4}$ , a reasonable assumption if the sampler is to obtain a sample size sublinear of a large input stream $X$ .

a_{1}\leftarrow 1

b_{1}\leftarrow N

p^{\prime}\leftarrow max\{p,\,\frac{ln\,n}{n}\}

4:for

i=1,2,\dots n

x_{i}=\lfloor{a_{i}+(1-p^{\prime})(b_{i}-a_{i})}\rfloor

6: if Sampler accepts

x_{i}

then

a_{i+1}\leftarrow x_{i}

b_{i+1}\leftarrow b_{i}

9: else

10:

a_{i+1}\leftarrow a_{i}

11:

b_{i+1}\leftarrow x_{i}

12: end if

13:end for

Algorithm 2 Adversarial Strategy for a Bernoulli Sampler

The goal of the adversarial algorithm above is to maintain the following invariant. At round $i$ of the two-player game,

•

Elements that have been accepted by Sampler are at most $a_{i}$
•

Elements that have been rejected by Sampler are at least $b_{i}$
•

Elements submitted by Adversary at round $i$ are between $a_{i}$ and $b_{i}$ .

This invariant ensures that all elements in $S$ are smaller than the rejected elements in $X^{\prime}=X\setminus S$ . Then it naturally follows that $S$ can not be a $\epsilon$ -approximation of $X$ . Formally, let $s$ be the maximum element in $S$ . Now consider $R=[1,s]\in\mathcal{R}$ . Observe that $\frac{|R\cap S|}{|S|}=1$ .

\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap X|}{|X|}\right|\geq 1-\frac{|S|}{n}\geq 1-2p^{\prime}\geq 1/2\geq\epsilon

Noted that the above attack scheme guarantees success as long as Adversary does not run out of elements to draw from. That is, as long as $a_{i}<b_{i}$ for all rounds $i$ . The expected sample size drawn by the Bernoulli sampler is $np\leq np^{\prime}$ . By Markov’s Inequality $Pr[|S|\geq 2np^{\prime}]<\frac{1}{2}$ . When $|S|<2np^{\prime}$ , the following lemma from [5] completes the claim that Adversary will have enough elements to draw from.

Lemma 2.6.

If $|S|<2np^{\prime}$ , then $b_{i}-a_{i}\geq n$ for any $i\in[n]$ .

∎

2.3.2 Adversarially Robust Sample Size

Ben-Eliezer and Yogev[5] developed a upper bound on the $(\epsilon,\delta)$ -robustness (Definition 2.4) of Bernoulli and Reservoir samplers. Similar to the adversarial attack in the previous section, we will focus on the Bernoulli sampler case, while a very similar formulation can be applied to the Reservoir sampler case.

Theorem 2.7 (Robust Sample Size).

For any set system $(\mathcal{U},\mathcal{R})$ , $\epsilon,\delta\in(0,1)$ , and given an input stream of size $n$ , a Bernoulli sampler $Ber(n,p)$ is $(\epsilon,\delta)$ -robust if the following holds.

p\geq 10\cdot\frac{ln|\mathcal{R}|+ln(4/\delta)}{\epsilon^{2}n}

Corollary 2.2.

The sample size drawn by the Bernoulli sampler is well-concentrated near its expectation $np$ . The above theorem then implies that a sample size of $\Theta\left(\frac{ln|\mathcal{R}|+ln(1/\delta)}{\epsilon^{2}}\right)$ is an $\epsilon$ -approximation with probability $1-\delta$ .

Lemma 2.8.

Given a universe $\mathcal{U}$ and a subset $R\subseteq\mathcal{U}$ , and let $X$ be the adversarial input stream of length $n$ . A Bernoulli sampler with parameter $p\geq 10\cdot\frac{ln(4/\delta)}{\epsilon^{2}n}$ satisfies

Pr\left[\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap X|}{|X|}\right|\geq\epsilon\right]\leq\delta

Noted that main theorem 2.7 naturally follows from Lemma 2.8 by taking a union bound. Let $(\mathcal{U},\mathcal{R})$ , $\epsilon$ , $\delta$ be the ones defined in Theorem 2.7. Take a Bernoulli sampler with $p=10\cdot\frac{ln|\mathcal{R}|+ln(4/\delta)}{\epsilon^{2}n}$ that satisfies the condition in the main theorem. For each $R\in\mathcal{R}$ , plug in the lemma with parameters $\epsilon,\delta/|\mathcal{R}|$ and obtain the following result:

Pr\left[\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap X|}{|X|}\right|\geq\epsilon\right]\leq\frac{\delta}{|\mathcal{R}|}

If for all $R\in\mathcal{R}$ the above event succeeds, it naturally follows that the sample $S$ is an $\epsilon$ -approximation of $X$ . A union bound over all $R\in\mathcal{R}$ concludes that the Bernoulli sampler with $p$ defined above is $(\epsilon,\delta)$ -robust. It remains to prove Lemma 2.8 to complete the main theorem.

Proof.

A major difference in the analysis between the static setting and the adversarial setting here is the assumption of independence between elements in the input stream. Since adversary submits elements based on the history of elements submitted and the sample maintained, concentration inequalities like the Chernoff bound can not be applied in the adversarial setting. This motivates the formulation of random variables as a martingale. Given $R\subseteq\mathcal{U}$ , for each round $i$ define $X_{i}$ and $R_{i}$ to be the input stream and sample maintained up until round $i$ . Define the following random variables:

A_{i}^{R}=\frac{i}{n}\cdot\frac{|R\cap X_{i}|}{|X_{i}|}=\frac{|R\cap X_{i}|}{n}

B_{i}^{R}=\frac{|R\cap S_{i}|}{np}

Z_{i}^{R}=B_{i}^{R}-A_{i}^{R}

Observe that the sequence of $Z_{i}^{R}$ is a martingale (Lemma 2.10). Note that $A_{n}^{R}=\frac{|R\cap X|}{|X|}$ and therefore $\left|A_{n}^{R}-\frac{|R\cap S|}{|S|}\right|$ is the subject of interest in Lemma 2.8. Therefore it suffices to prove the following two inequalities and applying the triangle inequality on them:

1.

$Pr\left[|A_{n}^{R}-B_{n}^{R}|\geq\frac{\epsilon}{2}\right]\leq\frac{\delta}{2}$
2.

$Pr\left[|B_{n}^{R}-\frac{|R\cap S_{n}|}{|S_{n}|}|\geq\frac{\epsilon}{2}\right]\leq\frac{\delta}{2}$

The following two lemmas from [8] and [5] respectively explore useful properties of martingales to conclude the proof.

Lemma 2.9.

Let $X=(X_{0},X_{1},X_{2}\dots X_{n})$ be a martingale. Assume the variance of $X_{i}$ conditioned on previous elements $X_{0},X_{1}\dots X_{i-1}$ is bounded by $\sigma_{i}^{2}$ for $\sigma_{0}^{2},\sigma_{1}^{2}\dots\sigma_{n}^{2}\geq 0$ , and there exists some $M$ for which $|X_{i}-X_{i-1}|\leq M$ holds for all $i\in[0,n]$ . Then for any $\lambda\geq 0$ , the following inequality holds.

Pr\left[|X_{n}-X_{0}|\geq\lambda\right]\leq\exp\left(-\frac{\lambda^{2}}{2\sum_{i=1}^{n}\sigma_{i}^{2}+(M\lambda/3)}\right)

Lemma 2.10.

The sequence $Z_{0}^{R},Z_{1}^{R}\dots Z_{n}^{R}$ forms a martingale. The variance of $Z_{i}^{R}$ conditioned on the previous elements $Z_{0}^{R}\dots Z_{i-1}^{R}$ is bounded by $1/n^{2}p$ , and $|Z_{i}^{R}-Z_{i-1}^{R}|\leq 1/np$ .

Combining the two lemmas above, and assuming $np\geq\frac{9}{\epsilon^{2}}\ln{\delta/4}$ , we have

Pr\left[\left|A_{n}^{R}-B_{n}^{R}\right|\geq\frac{\epsilon}{2}\right]\leq 2\exp\left(-\frac{(\epsilon/2)^{2}}{2n\cdot\frac{1}{n^{2}p}+\frac{\epsilon}{6np}}\right)<2\exp\left(-\frac{\epsilon^{2}np}{9}\right)\leq\frac{\delta}{2}

which yields the first of the two inequalities we try to prove. For the second inequality, observe that $B_{n}^{R}=\frac{|R\cap S|}{np}=\frac{|R\cap S|}{|S|}\cdot\frac{|S|}{np}$ . Here we use the property of an oblivious sampler: regardless of adversary’s attack scheme, the oblivious samplers accept or reject an element based only on its index. In the case of a Bernoulli sampler, the size of $S$ is therefore described by a binomial distribution. By a Chernoff bound with $\delta=\epsilon/2$ , we have

Pr\left[||S|-np|>\epsilon np/2\right]\leq 2\exp\left(-\frac{\epsilon^{2}np}{10}\right)

The above inequality claims that with high probability the event $||S|-np|>\epsilon np/2$ does not happen. Conditioning on $||S|-np|\geq\epsilon np/2$ , the second inequality is completed with

\left|\frac{|R\cap S|}{|S|}-B_{n}^{R}\right|=\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap S|}{|S|}\cdot\frac{|S|}{np}\right|\leq\left|1-\frac{|S|}{np}\right|\leq\frac{\epsilon}{2}=\delta

∎

3 From Adversarial Sampling to Online Learnability

Throughout the above results, we have seen an interesting interplay between streaming and statistical learning theory. In particular, we note that the learnability of certain hypothesis classes plays a nontrivial role in sampling theory. Recall from Definition 1.3 and Theorem 1.1 that a hypothesis class $\mathcal{R}$ is PAC learnable if and only if the empirical error over large enough samples looks like the generalization error. This is equivalent to the statement that a large enough sample is an $\epsilon$ -approximation of the distribution $\mathcal{D}$ with respect to $\mathcal{R}$ with high probability (here, we imagine a stream that perfectly encapsulates $\mathcal{D}$ that we wish to approximate with a sample). In the offline VC-theory there is an established equivalence between admittance of a Uniform Law of Large Numbers, $\epsilon$ -approximability, and PAC learnability. It is therefore reasonable to inquire about a similar result in the setting of online learning, in which some aspect of sampling approximability relates to online learnability.

On the flip side, we have also seen situations in which adversarial streaming research reveals shadows of online learning theory. For example, consider the adversarial attack on sampling that was devised in section 2.3.1. At a high level, that attack constructed a large-depth mistake tree against the hypothesis class $\mathcal{R}=\{[1,b]:b\in\mathcal{U}\}$ . The result was that a Bernoulli sampler would need to create a much larger sample to yield an $\epsilon$ -approximation under this attack than it otherwise would. In the statistical learning theory language, however, this is akin to demonstrating that $\mathcal{R}$ , with a VC-dimension of 1, has a much higher Littlestone dimension to go along with its larger sample complexity when up against an adversary. This hints at a relationship between Littlestone dimension and sample complexity in adversarial settings. Even the functional form of Corollary 2.2, a result about sample size necessary for $\epsilon$ -approximability in adversarial settings, feels very familiar in the context of statistical learning (c.f. Theorem 1.2).

Throughout our above investigations, all signs point to a fascinating and deeply-knit relationship between the study of adversarially-robust sampling (see Section 2.3) and measures of online learnability (Definitions 1.5, 1.6). This relationship is precisely what is formalized, proven, and explored in the recent work ”Adversarial Laws of Large Numbers and Optimal Regret in Online Classification” by Alon et al. [3]. Throughout the rest of this section and paper, we present the results of this paper and describe how they beautifully tie up some gaps in the previous theory and also set the stage for a fresh new area of research.

3.1 Adversarial Uniform Law of Large Numbers

For the results below, we still work in the adversarial sampling model introducted in section 2.1. Consider the following definition, meant to extend the concept of $\epsilon$ -approximate sampling to the regime of adversarially-prepared streams.

Definition 3.1 (Adversarial Uniform Law of Large Numbers).

Let $(\mathcal{U,R})$ be a set system. A hypothesis class $\mathcal{R}$ admits an Adversarial Uniform Law of Large Numbers if there exists a function $m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N}$ and a sampler such that for any adversarially-prepared stream X over $\mathcal{U}$ and any tolerances $\epsilon,\delta\in(0,1)$ , the sampler chooses at most $m_{\mathcal{R}}(\epsilon,\delta)$ samples from $X$ which form an $\epsilon$ -approximation of $X$ with probability at least $1-\delta$ .

Compare the above definition with our previous understanding about Uniform Laws of Large Numbers and forming $\epsilon$ -approximations over regularly drawn (not adversarially-prepared) streams. This feels like a very natural mechanism with which to fill in the gaps and completely characterize online learning in the same way that we characterize offline learning with Theorems 1.1, 1.2. The two main theorems provided by Alon et al. do precisely that; we restate them below.

Theorem 3.1 (Fundamental Theorem of Online Learning).

Let $(\mathcal{U},\mathcal{R})$ be a set system. The following are equivalent:

1.

$\mathcal{R}$ is online learnable in the realizable and agnostic settings.
2.

$\mathcal{R}$ admits an Adversarial Uniform Law of Large Numbers.
3.

$\mathcal{R}$ has a finite Littlestone dimension.

Theorem 3.2 (Fundamental Theorem of Online Learning - Quantitiative).

Let $(\mathcal{U},\mathcal{R})$ be a set system with finite Littlestone dimension $d$ . Then, there exist constants $C_{1},C_{2}$ such that:

1.

$\mathcal{R}$ ’s Adversarial Uniform Law of Large Numbers has sample complexity

$C_{1}\frac{d}{\epsilon^{2}}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d+\log(1/\delta)}{\epsilon^{2}}$
2.

$\mathcal{R}$ is online learnable in the agnostic setting with optimal regret at a time $T$ of

$C_{1}\sqrt{d\cdot T}\leq R_{T}(\mathcal{R})\leq C_{2}\sqrt{d\cdot T}$

Observe that Theorem 3.1 completely characterizes online learnability in precisely the same way that Theorem 1.1 did for offline learnability. In particular, the Adversarial Uniform Law of Large Numbers plays the same role in online learnability as the Uniform Law of Large Numbers did offline, matching our intuitions at the beginning of this section.

There are some interesting things to note about the quantitative results in Theorem 3.2 as well. In the first bullet point there is an upper bound on adversarial sampling sample size that precisely matches the upper bound on regular sampling, simply with the VC-dimension replaced by the Littlestone dimension; this once again deepens the symmetry that has been developed throughout this survey. The lack of a tight lower bound, however, is interesting. While there is a tighter lower bound for $\epsilon$ -nets instead of $\epsilon$ -approximations, which correspond to realizable online learning instead of agnostic online learning (interested readers should read Section 7 of Alon et al.), neither bound matches the tightness we see in Theorem 1.2. Perhaps there is further research to be done in order to improve the lower bound on adversarial sampling complexity, or perhaps there is much more variability in how difficult it is to achieve an Adversarial Uniform Law of Large Numbers for non-oblivious samplers (for all oblivious samplers like Bernoulli and Uniform, regular VC theory makes the bound tight). The second bullet point in Theorem 3.2 tightens a until-now unsolved inequality for the optimal regret of agnostic online learners, definitively saying that optimal learners face regret of $\Theta(\sqrt{d\cdot T})$ . In any case, these above results fill in crucial gaps in online learnability theory and strengthen the connections between sampling theory and statistical learning. In particular, the symmetry between oblivious sampling/offline learning and adversarial sampling/online learning is now made clear in a way that can hopefully inspire and enlighten future research.

3.2 Overview of Proof Techniques

In this final section, we highlight some of the key ideas and techniques used by Alon et al.[3] to prove the results of the previous section (Theorems 3.1, 3.2). The actual proof mechanics are wild and would require immense exposition, but there is still much to be learned about the connections between streaming and statistical learning from studying their high level approach.

The first step of the proof of Theorem 3.2 uses an online variant of a technique called double sampling, a classic argument method credited to [17]. In this application of double sampling (Section 8 in [3]), the authors replace the approximation error with respect to the entire stream with an approximation error with respect to a test set of the same size as the gathered sample. This has the effective result of reducing the size of the problem from gathering a sample of size $k$ from a stream of size $n$ to gathering a sample of size $k$ from a stream of size $2k$ . Importantly, the adversary sees the $2k$ elements within the stream that are selected, but still has no more of an idea about what the final $k$ sampled elements will be. By doing so, the minimization of the $\epsilon$ -approximation error reduces to minimizing the discrepancy in a well known online combinatorial game, defined below.

Definition 3.2 (Online Combinatorial Discrepancy).

The online discrepancy game with respect to $\mathcal{R}$ is a sequential game played between a painter and an adversary which proceeds as follows: at each round $t=1,...,2k$ the adversary places an item $x_{t}$ on the board, and the painter colors $x_{t}$ in either red or blue. The goal of the painter is that each set in $\mathcal{R}$ will be colored in a balanced fashion; i.e., if we denote by $I$ the set of indices of items from the stream $X$ that are colored red, the painter’s goal is to minimize the discrepancy

Disc_{2k}(\mathcal{R},X,I)=\max_{R\in\mathcal{R}}\left||X_{I}\cap R|-|X_{[2k]\setminus I}\cap R|\right|

It is clear from the above definition that this is an equivalent formulation to the double sampling-reduced $\epsilon$ -approximation optimization, since painting the item red or blue corresponds to sampling or not sampling the element, and coloring each set in $\mathcal{R}$ in a balanced fashion corresponds to preserving density in our sample. This reduction allows us to connect performance in combinatorial discrepancy (which in turn is connected to $\epsilon$ -approximation) to a measure of complexity known as Sequential Rademacher Complexity [14].

Definition 3.3.

The Sequential Rademacher Complexity of a hypothesis class $\mathcal{R}$ is given by the expected discrepancy

Rad_{n}(\mathcal{R})=\mathbb{E}\left[Disc_{n}(\mathcal{R},X,I)\right]

This is a useful direction to maneuver toward because of an already established result that the optimal regret of agnostic online learning satisfies $R_{T}(\mathcal{R})\leq 2Rad_{T}(\mathcal{R})$ (for more details, see Section 12 of [3]). In addition, it ties the sample complexity of $\epsilon$ -approximation in the adversarial setting directly to Sequential Rademacher Complexity, leading us to the first item of the quantitative theorem. So, in order to arrive at Theorem 3.2, all that is left is to do is bound $Rad_{T}(\mathcal{R})$ by $O(\sqrt{Ldim(\mathcal{R})\cdot T})$ . The actual mechanics behind this involve fractional $(\epsilon,\gamma)$ -covers of hypothesis classes $\mathcal{R}$ (probability measures over dynamic sets that agrees with most of $\mathcal{R}$ ) and a chaining argument, but the tooling for that part of the proof is beyond the scope of this survey (the details can be found in Section 9 of [3] for those interested). All in all, this high level overview of the proof is presented to once again emphasize the depth and richness of the connection between adversarial sampling and online learning theory, this time through the lens of the Sequential Rademacher Complexity.

4 Remarks and Future Research

We believe that the relationship between two until-recently disjoint fields of study — online statistical learning and adversarially-robust streaming — that we have explored in this paper creates a new area of research with much promise and direction. Some particular directions of further research could pertain to the study of non-oblivious samplers that are created using the robustification methods of [2], [19], and others. These methods for mechanically turning an arbitrary oblivious streaming algorithm into an adversarially-robust one could result in new sampling schema that yield better performance on adversarial $\epsilon$ -approximation, different bounds on Sequential Rademacher Complexity, and in general could produce different results/perspectives than those found in Alon et al. using oblivious samplers. In the other direction, perhaps the bounds and connections to Rademacher Complexity theory could enlighten designers of sampling algorithms to devise techniques that are adversarially-robust and have practical standalone value in their application as sampling methods. There seem to be many promising directions to go to both build on and make use of the new connections explored throughout this research.

5 Acknowledgements

We would like to thank Professors Matthew Weinberg and Huacheng Yu for their kind feedback, reliable support, and incredible teaching. Thank you!

References

[1] Charu Aggarwal and Karthik Subbian “Evolutionary network analysis” In ACM Comput. Surv. 47.1 Association for Computing Machinery (ACM), 2014, pp. 1–36
[2] N. Alon, Y. Matias and M. Szegedy “The space complexity of approximating the frequency moments” In Proc. ACM STOC, 1996, pp. 20–29
[3] Noga Alon et al. “Adversarial laws of large numbers and optimal regret in online classification”, 2021 arXiv:2101.09054 [cs.LG]
[4] Shai Ben-David, Dávid Pál and Shai Shalev-Shwartz “Agnostic Online Learning” In Annual Conference Computational Learning Theory, 2009
[5] Omri Ben-Eliezer and Eylon Yogev “The Adversarial Robustness of Sampling”, 2019 arXiv:1906.11327 [cs.DS]
[6] Omri Ben-Eliezer, Rajesh Jayaram, David P Woodruff and Eylon Yogev “A framework for adversarially robust streaming algorithms”, 2020 arXiv:2003.14265 [cs.DS]
[7] Jarosław Błasiok “Optimal streaming and tracking distinct elements with high probability”, 2018 arXiv:1804.01642 [cs.DS]
[8] Fan Chung and Linyuan Lu “Concentration inequalities and martingale inequalities: A survey” In Internet Math. 3.1 Internet Mathematics, 2006, pp. 79–127
[9] Moritz Hardt and David P Woodruff “How robust are linear sketches to adaptive inputs?”, 2012 arXiv:1211.1056 [cs.DS]
[10] Ashwin Lall et al. “Data streaming algorithms for estimating entropy of network traffic” In Perform. Eval. Rev. 34.1 Association for Computing Machinery (ACM), 2006, pp. 145–156
[11] Nick Littlestone “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm” In Mach. Learn. 2.4 USA: Kluwer Academic Publishers, 1988, pp. 285–318 DOI: 10.1023/A:1022869011914
[12] Nabil H. Mustafa and Kasturi R. Varadarajan “Handbook of Discrete and Computational Geometry”, 2017
[13] S Muthukrishnan “Data streams: Algorithms and applications” In Found. Trends Theor. Comput. Sci. 1.2 Now Publishers, 2005, pp. 117–236
[14] Alexander Rakhlin, Karthik Sridharan and Ambuj Tewari “Online Learning via Sequential Complexities” arXiv, 2010 DOI: 10.48550/ARXIV.1006.1138
[15] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” USA: Cambridge University Press, 2014
[16] M. Talagrand “Sharper bounds for Gaussian and empirical processes” In Ann. Prob., 1994, pp. 22:28–76
[17] V N Vapnik and A Ya Chervonenkis “On the uniform convergence of relative frequencies of events to their probabilities” In Measures of Complexity Cham: Springer International Publishing, 2015, pp. 11–30
[18] V.N. Vapnik and A.Y. Chervonenkis “On the uniform convergence of relative frequen- cies of events to their probabilities” In Theory Probab. Appl., 1971, pp. 16:264–280
[19] David P. Woodruff and Samson Zhou “Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators” arXiv, 2020 DOI: 10.48550/ARXIV.2011.07471
[20] Andrew Chi-Chin Yao “Probabilistic computations: Toward a unified measure of complexity” In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977) Providence, RI, USA: IEEE, 1977

Robust Streaming, Sampling, and a Perspective on Online Learning

Abstract

1 Preliminaries

1.1 Statistical Learning

1.1.1 Offline Setting

Definition 1.1 (PAC-Learnable).

Definition 1.2 (Agnostic PAC-Learnable).

Definition 1.3 (Uniform Law of Large Numbers, Sample Complexity).

Definition 1.4.

Theorem 1.1 (Fundamental Theorem of PAC Learning).

Theorem 1.2 (Fundamental Theorem of PAC Learning - Quantitiative).

1.1.2 Online Setting

Definition 1.5 (Mistake Bounds, Realizable Online Learnability).

Definition 1.6 (Regret, Agnostic Online Learnability).

Definition 1.7 (Littlestone Dimension).

Theorem 1.3.

1.2 Streaming

1.2.1 Data Stream Models

Definition 1.8 (Strong Tracking).

1.2.2 Linear Sketches

1.3 Sampling

Definition 1.9 (ϵ\epsilon-Approximation).

Theorem 1.4.

2 Adversarial Robustness

2.1 Adversarial Setting

2.2 Robust Streaming

2.2.1 Vulnerability of Linear Sketching

Definition 2.1 (GapNorm).

Theorem 2.1.

Proof.

2.2.2 Sketch Switching

Definition 2.2 (Flip Number).

Definition 2.3 ((ϵ,m)(\epsilon,m)-Flip Number).

Lemma 2.2.

Theorem 2.3 (Sketch Switching).

Proof.

Corollary 2.1.

Theorem 2.4 (Robust DistinctElements).

Proof.

2.3 Robust Sampling

Definition 2.4 ((ϵ,δ)(\epsilon,\delta)-Robust).

2.3.1 Adversarial Attack on Sampling

Theorem 2.5.

Proof.

Lemma 2.6.

2.3.2 Adversarially Robust Sample Size

Theorem 2.7 (Robust Sample Size).

Corollary 2.2.

Lemma 2.8.

Proof.

Lemma 2.9.

Lemma 2.10.

3 From Adversarial Sampling to Online Learnability

3.1 Adversarial Uniform Law of Large Numbers

Definition 3.1 (Adversarial Uniform Law of Large Numbers).

Theorem 3.1 (Fundamental Theorem of Online Learning).

Theorem 3.2 (Fundamental Theorem of Online Learning - Quantitiative).

3.2 Overview of Proof Techniques

Definition 3.2 (Online Combinatorial Discrepancy).

Definition 3.3.

4 Remarks and Future Research

5 Acknowledgements

References

Definition 1.9 ( $\epsilon$ -Approximation).

Definition 2.3 ( $(\epsilon,m)$ -Flip Number).

Definition 2.4 ( $(\epsilon,\delta)$ -Robust).