This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Streaming, Sampling, and a Perspective on Online Learning

Evan Dogariu  Jiatong Yu
Department of Computer Science, Princeton University
{edogariu, jiatongy}@princeton.edu
Abstract

A young and rapidly growing field of theoretical computer science is that of robust streaming. The general subject of streaming faces many use cases in practice, coming up in problems like network traffic analysis and routing, reinforcement learning, database monitoring, server query response, distributed computing, etc. A nascent subfield of streaming concerns streaming algorithms that are robust to adversarially prepared streams, which can be found to have substantial practical grounding. For example, an adversary could submit a small amount of carefully chosen traffic to produce a denial-of-service attack in a network routing system; a robust routing algorithm in this setting would have immense practical use. We investigate this new field of robust streaming and in particular the formalization of robust sampling, which concerns sampling from an adversarially prepared stream to recover a representative sample.

Throughout this survey, we also highlight, explore, and deepen the connection between the field of robust streaming and that of statistical online learning. On the surface, these fields can appear distinct and are often researched independently; however, there is a deep interrelatedness that can be used to generate new results and intuitons in both places.

In this work we present an overview of statistical learning, followed by a survey of robust streaming techniques and challenges, culminating in several rigorous results proving the relationship that we motivate and hint at throughout the journey. Furthermore, we unify often disjoint theorems in a shared framework and notation to clarify the deep connections that are discovered. We hope that by approaching these results from a shared perspective, already aware of the technical connections that exist, we can enlighten the study of both fields and perhaps motivate new and previously unconsidered directions of research.

1 Preliminaries

1.1 Statistical Learning

The field of statistical learning concerns itself with the formal study of questions such as which tasks are learnable from data, how much training data is required to fit a general distribution, and other topics of this flavor. We present formal frameworks for research in this field in the offline and online settings below.

1.1.1 Offline Setting

In the offline setting, a learner attempts to learn from a dataset that is available all at once (in contrast with the online setting, where data arrives sequentially in a stream). In order to formalize these ideas, we will supply some notation to the setting of learning from data. We focus on binary classification below and represent our hypothesis as subsets of the domain, but general statistical learning theory follows the same principles and can reduce to this framework.

We consider learning over a set system (𝒰,)(\mathcal{U},\mathcal{R}), where 𝒰\mathcal{U} is a universe of possible elements and 2𝒰\mathcal{R}\subseteq 2^{\mathcal{U}} is a set of subsets of the universe. \mathcal{R} is called the hypothesis class in which a learner attempts to learn, for reasons we will see below. Let 𝒮=((x1,y1),,(xm,ym))𝒰×{0,1}\mathcal{S}=((x_{1},y_{1}),...,(x_{m},y_{m}))\in\mathcal{U}\times\{0,1\} be a finite training sample of size mm where xix_{i}’s are sampled and labeled either 0 or 1 according to some scheme. To be specific about the data generation, we work in the realizable setting if 𝒰\mathcal{U} is equipped with a distribution over its elements 𝒟\mathcal{D} and there exists some hypothesis ff\in\mathcal{R} that generates the labels (i.e. each yi=1y_{i}=1 if xifx_{i}\in f and 0 otherwise). If not, then we are in the agnostic setting in which the space 𝒰×{0,1}\mathcal{U}\times\{0,1\} is equipped with a joint distribution 𝒟\mathcal{D}. The realizability assumption assumes that there is some perfect ground truth hypothesis in \mathcal{R} that we would like the learner to learn; otherwise, we would like to learn the best hypothesis we can. A learning algorithm is any algorithm that, given a training sample 𝒮\mathcal{S}, outputs a hypothesis hh\in\mathcal{R} about the training data. Hopefully, the learned hypothesis hh not only fits the training sample 𝒮\mathcal{S} well, but also generalizes well. We define the generalization error (or risk) of a hypothesis hh\in\mathcal{R} in the realizable setting to be

L𝒟,f=Prx𝒟[𝟙xh𝟙xf]L_{\mathcal{D},f}=\underset{x\sim\mathcal{D}}{Pr}[\mathbbm{1}_{x\in h}\neq\mathbbm{1}_{x\in f}]

and in the agnostic setting to be

L𝒟=Pr(x,y)𝒟[𝟙xhy]L_{\mathcal{D}}=\underset{(x,y)\sim\mathcal{D}}{Pr}[\mathbbm{1}_{x\in h}\neq y]

With all of the above notation out of the way, we can now make headway into the statistical learning theory. The first step in formalization is to define what is meant by a hypothesis class being learnable; to this end, we introduce the following definitions.

Definition 1.1 (PAC-Learnable).

A hypothesis class \mathcal{R} is PAC-learnable in the realizable setting if there exists a function m:(0,1)2m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N} and a learning algorithm with the following property: For all accuracy and probability tolerances ϵ,δ(0,1)\epsilon,\delta\in(0,1), for all distributions 𝒟\mathcal{D} over 𝒰\mathcal{U} and all labeling sets ff\in\mathcal{R}, then when running the learning algorithm on mm(ϵ,δ)m\geq m_{\mathcal{R}}(\epsilon,\delta) i.i.d samples drawn from 𝒟\mathcal{D} and labeled by ff the algorithm returns a hypothesis hh\in\mathcal{R} for which

Pr[L𝒟,f(h)ϵ]1δPr[L_{\mathcal{D},f}(h)\leq\epsilon]\geq 1-\delta
Definition 1.2 (Agnostic PAC-Learnable).

A hypothesis class \mathcal{R} is PAC-learnable in the agnostic setting if there exists a function m:(0,1)2m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N} and a learning algorithm with the following property: For all accuracy and probability tolerances ϵ,δ(0,1)\epsilon,\delta\in(0,1), for all distributions 𝒟\mathcal{D} over 𝒰×{0,1}\mathcal{U}\times\{0,1\}, then when running the learning algorithm on mm(ϵ,δ)m\geq m_{\mathcal{R}}(\epsilon,\delta) i.i.d samples drawn from 𝒟\mathcal{D} the algorithm returns a hypothesis hh\in\mathcal{R} for which

Pr[L𝒟(h)ϵ]1δPr[L_{\mathcal{D}}(h)\leq\epsilon]\geq 1-\delta

Here, PAC stands for Probably-Approximately-Correct. Intuitively, these definitions mean that the correct hypothesis can be learned within arbitrary risk tolerance (the Approximately Correct) with high probability (the Probably) using a number of data points that is determined by these tolerances. We can characterize hypothesis classes with two concepts that are defined below (for intuition or examples, consult [15])

Definition 1.3 (Uniform Law of Large Numbers, Sample Complexity).

Let (𝒰,)(\mathcal{U,R}) be a set system. A hypothesis class \mathcal{R} admits a Uniform Law of Large Numbers if there exists a function m:(0,1)2m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N} such that for any distribution 𝒟\mathcal{D} over 𝒰\mathcal{U} and any tolerances ϵ,δ(0,1)\epsilon,\delta\in(0,1), the empirical error L𝒮(h)L_{\mathcal{S}}(h) over a sample 𝒮\mathcal{S} of size mm(ϵ,δ)m\geq m_{\mathcal{R}}(\epsilon,\delta) satisfies

h|L𝒮(h)L𝒟(h)|ϵ\forall h\in\mathcal{R}\quad|L_{\mathcal{S}}(h)-L_{\mathcal{D}}(h)|\leq\epsilon

with probability at least 1δ1-\delta. This property is often referred to as uniform convergence. We call m(ϵ,δ)m_{\mathcal{R}}(\epsilon,\delta) the sample complexity of \mathcal{R}.

Definition 1.4.

[VC-Dimension] The VC-dimension of a hypothesis class \mathcal{R}, denoted VCdim()VCdim(\mathcal{R}), is the maximal size of a set C𝒰C\subset\mathcal{U} that can be shattered by \mathcal{R}. If \mathcal{R} can shatter sets of arbitrarily large size we say that \mathcal{R} has infinite VC-dimension. Here, shattering has the usual definition where \mathcal{R} shatters C𝒰C\subset\mathcal{U} if for every labeling of CC there exists an element of \mathcal{R} that perfectly describes CC.

Equipped with the above tooling, we can now discuss the two fundamental theorems of VC/PAC theory. Proofs of both can be found in Chapter 5 of [15].

Theorem 1.1 (Fundamental Theorem of PAC Learning).

Let (𝒰,)(\mathcal{U},\mathcal{R}) be a set system. The following are equivalent:

  1. 1.

    \mathcal{R} is PAC learnable in the realizable and agnostic settings.

  2. 2.

    \mathcal{R} admits a Uniform Law of Large Numbers.

  3. 3.

    \mathcal{R} has a finite VC-dimension.

Theorem 1.2 (Fundamental Theorem of PAC Learning - Quantitiative).

Let (𝒰,)(\mathcal{U},\mathcal{R}) be a set system with finite VC-dimension dd. Then, there exist constants C1,C2C_{1},C_{2} such that:

  1. 1.

    \mathcal{R}’s Uniform Law of Large Numbers has sample complexity

    C1d+log(1/δ)ϵ2m(ϵ,δ)C2d+log(1/δ)ϵ2C_{1}\frac{d+\log(1/\delta)}{\epsilon^{2}}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d+\log(1/\delta)}{\epsilon^{2}}
  2. 2.

    \mathcal{R} is agnostic PAC learnable with sample complexity

    C1d+log(1/δ)ϵ2m(ϵ,δ)C2d+log(1/δ)ϵ2C_{1}\frac{d+\log(1/\delta)}{\epsilon^{2}}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d+\log(1/\delta)}{\epsilon^{2}}
  3. 3.

    \mathcal{R} is PAC learnable with sample complexity

    C1d+log(1/δ)ϵm(ϵ,δ)C2dlog(1/ϵ)+log(1/δ)ϵC_{1}\frac{d+\log(1/\delta)}{\epsilon}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d\log(1/\epsilon)+\log(1/\delta)}{\epsilon}

These two theorems characterize what types of hypothesis classes \mathcal{R} are learnable, and how much data is required to learn them. We will see that the online setting has well-established results that mirror these.

1.1.2 Online Setting

We now consider the online setting, in which samples of data are given to a learner sequentially in a stream. The learner learns by making predictions on the new data point, after which the true label of the point is revealed and the online learner can learn from it. Owing to the potential for previous training examples to not be useful in the future, we must define online learnability differently in both the realizable and agnostic settings than how we did in the PAC framework.

Definition 1.5 (Mistake Bounds, Realizable Online Learnability).

Let \mathcal{R} be a hypothesis class and let AA be an online learning algorithm. Given any sequence 𝒮=((x1,f(x1)),,(xT,f(xT))\mathcal{S}=((x_{1},f(x_{1})),...,(x_{T},f(x_{T})), where TT is any integer and ff\in\mathcal{R} generates the labels, let MA(𝒮)M_{A}(\mathcal{S}) be the number of mistakes AA makes on the sequence 𝒮\mathcal{S}. We denote by MA()M_{A}(\mathcal{R}) the supremum of MA(𝒮)M_{A}(\mathcal{S}) over all sequences of the above form. A bound of the form MA()B<M_{A}(\mathcal{R})\leq B<\infty is called a mistake bound. We say that a hypothesis class \mathcal{R} is online learnable in the realizable setting if there exists an algorithm AA for which MA()B<M_{A}(\mathcal{R})\leq B<\infty.

Definition 1.6 (Regret, Agnostic Online Learnability).

Let \mathcal{R} be a hypothesis class and let AA be an online learning algorithm. Given any sequence 𝒮=((x1,y1),,(xT,yT))\mathcal{S}=((x_{1},y_{1}),...,(x_{T},y_{T})), where TT is any integer, let A(xi)A(x_{i}) be the prediction that the algorithm made at step ii before being revealed the actual label yiy_{i}. We define the regret to be the difference between the number of mistakes made by the learner and the number of mistakes made by the best hypothesis

R𝒮(T)=i=1T𝟙A(xi)yiminhi=1T𝟙h(xi)yiR_{\mathcal{S}}(T)=\sum_{i=1}^{T}\mathbbm{1}_{A(x_{i})\neq y_{i}}-\min_{h\in\mathcal{R}}\sum_{i=1}^{T}\mathbbm{1}_{h(x_{i})\neq y_{i}}

We say that \mathcal{R} is online learnable in the agnostic setting if the expected regret 𝔼𝒮[R𝒮(T)]\mathbb{E}_{\mathcal{S}}[R_{\mathcal{S}}(T)] is o(T)o(T), or equivalently if limT𝔼𝒮[R𝒮(T)]/T=0\lim_{T\rightarrow\infty}\mathbb{E}_{\mathcal{S}}[R_{\mathcal{S}}(T)]/T=0 (i.e. the amortized expected regret vanishes). We call RT()R_{T}(\mathcal{R}) the optimal regret, where RT()R_{T}(\mathcal{R}) is the regret of the best learner against the worst adversarial sequence.

There is a combinatorial measure, due to Littlestone [11], that characterizes online learnability much in the same way that the VC-dimension characterizes offline learnability.

Definition 1.7 (Littlestone Dimension).

Let (𝒰,)(\mathcal{U,R}) be a set system. The definition of the Littlestone Dimension of \mathcal{R}, denoted Ldim()Ldim(\mathcal{R}), is given using mistake-trees: these are binary decision trees whose internal nodes are labeled by elements of 𝒰\mathcal{U}. Any root-to-leaf path corresponds to a sequence of pairs (x1,y1),,(xd,yd)(x_{1},y_{1}),...,(x_{d},y_{d}), where xix_{i} is the label of the ii’th internal node in the path, and yi=1y_{i}=1 if the (i+1)(i+1)’th node in the path is the right child of the ii’th node, and otherwise yi=0y_{i}=0. We say that a tree TT is shattered by \mathcal{R} if for any root-to-leaf path (x1,y1),,(xd,yd)(x_{1},y_{1}),...,(x_{d},y_{d}) in TT there is an hh\in\mathcal{R} such that xihyi=1x_{i}\in h\iff y_{i}=1 for all idi\leq d. Ldim()Ldim(\mathcal{R}) is the depth of the largest complete tree shattered by \mathcal{R}, with the convention that Ldim()=1Ldim(\emptyset)=-1.

The above definition is the online equivalent of the VC-dimension: intuitively, the tree TT is the largest set for which any path has a perfect predictor in \mathcal{R}, very much like how the set CC in Definition 1.4 is the largest set with a perfect predictor in \mathcal{R} (however, the mistake tree framework encapsulates the sequential nature of the online learning problem). An immediate corollary of this is that VCdim()Ldim()VCdim(\mathcal{R})\leq Ldim(\mathcal{R}). The following theorem characterizes online learning, given by Littlestone [11] in the realizable setting and by [4] in the agnostic setting.

Theorem 1.3.

Let (𝒰,)(\mathcal{U,R}) be a set system. Then, \mathcal{R} is online learnable in both the realizable and agnostic settings if and only if it has finite Littlestone dimension.

Previous work in online learnability theory failed to give an adequate description of a uniform law of large numbers condition on online learnability a la Definition 1.3 nor a tight quantitative bound on sample complexities a la Theorem. 1.2

1.2 Streaming

In the streaming setting, data of large volume arrives sequentially (and rapidly, which requires efficient computation performance), and it is not realistic to store the entire data stream. Instead, algorithms need one or several scan through the stream and approximately answer queries to the data.

1.2.1 Data Stream Models

An input stream of length mm in the format (a1,Δ1),(a2,Δ2)(am,Δm)(a_{1},\Delta_{1}),(a_{2},\Delta_{2})\dots(a_{m},\Delta_{m}) arrives sequentially, which describes an underlying frequency vector fnf\in\mathbb{R}^{n} where fi=t:at=iΔtf_{i}=\sum_{t:a_{t}=i}\Delta_{t}. Namely, ata_{t} describes the index of update and Δt\Delta_{t} describes the value of update to the frequency vector. This generic model is also called the turnstile model of streaming. A common assumption bounds the maximum coordinate of frequency vector at any time step tt. Let fi(t)=jt:aj=iΔjf^{(t)}_{i}=\sum_{j\leq t:a_{j}=i}\Delta_{j}, then ftM||f^{t}||_{\infty}\leq M for some M>0M>0.

Another commonly studied model assumes Δt0\Delta_{t}\geq 0, and is called the insertion-only model. The above setup could then be simplified as follows. Given an input stream a1,a2ama_{1},a_{2}\dots a_{m}, the frequency vector fnf\in\mathbb{R}^{n} is given by fi=|{j[m]:aj=i}|f_{i}=|\{j\in[m]:a_{j}=i\}|.

The streaming task is to respond to queries regarding the frequency vector ff at any timestamp tt with close approximation. One measure of success is known as strong tracking.

Definition 1.8 (Strong Tracking).

Let f(1),f(2)f(m)f^{(1)},f^{(2)}\dots f^{(m)} be the frequency vectors given the input stream, and let g:ng:\mathbb{R}^{n}\rightarrow\mathbb{R} be the query function over the frenquency vectors. A streaming algorithm is said to provide ϵ\epsilon-strong g-tracking if, at each time step tt, the approximation output RtR_{t} satisfies

|Rtg(f(t))|ϵ|g(ft)||R_{t}-g(f^{(t)})|\leq\epsilon|g(f^{t})|

1.2.2 Linear Sketches

Linear sketching is a crucial idea in many streaming algorithm designs. On a very high level, sketches achieve dimensionality reduction by generating pseudo-random vectors with (limitedly) independent random variables. Let Δtn\Delta_{t}\in\mathbb{R}^{n}. Given a distribution \mathcal{M} over r×nr\times n matrix space, and an evaluation function :r×n×rR\mathcal{F}:\mathbb{R}^{r\times n}\times\mathbb{R}^{r}\rightarrow R where RR is the output space, a linear sketching algorithm draws a (randomized) matrix AA\in\mathcal{M}. The evaluation function is used to respond to queries by F(A,AΔt)F(A,A\Delta_{t}).

Consider the second moment estimation problem on turnstile model. Alon et al. [2] developed a linear sketch approach as follows. The second moment is defined as F2=if[i]2F_{2}=\sum_{i}f[i]^{2}. For each i,ji,j define XijX_{ij} to be a random vector of length nn, with ±1\pm 1-valued, 44-wise independent random variables as elements. Update Xij=f,XijX_{ij}=\langle f,X_{ij}\rangle, and we have E[Xij2]=F2E[X_{ij}^{2}]=F_{2} and Var[Xij2]2F22Var[X_{ij}^{2}]\leq 2F_{2}^{2}. Define YiY_{i} to be the average of Xi1,Xi2,Xi,O(log(1/ϵ2))X_{i1},X_{i2},\dots X_{i,O(log(1/\epsilon^{2}))}; by Chebyshev’s inequality P(|YiF2|>ϵF2)O(1)P(|Y_{i}-F_{2}|>\epsilon F_{2})\leq O(1). Taking the medium of means, and let ZZ be the medium of YiY_{i}s, we have P(|ZF2|>ϵF2)<δP(|Z-F_{2}|>\epsilon F_{2})<\delta. This algorithm uses O(1ϵ2log(1δ))O(\frac{1}{\epsilon^{2}}log(\frac{1}{\delta})) space to achieve (ϵ)(\epsilon)-approximation of F2F_{2} with probability 1δ1-\delta.

In fact, the best-known algorithms for any problem in the turnstile model involves linear sketching [9]. Such applications include frequency moments, heavy hitters, entropy estimation, etc. S. Muthukrishnan[13] provides a detailed documentation of the aforementioned problems for interested readers.

1.3 Sampling

Sampling in the streaming setting aims to select a small subset SS, given sequentially-arrived input stream X={x1,x2,xn}X=\{x_{1},x_{2},\dots x_{n}\}, that is representative of the input stream. Deterministic sampling algorithms either fail to use a desirably small space, or involves complicated design tailored toward particular problems [5]. Therefore this section will primarily discusses random sampling.

The formal notion of being representative of the input stream is defined through ϵ\epsilon-approximation [12]. Intuitively, SS is a good representation of XX if any subset RR has similar density in XX and SS, as formulated by Definition 1.9.

Definition 1.9 (ϵ\epsilon-Approximation).

Let X be a stream and let (𝒰,)(\mathcal{U,R}). A sample SS is an ϵ\epsilon-approximation of X with respect to \mathcal{R} if, for any subset RR\in\mathcal{R}, it holds that ||RX||X||RS||S||ϵ\left|\frac{|R\cap X|}{|X|}-\frac{|R\cap S|}{|S|}\right|\leq\epsilon.

An oblivious sampler is a sampling algorithm that accepts or rejects an element based on index, independent of the values of the stream. Oblivious samplers are well-studied for a variety of sampling tasks, and are directly applied in the adversarial setting in Section 2. Below, we enumerate some classic oblivious samplers.

  • Bernoulli Sampler Ber(n,p)Ber(n,p) samples elements i[n]i\in[n] independently with probability pp.

  • Uniform Sampler Uni(n,k)Uni(n,k) randomly draws kk indexes i1,i2iki_{1},i_{2}\dots i_{k}, ij[1,n]i_{j}\in[1,n]. Stream elements with matching indexes are sampled.

  • Reservoir Sampler Res(n,k)Res(n,k) keeps a sample of size kk. The first kk arrivals of the stream are accepted. For an element with index i>ki>k, it is accepted to the sample with probability ki\frac{k}{i}; if accepted, an element currently in the samply will be uniformly drawn to be replaced by the new element.

Several results connect sampling complexity in this setting with VC theory [18] [16].

Theorem 1.4.

Let (𝒰,)(\mathcal{U},\mathcal{R}) be a set system with VC-dimension dd, and let XX be a stream drawn from 𝒰\mathcal{U}. Let SXS\subseteq X be a subset sampled uniformly at random with size O(d+log(1/δ)ϵ2)O\left(\frac{d+log(1/\delta)}{\epsilon^{2}}\right). Then SS is an ϵ\epsilon-approximation of XX with probability at least 1δ1-\delta.

Note that this is a restatement in sampling theory language of the sample complexity bound for a Uniform Law of Large Numbers (Definition 1.3) in the VC-theory (Theorem 1.2, item 1) .

2 Adversarial Robustness

Recall from sections 1.2 and 1.3 the key assumption that the input stream XX is chosen non-adaptively in advance. This assumption does not hold in many modern applications, where the streaming outputs may implicitly or directly affect the future input stream. An important area of application for streaming algorithms is routing [1] [10], where adversarial robustness has started to play a key role in evaluation metrics [10]. This section discusses some recent developments of adversarially robust streaming and sampling algorithms. The framework of analysis also contributes to section 3, where Alon et al. develop the connection between adversarial robustness in sampling and online learnability.

2.1 Adversarial Setting

We can model the adversarial streaming/sampling problem as a two-player game between Adversary and algorithm [5] [6]. In the most generic setting, Adversary is allowed unbounded computational power and can adaptively choose the next element in the stream based on algorithm’s output history.

For a streaming task, algorithm aims to respond to queries about its frequency vector ff in close approximation; for sampling, algorithm aims to admit a sample representative of the stream (i.e. an ϵ\epsilon-approximation of adversary’s input stream). At a high level, the goal of Adversary is to prevent Algorithm from obtaining good results, and the goal of algorithm is to be robust against this. Each round of the two-player game proceeds as follows.

  1. 1.

    Adversary submits an element of the stream to algorithm. The choice of element could (probabilistically) depend on all elements submitted in previous rounds, and observations from algorithm’s output history.

  2. 2.

    Algorithm update its internal state (estimation of ff or accepted sample SS) based on the new element submitted, and output its response to queries regarding its internal state.

  3. 3.

    Adversary observes Algorithm’s responses and proceeds to the next round.

2.2 Robust Streaming

Hardt and Woodruff [9] show that linear sketching is inherently non-robust to adversarial inputs. This necessitates new robust streaming algorithms. This section covers recent developments to robustify non-robust algorithms through sketch-switching.

2.2.1 Vulnerability of Linear Sketching

Hardt and Woodruff [9] in particular show that for problems that are at least as hard as lpl_{p}-norm estimation, linear sketching algorithms suffer from adversarial inputs111 Please refer to the original paper for formal proofs of the statements made in the algorithm design summary, which are omitted in this paper for simplicity. .

Definition 2.1 (GapNorm).

Given input stream X:{x1,x2}X:\{x_{1},x_{2}\dots\} with each xtnx_{t}\in\mathbb{R}^{n}, the GapNorm(B)(B) problem requires an algorithm to return 0 if xt21||x_{t}||_{2}\leq 1 and return 11 if xt2B||x_{t}||_{2}\geq B for some parameter BB. If xtx_{t} satisfies neither of the conditions, the output can be arbitrarily 0 or 11.

The GapNorm problem requires a slight modification to the data stream model in section 1.2: instead of Δt\Delta_{t}\in\mathbb{Z}, each input from the data stream is denoted as a vector xtnx_{t}\in\mathbb{R}^{n}.

Theorem 2.1.

There exists a randomized adversary that, with high probability, finds a distribution over queries on which linear sketching fails to solve GapNorm(B)(B) with constant probability for B2B\geq 2.

Proof.

Consider the following two-player game strategy between Adversary and algorithm. algorithm samples a matrix AA from a distribution \mathcal{M}. At each round, adversary submits xtnx_{t}\in\mathbb{R}^{n}, and algorithm responds with evaluation function F(A,Axt)F(A,Ax_{t}). Adversary aims to learn the row space R(A)R(A) of algorithm. If adversary can successfully learn R(A)R(A), the following strategy will force algorithm to make mistakes on sufficiently many queries:

With 1/21/2 probability, adversary submits the zero vector in n\mathbb{R}^{n}. With 1/21/2 probability, adversary submits a vector xtx_{t} in the kernel space of AA.

To learn the row space R(A)R(A), Adversary’s initial element x1x_{1} is drawn from N(0,τIn)N(0,\tau I_{n}), a mutivariate Gaussian distribution where τIn\tau I_{n} is the covariance matrix. From the properties of Gaussian variables, the projection of x1x_{1} on Algorithm’s sketch matrix AA is spherically symmetric, and therefore the output will only depend on the the norm of projection PA(x1)P_{A}(x_{1}). If x1x_{1} happens to have a high correlation with R(A)R(A), algorithm will be tricked to calculate a larger estimation of PA(x1)2||P_{A}(x_{1})||_{2}. [9] proved the existence of a τ\tau to make the aforementioned difference sufficiently large.

Therefore, adversary can first submit multiple elements xtx_{t} drawn from N(0,τIn)N(0,\tau I_{n}), and aggregate them into a vector v1nv_{1}\in\mathbb{R}^{n} that is almost entirely contained in R(A)R(A). This is accomplished by aggregating m=poly(n)m=poly(n) positively-labeled vectors xtx_{t} from the repeated trials into a matrix Gm×nG\in\mathbb{R}^{m\times n}, and obtain v1v_{1} as the top right singular vector of GG. Then PA(v1)211poly(n)||P_{A}(v_{1})||_{2}\geq 1-\frac{1}{poly(n)}, and with a sufficiently large mm we can say v1v_{1} is almost entirely contained in R(A)R(A).

At each iteration of the previously attached scheme, after finding v1v_{1} adversary draws xtx_{t} from N(0,τIn)N(0,\tau I_{n}) within the subspace orthogonal to {v1,v2}\{v_{1},v_{2}\dots\} and runs the attack. This effectively reduces one ”unknown” dimension in R(A)R(A) every time a new vv is found. Repeat the iterations until only a constant number of such unknown dimensions remain, and remaining attacks following the aforementioned strategy of switching between zero and kernel space vectors will trick algorithm to make a sufficiently large number of mistakes.

The adversarial algorithm above leads to Theorem 2.1. ∎

2.2.2 Sketch Switching

In light of the adversarial vulnerability of linear sketches, Ben-Eliezer and Yogev [6] developed generic robustification schemes for linear sketching. Two techniques were proposed to transform static streaming algorithms into adversarially robust ones. This section will focus on the sketch switching technique for its widespread usage in the static setting and its success in robustifying popular problems like Distinct Element, FpF_{p} Estimation, Heavy Hitters, and Entropy Estimation.

The Algorithm

Sketch switching achieves robustness by keeping multiple copies of strong tracking (Definition 1.8) algorithms. The high level goal of sketch switching is to change the current output as few times as possible and, when required to update output, move to different copies so that Adversary has no enough information to attack. The detailed algorithm is illustrated in 1.

As long as the previous output based on (at1,Δt1)(a_{t-1},\Delta_{t-1}) is a good multiplicative approximation of the estimate of (at,Δt)(a_{t},\Delta_{t}) by the copy in use, algorithm outputs the previous output, and internally updates the output state to the current estimation. When the previous output falls out of range with the internal estimation, algorithm submits the current estimation given by (at,Δt)(a_{t},\Delta_{t}), deactivates the current copy, and activate another copy of static strong tracking algorithms.

1:λλϵ/8,m(g)\lambda\leftarrow\lambda_{\epsilon/8,m}(g)
2:Initialize independent copies A1,A2AλA_{1},A_{2}\dots A_{\lambda} of (ϵ8,δλ)(\frac{\epsilon}{8},\frac{\delta}{\lambda})-strong gg-tracking static algorithms
3:ρ1\rho\leftarrow 1
4:g^g(0)\hat{g}\leftarrow g(\textbf{0})
5:while new stream update (at,Δt)(a_{t},\Delta_{t}) do
6:     Insert update (at,Δt)(a_{t},\Delta_{t}) to all copies A1,A2AλA_{1},A_{2}\dots A_{\lambda}
7:     yy\leftarrow current output of AρA_{\rho}
8:     if g^(1±ϵ/2)y\hat{g}\notin(1\pm\epsilon/2)y  then
9:         g^y\hat{g}\leftarrow y
10:         ρρ+1\rho\leftarrow\rho+1
11:     end if
12:     Output estimation g^\hat{g}
13:end while
Algorithm 1 Sketch Switching

A central definition in the sketch switching algorithm is flip number (Definition 2.2). Lemma 2.2 will come handy in proving the main theorem 2.3.

Definition 2.2 (Flip Number).

Let Y={y0,y1,,ym}Y=\{y_{0},y_{1},\dots,y_{m}\} be any sequence of real numbers. For ϵ0\epsilon\geq 0, the ϵ\epsilon-flip number λϵ(Y)\lambda_{\epsilon}(Y) is the maximum kk for which there exist indices 0i1<i2<ikm0\leq i_{1}<i_{2}\dots<i_{k}\leq m such that yij1(1±ϵ)yijy_{i_{j-1}}\notin(1\pm\epsilon)y_{i_{j}} for j[2,k]j\in[2,k].

Definition 2.3 ((ϵ,m)(\epsilon,m)-Flip Number).

Let 𝒞([n]×)m\mathcal{C}\subseteq([n]\times\mathbb{Z})^{m} be the space of possible input streams. Let gg be the query function defined in Definition 1.8. Then the Y={y0,y1,,ym}Y=\{y_{0},y_{1},\dots,y_{m}\} where yt=g(f(t))y_{t}=g(f^{(t)}). The (ϵ,m)(\epsilon,m)-flip number λϵ,m(g)\lambda_{\epsilon,m}(g) of gg over 𝒞\mathcal{C} is the maximum ϵ\epsilon-flip number of YYs generated by all possible input stream sequences of length mm in 𝒞\mathcal{C}.

Lemma 2.2.

Fix ϵ(0,1)\epsilon\in(0,1). Let U={u1,u2um}U=\{u_{1},u_{2}\dots u_{m}\}, V={v1,v2vm}V=\{v_{1},v_{2}\dots v_{m}\}, and W={w0,w1wm}W=\{w_{0},w_{1}\dots w_{m}\} be three sequences of real numbers that satisfy the following

  1. 1.

    For i[m]i\in[m], vi=(1±ϵ/8)uiv_{i}=(1\pm\epsilon/8)u_{i}

  2. 2.

    w0=v0w_{0}=v_{0}

  3. 3.

    For i>0i>0, if wi1=(1±ϵ/2)viw_{i-1}=(1\pm\epsilon/2)v_{i}, then wi=wi1w_{i}=w_{i-1}. Else wi=viw_{i}=v_{i}.

Then wi=(1±ϵ)viw_{i}=(1\pm\epsilon)v_{i} for i[m]i\in[m]. λ0(W)λϵ/8,m(g)\lambda_{0}(W)\leq\lambda_{\epsilon/8,m}(g).

The Proof

Now we have enough tools to move on to the main theorem of sketch switching, with a skeleton of the proof from the original paper [6].

Theorem 2.3 (Sketch Switching).

Fix a query function g:ng:\mathbb{R}^{n}\rightarrow\mathbb{R} over frequency vector ff, and let AA be a static streaming algorithm that, for any ϵ[0,1]\epsilon\in[0,1] and δ>0\delta>0, uses L(ϵ,δ)L(\epsilon,\delta) space and satisfies the (ϵ/8,δ/λ)(\epsilon/8,\delta/\lambda)-strong g-tracking property over frequency vectors at each time step f(1),f(2),,f(m)f^{(1)},f^{(2)},\dots,f^{(m)}. Let λ=λϵ/8,m(g)\lambda=\lambda_{\epsilon/8,m(g)} be the (ϵ,δ)(\epsilon,\delta)-flip number of gg over input stream class 𝒞\mathcal{C}. Then Algorithm 1 is an adversarially robust algorithm that returns (1+ϵ)(1+\epsilon)-approximation of g(f(t))g(f^{(t)}) at each time step tt with probability at least 1δ1-\delta, and uses O(λL(ϵ/8,δ/λ))O(\lambda\cdot L(\epsilon/8,\delta/\lambda)) space.

Proof.

Assume a fixed, randomized algorithm following the robusification framework in Algorithm 1 with static streaming copies A1,A2AλA_{1},A_{2}\dots A_{\lambda}, each with its own randomness. Assume a deterministic adversary for simplicity. By Yao’s minimax principle [20], this assumption can easily be relaxed to apply to randomized adversary. With this assumption, Adversary’s choice of (at,Δt)(a_{t},\Delta_{t}) is deterministically defined by ((a1,Δ1)(at1,Δt1)((a_{1},\Delta_{1})\dots(a_{t-1},\Delta_{t_{1}}) and responses R1,R2Rt1R_{1},R_{2}\dots R_{t-1}.

The skeleton of the proof goes as follows: we first show that up until algorithm switches a copy of sketch, its return sequence YY satisfies (1+ϵ)(1+\epsilon)-approximation of g(f)g(f) at each time step, and inductively apply the same line of reasoning to each switching to Aρ,ρ[λ]A_{\rho},\;\rho\in[\lambda]. Union bounding the λ\lambda copies of (ϵ/8,δ/λ)(\epsilon/8,\delta/\lambda)-strong gg-tracking copies gives us the 1δ1-\delta probability of success. And the remaining proof shows that λ\lambda copies are indeed sufficient to handle adversarial input stream of length mm.

Robustness of the first sketch. First fix the randomness of A1A_{1}. Let u11,u21uk1u_{1}^{1},u_{2}^{1}\dots u_{k}^{1} be the (a,Δ)(a,\Delta) updates adversary would make if algorithm were to output y0=g(0)y_{0}=g(\textbf{0}) to every uu. That is, u11,u21uk1u_{1}^{1},u_{2}^{1}\dots u_{k}^{1} is the sequence such that algorithm does not switch copy. Let k+1k+1 be the time step that y0(1±ϵ/2)A1(uk+1)y_{0}\notin(1\pm\epsilon/2)A_{1}(u_{k+1}) (i.e. where y0y_{0} falls out of range for the first time). Here algorithm returns R1,R2Rk=y0R_{1},R_{2}\dots R_{k}=y_{0} and Rk+1=y1R_{k+1}=y_{1}. By the definition of (ϵ/8,δ/λ)(\epsilon/8,\delta/\lambda)-strong g-tracking, we know that A1(ut)(1±ϵ/8)g(f(t))A_{1}(u_{t})\in(1\pm\epsilon/8)g(f^{(t)}) for tkt\leq k, and by design of Algorithm 1 Rt=y0(1±ϵ/2)A1(ut)R_{t}=y_{0}\in(1\pm\epsilon/2)A_{1}(u_{t}). By Lemma 2.2, we have

R1,R2Rk=y0(1±ϵ)g(f(t)),tkR_{1},R_{2}\dots R_{k}=y_{0}\in(1\pm\epsilon)g(f^{(t)}),\;\;\forall t\leq k

Induction on A2A_{2}. Following the definition above, algorithm switches to copy A2A_{2} at time t=k+1t=k+1. Here Rk+1R_{k+1}, the output of algorithm, is updated to be y1=A1(uk+1)y_{1}=A_{1}(u_{k+1}). Recall that by strong tracking, the switching point output Rk+1=y1(1±ϵ/8)g(f(k+1))R_{k+1}=y_{1}\in(1\pm\epsilon/8)g(f^{(k+1)}) by Definition 1.8. Consider X^={u11,u21uk1,u12,u22uk22}\hat{X}=\{u_{1}^{1},u_{2}^{1}\dots u_{k}^{1},u_{1}^{2},u_{2}^{2}\dots u_{k_{2}}^{2}\} be the concatenation between the aforementioned inputs and the stream of inputs such that algorithm will keep outputing y1y_{1}. Here k2k_{2} is the index where y1y_{1} starts to fall out of range. Given that the ϵ/8\epsilon/8-strong gg-tracking guarantee should hold on this fixed size of input and, similar to above reasoning, have

Rk+1,Rk+2Rk2=y1(1±ϵ)g(ft)t[k+1,k2]R_{k+1},R_{k+2}\dots R_{k_{2}}=y_{1}\in(1\pm\epsilon)g(f^{t})\;\;\forall t\in[k+1,k_{2}]

Noted that this line of reasoning extends to any AρA_{\rho} for ρ[λ]\rho\in[\lambda]. Then at any time tt, algorithm outputs yρ(1±ϵ)g(f(t))y_{\rho}\in(1\pm\epsilon)g(f^{(t)}) except for probability δλ\frac{\delta}{\lambda}. Taking a union bound over all copies of AρA_{\rho}, this gives us the desired 1δ1-\delta probability in Theorem 2.3.

Bounding number of copies required. The last step in the proof shows that λ\lambda samples of static streaming algorithm with strong tracking guarantee suffices to handle an input stream of length mm. Define U={g(f(0)),g(f(1)),g(f(m))}U=\{g(f^{(0)}),g(f^{(1)}),\dots g(f^{(m)})\}, V={g(f(0)),A1(u1),A1(uk),A2(uk1),A2(uk2),}V=\{g(f^{(0)}),A_{1}(u_{1}),\dots A_{1}(u_{k}),A_{2}(u_{k_{1}}),\dots A_{2}(u_{k_{2}}),\dots\}, and W={y0,y0,y1,y1}W=\{y_{0},y_{0}\dots,y_{1},y_{1}\dots\} (i.e. algorithm’s outputs R1,R2RmR_{1},R_{2}\dots R_{m}). These three sequences satisfy the condition of Lemma 2.2, and thus we have λ0(W)λϵ/8,m(g)\lambda_{0}(W)\leq\lambda_{\epsilon/8,m}(g). ∎

An Example

Below is a demonstration of how to apply the sketch switching technique to a common streaming problem, Distinct Elements. Also known as F0F_{0} estimation, the problem is defined through the query function g(f):|{i:f[i]0}|,fng(f):|\{i:f[i]\neq 0\}|,\;\;f\in\mathbb{R}^{n},. Although the proof will be omitted, the following Corollary 2.1 plays a central role in applying sketch switching on a variety of tasks. And the constraint of insertion-only model in 2.1 naturally constraints the technique itself to insertion-only models.

Corollary 2.1.

Let p0p\geq 0. Assume the input stream has length m=poly(n)m=poly(n). The (ϵ,m)(\epsilon,m)-flip number of of fpp||f||_{p}^{p} in the insertion-only model is λϵ,m(fpp)=O(lognϵ)\lambda_{\epsilon,m}(||f||_{p}^{p})=O(\frac{\log n}{\epsilon}) for p2p\leq 2, and O(plognϵ)O(\frac{p\log n}{\epsilon}) for p>2p>2. For p=0p=0, we have O(logmϵ)O(\frac{\log m}{\epsilon}).

Theorem 2.4 (Robust DistinctElements).

There exists an adversarially robust algorithm that outputs Rt(1±ϵ)f(t)0R_{t}\in(1\pm\epsilon)||f^{(t)}||_{0} at each time step tt with probability at least 1δ1-\delta. The space usage is given below.

O(log1/ϵϵ(log1/ϵ+log1/δ+log(logn)ϵ2+logn))O\left(\frac{\log 1/\epsilon}{\epsilon}\left(\frac{\log 1/\epsilon+\log 1/\delta+\log(\log n)}{\epsilon^{2}}+\log n\right)\right)
Proof.

First observe that DistinctElements inherently describes an insertion-only model. Therefore Algorithm will keep λϵ,m=O(lognϵ)\lambda_{\epsilon,m}=O(\frac{\log n}{\epsilon}) following Corollary 2.1. The copies A1,A2,AλA_{1},A_{2},\dots A_{\lambda} uses the static streaming algorithm proposed by Błasiok[7], which maintains a 1δ01-\delta_{0} probability of ϵ\epsilon-strong gg-tracking. Each copy AρA_{\rho} takes O(log1/ϵ+log1/δ+log(logn)ϵ2+logn)O\left(\frac{\log 1/\epsilon+\log 1/\delta+\log(\log n)}{\epsilon^{2}}+\log n\right) bits of space. To avoid complexity blow-up by the multiplicative term lognϵ\frac{\log n}{\epsilon}, use the following trick: instead of discarding a copy permanently, algorithm keeps a smaller amount of copy than prescribed by λϵ,m\lambda_{\epsilon,m}, and instead achieve the same result in a cyclic manner. After running out of new copies, re-activate a previously deactivated copy, and so on. This in effect reduces copy complexity from O(lognϵ)O(\frac{\log n}{\epsilon}) to O(log(1/ϵ)/ϵ))O(\log(1/\epsilon)/\epsilon)). ∎

2.3 Robust Sampling

The oblivious random samplers discussed in Section 1.3 are used extensively in modern data-intensive systems, such as database monitoring and distributed machine learning. This naturally raises the question of whether the random samplers are vulnerable to potential attacks by adversarial streams.

Recall Theorem 1.4: For a set system (𝒰,)(\mathcal{U},\mathcal{R}) with VC-dimension dd, a random sample of size O(d+log 1/δϵ)O(\frac{d+log\,1/\delta}{\epsilon}) is an ϵ\epsilon-approximation of XX drawn from 𝒰\mathcal{U} with probability 1δ1-\delta. This well-established connection between sample size and learnability is extensively used in theoretical machine learning and is a consequence of the quantitative version of the Fundamental Theorem of PAC Learning (Theorem 1.2).

Ben-Eliezer and Yogev [5] showed that the sample size in Theorem 1.4 is not robust to adversarial streams, and developed the minimum overhead required for an adversarially-robust sample. Section 3 shows that this overhead is implicitly connected to online learnability and further develops on this result.

Consider the adversarial setting described in Section 2.1. A formal definition of an adversarially robust algorithm is as follows.

Definition 2.4 ((ϵ,δ)(\epsilon,\delta)-Robust).

A randomized sampling algorithm is (ϵ,δ)(\epsilon,\delta)-robust if, for any strategy played by the (computationally unbounded) Adversary, the sample SS achieves ϵ\epsilon-approximation of input stream XX with respect to the set system (𝒰,)(\mathcal{U},\mathcal{R}) with probability 1δ1-\delta.

2.3.1 Adversarial Attack on Sampling

Theorem 2.5.

There exists a set system (𝒰,)(\mathcal{U},\mathcal{R}) with VC-dimension 11 where, for ϵ>0\epsilon>0, δ<1/2\delta<1/2, and some constant c>0c>0, the sample size required for static setting is not (ϵ,δ)(\epsilon,\delta)-robust.

Proof.

Consider the set system (𝒰,)(\mathcal{U},\mathcal{R}) where 𝒰\mathcal{U} is a well-ordered set {1,2,N},N[n6lnn,2n/2]\{1,2,\dots N\},\;\;N\in[n^{6ln\,n},2^{n/2}]. Here nn is the size of adversarial input stream. Define ={[1,b]:b𝒰}\mathcal{R}=\{[1,b]:\,b\in\mathcal{U}\} to be the set of inclusive intervals from 1 to each element of 𝒰\mathcal{U}.

Observe that (𝒰,)(\mathcal{U},\mathcal{R}) has VC-dimension 11 (recall Definition 1.4 for definitions of VC-dimension and shattering). To see this, note that any subset C𝒰C\subset\mathcal{U} of size 1 can clearly be shattered (if C={c}C=\{c\}, [1,c][1,c] describes CC for a positive label and [1,c1][1,c-1] describes CC for a negative label of cc). So, the VC-dimension is certainly at least 1. However, consider any arbitrary set C={c1,c2}𝒰C=\{c_{1},c_{2}\}\subset\mathcal{U} of size 2, and suppose without loss of generality that c1<c2c_{1}<c_{2}. There exists a labeling of CC (namely, c1c_{1} has a negative label but c2c_{2} has a positive label) that no interval in \mathcal{R} can describe (any interval not containing c1c_{1} and satisfying its negative label cannot contain c2c_{2}). Similar logic shows that \mathcal{R} cannot shatter any subset of 𝒰\mathcal{U} of size at least 2. Therefore, we must have VCdim()<2VCdim()=1VCdim(\mathcal{R})<2\implies VCdim(\mathcal{R})=1.

Here we consider a Bernoulli sampler Ber(n,p)Ber(n,p) for the following adversarial scheme. It could be easily extended to the other aforementioned types of oblivious random sampling algorithms (e.g. the Reservoir sampler). Assume p14p\leq\frac{1}{4}, a reasonable assumption if the sampler is to obtain a sample size sublinear of a large input stream XX.

1:a11a_{1}\leftarrow 1
2:b1Nb_{1}\leftarrow N
3:pmax{p,lnnn}p^{\prime}\leftarrow max\{p,\,\frac{ln\,n}{n}\}
4:for i=1,2,ni=1,2,\dots n do
5:     xi=ai+(1p)(biai)x_{i}=\lfloor{a_{i}+(1-p^{\prime})(b_{i}-a_{i})}\rfloor
6:     if Sampler accepts xix_{i} then
7:         ai+1xia_{i+1}\leftarrow x_{i}
8:         bi+1bib_{i+1}\leftarrow b_{i}
9:     else
10:         ai+1aia_{i+1}\leftarrow a_{i}
11:         bi+1xib_{i+1}\leftarrow x_{i}
12:     end if
13:end for
Algorithm 2 Adversarial Strategy for a Bernoulli Sampler

The goal of the adversarial algorithm above is to maintain the following invariant. At round ii of the two-player game,

  • Elements that have been accepted by Sampler are at most aia_{i}

  • Elements that have been rejected by Sampler are at least bib_{i}

  • Elements submitted by Adversary at round ii are between aia_{i} and bib_{i}.

This invariant ensures that all elements in SS are smaller than the rejected elements in X=XSX^{\prime}=X\setminus S. Then it naturally follows that SS can not be a ϵ\epsilon-approximation of XX. Formally, let ss be the maximum element in SS. Now consider R=[1,s]R=[1,s]\in\mathcal{R}. Observe that |RS||S|=1\frac{|R\cap S|}{|S|}=1.

||RS||S||RX||X||1|S|n12p1/2ϵ\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap X|}{|X|}\right|\geq 1-\frac{|S|}{n}\geq 1-2p^{\prime}\geq 1/2\geq\epsilon

Noted that the above attack scheme guarantees success as long as Adversary does not run out of elements to draw from. That is, as long as ai<bia_{i}<b_{i} for all rounds ii. The expected sample size drawn by the Bernoulli sampler is npnpnp\leq np^{\prime}. By Markov’s Inequality Pr[|S|2np]<12Pr[|S|\geq 2np^{\prime}]<\frac{1}{2}. When |S|<2np|S|<2np^{\prime}, the following lemma from [5] completes the claim that Adversary will have enough elements to draw from.

Lemma 2.6.

If |S|<2np|S|<2np^{\prime}, then biainb_{i}-a_{i}\geq n for any i[n]i\in[n].

2.3.2 Adversarially Robust Sample Size

Ben-Eliezer and Yogev[5] developed a upper bound on the (ϵ,δ)(\epsilon,\delta)-robustness (Definition 2.4) of Bernoulli and Reservoir samplers. Similar to the adversarial attack in the previous section, we will focus on the Bernoulli sampler case, while a very similar formulation can be applied to the Reservoir sampler case.

Theorem 2.7 (Robust Sample Size).

For any set system (𝒰,)(\mathcal{U},\mathcal{R}), ϵ,δ(0,1)\epsilon,\delta\in(0,1), and given an input stream of size nn, a Bernoulli sampler Ber(n,p)Ber(n,p) is (ϵ,δ)(\epsilon,\delta)-robust if the following holds.

p10ln||+ln(4/δ)ϵ2np\geq 10\cdot\frac{ln|\mathcal{R}|+ln(4/\delta)}{\epsilon^{2}n}
Corollary 2.2.

The sample size drawn by the Bernoulli sampler is well-concentrated near its expectation npnp. The above theorem then implies that a sample size of Θ(ln||+ln(1/δ)ϵ2)\Theta\left(\frac{ln|\mathcal{R}|+ln(1/\delta)}{\epsilon^{2}}\right) is an ϵ\epsilon-approximation with probability 1δ1-\delta.

Lemma 2.8.

Given a universe 𝒰\mathcal{U} and a subset R𝒰R\subseteq\mathcal{U}, and let XX be the adversarial input stream of length nn. A Bernoulli sampler with parameter p10ln(4/δ)ϵ2np\geq 10\cdot\frac{ln(4/\delta)}{\epsilon^{2}n} satisfies

Pr[||RS||S||RX||X||ϵ]δPr\left[\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap X|}{|X|}\right|\geq\epsilon\right]\leq\delta

Noted that main theorem 2.7 naturally follows from Lemma 2.8 by taking a union bound. Let (𝒰,)(\mathcal{U},\mathcal{R}), ϵ\epsilon, δ\delta be the ones defined in Theorem 2.7. Take a Bernoulli sampler with p=10ln||+ln(4/δ)ϵ2np=10\cdot\frac{ln|\mathcal{R}|+ln(4/\delta)}{\epsilon^{2}n} that satisfies the condition in the main theorem. For each RR\in\mathcal{R}, plug in the lemma with parameters ϵ,δ/||\epsilon,\delta/|\mathcal{R}| and obtain the following result:

Pr[||RS||S||RX||X||ϵ]δ||Pr\left[\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap X|}{|X|}\right|\geq\epsilon\right]\leq\frac{\delta}{|\mathcal{R}|}

If for all RR\in\mathcal{R} the above event succeeds, it naturally follows that the sample SS is an ϵ\epsilon-approximation of XX. A union bound over all RR\in\mathcal{R} concludes that the Bernoulli sampler with pp defined above is (ϵ,δ)(\epsilon,\delta)-robust. It remains to prove Lemma 2.8 to complete the main theorem.

Proof.

A major difference in the analysis between the static setting and the adversarial setting here is the assumption of independence between elements in the input stream. Since adversary submits elements based on the history of elements submitted and the sample maintained, concentration inequalities like the Chernoff bound can not be applied in the adversarial setting. This motivates the formulation of random variables as a martingale. Given R𝒰R\subseteq\mathcal{U}, for each round ii define XiX_{i} and RiR_{i} to be the input stream and sample maintained up until round ii. Define the following random variables:

AiR=in|RXi||Xi|=|RXi|nA_{i}^{R}=\frac{i}{n}\cdot\frac{|R\cap X_{i}|}{|X_{i}|}=\frac{|R\cap X_{i}|}{n}
BiR=|RSi|npB_{i}^{R}=\frac{|R\cap S_{i}|}{np}
ZiR=BiRAiRZ_{i}^{R}=B_{i}^{R}-A_{i}^{R}

Observe that the sequence of ZiRZ_{i}^{R} is a martingale (Lemma 2.10). Note that AnR=|RX||X|A_{n}^{R}=\frac{|R\cap X|}{|X|} and therefore |AnR|RS||S||\left|A_{n}^{R}-\frac{|R\cap S|}{|S|}\right| is the subject of interest in Lemma 2.8. Therefore it suffices to prove the following two inequalities and applying the triangle inequality on them:

  1. 1.

    Pr[|AnRBnR|ϵ2]δ2Pr\left[|A_{n}^{R}-B_{n}^{R}|\geq\frac{\epsilon}{2}\right]\leq\frac{\delta}{2}

  2. 2.

    Pr[|BnR|RSn||Sn||ϵ2]δ2Pr\left[|B_{n}^{R}-\frac{|R\cap S_{n}|}{|S_{n}|}|\geq\frac{\epsilon}{2}\right]\leq\frac{\delta}{2}

The following two lemmas from [8] and [5] respectively explore useful properties of martingales to conclude the proof.

Lemma 2.9.

Let X=(X0,X1,X2Xn)X=(X_{0},X_{1},X_{2}\dots X_{n}) be a martingale. Assume the variance of XiX_{i} conditioned on previous elements X0,X1Xi1X_{0},X_{1}\dots X_{i-1} is bounded by σi2\sigma_{i}^{2} for σ02,σ12σn20\sigma_{0}^{2},\sigma_{1}^{2}\dots\sigma_{n}^{2}\geq 0, and there exists some MM for which |XiXi1|M|X_{i}-X_{i-1}|\leq M holds for all i[0,n]i\in[0,n]. Then for any λ0\lambda\geq 0, the following inequality holds.

Pr[|XnX0|λ]exp(λ22i=1nσi2+(Mλ/3))Pr\left[|X_{n}-X_{0}|\geq\lambda\right]\leq\exp\left(-\frac{\lambda^{2}}{2\sum_{i=1}^{n}\sigma_{i}^{2}+(M\lambda/3)}\right)
Lemma 2.10.

The sequence Z0R,Z1RZnRZ_{0}^{R},Z_{1}^{R}\dots Z_{n}^{R} forms a martingale. The variance of ZiRZ_{i}^{R} conditioned on the previous elements Z0RZi1RZ_{0}^{R}\dots Z_{i-1}^{R} is bounded by 1/n2p1/n^{2}p, and |ZiRZi1R|1/np|Z_{i}^{R}-Z_{i-1}^{R}|\leq 1/np.

Combining the two lemmas above, and assuming np9ϵ2lnδ/4np\geq\frac{9}{\epsilon^{2}}\ln{\delta/4}, we have

Pr[|AnRBnR|ϵ2]2exp((ϵ/2)22n1n2p+ϵ6np)<2exp(ϵ2np9)δ2Pr\left[\left|A_{n}^{R}-B_{n}^{R}\right|\geq\frac{\epsilon}{2}\right]\leq 2\exp\left(-\frac{(\epsilon/2)^{2}}{2n\cdot\frac{1}{n^{2}p}+\frac{\epsilon}{6np}}\right)<2\exp\left(-\frac{\epsilon^{2}np}{9}\right)\leq\frac{\delta}{2}

which yields the first of the two inequalities we try to prove. For the second inequality, observe that BnR=|RS|np=|RS||S||S|npB_{n}^{R}=\frac{|R\cap S|}{np}=\frac{|R\cap S|}{|S|}\cdot\frac{|S|}{np}. Here we use the property of an oblivious sampler: regardless of adversary’s attack scheme, the oblivious samplers accept or reject an element based only on its index. In the case of a Bernoulli sampler, the size of SS is therefore described by a binomial distribution. By a Chernoff bound with δ=ϵ/2\delta=\epsilon/2, we have

Pr[||S|np|>ϵnp/2]2exp(ϵ2np10)Pr\left[||S|-np|>\epsilon np/2\right]\leq 2\exp\left(-\frac{\epsilon^{2}np}{10}\right)

The above inequality claims that with high probability the event ||S|np|>ϵnp/2||S|-np|>\epsilon np/2 does not happen. Conditioning on ||S|np|ϵnp/2||S|-np|\geq\epsilon np/2, the second inequality is completed with

||RS||S|BnR|=||RS||S||RS||S||S|np||1|S|np|ϵ2=δ\left|\frac{|R\cap S|}{|S|}-B_{n}^{R}\right|=\left|\frac{|R\cap S|}{|S|}-\frac{|R\cap S|}{|S|}\cdot\frac{|S|}{np}\right|\leq\left|1-\frac{|S|}{np}\right|\leq\frac{\epsilon}{2}=\delta

3 From Adversarial Sampling to Online Learnability

Throughout the above results, we have seen an interesting interplay between streaming and statistical learning theory. In particular, we note that the learnability of certain hypothesis classes plays a nontrivial role in sampling theory. Recall from Definition 1.3 and Theorem 1.1 that a hypothesis class \mathcal{R} is PAC learnable if and only if the empirical error over large enough samples looks like the generalization error. This is equivalent to the statement that a large enough sample is an ϵ\epsilon-approximation of the distribution 𝒟\mathcal{D} with respect to \mathcal{R} with high probability (here, we imagine a stream that perfectly encapsulates 𝒟\mathcal{D} that we wish to approximate with a sample). In the offline VC-theory there is an established equivalence between admittance of a Uniform Law of Large Numbers, ϵ\epsilon-approximability, and PAC learnability. It is therefore reasonable to inquire about a similar result in the setting of online learning, in which some aspect of sampling approximability relates to online learnability.

On the flip side, we have also seen situations in which adversarial streaming research reveals shadows of online learning theory. For example, consider the adversarial attack on sampling that was devised in section 2.3.1. At a high level, that attack constructed a large-depth mistake tree against the hypothesis class ={[1,b]:b𝒰}\mathcal{R}=\{[1,b]:b\in\mathcal{U}\}. The result was that a Bernoulli sampler would need to create a much larger sample to yield an ϵ\epsilon-approximation under this attack than it otherwise would. In the statistical learning theory language, however, this is akin to demonstrating that \mathcal{R}, with a VC-dimension of 1, has a much higher Littlestone dimension to go along with its larger sample complexity when up against an adversary. This hints at a relationship between Littlestone dimension and sample complexity in adversarial settings. Even the functional form of Corollary 2.2, a result about sample size necessary for ϵ\epsilon-approximability in adversarial settings, feels very familiar in the context of statistical learning (c.f. Theorem 1.2).

Throughout our above investigations, all signs point to a fascinating and deeply-knit relationship between the study of adversarially-robust sampling (see Section 2.3) and measures of online learnability (Definitions 1.5, 1.6). This relationship is precisely what is formalized, proven, and explored in the recent work ”Adversarial Laws of Large Numbers and Optimal Regret in Online Classification” by Alon et al. [3]. Throughout the rest of this section and paper, we present the results of this paper and describe how they beautifully tie up some gaps in the previous theory and also set the stage for a fresh new area of research.

3.1 Adversarial Uniform Law of Large Numbers

For the results below, we still work in the adversarial sampling model introducted in section 2.1. Consider the following definition, meant to extend the concept of ϵ\epsilon-approximate sampling to the regime of adversarially-prepared streams.

Definition 3.1 (Adversarial Uniform Law of Large Numbers).

Let (𝒰,)(\mathcal{U,R}) be a set system. A hypothesis class \mathcal{R} admits an Adversarial Uniform Law of Large Numbers if there exists a function m:(0,1)2m_{\mathcal{R}}:(0,1)^{2}\rightarrow\mathbb{N} and a sampler such that for any adversarially-prepared stream X over 𝒰\mathcal{U} and any tolerances ϵ,δ(0,1)\epsilon,\delta\in(0,1), the sampler chooses at most m(ϵ,δ)m_{\mathcal{R}}(\epsilon,\delta) samples from XX which form an ϵ\epsilon-approximation of XX with probability at least 1δ1-\delta.

Compare the above definition with our previous understanding about Uniform Laws of Large Numbers and forming ϵ\epsilon-approximations over regularly drawn (not adversarially-prepared) streams. This feels like a very natural mechanism with which to fill in the gaps and completely characterize online learning in the same way that we characterize offline learning with Theorems 1.1, 1.2. The two main theorems provided by Alon et al. do precisely that; we restate them below.

Theorem 3.1 (Fundamental Theorem of Online Learning).

Let (𝒰,)(\mathcal{U},\mathcal{R}) be a set system. The following are equivalent:

  1. 1.

    \mathcal{R} is online learnable in the realizable and agnostic settings.

  2. 2.

    \mathcal{R} admits an Adversarial Uniform Law of Large Numbers.

  3. 3.

    \mathcal{R} has a finite Littlestone dimension.

Theorem 3.2 (Fundamental Theorem of Online Learning - Quantitiative).

Let (𝒰,)(\mathcal{U},\mathcal{R}) be a set system with finite Littlestone dimension dd. Then, there exist constants C1,C2C_{1},C_{2} such that:

  1. 1.

    \mathcal{R}’s Adversarial Uniform Law of Large Numbers has sample complexity

    C1dϵ2m(ϵ,δ)C2d+log(1/δ)ϵ2C_{1}\frac{d}{\epsilon^{2}}\leq m_{\mathcal{R}}(\epsilon,\delta)\leq C_{2}\frac{d+\log(1/\delta)}{\epsilon^{2}}
  2. 2.

    \mathcal{R} is online learnable in the agnostic setting with optimal regret at a time TT of

    C1dTRT()C2dTC_{1}\sqrt{d\cdot T}\leq R_{T}(\mathcal{R})\leq C_{2}\sqrt{d\cdot T}

Observe that Theorem 3.1 completely characterizes online learnability in precisely the same way that Theorem 1.1 did for offline learnability. In particular, the Adversarial Uniform Law of Large Numbers plays the same role in online learnability as the Uniform Law of Large Numbers did offline, matching our intuitions at the beginning of this section.

There are some interesting things to note about the quantitative results in Theorem 3.2 as well. In the first bullet point there is an upper bound on adversarial sampling sample size that precisely matches the upper bound on regular sampling, simply with the VC-dimension replaced by the Littlestone dimension; this once again deepens the symmetry that has been developed throughout this survey. The lack of a tight lower bound, however, is interesting. While there is a tighter lower bound for ϵ\epsilon-nets instead of ϵ\epsilon-approximations, which correspond to realizable online learning instead of agnostic online learning (interested readers should read Section 7 of Alon et al.), neither bound matches the tightness we see in Theorem 1.2. Perhaps there is further research to be done in order to improve the lower bound on adversarial sampling complexity, or perhaps there is much more variability in how difficult it is to achieve an Adversarial Uniform Law of Large Numbers for non-oblivious samplers (for all oblivious samplers like Bernoulli and Uniform, regular VC theory makes the bound tight). The second bullet point in Theorem 3.2 tightens a until-now unsolved inequality for the optimal regret of agnostic online learners, definitively saying that optimal learners face regret of Θ(dT)\Theta(\sqrt{d\cdot T}). In any case, these above results fill in crucial gaps in online learnability theory and strengthen the connections between sampling theory and statistical learning. In particular, the symmetry between oblivious sampling/offline learning and adversarial sampling/online learning is now made clear in a way that can hopefully inspire and enlighten future research.

3.2 Overview of Proof Techniques

In this final section, we highlight some of the key ideas and techniques used by Alon et al.[3] to prove the results of the previous section (Theorems 3.1, 3.2). The actual proof mechanics are wild and would require immense exposition, but there is still much to be learned about the connections between streaming and statistical learning from studying their high level approach.

The first step of the proof of Theorem 3.2 uses an online variant of a technique called double sampling, a classic argument method credited to [17]. In this application of double sampling (Section 8 in [3]), the authors replace the approximation error with respect to the entire stream with an approximation error with respect to a test set of the same size as the gathered sample. This has the effective result of reducing the size of the problem from gathering a sample of size kk from a stream of size nn to gathering a sample of size kk from a stream of size 2k2k. Importantly, the adversary sees the 2k2k elements within the stream that are selected, but still has no more of an idea about what the final kk sampled elements will be. By doing so, the minimization of the ϵ\epsilon-approximation error reduces to minimizing the discrepancy in a well known online combinatorial game, defined below.

Definition 3.2 (Online Combinatorial Discrepancy).

The online discrepancy game with respect to \mathcal{R} is a sequential game played between a painter and an adversary which proceeds as follows: at each round t=1,,2kt=1,...,2k the adversary places an item xtx_{t} on the board, and the painter colors xtx_{t} in either red or blue. The goal of the painter is that each set in \mathcal{R} will be colored in a balanced fashion; i.e., if we denote by II the set of indices of items from the stream XX that are colored red, the painter’s goal is to minimize the discrepancy

Disc2k(,X,I)=maxR||XIR||X[2k]IR||Disc_{2k}(\mathcal{R},X,I)=\max_{R\in\mathcal{R}}\left||X_{I}\cap R|-|X_{[2k]\setminus I}\cap R|\right|

It is clear from the above definition that this is an equivalent formulation to the double sampling-reduced ϵ\epsilon-approximation optimization, since painting the item red or blue corresponds to sampling or not sampling the element, and coloring each set in \mathcal{R} in a balanced fashion corresponds to preserving density in our sample. This reduction allows us to connect performance in combinatorial discrepancy (which in turn is connected to ϵ\epsilon-approximation) to a measure of complexity known as Sequential Rademacher Complexity [14].

Definition 3.3.

The Sequential Rademacher Complexity of a hypothesis class \mathcal{R} is given by the expected discrepancy

Radn()=𝔼[Discn(,X,I)]Rad_{n}(\mathcal{R})=\mathbb{E}\left[Disc_{n}(\mathcal{R},X,I)\right]

This is a useful direction to maneuver toward because of an already established result that the optimal regret of agnostic online learning satisfies RT()2RadT()R_{T}(\mathcal{R})\leq 2Rad_{T}(\mathcal{R}) (for more details, see Section 12 of [3]). In addition, it ties the sample complexity of ϵ\epsilon-approximation in the adversarial setting directly to Sequential Rademacher Complexity, leading us to the first item of the quantitative theorem. So, in order to arrive at Theorem 3.2, all that is left is to do is bound RadT()Rad_{T}(\mathcal{R}) by O(Ldim()T)O(\sqrt{Ldim(\mathcal{R})\cdot T}). The actual mechanics behind this involve fractional (ϵ,γ)(\epsilon,\gamma)-covers of hypothesis classes \mathcal{R} (probability measures over dynamic sets that agrees with most of \mathcal{R}) and a chaining argument, but the tooling for that part of the proof is beyond the scope of this survey (the details can be found in Section 9 of [3] for those interested). All in all, this high level overview of the proof is presented to once again emphasize the depth and richness of the connection between adversarial sampling and online learning theory, this time through the lens of the Sequential Rademacher Complexity.

4 Remarks and Future Research

We believe that the relationship between two until-recently disjoint fields of study — online statistical learning and adversarially-robust streaming — that we have explored in this paper creates a new area of research with much promise and direction. Some particular directions of further research could pertain to the study of non-oblivious samplers that are created using the robustification methods of [2], [19], and others. These methods for mechanically turning an arbitrary oblivious streaming algorithm into an adversarially-robust one could result in new sampling schema that yield better performance on adversarial ϵ\epsilon-approximation, different bounds on Sequential Rademacher Complexity, and in general could produce different results/perspectives than those found in Alon et al. using oblivious samplers. In the other direction, perhaps the bounds and connections to Rademacher Complexity theory could enlighten designers of sampling algorithms to devise techniques that are adversarially-robust and have practical standalone value in their application as sampling methods. There seem to be many promising directions to go to both build on and make use of the new connections explored throughout this research.

5 Acknowledgements

We would like to thank Professors Matthew Weinberg and Huacheng Yu for their kind feedback, reliable support, and incredible teaching. Thank you!

References

  • [1] Charu Aggarwal and Karthik Subbian “Evolutionary network analysis” In ACM Comput. Surv. 47.1 Association for Computing Machinery (ACM), 2014, pp. 1–36
  • [2] N. Alon, Y. Matias and M. Szegedy “The space complexity of approximating the frequency moments” In Proc. ACM STOC, 1996, pp. 20–29
  • [3] Noga Alon et al. “Adversarial laws of large numbers and optimal regret in online classification”, 2021 arXiv:2101.09054 [cs.LG]
  • [4] Shai Ben-David, Dávid Pál and Shai Shalev-Shwartz “Agnostic Online Learning” In Annual Conference Computational Learning Theory, 2009
  • [5] Omri Ben-Eliezer and Eylon Yogev “The Adversarial Robustness of Sampling”, 2019 arXiv:1906.11327 [cs.DS]
  • [6] Omri Ben-Eliezer, Rajesh Jayaram, David P Woodruff and Eylon Yogev “A framework for adversarially robust streaming algorithms”, 2020 arXiv:2003.14265 [cs.DS]
  • [7] Jarosław Błasiok “Optimal streaming and tracking distinct elements with high probability”, 2018 arXiv:1804.01642 [cs.DS]
  • [8] Fan Chung and Linyuan Lu “Concentration inequalities and martingale inequalities: A survey” In Internet Math. 3.1 Internet Mathematics, 2006, pp. 79–127
  • [9] Moritz Hardt and David P Woodruff “How robust are linear sketches to adaptive inputs?”, 2012 arXiv:1211.1056 [cs.DS]
  • [10] Ashwin Lall et al. “Data streaming algorithms for estimating entropy of network traffic” In Perform. Eval. Rev. 34.1 Association for Computing Machinery (ACM), 2006, pp. 145–156
  • [11] Nick Littlestone “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm” In Mach. Learn. 2.4 USA: Kluwer Academic Publishers, 1988, pp. 285–318 DOI: 10.1023/A:1022869011914
  • [12] Nabil H. Mustafa and Kasturi R. Varadarajan “Handbook of Discrete and Computational Geometry”, 2017
  • [13] S Muthukrishnan “Data streams: Algorithms and applications” In Found. Trends Theor. Comput. Sci. 1.2 Now Publishers, 2005, pp. 117–236
  • [14] Alexander Rakhlin, Karthik Sridharan and Ambuj Tewari “Online Learning via Sequential Complexities” arXiv, 2010 DOI: 10.48550/ARXIV.1006.1138
  • [15] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” USA: Cambridge University Press, 2014
  • [16] M. Talagrand “Sharper bounds for Gaussian and empirical processes” In Ann. Prob., 1994, pp. 22:28–76
  • [17] V N Vapnik and A Ya Chervonenkis “On the uniform convergence of relative frequencies of events to their probabilities” In Measures of Complexity Cham: Springer International Publishing, 2015, pp. 11–30
  • [18] V.N. Vapnik and A.Y. Chervonenkis “On the uniform convergence of relative frequen- cies of events to their probabilities” In Theory Probab. Appl., 1971, pp. 16:264–280
  • [19] David P. Woodruff and Samson Zhou “Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators” arXiv, 2020 DOI: 10.48550/ARXIV.2011.07471
  • [20] Andrew Chi-Chin Yao “Probabilistic computations: Toward a unified measure of complexity” In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977) Providence, RI, USA: IEEE, 1977