Robust Streaming, Sampling, and a Perspective on Online Learning
Abstract
A young and rapidly growing field of theoretical computer science is that of robust streaming. The general subject of streaming faces many use cases in practice, coming up in problems like network traffic analysis and routing, reinforcement learning, database monitoring, server query response, distributed computing, etc. A nascent subfield of streaming concerns streaming algorithms that are robust to adversarially prepared streams, which can be found to have substantial practical grounding. For example, an adversary could submit a small amount of carefully chosen traffic to produce a denial-of-service attack in a network routing system; a robust routing algorithm in this setting would have immense practical use. We investigate this new field of robust streaming and in particular the formalization of robust sampling, which concerns sampling from an adversarially prepared stream to recover a representative sample.
Throughout this survey, we also highlight, explore, and deepen the connection between the field of robust streaming and that of statistical online learning. On the surface, these fields can appear distinct and are often researched independently; however, there is a deep interrelatedness that can be used to generate new results and intuitons in both places.
In this work we present an overview of statistical learning, followed by a survey of robust streaming techniques and challenges, culminating in several rigorous results proving the relationship that we motivate and hint at throughout the journey. Furthermore, we unify often disjoint theorems in a shared framework and notation to clarify the deep connections that are discovered. We hope that by approaching these results from a shared perspective, already aware of the technical connections that exist, we can enlighten the study of both fields and perhaps motivate new and previously unconsidered directions of research.
1 Preliminaries
1.1 Statistical Learning
The field of statistical learning concerns itself with the formal study of questions such as which tasks are learnable from data, how much training data is required to fit a general distribution, and other topics of this flavor. We present formal frameworks for research in this field in the offline and online settings below.
1.1.1 Offline Setting
In the offline setting, a learner attempts to learn from a dataset that is available all at once (in contrast with the online setting, where data arrives sequentially in a stream). In order to formalize these ideas, we will supply some notation to the setting of learning from data. We focus on binary classification below and represent our hypothesis as subsets of the domain, but general statistical learning theory follows the same principles and can reduce to this framework.
We consider learning over a set system , where is a universe of possible elements and is a set of subsets of the universe. is called the hypothesis class in which a learner attempts to learn, for reasons we will see below. Let be a finite training sample of size where ’s are sampled and labeled either 0 or 1 according to some scheme. To be specific about the data generation, we work in the realizable setting if is equipped with a distribution over its elements and there exists some hypothesis that generates the labels (i.e. each if and 0 otherwise). If not, then we are in the agnostic setting in which the space is equipped with a joint distribution . The realizability assumption assumes that there is some perfect ground truth hypothesis in that we would like the learner to learn; otherwise, we would like to learn the best hypothesis we can. A learning algorithm is any algorithm that, given a training sample , outputs a hypothesis about the training data. Hopefully, the learned hypothesis not only fits the training sample well, but also generalizes well. We define the generalization error (or risk) of a hypothesis in the realizable setting to be
and in the agnostic setting to be
With all of the above notation out of the way, we can now make headway into the statistical learning theory. The first step in formalization is to define what is meant by a hypothesis class being learnable; to this end, we introduce the following definitions.
Definition 1.1 (PAC-Learnable).
A hypothesis class is PAC-learnable in the realizable setting if there exists a function and a learning algorithm with the following property: For all accuracy and probability tolerances , for all distributions over and all labeling sets , then when running the learning algorithm on i.i.d samples drawn from and labeled by the algorithm returns a hypothesis for which
Definition 1.2 (Agnostic PAC-Learnable).
A hypothesis class is PAC-learnable in the agnostic setting if there exists a function and a learning algorithm with the following property: For all accuracy and probability tolerances , for all distributions over , then when running the learning algorithm on i.i.d samples drawn from the algorithm returns a hypothesis for which
Here, PAC stands for Probably-Approximately-Correct. Intuitively, these definitions mean that the correct hypothesis can be learned within arbitrary risk tolerance (the Approximately Correct) with high probability (the Probably) using a number of data points that is determined by these tolerances. We can characterize hypothesis classes with two concepts that are defined below (for intuition or examples, consult [15])
Definition 1.3 (Uniform Law of Large Numbers, Sample Complexity).
Let be a set system. A hypothesis class admits a Uniform Law of Large Numbers if there exists a function such that for any distribution over and any tolerances , the empirical error over a sample of size satisfies
with probability at least . This property is often referred to as uniform convergence. We call the sample complexity of .
Definition 1.4.
[VC-Dimension] The VC-dimension of a hypothesis class , denoted , is the maximal size of a set that can be shattered by . If can shatter sets of arbitrarily large size we say that has infinite VC-dimension. Here, shattering has the usual definition where shatters if for every labeling of there exists an element of that perfectly describes .
Equipped with the above tooling, we can now discuss the two fundamental theorems of VC/PAC theory. Proofs of both can be found in Chapter 5 of [15].
Theorem 1.1 (Fundamental Theorem of PAC Learning).
Let be a set system. The following are equivalent:
-
1.
is PAC learnable in the realizable and agnostic settings.
-
2.
admits a Uniform Law of Large Numbers.
-
3.
has a finite VC-dimension.
Theorem 1.2 (Fundamental Theorem of PAC Learning - Quantitiative).
Let be a set system with finite VC-dimension . Then, there exist constants such that:
-
1.
’s Uniform Law of Large Numbers has sample complexity
-
2.
is agnostic PAC learnable with sample complexity
-
3.
is PAC learnable with sample complexity
These two theorems characterize what types of hypothesis classes are learnable, and how much data is required to learn them. We will see that the online setting has well-established results that mirror these.
1.1.2 Online Setting
We now consider the online setting, in which samples of data are given to a learner sequentially in a stream. The learner learns by making predictions on the new data point, after which the true label of the point is revealed and the online learner can learn from it. Owing to the potential for previous training examples to not be useful in the future, we must define online learnability differently in both the realizable and agnostic settings than how we did in the PAC framework.
Definition 1.5 (Mistake Bounds, Realizable Online Learnability).
Let be a hypothesis class and let be an online learning algorithm. Given any sequence , where is any integer and generates the labels, let be the number of mistakes makes on the sequence . We denote by the supremum of over all sequences of the above form. A bound of the form is called a mistake bound. We say that a hypothesis class is online learnable in the realizable setting if there exists an algorithm for which .
Definition 1.6 (Regret, Agnostic Online Learnability).
Let be a hypothesis class and let be an online learning algorithm. Given any sequence , where is any integer, let be the prediction that the algorithm made at step before being revealed the actual label . We define the regret to be the difference between the number of mistakes made by the learner and the number of mistakes made by the best hypothesis
We say that is online learnable in the agnostic setting if the expected regret is , or equivalently if (i.e. the amortized expected regret vanishes). We call the optimal regret, where is the regret of the best learner against the worst adversarial sequence.
There is a combinatorial measure, due to Littlestone [11], that characterizes online learnability much in the same way that the VC-dimension characterizes offline learnability.
Definition 1.7 (Littlestone Dimension).
Let be a set system. The definition of the Littlestone Dimension of , denoted , is given using mistake-trees: these are binary decision trees whose internal nodes are labeled by elements of . Any root-to-leaf path corresponds to a sequence of pairs , where is the label of the ’th internal node in the path, and if the ’th node in the path is the right child of the ’th node, and otherwise . We say that a tree is shattered by if for any root-to-leaf path in there is an such that for all . is the depth of the largest complete tree shattered by , with the convention that .
The above definition is the online equivalent of the VC-dimension: intuitively, the tree is the largest set for which any path has a perfect predictor in , very much like how the set in Definition 1.4 is the largest set with a perfect predictor in (however, the mistake tree framework encapsulates the sequential nature of the online learning problem). An immediate corollary of this is that . The following theorem characterizes online learning, given by Littlestone [11] in the realizable setting and by [4] in the agnostic setting.
Theorem 1.3.
Let be a set system. Then, is online learnable in both the realizable and agnostic settings if and only if it has finite Littlestone dimension.
1.2 Streaming
In the streaming setting, data of large volume arrives sequentially (and rapidly, which requires efficient computation performance), and it is not realistic to store the entire data stream. Instead, algorithms need one or several scan through the stream and approximately answer queries to the data.
1.2.1 Data Stream Models
An input stream of length in the format arrives sequentially, which describes an underlying frequency vector where . Namely, describes the index of update and describes the value of update to the frequency vector. This generic model is also called the turnstile model of streaming. A common assumption bounds the maximum coordinate of frequency vector at any time step . Let , then for some .
Another commonly studied model assumes , and is called the insertion-only model. The above setup could then be simplified as follows. Given an input stream , the frequency vector is given by .
The streaming task is to respond to queries regarding the frequency vector at any timestamp with close approximation. One measure of success is known as strong tracking.
Definition 1.8 (Strong Tracking).
Let be the frequency vectors given the input stream, and let be the query function over the frenquency vectors. A streaming algorithm is said to provide -strong g-tracking if, at each time step , the approximation output satisfies
1.2.2 Linear Sketches
Linear sketching is a crucial idea in many streaming algorithm designs. On a very high level, sketches achieve dimensionality reduction by generating pseudo-random vectors with (limitedly) independent random variables. Let . Given a distribution over matrix space, and an evaluation function where is the output space, a linear sketching algorithm draws a (randomized) matrix . The evaluation function is used to respond to queries by .
Consider the second moment estimation problem on turnstile model. Alon et al. [2] developed a linear sketch approach as follows. The second moment is defined as . For each define to be a random vector of length , with -valued, -wise independent random variables as elements. Update , and we have and . Define to be the average of ; by Chebyshev’s inequality . Taking the medium of means, and let be the medium of s, we have . This algorithm uses space to achieve -approximation of with probability .
1.3 Sampling
Sampling in the streaming setting aims to select a small subset , given sequentially-arrived input stream , that is representative of the input stream. Deterministic sampling algorithms either fail to use a desirably small space, or involves complicated design tailored toward particular problems [5]. Therefore this section will primarily discusses random sampling.
The formal notion of being representative of the input stream is defined through -approximation [12]. Intuitively, is a good representation of if any subset has similar density in and , as formulated by Definition 1.9.
Definition 1.9 (-Approximation).
Let X be a stream and let . A sample is an -approximation of X with respect to if, for any subset , it holds that .
An oblivious sampler is a sampling algorithm that accepts or rejects an element based on index, independent of the values of the stream. Oblivious samplers are well-studied for a variety of sampling tasks, and are directly applied in the adversarial setting in Section 2. Below, we enumerate some classic oblivious samplers.
-
•
Bernoulli Sampler samples elements independently with probability .
-
•
Uniform Sampler randomly draws indexes , . Stream elements with matching indexes are sampled.
-
•
Reservoir Sampler keeps a sample of size . The first arrivals of the stream are accepted. For an element with index , it is accepted to the sample with probability ; if accepted, an element currently in the samply will be uniformly drawn to be replaced by the new element.
Theorem 1.4.
Let be a set system with VC-dimension , and let be a stream drawn from . Let be a subset sampled uniformly at random with size . Then is an -approximation of with probability at least .
2 Adversarial Robustness
Recall from sections 1.2 and 1.3 the key assumption that the input stream is chosen non-adaptively in advance. This assumption does not hold in many modern applications, where the streaming outputs may implicitly or directly affect the future input stream. An important area of application for streaming algorithms is routing [1] [10], where adversarial robustness has started to play a key role in evaluation metrics [10]. This section discusses some recent developments of adversarially robust streaming and sampling algorithms. The framework of analysis also contributes to section 3, where Alon et al. develop the connection between adversarial robustness in sampling and online learnability.
2.1 Adversarial Setting
We can model the adversarial streaming/sampling problem as a two-player game between Adversary and algorithm [5] [6]. In the most generic setting, Adversary is allowed unbounded computational power and can adaptively choose the next element in the stream based on algorithm’s output history.
For a streaming task, algorithm aims to respond to queries about its frequency vector in close approximation; for sampling, algorithm aims to admit a sample representative of the stream (i.e. an -approximation of adversary’s input stream). At a high level, the goal of Adversary is to prevent Algorithm from obtaining good results, and the goal of algorithm is to be robust against this. Each round of the two-player game proceeds as follows.
-
1.
Adversary submits an element of the stream to algorithm. The choice of element could (probabilistically) depend on all elements submitted in previous rounds, and observations from algorithm’s output history.
-
2.
Algorithm update its internal state (estimation of or accepted sample ) based on the new element submitted, and output its response to queries regarding its internal state.
-
3.
Adversary observes Algorithm’s responses and proceeds to the next round.
2.2 Robust Streaming
Hardt and Woodruff [9] show that linear sketching is inherently non-robust to adversarial inputs. This necessitates new robust streaming algorithms. This section covers recent developments to robustify non-robust algorithms through sketch-switching.
2.2.1 Vulnerability of Linear Sketching
Hardt and Woodruff [9] in particular show that for problems that are at least as hard as -norm estimation, linear sketching algorithms suffer from adversarial inputs111 Please refer to the original paper for formal proofs of the statements made in the algorithm design summary, which are omitted in this paper for simplicity. .
Definition 2.1 (GapNorm).
Given input stream with each , the GapNorm problem requires an algorithm to return if and return if for some parameter . If satisfies neither of the conditions, the output can be arbitrarily or .
The GapNorm problem requires a slight modification to the data stream model in section 1.2: instead of , each input from the data stream is denoted as a vector .
Theorem 2.1.
There exists a randomized adversary that, with high probability, finds a distribution over queries on which linear sketching fails to solve GapNorm with constant probability for .
Proof.
Consider the following two-player game strategy between Adversary and algorithm. algorithm samples a matrix from a distribution . At each round, adversary submits , and algorithm responds with evaluation function . Adversary aims to learn the row space of algorithm. If adversary can successfully learn , the following strategy will force algorithm to make mistakes on sufficiently many queries:
With probability, adversary submits the zero vector in . With probability, adversary submits a vector in the kernel space of .
To learn the row space , Adversary’s initial element is drawn from , a mutivariate Gaussian distribution where is the covariance matrix. From the properties of Gaussian variables, the projection of on Algorithm’s sketch matrix is spherically symmetric, and therefore the output will only depend on the the norm of projection . If happens to have a high correlation with , algorithm will be tricked to calculate a larger estimation of . [9] proved the existence of a to make the aforementioned difference sufficiently large.
Therefore, adversary can first submit multiple elements drawn from , and aggregate them into a vector that is almost entirely contained in . This is accomplished by aggregating positively-labeled vectors from the repeated trials into a matrix , and obtain as the top right singular vector of . Then , and with a sufficiently large we can say is almost entirely contained in .
At each iteration of the previously attached scheme, after finding adversary draws from within the subspace orthogonal to and runs the attack. This effectively reduces one ”unknown” dimension in every time a new is found. Repeat the iterations until only a constant number of such unknown dimensions remain, and remaining attacks following the aforementioned strategy of switching between zero and kernel space vectors will trick algorithm to make a sufficiently large number of mistakes.
The adversarial algorithm above leads to Theorem 2.1. ∎
2.2.2 Sketch Switching
In light of the adversarial vulnerability of linear sketches, Ben-Eliezer and Yogev [6] developed generic robustification schemes for linear sketching. Two techniques were proposed to transform static streaming algorithms into adversarially robust ones. This section will focus on the sketch switching technique for its widespread usage in the static setting and its success in robustifying popular problems like Distinct Element, Estimation, Heavy Hitters, and Entropy Estimation.
The Algorithm
Sketch switching achieves robustness by keeping multiple copies of strong tracking (Definition 1.8) algorithms. The high level goal of sketch switching is to change the current output as few times as possible and, when required to update output, move to different copies so that Adversary has no enough information to attack. The detailed algorithm is illustrated in 1.
As long as the previous output based on is a good multiplicative approximation of the estimate of by the copy in use, algorithm outputs the previous output, and internally updates the output state to the current estimation. When the previous output falls out of range with the internal estimation, algorithm submits the current estimation given by , deactivates the current copy, and activate another copy of static strong tracking algorithms.
A central definition in the sketch switching algorithm is flip number (Definition 2.2). Lemma 2.2 will come handy in proving the main theorem 2.3.
Definition 2.2 (Flip Number).
Let be any sequence of real numbers. For , the -flip number is the maximum for which there exist indices such that for .
Definition 2.3 (-Flip Number).
Let be the space of possible input streams. Let be the query function defined in Definition 1.8. Then the where . The -flip number of over is the maximum -flip number of s generated by all possible input stream sequences of length in .
Lemma 2.2.
Fix . Let , , and be three sequences of real numbers that satisfy the following
-
1.
For ,
-
2.
-
3.
For , if , then . Else .
Then for . .
The Proof
Now we have enough tools to move on to the main theorem of sketch switching, with a skeleton of the proof from the original paper [6].
Theorem 2.3 (Sketch Switching).
Fix a query function over frequency vector , and let be a static streaming algorithm that, for any and , uses space and satisfies the -strong g-tracking property over frequency vectors at each time step . Let be the -flip number of over input stream class . Then Algorithm 1 is an adversarially robust algorithm that returns -approximation of at each time step with probability at least , and uses space.
Proof.
Assume a fixed, randomized algorithm following the robusification framework in Algorithm 1 with static streaming copies , each with its own randomness. Assume a deterministic adversary for simplicity. By Yao’s minimax principle [20], this assumption can easily be relaxed to apply to randomized adversary. With this assumption, Adversary’s choice of is deterministically defined by and responses .
The skeleton of the proof goes as follows: we first show that up until algorithm switches a copy of sketch, its return sequence satisfies -approximation of at each time step, and inductively apply the same line of reasoning to each switching to . Union bounding the copies of -strong -tracking copies gives us the probability of success. And the remaining proof shows that copies are indeed sufficient to handle adversarial input stream of length .
Robustness of the first sketch. First fix the randomness of . Let be the updates adversary would make if algorithm were to output to every . That is, is the sequence such that algorithm does not switch copy. Let be the time step that (i.e. where falls out of range for the first time). Here algorithm returns and . By the definition of -strong g-tracking, we know that for , and by design of Algorithm 1 . By Lemma 2.2, we have
Induction on . Following the definition above, algorithm switches to copy at time . Here , the output of algorithm, is updated to be . Recall that by strong tracking, the switching point output by Definition 1.8. Consider be the concatenation between the aforementioned inputs and the stream of inputs such that algorithm will keep outputing . Here is the index where starts to fall out of range. Given that the -strong -tracking guarantee should hold on this fixed size of input and, similar to above reasoning, have
Noted that this line of reasoning extends to any for . Then at any time , algorithm outputs except for probability . Taking a union bound over all copies of , this gives us the desired probability in Theorem 2.3.
Bounding number of copies required. The last step in the proof shows that samples of static streaming algorithm with strong tracking guarantee suffices to handle an input stream of length . Define , , and (i.e. algorithm’s outputs ). These three sequences satisfy the condition of Lemma 2.2, and thus we have . ∎
An Example
Below is a demonstration of how to apply the sketch switching technique to a common streaming problem, Distinct Elements. Also known as estimation, the problem is defined through the query function ,. Although the proof will be omitted, the following Corollary 2.1 plays a central role in applying sketch switching on a variety of tasks. And the constraint of insertion-only model in 2.1 naturally constraints the technique itself to insertion-only models.
Corollary 2.1.
Let . Assume the input stream has length . The -flip number of of in the insertion-only model is for , and for . For , we have .
Theorem 2.4 (Robust DistinctElements).
There exists an adversarially robust algorithm that outputs at each time step with probability at least . The space usage is given below.
Proof.
First observe that DistinctElements inherently describes an insertion-only model. Therefore Algorithm will keep following Corollary 2.1. The copies uses the static streaming algorithm proposed by Błasiok[7], which maintains a probability of -strong -tracking. Each copy takes bits of space. To avoid complexity blow-up by the multiplicative term , use the following trick: instead of discarding a copy permanently, algorithm keeps a smaller amount of copy than prescribed by , and instead achieve the same result in a cyclic manner. After running out of new copies, re-activate a previously deactivated copy, and so on. This in effect reduces copy complexity from to . ∎
2.3 Robust Sampling
The oblivious random samplers discussed in Section 1.3 are used extensively in modern data-intensive systems, such as database monitoring and distributed machine learning. This naturally raises the question of whether the random samplers are vulnerable to potential attacks by adversarial streams.
Recall Theorem 1.4: For a set system with VC-dimension , a random sample of size is an -approximation of drawn from with probability . This well-established connection between sample size and learnability is extensively used in theoretical machine learning and is a consequence of the quantitative version of the Fundamental Theorem of PAC Learning (Theorem 1.2).
Ben-Eliezer and Yogev [5] showed that the sample size in Theorem 1.4 is not robust to adversarial streams, and developed the minimum overhead required for an adversarially-robust sample. Section 3 shows that this overhead is implicitly connected to online learnability and further develops on this result.
Consider the adversarial setting described in Section 2.1. A formal definition of an adversarially robust algorithm is as follows.
Definition 2.4 (-Robust).
A randomized sampling algorithm is -robust if, for any strategy played by the (computationally unbounded) Adversary, the sample achieves -approximation of input stream with respect to the set system with probability .
2.3.1 Adversarial Attack on Sampling
Theorem 2.5.
There exists a set system with VC-dimension where, for , , and some constant , the sample size required for static setting is not -robust.
Proof.
Consider the set system where is a well-ordered set . Here is the size of adversarial input stream. Define to be the set of inclusive intervals from 1 to each element of .
Observe that has VC-dimension (recall Definition 1.4 for definitions of VC-dimension and shattering). To see this, note that any subset of size 1 can clearly be shattered (if , describes for a positive label and describes for a negative label of ). So, the VC-dimension is certainly at least 1. However, consider any arbitrary set of size 2, and suppose without loss of generality that . There exists a labeling of (namely, has a negative label but has a positive label) that no interval in can describe (any interval not containing and satisfying its negative label cannot contain ). Similar logic shows that cannot shatter any subset of of size at least 2. Therefore, we must have .
Here we consider a Bernoulli sampler for the following adversarial scheme. It could be easily extended to the other aforementioned types of oblivious random sampling algorithms (e.g. the Reservoir sampler). Assume , a reasonable assumption if the sampler is to obtain a sample size sublinear of a large input stream .
The goal of the adversarial algorithm above is to maintain the following invariant. At round of the two-player game,
-
•
Elements that have been accepted by Sampler are at most
-
•
Elements that have been rejected by Sampler are at least
-
•
Elements submitted by Adversary at round are between and .
This invariant ensures that all elements in are smaller than the rejected elements in . Then it naturally follows that can not be a -approximation of . Formally, let be the maximum element in . Now consider . Observe that .
Noted that the above attack scheme guarantees success as long as Adversary does not run out of elements to draw from. That is, as long as for all rounds . The expected sample size drawn by the Bernoulli sampler is . By Markov’s Inequality . When , the following lemma from [5] completes the claim that Adversary will have enough elements to draw from.
Lemma 2.6.
If , then for any .
∎
2.3.2 Adversarially Robust Sample Size
Ben-Eliezer and Yogev[5] developed a upper bound on the -robustness (Definition 2.4) of Bernoulli and Reservoir samplers. Similar to the adversarial attack in the previous section, we will focus on the Bernoulli sampler case, while a very similar formulation can be applied to the Reservoir sampler case.
Theorem 2.7 (Robust Sample Size).
For any set system , , and given an input stream of size , a Bernoulli sampler is -robust if the following holds.
Corollary 2.2.
The sample size drawn by the Bernoulli sampler is well-concentrated near its expectation . The above theorem then implies that a sample size of is an -approximation with probability .
Lemma 2.8.
Given a universe and a subset , and let be the adversarial input stream of length . A Bernoulli sampler with parameter satisfies
Noted that main theorem 2.7 naturally follows from Lemma 2.8 by taking a union bound. Let , , be the ones defined in Theorem 2.7. Take a Bernoulli sampler with that satisfies the condition in the main theorem. For each , plug in the lemma with parameters and obtain the following result:
If for all the above event succeeds, it naturally follows that the sample is an -approximation of . A union bound over all concludes that the Bernoulli sampler with defined above is -robust. It remains to prove Lemma 2.8 to complete the main theorem.
Proof.
A major difference in the analysis between the static setting and the adversarial setting here is the assumption of independence between elements in the input stream. Since adversary submits elements based on the history of elements submitted and the sample maintained, concentration inequalities like the Chernoff bound can not be applied in the adversarial setting. This motivates the formulation of random variables as a martingale. Given , for each round define and to be the input stream and sample maintained up until round . Define the following random variables:
Observe that the sequence of is a martingale (Lemma 2.10). Note that and therefore is the subject of interest in Lemma 2.8. Therefore it suffices to prove the following two inequalities and applying the triangle inequality on them:
-
1.
-
2.
The following two lemmas from [8] and [5] respectively explore useful properties of martingales to conclude the proof.
Lemma 2.9.
Let be a martingale. Assume the variance of conditioned on previous elements is bounded by for , and there exists some for which holds for all . Then for any , the following inequality holds.
Lemma 2.10.
The sequence forms a martingale. The variance of conditioned on the previous elements is bounded by , and .
Combining the two lemmas above, and assuming , we have
which yields the first of the two inequalities we try to prove. For the second inequality, observe that . Here we use the property of an oblivious sampler: regardless of adversary’s attack scheme, the oblivious samplers accept or reject an element based only on its index. In the case of a Bernoulli sampler, the size of is therefore described by a binomial distribution. By a Chernoff bound with , we have
The above inequality claims that with high probability the event does not happen. Conditioning on , the second inequality is completed with
∎
3 From Adversarial Sampling to Online Learnability
Throughout the above results, we have seen an interesting interplay between streaming and statistical learning theory. In particular, we note that the learnability of certain hypothesis classes plays a nontrivial role in sampling theory. Recall from Definition 1.3 and Theorem 1.1 that a hypothesis class is PAC learnable if and only if the empirical error over large enough samples looks like the generalization error. This is equivalent to the statement that a large enough sample is an -approximation of the distribution with respect to with high probability (here, we imagine a stream that perfectly encapsulates that we wish to approximate with a sample). In the offline VC-theory there is an established equivalence between admittance of a Uniform Law of Large Numbers, -approximability, and PAC learnability. It is therefore reasonable to inquire about a similar result in the setting of online learning, in which some aspect of sampling approximability relates to online learnability.
On the flip side, we have also seen situations in which adversarial streaming research reveals shadows of online learning theory. For example, consider the adversarial attack on sampling that was devised in section 2.3.1. At a high level, that attack constructed a large-depth mistake tree against the hypothesis class . The result was that a Bernoulli sampler would need to create a much larger sample to yield an -approximation under this attack than it otherwise would. In the statistical learning theory language, however, this is akin to demonstrating that , with a VC-dimension of 1, has a much higher Littlestone dimension to go along with its larger sample complexity when up against an adversary. This hints at a relationship between Littlestone dimension and sample complexity in adversarial settings. Even the functional form of Corollary 2.2, a result about sample size necessary for -approximability in adversarial settings, feels very familiar in the context of statistical learning (c.f. Theorem 1.2).
Throughout our above investigations, all signs point to a fascinating and deeply-knit relationship between the study of adversarially-robust sampling (see Section 2.3) and measures of online learnability (Definitions 1.5, 1.6). This relationship is precisely what is formalized, proven, and explored in the recent work ”Adversarial Laws of Large Numbers and Optimal Regret in Online Classification” by Alon et al. [3]. Throughout the rest of this section and paper, we present the results of this paper and describe how they beautifully tie up some gaps in the previous theory and also set the stage for a fresh new area of research.
3.1 Adversarial Uniform Law of Large Numbers
For the results below, we still work in the adversarial sampling model introducted in section 2.1. Consider the following definition, meant to extend the concept of -approximate sampling to the regime of adversarially-prepared streams.
Definition 3.1 (Adversarial Uniform Law of Large Numbers).
Let be a set system. A hypothesis class admits an Adversarial Uniform Law of Large Numbers if there exists a function and a sampler such that for any adversarially-prepared stream X over and any tolerances , the sampler chooses at most samples from which form an -approximation of with probability at least .
Compare the above definition with our previous understanding about Uniform Laws of Large Numbers and forming -approximations over regularly drawn (not adversarially-prepared) streams. This feels like a very natural mechanism with which to fill in the gaps and completely characterize online learning in the same way that we characterize offline learning with Theorems 1.1, 1.2. The two main theorems provided by Alon et al. do precisely that; we restate them below.
Theorem 3.1 (Fundamental Theorem of Online Learning).
Let be a set system. The following are equivalent:
-
1.
is online learnable in the realizable and agnostic settings.
-
2.
admits an Adversarial Uniform Law of Large Numbers.
-
3.
has a finite Littlestone dimension.
Theorem 3.2 (Fundamental Theorem of Online Learning - Quantitiative).
Let be a set system with finite Littlestone dimension . Then, there exist constants such that:
-
1.
’s Adversarial Uniform Law of Large Numbers has sample complexity
-
2.
is online learnable in the agnostic setting with optimal regret at a time of
Observe that Theorem 3.1 completely characterizes online learnability in precisely the same way that Theorem 1.1 did for offline learnability. In particular, the Adversarial Uniform Law of Large Numbers plays the same role in online learnability as the Uniform Law of Large Numbers did offline, matching our intuitions at the beginning of this section.
There are some interesting things to note about the quantitative results in Theorem 3.2 as well. In the first bullet point there is an upper bound on adversarial sampling sample size that precisely matches the upper bound on regular sampling, simply with the VC-dimension replaced by the Littlestone dimension; this once again deepens the symmetry that has been developed throughout this survey. The lack of a tight lower bound, however, is interesting. While there is a tighter lower bound for -nets instead of -approximations, which correspond to realizable online learning instead of agnostic online learning (interested readers should read Section 7 of Alon et al.), neither bound matches the tightness we see in Theorem 1.2. Perhaps there is further research to be done in order to improve the lower bound on adversarial sampling complexity, or perhaps there is much more variability in how difficult it is to achieve an Adversarial Uniform Law of Large Numbers for non-oblivious samplers (for all oblivious samplers like Bernoulli and Uniform, regular VC theory makes the bound tight). The second bullet point in Theorem 3.2 tightens a until-now unsolved inequality for the optimal regret of agnostic online learners, definitively saying that optimal learners face regret of . In any case, these above results fill in crucial gaps in online learnability theory and strengthen the connections between sampling theory and statistical learning. In particular, the symmetry between oblivious sampling/offline learning and adversarial sampling/online learning is now made clear in a way that can hopefully inspire and enlighten future research.
3.2 Overview of Proof Techniques
In this final section, we highlight some of the key ideas and techniques used by Alon et al.[3] to prove the results of the previous section (Theorems 3.1, 3.2). The actual proof mechanics are wild and would require immense exposition, but there is still much to be learned about the connections between streaming and statistical learning from studying their high level approach.
The first step of the proof of Theorem 3.2 uses an online variant of a technique called double sampling, a classic argument method credited to [17]. In this application of double sampling (Section 8 in [3]), the authors replace the approximation error with respect to the entire stream with an approximation error with respect to a test set of the same size as the gathered sample. This has the effective result of reducing the size of the problem from gathering a sample of size from a stream of size to gathering a sample of size from a stream of size . Importantly, the adversary sees the elements within the stream that are selected, but still has no more of an idea about what the final sampled elements will be. By doing so, the minimization of the -approximation error reduces to minimizing the discrepancy in a well known online combinatorial game, defined below.
Definition 3.2 (Online Combinatorial Discrepancy).
The online discrepancy game with respect to is a sequential game played between a painter and an adversary which proceeds as follows: at each round the adversary places an item on the board, and the painter colors in either red or blue. The goal of the painter is that each set in will be colored in a balanced fashion; i.e., if we denote by the set of indices of items from the stream that are colored red, the painter’s goal is to minimize the discrepancy
It is clear from the above definition that this is an equivalent formulation to the double sampling-reduced -approximation optimization, since painting the item red or blue corresponds to sampling or not sampling the element, and coloring each set in in a balanced fashion corresponds to preserving density in our sample. This reduction allows us to connect performance in combinatorial discrepancy (which in turn is connected to -approximation) to a measure of complexity known as Sequential Rademacher Complexity [14].
Definition 3.3.
The Sequential Rademacher Complexity of a hypothesis class is given by the expected discrepancy
This is a useful direction to maneuver toward because of an already established result that the optimal regret of agnostic online learning satisfies (for more details, see Section 12 of [3]). In addition, it ties the sample complexity of -approximation in the adversarial setting directly to Sequential Rademacher Complexity, leading us to the first item of the quantitative theorem. So, in order to arrive at Theorem 3.2, all that is left is to do is bound by . The actual mechanics behind this involve fractional -covers of hypothesis classes (probability measures over dynamic sets that agrees with most of ) and a chaining argument, but the tooling for that part of the proof is beyond the scope of this survey (the details can be found in Section 9 of [3] for those interested). All in all, this high level overview of the proof is presented to once again emphasize the depth and richness of the connection between adversarial sampling and online learning theory, this time through the lens of the Sequential Rademacher Complexity.
4 Remarks and Future Research
We believe that the relationship between two until-recently disjoint fields of study — online statistical learning and adversarially-robust streaming — that we have explored in this paper creates a new area of research with much promise and direction. Some particular directions of further research could pertain to the study of non-oblivious samplers that are created using the robustification methods of [2], [19], and others. These methods for mechanically turning an arbitrary oblivious streaming algorithm into an adversarially-robust one could result in new sampling schema that yield better performance on adversarial -approximation, different bounds on Sequential Rademacher Complexity, and in general could produce different results/perspectives than those found in Alon et al. using oblivious samplers. In the other direction, perhaps the bounds and connections to Rademacher Complexity theory could enlighten designers of sampling algorithms to devise techniques that are adversarially-robust and have practical standalone value in their application as sampling methods. There seem to be many promising directions to go to both build on and make use of the new connections explored throughout this research.
5 Acknowledgements
We would like to thank Professors Matthew Weinberg and Huacheng Yu for their kind feedback, reliable support, and incredible teaching. Thank you!
References
- [1] Charu Aggarwal and Karthik Subbian “Evolutionary network analysis” In ACM Comput. Surv. 47.1 Association for Computing Machinery (ACM), 2014, pp. 1–36
- [2] N. Alon, Y. Matias and M. Szegedy “The space complexity of approximating the frequency moments” In Proc. ACM STOC, 1996, pp. 20–29
- [3] Noga Alon et al. “Adversarial laws of large numbers and optimal regret in online classification”, 2021 arXiv:2101.09054 [cs.LG]
- [4] Shai Ben-David, Dávid Pál and Shai Shalev-Shwartz “Agnostic Online Learning” In Annual Conference Computational Learning Theory, 2009
- [5] Omri Ben-Eliezer and Eylon Yogev “The Adversarial Robustness of Sampling”, 2019 arXiv:1906.11327 [cs.DS]
- [6] Omri Ben-Eliezer, Rajesh Jayaram, David P Woodruff and Eylon Yogev “A framework for adversarially robust streaming algorithms”, 2020 arXiv:2003.14265 [cs.DS]
- [7] Jarosław Błasiok “Optimal streaming and tracking distinct elements with high probability”, 2018 arXiv:1804.01642 [cs.DS]
- [8] Fan Chung and Linyuan Lu “Concentration inequalities and martingale inequalities: A survey” In Internet Math. 3.1 Internet Mathematics, 2006, pp. 79–127
- [9] Moritz Hardt and David P Woodruff “How robust are linear sketches to adaptive inputs?”, 2012 arXiv:1211.1056 [cs.DS]
- [10] Ashwin Lall et al. “Data streaming algorithms for estimating entropy of network traffic” In Perform. Eval. Rev. 34.1 Association for Computing Machinery (ACM), 2006, pp. 145–156
- [11] Nick Littlestone “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm” In Mach. Learn. 2.4 USA: Kluwer Academic Publishers, 1988, pp. 285–318 DOI: 10.1023/A:1022869011914
- [12] Nabil H. Mustafa and Kasturi R. Varadarajan “Handbook of Discrete and Computational Geometry”, 2017
- [13] S Muthukrishnan “Data streams: Algorithms and applications” In Found. Trends Theor. Comput. Sci. 1.2 Now Publishers, 2005, pp. 117–236
- [14] Alexander Rakhlin, Karthik Sridharan and Ambuj Tewari “Online Learning via Sequential Complexities” arXiv, 2010 DOI: 10.48550/ARXIV.1006.1138
- [15] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” USA: Cambridge University Press, 2014
- [16] M. Talagrand “Sharper bounds for Gaussian and empirical processes” In Ann. Prob., 1994, pp. 22:28–76
- [17] V N Vapnik and A Ya Chervonenkis “On the uniform convergence of relative frequencies of events to their probabilities” In Measures of Complexity Cham: Springer International Publishing, 2015, pp. 11–30
- [18] V.N. Vapnik and A.Y. Chervonenkis “On the uniform convergence of relative frequen- cies of events to their probabilities” In Theory Probab. Appl., 1971, pp. 16:264–280
- [19] David P. Woodruff and Samson Zhou “Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators” arXiv, 2020 DOI: 10.48550/ARXIV.2011.07471
- [20] Andrew Chi-Chin Yao “Probabilistic computations: Toward a unified measure of complexity” In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977) Providence, RI, USA: IEEE, 1977