A Theory of Optimistically Universal Online Learnability for General Concept Classes
Abstract
We provide a full characterization of the concept classes that are optimistically universally online learnable with labels. The notion of optimistically universal online learning was defined in (Hanneke, 2021) in order to understand learnability under minimal assumptions. In this paper, following the philosophy behind that work, we investigate two questions, namely, for every concept class: (1) What are the minimal assumptions on the data process admitting online learnability? (2) Is there a learning algorithm which succeeds under every data process satisfying the minimal assumptions? Such an algorithm is said to be optimistically universal for the given concept class. We resolve both of these questions for all concept classes, and moreover, as part of our solution we design general learning algorithms for each case. Finally, we extend these algorithms and results to the agnostic case, showing an equivalence between the minimal assumptions on the data process for learnability in the agnostic and realizable cases, for every concept class, as well as the equivalence of optimistically universal learnability.
1 Introduction
Just as computability is a core question in computation theory, learnability is now a core question in learning theory. Intuitively, learnability is trying to ask whether we can predict the future correctly with high probability by observing enough examples. In order to describe this intuition formally, we need to define learning models, such as, Probably Approximately Correct (PAC) learning (Vapnik and Chervonenkis (1974) and Valiant (1984)), online learning (Littlestone and Warmuth (1986)) and query learning (Angluin (1988)). In this paper, we focus on a variant of online learning: online learning under data processes. In this setting, the learning is sequential: in round , an instance is given to the algorithm, and then the algorithm makes the prediction, , based on the history and the input , i.e., . Next, the target value will be revealed to the learner such that it can be used to inform future predictions. We model this sequence as a general stochastic process (possibly with dependencies across times ). We say that the algorithm is strongly consistent under if the long-run average error is guaranteed to be low, i.e., almost surely, when .
In our setting, any theory of learning must be expressed based on the properties of, or restriction on, the data process, as the mistake bound is based on the data process. Thus, following an approach found in much of the learning theory literature, such as the PAC model of Valiant (1984) and Vapnik and Chervonenkis (1971) and the online learning model of Littlestone (1988), we introduce the restriction by an additional component, concept class . The role of the concept class is to restrict the processes we need to face, such that they all are realizable under that concept class. If there is a target function , such that for every , we say the data process is realizable (though our formal definition below is slightly more general). For a given , if a learning rule is strongly consistent under for every such that is realizable, we say it is universally consistent under in the realizable case.
It is known that we cannot get the low average error guarantee for all concept classes and data processes (Hanneke, 2021). Thus, we should make several restrictions on either the data process, the concept class, or a mixture of both. All three types of restrictions have been investigated: Littlestone (1988); Ben-David et al. (2009) studied online learning with unrestricted data processes but restricted concept classes. Haussler et al. (1994); Ryabko (2006) researched the online learning problem with a mix of both restrictions. There are also substantial amount of papers investigating online learnability with all measurable functions but restricted data processes. Most of these specifically consider the case of i.i.d. processes, such as Stone (1977); Devroye et al. (1996), though this has also been generalized to general stationary ergodic processes (Morvai et al., 1996; Gyorfi et al., 1999) or processes with certain convergence properties enabling laws of large numbers (Morvai et al., 1999; Steinwart et al., 2009).
More recently, a general question has been studied: In the case of equals the set of all measurable functions, is there a learning rule guaranteed to be universally consistent given only the assumption on that universally consistent learning is possible under ? The assumption that universal consistency is possible under is referred to as the “optimist’s assumption” (Hanneke, 2021), and for this reason, learning rules which are universally consistent for all satisfying the optimist’s assumption are said to be optimistically universally consistent. There is a series of works focusing on this question, starting from Hanneke (2021) and continuing with Blanchard and Cosson (2022); Blanchard (2022); Blanchard et al. (2022a). They tackle this question by first characterizing the minimal assumptions on the data process admitting universally consistent learning and then proposing an online learning algorithm that is universally consistent for all data processes satisfying that assumption. However, their works all focus on the situation with no restrictions on concepts in the concept class (i.e., as all measurable functions). Thus, a natural question arises: For which concept classes do there exist optimistically universal learning rules?
In this paper, we investigate the question mentioned above when the output is binary, i.e. . We handle this problem by first figuring out the minimal assumptions on the data process admitting consistent online learning as well. Thus, our results answered that question in the following aspects:
-
•
For which concept classes is optimistically universally consistent learning possible?
-
•
What are the sufficient and necessary conditions for processes to admit universally consistent online learning for a given concept class ?
-
•
For which concept classes is it the case that all processes admit universally consistent online learning?
We first answer these questions in the realizable case. Surprisingly, the answers turn out to be intimately related to combinatorial structures arising in the work of Bousquet et al. (2021) on universal learning rates. This suggests a potential connection between the universal consistency of online learning and universal learning rates, which is of independent interest. We also extend our learning algorithms for the realizable case to the agnostic setting, where the requirement of low average loss is replaced by that of having sublinear regret. Interestingly, our answers to the above questions remain unchanged in the agnostic case, establishing an equivalence between agnostic and realizable learning in this setting.
In this paper, we first provide some interesting examples in section 3. Then section 4 investigates question three and question one for those classes and section 5 answers question two and question one for remaining classes. Finally, in section 6, we extend our algorithm to the agnostic case.
1.1 More Related Work
Starting from Littlestone’s ground-breaking work (Littlestone, 1988), online learning is becoming more and more important. In this paper, Littlestone (1988) introduces a combinatorial parameter of the concept class, known as Littlestone dimension, to characterize the online learnable concept classes in the realizable case. After that, Ben-David et al. (2009) figure out Littlestone dimension is still the property to characterize online learnability in the agnostic setting. They extend an online learning algorithm for the realizable case to such an algorithm for the agnostic case using the weighted majority algorithm from Littlestone and Warmuth (1994). This line of work makes no assumption on the data process and investigates how restrictions on the concept affect the learnability of the problem. There are two other categories of assumptions also investigated in history: one is to make assumptions on both the data process and the concept, and the other is to make assumptions on the data process but not the concept. Those two categories are discussed in detail subsequently.
First, the works investigating the question of learnability with restrictions on both the data process and the concept make a variety of assumptions. For example, Haussler et al. (1994) investigate how the restrictions on the concept will affect learnability given that the data process is i.i.d. This problem is more similar to a streamlined version of PAC learning and they show that there is a logarithmic mistake bound with the assumption that the data process is i.i.d. and the concept class has finite VC dimension. Adams and Nobel (2010) reveal that all stationary ergodic sequences will uniformly converge under the concept class with finite VC dimension. However, they cannot show the convergence rate of that learning algorithm. Many other works focus on revealing the rate with slightly stronger assumptions on the sequences, such as, (Yu, 1994; Karandikar and Vidyasagar, 2002).
Another line of works focuses on the question of learnability with restrictions on the process instead of the concept, starting from the theory of universally consistent predictors under i.i.d sequences. In particular, there exists an online learning rule, such that for any i.i.d. sequence , and every measurable function , the long-run average loss is almost surely, such as, (Stone, 1977; Devroye et al., 1996; Hanneke et al., 2021). In the meanwhile, people are also interested in the consistency under non-i.i.d. processes; Gyorfi et al. (1999); Morvai et al. (1996) reveal that there are universally consistent online learning rules under general stationary ergodic processes. The paper of Morvai et al. (1999) and the paper of Steinwart et al. (2009) show universal consistency under some classes of processes satisfying laws of large numbers.
More recently, the work of Hanneke (2021) investigates whether there is a consistent learning rule given only the assumptions on that universally consistent learning is possible. This work generalizes the assumptions on made by the previous works on universal consistency. The assumption that universally consistent learning is possible is known as the “optimist’s assumption”, so the consistency under that assumption is called optimistically universal consistency. Hanneke (2021) studies three different learning models: inductive, self-adaptive, and online, and proves that there is an optimistically universal self-adaptive learning rule and no optimistically universal inductive learning rule. After this beginning, the works of Blanchard and Cosson (2022); Blanchard et al. (2022a); Blanchard (2022) show that optimistically universal online learning is possible and the processes that admit strongly universal online learning satisfy the condition called (see condition C for reference). This problem is also investigated under different models, such as, in contextual bandit setting Blanchard et al. (2022b, 2023) and general noisy labels Blanchard and Jaillet (2023).
2 Preliminaries and Main Results
In this section, we provide formal definitions and model settings and briefly list the main results of this paper without proof. For brevity, we provide the high-level proof sketch in the subsequent sections and proof details are in the appendices.
Model Setting
We formally provide the learning model here. Let be a measurable space, in which is assumed to be non-empty and is a Borel -algebra generated by a separable metrizable topology . We also define a space called label space. Here we focus on learning under the - loss: that is, defined on , where is the indicator function. A stochastic process is a sequence of -valued random variables. A stochastic process is a sequence of -valued random variable. The concept class , which is a non-empty set of measurable functions .111We additionally make standard restrictions on to ensure certain estimators are well-behaved; we omit the details for brevity, but refer the reader to Bousquet et al. (2021) for a thorough treatment of measurability of these estimators.
The online learning rule is a sequence of measurable functions: , where is a non-negative integer. For convenience, we also define , here is the history before round .
There are two ways to define the realizable case: The most common one is that there exists such that . The other is the definition 1 on the realizable data process, which comes from the universal learning setting. These two definitions are equivalent in the uniform PAC learning with i.i.d. samples. However, they are different when talking about universal learning. Thus, we follow the definition from the universal learning setting.
Definition 1.
For every concept class , we can define the following set of processes :
In the same way, the set of realizable label processes:
Definition 2.
For every concept class and data process , define a set of label processes:
In other words, are label processes s.t. . Importantly, while every satisfies , there can exist for which this is also true, due to only requiring realizable prefixes (thus, in a sense, represents label sequences by functions in a closure of defined by ).222For instance, for , for the process , and for (singletons), the all- sequence is in though the all- function is not in .
At first, we define the universal consistency under and in the realizable case. An online learning rule is universally consistent under and if its long-run average loss approaches almost surely when the number of rounds goes to infinity for all realizable label processes. Formally, we have the following definition:
Definition 3.
An online learning rule is strongly universally consistent under and for the realizable case, if for every , a.s.
We also define the universal consistency under and for the agnostic case. In that definition, we release the restrictions that , instead the label process can be set in any possible way, even dependent on the history of the algorithm’s predictions. Thus, the average loss may be linear and inappropriate for defining consistency. Therefore, we compare the performance of our algorithm with the performance of the best possible , which is usually referred to as regret. We say an online algorithm is universally consistent under and for the agnostic case if its long-run average regret is low for every label process. Formally,
Definition 4.
An online learning rule is strongly universally consistent under and for the agnostic case, if for every and for every , a.s.
To describe the assumption that universal consistency is possible under the data process and the concept class , we need to define the process admitting universal online learning as follows:
Definition 5.
We say a process admits strongly universal online learning (or just universal online learning for convenience) if there exists an online learning rule that is strongly universally consistent under and .
If the online learning rule is universally consistent under every process that admits universal online learning, we call it optimistically universal under the concept class. If there is an optimistically universal learning algorithm under that concept class, we say that concept class is optimistically universally online learnable. The formal definition is provided below:
Definition 6.
An online learning rule is optimistically universal under concept class if it is strongly universally consistent under every process that admits strongly universally consistent online learning under concept class .
If there is an online learning rule that is optimistically universal under concept class , we say is optimistically universally online learnable.
Next, we define the combinatorial structures we use to characterize the concept class that makes all processes admit universal online learning and is optimistically universally online learnable when all processes admit strongly universally consistent online learning:
Definition 7 (Littlestone tree Bousquet et al. (2021)).
A Littlestone Tree for is a complete binary tree of depth whose internal nodes are labeled by , and whose two edges connecting a node to its children are labeled and , such that every finite path emanating from the root is consistent with a concept . We say that has an infinite Littlestone tree if it has a Littlestone tree of depth .
Definition 8 (VCL Tree Bousquet et al. (2021)).
A VCL Tree for of depth is a collection
such that for every and , there exists a concept so that for all and , where we denote
We say that has an infinite VCL tree if it has a VCL tree of depth .
The characterization is formally stated in the following two theorems:
Theorem 9.
If and only if a concept class has no infinite VCL tree, any process admits strongly universally consistent online learning under .
Theorem 10.
If and only if a concept class has no infinite Littlestone tree, any process admits strongly universally consistent online learning under , and the concept class is optimistically universally online learnable.
According to theorem 9, we know that for those concept classes with an infinite VCL tree, there exist some processes that do not admit universal online learning. Thus, we need to figure out the sufficient and necessary conditions that the processes required to admit universal online learning.
First, we define the experts as algorithms that generate predictions only based on the input . Then we define the following condition (which is a property of a data process) and state the main theorem formally:
Condition A.
For a given concept class , there exists a countable set of experts , such that , , with , such that:
(1) |
Theorem 11.
A process admits strongly universally consistent online learning under concept class with infinite VCL tree if and only if it satisfies condition A.
Next, the sufficient and necessary conditions (on the concept class) for optimistically universal online learning:
Condition B.
There exists a countable set of experts , such that for any admits universal online learning, and any , there exists , with , such that:
(2) |
Notice that these two conditions (condition A and B) only have one major difference: whether the countable set of experts depends on the process .
Theorem 12.
A concept class with infinite VCL tree is optimistically universally online learnable if and only if it satisfies condition B.
We also extend the algorithms for realizable cases to an algorithm for agnostic cases and show that the same characterization works for agnostic cases.
3 Examples
In this section, we provide some interesting examples to help the reader get a sense of what these conditions are. We first provide an example of the concept class that is universally online learnable under all processes but not optimistically universally online learnable.
Example 1.
We have the instance space and , a binary output. The concept class is all of the threshold functions. In other words, . This concept class has no infinite VCL tree, as there is no such that shatters all possible results. Thus, all processes admit strongly universally consistent online learning under . However, it has an infinite Littlestone tree. Thus, for any learning algorithm, there exists a process that is not learnable by that algorithm. So it is not optimistically universally online learnable.
Referring to that line of optimistically universal online learning papers, we know that the concept class of all measurable functions is optimistically universally online learnable. The sufficient and necessary condition for processes to admit universal online learning under all measurable functions is the condition (see below). In the meanwhile, our conditions: A and B vanish to when the concept class becomes the class of all measurable functions.
Condition C ( in Hanneke (2021)).
For every sequence of disjoint elements of ,
The following example shows that whether a concept class is optimistically universally online learnable is neither sufficient nor necessary to determine whether its subset is optimistically universally online learnable or not. Whether a concept class is optimistically universally online learnable will be sufficient and necessary to determine whether its subset is optimistically universally online learnable, if and only if the processes that admit universal online learning are the same under those two concept classes.
Example 2.
We have the data which is sampled from input space and here and are disjoint. For example, and . Then we can define the concept class: is the set of all threshold functions on which are on , and is a set of all functions on which are constant on . Then we can consider the following scenarios:
-
1.
: It is optimistically universally online learnable. The processes that admit universal online learning will satisfy if we replace all the as dummy points.
- 2.
-
3.
are all measurable functions on . This is also optimistically universally online learnable.
4 Sufficient and Necessary Condition that ALL Processes Admit Universal Online Learning
In this section, we answer the question: What restrictions on concept classes make ALL processes admit universal online learning under ? The answer is formally stated as Theorem 9. We show the sufficiency by providing a universal online learning rule (depending on ) under every process and with no infinite VCL tree.
First, we define the VCL game along with the VCL tree. In this game, there are two players: the learner, , and the adversary, and . Then in each round :
-
•
choose the point .
-
•
choose the prediction .
-
•
Update .
-
•
wins the game in round if .
Here , which is the subset of that is consistent on for all .
A strategy is a way of playing that can be fully determined by the foregoing plays. And a winning strategy is a strategy that necessarily causes the player to win no matter what action one’s opponent takes. We have the following lemma from Bousquet et al. (2021).
Lemma 13 (Bousquet et al. (2021) lemma 5.2).
If has no infinite VCL tree, then there is a universally measurable winning strategy for in the game.
Notice that the winning strategy is completely decided by , we use as the winning strategy induced by the set . We may use this winning strategy to design an online learning algorithm 1. This algorithm is a combination of the algorithm in the work of Bousquet et al. (2021) and the algorithm inspired by the learning algorithm for partial concept in the work of Alon et al. (2021).
In order to describe the algorithm, we first provide the definitions of partial concepts. A partial concept class is a set of partial function defined on , where if and only if is undefined on . And for a set , is shattered if every binary pattern is realized by some . In this algorithm, we have , which is the number of the subsequences of the sequence that can be shattered by the partial concept class . is the partial concept class induced by , which contains the concepts that are not consistent with at more than data points, if . We define . We also define and .
The following lemma from the work of Bousquet et al. (2021) holds:
Lemma 14 (Bousquet et al. (2021)).
For any process , there exists , such that for all , algorithm 1 will not update and and for all , the winning strategy satisfies
Proof.
By the definition of the winning strategy, it leads to a winning condition for the player . By the definition of ’s winning condition, we know that there exists a such that , which means for all , . That finishes the proof. ∎
This lemma shows that if the concept class has no infinite VCL tree, for a sufficiently large , the VCL game will stop advancing after round . Once the game stops advancing, we are effectively just bounding the number of mistakes by a predictor based on a partial concept class of finite VC dimension. This result is interesting in its own right, we extract this portion of the algorithm into a separate subroutine, which is stated as Algorithm 2 in AppendixA.1, for which we prove the following result.
Lemma 15.
For brevity, we put the proof of this lemma in the appendix. The intuition behind the proof is that every mistake decreases the weight by at least half with more than half probability, so the number of mistakes is .
Combining the lemmas above, for a concept class with no infinite VCL tree, for any realizable sequence, Algorithm 1 satisfies a.s. Because the winning strategy only updates finite times, the long-run average number of mistakes is dominated by the number of mistakes made by the subroutine, which is .
To prove the necessity, we show that for every concept class with an infinite VCL tree, there exists at least one process that does not admit universal online learning under . Formally,
Theorem 16.
For every concept class with infinite VCL tree, there exists a process , such that does not admit universal consistent online learning.
We need the following definition and results from Bousquet et al. (2023) to define the sequence.
Notation 17 (Bousquet et al. (2023)).
For any , let denote the index of in the lexicographic ordering of .
Definition 18 (Bousquet et al. (2023)).
Let be a set and be a hypothesis class, and let
be an infinite VCL tree that is shattered by . This implies the existence of a collection
of consistent functions.
We say such a collection is indifferent if for every , if , and is a descendant of in the tree , then . In other words, the functions for all the descendants of a node that appears after agree on .
We say that is indifferent if it has a set of consistent functions that are indifferent.
Theorem 19 (Bousquet et al. (2023)).
Let be a set and be a hypothesis class, and let be an infinite VCL tree that is shattered by . Then there exists an infinite VCL tree that is shattered by that is indifferent.
Here is the proof sketch of Theorem 16
Proof Sketch.
First, we can modify the indifferent infinite VCL tree such that it has the property that the number of elements contained by the -th node in the Breadth-First-Search (BFS) order is . The data process we are choosing is all the data come in the lexical order in each node and the BFS order among different nodes. Then we take a random walk on this tree to choose the true label for each instance. The instances in the node visited by the random walk will be labeled by the label on the edge adjacent to it in the path. The instances in the node that is off-branch will be labeled by the label decided by its descendants. (We can do this as the tree is indifferent.) Thus, when reaching a node on the path, no matter what the algorithm predicts, it makes mistakes with probability . Thus, it makes a quarter mistake in expectation. Then by Fatou’s lemma, for each learning algorithm, we get a realizable process such that the algorithm does not make a sublinear loss almost surely. ∎
We finish the proof of Theorem 9 here. We then focus on the existence of the optimistically universal online learner when all processes admit universal online learning.
4.1 Optimistically Universal Online Learning Rule
In this part, we show that the condition whether has an infinite Littlestone tree is the sufficient and necessary condition for the existence of an optimistically universal online learning rule, when all processes admit universal online learning. This is formally stated as theorem 10. The sufficiency part of theorem 10 is proved in Bousquet et al. (2021) as the following lemma:
Lemma 20 (Bousquet et al. (2021) Theorem 3.1, the first bullet).
If does not have an infinite Littestone tree, then there is a strategy for the learner that makes only finitely many mistakes against any adversary.
Notice that the online learning algorithm derived from the winning strategy of the learner only makes finite mistakes against any adversary, so for every realizable data process , this online learning algorithm also only makes finite mistakes, which means the long-run average mistake bound goes to . Thus, this is an optimistically universal online learning rule, and the concept class which does not have an infinite Littlestone tree is optimistically universally online learnable.
The necessity is restated as the following theorem:
Theorem 21.
For any concept class with an infinite Littlestone tree, for any online learning algorithm , there exists a process that makes have an average loss greater than a half with non-zero probability.
Proof Sketch.
We can take a random walk on the infinite Littlestone tree to generate the target function. Thus, no matter what the algorithm predicts, it makes a mistake with a probability of more than half. Then we can use Fatou’s lemma to get a lower bound of the expected average loss of the learning algorithm among all random processes and that means for every algorithm there exists a process that makes its average loss more than a half with probability more than zero. ∎
5 Concept Classes with an Infinite VCL Tree
In this section, we discuss the necessary and sufficient conditions for a process to admit universal online learning under the concept class with an infinite VCL tree. Theorem 11 states the answer formally. To prove this theorem, we first prove sufficiency (Lemma 22) and then necessity (Lemma 23).
Lemma 22.
If a process satisfies condition A, it admits universally consistent online learning under concept class .
Proof Sketch.
Lemma 23.
If a process admits universally consistent online learning under concept class , it satisfies condition A.
Proof Sketch.
In order to prove this lemma, we need to show the following statement holds:
For a given concept class , and a data process , if there is a learning algorithm that is strongly universal consistent under and , then we have a set of experts , there is a sequence with , such that for any realizable sequence , for any , there is an expert with such that for every .
We modify the construction from the work of Ben-David et al. (2009) to build the experts. The experts are based on the set of the indexes of the rounds when the algorithm makes mistakes, so there is a one-on-one map from the set of the indexes of the mistakes to the experts. Then we can index the experts based on the set of the indexes of mistakes to show the existence of such a sequence. ∎
6 Agnostic Case
In this section, we extend the online learning algorithm for realizable cases to an online learning algorithm for agnostic cases. The basic idea follows the idea of Ben-David et al. (2009). In other words, we build an expert for each realizable process . Then we run the learning with experts’ advice algorithm on those experts and get a low regret learning algorithm.
Theorem 24.
The following two statements are equivalent:
-
•
There is an online learning rule that is strongly universally consistent under and for the realizable case.
-
•
There is an online learning rule that is strongly universally consistent under and for the agnostic case.
Proof Sketch.
To prove this lemma, we first build the experts based on the learning algorithm for the realizable case by using the construction from lemma 23. We then use the learning on experts’ advice algorithm called Squint from Koolen and van Erven (2015), with non-uniform initial weights for each to get sublinear regret. Thus, we can extend the learning algorithm for realizable cases to a learning algorithm for agnostic cases no matter how the algorithm operates.
An online learning algorithm for the agnostic case is also an online learning algorithm for the realizable case, by taking , the regret becomes the number of mistakes. Thus, the two statements are equivalent. ∎
Theorem 24 implies that all the characterizations for the realizable case are also characterizations for the agnostic case. We formally state the following theorems:
Theorem 25.
For the agnostic case and any concept class with no infinite VCL tree, any process admits strongly universal online learning under . However, only the concept class with no infinite Littlestone tree is optimistically universally online learnable.
For the concept class with infinite VCL tree:
Theorem 26.
References
- Adams and Nobel (2010) T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik–Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4):1345 – 1367, 2010. doi: 10.1214/09-AOP511. URL https://doi.org/10.1214/09-AOP511.
- Alon et al. (2021) N. Alon, S. Hanneke, R. Holzman, and S. Moran. A theory of PAC learnability of partial concept classes. In Proceedings of the Annual Symposium on Foundations of Computer Science, 2021.
- Angluin (1988) D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
- Ben-David et al. (2009) S. Ben-David, D. Pál, and S. Shalev-Shwartz. Agnostic online learning. In Annual Conference Computational Learning Theory, 2009.
- Blanchard (2022) M. Blanchard. Universal online learning: an optimistically universal learning rule. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1077–1125. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/blanchard22b.html.
- Blanchard and Cosson (2022) M. Blanchard and R. Cosson. Universal online learning with bounded loss: Reduction to binary classification. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 479–495. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/blanchard22a.html.
- Blanchard and Jaillet (2023) M. Blanchard and P. Jaillet. Universal regression with adversarial responses. The Annals of Statistics, 51(3):1401 – 1426, 2023. doi: 10.1214/23-AOS2299. URL https://doi.org/10.1214/23-AOS2299.
- Blanchard et al. (2022a) M. Blanchard, R. Cosson, and S. Hanneke. Universal online learning with unbounded losses: Memory is all you need. In Sanjoy Dasgupta and Nika Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 107–127. PMLR, 29 Mar–01 Apr 2022a. URL https://proceedings.mlr.press/v167/blanchard22a.html.
- Blanchard et al. (2022b) M. Blanchard, S. Hanneke, and P. Jaillet. Contextual bandits and optimistically universal learning, 2022b.
- Blanchard et al. (2023) M. Blanchard, S. Hanneke, and P. Jaillet. Adversarial rewards in universal learning for contextual bandits, 2023.
- Bousquet et al. (2021) O. Bousquet, S. Hanneke, S. Moran, R. van Handel, and A. Yehudayoff. A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 532–541, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380539. doi: 10.1145/3406325.3451087. URL https://doi.org/10.1145/3406325.3451087.
- Bousquet et al. (2023) O. Bousquet, S. Hanneke, S. Moran, J. Shafer, and I. Tolstikhin. Fine-grained distribution-dependent learning curves. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5890–5924. PMLR, 12–15 Jul 2023.
- Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York, Inc., 1996.
- Gyorfi et al. (1999) L. Gyorfi, G. Lugosi, and G. Morvai. A simple randomized algorithm for sequential prediction of ergodic time series. IEEE Transactions on Information Theory, 45(7):2642–2650, 1999. doi: 10.1109/18.796420.
- Hanneke (2021) S. Hanneke. Learning whenever learning is possible: Universal learning under general stochastic processes. Journal of Machine Learning Research, 22(130), 2021.
- Hanneke et al. (2021) S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss. Universal Bayes consistency in metric spaces. The Annals of Statistics, 49(4):2129 – 2150, 2021. doi: 10.1214/20-AOS2029. URL https://doi.org/10.1214/20-AOS2029.
- Haussler et al. (1994) D. Haussler, N. Littlestone, and M. Warmuth. Predicting -functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994.
- Karandikar and Vidyasagar (2002) R. L. Karandikar and M. Vidyasagar. Rates of uniform convergence of empirical means with mixing processes. Statistics & Probability Letters, 58(3):297–307, 2002. ISSN 0167-7152. doi: https://doi.org/10.1016/S0167-7152(02)00124-4. URL https://www.sciencedirect.com/science/article/pii/S0167715202001244.
- Koolen and van Erven (2015) W. M. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Workshop and Conference Proceedings, pages 1155–1175. JMLR.org, 2015. URL http://proceedings.mlr.press/v40/Koolen15a.html.
- Littlestone (1988) N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988.
- Littlestone and Warmuth (1986) N. Littlestone and M. Warmuth. Relating data compression and learnability. Unpublished manuscript, 1986.
- Littlestone and Warmuth (1994) N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994. ISSN 0890-5401. doi: https://doi.org/10.1006/inco.1994.1009. URL https://www.sciencedirect.com/science/article/pii/S0890540184710091.
- Morvai et al. (1996) G. Morvai, S. Yakowitz, and L. Gyorfi. Nonparametric inference for ergodic, stationary time series. The Annals of Statistics, 24(1):370–379, 1996. ISSN 00905364. URL http://www.jstor.org/stable/2242624.
- Morvai et al. (1999) G. Morvai, S. R. Kulkarni, and A. B. Nobel. Regression estimation from an individual stable sequence. Statistics: A Journal of Theoretical and Applied Statistics, 33(2):99–118, 1999.
- Ryabko (2006) D. Ryabko. Pattern recognition for conditionally independent data. Journal of Machine Learning Research, 7(23):645–664, 2006. URL http://jmlr.org/papers/v7/ryabko06a.html.
- Steinwart et al. (2009) I. Steinwart, D. Hush, and C. Scovel. Learning from dependent observations. Journal of Multivariate Analysis, 100(1):175–194, 2009.
- Stone (1977) C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, pages 595–620, 1977.
- Valiant (1984) L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984.
- Vapnik and Chervonenkis (1971) V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
- Vapnik and Chervonenkis (1974) V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974.
- Yu (1994) B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.
Appendix A Omitted Proofs
A.1 Proof of lemma 15
In order to help the reader understand, we provide the subroutine here again for reference.
Proof.
In this proof, we assume that for the partial concept class with VC dimension , is realizable. As we defined, the weight function, , which is the number of the subsequences of the sequence that can be shattered by the partial concept class .
Consider the -th batch, consisting of . Let
and
Notice that
Thus, the sequence is a martingale difference sequence with respect to the block sequence, . By the definition of , we also have . Then by Azuma’s Inequality, with probability , we have
Then we need to get an upper bound for . According to the prediction rule, every time we make a mistake, we have
(3) |
Due to the linearity of expectation, for every ,
Here , where and the algorithm makes a mistake at round .
Notice the first part is the expected number of mistakes, each of which decreases the weight by half. For every realization of , , let
By the definition of the weight function and the fact that , . Consider the last round that , we have , as the set must be shattered. Thus, we have . Therefore, , for every realization, . Thus,
(4) |
Then consider the second part, we have
This is because and only depend on . Due to the equation 3, we have
Thus,
(5) | ||||
Combining these two inequalities (4 and 5), we have
(6) |
Thus, for any , we have
(7) |
According to the inequality 7 for every , . Thus, with probability at least ,
By the definition of , with probability at least ,
(8) |
Let be the number of instances in the sequence, with probability at least
(9) |
Define the event as the event that in the sequence , . Then we know . Notice the fact that for any , . By Borel-Cantelli lemma, we know that for any large enough, happens with probability .
Then for any large enough , we have . Thus, with probability ,
(10) | ||||
(11) |
Therefore, for any large enough and a universal constant , with probability ,
(12) |
Notice that is . That finishes the proof.
∎
A.2 Proof of Theorem 16
Proof.
In this proof, we first modify the chosen indifferent VCL tree, such that the number of elements in each node is increasing exponentially. In other words, we hope the -th node in the Breadth-First-Search (BFS) order contains elements. We may reach this target by recursively modifying the chosen VCL tree. Starting from the root of the tree, for each node that does not satisfy our requirement, we promote one of its descendants to replace that node, such that the number of elements in that node is large enough.
Then we define the data process as follows: For the modified VCL tree, we define the sequence as , where is the -th element in the -th node in the BFS order.
Next, we define the target function, in other words, choose the label for each . First, we take a random walk in the modified VCL tree. Then for the elements in the node on the path we visited (in-branch node), we let be the label given by the edge adjacent to that node. Then, we need to decide the label of those elements in the node not on the path we visited (off-branch nodes). For any off-branch node, we can pick an in-branch node after it in BFS order, as the tree is indifferent, all descendants of the in-branch node agree on the label of the elements in the off-branch node. Thus, we can let the label of the elements in the off-branch node be the label decided by the descendant of that in-branch node. So, every element in the node that is visited by the random walk still may be wrong with probability , when the algorithm sees it the first time. Also, the number of elements that come before -th node in the BFS order, , is roughly the same as the number of elements in the -th node, . Thus at the -th layer of the modified tree, if the random walk reaches the node , for that node, we have
(13) |
Here is the number of elements in the process when we reach the -th node. This inequality holds for all .
Then notice that by taking an expectation on the expected mistakes for every deterministic sequence, we get an expectation of the number of mistakes for this random process. Then we can pick the sub-sequence, which only contains elements, and this decreases the ratio of mistakes (the third line in the following computations). This is because we can only make mistakes when the elements are in the -th node and any other will have a smaller ratio of mistakes than . Then notice that the ratio of mistakes is always smaller than or equal to , we can use the reversed Fatou’s lemma and inequality 13 to get the final result (the fourth line and the last line in the following computations).
Thus, there exists a deterministic sequence such that it does not make sublinear expected mistakes. ∎
A.3 Proof of Theorem 21
Proof.
As has an infinite Littlestone tree, we can take a random walk on this tree, then for every step , take the label of the node as , and no matter what the learning algorithm predicts, uniformly randomly choose . Thus, we have for every . We get
(14) |
According to Fatou’s lemma, notice the ratio of mistakes is smaller than or equal to , so we have the following inequality,
(15) |
Thus, for each learning algorithm, there exists a data sequence such that equation 15 holds. That finishes the proof. ∎
A.4 Proof of Lemma 22
For the completeness, we provide the weighted majority algorithm here:
By using this algorithm, we provide the proof of the lemma 22.
Proof.
In order to prove this lemma, we use the weighted majority algorithm 3 with initial weight for each expert . We set , which is the number of mistakes the algorithm made during rounds. We also set . Next, compute the total weight of all experts after rounds of the algorithm, . Notice that if the algorithm makes a mistake at round , there must be a majority of the experts making a mistake at round , so , which means . Thus, we have . Notice that for all , so it holds for . We have the following inequality for all ,
(16) | ||||
(17) |
Here comes from the fact that . Therefore, for a fixed process , target function and a fixed sequence, , we have
(18) |
Then by the condition A, we know the right-hand side of the inequality is .
Notice that is a non-negative random variable, so we have
(19) |
Therefore, almost surely. ∎
A.5 Proof of Lemma 23
Proof.
In this proof, we show how to define a sequence of experts , such that the condition A is satisfied, if admits universal online learning. Let an online learning rule be the algorithm that can learn all realizable label processes . Then we can build the experts by algorithm 4 and represent the experts by the set of the index of mistake rounds. We can define this set as . For example, if the algorithm makes mistakes at round then the set for that expert is . Thus we have a one-on-one map from to an expert.
First, we show that for every realizable label process and any , there is an expert such that for all . This part of the proof is similar to the proof of lemma 12 in the work of Ben-David et al. (2009). Consider a subsequence of a realizable label process , if and only if for all . Thus, the history for all are the same as , which implies for all . Therefore, for any , there is a set containing all , such that . Then the algorithm 4 with input creates an expert , such that for all .
Next, we only need to build the index for the set to show that . The index of set is as follows: For all , order them by . (If two sets have the same value, we use as a tie-breaking.) Here is the number of elements in and is the maximal element in . After that, index ’s from following this order.
At last, we show the method mentioned above constructed a set of experts satisfying condition A. we have a set of experts , there is a sequence with , such that for any realizable sequence , for any , there is an expert with such that for every . Therefore, we need to compute the as follows. Assume , we have
Notice that , we have
(20) |
Thus, we get the set of experts and the corresponding sequence satisfying condition A. ∎
A.6 Proof of Theorem 24
Proof.
In order to prove this theorem, we use the procedure based on learning with experts’ advice. First, we build and index the experts, , by using the same method we mentioned in the proof of lemma 23 (A.5), which satisfies Condition A. Then, to use Squint algorithm from the work of Koolen and van Erven (2015), we need the initial weight of the experts to be a distribution. We can set the initial weights for each and this forms a distribution, as , the sum of reaches when goes to infinity.
According to Theorem 3 in the work of Koolen and van Erven (2015), we have the following upper bound for the regret
(21) |
Here the is the sum of the square of the difference between the algorithm’s mistake and expert ’s mistake in each round. In other words, we have
(22) |
Notice that is either or , we have . The regret of this algorithm is upper bounded by:
(23) |
According to the condition A, we also know
(24) |
Thus, the regret of this algorithm is
Because and we have . Thus, the regret above is . Therefore, we have an algorithm to extend a universally consistent online learning algorithm for realizable cases to a universally consistent online algorithm for agnostic cases.
To prove the necessity, notice that a universally consistent online algorithm for agnostic cases can be used to solve the realizable case and the regret is equal to the number of mistakes. ∎