This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Theory of Optimistically Universal Online Learnability for General Concept Classes

Steve Hanneke
Department of Computer Science
Purdue University
West Lafayette, IN 47907
[email protected] &Hongao Wang
Department of Computer Science
Purdue University
West Lafayette, IN 47907
[email protected]
Abstract

We provide a full characterization of the concept classes that are optimistically universally online learnable with {0,1}\{0,1\} labels. The notion of optimistically universal online learning was defined in (Hanneke, 2021) in order to understand learnability under minimal assumptions. In this paper, following the philosophy behind that work, we investigate two questions, namely, for every concept class: (1) What are the minimal assumptions on the data process admitting online learnability? (2) Is there a learning algorithm which succeeds under every data process satisfying the minimal assumptions? Such an algorithm is said to be optimistically universal for the given concept class. We resolve both of these questions for all concept classes, and moreover, as part of our solution we design general learning algorithms for each case. Finally, we extend these algorithms and results to the agnostic case, showing an equivalence between the minimal assumptions on the data process for learnability in the agnostic and realizable cases, for every concept class, as well as the equivalence of optimistically universal learnability.

1 Introduction

Just as computability is a core question in computation theory, learnability is now a core question in learning theory. Intuitively, learnability is trying to ask whether we can predict the future correctly with high probability by observing enough examples. In order to describe this intuition formally, we need to define learning models, such as, Probably Approximately Correct (PAC) learning (Vapnik and Chervonenkis (1974) and  Valiant (1984)), online learning (Littlestone and Warmuth (1986)) and query learning (Angluin (1988)). In this paper, we focus on a variant of online learning: online learning under data processes. In this setting, the learning is sequential: in round tt, an instance XtX_{t} is given to the algorithm, and then the algorithm makes the prediction, Y^t\hat{Y}_{t}, based on the history (Xt1,Yt1)(X_{\leq t-1},Y_{\leq t-1}) and the input XtX_{t}, i.e., Y^t=ft(Xt1,Yt1,Xt)\hat{Y}_{t}=f_{t}(X_{\leq t-1},Y_{\leq t-1},X_{t}). Next, the target value YtY_{t} will be revealed to the learner such that it can be used to inform future predictions. We model this sequence as a general stochastic process (𝕏,𝕐)={(Xt,Yt)}t(\mathbb{X},\mathbb{Y})=\{(X_{t},Y_{t})\}_{t\in\mathbb{N}} (possibly with dependencies across times tt). We say that the algorithm is strongly consistent under (𝕏,𝕐)(\mathbb{X},\mathbb{Y}) if the long-run average error is guaranteed to be low, i.e., 1nt=1n𝕀[Y^tYt]0\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\to 0 almost surely, when nn\to\infty.

In our setting, any theory of learning must be expressed based on the properties of, or restriction on, the data process, as the mistake bound is based on the data process. Thus, following an approach found in much of the learning theory literature, such as the PAC model of Valiant (1984) and Vapnik and Chervonenkis (1971) and the online learning model of Littlestone (1988), we introduce the restriction by an additional component, concept class 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}}. The role of the concept class is to restrict the processes we need to face, such that they all are realizable under that concept class. If there is a target function hh^{*}\in\mathcal{H}, such that Yt=h(Xt)Y_{t}=h^{*}(X_{t}) for every tt, we say the data process (𝕏,𝕐)(\mathbb{X},\mathbb{Y}) is realizable (though our formal definition below is slightly more general). For a given 𝕏\mathbb{X}, if a learning rule is strongly consistent under (𝕏,𝕐)(\mathbb{X},\mathbb{Y}) for every 𝕐\mathbb{Y} such that (𝕏,𝕐)(\mathbb{X},\mathbb{Y}) is realizable, we say it is universally consistent under 𝕏\mathbb{X} in the realizable case.

It is known that we cannot get the low average error guarantee for all concept classes and data processes (Hanneke, 2021). Thus, we should make several restrictions on either the data process, the concept class, or a mixture of both. All three types of restrictions have been investigated: Littlestone (1988); Ben-David et al. (2009) studied online learning with unrestricted data processes but restricted concept classes. Haussler et al. (1994); Ryabko (2006) researched the online learning problem with a mix of both restrictions. There are also substantial amount of papers investigating online learnability with all measurable functions but restricted data processes. Most of these specifically consider the case of i.i.d. processes, such as Stone (1977); Devroye et al. (1996), though this has also been generalized to general stationary ergodic processes (Morvai et al., 1996; Gyorfi et al., 1999) or processes with certain convergence properties enabling laws of large numbers (Morvai et al., 1999; Steinwart et al., 2009).

More recently, a general question has been studied: In the case of \mathcal{H} equals the set of all measurable functions, is there a learning rule guaranteed to be universally consistent given only the assumption on 𝕏\mathbb{X} that universally consistent learning is possible under 𝕏\mathbb{X}? The assumption that universal consistency is possible under 𝕏\mathbb{X} is referred to as the “optimist’s assumption” (Hanneke, 2021), and for this reason, learning rules which are universally consistent for all 𝕏\mathbb{X} satisfying the optimist’s assumption are said to be optimistically universally consistent. There is a series of works focusing on this question, starting from Hanneke (2021) and continuing with Blanchard and Cosson (2022); Blanchard (2022); Blanchard et al. (2022a). They tackle this question by first characterizing the minimal assumptions on the data process admitting universally consistent learning and then proposing an online learning algorithm that is universally consistent for all data processes satisfying that assumption. However, their works all focus on the situation with no restrictions on concepts in the concept class (i.e., \mathcal{H} as all measurable functions). Thus, a natural question arises: For which concept classes do there exist optimistically universal learning rules?

In this paper, we investigate the question mentioned above when the output is binary, i.e. {0,1}\{0,1\}. We handle this problem by first figuring out the minimal assumptions on the data process admitting consistent online learning as well. Thus, our results answered that question in the following aspects:

  • For which concept classes is optimistically universally consistent learning possible?

  • What are the sufficient and necessary conditions for processes to admit universally consistent online learning for a given concept class \mathcal{H}?

  • For which concept classes is it the case that all processes 𝕏\mathbb{X} admit universally consistent online learning?

We first answer these questions in the realizable case. Surprisingly, the answers turn out to be intimately related to combinatorial structures arising in the work of Bousquet et al. (2021) on universal learning rates. This suggests a potential connection between the universal consistency of online learning and universal learning rates, which is of independent interest. We also extend our learning algorithms for the realizable case to the agnostic setting, where the requirement of low average loss is replaced by that of having sublinear regret. Interestingly, our answers to the above questions remain unchanged in the agnostic case, establishing an equivalence between agnostic and realizable learning in this setting.

In this paper, we first provide some interesting examples in section 3. Then section 4 investigates question three and question one for those classes and section 5 answers question two and question one for remaining classes. Finally, in section 6, we extend our algorithm to the agnostic case.

1.1 More Related Work

Starting from Littlestone’s ground-breaking work (Littlestone, 1988), online learning is becoming more and more important. In this paper, Littlestone (1988) introduces a combinatorial parameter of the concept class, known as Littlestone dimension, to characterize the online learnable concept classes in the realizable case. After that, Ben-David et al. (2009) figure out Littlestone dimension is still the property to characterize online learnability in the agnostic setting. They extend an online learning algorithm for the realizable case to such an algorithm for the agnostic case using the weighted majority algorithm from Littlestone and Warmuth (1994). This line of work makes no assumption on the data process and investigates how restrictions on the concept affect the learnability of the problem. There are two other categories of assumptions also investigated in history: one is to make assumptions on both the data process and the concept, and the other is to make assumptions on the data process but not the concept. Those two categories are discussed in detail subsequently.

First, the works investigating the question of learnability with restrictions on both the data process and the concept make a variety of assumptions. For example, Haussler et al. (1994) investigate how the restrictions on the concept will affect learnability given that the data process is i.i.d. This problem is more similar to a streamlined version of PAC learning and they show that there is a logarithmic mistake bound with the assumption that the data process is i.i.d. and the concept class has finite VC dimension. Adams and Nobel (2010) reveal that all stationary ergodic sequences will uniformly converge under the concept class with finite VC dimension. However, they cannot show the convergence rate of that learning algorithm. Many other works focus on revealing the rate with slightly stronger assumptions on the sequences, such as, (Yu, 1994; Karandikar and Vidyasagar, 2002).

Another line of works focuses on the question of learnability with restrictions on the process instead of the concept, starting from the theory of universally consistent predictors under i.i.d sequences. In particular, there exists an online learning rule, such that for any i.i.d. sequence (𝕏,𝕐)(\mathbb{X},\mathbb{Y}), and every measurable function ff^{*}, the long-run average loss is 0 almost surely, such as, (Stone, 1977; Devroye et al., 1996; Hanneke et al., 2021). In the meanwhile, people are also interested in the consistency under non-i.i.d. processes; Gyorfi et al. (1999); Morvai et al. (1996) reveal that there are universally consistent online learning rules under general stationary ergodic processes. The paper of Morvai et al. (1999) and the paper of Steinwart et al. (2009) show universal consistency under some classes of processes satisfying laws of large numbers.

More recently, the work of Hanneke (2021) investigates whether there is a consistent learning rule given only the assumptions on 𝕏\mathbb{X} that universally consistent learning is possible. This work generalizes the assumptions on 𝕏\mathbb{X} made by the previous works on universal consistency. The assumption that universally consistent learning is possible is known as the “optimist’s assumption”, so the consistency under that assumption is called optimistically universal consistency. Hanneke (2021) studies three different learning models: inductive, self-adaptive, and online, and proves that there is an optimistically universal self-adaptive learning rule and no optimistically universal inductive learning rule. After this beginning, the works of Blanchard and Cosson (2022); Blanchard et al. (2022a); Blanchard (2022) show that optimistically universal online learning is possible and the processes that admit strongly universal online learning satisfy the condition called 𝒞2\mathcal{C}_{2} (see condition C for reference). This problem is also investigated under different models, such as, in contextual bandit setting Blanchard et al. (2022b, 2023) and general noisy labels Blanchard and Jaillet (2023).

2 Preliminaries and Main Results

In this section, we provide formal definitions and model settings and briefly list the main results of this paper without proof. For brevity, we provide the high-level proof sketch in the subsequent sections and proof details are in the appendices.

Model Setting

We formally provide the learning model here. Let (𝒳,)(\mathcal{X},\mathcal{B}) be a measurable space, in which 𝒳\mathcal{X} is assumed to be non-empty and \mathcal{B} is a Borel σ\sigma-algebra generated by a separable metrizable topology 𝒯\mathcal{T}. We also define a space 𝒴={0,1}\mathcal{Y}=\{0,1\} called label space. Here we focus on learning under the 0-11 loss: that is, (y,y)𝕀[yy](y,y^{\prime})\mapsto\mathbb{I}\left[y\neq y^{\prime}\right] defined on 𝒴×𝒴\mathcal{Y}\times\mathcal{Y}, where 𝕀[]\mathbb{I}\left[\cdot\right] is the indicator function. A stochastic process 𝕏={Xt}t\mathbb{X}=\{X_{t}\}_{t\in\mathbb{N}} is a sequence of 𝒳\mathcal{X}-valued random variables. A stochastic process 𝕐={Yt}t\mathbb{Y}=\{Y_{t}\}_{t\in\mathbb{N}} is a sequence of {0,1}\{0,1\}-valued random variable. The concept class \mathcal{H}, which is a non-empty set of measurable functions h:𝒳𝒴h:\mathcal{X}\rightarrow\mathcal{Y}.111We additionally make standard restrictions on \mathcal{H} to ensure certain estimators are well-behaved; we omit the details for brevity, but refer the reader to Bousquet et al. (2021) for a thorough treatment of measurability of these estimators.

The online learning rule is a sequence of measurable functions: ft:𝒳t1×𝒴t1×𝒳𝒴f_{t}:\mathcal{X}^{t-1}\times\mathcal{Y}^{t-1}\times\mathcal{X}\rightarrow\mathcal{Y}, where tt is a non-negative integer. For convenience, we also define h^t1=ft(X<t,Y<t)\hat{h}_{t-1}=f_{t}(X_{<t},Y_{<t}), here (X<t,Y<t)={(Xi,Yi)}i<t(X_{<t},Y_{<t})=\{(X_{i},Y_{i})\}_{i<t} is the history before round tt.

There are two ways to define the realizable case: The most common one is that there exists hh^{*}\in\mathcal{H} such that Yt=h(Xt)Y_{t}=h^{*}(X_{t}). The other is the definition 1 on the realizable data process, which comes from the universal learning setting. These two definitions are equivalent in the uniform PAC learning with i.i.d. samples. However, they are different when talking about universal learning. Thus, we follow the definition from the universal learning setting.

Definition 1.

For every concept class \mathcal{H}, we can define the following set of processes R()\text{R}(\mathcal{H}):

R():={(𝕏,𝕐)={(Xi,Yi)}i:with probability 1,n<,{(Xi,Yi)}in realizable by }.\text{R}(\mathcal{H}):=\left\{(\mathbb{X},\mathbb{Y})=\left\{(X_{i},Y_{i})\right\}_{i\in\mathbb{N}}:\text{with probability }1,\forall n<\infty,\left\{(X_{i},Y_{i})\right\}_{i\leq n}\text{ realizable by }\mathcal{H}\right\}.

In the same way, the set of realizable label processes:

Definition 2.

For every concept class \mathcal{H} and data process 𝕏\mathbb{X}, define a set R(,𝕏)\text{R}(\mathcal{H},\mathbb{X}) of label processes:

R(,𝕏):={𝕐={Yi}i:(𝕏,𝕐)R() and  a non-random function f s.t. Yi=f(Xi)}.\text{R}(\mathcal{H},\mathbb{X}):=\left\{\mathbb{Y}=\left\{Y_{i}\right\}_{i\in\mathbb{N}}:(\mathbb{X},\mathbb{Y})\in\text{R}(\mathcal{H})\text{ and }\exists\text{ a non-random function }f\text{ s.t.\ }Y_{i}=f(X_{i})\right\}.

In other words, R(,𝕏)\text{R}(\mathcal{H},\mathbb{X}) are label processes 𝕐=f(𝕏)\mathbb{Y}=f(\mathbb{X}) s.t. (𝕏,f(𝕏))R()(\mathbb{X},f(\mathbb{X}))\in\text{R}(\mathcal{H}). Importantly, while every ff\in\mathcal{H} satisfies f(𝕏)R(,𝕏)f(\mathbb{X})\in\text{R}(\mathcal{H},\mathbb{X}), there can exist ff\notin\mathcal{H} for which this is also true, due to R()\text{R}(\mathcal{H}) only requiring realizable prefixes (thus, in a sense, R(,𝕏)\text{R}(\mathcal{H},\mathbb{X}) represents label sequences by functions in a closure of \mathcal{H} defined by 𝕏\mathbb{X}).222For instance, for 𝒳=\mathcal{X}=\mathbb{N}, for the process Xi=iX_{i}=i, and for ={1{i}:i𝒳}\mathcal{H}=\{\mathbbold{1}_{\{i\}}:i\in\mathcal{X}\} (singletons), the all-0 sequence is in R(,𝕏)\text{R}(\mathcal{H},\mathbb{X}) though the all-0 function is not in \mathcal{H}.

At first, we define the universal consistency under 𝕏\mathbb{X} and \mathcal{H} in the realizable case. An online learning rule is universally consistent under 𝕏\mathbb{X} and \mathcal{H} if its long-run average loss approaches 0 almost surely when the number of rounds nn goes to infinity for all realizable label processes. Formally, we have the following definition:

Definition 3.

An online learning rule is strongly universally consistent under 𝕏\mathbb{X} and \mathcal{H} for the realizable case, if for every 𝕐R(,𝕏)\mathbb{Y}\in\text{R}(\mathcal{H},\mathbb{X}), lim supn1nt=1n𝕀[Yth^t1(Xt)]=0\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]=0 a.s.

We also define the universal consistency under 𝕏\mathbb{X} and \mathcal{H} for the agnostic case. In that definition, we release the restrictions that 𝕐R(,𝕏)\mathbb{Y}\in\text{R}(\mathcal{H},\mathbb{X}), instead the label process 𝕐\mathbb{Y} can be set in any possible way, even dependent on the history of the algorithm’s predictions. Thus, the average loss may be linear and inappropriate for defining consistency. Therefore, we compare the performance of our algorithm with the performance of the best possible 𝕐R(,𝕏)\mathbb{Y}^{*}\in\text{R}(\mathcal{H},\mathbb{X}), which is usually referred to as regret. We say an online algorithm is universally consistent under 𝕏\mathbb{X} and \mathcal{H} for the agnostic case if its long-run average regret is low for every label process. Formally,

Definition 4.

An online learning rule is strongly universally consistent under 𝕏\mathbb{X} and \mathcal{H} for the agnostic case, if for every 𝕐R(,𝕏)\mathbb{Y}^{*}\in\text{R}(\mathcal{H},\mathbb{X}) and for every 𝕐\mathbb{Y}, lim supn1nt=1n(𝕀[Yth^t1(Xt)]𝕀[YtYt])0\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\left(\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]-\mathbb{I}\left[Y_{t}\neq Y^{*}_{t}\right]\right)\leq 0 a.s.

To describe the assumption that universal consistency is possible under the data process 𝕏\mathbb{X} and the concept class \mathcal{H}, we need to define the process admitting universal online learning as follows:

Definition 5.

We say a process 𝕏\mathbb{X} admits strongly universal online learning (or just universal online learning for convenience) if there exists an online learning rule that is strongly universally consistent under 𝕏\mathbb{X} and \mathcal{H}.

If the online learning rule is universally consistent under every process that admits universal online learning, we call it optimistically universal under the concept class. If there is an optimistically universal learning algorithm under that concept class, we say that concept class is optimistically universally online learnable. The formal definition is provided below:

Definition 6.

An online learning rule is optimistically universal under concept class \mathcal{H} if it is strongly universally consistent under every process 𝕏\mathbb{X} that admits strongly universally consistent online learning under concept class \mathcal{H}.

If there is an online learning rule that is optimistically universal under concept class \mathcal{H}, we say \mathcal{H} is optimistically universally online learnable.

Next, we define the combinatorial structures we use to characterize the concept class that makes all processes admit universal online learning and is optimistically universally online learnable when all processes admit strongly universally consistent online learning:

Definition 7 (Littlestone tree Bousquet et al. (2021)).

A Littlestone Tree for \mathcal{H} is a complete binary tree of depth dd\leq\infty whose internal nodes are labeled by 𝒳\mathcal{X}, and whose two edges connecting a node to its children are labeled 0 and 11, such that every finite path emanating from the root is consistent with a concept hh\in\mathcal{H}. We say that \mathcal{H} has an infinite Littlestone tree if it has a Littlestone tree of depth d=d=\infty.

Definition 8 (VCL Tree Bousquet et al. (2021)).

A VCL Tree for \mathcal{H} of depth dd\leq\infty is a collection

{xu𝒳k+1:0k<d,u{0,1}1×{0,1}2××{0,1}k}\{x_{u}\in\mathcal{X}^{k+1}:0\leq k<d,u\in\{0,1\}^{1}\times\{0,1\}^{2}\times\cdots\times\{0,1\}^{k}\}

such that for every n<dn<d and y{0,1}1×{0,1}2××{0,1}n+1y\in\{0,1\}^{1}\times\{0,1\}^{2}\times\cdots\times\{0,1\}^{n+1}, there exists a concept hh\in\mathcal{H} so that h(xyki)=yk+1ih(x_{y\leq k}^{i})=y^{i}_{k+1} for all 0ik0\leq i\leq k and 0kn0\leq k\leq n, where we denote

yk=(y10,(y20,y21),,(yk0,,ykk1)),xyk=(xyk0,,xykk)y_{\leq k}=(y_{1}^{0},(y_{2}^{0},y_{2}^{1}),\dots,(y_{k}^{0},\dots,y_{k}^{k-1})),x_{y_{\leq k}}=(x_{y_{\leq k}}^{0},\dots,x_{y_{\leq k}}^{k})

We say that \mathcal{H} has an infinite VCL tree if it has a VCL tree of depth d=d=\infty.

The characterization is formally stated in the following two theorems:

Theorem 9.

If and only if a concept class \mathcal{H} has no infinite VCL tree, any process admits strongly universally consistent online learning under \mathcal{H}.

Theorem 10.

If and only if a concept class \mathcal{H} has no infinite Littlestone tree, any process admits strongly universally consistent online learning under \mathcal{H}, and the concept class \mathcal{H} is optimistically universally online learnable.

According to theorem 9, we know that for those concept classes with an infinite VCL tree, there exist some processes that do not admit universal online learning. Thus, we need to figure out the sufficient and necessary conditions that the processes required to admit universal online learning.

First, we define the experts as algorithms that generate predictions only based on the input XtX_{t}. Then we define the following condition (which is a property of a data process) and state the main theorem formally:

Condition A.

For a given concept class \mathcal{H}, there exists a countable set of experts E={e1,e2,}E=\{e_{1},e_{2},\dots\}, such that 𝕐R(,𝕏)\forall\mathbb{Y}^{*}\in\text{R}(\mathcal{H},\mathbb{X}), in\exists i_{n}\rightarrow\infty, with login=o(n)\log i_{n}=o(n), such that:

𝔼[lim supnminei:iin1nt=1n𝕀[ei(Xt)Yt]]=0\mathbb{E}\left[\limsup_{n\rightarrow\infty}\min_{e_{i}:i\leq i_{n}}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y^{*}_{t}\right]\right]=0 (1)
Theorem 11.

A process 𝕏\mathbb{X} admits strongly universally consistent online learning under concept class \mathcal{H} with infinite VCL tree if and only if it satisfies condition A.

Next, the sufficient and necessary conditions (on the concept class) for optimistically universal online learning:

Condition B.

There exists a countable set of experts E={e1,e2,}E=\{e_{1},e_{2},\dots\}, such that for any 𝕏\mathbb{X} admits universal online learning, and any 𝕐R(,𝕏)\mathbb{Y}^{*}\in\text{R}(\mathcal{H},\mathbb{X}), there exists ini_{n}\rightarrow\infty, with login=o(n)\log i_{n}=o(n), such that:

𝔼[lim supnminei:iin1nt=1n𝕀[ei(Xt)Yt]]=0\mathbb{E}\left[\limsup_{n\rightarrow\infty}\min_{e_{i}:i\leq i_{n}}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y^{*}_{t}\right]\right]=0 (2)

Notice that these two conditions (condition A and B) only have one major difference: whether the countable set of experts depends on the process 𝕏\mathbb{X}.

Theorem 12.

A concept class \mathcal{H} with infinite VCL tree is optimistically universally online learnable if and only if it satisfies condition B.

We also extend the algorithms for realizable cases to an algorithm for agnostic cases and show that the same characterization works for agnostic cases.

3 Examples

In this section, we provide some interesting examples to help the reader get a sense of what these conditions are. We first provide an example of the concept class that is universally online learnable under all processes but not optimistically universally online learnable.

Example 1.

We have the instance space 𝒳=\mathcal{X}=\mathbb{R} and 𝒴={0,1}\mathcal{Y}=\{0,1\}, a binary output. The concept class \mathcal{H} is all of the threshold functions. In other words, threshold={ha:ha(x)=𝕀[xa]|a}\mathcal{H}_{\text{threshold}}=\left\{h_{a}:h_{a}(x)=\mathbb{I}\left[x\geq a\right]|a\in\mathbb{R}\right\}. This concept class has no infinite VCL tree, as there is no (x1,x2)(x_{1},x_{2}) such that threshold\mathcal{H}_{\text{threshold}} shatters all possible results. Thus, all processes admit strongly universally consistent online learning under threshold\mathcal{H}_{\text{threshold}}. However, it has an infinite Littlestone tree. Thus, for any learning algorithm, there exists a process that is not learnable by that algorithm. So it is not optimistically universally online learnable.

Referring to that line of optimistically universal online learning papers, we know that the concept class of all measurable functions is optimistically universally online learnable. The sufficient and necessary condition for processes to admit universal online learning under all measurable functions is the condition 𝒞2\mathcal{C}_{2} (see below). In the meanwhile, our conditions: A and B vanish to 𝒞2\mathcal{C}_{2} when the concept class \mathcal{H} becomes the class of all measurable functions.

Condition C (𝒞2\mathcal{C}_{2} in  Hanneke (2021)).

For every sequence {Ak}k=1\{A_{k}\}_{k=1}^{\infty} of disjoint elements of \mathcal{B},

|{k:X1:TAk}|=o(T)a.s.\left|\left\{k\in\mathbb{N}:X_{1:T}\cap A_{k}\neq\emptyset\right\}\right|=o(T)~{}a.s.

The following example shows that whether a concept class is optimistically universally online learnable is neither sufficient nor necessary to determine whether its subset is optimistically universally online learnable or not. Whether a concept class is optimistically universally online learnable will be sufficient and necessary to determine whether its subset is optimistically universally online learnable, if and only if the processes that admit universal online learning are the same under those two concept classes.

Example 2.

We have the data which is sampled from input space 𝒳=𝒳1𝒳2\mathcal{X}=\mathcal{X}_{1}\cup\mathcal{X}_{2} and here 𝒳1\mathcal{X}_{1} and 𝒳2\mathcal{X}_{2} are disjoint. For example, 𝒳1=+\mathcal{X}_{1}=\mathbb{R}^{+} and 𝒳2=\+\mathcal{X}_{2}=\mathbb{R}\backslash\mathbb{R}^{+}. Then we can define the concept class: 1\mathcal{H}_{1} is the set of all threshold functions on 𝒳1\mathcal{X}_{1} which are 0 on 𝒳2\mathcal{X}_{2}, and 2\mathcal{H}_{2} is a set of all functions on 𝒳2\mathcal{X}_{2} which are constant on 𝒳1\mathcal{X}_{1}. Then we can consider the following scenarios:

  1. 1.

    =2\mathcal{H}=\mathcal{H}_{2}: It is optimistically universally online learnable. The processes that admit universal online learning will satisfy 𝒞2\mathcal{C}_{2} if we replace all the Xt𝒳1X_{t}\in\mathcal{X}_{1} as dummy points.

  2. 2.

    =12\mathcal{H}=\mathcal{H}_{1}\cup\mathcal{H}_{2}: It is not optimistically universally online learnable, as all processes supported on 𝒳1\mathcal{X}_{1} admit universal online learning under \mathcal{H}. However, for every learning strategy, there exists at least one process on 𝒳1\mathcal{X}_{1} forcing that strategy to make linear mistakes. (Due to theorem 9 and theorem 10)

  3. 3.

    \mathcal{H} are all measurable functions on 𝒳\mathcal{X}. This is also optimistically universally online learnable.

4 Sufficient and Necessary Condition that ALL Processes Admit Universal Online Learning

In this section, we answer the question: What restrictions on concept classes make ALL processes admit universal online learning under \mathcal{H}? The answer is formally stated as Theorem 9. We show the sufficiency by providing a universal online learning rule (depending on 𝕏\mathbb{X}) under every process 𝕏\mathbb{X} and \mathcal{H} with no infinite VCL tree.

First, we define the VCL game along with the VCL tree. In this game, there are two players: the learner, PLP_{L}, and the adversary, PAP_{A} and U0=U_{0}=\emptyset. Then in each round kk:

  • PAP_{A} choose the point X(k)=(Xk,1,,Xk,k)𝒳kX_{(k)}=(X_{k,1},\dots,X_{k,k})\in\mathcal{X}^{k}.

  • PLP_{L} choose the prediction gUk1((Xk,1,,Xk,k)){0,1}kg_{U_{k-1}}((X_{k,1},\dots,X_{k,k}))\in\{0,1\}^{k}.

  • Update Uk=Uk1{X(k),gUk1}U_{k}=U_{k-1}\cup\{X_{(k)},g_{U_{k-1}}\}.

  • PLP_{L} wins the game in round kk if Uk=\mathcal{H}_{U_{k}}=\emptyset.

Here Uk={h:i,h(X(i))=gi(X(i))}\mathcal{H}_{U_{k}}=\{h\in\mathcal{H}:\forall i,h(X_{(i)})=g_{i}(X_{(i)})\}, which is the subset of \mathcal{H} that is consistent on (X(i),gi(X(i)))(X_{(i)},g_{i}(X_{(i)})) for all iki\leq k.

A strategy is a way of playing that can be fully determined by the foregoing plays. And a winning strategy is a strategy that necessarily causes the player to win no matter what action one’s opponent takes. We have the following lemma from Bousquet et al. (2021).

Lemma 13 (Bousquet et al. (2021) lemma 5.2).

If \mathcal{H} has no infinite VCL tree, then there is a universally measurable winning strategy gg for PLP_{L} in the game.

Notice that the winning strategy gg is completely decided by UU, we use gUg_{U} as the winning strategy induced by the set UU. We may use this winning strategy gUg_{U} to design an online learning algorithm 1. This algorithm is a combination of the algorithm in the work of Bousquet et al. (2021) and the algorithm inspired by the learning algorithm for partial concept in the work of Alon et al. (2021).

In order to describe the algorithm, we first provide the definitions of partial concepts. A partial concept class {0,1,}𝒳\mathcal{H}\subseteq\{0,1,*\}^{\mathcal{X}} is a set of partial function defined on 𝒳\mathcal{X}, where h(x)=h(x)=* if and only if hh is undefined on xx. And for a set X𝒳X^{\prime}\subseteq\mathcal{X}, XX^{\prime} is shattered if every binary pattern y{0,1}Xy\in\{0,1\}^{X^{\prime}} is realized by some hh\in\mathcal{H}. In this algorithm, we have w(,XT)=|{S:S{xi}iT such that S shattered by }|w(\mathcal{H}^{\prime},X_{\leq T})=|\{S:S\subseteq\{x_{i}\}_{i\leq T}\text{ such that }S\text{ shattered by }\mathcal{H}^{\prime}\}|, which is the number of the subsequences of the sequence XTX_{\leq T} that can be shattered by the partial concept class \mathcal{H}^{\prime}. gU={h:X1,X2,,Xk𝒳,(h(X1),h(X2),,h(Xk))gU(X1,X2,,Xk)}\mathcal{H}^{g_{U}}=\{h:\forall X_{1},X_{2},\dots,X_{k}\in\mathcal{X},(h(X_{1}),h(X_{2}),\dots,h(X_{k}))\neq g_{U}(X_{1},X_{2},\dots,X_{k})\} is the partial concept class induced by gUg_{U}, which contains the concepts that are not consistent with gUg_{U} at more than k1k-1 data points, if U={(X(i),gi(X(i)))}ik1U=\{(X_{(i)},g_{i}(X_{(i)}))\}_{i\leq k-1}. We define {(Xi,Yi)}itgU={hgU:it,h(Xi)=Yi}\mathcal{H}^{g_{U}}_{\{(X_{i},Y_{i})\}_{i\leq t}}=\{h\in\mathcal{H}^{g_{U}}:\forall i\leq t,h(X_{i})=Y_{i}\}. We also define X[t,t]={Xi}titX_{[t,t^{\prime}]}=\{X_{i}\}_{t\leq i\leq t^{\prime}} and t(m)=m(m+1)2t(m)=\frac{m(m+1)}{2}.

k=1k=1, U={}U=\{\}, t0t^{\prime}\leftarrow 0.
for t=1,2,3,t=1,2,3,\dots do
       if j1,j2,,jk<t\exists j_{1},j_{2},\dots,j_{k}<t such that gU(Xj1,,Xjk)=(Yj1,,Yjk)g_{U}(X_{j_{1}},\dots,X_{j_{k}})=(Y_{j_{1}},\dots,Y_{j_{k}}) then
             Advance the game:
             UU{((Xj1,,Xjk),(Yj1,,Yjk))}U\leftarrow U\cup\{((X_{j_{1}},\dots,X_{j_{k}}),(Y_{j_{1}},\dots,Y_{j_{k}}))\}.
             kk+1k\leftarrow k+1.
             LL\leftarrow\emptyset.
             m1m\leftarrow 1.
             tt1t^{\prime}\leftarrow t-1.
       end if
      Predict
Yt^=argmaxyPr[w(L(Xt,1y)gU,X[t,t(m)+t])12w(LgU,X[t,t(m)+t])|Xt]\hat{Y_{t}}=\mathop{\rm argmax}_{y}\Pr\left[\left.w(\mathcal{H}^{g_{U}}_{L\cup{(X_{t},1-y)}},X_{[t,t(m)+t^{\prime}]})\leq\frac{1}{2}w(\mathcal{H}^{g_{U}}_{L},X_{[t,t(m)+t^{\prime}]})\right|X_{\leq t}\right]
if YtY^tY_{t}\neq\hat{Y}_{t} then
            LL{(Xt,Yt)}L\leftarrow L\cup\{(X_{t},Y_{t})\}.
       end if
      if tm(m+1)2+tt\geq\frac{m(m+1)}{2}+t^{\prime}. then
            mm+1m\leftarrow m+1.
       end if
      
end for
Algorithm 1 Learning algorithm from winning strategy

The following lemma from the work of Bousquet et al. (2021) holds:

Lemma 14 (Bousquet et al. (2021)).

For any process {(Xi,Yi)i}R()\{(X_{i},Y_{i})_{i\in\mathbb{N}}\}\in\text{R}(\mathcal{H}), there exists t0t_{0}, such that for all tt0t\geq t_{0}, algorithm 1 will not update kk and UU and for all j1,j2,,jkj_{1},j_{2},\dots,j_{k}, the winning strategy gUg_{U} satisfies

gU(Xj1,,Xjk)(Yj1,,Yjk).g_{U}(X_{j_{1}},\dots,X_{j_{k}})\neq(Y_{j_{1}},\dots,Y_{j_{k}}).
Proof.

By the definition of the winning strategy, it leads to a winning condition for the player PLP_{L}. By the definition of PLP_{L}’s winning condition, we know that there exists a kk such that X1,g1,,Xk,gk=\mathcal{H}_{X_{1},g_{1},\dots,X_{k},g_{k}}=\emptyset, which means for all j1,j2,,jkj_{1},j_{2},\dots,j_{k}, gU(Xj1,,Xjk)(Yj1,,Yjk)g_{U}(X_{j_{1}},\dots,X_{j_{k}})\neq(Y_{j_{1}},\dots,Y_{j_{k}}). That finishes the proof. ∎

This lemma shows that if the concept class \mathcal{H} has no infinite VCL tree, for a sufficiently large t0t_{0}, the VCL game will stop advancing after round t0t_{0}. Once the game stops advancing, we are effectively just bounding the number of mistakes by a predictor based on a partial concept class of finite VC dimension. This result is interesting in its own right, we extract this portion of the algorithm into a separate subroutine, which is stated as Algorithm 2 in AppendixA.1, for which we prove the following result.

Lemma 15.

For any process 𝕏\mathbb{X} and \mathcal{H} be a partial concept class on 𝕏\mathbb{X} with VC()=d<{\rm VC}(\mathcal{H})=d<\infty. The subroutine (Algorithm 2 in AppendixA.1) only makes o(T)o(T) mistakes almost surely as TT\to\infty.

For brevity, we put the proof of this lemma in the appendix. The intuition behind the proof is that every mistake decreases the weight by at least half with more than half probability, so the number of mistakes is o(T)o(T).

Combining the lemmas above, for a concept class \mathcal{H} with no infinite VCL tree, for any realizable sequence, Algorithm 1 satisfies lim supn1nt=1n𝕀[Yth^t1(Xt)]=0\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]=0 a.s. Because the winning strategy only updates finite times, the long-run average number of mistakes is dominated by the number of mistakes made by the subroutine, which is o(n)o(n).

To prove the necessity, we show that for every concept class \mathcal{H} with an infinite VCL tree, there exists at least one process that does not admit universal online learning under \mathcal{H}. Formally,

Theorem 16.

For every concept class \mathcal{H} with infinite VCL tree, there exists a process 𝕏\mathbb{X}, such that 𝕏\mathbb{X} does not admit universal consistent online learning.

We need the following definition and results from Bousquet et al. (2023) to define the sequence.

Notation 17 (Bousquet et al. (2023)).

For any 𝐮({0,1})\mathbf{u}\in(\{0,1\})^{*}, let index(𝐮)\textsf{index}(\mathbf{u})\in\mathbb{N} denote the index of 𝐮\mathbf{u} in the lexicographic ordering of ({0,1})(\{0,1\})^{*}.

Definition 18 (Bousquet et al. (2023)).

Let 𝒳\mathcal{X} be a set and {0,1}𝒳\mathcal{H}\subseteq\{0,1\}^{\mathcal{X}} be a hypothesis class, and let

T={x𝐮𝒳:𝐮({0,1})}T=\left\{x_{\mathbf{u}}\in\mathcal{X}:\mathbf{u}\in(\{0,1\})^{*}\right\}

be an infinite VCL tree that is shattered by \mathcal{H}. This implies the existence of a collection

T={h𝐮:𝐮({0,1})}\mathcal{H}_{T}=\left\{h_{\mathbf{u}}\in\mathcal{H}:\mathbf{u}\in(\{0,1\})^{*}\right\}

of consistent functions.

We say such a collection is indifferent if for every 𝐯,𝐮,𝐰({0,1})\mathbf{v},\mathbf{u},\mathbf{w}\in(\{0,1\})^{*}, if index(𝐯)<index(𝐮)\textsf{index}(\mathbf{v})<\textsf{index}(\mathbf{u}), and 𝐰\mathbf{w} is a descendant of 𝐮\mathbf{u} in the tree TT, then h𝐮(x𝐯)=h𝐰(x𝐯)h_{\mathbf{u}}(x_{\mathbf{v}})=h_{\mathbf{w}}(x_{\mathbf{v}}). In other words, the functions for all the descendants of a node that appears after 𝐯\mathbf{v} agree on 𝐯\mathbf{v}.

We say that TT is indifferent if it has a set T\mathcal{H}_{T} of consistent functions that are indifferent.

Theorem 19 (Bousquet et al. (2023)).

Let 𝒳\mathcal{X} be a set and {0,1}𝒳\mathcal{H}\subseteq\{0,1\}^{\mathcal{X}} be a hypothesis class, and let TT be an infinite VCL tree that is shattered by \mathcal{H}. Then there exists an infinite VCL tree TT^{\prime} that is shattered by \mathcal{H} that is indifferent.

Here is the proof sketch of Theorem 16

Proof Sketch.

First, we can modify the indifferent infinite VCL tree such that it has the property that the number of elements contained by the kk-th node in the Breadth-First-Search (BFS) order is 2k12^{k-1}. The data process we are choosing is all the data come in the lexical order in each node and the BFS order among different nodes. Then we take a random walk on this tree to choose the true label for each instance. The instances in the node visited by the random walk will be labeled by the label on the edge adjacent to it in the path. The instances in the node that is off-branch will be labeled by the label decided by its descendants. (We can do this as the tree is indifferent.) Thus, when reaching a node on the path, no matter what the algorithm predicts, it makes mistakes with probability 12\frac{1}{2}. Thus, it makes a quarter mistake in expectation. Then by Fatou’s lemma, for each learning algorithm, we get a realizable process such that the algorithm does not make a sublinear loss almost surely. ∎

We finish the proof of Theorem 9 here. We then focus on the existence of the optimistically universal online learner when all processes admit universal online learning.

4.1 Optimistically Universal Online Learning Rule

In this part, we show that the condition whether \mathcal{H} has an infinite Littlestone tree is the sufficient and necessary condition for the existence of an optimistically universal online learning rule, when all processes admit universal online learning. This is formally stated as theorem 10. The sufficiency part of theorem 10 is proved in  Bousquet et al. (2021) as the following lemma:

Lemma 20 (Bousquet et al. (2021) Theorem 3.1, the first bullet).

If \mathcal{H} does not have an infinite Littestone tree, then there is a strategy for the learner that makes only finitely many mistakes against any adversary.

Notice that the online learning algorithm derived from the winning strategy of the learner only makes finite mistakes against any adversary, so for every realizable data process (𝕏,𝕐)(\mathbb{X},\mathbb{Y}), this online learning algorithm also only makes finite mistakes, which means the long-run average mistake bound goes to 0. Thus, this is an optimistically universal online learning rule, and the concept class \mathcal{H} which does not have an infinite Littlestone tree is optimistically universally online learnable.

The necessity is restated as the following theorem:

Theorem 21.

For any concept class \mathcal{H} with an infinite Littlestone tree, for any online learning algorithm 𝒜\mathcal{A}, there exists a process 𝕏\mathbb{X} that makes 𝒜\mathcal{A} have an average loss greater than a half with non-zero probability.

Proof Sketch.

We can take a random walk on the infinite Littlestone tree to generate the target function. Thus, no matter what the algorithm predicts, it makes a mistake with a probability of more than half. Then we can use Fatou’s lemma to get a lower bound of the expected average loss of the learning algorithm among all random processes and that means for every algorithm there exists a process that makes its average loss more than a half with probability more than zero. ∎

5 Concept Classes with an Infinite VCL Tree

In this section, we discuss the necessary and sufficient conditions for a process to admit universal online learning under the concept class \mathcal{H} with an infinite VCL tree. Theorem 11 states the answer formally. To prove this theorem, we first prove sufficiency (Lemma 22) and then necessity (Lemma 23).

Lemma 22.

If a process 𝕏\mathbb{X} satisfies condition A, it admits universally consistent online learning under concept class \mathcal{H}.

Proof Sketch.

To prove this lemma, we use the weighted majority algorithm with non-uniform initial weights on the experts defined in condition A. The initial weight of expert ii is 1i(i+1)\frac{1}{i(i+1)}, where the index ii is the index defined in condition A as well. ∎

Lemma 23.

If a process 𝕏\mathbb{X} admits universally consistent online learning under concept class \mathcal{H}, it satisfies condition A.

Proof Sketch.

In order to prove this lemma, we need to show the following statement holds:

For a given concept class \mathcal{H}, and a data process 𝕏\mathbb{X}, if there is a learning algorithm 𝒜\mathcal{A} that is strongly universal consistent under 𝕏\mathbb{X} and \mathcal{H}, then we have a set of experts E={e1,e2,}E=\{e_{1},e_{2},\dots\}, there is a sequence {in}\{i_{n}\} with login=o(n)\log i_{n}=o(n), such that for any realizable sequence (𝕏,𝕐)(\mathbb{X},\mathbb{Y}), for any nn\in\mathbb{N}, there is an expert eie_{i} with iini\leq i_{n} such that Yt=ei(Xt)Y_{t}=e_{i}(X_{t}) for every tnt\leq n.

We modify the construction from the work of Ben-David et al. (2009) to build the experts. The experts are based on the set of the indexes of the rounds when the algorithm 𝒜\mathcal{A} makes mistakes, so there is a one-on-one map from the set of the indexes of the mistakes to the experts. Then we can index the experts based on the set of the indexes of mistakes to show the existence of such a sequence. ∎

Then we can get the theorem for optimistically universal online learnability, which is theorem 12. Because the proof of lemma 23 and 22 works for any process, we can prove Theorem 12 by reusing the proof of Theorem 11.

6 Agnostic Case

In this section, we extend the online learning algorithm for realizable cases to an online learning algorithm for agnostic cases. The basic idea follows the idea of Ben-David et al. (2009). In other words, we build an expert for each realizable process (𝕏,𝕐)(\mathbb{X},\mathbb{Y}). Then we run the learning with experts’ advice algorithm on those experts and get a low regret learning algorithm.

Theorem 24.

The following two statements are equivalent:

  • There is an online learning rule that is strongly universally consistent under 𝕏\mathbb{X} and \mathcal{H} for the realizable case.

  • There is an online learning rule that is strongly universally consistent under 𝕏\mathbb{X} and \mathcal{H} for the agnostic case.

Proof Sketch.

To prove this lemma, we first build the experts eie_{i} based on the learning algorithm for the realizable case by using the construction from lemma 23. We then use the learning on experts’ advice algorithm called Squint from Koolen and van Erven (2015), with non-uniform initial weights 1i(i+1)\frac{1}{i(i+1)} for each eie_{i} to get sublinear regret. Thus, we can extend the learning algorithm for realizable cases to a learning algorithm for agnostic cases no matter how the algorithm operates.

An online learning algorithm for the agnostic case is also an online learning algorithm for the realizable case, by taking 𝕐=𝕐\mathbb{Y}^{*}=\mathbb{Y}, the regret becomes the number of mistakes. Thus, the two statements are equivalent. ∎

Theorem 24 implies that all the characterizations for the realizable case are also characterizations for the agnostic case. We formally state the following theorems:

Theorem 25.

For the agnostic case and any concept class \mathcal{H} with no infinite VCL tree, any process 𝕏\mathbb{X} admits strongly universal online learning under \mathcal{H}. However, only the concept class with no infinite Littlestone tree is optimistically universally online learnable.

For the concept class \mathcal{H} with infinite VCL tree:

Theorem 26.

For the agnostic case, a data process 𝕏\mathbb{X} admits strongly universal online learning under concept class \mathcal{H} with infinite VCL tree if and only if it satisfies condition A. However, a concept class \mathcal{H} with infinite VCL tree is optimistically universally online learnable if and only if it satisfies condition B.

References

  • Adams and Nobel (2010) T. M. Adams and A. B. Nobel. Uniform convergence of Vapnik–Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4):1345 – 1367, 2010. doi: 10.1214/09-AOP511. URL https://doi.org/10.1214/09-AOP511.
  • Alon et al. (2021) N. Alon, S. Hanneke, R. Holzman, and S. Moran. A theory of PAC learnability of partial concept classes. In Proceedings of the 62nd62^{\mathrm{nd}} Annual Symposium on Foundations of Computer Science, 2021.
  • Angluin (1988) D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
  • Ben-David et al. (2009) S. Ben-David, D. Pál, and S. Shalev-Shwartz. Agnostic online learning. In Annual Conference Computational Learning Theory, 2009.
  • Blanchard (2022) M. Blanchard. Universal online learning: an optimistically universal learning rule. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1077–1125. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/blanchard22b.html.
  • Blanchard and Cosson (2022) M. Blanchard and R. Cosson. Universal online learning with bounded loss: Reduction to binary classification. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 479–495. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/blanchard22a.html.
  • Blanchard and Jaillet (2023) M. Blanchard and P. Jaillet. Universal regression with adversarial responses. The Annals of Statistics, 51(3):1401 – 1426, 2023. doi: 10.1214/23-AOS2299. URL https://doi.org/10.1214/23-AOS2299.
  • Blanchard et al. (2022a) M. Blanchard, R. Cosson, and S. Hanneke. Universal online learning with unbounded losses: Memory is all you need. In Sanjoy Dasgupta and Nika Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 107–127. PMLR, 29 Mar–01 Apr 2022a. URL https://proceedings.mlr.press/v167/blanchard22a.html.
  • Blanchard et al. (2022b) M. Blanchard, S. Hanneke, and P. Jaillet. Contextual bandits and optimistically universal learning, 2022b.
  • Blanchard et al. (2023) M. Blanchard, S. Hanneke, and P. Jaillet. Adversarial rewards in universal learning for contextual bandits, 2023.
  • Bousquet et al. (2021) O. Bousquet, S. Hanneke, S. Moran, R. van Handel, and A. Yehudayoff. A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 532–541, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380539. doi: 10.1145/3406325.3451087. URL https://doi.org/10.1145/3406325.3451087.
  • Bousquet et al. (2023) O. Bousquet, S. Hanneke, S. Moran, J. Shafer, and I. Tolstikhin. Fine-grained distribution-dependent learning curves. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5890–5924. PMLR, 12–15 Jul 2023.
  • Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York, Inc., 1996.
  • Gyorfi et al. (1999) L. Gyorfi, G. Lugosi, and G. Morvai. A simple randomized algorithm for sequential prediction of ergodic time series. IEEE Transactions on Information Theory, 45(7):2642–2650, 1999. doi: 10.1109/18.796420.
  • Hanneke (2021) S. Hanneke. Learning whenever learning is possible: Universal learning under general stochastic processes. Journal of Machine Learning Research, 22(130), 2021.
  • Hanneke et al. (2021) S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss. Universal Bayes consistency in metric spaces. The Annals of Statistics, 49(4):2129 – 2150, 2021. doi: 10.1214/20-AOS2029. URL https://doi.org/10.1214/20-AOS2029.
  • Haussler et al. (1994) D. Haussler, N. Littlestone, and M. Warmuth. Predicting {0,1}\{0,1\}-functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994.
  • Karandikar and Vidyasagar (2002) R. L. Karandikar and M. Vidyasagar. Rates of uniform convergence of empirical means with mixing processes. Statistics & Probability Letters, 58(3):297–307, 2002. ISSN 0167-7152. doi: https://doi.org/10.1016/S0167-7152(02)00124-4. URL https://www.sciencedirect.com/science/article/pii/S0167715202001244.
  • Koolen and van Erven (2015) W. M. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Workshop and Conference Proceedings, pages 1155–1175. JMLR.org, 2015. URL http://proceedings.mlr.press/v40/Koolen15a.html.
  • Littlestone (1988) N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988.
  • Littlestone and Warmuth (1986) N. Littlestone and M. Warmuth. Relating data compression and learnability. Unpublished manuscript, 1986.
  • Littlestone and Warmuth (1994) N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994. ISSN 0890-5401. doi: https://doi.org/10.1006/inco.1994.1009. URL https://www.sciencedirect.com/science/article/pii/S0890540184710091.
  • Morvai et al. (1996) G. Morvai, S. Yakowitz, and L. Gyorfi. Nonparametric inference for ergodic, stationary time series. The Annals of Statistics, 24(1):370–379, 1996. ISSN 00905364. URL http://www.jstor.org/stable/2242624.
  • Morvai et al. (1999) G. Morvai, S. R. Kulkarni, and A. B. Nobel. Regression estimation from an individual stable sequence. Statistics: A Journal of Theoretical and Applied Statistics, 33(2):99–118, 1999.
  • Ryabko (2006) D. Ryabko. Pattern recognition for conditionally independent data. Journal of Machine Learning Research, 7(23):645–664, 2006. URL http://jmlr.org/papers/v7/ryabko06a.html.
  • Steinwart et al. (2009) I. Steinwart, D. Hush, and C. Scovel. Learning from dependent observations. Journal of Multivariate Analysis, 100(1):175–194, 2009.
  • Stone (1977) C. J. Stone. Consistent nonparametric regression. The Annals of Statistics, pages 595–620, 1977.
  • Valiant (1984) L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984.
  • Vapnik and Chervonenkis (1971) V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
  • Vapnik and Chervonenkis (1974) V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974.
  • Yu (1994) B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.

Appendix A Omitted Proofs

A.1 Proof of lemma 15

In order to help the reader understand, we provide the subroutine here again for reference.

LL\leftarrow\emptyset.
m1m\leftarrow 1.
for t=1,2,3,t=1,2,3,\dots do
       Predict
Yt^=argmaxyPr[w(L(Xt,1y)gU,X[t,t(m)])12w(LgU,X[t,t(m)])|Xt]\hat{Y_{t}}=\mathop{\rm argmax}_{y}\Pr\left[\left.w(\mathcal{H}^{g_{U}}_{L\cup{(X_{t},1-y)}},X_{[t,t(m)]})\leq\frac{1}{2}w(\mathcal{H}^{g_{U}}_{L},X_{[t,t(m)]})\right|X_{\leq t}\right]
if YtY^tY_{t}\neq\hat{Y}_{t} then
            LL{(Xt,Yt)}L\leftarrow L\cup\{(X_{t},Y_{t})\}.
       end if
      if tm(m+1)2t\geq\frac{m(m+1)}{2}. then
            mm+1m\leftarrow m+1.
       end if
      
end for
Algorithm 2 Subroutine for learning a partial concept class \mathcal{H} with VC dimension dd on the data process 𝕏\mathbb{X}.
Proof.

In this proof, we assume that for the partial concept class \mathcal{H} with VC dimension d\leq d, {(Xi,Yi)}i\{(X_{i},Y_{i})\}_{i\in\mathbb{N}} is realizable. As we defined, the weight function, w(,XT)=|{S:S{Xi}iT such that S shattered by }|w(\mathcal{H}^{\prime},X_{\leq T})=|\{S:S\subseteq\{X_{i}\}_{i\leq T}\text{ such that }S\text{ shattered by }\mathcal{H}^{\prime}\}|, which is the number of the subsequences of the sequence XTX_{\leq T} that can be shattered by the partial concept class \mathcal{H}^{\prime}.

Consider the kk-th batch, consisting of Wk={Xk(k1)2+1,,Xk(k+1)2}W_{k}=\{X_{\frac{k(k-1)}{2}+1},\cdots,X_{\frac{k(k+1)}{2}}\}. Let

Zk=t=k(k1)2+1k(k+1)2𝕀[Y^tYt],Z_{k}=\sum_{t=\frac{k(k-1)}{2}+1}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right],

and

Vk=Zk𝔼[Zk|Xk(k1)2].V_{k}=Z_{k}-\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right].

Notice that

𝔼[Vk|Xk(k1)2]\displaystyle\mathbb{E}\left[V_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]
=𝔼[Zk𝔼[Zk|Xk(k1)2]|Xk(k1)2]\displaystyle=\mathbb{E}\left[Z_{k}-\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]
=𝔼[Zk|Xk(k1)2]𝔼[Zk|Xk(k1)2]=0.(a.s.)\displaystyle=\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]-\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]=0.(a.s.)

Thus, the sequence VkV_{k} is a martingale difference sequence with respect to the block sequence, W1,W2,W_{1},W_{2},\cdots. By the definition of VkV_{k}, we also have kVkk-k\leq V_{k}\leq k. Then by Azuma’s Inequality, with probability 11K21-\frac{1}{K^{2}}, we have

k=1KZk\displaystyle\sum_{k=1}^{K}Z_{k} k=1K𝔼[Zk|Xk(k1)2]+log(1K2)2(k=1Kk2)\displaystyle\leq\sum_{k=1}^{K}\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]+\sqrt{-\log\left(\frac{1}{K^{2}}\right)\cdot 2\cdot\left(\sum_{k=1}^{K}k^{2}\right)}
k=1K𝔼[Zk|Xk(k1)2]+4K3logK.\displaystyle\leq\sum_{k=1}^{K}\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]+\sqrt{4K^{3}\log K}.

Then we need to get an upper bound for 𝔼[Zk|Xk(k1)2]\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]. According to the prediction rule, every time we make a mistake, we have

Pr[w(L(Xt,Yt)gU,X[t,k(k+1)2])12w(LgU,X[t,k(k+1)2])|Xt]12.\Pr\left[\left.w(\mathcal{H}^{g_{U}}_{L\cup{(X_{t},Y_{t})}},X_{[t,\frac{k(k+1)}{2}]})\leq\frac{1}{2}w(\mathcal{H}^{g_{U}}_{L},X_{[t,\frac{k(k+1)}{2}]})\right|X_{\leq t}\right]\geq\frac{1}{2}. (3)

Due to the linearity of expectation, for every kk,

𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]|Xk(k1)2]\displaystyle\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]
=𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,X[t+1,k(k+1)2])12w(Lt1,X[t,k(k+1)2])]|Xk(k1)2]\displaystyle=\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})\leq\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]
+𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])]|Xk(k1)2].\displaystyle+\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq\frac{k(k-1)}{2}}\right].

Here Lt={(Xi,Yi)}L_{t}=\{(X_{i},Y_{i})\}, where iti\leq t and the algorithm makes a mistake at round ii.

Notice the first part is the expected number of mistakes, each of which decreases the weight by half. For every realization of X[k(k1)2,k(k+1)2]X_{[\frac{k(k-1)}{2},\frac{k(k+1)}{2}]}, x[k(k1)2,k(k+1)2]x_{[\frac{k(k-1)}{2},\frac{k(k+1)}{2}]}, let

u(k)=i=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,x[t+1,k(k+1)2])12w(Lt1,x[t,k(k+1)2])].u(k)=\sum_{i=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},x_{[t+1,\frac{k(k+1)}{2}]})\leq\frac{1}{2}w(\mathcal{H}_{L_{t-1}},x_{[t,\frac{k(k+1)}{2}]})\right].

By the definition of the weight function and the fact that VC()=d{\rm VC}(\mathcal{H})=d, w(Lk(k1)2,x[k(k1)2,k(k+1)2])kdw(\mathcal{H}_{L_{\frac{k(k-1)}{2}}},x_{[\frac{k(k-1)}{2},\frac{k(k+1)}{2}]})\leq k^{d}. Consider the last round tk(k+1)2t\leq\frac{k(k+1)}{2} that Y^tYt\hat{Y}_{t}\neq Y_{t}, we have w(Lt1,x[t,k(k+1)2])1w(\mathcal{H}_{L_{t-1},x_{[t,\frac{k(k+1)}{2}]}})\geq 1, as the set {xt}\{x_{t}\} must be shattered. Thus, we have 2u(k)1w(Lt1,x[t,k(k+1)2])w(,x[k(k1)2,k(k+1)2])2^{u(k)-1}w(\mathcal{H}_{L_{t-1}},x_{[t,\frac{k(k+1)}{2}]})\leq w(\mathcal{H},x_{[\frac{k(k-1)}{2},\frac{k(k+1)}{2}]}). Therefore, u(k)dlogk+1u(k)\leq d\log k+1, for every realization, x[k(k1)2,k(k+1)2]x_{[\frac{k(k-1)}{2},\frac{k(k+1)}{2}]}. Thus,

𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,X[t+1,k(k+1)2])12w(Lt1,X[t,k(k+1)2])]|Xk(k1)2]2dlogk.\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})\leq\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]\leq 2d\log k. (4)

Then consider the second part, we have

𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])]|Xk(k1)2]\displaystyle\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]
=𝔼[𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])]|Xt]|Xk(k1)2]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left.\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]
=𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝔼[𝕀[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])]|Xt]|Xk(k1)2]\displaystyle=\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{E}\left[\left.\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]

This is because Y^t\hat{Y}_{t} and YtY_{t} only depend on XtX_{\leq t}. Due to the equation 3, we have

𝕀[Y^tYt]𝔼[𝕀[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])]|Xt]\displaystyle\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{E}\left[\left.\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq t}\right]
=𝕀[Y^tYt]Pr[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])|Xt]\displaystyle=\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\Pr\left[\left.w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right|X_{\leq t}\right]
12𝕀[Y^tYt].\displaystyle\leq\frac{1}{2}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right].

Thus,

𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]𝕀[w(Lt,X[t+1,k(k+1)2])>12w(Lt1,X[t,k(k+1)2])]|Xk(k1)2]\displaystyle\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\mathbb{I}\left[w(\mathcal{H}_{L_{t}},X_{[t+1,\frac{k(k+1)}{2}]})>\frac{1}{2}w(\mathcal{H}_{L_{t-1}},X_{[t,\frac{k(k+1)}{2}]})\right]\right|X_{\leq\frac{k(k-1)}{2}}\right] (5)
12𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]|Xk(k1)2]\displaystyle\leq\frac{1}{2}\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]

Combining these two inequalities (4 and 5), we have

𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]|Xk(k1)2]2dlogk+12𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]|Xk(k1)2].\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]\leq 2d\log k+\frac{1}{2}\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]. (6)

Thus, for any kk, we have

𝔼[t=k(k1)2k(k+1)2𝕀[Y^tYt]|Xk(k1)2]4dlogk.\mathbb{E}\left[\left.\sum_{t=\frac{k(k-1)}{2}}^{\frac{k(k+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\right|X_{\leq\frac{k(k-1)}{2}}\right]\leq 4d\log k. (7)

According to the inequality 7 for every kk, 𝔼[Zk|Xk(k1)2]4dlogk\mathbb{E}\left[Z_{k}\left|X_{\leq\frac{k(k-1)}{2}}\right.\right]\leq 4d\log k. Thus, with probability at least 11K21-\frac{1}{K^{2}},

k=1KZkk=1K4dlogk+4K3logK4dKlogK+4K3logK.\sum_{k=1}^{K}Z_{k}\leq\sum_{k=1}^{K}4d\log k+\sqrt{4K^{3}\log K}\leq 4dK\log K+\sqrt{4K^{3}\log K}.

By the definition of ZkZ_{k}, with probability at least 11K21-\frac{1}{K^{2}},

t=1K(K+1)2𝕀[Y^tYt]4dKlogK+4K3logK(4d+2)K3logK.\sum_{t=1}^{\frac{K(K+1)}{2}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\leq 4dK\log K+\sqrt{4K^{3}\log K}\leq(4d+2)\sqrt{K^{3}\log K}. (8)

Let TK=K(K+1)2T_{K}=\frac{K(K+1)}{2} be the number of instances in the sequence, with probability at least 11K21-\frac{1}{K^{2}}

t=1TK𝕀[Y^tYt](4d+2)TK3412logTK.\sum_{t=1}^{T_{K}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\leq(4d+2)T_{K}^{\frac{3}{4}}\sqrt{\frac{1}{2}\log T_{K}}. (9)

Define the event EKE_{K} as the event that in the sequence XTKX_{\leq T_{K}}, t=1TK𝕀[Y^tYt]>(4d+2)TK3412logTK\sum_{t=1}^{T_{K}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]>(4d+2)T_{K}^{\frac{3}{4}}\sqrt{\frac{1}{2}\log T_{K}}. Then we know Pr[EK]1K2\Pr[E_{K}]\leq\frac{1}{K^{2}}. Notice the fact that for any KK\in\mathbb{N}, k=1K1k2π26\sum_{k=1}^{K}\frac{1}{k^{2}}\leq\frac{\pi^{2}}{6}. By Borel-Cantelli lemma, we know that for any TK=K(K+1)2T_{K}=\frac{K(K+1)}{2} large enough, t=1TK𝕀[Y^tYt](4d+2)TK3412logTK\sum_{t=1}^{T_{K}}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\leq(4d+2)T_{K}^{\frac{3}{4}}\sqrt{\frac{1}{2}\log T_{K}} happens with probability 11.

Then for any large enough TT, we have TKTTK+12TT_{K}\leq T\leq T_{K+1}\leq 2T. Thus, with probability 11,

t=1T𝕀[Y^tYt]\displaystyle\sum_{t=1}^{T}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right] (4d+2)TK+13412logTK+1\displaystyle\leq(4d+2)T_{K+1}^{\frac{3}{4}}\sqrt{\frac{1}{2}\log T_{K+1}} (10)
(4d+2)(2T)3412log2T.\displaystyle\leq(4d+2)(2T)^{\frac{3}{4}}\sqrt{\frac{1}{2}\log 2T}. (11)

Therefore, for any large enough TT and a universal constant cc, with probability 11,

t=1T𝕀[Y^tYt]cT34logT.\sum_{t=1}^{T}\mathbb{I}\left[\hat{Y}_{t}\neq Y_{t}\right]\leq cT^{\frac{3}{4}}\sqrt{\log T}. (12)

Notice that cT34logTcT^{\frac{3}{4}}\sqrt{\log T} is o(T)o(T). That finishes the proof.

A.2 Proof of Theorem 16

Proof.

In this proof, we first modify the chosen indifferent VCL tree, such that the number of elements in each node is increasing exponentially. In other words, we hope the kk-th node in the Breadth-First-Search (BFS) order contains 2k12^{k-1} elements. We may reach this target by recursively modifying the chosen VCL tree. Starting from the root of the tree, for each node that does not satisfy our requirement, we promote one of its descendants to replace that node, such that the number of elements in that node is large enough.

Then we define the data process as follows: For the modified VCL tree, we define the sequence {Xi}i\{X_{i}\}_{i\in\mathbb{N}} as X2k1+j=XjkX_{2^{k-1}+j}=X_{jk}, where XjkX_{jk} is the jj-th element in the kk-th node in the BFS order.

Next, we define the target function, in other words, choose the label YtY_{t} for each XtX_{t}. First, we take a random walk in the modified VCL tree. Then for the elements in the node on the path we visited (in-branch node), we let YtY_{t} be the label given by the edge adjacent to that node. Then, we need to decide the label of those elements in the node not on the path we visited (off-branch nodes). For any off-branch node, we can pick an in-branch node after it in BFS order, as the tree is indifferent, all descendants of the in-branch node agree on the label of the elements in the off-branch node. Thus, we can let the label of the elements in the off-branch node be the label decided by the descendant of that in-branch node. So, every element in the node that is visited by the random walk still may be wrong with probability 12\frac{1}{2}, when the algorithm sees it the first time. Also, the number of elements that come before kk-th node in the BFS order, i=0k22i=2k11\sum_{i=0}^{k-2}2^{i}=2^{k-1}-1, is roughly the same as the number of elements in the kk-th node, 2k12^{k-1}. Thus at the dd-th layer of the modified tree, if the random walk reaches the node KdK_{d}, for that node, we have

𝔼[1nKdt=1nKd𝕀[ht1(Xt)Yt]|Kd]14.\mathbb{E}\left[\left.\frac{1}{n_{K_{d}}}\sum_{t=1}^{n_{K_{d}}}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right|K_{d}\right]\geq\frac{1}{4}. (13)

Here nKdn_{K_{d}} is the number of elements in the process when we reach the KdK_{d}-th node. This inequality holds for all dd.

Then notice that by taking an expectation on the expected mistakes for every deterministic sequence, we get an expectation of the number of mistakes for this random process. Then we can pick the sub-sequence, which only contains nKd=2Kd1n_{K_{d}}=2^{K_{d}}-1 elements, and this decreases the ratio of mistakes (the third line in the following computations). This is because we can only make mistakes when the elements are in the KdK_{d}-th node and any other nn will have a smaller ratio of mistakes than nKdn_{K_{d}}. Then notice that the ratio of mistakes is always smaller than or equal to 11, we can use the reversed Fatou’s lemma and inequality 13 to get the final result (the fourth line and the last line in the following computations).

𝔼[𝔼[lim supn1nt=1n𝕀[ht1(Xt)Yt]|(𝕏,𝕐)]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\left.\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right|(\mathbb{X},\mathbb{Y})\right]\right]
=𝔼[lim supn1nt=1n𝕀[ht1(Xt)Yt]]\displaystyle=\mathbb{E}\left[\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]
𝔼[lim supd1nKdt=1nKd𝕀[ht1(Xt)Yt]]\displaystyle\geq\mathbb{E}\left[\limsup_{d\rightarrow\infty}\frac{1}{n_{K_{d}}}\sum_{t=1}^{n_{K_{d}}}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]
𝔼[lim supd𝔼[1nKdt=1nKd𝕀[ht1(Xt)Yt]|nKd]]\displaystyle\geq\mathbb{E}\left[\limsup_{d\rightarrow\infty}\mathbb{E}\left[\left.\frac{1}{n_{K_{d}}}\sum_{t=1}^{n_{K_{d}}}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right|n_{K_{d}}\right]\right]
14.\displaystyle\geq\frac{1}{4}.

Thus, there exists a deterministic sequence (𝕏,𝕐)(\mathbb{X},\mathbb{Y}) such that it does not make sublinear expected mistakes. ∎

A.3 Proof of Theorem 21

Proof.

As \mathcal{H} has an infinite Littlestone tree, we can take a random walk on this tree, then for every step tt, take the label of the node as XtX_{t}, and no matter what the learning algorithm predicts, uniformly randomly choose YtY_{t}. Thus, we have 𝔼[𝕀[ht1(Xt)Yt]]12\mathbb{E}\left[\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]\geq\frac{1}{2} for every tt. We get

lim supn𝔼(𝕏,𝕐)[1nt=1n𝕀[ht1(Xt)Yt]]12.\limsup_{n\rightarrow\infty}\mathbb{E}_{(\mathbb{X},\mathbb{Y})}\left[\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]\geq\frac{1}{2}. (14)

According to Fatou’s lemma, notice the ratio of mistakes is smaller than or equal to 11, so we have the following inequality,

𝔼[lim supn1nt=1n𝕀[ht1(Xt)Yt]]12.\mathbb{E}\left[\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]\geq\frac{1}{2}. (15)

Thus, for each learning algorithm, there exists a data sequence {(Xt,Yt)}t\{(X_{t},Y_{t})\}_{t\in\mathbb{N}} such that equation 15 holds. That finishes the proof. ∎

A.4 Proof of Lemma 22

For the completeness, we provide the weighted majority algorithm here:

For expert eie_{i}, assign wi0=1i(i+1)w_{i}^{0}=\frac{1}{i(i+1)} as its initial weight.
for t = 1,2,… do
      Predict yt=𝕀[iwit1iwit1ei(Xt)12]y_{t}=\mathbb{I}\left[\sum_{i}\frac{w_{i}^{t-1}}{\sum_{i}w_{i}^{t-1}}e_{i}(X_{t})\geq\frac{1}{2}\right].
       Update wit=(12)𝕀[ei(Xt)Yt]wit1w_{i}^{t}=\left(\frac{1}{2}\right)^{\mathbb{I}\left[e_{i}(X_{t})\neq Y_{t}\right]}w_{i}^{t-1}.
end for
Algorithm 3 The Weighted Majority Algorithm with Non-uniform Initial Weights

By using this algorithm, we provide the proof of the lemma 22.

Proof.

In order to prove this lemma, we use the weighted majority algorithm 3 with initial weight wi0=1i(i+1)w_{i}^{0}=\frac{1}{i(i+1)} for each expert ii. We set MB=t=1n𝕀[ht1(Xt)Yt]\text{MB}=\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right], which is the number of mistakes the algorithm made during nn rounds. We also set mi=t=1n𝕀[ei(Xt)Yt]m_{i}=\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y_{t}\right]. Next, compute the total weight of all experts after nn rounds of the algorithm, WnW^{n}. Notice that if the algorithm makes a mistake at round nn, there must be a majority of the experts making a mistake at round nn, so Wn1Wn1212Wn1W^{n-1}-W^{n}\geq\frac{1}{2}\cdot\frac{1}{2}W^{n-1}, which means Wn34Wn1W^{n}\leq\frac{3}{4}W^{n-1}. Thus, we have Wn(34)MBW0(34)MBW^{n}\leq\left(\frac{3}{4}\right)^{\text{MB}}W^{0}\leq\left(\frac{3}{4}\right)^{\text{MB}}. Notice that WnwinW^{n}\geq w_{i}^{n} for all ii, so it holds for argminiint=1n𝕀[ei(Xt)Yt]\mathop{\rm argmin}_{i\leq i_{n}}\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y_{t}\right]. We have the following inequality for all nn,

t=1n𝕀[ht1(Xt)Yt]\displaystyle\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right] 3miniint=1n𝕀[ei(Xt)Yt]+log(1wi0)\displaystyle\leq 3\min_{i\leq i_{n}}\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y_{t}\right]+\log\left(\frac{1}{w_{i}^{0}}\right) (16)
3miniint=1n𝕀[ei(Xt)Yt]+2login.\displaystyle\leq 3\min_{i\leq i_{n}}\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y_{t}\right]+2\log i_{n}. (17)

Here 33 comes from the fact that log2log(43)3\frac{\log 2}{\log(\frac{4}{3})}\leq 3. Therefore, for a fixed process 𝕏\mathbb{X}, target function hh^{*} and a fixed sequence, {in}\{i_{n}\}, we have

𝔼[lim supn1nt=1n𝕀[ht1(Xt)Yt]]𝔼[lim supn1n(3miniint=1n𝕀[ei(Xt)Yt]+2login)].\mathbb{E}\left[\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]\leq\mathbb{E}\left[\limsup_{n\rightarrow\infty}\frac{1}{n}\left(3\min_{i\leq i_{n}}\sum_{t=1}^{n}\mathbb{I}\left[e_{i}(X_{t})\neq Y_{t}\right]+2\log i_{n}\right)\right]. (18)

Then by the condition A, we know the right-hand side of the inequality is 0.

Notice that lim supn1nt=1n𝕀[ht1(Xt)Yt]\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right] is a non-negative random variable, so we have

𝔼[lim supn1nt=1n𝕀[ht1(Xt)Yt]]=0.\mathbb{E}\left[\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]\right]=0. (19)

Therefore, lim supn1nt=1n𝕀[ht1(Xt)Yt]=0\limsup_{n\rightarrow\infty}\frac{1}{n}\sum_{t=1}^{n}\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]=0 almost surely. ∎

A.5 Proof of Lemma 23

Input: set JJ
for  t = 1,2,… do
      recieve XtX_{t}.
       compute yt~=ft𝒜(X<t,Y^<t,Xt)\tilde{y_{t}}=f_{t}^{\mathcal{A}}(X_{<t},\hat{Y}_{<t},X_{t}).333Y^<t={yi^}<t\hat{Y}_{<t}=\{\hat{y_{i}}\}_{<t}
       if tJt\in J then
            predict yt^=¬yt~\hat{y_{t}}=\neg\tilde{y_{t}}
       else
            predict yt^=yt~\hat{y_{t}}=\tilde{y_{t}}
       end if
      
end for
Algorithm 4 Expert JJ
Proof.

In this proof, we show how to define a sequence of experts {e1,e2,}\{e_{1},e_{2},\dots\}, such that the condition A is satisfied, if 𝕏\mathbb{X} admits universal online learning. Let an online learning rule ft𝒜f_{t}^{\mathcal{A}} be the algorithm that can learn all realizable label processes 𝕐R(,𝕏)\mathbb{Y}\in\text{R}(\mathcal{H},\mathbb{X}). Then we can build the experts by algorithm 4 and represent the experts by the set of the index of mistake rounds. We can define this set as JJ. For example, if the algorithm makes mistakes at round 1,4,71,4,7 then the set JJ for that expert is {1,4,7}\{1,4,7\}. Thus we have a one-on-one map from JJ to an expert.

First, we show that for every realizable label process 𝕐R(,𝕏)\mathbb{Y}\in\text{R}(\mathcal{H},\mathbb{X}) and any nn\in\mathbb{N}, there is an expert eie_{i} such that ei(Xt)=Yte_{i}(X_{t})=Y_{t} for all tnt\leq n. This part of the proof is similar to the proof of lemma 12 in the work of  Ben-David et al. (2009). Consider a subsequence YnY_{\leq n} of a realizable label process 𝕐\mathbb{Y}, jJj\in J if and only if fj𝒜(X<j,Y^<j,Xj)Yjf_{j}^{\mathcal{A}}(X_{<j},\hat{Y}_{<j},X_{j})\neq Y_{j} for all jnj\leq n. Thus, the history (X<t,Y^<t)(X_{<t},\hat{Y}_{<t}) for all tnt\leq n are the same as (X<t,Y<t)(X_{<t},Y_{<t}), which implies ei(Xt)=Y^t=Yte_{i}(X_{t})=\hat{Y}_{t}=Y_{t} for all tnt\leq n. Therefore, for any nn\in\mathbb{N}, there is a set JinJ_{i_{n}} containing all jnj\leq n, such that fj𝒜(X<j,Y^<j,Xj)Yjf_{j}^{\mathcal{A}}(X_{<j},\hat{Y}_{<j},X_{j})\neq Y_{j}. Then the algorithm 4 with input JinJ_{i_{n}} creates an expert eine_{i_{n}}, such that ein(Xt)=Yte_{i_{n}}(X_{t})=Y_{t} for all tnt\leq n.

Next, we only need to build the index ini_{n} for the set JnJ_{n} to show that login=o(n)\log i_{n}=o(n). The index of set JJ is as follows: For all JJ\subseteq\mathbb{N}, order them by |J|(maxJ)|J|(\max J). (If two sets have the same value, we use |J||J| as a tie-breaking.) Here |J||J| is the number of elements in JJ and maxJ\max J is the maximal element in JJ. After that, index JJ’s from 11 following this order.

At last, we show the method mentioned above constructed a set of experts satisfying condition A. we have a set of experts E={e1,e2,}E=\{e_{1},e_{2},\dots\}, there is a sequence {in}\{i_{n}\} with login=o(n)\log i_{n}=o(n), such that for any realizable sequence (𝕏,𝕐)(\mathbb{X},\mathbb{Y}), for any nn\in\mathbb{N}, there is an expert eie_{i} with iini\leq i_{n} such that Yt=ei(Xt)Y_{t}=e_{i}(X_{t}) for every tnt\leq n. Therefore, we need to compute the ini_{n} as follows. Assume |Jin|maxJin=k|J_{i_{n}}|\max J_{i_{n}}=k, we have

in\displaystyle i_{n} |{J:|J|maxJk}|=1+m=1k|{J:|J|km,maxJ=m}|\displaystyle\leq|\{J:|J|\max J\leq k\}|=1+\sum_{m=1}^{k}|\{J:|J|\leq\frac{k}{m},\max J=m\}|
=1+m=1k2m1+m=kk(m1(km1))2k+m=kk(em2k)km(k+1)ek.\displaystyle=1+\sum_{m=1}^{\sqrt{k}}2^{m-1}+\sum_{m=\sqrt{k}}^{k}\binom{m-1}{\leq(\frac{k}{m}-1)}\leq 2^{\sqrt{k}}+\sum_{m=\sqrt{k}}^{k}\left(\frac{em^{2}}{k}\right)^{\frac{k}{m}}\leq(k+1)e^{\sqrt{k}}.

Notice that k=|Jin|nk=|J_{i_{n}}|n, we have

limn1nloginlimn2|Jin|nn=0.\lim_{n\rightarrow\infty}\frac{1}{n}\log i_{n}\leq\lim_{n\rightarrow\infty}\frac{2\sqrt{|J_{i_{n}}|n}}{n}=0. (20)

Thus, we get the set of experts and the corresponding sequence {in}\{i_{n}\} satisfying condition A. ∎

A.6 Proof of Theorem 24

Proof.

In order to prove this theorem, we use the procedure based on learning with experts’ advice. First, we build and index the experts, {e1,e2,}\{e_{1},e_{2},\cdots\}, by using the same method we mentioned in the proof of lemma 23 (A.5), which satisfies Condition A. Then, to use Squint algorithm from the work of Koolen and van Erven (2015), we need the initial weight of the experts to be a distribution. We can set the initial weights πi=1i(i+1)\pi_{i}=\frac{1}{i(i+1)} for each eie_{i} and this forms a distribution, as πi=1i1i+1\pi_{i}=\frac{1}{i}-\frac{1}{i+1}, the sum of πi\pi_{i} reaches 11 when ii goes to infinity.

According to Theorem 3 in the work of Koolen and van Erven (2015), we have the following upper bound for the regret

t=1T𝕀[h^t1(Xt)Yt]t=1T𝕀[ek(Xt)Yt]O(VkloglogVkπk+log1πk).\sum_{t=1}^{T}\mathbb{I}\left[\hat{h}_{t-1}(X_{t})\neq Y_{t}\right]-\sum_{t=1}^{T}\mathbb{I}\left[e_{k}(X_{t})\neq Y_{t}\right]\leq O\left(\sqrt{V_{k}\log\frac{\log V_{k}}{\pi_{k}}+\log\frac{1}{\pi_{k}}}\right). (21)

Here the VkV_{k} is the sum of the square of the difference between the algorithm’s mistake and expert kk’s mistake in each round. In other words, we have

Vk=i=1T(𝕀[ht1(Xt)Yt]𝕀[ek(Xt)Yt])2.V_{k}=\sum_{i=1}^{T}\left(\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]-\mathbb{I}\left[e_{k}(X_{t})\neq Y_{t}\right]\right)^{2}. (22)

Notice that (𝕀[ht1(Xt)Yt]𝕀[ek(Xt)Yt])2\left(\mathbb{I}\left[h_{t-1}(X_{t})\neq Y_{t}\right]-\mathbb{I}\left[e_{k}(X_{t})\neq Y_{t}\right]\right)^{2} is either 11 or 0, we have VkTV_{k}\leq T. The regret of this algorithm is upper bounded by:

O(VkloglogVkπk+log1πk)=O(TloglogT+Tlogk+logk).O\left(\sqrt{V_{k}\log\frac{\log V_{k}}{\pi_{k}}+\log\frac{1}{\pi_{k}}}\right)=O\left(\sqrt{T\log\log T+T\log k+\log k}\right). (23)

According to the condition A, we also know

𝔼[lim supTminei:iiT1Tt=1T𝕀[ei(Xt)Yt]]=0.\mathbb{E}\left[\limsup_{T\rightarrow\infty}\min_{e_{i}:i\leq i_{T}}\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}\left[e_{i}(X_{t})\neq Y^{*}_{t}\right]\right]=0. (24)

Thus, the regret of this algorithm is

lim supT1Tt=1T(𝕀[Yth^t1(Xt)]𝕀[YtYt])\displaystyle\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]-\mathbb{I}\left[Y_{t}\neq Y^{*}_{t}\right]\right)
=lim supT1Tt=1T(𝕀[Yth^t1(Xt)]|𝕀[Ytek(Xt)]𝕀[ek(Xt)Yt]|)\displaystyle=\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]-\left|\mathbb{I}\left[Y_{t}\neq e_{k}(X_{t})\right]-\mathbb{I}\left[e_{k}(X_{t})\neq Y^{*}_{t}\right]\right|\right)
lim supT1Tt=1T(𝕀[Yth^t1(Xt)]𝕀[Ytek(Xt)]+𝕀[ek(Xt)Yt])\displaystyle\leq\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]-\mathbb{I}\left[Y_{t}\neq e_{k}(X_{t})\right]+\mathbb{I}\left[e_{k}(X_{t})\neq Y^{*}_{t}\right]\right)
lim supT1Tt=1T(𝕀[Yth^t1(Xt)]𝕀[Ytek(Xt)])+lim supT1Tt=1T𝕀[ek(Xt)Yt]\displaystyle\leq\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{I}\left[Y_{t}\neq\hat{h}_{t-1}(X_{t})\right]-\mathbb{I}\left[Y_{t}\neq e_{k}(X_{t})\right]\right)+\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}\left[e_{k}(X_{t})\neq Y^{*}_{t}\right]

Because logiT=o(T)\log i_{T}=o(T) and logk<logiT\log k<\log i_{T} we have logk=o(T)\log k=o(T). Thus, the regret above is o(T)o(T). Therefore, we have an algorithm to extend a universally consistent online learning algorithm for realizable cases to a universally consistent online algorithm for agnostic cases.

To prove the necessity, notice that a universally consistent online algorithm for agnostic cases can be used to solve the realizable case and the regret is equal to the number of mistakes. ∎