This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Can Machines Learn the True Probabilities?

Jinsook Kim
Abstract

When there exists uncertainty, AI machines are designed to make decisions so as to reach the best expected outcomes. Expectations are based on true facts about the objective environment the machines interact with, and those facts can be encoded into AI models in the form of true objective probability functions. Accordingly, AI models involve probabilistic machine learning in which the probabilities should be objectively interpreted. We prove under some basic assumptions when machines can learn the true objective probabilities, if any, and when machines cannot learn them.

Foundations of Artificial Intelligence, Probabilistic Machine Learning, Non-parametric Estimation, Effective Calculation by Turing Machine, True Guarantee of Well-Calibration

1 Introduction

In the standard AI model under uncertainty, how to measure the degree of uncertainty matters. This paper is about treating such measures in the form of probabilities. In particular, we focus on the true objective probabilities, if any. There are various probabilistic contexts in which the true objective probabilities matter. For example, causal relations of physical events are widely regarded as objective features of the world. Therefore, when causal relations are to be understood in terms of probabilities mainly due to various regularity issues, a probabilistic causal model should include an objective probability function that measures the true objective values about our world.

This paper addresses the question of whether machines can learn the true objective probabilities from the data to perform such probabilistic reasoning. Under some basic assumptions, we prove that machines can learn the true objective probabilities if and only if the probabilities are directly observable by them. Roughly speaking, a true probability is directly observable by a machine when it can calculate the probability by the empirical frequency of a true population given to it.

The outline of the proof is as follows. After defining some main concepts, we identify the Success Criterion and the necessary condition for any machine to learn the true objective probabilities. From these conditions, we derive the theorem that learning implies the true guarantee of well-calibration. Roughly speaking, “truly guaranteed well-calibration” means the following: when a machine collects data according to its subjective forecast along a stochastic path in which the associated events occur, the empirical frequency of the collected data matches the very probabilistic forecast of the machine with the true probability PP- one. Now that the machine forecasts must indeed be true when the machine learns the true probabilities, this calibration property can then be understood as a calibration version of the strong law of large numbers without the independence assumption.

Note that there exist connections here among machine forecasting, well-calibration, and machine learning. While proving our theorems, therefore, we establish connections between the true guarantee of well-calibration and various settings of the real forecasting games between Nature and a machine. In this game, what Nature forecasts are the true objective probabilities, while what the machine forecasts are its own subjective probabilities. The machine loses when Nature deviates from the probabilistic forecasts of the machine. Bridged by the property of truly guaranteed well-calibration, we then prove whether the machine learns the true probabilities or not under various settings of forecasting games.

With this proof, we provide the fundamental scope and limit of learning the true probabilities by AI machines. One important implication is that machines can relax the independent assumption among data to learn the true probabilities but cannot relax the assumption of identical distribution such as stationarity or ergodicity along a stochastic path where any associated events occur. Another implication is to show that the problem of computability is directly connected to the problem of complexity in the case of learning the true probabilities.

2 Notations and Definitions

In this section, we define some main concepts, including “machine learning” and “true objective probability”. Adopting terminologies from (Nilsson, 2011) and (Boolos et al., 2002), let us first define a machine as an artifact or device that can effectively calculate or compute any target function if there exist definite and explicit instructions to do so in principle. Since we focus on probability functions in this paper, we particularly mean by “an effectively calculating or computing device” a machine that can in principle assign a probability measure (a value of a probability function) to each state (an argument of the probability function) in a given domain, an event space of a sigma-field.

Definition 2.1.

A function is effectively calculable or computable when there are definite and explicit instructions, following which its functional value can be calculated in principle for any given argument. (Boolos et al. (2002))

Two things merit to be taken into account with Definition 2.1. First, this notion of effective calculation or computation is an ideal one with no practical limits on time, expense, etc., necessary to calculate. Therefore, a proof of the limitation on effective calculation or computation of any function will imply a fundamental limit on computability that cannot be overcome by any practical real machine. Second, as (Kozen, 1997) points out, this notion is an informal one, something that is supposed to be captured in common by all formalisms such as computation by Turing machines, by the λ\lambda -calculus and by the μ\mu -recursive method, etc. Accordingly, once we adopt this notion of effective calculation or computation to define “learning”, we can be flexible about which formalism would be encoded as instructions to complete a given learning task.

Now, whatever such formalism is, machines can learn only if there exist some instructions followed by them to complete their tasks. So we can prove that it is impossible for machines to learn any target function under certain conditions in the following way: we first suppose that there exist some successful instructions to be encoded into machine programming to learn any given function under the conditions. We then show that this supposition leads to a conclusion that is impossible to satisfy. We thereby conclude that there cannot exist such instructions for the given function and, accordingly, that machines cannot learn it. This is a simple but clear way of proving the impossibility of learning without being committed to any complex procedure of constructing any formalism such as a Turing machine or λ\lambda-calculus, etc.

Definition 2.2.

A machine learns when it succeeds in effectively calculating or computing a target function, if any, after processing possibly infinite amounts of data.

The phenomenon of learning must be at least computational in its essence when acquired by a machine. We thus adopt the notion of computation to define what learning is in Definition 2.2. Inspired by the ideas of (Turing, 1936) and (Church, 1936), we require that a machine be able to effectively calculate or compute a target function when the machine can learn the function.

In addition, we add the notion of success to Definition 2.2, which aims to capture the role of “learning” as an epistemic notion, not just a computational one. The epistemic notion of machine learning requires two components: if a machine learns, then (i) it must be indeed correct most of the time and (ii) it must be self-assured to be correct most of the time.

Learning is the phenomenon of knowledge acquisition. Once something is learned, knowledge about it is acquired. Now, knowledge must be a true representation, and it must be so not just by luck. We thus require that (i) what is effectively calculated or computed by a machine be true and further that it be true most of the time out of infinite opportunities to learn. In addition, if the machine admits errors too many times, say infinitely often, it cannot be said to learn. We thus require also that (ii) the machine be self-assured that what it calculates is correct most of the time. In sum, we provide the following Success Criterion:

(1) If a machine achieves computational success by learning, what it acquires in the end must be true to our world most of the time, which must be assured to the machine itself.

If what the machine computes turns out to be wrong or it admits errors repeatedly too often out of infinite opportunities to learn, then its computation cannot be considered successful. Later, we prove that the Success Criterion (1) is sufficient for learning in the case of computing true probabilities by Corollary 4.37. We also clarify there what we mean by “most of the time.”

Definition 2.3.

A true probability is what collectively constitutes a probability space, a triple (Ω,,p)(\Omega,\mathcal{F},p) of random variables StS_{t}’s in a joint true probability pp of the stochastic process according to which Nature generates a sequence of actual data sts_{t}’s and each of these data is realized as such with the very true probability PP.

Consider an enumerable set Ωt\Omega_{t} of ωi\omega_{i}’s called states at time tt with tt\in\mathbb{N}. For example, Ωt\Omega_{t} may be the set {ωs,ωc,ωr}\{\omega_{s},\omega_{c},\omega_{r}\} where ωs\omega_{s} denotes the state of sunny day, ωc\omega_{c} the state of cloudy day and ωr\omega_{r} the state of rainy day at date tt. Also, consider the set Ω\Omega that consists of all the infinite sequences with a representative sequence ω=(S01(s0),S11(s1),S21(s2),).\omega=(S_{0}^{-1}(s_{0}),S_{1}^{-1}(s_{1}),S_{2}^{-1}(s_{2}),\ldots). Here, St(ωi)S_{t}(\omega_{i}) is a random variable which has some numerical value sts_{t}\in\Re according as which ωi\omega_{i}’s are realized at time tt in our world. Now, StS_{t} comes before St+1S_{t+1} in time, and thus the sequence of StS_{t}’s represents a discrete-time stochastic process. Then Nature generates the actual data set {s0,s1,s2,}\{s_{0},s_{1},s_{2},\ldots\} with true probability PP’s. So the probability function PP, if any, becomes true to our world when it corresponds to whatever amounts to the rules according to which the actual data are realized in our world. Broadly speaking, this is in line with the correspondence theory of truth similarly in (Tarski, 1944).

Remark 2.4.

More detailed discussions on Definition 2.3, including examples, are provided in Appendix D.

Now that we have defined learning and true probability, let us discuss under what conditions machines can or cannot learn the true probabilities. Before we move on, however, let us briefly mention how we can provide formal conditions for learning even though Definition 2.2 contains informal notions.

Recall from the second comment on Definition 2.1 that the general notion of computation has not been mathematically defined. This is why the Church-Turing thesis remains as a thesis, not as a theorem, given that it uses the general notion of computation. But the computability of any target function in each specific case can be formally specified by giving some definite and explicit instructions to derive the target function in each case, say by a Turing machine. Likewise, our general notion of machine learning cannot be mathematically defined because Definition 2.2 uses the general notion of computation and the informal notion of success. But this does not prevent us from mathematically analyzing the notion of machine learning on the true probabilities by proving what the necessary and sufficient conditions are to learn them. We can do so by giving some definite and explicit instructions to statistically derive the true probability function by a machine while satisfying the Success Criterion (1).

3 Kinds of Probabilities and Learning

3.1 Subjective vs. Objective Probabilities

Broadly speaking, probabilities can be divided into two kinds, subjective and objective ones. Subjective probability, say Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}), depends on each person’s belief and thus possibly varies from person to person, while objective one, say P(At+1|P(A_{t+1}|ß)t{}_{t}), does not.

The standard theory of subjective probability was first developed by Ramsey and then further by De Finetti and Savage. Subjective probability is designed to represent a degree of belief possessed by a subject, say some person or, if possible, a machine. Hence subjective probability represents whatever is in any one’s mind upon anything as long as his/her belief system is coherent, and so can be assigned even to what is merely imagined. For example, while arguing for cogito, ergo sum, (Descartes, 2008) imagined an evil spirit that has devoted all its efforts to deceiving him. Descartes can assign some value of subjective probability to his imagination on the evil spirit in accordance with how likely it is to him that the imagination can be realized in this world, as long as Descartes’ belief system remains coherent.

In contrast, objective probability, if any, is what must be determined by objective features of our world that do not vary from person to person. The best way to understand objective probability is to consider examples. Following (Maher, 2010), for example, suppose that a coin has the same face on both sides, that is, two-headed or two-tailed. When this coin is tossed infinitely often, its relative frequency surely converges to 1 or 0. Hence the limiting relative frequency here is either 1 or 0, depending on how our world turns out to be, which is an objective matter, and not on whatever we believe.

It should be noted that subjective and objective probabilities are conceptually bifurcated in two important ways. First, recall that subjective probability represents an aspect of someone’s subjective belief, while objective probability does not. Hence the subjective probability of Descartes’ demon is positive as long as it is believed at any degree that it could exist in our world. However, this does not necessarily imply that the true objective probability of Descartes’ demon is positive, since it might be the case that such a demon is possible only in one’s imagination but impossible in our real world. We will return to this potential bifurcation between subjective and objective probability in Section 4.1.

Second, there exists an asymmetric relation between subjective and objective probability: although the subjective probability of Descartes’ demon does not necessarily bind its objective probability, the converse holds. (e.g. (Lewis, 1980)) That is, once it is proven/assumed by any agent that the true objective probability of Descartes’ demon is, say zero, then its subjective probability of the same agent is bound to this proven/assumed result on the objective probability and thus must be zero as well. From this asymmetric relationship, we derive Lemma 4.23 in Section 4.2.

Remark 3.1.

More detailed discussions on various kinds of probabilities are provided in Appendix D.

3.2 What is Implied by Learning the True Objective Probabilities?

As we pointed out in Section 2, learning is the phenomenon of knowledge acquisition, and knowledge must be at least a true representation. In the case of human beings, the requirement of true representation is expressed as the requirement that (propositional) knowledge be at least a true belief (e.g. (Hintikka, 1962), (Moore, 1985)). What then is the counterpart of such a requirement for machines?

In general, if a machine achieves computational success at tt by learning, what the machine represents by learning must be at least true at that time. Then we denote the true representation of the machine about what is learned by the “true belief” of the machine, a legitimate analogue to the true belief of human beings. It is a belief analogue, for we haven’t yet shown that machines have minds or that they have the same kinds of mental representations as human beings. It is nevertheless a legitimate belief analogue, since the computational models of machine intelligence are based on understanding human intelligence. (e.g. (Pearl, 2018), (Russell, 1998), (Valiant, 1984, 2008))

That said, let us discuss the relation between belief and learning on the machine side: the knowledge acquired by machine learning must be at least a true belief. In (Hintikka, 1962), the knowledge of a person ii refers to the knowledge of that person ii on any proposition AA. Likewise, machine’s learning of the true objective probability PP here refers to the knowledge acquired by any machine on the probabilistic proposition ApA_{p}. If a machine learns the true probability as α\alpha, then the probabilistic proposition ApA_{p} amounts to that the true objective probability P,P, if any, is what the very machine calculates as α\alpha. Here, we convert the non-propositional learning into propositional learning.

Now, just as a person ii’s knowledge on proposition AA must satisfy the necessary condition that the person ii’s belief in AA is true, machine learning of the true probability PP must also satisfy the condition that the belief in ApA_{p} of the machine is true. Note here that such a belief in ApA_{p} is true when what has been calculated by the machine is indeed equal to the true probability PP. Now, this calculated probability function by a machine is nothing more than the subjective probability of the machine. Therefore, the necessary condition for machine learning of true probability PP requires a machine to hold a true belief whose truth condition is satisfied when its subjective probability is, in fact, in congruence with the true objective probability PP. In short, if a machine learns the true objective probability PP, then the subjective probability Π\Pi of the machine is actually equal to the true probability PP.

Remark 3.2.

There has been a large literature in logic and economics whose discussion implies when a machine holds a true belief in the probabilistic proposition ApA_{p}. We provide some literature in Appendix B.

Therefore, we obtain the following condition:

The Necessary Condition for any Machine to Learn the True Probability

(2) If a machine learns the true objective probability P(At+1|P(A_{t+1}|ß)t{}_{t}), then Π(At+1|\Pi(A_{t+1}|ß)t=P(At+1|{}_{t})=P(A_{t+1}|ß)t{}_{t})

where Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) denotes the subjective probability of the machine at time tt.

We assume, without loss of generality, that the event At+1A_{t+1} is an elementary event, for simplicity. So the event At+1A_{t+1} is a singleton, i.e. {ωt+1}\{\omega_{t+1}\}.

Two things should be noted from (2): first, learning/knowledge is not necessarily equivalent to obtaining true fact that Π(At+1|\Pi(A_{t+1}|ß)t=P(At+1|{}_{t})=P(A_{t+1}|ß)t{}_{t}), as the converse of condition (2) does not necessarily hold. Second, if a machine is wrong in calculating the true probability at time tt so that Π(At+1|\Pi(A_{t+1}|ß)tP(At+1|{}_{t})\neq P(A_{t+1}|ß)t{}_{t}), then by modus tollens we can derive from (2) that the machine does not learn it at that time. However, this does not preclude the machine from learning it at any other time. Then what can be said about learnability in general? According to the Success Criterion (1), a machine cannot learn any target function if it is wrong most of the time, except for a few finite cases out of infinite opportunities to learn. But can a machine be said to learn if it is correct infinitely often but also wrong as that often? We give a negative answer to this question by proving theorems in Section 4.2.

4 Can Machines Learn the True Probabilities?

4.1 Learning the True Probabilities and Calibration

Let us start with a simple example in which a machine is trying to learn the true probability that it will rain tomorrow. A forecasting system is said to be well-calibrated if it assigns probability, say 30%, to rainy events in a test set whose long-term proportion that actually rains is 30%. According to (Dawid, 1982), a forecasting machine is self-assured that its fairly arbitrary test set of forecasts is well-calibrated. This is Theorem 4.1. In addition, we prove in Theorem 4.6 that if the machine learns the true probability, then this machine’s forecasting is truly guaranteed to be well-calibrated.

Now, let us assume that a machine has its own (not necessarily true in our context) probability distribution Π\Pi defined over ß=t=0ßt,{}_{\infty}={\textstyle\bigvee\limits_{t=0}^{\infty}}\text{\ss}_{t}, where ßt is denoted by the totality of the true facts up to day tt. The probability forecasts Π(At+1|ßt)\Pi(A_{t+1}|\text{\ss}_{t}) it makes on day tt are for events At+1A_{t+1}’s in ßt+1 and are ßt-measurable. For each day tt we have an arbitrary associated event AtA_{t}\in ßt, say the event of raining on day tt. We denote the indicator of At+1A_{t+1} by Yt+1=1{At+1}Y_{t+1}=1_{\{A_{t+1}\}}, and introduce Y^t+1=Π(At+1|\hat{Y}_{t+1}=\Pi(A_{t+1}|ß)t{}_{t}), the probabilistic forecast of machines on day tt+1. In addition, we introduce the new indicator variables ξ1,ξ2,,\xi_{1},\xi_{2},\ldots, at choice to denote the inclusion of any particular day tt in the test set where ξt=1\xi_{t}=1 if the day tt is included in the test set and ξt=0\xi_{t}=0 otherwise. Now, if we set the selection criterion to include any day into the test set as the assessed probability α\alpha on day tt, then we have the following theorem.

Theorem 4.1.

Suppose that ξt\xi_{t}\ is ßt-1 measurable. Then, Π\Pi (pk(p_{k}\rightarrow α)=1\alpha)=1 when kk\to\infty,

where      kk: the number of days in the test set

pk=(t=1kξt)1(t=1kξt1{At+1})p_{k}=({\textstyle\sum\limits_{t=1}^{k}}\xi_{t})^{-1}\cdot({\textstyle\sum\limits_{t=1}^{k}}\xi_{t}\cdot 1_{\{A_{t+1}\}})

ξt:={1Y^t+1=Π(At+1|ßt)=α0Y^t+1=Π(At+1|ßt)α\xi_{t}:=\begin{cases}1&\hat{Y}_{t+1}=\Pi(A_{t+1}|$\ss$_{t})=\alpha\\ 0&\hat{Y}_{t+1}=\Pi(A_{t+1}|$\ss$_{t})\neq\alpha\end{cases}

Here, let us use the terms as follows: machine forecasts are self-assured to be well-calibrated when Π\Pi (pk(p_{k}\rightarrow α)=1\alpha)=1, while those are truly guaranteed to be so when PP (pk(p_{k}\rightarrow α)=1\alpha)=1. It should be noted then that even if the forecasting machine is self-assured to be well-calibrated, this does not necessarily imply that its forecasts are truly guaranteed to be well-calibrated. Recall from Section 3.1 that there is a conceptual bifurcation between subjective and objective probability.

Now, suppose that a machine tries to learn the true probability of a particular event At+1A_{t+1}. If this machine indeed learns the true probability of the event as α\alpha, then the machine should correctly calculate the true probability of the same events repeatedly as α\alpha most of the time. Hence, the machine can construct a test set of those associated events At+1A_{t+1}’s whose sequentially correct probabilities are α\alpha. Then we can show further from Theorem 4.1 that the test set will be well-calibrated with true probability PP- one. This is Theorem 4.6. In short, here “being correct as α\alpha” itself serves as what (Dawid, 1982) calls a selection criterion.

However, note that if the size of ßt continues to grow as tt goes to infinity, then ßt’s might be different for each t.t. Then P(At+1|ßt)P(A_{t+1}|\text{\ss}_{t}) might not stay the same as α\alpha even for the same events At+1A_{t+1}’s across infinitely many tt’s. Now, in order for the correct probability α\alpha to work as a selection criterion, it should be that P(At+1|ßt)P(A_{t+1}|\text{\ss}_{t}) stays the same as α\alpha at least for infinitely many tt’s even though ßt may vary as time passes. Therefore, we prove Lemma 4.5 from the following three assumptions. The justifications for the three assumptions are provided in Appendix C.

Assumption 4.2.

ßt’s in P(At+1|P(A_{t+1}|ß)t{}_{t}) are the set of all the true facts up to time tt.

Assumption 4.3.

No further knowledge requirement is imposed on condition ßt.

Assumption 4.4.

Once a probability of an event type EE is established, its associated event tokens EtkE_{t_{k}}’s occur at some infinite subsequence of time tkt_{k}’ s, so that P(Etk)P(E_{t_{k}}) does not vanish to zero as tkt_{k}\rightarrow\infty.

It should be noted from 4.2 and 4.3 that if ßt is the set of known facts, the information on the associated events EtE_{t}’s in ßt’s may not be independent of one another over time. Once EtE_{t} has been known in the past at some time t0,t_{0}, the same events EtE_{t}’s are more likely to be known afterwards. Repeatedly accumulated knowledge of the same events reinforces the probability that the very event will be known again in the future. However, this is not necessarily the case with the set of true facts. It will be clear in Lemma 4.5 why this independence condition matters.

Lemma 4.5.

For any α[0,1]\alpha\in\Re[0,1], let EtE_{t} denote the event token at time tt\in\mathbb{N} whose event type EE almost surely determines the true probability of an event type AA as α\alpha. Then, if for some subsequence tkt_{k}’s, EtkE_{t_{k}}’s are independent across tkt_{k}’s and P(Etk)0P(E_{t_{k}})\neq 0 for any tk,t_{k}, then P(EtP(E_{t} i.o)=1.i.o)=1.

Now that Lemma 4.5 has been established, P(At+1|P(A_{t+1}|ß)t{}_{t}) is truly guaranteed to stay as α\alpha infinitely often, and thus the machine has infinite opportunities to learn P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha.

Theorem 4.6.

Let us consider any arbitrary α[0,1]\alpha\in\Re[0,1]. If a machine learns the true objective probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha, then P(P( pkp_{k}\rightarrow α\alpha )=1)=1.

It should be noted that the notion of learning in Theorem 4.6 is flexible enough to allow for some finitely few potential errors, so that there can exist some t<t^{\ast}<\infty such that P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha t<t\forall t<t^{\ast} while processing the data to learn.

Remark 4.7.

More detailed discussions on Theorem 4.6 are provided in Appendix D.

4.2 Can Machines Learn the True Probabilities?

Theorem 4.8.

It is impossible to obtain a joint distribution for an infinite sequence of events that could have the well-calibration property with subjective probability 11.

The basic idea in the proof of Theorem 4.8 starts with constructing a counterexample in which the true probability function PP is deviated infinitely often from the subjective probability function Π\Pi in such a way that the well-calibration property does not hold any longer.

Counterexample 1 Following (Oakes, 1985), let PP be such as P(At|P(A_{t}|ß)t1=f(Π(At|ßt1)),{}_{t-1})=f(\Pi(A_{t}|\text{\ss}_{t-1})), with the function f([0,1])[0,1]f([0,1])\rightarrow[0,1] being defined by f(x)=x+12f(x)=x+\frac{1}{2} (0x12),(0\leq x\leq\frac{1}{2}), f(x)=1xf(x)=1-x (12<x1)(\frac{1}{2}<x\leq 1) for any event At.A_{t}. Then, under PP with P(YIk=1)=f(α)P(Y_{I_{k}}=1)=f(\alpha) where Y^t=\hat{Y}_{t}= α\alpha for a subsequence {t:t=I1,I2,}\{t:t=I_{1},I_{2},\ldots\} and YIkY_{I_{k}}’s form a Bernoulli sequence, the well-calibration property does not hold.

Due to this counterexample from (Oakes, 1985), the machine forecaster cannot exclude the possibility that its test set may be mis-calibrated, and thus the machine cannot hold its subjective probability Π\Pi- one of being well-calibrated. Furthermore, if this artificially-imagined possibility of mis-calibration is a real possibility, then it is derived that no test set large enough can be guaranteed to be well-calibrated with the true probability PP- one. Later in this section, we prove that if such an imagined possibility is a real one, then machines cannot learn. Meanwhile, we also prove mathematically how the (Oakes, 1985) Counterexample paralyzes Dawid’s Theorem 4.1, which amounts to the proof of Theorem 4.8.

Remark 4.9.

More detailed discussion on the Counterexample 1 is provided in Appendix D.

Lemma 4.10.

Suppose that a machine constructs a test set by the assessed probability α\alpha. Then EE |p|p_{\infty}- α|\alpha| == 0 if and only if P(pkP(p_{k}\rightarrow α)=1\alpha)=1 where the expectation is taken with respect to the true probability PP. Here, p=limkp_{\infty}=\lim\limits_{k\rightarrow\infty} pkp_{k}.

Lemma 4.11.

Let us fix α[0,1]\alpha\in\Re[0,1]. Now, suppose that pp_{\infty} exists. Then EE [pα]=0[p_{\infty}-\alpha]=0 if and only if E[limk1kj=0k1P(Atj+1|ßtj)α]=0E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha]=0. In general, EE |pα||p_{\infty}-\alpha| \geq EE |limk1kj=0k1P(Atj+1|ßtj)α|.|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha|.

Remark 4.12.

By Lemma 4.10 and Lemma 4.11, we establish a connection between the true guarantee of well-calibration and the real forecasting game between a machine and Nature. More discussions on such connection by Lemma 4.10 and Lemma 4.11 are provided in Appendix D.

Definition 4.13.

Nature is perverse when, for any fixed machine forecast α\alpha, P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least for infinitely many tt’s along the stochastic path of the test set.

By “at least i.o.” in Definition 4.13, we mean that Nature deviates from α\alpha either (i) infinitely often or (ii) all but finitely often along the stochastic path of the test set. Thus, we clearly distinguish (i) from (ii). From now on, we mean by “infinitely often” that nature not only deviates infinitely often, but also does not deviate infinitely often. On the other hand, by “all but finitely often” we mean as usual. Then, if the true probability of Nature’s perversity is zero, then we denote it by P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.i.o. along the path of the test set)=0)=0, which amounts to P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at most for t<t<\infty along the path of the test set)=1)=1. Furthermore, if there is no confusion, we will simplify Nature’s perversity by “P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.i.o.” while omitting “along the path of the test set.”

Now, according to the Success Criterion (1), a machine fails to learn the true probability in case (ii), because the machine then makes wrong forecasts along the path except for a finite few of the infinite opportunities to learn. However, it seems unclear whether the machine can learn or not in case (i). On the one hand, the machine seems not to be able to learn because it makes too many errors, say infinitely many errors. On the other hand, it seems that the machine should be able to learn because it makes astronomically many correct forecasts, say infinitely often. Therefore, while adopting this definition, we clearly prove by Theorem 4.20 and Corollary 4.30 that a machine cannot learn the true probability even when it is correct infinitely often, if it is wrong that often.

Observation Provided that the machine forecast Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) is fixed as some value α[0,1]\alpha\in\Re[0,1], P(P( Δt\Delta_{t} )) becomes the true second-order probability on the true first-order probability of such event At+1A_{t+1}, that is, P(P( Δt\Delta_{t} )) = PP ( P(At+1|ßt)=α )\left(\text{ }P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ }\right) where Δt\Delta_{t} denotes the event that the machine makes a correct forecast at tt.

It should be noted here that the computable numbers by a machine are countably many (e.g. (Turing, 1936)). Thus, the true second-order probability PP here is a probability mass function on countable space and therefore satisfies the Kolmogorov axioms, although α\alpha may potentially be any real number in [0,1].\Re[0,1].

Remark 4.14.

More detailed discussions on the connection between true second-order probability and the forecasting game are provided in Appendix D.

Lemma 4.15.

Let us consider the forecasting game between Nature and a machine. Also, let us further suppose that the structure of this game at any given time tt, i.e. whether it is simultaneous or not, is certain to Nature. Now, by 4.2 and 4.3, let us suppose that ßt consists of the true facts, not necessarily knowledge. Then there exists a true second-order probability PP such that 0<P(P(At+1|ßt)=α)<10<P\left(P(A_{t+1}|\text{\ss}_{t})=\alpha\right)<1 if and only if the real forecasting game is a simultaneous-move game at time tt. In particular, PP (P(At+1|ßt)=α)=0\left(P(A_{t+1}|\text{\ss}_{t})=\alpha\right)=0 if and only if the machine moves first and then Nature moves later after observing what move the machine takes in the forecasting game at time tt.

There are various theories of learning in games. (e.g. (Nisan et al., 2007)) Therefore, what matters is what is aimed to learn through games and who are competing with each other in the games. In the standard model, a machine aims to learn what the optimal actions are to produce the minimized expected (total) loss or payoff, which is determined in a given environment, say financial market. In this case, a machine usually competes with other machines in the game. For example, in some online learning, a machine aims to learn a sequence of estimates which return the sub-linear regret, given that the loss functions are convex. It gets a possibly different amount of payoff/loss at each round of games along the stochastic path where the given sequence of games are played.

In our forecasting games, on the other hand, a machine aims to learn the true objective probability, if any, through games, and so the machine is competing with Nature in the game. Also, whoever wins a game, the winner/loser will get uniform payoff at every round along the path, for what counts is how many times the machine loses/wins along the path, not how much payoff it gets at each round along the path once it loses/wins.

Theorem 4.16.

In the forecasting game between a machine and Nature, the machine does not necessarily learn that it wins at each round of the game even though it indeed wins.

Thus, winning strategy is not equivalent to learning strategy. Now, in case when a machine does not learn that it wins/loses a game even though it indeed does so, it does not matter what it gets as payoff when it wins/loses because it cannot learn how much it gets at each round. What matters, on the contrary, is how many times it wins along the path, and this is why our game setting in Lemma 4.15 adopts a uniform payoff at each round.

Theorem 4.17.

Let us consider any arbitrary α[0,1]\alpha\in\Re[0,1] for any machine forecast. If P(pkP(p_{k}\rightarrow α)=1\alpha)=1, then the true probability that Nature is perverse is zero with any of these forecasts α\alpha. ((Case 3))

((Case 1)) Let us suppose that P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha at most finitely often along the stochastic path where the associated event At+1A_{t+1}’s occur. Then P(pkP(p_{k}\rightarrow α)=1\alpha)=1 where pkp_{k} denotes the limiting relative frequency along the path.

((Case 2)) Let us suppose that PP(P(At+1|(P(A_{t+1}|ß)tα{}_{t})\neq\alpha just as in (Oakes, 1985))0)\neq 0. Then, P(pkP(p_{k}\rightarrow α)1\alpha)\neq 1 where pkp_{k} denotes the limiting relative frequency along the stochastic path of the test set.

((Case 3)) Let us suppose that PP(P(At+1|(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o. along the test set)0)\neq 0. Then P(pkP(p_{k}\rightarrow α)1\alpha)\neq 1 where pkp_{k} denotes the limiting relative frequency along the path of the test set.

Regarding Theorem 4.17, it is worth noting the following three things: (i) ((Case 1)) is equivalent to the strong law of large numbers under a weaker assumption than i.i.d.: if the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) exists and P(At+1|P(A_{t+1}|ß)t{}_{t}) is identically distributed as α\alpha all but finitely often along the path, then the limiting relative frequency converges to the same P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha with true probability PP- one. (ii) ((Case 2)) shows that if (Oakes, 1985) holds with Π\Pi-subjective probability >0>0, then (Dawid, 1982) does not hold, which amounts to the proof of Theorem 4.8. (iii) ((Case 3)) shows, combined with Theorem 4.6, that if PP(P(At+1|(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at most f.o.f.o. along the test set)1)\neq 1, then a machine cannot learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha. Thus, the third result (iii) has the following important implication for time-series analysis: a machine cannot relax the assumption that the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) is identically distributed along the stochastic path, if the machine aims to learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}). To learn, the machine needs some identical distributional assumptions such as stationarity or ergodicity.

Definition 4.18.

Suppose that, with true probability P>0,P>0, Nature is perverse with some forecast α\alpha^{\ast}. Then, Nature is uniformly perverse, when for any forecast α[0,1]\alpha\in\Re[0,1], there exists no αα\alpha\neq\alpha^{\ast} such that P(P( P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)=0i.o.)=0 for any event At+1A_{t+1}.

In other words, when Nature deviates from forecasters for any event At+1A_{t+1}, she does not discriminate against some forecasters in favor of the others whose forecasts α\alpha Nature decides to conform to all but finitely often for sure.

Theorem 4.19.

Suppose that, for any α\alpha, there exists a true second-order probability PP such that P(P(At+1|ßt)=α )<1P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1 at least for infinitely many tt’s. Then, Nature is uniformly perverse.

Theorem 4.20.

Suppose that, for any α\alpha, there exists a true second-order probability PP such that P(P(At+1|ßt)=α )<1P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1 at least for infinitely many tt’s. The machine cannot then learn the true objective probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha.

Now, let us discuss what it means in Theorem 4.20 by the condition that the true second-order probability is strictly less than 1. Note from Lemma 4.15 that PP (P(At+1|ßt)=α)=1\left(P(A_{t+1}|\text{\ss}_{t})=\alpha\right)=1 if and only if Nature moves first and then the machine moves later after observing what move Nature takes in the forecasting game at time tt. Thus, it is clear from the condition of Theorem 4.20 why and when the machine fails to learn the true probability if Nature is uniformly perverse: when the machine cannot move later after observing the true move of Nature infinitely often, there always exists a real possibility that the machine may not be able to match Nature’s move that often. Hence the machine cannot be truly guaranteed to be well-calibrated, which again implies the impossibility of machine learning. Since the machine cannot observe the true move of Nature in those forecasting games, the true probability is unobservable by the machine.

So far we have shown that it is of real possibility that Nature is perverse, and thus that no machines can learn the true objective probability. Now someone might argue that its proof holds only under the condition that Nature is uniformly perverse. Nature may not be uniformly perverse, however, but only selectively perverse, so that, for some forecast α0\alpha_{0}, Nature may decide to be benevolent enough to conform to that α0\alpha_{0}. Then it may be the case that the true probability of Nature being perverse is zero for this α0\alpha_{0}, and accordingly that machines may be given an opportunity to learn the true objective probability for that α0\alpha_{0}.

Note, however, that it is entirely Nature’s decision when she will be benevolent to a machine and when she will not. Therefore, it is still a random event to the machine whether Nature is perverse or not. If so, we will show further that, even if the true probability of Nature’s being perverse is zero with some α0,\alpha_{0}, a machine still cannot learn the true probability if it cannot learn which forecast is the right α0\alpha_{0} for any event At+1A_{t+1}.

Definition 4.21.

A machine tolerates error at tt while pursuing its goal of learning the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}), when Π(At+1|\Pi(A_{t+1}|ß)t=α{}_{t})=\alpha but Π({P(At+1|\Pi(\{P(A_{t+1}|ß)tα})>0{}_{t})\neq\alpha\})>0 for some α[0,1].\alpha\in\Re[0,1].

Remark 4.22.

In relation to Lemma 4.23, more detailed interpretation on Definition 4.21 is provided in Appendix D.

Lemma 4.23.

Suppose that a machine aims to learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) and thus performs an effective calculation to return its result of Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) as 0 for the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}). Then, Π(At+1|\Pi(A_{t+1}|ß)t=0{}_{t})=0 if and only if Π({P(At+1|\Pi(\{P(A_{t+1}|ß)t=0})=1{}_{t})=0\})=1, for all but finitely many tt’s.

Remark 4.24.

In relation to (Savage, 1972), more discussions on Lemma 4.23 are provided in Appendix D.

Definition 4.25.

Nature is selectively perverse, when \exists α\alpha and α0α\alpha_{0}\neq\alpha such that P(P(At+1|P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0\,,  while P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)>0i.o.)>0 for any other αα0\alpha\neq\alpha_{0}.

Now, let us define Nature’s decision to be selectively perverse at tt to show by Lemma 4.28 that once Nature decides so at tt, our real world remains as such.

Definition 4.26.

Nature decides to be selectively perverse at tt, when there exist forecasts α\alpha and α0α\alpha_{0}\neq\alpha such that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0, while P(Aαα0(t+1)|P(A_{\alpha\neq\alpha_{0}}(t+1)|ß)t0{}_{t})\neq 0 where Aα(t+1)A_{\alpha}(t+1) denotes the event that, from t+1t+1 onward, Nature is perverse with the associated events AtA_{t}’s whose assessed forecasts are α\alpha.

Definition 4.27.

Suppose that Nature is selectively perverse so that she freely decides at any time whether to be perverse at any rate or not. Then, ts<t_{s}<\infty denotes a stopping time if tst_{s} is the last time that Nature changes her mind into non-perversity so that, for any α0\alpha_{0} with which Nature is not perverse with true probability PP-one, P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0, t>ts\forall t>t_{s}.

Note that tst_{s} is ßt-measurable, because ßt includes all the true facts up to tt and so whatever Nature decides at tt, say the event {P(Aα0(t+1)|\{P(A_{\alpha_{0}}(t+1)|ß)t=0}{}_{t})=0\} belongs to the set of true facts, ßt.

Lemma 4.28.

Nature is selectively perverse if and only if there exists a stopping time tst_{s} for every forecast α0\alpha_{0} with which Nature is not perverse with true probability PP-one so that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s}, while there is no stopping time tst_{s} for any other αα0\alpha\neq\alpha_{0}.

Lemma 4.29.

Let us suppose that Nature is selectively perverse and that a machine learns which forecast is the right forecast α0\alpha_{0} for any associated AtA_{t}’s with which Nature is not perverse with true probability PP- one. The machine is then self-assured that the stopping time tst_{s} arrives for that α0\alpha_{0}.

Corollary 4.30.

Suppose that Nature is selectively perverse so that, with true probability PP-one, she is not perverse with some machine forecasts α0\alpha_{0}. Furthermore, suppose that the machine is not self-assured that the stopping time tst_{s} arrives for each of those α0\alpha_{0}’s. The machine cannot then learn the true objective probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha.

Note that along the stochastic path considered in Corollary 4.30, P(P(At+1|P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0 t>ts\forall t>t_{s}. Now, for this α0\alpha_{0},

(3)     lim supt\limsup\limits_{t\rightarrow\infty} P(P(At+1|P(P(A_{t+1}|ß)tα0)P(P(At+1|{}_{t})\neq\alpha_{0})\leq P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o)=0i.o)=0

Therefore, without loss of generality, letting ttst^{\ast}\geq t_{s} with t<t^{\ast}<\infty,

(4)     P(P(At+1|P(P(A_{t+1}|ß)t=α0)=1,{}_{t})=\alpha_{0})=1, t>tts\forall t>t^{\ast}\geq t_{s} with t<t^{\ast}<\infty.

Now, (4) means by Lemma 4.15 that the true probability is observable at any time t>tt>t^{\ast} along this path. Then why is the machine still unable to learn the true probability, even though the machine can move after observing what move Nature takes at the forecasting games all along that path after tt^{\ast}? According to Corollary 4.30, this is because the machine cannot be self-assured whether the true probability will remain observable at any time after tt^{\ast}+1 onward, even if the machine observes Nature’s true move at time tt^{\ast}+1. Let us show this by the following Lemma 4.31.

Lemma 4.31.

Suppose that a machine is not self-assured of the stopping time tst_{s} for α0\alpha_{0}. The machine cannot then be self-assured whether the true probability will remain observable at any time after tt^{\ast}+1 onward, even if the machine observes Nature’s true move at time tt^{\ast}+1.

From Theorem 4.20 and Corollary 4.30, we conclude that the impossibility of learning is derived under the assumption either that Nature is uniformly perverse or that Nature is selectively perverse but a machine is not self-assured of whether the stopping time arrives or not. What would then happen in the case where Nature is selectively perverse and a machine is self-assured of the stopping time tst_{s} when the tst_{s} indeed exists? We show in the following that a machine can learn the true probability in this case, and further that this is the only case in which a machine can learn it.

Theorem 4.32.

Suppose that a machine learns the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha. The machine is then self-assured that the stopping time tst_{s} arrives for α\alpha, while the machine is not self-assured that the stopping time tst_{s} arrives for α\alpha where such tst_{s} does not exist.

Let us now define when the true probability is directly observable based on the notion of population. The concept of population in Definition 4.34 is mainly indebted to (von Mises, 1957, 1967). Since the true probability is defined as the empirical distribution of this population available to a machine, the probability is said to be directly observable by the machine.

Definition 4.33.

Let us consider a set SS that consists of the sequence of events At+1A_{t+1}’s, {At+1}t=0k1\{A_{t+1}\}_{t=0}^{k-1} with kk potentially infinite. Then, the set SS is defined to be a population with kk number of elements, when this set SS is assumed to have a certain attribute of interest, and so an indicator variable 1{At+1}1_{\{A_{t+1}\}} is assigned to each event At+1A_{t+1} where 1{At+1}1_{\{A_{t+1}\}} has a value 1 or 0 depending on whether the event At+1A_{t+1} satisfies such an attribute or not, once the set SS is collected. Then, the empirical distribution of the population SS with respect to the given attribute is defined to be 1kt=0k11{At+1}\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}.

Definition 4.34.

A machine directly observes PP(At+1|(A_{t+1}|ß)t{}_{t}) from the population SS at tt^{\ast} if the following two conditions are satisfied: (i) a population SS is in principle available to the machine. (ii) The machine calculates the empirical distribution of the population with respect to the given attribute, which is the true probability distribution of the event At+1.A_{t+1}.

Now, in case where the sequence {At+1}t=0k1\{A_{t+1}\}_{t=0}^{k-1} is a time-series, Definition 4.34 means that Π\Pi(At+1|(A_{t^{\ast}+1}|ß)t=1kt=0k11{At+1}=P{}_{t^{\ast}})=\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}=P(At+1|(A_{t^{\ast}+1}|ß)t{}_{t^{\ast}}) with k=tk=t^{\ast}. Thus, when tt^{\ast} goes to infinity, the directly observable true probability becomes the limiting relative frequency, the representative objective true probability.

Theorem 4.35.

Suppose that a machine is self-assured of the stopping time tst_{s} when there exists tst_{s}, but that the machine is not self-assured of the stopping time tst_{s} when no tst_{s} exists. The machine then directly observes the true probability PP(At+1|(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0}.

Theorem 4.36.

A machine directly observes the true probability PP(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha if and only if the machine learns the true probability PP(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha.

Two things should be noted from Theorem 4.36. First, whenever the true probability is not directly observable, a machine cannot learn the true probability. Now recall from Definition 2.1 that the machine is an ideal one with no practical limits on computational resources such as time or storage spaces. Therefore, this implies that no real machines, hindered by many practical limits in our world, can overcome this impossibility of learning either, whenever the true probability is not directly observable. Second, Theorem 4.36 also says that the true probability is directly observable by a machine whenever it can learn the true probability. Once a machine learns the true probability and so it is successfully computable, then the next question is how complex it is to compute. Now that the true probability is directly observable, this makes it easier to deal with the complexity problem. (e.g. Sorting algorithm) Thus, Theorem 4.36 directly connects the problem of computational solvability to the problem of complexity.

Now, let us finish this section by adding one more claim that the Success Criterion (1) to compute the true probability is sufficient for learning it.

Corollary 4.37.

If a machine calculates the true probability PP(At+1|(A_{t+1}|ß)t{}_{t}) correctly most of the time, which is self-assured to the machine, then the machine can learn the true probability.

5 Conclusion

We have discussed so far when machines can learn the true probabilities and when they cannot. In summary:

  • \exists α\alpha^{\ast} such that P(P( Nature is perverse with α\alpha^{\ast} )>0)>0 by Theorem 4.19.

Now that Nature is perverse at least with one forecast α\alpha^{\ast},

  • (i) Nature is uniformly perverse: machines cannot learn by Theorem 4.20.

  • (ii) Nature is selectively perverse: \exists tst_{s} for each α0\alpha_{0} such that P(P( Nature is perverse with α0\alpha_{0} )=0)=0 by Lemma 4.28.

Then under (ii),

  • (ii-1) Machines are not self-assured of the tst_{s}: machines cannot learn by Corollary 4.30.

  • (ii-2) Machines are self-assured of the tst_{s}:

Then under (ii-2),

  • (ii-2-1) tst_{s} actually does not arrive: machines cannot learn by Theorem 4.32.

  • (ii-2-2) tst_{s} indeed arrives: machines can learn and this is the only case in which machines can learn by Theorem 4.35 and Theorem 4.36.

Before we close this section, let us add a few remarks. First, we emphasize that in this paper we have focused on the notion of “machine learning” that is not just a technical terminology, understood as an identification of a target function, but also an epistemic one, a counterpart to “human learning.” We focus on this epistemic notion of machine learning because we particularly mean by “machines” those artifacts that perform human-level intelligent behaviors.

Second, note that we do not need to specify how machines learn the true objective probabilities to prove the impossibility of machine learning on the true probabilities. Instead, we only need the necessary condition for any machine to learn the true objective probabilities if it learns them in any way. Thanks to this flexibility about how to learn, we come to have a powerful and robust result: no matter what kind of learning method a machine uses, it cannot learn the true objective probabilities that are not directly observable.

Lastly, let us emphasize again that our learning machine is an ideal device with no practical limits on time and storage space, etc. Therefore, the scope and limit of machine learning on true probabilities discussed in this paper are more fundamental than practical ones.

Acknowledgements

The author is grateful to Tyler Burge, Michael Christ, Philip Dawid, Joseph Halpern, Jinho Kang, Steven Matthews, Thomas Sargent, and Byeong-uk Yi for discussions that were helpful in various ways to develop this paper. In particular, Tyler helped me pay attention to the idea of converting a non-propositional structure to a propositional one while learning, and Joe helped me open my eyes to the possibility of machine learning on the true probabilities. I discussed every detail of this paper with Jinho so that I insisted that he should be listed as a co-author. Jinho refused on the ground that he did not make direct contributions to mathematical proofs, with which I disagree. But Jinho has been right most of the time when we disagreed, so I decided to agree. Lastly, the author is grateful to three anonymous reviewers and a meta-reviewer. Their reviews were helpful in improving this paper.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

  • Blume & Easley (2006) Blume, L. and Easley, D. If you’re so smart, why aren’t you rich? belief selection in complete and incomplete markets. Econometrica, Vol. 74:929–966, 2006.
  • Blume & Easley (2008) Blume, L. and Easley, D. Market selection and asset pricing. The Handbook of Financial Markets: Dynamics and Evolution, by T. Hens IV and K. Schenk-Hoppe (ed.), North-Holland:403–438, 2008.
  • Boolos et al. (2002) Boolos, G., Burgess, J., and Jeffrey, R. Computability and Logic. Cambridge University Press, Cambridge, 2002.
  • Carnap (1963) Carnap, R. Logical Foundations of Probability. The University of Chicago Press, 1963.
  • Church (1936) Church, A. An unsolvable problem of elementary number theory. American Journal of Mathematics, Vol. 58(2):345–363, 1936.
  • Cogley & Sargent (2008) Cogley, T. and Sargent, T. The market price of risk and the equity premium: A legacy of the great depression. Journal of Monetary Economics, Vol. 55:454–476, 2008.
  • Cogley & Sargent (2009) Cogley, T. and Sargent, T. Diverse belief, survival and the market price of risk. Economic Journal, vol. 119:354–376, 2009.
  • Dawid (1982) Dawid, P. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):604–613, 1982.
  • Descartes (2008) Descartes, R. Meditations on First Philosophy. Translated by Moriarty. M. Oxford University Press, 2008.
  • Foster & Vohra (1993) Foster, D. and Vohra, R. Asymptotic calibration. Biometrika, 85(2):379–390, 1993.
  • Gaifman (1986) Gaifman, H. A theory of higher order probabilities. In TARK, 1986.
  • Halpern (2016) Halpern, J. Actual Causality. The MIT Press, Cambridge, MA, 2016.
  • Halpern & Fagin (1994) Halpern, J. and Fagin, R. Reasoning about knowledge and probability. Journal of the Association for Computing Machinery, Vol 41(2):340–367, 1994.
  • Hintikka (1962) Hintikka, J. Knowledge and Belief. Cornell University Press, Ithaca, 1962.
  • Kozen (1997) Kozen, D. Automata and Computability. Springer, New York, 1997.
  • Lewis (1980) Lewis, D. A subjectivist’s guide to objective chance. Studies in Inductive Logic and Probability, Volume II, by R. Jeffrey (ed.):263–293, 1980.
  • Maher (2010) Maher, P. Explication of inductive probability. Journal of Philosophical Logic, Vol. 39:593–616, 2010.
  • Moore (1985) Moore, R. C. A formal theory of knowledge and action. Formal Theories of the Commonsense World, by J. Hobbs and R. C. Moore, (ed.). Ablex Publishing Corp:319–358, 1985.
  • Nagel (1939) Nagel, E. Principles of the theory of probability. Int. Encycl. Unif. Sc., Vol. I(No. 6), 1939.
  • Nilsson (1986) Nilsson, N. Probabilistic logic. Artificial Intelligence, 28:71–87, 1986.
  • Nilsson (2011) Nilsson, N. Artificial Intelligence: A New Synthesis. Morgan Kaufmann, California, 2011.
  • Nisan et al. (2007) Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. V. Algorithmic Game Theory. Cambridge University Press, New York, 2007.
  • Oakes (1985) Oakes, D. Self-calibrating priors do not exist. Journal of the American Statistical Association, 80(390):p. 339, 1985.
  • Pearl (2018) Pearl, J. A personal journey into bayesian networks. UCLA Cognitive Systems Laboratory, Technical Report (R-476), 2018.
  • Ramsey (1931) Ramsey, F. Truth and probability. Studies in Subjective Probability, by Henry Kyburg and Howard smokler (ed.). Krieger:25–52, 1931.
  • Russell (1998) Russell, S. Learning agents for uncertain environments. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp.  101–103, 1998.
  • Sandroni (2000) Sandroni, A. Do markets favor agents able to make accurate predictions? Econometrica, Vol. 68(6):1303–1341, 2000.
  • Savage (1972) Savage, L. J. The Foundations of Statistics. Dover Publications, New York, 1972.
  • Tarski (1944) Tarski, A. The semantic conception of truth: and the foundations of semantics. Philosophy and Phenomenological Research, Vol. 4(3):341–376, 1944.
  • Turing (1936) Turing, A. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 42:230–265, 1936.
  • Valiant (1984) Valiant, L. A theory of the learnable. Communications of the ACM, 27, Nov.:1134–1142, 1984.
  • Valiant (2008) Valiant, L. Knowledge infusion: In pursuit of robustness in artificial intelligence. Proc 28th Conference on Foundations of Software Technology and Theoretical Computer Science, pp.  415–422, 2008.
  • von Mises (1957) von Mises, R. Probability, Statistics and Truth. revised English edition, Macmillan, New York, 1957.
  • von Mises (1967) von Mises, R. Mathmatical Theory of Probability and Statistics. 2nd edition, Academic Press Inc., New York, 1967.

Appendix A Proofs for Lemmas, Theorems and Corollaries


Proof of Theorem 4.1 A proof of Theorem 4.1 is suggested in (Dawid, 1982). A simpler one is as follows: Let Xt=(j=1tξj)1ξt(YtY^t).X_{t}=({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{t}(Y_{t}-\hat{Y}_{t}). Since (j=1tξj)1,ξt({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1},\xi_{t} and Y^t\hat{Y}_{t} are ßt-1-measurable, it follows that E(Xt|E(X_{t}|ß)t1=0{}_{t-1})=0 where EE is taken with respect to Π(|\Pi(\cdot|ß)t1{}_{t-1}) and so that t=1kXt{\textstyle\sum\limits_{t=1}^{k}}X_{t} is a martingale adapted to ßk-1. Also, E((t=1kXt)2)=t=1kE(Xt2)λE{t=1k((j=1tξj)1ξi)2}λπ26,E(({\textstyle\sum\limits_{t=1}^{k}}X_{t})^{2})={\textstyle\sum\limits_{t=1}^{k}}E(X_{t}^{2})\leq\lambda\cdot E\{{\textstyle\sum\limits_{t=1}^{k}}(({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{i})^{2}\}\leq\frac{\lambda\pi^{2}}{6}, because YtY_{t} is an indicator variable and so var(Yt|var(Y_{t}|ß)t1{}_{t-1}) is uniformly bounded above by some λ\lambda such that 0λ<0\leq\lambda<\infty. Then, by the martingale convergence theorem, t=1kXt{\textstyle\sum\limits_{t=1}^{k}}X_{t} converges with Π\Pi-probability one, which implies from Kronecker’s lemma that, with Π\Pi-probability one, pkα=(t=1kξt)1t=1kξt(YtY^t)0p_{k}-\alpha=({\textstyle\sum\limits_{t=1}^{k}}\xi_{t})^{-1}\cdot{\textstyle\sum\limits_{t=1}^{k}}\xi_{t}(Y_{t}-\hat{Y}_{t})\rightarrow 0 where Y^t=α\hat{Y}_{t}=\alpha t.\forall t. Q.E.D.Q.E.D.

Proof of Lemma 4.5 Let AtA_{t} be an event token at time tt and P(A|E)=αP(A|E)=\alpha be the true probability of event type AA conditional on event type EE whose event tokens are denoted by AtA_{t} and EtE_{t}, respectively. Then, by the definition of EE with respect to AA, P(At+1|EtP(A_{t+1}|E_{t}\in ß)t=α{}_{t})=\alpha with true probability PP- one. Now, once P(At+1|EtP(A_{t+1}|E_{t}\in ß)t{}_{t}) is learned as such at some t0,t_{0}, then Et0E_{t_{0}} must have happened at that time and so P(Et0)0.P(E_{t_{0}})\neq 0. Also, by 4.4, consider a subsequence of EtkE_{t_{k}}’s where P(Etk)0P(E_{t_{k}})\neq 0 for any tk>t0.t_{k}>t_{0}. Then, for this subsequence, P(Et0&Etk)0P(E_{t_{0}}\&E_{t_{k}})\neq 0 for any tk>t0,t_{k}>t_{0}, because EtkE_{t_{k}}’s are independent of one another.

Here, EtkE_{t_{k}}’s are independent for the following reason: recall that by definition, P(Atk+1|EtkP(A_{t_{k}+1}|E_{t_{k}}\in ß)tk=α{}_{t_{k}})=\alpha with true probability PP- one. Then, note that ßtk{}_{t_{k}} includes the fact that P(Atki+1|EtkiP(A_{t_{k-i}+1}|E_{t_{k-i}}\in ß)ti=α{}_{t-i})=\alpha for some i1.i\geq 1. Now, without loss of generality, let i=1.i=1. Thus, we obtain


(1)       P(P( P(Atk+1|P(A_{t_{k}+1}| {P(Atk1+1|Etk1)=α}\{P(A_{t_{k-1}+1}|E_{t_{k-1}})=\alpha\} \in ß)tk=α)=1{}_{t_{k}})=\alpha)=1

Now that EtkE_{t_{k}} and Etk1E_{t_{k-1}} are all included in ßtk{}_{t_{k}} by (1), to show that EtkE_{t_{k}}’s are independent, we need to prove that


(2)       P(P( {P(Atk+1|\{P(A_{t_{k}+1}|ß)tk=α}{}_{t_{k}})=\alpha\} || {P(Atk1+1|\{P(A_{t_{k-1}+1}|ß)tk1=α})=P({}_{t_{k-1}})=\alpha\})=P( {P(Atk+1|\{P(A_{t_{k}+1}|ß)tk=α}){}_{t_{k}})=\alpha\})

But (2) is satisfied because P(P( {P(Atk+1|\{P(A_{t_{k}+1}|ß)tk=α})=1=P({}_{t_{k}})=\alpha\})=1=P( {P(Atk1+1|\{P(A_{t_{k-1}+1}|ß)tk1=α}){}_{t_{k-1}})=\alpha\}).

Now that P(Et0&Etk)0P(E_{t_{0}}\&E_{t_{k}})\neq 0, for any tk>t0t_{k}>t_{0} in this subsequence, we can always find some small enough ϵ>0\epsilon>0 such that P(Etk)>ϵ.P(E_{t_{k}})>\epsilon. Therefore, the probability of the element in this subsequence does not vanish to zero, which implies that limsP(Et0&Ets)0.\lim\limits_{s\rightarrow\infty}P(E_{t_{0}}\&E_{t_{s}})\neq 0. Since limsP(Et0&Ets)0\lim\limits_{s\rightarrow\infty}P(E_{t_{0}}\&E_{t_{s}})\neq 0, s=1P(Et0&Ets)=.{\textstyle\sum\limits_{s=1}^{\infty}}P(E_{t_{0}}\&E_{t_{s}})=\infty. Then, by the second Borel-Cantelli lemma, P(Et0&EtsP(E_{t_{0}}\&E_{t_{s}} i.o.)=1i.o.)=1 for s>0,s>0, which means P(Et0ßt0P(E_{t_{0}}\in\text{\ss}_{t_{0}} &\& EtkE_{t_{k}} \in ßtk{}_{t_{k}} i.o.)=1i.o.)=1 for tk>t0,t_{k}>t_{0}, the desired result. Q.E.D.Q.E.D.

Proof of Theorem 4.6 Suppose that, for infinitely many tt’s when P(At+1|P(A_{t+1}|ß)t{}_{t}) stays the same as α\alpha, machines learn this P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha at time tt. Then, by the Success Criterion (1), Π(Atk+1|\Pi(A_{t_{k}+1}|ß)tk=α=P(Atk+1|{}_{t_{k}})=\alpha=P(A_{t_{k}+1}|ß)tk{}_{t_{k}}) at least infinitely often out of those infinite opportunities at tt’s to learn. (We prove in Corollary 4.37 what we mean exactly by “most of the time.” Here we tentatively mean “at least i.o.i.o.” by it because machines are otherwise wrong too often to learn given the Success Criterion (1).) Thus we can construct a test set which consists of the subsequence of Π(Atk+1|\Pi(A_{t_{k}+1}|ß)tk{}_{t_{k}}) which is equal to P(Atk+1|ßtk)P(A_{t_{k}+1}|\text{\ss}_{t_{k}}) for those infinitely many tkt_{k}’s. Let ξtk+1=1\xi_{t_{k}+1}=1 if and only if Π(Atk+1|\Pi(A_{t_{k}+1}|ß)tk=P(Atk+1|ßtk)=α.{}_{t_{k}})=P(A_{t_{k}+1}|\text{\ss}_{t_{k}})=\alpha. Note that ξtk+1\xi_{t_{k}+1} is ßtk{}_{t_{k}}-measurable, because machine forecasting α\alpha occurs at time tkt_{k}. Then, by Theorem 4.1, with true probability PP-one, pkp_{k}- α=(j=0k1ξtj+1)1j=0k1ξtj+1(Ytj+1α)0\alpha=({\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1})^{-1}\cdot{\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1}(Y_{t_{j}+1}-\alpha)\rightarrow 0, as kk\rightarrow\infty where PP is defined over ß=k=0ßtk{}_{\infty}={\textstyle\bigvee\limits_{k=0}^{\infty}}\text{\ss}_{t_{k}} and ßtk{}_{t_{k}} is denoted by the totality of true facts up to day tk.t_{k}. Q.E.D.Q.E.D.

Proof of Lemma 4.10 Clearly, if with PP-probability one, pkp_{k}\rightarrow α,\alpha, then EE [p[p_{\infty}- α]\alpha] =0=0 where the mathematical expectation is taken with respect to the true probability P,P, but not vice versa. The reverse does not necessarily hold, because even though P(P( pkp_{k}\rightarrow α)<1,\alpha)<1, EE [p[p_{\infty}- α]\alpha] =0=0 when [pkα][p_{k}-\alpha] converges to ±β0\pm\beta\neq 0 with the equal probability as 12(1P)>0.\frac{1}{2}(1-P)>0. However, with PP-probability one, pkp_{k}\rightarrow α\alpha if and only if EE |p|p_{\infty}- α|\alpha| =0,=0, for the following reason: letting Λ\Lambda_{\infty} denote the event that pkp_{k}\rightarrow α\alpha as kk goes to infinity, EE |p|p_{\infty}- α|\alpha| == P(Λ)×|pP(\Lambda_{\infty})\times|p_{\infty}- α|Λ+\alpha|_{\Lambda_{\infty}^{+}} ++ (1P(Λ))×|p(1-P(\Lambda_{\infty}))\times|p_{\infty}- α|Λ=0\alpha|_{\Lambda_{\infty}^{-}}=0 if and only if P(pkP(p_{k}\rightarrow α)=1\alpha)=1 where |p|p_{\infty}- α|Λ+\alpha|_{\Lambda_{\infty}^{+}} denotes the value of |p|p_{\infty}- α|\alpha| when Λ\Lambda_{\infty} occurs, while |p|p_{\infty}- α|Λ\alpha|_{\Lambda_{\infty}^{-}} denotes that when Λ\Lambda_{\infty} does not occur. Here, the “if” part is clear. For the “only if” part, if P(pkP(p_{k}\rightarrow α)<1,\alpha)<1, then (1P(Λ))×|p(1-P(\Lambda_{\infty}))\times|p_{\infty}- α|Λ>0\alpha|_{\Lambda_{\infty}^{-}}>0 while P(Λ)×|pP(\Lambda_{\infty})\times|p_{\infty}- α|Λ+=0\alpha|_{\Lambda_{\infty}^{+}}=0, which implies that EE |p|p_{\infty}- α|\alpha| 0.\neq 0. Q.E.D.Q.E.D.

Proof of Lemma 4.11 By Fatou’s lemma, E[lim infk1kj=0k1Ytj+1|E[\liminf\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}|ß]tj{}_{t_{j}}] \leq lim infkE[1kj=0k1Ytj+1|\liminf\limits_{k\rightarrow\infty}E[\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}|ß]tj{}_{t_{j}}] =lim infk=\liminf\limits_{k\rightarrow\infty} 1kj=0k1P(Atj+1|\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tj{}_{t_{j}}) lim supk1kj=0k1P(Atj+1|\leq\limsup\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tjE[lim supk1kj=0k1Ytj+1|{}_{t_{j}})\leq E[\limsup\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}|ß]tj.{}_{t_{j}}]. Now, since pp_{\infty} exists by the assumption, lim infk1kj=0k1Ytj+1=lim supk1kj=0k1Ytj+1.\liminf\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}=\limsup\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}. Then, by squeezing theorem, limk1kj=0k1P(Atj+1|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tj{}_{t_{j}}) also exists and thus EE [limk1kj=0k1Ytj+1[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1} |ßtj]=limk1kj=0k1P(Atj+1||\text{\ss}_{t_{j}}]=\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tj{}_{t_{j}}). Now, by the law of iterated expectations, E[limk1kj=0k1Ytj+1]α=EE[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}]-\alpha=E [E[E [limk1kj=0k1Ytj+1[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1} |ßtj]α]=E[limk1ktj=0k1P(Atj+1|ßtj)α]|\text{\ss}_{t_{j}}]-\alpha]=E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t_{j}=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha]. Therefore, EE [pα]=0[p_{\infty}-\alpha]=0 if and only if E[limk1ktj=0k1P(Atj+1|ßtj)α]=0.E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t_{j}=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha]=0. Also,E~{}E |p|p_{\infty}- α|\alpha| =E=E |limk1kj=0k1Ytj+1α||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha| =E=E [E[E [|limk1kj=0k1Ytj+1α|[|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha| |ßtj]]|\text{\ss}_{t_{j}}]]. But note that EE [E[E [|limk1kj=0k1Ytj+1α|[|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha| |ßtj]]E|\text{\ss}_{t_{j}}]]\geq E |E[limk1kj=0k1Ytj+1α|ßtj]||E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha|\text{\ss}_{t_{j}}]| =E=E |limk1kj=0k1P(Atj+1|ßtj)α||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha| by Jensen’s inequality. Therefore, EE |pα||p_{\infty}-\alpha| \geq EE |limk1kj=0k1P(Atj+1|ßtj)α|.|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha|. Q.E.D.Q.E.D.

Proof of Lemma 4.15 Consider a simple two-player game (I,Si,ui(s))(I,S_{i},u_{i}(s)) between Nature (player ii) and a representative machine (player i-i) where II is the set of players {i,i}\{i,-i\}, SiS_{i} is the set of pure strategies sis_{i}’s for each player ii, and ui(s)u_{i}(s) is the usual payoff function for player ii. Since this is a probabilistic forecasting game, the pure strategy for each player sis_{i} can be any number in [0,1]\Re[0,1]. But since the computable numbers by player i-i are countably many, we restrict SiS_{i} to be countable. For simplicity, let ui:u_{i}: Si×Si{1,1}S_{i}\times S_{-i}\rightarrow\{-1,1\}. In other words, for each profile s=(si,si)s=(s_{i},s_{-i}), if player ii wins, she obtains 11, while she obtains 1-1 otherwise. When Nature (player ii) succeeds in deviating from the machine forecast, Nature wins. Otherwise, the machine (player i-i) wins. Thus, this is a kind of matching game with countably infinite state space.

First, let us note that the structure of the forecasting game is given to Nature, because the structure itself is something objective about the world and thus it belongs to the realm of Nature herself. In other words, it is certain to Nature whether Nature and the machine moves simultaneously or not in the game as follows: If the machine moves when Nature herself does not move yet, then it is certain to Nature that the machine moves first and thus that it is not a simultaneous game. If the machine does not move yet when Nature does not move either, then it is certain to Nature that the machine does not move first, and thus whether it is a simultaneous game or not depends on Nature herself. If Nature reveals herself to the machine even before the machine moves so that the machine can move after observing Nature’s, it is certain to Nature that it is not a simultaneous game. Otherwise, it is certain to Nature that it is a simultaneous game.

(i)(i) the proof of the “only if” part: first, let us fix machine forecast Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) as α\alpha and then consider the relevant test set. Now, suppose that the forecasting game along the stochastic path of this test set is not a simultaneous-move game at time tt. Then, either Nature or the machine moves first, and the rest moves later after observing what move the other opponent takes. Thus, the one who can observe the opponent’s move can control their/her own forecasting to win the game, and so Δt\Delta_{t} occurs or does not occur at time tt, which is certain to Nature because the structure of the game is given to Nature. Then, since ßt includes Δt\Delta_{t} or ¬Δt\lnot\Delta_{t} as part of the true facts by 4.2, PP (Δtßt)=1\left(\Delta_{t}\in\text{\ss}_{t}\right)=1 or PP (¬Δtßt)=1\left(\lnot\Delta_{t}\in\text{\ss}_{t}\right)=1. Thus, it is either P(P( PP (At+1| Δtßt)=α\left(A_{t+1}\text{}|\text{ }\Delta_{t}\in\text{\ss}_{t}\right)=\alpha )=1)=1 or P(P( PP (At+1 | ¬Δtßt)=α\left(A_{t+1}\text{ }|\text{ }\lnot\Delta_{t}\in\text{\ss}_{t}\right)=\alpha )=0)=0 respectively, according as Nature moves first or the machine moves first. Therefore, the true second-order probability PP is neither strictly less than 11 nor strictly greater than 0.

(ii)(ii) the proof of the “if” part: again, let us fix the machine forecast Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) as α\alpha and then consider the relevant test set. Now, suppose that the forecasting game is a simultaneous-move game at time tt. Then, for any fixed value α[0,1]\alpha\in\Re[0,1], it is not certain to Nature herself whether Π(At+1|\Pi(A_{t+1}|ß)t=α{}_{t})=\alpha or not, because there exists no pure strategy Nash equilibrium in this simultaneous matching game. Thus, Nature cannot certainly control P(At+1|P(A_{t+1}|ß)t{}_{t}) to make it deviate from Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) and so we obtain


(3)      P(P( PP(At+1 |ßt)=α\left(A_{t+1}\text{ }|\text{\ss}_{t}\right)=\alpha )0)\neq 0.


(3) holds even though ßt of PP(At+1 |ßt)\left(A_{t+1}\text{ }|\text{\ss}_{t}\right) in (3) includes Δt\Delta_{t} or ¬Δt\lnot\Delta_{t} as part of the true facts by 4.2, if either of them indeed occurs at tt. In the same logic, it is not certain to Nature that the machine can control Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) to make it coincide with P(At+1|P(A_{t+1}|ß)t{}_{t}) and so we obtain


(4)       P(P( PP(At+1 |ßt)=α\left(A_{t+1}\text{ }|\text{\ss}_{t}\right)=\alpha )1)\neq 1.


Clearly, any mixed strategy Nash equilibrium, if any, will lead to 0<P(P(At+1|0<P(P(A_{t+1}|ß)t=α{}_{t})=\alpha )<1)<1. Therefore, there exists the true second-order probability PP such that 0<P(P(At+1|0<P(P(A_{t+1}|ß)t=α{}_{t})=\alpha )<1)<1.

Furthermore, if Nature moves first, then P(P( PP (At+1 | ßt)=α\left(A_{t+1}\text{ }|\text{ }\text{\ss}_{t}\right)=\alpha )=1)=1, as we proved in (i)(i). Therefore, if the machine does not move first, which amounts to either Nature moves first or the machine moves simultaneously with Nature, then clearly P(P( PP (At+1 | ßt)=α\left(A_{t+1}\text{ }|\text{ }\text{\ss}_{t}\right)=\alpha)0)\neq 0. Q.E.D.Q.E.D.

Proof of Theorem 4.16 Consider the necessary condition (2) that if a machine learns the true objective probability P(At+1|P(A_{t+1}|ß)t{}_{t}), then Π(At+1|\Pi(A_{t+1}|ß)t=P(At+1|{}_{t})=P(A_{t+1}|ß)t{}_{t}). Since this is just a necessary but not sufficient condition, the converse of (2) does not necessarily hold. Now, for any machine forecast α[0,1]\alpha\in\mathbb{R}[0,1], suppose that P(At+1|P(A_{t+1}|ß )tα{}_{t})\neq\alpha for infinitely many tt’s along the stochastic path where the associated At+1A_{t+1}’s occur but that P(At+1|P(A_{t+1}|ß )t=α{}_{t})=\alpha for infinitely many tt^{\ast}’s. Then, by Theorem 4.19, P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha i.o.)>0i.o.)>0 for some event At+1A_{t+1}. Thus, by (Case 3) of Theorem 4.17 and Theorem 4.6, the machine cannot learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}), even though Π(At+1|\Pi(A_{t+1}|ß)t=α=P(At+1|{}_{t})=\alpha=P(A_{t+1}|ß)t{}_{t}) at infinitely many tt^{\ast}’s. Thus, the machine does not learn that it wins even though it indeed wins at tt^{\ast}’s. Clearly, the machine does not learn whether it wins at other tt’s than tt^{\ast}’s when it loses. Now, since the machine does not learn whether it wins or not at each round of game, the machine does not learn what its payoff is at each round. Furthermore, the machine is truly guaranteed to be well-calibrated along the path of tt^{\ast}’s and so this is the winning strategy in forecasting game between Nature and the machine (e.g. (Foster & Vohra, 1993)), but the machine still cannot learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}). Thus, in this case, winning strategy is not equivalent to learning strategy. Q.E.D.Q.E.D.

Proof of Theorem 4.17 First, let us recall the followings: by Nature’s perversity with true probability 0, we mean that PP( MtM_{t} at least i.o.)=0i.o.)=0 for any fixed α[0,1]\alpha\in\Re[0,1]. Here, MtM_{t} denotes a meta-event {P(At+1|\{P(A_{t+1}|ß)tα{}_{t})\neq\alpha for any event At+1A_{t+1} at time t}t\} for such a fixed forecast α\alpha. Given this, let us consider the following three cases, according as how P(At+1|P(A_{t+1}|ß)t{}_{t}) actually varies with respect to α\alpha along the path of the test set. ((Case 3)) amounts to Theorem 4.17.


((Case 1)) Let us suppose that P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha for finitely many tt’s along the stochastic path. Now, as in Theorem 4.1, let Xt=(j=1tξj)1ξt(Ytα).X_{t}=({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{t}(Y_{t}-\alpha). But, unlike in Theorem 4.1, ξj=1\xi_{j}=1 here if P(Aj+1|P(A_{j+1}|ß)j=α{}_{j})=\alpha for all jj along the stochastic path, not necessarily restricted to the test set. Now, consider those finite tt’s when P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha and denote the largest tt among them by tmt_{m}. Then, P(At+1|P(A_{t+1}|ß)tα=E[Yt|{}_{t})-\alpha=E[Y_{t}|ß]t1α=0{}_{t-1}]-\alpha=0, t>tm\forall t>t_{m} along the stochastic path. Thus, E(Xt|E(X_{t}|ß)t1=0{}_{t-1})=0 where expectation EE is taken with respect to the true probability P(|P(\cdot|ß)t1{}_{t-1}) and so t=tm+1kXt{\textstyle\sum\limits_{t=t_{m+1}}^{k}}X_{t} is a martingale adapted to ßk-1 at t>tmt>t_{m} along the path. Then, by the martingale convergence theorem and Kronecker’s lemma, (j=0k1ξtj+1)1j=0k1ξtj+1(Ytj+1α)0({\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1})^{-1}\cdot{\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1}(Y_{t_{j}+1}-\alpha)\rightarrow 0 with true probability PP-one.


((Case 2)) Let us consider the case where with true probability PP >0>0, P(At+1|P(A_{t+1}|ß)t{}_{t}) deviates from α\alpha in such a way as in Oakes (1985) along the test set. Then, EE |p|p_{\infty}- α|\alpha| 0\neq 0 and so the calibration property is not truly guaranteed for the following reason: Let Λo\Lambda_{\infty}^{o} be the event that P(At+1|P(A_{t+1}|ß)t{}_{t}) deviates from α\alpha in such a way as in Oakes (1985) along the test set. Then, since some subsequence of YtY_{t}’s along the test set forms Bernoulli whose relative frequency converges to f(α)αf(\alpha)\neq\alpha, pkp_{k} does not converge to α\alpha when Λo\Lambda_{\infty}^{o} occurs. Now, let |p|p_{\infty}- α|Λo+\alpha|_{\Lambda_{\infty}^{o}}^{+} be the value of |p|p_{\infty}- α|\alpha| when Λo\Lambda_{\infty}^{o} occurs, while |p|p_{\infty}- α|Λo\alpha|_{\Lambda_{\infty}^{o}}^{-} be the value of |p|p_{\infty}- α|\alpha| when Λo\Lambda_{\infty}^{o} does not occur along the test set. Then, in the same logic as in Lemma 4.11, we obtain that EE |p|p_{\infty}- α|\alpha| =P(Λo)×|p=P(\Lambda_{\infty}^{o})\times|p_{\infty}- α|Λo+\alpha|_{\Lambda_{\infty}^{o}}^{+} ++ (1P(Λo))×|p(1-P(\Lambda_{\infty}^{o}))\times|p_{\infty}- α|Λo\alpha|_{\Lambda_{\infty}^{o}}^{-} 0\neq 0. Thus, P(pkα)1.P(p_{k}\rightarrow\alpha)\neq 1. However, the converse does not hold, for there can be many other ways of how pkp_{k} does not converge to α\alpha than in Oakes (1985). Hence it does not follow that P(Λo)>0,P(\Lambda_{\infty}^{o})>0, even if EE |p|p_{\infty}- α|\alpha| 0.\neq 0.

Now, suppose that with Π\Pi-subjective probability >0,>0, P(At+1|P(A_{t+1}|ß)t{}_{t}) behaves in such a way as in Oakes (1985). Then, again in the same logic as in Lemma 4.11, we obtain that EE |p|p_{\infty}- α|\alpha| =Π(Λo)×|p=\Pi(\Lambda_{\infty}^{o})\times|p_{\infty}- α|Λo+\alpha|_{\Lambda_{\infty}^{o}}^{+} ++ (1Π(Λo))×|p(1-\Pi(\Lambda_{\infty}^{o}))\times|p_{\infty}- α|Λo\alpha|_{\Lambda_{\infty}^{o}}^{-} 0\neq 0 where expectation is now taken with respect to Π\Pi. Hence Π(pkα)1.\Pi(p_{k}\rightarrow\alpha)\neq 1. Therefore, we conclude that if Oakes (1985) holds with Π\Pi-subjective probability >0>0, then Dawid (1982) does not hold, which amounts to the proof for Theorem 4.8.


((Case 3)) In general, suppose that the true probability of Nature’s being perverse is not zero for any fixed forecast α\alpha on any associated events AtA_{t}’s. In other words, suppose that P(P( MtM_{t} at least i.o.i.o. along the test set)>0)>0 where MtM_{t} is the meta-event that P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha. Then, we claim that this implies that EE |p|p_{\infty}- α|\alpha| 0\neq 0 where EE is taken with respect to PP.

First, suppose that pp_{\infty} exists. Also, suppose that α0\alpha\neq 0, because (Case 3) trivially holds if α=0\alpha=0. Now let us consider an infinite subsequence of AtkA_{t_{k}}’s, {Atkj}j=0,\{A_{t_{k_{j}}}\}_{j=0}^{\infty}, which is conditionally identically distributed along the test set where MtM_{t} occurs at least infinitely often. We can do this by Kolmogorov axioms 1 and 2 and Lemma 4.5 for the following reason: note that by Kolmogorov axioms 1 and 2 there always exists one β[0,1]\beta\in\Re[0,1] such that P(A|E)=βP(A|E)=\beta for any type event AA and E,E, given that there exists probability of type event, if any. Then, for this β\beta, P(P( P(At+1|P(A_{t+1}|ß)t=β{}_{t})=\beta i.o.)=1\ i.o.)=1 according to Lemma 4.5. Thus, we found one subsequence of {Atk}k=0\{A_{t_{k}}\}_{k=0}^{\infty} such that it is conditionally identically distributed as {P(Atk+1|\{P(A_{t_{k}+1}|ß)tk=β}k=0{}_{t_{k}})=\beta\}_{k=0}^{\infty}. Now, fix α\alpha. Also, without loss of generality, suppose that βα\beta\neq\alpha. Since βα\beta\neq\alpha is arbitrary, from this subsequence we can consider another subsequence EAE_{A} of {Atkj}j=0\{A_{t_{k_{j}}}\}_{j=0}^{\infty} with the true probability P>0P>0 such that EAE_{A} = {P(Atkj+1|\{P(A_{t_{k_{j}}+1}|ß)tkj=β}j=0{}_{t_{k_{j}}})=\beta\}_{j=0}^{\infty} along the stochastic path of the test set in which MtM_{t} occurs at least infinitely often.

For reductio, let us suppose that Nature deviates α\alpha by picking numbers from uncountably many values of β\beta’s such that every value of β\beta is equal to P(At+1|P(A_{t+1}|ß)t{}_{t}) only at most finitely many tt’s along the test set with true probability PP- one. In other words,

(5) For β[0,1]\beta\in\Re[0,1] where βα\beta\neq\alpha, P(At+1|P(A_{t+1}|ß)t=β{}_{t})=\beta at most for finitely many tt’s along the path of the test set where MtM_{t} occurs at least infinitely often, with true probability PP- one.

Note that there must be countably infinite number of different β\beta’s in (5). Let us denote each different β\beta at each time along the path by βtkj\beta_{t_{k_{j}}}, while letting βtkiβtkj\beta_{t_{k_{i}}}\neq\beta_{t_{k_{j}}} for iji\neq j without loss of generality. Now, recall that pp_{\infty} is assumed to exist along the stochastic path of the test set. Thus, inspired by this assumption, let us further assume that limh1hj=0h1P(Atkj+1|\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|ß)tkj{}_{t_{k_{j}}}) exists where P(Atkj+1|P(A_{t_{k_{j}}+1}|ß)tkj=βtkj{}_{t_{k_{j}}})=\beta_{t_{k_{j}}} or P(Atkj+1|P(A_{t_{k_{j}}+1}|ß)tkj=α{}_{t_{k_{j}}})=\alpha along the path of the test set. Then, letting

ξtkj:={1P(Atkj+1|ßtkj)=α0P(Atkj+1|ßtkj)=βtkj\xi_{t_{k_{j}}}:=\begin{cases}1&P(A_{t_{k_{j}}+1}|$\ss$_{t_{k_{j}}})=\alpha\\ 0&P(A_{t_{k_{j}}+1}|$\ss$_{t_{k_{j}}})=\beta_{t_{k_{j}}}\end{cases}

(6) limh1hj=0h1P(Atkj+1|\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|ß)tkj=limh1hj=0h1[{}_{t_{k_{j}}})=\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}[ ξtkjP(Atkj+1|\xi_{t_{k_{j}}}\cdot P(A_{t_{k_{j}}+1}|ß)tkj+(1ξtkj)P(Atkj+1|{}_{t_{k_{j}}})+(1-\xi_{t_{k_{j}}})\cdot P(A_{t_{k_{j}}+1}|ß)tkj{}_{t_{k_{j}}})]]

=αlimh1hj=0h1=\alpha\cdot\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}} ξtkj+limh1hj=0h1(1ξtkj)βtkj\xi_{t_{k_{j}}}+\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}.

Thus,

(7) limh1hj=0h1P(Atkj+1|\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|ß)tkj=α{}_{t_{k_{j}}})=\alpha, if and only if, limh1hj=0h1(1ξtkj)βtkj=α(1limh1hj=0h1\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}=\alpha\cdot(1-\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}} ξtkj)\xi_{t_{k_{j}}}).

In other words, if Nature deviates from machine forecasts by βtkj\beta_{t_{k_{j}}}’s so that her deviating forecasts on average satisfy (7) under (5), then EE |p|p_{\infty}- α|\alpha| =0=0 and thus the test set is truly guaranteed to be well-calibrated. But Nature then loses the repeated forecasting games along the path in the long run. So Nature has no reason to behave in this way with the true probability PP- one. Let us then consider the following three cases:

(Case i) P(P(At+1|P(P(A_{t+1}|ß)t=α)=0{}_{t})=\alpha)=0 at least i.o.i.o.

In this case, by Lemma 4.15, Nature observes machine forecasts α\alpha in each time tkjt_{k_{j}} whenever the machine predicts P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha.

Now that 1=lim supt1=\limsup\limits_{t\rightarrow\infty} P(P(At+1|P(P(A_{t+1}|ß)tα)P(P(At+1|{}_{t})\neq\alpha)\leq P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.),i.o.),

Nature would choose the deviating value βtkj\beta_{t_{k_{j}}} in such a way that she would not allow (7) to hold with true probability PP- one. Thus,

(8) PP (limh1hj=0h1(1ξtkj)βtkj=α(1limh1hj=0h1(\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}=\alpha\cdot(1-\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}} ξtkj)\xi_{t_{k_{j}}}) )1)\neq 1.

In other words, since Nature observes machine forecast α\alpha at every time, she would deviate each forecast α\alpha at tkjt_{k_{j}} in such a way that (8) holds in the end. Otherwise, EE |p|p_{\infty}- α|\alpha| =0=0, so Nature would lose in the long run. Therefore, we conclude due to (8) that EE |p|p_{\infty}- α|\alpha| 0\neq 0 in case (i).

(Case ii) P(P(At+1|P(P(A_{t+1}|ß)t=α)=1{}_{t})=\alpha)=1 at least i.o.i.o.

In this case, by Lemma 4.15, Nature moves first so the machine cannot fail to match P(At+1|P(A_{t+1}|ß)t{}_{t}). But then,

1=lim supt1=\limsup\limits_{t\rightarrow\infty} P(P(At+1|P(P(A_{t+1}|ß)t=α)P(P(At+1|{}_{t})=\alpha)\leq P(P(A_{t+1}|ß)t=α{}_{t})=\alpha at least i.o.)=P(P(At+1|i.o.)=P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at most f.o.)f.o.), which contradicts (5). Therefore, we exclude case (ii) under (5).

(Case iii) 0<P(P(At+1|0<P(P(A_{t+1}|ß)t=α)<1{}_{t})=\alpha)<1 at least i.o.i.o.

In this case, by Lemma 4.15, Nature moves simultaneously with the machine, so Nature has no reason to pick any particular βtkj[0,1]\beta_{t_{k_{j}}}\in\Re[0,1] at each tkjt_{k_{j}}, for there exists no pure strategy Nash equilibrium. Hence any combination of {βtkj}j=0\{\beta_{t_{k_{j}}}\}_{j=0}^{\infty} is equally likely. Now, without loss of generality, let us fix α\alpha and ξtkj\xi_{t_{k_{j}}} for each tkjt_{k_{j}}. Then we claim that

(9) P(P( 1hj=0h1(1ξtkj)βtkjcα\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}\to c\alpha )<P()<P( 1hj=0h1(1ξtkj)βtkjcα\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}\to c\alpha^{-} )1)\leq 1

where c=1limh1hj=0h1c=1-\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}} ξtkj\xi_{t_{k_{j}}} for some fixed cc, and cαCc\alpha\in C for some fixed α\alpha, and some set CC such that xC\forall x\in C, x[0,1]x\in\Re[0,1] but CC is countably infinite, and cαc\alpha^{-} is any real number in the set CC/cαc\alpha, the set CC without cαc\alpha.

First, recall that limh1hj=0h1(1ξtkj)βtkj\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}} exists. Then, by definition,

ϵ>0\forall\epsilon>0, \exists N1<N_{1}<\infty such that |1hj=0h1(1ξtkj)βtkjcα|<ϵ,h>N1,|\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}-c\alpha|<\epsilon,\forall h>N_{1},

ϵ>0\forall\epsilon>0, \exists Ni<N_{i}<\infty such that |1hj=0h1(1ξtkj)βtkjciα|<ϵ,h>N2.|\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}-c_{i}\alpha^{-}|<\epsilon,\forall h>N_{2}. (1i)(1\neq i\in\mathbb{N})

Now, letting N=N= max(N1,Ni)(N_{1},N_{i}), ϵ>0\forall\epsilon>0,

(10) P({ωß=j=0ßtkj:P(\{\omega\in\text{\ss}_{\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}\text{\ss}_{t_{k_{j}}}: || 1hj=0h1\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}} [P(Atkj+1|[P(A_{t_{k_{j}}+1}|ß)tkj=βtkj]{}_{t_{k_{j}}})=\beta_{t_{k_{j}}}] - cαc\alpha |>ϵ,h>N})<P(|>\epsilon,\forall h>N\})<P( i=0{ω\bigcup\limits_{i=0}^{\infty}\{\omega\in ß= j=0{}_{\text{ }\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}ß:tkj{}_{t_{k_{j}}}: || 1hj=0h1\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}} [P(Atkj+1|[P(A_{t_{k_{j}}+1}|ß)tkj=βtkj]{}_{t_{k_{j}}})=\beta_{t_{k_{j}}}] - ciαc_{i}\alpha^{-} |>ϵ,h>N})1|>\epsilon,\forall h>N\})\leq 1.

Therefore, we again obtain (8) by (10). Now, we consider all possible cases under (5), all of which lead to EE |p|p_{\infty}- α)|\alpha)| 0\neq 0. But this result is what we try to show in this proof anyway. Therefore, to continue to prove, let us accept that there exists such a set EAE_{A} with true probability P>0.P>0.

Now, note that EA={ωß=j=0ßtkj:E_{A}=\{\omega\in\text{\ss}_{\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}\text{\ss}_{t_{k_{j}}}: 1{ω}=11_{\{\omega\}}=1 when P(Atkj+1|ßtkj)=βαP(A_{t_{k_{j}}+1}|\text{\ss}_{t_{k_{j}}})=\beta\neq\alpha for all tkjt_{k_{j}}’s along the test set}{ω\}\subset\{\omega\in ß= j=0{}_{\text{ }\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}ß:tkj{}_{t_{k_{j}}}: 1{ω}=11_{\{\omega\}}=1 when |limh1hj=0h1P(Atkj+1||\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|ß)tkj{}_{t_{k_{j}}}) α|0-\alpha|\neq 0 for all tkjt_{k_{j}}’s along the test set}\}. Then, since P(EA)>0P(E_{A})>0, P(P( |limh1hj=0h1P(Atkj+1||\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|ß)tkj{}_{t_{k_{j}}}) α|0-\alpha|\neq 0 for all tkjt_{k_{j}}’s along the test set )>0)>0. Thus, since we found one subsequence of {1hj=0h1P(Atkj+1|\{\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|ß)tkj}h=1{}_{t_{k_{j}}})\}_{h=1}^{\infty} as such along the test set with true probability P>0P>0 and pp_{\infty} exists, P(|limk1kt=0k1P(At+1|P(|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t=0}^{k-1}}P(A_{t+1}|ß)t{}_{t}) - α|0\alpha|\neq 0 along the test set)>0)>0 for α0\alpha\neq 0. Then, by the same reasoning as in Lemma 4.10, EE |limk1kt=0k1P(At+1|ßt)α||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t=0}^{k-1}}P(A_{t+1}|\text{\ss}_{t})-\alpha| 0\neq 0. Now, by Lemma 4.11, we obtain that EE |pα||p_{\infty}-\alpha| \geq EE |limk1kt=0k1P(At+1|ßt)α||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t=0}^{k-1}}P(A_{t+1}|\text{\ss}_{t})-\alpha| 0\neq 0 when pp_{\infty} exists. Clearly, when pp_{\infty} does not exist, EE |p|p_{\infty}- α|\alpha| 0.\neq 0.

Therefore, we conclude that if P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)>0i.o.)>0, then EE |p|p_{\infty}- α|\alpha| 0.\neq 0. Q.E.D.Q.E.D.

Proof of Theorem 4.19 First, let us first note that with PP-probability >0,>0, P(At+1|P(A_{t+1}|ß)t1{}_{t})\neq 1 at least infinitely often for some event At+1A_{t+1}. Otherwise, beyond the near future, all events At+1A_{t+1}’s would certainly continue to occur, with PP-probability one, and thus there would be no uncertainty about any At+1A_{t+1}’s. Now, if this is the case, then we must stop here and simply conclude that no machine would be able to learn the true probability of any At+1,A_{t+1}, simply because there is no uncertainty for any machine to measure by the true probability in our world. Therefore, to continue to prove our main claim, we accept that P(P(At+1|P(P(A_{t+1}|ß)t1{}_{t})\neq 1 at least i.o.)>0i.o.)>0 for some event At+1A_{t+1}. Now, let us consider the test set where α=1\alpha^{\ast}=1. Then, along the stochastic path of this test set, P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha^{\ast} at least i.o)>0i.o)>0. Therefore, we found some α\alpha^{\ast} for which Nature is perverse with true probability P>P> 0.

Now, suppose that, for any α\alpha, P(P(At+1|ßt)=α )<1P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1 at least for infinitely many tt’s. In other words, P(P(At+1|ßt)α )>0P(P(A_{t+1}|\text{\ss}_{t})\neq\alpha\text{ })>0 at least i.o.i.o. Then, 0<lim supt0<\limsup\limits_{t\rightarrow\infty} P(P(At+1|P(P(A_{t+1}|ß)tα)P(P(At+1|{}_{t})\neq\alpha)\leq P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o)i.o). Thus, by Definition 4.18, Nature is uniformly perverse, which again means by Definition 4.13 that P(P( Nature is perverse )>0)>0 for any α[0,1].\alpha\in\Re[0,1]. Q.E.D.Q.E.D.

Proof of Theorem 4.20 Suppose that, for any α\alpha, P(P(At+1|ßt)=α )<1P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1 at least for infinitely many tt’s. Then, by Theorem 4.17 and Theorem 4.19, EE |p|p_{\infty}- α|\alpha| 0\neq 0 and so P(P( pkp_{k}\rightarrow α)1\alpha)\neq 1 for any α[0,1]\alpha\in\Re[0,1] where PP is the true objective probability defined over ß=t=0ßt{}_{\infty}={\textstyle\bigvee\limits_{t=0}^{\infty}}\text{\ss}_{t} and the expectation EE is taken with respect to this true probability P.P. Then, by Theorem 4.6, the machine cannot learn the true objective probability P(At+1|P(A_{t+1}|ß)t.{}_{t}). Q.E.D.Q.E.D.

Proof of Lemma 4.23 Suppose that the machine effectively calculates Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) as α\alpha with the goal of learning the true value of P(At+1|P(A_{t+1}|ß)t{}_{t}). Then, by the necessary condition for learning, the machine must return Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) which is congruent to P(At+1|P(A_{t+1}|ß)t=α{}_{t})=\alpha, in order to achieve this goal. Now, suppose further that the machine calculates at the same time Π({P(At+1|\Pi(\{P(A_{t+1}| ß)tα})0{}_{t})\neq\alpha\})\neq 0. Then the machine tolerates error by Definition 4.21.

However, by Theorem 4.6, the machine cannot tolerate errors infinitely often to achieve this goal of learning for the following reason: for any α[0,1]\alpha\in\Re[0,1], suppose that Π(At+1|\Pi(A_{t+1}|ß)t=α{}_{t})=\alpha but Π({P(At+1|\Pi(\{P(A_{t+1}|ß)tα})>0{}_{t})\neq\alpha\})>0 infinitely often. Now, since it must be that P(At+1|P(A_{t+1}|ß)t=Π(At+1|{}_{t})=\Pi(A_{t+1}|ß)t=α{}_{t})=\alpha to learn the true probability, it must also be by Theorem 4.6 that PP (pk(p_{k}\rightarrow α)\alpha) = Π\Pi (pk(p_{k}\rightarrow α)\alpha) = 1. But now, by assumption, Π({P(At+1|\Pi(\{P(A_{t+1}|ß)tα})>0{}_{t})\neq\alpha\})>0 infinitely often, which leads to that 0<lim supt0<\limsup\limits_{t\rightarrow\infty} Π({P(At+1|\Pi(\{P(A_{t+1}|ß)tα})Π({P(At+1|{}_{t})\neq\alpha\})\leq\Pi(\{P(A_{t+1}|ß)tα}{}_{t})\neq\alpha\} at least i.o)i.o). But this contradicts Π\Pi (pk(p_{k}\rightarrow α)=1\alpha)=1 by the same reasoning as in the proof of (Case 3) in Theorem 4.17 while replacing PP by Π\Pi and so the machine cannot learn the true probability by Theorem 4.6. Therefore, the machine cannot tolerate errors infinitely often if the machine aims to learn the true probability. Since α\alpha was arbitrary in [0,1]\Re[0,1], let α=0\alpha=0, the desired result. Q.E.DQ.E.D

Proof of Lemma 4.28 (i) Proof of “if” part: suppose that there exists a stopping time ts<t_{s}<\infty for some forecast α0\alpha_{0} such that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0,{}_{t})=0, t>ts\forall t>t_{s}, while there exists no stopping time for any other αα0\alpha\neq\alpha_{0} so that P(Aαα0(t+1)|P(A_{\alpha\neq\alpha_{0}}(t+1)|ß)t>0{}_{t})>0 at least infinitely often. Then, by the definition of Aα0(t+1)A_{\alpha_{0}}(t+1) and the law of iterated expectations,

(11)     P(Aα0(t+1))P(limtAα0(t+1))P(A_{\alpha_{0}}(t+1))\searrow P(\displaystyle{\lim_{t\to\infty}A_{\alpha_{0}}(t+1)}), because Aα0(t+1)limtAα0(t+1)A_{\alpha_{0}}(t+1)\searrow\displaystyle{\lim_{t\to\infty}A_{\alpha_{0}}(t+1)}.

Now that limtAα0(t+1)\displaystyle{\lim_{t\to\infty}A_{\alpha_{0}}(t+1)} is the event that P(At+1|P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.i.o. and so that the limit exists,

(12)     0=limtP(Aα0(t+1))=P(P(At+1|0=\displaystyle{\lim_{t\to\infty}P(A_{\alpha_{0}}(t+1))}=P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)i.o.) for α0\alpha_{0}.

Also, in the same logic as for α0\alpha_{0},

(13)     0<limtP(Aα(t+1))=P(P(At+1|0<\displaystyle{\lim_{t\to\infty}P(A_{\alpha}(t+1))}=P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)i.o.) for any αα0.\alpha\neq\alpha_{0}.

Thus, by Definition 4.25, Nature is selectively perverse.

(ii) Proof of “only if” part: suppose that Nature is selectively perverse. Then, by Definition 4.25, there must exist some α0\alpha_{0} such that P(P(At+1|P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0. Now, for reductio, suppose that for any such α0\alpha_{0} there exists no stopping time tst_{s} so that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t>0{}_{t})>0 at least infinitely often. In other words, Nature keeps changing her mind infinitely often between perversity and non-perversity or Nature keeps being perverse all the way long. Then, by law of iterated expectation, P(Aα0(t+1))>0P(A_{\alpha_{0}}(t+1))>0 at least infinitely often, which contradicts the selective perversity of Nature by the same reasoning as in (13). Q.E.D.Q.E.D.

Proof of Lemma 4.29 For any given α0\alpha_{0} with which Nature is not perverse with true probability PP-one, there exists ts<t_{s}<\infty for this α0\alpha_{0} by Lemma 4.28. Now, by assumption, machines learn that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s}. Thus, Π(Aα0(t+1)|\Pi(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s} by the necessary condition for learning. Then, by Lemma 4.23 and the same reasoning as (11) in the proof of Lemma 4.28, Π(P(Aα0(t+1)|\Pi(P(A_{\alpha_{0}}(t+1)|ß)t=0,t>ts)=1{}_{t})=0,\forall t>t_{s})=1. Q.E.D.Q.E.D.

Proof of Corollary 4.30 (i) Suppose that Nature is selectively perverse so that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s} for some α0\alpha_{0} by Lemma 4.28. However, since the machine is assumed not to be self-assured that the stopping time tst_{s} arrives for that α0\alpha_{0}, the machine cannot learn that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s} by Lemma 4.29.

(ii) Now, note that if the machine learns P(At+1|P(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0}, the machine also learns that P(P(At+1|P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0 in the following way: first, by Theorem 4.6 and (Case 3) in Theorem 4.17, machine learning of the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0} mathematically implies that P(P(At+1|P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0. Thus, once the machine learns the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0}, it cannot fail to effectively calculate the true probability P(Aα0(t+1))P(A_{\alpha_{0}}(t+1)) as 0, following Theorem 4.6 and (Case 3) in Theorem 4.17 as instructions. Then, by Definition 2.2, the machine learns that P(Aα0(t+1))=0P(A_{\alpha_{0}}(t+1))=0 in particular t>ts\forall t>t_{s}, so that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s} while following law of iterated expectation as instruction. However, as we proved it in (i), the machine cannot learn that P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s}. Hence we conclude that the machine cannot learn the true objective probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0} either. Q.E.DQ.E.D.

Proof of Lemma 4.31 Suppose that the machine is not self-assured of the stopping time tst_{s} for α0\alpha_{0}. Then,

(14)     Π(P(Aα0(t+1)|\Pi(P(A_{\alpha_{0}}(t+1)|ß)t=0,t>ts)1{}_{t})=0,\forall t>t_{s})\neq 1.

Now that limtP(Aα0(t+1))=P(P(At+1|\displaystyle{\lim_{t\to\infty}P(A_{\alpha_{0}}(t+1))}=P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)i.o.) for this α0\alpha_{0},

(15)     Π(\Pi( P(P( P(At+1|P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0)1i.o.)=0)\neq 1.

Then, since lim supt\limsup\limits_{t\rightarrow\infty} P(P( P(At+1|P(A_{t+1}|ß)t)α0)P({}_{t}))\neq\alpha_{0})\leq P( P(At+1|P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0,

(16)     Π(\Pi( P(P( P(At+1|P(A_{t+1}|ß)t=α0)=1{}_{t})=\alpha_{0})=1 t>t)1\forall t>t^{\ast})\neq 1, for some t<t^{\ast}<\infty.

Now, note that along the stochastic path considered in Corollary 4.30, P(P(At+1|P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o.)=0i.o.)=0 t>ts\forall t>t_{s}. Now, for this α0\alpha_{0},

(17)     lim supt\limsup\limits_{t\rightarrow\infty} P(P(At+1|P(P(A_{t+1}|ß)tα0)P(P(At+1|{}_{t})\neq\alpha_{0})\leq P(P(A_{t+1}|ß)tα0{}_{t})\neq\alpha_{0} at least i.o)=0i.o)=0

Therefore, without loss of generality, letting ttst^{\ast}\geq t_{s} with t<t^{\ast}<\infty,

(18)     P(P(At+1|P(P(A_{t+1}|ß)t=α0)=1,{}_{t})=\alpha_{0})=1, t>tts\forall t>t^{\ast}\geq t_{s} with t<t^{\ast}<\infty.

Then, without loss of generality, let P(P(At+1|P(P(A_{t+1}|ß)t=α0)=1{}_{t})=\alpha_{0})=1 at tt^{\ast}+1 by (18). Thus, (16) and (18) lead to the desired result by Lemma 4.15. Q.E.DQ.E.D

Proof of Theorem 4.32 Suppose that the machine learns the true probability. Since the machine cannot learn if Nature is uniformly perverse, Nature must then be selectively perverse so that the stopping time tst_{s} exists by Lemma 4.28. Then, by the (ii) part of Corollary 4.30 and Lemma 4.29, the machine is self-assured of the stopping time tst_{s} when tst_{s} exists. We now finish the proof of Theorem 4.32 by showing that if the machine learns the true probability, the machine is not self-assured of the stopping time tst_{s} when such tst_{s} does not exist.

Suppose that the machine is self-assured of the stopping time tst_{s} even though such tst_{s} does not exist. The machine is then wrong about tst_{s}, so it cannot learn the true probability along the path where P(Aα(t+1)|P(A_{\alpha}(t+1)|ß)t>0{}_{t})>0 at least i.o.i.o. for the following reason: first, by Lemma 4.28, with true probability P>0P>0, Nature is perverse to the forecast α\alpha along the path where there is no stopping time tst_{s}. Thus, P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)>0i.o.)>0 for such forecast α.\alpha. Then, by the (Case 3) of Theorem 4.17 and then Theorem 4.6, the machine cannot learn that α\alpha. In other words, the world does not exist in the way that Nature allows the machine to learn the true probability. Notwithstanding, the machine has a wrong belief about the stochastic path of the true probability, and so cannot learn the true probability. Q.E.D.Q.E.D.

Proof of Theorem 4.35 Suppose that the machine is self-assured of stopping time tst_{s} along the path where, for any given α0\alpha_{0}, P(Aα0(t+1)|P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s}. Then, along this path, the machine obtains

Π(P(Aα0(t+1)|\Pi(P(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts)=1\forall t>t_{s})=1 and so Π(Aα0(t+1)|\Pi(A_{\alpha_{0}}(t+1)|ß)t=0{}_{t})=0 t>ts\forall t>t_{s} by Lemma 4.23.

Now, by the definition of Aα0(t+1)A_{\alpha_{0}}(t+1) and Lemma 4.23 again,

Π(At+1|\Pi(A_{t+1}|ß)t=α0{}_{t})=\alpha_{0}, t>t>ts\forall t>t^{\ast}>t_{s} for some t<t^{\ast}<\infty

Note also that P(At+1|P(A_{t+1}|ß)t=α0{}_{t})=\alpha_{0}, t>t>ts\forall t>t^{\ast}>t_{s} for some t<t^{\ast}<\infty along this path.

(19)     P(At+1|P(A_{t+1}|ß)t=α0=Π(At+1|{}_{t})=\alpha_{0}=\Pi(A_{t+1}|ß)t{}_{t}), t>t\forall t>t^{\ast} with t<t^{\ast}<\infty.

Then, as in Theorem 4.6, we can construct a test set along the stochastic path by the assessed α0\alpha_{0} as a selection criterion by (19). This test set is also truly guaranteed to be well-calibrated.

Thus, from this test set along the path, the machine obtains the following by Lemma 4.10 and Lemma 4.11,


(20)    PP (limn1nt=tt+nP(At+1|(\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{t=t^{\ast}}^{t^{\ast}+n}P(A_{t+1}|ß )t=α0)=1{}_{t})=\alpha_{0})=1 if and only if PP (limn1nt=tt+n1{At+1}(\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{t=t^{\ast}}^{t^{\ast}+n}1_{\{A_{t+1}\}} == α0)=1\alpha_{0})=1

Now, let us gather the sequence of {At+1}t=t\{A_{t+1}\}_{t=t^{\ast}}^{\infty} along the path and call this set a population. The machine then effectively calculates the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0} by the empirical distribution out of this population by (20), which satisfies (i)(i) in Definition 4.34. Also, this effective calculation of the empirical distribution must be successful in returning the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}), for 1nt=tt+nP(At+1|\frac{1}{n}\sum\limits_{t=t^{\ast}}^{t^{\ast}+n}P(A_{t+1}|ß )t{}_{t}) in the right-hand side of (20) is equal to P(At+1|P(A_{t+1}|ß)t,n{}_{t}),\forall n and t>t\forall t>t^{\ast} by (19), which satisfies (ii)(ii) in Definition 4.34. Therefore, by Definition 4.34, the machine directly observes the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α0\alpha_{0}. Q.E.DQ.E.D

Proof of Theorem 4.36 (i) Proof of “if” part: follows directly from Theorem 4.32 and Theorem 4.35.

(ii) Proof of “only if” part: suppose that the machine directly observes the true probability PP(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha from the given population SS at some time tt^{\ast}. The machine then effectively calculates Π\Pi(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha at tt^{\ast}, while adopting the following as an instruction: recall that the given set SS consists of the sequence of events At+1A_{t+1}’s, {At+1}t=0k1\{A_{t+1}\}_{t=0}^{k-1} with kk potentially infinite. Since the set SS is available in principle to the machine by the part (i) of Definition 4.34, there must exist some rule on how to collect the available set of events {At+1}t=0k\{A_{t+1}\}_{t=0}^{k}. Then let the machine build up the population SS by collecting events while following the rule on how-to. Now, once collected by the machine to constitute the set SS, it must have been observed whether each event has a certain attribute of interest or not, and so a value of the indicator variable 1{At+1}1_{\{A_{t+1}\}} must have been assigned accordingly to each event At+1A_{t+1} by the machine. Then, let the machine calculate Π\Pi(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha =1kt=0k11{At+1}=\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}. Therefore, the machine effectively calculates Π\Pi(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha.

Furthermore, note that 1kt=0k11{At+1}\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}} is defined to be PP(At+1|(A_{t+1}|ß)t{}_{t}) at tt^{\ast} by the part (ii) in Definition 4.34. The machine then cannot fail to compute PP(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha from the population SS. Therefore, the machine learns the true probability PP(At+1|(A_{t+1}|ß)t{}_{t}) as α\alpha by Definition 2.2. Q.E.D.Q.E.D.

Proof of Corollary 4.37 Let us first define what we mean by “most of the time” in the success criterion (1). by Lemma 4.10 and Theorem 4.17, machines cannot satisfy the calibration property when the test set is constructed by the selection criterion of an assessed probability α\alpha if P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)>0i.o.)>0. Therefore, in order to learn, the machines must return the correct calculations except a finite number of times out of infinite opportunities to learn. Thus, “most of the time” in the Success Criterion (1) should be “all but finitely often out of infinite opportunities to learn,” which means that machines must be correct not just infinitely often while being wrong that often.

Now suppose that the machine is correct most of the time when the machine aims to learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}). Then, by the (Case 1) in Theorem 4.17, P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at most f.o.)=1f.o.)=1. Thus, there exists a stopping time tst_{s} because P(P(At+1|P(P(A_{t+1}|ß)tα{}_{t})\neq\alpha at least i.o.)=0i.o.)=0 if and only if there exists a stopping time tst_{s} for any machine forecast α\alpha by Lemma 4.28. Furthermore, suppose that the machine is self-assured that it is correct most of the time. Then, again by Lemma 4.28, Π(\Pi( there exists a stopping time ts)=1.t_{s})=1. Thus, if the machine satisfies the Success Criterion (1), then it satisfies the condition of Theorem 4.35. Therefore, if the machine satisfies the Success Criterion (1), it can learn the true probability by Theorem 4.35 and Theorem 4.36. Q.E.D.Q.E.D.

Appendix B Some Literature for the Necessary Condition in Sec. 3.2

There has been a large literature in logic and economics whose discussion implies when a machine holds a true belief in the probabilistic proposition ApA_{p}. For example, while defining the concept of rationality in the economics model, (Cogley & Sargent, 2008, 2009), (Sandroni, 2000), (Blume & Easley, 2006, 2008) and many others stipulate that an agent is rational when his/her partial beliefs are correct in the sense that his/her subjective probability distributions are congruent to the true probability distribution which Nature identifies as such. In other words, this means that a machine holds such a true belief in ApA_{p} when it is rational, which entails that its subjective probability Π\Pi is equal to the true objective probability P.P.

Also, in probabilistic logic, (Nilsson, 1986), (Halpern & Fagin, 1994), and many others follow the probabilistic version of the Tarskian semantic theory of truth in the following way: a formula describing the subjective probability of an agent is true when the agent’s probability assignment corresponds to what the sentence in fact represents. For example, in (Halpern & Fagin, 1994), a formula like wi(φ)2wi(ψ)w_{i}(\varphi)\geq 2w_{i}(\psi) is true if, according to the probability assignment of the agent ii, the event φ\varphi is at least twice as probable as ψ\psi. Now, if we extend this idea to the true objective probability PP if any, a formula such as wi(φ)=w(φ),w_{i}(\varphi)=w(\varphi), where wiw_{i} denotes the probability operator of the agent ii and ww does that of Nature, is true when, according to the assignment of the agent ii’s probability, the event φ\varphi is as probable as what Nature assigns on φ\varphi as the true probability value in our world.

It deserves to note from the economics literature when it becomes true that agent ii’s partial belief on the event φ\varphi has a degree wi(φ)w_{i}(\varphi) which corresponds to the true objective probability w(φ)w(\varphi). This is indeed true when the subjective probability of the agent ii, wi(φ)w_{i}(\varphi) is in congruence with the true objective probability w(φ)w(\varphi), which again makes the formula wi(φ)=w(φ)w_{i}(\varphi)=w(\varphi) true. Therefore, the condition for any agent to be rational (or rational machine in our context) in economics is equivalent to the truth condition for the formula in probabilistic logic.

Appendix C Justifications for the Three Assumptions

4.2 ßt’s in P(At+1|P(A_{t+1}|ß)t{}_{t}) are the set of all the true facts up to time tt.

In other words, ßt is the historical path of true facts up to time tt. To recognize that 4.2 is reasonable, recall that we are handling with objective probability true to our world. Therefore, its condition must also be true in our world. Otherwise, P(At+1|P(A_{t+1}|ß)t{}_{t}) cannot represent the true probability according to which the actual data are realized in our world. For example, if there works some special gravity force on Mars and so a fair coin lands on its edge as equally likely as on its head or tail, then the probability of the coin landing on the head conditional on this hypothesis will be 13\frac{1}{3}. However, if such a special gravity force actually does not exist on Mars, this conditional probability 13\frac{1}{3} cannot be true either, because its data would not be realized according to the probability of 13\frac{1}{3} in our world.

4.3 No further knowledge requirement is imposed on the condition ßt.

To recognize that 4.3 is reasonable, note the following: If ßt is the set of known facts, then P(At+1|P(A_{t+1}|ß)t{}_{t}) can vary from person to person, as the set of events known to each person may be different, depending on who possesses what information. In order for P(At+1|P(A_{t+1}|ß)t{}_{t}) to be objective, however, P(At+1|P(A_{t+1}|ß)t{}_{t}) should not depend on each person. Therefore, we require that ßt consist of true facts, not necessarily knowledge.

4.4 Once a probability of an event type EE is established, its associated event tokens EtkE_{t_{k}}’s occur at some infinite subsequence of time tkt_{k}’ s, so that P(Etk)P(E_{t_{k}}) does not vanish to zero as tkt_{k}\rightarrow\infty.

Here, “event token” refers to the event that ever occurs at some specific time and place, while “event type” refers to the abstract object with no specific space-time location. For example, cloudy weather in Denver is an abstract event type EE with no time subscript, while cloudy weather in Denver on 29 May 2024 is a particular event token Et0E_{t_{0}}. Some literature (e.g. (Halpern, 2016)) deals mainly with probability of token events, while some literature (e.g. (Maher, 2010)) deals mainly with probability of type events. 4.4 establishes a connection between the probabilities of these two kinds of events.

In order to recognize that 4.4 is reasonable, consider now the following example: suppose that we try to predict the probability that some person ii suffers from lung cancer caused by his/her smoking habit. As we discussed in the Introduction, this causal probability is objective, which is relevant to our discussion. Then, as long as the probability of the event type of having lung cancer from smoking is allowed to be considered for forecasting, we require that the true probability of the associated event tokens for some persons ii’s should not be completely zero from some time t0<t_{0}<\infty onward. In other words, although the true probability of such event tokens is allowed to be intermittently zero, the probability of the associated event tokens should not vanish to zero as kk\rightarrow\infty.

It might be pointed out that a particular person, say Mary, will die some time in the future, and that it will not make sense to consider the probability of Mary’s suffering from lung cancer after that time any more. However, unless all generations of our human beings suddenly become extinct in the near future, we can consider the true probability of this event token at least for some person ii at each time tt. Hence it would make sense to forecast the probability of such an event token in each specific case, as t.t\rightarrow\infty.

Appendix D More Detailed Remarks

Remark 2.4 Now, let \mathcal{F} be the sigma-field generated by Ω\Omega and ωt=(S01(s0),\omega^{t}=(S_{0}^{-1}(s_{0}), ,St1(st),\ldots,S_{t}^{-1}(s_{t}), Ωt+1,Ωt+2,\Omega_{t+1},\Omega_{t+2}, )Ω\ldots)\in\Omega denote a partial history through date tt. Then, for any probability measure ptp_{t} on t,\mathcal{F}_{t}, pt(ωt)p_{t}(\omega^{t}) becomes the (marginal) probability of the partial history, and each ωt\omega^{t} is assumed to be t\mathcal{F}_{t}-measurable. Note then that pt(ωt)=τ=1tp(ωτ|τ1)p_{t}{(\omega^{t})=\textstyle\prod\limits_{\tau=1}^{t}}p(\omega_{\tau}|\mathcal{F}_{\tau-1}) for any t,t, and so pt(ωt)=p(ωt|t1)pt1(ωt1).p_{t}(\omega^{t})=p(\omega_{t}|\mathcal{F}_{t-1})p_{t-1}(\omega^{t-1}). Furthermore, when sts_{t} is only either 0 or 1, St(ωt)S_{t}(\omega_{t}) becomes an indicator function for an event {ωt}.\{\omega_{t}\}. Then, provided that there indeed exists any true objective probability PP, p({ωt}|t1)=P({ωt}|t1)p(\{\omega_{t}\}|\mathcal{F}_{t-1})=P(\{\omega_{t}\}|\mathcal{F}_{t-1}) =E(St(ωt)=1|t1)=E(S_{t}(\omega_{t})=1|\mathcal{F}_{t-1}) where the expectation EE is taken with respect to this true probability PP.

For example, let StS_{t} be an 𝐢.𝐢.𝐝\mathbf{i.i.d}. random variable whose value is 11 if the event {ωt}\{\omega_{t}\} occurs at tt and 0 otherwise. Then, Xn=k=1nSkX_{n}=\sum\limits_{k=1}^{n}S_{k} will be the number of events that have occurred up to time nn. Since StS_{t} is 𝐢.𝐢.𝐝\mathbf{i.i.d}., p({ωt}|t1)p(\{\omega_{t}\}|\mathcal{F}_{t-1}) is same as P({ωt})P(\{\omega_{t}\}) across time. Now, let limnXnn=\lim\limits_{n\rightarrow\infty}\frac{X_{n}}{n}= limn1nk=1nSk\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{k=1}^{n}S_{k} be the ratio of events that ever occur. Then, provided that this limit indeed exists, the dominated convergence theorem and Fubini’s theorem imply that E{limn1nk=1nSk}=P({ωt})E\{\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{k=1}^{n}S_{k}\}=P(\{\omega_{t}\}). Thus, in the 𝐢.𝐢.𝐝\mathbf{i.i.d}. case, we can derive that with the true probability PP- one, the true objective probability of the event {ωt}\{\omega_{t}\} is the limiting relative frequency which is objective.

By stipulating that the true objective probability follows the rule on how Nature generates each actual data point, we emphasize that the true probability here is something objective, not subjective, but no more or no less than that. “Nature” is just a metaphor for describing the relationship of true probability with our objective world. Adopting the widely accepted statistical notion of a data-generating process, we intend to use the term “Nature” to refer to whatever is supposed to govern the underlying true objective process to generate the actual data. Given that Nature is simply a metaphor, it is important to emphasize that, in order to prove the possibility or the impossibility of machine learning on the true objective probabilities, we do not need to commit ourselves to whether there really exists such a thing as a true objective process: probability might be merely something subjective which has nothing to do with “Nature.” If that is the case, then we conclude that no machines can learn the true objective probabilities simply because there exist no such things as true probabilities for machines to learn.

Remark 3.1 The standard theory of subjective probability was first developed by Ramsey and then further by De Finetti and Savage. Subjective probability is designed to represent a degree of belief possessed by a subject, say some person. Here, two words, degree and belief, deserve to be noted. First, subjective probability represents some aspects of belief. However, belief is an inner thought that, in principle, resists a direct observation, while probability quantification requires measurability. Note that the easiest method of measurement is by observation. Thus, in order for the degree of belief to be quantified as a probability measure, it works well if the unobservable is made observable. Here comes in the relationship between unobservable belief and observable action: belief causes action. According to (Ramsey, 1931), the strength of our belief can be judged in relation to how we should act in hypothetical situations. Given a preferential system on the lotteries of a set of conditions, the choice action under hypothetical circumstances will reveal the degree of belief of some relevant agent. In this vein, subjective probability represents whatever is in any one’s mind upon anything as long as his/her belief system is coherent, and thus can be even assigned to what is merely imagined. For instance, while arguing for cogito, ergo sum, (Descartes, 2008) imagined some evil spirit that has devoted all its efforts to deceiving him. Then, Descartes can assign some value of subjective probability to such imagination on the evil spirit, according to how likely it is to him that such imagination can be realized in this world, as long as Descartes’ belief system is coherent.

Second, it is assumed that the degree of belief ranges between 0 and 1. For example, your belief that there will be rain tomorrow has a degree strictly less than 11 and thus is called a partial belief, because you have some unconfidence on future events. In addition to this quantitative usage of the term “belief”, however, there is another categorical usage: “belief” refers to the proposition that something is the case or that something is not the case, or none of them. For instance, your belief in the Moorean fact that here is one hand represents either the case or not, or it is on suspension. Compared to partial belief, this qualitative belief is called belief simpliciter. As the term “belief” has these two faces, gradational quantitative and categorical qualitative ones, numerical degrees are assigned to partial belief, while truth values are assigned to belief simpliciter. In this paper, we abbreviate belief simpliciter by “belief” and denote partial belief by “partial belief” as it is.

In contrast, objective probability, if any, is what must be determined by objective features of the world that do not vary from person to person. Following (Nagel, 1939) and (Carnap, 1963), we list chance, logical probability, and relative frequency as exhaustive examples of objective probability. The best way to clarify these concepts is to consider their examples. Following (Maher, 2010), for example, suppose that a coin has the same face on both sides, that is, two-headed or two-tailed. Provided further that it is completely uncertain what face value, head or tail, the coin has on both sides, the chance of getting head when tossed is 1 or 0, while its logical probability is 12\frac{1}{2}. Furthermore, when the coin is tossed infinitely often, its relative frequency surely converges to 1 or 0.

Here, the chance is either 1 or 0, depending on what our world is like, namely, whether the coin is indeed two-headed or two-tailed. Therefore, the chance is objective in the sense that it depends on real features of the coin, not on any personal inner thought. On the other hand, the logical probability is 12,\frac{1}{2}, because it is logically implied from the given conditions that the coin has the same face value on both sides, but that whether it is two-headed or two-tailed is completely uncertain. Therefore, logical probability is also objective in the sense that it depends on the logical features of our world, not on us. Clearly, the relative frequency is what our world turns out to be, not whatever we believe. However, no matter what interpretation of probability is adopted among these three kinds, it is important to note that the true objective probability PP in Definition 2.3 is a mathematical object that is supposed to represent any of them as long as they satisfy the Kolmogorov axioms.

Remark 4.7 It should be noted that Theorem 4.6 is our building block to prove when a machine cannot learn the true probability, because pp_{\infty} in Theorem 4.6 denotes the limiting relative frequency along the test set, the representative true objective probability. We do not consider any limiting behavior of the relative frequency outside the test set, because learning as α\alpha per se is not possible outside the test set by the necessary condition for learning in Section 3.2. Therefore, if it is shown to be impossible that with PP-probability one, pkp_{k}\rightarrow α\alpha along the stochastic path of the test set collected by the assessed α\alpha, then it is derived from Theorem 4.6 that the machine cannot learn the true probability.

Now, note that P(P(pkp_{k}\rightarrow α\alpha ) =1=1 if and only if for any ϵ>0,\epsilon>0, limnP(supmn\lim\limits_{n\rightarrow\infty}P(\sup\limits_{m\geq n} || pmαp_{m}-\alpha || <ϵ)=1.<\epsilon)=1. Thus, if the machine learns, then for all ϵ>0\epsilon>0 that are small enough, limnP(|\lim\limits_{n\rightarrow\infty}P(| pnαp_{n}-\alpha || <ϵ,<\epsilon, |pn+1α|<ϵ,)=1|p_{n+1}-\alpha|<\epsilon,\ldots)=1, which is limnP(pn=α,\lim\limits_{n\rightarrow\infty}P(p_{n}=\alpha, pn+1=α,)=1.p_{n+1}=\alpha,\ldots)=1. Thus, Theorem 4.6 is not committed to what the machine engages in by the first n1n-1 number of data while “learning”. This concept of machine learning is flexible enough to allow for some finitely few potential errors where ptαp_{t}\neq\alpha t<n\forall t<n so that P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha t<n\forall t<n while processing the data to learn.

Remark 4.9 Indeed, it may well be argued against the (Oakes, 1985) Counterexample that, although it could be imagined so, Nature actually never behaves in that way. There is no reason why Nature is so perverse that she generates data in such a deviating way. The true objective probability of Nature being perverse may be simply zero. Then, Theorem 4.1 and Theorem 4.8 do not necessarily imply that a machine cannot learn the true probability.

Theorem 4.8 shows only to the extent that if a machine can imagine such a counterexample, and thus it sincerely believes in such possibility, then its subjective probability of long-run mis-calibration is not zero. But recall the Descartes’ Demon case from Section 3.1. A simple possibility of imagination does not necessarily imply a real possibility, namely that the true objective probability of it occurring in the actual world is not zero. Theorem 4.1 and Theorem 4.8 show only that if a machine cannot exclude such a counterexample, it cannot be self-assured to be well-calibrated with its own subjective probability 11. However, recall that there exists an asymmetric relation between subjective and objective probabilities: objective probability binds subjective probability, but not necessarily vice versa. Thus, if the true probability of Nature’s perversity is proven to be zero, the machine can exclude such a possibility, and so its subjective probability on Oakes’ counterexample will be zero as well. Then, from this it is derived neither that the machine cannot be self-assured to be well-calibrated nor that it cannot be truly guaranteed to be so, which implies that the impossibility of machine learning does not necessarily follow from Theorem 4.6.

Later by Theorem 4.19, we prove that such an imagined possibility of Nature’s being perverse is a real one if the true probability is not observable. Meanwhile, we will also prove mathematically how (Oakes, 1985) Counterexample paralyzes Dawid’s Theorem 4.1, which amounts to the proof of Theorem 4.8. Note that if the true probability indeed escapes from the machine’s forecast just as in (Oakes, 1985), Theorem 4.1 breaks down: Theorem 4.1 critically relies on the martingale property of t=1kXt{\textstyle\sum\limits_{t=1}^{k}}X_{t} given ßk-1 where Xt=(j=1tξj)1ξt(YtY^t)X_{t}=({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{t}(Y_{t}-\hat{Y}_{t}), which was from E(Xk|E(X_{k}|ß)k1=0.{}_{k-1})=0. This martingale property, however, breaks down when P(At+1|P(A_{t+1}|ß)t=E(Yt+1|{}_{t})=E(Y_{t+1}|ß)tY^t+1=Π(At+1|{}_{t})\neq\hat{Y}_{t+1}=\Pi(A_{t+1}|ß)t{}_{t}) for all tt. Note that (Dawid, 1982) takes it for granted that E(Yt+1|E(Y_{t+1}|ß)t=Π(At+1|{}_{t})=\Pi(A_{t+1}|ß)t=Y^t+1{}_{t})=\hat{Y}_{t+1} for all tt. Therefore, if we relax this assumption, we can prove mathematically how (Oakes, 1985) works against (Dawid, 1982), which will be shown from ((Case 2)) in the proof of Theorem 4.17.

Remark 4.12 Regarding Lemma 4.10 and Lemma 4.11, it deserves to note the following two things: first, note that we do not require any standard assumption such as the stochastic process to be 𝐢.𝐢.𝐝.\mathbf{i.i.d.} along the historic path of the test set and so that P(At+1|P(A_{t+1}|ß)t{}_{t}) can vary along the path. Note also that unlike (Blume & Easley, 2006, 2008), etc., we do not require to consider all the associated events AtA_{t}’s along the stochastic path, but that we consider only the events AtA_{t}’s whose assessed probabilities are α\alpha. The set of those events AtA_{t}’s is called a test set, because it is collected according to the selection criterion of being assessed constantly as α\alpha. Therefore, we do not assume any specific property of the stochastic process along the path in the test set, such as stationarity or ergodicity. We do not assume any specific properties because we include only the arbitrary subsequences of the stochastic process into the test set according to the subjective assessment.

Second, by Lemma 4.10 and Lemma 4.11, we obtain that if P(pkP(p_{k}\rightarrow α)=1,\alpha)=1, then EE |limk1kj=0k1P(Atj+1||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tjα|=0{}_{t_{j}})-\alpha|=0 where expectation is taken with respect to the true probability P.P. Then, from this equation, we establish a connection between the true guarantee of well-calibration and the real forecasting game between a machine and Nature: (i)(i) the true guarantee of well-calibration is connected to forecasting games between a machine and Nature, for what the machine forecasts is α\alpha while what Nature forecasts is P(Atj+1|P(A_{t_{j}+1}|ß)tj{}_{t_{j}}) and thus whether |limk1kj=0k1P(Atj+1||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tjα|=0{}_{t_{j}})-\alpha|=0 holds or not is tied to how Nature and the machine play in the forecasting games along the stochastic path of the test set. In this game, the machine loses at time tt whenever Nature succeeds in deviating from machine forecasting at that time. There is some literature which deals with the problem of well-calibration in various forecasting game settings. (e.g. (Foster & Vohra, 1993)) (ii)(ii) Also, note that, in the proof of Lemma 4.11, we take both the inner and outer expectations with respect to the true probability PP while applying the law of iterated expectations. Thus, it is a real game, not any arbitrarily imaginary one, for |limk1kj=0k1P(Atj+1||\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|ß)tjα|=0{}_{t_{j}})-\alpha|=0 is expected to hold with respect to the true probability PP, not any other subjective probability Π\Pi.

Remark 4.14 Now, let us establish a connection between the true second-order probability and the forecasting game between Nature and a machine. For simplicity, let us denote by Δt\Delta_{t} the event at time tt that P(At+1|P(A_{t+1}|ß)t=α{}_{t})=\alpha for any machine forecast α\alpha. In other words, Δt\Delta_{t} denotes the event that the machine makes the correct forecast at time tt, which amounts to that the machine wins the forecasting game at that time. Note here that, strictly speaking, the event Δt\Delta_{t} is a complex event which consists of two events, the event of {P(At+1|\{P(A_{t+1}|ß)t=α}{}_{t})=\alpha\} and the event of {Π(At+1|\{\Pi(A_{t+1}|ß)t=α}{}_{t})=\alpha\} for the same functional value α\alpha while P(At+1|P(A_{t+1}|ß)t{}_{t}) and Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) are two probability functions about the common event At+1A_{t+1}, that is {Δt}={P(At+1|\{\Delta_{t}\}=\{P(A_{t+1}|ß)t=α=Π(At+1|{}_{t})=\alpha=\Pi(A_{t+1}|ß)t}{}_{t})\}. However, since we consider only the test set along the stochastic path, here we take it that Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) is fixed as α\alpha along the path.

Then, extending some notions from (Gaifman, 1986), let us derive a second-order probability, i.e. the probability of probability, from the outcomes of the forecasting game between Nature and the machine as follows: for any event At+1,A_{t+1}, the true second-order probability PP is the probability of the meta-event that the first-order probability (either Nature’s true forecast or the machine’s subjective forecast) of At+1A_{t+1} actually has a certain numerical value α[0,1]\alpha\in\Re[0,1]. Thus, the true second-order probability PP denotes PP ( P(At+1|ßt)=α )\left(\text{ }P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ }\right).

Here, it deserves to note that although we derive the notion of higher-order probabilities by extending some notions from (Gaifman, 1986), our notion is different from his in the following way: we do not distinguish the first-order and the second-order probabilities while using the same notation as P, although Gaifman(1986) uses P and PR operator to denote the second-order probability and the event on the first-order probability, respectively. This is because Gaifman’s notions are different from ours in that (1) PP in Gaifman denotes the agent subjective probability, while our second-order probability PP can be a true objective one just like the first-order true probability, and that (2) his PRPR operator accepts a closed interval as one of its arguments, while our domain of the second-order probability PP does not contain intervals of real numbers. Note that our domain of the second-order probability is assumed to be generated by the collection of all the singletons of the computable real values of the first-order true probability function PP, and that it is assumed to be countable. Thus, the domain does not contain intervals of real numbers. (3) In addition, our notion of the first-order probability is not imprecise but precise one, so it is not supposed to be what belongs to any interval or any set of probability measures.

Now, the probability space of the second-order probability is defined as (Ω,𝒢,P)(\Omega,\mathcal{G},P), in which Ω\Omega is the set of all the computable functional values for any given true first-order probability function P(At+1|βt)P(A_{t+1}|\beta_{t}), 𝒢\mathcal{G} is a field generated by the collection of all the singletons in Ω\Omega, and PP is the second-order probability with P:𝒢[0,1]P:\mathcal{G}\rightarrow\Re[0,1]. Note here that Ω\Omega is countable and that Ω\Omega is the set of all the possible forecasts by machines on the event At+1A_{t+1} given βt\beta_{t}. Now, if the domain of the second-order probability is a sigma-field \mathcal{F} generated by Ω\Omega, then the problem here is that the sigma-field \mathcal{F} becomes uncountable given that Ω\Omega is countable. So, we should consider a field 𝒢\mathcal{G}, not sigma-field \mathcal{F} for the probability space of the second-order probability PP.

Here are some justifications for defending the use of a field 𝒢\mathcal{G}, not sigma-field \mathcal{F}, as a domain of the second-order probability PP: we do not require the domain of the second-order probability to include all the countably infinite unions, for the number of strategies a machine can use then becomes uncountable, which is contradictory to the fact that the set of numbers a machine can compute is countable. In our forecasting game, any singleton in Ω\Omega can be thought of as a pure strategy by the machine and any union of those singletons as a mixed strategy by the machine. Again, since the set of numbers a machine can compute is countable, a machine cannot compute uncountably many mixed strategies.

Remark 4.22 Recall from the necessary condition for learning in Section 3.2 that P(At+1|P(A_{t+1}|ß)t{}_{t}) = Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) = α\alpha if the machine learns the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) as α\alpha. Definition 4.21 then means that while the machine calculates the value of Π(At+1|\Pi(A_{t+1}|ß)t{}_{t}) as α\alpha to learn the true probability P(At+1|P(A_{t+1}|ß)t{}_{t}) at time tt, the machine assigns its Π\Pi- probability >0>0 to the event that P(At+1|P(A_{t+1}|ß)tα{}_{t})\neq\alpha, because the machine tolerates the error that the true value of P(At+1|P(A_{t+1}|ß)t{}_{t}) may not be very α\alpha at that time tt. In Lemma 4.23, we prove that a machine cannot tolerate errors infinitely often if it aims to learn the true probability.

Remark 4.24 For example, in (Savage, 1972), a vacuous event is null, but not every null set is necessarily vacuous. Here, an event is null to an agent when the event is believed to be impossible to the very agent, and thus its subjective probability is zero to the agent. On the other hand, a vacuous event has absolute impossibility whose true objective probability is zero by the Kolmogorov axiom. Thus, the objective true probability of an absolutely impossible event here binds its subjective probability to zero, but not necessarily vice versa.

We now extend this idea in (Savage, 1972) to all virtually impossible events. Here, note that absolute impossibility is assigned to a vacuous event by the Kolmogorov axiom, while virtual impossibility is assigned to any event whose true objective probability measure is zero by Nature. Thus, in Lemma 4.23, we derive that all virtually impossible events also have a subjective probability Π\Pi- zero infinitely often whenever the agent is self-assured that such events are truly impossible, for the subjective probability must be bound to the true objective probability PP- zero, if any. Otherwise, the machine comes to tolerate error infinitely often, which makes it impossible for the machine to achieve its goal of learning the true probability.