Can Machines Learn the True Probabilities?

Jinsook Kim

Abstract

When there exists uncertainty, AI machines are designed to make decisions so as to reach the best expected outcomes. Expectations are based on true facts about the objective environment the machines interact with, and those facts can be encoded into AI models in the form of true objective probability functions. Accordingly, AI models involve probabilistic machine learning in which the probabilities should be objectively interpreted. We prove under some basic assumptions when machines can learn the true objective probabilities, if any, and when machines cannot learn them.

Foundations of Artificial Intelligence, Probabilistic Machine Learning, Non-parametric Estimation, Effective Calculation by Turing Machine, True Guarantee of Well-Calibration

1 Introduction

In the standard AI model under uncertainty, how to measure the degree of uncertainty matters. This paper is about treating such measures in the form of probabilities. In particular, we focus on the true objective probabilities, if any. There are various probabilistic contexts in which the true objective probabilities matter. For example, causal relations of physical events are widely regarded as objective features of the world. Therefore, when causal relations are to be understood in terms of probabilities mainly due to various regularity issues, a probabilistic causal model should include an objective probability function that measures the true objective values about our world.

This paper addresses the question of whether machines can learn the true objective probabilities from the data to perform such probabilistic reasoning. Under some basic assumptions, we prove that machines can learn the true objective probabilities if and only if the probabilities are directly observable by them. Roughly speaking, a true probability is directly observable by a machine when it can calculate the probability by the empirical frequency of a true population given to it.

The outline of the proof is as follows. After defining some main concepts, we identify the Success Criterion and the necessary condition for any machine to learn the true objective probabilities. From these conditions, we derive the theorem that learning implies the true guarantee of well-calibration. Roughly speaking, “truly guaranteed well-calibration” means the following: when a machine collects data according to its subjective forecast along a stochastic path in which the associated events occur, the empirical frequency of the collected data matches the very probabilistic forecast of the machine with the true probability $P$ - one. Now that the machine forecasts must indeed be true when the machine learns the true probabilities, this calibration property can then be understood as a calibration version of the strong law of large numbers without the independence assumption.

Note that there exist connections here among machine forecasting, well-calibration, and machine learning. While proving our theorems, therefore, we establish connections between the true guarantee of well-calibration and various settings of the real forecasting games between Nature and a machine. In this game, what Nature forecasts are the true objective probabilities, while what the machine forecasts are its own subjective probabilities. The machine loses when Nature deviates from the probabilistic forecasts of the machine. Bridged by the property of truly guaranteed well-calibration, we then prove whether the machine learns the true probabilities or not under various settings of forecasting games.

With this proof, we provide the fundamental scope and limit of learning the true probabilities by AI machines. One important implication is that machines can relax the independent assumption among data to learn the true probabilities but cannot relax the assumption of identical distribution such as stationarity or ergodicity along a stochastic path where any associated events occur. Another implication is to show that the problem of computability is directly connected to the problem of complexity in the case of learning the true probabilities.

2 Notations and Definitions

In this section, we define some main concepts, including “machine learning” and “true objective probability”. Adopting terminologies from (Nilsson, 2011) and (Boolos et al., 2002), let us first define a machine as an artifact or device that can effectively calculate or compute any target function if there exist definite and explicit instructions to do so in principle. Since we focus on probability functions in this paper, we particularly mean by “an effectively calculating or computing device” a machine that can in principle assign a probability measure (a value of a probability function) to each state (an argument of the probability function) in a given domain, an event space of a sigma-field.

Definition 2.1.

A function is effectively calculable or computable when there are definite and explicit instructions, following which its functional value can be calculated in principle for any given argument. (Boolos et al. (2002))

Two things merit to be taken into account with Definition 2.1. First, this notion of effective calculation or computation is an ideal one with no practical limits on time, expense, etc., necessary to calculate. Therefore, a proof of the limitation on effective calculation or computation of any function will imply a fundamental limit on computability that cannot be overcome by any practical real machine. Second, as (Kozen, 1997) points out, this notion is an informal one, something that is supposed to be captured in common by all formalisms such as computation by Turing machines, by the $\lambda$ -calculus and by the $\mu$ -recursive method, etc. Accordingly, once we adopt this notion of effective calculation or computation to define “learning”, we can be flexible about which formalism would be encoded as instructions to complete a given learning task.

Now, whatever such formalism is, machines can learn only if there exist some instructions followed by them to complete their tasks. So we can prove that it is impossible for machines to learn any target function under certain conditions in the following way: we first suppose that there exist some successful instructions to be encoded into machine programming to learn any given function under the conditions. We then show that this supposition leads to a conclusion that is impossible to satisfy. We thereby conclude that there cannot exist such instructions for the given function and, accordingly, that machines cannot learn it. This is a simple but clear way of proving the impossibility of learning without being committed to any complex procedure of constructing any formalism such as a Turing machine or $\lambda$ -calculus, etc.

Definition 2.2.

A machine learns when it succeeds in effectively calculating or computing a target function, if any, after processing possibly infinite amounts of data.

The phenomenon of learning must be at least computational in its essence when acquired by a machine. We thus adopt the notion of computation to define what learning is in Definition 2.2. Inspired by the ideas of (Turing, 1936) and (Church, 1936), we require that a machine be able to effectively calculate or compute a target function when the machine can learn the function.

In addition, we add the notion of success to Definition 2.2, which aims to capture the role of “learning” as an epistemic notion, not just a computational one. The epistemic notion of machine learning requires two components: if a machine learns, then (i) it must be indeed correct most of the time and (ii) it must be self-assured to be correct most of the time.

Learning is the phenomenon of knowledge acquisition. Once something is learned, knowledge about it is acquired. Now, knowledge must be a true representation, and it must be so not just by luck. We thus require that (i) what is effectively calculated or computed by a machine be true and further that it be true most of the time out of infinite opportunities to learn. In addition, if the machine admits errors too many times, say infinitely often, it cannot be said to learn. We thus require also that (ii) the machine be self-assured that what it calculates is correct most of the time. In sum, we provide the following Success Criterion:

(1) If a machine achieves computational success by learning, what it acquires in the end must be true to our world most of the time, which must be assured to the machine itself.

If what the machine computes turns out to be wrong or it admits errors repeatedly too often out of infinite opportunities to learn, then its computation cannot be considered successful. Later, we prove that the Success Criterion (1) is sufficient for learning in the case of computing true probabilities by Corollary 4.37. We also clarify there what we mean by “most of the time.”

Definition 2.3.

A true probability is what collectively constitutes a probability space, a triple $(\Omega,\mathcal{F},p)$ of random variables $S_{t}$ ’s in a joint true probability $p$ of the stochastic process according to which Nature generates a sequence of actual data $s_{t}$ ’s and each of these data is realized as such with the very true probability $P$ .

Consider an enumerable set $\Omega_{t}$ of $\omega_{i}$ ’s called states at time $t$ with $t\in\mathbb{N}$ . For example, $\Omega_{t}$ may be the set $\{\omega_{s},\omega_{c},\omega_{r}\}$ where $\omega_{s}$ denotes the state of sunny day, $\omega_{c}$ the state of cloudy day and $\omega_{r}$ the state of rainy day at date $t$ . Also, consider the set $\Omega$ that consists of all the infinite sequences with a representative sequence $\omega=(S_{0}^{-1}(s_{0}),S_{1}^{-1}(s_{1}),S_{2}^{-1}(s_{2}),\ldots).$ Here, $S_{t}(\omega_{i})$ is a random variable which has some numerical value $s_{t}\in\Re$ according as which $\omega_{i}$ ’s are realized at time $t$ in our world. Now, $S_{t}$ comes before $S_{t+1}$ in time, and thus the sequence of $S_{t}$ ’s represents a discrete-time stochastic process. Then Nature generates the actual data set $\{s_{0},s_{1},s_{2},\ldots\}$ with true probability $P$ ’s. So the probability function $P$ , if any, becomes true to our world when it corresponds to whatever amounts to the rules according to which the actual data are realized in our world. Broadly speaking, this is in line with the correspondence theory of truth similarly in (Tarski, 1944).

Remark 2.4.

More detailed discussions on Definition 2.3, including examples, are provided in Appendix D.

Now that we have defined learning and true probability, let us discuss under what conditions machines can or cannot learn the true probabilities. Before we move on, however, let us briefly mention how we can provide formal conditions for learning even though Definition 2.2 contains informal notions.

Recall from the second comment on Definition 2.1 that the general notion of computation has not been mathematically defined. This is why the Church-Turing thesis remains as a thesis, not as a theorem, given that it uses the general notion of computation. But the computability of any target function in each specific case can be formally specified by giving some definite and explicit instructions to derive the target function in each case, say by a Turing machine. Likewise, our general notion of machine learning cannot be mathematically defined because Definition 2.2 uses the general notion of computation and the informal notion of success. But this does not prevent us from mathematically analyzing the notion of machine learning on the true probabilities by proving what the necessary and sufficient conditions are to learn them. We can do so by giving some definite and explicit instructions to statistically derive the true probability function by a machine while satisfying the Success Criterion (1).

3 Kinds of Probabilities and Learning

3.1 Subjective vs. Objective Probabilities

Broadly speaking, probabilities can be divided into two kinds, subjective and objective ones. Subjective probability, say $\Pi(A_{t+1}|$ ß ${}_{t})$ , depends on each person’s belief and thus possibly varies from person to person, while objective one, say $P(A_{t+1}|$ ß ${}_{t})$ , does not.

The standard theory of subjective probability was first developed by Ramsey and then further by De Finetti and Savage. Subjective probability is designed to represent a degree of belief possessed by a subject, say some person or, if possible, a machine. Hence subjective probability represents whatever is in any one’s mind upon anything as long as his/her belief system is coherent, and so can be assigned even to what is merely imagined. For example, while arguing for cogito, ergo sum, (Descartes, 2008) imagined an evil spirit that has devoted all its efforts to deceiving him. Descartes can assign some value of subjective probability to his imagination on the evil spirit in accordance with how likely it is to him that the imagination can be realized in this world, as long as Descartes’ belief system remains coherent.

In contrast, objective probability, if any, is what must be determined by objective features of our world that do not vary from person to person. The best way to understand objective probability is to consider examples. Following (Maher, 2010), for example, suppose that a coin has the same face on both sides, that is, two-headed or two-tailed. When this coin is tossed infinitely often, its relative frequency surely converges to 1 or 0. Hence the limiting relative frequency here is either 1 or 0, depending on how our world turns out to be, which is an objective matter, and not on whatever we believe.

It should be noted that subjective and objective probabilities are conceptually bifurcated in two important ways. First, recall that subjective probability represents an aspect of someone’s subjective belief, while objective probability does not. Hence the subjective probability of Descartes’ demon is positive as long as it is believed at any degree that it could exist in our world. However, this does not necessarily imply that the true objective probability of Descartes’ demon is positive, since it might be the case that such a demon is possible only in one’s imagination but impossible in our real world. We will return to this potential bifurcation between subjective and objective probability in Section 4.1.

Second, there exists an asymmetric relation between subjective and objective probability: although the subjective probability of Descartes’ demon does not necessarily bind its objective probability, the converse holds. (e.g. (Lewis, 1980)) That is, once it is proven/assumed by any agent that the true objective probability of Descartes’ demon is, say zero, then its subjective probability of the same agent is bound to this proven/assumed result on the objective probability and thus must be zero as well. From this asymmetric relationship, we derive Lemma 4.23 in Section 4.2.

Remark 3.1.

More detailed discussions on various kinds of probabilities are provided in Appendix D.

3.2 What is Implied by Learning the True Objective Probabilities?

As we pointed out in Section 2, learning is the phenomenon of knowledge acquisition, and knowledge must be at least a true representation. In the case of human beings, the requirement of true representation is expressed as the requirement that (propositional) knowledge be at least a true belief (e.g. (Hintikka, 1962), (Moore, 1985)). What then is the counterpart of such a requirement for machines?

In general, if a machine achieves computational success at $t$ by learning, what the machine represents by learning must be at least true at that time. Then we denote the true representation of the machine about what is learned by the “true belief” of the machine, a legitimate analogue to the true belief of human beings. It is a belief analogue, for we haven’t yet shown that machines have minds or that they have the same kinds of mental representations as human beings. It is nevertheless a legitimate belief analogue, since the computational models of machine intelligence are based on understanding human intelligence. (e.g. (Pearl, 2018), (Russell, 1998), (Valiant, 1984, 2008))

That said, let us discuss the relation between belief and learning on the machine side: the knowledge acquired by machine learning must be at least a true belief. In (Hintikka, 1962), the knowledge of a person $i$ refers to the knowledge of that person $i$ on any proposition $A$ . Likewise, machine’s learning of the true objective probability $P$ here refers to the knowledge acquired by any machine on the probabilistic proposition $A_{p}$ . If a machine learns the true probability as $\alpha$ , then the probabilistic proposition $A_{p}$ amounts to that the true objective probability $P,$ if any, is what the very machine calculates as $\alpha$ . Here, we convert the non-propositional learning into propositional learning.

Now, just as a person $i$ ’s knowledge on proposition $A$ must satisfy the necessary condition that the person $i$ ’s belief in $A$ is true, machine learning of the true probability $P$ must also satisfy the condition that the belief in $A_{p}$ of the machine is true. Note here that such a belief in $A_{p}$ is true when what has been calculated by the machine is indeed equal to the true probability $P$ . Now, this calculated probability function by a machine is nothing more than the subjective probability of the machine. Therefore, the necessary condition for machine learning of true probability $P$ requires a machine to hold a true belief whose truth condition is satisfied when its subjective probability is, in fact, in congruence with the true objective probability $P$ . In short, if a machine learns the true objective probability $P$ , then the subjective probability $\Pi$ of the machine is actually equal to the true probability $P$ .

Remark 3.2.

There has been a large literature in logic and economics whose discussion implies when a machine holds a true belief in the probabilistic proposition $A_{p}$ . We provide some literature in Appendix B.

Therefore, we obtain the following condition:

The Necessary Condition for any Machine to Learn the True Probability

(2) If a machine learns the true objective probability $P(A_{t+1}|$ ß ${}_{t})$ , then $\Pi(A_{t+1}|$ ß ${}_{t})=P(A_{t+1}|$ ß ${}_{t})$

where $\Pi(A_{t+1}|$ ß ${}_{t})$ denotes the subjective probability of the machine at time $t$ .

We assume, without loss of generality, that the event $A_{t+1}$ is an elementary event, for simplicity. So the event $A_{t+1}$ is a singleton, i.e. $\{\omega_{t+1}\}$ .

Two things should be noted from (2): first, learning/knowledge is not necessarily equivalent to obtaining true fact that $\Pi(A_{t+1}|$ ß ${}_{t})=P(A_{t+1}|$ ß ${}_{t})$ , as the converse of condition (2) does not necessarily hold. Second, if a machine is wrong in calculating the true probability at time $t$ so that $\Pi(A_{t+1}|$ ß ${}_{t})\neq P(A_{t+1}|$ ß ${}_{t})$ , then by modus tollens we can derive from (2) that the machine does not learn it at that time. However, this does not preclude the machine from learning it at any other time. Then what can be said about learnability in general? According to the Success Criterion (1), a machine cannot learn any target function if it is wrong most of the time, except for a few finite cases out of infinite opportunities to learn. But can a machine be said to learn if it is correct infinitely often but also wrong as that often? We give a negative answer to this question by proving theorems in Section 4.2.

4 Can Machines Learn the True Probabilities?

4.1 Learning the True Probabilities and Calibration

Let us start with a simple example in which a machine is trying to learn the true probability that it will rain tomorrow. A forecasting system is said to be well-calibrated if it assigns probability, say 30%, to rainy events in a test set whose long-term proportion that actually rains is 30%. According to (Dawid, 1982), a forecasting machine is self-assured that its fairly arbitrary test set of forecasts is well-calibrated. This is Theorem 4.1. In addition, we prove in Theorem 4.6 that if the machine learns the true probability, then this machine’s forecasting is truly guaranteed to be well-calibrated.

Now, let us assume that a machine has its own (not necessarily true in our context) probability distribution $\Pi$ defined over ß ${}_{\infty}={\textstyle\bigvee\limits_{t=0}^{\infty}}\text{\ss}_{t},$ where ß_t is denoted by the totality of the true facts up to day $t$ . The probability forecasts $\Pi(A_{t+1}|\text{\ss}_{t})$ it makes on day $t$ are for events $A_{t+1}$ ’s in ß_t+1 and are ß_t-measurable. For each day $t$ we have an arbitrary associated event $A_{t}\in$ ß_t, say the event of raining on day $t$ . We denote the indicator of $A_{t+1}$ by $Y_{t+1}=1_{\{A_{t+1}\}}$ , and introduce $\hat{Y}_{t+1}=\Pi(A_{t+1}|$ ß ${}_{t})$ , the probabilistic forecast of machines on day $t$ +1. In addition, we introduce the new indicator variables $\xi_{1},\xi_{2},\ldots,$ at choice to denote the inclusion of any particular day $t$ in the test set where $\xi_{t}=1$ if the day $t$ is included in the test set and $\xi_{t}=0$ otherwise. Now, if we set the selection criterion to include any day into the test set as the assessed probability $\alpha$ on day $t$ , then we have the following theorem.

Theorem 4.1.

Suppose that $\xi_{t}\$ is ß_t-1 measurable. Then, $\Pi$ $(p_{k}\rightarrow$ $\alpha)=1$ when $k\to\infty$ ,

where $k$ : the number of days in the test set

$p_{k}=({\textstyle\sum\limits_{t=1}^{k}}\xi_{t})^{-1}\cdot({\textstyle\sum\limits_{t=1}^{k}}\xi_{t}\cdot 1_{\{A_{t+1}\}})$

$\xi_{t}:=\begin{cases}1&\hat{Y}_{t+1}=\Pi(A_{t+1}|$\ss$_{t})=\alpha\\ 0&\hat{Y}_{t+1}=\Pi(A_{t+1}|$\ss$_{t})\neq\alpha\end{cases}$

Here, let us use the terms as follows: machine forecasts are self-assured to be well-calibrated when $\Pi$ $(p_{k}\rightarrow$ $\alpha)=1$ , while those are truly guaranteed to be so when $P$ $(p_{k}\rightarrow$ $\alpha)=1$ . It should be noted then that even if the forecasting machine is self-assured to be well-calibrated, this does not necessarily imply that its forecasts are truly guaranteed to be well-calibrated. Recall from Section 3.1 that there is a conceptual bifurcation between subjective and objective probability.

Now, suppose that a machine tries to learn the true probability of a particular event $A_{t+1}$ . If this machine indeed learns the true probability of the event as $\alpha$ , then the machine should correctly calculate the true probability of the same events repeatedly as $\alpha$ most of the time. Hence, the machine can construct a test set of those associated events $A_{t+1}$ ’s whose sequentially correct probabilities are $\alpha$ . Then we can show further from Theorem 4.1 that the test set will be well-calibrated with true probability $P$ - one. This is Theorem 4.6. In short, here “being correct as $\alpha$ ” itself serves as what (Dawid, 1982) calls a selection criterion.

However, note that if the size of ß_t continues to grow as $t$ goes to infinity, then ß_t’s might be different for each $t.$ Then $P(A_{t+1}|\text{\ss}_{t})$ might not stay the same as $\alpha$ even for the same events $A_{t+1}$ ’s across infinitely many $t$ ’s. Now, in order for the correct probability $\alpha$ to work as a selection criterion, it should be that $P(A_{t+1}|\text{\ss}_{t})$ stays the same as $\alpha$ at least for infinitely many $t$ ’s even though ß_t may vary as time passes. Therefore, we prove Lemma 4.5 from the following three assumptions. The justifications for the three assumptions are provided in Appendix C.

Assumption 4.2.

ß_t’s in $P(A_{t+1}|$ ß ${}_{t})$ are the set of all the true facts up to time $t$ .

Assumption 4.3.

No further knowledge requirement is imposed on condition ß_t.

Assumption 4.4.

Once a probability of an event type $E$ is established, its associated event tokens $E_{t_{k}}$ ’s occur at some infinite subsequence of time $t_{k}$ ’ s, so that $P(E_{t_{k}})$ does not vanish to zero as $t_{k}\rightarrow\infty$ .

It should be noted from 4.2 and 4.3 that if ß_t is the set of known facts, the information on the associated events $E_{t}$ ’s in ß_t’s may not be independent of one another over time. Once $E_{t}$ has been known in the past at some time $t_{0},$ the same events $E_{t}$ ’s are more likely to be known afterwards. Repeatedly accumulated knowledge of the same events reinforces the probability that the very event will be known again in the future. However, this is not necessarily the case with the set of true facts. It will be clear in Lemma 4.5 why this independence condition matters.

Lemma 4.5.

For any $\alpha\in\Re[0,1]$ , let $E_{t}$ denote the event token at time $t\in\mathbb{N}$ whose event type $E$ almost surely determines the true probability of an event type $A$ as $\alpha$ . Then, if for some subsequence $t_{k}$ ’s, $E_{t_{k}}$ ’s are independent across $t_{k}$ ’s and $P(E_{t_{k}})\neq 0$ for any $t_{k},$ then $P(E_{t}$ $i.o)=1.$

Now that Lemma 4.5 has been established, $P(A_{t+1}|$ ß ${}_{t})$ is truly guaranteed to stay as $\alpha$ infinitely often, and thus the machine has infinite opportunities to learn $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ .

Theorem 4.6.

Let us consider any arbitrary $\alpha\in\Re[0,1]$ . If a machine learns the true objective probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ , then $P($ $p_{k}\rightarrow$ $\alpha$ $)=1$ .

It should be noted that the notion of learning in Theorem 4.6 is flexible enough to allow for some finitely few potential errors, so that there can exist some $t^{\ast}<\infty$ such that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ $\forall t<t^{\ast}$ while processing the data to learn.

Remark 4.7.

More detailed discussions on Theorem 4.6 are provided in Appendix D.

4.2 Can Machines Learn the True Probabilities?

Theorem 4.8.

It is impossible to obtain a joint distribution for an infinite sequence of events that could have the well-calibration property with subjective probability $1$ .

The basic idea in the proof of Theorem 4.8 starts with constructing a counterexample in which the true probability function $P$ is deviated infinitely often from the subjective probability function $\Pi$ in such a way that the well-calibration property does not hold any longer.

Counterexample 1 Following (Oakes, 1985), let $P$ be such as $P(A_{t}|$ ß ${}_{t-1})=f(\Pi(A_{t}|\text{\ss}_{t-1})),$ with the function $f([0,1])\rightarrow[0,1]$ being defined by $f(x)=x+\frac{1}{2}$ $(0\leq x\leq\frac{1}{2}),$ $f(x)=1-x$ $(\frac{1}{2}<x\leq 1)$ for any event $A_{t}.$ Then, under $P$ with $P(Y_{I_{k}}=1)=f(\alpha)$ where $\hat{Y}_{t}=$ $\alpha$ for a subsequence $\{t:t=I_{1},I_{2},\ldots\}$ and $Y_{I_{k}}$ ’s form a Bernoulli sequence, the well-calibration property does not hold.

Due to this counterexample from (Oakes, 1985), the machine forecaster cannot exclude the possibility that its test set may be mis-calibrated, and thus the machine cannot hold its subjective probability $\Pi$ - one of being well-calibrated. Furthermore, if this artificially-imagined possibility of mis-calibration is a real possibility, then it is derived that no test set large enough can be guaranteed to be well-calibrated with the true probability $P$ - one. Later in this section, we prove that if such an imagined possibility is a real one, then machines cannot learn. Meanwhile, we also prove mathematically how the (Oakes, 1985) Counterexample paralyzes Dawid’s Theorem 4.1, which amounts to the proof of Theorem 4.8.

Remark 4.9.

More detailed discussion on the Counterexample 1 is provided in Appendix D.

Lemma 4.10.

Suppose that a machine constructs a test set by the assessed probability $\alpha$ . Then $E$ $|p_{\infty}-$ $\alpha|$ $=$ $0$ if and only if $P(p_{k}\rightarrow$ $\alpha)=1$ where the expectation is taken with respect to the true probability $P$ . Here, $p_{\infty}=\lim\limits_{k\rightarrow\infty}$ $p_{k}$ .

Lemma 4.11.

Let us fix $\alpha\in\Re[0,1]$ . Now, suppose that $p_{\infty}$ exists. Then $E$ $[p_{\infty}-\alpha]=0$ if and only if $E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha]=0$ . In general, $E$ $|p_{\infty}-\alpha|$ $\geq$ $E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha|.$

Remark 4.12.

By Lemma 4.10 and Lemma 4.11, we establish a connection between the true guarantee of well-calibration and the real forecasting game between a machine and Nature. More discussions on such connection by Lemma 4.10 and Lemma 4.11 are provided in Appendix D.

Definition 4.13.

Nature is perverse when, for any fixed machine forecast $\alpha$ , $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least for infinitely many $t$ ’s along the stochastic path of the test set.

By “at least i.o.” in Definition 4.13, we mean that Nature deviates from $\alpha$ either (i) infinitely often or (ii) all but finitely often along the stochastic path of the test set. Thus, we clearly distinguish (i) from (ii). From now on, we mean by “infinitely often” that nature not only deviates infinitely often, but also does not deviate infinitely often. On the other hand, by “all but finitely often” we mean as usual. Then, if the true probability of Nature’s perversity is zero, then we denote it by $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.$ along the path of the test set $)=0$ , which amounts to $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at most for $t<\infty$ along the path of the test set $)=1$ . Furthermore, if there is no confusion, we will simplify Nature’s perversity by “ $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.$ ” while omitting “along the path of the test set.”

Now, according to the Success Criterion (1), a machine fails to learn the true probability in case (ii), because the machine then makes wrong forecasts along the path except for a finite few of the infinite opportunities to learn. However, it seems unclear whether the machine can learn or not in case (i). On the one hand, the machine seems not to be able to learn because it makes too many errors, say infinitely many errors. On the other hand, it seems that the machine should be able to learn because it makes astronomically many correct forecasts, say infinitely often. Therefore, while adopting this definition, we clearly prove by Theorem 4.20 and Corollary 4.30 that a machine cannot learn the true probability even when it is correct infinitely often, if it is wrong that often.

Observation Provided that the machine forecast $\Pi(A_{t+1}|$ ß ${}_{t})$ is fixed as some value $\alpha\in\Re[0,1]$ , $P($ $\Delta_{t}$ $)$ becomes the true second-order probability on the true first-order probability of such event $A_{t+1}$ , that is, $P($ $\Delta_{t}$ $)$ = $P$ $\left(\text{ }P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ }\right)$ where $\Delta_{t}$ denotes the event that the machine makes a correct forecast at $t$ .

It should be noted here that the computable numbers by a machine are countably many (e.g. (Turing, 1936)). Thus, the true second-order probability $P$ here is a probability mass function on countable space and therefore satisfies the Kolmogorov axioms, although $\alpha$ may potentially be any real number in $\Re[0,1].$

Remark 4.14.

More detailed discussions on the connection between true second-order probability and the forecasting game are provided in Appendix D.

Lemma 4.15.

Let us consider the forecasting game between Nature and a machine. Also, let us further suppose that the structure of this game at any given time $t$ , i.e. whether it is simultaneous or not, is certain to Nature. Now, by 4.2 and 4.3, let us suppose that ß_t consists of the true facts, not necessarily knowledge. Then there exists a true second-order probability $P$ such that $0<P\left(P(A_{t+1}|\text{\ss}_{t})=\alpha\right)<1$ if and only if the real forecasting game is a simultaneous-move game at time $t$ . In particular, $P$ $\left(P(A_{t+1}|\text{\ss}_{t})=\alpha\right)=0$ if and only if the machine moves first and then Nature moves later after observing what move the machine takes in the forecasting game at time $t$ .

There are various theories of learning in games. (e.g. (Nisan et al., 2007)) Therefore, what matters is what is aimed to learn through games and who are competing with each other in the games. In the standard model, a machine aims to learn what the optimal actions are to produce the minimized expected (total) loss or payoff, which is determined in a given environment, say financial market. In this case, a machine usually competes with other machines in the game. For example, in some online learning, a machine aims to learn a sequence of estimates which return the sub-linear regret, given that the loss functions are convex. It gets a possibly different amount of payoff/loss at each round of games along the stochastic path where the given sequence of games are played.

In our forecasting games, on the other hand, a machine aims to learn the true objective probability, if any, through games, and so the machine is competing with Nature in the game. Also, whoever wins a game, the winner/loser will get uniform payoff at every round along the path, for what counts is how many times the machine loses/wins along the path, not how much payoff it gets at each round along the path once it loses/wins.

Theorem 4.16.

In the forecasting game between a machine and Nature, the machine does not necessarily learn that it wins at each round of the game even though it indeed wins.

Thus, winning strategy is not equivalent to learning strategy. Now, in case when a machine does not learn that it wins/loses a game even though it indeed does so, it does not matter what it gets as payoff when it wins/loses because it cannot learn how much it gets at each round. What matters, on the contrary, is how many times it wins along the path, and this is why our game setting in Lemma 4.15 adopts a uniform payoff at each round.

Theorem 4.17.

Let us consider any arbitrary $\alpha\in\Re[0,1]$ for any machine forecast. If $P(p_{k}\rightarrow$ $\alpha)=1$ , then the true probability that Nature is perverse is zero with any of these forecasts $\alpha$ . $($ Case 3 $)$

$($ Case 1 $)$ Let us suppose that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at most finitely often along the stochastic path where the associated event $A_{t+1}$ ’s occur. Then $P(p_{k}\rightarrow$ $\alpha)=1$ where $p_{k}$ denotes the limiting relative frequency along the path.

$($ Case 2 $)$ Let us suppose that $P$ $(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ just as in (Oakes, 1985) $)\neq 0$ . Then, $P(p_{k}\rightarrow$ $\alpha)\neq 1$ where $p_{k}$ denotes the limiting relative frequency along the stochastic path of the test set.

$($ Case 3 $)$ Let us suppose that $P$ $(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least i.o. along the test set $)\neq 0$ . Then $P(p_{k}\rightarrow$ $\alpha)\neq 1$ where $p_{k}$ denotes the limiting relative frequency along the path of the test set.

Regarding Theorem 4.17, it is worth noting the following three things: (i) $($ Case 1 $)$ is equivalent to the strong law of large numbers under a weaker assumption than i.i.d.: if the true probability $P(A_{t+1}|$ ß ${}_{t})$ exists and $P(A_{t+1}|$ ß ${}_{t})$ is identically distributed as $\alpha$ all but finitely often along the path, then the limiting relative frequency converges to the same $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ with true probability $P$ - one. (ii) $($ Case 2 $)$ shows that if (Oakes, 1985) holds with $\Pi-$ subjective probability $>0$ , then (Dawid, 1982) does not hold, which amounts to the proof of Theorem 4.8. (iii) $($ Case 3 $)$ shows, combined with Theorem 4.6, that if $P$ $(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at most $f.o.$ along the test set $)\neq 1$ , then a machine cannot learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ . Thus, the third result (iii) has the following important implication for time-series analysis: a machine cannot relax the assumption that the true probability $P(A_{t+1}|$ ß ${}_{t})$ is identically distributed along the stochastic path, if the machine aims to learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ . To learn, the machine needs some identical distributional assumptions such as stationarity or ergodicity.

Definition 4.18.

Suppose that, with true probability $P>0,$ Nature is perverse with some forecast $\alpha^{\ast}$ . Then, Nature is uniformly perverse, when for any forecast $\alpha\in\Re[0,1]$ , there exists no $\alpha\neq\alpha^{\ast}$ such that $P($ $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)=0$ for any event $A_{t+1}$ .

In other words, when Nature deviates from forecasters for any event $A_{t+1}$ , she does not discriminate against some forecasters in favor of the others whose forecasts $\alpha$ Nature decides to conform to all but finitely often for sure.

Theorem 4.19.

Suppose that, for any $\alpha$ , there exists a true second-order probability $P$ such that $P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1$ at least for infinitely many $t$ ’s. Then, Nature is uniformly perverse.

Theorem 4.20.

Now, let us discuss what it means in Theorem 4.20 by the condition that the true second-order probability is strictly less than 1. Note from Lemma 4.15 that $P$ $\left(P(A_{t+1}|\text{\ss}_{t})=\alpha\right)=1$ if and only if Nature moves first and then the machine moves later after observing what move Nature takes in the forecasting game at time $t$ . Thus, it is clear from the condition of Theorem 4.20 why and when the machine fails to learn the true probability if Nature is uniformly perverse: when the machine cannot move later after observing the true move of Nature infinitely often, there always exists a real possibility that the machine may not be able to match Nature’s move that often. Hence the machine cannot be truly guaranteed to be well-calibrated, which again implies the impossibility of machine learning. Since the machine cannot observe the true move of Nature in those forecasting games, the true probability is unobservable by the machine.

So far we have shown that it is of real possibility that Nature is perverse, and thus that no machines can learn the true objective probability. Now someone might argue that its proof holds only under the condition that Nature is uniformly perverse. Nature may not be uniformly perverse, however, but only selectively perverse, so that, for some forecast $\alpha_{0}$ , Nature may decide to be benevolent enough to conform to that $\alpha_{0}$ . Then it may be the case that the true probability of Nature being perverse is zero for this $\alpha_{0}$ , and accordingly that machines may be given an opportunity to learn the true objective probability for that $\alpha_{0}$ .

Note, however, that it is entirely Nature’s decision when she will be benevolent to a machine and when she will not. Therefore, it is still a random event to the machine whether Nature is perverse or not. If so, we will show further that, even if the true probability of Nature’s being perverse is zero with some $\alpha_{0},$ a machine still cannot learn the true probability if it cannot learn which forecast is the right $\alpha_{0}$ for any event $A_{t+1}$ .

Definition 4.21.

A machine tolerates error at $t$ while pursuing its goal of learning the true probability $P(A_{t+1}|$ ß ${}_{t})$ , when $\Pi(A_{t+1}|$ ß ${}_{t})=\alpha$ but $\Pi(\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha\})>0$ for some $\alpha\in\Re[0,1].$

Remark 4.22.

In relation to Lemma 4.23, more detailed interpretation on Definition 4.21 is provided in Appendix D.

Lemma 4.23.

Suppose that a machine aims to learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ and thus performs an effective calculation to return its result of $\Pi(A_{t+1}|$ ß ${}_{t})$ as $0$ for the true probability $P(A_{t+1}|$ ß ${}_{t})$ . Then, $\Pi(A_{t+1}|$ ß ${}_{t})=0$ if and only if $\Pi(\{P(A_{t+1}|$ ß ${}_{t})=0\})=1$ , for all but finitely many $t$ ’s.

Remark 4.24.

In relation to (Savage, 1972), more discussions on Lemma 4.23 are provided in Appendix D.

Definition 4.25.

Nature is selectively perverse, when $\exists$ $\alpha$ and $\alpha_{0}\neq\alpha$ such that $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0\,$ , while $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)>0$ for any other $\alpha\neq\alpha_{0}$ .

Now, let us define Nature’s decision to be selectively perverse at $t$ to show by Lemma 4.28 that once Nature decides so at $t$ , our real world remains as such.

Definition 4.26.

Nature decides to be selectively perverse at $t$ , when there exist forecasts $\alpha$ and $\alpha_{0}\neq\alpha$ such that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ , while $P(A_{\alpha\neq\alpha_{0}}(t+1)|$ ß ${}_{t})\neq 0$ where $A_{\alpha}(t+1)$ denotes the event that, from $t+1$ onward, Nature is perverse with the associated events $A_{t}$ ’s whose assessed forecasts are $\alpha$ .

Definition 4.27.

Suppose that Nature is selectively perverse so that she freely decides at any time whether to be perverse at any rate or not. Then, $t_{s}<\infty$ denotes a stopping time if $t_{s}$ is the last time that Nature changes her mind into non-perversity so that, for any $\alpha_{0}$ with which Nature is not perverse with true probability $P$ -one, $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ , $\forall t>t_{s}$ .

Note that $t_{s}$ is ß_t-measurable, because ß_t includes all the true facts up to $t$ and so whatever Nature decides at $t$ , say the event $\{P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0\}$ belongs to the set of true facts, ß_t.

Lemma 4.28.

Nature is selectively perverse if and only if there exists a stopping time $t_{s}$ for every forecast $\alpha_{0}$ with which Nature is not perverse with true probability $P$ -one so that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ , while there is no stopping time $t_{s}$ for any other $\alpha\neq\alpha_{0}$ .

Lemma 4.29.

Let us suppose that Nature is selectively perverse and that a machine learns which forecast is the right forecast $\alpha_{0}$ for any associated $A_{t}$ ’s with which Nature is not perverse with true probability $P$ - one. The machine is then self-assured that the stopping time $t_{s}$ arrives for that $\alpha_{0}$ .

Corollary 4.30.

Suppose that Nature is selectively perverse so that, with true probability $P$ -one, she is not perverse with some machine forecasts $\alpha_{0}$ . Furthermore, suppose that the machine is not self-assured that the stopping time $t_{s}$ arrives for each of those $\alpha_{0}$ ’s. The machine cannot then learn the true objective probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ .

Note that along the stochastic path considered in Corollary 4.30, $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0$ $\forall t>t_{s}$ . Now, for this $\alpha_{0}$ ,

(3) $\limsup\limits_{t\rightarrow\infty}$ $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0})\leq P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o)=0$

Therefore, without loss of generality, letting $t^{\ast}\geq t_{s}$ with $t^{\ast}<\infty$ ,

(4) $P(P(A_{t+1}|$ ß ${}_{t})=\alpha_{0})=1,$ $\forall t>t^{\ast}\geq t_{s}$ with $t^{\ast}<\infty$ .

Now, (4) means by Lemma 4.15 that the true probability is observable at any time $t>t^{\ast}$ along this path. Then why is the machine still unable to learn the true probability, even though the machine can move after observing what move Nature takes at the forecasting games all along that path after $t^{\ast}$ ? According to Corollary 4.30, this is because the machine cannot be self-assured whether the true probability will remain observable at any time after $t^{\ast}$ +1 onward, even if the machine observes Nature’s true move at time $t^{\ast}$ +1. Let us show this by the following Lemma 4.31.

Lemma 4.31.

Suppose that a machine is not self-assured of the stopping time $t_{s}$ for $\alpha_{0}$ . The machine cannot then be self-assured whether the true probability will remain observable at any time after $t^{\ast}$ +1 onward, even if the machine observes Nature’s true move at time $t^{\ast}$ +1.

From Theorem 4.20 and Corollary 4.30, we conclude that the impossibility of learning is derived under the assumption either that Nature is uniformly perverse or that Nature is selectively perverse but a machine is not self-assured of whether the stopping time arrives or not. What would then happen in the case where Nature is selectively perverse and a machine is self-assured of the stopping time $t_{s}$ when the $t_{s}$ indeed exists? We show in the following that a machine can learn the true probability in this case, and further that this is the only case in which a machine can learn it.

Theorem 4.32.

Suppose that a machine learns the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ . The machine is then self-assured that the stopping time $t_{s}$ arrives for $\alpha$ , while the machine is not self-assured that the stopping time $t_{s}$ arrives for $\alpha$ where such $t_{s}$ does not exist.

Let us now define when the true probability is directly observable based on the notion of population. The concept of population in Definition 4.34 is mainly indebted to (von Mises, 1957, 1967). Since the true probability is defined as the empirical distribution of this population available to a machine, the probability is said to be directly observable by the machine.

Definition 4.33.

Let us consider a set $S$ that consists of the sequence of events $A_{t+1}$ ’s, $\{A_{t+1}\}_{t=0}^{k-1}$ with $k$ potentially infinite. Then, the set $S$ is defined to be a population with $k$ number of elements, when this set $S$ is assumed to have a certain attribute of interest, and so an indicator variable $1_{\{A_{t+1}\}}$ is assigned to each event $A_{t+1}$ where $1_{\{A_{t+1}\}}$ has a value 1 or 0 depending on whether the event $A_{t+1}$ satisfies such an attribute or not, once the set $S$ is collected. Then, the empirical distribution of the population $S$ with respect to the given attribute is defined to be $\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}$ .

Definition 4.34.

A machine directly observes $P$ $(A_{t+1}|$ ß ${}_{t})$ from the population $S$ at $t^{\ast}$ if the following two conditions are satisfied: (i) a population $S$ is in principle available to the machine. (ii) The machine calculates the empirical distribution of the population with respect to the given attribute, which is the true probability distribution of the event $A_{t+1}.$

Now, in case where the sequence $\{A_{t+1}\}_{t=0}^{k-1}$ is a time-series, Definition 4.34 means that $\Pi$ $(A_{t^{\ast}+1}|$ ß ${}_{t^{\ast}})=\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}=P$ $(A_{t^{\ast}+1}|$ ß ${}_{t^{\ast}})$ with $k=t^{\ast}$ . Thus, when $t^{\ast}$ goes to infinity, the directly observable true probability becomes the limiting relative frequency, the representative objective true probability.

Theorem 4.35.

Suppose that a machine is self-assured of the stopping time $t_{s}$ when there exists $t_{s}$ , but that the machine is not self-assured of the stopping time $t_{s}$ when no $t_{s}$ exists. The machine then directly observes the true probability $P$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ .

Theorem 4.36.

A machine directly observes the true probability $P$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ if and only if the machine learns the true probability $P$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ .

Two things should be noted from Theorem 4.36. First, whenever the true probability is not directly observable, a machine cannot learn the true probability. Now recall from Definition 2.1 that the machine is an ideal one with no practical limits on computational resources such as time or storage spaces. Therefore, this implies that no real machines, hindered by many practical limits in our world, can overcome this impossibility of learning either, whenever the true probability is not directly observable. Second, Theorem 4.36 also says that the true probability is directly observable by a machine whenever it can learn the true probability. Once a machine learns the true probability and so it is successfully computable, then the next question is how complex it is to compute. Now that the true probability is directly observable, this makes it easier to deal with the complexity problem. (e.g. Sorting algorithm) Thus, Theorem 4.36 directly connects the problem of computational solvability to the problem of complexity.

Now, let us finish this section by adding one more claim that the Success Criterion (1) to compute the true probability is sufficient for learning it.

Corollary 4.37.

If a machine calculates the true probability $P$ $(A_{t+1}|$ ß ${}_{t})$ correctly most of the time, which is self-assured to the machine, then the machine can learn the true probability.

5 Conclusion

We have discussed so far when machines can learn the true probabilities and when they cannot. In summary:

•

$\exists$ $\alpha^{\ast}$ such that $P($ Nature is perverse with $\alpha^{\ast}$ $)>0$ by Theorem 4.19.

Now that Nature is perverse at least with one forecast $\alpha^{\ast}$ ,

•

(i) Nature is uniformly perverse: machines cannot learn by Theorem 4.20.
•

(ii) Nature is selectively perverse: $\exists$ $t_{s}$ for each $\alpha_{0}$ such that $P($ Nature is perverse with $\alpha_{0}$ $)=0$ by Lemma 4.28.

Then under (ii),

•

(ii-1) Machines are not self-assured of the $t_{s}$ : machines cannot learn by Corollary 4.30.
•

(ii-2) Machines are self-assured of the $t_{s}$ :

Then under (ii-2),

•

(ii-2-1) $t_{s}$ actually does not arrive: machines cannot learn by Theorem 4.32.
•

(ii-2-2) $t_{s}$ indeed arrives: machines can learn and this is the only case in which machines can learn by Theorem 4.35 and Theorem 4.36.

Before we close this section, let us add a few remarks. First, we emphasize that in this paper we have focused on the notion of “machine learning” that is not just a technical terminology, understood as an identification of a target function, but also an epistemic one, a counterpart to “human learning.” We focus on this epistemic notion of machine learning because we particularly mean by “machines” those artifacts that perform human-level intelligent behaviors.

Second, note that we do not need to specify how machines learn the true objective probabilities to prove the impossibility of machine learning on the true probabilities. Instead, we only need the necessary condition for any machine to learn the true objective probabilities if it learns them in any way. Thanks to this flexibility about how to learn, we come to have a powerful and robust result: no matter what kind of learning method a machine uses, it cannot learn the true objective probabilities that are not directly observable.

Lastly, let us emphasize again that our learning machine is an ideal device with no practical limits on time and storage space, etc. Therefore, the scope and limit of machine learning on true probabilities discussed in this paper are more fundamental than practical ones.

Acknowledgements

The author is grateful to Tyler Burge, Michael Christ, Philip Dawid, Joseph Halpern, Jinho Kang, Steven Matthews, Thomas Sargent, and Byeong-uk Yi for discussions that were helpful in various ways to develop this paper. In particular, Tyler helped me pay attention to the idea of converting a non-propositional structure to a propositional one while learning, and Joe helped me open my eyes to the possibility of machine learning on the true probabilities. I discussed every detail of this paper with Jinho so that I insisted that he should be listed as a co-author. Jinho refused on the ground that he did not make direct contributions to mathematical proofs, with which I disagree. But Jinho has been right most of the time when we disagreed, so I decided to agree. Lastly, the author is grateful to three anonymous reviewers and a meta-reviewer. Their reviews were helpful in improving this paper.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

Blume & Easley (2006) Blume, L. and Easley, D. If you’re so smart, why aren’t you rich? belief selection in complete and incomplete markets. Econometrica, Vol. 74:929–966, 2006.
Blume & Easley (2008) Blume, L. and Easley, D. Market selection and asset pricing. The Handbook of Financial Markets: Dynamics and Evolution, by T. Hens IV and K. Schenk-Hoppe (ed.), North-Holland:403–438, 2008.
Boolos et al. (2002) Boolos, G., Burgess, J., and Jeffrey, R. Computability and Logic. Cambridge University Press, Cambridge, 2002.
Carnap (1963) Carnap, R. Logical Foundations of Probability. The University of Chicago Press, 1963.
Church (1936) Church, A. An unsolvable problem of elementary number theory. American Journal of Mathematics, Vol. 58(2):345–363, 1936.
Cogley & Sargent (2008) Cogley, T. and Sargent, T. The market price of risk and the equity premium: A legacy of the great depression. Journal of Monetary Economics, Vol. 55:454–476, 2008.
Cogley & Sargent (2009) Cogley, T. and Sargent, T. Diverse belief, survival and the market price of risk. Economic Journal, vol. 119:354–376, 2009.
Dawid (1982) Dawid, P. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):604–613, 1982.
Descartes (2008) Descartes, R. Meditations on First Philosophy. Translated by Moriarty. M. Oxford University Press, 2008.
Foster & Vohra (1993) Foster, D. and Vohra, R. Asymptotic calibration. Biometrika, 85(2):379–390, 1993.
Gaifman (1986) Gaifman, H. A theory of higher order probabilities. In TARK, 1986.
Halpern (2016) Halpern, J. Actual Causality. The MIT Press, Cambridge, MA, 2016.
Halpern & Fagin (1994) Halpern, J. and Fagin, R. Reasoning about knowledge and probability. Journal of the Association for Computing Machinery, Vol 41(2):340–367, 1994.
Hintikka (1962) Hintikka, J. Knowledge and Belief. Cornell University Press, Ithaca, 1962.
Kozen (1997) Kozen, D. Automata and Computability. Springer, New York, 1997.
Lewis (1980) Lewis, D. A subjectivist’s guide to objective chance. Studies in Inductive Logic and Probability, Volume II, by R. Jeffrey (ed.):263–293, 1980.
Maher (2010) Maher, P. Explication of inductive probability. Journal of Philosophical Logic, Vol. 39:593â€“616, 2010.
Moore (1985) Moore, R. C. A formal theory of knowledge and action. Formal Theories of the Commonsense World, by J. Hobbs and R. C. Moore, (ed.). Ablex Publishing Corp:319–358, 1985.
Nagel (1939) Nagel, E. Principles of the theory of probability. Int. Encycl. Unif. Sc., Vol. I(No. 6), 1939.
Nilsson (1986) Nilsson, N. Probabilistic logic. Artificial Intelligence, 28:71–87, 1986.
Nilsson (2011) Nilsson, N. Artificial Intelligence: A New Synthesis. Morgan Kaufmann, California, 2011.
Nisan et al. (2007) Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. V. Algorithmic Game Theory. Cambridge University Press, New York, 2007.
Oakes (1985) Oakes, D. Self-calibrating priors do not exist. Journal of the American Statistical Association, 80(390):p. 339, 1985.
Pearl (2018) Pearl, J. A personal journey into bayesian networks. UCLA Cognitive Systems Laboratory, Technical Report (R-476), 2018.
Ramsey (1931) Ramsey, F. Truth and probability. Studies in Subjective Probability, by Henry Kyburg and Howard smokler (ed.). Krieger:25–52, 1931.
Russell (1998) Russell, S. Learning agents for uncertain environments. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 101–103, 1998.
Sandroni (2000) Sandroni, A. Do markets favor agents able to make accurate predictions? Econometrica, Vol. 68(6):1303–1341, 2000.
Savage (1972) Savage, L. J. The Foundations of Statistics. Dover Publications, New York, 1972.
Tarski (1944) Tarski, A. The semantic conception of truth: and the foundations of semantics. Philosophy and Phenomenological Research, Vol. 4(3):341–376, 1944.
Turing (1936) Turing, A. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, 42:230–265, 1936.
Valiant (1984) Valiant, L. A theory of the learnable. Communications of the ACM, 27, Nov.:1134–1142, 1984.
Valiant (2008) Valiant, L. Knowledge infusion: In pursuit of robustness in artificial intelligence. Proc 28th Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 415–422, 2008.
von Mises (1957) von Mises, R. Probability, Statistics and Truth. revised English edition, Macmillan, New York, 1957.
von Mises (1967) von Mises, R. Mathmatical Theory of Probability and Statistics. 2nd edition, Academic Press Inc., New York, 1967.

Appendix A Proofs for Lemmas, Theorems and Corollaries

Proof of Theorem 4.1 A proof of Theorem 4.1 is suggested in (Dawid, 1982). A simpler one is as follows: Let $X_{t}=({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{t}(Y_{t}-\hat{Y}_{t}).$ Since $({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1},\xi_{t}$ and $\hat{Y}_{t}$ are ß_t-1-measurable, it follows that $E(X_{t}|$ ß ${}_{t-1})=0$ where $E$ is taken with respect to $\Pi(\cdot|$ ß ${}_{t-1})$ and so that ${\textstyle\sum\limits_{t=1}^{k}}X_{t}$ is a martingale adapted to ß_k-1. Also, $E(({\textstyle\sum\limits_{t=1}^{k}}X_{t})^{2})={\textstyle\sum\limits_{t=1}^{k}}E(X_{t}^{2})\leq\lambda\cdot E\{{\textstyle\sum\limits_{t=1}^{k}}(({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{i})^{2}\}\leq\frac{\lambda\pi^{2}}{6},$ because $Y_{t}$ is an indicator variable and so $var(Y_{t}|$ ß ${}_{t-1})$ is uniformly bounded above by some $\lambda$ such that $0\leq\lambda<\infty$ . Then, by the martingale convergence theorem, ${\textstyle\sum\limits_{t=1}^{k}}X_{t}$ converges with $\Pi-$ probability one, which implies from Kronecker’s lemma that, with $\Pi-$ probability one, $p_{k}-\alpha=({\textstyle\sum\limits_{t=1}^{k}}\xi_{t})^{-1}\cdot{\textstyle\sum\limits_{t=1}^{k}}\xi_{t}(Y_{t}-\hat{Y}_{t})\rightarrow 0$ where $\hat{Y}_{t}=\alpha$ $\forall t.$ $Q.E.D.$

Proof of Lemma 4.5 Let $A_{t}$ be an event token at time $t$ and $P(A|E)=\alpha$ be the true probability of event type $A$ conditional on event type $E$ whose event tokens are denoted by $A_{t}$ and $E_{t}$ , respectively. Then, by the definition of $E$ with respect to $A$ , $P(A_{t+1}|E_{t}\in$ ß ${}_{t})=\alpha$ with true probability $P-$ one. Now, once $P(A_{t+1}|E_{t}\in$ ß ${}_{t})$ is learned as such at some $t_{0},$ then $E_{t_{0}}$ must have happened at that time and so $P(E_{t_{0}})\neq 0.$ Also, by 4.4, consider a subsequence of $E_{t_{k}}$ ’s where $P(E_{t_{k}})\neq 0$ for any $t_{k}>t_{0}.$ Then, for this subsequence, $P(E_{t_{0}}\&E_{t_{k}})\neq 0$ for any $t_{k}>t_{0},$ because $E_{t_{k}}$ ’s are independent of one another.

Here, $E_{t_{k}}$ ’s are independent for the following reason: recall that by definition, $P(A_{t_{k}+1}|E_{t_{k}}\in$ ß ${}_{t_{k}})=\alpha$ with true probability $P-$ one. Then, note that ß ${}_{t_{k}}$ includes the fact that $P(A_{t_{k-i}+1}|E_{t_{k-i}}\in$ ß ${}_{t-i})=\alpha$ for some $i\geq 1.$ Now, without loss of generality, let $i=1.$ Thus, we obtain

(1) $P($ $P(A_{t_{k}+1}|$ $\{P(A_{t_{k-1}+1}|E_{t_{k-1}})=\alpha\}$ $\in$ ß ${}_{t_{k}})=\alpha)=1$

Now that $E_{t_{k}}$ and $E_{t_{k-1}}$ are all included in ß ${}_{t_{k}}$ by (1), to show that $E_{t_{k}}$ ’s are independent, we need to prove that

(2) $P($ $\{P(A_{t_{k}+1}|$ ß ${}_{t_{k}})=\alpha\}$ $|$ $\{P(A_{t_{k-1}+1}|$ ß ${}_{t_{k-1}})=\alpha\})=P($ $\{P(A_{t_{k}+1}|$ ß ${}_{t_{k}})=\alpha\})$

But (2) is satisfied because $P($ $\{P(A_{t_{k}+1}|$ ß ${}_{t_{k}})=\alpha\})=1=P($ $\{P(A_{t_{k-1}+1}|$ ß ${}_{t_{k-1}})=\alpha\})$ .

Now that $P(E_{t_{0}}\&E_{t_{k}})\neq 0$ , for any $t_{k}>t_{0}$ in this subsequence, we can always find some small enough $\epsilon>0$ such that $P(E_{t_{k}})>\epsilon.$ Therefore, the probability of the element in this subsequence does not vanish to zero, which implies that $\lim\limits_{s\rightarrow\infty}P(E_{t_{0}}\&E_{t_{s}})\neq 0.$ Since $\lim\limits_{s\rightarrow\infty}P(E_{t_{0}}\&E_{t_{s}})\neq 0$ , ${\textstyle\sum\limits_{s=1}^{\infty}}P(E_{t_{0}}\&E_{t_{s}})=\infty.$ Then, by the second Borel-Cantelli lemma, $P(E_{t_{0}}\&E_{t_{s}}$ $i.o.)=1$ for $s>0,$ which means $P(E_{t_{0}}\in\text{\ss}_{t_{0}}$ $\&$ $E_{t_{k}}$ $\in$ ß ${}_{t_{k}}$ $i.o.)=1$ for $t_{k}>t_{0},$ the desired result. $Q.E.D.$

Proof of Theorem 4.6 Suppose that, for infinitely many $t$ ’s when $P(A_{t+1}|$ ß ${}_{t})$ stays the same as $\alpha$ , machines learn this $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ at time $t$ . Then, by the Success Criterion (1), $\Pi(A_{t_{k}+1}|$ ß ${}_{t_{k}})=\alpha=P(A_{t_{k}+1}|$ ß ${}_{t_{k}})$ at least infinitely often out of those infinite opportunities at $t$ ’s to learn. (We prove in Corollary 4.37 what we mean exactly by “most of the time.” Here we tentatively mean “at least $i.o.$ ” by it because machines are otherwise wrong too often to learn given the Success Criterion (1).) Thus we can construct a test set which consists of the subsequence of $\Pi(A_{t_{k}+1}|$ ß ${}_{t_{k}})$ which is equal to $P(A_{t_{k}+1}|\text{\ss}_{t_{k}})$ for those infinitely many $t_{k}$ ’s. Let $\xi_{t_{k}+1}=1$ if and only if $\Pi(A_{t_{k}+1}|$ ß ${}_{t_{k}})=P(A_{t_{k}+1}|\text{\ss}_{t_{k}})=\alpha.$ Note that $\xi_{t_{k}+1}$ is ß ${}_{t_{k}}-$ measurable, because machine forecasting $\alpha$ occurs at time $t_{k}$ . Then, by Theorem 4.1, with true probability $P-$ one, $p_{k}-$ $\alpha=({\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1})^{-1}\cdot{\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1}(Y_{t_{j}+1}-\alpha)\rightarrow 0$ , as $k\rightarrow\infty$ where $P$ is defined over ß ${}_{\infty}={\textstyle\bigvee\limits_{k=0}^{\infty}}\text{\ss}_{t_{k}}$ and ß ${}_{t_{k}}$ is denoted by the totality of true facts up to day $t_{k}.$ $Q.E.D.$

Proof of Lemma 4.10 Clearly, if with $P-$ probability one, $p_{k}\rightarrow$ $\alpha,$ then $E$ $[p_{\infty}-$ $\alpha]$ $=0$ where the mathematical expectation is taken with respect to the true probability $P,$ but not vice versa. The reverse does not necessarily hold, because even though $P($ $p_{k}\rightarrow$ $\alpha)<1,$ $E$ $[p_{\infty}-$ $\alpha]$ $=0$ when $[p_{k}-\alpha]$ converges to $\pm\beta\neq 0$ with the equal probability as $\frac{1}{2}(1-P)>0.$ However, with $P-$ probability one, $p_{k}\rightarrow$ $\alpha$ if and only if $E$ $|p_{\infty}-$ $\alpha|$ $=0,$ for the following reason: letting $\Lambda_{\infty}$ denote the event that $p_{k}\rightarrow$ $\alpha$ as $k$ goes to infinity, $E$ $|p_{\infty}-$ $\alpha|$ $=$ $P(\Lambda_{\infty})\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{+}}$ $+$ $(1-P(\Lambda_{\infty}))\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{-}}=0$ if and only if $P(p_{k}\rightarrow$ $\alpha)=1$ where $|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{+}}$ denotes the value of $|p_{\infty}-$ $\alpha|$ when $\Lambda_{\infty}$ occurs, while $|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{-}}$ denotes that when $\Lambda_{\infty}$ does not occur. Here, the “if” part is clear. For the “only if” part, if $P(p_{k}\rightarrow$ $\alpha)<1,$ then $(1-P(\Lambda_{\infty}))\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{-}}>0$ while $P(\Lambda_{\infty})\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{+}}=0$ , which implies that $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0.$ $Q.E.D.$

Proof of Lemma 4.11 By Fatou’s lemma, $E[\liminf\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}|$ ß ${}_{t_{j}}]$ $\leq$ $\liminf\limits_{k\rightarrow\infty}E[\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}|$ ß ${}_{t_{j}}]$ $=\liminf\limits_{k\rightarrow\infty}$ $\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})$ $\leq\limsup\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})\leq E[\limsup\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}|$ ß ${}_{t_{j}}].$ Now, since $p_{\infty}$ exists by the assumption, $\liminf\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}=\limsup\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}.$ Then, by squeezing theorem, $\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})$ also exists and thus $E$ $[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}$ $|\text{\ss}_{t_{j}}]=\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})$ . Now, by the law of iterated expectations, $E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}]-\alpha=E$ $[E$ $[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}$ $|\text{\ss}_{t_{j}}]-\alpha]=E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t_{j}=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha]$ . Therefore, $E$ $[p_{\infty}-\alpha]=0$ if and only if $E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t_{j}=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha]=0.$ Also, $~{}E$ $|p_{\infty}-$ $\alpha|$ $=E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha|$ $=E$ $[E$ $[|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha|$ $|\text{\ss}_{t_{j}}]]$ . But note that $E$ $[E$ $[|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha|$ $|\text{\ss}_{t_{j}}]]\geq E$ $|E[\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}Y_{t_{j}+1}-\alpha|\text{\ss}_{t_{j}}]|$ $=E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha|$ by Jensen’s inequality. Therefore, $E$ $|p_{\infty}-\alpha|$ $\geq$ $E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|\text{\ss}_{t_{j}})-\alpha|.$ $Q.E.D.$

Proof of Lemma 4.15 Consider a simple two-player game $(I,S_{i},u_{i}(s))$ between Nature (player $i$ ) and a representative machine (player $-i$ ) where $I$ is the set of players $\{i,-i\}$ , $S_{i}$ is the set of pure strategies $s_{i}$ ’s for each player $i$ , and $u_{i}(s)$ is the usual payoff function for player $i$ . Since this is a probabilistic forecasting game, the pure strategy for each player $s_{i}$ can be any number in $\Re[0,1]$ . But since the computable numbers by player $-i$ are countably many, we restrict $S_{i}$ to be countable. For simplicity, let $u_{i}:$ $S_{i}\times S_{-i}\rightarrow\{-1,1\}$ . In other words, for each profile $s=(s_{i},s_{-i})$ , if player $i$ wins, she obtains $1$ , while she obtains $-1$ otherwise. When Nature (player $i$ ) succeeds in deviating from the machine forecast, Nature wins. Otherwise, the machine (player $-i$ ) wins. Thus, this is a kind of matching game with countably infinite state space.

First, let us note that the structure of the forecasting game is given to Nature, because the structure itself is something objective about the world and thus it belongs to the realm of Nature herself. In other words, it is certain to Nature whether Nature and the machine moves simultaneously or not in the game as follows: If the machine moves when Nature herself does not move yet, then it is certain to Nature that the machine moves first and thus that it is not a simultaneous game. If the machine does not move yet when Nature does not move either, then it is certain to Nature that the machine does not move first, and thus whether it is a simultaneous game or not depends on Nature herself. If Nature reveals herself to the machine even before the machine moves so that the machine can move after observing Nature’s, it is certain to Nature that it is not a simultaneous game. Otherwise, it is certain to Nature that it is a simultaneous game.

$(i)$ the proof of the “only if” part: first, let us fix machine forecast $\Pi(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ and then consider the relevant test set. Now, suppose that the forecasting game along the stochastic path of this test set is not a simultaneous-move game at time $t$ . Then, either Nature or the machine moves first, and the rest moves later after observing what move the other opponent takes. Thus, the one who can observe the opponent’s move can control their/her own forecasting to win the game, and so $\Delta_{t}$ occurs or does not occur at time $t$ , which is certain to Nature because the structure of the game is given to Nature. Then, since ß_t includes $\Delta_{t}$ or $\lnot\Delta_{t}$ as part of the true facts by 4.2, $P$ $\left(\Delta_{t}\in\text{\ss}_{t}\right)=1$ or $P$ $\left(\lnot\Delta_{t}\in\text{\ss}_{t}\right)=1$ . Thus, it is either $P($ $P$ $\left(A_{t+1}\text{}|\text{ }\Delta_{t}\in\text{\ss}_{t}\right)=\alpha$ $)=1$ or $P($ $P$ $\left(A_{t+1}\text{ }|\text{ }\lnot\Delta_{t}\in\text{\ss}_{t}\right)=\alpha$ $)=0$ respectively, according as Nature moves first or the machine moves first. Therefore, the true second-order probability $P$ is neither strictly less than $1$ nor strictly greater than $0$ .

$(ii)$ the proof of the “if” part: again, let us fix the machine forecast $\Pi(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ and then consider the relevant test set. Now, suppose that the forecasting game is a simultaneous-move game at time $t$ . Then, for any fixed value $\alpha\in\Re[0,1]$ , it is not certain to Nature herself whether $\Pi(A_{t+1}|$ ß ${}_{t})=\alpha$ or not, because there exists no pure strategy Nash equilibrium in this simultaneous matching game. Thus, Nature cannot certainly control $P(A_{t+1}|$ ß ${}_{t})$ to make it deviate from $\Pi(A_{t+1}|$ ß ${}_{t})$ and so we obtain

(3) $P($ $P$ $\left(A_{t+1}\text{ }|\text{\ss}_{t}\right)=\alpha$ $)\neq 0$ .

(3) holds even though ß_t of $P$ $\left(A_{t+1}\text{ }|\text{\ss}_{t}\right)$ in (3) includes $\Delta_{t}$ or $\lnot\Delta_{t}$ as part of the true facts by 4.2, if either of them indeed occurs at $t$ . In the same logic, it is not certain to Nature that the machine can control $\Pi(A_{t+1}|$ ß ${}_{t})$ to make it coincide with $P(A_{t+1}|$ ß ${}_{t})$ and so we obtain

(4) $P($ $P$ $\left(A_{t+1}\text{ }|\text{\ss}_{t}\right)=\alpha$ $)\neq 1$ .

Clearly, any mixed strategy Nash equilibrium, if any, will lead to $0<P(P(A_{t+1}|$ ß ${}_{t})=\alpha$ $)<1$ . Therefore, there exists the true second-order probability $P$ such that $0<P(P(A_{t+1}|$ ß ${}_{t})=\alpha$ $)<1$ .

Furthermore, if Nature moves first, then $P($ $P$ $\left(A_{t+1}\text{ }|\text{ }\text{\ss}_{t}\right)=\alpha$ $)=1$ , as we proved in $(i)$ . Therefore, if the machine does not move first, which amounts to either Nature moves first or the machine moves simultaneously with Nature, then clearly $P($ $P$ $\left(A_{t+1}\text{ }|\text{ }\text{\ss}_{t}\right)=\alpha$ $)\neq 0$ . $Q.E.D.$

Proof of Theorem 4.16 Consider the necessary condition (2) that if a machine learns the true objective probability $P(A_{t+1}|$ ß ${}_{t})$ , then $\Pi(A_{t+1}|$ ß ${}_{t})=P(A_{t+1}|$ ß ${}_{t})$ . Since this is just a necessary but not sufficient condition, the converse of (2) does not necessarily hold. Now, for any machine forecast $\alpha\in\mathbb{R}[0,1]$ , suppose that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ for infinitely many $t$ ’s along the stochastic path where the associated $A_{t+1}$ ’s occur but that $P(A_{t+1}|$ ß ${}_{t})=\alpha$ for infinitely many $t^{\ast}$ ’s. Then, by Theorem 4.19, $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ $i.o.)>0$ for some event $A_{t+1}$ . Thus, by (Case 3) of Theorem 4.17 and Theorem 4.6, the machine cannot learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ , even though $\Pi(A_{t+1}|$ ß ${}_{t})=\alpha=P(A_{t+1}|$ ß ${}_{t})$ at infinitely many $t^{\ast}$ ’s. Thus, the machine does not learn that it wins even though it indeed wins at $t^{\ast}$ ’s. Clearly, the machine does not learn whether it wins at other $t$ ’s than $t^{\ast}$ ’s when it loses. Now, since the machine does not learn whether it wins or not at each round of game, the machine does not learn what its payoff is at each round. Furthermore, the machine is truly guaranteed to be well-calibrated along the path of $t^{\ast}$ ’s and so this is the winning strategy in forecasting game between Nature and the machine (e.g. (Foster & Vohra, 1993)), but the machine still cannot learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ . Thus, in this case, winning strategy is not equivalent to learning strategy. $Q.E.D.$

Proof of Theorem 4.17 First, let us recall the followings: by Nature’s perversity with true probability $0$ , we mean that $P$ ( $M_{t}$ at least $i.o.)=0$ for any fixed $\alpha\in\Re[0,1]$ . Here, $M_{t}$ denotes a meta-event $\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ for any event $A_{t+1}$ at time $t\}$ for such a fixed forecast $\alpha$ . Given this, let us consider the following three cases, according as how $P(A_{t+1}|$ ß ${}_{t})$ actually varies with respect to $\alpha$ along the path of the test set. $($ Case 3 $)$ amounts to Theorem 4.17.

$($ Case 1 $)$ Let us suppose that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ for finitely many $t$ ’s along the stochastic path. Now, as in Theorem 4.1, let $X_{t}=({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{t}(Y_{t}-\alpha).$ But, unlike in Theorem 4.1, $\xi_{j}=1$ here if $P(A_{j+1}|$ ß ${}_{j})=\alpha$ for all $j$ along the stochastic path, not necessarily restricted to the test set. Now, consider those finite $t$ ’s when $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ and denote the largest $t$ among them by $t_{m}$ . Then, $P(A_{t+1}|$ ß ${}_{t})-\alpha=E[Y_{t}|$ ß ${}_{t-1}]-\alpha=0$ , $\forall t>t_{m}$ along the stochastic path. Thus, $E(X_{t}|$ ß ${}_{t-1})=0$ where expectation $E$ is taken with respect to the true probability $P(\cdot|$ ß ${}_{t-1})$ and so ${\textstyle\sum\limits_{t=t_{m+1}}^{k}}X_{t}$ is a martingale adapted to ß_k-1 at $t>t_{m}$ along the path. Then, by the martingale convergence theorem and Kronecker’s lemma, $({\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1})^{-1}\cdot{\textstyle\sum\limits_{j=0}^{k-1}}\xi_{t_{j}+1}(Y_{t_{j}+1}-\alpha)\rightarrow 0$ with true probability $P-$ one.

$($ Case 2 $)$ Let us consider the case where with true probability $P$ $>0$ , $P(A_{t+1}|$ ß ${}_{t})$ deviates from $\alpha$ in such a way as in Oakes (1985) along the test set. Then, $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0$ and so the calibration property is not truly guaranteed for the following reason: Let $\Lambda_{\infty}^{o}$ be the event that $P(A_{t+1}|$ ß ${}_{t})$ deviates from $\alpha$ in such a way as in Oakes (1985) along the test set. Then, since some subsequence of $Y_{t}$ ’s along the test set forms Bernoulli whose relative frequency converges to $f(\alpha)\neq\alpha$ , $p_{k}$ does not converge to $\alpha$ when $\Lambda_{\infty}^{o}$ occurs. Now, let $|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{o}}^{+}$ be the value of $|p_{\infty}-$ $\alpha|$ when $\Lambda_{\infty}^{o}$ occurs, while $|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{o}}^{-}$ be the value of $|p_{\infty}-$ $\alpha|$ when $\Lambda_{\infty}^{o}$ does not occur along the test set. Then, in the same logic as in Lemma 4.11, we obtain that $E$ $|p_{\infty}-$ $\alpha|$ $=P(\Lambda_{\infty}^{o})\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{o}}^{+}$ $+$ $(1-P(\Lambda_{\infty}^{o}))\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{o}}^{-}$ $\neq 0$ . Thus, $P(p_{k}\rightarrow\alpha)\neq 1.$ However, the converse does not hold, for there can be many other ways of how $p_{k}$ does not converge to $\alpha$ than in Oakes (1985). Hence it does not follow that $P(\Lambda_{\infty}^{o})>0,$ even if $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0.$

Now, suppose that with $\Pi-$ subjective probability $>0,$ $P(A_{t+1}|$ ß ${}_{t})$ behaves in such a way as in Oakes (1985). Then, again in the same logic as in Lemma 4.11, we obtain that $E$ $|p_{\infty}-$ $\alpha|$ $=\Pi(\Lambda_{\infty}^{o})\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{o}}^{+}$ $+$ $(1-\Pi(\Lambda_{\infty}^{o}))\times|p_{\infty}-$ $\alpha|_{\Lambda_{\infty}^{o}}^{-}$ $\neq 0$ where expectation is now taken with respect to $\Pi$ . Hence $\Pi(p_{k}\rightarrow\alpha)\neq 1.$ Therefore, we conclude that if Oakes (1985) holds with $\Pi-$ subjective probability $>0$ , then Dawid (1982) does not hold, which amounts to the proof for Theorem 4.8.

$($ Case 3 $)$ In general, suppose that the true probability of Nature’s being perverse is not zero for any fixed forecast $\alpha$ on any associated events $A_{t}$ ’s. In other words, suppose that $P($ $M_{t}$ at least $i.o.$ along the test set $)>0$ where $M_{t}$ is the meta-event that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ . Then, we claim that this implies that $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0$ where $E$ is taken with respect to $P$ .

First, suppose that $p_{\infty}$ exists. Also, suppose that $\alpha\neq 0$ , because (Case 3) trivially holds if $\alpha=0$ . Now let us consider an infinite subsequence of $A_{t_{k}}$ ’s, $\{A_{t_{k_{j}}}\}_{j=0}^{\infty},$ which is conditionally identically distributed along the test set where $M_{t}$ occurs at least infinitely often. We can do this by Kolmogorov axioms 1 and 2 and Lemma 4.5 for the following reason: note that by Kolmogorov axioms 1 and 2 there always exists one $\beta\in\Re[0,1]$ such that $P(A|E)=\beta$ for any type event $A$ and $E,$ given that there exists probability of type event, if any. Then, for this $\beta$ , $P($ $P(A_{t+1}|$ ß ${}_{t})=\beta$ $\ i.o.)=1$ according to Lemma 4.5. Thus, we found one subsequence of $\{A_{t_{k}}\}_{k=0}^{\infty}$ such that it is conditionally identically distributed as $\{P(A_{t_{k}+1}|$ ß ${}_{t_{k}})=\beta\}_{k=0}^{\infty}$ . Now, fix $\alpha$ . Also, without loss of generality, suppose that $\beta\neq\alpha$ . Since $\beta\neq\alpha$ is arbitrary, from this subsequence we can consider another subsequence $E_{A}$ of $\{A_{t_{k_{j}}}\}_{j=0}^{\infty}$ with the true probability $P>0$ such that $E_{A}$ = $\{P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\beta\}_{j=0}^{\infty}$ along the stochastic path of the test set in which $M_{t}$ occurs at least infinitely often.

For reductio, let us suppose that Nature deviates $\alpha$ by picking numbers from uncountably many values of $\beta$ ’s such that every value of $\beta$ is equal to $P(A_{t+1}|$ ß ${}_{t})$ only at most finitely many $t$ ’s along the test set with true probability $P$ - one. In other words,

(5) For $\beta\in\Re[0,1]$ where $\beta\neq\alpha$ , $P(A_{t+1}|$ ß ${}_{t})=\beta$ at most for finitely many $t$ ’s along the path of the test set where $M_{t}$ occurs at least infinitely often, with true probability $P$ - one.

Note that there must be countably infinite number of different $\beta$ ’s in (5). Let us denote each different $\beta$ at each time along the path by $\beta_{t_{k_{j}}}$ , while letting $\beta_{t_{k_{i}}}\neq\beta_{t_{k_{j}}}$ for $i\neq j$ without loss of generality. Now, recall that $p_{\infty}$ is assumed to exist along the stochastic path of the test set. Thus, inspired by this assumption, let us further assume that $\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})$ exists where $P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\beta_{t_{k_{j}}}$ or $P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\alpha$ along the path of the test set. Then, letting

$\xi_{t_{k_{j}}}:=\begin{cases}1&P(A_{t_{k_{j}}+1}|$\ss$_{t_{k_{j}}})=\alpha\\ 0&P(A_{t_{k_{j}}+1}|$\ss$_{t_{k_{j}}})=\beta_{t_{k_{j}}}\end{cases}$

(6) $\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}[$ $\xi_{t_{k_{j}}}\cdot P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})+(1-\xi_{t_{k_{j}}})\cdot P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})$ $]$

$=\alpha\cdot\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}$ $\xi_{t_{k_{j}}}+\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}$ .

Thus,

(7) $\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\alpha$ , if and only if, $\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}=\alpha\cdot(1-\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}$ $\xi_{t_{k_{j}}})$ .

In other words, if Nature deviates from machine forecasts by $\beta_{t_{k_{j}}}$ ’s so that her deviating forecasts on average satisfy (7) under (5), then $E$ $|p_{\infty}-$ $\alpha|$ $=0$ and thus the test set is truly guaranteed to be well-calibrated. But Nature then loses the repeated forecasting games along the path in the long run. So Nature has no reason to behave in this way with the true probability $P$ - one. Let us then consider the following three cases:

(Case i) $P(P(A_{t+1}|$ ß ${}_{t})=\alpha)=0$ at least $i.o.$

In this case, by Lemma 4.15, Nature observes machine forecasts $\alpha$ in each time $t_{k_{j}}$ whenever the machine predicts $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ .

Now that $1=\limsup\limits_{t\rightarrow\infty}$ $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha)\leq P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.),$

Nature would choose the deviating value $\beta_{t_{k_{j}}}$ in such a way that she would not allow (7) to hold with true probability $P$ - one. Thus,

(8) $P$ $(\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}=\alpha\cdot(1-\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}$ $\xi_{t_{k_{j}}})$ $)\neq 1$ .

In other words, since Nature observes machine forecast $\alpha$ at every time, she would deviate each forecast $\alpha$ at $t_{k_{j}}$ in such a way that (8) holds in the end. Otherwise, $E$ $|p_{\infty}-$ $\alpha|$ $=0$ , so Nature would lose in the long run. Therefore, we conclude due to (8) that $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0$ in case (i).

(Case ii) $P(P(A_{t+1}|$ ß ${}_{t})=\alpha)=1$ at least $i.o.$

In this case, by Lemma 4.15, Nature moves first so the machine cannot fail to match $P(A_{t+1}|$ ß ${}_{t})$ . But then,

$1=\limsup\limits_{t\rightarrow\infty}$ $P(P(A_{t+1}|$ ß ${}_{t})=\alpha)\leq P(P(A_{t+1}|$ ß ${}_{t})=\alpha$ at least $i.o.)=P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at most $f.o.)$ , which contradicts (5). Therefore, we exclude case (ii) under (5).

(Case iii) $0<P(P(A_{t+1}|$ ß ${}_{t})=\alpha)<1$ at least $i.o.$

In this case, by Lemma 4.15, Nature moves simultaneously with the machine, so Nature has no reason to pick any particular $\beta_{t_{k_{j}}}\in\Re[0,1]$ at each $t_{k_{j}}$ , for there exists no pure strategy Nash equilibrium. Hence any combination of $\{\beta_{t_{k_{j}}}\}_{j=0}^{\infty}$ is equally likely. Now, without loss of generality, let us fix $\alpha$ and $\xi_{t_{k_{j}}}$ for each $t_{k_{j}}$ . Then we claim that

(9) $P($ $\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}\to c\alpha$ $)<P($ $\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}\to c\alpha^{-}$ $)\leq 1$

where $c=1-\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}$ $\xi_{t_{k_{j}}}$ for some fixed $c$ , and $c\alpha\in C$ for some fixed $\alpha$ , and some set $C$ such that $\forall x\in C$ , $x\in\Re[0,1]$ but $C$ is countably infinite, and $c\alpha^{-}$ is any real number in the set $C$ / $c\alpha$ , the set $C$ without $c\alpha$ .

First, recall that $\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}$ exists. Then, by definition,

$\forall\epsilon>0$ , $\exists$ $N_{1}<\infty$ such that $|\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}-c\alpha|<\epsilon,\forall h>N_{1},$

$\forall\epsilon>0$ , $\exists$ $N_{i}<\infty$ such that $|\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}(1-\xi_{t_{k_{j}}})\cdot\beta_{t_{k_{j}}}-c_{i}\alpha^{-}|<\epsilon,\forall h>N_{2}.$ $(1\neq i\in\mathbb{N})$

Now, letting $N=$ max $(N_{1},N_{i})$ , $\forall\epsilon>0$ ,

(10) $P(\{\omega\in\text{\ss}_{\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}\text{\ss}_{t_{k_{j}}}:$ $|$ $\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}$ $[P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\beta_{t_{k_{j}}}]$ $-$ $c\alpha$ $|>\epsilon,\forall h>N\})<P($ $\bigcup\limits_{i=0}^{\infty}\{\omega\in$ ß ${}_{\text{ }\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}$ ß ${}_{t_{k_{j}}}:$ $|$ $\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}$ $[P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})=\beta_{t_{k_{j}}}]$ $-$ $c_{i}\alpha^{-}$ $|>\epsilon,\forall h>N\})\leq 1$ .

Therefore, we again obtain (8) by (10). Now, we consider all possible cases under (5), all of which lead to $E$ $|p_{\infty}-$ $\alpha)|$ $\neq 0$ . But this result is what we try to show in this proof anyway. Therefore, to continue to prove, let us accept that there exists such a set $E_{A}$ with true probability $P>0.$

Now, note that $E_{A}=\{\omega\in\text{\ss}_{\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}\text{\ss}_{t_{k_{j}}}:$ $1_{\{\omega\}}=1$ when $P(A_{t_{k_{j}}+1}|\text{\ss}_{t_{k_{j}}})=\beta\neq\alpha$ for all $t_{k_{j}}$ ’s along the test set $\}\subset\{\omega\in$ ß ${}_{\text{ }\infty}={\textstyle\bigvee\limits_{j=0}^{\infty}}$ ß ${}_{t_{k_{j}}}:$ $1_{\{\omega\}}=1$ when $|\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})$ $-\alpha|\neq 0$ for all $t_{k_{j}}$ ’s along the test set $\}$ . Then, since $P(E_{A})>0$ , $P($ $|\lim\limits_{h\rightarrow\infty}\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})$ $-\alpha|\neq 0$ for all $t_{k_{j}}$ ’s along the test set $)>0$ . Thus, since we found one subsequence of $\{\frac{1}{h}{\textstyle\sum\limits_{j=0}^{h-1}}P(A_{t_{k_{j}}+1}|$ ß ${}_{t_{k_{j}}})\}_{h=1}^{\infty}$ as such along the test set with true probability $P>0$ and $p_{\infty}$ exists, $P(|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t=0}^{k-1}}P(A_{t+1}|$ ß ${}_{t})$ $-$ $\alpha|\neq 0$ along the test set $)>0$ for $\alpha\neq 0$ . Then, by the same reasoning as in Lemma 4.10, $E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t=0}^{k-1}}P(A_{t+1}|\text{\ss}_{t})-\alpha|$ $\neq 0$ . Now, by Lemma 4.11, we obtain that $E$ $|p_{\infty}-\alpha|$ $\geq$ $E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{t=0}^{k-1}}P(A_{t+1}|\text{\ss}_{t})-\alpha|$ $\neq 0$ when $p_{\infty}$ exists. Clearly, when $p_{\infty}$ does not exist, $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0.$

Therefore, we conclude that if $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)>0$ , then $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0.$ $Q.E.D.$

Proof of Theorem 4.19 First, let us first note that with $P-$ probability $>0,$ $P(A_{t+1}|$ ß ${}_{t})\neq 1$ at least infinitely often for some event $A_{t+1}$ . Otherwise, beyond the near future, all events $A_{t+1}$ ’s would certainly continue to occur, with $P-$ probability one, and thus there would be no uncertainty about any $A_{t+1}$ ’s. Now, if this is the case, then we must stop here and simply conclude that no machine would be able to learn the true probability of any $A_{t+1},$ simply because there is no uncertainty for any machine to measure by the true probability in our world. Therefore, to continue to prove our main claim, we accept that $P(P(A_{t+1}|$ ß ${}_{t})\neq 1$ at least $i.o.)>0$ for some event $A_{t+1}$ . Now, let us consider the test set where $\alpha^{\ast}=1$ . Then, along the stochastic path of this test set, $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha^{\ast}$ at least $i.o)>0$ . Therefore, we found some $\alpha^{\ast}$ for which Nature is perverse with true probability $P>$ 0.

Now, suppose that, for any $\alpha$ , $P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1$ at least for infinitely many $t$ ’s. In other words, $P(P(A_{t+1}|\text{\ss}_{t})\neq\alpha\text{ })>0$ at least $i.o.$ Then, $0<\limsup\limits_{t\rightarrow\infty}$ $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha)\leq P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o)$ . Thus, by Definition 4.18, Nature is uniformly perverse, which again means by Definition 4.13 that $P($ Nature is perverse $)>0$ for any $\alpha\in\Re[0,1].$ $Q.E.D.$

Proof of Theorem 4.20 Suppose that, for any $\alpha$ , $P(P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ })<1$ at least for infinitely many $t$ ’s. Then, by Theorem 4.17 and Theorem 4.19, $E$ $|p_{\infty}-$ $\alpha|$ $\neq 0$ and so $P($ $p_{k}\rightarrow$ $\alpha)\neq 1$ for any $\alpha\in\Re[0,1]$ where $P$ is the true objective probability defined over ß ${}_{\infty}={\textstyle\bigvee\limits_{t=0}^{\infty}}\text{\ss}_{t}$ and the expectation $E$ is taken with respect to this true probability $P.$ Then, by Theorem 4.6, the machine cannot learn the true objective probability $P(A_{t+1}|$ ß ${}_{t}).$ $Q.E.D.$

Proof of Lemma 4.23 Suppose that the machine effectively calculates $\Pi(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ with the goal of learning the true value of $P(A_{t+1}|$ ß ${}_{t})$ . Then, by the necessary condition for learning, the machine must return $\Pi(A_{t+1}|$ ß ${}_{t})$ which is congruent to $P(A_{t+1}|$ ß ${}_{t})=\alpha$ , in order to achieve this goal. Now, suppose further that the machine calculates at the same time $\Pi(\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha\})\neq 0$ . Then the machine tolerates error by Definition 4.21.

However, by Theorem 4.6, the machine cannot tolerate errors infinitely often to achieve this goal of learning for the following reason: for any $\alpha\in\Re[0,1]$ , suppose that $\Pi(A_{t+1}|$ ß ${}_{t})=\alpha$ but $\Pi(\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha\})>0$ infinitely often. Now, since it must be that $P(A_{t+1}|$ ß ${}_{t})=\Pi(A_{t+1}|$ ß ${}_{t})=\alpha$ to learn the true probability, it must also be by Theorem 4.6 that $P$ $(p_{k}\rightarrow$ $\alpha)$ = $\Pi$ $(p_{k}\rightarrow$ $\alpha)$ = 1. But now, by assumption, $\Pi(\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha\})>0$ infinitely often, which leads to that $0<\limsup\limits_{t\rightarrow\infty}$ $\Pi(\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha\})\leq\Pi(\{P(A_{t+1}|$ ß ${}_{t})\neq\alpha\}$ at least $i.o)$ . But this contradicts $\Pi$ $(p_{k}\rightarrow$ $\alpha)=1$ by the same reasoning as in the proof of (Case 3) in Theorem 4.17 while replacing $P$ by $\Pi$ and so the machine cannot learn the true probability by Theorem 4.6. Therefore, the machine cannot tolerate errors infinitely often if the machine aims to learn the true probability. Since $\alpha$ was arbitrary in $\Re[0,1]$ , let $\alpha=0$ , the desired result. $Q.E.D$

Proof of Lemma 4.28 (i) Proof of “if” part: suppose that there exists a stopping time $t_{s}<\infty$ for some forecast $\alpha_{0}$ such that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0,$ $\forall t>t_{s}$ , while there exists no stopping time for any other $\alpha\neq\alpha_{0}$ so that $P(A_{\alpha\neq\alpha_{0}}(t+1)|$ ß ${}_{t})>0$ at least infinitely often. Then, by the definition of $A_{\alpha_{0}}(t+1)$ and the law of iterated expectations,

(11) $P(A_{\alpha_{0}}(t+1))\searrow P(\displaystyle{\lim_{t\to\infty}A_{\alpha_{0}}(t+1)})$ , because $A_{\alpha_{0}}(t+1)\searrow\displaystyle{\lim_{t\to\infty}A_{\alpha_{0}}(t+1)}$ .

Now that $\displaystyle{\lim_{t\to\infty}A_{\alpha_{0}}(t+1)}$ is the event that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.$ and so that the limit exists,

(12) $0=\displaystyle{\lim_{t\to\infty}P(A_{\alpha_{0}}(t+1))}=P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)$ for $\alpha_{0}$ .

Also, in the same logic as for $\alpha_{0}$ ,

(13) $0<\displaystyle{\lim_{t\to\infty}P(A_{\alpha}(t+1))}=P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)$ for any $\alpha\neq\alpha_{0}.$

Thus, by Definition 4.25, Nature is selectively perverse.

(ii) Proof of “only if” part: suppose that Nature is selectively perverse. Then, by Definition 4.25, there must exist some $\alpha_{0}$ such that $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0$ . Now, for reductio, suppose that for any such $\alpha_{0}$ there exists no stopping time $t_{s}$ so that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})>0$ at least infinitely often. In other words, Nature keeps changing her mind infinitely often between perversity and non-perversity or Nature keeps being perverse all the way long. Then, by law of iterated expectation, $P(A_{\alpha_{0}}(t+1))>0$ at least infinitely often, which contradicts the selective perversity of Nature by the same reasoning as in (13). $Q.E.D.$

Proof of Lemma 4.29 For any given $\alpha_{0}$ with which Nature is not perverse with true probability $P$ -one, there exists $t_{s}<\infty$ for this $\alpha_{0}$ by Lemma 4.28. Now, by assumption, machines learn that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ . Thus, $\Pi(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ by the necessary condition for learning. Then, by Lemma 4.23 and the same reasoning as (11) in the proof of Lemma 4.28, $\Pi(P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0,\forall t>t_{s})=1$ . $Q.E.D.$

Proof of Corollary 4.30 (i) Suppose that Nature is selectively perverse so that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ for some $\alpha_{0}$ by Lemma 4.28. However, since the machine is assumed not to be self-assured that the stopping time $t_{s}$ arrives for that $\alpha_{0}$ , the machine cannot learn that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ by Lemma 4.29.

(ii) Now, note that if the machine learns $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ , the machine also learns that $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0$ in the following way: first, by Theorem 4.6 and (Case 3) in Theorem 4.17, machine learning of the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ mathematically implies that $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0$ . Thus, once the machine learns the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ , it cannot fail to effectively calculate the true probability $P(A_{\alpha_{0}}(t+1))$ as $0$ , following Theorem 4.6 and (Case 3) in Theorem 4.17 as instructions. Then, by Definition 2.2, the machine learns that $P(A_{\alpha_{0}}(t+1))=0$ in particular $\forall t>t_{s}$ , so that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ while following law of iterated expectation as instruction. However, as we proved it in (i), the machine cannot learn that $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ . Hence we conclude that the machine cannot learn the true objective probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ either. $Q.E.D$ .

Proof of Lemma 4.31 Suppose that the machine is not self-assured of the stopping time $t_{s}$ for $\alpha_{0}$ . Then,

(14) $\Pi(P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0,\forall t>t_{s})\neq 1$ .

Now that $\displaystyle{\lim_{t\to\infty}P(A_{\alpha_{0}}(t+1))}=P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)$ for this $\alpha_{0}$ ,

(15) $\Pi($ $P($ $P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0)\neq 1$ .

Then, since $\limsup\limits_{t\rightarrow\infty}$ $P($ $P(A_{t+1}|$ ß ${}_{t}))\neq\alpha_{0})\leq P($ $P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0$ ,

(16) $\Pi($ $P($ $P(A_{t+1}|$ ß ${}_{t})=\alpha_{0})=1$ $\forall t>t^{\ast})\neq 1$ , for some $t^{\ast}<\infty$ .

Now, note that along the stochastic path considered in Corollary 4.30, $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o.)=0$ $\forall t>t_{s}$ . Now, for this $\alpha_{0}$ ,

(17) $\limsup\limits_{t\rightarrow\infty}$ $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0})\leq P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha_{0}$ at least $i.o)=0$

Therefore, without loss of generality, letting $t^{\ast}\geq t_{s}$ with $t^{\ast}<\infty$ ,

(18) $P(P(A_{t+1}|$ ß ${}_{t})=\alpha_{0})=1,$ $\forall t>t^{\ast}\geq t_{s}$ with $t^{\ast}<\infty$ .

Then, without loss of generality, let $P(P(A_{t+1}|$ ß ${}_{t})=\alpha_{0})=1$ at $t^{\ast}$ +1 by (18). Thus, (16) and (18) lead to the desired result by Lemma 4.15. $Q.E.D$

Proof of Theorem 4.32 Suppose that the machine learns the true probability. Since the machine cannot learn if Nature is uniformly perverse, Nature must then be selectively perverse so that the stopping time $t_{s}$ exists by Lemma 4.28. Then, by the (ii) part of Corollary 4.30 and Lemma 4.29, the machine is self-assured of the stopping time $t_{s}$ when $t_{s}$ exists. We now finish the proof of Theorem 4.32 by showing that if the machine learns the true probability, the machine is not self-assured of the stopping time $t_{s}$ when such $t_{s}$ does not exist.

Suppose that the machine is self-assured of the stopping time $t_{s}$ even though such $t_{s}$ does not exist. The machine is then wrong about $t_{s}$ , so it cannot learn the true probability along the path where $P(A_{\alpha}(t+1)|$ ß ${}_{t})>0$ at least $i.o.$ for the following reason: first, by Lemma 4.28, with true probability $P>0$ , Nature is perverse to the forecast $\alpha$ along the path where there is no stopping time $t_{s}$ . Thus, $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)>0$ for such forecast $\alpha.$ Then, by the (Case 3) of Theorem 4.17 and then Theorem 4.6, the machine cannot learn that $\alpha$ . In other words, the world does not exist in the way that Nature allows the machine to learn the true probability. Notwithstanding, the machine has a wrong belief about the stochastic path of the true probability, and so cannot learn the true probability. $Q.E.D.$

Proof of Theorem 4.35 Suppose that the machine is self-assured of stopping time $t_{s}$ along the path where, for any given $\alpha_{0}$ , $P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ . Then, along this path, the machine obtains

$\Pi(P(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s})=1$ and so $\Pi(A_{\alpha_{0}}(t+1)|$ ß ${}_{t})=0$ $\forall t>t_{s}$ by Lemma 4.23.

Now, by the definition of $A_{\alpha_{0}}(t+1)$ and Lemma 4.23 again,

$\Pi(A_{t+1}|$ ß ${}_{t})=\alpha_{0}$ , $\forall t>t^{\ast}>t_{s}$ for some $t^{\ast}<\infty$

Note also that $P(A_{t+1}|$ ß ${}_{t})=\alpha_{0}$ , $\forall t>t^{\ast}>t_{s}$ for some $t^{\ast}<\infty$ along this path.

(19) $P(A_{t+1}|$ ß ${}_{t})=\alpha_{0}=\Pi(A_{t+1}|$ ß ${}_{t})$ , $\forall t>t^{\ast}$ with $t^{\ast}<\infty$ .

Then, as in Theorem 4.6, we can construct a test set along the stochastic path by the assessed $\alpha_{0}$ as a selection criterion by (19). This test set is also truly guaranteed to be well-calibrated.

Thus, from this test set along the path, the machine obtains the following by Lemma 4.10 and Lemma 4.11,

(20) $P$ $(\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{t=t^{\ast}}^{t^{\ast}+n}P(A_{t+1}|$ ß ${}_{t})=\alpha_{0})=1$ if and only if $P$ $(\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{t=t^{\ast}}^{t^{\ast}+n}1_{\{A_{t+1}\}}$ $=$ $\alpha_{0})=1$

Now, let us gather the sequence of $\{A_{t+1}\}_{t=t^{\ast}}^{\infty}$ along the path and call this set a population. The machine then effectively calculates the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ by the empirical distribution out of this population by (20), which satisfies $(i)$ in Definition 4.34. Also, this effective calculation of the empirical distribution must be successful in returning the true probability $P(A_{t+1}|$ ß ${}_{t})$ , for $\frac{1}{n}\sum\limits_{t=t^{\ast}}^{t^{\ast}+n}P(A_{t+1}|$ ß ${}_{t})$ in the right-hand side of (20) is equal to $P(A_{t+1}|$ ß ${}_{t}),\forall n$ and $\forall t>t^{\ast}$ by (19), which satisfies $(ii)$ in Definition 4.34. Therefore, by Definition 4.34, the machine directly observes the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha_{0}$ . $Q.E.D$

Proof of Theorem 4.36 (i) Proof of “if” part: follows directly from Theorem 4.32 and Theorem 4.35.

(ii) Proof of “only if” part: suppose that the machine directly observes the true probability $P$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ from the given population $S$ at some time $t^{\ast}$ . The machine then effectively calculates $\Pi$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ at $t^{\ast}$ , while adopting the following as an instruction: recall that the given set $S$ consists of the sequence of events $A_{t+1}$ ’s, $\{A_{t+1}\}_{t=0}^{k-1}$ with $k$ potentially infinite. Since the set $S$ is available in principle to the machine by the part (i) of Definition 4.34, there must exist some rule on how to collect the available set of events $\{A_{t+1}\}_{t=0}^{k}$ . Then let the machine build up the population $S$ by collecting events while following the rule on how-to. Now, once collected by the machine to constitute the set $S$ , it must have been observed whether each event has a certain attribute of interest or not, and so a value of the indicator variable $1_{\{A_{t+1}\}}$ must have been assigned accordingly to each event $A_{t+1}$ by the machine. Then, let the machine calculate $\Pi$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ $=\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}$ . Therefore, the machine effectively calculates $\Pi$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ .

Furthermore, note that $\frac{1}{k}\sum\limits_{t=0}^{k-1}1_{\{A_{t+1}\}}$ is defined to be $P$ $(A_{t+1}|$ ß ${}_{t})$ at $t^{\ast}$ by the part (ii) in Definition 4.34. The machine then cannot fail to compute $P$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ from the population $S$ . Therefore, the machine learns the true probability $P$ $(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ by Definition 2.2. $Q.E.D.$

Proof of Corollary 4.37 Let us first define what we mean by “most of the time” in the success criterion (1). by Lemma 4.10 and Theorem 4.17, machines cannot satisfy the calibration property when the test set is constructed by the selection criterion of an assessed probability $\alpha$ if $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)>0$ . Therefore, in order to learn, the machines must return the correct calculations except a finite number of times out of infinite opportunities to learn. Thus, “most of the time” in the Success Criterion (1) should be “all but finitely often out of infinite opportunities to learn,” which means that machines must be correct not just infinitely often while being wrong that often.

Now suppose that the machine is correct most of the time when the machine aims to learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ . Then, by the (Case 1) in Theorem 4.17, $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at most $f.o.)=1$ . Thus, there exists a stopping time $t_{s}$ because $P(P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ at least $i.o.)=0$ if and only if there exists a stopping time $t_{s}$ for any machine forecast $\alpha$ by Lemma 4.28. Furthermore, suppose that the machine is self-assured that it is correct most of the time. Then, again by Lemma 4.28, $\Pi($ there exists a stopping time $t_{s})=1.$ Thus, if the machine satisfies the Success Criterion (1), then it satisfies the condition of Theorem 4.35. Therefore, if the machine satisfies the Success Criterion (1), it can learn the true probability by Theorem 4.35 and Theorem 4.36. $Q.E.D.$

Appendix B Some Literature for the Necessary Condition in Sec. 3.2

There has been a large literature in logic and economics whose discussion implies when a machine holds a true belief in the probabilistic proposition $A_{p}$ . For example, while defining the concept of rationality in the economics model, (Cogley & Sargent, 2008, 2009), (Sandroni, 2000), (Blume & Easley, 2006, 2008) and many others stipulate that an agent is rational when his/her partial beliefs are correct in the sense that his/her subjective probability distributions are congruent to the true probability distribution which Nature identifies as such. In other words, this means that a machine holds such a true belief in $A_{p}$ when it is rational, which entails that its subjective probability $\Pi$ is equal to the true objective probability $P.$

Also, in probabilistic logic, (Nilsson, 1986), (Halpern & Fagin, 1994), and many others follow the probabilistic version of the Tarskian semantic theory of truth in the following way: a formula describing the subjective probability of an agent is true when the agent’s probability assignment corresponds to what the sentence in fact represents. For example, in (Halpern & Fagin, 1994), a formula like $w_{i}(\varphi)\geq 2w_{i}(\psi)$ is true if, according to the probability assignment of the agent $i$ , the event $\varphi$ is at least twice as probable as $\psi$ . Now, if we extend this idea to the true objective probability $P$ if any, a formula such as $w_{i}(\varphi)=w(\varphi),$ where $w_{i}$ denotes the probability operator of the agent $i$ and $w$ does that of Nature, is true when, according to the assignment of the agent $i$ ’s probability, the event $\varphi$ is as probable as what Nature assigns on $\varphi$ as the true probability value in our world.

It deserves to note from the economics literature when it becomes true that agent $i$ ’s partial belief on the event $\varphi$ has a degree $w_{i}(\varphi)$ which corresponds to the true objective probability $w(\varphi)$ . This is indeed true when the subjective probability of the agent $i$ , $w_{i}(\varphi)$ is in congruence with the true objective probability $w(\varphi)$ , which again makes the formula $w_{i}(\varphi)=w(\varphi)$ true. Therefore, the condition for any agent to be rational (or rational machine in our context) in economics is equivalent to the truth condition for the formula in probabilistic logic.

Appendix C Justifications for the Three Assumptions

4.2 ß_t’s in $P(A_{t+1}|$ ß ${}_{t})$ are the set of all the true facts up to time $t$ .

In other words, ß_t is the historical path of true facts up to time $t$ . To recognize that 4.2 is reasonable, recall that we are handling with objective probability true to our world. Therefore, its condition must also be true in our world. Otherwise, $P(A_{t+1}|$ ß ${}_{t})$ cannot represent the true probability according to which the actual data are realized in our world. For example, if there works some special gravity force on Mars and so a fair coin lands on its edge as equally likely as on its head or tail, then the probability of the coin landing on the head conditional on this hypothesis will be $\frac{1}{3}$ . However, if such a special gravity force actually does not exist on Mars, this conditional probability $\frac{1}{3}$ cannot be true either, because its data would not be realized according to the probability of $\frac{1}{3}$ in our world.

4.3 No further knowledge requirement is imposed on the condition ß_t.

To recognize that 4.3 is reasonable, note the following: If ß_t is the set of known facts, then $P(A_{t+1}|$ ß ${}_{t})$ can vary from person to person, as the set of events known to each person may be different, depending on who possesses what information. In order for $P(A_{t+1}|$ ß ${}_{t})$ to be objective, however, $P(A_{t+1}|$ ß ${}_{t})$ should not depend on each person. Therefore, we require that ß_t consist of true facts, not necessarily knowledge.

4.4 Once a probability of an event type $E$ is established, its associated event tokens $E_{t_{k}}$ ’s occur at some infinite subsequence of time $t_{k}$ ’ s, so that $P(E_{t_{k}})$ does not vanish to zero as $t_{k}\rightarrow\infty$ .

Here, “event token” refers to the event that ever occurs at some specific time and place, while “event type” refers to the abstract object with no specific space-time location. For example, cloudy weather in Denver is an abstract event type $E$ with no time subscript, while cloudy weather in Denver on 29 May 2024 is a particular event token $E_{t_{0}}$ . Some literature (e.g. (Halpern, 2016)) deals mainly with probability of token events, while some literature (e.g. (Maher, 2010)) deals mainly with probability of type events. 4.4 establishes a connection between the probabilities of these two kinds of events.

In order to recognize that 4.4 is reasonable, consider now the following example: suppose that we try to predict the probability that some person $i$ suffers from lung cancer caused by his/her smoking habit. As we discussed in the Introduction, this causal probability is objective, which is relevant to our discussion. Then, as long as the probability of the event type of having lung cancer from smoking is allowed to be considered for forecasting, we require that the true probability of the associated event tokens for some persons $i$ ’s should not be completely zero from some time $t_{0}<\infty$ onward. In other words, although the true probability of such event tokens is allowed to be intermittently zero, the probability of the associated event tokens should not vanish to zero as $k\rightarrow\infty$ .

It might be pointed out that a particular person, say Mary, will die some time in the future, and that it will not make sense to consider the probability of Mary’s suffering from lung cancer after that time any more. However, unless all generations of our human beings suddenly become extinct in the near future, we can consider the true probability of this event token at least for some person $i$ at each time $t$ . Hence it would make sense to forecast the probability of such an event token in each specific case, as $t\rightarrow\infty.$

Appendix D More Detailed Remarks

Remark 2.4 Now, let $\mathcal{F}$ be the sigma-field generated by $\Omega$ and $\omega^{t}=(S_{0}^{-1}(s_{0}),$ $\ldots,S_{t}^{-1}(s_{t}),$ $\Omega_{t+1},\Omega_{t+2},$ $\ldots)\in\Omega$ denote a partial history through date $t$ . Then, for any probability measure $p_{t}$ on $\mathcal{F}_{t},$ $p_{t}(\omega^{t})$ becomes the (marginal) probability of the partial history, and each $\omega^{t}$ is assumed to be $\mathcal{F}_{t}$ -measurable. Note then that $p_{t}{(\omega^{t})=\textstyle\prod\limits_{\tau=1}^{t}}p(\omega_{\tau}|\mathcal{F}_{\tau-1})$ for any $t,$ and so $p_{t}(\omega^{t})=p(\omega_{t}|\mathcal{F}_{t-1})p_{t-1}(\omega^{t-1}).$ Furthermore, when $s_{t}$ is only either 0 or 1, $S_{t}(\omega_{t})$ becomes an indicator function for an event $\{\omega_{t}\}.$ Then, provided that there indeed exists any true objective probability $P$ , $p(\{\omega_{t}\}|\mathcal{F}_{t-1})=P(\{\omega_{t}\}|\mathcal{F}_{t-1})$ $=E(S_{t}(\omega_{t})=1|\mathcal{F}_{t-1})$ where the expectation $E$ is taken with respect to this true probability $P$ .

For example, let $S_{t}$ be an $\mathbf{i.i.d}$ . random variable whose value is $1$ if the event $\{\omega_{t}\}$ occurs at $t$ and $0$ otherwise. Then, $X_{n}=\sum\limits_{k=1}^{n}S_{k}$ will be the number of events that have occurred up to time $n$ . Since $S_{t}$ is $\mathbf{i.i.d}$ ., $p(\{\omega_{t}\}|\mathcal{F}_{t-1})$ is same as $P(\{\omega_{t}\})$ across time. Now, let $\lim\limits_{n\rightarrow\infty}\frac{X_{n}}{n}=$ $\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{k=1}^{n}S_{k}$ be the ratio of events that ever occur. Then, provided that this limit indeed exists, the dominated convergence theorem and Fubini’s theorem imply that $E\{\lim\limits_{n\rightarrow\infty}\frac{1}{n}\sum\limits_{k=1}^{n}S_{k}\}=P(\{\omega_{t}\})$ . Thus, in the $\mathbf{i.i.d}$ . case, we can derive that with the true probability $P-$ one, the true objective probability of the event $\{\omega_{t}\}$ is the limiting relative frequency which is objective.

By stipulating that the true objective probability follows the rule on how Nature generates each actual data point, we emphasize that the true probability here is something objective, not subjective, but no more or no less than that. “Nature” is just a metaphor for describing the relationship of true probability with our objective world. Adopting the widely accepted statistical notion of a data-generating process, we intend to use the term “Nature” to refer to whatever is supposed to govern the underlying true objective process to generate the actual data. Given that Nature is simply a metaphor, it is important to emphasize that, in order to prove the possibility or the impossibility of machine learning on the true objective probabilities, we do not need to commit ourselves to whether there really exists such a thing as a true objective process: probability might be merely something subjective which has nothing to do with “Nature.” If that is the case, then we conclude that no machines can learn the true objective probabilities simply because there exist no such things as true probabilities for machines to learn.

Remark 3.1 The standard theory of subjective probability was first developed by Ramsey and then further by De Finetti and Savage. Subjective probability is designed to represent a degree of belief possessed by a subject, say some person. Here, two words, degree and belief, deserve to be noted. First, subjective probability represents some aspects of belief. However, belief is an inner thought that, in principle, resists a direct observation, while probability quantification requires measurability. Note that the easiest method of measurement is by observation. Thus, in order for the degree of belief to be quantified as a probability measure, it works well if the unobservable is made observable. Here comes in the relationship between unobservable belief and observable action: belief causes action. According to (Ramsey, 1931), the strength of our belief can be judged in relation to how we should act in hypothetical situations. Given a preferential system on the lotteries of a set of conditions, the choice action under hypothetical circumstances will reveal the degree of belief of some relevant agent. In this vein, subjective probability represents whatever is in any one’s mind upon anything as long as his/her belief system is coherent, and thus can be even assigned to what is merely imagined. For instance, while arguing for cogito, ergo sum, (Descartes, 2008) imagined some evil spirit that has devoted all its efforts to deceiving him. Then, Descartes can assign some value of subjective probability to such imagination on the evil spirit, according to how likely it is to him that such imagination can be realized in this world, as long as Descartes’ belief system is coherent.

Second, it is assumed that the degree of belief ranges between 0 and 1. For example, your belief that there will be rain tomorrow has a degree strictly less than $1$ and thus is called a partial belief, because you have some unconfidence on future events. In addition to this quantitative usage of the term “belief”, however, there is another categorical usage: “belief” refers to the proposition that something is the case or that something is not the case, or none of them. For instance, your belief in the Moorean fact that here is one hand represents either the case or not, or it is on suspension. Compared to partial belief, this qualitative belief is called belief simpliciter. As the term “belief” has these two faces, gradational quantitative and categorical qualitative ones, numerical degrees are assigned to partial belief, while truth values are assigned to belief simpliciter. In this paper, we abbreviate belief simpliciter by “belief” and denote partial belief by “partial belief” as it is.

In contrast, objective probability, if any, is what must be determined by objective features of the world that do not vary from person to person. Following (Nagel, 1939) and (Carnap, 1963), we list chance, logical probability, and relative frequency as exhaustive examples of objective probability. The best way to clarify these concepts is to consider their examples. Following (Maher, 2010), for example, suppose that a coin has the same face on both sides, that is, two-headed or two-tailed. Provided further that it is completely uncertain what face value, head or tail, the coin has on both sides, the chance of getting head when tossed is 1 or 0, while its logical probability is $\frac{1}{2}$ . Furthermore, when the coin is tossed infinitely often, its relative frequency surely converges to 1 or 0.

Here, the chance is either 1 or 0, depending on what our world is like, namely, whether the coin is indeed two-headed or two-tailed. Therefore, the chance is objective in the sense that it depends on real features of the coin, not on any personal inner thought. On the other hand, the logical probability is $\frac{1}{2},$ because it is logically implied from the given conditions that the coin has the same face value on both sides, but that whether it is two-headed or two-tailed is completely uncertain. Therefore, logical probability is also objective in the sense that it depends on the logical features of our world, not on us. Clearly, the relative frequency is what our world turns out to be, not whatever we believe. However, no matter what interpretation of probability is adopted among these three kinds, it is important to note that the true objective probability $P$ in Definition 2.3 is a mathematical object that is supposed to represent any of them as long as they satisfy the Kolmogorov axioms.

Remark 4.7 It should be noted that Theorem 4.6 is our building block to prove when a machine cannot learn the true probability, because $p_{\infty}$ in Theorem 4.6 denotes the limiting relative frequency along the test set, the representative true objective probability. We do not consider any limiting behavior of the relative frequency outside the test set, because learning as $\alpha$ per se is not possible outside the test set by the necessary condition for learning in Section 3.2. Therefore, if it is shown to be impossible that with $P-$ probability one, $p_{k}\rightarrow$ $\alpha$ along the stochastic path of the test set collected by the assessed $\alpha$ , then it is derived from Theorem 4.6 that the machine cannot learn the true probability.

Now, note that $P($ $p_{k}\rightarrow$ $\alpha$ ) $=1$ if and only if for any $\epsilon>0,$ $\lim\limits_{n\rightarrow\infty}P(\sup\limits_{m\geq n}$ $|$ $p_{m}-\alpha$ $|$ $<\epsilon)=1.$ Thus, if the machine learns, then for all $\epsilon>0$ that are small enough, $\lim\limits_{n\rightarrow\infty}P(|$ $p_{n}-\alpha$ $|$ $<\epsilon,$ $|p_{n+1}-\alpha|<\epsilon,\ldots)=1$ , which is $\lim\limits_{n\rightarrow\infty}P(p_{n}=\alpha,$ $p_{n+1}=\alpha,\ldots)=1.$ Thus, Theorem 4.6 is not committed to what the machine engages in by the first $n-1$ number of data while “learning”. This concept of machine learning is flexible enough to allow for some finitely few potential errors where $p_{t}\neq\alpha$ $\forall t<n$ so that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ $\forall t<n$ while processing the data to learn.

Remark 4.9 Indeed, it may well be argued against the (Oakes, 1985) Counterexample that, although it could be imagined so, Nature actually never behaves in that way. There is no reason why Nature is so perverse that she generates data in such a deviating way. The true objective probability of Nature being perverse may be simply zero. Then, Theorem 4.1 and Theorem 4.8 do not necessarily imply that a machine cannot learn the true probability.

Theorem 4.8 shows only to the extent that if a machine can imagine such a counterexample, and thus it sincerely believes in such possibility, then its subjective probability of long-run mis-calibration is not zero. But recall the Descartes’ Demon case from Section 3.1. A simple possibility of imagination does not necessarily imply a real possibility, namely that the true objective probability of it occurring in the actual world is not zero. Theorem 4.1 and Theorem 4.8 show only that if a machine cannot exclude such a counterexample, it cannot be self-assured to be well-calibrated with its own subjective probability $1$ . However, recall that there exists an asymmetric relation between subjective and objective probabilities: objective probability binds subjective probability, but not necessarily vice versa. Thus, if the true probability of Nature’s perversity is proven to be zero, the machine can exclude such a possibility, and so its subjective probability on Oakes’ counterexample will be zero as well. Then, from this it is derived neither that the machine cannot be self-assured to be well-calibrated nor that it cannot be truly guaranteed to be so, which implies that the impossibility of machine learning does not necessarily follow from Theorem 4.6.

Later by Theorem 4.19, we prove that such an imagined possibility of Nature’s being perverse is a real one if the true probability is not observable. Meanwhile, we will also prove mathematically how (Oakes, 1985) Counterexample paralyzes Dawid’s Theorem 4.1, which amounts to the proof of Theorem 4.8. Note that if the true probability indeed escapes from the machine’s forecast just as in (Oakes, 1985), Theorem 4.1 breaks down: Theorem 4.1 critically relies on the martingale property of ${\textstyle\sum\limits_{t=1}^{k}}X_{t}$ given ß_k-1 where $X_{t}=({\textstyle\sum\limits_{j=1}^{t}}\xi_{j})^{-1}\cdot\xi_{t}(Y_{t}-\hat{Y}_{t})$ , which was from $E(X_{k}|$ ß ${}_{k-1})=0.$ This martingale property, however, breaks down when $P(A_{t+1}|$ ß ${}_{t})=E(Y_{t+1}|$ ß ${}_{t})\neq\hat{Y}_{t+1}=\Pi(A_{t+1}|$ ß ${}_{t})$ for all $t$ . Note that (Dawid, 1982) takes it for granted that $E(Y_{t+1}|$ ß ${}_{t})=\Pi(A_{t+1}|$ ß ${}_{t})=\hat{Y}_{t+1}$ for all $t$ . Therefore, if we relax this assumption, we can prove mathematically how (Oakes, 1985) works against (Dawid, 1982), which will be shown from $($ Case 2 $)$ in the proof of Theorem 4.17.

Remark 4.12 Regarding Lemma 4.10 and Lemma 4.11, it deserves to note the following two things: first, note that we do not require any standard assumption such as the stochastic process to be $\mathbf{i.i.d.}$ along the historic path of the test set and so that $P(A_{t+1}|$ ß ${}_{t})$ can vary along the path. Note also that unlike (Blume & Easley, 2006, 2008), etc., we do not require to consider all the associated events $A_{t}$ ’s along the stochastic path, but that we consider only the events $A_{t}$ ’s whose assessed probabilities are $\alpha$ . The set of those events $A_{t}$ ’s is called a test set, because it is collected according to the selection criterion of being assessed constantly as $\alpha$ . Therefore, we do not assume any specific property of the stochastic process along the path in the test set, such as stationarity or ergodicity. We do not assume any specific properties because we include only the arbitrary subsequences of the stochastic process into the test set according to the subjective assessment.

Second, by Lemma 4.10 and Lemma 4.11, we obtain that if $P(p_{k}\rightarrow$ $\alpha)=1,$ then $E$ $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})-\alpha|=0$ where expectation is taken with respect to the true probability $P.$ Then, from this equation, we establish a connection between the true guarantee of well-calibration and the real forecasting game between a machine and Nature: $(i)$ the true guarantee of well-calibration is connected to forecasting games between a machine and Nature, for what the machine forecasts is $\alpha$ while what Nature forecasts is $P(A_{t_{j}+1}|$ ß ${}_{t_{j}})$ and thus whether $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})-\alpha|=0$ holds or not is tied to how Nature and the machine play in the forecasting games along the stochastic path of the test set. In this game, the machine loses at time $t$ whenever Nature succeeds in deviating from machine forecasting at that time. There is some literature which deals with the problem of well-calibration in various forecasting game settings. (e.g. (Foster & Vohra, 1993)) $(ii)$ Also, note that, in the proof of Lemma 4.11, we take both the inner and outer expectations with respect to the true probability $P$ while applying the law of iterated expectations. Thus, it is a real game, not any arbitrarily imaginary one, for $|\lim\limits_{k\rightarrow\infty}\frac{1}{k}{\textstyle\sum\limits_{j=0}^{k-1}}P(A_{t_{j}+1}|$ ß ${}_{t_{j}})-\alpha|=0$ is expected to hold with respect to the true probability $P$ , not any other subjective probability $\Pi$ .

Remark 4.14 Now, let us establish a connection between the true second-order probability and the forecasting game between Nature and a machine. For simplicity, let us denote by $\Delta_{t}$ the event at time $t$ that $P(A_{t+1}|$ ß ${}_{t})=\alpha$ for any machine forecast $\alpha$ . In other words, $\Delta_{t}$ denotes the event that the machine makes the correct forecast at time $t$ , which amounts to that the machine wins the forecasting game at that time. Note here that, strictly speaking, the event $\Delta_{t}$ is a complex event which consists of two events, the event of $\{P(A_{t+1}|$ ß ${}_{t})=\alpha\}$ and the event of $\{\Pi(A_{t+1}|$ ß ${}_{t})=\alpha\}$ for the same functional value $\alpha$ while $P(A_{t+1}|$ ß ${}_{t})$ and $\Pi(A_{t+1}|$ ß ${}_{t})$ are two probability functions about the common event $A_{t+1}$ , that is $\{\Delta_{t}\}=\{P(A_{t+1}|$ ß ${}_{t})=\alpha=\Pi(A_{t+1}|$ ß ${}_{t})\}$ . However, since we consider only the test set along the stochastic path, here we take it that $\Pi(A_{t+1}|$ ß ${}_{t})$ is fixed as $\alpha$ along the path.

Then, extending some notions from (Gaifman, 1986), let us derive a second-order probability, i.e. the probability of probability, from the outcomes of the forecasting game between Nature and the machine as follows: for any event $A_{t+1},$ the true second-order probability $P$ is the probability of the meta-event that the first-order probability (either Nature’s true forecast or the machine’s subjective forecast) of $A_{t+1}$ actually has a certain numerical value $\alpha\in\Re[0,1]$ . Thus, the true second-order probability $P$ denotes $P$ $\left(\text{ }P(A_{t+1}|\text{\ss}_{t})=\alpha\text{ }\right)$ .

Here, it deserves to note that although we derive the notion of higher-order probabilities by extending some notions from (Gaifman, 1986), our notion is different from his in the following way: we do not distinguish the first-order and the second-order probabilities while using the same notation as P, although Gaifman(1986) uses P and PR operator to denote the second-order probability and the event on the first-order probability, respectively. This is because Gaifman’s notions are different from ours in that (1) $P$ in Gaifman denotes the agent subjective probability, while our second-order probability $P$ can be a true objective one just like the first-order true probability, and that (2) his $PR$ operator accepts a closed interval as one of its arguments, while our domain of the second-order probability $P$ does not contain intervals of real numbers. Note that our domain of the second-order probability is assumed to be generated by the collection of all the singletons of the computable real values of the first-order true probability function $P$ , and that it is assumed to be countable. Thus, the domain does not contain intervals of real numbers. (3) In addition, our notion of the first-order probability is not imprecise but precise one, so it is not supposed to be what belongs to any interval or any set of probability measures.

Now, the probability space of the second-order probability is defined as $(\Omega,\mathcal{G},P)$ , in which $\Omega$ is the set of all the computable functional values for any given true first-order probability function $P(A_{t+1}|\beta_{t})$ , $\mathcal{G}$ is a field generated by the collection of all the singletons in $\Omega$ , and $P$ is the second-order probability with $P:\mathcal{G}\rightarrow\Re[0,1]$ . Note here that $\Omega$ is countable and that $\Omega$ is the set of all the possible forecasts by machines on the event $A_{t+1}$ given $\beta_{t}$ . Now, if the domain of the second-order probability is a sigma-field $\mathcal{F}$ generated by $\Omega$ , then the problem here is that the sigma-field $\mathcal{F}$ becomes uncountable given that $\Omega$ is countable. So, we should consider a field $\mathcal{G}$ , not sigma-field $\mathcal{F}$ for the probability space of the second-order probability $P$ .

Here are some justifications for defending the use of a field $\mathcal{G}$ , not sigma-field $\mathcal{F}$ , as a domain of the second-order probability $P$ : we do not require the domain of the second-order probability to include all the countably infinite unions, for the number of strategies a machine can use then becomes uncountable, which is contradictory to the fact that the set of numbers a machine can compute is countable. In our forecasting game, any singleton in $\Omega$ can be thought of as a pure strategy by the machine and any union of those singletons as a mixed strategy by the machine. Again, since the set of numbers a machine can compute is countable, a machine cannot compute uncountably many mixed strategies.

Remark 4.22 Recall from the necessary condition for learning in Section 3.2 that $P(A_{t+1}|$ ß ${}_{t})$ = $\Pi(A_{t+1}|$ ß ${}_{t})$ = $\alpha$ if the machine learns the true probability $P(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ . Definition 4.21 then means that while the machine calculates the value of $\Pi(A_{t+1}|$ ß ${}_{t})$ as $\alpha$ to learn the true probability $P(A_{t+1}|$ ß ${}_{t})$ at time $t$ , the machine assigns its $\Pi$ - probability $>0$ to the event that $P(A_{t+1}|$ ß ${}_{t})\neq\alpha$ , because the machine tolerates the error that the true value of $P(A_{t+1}|$ ß ${}_{t})$ may not be very $\alpha$ at that time $t$ . In Lemma 4.23, we prove that a machine cannot tolerate errors infinitely often if it aims to learn the true probability.

Remark 4.24 For example, in (Savage, 1972), a vacuous event is null, but not every null set is necessarily vacuous. Here, an event is null to an agent when the event is believed to be impossible to the very agent, and thus its subjective probability is zero to the agent. On the other hand, a vacuous event has absolute impossibility whose true objective probability is zero by the Kolmogorov axiom. Thus, the objective true probability of an absolutely impossible event here binds its subjective probability to zero, but not necessarily vice versa.

We now extend this idea in (Savage, 1972) to all virtually impossible events. Here, note that absolute impossibility is assigned to a vacuous event by the Kolmogorov axiom, while virtual impossibility is assigned to any event whose true objective probability measure is zero by Nature. Thus, in Lemma 4.23, we derive that all virtually impossible events also have a subjective probability $\Pi-$ zero infinitely often whenever the agent is self-assured that such events are truly impossible, for the subjective probability must be bound to the true objective probability $P-$ zero, if any. Otherwise, the machine comes to tolerate error infinitely often, which makes it impossible for the machine to achieve its goal of learning the true probability.