University of Pittsburgh, United States and https://kodyvajjha.github.io/[email protected]://orcid.org/0000-0003-3799-6326Vajjha acknowledges support from the Alfred P. Sloan Foundation under grant number G-2018-10067 and the Andrew W. Mellon Foundation. IBM Research, United [email protected] IBM Research, United States and https://researcher.watson.ibm.com/researcher/view.php?person=us-shinnar[email protected] IBM Research, United States and IHÉS, France and https://pestun.ihes.fr/[email protected]://orcid.org/ 0000-0001-8014-4685 \CopyrightKoundinya Vajjha and Barry Trager and Avraham Shinnar and Vasily Pestun \ccsdesc[500]Mathematics of computing Stochastic processes \ccsdesc[500]Mathematics of computing Nonlinear equations \ccsdesc[500]Computing methodologies Optimization algorithms \relatedversion \supplementhttps://github.com/IBM/FormalML/releases/tag/ITP2022 \EventEditorsJune Andronick and Leonardo de Moura \EventNoEds2 \EventLongTitle13th International Conference on Interactive Theorem Proving (ITP 2022) \EventShortTitleITP 2022 \EventAcronymITP \EventYear2022 \EventDateAugust 7–10, 2022 \EventLocationHaifa, Israel \EventLogo \SeriesVolume237 \ArticleNo21
Formalization of a Stochastic Approximation Theorem
Abstract
Stochastic approximation algorithms are iterative procedures which are used to approximate a target value in an environment where the target is unknown and direct observations are corrupted by noise. These algorithms are useful, for instance, for root-finding and function minimization when the target function or model is not directly known. Originally introduced in a 1951 paper by Robbins and Monro, the field of Stochastic approximation has grown enormously and has come to influence application domains from adaptive signal processing to artificial intelligence. As an example, the Stochastic Gradient Descent algorithm which is ubiquitous in various subdomains of Machine Learning is based on stochastic approximation theory. In this paper, we give a formal proof (in the Coq proof assistant) of a general convergence theorem due to Aryeh Dvoretzky [21] (proven in 1956) which implies the convergence of important classical methods such as the Robbins-Monro and the Kiefer-Wolfowitz algorithms. In the process, we build a comprehensive Coq library of measure-theoretic probability theory and stochastic processes.
keywords:
Formal Verification, Stochastic Approximation, Stochastic Processes, Probability Theory, Optimization Algorithms1 Introduction
This paper presents a formal proof of Aryeh Dvoretzky’s 1956 result on stochastic approximation.
To motivate this result, let us consider a problem frequently occurring in various contexts of statistical learning: Let be a real-valued random variable that depends on a parameter . We may say that is the probability distribution of conditioned or dependent on a parameter . Next, suppose we are given a function and we want to find that solves the equation
(1) |
Moreover, assume that is not available to us explicitly, but only in an implicit or sampling form, that is, we are provided a sampling oracle which takes a parameter and returns a sample of drawn from -dependent probability distribution .
Example 1.1 (Kolmogorov’s Strong Law of Large Numbers).
Let , and let be independent of . To solve equation (1) in this situation means to solve , that is to find the expected value of a random variable given an oracle from which we can sample , in other words to construct a statistical estimator of given a series of samples . The following iterative algorithm does the job
(2) |
for where and . Indeed, the iterations (2) are equivalent to the standard sample mean estimator . Notice that the iterations (2) have the form . The theorem that the estimator converges almost surely (with probability one) to the true expectation value is famously known as the “Kolmogorov’s Strong Law of Large Numbers” (SLLN).
Example 1.2 (Banach’s fixed point and optimal control).
Now consider the opposite example, where that depends on in a deterministic way, say where is a certain function (and event space is a single point). In this case, when we pass to the oracle, the oracle deterministically returns to us the value of the function evaluated at . In this case, solving the equation (1) for the function means solving the equation
(3) |
If is a -contraction map111this means that in some norm it holds that for all with the standard proof of the Banach fixed point theorem tells us that the iterations
(4) |
for a suitable choice of , for example , form a Cauchy sequence that converges to the fixed point of the map . A variation of this process is applied to solve Bellman’s equation for optimal control of Markov Decision Process (MDP) where -contraction map comes from Bellman’s optimality operator for MDPs with discount parameter .
Example 1.3 (Stochastic gradient descent).
Now, as a variation of (1), suppose that we want to find that minimizes the expectation value of a certain loss function in a context where is sampled by an oracle from an -dependent probability distribution. Assuming that is a locally convex analytic function, finding a local minimum is equivalent to solving the stationary point equation
(5) |
Since is a linear operator, the above equation is equivalent to and therefore is again an example of the equation (1) with . The iterative sequence
(6) |
is known as stochastic gradient descent. This algorithm is a typical component of most of machine learning algorithms that search for a parameter that minimizes the expected value of the loss function given samples of .222In the context of supervised learning, will stand for tuples sampled from training data, and stands for the model parameters, e.g. neural network weights. If is a neural network, then with a quadratic supervised loss one normally takes where Under suitable conditions on and the parameters (for example, would satisfy all required assumptions) one can prove convergence of (6) to the critical point of the loss function .
These three examples demonstrate the ubiquity of the problem (1), and many more applications could be mentioned in a longer report.
In all these cases the solution of the problem (1) has the form
(7) |
and is called a stochastic approximation algorithm.
A large body of literature explored different versions of assumptions on the domain of variables, on the function and on the step-sizes (learning rates) , under which the convergence of could be proven in various senses: as convergence in , as convergence in probability, as convergence with probability 1.333The notion of “convergence with probability 1” is the same as the notion of “convergence almost surely”, but different from the notion of “convergence in probability”, which is much weaker.
Robbins and Monro introduced in [30] the field of Stochastic Approximation by proving the convergence of the process (7) for to the value that solves the equation . Note that we write to indicate that occurs as a parameter in the distribution of . For this theorem, Robbins and Monro assumed
(8) |
that is bounded with probability 1, and that the function is (i) non-decreasing, (ii) the solution of exists, and (iii) the derivative at the solution is positive .
Kiefer and Wolfowitz [26] took a similar approach but considered the problem of estimating the parameter where the function has a maximum, and proved convergence in probability.
Wolfowitz [42] weakened the assumption of Robbins-Monro about boundedness of : instead his version assumes only that the variance of is bounded uniformly over , and is bounded, and with those assumptions Wolfowitz proves convergence in probability.
Blum [12] weakened further the assumptions of Robbins-Monro and Wolfowitz and proved a substantially stronger result, namely that the iterative sequence (7) (with ) converges with probability 1. Blum requires the variance of be uniformly bounded over , but he allows the expectation value to be bounded by a linear function of
(9) |
instead of a constant. Blum’s proof is based on a version of Kolmogorov’s inequality adopted in a suitable way by Loève [28] where instead of series of independent random variables, a certain dependence was allowed but constrained by a conditional expectation value. This extension of Kolomogorov’s inequality to the conditional situation was related to earlier works of Borel, Lévy and Doob about convergence with probability 1 of certain stochastic processes.
Finally, the most general form of stochastic approximation was formulated by Dvoretzky [21]. In the original Robbins-Monro stochastic approximation (7), the next value is determined through the previous value and the sample . Dvoretzky allowed more general estimator algorithms in which is determined through a certain function that can take as arguments complete history of all previous values and the current sample .
Concretely, let be a real-valued function of -variables. Consider the stochastic process
(10) |
where are random variables, with dependent on the previous history such that
(11) |
Another way to formulate Dvoretzky’s setup is to say that for any sequence of random variables where we have conditional probability distribution of dependent on the complete history , and then define
(12) | ||||
in this way we automatically get the relation (10) with noise terms that satisfy (11).
For example, in the Robbins-Monro version we take (7) with which gives and hence the Robbins-Monro process is a specialization of Dvoretzky’s process with
(13) | ||||
whereas before in the context of Robbins-Monro we had .
To prove his result, Dvoretzky assumed that:
-
1.
there exists a point such that
(14) where are sequences of non-negative real numbers with
(15) (16) (17) -
2.
The cumulative variance of the noise terms is bounded
(18)
and proved that the iterative sequence (10) converges with probability 1 to the fixed point .
The Robbins-Monro theorem in its strongest form (that is, under the weakest assumptions of Blum (9)) becomes an easy consequence of Dvoretzky theorem. We only have to check that given the assumptions of Blum we can apply Dvoretzky. First, Blum’s assumption that the variance of is bounded by, say, for all and , given the relation (13), implies , and therefore Dvoretzky’s assumption (18) about limited cumulative variance of his noise terms holds. Second, given Blum’s in equation (9), we will construct that satisfy (15) and such that the bound on the operator in (14) holds. To do that, first choose a real-valued series with and such that444for example, if we start with take , in general take , (see [1]). . For simplicity assume, by a change of coordinates, that the fixed point . Assuming that is regular and monotonic in a neighborhood of there is an inverse map and then we define the sequence for sufficiently small . Next, define (for sufficiently large )
(19) | ||||
A case-by-case check for and for shows that Dvoretzky’s bound on (14) holds given the relation (13).
One universal theme passing through the various versions of stochastic approximation convergence theorems is the choice of the scheduling of the step-sizes (or learning rates) .
In the Robbins-Monro scheduling assumption (8), it is clear that the step-sizes have to converge to zero (otherwise the model would fluctuate and never converge to the exact solution). The second assumption that says that the rates should not converge to zero too fast is also sensible, as otherwise it is easy to imagine a learning schedule with dropping to zero so fast that the iterative process does not reach the fixed point from an initial point (for a concrete example, see [17, pp. 5]). The third assumption, , is more subtle and technical, it primarily ensures that even in situations when the noise-terms have self-correlation they would not move the iterative process out of its track of converging with probability 1 to the exact fixed point. In certain situations, a slower decreasing of learning rate schedule still leads to convergence, and is faster in practice.
As the above discussion has shown, the Robbins-Monro paper spawned a huge literature on the analysis and applications of such stochastic algorithms. This is because the problem of estimating unknown parameters of a model from observed data is quite a fundamental one, with variants of this problem appearing in one form or another in control theory, learning theory and other fields of engineering.
Because of the pervasive reach of stochastic approximation methods, any serious formalization effort of an algorithm involving parameter estimation when the underlying model is unknown will eventually have to contend with formalizing tricky stochastic convergence proofs. We chose to formalize Dvoretzky’s theorem as it implies the convergence of both the Robbins-Monro and Kiefer-Wolfowitz algorithms, various stochastic gradient descent agorithms and various reinforcement learning algorithms such as Q-learning based on Bellman’s optimality operator.
Throughout the text which follows, hyperlinks to theorems, definitions and lemmas which have formal equivalents in the Coq development are indicated by a ✿. Our formalization is open-source and is available at https://github.com/IBM/FormalML.
2 Dvoretzky’s Theorem
After Dvoretzky’s original publication [21] of his theorem and several very useful extensions, several shorter proofs have been proposed. A simplified proof was published by Wolfowitz [43] who like Blum relied on the conditional version of Kolmogorov’s law exposed by Loève [28]. A third, more simplified proof was published by Derman and Sacks [20], who again relied on the conditional version of Kolomogorov’s law, streamlined the chain of inequality manipulations with Dvoretzky’s bounding series parameters and used Chebyshev’s inequality and the Borel-Cantelli lemma to arrive at a very short proof. Robbins and Siegmund generalized the theorem to the context where the variables take value in generic Hilbert spaces using the methods of supermartingale theory [29], as did Venter [40]. For a survey see Lai [27]. Dvoretzky himself published a revisited version in [22].
We have chosen to formalise the proof following Derman and Sacks [20] as this version appeared to us as being the shortest and most suitable to formalize using constructions from our library of formalized probability theory.
In this paper we present complete formalization of the scalar version of Dvoretzky’s theorem, with random variables taking value in .
Here is a full statement of Dvoretzky’s theorem:
Theorem 2.1 (Regular Dvoretzky’s Theorem ✿).
Assuming the following:
-
Let be a probability space
-
For
-
Let be an increasing sequence of sub -fields of
-
Let be -measurable random variables taking values in .
-
Let be a measurable function
-
Let be -measurable random variables taking values in such that
-
-
-
Let be a series of real numbers such that
-
-
-
-
-
-
-
Let be a point in such that for all and for all ,
Then the sequence of random variables converges with probability 1 to :
An increasing sequence of sub--fields of (a filtration) formalizes a notion of a discrete stochastic process moving forward in time steps , where formalizes the history of the process up to the time step . Assuming an -measurable random variable means assuming a stochastic variable that is included into the history up to the time step .
We have also formalized the extended version of Dvoretzky’s theorem in which are promoted to real valued functions and is promoted to be an -measurable random variable. The hypotheses that have been modified in the extended version are marked by the symbol below:
Theorem 2.2 (Extended Dvoretzky’s theorem ✿).
Assuming the following:
-
Let be a probability space
-
For
-
Let be an increasing sequence of sub -fields of
-
Let be -measurable random variables taking values in .
-
Let be -measurable -valued random variable
-
Let be -measurable -valued random variables such that:
-
-
-
Let be functions555Technically, Dvoretzky in his revisited paper [22] requires to be -measurable, but this assumption wasn’t actually used in the proof, so we have omitted it. such that:
-
-
-
-
with probability 1
-
with probability 1
-
with probability 1
- :
-
Let be a point in such that for all and for all we have:
(20)
Then the sequence of random variables converges with probability 1 to :
We now turn to describing the formalization of the above theorems. First, we give a description of our comprehensive supporting Probability Theory library in Section 3 (which may be of independent interest), then we shall give an overview of the proof of Theorem 2.1 in Section 4.1, and finally detail the variants of this theorem we have formalized in Section 4.2.
3 Formalized Probability Library
Our formalization of both Dvoretzky theorems is built on top of our general library of formalized Probability Theory. In particular, we are not restricted to discrete probability measures.
3.1 -Algebras and Probability Spaces
We first introduce pre_events ✿ which are just subsets of a type T i.e., maps T Prop. Then we define -algebras, SigmaAlgebra(T) ✿, as collections of pre_events which are closed under countable union and complement and include the full subset of all elements in T:
Then, we label pre_events which are members of a -algebra as events ✿. Special -algebras, like that generated by a set of pre_events ✿ and the Borel -algebra ✿, are constructed as usual.
One interesting feature of the formalization of both of these is that they are both provided with alternative characterizations, which is useful for using the definitions. For the borel -algebra, we define two variants: borel_sa ✿, defined as the -algebra generated by the half-open intervals, and open_borel_sa ✿, defined as the -algebra generated by the open sets. After proving that the definitions yield the same -algebra ✿, we can choose which definition is simpler to work with in a given context, simplifying some proofs.
For the definition of , the -algebra generated by a set , we start with the standard definition ✿: the intersection ✿ of the set of -algebras that contain ✿. This is useful, but as it is non-constructive, it lacks a convenient induction principle. As an alternative, we define the explicit closure of a set of events ✿, built by starting with the set (augmented by ), and repeatedly adding in complements and countable unions. In Coq, this is naturally defined using an inductive data type. This closure is then shown to be (a -algebra ✿ and) equivalent to ✿.
While definitions generally use the standard definition of , some theorems are more easily proven by switching to the equivalent closure-based characterization. This enables induction, providing an easy way to extend a property on the generating set to the generated -algebra, by showing that complements and countable unions preserve the property in question.
Next, we introduce probability spaces ✿ over a -algebra, equipped with a measure mapping each event to a real number , such that .
3.2 Almost Everywhere
Having defined probability spaces, we can introduce a commonly used assertion in probabilistic proofs: that a certain property holds almost everywhere on a probability space. By this we mean the set of points where the property holds includes a measurable event of measure 1. We define a predicate almost ✿ to indicate propositions which hold almost everywhere. It is parameterized by a probability space and proposition on that space.
We have introduced machinery to make it more convenient to reason about almost propositions. For example, if we want to show that almost P almost Q almost R, we reduce the proof to showing that almost (P Q R) ✿, which itself is implied by P Q R ✿. Usual theorem proving tools can then be used.
On top of the basic almost definition, we defined almostR2 ✿, which says that a binary relation holds almost everywhere.
This is useful, since it inherits many properties from the base relation (e.g. it is a preorder if the base relation is ✿), and simplifies definitions.
3.3 Measurability and Expectation
We next introduce the concept of measurable functions with respect to two -algebras. Since we are focusing on probability spaces, we call these measurable functions RandomVariables ✿.
In order to define the Expectation of a RandomVariable, we follow the usual technique of first treating the case of finite range functions ✿, then extending to nonnegative functions ✿ (resulting in an extended real) and then to general random variables. In the general case, the expectation is the difference of the expectation of the positive and negative parts of a random variable. ✿ Exceptions are handled using the Coq option type. For example, the difference of the expectations of the positive and negative parts of a random variable is not defined if they are both the same infinity. This exception is captured by allowing the difference to be None in that case. A well defined Expectation yields Some r, for some value in Coquelicot’s Rbar type [16]. This represents a value in the extended reals: either a Finite real value, or positive or negative infinity (p_infty or m_infty).
Originally our results about Expectation were for random variables taking images in the reals, but as we introduced limiting processes we needed to extend our definition to random variables taking values in the extended reals (Rbar).
This requires extending the support for limits in Coquelicot, allowing for sequences of functions over the extended reals ✿. The approach we took was to copy over all the definitions and lemmas in Coquelicot’s Lim_seq module, extending them as appropriate, and re-proving them. A few changes were made, such as defining the extended version of is_lim_seq ✿ to hold when the inf and sup sequence limits coincide. The original definition uses filters, and is problematic to extend to the extended reals, since they do not form a uniform space. Pleasantly, however, almost all of the lemmas continue to hold with minor modification.
The above construction of Expectation and its properties (including linearity ✿✿, the monotone convergence theorem ✿, and other standard results) are then generalized and proven for this generalization to functions whose image is the extended reals ✿.
On top of our general definition of Expectation, we define the IsFiniteExpectation property, which asserts that a function has a well-defined, finite expectation ✿✿. For functions that satisfy this property, we can define their FiniteExpectation ✿✿, which returns their (real) expectation. This simplifies working with such functions, and avoids otherwise necessary side-conditions on properties such as linearity ✿✿.
3.4 Spaces
Using these building blocks, we can define spaces, which are the space of measurable functions where the -th power of its absolute value has finite expectation ✿.
This space is then quotiented, identifying functions that are equal almost everywhere (see Section 3.2) ✿. We use a quotient construction ✿ that avoids needing axioms beyond those already proposed in Coq’s standard libraries666Specifically, we use functional and propositional extensionality as well as constructive definite description (also known as the axiom of unique choice).. This quotienting operation is required in order to define a norm on the space (defined as the -th root of the Expectation of the absolute value of the -th power of the function), as having a zero Expectation only implies that a non-negative function is zero almost everywhere.
For nonnegative , is shown to be a module space ✿. For , it is shown to be a Banach space (complete normed module space) ✿ ✿.
Furthermore, the important special case of is proven to be a Hilbert space ✿, where the inner product of and is defined as the Expectation of the product of and .
3.5 Conditional Expectation
Building on top of this work, we turn to the definition of conditional expectation, defining it with respect to a general -algebra dom2 (the ambient -algebra being dom). We first postulate a relational definition ✿, characterized by the universal property of conditional expectations: for any event that is in the sub -algebra dom2, if we multiply the original function and its conditional expectation by that event’s associated indicator function, we get equal expectations.
Using this definition, we can show uniqueness (where equality is almost everywhere) ✿, and many standard properties of conditional expectation, such as linearity ✿✿, preservation of Expectation ✿, (almost) monotonicity ✿, and the tower law ✿. We also show the “factor out” property ✿, which enables factoring out of a conditional expectation a random variable that is measurable with respect to the sub -algebra. In addition, we verify its interactions with limits (e.g. the conditional version of the monotone convergence theorem ✿), and prove Jensen’s lemma ✿, bounding how convex functions affect the conditional expectation.
After having proven these properties for the is_conditional_expectation relation, we still need to show that the conditional expectation generally exists (at least for functions that are non-negative or have finite expectation).
To do this, we build on our work on spaces (Section 3.4), and in particular our proof that that is a Hilbert space. Given an function, this implies that the subset of functions which are measurable with respect to a smaller -algebra dom2 forms a linear subspace.
The conditional expectation ✿ of an random variable with respect to dom2 is then defined as the orthogonal projection ✿ of onto that subspace. For this construction and definitions of Hilbert spaces we use the library from the formal development of the Lax-Milgram theorem [14]. Note that this definition is for a function in the quotiented space (recall that is quotiented to identify functions that are equal almost everywhere).
We can then define conditional expectation on the unquotiented space by injecting the inputs into the quotiented space, using the conditional expectation operator just defined on functions, and then choosing a representative from the equivalence class of functions it returns ✿. This unquotienting gives insight into why most theorems about conditional expectations only almost hold, as it is defined on equivalence classes of almost equal functions.
Next, we extend our notion of conditional expectation to nonnegative functions whose usual expectation is finite using the property that functions are dense in . In particular, given a nonnegative function , we can define an sequence . The conditional expectation of is defined as the limit of the conditional expectation of the ✿.
Using this definition directly has some disadvantages: it forces essentially all theorems, including simple ones such as the result being non-negative, or that the conditional expectation is the identity operation on functions that are already measurable with respect to the sub -algebra dom2, be only almost valid. To address this, we wrap this definition in a wrapper ✿ that takes the function returned by the original (limit based) definition and tweaks it slightly, producing a “fixed” function almost equivalent to the original, but where such simple properties hold unconditionally.
Finally, we extend this to all measurable functions by taking the difference of the nonnegative conditional expectation of its positive and negative parts ✿. While this function is defined for all measurable functions, it can only be shown to be a conditional expectation (the is_conditional_expectation relation defined above) for functions that are either non-negative ✿ or have finite expectation ✿. Using this property, we now lift all of the properties proven above for the relational version to our explicitly defined version, verifying that it satisfies all the expected properties. For convenience, we also provide a wrapper definition FiniteConditionalExpectation ✿, which assumes that the function has finite expectation, and returns a function whose image is in (insted of the extended reals), and lift all the expected properties to it.
Connecting back to spaces, we can use Jensen’s lemma about convex functions to show that if a function is in then its conditional expectation is as well ✿, allowing us to view conditional expectation as a (contractive ✿) operation on spaces. Furthermore, we show that it minimizes the -loss for an function ✿.
We chose this approach to defining conditional expectation (via an orthonormal projection on ) since we could rely on an existing library of Hilbert space theory [14], thus avoiding other tedious constructions involving Radon-Nikodym derivatives etc.
3.6 Filtrations and Martingales
We next introduce a notion of -algebra filtrations ✿, which are an increasing sequence of -algebras. We say that a sequence of random variables IsAdapted ✿ to a filtration , if each random variable of the sequence is measurable with respect the corresponding -algebra.
Building on these definitions and our development of conditional expectation, we started developing the basics of martingale theory ✿.
Additionally, the language of filtrations and adapted processes enables us to represent the history of a stochastic process, which is critical for stating and verifying properties of stochastic approximation methods.
3.7 Additional results
There are many other results proven in the library; here we highlight two that are used in the Derman-Sacks proof: Chebyshev’s inequality and the Borel-Cantelli lemma.
Chebyshev’s inequality ✿ which states that given a random variable and a positive constant , the probability of is less that or equal to the expectation of .
Another is the Borel-Cantelli lemma ✿ which states that if the sum of probabilities of a sequence of events is finite, then the probability of all but finitely many of them occuring is 0.
In this theorem statement, ex_series f, defined in Coquelicot, assert that the infinite series of partial sums converges to a finite limit.
3.8 Retrospective design decisions
In this section we discuss some of the design choices we made (and revisited), and our retrospective opinion on their impact. This may be of benefit to those seeking to pursue similar projects.
Initially, we modeled events as sets (now called pre_events), accompanying them with proofs that the set was in a given -algebra. This resulted in a lot of code threading and transforming these proofs, which was particularly painful when reasoning about lists. Revisiting that decision, we built up events as a subset type: a dependent pair of a pre_event and a proof that it belongs in a relevant -algebra ✿. For many simple uses, this obviates the need for reasoning explicitly about being in a -algebra. There are still cases where explicit reasoning is required, but this change definitely simplified the code.
We support general random variables using a typeclass which specifies the sigma algebras for the domain and range along with the function ✿. An initial version of our probability library developed expectation and properties of random variables whose codomain was the reals ✿. However as we proved more properties, especially limiting and convergence properties, it became clear that we needed to allow infinite values, thus to allow random variables with (the extended real numbers) as codomain ✿. However the native support for limits in the Coquelicot package allows the limiting value to be infinite, but restricts to sequences taking values in . In order to be able to take limits of random variables to , we developed our own limit package extending the Coquelicot definitions and lemmas to sequences taking values in ✿. In the end, some results about valued random variables become simpler since one doesn’t need to make unnatural finiteness restrictions, but on the other hand, one needs to be extra careful, since is not a field as sums and products are not always defined and operations are not associative.
As in the standard development of expectation, we first defined it for functions whose range is a finite subset of (or ). We decided to represent them as a typeclass which includes a field containing a finite list of values which includes all the values in the range of the function ✿.
We decided to allow this list to have duplicates and to contain additional values not in the range. This made several definitions more convenient, for example when defining the sum of two finite range functions, the new list of values is just the sum of all pairs of values, which is guaranteed to contain all the actual values, but can contain values which are not in the image of the sum and can contain duplicated values ✿.
One simplification our code makes is that we deal only with probability spaces, rather then general measures. This was a pragmatic decision, as it simplifies some of the proofs (since, for example, measures must be finite). As our intended use is probability theory, this mostly sufficed. In order to define (dependent) product spaces, we did define the rudiments of measure theory (measures, outer measures, and inner measures) ✿, but the final construction of the product is defined only for probability spaces, since the proof crucially relies on the monotone convergence theorem, which we have not proven for general measure spaces. It would clearly have been nicer to define things more generally, and we may go back and change things in the future, however this simplification allowed us to use our limited resources to greater effect.
4 Formalization Challenges/Overview
We will now sketch the key pieces which go into the formalization of the Derman-Sacks proof.
4.1 Overview of the proof
The Derman-Sacks proof relies on a number of prerequisites in Probability Theory and Real Analysis. For example, the proof begins by stating that we may replace the series by the series where . This statement invokes a classical theorem of du Bois-Reymond [13] which states:
Theorem 4.1.
✿ Let be a sequence of nonnegative real numbers. The series converges if and only if there is another sequence of positive real numbers such that and .
In other words, this theorem states that no worst convergent series exists (see [5]). This elementary theorem did require some effort to formalize, in part because existing proofs such as the one in [5] require the sequence to consist only of positive terms, while our application (Dvoretzky’s theorem) needed them to be non-negative. Additionally, we had to prove convergence of the product series without using the integral test (as used in [5]), because it was unavailable in our library. Our final proof of Theorem 4.1 involved a case analysis in which we case on whether the sequence was eventually positive or not ✿, and we bypassed the need to use the integral test by using an exercise from Rudin’s Principles of Mathematical Analysis [31].
The main workhorse of the Derman-Sacks proof is the sequence . First, they apply the following theorem777the proof of this theorem is a modification of Theorem 6.2.1 in Ash’s Probability and Measure Theory [6] to the sequence of random variables :
Theorem 4.2 (Loève [28] ✿).
Let be a sequence of random variables adapted to a filtration . Assume that almost surely for all and also that converges. Then we have that converges almost surely.
to conclude that the series converges almost surely. To apply this theorem we need to prove that is adapted to the filtration , which critically uses the fact that is a measurable function. (Here we take .) The proof of the theorem uses to show that since the sequence is adapted, we have for all . This depends on the “factor out” property of conditional expectation ✿ (see Section 3.5).
Next, it is shown that almost surely for sufficiently large . This argument uses the Borel-Cantelli lemma ✿ and the Chebyshev inequality ✿, both of which needed a significant amount of probability theory to be set up (see Section 3.7). Using this bound for and the bound for in the hypothesis, an elementary argument shows that
almost surely for sufficiently large .
Now, the conclusion almost surely follows by applying the following lemma:
Lemma 4.3.
✿ Let and be sequences of real numbers such that
-
1.
are non-negative
-
2.
converges.
-
3.
For all larger than some ,
then, .
The proof of the lemma is somewhat unusual since it involves running an iteration backwards: the property (3) is applied repeatedly to derive an inequality between and for ✿. Besides using several properties of infinite products and list maximums, the final convergence result is an application of Abel’s descending convergence criterion ✿ which says if the series converges, and is a bounded descending sequence, then the series also converges.
We note that our formalization is firmly within the Classical territory for a number of reasons: first of all, the theory of Real numbers within the Coq standard library (which we use) uses non-computable axioms [23]. Secondly, while constructive measure theory and constructive analysis are both actively researched topics (see [19, 18, 11]) we are unaware if our main result (Dvoretzky’s theorem) is constructively valid. Thirdly, as we remarked above, our proof of Theorem 4.1 requires a case split on whether a particular sequence of real numbers is eventually zero or not, for which we use the axiom of constructive indefinite description.
4.2 Variants of Dvoretzky’s Theorem.
While Dvoretzky’s theorem admits generalizations in many different ways, we chose to focus on formalizing the ones most suited for applications.
-
1.
As already mentioned, we prove Theorem 2.2 which is a generalization of Theorem 2.1 in which the sequences of numbers are replaced by sequences of functions on the probability space. This generalization is called the extended Dvoretzky theorem ✿. All conditions on the sequences now hold pointwise, almost everywhere.
-
2.
To apply Theorem 4.2 in the proof of Theorem 2.1 we needed to prove that is adapted to the filtration , which needed us to make assumptions on the functions . These assumptions on can be modified and generalized as:
-
(a)
in the regular (non-extended) case, are deterministic and measurable. ✿
-
(b)
in the extended case, are stochastic and -adapted. ✿
Since Derman-Sacks do not explicitly state either assumption, we formalized Dvoretzky’s theorem under both assumptions. It should be noted that Dvoretzky’s original paper [21] and his revisited paper [22] treat both the above cases.
-
(a)
-
3.
We have also formalized a corollary of the extended Dvoretzky’s theorem ✿ which proves that the theorem holds in the context where the bound on in (14) is assumed as follows with all other assumptions intact:
While this formulation is weaker compared to the original, it is convenient to have it for several applications of stochastic approximation theorems. A proof of this corollary used a classical analysis result of Abel [1] on the fact that the terms in a divergent sum-series could be multiplied by infinitesimally small series and the sum-series would still diverge ✿. This was addressed in Dvoretzky’s paper [21, (5.1)].
5 Related work
While our results are general, our intended application was formalizing machine learning theory, on which there is a growing body of work [34, 35, 37, 24, 32, 9, 10]. Our work is a step in this direction, providing future developers of secure machine learning systems a library of formalized stochastic approximation results. Keeping this in mind, we have formalized different versions of our main result (Dvoretzky’s theorem) to facilitate ease of use (see Section 4.2).
For the formalization itself, we make extensive use of the Coquelicot library of Boldo et al. [16] and the library which proved the Lax-Milgram theorem [14] which includes definitions and basic properties of hilbert spaces. There have also been quite a few formalizations of probability theory in Coq: see Polaris [33], Infotheo [4], and Alea [7]. Alea is an early work and to the best of our knowledge incompatible with latest versions of Coq while Infotheo and Polaris either fundamentally focus on discrete probability theory (see [3]) or do not have the results we needed to prove Dvoretzky’s theorem.
More recently there have been two projects in Coq which formalize measure theory and Lebesgue integration. The MathComp-Analysis project has general measure theory and integration developed on top of their library which is an alternative to Coquelicot [2]. The Numerical Analysis in Coq (coq-num-analysis) project is built on top of Coquelicot and includes support for Lebesgue integration of nonnegative functions [15]. Neither of these were available at the time we began to develop our probability library. Since we depend on Coquelicot, we could have developed on top of the coq-num-analysis library if it were available earlier. This would have given us the added benefit of supporting general measures instead of our restriction to probability measures. Refactoring our library to build on top of one or more of these formalizations might be a possible direction for future work.
Formal proofs about convergence of random variables (the Central Limit Theorem) have been given in Avigad et al [8] using the Isabelle/HOL system. Parts of Martingale theory and stochastic processes have also recently made their way into the Lean math library [36].
To the best of our knowledge, our work presents the first formal proof of correctness of any theorem in Stochastic Approximation.
6 Applications & Future Work
Our own interest in stochastic approximation began with an attempt to extend our work on convergence proofs of (model-based) Reinforcement Learning (RL) algorithms [39] to include the model-free case. Model-based RL algorithms converge to an optimal policy (a sequence of actions which an agent should probabilistically perform so as to maximize its expected long-term reward) by making full use of the given transition probability structure of the agent. The term model-free refers to the fact that we have no information on how the agent performs it’s transitions but can only observe its transitions. As we have emphasized above, this situation is perfectly suited for stochastic approximation techniques. Indeed, convergence proofs of Q-Learning (a prominent model-free RL algorithm) appeal to standard results of stochastic approximation (see Watkins & Dayan [41], Jaakkola et al. [25]), Tsitsiklis [38]. We plan to use our formalization of Dvoretzky’s theorem to complete a convergence proof of the Q-learning algorithm.
Additionally, as part of this process, we have built up a large library for basic results on (general) probability spaces in Coq, including a general definition of conditional expectation. This library is publically available at https://github.com/IBM/FormalML and open source. We invite others to use our library and collaborate with us on extending and enhancing it.
References
- [1] Niels Henrik Abel. Note sur le mémoire de Mr. L. Olivier No. 4. du second tome de ce Journal, ajant pour titre “remarques sur les series infinies et leur convergence”. Crelles Journal, (3), 1828.
- [2] Reynald Affeldt and Cyril Cohen. Formalization of the Lebesgue measure in MathComp-Analysis. The Coq Workshop 2021, online, July 2, 2021, Jul 2021.
- [3] Reynald Affeldt, Jacques Garrigue, and Takafumi Saikawa. Reasoning with conditional probabilities and joint distributions in coq. Computer Software, 37(3):79–95, 2020. doi:10.11309/jssst.37.3_79.
- [4] Reynald Affeldt and Manabu Hagiwara. Formalization of Shannon’s theorems in SSReflect-Coq. In 3rd Conference on Interactive Theorem Proving (ITP 2012), Princeton, New Jersey, USA, August 13–15, 2012, volume 7406 of Lecture Notes in Computer Science, pages 233–249. Springer, Aug 2012.
- [5] J Marshall Ash. Neither a worst convergent series nor a best divergent series exists. The College Mathematics Journal, 28(4):296–297, 1997.
- [6] Robert B Ash, B Robert, Catherine A Doleans-Dade, and A Catherine. Probability and measure theory. Academic press, 2000.
- [7] Philippe Audebaud and Christine Paulin-Mohring. Proofs of randomized algorithms in Coq. Science of Computer Programming, 74(8):568–589, 2009.
- [8] Jeremy Avigad, Johannes Hölzl, and Luke Serafin. A formally verified proof of the central limit theorem. Journal of Automated Reasoning, 59(4):389–423, 2017.
- [9] Alexander Bagnall and Gordon Stewart. Certifying the True Error: Machine Learning in Coq with Verified Generalization Guarantees. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, pages 2662–2669. AAAI Press, 2019. doi:10.1609/aaai.v33i01.33012662.
- [10] Alexander Bentkamp, Jasmin Christian Blanchette, and Dietrich Klakow. A Formal Proof of the Expressiveness of Deep Learning. J. Autom. Reason., 63(2):347–368, 2019. doi:10.1007/s10817-018-9481-5.
- [11] Errett Albert Bishop. Foundations of constructive analysis. 1967.
- [12] Julius R. Blum. Approximation Methods which Converge with Probability one. The Annals of Mathematical Statistics, 25(2):382 – 386, 1954. doi:10.1214/aoms/1177728794.
- [13] Paul du Bois-Reymond. Eine neue Theorie der Convergenz und Divergenz von Reihen mit positiven Gliedern. 1873.
- [14] Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. A Coq formal proof of the Lax–Milgram theorem. In 6th ACM SIGPLAN Conference on Certified Programs and Proofs, Paris, France, January 2017. URL: https://hal.inria.fr/hal-01391578, doi:10.1145/3018610.3018625.
- [15] Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. A coq formalization of lebesgue integration of nonnegative functions. Journal of Automated Reasoning, pages 1–39, 2021.
- [16] Sylvie Boldo, Catherine Lelay, and Guillaume Melquiond. Coquelicot: A user-friendly library of real analysis for Coq. Mathematics in Computer Science, 9, 03 2014. doi:10.1007/s11786-014-0181-1.
- [17] Han-Fu Chen. Stochastic approximation and its applications, volume 64. Springer Science & Business Media, 2006.
- [18] Thierry Coquand and Erik Palmgren. Metric boolean algebras and constructive measure theory. Archive for Mathematical Logic, 41(7):687–704, 2002.
- [19] Thierry Coquand and Bas Spitters. Integrals and valuations. arXiv preprint arXiv:0808.1522, 2008.
- [20] C Derman and J Sacks. On Dvoretzky’s stochastic approximation theorem. The Annals of Mathematical Statistics, 30(2):601–606, 1959.
- [21] A. Dvoretzky. On stochastic approximation. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, 1956.
- [22] Aryeh Dvoretzky. Stochastic approximation revisited. Advances in Applied Mathematics, 7(2):220–227, 1986. URL: https://www.sciencedirect.com/science/article/pii/0196885886900333, doi:https://doi.org/10.1016/0196-8858(86)90033-3.
- [23] Herman Geuvers and Milad Niqui. Constructive reals in coq: Axioms and categoricity. In International Workshop on Types for Proofs and Programs, pages 79–95. Springer, 2000.
- [24] Johannes Hoelzl. Markov processes in Isabelle/HOL. In Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs and Proofs, CPP 2017, page 100–111, New York, NY, USA, 2017. Association for Computing Machinery. doi:10.1145/3018610.3018628.
- [25] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201, 1994.
- [26] J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics, 23(3):462 – 466, 1952. doi:10.1214/aoms/1177729392.
- [27] Tze Leung Lai. Stochastic Approximation. The Annals of Statistics, 31(2):391–406, 2003. URL: http://www.jstor.org/stable/3448398.
- [28] Michel Loève. On almost sure convergence. Proc. Berkeley Sympos. math. Statist. Probability, California July 31 - August 12, 1950, 279-303 (1951)., 1951.
- [29] H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics, pages 233–257. Academic Press, 1971. URL: https://www.sciencedirect.com/science/article/pii/B9780126045505500158, doi:https://doi.org/10.1016/B978-0-12-604550-5.50015-8.
- [30] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- [31] Walter Rudin et al. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1976.
- [32] Daniel Selsam, Percy Liang, and David L. Dill. Developing Bug-Free Machine Learning Systems With Formal Mathematics. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 3047–3056. PMLR, 2017. URL: http://proceedings.mlr.press/v70/selsam17a.html.
- [33] Joseph Tassarotti and Robert Harper. A separation logic for concurrent randomized programs. Proceedings of the ACM on Programming Languages, 3(POPL):1–30, 2019.
- [34] Joseph Tassarotti, Jean-Baptiste Tristan, and Koundinya Vajjha. A Formal Proof of PAC Learnability for Decision Stumps. CoRR, abs/1911.00385, 2019. URL: http://arxiv.org/abs/1911.00385, arXiv:1911.00385.
- [35] Joseph Tassarotti, Koundinya Vajjha, Anindya Banerjee, and Jean-Baptiste Tristan. A formal proof of PAC learnability for decision stumps. In Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs, pages 5–17, 2021.
- [36] The mathlib Community. The Lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP 2020), pages 367–381, 2020.
- [37] Jean-Baptiste Tristan, Joseph Tassarotti, Koundinya Vajjha, Michael L. Wick, and Anindya Banerjee. Verification of ML Systems via Reparameterization. CoRR, abs/2007.06776, 2020. URL: https://arxiv.org/abs/2007.06776, arXiv:2007.06776.
- [38] John N Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine learning, 16(3):185–202, 1994.
- [39] Koundinya Vajjha, Avraham Shinnar, Barry Trager, Vasily Pestun, and Nathan Fulton. Certrl: formalizing convergence proofs for value and policy iteration in coq. In Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs, pages 18–31, 2021.
- [40] J. H. Venter. On Dvoretzky Stochastic Approximation Theorems. The Annals of Mathematical Statistics, 37(6):1534 – 1544, 1966. doi:10.1214/aoms/1177699145.
- [41] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- [42] J. Wolfowitz. On the Stochastic Approximation Method of Robbins and Monro. The Annals of Mathematical Statistics, 23(3):457–461, 1952. URL: http://www.jstor.org/stable/2236689.
- [43] J. Wolfowitz. On Stochastic Approximation Methods. The Annals of Mathematical Statistics, 27(4):1151 – 1156, 1956. doi:10.1214/aoms/1177728082.