SGD Through the Lens of Kolmogorov Complexity
Abstract
We prove that stochastic gradient descent (SGD) finds a solution that achieves classification accuracy on the entire dataset. We do so under two main assumptions: (1. Local progress) The model accuracy improves on average over batches. (2. Models compute simple functions) The function computed by the model is simple (has low Kolmogorov complexity). It is sufficient that these assumptions hold only for a tiny fraction of the epochs.
Intuitively, the above means that intermittent local progress of SGD implies global progress. Assumption 2 trivially holds for underparameterized models, hence, our work gives the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result which is completely model agnostic - we do not require the model to have any specific architecture or activation function, it may not even be a neural network. Our analysis makes use of the entropy compression method, which was first introduced by Moser and Tardos [1; 2] in the context of the Lovász local lemma.
1 Introduction
Stochastic gradient descent (SGD) is at the heart of modern machine learning. However, we are still lacking a theoretical framework that explains its performance for general, non-convex functions. There are known convergence guarantees for overparameterized neural networks [3; 4; 5; 6]. These works show that if the number of network parameters is considerably larger than the number of data points, then the network is guaranteed to converge to a (near) global optimum. The main drawback of these results is the requirement of exceedingly large (and wide) models, where the size is a large polynomial in (the size of the dataset). In other words, the analysis is dependent on the structure of the model. In this work, we take a different approach and show convergence guarantees based on the non-structural properties of the model.
Our two main assumptions are: (1) The model accuracy improves on average over batches following the GD step. That is, if we look at the difference: (batch accuracy after GD batch accuracy before GD), averaged over an epoch, then it is greater than some positive parameter. (2) The function computed by the model is simple (has Kolmogorov complexity sufficiently smaller than ). Note that this is different than saying that the model is simple. It might be the case that a very complicated model computes an extremely simple function. A classical example is a high degree polynomial over a finite field that always evaluates to 0.
For our convergence guarantees to hold, we only require the above assumptions to holds for some tiny fraction of the iterations. Note that Assumption 2 trivially holds when the model is underparametrized (number of model parameters is sufficiently smaller than ). To the best of our knowledge, this is the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result which is completely model agnostic. That is, we don’t require the model to have any specific architecture or activation function. It may not even be a neural network. Finally, our results answer an intriguing theoretical question: is it possible for SGD to achieve local progress (improvement over batches), but no global progress (good accuracy for the entire dataset)? Intuitively, one might think that this is indeed a situation that might occur. However, we show that (under Assumption 2) this cannot be the case.
Overview of our results
We consider batch SGD, where the model is initialized randomly and the dataset is shuffled once at the beginning of each epoch and then divided into batches. Note that this paper does not deal with the generalization abilities of the model, thus the dataset is always the training set. In each epoch, the algorithm goes over the batches one by one, and performs gradient descent to update the model. This is the “vanilla" version of SGD, without any acceleration or regularization (for a formal definition see Section 2). Furthermore, for the sake of analysis, we add a termination condition after every GD step: if the accuracy on the entire dataset is at least we terminate. Thus in our case termination implies good accuracy.
We show that under reasonable assumptions, local progress of SGD implies global progress. Let be an accuracy parameter of our choice and let be the dataset (every element in consists of a data point, a label, and some unique ID), where . Let be the Kolmogorov complexity of (the shortest Turing machine which outputs ). We can also define the Kolmogorov complexity of a function as the shortest Turing machine which computes the function (see Section 2 for a formal definition). The assumptions we make are formally stated at the beginning of Section 3. Below is an intuitive outline:
-
1.
After every GD step the batch accuracy is improved by at least an additive factor (on average, per epoch), where is a local progress parameter. That is, fix some epoch and let be the model accuracy on the -th batch before the GD step and let be the model accuracy after the GD step, then we require that where is the number of batches in an epoch.
-
2.
The Kolmogorov complexity of the function computed by the model is at most . In other words, either the model is underparameterized, or the function computed by the model is sufficiently simple.
-
3.
The loss function is differentiable and -smooth. This must also hold when taking numerical stability into account.
-
4.
The batch size is . Note, the dependence on is only polylogarithmic.
-
5.
The size of every element in is at most polynomial in . For example, if contains images, they are not excessively large compared to . Furthermore, the number of model parameters is at most polynomial in (this is not implied by Assumption 2, as the model can be huge but still compute a simple function).
Roughly speaking, we show that if is sufficiently large and assumptions 3 to 5 always hold and assumptions 1,2 only hold for a -fraction of the epochs ( need not be a constant), then SGD must achieve an accuracy of at least on the entire dataset within epochs with high probability (w.h.p)111This is usually taken to be for some constant . For our case, it holds that , however, this can be amplified arbitrarily by increasing the batch size by a multiplicative constant.. Note that the number of epochs does not depend on the size of the model. Finally, note that when the accuracy is too high the local progress assumption cannot hold anymore (we discuss this further in Section 3.1), and we cannot guarantee progress past this point.
Let us briefly discuss our assumptions. As for Assumption 1, it is quite natural as the GD step attempts to minimize the loss function which acts as a proxy for the accuracy on the data. However, Assumption 2 seems at odds with the fact that we assume the model to be initialized randomly. Intuitively, one might believe that a random initialization surely implies that Assumption 2 does not hold. While it is indeed the case that the model has high Kolmogorov complexity in this case, this is not necessarily the case for the function computed by the model. Recent literature suggests that randomly initialized models are biased towards simple functions rather than complicated ones (see Section 7.3 in [7] for a survey of recent work) and that the function learned by the network has low Kolmogorov complexity [8]. Furthermore, our results also generalize if we allow lossy compression. There is a vast literature on model compression (see [9] for a survey of recent literature), where the size of the model can be significantly compressed while preserving its accuracy. This also serves as a strengthening argument towards this assumption.
Note that assumptions (3)-(5) are rather mild. Assumption 3 (or some variation of it) is standard in the literature [3; 4; 5; 6], and discussions regarding numerical stability are generally omitted. Due to our use of Kolmogorov complexity we need to explicitly state that the L-smoothness assumption holds even when we consider some subset of the real numbers (i.e., floats). We discuss this in more detail in Section 2. In Assumption 4 we only require a polylogarithmic dependence on , which is on par with batch sizes used in practice. Furthermore, it is well observed that when the batch size is too small SGD suffers from instability [10]. Finally, Assumption 5 is extremely mild, as our dependence on the size of the elements / model is only logarithmic when we apply this assumption. That is, even elements / model as large as leave our analysis unchanged asymptotically.
Outline of our techniques
To achieve our results we make use of entropy compression, first considered by Moser and Tardos [1; 2] to prove a constructive version of the Lovász local lemma (the name “entropy compression" was coined by Terrence Tao [11]). Roughly speaking, the entropy compression argument allows one to bound the running time of a randomized algorithm222We require that the number of the random bits used is proportional to the execution time of the algorithm. That is, the algorithm flips coins for every iteration of a loop, rather than just a constant number at the beginning of the execution. by leveraging the fact that a random string of bits (the randomness used by the algorithm) is computationally incompressible (has high Kolmogorov complexity) w.h.p. If one can show that throughout the execution of the algorithm, it (implicitly) compresses the randomness it uses, then one can bound the number of iterations the algorithm may execute without terminating. To show that the algorithm has such a property, one would usually consider the algorithm after executing iterations, and would try to show that just by looking at an “execution log" of the algorithm and some set of “hints", whose size together is considerably smaller than the number of random bits used by the algorithm, it is possible to reconstruct all of the random bits used by the algorithm. We highly encourage readers unfamiliar with this argument to take a quick look at [12], which provides an excellent 1-page example.
We apply this approach to SGD with an added termination condition when the accuracy over the entire dataset is sufficiently high, thus termination in our case guarantees good accuracy. The randomness we compress are the bits required to represent the random permutation of the data at every epoch. So indeed the longer SGD executes, the more random bits are generated. We show that under our assumptions it is possible to reconstruct these bits efficiently starting from the dataset and the model after executing epochs. The first step in allowing us to reconstruct the random bits of the permutation in each epoch is to show that under our L-smoothness assumption and a sufficiently small GD step, SGD is reversible. That is, if we are given a model and a batch such that results from taking a gradient step with model where the loss is calculated with respect to , then we can uniquely retrieve using only and . This means that if we can efficiently encode the batches used in every epoch (i.e., using less bits than encoding the entire permutation of the data), we can also retrieve all intermediate models in that epoch (at no additional cost). We prove this claim in Section 2.
The crux of this paper is to show that our assumptions imply that at every epoch the batches can indeed be compressed. Let us consider two toy examples that provide intuition for our analysis. First, consider the scenario where, for every epoch, exactly in the “middle" of the epoch (after considering half of the batches for the epoch), our model achieves 100% accuracy on the set of all elements already considered so far during the epoch (elements on the “left") and 0% accuracy for the remaining elements (elements on the “right"). Let us consider a single epoch of this execution. We can express the permutation of the dataset by encoding two sets (for elements on the left and right), and two permutations for these sets. Given the dataset and the function computed by the model at the middle of the epoch, we immediately know which elements are on the left and which are on the right by seeing if they are classified correctly or not by the model (recall that contains data points and labels and that elements have unique IDs). Thus it is sufficient to encode only two permutations, each over elements. While naively encoding the permutation for the entire epoch requires bits, we manage to make do with . Using Stirling’s approximation () and omitting logarithmic factors, we get that . Thus, we can save a linear number of bits using this encoding. To achieve the above encoding, we need the function computed by the model in the middle of the epoch. Assumption 2 guarantees that we can efficiently encode this function.
Second, let us consider the scenario where, in every epoch, just after a single GD step on a batch we consistently achieve 100% accuracy on the batch. Furthermore, we terminate our algorithm when we achieve 90% accuracy on the entire dataset. Let us consider some epoch of our execution, assume we have access to , and let be the model at the end of the epoch. Our goal is to retrieve the last batch of the epoch, (without knowing the permutation of the data for the epoch). A naive approach would be to simply encode the indices in of the elements in the batch (we can sort by IDs). However, we can use to achieve a more efficient encoding. Specifically, we know that achieves 100% accuracy on but only 90% accuracy on . Thus it is sufficient to encode the elements of using a smaller subset of (the elements classified correctly by , which has size at most ). This allows us to significantly compress . Next, we can use and together with the reversibility of SGD to retrieve . We can now repeat the above argument to compress and so on, until we are able to reconstruct all of the random bits used to generate the permutation of in the epoch. This will again result in a linear reduction in the number of bits required for the encoding.
In our analysis, we show generalized versions of the two scenarios above. First, we show that if at any epoch the accuracy of the model on the left and right elements differs significantly (we need not split the data down the middle anymore), we manage to achieve a meaningful compression of the randomness. Next, we show that if this scenario does not happen, then together with our local progress assumption, we manage to achieve slightly better accuracy on the batch we are trying to retrieve (compared to the set of available elements). This results in a compression similar to the second scenario. Either way, we manage to save a significant (linear in ) number of bits for every epoch, which allows us to use the entropy compression argument to bound our running time.
Related work
An excellent survey of recent work is given in [13], which we follow below. There has been a long line of research proving convergence bounds for SGD under various simplifying assumptions such as: linear networks [14; 15], shallow networks [16; 17; 18], etc. However, the most relevant results for this paper are the ones dealing with deep, overparameterized networks [3; 4; 5; 6]. All of these works make use of NTK (Neural Tangent Kernel)[19] and show convergence guarantees for SGD when the hidden layers have width at least where is the size of the dataset and is the depth of the network. We note that the exponents of the polynomials are quite large. For example, assuming a random weight initialization, [4] require width at least and converges to a model with an error in steps w.h.p. The current best parameters are due to [6], which under random weight initialization, require width at least and converge to a model with an error in steps w.h.p. Recent works [20; 21; 13] achieve a much improved width dependence (linear / quadratic) and running times, however, they require additional assumptions which make their results incomparable with the above. Specifically, in [20; 21] a special (non-standard) weight initialization is required, and in [13] convergence is shown for a non-standard activation unit.
Our results are not directly comparable with the above, mainly due to assumptions 1 and 2. However, if one accepts our assumptions, our analysis has the benefit of not requiring any specific network architecture. The convergences times are also essentially incomparable due to the dependence on . That is, can be very small if the elements of are small and similar to each other, or very large if they are very large and contain noise (e.g., high definition images with noise). To the best of our knowledge, this is the first (conditional) global convergence guarantee for SGD for general, non-convex underparameterized models.
Structure of the paper
2 Preliminaries
We consider the following optimization problem. We are given an input (dataset) of size . Let us denote (Our inputs contain both data and labels, we do not need to distinguish them for this work). We also associate every with a unique id of bits. We often consider batches of the input . The size of the batch is denoted by (all batches have the same size). We have some model whose parameters are denoted by , where is the model dimension. We aim to optimize a goal function of the following type: , where the functions are completely determined by . We also define for every set : . Note that .
We denote by the accuracy of model on the set (where we use to classify elements from ). Note that for it holds that is a binary value indicating whether is classified correctly or not. We require that every is differentiable and L-smooth: . This implies that every is also differentiable and L-smooth. To see this consider the following:
We state another useful property of :
Lemma 2.1.
Let and . For any , if it holds that then .
Proof.
Rearranging the terms we get that . Now let us consider the norm of both sides: Unless , the final strict inequality holds which leads to a contradiction. ∎
The above means that for a sufficiently small gradient step, the gradient descent process is reversible. That is, we can always recover the previous model parameters given the current ones, assuming that the batch is fixed. We use the notion of reversibility throughout this paper. However, in practice we only have finite precision, thus instead of we work with the finite set . Furthermore, due to numerical stability issues, we do not have access to exact gradients, but only to approximate values . For the rest of this paper, we assume these values are L-smooth on all elements in . That is,
This immediately implies that Lemma 2.1 holds even when precision is limited. Let us state the following theorem:
Theorem 2.2.
Let , and . If it holds that , then given and we can retrieve .
Proof.
Given we iterate over all until we find such that . Using Lemma 2.1, there is only a single element such that this equality holds, and thus . We repeat this process until we retrieve . ∎
We note that the assumption that L-smoothness holds for the approximate gradients on all elements in might seem a bit strong. However, if we assume that the gradients have norm bounded by a parameter (i.e. is -Lipschitz), then it holds that . This means that in Theorem 2.2, we need to only go over elements that are close to when retrieving . This in turn means that we can require that the Lipschitz condition for only holds around .
SGD
We analyze the classic SGD algorithm presented in Algorithm 1. One difference to note in our algorithm, compared to the standard implementation, is the termination condition when the accuracy on the dataset is sufficiently high. In practice the termination condition is not used, however, we only use it to prove that at some point in time the accuracy of the model is indeed sufficiently large. It is also possible to check this condition on a sample of a logarithmic number of elements of and achieve a good approximation to the accuracy w.h.p.
Kolmogorov complexity
The Kolmogorov complexity of a string , denoted by , is defined as the size of the smallest prefix Turing machine which outputs this string. We note that this definition depends on which encoding of Turing machines we use. However, one can show that this will only change the Kolmogorov complexity by a constant factor [22]. Random strings have the following useful property [22]:
Theorem 2.3.
For an bit string chosen uniformly at random and any it holds that .
We also use the notion of conditional Kolmogorov complexity, denoted by . This is the length of the shortest prefix Turing machine which gets as an auxiliary input and prints . Note that the length of does not count towards the size of the machine which outputs . So it can be the case that but it holds that . We can also consider the Kolmogorov complexity of functions. Let then is the size of the smallest Turing machine which computes the function .
The following properties of Kolmogorov complexity will be of use [22]. Let be three strings:
-
•
Extra information:
-
•
Subadditivity:
Throughout this paper we often consider the value where is a set. Here the program computing need only output the elements of (in any order). When considering such that , it holds that . To see why, consider Algorithm 2.
In the algorithm is the index of when considering some ordering of all subsets of of size . Thus bits are sufficient to represent . The remaining variables and any additional variables required to construct the set are all of size at most and there is at most a constant number of them.
During our analysis, we often bound the Kolmogorov complexity of tuples of objects. For example, where is a set and is a permutation of (note that together form an ordered tuple of the elements of ). Instead of explicitly presenting a program such as Algorithm 2, we say that if and bits are sufficient to represent , thus . This just means that we directly have a variable encoding into the program that computes given and uses it in the code. For example, we can add a permutation to Algorithm 2 and output an ordered tuple of elements rather than a set. Note that when representing a permutation of , instead of using functions, we can just talk about values in . That is, we can decide on some predetermined ordering of all permutations of elements, and represent a permutation as its number in this ordering.
Entropy and KL-divergence
Our proofs make extensive use of binary entropy and KL-divergence. In what follows we define these concepts and provide some useful properties.
Entropy: For we denote by the entropy of . Note that .
KL-divergence: For let be the Kullback Leibler divergence (KL-divergence) between two Bernoulli distributions with parameters . We also state the following useful private case of Pinsker’s inequality: .
Lemma 2.4.
For it holds that .
Proof.
Let us write our claim as:
Rearranging we get:
Note that . Where in the final transition we use the fact that is monotonically increasing on . This completes the proof. ∎
Lemma 2.5.
For where it holds that
Proof.
Let us expand the left hand side using the definition of entropy:
Where in the last equality we simply sum the first terms on both lines. To complete the proof we use the log-sum inequality for the last expression. The log-sum inequality states that: Let be non-negative numbers and let , then . We apply the log-sum inequality with and , getting that:
Putting everything together we get that
∎
Representing sets
Let us state some useful bounds on the Kolmogorov complexity of sets.
Lemma 2.6.
Let , and let . For any set let and . It holds that
Proof.
The algorithm is very similar to Algorithm 2, the main difference is that we must first compute from using , and select from , respectively, using two indices . Finally we print . We can now bound the number of bits required to represent . Note that . Note that for we pick elements from elements and for we pick elements from elements. The number of bits required to represent this selection is:
Where in the first inequality we used the fact that , Lemma 2.5 in the second transition, and the lower bound on the KL-divergence in the third transition. The additional factor is due to various counters and variables, similarly to Algorithm 2. ∎
Concentration bounds
We state the following Hoeffding bound for sampling without replacement.
Theorem 2.7 (Hoeffding [23; 24]).
Let be a finite population of size , such that , and let be a random sample drawn without replacement from (first is chosen uniformly at random from , then is chosen uniformly at random from and so on). Let . For all it holds that: .
One important thing to note is that if we have access to as a shuffled tuple (or shuffled array), then taking any subset of indices, is a random sample of size drawn from without repetitions. Furthermore, if we take some sample from the shuffled population and then take a second sample (already knowing ) from the remaining set , then is sampled without repetitions from .
3 Convergence guarantees for SGD
In this section, we prove that Algorithm 1 eventually terminates. That is, a accuracy for the entire dataset is achieved. Let us start by defining some notation, followed by formal definitions of our assumptions.

Notations
First let us define some useful notations:
-
•
. This is the accuracy of the model in epoch on the entire dataset , before performing the GD step on batch .
-
•
. This is the accuracy of the model on the -th batch in the -th epoch after performing the GD step on the batch.
-
•
(note that ). This is the set of elements in the first batches of epoch . Let us also denote (Note that , thus need not appear in the subscript).
-
•
, where is the accuracy of the model on the set of all previously seen batch elements, after performing the GD step on the -th batch and is the accuracy of the same model, on all remaining elements (-th batch onward). To avoid computing the accuracy on empty sets, is defined for and is defined for .
Assumptions
In our analysis we consider epochs of the SGD algorithm, where we bound later. Let us state the assumptions we make (later we discuss ways to further alleviate these assumptions).
-
1.
Local progress: There exists some constant such that . That is, we assume that in every epoch, the accuracy of the model on the batch after the GD step increases by an additive factor on average. Let us denote (we explain the reason for this parameterization in Section 3.1).
-
2.
Models compute simple functions: (where is the accuracy function of the model with parameters ). This is actually a slightly weaker condition than bounding the Kolmogorov complexity of the function computed by the model. That is, given and the function computed by the model we can compute , but not the other way around. For simplicity of notation, we assume that contains not only the model parameters, but also the architecture of the model and the indices .
-
3.
. That is, the batch must be sufficiently large. After we bound , we get that is sufficient.
-
4.
is differentiable and L-smooth, even when taking numerical stability into account.
-
5.
. That is, the size of the model and the size of every data point is at most polynomial in . We note that the exponent can be extremely large, say , and this will only change our bounds by a constant.
Bounding :
Our goal is to use the entropy compression argument to bound the running time of the algorithm. That is, we show that if our assumptions hold for a consecutive number of epochs, if is sufficiently large then our algorithm must terminate. Let be the string of random bits representing the random permutation of at epoch . As we consider epochs, let . Note that the number of bits required to represent an arbitrary permutation of is given by: . Where in the above we used Stirling’s approximation. Thus, it holds that and according to Theorem 2.3, with probability at least it holds that .
Our goal for the rest of the section is to show that if our assumptions hold, then in every epoch we manage to compress , which in turn results in a compression of , leading to an upper bound on due to the incompressibility of . For the rest of the section, we assume that all of our assumptions hold and that the algorithm does not terminate within the first epochs. As discussed in the introduction, our proof consists of two cases. For the first case, we show that if at some point during the epoch is sufficiently large, then can be compressed. For the second case, if is small throughout the epoch and show that can be compressed by encoding every batch individually.
Case 1: is large
In the introduction, we considered the case where during the epoch the model achieves 100% accuracy on half the elements and 0% on the other half. Let us show a generalized version of this statement, where instead of considering the model during the middle of the epoch, we consider the model after seeing a -fraction of the data. The statement below holds for an arbitrary difference in the accuracies of the model. Instead of considering directly, we first show an auxiliary compression claim involving and . This will immediately allow us to derive the desired result for , and will also be useful later on. Let us state the following claim.
Claim 3.1.
Let . It holds that:
Proof.
To encode we encode it as 2 separate strings, split at the -th index. To encode each such string it is sufficient to encode the sets and , accompanied by a permutation for each set. As is given, we can immediately deduce given an ordering of the elements of . Note that knowing and we immediately know . Thus it is sufficient to only encode . Finally, we need to encode two permutations for the two sets. The number of bits required for the two permutations is bounded by:
To encode the set we use Lemma 2.6 where and . We get that
Summing the two expressions we get that
Encoding instead of and using an identical argument, we get a similar bound for . ∎
Note that in the above if is too small or too large, we are unable to get a meaningful compression. Let us focus our attention on a sub-interval of where we get a meaningful bound. Let us denote by the start and finish indices for this interval. We state the following Corollary which completes the proof of the first case.
Corollary 3.2.
If it holds for some epoch , and some that , then .
Proof.
Assume w.l.o.g that . Note that it holds that . This is because we can write for some . Thus if then either or . We apply Claim 3.1 for either or with parameter . Noting that we get that . Putting everything together we get: ∎
Generalizing Assumption 2
The assumption is only required for this case of the proof, and the proofs of Claim 3.1 and Crollary 3.2 still go through even if we only have a sufficiently good approximation to . That is, there exists a function , which achieves a sufficiently close approximation to the accuracy of on and (say up to an additive factor). Thus, we only need the existence of a compressible function which approximates on and .
Case 2: is small
Let us now consider the second case, where is small throughout the epoch (more precisely, when ). Roughly speaking, we would like to get an effect similar to that seen in the introduction, when we assumed we can achieve 100% accuracy for every batch. The reason we managed to achieve compression in that case, was that we could use the model to reduce the set of potential elements the batch can be taken from when going backwards and reconstructing the batches. We will show that we can achieve a similar effect here. Assume we have the entire dataset, , and the model which resulted from a GD step on the -th batch. The argument is inductive, thus let us assume that we have reconstructed all batches . This means we also know . Knowing and , we immediately know , the set of all elements which can appear in . We aim to show that the local progress assumption (Assumption 1) implies that (on average) achieves a slightly higher accuracy on than on , which will allow us to achieve a meaningful compression.
Let us first show that if the batch is large enough, the accuracy of on is sufficiently close to (the accuracy of on ). We state the following claim:
Claim 3.3.
It holds w.h.p that:
Proof.
First let us note that if , then the claim holds trivially: . So for the rest of the proof let us assume that .
Let us denote by , the set of binary indicator variables indicating whether the -th element in was classified correctly by . As stated before, we can say that elements in are chosen uniformly at random without repetitions from . Furthermore, it is important to note that is independent of . We can imagine that is selected after is determined. We can write . Applying a Hoeffding bound with parameters , we get that:
According to Assumption 3 it holds that . Thus, we get that:
We take a union bound over all bad events (at most ) to get that with probability at least none of the bad events occur throughout the epochs of the execution.
∎
That is, w.h.p the accuracy of on is at least its accuracy on , up to an additive factor. Next we use the fact that is small together with the claim above to show that the accuracy of on is also not too far from the accuracy of on . Finally, using the fact that on average has a sufficiently larger accuracy on compared to we prove the following claim. Note that this is an auxiliary claim, and the quantity will appear later as the number of bits we manage to save per epoch, by using the model at every epoch to encode the batches.
Claim 3.4.
If then w.h.p: .
Proof.
Recall that . This is the accuracy of the model on the -th batch in the -th epoch after performing the GD step on the batch. Let us consider the following inequality:
Where in the last line we use Assumption 1 together with the fact that and
Combining everything together we get that:
Let us use the above to derive a lower bound for . Let us write:
Where in the second to last transition we use the assumption that , and in the final transition we assume is sufficiently large. Taking the square of both sides we get that: .
Finally, applying Cauchy-Schwartz inequality, we get that:
∎
To complete the proof for the second case we use the above claim to show that we can significantly compress all batches, this, in turn, will result in a compression of the entire permutation. We state the following claim.
Claim 3.5.
It holds that if then w.h.p:
Proof.
Recall that is the -th batch in the -th epoch, and let be a permutation of such that the order of the elements in under is the same as under . Note that given , if we know the partition into batches and all permutations, we can reconstruct . According to Theorem 2.2, given and we can compute . Let us denote by the encoding of this procedure. To implement we need to iterate over all possible vectors in and over batch elements to compute the gradients. To express this program we require auxiliary variables of size at most . Thus it holds that . Let us abbreviate as . We write the following.
Let us bound by repeatedly using the subadditivity and extra information properties of Kolmogorov complexity.
Where in the transitions we used the fact that given and we can retrieve . That is, we can always bound by .
To encode the order inside each batch, bits are sufficient. Finally we get that: .
Let us now bound . Knowing we know that . Thus we need to use to compress . Applying Lemma 2.6 with parameters and . We get the following:
Adding to the above, we get the following bound on every element in the sum:
Where in the second line we use Lemma 2.4. Note that the most important term in the sum is . That is, the more the accuracy of on the batch, , differs from the accuracy of on the set of elements containing the batch, , we can represent the batch more efficiently. Let us now bound the sum: . Let us first bound the sum over :
Applying Claim 3.4 we get that . Finally we can write that:
Where because . ∎
This completes the analysis of the second case. As for every epoch either case 1 or case 2 hold, we can conclude that we always manage to compress . Combining the claim above with Corollary 3.2 we get:
Corollary 3.6.
It holds w.h.p that .
Proof.
We use the fact that for every epoch that the algorithm does not terminate we manage to significantly compress together with the fact that is incompressible to bound . We state the following claim:
Claim 3.7.
It holds w.h.p that .
Proof.
Similarly to the definition of in Claim 3.5, let be the program which receives as input and repeatedly applies Theorem 2.2 to retrieve . As just needs to reconstruct all batches from and call for times, it holds that . Using the subadditivity and extra information properties of , together with the fact that can be reconstructed given , we write the following:
First we note that: . Where in the last inequality we simply execute on to get . Let us write:
Where the last inequality is due to Corollary 3.6. Combining everything together we get that: .
Our proof implies that we can reconstruct not only , but also using . Due to the incompressibility of random strings, we get that w.h.p . Combining the lower and upper bound for we can get the following inequality, using which we can bound .
Assumptions 1 and 2 need not always hold
Let us consider the case where both Assumption 1 and Assumption 2 only hold for some fraction of the epochs that we analyze. Let us denote by the set of all good epochs where the assumptions hold (the rest of the epochs are bad epochs). This means that we can no longer achieve compression for bad epochs. However, for every bad epoch , we can simply encode directly. Given we can reconstruct all of the batches, and retrieve from using Theorem 2.2. We can write that:
This allows us to bound as follows:
Note that also appears in the batch size. By Assumption 5, we know that , which in turn means that is a sufficient condition for Assumption 5 to hold. We state our main theorem for this section:
Theorem 3.8.
If assumptions (3)-(5) hold throughout the execution of SGD and assumptions (1)-(2) hold for an -fraction of the epochs, then SGD achieves an accuracy of on in epochs w.h.p.
3.1 Connection between and
We note that the local progress assumption does not make sense when the accuracy is too close to 1. That is, if the accuracy is too high, there is no room for improvement. We formalize this intuition in the following claim:
Claim 3.9.
If for some it holds that , then .
Proof.
Recall the local progress assumption: . Applying Claim 3.3 we know that w.h.p . So we can write:
It holds that . Let us now bound using . We can write the following for :
Plugging the above into the first inequality we get:
Finally we get . ∎
Taking the above into consideration, we parameterize via and write . This parameterization is correct in the sense that it allows the accuracy parameter to appear in our running time bound.
References
- Moser [2009] Robin A. Moser. A constructive proof of the lovász local lemma. In STOC, pages 343–350. ACM, 2009.
- Moser and Tardos [2010] Robin A. Moser and Gábor Tardos. A constructive proof of the general lovász local lemma. J. ACM, 57(2):11:1–11:15, 2010.
- Du et al. [2019] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 1675–1685. PMLR, 2019.
- Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 2019.
- Zou et al. [2018] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888, 2018.
- Zou and Gu [2019] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In NeurIPS, pages 2053–2062, 2019.
- Mingard et al. [2021] Chris Mingard, Guillermo Valle Pérez, Joar Skalse, and Ard A. Louis. Is SGD a bayesian sampler? well, almost. J. Mach. Learn. Res., 22:79:1–79:64, 2021.
- Pérez et al. [2019] Guillermo Valle Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In ICLR (Poster), 2019.
- O’Neill [2020] James O’Neill. An overview of neural network compression. CoRR, abs/2006.03669, 2020.
- Goodfellow et al. [2016] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.
- Tao [2009] Terence Tao. Moser’s entropy compression argument. https://terrytao.wordpress.com/2009/08/05/mosers-entropy-compression-argument, 2009.
- Fortnow [2009] Lance Fortnow. A kolmogorov complexity proof of the lovász local lemma. http://blog.computationalcomplexity.org/2009/06/kolmogorov-complexity-proof-of-lov.html, 2009.
- Noy et al. [2021] Asaf Noy, Yi Xu, Yonathan Aflalo, Lihi Zelnik-Manor, and Rong Jin. A convergence theory towards practical over-parameterized deep neural networks. CoRR, abs/2101.04243, 2021.
- Arora et al. [2019] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In ICLR (Poster). OpenReview.net, 2019.
- Arora et al. [2018] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 244–253. PMLR, 2018.
- Safran and Shamir [2018] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 4430–4438. PMLR, 2018.
- Du and Lee [2018] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 1328–1337. PMLR, 2018.
- Oymak and Soltanolkotabi [2019] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. CoRR, abs/1902.04674, 2019.
- Jacot et al. [2018] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, pages 8580–8589, 2018.
- Nguyen and Mondelli [2020] Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In NeurIPS, 2020.
- Nguyen [2021] Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8056–8062. PMLR, 2021.
- Li and Vitányi [2019] Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Texts in Computer Science. Springer, 2019.
- Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- Chvátal [1979] Vasek Chvátal. The tail of the hypergeometric distribution. Discrete Mathematics, 25(3):285–287, 1979.