This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SGD Through the Lens of Kolmogorov Complexity

Gregory Schwartzman
JAIST
[email protected]
Abstract

We prove that stochastic gradient descent (SGD) finds a solution that achieves (1ϵ)(1-\epsilon) classification accuracy on the entire dataset. We do so under two main assumptions: (1. Local progress) The model accuracy improves on average over batches. (2. Models compute simple functions) The function computed by the model is simple (has low Kolmogorov complexity). It is sufficient that these assumptions hold only for a tiny fraction of the epochs.

Intuitively, the above means that intermittent local progress of SGD implies global progress. Assumption 2 trivially holds for underparameterized models, hence, our work gives the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result which is completely model agnostic - we do not require the model to have any specific architecture or activation function, it may not even be a neural network. Our analysis makes use of the entropy compression method, which was first introduced by Moser and Tardos [1; 2] in the context of the Lovász local lemma.

1 Introduction

Stochastic gradient descent (SGD) is at the heart of modern machine learning. However, we are still lacking a theoretical framework that explains its performance for general, non-convex functions. There are known convergence guarantees for overparameterized neural networks [3; 4; 5; 6]. These works show that if the number of network parameters is considerably larger than the number of data points, then the network is guaranteed to converge to a (near) global optimum. The main drawback of these results is the requirement of exceedingly large (and wide) models, where the size is a large polynomial in nn (the size of the dataset). In other words, the analysis is dependent on the structure of the model. In this work, we take a different approach and show convergence guarantees based on the non-structural properties of the model.

Our two main assumptions are: (1) The model accuracy improves on average over batches following the GD step. That is, if we look at the difference: (batch accuracy after GD - batch accuracy before GD), averaged over an epoch, then it is greater than some positive parameter. (2) The function computed by the model is simple (has Kolmogorov complexity sufficiently smaller than nn). Note that this is different than saying that the model is simple. It might be the case that a very complicated model computes an extremely simple function. A classical example is a high degree polynomial over a finite field that always evaluates to 0.

For our convergence guarantees to hold, we only require the above assumptions to holds for some tiny fraction of the iterations. Note that Assumption 2 trivially holds when the model is underparametrized (number of model parameters is sufficiently smaller than nn). To the best of our knowledge, this is the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result which is completely model agnostic. That is, we don’t require the model to have any specific architecture or activation function. It may not even be a neural network. Finally, our results answer an intriguing theoretical question: is it possible for SGD to achieve local progress (improvement over batches), but no global progress (good accuracy for the entire dataset)? Intuitively, one might think that this is indeed a situation that might occur. However, we show that (under Assumption 2) this cannot be the case.

Overview of our results

We consider batch SGD, where the model is initialized randomly and the dataset is shuffled once at the beginning of each epoch and then divided into batches. Note that this paper does not deal with the generalization abilities of the model, thus the dataset is always the training set. In each epoch, the algorithm goes over the batches one by one, and performs gradient descent to update the model. This is the “vanilla" version of SGD, without any acceleration or regularization (for a formal definition see Section 2). Furthermore, for the sake of analysis, we add a termination condition after every GD step: if the accuracy on the entire dataset is at least 1ϵ1-\epsilon we terminate. Thus in our case termination implies good accuracy.

We show that under reasonable assumptions, local progress of SGD implies global progress. Let ϵ(0,1)\epsilon\in(0,1) be an accuracy parameter of our choice and let XX be the dataset (every element in XX consists of a data point, a label, and some unique ID), where |X|=n\left|X\right|=n. Let K(X)K(X) be the Kolmogorov complexity of XX (the shortest Turing machine which outputs XX). We can also define the Kolmogorov complexity of a function as the shortest Turing machine which computes the function (see Section 2 for a formal definition). The assumptions we make are formally stated at the beginning of Section 3. Below is an intuitive outline:

  1. 1.

    After every GD step the batch accuracy is improved by at least an additive ϵβ\epsilon\beta^{\prime} factor (on average, per epoch), where β\beta^{\prime} is a local progress parameter. That is, fix some epoch and let yiy_{i} be the model accuracy on the ii-th batch before the GD step and let yiy^{\prime}_{i} be the model accuracy after the GD step, then we require that 1ti=1t(yiyi)ϵβ\frac{1}{t}\sum_{i=1}^{t}(y^{\prime}_{i}-y_{i})\geq\epsilon\beta^{\prime} where tt is the number of batches in an epoch.

  2. 2.

    The Kolmogorov complexity of the function computed by the model is at most (ϵβ)3n/512(\epsilon\beta^{\prime})^{3}n/512. In other words, either the model is underparameterized, or the function computed by the model is sufficiently simple.

  3. 3.

    The loss function is differentiable and LL-smooth. This must also hold when taking numerical stability into account.

  4. 4.

    The batch size is Ω((βϵ)2log2(n/(βϵ)))\Omega((\beta^{\prime}\epsilon)^{-2}\log^{2}(n/(\beta^{\prime}\epsilon))). Note, the dependence on nn is only polylogarithmic.

  5. 5.

    The size of every element in XX is at most polynomial in nn. For example, if XX contains images, they are not excessively large compared to nn. Furthermore, the number of model parameters is at most polynomial in nn (this is not implied by Assumption 2, as the model can be huge but still compute a simple function).

Roughly speaking, we show that if nn is sufficiently large and assumptions 3 to 5 always hold and assumptions 1,2 only hold for a γ\gamma-fraction of the epochs (γ(0,1]\gamma\in(0,1] need not be a constant), then SGD must achieve an accuracy of at least (1ϵ)(1-\epsilon) on the entire dataset within O(K(X)γ(βϵ)3n)O(\frac{K(X)}{\gamma(\beta^{\prime}\epsilon)^{3}n}) epochs with high probability (w.h.p)111This is usually taken to be 11/nc1-1/n^{c} for some constant c>1c>1. For our case, it holds that c(1,2)c\in(1,2), however, this can be amplified arbitrarily by increasing the batch size by a multiplicative constant.. Note that the number of epochs does not depend on the size of the model. Finally, note that when the accuracy is too high the local progress assumption cannot hold anymore (we discuss this further in Section 3.1), and we cannot guarantee progress past this point.

Let us briefly discuss our assumptions. As for Assumption 1, it is quite natural as the GD step attempts to minimize the loss function which acts as a proxy for the accuracy on the data. However, Assumption 2 seems at odds with the fact that we assume the model to be initialized randomly. Intuitively, one might believe that a random initialization surely implies that Assumption 2 does not hold. While it is indeed the case that the model has high Kolmogorov complexity in this case, this is not necessarily the case for the function computed by the model. Recent literature suggests that randomly initialized models are biased towards simple functions rather than complicated ones (see Section 7.3 in [7] for a survey of recent work) and that the function learned by the network has low Kolmogorov complexity [8]. Furthermore, our results also generalize if we allow lossy compression. There is a vast literature on model compression (see [9] for a survey of recent literature), where the size of the model can be significantly compressed while preserving its accuracy. This also serves as a strengthening argument towards this assumption.

Note that assumptions (3)-(5) are rather mild. Assumption 3 (or some variation of it) is standard in the literature [3; 4; 5; 6], and discussions regarding numerical stability are generally omitted. Due to our use of Kolmogorov complexity we need to explicitly state that the L-smoothness assumption holds even when we consider some subset of the real numbers (i.e., floats). We discuss this in more detail in Section 2. In Assumption 4 we only require a polylogarithmic dependence on nn, which is on par with batch sizes used in practice. Furthermore, it is well observed that when the batch size is too small SGD suffers from instability [10]. Finally, Assumption 5 is extremely mild, as our dependence on the size of the elements / model is only logarithmic when we apply this assumption. That is, even elements / model as large as n1000n^{1000} leave our analysis unchanged asymptotically.

Outline of our techniques

To achieve our results we make use of entropy compression, first considered by Moser and Tardos [1; 2] to prove a constructive version of the Lovász local lemma (the name “entropy compression" was coined by Terrence Tao [11]). Roughly speaking, the entropy compression argument allows one to bound the running time of a randomized algorithm222We require that the number of the random bits used is proportional to the execution time of the algorithm. That is, the algorithm flips coins for every iteration of a loop, rather than just a constant number at the beginning of the execution. by leveraging the fact that a random string of bits (the randomness used by the algorithm) is computationally incompressible (has high Kolmogorov complexity) w.h.p. If one can show that throughout the execution of the algorithm, it (implicitly) compresses the randomness it uses, then one can bound the number of iterations the algorithm may execute without terminating. To show that the algorithm has such a property, one would usually consider the algorithm after executing tt iterations, and would try to show that just by looking at an “execution log" of the algorithm and some set of “hints", whose size together is considerably smaller than the number of random bits used by the algorithm, it is possible to reconstruct all of the random bits used by the algorithm. We highly encourage readers unfamiliar with this argument to take a quick look at [12], which provides an excellent 1-page example.

We apply this approach to SGD with an added termination condition when the accuracy over the entire dataset is sufficiently high, thus termination in our case guarantees good accuracy. The randomness we compress are the bits required to represent the random permutation of the data at every epoch. So indeed the longer SGD executes, the more random bits are generated. We show that under our assumptions it is possible to reconstruct these bits efficiently starting from the dataset XX and the model after executing tt epochs. The first step in allowing us to reconstruct the random bits of the permutation in each epoch is to show that under our L-smoothness assumption and a sufficiently small GD step, SGD is reversible. That is, if we are given a model Wi+1W_{i+1} and a batch BiB_{i} such that Wi+1W_{i+1} results from taking a gradient step with model WiW_{i} where the loss is calculated with respect to BiB_{i}, then we can uniquely retrieve WiW_{i} using only BiB_{i} and Wi+1W_{i+1}. This means that if we can efficiently encode the batches used in every epoch (i.e., using less bits than encoding the entire permutation of the data), we can also retrieve all intermediate models in that epoch (at no additional cost). We prove this claim in Section 2.

The crux of this paper is to show that our assumptions imply that at every epoch the batches can indeed be compressed. Let us consider two toy examples that provide intuition for our analysis. First, consider the scenario where, for every epoch, exactly in the “middle" of the epoch (after considering half of the batches for the epoch), our model achieves 100% accuracy on the set of all elements already considered so far during the epoch (elements on the “left") and 0% accuracy for the remaining elements (elements on the “right"). Let us consider a single epoch of this execution. We can express the permutation of the dataset by encoding two sets (for elements on the left and right), and two permutations for these sets. Given the dataset XX and the function computed by the model at the middle of the epoch, we immediately know which elements are on the left and which are on the right by seeing if they are classified correctly or not by the model (recall that XX contains data points and labels and that elements have unique IDs). Thus it is sufficient to encode only two permutations, each over n/2n/2 elements. While naively encoding the permutation for the entire epoch requires log(n!)\lceil\log(n!)\rceil bits, we manage to make do with 2log(n/2)!2\lceil\log(n/2)!\rceil. Using Stirling’s approximation (log(n!)=nlog(n/e)+O(logn)\log(n!)=n\log(n/e)+O(\log n)) and omitting logarithmic factors, we get that log(n!)2log((n/2)!)nlog(n/e)nlog(n/2e)=n\log(n!)-2\log((n/2)!)\approx n\log(n/e)-n\log(n/2e)=n. Thus, we can save a linear number of bits using this encoding. To achieve the above encoding, we need the function computed by the model in the middle of the epoch. Assumption 2 guarantees that we can efficiently encode this function.

Second, let us consider the scenario where, in every epoch, just after a single GD step on a batch we consistently achieve 100% accuracy on the batch. Furthermore, we terminate our algorithm when we achieve 90% accuracy on the entire dataset. Let us consider some epoch of our execution, assume we have access to XX, and let WfW_{f} be the model at the end of the epoch. Our goal is to retrieve the last batch of the epoch, BfXB_{f}\subset X (without knowing the permutation of the data for the epoch). A naive approach would be to simply encode the indices in XX of the elements in the batch (we can sort XX by IDs). However, we can use WfW_{f} to achieve a more efficient encoding. Specifically, we know that WfW_{f} achieves 100% accuracy on BfB_{f} but only 90% accuracy on XX. Thus it is sufficient to encode the elements of BfB_{f} using a smaller subset of XX (the elements classified correctly by WfW_{f}, which has size at most 0.9|X|0.9\left|X\right|). This allows us to significantly compress BfB_{f}. Next, we can use BfB_{f} and WfW_{f} together with the reversibility of SGD to retrieve Wf1W_{f-1}. We can now repeat the above argument to compress Bf1B_{f-1} and so on, until we are able to reconstruct all of the random bits used to generate the permutation of XX in the epoch. This will again result in a linear reduction in the number of bits required for the encoding.

In our analysis, we show generalized versions of the two scenarios above. First, we show that if at any epoch the accuracy of the model on the left and right elements differs significantly (we need not split the data down the middle anymore), we manage to achieve a meaningful compression of the randomness. Next, we show that if this scenario does not happen, then together with our local progress assumption, we manage to achieve slightly better accuracy on the batch we are trying to retrieve (compared to the set of available elements). This results in a compression similar to the second scenario. Either way, we manage to save a significant (linear in nn) number of bits for every epoch, which allows us to use the entropy compression argument to bound our running time.

Related work

An excellent survey of recent work is given in [13], which we follow below. There has been a long line of research proving convergence bounds for SGD under various simplifying assumptions such as: linear networks [14; 15], shallow networks [16; 17; 18], etc. However, the most relevant results for this paper are the ones dealing with deep, overparameterized networks [3; 4; 5; 6]. All of these works make use of NTK (Neural Tangent Kernel)[19] and show convergence guarantees for SGD when the hidden layers have width at least poly(n,L)poly(n,L) where nn is the size of the dataset and LL is the depth of the network. We note that the exponents of the polynomials are quite large. For example, assuming a random weight initialization, [4] require width at least n24L12n^{24}L^{12} and converges to a model with an ϵ\epsilon error in O(n6L2log(1/ϵ))O(n^{6}L^{2}\log(1/\epsilon)) steps w.h.p. The current best parameters are due to [6], which under random weight initialization, require width at least n8L12n^{8}L^{12} and converge to a model with an ϵ\epsilon error in O(n2L2log(1/ϵ))O(n^{2}L^{2}\log(1/\epsilon)) steps w.h.p. Recent works [20; 21; 13] achieve a much improved width dependence (linear / quadratic) and running times, however, they require additional assumptions which make their results incomparable with the above. Specifically, in [20; 21] a special (non-standard) weight initialization is required, and in [13] convergence is shown for a non-standard activation unit.

Our results are not directly comparable with the above, mainly due to assumptions 1 and 2. However, if one accepts our assumptions, our analysis has the benefit of not requiring any specific network architecture. The convergences times are also essentially incomparable due to the dependence on K(X)K(X). That is, K(X)K(X) can be very small if the elements of XX are small and similar to each other, or very large if they are very large and contain noise (e.g., high definition images with noise). To the best of our knowledge, this is the first (conditional) global convergence guarantee for SGD for general, non-convex underparameterized models.

Structure of the paper

We first present some preliminary definitions in Section 2 followed by our main results in Section 3.

2 Preliminaries

We consider the following optimization problem. We are given an input (dataset) of size nn. Let us denote X={xi}i=1nX=\left\{x_{i}\right\}_{i=1}^{n} (Our inputs contain both data and labels, we do not need to distinguish them for this work). We also associate every xXx\in X with a unique id of logn\lceil\log n\rceil bits. We often consider batches of the input BXB\subset X. The size of the batch is denoted by bb (all batches have the same size). We have some model whose parameters are denoted by WdW\in\mathbb{R}^{d}, where dd is the model dimension. We aim to optimize a goal function of the following type: f(W)=1nxXfx(W)f(W)=\frac{1}{n}\sum_{x\in X}f_{x}(W), where the functions fx:df_{x}:\mathbb{R}^{d}\rightarrow\mathbb{R} are completely determined by xXx\in X. We also define for every set AXA\subseteq X: fA(W)=1|A|xAfx(W)f_{A}(W)=\frac{1}{|A|}\sum_{x\in A}f_{x}(W). Note that fX=ff_{X}=f.

We denote by acc(W,A):d×2X[0,1]acc(W,A):\mathbb{R}^{d}\times 2^{X}\rightarrow\mathbb{R}[0,1] the accuracy of model WW on the set AXA\subseteq X (where we use WW to classify elements from XX). Note that for xXx\in X it holds that acc(W,x)acc(W,x) is a binary value indicating whether xx is classified correctly or not. We require that every fxf_{x} is differentiable and L-smooth: W1,W2d,fx(W1)fx(W2)LW1W2\forall W_{1},W_{2}\in\mathbb{R}^{d},\|\nabla f_{x}(W_{1})-\nabla f_{x}(W_{2})\|\leq L\|W_{1}-W_{2}\|. This implies that every fAf_{A} is also differentiable and L-smooth. To see this consider the following:

fA(W1)fA(W2)=1|A|xAfx(W1)1|A|xAfx(W2)\displaystyle\|\nabla f_{A}(W_{1})-\nabla f_{A}(W_{2})\|=\|\frac{1}{|A|}\sum_{x\in A}\nabla f_{x}(W_{1})-\frac{1}{|A|}\sum_{x\in A}\nabla f_{x}(W_{2})\|
=1|A|xAfx(W1)fx(W2)1|A|xAfx(W1)fx(W2)LW1W2\displaystyle=\frac{1}{|A|}\|\sum_{x\in A}\nabla f_{x}(W_{1})-\nabla f_{x}(W_{2})\|\leq\frac{1}{|A|}\sum_{x\in A}\|\nabla f_{x}(W_{1})-\nabla f_{x}(W_{2})\|\leq L\|W_{1}-W_{2}\|

We state another useful property of fAf_{A}:

Lemma 2.1.

Let W1,W2dW_{1},W_{2}\in\mathbb{R}^{d} and α<1/L\alpha<1/L. For any AXA\subseteq X, if it holds that W1αfA(W1)=W2αfA(W2)W_{1}-\alpha\nabla f_{A}(W_{1})=W_{2}-\alpha\nabla f_{A}(W_{2}) then W1=W2W_{1}=W_{2}.

Proof.

Rearranging the terms we get that W1W2=αfA(W1)αfA(W2)W_{1}-W_{2}=\alpha\nabla f_{A}(W_{1})-\alpha\nabla f_{A}(W_{2}). Now let us consider the norm of both sides: W1W2=αfA(W1)αfA(W2)αLW1W2<W1W2\|W_{1}-W_{2}\|=\|\alpha\nabla f_{A}(W_{1})-\alpha\nabla f_{A}(W_{2})\|\leq\alpha\cdot L\|W_{1}-W_{2}\|<\|W_{1}-W_{2}\| Unless W1=W2W_{1}=W_{2}, the final strict inequality holds which leads to a contradiction. ∎

The above means that for a sufficiently small gradient step, the gradient descent process is reversible. That is, we can always recover the previous model parameters given the current ones, assuming that the batch is fixed. We use the notion of reversibility throughout this paper. However, in practice we only have finite precision, thus instead of \mathbb{R} we work with the finite set 𝔽\mathbb{F}\subset\mathbb{R}. Furthermore, due to numerical stability issues, we do not have access to exact gradients, but only to approximate values fA~\widetilde{\nabla f_{A}}. For the rest of this paper, we assume these values are L-smooth on all elements in 𝔽d\mathbb{F}^{d}. That is,

W1,W2𝔽d,AX,fA~(W1)fA~(W2)LW1W2\forall W_{1},W_{2}\in\mathbb{F}^{d},A\subseteq X,\|\widetilde{\nabla f_{A}}(W_{1})-\widetilde{\nabla f_{A}}(W_{2})\|\leq L\|W_{1}-W_{2}\|

This immediately implies that Lemma 2.1 holds even when precision is limited. Let us state the following theorem:

Theorem 2.2.

Let W1,W2,,Wk𝔽ddW_{1},W_{2},...,W_{k}\in\mathbb{F}^{d}\subset\mathbb{R}^{d}, A1,A2,,AkXA_{1},A_{2},...,A_{k}\subseteq X and α<1/L\alpha<1/L. If it holds that Wi=Wi1αfAi1~(Wi1)W_{i}=W_{i-1}-\alpha\widetilde{\nabla f_{A_{i-1}}}(W_{i-1}), then given A1,A2,,Ak1A_{1},A_{2},...,A_{k-1} and WkW_{k} we can retrieve W1W_{1}.

Proof.

Given WkW_{k} we iterate over all W𝔽dW\in\mathbb{F}^{d} until we find WW such that Wk=WαfAi1~(W)W_{k}=W-\alpha\widetilde{\nabla f_{A_{i-1}}}(W). Using Lemma 2.1, there is only a single element such that this equality holds, and thus W=Wk1W=W_{k-1}. We repeat this process until we retrieve W1W_{1}. ∎

We note that the assumption that L-smoothness holds for the approximate gradients on all elements in 𝔽d\mathbb{F}^{d} might seem a bit strong. However, if we assume that the gradients have norm bounded by a parameter GG (i.e. fAf_{A} is GG-Lipschitz), then it holds that WiWi1αG\|W_{i}-W_{i-1}\|\leq\alpha G. This means that in Theorem 2.2, we need to only go over elements that are close to WiW_{i} when retrieving Wi1W_{i-1}. This in turn means that we can require that the Lipschitz condition for fAi1~\widetilde{\nabla f_{A_{i-1}}} only holds around WiW_{i}.

SGD

We analyze the classic SGD algorithm presented in Algorithm 1. One difference to note in our algorithm, compared to the standard implementation, is the termination condition when the accuracy on the dataset is sufficiently high. In practice the termination condition is not used, however, we only use it to prove that at some point in time the accuracy of the model is indeed sufficiently large. It is also possible to check this condition on a sample of a logarithmic number of elements of XX and achieve a good approximation to the accuracy w.h.p.

1
2i1i\leftarrow 1 // epoch counter
3 W1,1W_{1,1} is an initial model
4 while True do
5       Take a random permutation of XX, divided into batches {Bi,j}j=1n/b\left\{B_{i,j}\right\}_{j=1}^{n/b}
6       for j from 1 to n/b do
7             if acc(Wi,j,X)(1ϵ)acc(W_{i,j},X)\geq(1-\epsilon) then  Return Wi,jW_{i,j}
8             Wi,j+1Wi,jαfBi,j(Wi,j)W_{i,j+1}\leftarrow W_{i,j}-\alpha\nabla f_{B_{i,j}}(W_{i,j})
9            
10      ii+1i\leftarrow i+1, Wi,1Wi1,n/b+1W_{i,1}\leftarrow W_{i-1,n/b+1}
11      
Algorithm 1 SGD

Kolmogorov complexity

The Kolmogorov complexity of a string x{0,1}x\in\{0,1\}^{*}, denoted by K(x)K(x), is defined as the size of the smallest prefix Turing machine which outputs this string. We note that this definition depends on which encoding of Turing machines we use. However, one can show that this will only change the Kolmogorov complexity by a constant factor [22]. Random strings have the following useful property [22]:

Theorem 2.3.

For an nn bit string xx chosen uniformly at random and any cc\in\mathbb{N} it holds that Pr[K(x)nc]11/2cPr[K(x)\geq n-c]\geq 1-1/2^{c}.

We also use the notion of conditional Kolmogorov complexity, denoted by K(xy)K(x\mid y). This is the length of the shortest prefix Turing machine which gets yy as an auxiliary input and prints xx. Note that the length of yy does not count towards the size of the machine which outputs xx. So it can be the case that |x||y|\left|x\right|\ll\left|y\right| but it holds that K(xy)<K(x)K(x\mid y)<K(x). We can also consider the Kolmogorov complexity of functions. Let g:{0,1}{0,1}g:\left\{0,1\right\}^{*}\rightarrow\left\{0,1\right\}^{*} then K(g)K(g) is the size of the smallest Turing machine which computes the function gg.

The following properties of Kolmogorov complexity will be of use [22]. Let x,y,zx,y,z be three strings:

  • Extra information: K(xy,z)K(xz)+O(1)K(x,yz)+O(1)K(x\mid y,z)\leq K(x\mid z)+O(1)\leq K(x,y\mid z)+O(1)

  • Subadditivity: K(xyz)K(xz,y)+K(yz)+O(1)K(xz)+K(yz)+O(1)K(xy\mid z)\leq K(x\mid z,y)+K(y\mid z)+O(1)\leq K(x\mid z)+K(y\mid z)+O(1)

Throughout this paper we often consider the value K(A)K(A) where AA is a set. Here the program computing AA need only output the elements of AA (in any order). When considering K(AB)K(A\mid B) such that ABA\subseteq B, it holds that K(AB)log(|B||A|)+O(log|B|)K(A\mid B)\leq\lceil\log\binom{\left|B\right|}{\left|A\right|}\rceil+O(\log\left|B\right|). To see why, consider Algorithm 2.

1 mA|A|,mB|B|,i0,iAm_{A}\leftarrow\left|A\right|,m_{B}\leftarrow\left|B\right|,i\leftarrow 0,i_{A} is a target index
2 for every subset CBC\subseteq B s.t |C|=mA\left|C\right|=m_{A} (in a predetermined order) do
3       if i=iAi=i_{A} then Print CC
4       ii+1i\leftarrow i+1
5      
Algorithm 2 Compute AA given BB as input

In the algorithm iAi_{A} is the index of AA when considering some ordering of all subsets of BB of size |A|\left|A\right|. Thus log(|B||A|)\lceil\log\binom{\left|B\right|}{\left|A\right|}\rceil bits are sufficient to represent iAi_{A}. The remaining variables i,mA,mBi,m_{A},m_{B} and any additional variables required to construct the set CC are all of size at most O(log|B|)O(\log\left|B\right|) and there is at most a constant number of them.

During our analysis, we often bound the Kolmogorov complexity of tuples of objects. For example, K(A,PB)K(A,P\mid B) where ABA\subseteq B is a set and P:A[|A|]P:A\rightarrow[\left|A\right|] is a permutation of AA (note that A,PA,P together form an ordered tuple of the elements of AA). Instead of explicitly presenting a program such as Algorithm 2, we say that if K(AB)c1K(A\mid B)\leq c_{1} and c2c_{2} bits are sufficient to represent PP, thus K(A,PB)c1+c2+O(1)K(A,P\mid B)\leq c_{1}+c_{2}+O(1). This just means that we directly have a variable encoding PP into the program that computes AA given BB and uses it in the code. For example, we can add a permutation to Algorithm 2 and output an ordered tuple of elements rather than a set. Note that when representing a permutation of A,|A|=kA,\left|A\right|=k, instead of using functions, we can just talk about values in logk!\lceil\log k!\rceil. That is, we can decide on some predetermined ordering of all permutations of kk elements, and represent a permutation as its number in this ordering.

Entropy and KL-divergence

Our proofs make extensive use of binary entropy and KL-divergence. In what follows we define these concepts and provide some useful properties.

Entropy: For p[0,1]p\in[0,1] we denote by h(p)=plogp(1p)log(1p)h(p)=-p\log p-(1-p)\log(1-p) the entropy of pp. Note that h(0)=h(1)=0h(0)=h(1)=0.

KL-divergence: For p,q[0,1]p,q\in[0,1] let DKL(pq)=plogpq+(1p)log1p1qD_{KL}(p\;\|\;q)=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q} be the Kullback Leibler divergence (KL-divergence) between two Bernoulli distributions with parameters p,qp,q. We also state the following useful private case of Pinsker’s inequality: DKL(pq)2(pq)2D_{KL}(p\;\|\;q)\geq 2(p-q)^{2}.

Lemma 2.4.

For p[0,1]p\in[0,1] it holds that h(p)plog(e/p)h(p)\leq p\log(e/p).

Proof.

Let us write our claim as:

h(p)=plogp(1p)log(1p)plog(e/p)h(p)=-p\log p-(1-p)\log(1-p)\leq p\log(e/p)

Rearranging we get:

(1p)log(1p)plogp+plog(1/p)+ploge\displaystyle-(1-p)\log(1-p)\leq p\log p+p\log(1/p)+p\log e
(1p)log(1p)ploge\displaystyle\implies-(1-p)\log(1-p)\leq p\log e
ln(1p)p(1p)\displaystyle\implies-\ln(1-p)\leq\frac{p}{(1-p)}

Note that ln(1p)=0p1(1x)𝑑xp1(1p)-\ln(1-p)=\int_{0}^{p}\frac{1}{(1-x)}dx\leq p\cdot\frac{1}{(1-p)}. Where in the final transition we use the fact that 1(1x)\frac{1}{(1-x)} is monotonically increasing on [0,1][0,1]. This completes the proof. ∎

Lemma 2.5.

For p,γ,q[0,1]p,\gamma,q\in[0,1] where pγq,(1p)γ(1q)p\gamma\leq q,(1-p)\gamma\leq(1-q) it holds that

qh(pγq)+(1q)h((1p)γ(1q))h(γ)γDKL(pq)qh(\frac{p\gamma}{q})+(1-q)h(\frac{(1-p)\gamma}{(1-q)})\leq h(\gamma)-\gamma D_{KL}(p\;\|\;q)
Proof.

Let us expand the left hand side using the definition of entropy:

qh(pγq)+(1q)h((1p)γ(1q))\displaystyle qh(\frac{p\gamma}{q})+(1-q)h(\frac{(1-p)\gamma}{(1-q)})
=q(pγqlogpγq+(1pγq)log(1pγq))\displaystyle=-q(\frac{p\gamma}{q}\log\frac{p\gamma}{q}+(1-\frac{p\gamma}{q})\log(1-\frac{p\gamma}{q}))
(1q)((1p)γ(1q)log(1p)γ(1q)+(1(1p)γ(1q))log(1(1p)γ(1q)))\displaystyle-(1-q)(\frac{(1-p)\gamma}{(1-q)}\log\frac{(1-p)\gamma}{(1-q)}+(1-\frac{(1-p)\gamma}{(1-q)})\log(1-\frac{(1-p)\gamma}{(1-q)}))
=(pγlogpγq+(qpγ)logqpγq)\displaystyle=-(p\gamma\log\frac{p\gamma}{q}+(q-p\gamma)\log\frac{q-p\gamma}{q})
((1p)γlog(1p)γ(1q)+((1q)(1p)γ)log(1q)(1p)γ1q)\displaystyle-((1-p)\gamma\log\frac{(1-p)\gamma}{(1-q)}+((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})
=γlogγγDKL(pq)\displaystyle=-\gamma\log\gamma-\gamma D_{KL}(p\;\|\;q)
(qpγ)(logqpγq)((1q)(1p)γ)log(1q)(1p)γ1q)\displaystyle-(q-p\gamma)(\log\frac{q-p\gamma}{q})-((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})

Where in the last equality we simply sum the first terms on both lines. To complete the proof we use the log-sum inequality for the last expression. The log-sum inequality states that: Let {ak}k=1m,{bk}k=1m\left\{a_{k}\right\}^{m}_{k=1},\left\{b_{k}\right\}^{m}_{k=1} be non-negative numbers and let a=k=1mak,b=k=1mbka=\sum^{m}_{k=1}a_{k},b=\sum^{m}_{k=1}b_{k}, then k=1mailogaibialogab\sum_{k=1}^{m}a_{i}\log\frac{a_{i}}{b_{i}}\geq a\log\frac{a}{b}. We apply the log-sum inequality with m=2,a1=qpγ,a2=(1q)(1p)γ,a=1γm=2,a_{1}=q-p\gamma,a_{2}=(1-q)-(1-p)\gamma,a=1-\gamma and b1=q,b2=1q,b=1b_{1}=q,b_{2}=1-q,b=1, getting that:

(qpγ)(logqpγq)+((1q)(1p)γ)log(1q)(1p)γ1q)(1γ)log(1γ)(q-p\gamma)(\log\frac{q-p\gamma}{q})+((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})\geq(1-\gamma)\log(1-\gamma)

Putting everything together we get that

γlogγγDKL(pq)\displaystyle-\gamma\log\gamma-\gamma D_{KL}(p\;\|\;q)
(qpγ)(logqpγq)((1q)(1p)γ)log(1q)(1p)γ1q)\displaystyle-(q-p\gamma)(\log\frac{q-p\gamma}{q})-((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})
γlogγ(1γ)log(1γ)γDKL(pq)=h(γ)γDKL(pq)\displaystyle\leq-\gamma\log\gamma-(1-\gamma)\log(1-\gamma)-\gamma D_{KL}(p\;\|\;q)=h(\gamma)-\gamma D_{KL}(p\;\|\;q)

Representing sets

Let us state some useful bounds on the Kolmogorov complexity of sets.

Lemma 2.6.

Let AB,|B|=m,|A|=γmA\subseteq B,\left|B\right|=m,\left|A\right|=\gamma m, and let g:B{0,1}g:B\rightarrow\left\{0,1\right\}. For any set YBY\subseteq B let Y1={xxY,g(x)=1},Y0=YY1Y_{1}=\left\{x\mid x\in Y,g(x)=1\right\},Y_{0}=Y\setminus Y_{1} and κY=|Y1||Y|\kappa_{Y}=\frac{\left|Y_{1}\right|}{\left|Y\right|}. It holds that

K(AB,g)m(h(γ)2γ(κBκA)2)+O(logm)K(A\mid B,g)\leq m(h(\gamma)-2\gamma(\kappa_{B}-\kappa_{A})^{2})+O(\log m)
Proof.

The algorithm is very similar to Algorithm 2, the main difference is that we must first compute B1,B0B_{1},B_{0} from BB using gg, and select A1,A0A_{1},A_{0} from B1,B0B_{1},B_{0}, respectively, using two indices iA1,iA0i_{A_{1}},i_{A_{0}}. Finally we print A=A1A0A=A_{1}\cup A_{0}. We can now bound the number of bits required to represent iA1,iA0i_{A_{1}},i_{A_{0}}. Note that |B1|=κBm,|B0|=(1κB)m\left|B_{1}\right|=\kappa_{B}m,\left|B_{0}\right|=(1-\kappa_{B})m. Note that for A1A_{1} we pick γκAm\gamma\kappa_{A}m elements from κBm\kappa_{B}m elements and for A0A_{0} we pick γ(1κA)m\gamma(1-\kappa_{A})m elements from (1κB)m(1-\kappa_{B})m elements. The number of bits required to represent this selection is:

log(κBmγκAm)+log((1κB)mγ(1κA)m)κBmh(γκAκB)+(1κB)mh(γ(1κA)(1κB))\displaystyle\lceil\log\binom{\kappa_{B}m}{\gamma\kappa_{A}m}\rceil+\lceil\log\binom{(1-\kappa_{B})m}{\gamma(1-\kappa_{A})m}\rceil\leq\kappa_{B}mh(\frac{\gamma\kappa_{A}}{\kappa_{B}})+(1-\kappa_{B})mh(\frac{\gamma(1-\kappa_{A})}{(1-\kappa_{B})})
m(h(γ)γDKL(κBκA))m(h(γ)2γ(κBκA)2)\displaystyle\leq m(h(\gamma)-\gamma D_{KL}(\kappa_{B}\;\|\;\kappa_{A}))\leq m(h(\gamma)-2\gamma(\kappa_{B}-\kappa_{A})^{2})

Where in the first inequality we used the fact that 0kn,log(nk)nh(k/n)\forall 0\leq k\leq n,\log\binom{n}{k}\leq nh(k/n), Lemma 2.5 in the second transition, and the lower bound on the KL-divergence in the third transition. The additional O(logm)O(\log m) factor is due to various counters and variables, similarly to Algorithm 2. ∎

Concentration bounds

We state the following Hoeffding bound for sampling without replacement.

Theorem 2.7 (Hoeffding [23; 24]).

Let χ=(y1,,yN)\chi=(y_{1},...,y_{N}) be a finite population of size NN, such that i,yi{0,1}\forall i,y_{i}\in\{0,1\}, and let (Y1,,Yk)(Y_{1},...,Y_{k}) be a random sample drawn without replacement from χ\chi (first Y1Y_{1} is chosen uniformly at random from χ\chi, then Y2Y_{2} is chosen uniformly at random from χ{Y1}\chi\setminus\left\{Y_{1}\right\} and so on). Let μ=1Nyχy\mu=\frac{1}{N}\sum_{y\in\chi}y. For all 0δμk0\leq\delta\leq\mu k it holds that: P(1ki=1kYkμδ)e2kδ2P(\frac{1}{k}\sum_{i=1}^{k}Y_{k}-\mu\leq-\delta)\leq e^{-2k\delta^{2}}.

One important thing to note is that if we have access to χ\chi as a shuffled tuple (or shuffled array), then taking any subset of kk indices, is a random sample of size kk drawn from χ\chi without repetitions. Furthermore, if we take some sample YY from the shuffled population χ\chi and then take a second sample ZZ (already knowing YY) from the remaining set χY\chi\setminus Y, then ZZ is sampled without repetitions from χY\chi\setminus Y.

3 Convergence guarantees for SGD

In this section, we prove that Algorithm 1 eventually terminates. That is, a (1ϵ)(1-\epsilon) accuracy for the entire dataset is achieved. Let us start by defining some notation, followed by formal definitions of our assumptions.

Refer to caption
Figure 1: A visual summary of our notations.

Notations

First let us define some useful notations:

  • λi,j=acc(Wi,j,X)\lambda_{i,j}=acc(W_{i,j},X). This is the accuracy of the model in epoch ii on the entire dataset XX, before performing the GD step on batch jj.

  • φi,j=acc(Wi,j,Bi,j1)\varphi_{i,j}=acc(W_{i,j},B_{i,j-1}). This is the accuracy of the model on the (j1)(j-1)-th batch in the ii-th epoch after performing the GD step on the batch.

  • Xi,j=k=1jBi,kX_{i,j}=\bigcup_{k=1}^{j}B_{i,k} (note that i,Xi,0=,Xi,n/b=X\forall i,X_{i,0}=\emptyset,X_{i,n/b}=X). This is the set of elements in the first jj batches of epoch ii. Let us also denote nj=|Xi,j|=jbn_{j}=\left|X_{i,j}\right|=jb (Note that j,i1,i2,|Xi1,j|=|Xi2,j|\forall j,i_{1},i_{2},\left|X_{i_{1},j}\right|=\left|X_{i_{2},j}\right|, thus ii need not appear in the subscript).

  • λi,j=acc(Wi,j,Xi,j1),λi,j′′=acc(Wi,j,XXi,j1)\lambda^{\prime}_{i,j}=acc(W_{i,j},X_{i,j-1}),\lambda^{\prime\prime}_{i,j}=acc(W_{i,j},X\setminus X_{i,j-1}), where λi,j\lambda^{\prime}_{i,j} is the accuracy of the model on the set of all previously seen batch elements, after performing the GD step on the (j1)(j-1)-th batch and λi,j′′\lambda^{\prime\prime}_{i,j} is the accuracy of the same model, on all remaining elements (jj-th batch onward). To avoid computing the accuracy on empty sets, λi,j\lambda^{\prime}_{i,j} is defined for j[2,n/b+1]j\in[2,n/b+1] and λi,j′′\lambda^{\prime\prime}_{i,j} is defined for j[1,n/b]j\in[1,n/b].

Assumptions

In our analysis we consider tt epochs of the SGD algorithm, where we bound tt later. Let us state the assumptions we make (later we discuss ways to further alleviate these assumptions).

  1. 1.

    Local progress: There exists some constant β\beta^{\prime} such that i[t],bnj=2n/b+1(φi,jacc(Wi,j1,Bi,j1))>βϵ\forall i\in[t],\frac{b}{n}\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))>\beta^{\prime}\epsilon. That is, we assume that in every epoch, the accuracy of the model on the batch after the GD step increases by an additive βϵ\beta^{\prime}\epsilon factor on average. Let us denote β=βϵ\beta=\beta^{\prime}\epsilon (we explain the reason for this parameterization in Section 3.1).

  2. 2.

    Models compute simple functions: i[t],j[n/b],K(acc(Wi,j,)X)nβ3512\forall i\in[t],j\in[n/b],K(acc(W_{i,j},\bullet)\mid X)\leq\frac{n\beta^{3}}{512} (where acc(Wi,j,)acc(W_{i,j},\bullet) is the accuracy function of the model with parameters Wi,jW_{i,j}). This is actually a slightly weaker condition than bounding the Kolmogorov complexity of the function computed by the model. That is, given XX and the function computed by the model we can compute acc(Wi,j,)acc(W_{i,j},\bullet), but not the other way around. For simplicity of notation, we assume that Wi,jW_{i,j} contains not only the model parameters, but also the architecture of the model and the indices i,ji,j.

  3. 3.

    b8β2ln2(tn3)b\geq 8\beta^{-2}\ln^{2}(tn^{3}). That is, the batch must be sufficiently large. After we bound tt, we get that b=Θ(β2log2(n/β))b=\Theta(\beta^{-2}\log^{2}(n/\beta)) is sufficient.

  4. 4.

    xX,fx\forall x\in X,f_{x} is differentiable and L-smooth, even when taking numerical stability into account.

  5. 5.

    K(X)=nO(1),d=nO(1)K(X)=n^{O(1)},d=n^{O(1)}. That is, the size of the model and the size of every data point is at most polynomial in nn. We note that the exponent can be extremely large, say 10001000, and this will only change our bounds by a constant.

Bounding tt:

Our goal is to use the entropy compression argument to bound the running time of the algorithm. That is, we show that if our assumptions hold for a consecutive number of tt epochs, if tt is sufficiently large then our algorithm must terminate. Let rir_{i} be the string of random bits representing the random permutation of XX at epoch ii. As we consider tt epochs, let r=r1r2rtr=r_{1}r_{2}\dots r_{t}. Note that the number of bits required to represent an arbitrary permutation of [n][n] is given by: log(n!)=nlognnloge+O(logn)=nlog(n/e)+O(logn)\lceil\log(n!)\rceil=n\log n-n\log e+O(\log n)=n\log(n/e)+O(\log n). Where in the above we used Stirling’s approximation. Thus, it holds that |r|=t(nlog(n/e)+O(logn))|r|=t(n\log(n/e)+O(\log n)) and according to Theorem 2.3, with probability at least 11/n21-1/n^{2} it holds that K(r)tnlog(n/e)O(logn)K(r)\geq tn\log(n/e)-O(\log n).

Our goal for the rest of the section is to show that if our assumptions hold, then in every epoch we manage to compress rir_{i}, which in turn results in a compression of rr, leading to an upper bound on tt due to the incompressibility of rr. For the rest of the section, we assume that all of our assumptions hold and that the algorithm does not terminate within the first tt epochs. As discussed in the introduction, our proof consists of two cases. For the first case, we show that if at some point during the epoch |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is sufficiently large, then rir_{i} can be compressed. For the second case, if |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is small throughout the epoch and show that rir_{i} can be compressed by encoding every batch individually.

Case 1: |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is large

In the introduction, we considered the case where during the epoch the model achieves 100% accuracy on half the elements and 0% on the other half. Let us show a generalized version of this statement, where instead of considering the model during the middle of the epoch, we consider the model after seeing a η\eta-fraction of the data. The statement below holds for an arbitrary difference in the accuracies of the model. Instead of considering |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| directly, we first show an auxiliary compression claim involving |λi,jλi,j||\lambda^{\prime}_{i,j}-\lambda_{i,j}| and |λi,j′′λi,j||\lambda^{\prime\prime}_{i,j}-\lambda_{i,j}|. This will immediately allow us to derive the desired result for |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|, and will also be useful later on. Let us state the following claim.

Claim 3.1.

Let i[t],j[2,n/b1],a,η=(j1)b/ni\in[t],j\in[2,n/b-1],a\in\mathbb{R},\eta=(j-1)b/n. It holds that:

K(riacc(Wi,j,),X){n(log(n/e)η2a2)+O(logn),if |λi,jλi,j|an(log(n/e)(1η)2a2)+O(logn),if |λi,j′′λi,j|aK(r_{i}\mid acc(W_{i,j},\bullet),X)\leq\begin{cases}n(\log(n/e)-\eta 2a^{2})+O(\log n),&\text{if }|\lambda^{\prime}_{i,j}-\lambda_{i,j}|\geq a\\ n(\log(n/e)-(1-\eta)2a^{2})+O(\log n),&\text{if }|\lambda^{\prime\prime}_{i,j}-\lambda_{i,j}|\geq a\end{cases}
Proof.

To encode rir_{i} we encode it as 2 separate strings, split at the ((j1)b)((j-1)b)-th index. To encode each such string it is sufficient to encode the sets Xi,j1X_{i,j-1} and XXi,j1X\setminus X_{i,j-1}, accompanied by a permutation for each set. As XX is given, we can immediately deduce rir_{i} given an ordering of the elements of XX. Note that knowing Xi,j1X_{i,j-1} and XX we immediately know XXi,j1X\setminus X_{i,j-1}. Thus it is sufficient to only encode Xi,j1X_{i,j-1}. Finally, we need to encode two permutations for the two sets. The number of bits required for the two permutations is bounded by:

log(ηn)!+log((1η)n)!ηnlogηne+(1η)nlog(1η)ne+O(logn)\displaystyle\lceil\log(\eta n)!\rceil+\lceil\log((1-\eta)n)!\rceil\leq\eta n\log\frac{\eta n}{e}+(1-\eta)n\log\frac{(1-\eta)n}{e}+O(\log n)
=n(logne+ηlogη+(1η)log(1η))+O(logn)=nlognenh(η)+O(logn)\displaystyle=n(\log\frac{n}{e}+\eta\log\eta+(1-\eta)\log(1-\eta))+O(\log n)=n\log\frac{n}{e}-nh(\eta)+O(\log n)

To encode the set Xi,j1X_{i,j-1} we use Lemma 2.6 where A=Xi,j1,B=XA=X_{i,j-1},B=X and g(x)=acc(Wi,j,x)g(x)=acc(W_{i,j},x). We get that

K(Xi,j1X,acc(Wi,j,))n(h(η)2(λi,jλi,j)2)+O(logn)n(h(η)η2a2)+O(logn)K(X_{i,j-1}\mid X,acc(W_{i,j},\bullet))\leq n(h(\eta)-2(\lambda^{\prime}_{i,j}-\lambda_{i,j})^{2})+O(\log n)\leq n(h(\eta)-\eta 2a^{2})+O(\log n)

Summing the two expressions we get that

K(riacc(Wi,j,),X)nlog(n/e)ηaDKL(λi,jλi,j)+O(logn)n(log(n/e)η2a2)+O(logn)K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq n\log(n/e)-\eta a\cdot D_{KL}(\lambda^{\prime}_{i,j}\;\|\;\lambda_{i,j})+O(\log n)\leq n(\log(n/e)-\eta 2a^{2})+O(\log n)

Encoding XXi,j1X\setminus X_{i,j-1} instead of Xi,j1X_{i,j-1} and using an identical argument, we get a similar bound for λi,j′′\lambda^{\prime\prime}_{i,j}. ∎

Note that in the above if η=(j1)b/n\eta=(j-1)b/n is too small or too large, we are unable to get a meaningful compression. Let us focus our attention on a sub-interval of j[2,n/b1]j\in[2,n/b-1] where we get a meaningful bound. Let us denote by js=βn/8b+1,jf=(1β/8)n/b+1j_{s}=\lceil\beta n/8b\rceil+1,j_{f}=\lfloor(1-\beta/8)n/b\rfloor+1 the start and finish indices for this interval. We state the following Corollary which completes the proof of the first case.

Corollary 3.2.

If it holds for some epoch ii, and some j[js,jf]j\in[j_{s},j_{f}] that |λi,jλi,j′′|β/4|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|\geq\beta/4, then K(riacc(Wi,j,),X)n(log(n/e)1256β3)+O(logn)K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq n(\log(n/e)-\frac{1}{256}\beta^{3})+O(\log n).

Proof.

Assume w.l.o.g that λi,jλi,j′′\lambda^{\prime}_{i,j}\leq\lambda^{\prime\prime}_{i,j}. Note that it holds that λi,j[λi,j,λi,j′′]\lambda_{i,j}\in[\lambda^{\prime}_{i,j},\lambda^{\prime\prime}_{i,j}]. This is because we can write λi,j=qλi,j+(1q)λi,j′′\lambda_{i,j}=q\lambda^{\prime}_{i,j}+(1-q)\lambda^{\prime\prime}_{i,j} for some q[0,1]q\in[0,1]. Thus if |λi,jλi,j′′|β/4|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|\geq\beta/4 then either |λi,jλi,j′′|β/8|\lambda_{i,j}-\lambda^{\prime\prime}_{i,j}|\geq\beta/8 or |λi,jλi,j|β/8|\lambda_{i,j}-\lambda^{\prime}_{i,j}|\geq\beta/8. We apply Claim 3.1 for either λi,j\lambda^{\prime}_{i,j} or λi,j′′\lambda^{\prime\prime}_{i,j} with parameter a=β/8a=\beta/8. Noting that j[js,jf]j\in[j_{s},j_{f}] we get that η,(1η)β/8\eta,(1-\eta)\geq\beta/8. Putting everything together we get: K(riacc(Wi,j,),X)n(log(n/e)1256β3)+O(logn)K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq n(\log(n/e)-\frac{1}{256}\beta^{3})+O(\log n)

Generalizing Assumption 2

The assumption is only required for this case of the proof, and the proofs of Claim 3.1 and Crollary 3.2 still go through even if we only have a sufficiently good approximation to acc(Wi,j,)acc(W_{i,j},\bullet). That is, there exists a function g:X{0,1}g^{\prime}:X\rightarrow\left\{0,1\right\}, which achieves a sufficiently close approximation to the accuracy of acc(Wi,j,)acc(W_{i,j},\bullet) on XXi,jX\setminus X_{i,j} and Xi,jX_{i,j} (say up to an additive β/100\beta/100 factor). Thus, we only need the existence of a compressible function which approximates acc(Wi,j,)acc(W_{i,j},\bullet) on XXi,jX\setminus X_{i,j} and Xi,jX_{i,j}.

Case 2: |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is small

Let us now consider the second case, where |λi,jλi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is small throughout the epoch (more precisely, when j[js,jf]j\in[j_{s},j_{f}]). Roughly speaking, we would like to get an effect similar to that seen in the introduction, when we assumed we can achieve 100% accuracy for every batch. The reason we managed to achieve compression in that case, was that we could use the model to reduce the set of potential elements the batch can be taken from when going backwards and reconstructing the batches. We will show that we can achieve a similar effect here. Assume we have the entire dataset, XX, and the model which resulted from a GD step on the (j1)(j-1)-th batch. The argument is inductive, thus let us assume that we have reconstructed all batches Bi,k,kjB_{i,k},k\geq j. This means we also know XXi,j1X\setminus X_{i,j-1}. Knowing XX and XXi,j1X\setminus X_{i,j-1}, we immediately know Xi,j1X_{i,j-1}, the set of all elements which can appear in Bi,j1B_{i,j-1}. We aim to show that the local progress assumption (Assumption 1) implies that (on average) Wi,jW_{i,j} achieves a slightly higher accuracy on Bi,j1B_{i,j-1} than on Xi,j1X_{i,j-1}, which will allow us to achieve a meaningful compression.

Let us first show that if the batch is large enough, the accuracy of Wi,jW_{i,j} on Bi,jB_{i,j} is sufficiently close to λi,j′′\lambda^{\prime\prime}_{i,j} (the accuracy of Wi,jW_{i,j} on XXi,j1X\setminus X_{i,j-1}). We state the following claim:

Claim 3.3.

It holds w.h.p that: i[t],j[2,n/b1],acc(Wi,j,Bi,j)λi,j′′β/4\forall i\in[t],j\in[2,n/b-1],acc(W_{i,j},B_{i,j})\geq\lambda^{\prime\prime}_{i,j}-\beta/4

i[t],j[2,n/b1],acc(Wi,j,Bi,j)λi,j′′β/4\displaystyle\forall i\in[t],j\in[2,n/b-1],acc(W_{i,j},B_{i,j})\geq\lambda^{\prime\prime}_{i,j}-\beta/4
Proof.

First let us note that if λi,j′′β/4\lambda^{\prime\prime}_{i,j}\leq\beta/4, then the claim holds trivially: acc(Wi,j,Bi,j)0λi,j′′β/4acc(W_{i,j},B_{i,j})\geq 0\geq\lambda^{\prime\prime}_{i,j}-\beta/4. So for the rest of the proof let us assume that λi,j′′β/4\lambda^{\prime\prime}_{i,j}\geq\beta/4.

Let us denote by {Y}=1b\left\{Y_{\ell}\right\}_{\ell=1}^{b}, the set of binary indicator variables indicating whether the \ell-th element in Bi,jB_{i,j} was classified correctly by Wi,jW_{i,j}. As stated before, we can say that elements in Bi,jB_{i,j} are chosen uniformly at random without repetitions from XXi,j1X\setminus X_{i,j-1}. Furthermore, it is important to note that Bi,jB_{i,j} is independent of Wi,jW_{i,j}. We can imagine that Bi,jB_{i,j} is selected after Wi,jW_{i,j} is determined. We can write acc(Wi,j,Bi,j)=1b=1bYacc(W_{i,j},B_{i,j})=\frac{1}{b}\sum_{\ell=1}^{b}Y_{\ell}. Applying a Hoeffding bound with parameters μ=λi,j′′,k=b,δ=β/4μk\mu=\lambda^{\prime\prime}_{i,j},k=b,\delta=\beta/4\geq\mu k, we get that:

Pr[acc(Wi,j,Bi,j)λi,j′′β/4]e2kδ2=ebβ28Pr[acc(W_{i,j},B_{i,j})\leq\lambda^{\prime\prime}_{i,j}-\beta/4]\leq e^{-2k\delta^{2}}=e^{-b\frac{\beta^{2}}{8}}

According to Assumption 3 it holds that b8ln(tn3)β2b\geq\frac{8\ln(tn^{3})}{\beta^{2}}. Thus, we get that:

i[t],j[2,n/b1],Pr[acc(Wi,j,Bi,j)<λi,j′′β/4]1/(n3t)\forall i\in[t],j\in[2,n/b-1],Pr[acc(W_{i,j},B_{i,j})<\lambda^{\prime\prime}_{i,j}-\beta/4]\leq 1/(n^{3}t)

We take a union bound over all bad events (at most tntn) to get that with probability at least 11/n21-1/n^{2} none of the bad events occur throughout the tt epochs of the execution.

That is, w.h.p the accuracy of Wi,j1W_{i,j-1} on Bi,j1B_{i,j-1} is at least its accuracy on XXi,j1X\setminus X_{i,j-1}, up to an additive β/4\beta/4 factor. Next we use the fact that [js,jf],|λi,λi,′′|\forall\ell\in[j_{s},j_{f}],|\lambda^{\prime}_{i,\ell}-\lambda^{\prime\prime}_{i,\ell}| is small together with the claim above to show that the accuracy of Wi,j1W_{i,j-1} on Bi,j1B_{i,j-1} is also not too far from the accuracy of Wi,j1W_{i,j-1} on Xi,j1X_{i,j-1}. Finally, using the fact that on average Wi,jW_{i,j} has a sufficiently larger accuracy on Bi,jB_{i,j} compared to Wi,j1W_{i,j-1} we prove the following claim. Note that this is an auxiliary claim, and the quantity j=2n/b+1(φi,jλi,j)2\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2} will appear later as the number of bits we manage to save per epoch, by using the model at every epoch to encode the batches.

Claim 3.4.

If i[t],j[js,jf],|λi,jλi,j′′|β/4\forall i\in[t],j\in[j_{s},j_{f}],\left|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}\right|\leq\beta/4 then w.h.p: i[t],j=2n/b+1(φi,jλi,j)2nβ225b\forall i\in[t],\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\frac{n\beta^{2}}{25b}.

Proof.

Recall that φi,j=acc(Wi,j,Bi,j1)\varphi_{i,j}=acc(W_{i,j},B_{i,j-1}). This is the accuracy of the model on the (j1)(j-1)-th batch in the ii-th epoch after performing the GD step on the batch. Let us consider the following inequality:

1λi,jf′′λi,js′′=j=js+1jf(λi,j′′λi,j1′′)\displaystyle 1\geq\lambda^{\prime\prime}_{i,j_{f}}-\lambda^{\prime\prime}_{i,j_{s}}=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime\prime}_{i,j-1})
=j=js+1jf(λi,j′′λi,j1′′+φi,jφi,j)=j=js+1jf(λi,j′′φi,j)+j=js+1jf(φi,jλi,j1′′)\displaystyle=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime\prime}_{i,j-1}+\varphi_{i,j}-\varphi_{i,j})=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\varphi_{i,j})+\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j-1})

Using Claim 3.3 we know that w.h.p:

j=js+1jfλi,j1′′j=js+1jf(acc(Wi,j1,Bi,j1)+β4)j=js+1jfacc(Wi,j1,Bi,j1)+βn4b\sum_{j=j_{s}+1}^{j_{f}}\lambda^{\prime\prime}_{i,j-1}\leq\sum_{j=j_{s}+1}^{j_{f}}(acc(W_{i,j-1},B_{i,j-1})+\frac{\beta}{4})\leq\sum_{j=j_{s}+1}^{j_{f}}acc(W_{i,j-1},B_{i,j-1})+\frac{\beta n}{4b}

So we can write:

j=js+1jf(φi,jλi,j1′′)j=js+1jf(φi,jacc(Wi,j1,Bi,j1))βn4b\displaystyle\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j-1})\geq\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\frac{\beta n}{4b}
=j=2n/b+1(φi,jacc(Wi,j1,Bi,j1))j[2,js][jf+1,n/b+1](φi,jacc(Wi,j1,Bi,j1))βn4b\displaystyle=\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\sum_{j\in[2,j_{s}]\cup[j_{f}+1,n/b+1]}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\frac{\beta n}{4b}
j=2n/b+1(φi,jacc(Wi,j1,Bi,j1))βn2bβnbβn2b=βn2b\displaystyle\geq\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\frac{\beta n}{2b}\geq\frac{\beta n}{b}-\frac{\beta n}{2b}=\frac{\beta n}{2b}

Where in the last line we use Assumption 1 together with the fact that (φi,jacc(Wi,j1,Bi,j1))[1,1](\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))\in[-1,1] and

|[2,js]|+|[jf+1,n/b+1]|=js1+(n/b+1(jf+1)+1)=js+n/bjf\displaystyle\left|[2,j_{s}]\right|+\left|[j_{f}+1,n/b+1]\right|=j_{s}-1+(n/b+1-(j_{f}+1)+1)=j_{s}+n/b-j_{f}
=(β/8)nb+1+(n/b(1β/8)nb1))(β/8)nb+2+(n/b(1β/8)nb2)βn4b\displaystyle=\lceil(\beta/8)\frac{n}{b}\rceil+1+(n/b-\lfloor\frac{(1-\beta/8)n}{b}\rfloor-1))\leq(\beta/8)\frac{n}{b}+2+(n/b-\frac{(1-\beta/8)n}{b}-2)\leq\frac{\beta n}{4b}

Combining everything together we get that:

1j=js+1jf(λi,j′′φi,j)+βn2bj=js+1jf(φi,jλi,j′′)βn2b11\geq\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\varphi_{i,j})+\frac{\beta n}{2b}\implies\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j})\geq\frac{\beta n}{2b}-1

Let us use the above to derive a lower bound for j=js+1jf(λi,jφi,j)\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime}_{i,j}-\varphi_{i,j}). Let us write:

j=js+1jf(λi,j′′φi,j)=j=js+1jf(λi,jφi,j)+j=js+1jf(λi,j′′λi,j)\displaystyle\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\varphi_{i,j})=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime}_{i,j}-\varphi_{i,j})+\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime}_{i,j})
j=js+1jf(λi,jφi,j)βn2b1j=js+1jf(λi,j′′λi,j)βn4b1βn5b\displaystyle\implies\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime}_{i,j}-\varphi_{i,j})\geq\frac{\beta n}{2b}-1-\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime}_{i,j})\geq\frac{\beta n}{4b}-1\geq\frac{\beta n}{5b}

Where in the second to last transition we use the assumption that i[t],j[js,jf],|λi,jλi,j′′|β/4\forall i\in[t],j\in[j_{s},j_{f}],\left|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}\right|\leq\beta/4, and in the final transition we assume nn is sufficiently large. Taking the square of both sides we get that: (j=js+1jfλi,jφi,j)2(βn5b)2(\sum_{j=j_{s}+1}^{j_{f}}\lambda^{\prime}_{i,j}-\varphi_{i,j})^{2}\geq(\frac{\beta n}{5b})^{2}.

Finally, applying Cauchy-Schwartz inequality, we get that:

j=2n/b+1(φi,jλi,j)2j=js+1jf(φi,jλi,j)2(jfjs)(j=js+1jfφi,jλi,j)2b(1β/4)n(βn5b)2nβ225b\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq(j_{f}-j_{s})(\sum_{j=j_{s}+1}^{j_{f}}\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\frac{b}{(1-\beta/4)n}\cdot(\frac{\beta n}{5b})^{2}\geq\frac{n\beta^{2}}{25b}

To complete the proof for the second case we use the above claim to show that we can significantly compress all batches, this, in turn, will result in a compression of the entire permutation. We state the following claim.

Claim 3.5.

It holds that if i[t],j[js,jf],|λi,jλi,j′′|β/4\forall i\in[t],j\in[j_{s},j_{f}],\left|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}\right|\leq\beta/4 then w.h.p: K(riWi+1,1,X)nlogne2nβ225+o(n)K(r_{i}\mid W_{i+1,1},X)\leq n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+o(n)

Proof.

Recall that Bi,jB_{i,j} is the jj-th batch in the ii-th epoch, and let Pi,jP_{i,j} be a permutation of Bi,jB_{i,j} such that the order of the elements in Bi,jB_{i,j} under Pi,jP_{i,j} is the same as under rir_{i}. Note that given XX, if we know the partition into batches and all permutations, we can reconstruct rir_{i}. According to Theorem 2.2, given Wi,jW_{i,j} and Bi,j1B_{i,j-1} we can compute Wi,j1W_{i,j-1}. Let us denote by YY the encoding of this procedure. To implement YY we need to iterate over all possible vectors in 𝔽d\mathbb{F}^{d} and over batch elements to compute the gradients. To express this program we require auxiliary variables of size at most O(logmin{d,b})=O(logn)O(\log\min\left\{d,b\right\})=O(\log n). Thus it holds that K(Y)=O(logn)K(Y)=O(\log n). Let us abbreviate Bi,1,Bi,2,,Bi,jB_{i,1},B_{i,2},...,B_{i,j} as (Bi,k)k=1j(B_{i,k})_{k=1}^{j}. We write the following.

K(riX,Wi+1,1)K(ri,YX,Wi+1,1)+O(1)K(riX,Wi+1,1,Y)+K(YX,Wi+1,1)+O(1)\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq K(r_{i},Y\mid X,W_{i+1,1})+O(1)\leq K(r_{i}\mid X,W_{i+1,1},Y)+K(Y\mid X,W_{i+1,1})+O(1)
O(logn)+K((Bi,k)k=1n/bX,Wi+1,1,Y)+j=1n/bK(Pi,j)\displaystyle\leq O(\log n)+K((B_{i,k})_{k=1}^{n/b}\mid X,W_{i+1,1},Y)+\sum_{j=1}^{n/b}K(P_{i,j})

Let us bound K((Bi,k)k=1n/bX,Wi+1,1,Y)K((B_{i,k})_{k=1}^{n/b}\mid X,W_{i+1,1},Y) by repeatedly using the subadditivity and extra information properties of Kolmogorov complexity.

K((Bi,k)k=1n/bX,Y,Wi+1,1)K(Bi,n/bX,Wi+1,1)+K((Bi,k)k=1n/b1X,Y,Wi+1,1,Bi,n/b)+O(1)\displaystyle K((B_{i,k})_{k=1}^{n/b}\mid X,Y,W_{i+1,1})\leq K(B_{i,n/b}\mid X,W_{i+1,1})+K((B_{i,k})_{k=1}^{n/b-1}\mid X,Y,W_{i+1,1},B_{i,n/b})+O(1)
K(Bi,n/bX,Wi+1,1)+K((Bi,k)k=1n/b1X,Y,Wi,n/b,Bi,n/b)+O(1)\displaystyle\leq K(B_{i,n/b}\mid X,W_{i+1,1})+K((B_{i,k})_{k=1}^{n/b-1}\mid X,Y,W_{i,n/b},B_{i,n/b})+O(1)
K(Bi,n/bX,Wi+1,1)+K(Bi,n/b1X,Wi,n/b,Bi,n/b)\displaystyle\leq K(B_{i,n/b}\mid X,W_{i+1,1})+K(B_{i,n/b-1}\mid X,W_{i,n/b},B_{i,n/b})
+K((Bi,k)k=1n/b2X,Y,Wi,n/b1,Bi,n/b,Bi,n/b1)+O(1)\displaystyle+K((B_{i,k})_{k=1}^{n/b-2}\mid X,Y,W_{i,n/b-1},B_{i,n/b},B_{i,n/b-1})+O(1)
O(nb)+j=1n/bK(Bi,jX,Wi,j+1,(Bi,k)k=j+1n/b)O(nb)+j=1n/bK(Bi,jXi,j,Wi,j+1)\displaystyle\leq...\leq O(\frac{n}{b})+\sum_{j=1}^{n/b}K(B_{i,j}\mid X,W_{i,j+1},(B_{i,k})_{k=j+1}^{n/b})\leq O(\frac{n}{b})+\sum_{j=1}^{n/b}K(B_{i,j}\mid X_{i,j},W_{i,j+1})

Where in the transitions we used the fact that given Wi,j,Bi,j1W_{i,j},B_{i,j-1} and YY we can retrieve Wi,j1W_{i,j-1}. That is, we can always bound K(Y,Wi,j,Bi,j1,)K(...\mid Y,W_{i,j},B_{i,j-1},...) by K(Y,Wi,j1,Bi,j1,)+O(1)K(...\mid Y,W_{i,j-1},B_{i,j-1},...)+O(1).

To encode the order Pi,jP_{i,j} inside each batch, blog(b/e)+O(logb)b\log(b/e)+O(\log b) bits are sufficient. Finally we get that: K(riX,Wi+1,1)O(nb)+j=1n/b[K(Bi,jXi,j,Wi,j+1)+blog(b/e)+O(logb)]K(r_{i}\mid X,W_{i+1,1})\leq O(\frac{n}{b})+\sum_{j=1}^{n/b}[K(B_{i,j}\mid X_{i,j},W_{i,j+1})+b\log(b/e)+O(\log b)].

Let us now bound K(Bi,j1Xi,j1,Wi,j)K(B_{i,j-1}\mid X_{i,j-1},W_{i,j}). Knowing Xi,j1X_{i,j-1} we know that Bi,j1Xi,j1B_{i,j-1}\subseteq X_{i,j-1}. Thus we need to use Wi,jW_{i,j} to compress Bi,j1B_{i,j-1}. Applying Lemma 2.6 with parameters A=Bi,j1,B=Xi,j1,η=b/nj1,κA=φi,j,κB=λi,jA=B_{i,j-1},B=X_{i,j-1},\eta=b/n_{j-1},\kappa_{A}=\varphi_{i,j},\kappa_{B}=\lambda^{\prime}_{i,j} and g(x)=acc(Wi,j,x)g(x)=acc(W_{i,j},x). We get the following:

K(Bi,j1Xi,j1,Wi,j)nj1(h(bnj1)2bnj1(φi,jλi,j)2)+O(lognj1)\displaystyle K(B_{i,j-1}\mid X_{i,j-1},W_{i,j})\leq n_{j-1}(h(\frac{b}{n_{j-1}})-\frac{2b}{n_{j-1}}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2})+O(\log n_{j-1})
=nj1h(bnj1)2b(φi,jλi,j)2+O(lognj1)\displaystyle=n_{j-1}h(\frac{b}{n_{j-1}})-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n_{j-1})

Adding blog(b/e)+O(logb)b\log(b/e)+O(\log b) to the above, we get the following bound on every element in the sum:

nj1h(bnj1)2b(φi,jλi,j)2+blog(b/e)+O(logb)+O(lognj1)\displaystyle n_{j-1}h(\frac{b}{n_{j-1}})-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+b\log(b/e)+O(\log b)+O(\log n_{j-1})
blogenj1b+blog(b/e)+O(lognj1)2b(φi,jλi,j)2\displaystyle\leq b\log\frac{en_{j-1}}{b}+b\log(b/e)+O(\log n_{j-1})-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}
=blognj12b(φi,jλi,j)2+O(lognj1)\displaystyle=b\log n_{j-1}-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n_{j-1})

Where in the second line we use Lemma 2.4. Note that the most important term in the sum is 2b(φi,jλi,j)2-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}. That is, the more the accuracy of Wi,jW_{i,j} on the batch, Bi,j1B_{i,j-1}, differs from the accuracy of Wi,jW_{i,j} on the set of elements containing the batch, Xi,j1X_{i,j-1}, we can represent the batch more efficiently. Let us now bound the sum: j=2n/b+1[blognj12b(φi,jλi,j)2+O(lognj1)]\sum_{j=2}^{n/b+1}[b\log n_{j-1}-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n_{j-1})]. Let us first bound the sum over blognj1b\log n_{j-1}:

j=2n/b+1blognj1=j=1n/bblogjb=j=1n/bb(logb+logj)\displaystyle\sum_{j=2}^{n/b+1}b\log n_{j-1}=\sum_{j=1}^{n/b}b\log jb=\sum_{j=1}^{n/b}b(\log b+\log j)
=nlogb+blog(n/b)!=nlogb+nlognbe+O(logn)=nlogne+O(logn)\displaystyle=n\log b+b\log(n/b)!=n\log b+n\log\frac{n}{b\cdot e}+O(\log n)=n\log\frac{n}{e}+O(\log n)

Applying Claim 3.4 we get that 2bj=2n/b+1(φi,jλi,j)22nβ2252b\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\frac{2n\beta^{2}}{25}. Finally we can write that:

K(riX,Wi+1,1)O(nb)+j=2n/b+1[blognj12b(φi,jλi,j)2+O(logn)]\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq O(\frac{n}{b})+\sum_{j=2}^{n/b+1}[b\log n_{j-1}-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n)]
nlogne2nβ225+nbO(logn)=nlogne2nβ225+o(n)\displaystyle\leq n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+\frac{n}{b}\cdot O(\log n)=n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+o(n)

Where nbO(logn)=o(n)\frac{n}{b}\cdot O(\log n)=o(n) because b=ω(logn)b=\omega(\log n). ∎

This completes the analysis of the second case. As for every epoch either case 1 or case 2 hold, we can conclude that we always manage to compress rir_{i}. Combining the claim above with Corollary 3.2 we get:

Corollary 3.6.

It holds w.h.p that i[t],K(riX,Wi+1,1)n(log(n/e)1512β3)+O(logn)\forall i\in[t],K(r_{i}\mid X,W_{i+1,1})\leq n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n).

Proof.

In every epoch either the conditions of Corollary 3.2 or Claim 3.5 hold. If the conditions of Corollary 3.2 hold for some j[js,jf]j\in[j_{s},j_{f}] then we can write:

K(riX,Wi+1,1)K(riX,Wi+1,1,acc(Wi,j,))+K(acc(Wi,j,)X)+O(1)\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq K(r_{i}\mid X,W_{i+1,1},acc(W_{i,j},\bullet))+K(acc(W_{i,j},\bullet)\mid X)+O(1)
n512β3+n(log(n/e)1256β3)+O(logn)n(log(n/e)1512β3)+O(logn)\displaystyle\leq\frac{n}{512}\beta^{3}+n(\log(n/e)-\frac{1}{256}\beta^{3})+O(\log n)\leq n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)

Where we use Assumption 2 to bound K(acc(Wi,j,)X)K(acc(W_{i,j},\bullet)\mid X). If the conditions of Claim 3.5 hold, then we immediately get:

K(riX,Wi+1,1)nlogne2nβ225+o(n)n(log(n/e)1512β3)+O(logn)K(r_{i}\mid X,W_{i+1,1})\leq n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+o(n)\leq n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)

We use the fact that for every epoch that the algorithm does not terminate we manage to significantly compress rir_{i} together with the fact that rr is incompressible to bound tt. We state the following claim:

Claim 3.7.

It holds w.h.p that tO(K(X)nβ3)t\leq O(\frac{K(X)}{n\beta^{3}}).

Proof.

Similarly to the definition of YY in Claim 3.5, let YY^{\prime} be the program which receives X,ri,Wi+1,1X,r_{i},W_{i+1,1} as input and repeatedly applies Theorem 2.2 to retrieve Wi,1W_{i,1}. As YY^{\prime} just needs to reconstruct all batches from X,riX,r_{i} and call YY for n/bn/b times, it holds that K(Y)=O(logn)K(Y^{\prime})=O(\log n). Using the subadditivity and extra information properties of K()K(), together with the fact that W1,1W_{1,1} can be reconstructed given X,Wt+1,1,YX,W_{t+1,1},Y^{\prime}, we write the following:

K(r,W1,1)K(r,W1,1,X,Y,Wt+1,1)+O(1)K(W1,1,,X,Wt+1,1,Y)+K(rX,Y,Wt+1,1)+O(1)\displaystyle K(r,W_{1,1})\leq K(r,W_{1,1},X,Y^{\prime},W_{t+1,1})+O(1)\leq K(W_{1,1,},X,W_{t+1,1},Y^{\prime})+K(r\mid X,Y^{\prime},W_{t+1,1})+O(1)
K(X,Wt+1,1)+K(rX,Y,Wt+1,1)+O(logn)K(X)+d+K(rX,Y,Wt+1,1)+O(logn)\displaystyle\leq K(X,W_{t+1,1})+K(r\mid X,Y^{\prime},W_{t+1,1})+O(\log n)\leq K(X)+d+K(r\mid X,Y^{\prime},W_{t+1,1})+O(\log n)

First we note that: i[t1],K(riX,Y,Wi+2,1,ri+1)K(riX,Y,Wi+1,1)+O(1)\forall i\in[t-1],K(r_{i}\mid X,Y^{\prime},W_{i+2,1},r_{i+1})\leq K(r_{i}\mid X,Y^{\prime},W_{i+1,1})+O(1). Where in the last inequality we simply execute YY^{\prime} on X,Wi+2,1,ri+1X,W_{i+2,1},r_{i+1} to get Wi+1,1W_{i+1,1}. Let us write:

K(r1r2rtX,Y,Wt+1,1)K(rtX,Y,Wt+1,1)+K(r1r2rt1X,Y,Wt+1,1,rt)+O(1)\displaystyle K(r_{1}r_{2}\dots r_{t}\mid X,Y^{\prime},W_{t+1,1})\leq K(r_{t}\mid X,Y^{\prime},W_{t+1,1})+K(r_{1}r_{2}\dots r_{t-1}\mid X,Y^{\prime},W_{t+1,1},r_{t})+O(1)
K(rtX,Wt+1,1)+K(r1r2rt1X,Y,Wt,1)+O(1)\displaystyle\leq K(r_{t}\mid X,W_{t+1,1})+K(r_{1}r_{2}\dots r_{t-1}\mid X,Y^{\prime},W_{t,1})+O(1)
K(rtX,Wt+1,1)+K(rt1X,Wt,1)+K(r1r2rt2X,Y,Wt1,1)+O(1)\displaystyle\leq K(r_{t}\mid X,W_{t+1,1})+K(r_{t-1}\mid X,W_{t,1})+K(r_{1}r_{2}\dots r_{t-2}\mid X,Y^{\prime},W_{t-1,1})+O(1)
O(t)+k=1tK(rkX,Wk+1,1)t[n(log(n/e)1512β3)+O(logn)]\displaystyle\leq\dots\leq O(t)+\sum^{t}_{k=1}K(r_{k}\mid X,W_{k+1,1})\leq t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]

Where the last inequality is due to Corollary 3.6. Combining everything together we get that: K(r)K(X,Wt+1,1)+t[n(log(n/e)1512β3)+O(logn)]K(r)\leq K(X,W_{t+1,1})+t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)].

Our proof implies that we can reconstruct not only rr, but also W1,1W_{1,1} using X,Wt+1,1X,W_{t+1,1}. Due to the incompressibility of random strings, we get that w.h.p K(r,W1,1)d+tnlog(n/e)O(logn)K(r,W_{1,1})\geq d+tn\log(n/e)-O(\log n). Combining the lower and upper bound for K(r,W1,1)K(r,W_{1,1}) we can get the following inequality, using which we can bound tt.

d+tnlog(n/e)O(logn)K(X)+d+t[n(log(n/e)1512β3)+O(logn)]\displaystyle d+tn\log(n/e)-O(\log n)\leq K(X)+d+t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]
t(n512β3O(logn))K(X)t=O(K(X)nβ3)\displaystyle\implies t(\frac{n}{512}\beta^{3}-O(\log n))\leq K(X)\implies t=O(\frac{K(X)}{n\beta^{3}})\qed

Assumptions 1 and 2 need not always hold

Let us consider the case where both Assumption 1 and Assumption 2 only hold for some fraction γ\gamma of the tt epochs that we analyze. Let us denote by Jg[t]J_{g}\subset[t] the set of all good epochs where the assumptions hold (the rest of the epochs are bad epochs). This means that we can no longer achieve compression for bad epochs. However, for every bad epoch ii, we can simply encode rir_{i} directly. Given rir_{i} we can reconstruct all of the batches, and retrieve Wi,1W_{i,1} from Wi+1,1W_{i+1,1} using Theorem 2.2. We can write that:

K(riX,Wi+1,1){n(log(n/e)1512β3)+O(logn),if iJgn(log(n/e))+O(logn),otherwise\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq\begin{cases}n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n),&\text{if }i\in J_{g}\\ n(\log(n/e))+O(\log n),&\text{otherwise}\end{cases}

This allows us to bound tt as follows:

d+tnlog(n/e)O(logn)\displaystyle d+tn\log(n/e)-O(\log n)
K(X)+d+γt[n(log(n/e)1512β3)+O(logn)]+(1γ)t[n(log(n/e))+O(logn)]\displaystyle\leq K(X)+d+\gamma t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]+(1-\gamma)t[n(\log(n/e))+O(\log n)]
t(γn512β3O(logn))K(X)t=O(K(X)nγβ3)\displaystyle\implies t(\gamma\frac{n}{512}\beta^{3}-O(\log n))\leq K(X)\implies t=O(\frac{K(X)}{n\gamma\beta^{3}})

Note that tt also appears in the batch size. By Assumption 5, we know that t=O(β3poly(n))t=O(\beta^{-3}poly(n)), which in turn means that b=Θ(β2log2(n/β))b=\Theta(\beta^{-2}\log^{2}(n/\beta)) is a sufficient condition for Assumption 5 to hold. We state our main theorem for this section:

Theorem 3.8.

If assumptions (3)-(5) hold throughout the execution of SGD and assumptions (1)-(2) hold for an γ\gamma-fraction of the epochs, then SGD achieves an accuracy of (1ϵ)(1-\epsilon) on XX in O(K(X)γn(ϵβ)3)O(\frac{K(X)}{\gamma n(\epsilon\beta^{\prime})^{3}}) epochs w.h.p.

3.1 Connection between ϵ\epsilon and β\beta

We note that the local progress assumption does not make sense when the accuracy is too close to 1. That is, if the accuracy is too high, there is no room for improvement. We formalize this intuition in the following claim:

Claim 3.9.

If for some ii it holds that j,λi,j(1ϵ)\forall j,\lambda_{i,j}\geq(1-\epsilon), then β43ϵln(n/b)\beta\leq\frac{4}{3}\epsilon\ln(n/b).

Proof.

Recall the local progress assumption: bnj=2n/b+1(φi,jacc(Wi,j1,Bi,j1))>β\frac{b}{n}\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))>\beta. Applying Claim 3.3 we know that w.h.p acc(Wi,j1,Bi,j1)λi,j1′′β/4acc(W_{i,j-1},B_{i,j-1})\geq\lambda^{\prime\prime}_{i,j-1}-\beta/4. So we can write:

β<bnj=2n/b+1(φi,jλi,j1′′+β/4)β/4+bnj=2n/b+1(1λi,j1′′)\beta<\frac{b}{n}\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j-1}+\beta/4)\leq\beta/4+\frac{b}{n}\sum_{j=2}^{n/b+1}(1-\lambda^{\prime\prime}_{i,j-1})

It holds that λi,j(1ϵ)\lambda_{i,j}\geq(1-\epsilon). Let us now bound λi,j′′\lambda^{\prime\prime}_{i,j} using λi,j\lambda_{i,j}. We can write the following for j[2,n/n+1]j\in[2,n/n+1]:

nλi,j=xXacc(Wi,j,x)=xXi,j1acc(Wi,j,x)+xXXi,j1acc(Wi,j,x)\displaystyle n\lambda_{i,j}=\sum_{x\in X}acc(W_{i,j},x)=\sum_{x\in X_{i,j-1}}acc(W_{i,j},x)+\sum_{x\in X\setminus X_{i,j-1}}acc(W_{i,j},x)
=(j1)bλi,j+(n(j1)b)λi,j′′\displaystyle=(j-1)b\lambda^{\prime}_{i,j}+(n-(j-1)b)\lambda^{\prime\prime}_{i,j}
λi,j′′=nλi,j(j1)bλi,jn(j1)bn(1ϵ)(j1)bn(j1)b=1ϵnn(j1)b\displaystyle\implies\lambda^{\prime\prime}_{i,j}=\frac{n\lambda_{i,j}-(j-1)b\lambda^{\prime}_{i,j}}{n-(j-1)b}\geq\frac{n(1-\epsilon)-(j-1)b}{n-(j-1)b}=1-\frac{\epsilon n}{n-(j-1)b}

Plugging the above into the first inequality we get:

3β4bnj=2n/b+1ϵnn(j2)bj=2n/b+1ϵn/b(j2)ϵ(1+ln(n/b))\displaystyle\frac{3\beta}{4}\leq\frac{b}{n}\sum_{j=2}^{n/b+1}\frac{\epsilon n}{n-(j-2)b}\leq\sum_{j=2}^{n/b+1}\frac{\epsilon}{n/b-(j-2)}\leq\epsilon(1+\ln(n/b))

Finally we get β43ϵ(1+ln(n/b))\beta\leq\frac{4}{3}\epsilon(1+\ln(n/b)). ∎

Taking the above into consideration, we parameterize β\beta via ϵ\epsilon and write β=βϵ\beta=\beta^{\prime}\epsilon. This parameterization is correct in the sense that it allows the accuracy parameter ϵ\epsilon to appear in our running time bound.

References

  • Moser [2009] Robin A. Moser. A constructive proof of the lovász local lemma. In STOC, pages 343–350. ACM, 2009.
  • Moser and Tardos [2010] Robin A. Moser and Gábor Tardos. A constructive proof of the general lovász local lemma. J. ACM, 57(2):11:1–11:15, 2010.
  • Du et al. [2019] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 1675–1685. PMLR, 2019.
  • Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 2019.
  • Zou et al. [2018] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888, 2018.
  • Zou and Gu [2019] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In NeurIPS, pages 2053–2062, 2019.
  • Mingard et al. [2021] Chris Mingard, Guillermo Valle Pérez, Joar Skalse, and Ard A. Louis. Is SGD a bayesian sampler? well, almost. J. Mach. Learn. Res., 22:79:1–79:64, 2021.
  • Pérez et al. [2019] Guillermo Valle Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In ICLR (Poster), 2019.
  • O’Neill [2020] James O’Neill. An overview of neural network compression. CoRR, abs/2006.03669, 2020.
  • Goodfellow et al. [2016] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.
  • Tao [2009] Terence Tao. Moser’s entropy compression argument. https://terrytao.wordpress.com/2009/08/05/mosers-entropy-compression-argument, 2009.
  • Fortnow [2009] Lance Fortnow. A kolmogorov complexity proof of the lovász local lemma. http://blog.computationalcomplexity.org/2009/06/kolmogorov-complexity-proof-of-lov.html, 2009.
  • Noy et al. [2021] Asaf Noy, Yi Xu, Yonathan Aflalo, Lihi Zelnik-Manor, and Rong Jin. A convergence theory towards practical over-parameterized deep neural networks. CoRR, abs/2101.04243, 2021.
  • Arora et al. [2019] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In ICLR (Poster). OpenReview.net, 2019.
  • Arora et al. [2018] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 244–253. PMLR, 2018.
  • Safran and Shamir [2018] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 4430–4438. PMLR, 2018.
  • Du and Lee [2018] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 1328–1337. PMLR, 2018.
  • Oymak and Soltanolkotabi [2019] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. CoRR, abs/1902.04674, 2019.
  • Jacot et al. [2018] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, pages 8580–8589, 2018.
  • Nguyen and Mondelli [2020] Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In NeurIPS, 2020.
  • Nguyen [2021] Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8056–8062. PMLR, 2021.
  • Li and Vitányi [2019] Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Texts in Computer Science. Springer, 2019.
  • Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • Chvátal [1979] Vasek Chvátal. The tail of the hypergeometric distribution. Discrete Mathematics, 25(3):285–287, 1979.