SGD Through the Lens of Kolmogorov Complexity

Gregory Schwartzman
JAIST
[email protected]

Abstract

We prove that stochastic gradient descent (SGD) finds a solution that achieves $(1-\epsilon)$ classification accuracy on the entire dataset. We do so under two main assumptions: (1. Local progress) The model accuracy improves on average over batches. (2. Models compute simple functions) The function computed by the model is simple (has low Kolmogorov complexity). It is sufficient that these assumptions hold only for a tiny fraction of the epochs.

Intuitively, the above means that intermittent local progress of SGD implies global progress. Assumption 2 trivially holds for underparameterized models, hence, our work gives the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result which is completely model agnostic - we do not require the model to have any specific architecture or activation function, it may not even be a neural network. Our analysis makes use of the entropy compression method, which was first introduced by Moser and Tardos [1; 2] in the context of the Lovász local lemma.

1 Introduction

Stochastic gradient descent (SGD) is at the heart of modern machine learning. However, we are still lacking a theoretical framework that explains its performance for general, non-convex functions. There are known convergence guarantees for overparameterized neural networks [3; 4; 5; 6]. These works show that if the number of network parameters is considerably larger than the number of data points, then the network is guaranteed to converge to a (near) global optimum. The main drawback of these results is the requirement of exceedingly large (and wide) models, where the size is a large polynomial in $n$ (the size of the dataset). In other words, the analysis is dependent on the structure of the model. In this work, we take a different approach and show convergence guarantees based on the non-structural properties of the model.

Our two main assumptions are: (1) The model accuracy improves on average over batches following the GD step. That is, if we look at the difference: (batch accuracy after GD $-$ batch accuracy before GD), averaged over an epoch, then it is greater than some positive parameter. (2) The function computed by the model is simple (has Kolmogorov complexity sufficiently smaller than $n$ ). Note that this is different than saying that the model is simple. It might be the case that a very complicated model computes an extremely simple function. A classical example is a high degree polynomial over a finite field that always evaluates to 0.

For our convergence guarantees to hold, we only require the above assumptions to holds for some tiny fraction of the iterations. Note that Assumption 2 trivially holds when the model is underparametrized (number of model parameters is sufficiently smaller than $n$ ). To the best of our knowledge, this is the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result which is completely model agnostic. That is, we don’t require the model to have any specific architecture or activation function. It may not even be a neural network. Finally, our results answer an intriguing theoretical question: is it possible for SGD to achieve local progress (improvement over batches), but no global progress (good accuracy for the entire dataset)? Intuitively, one might think that this is indeed a situation that might occur. However, we show that (under Assumption 2) this cannot be the case.

Overview of our results

We consider batch SGD, where the model is initialized randomly and the dataset is shuffled once at the beginning of each epoch and then divided into batches. Note that this paper does not deal with the generalization abilities of the model, thus the dataset is always the training set. In each epoch, the algorithm goes over the batches one by one, and performs gradient descent to update the model. This is the “vanilla" version of SGD, without any acceleration or regularization (for a formal definition see Section 2). Furthermore, for the sake of analysis, we add a termination condition after every GD step: if the accuracy on the entire dataset is at least $1-\epsilon$ we terminate. Thus in our case termination implies good accuracy.

We show that under reasonable assumptions, local progress of SGD implies global progress. Let $\epsilon\in(0,1)$ be an accuracy parameter of our choice and let $X$ be the dataset (every element in $X$ consists of a data point, a label, and some unique ID), where $\left|X\right|=n$ . Let $K(X)$ be the Kolmogorov complexity of $X$ (the shortest Turing machine which outputs $X$ ). We can also define the Kolmogorov complexity of a function as the shortest Turing machine which computes the function (see Section 2 for a formal definition). The assumptions we make are formally stated at the beginning of Section 3. Below is an intuitive outline:

1.

After every GD step the batch accuracy is improved by at least an additive $\epsilon\beta^{\prime}$ factor (on average, per epoch), where $\beta^{\prime}$ is a local progress parameter. That is, fix some epoch and let $y_{i}$ be the model accuracy on the $i$ -th batch before the GD step and let $y^{\prime}_{i}$ be the model accuracy after the GD step, then we require that $\frac{1}{t}\sum_{i=1}^{t}(y^{\prime}_{i}-y_{i})\geq\epsilon\beta^{\prime}$ where $t$ is the number of batches in an epoch.
2.

The Kolmogorov complexity of the function computed by the model is at most $(\epsilon\beta^{\prime})^{3}n/512$ . In other words, either the model is underparameterized, or the function computed by the model is sufficiently simple.
3.

The loss function is differentiable and $L$ -smooth. This must also hold when taking numerical stability into account.
4.

The batch size is $\Omega((\beta^{\prime}\epsilon)^{-2}\log^{2}(n/(\beta^{\prime}\epsilon)))$ . Note, the dependence on $n$ is only polylogarithmic.
5.

The size of every element in $X$ is at most polynomial in $n$ . For example, if $X$ contains images, they are not excessively large compared to $n$ . Furthermore, the number of model parameters is at most polynomial in $n$ (this is not implied by Assumption 2, as the model can be huge but still compute a simple function).

Roughly speaking, we show that if $n$ is sufficiently large and assumptions 3 to 5 always hold and assumptions 1,2 only hold for a $\gamma$ -fraction of the epochs ( $\gamma\in(0,1]$ need not be a constant), then SGD must achieve an accuracy of at least $(1-\epsilon)$ on the entire dataset within $O(\frac{K(X)}{\gamma(\beta^{\prime}\epsilon)^{3}n})$ epochs with high probability (w.h.p)¹¹1This is usually taken to be $1-1/n^{c}$ for some constant $c>1$ . For our case, it holds that $c\in(1,2)$ , however, this can be amplified arbitrarily by increasing the batch size by a multiplicative constant.. Note that the number of epochs does not depend on the size of the model. Finally, note that when the accuracy is too high the local progress assumption cannot hold anymore (we discuss this further in Section 3.1), and we cannot guarantee progress past this point.

Let us briefly discuss our assumptions. As for Assumption 1, it is quite natural as the GD step attempts to minimize the loss function which acts as a proxy for the accuracy on the data. However, Assumption 2 seems at odds with the fact that we assume the model to be initialized randomly. Intuitively, one might believe that a random initialization surely implies that Assumption 2 does not hold. While it is indeed the case that the model has high Kolmogorov complexity in this case, this is not necessarily the case for the function computed by the model. Recent literature suggests that randomly initialized models are biased towards simple functions rather than complicated ones (see Section 7.3 in [7] for a survey of recent work) and that the function learned by the network has low Kolmogorov complexity [8]. Furthermore, our results also generalize if we allow lossy compression. There is a vast literature on model compression (see [9] for a survey of recent literature), where the size of the model can be significantly compressed while preserving its accuracy. This also serves as a strengthening argument towards this assumption.

Note that assumptions (3)-(5) are rather mild. Assumption 3 (or some variation of it) is standard in the literature [3; 4; 5; 6], and discussions regarding numerical stability are generally omitted. Due to our use of Kolmogorov complexity we need to explicitly state that the L-smoothness assumption holds even when we consider some subset of the real numbers (i.e., floats). We discuss this in more detail in Section 2. In Assumption 4 we only require a polylogarithmic dependence on $n$ , which is on par with batch sizes used in practice. Furthermore, it is well observed that when the batch size is too small SGD suffers from instability [10]. Finally, Assumption 5 is extremely mild, as our dependence on the size of the elements / model is only logarithmic when we apply this assumption. That is, even elements / model as large as $n^{1000}$ leave our analysis unchanged asymptotically.

Outline of our techniques

To achieve our results we make use of entropy compression, first considered by Moser and Tardos [1; 2] to prove a constructive version of the Lovász local lemma (the name “entropy compression" was coined by Terrence Tao [11]). Roughly speaking, the entropy compression argument allows one to bound the running time of a randomized algorithm²²2We require that the number of the random bits used is proportional to the execution time of the algorithm. That is, the algorithm flips coins for every iteration of a loop, rather than just a constant number at the beginning of the execution. by leveraging the fact that a random string of bits (the randomness used by the algorithm) is computationally incompressible (has high Kolmogorov complexity) w.h.p. If one can show that throughout the execution of the algorithm, it (implicitly) compresses the randomness it uses, then one can bound the number of iterations the algorithm may execute without terminating. To show that the algorithm has such a property, one would usually consider the algorithm after executing $t$ iterations, and would try to show that just by looking at an “execution log" of the algorithm and some set of “hints", whose size together is considerably smaller than the number of random bits used by the algorithm, it is possible to reconstruct all of the random bits used by the algorithm. We highly encourage readers unfamiliar with this argument to take a quick look at [12], which provides an excellent 1-page example.

We apply this approach to SGD with an added termination condition when the accuracy over the entire dataset is sufficiently high, thus termination in our case guarantees good accuracy. The randomness we compress are the bits required to represent the random permutation of the data at every epoch. So indeed the longer SGD executes, the more random bits are generated. We show that under our assumptions it is possible to reconstruct these bits efficiently starting from the dataset $X$ and the model after executing $t$ epochs. The first step in allowing us to reconstruct the random bits of the permutation in each epoch is to show that under our L-smoothness assumption and a sufficiently small GD step, SGD is reversible. That is, if we are given a model $W_{i+1}$ and a batch $B_{i}$ such that $W_{i+1}$ results from taking a gradient step with model $W_{i}$ where the loss is calculated with respect to $B_{i}$ , then we can uniquely retrieve $W_{i}$ using only $B_{i}$ and $W_{i+1}$ . This means that if we can efficiently encode the batches used in every epoch (i.e., using less bits than encoding the entire permutation of the data), we can also retrieve all intermediate models in that epoch (at no additional cost). We prove this claim in Section 2.

The crux of this paper is to show that our assumptions imply that at every epoch the batches can indeed be compressed. Let us consider two toy examples that provide intuition for our analysis. First, consider the scenario where, for every epoch, exactly in the “middle" of the epoch (after considering half of the batches for the epoch), our model achieves 100% accuracy on the set of all elements already considered so far during the epoch (elements on the “left") and 0% accuracy for the remaining elements (elements on the “right"). Let us consider a single epoch of this execution. We can express the permutation of the dataset by encoding two sets (for elements on the left and right), and two permutations for these sets. Given the dataset $X$ and the function computed by the model at the middle of the epoch, we immediately know which elements are on the left and which are on the right by seeing if they are classified correctly or not by the model (recall that $X$ contains data points and labels and that elements have unique IDs). Thus it is sufficient to encode only two permutations, each over $n/2$ elements. While naively encoding the permutation for the entire epoch requires $\lceil\log(n!)\rceil$ bits, we manage to make do with $2\lceil\log(n/2)!\rceil$ . Using Stirling’s approximation ( $\log(n!)=n\log(n/e)+O(\log n)$ ) and omitting logarithmic factors, we get that $\log(n!)-2\log((n/2)!)\approx n\log(n/e)-n\log(n/2e)=n$ . Thus, we can save a linear number of bits using this encoding. To achieve the above encoding, we need the function computed by the model in the middle of the epoch. Assumption 2 guarantees that we can efficiently encode this function.

Second, let us consider the scenario where, in every epoch, just after a single GD step on a batch we consistently achieve 100% accuracy on the batch. Furthermore, we terminate our algorithm when we achieve 90% accuracy on the entire dataset. Let us consider some epoch of our execution, assume we have access to $X$ , and let $W_{f}$ be the model at the end of the epoch. Our goal is to retrieve the last batch of the epoch, $B_{f}\subset X$ (without knowing the permutation of the data for the epoch). A naive approach would be to simply encode the indices in $X$ of the elements in the batch (we can sort $X$ by IDs). However, we can use $W_{f}$ to achieve a more efficient encoding. Specifically, we know that $W_{f}$ achieves 100% accuracy on $B_{f}$ but only 90% accuracy on $X$ . Thus it is sufficient to encode the elements of $B_{f}$ using a smaller subset of $X$ (the elements classified correctly by $W_{f}$ , which has size at most $0.9\left|X\right|$ ). This allows us to significantly compress $B_{f}$ . Next, we can use $B_{f}$ and $W_{f}$ together with the reversibility of SGD to retrieve $W_{f-1}$ . We can now repeat the above argument to compress $B_{f-1}$ and so on, until we are able to reconstruct all of the random bits used to generate the permutation of $X$ in the epoch. This will again result in a linear reduction in the number of bits required for the encoding.

In our analysis, we show generalized versions of the two scenarios above. First, we show that if at any epoch the accuracy of the model on the left and right elements differs significantly (we need not split the data down the middle anymore), we manage to achieve a meaningful compression of the randomness. Next, we show that if this scenario does not happen, then together with our local progress assumption, we manage to achieve slightly better accuracy on the batch we are trying to retrieve (compared to the set of available elements). This results in a compression similar to the second scenario. Either way, we manage to save a significant (linear in $n$ ) number of bits for every epoch, which allows us to use the entropy compression argument to bound our running time.

Related work

An excellent survey of recent work is given in [13], which we follow below. There has been a long line of research proving convergence bounds for SGD under various simplifying assumptions such as: linear networks [14; 15], shallow networks [16; 17; 18], etc. However, the most relevant results for this paper are the ones dealing with deep, overparameterized networks [3; 4; 5; 6]. All of these works make use of NTK (Neural Tangent Kernel)[19] and show convergence guarantees for SGD when the hidden layers have width at least $poly(n,L)$ where $n$ is the size of the dataset and $L$ is the depth of the network. We note that the exponents of the polynomials are quite large. For example, assuming a random weight initialization, [4] require width at least $n^{24}L^{12}$ and converges to a model with an $\epsilon$ error in $O(n^{6}L^{2}\log(1/\epsilon))$ steps w.h.p. The current best parameters are due to [6], which under random weight initialization, require width at least $n^{8}L^{12}$ and converge to a model with an $\epsilon$ error in $O(n^{2}L^{2}\log(1/\epsilon))$ steps w.h.p. Recent works [20; 21; 13] achieve a much improved width dependence (linear / quadratic) and running times, however, they require additional assumptions which make their results incomparable with the above. Specifically, in [20; 21] a special (non-standard) weight initialization is required, and in [13] convergence is shown for a non-standard activation unit.

Our results are not directly comparable with the above, mainly due to assumptions 1 and 2. However, if one accepts our assumptions, our analysis has the benefit of not requiring any specific network architecture. The convergences times are also essentially incomparable due to the dependence on $K(X)$ . That is, $K(X)$ can be very small if the elements of $X$ are small and similar to each other, or very large if they are very large and contain noise (e.g., high definition images with noise). To the best of our knowledge, this is the first (conditional) global convergence guarantee for SGD for general, non-convex underparameterized models.

Structure of the paper

We first present some preliminary definitions in Section 2 followed by our main results in Section 3.

2 Preliminaries

We consider the following optimization problem. We are given an input (dataset) of size $n$ . Let us denote $X=\left\{x_{i}\right\}_{i=1}^{n}$ (Our inputs contain both data and labels, we do not need to distinguish them for this work). We also associate every $x\in X$ with a unique id of $\lceil\log n\rceil$ bits. We often consider batches of the input $B\subset X$ . The size of the batch is denoted by $b$ (all batches have the same size). We have some model whose parameters are denoted by $W\in\mathbb{R}^{d}$ , where $d$ is the model dimension. We aim to optimize a goal function of the following type: $f(W)=\frac{1}{n}\sum_{x\in X}f_{x}(W)$ , where the functions $f_{x}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ are completely determined by $x\in X$ . We also define for every set $A\subseteq X$ : $f_{A}(W)=\frac{1}{|A|}\sum_{x\in A}f_{x}(W)$ . Note that $f_{X}=f$ .

We denote by $acc(W,A):\mathbb{R}^{d}\times 2^{X}\rightarrow\mathbb{R}[0,1]$ the accuracy of model $W$ on the set $A\subseteq X$ (where we use $W$ to classify elements from $X$ ). Note that for $x\in X$ it holds that $acc(W,x)$ is a binary value indicating whether $x$ is classified correctly or not. We require that every $f_{x}$ is differentiable and L-smooth: $\forall W_{1},W_{2}\in\mathbb{R}^{d},\|\nabla f_{x}(W_{1})-\nabla f_{x}(W_{2})\|\leq L\|W_{1}-W_{2}\|$ . This implies that every $f_{A}$ is also differentiable and L-smooth. To see this consider the following:

	$\displaystyle\\|\nabla f_{A}(W_{1})-\nabla f_{A}(W_{2})\\|=\\|\frac{1}{\|A\|}\sum_{x\in A}\nabla f_{x}(W_{1})-\frac{1}{\|A\|}\sum_{x\in A}\nabla f_{x}(W_{2})\\|$
	$\displaystyle=\frac{1}{\|A\|}\\|\sum_{x\in A}\nabla f_{x}(W_{1})-\nabla f_{x}(W_{2})\\|\leq\frac{1}{\|A\|}\sum_{x\in A}\\|\nabla f_{x}(W_{1})-\nabla f_{x}(W_{2})\\|\leq L\\|W_{1}-W_{2}\\|$

We state another useful property of $f_{A}$ :

Lemma 2.1.

Let $W_{1},W_{2}\in\mathbb{R}^{d}$ and $\alpha<1/L$ . For any $A\subseteq X$ , if it holds that $W_{1}-\alpha\nabla f_{A}(W_{1})=W_{2}-\alpha\nabla f_{A}(W_{2})$ then $W_{1}=W_{2}$ .

Proof.

Rearranging the terms we get that $W_{1}-W_{2}=\alpha\nabla f_{A}(W_{1})-\alpha\nabla f_{A}(W_{2})$ . Now let us consider the norm of both sides: $\|W_{1}-W_{2}\|=\|\alpha\nabla f_{A}(W_{1})-\alpha\nabla f_{A}(W_{2})\|\leq\alpha\cdot L\|W_{1}-W_{2}\|<\|W_{1}-W_{2}\|$ Unless $W_{1}=W_{2}$ , the final strict inequality holds which leads to a contradiction. ∎

The above means that for a sufficiently small gradient step, the gradient descent process is reversible. That is, we can always recover the previous model parameters given the current ones, assuming that the batch is fixed. We use the notion of reversibility throughout this paper. However, in practice we only have finite precision, thus instead of $\mathbb{R}$ we work with the finite set $\mathbb{F}\subset\mathbb{R}$ . Furthermore, due to numerical stability issues, we do not have access to exact gradients, but only to approximate values $\widetilde{\nabla f_{A}}$ . For the rest of this paper, we assume these values are L-smooth on all elements in $\mathbb{F}^{d}$ . That is,

\forall W_{1},W_{2}\in\mathbb{F}^{d},A\subseteq X,\|\widetilde{\nabla f_{A}}(W_{1})-\widetilde{\nabla f_{A}}(W_{2})\|\leq L\|W_{1}-W_{2}\|

This immediately implies that Lemma 2.1 holds even when precision is limited. Let us state the following theorem:

Theorem 2.2.

Let $W_{1},W_{2},...,W_{k}\in\mathbb{F}^{d}\subset\mathbb{R}^{d}$ , $A_{1},A_{2},...,A_{k}\subseteq X$ and $\alpha<1/L$ . If it holds that $W_{i}=W_{i-1}-\alpha\widetilde{\nabla f_{A_{i-1}}}(W_{i-1})$ , then given $A_{1},A_{2},...,A_{k-1}$ and $W_{k}$ we can retrieve $W_{1}$ .

Proof.

Given $W_{k}$ we iterate over all $W\in\mathbb{F}^{d}$ until we find $W$ such that $W_{k}=W-\alpha\widetilde{\nabla f_{A_{i-1}}}(W)$ . Using Lemma 2.1, there is only a single element such that this equality holds, and thus $W=W_{k-1}$ . We repeat this process until we retrieve $W_{1}$ . ∎

We note that the assumption that L-smoothness holds for the approximate gradients on all elements in $\mathbb{F}^{d}$ might seem a bit strong. However, if we assume that the gradients have norm bounded by a parameter $G$ (i.e. $f_{A}$ is $G$ -Lipschitz), then it holds that $\|W_{i}-W_{i-1}\|\leq\alpha G$ . This means that in Theorem 2.2, we need to only go over elements that are close to $W_{i}$ when retrieving $W_{i-1}$ . This in turn means that we can require that the Lipschitz condition for $\widetilde{\nabla f_{A_{i-1}}}$ only holds around $W_{i}$ .

SGD

We analyze the classic SGD algorithm presented in Algorithm 1. One difference to note in our algorithm, compared to the standard implementation, is the termination condition when the accuracy on the dataset is sufficiently high. In practice the termination condition is not used, however, we only use it to prove that at some point in time the accuracy of the model is indeed sufficiently large. It is also possible to check this condition on a sample of a logarithmic number of elements of $X$ and achieve a good approximation to the accuracy w.h.p.

i\leftarrow 1

// epoch counter

W_{1,1}

is an initial model

4 while True do

5 Take a random permutation of

X

, divided into batches

\left\{B_{i,j}\right\}_{j=1}^{n/b}

6 for j from 1 to n/b do

7 if $acc(W_{i,j},X)\geq(1-\epsilon)$ then Return

W_{i,j}

W_{i,j+1}\leftarrow W_{i,j}-\alpha\nabla f_{B_{i,j}}(W_{i,j})

i\leftarrow i+1

W_{i,1}\leftarrow W_{i-1,n/b+1}

Algorithm 1 SGD

Kolmogorov complexity

The Kolmogorov complexity of a string $x\in\{0,1\}^{*}$ , denoted by $K(x)$ , is defined as the size of the smallest prefix Turing machine which outputs this string. We note that this definition depends on which encoding of Turing machines we use. However, one can show that this will only change the Kolmogorov complexity by a constant factor [22]. Random strings have the following useful property [22]:

Theorem 2.3.

For an $n$ bit string $x$ chosen uniformly at random and any $c\in\mathbb{N}$ it holds that $Pr[K(x)\geq n-c]\geq 1-1/2^{c}$ .

We also use the notion of conditional Kolmogorov complexity, denoted by $K(x\mid y)$ . This is the length of the shortest prefix Turing machine which gets $y$ as an auxiliary input and prints $x$ . Note that the length of $y$ does not count towards the size of the machine which outputs $x$ . So it can be the case that $\left|x\right|\ll\left|y\right|$ but it holds that $K(x\mid y)<K(x)$ . We can also consider the Kolmogorov complexity of functions. Let $g:\left\{0,1\right\}^{*}\rightarrow\left\{0,1\right\}^{*}$ then $K(g)$ is the size of the smallest Turing machine which computes the function $g$ .

The following properties of Kolmogorov complexity will be of use [22]. Let $x,y,z$ be three strings:

•

Extra information: $K(x\mid y,z)\leq K(x\mid z)+O(1)\leq K(x,y\mid z)+O(1)$
•

Subadditivity: $K(xy\mid z)\leq K(x\mid z,y)+K(y\mid z)+O(1)\leq K(x\mid z)+K(y\mid z)+O(1)$

Throughout this paper we often consider the value $K(A)$ where $A$ is a set. Here the program computing $A$ need only output the elements of $A$ (in any order). When considering $K(A\mid B)$ such that $A\subseteq B$ , it holds that $K(A\mid B)\leq\lceil\log\binom{\left|B\right|}{\left|A\right|}\rceil+O(\log\left|B\right|)$ . To see why, consider Algorithm 2.

m_{A}\leftarrow\left|A\right|,m_{B}\leftarrow\left|B\right|,i\leftarrow 0,i_{A}

is a target index

2 for every subset $C\subseteq B$ s.t $\left|C\right|=m_{A}$ (in a predetermined order) do

3 if $i=i_{A}$ then Print

C

i\leftarrow i+1

Algorithm 2 Compute

A

given

B

as input

In the algorithm $i_{A}$ is the index of $A$ when considering some ordering of all subsets of $B$ of size $\left|A\right|$ . Thus $\lceil\log\binom{\left|B\right|}{\left|A\right|}\rceil$ bits are sufficient to represent $i_{A}$ . The remaining variables $i,m_{A},m_{B}$ and any additional variables required to construct the set $C$ are all of size at most $O(\log\left|B\right|)$ and there is at most a constant number of them.

During our analysis, we often bound the Kolmogorov complexity of tuples of objects. For example, $K(A,P\mid B)$ where $A\subseteq B$ is a set and $P:A\rightarrow[\left|A\right|]$ is a permutation of $A$ (note that $A,P$ together form an ordered tuple of the elements of $A$ ). Instead of explicitly presenting a program such as Algorithm 2, we say that if $K(A\mid B)\leq c_{1}$ and $c_{2}$ bits are sufficient to represent $P$ , thus $K(A,P\mid B)\leq c_{1}+c_{2}+O(1)$ . This just means that we directly have a variable encoding $P$ into the program that computes $A$ given $B$ and uses it in the code. For example, we can add a permutation to Algorithm 2 and output an ordered tuple of elements rather than a set. Note that when representing a permutation of $A,\left|A\right|=k$ , instead of using functions, we can just talk about values in $\lceil\log k!\rceil$ . That is, we can decide on some predetermined ordering of all permutations of $k$ elements, and represent a permutation as its number in this ordering.

Entropy and KL-divergence

Our proofs make extensive use of binary entropy and KL-divergence. In what follows we define these concepts and provide some useful properties.

Entropy: For $p\in[0,1]$ we denote by $h(p)=-p\log p-(1-p)\log(1-p)$ the entropy of $p$ . Note that $h(0)=h(1)=0$ .

KL-divergence: For $p,q\in[0,1]$ let $D_{KL}(p\;\|\;q)=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}$ be the Kullback Leibler divergence (KL-divergence) between two Bernoulli distributions with parameters $p,q$ . We also state the following useful private case of Pinsker’s inequality: $D_{KL}(p\;\|\;q)\geq 2(p-q)^{2}$ .

Lemma 2.4.

For $p\in[0,1]$ it holds that $h(p)\leq p\log(e/p)$ .

Proof.

Let us write our claim as:

h(p)=-p\log p-(1-p)\log(1-p)\leq p\log(e/p)

Rearranging we get:

	$\displaystyle-(1-p)\log(1-p)\leq p\log p+p\log(1/p)+p\log e$
	$\displaystyle\implies-(1-p)\log(1-p)\leq p\log e$
	$\displaystyle\implies-\ln(1-p)\leq\frac{p}{(1-p)}$

Note that $-\ln(1-p)=\int_{0}^{p}\frac{1}{(1-x)}dx\leq p\cdot\frac{1}{(1-p)}$ . Where in the final transition we use the fact that $\frac{1}{(1-x)}$ is monotonically increasing on $[0,1]$ . This completes the proof. ∎

Lemma 2.5.

For $p,\gamma,q\in[0,1]$ where $p\gamma\leq q,(1-p)\gamma\leq(1-q)$ it holds that

qh(\frac{p\gamma}{q})+(1-q)h(\frac{(1-p)\gamma}{(1-q)})\leq h(\gamma)-\gamma D_{KL}(p\;\|\;q)

Proof.

Let us expand the left hand side using the definition of entropy:

	$\displaystyle qh(\frac{p\gamma}{q})+(1-q)h(\frac{(1-p)\gamma}{(1-q)})$
	$\displaystyle=-q(\frac{p\gamma}{q}\log\frac{p\gamma}{q}+(1-\frac{p\gamma}{q})\log(1-\frac{p\gamma}{q}))$
	$\displaystyle-(1-q)(\frac{(1-p)\gamma}{(1-q)}\log\frac{(1-p)\gamma}{(1-q)}+(1-\frac{(1-p)\gamma}{(1-q)})\log(1-\frac{(1-p)\gamma}{(1-q)}))$
	$\displaystyle=-(p\gamma\log\frac{p\gamma}{q}+(q-p\gamma)\log\frac{q-p\gamma}{q})$
	$\displaystyle-((1-p)\gamma\log\frac{(1-p)\gamma}{(1-q)}+((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})$
	$\displaystyle=-\gamma\log\gamma-\gamma D_{KL}(p\;\\|\;q)$
	$\displaystyle-(q-p\gamma)(\log\frac{q-p\gamma}{q})-((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})$

Where in the last equality we simply sum the first terms on both lines. To complete the proof we use the log-sum inequality for the last expression. The log-sum inequality states that: Let $\left\{a_{k}\right\}^{m}_{k=1},\left\{b_{k}\right\}^{m}_{k=1}$ be non-negative numbers and let $a=\sum^{m}_{k=1}a_{k},b=\sum^{m}_{k=1}b_{k}$ , then $\sum_{k=1}^{m}a_{i}\log\frac{a_{i}}{b_{i}}\geq a\log\frac{a}{b}$ . We apply the log-sum inequality with $m=2,a_{1}=q-p\gamma,a_{2}=(1-q)-(1-p)\gamma,a=1-\gamma$ and $b_{1}=q,b_{2}=1-q,b=1$ , getting that:

(q-p\gamma)(\log\frac{q-p\gamma}{q})+((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})\geq(1-\gamma)\log(1-\gamma)

Putting everything together we get that

	$\displaystyle-\gamma\log\gamma-\gamma D_{KL}(p\;\\|\;q)$
	$\displaystyle-(q-p\gamma)(\log\frac{q-p\gamma}{q})-((1-q)-(1-p)\gamma)\log\frac{(1-q)-(1-p)\gamma}{1-q})$
	$\displaystyle\leq-\gamma\log\gamma-(1-\gamma)\log(1-\gamma)-\gamma D_{KL}(p\;\\|\;q)=h(\gamma)-\gamma D_{KL}(p\;\\|\;q)$

∎

Representing sets

Let us state some useful bounds on the Kolmogorov complexity of sets.

Lemma 2.6.

Let $A\subseteq B,\left|B\right|=m,\left|A\right|=\gamma m$ , and let $g:B\rightarrow\left\{0,1\right\}$ . For any set $Y\subseteq B$ let $Y_{1}=\left\{x\mid x\in Y,g(x)=1\right\},Y_{0}=Y\setminus Y_{1}$ and $\kappa_{Y}=\frac{\left|Y_{1}\right|}{\left|Y\right|}$ . It holds that

K(A\mid B,g)\leq m(h(\gamma)-2\gamma(\kappa_{B}-\kappa_{A})^{2})+O(\log m)

Proof.

The algorithm is very similar to Algorithm 2, the main difference is that we must first compute $B_{1},B_{0}$ from $B$ using $g$ , and select $A_{1},A_{0}$ from $B_{1},B_{0}$ , respectively, using two indices $i_{A_{1}},i_{A_{0}}$ . Finally we print $A=A_{1}\cup A_{0}$ . We can now bound the number of bits required to represent $i_{A_{1}},i_{A_{0}}$ . Note that $\left|B_{1}\right|=\kappa_{B}m,\left|B_{0}\right|=(1-\kappa_{B})m$ . Note that for $A_{1}$ we pick $\gamma\kappa_{A}m$ elements from $\kappa_{B}m$ elements and for $A_{0}$ we pick $\gamma(1-\kappa_{A})m$ elements from $(1-\kappa_{B})m$ elements. The number of bits required to represent this selection is:

	$\displaystyle\lceil\log\binom{\kappa_{B}m}{\gamma\kappa_{A}m}\rceil+\lceil\log\binom{(1-\kappa_{B})m}{\gamma(1-\kappa_{A})m}\rceil\leq\kappa_{B}mh(\frac{\gamma\kappa_{A}}{\kappa_{B}})+(1-\kappa_{B})mh(\frac{\gamma(1-\kappa_{A})}{(1-\kappa_{B})})$
	$\displaystyle\leq m(h(\gamma)-\gamma D_{KL}(\kappa_{B}\;\\|\;\kappa_{A}))\leq m(h(\gamma)-2\gamma(\kappa_{B}-\kappa_{A})^{2})$

Where in the first inequality we used the fact that $\forall 0\leq k\leq n,\log\binom{n}{k}\leq nh(k/n)$ , Lemma 2.5 in the second transition, and the lower bound on the KL-divergence in the third transition. The additional $O(\log m)$ factor is due to various counters and variables, similarly to Algorithm 2. ∎

Concentration bounds

We state the following Hoeffding bound for sampling without replacement.

Theorem 2.7 (Hoeffding [23; 24]).

Let $\chi=(y_{1},...,y_{N})$ be a finite population of size $N$ , such that $\forall i,y_{i}\in\{0,1\}$ , and let $(Y_{1},...,Y_{k})$ be a random sample drawn without replacement from $\chi$ (first $Y_{1}$ is chosen uniformly at random from $\chi$ , then $Y_{2}$ is chosen uniformly at random from $\chi\setminus\left\{Y_{1}\right\}$ and so on). Let $\mu=\frac{1}{N}\sum_{y\in\chi}y$ . For all $0\leq\delta\leq\mu k$ it holds that: $P(\frac{1}{k}\sum_{i=1}^{k}Y_{k}-\mu\leq-\delta)\leq e^{-2k\delta^{2}}$ .

One important thing to note is that if we have access to $\chi$ as a shuffled tuple (or shuffled array), then taking any subset of $k$ indices, is a random sample of size $k$ drawn from $\chi$ without repetitions. Furthermore, if we take some sample $Y$ from the shuffled population $\chi$ and then take a second sample $Z$ (already knowing $Y$ ) from the remaining set $\chi\setminus Y$ , then $Z$ is sampled without repetitions from $\chi\setminus Y$ .

3 Convergence guarantees for SGD

In this section, we prove that Algorithm 1 eventually terminates. That is, a $(1-\epsilon)$ accuracy for the entire dataset is achieved. Let us start by defining some notation, followed by formal definitions of our assumptions.

Refer to caption — Figure 1: A visual summary of our notations.

Notations

First let us define some useful notations:

•

$\lambda_{i,j}=acc(W_{i,j},X)$ . This is the accuracy of the model in epoch $i$ on the entire dataset $X$ , before performing the GD step on batch $j$ .
•

$\varphi_{i,j}=acc(W_{i,j},B_{i,j-1})$ . This is the accuracy of the model on the $(j-1)$ -th batch in the $i$ -th epoch after performing the GD step on the batch.
•

$X_{i,j}=\bigcup_{k=1}^{j}B_{i,k}$ (note that $\forall i,X_{i,0}=\emptyset,X_{i,n/b}=X$ ). This is the set of elements in the first $j$ batches of epoch $i$ . Let us also denote $n_{j}=\left|X_{i,j}\right|=jb$ (Note that $\forall j,i_{1},i_{2},\left|X_{i_{1},j}\right|=\left|X_{i_{2},j}\right|$ , thus $i$ need not appear in the subscript).
•

$\lambda^{\prime}_{i,j}=acc(W_{i,j},X_{i,j-1}),\lambda^{\prime\prime}_{i,j}=acc(W_{i,j},X\setminus X_{i,j-1})$ , where $\lambda^{\prime}_{i,j}$ is the accuracy of the model on the set of all previously seen batch elements, after performing the GD step on the $(j-1)$ -th batch and $\lambda^{\prime\prime}_{i,j}$ is the accuracy of the same model, on all remaining elements ( $j$ -th batch onward). To avoid computing the accuracy on empty sets, $\lambda^{\prime}_{i,j}$ is defined for $j\in[2,n/b+1]$ and $\lambda^{\prime\prime}_{i,j}$ is defined for $j\in[1,n/b]$ .

Assumptions

In our analysis we consider $t$ epochs of the SGD algorithm, where we bound $t$ later. Let us state the assumptions we make (later we discuss ways to further alleviate these assumptions).

1.

Local progress: There exists some constant $\beta^{\prime}$ such that $\forall i\in[t],\frac{b}{n}\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))>\beta^{\prime}\epsilon$ . That is, we assume that in every epoch, the accuracy of the model on the batch after the GD step increases by an additive $\beta^{\prime}\epsilon$ factor on average. Let us denote $\beta=\beta^{\prime}\epsilon$ (we explain the reason for this parameterization in Section 3.1).
2.

Models compute simple functions: $\forall i\in[t],j\in[n/b],K(acc(W_{i,j},\bullet)\mid X)\leq\frac{n\beta^{3}}{512}$ (where $acc(W_{i,j},\bullet)$ is the accuracy function of the model with parameters $W_{i,j}$ ). This is actually a slightly weaker condition than bounding the Kolmogorov complexity of the function computed by the model. That is, given $X$ and the function computed by the model we can compute $acc(W_{i,j},\bullet)$ , but not the other way around. For simplicity of notation, we assume that $W_{i,j}$ contains not only the model parameters, but also the architecture of the model and the indices $i,j$ .
3.

$b\geq 8\beta^{-2}\ln^{2}(tn^{3})$ . That is, the batch must be sufficiently large. After we bound $t$ , we get that $b=\Theta(\beta^{-2}\log^{2}(n/\beta))$ is sufficient.
4.

$\forall x\in X,f_{x}$ is differentiable and L-smooth, even when taking numerical stability into account.
5.

$K(X)=n^{O(1)},d=n^{O(1)}$ . That is, the size of the model and the size of every data point is at most polynomial in $n$ . We note that the exponent can be extremely large, say $1000$ , and this will only change our bounds by a constant.

Bounding $t$ :

Our goal is to use the entropy compression argument to bound the running time of the algorithm. That is, we show that if our assumptions hold for a consecutive number of $t$ epochs, if $t$ is sufficiently large then our algorithm must terminate. Let $r_{i}$ be the string of random bits representing the random permutation of $X$ at epoch $i$ . As we consider $t$ epochs, let $r=r_{1}r_{2}\dots r_{t}$ . Note that the number of bits required to represent an arbitrary permutation of $[n]$ is given by: $\lceil\log(n!)\rceil=n\log n-n\log e+O(\log n)=n\log(n/e)+O(\log n)$ . Where in the above we used Stirling’s approximation. Thus, it holds that $|r|=t(n\log(n/e)+O(\log n))$ and according to Theorem 2.3, with probability at least $1-1/n^{2}$ it holds that $K(r)\geq tn\log(n/e)-O(\log n)$ .

Our goal for the rest of the section is to show that if our assumptions hold, then in every epoch we manage to compress $r_{i}$ , which in turn results in a compression of $r$ , leading to an upper bound on $t$ due to the incompressibility of $r$ . For the rest of the section, we assume that all of our assumptions hold and that the algorithm does not terminate within the first $t$ epochs. As discussed in the introduction, our proof consists of two cases. For the first case, we show that if at some point during the epoch $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is sufficiently large, then $r_{i}$ can be compressed. For the second case, if $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is small throughout the epoch and show that $r_{i}$ can be compressed by encoding every batch individually.

Case 1: $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is large

In the introduction, we considered the case where during the epoch the model achieves 100% accuracy on half the elements and 0% on the other half. Let us show a generalized version of this statement, where instead of considering the model during the middle of the epoch, we consider the model after seeing a $\eta$ -fraction of the data. The statement below holds for an arbitrary difference in the accuracies of the model. Instead of considering $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ directly, we first show an auxiliary compression claim involving $|\lambda^{\prime}_{i,j}-\lambda_{i,j}|$ and $|\lambda^{\prime\prime}_{i,j}-\lambda_{i,j}|$ . This will immediately allow us to derive the desired result for $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ , and will also be useful later on. Let us state the following claim.

Claim 3.1.

Let $i\in[t],j\in[2,n/b-1],a\in\mathbb{R},\eta=(j-1)b/n$ . It holds that:

K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq\begin{cases}n(\log(n/e)-\eta 2a^{2})+O(\log n),&\text{if }|\lambda^{\prime}_{i,j}-\lambda_{i,j}|\geq a\\ n(\log(n/e)-(1-\eta)2a^{2})+O(\log n),&\text{if }|\lambda^{\prime\prime}_{i,j}-\lambda_{i,j}|\geq a\end{cases}

Proof.

To encode $r_{i}$ we encode it as 2 separate strings, split at the $((j-1)b)$ -th index. To encode each such string it is sufficient to encode the sets $X_{i,j-1}$ and $X\setminus X_{i,j-1}$ , accompanied by a permutation for each set. As $X$ is given, we can immediately deduce $r_{i}$ given an ordering of the elements of $X$ . Note that knowing $X_{i,j-1}$ and $X$ we immediately know $X\setminus X_{i,j-1}$ . Thus it is sufficient to only encode $X_{i,j-1}$ . Finally, we need to encode two permutations for the two sets. The number of bits required for the two permutations is bounded by:

	$\displaystyle\lceil\log(\eta n)!\rceil+\lceil\log((1-\eta)n)!\rceil\leq\eta n\log\frac{\eta n}{e}+(1-\eta)n\log\frac{(1-\eta)n}{e}+O(\log n)$
	$\displaystyle=n(\log\frac{n}{e}+\eta\log\eta+(1-\eta)\log(1-\eta))+O(\log n)=n\log\frac{n}{e}-nh(\eta)+O(\log n)$

To encode the set $X_{i,j-1}$ we use Lemma 2.6 where $A=X_{i,j-1},B=X$ and $g(x)=acc(W_{i,j},x)$ . We get that

K(X_{i,j-1}\mid X,acc(W_{i,j},\bullet))\leq n(h(\eta)-2(\lambda^{\prime}_{i,j}-\lambda_{i,j})^{2})+O(\log n)\leq n(h(\eta)-\eta 2a^{2})+O(\log n)

Summing the two expressions we get that

K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq n\log(n/e)-\eta a\cdot D_{KL}(\lambda^{\prime}_{i,j}\;\|\;\lambda_{i,j})+O(\log n)\leq n(\log(n/e)-\eta 2a^{2})+O(\log n)

Encoding $X\setminus X_{i,j-1}$ instead of $X_{i,j-1}$ and using an identical argument, we get a similar bound for $\lambda^{\prime\prime}_{i,j}$ . ∎

Note that in the above if $\eta=(j-1)b/n$ is too small or too large, we are unable to get a meaningful compression. Let us focus our attention on a sub-interval of $j\in[2,n/b-1]$ where we get a meaningful bound. Let us denote by $j_{s}=\lceil\beta n/8b\rceil+1,j_{f}=\lfloor(1-\beta/8)n/b\rfloor+1$ the start and finish indices for this interval. We state the following Corollary which completes the proof of the first case.

Corollary 3.2.

If it holds for some epoch $i$ , and some $j\in[j_{s},j_{f}]$ that $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|\geq\beta/4$ , then $K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq n(\log(n/e)-\frac{1}{256}\beta^{3})+O(\log n)$ .

Proof.

Assume w.l.o.g that $\lambda^{\prime}_{i,j}\leq\lambda^{\prime\prime}_{i,j}$ . Note that it holds that $\lambda_{i,j}\in[\lambda^{\prime}_{i,j},\lambda^{\prime\prime}_{i,j}]$ . This is because we can write $\lambda_{i,j}=q\lambda^{\prime}_{i,j}+(1-q)\lambda^{\prime\prime}_{i,j}$ for some $q\in[0,1]$ . Thus if $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|\geq\beta/4$ then either $|\lambda_{i,j}-\lambda^{\prime\prime}_{i,j}|\geq\beta/8$ or $|\lambda_{i,j}-\lambda^{\prime}_{i,j}|\geq\beta/8$ . We apply Claim 3.1 for either $\lambda^{\prime}_{i,j}$ or $\lambda^{\prime\prime}_{i,j}$ with parameter $a=\beta/8$ . Noting that $j\in[j_{s},j_{f}]$ we get that $\eta,(1-\eta)\geq\beta/8$ . Putting everything together we get: $K(r_{i}\mid acc(W_{i,j},\bullet),X)\leq n(\log(n/e)-\frac{1}{256}\beta^{3})+O(\log n)$ ∎

Generalizing Assumption 2

The assumption is only required for this case of the proof, and the proofs of Claim 3.1 and Crollary 3.2 still go through even if we only have a sufficiently good approximation to $acc(W_{i,j},\bullet)$ . That is, there exists a function $g^{\prime}:X\rightarrow\left\{0,1\right\}$ , which achieves a sufficiently close approximation to the accuracy of $acc(W_{i,j},\bullet)$ on $X\setminus X_{i,j}$ and $X_{i,j}$ (say up to an additive $\beta/100$ factor). Thus, we only need the existence of a compressible function which approximates $acc(W_{i,j},\bullet)$ on $X\setminus X_{i,j}$ and $X_{i,j}$ .

Case 2: $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is small

Let us now consider the second case, where $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is small throughout the epoch (more precisely, when $j\in[j_{s},j_{f}]$ ). Roughly speaking, we would like to get an effect similar to that seen in the introduction, when we assumed we can achieve 100% accuracy for every batch. The reason we managed to achieve compression in that case, was that we could use the model to reduce the set of potential elements the batch can be taken from when going backwards and reconstructing the batches. We will show that we can achieve a similar effect here. Assume we have the entire dataset, $X$ , and the model which resulted from a GD step on the $(j-1)$ -th batch. The argument is inductive, thus let us assume that we have reconstructed all batches $B_{i,k},k\geq j$ . This means we also know $X\setminus X_{i,j-1}$ . Knowing $X$ and $X\setminus X_{i,j-1}$ , we immediately know $X_{i,j-1}$ , the set of all elements which can appear in $B_{i,j-1}$ . We aim to show that the local progress assumption (Assumption 1) implies that (on average) $W_{i,j}$ achieves a slightly higher accuracy on $B_{i,j-1}$ than on $X_{i,j-1}$ , which will allow us to achieve a meaningful compression.

Let us first show that if the batch is large enough, the accuracy of $W_{i,j}$ on $B_{i,j}$ is sufficiently close to $\lambda^{\prime\prime}_{i,j}$ (the accuracy of $W_{i,j}$ on $X\setminus X_{i,j-1}$ ). We state the following claim:

Claim 3.3.

It holds w.h.p that: $\forall i\in[t],j\in[2,n/b-1],acc(W_{i,j},B_{i,j})\geq\lambda^{\prime\prime}_{i,j}-\beta/4$

\displaystyle\forall i\in[t],j\in[2,n/b-1],acc(W_{i,j},B_{i,j})\geq\lambda^{\prime\prime}_{i,j}-\beta/4

Proof.

First let us note that if $\lambda^{\prime\prime}_{i,j}\leq\beta/4$ , then the claim holds trivially: $acc(W_{i,j},B_{i,j})\geq 0\geq\lambda^{\prime\prime}_{i,j}-\beta/4$ . So for the rest of the proof let us assume that $\lambda^{\prime\prime}_{i,j}\geq\beta/4$ .

Let us denote by $\left\{Y_{\ell}\right\}_{\ell=1}^{b}$ , the set of binary indicator variables indicating whether the $\ell$ -th element in $B_{i,j}$ was classified correctly by $W_{i,j}$ . As stated before, we can say that elements in $B_{i,j}$ are chosen uniformly at random without repetitions from $X\setminus X_{i,j-1}$ . Furthermore, it is important to note that $B_{i,j}$ is independent of $W_{i,j}$ . We can imagine that $B_{i,j}$ is selected after $W_{i,j}$ is determined. We can write $acc(W_{i,j},B_{i,j})=\frac{1}{b}\sum_{\ell=1}^{b}Y_{\ell}$ . Applying a Hoeffding bound with parameters $\mu=\lambda^{\prime\prime}_{i,j},k=b,\delta=\beta/4\geq\mu k$ , we get that:

Pr[acc(W_{i,j},B_{i,j})\leq\lambda^{\prime\prime}_{i,j}-\beta/4]\leq e^{-2k\delta^{2}}=e^{-b\frac{\beta^{2}}{8}}

According to Assumption 3 it holds that $b\geq\frac{8\ln(tn^{3})}{\beta^{2}}$ . Thus, we get that:

\forall i\in[t],j\in[2,n/b-1],Pr[acc(W_{i,j},B_{i,j})<\lambda^{\prime\prime}_{i,j}-\beta/4]\leq 1/(n^{3}t)

We take a union bound over all bad events (at most $tn$ ) to get that with probability at least $1-1/n^{2}$ none of the bad events occur throughout the $t$ epochs of the execution.

∎

That is, w.h.p the accuracy of $W_{i,j-1}$ on $B_{i,j-1}$ is at least its accuracy on $X\setminus X_{i,j-1}$ , up to an additive $\beta/4$ factor. Next we use the fact that $\forall\ell\in[j_{s},j_{f}],|\lambda^{\prime}_{i,\ell}-\lambda^{\prime\prime}_{i,\ell}|$ is small together with the claim above to show that the accuracy of $W_{i,j-1}$ on $B_{i,j-1}$ is also not too far from the accuracy of $W_{i,j-1}$ on $X_{i,j-1}$ . Finally, using the fact that on average $W_{i,j}$ has a sufficiently larger accuracy on $B_{i,j}$ compared to $W_{i,j-1}$ we prove the following claim. Note that this is an auxiliary claim, and the quantity $\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}$ will appear later as the number of bits we manage to save per epoch, by using the model at every epoch to encode the batches.

Claim 3.4.

If $\forall i\in[t],j\in[j_{s},j_{f}],\left|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}\right|\leq\beta/4$ then w.h.p: $\forall i\in[t],\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\frac{n\beta^{2}}{25b}$ .

Proof.

Recall that $\varphi_{i,j}=acc(W_{i,j},B_{i,j-1})$ . This is the accuracy of the model on the $(j-1)$ -th batch in the $i$ -th epoch after performing the GD step on the batch. Let us consider the following inequality:

	$\displaystyle 1\geq\lambda^{\prime\prime}_{i,j_{f}}-\lambda^{\prime\prime}_{i,j_{s}}=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime\prime}_{i,j-1})$
	$\displaystyle=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime\prime}_{i,j-1}+\varphi_{i,j}-\varphi_{i,j})=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\varphi_{i,j})+\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j-1})$

Using Claim 3.3 we know that w.h.p:

\sum_{j=j_{s}+1}^{j_{f}}\lambda^{\prime\prime}_{i,j-1}\leq\sum_{j=j_{s}+1}^{j_{f}}(acc(W_{i,j-1},B_{i,j-1})+\frac{\beta}{4})\leq\sum_{j=j_{s}+1}^{j_{f}}acc(W_{i,j-1},B_{i,j-1})+\frac{\beta n}{4b}

So we can write:

	$\displaystyle\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j-1})\geq\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\frac{\beta n}{4b}$
	$\displaystyle=\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\sum_{j\in[2,j_{s}]\cup[j_{f}+1,n/b+1]}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\frac{\beta n}{4b}$
	$\displaystyle\geq\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))-\frac{\beta n}{2b}\geq\frac{\beta n}{b}-\frac{\beta n}{2b}=\frac{\beta n}{2b}$

Where in the last line we use Assumption 1 together with the fact that $(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))\in[-1,1]$ and

	$\displaystyle\left\|[2,j_{s}]\right\|+\left\|[j_{f}+1,n/b+1]\right\|=j_{s}-1+(n/b+1-(j_{f}+1)+1)=j_{s}+n/b-j_{f}$
	$\displaystyle=\lceil(\beta/8)\frac{n}{b}\rceil+1+(n/b-\lfloor\frac{(1-\beta/8)n}{b}\rfloor-1))\leq(\beta/8)\frac{n}{b}+2+(n/b-\frac{(1-\beta/8)n}{b}-2)\leq\frac{\beta n}{4b}$

Combining everything together we get that:

1\geq\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\varphi_{i,j})+\frac{\beta n}{2b}\implies\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j})\geq\frac{\beta n}{2b}-1

Let us use the above to derive a lower bound for $\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime}_{i,j}-\varphi_{i,j})$ . Let us write:

	$\displaystyle\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\varphi_{i,j})=\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime}_{i,j}-\varphi_{i,j})+\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime}_{i,j})$
	$\displaystyle\implies\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime}_{i,j}-\varphi_{i,j})\geq\frac{\beta n}{2b}-1-\sum_{j=j_{s}+1}^{j_{f}}(\lambda^{\prime\prime}_{i,j}-\lambda^{\prime}_{i,j})\geq\frac{\beta n}{4b}-1\geq\frac{\beta n}{5b}$

Where in the second to last transition we use the assumption that $\forall i\in[t],j\in[j_{s},j_{f}],\left|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}\right|\leq\beta/4$ , and in the final transition we assume $n$ is sufficiently large. Taking the square of both sides we get that: $(\sum_{j=j_{s}+1}^{j_{f}}\lambda^{\prime}_{i,j}-\varphi_{i,j})^{2}\geq(\frac{\beta n}{5b})^{2}$ .

Finally, applying Cauchy-Schwartz inequality, we get that:

\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\sum_{j=j_{s}+1}^{j_{f}}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq(j_{f}-j_{s})(\sum_{j=j_{s}+1}^{j_{f}}\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\frac{b}{(1-\beta/4)n}\cdot(\frac{\beta n}{5b})^{2}\geq\frac{n\beta^{2}}{25b}

∎

To complete the proof for the second case we use the above claim to show that we can significantly compress all batches, this, in turn, will result in a compression of the entire permutation. We state the following claim.

Claim 3.5.

It holds that if $\forall i\in[t],j\in[j_{s},j_{f}],\left|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}\right|\leq\beta/4$ then w.h.p: $K(r_{i}\mid W_{i+1,1},X)\leq n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+o(n)$

Proof.

Recall that $B_{i,j}$ is the $j$ -th batch in the $i$ -th epoch, and let $P_{i,j}$ be a permutation of $B_{i,j}$ such that the order of the elements in $B_{i,j}$ under $P_{i,j}$ is the same as under $r_{i}$ . Note that given $X$ , if we know the partition into batches and all permutations, we can reconstruct $r_{i}$ . According to Theorem 2.2, given $W_{i,j}$ and $B_{i,j-1}$ we can compute $W_{i,j-1}$ . Let us denote by $Y$ the encoding of this procedure. To implement $Y$ we need to iterate over all possible vectors in $\mathbb{F}^{d}$ and over batch elements to compute the gradients. To express this program we require auxiliary variables of size at most $O(\log\min\left\{d,b\right\})=O(\log n)$ . Thus it holds that $K(Y)=O(\log n)$ . Let us abbreviate $B_{i,1},B_{i,2},...,B_{i,j}$ as $(B_{i,k})_{k=1}^{j}$ . We write the following.

	$\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq K(r_{i},Y\mid X,W_{i+1,1})+O(1)\leq K(r_{i}\mid X,W_{i+1,1},Y)+K(Y\mid X,W_{i+1,1})+O(1)$
	$\displaystyle\leq O(\log n)+K((B_{i,k})_{k=1}^{n/b}\mid X,W_{i+1,1},Y)+\sum_{j=1}^{n/b}K(P_{i,j})$

Let us bound $K((B_{i,k})_{k=1}^{n/b}\mid X,W_{i+1,1},Y)$ by repeatedly using the subadditivity and extra information properties of Kolmogorov complexity.

	$\displaystyle K((B_{i,k})_{k=1}^{n/b}\mid X,Y,W_{i+1,1})\leq K(B_{i,n/b}\mid X,W_{i+1,1})+K((B_{i,k})_{k=1}^{n/b-1}\mid X,Y,W_{i+1,1},B_{i,n/b})+O(1)$
	$\displaystyle\leq K(B_{i,n/b}\mid X,W_{i+1,1})+K((B_{i,k})_{k=1}^{n/b-1}\mid X,Y,W_{i,n/b},B_{i,n/b})+O(1)$
	$\displaystyle\leq K(B_{i,n/b}\mid X,W_{i+1,1})+K(B_{i,n/b-1}\mid X,W_{i,n/b},B_{i,n/b})$
	$\displaystyle+K((B_{i,k})_{k=1}^{n/b-2}\mid X,Y,W_{i,n/b-1},B_{i,n/b},B_{i,n/b-1})+O(1)$
	$\displaystyle\leq...\leq O(\frac{n}{b})+\sum_{j=1}^{n/b}K(B_{i,j}\mid X,W_{i,j+1},(B_{i,k})_{k=j+1}^{n/b})\leq O(\frac{n}{b})+\sum_{j=1}^{n/b}K(B_{i,j}\mid X_{i,j},W_{i,j+1})$

Where in the transitions we used the fact that given $W_{i,j},B_{i,j-1}$ and $Y$ we can retrieve $W_{i,j-1}$ . That is, we can always bound $K(...\mid Y,W_{i,j},B_{i,j-1},...)$ by $K(...\mid Y,W_{i,j-1},B_{i,j-1},...)+O(1)$ .

To encode the order $P_{i,j}$ inside each batch, $b\log(b/e)+O(\log b)$ bits are sufficient. Finally we get that: $K(r_{i}\mid X,W_{i+1,1})\leq O(\frac{n}{b})+\sum_{j=1}^{n/b}[K(B_{i,j}\mid X_{i,j},W_{i,j+1})+b\log(b/e)+O(\log b)]$ .

Let us now bound $K(B_{i,j-1}\mid X_{i,j-1},W_{i,j})$ . Knowing $X_{i,j-1}$ we know that $B_{i,j-1}\subseteq X_{i,j-1}$ . Thus we need to use $W_{i,j}$ to compress $B_{i,j-1}$ . Applying Lemma 2.6 with parameters $A=B_{i,j-1},B=X_{i,j-1},\eta=b/n_{j-1},\kappa_{A}=\varphi_{i,j},\kappa_{B}=\lambda^{\prime}_{i,j}$ and $g(x)=acc(W_{i,j},x)$ . We get the following:

	$\displaystyle K(B_{i,j-1}\mid X_{i,j-1},W_{i,j})\leq n_{j-1}(h(\frac{b}{n_{j-1}})-\frac{2b}{n_{j-1}}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2})+O(\log n_{j-1})$
	$\displaystyle=n_{j-1}h(\frac{b}{n_{j-1}})-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n_{j-1})$

Adding $b\log(b/e)+O(\log b)$ to the above, we get the following bound on every element in the sum:

	$\displaystyle n_{j-1}h(\frac{b}{n_{j-1}})-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+b\log(b/e)+O(\log b)+O(\log n_{j-1})$
	$\displaystyle\leq b\log\frac{en_{j-1}}{b}+b\log(b/e)+O(\log n_{j-1})-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}$
	$\displaystyle=b\log n_{j-1}-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n_{j-1})$

Where in the second line we use Lemma 2.4. Note that the most important term in the sum is $-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}$ . That is, the more the accuracy of $W_{i,j}$ on the batch, $B_{i,j-1}$ , differs from the accuracy of $W_{i,j}$ on the set of elements containing the batch, $X_{i,j-1}$ , we can represent the batch more efficiently. Let us now bound the sum: $\sum_{j=2}^{n/b+1}[b\log n_{j-1}-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n_{j-1})]$ . Let us first bound the sum over $b\log n_{j-1}$ :

	$\displaystyle\sum_{j=2}^{n/b+1}b\log n_{j-1}=\sum_{j=1}^{n/b}b\log jb=\sum_{j=1}^{n/b}b(\log b+\log j)$
	$\displaystyle=n\log b+b\log(n/b)!=n\log b+n\log\frac{n}{b\cdot e}+O(\log n)=n\log\frac{n}{e}+O(\log n)$

Applying Claim 3.4 we get that $2b\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}\geq\frac{2n\beta^{2}}{25}$ . Finally we can write that:

	$\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq O(\frac{n}{b})+\sum_{j=2}^{n/b+1}[b\log n_{j-1}-2b(\varphi_{i,j}-\lambda^{\prime}_{i,j})^{2}+O(\log n)]$
	$\displaystyle\leq n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+\frac{n}{b}\cdot O(\log n)=n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+o(n)$

Where $\frac{n}{b}\cdot O(\log n)=o(n)$ because $b=\omega(\log n)$ . ∎

This completes the analysis of the second case. As for every epoch either case 1 or case 2 hold, we can conclude that we always manage to compress $r_{i}$ . Combining the claim above with Corollary 3.2 we get:

Corollary 3.6.

It holds w.h.p that $\forall i\in[t],K(r_{i}\mid X,W_{i+1,1})\leq n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)$ .

Proof.

In every epoch either the conditions of Corollary 3.2 or Claim 3.5 hold. If the conditions of Corollary 3.2 hold for some $j\in[j_{s},j_{f}]$ then we can write:

	$\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq K(r_{i}\mid X,W_{i+1,1},acc(W_{i,j},\bullet))+K(acc(W_{i,j},\bullet)\mid X)+O(1)$
	$\displaystyle\leq\frac{n}{512}\beta^{3}+n(\log(n/e)-\frac{1}{256}\beta^{3})+O(\log n)\leq n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)$

Where we use Assumption 2 to bound $K(acc(W_{i,j},\bullet)\mid X)$ . If the conditions of Claim 3.5 hold, then we immediately get:

K(r_{i}\mid X,W_{i+1,1})\leq n\log\frac{n}{e}-\frac{2n\beta^{2}}{25}+o(n)\leq n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)

∎

We use the fact that for every epoch that the algorithm does not terminate we manage to significantly compress $r_{i}$ together with the fact that $r$ is incompressible to bound $t$ . We state the following claim:

Claim 3.7.

It holds w.h.p that $t\leq O(\frac{K(X)}{n\beta^{3}})$ .

Proof.

Similarly to the definition of $Y$ in Claim 3.5, let $Y^{\prime}$ be the program which receives $X,r_{i},W_{i+1,1}$ as input and repeatedly applies Theorem 2.2 to retrieve $W_{i,1}$ . As $Y^{\prime}$ just needs to reconstruct all batches from $X,r_{i}$ and call $Y$ for $n/b$ times, it holds that $K(Y^{\prime})=O(\log n)$ . Using the subadditivity and extra information properties of $K()$ , together with the fact that $W_{1,1}$ can be reconstructed given $X,W_{t+1,1},Y^{\prime}$ , we write the following:

	$\displaystyle K(r,W_{1,1})\leq K(r,W_{1,1},X,Y^{\prime},W_{t+1,1})+O(1)\leq K(W_{1,1,},X,W_{t+1,1},Y^{\prime})+K(r\mid X,Y^{\prime},W_{t+1,1})+O(1)$
	$\displaystyle\leq K(X,W_{t+1,1})+K(r\mid X,Y^{\prime},W_{t+1,1})+O(\log n)\leq K(X)+d+K(r\mid X,Y^{\prime},W_{t+1,1})+O(\log n)$

First we note that: $\forall i\in[t-1],K(r_{i}\mid X,Y^{\prime},W_{i+2,1},r_{i+1})\leq K(r_{i}\mid X,Y^{\prime},W_{i+1,1})+O(1)$ . Where in the last inequality we simply execute $Y^{\prime}$ on $X,W_{i+2,1},r_{i+1}$ to get $W_{i+1,1}$ . Let us write:

	$\displaystyle K(r_{1}r_{2}\dots r_{t}\mid X,Y^{\prime},W_{t+1,1})\leq K(r_{t}\mid X,Y^{\prime},W_{t+1,1})+K(r_{1}r_{2}\dots r_{t-1}\mid X,Y^{\prime},W_{t+1,1},r_{t})+O(1)$
	$\displaystyle\leq K(r_{t}\mid X,W_{t+1,1})+K(r_{1}r_{2}\dots r_{t-1}\mid X,Y^{\prime},W_{t,1})+O(1)$
	$\displaystyle\leq K(r_{t}\mid X,W_{t+1,1})+K(r_{t-1}\mid X,W_{t,1})+K(r_{1}r_{2}\dots r_{t-2}\mid X,Y^{\prime},W_{t-1,1})+O(1)$
	$\displaystyle\leq\dots\leq O(t)+\sum^{t}_{k=1}K(r_{k}\mid X,W_{k+1,1})\leq t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]$

Where the last inequality is due to Corollary 3.6. Combining everything together we get that: $K(r)\leq K(X,W_{t+1,1})+t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]$ .

Our proof implies that we can reconstruct not only $r$ , but also $W_{1,1}$ using $X,W_{t+1,1}$ . Due to the incompressibility of random strings, we get that w.h.p $K(r,W_{1,1})\geq d+tn\log(n/e)-O(\log n)$ . Combining the lower and upper bound for $K(r,W_{1,1})$ we can get the following inequality, using which we can bound $t$ .

	$\displaystyle d+tn\log(n/e)-O(\log n)\leq K(X)+d+t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]$
	$\displaystyle\implies t(\frac{n}{512}\beta^{3}-O(\log n))\leq K(X)\implies t=O(\frac{K(X)}{n\beta^{3}})\qed$

Assumptions 1 and 2 need not always hold

Let us consider the case where both Assumption 1 and Assumption 2 only hold for some fraction $\gamma$ of the $t$ epochs that we analyze. Let us denote by $J_{g}\subset[t]$ the set of all good epochs where the assumptions hold (the rest of the epochs are bad epochs). This means that we can no longer achieve compression for bad epochs. However, for every bad epoch $i$ , we can simply encode $r_{i}$ directly. Given $r_{i}$ we can reconstruct all of the batches, and retrieve $W_{i,1}$ from $W_{i+1,1}$ using Theorem 2.2. We can write that:

\displaystyle K(r_{i}\mid X,W_{i+1,1})\leq\begin{cases}n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n),&\text{if }i\in J_{g}\\ n(\log(n/e))+O(\log n),&\text{otherwise}\end{cases}

This allows us to bound $t$ as follows:

	$\displaystyle d+tn\log(n/e)-O(\log n)$
	$\displaystyle\leq K(X)+d+\gamma t[n(\log(n/e)-\frac{1}{512}\beta^{3})+O(\log n)]+(1-\gamma)t[n(\log(n/e))+O(\log n)]$
	$\displaystyle\implies t(\gamma\frac{n}{512}\beta^{3}-O(\log n))\leq K(X)\implies t=O(\frac{K(X)}{n\gamma\beta^{3}})$

Note that $t$ also appears in the batch size. By Assumption 5, we know that $t=O(\beta^{-3}poly(n))$ , which in turn means that $b=\Theta(\beta^{-2}\log^{2}(n/\beta))$ is a sufficient condition for Assumption 5 to hold. We state our main theorem for this section:

Theorem 3.8.

If assumptions (3)-(5) hold throughout the execution of SGD and assumptions (1)-(2) hold for an $\gamma$ -fraction of the epochs, then SGD achieves an accuracy of $(1-\epsilon)$ on $X$ in $O(\frac{K(X)}{\gamma n(\epsilon\beta^{\prime})^{3}})$ epochs w.h.p.

3.1 Connection between $\epsilon$ and $\beta$

We note that the local progress assumption does not make sense when the accuracy is too close to 1. That is, if the accuracy is too high, there is no room for improvement. We formalize this intuition in the following claim:

Claim 3.9.

If for some $i$ it holds that $\forall j,\lambda_{i,j}\geq(1-\epsilon)$ , then $\beta\leq\frac{4}{3}\epsilon\ln(n/b)$ .

Proof.

Recall the local progress assumption: $\frac{b}{n}\sum_{j=2}^{n/b+1}(\varphi_{i,j}-acc(W_{i,j-1},B_{i,j-1}))>\beta$ . Applying Claim 3.3 we know that w.h.p $acc(W_{i,j-1},B_{i,j-1})\geq\lambda^{\prime\prime}_{i,j-1}-\beta/4$ . So we can write:

\beta<\frac{b}{n}\sum_{j=2}^{n/b+1}(\varphi_{i,j}-\lambda^{\prime\prime}_{i,j-1}+\beta/4)\leq\beta/4+\frac{b}{n}\sum_{j=2}^{n/b+1}(1-\lambda^{\prime\prime}_{i,j-1})

It holds that $\lambda_{i,j}\geq(1-\epsilon)$ . Let us now bound $\lambda^{\prime\prime}_{i,j}$ using $\lambda_{i,j}$ . We can write the following for $j\in[2,n/n+1]$ :

	$\displaystyle n\lambda_{i,j}=\sum_{x\in X}acc(W_{i,j},x)=\sum_{x\in X_{i,j-1}}acc(W_{i,j},x)+\sum_{x\in X\setminus X_{i,j-1}}acc(W_{i,j},x)$
	$\displaystyle=(j-1)b\lambda^{\prime}_{i,j}+(n-(j-1)b)\lambda^{\prime\prime}_{i,j}$
	$\displaystyle\implies\lambda^{\prime\prime}_{i,j}=\frac{n\lambda_{i,j}-(j-1)b\lambda^{\prime}_{i,j}}{n-(j-1)b}\geq\frac{n(1-\epsilon)-(j-1)b}{n-(j-1)b}=1-\frac{\epsilon n}{n-(j-1)b}$

Plugging the above into the first inequality we get:

\displaystyle\frac{3\beta}{4}\leq\frac{b}{n}\sum_{j=2}^{n/b+1}\frac{\epsilon n}{n-(j-2)b}\leq\sum_{j=2}^{n/b+1}\frac{\epsilon}{n/b-(j-2)}\leq\epsilon(1+\ln(n/b))

Finally we get $\beta\leq\frac{4}{3}\epsilon(1+\ln(n/b))$ . ∎

Taking the above into consideration, we parameterize $\beta$ via $\epsilon$ and write $\beta=\beta^{\prime}\epsilon$ . This parameterization is correct in the sense that it allows the accuracy parameter $\epsilon$ to appear in our running time bound.

References

Moser [2009] Robin A. Moser. A constructive proof of the lovász local lemma. In STOC, pages 343–350. ACM, 2009.
Moser and Tardos [2010] Robin A. Moser and Gábor Tardos. A constructive proof of the general lovász local lemma. J. ACM, 57(2):11:1–11:15, 2010.
Du et al. [2019] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 1675–1685. PMLR, 2019.
Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 2019.
Zou et al. [2018] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888, 2018.
Zou and Gu [2019] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In NeurIPS, pages 2053–2062, 2019.
Mingard et al. [2021] Chris Mingard, Guillermo Valle Pérez, Joar Skalse, and Ard A. Louis. Is SGD a bayesian sampler? well, almost. J. Mach. Learn. Res., 22:79:1–79:64, 2021.
Pérez et al. [2019] Guillermo Valle Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In ICLR (Poster), 2019.
O’Neill [2020] James O’Neill. An overview of neural network compression. CoRR, abs/2006.03669, 2020.
Goodfellow et al. [2016] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.
Tao [2009] Terence Tao. Moser’s entropy compression argument. https://terrytao.wordpress.com/2009/08/05/mosers-entropy-compression-argument, 2009.
Fortnow [2009] Lance Fortnow. A kolmogorov complexity proof of the lovász local lemma. http://blog.computationalcomplexity.org/2009/06/kolmogorov-complexity-proof-of-lov.html, 2009.
Noy et al. [2021] Asaf Noy, Yi Xu, Yonathan Aflalo, Lihi Zelnik-Manor, and Rong Jin. A convergence theory towards practical over-parameterized deep neural networks. CoRR, abs/2101.04243, 2021.
Arora et al. [2019] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In ICLR (Poster). OpenReview.net, 2019.
Arora et al. [2018] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 244–253. PMLR, 2018.
Safran and Shamir [2018] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 4430–4438. PMLR, 2018.
Du and Lee [2018] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 1328–1337. PMLR, 2018.
Oymak and Soltanolkotabi [2019] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. CoRR, abs/1902.04674, 2019.
Jacot et al. [2018] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, pages 8580–8589, 2018.
Nguyen and Mondelli [2020] Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In NeurIPS, 2020.
Nguyen [2021] Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8056–8062. PMLR, 2021.
Li and Vitányi [2019] Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Texts in Computer Science. Springer, 2019.
Hoeffding [1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
Chvátal [1979] Vasek Chvátal. The tail of the hypergeometric distribution. Discrete Mathematics, 25(3):285–287, 1979.

SGD Through the Lens of Kolmogorov Complexity

Abstract

1 Introduction

Overview of our results

Outline of our techniques

Related work

Structure of the paper

2 Preliminaries

Lemma 2.1.

Proof.

Theorem 2.2.

Proof.

SGD

Kolmogorov complexity

Theorem 2.3.

Entropy and KL-divergence

Lemma 2.4.

Proof.

Lemma 2.5.

Proof.

Representing sets

Lemma 2.6.

Proof.

Concentration bounds

Theorem 2.7 (Hoeffding [23; 24]).

3 Convergence guarantees for SGD

Notations

Assumptions

Bounding tt:

Case 1: |λi,j′−λi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is large

Claim 3.1.

Proof.

Corollary 3.2.

Proof.

Generalizing Assumption 2

Case 2: |λi,j′−λi,j′′||\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}| is small

Claim 3.3.

Proof.

Claim 3.4.

Proof.

Claim 3.5.

Proof.

Corollary 3.6.

Proof.

Claim 3.7.

Proof.

Assumptions 1 and 2 need not always hold

Theorem 3.8.

3.1 Connection between ϵ\epsilon and β\beta

Claim 3.9.

Proof.

References

Bounding $t$ :

Case 1: $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is large

Case 2: $|\lambda^{\prime}_{i,j}-\lambda^{\prime\prime}_{i,j}|$ is small

3.1 Connection between $\epsilon$ and $\beta$