List and Certificate Complexities in Replicable Learning
Abstract
We investigate replicable learning algorithms. Ideally, we would like to design algorithms that output the same canonical model over multiple runs, even when different runs observe a different set of samples from the unknown data distribution. In general, such a strong notion of replicability is not achievable. Thus we consider two feasible notions of replicability called list replicability and certificate replicability. Intuitively, these notions capture the degree of (non) replicability. We design algorithms for certain learning problems that are optimal in list and certificate complexity. We establish matching impossibility results.
1 Introduction
Replicability and reproducibility in science are critical concerns. The fundamental requirement that scientific results and experiments be replicable/reproducible is central to the development and evolution of science. In recent years, these concerns have grown as several scientific disciplines turn to data-driven research, which enables exponential progress through data democratization and affordable computing resources. The replicability issue has received attention from a wide spectrum of entities, from general media publications (for example, The Economist’s ”How Science Goes Wrong,” 2013 [eco13]) to scientific publication venues (for example, see [JP05, Bak16]) to professional and scientific bodies such as the National Academy of Sciences, Engineering, and Medicine (NASEM). The emerging challenges to replicability and reproducibility have been discussed in depth by a consensus study report published by NASEM [NAS19].
A broad approach taken to ensure the reproducibility/replicability of algorithms is to make the datasets, algorithms, and code publicly available. Of late, conferences have been hosting replicability workshops to promote best practices and to encourage researchers to share code (for example, see [PVLS+21] and [MPK19]). An underlying assumption is that consistent results can be obtained using the same input data, computational methods, and code. However, this practice alone is insufficient to ensure replicability as modern-day approaches use computations that inherently involve randomness.
Computing over random variables results in a high degree of non-replicability, especially in machine learning tasks. Machine learning algorithms observe samples from a (sometimes unknown) distribution and output a hypothesis or model. Such algorithms are inherently non-replicable. Two distinct runs of the algorithm will output different models as the algorithms see different sets of samples over the two runs. Ideally, to achieve ”perfect replicability,” we would like to design algorithms that output the same canonical hypothesis over multiple runs, even when different runs observe a different set of samples from the unknown distribution.
We first observe that perfect replicability is not achievable in learning, as a dependency of the output on the data samples is inevitable. We illustrate this with a simple learning task of estimating the bias of a coin: given independent tosses from a coin with unknown bias , output an estimate of . It is relatively easy to argue that there is no algorithm that outputs a canonical estimate with probability so that . Consider a sequence of coins with biases where each (for some small ) but . For two adjacent biases, the statistical distance (denoted by ) between and is , where is the distribution of independent tosses of the coin. Let and be the canonical estimates output by the algorithm for biases and respectively. Since on samples from distribution outputs with probability at least and , must output with probability at least . We take . Since must output a canonical value with probability at least , this implies that . Thus, on all biases, , should output the same value and this leads to a contradiction since and are apart. However, it is easy to see that there is an algorithm for bias-estimation that outputs one of two canonical estimates using tosses: estimate the bias within and round the value to the closest multiple of . The starting point of our work is these two observations. Even though it may not be possible to design learning algorithms that are perfectly replicable, it is possible to design algorithms whose “non-replicability” is minimized.
Motivated by this, we study two notions of replicability called list-replicability and certificate-replicability in the context of machine learning algorithms. These notions quantify the degree of (non)-replicability. The notions we consider are rooted in the pseudodeterminism-literature [GG11, Gol19, GL19] which has been an active area in randomized algorithms and computational complexity theory. Section 2 discusses this connection and other works that are related to our work.
1.1 Our Results
We consider two notions of replicable learning: list-replicable learning and certificate-replicable learning. In list replicability, the learning algorithm should output one of the models from a list of models of size with high probability. This means that when the learning algorithm is run multiple times, we see at most distinct models/hypotheses. The value is called the list complexity of the algorithm. An algorithm whose list complexity is is perfectly replicable. Thus list complexity can be considered as the degree of (non) replicability. The goal in this setting is to design learning algorithms that minimize the list complexity .
In certificate replicability, the learning algorithm has access to an -bit random string called the certificate of replicability (that is independent of samples and the other randomness that it may use). We require that for most certificates, the algorithm must output a canonical model that can depend only on this -bit random string . Thus once we fix a certificate , multiple runs of the algorithm that have access to will output the same model w.h.p. We call the certificate complexity. Note that an algorithm with zero certificate complexity is perfectly replicable. Thus is another measure of the degree of (non) replicability of the algorithm. The goal in this setting is to design learning algorithms that minimize .
A critical resource in machine learning tasks is the number of samples that the algorithm observes known as the sample complexity. An efficient learning algorithm uses as few samples as possible. In this work, we measure the efficiency of learning algorithms in terms of sample complexity, list complexity, and certificate complexity. This work initiates a study of learning algorithms that are efficient in certificate/list complexities as well as sample complexity. We establish both positive results and impossibility results for certain learning tasks.
Estimating the bias of coins.
Our first set of results is on efficient replicable algorithms for the coin-bias estimation problem. We consider the problem of estimating the biases of coins simultaneously by observing tosses of each of the coins which we call -Coin Bias Estimation Problem. The task is to output a bias vector so that where is the true bias vector. We show that there is a -list replicable learning algorithm for this problem with a sample complexity (number of coin tosses) per coin. We also design a -certificate reproducible algorithm for the problem with sample complexity per coin. Here is the success probability.
We also establish the optimality of the above upper bounds in terms of list and certificate complexities. We show that there is no -list replicable learning algorithm for -Coin Bias Estimation Problem. This leads to a lower bound of on its certificate complexity. For establishing this impossibility result we use a version of KKM/Sperner’s Lemma.
PAC learning.
We establish possibility and impossibility results for list/certificate replicable algorithms in the context of PAC learning. We show that any concept class that can be learned using non-adaptive statistical queries can be learned by a -list replicable PAC learning algorithm with sample complexity where is the statistical query parameter. We also show that such concept classes admit a -certificate replicable PAC learning algorithm with sample complexity . Finally, we establish tight results in the PAC learning model. In particular, we prove that the concept class of -dimensional thresholds does not admit a -list replicable learning algorithm under the uniform distribution. Since we can learn -dimensional thresholds under the uniform distribution, using non-adaptive statistical queries, we get a -list replicable PAC algorithm under the uniform distribution. This yields matching upper and lower bounds on the list complexity of PAC learning of thresholds under the uniform distribution.
2 Prior and Related Work
Formalizing reproducibility and replicability has gained considerable momentum in recent years. While the terms reproducibility and replicability are very close and often used interchangeably, there has been an effort to distinguish between them and accordingly, our notions fall in the replicability definition [PVLS+21].
In the context of randomized algorithms, various notions of reproducibility/replicability have been investigated. The work of Gat and Goldwasser [GG11] formalized and defined the notion of pseudodeterministic algorithms. A randomized algorithm is pseudodeterministic if, for any input , there is a canonical value such that . Gat and Goldwasser designed polynomial-time pseudodeterministic algorithms for algebraic computational problems, such as finding quadratic non-residues and finding non-roots of multivariate polynomials [GG11]. Later works studied the notion of pseudodeterminism in other algorithmic settings, such as parallel computation, streaming and sub-linear algorithms, interactive proofs, and its connections to complexity theory [GG, GGH18, OS17, OS18, AV20, GGMW20, LOS21, DPVWV22].
In the algorithmic setting, mainly two generalizations of pseudodeterminism have been investigated: multi-pseudodeterministic algorithms [Gol19] and influential bit algorithms [GL19]. A randomized algorithm is -pseudodeterministic if, for every input , there is a set of size at most such that the output of belongs to the set with high probability. When , we get pseudodeterminism. A randomized algorithm is -influential-bit algorithm if, for every input , for most of the strings of length , there exists a canonical value such that the algorithm on inputs and outputs with high probability. The string are called the influential bit string. Again, when , we get back pseudodeterminism. The main focus of these works has been to investigate reproducibility in randomized search algorithms.
Very recently, pseudodeterminism and its generalizations have been explored in the context of learning algorithms to formalize the notion of replicability. In particular, the work of Impagliazzo, Lei, Pitassi, and Sorrell [ILPS22] introduced the notion of -replicability. A learning algorithm is -replicable if , where and are samples drawn from a distribution and is the internal randomness of the learning algorithm . They designed replicable algorithms for many learning tasks, including statistical queries, approximate heavy hitters, median, and learning half-spaces.
Another line of recent work connects replicable learning to differentially private learning. In a recent seminal work, Bun, Livny, and Moran [BLM20] showed that every concept class with finite Littlestone dimension can be learned by an approximate differentially private algorithm. This, together with an earlier work of Alon, Livny, Malliaris, and Moran [ALMM19], establishes an equivalence between online learnability and differentially private PAC learnability. Rather surprisingly, the proof of [BLM20] uses the notion of ”global stability,” which is similar to the notion of pseudodeterminism in the context of learning. They define a learning algorithm to be -globally stable with respect to a distribution if there is a hypothesis such that . They showed that any concept class with Littlestone dimension has an -globally stable learning algorithm with and , where the error of (with respect to the unknown hypothesis) is . Then they established that a globally stable learner implies a differentially private learner. The notion of globally stable learning is the perfect replicability that we discuss in the introduction when . Thus, as discussed in the introduction, it follows that designing globally stable algorithms with is not possible, even for the simple task of estimating the bias of a coin. The work of Ghazi, Kumar, and Manurangsi [GKM21] extended the notion of global stability to pseudo-global stability and list-global stability. The notion of pseudo-global stability is very similar to the earlier-mentioned notion of influential bit algorithms of Grossman and Liu [GL19] when translated to the context of learning. Similarly, the list-global stability is similar to Goldreich’s notion of multi-pseudodeterminism [Gol19]. It is also known that the notions of pseudo-global stability and -replicability are the same up to polynomial factors in the parameters [ILPS22, GKM21]. The work of [GKM21] uses these notions to design user-level private learning algorithms.
Our notion of list replicability is inspired by the notion of multi-pseudodeterminism and the notion of certificate replicability is inspired by the notion of influential-bit algorithms. In the learning setting, our notion of list replicability is similar to the notion of list-global stability and the notion of certificate replicability is similar to the notion of pseudo-global stability which in turn is similar to the notion of -replicability of [ILPS22].
We introduce the notions of list and certificate complexities that measure the degree of (non) replicability. Our goal is to design learning algorithms with optimal list and certificate complexities while minimizing the sample complexity. The earlier works did not focus on minimizing these quantities. The works of [BLM20, GKM21] used replicable algorithms as an intermediate step to design differentially private algorithms. The work of [ILPS22] did not consider reducing the certificate complexity in their algorithms and also did not study list-replicability
The study of various notions of reproducibility/replicability in various computational fields is an emerging topic. In [EKK+23], the authors consider replicability in the context of stochastic bandits. Their notion is similar to the notion studied in [ILPS22]. In [AJJ+22], the authors investigate reproducibility111See [PVLS+21] for a discussion on replicability and replicability. in the context of optimization with inexact oracles (initialization/gradient oracles). The setup and focus of these works are different from ours.
3 Primary Lemmas
In this section, we state a few lemmas that build on the work of [VWDP+22] and [DLPES02] that will be useful for algorithmic constructions and impossibility results in the remainder of the paper.
3.1 Partitions and Rounding
In this subsection, we define a deterministic rounding function that we will use repeatedly. This rounding function is based on the notion of secluded partitions defined in the work of [VWDP+22].
We will use the following notation. We use to indicate the diameter of a set relative to the norm and to represent the closed ball of radius centered at relative to the norm. That is, in we have .
Let be a partition of . For a point , we use to denote the set
Definition 3.1.
Let be a partition of . We say that is -secluded, if for every point , .
The following theorem is from [VWDP+22].
Theorem 3.2.
For each , there exists a -secluded partition, where each member of the partition is a unit hypercube. Moreover, the partition is efficiently computable: Given an arbitrary point , the description of the partition member to which belongs can be computed in time polynomial in .
Definition 3.3 (-Deterministic Rounding).
A deterministic rounding scheme is a family of functions where . We call a -deterministic rounding scheme if (1) , 222 The bound of 1 is not critical. We can use any constant and scale the parameters appropriately. (2) , .
The following observation is from [VWDP+22].
Observation 3.4 (Equivalence of Rounding Schemes and Partitions).
A -deterministic rounding scheme induces, for each , a -secluded partition of in which each member has diameter at most . Conversely, a sequence of partitions where is -secluded and contains only members of diameter at most induces a -deterministic rounding schemes.
3.2 A Universal Rounding Algorithm for List Replicability
In this subsection, we will design a deterministic algorithm that will be used as a sub-routine in our list-replicable algorithms.
Lemma 3.5.
Let and . Let . There is an efficiently computable function with the following two properties:
-
1.
For any and any it holds that .
-
2.
For any the set has cardinality at most .
Informally, these two conditions are (1) if is an approximation of , then is an approximation of , and (2) maps every approximation of to one of at most possible values.
Proof.
Let be a -secluded partition from Theorem 3.2 and the associated deterministic rounding function due to Observation 3.4 The partition consists of translates of the unit cube with the property that for any point the closed cube of side length centered at (i.e. ) intersects at most members/cubes in . The associated rounding function maps each point of to the center point of the unique cube in which contains it. This means has the following two properties (which are closely connected to the two properties we want of ): (1) for every , (because every point is mapped to the center of its containing unit cube) and (2) for any point , the set has cardinality at most (because intersects at most members of ). Define by . The efficient computability of comes from the efficient computability of (i.e. the ability to efficiently compute the center of the unit cube in which contains a given point).
To see that has property (1), let and . Then we have the following (justifications will follow):
The first line is by the definition of , the second is the triangle inequality, the third is scaling of norms, the fourth uses the property of that points are not mapped a distance more than along with the hypothesis that , the fifth uses the definition of , and the sixth uses the fact that .
Scaling both sides by and using the scaling of norms, the above gives us which proves property (1) of the lemma.
To see that has property (2), let . We have the following set of equalities:
The first line is from the definition of , the second is from re-scaling, and the third is from the definition of .
Because takes on at most distinct values on , the set has cardinality at most which proves property (2) of the lemma. ∎
3.3 A Universal Rounding Algorithm for Certificate Replicability
For designing certificate replicable learning algorithms we will use a general randomized procedure which is stated in the following lemma. This lemma uses a randomized rounding scheme. Similar randomized rounding schemes have been used in a few prior works [SZ99, DPV18, Gol19, GL19, ILPS22].
Lemma 3.6.
Let , and . There is an efficiently computable deterministic function with the following property. For any ,
where and .
Proof.
Partition each coordinate of into -width intervals. The algorithm computing the function does the following simple randomized rounding:
The function Choose a random integer . Note that can be represented using bits. Consider the coordinate of denoted by . Round to the nearest such that .
Now we will prove that satisfies the required properties.
First, we prove the approximation guarantee. Let denote the point in obtained after rounding each coordinate of . The s satisfying are apart. Therefore, is rounded by at most . That is, for every , . Since is an -approximation (i.e. each coordinate is within of the true value ), then each coordinate of is within of . Therefore is a -approximation of . Thus for any choice of .
Now we establish that for fraction of , there exists such every is rounded . We argue this with respect to each coordinate and apply the union bound. Fix an and a coordinate . For , consider the interval around it.
Consider from . When this is chosen, then we round to the closest such that . Let be the set of such points: more precisely . Note that is rounded to an to some . Let denote the midpoint between and . I.e, We call ‘bad’ for if is close to some . That is, is ‘bad’ if . Note that for a bad there exists and in so that their coordinates are round to and respectively. The crucial point is that if is ‘not bad’ for , then for every , there exists a canonical such that is rounded to . We call bad for , if is bad for , if there exists at least one , such that is bad for . With this, it follows that if is not bad for , then there exists a canonical such that every is rounded to .
With this, the goal is to bound the probability that a randomly chosen is bad for . For this, we first bound the probability that is bad for . We will argue that there exists almost one bad for . Suppose that there exist two numbers and that are both bad for . This means that and for some and . Thus by triangle inequality . However, note that is . Since , this value is at least . This implies that the absolute value of difference between and is at least leading to a contradiction.
Thus the probability that is bad for is atmost and by the union bound the probability that is bad for is atmost . This completes the proof. ∎
3.4 A Consequence of Sperner’s Lemma/KKM Lemma
The following result is a corollary to some cubical variants of Sperner’s lemma/KKM lemma initially developed in [DLPES02] and expanded on in [VWDP+22]. The statement and proof of this result is quite similar to that of Theorem 9.4 (Second Optimality Theorem) in [VWDP+22] except that it is stated here for instead of .
Lemma 3.7.
Let be a partition of such that for each member , it holds that . Then there exists such that for all we have that intersects at least members of .
4 Replicability of Learning Coins Biases
In this section, we establish replicability results for estimating biases of coins.
Definition 4.1.
The -Coin Bias Estimation Problem is the following problem: Design an algorithm (possibly randomize) that given , , and independent tosses (for each coin) of an ordered collection of -many biased coins with a bias vector outputs so that with probability .
Definition 4.2.
We say an algorithm for -Coin Bias Estimation Problem is -list replicable, if for any bias vector , and parameters , there is set and an integer such that and on input and and independent tosses (per coin) according to the bias vector , outputs an estimate , with probability . is the sample complexity of and is the list complexity of .
Definition 4.3.
We say an algorithm for -Coin Bias Estimation Problem is -certificate replicable, if for any bias vector , and parameters : on inputs , , , and independent coin tosses (per coin) according to satisfy the following:
In the above, the inner probability is taken over the internal randomness of and the coin tosses. Algorithm also gets as an input (in addition to the other inputs). We call the sample complexity and the number of random bits the certificate complexity of .
We note that from a coarse sense, we can convert list replicable algorithms to certificate replicable algorithms and vice-versa. However, such transformations will result in a degradation of sample complexity which is a concern of this paper.
In the following, the output of an algorithm for -Coin Bias Estimation Problem is denoted as .
4.1 Replicable Algorithms
We present two algorithms for -Coin Bias Estimation Problem. The first one -list replicable and the second one is -certificate replicable.
Theorem 4.4.
There exists an -list replicable algorithm for -Coin Bias Estimation Problem with sample complexity (per coin).
Proof.
Note that when , a trivial algorithm that outputs a vector with in each component works. Thus the most interesting case is when . Our list replicable algorithm is described in Algorithm 1.
So we will prove its correctness by giving for each possible bias , a set with the three necessary properties: (1) , (2) (and also the problem specific restriction that ), and (3) when given access to coins of biases , with probability at least the algorithm returns a value in .
Assume notation from Algorithm 1. Let . By 3.5, takes on at most values on (which means also takes on at most values on this ball) which proves that . This proves property (1).
NExt we state the following ibservation which says that the coordinate restriction function of Algorithm 1 does not reduce approximation quality. The proof is relatively straightforward, but tedious.
Observation 4.5.
Using the notation of Algorithm 1, if then .
Proof.
Let . We must show for each that . Note that
so we proceed with three cases.
Case 1: .
In this case, so because we have . Also, because , we have , so subtracting from both sides gives . Thus, we have as desired.
Case 2: .
This case is symmetric to Case 1.
Case 3: .
In this case so we are done. ∎
We now establish Property (2). We know from 3.5 that for each we have , and by 4.5, maintains this quality and we have . This shows that proving property (2).
By Chernoff’s bounds, for a single biased coin, independent samples of the coin is enough that with probability at least , the empirical bias is within of the true bias. Thus, by a union bound, if we take samples of each of the coins, there is a probability of at most that at least one of the empirical coin biases is not within of the true bias. Thus, by taking samples of each coin, we have with probability at least that the empirical biases belong to . In the case that this occurs, we have by definition of that the value returned by the algorithm belongs to the set . This proves property (3). ∎
Theorem 4.6.
There is a -certificate replicable algorithm for -Coin Bias Estimation Problem with sample complexity of per coin.
Proof.
Let and be the input parameters to the algorithm and the bias vector. Set . The algorithm first estimates the bias of each coin with up to with a probability error parameter using a standard estimation algorithm. Note that this can be done using tosses per coin. Let be the output vector. It follows that with probability at least . Then it runs the deterministic function described in Lemma 3.6 with input with and and outputs the value of the function. Lemma 3.6 guarantees that for fraction of the s, all gets rounded to the same value by . Hence algorithm satisfies the requirements of the certificate-replicability. The certificate complexity is . ∎
Note that a -certificate replicable leads to a -list replicable algorithm. Thus Theorem 4.6 gives a -list replicable algorithm for -Coin Bias Estimation Problemwith sample complexity . However, this is clearly sub-optimal and Theorem 4.4 gives algorithms with a much smaller list and sample complexities.
4.2 An Impossibility Result
Theorem 4.7.
For , there does not exist a -list replicable algorithm for the -Coin Bias Estimation Problem.
Before proving the theorem, we need a lemma whose proof appears in the appendix.
Lemma 4.8.
For biases we have .
Proof.
We can view the model as algorithm having access to a single draw from a distribution. The distribution giving one sample flip of each coin in a collection with bias is the -fold product of Bernoulli distributions (which for notational brevity we denote as ), so the distribution which gives independent flips of each coin is the -fold product of this (using notation of [Can15] written as ).
Comparing the distributions of independent flips of the coins for bias as compared to bias , we have for each that
so by C.1.2 and C.1.3 of [Can15] we have
and
Because is a randomized function of one draw of this distribution, by D.1.2 of [Can15] we have that cannot serve to increase the total variation distance, so
which completes the proof. ∎
Proof of Theorem 4.7.
Fix any , and choose and as and .
Suppose for contradiction that such an algorithm does exists for some . This means that for each possible threshold , there exists some set of hypotheses with three properties: (1) each element of is an -approximation to , (2) , and (3) with probability at least , returns an element of .
Suppose for contradiction that such an algorithm does exist for some . This means that for each possible bias , there exists some set (not necessarily unique, but consider some fixed one) with such that with probability at least least , returns an element of . By the trivial averaging argument, this means that there exists at least one element in which is returned by with probability at least . Let be a function which maps each bias to such an element of .
Since , let be such that .
The function induces a partition of where the members of are the fibers of (i.e. ). By definition, for any member there exists some such that for all , . By definition of -pseudodeterministic -approximation, we have showing that and by symmetry . This shows that , so .
Let . Since every member of has diameter less than , by 3.7 there exists a point such that intersects at least members of . Let be points belonging to distinct members of that all belong to . By definition of , this means for distinct that .
Now, for each , because , by 4.8 we have . However, this gives rise to a contradiction because the probability that with access to biased coins returns is at least , and by the total variation distance, it must be that with access to biased coins returns with probability at least ; notationally, and , so . This is a contradiction because a distribution cannot have disjoint events that each have probability greater than . ∎
We conclude this section by noting that the above impossibility result implies a lower-bound on certificate complexity for -Coin Bias Estimation Problem. It follows that there is no -certificate replicable algorithm for -Coin Bias Estimation Problem. In particular, any -certificate replicable algorithm requires . Hence our algorithms for -Coin Bias Estimation Problem has optimal list and certificate complexity.
5 Replicability in PAC Learning
In this section, we establish replicability results for the PAC model. First, we define the PAC learning model.
Let be a (hypothesis) class of Boolean functions over , and be a distribution over . For a function , let a distribution over that is obtained by sampling an element according and outputs . For a hypothesis , its error with respect to denoted by is .
Definition 5.1.
A hypothesis class (or concept class) is PAC learnable with sample complexity , if there is a learning algorithms with the following property: For every , for every a distribution over , for all , on inputs and drawn i.i.d according to where : outputs a hypothesis so that . with probability .
We show that every hypothesis that can be learned via a statistical query learning algorithm has a reproducible PAC learning algorithm. We first define the notion of learning with statistical queries [Kea98].
Definition 5.2.
A statistical oracle takes as an input a real-valued function and returns an estimate such that
Definition 5.3.
We say that an algorithm learns a concept class via statistical queries if for every distribution and every function , for every , there exists such that the algorithm on input , and as an oracle, outputs a hypothesis such that . The concept class is non-adaptively learnable if all the queries made by are non-adaptive.
5.1 Notions of Replicable PAC Learning
We now define the notions of list and certificate replicable learning in the PAC model.
Definition 5.4.
Let be a hypothesis class. We say that is -list replicably learnable if there is an algorithm with the following properties. For every and for every distribution over , for every and , there is a list of size at most consisting of -approximate hypotheses such that on inputs and samples , . We call the sample complexity of the learning algorithm and to be its list complexity.
Next we define certificate replicability. This is very close to the notion of replicability given in [impagliazzo_replicability_2022]. However, our main concern is the amount of randomness needed by the algorithm to make it perfectly reproducible.
Definition 5.5.
Let be a concept class. We say that is -certificate replicably learnable if the following holds. There is a learning algorithm such that for every , for every distribution over , gets the following inputs: , , , and .
We call the sample complexity and the number of random bits the certificate complexity of .
The definition can be further explained as follows. For the algorithm , we say is a “certificate of replicability” if with independent samples from , and as input, outputs a canonical -approximate hypothesis with probability . Then the above definition demands that fraction of are certificates of replicability.
5.2 Replicable Algorithms
Theorem 5.6.
Let be a concept class that is learnable with non-adaptive statistical queries, then is -list reproducibly learnable. Furthermore, the sample complexity of the -list replicable algorithm equals , where is the approximation error parameter of each statistical query oracle.
Proof.
The proof is very similar to the proof of Theorem 4.4. Our replicable algorithm works as follows. Let and be input parameters and be a distribution and . Let be the statistical query learning algorithm for that outputs a hypothesis with approximation error . Let be the statistical query oracle for this algorithm. Let be the statistical queries made by .
Let . Set . The algorithm first estimates the values , upto an approximation error of with success probably for each query. Note that this can be done by a simple empirical estimation algorithm, that uses a total of samples. Let be the estimated vector. It follows that with probability at least .
Now the algorithm evaluates the deterministic function from Lemma 3.5 on input . Let be the output vector. Now the algorithm simulates the statistical query algorithm with as the answer to the query . By Lemma 3.5, . Thus the error of the hypothesis output by the algorithm is at most . Since is a deterministic algorithm the number of possible outputs only depends on the number of outputs of the function , more precisely the number of possible outputs is the size of the set which is almost , by Lemma 3.5. Thus the total number of possible outputs of the algorithm is at most with probability at least . ∎
We note that we can simulate a statistical query algorithm that makes adaptive queries to get a -list replicable learning algorithm. This can be done by rounding each query to two possible values (the approximation factor increases by 2). The sample complexity of this algorithm will be . The sample complexity can be improved to by using techniques from adaptive data analysis [BNS+21].
Next, we design a certificate replicable algorithm for hypothesis classes that admit statistical query learning algorithms. The proof the following theorem is in the appendix.
Theorem 5.7.
Let be a concept class that is learnable with non-adaptive statistical queries, then is -certificate reproducibly learnable. Furthermore, the sample complexity of this algorithm equals , where is the approximation error parameter of each statistical query oracle.
Proof.
The proof is very similar to the proof of Theorem 4.6. Our replicable algorithm works as follows, let and be input parameters and be a distribution and . Let be the statistical query learning algorithm for that outputs a hypothesis with approximation error . Let be the statistical query oracle for this algorithm. Let be the statistical queries made by .
Let . Set . The algorithm first estimates the values , upto an approximation error of with success probably for each query. Note that this can be done by a simple empirical estimation algorithm, that uses a total of samples. Let be the estimated the vector. It follows that with probability at least . Now the algorithm evaluates the deterministic function described in Lemma 3.6 with inputs where and . By Lemma 3.6 for at least fraction of the ’s , the function outputs a canonical . Now the algorithm simulates the statistical query algorithm with as the answer to the query . Since is a deterministic algorithm it follows that our algorithm is certificate replicable. Finally, note that the certificate complexity is . ∎
As before we consider the case when the statistical query algorithm makes adaptive queries. The proof of the following theorem appears in the appendix.
Theorem 5.8.
Let be a concept class that is learnable with adaptive statistical queries, then is -certificate reproducibly learnable. Furthermore, the sample complexity of this algorithm equals , where is the approximation error parameter of each statistical query oracle.
Proof Sketch.
The proof uses similar arguments as before. The main difference we will evaluate each query with an approximation error of with a probability error of . This requires per query. We use a fresh set of certificate randomness for each such evaluation. Note that the length of the certificate for each query is . Thus the total certificate complexity is . ∎
5.3 Impossibility Results in the PAC Model
In this section, we establish matching upper and lower bounds for the -Threshold Estimation Problem in the PAC model with respect to the uniform distribution. We establish that this problem admits a -list replicable algorithm and does not admit a -list replicable algorithm.
Problem 5.9 (-Threshold Estimation Problem).
Fix some . Let . For each value (which happens to be the same as ), let be the function defined by
This is the function that determines if each coordinate is less than or equal to the thresholds specified by . Let be the hypothesis class consisting of all such threshold functions: .
We first observe the impossibility of list-replicable algorithms in the general PAC model. This follows from known results.
Observation 5.10.
Let be any constant. There is no -list replicable algorithm for the -Threshold Estimation Problem in the PAC model even when .
Proof.
From the works of [BLM20] and [ALMM19], it follows that any class has finite Littlestone dimension if and only if there exists a constant such that the concept class has a -list replicable algorithm in the PAC model. Since the concept class -Threshold Estimation Problemhas infinite Littlestone dimension even when , the theorem follows. ∎
The above result rules out list-replicable algorithms in the general PAC model. In the rest of this section, we study the replicable learnability of -Threshold Estimation Problem in the PAC model under the uniform distribution. We establish matching upper and lower bounds on the list complexity.
Theorem 5.11.
In the PAC model under the uniform distribution, there is a -list replicable algorithm for -Threshold Estimation Problem
Proof.
It is known and easy to see that -Threshold Estimation Problem is learnable under the uniform distribution by making nonadaptive statistical queries. Thus by Theorem 5.6, -Threshold Estimation Problem admits a -list replicable algorithm. ∎
We next establish that the above result is tight by proving that there is no -list replicable algorithm in the PAC model under the uniform distribution.
Theorem 5.12.
For , there does not exist a -list replicable algorithm for the -Threshold Estimation Problem in the PAC model.
The proof that for , there is no algorithm which -list reproducibly learns -Threshold Estimation Problem in the PAC model is similar to the proof of Theorem 4.7. The reason is that sampling -many biased coins with biases is similar to obtaining a point uniformly at random from and evaluating the threshold function on it—this corresponds to asking whether all of the coins were heads/’s. The two models differ though, because in the sample model for the -Coin Bias Estimation Problem, the algorithm sees for each coin whether it is heads or tails, but this information is not available in the PAC model for the -Threshold Estimation Problem. Conversely, in the PAC model for the -Threshold Estimation Problem, a random draw from is available to the algorithm, but in the sample model for the -Coin Bias Estimation Problem the algorithm does not get this information.
Furthermore, there is the following additional complexity in the impossibility result for the -Threshold Estimation Problem. In the -Coin Bias Estimation Problem, we said by definition that a collection of coins parameterized by bias vector was an -approximation to a collection of coins parameterized by bias vector if and only if , and we used this norm in applying the results of [VWDP+22]. However, the notion of -approximation in the PAC model is quite different than this. It is possible to have a hypotheses and in the -Threshold Estimation Problemsuch that but with respect to some distribution on the domain we have . For example, if is the uniform distribution on and and is the first standard basis vector , and , then , but because if and only if all of the last coordinates of are and the first coordinate is , but there is probability of sampling such from the uniform distribution on .
For this reason, we can’t just partition as we did with the proof of Theorem 4.7 and must do something more clever. It turns out that it is possible to find a subset on which hypotheses parameterized by vectors on opposite faces of this cube have high PAC error between them. A consequence by the triangle inequality of is that two such hypotheses cannot both be approximated by a common third hypothesis. That is what the following lemma states.
Lemma 5.13.
Let and . Let such that there exists a coordinate where and (i.e. and are on opposite faces of this cube). Let . Then there is no point such that both and (i.e. there is no hypothesis which is an -approximation to both and ).
Proof.
Let which will serve as a proxy to .
We need the following claim.
Claim 5.14.
For each , the following are equivalent:
-
1.
-
2.
and
-
3.
and for all , .
Furthermore, the above equivalent conditions imply the following:
-
4.
.
Proof of Claim.
Note that because , we have for all that . If then for some it must be that , but since it would also be the case that , so which gives the contradiction that . Thus , and since we have .
Case 1: and for all , . In this case, so and for all , so , so .
Case 2: and for all , . In this case, because and we have and also for all other , (by definition of ). Thus .
Case 3: For some , . In this case, because , we have . Thus .
Thus, it is the case that if and only if and for all , .
With this claim in hand, our next step will be two prove the following two inequalities:
For the second of these inequalities, note that by the (1) (4) part of claim above, since implies we have
Now, for the first of the inequalities above, we will use the (1) (3) portion of the claim, we will use our hypothesis that (which implies for each that ), we will use the hypothesis that , and we will use LABEL:alpha-lemma. Utilizing these, we get the following:
Thus, we get the desired two inequalities:
This nearly completes the proof. If there existed some point such that both and , then it would follow from the triangle inequality of that
but this would contradict the above inequalities, so no such exists. ∎
Equipped with the above lemma, we are now ready to prove Theorem 5.12.
Proof.
of Theorem 5.12 Fix any , and choose and as and . We will use the constant and consider the cube . We will also consider only the uniform distribution over .
Suppose for contradiction that such an algorithm does exists for some . This means that for each possible threshold , there exists some set of hypotheses with three properties: (1) each element of is an -approximation to , (2) , and (3) with probability at least , returns an element of .
By the trivial averaging argument, this means that there exists at least one element in which is returned by with probability at least . Let be a function which maps each threshold to such an element of . This is slightly different from the proof of Theorem 4.7 because we are defining the function on only a very specific subset of the possible thresholds. The reason for this was alluded to in the discussion following the statement of Theorem 5.12.
Since , let be such that .
The function induces a partition of where the members of are the fibers of (i.e. ). For any member and any coordinate , it cannot be that the set contains both values and —if it did, then there would be two points such that and , but because they both belong to , there is some such that , but by definition of the partition, would have to be an -approximation (in the PAC model) of both and , but by Theorem 5.12, this is not possible.
Thus, the partition is a “non-spanning” partition of as in the proof of 3.7, so there is some point such that for every radius , it holds that intersects at least members of .
Similar to 4.8 and how it is used in the proof of Theorem 4.7, we can use the following two facts. First, the function defined by is continuous (with respect to the norm on the domain). Second, the function is continuous (with respect to the notion of distance on the domain). A consequence is that the composition is continuous. Thus, we can find some radius such that if , then .
Now we get the same type of contradiction as in the proof of Theorem 4.7: for the special point we have that is a distribution that has disjoint events that each have probability greater than . Thus, no -list replicable algorithm exists.
∎
6 Conclusions
In this work, we investigated the pressing issue of replicability in machine learning from an algorithmic point of view. We observed that replicability in the absolute sense is difficult to achieve. Hence we considered two natural extensions that capture the degree of (non) replicability: list and certificate replicability. We designed replicable algorithms with a small list, certificate, and sample complexities for the -Coin Bias Estimation Problem and the class of problems that can be learned via statistical query algorithms that make non-adaptive statistical queries. We also established certain impossibility results in the PAC model of learning and for -Coin Bias Estimation Problem. There are several interesting research directions that emerge from our work. There is a gap in the sample complexities of the list and certificate reproducibilities with comparable parameters. Is this gap inevitable? Currently, there is an exponential gap in the replicability parameters between hypothesis classes that can be learned via non-adaptive and adaptive statistical queries. Is this gap necessary? A generic question is to explore the trade-offs between the sample complexities, list complexity, certificate complexities, adaptivity, and nonadaptivity.
7 Acknowledgements
We thank an anonymous reviewer for suggesting Observation 5.10. Pavan’s work is partly supported by NSF award 2130536. Part of the work was done when Pavan was visiting Simons Institute for the Theory of Computing. Vander Woude and Vinodchandran’s work is partly supported by NSF award 2130608.
References
- [AJJ+22] Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I. Shamir. Reproducibility in optimization: Theoretical framework and limits, 2022. arXiv:2202.04598.
- [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852–860. ACM, 2019.
- [AV20] Nima Anari and Vijay V. Vazirani. Matching is as easy as the decision problem, in the NC model. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS 2020, January 12-14, 2020, Seattle, Washington, USA, volume 151 of LIPIcs, pages 54:1–54:25. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
- [Bak16] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533:452–454, 2016.
- [BLM20] Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classification and online prediction. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 389–402. IEEE, 2020.
- [BNS+21] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan R. Ullman. Algorithmic stability for adaptive data analysis. SIAM J. Comput., 50(3), 2021.
- [Can15] Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electron. Colloquium Comput. Complex., TR15-063, 2015. URL: https://eccc.weizmann.ac.il/report/2015/063, arXiv:TR15-063.
- [DLPES02] Jesus A. De Loera, Elisha Peterson, and Francis Edward Su. A Polytopal Generalization of Sperner’s Lemma. Journal of Combinatorial Theory, Series A, 100(1):1–26, October 2002. URL: https://www.sciencedirect.com/science/article/pii/S0097316502932747, doi:10.1006/jcta.2002.3274.
- [DPV18] Peter Dixon, Aduri Pavan, and N. V. Vinodchandran. On pseudodeterministic approximation algorithms. In Igor Potapov, Paul G. Spirakis, and James Worrell, editors, 43rd International Symposium on Mathematical Foundations of Computer Science, MFCS 2018, August 27-31, 2018, Liverpool, UK, volume 117 of LIPIcs, pages 61:1–61:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
- [DPVWV22] Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. Pseudodeterminism: promises and lowerbounds. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1552–1565. ACM, 2022.
- [eco13] How science goes wrong. The Economist, pages 25–30, 2013.
- [EKK+23] Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits, 2023. arXiv:2210.01898.
- [GG] Shafi Goldwasser and Ofer Grossman. Bipartite perfect matching in pseudo-deterministic NC. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs.
- [GG11] Eran Gat and Shafi Goldwasser. Probabilistic Search Algorithms with Unique Answers and Their Cryptographic Applications. Technical Report 136, 2011. URL: https://eccc.weizmann.ac.il/report/2011/136/.
- [GGH18] Shafi Goldwasser, Ofer Grossman, and Dhiraj Holden. Pseudo-deterministic proofs. In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, volume 94 of LIPIcs, pages 17:1–17:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
- [GGMW20] Shafi Goldwasser, Ofer Grossman, Sidhanth Mohanty, and David P. Woodruff. Pseudo-deterministic streaming. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS, volume 151 of LIPIcs, pages 79:1–79:25, 2020.
- [GKM21] Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. User-level differentially private learning via correlated sampling. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20172–20184, 2021.
- [GL19] Ofer Grossman and Yang P. Liu. Reproducibility and pseudo-determinism in log-space. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 606–620. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.38, doi:10.1137/1.9781611975482.38.
- [Gol19] Oded Goldreich. Multi-pseudodeterministic algorithms. Electron. Colloquium Comput. Complex., TR19-012, 2019. URL: https://eccc.weizmann.ac.il/report/2019/012, arXiv:TR19-012.
- [ILPS22] Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818–831. ACM, 2022. URL: https://doi.org/10.1145/3519935.3519973, doi:10.1145/3519935.3519973.
- [JP05] Ioannidis JP. Why most published research findings are false. PLOS Medicine, 2(8), 2005.
- [Kea98] Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.
- [LOS21] Zhenjian Lu, Igor Carboni Oliveira, and Rahul Santhanam. Pseudodeterministic algorithms and the structure of probabilistic time. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 303–316. ACM, 2021.
- [MPK19] Harshal Mittal, Kartikey Pandey, and Yash Kant. Iclr reproducibility challenge report (padam : Closing the generalization gap of adaptive gradient methods in training deep neural networks), 2019. arXiv:1901.09517.
- [NAS19] Reproducibility and Replicability in Science. https://doi.org/10.17226/25303, 2019. National Academies of Sciences, Engineering, and Medicine.
- [OS17] I. Oliveira and R. Santhanam. Pseudodeterministic constructions in subexponential time. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 665–677, 2017.
- [OS18] Igor Carboni Oliveira and Rahul Santhanam. Pseudo-derandomizing learning and approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2018, volume 116 of LIPIcs, pages 55:1–55:19, 2018.
- [PVLS+21] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer, Florence d’Alche Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research(a report from the neurips 2019 reproducibility program). Journal of Machine Learning Research, 22(164):1–20, 2021. URL: http://jmlr.org/papers/v22/20-303.html.
- [SZ99] Michael E. Saks and Shiyu Zhou. BPSPACE(S) DSPACE(S). J. Comput. Syst. Sci., 58(2):376–403, 1999.
- [VWDP+22] Jason Vander Woude, Peter Dixon, Aduri Pavan, Jamie Radcliffe, and N. V. Vinodchandran. Geometry of rounding. CoRR, abs/2211.02694, 2022. URL: https://doi.org/10.48550/arXiv.2211.02694, arXiv:2211.02694, doi:10.48550/arXiv.2211.02694.