This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

List and Certificate Complexities in Replicable Learning

Peter Dixon Ben-Gurion University of Negev A. Pavan Iowa State University Jason Vander Woude University of Nebraska-Lincoln N. V. Vinodchandran University of Nebraska-Lincoln
Abstract

We investigate replicable learning algorithms. Ideally, we would like to design algorithms that output the same canonical model over multiple runs, even when different runs observe a different set of samples from the unknown data distribution. In general, such a strong notion of replicability is not achievable. Thus we consider two feasible notions of replicability called list replicability and certificate replicability. Intuitively, these notions capture the degree of (non) replicability. We design algorithms for certain learning problems that are optimal in list and certificate complexity. We establish matching impossibility results.

1 Introduction

Replicability and reproducibility in science are critical concerns. The fundamental requirement that scientific results and experiments be replicable/reproducible is central to the development and evolution of science. In recent years, these concerns have grown as several scientific disciplines turn to data-driven research, which enables exponential progress through data democratization and affordable computing resources. The replicability issue has received attention from a wide spectrum of entities, from general media publications (for example, The Economist’s ”How Science Goes Wrong,” 2013 [eco13]) to scientific publication venues (for example, see [JP05, Bak16]) to professional and scientific bodies such as the National Academy of Sciences, Engineering, and Medicine (NASEM). The emerging challenges to replicability and reproducibility have been discussed in depth by a consensus study report published by NASEM [NAS19].

A broad approach taken to ensure the reproducibility/replicability of algorithms is to make the datasets, algorithms, and code publicly available. Of late, conferences have been hosting replicability workshops to promote best practices and to encourage researchers to share code (for example, see  [PVLS+21] and [MPK19]). An underlying assumption is that consistent results can be obtained using the same input data, computational methods, and code. However, this practice alone is insufficient to ensure replicability as modern-day approaches use computations that inherently involve randomness.

Computing over random variables results in a high degree of non-replicability, especially in machine learning tasks. Machine learning algorithms observe samples from a (sometimes unknown) distribution and output a hypothesis or model. Such algorithms are inherently non-replicable. Two distinct runs of the algorithm will output different models as the algorithms see different sets of samples over the two runs. Ideally, to achieve ”perfect replicability,” we would like to design algorithms that output the same canonical hypothesis over multiple runs, even when different runs observe a different set of samples from the unknown distribution.

We first observe that perfect replicability is not achievable in learning, as a dependency of the output on the data samples is inevitable. We illustrate this with a simple learning task of estimating the bias of a coin: given nn independent tosses from a coin with unknown bias bb, output an estimate of bb. It is relatively easy to argue that there is no algorithm that outputs a canonical estimate vbv_{b} with probability 2/3\geq 2/3 so that |vbb|ε|v_{b}-b|\leq\varepsilon. Consider a sequence of coins with biases b1<b2<<bmb_{1}<b_{2}<\cdots<b_{m} where each bi+1biηb_{i+1}-b_{i}\leq\eta (for some small η\eta) but bmb12εb_{m}-b_{1}\geq 2\varepsilon. For two adjacent biases, the statistical distance (denoted by dTV\operatorname{d_{TV}}) between 𝒟i+1n\mathcal{D}_{i+1}^{n} and 𝒟in\mathcal{D}_{i}^{n} is mη\leq m\eta, where 𝒟in\mathcal{D}_{i}^{n} is the distribution of nn independent tosses of the ithi^{th} coin. Let vi+1v_{i+1} and viv_{i} be the canonical estimates output by the algorithm for biases bi+1b_{i+1} and bib_{i} respectively. Since AA on samples from distribution 𝒟in\mathcal{D}_{i}^{n} outputs viv_{i} with probability at least 2/32/3 and dTV(Din,Di+1n)nη\operatorname{d_{TV}}(D_{i}^{n},D_{i+1}^{n})\leq n\eta, A(𝒟i+1n)A(\mathcal{D}_{i+1}^{n}) must output viv_{i} with probability at least 2/3nη2/3-n\eta. We take η=1/10n\eta=1/10n. Since A(𝒟in)A(\mathcal{D}_{i}^{n}) must output a canonical value with probability at least 2/32/3, this implies that vi=vi+1v_{i}=v_{i+1}. Thus, on all biases, b1,,bmb_{1},\ldots,b_{m}, AA should output the same value and this leads to a contradiction since b1b_{1} and bmb_{m} are 2ε2\varepsilon apart. However, it is easy to see that there is an algorithm for bias-estimation that outputs one of two canonical estimates using n=O(1/ε2)n=O(1/\varepsilon^{2}) tosses: estimate the bias within ε/2\varepsilon/2 and round the value to the closest multiple of ε\varepsilon. The starting point of our work is these two observations. Even though it may not be possible to design learning algorithms that are perfectly replicable, it is possible to design algorithms whose “non-replicability” is minimized.

Motivated by this, we study two notions of replicability called list-replicability and certificate-replicability in the context of machine learning algorithms. These notions quantify the degree of (non)-replicability. The notions we consider are rooted in the pseudodeterminism-literature [GG11, Gol19, GL19] which has been an active area in randomized algorithms and computational complexity theory. Section 2 discusses this connection and other works that are related to our work.

1.1 Our Results

We consider two notions of replicable learning: list-replicable learning and certificate-replicable learning. In list replicability, the learning algorithm should output one of the models from a list of models of size k\leq k with high probability. This means that when the learning algorithm is run multiple times, we see at most kk distinct models/hypotheses. The value kk is called the list complexity of the algorithm. An algorithm whose list complexity is 11 is perfectly replicable. Thus list complexity can be considered as the degree of (non) replicability. The goal in this setting is to design learning algorithms that minimize the list complexity kk.

In certificate replicability, the learning algorithm has access to an \ell-bit random string called the certificate of replicability (that is independent of samples and the other randomness that it may use). We require that for most certificates, the algorithm must output a canonical model hh that can depend only on this \ell-bit random string rr. Thus once we fix a certificate rr, multiple runs of the algorithm that have access to rr will output the same model w.h.p. We call \ell the certificate complexity. Note that an algorithm with zero certificate complexity is perfectly replicable. Thus \ell is another measure of the degree of (non) replicability of the algorithm. The goal in this setting is to design learning algorithms that minimize \ell.

A critical resource in machine learning tasks is the number of samples that the algorithm observes known as the sample complexity. An efficient learning algorithm uses as few samples as possible. In this work, we measure the efficiency of learning algorithms in terms of sample complexity, list complexity, and certificate complexity. This work initiates a study of learning algorithms that are efficient in certificate/list complexities as well as sample complexity. We establish both positive results and impossibility results for certain learning tasks.

Estimating the bias of dd coins.

Our first set of results is on efficient replicable algorithms for the coin-bias estimation problem. We consider the problem of estimating the biases of dd coins simultaneously by observing nn tosses of each of the coins which we call dd-Coin Bias Estimation Problem. The task is to output a bias vector v\vec{v} so that bvε\lVert\vec{b}-\vec{v}\rVert_{\infty}\leq\varepsilon where b\vec{b} is the true bias vector. We show that there is a (d+1)(d+1)-list replicable learning algorithm for this problem with a sample complexity (number of coin tosses) n=O(d2ε2logdδ)n=O({\frac{d^{2}}{\varepsilon^{2}}}\cdot\log{\frac{d}{\delta}}) per coin. We also design a logdδ\lceil\log{d\over\delta}\rceil-certificate reproducible algorithm for the problem with sample complexity O(d2ε2δ2)O({d^{2}\over\varepsilon^{2}\delta^{2}}) per coin. Here (1δ)(1-\delta) is the success probability.

We also establish the optimality of the above upper bounds in terms of list and certificate complexities. We show that there is no dd-list replicable learning algorithm for dd-Coin Bias Estimation Problem. This leads to a lower bound of Ω(logd)\Omega(\log d) on its certificate complexity. For establishing this impossibility result we use a version of KKM/Sperner’s Lemma.

PAC learning.

We establish possibility and impossibility results for list/certificate replicable algorithms in the context of PAC learning. We show that any concept class that can be learned using dd non-adaptive statistical queries can be learned by a (d+1)(d+1)-list replicable PAC learning algorithm with sample complexity O(d2ν2logdδ)O({d^{2}\over\nu^{2}}\cdot\log{d\over\delta}) where ν\nu is the statistical query parameter. We also show that such concept classes admit a logdδ\lceil\log{d\over\delta}\rceil-certificate replicable PAC learning algorithm with sample complexity O(d2ν2δ2logdδ)O({d^{2}\over\nu^{2}\delta^{2}}\cdot\log{d\over\delta}). Finally, we establish tight results in the PAC learning model. In particular, we prove that the concept class of dd-dimensional thresholds does not admit a dd-list replicable learning algorithm under the uniform distribution. Since we can learn dd-dimensional thresholds under the uniform distribution, using dd non-adaptive statistical queries, we get a d+1d+1-list replicable PAC algorithm under the uniform distribution. This yields matching upper and lower bounds on the list complexity of PAC learning of thresholds under the uniform distribution.

2 Prior and Related Work

Formalizing reproducibility and replicability has gained considerable momentum in recent years. While the terms reproducibility and replicability are very close and often used interchangeably, there has been an effort to distinguish between them and accordingly, our notions fall in the replicability definition [PVLS+21].

In the context of randomized algorithms, various notions of reproducibility/replicability have been investigated. The work of Gat and Goldwasser [GG11] formalized and defined the notion of pseudodeterministic algorithms. A randomized algorithm AA is pseudodeterministic if, for any input xx, there is a canonical value vxv_{x} such that Pr[A(x)=vx]2/3\Pr[A(x)=v_{x}]\geq 2/3. Gat and Goldwasser designed polynomial-time pseudodeterministic algorithms for algebraic computational problems, such as finding quadratic non-residues and finding non-roots of multivariate polynomials [GG11]. Later works studied the notion of pseudodeterminism in other algorithmic settings, such as parallel computation, streaming and sub-linear algorithms, interactive proofs, and its connections to complexity theory [GG, GGH18, OS17, OS18, AV20, GGMW20, LOS21, DPVWV22].

In the algorithmic setting, mainly two generalizations of pseudodeterminism have been investigated: multi-pseudodeterministic algorithms [Gol19] and influential bit algorithms [GL19]. A randomized algorithm AA is kk-pseudodeterministic if, for every input xx, there is a set SxS_{x} of size at most kk such that the output of A(x)A(x) belongs to the set SxS_{x} with high probability. When k=1k=1, we get pseudodeterminism. A randomized algorithm AA is \ell-influential-bit algorithm if, for every input xx, for most of the strings rr of length \ell, there exists a canonical value vx,rv_{x,r} such that the algorithm AA on inputs xx and rr outputs vx,rv_{x,r} with high probability. The string rr are called the influential bit string. Again, when =0\ell=0, we get back pseudodeterminism. The main focus of these works has been to investigate reproducibility in randomized search algorithms.

Very recently, pseudodeterminism and its generalizations have been explored in the context of learning algorithms to formalize the notion of replicability. In particular, the work of Impagliazzo, Lei, Pitassi, and Sorrell [ILPS22] introduced the notion of ρ\rho-replicability. A learning algorithm AA is ρ\rho-replicable if Pr[A(S1,r)=A(S2,r)]1ρ\Pr[A(S_{1},r)=A(S_{2},r)]\geq 1-\rho, where S1S_{1} and S2S_{2} are samples drawn from a distribution 𝒟\mathcal{D} and rr is the internal randomness of the learning algorithm AA. They designed replicable algorithms for many learning tasks, including statistical queries, approximate heavy hitters, median, and learning half-spaces.

Another line of recent work connects replicable learning to differentially private learning. In a recent seminal work, Bun, Livny, and Moran [BLM20] showed that every concept class with finite Littlestone dimension can be learned by an approximate differentially private algorithm. This, together with an earlier work of Alon, Livny, Malliaris, and Moran [ALMM19], establishes an equivalence between online learnability and differentially private PAC learnability. Rather surprisingly, the proof of [BLM20] uses the notion of ”global stability,” which is similar to the notion of pseudodeterminism in the context of learning. They define a learning algorithm AA to be (n,η)(n,\eta)-globally stable with respect to a distribution DD if there is a hypothesis hh such that PrSDn(A(S)=h)η\Pr_{S\sim D^{n}}(A(S)=h)\geq\eta. They showed that any concept class with Littlestone dimension dd has an (n,η)(n,\eta)-globally stable learning algorithm with m=O~(22d/α)m=\tilde{O}(2^{2^{d}}/\alpha) and η=O~(22d)\eta=\tilde{O}(2^{-2^{d}}), where the error of hh (with respect to the unknown hypothesis) is α\leq\alpha. Then they established that a globally stable learner implies a differentially private learner. The notion of globally stable learning is the perfect replicability that we discuss in the introduction when η=2/3\eta=2/3. Thus, as discussed in the introduction, it follows that designing globally stable algorithms with η>1/2\eta>1/2 is not possible, even for the simple task of estimating the bias of a coin. The work of Ghazi, Kumar, and Manurangsi [GKM21] extended the notion of global stability to pseudo-global stability and list-global stability. The notion of pseudo-global stability is very similar to the earlier-mentioned notion of influential bit algorithms of Grossman and Liu [GL19] when translated to the context of learning. Similarly, the list-global stability is similar to Goldreich’s notion of multi-pseudodeterminism [Gol19]. It is also known that the notions of pseudo-global stability and ρ\rho-replicability are the same up to polynomial factors in the parameters [ILPS22, GKM21]. The work of [GKM21] uses these notions to design user-level private learning algorithms.

Our notion of list replicability is inspired by the notion of multi-pseudodeterminism and the notion of certificate replicability is inspired by the notion of influential-bit algorithms. In the learning setting, our notion of list replicability is similar to the notion of list-global stability and the notion of certificate replicability is similar to the notion of pseudo-global stability which in turn is similar to the notion of ρ\rho-replicability of [ILPS22].

We introduce the notions of list and certificate complexities that measure the degree of (non) replicability. Our goal is to design learning algorithms with optimal list and certificate complexities while minimizing the sample complexity. The earlier works did not focus on minimizing these quantities. The works of [BLM20, GKM21] used replicable algorithms as an intermediate step to design differentially private algorithms. The work of [ILPS22] did not consider reducing the certificate complexity in their algorithms and also did not study list-replicability

The study of various notions of reproducibility/replicability in various computational fields is an emerging topic. In  [EKK+23], the authors consider replicability in the context of stochastic bandits. Their notion is similar to the notion studied in  [ILPS22]. In [AJJ+22], the authors investigate reproducibility111See [PVLS+21] for a discussion on replicability and replicability. in the context of optimization with inexact oracles (initialization/gradient oracles). The setup and focus of these works are different from ours.

3 Primary Lemmas

In this section, we state a few lemmas that build on the work of [VWDP+22] and [DLPES02] that will be useful for algorithmic constructions and impossibility results in the remainder of the paper.

3.1 Partitions and Rounding

In this subsection, we define a deterministic rounding function that we will use repeatedly. This rounding function is based on the notion of secluded partitions defined in the work of [VWDP+22].

We will use the following notation. We use diam\operatorname{diam}_{\infty} to indicate the diameter of a set relative to the \ell_{\infty} norm and B¯ϵ(p)\overline{B}_{\epsilon}^{\infty}(\vec{p}) to represent the closed ball of radius ϵ\epsilon centered at p\vec{p} relative to the \ell_{\infty} norm. That is, in d\mathbb{R}^{d} we have B¯ϵ(p)=i=1d[piϵ,pi+ϵ]\overline{B}_{\epsilon}^{\infty}(\vec{p})=\prod_{i=1}^{d}[p_{i}-\epsilon,p_{i}+\epsilon].

Let 𝒫\mathcal{P} be a partition of d\mathbb{R}^{d}. For a point pd\vec{p}\in\mathbb{R}^{d}, we use Nϵ(p)N_{\epsilon}(\vec{p}) to denote the set

{X|Bϵ(p)X}\{X\in\mathbb{P}~{}|~{}B_{\epsilon}(\vec{p})\cap X\neq\emptyset\}
Definition 3.1.

Let 𝒫\mathcal{P} be a partition of d\mathbb{R}^{d}. We say that 𝒫\mathcal{P} is (k,ϵ)(k,\epsilon)-secluded, if for every point pd\vec{p}\in\mathbb{R}^{d}, |Nϵ(p)|k|N_{\epsilon(\vec{p}})|\leq k.

The following theorem is from [VWDP+22].

Theorem 3.2.

For each dd\in\mathbb{N}, there exists a (d+1,12d)(d+1,\frac{1}{2d})-secluded partition, where each member of the partition is a unit hypercube. Moreover, the partition is efficiently computable: Given an arbitrary point xd\vec{x}\in\mathbb{R}^{d}, the description of the partition member to which x\vec{x} belongs can be computed in time polynomial in dd.

Definition 3.3 ((k(d),ϵ(d))(k(d),\epsilon(d))-Deterministic Rounding).

A deterministic rounding scheme is a family of functions ={fd}d{\mathcal{F}}=\{f_{d}\}_{d\in\mathbb{N}} where fd:ddf_{d}:\mathbb{R}^{d}\to\mathbb{R}^{d}. We call \mathcal{F} a (k(d),ε(d))(k(d),\varepsilon(d))-deterministic rounding scheme if (1) xd\forall\vec{x}\in\mathbb{R}^{d}, dmax(x,fd(x))1d_{max}(\vec{x},f_{d}(\vec{x}))\leq 1222 The bound of 1 is not critical. We can use any constant and scale the parameters appropriately. (2) xd\forall\vec{x}\in\mathbb{R}^{d}, |{fd(z):zBε(d)(x)}|k(d)\lvert\{f_{d}(\vec{z})\colon\vec{z}\in B_{\varepsilon(d)}(\vec{x})\}\rvert\leq k(d).

The following observation is from [VWDP+22].

Observation 3.4 (Equivalence of Rounding Schemes and Partitions).

A (k(d),ϵ(d))(k(d),\epsilon(d))-deterministic rounding scheme induces, for each dd\in\mathbb{N}, a (k(d),ϵ(d))(k(d),\epsilon(d))-secluded partition of d\mathbb{R}^{d} in which each member has diameter at most 22. Conversely, a sequence 𝒫dd=1\langle\mathcal{P}_{d}\rangle_{d=1}^{\infty} of partitions where 𝒫d\mathcal{P}_{d} is (k(d),ϵ(d))(k(d),\epsilon(d))-secluded and contains only members of diameter at most 11 induces a (k(d),ϵ(d))(k(d),\epsilon(d))-deterministic rounding schemes.

3.2 A Universal Rounding Algorithm for List Replicability

In this subsection, we will design a deterministic algorithm that will be used as a sub-routine in our list-replicable algorithms.

Lemma 3.5.

Let dd\in\mathbb{N} and ϵ(0,)\epsilon\in(0,\infty). Let ϵ0=ϵ2d\epsilon_{0}=\frac{\epsilon}{2d}. There is an efficiently computable function fϵ:ddf_{\epsilon}:\mathbb{R}^{d}\to\mathbb{R}^{d} with the following two properties:

  1. 1.

    For any xdx\in\mathbb{R}^{d} and any x^B¯ϵ0(x)\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x) it holds that fϵ(x^)B¯ϵ(x)f_{\epsilon}(\hat{x})\in\overline{B}_{\epsilon}^{\infty}(x).

  2. 2.

    For any xdx\in\mathbb{R}^{d} the set {fϵ(x^):x^B¯ϵ0(x)}\left\{f_{\epsilon}(\hat{x})\colon\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)\right\} has cardinality at most d+1d+1.

Informally, these two conditions are (1) if x^\hat{x} is an ϵ0\epsilon_{0} approximation of xx, then fϵ(x^)f_{\epsilon}(\hat{x}) is an ϵ\epsilon approximation of xx, and (2) fϵf_{\epsilon} maps every ϵ0\epsilon_{0} approximation of xx to one of at most d+1d+1 possible values.

Proof.

Let 𝒫\mathcal{P} be a (d+1,12d)(d+1,\frac{1}{2d})-secluded partition from Theorem 3.2 and f:ddf:\mathbb{R}^{d}\to\mathbb{R}^{d} the associated deterministic rounding function due to Observation 3.4 The partition 𝒫\mathcal{P} consists of translates of the unit cube [0,1)d[0,1)^{d} with the property that for any point pd\vec{p}\in\mathbb{R}^{d} the closed cube of side length 1/d1/d centered at p\vec{p} (i.e. B¯1/2d(p)\overline{B}_{1/2d}^{\infty}(\vec{p})) intersects at most d+1d+1 members/cubes in 𝒫\mathcal{P}. The associated rounding function f:ddf:\mathbb{R}^{d}\to\mathbb{R}^{d} maps each point of d\mathbb{R}^{d} to the center point of the unique cube in 𝒫\mathcal{P} which contains it. This means ff has the following two properties (which are closely connected to the two properties we want of fϵf_{\epsilon}): (1) for every ada\in\mathbb{R}^{d}, f(a)a12\lVert f(a)-a\rVert_{\infty}\leq\frac{1}{2} (because every point is mapped to the center of its containing unit cube) and (2) for any point pdp\in\mathbb{R}^{d}, the set {f(a):aB¯1/2d(p)}\left\{f(a)\colon a\in\overline{B}_{1/2d}^{\infty}(p)\right\} has cardinality at most d+1d+1 (because B¯1/2d(p)\overline{B}_{1/2d}^{\infty}(p) intersects at most d+1d+1 members of 𝒫\mathcal{P}). Define fϵ:ddf_{\epsilon}:\mathbb{R}^{d}\to\mathbb{R}^{d} by fϵ(a)=ϵf(1ϵa)f_{\epsilon}(a)=\epsilon\cdot f(\frac{1}{\epsilon}a). The efficient computability of fϵf_{\epsilon} comes from the efficient computability of ff (i.e. the ability to efficiently compute the center of the unit cube in 𝒫\mathcal{P} which contains a given point).

To see that fϵf_{\epsilon} has property (1), let xdx\in\mathbb{R}^{d} and x^B¯ϵ0(x)\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x). Then we have the following (justifications will follow):

1ϵfϵ(x^)1ϵx\displaystyle\left\lVert\tfrac{1}{\epsilon}\cdot f_{\epsilon}(\hat{x})-\tfrac{1}{\epsilon}x\right\rVert_{\infty} =f(1ϵx^)1ϵx\displaystyle=\left\lVert f(\tfrac{1}{\epsilon}\hat{x})-\tfrac{1}{\epsilon}x\right\rVert_{\infty}
f(1ϵx^)1ϵx^+1ϵx^1ϵx\displaystyle\leq\left\lVert f(\tfrac{1}{\epsilon}\hat{x})-\tfrac{1}{\epsilon}\hat{x}\right\rVert_{\infty}+\left\lVert\tfrac{1}{\epsilon}\hat{x}-\tfrac{1}{\epsilon}x\right\rVert_{\infty}
f(1ϵx^)1ϵx^+1ϵx^x\displaystyle\leq\left\lVert f(\tfrac{1}{\epsilon}\hat{x})-\tfrac{1}{\epsilon}\hat{x}\right\rVert_{\infty}+\tfrac{1}{\epsilon}\left\lVert\hat{x}-x\right\rVert_{\infty}
12+1ϵϵ0\displaystyle\leq\tfrac{1}{2}+\tfrac{1}{\epsilon}\epsilon_{0}
=12+12d1\displaystyle=\tfrac{1}{2}+\tfrac{1}{2d}\leq 1

The first line is by the definition of fϵf_{\epsilon}, the second is the triangle inequality, the third is scaling of norms, the fourth uses the property of ff that points are not mapped a distance more than 12\frac{1}{2} along with the hypothesis that x^B¯ϵ0(x)\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x), the fifth uses the definition of ϵ0\epsilon_{0}, and the sixth uses the fact that d1d\geq 1.

Scaling both sides by ϵ\epsilon and using the scaling of norms, the above gives us fϵ(x^)xϵ\left\lVert f_{\epsilon}(\hat{x})-x\right\rVert_{\infty}\leq\epsilon which proves property (1) of the lemma.

To see that fϵf_{\epsilon} has property (2), let xdx\in\mathbb{R}^{d}. We have the following set of equalities:

{fϵ(x^):x^B¯ϵ0(x)}\displaystyle\left\{f_{\epsilon}(\hat{x})\colon\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)\right\} ={ϵf(1ϵx^):x^B¯ϵ0(x)}\displaystyle=\left\{\epsilon\cdot f(\tfrac{1}{\epsilon}\hat{x})\colon\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)\right\}
={ϵf(a):aB¯1ϵϵ0(x)}\displaystyle=\left\{\epsilon\cdot f(a)\colon a\in\overline{B}_{\tfrac{1}{\epsilon}\epsilon_{0}}^{\infty}(x)\right\}
={ϵf(a):aB¯12d(x)}\displaystyle=\left\{\epsilon\cdot f(a)\colon a\in\overline{B}_{\tfrac{1}{2d}}^{\infty}(x)\right\}

The first line is from the definition of fϵf_{\epsilon}, the second is from re-scaling, and the third is from the definition of ϵ0\epsilon_{0}.

Because ff takes on at most d+1d+1 distinct values on B¯12d(x)\overline{B}_{\tfrac{1}{2d}}^{\infty}(x), the set has cardinality at most d+1d+1 which proves property (2) of the lemma. ∎

3.3 A Universal Rounding Algorithm for Certificate Replicability

For designing certificate replicable learning algorithms we will use a general randomized procedure which is stated in the following lemma. This lemma uses a randomized rounding scheme. Similar randomized rounding schemes have been used in a few prior works [SZ99, DPV18, Gol19, GL19, ILPS22].

Lemma 3.6.

Let dd\in\mathbb{N}, ϵ0(0,)\epsilon_{0}\in(0,\infty) and 0<δ<10<\delta<1. There is an efficiently computable deterministic function f:{0,1}×ddf:\{0,1\}^{\ell}\times\mathbb{R}^{d}\to\mathbb{R}^{d} with the following property. For any xdx\in\mathbb{R}^{d},

Prr{0,1}[xB¯ε(x)x^B¯ε0(x):f(r,x^)=x]1δ\Pr_{r\in\{0,1\}^{\ell}}\left[\exists x^{*}\in\overline{B}_{\varepsilon}^{\infty}(x)~{}~{}\forall\hat{x}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x):f(r,\hat{x})=x^{*}\right]\geq 1-\delta

where =logdδ\ell=\lceil\log{d\over\delta}\rceil and ε=(2+1)ϵ02ε0dδ\varepsilon=(2^{\ell}+1)\epsilon_{0}\leq{2\varepsilon_{0}d\over\delta}.

Proof.

Partition each coordinate of d\mathbb{R}^{d} into 2ε02\varepsilon_{0}-width intervals. The algorithm computing the function ff does the following simple randomized rounding:

The function f:f: Choose a random integer r{12}r\in\{1\dots 2^{\ell}\}. Note that rr can be represented using \ell bits. Consider the ithi^{th} coordinate of x^{\hat{x}} denoted by x^[i]{\hat{x}[i]}. Round x^[i]\hat{x}[i] to the nearest k(2ε0)k*(2\varepsilon_{0}) such that kmod2rk\mod 2^{\ell}\equiv r.

Now we will prove that ff satisfies the required properties.

First, we prove the approximation guarantee. Let x{x^{\prime}} denote the point in d\mathbb{R}^{d} obtained after rounding each coordinate of x^\hat{x}. The kks satisfying kmod2rk\mod 2^{\ell}\equiv r are 22ε02^{\ell}\cdot 2\varepsilon_{0} apart. Therefore, x[i]{x^{\prime}}[i] is rounded by at most 2ϵ02^{\ell}\epsilon_{0}. That is, |x[i]x^[i]|2ϵ0=ε0dδ|{x^{\prime}}[i]-\hat{x}[i]|\leq 2^{\ell}\epsilon_{0}={\varepsilon_{0}d\over\delta} for every ii, 1id1\leq i\leq d. Since x^\hat{x} is an ε0\varepsilon_{0}-approximation (i.e. each coordinate x^[i]\hat{x}[i] is within ε0\varepsilon_{0} of the true value x[i]x[i]), then each coordinate of x{x^{\prime}} is within (2+1)ε0(2^{\ell}+1)\varepsilon_{0} of x[i]x[i]. Therefore x{x^{\prime}} is a (2+1)ε0(2^{\ell}+1)\varepsilon_{0}-approximation of x[i]x[i]. Thus xB¯ε(x)x^{\prime}\in\overline{B}_{\varepsilon}^{\infty}(x) for any choice of rr.

Now we establish that for 1δ\geq 1-\delta fraction of r{12}r\in\{1\ldots 2^{\ell}\}, there exists xx^{*} such every x^B¯ε0(x)\hat{x}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x) is rounded xx^{*}. We argue this with respect to each coordinate and apply the union bound. Fix an xx and a coordinate ii. For x[i]x[i], consider the ε0\varepsilon_{0} interval around it.

Consider rr from {12}\{1\ldots 2^{\ell}\}. When this rr is chosen, then we round x^[i]\hat{x}[i] to the closest k(2ε0)k*(2\varepsilon_{0}) such that kmod2rk\mod 2^{\ell}\equiv r. Let p1r,p2r,pjrp^{r}_{1},p^{r}_{2},\ldots p^{r}_{j}\ldots be the set of such points: more precisely pj=(j2l+r)2ε0p_{j}=(j2^{l}+r)*2\varepsilon_{0}. Note that x^[i]\hat{x}[i] is rounded to an pjp_{j} to some jj. Let mjrm^{r}_{j} denote the midpoint between pjrp^{r}_{j} and pj+1rp^{r}_{j+1}. I.e, mr=(pjr+pj+1r)/2m^{r}=(p^{r}_{j}+p^{r}_{j+1})/2 We call rr ‘bad’ for x[i]x[i] if x[i]x[i] is close to some mjrm^{r}_{j}. That is, rr is ‘bad’ if |x[i]mjr|<ε0|x[i]-m^{r}_{j}|<\varepsilon_{0}. Note that for a bad rr there exists x1^\hat{x_{1}} and x2^\hat{x_{2}} in B¯ε0(x)\overline{B}_{\varepsilon_{0}}^{\infty}(x) so that their ithi^{th} coordinates are round to pjrp^{r}_{j} and pj+1rp^{r}_{j+1} respectively. The crucial point is that if rr is ‘not bad’ for x[i]x[i], then for every xB¯ε0(x)x^{\prime}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x), there exists a canonical pp^{*} such that x[i]x^{\prime}[i] is rounded to pp^{*}. We call rr bad for xx, if rr is bad for xx, if there exists at least one ii, 1id1\leq i\leq d such that rr is bad for x[i]x[i]. With this, it follows that if rr is not bad for xx, then there exists a canonical xx^{*} such that every xB¯ε0(x)x^{\prime}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x) is rounded to xx^{*}.

With this, the goal is to bound the probability that a randomly chosen rr is bad for xx. For this, we first bound the probability that rr is bad for x[i]x[i]. We will argue that there exists almost one bad rr for x[i]x[i]. Suppose that there exist two numbers r1r_{1} and r2r_{2} that are both bad for x[i]x[i]. This means that |x[i]mj1r1|<ε0|x[i]-m^{r_{1}}_{j_{1}}|<\varepsilon_{0} and |x[i]mj2r2|<ε0|x[i]-m^{r_{2}}_{j_{2}}|<\varepsilon_{0} for some j1j_{1} and j2j_{2}. Thus by triangle inequality |mj1r1mj2r2|<2ε0|m^{r_{1}}_{j_{1}}-m^{r_{2}}_{j_{2}}|<2\varepsilon_{0}. However, note that |pj1r1pj2r2||p^{r_{1}}_{j_{1}}-p^{r_{2}}_{j_{2}}| is |(j1j2)2+(r1r2)|2ϵ0|(j_{1}-j_{2})2^{\ell}+(r_{1}-r_{2})|2\epsilon_{0}. Since r1r2r_{1}\neq r_{2}, this value is at least 2ε02\varepsilon_{0}. This implies that the absolute value of difference between mj1r1m^{r_{1}}_{j_{1}} and mj2r2m^{r_{2}}_{j_{2}} is at least 2ε2\varepsilon leading to a contradiction.

Thus the probability that rr is bad for x[i]x[i] is atmost 12\frac{1}{2^{\ell}} and by the union bound the probability that rr is bad for xx is atmost d2δ\frac{d}{2^{\ell}}\leq\delta. This completes the proof. ∎

3.4 A Consequence of Sperner’s Lemma/KKM Lemma

The following result is a corollary to some cubical variants of Sperner’s lemma/KKM lemma initially developed in [DLPES02] and expanded on in [VWDP+22]. The statement and proof of this result is quite similar to that of Theorem 9.4 (Second Optimality Theorem) in [VWDP+22] except that it is stated here for [0,1]d[0,1]^{d} instead of d\mathbb{R}^{d}.

Lemma 3.7.

Let 𝒫\mathcal{P} be a partition of [0,1]d[0,1]^{d} such that for each member X𝒫X\in\mathcal{P}, it holds that diam(X)<1\operatorname{diam}_{\infty}(X)<1. Then there exists p[0,1]d\vec{p}\in[0,1]^{d} such that for all δ>0\delta>0 we have that B¯δ(p)\overline{B}_{\delta}^{\infty}(\vec{p}) intersects at least d+1d+1 members of 𝒫\mathcal{P}.

4 Replicability of Learning Coins Biases

In this section, we establish replicability results for estimating biases of dd coins.

Definition 4.1.

The dd-Coin Bias Estimation Problem is the following problem: Design an algorithm AA (possibly randomize) that given ϵ(0,)\epsilon\in(0,\infty), δ(0,1]\delta\in(0,1], and nn independent tosses (for each coin) of an ordered collection of dd-many biased coins with a bias vector b[0,1]d\vec{b}\in[0,1]^{d} outputs v\vec{v} so that bvε\lVert\vec{b}-\vec{v}\rVert_{\infty}\leq\varepsilon with probability 1δ\geq 1-\delta.

Definition 4.2.

We say an algorithm AA for dd-Coin Bias Estimation Problem is kk-list replicable, if for any bias vector b[0,1]d\vec{b}\in[0,1]^{d}, and parameters ε,δ\varepsilon,\delta, there is set LB¯ε(b)L\subseteq\overline{B}_{\varepsilon}^{\infty}(\vec{b}) and an integer nn such that |L|k|L|\leq k and AA on input ε\varepsilon and δ\delta and nn independent tosses (per coin) according to the bias vector b\vec{b}, outputs an estimate vL\vec{v}\in L, with probability 1δ\geq 1-\delta. nn is the sample complexity of AA and kk is the list complexity of AA.

Definition 4.3.

We say an algorithm AA for dd-Coin Bias Estimation Problem is \ell-certificate replicable, if for any bias vector b[0,1]d\vec{b}\in[0,1]^{d}, and parameters ε,δ\varepsilon,\delta: AA on inputs ϵ\epsilon, δ\delta, r{0,1}r\in\{0,1\}^{\ell}, and nn independent coin tosses (per coin) according to b\vec{b} satisfy the following:

Prr{0,1}[vrB¯ε(b):Pr[A outputs vr]1δ]1δ\Pr_{r\in\{0,1\}^{\ell}}\left[\exists\vec{v}_{r}\in\overline{B}_{\varepsilon}^{\infty}(\vec{b}):\Pr[A\mbox{ outputs }\vec{v_{r}}]\geq 1-\delta\right]\geq 1-\delta

In the above, the inner probability is taken over the internal randomness of AA and the coin tosses. Algorithm AA also gets rr as an input (in addition to the other inputs). We call nn the sample complexity and the number of random bits \ell the certificate complexity of AA.

We note that from a coarse sense, we can convert list replicable algorithms to certificate replicable algorithms and vice-versa. However, such transformations will result in a degradation of sample complexity which is a concern of this paper.

In the following, the output of an algorithm for dd-Coin Bias Estimation Problem is denoted as 𝒟A,b,n\mathcal{D}_{A,\vec{b},n}.

4.1 Replicable Algorithms

We present two algorithms for dd-Coin Bias Estimation Problem. The first one d+1d+1-list replicable and the second one is logdδ\lceil\log{d\over\delta}\rceil-certificate replicable.

Theorem 4.4.

There exists an (d+1)(d+1)-list replicable algorithm for dd-Coin Bias Estimation Problem  with sample complexity n=O(d2ε2logdδ)n=O({d^{2}\over\varepsilon^{2}}\cdot\log{d\over\delta}) (per coin).

Algorithm 1 (d+1)(d+1)-list replicable algorithm for dd-Coin Bias Estimation Problem as in Theorem 4.4
  Input: ϵ>0\epsilon>0
  Input: δ(0,1]\delta\in(0,1]
  Input: sample access to dd coins with biases b[0,1]d\vec{b}\in[0,1]^{d}
  Output: The algorithm behaves as a (d+1)(d+1)-pseudodeterministic (ϵ,δ)(\epsilon,\delta)-approximation of b\vec{b} and is guaranteed to return a value in [0,1]d[0,1]^{d}.
  Algorithm:
  ϵ0=defϵ2d\epsilon_{0}\operatorname{\overset{def}{=}}\frac{\epsilon}{2d}
  δ0=defδd\delta_{0}\operatorname{\overset{def}{=}}\frac{\delta}{d}
  n=defO(ln(1/δ0)ϵ02)=O(d2ln(d/δ)ϵ2)n\operatorname{\overset{def}{=}}O\left(\frac{\ln(1/\delta_{0})}{\epsilon_{0}^{2}}\right)=O\left(\frac{d^{2}\ln(d/\delta)}{\epsilon^{2}}\right) for some constant
  Let fϵ:ddf_{\epsilon}:\mathbb{R}^{d}\to\mathbb{R}^{d} be as in 3.5.
  Let g:d[0,1]dg:\mathbb{R}^{d}\to[0,1]^{d} be the function which restricts coordinates to the unit interval (i.e.
g(y)=def{0yi<0yiyi[0,1]1yi>1i=1dg(\vec{y})\operatorname{\overset{def}{=}}\left\langle\begin{cases}0&y_{i}<0\\ y_{i}&y_{i}\in[0,1]\\ 1&y_{i}>1\end{cases}\right\rangle_{i=1}^{d}
)
  Take nn samples from each coin and let a\vec{a} be the empirical biases.
  return g(f(a))g(f(\vec{a}))
Proof.

Note that when ε1/2\varepsilon\geq 1/2, a trivial algorithm that outputs a vector with 1/21/2 in each component works. Thus the most interesting case is when ε<1/2\varepsilon<1/2. Our list replicable algorithm is described in Algorithm 1.

So we will prove its correctness by giving for each possible bias b[0,1]d\vec{b}\in[0,1]^{d}, a set LbL_{\vec{b}} with the three necessary properties: (1) |Lb|d+1\lvert L_{\vec{b}}\rvert\leq d+1, (2) LbB¯ϵ(b)L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b}) (and also the problem specific restriction that Lb[0,1]dL_{\vec{b}}\subseteq[0,1]^{d}), and (3) when given access to coins of biases b\vec{b}, with probability at least 1δ1-\delta the algorithm returns a value in LbL_{\vec{b}}.

Assume notation from Algorithm 1. Let Lb={g(fϵ(x)):xB¯ϵ0(b)}L_{\vec{b}}=\left\{g(f_{\epsilon}(\vec{x}))\colon\vec{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b})\right\}. By 3.5, fϵf_{\epsilon} takes on at most d+1d+1 values on B¯ϵ0(b)\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b}) (which means gfϵg\circ f_{\epsilon} also takes on at most d+1d+1 values on this ball) which proves that |Lb|d+1\lvert L_{\vec{b}}\rvert\leq d+1. This proves property (1).

NExt we state the following ibservation which says that the coordinate restriction function gg of Algorithm 1 does not reduce approximation quality. The proof is relatively straightforward, but tedious.

Observation 4.5.

Using the notation of Algorithm 1, if yB¯ϵ(b)\vec{y}\in\overline{B}_{\epsilon}^{\infty}(\vec{b}) then g(y)B¯ϵ(b)g(\vec{y})\in\overline{B}_{\epsilon}^{\infty}(\vec{b}).

Proof.

Let z=g(y)\vec{z}=g(\vec{y}). We must show for each i[d]i\in[d] that zi[biϵ,bi+ϵ]z_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon]. Note that

zi={0yi<0yiyi[0,1]1yi>1z_{i}=\begin{cases}0&y_{i}<0\\ y_{i}&y_{i}\in[0,1]\\ 1&y_{i}>1\end{cases}

so we proceed with three cases.

Case 1: yi<0y_{i}<0.

In this case, zi=0z_{i}=0 so because bi[0,1]b_{i}\in[0,1] we have zibi+ϵz_{i}\leq b_{i}+\epsilon. Also, because yi[biϵ,bi+ϵ]y_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon], we have biyi+ϵ<0+ϵ=zi+ϵb_{i}\leq y_{i}+\epsilon<0+\epsilon=z_{i}+\epsilon, so subtracting ϵ\epsilon from both sides gives zi>biϵz_{i}>b_{i}-\epsilon. Thus, we have zi[biϵ,bi+ϵ]z_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon] as desired.

Case 2: yi>1y_{i}>1.

This case is symmetric to Case 1.

Case 3: yi[0,1]y_{i}\in[0,1].

In this case zi=yi[biϵ,bi+ϵ]z_{i}=y_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon] so we are done. ∎

We now establish Property (2). We know from 3.5 that for each xB¯ϵ0(b)\vec{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b}) we have fϵ(x)B¯ϵ(b)f_{\epsilon}(\vec{x})\in\overline{B}_{\epsilon}^{\infty}(\vec{b}), and by 4.5, gg maintains this quality and we have g(fϵ(x))B¯ϵ(b)g(f_{\epsilon}(\vec{x}))\in\overline{B}_{\epsilon}^{\infty}(\vec{b}). This shows that LbB¯ϵ(b)L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b}) proving property (2).

By Chernoff’s bounds, for a single biased coin, n=O(ln(1/δ0)ϵ02)n=O\left(\frac{\ln(1/\delta_{0})}{\epsilon_{0}^{2}}\right) independent samples of the coin is enough that with probability at least 1δ01-\delta_{0}, the empirical bias is within ϵ0\epsilon_{0} of the true bias. Thus, by a union bound, if we take nn samples of each of the dd coins, there is a probability of at most dδ0=δd\cdot\delta_{0}=\delta that at least one of the empirical coin biases is not within ϵ0\epsilon_{0} of the true bias. Thus, by taking nn samples of each coin, we have with probability at least 1δ1-\delta that the empirical biases a\vec{a} belong to B¯ϵ0(b)\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b}). In the case that this occurs, we have by definition of LbL_{\vec{b}} that the value g(fϵ(a))g(f_{\epsilon}(\vec{a})) returned by the algorithm belongs to the set LbL_{\vec{b}}. This proves property (3). ∎

Theorem 4.6.

There is a logdδ\lceil\log{d\over\delta}\rceil-certificate replicable algorithm for dd-Coin Bias Estimation Problem with sample complexity of n=O(d2ε2δ2)n=O({d^{2}\over\varepsilon^{2}\delta^{2}}) per coin.

Proof.

Let ε\varepsilon and δ\delta be the input parameters to the algorithm and b\vec{b} the bias vector. Set ε0=εδ2d\varepsilon_{0}=\frac{\varepsilon\delta}{2d}. The algorithm AA first estimates the bias of each coin with up to ε0\varepsilon_{0} with a probability error parameter δd\frac{\delta}{d} using a standard estimation algorithm. Note that this can be done using O(d2ε2δ2)O({d^{2}\over\varepsilon^{2}\delta^{2}}) tosses per coin. Let v\vec{v} be the output vector. It follows that vB¯ε0(b)\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b}) with probability at least 1δ1-\delta. Then it runs the deterministic function ff described in Lemma 3.6 with input r{0,1}r\in\{0,1\}^{\ell} with =logdδ\ell=\lceil\log{d\over\delta}\rceil and v\vec{v} and outputs the value of the function. Lemma 3.6 guarantees that for 1δ1-\delta fraction of the rrs, all vB¯ε0(b)\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b}) gets rounded to the same value by ff. Hence algorithm AA satisfies the requirements of the certificate-replicability. The certificate complexity is logdδ\lceil\log{d\over\delta}\rceil. ∎

Note that a \ell-certificate replicable leads to a 22^{\ell}-list replicable algorithm. Thus Theorem 4.6 gives a O(dδ)O(\frac{d}{\delta})-list replicable algorithm for dd-Coin Bias Estimation Problemwith sample complexity O(d2ε2δ2)O({d^{2}\over\varepsilon^{2}\delta^{2}}). However, this is clearly sub-optimal and Theorem 4.4 gives algorithms with a much smaller list and sample complexities.

4.2 An Impossibility Result

Theorem 4.7.

For k<d+1k<d+1, there does not exist a kk-list replicable algorithm for the dd-Coin Bias Estimation Problem.

Before proving the theorem, we need a lemma whose proof appears in the appendix.

Lemma 4.8.

For biases a,b[0,1]d\vec{a},\vec{b}\in[0,1]^{d} we have dTV(𝒟A,a,n,𝒟A,b,n)ndba\operatorname{d_{TV}}\left(\mathcal{D}_{A,\vec{a},n},\mathcal{D}_{A,\vec{b},n}\right)\leq n\cdot d\cdot\lVert\vec{b}-\vec{a}\rVert_{\infty}.

Proof.

We can view the model as algorithm AA having access to a single draw from a distribution. The distribution giving one sample flip of each coin in a collection with bias b\vec{b} is the dd-fold product of Bernoulli distributions i=1dBern(bi)\prod_{i=1}^{d}\operatorname{Bern}(b_{i}) (which for notational brevity we denote as Bern(b\operatorname{Bern}(\vec{b}), so the distribution which gives nn independent flips of each coin is the nn-fold product of this (using notation of [Can15] written as Bern(b)n\operatorname{Bern}(\vec{b})^{\otimes n}).

Comparing the distributions of nn independent flips of the dd coins for bias b\vec{b} as compared to bias a\vec{a}, we have for each i[d]i\in[d] that

dTV(Bern(bi),Bern(ai))=|biai|\operatorname{d_{TV}}\left(\operatorname{Bern}(b_{i}),\operatorname{Bern}(a_{i})\right)=\lvert b_{i}-a_{i}\rvert

so by C.1.2 and C.1.3 of [Can15] we have

dTV(Bern(b),Bern(a))i=1d|biai|dba\operatorname{d_{TV}}\left(\operatorname{Bern}(\vec{b}),\operatorname{Bern}(\vec{a})\right)\leq\sum_{i=1}^{d}\lvert b_{i}-a_{i}\rvert\leq d\cdot\lVert\vec{b}-\vec{a}\rVert_{\infty}

and

dTV(Bern(b)n,Bern(a)n)ndba.\operatorname{d_{TV}}\left(\operatorname{Bern}(\vec{b})^{\otimes n},\operatorname{Bern}(\vec{a})^{\otimes n}\right)\leq n\cdot d\cdot\lVert\vec{b}-\vec{a}\rVert_{\infty}.

Because AA is a randomized function of one draw of this distribution, by D.1.2 of [Can15] we have that AA cannot serve to increase the total variation distance, so

dTV(𝒟A,a,n,𝒟A,b,n)dTV(Bern(b)n,Bern(a)n)\operatorname{d_{TV}}\left(\mathcal{D}_{A,\vec{a},n},\mathcal{D}_{A,\vec{b},n}\right)\leq\operatorname{d_{TV}}\left(\operatorname{Bern}(\vec{b})^{\otimes n},\operatorname{Bern}(\vec{a})^{\otimes n}\right)

which completes the proof. ∎

Proof of Theorem 4.7.

Fix any dd\in\mathbb{N}, and choose ϵ\epsilon and δ\delta as ϵ<12\epsilon<\frac{1}{2} and δ1d+2\delta\leq\frac{1}{d+2}.

Suppose for contradiction that such an algorithm AA does exists for some k<d+1k<d+1. This means that for each possible threshold t[0,1]d\vec{t}\in[0,1]^{d}, there exists some set LtL_{\vec{t}}\subseteq\mathcal{H} of hypotheses with three properties: (1) each element of LtL_{\vec{t}} is an ϵ\epsilon-approximation to hth_{\vec{t}}, (2) |Lt|k\lvert L_{\vec{t}}\rvert\leq k, and (3) with probability at least 1δ1-\delta, AA returns an element of LtL_{\vec{t}}.

Suppose for contradiction that such an algorithm does exist for some k<d+1k<d+1. This means that for each possible bias b[0,1]d\vec{b}\in[0,1]^{d}, there exists some set LbB¯ϵ(b)L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b}) (not necessarily unique, but consider some fixed one) with |Lb|k\lvert L_{\vec{b}}\rvert\leq k such that with probability at least least 1k(1δ)1k(11d+2)=1kd+1d+21kk+1k+2\frac{1}{k}\cdot(1-\delta)\geq\frac{1}{k}\cdot(1-\frac{1}{d+2})=\frac{1}{k}\cdot\frac{d+1}{d+2}\geq\frac{1}{k}\cdot\frac{k+1}{k+2}, AA returns an element of LbL_{\vec{b}}. By the trivial averaging argument, this means that there exists at least one element in LbL_{\vec{b}} which is returned by AA with probability at least 1kk+1k+2\frac{1}{k}\cdot\frac{k+1}{k+2}. Let f:[0,1]d[0,1]df\colon[0,1]^{d}\to[0,1]^{d} be a function which maps each bias b\vec{b} to such an element of LbL_{\vec{b}}.

Since 1kk+1k+2>1k+1\frac{1}{k}\cdot\frac{k+1}{k+2}>\frac{1}{k+1}, let η\eta be such that 0<η<1kk+1k+21k+10<\eta<\frac{1}{k}\cdot\frac{k+1}{k+2}-\frac{1}{k+1}.

The function ff induces a partition 𝒫\mathcal{P} of [0,1]d[0,1]^{d} where the members of 𝒫\mathcal{P} are the fibers of ff (i.e. 𝒫={f1(y):yrange(f)}\mathcal{P}=\left\{f^{-1}(\vec{y})\colon\vec{y}\in\operatorname{range}(f)\right\}). By definition, for any member X𝒫X\in\mathcal{P} there exists some yrangef\vec{y}\in\operatorname{range}{f} such that for all bX\vec{b}\in X, f(b)=yf(\vec{b})=\vec{y}. By definition of kk-pseudodeterministic ϵ\epsilon-approximation, we have f(b)LbB¯ϵ(b)f(\vec{b})\in L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b}) showing that yB¯ϵ(b)\vec{y}\in\overline{B}_{\epsilon}^{\infty}(\vec{b}) and by symmetry bB¯ϵ(y)\vec{b}\in\overline{B}_{\epsilon}^{\infty}(\vec{y}). This shows that XB¯ϵ(y)X\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{y}), so diam(X)2ϵ<1\operatorname{diam}_{\infty}(X)\leq 2\epsilon<1.

Let r=ηdnr=\frac{\eta\cdot d}{n}. Since every member of 𝒫\mathcal{P} has \ell_{\infty} diameter less than 11, by 3.7 there exists a point p[0,1]d\vec{p}\in[0,1]^{d} such that B¯r(p)\overline{B}_{r}^{\infty}(\vec{p}) intersects at least d+1>kd+1>k members of 𝒫\mathcal{P}. Let b(1),,b(d+1)\vec{b}^{(1)},\ldots,\vec{b}^{(d+1)} be points belonging to distinct members of 𝒫\mathcal{P} that all belong to B¯r(p)\overline{B}_{r}^{\infty}(\vec{p}). By definition of 𝒫\mathcal{P}, this means for distinct j,j[d+1]j,j^{\prime}\in[d+1] that f(b(j))f(b(j))f(\vec{b}^{(j)})\not=f(\vec{b}^{(j^{\prime})}).

Now, for each j[d+1]j\in[d+1], because pb(j)r\lVert\vec{p}-\vec{b}^{(j)}\rVert_{\infty}\leq r, by 4.8 we have dTV(𝒟A,p,n,𝒟A,b(j),n)ndr=η\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{p},n},\mathcal{D}_{A,\vec{b^{(j)}},n})\leq n\cdot d\cdot r=\eta. However, this gives rise to a contradiction because the probability that AA with access to biased coins b(j)\vec{b}^{(j)} returns f(b(j))f(\vec{b}^{(j)}) is at least 1kk+1k+2\frac{1}{k}\cdot\frac{k+1}{k+2}, and by the total variation distance, it must be that AA with access to biased coins p\vec{p} returns f(b(j))f(\vec{b}^{(j)}) with probability at least 1kk+1k+2η>1k+1\frac{1}{k}\cdot\frac{k+1}{k+2}-\eta>\frac{1}{k+1}; notationally, Pr𝒟A,b(j),n({f(b(j))})1kk+1k+2\Pr_{\mathcal{D}_{A,\vec{b^{(j)}},n}}(\left\{f(\vec{b}^{(j)})\right\})\geq\frac{1}{k}\cdot\frac{k+1}{k+2} and dTV(𝒟A,b(j),n,𝒟A,p,n)η\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{b^{(j)}},n},\mathcal{D}_{A,\vec{p},n})\leq\eta, so Pr𝒟A,p,n({f(b(j))})1kk+1k+2η>1k+1\Pr_{\mathcal{D}_{A,\vec{p},n}}(\left\{f(\vec{b}^{(j)})\right\})\geq\frac{1}{k}\cdot\frac{k+1}{k+2}-\eta>\frac{1}{k+1}. This is a contradiction because a distribution cannot have d+1k+1d+1\geq k+1 disjoint events that each have probability greater than 1k+1\frac{1}{k+1}. ∎

We conclude this section by noting that the above impossibility result implies a lower-bound on certificate complexity for dd-Coin Bias Estimation Problem. It follows that there is no log(d)\lfloor\log(d)\rfloor-certificate replicable algorithm for dd-Coin Bias Estimation Problem. In particular, any ll-certificate replicable algorithm requires l=Ω(logd/delta)l=\Omega(\log d/\\ delta). Hence our algorithms for dd-Coin Bias Estimation Problem has optimal list and certificate complexity.

5 Replicability in PAC Learning

In this section, we establish replicability results for the PAC model. First, we define the PAC learning model.

Let \mathcal{H} be a (hypothesis) class of Boolean functions over XX, and 𝒟\cal D be a distribution over XX. For a function ff\in\cal H, let 𝒟f{\cal D}_{f} a distribution over X×{0,1}X\times\{0,1\} that is obtained by sampling an element xXx\in X according 𝒟{\cal D} and outputs x,f(x)\langle x,f(x)\rangle. For a hypothesis hh, its error with respect to 𝒟f\mathcal{D}_{f} denoted by e𝒟f(h)e_{\mathcal{D}_{f}}(h) is Prx,f(x)𝒟f(f(x)h(x))\Pr_{\langle x,f(x)\rangle\in\mathcal{D}_{f}}(f(x)\neq h(x)).

Definition 5.1.

A hypothesis class (or concept class) {\cal H} is PAC learnable with sample complexity mm, if there is a learning algorithms AA with the following property: For every ff\in\mathcal{H}, for every 𝒟{\cal D} a distribution over XX, for all 0<δ,ϵ<10<\delta,\epsilon<1, AA on inputs δ,ϵ\delta,\epsilon and SS drawn i.i.d according to 𝒟f\mathcal{D}_{f} where |S|m|S|\leq m: outputs a hypothesis hh so that e𝒟f(h)εe_{\mathcal{D}_{f}}(h)\leq\varepsilon. with probability (1δ)\geq(1-\delta).

We show that every hypothesis that can be learned via a statistical query learning algorithm has a reproducible PAC learning algorithm. We first define the notion of learning with statistical queries [Kea98].

Definition 5.2.

A statistical oracle STAT(𝒟f,ν)STAT({\cal D}_{f},\nu) takes as an input a real-valued function ϕ:X×{0,1}(0,1)\phi:X\times\{0,1\}\rightarrow(0,1) and returns an estimate vv such that

|vEx,y𝒟f[ϕ(x,y)]|ν|v-E_{\langle x,y\rangle\in{\cal D}_{f}}[\phi(x,y)]|\leq\nu
Definition 5.3.

We say that an algorithm A{A} learns a concept class \mathcal{H} via statistical queries if for every distribution 𝒟{\cal D} and every function ff\in\mathcal{H}, for every 0<ε<10<\varepsilon<1, there exists ν\nu such that the algorithm AA on input ε\varepsilon, and STAT(𝒟f,ν)STAT(\mathcal{D}_{f},\nu) as an oracle, outputs a hypothesis hh such that eDf(h)εe_{D_{f}}(h)\leq\varepsilon. The concept class is non-adaptively learnable if all the queries made by AA are non-adaptive.

5.1 Notions of Replicable PAC Learning

We now define the notions of list and certificate replicable learning in the PAC model.

Definition 5.4.

Let \mathcal{H} be a hypothesis class. We say that \mathcal{H} is kk-list replicably learnable if there is an algorithm AA with the following properties. For every ff\in\mathcal{H} and for every distribution 𝒟\mathcal{D} over XX, for every 0<ϵ<10<\epsilon<1 and 0<δ10<\delta\leq 1, there is a list LL of size at most kk consisting of ε\varepsilon-approximate hypotheses such that AA on inputs δ,ϵ\delta,\epsilon and samples S𝒟fmS\in\mathcal{D}_{f}^{m}, PrS𝒟fm[A(S)L]1δ\Pr_{S\sim\mathcal{D}_{f}^{m}}[A(S)\in L]\geq 1-\delta. We call mm the sample complexity of the learning algorithm and kk to be its list complexity.

Next we define certificate replicability. This is very close to the notion of replicability given in  [impagliazzo_replicability_2022]. However, our main concern is the amount of randomness needed by the algorithm to make it perfectly reproducible.

Definition 5.5.

Let \mathcal{H} be a concept class. We say that \mathcal{H} is \ell-certificate replicably learnable if the following holds. There is a learning algorithm AA such that for every ff\in\mathcal{H}, for every distribution 𝒟\mathcal{D} over XX, AA gets the following inputs: ϵ\epsilon, δ\delta, r{0,1}r\in\{0,1\}^{\ell}, and S𝒟fmS\sim\mathcal{D}_{f}^{m}.

Prr{0,1}[hr:Prs𝒟m[A(s;r)=hr]1δ]1δ\Pr_{r\in\{0,1\}^{\ell}}\left[\exists h_{r}:\Pr_{s\sim\mathcal{D}^{m}}[A(s;r)=h_{r}]\geq 1-\delta\right]\geq 1-\delta

We call mm the sample complexity and the number of random bits \ell the certificate complexity of AA.

The definition can be further explained as follows. For the algorithm AA, we say rr is a “certificate of replicability” if AA with mm independent samples from 𝒟f\mathcal{D}_{f}, and rr as input, outputs a canonical ε\varepsilon-approximate hypothesis hrh_{r} with probability 1δ\geq 1-\delta. Then the above definition demands that 1δ1-\delta fraction of r{0,1}lr\in\{0,1\}^{l} are certificates of replicability.

5.2 Replicable Algorithms

Theorem 5.6.

Let \mathcal{H} be a concept class that is learnable with dd non-adaptive statistical queries, then \mathcal{H} is (d+1)(d+1)-list reproducibly learnable. Furthermore, the sample complexity n=n(ν,δ)n=n(\nu,\delta) of the (d+1)(d+1)-list replicable algorithm equals O(d2logdδ1ν2)O(d^{2}\log{d\over\delta}\cdot{1\over\nu^{2}}), where ν\nu is the approximation error parameter of each statistical query oracle.

Proof.

The proof is very similar to the proof of Theorem 4.4. Our replicable algorithm BB works as follows. Let ε\varepsilon and δ\delta be input parameters and 𝒟\mathcal{D} be a distribution and ff\in\mathcal{H}. Let AA be the statistical query learning algorithm for \mathcal{H} that outputs a hypothesis hh with approximation error e𝒟f(h)=εe_{\mathcal{D}_{f}}(h)=\varepsilon. Let STAT(Df,ν)STAT(D_{f},\nu) be the statistical query oracle for this algorithm. Let ϕ1,,ϕd\phi_{1},\ldots,\phi_{d} be the statistical queries made by AA.

Let b=Ex,y𝒟f[ϕ1(x,y),,Ex,y𝒟f[ϕd(x,y)\vec{b}=\langle E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{1}(\langle x,y\rangle),\ldots,E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{d}(\langle x,y\rangle)\rangle. Set ε0=ν2d\varepsilon_{0}=\frac{\nu}{2d}. The algorithm BB first estimates the values b[i]=Ex,y𝒟f[ϕi(x,y)]b[i]=E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{i}(\langle x,y\rangle)], 1id1\leq i\leq d upto an approximation error of ϵ0\epsilon_{0} with success probably 1δ/d1-\delta/d for each query. Note that this can be done by a simple empirical estimation algorithm, that uses a total of n=O(d2ν2logd/δ)n=O(\frac{d^{2}}{\nu^{2}}\cdot\log d/\delta) samples. Let v\vec{v} be the estimated vector. It follows that vB¯ε0(b)\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b}) with probability at least 1δ1-\delta.

Now the algorithm BB evaluates the deterministic function fϵf_{\epsilon} from Lemma 3.5 on input v\vec{v}. Let u\vec{u} be the output vector. Now the algorithm BB simulates the statistical query algorithm AA with u[i]\vec{u}[i] as the answer to the query ϕi\phi_{i}. By Lemma 3.5, uB¯ν(b)\vec{u}\in\overline{B}_{\nu}^{\infty}(\vec{b}). Thus the error of the hypothesis output by the algorithm is at most ϵ\epsilon. Since AA is a deterministic algorithm the number of possible outputs only depends on the number of outputs of the function fεf_{\varepsilon}, more precisely the number of possible outputs is the size of the set {fε(v):vB¯ε0(b)}\{f_{\varepsilon}(\vec{v}):v\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b})\} which is almost d+1d+1, by Lemma 3.5. Thus the total number of possible outputs of the algorithm BB is at most d+1d+1 with probability at least 1δ1-\delta. ∎

We note that we can simulate a statistical query algorithm that makes dd adaptive queries to get a 2d2^{d}-list replicable learning algorithm. This can be done by rounding each query to two possible values (the approximation factor increases by 2). The sample complexity of this algorithm will be O(dν2log1δ)O({d\over\nu^{2}}\cdot\log{1\over\delta}). The sample complexity can be improved to O~(dν2)\tilde{O}({\sqrt{d}\over\nu^{2}}) by using techniques from adaptive data analysis [BNS+21].

Next, we design a certificate replicable algorithm for hypothesis classes that admit statistical query learning algorithms. The proof the following theorem is in the appendix.

Theorem 5.7.

Let \mathcal{H} be a concept class that is learnable with dd non-adaptive statistical queries, then \mathcal{H} is logdδ\lceil\log{d\over\delta}\rceil-certificate reproducibly learnable. Furthermore, the sample complexity n=n(ν,δ)n=n(\nu,\delta) of this algorithm equals O(d2ν2δ2logdδ)O(\frac{d^{2}}{\nu^{2}\delta^{2}}\cdot\log{d\over\delta}), where ν\nu is the approximation error parameter of each statistical query oracle.

Proof.

The proof is very similar to the proof of Theorem 4.6. Our replicable algorithm BB works as follows, let ε\varepsilon and δ\delta be input parameters and 𝒟\mathcal{D} be a distribution and ff\in\mathcal{H}. Let AA be the statistical query learning algorithm for \mathcal{H} that outputs a hypothesis hh with approximation error e𝒟f(h)=εe_{\mathcal{D}_{f}}(h)=\varepsilon. Let STAT(Df,ν)STAT(D_{f},\nu) be the statistical query oracle for this algorithm. Let ϕ1,,ϕd\phi_{1},\ldots,\phi_{d} be the statistical queries made by AA.

Let b=Ex,y𝒟f[ϕ1(x,y),,Ex,y𝒟f[ϕd(x,y)\vec{b}=\langle E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{1}(\langle x,y\rangle),\ldots,E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{d}(\langle x,y\rangle)\rangle. Set ε0=νδ2d\varepsilon_{0}=\frac{\nu\delta}{2d}. The algorithm BB first estimates the values b[i]=Ex,y𝒟f[ϕi(x,y)]b[i]=E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{i}(\langle x,y\rangle)], 1id1\leq i\leq d upto an approximation error of ϵ0\epsilon_{0} with success probably 1δ/d1-\delta/d for each query. Note that this can be done by a simple empirical estimation algorithm, that uses a total of n=O(d2ν2δ2logd/δ)n=O(\frac{d^{2}}{\nu^{2}\delta^{2}}\cdot\log d/\delta) samples. Let v\vec{v} be the estimated the vector. It follows that vB¯ε0(b)\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b}) with probability at least 1δ1-\delta. Now the algorithm BB evaluates the deterministic function ff described in Lemma 3.6 with inputs r{0,1}r\in\{0,1\}^{\ell} where =logdδ\ell=\lceil\log{d\over\delta}\rceil and v\vec{v}. By Lemma 3.6 for at least 1δ1-\delta fraction of the rr’s , the function ff outputs a canonical vB¯ν(b)\vec{v^{*}}\in\overline{B}_{\nu}^{\infty}(\vec{b}). Now the algorithm BB simulates the statistical query algorithm AA with v[i]\vec{v*}[i] as the answer to the query ϕi\phi_{i}. Since AA is a deterministic algorithm it follows that our algorithm BB is certificate replicable. Finally, note that the certificate complexity is logdδ\lceil\log{d\over\delta}\rceil. ∎

As before we consider the case when the statistical query algorithm makes dd adaptive queries. The proof of the following theorem appears in the appendix.

Theorem 5.8.

Let \mathcal{H} be a concept class that is learnable with dd adaptive statistical queries, then \mathcal{H} is dlogdδ\lceil d\log{d\over\delta}\rceil-certificate reproducibly learnable. Furthermore, the sample complexity of this algorithm equals O(d3ν2δ2logdδ)O(\frac{d^{3}}{\nu^{2}\delta^{2}}\cdot\log{d\over\delta}), where ν\nu is the approximation error parameter of each statistical query oracle.

Proof Sketch.

The proof uses similar arguments as before. The main difference we will evaluate each query with an approximation error of νδd\frac{\nu\delta}{d} with a probability error of d/δd/\delta. This requires O(d2ν2δ2logdδ)O(\frac{d^{2}}{\nu^{2}\delta^{2}}\cdot\log{d\over\delta}) per query. We use a fresh set of certificate randomness for each such evaluation. Note that the length of the certificate for each query is logd/δ\lceil\log d/\delta\rceil. Thus the total certificate complexity is dlogdδ\lceil d\log{d\over\delta}\rceil. ∎

5.3 Impossibility Results in the PAC Model

In this section, we establish matching upper and lower bounds for the dd-Threshold Estimation Problem in the PAC model with respect to the uniform distribution. We establish that this problem admits a (d+1)(d+1)-list replicable algorithm and does not admit a dd-list replicable algorithm.

Problem 5.9 (dd-Threshold Estimation Problem).

Fix some dd\in\mathbb{N}. Let X=[0,1]dX=[0,1]^{d}. For each value t[0,1]d\vec{t}\in[0,1]^{d} (which happens to be the same as XX), let ht:X{0,1}h_{\vec{t}}:X\to\left\{0,1\right\} be the function defined by

ht(x)={1for each i[d], it holds that xiti0otherwise.h_{\vec{t}}(\vec{x})=\begin{cases}1&\text{for each $i\in[d]$, it holds that $x_{i}\leq t_{i}$}\\ 0&\text{otherwise}\end{cases}.

This is the function that determines if each coordinate is less than or equal to the thresholds specified by t\vec{t}. Let \mathcal{H} be the hypothesis class consisting of all such threshold functions: ={ht|t[0,1]d}\mathcal{H}=\left\{h_{\vec{t}}\;|\;\vec{t}\in[0,1]^{d}\right\}.

We first observe the impossibility of list-replicable algorithms in the general PAC model. This follows from known results.

Observation 5.10.

Let kk be any constant. There is no kk-list replicable algorithm for the dd-Threshold Estimation Problem in the PAC model even when d=1d=1.

Proof.

From the works of [BLM20] and [ALMM19], it follows that any class has finite Littlestone dimension if and only if there exists a constant kk such that the concept class has a kk-list replicable algorithm in the PAC model. Since the concept class dd-Threshold Estimation Problemhas infinite Littlestone dimension even when d=1d=1, the theorem follows. ∎

The above result rules out list-replicable algorithms in the general PAC model. In the rest of this section, we study the replicable learnability of dd-Threshold Estimation Problem in the PAC model under the uniform distribution. We establish matching upper and lower bounds on the list complexity.

Theorem 5.11.

In the PAC model under the uniform distribution, there is a d+1d+1-list replicable algorithm for dd-Threshold Estimation Problem

Proof.

It is known and easy to see that dd-Threshold Estimation Problem is learnable under the uniform distribution by making dd nonadaptive statistical queries. Thus by Theorem 5.6, dd-Threshold Estimation Problem admits a (d+1)(d+1)-list replicable algorithm. ∎

We next establish that the above result is tight by proving that there is no dd-list replicable algorithm in the PAC model under the uniform distribution.

Theorem 5.12.

For k<d+1k<d+1, there does not exist a kk-list replicable algorithm for the dd-Threshold Estimation Problem in the PAC model.

The proof that for k<d+1k<d+1, there is no algorithm which kk-list reproducibly learns dd-Threshold Estimation Problem in the PAC model is similar to the proof of Theorem 4.7. The reason is that sampling dd-many biased coins with biases b\vec{b} is similar to obtaining a point x\vec{x} uniformly at random from [0,1]d[0,1]^{d} and evaluating the threshold function hbh_{\vec{b}} on it—this corresponds to asking whether all of the coins were heads/11’s. The two models differ though, because in the sample model for the dd-Coin Bias Estimation Problem, the algorithm sees for each coin whether it is heads or tails, but this information is not available in the PAC model for the dd-Threshold Estimation Problem. Conversely, in the PAC model for the dd-Threshold Estimation Problem, a random draw from [0,1]d[0,1]^{d} is available to the algorithm, but in the sample model for the dd-Coin Bias Estimation Problem the algorithm does not get this information.

Furthermore, there is the following additional complexity in the impossibility result for the dd-Threshold Estimation Problem. In the dd-Coin Bias Estimation Problem, we said by definition that a collection of dd coins parameterized by bias vector a\vec{a} was an ϵ\epsilon-approximation to a collection of dd coins parameterized by bias vector b\vec{b} if and only if baϵ\lVert\vec{b}-\vec{a}\rVert_{\infty}\leq\epsilon, and we used this norm in applying the results of [VWDP+22]. However, the notion of ϵ\epsilon-approximation in the PAC model is quite different than this. It is possible to have a hypotheses hah_{\vec{a}} and hbh_{\vec{b}} in the dd-Threshold Estimation Problemsuch that ba>ϵ\lVert\vec{b}-\vec{a}\rVert_{\infty}>\epsilon but with respect to some distribution 𝒟X\mathcal{D}_{X} on the domain XX we have e𝒟X(ha,hb)ϵ\operatorname{e}_{\mathcal{D}_{X}}(h_{\vec{a}},h_{\vec{b}})\leq\epsilon. For example, if 𝒟X\mathcal{D}_{X} is the uniform distribution on X=[0,1]dX=[0,1]^{d} and a=0\vec{a}=\vec{0} and b\vec{b} is the first standard basis vector b=1,0,,0\vec{b}=\langle 1,0,\ldots,0\rangle, and ϵ=12\epsilon=\frac{1}{2}, then ba=1>ϵ\lVert\vec{b}-\vec{a}\rVert_{\infty}=1>\epsilon, but e𝒟X(ha,hb)=0ϵ\operatorname{e}_{\mathcal{D}_{X}}(h_{\vec{a}},h_{\vec{b}})=0\leq\epsilon because ha(x)hb(x)h_{\vec{a}}(\vec{x})\not=h_{\vec{b}}(\vec{x}) if and only if all of the last d1d-1 coordinates of x\vec{x} are 0 and the first coordinate is >0>0, but there is probability 0 of sampling such x\vec{x} from the uniform distribution on X=[0,1]dX=[0,1]^{d}.

For this reason, we can’t just partition [0,1]d[0,1]^{d} as we did with the proof of Theorem 4.7 and must do something more clever. It turns out that it is possible to find a subset [α,1]d[\alpha,1]^{d} on which hypotheses parameterized by vectors on opposite faces of this cube [α,1]d[\alpha,1]^{d} have high PAC error between them. A consequence by the triangle inequality of e𝒟X\operatorname{e}_{\mathcal{D}_{X}} is that two such hypotheses cannot both be approximated by a common third hypothesis. That is what the following lemma states.

Lemma 5.13.

Let dd\in\mathbb{N} and α=d1d\alpha=\frac{d-1}{d}. Let s,t[α,1]d\vec{s},\vec{t}\in[\alpha,1]^{d} such that there exists a coordinate i0[d]i_{0}\in[d] where si0=αs_{i_{0}}=\alpha and ti0=1t_{i_{0}}=1 (i.e. s\vec{s} and t\vec{t} are on opposite faces of this cube). Let ϵ18d\epsilon\leq\frac{1}{8d}. Then there is no point rX\vec{r}\in X such that both eunif(hs,hr)ϵ\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{r}})\leq\epsilon and eunif(ht,hr)ϵ\operatorname{\operatorname{e}_{unif}}(h_{\vec{t}},h_{\vec{r}})\leq\epsilon (i.e. there is no hypothesis which is an ϵ\epsilon-approximation to both hsh_{\vec{s}} and hth_{\vec{t}}).

Proof.

Let q={sii=i0tiii0i=1d\vec{q}=\left\langle\begin{cases}s_{i}&i=i_{0}\\ t_{i}&i\not=i_{0}\end{cases}\right\rangle_{i=1}^{d} which will serve as a proxy to s\vec{s}.

We need the following claim.

Claim 5.14.

For each xX\vec{x}\in X, the following are equivalent:

  1. 1.

    hq(x)ht(x)h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})

  2. 2.

    hq(x)=0h_{\vec{q}}(\vec{x})=0 and ht(x)=1h_{\vec{t}}(\vec{x})=1

  3. 3.

    xi0(qi0,ti0]=(α,1]x_{i_{0}}\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1] and for all i[d]{i0}i\in[d]\setminus\left\{i_{0}\right\}, xi[0,ti]x_{i}\in[0,t_{i}].

Furthermore, the above equivalent conditions imply the following:

  1. 4.

    hs(x)ht(x)h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x}).

Proof of Claim.

(2) \Longrightarrow (1): This is trivial.

(1) \Longrightarrow (2):

Note that because qi0=si0=α<1=ti0q_{i_{0}}=s_{i_{0}}=\alpha<1=t_{i_{0}}, we have for all i[d]i\in[d] that qitiq_{i}\leq t_{i}. If ht(x)=0h_{\vec{t}}(\vec{x})=0 then for some i1[d]i_{1}\in[d] it must be that xi1>ti1x_{i_{1}}>t_{i_{1}}, but since ti1qi1t_{i_{1}}\geq q_{i_{1}} it would also be the case that xi1>qi1x_{i_{1}}>q_{i_{1}}, so hq(x)=0h_{\vec{q}}(\vec{x})=0 which gives the contradiction that hq(x)=ht(x)h_{\vec{q}}(\vec{x})=h_{\vec{t}}(\vec{x}). Thus ht(x)=1h_{\vec{t}}(\vec{x})=1, and since hq(x)ht(x)h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x}) we have hq(x)=0h_{\vec{q}}(\vec{x})=0.

(1) \iff (3): We partition [0,1]d[0,1]^{d} into three sets and examine these three cases.

Case 1: xi0(qi0,ti0]=(α,1]x_{i_{0}}\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1] and for all i[d]{i0}i\in[d]\setminus\left\{i_{0}\right\}, xi[0,ti]x_{i}\in[0,t_{i}]. In this case, qi0<xi0q_{i_{0}}<x_{i_{0}} so hq(x)=0h_{\vec{q}}(\vec{x})=0 and for all i[d]i\in[d] xitix_{i}\leq t_{i}, so ht(x)=1h_{\vec{t}}(\vec{x})=1, so hq(x)ht(x)h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x}).

Case 2: xi0(qi0,ti0]=(α,1]x_{i_{0}}\not\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1] and for all i[d]{i0}i\in[d]\setminus\left\{i_{0}\right\}, xi[0,ti]x_{i}\in[0,t_{i}]. In this case, because xi0[0,1]x_{i_{0}}\in[0,1] and xi0(α,1]x_{i_{0}}\not\in(\alpha,1] we have xi0α=qi0ti0x_{i_{0}}\leq\alpha=q_{i_{0}}\leq t_{i_{0}} and also for all other i[d]{i0}i\in[d]\setminus\left\{i_{0}\right\}, xiti=qix_{i}\leq t_{i}=q_{i} (by definition of q\vec{q}). Thus hq(x)=1=ht(x)h_{\vec{q}}(\vec{x})=1=h_{\vec{t}}(\vec{x}).

Case 3: For some i1[d]{i0}i_{1}\in[d]\setminus\left\{i_{0}\right\}, xi1[0,ti1]x_{i_{1}}\not\in[0,t_{i_{1}}]. In this case, because xi1[0,1]x_{i_{1}}\in[0,1], we have xi1>ti1=qi1x_{i_{1}}>t_{i_{1}}=q_{i_{1}}. Thus hq(x)=0=ht(x)h_{\vec{q}}(\vec{x})=0=h_{\vec{t}}(\vec{x}).

Thus, it is the case that hq(x)ht(x)h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x}) if and only if xi0(qi0,ti0]=(α,1]x_{i_{0}}\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1] and for all i[d]{i0}i\in[d]\setminus\left\{i_{0}\right\}, xi[0,ti]x_{i}\in[0,t_{i}].

(1, 2, 3) \Longrightarrow (4): By (2), we have xi0>qi0x_{i_{0}}>q_{i_{0}}, and since qi0=si0q_{i_{0}}=s_{i_{0}} by definition of q\vec{q}, it follows that xi0>si0x_{i_{0}}>s_{i_{0}} which means hs(x)=0h_{\vec{s}}(\vec{x})=0. By (3), ht(x)=1h_{\vec{t}}(\vec{x})=1 which gives hs(x)ht(x)h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x}). ∎

With this claim in hand, our next step will be two prove the following two inequalities:

2ϵ<eunif(hq,ht)eunif(hs,ht).2\epsilon<\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})\leq\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}).

For the second of these inequalities, note that by the (1) \Longrightarrow (4) part of claim above, since hq(x)ht(x)h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x}) implies hs(x)ht(x)h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x}) we have

eunif(hq,ht)\displaystyle\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}}) =Prxunif(X)[hq(x)ht(x)]\displaystyle=\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})]
Prxunif(X)[hs(x)ht(x)]\displaystyle\leq\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x})]
=eunif(hs,ht).\displaystyle=\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}).

Now, for the first of the inequalities above, we will use the (1) \iff (3) portion of the claim, we will use our hypothesis that t[α,1]d\vec{t}\in[\alpha,1]^{d} (which implies for each i[d]i\in[d] that [0,ti][0,α][0,t_{i}]\subseteq[0,\alpha]), we will use the hypothesis that ϵ18d\epsilon\leq\frac{1}{8d}, and we will use LABEL:alpha-lemma. Utilizing these, we get the following:

eunif(hq,ht)\displaystyle\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})
=Prxunif(X)[hq(x)ht(x)]\displaystyle=\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})]
=Prxunif(X)[xi0(α,1]i[d]{i0},xi[0,ti]]\displaystyle=\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[x_{i_{0}}\in(\alpha,1]\;\wedge\;\forall i\in[d]\setminus\left\{i_{0}\right\},\,x_{i}\in[0,t_{i}]]
=Prxi0unif([0,1])[xi0(α,1]]i=1ii0dPrxunif([0,1])[x[0,ti]]\displaystyle=\Pr_{x_{i_{0}}\operatorname{\sim}\mathrm{unif}([0,1])}[x_{i_{0}}\in(\alpha,1]]\cdot\prod_{\begin{subarray}{c}i=1\\ i\not=i_{0}\end{subarray}}^{d}\Pr_{x\operatorname{\sim}\mathrm{unif}([0,1])}[x\in[0,t_{i}]]
Prxi0unif([0,1])[xi0(α,1]]i=1ii0dPrxunif([0,1])[x[0,α]]\displaystyle\geq\Pr_{x_{i_{0}}\operatorname{\sim}\mathrm{unif}([0,1])}[x_{i_{0}}\in(\alpha,1]]\cdot\prod_{\begin{subarray}{c}i=1\\ i\not=i_{0}\end{subarray}}^{d}\Pr_{x\operatorname{\sim}\mathrm{unif}([0,1])}[x\in[0,\alpha]]
=(1α)αd1\displaystyle=(1-\alpha)\cdot\alpha^{d-1}
>14d\displaystyle>\frac{1}{4d}
2ϵ.\displaystyle\geq 2\epsilon.

Thus, we get the desired two inequalities:

2ϵ<eunif(hq,ht)eunif(hs,ht).2\epsilon<\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})\leq\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}).

This nearly completes the proof. If there existed some point rX\vec{r}\in X such that both eunif(hs,hr)ϵ\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{r}})\leq\epsilon and eunif(ht,hr)ϵ\operatorname{\operatorname{e}_{unif}}(h_{\vec{t}},h_{\vec{r}})\leq\epsilon, then it would follow from the triangle inequality of eunif\operatorname{\operatorname{e}_{unif}} that

eunif(hs,ht)eunif(hs,hr)+eunif(ht,hr)2ϵ\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}})\leq\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{r}})+\operatorname{\operatorname{e}_{unif}}(h_{\vec{t}},h_{\vec{r}})\leq 2\epsilon

but this would contradict the above inequalities, so no such r\vec{r} exists. ∎

Equipped with the above lemma, we are now ready to prove Theorem 5.12.

Proof.

of Theorem 5.12 Fix any dd\in\mathbb{N}, and choose ϵ\epsilon and δ\delta as ϵ14d\epsilon\leq\frac{1}{4d} and δ1d+2\delta\leq\frac{1}{d+2}. We will use the constant α=d1d\alpha=\frac{d-1}{d} and consider the cube [α,1]d[\alpha,1]^{d}. We will also consider only the uniform distribution over XX.

Suppose for contradiction that such an algorithm AA does exists for some k<d+1k<d+1. This means that for each possible threshold t[0,1]d\vec{t}\in[0,1]^{d}, there exists some set LtL_{\vec{t}}\subseteq\mathcal{H} of hypotheses with three properties: (1) each element of LtL_{\vec{t}} is an ϵ\epsilon-approximation to hth_{\vec{t}}, (2) |Lt|k\lvert L_{\vec{t}}\rvert\leq k, and (3) with probability at least 1δ1-\delta, AA returns an element of LtL_{\vec{t}}.

By the trivial averaging argument, this means that there exists at least one element in LtL_{\vec{t}} which is returned by AA with probability at least 1k(1δ)1k(11d+2)=1kd+1d+21kk+1k+2\frac{1}{k}\cdot(1-\delta)\geq\frac{1}{k}\cdot(1-\frac{1}{d+2})=\frac{1}{k}\cdot\frac{d+1}{d+2}\geq\frac{1}{k}\cdot\frac{k+1}{k+2}. Let f:[α,1]d[0,1]df\colon[\alpha,1]^{d}\to[0,1]^{d} be a function which maps each threshold t[α,1]d\vec{t}\in[\alpha,1]^{d} to such an element of LtL_{\vec{t}}. This is slightly different from the proof of Theorem 4.7 because we are defining the function ff on only a very specific subset of the possible thresholds. The reason for this was alluded to in the discussion following the statement of Theorem 5.12.

Since 1kk+1k+2>1k+1\frac{1}{k}\cdot\frac{k+1}{k+2}>\frac{1}{k+1}, let η\eta be such that 0<η<1kk+1k+21k+10<\eta<\frac{1}{k}\cdot\frac{k+1}{k+2}-\frac{1}{k+1}.

The function ff induces a partition 𝒫\mathcal{P} of [α,1]d[\alpha,1]^{d} where the members of 𝒫\mathcal{P} are the fibers of ff (i.e. 𝒫={f1(y):yrange(f)}\mathcal{P}=\left\{f^{-1}(\vec{y})\colon\vec{y}\in\operatorname{range}(f)\right\}). For any member W𝒫W\in\mathcal{P} and any coordinate i[d]i\in[d], it cannot be that the set wi:wW{w_{i}\colon\vec{w}\in W} contains both values α\alpha and 11—if it did, then there would be two points s,tW\vec{s},\vec{t}\in W such that si=αs_{i}=\alpha and ti=1t_{i}=1, but because they both belong to WW, there is some y[0,1]d\vec{y}\in[0,1]^{d} such that f(s)=y=f(t)f(\vec{s})=\vec{y}=f(\vec{t}), but by definition of the partition, hyh_{\vec{y}} would have to be an ϵ\epsilon-approximation (in the PAC model) of both hsh_{\vec{s}} and hth_{\vec{t}}, but by Theorem 5.12, this is not possible.

Thus, the partition 𝒫\mathcal{P} is a “non-spanning” partition of [α,1]d[\alpha,1]^{d} as in the proof of 3.7, so there is some point p[α,1]d\vec{p}\in[\alpha,1]^{d} such that for every radius r>0r>0, it holds that B¯r(p)\overline{B}_{r}^{\infty}(\vec{p}) intersects at least d+1d+1 members of 𝒫\mathcal{P}.

Similar to 4.8 and how it is used in the proof of Theorem 4.7, we can use the following two facts. First, the function γ1\gamma_{1} defined by γ1(s,t)=eunif(hs,ht)\gamma_{1}(\vec{s},\vec{t})=\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}) is continuous (with respect to the \ell_{\infty} norm on the domain). Second, the function γ2(hs,ht)=dTV(𝒟A,s,n,𝒟A,t,n)\gamma_{2}(h_{\vec{s}},h_{\vec{t}})=\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{s},n},\mathcal{D}_{A,\vec{t},n}) is continuous (with respect to the eunif\operatorname{\operatorname{e}_{unif}} notion of distance on the domain). A consequence is that the composition γ12(s,t)=dTV(𝒟A,s,n,𝒟A,t,n)\gamma_{12}(\vec{s},\vec{t})=\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{s},n},\mathcal{D}_{A,\vec{t},n}) is continuous. Thus, we can find some radius r>0r>0 such that if tsr\lVert\vec{t}-\vec{s}\rVert_{\infty}\leq r, then dTV(𝒟A,s,n,𝒟A,t,n)η\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{s},n},\mathcal{D}_{A,\vec{t},n})\leq\eta.

Now we get the same type of contradiction as in the proof of Theorem 4.7: for the special point p\vec{p} we have that 𝒟A,p,n\mathcal{D}_{A,\vec{p},n} is a distribution that has d+1k+1d+1\geq k+1 disjoint events that each have probability greater than 1k+1\frac{1}{k+1}. Thus, no kk-list replicable algorithm exists.

6 Conclusions

In this work, we investigated the pressing issue of replicability in machine learning from an algorithmic point of view. We observed that replicability in the absolute sense is difficult to achieve. Hence we considered two natural extensions that capture the degree of (non) replicability: list and certificate replicability. We designed replicable algorithms with a small list, certificate, and sample complexities for the dd-Coin Bias Estimation Problem and the class of problems that can be learned via statistical query algorithms that make non-adaptive statistical queries. We also established certain impossibility results in the PAC model of learning and for dd-Coin Bias Estimation Problem. There are several interesting research directions that emerge from our work. There is a gap in the sample complexities of the list and certificate reproducibilities with comparable parameters. Is this gap inevitable? Currently, there is an exponential gap in the replicability parameters between hypothesis classes that can be learned via non-adaptive and adaptive statistical queries. Is this gap necessary? A generic question is to explore the trade-offs between the sample complexities, list complexity, certificate complexities, adaptivity, and nonadaptivity.

7 Acknowledgements

We thank an anonymous reviewer for suggesting Observation 5.10. Pavan’s work is partly supported by NSF award 2130536. Part of the work was done when Pavan was visiting Simons Institute for the Theory of Computing. Vander Woude and Vinodchandran’s work is partly supported by NSF award 2130608.

References

  • [AJJ+22] Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I. Shamir. Reproducibility in optimization: Theoretical framework and limits, 2022. arXiv:2202.04598.
  • [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852–860. ACM, 2019.
  • [AV20] Nima Anari and Vijay V. Vazirani. Matching is as easy as the decision problem, in the NC model. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS 2020, January 12-14, 2020, Seattle, Washington, USA, volume 151 of LIPIcs, pages 54:1–54:25. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • [Bak16] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533:452–454, 2016.
  • [BLM20] Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classification and online prediction. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 389–402. IEEE, 2020.
  • [BNS+21] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan R. Ullman. Algorithmic stability for adaptive data analysis. SIAM J. Comput., 50(3), 2021.
  • [Can15] Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electron. Colloquium Comput. Complex., TR15-063, 2015. URL: https://eccc.weizmann.ac.il/report/2015/063, arXiv:TR15-063.
  • [DLPES02] Jesus A. De Loera, Elisha Peterson, and Francis Edward Su. A Polytopal Generalization of Sperner’s Lemma. Journal of Combinatorial Theory, Series A, 100(1):1–26, October 2002. URL: https://www.sciencedirect.com/science/article/pii/S0097316502932747, doi:10.1006/jcta.2002.3274.
  • [DPV18] Peter Dixon, Aduri Pavan, and N. V. Vinodchandran. On pseudodeterministic approximation algorithms. In Igor Potapov, Paul G. Spirakis, and James Worrell, editors, 43rd International Symposium on Mathematical Foundations of Computer Science, MFCS 2018, August 27-31, 2018, Liverpool, UK, volume 117 of LIPIcs, pages 61:1–61:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
  • [DPVWV22] Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. Pseudodeterminism: promises and lowerbounds. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1552–1565. ACM, 2022.
  • [eco13] How science goes wrong. The Economist, pages 25–30, 2013.
  • [EKK+23] Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits, 2023. arXiv:2210.01898.
  • [GG] Shafi Goldwasser and Ofer Grossman. Bipartite perfect matching in pseudo-deterministic NC. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs.
  • [GG11] Eran Gat and Shafi Goldwasser. Probabilistic Search Algorithms with Unique Answers and Their Cryptographic Applications. Technical Report 136, 2011. URL: https://eccc.weizmann.ac.il/report/2011/136/.
  • [GGH18] Shafi Goldwasser, Ofer Grossman, and Dhiraj Holden. Pseudo-deterministic proofs. In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, volume 94 of LIPIcs, pages 17:1–17:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
  • [GGMW20] Shafi Goldwasser, Ofer Grossman, Sidhanth Mohanty, and David P. Woodruff. Pseudo-deterministic streaming. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS, volume 151 of LIPIcs, pages 79:1–79:25, 2020.
  • [GKM21] Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. User-level differentially private learning via correlated sampling. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20172–20184, 2021.
  • [GL19] Ofer Grossman and Yang P. Liu. Reproducibility and pseudo-determinism in log-space. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 606–620. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.38, doi:10.1137/1.9781611975482.38.
  • [Gol19] Oded Goldreich. Multi-pseudodeterministic algorithms. Electron. Colloquium Comput. Complex., TR19-012, 2019. URL: https://eccc.weizmann.ac.il/report/2019/012, arXiv:TR19-012.
  • [ILPS22] Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818–831. ACM, 2022. URL: https://doi.org/10.1145/3519935.3519973, doi:10.1145/3519935.3519973.
  • [JP05] Ioannidis JP. Why most published research findings are false. PLOS Medicine, 2(8), 2005.
  • [Kea98] Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.
  • [LOS21] Zhenjian Lu, Igor Carboni Oliveira, and Rahul Santhanam. Pseudodeterministic algorithms and the structure of probabilistic time. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 303–316. ACM, 2021.
  • [MPK19] Harshal Mittal, Kartikey Pandey, and Yash Kant. Iclr reproducibility challenge report (padam : Closing the generalization gap of adaptive gradient methods in training deep neural networks), 2019. arXiv:1901.09517.
  • [NAS19] Reproducibility and Replicability in Science. https://doi.org/10.17226/25303, 2019. National Academies of Sciences, Engineering, and Medicine.
  • [OS17] I. Oliveira and R. Santhanam. Pseudodeterministic constructions in subexponential time. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 665–677, 2017.
  • [OS18] Igor Carboni Oliveira and Rahul Santhanam. Pseudo-derandomizing learning and approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2018, volume 116 of LIPIcs, pages 55:1–55:19, 2018.
  • [PVLS+21] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer, Florence d’Alche Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research(a report from the neurips 2019 reproducibility program). Journal of Machine Learning Research, 22(164):1–20, 2021. URL: http://jmlr.org/papers/v22/20-303.html.
  • [SZ99] Michael E. Saks and Shiyu Zhou. BPh{}_{\mbox{h}}SPACE(S) \subseteq DSPACE(S3/2{}^{\mbox{3/2}}). J. Comput. Syst. Sci., 58(2):376–403, 1999.
  • [VWDP+22] Jason Vander Woude, Peter Dixon, Aduri Pavan, Jamie Radcliffe, and N. V. Vinodchandran. Geometry of rounding. CoRR, abs/2211.02694, 2022. URL: https://doi.org/10.48550/arXiv.2211.02694, arXiv:2211.02694, doi:10.48550/arXiv.2211.02694.