List and Certificate Complexities in Replicable Learning

Peter Dixon Ben-Gurion University of Negev A. Pavan Iowa State University Jason Vander Woude University of Nebraska-Lincoln N. V. Vinodchandran University of Nebraska-Lincoln

Abstract

We investigate replicable learning algorithms. Ideally, we would like to design algorithms that output the same canonical model over multiple runs, even when different runs observe a different set of samples from the unknown data distribution. In general, such a strong notion of replicability is not achievable. Thus we consider two feasible notions of replicability called list replicability and certificate replicability. Intuitively, these notions capture the degree of (non) replicability. We design algorithms for certain learning problems that are optimal in list and certificate complexity. We establish matching impossibility results.

1 Introduction

Replicability and reproducibility in science are critical concerns. The fundamental requirement that scientific results and experiments be replicable/reproducible is central to the development and evolution of science. In recent years, these concerns have grown as several scientific disciplines turn to data-driven research, which enables exponential progress through data democratization and affordable computing resources. The replicability issue has received attention from a wide spectrum of entities, from general media publications (for example, The Economist’s ”How Science Goes Wrong,” 2013 [eco13]) to scientific publication venues (for example, see [JP05, Bak16]) to professional and scientific bodies such as the National Academy of Sciences, Engineering, and Medicine (NASEM). The emerging challenges to replicability and reproducibility have been discussed in depth by a consensus study report published by NASEM [NAS19].

A broad approach taken to ensure the reproducibility/replicability of algorithms is to make the datasets, algorithms, and code publicly available. Of late, conferences have been hosting replicability workshops to promote best practices and to encourage researchers to share code (for example, see [PVLS⁺21] and [MPK19]). An underlying assumption is that consistent results can be obtained using the same input data, computational methods, and code. However, this practice alone is insufficient to ensure replicability as modern-day approaches use computations that inherently involve randomness.

Computing over random variables results in a high degree of non-replicability, especially in machine learning tasks. Machine learning algorithms observe samples from a (sometimes unknown) distribution and output a hypothesis or model. Such algorithms are inherently non-replicable. Two distinct runs of the algorithm will output different models as the algorithms see different sets of samples over the two runs. Ideally, to achieve ”perfect replicability,” we would like to design algorithms that output the same canonical hypothesis over multiple runs, even when different runs observe a different set of samples from the unknown distribution.

We first observe that perfect replicability is not achievable in learning, as a dependency of the output on the data samples is inevitable. We illustrate this with a simple learning task of estimating the bias of a coin: given $n$ independent tosses from a coin with unknown bias $b$ , output an estimate of $b$ . It is relatively easy to argue that there is no algorithm that outputs a canonical estimate $v_{b}$ with probability $\geq 2/3$ so that $|v_{b}-b|\leq\varepsilon$ . Consider a sequence of coins with biases $b_{1}<b_{2}<\cdots<b_{m}$ where each $b_{i+1}-b_{i}\leq\eta$ (for some small $\eta$ ) but $b_{m}-b_{1}\geq 2\varepsilon$ . For two adjacent biases, the statistical distance (denoted by $\operatorname{d_{TV}}$ ) between $\mathcal{D}_{i+1}^{n}$ and $\mathcal{D}_{i}^{n}$ is $\leq m\eta$ , where $\mathcal{D}_{i}^{n}$ is the distribution of $n$ independent tosses of the $i^{th}$ coin. Let $v_{i+1}$ and $v_{i}$ be the canonical estimates output by the algorithm for biases $b_{i+1}$ and $b_{i}$ respectively. Since $A$ on samples from distribution $\mathcal{D}_{i}^{n}$ outputs $v_{i}$ with probability at least $2/3$ and $\operatorname{d_{TV}}(D_{i}^{n},D_{i+1}^{n})\leq n\eta$ , $A(\mathcal{D}_{i+1}^{n})$ must output $v_{i}$ with probability at least $2/3-n\eta$ . We take $\eta=1/10n$ . Since $A(\mathcal{D}_{i}^{n})$ must output a canonical value with probability at least $2/3$ , this implies that $v_{i}=v_{i+1}$ . Thus, on all biases, $b_{1},\ldots,b_{m}$ , $A$ should output the same value and this leads to a contradiction since $b_{1}$ and $b_{m}$ are $2\varepsilon$ apart. However, it is easy to see that there is an algorithm for bias-estimation that outputs one of two canonical estimates using $n=O(1/\varepsilon^{2})$ tosses: estimate the bias within $\varepsilon/2$ and round the value to the closest multiple of $\varepsilon$ . The starting point of our work is these two observations. Even though it may not be possible to design learning algorithms that are perfectly replicable, it is possible to design algorithms whose “non-replicability” is minimized.

Motivated by this, we study two notions of replicability called list-replicability and certificate-replicability in the context of machine learning algorithms. These notions quantify the degree of (non)-replicability. The notions we consider are rooted in the pseudodeterminism-literature [GG11, Gol19, GL19] which has been an active area in randomized algorithms and computational complexity theory. Section 2 discusses this connection and other works that are related to our work.

1.1 Our Results

We consider two notions of replicable learning: list-replicable learning and certificate-replicable learning. In list replicability, the learning algorithm should output one of the models from a list of models of size $\leq k$ with high probability. This means that when the learning algorithm is run multiple times, we see at most $k$ distinct models/hypotheses. The value $k$ is called the list complexity of the algorithm. An algorithm whose list complexity is $1$ is perfectly replicable. Thus list complexity can be considered as the degree of (non) replicability. The goal in this setting is to design learning algorithms that minimize the list complexity $k$ .

In certificate replicability, the learning algorithm has access to an $\ell$ -bit random string called the certificate of replicability (that is independent of samples and the other randomness that it may use). We require that for most certificates, the algorithm must output a canonical model $h$ that can depend only on this $\ell$ -bit random string $r$ . Thus once we fix a certificate $r$ , multiple runs of the algorithm that have access to $r$ will output the same model w.h.p. We call $\ell$ the certificate complexity. Note that an algorithm with zero certificate complexity is perfectly replicable. Thus $\ell$ is another measure of the degree of (non) replicability of the algorithm. The goal in this setting is to design learning algorithms that minimize $\ell$ .

A critical resource in machine learning tasks is the number of samples that the algorithm observes known as the sample complexity. An efficient learning algorithm uses as few samples as possible. In this work, we measure the efficiency of learning algorithms in terms of sample complexity, list complexity, and certificate complexity. This work initiates a study of learning algorithms that are efficient in certificate/list complexities as well as sample complexity. We establish both positive results and impossibility results for certain learning tasks.

Estimating the bias of $d$ coins.

Our first set of results is on efficient replicable algorithms for the coin-bias estimation problem. We consider the problem of estimating the biases of $d$ coins simultaneously by observing $n$ tosses of each of the coins which we call $d$ -Coin Bias Estimation Problem. The task is to output a bias vector $\vec{v}$ so that $\lVert\vec{b}-\vec{v}\rVert_{\infty}\leq\varepsilon$ where $\vec{b}$ is the true bias vector. We show that there is a $(d+1)$ -list replicable learning algorithm for this problem with a sample complexity (number of coin tosses) $n=O({\frac{d^{2}}{\varepsilon^{2}}}\cdot\log{\frac{d}{\delta}})$ per coin. We also design a $\lceil\log{d\over\delta}\rceil$ -certificate reproducible algorithm for the problem with sample complexity $O({d^{2}\over\varepsilon^{2}\delta^{2}})$ per coin. Here $(1-\delta)$ is the success probability.

We also establish the optimality of the above upper bounds in terms of list and certificate complexities. We show that there is no $d$ -list replicable learning algorithm for $d$ -Coin Bias Estimation Problem. This leads to a lower bound of $\Omega(\log d)$ on its certificate complexity. For establishing this impossibility result we use a version of KKM/Sperner’s Lemma.

PAC learning.

We establish possibility and impossibility results for list/certificate replicable algorithms in the context of PAC learning. We show that any concept class that can be learned using $d$ non-adaptive statistical queries can be learned by a $(d+1)$ -list replicable PAC learning algorithm with sample complexity $O({d^{2}\over\nu^{2}}\cdot\log{d\over\delta})$ where $\nu$ is the statistical query parameter. We also show that such concept classes admit a $\lceil\log{d\over\delta}\rceil$ -certificate replicable PAC learning algorithm with sample complexity $O({d^{2}\over\nu^{2}\delta^{2}}\cdot\log{d\over\delta})$ . Finally, we establish tight results in the PAC learning model. In particular, we prove that the concept class of $d$ -dimensional thresholds does not admit a $d$ -list replicable learning algorithm under the uniform distribution. Since we can learn $d$ -dimensional thresholds under the uniform distribution, using $d$ non-adaptive statistical queries, we get a $d+1$ -list replicable PAC algorithm under the uniform distribution. This yields matching upper and lower bounds on the list complexity of PAC learning of thresholds under the uniform distribution.

2 Prior and Related Work

Formalizing reproducibility and replicability has gained considerable momentum in recent years. While the terms reproducibility and replicability are very close and often used interchangeably, there has been an effort to distinguish between them and accordingly, our notions fall in the replicability definition [PVLS⁺21].

In the context of randomized algorithms, various notions of reproducibility/replicability have been investigated. The work of Gat and Goldwasser [GG11] formalized and defined the notion of pseudodeterministic algorithms. A randomized algorithm $A$ is pseudodeterministic if, for any input $x$ , there is a canonical value $v_{x}$ such that $\Pr[A(x)=v_{x}]\geq 2/3$ . Gat and Goldwasser designed polynomial-time pseudodeterministic algorithms for algebraic computational problems, such as finding quadratic non-residues and finding non-roots of multivariate polynomials [GG11]. Later works studied the notion of pseudodeterminism in other algorithmic settings, such as parallel computation, streaming and sub-linear algorithms, interactive proofs, and its connections to complexity theory [GG, GGH18, OS17, OS18, AV20, GGMW20, LOS21, DPVWV22].

In the algorithmic setting, mainly two generalizations of pseudodeterminism have been investigated: multi-pseudodeterministic algorithms [Gol19] and influential bit algorithms [GL19]. A randomized algorithm $A$ is $k$ -pseudodeterministic if, for every input $x$ , there is a set $S_{x}$ of size at most $k$ such that the output of $A(x)$ belongs to the set $S_{x}$ with high probability. When $k=1$ , we get pseudodeterminism. A randomized algorithm $A$ is $\ell$ -influential-bit algorithm if, for every input $x$ , for most of the strings $r$ of length $\ell$ , there exists a canonical value $v_{x,r}$ such that the algorithm $A$ on inputs $x$ and $r$ outputs $v_{x,r}$ with high probability. The string $r$ are called the influential bit string. Again, when $\ell=0$ , we get back pseudodeterminism. The main focus of these works has been to investigate reproducibility in randomized search algorithms.

Very recently, pseudodeterminism and its generalizations have been explored in the context of learning algorithms to formalize the notion of replicability. In particular, the work of Impagliazzo, Lei, Pitassi, and Sorrell [ILPS22] introduced the notion of $\rho$ -replicability. A learning algorithm $A$ is $\rho$ -replicable if $\Pr[A(S_{1},r)=A(S_{2},r)]\geq 1-\rho$ , where $S_{1}$ and $S_{2}$ are samples drawn from a distribution $\mathcal{D}$ and $r$ is the internal randomness of the learning algorithm $A$ . They designed replicable algorithms for many learning tasks, including statistical queries, approximate heavy hitters, median, and learning half-spaces.

Another line of recent work connects replicable learning to differentially private learning. In a recent seminal work, Bun, Livny, and Moran [BLM20] showed that every concept class with finite Littlestone dimension can be learned by an approximate differentially private algorithm. This, together with an earlier work of Alon, Livny, Malliaris, and Moran [ALMM19], establishes an equivalence between online learnability and differentially private PAC learnability. Rather surprisingly, the proof of [BLM20] uses the notion of ”global stability,” which is similar to the notion of pseudodeterminism in the context of learning. They define a learning algorithm $A$ to be $(n,\eta)$ -globally stable with respect to a distribution $D$ if there is a hypothesis $h$ such that $\Pr_{S\sim D^{n}}(A(S)=h)\geq\eta$ . They showed that any concept class with Littlestone dimension $d$ has an $(n,\eta)$ -globally stable learning algorithm with $m=\tilde{O}(2^{2^{d}}/\alpha)$ and $\eta=\tilde{O}(2^{-2^{d}})$ , where the error of $h$ (with respect to the unknown hypothesis) is $\leq\alpha$ . Then they established that a globally stable learner implies a differentially private learner. The notion of globally stable learning is the perfect replicability that we discuss in the introduction when $\eta=2/3$ . Thus, as discussed in the introduction, it follows that designing globally stable algorithms with $\eta>1/2$ is not possible, even for the simple task of estimating the bias of a coin. The work of Ghazi, Kumar, and Manurangsi [GKM21] extended the notion of global stability to pseudo-global stability and list-global stability. The notion of pseudo-global stability is very similar to the earlier-mentioned notion of influential bit algorithms of Grossman and Liu [GL19] when translated to the context of learning. Similarly, the list-global stability is similar to Goldreich’s notion of multi-pseudodeterminism [Gol19]. It is also known that the notions of pseudo-global stability and $\rho$ -replicability are the same up to polynomial factors in the parameters [ILPS22, GKM21]. The work of [GKM21] uses these notions to design user-level private learning algorithms.

Our notion of list replicability is inspired by the notion of multi-pseudodeterminism and the notion of certificate replicability is inspired by the notion of influential-bit algorithms. In the learning setting, our notion of list replicability is similar to the notion of list-global stability and the notion of certificate replicability is similar to the notion of pseudo-global stability which in turn is similar to the notion of $\rho$ -replicability of [ILPS22].

We introduce the notions of list and certificate complexities that measure the degree of (non) replicability. Our goal is to design learning algorithms with optimal list and certificate complexities while minimizing the sample complexity. The earlier works did not focus on minimizing these quantities. The works of [BLM20, GKM21] used replicable algorithms as an intermediate step to design differentially private algorithms. The work of [ILPS22] did not consider reducing the certificate complexity in their algorithms and also did not study list-replicability

The study of various notions of reproducibility/replicability in various computational fields is an emerging topic. In [EKK⁺23], the authors consider replicability in the context of stochastic bandits. Their notion is similar to the notion studied in [ILPS22]. In [AJJ⁺22], the authors investigate reproducibility¹¹1See [PVLS⁺21] for a discussion on replicability and replicability. in the context of optimization with inexact oracles (initialization/gradient oracles). The setup and focus of these works are different from ours.

3 Primary Lemmas

In this section, we state a few lemmas that build on the work of [VWDP⁺22] and [DLPES02] that will be useful for algorithmic constructions and impossibility results in the remainder of the paper.

3.1 Partitions and Rounding

In this subsection, we define a deterministic rounding function that we will use repeatedly. This rounding function is based on the notion of secluded partitions defined in the work of [VWDP⁺22].

We will use the following notation. We use $\operatorname{diam}_{\infty}$ to indicate the diameter of a set relative to the $\ell_{\infty}$ norm and $\overline{B}_{\epsilon}^{\infty}(\vec{p})$ to represent the closed ball of radius $\epsilon$ centered at $\vec{p}$ relative to the $\ell_{\infty}$ norm. That is, in $\mathbb{R}^{d}$ we have $\overline{B}_{\epsilon}^{\infty}(\vec{p})=\prod_{i=1}^{d}[p_{i}-\epsilon,p_{i}+\epsilon]$ .

Let $\mathcal{P}$ be a partition of $\mathbb{R}^{d}$ . For a point $\vec{p}\in\mathbb{R}^{d}$ , we use $N_{\epsilon}(\vec{p})$ to denote the set

\{X\in\mathbb{P}~{}|~{}B_{\epsilon}(\vec{p})\cap X\neq\emptyset\}

Definition 3.1.

Let $\mathcal{P}$ be a partition of $\mathbb{R}^{d}$ . We say that $\mathcal{P}$ is $(k,\epsilon)$ -secluded, if for every point $\vec{p}\in\mathbb{R}^{d}$ , $|N_{\epsilon(\vec{p}})|\leq k$ .

The following theorem is from [VWDP⁺22].

Theorem 3.2.

For each $d\in\mathbb{N}$ , there exists a $(d+1,\frac{1}{2d})$ -secluded partition, where each member of the partition is a unit hypercube. Moreover, the partition is efficiently computable: Given an arbitrary point $\vec{x}\in\mathbb{R}^{d}$ , the description of the partition member to which $\vec{x}$ belongs can be computed in time polynomial in $d$ .

Definition 3.3 ( $(k(d),\epsilon(d))$ -Deterministic Rounding).

A deterministic rounding scheme is a family of functions ${\mathcal{F}}=\{f_{d}\}_{d\in\mathbb{N}}$ where $f_{d}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ . We call $\mathcal{F}$ a $(k(d),\varepsilon(d))$ -deterministic rounding scheme if (1) $\forall\vec{x}\in\mathbb{R}^{d}$ , $d_{max}(\vec{x},f_{d}(\vec{x}))\leq 1$ ²²2 The bound of 1 is not critical. We can use any constant and scale the parameters appropriately. (2) $\forall\vec{x}\in\mathbb{R}^{d}$ , $\lvert\{f_{d}(\vec{z})\colon\vec{z}\in B_{\varepsilon(d)}(\vec{x})\}\rvert\leq k(d)$ .

The following observation is from [VWDP⁺22].

Observation 3.4 (Equivalence of Rounding Schemes and Partitions).

A $(k(d),\epsilon(d))$ -deterministic rounding scheme induces, for each $d\in\mathbb{N}$ , a $(k(d),\epsilon(d))$ -secluded partition of $\mathbb{R}^{d}$ in which each member has diameter at most $2$ . Conversely, a sequence $\langle\mathcal{P}_{d}\rangle_{d=1}^{\infty}$ of partitions where $\mathcal{P}_{d}$ is $(k(d),\epsilon(d))$ -secluded and contains only members of diameter at most $1$ induces a $(k(d),\epsilon(d))$ -deterministic rounding schemes.

3.2 A Universal Rounding Algorithm for List Replicability

In this subsection, we will design a deterministic algorithm that will be used as a sub-routine in our list-replicable algorithms.

Lemma 3.5.

Let $d\in\mathbb{N}$ and $\epsilon\in(0,\infty)$ . Let $\epsilon_{0}=\frac{\epsilon}{2d}$ . There is an efficiently computable function $f_{\epsilon}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ with the following two properties:

1.

For any $x\in\mathbb{R}^{d}$ and any $\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)$ it holds that $f_{\epsilon}(\hat{x})\in\overline{B}_{\epsilon}^{\infty}(x)$ .
2.

For any $x\in\mathbb{R}^{d}$ the set $\left\{f_{\epsilon}(\hat{x})\colon\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)\right\}$ has cardinality at most $d+1$ .

Informally, these two conditions are (1) if $\hat{x}$ is an $\epsilon_{0}$ approximation of $x$ , then $f_{\epsilon}(\hat{x})$ is an $\epsilon$ approximation of $x$ , and (2) $f_{\epsilon}$ maps every $\epsilon_{0}$ approximation of $x$ to one of at most $d+1$ possible values.

Proof.

Let $\mathcal{P}$ be a $(d+1,\frac{1}{2d})$ -secluded partition from Theorem 3.2 and $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ the associated deterministic rounding function due to Observation 3.4 The partition $\mathcal{P}$ consists of translates of the unit cube $[0,1)^{d}$ with the property that for any point $\vec{p}\in\mathbb{R}^{d}$ the closed cube of side length $1/d$ centered at $\vec{p}$ (i.e. $\overline{B}_{1/2d}^{\infty}(\vec{p})$ ) intersects at most $d+1$ members/cubes in $\mathcal{P}$ . The associated rounding function $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ maps each point of $\mathbb{R}^{d}$ to the center point of the unique cube in $\mathcal{P}$ which contains it. This means $f$ has the following two properties (which are closely connected to the two properties we want of $f_{\epsilon}$ ): (1) for every $a\in\mathbb{R}^{d}$ , $\lVert f(a)-a\rVert_{\infty}\leq\frac{1}{2}$ (because every point is mapped to the center of its containing unit cube) and (2) for any point $p\in\mathbb{R}^{d}$ , the set $\left\{f(a)\colon a\in\overline{B}_{1/2d}^{\infty}(p)\right\}$ has cardinality at most $d+1$ (because $\overline{B}_{1/2d}^{\infty}(p)$ intersects at most $d+1$ members of $\mathcal{P}$ ). Define $f_{\epsilon}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ by $f_{\epsilon}(a)=\epsilon\cdot f(\frac{1}{\epsilon}a)$ . The efficient computability of $f_{\epsilon}$ comes from the efficient computability of $f$ (i.e. the ability to efficiently compute the center of the unit cube in $\mathcal{P}$ which contains a given point).

To see that $f_{\epsilon}$ has property (1), let $x\in\mathbb{R}^{d}$ and $\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)$ . Then we have the following (justifications will follow):

	$\displaystyle\left\lVert\tfrac{1}{\epsilon}\cdot f_{\epsilon}(\hat{x})-\tfrac{1}{\epsilon}x\right\rVert_{\infty}$	$\displaystyle=\left\lVert f(\tfrac{1}{\epsilon}\hat{x})-\tfrac{1}{\epsilon}x\right\rVert_{\infty}$
		$\displaystyle\leq\left\lVert f(\tfrac{1}{\epsilon}\hat{x})-\tfrac{1}{\epsilon}\hat{x}\right\rVert_{\infty}+\left\lVert\tfrac{1}{\epsilon}\hat{x}-\tfrac{1}{\epsilon}x\right\rVert_{\infty}$
		$\displaystyle\leq\left\lVert f(\tfrac{1}{\epsilon}\hat{x})-\tfrac{1}{\epsilon}\hat{x}\right\rVert_{\infty}+\tfrac{1}{\epsilon}\left\lVert\hat{x}-x\right\rVert_{\infty}$
		$\displaystyle\leq\tfrac{1}{2}+\tfrac{1}{\epsilon}\epsilon_{0}$
		$\displaystyle=\tfrac{1}{2}+\tfrac{1}{2d}\leq 1$

The first line is by the definition of $f_{\epsilon}$ , the second is the triangle inequality, the third is scaling of norms, the fourth uses the property of $f$ that points are not mapped a distance more than $\frac{1}{2}$ along with the hypothesis that $\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)$ , the fifth uses the definition of $\epsilon_{0}$ , and the sixth uses the fact that $d\geq 1$ .

Scaling both sides by $\epsilon$ and using the scaling of norms, the above gives us $\left\lVert f_{\epsilon}(\hat{x})-x\right\rVert_{\infty}\leq\epsilon$ which proves property (1) of the lemma.

To see that $f_{\epsilon}$ has property (2), let $x\in\mathbb{R}^{d}$ . We have the following set of equalities:

	$\displaystyle\left\{f_{\epsilon}(\hat{x})\colon\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)\right\}$	$\displaystyle=\left\{\epsilon\cdot f(\tfrac{1}{\epsilon}\hat{x})\colon\hat{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(x)\right\}$
		$\displaystyle=\left\{\epsilon\cdot f(a)\colon a\in\overline{B}_{\tfrac{1}{\epsilon}\epsilon_{0}}^{\infty}(x)\right\}$
		$\displaystyle=\left\{\epsilon\cdot f(a)\colon a\in\overline{B}_{\tfrac{1}{2d}}^{\infty}(x)\right\}$

The first line is from the definition of $f_{\epsilon}$ , the second is from re-scaling, and the third is from the definition of $\epsilon_{0}$ .

Because $f$ takes on at most $d+1$ distinct values on $\overline{B}_{\tfrac{1}{2d}}^{\infty}(x)$ , the set has cardinality at most $d+1$ which proves property (2) of the lemma. ∎

3.3 A Universal Rounding Algorithm for Certificate Replicability

For designing certificate replicable learning algorithms we will use a general randomized procedure which is stated in the following lemma. This lemma uses a randomized rounding scheme. Similar randomized rounding schemes have been used in a few prior works [SZ99, DPV18, Gol19, GL19, ILPS22].

Lemma 3.6.

Let $d\in\mathbb{N}$ , $\epsilon_{0}\in(0,\infty)$ and $0<\delta<1$ . There is an efficiently computable deterministic function $f:\{0,1\}^{\ell}\times\mathbb{R}^{d}\to\mathbb{R}^{d}$ with the following property. For any $x\in\mathbb{R}^{d}$ ,

\Pr_{r\in\{0,1\}^{\ell}}\left[\exists x^{*}\in\overline{B}_{\varepsilon}^{\infty}(x)~{}~{}\forall\hat{x}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x):f(r,\hat{x})=x^{*}\right]\geq 1-\delta

where $\ell=\lceil\log{d\over\delta}\rceil$ and $\varepsilon=(2^{\ell}+1)\epsilon_{0}\leq{2\varepsilon_{0}d\over\delta}$ .

Proof.

Partition each coordinate of $\mathbb{R}^{d}$ into $2\varepsilon_{0}$ -width intervals. The algorithm computing the function $f$ does the following simple randomized rounding:

The function $f:$ Choose a random integer $r\in\{1\dots 2^{\ell}\}$ . Note that $r$ can be represented using $\ell$ bits. Consider the $i^{th}$ coordinate of ${\hat{x}}$ denoted by ${\hat{x}[i]}$ . Round $\hat{x}[i]$ to the nearest $k*(2\varepsilon_{0})$ such that $k\mod 2^{\ell}\equiv r$ .

Now we will prove that $f$ satisfies the required properties.

First, we prove the approximation guarantee. Let ${x^{\prime}}$ denote the point in $\mathbb{R}^{d}$ obtained after rounding each coordinate of $\hat{x}$ . The $k$ s satisfying $k\mod 2^{\ell}\equiv r$ are $2^{\ell}\cdot 2\varepsilon_{0}$ apart. Therefore, ${x^{\prime}}[i]$ is rounded by at most $2^{\ell}\epsilon_{0}$ . That is, $|{x^{\prime}}[i]-\hat{x}[i]|\leq 2^{\ell}\epsilon_{0}={\varepsilon_{0}d\over\delta}$ for every $i$ , $1\leq i\leq d$ . Since $\hat{x}$ is an $\varepsilon_{0}$ -approximation (i.e. each coordinate $\hat{x}[i]$ is within $\varepsilon_{0}$ of the true value $x[i]$ ), then each coordinate of ${x^{\prime}}$ is within $(2^{\ell}+1)\varepsilon_{0}$ of $x[i]$ . Therefore ${x^{\prime}}$ is a $(2^{\ell}+1)\varepsilon_{0}$ -approximation of $x[i]$ . Thus $x^{\prime}\in\overline{B}_{\varepsilon}^{\infty}(x)$ for any choice of $r$ .

Now we establish that for $\geq 1-\delta$ fraction of $r\in\{1\ldots 2^{\ell}\}$ , there exists $x^{*}$ such every $\hat{x}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x)$ is rounded $x^{*}$ . We argue this with respect to each coordinate and apply the union bound. Fix an $x$ and a coordinate $i$ . For $x[i]$ , consider the $\varepsilon_{0}$ interval around it.

Consider $r$ from $\{1\ldots 2^{\ell}\}$ . When this $r$ is chosen, then we round $\hat{x}[i]$ to the closest $k*(2\varepsilon_{0})$ such that $k\mod 2^{\ell}\equiv r$ . Let $p^{r}_{1},p^{r}_{2},\ldots p^{r}_{j}\ldots$ be the set of such points: more precisely $p_{j}=(j2^{l}+r)*2\varepsilon_{0}$ . Note that $\hat{x}[i]$ is rounded to an $p_{j}$ to some $j$ . Let $m^{r}_{j}$ denote the midpoint between $p^{r}_{j}$ and $p^{r}_{j+1}$ . I.e, $m^{r}=(p^{r}_{j}+p^{r}_{j+1})/2$ We call $r$ ‘bad’ for $x[i]$ if $x[i]$ is close to some $m^{r}_{j}$ . That is, $r$ is ‘bad’ if $|x[i]-m^{r}_{j}|<\varepsilon_{0}$ . Note that for a bad $r$ there exists $\hat{x_{1}}$ and $\hat{x_{2}}$ in $\overline{B}_{\varepsilon_{0}}^{\infty}(x)$ so that their $i^{th}$ coordinates are round to $p^{r}_{j}$ and $p^{r}_{j+1}$ respectively. The crucial point is that if $r$ is ‘not bad’ for $x[i]$ , then for every $x^{\prime}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x)$ , there exists a canonical $p^{*}$ such that $x^{\prime}[i]$ is rounded to $p^{*}$ . We call $r$ bad for $x$ , if $r$ is bad for $x$ , if there exists at least one $i$ , $1\leq i\leq d$ such that $r$ is bad for $x[i]$ . With this, it follows that if $r$ is not bad for $x$ , then there exists a canonical $x^{*}$ such that every $x^{\prime}\in\overline{B}_{\varepsilon_{0}}^{\infty}(x)$ is rounded to $x^{*}$ .

With this, the goal is to bound the probability that a randomly chosen $r$ is bad for $x$ . For this, we first bound the probability that $r$ is bad for $x[i]$ . We will argue that there exists almost one bad $r$ for $x[i]$ . Suppose that there exist two numbers $r_{1}$ and $r_{2}$ that are both bad for $x[i]$ . This means that $|x[i]-m^{r_{1}}_{j_{1}}|<\varepsilon_{0}$ and $|x[i]-m^{r_{2}}_{j_{2}}|<\varepsilon_{0}$ for some $j_{1}$ and $j_{2}$ . Thus by triangle inequality $|m^{r_{1}}_{j_{1}}-m^{r_{2}}_{j_{2}}|<2\varepsilon_{0}$ . However, note that $|p^{r_{1}}_{j_{1}}-p^{r_{2}}_{j_{2}}|$ is $|(j_{1}-j_{2})2^{\ell}+(r_{1}-r_{2})|2\epsilon_{0}$ . Since $r_{1}\neq r_{2}$ , this value is at least $2\varepsilon_{0}$ . This implies that the absolute value of difference between $m^{r_{1}}_{j_{1}}$ and $m^{r_{2}}_{j_{2}}$ is at least $2\varepsilon$ leading to a contradiction.

Thus the probability that $r$ is bad for $x[i]$ is atmost $\frac{1}{2^{\ell}}$ and by the union bound the probability that $r$ is bad for $x$ is atmost $\frac{d}{2^{\ell}}\leq\delta$ . This completes the proof. ∎

3.4 A Consequence of Sperner’s Lemma/KKM Lemma

The following result is a corollary to some cubical variants of Sperner’s lemma/KKM lemma initially developed in [DLPES02] and expanded on in [VWDP⁺22]. The statement and proof of this result is quite similar to that of Theorem 9.4 (Second Optimality Theorem) in [VWDP⁺22] except that it is stated here for $[0,1]^{d}$ instead of $\mathbb{R}^{d}$ .

Lemma 3.7.

Let $\mathcal{P}$ be a partition of $[0,1]^{d}$ such that for each member $X\in\mathcal{P}$ , it holds that $\operatorname{diam}_{\infty}(X)<1$ . Then there exists $\vec{p}\in[0,1]^{d}$ such that for all $\delta>0$ we have that $\overline{B}_{\delta}^{\infty}(\vec{p})$ intersects at least $d+1$ members of $\mathcal{P}$ .

4 Replicability of Learning Coins Biases

In this section, we establish replicability results for estimating biases of $d$ coins.

Definition 4.1.

The $d$ -Coin Bias Estimation Problem is the following problem: Design an algorithm $A$ (possibly randomize) that given $\epsilon\in(0,\infty)$ , $\delta\in(0,1]$ , and $n$ independent tosses (for each coin) of an ordered collection of $d$ -many biased coins with a bias vector $\vec{b}\in[0,1]^{d}$ outputs $\vec{v}$ so that $\lVert\vec{b}-\vec{v}\rVert_{\infty}\leq\varepsilon$ with probability $\geq 1-\delta$ .

Definition 4.2.

We say an algorithm $A$ for $d$ -Coin Bias Estimation Problem is $k$ -list replicable, if for any bias vector $\vec{b}\in[0,1]^{d}$ , and parameters $\varepsilon,\delta$ , there is set $L\subseteq\overline{B}_{\varepsilon}^{\infty}(\vec{b})$ and an integer $n$ such that $|L|\leq k$ and $A$ on input $\varepsilon$ and $\delta$ and $n$ independent tosses (per coin) according to the bias vector $\vec{b}$ , outputs an estimate $\vec{v}\in L$ , with probability $\geq 1-\delta$ . $n$ is the sample complexity of $A$ and $k$ is the list complexity of $A$ .

Definition 4.3.

We say an algorithm $A$ for $d$ -Coin Bias Estimation Problem is $\ell$ -certificate replicable, if for any bias vector $\vec{b}\in[0,1]^{d}$ , and parameters $\varepsilon,\delta$ : $A$ on inputs $\epsilon$ , $\delta$ , $r\in\{0,1\}^{\ell}$ , and $n$ independent coin tosses (per coin) according to $\vec{b}$ satisfy the following:

\Pr_{r\in\{0,1\}^{\ell}}\left[\exists\vec{v}_{r}\in\overline{B}_{\varepsilon}^{\infty}(\vec{b}):\Pr[A\mbox{ outputs }\vec{v_{r}}]\geq 1-\delta\right]\geq 1-\delta

In the above, the inner probability is taken over the internal randomness of $A$ and the coin tosses. Algorithm $A$ also gets $r$ as an input (in addition to the other inputs). We call $n$ the sample complexity and the number of random bits $\ell$ the certificate complexity of $A$ .

We note that from a coarse sense, we can convert list replicable algorithms to certificate replicable algorithms and vice-versa. However, such transformations will result in a degradation of sample complexity which is a concern of this paper.

In the following, the output of an algorithm for $d$ -Coin Bias Estimation Problem is denoted as $\mathcal{D}_{A,\vec{b},n}$ .

4.1 Replicable Algorithms

We present two algorithms for $d$ -Coin Bias Estimation Problem. The first one $d+1$ -list replicable and the second one is $\lceil\log{d\over\delta}\rceil$ -certificate replicable.

Theorem 4.4.

There exists an $(d+1)$ -list replicable algorithm for $d$ -Coin Bias Estimation Problem with sample complexity $n=O({d^{2}\over\varepsilon^{2}}\cdot\log{d\over\delta})$ (per coin).

Algorithm 1

(d+1)

-list replicable algorithm for

d

-Coin Bias Estimation Problem as in Theorem 4.4

Input:

\epsilon>0

Input:

\delta\in(0,1]

Input: sample access to

d

coins with biases

\vec{b}\in[0,1]^{d}

Output: The algorithm behaves as a

(d+1)

-pseudodeterministic

(\epsilon,\delta)

-approximation of

\vec{b}

and is guaranteed to return a value in

[0,1]^{d}

Algorithm:

\epsilon_{0}\operatorname{\overset{def}{=}}\frac{\epsilon}{2d}

\delta_{0}\operatorname{\overset{def}{=}}\frac{\delta}{d}

n\operatorname{\overset{def}{=}}O\left(\frac{\ln(1/\delta_{0})}{\epsilon_{0}^{2}}\right)=O\left(\frac{d^{2}\ln(d/\delta)}{\epsilon^{2}}\right)

for some constant

Let

f_{\epsilon}:\mathbb{R}^{d}\to\mathbb{R}^{d}

be as in 3.5.

Let

g:\mathbb{R}^{d}\to[0,1]^{d}

be the function which restricts coordinates to the unit interval (i.e.

g(\vec{y})\operatorname{\overset{def}{=}}\left\langle\begin{cases}0&y_{i}<0\\ y_{i}&y_{i}\in[0,1]\\ 1&y_{i}>1\end{cases}\right\rangle_{i=1}^{d}

)

Take

n

samples from each coin and let

\vec{a}

be the empirical biases.

return

g(f(\vec{a}))

Proof.

Note that when $\varepsilon\geq 1/2$ , a trivial algorithm that outputs a vector with $1/2$ in each component works. Thus the most interesting case is when $\varepsilon<1/2$ . Our list replicable algorithm is described in Algorithm 1.

So we will prove its correctness by giving for each possible bias $\vec{b}\in[0,1]^{d}$ , a set $L_{\vec{b}}$ with the three necessary properties: (1) $\lvert L_{\vec{b}}\rvert\leq d+1$ , (2) $L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b})$ (and also the problem specific restriction that $L_{\vec{b}}\subseteq[0,1]^{d}$ ), and (3) when given access to coins of biases $\vec{b}$ , with probability at least $1-\delta$ the algorithm returns a value in $L_{\vec{b}}$ .

Assume notation from Algorithm 1. Let $L_{\vec{b}}=\left\{g(f_{\epsilon}(\vec{x}))\colon\vec{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b})\right\}$ . By 3.5, $f_{\epsilon}$ takes on at most $d+1$ values on $\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b})$ (which means $g\circ f_{\epsilon}$ also takes on at most $d+1$ values on this ball) which proves that $\lvert L_{\vec{b}}\rvert\leq d+1$ . This proves property (1).

NExt we state the following ibservation which says that the coordinate restriction function $g$ of Algorithm 1 does not reduce approximation quality. The proof is relatively straightforward, but tedious.

Observation 4.5.

Using the notation of Algorithm 1, if $\vec{y}\in\overline{B}_{\epsilon}^{\infty}(\vec{b})$ then $g(\vec{y})\in\overline{B}_{\epsilon}^{\infty}(\vec{b})$ .

Proof.

Let $\vec{z}=g(\vec{y})$ . We must show for each $i\in[d]$ that $z_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon]$ . Note that

z_{i}=\begin{cases}0&y_{i}<0\\ y_{i}&y_{i}\in[0,1]\\ 1&y_{i}>1\end{cases}

so we proceed with three cases.

Case 1: $y_{i}<0$ .

In this case, $z_{i}=0$ so because $b_{i}\in[0,1]$ we have $z_{i}\leq b_{i}+\epsilon$ . Also, because $y_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon]$ , we have $b_{i}\leq y_{i}+\epsilon<0+\epsilon=z_{i}+\epsilon$ , so subtracting $\epsilon$ from both sides gives $z_{i}>b_{i}-\epsilon$ . Thus, we have $z_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon]$ as desired.

Case 2: $y_{i}>1$ .

This case is symmetric to Case 1.

Case 3: $y_{i}\in[0,1]$ .

In this case $z_{i}=y_{i}\in[b_{i}-\epsilon,b_{i}+\epsilon]$ so we are done. ∎

We now establish Property (2). We know from 3.5 that for each $\vec{x}\in\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b})$ we have $f_{\epsilon}(\vec{x})\in\overline{B}_{\epsilon}^{\infty}(\vec{b})$ , and by 4.5, $g$ maintains this quality and we have $g(f_{\epsilon}(\vec{x}))\in\overline{B}_{\epsilon}^{\infty}(\vec{b})$ . This shows that $L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b})$ proving property (2).

By Chernoff’s bounds, for a single biased coin, $n=O\left(\frac{\ln(1/\delta_{0})}{\epsilon_{0}^{2}}\right)$ independent samples of the coin is enough that with probability at least $1-\delta_{0}$ , the empirical bias is within $\epsilon_{0}$ of the true bias. Thus, by a union bound, if we take $n$ samples of each of the $d$ coins, there is a probability of at most $d\cdot\delta_{0}=\delta$ that at least one of the empirical coin biases is not within $\epsilon_{0}$ of the true bias. Thus, by taking $n$ samples of each coin, we have with probability at least $1-\delta$ that the empirical biases $\vec{a}$ belong to $\overline{B}_{\epsilon_{0}}^{\infty}(\vec{b})$ . In the case that this occurs, we have by definition of $L_{\vec{b}}$ that the value $g(f_{\epsilon}(\vec{a}))$ returned by the algorithm belongs to the set $L_{\vec{b}}$ . This proves property (3). ∎

Theorem 4.6.

There is a $\lceil\log{d\over\delta}\rceil$ -certificate replicable algorithm for $d$ -Coin Bias Estimation Problem with sample complexity of $n=O({d^{2}\over\varepsilon^{2}\delta^{2}})$ per coin.

Proof.

Let $\varepsilon$ and $\delta$ be the input parameters to the algorithm and $\vec{b}$ the bias vector. Set $\varepsilon_{0}=\frac{\varepsilon\delta}{2d}$ . The algorithm $A$ first estimates the bias of each coin with up to $\varepsilon_{0}$ with a probability error parameter $\frac{\delta}{d}$ using a standard estimation algorithm. Note that this can be done using $O({d^{2}\over\varepsilon^{2}\delta^{2}})$ tosses per coin. Let $\vec{v}$ be the output vector. It follows that $\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b})$ with probability at least $1-\delta$ . Then it runs the deterministic function $f$ described in Lemma 3.6 with input $r\in\{0,1\}^{\ell}$ with $\ell=\lceil\log{d\over\delta}\rceil$ and $\vec{v}$ and outputs the value of the function. Lemma 3.6 guarantees that for $1-\delta$ fraction of the $r$ s, all $\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b})$ gets rounded to the same value by $f$ . Hence algorithm $A$ satisfies the requirements of the certificate-replicability. The certificate complexity is $\lceil\log{d\over\delta}\rceil$ . ∎

Note that a $\ell$ -certificate replicable leads to a $2^{\ell}$ -list replicable algorithm. Thus Theorem 4.6 gives a $O(\frac{d}{\delta})$ -list replicable algorithm for $d$ -Coin Bias Estimation Problemwith sample complexity $O({d^{2}\over\varepsilon^{2}\delta^{2}})$ . However, this is clearly sub-optimal and Theorem 4.4 gives algorithms with a much smaller list and sample complexities.

4.2 An Impossibility Result

Theorem 4.7.

For $k<d+1$ , there does not exist a $k$ -list replicable algorithm for the $d$ -Coin Bias Estimation Problem.

Before proving the theorem, we need a lemma whose proof appears in the appendix.

Lemma 4.8.

For biases $\vec{a},\vec{b}\in[0,1]^{d}$ we have $\operatorname{d_{TV}}\left(\mathcal{D}_{A,\vec{a},n},\mathcal{D}_{A,\vec{b},n}\right)\leq n\cdot d\cdot\lVert\vec{b}-\vec{a}\rVert_{\infty}$ .

Proof.

We can view the model as algorithm $A$ having access to a single draw from a distribution. The distribution giving one sample flip of each coin in a collection with bias $\vec{b}$ is the $d$ -fold product of Bernoulli distributions $\prod_{i=1}^{d}\operatorname{Bern}(b_{i})$ (which for notational brevity we denote as $\operatorname{Bern}(\vec{b}$ ), so the distribution which gives $n$ independent flips of each coin is the $n$ -fold product of this (using notation of [Can15] written as $\operatorname{Bern}(\vec{b})^{\otimes n}$ ).

Comparing the distributions of $n$ independent flips of the $d$ coins for bias $\vec{b}$ as compared to bias $\vec{a}$ , we have for each $i\in[d]$ that

\operatorname{d_{TV}}\left(\operatorname{Bern}(b_{i}),\operatorname{Bern}(a_{i})\right)=\lvert b_{i}-a_{i}\rvert

so by C.1.2 and C.1.3 of [Can15] we have

\operatorname{d_{TV}}\left(\operatorname{Bern}(\vec{b}),\operatorname{Bern}(\vec{a})\right)\leq\sum_{i=1}^{d}\lvert b_{i}-a_{i}\rvert\leq d\cdot\lVert\vec{b}-\vec{a}\rVert_{\infty}

and

\operatorname{d_{TV}}\left(\operatorname{Bern}(\vec{b})^{\otimes n},\operatorname{Bern}(\vec{a})^{\otimes n}\right)\leq n\cdot d\cdot\lVert\vec{b}-\vec{a}\rVert_{\infty}.

Because $A$ is a randomized function of one draw of this distribution, by D.1.2 of [Can15] we have that $A$ cannot serve to increase the total variation distance, so

\operatorname{d_{TV}}\left(\mathcal{D}_{A,\vec{a},n},\mathcal{D}_{A,\vec{b},n}\right)\leq\operatorname{d_{TV}}\left(\operatorname{Bern}(\vec{b})^{\otimes n},\operatorname{Bern}(\vec{a})^{\otimes n}\right)

which completes the proof. ∎

Proof of Theorem 4.7.

Fix any $d\in\mathbb{N}$ , and choose $\epsilon$ and $\delta$ as $\epsilon<\frac{1}{2}$ and $\delta\leq\frac{1}{d+2}$ .

Suppose for contradiction that such an algorithm $A$ does exists for some $k<d+1$ . This means that for each possible threshold $\vec{t}\in[0,1]^{d}$ , there exists some set $L_{\vec{t}}\subseteq\mathcal{H}$ of hypotheses with three properties: (1) each element of $L_{\vec{t}}$ is an $\epsilon$ -approximation to $h_{\vec{t}}$ , (2) $\lvert L_{\vec{t}}\rvert\leq k$ , and (3) with probability at least $1-\delta$ , $A$ returns an element of $L_{\vec{t}}$ .

Suppose for contradiction that such an algorithm does exist for some $k<d+1$ . This means that for each possible bias $\vec{b}\in[0,1]^{d}$ , there exists some set $L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b})$ (not necessarily unique, but consider some fixed one) with $\lvert L_{\vec{b}}\rvert\leq k$ such that with probability at least least $\frac{1}{k}\cdot(1-\delta)\geq\frac{1}{k}\cdot(1-\frac{1}{d+2})=\frac{1}{k}\cdot\frac{d+1}{d+2}\geq\frac{1}{k}\cdot\frac{k+1}{k+2}$ , $A$ returns an element of $L_{\vec{b}}$ . By the trivial averaging argument, this means that there exists at least one element in $L_{\vec{b}}$ which is returned by $A$ with probability at least $\frac{1}{k}\cdot\frac{k+1}{k+2}$ . Let $f\colon[0,1]^{d}\to[0,1]^{d}$ be a function which maps each bias $\vec{b}$ to such an element of $L_{\vec{b}}$ .

Since $\frac{1}{k}\cdot\frac{k+1}{k+2}>\frac{1}{k+1}$ , let $\eta$ be such that $0<\eta<\frac{1}{k}\cdot\frac{k+1}{k+2}-\frac{1}{k+1}$ .

The function $f$ induces a partition $\mathcal{P}$ of $[0,1]^{d}$ where the members of $\mathcal{P}$ are the fibers of $f$ (i.e. $\mathcal{P}=\left\{f^{-1}(\vec{y})\colon\vec{y}\in\operatorname{range}(f)\right\}$ ). By definition, for any member $X\in\mathcal{P}$ there exists some $\vec{y}\in\operatorname{range}{f}$ such that for all $\vec{b}\in X$ , $f(\vec{b})=\vec{y}$ . By definition of $k$ -pseudodeterministic $\epsilon$ -approximation, we have $f(\vec{b})\in L_{\vec{b}}\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{b})$ showing that $\vec{y}\in\overline{B}_{\epsilon}^{\infty}(\vec{b})$ and by symmetry $\vec{b}\in\overline{B}_{\epsilon}^{\infty}(\vec{y})$ . This shows that $X\subseteq\overline{B}_{\epsilon}^{\infty}(\vec{y})$ , so $\operatorname{diam}_{\infty}(X)\leq 2\epsilon<1$ .

Let $r=\frac{\eta\cdot d}{n}$ . Since every member of $\mathcal{P}$ has $\ell_{\infty}$ diameter less than $1$ , by 3.7 there exists a point $\vec{p}\in[0,1]^{d}$ such that $\overline{B}_{r}^{\infty}(\vec{p})$ intersects at least $d+1>k$ members of $\mathcal{P}$ . Let $\vec{b}^{(1)},\ldots,\vec{b}^{(d+1)}$ be points belonging to distinct members of $\mathcal{P}$ that all belong to $\overline{B}_{r}^{\infty}(\vec{p})$ . By definition of $\mathcal{P}$ , this means for distinct $j,j^{\prime}\in[d+1]$ that $f(\vec{b}^{(j)})\not=f(\vec{b}^{(j^{\prime})})$ .

Now, for each $j\in[d+1]$ , because $\lVert\vec{p}-\vec{b}^{(j)}\rVert_{\infty}\leq r$ , by 4.8 we have $\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{p},n},\mathcal{D}_{A,\vec{b^{(j)}},n})\leq n\cdot d\cdot r=\eta$ . However, this gives rise to a contradiction because the probability that $A$ with access to biased coins $\vec{b}^{(j)}$ returns $f(\vec{b}^{(j)})$ is at least $\frac{1}{k}\cdot\frac{k+1}{k+2}$ , and by the total variation distance, it must be that $A$ with access to biased coins $\vec{p}$ returns $f(\vec{b}^{(j)})$ with probability at least $\frac{1}{k}\cdot\frac{k+1}{k+2}-\eta>\frac{1}{k+1}$ ; notationally, $\Pr_{\mathcal{D}_{A,\vec{b^{(j)}},n}}(\left\{f(\vec{b}^{(j)})\right\})\geq\frac{1}{k}\cdot\frac{k+1}{k+2}$ and $\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{b^{(j)}},n},\mathcal{D}_{A,\vec{p},n})\leq\eta$ , so $\Pr_{\mathcal{D}_{A,\vec{p},n}}(\left\{f(\vec{b}^{(j)})\right\})\geq\frac{1}{k}\cdot\frac{k+1}{k+2}-\eta>\frac{1}{k+1}$ . This is a contradiction because a distribution cannot have $d+1\geq k+1$ disjoint events that each have probability greater than $\frac{1}{k+1}$ . ∎

We conclude this section by noting that the above impossibility result implies a lower-bound on certificate complexity for $d$ -Coin Bias Estimation Problem. It follows that there is no $\lfloor\log(d)\rfloor$ -certificate replicable algorithm for $d$ -Coin Bias Estimation Problem. In particular, any $l$ -certificate replicable algorithm requires $l=\Omega(\log d/\\ delta)$ . Hence our algorithms for $d$ -Coin Bias Estimation Problem has optimal list and certificate complexity.

5 Replicability in PAC Learning

In this section, we establish replicability results for the PAC model. First, we define the PAC learning model.

Let $\mathcal{H}$ be a (hypothesis) class of Boolean functions over $X$ , and $\cal D$ be a distribution over $X$ . For a function $f\in\cal H$ , let ${\cal D}_{f}$ a distribution over $X\times\{0,1\}$ that is obtained by sampling an element $x\in X$ according ${\cal D}$ and outputs $\langle x,f(x)\rangle$ . For a hypothesis $h$ , its error with respect to $\mathcal{D}_{f}$ denoted by $e_{\mathcal{D}_{f}}(h)$ is $\Pr_{\langle x,f(x)\rangle\in\mathcal{D}_{f}}(f(x)\neq h(x))$ .

Definition 5.1.

A hypothesis class (or concept class) ${\cal H}$ is PAC learnable with sample complexity $m$ , if there is a learning algorithms $A$ with the following property: For every $f\in\mathcal{H}$ , for every ${\cal D}$ a distribution over $X$ , for all $0<\delta,\epsilon<1$ , $A$ on inputs $\delta,\epsilon$ and $S$ drawn i.i.d according to $\mathcal{D}_{f}$ where $|S|\leq m$ : outputs a hypothesis $h$ so that $e_{\mathcal{D}_{f}}(h)\leq\varepsilon$ . with probability $\geq(1-\delta)$ .

We show that every hypothesis that can be learned via a statistical query learning algorithm has a reproducible PAC learning algorithm. We first define the notion of learning with statistical queries [Kea98].

Definition 5.2.

A statistical oracle $STAT({\cal D}_{f},\nu)$ takes as an input a real-valued function $\phi:X\times\{0,1\}\rightarrow(0,1)$ and returns an estimate $v$ such that

|v-E_{\langle x,y\rangle\in{\cal D}_{f}}[\phi(x,y)]|\leq\nu

Definition 5.3.

We say that an algorithm ${A}$ learns a concept class $\mathcal{H}$ via statistical queries if for every distribution ${\cal D}$ and every function $f\in\mathcal{H}$ , for every $0<\varepsilon<1$ , there exists $\nu$ such that the algorithm $A$ on input $\varepsilon$ , and $STAT(\mathcal{D}_{f},\nu)$ as an oracle, outputs a hypothesis $h$ such that $e_{D_{f}}(h)\leq\varepsilon$ . The concept class is non-adaptively learnable if all the queries made by $A$ are non-adaptive.

5.1 Notions of Replicable PAC Learning

We now define the notions of list and certificate replicable learning in the PAC model.

Definition 5.4.

Let $\mathcal{H}$ be a hypothesis class. We say that $\mathcal{H}$ is $k$ -list replicably learnable if there is an algorithm $A$ with the following properties. For every $f\in\mathcal{H}$ and for every distribution $\mathcal{D}$ over $X$ , for every $0<\epsilon<1$ and $0<\delta\leq 1$ , there is a list $L$ of size at most $k$ consisting of $\varepsilon$ -approximate hypotheses such that $A$ on inputs $\delta,\epsilon$ and samples $S\in\mathcal{D}_{f}^{m}$ , $\Pr_{S\sim\mathcal{D}_{f}^{m}}[A(S)\in L]\geq 1-\delta$ . We call $m$ the sample complexity of the learning algorithm and $k$ to be its list complexity.

Next we define certificate replicability. This is very close to the notion of replicability given in [impagliazzo_replicability_2022]. However, our main concern is the amount of randomness needed by the algorithm to make it perfectly reproducible.

Definition 5.5.

Let $\mathcal{H}$ be a concept class. We say that $\mathcal{H}$ is $\ell$ -certificate replicably learnable if the following holds. There is a learning algorithm $A$ such that for every $f\in\mathcal{H}$ , for every distribution $\mathcal{D}$ over $X$ , $A$ gets the following inputs: $\epsilon$ , $\delta$ , $r\in\{0,1\}^{\ell}$ , and $S\sim\mathcal{D}_{f}^{m}$ .

\Pr_{r\in\{0,1\}^{\ell}}\left[\exists h_{r}:\Pr_{s\sim\mathcal{D}^{m}}[A(s;r)=h_{r}]\geq 1-\delta\right]\geq 1-\delta

We call $m$ the sample complexity and the number of random bits $\ell$ the certificate complexity of $A$ .

The definition can be further explained as follows. For the algorithm $A$ , we say $r$ is a “certificate of replicability” if $A$ with $m$ independent samples from $\mathcal{D}_{f}$ , and $r$ as input, outputs a canonical $\varepsilon$ -approximate hypothesis $h_{r}$ with probability $\geq 1-\delta$ . Then the above definition demands that $1-\delta$ fraction of $r\in\{0,1\}^{l}$ are certificates of replicability.

5.2 Replicable Algorithms

Theorem 5.6.

Let $\mathcal{H}$ be a concept class that is learnable with $d$ non-adaptive statistical queries, then $\mathcal{H}$ is $(d+1)$ -list reproducibly learnable. Furthermore, the sample complexity $n=n(\nu,\delta)$ of the $(d+1)$ -list replicable algorithm equals $O(d^{2}\log{d\over\delta}\cdot{1\over\nu^{2}})$ , where $\nu$ is the approximation error parameter of each statistical query oracle.

Proof.

The proof is very similar to the proof of Theorem 4.4. Our replicable algorithm $B$ works as follows. Let $\varepsilon$ and $\delta$ be input parameters and $\mathcal{D}$ be a distribution and $f\in\mathcal{H}$ . Let $A$ be the statistical query learning algorithm for $\mathcal{H}$ that outputs a hypothesis $h$ with approximation error $e_{\mathcal{D}_{f}}(h)=\varepsilon$ . Let $STAT(D_{f},\nu)$ be the statistical query oracle for this algorithm. Let $\phi_{1},\ldots,\phi_{d}$ be the statistical queries made by $A$ .

Now the algorithm $B$ evaluates the deterministic function $f_{\epsilon}$ from Lemma 3.5 on input $\vec{v}$ . Let $\vec{u}$ be the output vector. Now the algorithm $B$ simulates the statistical query algorithm $A$ with $\vec{u}[i]$ as the answer to the query $\phi_{i}$ . By Lemma 3.5, $\vec{u}\in\overline{B}_{\nu}^{\infty}(\vec{b})$ . Thus the error of the hypothesis output by the algorithm is at most $\epsilon$ . Since $A$ is a deterministic algorithm the number of possible outputs only depends on the number of outputs of the function $f_{\varepsilon}$ , more precisely the number of possible outputs is the size of the set $\{f_{\varepsilon}(\vec{v}):v\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b})\}$ which is almost $d+1$ , by Lemma 3.5. Thus the total number of possible outputs of the algorithm $B$ is at most $d+1$ with probability at least $1-\delta$ . ∎

We note that we can simulate a statistical query algorithm that makes $d$ adaptive queries to get a $2^{d}$ -list replicable learning algorithm. This can be done by rounding each query to two possible values (the approximation factor increases by 2). The sample complexity of this algorithm will be $O({d\over\nu^{2}}\cdot\log{1\over\delta})$ . The sample complexity can be improved to $\tilde{O}({\sqrt{d}\over\nu^{2}})$ by using techniques from adaptive data analysis [BNS⁺21].

Next, we design a certificate replicable algorithm for hypothesis classes that admit statistical query learning algorithms. The proof the following theorem is in the appendix.

Theorem 5.7.

Let $\mathcal{H}$ be a concept class that is learnable with $d$ non-adaptive statistical queries, then $\mathcal{H}$ is $\lceil\log{d\over\delta}\rceil$ -certificate reproducibly learnable. Furthermore, the sample complexity $n=n(\nu,\delta)$ of this algorithm equals $O(\frac{d^{2}}{\nu^{2}\delta^{2}}\cdot\log{d\over\delta})$ , where $\nu$ is the approximation error parameter of each statistical query oracle.

Proof.

The proof is very similar to the proof of Theorem 4.6. Our replicable algorithm $B$ works as follows, let $\varepsilon$ and $\delta$ be input parameters and $\mathcal{D}$ be a distribution and $f\in\mathcal{H}$ . Let $A$ be the statistical query learning algorithm for $\mathcal{H}$ that outputs a hypothesis $h$ with approximation error $e_{\mathcal{D}_{f}}(h)=\varepsilon$ . Let $STAT(D_{f},\nu)$ be the statistical query oracle for this algorithm. Let $\phi_{1},\ldots,\phi_{d}$ be the statistical queries made by $A$ .

Let $\vec{b}=\langle E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{1}(\langle x,y\rangle),\ldots,E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{d}(\langle x,y\rangle)\rangle$ . Set $\varepsilon_{0}=\frac{\nu\delta}{2d}$ . The algorithm $B$ first estimates the values $b[i]=E_{\langle x,y\rangle\in\mathcal{D}_{f}}[\phi_{i}(\langle x,y\rangle)]$ , $1\leq i\leq d$ upto an approximation error of $\epsilon_{0}$ with success probably $1-\delta/d$ for each query. Note that this can be done by a simple empirical estimation algorithm, that uses a total of $n=O(\frac{d^{2}}{\nu^{2}\delta^{2}}\cdot\log d/\delta)$ samples. Let $\vec{v}$ be the estimated the vector. It follows that $\vec{v}\in\overline{B}_{\varepsilon_{0}}^{\infty}(\vec{b})$ with probability at least $1-\delta$ . Now the algorithm $B$ evaluates the deterministic function $f$ described in Lemma 3.6 with inputs $r\in\{0,1\}^{\ell}$ where $\ell=\lceil\log{d\over\delta}\rceil$ and $\vec{v}$ . By Lemma 3.6 for at least $1-\delta$ fraction of the $r$ ’s , the function $f$ outputs a canonical $\vec{v^{*}}\in\overline{B}_{\nu}^{\infty}(\vec{b})$ . Now the algorithm $B$ simulates the statistical query algorithm $A$ with $\vec{v*}[i]$ as the answer to the query $\phi_{i}$ . Since $A$ is a deterministic algorithm it follows that our algorithm $B$ is certificate replicable. Finally, note that the certificate complexity is $\lceil\log{d\over\delta}\rceil$ . ∎

As before we consider the case when the statistical query algorithm makes $d$ adaptive queries. The proof of the following theorem appears in the appendix.

Theorem 5.8.

Let $\mathcal{H}$ be a concept class that is learnable with $d$ adaptive statistical queries, then $\mathcal{H}$ is $\lceil d\log{d\over\delta}\rceil$ -certificate reproducibly learnable. Furthermore, the sample complexity of this algorithm equals $O(\frac{d^{3}}{\nu^{2}\delta^{2}}\cdot\log{d\over\delta})$ , where $\nu$ is the approximation error parameter of each statistical query oracle.

Proof Sketch.

The proof uses similar arguments as before. The main difference we will evaluate each query with an approximation error of $\frac{\nu\delta}{d}$ with a probability error of $d/\delta$ . This requires $O(\frac{d^{2}}{\nu^{2}\delta^{2}}\cdot\log{d\over\delta})$ per query. We use a fresh set of certificate randomness for each such evaluation. Note that the length of the certificate for each query is $\lceil\log d/\delta\rceil$ . Thus the total certificate complexity is $\lceil d\log{d\over\delta}\rceil$ . ∎

5.3 Impossibility Results in the PAC Model

In this section, we establish matching upper and lower bounds for the $d$ -Threshold Estimation Problem in the PAC model with respect to the uniform distribution. We establish that this problem admits a $(d+1)$ -list replicable algorithm and does not admit a $d$ -list replicable algorithm.

Problem 5.9 ( $d$ -Threshold Estimation Problem).

Fix some $d\in\mathbb{N}$ . Let $X=[0,1]^{d}$ . For each value $\vec{t}\in[0,1]^{d}$ (which happens to be the same as $X$ ), let $h_{\vec{t}}:X\to\left\{0,1\right\}$ be the function defined by

h_{\vec{t}}(\vec{x})=\begin{cases}1&\text{for each $i\in[d]$, it holds that $x_{i}\leq t_{i}$}\\ 0&\text{otherwise}\end{cases}.

This is the function that determines if each coordinate is less than or equal to the thresholds specified by $\vec{t}$ . Let $\mathcal{H}$ be the hypothesis class consisting of all such threshold functions: $\mathcal{H}=\left\{h_{\vec{t}}\;|\;\vec{t}\in[0,1]^{d}\right\}$ .

We first observe the impossibility of list-replicable algorithms in the general PAC model. This follows from known results.

Observation 5.10.

Let $k$ be any constant. There is no $k$ -list replicable algorithm for the $d$ -Threshold Estimation Problem in the PAC model even when $d=1$ .

Proof.

From the works of [BLM20] and [ALMM19], it follows that any class has finite Littlestone dimension if and only if there exists a constant $k$ such that the concept class has a $k$ -list replicable algorithm in the PAC model. Since the concept class $d$ -Threshold Estimation Problemhas infinite Littlestone dimension even when $d=1$ , the theorem follows. ∎

The above result rules out list-replicable algorithms in the general PAC model. In the rest of this section, we study the replicable learnability of $d$ -Threshold Estimation Problem in the PAC model under the uniform distribution. We establish matching upper and lower bounds on the list complexity.

Theorem 5.11.

In the PAC model under the uniform distribution, there is a $d+1$ -list replicable algorithm for $d$ -Threshold Estimation Problem

Proof.

It is known and easy to see that $d$ -Threshold Estimation Problem is learnable under the uniform distribution by making $d$ nonadaptive statistical queries. Thus by Theorem 5.6, $d$ -Threshold Estimation Problem admits a $(d+1)$ -list replicable algorithm. ∎

We next establish that the above result is tight by proving that there is no $d$ -list replicable algorithm in the PAC model under the uniform distribution.

Theorem 5.12.

For $k<d+1$ , there does not exist a $k$ -list replicable algorithm for the $d$ -Threshold Estimation Problem in the PAC model.

The proof that for $k<d+1$ , there is no algorithm which $k$ -list reproducibly learns $d$ -Threshold Estimation Problem in the PAC model is similar to the proof of Theorem 4.7. The reason is that sampling $d$ -many biased coins with biases $\vec{b}$ is similar to obtaining a point $\vec{x}$ uniformly at random from $[0,1]^{d}$ and evaluating the threshold function $h_{\vec{b}}$ on it—this corresponds to asking whether all of the coins were heads/ $1$ ’s. The two models differ though, because in the sample model for the $d$ -Coin Bias Estimation Problem, the algorithm sees for each coin whether it is heads or tails, but this information is not available in the PAC model for the $d$ -Threshold Estimation Problem. Conversely, in the PAC model for the $d$ -Threshold Estimation Problem, a random draw from $[0,1]^{d}$ is available to the algorithm, but in the sample model for the $d$ -Coin Bias Estimation Problem the algorithm does not get this information.

Furthermore, there is the following additional complexity in the impossibility result for the $d$ -Threshold Estimation Problem. In the $d$ -Coin Bias Estimation Problem, we said by definition that a collection of $d$ coins parameterized by bias vector $\vec{a}$ was an $\epsilon$ -approximation to a collection of $d$ coins parameterized by bias vector $\vec{b}$ if and only if $\lVert\vec{b}-\vec{a}\rVert_{\infty}\leq\epsilon$ , and we used this norm in applying the results of [VWDP⁺22]. However, the notion of $\epsilon$ -approximation in the PAC model is quite different than this. It is possible to have a hypotheses $h_{\vec{a}}$ and $h_{\vec{b}}$ in the $d$ -Threshold Estimation Problemsuch that $\lVert\vec{b}-\vec{a}\rVert_{\infty}>\epsilon$ but with respect to some distribution $\mathcal{D}_{X}$ on the domain $X$ we have $\operatorname{e}_{\mathcal{D}_{X}}(h_{\vec{a}},h_{\vec{b}})\leq\epsilon$ . For example, if $\mathcal{D}_{X}$ is the uniform distribution on $X=[0,1]^{d}$ and $\vec{a}=\vec{0}$ and $\vec{b}$ is the first standard basis vector $\vec{b}=\langle 1,0,\ldots,0\rangle$ , and $\epsilon=\frac{1}{2}$ , then $\lVert\vec{b}-\vec{a}\rVert_{\infty}=1>\epsilon$ , but $\operatorname{e}_{\mathcal{D}_{X}}(h_{\vec{a}},h_{\vec{b}})=0\leq\epsilon$ because $h_{\vec{a}}(\vec{x})\not=h_{\vec{b}}(\vec{x})$ if and only if all of the last $d-1$ coordinates of $\vec{x}$ are $0$ and the first coordinate is $>0$ , but there is probability $0$ of sampling such $\vec{x}$ from the uniform distribution on $X=[0,1]^{d}$ .

For this reason, we can’t just partition $[0,1]^{d}$ as we did with the proof of Theorem 4.7 and must do something more clever. It turns out that it is possible to find a subset $[\alpha,1]^{d}$ on which hypotheses parameterized by vectors on opposite faces of this cube $[\alpha,1]^{d}$ have high PAC error between them. A consequence by the triangle inequality of $\operatorname{e}_{\mathcal{D}_{X}}$ is that two such hypotheses cannot both be approximated by a common third hypothesis. That is what the following lemma states.

Lemma 5.13.

Let $d\in\mathbb{N}$ and $\alpha=\frac{d-1}{d}$ . Let $\vec{s},\vec{t}\in[\alpha,1]^{d}$ such that there exists a coordinate $i_{0}\in[d]$ where $s_{i_{0}}=\alpha$ and $t_{i_{0}}=1$ (i.e. $\vec{s}$ and $\vec{t}$ are on opposite faces of this cube). Let $\epsilon\leq\frac{1}{8d}$ . Then there is no point $\vec{r}\in X$ such that both $\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{r}})\leq\epsilon$ and $\operatorname{\operatorname{e}_{unif}}(h_{\vec{t}},h_{\vec{r}})\leq\epsilon$ (i.e. there is no hypothesis which is an $\epsilon$ -approximation to both $h_{\vec{s}}$ and $h_{\vec{t}}$ ).

Proof.

Let $\vec{q}=\left\langle\begin{cases}s_{i}&i=i_{0}\\ t_{i}&i\not=i_{0}\end{cases}\right\rangle_{i=1}^{d}$ which will serve as a proxy to $\vec{s}$ .

We need the following claim.

Claim 5.14.

For each $\vec{x}\in X$ , the following are equivalent:

1.

$h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$
2.

$h_{\vec{q}}(\vec{x})=0$ and $h_{\vec{t}}(\vec{x})=1$
3.

$x_{i_{0}}\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1]$ and for all $i\in[d]\setminus\left\{i_{0}\right\}$ , $x_{i}\in[0,t_{i}]$ .

Furthermore, the above equivalent conditions imply the following:

4.

$h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ .

Proof of Claim.

(2) $\Longrightarrow$ (1): This is trivial.

(1) $\Longrightarrow$ (2):

Note that because $q_{i_{0}}=s_{i_{0}}=\alpha<1=t_{i_{0}}$ , we have for all $i\in[d]$ that $q_{i}\leq t_{i}$ . If $h_{\vec{t}}(\vec{x})=0$ then for some $i_{1}\in[d]$ it must be that $x_{i_{1}}>t_{i_{1}}$ , but since $t_{i_{1}}\geq q_{i_{1}}$ it would also be the case that $x_{i_{1}}>q_{i_{1}}$ , so $h_{\vec{q}}(\vec{x})=0$ which gives the contradiction that $h_{\vec{q}}(\vec{x})=h_{\vec{t}}(\vec{x})$ . Thus $h_{\vec{t}}(\vec{x})=1$ , and since $h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ we have $h_{\vec{q}}(\vec{x})=0$ .

(1) $\iff$ (3): We partition $[0,1]^{d}$ into three sets and examine these three cases.

Case 1: $x_{i_{0}}\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1]$ and for all $i\in[d]\setminus\left\{i_{0}\right\}$ , $x_{i}\in[0,t_{i}]$ . In this case, $q_{i_{0}}<x_{i_{0}}$ so $h_{\vec{q}}(\vec{x})=0$ and for all $i\in[d]$ $x_{i}\leq t_{i}$ , so $h_{\vec{t}}(\vec{x})=1$ , so $h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ .

Case 2: $x_{i_{0}}\not\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1]$ and for all $i\in[d]\setminus\left\{i_{0}\right\}$ , $x_{i}\in[0,t_{i}]$ . In this case, because $x_{i_{0}}\in[0,1]$ and $x_{i_{0}}\not\in(\alpha,1]$ we have $x_{i_{0}}\leq\alpha=q_{i_{0}}\leq t_{i_{0}}$ and also for all other $i\in[d]\setminus\left\{i_{0}\right\}$ , $x_{i}\leq t_{i}=q_{i}$ (by definition of $\vec{q}$ ). Thus $h_{\vec{q}}(\vec{x})=1=h_{\vec{t}}(\vec{x})$ .

Case 3: For some $i_{1}\in[d]\setminus\left\{i_{0}\right\}$ , $x_{i_{1}}\not\in[0,t_{i_{1}}]$ . In this case, because $x_{i_{1}}\in[0,1]$ , we have $x_{i_{1}}>t_{i_{1}}=q_{i_{1}}$ . Thus $h_{\vec{q}}(\vec{x})=0=h_{\vec{t}}(\vec{x})$ .

Thus, it is the case that $h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ if and only if $x_{i_{0}}\in(q_{i_{0}},t_{i_{0}}]=(\alpha,1]$ and for all $i\in[d]\setminus\left\{i_{0}\right\}$ , $x_{i}\in[0,t_{i}]$ .

(1, 2, 3) $\Longrightarrow$ (4): By (2), we have $x_{i_{0}}>q_{i_{0}}$ , and since $q_{i_{0}}=s_{i_{0}}$ by definition of $\vec{q}$ , it follows that $x_{i_{0}}>s_{i_{0}}$ which means $h_{\vec{s}}(\vec{x})=0$ . By (3), $h_{\vec{t}}(\vec{x})=1$ which gives $h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ . ∎

With this claim in hand, our next step will be two prove the following two inequalities:

2\epsilon<\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})\leq\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}).

For the second of these inequalities, note that by the (1) $\Longrightarrow$ (4) part of claim above, since $h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ implies $h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x})$ we have

	$\displaystyle\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})$	$\displaystyle=\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})]$
		$\displaystyle\leq\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[h_{\vec{s}}(\vec{x})\not=h_{\vec{t}}(\vec{x})]$
		$\displaystyle=\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}).$

Now, for the first of the inequalities above, we will use the (1) $\iff$ (3) portion of the claim, we will use our hypothesis that $\vec{t}\in[\alpha,1]^{d}$ (which implies for each $i\in[d]$ that $[0,t_{i}]\subseteq[0,\alpha]$ ), we will use the hypothesis that $\epsilon\leq\frac{1}{8d}$ , and we will use LABEL:alpha-lemma. Utilizing these, we get the following:

	$\displaystyle\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})$
	$\displaystyle=\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[h_{\vec{q}}(\vec{x})\not=h_{\vec{t}}(\vec{x})]$
	$\displaystyle=\Pr_{\vec{x}\operatorname{\sim}\mathrm{unif}(X)}[x_{i_{0}}\in(\alpha,1]\;\wedge\;\forall i\in[d]\setminus\left\{i_{0}\right\},\,x_{i}\in[0,t_{i}]]$
	$\displaystyle=\Pr_{x_{i_{0}}\operatorname{\sim}\mathrm{unif}([0,1])}[x_{i_{0}}\in(\alpha,1]]\cdot\prod_{\begin{subarray}{c}i=1\\ i\not=i_{0}\end{subarray}}^{d}\Pr_{x\operatorname{\sim}\mathrm{unif}([0,1])}[x\in[0,t_{i}]]$
	$\displaystyle\geq\Pr_{x_{i_{0}}\operatorname{\sim}\mathrm{unif}([0,1])}[x_{i_{0}}\in(\alpha,1]]\cdot\prod_{\begin{subarray}{c}i=1\\ i\not=i_{0}\end{subarray}}^{d}\Pr_{x\operatorname{\sim}\mathrm{unif}([0,1])}[x\in[0,\alpha]]$
	$\displaystyle=(1-\alpha)\cdot\alpha^{d-1}$
	$\displaystyle>\frac{1}{4d}$
	$\displaystyle\geq 2\epsilon.$

Thus, we get the desired two inequalities:

2\epsilon<\operatorname{\operatorname{e}_{unif}}(h_{\vec{q}},h_{\vec{t}})\leq\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}}).

This nearly completes the proof. If there existed some point $\vec{r}\in X$ such that both $\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{r}})\leq\epsilon$ and $\operatorname{\operatorname{e}_{unif}}(h_{\vec{t}},h_{\vec{r}})\leq\epsilon$ , then it would follow from the triangle inequality of $\operatorname{\operatorname{e}_{unif}}$ that

\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}})\leq\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{r}})+\operatorname{\operatorname{e}_{unif}}(h_{\vec{t}},h_{\vec{r}})\leq 2\epsilon

but this would contradict the above inequalities, so no such $\vec{r}$ exists. ∎

Equipped with the above lemma, we are now ready to prove Theorem 5.12.

Proof.

of Theorem 5.12 Fix any $d\in\mathbb{N}$ , and choose $\epsilon$ and $\delta$ as $\epsilon\leq\frac{1}{4d}$ and $\delta\leq\frac{1}{d+2}$ . We will use the constant $\alpha=\frac{d-1}{d}$ and consider the cube $[\alpha,1]^{d}$ . We will also consider only the uniform distribution over $X$ .

By the trivial averaging argument, this means that there exists at least one element in $L_{\vec{t}}$ which is returned by $A$ with probability at least $\frac{1}{k}\cdot(1-\delta)\geq\frac{1}{k}\cdot(1-\frac{1}{d+2})=\frac{1}{k}\cdot\frac{d+1}{d+2}\geq\frac{1}{k}\cdot\frac{k+1}{k+2}$ . Let $f\colon[\alpha,1]^{d}\to[0,1]^{d}$ be a function which maps each threshold $\vec{t}\in[\alpha,1]^{d}$ to such an element of $L_{\vec{t}}$ . This is slightly different from the proof of Theorem 4.7 because we are defining the function $f$ on only a very specific subset of the possible thresholds. The reason for this was alluded to in the discussion following the statement of Theorem 5.12.

Since $\frac{1}{k}\cdot\frac{k+1}{k+2}>\frac{1}{k+1}$ , let $\eta$ be such that $0<\eta<\frac{1}{k}\cdot\frac{k+1}{k+2}-\frac{1}{k+1}$ .

The function $f$ induces a partition $\mathcal{P}$ of $[\alpha,1]^{d}$ where the members of $\mathcal{P}$ are the fibers of $f$ (i.e. $\mathcal{P}=\left\{f^{-1}(\vec{y})\colon\vec{y}\in\operatorname{range}(f)\right\}$ ). For any member $W\in\mathcal{P}$ and any coordinate $i\in[d]$ , it cannot be that the set ${w_{i}\colon\vec{w}\in W}$ contains both values $\alpha$ and $1$ —if it did, then there would be two points $\vec{s},\vec{t}\in W$ such that $s_{i}=\alpha$ and $t_{i}=1$ , but because they both belong to $W$ , there is some $\vec{y}\in[0,1]^{d}$ such that $f(\vec{s})=\vec{y}=f(\vec{t})$ , but by definition of the partition, $h_{\vec{y}}$ would have to be an $\epsilon$ -approximation (in the PAC model) of both $h_{\vec{s}}$ and $h_{\vec{t}}$ , but by Theorem 5.12, this is not possible.

Thus, the partition $\mathcal{P}$ is a “non-spanning” partition of $[\alpha,1]^{d}$ as in the proof of 3.7, so there is some point $\vec{p}\in[\alpha,1]^{d}$ such that for every radius $r>0$ , it holds that $\overline{B}_{r}^{\infty}(\vec{p})$ intersects at least $d+1$ members of $\mathcal{P}$ .

Similar to 4.8 and how it is used in the proof of Theorem 4.7, we can use the following two facts. First, the function $\gamma_{1}$ defined by $\gamma_{1}(\vec{s},\vec{t})=\operatorname{\operatorname{e}_{unif}}(h_{\vec{s}},h_{\vec{t}})$ is continuous (with respect to the $\ell_{\infty}$ norm on the domain). Second, the function $\gamma_{2}(h_{\vec{s}},h_{\vec{t}})=\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{s},n},\mathcal{D}_{A,\vec{t},n})$ is continuous (with respect to the $\operatorname{\operatorname{e}_{unif}}$ notion of distance on the domain). A consequence is that the composition $\gamma_{12}(\vec{s},\vec{t})=\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{s},n},\mathcal{D}_{A,\vec{t},n})$ is continuous. Thus, we can find some radius $r>0$ such that if $\lVert\vec{t}-\vec{s}\rVert_{\infty}\leq r$ , then $\operatorname{d_{TV}}(\mathcal{D}_{A,\vec{s},n},\mathcal{D}_{A,\vec{t},n})\leq\eta$ .

Now we get the same type of contradiction as in the proof of Theorem 4.7: for the special point $\vec{p}$ we have that $\mathcal{D}_{A,\vec{p},n}$ is a distribution that has $d+1\geq k+1$ disjoint events that each have probability greater than $\frac{1}{k+1}$ . Thus, no $k$ -list replicable algorithm exists.

∎

6 Conclusions

In this work, we investigated the pressing issue of replicability in machine learning from an algorithmic point of view. We observed that replicability in the absolute sense is difficult to achieve. Hence we considered two natural extensions that capture the degree of (non) replicability: list and certificate replicability. We designed replicable algorithms with a small list, certificate, and sample complexities for the $d$ -Coin Bias Estimation Problem and the class of problems that can be learned via statistical query algorithms that make non-adaptive statistical queries. We also established certain impossibility results in the PAC model of learning and for $d$ -Coin Bias Estimation Problem. There are several interesting research directions that emerge from our work. There is a gap in the sample complexities of the list and certificate reproducibilities with comparable parameters. Is this gap inevitable? Currently, there is an exponential gap in the replicability parameters between hypothesis classes that can be learned via non-adaptive and adaptive statistical queries. Is this gap necessary? A generic question is to explore the trade-offs between the sample complexities, list complexity, certificate complexities, adaptivity, and nonadaptivity.

7 Acknowledgements

We thank an anonymous reviewer for suggesting Observation 5.10. Pavan’s work is partly supported by NSF award 2130536. Part of the work was done when Pavan was visiting Simons Institute for the Theory of Computing. Vander Woude and Vinodchandran’s work is partly supported by NSF award 2130608.

References

[AJJ⁺22] Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I. Shamir. Reproducibility in optimization: Theoretical framework and limits, 2022. arXiv:2202.04598.
[ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852–860. ACM, 2019.
[AV20] Nima Anari and Vijay V. Vazirani. Matching is as easy as the decision problem, in the NC model. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS 2020, January 12-14, 2020, Seattle, Washington, USA, volume 151 of LIPIcs, pages 54:1–54:25. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
[Bak16] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533:452–454, 2016.
[BLM20] Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classification and online prediction. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 389–402. IEEE, 2020.
[BNS⁺21] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan R. Ullman. Algorithmic stability for adaptive data analysis. SIAM J. Comput., 50(3), 2021.
[Can15] Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electron. Colloquium Comput. Complex., TR15-063, 2015. URL: https://eccc.weizmann.ac.il/report/2015/063, arXiv:TR15-063.
[DLPES02] Jesus A. De Loera, Elisha Peterson, and Francis Edward Su. A Polytopal Generalization of Sperner’s Lemma. Journal of Combinatorial Theory, Series A, 100(1):1–26, October 2002. URL: https://www.sciencedirect.com/science/article/pii/S0097316502932747, doi:10.1006/jcta.2002.3274.
[DPV18] Peter Dixon, Aduri Pavan, and N. V. Vinodchandran. On pseudodeterministic approximation algorithms. In Igor Potapov, Paul G. Spirakis, and James Worrell, editors, 43rd International Symposium on Mathematical Foundations of Computer Science, MFCS 2018, August 27-31, 2018, Liverpool, UK, volume 117 of LIPIcs, pages 61:1–61:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
[DPVWV22] Peter Dixon, Aduri Pavan, Jason Vander Woude, and N. V. Vinodchandran. Pseudodeterminism: promises and lowerbounds. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1552–1565. ACM, 2022.
[eco13] How science goes wrong. The Economist, pages 25–30, 2013.
[EKK⁺23] Hossein Esfandiari, Alkis Kalavasis, Amin Karbasi, Andreas Krause, Vahab Mirrokni, and Grigoris Velegkas. Replicable bandits, 2023. arXiv:2210.01898.
[GG] Shafi Goldwasser and Ofer Grossman. Bipartite perfect matching in pseudo-deterministic NC. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs.
[GG11] Eran Gat and Shafi Goldwasser. Probabilistic Search Algorithms with Unique Answers and Their Cryptographic Applications. Technical Report 136, 2011. URL: https://eccc.weizmann.ac.il/report/2011/136/.
[GGH18] Shafi Goldwasser, Ofer Grossman, and Dhiraj Holden. Pseudo-deterministic proofs. In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, volume 94 of LIPIcs, pages 17:1–17:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
[GGMW20] Shafi Goldwasser, Ofer Grossman, Sidhanth Mohanty, and David P. Woodruff. Pseudo-deterministic streaming. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS, volume 151 of LIPIcs, pages 79:1–79:25, 2020.
[GKM21] Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. User-level differentially private learning via correlated sampling. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20172–20184, 2021.
[GL19] Ofer Grossman and Yang P. Liu. Reproducibility and pseudo-determinism in log-space. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 606–620. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.38, doi:10.1137/1.9781611975482.38.
[Gol19] Oded Goldreich. Multi-pseudodeterministic algorithms. Electron. Colloquium Comput. Complex., TR19-012, 2019. URL: https://eccc.weizmann.ac.il/report/2019/012, arXiv:TR19-012.
[ILPS22] Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 818–831. ACM, 2022. URL: https://doi.org/10.1145/3519935.3519973, doi:10.1145/3519935.3519973.
[JP05] Ioannidis JP. Why most published research findings are false. PLOS Medicine, 2(8), 2005.
[Kea98] Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.
[LOS21] Zhenjian Lu, Igor Carboni Oliveira, and Rahul Santhanam. Pseudodeterministic algorithms and the structure of probabilistic time. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 303–316. ACM, 2021.
[MPK19] Harshal Mittal, Kartikey Pandey, and Yash Kant. Iclr reproducibility challenge report (padam : Closing the generalization gap of adaptive gradient methods in training deep neural networks), 2019. arXiv:1901.09517.
[NAS19] Reproducibility and Replicability in Science. https://doi.org/10.17226/25303, 2019. National Academies of Sciences, Engineering, and Medicine.
[OS17] I. Oliveira and R. Santhanam. Pseudodeterministic constructions in subexponential time. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 665–677, 2017.
[OS18] Igor Carboni Oliveira and Rahul Santhanam. Pseudo-derandomizing learning and approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2018, volume 116 of LIPIcs, pages 55:1–55:19, 2018.
[PVLS⁺21] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer, Florence d’Alche Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research(a report from the neurips 2019 reproducibility program). Journal of Machine Learning Research, 22(164):1–20, 2021. URL: http://jmlr.org/papers/v22/20-303.html.
[SZ99] Michael E. Saks and Shiyu Zhou. BP ${}_{\mbox{h}}$ SPACE(S) $\subseteq$ DSPACE(S ${}^{\mbox{3/2}}$ ). J. Comput. Syst. Sci., 58(2):376–403, 1999.
[VWDP⁺22] Jason Vander Woude, Peter Dixon, Aduri Pavan, Jamie Radcliffe, and N. V. Vinodchandran. Geometry of rounding. CoRR, abs/2211.02694, 2022. URL: https://doi.org/10.48550/arXiv.2211.02694, arXiv:2211.02694, doi:10.48550/arXiv.2211.02694.

List and Certificate Complexities in Replicable Learning

Abstract

1 Introduction

1.1 Our Results

Estimating the bias of dd coins.

PAC learning.

2 Prior and Related Work

3 Primary Lemmas

3.1 Partitions and Rounding

Definition 3.1.

Theorem 3.2.

Definition 3.3 ((k​(d),ϵ​(d))(k(d),\epsilon(d))-Deterministic Rounding).

Observation 3.4 (Equivalence of Rounding Schemes and Partitions).

3.2 A Universal Rounding Algorithm for List Replicability

Lemma 3.5.

Proof.

3.3 A Universal Rounding Algorithm for Certificate Replicability

Lemma 3.6.

Proof.

3.4 A Consequence of Sperner’s Lemma/KKM Lemma

Lemma 3.7.

4 Replicability of Learning Coins Biases

Definition 4.1.

Definition 4.2.

Definition 4.3.

4.1 Replicable Algorithms

Theorem 4.4.

Proof.

Observation 4.5.

Proof.

Case 1: yi<0y_{i}<0.

Case 2: yi>1y_{i}>1.

Case 3: yi∈[0,1]y_{i}\in[0,1].

Theorem 4.6.

Proof.

4.2 An Impossibility Result

Theorem 4.7.

Lemma 4.8.

Proof.

Proof of Theorem 4.7.

5 Replicability in PAC Learning

Definition 5.1.

Definition 5.2.

Definition 5.3.

5.1 Notions of Replicable PAC Learning

Definition 5.4.

Definition 5.5.

5.2 Replicable Algorithms

Theorem 5.6.

Proof.

Theorem 5.7.

Proof.

Theorem 5.8.

Proof Sketch.

5.3 Impossibility Results in the PAC Model

Problem 5.9 (dd-Threshold Estimation Problem).

Observation 5.10.

Proof.

Theorem 5.11.

Proof.

Theorem 5.12.

Lemma 5.13.

Proof.

Claim 5.14.

Proof of Claim.

Proof.

6 Conclusions

7 Acknowledgements

References

Estimating the bias of $d$ coins.

Definition 3.3 ( $(k(d),\epsilon(d))$ -Deterministic Rounding).

Case 1: $y_{i}<0$ .

Case 2: $y_{i}>1$ .

Case 3: $y_{i}\in[0,1]$ .

Problem 5.9 ( $d$ -Threshold Estimation Problem).