On the Statistical Complexity of Sample Amplification

Brian Axelrod, Shivam Garg, Yanjun Han, Vatsal Sharan, and Gregory Valiant B. Axelrod and G. Valiant are with the Department of Computer Science, Stanford University, emails: {baxelrod,valiant}@cs.stanford.edu. S. Garg is with the Microsoft Research, email: [email protected]. Y. Han is with the Courant Institute of Mathematical Sciences and the Center for Data Science, New York University, email: [email protected]. V. Shaman is with the Department of Computer Science, University of Southern California, email: [email protected].

Abstract

The “sample amplification” problem formalizes the following question: Given $n$ i.i.d. samples drawn from an unknown distribution $P$ , when is it possible to produce a larger set of $n+m$ samples which cannot be distinguished from $n+m$ i.i.d. samples drawn from $P$ ? In this work, we provide a firm statistical foundation for this problem by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.

1 Introduction

Consider the following problem of manufacturing more data: an amplifier is given $n$ samples drawn i.i.d. from an unknown distribution $P$ , and the goal is to generate a larger set of $n+m$ samples which are indistinguishable from $n+m$ i.i.d. samples from $P$ . How large can $m$ be as a function of $n$ and the distribution class in question? Are there sound and systematic ways to generate a larger set of samples? Is this task strictly easier than the learning task, in the sense that the number of samples required for generating $n+1$ samples is smaller than that required for learning $P$ ?

In our preliminary work [AGSV20], we formalized this problem as the sample amplification problem, considering total variation (TV) as the measure for indistinguishability.

Definition 1.1 (Sample Amplification).

Let $\mathcal{P}$ be a class of probability distributions over a domain $\mathcal{X}$ . We say that $\mathcal{P}$ admits an $(n,n+m,\varepsilon)$ sample amplification procedure if there exists a (possibly randomized) map $T_{\mathcal{P},n,m,\varepsilon}:\mathcal{X}^{n}\rightarrow\mathcal{X}^{n+m}$ such that

\displaystyle\sup_{P\in\mathcal{P}}\|P^{\otimes n}\circ T_{\mathcal{P},n,m,\varepsilon}^{-1}-P^{\otimes(n+m)}\|_{\text{\rm TV}}\leq\varepsilon.

(1)

An equivalent formulation to view Definition 1.1 is as a game between two parties: an amplifier and a verifier. The amplifier gets $n$ samples drawn i.i.d. from the unknown distribution $P$ in the class $\mathcal{P}$ , and her goal is to generate a larger dataset of $n+m$ samples which must be accepted with good probability by any verifier that also accepts a real dataset of $n+m$ i.i.d. samples from $P$ with good probability. Here, the verifier is computationally unbounded and knows the distribution $P$ , but does not observe the amplifier’s original set of $n$ samples.

Along with being a natural statistical task, the sample amplification framework is also relevant from a practical standpoint. Currently, there is an enormous trend in the machine learning community to train models on datasets that have been enlarged in various ways. There are relatively transparent and classical approaches to achieve this, such as leveraging known invariances such as rotation or translation invariance to augment the dataset by including transformed versions of each original datapoint [SSP⁺03, KSH12, CZM⁺19, SK19, CZSL20]. More recently, deep generative models have been used to both directly enlarge training data and construct larger datasets consisting of samples with properties that are rare in naturally occurring datasets [ASE17, CMST17, FADK⁺18, MMKSM18, YSH18, SYPS19, CPS⁺19, YWB19, HRA⁺19, CMV⁺21, LSM⁺22, CQZ⁺22, LSW⁺23, BKK⁺22]. More opaque approaches such as MixUp [ZCDLP18] and related techniques [TUH18, YHO⁺19, VLB⁺19, HMC⁺20] which add a significant fraction of new datapoints that are explicitly not supported in the true data distribution are also very popular since they seem to improve the performance of the trained models. Given this current zoo of widely implemented approaches to enlarging datasets, there is a clear motivation for bringing a more principled statistical understanding to such approaches. One natural starting point is the statistical setting we consider that asks the extent to which datasets can be enlarged in a perfect sense—where it is not possible to distinguish the enlarged dataset from a set of i.i.d. draws. Moreover, this work lays a foundation for the ambitious broader goal of understanding how various amplification techniques interact with downstream learning algorithms and statistical estimators, and developing amplification techniques that are optimal for certain classes of such algorithms and estimators.

In [AGSV20], a subset of the authors introduced the sample amplification problem, and studied two classes of distributions: the Gaussian location model and discrete distribution model. For these examples, they characterized the statistical complexity of sample amplification and showed that it is strictly smaller than that of learning. In this paper, we work towards a general understanding of the statistical complexity of the sample amplification problem, and its relationship with learning. The main contributions of this paper are as follows:

Distribution Class	Amplification	Procedure
Gaussian with unknown mean and fixed covariance	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency/Learning
(UB: Example 4.1, 4.2, A.8; LB: Theorem 6.2, 6.5)	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency/Learning
Gaussian with unknown mean and covariance	$(n,n+\Theta(n\varepsilon/d))$	Sufficiency
(UB: Example A.1, A.3; LB: Example A.20)	$(n,n+\Theta(n\varepsilon/d))$	Sufficiency
Gaussian with $s$ -sparse mean and identity covariance	$(n,n+\Theta(n\varepsilon/\sqrt{s\log d}))$	Learning
(UB: Example A.12; LB: Example A.18)	$(n,n+\Theta(n\varepsilon/\sqrt{s\log d}))$	Learning
Discrete distributions with support size at most $k$	$(n,n+\Theta(n\varepsilon/\sqrt{k}))$	Learning
(UB: Example A.9; LB: [AGSV20, Theorem 1])	$(n,n+\Theta(n\varepsilon/\sqrt{k}))$	Learning
Poissonized discrete distributions with support at most $k$	( $n$ , $n+\Theta(\sqrt{n}\varepsilon+n\varepsilon/\sqrt{k}))$	Learning
(UB: Example A.16; LB: Example A.16)		Learning
$d$ -dim. product of Exponential distributions	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency/Learning
(UB: Example A.5, A.11; LB: Theorem 6.5)	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency/Learning
Uniform distribution on $d$ -dim. rectangle	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency/Learning
(UB: Example A.6, A.10; LB: Theorem 6.5)	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency/Learning
$d$ -dim. product of Poisson distributions	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency+Learning
(UB: Example A.14; LB: Theorem 6.5)	$(n,n+\Theta(n\varepsilon/\sqrt{d}))$	Sufficiency+Learning

Table 1: Sample amplification achieved by the presented procedures. Results include matching upper bounds (UB) and lower bounds (LB), with appropriate pointers to specific examples or theorems for details.

1.

Amplification via sufficiency. Our first contribution is a simple yet powerful procedure for sample amplification, i.e. apply the sample amplification map only to sufficient statistics. Specifically, the learner computes a sufficient statistic $T_{n}$ from $X^{n}$ , maps $T_{n}$ properly to some $T_{n+m}$ , and generates new samples $\widehat{X}^{n+m}$ from some conditional distribution conditioned on $T_{n+m}$ . Surprisingly, this simple idea leads to a much cleaner procedure than [AGSV20] under Gaussian location models which is also exactly optimal (cf. Theorem 6.2). The range of applicability also extends to general exponential families, with rate-optimal sample amplification performances. Specifically, for general $d$ -dimensional exponential families with a mild moment condition, the sufficiency-based procedure achieves an $(n,n+O(n\varepsilon/\sqrt{d}),\varepsilon)$ sample amplification for large enough $n$ , which by our matching lower bounds in Section 6 is asymptotically minimax rate-optimal.
2.

Amplification via learning. Our second contribution is another general sample amplification procedure that does not require the existence of a sufficient statistic, and also sheds light on the relationship between learning and sample amplification. This procedure essentially draws new samples from a rate-optimal estimate of the true distribution, and outputs a random permutation of the old and new samples. The procedure achieves an $(n,n+O(\varepsilon\sqrt{n/r_{\chi^{2}}(\mathcal{P},n)}),\varepsilon)$ sample amplification, where $r_{\chi^{2}}(\mathcal{P},n)$ denotes the minimax risk for learning $P\in\mathcal{P}$ under the expected $\chi^{2}$ divergence given $n$ samples. This shows that learning $P$ to $\chi^{2}$ divergence $O(n/\varepsilon^{2})$ is sufficient for non-trivial sample amplification.

In addition, we show that for the special case of product distributions, it is important that the permutation step be applied coordinatewise to achieve the optimal sample amplification. Specifically, if $\mathcal{P}=\prod_{j=1}^{d}\mathcal{P}_{j}$ , this procedure achieves a better sample amplification

$\displaystyle\left(n,n+O\left(\varepsilon\sqrt{\frac{n}{\sum_{j=1}^{d}r_{\chi^{2}}(\mathcal{P}_{j},n)}}\right),\varepsilon\right).$

We have summarized several examples in Table 1 where the sufficiency and/or learning based sample amplification procedures are optimal. Note that there is no golden rule for choosing one idea over the other, and there exists an example where the above two ideas must be combined.
3.

Minimax lower bounds. Complementing our sample amplification procedures, we provide a general recipe for proving lower bounds for sample amplification. This recipe is intrinsically different from the standard techniques of proving lower bounds for hypothesis testing, for the sample amplification problem differs significantly from an estimation problem. In particular, specializing our recipe to product models shows that, for essentially all $d$ -dimensional product models, an $(n,n+Cn\varepsilon/\sqrt{d},\varepsilon)$ sample amplification is impossible for some absolute constant $C<\infty$ independent of the product model.

For non-product models, the above powerful result does not directly apply, but proper applications of the general recipe could still lead to tight lower bounds for sample amplification. Specifically, we obtain matching lower bounds for all examples listed in Table 1, including the non-product examples.

We now provide several numerical simulations to suggest the potential utility of sample amplification. Recall that a practical motivation of sample amplification is to produce an enlarged dataset that can be fed into a distribution-agnostic algorithm in downstream applications. Here, we consider the following basic experiments in that vein:

Refer to caption — (a) Mean absolute errors for estimation of the fourth moment $\mathbb{E}[X^{4}]$ in a one-dimensional Gaussian location model $X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,1)$ with $n=100,\mu=1$ .

•

Fourth moment estimation for one-dimensional Gaussian: here we observe $X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,1)$ with $n=100$ and $\mu=1$ , and we consider three estimators. The empirical estimator operates in a distribution-agnostic fashion and is simply the empirical fourth moment $n^{-1}\sum_{i=1}^{n}X_{i}^{4}$ . The plug-in estimator uses the knowledge of Gaussianity: it first estimates $\widehat{\mu}=\bar{X}$ and then uses $\mathbb{E}_{X\sim{\mathcal{N}}(\widehat{\mu},1)}[X^{4}]=\widehat{\mu}^{4}+6\widehat{\mu}^{2}+3$ . The amplified estimator first amplifies the sample $X^{n}$ into $Y^{n+m}$ via sufficiency (cf. Example 4.1), and then uses the empirical estimator $(n+m)^{-1}\sum_{j=1}^{n+m}Y_{j}^{4}$ based on the enlarged sample $Y^{n+m}$ . The plots of the mean absolute errors (MAEs) are displayed in Figure 1(a). We observe that although the empirical estimator based on the original sample $X^{n}$ has a large MAE, its performance is improved based on the amplified sample $Y^{n+m}$ .
•

Squared $L_{2}$ norm estimation for high-dimensional Gaussian: here we observe $X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,I_{d})$ with $n=50,d=100$ and $\mu={\bf 1}/\sqrt{d}$ , and we again consider three estimators for $\mathbb{E}[\|X\|_{2}^{2}]$ . As before, the empirical estimator is simply $n^{-1}\sum_{i=1}^{n}\|X_{i}\|_{2}^{2}$ , and the plug-in estimator uses the knowledge $\mathbb{E}_{X\sim{\mathcal{N}}(\widehat{\mu},I_{d})}[\|X\|_{2}^{2}]=\|\widehat{\mu}\|_{2}^{2}+d$ and estimates $\widehat{\mu}=\bar{X}$ . As for the amplified estimator, it first amplifies the sample $X^{n}$ into $Y^{n+m}$ via sufficiency (cf. Example 4.1), and then uses the empirical estimator based on $Y^{n+m}$ . The plots of the mean absolute errors are displayed in Figure 1(b). Here the empirical estimator outperforms the plug-in estimator due to a smaller bias, while the sample amplification further reduces the MAE as long as the size of amplification $m$ is not too large. This could be explained by the bias-variance tradeoff, where the amplified estimator interpolates between the empirical estimator (with no bias) and the plug-in estimator (with the smallest asymptotic variance).
•

Binary classification: here we observe two clusters of covariates $X_{1},\cdots,X_{n/2}\sim{\mathcal{N}}(e_{1},I_{d})$ (with label $1$ ) and $X_{n/2+1},\cdots,X_{n}\sim{\mathcal{N}}(-e_{1},I_{d})$ (with label $-1$ ), with $n=50,d=10$ and $e_{1}$ being the first basis vector. The target is to train a classifier with a high classification accuracy on the test data with the same distribution. The standard classifier is via logistic regression, which does not use the knowledge of Gaussianity. To apply sample amplification, we first amplify the sample in each class via either sufficiency (cf. Example 4.1) or learning (cf. Example A.8), and then run logistic regression on the amplified datasets. Figure 1(c) displays the classification errors of all three approaches, and shows that both amplification procedures help reduce the classification error even for small values of $m$ .

The above experiments demonstrate the potential for sample amplification to leverage knowledge of the data distribution to produce a larger dataset that is then fed into downstream distribution-agnostic algorithms. Some experiments (e.g. Figure 1(b)) also suggest a limit beyond which the amplification procedure alters the data distribution too much. We believe that rigorously examining amplification through the lens of the performance of downstream estimators and algorithms, including those illustrated in our numerical simulations, would be a fruitful direction for future work.

2 Connections, limitations and future work

As discussed above, it is commonplace in machine learning to increase the size of datasets using various heuristics, often resulting in large gains in downstream learning performance. However, a clear statistical understanding of when this is possible and what techniques are useful for this is missing. A natural starting point to get a better understanding is the formulation we consider that asks the extent to which datasets can be amplified in a perfect sense—where any verifier who knows the true distribution is not able to distinguish the amplified dataset from a set of i.i.d. draws.

A limitation of the sample amplification formulation described above is that the additive amplification factor $m$ is rather small (e.g., $O(n\varepsilon/\sqrt{d})$ for $d$ -dimensional exponential families). Moreover, we show matching lower bounds demonstrating that this factor cannot be improved even when $n$ is large enough to learn the distribution to non-trivial accuracy. However, it might be possible to achieve larger amplification factors with restricted verifiers, for instance, the class of verifiers corresponding to learning algorithms used for downstream tasks (see [AGSV20] for other possible classes of verifiers). Investigating the sample amplification problem with such restricted verifiers may be a practically fruitful future direction.

Despite this limitation, the sample amplification formulation does yield high-level insights that can inform the way datasets are amplified in practice. For instance, from the results in this paper, we know that sample amplification is possible for a broad class of distributions even when learning is not possible. Moreover, both our sufficiency or learning based approaches modify the original data points in general, conforming to the lower bound in [AGSV20] that optimal amplification may be impossible if the amplifier returns a superset of the input dataset. These observations show that the folklore way of enlarging datasets by learning the data distribution and adding more samples from the learned distribution can be far from optimal.

Connections with other statistical notions. An equivalent view of Definition 1.1 is through Le Cam’s distance [LC72], a classical concept in statistics. The formal definition of Le Cam’s distance $\Delta(\mathcal{M},\mathcal{N})$ is summarized in Definition 3.1; roughly speaking, it measures the fundamental difference in power in the statistical models $\mathcal{M}$ and $\mathcal{N}$ , without resorting to specific estimation procedures. The sample amplification problem is equivalent to the study of Le Cam’s distance $\Delta(\mathcal{P}^{\otimes n},\mathcal{P}^{\otimes(n+m)})$ between product models, where (1) is precisely equivalent to $\Delta(\mathcal{P}^{\otimes n},\mathcal{P}^{\otimes(n+m)})\leq\varepsilon$ . However, in the statistics literature, Le Cam’s distance was mainly used to study the asymptotic equivalence, where a typical target is to show that $\lim_{n\to\infty}\Delta({\mathcal{M}}_{n},{\mathcal{N}}_{n})=0$ for certain sequences of statistical models. For example, showing that localized regular statistical models converge to Gaussian location models is the fundamental idea behind the Hájek–Le Cam asymptotic statistics; see [LC72, LC86, LCY90] and [VdV00, Chapter 9]. In nonparametric statistics, there is also a rich line of research [BL96, BCLZ02, BCLZ04, RSH19] establishing asymptotic (non-)equivalences, based on Le Cam’s distance, between density models, regression models, and Gaussian white noise models. In the above lines of work, only asymptotic results were typically obtained with a fixed dimension and possibly slow convergence rate. In contrast, we aim to obtain a non-asymptotic characterization of $\Delta(\mathcal{P}^{\otimes n},\mathcal{P}^{\otimes(n+m)})$ in $(n,m)$ and the dimension of the problem, a task which is largely underexplored in the literature.

Another related angle is from reductions between statistical models. Over the past decade there has been a growing interest in constructing polynomial-time reductions between various statistical models (typically from the planted clique) to prove statistical-computational gaps, see, e.g. [BR13, MW15, BB19, BB20]. The sample amplification falls into the reduction framework, and aims to perform reductions from a product model $\mathcal{P}^{\otimes n}$ to another product model $\mathcal{P}^{\otimes(n+m)}$ . While previous reduction techniques were mainly constructive and employed to prove computational lower bounds, in this paper we also develop general tools to prove limitations of all possible reductions purely from the statistical perspective.

Organization. The rest of this paper is organized as follows. Section 3 lists some notations and preliminaries for this paper, and in particular introduces the concept of Le Cam’s distance. Section 4 introduces a sufficiency-based procedure for sample amplification, with asymptotic properties for general exponential families and non-asymptotic performances in several specific examples. Section 5 is devoted to a learning-based procedure for sample amplification, with a general relationship between sample amplification and the $\chi^{2}$ estimation error, as well as its applications in several examples. Section 6 presents the general idea of establishing lower bounds for sample amplification, with a universal result specializing to product models. Section 7 discusses more examples in sample amplification and learning, and shows that these tasks are in general non-comparable. More concrete examples of both the upper and lower bounds, auxiliary lemmas, and proofs are relegated to the appendices.

3 Preliminaries

We use the following notations throughout this paper. For a random variable $X$ , let ${\mathcal{L}}(X)$ be the law (i.e. probability distribution) of $X$ . For a probability distribution $P$ on a probability space $\Omega$ and a measurable map $T:\Omega\to\Omega^{\prime}$ , let $P\circ T^{-1}$ denotes the pushforward probability measure, i.e. ${\mathcal{L}}(T(X))$ with ${\mathcal{L}}(X)=P$ . For a probability measure $P$ , let $P^{\otimes n}$ be the $n$ -fold product measure. For a positive integer $n$ , let $[n]\triangleq\{1,\cdots,n\}$ , and $x^{n}\triangleq(x_{1},\cdots,x_{n})$ . We adopt the following asymptotic notations: for two non-negative sequences $(a_{n})$ and $(b_{n})$ , we use $a_{n}=O(b_{n})$ to denote that $\limsup_{n\to\infty}a_{n}/b_{n}<\infty$ , and $a_{n}=\Omega(b_{n})$ to denote $b_{n}=O(a_{n})$ , and $a_{n}=\Theta(b_{n})$ to denote both $a_{n}=O(b_{n})$ and $b_{n}=O(a_{n})$ . We also use the notations $O_{c},\Omega_{c},\Theta_{c}$ to denote the respective meanings with hidden constants depending on $c$ . For probability measures $P,Q$ defined on the same probability space, the total variation (TV) distance, Hellinger distance, Kullback–Leibler (KL) divergence, and the chi-squared divergence are defined as follows:

	$\displaystyle\\|P-Q\\|_{\text{TV}}$	$\displaystyle=\frac{1}{2}\int\|\mathrm{d}P-\mathrm{d}Q\|,\qquad\qquad H(P,Q)=\left(\frac{1}{2}\int(\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q})^{2}\right)^{\frac{1}{2}},$
	$\displaystyle D_{\text{KL}}(P\\|Q)$	$\displaystyle=\int\mathrm{d}P\log\frac{\mathrm{d}P}{\mathrm{d}Q},\qquad\qquad\chi^{2}(P\\|Q)=\int\frac{(\mathrm{d}P-\mathrm{d}Q)^{2}}{\mathrm{d}Q}.$

We will frequently use the following inequalities between the above quantities [Tsy09, Chapter 2]:

	$\displaystyle H^{2}(P,Q)\leq\\|P-Q\\|_{\text{TV}}\leq H(P,Q)\sqrt{2-H^{2}(P,Q)},$		(2)
	$\displaystyle\\|P-Q\\|_{\text{TV}}\leq\sqrt{\frac{1}{2}D_{\text{KL}}(P\\|Q)}\leq\sqrt{\frac{1}{2}\log(1+\chi^{2}(P\\|Q))}.$		(3)

Next we define several quantities related to Definition 1.1. For a given distribution class $\mathcal{P}$ and sample sizes $n$ and $m$ , the minimax error of sample amplification is defined as

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\triangleq\inf_{T}\sup_{P\in{\mathcal{P}}}\|P^{\otimes(n+m)}-P^{\otimes n}\circ T^{-1}\|_{\text{TV}},

(4)

where the infimum is over all (possibly randomized) measurable mappings $T:{\mathcal{X}}^{n}\to{\mathcal{X}}^{n+m}$ . For a given error level $\varepsilon$ , the maximum size of sample amplification is the largest $m$ such that there exists an $(n,n+m,\varepsilon)$ sample amplification, i.e.

\displaystyle m^{\star}({\mathcal{P}},n,\varepsilon)\triangleq\max\{m\in\mathbb{N}:\varepsilon^{\star}({\mathcal{P}},n,m)\leq\varepsilon\}.

(5)

For the ease of presentation, we often choose $\varepsilon$ to be a small constant (say $0.1$ ) and abbreviate the above quantity as $m^{\star}({\mathcal{P}},n)$ ; we remark that all our results work for a generic $\varepsilon\in(0,1)$ . Finally, we also define the sample amplification complexity as the smallest $n$ such that an amplification from $n$ to $n+1$ samples is possible:

\displaystyle n^{\star}({\mathcal{P}})\triangleq\min\{n\in\mathbb{N}:m^{\star}({\mathcal{P}},n)\geq 1\}.

(6)

Note that all the above notions are instance-wise in the distribution class ${\mathcal{P}}$ .

The minimax error of sample amplification (4) is precisely known as the Le Cam’s distance in the statistics literature. We adopt the standard framework of statistical decision theory [Wal50]. A statistical model (or experiment) ${\mathcal{M}}$ is a tuple $({\mathcal{X}},(P_{\theta})_{\theta\in\Theta})$ where an observation $X\sim P_{\theta}$ is drawn for some $\theta\in\Theta$ . A decision rule $\delta$ is a regular conditional probability kernel from ${\mathcal{X}}$ to the family of probability distributions on a general action space ${\mathcal{A}}$ , and there is a (measurable) loss function $L:\Theta\times{\mathcal{A}}\to\mathbb{R}_{+}$ . The risk function of a given decision rule $\delta$ is defined as

\displaystyle R_{\mathcal{M}}(\theta,\delta,L)\triangleq\mathbb{E}_{\theta}[L(\theta,\delta(X))]=\int_{\mathcal{X}}\int_{\mathcal{A}}L(\theta,a)\delta(da\mid x)P_{\theta}(dx).

(7)

Based on the definition of risk functions, we are ready to define a metric, known as Le Cam’s distance, between statistical models.

Definition 3.1 (Le Cam’s distance; see [LC72, LC86, LCY90]).

For two statistical models ${\mathcal{M}}=({\mathcal{X}},(P_{\theta})_{\theta\in\Theta})$ and ${\mathcal{N}}=({\mathcal{Y}},(Q_{\theta})_{\theta\in\Theta})$ , Le Cam’s distance $\Delta({\mathcal{M}},{\mathcal{N}})$ is defined as

	$\displaystyle\Delta({\mathcal{M}},{\mathcal{N}})$	$\displaystyle=\max\left\{\sup_{L}\sup_{\delta_{\mathcal{N}}}\inf_{\delta_{\mathcal{M}}}\sup_{\theta\in\Theta}\|R_{\mathcal{M}}(\theta,\delta_{\mathcal{M}},L)-R_{\mathcal{N}}(\theta,\delta_{\mathcal{N}},L)\|,\right.$
		$\displaystyle\qquad\qquad\qquad\left.\sup_{L}\sup_{\delta_{\mathcal{M}}}\inf_{\delta_{\mathcal{N}}}\sup_{\theta\in\Theta}\|R_{\mathcal{M}}(\theta,\delta_{\mathcal{M}},L)-R_{\mathcal{N}}(\theta,\delta_{\mathcal{N}},L)\|\right\}$
		$\displaystyle=\max\left\{\inf_{T_{1}}\sup_{\theta}\\|P_{\theta}\circ T_{1}^{-1}-Q_{\theta}\\|_{\text{\rm TV}},\ \inf_{T_{2}}\sup_{\theta}\\|Q_{\theta}\circ T_{2}^{-1}-P_{\theta}\\|_{\text{\rm TV}}\right\},$

where the loss function is taken over all measurable functions $L:\Theta\times{\mathcal{A}}\to[0,1]$ .

In the language of model deficiency introduced in [LC64], Le Cam’s distance is the smallest $\varepsilon>0$ such that the model ${\mathcal{M}}$ is $\varepsilon$ -deficient to the model ${\mathcal{N}}$ , and ${\mathcal{N}}$ is also $\varepsilon$ -deficient to ${\mathcal{M}}$ . In the sample amplification problem, $(P_{\theta})_{\theta\in\Theta}=\{P^{\otimes n}:P\in{\mathcal{P}}\},(Q_{\theta})_{\theta\in\Theta}=\{P^{\otimes(n+m)}:P\in{\mathcal{P}}\}$ . Here, choosing $T_{2}(x^{n+m})=x^{n}$ in Definition 3.1 shows that ${\mathcal{N}}$ is $0$ -deficient to ${\mathcal{M}}$ , and the remaining quantity involving $T_{1}$ exactly reduces to the minimax error of sample amplification in (4). Therefore, studying the complexity of sample amplification is equivalent to the characterization of the quantity $\Delta({\mathcal{P}}^{\otimes n},{\mathcal{P}}^{\otimes(n+m)})$ .

4 Sample amplification via sufficient statistics

The first idea we present for sample amplification is the classical idea of reduction by sufficiency. Albeit simple, the sufficiency-based idea reduces the problem of generating multiple random vectors to a simpler problem of generating only a few vectors, achieves the optimal complexity of sample amplification in many examples, and is easy to implement.

4.1 The general idea

We first recall the definition of sufficient statistics: in a statistical model ${\mathcal{M}}=({\mathcal{X}},(P_{\theta})_{\theta\in\Theta})$ and $X\sim P_{\theta}$ , a statistic $T=T(X)\in{\mathcal{T}}$ is sufficient if and only if both $\theta-X-T$ and $\theta-T-X$ are Markov chains. A classical result in statistical decision theory is reduction by sufficiency, i.e. only the sufficient statistic needs to be maintained to perform statistical tasks as $P_{X|T,\theta}$ does not depend on the unknown parameter $\theta$ . In terms of Le Cam’s distance, let ${\mathcal{M}}\circ T^{-1}=({\mathcal{T}},(P_{\theta}\circ T^{-1})_{\theta\in\Theta})$ be the statistical experiment associated with $T$ , then sufficiency of $T$ implies that $\Delta({\mathcal{M}},{\mathcal{M}}\circ T^{-1})=0$ . Hereafter, we will call ${\mathcal{M}}\circ T^{-1}$ the $T$ -reduced model, or simply reduced model in short.

Algorithm 1 Sample amplification via sufficiency

1:Input: samples

X_{1},\cdots,X_{n}

, a given transformation

f

between sufficient statistics

2:Compute the sufficient statistic

T_{n}=T_{n}(X_{1},\cdots,X_{n})

3:Apply

f

to the sufficient statistic and compute

\widehat{T}_{n+m}=f(T_{n})

4:Generate

(\widehat{X}_{1},\cdots,\widehat{X}_{n+m})\sim P_{X^{n+m}\mid T_{n+m}}(\cdot\mid\widehat{T}_{n+m})

5:Output: amplified samples

(\widehat{X}_{1},\cdots,\widehat{X}_{n+m})

Reduction by sufficiency could be applied to sample amplification in a simple way, with a general algorithm displayed in Algorithm 1. Suppose that both models ${\mathcal{P}}^{\otimes n}$ and ${\mathcal{P}}^{\otimes(n+m)}$ admit sufficient statistics $T_{n}=T_{n}(X^{n})$ and $T_{n+m}=T_{n+m}(X^{n+m})$ , respectively. Algorithm 1 claims that it suffices to perform sample amplification on the reduced models ${\mathcal{P}}^{\otimes n}\circ T_{n}^{-1}$ and ${\mathcal{P}}^{\otimes(n+m)}\circ T_{n+m}^{-1}$ , i.e. construct a randomization map $f$ from $T_{n}$ to $T_{n+m}$ . Concretely, the algorithm decomposes into three steps:

1.

Step I: map $X^{n}$ to $T_{n}$ . This step is straightforward: we simply compute $T_{n}=T_{n}(X_{1},\cdots,X_{n})$ .
2.

Step II: apply a randomization map in the reduced model. Upon choosing the map $f$ , we simply compute $\widehat{T}_{n+m}=f(T_{n})$ with the target that the TV distance $\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\|_{\text{TV}}$ is uniformly small. The concrete choice of $f$ depends on specific models.
3.

Step III: map $T_{n+m}$ to $X^{n+m}$ . By sufficiency of $T_{n+m}$ , the conditional distribution $P_{X^{n+m}\mid T_{n+m}}$ does not depend on the unknown model. Therefore, after replacing the true statistic $T_{n+m}$ by $\widehat{T}_{n+m}$ , it is feasible to generate $\widehat{X}^{n+m}\sim P_{X^{n+m}\mid T_{n+m}}(\cdot\mid\widehat{T}_{n+m})$ . To simulate this random vector, it suffices to choose any distribution $P_{0}\in{\mathcal{P}}$ and generate $\widehat{X}^{n+m}\sim(P_{0}^{\otimes(n+m)}\mid T_{n+m}(\widehat{X}^{n+m})=\widehat{T}_{n+m})$ . This step may suffer from computational issues which will be discussed in Section 4.2.

The validity of this idea simply follows from

\displaystyle\Delta({\mathcal{P}}^{\otimes n},{\mathcal{P}}^{\otimes(n+m)})=\Delta({\mathcal{P}}^{\otimes n}\circ T_{n}^{-1},{\mathcal{P}}^{\otimes(n+m)}\circ T_{n+m}^{-1}),

or equivalently, under each $P\in{\mathcal{P}}$ ,

	$\displaystyle\\|{\mathcal{L}}(\widehat{X}^{n+m})-{\mathcal{L}}(X^{n+m})\\|_{\text{TV}}$	$\displaystyle=\\|{\mathcal{L}}(\widehat{T}_{n+m})\times P_{X^{n+m}\mid T_{n+m}}-{\mathcal{L}}(T_{n+m})\times P_{X^{n+m}\mid T_{n+m}}\\|_{\text{TV}}$
		$\displaystyle\overset{\rm(a)}{=}\\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\\|_{\text{TV}}=\\|{\mathcal{L}}(f(T_{n}))-{\mathcal{L}}(T_{n+m})\\|_{\text{TV}},$

where (a) is due to the identity $\|P_{X}P_{Y|X}-Q_{X}P_{Y|X}\|_{\text{TV}}=\|P_{X}-Q_{X}\|_{\text{TV}}$ . In other words, it suffices to work on reduced models and find the map $f$ between sufficient statistics.

This idea of reduction by sufficiency simplifies the design of sample amplification procedures. Unlike in original models where $X^{n}$ and $X^{n+m}$ typically take values in spaces of different dimensions, in reduced models the sufficient statistics $T_{n}$ and $T_{n+m}$ are usually drawn from the same space. A simple example is as follows.

Example 4.1 (Gaussian location model with known covariance).

Consider the observations $X_{1},\cdots,X_{n}$ from the Gaussian location model $P_{\theta}={\mathcal{N}}(\theta,\Sigma)$ with an unknown mean $\theta\in\mathbb{R}^{d}$ and a known covariance $\Sigma\in\mathbb{R}^{d\times d}$ . To amplify to $n+m$ samples, note that the sample mean vector is a sufficient statistic here, with

\displaystyle T_{n}(X_{1},\cdots,X_{n})=\frac{1}{n}\sum_{i=1}^{n}X_{i}\sim{\mathcal{N}}(\theta,\Sigma/n).

Now consider the identity map between sufficient statistics $\widehat{T}_{n+m}=T_{n}$ used with algorithm 1. The amplified samples $(\widehat{X}_{1},\cdots,\widehat{X}_{n+m})$ are drawn from ${\mathcal{N}}(0,\Sigma)$ conditioned on the event that $T_{n+m}(\widehat{X}^{n+m})=\widehat{T}_{n+m}=T_{n}(X^{n})$ . For every mean vector $\theta\in\mathbb{R}^{d}$ we can upper bound the amplification error of this approach:

	$\displaystyle\\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\\|_{\text{\rm TV}}$	$\displaystyle=\\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\\|_{\text{\rm TV}}$
		$\displaystyle=\\|{\mathcal{N}}(\theta,\Sigma/n)-{\mathcal{N}}(\theta,\Sigma/(n+m))\\|_{\text{\rm TV}}$
		$\displaystyle\overset{\rm(a)}{\leq}\sqrt{\frac{1}{2}D_{\text{\rm KL}}({\mathcal{N}}(\theta,\Sigma/n)\\|{\mathcal{N}}(\theta,\Sigma/(n+m)))}$
		$\displaystyle=\sqrt{\frac{d}{4}\left(\frac{m}{n}-\log\left(1+\frac{m}{n}\right)\right)}=O\left(\frac{m\sqrt{d}}{n}\right),$

where (a) is due to (3), and the last step holds whenever $m=O(n)$ . Therefore, we could amplify $\Omega(n/\sqrt{d})$ additional samples based on $n$ observations, and the complexity of sample amplification in (6) is $n^{\star}=O(\sqrt{d})$ . In contrast, learning this distribution within a small TV distance requires $n=\Omega(d)$ samples, which is strictly harder than sample amplification. This example recovers the upper bound of [AGSV20] with a much simpler analysis, and in later sections we will show that this approach is exactly minimax optimal.

We make two remarks for the above example. First, the amplified samples $\widehat{X}^{n+m}$ are no longer independent, either marginally or conditioned on $X^{n}$ . Therefore, the above approach is fundamentally different from first estimating the distribution and then generating independent samples from the estimated distribution. Second, the amplified samples do not contain the original samples as a subset. In contrast, a tempting approach for sample amplification is to add $m$ fake samples to the original $n$ observations. However, [AGSV20] showed that any sample amplification approach containing the original samples cannot succeed if $n=o(d/\log d)$ in the above model, and our approach conforms to this result. More examples will be presented in Section A.1.

4.2 Computational issues

A natural computational question in Algorithm 1 is how to sample $\widehat{X}^{n+m}\sim P_{X^{n+m}\mid T_{n+m}}(\cdot\mid\widehat{T}_{n+m})$ in a computationally efficient way. With an additional mild assumption that the sufficient statistic $T$ is also complete (which is easy to find in exponential families), the conditional distribution $P_{X\mid T}$ could be efficiently sampled if we could find a statistic $S=S(X)$ with the following two properties:

1.

$S$ is ancillary, i.e. ${\mathcal{L}}(S)$ is independent of the model parameter $\theta$ ;
2.

There is a (measurable) bijection $g$ between $(T,S)$ and $X$ , i.e. $X=g(T,S)$ almost surely.

In fact, if such an $S$ exists, then under any $\theta\in\Theta$ ,

\displaystyle{\mathcal{L}}(X\mid T=t)\overset{\rm(a)}{=}{\mathcal{L}}(g(T,S)\mid T=t)\overset{\rm(b)}{=}{\mathcal{L}}(g(t,S)),

where (a) is due to the assumed bijection $g$ between $(T,S)$ and $X$ , and (b) is due to a classical result of Basu [Bas55, Bas58] that $S$ and $T$ are independent. Therefore, by the ancillarity of $S$ , we could sample $X\sim P_{\theta_{0}}$ with any $\theta_{0}\in\Theta$ and compute the statistic $S$ from $X$ , then $g(t,S)$ follows the desired conditional distribution $P_{X|T=t}$ . An example of this procedure is illustrated below.

Example 4.2 (Computation in Gaussian location model).

Consider the setting of Example 4.1 where $P_{\theta}={\mathcal{N}}(\theta,\Sigma)^{\otimes(n+m)},T_{n+m}=(n+m)^{-1}\sum_{i=1}^{n+m}X_{i}$ , and the target is to sample from the distribution $P_{X^{n+m}|T_{n+m}}$ . In this model, $T_{n+m}$ is complete and sufficient, and we choose $S=S(X^{n+m})=(S_{1},\cdots,S_{n+m-1})$ with $S_{i}=X_{i+1}-X_{1}$ for all $i$ . Clearly $S$ is ancillary, and $X^{n+m}$ could be recovered from $(T_{n+m},S)$ via

\displaystyle X_{1}=T_{n+m}-\frac{\sum_{i=1}^{n+m-1}S_{i}}{n+m},\qquad X_{i+1}=X_{1}+S_{i},\quad i\in[n+m-1].

Therefore, the choice of $S$ satisfies both conditions. Consequently, we can draw $Z^{n+m}\sim{\mathcal{N}}(0,\Sigma)^{\otimes(n+m)}$ , compute $S=S(Z^{n+m})$ (where $S_{i}=Z_{i+1}-Z_{1}$ ), and recover $X^{n+m}$ from $(T_{n+m},S)$ .

The proper choice of $S$ depends on specific models and may require some effort to find; we refer to Section A.1 for more examples. We remark that in general there is no golden rule to find $S$ . One tempting approach is to find a maximal ancillary statistic $S$ such that any other ancillary statistic $S^{\prime}$ must be a function of $S$ . This idea is motivated by the existence of the minimal sufficient statistic under mild conditions and a known computationally efficient approach to compute it. However, for ancillary statistics there is typically no such a maximal one in the above sense, and there may exist uncountably many “maximal” ancillary statistics which are not equivalent to each other. From the measure theoretic perspective, this is due to the fact that the family of all ancillary sets is not closed under intersection and thus not a $\sigma$ -algebra. In addition, even if a proper notion of “maximal” or “essentially maximal” could be defined, there is no guarantee that such an ancillary statistic satisfies the bijection condition, and it is hard to determine whether a given ancillary statistic is maximal or not. We refer to [Bas59, LS92] for detailed discussions on ancillarity from mathematical statistics.

There is also another sampling procedure of $P_{X^{n}|T_{n}}$ in the conditional inference literature [Wil82]. Specifically, for each $i\in[n]$ , this approach sequentially generates the observation $X_{i}$ from the one-dimensional distribution $P_{X_{i}\mid X^{i-1},T_{n}}$ , which is a simple task as long as we know its CDF. Although this works in simple models such as the Gaussian location model above, in more complicated models exact computation of the CDF is typically not feasible.

4.3 General theory for exponential families

In this section, we show that a general $(n,n+\Omega(n\varepsilon/\sqrt{d}),\varepsilon)$ sample amplification phenomenon holds for a rich class of exponential families, and is achieved by the sufficiency-based procedure in Algorithm 1. Specifically, we consider the following natural exponential family.

Definition 4.3 (Exponential family).

The exponential family $({\mathcal{X}},(P_{\theta})_{\theta\in\Theta})$ of probability measures is determined by

\displaystyle dP_{\theta}(x)=\exp(\theta^{\top}T(x)-A(\theta))d\mu(x),

where $\theta\in\Theta$ is the natural parameter with $\Theta=\{\theta\in\mathbb{R}^{d}:A(\theta)<\infty\}$ , $T(x)$ is the sufficient statistic, $A(\theta)$ is the log-partition function, and $\mu$ is the base measure.

The exponential family includes many widely used probability distributions such as Gaussian, Gamma, Poisson, Exponential, Beta, etc. In the exponential family, the statistic $T(x)$ is sufficient and complete, and several well-known identities include $\mathbb{E}_{\theta}[T(X)]=\nabla A(\theta)$ , and $\mathsf{Cov}_{\theta}[T(X)]=\nabla^{2}A(\theta)$ . We refer to [DY79] for a mathematical theory of the exponential family.

To establish a general theory of sample amplification for exponential families, we shall make the following assumptions on the exponential family.

Assumption 1 (Continuity).

The parameter set $\Theta$ has a non-empty interior. Under each $\theta\in\Theta$ , the probability distribution ${\mathcal{L}}(T(X))$ is absolutely continuous with respect to the Lebesgue measure.

Assumption 2 (Moment condition $\mathsf{M}_{k}$ ).

For a given integer $k>0$ , it holds that

\displaystyle\sup_{\theta\in\Theta}\mathbb{E}_{\theta}\left[\left\|(\nabla^{2}A(\theta))^{-1/2}(T(X)-\nabla A(\theta))\right\|_{2}^{k}\right]<\infty.

We call it the moment condition $\mathsf{M}_{k}$ .

Assumption 1 requires an exponential family of continuous distributions. The motivation is that for continuous exponential family, the sufficient statistics $T_{n}(X)$ and $T_{n+m}(X)$ for different sample sizes take continuous values in the same space, and it is easier to construct a general transformation. We will propose a different sample amplification approach for discrete statistical models in Section 5. Assumption 2 is a moment condition on the normalized statistic $(\nabla^{2}A(\theta))^{-1/2}(T(X)-\nabla A(\theta))$ , whose moments always exist as the moment generating function of $T(X)$ exists around the origin. The moment condition $\mathsf{M}_{k}$ claims that the above normalized statistic has a uniformly bounded $k$ -th moment for all $\theta\in\Theta$ , which holds in several examples (such as Gaussian, exponential, Pareto) or by considering a slightly smaller $\Theta_{0}\subseteq\Theta$ bounded away from the boundary. The following lemma presents a sufficient criterion for the moment condition $\mathsf{M}_{k}$ .

Lemma 4.4.

If the log-partition function $A(\theta)$ satisfies

\displaystyle\sup_{\theta\in\Theta}\sup_{u\in\mathbb{R}^{d}\backslash\{0\}}\frac{|\nabla^{3}A(\theta)[u;u;u]|}{(\nabla^{2}A(\theta)[u;u])^{3/2}}\leq M<\infty,

then the exponential family satisfies the moment condition $\mathsf{M}_{k}$ for all $k\in\mathbb{N}$ . Here for a $k$ -tensor $T$ and vectors $u_{1},\cdots,u_{k}$ , $T[u_{1};\cdots;u_{k}]$ denotes the value of $\langle T,u_{1}\otimes\cdots\otimes u_{k}\rangle$ .

The condition in Lemma 4.4 is called the self-concordant condition, which is a key concept in the interior point method for convex optimization [Nes03]. For example, all quadratic functions and logarithmic functions are self-concordant (which correspond to Gaussian, exponential, and Pareto distributions), and the self-concordance is always fulfilled when $\Theta$ is compact.

Given any exponential family ${\mathcal{P}}$ satisfying Assumptions 1 and 2, we will show that a simple sample amplification procedure gives a size $\Omega(n/\sqrt{d})$ of sample amplification. Let $X_{1},\cdots,X_{n}$ be i.i.d. samples drawn from $P_{\theta}$ taking a general form in Definition 4.3, then it is clear that the sample average

\displaystyle T_{n}(X^{n})\triangleq\frac{1}{n}\sum_{i=1}^{n}T(X_{i})

is a sufficient statistic by the factorization theorem. We will apply the general Algorithm 1 with an identity map between sufficient statistics, i.e. $\widehat{T}_{n+m}=T_{n}$ . The next theorem shows the performance of this approach.

Theorem 4.5.

If the exponential family ${\mathcal{P}}$ satisfies Assumptions 1 and 2 with $k=3$ , then for $\theta\in\Theta$ , it holds that

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}}\leq\frac{C}{\sqrt{n}}+\frac{m\sqrt{d}}{n},

where $C<\infty$ is an absolute constant depending only on $d$ and the moment upper bound in Assumption 2. In particular, for sufficiently large $n$ , a sample amplification of size $\Omega(n/\sqrt{d})$ is achievable.

Theorem 4.5 shows that the above simple procedure could achieve a sample amplification from $n$ to $n+\Omega(n/\sqrt{d})$ samples in general continuous exponential families, provided that $n$ is large enough. The main idea behind the proof of Theorem 4.5 is also simple. We show that the distribution of $T_{n}$ is approximately $G_{n}\sim{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)$ by CLT, apply the same CLT for $T_{n+m}$ , and then compute the TV distance between two Gaussians as in Example 4.1. Theorem 4.5 is then a direct consequence of the triangle inequality:

	$\displaystyle\\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\\|_{\text{\rm TV}}$
	$\displaystyle\leq\\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(G_{n})\\|_{\text{\rm TV}}+\\|{\mathcal{L}}(G_{n})-{\mathcal{L}}(G_{n+m})\\|_{\text{\rm TV}}+\\|{\mathcal{L}}(T_{n+m})-{\mathcal{L}}(G_{n+m})\\|_{\text{\rm TV}}.$

Note that Assumption 1 ensures a vanishing TV distance for the Gaussian approximation, and Assumption 2 enables us to apply Berry–Esseen type arguments and obtain an $O(1/\sqrt{n})$ convergence rate for the Gaussian approximation.

The main drawback of Theorem 4.5 is that there is a hidden constant $C$ depending on the dimension $d$ , thus it does not mean that an $(n,n+1,\varepsilon)$ sample amplification is possible as long as $n=\Omega(\sqrt{d}/\varepsilon)$ . To tackle this issue, we need to improve the first term in Theorem 4.5 and find the best possible dependence of the constant $C$ on $d$ . We remark that this is a challenging task in probability theory: although the convergence rates of both TV [Pro52, SM62, BC14, BC16] and KL [Bar86, BCG14] in the CLT result were studied, almost all of them solely focused on the convergence rate on $n$ , leaving the tight dependence on $d$ still open. Moreover, direct computation of the quantity $\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(G_{n})\|_{\text{\rm TV}}$ shows that even if the random vector $T_{n}$ has independent components, this quantity is typically at least $\Omega(\sqrt{d/n})$ . Therefore, $C=\Omega(\sqrt{d})$ under this proof technique, and a vanishing first term in Theorem 4.5 requires that $n=\Omega(d)$ , which is already larger than the anticipated sample amplification complexity $n=O(\sqrt{d})$ .

To overcome the above difficulties, we make the following changes to both the assumption and analysis. First, to avoid the unknown dependence on $d$ , we additionally assume a product exponential family, i.e. $P_{\theta}(dx)=\prod_{i=1}^{d}p_{\theta_{i}}(dx_{i})$ , where each $p_{\theta_{i}}(x_{i})$ is a one-dimensional exponential family. Exploiting the product structure enables to find a constant $C$ depending linearly on $d$ . Second, we improve the $O(1/\sqrt{n})$ dependence on $n$ by applying a higher-order CLT result to $T_{n}$ and $T_{n+m}$ , known as the Edgeworth expansion [BR10]. For any $k\geq 2$ and $n\in\mathbb{N}$ , the signed measure of the Edgeworth expansion on $\mathbb{R}^{d}$ is

\displaystyle\Gamma_{n,k}(dx)=\gamma(x)\left(1+\sum_{\ell=1}^{\lfloor k/3\rfloor}\frac{{\mathcal{K}}_{\ell}(x)}{n^{\ell/2}}\right)dx,

(8)

where $\gamma(x)$ is the density of a standard normal random variable on $\mathbb{R}^{d}$ , and ${\mathcal{K}}_{m}(x)$ is a polynomial of degree $3m$ which depends only on the first $3m$ moments of the distribution. We note that unlike CLT, the general Edgeworth expansion is a signed measure with possibly negative densities; however, it is close to Gaussian with an $O(n^{-1/2})$ approximation error. Such a higher-order expansion enables us to have better Berry-Esseen type bounds, but upper bounding $\|\Gamma_{n,k}-\Gamma_{n+m,k}\|_{\text{\rm TV}}$ becomes more complicated and requires to handle the Gaussian part and the correction part separately; see Appendix B.2 for details. In particular, we could improve the error dependence on $n$ from $O(1/\sqrt{n})$ to $O(1/n^{2})$ .

Formally, the next theorem shows a better sample amplification performance for product exponential families.

Theorem 4.6.

Let $({\mathcal{X}},{\mathcal{P}}=(P_{\theta})_{\theta\in\Theta})$ be a product exponential family, where each one-dimensional component satisfies Assumptions 1 and 2 with $k=10$ . Then for $\theta\in\Theta$ , it holds that

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}}\leq C\left(\frac{d}{n^{2}}+\frac{m\sqrt{d}}{n}\right),

where $C<\infty$ is an absolute constant independent of $(n,d)$ . In particular, as long as $n=\Omega(\sqrt{d}/\varepsilon)$ , an $(n,n+m,\varepsilon)$ sample amplification of size $m=\Omega(n\varepsilon/\sqrt{d})$ is achievable.

Theorem 4.6 shows that for product exponential family, we not only achieve the amplification size $m=\Omega(n\varepsilon/\sqrt{d})$ , but also attain a sample complexity $n=O(\sqrt{d}/\varepsilon)$ for sample amplification. This additional result on sample complexity is important in the sense that, even if distribution learning is impossible, it is possible to perform sample amplification. Although the independence or even the exponential family assumption could be strong practically, in Section A.1 we show that both phenomena $m=\Omega(n\varepsilon/\sqrt{d})$ and $n=O(\sqrt{d}/\varepsilon)$ hold in many natural models.

5 Sample amplification via learning

The sufficiency-based approach of sample amplification is not always desirable. First, models outside the exponential family typically do not admit non-trivial sufficient statistics, and therefore the reduction by sufficiency may not be very helpful. Second, the identity map applied to the sufficient statistics only works for continuous models, and incurs a too large TV distance when the underlying model is discrete. Third, previous approaches are not directly related to learning the model, so a general relationship between learning and sample amplification is largely missing. In this section, we propose another sample amplification approach, and show that how a good learner helps to obtain a good sample amplifier.

5.1 The general result

For a class of distribution ${\mathcal{P}}$ and $n$ i.i.d. observations drawn from an unknown $P\in{\mathcal{P}}$ , we define the following notion of the $\chi^{2}$ -estimation error.

Definition 5.1 ( $\chi^{2}$ -estimation error).

For a class of distributions ${\mathcal{P}}$ and sample size $n$ , the $\chi^{2}$ -estimation error $r_{\chi^{2}}({\mathcal{P}},n)$ is defined to be the minimax estimation error under the expected $\chi^{2}$ -divergence:

\displaystyle r_{\chi^{2}}({\mathcal{P}},n)\triangleq\inf_{\widehat{P}_{n}}\sup_{P\in{\mathcal{P}}}\mathbb{E}_{P}[\chi^{2}(\widehat{P}_{n},P)],

where the infimum is taken over all possible distribution estimators $\widehat{P}_{n}$ based on $n$ samples.

Roughly speaking, the $\chi^{2}$ -estimation error in the above definition characterizes the complexity of the distribution class ${\mathcal{P}}$ in terms of distribution learning under the $\chi^{2}$ -divergence. The main result of this section is to show that, the error of sample amplification in (4) could be upper bounded by using the $\chi^{2}$ -estimation error.

Theorem 5.2.

For general ${\mathcal{P}}$ and $n,m\geq 0$ , it holds that

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\sqrt{\frac{m^{2}}{n}\cdot r_{\chi^{2}}({\mathcal{P}},n/2)}.

The following corollary is immediate from Theorem 5.2.

Corollary 5.3.

An $(n,n+m,\varepsilon)$ sample amplification is possible if $m=O(\varepsilon\sqrt{n/r_{\chi^{2}}({\mathcal{P}},n/2)})$ . Moreover, the sample complexity of amplification in (6) satisfies

\displaystyle n^{\star}({\mathcal{P}})=O\left(\min\left\{n\in\mathbb{N}:r_{\chi^{2}}({\mathcal{P}},n/2)\leq n\right\}\right).

Remark 5.4.

Although the error of sample amplification in (4) is measured under the TV distance, the same result holds for the squared root of the KL divergence (which by (3) is no smaller than the TV distance).

The above result provides a quantitative guarantee that the sample amplification is easier than learning (under the $\chi^{2}$ -divergence). Specifically, the sample complexity of learning is the smallest $n\in\mathbb{N}$ such that $r_{\chi^{2}}({\mathcal{P}},n)=O(1)$ , while Corollary 5.3 shows that the complexity for amplification is at most the smallest $n\in\mathbb{N}$ such that $r_{\chi^{2}}({\mathcal{P}},n/2)=O(n)$ . As $r_{\chi^{2}}({\mathcal{P}},n)$ is non-increasing in $n$ , this means that the learning complexity is in general larger.

When the distribution class ${\mathcal{P}}$ has a product structure ${\mathcal{P}}=\prod_{j=1}^{d}{\mathcal{P}}_{j}$ , the next theorem shows a better relationship between the amplification error and the learning error.

Theorem 5.5.

For ${\mathcal{P}}=\prod_{j=1}^{d}{\mathcal{P}}_{j}$ and $n,m\geq 0$ , it holds that

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\sqrt{\frac{m^{2}}{n}\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)}.

Corollary 5.6.

For product models, an $(n,n+m,\varepsilon)$ sample amplification is achievable if

\displaystyle m=O\left(\varepsilon\sqrt{\frac{n}{\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)}}\right).

Moreover, the sample complexity of amplification in (6) satisfies

\displaystyle n^{\star}({\mathcal{P}})=O\left(\min\left\{n\in\mathbb{N}:\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)\leq n\right\}\right).

We observe that the result of Theorem 5.5 typically improves over Theorem 5.2 for product models. In fact, since

\displaystyle\chi^{2}\left(\prod_{j=1}^{d}P_{j},\prod_{j=1}^{d}Q_{j}\right)=\prod_{j=1}^{d}(1+\chi^{2}(P_{j},Q_{j}))-1\geq\sum_{j=1}^{d}\chi^{2}(P_{j},Q_{j}),

the inequality $\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)\leq r_{\chi^{2}}({\mathcal{P}},n/2)$ typically holds. Moreover, there are scenarios where we have $\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)\ll r_{\chi^{2}}({\mathcal{P}},n/2)$ , thus Theorem 5.5 provides a substantial improvement over Theorem 5.2. For example, when ${\mathcal{P}}=({\mathcal{N}}(\theta,I_{d}))_{\theta\in\mathbb{R}^{d}}$ , it could be verified that $r_{\chi^{2}}({\mathcal{P}}_{j},n/2)=O(1/n)$ for each $j\in[d]$ , while $r_{\chi^{2}}({\mathcal{P}},n/2)=\exp(\Omega(d/n))-1$ . Hence, in the important regime $\sqrt{d}\ll n\ll d$ where learning is impossible but the sample amplification is possible, Theorem 5.5 is strictly stronger than Theorem 5.2.

Remark 5.7.

In the above Gaussian location model, there is an alternative way to conclude that Theorem 5.5 is strictly stronger than Theorem 5.2. We will see that the shuffling approach achieving the bound in Theorem 5.2 keeps all the observed samples, whereas [AGSV20] shows that all such approaches must incur a sample complexity $n=\Omega(d/\log d)$ for the Gaussian model. In contrast, Theorem 5.5 and Corollary 5.6 give a sample complexity $n=O(\sqrt{d})$ of amplification in the Gaussian location model.

5.2 The shuffling approach

This section presents the sample amplification approaches to achieve Theorems 5.2 and 5.5. The idea is simple: we find a good distribution learner $\widehat{P}_{n}$ which attains the rate-optimal $\chi^{2}$ -estimation error, draw additional $m$ samples $Y_{1},\cdots,Y_{m}$ from $\widehat{P}_{n}$ , and shuffle them with the original samples $X_{1},\cdots,X_{n}$ uniformly at random. This approach suffices to achieve the sample amplification error in Theorem 5.2, while for Theorem 5.5 an additional trick is applied: instead of shuffling the whole vectors, we independently shuffle each coordinate instead. For technical reasons, in both approaches we apply the sample splitting: the first $n/2$ samples are used for the estimation of $\widehat{P}_{n}$ , while the second $n/2$ samples are used for shuffling. The algorithms are summarized in Algorithms 2 and 3.

Algorithm 2 Sample amplification via shuffling: general model

1:Input: samples

X_{1},\cdots,X_{n}

, a given class of distributions

{\mathcal{P}}

2:Based on samples

X_{1},\cdots,X_{n/2}

, find an estimator

\widehat{P}_{n}

such that

\displaystyle\sup_{P\in{\mathcal{P}}}\mathbb{E}_{P}[\chi^{2}(\widehat{P}_{n},P)]\leq C\cdot r_{\chi^{2}}({\mathcal{P}},n/2).

3:Draw

m

additional samples

Y_{1},\cdots,Y_{m}

from

\widehat{P}_{n}

4:Uniformly at random, shuffle the pool of

X_{n/2+1},\cdots,X_{n},Y_{1},\cdots,Y_{m}

to obtain

(Z_{1},\cdots,Z_{n/2+m})

5:Output: amplified samples

(X_{1},\cdots,X_{n/2},Z_{1},\cdots,Z_{n/2+m})

Algorithm 3 Sample amplification via shuffling: product model

1:Input: samples

X_{1},\cdots,X_{n}

, a given class of product distributions

{\mathcal{P}}=\prod_{j=1}^{d}{\mathcal{P}}_{j}

2:for

j=1,2,\cdots,d

3: Based on samples

X_{1,j},\cdots,X_{n/2,j}

, find an estimator

\widehat{P}_{n,j}

such that

\displaystyle\sup_{P_{j}\in{\mathcal{P}}_{j}}\mathbb{E}_{P_{j}}[\chi^{2}(\widehat{P}_{n,j},P_{j})]\leq C\cdot r_{\chi^{2}}({\mathcal{P}}_{j},n/2).

4: Draw

m

additional samples

Y_{1,j},\cdots,Y_{m,j}

from

\widehat{P}_{n,j}

5: Uniformly at random, shuffle

X_{n/2+1,j},\cdots,X_{n,j},Y_{1,j},\cdots,Y_{m,j}

to obtain

(Z_{1,j},\cdots,Z_{n/2+m,j})

6:end for

7:For each

i\in[n/2+m]

, form the vector

Z_{i}=(Z_{i,1},\cdots,Z_{i,d})

8:Output: amplified samples

(X_{1},\cdots,X_{n/2},Z_{1},\cdots,Z_{n/2+m})

The following lemma is the key to analyze the performance of the shuffling approach.

Lemma 5.8.

Let $X_{1},\cdots,X_{n}$ be i.i.d. drawn from $P$ , and $Y_{1},\cdots,Y_{m}$ be i.i.d. drawn from $Q$ independent of $(X_{1},\cdots,X_{n})$ . Let $(Z_{1},\cdots,Z_{n+m})$ be a uniformly random permutation of $(X_{1},\cdots,X_{n},Y_{1},\cdots,Y_{m})$ , and $P_{\text{\rm mix}}$ be the distribution of the random mixture $(Z_{1},\cdots,Z_{n+m})$ . Then

\displaystyle\chi^{2}\left(P_{\text{\rm mix}},P^{\otimes(n+m)}\right)\leq\left(1+\frac{m}{n+m}\chi^{2}(Q,P)\right)^{m}-1.

Based on Lemma 5.8, the advantage of random shuffling is clear: if we simply append $Y_{1},\cdots,Y_{m}$ to the end of the original sequence $X_{1},\cdots,X_{n}$ , then the $\chi^{2}$ -divergence is exactly $(1+\chi^{2}(Q,P))^{m}-1$ . By comparing with the upper bound in Lemma 5.8, we observe that a smaller coefficient $m/(n+m)$ is applied to the individual $\chi^{2}$ -divergence after a random shuffle. The proofs of Theorems 5.2 and 5.5 are then clear, where we simply take $Q=\widehat{P}_{n}$ and apply the above lemma. Note that the statement of Lemma 5.8 requires that $Y_{1},\cdots,Y_{m}$ be independent of $X_{1},\cdots,X_{n}$ , which is exactly the reason why we apply sample splitting in Algorithms 2 and 3. The proof of Lemma 5.8 is presented in Appendix C, and the complete proofs of Theorems 5.2 and 5.5 are relegated to Appendix B. We also include concrete examples of the shuffling approach in Section A.2.

6 Minimax lower bounds

In this section we establish minimax lower bounds for sample amplification in different statistical models. Section 6.1 presents a general and tight approach for establishing the lower bound, which leads to an exact sample amplification result for the Gaussian location model. Based on this result, we show that for $d$ -dimensional continuous exponential families, the sample amplification size cannot exceed $\omega(n\varepsilon/\sqrt{d})$ for sufficiently large sample size $n$ . Section 6.2 provides a specialized criterion for product models, where we show that $n=\Omega(\sqrt{d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{d})$ are always valid lower bounds, with hidden constants independent of all parameters. Section A.3 lists several concrete examples where our general idea could be properly applied to provide tight and non-asymptotic results.

6.1 General idea

The main tool to establish the lower bound is the first equality in the Definition 3.1 of Le Cam’s distance. Specifically, for a class of distributions ${\mathcal{P}}=(P_{\theta})_{\theta\in\Theta}$ , let $\mu$ be a given prior distribution on $\Theta$ , and $L:\Theta\times{\mathcal{A}}\to[0,1]$ be a given non-negative loss function upper bounded by one. Given $n$ i.i.d. samples from an unknown distribution in ${\mathcal{P}}$ , define the following Bayes risk and minimax risk:

	$\displaystyle r_{\text{\rm B}}({\mathcal{P}},n,L,\mu)$	$\displaystyle=\inf_{\widehat{\theta}}\int_{\Theta}\mathbb{E}_{\theta}[L(\theta,\widehat{\theta}(X^{n}))]\mu(d\theta),$
	$\displaystyle r({\mathcal{P}},n,L)$	$\displaystyle=\inf_{\widehat{\theta}}\sup_{\theta\in\Theta}\mathbb{E}_{\theta}[L(\theta,\widehat{\theta}(X^{n}))],$

where the infimum is over all possible estimators $\widehat{\theta}(\cdot)$ taking value in ${\mathcal{A}}$ . The following result is a direct consequence of Definition 3.1.

Lemma 6.1.

For any integer $n,m>0$ , any class of distributions ${\mathcal{P}}=(P_{\theta})_{\theta\in\Theta}$ , any prior $\mu$ on $\Theta$ , and any loss function $L:\Theta\times{\mathcal{A}}\to[0,1]$ , the minimax error of sample amplification $\varepsilon^{\star}({\mathcal{P}},n,m)$ in (4) satisfies that

	$\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)$	$\displaystyle\geq r_{\text{\rm B}}({\mathcal{P}},n,L,\mu)-r_{\text{\rm B}}({\mathcal{P}},n+m,L,\mu),$
	$\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)$	$\displaystyle\geq r({\mathcal{P}},n,L)-r({\mathcal{P}},n+m,L).$

Based on Lemma 6.1, it suffices to find an appropriate prior distribution $\mu$ and a loss function $L$ , and then compute (or lower bound) the difference between the Bayes or minimax risks with different sample sizes. We note that the lower bound technique in [AGSV20], albeit seemingly different, is a special case of Lemma 6.1. Specifically, the authors of [AGSV20] designed a set-valued mapping $A_{n}:\theta\to{\mathcal{P}}({\mathcal{X}}^{n})$ for each $n\in\mathbb{N}$ such that $\mathbb{P}_{\theta}(X^{n+m}\in A_{n+m}(\theta))\geq 0.99$ for all $\theta\in\Theta$ , while there is a prior distribution $\mu$ on $\Theta$ such that

\displaystyle\mathbb{E}_{X^{n}}\left[\sup_{x^{n}\in{\mathcal{X}}^{n}}\mathbb{P}_{\theta\mid X^{n}}(x^{n}\in A_{n}(\theta))\right]\leq 0.5.

(9)

If the above conditions hold, then an $(n,n+m)$ sample amplification is impossible. Note that the probability term in (9) is the maximum coverage probability of the sets $A_{n}(\theta)$ where $\theta$ follows the posterior distribution $\mathbb{P}_{\theta\mid X^{n}}$ , which is a well-defined geometric object when both $A_{n}(\theta)$ and the posterior are known. To see that the above approach falls into our framework, consider the loss function $L:\Theta\times\cup_{n\geq 1}{\mathcal{X}}^{n}\to[0,1]$ with $L(\theta,X^{n})=\mathbbm{1}(X^{n}\notin A_{n}(\theta))$ . Then the first condition ensures that $r_{\text{B}}({\mathcal{P}},n+m,L,\nu)\leq 0.01$ for each prior $\nu$ , and the second condition (9) exactly states that $r_{\text{B}}({\mathcal{P}},n,L,\mu)\geq 0.5$ for the chosen prior $\mu$ .

A first application of Lemma 6.1 is an exact lower bound in Gaussian location models.

Theorem 6.2.

For the Gaussian location model ${\mathcal{P}}=\{{\mathcal{N}}(\theta,\Sigma)\}_{\theta\in\mathbb{R}^{d}}$ with a fixed covariance $\Sigma\in\mathbb{R}^{d\times d}$ , the minimax error of sample amplification in (4) is exactly

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{\rm TV}}.

In particular, the sufficiency-based approach in Example 4.1 is exactly minimax optimal.

Theorem 6.2 shows that an exact error characterization for the Gaussian location model is possible through the general lower bound approach in Lemma 6.1. This result is also asymptotically useful to a rich family of models: note that by CLT, the sufficient statistic in a continuous exponential family follows a Gaussian distribution asymptotically, with a vanishing TV distance. This idea was used in Section 4.3 to establish the $O(n\varepsilon/\sqrt{d})$ upper bound, and the same observation could lead to an $\Omega(n\varepsilon/\sqrt{d})$ lower bound as well, under slightly different assumptions. Specifically, we drop Assumption 2 while introducing an additional assumption.

Assumption 3 (Linear independence).

The components of sufficient statistic $T(x)$ are linearly independent, i.e. $a^{\top}T(x)=0$ for $\mu$ -almost all $x\in{\mathcal{X}}$ implies $a=0$ .

Assumption 3 ensures that the true dimension of the exponential family is indeed $d$ . Whenever Assumption 3 does not hold, we could transform it into a minimal exponential family with a lower dimension fulfilling this assumption. Note that when Assumptions 1 and 3 hold, the mean mapping $\theta\mapsto\nabla A(\theta)$ is a diffeomorphism between $\Theta$ and $\nabla A(\theta)$ ; see, e.g. [ML08, Theorem 1.22]. Therefore, $\nabla A(\cdot)$ is an open map, and the set $\{\nabla A(\theta):\theta\in\Theta\}$ contains a $d$ -dimensional ball. This fact enables us to obtain a $d$ -dimensional Gaussian location model after we apply the CLT.

The following theorem characterizes an asymptotic lower bound for every exponential family satisfying Assumptions 1 and 3.

Theorem 6.3.

Given a $d$ -dimensional exponential family ${\mathcal{P}}$ satisfying Assumptions 1 and 3, for every $n,m\in\mathbb{N}$ , the minimax error of sample amplification satisfies

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq c\cdot\left(\frac{m\sqrt{d}}{n}\wedge 1\right)-C\cdot\left(\frac{\log n}{n}\right)^{\frac{1}{3}},

where $c>0$ is an absolute constant independent of $(n,m,d,{\mathcal{P}})$ , and constant $C>0$ depends only on the exponential family (and thus on $d$ ).

Theorem 6.3 shows that there exists some $n_{0}>0$ depending only on the exponential family, such that sample amplification from $n$ to $n+\omega(n\varepsilon/\sqrt{d})$ samples is impossible for all $n>n_{0}$ . However, similar to the nature of the upper bound in Theorem 4.5, this asymptotic result does not imply that the sample amplification is impossible if $n=o(\sqrt{d}/\varepsilon)$ . Nevertheless, in the following sections we show that the sample complexity lower bound $n=\Omega(\sqrt{d}/\varepsilon)$ indeed holds in product families, as well as several other concrete examples.

6.2 Product models

Although Lemma 6.1 presents a lower bound argument in general, the computation of exact Bayes or minimax risks could be very challenging, and the usual rate-optimal analysis (i.e. bounding the risks within a multiplicative constant) will not lead to meaningful results. In addition, choosing the right prior and loss is a difficult task which may change from instance to instance. Therefore, it is helpful to propose specialized versions of Lemma 6.1 which are easier to work with. Surprisingly, such a simple version exists for product models, which is summarized in the following theorem.

Theorem 6.4.

Let $\varepsilon\in(0,1)$ and $P_{\theta}=\prod_{j=1}^{d}p_{\theta_{j}}$ be a product model with $(\theta_{1},\cdots,\theta_{d})\in\prod_{j=1}^{d}\Theta_{j}$ . Suppose for each $j\in[d]$ , there exist two points $\theta_{j,+},\theta_{j,-}\in\Theta_{j}$ such that

	$\displaystyle\\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\\|_{\text{\rm TV}}$	$\displaystyle\leq\alpha_{j}-\frac{\varepsilon}{\sqrt{d}},$		(10)
	$\displaystyle\\|p_{\theta_{j,+}}^{\otimes(n+m)}-p_{\theta_{j,-}}^{\otimes(n+m)}\\|_{\text{\rm TV}}$	$\displaystyle\geq\alpha_{j}+\frac{\varepsilon}{\sqrt{d}},$		(11)

with $\alpha_{j}\in(\underline{\alpha},\overline{\alpha})$ , where $\underline{\alpha},\overline{\alpha}\in(0,1)$ are absolute constants. Then there exists an absolute constant $c=c(\underline{\alpha},\overline{\alpha})>0$ such that

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq c\varepsilon.

Theorem 6.4 leaves the choices of the prior and loss function in Lemma 6.1 implicit, and provides a simple criterion for product models. The usual routine of applying Theorem 6.4 is as follows: fix any constant $\alpha$ and a target error $\varepsilon$ , find for each $j\in[d]$ two points $\theta_{j,+},\theta_{j,-}\in\Theta_{j}$ such that the condition (10) holds for a given sample size $n$ . Then the condition (11) becomes an inequality solely for $m$ , from which we could solve the smallest $m_{j}\in\mathbb{N}$ such that (11) holds along the $j$ -th coordinate. Finally, the sample amplification from $n$ to $n+m$ samples is impossible by the above theorem, where $m=\max_{j\in[d]}m_{j}$ . Although Theorem 6.5 is only for product models, similar ideas could also be applied to non-product models; we refer to Section A.3 for concrete examples.

Theorem 6.4 also provides some intuition on why the sample complexity lower bound for amplification is typically smaller than that of learning. Specifically, for learning under the TV distance, a small TV distance $\|\prod_{j=1}^{d}p_{\theta_{j,+}}^{\otimes n}-\prod_{j=1}^{d}p_{\theta_{j,-}}^{\otimes n}\|_{\text{TV}}$ between product distributions is required. This requirement typically leads to a much smaller individual TV distance $\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\|_{\text{\rm TV}}$ , e.g. $O(1/\sqrt{d})$ for many regular models. In contrast, the conditions (10) and (11) only require a constant individual TV distance, which leads to a smaller sample complexity $n$ in the sample amplification lower bound. To understand why a larger individual TV distance works for sample amplification, in the proof of Theorem 6.4 we consider the uniform prior on $2^{d}$ points $\prod_{j=1}^{d}\{\theta_{j,+},\theta_{j,-}\}$ . Under this prior, the test accuracy for each dimension is precisely $(1+\text{TV}_{j})/2$ , which is slightly smaller than $(1+\alpha)/2$ with $n$ samples, and slightly larger than $(1+\alpha)/2$ with $n+m$ samples (assuming $\alpha_{j}\equiv\alpha$ ). Therefore, if a unit loss is incurred when the fraction of correct tests does not exceed $(1+\alpha)/2$ , the current scaling in (10), (11) shows that there is an $\Omega(\varepsilon)$ difference in the expected loss under different sample sizes. In other words, such an aggregate voting test helps to have a larger individual TV distance. The details of the proof are deferred to Appendix B.

Theorem 6.4 has a far-reaching consequence: with almost no assumption on the product model ${\mathcal{P}}$ , for any $c>0$ it always holds that $\varepsilon^{\star}({\mathcal{P}},n,\lceil c\varepsilon n/\sqrt{d}\rceil)\geq c^{\prime}\varepsilon$ for some absolute constant $c^{\prime}>0$ independent of the product model ${\mathcal{P}}$ . The only assumption (besides the product structure) we make on ${\mathcal{P}}$ is as follows (here $n\in\mathbb{N}$ is a given sample size):

Assumption 4.

Let ${\mathcal{P}}$ possess the product structure as in Theorem 6.4. For each $j\in[d]$ , there exists two points $\theta_{j,+},\theta_{j,-}\in\Theta_{j}$ such that $1/(10n)\leq H^{2}(p_{\theta_{j,+}},p_{\theta_{j,-}})\leq 1/(5n)$ .

Assumption 4 is a mild assumption that requires that for each coordinate, the map $\theta_{j}\mapsto p_{\theta_{j}}$ is continuous under the Hellinger distance. This assumption is satisfied for almost all practical models, either discrete or continuous, and is invariant with model reparametrizations or bijective transformation of observations. We note that the coefficients $1/10$ and $1/5$ are not essential, and could be replaced by any smaller constants. The next theorem states that if Assumption 4 holds, we always have a lower bound $n=\Omega(\sqrt{d})$ for the sample complexity and an upper bound $m=O(n/\sqrt{d})$ for the size of sample amplification.

Theorem 6.5.

Let ${\mathcal{P}}$ be a product model satisfying Assumption 4. Then for any $c>0$ , there is some $c^{\prime}>0$ depending only on $c$ (thus independent of $n,d,\varepsilon,{\mathcal{P}}$ ) such that

\displaystyle\varepsilon^{\star}\left({\mathcal{P}},n,\left\lceil\frac{c\varepsilon n}{\sqrt{d}}\right\rceil\right)\geq c^{\prime}\varepsilon.

Theorem 6.5 is a general lower bound for sample amplification in product models, with intriguing properties that it is instance-wise in the model ${\mathcal{P}}$ , while the constants $c$ and $c^{\prime}$ are independent of ${\mathcal{P}}$ . As a result, the sample complexity is uniformly $\Omega(\sqrt{d}/\varepsilon)$ , and the maximum size of sample amplification is uniformly $O(n\varepsilon/\sqrt{d})$ for all product models. In comparison, the matching upper bound in Theorem 4.6 for product models has a hidden constant depending on the statistical model. We note that it is indeed natural to have sample amplification results independent of the underlying statistical model. For example, it is clear by definition that sample amplifications are invariant with bijective transformation of observations. However, Assumption 2 depends on such transformations, so it possibly contains some redundancy. In contrast, Assumption 4 remains invariant, which is therefore more natural.

The proof idea of Theorem 6.5 is best illustrated for the case $d=1$ . Using the two points $\theta_{+},\theta_{-}$ in Assumption 4, one could show that the TV distance between $n$ copies of $p_{\theta_{+}}$ and $p_{\theta_{-}}$ is bounded from above by a small constant. Similarly, for a large $C>0$ , the TV distance between $Cn$ copies of them is lower bounded by a large constant. Consequently, if $m=(C-1)n$ , Theorem 6.4 applied with $d=1$ gives an $\Omega(1)$ lower bound on $\varepsilon^{\star}({\mathcal{P}},n,m)$ . What happens if $m=c\varepsilon n$ with a small $c$ ? The idea is to consider the TV distances between $n,n+c\varepsilon n,n+2c\varepsilon n,\cdots,Cn$ copies of $p_{\theta_{+}}$ and $p_{\theta_{-}}$ , which is an increasing sequence. Now by the pigeonhole principle, there must be two adjacent TV distances differing by at least $\Omega(c\varepsilon/C)=\Omega(\varepsilon)$ , and Theorem 6.4 could be applied to this pair of sample sizes. This idea is easily generalized to any dimensions, with the full proof in Appendix B.

We note that the lower bounds in Theorem 6.3 (as well as 6.2) and Theorem 6.5 are on two different ends of the spectrum. In Theorems 6.2 and 6.3, an asymptotic setting (i.e. $d$ fixed and $n\to\infty$ ) is essentially considered, and a Gaussian limit is crucially used as long as there is local asymptotic normality. In comparison, Theorem 6.5 deals with a high-dimensional scenario ( $n,d$ can grow together) but restricts to a product model. However, looking at product submodels and/or exploiting its proof techniques could still lead to tight lower bounds for several non-product models, as shown in the examples in Section A.3.

7 Discussions on sample amplification versus learning

In all the examples we have seen in the previous sections, there is always a squared root relationship between the statistical complexities of sample amplification and learning. Specifically, when the dimensionality of the problem is $d$ , the complexity of learning the distribution (under a small TV distance) is typically $n=\Theta(d)$ , whereas that of the sample amplification is typically $n=\Theta(\sqrt{d})$ . In this section, we give examples where this relationship could break down in either direction, thus show that there is no universal scaling between the sample complexities of amplification and learning.

7.1 An example where the complexity of sample amplification is $o(\sqrt{d})$

We first provide an example where the distribution learning is hard, but an $(n,n+1,0.1)$ sample amplification is easy. Consider the following class ${\mathcal{P}}_{d,t}$ of discrete distributions:

\displaystyle{\mathcal{P}}_{d,t}=\left\{(p_{0},\cdots,p_{d}):p_{i}\geq 0,\sum_{i=0}^{d}p_{i}=1,p_{0}=t\right\},

where it is the same as the class of all discrete distributions over $d+1$ points, except that the learner has the perfect knowledge of $p_{0}=t$ for some known $t\in[1/(2\sqrt{d}),1/2]$ . It is a classical result (see, e.g. [HJW15]) that the sample complexity of learning the distribution over ${\mathcal{P}}_{d,t}$ with a small TV distance is still $n=\Theta(d)$ , regardless of $t$ . However, the next theorem shows that the complexity of sample amplification is much smaller.

Theorem 7.1.

For the class ${\mathcal{P}}_{d,t}$ with $t\in[1/(2\sqrt{d}),1/2]$ , an $(n,n+1,0.1)$ sample amplification is possible if and only if

\displaystyle n=\Omega\left(\frac{1}{t}\right).

Note that for the choice of $t=\Theta(d^{-\alpha})$ with $\alpha\in[0,1/2]$ , the complexity of sample amplification could possibly be $n=\Theta(d^{\alpha})$ for every $\alpha\in[0,1/2]$ , showing that it could be $o(\sqrt{d})$ with an arbitrary polynomial scale in $d$ . Moreover, if $t=o(1/\sqrt{d})$ , the complexity of sample amplification reduces to $n=\Theta(\sqrt{d})$ , the case without the knowledge of $t$ . The main reason why sample amplification is easier here is that the additional fake sample could be chosen as the first symbol, which has a large probability. In contrast, learning the distribution requires the estimation of all other probability masses, so the existence of a probable symbol does not help much in learning.

7.2 An example where the complexity of sample amplification is $\omega(\sqrt{d})$

Next we provide an example where the complexity of sample amplification is the same as that of learning. Consider a low-rank covariance estimation model: $X_{1},\cdots,X_{n}\sim{\mathcal{N}}(0,\Sigma)$ , where $\Sigma\in\mathbb{R}^{p\times p}$ could be written as $\Sigma=UU^{\top}$ with $U\in\mathbb{R}^{p\times d}$ and $U^{\top}U=I_{d}$ . In other words, the covariance matrix $\Sigma$ is isotropic on some $d$ -dimensional subspace. Here $n\geq d$ samples suffice to estimate $\Sigma$ and thus the whole distribution perfectly, for the $d$ -dimensional subspace could be recovered using $d$ i.i.d. samples with probability one. Therefore, the complexity of learning the distribution is $n=d$ . The following theorem states that this is also the complexity of sample amplification.

Theorem 7.2.

For the above low-rank covariance estimation model with $p\geq d+1$ , an $(n,n+1,0.1)$ sample amplification is possible if and only if $n\geq d$ .

Theorem 7.2 shows that as opposed to learning, sample amplification fails to exploit the low-rank structure in the covariance estimation problem. As a result, the complexity of sample amplification coincides with that of learning in this example. Note that sample amplification is always no harder than learning: the learner could always estimate the distribution, generate one observation from the distribution and append it to the original samples. Therefore, Theorem 7.2 provides an example where the relationship between sample amplification and learning is the worst possible.

7.3 An example where the TV distance is not the right metric

Finally we provide an example showing that the TV distance is not the right metric for the learning-based approach in Section 5, and thereby partially illustrate the necessity of using the $\chi^{2}$ divergence. This example also goes beyond parametric families for sample amplification. Let ${\mathcal{P}}$ be the class of all $L$ -Lipschitz densities supported on $[0,1]$ , i.e. the density $f$ satisfies $|f(x)-f(y)|\leq L|x-y|$ for all $x,y\in[0,1]$ . For $c\in(0,1)$ , also let ${\mathcal{P}}_{c}\subseteq{\mathcal{P}}$ be the subclass of densities lower bounded by $c$ everywhere, i.e. $f(x)\geq c$ for all $x\in[0,1]$ . It is a classical result (see, e.g. [Tsy09]) that the minimax density estimation error under TV distance is $\Theta(n^{-1/3})$ for both ${\mathcal{P}}$ and ${\mathcal{P}}_{c}$ . The next theorem shows that the sample complexities for amplification are actually different.

Theorem 7.3.

Let $L\geq 8$ and $c\in(0,1)$ be fixed. It holds that

\displaystyle m^{\star}({\mathcal{P}}_{c},n)\asymp n^{5/6},\quad\text{ while }\quad m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}.

Theorem 7.3 shows that, although assuming a density lower bound does not alter the TV estimation error, it boosts the size of amplified samples from $O(n^{3/4})$ to $\Theta(n^{5/6})$ . In fact, the $\chi^{2}$ -estimation error is also reduced from ${\mathcal{P}}$ to ${\mathcal{P}}_{c}$ : in the proof of Theorem 7.3 we show that $r_{\chi^{2}}({\mathcal{P}}_{c},n)\lesssim n^{-2/3}$ , but $m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}$ together with Theorem 5.2 imply that $r_{\chi^{2}}({\mathcal{P}},n)\gtrsim n^{-1/2}$ . Therefore, this is an example suggesting that measuring the estimation error under the $\chi^{2}$ divergence might be a better indicator for the complexity of sample amplification than the TV distance.

Acknowledgements.

Thank you to anonymous reviewers for helpful feedback on earlier drafts of this paper.

Funding.

Shivam Garg conducted this research while affiliated with Stanford University and was supported by a Stanford Interdisciplinary Graduate Fellowship. Yanjun Han was supported by a Simons-Berkeley research fellowship and the Norbert Wiener postdoctoral fellowship in statistics at MIT IDSS. Vatsal Sharan was supported by NSF CAREER Award CCF-2239265 and an Amazon Research Award. Gregory Valiant was supported by NSF Awards AF-2341890, CCF-1704417, CCF-1813049, UT Austin’s Foundation of ML NSF AI Institute, and a Simons Foundation Investigator Award.

Appendix A Concrete examples of sample amplification

In this section we include concrete examples of sample amplification omitted in the main text due to space limitations, including the non-asymptotic upper bounds in Section 4 and 5, and the lower bounds for non-product models in Section 6.

A.1 Concrete examples of amplification via sufficiency

In this section, in contrast to the mostly asymptotic results in Section 4.3, we investigate several non-asymptotic examples of sample amplification in concrete models. We show that for many natural models, including exponential family with dependent coordinates and non-exponential family which are not covered in the general theory, the sufficiency-based sample amplification approach could still amplify $\Omega(n/\sqrt{d})$ additional samples. We also illustrate the computational idea in Section 4.2 via more involved examples.

Our first example concerns the Gaussian model with a known mean but an unknown covariance. It is a folklore result that estimating the unknown covariance in a vanishing Frobenius norm requires $n=\Omega(d^{2})$ samples [DMR20, Corollary 1.2], which is also the sample complexity for learning the distribution within a small TV distance. The following example shows that $n=O(d)$ samples suffice for sample amplification.

Example A.1 (Gaussian covariance model with known mean).

Consider the i.i.d. observations $X_{1},\cdots,X_{n}$ drawn from ${\mathcal{N}}(0,\Sigma)$ with zero mean and an unknown covariance $\Sigma\in\mathbb{R}^{d\times d}$ . Here a minimal sufficient statistic is the sample covariance matrix

\displaystyle\widehat{\Sigma}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}.

Lemma A.2 shows that $\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\varepsilon$ as long as $m=O(\varepsilon n/d)$ , therefore drawing samples from $P_{X^{n+m}\mid\widehat{\Sigma}_{n+m}}$ achieves sample amplification of size $m=\Omega(n/d)$ . This coincides with the general $O(n/\sqrt{D})$ result where $D\asymp d^{2}$ is the parameter dimension.

In order to sample from this conditional distribution, consider the following statistic

\displaystyle S_{n+m}=[(n+m)\widehat{\Sigma}_{n+m}]^{-1/2}[X_{1},X_{2},\cdots,X_{n+m}]\in\mathbb{R}^{d\times(n+m)},

which always exists even if $\widehat{\Sigma}_{n+m}$ is not invertible. Clearly there is a bijection between $X^{n+m}$ and $(\widehat{\Sigma}_{m+n},S_{n+m})$ , and Lemma A.2 shows that $S_{n+m}$ is an ancillary statistic. In particular, $S_{n+m}$ always follows the uniform distribution on the following set:

\displaystyle A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d}\}.

Consequently an $(n,n+\Omega(n\varepsilon/d),\varepsilon)$ sample amplification is efficiently achievable in the Gaussian covariance model using the following algorithm:

1.

Given samples $[X_{1},X_{2},\cdots,X_{n}]$ compute $\widehat{\Sigma}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}$ .
2.

Sample $[Z_{1},Z_{2},\cdots,Z_{n+m}]$ where $Z_{i}\sim N(0,I)$ for all $i$ . Compute $\widehat{\Sigma}_{n+m}=\frac{1}{n}\sum_{i=1}^{n}Z_{i}Z_{i}^{\top}$ . Based on these compute $S_{n+m}=[(n+m)\widehat{\Sigma}_{n+m}]^{-1/2}[Z_{1},Z_{2},\cdots,Z_{n+m}]$ .
3.

Output $(n+m)$ samples $X^{n+m}=[(n+m)\widehat{\Sigma}_{n}]^{1/2}S_{n+m}$ .

Lemma A.2.

Under the notations of Example A.1, for $n\geq 4\max\{m,d\}$ it holds that

\displaystyle\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\frac{2md}{n}.

In addition, $S_{n+m}$ is uniformly distributed on $A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d}\}$ .

The next example shows that the sample amplification result does not change much if both the mean and covariance are unknown, though the sampling procedure becomes slightly more involved.

Example A.3 (Gaussian model with unknown mean and covariance).

Next we consider the most general Gaussian model where $X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\theta,\Sigma)$ with unknown mean vector $\theta\in\mathbb{R}^{d}$ and unknown covariance matrix $\Sigma\in\mathbb{R}^{d\times d}$ . In this case, a minimal sufficient statistic is the pair $(\overline{X}_{n},\widehat{\Sigma}_{n})$ , with

\displaystyle\overline{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i},\qquad\widehat{\Sigma}_{n}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\overline{X}_{n})(X_{i}-\overline{X}_{n})^{\top}.

Lemma A.4 shows that $\|{\mathcal{L}}(\overline{X}_{n},\widehat{\Sigma}_{n})-{\mathcal{L}}(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\varepsilon$ as long as $m=O(n\varepsilon/d)$ , therefore drawing amplified samples from $P_{X^{n+m}\mid(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})}$ achieves sample amplification of size $m=\Omega(n\varepsilon/d)$ . For the computation, consider the following statistic

\displaystyle S_{n+m}=[(n+m-1)\widehat{\Sigma}_{n+m}]^{-1/2}[X_{1}-\overline{X}_{n+m},\cdots,X_{n+m}-\overline{X}_{n+m}]\in\mathbb{R}^{d\times(n+m)},

and it is clear that the whole samples $X^{n+m}$ could be recovered from $(\overline{X}_{n+m},\widehat{\Sigma}_{n+m},S_{n+m})$ . Again, Lemma A.4 shows that $S_{n+m}$ is ancillary and uniformly distributed on the following set (assuming $n+m-1\geq d$ ):

\displaystyle A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d},U{\bf 1}={\bf 0}\}.

Lemma A.4.

Under the notations of Example A.3, for $n\geq 4\max\{m,d\}$ it holds that

\displaystyle\|{\mathcal{L}}(\overline{X}_{n},\widehat{\Sigma}_{n})-{\mathcal{L}}(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\frac{3md}{n-1}.

In addition, $S_{n+m}$ is uniformly distributed on $A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d},U{\bf 1}={\bf 0}\}$ .

The following example concerns the product exponential distributions, or general Gamma distributions with a fixed shape parameter.

Example A.5 (Product exponential distribution).

In this example, we consider the product exponential model where $X_{1},\cdots,X_{n}\sim\prod_{i=1}^{d}\text{\rm Exp}(\lambda_{i})$ with unknown rate vector $\lambda=(\lambda_{1},\cdots,\lambda_{d})$ . Again, in this model the sample mean $\overline{X}_{n}$ is sufficient, and follows a product Gamma distribution $\prod_{i=1}^{d}\text{\rm Gamma}(n,n\lambda_{i})$ . Consequently,

\displaystyle D_{\text{\rm KL}}({\mathcal{L}}(\overline{X}_{n})\|{\mathcal{L}}(\overline{X}_{n+m}))=d\cdot\left(m-(n+m)\log\left(1+\frac{m}{n}\right)+\log\frac{\Gamma(n+m)}{\Gamma(n)}-m\psi(n)\right),

where $\Gamma(x)=\int_{0}^{\infty}t^{x-1}e^{-t}dt$ and $\psi(x)=\frac{d}{dx}[\log\Gamma(x)]$ are the gamma and digamma functions, respectively. Using the definition of $f(n,m,d)$ in (C.2), we note that the above KL divergence is precisely $d\cdot f(2n,2m,1)$ , thus by the proof of Lemma A.2 is further at most $O(dm^{2}/n^{2})$ . Consequently, sample amplification is possible as long as $n=\Omega(\sqrt{d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{d})$ . Alternatively, the same result could also follow from Theorem 4.6, for both assumptions 1 and 2 hold for the exponential distribution.

To draw amplified samples $X_{1},\cdots,X_{n+m}$ conditioned on $\overline{X}_{n+m}$ , note that the following statistic $S\in\mathbb{R}^{d\times(n+m)}$ with $i$ -th row being

\displaystyle S_{i}=\left(\frac{X_{1,i}}{\overline{X}_{n+m,i}},\frac{X_{2,i}}{\overline{X}_{n+m,i}},\cdots,\frac{X_{n+m,i}}{\overline{X}_{n+m,i}}\right),

is ancillary. In fact, each $S_{i}$ follows the Dirichlet distribution $\text{\rm Dir}(1,1,\cdots,1)$ . Then computational efficiency follows from the obvious fact that $X^{n+m}$ is determined by $(\overline{X}_{n+m},S)$ .

Our final example is a non-exponential family which is not even differentiable in quadratic mean, but the $n=O(\sqrt{d})$ sample complexity still holds.

Example A.6 (Uniform distribution over a rectangle).

In this example, let $X_{1},\cdots,X_{n}$ be i.i.d. samples from the uniform distribution on an unknown rectangle $\prod_{j=1}^{d}[a_{j},b_{j}]$ in $\mathbb{R}^{d}$ . Note that this is not an exponential family. However, the sufficiency-based sample amplification could still be applied in this case. Specifically, here the sufficient statistics are $X_{\min}^{n}$ and $X_{\max}^{n}$ , where for $j\in[d]$ ,

\displaystyle X_{\min,j}^{n}=\min_{i\in[n]}X_{i,j},\qquad X_{\max,j}^{n}=\max_{i\in[n]}X_{i,j}.

It is not hard to find that, the joint density of $(X_{\min,j}^{n},X_{\max,j}^{n})$ is

\displaystyle f_{n}(a,b)=\frac{n(n-1)(b-a)^{n-2}}{(b_{j}-a_{j})^{n}}\cdot\mathbbm{1}(a_{j}\leq a\leq b\leq b_{j}).

Then after some algebra, the KL divergence between the sufficient statistics is

\displaystyle D_{\text{\rm KL}}({\mathcal{L}}(X_{\min}^{n},X_{\max}^{n})\|{\mathcal{L}}(X_{\min}^{n+m},X_{\max}^{n+m}))=d\left(\frac{m}{n}-\log\left(1+\frac{m}{n}\right)+\frac{m}{n-1}-\log\left(1+\frac{m}{n-1}\right)\right),

which is $O(dm^{2}/n^{2})$ when $m=O(n)$ . Therefore, sample amplification for uniform distributions is still possible whenever $m=O(n\varepsilon/\sqrt{d})$ and $n=\Omega(\sqrt{d}/\varepsilon)$ , same as exponential families.

To draw amplified samples $X_{1},\cdots,X_{m+n}$ conditioned on $(X_{\min}^{n+m},X_{\max}^{m+n})$ , note that the following statistic $S\in\mathbb{R}^{d\times(n+m)}$ with $i$ -th row being

\displaystyle S_{i}=\left(\frac{X_{1,i}-X_{\min}^{n+m}}{X_{\max}^{m+n}-X_{\min}^{m+n}},\cdots,\frac{X_{m+n,i}-X_{\min}^{n+m}}{X_{\max}^{m+n}-X_{\min}^{m+n}}\right),

is ancillary. This is due to the invariance property of the uniform distribution: if $X\sim\mathsf{U}(0,1)$ , then $aX+b\sim\mathsf{U}(b,a+b)$ . Since $(X_{1},\cdots,X_{n+m})$ is determined by $(X_{\min}^{n+m},X_{\max}^{n+m},S)$ , the computational efficiency follows.

Although all of the above examples work exclusively for continuous models and apply an identity map between sufficient statistics, we remark that more sophisticated maps based on learning could be useful and work for discrete models. The idea of sample amplification via learning is presented in Section 5, and an example which combines both the sufficiency and learning ideas could be found in Example A.14.

A.2 Concrete examples of amplification via learning

In this section we show how learning-based approaches achieve optimal performances of sample amplification in several examples, including both continuous and discrete models. In some scenarios the following strengthened lemma may be useful to deal with too large $\chi^{2}$ -divergence.

Lemma A.7.

The same results of Theorem 5.2 and 5.5 hold for the following modification of the $\chi^{2}$ -estimation error:

\displaystyle r_{\bar{\chi}^{2}}({\mathcal{P}},n)=\inf_{\widehat{P}_{n}}\sup_{P\in{\mathcal{P}}}\mathbb{E}_{P}\left[\chi^{2}(\widehat{P}_{n},P)\wedge n\right].

The idea behind Lemma A.7 is that for some models, it might happen that $\chi^{2}(\widehat{P}_{n},P)=\infty$ with a small probability. However, for the TV distance we always have $\|\widehat{P}_{n}-P\|_{\text{TV}}\leq 1$ , so a large $\chi^{2}$ -divergence could still lead to a meaningful TV distance. The proof of Lemma A.7 could be found in Appendix C.

The first example is again the Gaussian location model in Example 4.1, where we show that the shuffling-based approach also achieves the complexity $n=O(\sqrt{d}/\varepsilon)$ and the size $m=\Omega(n\varepsilon/\sqrt{d})$ .

Example A.8 (Gaussian location model with known covariance, continued).

Consider the setting of Example 4.1, where the family of distributions is ${\mathcal{P}}=\{{\mathcal{N}}(\theta,I_{d})\}_{\theta\in\mathbb{R}^{d}}$ . Here ${\mathcal{P}}$ has a product structure, with ${\mathcal{P}}_{j}=\{{\mathcal{N}}(\theta_{j},1)\}_{\theta_{j}\in\mathbb{R}}$ for each $j\in[d]$ . To find an upper bound on the $\chi^{2}$ -estimation error, consider the distribution estimator $\widehat{P}_{n,j}={\mathcal{N}}(\widehat{\theta}_{j},1)$ , where $\widehat{\theta}_{j}=n^{-1}\sum_{i=1}^{n}X_{i,j}$ . Consequently,

\displaystyle\chi^{2}(\widehat{P}_{n,j},P_{j})=\chi^{2}({\mathcal{N}}(\widehat{\theta}_{j},1),{\mathcal{N}}(\theta_{j},1))=\exp((\widehat{\theta}_{j}-\theta_{j})^{2})-1,

and using $\widehat{\theta}_{j}-\theta_{j}\sim{\mathcal{N}}(0,1/n)$ , we have

\displaystyle\mathbb{E}[\chi^{2}(\widehat{P}_{n,j},P_{j})]=\sqrt{\frac{n}{n-2}}-1=O\left(\frac{1}{n}\right)

whenever $n\geq 3$ . Consequently, $r_{\chi^{2}}({\mathcal{P}}_{j},n)=O(1/n)$ for all $j\in[d]$ , and Theorem 5.5 implies that an $(n,n+m,\varepsilon)$ sample amplification is possible if $n=\Omega(\sqrt{d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{d})$ .

The next example is the discrete distribution model considered in [AGSV20].

Example A.9 (Discrete distribution model).

Let ${\mathcal{P}}$ be the class of all discrete distributions supported on $k$ elements. In this case, a natural learner is the empirical distribution $\widehat{P}_{n}=(\widehat{p}_{1},\cdots,\widehat{p}_{k})$ , with

\displaystyle\widehat{p}_{j}=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(X_{i}=j),\qquad j\in[k].

Consequently,

\displaystyle\mathbb{E}[\chi^{2}(\widehat{P}_{n},P)]=\mathbb{E}\left[\sum_{j=1}^{k}\frac{(\widehat{p}_{j}-p_{j})^{2}}{p_{j}}\right]=\sum_{j=1}^{k}\frac{1-p_{j}}{n}=\frac{k-1}{n},

meaning that $r_{\chi^{2}}({\mathcal{P}},n)\leq(k-1)/n$ . Hence, Theorem 5.2 implies that sample amplification is possible whenever $n=\Omega(\sqrt{k}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{k})$ . In this case, Algorithm 2 essentially subsamples from the original data and add them back with random shuffling, which is the same algorithm in [AGSV20]. However, the analysis here is much simplified.

The following example revisits the uniform distribution in Example A.6, and again shows that the same amplification performance could be achieved by random shuffling.

Example A.10 (Uniform distribution, continued).

Consider the setting in Example A.6, where the distribution class ${\mathcal{P}}$ is the family of all uniform distributions on a rectangle $\prod_{j=1}^{d}[a_{j},b_{j}]$ . For each $j$ , a natural distribution estimator $\widehat{P}_{n,j}$ is simply the uniform distribution on $[X_{\min,j}^{n},X_{\max,j}^{n}]$ , where these quantities are defined in Example A.6. For this learner, it holds that

\displaystyle\chi^{2}\left(\widehat{P}_{n,j},P_{j}\right)=\frac{(a_{j}-b_{j})^{2}}{(X_{\max,j}^{n}-X_{\min,j}^{n})^{2}}-1.

Note that Example A.6 shows that the joint density of $(X_{\min,j}^{n},X_{\max,j}^{n})$ is given by

\displaystyle f_{n}(a,b)=\frac{n(n-1)(b-a)^{n-2}}{(b_{j}-a_{j})^{n}}\cdot\mathbbm{1}(a_{j}\leq a\leq b\leq b_{j}),

the expected $\chi^{2}$ -divergence could then be computed as

\displaystyle\mathbb{E}\left[\chi^{2}\left(\widehat{P}_{n,j},P_{j}\right)\right]

\displaystyle=-1+\iint_{a_{j}\leq a\leq b\leq b_{j}}\frac{n(n-1)(b-a)^{n-4}}{(b_{j}-a_{j})^{n-2}}dadb=\frac{4n-6}{(n-2)(n-3)},

which is $O(n^{-1})$ if $n\geq 4$ . Hence, we have $r_{\chi^{2}}({\mathcal{P}}_{j},n)=O(1/n)$ for each $j\in[d]$ , and Theorem 5.5 shows an $(n,n+m,\varepsilon)$ sample amplification if $n=\Omega(\sqrt{d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{d})$ .

The next example is the exponential distribution model studied in Example A.5, where the modified $\chi^{2}$ -learning error $r_{\bar{\chi}^{2}}({\mathcal{P}},n)$ in Lemma A.7 will be useful.

Example A.11 (Exponential distribution, continued).

Consider the setting in Example A.5, where the distribution class ${\mathcal{P}}$ is a product of exponential distributions with unknown rate parameters. In this case, a natural distribution learner is to estimate each rate parameter as $\widehat{\lambda}_{n,j}=n/\sum_{i=1}^{n}X_{i,j}$ , and use $\text{\rm Exp}(\widehat{\lambda}_{n,j})$ to estimate the truth $\text{\rm Exp}(\lambda_{j})$ . Note that

\displaystyle\chi^{2}\left(\text{\rm Exp}(\widehat{\lambda}_{n,j}),\text{\rm Exp}(\lambda_{j})\right)=\frac{(\widehat{\lambda}_{n,j}-\lambda_{j})^{2}}{\lambda_{j}(2\widehat{\lambda}_{n,j}-\lambda_{j})}

whenever $2\widehat{\lambda}_{n,j}>\lambda_{j}$ . However, if $2\widehat{\lambda}_{n,j}\leq\lambda_{j}$ the $\chi^{2}$ -divergence will be unbounded, which occurs with a small but positive probability.

To address this issue, we note that when $\lambda_{j}=1$ , the sub-exponential concentration claims that $|\sum_{i=1}^{n}X_{i,j}-n|\leq n/3$ with probability at least $1-\exp(-\Omega(n))$ . By a simple scaling, the above event implies that $\widehat{\lambda}_{n,j}/\lambda_{j}\in[3/4,3/2]$ . Hence,

\displaystyle\mathbb{E}\left[\chi^{2}\left(\text{\rm Exp}(\widehat{\lambda}_{n,j}),\text{\rm Exp}(\lambda_{j})\right)\wedge n\right]\leq 2\cdot\mathbb{E}\left[\left(\frac{\widehat{\lambda}_{n,j}}{\lambda_{j}}-1\right)^{2}\right]+n\cdot\exp(-\Omega(n))=O\left(\frac{1}{n}\right),

which means that $r_{\bar{\chi}^{2}}({\mathcal{P}}_{j},n)=O(1/n)$ . Therefore, by Lemma A.7, the sample amplification is possible whenever $n=\Omega(\sqrt{d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{d})$ , the same as Example A.5.

The following example considers an interesting non-product model, i.e. the Gaussian distribution with a sparse mean vector.

Example A.12 (Sparse Gaussian model).

Consider the Gaussian location model ${\mathcal{P}}=\{{\mathcal{N}}(\theta,I_{d})\}_{\theta\in\Theta}$ , with an additional constraint that the mean vector $\theta$ is $s$ -sparse, i.e. $\Theta=\{\theta\in\mathbb{R}^{d}:\|\theta\|_{0}\leq s\}$ . For the learning problem, it is well-known (cf. [DJ94, Theorem 1]) that the soft-thresholding estimator $\widehat{\theta}_{n}$ with

\displaystyle\widehat{\theta}_{n,j}=\text{\rm sign}\left(\frac{1}{n}\sum_{i=1}^{n}X_{i,j}\right)\cdot\left(\left|\frac{1}{n}\sum_{i=1}^{n}X_{i,j}\right|-\sqrt{\frac{C\log n}{n}}\right)_{+}

and any constant $C>2$ achieves that $\sup_{\theta\in\Theta}\mathbb{E}[\|\widehat{\theta}_{n}-\theta\|_{2}^{2}]=O(s\log d/n)$ . Therefore, the sample complexity of learning sparse Gaussian distributions is $n=O(s\log d)$ .

For the complexity of sample amplification, for each $j\in[d]$ we apply ${\mathcal{N}}(\widehat{\theta}_{n,j},1)$ as the distribution estimator. The $\chi^{2}$ -estimation performance of this estimator is summarized in Lemma A.13. Therefore, by a simple adaptation of Lemma A.7, an $(n,n+m)$ sample amplification is possible as long as $n=\Omega(\sqrt{s\log d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{s\log d})$ .

Lemma A.13.

Under the setting of Example A.12, it holds that

\displaystyle\sup_{\theta\in\Theta}\sum_{j=1}^{d}\mathbb{E}\left[\chi^{2}\left({\mathcal{N}}(\widehat{\theta}_{n,j},1),{\mathcal{N}}(\theta_{j},1)\right)\wedge n\right]=O\left(\frac{s\log d}{n}\right).

The final example is a special example where the sole application of either sufficient statistics or learning will fail to achieve the optimal sample amplification. The solution is to use a combination of both ideas.

Example A.14 (Poisson distribution).

Consider the product Poisson model $\prod_{j=1}^{d}\mathsf{Poi}(\lambda_{j})$ with $\lambda\in\mathbb{R}_{+}^{d}$ . We first show that a naïve application of either sufficiency-based or shuffling-based idea will not lead to a desired sample amplification. In fact, the sufficient statistic here is $T_{n}=\sum_{i=1}^{n}X_{i}$ which follows a product Poisson distribution $\prod_{j=1}^{d}\mathsf{Poi}(n\lambda_{j})$ ; as $T_{n}$ takes discrete values, applying any linear map between $T_{n}$ to $T_{n+m}$ will not result in a small TV distance between sufficient statistics.

The argument for the shuffling-based approach is subtler. A natural distribution estimator for $\mathsf{Poi}(\lambda_{j})$ is $\mathsf{Poi}(\widehat{\lambda}_{n,j})$ , with $\widehat{\lambda}_{n,j}=n^{-1}\sum_{i=1}^{n}X_{i,j}$ . Lemma A.15 shows that this distribution estimator could suffer from an expected $\chi^{2}$ -estimation error far greater than $\Omega(1/n)$ , so we cannot conclude an $(n,n+\Omega(n\varepsilon/\sqrt{d}),\varepsilon)$ sample amplification from Theorem 5.5.

Now we show that a combination of the sufficient statistic and learning leads to the rate-optimal sample amplification in this model. Specifically, we split samples and compute the empirical rate parameter $\widehat{\lambda}_{n/2}$ based on the first $n/2$ samples. Next, conditioned on the first half of samples, the sufficient statistic for the remaining half is $T_{n/2}=\sum_{i=n/2+1}^{n}X_{i}$ . Define

\displaystyle\widehat{T}_{n/2+m}=T_{n/2}+Z,

where $Z\sim\prod_{j=1}^{d}\mathsf{Poi}(m\widehat{\lambda}_{n/2,j})$ is independent of $T_{n/2}$ conditioning on the first half samples. Finally, we generate the $(n/2+m)$ amplified samples from the conditional distribution and append them to $(X_{1},\cdots,X_{n/2})$ . By the second statement of Lemma A.15, this achieves an $(n,n+m,\varepsilon)$ sample amplification whenever $n=\Omega(\sqrt{d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{d})$ .

Lemma A.15.

Under the settings of Example A.14, there exists some $\lambda_{j}>0$ such that

\displaystyle\mathbb{E}\left[\chi^{2}\left(\mathsf{Poi}(\widehat{\lambda}_{n,j}),\mathsf{Poi}(\lambda_{j})\right)\wedge n\right]=\Omega\left(\frac{1}{\log n}\right)\gg\frac{1}{n}.

In addition, the proposed sufficiency+learning approach satisfies

\displaystyle\|{\mathcal{L}}(\widehat{X}^{n+m})-{\mathcal{L}}(X^{n+m})\|_{\text{\rm TV}}\leq\frac{m\sqrt{2d}}{n}.

A.3 Non-asymptotic examples of lower bounds

In this section, we apply the lower bound ideas to several concrete examples in Section 4 and 5, and show that the previously established upper bounds for sample amplification are indeed tight. Since Theorem 6.5 already handles all product models (including the Gaussian location model in Example 4.1, 4.2, and A.8, exponential model in Example A.5 and A.11, uniform model in Example A.6 and A.10, and Poisson model in Example A.14), we are only left with the remaining non-product models. In the sequel, we will make use of the general Lemma 6.1 to prove non-asymptotic lower bounds in these examples.

The lower bound of the discrete distribution model was obtained in [AGSV20], where the learning approach in Example A.9 is rate optimal. Our first example concerns the “Poissonized” version of the discrete distribution model, and we show that the results become slightly different.

Example A.16 (“Poissonized” discrete distribution model).

We consider the following “Poissonized” discrete distribution model, where we have $n$ i.i.d. samples drawn from $\prod_{j=1}^{k}\mathsf{Poi}(p_{j})$ , with $(p_{1},\cdots,p_{k})$ being an unknown probability vector. Although the Poissonization does not affect the optimal rate of estimation in many problems, Lemma A.17 shows the following distinction when it comes to sample amplification: the optimal amplification size is $m=\Theta(n\varepsilon/\sqrt{k}+\sqrt{n}\varepsilon)$ under the Poissonized model, while it is $m=\Theta(n\varepsilon/\sqrt{k})$ under the non-Poissonized model [AGSV20].

The complete proof of Lemma A.17 is relegated to the appendix, but we briefly comment on why Theorem 6.5 is not directly applicable when $k\gg n$ . The reason here is that to apply Theorem 6.5, we will construct a parametric submodel which is a product model:

\displaystyle P_{\theta}=\prod_{j=1}^{k_{0}}\left[\mathsf{Poi}\left(\frac{1}{k}+\theta_{j}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta_{j}\right)\right],\qquad\theta\in\Theta\triangleq\left[-\frac{1}{k},\frac{1}{k}\right]^{k_{0}},

where $k=2k_{0}$ . However, for $\theta,\theta^{\prime}\in[0,1/k]$ , the range of the squared Hellinger distance

\displaystyle H^{2}\left(\mathsf{Poi}\left(\frac{1}{k}+\theta\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta\right),\mathsf{Poi}\left(\frac{1}{k}+\theta^{\prime}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta^{\prime}\right)\right)

is only $[0,\Theta(1/k)]$ , so Assumption 4 does not hold when $k\gg n$ . This is precisely the subtlety in the Poissonized model.

Lemma A.17.

Under the Poissonized discrete distribution model, an $(n,n+m,\varepsilon)$ sample amplification is possible if and only if $n=\Omega(1)$ and $m=O(n\varepsilon/\sqrt{k}+\sqrt{n}\varepsilon)$ .

Our next example is the sparse Gaussian model in Example A.12, where we establish the tightness of $n=\Omega(\sqrt{s\log d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{s\log d})$ directly using Lemma 6.1.

Example A.18 (Sparse Gaussian location model).

Consider the setting of the sparse Gaussian location model in Example A.12, and we aim to prove a matching lower bound for sample amplification. In the sequel, we first handle the case $s=1$ to reflect the main idea, and postpone the case of general $s$ to Lemma A.19.

For $s=1$ , we apply Lemma 6.1 to a proper choice of the prior $\mu$ and loss $L$ . Fixing a parameter $t>0$ to be chosen later, let $\mu$ be the uniform distribution on the finite set of vectors $\{te_{1},\cdots,te_{d}\}$ , where $e_{1},\cdots,e_{d}$ are the canonical vectors in $\mathbb{R}^{d}$ . Moreover, for an estimator $\widehat{\theta}\in\mathbb{R}^{d}$ of the unknown mean vector, the loss function is chosen to be $L(\theta,\widehat{\theta})=\mathbbm{1}(\widehat{\theta}\neq\theta)$ . For the above prior and loss, it is clear that the maximum likelihood estimator $\widehat{\theta}=te_{\widehat{j}}$ with

\displaystyle\widehat{j}=\arg\max_{j\in[d]}\sum_{i=1}^{n}X_{i,j}

is the Bayes estimator, and the Bayes risk admits the following expression:

\displaystyle r_{\text{\rm B}}({\mathcal{P}},n,\mu,L)=1-\mathbb{E}\left[\frac{\exp(\sqrt{n}t(\sqrt{n}t+Z_{1}))}{\exp(\sqrt{n}t(\sqrt{n}t+Z_{1}))+\sum_{j=2}^{d}\exp(\sqrt{n}tZ_{j})}\right]\triangleq 1-p_{d}(\sqrt{n}\cdot t),

where $Z_{1},\cdots,Z_{d}\sim{\mathcal{N}}(0,1)$ are i.i.d. standard normal random variables. Similarly, the Bayes risk under $n+m$ samples is $r_{\text{\rm B}}({\mathcal{P}},n+m,\mu,L)=1-p_{d}(\sqrt{n+m}\cdot t)$ , and it remains to investigate the property of the function $p_{d}(\cdot)$ . Lemma A.19 summarizes a useful property for the lower bound, i.e. the function $p_{d}(z)$ enjoys a phase transition around $z\sim\sqrt{2\log d}$ .

Based on Lemma A.19, the pigeonhole principle implies that for every given $c>0$ , there exists some $z\in[\sqrt{2\log d}-C,\sqrt{2\log d}+C]$ such that $p_{d}(z+c\varepsilon)-p_{d}(z)=\Omega_{c}(\varepsilon)$ . Therefore, if $m=\lceil cn\varepsilon/\sqrt{\log d}\rceil$ , choosing $t=z/\sqrt{n}$ for the above $z$ yields that $p_{d}(\sqrt{n+m}\cdot t)-p_{d}(\sqrt{n}\cdot t)=\Omega_{c}(\varepsilon)$ , and consequently

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq r_{\text{\rm B}}({\mathcal{P}},n,\mu,L)-r_{\text{\rm B}}({\mathcal{P}},n+m,\mu,L)=p_{d}(\sqrt{n+m}\cdot t)-p_{d}(\sqrt{n}\cdot t)=\Omega_{c}(\varepsilon).

Therefore, we must have $n=\Omega(\sqrt{\log d}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{\log d})$ for sample amplification.

The following lemma summarizes the phase-transition property of $p_{d}$ used in the above example.

Lemma A.19.

For the function $p_{d}(z)$ in Example A.18, there exists an absolute constant $C$ independent of $d$ such that

	$\displaystyle p_{d}(\sqrt{2\log d}-C)$	$\displaystyle\leq 0.1,$
	$\displaystyle p_{d}(\sqrt{2\log d}+C)$	$\displaystyle\geq 0.9.$

Moreover, for general $s<d/2$ , an $(n,n+m)$ sample amplification is possible under the sparse Gaussian model only if $n=\Omega(\sqrt{s\log(d/s)}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{s\log(d/s)})$ .

Our final example proves the tightness of the sufficiency-based approaches in Examples A.1 and A.3 for Gaussian models with unknown covariance. The proof relies on the computation of the minimax risks in Lemma 6.1, as well as the statistical theory of invariance.

Example A.20 (Gaussian model with unknown covariance).

We establish the matching lower bounds of sample amplification in Example A.1 with a known mean, which imply the lower bounds in Example A.3. We make use of the minimax risk formulation of Lemma 6.1, and consider the following loss:

\displaystyle L(\Sigma,\widehat{\Sigma})=\mathbbm{1}\left(\ell(\Sigma,\widehat{\Sigma})\geq g(n+m-1-d,d)+C\cdot\frac{d}{n}\right),

where function $g$ is given by (13) in the Lemma C.3 below, $C>0$ is a large absolute constant to be determined later, and $\ell(\Sigma,\widehat{\Sigma})$ is the Stein’s loss

\displaystyle\ell(\Sigma,\widehat{\Sigma})=\text{\rm tr}(\Sigma^{-1}\widehat{\Sigma})-\log\det(\Sigma^{-1}\widehat{\Sigma})-d.

To search for the minimax estimator under the above loss, similar arguments in [JS61] based on the theory of invariance show that it suffices to consider estimators taking the form

\displaystyle\widehat{\Sigma}_{n}=L_{n}D_{n}L_{n}^{\top},

(12)

where $D_{n}\in\mathbb{R}^{d\times d}$ is a diagonal matrix independent of the observations, and $L_{n}\in\mathbb{R}^{d\times d}$ is the lower triangular matrix satisfying that $L_{n}L_{n}^{\top}=\sum_{i=1}^{n}X_{i}X_{i}^{\top}$ . Moreover, the risk of the above estimator is invariant with the unknown $\Sigma$ , so in the sequel we assume that $\Sigma=I_{d}$ . Note that when $D_{n}=I_{d}/n$ , the estimator $\widehat{\Sigma}_{n}$ is the sample covariance; however, other choices of the diagonal matrix $D_{n}$ could give a uniform improvement over the sample covariance, see [JS61].

The proof idea is to show that with $n+m$ samples, there exists an estimator $\widehat{\Sigma}_{n+m}$ with a specific choice of $D_{n+m}$ in (12) such that (cf. Lemma A.21)

\displaystyle|\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n+m})]-g(n+m+1-d,d)|\leq\frac{5d}{n+m},\qquad\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n+m}))}\leq\frac{4d}{n+m}.

Consequently, by Chebyshev’s inequality, for $C>0$ large enough $r({\mathcal{P}},n+m,L)\leq 0.1$ .

To lower bound the minimax risk $r({\mathcal{P}},n,L)$ with $n$ samples, we need to enumerate over all possible estimators taking the form of (12). It turns out that for any choice of $D_{n}$ , the first two moments of $\ell(\Sigma,\widehat{\Sigma}_{n})$ admit explicit expressions, and it always holds that (cf. Lemma A.21)

\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]\geq g(n+1-d,d)+C_{1}\left(\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}-\frac{4d}{n}\right)-\frac{6d}{n}

for any given constant $C_{1}>0$ , provided that $n,d\geq C_{2}$ with $C_{2}$ depending only on $C_{1}$ . Choosing $C_{1}>0$ large enough, Chebyshev’s inequality leads to $\ell(\Sigma,\widehat{\Sigma}_{n})\geq g(n+1-d,d)-5C_{1}d/n$ with probability at least $0.9$ for every $\widehat{\Sigma}_{n}$ taking the form of (12). By the last statement of Lemma A.21, for $m=C_{3}n/d$ with a large enough $C_{3}>0$ , the above event implies that $r({\mathcal{P}},n,L)\geq 0.9$ .

Combining the above scenarios and applying the pigeonhole principle, an application of Lemma 6.1 claims that $m=O(n\varepsilon/d)$ is necessary for an $(n,n+m,\varepsilon)$ sample amplification.

The following lemma summarizes the necessary technical results for Example A.20.

Lemma A.21.

Let $n\geq d$ . For $D_{n}=\text{\rm diag}(\lambda_{1},\cdots,\lambda_{d})$ with $\lambda_{j}=1/(n+d+1-2j)$ for all $j\in[d]$ , the corresponding estimator $\widehat{\Sigma}_{n}$ in (12) satisfies

\displaystyle|\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]-g(n+1-d,d)|\leq\frac{5d}{n},\qquad\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}\leq\frac{4d}{n},

where

\displaystyle g(u,v)

\displaystyle\triangleq\frac{(u+2v)\log(u+2v)+u\log u}{2}-(u+v)\log(u+v).

(13)

Meanwhile, for any choice of $D_{n}$ and any absolute constant $C_{1}>0$ , the estimator $\widehat{\Sigma}_{n}$ in (12) satisfies

\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]\geq g(n+1-d,d)+C_{1}\left(\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}-\frac{4d}{n}\right)-\frac{6d}{n},

as long as $n,d\geq C_{2}$ for some large enough constant $C_{2}$ depending only on $C_{1}$ .

Finally, the function $g$ defined in (13) satisfies the following inequality: for $n\geq 2d$ and $0\leq m\leq n$ , it holds that

\displaystyle g(n+1-d,d)-g(n+m+1-d,d)\geq\frac{md^{2}}{13n^{2}}.

Appendix B Proof of main theorems

B.1 Proof of Theorem 4.5

We recall the following result in [BC16, Theorem 2.6]: given Assumptions 1 and 2 with $k=3$ , there exists a constant $C>0$ depending only on $d$ and the moment upper bound such that

\displaystyle\sup_{\theta\in\Theta}\|{\mathcal{L}}(\sqrt{n}[\nabla^{2}A(\theta)]^{-1/2}(T_{n}-\nabla A(\theta)))-{\mathcal{N}}(0,I_{d})\|_{\text{TV}}\leq\frac{C}{\sqrt{n}}.

By an affine transformation, it is then clear that

\displaystyle\sup_{\theta\in\Theta}\|{\mathcal{L}}(T_{n})-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)\|_{\text{TV}}\leq\frac{C}{\sqrt{n}}.

Moreover, the computation in Example 4.1 shows that

\displaystyle\sup_{\theta\in\Theta}\|{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/(n+m))\|_{\text{TV}}\leq\frac{m\sqrt{d}}{n}.

Now the desired result follows from the above inequalities and a triangle inequality.

B.2 Proof of Theorem 4.6

As the density in the Edgeworth expansion (8) may be negative at some point, throughout the proof we extend the formal definition of the TV distance to any signed measures $P$ and $Q$ , just as half of the $L_{1}$ distance. Under this new definition, it is clear that the triangle inequality $\|P-R\|_{\text{TV}}\leq\|P-Q\|_{\text{TV}}+\|Q-R\|_{\text{TV}}$ still holds. The following tensorization property also holds: for general signed measures $P_{1},\cdots,P_{d}$ and $Q_{1},\cdots,Q_{d}$ with $\max_{i\in[d]}\max\{|P_{i}|(\Omega),|Q_{i}|(\Omega)\}\leq r$ , we have

		$\displaystyle\\|P_{1}\times\cdots\times P_{d}-Q_{1}\times\cdots\times Q_{d}\\|_{\text{TV}}$
		$\displaystyle\leq\sum_{i=1}^{d}\\|P_{1}\times\cdots\times P_{i-1}\times Q_{i}\times\cdots\times Q_{d}-P_{1}\times\cdots\times P_{i}\times Q_{i+1}\times\cdots\times Q_{d}\\|_{\text{TV}}$
		$\displaystyle\leq\sum_{i=1}^{d}\\|P_{i}-Q_{i}\\|_{\text{TV}}\cdot\prod_{j<i}\|P_{j}\|(\Omega)\cdot\prod_{k>i}\|Q_{k}\|(\Omega)$
		$\displaystyle\leq r^{d}\sum_{i=1}^{d}\\|P_{i}-Q_{i}\\|_{\text{TV}}.$		(14)

Fix any $\theta\in\Theta$ , let $P_{n,i}$ (resp. $P_{n+m,i}$ ) be the probability distribution of the $i$ -th coordinate of $T_{n}$ (resp. $T_{n+m}$ ), and $Q_{n,i}$ (resp. $Q_{n+m,i}$ ) be the signed measure of the corresponding Edgeworth expansion taking the form (8) with $k=9$ . We note that the polynomials ${\mathcal{K}}_{\ell}(x)$ in (8) are the same for $Q_{n,i}$ and $Q_{n+m,i}$ , and their coefficients are uniformly bounded over $\theta\in\Theta$ thanks to Assumption 2. Then based on Assumptions 1 and 2 with $k=10$ , the result of [BC16, Theorem 2.7] claims that

\displaystyle\|P_{n,i}-Q_{n,i}\|_{\text{TV}}\leq\frac{C}{n^{2}},

with $C>0$ independent of $(n,d,\theta)$ . Moreover, the signed measure of the Edgeworth expansion in (8) could be negative only if $|x|=\Omega(\sqrt{n})$ , and therefore the total variation of each $Q_{n,i}$ satisfies

\displaystyle|Q_{n,i}|(\mathbb{R})=|\Gamma_{n,9}|(\mathbb{R})\leq\Gamma_{n,9}([-c\sqrt{n},c\sqrt{n}])+|\Gamma_{n,9}|(\mathbb{R}\backslash[-c\sqrt{n},c\sqrt{n}])\leq 1+\exp(-\Omega(n)),

where the last inequality follows from integrating the Gaussian tails. Finally, let $Q_{n,i}^{+}$ be the positive part of $Q_{n,i}$ in the Jordan decomposition, the tensorization property (B.2) leads to

\displaystyle\left\|{\mathcal{L}}(T_{n})-\prod_{i=1}^{d}Q_{n,i}^{+}\right\|_{\text{TV}}\leq\left\|{\mathcal{L}}(T_{n})-\prod_{i=1}^{d}Q_{n,i}\right\|_{\text{TV}}\leq\frac{Cd}{n^{2}}\cdot\left(1+\exp(-\Omega(n))\right)^{d},

(15)

and similar result holds for ${\mathcal{L}}(T_{n+m})$ .

Next it remains to upper bound the TV distance between $\prod_{i=1}^{d}Q_{n,i}^{+}$ and $\prod_{i=1}^{d}Q_{n+m,i}^{+}$ . To this end, we also generalize the formal definition of the Hellinger distance between general measures which are not necessarily probabilities. Then the following inequality between generalized TV and Hellinger distance holds: for measures $P$ and $Q$ on $\Omega$ ,

$\displaystyle\\|P-Q\\|_{\text{TV}}$	$\displaystyle=\frac{1}{2}\int\|\mathrm{d}P-\mathrm{d}Q\|=\frac{1}{2}\int\|\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q}\|(\sqrt{\mathrm{d}P}+\sqrt{\mathrm{d}Q})$
	$\displaystyle\leq\frac{1}{2}\sqrt{\int(\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q})^{2}\cdot\int(\sqrt{\mathrm{d}P}+\sqrt{\mathrm{d}Q})^{2}}$
	$\displaystyle\leq H(P,Q)\cdot\sqrt{P(\Omega)+Q(\Omega)}.$	(16)

Also, the following tensorization property holds for the Hellinger distance between general measures: for (not necessarily probability) measures $P_{i},Q_{i}$ on $\Omega_{i}$ ,

\displaystyle H^{2}\left(\prod_{i=1}^{d}P_{i},\prod_{i=1}^{d}Q_{i}\right)=\frac{\prod_{i=1}^{d}P_{i}(\Omega_{i})+\prod_{i=1}^{d}Q_{i}(\Omega_{i})}{2}-\prod_{i=1}^{d}\left(\frac{P_{i}(\Omega_{i})+Q_{i}(\Omega_{i})}{2}-H^{2}(P_{i},Q_{i})\right).

(17)

Consequently, it suffices to prove an upper bound on the Hellinger distance $H(P_{i},Q_{i})$ , and the TV distance on the product measure is a direct consequence of (B.2) and (17).

To upper bound the individual Hellinger distance, note that after a proper affine transformation, the densities of $Q_{n,i}$ and $Q_{n+m,i}$ are as follows:

	$\displaystyle Q_{n,i}(dx)$	$\displaystyle=\gamma_{n}(x)\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}\right)dx,$		(18)
	$\displaystyle Q_{n+m,i}(dx)$	$\displaystyle=\gamma_{n+m}(x)\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)dx,$		(19)

where $\gamma_{n}$ is the density of ${\mathcal{N}}(0,1/n)$ , and ${\mathcal{K}}_{\ell,i}$ is a polynomial of degree $3\ell$ with uniformly bounded coefficients. By (18) and (19), we observe that there exists an absolute constant $c>0$ independent of $(n,m)$ such that $Q_{n,i}(x)/\gamma_{n}(x),Q_{n+m,i}(x)/\gamma_{n+m}(x)\in[1/2,3/2]$ whenever $|x|\leq c$ . Consequently, the squared Hellinger distance could then be expressed as

	$\displaystyle H^{2}(Q_{n,i}^{+},Q_{n+m,i}^{+})$	$\displaystyle=\frac{1}{2}\int_{\|x\|\leq c}\left(\sqrt{Q_{n,i}(x)}-\sqrt{Q_{n+m,i}(x)}\right)^{2}\mathrm{d}x+\frac{1}{2}\int_{\|x\|>c}\left(\sqrt{Q_{n,i}^{+}(x)}-\sqrt{Q_{n+m,i}^{+}(x)}\right)^{2}\mathrm{d}x$
		$\displaystyle\leq\int_{\|x\|\leq c}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)\mathrm{d}x$
		$\displaystyle\qquad+\int_{\|x\|\leq c}\gamma_{n}(x)\left(\sqrt{1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}}-\sqrt{1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}}\right)^{2}\mathrm{d}x$
		$\displaystyle\qquad+\int_{\|x\|>c}\left(Q_{n,i}^{+}(x)+Q_{n+m,i}^{+}(x)\right)\mathrm{d}x$
		$\displaystyle\equiv A_{1}+A_{2}+A_{3}.$

Next we upper bound the terms $A_{1},A_{2}$ , and $A_{3}$ separately.

Upper bounding $A_{1}$ : note that by the definition of $c$ , the multiplication factor in $A_{1}$ is at most $3/2$ . Therefore,

$\displaystyle A_{1}$	$\displaystyle\leq\frac{3}{2}\int_{\|x\|\leq c}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\mathrm{d}x$
	$\displaystyle\leq\frac{3}{2}\int_{\mathbb{R}}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\mathrm{d}x$
	$\displaystyle\overset{\rm(a)}{=}3\left(1-\left(\frac{n(n+m)}{(n+m/2)^{2}}\right)^{1/4}\right)$
	$\displaystyle\overset{\rm(b)}{\leq}\frac{3m^{2}}{4n^{2}},$	(20)

where (a) is due to the direct computation of the squared Hellinger distance between ${\mathcal{N}}(0,1/n)$ and ${\mathcal{N}}(0,1/(n+m))$ , and (b) makes use of the inequality $1-x^{1/4}\leq 1-x$ for $x\in[0,1]$ .

Upper bounding $A_{2}$ : using $|\sqrt{a}-\sqrt{b}|\leq|a-b|$ for $a,b\geq 1/2$ , we have

	$\displaystyle A_{2}$	$\displaystyle\leq\int_{\|x\|\leq c}\gamma_{n}(x)\left(\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}-\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)^{2}\mathrm{d}x$
		$\displaystyle\leq 3\sum_{\ell=1}^{3}\int_{\mathbb{R}}\gamma_{n}(x)\left(\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}-\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)^{2}\mathrm{d}x$
		$\displaystyle=3\sum_{\ell=1}^{3}\int_{\mathbb{R}}\gamma(x)\left(\frac{{\mathcal{K}}_{\ell,i}(x)}{n^{\ell/2}}-\frac{{\mathcal{K}}_{\ell,i}(\sqrt{1+m/n}\cdot x)}{(n+m)^{\ell/2}}\right)^{2}\mathrm{d}x,$

where the last step is a change of measure, and $\gamma$ is the density of ${\mathcal{N}}(0,1)$ . Writing ${\mathcal{K}}_{\ell,i}(x)=\sum_{j=0}^{3\ell}a_{i,j}x^{j}$ , then

\displaystyle\left|\frac{{\mathcal{K}}_{\ell,i}(x)}{n^{\ell/2}}-\frac{{\mathcal{K}}_{\ell,i}(\sqrt{1+m/n}\cdot x)}{(n+m)^{\ell/2}}\right|=\left|\sum_{j=0}^{3\ell}\frac{a_{i,j}x^{j}}{n^{\ell/2}}\left(1-\left(1+\frac{m}{n}\right)^{\frac{j-\ell}{2}}\right)\right|\lesssim\frac{m(1+x^{3\ell})}{n^{\ell/2+1}}

whenever $m=O(n)$ . Combining the above two inequalities yields

\displaystyle A_{2}\lesssim\frac{m^{2}}{n^{3}}.

(21)

3.

Upper bounding $A_{3}$ : note that the tail of $\gamma_{n}(x)$ is at least $\exp(-\Omega(n))$ when $x\geq c$ , integrating the tail leads to

$\displaystyle A_{3}=\exp(-\Omega(n)).$ (22)

In summary, a combination of (1), (21), and (22) leads to

\displaystyle H^{2}(Q_{n,i}^{+},Q_{n+m,i}^{+})=O\left(\frac{m^{2}}{n^{2}}+\exp(-\Omega(n))\right).

Now using (B.2), (17), and $|Q_{n,i}|(\mathbb{R})\leq 1+\exp(-\Omega(n))$ , we conclude that

\displaystyle\left\|\prod_{i=1}^{d}Q_{n,i}^{+}-\prod_{i=1}^{d}Q_{n+m,i}^{+}\right\|_{\text{TV}}=O\left(\frac{m\sqrt{d}}{n}+(1+\exp(-\Omega(n)))^{d}\right).

(23)

Hence, when $n=\Omega(\sqrt{d})$ and $m=O(n)$ , the desired result follows from (15) and (23). For the other scenarios, we simply use that the TV distance is upper bounded by one, and the result still holds.

B.3 Proof of Theorem 5.2

Let $\widehat{P}_{n}$ be the distribution learned from the first $n/2$ samples which achieves the $\chi^{2}$ -learning error $r_{\chi^{2}}({\mathcal{P}},n/2)$ , and $P_{\text{mix}}$ be the distribution of the shuffled samples $(Z_{1},\cdots,Z_{n/2+m})$ in Algorithm 2. Note that both distributions depend on the first $n/2$ samples and are therefore random. Then the final TV distance of sample amplification is

\displaystyle\|P_{X^{n/2}}\times P_{\text{mix}}(X^{n/2})-P^{\otimes(n+m)}\|_{\text{TV}}=\mathbb{E}_{X^{n/2}}\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}},

which is the expected TV distance between the mixture distribution and the product distribution.

By Lemma 5.8, for any realization of $X^{n/2}$ it holds that

\displaystyle\chi^{2}\left(P_{\text{mix}},P^{\otimes(n/2+m)}\right)\leq\left(1+\frac{m}{n/2+m}\chi^{2}(\widehat{P}_{n},P)\right)^{m}-1.

Since (3) shows that $D_{\text{KL}}(P\|Q)\leq\log(1+\chi^{2}(P,Q))$ , we have

\displaystyle D_{\text{KL}}(P_{\text{mix}}\|P^{\otimes(n/2+m)})\leq m\log\left(1+\frac{m}{n/2+m}\chi^{2}(\widehat{P}_{n},P)\right)\leq\frac{m^{2}}{n/2+m}\chi^{2}(\widehat{P}_{n},P).

Consequently, by (3) again and the concavity of $x\mapsto\sqrt{x}$ , we have

	$\displaystyle\mathbb{E}_{X^{n/2}}\\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\\|_{\text{TV}}$	$\displaystyle\leq\mathbb{E}_{X^{n/2}}\sqrt{\frac{1}{2}D_{\text{KL}}(P_{\text{mix}}(X^{n/2})\\|P^{\otimes(n/2+m)})}$
		$\displaystyle\leq\mathbb{E}_{X^{n/2}}\sqrt{\frac{m^{2}}{n+2m}\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)}$
		$\displaystyle\leq\sqrt{\frac{m^{2}}{n+2m}\mathbb{E}_{X^{n/2}}[\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)]}$
		$\displaystyle\leq\sqrt{\frac{m^{2}}{n+2m}\cdot r_{\chi^{2}}({\mathcal{P}},n/2)}.$

B.4 Proof of Theorem 5.5

The main arguments are essentially the same as Theorem 5.2. Note that by the same argument, for each $j\in[d]$ we have

\displaystyle D_{\text{KL}}(P_{\text{mix},j}\|P_{j}^{\otimes(n/2+m)})\leq\frac{m^{2}}{n/2+m}\chi^{2}(\widehat{P}_{n,j},P_{j}).

Note that both $P_{\text{mix}}$ and $P$ have product structures, the chain rule of KL divergence implies that

\displaystyle D_{\text{KL}}(P_{\text{mix}}\|P^{\otimes(n/2+m)})\leq\frac{m^{2}}{n/2+m}\sum_{j=1}^{d}\chi^{2}(\widehat{P}_{n,j},P_{j}).

Now the rest of the proof follows from the same last few lines of that of Theorem 5.2.

B.5 Proof of Theorem 6.2

The proof relies on Lemma 6.1 and a classical statistical result known as Anderson’s lemma. Without loss of generality we assume $\Sigma=I_{d}$ . Choose $L(\theta,\widehat{\theta})=\ell(\theta-\widehat{\theta})$ , with a bowl-shaped (i.e. symmetric and quasi-convex) loss function $\ell(\cdot)\in[0,1]$ . Then Anderson’s lemma (see, e.g. [VdV00, Lemma 8.5]) implies that the minimax estimator under $L$ and $n$ samples is $\widehat{\theta}_{n}=n^{-1}\sum_{i=1}^{n}X_{i}$ , thus

\displaystyle r({\mathcal{P}},n,L)=\mathbb{E}\left[\ell\left(\frac{Z}{\sqrt{n}}\right)\right],

with $Z\sim{\mathcal{N}}(0,I_{d})$ . Choosing $\ell(x)=\mathbbm{1}(\|x\|_{2}\geq r)$ with the parameter $r>0$ determined by

\displaystyle\mathbb{P}\left(\left|\frac{Z}{\sqrt{n}}\right|\geq r\right)-\mathbb{P}\left(\left|\frac{Z}{\sqrt{n+m}}\right|\geq r\right)=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{\rm TV}},

then clearly $\ell(\cdot)\in\{0,1\}$ is bowl-shaped. Hence, for this choice of $L$ , Lemma 6.1 gives that

\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq r({\mathcal{P}},n,L)-r({\mathcal{P}},n+m,L)=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{\rm TV}},

and a matching upper bound is presented in Example 4.1.

B.6 Proof of Theorem 6.3

Based on the discussions above Theorem 6.3, we choose an arbitrary open ball $\Theta_{0}\subseteq\Theta$ , and pick any $d$ -dimensional ball $B_{d}(\mu_{0};r)$ contained in $\nabla A(\Theta_{0})$ . Let $\theta_{0}$ be the center of $\Theta_{0}$ , and after a proper affine transformation we assume that $\nabla A^{2}(\theta_{0})=I_{d}$ . Consider a truncated Gaussian prior $\nu$ with

\displaystyle\mu\sim{\mathcal{N}}(\mu_{0},c_{n}r_{n}^{2})\mid\mu\in B_{d}(\mu_{0};r_{n}),

where $c_{n}>0$ and $r_{n}\in(0,r)$ are parameters to be determined later. Using the diffeomorphism $\nabla A$ , there is a prior $\nu_{0}$ on $\Theta_{0}$ which induces the above prior on $\nabla A(\Theta_{0})$ . Moreover, as $\overline{\Theta}_{0}$ , the closure of $\Theta_{0}$ , is a compact set, a weaker Assumption 2 with the supremum restricted to $\Theta_{0}$ holds for $k=3$ .

Now we analyze the Bayes risks under the above prior $\nu_{0}$ and the loss

\displaystyle L(\theta,\widehat{\mu})=\ell(\nabla A(\theta)-\widehat{\mu})\in[0,1],

where $\ell$ is a bowl-shaped function. By sufficiency, it remains to consider the class of estimators depending only on the sufficient statistic $T_{n}(X^{n})=n^{-1}\sum_{i=1}^{n}T(X_{i})$ . Let $\widehat{\mu}(T_{n})$ be such an estimator, then under each $\theta\in\Theta_{0}$ , we have

\displaystyle\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(T_{n}))]\geq\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))]-\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(Z_{n})\|_{\text{TV}},

where $Z_{n}\mid\theta\sim{\mathcal{N}}(\nabla A(\theta),I_{d}/n)$ . To upper bound this TV distance, the proof of Theorem 4.5 together with Assumption 2 applied to $\Theta_{0}$ gives that

\displaystyle\|{\mathcal{L}}(T_{n})-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)\|_{\text{TV}}\leq\frac{C_{1}}{\sqrt{n}},

where $C_{1}>0$ only depends on the exponential family. Moreover,

	$\displaystyle\\|{\mathcal{N}}(\nabla A(\theta),I_{d}/n)-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)\\|_{\text{TV}}$
	$\displaystyle=\\|{\mathcal{N}}(0,I_{d})-{\mathcal{N}}(0,\nabla^{2}A(\theta))\\|_{\text{TV}}$
	$\displaystyle\overset{\rm(a)}{\leq}\frac{3}{2}\\|\nabla^{2}A(\theta)-I_{d}\\|_{\text{F}}\overset{\rm(b)}{\leq}C_{2}r_{n},$

where (a) follows from [DMR18, Theorem 1.1], and (b) makes use of the analytical property of $A(\theta)$ (see, e.g. [ML08, Theorem 1.17]), the assumption that $\|\nabla A(\theta)-\nabla A(\theta_{0})\|_{2}\leq r_{n}$ , and $\nabla^{2}A(\theta_{0})=I_{d}$ . Again, here the constant $C_{2}>0$ is independent of $n$ . Combining the above inequalities yields that

\displaystyle\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(T_{n}))]\geq\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))]-\frac{C_{1}}{\sqrt{n}}-C_{2}r_{n}.

(24)

Next we lower bound the Bayes risk $\mathbb{E}_{\nu_{0}}\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))]$ when $T_{n}$ has been replaced by $Z_{n}$ . Let $\nu^{\prime}$ be the non-truncated Gaussian distribution ${\mathcal{N}}(\mu_{0},c_{n}r_{n}^{2})$ , and $G_{n}\sim{\mathcal{N}}(0,I_{d}/n)$ , then

$\displaystyle\mathbb{E}_{\nu_{0}}\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))]$	$\displaystyle=\mathbb{E}_{\mu\sim\nu}[\ell(\mu-\widehat{\mu}(G_{n}+\mu))]$
	$\displaystyle\geq\mathbb{E}_{\mu\sim\nu^{\prime}}[\ell(\mu-\widehat{\mu}(G_{n}+\mu))]-\nu^{\prime}(\left\{\mu\notin B_{d}(\mu_{0};r_{n})\right\})$
	$\displaystyle\overset{\rm(c)}{\geq}\mathbb{E}_{\mu\sim\nu^{\prime}}[\ell(\mu-\widehat{\mu}(G_{n}+\mu))]-e^{-C_{3}/c_{n}}$
	$\displaystyle\overset{\rm(d)}{\geq}\mathbb{E}\left[\ell\left(\sqrt{\frac{nc_{n}r_{n}^{2}}{1+nc_{n}r_{n}^{2}}}\cdot G_{n}\right)\right]-e^{-C_{3}/c_{n}}$
	$\displaystyle\overset{\rm(e)}{\geq}\mathbb{E}[\ell(G_{n})]-e^{-C_{3}/c_{n}}-\frac{C_{4}}{1+nc_{n}r_{n}^{2}}.$	(25)

Here (c) follows from the Gaussian tail probability, (d) makes use of Anderson’s lemma and the fact that under $\nu^{\prime}$ , the posterior distribution of $\mu$ given $Z_{n}$ is Gaussian with covariance $nc_{n}r_{n}^{2}I_{d}/(1+nc_{n}r_{n}^{2})$ . The final inequality (e) is due to the following upper bound on the TV distance:

\displaystyle\|{\mathcal{N}}(0,\Sigma)-{\mathcal{N}}(0,c\Sigma)\|_{\text{TV}}\leq\frac{3}{2}\|(c-1)I_{d}\|_{\text{F}}=\frac{3\sqrt{d}|c-1|}{2}.

Now combining (24) and (B.6), we obtain a lower bound on the Bayes risk:

\displaystyle r_{\text{B}}({\mathcal{P}},n,\nu_{0},L)\geq\mathbb{E}[\ell(G_{n})]-\frac{C_{1}}{\sqrt{n}}-C_{2}r_{n}-e^{-C_{3}/c_{n}}-\frac{C_{4}}{1+nc_{n}r_{n}^{2}}.

Similarly, by reversing all the above inequalities, an upper bound of the Bayes risk with $n+m$ samples is also available:

\displaystyle r_{\text{B}}({\mathcal{P}},n+m,\nu_{0},L)\leq\mathbb{E}[\ell(G_{n+m})]+\frac{C_{1}}{\sqrt{n}}+C_{2}r_{n}+e^{-C_{3}/c_{n}}+\frac{C_{4}}{1+nc_{n}r_{n}^{2}}.

By the proof of Theorem 6.2, a proper choice of $\ell$ satisfies that

\displaystyle\mathbb{E}[\ell(G_{n})]-\mathbb{E}[\ell(G_{n+m})]=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{TV}}=\Omega\left(\frac{m\sqrt{d}}{n}\wedge 1\right),

where the last step is again due to [DMR18, Theorem 1.1]. Consequently, by choosing $c_{n}=\Theta(1/\log n)$ and $r_{n}=\Theta((\log n/n)^{1/3})$ , the desired result follows from Lemma 6.1.

B.7 Proof of Theorem 6.4

We will apply Lemma 6.1 to the uniform prior $\mu$ over $2^{d}$ points $\prod_{j=1}^{d}\{\theta_{j,+},\theta_{j,-}\}$ , and the loss function $L:\Theta\times\Theta\to[0,1]$ with

\displaystyle L((\theta_{1},\cdots,\theta_{d}),(\widehat{\theta}_{1},\cdots,\widehat{\theta}_{d}))=\mathbbm{1}\left(\sum_{j=1}^{d}\mathbbm{1}(\theta_{j}=\widehat{\theta}_{j})\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right).

We compute the Bayes risks $r_{\text{B}}({\mathcal{P}},n,\mu,L)$ and $r_{\text{B}}({\mathcal{P}},n+m,\mu,L)$ in this scenario.

Under the uniform prior $\mu$ , it is straightforward to see that the posterior distribution of $\theta$ given $X^{n}$ is a product distribution $\prod_{j=1}^{d}p_{\theta_{j}\mid X_{j}^{n}}$ , with $p_{\theta_{j}\mid X_{j}^{n}}$ supported on two elements $\{\theta_{j,+},\theta_{j,-}\}$ , and

	$\displaystyle p_{\theta_{j}\mid X_{j}^{n}}(\theta_{j,+})$	$\displaystyle=\frac{\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})}{\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})+\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})},$
	$\displaystyle p_{\theta_{j}\mid X_{j}^{n}}(\theta_{j,-})$	$\displaystyle=\frac{\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})}{\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})+\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})}.$

Then given $X^{n}$ , the Bayes estimator $\widehat{\theta}(X^{n})\in\Theta$ which minimizes the expected loss $L$ is a minimizer $\xi\in\Theta$ of the above expression

\displaystyle\mathbb{P}_{\theta\mid X^{n}}\left(\sum_{j=1}^{d}\mathbbm{1}(\theta_{j}=\xi_{j})\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right),

which is easily seen to be

\displaystyle\widehat{\theta}_{j}(X^{n})=\widehat{\theta}_{j}(X_{j}^{n})=\theta_{j,-}+(\theta_{j,+}-\theta_{j,-})\cdot\mathbbm{1}\left(\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})\geq\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})\right).

For the above Bayes estimator, the random variables $\mathbbm{1}(\theta_{j}=\widehat{\theta}_{j}(X^{n}))$ are mutually independent, with the mean value

	$\displaystyle p_{n,j}$	$\displaystyle=\mathbb{P}(\theta_{j}=\widehat{\theta}_{j}(X^{n}))$
		$\displaystyle=\frac{1}{2}p_{\theta_{j,+}}^{\otimes n}\left(\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})<\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})\right)+\frac{1}{2}p_{\theta_{j,-}}^{\otimes n}\left(\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})\geq\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})\right)$
		$\displaystyle=\frac{1}{2}\left(1+\\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\\|_{\text{TV}}\right).$

Consequently, we have $p_{n,j}\leq(1+\alpha_{j})/2-\varepsilon/(2\sqrt{d})$ for each $j\in[d]$ by (10), and thus

	$\displaystyle r_{\text{B}}({\mathcal{P}},n,\mu,L)$	$\displaystyle=\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}(p_{n,j})\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right)$
		$\displaystyle\geq\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right).$		(26)

An entirely symmetric argument leads to

\displaystyle r_{\text{B}}({\mathcal{P}},n+m,\mu,L)\leq\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right).

(27)

To further lower bound (B.7) and upper bound (27), the following lemma shows that we may assume that the above Bernoulli distributions have the same parameter.

Lemma B.1 (Theorems 4 and 5 of [Hoe56]).

Let $X_{i}\sim\mathsf{Bern}(p_{i})$ be independent Bernoulli random variables for $i=1,\cdots,n$ , with $\bar{p}=n^{-1}\sum_{i=1}^{n}p_{i}$ . Then for $0\leq k\leq n\bar{p}-1$ , we have

\displaystyle\mathbb{P}\left(\sum_{i=1}^{n}X_{i}\leq k\right)\leq\mathbb{P}\left(\mathsf{B}(n,\bar{p})\leq k\right),

and for $k\geq n\bar{p}$ , we have

\displaystyle\mathbb{P}\left(\sum_{i=1}^{n}X_{i}\leq k\right)\geq\mathbb{P}\left(\mathsf{B}(n,\bar{p})\leq k\right).

In particular, for integers $(b,c)$ with $b\leq n\bar{p}\leq c$ , we have

\displaystyle\mathbb{P}\left(b\leq\sum_{i=1}^{n}X_{i}\leq c\right)\geq\mathbb{P}\left(b\leq\mathsf{B}(n,\bar{p})\leq c\right).

For $d\geq 4/\varepsilon^{2}$ , based on the second statement of Lemma B.1, the quantity in (B.7) satisfies

\displaystyle r_{\text{B}}({\mathcal{P}},n,\mu,L)\geq\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right),

(28)

with $\alpha\triangleq d^{-1}\sum_{j=1}^{d}\alpha_{j}$ . Similarly, based on the first statement of Lemma B.1, for (27) we have

\displaystyle r_{\text{B}}({\mathcal{P}},n+m,\mu,L)\leq\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right).

(29)

Since

\displaystyle\frac{\mathrm{d}\mathbb{P}(\mathsf{B}(n,x)=k)}{\mathrm{d}x}=n\left(\mathbb{P}(\mathsf{B}(n-1,x)=k-1)-\mathbb{P}(\mathsf{B}(n-1,x)=k)\right),

we invoke Lemma 6.1 and lower bound the Bayes risk difference as

	$\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)$	$\displaystyle\geq r_{\text{B}}({\mathcal{P}},n,\mu,L)-r_{\text{B}}({\mathcal{P}},n+m,\mu,L)$
		$\displaystyle\geq\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right)-\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right)$
		$\displaystyle=-\sum_{0\leq k\leq(1+\alpha)d/2}\int_{\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}}\frac{\mathrm{d}\mathbb{P}(\mathsf{B}(d,x)=k)}{\mathrm{d}x}\mathrm{d}x$
		$\displaystyle=d\int_{\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}}\mathbb{P}\left(\mathsf{B}(d-1,x)=\left\lfloor\frac{1+\alpha}{2}\cdot d\right\rfloor\right)\mathrm{d}x$
		$\displaystyle\overset{\rm(a)}{\geq}d\int_{\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}}\frac{c(\underline{\alpha},\overline{\alpha})}{\sqrt{d}}\mathrm{d}x=c(\underline{\alpha},\overline{\alpha})\varepsilon,$

where (a) is due to

\displaystyle\min_{p_{0}\leq p\leq p_{1},|k-np|\leq C\sqrt{n}}\mathbb{P}(\mathsf{B}(n,p)=k)=\Omega_{p_{0},p_{1},C}\left(\frac{1}{\sqrt{n}}\right)

(30)

for any $p_{0},p_{1}\in(0,1)$ and $C>0$ , by Stirling’s approximation.

For $d<4/\varepsilon^{2}$ , we first note the following identity:

	$\displaystyle\frac{\mathrm{d}}{\mathrm{d}x}\bigg{\|}_{x=0}\mathbb{P}\left(\sum_{i=1}^{n}\mathsf{Bern}(p_{i}+x)=k\right)$
	$\displaystyle=\frac{\mathrm{d}}{\mathrm{d}x}\bigg{\|}_{x=0}\sum_{w\in\{0,1\}^{n}:\sum_{i=1}^{n}w_{i}=k}\prod_{i=1}^{n}(p_{i}+x)^{w_{i}}(1-p_{i}-x)^{1-w_{i}}$
	$\displaystyle=\sum_{i=1}^{n}\left(\sum_{w_{\backslash i}\in\{0,1\}^{n-1}:\sum_{j\neq i}w_{j}=k-1}\prod_{j\neq i}p_{j}^{w_{j}}(1-p_{j})^{1-w_{j}}-\sum_{w_{\backslash i}\in\{0,1\}^{n-1}:\sum_{j\neq i}w_{j}=k}\prod_{j\neq i}p_{j}^{w_{j}}(1-p_{j})^{1-w_{j}}\right)$
	$\displaystyle=\sum_{i=1}^{n}\left[\mathbb{P}\left(\sum_{j\neq i}\mathsf{Bern}(p_{j})=k-1\right)-\mathbb{P}\left(\sum_{j\neq i}\mathsf{Bern}(p_{j})=k\right)\right].$

Based on (B.7) and (27), we then have

	$\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq r_{\text{B}}({\mathcal{P}},n,\mu,L)-r_{\text{B}}({\mathcal{P}},n+m,\mu,L)$
	$\displaystyle\geq\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{(1+\alpha)d}{2}\right)-\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{(1+\alpha)d}{2}\right)$
	$\displaystyle=-\sum_{0\leq k\leq(1+\alpha)d/2}\int_{-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{\varepsilon}{2\sqrt{d}}}\frac{\mathrm{d}}{\mathrm{d}x}\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+x\right)=k\right)\mathrm{d}x$
	$\displaystyle=\sum_{i=1}^{d}\int_{-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{\varepsilon}{2\sqrt{d}}}\mathbb{P}\left(\sum_{j\neq i}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+x\right)=\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor\right)\mathrm{d}x.$

We will show that the integrand is uniformly of the order $\Omega(1/\sqrt{d})$ . To this end, note that for $|x|\leq\varepsilon/(2\sqrt{d})<1/d$ as $d<4/\varepsilon^{2}$ , it holds that

	$\displaystyle\sum_{j\neq i}\left(\frac{1+\alpha_{j}}{2}+x\right)$	$\displaystyle<\sum_{j=1}^{d}\frac{1+\alpha_{j}}{2}+1\leq\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor+2,$
	$\displaystyle\sum_{j\neq i}\left(\frac{1+\alpha_{j}}{2}+x\right)$	$\displaystyle>\sum_{j=1}^{d}\frac{1+\alpha_{j}}{2}-1-1\geq\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor-2.$

Consequently, by the last statement of Lemma B.1, one has

	$\displaystyle\mathbb{P}\left(\left\|\sum_{j\neq i}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+x\right)-\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor\right\|\leq 2\right)$
	$\displaystyle\geq\mathbb{P}\left(\left\|\mathsf{B}\left(d-1,\frac{1}{d-1}\sum_{j\neq i}\left(\frac{1+\alpha_{j}}{2}+x\right)\right)-\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor\right\|\leq 2\right)=\Omega_{\underline{\alpha},\overline{\alpha}}\left(\frac{1}{\sqrt{d}}\right),$

where the last step is again due to (30). The above display is a lower bound for the probability of a size- $5$ set, and in view of the following lemma, the same $\Omega_{\underline{\alpha},\overline{\alpha}}(1/\sqrt{d})$ lower bound also holds for the probability of any singleton. Plugging this lower bound back into the integral then yields to $\varepsilon^{\star}({\mathcal{P}},n,m)=\Omega_{\underline{\alpha},\overline{\alpha}}(\varepsilon)$ , as desired.

Lemma B.2.

Let $p_{1},\cdots,p_{n}\in[a,b]$ with $0<a\leq b<1$ , and $cn\leq k<k+1\leq(1-c)n$ for some $c>0$ . Define

\displaystyle f(k)=\mathbb{P}\left(\sum_{i=1}^{n}\mathsf{Bern}(p_{i})=k\right).

Then there exists an absolute constant $C=C(a,b,c)<\infty$ such that

\displaystyle C^{-1}\leq\frac{f(k+1)}{f(k)}\leq C.

Proof.

Let $W_{k}=\{w\in\{0,1\}^{n}:\sum_{i=1}^{n}w_{i}=k\}$ , and $g(w)=\prod_{i=1}^{n}p_{i}^{w_{i}}(1-p_{i})^{1-w_{i}}$ . Then it is clear that

\displaystyle f(k)=\sum_{w\in W_{k}}g(w).

Call two binary vectors $w$ and $w^{\prime}$ as neighbors (denoted by $w\sim w^{\prime}$ ) if $w$ and $w^{\prime}$ only differ in one coordinate. It is clear that every $w\in W_{k}$ has $n-k$ neighbors in $W_{k+1}$ , and every $w\in W_{k+1}$ has $k+1$ neighbors in $W_{k}$ . Moreover, for $w\sim w^{\prime}$ ,

\displaystyle\frac{g(w)}{g(w^{\prime})}\leq\max\left\{\frac{b}{a},\frac{1-a}{1-b}\right\}=:\rho.

Consequently, by double counting,

	$\displaystyle f(k+1)$	$\displaystyle=\sum_{w\in W_{k+1}}g(w)=\frac{1}{k+1}\sum_{w^{\prime}\in W_{k}}\sum_{w\in W_{k+1}:w\sim w^{\prime}}g(w)$
		$\displaystyle\leq\frac{\rho}{k+1}\sum_{w^{\prime}\in W_{k}}\sum_{w\in W_{k+1}:w\sim w^{\prime}}g(w^{\prime})=\frac{(n-k)\rho}{k+1}\sum_{w^{\prime}\in W_{k}}g(w^{\prime})$
		$\displaystyle=\frac{(n-k)\rho}{k+1}f(k)\leq\frac{(1-c)\rho}{c}f(k).$

The other inequality can be established analogously. ∎

B.8 Proof of Theorem 6.5

For each $j\in[d]$ , we pick two points $\theta_{j,+},\theta_{j,-}\in\Theta_{j}$ in Assumption 4. We first aim to show that

	$\displaystyle 0.09\triangleq\varepsilon_{1}^{\prime}\leq\\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\\|_{\text{TV}}$	$\displaystyle\leq\varepsilon_{1}\triangleq 0.6,$		(31)
	$\displaystyle 0.99995\triangleq\varepsilon_{2}^{\prime}\geq\\|p_{\theta_{j,+}}^{\otimes 20n}-p_{\theta_{j,-}}^{\otimes 20n}\\|_{\text{TV}}$	$\displaystyle\geq\varepsilon_{2}\triangleq 0.86.$		(32)

The proofs of (31) and (32) rely on the tensorization property of the Hellinger distance

\displaystyle H^{2}\left(\prod_{i=1}^{d}P_{i},\prod_{i=1}^{d}Q_{i}\right)=1-\prod_{i=1}^{d}\left(1-H^{2}(P_{i},Q_{i})\right),

and the relationship between TV and Hellinger distance in (2). For example, for (31), we have

	$\displaystyle\\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\\|_{\text{TV}}$	$\displaystyle\leq\sqrt{1-(1-H^{2}(p_{\theta_{j,+}}^{\otimes n},p_{\theta_{j,-}}^{\otimes n}))^{2}}$
		$\displaystyle=\sqrt{1-\left(1-H^{2}(p_{\theta_{j,+}},p_{\theta_{j,-}})\right)^{2n}}$
		$\displaystyle\leq\sqrt{1-\left(1-\frac{1}{5n}\right)^{2n}}$
		$\displaystyle\leq\sqrt{\frac{9}{25}}=0.6.$

Applying the other inequality, for (32) we have

	$\displaystyle\\|p_{\theta_{j,+}}^{\otimes 20n}-p_{\theta_{j,-}}^{\otimes 20n}\\|_{\text{TV}}$	$\displaystyle\geq H^{2}(p_{\theta_{j,+}}^{\otimes 20n},p_{\theta_{j,-}}^{\otimes 20n})$
		$\displaystyle\geq 1-\left(1-\frac{1}{10n}\right)^{20n}$
		$\displaystyle\geq 1-\exp(-2)>0.86.$

The other inequalities involving $\varepsilon_{1}^{\prime}$ and $\varepsilon_{2}^{\prime}$ could be established analogously.

Next, for the choice of $m=\lceil c\varepsilon n/\sqrt{d}\rceil$ in the statement of Theorem 6.5, we show that there exists $n_{j}\in[n,20n]$ such that

\displaystyle\|p_{\theta_{j,+}}^{\otimes(n_{j}+m)}-p_{\theta_{j,-}}^{\otimes(n_{j}+m)}\|_{\text{TV}}-\|p_{\theta_{j,+}}^{\otimes n_{j}}-p_{\theta_{j,-}}^{\otimes n_{j}}\|_{\text{TV}}\geq\frac{\varepsilon_{2}-\varepsilon_{1}}{\lceil 19\sqrt{d}/(c\varepsilon)\rceil}.

(33)

To prove (33), first note that $t\mapsto f_{j}(t)\triangleq\|p_{\theta_{j,+}}^{\otimes t}-p_{\theta_{j,-}}^{\otimes t}\|_{\text{TV}}$ is non-decreasing by the data-processing property of the TV distance. Moreover, by (31) and (32), we have $f_{j}(n)\leq\varepsilon_{1}$ and $f_{j}(20n)\geq\varepsilon_{2}$ . Consequently,

	$\displaystyle\varepsilon_{2}-\varepsilon_{1}$	$\displaystyle\leq f_{j}(20n)-f_{j}(n)$
		$\displaystyle\leq\sum_{k=1}^{\lceil 19n/m\rceil-1}[f_{j}(n+km)-f_{j}(n+(k-1)m)]+[f_{j}(20n)-f_{j}(20n-m)]$
		$\displaystyle\leq\left\lceil\frac{19n}{m}\right\rceil\cdot\max_{n\leq n_{j}\leq 20n-m}[f_{j}(n_{j}+m)-f_{j}(n_{j})],$

which gives (33). In addition, we also have $\varepsilon_{1}^{\prime}\leq f_{j}(n_{j})\leq f_{j}(n_{j}+m)\leq\varepsilon_{2}^{\prime}$ .

Next we are about to apply Theorem 6.4. By (33), there exist $\alpha_{j}\in(\varepsilon_{1}^{\prime},\varepsilon_{2}^{\prime})$ such that

	$\displaystyle\\|p_{\theta_{j,+}}^{\otimes n_{j}}-p_{\theta_{j,-}}^{\otimes n_{j}}\\|_{\text{TV}}$	$\displaystyle\leq\alpha_{j}-\Omega_{c}\left(\frac{\varepsilon}{\sqrt{d}}\right),$
	$\displaystyle\\|p_{\theta_{j,+}}^{\otimes(n_{j}+m)}-p_{\theta_{j,-}}^{\otimes(n_{j}+m)}\\|_{\text{TV}}$	$\displaystyle\geq\alpha_{j}+\Omega_{c}\left(\frac{\varepsilon}{\sqrt{d}}\right).$

Therefore, Theorem 6.4 shows that there is an absolute constant $c^{\prime}>0$ depending only on $(c,\varepsilon_{1}^{\prime},\varepsilon_{2}^{\prime})$ such that

\displaystyle\varepsilon^{\star}({\mathcal{P}},(n_{1},\cdots,n_{d}),m)\geq c^{\prime}\varepsilon,

where the above quantity denotes the minimax error in a new sample amplification problem: suppose we draw $n_{j}$ independent samples from ${\mathcal{P}}_{j}$ , also independently for each $j\in[d]$ , and we aim to amplify into $n_{j}+m$ independent samples from ${\mathcal{P}}_{j}$ . In other words, the sample sizes for different dimensions may not be equal in the new problem, but the target is still to generate $m$ more samples. We claim that $\varepsilon^{\star}({\mathcal{P}},(n_{1},\cdots,n_{d}),m)\leq\varepsilon^{\star}({\mathcal{P}},n,m)$ , and thereby complete the proof. To show the claim, note that $n_{j}\geq n$ for all $j\in[d]$ , hence in the new problem we could keep $n_{j}-n$ samples unused for each $j$ , use the remaining samples to amplify into $n+m$ vectors, and add the above unused samples back to form the final amplification.

B.9 Proof of Theorem 7.1

We first prove the upper bound. Consider the distribution estimator $\widehat{P}_{n}=(1,0,\cdots,0)$ , which has a $\chi^{2}$ -divergence

\displaystyle\chi^{2}(\widehat{P}_{n}\|P)=\frac{1}{t}-1,\qquad\forall P\in{\mathcal{P}}_{d,t}.

Consequently, the $\chi^{2}$ -estimation error $r_{\chi^{2}}({\mathcal{P}}_{d,t},n)$ is at most $1/t$ , and Theorem 5.2 states that the random shuffling approach achieves an $(n,n+1,0.1)$ sample amplification if $n=\Theta(1/t)$ .

Next we prove the lower bound. Let $n=1/(100t)$ and $d$ be a multiple of $100$ . Consider the following prior $\mu$ over ${\mathcal{P}}_{d,t}$ : $\mu$ is the uniform prior over $(p_{0},\cdots,p_{d})\in{\mathcal{P}}_{d,t}$ with $p_{0}=t$ and the remaining $1-t$ mass evenly distributed on a uniformly random subset of $[d]$ of size $d/100$ . The action space ${\mathcal{A}}$ is chosen to be ${\mathcal{X}}^{n+1}$ , and the loss function is

	$\displaystyle L(P,x^{n+1})$	$\displaystyle=1-\mathbbm{1}(x^{n+1}\text{ belongs to the support of }P,$
		$\displaystyle\qquad\text{ does not contain symbol }0\text{ or repeated symbols}).$

We first show that $r_{\text{B}}({\mathcal{P}}_{d,t},n+1,\mu)\leq 0.1$ . In fact, after observing $n+1$ samples $X^{n+1}$ , we simply use $X^{n+1}$ as the estimator under the above loss. Clearly $X^{n+1}$ belongs to the support of $P$ . For the remaining conditions,

	$\displaystyle\mathbb{P}(X^{n+1}\text{ contains symbol }0)$	$\displaystyle=1-(1-t)^{n+1}=1-(1-t)^{1/(100t)+1}\leq 0.05,$
	$\displaystyle\mathbb{P}(X^{n+1}\text{ contains repeated symbols})$	$\displaystyle\leq\binom{n+1}{2}\left(t^{2}+\sum_{j=1}^{d/100}\frac{1}{(d/100)^{2}}\right)$
		$\displaystyle\leq n^{2}\left(t^{2}+\frac{100}{d}\right)\leq\frac{1}{10000}+\frac{1}{100}<0.05.$

By the union bound, this estimator achieves a Bayes risk at most $0.1$ , and thus $r_{\text{B}}({\mathcal{P}}_{d,t},n+1,\mu)\leq 0.1$ .

Next we show that $r_{\text{B}}({\mathcal{P}}_{d,t},n,\mu)\geq 0.9$ , which combined with Lemma 6.1 gives the desired lower bound. To show this, consider the new symbol in the estimator $x^{n+1}$ not in $X^{n}$ when the learner observes $X^{n}$ . Since $x^{n+1}$ has length $n+1>n$ , there is at least one such symbol. In order to have $L(P,x^{n+1})=0$ , this new symbol cannot be $0$ or appear in $X^{n}$ . Moreover, the posterior distribution of the support of $P\sim\mu$ given $X^{n}$ is uniformly distributed over

\displaystyle\{0,X_{1},\cdots,X_{n}\}\cup\left\{S\subseteq[d]\backslash\{X_{1},\cdots,X_{n}\}:|S|=\frac{d}{100}-n\right\}.

Therefore, the posterior probability of the new symbol being outside the support of $P$ (recall that it could be neither one of $\{X_{1},\cdots,X_{n}\}$ nor $0$ ) is at least

\displaystyle 1-\frac{d/100-n}{d-n}=\frac{0.99d}{d-n}\geq 0.99>0.9,

giving the desired inequality $r_{\text{B}}({\mathcal{P}}_{d,t},n,\mu)\geq 0.9$ .

B.10 Proof of Theorem 7.2

The upper bound of sample amplification directly follows from that of learning, and it remains to show the lower bound $n\geq d$ . If $n\leq d-1$ , with probability one the observations $X^{n}$ spans an $n$ -dimensional subspace of the row space of $\Sigma$ . An $(n,n+1,0.1)$ sample amplification calls for at least one additional observation not in $X^{n}$ , which with probability $1$ should not belong to the $n$ -dimensional subspace spanned by $X^{n}$ for $n\leq d-1$ . However, since $p\geq d+1$ , under a uniformly chosen $d$ -dimensional row space of $\Sigma$ , the posterior probability of the additional observation belonging to the row space of $\Sigma$ is zero. Consequently, an $(n,n+1,0.1)$ sample amplification is impossible if $n<d$ .

B.11 Proof of Theorem 7.3

We prove the three claims separately.

The proof of $m^{\star}({\mathcal{P}}_{c},n)\gtrsim_{c}n^{5/6}$ . Classical theory of nonparametric estimation (see, e.g. [Tsy09]) tells that there exists a density estimator $\widehat{f}$ such that $\mathbb{E}_{f}\|\widehat{f}-f\|_{2}^{2}\lesssim n^{-2/3}$ . Since $f$ is lower bounded by $c$ , this implies that

\displaystyle\chi^{2}(\widehat{f},f)\leq\frac{\|\widehat{f}-f\|_{2}^{2}}{c}\lesssim_{c}n^{-2/3}.

Consequently, $r_{\chi^{2}}({\mathcal{P}}_{c},n)\lesssim n^{-2/3}$ , and Theorem 5.2 implies that $m^{\star}({\mathcal{P}}_{c},n)\gtrsim_{c}n^{5/6}$ .

The proof of $m^{\star}({\mathcal{P}}_{c},n)\lesssim_{c}n^{5/6}$ . We construct a parametric subfamily of ${\mathcal{P}}_{c}$ and invoke Theorem 6.5. Let $g$ be a $1$ -Lipschitz function supported on $[0,1]$ with $\int_{0}^{1}g(x)dx=0$ and $\|g\|_{2}>0$ ; in particular $\|g\|_{\infty}\leq 1$ . Let $h=n^{-1/3}$ and assume that $M:=h^{-1}$ is an integer. For $u=(u_{1},\cdots,u_{M})\in\{\pm 1\}^{M}$ , define

\displaystyle f_{u}(x)=1+c_{0}\sum_{i=1}^{M}u_{i}hg\left(\frac{x-(i-1)h}{h}\right),

where $c_{0}\in(0,1)$ is a small constant satisfying $c_{0}h\leq 1-c$ . Consequently, $f_{u}\in{\mathcal{P}}_{c}$ for every $u\in\{\pm 1\}^{M}$ . However, the density estimation model $X_{1},\cdots,X_{n}\sim f_{u}$ is not a product model, so Theorem 6.5 cannot be directly applied.

To overcome the above issue, we consider a Poissonized model as follows: first we draw $N\sim\text{Poi}(n)$ , and then draw $N$ i.i.d. samples $X_{1},\cdots,X_{N}\sim f_{u}$ . The Poissonized model satisfies the following two properties:

1.

For any measurable set $A\subseteq[0,1]$ , we have

$\displaystyle M(A):=|\{i\in[N]:X_{i}\in A\}|\sim\text{Poi}\left(n\int_{A}f(x)\mathrm{d}x\right).$
2.

For any collection of disjoint subsets $\{A_{i},i\geq 1\}$ , the random variables $\{M(A_{i}),i\geq 1\}$ are mutually independent.

For $i\in[M]$ , let $A_{i}=[(i-1)/M,i/M)$ , so that $\int_{A_{i}}f_{u}(x)\mathrm{d}u=1/M$ for every $u\in\{\pm 1\}^{M}$ . Clearly, there is a one-to-one correspondence between $(X_{1},\cdots,X_{N})$ and $(Y_{1},\cdots,Y_{M})$ , where $Y_{i}$ is the collection of observations in $X_{1},\cdots,X_{N}$ that falls into the set $A_{i}$ . By the above two properties, $(Y_{1},\cdots,Y_{M})$ are mutually independent, and $Y_{i}\sim f_{i,u_{i}}^{\otimes n}$ under $f_{u}$ . Here $f_{i,u_{i}}$ is the probability distribution of the following process: sample $N_{i}\sim\text{Poi}(1/M)$ , and draw $N_{i}$ i.i.d. samples from the density

\displaystyle M\left(1+c_{0}u_{i}hg\left(\frac{x-(i-1)h}{h}\right)\right)

supported on $A_{i}$ . In addition,

\displaystyle H^{2}(f_{i,+1},f_{i,-1})\asymp\mathbb{E}[N_{i}]\cdot Mc_{0}^{2}h^{3}\|g\|_{2}^{2}\asymp\frac{1}{n},

so the Poissonized model is a product model which satisfies the prerequisite of Theorem 6.5.

Next we denote the i.i.d. sampling model by ${\mathcal{P}}_{c}^{\otimes n}$ , and the Poissonized model by ${\mathcal{P}}_{c}^{\text{Poi}(n)}$ . Let $n_{1}=n+C\sqrt{n},n_{2}=n+m-C\sqrt{n+m}$ , where $C>0$ is a large universal constant to be chosen later. By Theorem 6.5 applied to $d=M\asymp n^{1/3}$ , Le Cam’s distance in Definition 3.1 between Poissonized models satisfies

\displaystyle\Delta({\mathcal{P}}_{c}^{\text{Poi}(n_{1})},{\mathcal{P}}_{c}^{\text{Poi}(n_{2})})=\Omega(1),

as long as $m=\Omega(n^{5/6})$ . In addition, the model ${\mathcal{P}}_{c}^{\text{Poi}(n_{1})}$ is more informative than ${\mathcal{P}}_{c}^{\otimes n}$ with probability at least $1-\varepsilon_{1}$ , where $\varepsilon_{1}=\mathbb{P}(\text{Poi}(n_{1})<n)$ . Similarly, ${\mathcal{P}}_{c}^{\otimes(n+m)}$ is more informative than ${\mathcal{P}}_{c}^{\text{Poi}(n_{2})}$ with probability at least $1-\varepsilon_{2}$ , where $\varepsilon_{2}=\mathbb{P}(\text{Poi}(n_{2})>n+m)$ . Therefore,

\displaystyle\Delta({\mathcal{P}}_{c}^{\otimes n},{\mathcal{P}}_{c}^{\otimes(m+n)})\geq\Delta({\mathcal{P}}_{c}^{\text{Poi}(n_{1})},{\mathcal{P}}_{c}^{\text{Poi}(n_{2})})-\varepsilon_{1}-\varepsilon_{2}=\Omega(1),

where in the last step we have $\varepsilon_{1},\varepsilon_{2}\to 0$ by choosing $C>0$ large enough.

The proof of $m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}$ . Suppose $n\geq m\geq Cn^{3/4}$ for a large constant $C>0$ . Consider the following prior $\mu$ on the density $f$ : let $M=\sqrt{n}$ , and $u=(u_{1},\cdots,u_{M})$ be uniformly distributed over all vectors in $\{0,1\}^{M}$ such that the number of $1$ ’s is $M/100$ . Given $u$ , we construct

\displaystyle f_{u}(x)=\begin{cases}u_{i}(2/M-4|x-(2i-1)/2M|)&\text{if }x\in A_{i}\triangleq[(i-1)/M,i/M),i\in[M],\\ (8-2/(25M))(x-1/2)&\text{if }1/2\leq x\leq 1,\end{cases}

and let $f=f_{u}$ . It is not hard to verify that each $f_{u}$ is a density and $8$ -Lipschitz, so $f_{u}\in{\mathcal{P}}$ .

Next under the context of Lemma 6.1, we choose the action space to be $[0,1]^{n+m}$ , as well as the following loss:

	$\displaystyle L(f,x^{n+m})$	$\displaystyle=1-\mathbbm{1}(x^{n+m}\text{ belongs to the support of }f,$
		$\displaystyle\qquad\text{ and fall into at least }[1-(1-n^{-1})^{n}]\sqrt{n}/100+C_{0}n^{1/4}\text{ sets in }\{A_{i}\}_{i=1}^{M}),$

where $C_{0}>0$ is a large enough constant. We aim to show that under the loss $L$ and prior $\mu$ , the Bayes risk satisfies $r_{\text{B}}({\mathcal{P}}^{\otimes(n+m)},\mu)\leq 0.1$ and $r_{\text{B}}({\mathcal{P}}^{\otimes n},\mu)\geq 0.9$ . Then an application of Lemma 6.1 completes the proof of $m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}$ .

We first show that $r_{\text{B}}({\mathcal{P}}^{\otimes(n+m)},\mu)\leq 0.1$ . We simply use the observed sample $X^{n+m}$ as the estimator $x^{n+m}$ under the above loss, then clearly $x^{n+m}$ belongs to the support of $f$ . For the second event in the loss function, let $U_{n+m}$ be the number of sets in $\{A_{i}\}_{i=1}^{M}$ which receive any observations in $X^{n+m}$ . By the linearity of expectation, as well as the negative dependence across different set counts, we compute for every $f_{u}$ that

	$\displaystyle\mathbb{E}[U_{n+m}]$	$\displaystyle=\sum_{i=1}^{M/100}\left[1-\left(1-\frac{1}{M^{2}}\right)^{n+m}\right]=\frac{\sqrt{n}}{100}\left[1-\left(1-\frac{1}{n}\right)^{n}\right]+\Omega\left(\frac{m}{\sqrt{n}}\right),$
	$\displaystyle\mathsf{Var}(U_{n+m})$	$\displaystyle\leq\mathbb{E}[U_{n+m}]=O(\sqrt{n}).$

By Chebyshev’s inequality, for $m\geq Cn^{3/4}$ with a large enough $C>0$ , we have $r_{\text{B}}({\mathcal{P}}^{\otimes(n+m)},\mu)\leq 0.1$ .

Next we show that $r_{\text{B}}({\mathcal{P}}^{\otimes n},\mu)\geq 0.9$ . Let $(i_{1},\cdots,i_{k})\in[M]^{k}$ be the indices such that the set $A_{i_{j}}$ is hit by the observations $X^{n}$ . Clearly the posterior distribution of the support of the vector $u$ is uniformly distributed over

\displaystyle\{i_{1},\cdots,i_{k}\}\cup\left\{S\subseteq[M]\backslash\{i_{1},\cdots,i_{k}\}:|S|=\frac{M}{100}-k\right\}.

On one hand, if the estimator $x^{m+n}$ hits a set $A_{j}$ with $j\notin\{i_{1},\cdots,i_{k}\}$ , the probability that $x^{n+m}$ does not belong to the support of $f$ is at least

\displaystyle 1-\frac{M/100-k}{M-k}=\frac{0.99M}{M-k}\geq 0.99.

On the other hand, if the estimator $x^{m+n}$ never hits a set $A_{j}$ with $j\notin\{i_{1},\cdots,i_{k}\}$ , then $x^{m+n}$ fall into at most $U_{n}$ sets in $\{A_{i}\}_{i=1}^{M}$ , where $U_{n}$ is defined in a similar way as $U_{n+m}$ . Similar to the previous computation, we have

\displaystyle\mathbb{E}[U_{n}]=\frac{\sqrt{n}}{100}\left[1-\left(1-\frac{1}{n}\right)^{n}\right],\qquad\mathsf{Var}(U_{n})=O(\sqrt{n}).

By Chebyshev’s inequality, for $C_{0}>0$ large enough, the probability that $x^{n+m}$ violates the second event in the loss function is at least $0.99$ . Now a combination of the above two cases implies that $r_{\text{B}}({\mathcal{P}}^{\otimes n},\mu)\geq 0.9$ , as desired.

Appendix C Proof of main lemmas

C.1 Proof of Lemma 4.4

Recall that the log-partition function $A(\theta)$ is defined as

\displaystyle A(\theta)=\log\int_{{\mathcal{X}}}\exp(\theta^{\top}T(x))\mathrm{d}\mu(x),

for any vector $\lambda\in\mathbb{R}^{d}$ with $\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda\in\Theta$ , we have

		$\displaystyle\mathbb{E}_{\theta}\left[\exp\left(\lambda^{\top}(\nabla^{2}A(\theta))^{-1/2}(T(X)-\nabla A(\theta))\right)\right]$
		$\displaystyle=\int_{{\mathcal{X}}}\exp\left((\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)^{\top}T(x)-A(\theta)-[(\nabla^{2}A(\theta))^{-1/2}\lambda]^{\top}\nabla A(\theta)\right)\mathrm{d}\mu(x)$
		$\displaystyle=\exp\left(A(\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)-A(\theta)-[(\nabla^{2}A(\theta))^{-1/2}\lambda]^{\top}\nabla A(\theta)\right).$		(34)

It remains to show that when $\|\lambda\|_{2}$ is sufficiently small, we always have $\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda\in\Theta$ , and the exponent of (C.1) is uniformly bounded from above over $\theta\in\Theta$ and $\lambda\in\mathbb{R}^{d}$ . Then the existence of uniformly bounded MGF around zero implies a uniformly bounded moment of any order.

The result of [Nes03, Theorem 4.1.6] shows that for a self-concordant and convex function $f$ , we have $\nabla^{2}f(y)\preceq 4\nabla^{2}f(x)$ whenever $(y-x)^{\top}\nabla^{2}f(x)(y-x)\leq M^{2}/16$ . Consequently, for $\|\lambda\|_{2}\leq M/4$ , a Taylor expansion with a Lagrange remainder gives

		$\displaystyle A(\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)-A(\theta)-[(\nabla^{2}A(\theta))^{-1/2}\lambda]^{\top}\nabla A(\theta)$
		$\displaystyle=\frac{1}{2}\lambda^{\top}(\nabla^{2}A(\theta))^{-1/2}\cdot\nabla^{2}A(\xi)\cdot(\nabla^{2}A(\theta))^{-1/2}\lambda,$		(35)

where $\xi$ lies on the line segment between $\theta$ and $\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda$ . Consequently, we have

\displaystyle(\xi-\theta)^{\top}\cdot\nabla^{2}A(\theta)\cdot(\xi-\theta)\leq\lambda^{\top}(\nabla^{2}A(\theta))^{-1/2}\cdot\nabla^{2}A(\theta)\cdot(\nabla^{2}A(\theta))^{-1/2}\lambda=\|\lambda\|_{2}^{2}\leq\frac{M^{2}}{16},

and therefore $\nabla^{2}A(\xi)\preceq 4\nabla^{2}A(\theta)$ . Plugging it back into (C.1) establishes the boundedness of (C.1), as well as the finiteness of $A(\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)$ , or equivalently, $\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda\in\Theta$ .

C.2 Proof of Lemma A.2

For $X_{1},\cdots,X_{n}\sim{\mathcal{N}}(0,\Sigma)$ and $n\geq d$ , it is well known that the empirical covariance $\widehat{\Sigma}_{n}$ follows a Wishart distribution ${\mathcal{W}}_{d}(\Sigma/n,n)$ , where the density of ${\mathcal{W}}_{d}(V,n)$ is given by

\displaystyle f_{V,n}(X)=\frac{\det(X)^{(n-d-1)/2}\exp(-\mathsf{Tr}(V^{-1}X)/2)}{2^{nd/2}\det(V)^{n/2}\Gamma_{d}(n/2)},

where $\Gamma_{d}(x)=\pi^{d(d-1)/4}\prod_{i=1}^{d}\Gamma(x-(i-1)/2)$ is the multivariate Gamma function, and $\Gamma(x)=\int_{0}^{\infty}t^{x-1}e^{-t}dt$ is the usual Gamma function. After some algebra, we have

		$\displaystyle D_{\text{KL}}({\mathcal{W}}_{d}(\Sigma/n,n)\\|{\mathcal{W}}_{d}(\Sigma/(n+m),n+m))$
		$\displaystyle=\frac{d}{2}\left(m-(n+m)\log\left(1+\frac{m}{n}\right)\right)+\log\frac{\Gamma_{d}((n+m)/2)}{\Gamma_{d}(n/2)}-\frac{m}{2}\psi_{d}\left(\frac{n}{2}\right),$		(36)

where $\psi_{d}(x)=\frac{d}{dx}[\log\Gamma_{d}(x)]$ is the multivariate digamma function. Note that the above KL divergence (C.2) does not depend on $\Sigma$ ; we denote it by $f(n,m,d)$ .

By (3), it suffices to establish an upper bound of $f(n,m,d)$ . Applying infinite Taylor series to $\log\Gamma_{d}(x)$ at $x=n/2$ yields

	$\displaystyle\log\Gamma_{d}\left(\frac{n+m}{2}\right)$	$\displaystyle=\log\Gamma_{d}\left(\frac{n}{2}\right)+\frac{m}{2}\psi_{d}\left(\frac{n}{2}\right)+\sum_{t=2}^{\infty}\frac{1}{t!}\left(\frac{m}{2}\right)^{t}\psi_{d}^{(t-1)}\left(\frac{n}{2}\right)$
		$\displaystyle=\log\Gamma_{d}\left(\frac{n}{2}\right)+\frac{m}{2}\psi_{d}\left(\frac{n}{2}\right)+\sum_{t=2}^{\infty}\frac{1}{t!}\left(\frac{m}{2}\right)^{t}\sum_{k=1}^{d}\psi^{(t-1)}\left(\frac{n-k+1}{2}\right),$

where $\psi^{(t-1)}(x)=\frac{d^{t}}{dx^{t}}[\log\Gamma(x)]$ is the polygamma function. For any $t\geq 2$ and $x\geq 1$ , the following inequality holds for the polygamma function [AS64, Equation 6.4.10]:

\displaystyle\left|\psi^{(t-1)}(x)-(-1)^{t}\frac{(t-2)!}{x^{t-1}}\right|\leq\frac{(t-1)!}{x^{t}}.

As a result, for the following modification

\displaystyle g(n,m,d)\triangleq\frac{d}{2}\left(m-(n+m)\log\left(1+\frac{m}{n}\right)\right)+m\sum_{k=1}^{d}\sum_{t=2}^{\infty}\frac{(-1)^{t}}{2t(t-1)}\left(\frac{m}{n-k+1}\right)^{t-1},

(37)

it holds that

$\displaystyle\|f(n,m,d)-g(n,m,d)\|$	$\displaystyle\leq\sum_{k=1}^{d}\sum_{t=2}^{\infty}\frac{1}{t!}\left(\frac{m}{2}\right)^{t}\cdot\frac{(t-1)!}{[(n-k+1)/2]^{t}}$
	$\displaystyle=\sum_{k=1}^{d}\sum_{t=2}^{\infty}\frac{1}{t}\left(\frac{m}{n-k+1}\right)^{t}$
	$\displaystyle\leq d\cdot\sum_{t=2}^{\infty}\frac{1}{2}\left(\frac{2m}{n}\right)^{t}\leq\frac{4dm^{2}}{n^{2}},$	(38)

where we have used the assumption $n\geq 4\max\{m,d\}$ .

Next we establish an upper bound of $g(n,m,d)$ . Using the identity

\displaystyle h(x)\triangleq\frac{(1+x)\log(1+x)-x}{x}=\sum_{t=2}^{\infty}\frac{(-1)^{t}x^{t-1}}{t(t-1)}

and some algebra, we have

\displaystyle g(n,m,d)=-\frac{dm}{2}h\left(\frac{m}{n}\right)+\frac{m}{2}\sum_{k=1}^{d}h\left(\frac{m}{n-k+1}\right)=\frac{m}{2}\sum_{k=1}^{d}\left[h\left(\frac{m}{n-k+1}\right)-h\left(\frac{m}{n}\right)\right].

Note that for $x\in[0,1]$ , we have

\displaystyle h^{\prime}(x)=\frac{x-\log(1+x)}{x^{2}}\in\left[0,\frac{1}{2}\right],

we conclude that

\displaystyle g(n,m,d)\leq\frac{m}{2}\sum_{k=1}^{d}\frac{1}{2}\left[\frac{m}{n-k+1}-\frac{m}{n}\right]\leq\frac{md}{2}\cdot\frac{2md}{n^{2}}=\frac{m^{2}d^{2}}{n^{2}}.

(39)

Finally, by (C.2), (39) and (3), we conclude that

\displaystyle\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{TV}}\leq\sqrt{\frac{f(n,m,d)}{2}}\leq\frac{2md}{n}.

For the claim that $S_{n+m}$ follows the uniform distribution on the set $A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d}\}$ , we need the following auxiliary definitions and results. A topological group is a group $(G,+)$ with a topology such that the operation $+:G\times G\to G$ is continuous. A (right) group action of $G$ on $X$ is a function $\phi:X\times G\to X$ such that $\phi(\phi(x,g),g^{\prime})=\phi(x,gg^{\prime})$ and $\phi(x,e)=x$ , where $e$ is the identity element of $G$ . A group action is called transitive if for every $x,x^{\prime}\in A$ , there exists some $g\in G$ such that $\phi(x,g)=x^{\prime}$ . A group action is called proper if for any compact $K\subseteq X$ and $x\in X$ , the map $\phi_{x}:G\to X$ with $g\mapsto\phi(x,g)$ satisfies that $\phi_{x}^{-1}(K)\subseteq G$ is compact. The following lemma is useful.

Lemma C.1 (Chapter 14, Theorem 25 of [Roy88]).

Let $G$ be a locally compact group acting transitively and properly on a locally compact Hausdorff space $X$ . Then there is a unique (up to multiplicative factors) Baire measure on $X$ which is invariant under the action of $G$ .

Now for the claimed result, it is easy to verify that (a proper version of) $S_{n+m}$ always takes value in $A$ , assuming $n+m\geq d$ . To show that $S_{n+m}$ is uniform on $A$ , note that for any orthogonal matrix $V\in\mathbb{R}^{(n+m)\times(n+m)}$ , it is easy to verify

\displaystyle[X_{1},\cdots,X_{n+m}]\overset{d}{=}[X_{1},\cdots,X_{n+m}]V.

Denote the RHS by $[Y_{1},\cdots,Y_{n+m}]$ , we also have $\widehat{\Sigma}_{n+m}(Y^{n+m})=\widehat{\Sigma}_{n+m}(X^{n+m})$ . Consequently,

	$\displaystyle S_{n+m}(X^{m+n})\cdot V$	$\displaystyle=[(n+m)\widehat{\Sigma}_{n+m}(X^{n+m})]^{-1/2}[Y_{1},Y_{2},\cdots,Y_{n+m}]$
		$\displaystyle=[(n+m)\widehat{\Sigma}_{n+m}(Y^{n+m})]^{-1/2}[Y_{1},Y_{2},\cdots,Y_{n+m}]$
		$\displaystyle\overset{d}{=}[(n+m)\widehat{\Sigma}_{n+m}(X^{n+m})]^{-1/2}[X_{1},X_{2},\cdots,X_{n+m}]=S_{n+m}(X^{m+n}),$

meaning that the distribution of $S_{n+m}$ is invariant with right multiplication of an orthogonal matrix. Let $G$ be the orthogonal group $\mathsf{O}(n+m)$ , the map $\phi:A\times G\to A$ with $\phi(U,V)=UV$ is a group action of $G$ on $A$ . This action is transitive as for any $U,U^{\prime}\in A$ , we could add more rows to $U,U^{\prime}$ to obtain $\widetilde{U},\widetilde{U}^{\prime}\in G$ , and then $V=\widetilde{U}^{\top}\widetilde{U}^{\prime}\in G$ maps $U$ to $U^{\prime}$ . This action is also proper as $G$ itself is compact. Hence, Lemma C.1 below shows that $S_{n+m}$ is uniformly distributed on $A$ .

C.3 Proof of Lemma A.4

The proof of Lemma A.4 is a simple consequence of several known results. First, Basu’s theorem claims that $\overline{X}_{n}$ and $\widehat{\Sigma}_{n}$ are independent. Second, the computation in Example 4.1 shows

\displaystyle\|{\mathcal{L}}(\overline{X}_{n})-{\mathcal{L}}(\overline{X}_{n+m})\|_{\text{TV}}\leq\frac{m\sqrt{d}}{n}.

Finally, as $\widehat{\Sigma}_{n}\sim{\mathcal{W}}_{d}(\Sigma/(n-1),n-1)$ again follows a Wishart distribution, the proof of Lemma A.2 shows that

\displaystyle\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{TV}}\leq\frac{2md}{n-1}.

In conclusion, we have

	$\displaystyle\\|{\mathcal{L}}(\overline{X}_{n},\widehat{\Sigma}_{n})-{\mathcal{L}}(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})\\|_{\text{\rm TV}}$	$\displaystyle\leq\\|{\mathcal{L}}(\overline{X}_{n})-{\mathcal{L}}(\overline{X}_{n+m})\\|_{\text{TV}}+\\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\\|_{\text{TV}}$
		$\displaystyle\leq\frac{3md}{n-1}.$

For the distribution of $S_{n+m}$ , it is clear that (a proper version of) $S_{n+m}$ always takes value in $A$ , and we show that the distribution of $S_{n+m}$ is invariant with proper group actions in $A$ . Consider the set

\displaystyle G=\{V\in\mathbb{R}^{(n+m)\times(n+m)}:VV^{\top}=I_{n+m},V{\bf 1}={\bf 1}\}

with the usual matrix multiplication. We show that $G$ is a group: clearly $I_{n+m}\in G$ ; for $V,V^{\prime}\in G$ , it is clear that $VV^{\prime}{\bf 1}=V{\bf 1}={\bf 1}$ and therefore $VV^{\prime}\in G$ ; for $V\in G$ , it holds that $V^{-1}{\bf 1}=V^{-1}V{\bf 1}={\bf 1}$ and thus $V^{-1}\in G$ . Next we show that the action $\phi:A\times G\to A$ with $\phi(U,V)=UV$ is a group action on $A$ , and it suffices to show that $UV\in A$ . This is true as $(UV)(UV)^{\top}=UVV^{\top}U^{\top}=UU^{\top}=I_{d}$ , and $UV{\bf 1}=U{\bf 1}={\bf 0}$ . This group action is also transitive: for any $U,U^{\prime}\in A$ , we may properly add rows to them and obtain $\widetilde{U},\widetilde{U}^{\prime}\in\mathsf{O}(n+m)$ , where one of the added rows is a scalar multiple of ${\bf 1}^{\top}$ , which is feasible as $U{\bf 1}={\bf 0}$ . Consequently, the matrix $V=\widetilde{U}^{-1}\widetilde{U}^{\prime}\in\mathsf{O}(n+m)$ maps $U$ to $U^{\prime}$ , and also $\bf 1^{\top}$ to $\bf 1^{\top}$ ; hence $V\in G$ . Finally, we show that any group action of $G$ on $S_{m+n}$ does not change the distribution of $S_{m+n}$ . To see this, for any $V\in G$ we have

	$\displaystyle[X_{1},\cdots,X_{n+m}]$	$\displaystyle\overset{d}{=}[X_{1},\cdots,X_{n+m}]V,$
	$\displaystyle\overline{X}_{n+m}([X_{1},\cdots,X_{n+m}]V)$	$\displaystyle=\overline{X}_{n+m}([X_{1},\cdots,X_{n+m}]),$
	$\displaystyle\widehat{\Sigma}_{n+m}([X_{1},\cdots,X_{n+m}]V)$	$\displaystyle=\widehat{\Sigma}_{n+m}([X_{1},\cdots,X_{n+m}]).$

Therefore, following the same arguments as in Example A.1 we arrive at the desired invariance, and the uniform distribution of $S_{m+n}$ is a direct consequence of Lemma C.1.

C.4 Proof of Lemma 5.8

For any subset $S\subseteq[n+m]$ with $|S|=m$ , let $P_{S}$ be the distribution of $(Z_{1},\cdots,Z_{n+m})$ when the samples $(Y_{1},\cdots,Y_{m})$ are placed in the index set $S$ of the pool $(Z_{1},\cdots,Z_{n+m})$ . Then it is clear that $P_{\text{mix}}=\mathbb{E}[P_{S}]$ , with $S$ uniformly distributed on all size- $m$ subsets of $[n+m]$ . To compute the $\chi^{2}$ -divergence where the first distribution is a mixture, the following identity holds [IS12]:

\chi^{2}\left(P_{\text{\rm mix}},P^{\otimes(n+m)}\right)=\mathbb{E}_{S,S^{\prime}}\left[\int\frac{\mathrm{d}P_{S}\mathrm{d}P_{S^{\prime}}}{\mathrm{d}P^{\otimes(n+m)}}\right]-1,

where $S^{\prime}$ is an independent copy of $S$ . By the independence assumption, we have

\frac{\mathrm{d}P_{S}}{\mathrm{d}P^{\otimes(n+m)}}(z_{1},\cdots,z_{n+m})=\prod_{i\in S}\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i}),

and consequently

	$\displaystyle\int\frac{\mathrm{d}P_{S}\mathrm{d}P_{S^{\prime}}}{\mathrm{d}P^{\otimes(n+m)}}$	$\displaystyle=\mathbb{E}_{P}\left[\prod_{i\in S}\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\prod_{i\in S^{\prime}}\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\right]$
		$\displaystyle=\prod_{i\in S\cap S^{\prime}}\mathbb{E}_{P}\left[\left(\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\right)^{2}\right]\cdot\prod_{i\in S\Delta S^{\prime}}\mathbb{E}_{P}\left[\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\right]$
		$\displaystyle=(1+\chi^{2}(Q,P))^{\|S\cap S^{\prime}\|}.$

It remains to upper bound the expectation with respect to the random variable $|S\cap S^{\prime}|$ . Note that $|S\cap S^{\prime}|$ follows the hypergeometric distribution with parameter $(n+m,m,m)$ , which corresponds to sampling without replacement. The counterpart for sampling with replacement corresponds to a Binomial distribution $\mathsf{B}(m,\frac{m}{n+m})$ , and the following lemma shows that the latter dominates the former in terms of the convex order:

Lemma C.2.

[Hoe63, Theorem 4] Let the population be ${\mathcal{C}}=\{c_{1},\cdots,c_{N}\}$ . Let $X_{1},\cdots,X_{n}$ denote random samples without replacement from ${\mathcal{C}}$ and $Y_{1},\cdots,Y_{n}$ denote random samples with replacement. If $f(\cdot)$ is convex, then

\mathbb{E}\left[f\left(\sum_{i=1}^{n}X_{i}\right)\right]\leq\mathbb{E}\left[f\left(\sum_{i=1}^{n}Y_{i}\right)\right].

Applying Lemma C.2 to the convex function $x\mapsto(1+\chi^{2}(Q,P))^{x}$ yields

	$\displaystyle\mathbb{E}_{S,S^{\prime}}[(1+\chi^{2}(Q,P))^{\|S\cap S^{\prime}\|}]$	$\displaystyle\leq\mathbb{E}[(1+\chi^{2}(Q,P))^{\mathsf{B}(m,m/(n+m))}]$
		$\displaystyle=\left(\mathbb{E}[(1+\chi^{2}(Q,P))^{\text{Bern}(m/(n+m))}]\right)^{m}$
		$\displaystyle=\left(1+\frac{m}{n+m}\chi^{2}(Q,P)\right)^{m},$

as desired.

C.5 Proof of Lemma A.7

Note that in the proof of Theorem 5.2, we have

\displaystyle\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}}\leq\sqrt{\frac{m^{2}}{n}\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)}.

Since the TV distance is always upper bounded by one, the following upper bound is also true:

\displaystyle\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}}\leq\sqrt{\frac{m^{2}}{n}\cdot\left(\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)\wedge n\right)}.

Consequently, taking the expectation over $X^{n/2}$ leads to the claim.

C.6 Proof of Lemma A.13

First we note that

\displaystyle\chi^{2}\left({\mathcal{N}}(\widehat{\theta}_{n,j},1),{\mathcal{N}}(\theta_{j},1)\right)=\exp\left((\widehat{\theta}_{n,j}-\theta_{j})^{2}\right)-1,

By the triangle inequality,

	$\displaystyle\mathbb{E}\|\widehat{\theta}_{n,j}-\theta_{j}\|$	$\displaystyle\leq\mathbb{E}\left\|\widehat{\theta}_{n,j}-\frac{1}{n}\sum_{i=1}^{n}X_{i,j}\right\|+\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}X_{i,j}-\theta_{j}\right\|$
		$\displaystyle\leq\sqrt{\frac{C\log n}{n}}+\sqrt{\frac{2}{n\pi}}=O\left(\sqrt{\frac{\log n}{n}}\right).$		(40)

Moreover, it is straightforward to verify that $|\widehat{\theta}_{n,j}-\theta_{j}|$ is $(1/n)$ -Lipschitz with respect to $(X_{1,j},\cdots,X_{n,j})$ . Therefore, the Gaussian Lipschitz concentration (see, e.g. [BLM13, Theorem 10.17]) gives that

\displaystyle\mathbb{P}\left(|\widehat{\theta}_{n,j}-\theta_{j}|\geq\mathbb{E}|\widehat{\theta}_{n,j}-\theta_{j}|+t\right)\leq\exp\left(-\frac{nt^{2}}{2}\right)

(41)

for any $t\geq 0$ . Hence, combining (C.6) and (41), we conclude that $|\widehat{\theta}_{n,j}-\theta_{j}|=O(\sqrt{\log(nd)/n})$ holds with probability at least $1-(nd)^{-2}$ , and therefore

\displaystyle\mathbb{E}\left[\chi^{2}\left({\mathcal{N}}(\widehat{\theta}_{n,j},1),{\mathcal{N}}(\theta_{j},1)\right)\wedge n\right]\leq 2\cdot\mathbb{E}[(\widehat{\theta}_{n,j}-\theta_{j})^{2}]+\frac{1}{nd},

where we have used that $e^{x}\leq 1+2x$ whenever $x\in[0,1]$ . Summing over $j\in[d]$ and using the property of the soft-thresholding estimator $\sup_{\theta\in\Theta}\mathbb{E}[\|\widehat{\theta}_{n}-\theta\|_{2}^{2}]=O(s\log d/n)$ gives the claimed result.

C.7 Proof of Lemma A.15

For the first claim, note that for Poisson models, we have

\displaystyle\chi^{2}\left(\mathsf{Poi}(\widehat{\lambda}_{n,j}),\mathsf{Poi}(\lambda_{j})\right)=\exp\left(\frac{(\widehat{\lambda}_{n,j}-\lambda_{j})^{2}}{\lambda_{j}}\right)-1.

Consequently, for $\lambda_{j}=1/(n^{2}\log n)$ , we have

	$\displaystyle\mathbb{E}\left[\chi^{2}\left(\mathsf{Poi}(\widehat{\lambda}_{n,j}),\mathsf{Poi}(\lambda_{j})\right)\wedge n\right]$	$\displaystyle\geq\left[\left(\exp\left(\frac{(1/n-\lambda_{j})^{2}}{\lambda_{j}}\right)-1\right)\wedge n\right]\cdot\mathbb{P}(\widehat{\lambda}_{n,j}=1/n)$
		$\displaystyle=\Omega(n)\cdot e^{-n\lambda_{j}}n\lambda_{j}=\Omega\left(\frac{1}{\log n}\right)\gg\frac{1}{n},$

establishing the first claim.

For the second claim, note that conditioning on $X^{n/2}$ ,

	$\displaystyle D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\\|{\mathcal{L}}(T_{n/2+m})\right)$	$\displaystyle=\sum_{j=1}^{d}D_{\text{\rm KL}}\left(\mathsf{Poi}(n\lambda_{j}/2+m\widehat{\lambda}_{n,j})\\|\mathsf{Poi}((n/2+m)\lambda_{j})\right)$
		$\displaystyle\leq\sum_{j=1}^{d}\frac{m^{2}(\widehat{\lambda}_{n,j}-\lambda_{j})^{2}}{(n/2+m)\lambda_{j}}.$

where we have used $D_{\text{\rm KL}}(\mathsf{Poi}(\lambda_{1})\|\mathsf{Poi}(\lambda_{2}))=\lambda_{2}-\lambda_{1}+\lambda_{1}\log(\lambda_{1}/\lambda_{2})\leq(\lambda_{1}-\lambda_{2})^{2}/\lambda_{2}$ . Hence,

\displaystyle\mathbb{E}_{X^{n/2}}\left[D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\|{\mathcal{L}}(T_{n/2+m})\right)\right]\leq\frac{dm^{2}}{(n/2+m)n/2}\leq\frac{4dm^{2}}{n^{2}},

and the TV distance satisfies

	$\displaystyle\\|{\mathcal{L}}(\widehat{X}^{n+m})-{\mathcal{L}}(X^{n+m})\\|_{\text{\rm TV}}$	$\displaystyle=\mathbb{E}_{X^{n/2}}\left[\\|{\mathcal{L}}(\widehat{T}_{n/2+m})-{\mathcal{L}}(T_{n/2+m})\\|_{\text{\rm TV}}\right]$
		$\displaystyle\leq\mathbb{E}_{X^{n/2}}\left[\sqrt{\frac{1}{2}D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\\|{\mathcal{L}}(T_{n/2+m})\right)}\right]$
		$\displaystyle\leq\sqrt{\frac{1}{2}\mathbb{E}_{X^{n/2}}\left[D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\\|{\mathcal{L}}(T_{n/2+m})\right)\right]}$
		$\displaystyle\leq\frac{m\sqrt{2d}}{n}.$

C.8 Proof of Lemma A.17

The upper bound result is easy. The $m=O(n\varepsilon/\sqrt{k})$ upper bound is a consequence of the general product Poisson model considered in Example A.14. For the $m=O(\sqrt{n}\varepsilon)$ upper bound, we consider the sufficient statistic $T_{n}=\sum_{i=1}^{n}X_{i}\sim\prod_{j=1}^{k}\mathsf{Poi}(np_{j})$ , and simply apply the sufficiency-based algorithm to $\widehat{T}_{n+m}=T_{n}$ . Since

\displaystyle D_{\text{\rm KL}}({\mathcal{L}}(T_{n+m})\|{\mathcal{L}}(T_{n}))=\sum_{j=1}^{k}D_{\text{\rm KL}}(\mathsf{Poi}((n+m)p_{j})\|\mathsf{Poi}(np_{j}))\leq\sum_{j=1}^{k}\frac{(mp_{j})^{2}}{np_{j}}=\frac{m^{2}}{n},

where the last identity crucially makes use of the identity $\sum_{j=1}^{k}p_{j}=1$ , this procedure works.

Next we show that sample amplification is impossible when $m=\omega(n\varepsilon/\sqrt{k})$ and $k=O(n)$ . Note that this implies that $m=\omega(\max\{n\varepsilon/\sqrt{k},\sqrt{n}\varepsilon\})$ is impossible in general, for the $k>n$ case is always not easier than the $k=n$ case. To prove the above claim, w.l.o.g. we assume that $k=2k_{0}$ is even, and consider the following parametric submodel:

\displaystyle P_{\theta}=\prod_{j=1}^{k_{0}}\left[\mathsf{Poi}\left(\frac{1}{k}+\theta_{j}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta_{j}\right)\right],\qquad\theta\in\Theta\triangleq\left[-\frac{1}{k},\frac{1}{k}\right]^{k_{0}}.

Clearly $P_{\theta}$ is a parametric submodel, by setting $p_{2j-1}=1/k+\theta_{j}$ and $p_{2j}=1/k-\theta_{j}$ in the original model. This submodel is a product model, thus we could apply the result of Theorem 6.5 after we have verified Assumption 4. Note that when $\theta,\theta^{\prime}$ arbitrarily vary in $[-1/k,1/k]$ , the range of

\displaystyle H^{2}\left(\mathsf{Poi}\left(\frac{1}{k}+\theta\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta\right),\mathsf{Poi}\left(\frac{1}{k}+\theta^{\prime}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta^{\prime}\right)\right)

is $[0,\Theta(1/k)]$ , so Assumption 4 is fulfilled when $k\leq cn$ for a small constant $c>0$ . Consequently, Theorem 6.5 establishes the desired bound $m=O(n\varepsilon/\sqrt{k})$ .

C.9 Proof of Lemma A.19

For the first inequality, note that for a large $C_{0}>0$ , both $Z_{1}\leq C_{0}$ and $\max_{2\leq j\leq d}Z_{j}\geq\sqrt{2\log d}-C_{0}$ hold with probability at least $0.99$ . When both events hold, we have

	$\displaystyle\exp(t(t+Z_{1}))$	$\displaystyle\leq\exp(t(t+C_{0})),$
	$\displaystyle\sum_{j=2}^{d}\exp(tZ_{j})$	$\displaystyle\geq\exp(t(\sqrt{2\log d}-C_{0})).$

Consequently, for $C=3C_{0}$ with $C_{0}>0$ large enough, we have

	$\displaystyle p_{d}(\sqrt{2\log d}-C)$	$\displaystyle\leq 0.01+\frac{\exp((\sqrt{2\log d}-3C_{0})(\sqrt{2\log d}-2C_{0}))}{\exp((\sqrt{2\log d}-3C_{0})(\sqrt{2\log d}-C_{0}))}$
		$\displaystyle=0.01+\exp\left(-C_{0}(\sqrt{2\log d}-3C_{0})\right)<0.1.$

For the second inequality, note that Jensen’s inequality yields that

	$\displaystyle p_{d}(t)$	$\displaystyle\geq\mathbb{E}_{Z_{1}}\left[\frac{\exp(t(t+Z_{1}))}{\exp(t(t+Z_{1}))+\sum_{j=2}^{d}\mathbb{E}[\exp(tZ_{j})]}\right]$
		$\displaystyle\geq\mathbb{E}_{Z_{1}}\left[\frac{\exp(t(t+Z_{1}))}{\exp(t(t+Z_{1}))+d\exp(t^{2}/2)}\right].$

Again, with probability at least $0.99$ we have $Z_{1}\geq-C_{0}$ , and therefore for $t=\sqrt{2\log d}+2C_{0}$ ,

	$\displaystyle p_{d}(t)$	$\displaystyle\geq 0.99\times\frac{\exp(t(t-C_{0}))}{\exp(t(t-C_{0}))+d\exp(t^{2}/2)}$
		$\displaystyle=0.99\times\frac{1}{1+\exp(t^{2}/2+(t-2C_{0})^{2}/2-t(t-C_{0}))}$
		$\displaystyle=0.99\times\frac{1}{1+\exp(-C_{0}\cdot\sqrt{2\log d})}>0.9,$

for $C_{0}>0$ large enough.

For the last claim, consider any $s<d/2$ . Let $d_{0}\triangleq\lfloor d/s\rfloor\geq 2$ , and consider the product prior $\mu^{\otimes s}$ to $s$ blocks each of dimension $d_{0}$ . In other words, we set exactly one of the first $d_{0}$ coordinates of the mean vector to $t$ uniformly at random, and do the same for the next $d_{0}$ coordinates, and so on. Clearly the resulting mean vector is always $s$ -sparse. Writing $\theta=(\theta_{1},\cdots,\theta_{s})$ with each $\theta_{i}\in\mathbb{R}^{d_{0}}$ , let the loss function be

\displaystyle L(\theta,\widehat{\theta})=\mathbbm{1}\left(\sum_{j=1}^{s}\mathbbm{1}(\theta_{j}\neq\widehat{\theta}_{j})\geq N\right),

with an integer $N$ to be specified later. Then in each block, we reduce to the case $s=1$ , and the error probability for this block is $1-p_{d_{0}}(\sqrt{n}\cdot t)$ for sample size $n$ . Moreover, the errors in different blocks are independent. Consequently,

		$\displaystyle r_{\text{\rm B}}({\mathcal{P}},n,\mu,L)-r_{\text{\rm B}}({\mathcal{P}},n+m,\mu,L)$
		$\displaystyle=\mathbb{P}\left(\mathsf{B}(s,1-p_{d_{0}}(\sqrt{n}\cdot t))\geq N\right)-\mathbb{P}\left(\mathsf{B}(s,1-p_{d_{0}}(\sqrt{n+m}\cdot t))\geq N\right).$		(42)

Finally, again by the properties of $p_{d}(\cdot)$ summarized in Lemma A.19 and the pigeonhole principle, for $m=\lceil cn\varepsilon/\sqrt{s\log d_{0}}\rceil$ , we could always find some $t>0$ such that $p_{d_{0}}(\sqrt{n+m}\cdot t)-p_{d_{0}}(\sqrt{n}\cdot t)=\Omega_{c}(\varepsilon/\sqrt{s})$ , with both quantities in $[0.1,0.9]$ . Now choosing

\displaystyle N=\left\lceil\frac{s(1-p_{d_{0}}(\sqrt{n}\cdot t))}{2}+\frac{s(1-p_{d_{0}}(\sqrt{n+m}\cdot t))}{2}\right\rceil

with the above choice of $t$ , the Bayes risk difference will be lower bounded by $\Omega_{c}(\varepsilon)$ by a similar argument to the proof of Theorem 6.4. Now by (C.9) and Lemma 6.1, we see that $n=\Omega(\sqrt{s\log(d/s)}/\varepsilon)$ and $m=O(n\varepsilon/\sqrt{s\log(d/s)})$ are necessary for sample amplification.

C.10 Proof of Lemma A.21

The following results are the key to the proof of Lemma A.21, which we will assume for now and prove in the subsequent sections.

Lemma C.3.

For $D_{n}=\text{\rm diag}(\lambda_{1},\cdots,\lambda_{d})$ , the following identities hold for the estimator $\widehat{\Sigma}_{n}$ :

	$\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]$	$\displaystyle=\sum_{j=1}^{d}\left[(n+d+1-2j)\lambda_{j}-\log\lambda_{j}-\mathbb{E}\left[\log\chi_{n-j+1}^{2}\right]\right]-d,$		(43)
	$\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))$	$\displaystyle=\sum_{j=1}^{d}\left[2(n+d+1-2j)\lambda_{j}^{2}-4\lambda_{j}+\psi^{\prime}\left(\frac{n+1-j}{2}\right)\right],$		(44)

where $\chi_{m}^{2}$ denotes the chi-squared distribution with $m$ degrees of freedom, and $\psi^{\prime}(x)$ is the polygamma function of order $1$ . In particular, when $n\geq 2d$ , the following inequalities hold:

	$\displaystyle\left\|\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]-\left(g(n+1-d,d)+\sum_{j=1}^{d}h((n+d+1-2j)\lambda_{j})\right)\right\|\leq\frac{5d}{n},$		(45)
	$\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))\leq\frac{16d^{2}}{n^{2}}+\sum_{j=1}^{d}\frac{4((n+d+1-2j)\lambda_{j}-1)^{2}}{n},$		(46)

where the functions $g$ and $h$ are given by (13) and

\displaystyle h(u)

\displaystyle\triangleq u-\log u-1.

(47)

Lemma C.4.

For the function $h$ in (47) and $u_{1},\cdots,u_{d}\in\mathbb{R}_{+}$ , it holds that

\displaystyle\sum_{j=1}^{d}h(u_{j})\geq\frac{1}{8}\min\left\{\sum_{j=1}^{d}(u_{j}-1)^{2},\sqrt{\sum_{j=1}^{d}(u_{j}-1)^{2}}\right\}.

(48)

Returning to the proof of Lemma A.21, the first claim is a direct application of (45) and (46). For the second claim, let $D_{n}=\text{\rm diag}(\lambda_{1},\cdots,\lambda_{d})$ , and

\displaystyle V\triangleq\sum_{j=1}^{d}((n+d+1-2j)\lambda_{j}-1)^{2}.

Then by (45), (46) and Lemma C.4, we have

\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]

\displaystyle\geq g(n+1-d,d)+\frac{V\wedge\sqrt{V}}{8}-\frac{5d}{n}.

Meanwhile, the variance satisfies

\displaystyle\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}\leq\frac{4d}{n}+2\sqrt{\frac{V}{n}}.

Note that for $n,d$ larger than a constant depending only on $C_{1}$ , we always have

\displaystyle\frac{d}{n}+\frac{V\wedge\sqrt{V}}{8}\geq 2C_{1}\sqrt{\frac{V}{n}}.

Therefore, for $n,d=\Omega(1)$ large enough, the above inequalities implies

\displaystyle\ell(\Sigma,\widehat{\Sigma})\geq g(n+1-d,d)+C_{1}\left(\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}-\frac{4d}{n}\right)-\frac{6d}{n},

establishing the second claim.

For the last statement, note that

\displaystyle\frac{\partial g(u,v)}{\partial u}=\frac{\log(u+2v)+\log(v)}{2}-\log(u+v)=\frac{1}{2}\log\left(\frac{u(u+2v)}{(u+v)^{2}}\right)\leq-\frac{v^{2}}{2(u+v)^{2}},

the intermediate value theorem then implies that

\displaystyle g(n+1-d,d)-g(n+m+1-d,d)\geq m\cdot\min_{u\in[n+1-d,n+m+1-d]}\frac{d^{2}}{2(u+d)^{2}}\geq\frac{md^{2}}{13n^{2}}.

C.11 Proof of Lemma C.3

We first recall the well-known Bartlett decomposition: for the lower triangular matrix $L_{n}=(L_{ij})_{i\geq j}$ , the random variables $\{L_{ij}\}_{i\geq j}$ are mutually independent, with

\displaystyle L_{ij}\sim{\mathcal{N}}(0,1),\quad i>j;\qquad L_{jj}^{2}\sim\chi_{n+1-j}^{2},\quad j\in[d].

For $\Sigma=I_{d}$ and $\widehat{\Sigma}_{n}=L_{n}D_{n}L_{n}^{\top}$ , simple algebra gives

\displaystyle\ell(\Sigma,\widehat{\Sigma}_{n})=\sum_{j=1}^{d}\left(\sum_{i\geq j}\lambda_{j}L_{ij}^{2}-\log\lambda_{j}-\log L_{jj}^{2}-1\right).

Consequently, the identity (43) follows from the above Bartlett decomposition. This identity was also obtained in [JS61].

For the identity (44) on the variance, by the mutual independence we have

\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))=\sum_{j=1}^{d}\sum_{i\geq j}\lambda_{j}^{2}\cdot\mathsf{Var}(L_{ij}^{2})+\sum_{j=1}^{d}\mathsf{Var}(\log L_{jj}^{2})-2\sum_{j=1}^{d}\lambda_{j}\cdot\mathsf{Cov}(L_{jj}^{2},\log L_{jj}^{2}).

(49)

Next we evaluate each term of (49). Clearly $\mathsf{Var}(L_{ij}^{2})=\mathbb{E}[Z^{4}]-\mathbb{E}[Z^{2}]^{2}=2$ for $i>j$ and $Z\sim{\mathcal{N}}(0,1)$ . For the other random variables, we need to recall the following identity for $\chi_{m}^{2}$ :

\displaystyle\Lambda(t)\triangleq\log\mathbb{E}[(\chi_{m}^{2})^{t}]=t\log 2+\log\Gamma\left(\frac{m}{2}+t\right)-\log\Gamma\left(\frac{m}{2}\right).

(50)

Based on (50), we have

\displaystyle\mathsf{Var}(L_{jj}^{2})=(n+1-j)(n+3-j)-(n+1-j)^{2}=2(n+1-j).

Moreover, differentiating $\Lambda(t)$ at $t=0$ gives

	$\displaystyle\mathbb{E}[\log\chi_{m}^{2}]$	$\displaystyle=\Lambda^{\prime}(0)=\log 2+\psi\left(\frac{m}{2}\right),$		(51)
	$\displaystyle\mathbb{E}[(\log\chi_{m}^{2})^{2}]-\left(\mathbb{E}[\log\chi_{m}^{2}]\right)^{2}$	$\displaystyle=\Lambda^{\prime\prime}(0)=\psi^{\prime}\left(\frac{m}{2}\right).$		(52)

Note that (52) leads to

\displaystyle\mathsf{Var}(\log L_{jj}^{2})=\psi^{\prime}\left(\frac{n+1-j}{2}\right).

Finally, differentiating $\Lambda(t)$ at $t=1$ gives that

	$\displaystyle\mathbb{E}[L_{jj}^{2}\log L_{jj}^{2}]$	$\displaystyle=\mathbb{E}[L_{jj}^{2}]\cdot\left(\log 2+\psi\left(\frac{n+1-j}{2}+1\right)\right)$
		$\displaystyle=(n+1-j)\left(\log 2+\psi\left(\frac{n+1-j}{2}+1\right)\right),$

and hence the identity $\psi(x+1)=\psi(x)+x^{-1}$ for the digamma function together with (51) leads to

\displaystyle\mathsf{Cov}(L_{jj}^{2},\log L_{jj}^{2})=2.

Therefore, plugging the above quantities in (49) gives the identity (44).

Next we prove the remaining inequalities when $n\geq 2d$ . For (45), note that (43) gives an identity (together with (51))

	$\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]-\left(g(n+1-d,d)+\sum_{j=1}^{d}h((n+d+1-2j)\lambda_{j})\right)$
	$\displaystyle=\sum_{j=1}^{d}\left(\log(n+d+1-2j)-\log 2-\psi\left(\frac{n+1-j}{2}\right)\right)-g(n+1-d,d).$

Since $|\psi(x)-\log(x)|\leq 1/x$ for all $x\geq 1$ , replacing $\psi(\cdot)$ by $\log(\cdot)$ in the above expression only incurs an absolute difference at most

\displaystyle\sum_{j=1}^{d}\frac{2}{n+1-j}\leq 2\int_{0}^{d}\frac{\mathrm{d}x}{n-x}=2\log\left(1+\frac{d}{n-d}\right)\leq\frac{4d}{n}.

For the remaining terms, it is not hard to verify that

\displaystyle g(n+1-d,d)=\int_{0}^{d}\log\frac{n+1-d+2x}{n+1-d+x}dx.

As $x\mapsto\log(n+1-d+2x)-\log(n+1-d+x)$ is increasing on $[0,\infty)$ , we have

\displaystyle 0\leq g(n+1-d,d)-\sum_{x=0}^{d-1}\log\frac{n+1-d+2x}{n+1-d+x}\leq\log\frac{n+1-d+2d}{n+1-d+d}\leq\frac{d}{n}.

Now (45) follows from a combination of the above inequalities.

For the inequality (46), we complete the square of (44) to obtain

	$\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))$	$\displaystyle=\sum_{j=1}^{d}\frac{2((n+d+1-2j)\lambda_{j}-1)^{2}}{n+d+1-2j}+\sum_{j=1}^{d}\left(\psi^{\prime}\left(\frac{n+1-j}{2}\right)-\frac{2}{n+d+1-2j}\right)$
		$\displaystyle\leq\frac{4}{n}\sum_{j=1}^{d}((n+d+1-2j)\lambda_{j}-1)^{2}+\sum_{j=1}^{d}\left(\psi^{\prime}\left(\frac{n+1-j}{2}\right)-\frac{2}{n+d+1-2j}\right).$

To handle the second sum, note that [AS64, Equation 6.4.10] gives $|\psi^{\prime}(x)-x^{-1}|\leq x^{-2}$ for $x\geq 1$ . Therefore, the second term has an absolute value at most

\displaystyle\sum_{j=1}^{d}\left[\frac{2(d-j)}{(n+1-j)(n+d+1-2j)}+\left(\frac{2}{n+1-j}\right)^{2}\right]\leq d\cdot\left[\frac{2d}{(n/2)^{2}}+\left(\frac{2}{n/2}\right)^{2}\right]\leq\frac{16d^{2}}{n^{2}}.

C.12 Proof of Lemma C.4

Note that when $0<u\leq 2$ , Taylor expansion of $h(\cdot)$ at $u=1$ gives

\displaystyle h(u)\geq\frac{(u-1)^{2}}{2}\min_{\theta\in[0,2]}h^{\prime\prime}(\theta)=\frac{(u-1)^{2}}{8}.

For $u>2$ , we have $u-1\geq\log_{2}u>10/7\cdot\log u$ , and therefore

\displaystyle h(u)=u-\log u-1\geq\frac{3(u-1)}{10}.

Therefore, in both cases we have

\displaystyle h(u)\geq\frac{1}{8}\min\left\{(u-1)^{2},|u-1|\right\}.

To prove (48), let $J\triangleq\{j\in[d]:|u_{j}-1|\leq 1\}$ , $S\triangleq\sum_{j=1}^{d}(u_{j}-1)^{2}$ . Using the above inequality, we have

	$\displaystyle\sum_{j=1}^{d}h(u_{j})$	$\displaystyle\geq\frac{1}{8}\sum_{j\in J}(u_{j}-1)^{2}+\frac{1}{8}\sum_{j\notin J}\|u_{j}-1\|$
		$\displaystyle\geq\frac{1}{8}\sum_{j\in J}(u_{j}-1)^{2}+\frac{1}{8}\sqrt{\sum_{j\notin J}(u_{j}-1)^{2}}$
		$\displaystyle\geq\frac{1}{8}\min_{x\in[0,S]}\left(x+\sqrt{S-x}\right)$
		$\displaystyle=\frac{1}{8}\min\{S,\sqrt{S}\},$

which is precisely (48).

References

[AGSV20] Brian Axelrod, Shivam Garg, Vatsal Sharan, and Gregory Valiant. Sample amplification: Increasing dataset size even when learning is impossible. In International Conference on Machine Learning, pages 442–451. PMLR, 2020.
[AS64] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964.
[ASE17] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
[Bar86] Andrew R Barron. Entropy and the central limit theorem. The Annals of probability, pages 336–342, 1986.
[Bas55] Dev Basu. On statistics independent of a complete sufficient statistic. Sankhyā: The Indian Journal of Statistics (1933-1960), 15(4):377–380, 1955.
[Bas58] Dev Basu. On statistics independent of sufficient statistics. Sankhyā: The Indian Journal of Statistics, pages 223–226, 1958.
[Bas59] Dev Basu. The family of ancillary statistics. Sankhyā: The Indian Journal of Statistics, pages 247–256, 1959.
[BB19] Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. In Conference on Learning Theory, pages 469–470. PMLR, 2019.
[BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
[BC14] Vlad Bally and Lucia Caramellino. On the distances between probability density functions. Electronic Journal of Probability, 19, 2014.
[BC16] Vlad Bally and Lucia Caramellino. Asymptotic development for the CLT in total variation distance. Bernoulli, 22(4):2442–2485, 2016.
[BCG14] Sergey G Bobkov, Gennadiy P Chistyakov, and Friedrich Götze. Berry–esseen bounds in the entropic central limit theorem. Probability Theory and Related Fields, 159(3-4):435–478, 2014.
[BCLZ02] Lawrence D Brown, T Tony Cai, Mark G Low, and Cun-Hui Zhang. Asymptotic equivalence theory for nonparametric regression with random design. The Annals of statistics, 30(3):688–707, 2002.
[BCLZ04] Lawrence D Brown, Andrew V Carter, Mark G Low, and Cun-Hui Zhang. Equivalence theory for density estimation, poisson processes and gaussian white noise with drift. The Annals of Statistics, 32(5):2074–2097, 2004.
[BKK⁺22] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
[BL96] Lawrence D Brown and Mark G Low. Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics, 24(6):2384–2398, 1996.
[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
[BR10] Rabi N Bhattacharya and R Ranga Rao. Normal approximation and asymptotic expansions. SIAM, 2010.
[BR13] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.
[CMST17] Francesco Calimeri, Aldo Marzullo, Claudio Stamile, and Giorgio Terracina. Biomedical data augmentation using generative adversarial neural networks. In International conference on artificial neural networks, pages 626–634. Springer, 2017.
[CMV⁺21] Phillip Chlap, Hang Min, Nym Vandenberg, Jason Dowling, Lois Holloway, and Annette Haworth. A review of medical image data augmentation techniques for deep learning applications. Journal of Medical Imaging and Radiation Oncology, 65(5):545–563, 2021.
[CPS⁺19] Aggelina Chatziagapi, Georgios Paraskevopoulos, Dimitris Sgouropoulos, Georgios Pantazopoulos, Malvina Nikandrou, Theodoros Giannakopoulos, Athanasios Katsamanis, Alexandros Potamianos, and Shrikanth Narayanan. Data augmentation using gans for speech emotion recognition. In Interspeech, pages 171–175, 2019.
[CQZ⁺22] Dong Chen, Xinda Qi, Yu Zheng, Yuzhen Lu, and Zhaojian Li. Deep data augmentation for weed recognition enhancement: A diffusion probabilistic model and transfer learning based approach. arXiv preprint arXiv:2210.09509, 2022.
[CZM⁺19] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019.
[CZSL20] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
[DJ94] David L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
[DMR18] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.
[DMR20] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The minimax learning rates of normal and Ising undirected graphical models. Electronic Journal of Statistics, 14(1):2338 – 2361, 2020.
[DY79] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The Annals of statistics, pages 269–281, 1979.
[FADK⁺18] Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. GAN-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 321:321–331, 2018.
[HJW15] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions under $\ell_{1}$ loss. IEEE Transactions on Information Theory, 61(11):6343–6354, 2015.
[HMC⁺20] Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020.
[Hoe56] Wassily Hoeffding. On the distribution of the number of successes in independent trials. The Annals of Mathematical Statistics, pages 713–721, 1956.
[Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
[HRA⁺19] Changhee Han, Leonardo Rundo, Ryosuke Araki, Yudai Nagano, Yujiro Furukawa, Giancarlo Mauri, Hideki Nakayama, and Hideaki Hayashi. Combining noise-to-image and image-to-image gans: Brain mr image augmentation for tumor detection. IEEE Access, 7:156966–156977, 2019.
[IS12] Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2012.
[JS61] W James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[LC64] Lucien Le Cam. Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics, pages 1419–1455, 1964.
[LC72] Lucien Le Cam. Limits of experiments. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
[LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer, 1986.
[LCY90] Lucien Le Cam and Grace Lo Yang. Asymptotics in Statistics. Springer, New York, 1990.
[LS92] Erich Leo Lehmann and FW Scholz. Ancillarity. Lecture Notes-Monograph Series, 17:32–51, 1992.
[LSM⁺22] Lorenzo Luzi, Ali Siahkoohi, Paul M Mayer, Josue Casco-Rodriguez, and Richard Baraniuk. Boomerang: Local sampling on image manifolds using diffusion models. arXiv preprint arXiv:2210.12100, 2022.
[LSW⁺23] Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, and Wenqi Wei. Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062, 2023.
[ML08] Klaus-J Miescke and F Liese. Statistical Decision Theory: Estimation, Testing, and Selection. Springer, 2008.
[MMKSM18] Ali Madani, Mehdi Moradi, Alexandros Karargyris, and Tanveer Syeda-Mahmood. Chest x-ray generation and data augmentation for cardiovascular abnormality classification. In Medical imaging 2018: Image processing, volume 10574, pages 415–420. SPIE, 2018.
[MW15] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. The Annals of Statistics, 43(3):1089–1116, 2015.
[Nes03] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
[Pro52] Yu V Prohorov. A local theorem for densities. In Doklady Akad. Nauk SSSR (NS), volume 83, pages 797–800, 1952.
[Roy88] Halsey Lawrence Royden. Real analysis (3rd edition). Prentice-Hall, 1988.
[RSH19] Kolyan Ray and Johannes Schmidt-Hieber. Asymptotic nonequivalence of density estimation and gaussian white noise for small densities. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 55, pages 2195–2208. Institut Henri Poincaré, 2019.
[SK19] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
[SM62] S Kh Sirazhdinov and M Mamatov. On convergence in the mean for densities. Theory of Probability & Its Applications, 7(4):424–428, 1962.
[SSP⁺03] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of International Conference on Document Analysis and Recognition, volume 3. Edinburgh, 2003.
[SYPS19] Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M Summers. Data augmentation using generative adversarial networks (cyclegan) to improve generalizability in ct segmentation tasks. Nature Scientific reports, 9(1):1–9, 2019.
[Tsy09] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
[TUH18] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Between-class learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5486–5494, 2018.
[VdV00] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
[VLB⁺19] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447. PMLR, 2019.
[Wal50] Abraham Wald. Statistical decision functions. Wiley, 1950.
[Wil82] EJ Williams. Some classes of conditional inference procedures. Journal of Applied Probability, 19(A):293–303, 1982.
[YHO⁺19] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
[YSH18] Wei Yi, Yaoran Sun, and Sailing He. Data augmentation using conditional gans for facial emotion recognition. In 2018 Progress in Electromagnetics Research Symposium (PIERS-Toyama), pages 710–714. IEEE, 2018.
[YWB19] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversarial network in medical imaging: A review. Medical image analysis, 58:101552, 2019.
[ZCDLP18] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.

	$\displaystyle\\|P-Q\\|_{\text{TV}}$	$\displaystyle=\frac{1}{2}\int\|\mathrm{d}P-\mathrm{d}Q\|,\qquad\qquad H(P,Q)=\left(\frac{1}{2}\int(\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q})^{2}\right)^{\frac{1}{2}},$
	$\displaystyle D_{\text{KL}}(P\\|Q)$	$\displaystyle=\int\mathrm{d}P\log\frac{\mathrm{d}P}{\mathrm{d}Q},\qquad\qquad\chi^{2}(P\\|Q)=\int\frac{(\mathrm{d}P-\mathrm{d}Q)^{2}}{\mathrm{d}Q}.$

	$\displaystyle\Delta({\mathcal{M}},{\mathcal{N}})$	$\displaystyle=\max\left\{\sup_{L}\sup_{\delta_{\mathcal{N}}}\inf_{\delta_{\mathcal{M}}}\sup_{\theta\in\Theta}\|R_{\mathcal{M}}(\theta,\delta_{\mathcal{M}},L)-R_{\mathcal{N}}(\theta,\delta_{\mathcal{N}},L)\|,\right.$
		$\displaystyle\qquad\qquad\qquad\left.\sup_{L}\sup_{\delta_{\mathcal{M}}}\inf_{\delta_{\mathcal{N}}}\sup_{\theta\in\Theta}\|R_{\mathcal{M}}(\theta,\delta_{\mathcal{M}},L)-R_{\mathcal{N}}(\theta,\delta_{\mathcal{N}},L)\|\right\}$
		$\displaystyle=\max\left\{\inf_{T_{1}}\sup_{\theta}\\|P_{\theta}\circ T_{1}^{-1}-Q_{\theta}\\|_{\text{\rm TV}},\ \inf_{T_{2}}\sup_{\theta}\\|Q_{\theta}\circ T_{2}^{-1}-P_{\theta}\\|_{\text{\rm TV}}\right\},$

	$\displaystyle\\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\\|_{\text{\rm TV}}$	$\displaystyle=\\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\\|_{\text{\rm TV}}$
		$\displaystyle=\\|{\mathcal{N}}(\theta,\Sigma/n)-{\mathcal{N}}(\theta,\Sigma/(n+m))\\|_{\text{\rm TV}}$
		$\displaystyle\overset{\rm(a)}{\leq}\sqrt{\frac{1}{2}D_{\text{\rm KL}}({\mathcal{N}}(\theta,\Sigma/n)\\|{\mathcal{N}}(\theta,\Sigma/(n+m)))}$
		$\displaystyle=\sqrt{\frac{d}{4}\left(\frac{m}{n}-\log\left(1+\frac{m}{n}\right)\right)}=O\left(\frac{m\sqrt{d}}{n}\right),$

		$\displaystyle\\|P_{1}\times\cdots\times P_{d}-Q_{1}\times\cdots\times Q_{d}\\|_{\text{TV}}$
		$\displaystyle\leq\sum_{i=1}^{d}\\|P_{1}\times\cdots\times P_{i-1}\times Q_{i}\times\cdots\times Q_{d}-P_{1}\times\cdots\times P_{i}\times Q_{i+1}\times\cdots\times Q_{d}\\|_{\text{TV}}$
		$\displaystyle\leq\sum_{i=1}^{d}\\|P_{i}-Q_{i}\\|_{\text{TV}}\cdot\prod_{j<i}\|P_{j}\|(\Omega)\cdot\prod_{k>i}\|Q_{k}\|(\Omega)$
		$\displaystyle\leq r^{d}\sum_{i=1}^{d}\\|P_{i}-Q_{i}\\|_{\text{TV}}.$		(14)

	$\displaystyle H^{2}(Q_{n,i}^{+},Q_{n+m,i}^{+})$	$\displaystyle=\frac{1}{2}\int_{\|x\|\leq c}\left(\sqrt{Q_{n,i}(x)}-\sqrt{Q_{n+m,i}(x)}\right)^{2}\mathrm{d}x+\frac{1}{2}\int_{\|x\|>c}\left(\sqrt{Q_{n,i}^{+}(x)}-\sqrt{Q_{n+m,i}^{+}(x)}\right)^{2}\mathrm{d}x$
		$\displaystyle\leq\int_{\|x\|\leq c}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)\mathrm{d}x$
		$\displaystyle\qquad+\int_{\|x\|\leq c}\gamma_{n}(x)\left(\sqrt{1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}}-\sqrt{1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}}\right)^{2}\mathrm{d}x$
		$\displaystyle\qquad+\int_{\|x\|>c}\left(Q_{n,i}^{+}(x)+Q_{n+m,i}^{+}(x)\right)\mathrm{d}x$
		$\displaystyle\equiv A_{1}+A_{2}+A_{3}.$

On the Statistical Complexity of Sample Amplification

Abstract

1 Introduction

Definition 1.1 (Sample Amplification).

2 Connections, limitations and future work

3 Preliminaries

Definition 3.1 (Le Cam’s distance; see [LC72, LC86, LCY90]).

4 Sample amplification via sufficient statistics

4.1 The general idea

Example 4.1 (Gaussian location model with known covariance).

4.2 Computational issues

Example 4.2 (Computation in Gaussian location model).

4.3 General theory for exponential families

Definition 4.3 (Exponential family).

Assumption 1 (Continuity).

Assumption 2 (Moment condition 𝖬k\mathsf{M}_{k}).

Lemma 4.4.

Theorem 4.5.

Theorem 4.6.

5 Sample amplification via learning

5.1 The general result

Definition 5.1 (χ2\chi^{2}-estimation error).

Theorem 5.2.

Corollary 5.3.

Remark 5.4.

Theorem 5.5.

Corollary 5.6.

Remark 5.7.

5.2 The shuffling approach

Lemma 5.8.

6 Minimax lower bounds

6.1 General idea

Lemma 6.1.

Theorem 6.2.

Assumption 3 (Linear independence).

Theorem 6.3.

6.2 Product models

Theorem 6.4.

Assumption 4.

Theorem 6.5.

7 Discussions on sample amplification versus learning

7.1 An example where the complexity of sample amplification is o​(d)o(\sqrt{d})

Theorem 7.1.

7.2 An example where the complexity of sample amplification is ω​(d)\omega(\sqrt{d})

Theorem 7.2.

7.3 An example where the TV distance is not the right metric

Theorem 7.3.

Acknowledgements.

Funding.

Appendix A Concrete examples of sample amplification

A.1 Concrete examples of amplification via sufficiency

Example A.1 (Gaussian covariance model with known mean).

Lemma A.2.

Example A.3 (Gaussian model with unknown mean and covariance).

Lemma A.4.

Example A.5 (Product exponential distribution).

Example A.6 (Uniform distribution over a rectangle).

A.2 Concrete examples of amplification via learning

Lemma A.7.

Example A.8 (Gaussian location model with known covariance, continued).

Example A.9 (Discrete distribution model).

Example A.10 (Uniform distribution, continued).

Example A.11 (Exponential distribution, continued).

Example A.12 (Sparse Gaussian model).

Lemma A.13.

Example A.14 (Poisson distribution).

Lemma A.15.

A.3 Non-asymptotic examples of lower bounds

Example A.16 (“Poissonized” discrete distribution model).

Lemma A.17.

Example A.18 (Sparse Gaussian location model).

Lemma A.19.

Example A.20 (Gaussian model with unknown covariance).

Lemma A.21.

Appendix B Proof of main theorems

B.1 Proof of Theorem 4.5

B.2 Proof of Theorem 4.6

B.3 Proof of Theorem 5.2

B.4 Proof of Theorem 5.5

B.5 Proof of Theorem 6.2

Assumption 2 (Moment condition $\mathsf{M}_{k}$ ).

Definition 5.1 ( $\chi^{2}$ -estimation error).

7.1 An example where the complexity of sample amplification is $o(\sqrt{d})$

7.2 An example where the complexity of sample amplification is $\omega(\sqrt{d})$