This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Statistical Complexity of Sample Amplification

Brian Axelrod, Shivam Garg, Yanjun Han, Vatsal Sharan, and Gregory Valiant B. Axelrod and G. Valiant are with the Department of Computer Science, Stanford University, emails: {baxelrod,valiant}@cs.stanford.edu. S. Garg is with the Microsoft Research, email: [email protected]. Y. Han is with the Courant Institute of Mathematical Sciences and the Center for Data Science, New York University, email: [email protected]. V. Shaman is with the Department of Computer Science, University of Southern California, email: [email protected].
Abstract

The “sample amplification” problem formalizes the following question: Given nn i.i.d. samples drawn from an unknown distribution PP, when is it possible to produce a larger set of n+mn+m samples which cannot be distinguished from n+mn+m i.i.d. samples drawn from PP? In this work, we provide a firm statistical foundation for this problem by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.

1 Introduction

Consider the following problem of manufacturing more data: an amplifier is given nn samples drawn i.i.d. from an unknown distribution PP, and the goal is to generate a larger set of n+mn+m samples which are indistinguishable from n+mn+m i.i.d. samples from PP. How large can mm be as a function of nn and the distribution class in question? Are there sound and systematic ways to generate a larger set of samples? Is this task strictly easier than the learning task, in the sense that the number of samples required for generating n+1n+1 samples is smaller than that required for learning PP?

In our preliminary work [AGSV20], we formalized this problem as the sample amplification problem, considering total variation (TV) as the measure for indistinguishability.

Definition 1.1 (Sample Amplification).

Let 𝒫\mathcal{P} be a class of probability distributions over a domain 𝒳\mathcal{X}. We say that 𝒫\mathcal{P} admits an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification procedure if there exists a (possibly randomized) map T𝒫,n,m,ε:𝒳n𝒳n+mT_{\mathcal{P},n,m,\varepsilon}:\mathcal{X}^{n}\rightarrow\mathcal{X}^{n+m} such that

supP𝒫PnT𝒫,n,m,ε1P(n+m)TVε.\displaystyle\sup_{P\in\mathcal{P}}\|P^{\otimes n}\circ T_{\mathcal{P},n,m,\varepsilon}^{-1}-P^{\otimes(n+m)}\|_{\text{\rm TV}}\leq\varepsilon. (1)

An equivalent formulation to view Definition 1.1 is as a game between two parties: an amplifier and a verifier. The amplifier gets nn samples drawn i.i.d. from the unknown distribution PP in the class 𝒫\mathcal{P}, and her goal is to generate a larger dataset of n+mn+m samples which must be accepted with good probability by any verifier that also accepts a real dataset of n+mn+m i.i.d. samples from PP with good probability. Here, the verifier is computationally unbounded and knows the distribution PP, but does not observe the amplifier’s original set of nn samples.

Along with being a natural statistical task, the sample amplification framework is also relevant from a practical standpoint. Currently, there is an enormous trend in the machine learning community to train models on datasets that have been enlarged in various ways. There are relatively transparent and classical approaches to achieve this, such as leveraging known invariances such as rotation or translation invariance to augment the dataset by including transformed versions of each original datapoint [SSP+03, KSH12, CZM+19, SK19, CZSL20]. More recently, deep generative models have been used to both directly enlarge training data and construct larger datasets consisting of samples with properties that are rare in naturally occurring datasets [ASE17, CMST17, FADK+18, MMKSM18, YSH18, SYPS19, CPS+19, YWB19, HRA+19, CMV+21, LSM+22, CQZ+22, LSW+23, BKK+22]. More opaque approaches such as MixUp [ZCDLP18] and related techniques [TUH18, YHO+19, VLB+19, HMC+20] which add a significant fraction of new datapoints that are explicitly not supported in the true data distribution are also very popular since they seem to improve the performance of the trained models. Given this current zoo of widely implemented approaches to enlarging datasets, there is a clear motivation for bringing a more principled statistical understanding to such approaches. One natural starting point is the statistical setting we consider that asks the extent to which datasets can be enlarged in a perfect sense—where it is not possible to distinguish the enlarged dataset from a set of i.i.d. draws. Moreover, this work lays a foundation for the ambitious broader goal of understanding how various amplification techniques interact with downstream learning algorithms and statistical estimators, and developing amplification techniques that are optimal for certain classes of such algorithms and estimators.

In [AGSV20], a subset of the authors introduced the sample amplification problem, and studied two classes of distributions: the Gaussian location model and discrete distribution model. For these examples, they characterized the statistical complexity of sample amplification and showed that it is strictly smaller than that of learning. In this paper, we work towards a general understanding of the statistical complexity of the sample amplification problem, and its relationship with learning. The main contributions of this paper are as follows:

Distribution Class Amplification Procedure
Gaussian with unknown mean and fixed covariance (n,n+Θ(nε/d))(n,n+\Theta(n\varepsilon/\sqrt{d})) Sufficiency/Learning
(UB: Example 4.1, 4.2, A.8; LB: Theorem 6.2, 6.5)
Gaussian with unknown mean and covariance (n,n+Θ(nε/d))(n,n+\Theta(n\varepsilon/d)) Sufficiency
(UB: Example A.1, A.3; LB: Example A.20)
Gaussian with ss-sparse mean and identity covariance (n,n+Θ(nε/slogd))(n,n+\Theta(n\varepsilon/\sqrt{s\log d})) Learning
(UB: Example A.12; LB: Example A.18)
Discrete distributions with support size at most kk (n,n+Θ(nε/k))(n,n+\Theta(n\varepsilon/\sqrt{k})) Learning
(UB: Example A.9; LB: [AGSV20, Theorem 1])
Poissonized discrete distributions with support at most kk (nn, n+Θ(nε+nε/k))n+\Theta(\sqrt{n}\varepsilon+n\varepsilon/\sqrt{k})) Learning
(UB: Example A.16; LB: Example A.16)
dd-dim. product of Exponential distributions (n,n+Θ(nε/d))(n,n+\Theta(n\varepsilon/\sqrt{d})) Sufficiency/Learning
(UB: Example A.5, A.11; LB: Theorem 6.5)
Uniform distribution on dd-dim. rectangle (n,n+Θ(nε/d))(n,n+\Theta(n\varepsilon/\sqrt{d})) Sufficiency/Learning
(UB: Example A.6, A.10; LB: Theorem 6.5)
dd-dim. product of Poisson distributions (n,n+Θ(nε/d))(n,n+\Theta(n\varepsilon/\sqrt{d})) Sufficiency+Learning
(UB: Example A.14; LB: Theorem 6.5)
Table 1: Sample amplification achieved by the presented procedures. Results include matching upper bounds (UB) and lower bounds (LB), with appropriate pointers to specific examples or theorems for details.
  1. 1.

    Amplification via sufficiency. Our first contribution is a simple yet powerful procedure for sample amplification, i.e. apply the sample amplification map only to sufficient statistics. Specifically, the learner computes a sufficient statistic TnT_{n} from XnX^{n}, maps TnT_{n} properly to some Tn+mT_{n+m}, and generates new samples X^n+m\widehat{X}^{n+m} from some conditional distribution conditioned on Tn+mT_{n+m}. Surprisingly, this simple idea leads to a much cleaner procedure than [AGSV20] under Gaussian location models which is also exactly optimal (cf. Theorem 6.2). The range of applicability also extends to general exponential families, with rate-optimal sample amplification performances. Specifically, for general dd-dimensional exponential families with a mild moment condition, the sufficiency-based procedure achieves an (n,n+O(nε/d),ε)(n,n+O(n\varepsilon/\sqrt{d}),\varepsilon) sample amplification for large enough nn, which by our matching lower bounds in Section 6 is asymptotically minimax rate-optimal.

  2. 2.

    Amplification via learning. Our second contribution is another general sample amplification procedure that does not require the existence of a sufficient statistic, and also sheds light on the relationship between learning and sample amplification. This procedure essentially draws new samples from a rate-optimal estimate of the true distribution, and outputs a random permutation of the old and new samples. The procedure achieves an (n,n+O(εn/rχ2(𝒫,n)),ε)(n,n+O(\varepsilon\sqrt{n/r_{\chi^{2}}(\mathcal{P},n)}),\varepsilon) sample amplification, where rχ2(𝒫,n)r_{\chi^{2}}(\mathcal{P},n) denotes the minimax risk for learning P𝒫P\in\mathcal{P} under the expected χ2\chi^{2} divergence given nn samples. This shows that learning PP to χ2\chi^{2} divergence O(n/ε2)O(n/\varepsilon^{2}) is sufficient for non-trivial sample amplification.

    In addition, we show that for the special case of product distributions, it is important that the permutation step be applied coordinatewise to achieve the optimal sample amplification. Specifically, if 𝒫=j=1d𝒫j\mathcal{P}=\prod_{j=1}^{d}\mathcal{P}_{j}, this procedure achieves a better sample amplification

    (n,n+O(εnj=1drχ2(𝒫j,n)),ε).\displaystyle\left(n,n+O\left(\varepsilon\sqrt{\frac{n}{\sum_{j=1}^{d}r_{\chi^{2}}(\mathcal{P}_{j},n)}}\right),\varepsilon\right).

    We have summarized several examples in Table 1 where the sufficiency and/or learning based sample amplification procedures are optimal. Note that there is no golden rule for choosing one idea over the other, and there exists an example where the above two ideas must be combined.

  3. 3.

    Minimax lower bounds. Complementing our sample amplification procedures, we provide a general recipe for proving lower bounds for sample amplification. This recipe is intrinsically different from the standard techniques of proving lower bounds for hypothesis testing, for the sample amplification problem differs significantly from an estimation problem. In particular, specializing our recipe to product models shows that, for essentially all dd-dimensional product models, an (n,n+Cnε/d,ε)(n,n+Cn\varepsilon/\sqrt{d},\varepsilon) sample amplification is impossible for some absolute constant C<C<\infty independent of the product model.

    For non-product models, the above powerful result does not directly apply, but proper applications of the general recipe could still lead to tight lower bounds for sample amplification. Specifically, we obtain matching lower bounds for all examples listed in Table 1, including the non-product examples.

We now provide several numerical simulations to suggest the potential utility of sample amplification. Recall that a practical motivation of sample amplification is to produce an enlarged dataset that can be fed into a distribution-agnostic algorithm in downstream applications. Here, we consider the following basic experiments in that vein:

Refer to caption
(a) Mean absolute errors for estimation of the fourth moment 𝔼[X4]\mathbb{E}[X^{4}] in a one-dimensional Gaussian location model X1,,Xn𝒩(μ,1)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,1) with n=100,μ=1n=100,\mu=1.
Refer to caption
(b) Mean absolute errors for estimation of the squared L2L_{2} norm 𝔼[X22]\mathbb{E}[\|X\|_{2}^{2}] in a dd-dimensional Gaussian location model X1,,Xn𝒩(μ,Id)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,I_{d}) with n=50,d=100,μ=𝟏/dn=50,d=100,\mu={\bf 1}/\sqrt{d}.
Refer to caption
(c) Classification errors for binary classification between two clusters of covariates X1,,Xn/2𝒩(e1,Id)X_{1},\cdots,X_{n/2}\sim{\mathcal{N}}(e_{1},I_{d}) and Xn/2+1,,Xn𝒩(e1,Id)X_{n/2+1},\cdots,X_{n}\sim{\mathcal{N}}(-e_{1},I_{d}), with n=50n=50 and d=10d=10.
Figure 1: Sample amplification experiments. The xx-axis corresponds to the amount of amplification, mm, and the shaded area depicts the 95% confidence interval based on 5,000 simulations.
  • Fourth moment estimation for one-dimensional Gaussian: here we observe X1,,Xn𝒩(μ,1)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,1) with n=100n=100 and μ=1\mu=1, and we consider three estimators. The empirical estimator operates in a distribution-agnostic fashion and is simply the empirical fourth moment n1i=1nXi4n^{-1}\sum_{i=1}^{n}X_{i}^{4}. The plug-in estimator uses the knowledge of Gaussianity: it first estimates μ^=X¯\widehat{\mu}=\bar{X} and then uses 𝔼X𝒩(μ^,1)[X4]=μ^4+6μ^2+3\mathbb{E}_{X\sim{\mathcal{N}}(\widehat{\mu},1)}[X^{4}]=\widehat{\mu}^{4}+6\widehat{\mu}^{2}+3. The amplified estimator first amplifies the sample XnX^{n} into Yn+mY^{n+m} via sufficiency (cf. Example 4.1), and then uses the empirical estimator (n+m)1j=1n+mYj4(n+m)^{-1}\sum_{j=1}^{n+m}Y_{j}^{4} based on the enlarged sample Yn+mY^{n+m}. The plots of the mean absolute errors (MAEs) are displayed in Figure 1(a). We observe that although the empirical estimator based on the original sample XnX^{n} has a large MAE, its performance is improved based on the amplified sample Yn+mY^{n+m}.

  • Squared L2L_{2} norm estimation for high-dimensional Gaussian: here we observe X1,,Xn𝒩(μ,Id)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\mu,I_{d}) with n=50,d=100n=50,d=100 and μ=𝟏/d\mu={\bf 1}/\sqrt{d}, and we again consider three estimators for 𝔼[X22]\mathbb{E}[\|X\|_{2}^{2}]. As before, the empirical estimator is simply n1i=1nXi22n^{-1}\sum_{i=1}^{n}\|X_{i}\|_{2}^{2}, and the plug-in estimator uses the knowledge 𝔼X𝒩(μ^,Id)[X22]=μ^22+d\mathbb{E}_{X\sim{\mathcal{N}}(\widehat{\mu},I_{d})}[\|X\|_{2}^{2}]=\|\widehat{\mu}\|_{2}^{2}+d and estimates μ^=X¯\widehat{\mu}=\bar{X}. As for the amplified estimator, it first amplifies the sample XnX^{n} into Yn+mY^{n+m} via sufficiency (cf. Example 4.1), and then uses the empirical estimator based on Yn+mY^{n+m}. The plots of the mean absolute errors are displayed in Figure 1(b). Here the empirical estimator outperforms the plug-in estimator due to a smaller bias, while the sample amplification further reduces the MAE as long as the size of amplification mm is not too large. This could be explained by the bias-variance tradeoff, where the amplified estimator interpolates between the empirical estimator (with no bias) and the plug-in estimator (with the smallest asymptotic variance).

  • Binary classification: here we observe two clusters of covariates X1,,Xn/2𝒩(e1,Id)X_{1},\cdots,X_{n/2}\sim{\mathcal{N}}(e_{1},I_{d}) (with label 11) and Xn/2+1,,Xn𝒩(e1,Id)X_{n/2+1},\cdots,X_{n}\sim{\mathcal{N}}(-e_{1},I_{d}) (with label 1-1), with n=50,d=10n=50,d=10 and e1e_{1} being the first basis vector. The target is to train a classifier with a high classification accuracy on the test data with the same distribution. The standard classifier is via logistic regression, which does not use the knowledge of Gaussianity. To apply sample amplification, we first amplify the sample in each class via either sufficiency (cf. Example 4.1) or learning (cf. Example A.8), and then run logistic regression on the amplified datasets. Figure 1(c) displays the classification errors of all three approaches, and shows that both amplification procedures help reduce the classification error even for small values of mm.

The above experiments demonstrate the potential for sample amplification to leverage knowledge of the data distribution to produce a larger dataset that is then fed into downstream distribution-agnostic algorithms. Some experiments (e.g. Figure 1(b)) also suggest a limit beyond which the amplification procedure alters the data distribution too much. We believe that rigorously examining amplification through the lens of the performance of downstream estimators and algorithms, including those illustrated in our numerical simulations, would be a fruitful direction for future work.

2 Connections, limitations and future work

As discussed above, it is commonplace in machine learning to increase the size of datasets using various heuristics, often resulting in large gains in downstream learning performance. However, a clear statistical understanding of when this is possible and what techniques are useful for this is missing. A natural starting point to get a better understanding is the formulation we consider that asks the extent to which datasets can be amplified in a perfect sense—where any verifier who knows the true distribution is not able to distinguish the amplified dataset from a set of i.i.d. draws.

A limitation of the sample amplification formulation described above is that the additive amplification factor mm is rather small (e.g., O(nε/d)O(n\varepsilon/\sqrt{d}) for dd-dimensional exponential families). Moreover, we show matching lower bounds demonstrating that this factor cannot be improved even when nn is large enough to learn the distribution to non-trivial accuracy. However, it might be possible to achieve larger amplification factors with restricted verifiers, for instance, the class of verifiers corresponding to learning algorithms used for downstream tasks (see [AGSV20] for other possible classes of verifiers). Investigating the sample amplification problem with such restricted verifiers may be a practically fruitful future direction.

Despite this limitation, the sample amplification formulation does yield high-level insights that can inform the way datasets are amplified in practice. For instance, from the results in this paper, we know that sample amplification is possible for a broad class of distributions even when learning is not possible. Moreover, both our sufficiency or learning based approaches modify the original data points in general, conforming to the lower bound in [AGSV20] that optimal amplification may be impossible if the amplifier returns a superset of the input dataset. These observations show that the folklore way of enlarging datasets by learning the data distribution and adding more samples from the learned distribution can be far from optimal.

Connections with other statistical notions. An equivalent view of Definition 1.1 is through Le Cam’s distance [LC72], a classical concept in statistics. The formal definition of Le Cam’s distance Δ(,𝒩)\Delta(\mathcal{M},\mathcal{N}) is summarized in Definition 3.1; roughly speaking, it measures the fundamental difference in power in the statistical models \mathcal{M} and 𝒩\mathcal{N}, without resorting to specific estimation procedures. The sample amplification problem is equivalent to the study of Le Cam’s distance Δ(𝒫n,𝒫(n+m))\Delta(\mathcal{P}^{\otimes n},\mathcal{P}^{\otimes(n+m)}) between product models, where (1) is precisely equivalent to Δ(𝒫n,𝒫(n+m))ε\Delta(\mathcal{P}^{\otimes n},\mathcal{P}^{\otimes(n+m)})\leq\varepsilon. However, in the statistics literature, Le Cam’s distance was mainly used to study the asymptotic equivalence, where a typical target is to show that limnΔ(n,𝒩n)=0\lim_{n\to\infty}\Delta({\mathcal{M}}_{n},{\mathcal{N}}_{n})=0 for certain sequences of statistical models. For example, showing that localized regular statistical models converge to Gaussian location models is the fundamental idea behind the Hájek–Le Cam asymptotic statistics; see [LC72, LC86, LCY90] and [VdV00, Chapter 9]. In nonparametric statistics, there is also a rich line of research [BL96, BCLZ02, BCLZ04, RSH19] establishing asymptotic (non-)equivalences, based on Le Cam’s distance, between density models, regression models, and Gaussian white noise models. In the above lines of work, only asymptotic results were typically obtained with a fixed dimension and possibly slow convergence rate. In contrast, we aim to obtain a non-asymptotic characterization of Δ(𝒫n,𝒫(n+m))\Delta(\mathcal{P}^{\otimes n},\mathcal{P}^{\otimes(n+m)}) in (n,m)(n,m) and the dimension of the problem, a task which is largely underexplored in the literature.

Another related angle is from reductions between statistical models. Over the past decade there has been a growing interest in constructing polynomial-time reductions between various statistical models (typically from the planted clique) to prove statistical-computational gaps, see, e.g. [BR13, MW15, BB19, BB20]. The sample amplification falls into the reduction framework, and aims to perform reductions from a product model 𝒫n\mathcal{P}^{\otimes n} to another product model 𝒫(n+m)\mathcal{P}^{\otimes(n+m)}. While previous reduction techniques were mainly constructive and employed to prove computational lower bounds, in this paper we also develop general tools to prove limitations of all possible reductions purely from the statistical perspective.

Organization. The rest of this paper is organized as follows. Section 3 lists some notations and preliminaries for this paper, and in particular introduces the concept of Le Cam’s distance. Section 4 introduces a sufficiency-based procedure for sample amplification, with asymptotic properties for general exponential families and non-asymptotic performances in several specific examples. Section 5 is devoted to a learning-based procedure for sample amplification, with a general relationship between sample amplification and the χ2\chi^{2} estimation error, as well as its applications in several examples. Section 6 presents the general idea of establishing lower bounds for sample amplification, with a universal result specializing to product models. Section 7 discusses more examples in sample amplification and learning, and shows that these tasks are in general non-comparable. More concrete examples of both the upper and lower bounds, auxiliary lemmas, and proofs are relegated to the appendices.

3 Preliminaries

We use the following notations throughout this paper. For a random variable XX, let (X){\mathcal{L}}(X) be the law (i.e. probability distribution) of XX. For a probability distribution PP on a probability space Ω\Omega and a measurable map T:ΩΩT:\Omega\to\Omega^{\prime}, let PT1P\circ T^{-1} denotes the pushforward probability measure, i.e. (T(X)){\mathcal{L}}(T(X)) with (X)=P{\mathcal{L}}(X)=P. For a probability measure PP, let PnP^{\otimes n} be the nn-fold product measure. For a positive integer nn, let [n]{1,,n}[n]\triangleq\{1,\cdots,n\}, and xn(x1,,xn)x^{n}\triangleq(x_{1},\cdots,x_{n}). We adopt the following asymptotic notations: for two non-negative sequences (an)(a_{n}) and (bn)(b_{n}), we use an=O(bn)a_{n}=O(b_{n}) to denote that lim supnan/bn<\limsup_{n\to\infty}a_{n}/b_{n}<\infty, and an=Ω(bn)a_{n}=\Omega(b_{n}) to denote bn=O(an)b_{n}=O(a_{n}), and an=Θ(bn)a_{n}=\Theta(b_{n}) to denote both an=O(bn)a_{n}=O(b_{n}) and bn=O(an)b_{n}=O(a_{n}). We also use the notations Oc,Ωc,ΘcO_{c},\Omega_{c},\Theta_{c} to denote the respective meanings with hidden constants depending on cc. For probability measures P,QP,Q defined on the same probability space, the total variation (TV) distance, Hellinger distance, Kullback–Leibler (KL) divergence, and the chi-squared divergence are defined as follows:

PQTV\displaystyle\|P-Q\|_{\text{TV}} =12|dPdQ|,H(P,Q)=(12(dPdQ)2)12,\displaystyle=\frac{1}{2}\int|\mathrm{d}P-\mathrm{d}Q|,\qquad\qquad H(P,Q)=\left(\frac{1}{2}\int(\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q})^{2}\right)^{\frac{1}{2}},
DKL(PQ)\displaystyle D_{\text{KL}}(P\|Q) =dPlogdPdQ,χ2(PQ)=(dPdQ)2dQ.\displaystyle=\int\mathrm{d}P\log\frac{\mathrm{d}P}{\mathrm{d}Q},\qquad\qquad\chi^{2}(P\|Q)=\int\frac{(\mathrm{d}P-\mathrm{d}Q)^{2}}{\mathrm{d}Q}.

We will frequently use the following inequalities between the above quantities [Tsy09, Chapter 2]:

H2(P,Q)PQTVH(P,Q)2H2(P,Q),\displaystyle H^{2}(P,Q)\leq\|P-Q\|_{\text{TV}}\leq H(P,Q)\sqrt{2-H^{2}(P,Q)}, (2)
PQTV12DKL(PQ)12log(1+χ2(PQ)).\displaystyle\|P-Q\|_{\text{TV}}\leq\sqrt{\frac{1}{2}D_{\text{KL}}(P\|Q)}\leq\sqrt{\frac{1}{2}\log(1+\chi^{2}(P\|Q))}. (3)

Next we define several quantities related to Definition 1.1. For a given distribution class 𝒫\mathcal{P} and sample sizes nn and mm, the minimax error of sample amplification is defined as

ε(𝒫,n,m)infTsupP𝒫P(n+m)PnT1TV,\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\triangleq\inf_{T}\sup_{P\in{\mathcal{P}}}\|P^{\otimes(n+m)}-P^{\otimes n}\circ T^{-1}\|_{\text{TV}}, (4)

where the infimum is over all (possibly randomized) measurable mappings T:𝒳n𝒳n+mT:{\mathcal{X}}^{n}\to{\mathcal{X}}^{n+m}. For a given error level ε\varepsilon, the maximum size of sample amplification is the largest mm such that there exists an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification, i.e.

m(𝒫,n,ε)max{m:ε(𝒫,n,m)ε}.\displaystyle m^{\star}({\mathcal{P}},n,\varepsilon)\triangleq\max\{m\in\mathbb{N}:\varepsilon^{\star}({\mathcal{P}},n,m)\leq\varepsilon\}. (5)

For the ease of presentation, we often choose ε\varepsilon to be a small constant (say 0.10.1) and abbreviate the above quantity as m(𝒫,n)m^{\star}({\mathcal{P}},n); we remark that all our results work for a generic ε(0,1)\varepsilon\in(0,1). Finally, we also define the sample amplification complexity as the smallest nn such that an amplification from nn to n+1n+1 samples is possible:

n(𝒫)min{n:m(𝒫,n)1}.\displaystyle n^{\star}({\mathcal{P}})\triangleq\min\{n\in\mathbb{N}:m^{\star}({\mathcal{P}},n)\geq 1\}. (6)

Note that all the above notions are instance-wise in the distribution class 𝒫{\mathcal{P}}.

The minimax error of sample amplification (4) is precisely known as the Le Cam’s distance in the statistics literature. We adopt the standard framework of statistical decision theory [Wal50]. A statistical model (or experiment) {\mathcal{M}} is a tuple (𝒳,(Pθ)θΘ)({\mathcal{X}},(P_{\theta})_{\theta\in\Theta}) where an observation XPθX\sim P_{\theta} is drawn for some θΘ\theta\in\Theta. A decision rule δ\delta is a regular conditional probability kernel from 𝒳{\mathcal{X}} to the family of probability distributions on a general action space 𝒜{\mathcal{A}}, and there is a (measurable) loss function L:Θ×𝒜+L:\Theta\times{\mathcal{A}}\to\mathbb{R}_{+}. The risk function of a given decision rule δ\delta is defined as

R(θ,δ,L)𝔼θ[L(θ,δ(X))]=𝒳𝒜L(θ,a)δ(dax)Pθ(dx).\displaystyle R_{\mathcal{M}}(\theta,\delta,L)\triangleq\mathbb{E}_{\theta}[L(\theta,\delta(X))]=\int_{\mathcal{X}}\int_{\mathcal{A}}L(\theta,a)\delta(da\mid x)P_{\theta}(dx). (7)

Based on the definition of risk functions, we are ready to define a metric, known as Le Cam’s distance, between statistical models.

Definition 3.1 (Le Cam’s distance; see [LC72, LC86, LCY90]).

For two statistical models =(𝒳,(Pθ)θΘ){\mathcal{M}}=({\mathcal{X}},(P_{\theta})_{\theta\in\Theta}) and 𝒩=(𝒴,(Qθ)θΘ){\mathcal{N}}=({\mathcal{Y}},(Q_{\theta})_{\theta\in\Theta}), Le Cam’s distance Δ(,𝒩)\Delta({\mathcal{M}},{\mathcal{N}}) is defined as

Δ(,𝒩)\displaystyle\Delta({\mathcal{M}},{\mathcal{N}}) =max{supLsupδ𝒩infδsupθΘ|R(θ,δ,L)R𝒩(θ,δ𝒩,L)|,\displaystyle=\max\left\{\sup_{L}\sup_{\delta_{\mathcal{N}}}\inf_{\delta_{\mathcal{M}}}\sup_{\theta\in\Theta}|R_{\mathcal{M}}(\theta,\delta_{\mathcal{M}},L)-R_{\mathcal{N}}(\theta,\delta_{\mathcal{N}},L)|,\right.
supLsupδinfδ𝒩supθΘ|R(θ,δ,L)R𝒩(θ,δ𝒩,L)|}\displaystyle\qquad\qquad\qquad\left.\sup_{L}\sup_{\delta_{\mathcal{M}}}\inf_{\delta_{\mathcal{N}}}\sup_{\theta\in\Theta}|R_{\mathcal{M}}(\theta,\delta_{\mathcal{M}},L)-R_{\mathcal{N}}(\theta,\delta_{\mathcal{N}},L)|\right\}
=max{infT1supθPθT11QθTV,infT2supθQθT21PθTV},\displaystyle=\max\left\{\inf_{T_{1}}\sup_{\theta}\|P_{\theta}\circ T_{1}^{-1}-Q_{\theta}\|_{\text{\rm TV}},\ \inf_{T_{2}}\sup_{\theta}\|Q_{\theta}\circ T_{2}^{-1}-P_{\theta}\|_{\text{\rm TV}}\right\},

where the loss function is taken over all measurable functions L:Θ×𝒜[0,1]L:\Theta\times{\mathcal{A}}\to[0,1].

In the language of model deficiency introduced in [LC64], Le Cam’s distance is the smallest ε>0\varepsilon>0 such that the model {\mathcal{M}} is ε\varepsilon-deficient to the model 𝒩{\mathcal{N}}, and 𝒩{\mathcal{N}} is also ε\varepsilon-deficient to {\mathcal{M}}. In the sample amplification problem, (Pθ)θΘ={Pn:P𝒫},(Qθ)θΘ={P(n+m):P𝒫}(P_{\theta})_{\theta\in\Theta}=\{P^{\otimes n}:P\in{\mathcal{P}}\},(Q_{\theta})_{\theta\in\Theta}=\{P^{\otimes(n+m)}:P\in{\mathcal{P}}\}. Here, choosing T2(xn+m)=xnT_{2}(x^{n+m})=x^{n} in Definition 3.1 shows that 𝒩{\mathcal{N}} is 0-deficient to {\mathcal{M}}, and the remaining quantity involving T1T_{1} exactly reduces to the minimax error of sample amplification in (4). Therefore, studying the complexity of sample amplification is equivalent to the characterization of the quantity Δ(𝒫n,𝒫(n+m))\Delta({\mathcal{P}}^{\otimes n},{\mathcal{P}}^{\otimes(n+m)}).

4 Sample amplification via sufficient statistics

The first idea we present for sample amplification is the classical idea of reduction by sufficiency. Albeit simple, the sufficiency-based idea reduces the problem of generating multiple random vectors to a simpler problem of generating only a few vectors, achieves the optimal complexity of sample amplification in many examples, and is easy to implement.

4.1 The general idea

We first recall the definition of sufficient statistics: in a statistical model =(𝒳,(Pθ)θΘ){\mathcal{M}}=({\mathcal{X}},(P_{\theta})_{\theta\in\Theta}) and XPθX\sim P_{\theta}, a statistic T=T(X)𝒯T=T(X)\in{\mathcal{T}} is sufficient if and only if both θXT\theta-X-T and θTX\theta-T-X are Markov chains. A classical result in statistical decision theory is reduction by sufficiency, i.e. only the sufficient statistic needs to be maintained to perform statistical tasks as PX|T,θP_{X|T,\theta} does not depend on the unknown parameter θ\theta. In terms of Le Cam’s distance, let T1=(𝒯,(PθT1)θΘ){\mathcal{M}}\circ T^{-1}=({\mathcal{T}},(P_{\theta}\circ T^{-1})_{\theta\in\Theta}) be the statistical experiment associated with TT, then sufficiency of TT implies that Δ(,T1)=0\Delta({\mathcal{M}},{\mathcal{M}}\circ T^{-1})=0. Hereafter, we will call T1{\mathcal{M}}\circ T^{-1} the TT-reduced model, or simply reduced model in short.

Algorithm 1 Sample amplification via sufficiency
1:Input: samples X1,,XnX_{1},\cdots,X_{n}, a given transformation ff between sufficient statistics
2:Compute the sufficient statistic Tn=Tn(X1,,Xn)T_{n}=T_{n}(X_{1},\cdots,X_{n}).
3:Apply ff to the sufficient statistic and compute T^n+m=f(Tn)\widehat{T}_{n+m}=f(T_{n}).
4:Generate (X^1,,X^n+m)PXn+mTn+m(T^n+m)(\widehat{X}_{1},\cdots,\widehat{X}_{n+m})\sim P_{X^{n+m}\mid T_{n+m}}(\cdot\mid\widehat{T}_{n+m}).
5:Output: amplified samples (X^1,,X^n+m)(\widehat{X}_{1},\cdots,\widehat{X}_{n+m}).

Reduction by sufficiency could be applied to sample amplification in a simple way, with a general algorithm displayed in Algorithm 1. Suppose that both models 𝒫n{\mathcal{P}}^{\otimes n} and 𝒫(n+m){\mathcal{P}}^{\otimes(n+m)} admit sufficient statistics Tn=Tn(Xn)T_{n}=T_{n}(X^{n}) and Tn+m=Tn+m(Xn+m)T_{n+m}=T_{n+m}(X^{n+m}), respectively. Algorithm 1 claims that it suffices to perform sample amplification on the reduced models 𝒫nTn1{\mathcal{P}}^{\otimes n}\circ T_{n}^{-1} and 𝒫(n+m)Tn+m1{\mathcal{P}}^{\otimes(n+m)}\circ T_{n+m}^{-1}, i.e. construct a randomization map ff from TnT_{n} to Tn+mT_{n+m}. Concretely, the algorithm decomposes into three steps:

  1. 1.

    Step I: map XnX^{n} to TnT_{n}. This step is straightforward: we simply compute Tn=Tn(X1,,Xn)T_{n}=T_{n}(X_{1},\cdots,X_{n}).

  2. 2.

    Step II: apply a randomization map in the reduced model. Upon choosing the map ff, we simply compute T^n+m=f(Tn)\widehat{T}_{n+m}=f(T_{n}) with the target that the TV distance (T^n+m)(Tn+m)TV\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\|_{\text{TV}} is uniformly small. The concrete choice of ff depends on specific models.

  3. 3.

    Step III: map Tn+mT_{n+m} to Xn+mX^{n+m}. By sufficiency of Tn+mT_{n+m}, the conditional distribution PXn+mTn+mP_{X^{n+m}\mid T_{n+m}} does not depend on the unknown model. Therefore, after replacing the true statistic Tn+mT_{n+m} by T^n+m\widehat{T}_{n+m}, it is feasible to generate X^n+mPXn+mTn+m(T^n+m)\widehat{X}^{n+m}\sim P_{X^{n+m}\mid T_{n+m}}(\cdot\mid\widehat{T}_{n+m}). To simulate this random vector, it suffices to choose any distribution P0𝒫P_{0}\in{\mathcal{P}} and generate X^n+m(P0(n+m)Tn+m(X^n+m)=T^n+m)\widehat{X}^{n+m}\sim(P_{0}^{\otimes(n+m)}\mid T_{n+m}(\widehat{X}^{n+m})=\widehat{T}_{n+m}). This step may suffer from computational issues which will be discussed in Section 4.2.

The validity of this idea simply follows from

Δ(𝒫n,𝒫(n+m))=Δ(𝒫nTn1,𝒫(n+m)Tn+m1),\displaystyle\Delta({\mathcal{P}}^{\otimes n},{\mathcal{P}}^{\otimes(n+m)})=\Delta({\mathcal{P}}^{\otimes n}\circ T_{n}^{-1},{\mathcal{P}}^{\otimes(n+m)}\circ T_{n+m}^{-1}),

or equivalently, under each P𝒫P\in{\mathcal{P}},

(X^n+m)(Xn+m)TV\displaystyle\|{\mathcal{L}}(\widehat{X}^{n+m})-{\mathcal{L}}(X^{n+m})\|_{\text{TV}} =(T^n+m)×PXn+mTn+m(Tn+m)×PXn+mTn+mTV\displaystyle=\|{\mathcal{L}}(\widehat{T}_{n+m})\times P_{X^{n+m}\mid T_{n+m}}-{\mathcal{L}}(T_{n+m})\times P_{X^{n+m}\mid T_{n+m}}\|_{\text{TV}}
=(a)(T^n+m)(Tn+m)TV=(f(Tn))(Tn+m)TV,\displaystyle\overset{\rm(a)}{=}\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\|_{\text{TV}}=\|{\mathcal{L}}(f(T_{n}))-{\mathcal{L}}(T_{n+m})\|_{\text{TV}},

where (a) is due to the identity PXPY|XQXPY|XTV=PXQXTV\|P_{X}P_{Y|X}-Q_{X}P_{Y|X}\|_{\text{TV}}=\|P_{X}-Q_{X}\|_{\text{TV}}. In other words, it suffices to work on reduced models and find the map ff between sufficient statistics.

This idea of reduction by sufficiency simplifies the design of sample amplification procedures. Unlike in original models where XnX^{n} and Xn+mX^{n+m} typically take values in spaces of different dimensions, in reduced models the sufficient statistics TnT_{n} and Tn+mT_{n+m} are usually drawn from the same space. A simple example is as follows.

Example 4.1 (Gaussian location model with known covariance).

Consider the observations X1,,XnX_{1},\cdots,X_{n} from the Gaussian location model Pθ=𝒩(θ,Σ)P_{\theta}={\mathcal{N}}(\theta,\Sigma) with an unknown mean θd\theta\in\mathbb{R}^{d} and a known covariance Σd×d\Sigma\in\mathbb{R}^{d\times d}. To amplify to n+mn+m samples, note that the sample mean vector is a sufficient statistic here, with

Tn(X1,,Xn)=1ni=1nXi𝒩(θ,Σ/n).\displaystyle T_{n}(X_{1},\cdots,X_{n})=\frac{1}{n}\sum_{i=1}^{n}X_{i}\sim{\mathcal{N}}(\theta,\Sigma/n).

Now consider the identity map between sufficient statistics T^n+m=Tn\widehat{T}_{n+m}=T_{n} used with algorithm 1. The amplified samples (X^1,,X^n+m)(\widehat{X}_{1},\cdots,\widehat{X}_{n+m}) are drawn from 𝒩(0,Σ){\mathcal{N}}(0,\Sigma) conditioned on the event that Tn+m(X^n+m)=T^n+m=Tn(Xn)T_{n+m}(\widehat{X}^{n+m})=\widehat{T}_{n+m}=T_{n}(X^{n}). For every mean vector θd\theta\in\mathbb{R}^{d} we can upper bound the amplification error of this approach:

(T^n+m)(Tn+m)TV\displaystyle\|{\mathcal{L}}(\widehat{T}_{n+m})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}} =(Tn)(Tn+m)TV\displaystyle=\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}}
=𝒩(θ,Σ/n)𝒩(θ,Σ/(n+m))TV\displaystyle=\|{\mathcal{N}}(\theta,\Sigma/n)-{\mathcal{N}}(\theta,\Sigma/(n+m))\|_{\text{\rm TV}}
(a)12DKL(𝒩(θ,Σ/n)𝒩(θ,Σ/(n+m)))\displaystyle\overset{\rm(a)}{\leq}\sqrt{\frac{1}{2}D_{\text{\rm KL}}({\mathcal{N}}(\theta,\Sigma/n)\|{\mathcal{N}}(\theta,\Sigma/(n+m)))}
=d4(mnlog(1+mn))=O(mdn),\displaystyle=\sqrt{\frac{d}{4}\left(\frac{m}{n}-\log\left(1+\frac{m}{n}\right)\right)}=O\left(\frac{m\sqrt{d}}{n}\right),

where (a) is due to (3), and the last step holds whenever m=O(n)m=O(n). Therefore, we could amplify Ω(n/d)\Omega(n/\sqrt{d}) additional samples based on nn observations, and the complexity of sample amplification in (6) is n=O(d)n^{\star}=O(\sqrt{d}). In contrast, learning this distribution within a small TV distance requires n=Ω(d)n=\Omega(d) samples, which is strictly harder than sample amplification. This example recovers the upper bound of [AGSV20] with a much simpler analysis, and in later sections we will show that this approach is exactly minimax optimal.

We make two remarks for the above example. First, the amplified samples X^n+m\widehat{X}^{n+m} are no longer independent, either marginally or conditioned on XnX^{n}. Therefore, the above approach is fundamentally different from first estimating the distribution and then generating independent samples from the estimated distribution. Second, the amplified samples do not contain the original samples as a subset. In contrast, a tempting approach for sample amplification is to add mm fake samples to the original nn observations. However, [AGSV20] showed that any sample amplification approach containing the original samples cannot succeed if n=o(d/logd)n=o(d/\log d) in the above model, and our approach conforms to this result. More examples will be presented in Section A.1.

4.2 Computational issues

A natural computational question in Algorithm 1 is how to sample X^n+mPXn+mTn+m(T^n+m)\widehat{X}^{n+m}\sim P_{X^{n+m}\mid T_{n+m}}(\cdot\mid\widehat{T}_{n+m}) in a computationally efficient way. With an additional mild assumption that the sufficient statistic TT is also complete (which is easy to find in exponential families), the conditional distribution PXTP_{X\mid T} could be efficiently sampled if we could find a statistic S=S(X)S=S(X) with the following two properties:

  1. 1.

    SS is ancillary, i.e. (S){\mathcal{L}}(S) is independent of the model parameter θ\theta;

  2. 2.

    There is a (measurable) bijection gg between (T,S)(T,S) and XX, i.e. X=g(T,S)X=g(T,S) almost surely.

In fact, if such an SS exists, then under any θΘ\theta\in\Theta,

(XT=t)=(a)(g(T,S)T=t)=(b)(g(t,S)),\displaystyle{\mathcal{L}}(X\mid T=t)\overset{\rm(a)}{=}{\mathcal{L}}(g(T,S)\mid T=t)\overset{\rm(b)}{=}{\mathcal{L}}(g(t,S)),

where (a) is due to the assumed bijection gg between (T,S)(T,S) and XX, and (b) is due to a classical result of Basu [Bas55, Bas58] that SS and TT are independent. Therefore, by the ancillarity of SS, we could sample XPθ0X\sim P_{\theta_{0}} with any θ0Θ\theta_{0}\in\Theta and compute the statistic SS from XX, then g(t,S)g(t,S) follows the desired conditional distribution PX|T=tP_{X|T=t}. An example of this procedure is illustrated below.

Example 4.2 (Computation in Gaussian location model).

Consider the setting of Example 4.1 where Pθ=𝒩(θ,Σ)(n+m),Tn+m=(n+m)1i=1n+mXiP_{\theta}={\mathcal{N}}(\theta,\Sigma)^{\otimes(n+m)},T_{n+m}=(n+m)^{-1}\sum_{i=1}^{n+m}X_{i}, and the target is to sample from the distribution PXn+m|Tn+mP_{X^{n+m}|T_{n+m}}. In this model, Tn+mT_{n+m} is complete and sufficient, and we choose S=S(Xn+m)=(S1,,Sn+m1)S=S(X^{n+m})=(S_{1},\cdots,S_{n+m-1}) with Si=Xi+1X1S_{i}=X_{i+1}-X_{1} for all ii. Clearly SS is ancillary, and Xn+mX^{n+m} could be recovered from (Tn+m,S)(T_{n+m},S) via

X1=Tn+mi=1n+m1Sin+m,Xi+1=X1+Si,i[n+m1].\displaystyle X_{1}=T_{n+m}-\frac{\sum_{i=1}^{n+m-1}S_{i}}{n+m},\qquad X_{i+1}=X_{1}+S_{i},\quad i\in[n+m-1].

Therefore, the choice of SS satisfies both conditions. Consequently, we can draw Zn+m𝒩(0,Σ)(n+m)Z^{n+m}\sim{\mathcal{N}}(0,\Sigma)^{\otimes(n+m)}, compute S=S(Zn+m)S=S(Z^{n+m}) (where Si=Zi+1Z1S_{i}=Z_{i+1}-Z_{1}), and recover Xn+mX^{n+m} from (Tn+m,S)(T_{n+m},S).

The proper choice of SS depends on specific models and may require some effort to find; we refer to Section A.1 for more examples. We remark that in general there is no golden rule to find SS. One tempting approach is to find a maximal ancillary statistic SS such that any other ancillary statistic SS^{\prime} must be a function of SS. This idea is motivated by the existence of the minimal sufficient statistic under mild conditions and a known computationally efficient approach to compute it. However, for ancillary statistics there is typically no such a maximal one in the above sense, and there may exist uncountably many “maximal” ancillary statistics which are not equivalent to each other. From the measure theoretic perspective, this is due to the fact that the family of all ancillary sets is not closed under intersection and thus not a σ\sigma-algebra. In addition, even if a proper notion of “maximal” or “essentially maximal” could be defined, there is no guarantee that such an ancillary statistic satisfies the bijection condition, and it is hard to determine whether a given ancillary statistic is maximal or not. We refer to [Bas59, LS92] for detailed discussions on ancillarity from mathematical statistics.

There is also another sampling procedure of PXn|TnP_{X^{n}|T_{n}} in the conditional inference literature [Wil82]. Specifically, for each i[n]i\in[n], this approach sequentially generates the observation XiX_{i} from the one-dimensional distribution PXiXi1,TnP_{X_{i}\mid X^{i-1},T_{n}}, which is a simple task as long as we know its CDF. Although this works in simple models such as the Gaussian location model above, in more complicated models exact computation of the CDF is typically not feasible.

4.3 General theory for exponential families

In this section, we show that a general (n,n+Ω(nε/d),ε)(n,n+\Omega(n\varepsilon/\sqrt{d}),\varepsilon) sample amplification phenomenon holds for a rich class of exponential families, and is achieved by the sufficiency-based procedure in Algorithm 1. Specifically, we consider the following natural exponential family.

Definition 4.3 (Exponential family).

The exponential family (𝒳,(Pθ)θΘ)({\mathcal{X}},(P_{\theta})_{\theta\in\Theta}) of probability measures is determined by

dPθ(x)=exp(θT(x)A(θ))dμ(x),\displaystyle dP_{\theta}(x)=\exp(\theta^{\top}T(x)-A(\theta))d\mu(x),

where θΘ\theta\in\Theta is the natural parameter with Θ={θd:A(θ)<}\Theta=\{\theta\in\mathbb{R}^{d}:A(\theta)<\infty\}, T(x)T(x) is the sufficient statistic, A(θ)A(\theta) is the log-partition function, and μ\mu is the base measure.

The exponential family includes many widely used probability distributions such as Gaussian, Gamma, Poisson, Exponential, Beta, etc. In the exponential family, the statistic T(x)T(x) is sufficient and complete, and several well-known identities include 𝔼θ[T(X)]=A(θ)\mathbb{E}_{\theta}[T(X)]=\nabla A(\theta), and 𝖢𝗈𝗏θ[T(X)]=2A(θ)\mathsf{Cov}_{\theta}[T(X)]=\nabla^{2}A(\theta). We refer to [DY79] for a mathematical theory of the exponential family.

To establish a general theory of sample amplification for exponential families, we shall make the following assumptions on the exponential family.

Assumption 1 (Continuity).

The parameter set Θ\Theta has a non-empty interior. Under each θΘ\theta\in\Theta, the probability distribution (T(X)){\mathcal{L}}(T(X)) is absolutely continuous with respect to the Lebesgue measure.

Assumption 2 (Moment condition 𝖬k\mathsf{M}_{k}).

For a given integer k>0k>0, it holds that

supθΘ𝔼θ[(2A(θ))1/2(T(X)A(θ))2k]<.\displaystyle\sup_{\theta\in\Theta}\mathbb{E}_{\theta}\left[\left\|(\nabla^{2}A(\theta))^{-1/2}(T(X)-\nabla A(\theta))\right\|_{2}^{k}\right]<\infty.

We call it the moment condition 𝖬k\mathsf{M}_{k}.

Assumption 1 requires an exponential family of continuous distributions. The motivation is that for continuous exponential family, the sufficient statistics Tn(X)T_{n}(X) and Tn+m(X)T_{n+m}(X) for different sample sizes take continuous values in the same space, and it is easier to construct a general transformation. We will propose a different sample amplification approach for discrete statistical models in Section 5. Assumption 2 is a moment condition on the normalized statistic (2A(θ))1/2(T(X)A(θ))(\nabla^{2}A(\theta))^{-1/2}(T(X)-\nabla A(\theta)), whose moments always exist as the moment generating function of T(X)T(X) exists around the origin. The moment condition 𝖬k\mathsf{M}_{k} claims that the above normalized statistic has a uniformly bounded kk-th moment for all θΘ\theta\in\Theta, which holds in several examples (such as Gaussian, exponential, Pareto) or by considering a slightly smaller Θ0Θ\Theta_{0}\subseteq\Theta bounded away from the boundary. The following lemma presents a sufficient criterion for the moment condition 𝖬k\mathsf{M}_{k}.

Lemma 4.4.

If the log-partition function A(θ)A(\theta) satisfies

supθΘsupud\{0}|3A(θ)[u;u;u]|(2A(θ)[u;u])3/2M<,\displaystyle\sup_{\theta\in\Theta}\sup_{u\in\mathbb{R}^{d}\backslash\{0\}}\frac{|\nabla^{3}A(\theta)[u;u;u]|}{(\nabla^{2}A(\theta)[u;u])^{3/2}}\leq M<\infty,

then the exponential family satisfies the moment condition 𝖬k\mathsf{M}_{k} for all kk\in\mathbb{N}. Here for a kk-tensor TT and vectors u1,,uku_{1},\cdots,u_{k}, T[u1;;uk]T[u_{1};\cdots;u_{k}] denotes the value of T,u1uk\langle T,u_{1}\otimes\cdots\otimes u_{k}\rangle.

The condition in Lemma 4.4 is called the self-concordant condition, which is a key concept in the interior point method for convex optimization [Nes03]. For example, all quadratic functions and logarithmic functions are self-concordant (which correspond to Gaussian, exponential, and Pareto distributions), and the self-concordance is always fulfilled when Θ\Theta is compact.

Given any exponential family 𝒫{\mathcal{P}} satisfying Assumptions 1 and 2, we will show that a simple sample amplification procedure gives a size Ω(n/d)\Omega(n/\sqrt{d}) of sample amplification. Let X1,,XnX_{1},\cdots,X_{n} be i.i.d. samples drawn from PθP_{\theta} taking a general form in Definition 4.3, then it is clear that the sample average

Tn(Xn)1ni=1nT(Xi)\displaystyle T_{n}(X^{n})\triangleq\frac{1}{n}\sum_{i=1}^{n}T(X_{i})

is a sufficient statistic by the factorization theorem. We will apply the general Algorithm 1 with an identity map between sufficient statistics, i.e. T^n+m=Tn\widehat{T}_{n+m}=T_{n}. The next theorem shows the performance of this approach.

Theorem 4.5.

If the exponential family 𝒫{\mathcal{P}} satisfies Assumptions 1 and 2 with k=3k=3, then for θΘ\theta\in\Theta, it holds that

ε(𝒫,n,m)(Tn)(Tn+m)TVCn+mdn,\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}}\leq\frac{C}{\sqrt{n}}+\frac{m\sqrt{d}}{n},

where C<C<\infty is an absolute constant depending only on dd and the moment upper bound in Assumption 2. In particular, for sufficiently large nn, a sample amplification of size Ω(n/d)\Omega(n/\sqrt{d}) is achievable.

Theorem 4.5 shows that the above simple procedure could achieve a sample amplification from nn to n+Ω(n/d)n+\Omega(n/\sqrt{d}) samples in general continuous exponential families, provided that nn is large enough. The main idea behind the proof of Theorem 4.5 is also simple. We show that the distribution of TnT_{n} is approximately Gn𝒩(A(θ),2A(θ)/n)G_{n}\sim{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n) by CLT, apply the same CLT for Tn+mT_{n+m}, and then compute the TV distance between two Gaussians as in Example 4.1. Theorem 4.5 is then a direct consequence of the triangle inequality:

(Tn)(Tn+m)TV\displaystyle\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}}
(Tn)(Gn)TV+(Gn)(Gn+m)TV+(Tn+m)(Gn+m)TV.\displaystyle\leq\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(G_{n})\|_{\text{\rm TV}}+\|{\mathcal{L}}(G_{n})-{\mathcal{L}}(G_{n+m})\|_{\text{\rm TV}}+\|{\mathcal{L}}(T_{n+m})-{\mathcal{L}}(G_{n+m})\|_{\text{\rm TV}}.

Note that Assumption 1 ensures a vanishing TV distance for the Gaussian approximation, and Assumption 2 enables us to apply Berry–Esseen type arguments and obtain an O(1/n)O(1/\sqrt{n}) convergence rate for the Gaussian approximation.

The main drawback of Theorem 4.5 is that there is a hidden constant CC depending on the dimension dd, thus it does not mean that an (n,n+1,ε)(n,n+1,\varepsilon) sample amplification is possible as long as n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon). To tackle this issue, we need to improve the first term in Theorem 4.5 and find the best possible dependence of the constant CC on dd. We remark that this is a challenging task in probability theory: although the convergence rates of both TV [Pro52, SM62, BC14, BC16] and KL [Bar86, BCG14] in the CLT result were studied, almost all of them solely focused on the convergence rate on nn, leaving the tight dependence on dd still open. Moreover, direct computation of the quantity (Tn)(Gn)TV\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(G_{n})\|_{\text{\rm TV}} shows that even if the random vector TnT_{n} has independent components, this quantity is typically at least Ω(d/n)\Omega(\sqrt{d/n}). Therefore, C=Ω(d)C=\Omega(\sqrt{d}) under this proof technique, and a vanishing first term in Theorem 4.5 requires that n=Ω(d)n=\Omega(d), which is already larger than the anticipated sample amplification complexity n=O(d)n=O(\sqrt{d}).

To overcome the above difficulties, we make the following changes to both the assumption and analysis. First, to avoid the unknown dependence on dd, we additionally assume a product exponential family, i.e. Pθ(dx)=i=1dpθi(dxi)P_{\theta}(dx)=\prod_{i=1}^{d}p_{\theta_{i}}(dx_{i}), where each pθi(xi)p_{\theta_{i}}(x_{i}) is a one-dimensional exponential family. Exploiting the product structure enables to find a constant CC depending linearly on dd. Second, we improve the O(1/n)O(1/\sqrt{n}) dependence on nn by applying a higher-order CLT result to TnT_{n} and Tn+mT_{n+m}, known as the Edgeworth expansion [BR10]. For any k2k\geq 2 and nn\in\mathbb{N}, the signed measure of the Edgeworth expansion on d\mathbb{R}^{d} is

Γn,k(dx)=γ(x)(1+=1k/3𝒦(x)n/2)dx,\displaystyle\Gamma_{n,k}(dx)=\gamma(x)\left(1+\sum_{\ell=1}^{\lfloor k/3\rfloor}\frac{{\mathcal{K}}_{\ell}(x)}{n^{\ell/2}}\right)dx, (8)

where γ(x)\gamma(x) is the density of a standard normal random variable on d\mathbb{R}^{d}, and 𝒦m(x){\mathcal{K}}_{m}(x) is a polynomial of degree 3m3m which depends only on the first 3m3m moments of the distribution. We note that unlike CLT, the general Edgeworth expansion is a signed measure with possibly negative densities; however, it is close to Gaussian with an O(n1/2)O(n^{-1/2}) approximation error. Such a higher-order expansion enables us to have better Berry-Esseen type bounds, but upper bounding Γn,kΓn+m,kTV\|\Gamma_{n,k}-\Gamma_{n+m,k}\|_{\text{\rm TV}} becomes more complicated and requires to handle the Gaussian part and the correction part separately; see Appendix B.2 for details. In particular, we could improve the error dependence on nn from O(1/n)O(1/\sqrt{n}) to O(1/n2)O(1/n^{2}).

Formally, the next theorem shows a better sample amplification performance for product exponential families.

Theorem 4.6.

Let (𝒳,𝒫=(Pθ)θΘ)({\mathcal{X}},{\mathcal{P}}=(P_{\theta})_{\theta\in\Theta}) be a product exponential family, where each one-dimensional component satisfies Assumptions 1 and 2 with k=10k=10. Then for θΘ\theta\in\Theta, it holds that

ε(𝒫,n,m)(Tn)(Tn+m)TVC(dn2+mdn),\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(T_{n+m})\|_{\text{\rm TV}}\leq C\left(\frac{d}{n^{2}}+\frac{m\sqrt{d}}{n}\right),

where C<C<\infty is an absolute constant independent of (n,d)(n,d). In particular, as long as n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon), an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification of size m=Ω(nε/d)m=\Omega(n\varepsilon/\sqrt{d}) is achievable.

Theorem 4.6 shows that for product exponential family, we not only achieve the amplification size m=Ω(nε/d)m=\Omega(n\varepsilon/\sqrt{d}), but also attain a sample complexity n=O(d/ε)n=O(\sqrt{d}/\varepsilon) for sample amplification. This additional result on sample complexity is important in the sense that, even if distribution learning is impossible, it is possible to perform sample amplification. Although the independence or even the exponential family assumption could be strong practically, in Section A.1 we show that both phenomena m=Ω(nε/d)m=\Omega(n\varepsilon/\sqrt{d}) and n=O(d/ε)n=O(\sqrt{d}/\varepsilon) hold in many natural models.

5 Sample amplification via learning

The sufficiency-based approach of sample amplification is not always desirable. First, models outside the exponential family typically do not admit non-trivial sufficient statistics, and therefore the reduction by sufficiency may not be very helpful. Second, the identity map applied to the sufficient statistics only works for continuous models, and incurs a too large TV distance when the underlying model is discrete. Third, previous approaches are not directly related to learning the model, so a general relationship between learning and sample amplification is largely missing. In this section, we propose another sample amplification approach, and show that how a good learner helps to obtain a good sample amplifier.

5.1 The general result

For a class of distribution 𝒫{\mathcal{P}} and nn i.i.d. observations drawn from an unknown P𝒫P\in{\mathcal{P}}, we define the following notion of the χ2\chi^{2}-estimation error.

Definition 5.1 (χ2\chi^{2}-estimation error).

For a class of distributions 𝒫{\mathcal{P}} and sample size nn, the χ2\chi^{2}-estimation error rχ2(𝒫,n)r_{\chi^{2}}({\mathcal{P}},n) is defined to be the minimax estimation error under the expected χ2\chi^{2}-divergence:

rχ2(𝒫,n)infP^nsupP𝒫𝔼P[χ2(P^n,P)],\displaystyle r_{\chi^{2}}({\mathcal{P}},n)\triangleq\inf_{\widehat{P}_{n}}\sup_{P\in{\mathcal{P}}}\mathbb{E}_{P}[\chi^{2}(\widehat{P}_{n},P)],

where the infimum is taken over all possible distribution estimators P^n\widehat{P}_{n} based on nn samples.

Roughly speaking, the χ2\chi^{2}-estimation error in the above definition characterizes the complexity of the distribution class 𝒫{\mathcal{P}} in terms of distribution learning under the χ2\chi^{2}-divergence. The main result of this section is to show that, the error of sample amplification in (4) could be upper bounded by using the χ2\chi^{2}-estimation error.

Theorem 5.2.

For general 𝒫{\mathcal{P}} and n,m0n,m\geq 0, it holds that

ε(𝒫,n,m)m2nrχ2(𝒫,n/2).\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\sqrt{\frac{m^{2}}{n}\cdot r_{\chi^{2}}({\mathcal{P}},n/2)}.

The following corollary is immediate from Theorem 5.2.

Corollary 5.3.

An (n,n+m,ε)(n,n+m,\varepsilon) sample amplification is possible if m=O(εn/rχ2(𝒫,n/2))m=O(\varepsilon\sqrt{n/r_{\chi^{2}}({\mathcal{P}},n/2)}). Moreover, the sample complexity of amplification in (6) satisfies

n(𝒫)=O(min{n:rχ2(𝒫,n/2)n}).\displaystyle n^{\star}({\mathcal{P}})=O\left(\min\left\{n\in\mathbb{N}:r_{\chi^{2}}({\mathcal{P}},n/2)\leq n\right\}\right).
Remark 5.4.

Although the error of sample amplification in (4) is measured under the TV distance, the same result holds for the squared root of the KL divergence (which by (3) is no smaller than the TV distance).

The above result provides a quantitative guarantee that the sample amplification is easier than learning (under the χ2\chi^{2}-divergence). Specifically, the sample complexity of learning is the smallest nn\in\mathbb{N} such that rχ2(𝒫,n)=O(1)r_{\chi^{2}}({\mathcal{P}},n)=O(1), while Corollary 5.3 shows that the complexity for amplification is at most the smallest nn\in\mathbb{N} such that rχ2(𝒫,n/2)=O(n)r_{\chi^{2}}({\mathcal{P}},n/2)=O(n). As rχ2(𝒫,n)r_{\chi^{2}}({\mathcal{P}},n) is non-increasing in nn, this means that the learning complexity is in general larger.

When the distribution class 𝒫{\mathcal{P}} has a product structure 𝒫=j=1d𝒫j{\mathcal{P}}=\prod_{j=1}^{d}{\mathcal{P}}_{j}, the next theorem shows a better relationship between the amplification error and the learning error.

Theorem 5.5.

For 𝒫=j=1d𝒫j{\mathcal{P}}=\prod_{j=1}^{d}{\mathcal{P}}_{j} and n,m0n,m\geq 0, it holds that

ε(𝒫,n,m)m2nj=1drχ2(𝒫j,n/2).\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\leq\sqrt{\frac{m^{2}}{n}\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)}.
Corollary 5.6.

For product models, an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification is achievable if

m=O(εnj=1drχ2(𝒫j,n/2)).\displaystyle m=O\left(\varepsilon\sqrt{\frac{n}{\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)}}\right).

Moreover, the sample complexity of amplification in (6) satisfies

n(𝒫)=O(min{n:j=1drχ2(𝒫j,n/2)n}).\displaystyle n^{\star}({\mathcal{P}})=O\left(\min\left\{n\in\mathbb{N}:\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)\leq n\right\}\right).

We observe that the result of Theorem 5.5 typically improves over Theorem 5.2 for product models. In fact, since

χ2(j=1dPj,j=1dQj)=j=1d(1+χ2(Pj,Qj))1j=1dχ2(Pj,Qj),\displaystyle\chi^{2}\left(\prod_{j=1}^{d}P_{j},\prod_{j=1}^{d}Q_{j}\right)=\prod_{j=1}^{d}(1+\chi^{2}(P_{j},Q_{j}))-1\geq\sum_{j=1}^{d}\chi^{2}(P_{j},Q_{j}),

the inequality j=1drχ2(𝒫j,n/2)rχ2(𝒫,n/2)\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)\leq r_{\chi^{2}}({\mathcal{P}},n/2) typically holds. Moreover, there are scenarios where we have j=1drχ2(𝒫j,n/2)rχ2(𝒫,n/2)\sum_{j=1}^{d}r_{\chi^{2}}({\mathcal{P}}_{j},n/2)\ll r_{\chi^{2}}({\mathcal{P}},n/2), thus Theorem 5.5 provides a substantial improvement over Theorem 5.2. For example, when 𝒫=(𝒩(θ,Id))θd{\mathcal{P}}=({\mathcal{N}}(\theta,I_{d}))_{\theta\in\mathbb{R}^{d}}, it could be verified that rχ2(𝒫j,n/2)=O(1/n)r_{\chi^{2}}({\mathcal{P}}_{j},n/2)=O(1/n) for each j[d]j\in[d], while rχ2(𝒫,n/2)=exp(Ω(d/n))1r_{\chi^{2}}({\mathcal{P}},n/2)=\exp(\Omega(d/n))-1. Hence, in the important regime dnd\sqrt{d}\ll n\ll d where learning is impossible but the sample amplification is possible, Theorem 5.5 is strictly stronger than Theorem 5.2.

Remark 5.7.

In the above Gaussian location model, there is an alternative way to conclude that Theorem 5.5 is strictly stronger than Theorem 5.2. We will see that the shuffling approach achieving the bound in Theorem 5.2 keeps all the observed samples, whereas [AGSV20] shows that all such approaches must incur a sample complexity n=Ω(d/logd)n=\Omega(d/\log d) for the Gaussian model. In contrast, Theorem 5.5 and Corollary 5.6 give a sample complexity n=O(d)n=O(\sqrt{d}) of amplification in the Gaussian location model.

5.2 The shuffling approach

This section presents the sample amplification approaches to achieve Theorems 5.2 and 5.5. The idea is simple: we find a good distribution learner P^n\widehat{P}_{n} which attains the rate-optimal χ2\chi^{2}-estimation error, draw additional mm samples Y1,,YmY_{1},\cdots,Y_{m} from P^n\widehat{P}_{n}, and shuffle them with the original samples X1,,XnX_{1},\cdots,X_{n} uniformly at random. This approach suffices to achieve the sample amplification error in Theorem 5.2, while for Theorem 5.5 an additional trick is applied: instead of shuffling the whole vectors, we independently shuffle each coordinate instead. For technical reasons, in both approaches we apply the sample splitting: the first n/2n/2 samples are used for the estimation of P^n\widehat{P}_{n}, while the second n/2n/2 samples are used for shuffling. The algorithms are summarized in Algorithms 2 and 3.

Algorithm 2 Sample amplification via shuffling: general model
1:Input: samples X1,,XnX_{1},\cdots,X_{n}, a given class of distributions 𝒫{\mathcal{P}}.
2:Based on samples X1,,Xn/2X_{1},\cdots,X_{n/2}, find an estimator P^n\widehat{P}_{n} such that
supP𝒫𝔼P[χ2(P^n,P)]Crχ2(𝒫,n/2).\displaystyle\sup_{P\in{\mathcal{P}}}\mathbb{E}_{P}[\chi^{2}(\widehat{P}_{n},P)]\leq C\cdot r_{\chi^{2}}({\mathcal{P}},n/2).
3:Draw mm additional samples Y1,,YmY_{1},\cdots,Y_{m} from P^n\widehat{P}_{n}.
4:Uniformly at random, shuffle the pool of Xn/2+1,,Xn,Y1,,YmX_{n/2+1},\cdots,X_{n},Y_{1},\cdots,Y_{m} to obtain (Z1,,Zn/2+m)(Z_{1},\cdots,Z_{n/2+m}).
5:Output: amplified samples (X1,,Xn/2,Z1,,Zn/2+m)(X_{1},\cdots,X_{n/2},Z_{1},\cdots,Z_{n/2+m}).
Algorithm 3 Sample amplification via shuffling: product model
1:Input: samples X1,,XnX_{1},\cdots,X_{n}, a given class of product distributions 𝒫=j=1d𝒫j{\mathcal{P}}=\prod_{j=1}^{d}{\mathcal{P}}_{j}
2:for j=1,2,,dj=1,2,\cdots,d do
3:     Based on samples X1,j,,Xn/2,jX_{1,j},\cdots,X_{n/2,j}, find an estimator P^n,j\widehat{P}_{n,j} such that
supPj𝒫j𝔼Pj[χ2(P^n,j,Pj)]Crχ2(𝒫j,n/2).\displaystyle\sup_{P_{j}\in{\mathcal{P}}_{j}}\mathbb{E}_{P_{j}}[\chi^{2}(\widehat{P}_{n,j},P_{j})]\leq C\cdot r_{\chi^{2}}({\mathcal{P}}_{j},n/2).
4:     Draw mm additional samples Y1,j,,Ym,jY_{1,j},\cdots,Y_{m,j} from P^n,j\widehat{P}_{n,j}.
5:     Uniformly at random, shuffle Xn/2+1,j,,Xn,j,Y1,j,,Ym,jX_{n/2+1,j},\cdots,X_{n,j},Y_{1,j},\cdots,Y_{m,j} to obtain (Z1,j,,Zn/2+m,j)(Z_{1,j},\cdots,Z_{n/2+m,j}).
6:end for
7:For each i[n/2+m]i\in[n/2+m], form the vector Zi=(Zi,1,,Zi,d)Z_{i}=(Z_{i,1},\cdots,Z_{i,d}).
8:Output: amplified samples (X1,,Xn/2,Z1,,Zn/2+m)(X_{1},\cdots,X_{n/2},Z_{1},\cdots,Z_{n/2+m}).

The following lemma is the key to analyze the performance of the shuffling approach.

Lemma 5.8.

Let X1,,XnX_{1},\cdots,X_{n} be i.i.d. drawn from PP, and Y1,,YmY_{1},\cdots,Y_{m} be i.i.d. drawn from QQ independent of (X1,,Xn)(X_{1},\cdots,X_{n}). Let (Z1,,Zn+m)(Z_{1},\cdots,Z_{n+m}) be a uniformly random permutation of (X1,,Xn,Y1,,Ym)(X_{1},\cdots,X_{n},Y_{1},\cdots,Y_{m}) , and PmixP_{\text{\rm mix}} be the distribution of the random mixture (Z1,,Zn+m)(Z_{1},\cdots,Z_{n+m}). Then

χ2(Pmix,P(n+m))(1+mn+mχ2(Q,P))m1.\displaystyle\chi^{2}\left(P_{\text{\rm mix}},P^{\otimes(n+m)}\right)\leq\left(1+\frac{m}{n+m}\chi^{2}(Q,P)\right)^{m}-1.

Based on Lemma 5.8, the advantage of random shuffling is clear: if we simply append Y1,,YmY_{1},\cdots,Y_{m} to the end of the original sequence X1,,XnX_{1},\cdots,X_{n}, then the χ2\chi^{2}-divergence is exactly (1+χ2(Q,P))m1(1+\chi^{2}(Q,P))^{m}-1. By comparing with the upper bound in Lemma 5.8, we observe that a smaller coefficient m/(n+m)m/(n+m) is applied to the individual χ2\chi^{2}-divergence after a random shuffle. The proofs of Theorems 5.2 and 5.5 are then clear, where we simply take Q=P^nQ=\widehat{P}_{n} and apply the above lemma. Note that the statement of Lemma 5.8 requires that Y1,,YmY_{1},\cdots,Y_{m} be independent of X1,,XnX_{1},\cdots,X_{n}, which is exactly the reason why we apply sample splitting in Algorithms 2 and 3. The proof of Lemma 5.8 is presented in Appendix C, and the complete proofs of Theorems 5.2 and 5.5 are relegated to Appendix B. We also include concrete examples of the shuffling approach in Section A.2.

6 Minimax lower bounds

In this section we establish minimax lower bounds for sample amplification in different statistical models. Section 6.1 presents a general and tight approach for establishing the lower bound, which leads to an exact sample amplification result for the Gaussian location model. Based on this result, we show that for dd-dimensional continuous exponential families, the sample amplification size cannot exceed ω(nε/d)\omega(n\varepsilon/\sqrt{d}) for sufficiently large sample size nn. Section 6.2 provides a specialized criterion for product models, where we show that n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) and m=O(nε/d)m=O(n\varepsilon/\sqrt{d}) are always valid lower bounds, with hidden constants independent of all parameters. Section A.3 lists several concrete examples where our general idea could be properly applied to provide tight and non-asymptotic results.

6.1 General idea

The main tool to establish the lower bound is the first equality in the Definition 3.1 of Le Cam’s distance. Specifically, for a class of distributions 𝒫=(Pθ)θΘ{\mathcal{P}}=(P_{\theta})_{\theta\in\Theta}, let μ\mu be a given prior distribution on Θ\Theta, and L:Θ×𝒜[0,1]L:\Theta\times{\mathcal{A}}\to[0,1] be a given non-negative loss function upper bounded by one. Given nn i.i.d. samples from an unknown distribution in 𝒫{\mathcal{P}}, define the following Bayes risk and minimax risk:

rB(𝒫,n,L,μ)\displaystyle r_{\text{\rm B}}({\mathcal{P}},n,L,\mu) =infθ^Θ𝔼θ[L(θ,θ^(Xn))]μ(dθ),\displaystyle=\inf_{\widehat{\theta}}\int_{\Theta}\mathbb{E}_{\theta}[L(\theta,\widehat{\theta}(X^{n}))]\mu(d\theta),
r(𝒫,n,L)\displaystyle r({\mathcal{P}},n,L) =infθ^supθΘ𝔼θ[L(θ,θ^(Xn))],\displaystyle=\inf_{\widehat{\theta}}\sup_{\theta\in\Theta}\mathbb{E}_{\theta}[L(\theta,\widehat{\theta}(X^{n}))],

where the infimum is over all possible estimators θ^()\widehat{\theta}(\cdot) taking value in 𝒜{\mathcal{A}}. The following result is a direct consequence of Definition 3.1.

Lemma 6.1.

For any integer n,m>0n,m>0, any class of distributions 𝒫=(Pθ)θΘ{\mathcal{P}}=(P_{\theta})_{\theta\in\Theta}, any prior μ\mu on Θ\Theta, and any loss function L:Θ×𝒜[0,1]L:\Theta\times{\mathcal{A}}\to[0,1], the minimax error of sample amplification ε(𝒫,n,m)\varepsilon^{\star}({\mathcal{P}},n,m) in (4) satisfies that

ε(𝒫,n,m)\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m) rB(𝒫,n,L,μ)rB(𝒫,n+m,L,μ),\displaystyle\geq r_{\text{\rm B}}({\mathcal{P}},n,L,\mu)-r_{\text{\rm B}}({\mathcal{P}},n+m,L,\mu),
ε(𝒫,n,m)\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m) r(𝒫,n,L)r(𝒫,n+m,L).\displaystyle\geq r({\mathcal{P}},n,L)-r({\mathcal{P}},n+m,L).

Based on Lemma 6.1, it suffices to find an appropriate prior distribution μ\mu and a loss function LL, and then compute (or lower bound) the difference between the Bayes or minimax risks with different sample sizes. We note that the lower bound technique in [AGSV20], albeit seemingly different, is a special case of Lemma 6.1. Specifically, the authors of [AGSV20] designed a set-valued mapping An:θ𝒫(𝒳n)A_{n}:\theta\to{\mathcal{P}}({\mathcal{X}}^{n}) for each nn\in\mathbb{N} such that θ(Xn+mAn+m(θ))0.99\mathbb{P}_{\theta}(X^{n+m}\in A_{n+m}(\theta))\geq 0.99 for all θΘ\theta\in\Theta, while there is a prior distribution μ\mu on Θ\Theta such that

𝔼Xn[supxn𝒳nθXn(xnAn(θ))]0.5.\displaystyle\mathbb{E}_{X^{n}}\left[\sup_{x^{n}\in{\mathcal{X}}^{n}}\mathbb{P}_{\theta\mid X^{n}}(x^{n}\in A_{n}(\theta))\right]\leq 0.5. (9)

If the above conditions hold, then an (n,n+m)(n,n+m) sample amplification is impossible. Note that the probability term in (9) is the maximum coverage probability of the sets An(θ)A_{n}(\theta) where θ\theta follows the posterior distribution θXn\mathbb{P}_{\theta\mid X^{n}}, which is a well-defined geometric object when both An(θ)A_{n}(\theta) and the posterior are known. To see that the above approach falls into our framework, consider the loss function L:Θ×n1𝒳n[0,1]L:\Theta\times\cup_{n\geq 1}{\mathcal{X}}^{n}\to[0,1] with L(θ,Xn)=𝟙(XnAn(θ))L(\theta,X^{n})=\mathbbm{1}(X^{n}\notin A_{n}(\theta)). Then the first condition ensures that rB(𝒫,n+m,L,ν)0.01r_{\text{B}}({\mathcal{P}},n+m,L,\nu)\leq 0.01 for each prior ν\nu, and the second condition (9) exactly states that rB(𝒫,n,L,μ)0.5r_{\text{B}}({\mathcal{P}},n,L,\mu)\geq 0.5 for the chosen prior μ\mu.

A first application of Lemma 6.1 is an exact lower bound in Gaussian location models.

Theorem 6.2.

For the Gaussian location model 𝒫={𝒩(θ,Σ)}θd{\mathcal{P}}=\{{\mathcal{N}}(\theta,\Sigma)\}_{\theta\in\mathbb{R}^{d}} with a fixed covariance Σd×d\Sigma\in\mathbb{R}^{d\times d}, the minimax error of sample amplification in (4) is exactly

ε(𝒫,n,m)=𝒩(0,Idn)𝒩(0,Idn+m)TV.\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{\rm TV}}.

In particular, the sufficiency-based approach in Example 4.1 is exactly minimax optimal.

Theorem 6.2 shows that an exact error characterization for the Gaussian location model is possible through the general lower bound approach in Lemma 6.1. This result is also asymptotically useful to a rich family of models: note that by CLT, the sufficient statistic in a continuous exponential family follows a Gaussian distribution asymptotically, with a vanishing TV distance. This idea was used in Section 4.3 to establish the O(nε/d)O(n\varepsilon/\sqrt{d}) upper bound, and the same observation could lead to an Ω(nε/d)\Omega(n\varepsilon/\sqrt{d}) lower bound as well, under slightly different assumptions. Specifically, we drop Assumption 2 while introducing an additional assumption.

Assumption 3 (Linear independence).

The components of sufficient statistic T(x)T(x) are linearly independent, i.e. aT(x)=0a^{\top}T(x)=0 for μ\mu-almost all x𝒳x\in{\mathcal{X}} implies a=0a=0.

Assumption 3 ensures that the true dimension of the exponential family is indeed dd. Whenever Assumption 3 does not hold, we could transform it into a minimal exponential family with a lower dimension fulfilling this assumption. Note that when Assumptions 1 and 3 hold, the mean mapping θA(θ)\theta\mapsto\nabla A(\theta) is a diffeomorphism between Θ\Theta and A(θ)\nabla A(\theta); see, e.g. [ML08, Theorem 1.22]. Therefore, A()\nabla A(\cdot) is an open map, and the set {A(θ):θΘ}\{\nabla A(\theta):\theta\in\Theta\} contains a dd-dimensional ball. This fact enables us to obtain a dd-dimensional Gaussian location model after we apply the CLT.

The following theorem characterizes an asymptotic lower bound for every exponential family satisfying Assumptions 1 and 3.

Theorem 6.3.

Given a dd-dimensional exponential family 𝒫{\mathcal{P}} satisfying Assumptions 1 and 3, for every n,mn,m\in\mathbb{N}, the minimax error of sample amplification satisfies

ε(𝒫,n,m)c(mdn1)C(lognn)13,\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq c\cdot\left(\frac{m\sqrt{d}}{n}\wedge 1\right)-C\cdot\left(\frac{\log n}{n}\right)^{\frac{1}{3}},

where c>0c>0 is an absolute constant independent of (n,m,d,𝒫)(n,m,d,{\mathcal{P}}), and constant C>0C>0 depends only on the exponential family (and thus on dd).

Theorem 6.3 shows that there exists some n0>0n_{0}>0 depending only on the exponential family, such that sample amplification from nn to n+ω(nε/d)n+\omega(n\varepsilon/\sqrt{d}) samples is impossible for all n>n0n>n_{0}. However, similar to the nature of the upper bound in Theorem 4.5, this asymptotic result does not imply that the sample amplification is impossible if n=o(d/ε)n=o(\sqrt{d}/\varepsilon). Nevertheless, in the following sections we show that the sample complexity lower bound n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) indeed holds in product families, as well as several other concrete examples.

6.2 Product models

Although Lemma 6.1 presents a lower bound argument in general, the computation of exact Bayes or minimax risks could be very challenging, and the usual rate-optimal analysis (i.e. bounding the risks within a multiplicative constant) will not lead to meaningful results. In addition, choosing the right prior and loss is a difficult task which may change from instance to instance. Therefore, it is helpful to propose specialized versions of Lemma 6.1 which are easier to work with. Surprisingly, such a simple version exists for product models, which is summarized in the following theorem.

Theorem 6.4.

Let ε(0,1)\varepsilon\in(0,1) and Pθ=j=1dpθjP_{\theta}=\prod_{j=1}^{d}p_{\theta_{j}} be a product model with (θ1,,θd)j=1dΘj(\theta_{1},\cdots,\theta_{d})\in\prod_{j=1}^{d}\Theta_{j}. Suppose for each j[d]j\in[d], there exist two points θj,+,θj,Θj\theta_{j,+},\theta_{j,-}\in\Theta_{j} such that

pθj,+npθj,nTV\displaystyle\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\|_{\text{\rm TV}} αjεd,\displaystyle\leq\alpha_{j}-\frac{\varepsilon}{\sqrt{d}}, (10)
pθj,+(n+m)pθj,(n+m)TV\displaystyle\|p_{\theta_{j,+}}^{\otimes(n+m)}-p_{\theta_{j,-}}^{\otimes(n+m)}\|_{\text{\rm TV}} αj+εd,\displaystyle\geq\alpha_{j}+\frac{\varepsilon}{\sqrt{d}}, (11)

with αj(α¯,α¯)\alpha_{j}\in(\underline{\alpha},\overline{\alpha}), where α¯,α¯(0,1)\underline{\alpha},\overline{\alpha}\in(0,1) are absolute constants. Then there exists an absolute constant c=c(α¯,α¯)>0c=c(\underline{\alpha},\overline{\alpha})>0 such that

ε(𝒫,n,m)cε.\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq c\varepsilon.

Theorem 6.4 leaves the choices of the prior and loss function in Lemma 6.1 implicit, and provides a simple criterion for product models. The usual routine of applying Theorem 6.4 is as follows: fix any constant α\alpha and a target error ε\varepsilon, find for each j[d]j\in[d] two points θj,+,θj,Θj\theta_{j,+},\theta_{j,-}\in\Theta_{j} such that the condition (10) holds for a given sample size nn. Then the condition (11) becomes an inequality solely for mm, from which we could solve the smallest mjm_{j}\in\mathbb{N} such that (11) holds along the jj-th coordinate. Finally, the sample amplification from nn to n+mn+m samples is impossible by the above theorem, where m=maxj[d]mjm=\max_{j\in[d]}m_{j}. Although Theorem 6.5 is only for product models, similar ideas could also be applied to non-product models; we refer to Section A.3 for concrete examples.

Theorem 6.4 also provides some intuition on why the sample complexity lower bound for amplification is typically smaller than that of learning. Specifically, for learning under the TV distance, a small TV distance j=1dpθj,+nj=1dpθj,nTV\|\prod_{j=1}^{d}p_{\theta_{j,+}}^{\otimes n}-\prod_{j=1}^{d}p_{\theta_{j,-}}^{\otimes n}\|_{\text{TV}} between product distributions is required. This requirement typically leads to a much smaller individual TV distance pθj,+npθj,nTV\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\|_{\text{\rm TV}}, e.g. O(1/d)O(1/\sqrt{d}) for many regular models. In contrast, the conditions (10) and (11) only require a constant individual TV distance, which leads to a smaller sample complexity nn in the sample amplification lower bound. To understand why a larger individual TV distance works for sample amplification, in the proof of Theorem 6.4 we consider the uniform prior on 2d2^{d} points j=1d{θj,+,θj,}\prod_{j=1}^{d}\{\theta_{j,+},\theta_{j,-}\}. Under this prior, the test accuracy for each dimension is precisely (1+TVj)/2(1+\text{TV}_{j})/2, which is slightly smaller than (1+α)/2(1+\alpha)/2 with nn samples, and slightly larger than (1+α)/2(1+\alpha)/2 with n+mn+m samples (assuming αjα\alpha_{j}\equiv\alpha). Therefore, if a unit loss is incurred when the fraction of correct tests does not exceed (1+α)/2(1+\alpha)/2, the current scaling in (10), (11) shows that there is an Ω(ε)\Omega(\varepsilon) difference in the expected loss under different sample sizes. In other words, such an aggregate voting test helps to have a larger individual TV distance. The details of the proof are deferred to Appendix B.

Theorem 6.4 has a far-reaching consequence: with almost no assumption on the product model 𝒫{\mathcal{P}}, for any c>0c>0 it always holds that ε(𝒫,n,cεn/d)cε\varepsilon^{\star}({\mathcal{P}},n,\lceil c\varepsilon n/\sqrt{d}\rceil)\geq c^{\prime}\varepsilon for some absolute constant c>0c^{\prime}>0 independent of the product model 𝒫{\mathcal{P}}. The only assumption (besides the product structure) we make on 𝒫{\mathcal{P}} is as follows (here nn\in\mathbb{N} is a given sample size):

Assumption 4.

Let 𝒫{\mathcal{P}} possess the product structure as in Theorem 6.4. For each j[d]j\in[d], there exists two points θj,+,θj,Θj\theta_{j,+},\theta_{j,-}\in\Theta_{j} such that 1/(10n)H2(pθj,+,pθj,)1/(5n)1/(10n)\leq H^{2}(p_{\theta_{j,+}},p_{\theta_{j,-}})\leq 1/(5n).

Assumption 4 is a mild assumption that requires that for each coordinate, the map θjpθj\theta_{j}\mapsto p_{\theta_{j}} is continuous under the Hellinger distance. This assumption is satisfied for almost all practical models, either discrete or continuous, and is invariant with model reparametrizations or bijective transformation of observations. We note that the coefficients 1/101/10 and 1/51/5 are not essential, and could be replaced by any smaller constants. The next theorem states that if Assumption 4 holds, we always have a lower bound n=Ω(d)n=\Omega(\sqrt{d}) for the sample complexity and an upper bound m=O(n/d)m=O(n/\sqrt{d}) for the size of sample amplification.

Theorem 6.5.

Let 𝒫{\mathcal{P}} be a product model satisfying Assumption 4. Then for any c>0c>0, there is some c>0c^{\prime}>0 depending only on cc (thus independent of n,d,ε,𝒫n,d,\varepsilon,{\mathcal{P}}) such that

ε(𝒫,n,cεnd)cε.\displaystyle\varepsilon^{\star}\left({\mathcal{P}},n,\left\lceil\frac{c\varepsilon n}{\sqrt{d}}\right\rceil\right)\geq c^{\prime}\varepsilon.

Theorem 6.5 is a general lower bound for sample amplification in product models, with intriguing properties that it is instance-wise in the model 𝒫{\mathcal{P}}, while the constants cc and cc^{\prime} are independent of 𝒫{\mathcal{P}}. As a result, the sample complexity is uniformly Ω(d/ε)\Omega(\sqrt{d}/\varepsilon), and the maximum size of sample amplification is uniformly O(nε/d)O(n\varepsilon/\sqrt{d}) for all product models. In comparison, the matching upper bound in Theorem 4.6 for product models has a hidden constant depending on the statistical model. We note that it is indeed natural to have sample amplification results independent of the underlying statistical model. For example, it is clear by definition that sample amplifications are invariant with bijective transformation of observations. However, Assumption 2 depends on such transformations, so it possibly contains some redundancy. In contrast, Assumption 4 remains invariant, which is therefore more natural.

The proof idea of Theorem 6.5 is best illustrated for the case d=1d=1. Using the two points θ+,θ\theta_{+},\theta_{-} in Assumption 4, one could show that the TV distance between nn copies of pθ+p_{\theta_{+}} and pθp_{\theta_{-}} is bounded from above by a small constant. Similarly, for a large C>0C>0, the TV distance between CnCn copies of them is lower bounded by a large constant. Consequently, if m=(C1)nm=(C-1)n, Theorem 6.4 applied with d=1d=1 gives an Ω(1)\Omega(1) lower bound on ε(𝒫,n,m)\varepsilon^{\star}({\mathcal{P}},n,m). What happens if m=cεnm=c\varepsilon n with a small cc? The idea is to consider the TV distances between n,n+cεn,n+2cεn,,Cnn,n+c\varepsilon n,n+2c\varepsilon n,\cdots,Cn copies of pθ+p_{\theta_{+}} and pθp_{\theta_{-}}, which is an increasing sequence. Now by the pigeonhole principle, there must be two adjacent TV distances differing by at least Ω(cε/C)=Ω(ε)\Omega(c\varepsilon/C)=\Omega(\varepsilon), and Theorem 6.4 could be applied to this pair of sample sizes. This idea is easily generalized to any dimensions, with the full proof in Appendix B.

We note that the lower bounds in Theorem 6.3 (as well as 6.2) and Theorem 6.5 are on two different ends of the spectrum. In Theorems 6.2 and 6.3, an asymptotic setting (i.e. dd fixed and nn\to\infty) is essentially considered, and a Gaussian limit is crucially used as long as there is local asymptotic normality. In comparison, Theorem 6.5 deals with a high-dimensional scenario (n,dn,d can grow together) but restricts to a product model. However, looking at product submodels and/or exploiting its proof techniques could still lead to tight lower bounds for several non-product models, as shown in the examples in Section A.3.

7 Discussions on sample amplification versus learning

In all the examples we have seen in the previous sections, there is always a squared root relationship between the statistical complexities of sample amplification and learning. Specifically, when the dimensionality of the problem is dd, the complexity of learning the distribution (under a small TV distance) is typically n=Θ(d)n=\Theta(d), whereas that of the sample amplification is typically n=Θ(d)n=\Theta(\sqrt{d}). In this section, we give examples where this relationship could break down in either direction, thus show that there is no universal scaling between the sample complexities of amplification and learning.

7.1 An example where the complexity of sample amplification is o(d)o(\sqrt{d})

We first provide an example where the distribution learning is hard, but an (n,n+1,0.1)(n,n+1,0.1) sample amplification is easy. Consider the following class 𝒫d,t{\mathcal{P}}_{d,t} of discrete distributions:

𝒫d,t={(p0,,pd):pi0,i=0dpi=1,p0=t},\displaystyle{\mathcal{P}}_{d,t}=\left\{(p_{0},\cdots,p_{d}):p_{i}\geq 0,\sum_{i=0}^{d}p_{i}=1,p_{0}=t\right\},

where it is the same as the class of all discrete distributions over d+1d+1 points, except that the learner has the perfect knowledge of p0=tp_{0}=t for some known t[1/(2d),1/2]t\in[1/(2\sqrt{d}),1/2]. It is a classical result (see, e.g. [HJW15]) that the sample complexity of learning the distribution over 𝒫d,t{\mathcal{P}}_{d,t} with a small TV distance is still n=Θ(d)n=\Theta(d), regardless of tt. However, the next theorem shows that the complexity of sample amplification is much smaller.

Theorem 7.1.

For the class 𝒫d,t{\mathcal{P}}_{d,t} with t[1/(2d),1/2]t\in[1/(2\sqrt{d}),1/2], an (n,n+1,0.1)(n,n+1,0.1) sample amplification is possible if and only if

n=Ω(1t).\displaystyle n=\Omega\left(\frac{1}{t}\right).

Note that for the choice of t=Θ(dα)t=\Theta(d^{-\alpha}) with α[0,1/2]\alpha\in[0,1/2], the complexity of sample amplification could possibly be n=Θ(dα)n=\Theta(d^{\alpha}) for every α[0,1/2]\alpha\in[0,1/2], showing that it could be o(d)o(\sqrt{d}) with an arbitrary polynomial scale in dd. Moreover, if t=o(1/d)t=o(1/\sqrt{d}), the complexity of sample amplification reduces to n=Θ(d)n=\Theta(\sqrt{d}), the case without the knowledge of tt. The main reason why sample amplification is easier here is that the additional fake sample could be chosen as the first symbol, which has a large probability. In contrast, learning the distribution requires the estimation of all other probability masses, so the existence of a probable symbol does not help much in learning.

7.2 An example where the complexity of sample amplification is ω(d)\omega(\sqrt{d})

Next we provide an example where the complexity of sample amplification is the same as that of learning. Consider a low-rank covariance estimation model: X1,,Xn𝒩(0,Σ)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(0,\Sigma), where Σp×p\Sigma\in\mathbb{R}^{p\times p} could be written as Σ=UU\Sigma=UU^{\top} with Up×dU\in\mathbb{R}^{p\times d} and UU=IdU^{\top}U=I_{d}. In other words, the covariance matrix Σ\Sigma is isotropic on some dd-dimensional subspace. Here ndn\geq d samples suffice to estimate Σ\Sigma and thus the whole distribution perfectly, for the dd-dimensional subspace could be recovered using dd i.i.d. samples with probability one. Therefore, the complexity of learning the distribution is n=dn=d. The following theorem states that this is also the complexity of sample amplification.

Theorem 7.2.

For the above low-rank covariance estimation model with pd+1p\geq d+1, an (n,n+1,0.1)(n,n+1,0.1) sample amplification is possible if and only if ndn\geq d.

Theorem 7.2 shows that as opposed to learning, sample amplification fails to exploit the low-rank structure in the covariance estimation problem. As a result, the complexity of sample amplification coincides with that of learning in this example. Note that sample amplification is always no harder than learning: the learner could always estimate the distribution, generate one observation from the distribution and append it to the original samples. Therefore, Theorem 7.2 provides an example where the relationship between sample amplification and learning is the worst possible.

7.3 An example where the TV distance is not the right metric

Finally we provide an example showing that the TV distance is not the right metric for the learning-based approach in Section 5, and thereby partially illustrate the necessity of using the χ2\chi^{2} divergence. This example also goes beyond parametric families for sample amplification. Let 𝒫{\mathcal{P}} be the class of all LL-Lipschitz densities supported on [0,1][0,1], i.e. the density ff satisfies |f(x)f(y)|L|xy||f(x)-f(y)|\leq L|x-y| for all x,y[0,1]x,y\in[0,1]. For c(0,1)c\in(0,1), also let 𝒫c𝒫{\mathcal{P}}_{c}\subseteq{\mathcal{P}} be the subclass of densities lower bounded by cc everywhere, i.e. f(x)cf(x)\geq c for all x[0,1]x\in[0,1]. It is a classical result (see, e.g. [Tsy09]) that the minimax density estimation error under TV distance is Θ(n1/3)\Theta(n^{-1/3}) for both 𝒫{\mathcal{P}} and 𝒫c{\mathcal{P}}_{c}. The next theorem shows that the sample complexities for amplification are actually different.

Theorem 7.3.

Let L8L\geq 8 and c(0,1)c\in(0,1) be fixed. It holds that

m(𝒫c,n)n5/6, while m(𝒫,n)n3/4.\displaystyle m^{\star}({\mathcal{P}}_{c},n)\asymp n^{5/6},\quad\text{ while }\quad m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}.

Theorem 7.3 shows that, although assuming a density lower bound does not alter the TV estimation error, it boosts the size of amplified samples from O(n3/4)O(n^{3/4}) to Θ(n5/6)\Theta(n^{5/6}). In fact, the χ2\chi^{2}-estimation error is also reduced from 𝒫{\mathcal{P}} to 𝒫c{\mathcal{P}}_{c}: in the proof of Theorem 7.3 we show that rχ2(𝒫c,n)n2/3r_{\chi^{2}}({\mathcal{P}}_{c},n)\lesssim n^{-2/3}, but m(𝒫,n)n3/4m^{\star}({\mathcal{P}},n)\lesssim n^{3/4} together with Theorem 5.2 imply that rχ2(𝒫,n)n1/2r_{\chi^{2}}({\mathcal{P}},n)\gtrsim n^{-1/2}. Therefore, this is an example suggesting that measuring the estimation error under the χ2\chi^{2} divergence might be a better indicator for the complexity of sample amplification than the TV distance.

Acknowledgements.

Thank you to anonymous reviewers for helpful feedback on earlier drafts of this paper.

Funding.

Shivam Garg conducted this research while affiliated with Stanford University and was supported by a Stanford Interdisciplinary Graduate Fellowship. Yanjun Han was supported by a Simons-Berkeley research fellowship and the Norbert Wiener postdoctoral fellowship in statistics at MIT IDSS. Vatsal Sharan was supported by NSF CAREER Award CCF-2239265 and an Amazon Research Award. Gregory Valiant was supported by NSF Awards AF-2341890, CCF-1704417, CCF-1813049, UT Austin’s Foundation of ML NSF AI Institute, and a Simons Foundation Investigator Award.

Appendix A Concrete examples of sample amplification

In this section we include concrete examples of sample amplification omitted in the main text due to space limitations, including the non-asymptotic upper bounds in Section 4 and 5, and the lower bounds for non-product models in Section 6.

A.1 Concrete examples of amplification via sufficiency

In this section, in contrast to the mostly asymptotic results in Section 4.3, we investigate several non-asymptotic examples of sample amplification in concrete models. We show that for many natural models, including exponential family with dependent coordinates and non-exponential family which are not covered in the general theory, the sufficiency-based sample amplification approach could still amplify Ω(n/d)\Omega(n/\sqrt{d}) additional samples. We also illustrate the computational idea in Section 4.2 via more involved examples.

Our first example concerns the Gaussian model with a known mean but an unknown covariance. It is a folklore result that estimating the unknown covariance in a vanishing Frobenius norm requires n=Ω(d2)n=\Omega(d^{2}) samples [DMR20, Corollary 1.2], which is also the sample complexity for learning the distribution within a small TV distance. The following example shows that n=O(d)n=O(d) samples suffice for sample amplification.

Example A.1 (Gaussian covariance model with known mean).

Consider the i.i.d. observations X1,,XnX_{1},\cdots,X_{n} drawn from 𝒩(0,Σ){\mathcal{N}}(0,\Sigma) with zero mean and an unknown covariance Σd×d\Sigma\in\mathbb{R}^{d\times d}. Here a minimal sufficient statistic is the sample covariance matrix

Σ^n=1ni=1nXiXi.\displaystyle\widehat{\Sigma}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}.

Lemma A.2 shows that (Σ^n)(Σ^n+m)TVε\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\varepsilon as long as m=O(εn/d)m=O(\varepsilon n/d), therefore drawing samples from PXn+mΣ^n+mP_{X^{n+m}\mid\widehat{\Sigma}_{n+m}} achieves sample amplification of size m=Ω(n/d)m=\Omega(n/d). This coincides with the general O(n/D)O(n/\sqrt{D}) result where Dd2D\asymp d^{2} is the parameter dimension.

In order to sample from this conditional distribution, consider the following statistic

Sn+m=[(n+m)Σ^n+m]1/2[X1,X2,,Xn+m]d×(n+m),\displaystyle S_{n+m}=[(n+m)\widehat{\Sigma}_{n+m}]^{-1/2}[X_{1},X_{2},\cdots,X_{n+m}]\in\mathbb{R}^{d\times(n+m)},

which always exists even if Σ^n+m\widehat{\Sigma}_{n+m} is not invertible. Clearly there is a bijection between Xn+mX^{n+m} and (Σ^m+n,Sn+m)(\widehat{\Sigma}_{m+n},S_{n+m}), and Lemma A.2 shows that Sn+mS_{n+m} is an ancillary statistic. In particular, Sn+mS_{n+m} always follows the uniform distribution on the following set:

A={Ud×(n+m):UU=Id}.\displaystyle A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d}\}.

Consequently an (n,n+Ω(nε/d),ε)(n,n+\Omega(n\varepsilon/d),\varepsilon) sample amplification is efficiently achievable in the Gaussian covariance model using the following algorithm:

  1. 1.

    Given samples [X1,X2,,Xn][X_{1},X_{2},\cdots,X_{n}] compute Σ^n=1ni=1nXiXi\widehat{\Sigma}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}^{\top}.

  2. 2.

    Sample [Z1,Z2,,Zn+m][Z_{1},Z_{2},\cdots,Z_{n+m}] where ZiN(0,I)Z_{i}\sim N(0,I) for all ii. Compute Σ^n+m=1ni=1nZiZi\widehat{\Sigma}_{n+m}=\frac{1}{n}\sum_{i=1}^{n}Z_{i}Z_{i}^{\top}. Based on these compute Sn+m=[(n+m)Σ^n+m]1/2[Z1,Z2,,Zn+m]S_{n+m}=[(n+m)\widehat{\Sigma}_{n+m}]^{-1/2}[Z_{1},Z_{2},\cdots,Z_{n+m}].

  3. 3.

    Output (n+m)(n+m) samples Xn+m=[(n+m)Σ^n]1/2Sn+mX^{n+m}=[(n+m)\widehat{\Sigma}_{n}]^{1/2}S_{n+m}.

Lemma A.2.

Under the notations of Example A.1, for n4max{m,d}n\geq 4\max\{m,d\} it holds that

(Σ^n)(Σ^n+m)TV2mdn.\displaystyle\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\frac{2md}{n}.

In addition, Sn+mS_{n+m} is uniformly distributed on A={Ud×(n+m):UU=Id}A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d}\}.

The next example shows that the sample amplification result does not change much if both the mean and covariance are unknown, though the sampling procedure becomes slightly more involved.

Example A.3 (Gaussian model with unknown mean and covariance).

Next we consider the most general Gaussian model where X1,,Xn𝒩(θ,Σ)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(\theta,\Sigma) with unknown mean vector θd\theta\in\mathbb{R}^{d} and unknown covariance matrix Σd×d\Sigma\in\mathbb{R}^{d\times d}. In this case, a minimal sufficient statistic is the pair (X¯n,Σ^n)(\overline{X}_{n},\widehat{\Sigma}_{n}), with

X¯n=1ni=1nXi,Σ^n=1n1i=1n(XiX¯n)(XiX¯n).\displaystyle\overline{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i},\qquad\widehat{\Sigma}_{n}=\frac{1}{n-1}\sum_{i=1}^{n}(X_{i}-\overline{X}_{n})(X_{i}-\overline{X}_{n})^{\top}.

Lemma A.4 shows that (X¯n,Σ^n)(X¯n+m,Σ^n+m)TVε\|{\mathcal{L}}(\overline{X}_{n},\widehat{\Sigma}_{n})-{\mathcal{L}}(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\varepsilon as long as m=O(nε/d)m=O(n\varepsilon/d), therefore drawing amplified samples from PXn+m(X¯n+m,Σ^n+m)P_{X^{n+m}\mid(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})} achieves sample amplification of size m=Ω(nε/d)m=\Omega(n\varepsilon/d). For the computation, consider the following statistic

Sn+m=[(n+m1)Σ^n+m]1/2[X1X¯n+m,,Xn+mX¯n+m]d×(n+m),\displaystyle S_{n+m}=[(n+m-1)\widehat{\Sigma}_{n+m}]^{-1/2}[X_{1}-\overline{X}_{n+m},\cdots,X_{n+m}-\overline{X}_{n+m}]\in\mathbb{R}^{d\times(n+m)},

and it is clear that the whole samples Xn+mX^{n+m} could be recovered from (X¯n+m,Σ^n+m,Sn+m)(\overline{X}_{n+m},\widehat{\Sigma}_{n+m},S_{n+m}). Again, Lemma A.4 shows that Sn+mS_{n+m} is ancillary and uniformly distributed on the following set (assuming n+m1dn+m-1\geq d):

A={Ud×(n+m):UU=Id,U𝟏=𝟎}.\displaystyle A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d},U{\bf 1}={\bf 0}\}.
Lemma A.4.

Under the notations of Example A.3, for n4max{m,d}n\geq 4\max\{m,d\} it holds that

(X¯n,Σ^n)(X¯n+m,Σ^n+m)TV3mdn1.\displaystyle\|{\mathcal{L}}(\overline{X}_{n},\widehat{\Sigma}_{n})-{\mathcal{L}}(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}}\leq\frac{3md}{n-1}.

In addition, Sn+mS_{n+m} is uniformly distributed on A={Ud×(n+m):UU=Id,U𝟏=𝟎}A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d},U{\bf 1}={\bf 0}\}.

The following example concerns the product exponential distributions, or general Gamma distributions with a fixed shape parameter.

Example A.5 (Product exponential distribution).

In this example, we consider the product exponential model where X1,,Xni=1dExp(λi)X_{1},\cdots,X_{n}\sim\prod_{i=1}^{d}\text{\rm Exp}(\lambda_{i}) with unknown rate vector λ=(λ1,,λd)\lambda=(\lambda_{1},\cdots,\lambda_{d}). Again, in this model the sample mean X¯n\overline{X}_{n} is sufficient, and follows a product Gamma distribution i=1dGamma(n,nλi)\prod_{i=1}^{d}\text{\rm Gamma}(n,n\lambda_{i}). Consequently,

DKL((X¯n)(X¯n+m))=d(m(n+m)log(1+mn)+logΓ(n+m)Γ(n)mψ(n)),\displaystyle D_{\text{\rm KL}}({\mathcal{L}}(\overline{X}_{n})\|{\mathcal{L}}(\overline{X}_{n+m}))=d\cdot\left(m-(n+m)\log\left(1+\frac{m}{n}\right)+\log\frac{\Gamma(n+m)}{\Gamma(n)}-m\psi(n)\right),

where Γ(x)=0tx1et𝑑t\Gamma(x)=\int_{0}^{\infty}t^{x-1}e^{-t}dt and ψ(x)=ddx[logΓ(x)]\psi(x)=\frac{d}{dx}[\log\Gamma(x)] are the gamma and digamma functions, respectively. Using the definition of f(n,m,d)f(n,m,d) in (C.2), we note that the above KL divergence is precisely df(2n,2m,1)d\cdot f(2n,2m,1), thus by the proof of Lemma A.2 is further at most O(dm2/n2)O(dm^{2}/n^{2}). Consequently, sample amplification is possible as long as n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) and m=O(nε/d)m=O(n\varepsilon/\sqrt{d}). Alternatively, the same result could also follow from Theorem 4.6, for both assumptions 1 and 2 hold for the exponential distribution.

To draw amplified samples X1,,Xn+mX_{1},\cdots,X_{n+m} conditioned on X¯n+m\overline{X}_{n+m}, note that the following statistic Sd×(n+m)S\in\mathbb{R}^{d\times(n+m)} with ii-th row being

Si=(X1,iX¯n+m,i,X2,iX¯n+m,i,,Xn+m,iX¯n+m,i),\displaystyle S_{i}=\left(\frac{X_{1,i}}{\overline{X}_{n+m,i}},\frac{X_{2,i}}{\overline{X}_{n+m,i}},\cdots,\frac{X_{n+m,i}}{\overline{X}_{n+m,i}}\right),

is ancillary. In fact, each SiS_{i} follows the Dirichlet distribution Dir(1,1,,1)\text{\rm Dir}(1,1,\cdots,1). Then computational efficiency follows from the obvious fact that Xn+mX^{n+m} is determined by (X¯n+m,S)(\overline{X}_{n+m},S).

Our final example is a non-exponential family which is not even differentiable in quadratic mean, but the n=O(d)n=O(\sqrt{d}) sample complexity still holds.

Example A.6 (Uniform distribution over a rectangle).

In this example, let X1,,XnX_{1},\cdots,X_{n} be i.i.d. samples from the uniform distribution on an unknown rectangle j=1d[aj,bj]\prod_{j=1}^{d}[a_{j},b_{j}] in d\mathbb{R}^{d}. Note that this is not an exponential family. However, the sufficiency-based sample amplification could still be applied in this case. Specifically, here the sufficient statistics are XminnX_{\min}^{n} and XmaxnX_{\max}^{n}, where for j[d]j\in[d],

Xmin,jn=mini[n]Xi,j,Xmax,jn=maxi[n]Xi,j.\displaystyle X_{\min,j}^{n}=\min_{i\in[n]}X_{i,j},\qquad X_{\max,j}^{n}=\max_{i\in[n]}X_{i,j}.

It is not hard to find that, the joint density of (Xmin,jn,Xmax,jn)(X_{\min,j}^{n},X_{\max,j}^{n}) is

fn(a,b)=n(n1)(ba)n2(bjaj)n𝟙(ajabbj).\displaystyle f_{n}(a,b)=\frac{n(n-1)(b-a)^{n-2}}{(b_{j}-a_{j})^{n}}\cdot\mathbbm{1}(a_{j}\leq a\leq b\leq b_{j}).

Then after some algebra, the KL divergence between the sufficient statistics is

DKL((Xminn,Xmaxn)(Xminn+m,Xmaxn+m))=d(mnlog(1+mn)+mn1log(1+mn1)),\displaystyle D_{\text{\rm KL}}({\mathcal{L}}(X_{\min}^{n},X_{\max}^{n})\|{\mathcal{L}}(X_{\min}^{n+m},X_{\max}^{n+m}))=d\left(\frac{m}{n}-\log\left(1+\frac{m}{n}\right)+\frac{m}{n-1}-\log\left(1+\frac{m}{n-1}\right)\right),

which is O(dm2/n2)O(dm^{2}/n^{2}) when m=O(n)m=O(n). Therefore, sample amplification for uniform distributions is still possible whenever m=O(nε/d)m=O(n\varepsilon/\sqrt{d}) and n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon), same as exponential families.

To draw amplified samples X1,,Xm+nX_{1},\cdots,X_{m+n} conditioned on (Xminn+m,Xmaxm+n)(X_{\min}^{n+m},X_{\max}^{m+n}), note that the following statistic Sd×(n+m)S\in\mathbb{R}^{d\times(n+m)} with ii-th row being

Si=(X1,iXminn+mXmaxm+nXminm+n,,Xm+n,iXminn+mXmaxm+nXminm+n),\displaystyle S_{i}=\left(\frac{X_{1,i}-X_{\min}^{n+m}}{X_{\max}^{m+n}-X_{\min}^{m+n}},\cdots,\frac{X_{m+n,i}-X_{\min}^{n+m}}{X_{\max}^{m+n}-X_{\min}^{m+n}}\right),

is ancillary. This is due to the invariance property of the uniform distribution: if X𝖴(0,1)X\sim\mathsf{U}(0,1), then aX+b𝖴(b,a+b)aX+b\sim\mathsf{U}(b,a+b). Since (X1,,Xn+m)(X_{1},\cdots,X_{n+m}) is determined by (Xminn+m,Xmaxn+m,S)(X_{\min}^{n+m},X_{\max}^{n+m},S), the computational efficiency follows.

Although all of the above examples work exclusively for continuous models and apply an identity map between sufficient statistics, we remark that more sophisticated maps based on learning could be useful and work for discrete models. The idea of sample amplification via learning is presented in Section 5, and an example which combines both the sufficiency and learning ideas could be found in Example A.14.

A.2 Concrete examples of amplification via learning

In this section we show how learning-based approaches achieve optimal performances of sample amplification in several examples, including both continuous and discrete models. In some scenarios the following strengthened lemma may be useful to deal with too large χ2\chi^{2}-divergence.

Lemma A.7.

The same results of Theorem 5.2 and 5.5 hold for the following modification of the χ2\chi^{2}-estimation error:

rχ¯2(𝒫,n)=infP^nsupP𝒫𝔼P[χ2(P^n,P)n].\displaystyle r_{\bar{\chi}^{2}}({\mathcal{P}},n)=\inf_{\widehat{P}_{n}}\sup_{P\in{\mathcal{P}}}\mathbb{E}_{P}\left[\chi^{2}(\widehat{P}_{n},P)\wedge n\right].

The idea behind Lemma A.7 is that for some models, it might happen that χ2(P^n,P)=\chi^{2}(\widehat{P}_{n},P)=\infty with a small probability. However, for the TV distance we always have P^nPTV1\|\widehat{P}_{n}-P\|_{\text{TV}}\leq 1, so a large χ2\chi^{2}-divergence could still lead to a meaningful TV distance. The proof of Lemma A.7 could be found in Appendix C.

The first example is again the Gaussian location model in Example 4.1, where we show that the shuffling-based approach also achieves the complexity n=O(d/ε)n=O(\sqrt{d}/\varepsilon) and the size m=Ω(nε/d)m=\Omega(n\varepsilon/\sqrt{d}).

Example A.8 (Gaussian location model with known covariance, continued).

Consider the setting of Example 4.1, where the family of distributions is 𝒫={𝒩(θ,Id)}θd{\mathcal{P}}=\{{\mathcal{N}}(\theta,I_{d})\}_{\theta\in\mathbb{R}^{d}}. Here 𝒫{\mathcal{P}} has a product structure, with 𝒫j={𝒩(θj,1)}θj{\mathcal{P}}_{j}=\{{\mathcal{N}}(\theta_{j},1)\}_{\theta_{j}\in\mathbb{R}} for each j[d]j\in[d]. To find an upper bound on the χ2\chi^{2}-estimation error, consider the distribution estimator P^n,j=𝒩(θ^j,1)\widehat{P}_{n,j}={\mathcal{N}}(\widehat{\theta}_{j},1), where θ^j=n1i=1nXi,j\widehat{\theta}_{j}=n^{-1}\sum_{i=1}^{n}X_{i,j}. Consequently,

χ2(P^n,j,Pj)=χ2(𝒩(θ^j,1),𝒩(θj,1))=exp((θ^jθj)2)1,\displaystyle\chi^{2}(\widehat{P}_{n,j},P_{j})=\chi^{2}({\mathcal{N}}(\widehat{\theta}_{j},1),{\mathcal{N}}(\theta_{j},1))=\exp((\widehat{\theta}_{j}-\theta_{j})^{2})-1,

and using θ^jθj𝒩(0,1/n)\widehat{\theta}_{j}-\theta_{j}\sim{\mathcal{N}}(0,1/n), we have

𝔼[χ2(P^n,j,Pj)]=nn21=O(1n)\displaystyle\mathbb{E}[\chi^{2}(\widehat{P}_{n,j},P_{j})]=\sqrt{\frac{n}{n-2}}-1=O\left(\frac{1}{n}\right)

whenever n3n\geq 3. Consequently, rχ2(𝒫j,n)=O(1/n)r_{\chi^{2}}({\mathcal{P}}_{j},n)=O(1/n) for all j[d]j\in[d], and Theorem 5.5 implies that an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification is possible if n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) and m=O(nε/d)m=O(n\varepsilon/\sqrt{d}).

The next example is the discrete distribution model considered in [AGSV20].

Example A.9 (Discrete distribution model).

Let 𝒫{\mathcal{P}} be the class of all discrete distributions supported on kk elements. In this case, a natural learner is the empirical distribution P^n=(p^1,,p^k)\widehat{P}_{n}=(\widehat{p}_{1},\cdots,\widehat{p}_{k}), with

p^j=1ni=1n𝟙(Xi=j),j[k].\displaystyle\widehat{p}_{j}=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(X_{i}=j),\qquad j\in[k].

Consequently,

𝔼[χ2(P^n,P)]=𝔼[j=1k(p^jpj)2pj]=j=1k1pjn=k1n,\displaystyle\mathbb{E}[\chi^{2}(\widehat{P}_{n},P)]=\mathbb{E}\left[\sum_{j=1}^{k}\frac{(\widehat{p}_{j}-p_{j})^{2}}{p_{j}}\right]=\sum_{j=1}^{k}\frac{1-p_{j}}{n}=\frac{k-1}{n},

meaning that rχ2(𝒫,n)(k1)/nr_{\chi^{2}}({\mathcal{P}},n)\leq(k-1)/n. Hence, Theorem 5.2 implies that sample amplification is possible whenever n=Ω(k/ε)n=\Omega(\sqrt{k}/\varepsilon) and m=O(nε/k)m=O(n\varepsilon/\sqrt{k}). In this case, Algorithm 2 essentially subsamples from the original data and add them back with random shuffling, which is the same algorithm in [AGSV20]. However, the analysis here is much simplified.

The following example revisits the uniform distribution in Example A.6, and again shows that the same amplification performance could be achieved by random shuffling.

Example A.10 (Uniform distribution, continued).

Consider the setting in Example A.6, where the distribution class 𝒫{\mathcal{P}} is the family of all uniform distributions on a rectangle j=1d[aj,bj]\prod_{j=1}^{d}[a_{j},b_{j}]. For each jj, a natural distribution estimator P^n,j\widehat{P}_{n,j} is simply the uniform distribution on [Xmin,jn,Xmax,jn][X_{\min,j}^{n},X_{\max,j}^{n}], where these quantities are defined in Example A.6. For this learner, it holds that

χ2(P^n,j,Pj)=(ajbj)2(Xmax,jnXmin,jn)21.\displaystyle\chi^{2}\left(\widehat{P}_{n,j},P_{j}\right)=\frac{(a_{j}-b_{j})^{2}}{(X_{\max,j}^{n}-X_{\min,j}^{n})^{2}}-1.

Note that Example A.6 shows that the joint density of (Xmin,jn,Xmax,jn)(X_{\min,j}^{n},X_{\max,j}^{n}) is given by

fn(a,b)=n(n1)(ba)n2(bjaj)n𝟙(ajabbj),\displaystyle f_{n}(a,b)=\frac{n(n-1)(b-a)^{n-2}}{(b_{j}-a_{j})^{n}}\cdot\mathbbm{1}(a_{j}\leq a\leq b\leq b_{j}),

the expected χ2\chi^{2}-divergence could then be computed as

𝔼[χ2(P^n,j,Pj)]\displaystyle\mathbb{E}\left[\chi^{2}\left(\widehat{P}_{n,j},P_{j}\right)\right] =1+ajabbjn(n1)(ba)n4(bjaj)n2𝑑a𝑑b=4n6(n2)(n3),\displaystyle=-1+\iint_{a_{j}\leq a\leq b\leq b_{j}}\frac{n(n-1)(b-a)^{n-4}}{(b_{j}-a_{j})^{n-2}}dadb=\frac{4n-6}{(n-2)(n-3)},

which is O(n1)O(n^{-1}) if n4n\geq 4. Hence, we have rχ2(𝒫j,n)=O(1/n)r_{\chi^{2}}({\mathcal{P}}_{j},n)=O(1/n) for each j[d]j\in[d], and Theorem 5.5 shows an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification if n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) and m=O(nε/d)m=O(n\varepsilon/\sqrt{d}).

The next example is the exponential distribution model studied in Example A.5, where the modified χ2\chi^{2}-learning error rχ¯2(𝒫,n)r_{\bar{\chi}^{2}}({\mathcal{P}},n) in Lemma A.7 will be useful.

Example A.11 (Exponential distribution, continued).

Consider the setting in Example A.5, where the distribution class 𝒫{\mathcal{P}} is a product of exponential distributions with unknown rate parameters. In this case, a natural distribution learner is to estimate each rate parameter as λ^n,j=n/i=1nXi,j\widehat{\lambda}_{n,j}=n/\sum_{i=1}^{n}X_{i,j}, and use Exp(λ^n,j)\text{\rm Exp}(\widehat{\lambda}_{n,j}) to estimate the truth Exp(λj)\text{\rm Exp}(\lambda_{j}). Note that

χ2(Exp(λ^n,j),Exp(λj))=(λ^n,jλj)2λj(2λ^n,jλj)\displaystyle\chi^{2}\left(\text{\rm Exp}(\widehat{\lambda}_{n,j}),\text{\rm Exp}(\lambda_{j})\right)=\frac{(\widehat{\lambda}_{n,j}-\lambda_{j})^{2}}{\lambda_{j}(2\widehat{\lambda}_{n,j}-\lambda_{j})}

whenever 2λ^n,j>λj2\widehat{\lambda}_{n,j}>\lambda_{j}. However, if 2λ^n,jλj2\widehat{\lambda}_{n,j}\leq\lambda_{j} the χ2\chi^{2}-divergence will be unbounded, which occurs with a small but positive probability.

To address this issue, we note that when λj=1\lambda_{j}=1, the sub-exponential concentration claims that |i=1nXi,jn|n/3|\sum_{i=1}^{n}X_{i,j}-n|\leq n/3 with probability at least 1exp(Ω(n))1-\exp(-\Omega(n)). By a simple scaling, the above event implies that λ^n,j/λj[3/4,3/2]\widehat{\lambda}_{n,j}/\lambda_{j}\in[3/4,3/2]. Hence,

𝔼[χ2(Exp(λ^n,j),Exp(λj))n]2𝔼[(λ^n,jλj1)2]+nexp(Ω(n))=O(1n),\displaystyle\mathbb{E}\left[\chi^{2}\left(\text{\rm Exp}(\widehat{\lambda}_{n,j}),\text{\rm Exp}(\lambda_{j})\right)\wedge n\right]\leq 2\cdot\mathbb{E}\left[\left(\frac{\widehat{\lambda}_{n,j}}{\lambda_{j}}-1\right)^{2}\right]+n\cdot\exp(-\Omega(n))=O\left(\frac{1}{n}\right),

which means that rχ¯2(𝒫j,n)=O(1/n)r_{\bar{\chi}^{2}}({\mathcal{P}}_{j},n)=O(1/n). Therefore, by Lemma A.7, the sample amplification is possible whenever n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) and m=O(nε/d)m=O(n\varepsilon/\sqrt{d}), the same as Example A.5.

The following example considers an interesting non-product model, i.e. the Gaussian distribution with a sparse mean vector.

Example A.12 (Sparse Gaussian model).

Consider the Gaussian location model 𝒫={𝒩(θ,Id)}θΘ{\mathcal{P}}=\{{\mathcal{N}}(\theta,I_{d})\}_{\theta\in\Theta}, with an additional constraint that the mean vector θ\theta is ss-sparse, i.e. Θ={θd:θ0s}\Theta=\{\theta\in\mathbb{R}^{d}:\|\theta\|_{0}\leq s\}. For the learning problem, it is well-known (cf. [DJ94, Theorem 1]) that the soft-thresholding estimator θ^n\widehat{\theta}_{n} with

θ^n,j=sign(1ni=1nXi,j)(|1ni=1nXi,j|Clognn)+\displaystyle\widehat{\theta}_{n,j}=\text{\rm sign}\left(\frac{1}{n}\sum_{i=1}^{n}X_{i,j}\right)\cdot\left(\left|\frac{1}{n}\sum_{i=1}^{n}X_{i,j}\right|-\sqrt{\frac{C\log n}{n}}\right)_{+}

and any constant C>2C>2 achieves that supθΘ𝔼[θ^nθ22]=O(slogd/n)\sup_{\theta\in\Theta}\mathbb{E}[\|\widehat{\theta}_{n}-\theta\|_{2}^{2}]=O(s\log d/n). Therefore, the sample complexity of learning sparse Gaussian distributions is n=O(slogd)n=O(s\log d).

For the complexity of sample amplification, for each j[d]j\in[d] we apply 𝒩(θ^n,j,1){\mathcal{N}}(\widehat{\theta}_{n,j},1) as the distribution estimator. The χ2\chi^{2}-estimation performance of this estimator is summarized in Lemma A.13. Therefore, by a simple adaptation of Lemma A.7, an (n,n+m)(n,n+m) sample amplification is possible as long as n=Ω(slogd/ε)n=\Omega(\sqrt{s\log d}/\varepsilon) and m=O(nε/slogd)m=O(n\varepsilon/\sqrt{s\log d}).

Lemma A.13.

Under the setting of Example A.12, it holds that

supθΘj=1d𝔼[χ2(𝒩(θ^n,j,1),𝒩(θj,1))n]=O(slogdn).\displaystyle\sup_{\theta\in\Theta}\sum_{j=1}^{d}\mathbb{E}\left[\chi^{2}\left({\mathcal{N}}(\widehat{\theta}_{n,j},1),{\mathcal{N}}(\theta_{j},1)\right)\wedge n\right]=O\left(\frac{s\log d}{n}\right).

The final example is a special example where the sole application of either sufficient statistics or learning will fail to achieve the optimal sample amplification. The solution is to use a combination of both ideas.

Example A.14 (Poisson distribution).

Consider the product Poisson model j=1d𝖯𝗈𝗂(λj)\prod_{j=1}^{d}\mathsf{Poi}(\lambda_{j}) with λ+d\lambda\in\mathbb{R}_{+}^{d}. We first show that a naïve application of either sufficiency-based or shuffling-based idea will not lead to a desired sample amplification. In fact, the sufficient statistic here is Tn=i=1nXiT_{n}=\sum_{i=1}^{n}X_{i} which follows a product Poisson distribution j=1d𝖯𝗈𝗂(nλj)\prod_{j=1}^{d}\mathsf{Poi}(n\lambda_{j}); as TnT_{n} takes discrete values, applying any linear map between TnT_{n} to Tn+mT_{n+m} will not result in a small TV distance between sufficient statistics.

The argument for the shuffling-based approach is subtler. A natural distribution estimator for 𝖯𝗈𝗂(λj)\mathsf{Poi}(\lambda_{j}) is 𝖯𝗈𝗂(λ^n,j)\mathsf{Poi}(\widehat{\lambda}_{n,j}), with λ^n,j=n1i=1nXi,j\widehat{\lambda}_{n,j}=n^{-1}\sum_{i=1}^{n}X_{i,j}. Lemma A.15 shows that this distribution estimator could suffer from an expected χ2\chi^{2}-estimation error far greater than Ω(1/n)\Omega(1/n), so we cannot conclude an (n,n+Ω(nε/d),ε)(n,n+\Omega(n\varepsilon/\sqrt{d}),\varepsilon) sample amplification from Theorem 5.5.

Now we show that a combination of the sufficient statistic and learning leads to the rate-optimal sample amplification in this model. Specifically, we split samples and compute the empirical rate parameter λ^n/2\widehat{\lambda}_{n/2} based on the first n/2n/2 samples. Next, conditioned on the first half of samples, the sufficient statistic for the remaining half is Tn/2=i=n/2+1nXiT_{n/2}=\sum_{i=n/2+1}^{n}X_{i}. Define

T^n/2+m=Tn/2+Z,\displaystyle\widehat{T}_{n/2+m}=T_{n/2}+Z,

where Zj=1d𝖯𝗈𝗂(mλ^n/2,j)Z\sim\prod_{j=1}^{d}\mathsf{Poi}(m\widehat{\lambda}_{n/2,j}) is independent of Tn/2T_{n/2} conditioning on the first half samples. Finally, we generate the (n/2+m)(n/2+m) amplified samples from the conditional distribution and append them to (X1,,Xn/2)(X_{1},\cdots,X_{n/2}). By the second statement of Lemma A.15, this achieves an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification whenever n=Ω(d/ε)n=\Omega(\sqrt{d}/\varepsilon) and m=O(nε/d)m=O(n\varepsilon/\sqrt{d}).

Lemma A.15.

Under the settings of Example A.14, there exists some λj>0\lambda_{j}>0 such that

𝔼[χ2(𝖯𝗈𝗂(λ^n,j),𝖯𝗈𝗂(λj))n]=Ω(1logn)1n.\displaystyle\mathbb{E}\left[\chi^{2}\left(\mathsf{Poi}(\widehat{\lambda}_{n,j}),\mathsf{Poi}(\lambda_{j})\right)\wedge n\right]=\Omega\left(\frac{1}{\log n}\right)\gg\frac{1}{n}.

In addition, the proposed sufficiency+learning approach satisfies

(X^n+m)(Xn+m)TVm2dn.\displaystyle\|{\mathcal{L}}(\widehat{X}^{n+m})-{\mathcal{L}}(X^{n+m})\|_{\text{\rm TV}}\leq\frac{m\sqrt{2d}}{n}.

A.3 Non-asymptotic examples of lower bounds

In this section, we apply the lower bound ideas to several concrete examples in Section 4 and 5, and show that the previously established upper bounds for sample amplification are indeed tight. Since Theorem 6.5 already handles all product models (including the Gaussian location model in Example 4.1, 4.2, and A.8, exponential model in Example A.5 and A.11, uniform model in Example A.6 and A.10, and Poisson model in Example A.14), we are only left with the remaining non-product models. In the sequel, we will make use of the general Lemma 6.1 to prove non-asymptotic lower bounds in these examples.

The lower bound of the discrete distribution model was obtained in [AGSV20], where the learning approach in Example A.9 is rate optimal. Our first example concerns the “Poissonized” version of the discrete distribution model, and we show that the results become slightly different.

Example A.16 (“Poissonized” discrete distribution model).

We consider the following “Poissonized” discrete distribution model, where we have nn i.i.d. samples drawn from j=1k𝖯𝗈𝗂(pj)\prod_{j=1}^{k}\mathsf{Poi}(p_{j}), with (p1,,pk)(p_{1},\cdots,p_{k}) being an unknown probability vector. Although the Poissonization does not affect the optimal rate of estimation in many problems, Lemma A.17 shows the following distinction when it comes to sample amplification: the optimal amplification size is m=Θ(nε/k+nε)m=\Theta(n\varepsilon/\sqrt{k}+\sqrt{n}\varepsilon) under the Poissonized model, while it is m=Θ(nε/k)m=\Theta(n\varepsilon/\sqrt{k}) under the non-Poissonized model [AGSV20].

The complete proof of Lemma A.17 is relegated to the appendix, but we briefly comment on why Theorem 6.5 is not directly applicable when knk\gg n. The reason here is that to apply Theorem 6.5, we will construct a parametric submodel which is a product model:

Pθ=j=1k0[𝖯𝗈𝗂(1k+θj)×𝖯𝗈𝗂(1kθj)],θΘ[1k,1k]k0,\displaystyle P_{\theta}=\prod_{j=1}^{k_{0}}\left[\mathsf{Poi}\left(\frac{1}{k}+\theta_{j}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta_{j}\right)\right],\qquad\theta\in\Theta\triangleq\left[-\frac{1}{k},\frac{1}{k}\right]^{k_{0}},

where k=2k0k=2k_{0}. However, for θ,θ[0,1/k]\theta,\theta^{\prime}\in[0,1/k], the range of the squared Hellinger distance

H2(𝖯𝗈𝗂(1k+θ)×𝖯𝗈𝗂(1kθ),𝖯𝗈𝗂(1k+θ)×𝖯𝗈𝗂(1kθ))\displaystyle H^{2}\left(\mathsf{Poi}\left(\frac{1}{k}+\theta\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta\right),\mathsf{Poi}\left(\frac{1}{k}+\theta^{\prime}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta^{\prime}\right)\right)

is only [0,Θ(1/k)][0,\Theta(1/k)], so Assumption 4 does not hold when knk\gg n. This is precisely the subtlety in the Poissonized model.

Lemma A.17.

Under the Poissonized discrete distribution model, an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification is possible if and only if n=Ω(1)n=\Omega(1) and m=O(nε/k+nε)m=O(n\varepsilon/\sqrt{k}+\sqrt{n}\varepsilon).

Our next example is the sparse Gaussian model in Example A.12, where we establish the tightness of n=Ω(slogd/ε)n=\Omega(\sqrt{s\log d}/\varepsilon) and m=O(nε/slogd)m=O(n\varepsilon/\sqrt{s\log d}) directly using Lemma 6.1.

Example A.18 (Sparse Gaussian location model).

Consider the setting of the sparse Gaussian location model in Example A.12, and we aim to prove a matching lower bound for sample amplification. In the sequel, we first handle the case s=1s=1 to reflect the main idea, and postpone the case of general ss to Lemma A.19.

For s=1s=1, we apply Lemma 6.1 to a proper choice of the prior μ\mu and loss LL. Fixing a parameter t>0t>0 to be chosen later, let μ\mu be the uniform distribution on the finite set of vectors {te1,,ted}\{te_{1},\cdots,te_{d}\}, where e1,,ede_{1},\cdots,e_{d} are the canonical vectors in d\mathbb{R}^{d}. Moreover, for an estimator θ^d\widehat{\theta}\in\mathbb{R}^{d} of the unknown mean vector, the loss function is chosen to be L(θ,θ^)=𝟙(θ^θ)L(\theta,\widehat{\theta})=\mathbbm{1}(\widehat{\theta}\neq\theta). For the above prior and loss, it is clear that the maximum likelihood estimator θ^=tej^\widehat{\theta}=te_{\widehat{j}} with

j^=argmaxj[d]i=1nXi,j\displaystyle\widehat{j}=\arg\max_{j\in[d]}\sum_{i=1}^{n}X_{i,j}

is the Bayes estimator, and the Bayes risk admits the following expression:

rB(𝒫,n,μ,L)=1𝔼[exp(nt(nt+Z1))exp(nt(nt+Z1))+j=2dexp(ntZj)]1pd(nt),\displaystyle r_{\text{\rm B}}({\mathcal{P}},n,\mu,L)=1-\mathbb{E}\left[\frac{\exp(\sqrt{n}t(\sqrt{n}t+Z_{1}))}{\exp(\sqrt{n}t(\sqrt{n}t+Z_{1}))+\sum_{j=2}^{d}\exp(\sqrt{n}tZ_{j})}\right]\triangleq 1-p_{d}(\sqrt{n}\cdot t),

where Z1,,Zd𝒩(0,1)Z_{1},\cdots,Z_{d}\sim{\mathcal{N}}(0,1) are i.i.d. standard normal random variables. Similarly, the Bayes risk under n+mn+m samples is rB(𝒫,n+m,μ,L)=1pd(n+mt)r_{\text{\rm B}}({\mathcal{P}},n+m,\mu,L)=1-p_{d}(\sqrt{n+m}\cdot t), and it remains to investigate the property of the function pd()p_{d}(\cdot). Lemma A.19 summarizes a useful property for the lower bound, i.e. the function pd(z)p_{d}(z) enjoys a phase transition around z2logdz\sim\sqrt{2\log d}.

Based on Lemma A.19, the pigeonhole principle implies that for every given c>0c>0, there exists some z[2logdC,2logd+C]z\in[\sqrt{2\log d}-C,\sqrt{2\log d}+C] such that pd(z+cε)pd(z)=Ωc(ε)p_{d}(z+c\varepsilon)-p_{d}(z)=\Omega_{c}(\varepsilon). Therefore, if m=cnε/logdm=\lceil cn\varepsilon/\sqrt{\log d}\rceil, choosing t=z/nt=z/\sqrt{n} for the above zz yields that pd(n+mt)pd(nt)=Ωc(ε)p_{d}(\sqrt{n+m}\cdot t)-p_{d}(\sqrt{n}\cdot t)=\Omega_{c}(\varepsilon), and consequently

ε(𝒫,n,m)rB(𝒫,n,μ,L)rB(𝒫,n+m,μ,L)=pd(n+mt)pd(nt)=Ωc(ε).\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq r_{\text{\rm B}}({\mathcal{P}},n,\mu,L)-r_{\text{\rm B}}({\mathcal{P}},n+m,\mu,L)=p_{d}(\sqrt{n+m}\cdot t)-p_{d}(\sqrt{n}\cdot t)=\Omega_{c}(\varepsilon).

Therefore, we must have n=Ω(logd/ε)n=\Omega(\sqrt{\log d}/\varepsilon) and m=O(nε/logd)m=O(n\varepsilon/\sqrt{\log d}) for sample amplification.

The following lemma summarizes the phase-transition property of pdp_{d} used in the above example.

Lemma A.19.

For the function pd(z)p_{d}(z) in Example A.18, there exists an absolute constant CC independent of dd such that

pd(2logdC)\displaystyle p_{d}(\sqrt{2\log d}-C) 0.1,\displaystyle\leq 0.1,
pd(2logd+C)\displaystyle p_{d}(\sqrt{2\log d}+C) 0.9.\displaystyle\geq 0.9.

Moreover, for general s<d/2s<d/2, an (n,n+m)(n,n+m) sample amplification is possible under the sparse Gaussian model only if n=Ω(slog(d/s)/ε)n=\Omega(\sqrt{s\log(d/s)}/\varepsilon) and m=O(nε/slog(d/s))m=O(n\varepsilon/\sqrt{s\log(d/s)}).

Our final example proves the tightness of the sufficiency-based approaches in Examples A.1 and A.3 for Gaussian models with unknown covariance. The proof relies on the computation of the minimax risks in Lemma 6.1, as well as the statistical theory of invariance.

Example A.20 (Gaussian model with unknown covariance).

We establish the matching lower bounds of sample amplification in Example A.1 with a known mean, which imply the lower bounds in Example A.3. We make use of the minimax risk formulation of Lemma 6.1, and consider the following loss:

L(Σ,Σ^)=𝟙((Σ,Σ^)g(n+m1d,d)+Cdn),\displaystyle L(\Sigma,\widehat{\Sigma})=\mathbbm{1}\left(\ell(\Sigma,\widehat{\Sigma})\geq g(n+m-1-d,d)+C\cdot\frac{d}{n}\right),

where function gg is given by (13) in the Lemma C.3 below, C>0C>0 is a large absolute constant to be determined later, and (Σ,Σ^)\ell(\Sigma,\widehat{\Sigma}) is the Stein’s loss

(Σ,Σ^)=tr(Σ1Σ^)logdet(Σ1Σ^)d.\displaystyle\ell(\Sigma,\widehat{\Sigma})=\text{\rm tr}(\Sigma^{-1}\widehat{\Sigma})-\log\det(\Sigma^{-1}\widehat{\Sigma})-d.

To search for the minimax estimator under the above loss, similar arguments in [JS61] based on the theory of invariance show that it suffices to consider estimators taking the form

Σ^n=LnDnLn,\displaystyle\widehat{\Sigma}_{n}=L_{n}D_{n}L_{n}^{\top}, (12)

where Dnd×dD_{n}\in\mathbb{R}^{d\times d} is a diagonal matrix independent of the observations, and Lnd×dL_{n}\in\mathbb{R}^{d\times d} is the lower triangular matrix satisfying that LnLn=i=1nXiXiL_{n}L_{n}^{\top}=\sum_{i=1}^{n}X_{i}X_{i}^{\top}. Moreover, the risk of the above estimator is invariant with the unknown Σ\Sigma, so in the sequel we assume that Σ=Id\Sigma=I_{d}. Note that when Dn=Id/nD_{n}=I_{d}/n, the estimator Σ^n\widehat{\Sigma}_{n} is the sample covariance; however, other choices of the diagonal matrix DnD_{n} could give a uniform improvement over the sample covariance, see [JS61].

The proof idea is to show that with n+mn+m samples, there exists an estimator Σ^n+m\widehat{\Sigma}_{n+m} with a specific choice of Dn+mD_{n+m} in (12) such that (cf. Lemma A.21)

|𝔼[(Σ,Σ^n+m)]g(n+m+1d,d)|5dn+m,𝖵𝖺𝗋((Σ,Σ^n+m))4dn+m.\displaystyle|\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n+m})]-g(n+m+1-d,d)|\leq\frac{5d}{n+m},\qquad\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n+m}))}\leq\frac{4d}{n+m}.

Consequently, by Chebyshev’s inequality, for C>0C>0 large enough r(𝒫,n+m,L)0.1r({\mathcal{P}},n+m,L)\leq 0.1.

To lower bound the minimax risk r(𝒫,n,L)r({\mathcal{P}},n,L) with nn samples, we need to enumerate over all possible estimators taking the form of (12). It turns out that for any choice of DnD_{n}, the first two moments of (Σ,Σ^n)\ell(\Sigma,\widehat{\Sigma}_{n}) admit explicit expressions, and it always holds that (cf. Lemma A.21)

𝔼[(Σ,Σ^n)]g(n+1d,d)+C1(𝖵𝖺𝗋((Σ,Σ^n))4dn)6dn\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]\geq g(n+1-d,d)+C_{1}\left(\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}-\frac{4d}{n}\right)-\frac{6d}{n}

for any given constant C1>0C_{1}>0, provided that n,dC2n,d\geq C_{2} with C2C_{2} depending only on C1C_{1}. Choosing C1>0C_{1}>0 large enough, Chebyshev’s inequality leads to (Σ,Σ^n)g(n+1d,d)5C1d/n\ell(\Sigma,\widehat{\Sigma}_{n})\geq g(n+1-d,d)-5C_{1}d/n with probability at least 0.90.9 for every Σ^n\widehat{\Sigma}_{n} taking the form of (12). By the last statement of Lemma A.21, for m=C3n/dm=C_{3}n/d with a large enough C3>0C_{3}>0, the above event implies that r(𝒫,n,L)0.9r({\mathcal{P}},n,L)\geq 0.9.

Combining the above scenarios and applying the pigeonhole principle, an application of Lemma 6.1 claims that m=O(nε/d)m=O(n\varepsilon/d) is necessary for an (n,n+m,ε)(n,n+m,\varepsilon) sample amplification.

The following lemma summarizes the necessary technical results for Example A.20.

Lemma A.21.

Let ndn\geq d. For Dn=diag(λ1,,λd)D_{n}=\text{\rm diag}(\lambda_{1},\cdots,\lambda_{d}) with λj=1/(n+d+12j)\lambda_{j}=1/(n+d+1-2j) for all j[d]j\in[d], the corresponding estimator Σ^n\widehat{\Sigma}_{n} in (12) satisfies

|𝔼[(Σ,Σ^n)]g(n+1d,d)|5dn,𝖵𝖺𝗋((Σ,Σ^n))4dn,\displaystyle|\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]-g(n+1-d,d)|\leq\frac{5d}{n},\qquad\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}\leq\frac{4d}{n},

where

g(u,v)\displaystyle g(u,v) (u+2v)log(u+2v)+ulogu2(u+v)log(u+v).\displaystyle\triangleq\frac{(u+2v)\log(u+2v)+u\log u}{2}-(u+v)\log(u+v). (13)

Meanwhile, for any choice of DnD_{n} and any absolute constant C1>0C_{1}>0, the estimator Σ^n\widehat{\Sigma}_{n} in (12) satisfies

𝔼[(Σ,Σ^n)]g(n+1d,d)+C1(𝖵𝖺𝗋((Σ,Σ^n))4dn)6dn,\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]\geq g(n+1-d,d)+C_{1}\left(\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}-\frac{4d}{n}\right)-\frac{6d}{n},

as long as n,dC2n,d\geq C_{2} for some large enough constant C2C_{2} depending only on C1C_{1}.

Finally, the function gg defined in (13) satisfies the following inequality: for n2dn\geq 2d and 0mn0\leq m\leq n, it holds that

g(n+1d,d)g(n+m+1d,d)md213n2.\displaystyle g(n+1-d,d)-g(n+m+1-d,d)\geq\frac{md^{2}}{13n^{2}}.

Appendix B Proof of main theorems

B.1 Proof of Theorem 4.5

We recall the following result in [BC16, Theorem 2.6]: given Assumptions 1 and 2 with k=3k=3, there exists a constant C>0C>0 depending only on dd and the moment upper bound such that

supθΘ(n[2A(θ)]1/2(TnA(θ)))𝒩(0,Id)TVCn.\displaystyle\sup_{\theta\in\Theta}\|{\mathcal{L}}(\sqrt{n}[\nabla^{2}A(\theta)]^{-1/2}(T_{n}-\nabla A(\theta)))-{\mathcal{N}}(0,I_{d})\|_{\text{TV}}\leq\frac{C}{\sqrt{n}}.

By an affine transformation, it is then clear that

supθΘ(Tn)𝒩(A(θ),2A(θ)/n)TVCn.\displaystyle\sup_{\theta\in\Theta}\|{\mathcal{L}}(T_{n})-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)\|_{\text{TV}}\leq\frac{C}{\sqrt{n}}.

Moreover, the computation in Example 4.1 shows that

supθΘ𝒩(A(θ),2A(θ)/n)𝒩(A(θ),2A(θ)/(n+m))TVmdn.\displaystyle\sup_{\theta\in\Theta}\|{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/(n+m))\|_{\text{TV}}\leq\frac{m\sqrt{d}}{n}.

Now the desired result follows from the above inequalities and a triangle inequality.

B.2 Proof of Theorem 4.6

As the density in the Edgeworth expansion (8) may be negative at some point, throughout the proof we extend the formal definition of the TV distance to any signed measures PP and QQ, just as half of the L1L_{1} distance. Under this new definition, it is clear that the triangle inequality PRTVPQTV+QRTV\|P-R\|_{\text{TV}}\leq\|P-Q\|_{\text{TV}}+\|Q-R\|_{\text{TV}} still holds. The following tensorization property also holds: for general signed measures P1,,PdP_{1},\cdots,P_{d} and Q1,,QdQ_{1},\cdots,Q_{d} with maxi[d]max{|Pi|(Ω),|Qi|(Ω)}r\max_{i\in[d]}\max\{|P_{i}|(\Omega),|Q_{i}|(\Omega)\}\leq r, we have

P1××PdQ1××QdTV\displaystyle\|P_{1}\times\cdots\times P_{d}-Q_{1}\times\cdots\times Q_{d}\|_{\text{TV}}
i=1dP1××Pi1×Qi××QdP1××Pi×Qi+1××QdTV\displaystyle\leq\sum_{i=1}^{d}\|P_{1}\times\cdots\times P_{i-1}\times Q_{i}\times\cdots\times Q_{d}-P_{1}\times\cdots\times P_{i}\times Q_{i+1}\times\cdots\times Q_{d}\|_{\text{TV}}
i=1dPiQiTVj<i|Pj|(Ω)k>i|Qk|(Ω)\displaystyle\leq\sum_{i=1}^{d}\|P_{i}-Q_{i}\|_{\text{TV}}\cdot\prod_{j<i}|P_{j}|(\Omega)\cdot\prod_{k>i}|Q_{k}|(\Omega)
rdi=1dPiQiTV.\displaystyle\leq r^{d}\sum_{i=1}^{d}\|P_{i}-Q_{i}\|_{\text{TV}}. (14)

Fix any θΘ\theta\in\Theta, let Pn,iP_{n,i} (resp. Pn+m,iP_{n+m,i}) be the probability distribution of the ii-th coordinate of TnT_{n} (resp. Tn+mT_{n+m}), and Qn,iQ_{n,i} (resp. Qn+m,iQ_{n+m,i}) be the signed measure of the corresponding Edgeworth expansion taking the form (8) with k=9k=9. We note that the polynomials 𝒦(x){\mathcal{K}}_{\ell}(x) in (8) are the same for Qn,iQ_{n,i} and Qn+m,iQ_{n+m,i}, and their coefficients are uniformly bounded over θΘ\theta\in\Theta thanks to Assumption 2. Then based on Assumptions 1 and 2 with k=10k=10, the result of [BC16, Theorem 2.7] claims that

Pn,iQn,iTVCn2,\displaystyle\|P_{n,i}-Q_{n,i}\|_{\text{TV}}\leq\frac{C}{n^{2}},

with C>0C>0 independent of (n,d,θ)(n,d,\theta). Moreover, the signed measure of the Edgeworth expansion in (8) could be negative only if |x|=Ω(n)|x|=\Omega(\sqrt{n}), and therefore the total variation of each Qn,iQ_{n,i} satisfies

|Qn,i|()=|Γn,9|()Γn,9([cn,cn])+|Γn,9|(\[cn,cn])1+exp(Ω(n)),\displaystyle|Q_{n,i}|(\mathbb{R})=|\Gamma_{n,9}|(\mathbb{R})\leq\Gamma_{n,9}([-c\sqrt{n},c\sqrt{n}])+|\Gamma_{n,9}|(\mathbb{R}\backslash[-c\sqrt{n},c\sqrt{n}])\leq 1+\exp(-\Omega(n)),

where the last inequality follows from integrating the Gaussian tails. Finally, let Qn,i+Q_{n,i}^{+} be the positive part of Qn,iQ_{n,i} in the Jordan decomposition, the tensorization property (B.2) leads to

(Tn)i=1dQn,i+TV(Tn)i=1dQn,iTVCdn2(1+exp(Ω(n)))d,\displaystyle\left\|{\mathcal{L}}(T_{n})-\prod_{i=1}^{d}Q_{n,i}^{+}\right\|_{\text{TV}}\leq\left\|{\mathcal{L}}(T_{n})-\prod_{i=1}^{d}Q_{n,i}\right\|_{\text{TV}}\leq\frac{Cd}{n^{2}}\cdot\left(1+\exp(-\Omega(n))\right)^{d}, (15)

and similar result holds for (Tn+m){\mathcal{L}}(T_{n+m}).

Next it remains to upper bound the TV distance between i=1dQn,i+\prod_{i=1}^{d}Q_{n,i}^{+} and i=1dQn+m,i+\prod_{i=1}^{d}Q_{n+m,i}^{+}. To this end, we also generalize the formal definition of the Hellinger distance between general measures which are not necessarily probabilities. Then the following inequality between generalized TV and Hellinger distance holds: for measures PP and QQ on Ω\Omega,

PQTV\displaystyle\|P-Q\|_{\text{TV}} =12|dPdQ|=12|dPdQ|(dP+dQ)\displaystyle=\frac{1}{2}\int|\mathrm{d}P-\mathrm{d}Q|=\frac{1}{2}\int|\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q}|(\sqrt{\mathrm{d}P}+\sqrt{\mathrm{d}Q})
12(dPdQ)2(dP+dQ)2\displaystyle\leq\frac{1}{2}\sqrt{\int(\sqrt{\mathrm{d}P}-\sqrt{\mathrm{d}Q})^{2}\cdot\int(\sqrt{\mathrm{d}P}+\sqrt{\mathrm{d}Q})^{2}}
H(P,Q)P(Ω)+Q(Ω).\displaystyle\leq H(P,Q)\cdot\sqrt{P(\Omega)+Q(\Omega)}. (16)

Also, the following tensorization property holds for the Hellinger distance between general measures: for (not necessarily probability) measures Pi,QiP_{i},Q_{i} on Ωi\Omega_{i},

H2(i=1dPi,i=1dQi)=i=1dPi(Ωi)+i=1dQi(Ωi)2i=1d(Pi(Ωi)+Qi(Ωi)2H2(Pi,Qi)).\displaystyle H^{2}\left(\prod_{i=1}^{d}P_{i},\prod_{i=1}^{d}Q_{i}\right)=\frac{\prod_{i=1}^{d}P_{i}(\Omega_{i})+\prod_{i=1}^{d}Q_{i}(\Omega_{i})}{2}-\prod_{i=1}^{d}\left(\frac{P_{i}(\Omega_{i})+Q_{i}(\Omega_{i})}{2}-H^{2}(P_{i},Q_{i})\right). (17)

Consequently, it suffices to prove an upper bound on the Hellinger distance H(Pi,Qi)H(P_{i},Q_{i}), and the TV distance on the product measure is a direct consequence of (B.2) and (17).

To upper bound the individual Hellinger distance, note that after a proper affine transformation, the densities of Qn,iQ_{n,i} and Qn+m,iQ_{n+m,i} are as follows:

Qn,i(dx)\displaystyle Q_{n,i}(dx) =γn(x)(1+=13𝒦,i(nx)n/2)dx,\displaystyle=\gamma_{n}(x)\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}\right)dx, (18)
Qn+m,i(dx)\displaystyle Q_{n+m,i}(dx) =γn+m(x)(1+=13𝒦,i(n+mx)(n+m)/2)dx,\displaystyle=\gamma_{n+m}(x)\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)dx, (19)

where γn\gamma_{n} is the density of 𝒩(0,1/n){\mathcal{N}}(0,1/n), and 𝒦,i{\mathcal{K}}_{\ell,i} is a polynomial of degree 33\ell with uniformly bounded coefficients. By (18) and (19), we observe that there exists an absolute constant c>0c>0 independent of (n,m)(n,m) such that Qn,i(x)/γn(x),Qn+m,i(x)/γn+m(x)[1/2,3/2]Q_{n,i}(x)/\gamma_{n}(x),Q_{n+m,i}(x)/\gamma_{n+m}(x)\in[1/2,3/2] whenever |x|c|x|\leq c. Consequently, the squared Hellinger distance could then be expressed as

H2(Qn,i+,Qn+m,i+)\displaystyle H^{2}(Q_{n,i}^{+},Q_{n+m,i}^{+}) =12|x|c(Qn,i(x)Qn+m,i(x))2dx+12|x|>c(Qn,i+(x)Qn+m,i+(x))2dx\displaystyle=\frac{1}{2}\int_{|x|\leq c}\left(\sqrt{Q_{n,i}(x)}-\sqrt{Q_{n+m,i}(x)}\right)^{2}\mathrm{d}x+\frac{1}{2}\int_{|x|>c}\left(\sqrt{Q_{n,i}^{+}(x)}-\sqrt{Q_{n+m,i}^{+}(x)}\right)^{2}\mathrm{d}x
|x|c(γn(x)γn+m(x))2(1+=13𝒦,i(n+mx)(n+m)/2)dx\displaystyle\leq\int_{|x|\leq c}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\left(1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)\mathrm{d}x
+|x|cγn(x)(1+=13𝒦,i(nx)n/21+=13𝒦,i(n+mx)(n+m)/2)2dx\displaystyle\qquad+\int_{|x|\leq c}\gamma_{n}(x)\left(\sqrt{1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}}-\sqrt{1+\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}}\right)^{2}\mathrm{d}x
+|x|>c(Qn,i+(x)+Qn+m,i+(x))dx\displaystyle\qquad+\int_{|x|>c}\left(Q_{n,i}^{+}(x)+Q_{n+m,i}^{+}(x)\right)\mathrm{d}x
A1+A2+A3.\displaystyle\equiv A_{1}+A_{2}+A_{3}.

Next we upper bound the terms A1,A2A_{1},A_{2}, and A3A_{3} separately.

  1. 1.

    Upper bounding A1A_{1}: note that by the definition of cc, the multiplication factor in A1A_{1} is at most 3/23/2. Therefore,

    A1\displaystyle A_{1} 32|x|c(γn(x)γn+m(x))2dx\displaystyle\leq\frac{3}{2}\int_{|x|\leq c}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\mathrm{d}x
    32(γn(x)γn+m(x))2dx\displaystyle\leq\frac{3}{2}\int_{\mathbb{R}}(\sqrt{\gamma_{n}(x)}-\sqrt{\gamma_{n+m}(x)})^{2}\mathrm{d}x
    =(a)3(1(n(n+m)(n+m/2)2)1/4)\displaystyle\overset{\rm(a)}{=}3\left(1-\left(\frac{n(n+m)}{(n+m/2)^{2}}\right)^{1/4}\right)
    (b)3m24n2,\displaystyle\overset{\rm(b)}{\leq}\frac{3m^{2}}{4n^{2}}, (20)

    where (a) is due to the direct computation of the squared Hellinger distance between 𝒩(0,1/n){\mathcal{N}}(0,1/n) and 𝒩(0,1/(n+m)){\mathcal{N}}(0,1/(n+m)), and (b) makes use of the inequality 1x1/41x1-x^{1/4}\leq 1-x for x[0,1]x\in[0,1].

  2. 2.

    Upper bounding A2A_{2}: using |ab||ab||\sqrt{a}-\sqrt{b}|\leq|a-b| for a,b1/2a,b\geq 1/2, we have

    A2\displaystyle A_{2} |x|cγn(x)(=13𝒦,i(nx)n/2=13𝒦,i(n+mx)(n+m)/2)2dx\displaystyle\leq\int_{|x|\leq c}\gamma_{n}(x)\left(\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}-\sum_{\ell=1}^{3}\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)^{2}\mathrm{d}x
    3=13γn(x)(𝒦,i(nx)n/2𝒦,i(n+mx)(n+m)/2)2dx\displaystyle\leq 3\sum_{\ell=1}^{3}\int_{\mathbb{R}}\gamma_{n}(x)\left(\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n}\cdot x)}{n^{\ell/2}}-\frac{{\mathcal{K}}_{\ell,i}(\sqrt{n+m}\cdot x)}{(n+m)^{\ell/2}}\right)^{2}\mathrm{d}x
    =3=13γ(x)(𝒦,i(x)n/2𝒦,i(1+m/nx)(n+m)/2)2dx,\displaystyle=3\sum_{\ell=1}^{3}\int_{\mathbb{R}}\gamma(x)\left(\frac{{\mathcal{K}}_{\ell,i}(x)}{n^{\ell/2}}-\frac{{\mathcal{K}}_{\ell,i}(\sqrt{1+m/n}\cdot x)}{(n+m)^{\ell/2}}\right)^{2}\mathrm{d}x,

    where the last step is a change of measure, and γ\gamma is the density of 𝒩(0,1){\mathcal{N}}(0,1). Writing 𝒦,i(x)=j=03ai,jxj{\mathcal{K}}_{\ell,i}(x)=\sum_{j=0}^{3\ell}a_{i,j}x^{j}, then

    |𝒦,i(x)n/2𝒦,i(1+m/nx)(n+m)/2|=|j=03ai,jxjn/2(1(1+mn)j2)|m(1+x3)n/2+1\displaystyle\left|\frac{{\mathcal{K}}_{\ell,i}(x)}{n^{\ell/2}}-\frac{{\mathcal{K}}_{\ell,i}(\sqrt{1+m/n}\cdot x)}{(n+m)^{\ell/2}}\right|=\left|\sum_{j=0}^{3\ell}\frac{a_{i,j}x^{j}}{n^{\ell/2}}\left(1-\left(1+\frac{m}{n}\right)^{\frac{j-\ell}{2}}\right)\right|\lesssim\frac{m(1+x^{3\ell})}{n^{\ell/2+1}}

    whenever m=O(n)m=O(n). Combining the above two inequalities yields

    A2m2n3.\displaystyle A_{2}\lesssim\frac{m^{2}}{n^{3}}. (21)
  3. 3.

    Upper bounding A3A_{3}: note that the tail of γn(x)\gamma_{n}(x) is at least exp(Ω(n))\exp(-\Omega(n)) when xcx\geq c, integrating the tail leads to

    A3=exp(Ω(n)).\displaystyle A_{3}=\exp(-\Omega(n)). (22)

In summary, a combination of (1), (21), and (22) leads to

H2(Qn,i+,Qn+m,i+)=O(m2n2+exp(Ω(n))).\displaystyle H^{2}(Q_{n,i}^{+},Q_{n+m,i}^{+})=O\left(\frac{m^{2}}{n^{2}}+\exp(-\Omega(n))\right).

Now using (B.2), (17), and |Qn,i|()1+exp(Ω(n))|Q_{n,i}|(\mathbb{R})\leq 1+\exp(-\Omega(n)), we conclude that

i=1dQn,i+i=1dQn+m,i+TV=O(mdn+(1+exp(Ω(n)))d).\displaystyle\left\|\prod_{i=1}^{d}Q_{n,i}^{+}-\prod_{i=1}^{d}Q_{n+m,i}^{+}\right\|_{\text{TV}}=O\left(\frac{m\sqrt{d}}{n}+(1+\exp(-\Omega(n)))^{d}\right). (23)

Hence, when n=Ω(d)n=\Omega(\sqrt{d}) and m=O(n)m=O(n), the desired result follows from (15) and (23). For the other scenarios, we simply use that the TV distance is upper bounded by one, and the result still holds.

B.3 Proof of Theorem 5.2

Let P^n\widehat{P}_{n} be the distribution learned from the first n/2n/2 samples which achieves the χ2\chi^{2}-learning error rχ2(𝒫,n/2)r_{\chi^{2}}({\mathcal{P}},n/2), and PmixP_{\text{mix}} be the distribution of the shuffled samples (Z1,,Zn/2+m)(Z_{1},\cdots,Z_{n/2+m}) in Algorithm 2. Note that both distributions depend on the first n/2n/2 samples and are therefore random. Then the final TV distance of sample amplification is

PXn/2×Pmix(Xn/2)P(n+m)TV=𝔼Xn/2Pmix(Xn/2)P(n/2+m)TV,\displaystyle\|P_{X^{n/2}}\times P_{\text{mix}}(X^{n/2})-P^{\otimes(n+m)}\|_{\text{TV}}=\mathbb{E}_{X^{n/2}}\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}},

which is the expected TV distance between the mixture distribution and the product distribution.

By Lemma 5.8, for any realization of Xn/2X^{n/2} it holds that

χ2(Pmix,P(n/2+m))(1+mn/2+mχ2(P^n,P))m1.\displaystyle\chi^{2}\left(P_{\text{mix}},P^{\otimes(n/2+m)}\right)\leq\left(1+\frac{m}{n/2+m}\chi^{2}(\widehat{P}_{n},P)\right)^{m}-1.

Since (3) shows that DKL(PQ)log(1+χ2(P,Q))D_{\text{KL}}(P\|Q)\leq\log(1+\chi^{2}(P,Q)), we have

DKL(PmixP(n/2+m))mlog(1+mn/2+mχ2(P^n,P))m2n/2+mχ2(P^n,P).\displaystyle D_{\text{KL}}(P_{\text{mix}}\|P^{\otimes(n/2+m)})\leq m\log\left(1+\frac{m}{n/2+m}\chi^{2}(\widehat{P}_{n},P)\right)\leq\frac{m^{2}}{n/2+m}\chi^{2}(\widehat{P}_{n},P).

Consequently, by (3) again and the concavity of xxx\mapsto\sqrt{x}, we have

𝔼Xn/2Pmix(Xn/2)P(n/2+m)TV\displaystyle\mathbb{E}_{X^{n/2}}\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}} 𝔼Xn/212DKL(Pmix(Xn/2)P(n/2+m))\displaystyle\leq\mathbb{E}_{X^{n/2}}\sqrt{\frac{1}{2}D_{\text{KL}}(P_{\text{mix}}(X^{n/2})\|P^{\otimes(n/2+m)})}
𝔼Xn/2m2n+2mχ2(P^n(Xn/2),P)\displaystyle\leq\mathbb{E}_{X^{n/2}}\sqrt{\frac{m^{2}}{n+2m}\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)}
m2n+2m𝔼Xn/2[χ2(P^n(Xn/2),P)]\displaystyle\leq\sqrt{\frac{m^{2}}{n+2m}\mathbb{E}_{X^{n/2}}[\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)]}
m2n+2mrχ2(𝒫,n/2).\displaystyle\leq\sqrt{\frac{m^{2}}{n+2m}\cdot r_{\chi^{2}}({\mathcal{P}},n/2)}.

B.4 Proof of Theorem 5.5

The main arguments are essentially the same as Theorem 5.2. Note that by the same argument, for each j[d]j\in[d] we have

DKL(Pmix,jPj(n/2+m))m2n/2+mχ2(P^n,j,Pj).\displaystyle D_{\text{KL}}(P_{\text{mix},j}\|P_{j}^{\otimes(n/2+m)})\leq\frac{m^{2}}{n/2+m}\chi^{2}(\widehat{P}_{n,j},P_{j}).

Note that both PmixP_{\text{mix}} and PP have product structures, the chain rule of KL divergence implies that

DKL(PmixP(n/2+m))m2n/2+mj=1dχ2(P^n,j,Pj).\displaystyle D_{\text{KL}}(P_{\text{mix}}\|P^{\otimes(n/2+m)})\leq\frac{m^{2}}{n/2+m}\sum_{j=1}^{d}\chi^{2}(\widehat{P}_{n,j},P_{j}).

Now the rest of the proof follows from the same last few lines of that of Theorem 5.2.

B.5 Proof of Theorem 6.2

The proof relies on Lemma 6.1 and a classical statistical result known as Anderson’s lemma. Without loss of generality we assume Σ=Id\Sigma=I_{d}. Choose L(θ,θ^)=(θθ^)L(\theta,\widehat{\theta})=\ell(\theta-\widehat{\theta}), with a bowl-shaped (i.e. symmetric and quasi-convex) loss function ()[0,1]\ell(\cdot)\in[0,1]. Then Anderson’s lemma (see, e.g. [VdV00, Lemma 8.5]) implies that the minimax estimator under LL and nn samples is θ^n=n1i=1nXi\widehat{\theta}_{n}=n^{-1}\sum_{i=1}^{n}X_{i}, thus

r(𝒫,n,L)=𝔼[(Zn)],\displaystyle r({\mathcal{P}},n,L)=\mathbb{E}\left[\ell\left(\frac{Z}{\sqrt{n}}\right)\right],

with Z𝒩(0,Id)Z\sim{\mathcal{N}}(0,I_{d}). Choosing (x)=𝟙(x2r)\ell(x)=\mathbbm{1}(\|x\|_{2}\geq r) with the parameter r>0r>0 determined by

(|Zn|r)(|Zn+m|r)=𝒩(0,Idn)𝒩(0,Idn+m)TV,\displaystyle\mathbb{P}\left(\left|\frac{Z}{\sqrt{n}}\right|\geq r\right)-\mathbb{P}\left(\left|\frac{Z}{\sqrt{n+m}}\right|\geq r\right)=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{\rm TV}},

then clearly (){0,1}\ell(\cdot)\in\{0,1\} is bowl-shaped. Hence, for this choice of LL, Lemma 6.1 gives that

ε(𝒫,n,m)r(𝒫,n,L)r(𝒫,n+m,L)=𝒩(0,Idn)𝒩(0,Idn+m)TV,\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq r({\mathcal{P}},n,L)-r({\mathcal{P}},n+m,L)=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{\rm TV}},

and a matching upper bound is presented in Example 4.1.

B.6 Proof of Theorem 6.3

Based on the discussions above Theorem 6.3, we choose an arbitrary open ball Θ0Θ\Theta_{0}\subseteq\Theta, and pick any dd-dimensional ball Bd(μ0;r)B_{d}(\mu_{0};r) contained in A(Θ0)\nabla A(\Theta_{0}). Let θ0\theta_{0} be the center of Θ0\Theta_{0}, and after a proper affine transformation we assume that A2(θ0)=Id\nabla A^{2}(\theta_{0})=I_{d}. Consider a truncated Gaussian prior ν\nu with

μ𝒩(μ0,cnrn2)μBd(μ0;rn),\displaystyle\mu\sim{\mathcal{N}}(\mu_{0},c_{n}r_{n}^{2})\mid\mu\in B_{d}(\mu_{0};r_{n}),

where cn>0c_{n}>0 and rn(0,r)r_{n}\in(0,r) are parameters to be determined later. Using the diffeomorphism A\nabla A, there is a prior ν0\nu_{0} on Θ0\Theta_{0} which induces the above prior on A(Θ0)\nabla A(\Theta_{0}). Moreover, as Θ¯0\overline{\Theta}_{0}, the closure of Θ0\Theta_{0}, is a compact set, a weaker Assumption 2 with the supremum restricted to Θ0\Theta_{0} holds for k=3k=3.

Now we analyze the Bayes risks under the above prior ν0\nu_{0} and the loss

L(θ,μ^)=(A(θ)μ^)[0,1],\displaystyle L(\theta,\widehat{\mu})=\ell(\nabla A(\theta)-\widehat{\mu})\in[0,1],

where \ell is a bowl-shaped function. By sufficiency, it remains to consider the class of estimators depending only on the sufficient statistic Tn(Xn)=n1i=1nT(Xi)T_{n}(X^{n})=n^{-1}\sum_{i=1}^{n}T(X_{i}). Let μ^(Tn)\widehat{\mu}(T_{n}) be such an estimator, then under each θΘ0\theta\in\Theta_{0}, we have

𝔼θ[L(θ,μ^(Tn))]𝔼θ[L(θ,μ^(Zn))](Tn)(Zn)TV,\displaystyle\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(T_{n}))]\geq\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))]-\|{\mathcal{L}}(T_{n})-{\mathcal{L}}(Z_{n})\|_{\text{TV}},

where Znθ𝒩(A(θ),Id/n)Z_{n}\mid\theta\sim{\mathcal{N}}(\nabla A(\theta),I_{d}/n). To upper bound this TV distance, the proof of Theorem 4.5 together with Assumption 2 applied to Θ0\Theta_{0} gives that

(Tn)𝒩(A(θ),2A(θ)/n)TVC1n,\displaystyle\|{\mathcal{L}}(T_{n})-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)\|_{\text{TV}}\leq\frac{C_{1}}{\sqrt{n}},

where C1>0C_{1}>0 only depends on the exponential family. Moreover,

𝒩(A(θ),Id/n)𝒩(A(θ),2A(θ)/n)TV\displaystyle\|{\mathcal{N}}(\nabla A(\theta),I_{d}/n)-{\mathcal{N}}(\nabla A(\theta),\nabla^{2}A(\theta)/n)\|_{\text{TV}}
=𝒩(0,Id)𝒩(0,2A(θ))TV\displaystyle=\|{\mathcal{N}}(0,I_{d})-{\mathcal{N}}(0,\nabla^{2}A(\theta))\|_{\text{TV}}
(a)322A(θ)IdF(b)C2rn,\displaystyle\overset{\rm(a)}{\leq}\frac{3}{2}\|\nabla^{2}A(\theta)-I_{d}\|_{\text{F}}\overset{\rm(b)}{\leq}C_{2}r_{n},

where (a) follows from [DMR18, Theorem 1.1], and (b) makes use of the analytical property of A(θ)A(\theta) (see, e.g. [ML08, Theorem 1.17]), the assumption that A(θ)A(θ0)2rn\|\nabla A(\theta)-\nabla A(\theta_{0})\|_{2}\leq r_{n}, and 2A(θ0)=Id\nabla^{2}A(\theta_{0})=I_{d}. Again, here the constant C2>0C_{2}>0 is independent of nn. Combining the above inequalities yields that

𝔼θ[L(θ,μ^(Tn))]𝔼θ[L(θ,μ^(Zn))]C1nC2rn.\displaystyle\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(T_{n}))]\geq\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))]-\frac{C_{1}}{\sqrt{n}}-C_{2}r_{n}. (24)

Next we lower bound the Bayes risk 𝔼ν0𝔼θ[L(θ,μ^(Zn))]\mathbb{E}_{\nu_{0}}\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))] when TnT_{n} has been replaced by ZnZ_{n}. Let ν\nu^{\prime} be the non-truncated Gaussian distribution 𝒩(μ0,cnrn2){\mathcal{N}}(\mu_{0},c_{n}r_{n}^{2}), and Gn𝒩(0,Id/n)G_{n}\sim{\mathcal{N}}(0,I_{d}/n), then

𝔼ν0𝔼θ[L(θ,μ^(Zn))]\displaystyle\mathbb{E}_{\nu_{0}}\mathbb{E}_{\theta}[L(\theta,\widehat{\mu}(Z_{n}))] =𝔼μν[(μμ^(Gn+μ))]\displaystyle=\mathbb{E}_{\mu\sim\nu}[\ell(\mu-\widehat{\mu}(G_{n}+\mu))]
𝔼μν[(μμ^(Gn+μ))]ν({μBd(μ0;rn)})\displaystyle\geq\mathbb{E}_{\mu\sim\nu^{\prime}}[\ell(\mu-\widehat{\mu}(G_{n}+\mu))]-\nu^{\prime}(\left\{\mu\notin B_{d}(\mu_{0};r_{n})\right\})
(c)𝔼μν[(μμ^(Gn+μ))]eC3/cn\displaystyle\overset{\rm(c)}{\geq}\mathbb{E}_{\mu\sim\nu^{\prime}}[\ell(\mu-\widehat{\mu}(G_{n}+\mu))]-e^{-C_{3}/c_{n}}
(d)𝔼[(ncnrn21+ncnrn2Gn)]eC3/cn\displaystyle\overset{\rm(d)}{\geq}\mathbb{E}\left[\ell\left(\sqrt{\frac{nc_{n}r_{n}^{2}}{1+nc_{n}r_{n}^{2}}}\cdot G_{n}\right)\right]-e^{-C_{3}/c_{n}}
(e)𝔼[(Gn)]eC3/cnC41+ncnrn2.\displaystyle\overset{\rm(e)}{\geq}\mathbb{E}[\ell(G_{n})]-e^{-C_{3}/c_{n}}-\frac{C_{4}}{1+nc_{n}r_{n}^{2}}. (25)

Here (c) follows from the Gaussian tail probability, (d) makes use of Anderson’s lemma and the fact that under ν\nu^{\prime}, the posterior distribution of μ\mu given ZnZ_{n} is Gaussian with covariance ncnrn2Id/(1+ncnrn2)nc_{n}r_{n}^{2}I_{d}/(1+nc_{n}r_{n}^{2}). The final inequality (e) is due to the following upper bound on the TV distance:

𝒩(0,Σ)𝒩(0,cΣ)TV32(c1)IdF=3d|c1|2.\displaystyle\|{\mathcal{N}}(0,\Sigma)-{\mathcal{N}}(0,c\Sigma)\|_{\text{TV}}\leq\frac{3}{2}\|(c-1)I_{d}\|_{\text{F}}=\frac{3\sqrt{d}|c-1|}{2}.

Now combining (24) and (B.6), we obtain a lower bound on the Bayes risk:

rB(𝒫,n,ν0,L)𝔼[(Gn)]C1nC2rneC3/cnC41+ncnrn2.\displaystyle r_{\text{B}}({\mathcal{P}},n,\nu_{0},L)\geq\mathbb{E}[\ell(G_{n})]-\frac{C_{1}}{\sqrt{n}}-C_{2}r_{n}-e^{-C_{3}/c_{n}}-\frac{C_{4}}{1+nc_{n}r_{n}^{2}}.

Similarly, by reversing all the above inequalities, an upper bound of the Bayes risk with n+mn+m samples is also available:

rB(𝒫,n+m,ν0,L)𝔼[(Gn+m)]+C1n+C2rn+eC3/cn+C41+ncnrn2.\displaystyle r_{\text{B}}({\mathcal{P}},n+m,\nu_{0},L)\leq\mathbb{E}[\ell(G_{n+m})]+\frac{C_{1}}{\sqrt{n}}+C_{2}r_{n}+e^{-C_{3}/c_{n}}+\frac{C_{4}}{1+nc_{n}r_{n}^{2}}.

By the proof of Theorem 6.2, a proper choice of \ell satisfies that

𝔼[(Gn)]𝔼[(Gn+m)]=𝒩(0,Idn)𝒩(0,Idn+m)TV=Ω(mdn1),\displaystyle\mathbb{E}[\ell(G_{n})]-\mathbb{E}[\ell(G_{n+m})]=\left\|{\mathcal{N}}\left(0,\frac{I_{d}}{n}\right)-{\mathcal{N}}\left(0,\frac{I_{d}}{n+m}\right)\right\|_{\text{TV}}=\Omega\left(\frac{m\sqrt{d}}{n}\wedge 1\right),

where the last step is again due to [DMR18, Theorem 1.1]. Consequently, by choosing cn=Θ(1/logn)c_{n}=\Theta(1/\log n) and rn=Θ((logn/n)1/3)r_{n}=\Theta((\log n/n)^{1/3}), the desired result follows from Lemma 6.1.

B.7 Proof of Theorem 6.4

We will apply Lemma 6.1 to the uniform prior μ\mu over 2d2^{d} points j=1d{θj,+,θj,}\prod_{j=1}^{d}\{\theta_{j,+},\theta_{j,-}\}, and the loss function L:Θ×Θ[0,1]L:\Theta\times\Theta\to[0,1] with

L((θ1,,θd),(θ^1,,θ^d))=𝟙(j=1d𝟙(θj=θ^j)d2+12j=1dαj).\displaystyle L((\theta_{1},\cdots,\theta_{d}),(\widehat{\theta}_{1},\cdots,\widehat{\theta}_{d}))=\mathbbm{1}\left(\sum_{j=1}^{d}\mathbbm{1}(\theta_{j}=\widehat{\theta}_{j})\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right).

We compute the Bayes risks rB(𝒫,n,μ,L)r_{\text{B}}({\mathcal{P}},n,\mu,L) and rB(𝒫,n+m,μ,L)r_{\text{B}}({\mathcal{P}},n+m,\mu,L) in this scenario.

Under the uniform prior μ\mu, it is straightforward to see that the posterior distribution of θ\theta given XnX^{n} is a product distribution j=1dpθjXjn\prod_{j=1}^{d}p_{\theta_{j}\mid X_{j}^{n}}, with pθjXjnp_{\theta_{j}\mid X_{j}^{n}} supported on two elements {θj,+,θj,}\{\theta_{j,+},\theta_{j,-}\}, and

pθjXjn(θj,+)\displaystyle p_{\theta_{j}\mid X_{j}^{n}}(\theta_{j,+}) =i=1npθj,+(Xi,j)i=1npθj,+(Xi,j)+i=1npθj,(Xi,j),\displaystyle=\frac{\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})}{\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})+\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})},
pθjXjn(θj,)\displaystyle p_{\theta_{j}\mid X_{j}^{n}}(\theta_{j,-}) =i=1npθj,(Xi,j)i=1npθj,+(Xi,j)+i=1npθj,(Xi,j).\displaystyle=\frac{\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})}{\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})+\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})}.

Then given XnX^{n}, the Bayes estimator θ^(Xn)Θ\widehat{\theta}(X^{n})\in\Theta which minimizes the expected loss LL is a minimizer ξΘ\xi\in\Theta of the above expression

θXn(j=1d𝟙(θj=ξj)d2+12j=1dαj),\displaystyle\mathbb{P}_{\theta\mid X^{n}}\left(\sum_{j=1}^{d}\mathbbm{1}(\theta_{j}=\xi_{j})\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right),

which is easily seen to be

θ^j(Xn)=θ^j(Xjn)=θj,+(θj,+θj,)𝟙(i=1npθj,+(Xi,j)i=1npθj,(Xi,j)).\displaystyle\widehat{\theta}_{j}(X^{n})=\widehat{\theta}_{j}(X_{j}^{n})=\theta_{j,-}+(\theta_{j,+}-\theta_{j,-})\cdot\mathbbm{1}\left(\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})\geq\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})\right).

For the above Bayes estimator, the random variables 𝟙(θj=θ^j(Xn))\mathbbm{1}(\theta_{j}=\widehat{\theta}_{j}(X^{n})) are mutually independent, with the mean value

pn,j\displaystyle p_{n,j} =(θj=θ^j(Xn))\displaystyle=\mathbb{P}(\theta_{j}=\widehat{\theta}_{j}(X^{n}))
=12pθj,+n(i=1npθj,+(Xi,j)<i=1npθj,(Xi,j))+12pθj,n(i=1npθj,+(Xi,j)i=1npθj,(Xi,j))\displaystyle=\frac{1}{2}p_{\theta_{j,+}}^{\otimes n}\left(\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})<\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})\right)+\frac{1}{2}p_{\theta_{j,-}}^{\otimes n}\left(\prod_{i=1}^{n}p_{\theta_{j,+}}(X_{i,j})\geq\prod_{i=1}^{n}p_{\theta_{j,-}}(X_{i,j})\right)
=12(1+pθj,+npθj,nTV).\displaystyle=\frac{1}{2}\left(1+\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\|_{\text{TV}}\right).

Consequently, we have pn,j(1+αj)/2ε/(2d)p_{n,j}\leq(1+\alpha_{j})/2-\varepsilon/(2\sqrt{d}) for each j[d]j\in[d] by (10), and thus

rB(𝒫,n,μ,L)\displaystyle r_{\text{B}}({\mathcal{P}},n,\mu,L) =(j=1d𝖡𝖾𝗋𝗇(pn,j)d2+12j=1dαj)\displaystyle=\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}(p_{n,j})\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right)
(j=1d𝖡𝖾𝗋𝗇(1+αj2ε2d)d2+12j=1dαj).\displaystyle\geq\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right). (26)

An entirely symmetric argument leads to

rB(𝒫,n+m,μ,L)(j=1d𝖡𝖾𝗋𝗇(1+αj2+ε2d)d2+12j=1dαj).\displaystyle r_{\text{B}}({\mathcal{P}},n+m,\mu,L)\leq\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{d}{2}+\frac{1}{2}\sum_{j=1}^{d}\alpha_{j}\right). (27)

To further lower bound (B.7) and upper bound (27), the following lemma shows that we may assume that the above Bernoulli distributions have the same parameter.

Lemma B.1 (Theorems 4 and 5 of [Hoe56]).

Let Xi𝖡𝖾𝗋𝗇(pi)X_{i}\sim\mathsf{Bern}(p_{i}) be independent Bernoulli random variables for i=1,,ni=1,\cdots,n, with p¯=n1i=1npi\bar{p}=n^{-1}\sum_{i=1}^{n}p_{i}. Then for 0knp¯10\leq k\leq n\bar{p}-1, we have

(i=1nXik)(𝖡(n,p¯)k),\displaystyle\mathbb{P}\left(\sum_{i=1}^{n}X_{i}\leq k\right)\leq\mathbb{P}\left(\mathsf{B}(n,\bar{p})\leq k\right),

and for knp¯k\geq n\bar{p}, we have

(i=1nXik)(𝖡(n,p¯)k).\displaystyle\mathbb{P}\left(\sum_{i=1}^{n}X_{i}\leq k\right)\geq\mathbb{P}\left(\mathsf{B}(n,\bar{p})\leq k\right).

In particular, for integers (b,c)(b,c) with bnp¯cb\leq n\bar{p}\leq c, we have

(bi=1nXic)(b𝖡(n,p¯)c).\displaystyle\mathbb{P}\left(b\leq\sum_{i=1}^{n}X_{i}\leq c\right)\geq\mathbb{P}\left(b\leq\mathsf{B}(n,\bar{p})\leq c\right).

For d4/ε2d\geq 4/\varepsilon^{2}, based on the second statement of Lemma B.1, the quantity in (B.7) satisfies

rB(𝒫,n,μ,L)(𝖡(d,1+α2ε2d)1+α2d),\displaystyle r_{\text{B}}({\mathcal{P}},n,\mu,L)\geq\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right), (28)

with αd1j=1dαj\alpha\triangleq d^{-1}\sum_{j=1}^{d}\alpha_{j}. Similarly, based on the first statement of Lemma B.1, for (27) we have

rB(𝒫,n+m,μ,L)(𝖡(d,1+α2+ε2d)1+α2d).\displaystyle r_{\text{B}}({\mathcal{P}},n+m,\mu,L)\leq\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right). (29)

Since

d(𝖡(n,x)=k)dx=n((𝖡(n1,x)=k1)(𝖡(n1,x)=k)),\displaystyle\frac{\mathrm{d}\mathbb{P}(\mathsf{B}(n,x)=k)}{\mathrm{d}x}=n\left(\mathbb{P}(\mathsf{B}(n-1,x)=k-1)-\mathbb{P}(\mathsf{B}(n-1,x)=k)\right),

we invoke Lemma 6.1 and lower bound the Bayes risk difference as

ε(𝒫,n,m)\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m) rB(𝒫,n,μ,L)rB(𝒫,n+m,μ,L)\displaystyle\geq r_{\text{B}}({\mathcal{P}},n,\mu,L)-r_{\text{B}}({\mathcal{P}},n+m,\mu,L)
(𝖡(d,1+α2ε2d)1+α2d)(𝖡(d,1+α2+ε2d)1+α2d)\displaystyle\geq\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right)-\mathbb{P}\left(\mathsf{B}\left(d,\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{1+\alpha}{2}\cdot d\right)
=0k(1+α)d/21+α2ε2d1+α2+ε2dd(𝖡(d,x)=k)dxdx\displaystyle=-\sum_{0\leq k\leq(1+\alpha)d/2}\int_{\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}}\frac{\mathrm{d}\mathbb{P}(\mathsf{B}(d,x)=k)}{\mathrm{d}x}\mathrm{d}x
=d1+α2ε2d1+α2+ε2d(𝖡(d1,x)=1+α2d)dx\displaystyle=d\int_{\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}}\mathbb{P}\left(\mathsf{B}(d-1,x)=\left\lfloor\frac{1+\alpha}{2}\cdot d\right\rfloor\right)\mathrm{d}x
(a)d1+α2ε2d1+α2+ε2dc(α¯,α¯)ddx=c(α¯,α¯)ε,\displaystyle\overset{\rm(a)}{\geq}d\int_{\frac{1+\alpha}{2}-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{1+\alpha}{2}+\frac{\varepsilon}{2\sqrt{d}}}\frac{c(\underline{\alpha},\overline{\alpha})}{\sqrt{d}}\mathrm{d}x=c(\underline{\alpha},\overline{\alpha})\varepsilon,

where (a) is due to

minp0pp1,|knp|Cn(𝖡(n,p)=k)=Ωp0,p1,C(1n)\displaystyle\min_{p_{0}\leq p\leq p_{1},|k-np|\leq C\sqrt{n}}\mathbb{P}(\mathsf{B}(n,p)=k)=\Omega_{p_{0},p_{1},C}\left(\frac{1}{\sqrt{n}}\right) (30)

for any p0,p1(0,1)p_{0},p_{1}\in(0,1) and C>0C>0, by Stirling’s approximation.

For d<4/ε2d<4/\varepsilon^{2}, we first note the following identity:

ddx|x=0(i=1n𝖡𝖾𝗋𝗇(pi+x)=k)\displaystyle\frac{\mathrm{d}}{\mathrm{d}x}\bigg{|}_{x=0}\mathbb{P}\left(\sum_{i=1}^{n}\mathsf{Bern}(p_{i}+x)=k\right)
=ddx|x=0w{0,1}n:i=1nwi=ki=1n(pi+x)wi(1pix)1wi\displaystyle=\frac{\mathrm{d}}{\mathrm{d}x}\bigg{|}_{x=0}\sum_{w\in\{0,1\}^{n}:\sum_{i=1}^{n}w_{i}=k}\prod_{i=1}^{n}(p_{i}+x)^{w_{i}}(1-p_{i}-x)^{1-w_{i}}
=i=1n(w\i{0,1}n1:jiwj=k1jipjwj(1pj)1wjw\i{0,1}n1:jiwj=kjipjwj(1pj)1wj)\displaystyle=\sum_{i=1}^{n}\left(\sum_{w_{\backslash i}\in\{0,1\}^{n-1}:\sum_{j\neq i}w_{j}=k-1}\prod_{j\neq i}p_{j}^{w_{j}}(1-p_{j})^{1-w_{j}}-\sum_{w_{\backslash i}\in\{0,1\}^{n-1}:\sum_{j\neq i}w_{j}=k}\prod_{j\neq i}p_{j}^{w_{j}}(1-p_{j})^{1-w_{j}}\right)
=i=1n[(ji𝖡𝖾𝗋𝗇(pj)=k1)(ji𝖡𝖾𝗋𝗇(pj)=k)].\displaystyle=\sum_{i=1}^{n}\left[\mathbb{P}\left(\sum_{j\neq i}\mathsf{Bern}(p_{j})=k-1\right)-\mathbb{P}\left(\sum_{j\neq i}\mathsf{Bern}(p_{j})=k\right)\right].

Based on (B.7) and (27), we then have

ε(𝒫,n,m)rB(𝒫,n,μ,L)rB(𝒫,n+m,μ,L)\displaystyle\varepsilon^{\star}({\mathcal{P}},n,m)\geq r_{\text{B}}({\mathcal{P}},n,\mu,L)-r_{\text{B}}({\mathcal{P}},n+m,\mu,L)
(j=1d𝖡𝖾𝗋𝗇(1+αj2ε2d)(1+α)d2)(j=1d𝖡𝖾𝗋𝗇(1+αj2+ε2d)(1+α)d2)\displaystyle\geq\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}-\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{(1+\alpha)d}{2}\right)-\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+\frac{\varepsilon}{2\sqrt{d}}\right)\leq\frac{(1+\alpha)d}{2}\right)
=0k(1+α)d/2ε2dε2dddx(j=1d𝖡𝖾𝗋𝗇(1+αj2+x)=k)dx\displaystyle=-\sum_{0\leq k\leq(1+\alpha)d/2}\int_{-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{\varepsilon}{2\sqrt{d}}}\frac{\mathrm{d}}{\mathrm{d}x}\mathbb{P}\left(\sum_{j=1}^{d}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+x\right)=k\right)\mathrm{d}x
=i=1dε2dε2d(ji𝖡𝖾𝗋𝗇(1+αj2+x)=(1+α)d2)dx.\displaystyle=\sum_{i=1}^{d}\int_{-\frac{\varepsilon}{2\sqrt{d}}}^{\frac{\varepsilon}{2\sqrt{d}}}\mathbb{P}\left(\sum_{j\neq i}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+x\right)=\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor\right)\mathrm{d}x.

We will show that the integrand is uniformly of the order Ω(1/d)\Omega(1/\sqrt{d}). To this end, note that for |x|ε/(2d)<1/d|x|\leq\varepsilon/(2\sqrt{d})<1/d as d<4/ε2d<4/\varepsilon^{2}, it holds that

ji(1+αj2+x)\displaystyle\sum_{j\neq i}\left(\frac{1+\alpha_{j}}{2}+x\right) <j=1d1+αj2+1(1+α)d2+2,\displaystyle<\sum_{j=1}^{d}\frac{1+\alpha_{j}}{2}+1\leq\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor+2,
ji(1+αj2+x)\displaystyle\sum_{j\neq i}\left(\frac{1+\alpha_{j}}{2}+x\right) >j=1d1+αj211(1+α)d22.\displaystyle>\sum_{j=1}^{d}\frac{1+\alpha_{j}}{2}-1-1\geq\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor-2.

Consequently, by the last statement of Lemma B.1, one has

(|ji𝖡𝖾𝗋𝗇(1+αj2+x)(1+α)d2|2)\displaystyle\mathbb{P}\left(\left|\sum_{j\neq i}\mathsf{Bern}\left(\frac{1+\alpha_{j}}{2}+x\right)-\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor\right|\leq 2\right)
(|𝖡(d1,1d1ji(1+αj2+x))(1+α)d2|2)=Ωα¯,α¯(1d),\displaystyle\geq\mathbb{P}\left(\left|\mathsf{B}\left(d-1,\frac{1}{d-1}\sum_{j\neq i}\left(\frac{1+\alpha_{j}}{2}+x\right)\right)-\left\lfloor\frac{(1+\alpha)d}{2}\right\rfloor\right|\leq 2\right)=\Omega_{\underline{\alpha},\overline{\alpha}}\left(\frac{1}{\sqrt{d}}\right),

where the last step is again due to (30). The above display is a lower bound for the probability of a size-55 set, and in view of the following lemma, the same Ωα¯,α¯(1/d)\Omega_{\underline{\alpha},\overline{\alpha}}(1/\sqrt{d}) lower bound also holds for the probability of any singleton. Plugging this lower bound back into the integral then yields to ε(𝒫,n,m)=Ωα¯,α¯(ε)\varepsilon^{\star}({\mathcal{P}},n,m)=\Omega_{\underline{\alpha},\overline{\alpha}}(\varepsilon), as desired.

Lemma B.2.

Let p1,,pn[a,b]p_{1},\cdots,p_{n}\in[a,b] with 0<ab<10<a\leq b<1, and cnk<k+1(1c)ncn\leq k<k+1\leq(1-c)n for some c>0c>0. Define

f(k)=(i=1n𝖡𝖾𝗋𝗇(pi)=k).\displaystyle f(k)=\mathbb{P}\left(\sum_{i=1}^{n}\mathsf{Bern}(p_{i})=k\right).

Then there exists an absolute constant C=C(a,b,c)<C=C(a,b,c)<\infty such that

C1f(k+1)f(k)C.\displaystyle C^{-1}\leq\frac{f(k+1)}{f(k)}\leq C.
Proof.

Let Wk={w{0,1}n:i=1nwi=k}W_{k}=\{w\in\{0,1\}^{n}:\sum_{i=1}^{n}w_{i}=k\}, and g(w)=i=1npiwi(1pi)1wig(w)=\prod_{i=1}^{n}p_{i}^{w_{i}}(1-p_{i})^{1-w_{i}}. Then it is clear that

f(k)=wWkg(w).\displaystyle f(k)=\sum_{w\in W_{k}}g(w).

Call two binary vectors ww and ww^{\prime} as neighbors (denoted by www\sim w^{\prime}) if ww and ww^{\prime} only differ in one coordinate. It is clear that every wWkw\in W_{k} has nkn-k neighbors in Wk+1W_{k+1}, and every wWk+1w\in W_{k+1} has k+1k+1 neighbors in WkW_{k}. Moreover, for www\sim w^{\prime},

g(w)g(w)max{ba,1a1b}=:ρ.\displaystyle\frac{g(w)}{g(w^{\prime})}\leq\max\left\{\frac{b}{a},\frac{1-a}{1-b}\right\}=:\rho.

Consequently, by double counting,

f(k+1)\displaystyle f(k+1) =wWk+1g(w)=1k+1wWkwWk+1:wwg(w)\displaystyle=\sum_{w\in W_{k+1}}g(w)=\frac{1}{k+1}\sum_{w^{\prime}\in W_{k}}\sum_{w\in W_{k+1}:w\sim w^{\prime}}g(w)
ρk+1wWkwWk+1:wwg(w)=(nk)ρk+1wWkg(w)\displaystyle\leq\frac{\rho}{k+1}\sum_{w^{\prime}\in W_{k}}\sum_{w\in W_{k+1}:w\sim w^{\prime}}g(w^{\prime})=\frac{(n-k)\rho}{k+1}\sum_{w^{\prime}\in W_{k}}g(w^{\prime})
=(nk)ρk+1f(k)(1c)ρcf(k).\displaystyle=\frac{(n-k)\rho}{k+1}f(k)\leq\frac{(1-c)\rho}{c}f(k).

The other inequality can be established analogously. ∎

B.8 Proof of Theorem 6.5

For each j[d]j\in[d], we pick two points θj,+,θj,Θj\theta_{j,+},\theta_{j,-}\in\Theta_{j} in Assumption 4. We first aim to show that

0.09ε1pθj,+npθj,nTV\displaystyle 0.09\triangleq\varepsilon_{1}^{\prime}\leq\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\|_{\text{TV}} ε10.6,\displaystyle\leq\varepsilon_{1}\triangleq 0.6, (31)
0.99995ε2pθj,+20npθj,20nTV\displaystyle 0.99995\triangleq\varepsilon_{2}^{\prime}\geq\|p_{\theta_{j,+}}^{\otimes 20n}-p_{\theta_{j,-}}^{\otimes 20n}\|_{\text{TV}} ε20.86.\displaystyle\geq\varepsilon_{2}\triangleq 0.86. (32)

The proofs of (31) and (32) rely on the tensorization property of the Hellinger distance

H2(i=1dPi,i=1dQi)=1i=1d(1H2(Pi,Qi)),\displaystyle H^{2}\left(\prod_{i=1}^{d}P_{i},\prod_{i=1}^{d}Q_{i}\right)=1-\prod_{i=1}^{d}\left(1-H^{2}(P_{i},Q_{i})\right),

and the relationship between TV and Hellinger distance in (2). For example, for (31), we have

pθj,+npθj,nTV\displaystyle\|p_{\theta_{j,+}}^{\otimes n}-p_{\theta_{j,-}}^{\otimes n}\|_{\text{TV}} 1(1H2(pθj,+n,pθj,n))2\displaystyle\leq\sqrt{1-(1-H^{2}(p_{\theta_{j,+}}^{\otimes n},p_{\theta_{j,-}}^{\otimes n}))^{2}}
=1(1H2(pθj,+,pθj,))2n\displaystyle=\sqrt{1-\left(1-H^{2}(p_{\theta_{j,+}},p_{\theta_{j,-}})\right)^{2n}}
1(115n)2n\displaystyle\leq\sqrt{1-\left(1-\frac{1}{5n}\right)^{2n}}
925=0.6.\displaystyle\leq\sqrt{\frac{9}{25}}=0.6.

Applying the other inequality, for (32) we have

pθj,+20npθj,20nTV\displaystyle\|p_{\theta_{j,+}}^{\otimes 20n}-p_{\theta_{j,-}}^{\otimes 20n}\|_{\text{TV}} H2(pθj,+20n,pθj,20n)\displaystyle\geq H^{2}(p_{\theta_{j,+}}^{\otimes 20n},p_{\theta_{j,-}}^{\otimes 20n})
1(1110n)20n\displaystyle\geq 1-\left(1-\frac{1}{10n}\right)^{20n}
1exp(2)>0.86.\displaystyle\geq 1-\exp(-2)>0.86.

The other inequalities involving ε1\varepsilon_{1}^{\prime} and ε2\varepsilon_{2}^{\prime} could be established analogously.

Next, for the choice of m=cεn/dm=\lceil c\varepsilon n/\sqrt{d}\rceil in the statement of Theorem 6.5, we show that there exists nj[n,20n]n_{j}\in[n,20n] such that

pθj,+(nj+m)pθj,(nj+m)TVpθj,+njpθj,njTVε2ε119d/(cε).\displaystyle\|p_{\theta_{j,+}}^{\otimes(n_{j}+m)}-p_{\theta_{j,-}}^{\otimes(n_{j}+m)}\|_{\text{TV}}-\|p_{\theta_{j,+}}^{\otimes n_{j}}-p_{\theta_{j,-}}^{\otimes n_{j}}\|_{\text{TV}}\geq\frac{\varepsilon_{2}-\varepsilon_{1}}{\lceil 19\sqrt{d}/(c\varepsilon)\rceil}. (33)

To prove (33), first note that tfj(t)pθj,+tpθj,tTVt\mapsto f_{j}(t)\triangleq\|p_{\theta_{j,+}}^{\otimes t}-p_{\theta_{j,-}}^{\otimes t}\|_{\text{TV}} is non-decreasing by the data-processing property of the TV distance. Moreover, by (31) and (32), we have fj(n)ε1f_{j}(n)\leq\varepsilon_{1} and fj(20n)ε2f_{j}(20n)\geq\varepsilon_{2}. Consequently,

ε2ε1\displaystyle\varepsilon_{2}-\varepsilon_{1} fj(20n)fj(n)\displaystyle\leq f_{j}(20n)-f_{j}(n)
k=119n/m1[fj(n+km)fj(n+(k1)m)]+[fj(20n)fj(20nm)]\displaystyle\leq\sum_{k=1}^{\lceil 19n/m\rceil-1}[f_{j}(n+km)-f_{j}(n+(k-1)m)]+[f_{j}(20n)-f_{j}(20n-m)]
19nmmaxnnj20nm[fj(nj+m)fj(nj)],\displaystyle\leq\left\lceil\frac{19n}{m}\right\rceil\cdot\max_{n\leq n_{j}\leq 20n-m}[f_{j}(n_{j}+m)-f_{j}(n_{j})],

which gives (33). In addition, we also have ε1fj(nj)fj(nj+m)ε2\varepsilon_{1}^{\prime}\leq f_{j}(n_{j})\leq f_{j}(n_{j}+m)\leq\varepsilon_{2}^{\prime}.

Next we are about to apply Theorem 6.4. By (33), there exist αj(ε1,ε2)\alpha_{j}\in(\varepsilon_{1}^{\prime},\varepsilon_{2}^{\prime}) such that

pθj,+njpθj,njTV\displaystyle\|p_{\theta_{j,+}}^{\otimes n_{j}}-p_{\theta_{j,-}}^{\otimes n_{j}}\|_{\text{TV}} αjΩc(εd),\displaystyle\leq\alpha_{j}-\Omega_{c}\left(\frac{\varepsilon}{\sqrt{d}}\right),
pθj,+(nj+m)pθj,(nj+m)TV\displaystyle\|p_{\theta_{j,+}}^{\otimes(n_{j}+m)}-p_{\theta_{j,-}}^{\otimes(n_{j}+m)}\|_{\text{TV}} αj+Ωc(εd).\displaystyle\geq\alpha_{j}+\Omega_{c}\left(\frac{\varepsilon}{\sqrt{d}}\right).

Therefore, Theorem 6.4 shows that there is an absolute constant c>0c^{\prime}>0 depending only on (c,ε1,ε2)(c,\varepsilon_{1}^{\prime},\varepsilon_{2}^{\prime}) such that

ε(𝒫,(n1,,nd),m)cε,\displaystyle\varepsilon^{\star}({\mathcal{P}},(n_{1},\cdots,n_{d}),m)\geq c^{\prime}\varepsilon,

where the above quantity denotes the minimax error in a new sample amplification problem: suppose we draw njn_{j} independent samples from 𝒫j{\mathcal{P}}_{j}, also independently for each j[d]j\in[d], and we aim to amplify into nj+mn_{j}+m independent samples from 𝒫j{\mathcal{P}}_{j}. In other words, the sample sizes for different dimensions may not be equal in the new problem, but the target is still to generate mm more samples. We claim that ε(𝒫,(n1,,nd),m)ε(𝒫,n,m)\varepsilon^{\star}({\mathcal{P}},(n_{1},\cdots,n_{d}),m)\leq\varepsilon^{\star}({\mathcal{P}},n,m), and thereby complete the proof. To show the claim, note that njnn_{j}\geq n for all j[d]j\in[d], hence in the new problem we could keep njnn_{j}-n samples unused for each jj, use the remaining samples to amplify into n+mn+m vectors, and add the above unused samples back to form the final amplification.

B.9 Proof of Theorem 7.1

We first prove the upper bound. Consider the distribution estimator P^n=(1,0,,0)\widehat{P}_{n}=(1,0,\cdots,0), which has a χ2\chi^{2}-divergence

χ2(P^nP)=1t1,P𝒫d,t.\displaystyle\chi^{2}(\widehat{P}_{n}\|P)=\frac{1}{t}-1,\qquad\forall P\in{\mathcal{P}}_{d,t}.

Consequently, the χ2\chi^{2}-estimation error rχ2(𝒫d,t,n)r_{\chi^{2}}({\mathcal{P}}_{d,t},n) is at most 1/t1/t, and Theorem 5.2 states that the random shuffling approach achieves an (n,n+1,0.1)(n,n+1,0.1) sample amplification if n=Θ(1/t)n=\Theta(1/t).

Next we prove the lower bound. Let n=1/(100t)n=1/(100t) and dd be a multiple of 100100. Consider the following prior μ\mu over 𝒫d,t{\mathcal{P}}_{d,t}: μ\mu is the uniform prior over (p0,,pd)𝒫d,t(p_{0},\cdots,p_{d})\in{\mathcal{P}}_{d,t} with p0=tp_{0}=t and the remaining 1t1-t mass evenly distributed on a uniformly random subset of [d][d] of size d/100d/100. The action space 𝒜{\mathcal{A}} is chosen to be 𝒳n+1{\mathcal{X}}^{n+1}, and the loss function is

L(P,xn+1)\displaystyle L(P,x^{n+1}) =1𝟙(xn+1 belongs to the support of P,\displaystyle=1-\mathbbm{1}(x^{n+1}\text{ belongs to the support of }P,
 does not contain symbol 0 or repeated symbols).\displaystyle\qquad\text{ does not contain symbol }0\text{ or repeated symbols}).

We first show that rB(𝒫d,t,n+1,μ)0.1r_{\text{B}}({\mathcal{P}}_{d,t},n+1,\mu)\leq 0.1. In fact, after observing n+1n+1 samples Xn+1X^{n+1}, we simply use Xn+1X^{n+1} as the estimator under the above loss. Clearly Xn+1X^{n+1} belongs to the support of PP. For the remaining conditions,

(Xn+1 contains symbol 0)\displaystyle\mathbb{P}(X^{n+1}\text{ contains symbol }0) =1(1t)n+1=1(1t)1/(100t)+10.05,\displaystyle=1-(1-t)^{n+1}=1-(1-t)^{1/(100t)+1}\leq 0.05,
(Xn+1 contains repeated symbols)\displaystyle\mathbb{P}(X^{n+1}\text{ contains repeated symbols}) (n+12)(t2+j=1d/1001(d/100)2)\displaystyle\leq\binom{n+1}{2}\left(t^{2}+\sum_{j=1}^{d/100}\frac{1}{(d/100)^{2}}\right)
n2(t2+100d)110000+1100<0.05.\displaystyle\leq n^{2}\left(t^{2}+\frac{100}{d}\right)\leq\frac{1}{10000}+\frac{1}{100}<0.05.

By the union bound, this estimator achieves a Bayes risk at most 0.10.1, and thus rB(𝒫d,t,n+1,μ)0.1r_{\text{B}}({\mathcal{P}}_{d,t},n+1,\mu)\leq 0.1.

Next we show that rB(𝒫d,t,n,μ)0.9r_{\text{B}}({\mathcal{P}}_{d,t},n,\mu)\geq 0.9, which combined with Lemma 6.1 gives the desired lower bound. To show this, consider the new symbol in the estimator xn+1x^{n+1} not in XnX^{n} when the learner observes XnX^{n}. Since xn+1x^{n+1} has length n+1>nn+1>n, there is at least one such symbol. In order to have L(P,xn+1)=0L(P,x^{n+1})=0, this new symbol cannot be 0 or appear in XnX^{n}. Moreover, the posterior distribution of the support of PμP\sim\mu given XnX^{n} is uniformly distributed over

{0,X1,,Xn}{S[d]\{X1,,Xn}:|S|=d100n}.\displaystyle\{0,X_{1},\cdots,X_{n}\}\cup\left\{S\subseteq[d]\backslash\{X_{1},\cdots,X_{n}\}:|S|=\frac{d}{100}-n\right\}.

Therefore, the posterior probability of the new symbol being outside the support of PP (recall that it could be neither one of {X1,,Xn}\{X_{1},\cdots,X_{n}\} nor 0) is at least

1d/100ndn=0.99ddn0.99>0.9,\displaystyle 1-\frac{d/100-n}{d-n}=\frac{0.99d}{d-n}\geq 0.99>0.9,

giving the desired inequality rB(𝒫d,t,n,μ)0.9r_{\text{B}}({\mathcal{P}}_{d,t},n,\mu)\geq 0.9.

B.10 Proof of Theorem 7.2

The upper bound of sample amplification directly follows from that of learning, and it remains to show the lower bound ndn\geq d. If nd1n\leq d-1, with probability one the observations XnX^{n} spans an nn-dimensional subspace of the row space of Σ\Sigma. An (n,n+1,0.1)(n,n+1,0.1) sample amplification calls for at least one additional observation not in XnX^{n}, which with probability 11 should not belong to the nn-dimensional subspace spanned by XnX^{n} for nd1n\leq d-1. However, since pd+1p\geq d+1, under a uniformly chosen dd-dimensional row space of Σ\Sigma, the posterior probability of the additional observation belonging to the row space of Σ\Sigma is zero. Consequently, an (n,n+1,0.1)(n,n+1,0.1) sample amplification is impossible if n<dn<d.

B.11 Proof of Theorem 7.3

We prove the three claims separately.

The proof of m(𝒫c,n)cn5/6m^{\star}({\mathcal{P}}_{c},n)\gtrsim_{c}n^{5/6}. Classical theory of nonparametric estimation (see, e.g. [Tsy09]) tells that there exists a density estimator f^\widehat{f} such that 𝔼ff^f22n2/3\mathbb{E}_{f}\|\widehat{f}-f\|_{2}^{2}\lesssim n^{-2/3}. Since ff is lower bounded by cc, this implies that

χ2(f^,f)f^f22ccn2/3.\displaystyle\chi^{2}(\widehat{f},f)\leq\frac{\|\widehat{f}-f\|_{2}^{2}}{c}\lesssim_{c}n^{-2/3}.

Consequently, rχ2(𝒫c,n)n2/3r_{\chi^{2}}({\mathcal{P}}_{c},n)\lesssim n^{-2/3}, and Theorem 5.2 implies that m(𝒫c,n)cn5/6m^{\star}({\mathcal{P}}_{c},n)\gtrsim_{c}n^{5/6}.

The proof of m(𝒫c,n)cn5/6m^{\star}({\mathcal{P}}_{c},n)\lesssim_{c}n^{5/6}. We construct a parametric subfamily of 𝒫c{\mathcal{P}}_{c} and invoke Theorem 6.5. Let gg be a 11-Lipschitz function supported on [0,1][0,1] with 01g(x)𝑑x=0\int_{0}^{1}g(x)dx=0 and g2>0\|g\|_{2}>0; in particular g1\|g\|_{\infty}\leq 1. Let h=n1/3h=n^{-1/3} and assume that M:=h1M:=h^{-1} is an integer. For u=(u1,,uM){±1}Mu=(u_{1},\cdots,u_{M})\in\{\pm 1\}^{M}, define

fu(x)=1+c0i=1Muihg(x(i1)hh),\displaystyle f_{u}(x)=1+c_{0}\sum_{i=1}^{M}u_{i}hg\left(\frac{x-(i-1)h}{h}\right),

where c0(0,1)c_{0}\in(0,1) is a small constant satisfying c0h1cc_{0}h\leq 1-c. Consequently, fu𝒫cf_{u}\in{\mathcal{P}}_{c} for every u{±1}Mu\in\{\pm 1\}^{M}. However, the density estimation model X1,,XnfuX_{1},\cdots,X_{n}\sim f_{u} is not a product model, so Theorem 6.5 cannot be directly applied.

To overcome the above issue, we consider a Poissonized model as follows: first we draw NPoi(n)N\sim\text{Poi}(n), and then draw NN i.i.d. samples X1,,XNfuX_{1},\cdots,X_{N}\sim f_{u}. The Poissonized model satisfies the following two properties:

  1. 1.

    For any measurable set A[0,1]A\subseteq[0,1], we have

    M(A):=|{i[N]:XiA}|Poi(nAf(x)dx).\displaystyle M(A):=|\{i\in[N]:X_{i}\in A\}|\sim\text{Poi}\left(n\int_{A}f(x)\mathrm{d}x\right).
  2. 2.

    For any collection of disjoint subsets {Ai,i1}\{A_{i},i\geq 1\}, the random variables {M(Ai),i1}\{M(A_{i}),i\geq 1\} are mutually independent.

For i[M]i\in[M], let Ai=[(i1)/M,i/M)A_{i}=[(i-1)/M,i/M), so that Aifu(x)du=1/M\int_{A_{i}}f_{u}(x)\mathrm{d}u=1/M for every u{±1}Mu\in\{\pm 1\}^{M}. Clearly, there is a one-to-one correspondence between (X1,,XN)(X_{1},\cdots,X_{N}) and (Y1,,YM)(Y_{1},\cdots,Y_{M}), where YiY_{i} is the collection of observations in X1,,XNX_{1},\cdots,X_{N} that falls into the set AiA_{i}. By the above two properties, (Y1,,YM)(Y_{1},\cdots,Y_{M}) are mutually independent, and Yifi,uinY_{i}\sim f_{i,u_{i}}^{\otimes n} under fuf_{u}. Here fi,uif_{i,u_{i}} is the probability distribution of the following process: sample NiPoi(1/M)N_{i}\sim\text{Poi}(1/M), and draw NiN_{i} i.i.d. samples from the density

M(1+c0uihg(x(i1)hh))\displaystyle M\left(1+c_{0}u_{i}hg\left(\frac{x-(i-1)h}{h}\right)\right)

supported on AiA_{i}. In addition,

H2(fi,+1,fi,1)𝔼[Ni]Mc02h3g221n,\displaystyle H^{2}(f_{i,+1},f_{i,-1})\asymp\mathbb{E}[N_{i}]\cdot Mc_{0}^{2}h^{3}\|g\|_{2}^{2}\asymp\frac{1}{n},

so the Poissonized model is a product model which satisfies the prerequisite of Theorem 6.5.

Next we denote the i.i.d. sampling model by 𝒫cn{\mathcal{P}}_{c}^{\otimes n}, and the Poissonized model by 𝒫cPoi(n){\mathcal{P}}_{c}^{\text{Poi}(n)}. Let n1=n+Cn,n2=n+mCn+mn_{1}=n+C\sqrt{n},n_{2}=n+m-C\sqrt{n+m}, where C>0C>0 is a large universal constant to be chosen later. By Theorem 6.5 applied to d=Mn1/3d=M\asymp n^{1/3}, Le Cam’s distance in Definition 3.1 between Poissonized models satisfies

Δ(𝒫cPoi(n1),𝒫cPoi(n2))=Ω(1),\displaystyle\Delta({\mathcal{P}}_{c}^{\text{Poi}(n_{1})},{\mathcal{P}}_{c}^{\text{Poi}(n_{2})})=\Omega(1),

as long as m=Ω(n5/6)m=\Omega(n^{5/6}). In addition, the model 𝒫cPoi(n1){\mathcal{P}}_{c}^{\text{Poi}(n_{1})} is more informative than 𝒫cn{\mathcal{P}}_{c}^{\otimes n} with probability at least 1ε11-\varepsilon_{1}, where ε1=(Poi(n1)<n)\varepsilon_{1}=\mathbb{P}(\text{Poi}(n_{1})<n). Similarly, 𝒫c(n+m){\mathcal{P}}_{c}^{\otimes(n+m)} is more informative than 𝒫cPoi(n2){\mathcal{P}}_{c}^{\text{Poi}(n_{2})} with probability at least 1ε21-\varepsilon_{2}, where ε2=(Poi(n2)>n+m)\varepsilon_{2}=\mathbb{P}(\text{Poi}(n_{2})>n+m). Therefore,

Δ(𝒫cn,𝒫c(m+n))Δ(𝒫cPoi(n1),𝒫cPoi(n2))ε1ε2=Ω(1),\displaystyle\Delta({\mathcal{P}}_{c}^{\otimes n},{\mathcal{P}}_{c}^{\otimes(m+n)})\geq\Delta({\mathcal{P}}_{c}^{\text{Poi}(n_{1})},{\mathcal{P}}_{c}^{\text{Poi}(n_{2})})-\varepsilon_{1}-\varepsilon_{2}=\Omega(1),

where in the last step we have ε1,ε20\varepsilon_{1},\varepsilon_{2}\to 0 by choosing C>0C>0 large enough.

The proof of m(𝒫,n)n3/4m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}. Suppose nmCn3/4n\geq m\geq Cn^{3/4} for a large constant C>0C>0. Consider the following prior μ\mu on the density ff: let M=nM=\sqrt{n}, and u=(u1,,uM)u=(u_{1},\cdots,u_{M}) be uniformly distributed over all vectors in {0,1}M\{0,1\}^{M} such that the number of 11’s is M/100M/100. Given uu, we construct

fu(x)={ui(2/M4|x(2i1)/2M|)if xAi[(i1)/M,i/M),i[M],(82/(25M))(x1/2)if 1/2x1,\displaystyle f_{u}(x)=\begin{cases}u_{i}(2/M-4|x-(2i-1)/2M|)&\text{if }x\in A_{i}\triangleq[(i-1)/M,i/M),i\in[M],\\ (8-2/(25M))(x-1/2)&\text{if }1/2\leq x\leq 1,\end{cases}

and let f=fuf=f_{u}. It is not hard to verify that each fuf_{u} is a density and 88-Lipschitz, so fu𝒫f_{u}\in{\mathcal{P}}.

Next under the context of Lemma 6.1, we choose the action space to be [0,1]n+m[0,1]^{n+m}, as well as the following loss:

L(f,xn+m)\displaystyle L(f,x^{n+m}) =1𝟙(xn+m belongs to the support of f,\displaystyle=1-\mathbbm{1}(x^{n+m}\text{ belongs to the support of }f,
 and fall into at least [1(1n1)n]n/100+C0n1/4 sets in {Ai}i=1M),\displaystyle\qquad\text{ and fall into at least }[1-(1-n^{-1})^{n}]\sqrt{n}/100+C_{0}n^{1/4}\text{ sets in }\{A_{i}\}_{i=1}^{M}),

where C0>0C_{0}>0 is a large enough constant. We aim to show that under the loss LL and prior μ\mu, the Bayes risk satisfies rB(𝒫(n+m),μ)0.1r_{\text{B}}({\mathcal{P}}^{\otimes(n+m)},\mu)\leq 0.1 and rB(𝒫n,μ)0.9r_{\text{B}}({\mathcal{P}}^{\otimes n},\mu)\geq 0.9. Then an application of Lemma 6.1 completes the proof of m(𝒫,n)n3/4m^{\star}({\mathcal{P}},n)\lesssim n^{3/4}.

We first show that rB(𝒫(n+m),μ)0.1r_{\text{B}}({\mathcal{P}}^{\otimes(n+m)},\mu)\leq 0.1. We simply use the observed sample Xn+mX^{n+m} as the estimator xn+mx^{n+m} under the above loss, then clearly xn+mx^{n+m} belongs to the support of ff. For the second event in the loss function, let Un+mU_{n+m} be the number of sets in {Ai}i=1M\{A_{i}\}_{i=1}^{M} which receive any observations in Xn+mX^{n+m}. By the linearity of expectation, as well as the negative dependence across different set counts, we compute for every fuf_{u} that

𝔼[Un+m]\displaystyle\mathbb{E}[U_{n+m}] =i=1M/100[1(11M2)n+m]=n100[1(11n)n]+Ω(mn),\displaystyle=\sum_{i=1}^{M/100}\left[1-\left(1-\frac{1}{M^{2}}\right)^{n+m}\right]=\frac{\sqrt{n}}{100}\left[1-\left(1-\frac{1}{n}\right)^{n}\right]+\Omega\left(\frac{m}{\sqrt{n}}\right),
𝖵𝖺𝗋(Un+m)\displaystyle\mathsf{Var}(U_{n+m}) 𝔼[Un+m]=O(n).\displaystyle\leq\mathbb{E}[U_{n+m}]=O(\sqrt{n}).

By Chebyshev’s inequality, for mCn3/4m\geq Cn^{3/4} with a large enough C>0C>0, we have rB(𝒫(n+m),μ)0.1r_{\text{B}}({\mathcal{P}}^{\otimes(n+m)},\mu)\leq 0.1.

Next we show that rB(𝒫n,μ)0.9r_{\text{B}}({\mathcal{P}}^{\otimes n},\mu)\geq 0.9. Let (i1,,ik)[M]k(i_{1},\cdots,i_{k})\in[M]^{k} be the indices such that the set AijA_{i_{j}} is hit by the observations XnX^{n}. Clearly the posterior distribution of the support of the vector uu is uniformly distributed over

{i1,,ik}{S[M]\{i1,,ik}:|S|=M100k}.\displaystyle\{i_{1},\cdots,i_{k}\}\cup\left\{S\subseteq[M]\backslash\{i_{1},\cdots,i_{k}\}:|S|=\frac{M}{100}-k\right\}.

On one hand, if the estimator xm+nx^{m+n} hits a set AjA_{j} with j{i1,,ik}j\notin\{i_{1},\cdots,i_{k}\}, the probability that xn+mx^{n+m} does not belong to the support of ff is at least

1M/100kMk=0.99MMk0.99.\displaystyle 1-\frac{M/100-k}{M-k}=\frac{0.99M}{M-k}\geq 0.99.

On the other hand, if the estimator xm+nx^{m+n} never hits a set AjA_{j} with j{i1,,ik}j\notin\{i_{1},\cdots,i_{k}\}, then xm+nx^{m+n} fall into at most UnU_{n} sets in {Ai}i=1M\{A_{i}\}_{i=1}^{M}, where UnU_{n} is defined in a similar way as Un+mU_{n+m}. Similar to the previous computation, we have

𝔼[Un]=n100[1(11n)n],𝖵𝖺𝗋(Un)=O(n).\displaystyle\mathbb{E}[U_{n}]=\frac{\sqrt{n}}{100}\left[1-\left(1-\frac{1}{n}\right)^{n}\right],\qquad\mathsf{Var}(U_{n})=O(\sqrt{n}).

By Chebyshev’s inequality, for C0>0C_{0}>0 large enough, the probability that xn+mx^{n+m} violates the second event in the loss function is at least 0.990.99. Now a combination of the above two cases implies that rB(𝒫n,μ)0.9r_{\text{B}}({\mathcal{P}}^{\otimes n},\mu)\geq 0.9, as desired.

Appendix C Proof of main lemmas

C.1 Proof of Lemma 4.4

Recall that the log-partition function A(θ)A(\theta) is defined as

A(θ)=log𝒳exp(θT(x))dμ(x),\displaystyle A(\theta)=\log\int_{{\mathcal{X}}}\exp(\theta^{\top}T(x))\mathrm{d}\mu(x),

for any vector λd\lambda\in\mathbb{R}^{d} with θ+(2A(θ))1/2λΘ\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda\in\Theta, we have

𝔼θ[exp(λ(2A(θ))1/2(T(X)A(θ)))]\displaystyle\mathbb{E}_{\theta}\left[\exp\left(\lambda^{\top}(\nabla^{2}A(\theta))^{-1/2}(T(X)-\nabla A(\theta))\right)\right]
=𝒳exp((θ+(2A(θ))1/2λ)T(x)A(θ)[(2A(θ))1/2λ]A(θ))dμ(x)\displaystyle=\int_{{\mathcal{X}}}\exp\left((\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)^{\top}T(x)-A(\theta)-[(\nabla^{2}A(\theta))^{-1/2}\lambda]^{\top}\nabla A(\theta)\right)\mathrm{d}\mu(x)
=exp(A(θ+(2A(θ))1/2λ)A(θ)[(2A(θ))1/2λ]A(θ)).\displaystyle=\exp\left(A(\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)-A(\theta)-[(\nabla^{2}A(\theta))^{-1/2}\lambda]^{\top}\nabla A(\theta)\right). (34)

It remains to show that when λ2\|\lambda\|_{2} is sufficiently small, we always have θ+(2A(θ))1/2λΘ\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda\in\Theta, and the exponent of (C.1) is uniformly bounded from above over θΘ\theta\in\Theta and λd\lambda\in\mathbb{R}^{d}. Then the existence of uniformly bounded MGF around zero implies a uniformly bounded moment of any order.

The result of [Nes03, Theorem 4.1.6] shows that for a self-concordant and convex function ff, we have 2f(y)42f(x)\nabla^{2}f(y)\preceq 4\nabla^{2}f(x) whenever (yx)2f(x)(yx)M2/16(y-x)^{\top}\nabla^{2}f(x)(y-x)\leq M^{2}/16. Consequently, for λ2M/4\|\lambda\|_{2}\leq M/4, a Taylor expansion with a Lagrange remainder gives

A(θ+(2A(θ))1/2λ)A(θ)[(2A(θ))1/2λ]A(θ)\displaystyle A(\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda)-A(\theta)-[(\nabla^{2}A(\theta))^{-1/2}\lambda]^{\top}\nabla A(\theta)
=12λ(2A(θ))1/22A(ξ)(2A(θ))1/2λ,\displaystyle=\frac{1}{2}\lambda^{\top}(\nabla^{2}A(\theta))^{-1/2}\cdot\nabla^{2}A(\xi)\cdot(\nabla^{2}A(\theta))^{-1/2}\lambda, (35)

where ξ\xi lies on the line segment between θ\theta and θ+(2A(θ))1/2λ\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda. Consequently, we have

(ξθ)2A(θ)(ξθ)λ(2A(θ))1/22A(θ)(2A(θ))1/2λ=λ22M216,\displaystyle(\xi-\theta)^{\top}\cdot\nabla^{2}A(\theta)\cdot(\xi-\theta)\leq\lambda^{\top}(\nabla^{2}A(\theta))^{-1/2}\cdot\nabla^{2}A(\theta)\cdot(\nabla^{2}A(\theta))^{-1/2}\lambda=\|\lambda\|_{2}^{2}\leq\frac{M^{2}}{16},

and therefore 2A(ξ)42A(θ)\nabla^{2}A(\xi)\preceq 4\nabla^{2}A(\theta). Plugging it back into (C.1) establishes the boundedness of (C.1), as well as the finiteness of A(θ+(2A(θ))1/2λ)A(\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda), or equivalently, θ+(2A(θ))1/2λΘ\theta+(\nabla^{2}A(\theta))^{-1/2}\lambda\in\Theta.

C.2 Proof of Lemma A.2

For X1,,Xn𝒩(0,Σ)X_{1},\cdots,X_{n}\sim{\mathcal{N}}(0,\Sigma) and ndn\geq d, it is well known that the empirical covariance Σ^n\widehat{\Sigma}_{n} follows a Wishart distribution 𝒲d(Σ/n,n){\mathcal{W}}_{d}(\Sigma/n,n), where the density of 𝒲d(V,n){\mathcal{W}}_{d}(V,n) is given by

fV,n(X)=det(X)(nd1)/2exp(𝖳𝗋(V1X)/2)2nd/2det(V)n/2Γd(n/2),\displaystyle f_{V,n}(X)=\frac{\det(X)^{(n-d-1)/2}\exp(-\mathsf{Tr}(V^{-1}X)/2)}{2^{nd/2}\det(V)^{n/2}\Gamma_{d}(n/2)},

where Γd(x)=πd(d1)/4i=1dΓ(x(i1)/2)\Gamma_{d}(x)=\pi^{d(d-1)/4}\prod_{i=1}^{d}\Gamma(x-(i-1)/2) is the multivariate Gamma function, and Γ(x)=0tx1et𝑑t\Gamma(x)=\int_{0}^{\infty}t^{x-1}e^{-t}dt is the usual Gamma function. After some algebra, we have

DKL(𝒲d(Σ/n,n)𝒲d(Σ/(n+m),n+m))\displaystyle D_{\text{KL}}({\mathcal{W}}_{d}(\Sigma/n,n)\|{\mathcal{W}}_{d}(\Sigma/(n+m),n+m))
=d2(m(n+m)log(1+mn))+logΓd((n+m)/2)Γd(n/2)m2ψd(n2),\displaystyle=\frac{d}{2}\left(m-(n+m)\log\left(1+\frac{m}{n}\right)\right)+\log\frac{\Gamma_{d}((n+m)/2)}{\Gamma_{d}(n/2)}-\frac{m}{2}\psi_{d}\left(\frac{n}{2}\right), (36)

where ψd(x)=ddx[logΓd(x)]\psi_{d}(x)=\frac{d}{dx}[\log\Gamma_{d}(x)] is the multivariate digamma function. Note that the above KL divergence (C.2) does not depend on Σ\Sigma; we denote it by f(n,m,d)f(n,m,d).

By (3), it suffices to establish an upper bound of f(n,m,d)f(n,m,d). Applying infinite Taylor series to logΓd(x)\log\Gamma_{d}(x) at x=n/2x=n/2 yields

logΓd(n+m2)\displaystyle\log\Gamma_{d}\left(\frac{n+m}{2}\right) =logΓd(n2)+m2ψd(n2)+t=21t!(m2)tψd(t1)(n2)\displaystyle=\log\Gamma_{d}\left(\frac{n}{2}\right)+\frac{m}{2}\psi_{d}\left(\frac{n}{2}\right)+\sum_{t=2}^{\infty}\frac{1}{t!}\left(\frac{m}{2}\right)^{t}\psi_{d}^{(t-1)}\left(\frac{n}{2}\right)
=logΓd(n2)+m2ψd(n2)+t=21t!(m2)tk=1dψ(t1)(nk+12),\displaystyle=\log\Gamma_{d}\left(\frac{n}{2}\right)+\frac{m}{2}\psi_{d}\left(\frac{n}{2}\right)+\sum_{t=2}^{\infty}\frac{1}{t!}\left(\frac{m}{2}\right)^{t}\sum_{k=1}^{d}\psi^{(t-1)}\left(\frac{n-k+1}{2}\right),

where ψ(t1)(x)=dtdxt[logΓ(x)]\psi^{(t-1)}(x)=\frac{d^{t}}{dx^{t}}[\log\Gamma(x)] is the polygamma function. For any t2t\geq 2 and x1x\geq 1, the following inequality holds for the polygamma function [AS64, Equation 6.4.10]:

|ψ(t1)(x)(1)t(t2)!xt1|(t1)!xt.\displaystyle\left|\psi^{(t-1)}(x)-(-1)^{t}\frac{(t-2)!}{x^{t-1}}\right|\leq\frac{(t-1)!}{x^{t}}.

As a result, for the following modification

g(n,m,d)d2(m(n+m)log(1+mn))+mk=1dt=2(1)t2t(t1)(mnk+1)t1,\displaystyle g(n,m,d)\triangleq\frac{d}{2}\left(m-(n+m)\log\left(1+\frac{m}{n}\right)\right)+m\sum_{k=1}^{d}\sum_{t=2}^{\infty}\frac{(-1)^{t}}{2t(t-1)}\left(\frac{m}{n-k+1}\right)^{t-1}, (37)

it holds that

|f(n,m,d)g(n,m,d)|\displaystyle|f(n,m,d)-g(n,m,d)| k=1dt=21t!(m2)t(t1)![(nk+1)/2]t\displaystyle\leq\sum_{k=1}^{d}\sum_{t=2}^{\infty}\frac{1}{t!}\left(\frac{m}{2}\right)^{t}\cdot\frac{(t-1)!}{[(n-k+1)/2]^{t}}
=k=1dt=21t(mnk+1)t\displaystyle=\sum_{k=1}^{d}\sum_{t=2}^{\infty}\frac{1}{t}\left(\frac{m}{n-k+1}\right)^{t}
dt=212(2mn)t4dm2n2,\displaystyle\leq d\cdot\sum_{t=2}^{\infty}\frac{1}{2}\left(\frac{2m}{n}\right)^{t}\leq\frac{4dm^{2}}{n^{2}}, (38)

where we have used the assumption n4max{m,d}n\geq 4\max\{m,d\}.

Next we establish an upper bound of g(n,m,d)g(n,m,d). Using the identity

h(x)(1+x)log(1+x)xx=t=2(1)txt1t(t1)\displaystyle h(x)\triangleq\frac{(1+x)\log(1+x)-x}{x}=\sum_{t=2}^{\infty}\frac{(-1)^{t}x^{t-1}}{t(t-1)}

and some algebra, we have

g(n,m,d)=dm2h(mn)+m2k=1dh(mnk+1)=m2k=1d[h(mnk+1)h(mn)].\displaystyle g(n,m,d)=-\frac{dm}{2}h\left(\frac{m}{n}\right)+\frac{m}{2}\sum_{k=1}^{d}h\left(\frac{m}{n-k+1}\right)=\frac{m}{2}\sum_{k=1}^{d}\left[h\left(\frac{m}{n-k+1}\right)-h\left(\frac{m}{n}\right)\right].

Note that for x[0,1]x\in[0,1], we have

h(x)=xlog(1+x)x2[0,12],\displaystyle h^{\prime}(x)=\frac{x-\log(1+x)}{x^{2}}\in\left[0,\frac{1}{2}\right],

we conclude that

g(n,m,d)m2k=1d12[mnk+1mn]md22mdn2=m2d2n2.\displaystyle g(n,m,d)\leq\frac{m}{2}\sum_{k=1}^{d}\frac{1}{2}\left[\frac{m}{n-k+1}-\frac{m}{n}\right]\leq\frac{md}{2}\cdot\frac{2md}{n^{2}}=\frac{m^{2}d^{2}}{n^{2}}. (39)

Finally, by (C.2), (39) and (3), we conclude that

(Σ^n)(Σ^n+m)TVf(n,m,d)22mdn.\displaystyle\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{TV}}\leq\sqrt{\frac{f(n,m,d)}{2}}\leq\frac{2md}{n}.

For the claim that Sn+mS_{n+m} follows the uniform distribution on the set A={Ud×(n+m):UU=Id}A=\{U\in\mathbb{R}^{d\times(n+m)}:UU^{\top}=I_{d}\}, we need the following auxiliary definitions and results. A topological group is a group (G,+)(G,+) with a topology such that the operation +:G×GG+:G\times G\to G is continuous. A (right) group action of GG on XX is a function ϕ:X×GX\phi:X\times G\to X such that ϕ(ϕ(x,g),g)=ϕ(x,gg)\phi(\phi(x,g),g^{\prime})=\phi(x,gg^{\prime}) and ϕ(x,e)=x\phi(x,e)=x, where ee is the identity element of GG. A group action is called transitive if for every x,xAx,x^{\prime}\in A, there exists some gGg\in G such that ϕ(x,g)=x\phi(x,g)=x^{\prime}. A group action is called proper if for any compact KXK\subseteq X and xXx\in X, the map ϕx:GX\phi_{x}:G\to X with gϕ(x,g)g\mapsto\phi(x,g) satisfies that ϕx1(K)G\phi_{x}^{-1}(K)\subseteq G is compact. The following lemma is useful.

Lemma C.1 (Chapter 14, Theorem 25 of [Roy88]).

Let GG be a locally compact group acting transitively and properly on a locally compact Hausdorff space XX. Then there is a unique (up to multiplicative factors) Baire measure on XX which is invariant under the action of GG.

Now for the claimed result, it is easy to verify that (a proper version of) Sn+mS_{n+m} always takes value in AA, assuming n+mdn+m\geq d. To show that Sn+mS_{n+m} is uniform on AA, note that for any orthogonal matrix V(n+m)×(n+m)V\in\mathbb{R}^{(n+m)\times(n+m)}, it is easy to verify

[X1,,Xn+m]=𝑑[X1,,Xn+m]V.\displaystyle[X_{1},\cdots,X_{n+m}]\overset{d}{=}[X_{1},\cdots,X_{n+m}]V.

Denote the RHS by [Y1,,Yn+m][Y_{1},\cdots,Y_{n+m}], we also have Σ^n+m(Yn+m)=Σ^n+m(Xn+m)\widehat{\Sigma}_{n+m}(Y^{n+m})=\widehat{\Sigma}_{n+m}(X^{n+m}). Consequently,

Sn+m(Xm+n)V\displaystyle S_{n+m}(X^{m+n})\cdot V =[(n+m)Σ^n+m(Xn+m)]1/2[Y1,Y2,,Yn+m]\displaystyle=[(n+m)\widehat{\Sigma}_{n+m}(X^{n+m})]^{-1/2}[Y_{1},Y_{2},\cdots,Y_{n+m}]
=[(n+m)Σ^n+m(Yn+m)]1/2[Y1,Y2,,Yn+m]\displaystyle=[(n+m)\widehat{\Sigma}_{n+m}(Y^{n+m})]^{-1/2}[Y_{1},Y_{2},\cdots,Y_{n+m}]
=𝑑[(n+m)Σ^n+m(Xn+m)]1/2[X1,X2,,Xn+m]=Sn+m(Xm+n),\displaystyle\overset{d}{=}[(n+m)\widehat{\Sigma}_{n+m}(X^{n+m})]^{-1/2}[X_{1},X_{2},\cdots,X_{n+m}]=S_{n+m}(X^{m+n}),

meaning that the distribution of Sn+mS_{n+m} is invariant with right multiplication of an orthogonal matrix. Let GG be the orthogonal group 𝖮(n+m)\mathsf{O}(n+m), the map ϕ:A×GA\phi:A\times G\to A with ϕ(U,V)=UV\phi(U,V)=UV is a group action of GG on AA. This action is transitive as for any U,UAU,U^{\prime}\in A, we could add more rows to U,UU,U^{\prime} to obtain U~,U~G\widetilde{U},\widetilde{U}^{\prime}\in G, and then V=U~U~GV=\widetilde{U}^{\top}\widetilde{U}^{\prime}\in G maps UU to UU^{\prime}. This action is also proper as GG itself is compact. Hence, Lemma C.1 below shows that Sn+mS_{n+m} is uniformly distributed on AA.

C.3 Proof of Lemma A.4

The proof of Lemma A.4 is a simple consequence of several known results. First, Basu’s theorem claims that X¯n\overline{X}_{n} and Σ^n\widehat{\Sigma}_{n} are independent. Second, the computation in Example 4.1 shows

(X¯n)(X¯n+m)TVmdn.\displaystyle\|{\mathcal{L}}(\overline{X}_{n})-{\mathcal{L}}(\overline{X}_{n+m})\|_{\text{TV}}\leq\frac{m\sqrt{d}}{n}.

Finally, as Σ^n𝒲d(Σ/(n1),n1)\widehat{\Sigma}_{n}\sim{\mathcal{W}}_{d}(\Sigma/(n-1),n-1) again follows a Wishart distribution, the proof of Lemma A.2 shows that

(Σ^n)(Σ^n+m)TV2mdn1.\displaystyle\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{TV}}\leq\frac{2md}{n-1}.

In conclusion, we have

(X¯n,Σ^n)(X¯n+m,Σ^n+m)TV\displaystyle\|{\mathcal{L}}(\overline{X}_{n},\widehat{\Sigma}_{n})-{\mathcal{L}}(\overline{X}_{n+m},\widehat{\Sigma}_{n+m})\|_{\text{\rm TV}} (X¯n)(X¯n+m)TV+(Σ^n)(Σ^n+m)TV\displaystyle\leq\|{\mathcal{L}}(\overline{X}_{n})-{\mathcal{L}}(\overline{X}_{n+m})\|_{\text{TV}}+\|{\mathcal{L}}(\widehat{\Sigma}_{n})-{\mathcal{L}}(\widehat{\Sigma}_{n+m})\|_{\text{TV}}
3mdn1.\displaystyle\leq\frac{3md}{n-1}.

For the distribution of Sn+mS_{n+m}, it is clear that (a proper version of) Sn+mS_{n+m} always takes value in AA, and we show that the distribution of Sn+mS_{n+m} is invariant with proper group actions in AA. Consider the set

G={V(n+m)×(n+m):VV=In+m,V𝟏=𝟏}\displaystyle G=\{V\in\mathbb{R}^{(n+m)\times(n+m)}:VV^{\top}=I_{n+m},V{\bf 1}={\bf 1}\}

with the usual matrix multiplication. We show that GG is a group: clearly In+mGI_{n+m}\in G; for V,VGV,V^{\prime}\in G, it is clear that VV𝟏=V𝟏=𝟏VV^{\prime}{\bf 1}=V{\bf 1}={\bf 1} and therefore VVGVV^{\prime}\in G; for VGV\in G, it holds that V1𝟏=V1V𝟏=𝟏V^{-1}{\bf 1}=V^{-1}V{\bf 1}={\bf 1} and thus V1GV^{-1}\in G. Next we show that the action ϕ:A×GA\phi:A\times G\to A with ϕ(U,V)=UV\phi(U,V)=UV is a group action on AA, and it suffices to show that UVAUV\in A. This is true as (UV)(UV)=UVVU=UU=Id(UV)(UV)^{\top}=UVV^{\top}U^{\top}=UU^{\top}=I_{d}, and UV𝟏=U𝟏=𝟎UV{\bf 1}=U{\bf 1}={\bf 0}. This group action is also transitive: for any U,UAU,U^{\prime}\in A, we may properly add rows to them and obtain U~,U~𝖮(n+m)\widetilde{U},\widetilde{U}^{\prime}\in\mathsf{O}(n+m), where one of the added rows is a scalar multiple of 𝟏{\bf 1}^{\top}, which is feasible as U𝟏=𝟎U{\bf 1}={\bf 0}. Consequently, the matrix V=U~1U~𝖮(n+m)V=\widetilde{U}^{-1}\widetilde{U}^{\prime}\in\mathsf{O}(n+m) maps UU to UU^{\prime}, and also 𝟏\bf 1^{\top} to 𝟏\bf 1^{\top}; hence VGV\in G. Finally, we show that any group action of GG on Sm+nS_{m+n} does not change the distribution of Sm+nS_{m+n}. To see this, for any VGV\in G we have

[X1,,Xn+m]\displaystyle[X_{1},\cdots,X_{n+m}] =𝑑[X1,,Xn+m]V,\displaystyle\overset{d}{=}[X_{1},\cdots,X_{n+m}]V,
X¯n+m([X1,,Xn+m]V)\displaystyle\overline{X}_{n+m}([X_{1},\cdots,X_{n+m}]V) =X¯n+m([X1,,Xn+m]),\displaystyle=\overline{X}_{n+m}([X_{1},\cdots,X_{n+m}]),
Σ^n+m([X1,,Xn+m]V)\displaystyle\widehat{\Sigma}_{n+m}([X_{1},\cdots,X_{n+m}]V) =Σ^n+m([X1,,Xn+m]).\displaystyle=\widehat{\Sigma}_{n+m}([X_{1},\cdots,X_{n+m}]).

Therefore, following the same arguments as in Example A.1 we arrive at the desired invariance, and the uniform distribution of Sm+nS_{m+n} is a direct consequence of Lemma C.1.

C.4 Proof of Lemma 5.8

For any subset S[n+m]S\subseteq[n+m] with |S|=m|S|=m, let PSP_{S} be the distribution of (Z1,,Zn+m)(Z_{1},\cdots,Z_{n+m}) when the samples (Y1,,Ym)(Y_{1},\cdots,Y_{m}) are placed in the index set SS of the pool (Z1,,Zn+m)(Z_{1},\cdots,Z_{n+m}). Then it is clear that Pmix=𝔼[PS]P_{\text{mix}}=\mathbb{E}[P_{S}], with SS uniformly distributed on all size-mm subsets of [n+m][n+m]. To compute the χ2\chi^{2}-divergence where the first distribution is a mixture, the following identity holds [IS12]:

χ2(Pmix,P(n+m))=𝔼S,S[dPSdPSdP(n+m)]1,\chi^{2}\left(P_{\text{\rm mix}},P^{\otimes(n+m)}\right)=\mathbb{E}_{S,S^{\prime}}\left[\int\frac{\mathrm{d}P_{S}\mathrm{d}P_{S^{\prime}}}{\mathrm{d}P^{\otimes(n+m)}}\right]-1,

where SS^{\prime} is an independent copy of SS. By the independence assumption, we have

dPSdP(n+m)(z1,,zn+m)=iSdQdP(zi),\frac{\mathrm{d}P_{S}}{\mathrm{d}P^{\otimes(n+m)}}(z_{1},\cdots,z_{n+m})=\prod_{i\in S}\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i}),

and consequently

dPSdPSdP(n+m)\displaystyle\int\frac{\mathrm{d}P_{S}\mathrm{d}P_{S^{\prime}}}{\mathrm{d}P^{\otimes(n+m)}} =𝔼P[iSdQdP(zi)iSdQdP(zi)]\displaystyle=\mathbb{E}_{P}\left[\prod_{i\in S}\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\prod_{i\in S^{\prime}}\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\right]
=iSS𝔼P[(dQdP(zi))2]iSΔS𝔼P[dQdP(zi)]\displaystyle=\prod_{i\in S\cap S^{\prime}}\mathbb{E}_{P}\left[\left(\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\right)^{2}\right]\cdot\prod_{i\in S\Delta S^{\prime}}\mathbb{E}_{P}\left[\frac{\mathrm{d}Q}{\mathrm{d}P}(z_{i})\right]
=(1+χ2(Q,P))|SS|.\displaystyle=(1+\chi^{2}(Q,P))^{|S\cap S^{\prime}|}.

It remains to upper bound the expectation with respect to the random variable |SS||S\cap S^{\prime}|. Note that |SS||S\cap S^{\prime}| follows the hypergeometric distribution with parameter (n+m,m,m)(n+m,m,m), which corresponds to sampling without replacement. The counterpart for sampling with replacement corresponds to a Binomial distribution 𝖡(m,mn+m)\mathsf{B}(m,\frac{m}{n+m}), and the following lemma shows that the latter dominates the former in terms of the convex order:

Lemma C.2.

[Hoe63, Theorem 4] Let the population be 𝒞={c1,,cN}{\mathcal{C}}=\{c_{1},\cdots,c_{N}\}. Let X1,,XnX_{1},\cdots,X_{n} denote random samples without replacement from 𝒞{\mathcal{C}} and Y1,,YnY_{1},\cdots,Y_{n} denote random samples with replacement. If f()f(\cdot) is convex, then

𝔼[f(i=1nXi)]𝔼[f(i=1nYi)].\mathbb{E}\left[f\left(\sum_{i=1}^{n}X_{i}\right)\right]\leq\mathbb{E}\left[f\left(\sum_{i=1}^{n}Y_{i}\right)\right].

Applying Lemma C.2 to the convex function x(1+χ2(Q,P))xx\mapsto(1+\chi^{2}(Q,P))^{x} yields

𝔼S,S[(1+χ2(Q,P))|SS|]\displaystyle\mathbb{E}_{S,S^{\prime}}[(1+\chi^{2}(Q,P))^{|S\cap S^{\prime}|}] 𝔼[(1+χ2(Q,P))𝖡(m,m/(n+m))]\displaystyle\leq\mathbb{E}[(1+\chi^{2}(Q,P))^{\mathsf{B}(m,m/(n+m))}]
=(𝔼[(1+χ2(Q,P))Bern(m/(n+m))])m\displaystyle=\left(\mathbb{E}[(1+\chi^{2}(Q,P))^{\text{Bern}(m/(n+m))}]\right)^{m}
=(1+mn+mχ2(Q,P))m,\displaystyle=\left(1+\frac{m}{n+m}\chi^{2}(Q,P)\right)^{m},

as desired.

C.5 Proof of Lemma A.7

Note that in the proof of Theorem 5.2, we have

Pmix(Xn/2)P(n/2+m)TVm2nχ2(P^n(Xn/2),P).\displaystyle\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}}\leq\sqrt{\frac{m^{2}}{n}\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)}.

Since the TV distance is always upper bounded by one, the following upper bound is also true:

Pmix(Xn/2)P(n/2+m)TVm2n(χ2(P^n(Xn/2),P)n).\displaystyle\|P_{\text{mix}}(X^{n/2})-P^{\otimes(n/2+m)}\|_{\text{TV}}\leq\sqrt{\frac{m^{2}}{n}\cdot\left(\chi^{2}(\widehat{P}_{n}(X^{n/2}),P)\wedge n\right)}.

Consequently, taking the expectation over Xn/2X^{n/2} leads to the claim.

C.6 Proof of Lemma A.13

First we note that

χ2(𝒩(θ^n,j,1),𝒩(θj,1))=exp((θ^n,jθj)2)1,\displaystyle\chi^{2}\left({\mathcal{N}}(\widehat{\theta}_{n,j},1),{\mathcal{N}}(\theta_{j},1)\right)=\exp\left((\widehat{\theta}_{n,j}-\theta_{j})^{2}\right)-1,

By the triangle inequality,

𝔼|θ^n,jθj|\displaystyle\mathbb{E}|\widehat{\theta}_{n,j}-\theta_{j}| 𝔼|θ^n,j1ni=1nXi,j|+𝔼|1ni=1nXi,jθj|\displaystyle\leq\mathbb{E}\left|\widehat{\theta}_{n,j}-\frac{1}{n}\sum_{i=1}^{n}X_{i,j}\right|+\mathbb{E}\left|\frac{1}{n}\sum_{i=1}^{n}X_{i,j}-\theta_{j}\right|
Clognn+2nπ=O(lognn).\displaystyle\leq\sqrt{\frac{C\log n}{n}}+\sqrt{\frac{2}{n\pi}}=O\left(\sqrt{\frac{\log n}{n}}\right). (40)

Moreover, it is straightforward to verify that |θ^n,jθj||\widehat{\theta}_{n,j}-\theta_{j}| is (1/n)(1/n)-Lipschitz with respect to (X1,j,,Xn,j)(X_{1,j},\cdots,X_{n,j}). Therefore, the Gaussian Lipschitz concentration (see, e.g. [BLM13, Theorem 10.17]) gives that

(|θ^n,jθj|𝔼|θ^n,jθj|+t)exp(nt22)\displaystyle\mathbb{P}\left(|\widehat{\theta}_{n,j}-\theta_{j}|\geq\mathbb{E}|\widehat{\theta}_{n,j}-\theta_{j}|+t\right)\leq\exp\left(-\frac{nt^{2}}{2}\right) (41)

for any t0t\geq 0. Hence, combining (C.6) and (41), we conclude that |θ^n,jθj|=O(log(nd)/n)|\widehat{\theta}_{n,j}-\theta_{j}|=O(\sqrt{\log(nd)/n}) holds with probability at least 1(nd)21-(nd)^{-2}, and therefore

𝔼[χ2(𝒩(θ^n,j,1),𝒩(θj,1))n]2𝔼[(θ^n,jθj)2]+1nd,\displaystyle\mathbb{E}\left[\chi^{2}\left({\mathcal{N}}(\widehat{\theta}_{n,j},1),{\mathcal{N}}(\theta_{j},1)\right)\wedge n\right]\leq 2\cdot\mathbb{E}[(\widehat{\theta}_{n,j}-\theta_{j})^{2}]+\frac{1}{nd},

where we have used that ex1+2xe^{x}\leq 1+2x whenever x[0,1]x\in[0,1]. Summing over j[d]j\in[d] and using the property of the soft-thresholding estimator supθΘ𝔼[θ^nθ22]=O(slogd/n)\sup_{\theta\in\Theta}\mathbb{E}[\|\widehat{\theta}_{n}-\theta\|_{2}^{2}]=O(s\log d/n) gives the claimed result.

C.7 Proof of Lemma A.15

For the first claim, note that for Poisson models, we have

χ2(𝖯𝗈𝗂(λ^n,j),𝖯𝗈𝗂(λj))=exp((λ^n,jλj)2λj)1.\displaystyle\chi^{2}\left(\mathsf{Poi}(\widehat{\lambda}_{n,j}),\mathsf{Poi}(\lambda_{j})\right)=\exp\left(\frac{(\widehat{\lambda}_{n,j}-\lambda_{j})^{2}}{\lambda_{j}}\right)-1.

Consequently, for λj=1/(n2logn)\lambda_{j}=1/(n^{2}\log n), we have

𝔼[χ2(𝖯𝗈𝗂(λ^n,j),𝖯𝗈𝗂(λj))n]\displaystyle\mathbb{E}\left[\chi^{2}\left(\mathsf{Poi}(\widehat{\lambda}_{n,j}),\mathsf{Poi}(\lambda_{j})\right)\wedge n\right] [(exp((1/nλj)2λj)1)n](λ^n,j=1/n)\displaystyle\geq\left[\left(\exp\left(\frac{(1/n-\lambda_{j})^{2}}{\lambda_{j}}\right)-1\right)\wedge n\right]\cdot\mathbb{P}(\widehat{\lambda}_{n,j}=1/n)
=Ω(n)enλjnλj=Ω(1logn)1n,\displaystyle=\Omega(n)\cdot e^{-n\lambda_{j}}n\lambda_{j}=\Omega\left(\frac{1}{\log n}\right)\gg\frac{1}{n},

establishing the first claim.

For the second claim, note that conditioning on Xn/2X^{n/2},

DKL((T^n/2+m)(Tn/2+m))\displaystyle D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\|{\mathcal{L}}(T_{n/2+m})\right) =j=1dDKL(𝖯𝗈𝗂(nλj/2+mλ^n,j)𝖯𝗈𝗂((n/2+m)λj))\displaystyle=\sum_{j=1}^{d}D_{\text{\rm KL}}\left(\mathsf{Poi}(n\lambda_{j}/2+m\widehat{\lambda}_{n,j})\|\mathsf{Poi}((n/2+m)\lambda_{j})\right)
j=1dm2(λ^n,jλj)2(n/2+m)λj.\displaystyle\leq\sum_{j=1}^{d}\frac{m^{2}(\widehat{\lambda}_{n,j}-\lambda_{j})^{2}}{(n/2+m)\lambda_{j}}.

where we have used DKL(𝖯𝗈𝗂(λ1)𝖯𝗈𝗂(λ2))=λ2λ1+λ1log(λ1/λ2)(λ1λ2)2/λ2D_{\text{\rm KL}}(\mathsf{Poi}(\lambda_{1})\|\mathsf{Poi}(\lambda_{2}))=\lambda_{2}-\lambda_{1}+\lambda_{1}\log(\lambda_{1}/\lambda_{2})\leq(\lambda_{1}-\lambda_{2})^{2}/\lambda_{2}. Hence,

𝔼Xn/2[DKL((T^n/2+m)(Tn/2+m))]dm2(n/2+m)n/24dm2n2,\displaystyle\mathbb{E}_{X^{n/2}}\left[D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\|{\mathcal{L}}(T_{n/2+m})\right)\right]\leq\frac{dm^{2}}{(n/2+m)n/2}\leq\frac{4dm^{2}}{n^{2}},

and the TV distance satisfies

(X^n+m)(Xn+m)TV\displaystyle\|{\mathcal{L}}(\widehat{X}^{n+m})-{\mathcal{L}}(X^{n+m})\|_{\text{\rm TV}} =𝔼Xn/2[(T^n/2+m)(Tn/2+m)TV]\displaystyle=\mathbb{E}_{X^{n/2}}\left[\|{\mathcal{L}}(\widehat{T}_{n/2+m})-{\mathcal{L}}(T_{n/2+m})\|_{\text{\rm TV}}\right]
𝔼Xn/2[12DKL((T^n/2+m)(Tn/2+m))]\displaystyle\leq\mathbb{E}_{X^{n/2}}\left[\sqrt{\frac{1}{2}D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\|{\mathcal{L}}(T_{n/2+m})\right)}\right]
12𝔼Xn/2[DKL((T^n/2+m)(Tn/2+m))]\displaystyle\leq\sqrt{\frac{1}{2}\mathbb{E}_{X^{n/2}}\left[D_{\text{\rm KL}}\left({\mathcal{L}}(\widehat{T}_{n/2+m})\|{\mathcal{L}}(T_{n/2+m})\right)\right]}
m2dn.\displaystyle\leq\frac{m\sqrt{2d}}{n}.

C.8 Proof of Lemma A.17

The upper bound result is easy. The m=O(nε/k)m=O(n\varepsilon/\sqrt{k}) upper bound is a consequence of the general product Poisson model considered in Example A.14. For the m=O(nε)m=O(\sqrt{n}\varepsilon) upper bound, we consider the sufficient statistic Tn=i=1nXij=1k𝖯𝗈𝗂(npj)T_{n}=\sum_{i=1}^{n}X_{i}\sim\prod_{j=1}^{k}\mathsf{Poi}(np_{j}), and simply apply the sufficiency-based algorithm to T^n+m=Tn\widehat{T}_{n+m}=T_{n}. Since

DKL((Tn+m)(Tn))=j=1kDKL(𝖯𝗈𝗂((n+m)pj)𝖯𝗈𝗂(npj))j=1k(mpj)2npj=m2n,\displaystyle D_{\text{\rm KL}}({\mathcal{L}}(T_{n+m})\|{\mathcal{L}}(T_{n}))=\sum_{j=1}^{k}D_{\text{\rm KL}}(\mathsf{Poi}((n+m)p_{j})\|\mathsf{Poi}(np_{j}))\leq\sum_{j=1}^{k}\frac{(mp_{j})^{2}}{np_{j}}=\frac{m^{2}}{n},

where the last identity crucially makes use of the identity j=1kpj=1\sum_{j=1}^{k}p_{j}=1, this procedure works.

Next we show that sample amplification is impossible when m=ω(nε/k)m=\omega(n\varepsilon/\sqrt{k}) and k=O(n)k=O(n). Note that this implies that m=ω(max{nε/k,nε})m=\omega(\max\{n\varepsilon/\sqrt{k},\sqrt{n}\varepsilon\}) is impossible in general, for the k>nk>n case is always not easier than the k=nk=n case. To prove the above claim, w.l.o.g. we assume that k=2k0k=2k_{0} is even, and consider the following parametric submodel:

Pθ=j=1k0[𝖯𝗈𝗂(1k+θj)×𝖯𝗈𝗂(1kθj)],θΘ[1k,1k]k0.\displaystyle P_{\theta}=\prod_{j=1}^{k_{0}}\left[\mathsf{Poi}\left(\frac{1}{k}+\theta_{j}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta_{j}\right)\right],\qquad\theta\in\Theta\triangleq\left[-\frac{1}{k},\frac{1}{k}\right]^{k_{0}}.

Clearly PθP_{\theta} is a parametric submodel, by setting p2j1=1/k+θjp_{2j-1}=1/k+\theta_{j} and p2j=1/kθjp_{2j}=1/k-\theta_{j} in the original model. This submodel is a product model, thus we could apply the result of Theorem 6.5 after we have verified Assumption 4. Note that when θ,θ\theta,\theta^{\prime} arbitrarily vary in [1/k,1/k][-1/k,1/k], the range of

H2(𝖯𝗈𝗂(1k+θ)×𝖯𝗈𝗂(1kθ),𝖯𝗈𝗂(1k+θ)×𝖯𝗈𝗂(1kθ))\displaystyle H^{2}\left(\mathsf{Poi}\left(\frac{1}{k}+\theta\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta\right),\mathsf{Poi}\left(\frac{1}{k}+\theta^{\prime}\right)\times\mathsf{Poi}\left(\frac{1}{k}-\theta^{\prime}\right)\right)

is [0,Θ(1/k)][0,\Theta(1/k)], so Assumption 4 is fulfilled when kcnk\leq cn for a small constant c>0c>0. Consequently, Theorem 6.5 establishes the desired bound m=O(nε/k)m=O(n\varepsilon/\sqrt{k}).

C.9 Proof of Lemma A.19

For the first inequality, note that for a large C0>0C_{0}>0, both Z1C0Z_{1}\leq C_{0} and max2jdZj2logdC0\max_{2\leq j\leq d}Z_{j}\geq\sqrt{2\log d}-C_{0} hold with probability at least 0.990.99. When both events hold, we have

exp(t(t+Z1))\displaystyle\exp(t(t+Z_{1})) exp(t(t+C0)),\displaystyle\leq\exp(t(t+C_{0})),
j=2dexp(tZj)\displaystyle\sum_{j=2}^{d}\exp(tZ_{j}) exp(t(2logdC0)).\displaystyle\geq\exp(t(\sqrt{2\log d}-C_{0})).

Consequently, for C=3C0C=3C_{0} with C0>0C_{0}>0 large enough, we have

pd(2logdC)\displaystyle p_{d}(\sqrt{2\log d}-C) 0.01+exp((2logd3C0)(2logd2C0))exp((2logd3C0)(2logdC0))\displaystyle\leq 0.01+\frac{\exp((\sqrt{2\log d}-3C_{0})(\sqrt{2\log d}-2C_{0}))}{\exp((\sqrt{2\log d}-3C_{0})(\sqrt{2\log d}-C_{0}))}
=0.01+exp(C0(2logd3C0))<0.1.\displaystyle=0.01+\exp\left(-C_{0}(\sqrt{2\log d}-3C_{0})\right)<0.1.

For the second inequality, note that Jensen’s inequality yields that

pd(t)\displaystyle p_{d}(t) 𝔼Z1[exp(t(t+Z1))exp(t(t+Z1))+j=2d𝔼[exp(tZj)]]\displaystyle\geq\mathbb{E}_{Z_{1}}\left[\frac{\exp(t(t+Z_{1}))}{\exp(t(t+Z_{1}))+\sum_{j=2}^{d}\mathbb{E}[\exp(tZ_{j})]}\right]
𝔼Z1[exp(t(t+Z1))exp(t(t+Z1))+dexp(t2/2)].\displaystyle\geq\mathbb{E}_{Z_{1}}\left[\frac{\exp(t(t+Z_{1}))}{\exp(t(t+Z_{1}))+d\exp(t^{2}/2)}\right].

Again, with probability at least 0.990.99 we have Z1C0Z_{1}\geq-C_{0}, and therefore for t=2logd+2C0t=\sqrt{2\log d}+2C_{0},

pd(t)\displaystyle p_{d}(t) 0.99×exp(t(tC0))exp(t(tC0))+dexp(t2/2)\displaystyle\geq 0.99\times\frac{\exp(t(t-C_{0}))}{\exp(t(t-C_{0}))+d\exp(t^{2}/2)}
=0.99×11+exp(t2/2+(t2C0)2/2t(tC0))\displaystyle=0.99\times\frac{1}{1+\exp(t^{2}/2+(t-2C_{0})^{2}/2-t(t-C_{0}))}
=0.99×11+exp(C02logd)>0.9,\displaystyle=0.99\times\frac{1}{1+\exp(-C_{0}\cdot\sqrt{2\log d})}>0.9,

for C0>0C_{0}>0 large enough.

For the last claim, consider any s<d/2s<d/2. Let d0d/s2d_{0}\triangleq\lfloor d/s\rfloor\geq 2, and consider the product prior μs\mu^{\otimes s} to ss blocks each of dimension d0d_{0}. In other words, we set exactly one of the first d0d_{0} coordinates of the mean vector to tt uniformly at random, and do the same for the next d0d_{0} coordinates, and so on. Clearly the resulting mean vector is always ss-sparse. Writing θ=(θ1,,θs)\theta=(\theta_{1},\cdots,\theta_{s}) with each θid0\theta_{i}\in\mathbb{R}^{d_{0}}, let the loss function be

L(θ,θ^)=𝟙(j=1s𝟙(θjθ^j)N),\displaystyle L(\theta,\widehat{\theta})=\mathbbm{1}\left(\sum_{j=1}^{s}\mathbbm{1}(\theta_{j}\neq\widehat{\theta}_{j})\geq N\right),

with an integer NN to be specified later. Then in each block, we reduce to the case s=1s=1, and the error probability for this block is 1pd0(nt)1-p_{d_{0}}(\sqrt{n}\cdot t) for sample size nn. Moreover, the errors in different blocks are independent. Consequently,

rB(𝒫,n,μ,L)rB(𝒫,n+m,μ,L)\displaystyle r_{\text{\rm B}}({\mathcal{P}},n,\mu,L)-r_{\text{\rm B}}({\mathcal{P}},n+m,\mu,L)
=(𝖡(s,1pd0(nt))N)(𝖡(s,1pd0(n+mt))N).\displaystyle=\mathbb{P}\left(\mathsf{B}(s,1-p_{d_{0}}(\sqrt{n}\cdot t))\geq N\right)-\mathbb{P}\left(\mathsf{B}(s,1-p_{d_{0}}(\sqrt{n+m}\cdot t))\geq N\right). (42)

Finally, again by the properties of pd()p_{d}(\cdot) summarized in Lemma A.19 and the pigeonhole principle, for m=cnε/slogd0m=\lceil cn\varepsilon/\sqrt{s\log d_{0}}\rceil, we could always find some t>0t>0 such that pd0(n+mt)pd0(nt)=Ωc(ε/s)p_{d_{0}}(\sqrt{n+m}\cdot t)-p_{d_{0}}(\sqrt{n}\cdot t)=\Omega_{c}(\varepsilon/\sqrt{s}), with both quantities in [0.1,0.9][0.1,0.9]. Now choosing

N=s(1pd0(nt))2+s(1pd0(n+mt))2\displaystyle N=\left\lceil\frac{s(1-p_{d_{0}}(\sqrt{n}\cdot t))}{2}+\frac{s(1-p_{d_{0}}(\sqrt{n+m}\cdot t))}{2}\right\rceil

with the above choice of tt, the Bayes risk difference will be lower bounded by Ωc(ε)\Omega_{c}(\varepsilon) by a similar argument to the proof of Theorem 6.4. Now by (C.9) and Lemma 6.1, we see that n=Ω(slog(d/s)/ε)n=\Omega(\sqrt{s\log(d/s)}/\varepsilon) and m=O(nε/slog(d/s))m=O(n\varepsilon/\sqrt{s\log(d/s)}) are necessary for sample amplification.

C.10 Proof of Lemma A.21

The following results are the key to the proof of Lemma A.21, which we will assume for now and prove in the subsequent sections.

Lemma C.3.

For Dn=diag(λ1,,λd)D_{n}=\text{\rm diag}(\lambda_{1},\cdots,\lambda_{d}), the following identities hold for the estimator Σ^n\widehat{\Sigma}_{n}:

𝔼[(Σ,Σ^n)]\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})] =j=1d[(n+d+12j)λjlogλj𝔼[logχnj+12]]d,\displaystyle=\sum_{j=1}^{d}\left[(n+d+1-2j)\lambda_{j}-\log\lambda_{j}-\mathbb{E}\left[\log\chi_{n-j+1}^{2}\right]\right]-d, (43)
𝖵𝖺𝗋((Σ,Σ^n))\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n})) =j=1d[2(n+d+12j)λj24λj+ψ(n+1j2)],\displaystyle=\sum_{j=1}^{d}\left[2(n+d+1-2j)\lambda_{j}^{2}-4\lambda_{j}+\psi^{\prime}\left(\frac{n+1-j}{2}\right)\right], (44)

where χm2\chi_{m}^{2} denotes the chi-squared distribution with mm degrees of freedom, and ψ(x)\psi^{\prime}(x) is the polygamma function of order 11. In particular, when n2dn\geq 2d, the following inequalities hold:

|𝔼[(Σ,Σ^n)](g(n+1d,d)+j=1dh((n+d+12j)λj))|5dn,\displaystyle\left|\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]-\left(g(n+1-d,d)+\sum_{j=1}^{d}h((n+d+1-2j)\lambda_{j})\right)\right|\leq\frac{5d}{n}, (45)
𝖵𝖺𝗋((Σ,Σ^n))16d2n2+j=1d4((n+d+12j)λj1)2n,\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))\leq\frac{16d^{2}}{n^{2}}+\sum_{j=1}^{d}\frac{4((n+d+1-2j)\lambda_{j}-1)^{2}}{n}, (46)

where the functions gg and hh are given by (13) and

h(u)\displaystyle h(u) ulogu1.\displaystyle\triangleq u-\log u-1. (47)
Lemma C.4.

For the function hh in (47) and u1,,ud+u_{1},\cdots,u_{d}\in\mathbb{R}_{+}, it holds that

j=1dh(uj)18min{j=1d(uj1)2,j=1d(uj1)2}.\displaystyle\sum_{j=1}^{d}h(u_{j})\geq\frac{1}{8}\min\left\{\sum_{j=1}^{d}(u_{j}-1)^{2},\sqrt{\sum_{j=1}^{d}(u_{j}-1)^{2}}\right\}. (48)

Returning to the proof of Lemma A.21, the first claim is a direct application of (45) and (46). For the second claim, let Dn=diag(λ1,,λd)D_{n}=\text{\rm diag}(\lambda_{1},\cdots,\lambda_{d}), and

Vj=1d((n+d+12j)λj1)2.\displaystyle V\triangleq\sum_{j=1}^{d}((n+d+1-2j)\lambda_{j}-1)^{2}.

Then by (45), (46) and Lemma C.4, we have

𝔼[(Σ,Σ^n)]\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})] g(n+1d,d)+VV85dn.\displaystyle\geq g(n+1-d,d)+\frac{V\wedge\sqrt{V}}{8}-\frac{5d}{n}.

Meanwhile, the variance satisfies

𝖵𝖺𝗋((Σ,Σ^n))4dn+2Vn.\displaystyle\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}\leq\frac{4d}{n}+2\sqrt{\frac{V}{n}}.

Note that for n,dn,d larger than a constant depending only on C1C_{1}, we always have

dn+VV82C1Vn.\displaystyle\frac{d}{n}+\frac{V\wedge\sqrt{V}}{8}\geq 2C_{1}\sqrt{\frac{V}{n}}.

Therefore, for n,d=Ω(1)n,d=\Omega(1) large enough, the above inequalities implies

(Σ,Σ^)g(n+1d,d)+C1(𝖵𝖺𝗋((Σ,Σ^n))4dn)6dn,\displaystyle\ell(\Sigma,\widehat{\Sigma})\geq g(n+1-d,d)+C_{1}\left(\sqrt{\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))}-\frac{4d}{n}\right)-\frac{6d}{n},

establishing the second claim.

For the last statement, note that

g(u,v)u=log(u+2v)+log(v)2log(u+v)=12log(u(u+2v)(u+v)2)v22(u+v)2,\displaystyle\frac{\partial g(u,v)}{\partial u}=\frac{\log(u+2v)+\log(v)}{2}-\log(u+v)=\frac{1}{2}\log\left(\frac{u(u+2v)}{(u+v)^{2}}\right)\leq-\frac{v^{2}}{2(u+v)^{2}},

the intermediate value theorem then implies that

g(n+1d,d)g(n+m+1d,d)mminu[n+1d,n+m+1d]d22(u+d)2md213n2.\displaystyle g(n+1-d,d)-g(n+m+1-d,d)\geq m\cdot\min_{u\in[n+1-d,n+m+1-d]}\frac{d^{2}}{2(u+d)^{2}}\geq\frac{md^{2}}{13n^{2}}.

C.11 Proof of Lemma C.3

We first recall the well-known Bartlett decomposition: for the lower triangular matrix Ln=(Lij)ijL_{n}=(L_{ij})_{i\geq j}, the random variables {Lij}ij\{L_{ij}\}_{i\geq j} are mutually independent, with

Lij𝒩(0,1),i>j;Ljj2χn+1j2,j[d].\displaystyle L_{ij}\sim{\mathcal{N}}(0,1),\quad i>j;\qquad L_{jj}^{2}\sim\chi_{n+1-j}^{2},\quad j\in[d].

For Σ=Id\Sigma=I_{d} and Σ^n=LnDnLn\widehat{\Sigma}_{n}=L_{n}D_{n}L_{n}^{\top}, simple algebra gives

(Σ,Σ^n)=j=1d(ijλjLij2logλjlogLjj21).\displaystyle\ell(\Sigma,\widehat{\Sigma}_{n})=\sum_{j=1}^{d}\left(\sum_{i\geq j}\lambda_{j}L_{ij}^{2}-\log\lambda_{j}-\log L_{jj}^{2}-1\right).

Consequently, the identity (43) follows from the above Bartlett decomposition. This identity was also obtained in [JS61].

For the identity (44) on the variance, by the mutual independence we have

𝖵𝖺𝗋((Σ,Σ^n))=j=1dijλj2𝖵𝖺𝗋(Lij2)+j=1d𝖵𝖺𝗋(logLjj2)2j=1dλj𝖢𝗈𝗏(Ljj2,logLjj2).\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n}))=\sum_{j=1}^{d}\sum_{i\geq j}\lambda_{j}^{2}\cdot\mathsf{Var}(L_{ij}^{2})+\sum_{j=1}^{d}\mathsf{Var}(\log L_{jj}^{2})-2\sum_{j=1}^{d}\lambda_{j}\cdot\mathsf{Cov}(L_{jj}^{2},\log L_{jj}^{2}). (49)

Next we evaluate each term of (49). Clearly 𝖵𝖺𝗋(Lij2)=𝔼[Z4]𝔼[Z2]2=2\mathsf{Var}(L_{ij}^{2})=\mathbb{E}[Z^{4}]-\mathbb{E}[Z^{2}]^{2}=2 for i>ji>j and Z𝒩(0,1)Z\sim{\mathcal{N}}(0,1). For the other random variables, we need to recall the following identity for χm2\chi_{m}^{2}:

Λ(t)log𝔼[(χm2)t]=tlog2+logΓ(m2+t)logΓ(m2).\displaystyle\Lambda(t)\triangleq\log\mathbb{E}[(\chi_{m}^{2})^{t}]=t\log 2+\log\Gamma\left(\frac{m}{2}+t\right)-\log\Gamma\left(\frac{m}{2}\right). (50)

Based on (50), we have

𝖵𝖺𝗋(Ljj2)=(n+1j)(n+3j)(n+1j)2=2(n+1j).\displaystyle\mathsf{Var}(L_{jj}^{2})=(n+1-j)(n+3-j)-(n+1-j)^{2}=2(n+1-j).

Moreover, differentiating Λ(t)\Lambda(t) at t=0t=0 gives

𝔼[logχm2]\displaystyle\mathbb{E}[\log\chi_{m}^{2}] =Λ(0)=log2+ψ(m2),\displaystyle=\Lambda^{\prime}(0)=\log 2+\psi\left(\frac{m}{2}\right), (51)
𝔼[(logχm2)2](𝔼[logχm2])2\displaystyle\mathbb{E}[(\log\chi_{m}^{2})^{2}]-\left(\mathbb{E}[\log\chi_{m}^{2}]\right)^{2} =Λ′′(0)=ψ(m2).\displaystyle=\Lambda^{\prime\prime}(0)=\psi^{\prime}\left(\frac{m}{2}\right). (52)

Note that (52) leads to

𝖵𝖺𝗋(logLjj2)=ψ(n+1j2).\displaystyle\mathsf{Var}(\log L_{jj}^{2})=\psi^{\prime}\left(\frac{n+1-j}{2}\right).

Finally, differentiating Λ(t)\Lambda(t) at t=1t=1 gives that

𝔼[Ljj2logLjj2]\displaystyle\mathbb{E}[L_{jj}^{2}\log L_{jj}^{2}] =𝔼[Ljj2](log2+ψ(n+1j2+1))\displaystyle=\mathbb{E}[L_{jj}^{2}]\cdot\left(\log 2+\psi\left(\frac{n+1-j}{2}+1\right)\right)
=(n+1j)(log2+ψ(n+1j2+1)),\displaystyle=(n+1-j)\left(\log 2+\psi\left(\frac{n+1-j}{2}+1\right)\right),

and hence the identity ψ(x+1)=ψ(x)+x1\psi(x+1)=\psi(x)+x^{-1} for the digamma function together with (51) leads to

𝖢𝗈𝗏(Ljj2,logLjj2)=2.\displaystyle\mathsf{Cov}(L_{jj}^{2},\log L_{jj}^{2})=2.

Therefore, plugging the above quantities in (49) gives the identity (44).

Next we prove the remaining inequalities when n2dn\geq 2d. For (45), note that (43) gives an identity (together with (51))

𝔼[(Σ,Σ^n)](g(n+1d,d)+j=1dh((n+d+12j)λj))\displaystyle\mathbb{E}[\ell(\Sigma,\widehat{\Sigma}_{n})]-\left(g(n+1-d,d)+\sum_{j=1}^{d}h((n+d+1-2j)\lambda_{j})\right)
=j=1d(log(n+d+12j)log2ψ(n+1j2))g(n+1d,d).\displaystyle=\sum_{j=1}^{d}\left(\log(n+d+1-2j)-\log 2-\psi\left(\frac{n+1-j}{2}\right)\right)-g(n+1-d,d).

Since |ψ(x)log(x)|1/x|\psi(x)-\log(x)|\leq 1/x for all x1x\geq 1, replacing ψ()\psi(\cdot) by log()\log(\cdot) in the above expression only incurs an absolute difference at most

j=1d2n+1j20ddxnx=2log(1+dnd)4dn.\displaystyle\sum_{j=1}^{d}\frac{2}{n+1-j}\leq 2\int_{0}^{d}\frac{\mathrm{d}x}{n-x}=2\log\left(1+\frac{d}{n-d}\right)\leq\frac{4d}{n}.

For the remaining terms, it is not hard to verify that

g(n+1d,d)=0dlogn+1d+2xn+1d+xdx.\displaystyle g(n+1-d,d)=\int_{0}^{d}\log\frac{n+1-d+2x}{n+1-d+x}dx.

As xlog(n+1d+2x)log(n+1d+x)x\mapsto\log(n+1-d+2x)-\log(n+1-d+x) is increasing on [0,)[0,\infty), we have

0g(n+1d,d)x=0d1logn+1d+2xn+1d+xlogn+1d+2dn+1d+ddn.\displaystyle 0\leq g(n+1-d,d)-\sum_{x=0}^{d-1}\log\frac{n+1-d+2x}{n+1-d+x}\leq\log\frac{n+1-d+2d}{n+1-d+d}\leq\frac{d}{n}.

Now (45) follows from a combination of the above inequalities.

For the inequality (46), we complete the square of (44) to obtain

𝖵𝖺𝗋((Σ,Σ^n))\displaystyle\mathsf{Var}(\ell(\Sigma,\widehat{\Sigma}_{n})) =j=1d2((n+d+12j)λj1)2n+d+12j+j=1d(ψ(n+1j2)2n+d+12j)\displaystyle=\sum_{j=1}^{d}\frac{2((n+d+1-2j)\lambda_{j}-1)^{2}}{n+d+1-2j}+\sum_{j=1}^{d}\left(\psi^{\prime}\left(\frac{n+1-j}{2}\right)-\frac{2}{n+d+1-2j}\right)
4nj=1d((n+d+12j)λj1)2+j=1d(ψ(n+1j2)2n+d+12j).\displaystyle\leq\frac{4}{n}\sum_{j=1}^{d}((n+d+1-2j)\lambda_{j}-1)^{2}+\sum_{j=1}^{d}\left(\psi^{\prime}\left(\frac{n+1-j}{2}\right)-\frac{2}{n+d+1-2j}\right).

To handle the second sum, note that [AS64, Equation 6.4.10] gives |ψ(x)x1|x2|\psi^{\prime}(x)-x^{-1}|\leq x^{-2} for x1x\geq 1. Therefore, the second term has an absolute value at most

j=1d[2(dj)(n+1j)(n+d+12j)+(2n+1j)2]d[2d(n/2)2+(2n/2)2]16d2n2.\displaystyle\sum_{j=1}^{d}\left[\frac{2(d-j)}{(n+1-j)(n+d+1-2j)}+\left(\frac{2}{n+1-j}\right)^{2}\right]\leq d\cdot\left[\frac{2d}{(n/2)^{2}}+\left(\frac{2}{n/2}\right)^{2}\right]\leq\frac{16d^{2}}{n^{2}}.

C.12 Proof of Lemma C.4

Note that when 0<u20<u\leq 2, Taylor expansion of h()h(\cdot) at u=1u=1 gives

h(u)(u1)22minθ[0,2]h′′(θ)=(u1)28.\displaystyle h(u)\geq\frac{(u-1)^{2}}{2}\min_{\theta\in[0,2]}h^{\prime\prime}(\theta)=\frac{(u-1)^{2}}{8}.

For u>2u>2, we have u1log2u>10/7loguu-1\geq\log_{2}u>10/7\cdot\log u, and therefore

h(u)=ulogu13(u1)10.\displaystyle h(u)=u-\log u-1\geq\frac{3(u-1)}{10}.

Therefore, in both cases we have

h(u)18min{(u1)2,|u1|}.\displaystyle h(u)\geq\frac{1}{8}\min\left\{(u-1)^{2},|u-1|\right\}.

To prove (48), let J{j[d]:|uj1|1}J\triangleq\{j\in[d]:|u_{j}-1|\leq 1\}, Sj=1d(uj1)2S\triangleq\sum_{j=1}^{d}(u_{j}-1)^{2}. Using the above inequality, we have

j=1dh(uj)\displaystyle\sum_{j=1}^{d}h(u_{j}) 18jJ(uj1)2+18jJ|uj1|\displaystyle\geq\frac{1}{8}\sum_{j\in J}(u_{j}-1)^{2}+\frac{1}{8}\sum_{j\notin J}|u_{j}-1|
18jJ(uj1)2+18jJ(uj1)2\displaystyle\geq\frac{1}{8}\sum_{j\in J}(u_{j}-1)^{2}+\frac{1}{8}\sqrt{\sum_{j\notin J}(u_{j}-1)^{2}}
18minx[0,S](x+Sx)\displaystyle\geq\frac{1}{8}\min_{x\in[0,S]}\left(x+\sqrt{S-x}\right)
=18min{S,S},\displaystyle=\frac{1}{8}\min\{S,\sqrt{S}\},

which is precisely (48).

References

  • [AGSV20] Brian Axelrod, Shivam Garg, Vatsal Sharan, and Gregory Valiant. Sample amplification: Increasing dataset size even when learning is impossible. In International Conference on Machine Learning, pages 442–451. PMLR, 2020.
  • [AS64] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964.
  • [ASE17] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
  • [Bar86] Andrew R Barron. Entropy and the central limit theorem. The Annals of probability, pages 336–342, 1986.
  • [Bas55] Dev Basu. On statistics independent of a complete sufficient statistic. Sankhyā: The Indian Journal of Statistics (1933-1960), 15(4):377–380, 1955.
  • [Bas58] Dev Basu. On statistics independent of sufficient statistics. Sankhyā: The Indian Journal of Statistics, pages 223–226, 1958.
  • [Bas59] Dev Basu. The family of ancillary statistics. Sankhyā: The Indian Journal of Statistics, pages 247–256, 1959.
  • [BB19] Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. In Conference on Learning Theory, pages 469–470. PMLR, 2019.
  • [BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
  • [BC14] Vlad Bally and Lucia Caramellino. On the distances between probability density functions. Electronic Journal of Probability, 19, 2014.
  • [BC16] Vlad Bally and Lucia Caramellino. Asymptotic development for the CLT in total variation distance. Bernoulli, 22(4):2442–2485, 2016.
  • [BCG14] Sergey G Bobkov, Gennadiy P Chistyakov, and Friedrich Götze. Berry–esseen bounds in the entropic central limit theorem. Probability Theory and Related Fields, 159(3-4):435–478, 2014.
  • [BCLZ02] Lawrence D Brown, T Tony Cai, Mark G Low, and Cun-Hui Zhang. Asymptotic equivalence theory for nonparametric regression with random design. The Annals of statistics, 30(3):688–707, 2002.
  • [BCLZ04] Lawrence D Brown, Andrew V Carter, Mark G Low, and Cun-Hui Zhang. Equivalence theory for density estimation, poisson processes and gaussian white noise with drift. The Annals of Statistics, 32(5):2074–2097, 2004.
  • [BKK+22] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
  • [BL96] Lawrence D Brown and Mark G Low. Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics, 24(6):2384–2398, 1996.
  • [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
  • [BR10] Rabi N Bhattacharya and R Ranga Rao. Normal approximation and asymptotic expansions. SIAM, 2010.
  • [BR13] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.
  • [CMST17] Francesco Calimeri, Aldo Marzullo, Claudio Stamile, and Giorgio Terracina. Biomedical data augmentation using generative adversarial neural networks. In International conference on artificial neural networks, pages 626–634. Springer, 2017.
  • [CMV+21] Phillip Chlap, Hang Min, Nym Vandenberg, Jason Dowling, Lois Holloway, and Annette Haworth. A review of medical image data augmentation techniques for deep learning applications. Journal of Medical Imaging and Radiation Oncology, 65(5):545–563, 2021.
  • [CPS+19] Aggelina Chatziagapi, Georgios Paraskevopoulos, Dimitris Sgouropoulos, Georgios Pantazopoulos, Malvina Nikandrou, Theodoros Giannakopoulos, Athanasios Katsamanis, Alexandros Potamianos, and Shrikanth Narayanan. Data augmentation using gans for speech emotion recognition. In Interspeech, pages 171–175, 2019.
  • [CQZ+22] Dong Chen, Xinda Qi, Yu Zheng, Yuzhen Lu, and Zhaojian Li. Deep data augmentation for weed recognition enhancement: A diffusion probabilistic model and transfer learning based approach. arXiv preprint arXiv:2210.09509, 2022.
  • [CZM+19] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019.
  • [CZSL20] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  • [DJ94] David L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
  • [DMR18] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.
  • [DMR20] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The minimax learning rates of normal and Ising undirected graphical models. Electronic Journal of Statistics, 14(1):2338 – 2361, 2020.
  • [DY79] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The Annals of statistics, pages 269–281, 1979.
  • [FADK+18] Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. GAN-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 321:321–331, 2018.
  • [HJW15] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions under 1\ell_{1} loss. IEEE Transactions on Information Theory, 61(11):6343–6354, 2015.
  • [HMC+20] Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020.
  • [Hoe56] Wassily Hoeffding. On the distribution of the number of successes in independent trials. The Annals of Mathematical Statistics, pages 713–721, 1956.
  • [Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • [HRA+19] Changhee Han, Leonardo Rundo, Ryosuke Araki, Yudai Nagano, Yujiro Furukawa, Giancarlo Mauri, Hideki Nakayama, and Hideaki Hayashi. Combining noise-to-image and image-to-image gans: Brain mr image augmentation for tumor detection. IEEE Access, 7:156966–156977, 2019.
  • [IS12] Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2012.
  • [JS61] W James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
  • [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • [LC64] Lucien Le Cam. Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics, pages 1419–1455, 1964.
  • [LC72] Lucien Le Cam. Limits of experiments. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
  • [LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer, 1986.
  • [LCY90] Lucien Le Cam and Grace Lo Yang. Asymptotics in Statistics. Springer, New York, 1990.
  • [LS92] Erich Leo Lehmann and FW Scholz. Ancillarity. Lecture Notes-Monograph Series, 17:32–51, 1992.
  • [LSM+22] Lorenzo Luzi, Ali Siahkoohi, Paul M Mayer, Josue Casco-Rodriguez, and Richard Baraniuk. Boomerang: Local sampling on image manifolds using diffusion models. arXiv preprint arXiv:2210.12100, 2022.
  • [LSW+23] Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, and Wenqi Wei. Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062, 2023.
  • [ML08] Klaus-J Miescke and F Liese. Statistical Decision Theory: Estimation, Testing, and Selection. Springer, 2008.
  • [MMKSM18] Ali Madani, Mehdi Moradi, Alexandros Karargyris, and Tanveer Syeda-Mahmood. Chest x-ray generation and data augmentation for cardiovascular abnormality classification. In Medical imaging 2018: Image processing, volume 10574, pages 415–420. SPIE, 2018.
  • [MW15] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. The Annals of Statistics, 43(3):1089–1116, 2015.
  • [Nes03] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • [Pro52] Yu V Prohorov. A local theorem for densities. In Doklady Akad. Nauk SSSR (NS), volume 83, pages 797–800, 1952.
  • [Roy88] Halsey Lawrence Royden. Real analysis (3rd edition). Prentice-Hall, 1988.
  • [RSH19] Kolyan Ray and Johannes Schmidt-Hieber. Asymptotic nonequivalence of density estimation and gaussian white noise for small densities. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 55, pages 2195–2208. Institut Henri Poincaré, 2019.
  • [SK19] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  • [SM62] S Kh Sirazhdinov and M Mamatov. On convergence in the mean for densities. Theory of Probability & Its Applications, 7(4):424–428, 1962.
  • [SSP+03] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of International Conference on Document Analysis and Recognition, volume 3. Edinburgh, 2003.
  • [SYPS19] Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M Summers. Data augmentation using generative adversarial networks (cyclegan) to improve generalizability in ct segmentation tasks. Nature Scientific reports, 9(1):1–9, 2019.
  • [Tsy09] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
  • [TUH18] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Between-class learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5486–5494, 2018.
  • [VdV00] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • [VLB+19] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447. PMLR, 2019.
  • [Wal50] Abraham Wald. Statistical decision functions. Wiley, 1950.
  • [Wil82] EJ Williams. Some classes of conditional inference procedures. Journal of Applied Probability, 19(A):293–303, 1982.
  • [YHO+19] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  • [YSH18] Wei Yi, Yaoran Sun, and Sailing He. Data augmentation using conditional gans for facial emotion recognition. In 2018 Progress in Electromagnetics Research Symposium (PIERS-Toyama), pages 710–714. IEEE, 2018.
  • [YWB19] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversarial network in medical imaging: A review. Medical image analysis, 58:101552, 2019.
  • [ZCDLP18] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.