On the Statistical Complexity of Sample Amplification
Abstract
The “sample amplification” problem formalizes the following question: Given i.i.d. samples drawn from an unknown distribution , when is it possible to produce a larger set of samples which cannot be distinguished from i.i.d. samples drawn from ? In this work, we provide a firm statistical foundation for this problem by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.
1 Introduction
Consider the following problem of manufacturing more data: an amplifier is given samples drawn i.i.d. from an unknown distribution , and the goal is to generate a larger set of samples which are indistinguishable from i.i.d. samples from . How large can be as a function of and the distribution class in question? Are there sound and systematic ways to generate a larger set of samples? Is this task strictly easier than the learning task, in the sense that the number of samples required for generating samples is smaller than that required for learning ?
In our preliminary work [AGSV20], we formalized this problem as the sample amplification problem, considering total variation (TV) as the measure for indistinguishability.
Definition 1.1 (Sample Amplification).
Let be a class of probability distributions over a domain . We say that admits an sample amplification procedure if there exists a (possibly randomized) map such that
(1) |
An equivalent formulation to view Definition 1.1 is as a game between two parties: an amplifier and a verifier. The amplifier gets samples drawn i.i.d. from the unknown distribution in the class , and her goal is to generate a larger dataset of samples which must be accepted with good probability by any verifier that also accepts a real dataset of i.i.d. samples from with good probability. Here, the verifier is computationally unbounded and knows the distribution , but does not observe the amplifier’s original set of samples.
Along with being a natural statistical task, the sample amplification framework is also relevant from a practical standpoint. Currently, there is an enormous trend in the machine learning community to train models on datasets that have been enlarged in various ways. There are relatively transparent and classical approaches to achieve this, such as leveraging known invariances such as rotation or translation invariance to augment the dataset by including transformed versions of each original datapoint [SSP+03, KSH12, CZM+19, SK19, CZSL20]. More recently, deep generative models have been used to both directly enlarge training data and construct larger datasets consisting of samples with properties that are rare in naturally occurring datasets [ASE17, CMST17, FADK+18, MMKSM18, YSH18, SYPS19, CPS+19, YWB19, HRA+19, CMV+21, LSM+22, CQZ+22, LSW+23, BKK+22]. More opaque approaches such as MixUp [ZCDLP18] and related techniques [TUH18, YHO+19, VLB+19, HMC+20] which add a significant fraction of new datapoints that are explicitly not supported in the true data distribution are also very popular since they seem to improve the performance of the trained models. Given this current zoo of widely implemented approaches to enlarging datasets, there is a clear motivation for bringing a more principled statistical understanding to such approaches. One natural starting point is the statistical setting we consider that asks the extent to which datasets can be enlarged in a perfect sense—where it is not possible to distinguish the enlarged dataset from a set of i.i.d. draws. Moreover, this work lays a foundation for the ambitious broader goal of understanding how various amplification techniques interact with downstream learning algorithms and statistical estimators, and developing amplification techniques that are optimal for certain classes of such algorithms and estimators.
In [AGSV20], a subset of the authors introduced the sample amplification problem, and studied two classes of distributions: the Gaussian location model and discrete distribution model. For these examples, they characterized the statistical complexity of sample amplification and showed that it is strictly smaller than that of learning. In this paper, we work towards a general understanding of the statistical complexity of the sample amplification problem, and its relationship with learning. The main contributions of this paper are as follows:
Distribution Class | Amplification | Procedure |
---|---|---|
Gaussian with unknown mean and fixed covariance | Sufficiency/Learning | |
(UB: Example 4.1, 4.2, A.8; LB: Theorem 6.2, 6.5) | ||
Gaussian with unknown mean and covariance | Sufficiency | |
(UB: Example A.1, A.3; LB: Example A.20) | ||
Gaussian with -sparse mean and identity covariance | Learning | |
(UB: Example A.12; LB: Example A.18) | ||
Discrete distributions with support size at most | Learning | |
(UB: Example A.9; LB: [AGSV20, Theorem 1]) | ||
Poissonized discrete distributions with support at most | (, | Learning |
(UB: Example A.16; LB: Example A.16) | ||
-dim. product of Exponential distributions | Sufficiency/Learning | |
(UB: Example A.5, A.11; LB: Theorem 6.5) | ||
Uniform distribution on -dim. rectangle | Sufficiency/Learning | |
(UB: Example A.6, A.10; LB: Theorem 6.5) | ||
-dim. product of Poisson distributions | Sufficiency+Learning | |
(UB: Example A.14; LB: Theorem 6.5) |
-
1.
Amplification via sufficiency. Our first contribution is a simple yet powerful procedure for sample amplification, i.e. apply the sample amplification map only to sufficient statistics. Specifically, the learner computes a sufficient statistic from , maps properly to some , and generates new samples from some conditional distribution conditioned on . Surprisingly, this simple idea leads to a much cleaner procedure than [AGSV20] under Gaussian location models which is also exactly optimal (cf. Theorem 6.2). The range of applicability also extends to general exponential families, with rate-optimal sample amplification performances. Specifically, for general -dimensional exponential families with a mild moment condition, the sufficiency-based procedure achieves an sample amplification for large enough , which by our matching lower bounds in Section 6 is asymptotically minimax rate-optimal.
-
2.
Amplification via learning. Our second contribution is another general sample amplification procedure that does not require the existence of a sufficient statistic, and also sheds light on the relationship between learning and sample amplification. This procedure essentially draws new samples from a rate-optimal estimate of the true distribution, and outputs a random permutation of the old and new samples. The procedure achieves an sample amplification, where denotes the minimax risk for learning under the expected divergence given samples. This shows that learning to divergence is sufficient for non-trivial sample amplification.
In addition, we show that for the special case of product distributions, it is important that the permutation step be applied coordinatewise to achieve the optimal sample amplification. Specifically, if , this procedure achieves a better sample amplification
We have summarized several examples in Table 1 where the sufficiency and/or learning based sample amplification procedures are optimal. Note that there is no golden rule for choosing one idea over the other, and there exists an example where the above two ideas must be combined.
-
3.
Minimax lower bounds. Complementing our sample amplification procedures, we provide a general recipe for proving lower bounds for sample amplification. This recipe is intrinsically different from the standard techniques of proving lower bounds for hypothesis testing, for the sample amplification problem differs significantly from an estimation problem. In particular, specializing our recipe to product models shows that, for essentially all -dimensional product models, an sample amplification is impossible for some absolute constant independent of the product model.
For non-product models, the above powerful result does not directly apply, but proper applications of the general recipe could still lead to tight lower bounds for sample amplification. Specifically, we obtain matching lower bounds for all examples listed in Table 1, including the non-product examples.
We now provide several numerical simulations to suggest the potential utility of sample amplification. Recall that a practical motivation of sample amplification is to produce an enlarged dataset that can be fed into a distribution-agnostic algorithm in downstream applications. Here, we consider the following basic experiments in that vein:



-
•
Fourth moment estimation for one-dimensional Gaussian: here we observe with and , and we consider three estimators. The empirical estimator operates in a distribution-agnostic fashion and is simply the empirical fourth moment . The plug-in estimator uses the knowledge of Gaussianity: it first estimates and then uses . The amplified estimator first amplifies the sample into via sufficiency (cf. Example 4.1), and then uses the empirical estimator based on the enlarged sample . The plots of the mean absolute errors (MAEs) are displayed in Figure 1(a). We observe that although the empirical estimator based on the original sample has a large MAE, its performance is improved based on the amplified sample .
-
•
Squared norm estimation for high-dimensional Gaussian: here we observe with and , and we again consider three estimators for . As before, the empirical estimator is simply , and the plug-in estimator uses the knowledge and estimates . As for the amplified estimator, it first amplifies the sample into via sufficiency (cf. Example 4.1), and then uses the empirical estimator based on . The plots of the mean absolute errors are displayed in Figure 1(b). Here the empirical estimator outperforms the plug-in estimator due to a smaller bias, while the sample amplification further reduces the MAE as long as the size of amplification is not too large. This could be explained by the bias-variance tradeoff, where the amplified estimator interpolates between the empirical estimator (with no bias) and the plug-in estimator (with the smallest asymptotic variance).
-
•
Binary classification: here we observe two clusters of covariates (with label ) and (with label ), with and being the first basis vector. The target is to train a classifier with a high classification accuracy on the test data with the same distribution. The standard classifier is via logistic regression, which does not use the knowledge of Gaussianity. To apply sample amplification, we first amplify the sample in each class via either sufficiency (cf. Example 4.1) or learning (cf. Example A.8), and then run logistic regression on the amplified datasets. Figure 1(c) displays the classification errors of all three approaches, and shows that both amplification procedures help reduce the classification error even for small values of .
The above experiments demonstrate the potential for sample amplification to leverage knowledge of the data distribution to produce a larger dataset that is then fed into downstream distribution-agnostic algorithms. Some experiments (e.g. Figure 1(b)) also suggest a limit beyond which the amplification procedure alters the data distribution too much. We believe that rigorously examining amplification through the lens of the performance of downstream estimators and algorithms, including those illustrated in our numerical simulations, would be a fruitful direction for future work.
2 Connections, limitations and future work
As discussed above, it is commonplace in machine learning to increase the size of datasets using various heuristics, often resulting in large gains in downstream learning performance. However, a clear statistical understanding of when this is possible and what techniques are useful for this is missing. A natural starting point to get a better understanding is the formulation we consider that asks the extent to which datasets can be amplified in a perfect sense—where any verifier who knows the true distribution is not able to distinguish the amplified dataset from a set of i.i.d. draws.
A limitation of the sample amplification formulation described above is that the additive amplification factor is rather small (e.g., for -dimensional exponential families). Moreover, we show matching lower bounds demonstrating that this factor cannot be improved even when is large enough to learn the distribution to non-trivial accuracy. However, it might be possible to achieve larger amplification factors with restricted verifiers, for instance, the class of verifiers corresponding to learning algorithms used for downstream tasks (see [AGSV20] for other possible classes of verifiers). Investigating the sample amplification problem with such restricted verifiers may be a practically fruitful future direction.
Despite this limitation, the sample amplification formulation does yield high-level insights that can inform the way datasets are amplified in practice. For instance, from the results in this paper, we know that sample amplification is possible for a broad class of distributions even when learning is not possible. Moreover, both our sufficiency or learning based approaches modify the original data points in general, conforming to the lower bound in [AGSV20] that optimal amplification may be impossible if the amplifier returns a superset of the input dataset. These observations show that the folklore way of enlarging datasets by learning the data distribution and adding more samples from the learned distribution can be far from optimal.
Connections with other statistical notions. An equivalent view of Definition 1.1 is through Le Cam’s distance [LC72], a classical concept in statistics. The formal definition of Le Cam’s distance is summarized in Definition 3.1; roughly speaking, it measures the fundamental difference in power in the statistical models and , without resorting to specific estimation procedures. The sample amplification problem is equivalent to the study of Le Cam’s distance between product models, where (1) is precisely equivalent to . However, in the statistics literature, Le Cam’s distance was mainly used to study the asymptotic equivalence, where a typical target is to show that for certain sequences of statistical models. For example, showing that localized regular statistical models converge to Gaussian location models is the fundamental idea behind the Hájek–Le Cam asymptotic statistics; see [LC72, LC86, LCY90] and [VdV00, Chapter 9]. In nonparametric statistics, there is also a rich line of research [BL96, BCLZ02, BCLZ04, RSH19] establishing asymptotic (non-)equivalences, based on Le Cam’s distance, between density models, regression models, and Gaussian white noise models. In the above lines of work, only asymptotic results were typically obtained with a fixed dimension and possibly slow convergence rate. In contrast, we aim to obtain a non-asymptotic characterization of in and the dimension of the problem, a task which is largely underexplored in the literature.
Another related angle is from reductions between statistical models. Over the past decade there has been a growing interest in constructing polynomial-time reductions between various statistical models (typically from the planted clique) to prove statistical-computational gaps, see, e.g. [BR13, MW15, BB19, BB20]. The sample amplification falls into the reduction framework, and aims to perform reductions from a product model to another product model . While previous reduction techniques were mainly constructive and employed to prove computational lower bounds, in this paper we also develop general tools to prove limitations of all possible reductions purely from the statistical perspective.
Organization. The rest of this paper is organized as follows. Section 3 lists some notations and preliminaries for this paper, and in particular introduces the concept of Le Cam’s distance. Section 4 introduces a sufficiency-based procedure for sample amplification, with asymptotic properties for general exponential families and non-asymptotic performances in several specific examples. Section 5 is devoted to a learning-based procedure for sample amplification, with a general relationship between sample amplification and the estimation error, as well as its applications in several examples. Section 6 presents the general idea of establishing lower bounds for sample amplification, with a universal result specializing to product models. Section 7 discusses more examples in sample amplification and learning, and shows that these tasks are in general non-comparable. More concrete examples of both the upper and lower bounds, auxiliary lemmas, and proofs are relegated to the appendices.
3 Preliminaries
We use the following notations throughout this paper. For a random variable , let be the law (i.e. probability distribution) of . For a probability distribution on a probability space and a measurable map , let denotes the pushforward probability measure, i.e. with . For a probability measure , let be the -fold product measure. For a positive integer , let , and . We adopt the following asymptotic notations: for two non-negative sequences and , we use to denote that , and to denote , and to denote both and . We also use the notations to denote the respective meanings with hidden constants depending on . For probability measures defined on the same probability space, the total variation (TV) distance, Hellinger distance, Kullback–Leibler (KL) divergence, and the chi-squared divergence are defined as follows:
We will frequently use the following inequalities between the above quantities [Tsy09, Chapter 2]:
(2) | |||
(3) |
Next we define several quantities related to Definition 1.1. For a given distribution class and sample sizes and , the minimax error of sample amplification is defined as
(4) |
where the infimum is over all (possibly randomized) measurable mappings . For a given error level , the maximum size of sample amplification is the largest such that there exists an sample amplification, i.e.
(5) |
For the ease of presentation, we often choose to be a small constant (say ) and abbreviate the above quantity as ; we remark that all our results work for a generic . Finally, we also define the sample amplification complexity as the smallest such that an amplification from to samples is possible:
(6) |
Note that all the above notions are instance-wise in the distribution class .
The minimax error of sample amplification (4) is precisely known as the Le Cam’s distance in the statistics literature. We adopt the standard framework of statistical decision theory [Wal50]. A statistical model (or experiment) is a tuple where an observation is drawn for some . A decision rule is a regular conditional probability kernel from to the family of probability distributions on a general action space , and there is a (measurable) loss function . The risk function of a given decision rule is defined as
(7) |
Based on the definition of risk functions, we are ready to define a metric, known as Le Cam’s distance, between statistical models.
Definition 3.1 (Le Cam’s distance; see [LC72, LC86, LCY90]).
For two statistical models and , Le Cam’s distance is defined as
where the loss function is taken over all measurable functions .
In the language of model deficiency introduced in [LC64], Le Cam’s distance is the smallest such that the model is -deficient to the model , and is also -deficient to . In the sample amplification problem, . Here, choosing in Definition 3.1 shows that is -deficient to , and the remaining quantity involving exactly reduces to the minimax error of sample amplification in (4). Therefore, studying the complexity of sample amplification is equivalent to the characterization of the quantity .
4 Sample amplification via sufficient statistics
The first idea we present for sample amplification is the classical idea of reduction by sufficiency. Albeit simple, the sufficiency-based idea reduces the problem of generating multiple random vectors to a simpler problem of generating only a few vectors, achieves the optimal complexity of sample amplification in many examples, and is easy to implement.
4.1 The general idea
We first recall the definition of sufficient statistics: in a statistical model and , a statistic is sufficient if and only if both and are Markov chains. A classical result in statistical decision theory is reduction by sufficiency, i.e. only the sufficient statistic needs to be maintained to perform statistical tasks as does not depend on the unknown parameter . In terms of Le Cam’s distance, let be the statistical experiment associated with , then sufficiency of implies that . Hereafter, we will call the -reduced model, or simply reduced model in short.
Reduction by sufficiency could be applied to sample amplification in a simple way, with a general algorithm displayed in Algorithm 1. Suppose that both models and admit sufficient statistics and , respectively. Algorithm 1 claims that it suffices to perform sample amplification on the reduced models and , i.e. construct a randomization map from to . Concretely, the algorithm decomposes into three steps:
-
1.
Step I: map to . This step is straightforward: we simply compute .
-
2.
Step II: apply a randomization map in the reduced model. Upon choosing the map , we simply compute with the target that the TV distance is uniformly small. The concrete choice of depends on specific models.
-
3.
Step III: map to . By sufficiency of , the conditional distribution does not depend on the unknown model. Therefore, after replacing the true statistic by , it is feasible to generate . To simulate this random vector, it suffices to choose any distribution and generate . This step may suffer from computational issues which will be discussed in Section 4.2.
The validity of this idea simply follows from
or equivalently, under each ,
where (a) is due to the identity . In other words, it suffices to work on reduced models and find the map between sufficient statistics.
This idea of reduction by sufficiency simplifies the design of sample amplification procedures. Unlike in original models where and typically take values in spaces of different dimensions, in reduced models the sufficient statistics and are usually drawn from the same space. A simple example is as follows.
Example 4.1 (Gaussian location model with known covariance).
Consider the observations from the Gaussian location model with an unknown mean and a known covariance . To amplify to samples, note that the sample mean vector is a sufficient statistic here, with
Now consider the identity map between sufficient statistics used with algorithm 1. The amplified samples are drawn from conditioned on the event that . For every mean vector we can upper bound the amplification error of this approach:
where (a) is due to (3), and the last step holds whenever . Therefore, we could amplify additional samples based on observations, and the complexity of sample amplification in (6) is . In contrast, learning this distribution within a small TV distance requires samples, which is strictly harder than sample amplification. This example recovers the upper bound of [AGSV20] with a much simpler analysis, and in later sections we will show that this approach is exactly minimax optimal.
We make two remarks for the above example. First, the amplified samples are no longer independent, either marginally or conditioned on . Therefore, the above approach is fundamentally different from first estimating the distribution and then generating independent samples from the estimated distribution. Second, the amplified samples do not contain the original samples as a subset. In contrast, a tempting approach for sample amplification is to add fake samples to the original observations. However, [AGSV20] showed that any sample amplification approach containing the original samples cannot succeed if in the above model, and our approach conforms to this result. More examples will be presented in Section A.1.
4.2 Computational issues
A natural computational question in Algorithm 1 is how to sample in a computationally efficient way. With an additional mild assumption that the sufficient statistic is also complete (which is easy to find in exponential families), the conditional distribution could be efficiently sampled if we could find a statistic with the following two properties:
-
1.
is ancillary, i.e. is independent of the model parameter ;
-
2.
There is a (measurable) bijection between and , i.e. almost surely.
In fact, if such an exists, then under any ,
where (a) is due to the assumed bijection between and , and (b) is due to a classical result of Basu [Bas55, Bas58] that and are independent. Therefore, by the ancillarity of , we could sample with any and compute the statistic from , then follows the desired conditional distribution . An example of this procedure is illustrated below.
Example 4.2 (Computation in Gaussian location model).
Consider the setting of Example 4.1 where , and the target is to sample from the distribution . In this model, is complete and sufficient, and we choose with for all . Clearly is ancillary, and could be recovered from via
Therefore, the choice of satisfies both conditions. Consequently, we can draw , compute (where ), and recover from .
The proper choice of depends on specific models and may require some effort to find; we refer to Section A.1 for more examples. We remark that in general there is no golden rule to find . One tempting approach is to find a maximal ancillary statistic such that any other ancillary statistic must be a function of . This idea is motivated by the existence of the minimal sufficient statistic under mild conditions and a known computationally efficient approach to compute it. However, for ancillary statistics there is typically no such a maximal one in the above sense, and there may exist uncountably many “maximal” ancillary statistics which are not equivalent to each other. From the measure theoretic perspective, this is due to the fact that the family of all ancillary sets is not closed under intersection and thus not a -algebra. In addition, even if a proper notion of “maximal” or “essentially maximal” could be defined, there is no guarantee that such an ancillary statistic satisfies the bijection condition, and it is hard to determine whether a given ancillary statistic is maximal or not. We refer to [Bas59, LS92] for detailed discussions on ancillarity from mathematical statistics.
There is also another sampling procedure of in the conditional inference literature [Wil82]. Specifically, for each , this approach sequentially generates the observation from the one-dimensional distribution , which is a simple task as long as we know its CDF. Although this works in simple models such as the Gaussian location model above, in more complicated models exact computation of the CDF is typically not feasible.
4.3 General theory for exponential families
In this section, we show that a general sample amplification phenomenon holds for a rich class of exponential families, and is achieved by the sufficiency-based procedure in Algorithm 1. Specifically, we consider the following natural exponential family.
Definition 4.3 (Exponential family).
The exponential family of probability measures is determined by
where is the natural parameter with , is the sufficient statistic, is the log-partition function, and is the base measure.
The exponential family includes many widely used probability distributions such as Gaussian, Gamma, Poisson, Exponential, Beta, etc. In the exponential family, the statistic is sufficient and complete, and several well-known identities include , and . We refer to [DY79] for a mathematical theory of the exponential family.
To establish a general theory of sample amplification for exponential families, we shall make the following assumptions on the exponential family.
Assumption 1 (Continuity).
The parameter set has a non-empty interior. Under each , the probability distribution is absolutely continuous with respect to the Lebesgue measure.
Assumption 2 (Moment condition ).
For a given integer , it holds that
We call it the moment condition .
Assumption 1 requires an exponential family of continuous distributions. The motivation is that for continuous exponential family, the sufficient statistics and for different sample sizes take continuous values in the same space, and it is easier to construct a general transformation. We will propose a different sample amplification approach for discrete statistical models in Section 5. Assumption 2 is a moment condition on the normalized statistic , whose moments always exist as the moment generating function of exists around the origin. The moment condition claims that the above normalized statistic has a uniformly bounded -th moment for all , which holds in several examples (such as Gaussian, exponential, Pareto) or by considering a slightly smaller bounded away from the boundary. The following lemma presents a sufficient criterion for the moment condition .
Lemma 4.4.
If the log-partition function satisfies
then the exponential family satisfies the moment condition for all . Here for a -tensor and vectors , denotes the value of .
The condition in Lemma 4.4 is called the self-concordant condition, which is a key concept in the interior point method for convex optimization [Nes03]. For example, all quadratic functions and logarithmic functions are self-concordant (which correspond to Gaussian, exponential, and Pareto distributions), and the self-concordance is always fulfilled when is compact.
Given any exponential family satisfying Assumptions 1 and 2, we will show that a simple sample amplification procedure gives a size of sample amplification. Let be i.i.d. samples drawn from taking a general form in Definition 4.3, then it is clear that the sample average
is a sufficient statistic by the factorization theorem. We will apply the general Algorithm 1 with an identity map between sufficient statistics, i.e. . The next theorem shows the performance of this approach.
Theorem 4.5.
Theorem 4.5 shows that the above simple procedure could achieve a sample amplification from to samples in general continuous exponential families, provided that is large enough. The main idea behind the proof of Theorem 4.5 is also simple. We show that the distribution of is approximately by CLT, apply the same CLT for , and then compute the TV distance between two Gaussians as in Example 4.1. Theorem 4.5 is then a direct consequence of the triangle inequality:
Note that Assumption 1 ensures a vanishing TV distance for the Gaussian approximation, and Assumption 2 enables us to apply Berry–Esseen type arguments and obtain an convergence rate for the Gaussian approximation.
The main drawback of Theorem 4.5 is that there is a hidden constant depending on the dimension , thus it does not mean that an sample amplification is possible as long as . To tackle this issue, we need to improve the first term in Theorem 4.5 and find the best possible dependence of the constant on . We remark that this is a challenging task in probability theory: although the convergence rates of both TV [Pro52, SM62, BC14, BC16] and KL [Bar86, BCG14] in the CLT result were studied, almost all of them solely focused on the convergence rate on , leaving the tight dependence on still open. Moreover, direct computation of the quantity shows that even if the random vector has independent components, this quantity is typically at least . Therefore, under this proof technique, and a vanishing first term in Theorem 4.5 requires that , which is already larger than the anticipated sample amplification complexity .
To overcome the above difficulties, we make the following changes to both the assumption and analysis. First, to avoid the unknown dependence on , we additionally assume a product exponential family, i.e. , where each is a one-dimensional exponential family. Exploiting the product structure enables to find a constant depending linearly on . Second, we improve the dependence on by applying a higher-order CLT result to and , known as the Edgeworth expansion [BR10]. For any and , the signed measure of the Edgeworth expansion on is
(8) |
where is the density of a standard normal random variable on , and is a polynomial of degree which depends only on the first moments of the distribution. We note that unlike CLT, the general Edgeworth expansion is a signed measure with possibly negative densities; however, it is close to Gaussian with an approximation error. Such a higher-order expansion enables us to have better Berry-Esseen type bounds, but upper bounding becomes more complicated and requires to handle the Gaussian part and the correction part separately; see Appendix B.2 for details. In particular, we could improve the error dependence on from to .
Formally, the next theorem shows a better sample amplification performance for product exponential families.
Theorem 4.6.
Theorem 4.6 shows that for product exponential family, we not only achieve the amplification size , but also attain a sample complexity for sample amplification. This additional result on sample complexity is important in the sense that, even if distribution learning is impossible, it is possible to perform sample amplification. Although the independence or even the exponential family assumption could be strong practically, in Section A.1 we show that both phenomena and hold in many natural models.
5 Sample amplification via learning
The sufficiency-based approach of sample amplification is not always desirable. First, models outside the exponential family typically do not admit non-trivial sufficient statistics, and therefore the reduction by sufficiency may not be very helpful. Second, the identity map applied to the sufficient statistics only works for continuous models, and incurs a too large TV distance when the underlying model is discrete. Third, previous approaches are not directly related to learning the model, so a general relationship between learning and sample amplification is largely missing. In this section, we propose another sample amplification approach, and show that how a good learner helps to obtain a good sample amplifier.
5.1 The general result
For a class of distribution and i.i.d. observations drawn from an unknown , we define the following notion of the -estimation error.
Definition 5.1 (-estimation error).
For a class of distributions and sample size , the -estimation error is defined to be the minimax estimation error under the expected -divergence:
where the infimum is taken over all possible distribution estimators based on samples.
Roughly speaking, the -estimation error in the above definition characterizes the complexity of the distribution class in terms of distribution learning under the -divergence. The main result of this section is to show that, the error of sample amplification in (4) could be upper bounded by using the -estimation error.
Theorem 5.2.
For general and , it holds that
The following corollary is immediate from Theorem 5.2.
Corollary 5.3.
An sample amplification is possible if . Moreover, the sample complexity of amplification in (6) satisfies
Remark 5.4.
The above result provides a quantitative guarantee that the sample amplification is easier than learning (under the -divergence). Specifically, the sample complexity of learning is the smallest such that , while Corollary 5.3 shows that the complexity for amplification is at most the smallest such that . As is non-increasing in , this means that the learning complexity is in general larger.
When the distribution class has a product structure , the next theorem shows a better relationship between the amplification error and the learning error.
Theorem 5.5.
For and , it holds that
Corollary 5.6.
For product models, an sample amplification is achievable if
Moreover, the sample complexity of amplification in (6) satisfies
We observe that the result of Theorem 5.5 typically improves over Theorem 5.2 for product models. In fact, since
the inequality typically holds. Moreover, there are scenarios where we have , thus Theorem 5.5 provides a substantial improvement over Theorem 5.2. For example, when , it could be verified that for each , while . Hence, in the important regime where learning is impossible but the sample amplification is possible, Theorem 5.5 is strictly stronger than Theorem 5.2.
Remark 5.7.
In the above Gaussian location model, there is an alternative way to conclude that Theorem 5.5 is strictly stronger than Theorem 5.2. We will see that the shuffling approach achieving the bound in Theorem 5.2 keeps all the observed samples, whereas [AGSV20] shows that all such approaches must incur a sample complexity for the Gaussian model. In contrast, Theorem 5.5 and Corollary 5.6 give a sample complexity of amplification in the Gaussian location model.
5.2 The shuffling approach
This section presents the sample amplification approaches to achieve Theorems 5.2 and 5.5. The idea is simple: we find a good distribution learner which attains the rate-optimal -estimation error, draw additional samples from , and shuffle them with the original samples uniformly at random. This approach suffices to achieve the sample amplification error in Theorem 5.2, while for Theorem 5.5 an additional trick is applied: instead of shuffling the whole vectors, we independently shuffle each coordinate instead. For technical reasons, in both approaches we apply the sample splitting: the first samples are used for the estimation of , while the second samples are used for shuffling. The algorithms are summarized in Algorithms 2 and 3.
The following lemma is the key to analyze the performance of the shuffling approach.
Lemma 5.8.
Let be i.i.d. drawn from , and be i.i.d. drawn from independent of . Let be a uniformly random permutation of , and be the distribution of the random mixture . Then
Based on Lemma 5.8, the advantage of random shuffling is clear: if we simply append to the end of the original sequence , then the -divergence is exactly . By comparing with the upper bound in Lemma 5.8, we observe that a smaller coefficient is applied to the individual -divergence after a random shuffle. The proofs of Theorems 5.2 and 5.5 are then clear, where we simply take and apply the above lemma. Note that the statement of Lemma 5.8 requires that be independent of , which is exactly the reason why we apply sample splitting in Algorithms 2 and 3. The proof of Lemma 5.8 is presented in Appendix C, and the complete proofs of Theorems 5.2 and 5.5 are relegated to Appendix B. We also include concrete examples of the shuffling approach in Section A.2.
6 Minimax lower bounds
In this section we establish minimax lower bounds for sample amplification in different statistical models. Section 6.1 presents a general and tight approach for establishing the lower bound, which leads to an exact sample amplification result for the Gaussian location model. Based on this result, we show that for -dimensional continuous exponential families, the sample amplification size cannot exceed for sufficiently large sample size . Section 6.2 provides a specialized criterion for product models, where we show that and are always valid lower bounds, with hidden constants independent of all parameters. Section A.3 lists several concrete examples where our general idea could be properly applied to provide tight and non-asymptotic results.
6.1 General idea
The main tool to establish the lower bound is the first equality in the Definition 3.1 of Le Cam’s distance. Specifically, for a class of distributions , let be a given prior distribution on , and be a given non-negative loss function upper bounded by one. Given i.i.d. samples from an unknown distribution in , define the following Bayes risk and minimax risk:
where the infimum is over all possible estimators taking value in . The following result is a direct consequence of Definition 3.1.
Lemma 6.1.
For any integer , any class of distributions , any prior on , and any loss function , the minimax error of sample amplification in (4) satisfies that
Based on Lemma 6.1, it suffices to find an appropriate prior distribution and a loss function , and then compute (or lower bound) the difference between the Bayes or minimax risks with different sample sizes. We note that the lower bound technique in [AGSV20], albeit seemingly different, is a special case of Lemma 6.1. Specifically, the authors of [AGSV20] designed a set-valued mapping for each such that for all , while there is a prior distribution on such that
(9) |
If the above conditions hold, then an sample amplification is impossible. Note that the probability term in (9) is the maximum coverage probability of the sets where follows the posterior distribution , which is a well-defined geometric object when both and the posterior are known. To see that the above approach falls into our framework, consider the loss function with . Then the first condition ensures that for each prior , and the second condition (9) exactly states that for the chosen prior .
A first application of Lemma 6.1 is an exact lower bound in Gaussian location models.
Theorem 6.2.
Theorem 6.2 shows that an exact error characterization for the Gaussian location model is possible through the general lower bound approach in Lemma 6.1. This result is also asymptotically useful to a rich family of models: note that by CLT, the sufficient statistic in a continuous exponential family follows a Gaussian distribution asymptotically, with a vanishing TV distance. This idea was used in Section 4.3 to establish the upper bound, and the same observation could lead to an lower bound as well, under slightly different assumptions. Specifically, we drop Assumption 2 while introducing an additional assumption.
Assumption 3 (Linear independence).
The components of sufficient statistic are linearly independent, i.e. for -almost all implies .
Assumption 3 ensures that the true dimension of the exponential family is indeed . Whenever Assumption 3 does not hold, we could transform it into a minimal exponential family with a lower dimension fulfilling this assumption. Note that when Assumptions 1 and 3 hold, the mean mapping is a diffeomorphism between and ; see, e.g. [ML08, Theorem 1.22]. Therefore, is an open map, and the set contains a -dimensional ball. This fact enables us to obtain a -dimensional Gaussian location model after we apply the CLT.
The following theorem characterizes an asymptotic lower bound for every exponential family satisfying Assumptions 1 and 3.
Theorem 6.3.
Theorem 6.3 shows that there exists some depending only on the exponential family, such that sample amplification from to samples is impossible for all . However, similar to the nature of the upper bound in Theorem 4.5, this asymptotic result does not imply that the sample amplification is impossible if . Nevertheless, in the following sections we show that the sample complexity lower bound indeed holds in product families, as well as several other concrete examples.
6.2 Product models
Although Lemma 6.1 presents a lower bound argument in general, the computation of exact Bayes or minimax risks could be very challenging, and the usual rate-optimal analysis (i.e. bounding the risks within a multiplicative constant) will not lead to meaningful results. In addition, choosing the right prior and loss is a difficult task which may change from instance to instance. Therefore, it is helpful to propose specialized versions of Lemma 6.1 which are easier to work with. Surprisingly, such a simple version exists for product models, which is summarized in the following theorem.
Theorem 6.4.
Let and be a product model with . Suppose for each , there exist two points such that
(10) | ||||
(11) |
with , where are absolute constants. Then there exists an absolute constant such that
Theorem 6.4 leaves the choices of the prior and loss function in Lemma 6.1 implicit, and provides a simple criterion for product models. The usual routine of applying Theorem 6.4 is as follows: fix any constant and a target error , find for each two points such that the condition (10) holds for a given sample size . Then the condition (11) becomes an inequality solely for , from which we could solve the smallest such that (11) holds along the -th coordinate. Finally, the sample amplification from to samples is impossible by the above theorem, where . Although Theorem 6.5 is only for product models, similar ideas could also be applied to non-product models; we refer to Section A.3 for concrete examples.
Theorem 6.4 also provides some intuition on why the sample complexity lower bound for amplification is typically smaller than that of learning. Specifically, for learning under the TV distance, a small TV distance between product distributions is required. This requirement typically leads to a much smaller individual TV distance , e.g. for many regular models. In contrast, the conditions (10) and (11) only require a constant individual TV distance, which leads to a smaller sample complexity in the sample amplification lower bound. To understand why a larger individual TV distance works for sample amplification, in the proof of Theorem 6.4 we consider the uniform prior on points . Under this prior, the test accuracy for each dimension is precisely , which is slightly smaller than with samples, and slightly larger than with samples (assuming ). Therefore, if a unit loss is incurred when the fraction of correct tests does not exceed , the current scaling in (10), (11) shows that there is an difference in the expected loss under different sample sizes. In other words, such an aggregate voting test helps to have a larger individual TV distance. The details of the proof are deferred to Appendix B.
Theorem 6.4 has a far-reaching consequence: with almost no assumption on the product model , for any it always holds that for some absolute constant independent of the product model . The only assumption (besides the product structure) we make on is as follows (here is a given sample size):
Assumption 4.
Let possess the product structure as in Theorem 6.4. For each , there exists two points such that .
Assumption 4 is a mild assumption that requires that for each coordinate, the map is continuous under the Hellinger distance. This assumption is satisfied for almost all practical models, either discrete or continuous, and is invariant with model reparametrizations or bijective transformation of observations. We note that the coefficients and are not essential, and could be replaced by any smaller constants. The next theorem states that if Assumption 4 holds, we always have a lower bound for the sample complexity and an upper bound for the size of sample amplification.
Theorem 6.5.
Let be a product model satisfying Assumption 4. Then for any , there is some depending only on (thus independent of ) such that
Theorem 6.5 is a general lower bound for sample amplification in product models, with intriguing properties that it is instance-wise in the model , while the constants and are independent of . As a result, the sample complexity is uniformly , and the maximum size of sample amplification is uniformly for all product models. In comparison, the matching upper bound in Theorem 4.6 for product models has a hidden constant depending on the statistical model. We note that it is indeed natural to have sample amplification results independent of the underlying statistical model. For example, it is clear by definition that sample amplifications are invariant with bijective transformation of observations. However, Assumption 2 depends on such transformations, so it possibly contains some redundancy. In contrast, Assumption 4 remains invariant, which is therefore more natural.
The proof idea of Theorem 6.5 is best illustrated for the case . Using the two points in Assumption 4, one could show that the TV distance between copies of and is bounded from above by a small constant. Similarly, for a large , the TV distance between copies of them is lower bounded by a large constant. Consequently, if , Theorem 6.4 applied with gives an lower bound on . What happens if with a small ? The idea is to consider the TV distances between copies of and , which is an increasing sequence. Now by the pigeonhole principle, there must be two adjacent TV distances differing by at least , and Theorem 6.4 could be applied to this pair of sample sizes. This idea is easily generalized to any dimensions, with the full proof in Appendix B.
We note that the lower bounds in Theorem 6.3 (as well as 6.2) and Theorem 6.5 are on two different ends of the spectrum. In Theorems 6.2 and 6.3, an asymptotic setting (i.e. fixed and ) is essentially considered, and a Gaussian limit is crucially used as long as there is local asymptotic normality. In comparison, Theorem 6.5 deals with a high-dimensional scenario ( can grow together) but restricts to a product model. However, looking at product submodels and/or exploiting its proof techniques could still lead to tight lower bounds for several non-product models, as shown in the examples in Section A.3.
7 Discussions on sample amplification versus learning
In all the examples we have seen in the previous sections, there is always a squared root relationship between the statistical complexities of sample amplification and learning. Specifically, when the dimensionality of the problem is , the complexity of learning the distribution (under a small TV distance) is typically , whereas that of the sample amplification is typically . In this section, we give examples where this relationship could break down in either direction, thus show that there is no universal scaling between the sample complexities of amplification and learning.
7.1 An example where the complexity of sample amplification is
We first provide an example where the distribution learning is hard, but an sample amplification is easy. Consider the following class of discrete distributions:
where it is the same as the class of all discrete distributions over points, except that the learner has the perfect knowledge of for some known . It is a classical result (see, e.g. [HJW15]) that the sample complexity of learning the distribution over with a small TV distance is still , regardless of . However, the next theorem shows that the complexity of sample amplification is much smaller.
Theorem 7.1.
For the class with , an sample amplification is possible if and only if
Note that for the choice of with , the complexity of sample amplification could possibly be for every , showing that it could be with an arbitrary polynomial scale in . Moreover, if , the complexity of sample amplification reduces to , the case without the knowledge of . The main reason why sample amplification is easier here is that the additional fake sample could be chosen as the first symbol, which has a large probability. In contrast, learning the distribution requires the estimation of all other probability masses, so the existence of a probable symbol does not help much in learning.
7.2 An example where the complexity of sample amplification is
Next we provide an example where the complexity of sample amplification is the same as that of learning. Consider a low-rank covariance estimation model: , where could be written as with and . In other words, the covariance matrix is isotropic on some -dimensional subspace. Here samples suffice to estimate and thus the whole distribution perfectly, for the -dimensional subspace could be recovered using i.i.d. samples with probability one. Therefore, the complexity of learning the distribution is . The following theorem states that this is also the complexity of sample amplification.
Theorem 7.2.
For the above low-rank covariance estimation model with , an sample amplification is possible if and only if .
Theorem 7.2 shows that as opposed to learning, sample amplification fails to exploit the low-rank structure in the covariance estimation problem. As a result, the complexity of sample amplification coincides with that of learning in this example. Note that sample amplification is always no harder than learning: the learner could always estimate the distribution, generate one observation from the distribution and append it to the original samples. Therefore, Theorem 7.2 provides an example where the relationship between sample amplification and learning is the worst possible.
7.3 An example where the TV distance is not the right metric
Finally we provide an example showing that the TV distance is not the right metric for the learning-based approach in Section 5, and thereby partially illustrate the necessity of using the divergence. This example also goes beyond parametric families for sample amplification. Let be the class of all -Lipschitz densities supported on , i.e. the density satisfies for all . For , also let be the subclass of densities lower bounded by everywhere, i.e. for all . It is a classical result (see, e.g. [Tsy09]) that the minimax density estimation error under TV distance is for both and . The next theorem shows that the sample complexities for amplification are actually different.
Theorem 7.3.
Let and be fixed. It holds that
Theorem 7.3 shows that, although assuming a density lower bound does not alter the TV estimation error, it boosts the size of amplified samples from to . In fact, the -estimation error is also reduced from to : in the proof of Theorem 7.3 we show that , but together with Theorem 5.2 imply that . Therefore, this is an example suggesting that measuring the estimation error under the divergence might be a better indicator for the complexity of sample amplification than the TV distance.
Acknowledgements.
Thank you to anonymous reviewers for helpful feedback on earlier drafts of this paper.
Funding.
Shivam Garg conducted this research while affiliated with Stanford University and was supported by a Stanford Interdisciplinary Graduate Fellowship. Yanjun Han was supported by a Simons-Berkeley research fellowship and the Norbert Wiener postdoctoral fellowship in statistics at MIT IDSS. Vatsal Sharan was supported by NSF CAREER Award CCF-2239265 and an Amazon Research Award. Gregory Valiant was supported by NSF Awards AF-2341890, CCF-1704417, CCF-1813049, UT Austin’s Foundation of ML NSF AI Institute, and a Simons Foundation Investigator Award.
Appendix A Concrete examples of sample amplification
In this section we include concrete examples of sample amplification omitted in the main text due to space limitations, including the non-asymptotic upper bounds in Section 4 and 5, and the lower bounds for non-product models in Section 6.
A.1 Concrete examples of amplification via sufficiency
In this section, in contrast to the mostly asymptotic results in Section 4.3, we investigate several non-asymptotic examples of sample amplification in concrete models. We show that for many natural models, including exponential family with dependent coordinates and non-exponential family which are not covered in the general theory, the sufficiency-based sample amplification approach could still amplify additional samples. We also illustrate the computational idea in Section 4.2 via more involved examples.
Our first example concerns the Gaussian model with a known mean but an unknown covariance. It is a folklore result that estimating the unknown covariance in a vanishing Frobenius norm requires samples [DMR20, Corollary 1.2], which is also the sample complexity for learning the distribution within a small TV distance. The following example shows that samples suffice for sample amplification.
Example A.1 (Gaussian covariance model with known mean).
Consider the i.i.d. observations drawn from with zero mean and an unknown covariance . Here a minimal sufficient statistic is the sample covariance matrix
Lemma A.2 shows that as long as , therefore drawing samples from achieves sample amplification of size . This coincides with the general result where is the parameter dimension.
In order to sample from this conditional distribution, consider the following statistic
which always exists even if is not invertible. Clearly there is a bijection between and , and Lemma A.2 shows that is an ancillary statistic. In particular, always follows the uniform distribution on the following set:
Consequently an sample amplification is efficiently achievable in the Gaussian covariance model using the following algorithm:
-
1.
Given samples compute .
-
2.
Sample where for all . Compute . Based on these compute .
-
3.
Output samples .
Lemma A.2.
The next example shows that the sample amplification result does not change much if both the mean and covariance are unknown, though the sampling procedure becomes slightly more involved.
Example A.3 (Gaussian model with unknown mean and covariance).
Next we consider the most general Gaussian model where with unknown mean vector and unknown covariance matrix . In this case, a minimal sufficient statistic is the pair , with
Lemma A.4 shows that as long as , therefore drawing amplified samples from achieves sample amplification of size . For the computation, consider the following statistic
and it is clear that the whole samples could be recovered from . Again, Lemma A.4 shows that is ancillary and uniformly distributed on the following set (assuming ):
Lemma A.4.
The following example concerns the product exponential distributions, or general Gamma distributions with a fixed shape parameter.
Example A.5 (Product exponential distribution).
In this example, we consider the product exponential model where with unknown rate vector . Again, in this model the sample mean is sufficient, and follows a product Gamma distribution . Consequently,
where and are the gamma and digamma functions, respectively. Using the definition of in (C.2), we note that the above KL divergence is precisely , thus by the proof of Lemma A.2 is further at most . Consequently, sample amplification is possible as long as and . Alternatively, the same result could also follow from Theorem 4.6, for both assumptions 1 and 2 hold for the exponential distribution.
To draw amplified samples conditioned on , note that the following statistic with -th row being
is ancillary. In fact, each follows the Dirichlet distribution . Then computational efficiency follows from the obvious fact that is determined by .
Our final example is a non-exponential family which is not even differentiable in quadratic mean, but the sample complexity still holds.
Example A.6 (Uniform distribution over a rectangle).
In this example, let be i.i.d. samples from the uniform distribution on an unknown rectangle in . Note that this is not an exponential family. However, the sufficiency-based sample amplification could still be applied in this case. Specifically, here the sufficient statistics are and , where for ,
It is not hard to find that, the joint density of is
Then after some algebra, the KL divergence between the sufficient statistics is
which is when . Therefore, sample amplification for uniform distributions is still possible whenever and , same as exponential families.
To draw amplified samples conditioned on , note that the following statistic with -th row being
is ancillary. This is due to the invariance property of the uniform distribution: if , then . Since is determined by , the computational efficiency follows.
Although all of the above examples work exclusively for continuous models and apply an identity map between sufficient statistics, we remark that more sophisticated maps based on learning could be useful and work for discrete models. The idea of sample amplification via learning is presented in Section 5, and an example which combines both the sufficiency and learning ideas could be found in Example A.14.
A.2 Concrete examples of amplification via learning
In this section we show how learning-based approaches achieve optimal performances of sample amplification in several examples, including both continuous and discrete models. In some scenarios the following strengthened lemma may be useful to deal with too large -divergence.
Lemma A.7.
The idea behind Lemma A.7 is that for some models, it might happen that with a small probability. However, for the TV distance we always have , so a large -divergence could still lead to a meaningful TV distance. The proof of Lemma A.7 could be found in Appendix C.
The first example is again the Gaussian location model in Example 4.1, where we show that the shuffling-based approach also achieves the complexity and the size .
Example A.8 (Gaussian location model with known covariance, continued).
Consider the setting of Example 4.1, where the family of distributions is . Here has a product structure, with for each . To find an upper bound on the -estimation error, consider the distribution estimator , where . Consequently,
and using , we have
whenever . Consequently, for all , and Theorem 5.5 implies that an sample amplification is possible if and .
The next example is the discrete distribution model considered in [AGSV20].
Example A.9 (Discrete distribution model).
Let be the class of all discrete distributions supported on elements. In this case, a natural learner is the empirical distribution , with
Consequently,
meaning that . Hence, Theorem 5.2 implies that sample amplification is possible whenever and . In this case, Algorithm 2 essentially subsamples from the original data and add them back with random shuffling, which is the same algorithm in [AGSV20]. However, the analysis here is much simplified.
The following example revisits the uniform distribution in Example A.6, and again shows that the same amplification performance could be achieved by random shuffling.
Example A.10 (Uniform distribution, continued).
Consider the setting in Example A.6, where the distribution class is the family of all uniform distributions on a rectangle . For each , a natural distribution estimator is simply the uniform distribution on , where these quantities are defined in Example A.6. For this learner, it holds that
Note that Example A.6 shows that the joint density of is given by
the expected -divergence could then be computed as
which is if . Hence, we have for each , and Theorem 5.5 shows an sample amplification if and .
The next example is the exponential distribution model studied in Example A.5, where the modified -learning error in Lemma A.7 will be useful.
Example A.11 (Exponential distribution, continued).
Consider the setting in Example A.5, where the distribution class is a product of exponential distributions with unknown rate parameters. In this case, a natural distribution learner is to estimate each rate parameter as , and use to estimate the truth . Note that
whenever . However, if the -divergence will be unbounded, which occurs with a small but positive probability.
The following example considers an interesting non-product model, i.e. the Gaussian distribution with a sparse mean vector.
Example A.12 (Sparse Gaussian model).
Consider the Gaussian location model , with an additional constraint that the mean vector is -sparse, i.e. . For the learning problem, it is well-known (cf. [DJ94, Theorem 1]) that the soft-thresholding estimator with
and any constant achieves that . Therefore, the sample complexity of learning sparse Gaussian distributions is .
Lemma A.13.
Under the setting of Example A.12, it holds that
The final example is a special example where the sole application of either sufficient statistics or learning will fail to achieve the optimal sample amplification. The solution is to use a combination of both ideas.
Example A.14 (Poisson distribution).
Consider the product Poisson model with . We first show that a naïve application of either sufficiency-based or shuffling-based idea will not lead to a desired sample amplification. In fact, the sufficient statistic here is which follows a product Poisson distribution ; as takes discrete values, applying any linear map between to will not result in a small TV distance between sufficient statistics.
The argument for the shuffling-based approach is subtler. A natural distribution estimator for is , with . Lemma A.15 shows that this distribution estimator could suffer from an expected -estimation error far greater than , so we cannot conclude an sample amplification from Theorem 5.5.
Now we show that a combination of the sufficient statistic and learning leads to the rate-optimal sample amplification in this model. Specifically, we split samples and compute the empirical rate parameter based on the first samples. Next, conditioned on the first half of samples, the sufficient statistic for the remaining half is . Define
where is independent of conditioning on the first half samples. Finally, we generate the amplified samples from the conditional distribution and append them to . By the second statement of Lemma A.15, this achieves an sample amplification whenever and .
Lemma A.15.
Under the settings of Example A.14, there exists some such that
In addition, the proposed sufficiency+learning approach satisfies
A.3 Non-asymptotic examples of lower bounds
In this section, we apply the lower bound ideas to several concrete examples in Section 4 and 5, and show that the previously established upper bounds for sample amplification are indeed tight. Since Theorem 6.5 already handles all product models (including the Gaussian location model in Example 4.1, 4.2, and A.8, exponential model in Example A.5 and A.11, uniform model in Example A.6 and A.10, and Poisson model in Example A.14), we are only left with the remaining non-product models. In the sequel, we will make use of the general Lemma 6.1 to prove non-asymptotic lower bounds in these examples.
The lower bound of the discrete distribution model was obtained in [AGSV20], where the learning approach in Example A.9 is rate optimal. Our first example concerns the “Poissonized” version of the discrete distribution model, and we show that the results become slightly different.
Example A.16 (“Poissonized” discrete distribution model).
We consider the following “Poissonized” discrete distribution model, where we have i.i.d. samples drawn from , with being an unknown probability vector. Although the Poissonization does not affect the optimal rate of estimation in many problems, Lemma A.17 shows the following distinction when it comes to sample amplification: the optimal amplification size is under the Poissonized model, while it is under the non-Poissonized model [AGSV20].
The complete proof of Lemma A.17 is relegated to the appendix, but we briefly comment on why Theorem 6.5 is not directly applicable when . The reason here is that to apply Theorem 6.5, we will construct a parametric submodel which is a product model:
where . However, for , the range of the squared Hellinger distance
is only , so Assumption 4 does not hold when . This is precisely the subtlety in the Poissonized model.
Lemma A.17.
Under the Poissonized discrete distribution model, an sample amplification is possible if and only if and .
Our next example is the sparse Gaussian model in Example A.12, where we establish the tightness of and directly using Lemma 6.1.
Example A.18 (Sparse Gaussian location model).
Consider the setting of the sparse Gaussian location model in Example A.12, and we aim to prove a matching lower bound for sample amplification. In the sequel, we first handle the case to reflect the main idea, and postpone the case of general to Lemma A.19.
For , we apply Lemma 6.1 to a proper choice of the prior and loss . Fixing a parameter to be chosen later, let be the uniform distribution on the finite set of vectors , where are the canonical vectors in . Moreover, for an estimator of the unknown mean vector, the loss function is chosen to be . For the above prior and loss, it is clear that the maximum likelihood estimator with
is the Bayes estimator, and the Bayes risk admits the following expression:
where are i.i.d. standard normal random variables. Similarly, the Bayes risk under samples is , and it remains to investigate the property of the function . Lemma A.19 summarizes a useful property for the lower bound, i.e. the function enjoys a phase transition around .
Based on Lemma A.19, the pigeonhole principle implies that for every given , there exists some such that . Therefore, if , choosing for the above yields that , and consequently
Therefore, we must have and for sample amplification.
The following lemma summarizes the phase-transition property of used in the above example.
Lemma A.19.
For the function in Example A.18, there exists an absolute constant independent of such that
Moreover, for general , an sample amplification is possible under the sparse Gaussian model only if and .
Our final example proves the tightness of the sufficiency-based approaches in Examples A.1 and A.3 for Gaussian models with unknown covariance. The proof relies on the computation of the minimax risks in Lemma 6.1, as well as the statistical theory of invariance.
Example A.20 (Gaussian model with unknown covariance).
We establish the matching lower bounds of sample amplification in Example A.1 with a known mean, which imply the lower bounds in Example A.3. We make use of the minimax risk formulation of Lemma 6.1, and consider the following loss:
where function is given by (13) in the Lemma C.3 below, is a large absolute constant to be determined later, and is the Stein’s loss
To search for the minimax estimator under the above loss, similar arguments in [JS61] based on the theory of invariance show that it suffices to consider estimators taking the form
(12) |
where is a diagonal matrix independent of the observations, and is the lower triangular matrix satisfying that . Moreover, the risk of the above estimator is invariant with the unknown , so in the sequel we assume that . Note that when , the estimator is the sample covariance; however, other choices of the diagonal matrix could give a uniform improvement over the sample covariance, see [JS61].
The proof idea is to show that with samples, there exists an estimator with a specific choice of in (12) such that (cf. Lemma A.21)
Consequently, by Chebyshev’s inequality, for large enough .
To lower bound the minimax risk with samples, we need to enumerate over all possible estimators taking the form of (12). It turns out that for any choice of , the first two moments of admit explicit expressions, and it always holds that (cf. Lemma A.21)
for any given constant , provided that with depending only on . Choosing large enough, Chebyshev’s inequality leads to with probability at least for every taking the form of (12). By the last statement of Lemma A.21, for with a large enough , the above event implies that .
Combining the above scenarios and applying the pigeonhole principle, an application of Lemma 6.1 claims that is necessary for an sample amplification.
The following lemma summarizes the necessary technical results for Example A.20.
Lemma A.21.
Let . For with for all , the corresponding estimator in (12) satisfies
where
(13) |
Meanwhile, for any choice of and any absolute constant , the estimator in (12) satisfies
as long as for some large enough constant depending only on .
Finally, the function defined in (13) satisfies the following inequality: for and , it holds that
Appendix B Proof of main theorems
B.1 Proof of Theorem 4.5
We recall the following result in [BC16, Theorem 2.6]: given Assumptions 1 and 2 with , there exists a constant depending only on and the moment upper bound such that
By an affine transformation, it is then clear that
Moreover, the computation in Example 4.1 shows that
Now the desired result follows from the above inequalities and a triangle inequality.
B.2 Proof of Theorem 4.6
As the density in the Edgeworth expansion (8) may be negative at some point, throughout the proof we extend the formal definition of the TV distance to any signed measures and , just as half of the distance. Under this new definition, it is clear that the triangle inequality still holds. The following tensorization property also holds: for general signed measures and with , we have
(14) |
Fix any , let (resp. ) be the probability distribution of the -th coordinate of (resp. ), and (resp. ) be the signed measure of the corresponding Edgeworth expansion taking the form (8) with . We note that the polynomials in (8) are the same for and , and their coefficients are uniformly bounded over thanks to Assumption 2. Then based on Assumptions 1 and 2 with , the result of [BC16, Theorem 2.7] claims that
with independent of . Moreover, the signed measure of the Edgeworth expansion in (8) could be negative only if , and therefore the total variation of each satisfies
where the last inequality follows from integrating the Gaussian tails. Finally, let be the positive part of in the Jordan decomposition, the tensorization property (B.2) leads to
(15) |
and similar result holds for .
Next it remains to upper bound the TV distance between and . To this end, we also generalize the formal definition of the Hellinger distance between general measures which are not necessarily probabilities. Then the following inequality between generalized TV and Hellinger distance holds: for measures and on ,
(16) |
Also, the following tensorization property holds for the Hellinger distance between general measures: for (not necessarily probability) measures on ,
(17) |
Consequently, it suffices to prove an upper bound on the Hellinger distance , and the TV distance on the product measure is a direct consequence of (B.2) and (17).
To upper bound the individual Hellinger distance, note that after a proper affine transformation, the densities of and are as follows:
(18) | ||||
(19) |
where is the density of , and is a polynomial of degree with uniformly bounded coefficients. By (18) and (19), we observe that there exists an absolute constant independent of such that whenever . Consequently, the squared Hellinger distance could then be expressed as
Next we upper bound the terms , and separately.
-
1.
Upper bounding : note that by the definition of , the multiplication factor in is at most . Therefore,
(20) where (a) is due to the direct computation of the squared Hellinger distance between and , and (b) makes use of the inequality for .
-
2.
Upper bounding : using for , we have
where the last step is a change of measure, and is the density of . Writing , then
whenever . Combining the above two inequalities yields
(21) -
3.
Upper bounding : note that the tail of is at least when , integrating the tail leads to
(22)
B.3 Proof of Theorem 5.2
Let be the distribution learned from the first samples which achieves the -learning error , and be the distribution of the shuffled samples in Algorithm 2. Note that both distributions depend on the first samples and are therefore random. Then the final TV distance of sample amplification is
which is the expected TV distance between the mixture distribution and the product distribution.
B.4 Proof of Theorem 5.5
B.5 Proof of Theorem 6.2
The proof relies on Lemma 6.1 and a classical statistical result known as Anderson’s lemma. Without loss of generality we assume . Choose , with a bowl-shaped (i.e. symmetric and quasi-convex) loss function . Then Anderson’s lemma (see, e.g. [VdV00, Lemma 8.5]) implies that the minimax estimator under and samples is , thus
with . Choosing with the parameter determined by
then clearly is bowl-shaped. Hence, for this choice of , Lemma 6.1 gives that
and a matching upper bound is presented in Example 4.1.
B.6 Proof of Theorem 6.3
Based on the discussions above Theorem 6.3, we choose an arbitrary open ball , and pick any -dimensional ball contained in . Let be the center of , and after a proper affine transformation we assume that . Consider a truncated Gaussian prior with
where and are parameters to be determined later. Using the diffeomorphism , there is a prior on which induces the above prior on . Moreover, as , the closure of , is a compact set, a weaker Assumption 2 with the supremum restricted to holds for .
Now we analyze the Bayes risks under the above prior and the loss
where is a bowl-shaped function. By sufficiency, it remains to consider the class of estimators depending only on the sufficient statistic . Let be such an estimator, then under each , we have
where . To upper bound this TV distance, the proof of Theorem 4.5 together with Assumption 2 applied to gives that
where only depends on the exponential family. Moreover,
where (a) follows from [DMR18, Theorem 1.1], and (b) makes use of the analytical property of (see, e.g. [ML08, Theorem 1.17]), the assumption that , and . Again, here the constant is independent of . Combining the above inequalities yields that
(24) |
Next we lower bound the Bayes risk when has been replaced by . Let be the non-truncated Gaussian distribution , and , then
(25) |
Here (c) follows from the Gaussian tail probability, (d) makes use of Anderson’s lemma and the fact that under , the posterior distribution of given is Gaussian with covariance . The final inequality (e) is due to the following upper bound on the TV distance:
Now combining (24) and (B.6), we obtain a lower bound on the Bayes risk:
Similarly, by reversing all the above inequalities, an upper bound of the Bayes risk with samples is also available:
By the proof of Theorem 6.2, a proper choice of satisfies that
where the last step is again due to [DMR18, Theorem 1.1]. Consequently, by choosing and , the desired result follows from Lemma 6.1.
B.7 Proof of Theorem 6.4
We will apply Lemma 6.1 to the uniform prior over points , and the loss function with
We compute the Bayes risks and in this scenario.
Under the uniform prior , it is straightforward to see that the posterior distribution of given is a product distribution , with supported on two elements , and
Then given , the Bayes estimator which minimizes the expected loss is a minimizer of the above expression
which is easily seen to be
For the above Bayes estimator, the random variables are mutually independent, with the mean value
Consequently, we have for each by (10), and thus
(26) |
An entirely symmetric argument leads to
(27) |
To further lower bound (B.7) and upper bound (27), the following lemma shows that we may assume that the above Bernoulli distributions have the same parameter.
Lemma B.1 (Theorems 4 and 5 of [Hoe56]).
Let be independent Bernoulli random variables for , with . Then for , we have
and for , we have
In particular, for integers with , we have
For , based on the second statement of Lemma B.1, the quantity in (B.7) satisfies
(28) |
with . Similarly, based on the first statement of Lemma B.1, for (27) we have
(29) |
Since
we invoke Lemma 6.1 and lower bound the Bayes risk difference as
where (a) is due to
(30) |
for any and , by Stirling’s approximation.
For , we first note the following identity:
Based on (B.7) and (27), we then have
We will show that the integrand is uniformly of the order . To this end, note that for as , it holds that
Consequently, by the last statement of Lemma B.1, one has
where the last step is again due to (30). The above display is a lower bound for the probability of a size- set, and in view of the following lemma, the same lower bound also holds for the probability of any singleton. Plugging this lower bound back into the integral then yields to , as desired.
Lemma B.2.
Let with , and for some . Define
Then there exists an absolute constant such that
Proof.
Let , and . Then it is clear that
Call two binary vectors and as neighbors (denoted by ) if and only differ in one coordinate. It is clear that every has neighbors in , and every has neighbors in . Moreover, for ,
Consequently, by double counting,
The other inequality can be established analogously. ∎
B.8 Proof of Theorem 6.5
For each , we pick two points in Assumption 4. We first aim to show that
(31) | ||||
(32) |
The proofs of (31) and (32) rely on the tensorization property of the Hellinger distance
and the relationship between TV and Hellinger distance in (2). For example, for (31), we have
Applying the other inequality, for (32) we have
The other inequalities involving and could be established analogously.
Next, for the choice of in the statement of Theorem 6.5, we show that there exists such that
(33) |
To prove (33), first note that is non-decreasing by the data-processing property of the TV distance. Moreover, by (31) and (32), we have and . Consequently,
which gives (33). In addition, we also have .
Next we are about to apply Theorem 6.4. By (33), there exist such that
Therefore, Theorem 6.4 shows that there is an absolute constant depending only on such that
where the above quantity denotes the minimax error in a new sample amplification problem: suppose we draw independent samples from , also independently for each , and we aim to amplify into independent samples from . In other words, the sample sizes for different dimensions may not be equal in the new problem, but the target is still to generate more samples. We claim that , and thereby complete the proof. To show the claim, note that for all , hence in the new problem we could keep samples unused for each , use the remaining samples to amplify into vectors, and add the above unused samples back to form the final amplification.
B.9 Proof of Theorem 7.1
We first prove the upper bound. Consider the distribution estimator , which has a -divergence
Consequently, the -estimation error is at most , and Theorem 5.2 states that the random shuffling approach achieves an sample amplification if .
Next we prove the lower bound. Let and be a multiple of . Consider the following prior over : is the uniform prior over with and the remaining mass evenly distributed on a uniformly random subset of of size . The action space is chosen to be , and the loss function is
We first show that . In fact, after observing samples , we simply use as the estimator under the above loss. Clearly belongs to the support of . For the remaining conditions,
By the union bound, this estimator achieves a Bayes risk at most , and thus .
Next we show that , which combined with Lemma 6.1 gives the desired lower bound. To show this, consider the new symbol in the estimator not in when the learner observes . Since has length , there is at least one such symbol. In order to have , this new symbol cannot be or appear in . Moreover, the posterior distribution of the support of given is uniformly distributed over
Therefore, the posterior probability of the new symbol being outside the support of (recall that it could be neither one of nor ) is at least
giving the desired inequality .
B.10 Proof of Theorem 7.2
The upper bound of sample amplification directly follows from that of learning, and it remains to show the lower bound . If , with probability one the observations spans an -dimensional subspace of the row space of . An sample amplification calls for at least one additional observation not in , which with probability should not belong to the -dimensional subspace spanned by for . However, since , under a uniformly chosen -dimensional row space of , the posterior probability of the additional observation belonging to the row space of is zero. Consequently, an sample amplification is impossible if .
B.11 Proof of Theorem 7.3
We prove the three claims separately.
The proof of . Classical theory of nonparametric estimation (see, e.g. [Tsy09]) tells that there exists a density estimator such that . Since is lower bounded by , this implies that
Consequently, , and Theorem 5.2 implies that .
The proof of . We construct a parametric subfamily of and invoke Theorem 6.5. Let be a -Lipschitz function supported on with and ; in particular . Let and assume that is an integer. For , define
where is a small constant satisfying . Consequently, for every . However, the density estimation model is not a product model, so Theorem 6.5 cannot be directly applied.
To overcome the above issue, we consider a Poissonized model as follows: first we draw , and then draw i.i.d. samples . The Poissonized model satisfies the following two properties:
-
1.
For any measurable set , we have
-
2.
For any collection of disjoint subsets , the random variables are mutually independent.
For , let , so that for every . Clearly, there is a one-to-one correspondence between and , where is the collection of observations in that falls into the set . By the above two properties, are mutually independent, and under . Here is the probability distribution of the following process: sample , and draw i.i.d. samples from the density
supported on . In addition,
so the Poissonized model is a product model which satisfies the prerequisite of Theorem 6.5.
Next we denote the i.i.d. sampling model by , and the Poissonized model by . Let , where is a large universal constant to be chosen later. By Theorem 6.5 applied to , Le Cam’s distance in Definition 3.1 between Poissonized models satisfies
as long as . In addition, the model is more informative than with probability at least , where . Similarly, is more informative than with probability at least , where . Therefore,
where in the last step we have by choosing large enough.
The proof of . Suppose for a large constant . Consider the following prior on the density : let , and be uniformly distributed over all vectors in such that the number of ’s is . Given , we construct
and let . It is not hard to verify that each is a density and -Lipschitz, so .
Next under the context of Lemma 6.1, we choose the action space to be , as well as the following loss:
where is a large enough constant. We aim to show that under the loss and prior , the Bayes risk satisfies and . Then an application of Lemma 6.1 completes the proof of .
We first show that . We simply use the observed sample as the estimator under the above loss, then clearly belongs to the support of . For the second event in the loss function, let be the number of sets in which receive any observations in . By the linearity of expectation, as well as the negative dependence across different set counts, we compute for every that
By Chebyshev’s inequality, for with a large enough , we have .
Next we show that . Let be the indices such that the set is hit by the observations . Clearly the posterior distribution of the support of the vector is uniformly distributed over
On one hand, if the estimator hits a set with , the probability that does not belong to the support of is at least
On the other hand, if the estimator never hits a set with , then fall into at most sets in , where is defined in a similar way as . Similar to the previous computation, we have
By Chebyshev’s inequality, for large enough, the probability that violates the second event in the loss function is at least . Now a combination of the above two cases implies that , as desired.
Appendix C Proof of main lemmas
C.1 Proof of Lemma 4.4
Recall that the log-partition function is defined as
for any vector with , we have
(34) |
It remains to show that when is sufficiently small, we always have , and the exponent of (C.1) is uniformly bounded from above over and . Then the existence of uniformly bounded MGF around zero implies a uniformly bounded moment of any order.
The result of [Nes03, Theorem 4.1.6] shows that for a self-concordant and convex function , we have whenever . Consequently, for , a Taylor expansion with a Lagrange remainder gives
(35) |
where lies on the line segment between and . Consequently, we have
and therefore . Plugging it back into (C.1) establishes the boundedness of (C.1), as well as the finiteness of , or equivalently, .
C.2 Proof of Lemma A.2
For and , it is well known that the empirical covariance follows a Wishart distribution , where the density of is given by
where is the multivariate Gamma function, and is the usual Gamma function. After some algebra, we have
(36) |
where is the multivariate digamma function. Note that the above KL divergence (C.2) does not depend on ; we denote it by .
By (3), it suffices to establish an upper bound of . Applying infinite Taylor series to at yields
where is the polygamma function. For any and , the following inequality holds for the polygamma function [AS64, Equation 6.4.10]:
As a result, for the following modification
(37) |
it holds that
(38) |
where we have used the assumption .
Next we establish an upper bound of . Using the identity
and some algebra, we have
Note that for , we have
we conclude that
(39) |
For the claim that follows the uniform distribution on the set , we need the following auxiliary definitions and results. A topological group is a group with a topology such that the operation is continuous. A (right) group action of on is a function such that and , where is the identity element of . A group action is called transitive if for every , there exists some such that . A group action is called proper if for any compact and , the map with satisfies that is compact. The following lemma is useful.
Lemma C.1 (Chapter 14, Theorem 25 of [Roy88]).
Let be a locally compact group acting transitively and properly on a locally compact Hausdorff space . Then there is a unique (up to multiplicative factors) Baire measure on which is invariant under the action of .
Now for the claimed result, it is easy to verify that (a proper version of) always takes value in , assuming . To show that is uniform on , note that for any orthogonal matrix , it is easy to verify
Denote the RHS by , we also have . Consequently,
meaning that the distribution of is invariant with right multiplication of an orthogonal matrix. Let be the orthogonal group , the map with is a group action of on . This action is transitive as for any , we could add more rows to to obtain , and then maps to . This action is also proper as itself is compact. Hence, Lemma C.1 below shows that is uniformly distributed on .
C.3 Proof of Lemma A.4
The proof of Lemma A.4 is a simple consequence of several known results. First, Basu’s theorem claims that and are independent. Second, the computation in Example 4.1 shows
Finally, as again follows a Wishart distribution, the proof of Lemma A.2 shows that
In conclusion, we have
For the distribution of , it is clear that (a proper version of) always takes value in , and we show that the distribution of is invariant with proper group actions in . Consider the set
with the usual matrix multiplication. We show that is a group: clearly ; for , it is clear that and therefore ; for , it holds that and thus . Next we show that the action with is a group action on , and it suffices to show that . This is true as , and . This group action is also transitive: for any , we may properly add rows to them and obtain , where one of the added rows is a scalar multiple of , which is feasible as . Consequently, the matrix maps to , and also to ; hence . Finally, we show that any group action of on does not change the distribution of . To see this, for any we have
Therefore, following the same arguments as in Example A.1 we arrive at the desired invariance, and the uniform distribution of is a direct consequence of Lemma C.1.
C.4 Proof of Lemma 5.8
For any subset with , let be the distribution of when the samples are placed in the index set of the pool . Then it is clear that , with uniformly distributed on all size- subsets of . To compute the -divergence where the first distribution is a mixture, the following identity holds [IS12]:
where is an independent copy of . By the independence assumption, we have
and consequently
It remains to upper bound the expectation with respect to the random variable . Note that follows the hypergeometric distribution with parameter , which corresponds to sampling without replacement. The counterpart for sampling with replacement corresponds to a Binomial distribution , and the following lemma shows that the latter dominates the former in terms of the convex order:
Lemma C.2.
[Hoe63, Theorem 4] Let the population be . Let denote random samples without replacement from and denote random samples with replacement. If is convex, then
C.5 Proof of Lemma A.7
Note that in the proof of Theorem 5.2, we have
Since the TV distance is always upper bounded by one, the following upper bound is also true:
Consequently, taking the expectation over leads to the claim.
C.6 Proof of Lemma A.13
First we note that
By the triangle inequality,
(40) |
Moreover, it is straightforward to verify that is -Lipschitz with respect to . Therefore, the Gaussian Lipschitz concentration (see, e.g. [BLM13, Theorem 10.17]) gives that
(41) |
for any . Hence, combining (C.6) and (41), we conclude that holds with probability at least , and therefore
where we have used that whenever . Summing over and using the property of the soft-thresholding estimator gives the claimed result.
C.7 Proof of Lemma A.15
For the first claim, note that for Poisson models, we have
Consequently, for , we have
establishing the first claim.
For the second claim, note that conditioning on ,
where we have used . Hence,
and the TV distance satisfies
C.8 Proof of Lemma A.17
The upper bound result is easy. The upper bound is a consequence of the general product Poisson model considered in Example A.14. For the upper bound, we consider the sufficient statistic , and simply apply the sufficiency-based algorithm to . Since
where the last identity crucially makes use of the identity , this procedure works.
Next we show that sample amplification is impossible when and . Note that this implies that is impossible in general, for the case is always not easier than the case. To prove the above claim, w.l.o.g. we assume that is even, and consider the following parametric submodel:
Clearly is a parametric submodel, by setting and in the original model. This submodel is a product model, thus we could apply the result of Theorem 6.5 after we have verified Assumption 4. Note that when arbitrarily vary in , the range of
is , so Assumption 4 is fulfilled when for a small constant . Consequently, Theorem 6.5 establishes the desired bound .
C.9 Proof of Lemma A.19
For the first inequality, note that for a large , both and hold with probability at least . When both events hold, we have
Consequently, for with large enough, we have
For the second inequality, note that Jensen’s inequality yields that
Again, with probability at least we have , and therefore for ,
for large enough.
For the last claim, consider any . Let , and consider the product prior to blocks each of dimension . In other words, we set exactly one of the first coordinates of the mean vector to uniformly at random, and do the same for the next coordinates, and so on. Clearly the resulting mean vector is always -sparse. Writing with each , let the loss function be
with an integer to be specified later. Then in each block, we reduce to the case , and the error probability for this block is for sample size . Moreover, the errors in different blocks are independent. Consequently,
(42) |
Finally, again by the properties of summarized in Lemma A.19 and the pigeonhole principle, for , we could always find some such that , with both quantities in . Now choosing
with the above choice of , the Bayes risk difference will be lower bounded by by a similar argument to the proof of Theorem 6.4. Now by (C.9) and Lemma 6.1, we see that and are necessary for sample amplification.
C.10 Proof of Lemma A.21
The following results are the key to the proof of Lemma A.21, which we will assume for now and prove in the subsequent sections.
Lemma C.3.
For , the following identities hold for the estimator :
(43) | ||||
(44) |
where denotes the chi-squared distribution with degrees of freedom, and is the polygamma function of order . In particular, when , the following inequalities hold:
(45) | |||
(46) |
where the functions and are given by (13) and
(47) |
Lemma C.4.
For the function in (47) and , it holds that
(48) |
Returning to the proof of Lemma A.21, the first claim is a direct application of (45) and (46). For the second claim, let , and
Then by (45), (46) and Lemma C.4, we have
Meanwhile, the variance satisfies
Note that for larger than a constant depending only on , we always have
Therefore, for large enough, the above inequalities implies
establishing the second claim.
For the last statement, note that
the intermediate value theorem then implies that
C.11 Proof of Lemma C.3
We first recall the well-known Bartlett decomposition: for the lower triangular matrix , the random variables are mutually independent, with
For and , simple algebra gives
Consequently, the identity (43) follows from the above Bartlett decomposition. This identity was also obtained in [JS61].
For the identity (44) on the variance, by the mutual independence we have
(49) |
Next we evaluate each term of (49). Clearly for and . For the other random variables, we need to recall the following identity for :
(50) |
Based on (50), we have
Moreover, differentiating at gives
(51) | ||||
(52) |
Note that (52) leads to
Finally, differentiating at gives that
and hence the identity for the digamma function together with (51) leads to
Therefore, plugging the above quantities in (49) gives the identity (44).
Next we prove the remaining inequalities when . For (45), note that (43) gives an identity (together with (51))
Since for all , replacing by in the above expression only incurs an absolute difference at most
For the remaining terms, it is not hard to verify that
As is increasing on , we have
Now (45) follows from a combination of the above inequalities.
C.12 Proof of Lemma C.4
Note that when , Taylor expansion of at gives
For , we have , and therefore
Therefore, in both cases we have
References
- [AGSV20] Brian Axelrod, Shivam Garg, Vatsal Sharan, and Gregory Valiant. Sample amplification: Increasing dataset size even when learning is impossible. In International Conference on Machine Learning, pages 442–451. PMLR, 2020.
- [AS64] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964.
- [ASE17] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
- [Bar86] Andrew R Barron. Entropy and the central limit theorem. The Annals of probability, pages 336–342, 1986.
- [Bas55] Dev Basu. On statistics independent of a complete sufficient statistic. Sankhyā: The Indian Journal of Statistics (1933-1960), 15(4):377–380, 1955.
- [Bas58] Dev Basu. On statistics independent of sufficient statistics. Sankhyā: The Indian Journal of Statistics, pages 223–226, 1958.
- [Bas59] Dev Basu. The family of ancillary statistics. Sankhyā: The Indian Journal of Statistics, pages 247–256, 1959.
- [BB19] Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. In Conference on Learning Theory, pages 469–470. PMLR, 2019.
- [BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
- [BC14] Vlad Bally and Lucia Caramellino. On the distances between probability density functions. Electronic Journal of Probability, 19, 2014.
- [BC16] Vlad Bally and Lucia Caramellino. Asymptotic development for the CLT in total variation distance. Bernoulli, 22(4):2442–2485, 2016.
- [BCG14] Sergey G Bobkov, Gennadiy P Chistyakov, and Friedrich Götze. Berry–esseen bounds in the entropic central limit theorem. Probability Theory and Related Fields, 159(3-4):435–478, 2014.
- [BCLZ02] Lawrence D Brown, T Tony Cai, Mark G Low, and Cun-Hui Zhang. Asymptotic equivalence theory for nonparametric regression with random design. The Annals of statistics, 30(3):688–707, 2002.
- [BCLZ04] Lawrence D Brown, Andrew V Carter, Mark G Low, and Cun-Hui Zhang. Equivalence theory for density estimation, poisson processes and gaussian white noise with drift. The Annals of Statistics, 32(5):2074–2097, 2004.
- [BKK+22] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
- [BL96] Lawrence D Brown and Mark G Low. Asymptotic equivalence of nonparametric regression and white noise. The Annals of Statistics, 24(6):2384–2398, 1996.
- [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
- [BR10] Rabi N Bhattacharya and R Ranga Rao. Normal approximation and asymptotic expansions. SIAM, 2010.
- [BR13] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.
- [CMST17] Francesco Calimeri, Aldo Marzullo, Claudio Stamile, and Giorgio Terracina. Biomedical data augmentation using generative adversarial neural networks. In International conference on artificial neural networks, pages 626–634. Springer, 2017.
- [CMV+21] Phillip Chlap, Hang Min, Nym Vandenberg, Jason Dowling, Lois Holloway, and Annette Haworth. A review of medical image data augmentation techniques for deep learning applications. Journal of Medical Imaging and Radiation Oncology, 65(5):545–563, 2021.
- [CPS+19] Aggelina Chatziagapi, Georgios Paraskevopoulos, Dimitris Sgouropoulos, Georgios Pantazopoulos, Malvina Nikandrou, Theodoros Giannakopoulos, Athanasios Katsamanis, Alexandros Potamianos, and Shrikanth Narayanan. Data augmentation using gans for speech emotion recognition. In Interspeech, pages 171–175, 2019.
- [CQZ+22] Dong Chen, Xinda Qi, Yu Zheng, Yuzhen Lu, and Zhaojian Li. Deep data augmentation for weed recognition enhancement: A diffusion probabilistic model and transfer learning based approach. arXiv preprint arXiv:2210.09509, 2022.
- [CZM+19] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019.
- [CZSL20] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- [DJ94] David L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
- [DMR18] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.
- [DMR20] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The minimax learning rates of normal and Ising undirected graphical models. Electronic Journal of Statistics, 14(1):2338 – 2361, 2020.
- [DY79] Persi Diaconis and Donald Ylvisaker. Conjugate priors for exponential families. The Annals of statistics, pages 269–281, 1979.
- [FADK+18] Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. GAN-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 321:321–331, 2018.
- [HJW15] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions under loss. IEEE Transactions on Information Theory, 61(11):6343–6354, 2015.
- [HMC+20] Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020.
- [Hoe56] Wassily Hoeffding. On the distribution of the number of successes in independent trials. The Annals of Mathematical Statistics, pages 713–721, 1956.
- [Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- [HRA+19] Changhee Han, Leonardo Rundo, Ryosuke Araki, Yudai Nagano, Yujiro Furukawa, Giancarlo Mauri, Hideki Nakayama, and Hideaki Hayashi. Combining noise-to-image and image-to-image gans: Brain mr image augmentation for tumor detection. IEEE Access, 7:156966–156977, 2019.
- [IS12] Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2012.
- [JS61] W James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
- [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- [LC64] Lucien Le Cam. Sufficiency and approximate sufficiency. The Annals of Mathematical Statistics, pages 1419–1455, 1964.
- [LC72] Lucien Le Cam. Limits of experiments. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
- [LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer, 1986.
- [LCY90] Lucien Le Cam and Grace Lo Yang. Asymptotics in Statistics. Springer, New York, 1990.
- [LS92] Erich Leo Lehmann and FW Scholz. Ancillarity. Lecture Notes-Monograph Series, 17:32–51, 1992.
- [LSM+22] Lorenzo Luzi, Ali Siahkoohi, Paul M Mayer, Josue Casco-Rodriguez, and Richard Baraniuk. Boomerang: Local sampling on image manifolds using diffusion models. arXiv preprint arXiv:2210.12100, 2022.
- [LSW+23] Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, and Wenqi Wei. Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062, 2023.
- [ML08] Klaus-J Miescke and F Liese. Statistical Decision Theory: Estimation, Testing, and Selection. Springer, 2008.
- [MMKSM18] Ali Madani, Mehdi Moradi, Alexandros Karargyris, and Tanveer Syeda-Mahmood. Chest x-ray generation and data augmentation for cardiovascular abnormality classification. In Medical imaging 2018: Image processing, volume 10574, pages 415–420. SPIE, 2018.
- [MW15] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. The Annals of Statistics, 43(3):1089–1116, 2015.
- [Nes03] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- [Pro52] Yu V Prohorov. A local theorem for densities. In Doklady Akad. Nauk SSSR (NS), volume 83, pages 797–800, 1952.
- [Roy88] Halsey Lawrence Royden. Real analysis (3rd edition). Prentice-Hall, 1988.
- [RSH19] Kolyan Ray and Johannes Schmidt-Hieber. Asymptotic nonequivalence of density estimation and gaussian white noise for small densities. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 55, pages 2195–2208. Institut Henri Poincaré, 2019.
- [SK19] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
- [SM62] S Kh Sirazhdinov and M Mamatov. On convergence in the mean for densities. Theory of Probability & Its Applications, 7(4):424–428, 1962.
- [SSP+03] Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of International Conference on Document Analysis and Recognition, volume 3. Edinburgh, 2003.
- [SYPS19] Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M Summers. Data augmentation using generative adversarial networks (cyclegan) to improve generalizability in ct segmentation tasks. Nature Scientific reports, 9(1):1–9, 2019.
- [Tsy09] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
- [TUH18] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Between-class learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5486–5494, 2018.
- [VdV00] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
- [VLB+19] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pages 6438–6447. PMLR, 2019.
- [Wal50] Abraham Wald. Statistical decision functions. Wiley, 1950.
- [Wil82] EJ Williams. Some classes of conditional inference procedures. Journal of Applied Probability, 19(A):293–303, 1982.
- [YHO+19] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- [YSH18] Wei Yi, Yaoran Sun, and Sailing He. Data augmentation using conditional gans for facial emotion recognition. In 2018 Progress in Electromagnetics Research Symposium (PIERS-Toyama), pages 710–714. IEEE, 2018.
- [YWB19] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversarial network in medical imaging: A review. Medical image analysis, 58:101552, 2019.
- [ZCDLP18] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.