Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models
Abstract
Consider the problem of simultaneous estimation of location and variance matrix under Huber’s contaminated Gaussian model. First, we study minimum -divergence estimation at the population level, corresponding to a generative adversarial method with a nonparametric discriminator and establish conditions on -divergences which lead to robust estimation, similarly to robustness of minimum distance estimation. More importantly, we develop tractable adversarial algorithms with simple spline discriminators, which can be implemented via nested optimization such that the discriminator parameters can be fully updated by maximizing a concave objective function given the current generator. The proposed methods are shown to achieve minimax optimal rates or near-optimal rates depending on the -divergence and the penalty used. This is the first time such near-optimal error rates are established for adversarial algorithms with linear discriminators under Huber’s contamination model. We present simulation studies to demonstrate advantages of the proposed methods over classic robust estimators, pairwise methods, and a generative adversarial method with neural network discriminators.
keywords:
appendReferences \endlocaldefs
and
1 Introduction
Consider Huber’s contaminated Gaussian model (Huber (1964)): independent observations are obtained from , where is a -dimensional Gaussian distribution with mean vector and variance matrix , is a probability distribution for contaminated data, and is a contamination fraction. Our goal is to estimate the Gaussian parameters , without any restriction on for a small . This allows both outliers located in areas with vanishing probabilities under and other contaminated observations in areas with non-vanishing probabilities under . We focus on the setting where the dimension is small relative to the sample size , and no sparsity assumption is placed on or its inverse matrix. The latter, , is called the precision matrix and is of particular interest in Gaussian graphical modeling. In the low-dimensional setting, estimation of and can be treated as being equivalent.
There is a vast literature on robust statistics (e.g., Huber and Ronchetti 2009; Maronna et al. 2018). In particular, the problem of robust estimation from contaminated Gaussian data has been extensively studied, and various interesting methods and results have been obtained recently. Under Huber’s contamination model above, while the bulk of the data are still Gaussian distributed, a challenge is that the contamination status of each observation is hidden, and the contaminated data may be arbitrarily distributed. In this sense, this problem should be distinguished from various related problems, including multivariate scatter estimation for elliptical distributions as in Tyler (1987) and estimation in Gaussian copula graphical models as in Liu et al. (2012) and Xue and Zou (2012), among others. For motivation and comparison, we discuss below several existing approaches directly related to our work.
Existing work. As suggested by the definition of variance matrix , a numerically simple method, proposed in Öllerer and Croux (2015) and Tarr, Müller and Weber (2016), is to apply a robust covariance estimator for each pair of variables, for example, based on robust scale and correlation estimators, and then assemble those estimators into an estimated variance matrix . These pairwise methods are naturally suitable for both Huber’s contamination model and the cellwise contamination model where the components of a data vector can be contaminated independently, each with a small probability . For various choices of the correlation estimator, such as the transformed Kendall’s and Spearman’s estimator, this method is shown in Loh and Tan (2018) to achieve, in the maximum norm , the minimax error rate under cellwise contamination and Huber’s contamination model. However, because a transformed correlation estimator is used, the variance matrix estimator in Loh and Tan (2018) may not be positive semidefinite (Öllerer and Croux (2015)). Moreover, this approach seems to rely on the availability of individual elements of as pairwise covariances and generalization to other multivariate models can be difficult. In our numerical experiments, such pairwise methods have relatively poor performance when contaminated data are not easily separable from the uncontaminated marginally, especially with nonnegligible .
For location and scatter estimation under Huber’s contamination model, Chen, Gao and Ren (2018) showed that the minimax error rates in the and operator norm, and , are and attained by maximizing Tukey’s half-space depth (Tukey (1975)) and a matrix depth function, which is also studied in Zhang (2002) and Paindaveine and Van Bever (2018). Both depth functions, defined through minimization of certain discontinuous objective functions, are in general difficult to compute, and maximization of these depth functions is also numerically intractable. Subsequently, Gao et al. (2019) and Gao, Yao and Zhu (2020) exploited a connection between depth-based estimators and generative adversarial nets (GANs) (Goodfellow et al. (2014)), and proposed robust location and scatter estimators in the form of GANs. These estimators are also proved to achieve the minimax error rates in the and operator norms under Huber’s contamination model. More recent work in this direction includes Zhu, Jiao and Tse (2020), Wu et al. (2020), and Liu and Loh (2021).
GANs are a popular approach for learning generative models, with numerous impressive applications (Goodfellow et al. (2014)). In the GAN approach, a generator is defined to transform white noises into fake data, and a discriminator is then employed to distinguish between the fake and real data. The generator and discriminator are trained through minimax optimization with a certain objective function. For GANs used in Gao et al. (2019) and Gao, Yao and Zhu (2020), the generator is defined by the Gaussian model and the discriminator is a multi-layer neural network with sigmoid activations in the top and bottom layers. Hence the discriminator can be seen as logistic regression with the “predictors” defined by the remaining layers of the neural network. The GAN objective function, usually taken to the log-likelihood function in the classification of fake and real data, is more tractable than discontinuous depth functions, but remains nonconvex in the discriminator parameters and nonconcave in the generator parameters. Training such GANs is challenging through nonconvex-nonconcave minimax optimization (Farnia and Ozdaglar (2020); Jin, Netrapalli and Jordan (2020)).
There is also an interesting connection between GANs and minimum divergence (or distance) (MD) estimation, which has been traditionally studied for robust estimation (Donoho and Liu (1988); Lindsay (1994); Basu and Lindsay (1994)). A prominent example is minimum Hellinger distance estimation (Beran (1977); Tamura and Boos (1986)). In fact, as shown in -GANs (Nowozin, Cseke and Tomioka (2016)), various choices of the objective function in GANs can be derived from variational lower bounds of -divergences between the generator and real data distributions. Familiar examples of -divergences include the Kullback–Leibler (KL), squared Hellinger divergences, and the total variation (TV) distance (Ali and Silvey (1966); Csiszár (1967)). In particular, using the log-likelihood function in optimizing the discriminator leads to a lower bound of the Jensen–Shannon (JS) divergence for the generator. Furthermore, the lower bound becomes tight if the discriminator class is sufficiently rich (to include the nonparametrically optimal discriminator given any generator). In this sense, -GANs can be said to nearly implement minimum -divergence estimation, where the parameters are estimated by minimizing an -divergence between the model and data distributions. However, this relationship is only approximate and suggestive, because even a class of neural network discriminators may not be nonparametrically rich with population data. A similar issue can also be found in the previous studies, where minimum Hellinger estimation and related methods require a smoothed density function of sample data. This approach is impractical for multivariate continuous data.
In addition to MD estimation mentioned above, two other methods of MD estimation have also been studied for robust estimation both in general parametric models and in multivariate Gaussian models. The two methods are defined by minimization of power density divergences (also called -divergences) (Basu et al. (1998); Miyamura and Kano (2006)) and that of -divergences (Windham (1995); Fujisawa and Eguchi (2008); Hirose, Fujisawa and Sese (2017)). See Jones et al. (2001) for a comparison of these two methods. In contrast with -divergences, these two divergences can be evaluated without requiring smooth density estimation from sample data, and hence the corresponding MD estimators can be computed by standard optimization algorithms. To our knowledge, error bounds have not been formally derived for these methods under Huber’s contaminated Gaussian model.
Various methods based on iterative pruning or convex programming have been studied with provable error bounds for robust estimation in Huber’s contaminated Gaussian model (Lai, Rao and Vempala (2016); Balmand and Dalalyan (2015); Diakonikolas et al. (2019)). These methods either handle scatter estimation after location estimation sequentially in two stages, or resort to using normalized differences of pairs with mean zero for scatter estimation.
Our work. We propose and study adversarial algorithms with linear spline discriminators, and establish various error bounds for simultaneous location and scatter estimation under Huber’s contaminated Gaussian model. Two distinct types of GANs are exploited. The first one is logit -GANs (Tan, Song and Ou (2019)), which corresponds to a specific choice of -GANs with the objective function formulated as a negative loss function for logistic regression (or equivalently a density ratio model between fake and real data) when training the discriminator. The second is hinge GAN (Lim and Ye (2017); Zhao, Mathieu and LeCun (2017)), where the objective function is taken to be the negative hinge loss function when training the discriminator. The hinge objective can be derived from a variational lower bound of the total variation distance (Nguyen, Wainwright and Jordan (2010); Tan, Song and Ou (2019)), but cannot be deduced as a special case of the -GAN objective even though the total variation is also an -divergence. See Remark 3. In addition, we allow two-objective GANs, including the log trick in Goodfellow et al. (2014), where two objective functions are used, one for updating the discriminator and the other for updating the generator.
As a major departure from previous studies of GANs, our methods use a simple linear class of spline discriminators, where the basis functions consist of univariate truncated linear functions (or ReLUs shifted) at knots and the pairwise products of such univariate functions. For hinge GAN and certain logit -GANs including those based on the reverse KL (rKL) and JS divergences, the objective function is concave in the discriminator. By the linearity of the spline class, the objective function is then concave in the spline coefficients. Hence our hinge GAN and logit -GAN methods involve maximization of a concave function when training the spline discriminator for any fixed generator. In contrast with nonconvex-nonconcave minimax optimization for GANs with neural network discriminators (Gao et al. (2019); Gao, Yao and Zhu (2020)), our methods can be implemented through nested optimization in a numerically tractable manner. See Remarks 1, 10 and 11 and Algorithm 1. While the outer minimization for updating the generator remains nonconvex in our methods, such a single nonconvex optimization is usually more tractable than nonconvex-nonconcave minimax optimization.
In spite of the limited capacity of the spline discriminators, we establish various error bounds for our location and scatter estimators, depending on whether the hinge-GAN or logit -GAN is used and whether an or penalty is incorporated when training the discriminator. See Table 1 for a summary of existing and our error rates in scatter estimation. Our penalized hinge GAN method achieves the minimax error rate in the maximum norm. Our penalized hinge GAN method achieves the error rate , whereas the minimax error rate is , in the -Frobenius norm. While this might indicate the price paid for maintaining the convexity in training the discriminator, our error rate reduces to the same order as the minimax error rate provided that is sufficiently small, , such that the contamination error term is dominated by the sampling variation term up to a constant factor. To our knowledge, such near-optimal error rates were previously inconceivable for adversarial algorithms with linear discriminators in robust estimation. Moreover, the error rates for our logit -GAN methods exhibit a square-root dependency on the contamination fraction , instead of a linear dependency for our hinge GAN methods. This shows, for the first time, some theoretical advantage of hinge GAN over logit -GANs, although comparative performances of these methods may vary in practice, depending on specific settings.
[] Error rates Computation OC15, TMW15, LT18 Non-iterative computation CGR18 Minimax optimization with zero-one discriminators DKKLMS19 , convex optimization provided up to log factors GYZ20 Minimax optimization with neural network discriminators logit -GAN (Theorem 2) hinge GAN Nested optimization with an (Theorem 4) objective function concave in logit -GAN linear spline discriminators (Theorem 3) provided up to a constant factor hinge GAN (Theorem 5) Note: OC15, TMW15, LT18 refer to methods and theory in Öllerer and Croux (2015), Tarr, Müller and Weber (2016), Loh and Tan (2018); CGR18, DKKLMS19, and GYZ20 refer to, respectively, Chen, Gao and Ren (2018), Diakonikolas et al. (2019), and Gao, Yao and Zhu (2020).
To facilitate and complement our sample analysis, we provide error bounds for the population version of hinge GAN or logit -GANs with nonparametric discriminators, that is, minimization of the exact total variation or -divergence at the population level. From Theorem 1, population minimum TV or -divergence estimation under a simple set of conditions on (Assumption 1) leads to errors of order or respectively under Huber’s contamination model. Assumption 1 allows the reverse KL, JS, reverse , and squared Hellinger divergences, but excludes the mixed KL divergence, divergence, and, as reassurance, the KL divergence which corresponds to maximum likelihood estimation and is known to be non-robust. Hence certain (but not all) minimum -divergence estimation achieves robustness under Huber’s contamination model or an TV-contaminated neighborhood. Such robustness is identified for the first time for minimum -divergence estimation, and is related to, but distinct from, robustness of minimum distance estimation under contaminated neighborhood with respect to the same distance (Donoho and Liu (1988)). See Remark 7 for further discussion. The population error bounds in the and -Frobenius norms are independent of and hence tighter than the corresponding terms in our sample error bounds for both hinge GAN and logit -GAN. These gaps can be attributed to the use of nonparametric versus spline discriminators.
Remarkably, our population analysis also sheds light on the comparison of our sample results and those in Gao, Yao and Zhu (2020). On one hand, another set of conditions (Assumption 2), in addition to Assumption 1, are required in our sample analysis of logit -GANs with spline discriminators. On the other hand, GANs used in Gao, Yao and Zhu (2020) can be recast as logit -GANs with neural network discriminators (see Section 5.2). But minimax error rates are shown to be achieved in Gao, Yao and Zhu (2020) for an -divergence (for example, the mixed KL divergence) which, let alone Assumption 2, does not even satisfy Assumption 1 used in our analysis to show robustness of minimum -divergence estimation. The main reason for this discrepancy is that the neural network discriminator in Gao, Yao and Zhu (2020) is directly constrained to be of order in the log odds, which considerably simplifies the proofs of rate-optimal robust estimation. In contrast, our methods use linear spline discriminators (with penalties independent of ), and our proofs of robust estimation need to carefully tackle various technical difficulties due to the simple design of our methods. See Figure 2(b) for an illustration of non-robustness by minimization of the mixed KL divergence, and Section 5.2 for further discussion on this subtle issue in Gao, Yao and Zhu (2020).
Notation. For a vector , we denote by , , and the norm, norm, and norm of , respectively. For a matrix , we define the element-wise maximum norm , the Frobenius norm , the vectorized norm , the operator norm , and the -induced operator norm . For a square matrix , we write to indicate that is positive semidefinite. The tensor product of vectors and is denoted by , and the vectorization of matrix is denoted by . The cumulative distribution function of the standard normal distribution is denoted by , and the Gaussian error function is denoted by .
2 Numerical illustration
We illustrate the performance of our rKL logit -GAN and six existing methods, with two samples of size from a -dimensional Huber’s contaminated Gaussian distribution with and , based on a Toeplitz covariance matrix and the first contamination described in Section 6.2. Figure 1 shows the 95% Gaussian ellipsoids for the first two coordinates, using the estimated location vectors and variance matrices except for Tyler’s M-estimator (Tyler (1987)), Kendalls’s with MAD (Loh and Tan (2018)), and Spearman’s with -estimator (Öllerer and Croux (2015)) where the locations are set to the true means. The performance of our JS logit -GAN and hinge GAN is close to that of rKL logit -GAN. See the Supplement for illustration based on the second contamination in Section 6.2.
Among the methods shown in Figure 1, the rKL logit -GAN gives an ellipsoid closest to the truth, followed with relatively small but noticeable differences by the JS-GAN (Gao, Yao and Zhu (2020)), MCD (Rousseeuw (1985)), and Mest (Rocke (1996)), which are briefly described in Section 6.1. The remaining three methods, Kendall’s with MAD, Spearman’s with -estimator, and Tyler’s M-estimator show much less satisfactory performance. The estimated distributions from these methods are dragged towards the corner contamination cluster.
The relatively poor performance of the pairwise methods, Kendall’s with MAD and Spearman’s with -estimator, may be explained by the fact that as shown by the marginal histograms in Figure 1, the data in each coordinate are one-sided heavy-tailed, but no obvious outliers can be seen marginally. The correlation estimates from Kendall’s and Spearman’s tend to be inaccurate even after sine transformations, especially with nonnegligible . In contrast, our GAN methods as well as JS-GAN, MCD, and the Mest are capable of capturing higher dimensional information so that the impact of contamination is limited to various extents.

3 Adversarial algorithms
We review various adversarial algorithms (or GANs), which are exploited by our methods for robust location and scatter estimation. To focus on main ideas, the algorithms are stated in their population versions, where the underlying data distribution is involved instead of the empirical distribution . Let be a statistical model and be a function class, where is called a generator and a discriminator. In our study, is a multivariate Gaussian distribution , and is a pairwise spline function which is specified later in Section 4.2.
For a convex function , the -divergence between the distributions and with density functions and is
For example, taking yields the Kullback–Liebler (KL) divergence . The logit -GAN (Tan, Song and Ou (2019)) is defined by solving the minimax program
(1) |
where
Throughout, and denotes the derivative of . A motivation for this method is that the objective is a nonparametrically tight, lower bound of the -divergence (Tan, Song and Ou (2019), Proposition S1): for each , it holds that for any function ,
(2) |
where the equality is attained at , the log density ratio between and or equivalently the log odds for classifying whether a data point is from or . There are two choices of of particular interest. Taking leads the Jensen–Shannon (JS) divergence, , and the objective function
which is, up to a constant, the expected log-likelihood for logistic regression with log odds function . For , program (1) corresponds to the original GAN (Goodfellow et al. (2014)) with discrimination probability . Taking leads to the reverse KL divergence and the objective function
which is the negative calibration loss for logistic regression in Tan (2020).
The objective with fixed can be seen as a proper scoring rule reparameterized in terms of the log odds function for binary classification (Tan and Zhang (2020)). Replacing in (1) by the negative hinge loss (which is not a proper scoring rule) leads to
(3) |
where
This method is related to the geometric GAN described later in (13) and will be called hinge GAN. By Nguyen, Wainwright and Jordan (2009) or Proposition 5 in Tan, Song and Ou (2019), the objective is a nonparametrically tight, lower bound of the total variation distance scaled by : for each , it holds that for any function ,
(4) |
where the equality is attained at , and . The objectives and , with fixed , represent two types of loss functions for binary classification. See Buja, Stuetzle and Shen (2005) and Nguyen, Wainwright and Jordan (2009) for further discussions about loss functions and scoring rules.
The preceding programs, (1) and (3), are defined as minimax optimization, each with a single objective function. There are also adversarial algorithms, which are formulated as alternating optimization with two objective functions (see Remark 1). For example, GAN with the trick in Goodfellow et al. (2014) is defined by solving
(7) |
The second objective is introduced mainly to overcome vanishing gradients in when the discriminator is confident. The calibrated rKL-GAN (Huszár (2016); Tan, Song and Ou (2019)) is defined by solving
(10) |
The two objectives are chosen to stabilize gradients in both and during training. The geometric GAN in Lim and Ye (2017) or, equivalently, the energy-based GAN in Zhao, Mathieu and LeCun (2017) as shown in Tan, Song and Ou (2019), is defined by solving
(13) |
Interestingly, the second line in (10) or (13) involves the same objective , which can be equivalently replaced by because and hence are fixed.
Remark 1.
We discuss precise definitions for a solution to a minimax problem such as (1) or (3), and a solution to an alternating optimization problem such as (7)–(13). For an objective function , we say that is a solution to
(14) |
if for any . In other words, we treat (14) as nested optimization: is a minimizer of as a function of and , where is a maximizer of for fixed . This choice is directly exploited in both numerical implementation and theoretical analysis of our methods later. For two objective functions and , we say that is a solution to the alternating optimization problem
(17) |
if and . In the special case where , denoted as , a solution to (17) is also called a Nash equilibrium of , satisfying . A solution to minimax problem (14) and a Nash equilibrium of may in general differ from each other, although they coincide by Sion’s minimax theorem in the special case where is convex in for each and concave in for each . Hence our treatment of GANs in the form (14) as nested optimization should be distinguished from existing studies where GANs are interpreted as finding Nash equilibria (Farnia and Ozdaglar (2020); Jin, Netrapalli and Jordan (2020)).
Remark 2.
The population -GAN (Nowozin, Cseke and Tomioka (2016)) is defined by solving
(18) |
where is the Fenchel conjugate of , i.e., and is a function taking values in the domain of . Typically, is represented as , where is an activation function and take values unrestricted in . The logit -GAN corresponds to -GAN with the specific choice by the relationship (Tan, Song and Ou (2019)). Nevertheless, a benefit of logit -GAN is that the objective in (1) takes the explicit form of a negative discrimination loss such that can be seen to approximate the log density ratio between and .
Remark 3.
There is an important difference between hinge GAN and logit -GAN, although the total variation is also an -divergence with . In fact, taking this choice of in logit -GAN (1) yields
(19) |
This is called TV learning and is related to depth-based estimation in Gao et al. (2019). Compared with hinge GAN in (3), program (19) is computationally more difficult to solve. Such a difference also exists in the application of general -GAN to the total variation. For the total variation distance scaled by with , the conjugate is if or if . If is specified as , then the objective in -GAN (18) can be shown to be
which in general differs from the negative hinge loss in (3) unless is upper bounded by 1. If is specified as for a function taking values unrestricted in , the resulting -GAN is equivalent to TV-GAN in Gao et al. (2019) defined by solving
(20) |
However, solving program (20) is numerically intractable as discussed in Gao et al. (2019).
4 Theory and methods
We propose and study various adversarial algorithms with simple spline discriminators for robust estimation in a multivariate Gaussian model. Assume that are independent observations obtained from Huber’s -contamination model, that is, the data distribution is of the form
(21) |
where is with unknown , is a probability distribution for contaminated data, and is a contamination fraction. Both and are unknown. The dependency of on is suppressed in the notation. Equivalently, the data can be represented in a latent model: are independent, and is Bernoulli with and is drawn from or given or 1 for .
For theoretical analysis, we consider two choices of the parameter space. The first choice is for a constant . Equivalently, the diagonal elements of is upper bounded by for . The second choice is for a constant . For simplicity, the dependency of on or on is suppressed in the notation. For the second parameter space , the minimax rates in the and operator norms have been shown to be achieved using matrix depth (Chen, Gao and Ren (2018)) and GANs with certain neural network discriminators (Gao, Yao and Zhu (2020)).
Our work aims to investigate adversarial algorithms with a simple linear class of spline discriminators for computational tractability, and establish various error bounds for the proposed estimators, including those matching the minimax rates in the maximum norms for the location and scatter estimation over , and, provided that is bounded by a constant (independent of ), the minimax rates in the and Frobenius norms over .
4.1 Population analysis with nonparametric discriminators
A distinctive feature of GANs is that they can be motivated as approximations to minimum divergence estimation. For example, if the discriminator class in (1) is rich enough to include the nonparametrically optimal discriminator such that for each , then the (population) logit -GAN amounts to minimizing the -divergence . Similarly, if the discriminator class in (3) is sufficiently rich, then the (population) hinge GAN amounts to minimizing the total variation .
As a prelude to our sample analysis, Theorem 1 shows that at the population level, minimization of the total variation and certain -divergences satisfying Assumption 1 achieves robustness under Huber’s contamination model, in the sense that the estimation errors are respectively and , uniformly over all possible . Hence with sufficiently rich (or nonparametric) discriminators, the population versions of the hinge GAN and certain -GANs can be said to be robust under Huber’s contamination. From Table 2, Assumption 1 is satisfied by the reverse KL, JS, and squared Hellinger divergences, but violated by the KL divergence. Minimization of the KL divergence corresponds to maximum likelihood estimation, which is known to be non-robust under Huber’s contamination model.
Assumption 1.
Suppose that is convex with and satisfies the following conditions.
-
(i)
is twice differentiable with .
-
(ii)
is non-increasing.
-
(iii)
is concave (i.e., is non-increasing)
See Table 2 for validity of conditions (ii) and (iii) in various -divergences.
Remark 4.
Given a convex function with , the same -divergence can be defined using the convex function for any constant . Hence condition (ii) in Assumption 1 can be relaxed such that is upper bounded by a constant. The non-increasingness of is stated above for ease of interpretation. The other conditions in Assumption 1 and Assumption 2 are not affected by non-unique choices of .
Theorem 1.
Let .


Figure 2 provides a simple numerical illustration. From Figure 2(a), the location errors of minimum divergence estimators corresponding to the four robust -divergences (reverse KL, JS, squared Hellinger, and reverse ) satisfying Assumption 1 are of shapes in agreement with the order in Theorem 1, whereas those corresponding to TV appear to be linear in , for close to 0. For the KL, mixed KL, and divergences, which do not satisfy Assumption 1(ii), their corresponding errors quickly increase out of the plotting range, indicating non-robustness of the associated minimum divergence estimation. The differences between robust and non-robust -divergences are further demonstrated in Figure 2(b). As the contamination location moves farther away, the errors of the robust -divergences increase initially but then decrease to near 0, whereas those of the non-robust -divergences appear to increase unboundedly.
Name | Convex | Non-incr. | Concave | Concave | Lipschitz |
Total variation | ✓ | — | — | — | |
Reverse KL | ✓ | ✓ | ✓ | ✓ | |
Jensen-Shannon | ✓ | ✓ | ✓ | ✓ | |
Squared Hellinger | ✓ | ✓ | ✓ | ||
Reverse | ✓ | ✓ | ✓ | ||
KL | ✓ | ✓ | |||
Mixed KL | ✓ | ✓ | |||
✓ |
Note: The mixed KL divergence is defined as .
Remark 5.
From the proof in Section 7.1, Theorem 1(i) remains valid if is replaced by in Assumption 1(i) and the definition of , and Assumption 1(iii), the concavity of , is removed. On the other hand, a stronger condition than Assumption 1(iii) is used in our sample analysis: for convex , the concavity of is implied by Assumption 2(i), as discussed in Remark 9.
Remark 6.
The population bounds in Theorem 1 are more refined than those in our sample analysis later. The population minimizer is defined by minimization over the unrestricted space instead of or with the restriction or . The scaling factors in the population bounds also depend directly on the maximum or operator norm of the true variance matrix , instead of pre-specified constants or . Note that the parameter space is also restricted such that and the error bounds depend on in sample analysis of Gao, Yao and Zhu (2020). Nevertheless, the population bounds share a similar feature as in our sample bounds later: the error bounds in the maximum norms are governed by , which can be much smaller than involved in the error bounds in the operator norm.
Remark 7.
It is interesting to connect and compare our results with Donoho and Liu (1988), where minimum distance (MD) estimation is studied, that is, minimization of a proper distance satisfying the triangle inequality. For minimum TV estimation, let . For location estimation, define
which are called the bias distortion curve and the gauge function. Scatter estimation can be discussed in a similar manner. For a general family , the first half in our proof of Theorem 1(ii) shows that for any satisfying , we have . This implies a bound similar to Proposition 5.1 in Donoho and Liu (1988):
(24) |
For the multivariate Gaussian family , the second half in our proof of Theorem 1(ii) derives an explicit upper bound on provided that for a constant :
where . Combining the preceding inequalities yields in Theorem 1(ii), with . In addition, Proposition 5.1 in Donoho and Liu (1988) gives the same bound as (24) for MD estimation using certain other distances , including the Hellinger distance, where the MD functional is defined as , and and are defined with replaced by . The distances used in defining the MD functional and the contamination neighborhood are tied to each other. Hence, except for minimum TV estimation, our setting differs from Donoho and Liu (1988) in studying different choices of minimum -divergence estimation over the same Huber’s contamination neighborhood.
Remark 8.
We briefly comment on how our result is related to breakdown points in robust statistics (Huber (1981), Section 1.4). For estimating , the population breakdown point of a functional can be defined as , where . Scatter estimation can be discussed in a similar manner. For defined from minimum TV estimation, Theorem 1(ii) shows that if , then , as noted in Remark 7. This not only provides an explicit bound on , but also implies that the population breakdown point is at least for minimum TV estimation. Similar implications can be obtained from Theorem 1(i) for minimum -divergence estimation. For defined from minimum rKL divergence estimation, Theorem 1(i) shows that if , then , and hence the population breakdown point is at least . While these estimates of breakdown points can potentially be improved, our population analysis as well as sample analysis in the subsequent sections focus on deriving quantitative error bounds in terms of sufficiently small and some scaling constants free of .
4.2 Logit -GAN with spline discriminators
For the population analysis in Section 4.1, a discriminator class is assumed to be rich enough to include the nonparametrically optimal discriminator which depends on unknown . Because can be arbitrary, this nonparametric assumption is inappropriate for sample analysis. Recently, GANs with certain neural network discriminators are shown to achieve sample error bounds matching minimax rates (Gao et al. (2019); Gao, Yao and Zhu (2020)). It is interesting to study whether similar results can be obtained when using GANs with simpler and computationally more tractable discriminators.
We propose and study adversarial algorithms, including logit -GAN in this section and hinge GAN in Section 4.3, each with simple spline discriminators. Define a linear class of pairwise spline functions, denoted as :
where with and . The basis vector is obtained by applying componentwise to , with the knot , or for respectively. For concreteness, assume that every two components of are identical if associated with the same product of two components of , that is, can be arranged to a symmetric matrix. The preceding specification is sufficient for our theoretical analysis. Nevertheless, similar results can also be obtained, while allowing various changes to the basis functions, for example, adding as a subvector to . With this change, a function in has a main effect term in each , which is a linear spline with knots in , and a square or interaction term in each pair , which is a product of two spline functions in and for .
We consider two logit -GAN methods with an or penalty on the discriminator, which lead to meaningful error bounds over the parameter space or respectively under the following conditions on , in addition to Assumption 1. Among the -divergences in Table 2, the reverse KL and JS divergences satisfy both Assumptions 1 and 2, and hence the corresponding logit -GANs achieve sample robust estimation using spline discriminators. The squared Hellinger and reverse divergences satisfy Assumption 1, but not the Lipschitz condition in Assumption 2(ii). For such -divergences, it remains a theoretical question whether sample robust estimation can be achieved using spline discriminators.
Assumption 2.
Suppose that is strictly convex and three-times continuously differentiable with and satisfies the following conditions.
-
(i)
is concave in .
-
(ii)
is -Lipschitz in for a constant .
See Table 2 for validity of conditions (i) and (ii) in various -divergences, and Remarks 9 and 10 for further discussions.
The first method, penalized logit -GAN, is defined by solving
(25) |
where is in (1) with replaced by , , , the norm of excluding the intercept , and is a tuning parameter. In addition to the replacement of by , there are two notable modifications in (25) compared with the population version (1). First, a penalty term is introduced on , to achieve suitable control of sampling variation. Second, the discriminator is a spline function with knots depending on , the location parameter for the generator. By a change of variables, the non-penalized objective in (25) can be equivalently written as
(26) |
where denotes the empirical distribution on . Hence is a negative loss for discriminating between the shifted empirical distribution and the mean-zero generator . The adaptive choice of knots for the spline discriminator not only is numerically desirable but also facilitates the control of sampling variation in our theoretical analysis. See Propositions S5, S8, S13, and S15.
Theorem 2.
For penalized logit -GAN, Theorem 2 shows that the estimator achieves error bounds in the maximum norms in the order . These error bounds match sampling errors of order in the maximum norms for the standard estimators (i.e., the sample mean and variance) in a multivariate Gaussian model in the case of . Moreover, up to sampling variation, the error bounds also match the population error bounds of order in the maximum norms with nonparametric discriminators in Theorem 1(i), even though a simple, linear class of spline discriminators is used.
The second method, penalized logit -GAN, is defined by solving
(27) |
where and are defined as in (25), and , the norms of and , and and are tuning parameters. Compared with penalized logit -GAN (25), the norms of and are separately associated with tuning parameters and in (27), in addition to the change from to penalties. As seen from our proofs in Sections II.2 and II.3 in the Supplement, the use of separate tuning parameters and is crucial for achieving meaningful error bounds in the and Frobenius norms for simultaneous estimation of . Our method does not rely on the use of normalized differences of pairs of the observations to reduce the unknown mean to 0 for scatter estimation as in Diakonikolas et al. (2019).
Theorem 3.
Assume that , satisfies Assumptions 1–2, and is upper bounded by a constant . Let be a solution to (27). For , if , , and , then with probability at least the following bounds hold uniformly over contamination distribution ,
where are constants, depending on and but independent of except through the bound on .
For penalized logit -GAN, Theorem 3 provides error bounds of order , in the and -Frobenius norms for location and scatter estimation. A technical difference from Theorem 2 is that these bounds are derived under an extraneous condition that is upper bounded. Nevertheless, the error rate, , matches the population error bounds of order in Theorem 1(i), up to sampling variation of order in the and -Frobenius norms. We defer to Section 4.3 further discussion about the error bounds in Theorems 2–3 compared with minimax error rates.
Remark 9.
There are important implications of Assumption 2(i) together with Assumption 1(ii), based on the fact (“composition rule”) that the composition of a non-decreasing concave function and a concave function is concave. First, for convex , concavity of in implies Assumption 1(iii), that is, concavity of in . This follows by writing and applying the composition rule, where , in addition to being concave, is non-decreasing by convexity of , and is concave in . Note that concavity of in may not imply concavity of in , as shown by the Pearson in Table 2. Second, for convex and non-increasing , concavity of in also implies concavity of in . In fact, as mentioned in Remark 2, can be equivalently obtained as , where is the Fenchel conjugate of (Tan, Song and Ou (2019)). By the composition rule, is concave, where is concave and non-decreasing by non-increasingness of .
Remark 10.
The concavity of and in from Assumptions 1(ii) and 2(i), as discussed in Remark 9, is instrumental from both theoretical and computational perspectives. These concavity properties are crucial to our proofs of Theorems 2–3 and Corollary 1(i). See Lemmas S4 and S13 in the Supplement. Moreover, the concavity of and in , in conjunction with the linearity of the spline discriminator in , indicates that the objective function is concave in for any fixed . Hence our penalized logit -GAN (25) or (27) under Assumptions 1–2 can be implemented through nested optimization as shown in Algorithm 1, where a concave optimizer is used in the inner stage to train the spline discriminators. See Remark 1 for further discussion.
4.3 Hinge GAN with spline discriminators
We consider two hinge GAN methods with an or penalty on the spline discriminator, which leads to theoretically improved error bounds in terms of dependency on over the parameter space or respectively, compared with the corresponding logit -GAN methods in Section 4.2.
The first method, penalized hinge GAN, is defined by solving
(28) |
where is the hinge objective in (3) with replaced by and, similarly as in penalized logit -GAN (25), , , and is a tuning parameter.
Theorem 4.
Assume that . Let be a solution to (28). For , if and , then with probability at least the following bounds hold uniformly over contamination distribution ,
where are constants, depending on but independent of .
For penalized hinge GAN, Theorem 4 shows that the estimator achieves error bounds in the maximum norms in the order , which improve upon the error rate in terms of dependency on for penalized logit -GAN. This difference can be traced to that in the population error bounds in Theorem 1. Moreover, Theorem 5.1 in Chen, Gao and Ren (2018) indicates that a minimax lower bound on estimator errors or is also of order in Huber’s contaminated Gaussian model, where is a minimax lower bound in the maximum norms in the case of . Therefore, our penalized hinge GAN achieves the minimax rates in the maximum norms for Gaussian location and scatter estimation over .
The second method, penalized hinge GAN, is defined by solving
(29) |
where, similarly as in penalized logit -GAN (27), , and , and and are tuning parameters.
Theorem 5.
For penalized hinge GAN, Theorem 5 shows that the estimator achieves error bounds in the and -Frobenius norms in the order . On one hand, these error bounds reduce to the same order, , as those for penalized logit -GAN, under the condition that is upper bounded by a constant. On the other hand, when compared with the minimax rates, there remain nontrivial differences between penalized hinge GAN and logit -GAN. In fact, the minimax rates in the and operator norms for location and scatter estimation over is known to be in Huber’s contaminated Gaussian model (Chen, Gao and Ren (2018)). The same minimax rate can also be shown in the -Frobenius norm for scatter estimation. Then the error rate for penalized hinge GAN in Theorem 5 matches the minimax rate, and both reduce to the contamination-free error rate , provided that is bounded by a constant, i.e., , independently of . For penalized logit -GAN associated with the reverse KL or JS divergence (satisfying Assumptions 1–2), the error bounds from Theorem 3 match the minimax rate provided both and . The latter condition can be restrictive when is large.
Remark 11.
The two functionals, and , are concave in in the hinge objective . This is reminiscent of the concavity of and in in the logit -GAN objective under Assumptions 1(ii) and 2(i) as discussed in Remark 10. These concavity properties are crucial to our proofs of Theorems 4–5 and Corollary 1(ii) . See Lemmas S12 and S14 in the Supplement. Moreover, the concavity of in , together with the linearity of the spline discriminator in , implies that the objective function is concave in for any fixed . Hence similarly to penalized logit -GAN, our penalized hinge GAN (28) or (29) can also be implemented through nested optimization with a concave inner stage in training spline discriminators as shown in Algorithm 1.
4.4 Two-objective GAN with spline discriminators
We study two-objective GANs, where the spline discriminator is trained using the objective function in logit -GAN or hinge GAN, but the generator is trained using a different objective function.
Consider the following two-objective GAN related to logit -GANs (25) and (27):
(32) |
Similarly, consider the two-objective GAN related to the hinge GAN (28) and (29):
(35) |
Here is an penalty, and is as in (25), or is an penalty and is as in (27), and is a function satisfying Assumption 3. Note that the discriminator is a spline function with knots depending on , so that cannot be dropped in the optimization over in (32) or (35). We show that the two-objective logit -GAN and hinge GAN achieve similar error bounds as the corresponding one-objective versions in Theorems 2–5.
Assumption 3.
Corollary 1.
The two-objective GANs studied in Corollary 1 differ slightly from existing ones as described in (7)–(13), due to the use of the discriminator depending on to facilitate theoretical analysis as mentioned in Section 4.2. If were replaced by a discriminator defined independently of , then taking and or in (32) reduces to GAN with log trick (7) or calibrated rKL-GAN (10) respectively, and taking and in (35) reduces to geometric GAN (13).
5 Discussion
5.1 GANs with data transformation
Compared with the usual formulations (1) and (3), our logit -GAN and hinge GAN methods in Sections 4.2–4.3 involve a notable modification that both the real and fake data are discriminated against each other after being shifted by the current location parameter. Without the modification, a direct approach based on logit -GAN would use the objective function
(36) |
where the real data and the Gaussian fake data generated from standard noises are discriminated again each other given the parameters . The idea behind our modification can be extended by allowing both location and scatter transformation. For example, consider logit -GAN with full transformation:
(37) |
where is the logit -GAN objective as in (25) and (27), and is an or penalty term. The discriminator is obtained by applying with fixed knots to the transformed data . Similarly to (26), the non-penalized objective in (37) can be equivalently written as
(38) |
where denotes the empirical distribution on . Compared with (26) and (36), there are two advantages of using (38) with full transformation. First, due to both location and scatter transformation, logit -GAN (37), but not (25) or (27), can be shown to be affine equivariant. Second, the transformed real data and the standard Gaussian noises in (38) are discriminated against each other given the current parameters , while employing the spline discriminators with knots fixed at . Because standard Gaussian data are well covered by the grid formed from these marginal knots, the discrimination involved in (38) can be informative even when the parameters are updated. The discrimination involved in (36) may be problematic when employing the fixed-knot spline discriminators, because both the real and fake data may not be adequately covered by the grid formed from the knots.
From the preceding discussion, it can be more desirable to incorporate both location and scatter transformation as in (38) than just location transformation as in (26), which only aligns the centers, but not the scales and correlations, of the Gaussian fake data with the knots in the spline discriminators. As mentioned in Section 4.2, our sample analysis exploits the location transformation in establishing certain concentration properties in the proofs. On the other hand, our current proofs are not directly applicable while allowing both location and scatter transformation. It is desired in future work to extend our theoretical analysis in this direction.
5.2 Comparison with Gao, Yao and Zhu (2020)
We first point out a connection between logit -GANs and the GANs based on proper scoring rules in Gao, Yao and Zhu (2020). For a convex function , a proper scoring rule can be defined as (Savage (1971); Buja, Stuetzle and Shen (2005); Gneiting and Raftery (2007))
The population verion of the GAN studied in Gao, Yao and Zhu (2020) is defined as
(39) |
where , also called a discriminator, represents the probability that an observation comes from rather than , and
The objective is shown to be a lower bound, being tight if , for the divergence , where for . For example, taking leads to the log score, and . The corresponding objective function reduces to the expected log-likelihood with discrimination probability as used in Goodfellow et al. (2014). We show that if is specified as a sigmoid probability, then can be equivalently obtained as a logit -GAN objective for a suitable choice of .
Proposition 1.
Suppose that the discriminator is specified as . Then for defined in (1) and satisfying that .
In contrast with parameterized as a pairwise spline function, Gao, Yao and Zhu (2020) studied robust estimation in Huber’s contaminated Gaussian model, where is parameterized as a neural network with two or more layers and sigmoid activations in the top and bottom layers. In the case of two layers, the neural network in Gao, Yao and Zhu (2020), Section 4, is defined as
(40) |
where , , are the weights and intercepts in the bottom layer, and , , are the weights in the top layer constrained such that for a tuning parameter . Assume that is three-times continuously differentiable at , , and for a universal constant ,
(41) |
Then Gao, Yao and Zhu (2020) showed that the location and scatter estimators from the sample version of (39) with discriminator (40) achieve the minimax error rates, , in the and operator norms, provided that among other conditions. However, with sigmoid activations used inside , the sample objective may exhibit a complex, non-concave landscape in , which makes minimax optimization difficult.
There is also a subtle issue in how the above result from Gao, Yao and Zhu (2020) can be compared with even our population analysis for minimum -divergence estimation, i.e., population versions of GANs with nonparametric discriminators. In fact, condition (41) can be directly shown to be equivalent to saying that for associated with in Proposition 1. This condition can be satisfied, while Assumption 1 is violated, for example, by the choice and , corresponding to the mixed KL divergence . As shown in Figure 2, minimization of the mixed KL does not in general lead to robust estimation. Hence it seems paradoxical that minimax error rates can be achieved by the GAN in Gao, Yao and Zhu (2020) with its objective function derived from the mixed KL. On the other hand, a possible explanation can be seen as follows. By the sigmoid activation and the constraint , the log-odds discriminator in (40) is forced to be bounded, , where is further assumed to small, of the same order as the minimax rate . As a result, maximization of the population objective over such constrained discriminators may produce a divergence with a substantial gap to the actual divergence for any fixed . Instead, the implied divergence measure may behave more similarly as the total variation than as , due to the boundedness of by a sufficiently small , so that minimax error rates can still be achieved.
6 Simulation studies
We conducted simulation studies to compare the performance of our logit -GAN and hinge GAN methods with several existing methods in various settings depending on , , , and . Results about error dependency on are provided in the main paper and others are presented in the Supplement. Two contamination distributions are considered to allow different types of contaminations.
6.1 Implementation of methods
Our methods are implemented following Algorithm 1, where the penalized GAN objective function is defined as in (37) for logit -GAN or with replaced by for hinge GAN. As discussed in Section 5.1, this scheme allows adequate discrimination between the back-transformed real data, , and the standard Gaussian noises using spline discriminators with fixed knots. In the following, we describe the discriminator optimization and penalty choices. See the Supplement for further details including the initial values and learning rate .
For Algorithm 1 with spline discriminators, the training objective is concave in the discriminator parameter and hence the discriminator can be fully optimized to provide a proper updating direction for the generator in each iteration. This implementation via nested optimization can be much more tractable than nonconvex-nonconcave minimax optimization in other GAN methods, where the discriminator and generator are updated by gradient ascent and descent, with careful tuning of the learning rates. In numerical work, we formulate the discriminator optimization from our GAN methods in the CVX framework (Fu, Narasimhan and Boyd (2020)) and then apply the convex optimization solver MOSEK (https://docs.mosek.com/latest/rmosek/index.html).
As dictated by our theory, we employ or penalties on the spline discriminators to control sampling variation, especially when the sample size is relatively smaller compared to the dimension of the discriminator parameter . Numerically, these penalties help stabilize the training process by weakening the discriminator power in the early stage. We tested our methods under different penalty levels and identified default choices of for our rKL and JS logit -GANs and hinge GAN. These penalty choices are then fixed in all our subsequent simulations. See the Supplement for results from our tuning experiments.
For comparison, we also implement ten existing methods for robust estimation.
-
•
JS-GAN (Gao, Yao and Zhu (2020)). We use the code from Gao, Yao and Zhu (2020) with minimal modification. The batch size is set to of the data size because the default choice is too large in our experiment settings. We use the network structure with LeakyReLU and Sigmoid activations as recommended in Gao, Yao and Zhu (2020).
-
•
Tyler’s M-estimator (Tyler (1987)). This method is included for completeness, being designed for multivariate scatter estimation from elliptical distributions, not Huber’s contaminated Gaussian distribution. We use R package fastM for implementation (https://cran.r-project.org/web/packages/fastM/index.html).
-
•
Kendall’s with MAD (Loh and Tan (2018)). Kendall’s (Kendall (1938)) is used to estimate the correlations after sine transformation and the median absolute deviation (MAD) (Hampel (1974)) is used to estimate the scales. We use the R built-in function cor to compute Kendall’s correlations and use the built-in function mad for MAD.
-
•
Spearman’s with -estimator (Öllerer and Croux (2015)). The -estimator (Rousseeuw and Croux (1993)) is used for scale estimation and Spearman’s (Spearman (1987)) is used with sine transformation for correlation estimation. We use the R function cor to compute Spearman’s correlations and the Qn function in R package robustbase (https://cran.r-project.org/web/packages/robustbase/index.html).
-
•
MVE (Rousseeuw (1985)). The minimum volume ellipsoid (MVE) estimator is a high-breakdown robust method for multivariate location and scatter estimation. We use the function CovMve in R package rrcov for implementation (https://cran.r-project.org/web/packages/rrcov/index.html).
- •
- •
-
•
Mest (Rocke (1996)). This is a constrained M-estimator of multivariate location and scatter, based on a translated biweight function. We use the function CovMest in R package rrcov for implementation, with MVE as the initial value.
- •
-
•
-Lasso (Hirose, Fujisawa and Sese (2017)). The method is implemented in the R package rsggm, which has become unavailable on CRAN. We use an archived version (https://mran.microsoft.com/snapshot/2017-02-04/web/packages/rsggm/index.html). We set and deactivate the vectorized penalty.
Tyler’s M-estimator, Kendall’s with MAD, and Spearman’s with deal with scatter estimation only, whereas the other methods handle both location and scatter estimation. In our experiments, we focus on comparing the performance of existing and proposed methods in terms of scatter estimation (i.e., variance matrix estimation).
6.2 Simulation settings
The uncontaminated distribution is where is a Toeplitz matrix with component equal to . The location parameter is unknown and estimated together with the variance matrix, except for Tyler’s -estimator, Kendall’s , and Spearman’s . Consider two contamination distributions of different types. Denote a identity matrix as and a -dimensional vector of ones as .
-
•
where is a -dimensional vector of alternating . In this setting, the contaminated points may not be seen as outliers marginally in each coordinate. On the other hand, these contaminated points can be easily separated as outliers from the uncontaminated Gaussian distribution in higher dimensions.
-
•
. Contaminated points may lie in both low-density and high-density regions of the uncontaminated Gaussian distribution. The majority of contaminated points are outliers that are far from the uncontaminated data, and there are also contaminated points that are enclosed by the uncontaminated points. This setting is also used in Gao, Yao and Zhu (2020).
The Gaussianity of the above contamination distributions is used for convenience and easy characterization of data patterns, given the specific locations and scales. See Figure 1 for an illustration of the first contamination, and the Supplement for that of the second.
6.3 Experiment results
Table 3 summarizes scatter estimation errors in the maximum norm from penalized hinge GAN and logit -GANs and existing methods, where , , and increases from to . The errors are obtained by averaging repeated runs and the numbers in brackets are standard deviations. From these results, the logit -GANs, JS and rKL, have the best performance, followed closely by the hinge GAN and then with more noticeable differences by JS-GAN in Gao, Yao and Zhu (2020) and the five high-breakdown methods, MVE, MCD, Sest, Mest, and MMest. The pairwise methods, Kendall’s with MAD and Spearman’s with -estimator, have relatively poor performance, especially for the first contamination as expected from Figure 1. The -Lasso performs competitively when is small (e.g., 2.5%), but deteriorates quickly as increases. A possible explanation is that the default choice may not work well in the current settings, even though the numerical studies in Hirose, Fujisawa and Sese (2017) suggest that does not require special tuning. Tyler’s M-estimator performs poorly, primarily because it is not designed for robust estimation under Huber’s contamination.
Estimation errors in the Frobenius norm from our penalized GAN methods and existing methods are shown in Table 4. We observe a similar pattern of comparison as in Table 3.
From Tables 3–4, we see that the estimation errors of our GAN methods, as well as other methods, increase as increases. However, the dependency on is not precisely linear for the hinge GAN, and not in the order for the two logit -GANs. This does not violate our theoretical bounds, which are derived to hold over all possible contamination distributions, i.e., for the worst scenario of contamination. For specific contamination settings, it is possible for logit -GAN to outperform hinge GAN, and for each method to achieve a better error dependency on than in the worst scenario.
hinge GAN | JS logit -GAN | rKL logit -GAN | GYZ JS-GAN | Tyler_M | Kendall_MAD | Spearman_Qn | |
2.5 | 0.0711 (0.0203) | 0.0692 (0.0191) | 0.0661 (0.0121) | 0.0960 (0.0384) | 0.3340 (0.0360) | 0.1484 (0.0320) | 0.1434 (0.0245) |
5 | 0.0768 (0.0201) | 0.0694 (0.0166) | 0.0658 (0.0161) | 0.1049 (0.0423) | 0.3115 (0.0279) | 0.2235 (0.0351) | 0.2351 (0.0284) |
7.5 | 0.0828 (0.0203) | 0.0690 (0.0171) | 0.0661 (0.0092) | 0.0957 (0.0303) | 0.3048 (0.0234) | 0.3164 (0.0319) | 0.3294 (0.0241) |
10 | 0.0981 (0.0221) | 0.0737 (0.0207) | 0.0720 (0.0151) | 0.1029 (0.0467) | 0.3526 (0.0226) | 0.4133 (0.0264) | 0.4362 (0.0256) |
2.5 | 0.0742 (0.0214) | 0.0732 (0.0207) | 0.0765 (0.0209) | 0.0994 (0.0334) | 0.3886 (0.0346) | 0.1428 (0.0302) | 0.1578 (0.0266) |
5 | 0.0814 (0.0216) | 0.0739 (0.0204) | 0.0824 (0.0196) | 0.1053 (0.0235) | 0.4322 (0.0242) | 0.2211 (0.0286) | 0.2812 (0.0297) |
7.5 | 0.0901 (0.0206) | 0.0757 (0.0209) | 0.0893 (0.0155) | 0.1063 (0.0400) | 0.4788 (0.0312) | 0.3149 (0.0278) | 0.4153 (0.0318) |
10 | 0.1051 (0.0236) | 0.0802 (0.0225) | 0.0935 (0.0219) | 0.1275 (0.0453) | 0.5295 (0.0332) | 0.4074 (0.0278) | 0.5693 (0.0358) |
-Lasso | MVE | MCD | Sest | Mest | MMest | ||
2.5 | 0.0737 (0.0182) | 0.0935 (0.0266) | 0.0834 (0.0270) | 0.0892 (0.0282) | 0.0802 (0.0236) | 0.0876 (0.0278) | |
5 | 0.2182 (0.0162) | 0.1124 (0.0257) | 0.1078 (0.0267) | 0.1314 (0.0257) | 0.0891 (0.0245) | 0.1299 (0.0259) | |
7.5 | 0.3740 (0.0153) | 0.1373 (0.0276) | 0.1306 (0.0247) | 0.1774 (0.0245) | 0.1026 (0.0239) | 0.1759 (0.0240) | |
10 | 0.5277 (0.0154) | 0.1700 (0.0284) | 0.1619 (0.0258) | 0.2358 (0.0292) | 0.1289 (0.0269) | 0.2349 (0.0292) | |
2.5 | 0.0769 (0.0243) | 0.0984 (0.0284) | 0.0837 (0.0271) | 0.0892 (0.0282) | 0.0799 (0.0229) | 0.0876 (0.0278) | |
5 | 0.1044 (0.0215) | 0.1171 (0.0258) | 0.1081 (0.0267) | 0.1314 (0.0257) | 0.0890 (0.0243) | 0.1299 (0.0259) | |
7.5 | 0.1668 (0.0312) | 0.1383 (0.0251) | 0.1301 (0.0237) | 0.1774 (0.0245) | 0.1031 (0.0243) | 0.1759 (0.0240) | |
10 | 0.2935 (0.0619) | 0.1697 (0.0240) | 0.1618 (0.0250) | 0.2358 (0.0292) | 0.1289 (0.0262) | 0.2349 (0.0291) |
hinge GAN | JS logit -GAN | rKL logit -GAN | GYZ JS-GAN | Tyler_M | Kendall_MAD | Spearman_Qn | |
2.5 | 0.2673 (0.0457) | 0.2574 (0.0461) | 0.2502 (0.0345) | 0.3354 (0.0904) | 1.0973 (0.0890) | 0.7524 (0.0286) | 0.7936 (0.0201) |
5 | 0.2729 (0.0344) | 0.2649 (0.0385) | 0.2528 (0.0404) | 0.3511 (0.0757) | 1.0190 (0.0525) | 1.4888 (0.0282) | 1.5991 (0.0217) |
7.5 | 0.2985 (0.0573) | 0.2790 (0.0554) | 0.2568 (0.0339) | 0.3349 (0.0655) | 1.2802 (0.0439) | 2.3065 (0.0444) | 2.4908 (0.0401) |
10 | 0.3222 (0.0547) | 0.2898 (0.0565) | 0.2723 (0.0477) | 0.3642 (0.0885) | 2.3363 (0.0557) | 3.2143 (0.0489) | 3.4669 (0.0398) ) |
2.5 | 0.2705 (0.0475) | 0.2688 (0.0526) | 0.2454 (0.0376) | 0.3350 (0.0690) | 1.5503 (0.1226) | 0.7688 (0.1108) | 0.8884 (0.1040) |
5 | 0.2754 (0.0335) | 0.2738 (0.0365) | 0.2466 (0.0324) | 0.3526 (0.0489) | 1.9679 (0.0806) | 1.5229 (0.0793) | 1.8049 (0.0736) |
7.5 | 0.3071 (0.0621) | 0.3026 (0.0688) | 0.2685 (0.0435) | 0.3570 (0.0778) | 2.6099 (0.1545) | 2.3906 (0.1473) | 2.9247 (0.1490) |
10 | 0.3301 (0.0583) | 0.3049 (0.0635) | 0.2744 (0.0505) | 0.3829 (0.0852) | 3.2420 (0.2083) | 3.3202 (0.1462) | 4.1527 (0.1602) |
-Lasso | MVE | MCD | Sest | Mest | MMest | ||
2.5 | 0.2679 (0.0339) | 0.3262 (0.0610) | 0.2906 (0.0567) | 0.2982 (0.0619) | 0.2853 (0.0615) | 0.2921 (0.0616) | |
5 | 1.6199 (0.0381) | 0.3583 (0.0452) | 0.3454 (0.0484) | 0.4059 (0.0489) | 0.3016 (0.0425) | 0.4032 (0.0487) | |
7.5 | 3.0849 (0.0461) | 0.4562 (0.0877) | 0.4326 (0.0803) | 0.5805 (0.0863) | 0.3549 (0.0699) | 0.5768 (0.0870) | |
10 | 4.4364 (0.0688) | 0.5409 (0.0799) | 0.5183 (0.0767) | 0.7732 (0.0839) | 0.4079 (0.0681) | 0.7707 (0.0831) | |
2.5 | 0.3016 (0.0853) | 0.3299 (0.0658) | 0.2896 (0.0568) | 0.2982 (0.0619) | 0.2841 (0.0599) | 0.2921 (0.0616) | |
5 | 0.4872 (0.0670) | 0.3781 (0.0567) | 0.3460 (0.0487) | 0.4059 (0.0489) | 0.3011 (0.0439) | 0.4032 (0.0487) | |
7.5 | 1.0551 (0.2097) | 0.4532 (0.0740) | 0.4342 (0.0792) | 0.5805 (0.0863) | 0.3580 (0.0717) | 0.5768 (0.0870) | |
10 | 2.1356 (0.4419) | 0.5331 (0.0715) | 0.5167 (0.0753) | 0.7732 (0.0839) | 0.4076 (0.0667) | 0.7708 (0.0831) |
7 Main proofs
We present main proofs of Theorems 1 and 4 in this section. The main proofs of the other results and details of all main proofs are provided in the Supplementary Material.
At the center of our proofs is a unified strategy designed to establish error bounds for GANs. See, for example, (43) and (49). Moreover, we carefully exploit the location transformation and or penalties in our GAN objective functions and develop suitable concentration properties, in addition to leveraging the concavity in updating the spline discriminators, as discussed in Remarks 10 and 11.
7.1 Proof of Theorem 1
We state and prove the following result which implies Theorem 1.
Proposition 2.
Let is a variance matrix .
(i) Assume that satisfies Assumption 1, and for a constant . Let . If for a constant , then we have
where and . If further , then
(42) | ||||
where , , and the constant is defined such that . The same inequality as (42) also holds with replaced by .
(ii) Let . Then the statements in (i) hold with replaced by throughout.
Proof of Proposition 2.
(i) Our main strategy is to show the following inequalities hold:
(43) |
where and are bias terms, depending on and respectively and is the total variation or simply . Under certain conditions, delivers upper bounds, up to scaling constants, on the estimation bias to be controlled, , , , and .
(Step 1) The upper bound in (43) follows from Lemma S4 (iv): for any satisfying Assumption 1 and any , we have
where . The constant is nonnegative because is non-increasing by Assumption 1 (ii).
(Step 2) We show the lower bound in (43) as follows:
(44) | ||||
(45) |
where . Line (44) follows by the triangle inequality and the fact that . Line (45) follows from Lemma S1: for any -divergence satisfying Assumption 1 (iii), we have
The scaling constant, , in Lemma S1 reduces to , because is non-increasing by Assumption 1 (iii).
(Step 3) Combining the lower and upper bounds in (43), we have
where . The location result then follows from Proposition S4 provided that for a constant . The variance matrix result follows if .
(ii) For the TV minimizer , Steps 1 and 2 in (i) can be combined to directly obtain an upper bound on as follows:
(46) | ||||
(47) | ||||
(48) |
Line (46) is due to the triangle inequality. Line (47) follows because by the definition of and the symmetry of TV. Line (48) follows because as in (44).
Given the upper bound on , the location result then follows from Proposition S4 provided that for a constant . The variance matrix result follows if . ∎
7.2 Proof of Theorem 4
We state and prove the following result which implies Theorem 4. See the Supplement for details about how Proposition 3 implies Theorem 4. For , define
where , depending on universal constants and in Lemmas S26 and Corollary S2 in the Supplement. Denote
Proposition 3.
Proof of Proposition 3.
The main strategy of our proof is to show that the following inequalities hold with high probabilities,
(49) |
where and are error terms, and is a moment matching term, which under certain conditions delivers upper bounds, up to scaling constants, on the estimation errors to be controlled, and .
(Step 1) For upper bound in (49), we show that with probability at least ,
(50) | |||
(51) |
Inequality (50) follows from the definition of . Inequality (51) follows from Proposition S13: it holds with probability at least that for any ,
where , , and
From (50)–(51), the upper bound in (49) holds with probability at least , provided that the tuning parameter is chosen such that .
(Step 2) For the lower bound in (49), we show that with probability at least ,
(52) | |||
(53) |
Inequality (52) holds provided that is a subset of .
Take , where is the subset of associated with pairwise ramp functions as in the proof of Theorem 2. Inequality (53) follows from Proposition S14 because for , and hence the hinge loss reduces to a moment matching term: it holds with probability at least that for any ,
where and
From (52)–(53), the lower bound in (49) holds with probability at least , where and .
(Step 3) We complete the proof by relating the moment matching term to the estimation error between and . First, combining the lower and upper bounds in (49) shows that with probability at least ,
(54) |
where
The desired result then follows from Proposition S7: provided , inequality (54) implies that
∎
Supplement to “Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models”. The Supplement provides additional information for simulation studies, and main proofs and technical details. Auxiliary lemmas and technical tools are also provided.
References
- Ali and Silvey (1966) {barticle}[author] \bauthor\bsnmAli, \bfnmSyed Mumtaz\binitsS. M. and \bauthor\bsnmSilvey, \bfnmSamuel D\binitsS. D. (\byear1966). \btitleA general class of coefficients of divergence of one distribution from another. \bjournalJournal of the Royal Statistical Society: Series B \bvolume28 \bpages131–142. \endbibitem
- Balmand and Dalalyan (2015) {barticle}[author] \bauthor\bsnmBalmand, \bfnmSamuel\binitsS. and \bauthor\bsnmDalalyan, \bfnmArnak\binitsA. (\byear2015). \btitleConvex programming approach to robust estimation of a multivariate gaussian model. \bjournalarXiv preprint arXiv:1512.04734. \endbibitem
- Basu and Lindsay (1994) {barticle}[author] \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA. and \bauthor\bsnmLindsay, \bfnmBruce G\binitsB. G. (\byear1994). \btitleMinimum disparity estimation for continuous models: efficiency, distributions and robustness. \bjournalAnnals of the Institute of Statistical Mathematics \bvolume46 \bpages683–705. \endbibitem
- Basu et al. (1998) {barticle}[author] \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA., \bauthor\bsnmHarris, \bfnmIan R\binitsI. R., \bauthor\bsnmHjort, \bfnmNils L\binitsN. L. and \bauthor\bsnmJones, \bfnmMC\binitsM. (\byear1998). \btitleRobust and efficient estimation by minimising a density power divergence. \bjournalBiometrika \bvolume85 \bpages549–559. \endbibitem
- Beran (1977) {barticle}[author] \bauthor\bsnmBeran, \bfnmRudolf\binitsR. (\byear1977). \btitleMinimum Hellinger distance estimates for parametric models. \bjournalAnnals of Statistics \bvolume5 \bpages445–463. \endbibitem
- Buja, Stuetzle and Shen (2005) {barticle}[author] \bauthor\bsnmBuja, \bfnmAndreas\binitsA., \bauthor\bsnmStuetzle, \bfnmWerner\binitsW. and \bauthor\bsnmShen, \bfnmYi\binitsY. (\byear2005). \btitleLoss functions for binary class probability estimation and classification: Structure and applications. \bjournaltechnical report, University of Pennsylvania. \endbibitem
- Butler, Davies and Jhun (1993) {barticle}[author] \bauthor\bsnmButler, \bfnmRW\binitsR., \bauthor\bsnmDavies, \bfnmPL\binitsP. and \bauthor\bsnmJhun, \bfnmM\binitsM. (\byear1993). \btitleAsymptotics for the minimum covariance determinant estimator. \bjournalAnnals of Statistics \bpages1385–1400. \endbibitem
- Chen, Gao and Ren (2018) {barticle}[author] \bauthor\bsnmChen, \bfnmMengjie\binitsM., \bauthor\bsnmGao, \bfnmChao\binitsC. and \bauthor\bsnmRen, \bfnmZhao\binitsZ. (\byear2018). \btitleRobust covariance and scatter matrix estimation under Huber’s contamination model. \bjournalAnnals of Statistics \bvolume46 \bpages1932–1960. \endbibitem
- Csiszár (1967) {barticle}[author] \bauthor\bsnmCsiszár, \bfnmImre\binitsI. (\byear1967). \btitleInformation-type measures of difference of probability distributions and indirect observation. \bjournalStudia Scientiarum Mathematicarum Hungarica \bvolume2 \bpages229–318. \endbibitem
- Davies (1987) {barticle}[author] \bauthor\bsnmDavies, \bfnmP Laurie\binitsP. L. (\byear1987). \btitleAsymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. \bjournalAnnals of Statistics \bpages1269–1292. \endbibitem
- Diakonikolas et al. (2019) {barticle}[author] \bauthor\bsnmDiakonikolas, \bfnmIlias\binitsI., \bauthor\bsnmKamath, \bfnmGautam\binitsG., \bauthor\bsnmKane, \bfnmDaniel\binitsD., \bauthor\bsnmLi, \bfnmJerry\binitsJ., \bauthor\bsnmMoitra, \bfnmAnkur\binitsA. and \bauthor\bsnmStewart, \bfnmAlistair\binitsA. (\byear2019). \btitleRobust estimators in high-dimensions without the computational intractability. \bjournalSIAM Journal on Computing \bvolume48 \bpages742–864. \endbibitem
- Donoho and Liu (1988) {barticle}[author] \bauthor\bsnmDonoho, \bfnmDavid L\binitsD. L. and \bauthor\bsnmLiu, \bfnmRichard C\binitsR. C. (\byear1988). \btitleThe “automatic” robustness of minimum distance functionals. \bjournalAnnals of Statistics \bvolume16 \bpages552–586. \endbibitem
- Farnia and Ozdaglar (2020) {binproceedings}[author] \bauthor\bsnmFarnia, \bfnmFarzan\binitsF. and \bauthor\bsnmOzdaglar, \bfnmAsuman\binitsA. (\byear2020). \btitleDo GANs always have Nash equilibria? In \bbooktitleInternational Conference on Machine Learning \bpages3029–3039. \bpublisherPMLR. \endbibitem
- Fu, Narasimhan and Boyd (2020) {barticle}[author] \bauthor\bsnmFu, \bfnmAnqi\binitsA., \bauthor\bsnmNarasimhan, \bfnmBalasubramanian\binitsB. and \bauthor\bsnmBoyd, \bfnmStephen\binitsS. (\byear2020). \btitleCVXR: An R package for disciplined convex optimization. \bjournalJournal of Statistical Software \bvolume94 \bpages1–34. \endbibitem
- Fujisawa and Eguchi (2008) {barticle}[author] \bauthor\bsnmFujisawa, \bfnmHironori\binitsH. and \bauthor\bsnmEguchi, \bfnmShinto\binitsS. (\byear2008). \btitleRobust parameter estimation with a small bias against heavy contamination. \bjournalJournal of Multivariate Analysis \bvolume99 \bpages2053–2081. \endbibitem
- Gao, Yao and Zhu (2020) {barticle}[author] \bauthor\bsnmGao, \bfnmChao\binitsC., \bauthor\bsnmYao, \bfnmYuan\binitsY. and \bauthor\bsnmZhu, \bfnmWeizhi\binitsW. (\byear2020). \btitleGenerative adversarial nets for robust scatter estimation: a proper scoring rule perspective. \bjournalJournal of Machine Learning Research \bvolume21 \bpages160–1. \endbibitem
- Gao et al. (2019) {binproceedings}[author] \bauthor\bsnmGao, \bfnmChao\binitsC., \bauthor\bsnmLiu, \bfnmJiyi\binitsJ., \bauthor\bsnmYao, \bfnmYuan\binitsY. and \bauthor\bsnmZhu, \bfnmWeizhi\binitsW. (\byear2019). \btitleRobust estimation and generative adversarial networks. In \bbooktitle7th International Conference on Learning Representations, ICLR 2019. \endbibitem
- Gneiting and Raftery (2007) {barticle}[author] \bauthor\bsnmGneiting, \bfnmTilmann\binitsT. and \bauthor\bsnmRaftery, \bfnmAdrian E\binitsA. E. (\byear2007). \btitleStrictly proper scoring rules, prediction, and estimation. \bjournalJournal of the American Statistical Association \bvolume102 \bpages359–378. \endbibitem
- Goodfellow et al. (2014) {binproceedings}[author] \bauthor\bsnmGoodfellow, \bfnmIan\binitsI., \bauthor\bsnmPouget-Abadie, \bfnmJean\binitsJ., \bauthor\bsnmMirza, \bfnmMehdi\binitsM., \bauthor\bsnmXu, \bfnmBing\binitsB., \bauthor\bsnmWarde-Farley, \bfnmDavid\binitsD., \bauthor\bsnmOzair, \bfnmSherjil\binitsS., \bauthor\bsnmCourville, \bfnmAaron\binitsA. and \bauthor\bsnmBengio, \bfnmYoshua\binitsY. (\byear2014). \btitleGenerative adversarial nets. In \bbooktitleAdvances in Neural Information Processing Systems \bvolume27. \bpublisherCurran Associates, Inc. \endbibitem
- Hampel (1974) {barticle}[author] \bauthor\bsnmHampel, \bfnmFrank R\binitsF. R. (\byear1974). \btitleThe influence curve and its role in robust estimation. \bjournalJournal of the American Statistical Association \bvolume69 \bpages383–393. \endbibitem
- Hirose, Fujisawa and Sese (2017) {barticle}[author] \bauthor\bsnmHirose, \bfnmKei\binitsK., \bauthor\bsnmFujisawa, \bfnmHironori\binitsH. and \bauthor\bsnmSese, \bfnmJun\binitsJ. (\byear2017). \btitleRobust sparse Gaussian graphical modeling. \bjournalJournal of Multivariate Analysis \bvolume161 \bpages172–190. \endbibitem
- Huber (1964) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J.\binitsP. J. (\byear1964). \btitleRobust estimation of a location parameter. \bjournalAnnals of Mathematical Statistics \bvolume35 \bpages73-101. \endbibitem
- Huber (1981) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1981). \btitleRobust statistics. \bjournalWiley Series in Probability and Mathematical Statistics. \endbibitem
- Huszár (2016) {barticle}[author] \bauthor\bsnmHuszár, \bfnmFerenc\binitsF. (\byear2016). \btitleAn alternative update rule for generative adversarial networks. \bjournalblogpost. \endbibitem
- Jin, Netrapalli and Jordan (2020) {binproceedings}[author] \bauthor\bsnmJin, \bfnmChi\binitsC., \bauthor\bsnmNetrapalli, \bfnmPraneeth\binitsP. and \bauthor\bsnmJordan, \bfnmMichael\binitsM. (\byear2020). \btitleWhat is local optimality in nonconvex-nonconcave minimax optimization? In \bbooktitleInternational Conference on Machine Learning \bpages4880–4889. \bpublisherPMLR. \endbibitem
- Jones et al. (2001) {barticle}[author] \bauthor\bsnmJones, \bfnmMC\binitsM., \bauthor\bsnmHjort, \bfnmNils Lid\binitsN. L., \bauthor\bsnmHarris, \bfnmIan R\binitsI. R. and \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA. (\byear2001). \btitleA comparison of related density-based minimum divergence estimators. \bjournalBiometrika \bvolume88 \bpages865–873. \endbibitem
- Kendall (1938) {barticle}[author] \bauthor\bsnmKendall, \bfnmMaurice G\binitsM. G. (\byear1938). \btitleA new measure of rank correlation. \bjournalBiometrika \bvolume30 \bpages81–93. \endbibitem
- Lai, Rao and Vempala (2016) {binproceedings}[author] \bauthor\bsnmLai, \bfnmKevin A\binitsK. A., \bauthor\bsnmRao, \bfnmAnup B\binitsA. B. and \bauthor\bsnmVempala, \bfnmSantosh\binitsS. (\byear2016). \btitleAgnostic estimation of mean and covariance. In \bbooktitleIEEE 57th Annual Symposium on Foundations of Computer Science \bpages665–674. \bpublisherIEEE. \endbibitem
- Lim and Ye (2017) {barticle}[author] \bauthor\bsnmLim, \bfnmJae Hyun\binitsJ. H. and \bauthor\bsnmYe, \bfnmJong Chul\binitsJ. C. (\byear2017). \btitleGeometric GAN. \bjournalarXiv preprint arXiv:1705.02894. \endbibitem
- Lindsay (1994) {barticle}[author] \bauthor\bsnmLindsay, \bfnmBruce G\binitsB. G. (\byear1994). \btitleEfficiency versus robustness: the case for minimum Hellinger distance and related methods. \bjournalAnnals of Statistics \bvolume22 \bpages1081–1114. \endbibitem
- Liu and Loh (2021) {barticle}[author] \bauthor\bsnmLiu, \bfnmZheng\binitsZ. and \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. (\byear2021). \btitleRobust W-GAN-Based estimation under Wasserstein contamination. \bjournalarXiv preprint arXiv:2101.07969. \endbibitem
- Liu et al. (2012) {barticle}[author] \bauthor\bsnmLiu, \bfnmHan\binitsH., \bauthor\bsnmHan, \bfnmFang\binitsF., \bauthor\bsnmYuan, \bfnmMing\binitsM., \bauthor\bsnmLafferty, \bfnmJohn\binitsJ. and \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2012). \btitleHigh-dimensional semiparametric Gaussian copula graphical models. \bjournalAnnals of Statistics \bvolume40 \bpages2293–2326. \endbibitem
- Loh and Tan (2018) {barticle}[author] \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. and \bauthor\bsnmTan, \bfnmXin Lu\binitsX. L. (\byear2018). \btitleHigh-dimensional robust precision matrix estimation: Cellwise corruption under -contamination. \bjournalElectronic Journal of Statistics \bvolume12 \bpages1429–1467. \endbibitem
- Lopuhaa (1989) {barticle}[author] \bauthor\bsnmLopuhaa, \bfnmHendrik P\binitsH. P. (\byear1989). \btitleOn the relation between S-estimators and M-estimators of multivariate location and covariance. \bjournalAnnals of Statistics \bpages1662–1683. \endbibitem
- Miyamura and Kano (2006) {barticle}[author] \bauthor\bsnmMiyamura, \bfnmMasashi\binitsM. and \bauthor\bsnmKano, \bfnmYutaka\binitsY. (\byear2006). \btitleRobust Gaussian graphical modeling. \bjournalJournal of Multivariate Analysis \bvolume97 \bpages1525–1550. \endbibitem
- Nguyen, Wainwright and Jordan (2009) {barticle}[author] \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2009). \btitleOn surrogate loss functions and -divergences. \bjournalAnnals of Statistics \bvolume37 \bpages876–904. \endbibitem
- Nguyen, Wainwright and Jordan (2010) {barticle}[author] \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2010). \btitleEstimating divergence functionals and the likelihood ratio by convex risk minimization. \bjournalIEEE Transactions on Information Theory \bvolume56 \bpages5847–5861. \endbibitem
- Nowozin, Cseke and Tomioka (2016) {binproceedings}[author] \bauthor\bsnmNowozin, \bfnmSebastian\binitsS., \bauthor\bsnmCseke, \bfnmBotond\binitsB. and \bauthor\bsnmTomioka, \bfnmRyota\binitsR. (\byear2016). \btitle-GAN: Training generative neural samplers using variational divergence minimization. In \bbooktitleAdvances in Neural Information Processing Systems \bvolume29. \bpublisherCurran Associates, Inc. \endbibitem
- Öllerer and Croux (2015) {bincollection}[author] \bauthor\bsnmÖllerer, \bfnmViktoria\binitsV. and \bauthor\bsnmCroux, \bfnmChristophe\binitsC. (\byear2015). \btitleRobust high-dimensional precision matrix estimation. In \bbooktitleModern Nonparametric, Robust and Multivariate Methods \bpages325–350. \bpublisherSpringer. \endbibitem
- Paindaveine and Van Bever (2018) {barticle}[author] \bauthor\bsnmPaindaveine, \bfnmDavy\binitsD. and \bauthor\bsnmVan Bever, \bfnmGermain\binitsG. (\byear2018). \btitleHalfspace depths for scatter, concentration and shape matrices. \bjournalAnnals of Statistics \bvolume46 \bpages3276–3307. \endbibitem
- Rocke (1996) {barticle}[author] \bauthor\bsnmRocke, \bfnmDavid M\binitsD. M. (\byear1996). \btitleRobustness properties of S-estimators of multivariate location and shape in high dimension. \bjournalAnnals of Statistics \bpages1327–1345. \endbibitem
- Rousseeuw (1985) {barticle}[author] \bauthor\bsnmRousseeuw, \bfnmPeter J\binitsP. J. (\byear1985). \btitleMultivariate estimation with high breakdown point. \bjournalMathematical Statistics and Applications \bvolume8 \bpages37. \endbibitem
- Rousseeuw and Croux (1993) {barticle}[author] \bauthor\bsnmRousseeuw, \bfnmPeter J\binitsP. J. and \bauthor\bsnmCroux, \bfnmChristophe\binitsC. (\byear1993). \btitleAlternatives to the median absolute deviation. \bjournalJournal of the American Statistical Association \bvolume88 \bpages1273–1283. \endbibitem
- Savage (1971) {barticle}[author] \bauthor\bsnmSavage, \bfnmLeonard J\binitsL. J. (\byear1971). \btitleElicitation of personal probabilities and expectations. \bjournalJournal of the American Statistical Association \bvolume66 \bpages783–801. \endbibitem
- Spearman (1987) {barticle}[author] \bauthor\bsnmSpearman, \bfnmCharles\binitsC. (\byear1987). \btitleThe proof and measurement of association between two things. \bjournalThe American Journal of Psychology \bvolume100 \bpages441–471. \endbibitem
- Tamura and Boos (1986) {barticle}[author] \bauthor\bsnmTamura, \bfnmRoy N\binitsR. N. and \bauthor\bsnmBoos, \bfnmDennis D\binitsD. D. (\byear1986). \btitleMinimum Hellinger distance estimation for multivariate location and covariance. \bjournalJournal of the American Statistical Association \bvolume81 \bpages223–229. \endbibitem
- Tan (2020) {barticle}[author] \bauthor\bsnmTan, \bfnmZhiqiang\binitsZ. (\byear2020). \btitleRegularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. \bjournalBiometrika \bvolume107 \bpages137–158. \endbibitem
- Tan, Song and Ou (2019) {barticle}[author] \bauthor\bsnmTan, \bfnmZhiqiang\binitsZ., \bauthor\bsnmSong, \bfnmYunfu\binitsY. and \bauthor\bsnmOu, \bfnmZhijian\binitsZ. (\byear2019). \btitleCalibrated adversarial algorithms for generative modelling. \bjournalStat \bvolume8 \bpagese224. \endbibitem
- Tan and Zhang (2020) {barticle}[author] \bauthor\bsnmTan, \bfnmZhiqiang\binitsZ. and \bauthor\bsnmZhang, \bfnmXinwei\binitsX. (\byear2020). \btitleOn loss functions and regret bounds for multi-category classification. \bjournalarXiv preprint arXiv:2005.08155. \endbibitem
- Tarr, Müller and Weber (2016) {barticle}[author] \bauthor\bsnmTarr, \bfnmGarth\binitsG., \bauthor\bsnmMüller, \bfnmSamuel\binitsS. and \bauthor\bsnmWeber, \bfnmNeville C\binitsN. C. (\byear2016). \btitleRobust estimation of precision matrices under cellwise contamination. \bjournalComputational Statistics & Data Analysis \bvolume93 \bpages404–420. \endbibitem
- Tatsuoka and Tyler (2000) {barticle}[author] \bauthor\bsnmTatsuoka, \bfnmKay S\binitsK. S. and \bauthor\bsnmTyler, \bfnmDavid E\binitsD. E. (\byear2000). \btitleOn the uniqueness of S and M-functionals under nonelliptical distributions. \bjournalAnnals of Statistics \bpages1219–1243. \endbibitem
- Tukey (1975) {binproceedings}[author] \bauthor\bsnmTukey, \bfnmJohn W\binitsJ. W. (\byear1975). \btitleMathematics and the picturing of data. In \bbooktitleProceedings of the International Congress of Mathematicians, Vancouver, 1975 \bvolume2 \bpages523–531. \endbibitem
- Tyler (1987) {barticle}[author] \bauthor\bsnmTyler, \bfnmDavid E\binitsD. E. (\byear1987). \btitleA distribution-free M-estimator of multivariate scatter. \bjournalAnnals of Statistics \bpages234–251. \endbibitem
- Windham (1995) {barticle}[author] \bauthor\bsnmWindham, \bfnmMichael P\binitsM. P. (\byear1995). \btitleRobustifying model fitting. \bjournalJournal of the Royal Statistical Society. Series B \bpages599–609. \endbibitem
- Wu et al. (2020) {binproceedings}[author] \bauthor\bsnmWu, \bfnmKaiwen\binitsK., \bauthor\bsnmDing, \bfnmGavin Weiguang\binitsG. W., \bauthor\bsnmHuang, \bfnmRuitong\binitsR. and \bauthor\bsnmYu, \bfnmYaoliang\binitsY. (\byear2020). \btitleOn minimax optimality of GANs for robust mean estimation. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages4541–4551. \bpublisherPMLR. \endbibitem
- Xue and Zou (2012) {barticle}[author] \bauthor\bsnmXue, \bfnmLingzhou\binitsL. and \bauthor\bsnmZou, \bfnmHui\binitsH. (\byear2012). \btitleRegularized rank-based estimation of high-dimensional nonparanormal graphical models. \bjournalAnnals of Statistics \bvolume40 \bpages2541–2571. \endbibitem
- Yohai (1987) {barticle}[author] \bauthor\bsnmYohai, \bfnmVictor J\binitsV. J. (\byear1987). \btitleHigh breakdown-point and high efficiency robust estimates for regression. \bjournalAnnals of statistics \bpages642–656. \endbibitem
- Zhang (2002) {barticle}[author] \bauthor\bsnmZhang, \bfnmJian\binitsJ. (\byear2002). \btitleSome extensions of Tukey’s depth function. \bjournalJournal of Multivariate Analysis \bvolume82 \bpages134–165. \endbibitem
- Zhao, Mathieu and LeCun (2017) {binproceedings}[author] \bauthor\bsnmZhao, \bfnmJunbo\binitsJ., \bauthor\bsnmMathieu, \bfnmMichael\binitsM. and \bauthor\bsnmLeCun, \bfnmYann\binitsY. (\byear2017). \btitleEnergy-based generative adversarial networks. In \bbooktitleInternational Conference on Learning Representations. \endbibitem
- Zhu, Jiao and Tse (2020) {barticle}[author] \bauthor\bsnmZhu, \bfnmBanghua\binitsB., \bauthor\bsnmJiao, \bfnmJiantao\binitsJ. and \bauthor\bsnmTse, \bfnmDavid\binitsD. (\byear2020). \btitleDeconstructing generative adversarial networks. \bjournalIEEE Transactions on Information Theory \bvolume66 \bpages7155–7179. \endbibitem
Supplementary Material for
“Tractable and Near-Optimal Adversarial Algorithms for
Robust Estimation in Contaminated Gaussian Models”
Ziyue Wang and Zhiqiang Tan
The Supplementary Material contains Appendices I-V on additional information for simulation studies, main proofs and details, auxiliary lemmas, and technical tools.
I Additional information for simulation studies
I.1 Implementation of proposed methods
The detailed algorithm to implement our logit -GANs and hinge GAN is shown as Algorithm S1. In our experiments, the learning rate is set with , and the iteration number is set to . This choice leads to stable convergence while keeping the running time relatively short.
For rKL logit -GAN, we modify the un-penalized objective as . This modification helps stabilize the initial steps of training where the fake data and real data can be almost separable depending on the initial value of .
I.2 Tuning penalty levels
We conducted tuning experiments to identify base penalty levels which are expected to work reasonably well in various settings for our logit -GANs and hinge GAN, where the dependency on is already absorbed in the penalty parameters . In the tuning experiments, we tried two contamination proportions and two choices of contamination distributions as described in Section 6.2. Results are collected from 20 repeated experiments on a grid of penalty levels for each method.
As shown in Figure S1, although the average estimation error varies as the contamination setting changes, there is a consistent and stable range of the penalty level which leads to approximately the best performance for each method, with either or penalty used. We manually pick for the hinge GAN, for the JS logit -GAN, and for the rKL logit -GAN, which are then fixed in all subsequent simulations. It is desirable in future work to design and study automatic tuning of penalty levels for GANs.

I.3 Error dependency on and
Tables S1–S2 show the performance of various methods depending on sample size for the two choices of contamination in Section 6.2. We fix the dimension and and increase from to . Tables S3–S4 show how the performance of methods depending on sample size for the two choices of contamination. We fix and and increase from to . Estimation errors are measured in the maximum norm and the Frobenius norm.
For all methods considered, the estimation errors decrease as increases. As increases, however, the estimation errors seem to be affected to a lesser extent when measured in the maximum norm. This is expected because an error rate ( term aside) has been established for our three penalized methods as well as Kendall’s and Spearman’s (\citeappendLT18). When measured in the Frobenius norm, the estimation errors go up as increases, which is also expected. In Table S3 with , the -Lasso outperforms our methods in the setting of . and even has better performance than in the setting of smaller . A possible explanation is that the default choice may be near optimal for this particular setting of , but much less suitable in other settings.
hinge GAN | JS logit -GAN | rKL logit -GAN | GYZ JS-GAN | Tyler_M | Kendall_MAD | Spearman_Qn | |
500 | 0.1740 (0.0529) | 0.1775 (0.0496) | 0.1418 (0.0335) | 0.3357 (0.1771) | 0.4211 (0.0570) | 0.5304 (0.0824) | 0.5074 (0.0555) |
1000 | 0.1238 (0.0325) | 0.1134 (0.0292) | 0.1027 (0.0212) | 0.1982 (0.0917) | 0.3830 (0.0320) | 0.4846 (0.0735) | 0.4715 (0.0431) |
2000 | 0.0980 (0.0222) | 0.0737 (0.0207) | 0.0720 (0.0151) | 0.1029 (0.0467) | 0.3526 (0.0226) | 0.4133 (0.0264) | 0.4362 (0.0256) |
4000 | 0.0848 (0.0184) | 0.0528 (0.0124) | 0.0502 (0.0105) | 0.0770 (0.0241) | 0.3351 (0.0166) | 0.3948 (0.0259) | 0.4117 (0.0186) |
500 | 0.1855 (0.0536) | 0.1970 (0.0498) | 0.1548 (0.0353) | 0.2792 (0.1706) | 0.6061 (0.0579) | 0.5200 (0.0846) | 0.6393 (0.0807) |
1000 | 0.1382 (0.0331) | 0.1231 (0.0324) | 0.1230 (0.0266) | 0.1618 (0.0615) | 0.5734 (0.0429) | 0.4724 (0.0732) | 0.6187 (0.0578) |
2000 | 0.1050 (0.0237) | 0.0802 (0.0225) | 0.0935 (0.0219) | 0.1275 (0.0453) | 0.5295 (0.0332) | 0.4074 (0.0278) | 0.5693 (0.0358) |
4000 | 0.0903 (0.0180) | 0.0590 (0.0141) | 0.0739 (0.0167) | 0.0881 (0.0285) | 0.5154 (0.0225) | 0.3863 (0.0251) | 0.5481 (0.0284) |
-LASSO | MVE | MCD | Sest | Mest | MMest | ||
500 | 0.5786 (0.0288) | 0.2270 (0.0518) | 0.2175 (0.0516) | 0.2819 (0.0597) | 0.1900 (0.0461) | 0.2805 (0.0573) | |
1000 | 0.5426 (0.0261) | 0.1877 (0.0362) | 0.1883 (0.0397) | 0.2650 (0.0422) | 0.1551 (0.0394) | 0.2630 (0.0413) | |
2000 | 0.5277 (0.0154) | 0.1637 (0.0269) | 0.1618 (0.0257) | 0.2358 (0.0292) | 0.1292 (0.0260) | 0.2349 (0.0292) | |
4000 | 0.5150 (0.0139) | 0.1457 (0.0211) | 0.1471 (0.0222) | 0.2206 (0.0221) | 0.1127 (0.0218) | 0.2198 (0.0219) | |
500 | 0.3472 (0.1021) | 0.2179 (0.0542) | 0.2190 (0.0508) | 0.2819 (0.0597) | 0.1853 (0.0481) | 0.2805 (0.0573) | |
1000 | 0.3118 (0.0843) | 0.1953 (0.0435) | 0.1887 (0.0397) | 0.2650 (0.0422) | 0.1547 (0.0399) | 0.2630 (0.0412) | |
2000 | 0.2935 (0.0619) | 0.1685 (0.0303) | 0.1617 (0.0251) | 0.2358 (0.0292) | 0.1284 (0.0255) | 0.2349 (0.0291) | |
4000 | 0.2627 (0.0422) | 0.1511 (0.0216) | 0.1470 (0.0222) | 0.2206 (0.0221) | 0.1132 (0.0217) | 0.2198 (0.0219) |
hinge GAN | JS logit -GAN | rKL logit -GAN | GYZ JS-GAN | Tyler_M | Kendall_MAD | Spearman_Qn | |
500 | 0.6855 (0.1138) | 0.7519 (0.1238) | 0.8294 (0.1381) | 0.9512 (0.3576) | 2.4034 (0.1264) | 3.3007 (0.1748) | 3.5111 (0.1224) |
1000 | 0.4455 (0.0645) | 0.4304 (0.0683) | 0.4197 (0.0734) | 0.6198 (0.2318) | 2.3129 (0.0668) | 3.2204 (0.0862) | 3.4565 (0.0645) |
2000 | 0.3222 (0.0547) | 0.2898 (0.0565) | 0.2723 (0.0477) | 0.3642 (0.0885) | 2.3363 (0.0557) | 3.2143 (0.0489) | 3.4669 (0.0398) |
4000 | 0.2431 (0.0438) | 0.1948 (0.0380) | 0.1877 (0.0346) | 0.2797 (0.0738) | 2.3165 (0.0405) | 3.2030 (0.0417) | 3.4472 (0.0219) |
500 | 0.7157 (0.1151) | 0.8054 (0.1400) | 0.7190 (0.0970) | 0.8889 (0.3253) | 3.3388 (0.2880) | 3.3704 (0.2784) | 4.1824 (0.2852) |
1000 | 0.4589 (0.0695) | 0.4589 (0.0850) | 0.3942 (0.0606) | 0.5367 (0.1282) | 3.2529 (0.2480) | 3.3263 (0.1921) | 4.1522 (0.2150) |
2000 | 0.3301 (0.0583) | 0.3049 (0.0635) | 0.2744 (0.0505) | 0.3829 (0.0852) | 3.2420 (0.2083) | 3.3202 (0.1462) | 4.1527 (0.1602) |
4000 | 0.2472 (0.0458) | 0.2076 (0.0419) | 0.1990 (0.0334) | 0.2986 (0.0536) | 3.1988 (0.1288) | 3.2955 (0.1269) | 4.1117 (0.1182) |
-LASSO | MVE | MCD | Sest | Mest | MMest | ||
500 | 4.4395 (0.0997) | 0.7574 (0.1373) | 0.7448 (0.1249) | 0.9269 (0.1562) | 0.6615 (0.1164) | 0.9190 (0.1562) | |
1000 | 4.4150 (0.0618) | 0.6005 (0.1132) | 0.5843 (0.0960) | 0.8262 (0.1130) | 0.4894 (0.0948) | 0.8206 (0.1137) | |
2000 | 4.4364 (0.0688) | 0.5416 (0.0816) | 0.5170 (0.0754) | 0.7732 (0.0839) | 0.4078 (0.0670) | 0.7707 (0.0831) | |
4000 | 4.4204 (0.0444) | 0.4663 (0.0613) | 0.4542 (0.0623) | 0.7232 (0.0714) | 0.3453 (0.0577) | 0.7231 (0.0711) | |
500 | 2.0989 (0.7667) | 0.7577 (0.1291) | 0.7497 (0.1272) | 0.9269 (0.1562) | 0.6581 (0.1176) | 0.9190 (0.1562) | |
1000 | 1.9989 (0.6520) | 0.6021 (0.1016) | 0.5819 (0.0959) | 0.8263 (0.1131) | 0.4889 (0.0988) | 0.8211 (0.1140) | |
2000 | 2.1356 (0.4419) | 0.5314 (0.0835) | 0.5163 (0.0754) | 0.7732 (0.0839) | 0.4083 (0.0675) | 0.7708 (0.0831) | |
4000 | 1.9796 (0.3431) | 0.4740 (0.0628) | 0.4541 (0.0621) | 0.7232 (0.0714) | 0.3460 (0.0574) | 0.7232 (0.0711) |
hinge GAN | JS logit -GAN | rKL logit -GAN | GYZ JS-GAN | Tyler_M | Kendall_MAD | Spearman_Qn | |
5 | 0.1179 (0.0220) | 0.0561 (0.0147) | 0.0576 (0.0138) | 0.1315 (0.0845) | 0.2153 (0.0231) | 0.3880 (0.0342) | 0.4050 (0.0260) |
10 | 0.0980 (0.0222) | 0.0737 (0.0207) | 0.0720 (0.0151) | 0.1029 (0.0467) | 0.3526 (0.0226) | 0.4133 (0.0264) | 0.4362 (0.0256) |
15 | 0.0929 (0.0219) | 0.0905 (0.0222) | 0.0761 (0.0130) | 0.1586 (0.0741) | 0.5856 (0.0194) | 0.4265 (0.0337) | 0.4377 (0.0227) |
5 | 0.1344 (0.0230) | 0.0674 (0.0189) | 0.0991 (0.0236) | 0.6339 (1.7289) | 0.4503 (0.0286) | 0.3803 (0.0225) | 0.5381 (0.0371) |
10 | 0.1050 (0.0237) | 0.0802 (0.0225) | 0.0935 (0.0219) | 0.1275 (0.0453) | 0.5295 (0.0332) | 0.4074 (0.0278) | 0.5693 (0.0358) |
15 | 0.1021 (0.0252) | 0.0957 (0.0232) | 0.0925 (0.0180) | 0.1294 (0.0432) | 0.5769 (0.0251) | 0.4207 (0.0322) | 0.5782 (0.0301) |
-LASSO | MVE | MCD | Sest | Mest | MMest | ||
5 | 0.4866 (0.0197) | 0.2107 (0.0270) | 0.1973 (0.0246) | 0.2305 (0.0267) | 0.1398 (0.0296) | 0.2226 (0.0243) | |
10 | 0.5277 (0.0154) | 0.1662 (0.0253) | 0.1608 (0.0241) | 0.2358 (0.0292) | 0.1286 (0.0269) | 0.2349 (0.0292) | |
15 | 0.5556 (0.0204) | 0.1473 (0.0272) | 0.1430 (0.0246) | 0.3918 (0.1740) | 0.1190 (0.0241) | 0.3918 (0.1740) | |
5 | 1.2973 (0.0860) | 0.2123 (0.0261) | 0.1971 (0.0241) | 0.2306 (0.0266) | 0.1398 (0.0296) | 0.2311 (0.0270) | |
10 | 0.2935 (0.0619) | 0.1698 (0.0258) | 0.1616 (0.0256) | 0.2358 (0.0292) | 0.1292 (0.0264) | 0.2349 (0.0291) | |
15 | 0.0885 (0.0191) | 0.1462 (0.0263) | 0.1432 (0.0262) | 0.2362 (0.0287) | 0.1191 (0.0241) | 0.2362 (0.0287) |
hinge GAN | JS logit -GAN | rKL logit -GAN | GYZ JS-GAN | Tyler_M | Kendall_MAD | Spearman_Qn | |
5 | 0.2257 (0.0550) | 0.1345 (0.0372) | 0.1296 (0.0322) | 0.2774 (0.1523) | 0.6149 (0.0447) | 1.5525 (0.0462) | 1.7124 (0.0383) |
10 | 0.3222 (0.0547) | 0.2898 (0.0565) | 0.2723 (0.0477) | 0.3642 (0.0885) | 2.3363 (0.0557) | 3.2143 (0.0489) | 3.4669 (0.0398) |
15 | 0.4578 (0.0691) | 0.4582 (0.0754) | 0.4328 (0.0460) | 0.6387 (0.2095) | 6.8919 (0.1709) | 4.8584 (0.0640) | 5.1907 (0.0477) |
5 | 0.2378 (0.0594) | 0.1514 (0.0477) | 0.1641 (0.0589) | 2.4720 (7.0730) | 1.5732 (0.1017) | 1.5974 (0.0809) | 2.1323 (0.0963) |
10 | 0.3301 (0.0583) | 0.3049 (0.0635) | 0.2744 (0.0505) | 0.3829 (0.0852) | 3.2420 (0.2083) | 3.3202 (0.1462) | 4.1527 (0.1602) |
15 | 0.4683 (0.0766) | 0.4834 (0.0953) | 0.4183 (0.0486) | 0.5459 (0.0998) | 4.9017 (0.2202) | 4.9803 (0.1710) | 6.0796 (0.1809) |
-LASSO | MVE | MCD | Sest | Mest | MMest | ||
5 | 2.1000 (0.0427) | 0.4682 (0.0742) | 0.4494 (0.0608) | 0.5283 (0.0666) | 0.3014 (0.0649) | 0.5079 (0.0635) | |
10 | 4.4364 (0.0688) | 0.5338 (0.0846) | 0.5175 (0.0752) | 0.7732 (0.0839) | 0.4078 (0.0684) | 0.7707 (0.0831) | |
15 | 6.9967 (0.0857) | 0.6057 (0.0894) | 0.5921 (0.0777) | 3.7412 (3.1416) | 0.5143 (0.0672) | 3.7413 (3.1417) | |
5 | 5.3996 (0.3406) | 0.4685 (0.0754) | 0.4496 (0.0617) | 0.5284 (0.0669) | 0.3014 (0.0649) | 0.5450 (0.0701) | |
10 | 2.1356 (0.4419) | 0.5419 (0.0812) | 0.5175 (0.0755) | 0.7732 (0.0839) | 0.4075 (0.0689) | 0.7708 (0.0831) | |
15 | 0.4703 (0.1192) | 0.6017 (0.0789) | 0.5938 (0.0791) | 0.9585 (0.1027) | 0.5142 (0.0678) | 0.9585 (0.1027) |
I.4 Illustration with the second contamination

For completeness, Figure S2 shows the 95% Gaussian ellipsoids estimated for the first two coordinates, similarly as in Figure 1 but with two samples of size from a -dimensional Huber’s contaminated Gaussian distributions based on the second contamination in Section 6.2. Comparison of the methods studied is qualitatively similar to that found in Figure 1.
II Main proofs of results
II.1 Proof of Theorem 2
We state and prove the following result which implies Theorem 2. For , define two factors and with and . For , define
where , depending on universal constants and in Lemmas S26 and Corollary S2 in the Supplement. Denote
where . Note that , are bounded provided that is bounded, because is three-times continuously differentiable as required in Assumption 2.
Proposition S1.
Proof of Proposition S1.
The main strategy of our proof is to show that the following inequalities hold with high probabilities,
(S1) |
where and are error terms, and is a moment matching term, which under certain conditions delivers upper bounds, up to scaling constants, on the estimation errors to be controlled, and .
(Step 1) For the upper bound in (S1), we show that with probability at least ,
(S2) | |||
(S3) |
Inequality (S2) follows from the definition of . Inequality (S3) follows from Proposition S5: it holds with probability at least that for any ,
where , , and
Note that is the same as in the proof of Theorem 4, and the above differs from in the proof of Theorem 4 only in the factor . From (S2)–(S3), the upper bound in (S1) holds with probability at least , provided that the tuning parameter is chosen such that .
(Step 2) For the lower bound in (S1), we show that with probability at least ,
(S4) | |||
(S5) |
Inequality (S4) holds provided that is a subset of . As a subset of the pairwise spline class , define a class of pairwise ramp functions, , such that each function in can be expressed as, for ,
where for , with , and with and . For symmetry as in , assume that the coefficients in are symmetric, for any . By the definition of , each function can be represented as in the spline class , where and satisfy , , and . Incidentally, this relationship also holds when symmetry is not imposed in the coefficients in or in . Denote as the subset of such that .
Take for some fixed . Inequality (S5) follows from Proposition S6: it holds with probability at least that for any ,
where , and
Note that is the same as in the proof of Theorem 4. From (S4)–(S5), the lower bound in (S1) holds with probability at least , where and .
(Step 3) We complete the proof by choosing appropriate and relating the moment matching term to the estimation error between and . First, due to the linearity of in , combining the lower and upper bounds in (S1) shows that with probability at least ,
Taking in the preceding display and rearranging yields
(S6) |
where
The desired result then follows from Proposition S7: provided , inequality (S6) implies that
∎
II.2 Proof of Theorem 3
We state and prove the following result which implies Theorem 3. For , define , in addition to and as in Proposition S1. For , define
where , depending on universal constants and in Lemmas S23 and Corollary S2 in the Supplement. Denote
where , , , and . Note that by the strict convexity and monotonicity of as required in Assumption 1–2, we have that
is bounded away from zero provided that is bounded.
Proposition S2.
Proof of Proposition S2.
The main strategy of our proof is to show that the following inequalities hold with high probabilities,
(S7) |
where and are error terms, and is a moment matching term, similarly as in the proof of Theorem 2. However, additional considerations are involved.
We split the proof into several steps. In Step 1, we derive the upper bound in (S7) by exploiting two tuning parameters and associated with and respectively. In Steps 2 and 3, we derive the first version of the lower bound in (S7) and then deduce upper bounds on and , where or is the vector of standard deviations from or respectively. In Steps 4 and 5, we derive the second version of the lower bound in (S7) and then deduce an upper bound on .
(Step 1) For the upper bound in (S7), we show that with probability at least ,
(S8) | |||
(S9) |
Inequality (S8) follows from the definition of . Inequality (S9) follows from Proposition S8: it holds with probability at least that for any ,
where
and
From (S8)–(S9), the upper bound in (S7) holds with probability at least , provided that the tuning parameters and are chosen such that and .
(Step 2) For the first version of the lower bound in (S7), we show that with probability at least ,
(S10) | |||
(S11) | |||
(S12) |
where . Inequality (S10) follows because is a subset of such that and hence for . Inequality (S11) holds provided that is a subset of . As a subset of the main-effect spline class , define a main-effect ramp class, , such that each function in can be expressed as, for ,
where for , with , and with . Only the main-effect ramp functions are included, while the interaction ramp functions are excluded, in . By the definition of , each function can be represented as with , such that and satisfy and . For example, for , the associated norms are and . Denote as the subset of such that .
Take for some fixed . Inequality (S12) follows from Proposition S9: it holds with probability at least that for any ,
where , , and
From (S11)–(S12), the lower bound in (S7) holds with probability at least , where and .
(Step 3) We deduce upper bounds on and , by choosing appropriate and relating the moment matching term to the estimation errors. First, combining the upper bound in (S7) from Step 1 and the lower bound from Step 2 shows that with probability at least ,
Taking in the preceding display and rearranging yields
(S13) |
where
The error bounds for then follows from Proposition S10: provided , inequality (S13) implies that
(S14) | |||
(S15) |
(Step 4) For the second version of the lower bound in (S7), we show that with probability at least ,
(S16) | |||
(S17) | |||
(S18) |
where . Inequality (S16) follows because is a subset of such that and hence for . Inequality (S17) holds provided that is a subset of . As a subset of the interaction spline class , define an interaction ramp class, , such that each function in can be expressed as, for ,
where for , and with . In contrast with the function in , only the interaction ramp functions are included, while the main-effect ramp functions are excluded, in . For symmetry as in , assume that the coefficients in are symmetric, for any . By the definition of , each function can be represented as with , such that and satisfy and . For example, for , the associated norms are and . Denote as the subset of such that .
Take for some fixed . Inequality (S18) follows from Proposition S11: it holds with probability at least that for any ,
where , , and
From (S17)–(S18), the lower bound in (S7) holds with probability at least , where and .
(Step 5) We deduce an upper bound on , by choosing appropriate and relating the moment matching term to the estimation error. First, combining the upper bound in (S7) from Step 1 and the lower bound from Step 4 shows that with probability ,
Taking in the preceding display and rearranging yields
(S19) |
where
The error bound for then follows from Proposition S12: inequality (S19) together with the error bounds (S14)–(S15) implies that
where and . ∎
II.3 Proof of Theorem 5
We state and prove the following result which implies Theorem 5. For , define the same as in Sections II.1 and II.2. Denote
Proposition S3.
Proof of Proposition S3.
The main strategy of our proof is to show that the following inequalities hold with high probabilities,
(S20) |
where and are error terms, and is a moment matching term, similarly as in the proof of Theorem 4. However, additional considerations are involved.
(Step 1) For the upper bound in (S20), we show that with probability at least ,
(S21) | |||
(S22) |
Inequality (S21) follows from the definition of . Inequality (S22) follows from Proposition S15: it holds with probability at least that for any ,
where
and and are the same as in the proof of Theorem 3. Note that and differ from those in the proof of Theorem 3 only in that is removed. From (S21)–(S22), the upper bound in (S20) holds with probability at least , provided that the tuning parameters and are chosen such that and .
(Step 2) For the first version of the lower bound in (S20), we show that with probability at least ,
(S23) | |||
(S24) | |||
(S25) |
where Inequality (S23) follows because is defined as a subset of such that and hence for .
Take , where is the subset of associated with main-effect ramp functions as in the proof of Theorem 3. Inequality (S24) holds because by definition. Inequality (S25) follows from Proposition S16: it holds with probability at least that for any ,
where , and
Note that is the same as in the proof of Theorem 3. From (S23)–(S25), the lower bound in (S20) holds with probability at least , where and .
(Step 3) We deduce upper bounds on and , by choosing appropriate and relating the moment matching term to the estimation errors. First, combining the upper bound in (S20) from Step 1 and the lower bound from Step 2 shows that with probability at least ,
which gives
(S26) |
where
The desired result then follows from Proposition S10: provided , inequality (S26) implies that
(Step 4) For the second version of the lower bound in (S7), we show that with probability at least ,
(S27) | |||
(S28) | |||
(S29) |
where . Inequality (S27) follows because is a subset of such that and hence for .
Take for , where is the subset of associated with interaction ramp functions as in the proof of Theorem 3. Inequality (S28) holds because by definition. Inequality (S29) follows from Proposition S17: it holds with probability at least that for any ,
where and
Note that is the same as in the proof of Theorem 3. From (S27)–(S29), the lower bound in (S20) holds with probability at least , where and .
(Step 5) We deduce an upper bound on , by relating the moment matching term to the estimation error. First, combining the upper bound in (S20) from Step 1 and the lower bound from Step 4 shows that with probability ,
which gives
(S30) |
where
The desired result then follows from Proposition S12: inequality (S30) implies that
where and .
∎
II.4 Proof of Corollary 1
(i) In the proofs of Theorems 2 and 3, we used the main frame,
(S31) |
where is or . For Theorems 2 and 3, we showed the upper bound in (S31) using the fact that is the minimizer of
which is a function of by the definition of (25) and (27) as nested optimization (see Remark 1). Now is not defined as a minimizer of the above function, but a solution to an alternating optimization problem (32) with two objectives. We need to develop new arguments. On the other hand, we showed the lower bound in (S31) for Theorems 2 and 3, through choosing different subsets of . The previous arguments are still applicable here.
(Step 1) For the upper bound in (S31), we show that the following holds with probability at least ,
(S32) | |||
(S33) |
where is the (unobserved) fraction of contamination in . Inequality (S32) follows from Supplement Lemma S13, and is the most important step for connecting two-objective GAN with logit -GAN. Inequality (S33) follows from an upper bound on as proved in Supplement Proposition S5, where , the same as and in the proofs of Theorems 2 and 3.
Similarly as in Supplement Proposition S5 or S8, the term can be controlled in terms of the or norms of , using Supplement Lemma S3 or S9. Then for defined as an or penalty, it can be shown that the following holds with probability at least or ,
(S34) |
provided that the tuning parameters or are chosen as in Theorem 2 or 3 respectively. From (S32)–(S34), the upper bound holds in (S31) with probability or .
III Technical details
III.1 Details in main proof of Theorem 1
Lemma S1.
Proof.
Because is convex in , by Jensen’s inequality, we have
where and is the density ratio . Note that can be equivalently obtained as , where . Therefore, it suffices to show that for .
Denote as the cumulative distribution function of , and the probability of under for .
Lemma S2.
Let be arbitrarily fixed.
(i) If for , then
where .
(ii) If for , then
where and is an universal constant such that .
Proof.
(i) By the mean value theorem, we have , because is decreasing on .
(ii) By the mean value theorem, we have , because is decreasing on . ∎
Proposition S4.
Proof.
The TV distance, , can be equivalently defined as
for and defined in a probability space . This definition is applicable to multivariate Gaussian distributions with singular variance matrices. To derive the desired results, we choose specific events and show that the differences in the means and variance matrices can be upper bounded by .
We first show results (i) and (ii), when and are nonsingular. Then we show that the results remain valid when or is singular.
(i) Assume that both and are nonsingular. For any , we have by the definition of TV,
For nonzero , because and , we have
Combining the preceding three displays shows that for nonzero ,
By Lemma S2 (i), if for a constant , then for any satisfying ,
(S36) |
Let . By (S36) with restricted such that and , we have
Similarly, let , where is a vector with th coordinate being one and others being zero. By (S36) with restricted such that and , we have
The last line uses the fact that by the nature of variance matrices.
(ii) Assume that and are nonsingular. We first separate the bias caused by the location difference between and . By the triangle inequality, we have
(S37) |
By Lemma S1, we know that . Then we have
(S38) |
provided that . Inequality (S38) follows because by standard calculation
and taking in (S36) gives
Combining (S37) and (S38) yields
(S39) |
For any such that , (S39) implies
where is an universal constant such that . Similarly, for any such that , (S39) implies
Notice that the choice of ensures that for ,
Moreover, for any nonzero , we have by the definition of ,
Combining the preceding four displays shows that for any nonzero ,
By Lemma S2 (ii), if , then for any nonzero ,
or equivalently for any ,
(S40) |
Notice that for any , if then . Thus, inequality (S40) implies
(S41) |
where .
To handle , let , where is a vector in with only th and th coordinates possibly being nonzero. For , we have and , where is formed by th and th coordinates of , and and are matrices, formed by selecting th and th rows and columns from and respectively. Similarly as in the deviation of (S41), applying inequality (S40) with , we have
Because for a matrix , , the above inequality implies that for any ,
(S42) |
Taking the maximum on both sides of (S42) over gives the desired result:
(iii) Consider the case where or is singular. As the following argument is symmetric in and , we assume without loss of generality that is singular. Fix any nonzero such that .
First, we show that for such that , we also have . In fact, implies
(S43) |
Note that because . If , then , and hence (S43) gives
which is a contradiction. Thus .
Next we show that for such that , we also have . In fact, with as shown above, we have that if and otherwise. If , then inequality (S43) gives
which is a contradiction. Thus .
III.2 Details in main proof of Theorem 2
Lemma S3.
Suppose that are independent and identically distributed as with . For fixed knots in , denote , where is obtained by applying componentwise to for . Then the following results hold.
(i) Each component of the random vector is a sub-gaussian random variable with tail parameter .
(ii) For any , we have that with probability at least ,
where , depending on the universal constant in Lemma S25.
Proof.
(i) This can be obtained as the univariate case of Lemma S9 (i). It is only required that the marginal variance of each component of is upper bounded by .
(ii) Notice that
By (i) and sub-gaussian concentration (Lemma S25), each component of is sub-gaussian with tail parameter . Then for any , by the union bound, we have that with probability at ,
Taking gives the desired result.
(iii) The difference of interest can be expressed in terms of the centered variables as
(S44) | |||
(S45) |
We handle the concentration of the two terms separately. Denote , , , and .
First, for , the term in (S45) can be bounded as follows:
where . The second step holds because for . The third step holds because for by Lemma S15. By (ii), for any , we have that with probability at least ,
From the preceding two displays, we obtain that with probability at least ,
(S46) |
Next, notice that
From (i), each component of is sub-gaussian with tail parameter . By Lemma S30, each element of is sub-exponential with tail parameter . By Lemma S31, each element of the centered version, , is sub-exponential with tail parameter . Then for any , by Lemma S32 and the union bound, we have that with probability at least ,
Taking , we obtain that with probability at least ,
(S47) |
Combining the two bounds (S46) and (S47) gives the desired result. ∎
Lemma S4.
Suppose that is convex, non-increasing, and differentiable, and . Denote .
(i) For any and , we have
(S48) |
(ii) Let be fixed. For any and any function , we have
(iii) Suppose, in addition, that is concave in . Let be fixed. If , then for any function ,
(S49) |
where denotes the empirical distribution of in the latent representation of Huber’s contamination model.
Proof.
(i) Notice that by definition,
By the convexity of , we have that for any and ,
Moreover, by the convexity of and , we have
Combining the preceding three displays yields the desired result.
(ii) For any function , we have
(S50) |
using the fact that is non-increasing and hence for . Setting in (S48) shows that for ,
(S51) |
where for by the convexity of . Combining (S50) and (S51) leads to the desired result.
(iii) For any function , can be bounded as follows:
(S52) | |||
(S53) | |||
(S54) | |||
(S55) |
Line (S52) follows because for . Line (S53) follows from Jensen’s inequality by the concavity of and in . Line (S54) follows because
obtained by taking and in (S48) and using for . Finally, line (S55) follows because is -Lipschitz in . ∎
Remark S5.
Compared with (S49), can also be bounded as
(S56) |
In fact, this follows directly from (S52), because for ,
which can be obtained by taking and in (S48), similarly as (S51). However, the bound (S56) involves the moment difference of between and , which is difficult to control for in our spline class, even with Lipschitz in . In contrast, by exploiting the concavity of and in , the bound (S49) is derived such that it involves the moment difference of , which can be controlled by Lemma S3 in Proposition S5 or by Lemma S9 in Proposition S8.
Proposition S5.
Proof.
Consider the event . By Chebyshev’s inequality, we have . In the event , we have by the assumption and hence by the assumption . By Lemma S4 with , it holds in the event that for any ,
(S57) |
The last step uses the fact that and , by the definition .
Next, conditionally on the contamination indicators such that the event holds, we have that are independent and identically distributed observations from , where . Denote as the event that for any and ,
and
where , , and are defined as in Lemma S3 with . In the event , the preceding inequalities imply that for any ,
(S58) |
where and . By applying Lemma S3 with to , we have for any such that holds. Taking the expectation over given shows that and hence .
Lemma S5.
Suppose that are independent and identically distributed as . Let be fixed and be a vector of fixed functions. For a convex and twice differentiable function , define
where . Suppose that conditionally on , the random variable is sub-gaussian with tail parameter for , where are Rademacher variables, independent of , and denotes the th component of . Then for any , we have that with probability at least ,
where and is the universal constant in Lemma S26.
Proof.
First, satisfies the bounded difference condition, because with and is non-decreasing by the convexity of :
where . By McDiarmid’s inequality (\citeappendMC89), for any , we have that with probability at least ,
For any , taking shows that with probability at least ,
Next, the expectation of can be bounded as follows:
(S59) | |||
(S60) | |||
(S61) |
Line (S59) follows from the symmetrization Lemma S33, where are Rademacher variables, independent of . Line (S60) follows by Lemma S34, because is -Lipschitz in . Line (S61) follows because
For the last step, we use the assumption that conditionally on , the random variable is sub-gaussian with tail parameter for each , and apply Lemma S26 to obtain , and then .
Combining the tail probability and expectation bounds yields the desired result. ∎
Remark S2.
Lemma S6.
Suppose that is convex and three-times differentiable. Let be fixed. For any function , we have
where , , and .
Proof.
First, can be bounded as
where the inequality follows because for by the convexity of . Next, consider the function . A Taylor expansion of about yields
where for some ,
The desired result then follows because and for , and hence by the definition of . ∎
Proposition S6.
Proof.
By definition, for any , can be represented as such that and :
where with , and with and . Then for any with and , we have and correspondingly, and hence by the boundedness of the ramp function in . Moreover, with and can be expressed in the form , where for , is an unit vector, and is a vector of functions including and for , and for . For symmetry, and are included as two distinct components in , and the corresponding coefficients are identical to each other in . Parenthetically, at most one of the coefficients in associated with and is nonzero for each , but this property is not used in the subsequent discussion.
Next, can be bounded as
For any with and , applying Lemma S6 with and yields
By Lemma S5 with and defined above, it holds with probability at least that for any with and ,
where is determined in Lemma S5 as follows. For , consider the function class , where and, as defined in Lemma S5, is either a moving-knot ramp function, or or a product of moving-knot ramp functions, for . By Lemma S16, the VC index of moving-knot ramp functions is 2. By applying Corollary S2 (i) and (ii) with vanishing , we obtain that conditionally on , the random variable is sub-gaussian with tail parameter for .
Combining the preceding three displays leads to the desired result. ∎
Lemma S7 (Local linearity 1).
For and , denote , where is a function on and is a standard Gaussian random variable. For , if for , then we have
(S62) |
where . For , we have
(S63) |
where .
Remark S6.
Define a ramp function class
In the setting of Lemma S7, suppose that for fixed ,
Then we have
where . This shows that the moment matching discrepancy over delivers upper bounds, up to scaling constants, on the mean and standard deviation differences, provided that is sufficiently small, for example, .
Proof.
[Proof of (S62)] First, assume that is nonnegative. The other direction will be discussed later. Take . Then for all and
Define for . Then and . We notice the following properties for the function .
-
(i)
is non-decreasing and concave for .
-
(ii)
is non-increasing for .
-
(iii)
for .
For property (i), the derivative of is . Moreover, is non-increasing: for ,
and
The last inequality holds because for any and hence . To show (ii), we write . Then is non-increasing in because is non-increasing. To show (iii), we notice that and hence for ,
The last inequality follows by the Gaussian tail bound: for .
By the preceding properties, we show that and hence (S62) holds. Without loss of generality, assume that . For , let be determined such that . Then and, by property (ii), . If , then, by property (iii), , and hence
This inequality remains valid if . Therefore, , by the assumption and the definition of .
When is negative, a similar argument taking , which is the same as , shows that and hence (S62) holds.
[Proof of (S63)] First, assume that is nonnegative. Take . Notice that is -Lipschitz and hence . Then by the triangle inequality, we have
Define for . Then and . The derivative can be calculated as
By the mean value theorem, , where and hence because . Therefore, we have
by the definition .
When is negative, a similar argument taking shows that and hence (S63) holds. ∎
Lemma S8 (Local linearity 2).
For , , and , denote , where is a function on , is a Gaussian random vector in with mean 0 and variance matrix , and is a Gaussian random vector in with mean and variance matrix . For , we have
where , which behaves like to as or as , and , with and .
Remark S4.
Define a ramp main-effect and interaction class
In the setting of Lemma S8, suppose that for fixed ,
Then combining Lemma S7–S8 yields
where . This shows that the moment matching discrepancy over delivers upper bounds, up to scaling constants, on the mean, variance, and covariance differences, provided that is sufficiently small.
Proof.
First, we handle the effect of different means and standard deviations between and in . Denote , where is a Gaussian random vector with mean 0 and variance matrix . Then we have
(S64) |
In fact, assume that and , where is a Gaussian random vector with mean 0 and variance matrix . For , we have
The first inequality follows by the triangle inequality and the fact that is bounded in . The second step uses , by the fact that is -Lipschitz, , and . The third inequality follows because .
Next, we show that
(S65) |
Assume that is nonnegative. The other direction will be discussed later. Now take , and define for , where is a Gaussian random vector with mean 0 and variance matrix . Then and . The derivative can be calculated as
(S66) |
In fact, can be represented as , where is a standard Gaussian variable independent of . Then direct calculation yields
By Stein’s lemma using the fact that given is Gaussian with mean and variance , we have
Substituting this into the expression for gives the formula (S66).
To show (S65), we derive a lower bound on . Assume without loss of generality that , because formula (S66) is symmetric in and . By the previous representation of , we have and hence
where . The last inequality follows by the independence of and , , , and the two probability bounds: and . Hence by the mean value theorem, we have
which gives the desired bound (S65) in the case of :
When is negative, a similar argument taking and interchanging the roles of and leads to (S65).
Proposition S7.
Proof.
For any with , the function can be expressed as with . For , by restricting in (S67) to those with defined as ramp functions of and using Remark S6, we obtain
For , by restricting in (S67) to those with defined as ramp interaction functions of and using Remark S4, we obtain
where and are the th elements of and respectively. Combining the preceding two displays leads to the desired result. ∎
III.3 Details in main proof of Theorem 3
Lemma S9.
Suppose that are independent and identically distributed as with . For fixed knots in , denote , where is obtained by applying componentwise to for . Then the following results hold.
(i) is a sub-gaussian random vector with tail parameter .
(ii) For any , we have that with probability at least ,
where and are the universal constants in Lemmas S25 and S27.
(iii) Let . For any , we have that with probability at least , both the inequality in (ii) and
where , , , and is the universal constant in Lemma S28.
Proof.
(i) First, we show that is a -Lipschitz function for any unit vector . For any , we have , where is partitioned as , and because each component of is 1-Lipschitz, as a function of only the corresponding component of . Next, can be represented as , where is a standard Gaussian random vector. For any and unit vector , we have
Hence is a -Lipschitz function of the standard Gaussian vector . By Theorem 5.6 in \citeappendBLM13, the centered version satisfies that for any ,
That is, is sub-gaussian with tail parameter . The desired result follows by the definition of sub-gaussian random vectors.
(ii) As shown above, is sub-gaussian with tail parameter for any unit vector . Then is sub-gaussian with tail parameter by sub-gaussian concentration (Lemma S25). Hence by definition, we have that is a sub-gaussian random vector with tail parameter . Notice that
The desired result follows from Lemma S27: with probability at least , we have
(iii) The difference of interest can be expressed in terms of the centered variables as
(S68) | |||
(S69) |
We handle the concentration of the two terms separately. Denote , , , and .
First, for the term in (S69) can be bounded as follows:
where . The second inequality holds because . The third inequality holds because by Lemma S15. By (ii), for any , we have that with probability at least ,
From the preceding two displays, we obtain that with probability at least ,
(S70) |
Next, consider an eigen-decomposition , where ’s are eigenvalues and ’s are the eigenvectors with . The concentration of the term in (S68) can be controlled as follows:
The last inequality uses the fact that and hence for . From (i), is a sub-gaussian random vector with tail parameter . By Lemma S28, for any , we have that with probability at least ,
From the preceding two displays, we obtain that with probability at least ,
(S71) |
Combining the two bounds (S70) and (S71) gives the desired result. ∎
Proposition S8.
Proof.
Consider the event . By Chebyshev’s inequality, we have . In the event , we have by the assumption and hence by the assumption . By Lemma S4 with , it holds in the event that for any ,
(S72) |
The last step also uses the fact that and also , by the definition .
Next, conditionally on the contamination indicators such that the event holds, we have that are independent and identically distributed observations from , where . Denote as the event that for any and ,
and
where , , and are defined as in Lemma S9 with . In the event , the preceding inequalities imply that for any ,
(S73) |
where , , and . By applying Lemma S9 with to , we have for any such that holds. Taking the expectation over given shows that and hence .
Lemma S10.
Suppose that are independent and identically distributed as . Let be fixed and be a vector of fixed functions. For a convex and twice differentiable function , define
where . Suppose that conditionally on , the random variable is sub-gaussian with tail parameter for , where are Rademacher variables, independent of , and denotes the th component of . Then for any , we have that with probability at least ,
where and is the universal constant in Lemma S23.
Proof.
First, satisfies the bounded difference condition, because with and is non-decreasing by the convexity of :
where . By McDiarmid’s inequality (\citeappendMC89), for any , we have that with probability at least ,
For any , taking shows that with probability at least ,
Next, the expectation of can be bounded as follows:
(S74) | |||
(S75) | |||
(S76) |
Line (S74) follows from the symmetrization Lemma S33, where are Rademacher variables, independent of . Line (S75) follows by Lemma S34, because is -Lipschitz in . Line (S76) follows because
For the last step, we use the assumption that conditionally on , the random variable is sub-gaussian with tail parameter for each , and apply Lemma S23 to obtain , and then .
Combining the tail probability and expectation bounds yields the desired result. ∎
Lemma S11.
Suppose that is convex and three-times differentiable. Let be fixed.
(i) For any function , we have
where , , and .
(ii) If, in addition, and , then
where .
Proof.
(i) First, can be bounded as follows:
where the inequality follows because for by the convexity of . Next, consider the function
where , , , and . A Taylor expansion of about yields
where for some ,
The desired result then follows because and for and hence by the definition of .
(ii) The inequality from (i) can be rewritten as
If , then
Moreover, if , then
by the mean value theorem and the definition of . Combining the preceding two displays gives the desired result. ∎
Proposition S9.
Proof.
By definition, for any , can be represented as such that and :
where with , with , and denotes the vector of functions with the th component for . Then for any with , we have and correspondingly, and hence by the Cauchy–Schwartz inequality and the boundedness of the ramp function in . Moreover, can be expressed in the form , where for , is an unit vector, is a vector of functions, including and for . Parenthetically, at most one of the coefficients in associated with and is nonzero for each , although this property is not used in the subsequent discussion.
For any with and , the function can be expressed as
where . The mean-centered ramp functions in are bounded between , and hence similarly as above. Moreover, such can be expressed in the form , where is an unit vector, is defined as above, and by the boundedness of the ramp function in .
Next, can be bounded as
(S77) |
For any with , , and , applying Lemma S11(ii) with and yields
By Lemma S18(i), can bounded as follows:
Similarly, can also be bounded by , because . Hence can be bounded as
(S78) |
where as in Lemma S6. Moreover, by Lemma S5 with and defined above, it holds with probability at least that for any with and ,
(S79) |
where is determined in Lemma S5 as follows. For , consider the function class , where , , and, as defined in Lemma S5, is of the form or . By Lemma S16, the VC index of moving-knot ramp functions is 2. By Lemma S17, the VC index of constant functions is also 2. By applying Corollary S2 (ii) with vanishing , we obtain that conditionally on , the random variable is sub-gaussian with tail parameter for .
Proposition S10.
Proof.
For any with , the function can be obtained as with . For , we restrict in (S80) such that is a ramp function of , in the form for . Applying Lemma S7 shows that there exists in the form and in the form such that
(S83) | |||
(S84) |
Proposition S11.
Proof.
By definition, for any , can be represented as such that and , where
where with , and denotes the vector of functions with the th component for . Then for any with , we have and correspondingly, and hence , by the boundedness of the ramp function in and the Cauchy–Schwartz inequality, . Moreover, can be expressed in the form , where for , is an unit vector, is a vector of functions, including for . For symmetry, and are included as two distinct components in , and the corresponding coefficients are assumed to be identical to each other in .
For any with and , the function can be expressed as
where . The mean-centered ramp functions in are bounded between , and hence similarly as above. Moreover, such can be expressed in the form , where is an unit vector, is defined as above, and by the boundedness of the ramp function in .
Next, can be bounded as
(S85) |
For any with , , and , applying Lemma S11(ii) with and yields
By Lemma S19(ii), can bounded as follows:
Similarly, can also be bounded by , because . Hence, with , can be bounded as
(S86) |
where as in Lemma S6. Moreover, by Lemma S5 with and defined above, it holds with probability at least that for any with and ,
(S87) |
where is determined in Lemma S5 as follows. For , consider the function class , where , , and, as defined in Lemma S5, is of the form for . By Lemma S16, the VC index of moving-knot ramp functions is 2. By Lemma S17, the VC index of constant functions is also 2. By applying Corollary S2 (ii), we obtain that conditionally on , the random variable is sub-gaussian with tail parameter for .
Proposition S12.
Proof.
For any with , the function can be obtained as with . First, we handle the effect of different means and standard deviations between and in . Denote
where and is defined as the correlation matrix such that . Then can be related to as follows:
(S88) |
where . In fact, by Lemma S21 with set to , which is ()-Lipschitz and componentwise bounded in , we have that for any with ,
For each pair , we restrict such that is or . Applying Lemma S8 shows that there exists such that
where . By the triangle inequality, we have
In addition, we have . Then for any unit vector ,
where . The function can be expressed as such that for and for , and hence . By the definition of , we have . Moreover, by the Cauchy–Schwartz inequality, . Substituting these inequalities into the preceding display shows that
(S89) |
III.4 Details in main proof of Theorem 4
For completeness, we restate Proposition 3 in the main proof of Theorem 4 below. For , define
where , depending on universal constants and in Lemmas S26 and Corollary S2 in the Supplement. Denote
Proposition 3.
To see why Proposition 3 leads to Theorem 4, we show that conditions in Proposition 3 are satisfied under the setting of Theorem 4. For a constant , let
Then the following conditions
-
(i)
,
-
(ii)
,
imply the conditions
-
(iii)
,
-
(iv)
,
-
(v)
and .
In fact, condition (v) follows directly from condition (ii) because . For conditions (iii) and (iv), we first show that and can be upper bounded as follows:
(S90) | ||||
(S91) |
and
(S92) | ||||
(S93) |
Lines (S90) and (S93) hold because for . Line (S91) holds because and hence the linear term in is upper bounded by the square root term. To see this, by conditions (i) and (ii) we have
Line (S92) holds because for . With the above upper bounds for and , we show that condition (i) implies condition (iii) as follows:
and condition (ii) implies condition (iv) as follows:
Lemma S12.
Consider the hinge GAN (3).
(i) For any and any function , we have
(ii) For any function , we have
(S94) |
where and denotes the empirical distribution of in the latent representation of Huber’s contamination model.
Proof.
(i) For any we have
(S95) | |||
(S96) | |||
(S97) |
Inequalities (S95) and (S96) hold because for all . Inequality (S97) holds because for all .
(ii) Because both and are concave in and upper bounded by , we have
(S98) | |||
(S99) | |||
(S100) | |||
(S101) | |||
(S102) |
Lines (S98) and (S100) hold because for all . Line (S99) follows from Jensen’s inequality by the concavity of and . Line (S101) follows because for all , and the last line (S102) holds because is -Lipschitz in . ∎
Proposition S13.
Proof.
Proposition S14.
Proof.
By definition, for any , can be represented as such that and in the same way as in Proposition S6. Then for any with and , we have and correspondingly, and hence by the boundedness of the ramp function in . Moreover, with and can be expressed in the form , where for , is an unit vector, and is a vector of functions including and for , and for . For symmetry, and are included as two distinct components in , and the corresponding coefficients are identical to each other in .
Next, can be bounded as
(S105) |
For any with and , because , we have and . Hence the hinge term in (S105) reduces to a moment matching term and can be lower bounded as follows:
Similarly, the two other hinge terms in (S105) also reduce to moment matching terms:
We apply Lemma S5 with , defined above, and and replaced by the identity function in . It holds with probability at least that for any with and ,
as shown in Proposition S6. Combining the preceding three displays leads to the desired result. ∎
III.5 Details in main proof of Theorem 5
Proposition S15.
Proof.
Proposition S16.
Proof.
For any , because and , we have and . Hence, by the Cauchy–Schwartz inequality and the boundedness of the ramp function in . Then the mean-centered version , with , is also bounded in . Moreover, such can be expressed in the form , where, for , is an unit vector, is a vector of functions, including and for , and .
Next, can be bounded as
(S108) |
For any , because , we have and . Then the hinge term reduces to a moment matching term and can be lower bounded as follows:
(S109) |
Similarly, the absolute difference term in (S108) can be simplified as follows:
We apply Lemma S10 with , and defined above, and and replaced by the identity function in . It holds with probability at least that for any ,
(S110) |
as shown in Proposition S9. Combining the inequalities (S108)–(S110) leads to the desired result. ∎
Proposition S17.
In the setting of Proposition S3, it holds with probability at least that for any ,
where , and, with ,
Proof.
For any , because and , we have . Hence by the boundedness of the ramp function in and the Cauchy–Schwartz inequality, . Then the mean-centered version , with , is also bounded in . Moreover, such can be expressed in the form , where, for , is an unit vector, is a vector of functions, including for , and .
Next, can be bounded as
(S111) |
For any , because , we have and . Then the hinge term reduces to a moment matching term and can be lower bounded as follows:
(S112) |
Similarly, the absolute difference term in (S111) can be simplified as follows:
We apply Lemma S10 with , and defined above, and and replaced by the identity function in . It holds with probability at least that for any ,
(S113) |
as shown in Proposition S11. Combining the inequalities (S111)–(S113) leads to the desired result. ∎
III.6 Details in proof of Corollary 1
Lemma S13.
Assume that satisfies Assumptions 1 and 2, and satisfies Assumption 3. Let be a solution to the alternating optimization problem (32).
(i) Let be fixed. For any and any function , we have
(ii) Let be fixed. If , then for any function , we have
(S114) |
where denotes the empirical distribution of in the latent representation of Huber’s contamination model.
Proof.
(i) As mentioned in Remark 2, the logit -GAN objective in (18) can be equivalently written as
where and is the convex conjugate of . Because is convex and non-decreasing by Assumptions 1 and 2, we have that is convex and non-decreasing.
Denote as the generator objective function, . By the definition of a solution to alternating optimization (see Remark 1), we have
that is, the generator loss at is less than that at any , with the discriminator parameter fixed at . The preceding inequalities can be written out as
and
Note that either the location or the variance matrix, but not both, is changed on the two sides in each inequality. In the first inequality, we have because both are equal to . In the second inequality, the term is on both sides. Then the two inequalities yield
Now we are ready to derive an upper bound for :
(S115) | |||
(S116) | |||
(S117) | |||
(S118) | |||
(S119) |
Line (S115) follows from Jensen’s inequality by the convexity of . Line (S116) follows from the two inequalities derived above, together with the fact that is non-increasing, by non-decreasingness of , , and . In (S117) we use the fact that and drop the term because . Line (S118) follows from Jensen’s inequality by the convexity of and the concavity of , together with the fact that is non-increasing. For the last line (S119), by the definition of Fenchel conjugate we have
with set to .
(ii) To derive an upper bound for , we first argue similarly as in part (i):
Then we use the -Lipschitz property of and obtain
Combining the preceding displays completes the proof. ∎
Lemma S14.
Let be a solution to the alternating optimization problem (35).
(i) For any and any function , we have
(ii) If , then for any function , we have
(S120) |
where denotes the empirical distribution of in the latent representation of Huber’s contamination model.
Proof.
(i) By the same argument used in the proof of Lemma S13 with replaced by and replaced by we have:
(S121) | |||
(S122) |
Inequality (S121) is derived by the same argument that leads to (S115)-(S118) with the fact that is concave and non-decreasing and that is concave and non-increasing just like and in Lemma S13 respectively. The term is a result of the fact that . Inequality (S122) is by the same argument used in Lemma S12:
with set to be .
(ii) To derive an upper bound for , we first argue similarly as in part (i):
Then we use the -Lipschitz property of and obtain
Combining the preceding displays completes the proof. ∎
III.7 Proofs in Section 5
Proof of Proposition 1.
We first verify that is convex on and so that is a valid -divergence. Because is convex by the convexity of and , it follows that is convex on . Direct calculation gives . Thus, defines a valid -divergence.
Next we show that . Denote by and by . Then we have
(S123) | |||
(S124) | |||
(S125) |
Line (S123) is by direct calculation. Lines (S124)–(S125) are by the definition of and .
Finally, by the definition of from , direct calculation gives
which implies that . ∎
IV Auxiliary lemmas
IV.1 Truncated linear basis
The following result gives upper bounds on the moments of the truncated linear basis, which are used in the proofs of Lemma S3 and S9.
Lemma S15.
For and , we have
Proof.
The second result is immediate: . The first result can be shown as follows:
The last inequality holds because . ∎
IV.2 VC index of ramp functions
For a collection of subsets of , and points , define
that is, is the number of subsets of picked out by the collection . We say that a subset is picked up by if . For convenience, we also say that is picked up by if . Moreover, define
and the Vapnik–Chervonenkis (VC) index of as (\citeappendVW)
where the infimum over the empty set is taken to be infinity.
The subgraph of a function is defined as . For a collection of functions , denote the collection of corresponding subgraphs as , and define the VC index of as .
Lemma S16.
For , we have that . Moreover, the VC index of is . That is, the VC index of moving-knots ramp functions is .
Proof.
To show , we need to show and . The first property is trivially true. For the second property, it suffices to show that for any two distinct points , there is at least a subset of that cannot be picked up by for any . Without loss of generality, assume that . We arbitrarily fix and discuss several cases depending on and .
If , then if , i.e., , we have
This implies that and hence . As a result, a subset containing just cannot be picked up by .
If , then if , i.e., , we have
and hence . Thus, the subset can never be picked up by .
If and , then if , we have
This implies that . As a result, the subset can never be picked up by .
Combining the preceding cases shows that and . Moreover, the class of functions , denoted as , admits a one-to-one correspondence with . A subset of is picked up by if and only if the corresponding subset of is picked up by . Hence . ∎
Lemma S17.
For , we have that . That is, the VC index of constant functions is .
Proof.
For any two distinct points and , assume that with loss of generality . Then the singleton can never be picked up by for any , and hence and . In fact, if is in the subgraph of , then . As a result, , indicating that is also in the subgraph. ∎
IV.3 Lipschitz functions of Gaussian vectors
Say that a function is -Lipschitz if for any .
Lemma S18.
Let , and be an -Lipschitz function.
(ii) For any symmetric matrix with , we have
Proof.
(i) By \citeappendBLM13, Theorem 5.6, it can be shown that for any unit vector , is sub-gaussian with tail parameter . See the proof of Lemma S9(i) for a similar argument. Then by Lemma S23, .
(ii) Consider an eigen-decomposition , where ’s are eigenvalues and ’s are the eigenvectors with . Denote and . Then and
The first step uses the Cauchy–Schwartz inequality, and the second step uses the fact that and by Lemma S23 because is sub-gaussian with tail parameter for each . ∎
The following result provides a th-moment bound which depends linearly on , under a boundedness condition in addition to the Lipschitz condition.
Lemma S19.
Let , and be an -Lipschitz function.
(i) For any matrix with , we have
(ii) For any matrix with , we have
Proof.
(i) Denote and . Then each component of is contained in by the boundedness of . The variable can be bounded as follows:
(S126) | |||
(S127) |
Line (S126) follows from von Neumann’s trace equality. Line (S127) uses the fact that , because for any , by the boundedness of . Then the desired result follows because
(S128) | |||
(S129) |
Line (S128) also follows from von Neumann’s trace equality. Line (S129) follows because and by Lemma S18(i), with being an -Lipschitz function.
(ii) The difference can be expressed in terms of the centered variables as . Then
(S130) |
The expectation of the first term on (S130) can be bounded using (i) as
The expectation of the second term on (S130) can be bounded as
by inequality (S129) and the fact that . Combining the preceding two bounds yields the desired result. ∎
IV.4 Moment matching for Lipschitz functions
The following result gives an upper bound on moment matching of quadratic forms under a Lipschitz condition.
Lemma S20.
Let be an -Lipschitz function, and let and , where is a random vector in which the second moments of all components are 1, , and and with . Then for any matrix with ,
where .
Proof.
Denote and . The difference can be decomposed as
(S131) |
The expectation of the first term on (S131) can be bounded as
(S132) | |||
(S133) |
where are the singular values of , and . Line (S132) follows from von Neumann’s trace inequality. Line (S133) follows because with and , which can be shown as follows. For any unit vector , we have
(S134) |
using the fact that is -Lipschitz and the marginal variances of are 1. The expectation of the second term on (S131) can be bounded as
(S135) |
Line (S135) uses the fact that based on (S134) and the following argument:
where the last step follows because . Combining (S131), (S133), and (S135) yields the desired result. ∎
The following result gives a tighter bound than in Lemma S20 under a boundedness condition in addition to the Lipschitz condition.
Lemma S21.
In the setting of Lemma S20, suppose that each component of and is bounded in . Then for any matrix with ,
where .
Proof.
The difference can also be decomposed as
By (S135), both of the two terms on the right-hand side can be bounded in absolute values by and hence by , because by the componentwise boundedness of and . ∎
V Technical tools
V.1 von Neumann’s trace inequality
Lemma S22.
(\citeappendVon62) For any matrices and with singular values and respectively,
As a direct consequence, if is symmetric and non-negative definite, then
This follows because the singular values ’s are also the eigenvalues of and hence for a symmetric and nonnegative definite matrix , and by the definition of .
V.2 Sub-gassisan and sub-exponential properties
The following results can be obtained from \citeappendVer18, Proposition 2.5.2.
Lemma S23.
For a random variable , the following properties are equivalent: there exist universal constants such that the for all , where is the parameter appearing in property (i).
-
(i)
for any .
-
(ii)
for any .
-
(iii)
.
If , then properties (1)–(3) are also equivalent to the following one.
-
(iv)
, for any .
Say that is a sub-gaussian random variable with tail parameter if property (1) holds in Lemma S23 with . The following result shows that being sub-gaussian depends only on tail probabilities of a random variable.
Lemma S24.
Suppose that for some , a random variable satisfies that
Then is sub-gaussian with tail parameter .
Proof.
We distinguish two cases of . First, if , then and
Second, if , then
Hence the desired result holds. ∎
The following result follows directly from Chernoff’s inequality.
Lemma S25.
Suppose that are independent such that and is sub-gaussian with tail parameter for . Then
where as in Lemma S23.
The following result can be obtained from \citeappendVer18, Exercise 2.5.10.
Lemma S26.
Let be random variables such that is sub-gaussian with tail parameter for . Then
where is a universal constant.
Say that is a sub-gaussian random vector with tail parameter if is sub-gaussian with tail parameter for any with . The following result can be obtained from \citeappendHKZ12, Theorem 2.1.
Lemma S27.
Suppose that with is a sub-gaussian random vector with tail parameter . Then for any , we have that with probability at least ,
where is a universal constant.
The following result can be obtained from \citeappendVer10, Theorem 5.39 and Remark 5.40(1). Formally, this is different from \citeappendVer18, Theorem 4.7.1 and Exercise 4.7.3, due to assumption (4.24) used in the latter result.
Lemma S28.
Suppose that are independent and identically distributed as , where is a sub-gaussian random vector with tail parameter . Then for any , we have that with probability at least ,
where and is a universal constant.
The following result can be obtained from \citeappendVer18, Proposition 2.5.2.
Lemma S29.
For a random variable , the following properties are equivalent: there exist universal constants such that the for all , where is the parameter appearing in property (i).
-
(i)
for any .
-
(ii)
for any .
-
(iii)
.
If , then properties (i)–(iii) are also equivalent to the following one.
-
(iv)
, for any satisfying .
Say that is a sub-exponential random variable with tail parameter if property (1) holds in Lemma S29 with . The following result, from \citeappendVer18, Lemma 2.7.7, provides a link from sub-gaussian to sub-exponential random variables.
Lemma S30.
Suppose that and are sub-gaussian random variables with tail parameters and respectively. Then is sub-exponential with tail parameter , where is a universal constant.
The following result about centering can be obtained from \citeappendVer18, Exercise 2.7.10.
Lemma S31.
Suppose that is sub-exponential random variable with tail parameter . Then is sub-exponential random variable with tail parameter , where is a universal constant.
The following result can be obtained from \citeappendVer18, Corollary 2.8.3.
Lemma S32.
Suppose that are independent such that and is sub-exponential with tail parameter for . Then
where is a universal constant.
V.3 Symmetrization and contraction
The following result can be obtained from the symmetrization inequality (Section 2.3.1 in \citeappendVW) and Theorem 7 in \citeappendMZ03.
Lemma S33.
Let be i.i.d. random vectors and be a class of real-valued functions such that for all . Then we have
where are i.i.d. Rademacher random variables that are independent of . The above inequality also holds with the left-hand side replaced by
Proof.
For completeness, we give a direct proof. Let be i.i.d. copies of . Then we have
(S136) | |||
(S137) | |||
Line (S136) follows from Jensen’s inequality. Line (S137) follows because for a pair of i.i.d. random variables, their difference is a symmetric random variable about 0 and its distribution remains the same when multiplied by an independent Rademacher random variable. A similar argument is also applicable for upper bounding . ∎
Lemma S34.
Let be a function with a Lipschitz constant . Then in the setting of Lemma S33, we have
V.4 Entropy and maximal inequality
For a function class in a metric space endowed with norm , the covering number is defined as the smallest number of balls of radius in the -metric needed to cover . The entropy, is defined as . The following maximal inequality can be obtained from Dudley’s inequality for sub-gaussian variables (e.g., \citeappendVan2000, Corollary 8.3; \citeappendBLT18, Proposition 9.2) including Rademacher variables.
Lemma S35.
Let be a class of functions , and be independent Rademacher random variables. For a fixed set of points , define the random variable
Suppose that and , where is the empirical norm, . Then for any ,
(S138) |
where is a universal constant.
The following result, taken from \citeappendVW, Theorem 2.6.7, provides an upper bound on the entropy of a function class in terms of the VC index. For any and probability measure , the norm is defined as .
Lemma S36.
Let be a VC class of functions such that . Then for any and probability measure , we have
where denotes the VC index of and is a universal constant.
We deduce the following implications of the preceding results, which can be used in conjunction with Lemmas S16–S17.
Corollary S1.
In the setting of Lemma S35, the random variable is sub-gaussian with tail parameter .
Corollary S2.
In the setting of Lemma S35, the following results hold.
(i) If , then is sub-gaussian with tail parameter , where .
(ii) Consider another two classes and of functions from to in addition to , and let . If , then is sub-gaussian with tail parameter , where .
Proof.
(i) Take and to be the empirical distribution on . By Lemma S36, the entropy integral can be upper bounded by
using for and . Taking in (i) to be the right-hand side of the preceding display yields the desired result by Corollary S1.
(ii) First, we show that the covering number is upper bounded by the product . Denote as a -net of with the cardinality . Similarly, denote as and those of , with the cardinality and respectively. For any , and , there exist , and such that
By the triangle inequality and , we have
This shows that is a -net of with respect to . Hence the covering number is upper bounded by the cardinality of , that is, .
imsart-nameyear \bibliographyappendfinal