This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models

Ziyue Wanglabel=e1 [    mark][email protected]    Zhiqiang Tanlabel=e2 [    mark][email protected] Department of Statistics, Rutgers University, Piscataway, New Jersey 08854, USA
Abstract

Consider the problem of simultaneous estimation of location and variance matrix under Huber’s contaminated Gaussian model. First, we study minimum ff-divergence estimation at the population level, corresponding to a generative adversarial method with a nonparametric discriminator and establish conditions on ff-divergences which lead to robust estimation, similarly to robustness of minimum distance estimation. More importantly, we develop tractable adversarial algorithms with simple spline discriminators, which can be implemented via nested optimization such that the discriminator parameters can be fully updated by maximizing a concave objective function given the current generator. The proposed methods are shown to achieve minimax optimal rates or near-optimal rates depending on the ff-divergence and the penalty used. This is the first time such near-optimal error rates are established for adversarial algorithms with linear discriminators under Huber’s contamination model. We present simulation studies to demonstrate advantages of the proposed methods over classic robust estimators, pairwise methods, and a generative adversarial method with neural network discriminators.

ff-divergence,
generative adversarial algorithm,
Huber’s contamination model,
minimum divergence estimation,
penalized estimation,
robust location and scatter estimation,
keywords:
\startlocaldefs\newcites

appendReferences \endlocaldefs

and

1 Introduction

Consider Huber’s contaminated Gaussian model (Huber (1964)): independent observations X1,,XnX_{1},\ldots,X_{n} are obtained from Pϵ=(1ϵ)N(μ,Σ)+ϵQP_{\epsilon}=(1-\epsilon)\mathrm{N}(\mu^{*},\Sigma^{*})+\epsilon Q, where N(μ,Σ)\mathrm{N}(\mu^{*},\Sigma^{*}) is a pp-dimensional Gaussian distribution with mean vector μ\mu^{*} and variance matrix Σ\Sigma^{*}, QQ is a probability distribution for contaminated data, and ϵ\epsilon is a contamination fraction. Our goal is to estimate the Gaussian parameters (μ,Σ)(\mu^{*},\Sigma^{*}), without any restriction on QQ for a small ϵ\epsilon. This allows both outliers located in areas with vanishing probabilities under N(μ,Σ)\mathrm{N}(\mu^{*},\Sigma^{*}) and other contaminated observations in areas with non-vanishing probabilities under N(μ,Σ)\mathrm{N}(\mu^{*},\Sigma^{*}). We focus on the setting where the dimension pp is small relative to the sample size nn, and no sparsity assumption is placed on Σ\Sigma^{*} or its inverse matrix. The latter, Σ1\Sigma^{*-1}, is called the precision matrix and is of particular interest in Gaussian graphical modeling. In the low-dimensional setting, estimation of Σ\Sigma^{*} and Σ1\Sigma^{*-1} can be treated as being equivalent.

There is a vast literature on robust statistics (e.g., Huber and Ronchetti 2009; Maronna et al. 2018). In particular, the problem of robust estimation from contaminated Gaussian data has been extensively studied, and various interesting methods and results have been obtained recently. Under Huber’s contamination model above, while the bulk of the data are still Gaussian distributed, a challenge is that the contamination status of each observation is hidden, and the contaminated data may be arbitrarily distributed. In this sense, this problem should be distinguished from various related problems, including multivariate scatter estimation for elliptical distributions as in Tyler (1987) and estimation in Gaussian copula graphical models as in Liu et al. (2012) and Xue and Zou (2012), among others. For motivation and comparison, we discuss below several existing approaches directly related to our work.

Existing work.  As suggested by the definition of variance matrix Σ\Sigma^{*}, a numerically simple method, proposed in Öllerer and Croux (2015) and Tarr, Müller and Weber (2016), is to apply a robust covariance estimator for each pair of variables, for example, based on robust scale and correlation estimators, and then assemble those estimators into an estimated variance matrix Σ^\hat{\Sigma}. These pairwise methods are naturally suitable for both Huber’s contamination model and the cellwise contamination model where the components of a data vector can be contaminated independently, each with a small probability ϵ\epsilon. For various choices of the correlation estimator, such as the transformed Kendall’s τ\tau and Spearman’s ρ\rho estimator, this method is shown in Loh and Tan (2018) to achieve, in the maximum norm Σ^Σmax\|\hat{\Sigma}-\Sigma^{*}\|_{\max}, the minimax error rate ϵ+log(p)/n\epsilon+\sqrt{\log(p)/n} under cellwise contamination and Huber’s contamination model. However, because a transformed correlation estimator is used, the variance matrix estimator in Loh and Tan (2018) may not be positive semidefinite (Öllerer and Croux (2015)). Moreover, this approach seems to rely on the availability of individual elements of Σ\Sigma^{*} as pairwise covariances and generalization to other multivariate models can be difficult. In our numerical experiments, such pairwise methods have relatively poor performance when contaminated data are not easily separable from the uncontaminated marginally, especially with nonnegligible ϵ\epsilon.

For location and scatter estimation under Huber’s contamination model, Chen, Gao and Ren (2018) showed that the minimax error rates in the L2L_{2} and operator norm, μ^μ2\|\hat{\mu}-\mu^{*}\|_{2} and Σ^Σop\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}, are ϵ+p/n\epsilon+\sqrt{p/n} and attained by maximizing Tukey’s half-space depth (Tukey (1975)) and a matrix depth function, which is also studied in Zhang (2002) and Paindaveine and Van Bever (2018). Both depth functions, defined through minimization of certain discontinuous objective functions, are in general difficult to compute, and maximization of these depth functions is also numerically intractable. Subsequently, Gao et al. (2019) and Gao, Yao and Zhu (2020) exploited a connection between depth-based estimators and generative adversarial nets (GANs) (Goodfellow et al. (2014)), and proposed robust location and scatter estimators in the form of GANs. These estimators are also proved to achieve the minimax error rates in the L2L_{2} and operator norms under Huber’s contamination model. More recent work in this direction includes Zhu, Jiao and Tse (2020), Wu et al. (2020), and Liu and Loh (2021).

GANs are a popular approach for learning generative models, with numerous impressive applications (Goodfellow et al. (2014)). In the GAN approach, a generator is defined to transform white noises into fake data, and a discriminator is then employed to distinguish between the fake and real data. The generator and discriminator are trained through minimax optimization with a certain objective function. For GANs used in Gao et al. (2019) and Gao, Yao and Zhu (2020), the generator is defined by the Gaussian model and the discriminator is a multi-layer neural network with sigmoid activations in the top and bottom layers. Hence the discriminator can be seen as logistic regression with the “predictors” defined by the remaining layers of the neural network. The GAN objective function, usually taken to the log-likelihood function in the classification of fake and real data, is more tractable than discontinuous depth functions, but remains nonconvex in the discriminator parameters and nonconcave in the generator parameters. Training such GANs is challenging through nonconvex-nonconcave minimax optimization (Farnia and Ozdaglar (2020); Jin, Netrapalli and Jordan (2020)).

There is also an interesting connection between GANs and minimum divergence (or distance) (MD) estimation, which has been traditionally studied for robust estimation (Donoho and Liu (1988); Lindsay (1994); Basu and Lindsay (1994)). A prominent example is minimum Hellinger distance estimation (Beran (1977); Tamura and Boos (1986)). In fact, as shown in ff-GANs (Nowozin, Cseke and Tomioka (2016)), various choices of the objective function in GANs can be derived from variational lower bounds of ff-divergences between the generator and real data distributions. Familiar examples of ff-divergences include the Kullback–Leibler (KL), squared Hellinger divergences, and the total variation (TV) distance (Ali and Silvey (1966); Csiszár (1967)). In particular, using the log-likelihood function in optimizing the discriminator leads to a lower bound of the Jensen–Shannon (JS) divergence for the generator. Furthermore, the lower bound becomes tight if the discriminator class is sufficiently rich (to include the nonparametrically optimal discriminator given any generator). In this sense, ff-GANs can be said to nearly implement minimum ff-divergence estimation, where the parameters are estimated by minimizing an ff-divergence between the model and data distributions. However, this relationship is only approximate and suggestive, because even a class of neural network discriminators may not be nonparametrically rich with population data. A similar issue can also be found in the previous studies, where minimum Hellinger estimation and related methods require a smoothed density function of sample data. This approach is impractical for multivariate continuous data.

In addition to MD estimation mentioned above, two other methods of MD estimation have also been studied for robust estimation both in general parametric models and in multivariate Gaussian models. The two methods are defined by minimization of power density divergences (also called β\beta-divergences) (Basu et al. (1998); Miyamura and Kano (2006)) and that of γ\gamma-divergences (Windham (1995); Fujisawa and Eguchi (2008); Hirose, Fujisawa and Sese (2017)). See Jones et al. (2001) for a comparison of these two methods. In contrast with ff-divergences, these two divergences can be evaluated without requiring smooth density estimation from sample data, and hence the corresponding MD estimators can be computed by standard optimization algorithms. To our knowledge, error bounds have not been formally derived for these methods under Huber’s contaminated Gaussian model.

Various methods based on iterative pruning or convex programming have been studied with provable error bounds for robust estimation in Huber’s contaminated Gaussian model (Lai, Rao and Vempala (2016); Balmand and Dalalyan (2015); Diakonikolas et al. (2019)). These methods either handle scatter estimation after location estimation sequentially in two stages, or resort to using normalized differences of pairs with mean zero for scatter estimation.

Our work.  We propose and study adversarial algorithms with linear spline discriminators, and establish various error bounds for simultaneous location and scatter estimation under Huber’s contaminated Gaussian model. Two distinct types of GANs are exploited. The first one is logit ff-GANs (Tan, Song and Ou (2019)), which corresponds to a specific choice of ff-GANs with the objective function formulated as a negative loss function for logistic regression (or equivalently a density ratio model between fake and real data) when training the discriminator. The second is hinge GAN (Lim and Ye (2017); Zhao, Mathieu and LeCun (2017)), where the objective function is taken to be the negative hinge loss function when training the discriminator. The hinge objective can be derived from a variational lower bound of the total variation distance (Nguyen, Wainwright and Jordan (2010); Tan, Song and Ou (2019)), but cannot be deduced as a special case of the ff-GAN objective even though the total variation is also an ff-divergence. See Remark 3. In addition, we allow two-objective GANs, including the logDD trick in Goodfellow et al. (2014), where two objective functions are used, one for updating the discriminator and the other for updating the generator.

As a major departure from previous studies of GANs, our methods use a simple linear class of spline discriminators, where the basis functions consist of univariate truncated linear functions (or ReLUs shifted) at 55 knots and the pairwise products of such univariate functions. For hinge GAN and certain logit ff-GANs including those based on the reverse KL (rKL) and JS divergences, the objective function is concave in the discriminator. By the linearity of the spline class, the objective function is then concave in the spline coefficients. Hence our hinge GAN and logit ff-GAN methods involve maximization of a concave function when training the spline discriminator for any fixed generator. In contrast with nonconvex-nonconcave minimax optimization for GANs with neural network discriminators (Gao et al. (2019); Gao, Yao and Zhu (2020)), our methods can be implemented through nested optimization in a numerically tractable manner. See Remarks 1, 10 and 11 and Algorithm 1. While the outer minimization for updating the generator remains nonconvex in our methods, such a single nonconvex optimization is usually more tractable than nonconvex-nonconcave minimax optimization.

In spite of the limited capacity of the spline discriminators, we establish various error bounds for our location and scatter estimators, depending on whether the hinge-GAN or logit ff-GAN is used and whether an L1L_{1} or L2L_{2} penalty is incorporated when training the discriminator. See Table 1 for a summary of existing and our error rates in scatter estimation. Our L1L_{1} penalized hinge GAN method achieves the minimax error rate ϵ+log(p)/n\epsilon+\sqrt{\log(p)/n} in the maximum norm. Our L2L_{2} penalized hinge GAN method achieves the error rate ϵp+p/n\epsilon\sqrt{p}+\sqrt{p/n}, whereas the minimax error rate is ϵ+p/n\epsilon+\sqrt{p/n}, in the p1/2p^{-1/2}-Frobenius norm. While this might indicate the price paid for maintaining the convexity in training the discriminator, our error rate reduces to the same order p/n\sqrt{p/n} as the minimax error rate provided that ϵ\epsilon is sufficiently small, ϵ=O(1/n)\epsilon=O(\sqrt{1/n}), such that the contamination error term ϵp\epsilon\sqrt{p} is dominated by the sampling variation term p/n\sqrt{p/n} up to a constant factor. To our knowledge, such near-optimal error rates were previously inconceivable for adversarial algorithms with linear discriminators in robust estimation. Moreover, the error rates for our logit ff-GAN methods exhibit a square-root dependency on the contamination fraction ϵ\epsilon, instead of a linear dependency for our hinge GAN methods. This shows, for the first time, some theoretical advantage of hinge GAN over logit ff-GANs, although comparative performances of these methods may vary in practice, depending on specific settings.

[] Error rates Computation OC15, TMW15, LT18 Σ^Σmax=(ϵ+log(p)/n)Op(1)\|\hat{\Sigma}-\Sigma^{*}\|_{\max}=(\epsilon+\sqrt{\log(p)/n})O_{p}(1) Non-iterative computation CGR18 Σ^Σop=(ϵ+p/n)Op(1)\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}=(\epsilon+\sqrt{p/n})O_{p}(1) Minimax optimization with zero-one discriminators DKKLMS19 Σ^Σop=Op(ϵ)\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}=O_{p}(\epsilon), convex optimization provided ϵp/n\epsilon\geq p/\sqrt{n} up to log factors GYZ20 Σ^Σop=(ϵ+p/n)Op(1)\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}=(\epsilon+\sqrt{p/n})O_{p}(1) Minimax optimization with neural network discriminators L1L_{1} logit ff-GAN Σ^Σmax=(ϵ+log(p)/n)Op(1)\|\hat{\Sigma}-\Sigma^{*}\|_{\max}=(\sqrt{\epsilon}+\sqrt{\log(p)/n})O_{p}(1) (Theorem 2) L1L_{1} hinge GAN Σ^Σmax=(ϵ+log(p)/n)Op(1)\|\hat{\Sigma}-\Sigma^{*}\|_{\max}=(\epsilon+\sqrt{\log(p)/n})O_{p}(1) Nested optimization with an (Theorem 4) objective function concave in L2L_{2} logit ff-GAN p12Σ^ΣF=(ϵ+p/n)Op(1),p^{-\frac{1}{2}}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}=(\sqrt{\epsilon}+\sqrt{p/n})O_{p}(1), linear spline discriminators (Theorem 3) provided ϵ1/p\epsilon\leq 1/p up to a constant factor L2L_{2} hinge GAN p12Σ^ΣF=(ϵp+p/n)Op(1)p^{-\frac{1}{2}}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}=(\epsilon\sqrt{p}+\sqrt{p/n})O_{p}(1) (Theorem 5) Note: OC15, TMW15, LT18 refer to methods and theory in Öllerer and Croux (2015), Tarr, Müller and Weber (2016), Loh and Tan (2018); CGR18, DKKLMS19, and GYZ20 refer to, respectively, Chen, Gao and Ren (2018), Diakonikolas et al. (2019), and Gao, Yao and Zhu (2020).

Table 1: Comparison of existing and proposed methods.

To facilitate and complement our sample analysis, we provide error bounds for the population version of hinge GAN or logit ff-GANs with nonparametric discriminators, that is, minimization of the exact total variation or ff-divergence at the population level. From Theorem 1, population minimum TV or ff-divergence estimation under a simple set of conditions on ff (Assumption 1) leads to errors of order O(ϵ)O(\epsilon) or O(ϵ)O(\sqrt{\epsilon}) respectively under Huber’s contamination model. Assumption 1 allows the reverse KL, JS, reverse χ2\chi^{2}, and squared Hellinger divergences, but excludes the mixed KL divergence, χ2\chi^{2} divergence, and, as reassurance, the KL divergence which corresponds to maximum likelihood estimation and is known to be non-robust. Hence certain (but not all) minimum ff-divergence estimation achieves robustness under Huber’s contamination model or an ϵ\epsilon TV-contaminated neighborhood. Such robustness is identified for the first time for minimum ff-divergence estimation, and is related to, but distinct from, robustness of minimum distance estimation under ϵ\epsilon contaminated neighborhood with respect to the same distance (Donoho and Liu (1988)). See Remark 7 for further discussion. The population error bounds in the L2L_{2} and p1/2p^{-1/2}-Frobenius norms are independent of pp and hence tighter than the corresponding ϵ\epsilon terms in our sample error bounds for both hinge GAN and logit ff-GAN. These gaps can be attributed to the use of nonparametric versus spline discriminators.

Remarkably, our population analysis also sheds light on the comparison of our sample results and those in Gao, Yao and Zhu (2020). On one hand, another set of conditions (Assumption 2), in addition to Assumption 1, are required in our sample analysis of logit ff-GANs with spline discriminators. On the other hand, GANs used in Gao, Yao and Zhu (2020) can be recast as logit ff-GANs with neural network discriminators (see Section 5.2). But minimax error rates are shown to be achieved in Gao, Yao and Zhu (2020) for an ff-divergence (for example, the mixed KL divergence) which, let alone Assumption 2, does not even satisfy Assumption 1 used in our analysis to show robustness of minimum ff-divergence estimation. The main reason for this discrepancy is that the neural network discriminator in Gao, Yao and Zhu (2020) is directly constrained to be of order ϵ+p/n\epsilon+\sqrt{p/n} in the log odds, which considerably simplifies the proofs of rate-optimal robust estimation. In contrast, our methods use linear spline discriminators (with penalties independent of ϵ\epsilon), and our proofs of robust estimation need to carefully tackle various technical difficulties due to the simple design of our methods. See Figure 2(b) for an illustration of non-robustness by minimization of the mixed KL divergence, and Section 5.2 for further discussion on this subtle issue in Gao, Yao and Zhu (2020).

Notation.  For a vector a=(a1,,ap)Tpa=(a_{1},\dots,a_{p})^{T}\in\mathbb{R}^{p}, we denote by a1=i=1p|ai|\|a\|_{1}=\sum_{i=1}^{p}|a_{i}|, a=max1ip|ai|\|a\|_{\infty}=\max_{1\leq i\leq p}|a_{i}|, and a2=(i=1pai2)1/2\|a\|_{2}=(\sum_{i=1}^{p}a_{i}^{2})^{1/2} the L1L_{1} norm, LL_{\infty} norm, and L2L_{2} norm of aa, respectively. For a matrix A=(aij)m×nA=(a_{ij})\in\mathbb{R}^{m\times n}, we define the element-wise maximum norm Amax=max1im,1jn|aij|\|A\|_{\max}=\max_{1\leq i\leq m,1\leq j\leq n}|a_{ij}|, the Frobenius norm AF=(i=1mj=1naij2)1/2\|A\|_{\mathrm{F}}=(\sum_{i=1}^{m}\sum_{j=1}^{n}a_{ij}^{2})^{1/2}, the vectorized L1L_{1} norm A1,1=i=1mj=1n|aij|\|A\|_{1,1}=\sum_{i=1}^{m}\sum_{j=1}^{n}|a_{ij}|, the operator norm Aop=supx21Ax2\|A\|_{\mathrm{op}}=\sup_{\|x\|_{2}\leq 1}\|Ax\|_{2}, and the LL_{\infty}-induced operator norm A=supx1Ax\|A\|_{\infty}=\sup_{\|x\|_{\infty}\leq 1}\|Ax\|_{\infty}. For a square matrix AA, we write A0A\succeq 0 to indicate that AA is positive semidefinite. The tensor product of vectors aa and bb is denoted by aba\otimes b, and the vectorization of matrix AA is denoted by vec(A)\mathrm{vec}(A). The cumulative distribution function of the standard normal distribution is denoted by Φ(x)\Phi(x), and the Gaussian error function is denoted by erf(x)\mathrm{erf}(x).

2 Numerical illustration

We illustrate the performance of our rKL logit ff-GAN and six existing methods, with two samples of size 20002000 from a 1010-dimensional Huber’s contaminated Gaussian distribution with ϵ=5%\epsilon=5\% and 10%10\%, based on a Toeplitz covariance matrix and the first contamination QQ described in Section 6.2. Figure 1 shows the 95% Gaussian ellipsoids for the first two coordinates, using the estimated location vectors and variance matrices except for Tyler’s M-estimator (Tyler (1987)), Kendalls’s τ\tau with MAD (Loh and Tan (2018)), and Spearman’s ρ\rho with QnQ_{n}-estimator (Öllerer and Croux (2015)) where the locations are set to the true means. The performance of our JS logit ff-GAN and hinge GAN is close to that of rKL logit ff-GAN. See the Supplement for illustration based on the second contamination in Section 6.2.

Among the methods shown in Figure 1, the rKL logit ff-GAN gives an ellipsoid closest to the truth, followed with relatively small but noticeable differences by the JS-GAN (Gao, Yao and Zhu (2020)), MCD (Rousseeuw (1985)), and Mest (Rocke (1996)), which are briefly described in Section 6.1. The remaining three methods, Kendall’s τ\tau with MAD, Spearman’s ρ\rho with QnQ_{n}-estimator, and Tyler’s M-estimator show much less satisfactory performance. The estimated distributions from these methods are dragged towards the corner contamination cluster.

The relatively poor performance of the pairwise methods, Kendall’s τ\tau with MAD and Spearman’s ρ\rho with QnQ_{n}-estimator, may be explained by the fact that as shown by the marginal histograms in Figure 1, the data in each coordinate are one-sided heavy-tailed, but no obvious outliers can be seen marginally. The correlation estimates from Kendall’s τ\tau and Spearman’s ρ\rho tend to be inaccurate even after sine transformations, especially with nonnegligible ϵ=10%\epsilon=10\%. In contrast, our GAN methods as well as JS-GAN, MCD, and the Mest are capable of capturing higher dimensional information so that the impact of contamination is limited to various extents.

Refer to caption
Figure 1: The 95%95\% Gaussian ellipsoids estimated for the first two coordinates and observed marginal histograms, from contaminated data based on the first contamination in Section 6.2 with ϵ=5%\epsilon=5\% (left) or 10%10\% (right).

3 Adversarial algorithms

We review various adversarial algorithms (or GANs), which are exploited by our methods for robust location and scatter estimation. To focus on main ideas, the algorithms are stated in their population versions, where the underlying data distribution PP_{*} is involved instead of the empirical distribution PnP_{n}. Let {Pθ:θΘ}\{P_{\theta}:\theta\in\Theta\} be a statistical model and {hγ:γΓ}\{h_{\gamma}:\gamma\in\Gamma\} be a function class, where PθP_{\theta} is called a generator and hγh_{\gamma} a discriminator. In our study, PθP_{\theta} is a multivariate Gaussian distribution N(μ,Σ)\mathrm{N}(\mu,\Sigma), and hγh_{\gamma} is a pairwise spline function which is specified later in Section 4.2.

For a convex function f:(0,)f:(0,\infty)\to\mathbb{R}, the ff-divergence between the distributions PP_{*} and PθP_{\theta} with density functions pp_{*} and pθp_{\theta} is

Df(PPθ)=f(p(x)pθ(x))dPθ.\displaystyle D_{f}(P_{*}\|P_{\theta})=\int f\left(\frac{p_{*}(x)}{p_{\theta}(x)}\right)\,\mathrm{d}P_{\theta}.

For example, taking f(t)=tlogtf(t)=t\log t yields the Kullback–Liebler (KL) divergence DKL(PPθ)D_{\mathrm{KL}}(P_{*}\|P_{\theta}). The logit ff-GAN (Tan, Song and Ou (2019)) is defined by solving the minimax program

minθΘmaxγΓKf(P,Pθ;hγ),\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;K_{f}(P_{*},P_{\theta};h_{\gamma}), (1)

where

Kf(P,Pθ;h)\displaystyle K_{f}(P_{*},P_{\theta};h) =EPf(eh(x))EPθf#(eh(x))\displaystyle=\mathrm{E}_{P_{*}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\theta}}f^{\#}(e^{h(x)})
=EPf(eh(x))EPθ{eh(x)f(eh(x))f(eh(x))},\displaystyle=\mathrm{E}_{P_{*}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\theta}}\left\{\mathrm{e}^{h(x)}f^{\prime}(\mathrm{e}^{h(x)})-f(e^{h(x)})\right\},

Throughout, f#(t)=tf(t)f(t)f^{\#}(t)=tf^{\prime}(t)-f(t) and ff^{\prime} denotes the derivative of ff. A motivation for this method is that the objective KfK_{f} is a nonparametrically tight, lower bound of the ff-divergence (Tan, Song and Ou (2019), Proposition S1): for each θ\theta, it holds that for any function hh,

Kf(P,Pθ;h)Df(PPθ),\displaystyle K_{f}(P_{*},P_{\theta};h)\leq D_{f}(P_{*}\|P_{\theta}), (2)

where the equality is attained at hθ(x)=log{p(x)/pθ(x)}h_{*\theta}(x)=\log\{p_{*}(x)/p_{\theta}(x)\}, the log density ratio between PP_{*} and PθP_{\theta} or equivalently the log odds for classifying whether a data point xx is from PP_{*} or PθP_{\theta}. There are two choices of ff of particular interest. Taking f(t)=tlogt(t+1)log(t+1)+log4f(t)=t\log t-(t+1)\log(t+1)+\log 4 leads the Jensen–Shannon (JS) divergence, DJS(PPθ)=DKL(P(P+Pθ)/2)+DKL(Pθ(P+Pθ)/2)D_{\mathrm{JS}}(P_{*}\|P_{\theta})=D_{\mathrm{KL}}(P_{*}\|(P_{*}+P_{\theta})/2)+D_{\mathrm{KL}}(P_{\theta}\|(P_{*}+P_{\theta})/2), and the objective function

KJS(P,Pθ;h)=EPlog(1+eh(x))EPθlog(1+eh(x))+log4,\displaystyle K_{\mathrm{JS}}(P_{*},P_{\theta};h)=-\mathrm{E}_{P_{*}}\log(1+\mathrm{e}^{-h(x)})-\mathrm{E}_{P_{\theta}}\log(1+\mathrm{e}^{h(x)})+\log 4,

which is, up to a constant, the expected log-likelihood for logistic regression with log odds function h(x)h(x). For Kf=KJSK_{f}=K_{\mathrm{JS}}, program (1) corresponds to the original GAN (Goodfellow et al. (2014)) with discrimination probability sigmoid(h(x))\mathrm{sigmoid}(h(x)). Taking f(t)=logtf(t)=-\log t leads to the reverse KL divergence DrKL(PPθ)=DKL(PθP)D_{\mathrm{rKL}}(P_{*}\|P_{\theta})=D_{\mathrm{KL}}(P_{\theta}\|P_{*}) and the objective function

KrKL(P,Pθ;h)=1EPeh(x)EPθh(x),\displaystyle K_{\mathrm{rKL}}(P_{*},P_{\theta};h)=1-\mathrm{E}_{P_{*}}\mathrm{e}^{-h(x)}-\mathrm{E}_{P_{\theta}}h(x),

which is the negative calibration loss for logistic regression in Tan (2020).

The objective KfK_{f} with fixed θ\theta can be seen as a proper scoring rule reparameterized in terms of the log odds function h(x)h(x) for binary classification (Tan and Zhang (2020)). Replacing KfK_{f} in (1) by the negative hinge loss (which is not a proper scoring rule) leads to

minθΘmaxγΓKHG(P,Pθ;hγ),\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;K_{\mathrm{HG}}(P_{*},P_{\theta};h_{\gamma}), (3)

where

KHG(P,Pθ;h)\displaystyle K_{\mathrm{HG}}(P_{*},P_{\theta};h) ={EPmax(0,1h(x))+EPθmax(0,1+h(x))}+2\displaystyle=-\left\{\mathrm{E}_{P_{*}}\max(0,1-h(x))+\mathrm{E}_{P_{\theta}}\max(0,1+h(x))\right\}+2
=EPmin(1,h(x))+EPθmin(1,h(x)).\displaystyle=\mathrm{E}_{P_{*}}\min(1,h(x))+\mathrm{E}_{P_{\theta}}\min(1,-h(x)).

This method is related to the geometric GAN described later in (13) and will be called hinge GAN. By Nguyen, Wainwright and Jordan (2009) or Proposition 5 in Tan, Song and Ou (2019), the objective KHGK_{\mathrm{HG}} is a nonparametrically tight, lower bound of the total variation distance scaled by 22: for each θ\theta, it holds that for any function h(x)h(x),

KHG(P,Pθ;h)2DTV(PPθ),\displaystyle K_{\mathrm{HG}}(P_{*},P_{\theta};h)\leq 2D_{\mathrm{TV}}(P_{*}\|P_{\theta}), (4)

where the equality is attained at hθ(x)=sign(p(x)pθ(x))h_{*\theta}(x)=\mathrm{sign}(p_{*}(x)-p_{\theta}(x)), and DTV(PPθ)=|p(x)pθ(x)|/2dxD_{\mathrm{TV}}(P_{*}\|P_{\theta})=\int|p_{*}(x)-p_{\theta}(x)|/2\,\mathrm{d}x. The objectives KfK_{f} and KHGK_{\mathrm{HG}}, with fixed θ\theta, represent two types of loss functions for binary classification. See Buja, Stuetzle and Shen (2005) and Nguyen, Wainwright and Jordan (2009) for further discussions about loss functions and scoring rules.

The preceding programs, (1) and (3), are defined as minimax optimization, each with a single objective function. There are also adversarial algorithms, which are formulated as alternating optimization with two objective functions (see Remark 1). For example, GAN with the logD\log D trick in Goodfellow et al. (2014) is defined by solving

{maxγΓKJS(θ,γ)with θ fixed,minθΘEPθlog(1+ehγ(x))with γ fixed.\displaystyle\left\{\begin{array}[]{ll}\max\limits_{\gamma\in\Gamma}\;K_{\mathrm{JS}}(\theta,\gamma)&\text{with $\theta$ fixed},\\ \min\limits_{\theta\in\Theta}\;\mathrm{E}_{P_{\theta}}\log(1+\mathrm{e}^{-h_{\gamma}(x)})&\text{with $\gamma$ fixed}.\\ \end{array}\right. (7)

The second objective is introduced mainly to overcome vanishing gradients in θ\theta when the discriminator is confident. The calibrated rKL-GAN (Huszár (2016); Tan, Song and Ou (2019)) is defined by solving

{maxγΓKJS(θ,γ)with θ fixed,minθΘEPθhγ(x)with γ fixed.\displaystyle\left\{\begin{array}[]{ll}\max\limits_{\gamma\in\Gamma}\;K_{\mathrm{JS}}(\theta,\gamma)&\text{with $\theta$ fixed},\\ \min\limits_{\theta\in\Theta}\;-\mathrm{E}_{P_{\theta}}h_{\gamma}(x)&\text{with $\gamma$ fixed}.\\ \end{array}\right. (10)

The two objectives are chosen to stabilize gradients in both θ\theta and γ\gamma during training. The geometric GAN in Lim and Ye (2017) or, equivalently, the energy-based GAN in Zhao, Mathieu and LeCun (2017) as shown in Tan, Song and Ou (2019), is defined by solving

{maxγΓKHG(θ,γ)with θ fixed,minθΘEPθhγ(x)with γ fixed.\displaystyle\left\{\begin{array}[]{ll}\max\limits_{\gamma\in\Gamma}\;K_{\mathrm{HG}}(\theta,\gamma)&\text{with $\theta$ fixed},\\ \min\limits_{\theta\in\Theta}\;-\mathrm{E}_{P_{\theta}}h_{\gamma}(x)&\text{with $\gamma$ fixed}.\\ \end{array}\right. (13)

Interestingly, the second line in (10) or (13) involves the same objective EPθhγ(x)-\mathrm{E}_{P_{\theta}}h_{\gamma}(x), which can be equivalently replaced by KrKL(P,Pθ;hγ)K_{\mathrm{rKL}}(P_{*},P_{\theta};h_{\gamma}) because γ\gamma and hence hγh_{\gamma} are fixed.

Remark 1.

We discuss precise definitions for a solution to a minimax problem such as (1) or (3), and a solution to an alternating optimization problem such as (7)–(13). For an objective function K(θ,γ)K(\theta,\gamma), we say that (θ^,γ^)(\hat{\theta},\hat{\gamma}) is a solution to

minθmaxγK(θ,γ),\displaystyle\min_{\theta}\max_{\gamma}\;K(\theta,\gamma), (14)

if K(θ^,γ^)=maxγK(θ^,γ)maxγK(θ,γ)K(\hat{\theta},\hat{\gamma})=\max_{\gamma}K(\hat{\theta},\gamma)\leq\max_{\gamma}K(\theta,\gamma) for any θ\theta. In other words, we treat (14) as nested optimization: θ^\hat{\theta} is a minimizer of K(θ,γ^θ)K(\theta,\hat{\gamma}_{\theta}) as a function of θ\theta and γ^=γ^θ^\hat{\gamma}=\hat{\gamma}_{\hat{\theta}}, where γ^θ\hat{\gamma}_{\theta} is a maximizer of K(θ,γ)K(\theta,\gamma) for fixed θ\theta. This choice is directly exploited in both numerical implementation and theoretical analysis of our methods later. For two objective functions K1(θ,γ)K_{1}(\theta,\gamma) and K2(θ,γ)K_{2}(\theta,\gamma), we say that (θ^,γ^)(\hat{\theta},\hat{\gamma}) is a solution to the alternating optimization problem

{maxγK1(θ,γ)with θ fixed,minθK2(θ,γ)with γ fixed,\displaystyle\left\{\begin{array}[]{ll}\max\limits_{\gamma}\;K_{1}(\theta,\gamma)&\text{with $\theta$ fixed},\\ \min\limits_{\theta}\;K_{2}(\theta,\gamma)&\text{with $\gamma$ fixed},\\ \end{array}\right. (17)

if K1(θ^,γ^)=maxγK1(θ^,γ)K_{1}(\hat{\theta},\hat{\gamma})=\max_{\gamma}K_{1}(\hat{\theta},\gamma) and K2(θ^,γ^)=minθK2(θ,γ^)K_{2}(\hat{\theta},\hat{\gamma})=\min_{\theta}K_{2}(\theta,\hat{\gamma}). In the special case where K1(θ,γ)=K2(θ,γ)K_{1}(\theta,\gamma)=K_{2}(\theta,\gamma), denoted as K(θ,γ)K(\theta,\gamma), a solution (θ^,γ^)(\hat{\theta},\hat{\gamma}) to (17) is also called a Nash equilibrium of K(θ,γ)K(\theta,\gamma), satisfying K(θ^,γ^)=maxγK(θ^,γ)=minθK(θ,γ^)K(\hat{\theta},\hat{\gamma})=\max_{\gamma}K(\hat{\theta},\gamma)=\min_{\theta}K(\theta,\hat{\gamma}). A solution to minimax problem (14) and a Nash equilibrium of K(θ,γ)K(\theta,\gamma) may in general differ from each other, although they coincide by Sion’s minimax theorem in the special case where K(θ,γ)K(\theta,\gamma) is convex in θ\theta for each γ\gamma and concave in γ\gamma for each θ\theta. Hence our treatment of GANs in the form (14) as nested optimization should be distinguished from existing studies where GANs are interpreted as finding Nash equilibria (Farnia and Ozdaglar (2020); Jin, Netrapalli and Jordan (2020)).

Remark 2.

The population ff-GAN (Nowozin, Cseke and Tomioka (2016)) is defined by solving

minθΘmaxγΓ{EPTγ(x)EPθf(Tγ(x))},\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;\left\{E_{P_{*}}T_{\gamma}(x)-E_{P_{\theta}}f^{*}(T_{\gamma}(x))\right\}, (18)

where ff^{*} is the Fenchel conjugate of ff, i.e., f(s)=supt(0,)(stf(t))f^{*}(s)=\sup_{t\in(0,\infty)}(st-f(t)) and TγT_{\gamma} is a function taking values in the domain of ff^{*}. Typically, TγT_{\gamma} is represented as Tγ(x)=τf(hγ(x))T_{\gamma}(x)=\tau_{f}(h_{\gamma}(x)), where τf:dom(f)\tau_{f}:\mathbb{R}\to\mathrm{dom}(f^{*}) is an activation function and hγ(x)h_{\gamma}(x) take values unrestricted in \mathbb{R}. The logit ff-GAN corresponds to ff-GAN with the specific choice τf(u)=f(eu)\tau_{f}(u)=f^{\prime}(\mathrm{e}^{u}) by the relationship f(f(t))=f#(t)f^{*}(f^{\prime}(t))=f^{\#}(t) (Tan, Song and Ou (2019)). Nevertheless, a benefit of logit ff-GAN is that the objective KfK_{f} in (1) takes the explicit form of a negative discrimination loss such that hγ(x)h_{\gamma}(x) can be seen to approximate the log density ratio between PP_{*} and PθP_{\theta}.

Remark 3.

There is an important difference between hinge GAN and logit ff-GAN, although the total variation is also an ff-divergence with f(t)=|t1|/2f(t)=|t-1|/2. In fact, taking this choice of ff in logit ff-GAN (1) yields

minθΘmaxγΓ{EPsign(hγ(x))EPθsign(hγ(x))}.\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;\left\{E_{P_{*}}\mathrm{sign}(h_{\gamma}(x))-E_{P_{\theta}}\mathrm{sign}(h_{\gamma}(x))\right\}. (19)

This is called TV learning and is related to depth-based estimation in Gao et al. (2019). Compared with hinge GAN in (3), program (19) is computationally more difficult to solve. Such a difference also exists in the application of general ff-GAN to the total variation. For the total variation distance scaled by 22 with f(t)=|t1|f(t)=|t-1|, the conjugate is f(s)=max(1,s)f^{*}(s)=\max(-1,s) if s1s\leq 1 or \infty if s>1s>1. If TγT_{\gamma} is specified as Tγ=min(1,hγ(x))T_{\gamma}=\min(1,h_{\gamma}(x)), then the objective in ff-GAN (18) can be shown to be

EPmin(1,hγ(x))+EPθmin(1,max(1,hγ(x))),\displaystyle E_{P_{*}}\min(1,h_{\gamma}(x))+\mathrm{E}_{P_{\theta}}\min(1,\max(-1,-h_{\gamma}(x))),

which in general differs from the negative hinge loss in (3) unless hγh_{\gamma} is upper bounded by 1. If hγh_{\gamma} is specified as 2sigmoid(h~γ)12\,\mathrm{sigmoid}(\tilde{h}_{\gamma})-1 for a function h~γ\tilde{h}_{\gamma} taking values unrestricted in \mathbb{R}, the resulting ff-GAN is equivalent to TV-GAN in Gao et al. (2019) defined by solving

minθΘmaxγΓ{EPsigmoid(h~γ(x))EPθsigmoid(h~γ(x))}.\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;\left\{E_{P_{*}}\mathrm{sigmoid}(\tilde{h}_{\gamma}(x))-E_{P_{\theta}}\mathrm{sigmoid}(\tilde{h}_{\gamma}(x))\right\}. (20)

However, solving program (20) is numerically intractable as discussed in Gao et al. (2019).

4 Theory and methods

We propose and study various adversarial algorithms with simple spline discriminators for robust estimation in a multivariate Gaussian model. Assume that X1,,XnX_{1},\ldots,X_{n} are independent observations obtained from Huber’s ϵ\epsilon-contamination model, that is, the data distribution PP_{*} is of the form

Pϵ=(1ϵ)Pθ+ϵQ,\displaystyle P_{\epsilon}=(1-\epsilon)P_{\theta^{*}}+\epsilon Q, (21)

where PθP_{\theta^{*}} is N(μ,Σ)\mathrm{N}(\mu^{*},\Sigma^{*}) with unknown θ=(μ,Σ)\theta^{*}=(\mu^{*},\Sigma^{*}), QQ is a probability distribution for contaminated data, and ϵ\epsilon is a contamination fraction. Both QQ and ϵ\epsilon are unknown. The dependency of PϵP_{\epsilon} on (θ,Q)(\theta^{*},Q) is suppressed in the notation. Equivalently, the data (X1,,Xn)(X_{1},\ldots,X_{n}) can be represented in a latent model: (U1,X1),,(Un,Xn)(U_{1},X_{1}),\ldots,(U_{n},X_{n}) are independent, and UiU_{i} is Bernoulli with P(Ui=1)=ϵ\mathrm{P}(U_{i}=1)=\epsilon and XiX_{i} is drawn from PθP_{\theta^{*}} or QQ given Ui=0U_{i}=0 or 1 for i=1,,ni=1,\ldots,n.

For theoretical analysis, we consider two choices of the parameter space. The first choice is Θ1={(μ,Σ):μp,Σ0,ΣmaxM1}\Theta_{1}=\{(\mu,\Sigma):\mu\in\mathbb{R}^{p},\Sigma\succeq 0,\|\Sigma\|_{\max}\leq M_{1}\} for a constant M1>0M_{1}>0. Equivalently, the diagonal elements of Σ\Sigma is upper bounded by M1M_{1} for (μ,Σ)Θ1(\mu,\Sigma)\in\Theta_{1}. The second choice is Θ2={(μ,Σ):μp,Σ0,ΣopM2}\Theta_{2}=\{(\mu,\Sigma):\mu\in\mathbb{R}^{p},\Sigma\succeq 0,\|\Sigma\|_{\mathrm{op}}\leq M_{2}\} for a constant M2>0M_{2}>0. For simplicity, the dependency of Θ1\Theta_{1} on M1M_{1} or Θ2\Theta_{2} on M2M_{2} is suppressed in the notation. For the second parameter space Θ2\Theta_{2}, the minimax rates in the L2L_{2} and operator norms have been shown to be achieved using matrix depth (Chen, Gao and Ren (2018)) and GANs with certain neural network discriminators (Gao, Yao and Zhu (2020)).

Our work aims to investigate adversarial algorithms with a simple linear class of spline discriminators for computational tractability, and establish various error bounds for the proposed estimators, including those matching the minimax rates in the maximum norms for the location and scatter estimation over Θ1\Theta_{1}, and, provided that ϵn\epsilon\sqrt{n} is bounded by a constant (independent of pp), the minimax rates in the L2L_{2} and Frobenius norms over Θ2\Theta_{2}.

4.1 Population analysis with nonparametric discriminators

A distinctive feature of GANs is that they can be motivated as approximations to minimum divergence estimation. For example, if the discriminator class {hγ}\{h_{\gamma}\} in (1) is rich enough to include the nonparametrically optimal discriminator such that maxγΓKf(P,Pθ;hγ)=Df(PPθ)\max_{\gamma\in\Gamma}K_{f}(P_{*},P_{\theta};h_{\gamma})=D_{f}(P_{*}\|P_{\theta}) for each θ\theta, then the (population) logit ff-GAN amounts to minimizing the ff-divergence Df(PPθ)D_{f}(P_{*}\|P_{\theta}). Similarly, if the discriminator class {hγ}\{h_{\gamma}\} in (3) is sufficiently rich, then the (population) hinge GAN amounts to minimizing the total variation DTV(PPθ)D_{\mathrm{TV}}(P_{*}\|P_{\theta}).

As a prelude to our sample analysis, Theorem 1 shows that at the population level, minimization of the total variation and certain ff-divergences satisfying Assumption 1 achieves robustness under Huber’s contamination model, in the sense that the estimation errors are respectively O(ϵ)O(\epsilon) and O(ϵ)O(\sqrt{\epsilon}), uniformly over all possible QQ. Hence with sufficiently rich (or nonparametric) discriminators, the population versions of the hinge GAN and certain ff-GANs can be said to be robust under Huber’s contamination. From Table 2, Assumption 1 is satisfied by the reverse KL, JS, and squared Hellinger divergences, but violated by the KL divergence. Minimization of the KL divergence corresponds to maximum likelihood estimation, which is known to be non-robust under Huber’s contamination model.

Assumption 1.

Suppose that f:(0,)f:(0,\infty)\to\mathbb{R} is convex with f(1)=0f(1)=0 and satisfies the following conditions.

  • (i)

    ff is twice differentiable with f′′(1)>0f^{\prime\prime}(1)>0.

  • (ii)

    ff is non-increasing.

  • (iii)

    ff^{\prime} is concave (i.e., f′′f^{\prime\prime} is non-increasing)

See Table 2 for validity of conditions (ii) and (iii) in various ff-divergences.

Remark 4.

Given a convex function ff with f(1)=0f(1)=0, the same ff-divergence DfD_{f} can be defined using the convex function f(t)+c(t1)f(t)+c(t-1) for any constant cc\in\mathbb{R}. Hence condition (ii) in Assumption 1 can be relaxed such that ff^{\prime} is upper bounded by a constant. The non-increasingness of ff is stated above for ease of interpretation. The other conditions in Assumption 1 and Assumption 2 are not affected by non-unique choices of ff.

Theorem 1.

Let Θ0={(μ,Σ):μp,Σ is a p×p variance matrix}\Theta_{0}=\{(\mu,\Sigma):\mu\in\mathbb{R}^{p},\Sigma\text{ is a $p\times p$ variance matrix}\}.

(i) Assume that ff satisfies Assumption 1. Let θ¯=argminθΘ0Df(Pϵ||Pθ)\bar{\theta}=\mathrm{argmin}_{\theta\in\Theta_{0}}D_{f}(P_{\epsilon}||P_{\theta}). If 2(f′′(1))1f(1/2)ϵ+ϵ<1/2\sqrt{-2(f^{{\prime\prime}}(1))^{-1}f^{\prime}(1/2)\epsilon}+\epsilon<1/2, then for any contamination distribution QQ,

μ¯μ2CΣop1/2ϵ,μ¯μCΣmax1/2ϵ,\displaystyle\|\bar{\mu}-\mu^{*}\|_{2}\leq C\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}\sqrt{\epsilon},\quad\|\bar{\mu}-\mu^{*}\|_{\infty}\leq C\|\Sigma^{*}\|_{\max}^{1/2}\sqrt{\epsilon}, (22)

and

Σ¯ΣopCΣopϵ,Σ¯ΣmaxCΣmaxϵ,\displaystyle\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}\leq C\|\Sigma^{*}\|_{\mathrm{op}}\sqrt{\epsilon},\quad\|\bar{\Sigma}-\Sigma^{*}\|_{\max}\leq C\|\Sigma^{*}\|_{\max}\sqrt{\epsilon}, (23)

where C>0C>0 is a constant depending only on ff. The same inequality as in (23) also holds with Σ¯Σop\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}} replaced by p1/2Σ¯ΣFp^{-1/2}\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}.

(ii) Let θ¯=argminθΘ0DTV(Pϵ||Pθ)\bar{\theta}=\mathrm{argmin}_{\theta\in\Theta_{0}}D_{\mathrm{TV}}(P_{\epsilon}||P_{\theta}). If ϵ<1/4\epsilon<1/4 then (22) and (23) hold for an absolute constant C>0C>0 with ϵ\sqrt{\epsilon} replaced by ϵ\epsilon throughout.

Refer to caption
Refer to caption
Figure 2: Illustration of robustness of minimum ff-divergence location estimation. Figure (a): Location error |μ¯μ||\bar{\mu}-\mu^{*}| against contamination fraction ϵ\epsilon from 0 to 0.0150.015, with PθP_{\theta^{*}} being N(0,1)\mathrm{N}(0,1) and contamination QQ being N(5,1/4)\mathrm{N}(5,1/4) fixed; Figure (b): Location error |μ¯μ||\bar{\mu}-\mu^{*}| against contamination location μQ\mu_{Q} from 0 to 1010, with PθP_{\theta^{*}} being N(0,1)\mathrm{N}(0,1), contamination fraction ϵ=0.1\epsilon=0.1 fixed, and contamination QQ being N(μQ,1/4)\mathrm{N}(\mu_{Q},1/4). The squared Hellinger, reverse χ2\chi^{2}, and mixed KL are denoted by H2, rChi2, and mKL respectively.

Figure 2 provides a simple numerical illustration. From Figure 2(a), the location errors |μ¯μ||\bar{\mu}-\mu^{*}| of minimum divergence estimators corresponding to the four robust ff-divergences (reverse KL, JS, squared Hellinger, and reverse χ2\chi^{2}) satisfying Assumption 1 are of shapes in agreement with the order ϵ\sqrt{\epsilon} in Theorem 1, whereas those corresponding to TV appear to be linear in ϵ\epsilon, for ϵ\epsilon close to 0. For the KL, mixed KL, and χ2\chi^{2} divergences, which do not satisfy Assumption 1(ii), their corresponding errors quickly increase out of the plotting range, indicating non-robustness of the associated minimum divergence estimation. The differences between robust and non-robust ff-divergences are further demonstrated in Figure 2(b). As the contamination location moves farther away, the errors of the robust ff-divergences increase initially but then decrease to near 0, whereas those of the non-robust ff-divergences appear to increase unboundedly.

Table 2: Common ff-divergences and validity of Assumptions 1 (ii)–(iii) and 2 (i)–(ii).
Name Convex f(t)f(t) Non-incr. Concave Concave Lipschitz
f(t)f(t) f(t)f^{\prime}(t) f(eu)f^{\prime}(\mathrm{e}^{u}) f#(eu)f^{\#}(\mathrm{e}^{u})
Total variation (1t)+,|t1|/2(1-t)_{+},|t-1|/2
Reverse KL logt-\log t
Jensen-Shannon tlogt(t+1)log(t+1)+log4t\log t-(t+1)\log(t+1)+\log 4
Squared Hellinger (t1)2(\sqrt{t}-1)^{2}
Reverse χ2\chi^{2} t11t^{-1}-1
KL tlogtt\log t
Mixed KL {(t1)logt}/2\{(t-1)\log t\}/2
χ2\chi^{2} (t1)2(t-1)^{2}

Note: The mixed KL divergence is defined as DmKL(P||Q)=DKL(P||Q)/2+DKL(Q||P)/2D_{\mathrm{mKL}}(P||Q)=D_{\mathrm{KL}}(P||Q)/2+D_{\mathrm{KL}}(Q||P)/2.

Remark 5.

From the proof in Section 7.1, Theorem 1(i) remains valid if f′′(1)f^{\prime\prime}(1) is replaced by Cf=inft(0,1]f′′(t)C_{f}=\inf_{t\in(0,1]}f^{\prime\prime}(t) in Assumption 1(i) and the definition of Errf0(ϵ)\mathrm{Err}_{f0}(\epsilon), and Assumption 1(iii), the concavity of ff^{\prime}, is removed. On the other hand, a stronger condition than Assumption 1(iii) is used in our sample analysis: for convex ff, the concavity of ff^{\prime} is implied by Assumption 2(i), as discussed in Remark 9.

Remark 6.

The population bounds in Theorem 1 are more refined than those in our sample analysis later. The population minimizer θ¯=(μ¯,Σ¯)\bar{\theta}=(\bar{\mu},\bar{\Sigma}) is defined by minimization over the unrestricted space Θ0\Theta_{0} instead of Θ1\Theta_{1} or Θ2\Theta_{2} with the restriction ΣmaxM1\|\Sigma\|_{\max}\leq M_{1} or ΣopM2\|\Sigma\|_{\mathrm{op}}\leq M_{2}. The scaling factors in the population bounds also depend directly on the maximum or operator norm of the true variance matrix Σ\Sigma^{*}, instead of pre-specified constants M1M_{1} or M2M_{2}. Note that the parameter space is also restricted such that ΣopM2\|\Sigma\|_{\mathrm{op}}\leq M_{2} and the error bounds depend on M2M_{2} in sample analysis of Gao, Yao and Zhu (2020). Nevertheless, the population bounds share a similar feature as in our sample bounds later: the error bounds in the maximum norms are governed by Σmax\|\Sigma^{*}\|_{\max}, which can be much smaller than Σop\|\Sigma^{*}\|_{\mathrm{op}} involved in the error bounds in the operator norm.

Remark 7.

It is interesting to connect and compare our results with Donoho and Liu (1988), where minimum distance (MD) estimation is studied, that is, minimization of a proper distance D(P,Pθ)D(P,P_{\theta}) satisfying the triangle inequality. For minimum TV estimation, let θ¯P=(μ¯P,Σ¯P)=argminθDTV(PPθ)\bar{\theta}_{P}=(\bar{\mu}_{P},\bar{\Sigma}_{P})=\mathrm{argmin}_{\theta}D_{\mathrm{TV}}(P\|P_{\theta}). For location estimation, define

b(ϵ)=supP:DTV(PPθ)ϵμ¯Pμ2,\displaystyle b(\epsilon)=\sup_{P:D_{\mathrm{TV}}(P\|P_{\theta^{*}})\leq\epsilon}\|\bar{\mu}_{P}-\mu^{*}\|_{2},\quad b0(ϵ)=supθ:DTV(PθPθ)ϵμμ2,\displaystyle b_{0}(\epsilon)=\sup_{\theta:D_{\mathrm{TV}}(P_{\theta}\|P_{\theta^{*}})\leq\epsilon}\|\mu-\mu^{*}\|_{2},

which are called the bias distortion curve and the gauge function. Scatter estimation can be discussed in a similar manner. For a general family {Pθ}\{P_{\theta}\}, the first half in our proof of Theorem 1(ii) shows that for any PP satisfying DTV(PPθ)ϵD_{\mathrm{TV}}(P\|P_{\theta^{*}})\leq\epsilon, we have DTV(Pθ¯PPθ)2ϵD_{\mathrm{TV}}(P_{\bar{\theta}_{P}}\|P_{\theta^{*}})\leq 2\epsilon. This implies a bound similar to Proposition 5.1 in Donoho and Liu (1988):

b(ϵ)b0(2ϵ).\displaystyle b(\epsilon)\leq b_{0}(2\epsilon). (24)

For the multivariate Gaussian family {Pθ}\{P_{\theta}\}, the second half in our proof of Theorem 1(ii) derives an explicit upper bound on b0(ϵ)b_{0}(\epsilon) provided that 2ϵa2\epsilon\leq a for a constant a[0,1/2)a\in[0,1/2):

b0(2ϵ)S1,aΣop1/2(2ϵ),\displaystyle b_{0}(2\epsilon)\leq S_{1,a}\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}(2\epsilon),

where S1,a={Φ(Φ1(1/2+a))}1S_{1,a}=\{\Phi^{\prime}(\Phi^{-1}(1/2+a))\}^{-1}. Combining the preceding inequalities yields b(ϵ)CΣop1/2ϵb(\epsilon)\leq C\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}\epsilon in Theorem 1(ii), with C=2S1,aC=2S_{1,a}. In addition, Proposition 5.1 in Donoho and Liu (1988) gives the same bound as (24) for MD estimation using certain other distances D(P,Pθ)D(P,P_{\theta}), including the Hellinger distance, where the MD functional θ¯P=(μ¯P,Σ¯P)\bar{\theta}_{P}=(\bar{\mu}_{P},\bar{\Sigma}_{P}) is defined as argminθD(P,Pθ)\mathrm{argmin}_{\theta}D(P,P_{\theta}), and b(ϵ)b(\epsilon) and b0(ϵ)b_{0}(\epsilon) are defined with DTV(PPθ)D_{\mathrm{TV}}(P\|P_{\theta^{*}}) replaced by D(P,Pθ)D(P,P_{\theta}). The distances used in defining the MD functional and the contamination neighborhood are tied to each other. Hence, except for minimum TV estimation, our setting differs from Donoho and Liu (1988) in studying different choices of minimum ff-divergence estimation over the same Huber’s contamination neighborhood.

Remark 8.

We briefly comment on how our result is related to breakdown points in robust statistics (Huber (1981), Section 1.4). For estimating μ\mu^{*}, the population breakdown point of a functional T=T(P)T=T(P) can be defined as sup{ϵ:bT(ϵ)<}\sup\{\epsilon:b_{T}(\epsilon)<\infty\}, where bT(ϵ)=supP:DTV(PPθ)ϵT(P)μ2b_{T}(\epsilon)=\sup_{P:D_{\mathrm{TV}}(P\|P_{\theta^{*}})\leq\epsilon}\|T(P)-\mu^{*}\|_{2}. Scatter estimation can be discussed in a similar manner. For TT defined from minimum TV estimation, Theorem 1(ii) shows that if ϵ<1/4\epsilon<1/4, then bT(ϵ)CΣop1/2ϵb_{T}(\epsilon)\leq C\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}\epsilon, as noted in Remark 7. This not only provides an explicit bound on bT(ϵ)b_{T}(\epsilon), but also implies that the population breakdown point is at least 1/41/4 for minimum TV estimation. Similar implications can be obtained from Theorem 1(i) for minimum ff-divergence estimation. For TT defined from minimum rKL divergence estimation, Theorem 1(i) shows that if 2ϵ+ϵ<1/22\sqrt{\epsilon}+\epsilon<1/2, then bT(ϵ)CΣop1/2ϵb_{T}(\epsilon)\leq C\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}\sqrt{\epsilon}, and hence the population breakdown point is at least 0.0510.051. While these estimates of breakdown points can potentially be improved, our population analysis as well as sample analysis in the subsequent sections focus on deriving quantitative error bounds in terms of sufficiently small ϵ\epsilon and some scaling constants free of ϵ\epsilon.

4.2 Logit ff-GAN with spline discriminators

For the population analysis in Section 4.1, a discriminator class is assumed to be rich enough to include the nonparametrically optimal discriminator which depends on unknown (ϵ,Q)(\epsilon,Q). Because QQ can be arbitrary, this nonparametric assumption is inappropriate for sample analysis. Recently, GANs with certain neural network discriminators are shown to achieve sample error bounds matching minimax rates (Gao et al. (2019); Gao, Yao and Zhu (2020)). It is interesting to study whether similar results can be obtained when using GANs with simpler and computationally more tractable discriminators.

We propose and study adversarial algorithms, including logit ff-GAN in this section and hinge GAN in Section 4.3, each with simple spline discriminators. Define a linear class of pairwise spline functions, denoted as sp\mathcal{H}_{\mathrm{sp}}:

hsp,γ(x)=γ0+γ1Tφ(x)+γ2Tvec(φ(x)φ(x)),\displaystyle h_{\mathrm{sp},\gamma}(x)=\gamma_{0}+\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)+\gamma_{2}^{\mathrm{\scriptscriptstyle T}}\mathrm{vec}(\varphi(x)\otimes\varphi(x)),

where γ=(γ0,γ1T,γ2T)TΓ\gamma=(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma with Γ=1+5p+(5p)2\Gamma=\mathbb{R}^{1+5p+(5p)^{2}} and φ(x)=(φ1T(x),,φ5T(x))T\varphi(x)=(\varphi_{1}^{\mathrm{\scriptscriptstyle T}}(x),\ldots,\varphi_{5}^{\mathrm{\scriptscriptstyle T}}(x))^{\mathrm{\scriptscriptstyle T}}. The basis vector φl(x)p\varphi_{l}(x)\in\mathbb{R}^{p} is obtained by applying t(tξl)+t\mapsto(t-\xi_{l})_{+} componentwise to xpx\in\mathbb{R}^{p}, with the knot ξl=2,1,0,1\xi_{l}=-2,-1,0,1, or 22 for l=1,,5l=1,\ldots,5 respectively. For concreteness, assume that every two components of γ2\gamma_{2} are identical if associated with the same product of two components of φ(x)\varphi(x), that is, γ2\gamma_{2} can be arranged to a symmetric matrix. The preceding specification is sufficient for our theoretical analysis. Nevertheless, similar results can also be obtained, while allowing various changes to the basis functions, for example, adding xx as a subvector to φ(x)\varphi(x). With this change, a function in sp\mathcal{H}_{\mathrm{sp}} has a main effect term in each xjx_{j}, which is a linear spline with knots in {2,1,0,1,2}\{-2,-1,0,1,2\}, and a square or interaction term in each pair (xj1,xj2)(x_{j_{1}},x_{j_{2}}), which is a product of two spline functions in xj1x_{j_{1}} and xj2x_{j_{2}} for 1j1,j2p1\leq j_{1},j_{2}\leq p.

We consider two logit ff-GAN methods with an L1L_{1} or L2L_{2} penalty on the discriminator, which lead to meaningful error bounds over the parameter space Θ1\Theta_{1} or Θ2\Theta_{2} respectively under the following conditions on ff, in addition to Assumption 1. Among the ff-divergences in Table 2, the reverse KL and JS divergences satisfy both Assumptions 1 and 2, and hence the corresponding logit ff-GANs achieve sample robust estimation using spline discriminators. The squared Hellinger and reverse χ2\chi^{2} divergences satisfy Assumption 1, but not the Lipschitz condition in Assumption 2(ii). For such ff-divergences, it remains a theoretical question whether sample robust estimation can be achieved using spline discriminators.

Assumption 2.

Suppose that f:(0,)f:(0,\infty)\to\mathbb{R} is strictly convex and three-times continuously differentiable with f(1)=0f(1)=0 and satisfies the following conditions.

  • (i)

    f(eu)f^{\prime}(\mathrm{e}^{u}) is concave in uu\in\mathbb{R}.

  • (ii)

    f#(eu)f^{\#}(\mathrm{e}^{u}) is R1R_{1}-Lipschitz in uu\in\mathbb{R} for a constant R1>0R_{1}>0.

See Table 2 for validity of conditions (i) and (ii) in various ff-divergences, and Remarks 9 and 10 for further discussions.

The first method, L1L_{1} penalized logit ff-GAN, is defined by solving

minθΘ1maxγΓ{Kf(Pn,Pθ;hγ,μ)λ1pen1(γ)},\displaystyle\min_{\theta\in\Theta_{1}}\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\theta};h_{\gamma,\mu})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}, (25)

where Kf(Pn,Pθ;h)K_{f}(P_{n},P_{\theta};h) is Kf(P,Pθ;h)K_{f}(P_{*},P_{\theta};h) in (1) with PP_{*} replaced by PnP_{n}, hγ,μ(x)=hsp,γ(xμ)h_{\gamma,\mu}(x)=h_{\mathrm{sp},\gamma}(x-\mu), pen1(γ)=γ11+γ21\mathrm{pen}_{1}(\gamma)=\|\gamma_{1}\|_{1}+\|\gamma_{2}\|_{1}, the L1L_{1} norm of γ\gamma excluding the intercept γ0\gamma_{0}, and λ10\lambda_{1}\geq 0 is a tuning parameter. In addition to the replacement of PP_{*} by PnP_{n}, there are two notable modifications in (25) compared with the population version (1). First, a penalty term is introduced on γ\gamma, to achieve suitable control of sampling variation. Second, the discriminator hγ,μh_{\gamma,\mu} is a spline function with knots depending on μ\mu, the location parameter for the generator. By a change of variables, the non-penalized objective in (25) can be equivalently written as

Kf(Pn,Pθ;hγ,μ)=EPnμf(ehsp,γ(x))EP0,Σf#(ehsp,γ(x)),\displaystyle K_{f}(P_{n},P_{\theta};h_{\gamma,\mu})=\mathrm{E}_{P_{n}-\mu}f^{\prime}(\mathrm{e}^{h_{\mathrm{sp},\gamma}(x)})-\mathrm{E}_{P_{0,\Sigma}}f^{\#}(e^{h_{\mathrm{sp},\gamma}(x)}), (26)

where PnμP_{n}-\mu denotes the empirical distribution on {X1μ,,Xnμ}\{X_{1}-\mu,\ldots,X_{n}-\mu\}. Hence Kf(Pn,Pθ;hγ,μ)K_{f}(P_{n},P_{\theta};h_{\gamma,\mu}) is a negative loss for discriminating between the shifted empirical distribution PnμP_{n}-\mu and the mean-zero generator P0,ΣP_{0,\Sigma}. The adaptive choice of knots for the spline discriminator hγ,μh_{\gamma,\mu} not only is numerically desirable but also facilitates the control of sampling variation in our theoretical analysis. See Propositions S5, S8, S13, and S15.

Theorem 2.

Assume that ΣmaxM1\|\Sigma^{*}\|_{\max}\leq M_{1} and ff satisfies Assumptions 12. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (25). For δ<1/7\delta<1/7, if λ1C1(logp/n+log(1/δ)/n)\lambda_{1}\geq C_{1}\left(\sqrt{\log{p}/n}+\sqrt{\log(1/\delta)/n}\right) and ϵ+1/(nδ)+λ1C2\sqrt{\epsilon}+\sqrt{1/(n\delta)}+\lambda_{1}\leq C_{2}, then with probability at least 17δ1-7\delta the following bounds hold uniformly over contamination distribution QQ,

μ^μ\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty} C(ϵ+1/(nδ)+λ1),\displaystyle\leq C\left(\sqrt{\epsilon}+\sqrt{1/(n\delta)}+\lambda_{1}\right),
Σ^Σmax\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\max} C(ϵ+1/(nδ)+λ1),\displaystyle\leq C\left(\sqrt{\epsilon}+\sqrt{1/(n\delta)}+\lambda_{1}\right),

where C1,C2,C>0C_{1},C_{2},C>0 are constants, depending on M1M_{1} and ff but independent of (n,p,ϵ,δ)(n,p,\epsilon,\delta).

For L1L_{1} penalized logit ff-GAN, Theorem 2 shows that the estimator (μ^,Σ^)(\hat{\mu},\hat{\Sigma}) achieves error bounds in the maximum norms in the order ϵ+log(p)/n\sqrt{\epsilon}+\sqrt{\log(p)/n}. These error bounds match sampling errors of order log(p)/n\sqrt{\log(p)/n} in the maximum norms for the standard estimators (i.e., the sample mean and variance) in a multivariate Gaussian model in the case of ϵ=0\epsilon=0. Moreover, up to sampling variation, the error bounds also match the population error bounds of order ϵ\sqrt{\epsilon} in the maximum norms with nonparametric discriminators in Theorem 1(i), even though a simple, linear class of spline discriminators is used.

The second method, L2L_{2} penalized logit ff-GAN, is defined by solving

minθΘ2maxγΓ{Kf(Pn,Pθ;hγ,μ)λ2pen2(γ1)λ3pen2(γ2)},\displaystyle\min_{\theta\in\Theta_{2}}\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\theta};h_{\gamma,\mu})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}, (27)

where Kf(Pn,Pθ;h)K_{f}(P_{n},P_{\theta};h) and hγ,μ(x)h_{\gamma,\mu}(x) are defined as in (25), pen2(γ1)=γ12\mathrm{pen}_{2}(\gamma_{1})=\|\gamma_{1}\|_{2} and pen2(γ2)=γ22\mathrm{pen}_{2}(\gamma_{2})=\|\gamma_{2}\|_{2}, the L2L_{2} norms of γ1\gamma_{1} and γ2\gamma_{2}, and λ20\lambda_{2}\geq 0 and λ30\lambda_{3}\geq 0 are tuning parameters. Compared with L1L_{1} penalized logit ff-GAN (25), the L2L_{2} norms of γ1\gamma_{1} and γ2\gamma_{2} are separately associated with tuning parameters λ2\lambda_{2} and λ3\lambda_{3} in (27), in addition to the change from L1L_{1} to L2L_{2} penalties. As seen from our proofs in Sections II.2 and II.3 in the Supplement, the use of separate tuning parameters λ2\lambda_{2} and λ3\lambda_{3} is crucial for achieving meaningful error bounds in the L2L_{2} and Frobenius norms for simultaneous estimation of (μ,Σ)(\mu^{*},\Sigma^{*}). Our method does not rely on the use of normalized differences of pairs of the observations to reduce the unknown mean to 0 for scatter estimation as in Diakonikolas et al. (2019).

Theorem 3.

Assume that ΣopM2\|\Sigma^{*}\|_{\mathrm{op}}\leq M_{2}, ff satisfies Assumptions 12, and pϵp\epsilon is upper bounded by a constant BB. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (27). For δ<1/8\delta<1/8, if λ2C1(p/n+log(1/δ)/n)\lambda_{2}\geq C_{1}\left(\sqrt{p/n}+\sqrt{\log(1/\delta)/n}\right), λ3C1p(p/n+log(1/δ)/n)\lambda_{3}\geq C_{1}\sqrt{p}\left(\sqrt{p/n}+\sqrt{\log(1/\delta)/n}\right), and ϵ+1/(nδ)+λ2C2\sqrt{\epsilon}+\sqrt{1/(n\delta)}+\lambda_{2}\leq C_{2}, then with probability at least 18δ1-{8}\delta the following bounds hold uniformly over contamination distribution QQ,

μ^μ2\displaystyle\|\hat{\mu}-\mu^{*}\|_{2} C(ϵ+1/(nδ)+λ2),\displaystyle\leq C\left(\sqrt{\epsilon}+\sqrt{1/(n\delta)}+\lambda_{2}\right),
p1/2Σ^ΣF\displaystyle p^{-1/2}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}} C(ϵ+1/(nδ)+λ2+λ3/p),\displaystyle\leq C\left(\sqrt{\epsilon}+\sqrt{1/(n\delta)}+\lambda_{2}+\lambda_{3}/\sqrt{p}\right),

where C1,C2,C>0C_{1},C_{2},C>0 are constants, depending on M2M_{2} and ff but independent of (n,p,ϵ,δ)(n,p,\epsilon,\delta) except through the bound BB on pϵp\epsilon.

For L2L_{2} penalized logit ff-GAN, Theorem 3 provides error bounds of order ϵ+p/n\sqrt{\epsilon}+\sqrt{p/n}, in the L2L_{2} and p1/2p^{-1/2}-Frobenius norms for location and scatter estimation. A technical difference from Theorem 2 is that these bounds are derived under an extraneous condition that pϵp\epsilon is upper bounded. Nevertheless, the error rate, ϵ+p/n\sqrt{\epsilon}+\sqrt{p/n}, matches the population error bounds of order ϵ\sqrt{\epsilon} in Theorem 1(i), up to sampling variation of order p/n\sqrt{p/n} in the L2L_{2} and p1/2p^{-1/2}-Frobenius norms. We defer to Section 4.3 further discussion about the error bounds in Theorems 23 compared with minimax error rates.

Remark 9.

There are important implications of Assumption 2(i) together with Assumption 1(ii), based on the fact (“composition rule”) that the composition of a non-decreasing concave function and a concave function is concave. First, for convex ff, concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) in uu\in\mathbb{R} implies Assumption 1(iii), that is, concavity of f(t)f^{\prime}(t) in t(0,)t\in(0,\infty). This follows by writing f(t)=g(logt)f^{\prime}(t)=g(\log t) and applying the composition rule, where g(u)=f(eu)g(u)=f^{\prime}(\mathrm{e}^{u}), in addition to being concave, is non-decreasing by convexity of ff, and logt\log t is concave in tt. Note that concavity of f(t)f^{\prime}(t) in tt may not imply concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) in uu, as shown by the Pearson χ2\chi^{2} in Table 2. Second, for convex and non-increasing ff, concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) in uu\in\mathbb{R} also implies concavity of f#(eu)-f^{\#}(\mathrm{e}^{u}) in uu\in\mathbb{R}. In fact, as mentioned in Remark 2, f#(t)f^{\#}(t) can be equivalently obtained as f#(t)=f(f(t))f^{\#}(t)=f^{*}(f^{\prime}(t)), where ff^{*} is the Fenchel conjugate of ff (Tan, Song and Ou (2019)). By the composition rule, f#(eu)=g(f(eu))-f^{\#}(\mathrm{e}^{u})=g(f^{\prime}(\mathrm{e}^{u})) is concave, where g=fg=-f^{*} is concave and non-decreasing by non-increasingness of ff.

Remark 10.

The concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)-f^{\#}(\mathrm{e}^{u}) in uu from Assumptions 1(ii) and 2(i), as discussed in Remark 9, is instrumental from both theoretical and computational perspectives. These concavity properties are crucial to our proofs of Theorems 23 and Corollary 1(i). See Lemmas S4 and S13 in the Supplement. Moreover, the concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)-f^{\#}(\mathrm{e}^{u}) in uu, in conjunction with the linearity of the spline discriminator hγ,μh_{\gamma,\mu} in γ\gamma, indicates that the objective function Kf(Pn,Pθ;hγ,μ)K_{f}(P_{n},P_{\theta};h_{\gamma,\mu}) is concave in γ\gamma for any fixed θ\theta. Hence our penalized logit ff-GAN (25) or (27) under Assumptions 12 can be implemented through nested optimization as shown in Algorithm 1, where a concave optimizer is used in the inner stage to train the spline discriminators. See Remark 1 for further discussion.

Require A penalized GAN objective function K(θ,γ;λ)K(\theta,\gamma;\lambda) as in (25), (27), (28), (29), initial value θ0\theta_{0} for the generator, and a decaying learning rate αt\alpha_{t}.
Repeat 
       Sampling: Draw (Z1,,Zn)(Z_{1},\ldots,Z_{n}) from N(0,I)\mathrm{N}(0,I) and approximate Pθt1P_{\theta_{t-1}} by
       the empirical distribution on the fake data ξi=μt1+Σt11/2zi\xi_{i}=\mu_{t-1}+\Sigma_{t-1}^{1/2}z_{i}, i=1,,ni=1,\ldots,n.
       Updating: Compute γt=argmaxγK(θt1,γ;λ)\gamma_{t}=\mathrm{argmax}_{\gamma}K(\theta_{t-1},\gamma;\lambda) by a concave optimizer,
       and compute θt=θt1αtθK(θ,γt;λ)|θt1\theta_{t}=\theta_{t-1}-\alpha_{t}\nabla_{\theta}K(\theta,\gamma_{t};\lambda)|_{\theta_{t-1}} by gradient descent.
until convergence.
Algorithm 1 Penalized logit ff-GAN or hinge GAN

4.3 Hinge GAN with spline discriminators

We consider two hinge GAN methods with an L1L_{1} or L2L_{2} penalty on the spline discriminator, which leads to theoretically improved error bounds in terms of dependency on (ϵ,p)(\epsilon,p) over the parameter space Θ1\Theta_{1} or Θ2\Theta_{2} respectively, compared with the corresponding logit ff-GAN methods in Section 4.2.

The first method, L1L_{1} penalized hinge GAN, is defined by solving

minθΘ1maxγΓ{KHG(Pn,Pθ;hγ,μ)λ1pen1(γ)},\displaystyle\min_{\theta\in\Theta_{1}}\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\theta};h_{\gamma,\mu})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}, (28)

where KHG(Pn,Pθ;h)K_{\mathrm{HG}}(P_{n},P_{\theta};h) is the hinge objective KHG(P,Pθ;h)K_{\mathrm{HG}}(P_{*},P_{\theta};h) in (3) with PP_{*} replaced by PnP_{n} and, similarly as in L1L_{1} penalized logit ff-GAN (25), hγ,μ(x)=hsp,γ(xμ)h_{\gamma,\mu}(x)=h_{\mathrm{sp},\gamma}(x-\mu), pen1(γ)=γ11+γ21\mathrm{pen}_{1}(\gamma)=\|\gamma_{1}\|_{1}+\|\gamma_{2}\|_{1}, and λ10\lambda_{1}\geq 0 is a tuning parameter.

Theorem 4.

Assume that ΣmaxM1\|\Sigma^{*}\|_{\max}\leq M_{1}. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (28). For δ<1/7\delta<1/7, if λ1C1(logp/n+log(1/δ)/n)\lambda_{1}\geq C_{1}\left(\sqrt{\log{p}/n}+\sqrt{\log{(1/\delta)}/n}\right) and ϵ+ϵ/(nδ)+λ1C2\epsilon+\sqrt{\epsilon/(n\delta)}+\lambda_{1}\leq C_{2}, then with probability at least 17δ1-7\delta the following bounds hold uniformly over contamination distribution QQ,

μ^μ\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty} C(ϵ+ϵ/(nδ)+λ1),\displaystyle{\leq}C\left(\epsilon+\sqrt{\epsilon/(n\delta)}+\lambda_{1}\right),
Σ^Σmax\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\max} C(ϵ+ϵ/(nδ)+λ1),\displaystyle{\leq}C\left(\epsilon+\sqrt{\epsilon/(n\delta)}+\lambda_{1}\right),

where C,C1,C2>0C,C_{1},C_{2}>0 are constants, depending on M1M_{1} but independent of (n,p,ϵ,δ)(n,p,\epsilon,\delta).

For L1L_{1} penalized hinge GAN, Theorem 4 shows that the estimator (μ^,Σ^)(\hat{\mu},\hat{\Sigma}) achieves error bounds in the maximum norms in the order ϵ+log(p)/n\epsilon+\sqrt{\log(p)/n}, which improve upon the error rate ϵ+log(p)/n\sqrt{\epsilon}+\sqrt{\log(p)/n} in terms of dependency on ϵ\epsilon for L1L_{1} penalized logit ff-GAN. This difference can be traced to that in the population error bounds in Theorem 1. Moreover, Theorem 5.1 in Chen, Gao and Ren (2018) indicates that a minimax lower bound on estimator errors μ^μ\|\hat{\mu}-\mu^{*}\|_{\infty} or Σ^Σmax\|\hat{\Sigma}-\Sigma^{*}\|_{\max} is also of order ϵ+log(p)/n\epsilon+\sqrt{\log(p)/n} in Huber’s contaminated Gaussian model, where log(p)/n\sqrt{\log(p)/n} is a minimax lower bound in the maximum norms in the case of ϵ=0\epsilon=0. Therefore, our L1L_{1} penalized hinge GAN achieves the minimax rates in the maximum norms for Gaussian location and scatter estimation over Θ1\Theta_{1}.

The second method, L2L_{2} penalized hinge GAN, is defined by solving

minθΘ2maxγΓ{KHG(Pn,Pθ;hγ,μ)λ2pen2(γ1)λ3pen2(γ2)},\displaystyle\min_{\theta\in\Theta_{2}}\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\theta};h_{\gamma,\mu})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}, (29)

where, similarly as in L2L_{2} penalized logit ff-GAN (27), hγ,μ(x)=hsp,γ(xμ)h_{\gamma,\mu}(x)=h_{\mathrm{sp},\gamma}(x-\mu), pen2(γ1)=γ12\mathrm{pen}_{2}(\gamma_{1})=\|\gamma_{1}\|_{2} and pen2(γ2)=γ22\mathrm{pen}_{2}(\gamma_{2})=\|\gamma_{2}\|_{2}, and λ20\lambda_{2}\geq 0 and λ30\lambda_{3}\geq 0 are tuning parameters.

Theorem 5.

Assume that ΣopM2\|\Sigma^{*}\|_{\mathrm{op}}\leq M_{2} and ff satisfies Assumptions 12. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (29). For δ<1/8\delta<1/8, if λ2C1(p/n+log(1/δ)/n)\lambda_{2}\geq C_{1}\left(\sqrt{p/n}+\sqrt{\log(1/\delta)/n}\right), λ3C1p(p/n+log(1/δ)/n)\lambda_{3}\geq C_{1}\sqrt{p}\left(\sqrt{p/n}+\sqrt{\log(1/\delta)/n}\right), and p(ϵ+ϵ/(nδ))+λ2C2\sqrt{p}\left(\epsilon+\sqrt{\epsilon/(n\delta)}\right)+\lambda_{2}\leq C_{2}, then with probability at least 18δ1-{8}\delta the following bounds hold uniformly over contamination distribution QQ,

μ^μ2\displaystyle\|\hat{\mu}-\mu^{*}\|_{2} C(p(ϵ+ϵ/(nδ))+λ2),\displaystyle\leq C\left(\sqrt{p}\left(\epsilon+\sqrt{\epsilon/(n\delta)}\right)+\lambda_{2}\right),
p1/2Σ^ΣF\displaystyle p^{-1/2}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}} C(p(ϵ+ϵ/(nδ))+λ2+λ3/p),\displaystyle\leq C\left(\sqrt{p}\left(\epsilon+\sqrt{\epsilon/(n\delta)}\right)+\lambda_{2}+\lambda_{3}/\sqrt{p}\right),

where C1,C2,C>0C_{1},C_{2},C>0 are constants, depending on M2M_{2} but independent of (n,p,ϵ,δ)(n,p,\epsilon,\delta).

For L2L_{2} penalized hinge GAN, Theorem 5 shows that the estimator (μ^,Σ^)(\hat{\mu},\hat{\Sigma}) achieves error bounds in the L2L_{2} and p1/2p^{-1/2}-Frobenius norms in the order ϵp+p/n\epsilon\sqrt{p}+\sqrt{p/n}. On one hand, these error bounds reduce to the same order, ϵ+p/n\sqrt{\epsilon}+\sqrt{p/n}, as those for L2L_{2} penalized logit ff-GAN, under the condition that pϵp\epsilon is upper bounded by a constant. On the other hand, when compared with the minimax rates, there remain nontrivial differences between L2L_{2} penalized hinge GAN and logit ff-GAN. In fact, the minimax rates in the L2L_{2} and operator norms for location and scatter estimation over Θ2\Theta_{2} is known to be ϵ+p/n\epsilon+\sqrt{p/n} in Huber’s contaminated Gaussian model (Chen, Gao and Ren (2018)). The same minimax rate can also be shown in the p1/2p^{-1/2}-Frobenius norm for scatter estimation. Then the error rate for L2L_{2} penalized hinge GAN in Theorem 5 matches the minimax rate, and both reduce to the contamination-free error rate p/n\sqrt{p/n}, provided that ϵn\epsilon\sqrt{n} is bounded by a constant, i.e., ϵ=O(1/n)\epsilon=O(\sqrt{1/n}), independently of pp. For L2L_{2} penalized logit ff-GAN associated with the reverse KL or JS divergence (satisfying Assumptions 12), the error bounds from Theorem 3 match the minimax rate provided both ϵ=O(p/n)\epsilon=O(p/n) and ϵ=O(1/p)\epsilon=O(1/p). The latter condition can be restrictive when pp is large.

Remark 11.

The two functionals, min(1,h)\min(1,h) and min(1,h)\min(-1,h), are concave in hh in the hinge objective KHG(Pn,Pθ;h)K_{\mathrm{HG}}(P_{n},P_{\theta};h). This is reminiscent of the concavity of f(eh)f^{\prime}(\mathrm{e}^{h}) and f#(eh)-f^{\#}(\mathrm{e}^{h}) in hh in the logit ff-GAN objective Kf(Pn,Pθ;h)K_{f}(P_{n},P_{\theta};h) under Assumptions 1(ii) and 2(i) as discussed in Remark 10. These concavity properties are crucial to our proofs of Theorems 45 and Corollary 1(ii) . See Lemmas S12 and S14 in the Supplement. Moreover, the concavity of KHG(Pn,Pθ;h)K_{\mathrm{HG}}(P_{n},P_{\theta};h) in hh, together with the linearity of the spline discriminator hγ,μh_{\gamma,\mu} in γ\gamma, implies that the objective function KHG(Pn,Pθ;hγ,μ)K_{\mathrm{HG}}(P_{n},P_{\theta};h_{\gamma,\mu}) is concave in γ\gamma for any fixed θ\theta. Hence similarly to penalized logit ff-GAN, our penalized hinge GAN (28) or (29) can also be implemented through nested optimization with a concave inner stage in training spline discriminators as shown in Algorithm 1.

4.4 Two-objective GAN with spline discriminators

We study two-objective GANs, where the spline discriminator is trained using the objective function in logit ff-GAN or hinge GAN, but the generator is trained using a different objective function.

Consider the following two-objective GAN related to logit ff-GANs (25) and (27):

{maxγΓKf(Pn,Pθ;hγ,μ)pen(γ;λ)with θ fixed,minθΘEPnf(ehγ,μ(x))EPθG(hγ,μ(x))with γ fixed.\displaystyle\left\{\begin{array}[]{ll}\max\limits_{\gamma\in\Gamma}\;K_{f}({P_{n}},P_{\theta};h_{\gamma,\mu})-\mathrm{pen}(\gamma;\lambda)&\quad\text{with $\theta$ fixed},\\ \min\limits_{\theta\in\Theta}\;\mathrm{E}_{{P_{n}}}f^{\prime}(\mathrm{e}^{h_{\gamma,\mu}(x)})-\mathrm{E}_{P_{\theta}}G(h_{\gamma,\mu}(x))&\quad\text{with $\gamma$ fixed}.\\ \end{array}\right. (32)

Similarly, consider the two-objective GAN related to the hinge GAN (28) and (29):

{maxγΓKHG(Pn,Pθ;hγ,μ)pen(γ;λ)with θ fixed,minθΘEPnmin(hγ,μ(x),1)EPθG(hγ,μ(x))with γ fixed.\displaystyle\left\{\begin{array}[]{ll}\max\limits_{\gamma\in\Gamma}\;K_{\mathrm{HG}}({P_{n}},P_{\theta};h_{\gamma,\mu})-\mathrm{pen}(\gamma;\lambda)&\quad\text{with $\theta$ fixed},\\ \min\limits_{\theta\in\Theta}\;\mathrm{E}_{{P_{n}}}\min(h_{\gamma,\mu}(x),1)-\mathrm{E}_{P_{\theta}}G(h_{\gamma,\mu}(x))&\quad\text{with $\gamma$ fixed}.\\ \end{array}\right. (35)

Here pen(γ;λ)\mathrm{pen}(\gamma;\lambda) is an L1L_{1} penalty, λ1(γ11+γ21)\lambda_{1}(\|\gamma_{1}\|_{1}+\|\gamma_{2}\|_{1}) and Θ\Theta is Θ1={(μ,Σ):μp,ΣmaxM1}\Theta_{1}=\{(\mu,\Sigma):\mu\in\mathbb{R}^{p},\|\Sigma\|_{\max}\leq M_{1}\} as in (25), or pen(γ;λ)\mathrm{pen}(\gamma;\lambda) is an L2L_{2} penalty λ2γ12+λ3γ22\lambda_{2}\|\gamma_{1}\|_{2}+\lambda_{3}\|\gamma_{2}\|_{2} and Θ\Theta is Θ2={(μ,Σ):μp,ΣopM2}\Theta_{2}=\{(\mu,\Sigma):\mu\in\mathbb{R}^{p},\|\Sigma\|_{\mathrm{op}}\leq M_{2}\} as in (27), and GG is a function satisfying Assumption 3. Note that the discriminator hγ,μh_{\gamma,\mu} is a spline function with knots depending on μ\mu, so that EPnf(ehγ,μ(x))\mathrm{E}_{{P_{n}}}f^{\prime}(\mathrm{e}^{h_{\gamma,\mu}(x)}) cannot be dropped in the optimization over θ\theta in (32) or (35). We show that the two-objective logit ff-GAN and hinge GAN achieve similar error bounds as the corresponding one-objective versions in Theorems 25.

Assumption 3.

Function GG in (32) or (35) is convex and strictly increasing. Hence the inverse function G1G^{-1} exists and is concave and strictly increasing.

Corollary 1.

(i) If θ^\hat{\theta} is replaced by a solution to the alternating optimization problem (32) with the L1L_{1} or L2L_{2} penalty on γ\gamma as in (25) or (27) and the corresponding choice of Θ\Theta, then the results in Theorem 2 or 3 remains valid respectively.

(ii) If θ^\hat{\theta} is replaced by a solution to the alternating optimization problem (35) with the L1L_{1} or L2L_{2} penalty on γ\gamma as in (28) or (29) and the corresponding choice of Θ\Theta, then the results in Theorem 4 or 5 remains valid respectively.

The two-objective GANs studied in Corollary 1 differ slightly from existing ones as described in (7)–(13), due to the use of the discriminator hγ,μh_{\gamma,\mu} depending on μ\mu to facilitate theoretical analysis as mentioned in Section 4.2. If hγ,μh_{\gamma,\mu} were replaced by a discriminator hγh_{\gamma} defined independently of θ\theta, then taking Kf=KJSK_{f}=K_{\mathrm{JS}} and G(h)=log(1+eh)G(h)=-\log(1+\mathrm{e}^{-h}) or G(h)=hG(h)=-h in (32) reduces to GAN with logDD trick (7) or calibrated rKL-GAN (10) respectively, and taking Kf=KHGK_{f}=K_{\mathrm{HG}} and G(h)=hG(h)=-h in (35) reduces to geometric GAN (13).

5 Discussion

5.1 GANs with data transformation

Compared with the usual formulations (1) and (3), our logit ff-GAN and hinge GAN methods in Sections 4.24.3 involve a notable modification that both the real and fake data are discriminated against each other after being shifted by the current location parameter. Without the modification, a direct approach based on logit ff-GAN would use the objective function

Kf(Pn,Pθ;hsp,γ)=EPnf(ehsp,γ(x))EPμ,Σf#(ehsp,γ(x)),\displaystyle K_{f}(P_{n},P_{\theta};h_{\mathrm{sp},\gamma})=\mathrm{E}_{P_{n}}f^{\prime}(\mathrm{e}^{h_{\mathrm{sp},\gamma}(x)})-\mathrm{E}_{P_{\mu,\Sigma}}f^{\#}(e^{h_{\mathrm{sp},\gamma}(x)}), (36)

where the real data and the Gaussian fake data generated from standard noises are discriminated again each other given the parameters (μ,Σ)(\mu,\Sigma). The idea behind our modification can be extended by allowing both location and scatter transformation. For example, consider logit ff-GAN with full transformation:

minθΘmaxγΓ{Kf(Pn,Pθ;hγ,μ,Σ)pen(γ;λ)},\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\theta};h_{\gamma,\mu,\Sigma})-\mathrm{pen}(\gamma;\lambda)\right\}, (37)

where KfK_{f} is the logit ff-GAN objective as in (25) and (27), hγ,μ,Σ(x)=hsp,γ(Σ1/2(xμ))h_{\gamma,\mu,\Sigma}(x)=h_{\mathrm{sp},\gamma}(\Sigma^{-1/2}(x-\mu)) and pen(γ;λ)\mathrm{pen}(\gamma;\lambda) is an L1L_{1} or L2L_{2} penalty term. The discriminator hγ,μ,Σ(x)h_{\gamma,\mu,\Sigma}(x) is obtained by applying hsp,γ()h_{\mathrm{sp},\gamma}(\cdot) with fixed knots to the transformed data Σ1/2(xμ)\Sigma^{-1/2}(x-\mu). Similarly to (26), the non-penalized objective in (37) can be equivalently written as

Kf(Pn,Pθ;hγ,μ,Σ)=EΣ1/2(Pnμ)f(ehsp,γ(x))EP0,If#(ehsp,γ(x)),\displaystyle K_{f}(P_{n},P_{\theta};h_{\gamma,\mu,\Sigma})=\mathrm{E}_{\Sigma^{-1/2}(P_{n}-\mu)}f^{\prime}(\mathrm{e}^{h_{\mathrm{sp},\gamma}(x)})-\mathrm{E}_{P_{0,I}}f^{\#}(e^{h_{\mathrm{sp},\gamma}(x)}), (38)

where Σ1/2(Pnμ)\Sigma^{-1/2}(P_{n}-\mu) denotes the empirical distribution on {Σ1/2(X1μ),,Σ1/2(Xnμ)}\{\Sigma^{-1/2}(X_{1}-\mu),\ldots,\Sigma^{-1/2}(X_{n}-\mu)\}. Compared with (26) and (36), there are two advantages of using (38) with full transformation. First, due to both location and scatter transformation, logit ff-GAN (37), but not (25) or (27), can be shown to be affine equivariant. Second, the transformed real data and the standard Gaussian noises in (38) are discriminated against each other given the current parameters (μ,Σ)(\mu,\Sigma), while employing the spline discriminators hsp,γ(x)h_{\mathrm{sp},\gamma}(x) with knots fixed at {2,1,0,1,2}\{-2,-1,0,1,2\}. Because standard Gaussian data are well covered by the grid formed from these marginal knots, the discrimination involved in (38) can be informative even when the parameters (μ,Σ)(\mu,\Sigma) are updated. The discrimination involved in (36) may be problematic when employing the fixed-knot spline discriminators, because both the real and fake data may not be adequately covered by the grid formed from the knots.

From the preceding discussion, it can be more desirable to incorporate both location and scatter transformation as in (38) than just location transformation as in (26), which only aligns the centers, but not the scales and correlations, of the Gaussian fake data with the knots in the spline discriminators. As mentioned in Section 4.2, our sample analysis exploits the location transformation in establishing certain concentration properties in the proofs. On the other hand, our current proofs are not directly applicable while allowing both location and scatter transformation. It is desired in future work to extend our theoretical analysis in this direction.

5.2 Comparison with Gao, Yao and Zhu (2020)

We first point out a connection between logit ff-GANs and the GANs based on proper scoring rules in Gao, Yao and Zhu (2020). For a convex function g:(0,1)g:(0,1)\to\mathbb{R}, a proper scoring rule can be defined as (Savage (1971); Buja, Stuetzle and Shen (2005); Gneiting and Raftery (2007))

Sg(η,1)=g(η)+(1η)g(η),Sg(η,0)=g(η)ηg(η).\displaystyle S_{g}(\eta,1)=g(\eta)+(1-\eta)g^{\prime}(\eta),\quad S_{g}(\eta,0)=g(\eta)-\eta g^{\prime}(\eta).

The population verion of the GAN studied in Gao, Yao and Zhu (2020) is defined as

minθΘmaxγΓLg(P,Pθ;qγ),\displaystyle\min_{\theta\in\Theta}\max_{\gamma\in\Gamma}\;L_{g}(P_{*},P_{\theta};q_{\gamma}), (39)

where qγ(x)[0,1]q_{\gamma}(x)\in[0,1], also called a discriminator, represents the probability that an observation xx comes from PP_{*} rather than PθP_{\theta}, and

Lg(P,Pθ;q)\displaystyle L_{g}(P_{*},P_{\theta};q) =(1/2){EPSg(q(x),1)+EPθSg(q(x),0)}g(1/2).\displaystyle=(1/2)\left\{\mathrm{E}_{P_{*}}S_{g}(q(x),1)+\mathrm{E}_{P_{\theta}}S_{g}(q(x),0)\right\}-g(1/2).

The objective Lg(P,Pθ;q)L_{g}(P_{*},P_{\theta};q) is shown to be a lower bound, being tight if q=2dP/d(P+Pθ)q=2\,\mathrm{d}P_{*}/\mathrm{d}(P_{*}+P_{\theta}), for the divergence Dg0(P(P+Pθ)/2)D_{g_{0}}(P_{*}\|(P_{*}+P_{\theta})/2), where g0(t)=g(t/2)g(1/2)g_{0}(t)=g(t/2)-g(1/2) for t(0,2)t\in(0,2). For example, taking g(η)=ηlogη+(1η)log(1η)g(\eta)=\eta\log\eta+(1-\eta)\log(1-\eta) leads to the log score, Sg(η,1)=logηS_{g}(\eta,1)=\log\eta and Sg(η,0)=log(1η)S_{g}(\eta,0)=\log(1-\eta). The corresponding objective function Lg(P,Pθ;qγ)L_{g}(P_{*},P_{\theta};q_{\gamma}) reduces to the expected log-likelihood with discrimination probability qγ(x)q_{\gamma}(x) as used in Goodfellow et al. (2014). We show that if qγ(x)q_{\gamma}(x) is specified as a sigmoid probability, then Lg(P,Pθ;qγ)L_{g}(P_{*},P_{\theta};q_{\gamma}) can be equivalently obtained as a logit ff-GAN objective for a suitable choice of ff.

Proposition 1.

Suppose that the discriminator is specified as qγ(x)=sigmoid(hγ(x))q_{\gamma}(x)=\mathrm{sigmoid}(h_{\gamma}(x)). Then Lg(P,Pθ;qγ)=Kf(P,Pθ;hγ)L_{g}(P_{*},P_{\theta};q_{\gamma})=K_{f}(P_{*},P_{\theta};h_{\gamma}) for KfK_{f} defined in (1) and f(t)=1+t2g0(2t1+t)f(t)=\frac{1+t}{2}g_{0}(\frac{2t}{1+t}) satisfying that Dg0(P(P+Pθ)/2)=Df(PPθ)D_{g_{0}}(P_{*}\|(P_{*}+P_{\theta})/2)=D_{f}(P_{*}\|P_{\theta}).

In contrast with hγ(x)h_{\gamma}(x) parameterized as a pairwise spline function, Gao, Yao and Zhu (2020) studied robust estimation in Huber’s contaminated Gaussian model, where qγ(x)q_{\gamma}(x) is parameterized as a neural network with two or more layers and sigmoid activations in the top and bottom layers. In the case of two layers, the neural network in Gao, Yao and Zhu (2020), Section 4, is defined as

qγ(x)=sigmoid(hγ(x)),hγ(x)=j=1Jγj(1)sigmoid(γj(2)Tx+γ0j(2)),\displaystyle q_{\gamma}(x)=\mathrm{sigmoid}(h_{\gamma}(x)),\quad h_{\gamma}(x)=\sum_{j=1}^{J}\gamma_{j}^{(1)}\mathrm{sigmoid}(\gamma^{(2){\mathrm{\scriptscriptstyle T}}}_{j}x+\gamma_{0j}^{(2)}), (40)

where (γj(2),γj0(2))(\gamma_{j}^{(2)},\gamma_{j0}^{(2)}), j=1,,Jj=1,\ldots,J, are the weights and intercepts in the bottom layer, and γj(1)\gamma_{j}^{(1)}, j=1,,Jj=1,\ldots,J, are the weights in the top layer constrained such that j=1J|γj(1)|κ\sum_{j=1}^{J}|\gamma_{j}^{(1)}|\leq\kappa for a tuning parameter κ\kappa. Assume that g(η)g(\eta) is three-times continuously differentiable at η=1/2\eta=1/2, g′′(1/2)>0g^{\prime\prime}(1/2)>0, and for a universal constant c0>0c_{0}>0,

2g′′(1/2)g′′′(1/2)+c0,\displaystyle 2g^{\prime\prime}(1/2)\geq g^{\prime\prime\prime}(1/2)+c_{0}, (41)

Then Gao, Yao and Zhu (2020) showed that the location and scatter estimators from the sample version of (39) with discriminator (40) achieve the minimax error rates, O(ϵ+p/n)O(\epsilon+\sqrt{p/n}), in the L2L_{2} and operator norms, provided that κ=O(ϵ+p/n)\kappa=O(\epsilon+\sqrt{p/n}) among other conditions. However, with sigmoid activations used inside hγ(x)h_{\gamma}(x), the sample objective Lg(Pn,Pθ;qγ)L_{g}(P_{n},P_{\theta};q_{\gamma}) may exhibit a complex, non-concave landscape in γ\gamma, which makes minimax optimization difficult.

There is also a subtle issue in how the above result from Gao, Yao and Zhu (2020) can be compared with even our population analysis for minimum ff-divergence estimation, i.e., population versions of GANs with nonparametric discriminators. In fact, condition (41) can be directly shown to be equivalent to saying that d2du2f(eu)|u=0c0\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\prime}(\mathrm{e}^{u})|_{u=0}\geq c_{0} for ff associated with gg in Proposition 1. This condition can be satisfied, while Assumption 1 is violated, for example, by the choice g(η)=(η1)log(η/(2η))g(\eta)=(\eta-1)\log(\eta/(2-\eta)) and f(t)={(t1)logt}/2f(t)=\{(t-1)\log t\}/2, corresponding to the mixed KL divergence DKL(P||Q)/2+DKL(Q||P)/2D_{\mathrm{KL}}(P||Q)/2+D_{\mathrm{KL}}(Q||P)/2. As shown in Figure 2, minimization of the mixed KL does not in general lead to robust estimation. Hence it seems paradoxical that minimax error rates can be achieved by the GAN in Gao, Yao and Zhu (2020) with its objective function derived from the mixed KL. On the other hand, a possible explanation can be seen as follows. By the sigmoid activation and the constraint j=1J|γj(1)|κ\sum_{j=1}^{J}|\gamma_{j}^{(1)}|\leq\kappa, the log-odds discriminator hγ(x)h_{\gamma}(x) in (40) is forced to be bounded, |hγ(x)|κ|h_{\gamma}(x)|\leq\kappa, where κ\kappa is further assumed to small, of the same order as the minimax rate O(ϵ+p/n)O(\epsilon+\sqrt{p/n}). As a result, maximization of the population objective Lg(P,Pθ;qγ)L_{g}(P_{*},P_{\theta};q_{\gamma}) over such constrained discriminators may produce a divergence with a substantial gap to the actual divergence Df(PPθ)D_{f}(P_{*}\|P_{\theta}) for any fixed θ\theta. Instead, the implied divergence measure may behave more similarly as the total variation DTV(PPθ)D_{\mathrm{TV}}(P_{*}\|P_{\theta}) than as Df(PPθ)D_{f}(P_{*}\|P_{\theta}), due to the boundedness of hγ(x)h_{\gamma}(x) by a sufficiently small κ\kappa, so that minimax error rates can still be achieved.

6 Simulation studies

We conducted simulation studies to compare the performance of our logit ff-GAN and hinge GAN methods with several existing methods in various settings depending on QQ, ϵ\epsilon, nn, and pp. Results about error dependency on ϵ\epsilon are provided in the main paper and others are presented in the Supplement. Two contamination distributions QQ are considered to allow different types of contaminations.

6.1 Implementation of methods

Our methods are implemented following Algorithm 1, where the penalized GAN objective function K(θ,γ;λ)K(\theta,\gamma;\lambda) is defined as in (37) for logit ff-GAN or with KfK_{f} replaced by KHGK_{\mathrm{HG}} for hinge GAN. As discussed in Section 5.1, this scheme allows adequate discrimination between the back-transformed real data, Σ1/2(xμ)\Sigma^{-1/2}(x-\mu), and the standard Gaussian noises using spline discriminators with fixed knots. In the following, we describe the discriminator optimization and penalty choices. See the Supplement for further details including the initial values (μ0,Σ0)(\mu_{0},\Sigma_{0}) and learning rate αt\alpha_{t}.

For Algorithm 1 with spline discriminators, the training objective is concave in the discriminator parameter γ\gamma and hence the discriminator can be fully optimized to provide a proper updating direction for the generator in each iteration. This implementation via nested optimization can be much more tractable than nonconvex-nonconcave minimax optimization in other GAN methods, where the discriminator and generator are updated by gradient ascent and descent, with careful tuning of the learning rates. In numerical work, we formulate the discriminator optimization from our GAN methods in the CVX framework (Fu, Narasimhan and Boyd (2020)) and then apply the convex optimization solver MOSEK (https://docs.mosek.com/latest/rmosek/index.html).

As dictated by our theory, we employ L1L_{1} or L2L_{2} penalties on the spline discriminators to control sampling variation, especially when the sample size nn is relatively smaller compared to the dimension of the discriminator parameter γ\gamma. Numerically, these penalties help stabilize the training process by weakening the discriminator power in the early stage. We tested our methods under different penalty levels and identified default choices of λ\lambda for our rKL and JS logit ff-GANs and hinge GAN. These penalty choices are then fixed in all our subsequent simulations. See the Supplement for results from our tuning experiments.

For comparison, we also implement ten existing methods for robust estimation.

  • JS-GAN (Gao, Yao and Zhu (2020)). We use the code from Gao, Yao and Zhu (2020) with minimal modification. The batch size is set to 1/101/10 of the data size because the default choice 500500 is too large in our experiment settings. We use the network structure p-2p-p/2-1p\text{-}2p\text{-}\lfloor p/2\rfloor\text{-}1 with LeakyReLU and Sigmoid activations as recommended in Gao, Yao and Zhu (2020).

  • Tyler’s M-estimator (Tyler (1987)). This method is included for completeness, being designed for multivariate scatter estimation from elliptical distributions, not Huber’s contaminated Gaussian distribution. We use R package fastM for implementation (https://cran.r-project.org/web/packages/fastM/index.html).

  • Kendall’s τ\tau with MAD (Loh and Tan (2018)). Kendall’s τ\tau (Kendall (1938)) is used to estimate the correlations after sine transformation and the median absolute deviation (MAD) (Hampel (1974)) is used to estimate the scales. We use the R built-in function cor to compute Kendall’s τ\tau correlations and use the built-in function mad for MAD.

  • Spearman’s ρ\rho with QnQ_{n}-estimator (Öllerer and Croux (2015)). The QnQ_{n}-estimator (Rousseeuw and Croux (1993)) is used for scale estimation and Spearman’s ρ\rho (Spearman (1987)) is used with sine transformation for correlation estimation. We use the R function cor to compute Spearman’s ρ\rho correlations and the Qn function in R package robustbase (https://cran.r-project.org/web/packages/robustbase/index.html).

  • MVE (Rousseeuw (1985)). The minimum volume ellipsoid (MVE) estimator is a high-breakdown robust method for multivariate location and scatter estimation. We use the function CovMve in R package rrcov for implementation (https://cran.r-project.org/web/packages/rrcov/index.html).

  • MCD (Rousseeuw (1985)). The minimum covariance determinant (MCD) estimator is a high-breakdown robust method and is shown to be superior to MVE in statistical efficiency (Butler, Davies and Jhun (1993)). We use the function CovMcd in R package rrcov for implementation.

  • Sest (Davies (1987); Lopuhaa (1989)). This is an S-estimator of multivariate location and scatter, based on Tukey’s biweight function. It has high-breakdown point and improved statistical efficiency over MVE. We use the function CovSest in R package rrcov for implementation.

  • Mest (Rocke (1996)). This is a constrained M-estimator of multivariate location and scatter, based on a translated biweight function. We use the function CovMest in R package rrcov for implementation, with MVE as the initial value.

  • MMest (Tatsuoka and Tyler (2000)). This is a constrained M-estimator which is a multivariate version of MM-functionals in Yohai (1987). We use the function CovMMest in R package rrcov for implementation, with an S-estimate as the initial value.

  • γ\gamma-Lasso (Hirose, Fujisawa and Sese (2017)). The method is implemented in the R package rsggm, which has become unavailable on CRAN. We use an archived version (https://mran.microsoft.com/snapshot/2017-02-04/web/packages/rsggm/index.html). We set γ=0.05\gamma=0.05 and deactivate the vectorized L1L_{1} penalty.

Tyler’s M-estimator, Kendall’s τ\tau with MAD, and Spearman’s ρ\rho with QnQ_{n} deal with scatter estimation only, whereas the other methods handle both location and scatter estimation. In our experiments, we focus on comparing the performance of existing and proposed methods in terms of scatter estimation (i.e., variance matrix estimation).

6.2 Simulation settings

The uncontaminated distribution is N(0,Σ)\mathrm{N}(0,\Sigma^{*}) where Σ\Sigma^{*} is a Toeplitz matrix with (i,j)(i,j) component equal to (1/2)|ij|(1/2)^{|i-j|}. The location parameter is unknown and estimated together with the variance matrix, except for Tyler’s MM-estimator, Kendall’s τ\tau, and Spearman’s ρ\rho. Consider two contamination distributions QQ of different types. Denote a p×pp\times p identity matrix as IpI_{p} and a pp-dimensional vector of ones as 1p1_{p}.

  • Q=N(2.25c,13Ip)Q=\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right) where c=(1,1,1,1,1,)c=(1,-1,1,-1,1,\dots) is a pp-dimensional vector of alternating ±1\pm 1. In this setting, the contaminated points may not be seen as outliers marginally in each coordinate. On the other hand, these contaminated points can be easily separated as outliers from the uncontaminated Gaussian distribution in higher dimensions.

  • Q=N(51p,5Ip)Q=\mathrm{N}(51_{p},5I_{p}). Contaminated points may lie in both low-density and high-density regions of the uncontaminated Gaussian distribution. The majority of contaminated points are outliers that are far from the uncontaminated data, and there are also contaminated points that are enclosed by the uncontaminated points. This setting is also used in Gao, Yao and Zhu (2020).

The Gaussianity of the above contamination distributions is used for convenience and easy characterization of data patterns, given the specific locations and scales. See Figure 1 for an illustration of the first contamination, and the Supplement for that of the second.

6.3 Experiment results

Table 3 summarizes scatter estimation errors in the maximum norm from L1L_{1} penalized hinge GAN and logit ff-GANs and existing methods, where p=10p=10, n=2000n=2000, and ϵ\epsilon increases from 0.0250.025 to 0.10.1. The errors are obtained by averaging 2020 repeated runs and the numbers in brackets are standard deviations. From these results, the logit ff-GANs, JS and rKL, have the best performance, followed closely by the hinge GAN and then with more noticeable differences by JS-GAN in Gao, Yao and Zhu (2020) and the five high-breakdown methods, MVE, MCD, Sest, Mest, and MMest. The pairwise methods, Kendall’s τ\tau with MAD and Spearman’s ρ\rho with QnQ_{n}-estimator, have relatively poor performance, especially for the first contamination as expected from Figure 1. The γ\gamma-Lasso performs competitively when ϵ\epsilon is small (e.g., 2.5%), but deteriorates quickly as ϵ\epsilon increases. A possible explanation is that the default choice γ=0.05\gamma=0.05 may not work well in the current settings, even though the numerical studies in Hirose, Fujisawa and Sese (2017) suggest that γ\gamma does not require special tuning. Tyler’s M-estimator performs poorly, primarily because it is not designed for robust estimation under Huber’s contamination.

Estimation errors in the Frobenius norm from our L2L_{2} penalized GAN methods and existing methods are shown in Table 4. We observe a similar pattern of comparison as in Table 3.

From Tables 34, we see that the estimation errors of our GAN methods, as well as other methods, increase as ϵ\epsilon increases. However, the dependency on ϵ\epsilon is not precisely linear for the hinge GAN, and not in the order ϵ\sqrt{\epsilon} for the two logit ff-GANs. This does not violate our theoretical bounds, which are derived to hold over all possible contamination distributions, i.e., for the worst scenario of contamination. For specific contamination settings, it is possible for logit ff-GAN to outperform hinge GAN, and for each method to achieve a better error dependency on ϵ\epsilon than in the worst scenario.

Table 3: Comparison of existing methods and proposed L1L_{1} penalized GAN methods (p=10,n=2000p=10,n=2000). Estimation error of the variance matrix is reported in the maximum norm max\|\cdot\|_{\max}.
ϵ\epsilon hinge GAN JS logit ff-GAN rKL logit ff-GAN GYZ JS-GAN Tyler_M Kendall_MAD Spearman_Qn
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
2.5 0.0711 (0.0203) 0.0692 (0.0191) 0.0661 (0.0121) 0.0960 (0.0384) 0.3340 (0.0360) 0.1484 (0.0320) 0.1434 (0.0245)
5 0.0768 (0.0201) 0.0694 (0.0166) 0.0658 (0.0161) 0.1049 (0.0423) 0.3115 (0.0279) 0.2235 (0.0351) 0.2351 (0.0284)
7.5 0.0828 (0.0203) 0.0690 (0.0171) 0.0661 (0.0092) 0.0957 (0.0303) 0.3048 (0.0234) 0.3164 (0.0319) 0.3294 (0.0241)
10 0.0981 (0.0221) 0.0737 (0.0207) 0.0720 (0.0151) 0.1029 (0.0467) 0.3526 (0.0226) 0.4133 (0.0264) 0.4362 (0.0256)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
2.5 0.0742 (0.0214) 0.0732 (0.0207) 0.0765 (0.0209) 0.0994 (0.0334) 0.3886 (0.0346) 0.1428 (0.0302) 0.1578 (0.0266)
5 0.0814 (0.0216) 0.0739 (0.0204) 0.0824 (0.0196) 0.1053 (0.0235) 0.4322 (0.0242) 0.2211 (0.0286) 0.2812 (0.0297)
7.5 0.0901 (0.0206) 0.0757 (0.0209) 0.0893 (0.0155) 0.1063 (0.0400) 0.4788 (0.0312) 0.3149 (0.0278) 0.4153 (0.0318)
10 0.1051 (0.0236) 0.0802 (0.0225) 0.0935 (0.0219) 0.1275 (0.0453) 0.5295 (0.0332) 0.4074 (0.0278) 0.5693 (0.0358)
ϵ\epsilon γ\gamma-Lasso MVE MCD Sest Mest MMest
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
2.5 0.0737 (0.0182) 0.0935 (0.0266) 0.0834 (0.0270) 0.0892 (0.0282) 0.0802 (0.0236) 0.0876 (0.0278)
5 0.2182 (0.0162) 0.1124 (0.0257) 0.1078 (0.0267) 0.1314 (0.0257) 0.0891 (0.0245) 0.1299 (0.0259)
7.5 0.3740 (0.0153) 0.1373 (0.0276) 0.1306 (0.0247) 0.1774 (0.0245) 0.1026 (0.0239) 0.1759 (0.0240)
10 0.5277 (0.0154) 0.1700 (0.0284) 0.1619 (0.0258) 0.2358 (0.0292) 0.1289 (0.0269) 0.2349 (0.0292)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
2.5 0.0769 (0.0243) 0.0984 (0.0284) 0.0837 (0.0271) 0.0892 (0.0282) 0.0799 (0.0229) 0.0876 (0.0278)
5 0.1044 (0.0215) 0.1171 (0.0258) 0.1081 (0.0267) 0.1314 (0.0257) 0.0890 (0.0243) 0.1299 (0.0259)
7.5 0.1668 (0.0312) 0.1383 (0.0251) 0.1301 (0.0237) 0.1774 (0.0245) 0.1031 (0.0243) 0.1759 (0.0240)
10 0.2935 (0.0619) 0.1697 (0.0240) 0.1618 (0.0250) 0.2358 (0.0292) 0.1289 (0.0262) 0.2349 (0.0291)
Table 4: Comparison of existing methods and proposed L2L_{2} penalized GAN methods (p=10,n=2000p=10,n=2000). Estimation error of the variance matrix is reported in the Frobenius norm F\|\cdot\|_{\mathrm{F}}.
ϵ\epsilon hinge GAN JS logit ff-GAN rKL logit ff-GAN GYZ JS-GAN Tyler_M Kendall_MAD Spearman_Qn
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
2.5 0.2673 (0.0457) 0.2574 (0.0461) 0.2502 (0.0345) 0.3354 (0.0904) 1.0973 (0.0890) 0.7524 (0.0286) 0.7936 (0.0201)
5 0.2729 (0.0344) 0.2649 (0.0385) 0.2528 (0.0404) 0.3511 (0.0757) 1.0190 (0.0525) 1.4888 (0.0282) 1.5991 (0.0217)
7.5 0.2985 (0.0573) 0.2790 (0.0554) 0.2568 (0.0339) 0.3349 (0.0655) 1.2802 (0.0439) 2.3065 (0.0444) 2.4908 (0.0401)
10 0.3222 (0.0547) 0.2898 (0.0565) 0.2723 (0.0477) 0.3642 (0.0885) 2.3363 (0.0557) 3.2143 (0.0489) 3.4669 (0.0398) )
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
2.5 0.2705 (0.0475) 0.2688 (0.0526) 0.2454 (0.0376) 0.3350 (0.0690) 1.5503 (0.1226) 0.7688 (0.1108) 0.8884 (0.1040)
5 0.2754 (0.0335) 0.2738 (0.0365) 0.2466 (0.0324) 0.3526 (0.0489) 1.9679 (0.0806) 1.5229 (0.0793) 1.8049 (0.0736)
7.5 0.3071 (0.0621) 0.3026 (0.0688) 0.2685 (0.0435) 0.3570 (0.0778) 2.6099 (0.1545) 2.3906 (0.1473) 2.9247 (0.1490)
10 0.3301 (0.0583) 0.3049 (0.0635) 0.2744 (0.0505) 0.3829 (0.0852) 3.2420 (0.2083) 3.3202 (0.1462) 4.1527 (0.1602)
ϵ\epsilon γ\gamma-Lasso MVE MCD Sest Mest MMest
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
2.5 0.2679 (0.0339) 0.3262 (0.0610) 0.2906 (0.0567) 0.2982 (0.0619) 0.2853 (0.0615) 0.2921 (0.0616)
5 1.6199 (0.0381) 0.3583 (0.0452) 0.3454 (0.0484) 0.4059 (0.0489) 0.3016 (0.0425) 0.4032 (0.0487)
7.5 3.0849 (0.0461) 0.4562 (0.0877) 0.4326 (0.0803) 0.5805 (0.0863) 0.3549 (0.0699) 0.5768 (0.0870)
10 4.4364 (0.0688) 0.5409 (0.0799) 0.5183 (0.0767) 0.7732 (0.0839) 0.4079 (0.0681) 0.7707 (0.0831)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
2.5 0.3016 (0.0853) 0.3299 (0.0658) 0.2896 (0.0568) 0.2982 (0.0619) 0.2841 (0.0599) 0.2921 (0.0616)
5 0.4872 (0.0670) 0.3781 (0.0567) 0.3460 (0.0487) 0.4059 (0.0489) 0.3011 (0.0439) 0.4032 (0.0487)
7.5 1.0551 (0.2097) 0.4532 (0.0740) 0.4342 (0.0792) 0.5805 (0.0863) 0.3580 (0.0717) 0.5768 (0.0870)
10 2.1356 (0.4419) 0.5331 (0.0715) 0.5167 (0.0753) 0.7732 (0.0839) 0.4076 (0.0667) 0.7708 (0.0831)

7 Main proofs

We present main proofs of Theorems 1 and 4 in this section. The main proofs of the other results and details of all main proofs are provided in the Supplementary Material.

At the center of our proofs is a unified strategy designed to establish error bounds for GANs. See, for example, (43) and (49). Moreover, we carefully exploit the location transformation and L1L_{1} or L2L_{2} penalties in our GAN objective functions and develop suitable concentration properties, in addition to leveraging the concavity in updating the spline discriminators, as discussed in Remarks 10 and 11.

7.1 Proof of Theorem 1

We state and prove the following result which implies Theorem 1.

Proposition 2.

Let Θ0={(μ,Σ):μp,Σ\Theta_{0}=\{(\mu,\Sigma):\mu\in\mathbb{R}^{p},\Sigma is a p×pp\times p variance matrix }\}.

(i) Assume that ff satisfies Assumption 1, and ϵ[0,ϵ0]\epsilon\in[0,\epsilon_{0}] for a constant ϵ0[0,1/2)\epsilon_{0}\in[0,1/2). Let θ¯=argminθΘ0Df(Pϵ||Pθ)\bar{\theta}=\mathrm{argmin}_{\theta\in\Theta_{0}}D_{f}(P_{\epsilon}||P_{\theta}). If Errf0(ϵ)a\mathrm{Err}_{f0}(\epsilon)\leq a for a constant a[0,1/2)a\in[0,1/2), then we have

μ¯μ2\displaystyle\|\bar{\mu}-\mu^{*}\|_{2} S1,aΣop1/2Errf0(ϵ),\displaystyle\leq S_{1,a}\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}\mathrm{Err}_{f0}(\epsilon),
μ¯μ\displaystyle\|\bar{\mu}-\mu^{*}\|_{\infty} S1,aΣmax1/2Errf0(ϵ),\displaystyle\leq S_{1,a}\|\Sigma^{*}\|_{\max}^{1/2}\mathrm{Err}_{f0}(\epsilon),

where S1,a={Φ(Φ1(1/2+a))}1S_{1,a}=\{\Phi^{\prime}(\Phi^{-1}(1/2+a))\}^{-1} and Errf0(ϵ)=2(f′′(1))1f(1ϵ0)ϵ+ϵ\mathrm{Err}_{f0}(\epsilon)=\sqrt{-2(f^{{\prime\prime}}(1))^{-1}f^{\prime}(1-\epsilon_{0})\epsilon}+\epsilon. If further Errf0(ϵ)a/(1+S1,a)\mathrm{Err}_{f0}(\epsilon)\leq a/(1+S_{1,a}), then

Σ¯Σop\displaystyle\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}} 2S3,aΣopErrf0(ϵ)+S3,a2Σop(Errf0(ϵ))2,\displaystyle\leq 2S_{3,a}\|\Sigma^{*}\|_{\mathrm{op}}\mathrm{Err}_{f0}(\epsilon)+S^{2}_{3,a}\|\Sigma^{*}\|_{\mathrm{op}}(\mathrm{Err}_{f0}(\epsilon))^{2}, (42)
Σ¯Σmax\displaystyle\|\bar{\Sigma}-\Sigma^{*}\|_{\max} 4S3,aΣmaxErrf0(ϵ)+2S3,a2Σmax(Errf0(ϵ))2,\displaystyle\leq{4S_{3,a}}\|\Sigma^{*}\|_{\max}\mathrm{Err}_{f0}(\epsilon)+{2S_{3,a}^{2}}\|\Sigma^{*}\|_{\max}(\mathrm{Err}_{f0}(\epsilon))^{2},

where S3,a=S2,a(1+S1,a)S_{3,a}=S_{2,a}(1+S_{1,a}), S2,a={z0/2erf(2/z0erf1(1/2+a))}1S_{2,a}=\{\sqrt{z_{0}/2}\,\mathrm{erf}^{\prime}(\sqrt{2/z_{0}}\,\mathrm{erf}^{-1}(1/2+a))\}^{-1}, and the constant z0z_{0} is defined such that erf(z0/2)=1/2\mathrm{erf}(\sqrt{z_{0}/2})=1/2. The same inequality as (42) also holds with Σ¯Σop\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}} replaced by p1/2Σ¯ΣFp^{-1/2}\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}.

(ii) Let θ¯=argminθΘ0DTV(Pϵ||Pθ)\bar{\theta}=\mathrm{argmin}_{\theta\in\Theta_{0}}D_{\mathrm{TV}}(P_{\epsilon}||P_{\theta}). Then the statements in (i) hold with Errf0(ϵ)\mathrm{Err}_{f0}(\epsilon) replaced by Errh0(ϵ)=2ϵ\mathrm{Err}_{h0}(\epsilon)=2\epsilon throughout.

Proof of Proposition 2.

(i) Our main strategy is to show the following inequalities hold:

d(θ¯,θ)Δ1(ϵ)Df(Pϵ||Pθ¯)Δ2(ϵ,f),\displaystyle d(\bar{\theta},\theta^{*})-\Delta_{1}(\epsilon)\leq\sqrt{D_{f}(P_{\epsilon}||P_{\bar{\theta}})}\leq\Delta_{2}(\epsilon,f), (43)

where Δ1(ϵ)\Delta_{1}(\epsilon) and Δ2(ϵ,f)\Delta_{2}(\epsilon,f) are bias terms, depending on ϵ\epsilon and (ϵ,f)(\epsilon,f) respectively and d(θ¯,θ)d(\bar{\theta},\theta^{*}) is the total variation DTV(Pθ¯Pθ)D_{\mathrm{TV}}(P_{\bar{\theta}}\|P_{\theta^{*}}) or simply TV(Pθ¯,Pθ)\mathrm{TV}(P_{\bar{\theta}},P_{\theta^{*}}). Under certain conditions, d(θ¯,θ)d(\bar{\theta},\theta^{*}) delivers upper bounds, up to scaling constants, on the estimation bias to be controlled, μ¯μ\|\bar{\mu}-\mu^{*}\|_{\infty}, μ¯μ2\|\bar{\mu}-\mu^{*}\|_{2}, Σ¯Σmax\|\bar{\Sigma}-\Sigma^{*}\|_{\max}, and Σ¯Σop\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}.

(Step 1) The upper bound in (43) follows from Lemma S4 (iv): for any ff satisfying Assumption 1 and any ϵ[0,ϵ0]\epsilon\in[0,\epsilon_{0}], we have

Df(Pϵ||Pθ¯)Df(Pϵ||Pθ)f(1ϵ0)ϵ=Δ22(ϵ,f),\displaystyle D_{f}(P_{\epsilon}||P_{\bar{\theta}})\leq D_{f}(P_{\epsilon}||P_{\theta^{*}})\leq-f^{\prime}(1-\epsilon_{0})\epsilon=\Delta_{2}^{2}(\epsilon,f),

where Δ2(ϵ,f)=f(1ϵ0)ϵ\Delta_{2}(\epsilon,f)=\sqrt{-f^{\prime}(1-\epsilon_{0})\epsilon}. The constant f(1ϵ0)-f^{\prime}(1-\epsilon_{0}) is nonnegative because ff is non-increasing by Assumption 1 (ii).

(Step 2) We show the lower bound in (43) as follows:

d(θ¯,θ)\displaystyle d(\bar{\theta},\theta^{*}) TV(Pθ¯,Pϵ)+TV(Pϵ,Pθ)TV(Pθ¯,Pϵ)+Δ1(ϵ)\displaystyle\leq\mathrm{TV}(P_{\bar{\theta}},P_{\epsilon})+\mathrm{TV}(P_{\epsilon},P_{\theta^{*}})\leq\mathrm{TV}(P_{\bar{\theta}},P_{\epsilon})+\Delta_{1}(\epsilon) (44)
2(f′′(1))1Df(Pϵ||Pθ¯)+Δ1(ϵ),\displaystyle\leq\sqrt{2(f^{{\prime\prime}}(1))^{-1}D_{f}(P_{\epsilon}||P_{\bar{\theta}})}+\Delta_{1}(\epsilon), (45)

where Δ1(ϵ)=ϵ\Delta_{1}(\epsilon)=\epsilon. Line (44) follows by the triangle inequality and the fact that TV(Pϵ,Pθ)ϵTV(PQ,Pθ)ϵ\mathrm{TV}(P_{\epsilon},P_{\theta^{*}})\leq\epsilon\mathrm{TV}(P_{Q},P_{\theta^{*}})\leq\epsilon. Line (45) follows from Lemma S1: for any ff-divergence satisfying Assumption 1 (iii), we have

Df(Pϵ||Pθ¯)\displaystyle D_{f}(P_{\epsilon}||P_{\bar{\theta}}) f′′(1)2TV(Pϵ,Pθ¯)2.\displaystyle\geq\frac{f^{{\prime\prime}}(1)}{2}\mathrm{TV}(P_{\epsilon},P_{\bar{\theta}})^{2}.

The scaling constant, inft(0,1]f′′(t)/2\inf_{t\in(0,1]}f^{{\prime\prime}}(t)/2, in Lemma S1 reduces to f′′(1)/2f^{{\prime\prime}}(1)/2, because f′′f^{\prime\prime} is non-increasing by Assumption 1 (iii).

(Step 3) Combining the lower and upper bounds in (43), we have

d(θ¯,θ)\displaystyle d(\bar{\theta},\theta^{*}) 2(f′′(1))1Δ2(ϵ,f)+Δ1(ϵ)=Errf0(ϵ),\displaystyle\leq\sqrt{2(f^{{\prime\prime}}(1))^{-1}}\Delta_{2}(\epsilon,f)+\Delta_{1}(\epsilon)=\mathrm{Err}_{f0}(\epsilon),

where Errf0(ϵ)=2(f′′(1))1f(1ϵ0)ϵ+ϵ\mathrm{Err}_{f0}(\epsilon)=\sqrt{-2(f^{{\prime\prime}}(1))^{-1}f^{\prime}(1-\epsilon_{0})\epsilon}+\epsilon. The location result then follows from Proposition S4 provided that Errf0(ϵ)a\mathrm{Err}_{f0}(\epsilon)\leq a for a constant a[0,1/2)a\in[0,1/2). The variance matrix result follows if Errf0(ϵ)a/(1+S1,a)\mathrm{Err}_{f0}(\epsilon)\leq a/(1+S_{1,a}).

(ii) For the TV minimizer θ¯\bar{\theta}, Steps 1 and 2 in (i) can be combined to directly obtain an upper bound on d(θ¯,θ)d(\bar{\theta},\theta^{*}) as follows:

d(θ¯,θ)\displaystyle d(\bar{\theta},\theta^{*}) TV(Pθ¯,Pϵ)+TV(Pϵ,Pθ)\displaystyle\leq\mathrm{TV}(P_{\bar{\theta}},P_{\epsilon})+\mathrm{TV}(P_{\epsilon},P_{\theta^{*}}) (46)
2TV(Pϵ,Pθ)\displaystyle\leq 2\mathrm{TV}(P_{\epsilon},P_{\theta^{*}}) (47)
2ϵ.\displaystyle\leq 2\epsilon. (48)

Line (46) is due to the triangle inequality. Line (47) follows because TV(Pθ¯,Pϵ)TV(Pθ,Pϵ)=TV(Pϵ,Pθ)\mathrm{TV}(P_{\bar{\theta}},P_{\epsilon})\leq\mathrm{TV}(P_{\theta^{*}},P_{\epsilon})=\mathrm{TV}(P_{\epsilon},P_{\theta^{*}}) by the definition of θ¯\bar{\theta} and the symmetry of TV. Line (48) follows because TV(Pϵ,Pθ)ϵTV(PQ,Pθ)ϵ\mathrm{TV}(P_{\epsilon},P_{\theta^{*}})\leq\epsilon\mathrm{TV}(P_{Q},P_{\theta^{*}})\leq\epsilon as in (44).

Given the upper bound on d(θ¯,θ)d(\bar{\theta},\theta^{*}), the location result then follows from Proposition S4 provided that Errh0(ϵ)a\mathrm{Err}_{h0}(\epsilon)\leq a for a constant a[0,1/2)a\in[0,1/2). The variance matrix result follows if Errh0(ϵ)a/(1+S1,a)\mathrm{Err}_{h0}(\epsilon)\leq a/(1+S_{1,a}). ∎

7.2 Proof of Theorem 4

We state and prove the following result which implies Theorem 4. See the Supplement for details about how Proposition 3 implies Theorem 4. For δ(0,1/7)\delta\in(0,1/7), define

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n,\displaystyle\lambda_{11}=\sqrt{\frac{2\log({5p})+\log(\delta^{-1})}{n}}+\frac{2\log({5p})+\log(\delta^{-1})}{n},
λ12=2Crad4log(2p(p+1))n+2log(δ1)n,\displaystyle\lambda_{12}=2C_{\mathrm{rad4}}\sqrt{\frac{\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}},

where Crad4=Csg6Crad3C_{\mathrm{rad4}}=C_{\mathrm{sg6}}C_{\mathrm{rad3}}, depending on universal constants Csg6C_{\mathrm{sg6}} and Crad3C_{\mathrm{rad3}} in Lemmas S26 and Corollary S2 in the Supplement. Denote

Errh1(n,p,δ,ϵ)=3ϵ+2ϵ/(nδ)+λ12+λ1.\displaystyle\mathrm{Err}_{h1}(n,p,\delta,\epsilon)=3\epsilon+2\sqrt{\epsilon/(n\delta)}+\lambda_{12}+\lambda_{1}.
Proposition 3.

Assume that ΣmaxM1\|\Sigma_{*}\|_{\max}\leq M_{1} and ϵ1/5\epsilon\leq 1/5. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (28) with λ1Csp13M11λ11\lambda_{1}\geq C_{\mathrm{sp13}}M_{11}\lambda_{11} where M11=M11/2(M11/2+22π)M_{11}=M_{1}^{1/2}(M_{1}^{1/2}+{2}\sqrt{2\pi}) and Csp13=(5/3)(Csp11Csp12)C_{\mathrm{sp13}}=(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}}), depending on universal constants Csp11C_{\mathrm{sp11}} and Csp12C_{\mathrm{sp12}} in Lemma S3 in the Supplement. If ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and Errh1(n,p,δ,ϵ)a\mathrm{Err}_{h1}(n,p,\delta,\epsilon)\leq a for a constant a(0,1/2)a\in(0,1/2), then the following holds with probability at least 17δ1-7\delta uniformly over contamination distribution QQ,

μ^μ\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty} S4,aErrh1(n,p,δ,ϵ),\displaystyle{\leq}S_{4,a}\mathrm{Err}_{h1}(n,p,\delta,\epsilon),
Σ^Σmax\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\max} S8,aErrh1(n,p,δ,ϵ),\displaystyle{\leq}S_{8,a}\mathrm{Err}_{h1}(n,p,\delta,\epsilon),

where S4,a=(1+2M1log212a)/aS_{4,a}=(1+\sqrt{2M_{1}\log\frac{2}{1-2a}})/a and S8,a=2M11/2S6,a+S7(1+S4,a+S6,a)S_{8,a}=2M_{1}^{1/2}S_{6,a}+S_{7}(1+S_{4,a}+S_{6,a}) with S6,a=S5(1+S4,a/2)S_{6,a}=S_{5}(1+S_{4,a}/2), S5=22π(1e2/M1)1S_{5}=2\sqrt{2\pi}(1-\mathrm{e}^{-2/M_{1}})^{-1}, and S7=8πM1e1/(4M1)S_{7}=8\pi M_{1}\mathrm{e}^{1/(4M_{1})} S7=4{(12πM1e1/(8M1))(12e1/(8M1))}2S_{7}=4\{(\frac{1}{\sqrt{2\pi M_{1}}}\mathrm{e}^{-1/(8M_{1})})\vee(1-2\mathrm{e}^{-1/(8M_{1})})\}^{-2}.

Proof of Proposition 3.

The main strategy of our proof is to show that the following inequalities hold with high probabilities,

d(θ^,θ)Δ12maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}Δ11,\displaystyle d(\hat{\theta},\theta^{*})-\Delta_{12}\leq\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}\leq\Delta_{11}, (49)

where Δ11\Delta_{11} and Δ12\Delta_{12} are error terms, and d(θ,θ^)d(\theta^{*},\hat{\theta}) is a moment matching term, which under certain conditions delivers upper bounds, up to scaling constants, on the estimation errors to be controlled, μ^μ\|\hat{\mu}-\mu^{*}\|_{\infty} and Σ^Σmax\|\hat{\Sigma}-\Sigma^{*}\|_{\max}.

(Step 1) For upper bound in (49), we show that with probability at least 15δ1-5\delta,

maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}
maxγΓ{KHG(Pn,Pθ;hγ,μ)λ1pen1(γ)}\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\} (50)
maxγΓ{Δ11+pen1(γ)Δ~11λ1pen1(γ)}.\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{\Delta_{11}+\mathrm{pen}_{1}(\gamma)\tilde{\Delta}_{11}-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}. (51)

Inequality (50) follows from the definition of θ^\hat{\theta}. Inequality (51) follows from Proposition S13: it holds with probability at least 17δ1-{7}\delta that for any γΓ\gamma\in\Gamma,

KHG(Pn,Pθ;hγ,μ)Δ11+pen1(γ)Δ~11,\displaystyle K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})\leq\Delta_{11}+\mathrm{pen}_{1}(\gamma)\tilde{\Delta}_{11},

where Δ11=2(ϵ+ϵ/(nδ))\Delta_{11}=2(\epsilon+\sqrt{\epsilon/(n\delta)}), Δ~11=Csp13M11λ11\tilde{\Delta}_{11}=C_{\mathrm{sp13}}M_{11}\lambda_{11}, and

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n.\lambda_{11}=\sqrt{\frac{2\log(5p)+\log(\delta^{-1})}{n}}+\frac{2\log(5p)+\log(\delta^{-1})}{n}.

From (50)–(51), the upper bound in (49) holds with probability at least 15δ1-5\delta, provided that the tuning parameter λ1\lambda_{1} is chosen such that λ1Δ~11\lambda_{1}\geq\tilde{\Delta}_{11}.

(Step 2) For the lower bound in (49), we show that with probability at least 12δ1-2\delta,

maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}
maxγΓ0{KHG(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}\displaystyle\geq\max_{\gamma\in\Gamma_{0}}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\} (52)
maxγΓ0{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~12λ1.\displaystyle\geq\max_{\gamma\in\Gamma_{0}}\;\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{12}-\lambda_{1}. (53)

Inequality (52) holds provided that Γ0\Gamma_{0} is a subset of Γ\Gamma.

Take Γ0={γΓrp:γ0=0,pen1(γ)=1}\Gamma_{0}=\{\gamma\in\Gamma_{\mathrm{rp}}:\gamma_{0}=0,\mathrm{pen}_{1}(\gamma)=1\}, where Γrp\Gamma_{\mathrm{rp}} is the subset of Γ\Gamma associated with pairwise ramp functions as in the proof of Theorem 2. Inequality (53) follows from Proposition S14 because hγ,μ^(x)[1,1]h_{\gamma,\hat{\mu}}(x)\in[-1,1] for γΓ0\gamma\in\Gamma_{0}, and hence the hinge loss reduces to a moment matching term: it holds with probability at least 12δ1-2\delta that for any γΓ0\gamma\in\Gamma_{0},

KHG(Pn,Pθ^;hγ,μ^){EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~12\displaystyle K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{12}

where Δ~12=ϵ+λ12\tilde{\Delta}_{12}=\epsilon+\lambda_{12} and

λ12=Crad44log(2p(p+1))n+2log(δ1)n.\lambda_{12}=C_{\mathrm{rad4}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}}.

From (52)–(53), the lower bound in (49) holds with probability at least 12δ1-2\delta, where Δ12=Δ~12+λ1\Delta_{12}=\tilde{\Delta}_{12}+\lambda_{1} and d(θ^,θ)=maxγΓ0{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}d(\hat{\theta},\theta^{*})=\max_{\gamma\in\Gamma_{0}}\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\}.

(Step 3) We complete the proof by relating the moment matching term d(θ^,θ)d(\hat{\theta},\theta^{*}) to the estimation error between θ^\hat{\theta} and θ\theta^{*}. First, combining the lower and upper bounds in (49) shows that with probability at least 19δ1-{9}\delta,

maxγΓrp,pen1(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Errh1(n,p,δ,ϵ).\displaystyle\max_{\gamma\in\Gamma_{\mathrm{rp}},\mathrm{pen}_{1}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq\mathrm{Err}_{h1}(n,p,\delta,\epsilon). (54)

where

Errh1(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{h1}(n,p,\delta,\epsilon) =3ϵ+2ϵ/(nδ)+λ12+λ1.\displaystyle=3\epsilon+2\sqrt{\epsilon/(n\delta)}+\lambda_{12}+\lambda_{1}.

The desired result then follows from Proposition S7: provided Errh1(n,p,δ,ϵ)a{\mathrm{Err}_{h1}}(n,p,\delta,\epsilon)\leq a, inequality (54) implies that

μ^μS4,aErrh1(n,p,δ,ϵ),Σ^ΣmaxS8,aErrh1(n,p,δ,ϵ).\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty}\leq S_{4,a}{\mathrm{Err}_{h1}}(n,p,\delta,\epsilon),\quad\|\hat{\Sigma}-\Sigma^{*}\|_{\max}\leq S_{8,a}{\mathrm{Err}_{h1}}(n,p,\delta,\epsilon).

{supplement}

Supplement to “Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models”. The Supplement provides additional information for simulation studies, and main proofs and technical details. Auxiliary lemmas and technical tools are also provided.

References

  • Ali and Silvey (1966) {barticle}[author] \bauthor\bsnmAli, \bfnmSyed Mumtaz\binitsS. M. and \bauthor\bsnmSilvey, \bfnmSamuel D\binitsS. D. (\byear1966). \btitleA general class of coefficients of divergence of one distribution from another. \bjournalJournal of the Royal Statistical Society: Series B \bvolume28 \bpages131–142. \endbibitem
  • Balmand and Dalalyan (2015) {barticle}[author] \bauthor\bsnmBalmand, \bfnmSamuel\binitsS. and \bauthor\bsnmDalalyan, \bfnmArnak\binitsA. (\byear2015). \btitleConvex programming approach to robust estimation of a multivariate gaussian model. \bjournalarXiv preprint arXiv:1512.04734. \endbibitem
  • Basu and Lindsay (1994) {barticle}[author] \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA. and \bauthor\bsnmLindsay, \bfnmBruce G\binitsB. G. (\byear1994). \btitleMinimum disparity estimation for continuous models: efficiency, distributions and robustness. \bjournalAnnals of the Institute of Statistical Mathematics \bvolume46 \bpages683–705. \endbibitem
  • Basu et al. (1998) {barticle}[author] \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA., \bauthor\bsnmHarris, \bfnmIan R\binitsI. R., \bauthor\bsnmHjort, \bfnmNils L\binitsN. L. and \bauthor\bsnmJones, \bfnmMC\binitsM. (\byear1998). \btitleRobust and efficient estimation by minimising a density power divergence. \bjournalBiometrika \bvolume85 \bpages549–559. \endbibitem
  • Beran (1977) {barticle}[author] \bauthor\bsnmBeran, \bfnmRudolf\binitsR. (\byear1977). \btitleMinimum Hellinger distance estimates for parametric models. \bjournalAnnals of Statistics \bvolume5 \bpages445–463. \endbibitem
  • Buja, Stuetzle and Shen (2005) {barticle}[author] \bauthor\bsnmBuja, \bfnmAndreas\binitsA., \bauthor\bsnmStuetzle, \bfnmWerner\binitsW. and \bauthor\bsnmShen, \bfnmYi\binitsY. (\byear2005). \btitleLoss functions for binary class probability estimation and classification: Structure and applications. \bjournaltechnical report, University of Pennsylvania. \endbibitem
  • Butler, Davies and Jhun (1993) {barticle}[author] \bauthor\bsnmButler, \bfnmRW\binitsR., \bauthor\bsnmDavies, \bfnmPL\binitsP. and \bauthor\bsnmJhun, \bfnmM\binitsM. (\byear1993). \btitleAsymptotics for the minimum covariance determinant estimator. \bjournalAnnals of Statistics \bpages1385–1400. \endbibitem
  • Chen, Gao and Ren (2018) {barticle}[author] \bauthor\bsnmChen, \bfnmMengjie\binitsM., \bauthor\bsnmGao, \bfnmChao\binitsC. and \bauthor\bsnmRen, \bfnmZhao\binitsZ. (\byear2018). \btitleRobust covariance and scatter matrix estimation under Huber’s contamination model. \bjournalAnnals of Statistics \bvolume46 \bpages1932–1960. \endbibitem
  • Csiszár (1967) {barticle}[author] \bauthor\bsnmCsiszár, \bfnmImre\binitsI. (\byear1967). \btitleInformation-type measures of difference of probability distributions and indirect observation. \bjournalStudia Scientiarum Mathematicarum Hungarica \bvolume2 \bpages229–318. \endbibitem
  • Davies (1987) {barticle}[author] \bauthor\bsnmDavies, \bfnmP Laurie\binitsP. L. (\byear1987). \btitleAsymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. \bjournalAnnals of Statistics \bpages1269–1292. \endbibitem
  • Diakonikolas et al. (2019) {barticle}[author] \bauthor\bsnmDiakonikolas, \bfnmIlias\binitsI., \bauthor\bsnmKamath, \bfnmGautam\binitsG., \bauthor\bsnmKane, \bfnmDaniel\binitsD., \bauthor\bsnmLi, \bfnmJerry\binitsJ., \bauthor\bsnmMoitra, \bfnmAnkur\binitsA. and \bauthor\bsnmStewart, \bfnmAlistair\binitsA. (\byear2019). \btitleRobust estimators in high-dimensions without the computational intractability. \bjournalSIAM Journal on Computing \bvolume48 \bpages742–864. \endbibitem
  • Donoho and Liu (1988) {barticle}[author] \bauthor\bsnmDonoho, \bfnmDavid L\binitsD. L. and \bauthor\bsnmLiu, \bfnmRichard C\binitsR. C. (\byear1988). \btitleThe “automatic” robustness of minimum distance functionals. \bjournalAnnals of Statistics \bvolume16 \bpages552–586. \endbibitem
  • Farnia and Ozdaglar (2020) {binproceedings}[author] \bauthor\bsnmFarnia, \bfnmFarzan\binitsF. and \bauthor\bsnmOzdaglar, \bfnmAsuman\binitsA. (\byear2020). \btitleDo GANs always have Nash equilibria? In \bbooktitleInternational Conference on Machine Learning \bpages3029–3039. \bpublisherPMLR. \endbibitem
  • Fu, Narasimhan and Boyd (2020) {barticle}[author] \bauthor\bsnmFu, \bfnmAnqi\binitsA., \bauthor\bsnmNarasimhan, \bfnmBalasubramanian\binitsB. and \bauthor\bsnmBoyd, \bfnmStephen\binitsS. (\byear2020). \btitleCVXR: An R package for disciplined convex optimization. \bjournalJournal of Statistical Software \bvolume94 \bpages1–34. \endbibitem
  • Fujisawa and Eguchi (2008) {barticle}[author] \bauthor\bsnmFujisawa, \bfnmHironori\binitsH. and \bauthor\bsnmEguchi, \bfnmShinto\binitsS. (\byear2008). \btitleRobust parameter estimation with a small bias against heavy contamination. \bjournalJournal of Multivariate Analysis \bvolume99 \bpages2053–2081. \endbibitem
  • Gao, Yao and Zhu (2020) {barticle}[author] \bauthor\bsnmGao, \bfnmChao\binitsC., \bauthor\bsnmYao, \bfnmYuan\binitsY. and \bauthor\bsnmZhu, \bfnmWeizhi\binitsW. (\byear2020). \btitleGenerative adversarial nets for robust scatter estimation: a proper scoring rule perspective. \bjournalJournal of Machine Learning Research \bvolume21 \bpages160–1. \endbibitem
  • Gao et al. (2019) {binproceedings}[author] \bauthor\bsnmGao, \bfnmChao\binitsC., \bauthor\bsnmLiu, \bfnmJiyi\binitsJ., \bauthor\bsnmYao, \bfnmYuan\binitsY. and \bauthor\bsnmZhu, \bfnmWeizhi\binitsW. (\byear2019). \btitleRobust estimation and generative adversarial networks. In \bbooktitle7th International Conference on Learning Representations, ICLR 2019. \endbibitem
  • Gneiting and Raftery (2007) {barticle}[author] \bauthor\bsnmGneiting, \bfnmTilmann\binitsT. and \bauthor\bsnmRaftery, \bfnmAdrian E\binitsA. E. (\byear2007). \btitleStrictly proper scoring rules, prediction, and estimation. \bjournalJournal of the American Statistical Association \bvolume102 \bpages359–378. \endbibitem
  • Goodfellow et al. (2014) {binproceedings}[author] \bauthor\bsnmGoodfellow, \bfnmIan\binitsI., \bauthor\bsnmPouget-Abadie, \bfnmJean\binitsJ., \bauthor\bsnmMirza, \bfnmMehdi\binitsM., \bauthor\bsnmXu, \bfnmBing\binitsB., \bauthor\bsnmWarde-Farley, \bfnmDavid\binitsD., \bauthor\bsnmOzair, \bfnmSherjil\binitsS., \bauthor\bsnmCourville, \bfnmAaron\binitsA. and \bauthor\bsnmBengio, \bfnmYoshua\binitsY. (\byear2014). \btitleGenerative adversarial nets. In \bbooktitleAdvances in Neural Information Processing Systems \bvolume27. \bpublisherCurran Associates, Inc. \endbibitem
  • Hampel (1974) {barticle}[author] \bauthor\bsnmHampel, \bfnmFrank R\binitsF. R. (\byear1974). \btitleThe influence curve and its role in robust estimation. \bjournalJournal of the American Statistical Association \bvolume69 \bpages383–393. \endbibitem
  • Hirose, Fujisawa and Sese (2017) {barticle}[author] \bauthor\bsnmHirose, \bfnmKei\binitsK., \bauthor\bsnmFujisawa, \bfnmHironori\binitsH. and \bauthor\bsnmSese, \bfnmJun\binitsJ. (\byear2017). \btitleRobust sparse Gaussian graphical modeling. \bjournalJournal of Multivariate Analysis \bvolume161 \bpages172–190. \endbibitem
  • Huber (1964) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J.\binitsP. J. (\byear1964). \btitleRobust estimation of a location parameter. \bjournalAnnals of Mathematical Statistics \bvolume35 \bpages73-101. \endbibitem
  • Huber (1981) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1981). \btitleRobust statistics. \bjournalWiley Series in Probability and Mathematical Statistics. \endbibitem
  • Huszár (2016) {barticle}[author] \bauthor\bsnmHuszár, \bfnmFerenc\binitsF. (\byear2016). \btitleAn alternative update rule for generative adversarial networks. \bjournalblogpost. \endbibitem
  • Jin, Netrapalli and Jordan (2020) {binproceedings}[author] \bauthor\bsnmJin, \bfnmChi\binitsC., \bauthor\bsnmNetrapalli, \bfnmPraneeth\binitsP. and \bauthor\bsnmJordan, \bfnmMichael\binitsM. (\byear2020). \btitleWhat is local optimality in nonconvex-nonconcave minimax optimization? In \bbooktitleInternational Conference on Machine Learning \bpages4880–4889. \bpublisherPMLR. \endbibitem
  • Jones et al. (2001) {barticle}[author] \bauthor\bsnmJones, \bfnmMC\binitsM., \bauthor\bsnmHjort, \bfnmNils Lid\binitsN. L., \bauthor\bsnmHarris, \bfnmIan R\binitsI. R. and \bauthor\bsnmBasu, \bfnmAyanendranath\binitsA. (\byear2001). \btitleA comparison of related density-based minimum divergence estimators. \bjournalBiometrika \bvolume88 \bpages865–873. \endbibitem
  • Kendall (1938) {barticle}[author] \bauthor\bsnmKendall, \bfnmMaurice G\binitsM. G. (\byear1938). \btitleA new measure of rank correlation. \bjournalBiometrika \bvolume30 \bpages81–93. \endbibitem
  • Lai, Rao and Vempala (2016) {binproceedings}[author] \bauthor\bsnmLai, \bfnmKevin A\binitsK. A., \bauthor\bsnmRao, \bfnmAnup B\binitsA. B. and \bauthor\bsnmVempala, \bfnmSantosh\binitsS. (\byear2016). \btitleAgnostic estimation of mean and covariance. In \bbooktitleIEEE 57th Annual Symposium on Foundations of Computer Science \bpages665–674. \bpublisherIEEE. \endbibitem
  • Lim and Ye (2017) {barticle}[author] \bauthor\bsnmLim, \bfnmJae Hyun\binitsJ. H. and \bauthor\bsnmYe, \bfnmJong Chul\binitsJ. C. (\byear2017). \btitleGeometric GAN. \bjournalarXiv preprint arXiv:1705.02894. \endbibitem
  • Lindsay (1994) {barticle}[author] \bauthor\bsnmLindsay, \bfnmBruce G\binitsB. G. (\byear1994). \btitleEfficiency versus robustness: the case for minimum Hellinger distance and related methods. \bjournalAnnals of Statistics \bvolume22 \bpages1081–1114. \endbibitem
  • Liu and Loh (2021) {barticle}[author] \bauthor\bsnmLiu, \bfnmZheng\binitsZ. and \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. (\byear2021). \btitleRobust W-GAN-Based estimation under Wasserstein contamination. \bjournalarXiv preprint arXiv:2101.07969. \endbibitem
  • Liu et al. (2012) {barticle}[author] \bauthor\bsnmLiu, \bfnmHan\binitsH., \bauthor\bsnmHan, \bfnmFang\binitsF., \bauthor\bsnmYuan, \bfnmMing\binitsM., \bauthor\bsnmLafferty, \bfnmJohn\binitsJ. and \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2012). \btitleHigh-dimensional semiparametric Gaussian copula graphical models. \bjournalAnnals of Statistics \bvolume40 \bpages2293–2326. \endbibitem
  • Loh and Tan (2018) {barticle}[author] \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. and \bauthor\bsnmTan, \bfnmXin Lu\binitsX. L. (\byear2018). \btitleHigh-dimensional robust precision matrix estimation: Cellwise corruption under ϵ\epsilon-contamination. \bjournalElectronic Journal of Statistics \bvolume12 \bpages1429–1467. \endbibitem
  • Lopuhaa (1989) {barticle}[author] \bauthor\bsnmLopuhaa, \bfnmHendrik P\binitsH. P. (\byear1989). \btitleOn the relation between S-estimators and M-estimators of multivariate location and covariance. \bjournalAnnals of Statistics \bpages1662–1683. \endbibitem
  • Miyamura and Kano (2006) {barticle}[author] \bauthor\bsnmMiyamura, \bfnmMasashi\binitsM. and \bauthor\bsnmKano, \bfnmYutaka\binitsY. (\byear2006). \btitleRobust Gaussian graphical modeling. \bjournalJournal of Multivariate Analysis \bvolume97 \bpages1525–1550. \endbibitem
  • Nguyen, Wainwright and Jordan (2009) {barticle}[author] \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2009). \btitleOn surrogate loss functions and ff-divergences. \bjournalAnnals of Statistics \bvolume37 \bpages876–904. \endbibitem
  • Nguyen, Wainwright and Jordan (2010) {barticle}[author] \bauthor\bsnmNguyen, \bfnmXuanLong\binitsX., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2010). \btitleEstimating divergence functionals and the likelihood ratio by convex risk minimization. \bjournalIEEE Transactions on Information Theory \bvolume56 \bpages5847–5861. \endbibitem
  • Nowozin, Cseke and Tomioka (2016) {binproceedings}[author] \bauthor\bsnmNowozin, \bfnmSebastian\binitsS., \bauthor\bsnmCseke, \bfnmBotond\binitsB. and \bauthor\bsnmTomioka, \bfnmRyota\binitsR. (\byear2016). \btitleff-GAN: Training generative neural samplers using variational divergence minimization. In \bbooktitleAdvances in Neural Information Processing Systems \bvolume29. \bpublisherCurran Associates, Inc. \endbibitem
  • Öllerer and Croux (2015) {bincollection}[author] \bauthor\bsnmÖllerer, \bfnmViktoria\binitsV. and \bauthor\bsnmCroux, \bfnmChristophe\binitsC. (\byear2015). \btitleRobust high-dimensional precision matrix estimation. In \bbooktitleModern Nonparametric, Robust and Multivariate Methods \bpages325–350. \bpublisherSpringer. \endbibitem
  • Paindaveine and Van Bever (2018) {barticle}[author] \bauthor\bsnmPaindaveine, \bfnmDavy\binitsD. and \bauthor\bsnmVan Bever, \bfnmGermain\binitsG. (\byear2018). \btitleHalfspace depths for scatter, concentration and shape matrices. \bjournalAnnals of Statistics \bvolume46 \bpages3276–3307. \endbibitem
  • Rocke (1996) {barticle}[author] \bauthor\bsnmRocke, \bfnmDavid M\binitsD. M. (\byear1996). \btitleRobustness properties of S-estimators of multivariate location and shape in high dimension. \bjournalAnnals of Statistics \bpages1327–1345. \endbibitem
  • Rousseeuw (1985) {barticle}[author] \bauthor\bsnmRousseeuw, \bfnmPeter J\binitsP. J. (\byear1985). \btitleMultivariate estimation with high breakdown point. \bjournalMathematical Statistics and Applications \bvolume8 \bpages37. \endbibitem
  • Rousseeuw and Croux (1993) {barticle}[author] \bauthor\bsnmRousseeuw, \bfnmPeter J\binitsP. J. and \bauthor\bsnmCroux, \bfnmChristophe\binitsC. (\byear1993). \btitleAlternatives to the median absolute deviation. \bjournalJournal of the American Statistical Association \bvolume88 \bpages1273–1283. \endbibitem
  • Savage (1971) {barticle}[author] \bauthor\bsnmSavage, \bfnmLeonard J\binitsL. J. (\byear1971). \btitleElicitation of personal probabilities and expectations. \bjournalJournal of the American Statistical Association \bvolume66 \bpages783–801. \endbibitem
  • Spearman (1987) {barticle}[author] \bauthor\bsnmSpearman, \bfnmCharles\binitsC. (\byear1987). \btitleThe proof and measurement of association between two things. \bjournalThe American Journal of Psychology \bvolume100 \bpages441–471. \endbibitem
  • Tamura and Boos (1986) {barticle}[author] \bauthor\bsnmTamura, \bfnmRoy N\binitsR. N. and \bauthor\bsnmBoos, \bfnmDennis D\binitsD. D. (\byear1986). \btitleMinimum Hellinger distance estimation for multivariate location and covariance. \bjournalJournal of the American Statistical Association \bvolume81 \bpages223–229. \endbibitem
  • Tan (2020) {barticle}[author] \bauthor\bsnmTan, \bfnmZhiqiang\binitsZ. (\byear2020). \btitleRegularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. \bjournalBiometrika \bvolume107 \bpages137–158. \endbibitem
  • Tan, Song and Ou (2019) {barticle}[author] \bauthor\bsnmTan, \bfnmZhiqiang\binitsZ., \bauthor\bsnmSong, \bfnmYunfu\binitsY. and \bauthor\bsnmOu, \bfnmZhijian\binitsZ. (\byear2019). \btitleCalibrated adversarial algorithms for generative modelling. \bjournalStat \bvolume8 \bpagese224. \endbibitem
  • Tan and Zhang (2020) {barticle}[author] \bauthor\bsnmTan, \bfnmZhiqiang\binitsZ. and \bauthor\bsnmZhang, \bfnmXinwei\binitsX. (\byear2020). \btitleOn loss functions and regret bounds for multi-category classification. \bjournalarXiv preprint arXiv:2005.08155. \endbibitem
  • Tarr, Müller and Weber (2016) {barticle}[author] \bauthor\bsnmTarr, \bfnmGarth\binitsG., \bauthor\bsnmMüller, \bfnmSamuel\binitsS. and \bauthor\bsnmWeber, \bfnmNeville C\binitsN. C. (\byear2016). \btitleRobust estimation of precision matrices under cellwise contamination. \bjournalComputational Statistics & Data Analysis \bvolume93 \bpages404–420. \endbibitem
  • Tatsuoka and Tyler (2000) {barticle}[author] \bauthor\bsnmTatsuoka, \bfnmKay S\binitsK. S. and \bauthor\bsnmTyler, \bfnmDavid E\binitsD. E. (\byear2000). \btitleOn the uniqueness of S and M-functionals under nonelliptical distributions. \bjournalAnnals of Statistics \bpages1219–1243. \endbibitem
  • Tukey (1975) {binproceedings}[author] \bauthor\bsnmTukey, \bfnmJohn W\binitsJ. W. (\byear1975). \btitleMathematics and the picturing of data. In \bbooktitleProceedings of the International Congress of Mathematicians, Vancouver, 1975 \bvolume2 \bpages523–531. \endbibitem
  • Tyler (1987) {barticle}[author] \bauthor\bsnmTyler, \bfnmDavid E\binitsD. E. (\byear1987). \btitleA distribution-free M-estimator of multivariate scatter. \bjournalAnnals of Statistics \bpages234–251. \endbibitem
  • Windham (1995) {barticle}[author] \bauthor\bsnmWindham, \bfnmMichael P\binitsM. P. (\byear1995). \btitleRobustifying model fitting. \bjournalJournal of the Royal Statistical Society. Series B \bpages599–609. \endbibitem
  • Wu et al. (2020) {binproceedings}[author] \bauthor\bsnmWu, \bfnmKaiwen\binitsK., \bauthor\bsnmDing, \bfnmGavin Weiguang\binitsG. W., \bauthor\bsnmHuang, \bfnmRuitong\binitsR. and \bauthor\bsnmYu, \bfnmYaoliang\binitsY. (\byear2020). \btitleOn minimax optimality of GANs for robust mean estimation. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages4541–4551. \bpublisherPMLR. \endbibitem
  • Xue and Zou (2012) {barticle}[author] \bauthor\bsnmXue, \bfnmLingzhou\binitsL. and \bauthor\bsnmZou, \bfnmHui\binitsH. (\byear2012). \btitleRegularized rank-based estimation of high-dimensional nonparanormal graphical models. \bjournalAnnals of Statistics \bvolume40 \bpages2541–2571. \endbibitem
  • Yohai (1987) {barticle}[author] \bauthor\bsnmYohai, \bfnmVictor J\binitsV. J. (\byear1987). \btitleHigh breakdown-point and high efficiency robust estimates for regression. \bjournalAnnals of statistics \bpages642–656. \endbibitem
  • Zhang (2002) {barticle}[author] \bauthor\bsnmZhang, \bfnmJian\binitsJ. (\byear2002). \btitleSome extensions of Tukey’s depth function. \bjournalJournal of Multivariate Analysis \bvolume82 \bpages134–165. \endbibitem
  • Zhao, Mathieu and LeCun (2017) {binproceedings}[author] \bauthor\bsnmZhao, \bfnmJunbo\binitsJ., \bauthor\bsnmMathieu, \bfnmMichael\binitsM. and \bauthor\bsnmLeCun, \bfnmYann\binitsY. (\byear2017). \btitleEnergy-based generative adversarial networks. In \bbooktitleInternational Conference on Learning Representations. \endbibitem
  • Zhu, Jiao and Tse (2020) {barticle}[author] \bauthor\bsnmZhu, \bfnmBanghua\binitsB., \bauthor\bsnmJiao, \bfnmJiantao\binitsJ. and \bauthor\bsnmTse, \bfnmDavid\binitsD. (\byear2020). \btitleDeconstructing generative adversarial networks. \bjournalIEEE Transactions on Information Theory \bvolume66 \bpages7155–7179. \endbibitem

Supplementary Material for
“Tractable and Near-Optimal Adversarial Algorithms for
Robust Estimation in Contaminated Gaussian Models”

Ziyue Wang and Zhiqiang Tan

The Supplementary Material contains Appendices I-V on additional information for simulation studies, main proofs and details, auxiliary lemmas, and technical tools.

I Additional information for simulation studies

I.1 Implementation of proposed methods

The detailed algorithm to implement our logit ff-GANs and hinge GAN is shown as Algorithm S1. In our experiments, the learning rate αt\alpha_{t} is set with α0=1\alpha_{0}=1, and the iteration number is set to 100100. This choice leads to stable convergence while keeping the running time relatively short.

For rKL logit ff-GAN, we modify the un-penalized objective KrKLK_{\mathrm{rKL}} as min(KrKL,10)\min(K_{\mathrm{rKL}},10). This modification helps stabilize the initial steps of training where the fake data and real data can be almost separable depending on the initial value of θ\theta.

Require A penalized GAN objective function K(θ,γ;λ)K(\theta,\gamma;\lambda) as in (37) for logit ff-GAN or with KfK_{f} replaced by KHGK_{\mathrm{HG}} for hinge GAN, a decaying learning rate αt=α0exp(0.05t)\alpha_{t}=\alpha_{0}\exp(-0.05t), a base penalty level λ0\lambda_{0} so that λ1=λ0log(p)/n\lambda_{1}=\lambda_{0}\sqrt{\log(p)/n}, λ2=λ0p/n\lambda_{2}=\lambda_{0}\sqrt{p/n}, and λ3=λ0p2/n\lambda_{3}=\lambda_{0}\sqrt{p^{2}/n}.
Initialization Initialize μ0\mu_{0} by the median of real data xx. Initialize Σ0\Sigma_{0} by S^K^S^\hat{S}\hat{K}\hat{S} where K^\hat{K} is Kendall’s τ\tau correlation coefficient matrix and S^\hat{S} is a diagonal matrix with the MAD scale estimates on the diagonal. Repeat 
       Sampling: Draw (Z1,,Zn)(Z_{1},\ldots,Z_{n}) from N(0,I)\mathrm{N}(0,I) and approximate Pθt1P_{\theta_{t-1}} by the empirical distribution on the fake data ξi=μt1+Σt11/2zi\xi_{i}=\mu_{t-1}+\Sigma_{t-1}^{1/2}z_{i}, i=1,,ni=1,\ldots,n.
       Updating: Compute γt=argmaxγK(θt1,γ;λ)\gamma_{t}=\mathrm{argmax}_{\gamma}K(\theta_{t-1},\gamma;\lambda) by the optimizer MOSEK, and compute θt=θt1αtθK(θ,γt;λ)|θt1\theta_{t}=\theta_{t-1}-\alpha_{t}\nabla_{\theta}K(\theta,\gamma_{t};\lambda)|_{\theta_{t-1}} by gradient descent where the gradient θK(θ,γt;λ)|θt1\nabla_{\theta}K(\theta,\gamma_{t};\lambda)|_{\theta_{t-1}} is clipped at 0.10.1 by LL_{\infty} norm.
until convergence or the maximum iterations reached.
Algorithm 2 Penalized logit ff-GAN or hinge GAN (in detail)

I.2 Tuning penalty levels

We conducted tuning experiments to identify base penalty levels λ0\lambda_{0} which are expected to work reasonably well in various settings for our logit ff-GANs and hinge GAN, where the dependency on (p,n)(p,n) is already absorbed in the penalty parameters λ1,λ2,λ3\lambda_{1},\lambda_{2},\lambda_{3}. In the tuning experiments, we tried two contamination proportions ϵ\epsilon and two choices of contamination distributions QQ as described in Section 6.2. Results are collected from 20 repeated experiments on a grid of penalty levels for each method.

As shown in Figure S1, although the average estimation error varies as the contamination setting changes, there is a consistent and stable range of the penalty level λ0\lambda_{0} which leads to approximately the best performance for each method, with either L1L_{1} or L2L_{2} penalty used. We manually pick λ0=0.1\lambda_{0}=0.1 for the hinge GAN, λ0=0.025\lambda_{0}=0.025 for the JS logit ff-GAN, and λ0=0.3\lambda_{0}=0.3 for the rKL logit ff-GAN, which are then fixed in all subsequent simulations. It is desirable in future work to design and study automatic tuning of penalty levels for GANs.

Refer to caption
Figure S1: Tuning the penalty level. In this setting p=5p=5, n=500n=500, ϵ{0.05,0.1}\epsilon\in\{0.05,0.1\}, and contamination distribution QQ is either (A) N(2.25c,(1/3)Ip)\mathrm{N}(2.25c,(1/3)I_{p}) or (B) N(51p,5Ip)\mathrm{N}(51_{p},5I_{p}). Penalty levels for the hinge GAN are {0,0.1,0.2,0.3,0.4}\{0,0.1,0.2,0.3,0.4\}, penalty levels for the JS logit ff-GAN are {0,0.025,0.05,0.075,0.1}\{0,0.025,0.05,0.075,0.1\}, and penalty levels for the rKL logit ff-GAN are {0.1,0.3,0.5,0.7,0.9}\{0.1,0.3,0.5,0.7,0.9\}.

I.3 Error dependency on nn and pp

Tables S1S2 show the performance of various methods depending on sample size nn for the two choices of contamination in Section 6.2. We fix the dimension p=10p=10 and ϵ=0.1\epsilon=0.1 and increase nn from 500500 to 40004000. Tables S3S4 show how the performance of methods depending on sample size pp for the two choices of contamination. We fix ϵ=0.1\epsilon=0.1 and n=2000n=2000 and increase pp from 55 to 1515. Estimation errors are measured in the maximum norm and the Frobenius norm.

For all methods considered, the estimation errors decrease as nn increases. As pp increases, however, the estimation errors seem to be affected to a lesser extent when measured in the maximum norm. This is expected because an error rate log(p)/n\sqrt{\log(p)/n} (ϵ\epsilon term aside) has been established for our three L1L_{1} penalized methods as well as Kendall’s τ\tau and Spearman’s ρ\rho (\citeappendLT18). When measured in the Frobenius norm, the estimation errors go up as pp increases, which is also expected. In Table S3 with QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p}), the γ\gamma-Lasso outperforms our methods in the setting of p=15p=15. and even has better performance than in the setting of smaller pp. A possible explanation is that the default choice γ=0.05\gamma=0.05 may be near optimal for this particular setting of p=15p=15, but much less suitable in other settings.

Table S1: Comparison of existing methods and proposed L1L_{1} penalized GAN methods (p=10,ϵ=0.1p=10,\epsilon=0.1). Estimation error of the variance matrix is reported in the maximum norm max\|\cdot\|_{\max}.
nn hinge GAN JS logit ff-GAN rKL logit ff-GAN GYZ JS-GAN Tyler_M Kendall_MAD Spearman_Qn
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
500 0.1740 (0.0529) 0.1775 (0.0496) 0.1418 (0.0335) 0.3357 (0.1771) 0.4211 (0.0570) 0.5304 (0.0824) 0.5074 (0.0555)
1000 0.1238 (0.0325) 0.1134 (0.0292) 0.1027 (0.0212) 0.1982 (0.0917) 0.3830 (0.0320) 0.4846 (0.0735) 0.4715 (0.0431)
2000 0.0980 (0.0222) 0.0737 (0.0207) 0.0720 (0.0151) 0.1029 (0.0467) 0.3526 (0.0226) 0.4133 (0.0264) 0.4362 (0.0256)
4000 0.0848 (0.0184) 0.0528 (0.0124) 0.0502 (0.0105) 0.0770 (0.0241) 0.3351 (0.0166) 0.3948 (0.0259) 0.4117 (0.0186)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
500 0.1855 (0.0536) 0.1970 (0.0498) 0.1548 (0.0353) 0.2792 (0.1706) 0.6061 (0.0579) 0.5200 (0.0846) 0.6393 (0.0807)
1000 0.1382 (0.0331) 0.1231 (0.0324) 0.1230 (0.0266) 0.1618 (0.0615) 0.5734 (0.0429) 0.4724 (0.0732) 0.6187 (0.0578)
2000 0.1050 (0.0237) 0.0802 (0.0225) 0.0935 (0.0219) 0.1275 (0.0453) 0.5295 (0.0332) 0.4074 (0.0278) 0.5693 (0.0358)
4000 0.0903 (0.0180) 0.0590 (0.0141) 0.0739 (0.0167) 0.0881 (0.0285) 0.5154 (0.0225) 0.3863 (0.0251) 0.5481 (0.0284)
nn γ\gamma-LASSO MVE MCD Sest Mest MMest
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
500 0.5786 (0.0288) 0.2270 (0.0518) 0.2175 (0.0516) 0.2819 (0.0597) 0.1900 (0.0461) 0.2805 (0.0573)
1000 0.5426 (0.0261) 0.1877 (0.0362) 0.1883 (0.0397) 0.2650 (0.0422) 0.1551 (0.0394) 0.2630 (0.0413)
2000 0.5277 (0.0154) 0.1637 (0.0269) 0.1618 (0.0257) 0.2358 (0.0292) 0.1292 (0.0260) 0.2349 (0.0292)
4000 0.5150 (0.0139) 0.1457 (0.0211) 0.1471 (0.0222) 0.2206 (0.0221) 0.1127 (0.0218) 0.2198 (0.0219)
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
500 0.3472 (0.1021) 0.2179 (0.0542) 0.2190 (0.0508) 0.2819 (0.0597) 0.1853 (0.0481) 0.2805 (0.0573)
1000 0.3118 (0.0843) 0.1953 (0.0435) 0.1887 (0.0397) 0.2650 (0.0422) 0.1547 (0.0399) 0.2630 (0.0412)
2000 0.2935 (0.0619) 0.1685 (0.0303) 0.1617 (0.0251) 0.2358 (0.0292) 0.1284 (0.0255) 0.2349 (0.0291)
4000 0.2627 (0.0422) 0.1511 (0.0216) 0.1470 (0.0222) 0.2206 (0.0221) 0.1132 (0.0217) 0.2198 (0.0219)
Table S2: Comparison of existing methods and proposed L2L_{2} penalized GAN methods (p=10,ϵ=0.1p=10,\epsilon=0.1). Estimation error of the variance matrix is reported in the Frobenius norm F\|\cdot\|_{\mathrm{F}}.
nn hinge GAN JS logit ff-GAN rKL logit ff-GAN GYZ JS-GAN Tyler_M Kendall_MAD Spearman_Qn
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
500 0.6855 (0.1138) 0.7519 (0.1238) 0.8294 (0.1381) 0.9512 (0.3576) 2.4034 (0.1264) 3.3007 (0.1748) 3.5111 (0.1224)
1000 0.4455 (0.0645) 0.4304 (0.0683) 0.4197 (0.0734) 0.6198 (0.2318) 2.3129 (0.0668) 3.2204 (0.0862) 3.4565 (0.0645)
2000 0.3222 (0.0547) 0.2898 (0.0565) 0.2723 (0.0477) 0.3642 (0.0885) 2.3363 (0.0557) 3.2143 (0.0489) 3.4669 (0.0398)
4000 0.2431 (0.0438) 0.1948 (0.0380) 0.1877 (0.0346) 0.2797 (0.0738) 2.3165 (0.0405) 3.2030 (0.0417) 3.4472 (0.0219)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
500 0.7157 (0.1151) 0.8054 (0.1400) 0.7190 (0.0970) 0.8889 (0.3253) 3.3388 (0.2880) 3.3704 (0.2784) 4.1824 (0.2852)
1000 0.4589 (0.0695) 0.4589 (0.0850) 0.3942 (0.0606) 0.5367 (0.1282) 3.2529 (0.2480) 3.3263 (0.1921) 4.1522 (0.2150)
2000 0.3301 (0.0583) 0.3049 (0.0635) 0.2744 (0.0505) 0.3829 (0.0852) 3.2420 (0.2083) 3.3202 (0.1462) 4.1527 (0.1602)
4000 0.2472 (0.0458) 0.2076 (0.0419) 0.1990 (0.0334) 0.2986 (0.0536) 3.1988 (0.1288) 3.2955 (0.1269) 4.1117 (0.1182)
nn γ\gamma-LASSO MVE MCD Sest Mest MMest
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
500 4.4395 (0.0997) 0.7574 (0.1373) 0.7448 (0.1249) 0.9269 (0.1562) 0.6615 (0.1164) 0.9190 (0.1562)
1000 4.4150 (0.0618) 0.6005 (0.1132) 0.5843 (0.0960) 0.8262 (0.1130) 0.4894 (0.0948) 0.8206 (0.1137)
2000 4.4364 (0.0688) 0.5416 (0.0816) 0.5170 (0.0754) 0.7732 (0.0839) 0.4078 (0.0670) 0.7707 (0.0831)
4000 4.4204 (0.0444) 0.4663 (0.0613) 0.4542 (0.0623) 0.7232 (0.0714) 0.3453 (0.0577) 0.7231 (0.0711)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
500 2.0989 (0.7667) 0.7577 (0.1291) 0.7497 (0.1272) 0.9269 (0.1562) 0.6581 (0.1176) 0.9190 (0.1562)
1000 1.9989 (0.6520) 0.6021 (0.1016) 0.5819 (0.0959) 0.8263 (0.1131) 0.4889 (0.0988) 0.8211 (0.1140)
2000 2.1356 (0.4419) 0.5314 (0.0835) 0.5163 (0.0754) 0.7732 (0.0839) 0.4083 (0.0675) 0.7708 (0.0831)
4000 1.9796 (0.3431) 0.4740 (0.0628) 0.4541 (0.0621) 0.7232 (0.0714) 0.3460 (0.0574) 0.7232 (0.0711)
Table S3: Comparison of existing methods and proposed L1L_{1} penalized GAN methods (n=2000,ϵ=0.1n=2000,\epsilon=0.1). Estimation error of the variance matrix is reported in the maximum norm max\|\cdot\|_{\max}.
pp hinge GAN JS logit ff-GAN rKL logit ff-GAN GYZ JS-GAN Tyler_M Kendall_MAD Spearman_Qn
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
5 0.1179 (0.0220) 0.0561 (0.0147) 0.0576 (0.0138) 0.1315 (0.0845) 0.2153 (0.0231) 0.3880 (0.0342) 0.4050 (0.0260)
10 0.0980 (0.0222) 0.0737 (0.0207) 0.0720 (0.0151) 0.1029 (0.0467) 0.3526 (0.0226) 0.4133 (0.0264) 0.4362 (0.0256)
15 0.0929 (0.0219) 0.0905 (0.0222) 0.0761 (0.0130) 0.1586 (0.0741) 0.5856 (0.0194) 0.4265 (0.0337) 0.4377 (0.0227)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
5 0.1344 (0.0230) 0.0674 (0.0189) 0.0991 (0.0236) 0.6339 (1.7289) 0.4503 (0.0286) 0.3803 (0.0225) 0.5381 (0.0371)
10 0.1050 (0.0237) 0.0802 (0.0225) 0.0935 (0.0219) 0.1275 (0.0453) 0.5295 (0.0332) 0.4074 (0.0278) 0.5693 (0.0358)
15 0.1021 (0.0252) 0.0957 (0.0232) 0.0925 (0.0180) 0.1294 (0.0432) 0.5769 (0.0251) 0.4207 (0.0322) 0.5782 (0.0301)
pp γ\gamma-LASSO MVE MCD Sest Mest MMest
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
5 0.4866 (0.0197) 0.2107 (0.0270) 0.1973 (0.0246) 0.2305 (0.0267) 0.1398 (0.0296) 0.2226 (0.0243)
10 0.5277 (0.0154) 0.1662 (0.0253) 0.1608 (0.0241) 0.2358 (0.0292) 0.1286 (0.0269) 0.2349 (0.0292)
15 0.5556 (0.0204) 0.1473 (0.0272) 0.1430 (0.0246) 0.3918 (0.1740) 0.1190 (0.0241) 0.3918 (0.1740)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
5 1.2973 (0.0860) 0.2123 (0.0261) 0.1971 (0.0241) 0.2306 (0.0266) 0.1398 (0.0296) 0.2311 (0.0270)
10 0.2935 (0.0619) 0.1698 (0.0258) 0.1616 (0.0256) 0.2358 (0.0292) 0.1292 (0.0264) 0.2349 (0.0291)
15 0.0885 (0.0191) 0.1462 (0.0263) 0.1432 (0.0262) 0.2362 (0.0287) 0.1191 (0.0241) 0.2362 (0.0287)
Table S4: Comparison of existing methods and proposed L2L_{2} penalized GAN methods (n=2000,ϵ=0.1n=2000,\epsilon=0.1). Estimation error of the variance matrix is reported in the Frobenius norm F\|\cdot\|_{\mathrm{F}}.
pp hinge GAN JS logit ff-GAN rKL logit ff-GAN GYZ JS-GAN Tyler_M Kendall_MAD Spearman_Qn
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
5 0.2257 (0.0550) 0.1345 (0.0372) 0.1296 (0.0322) 0.2774 (0.1523) 0.6149 (0.0447) 1.5525 (0.0462) 1.7124 (0.0383)
10 0.3222 (0.0547) 0.2898 (0.0565) 0.2723 (0.0477) 0.3642 (0.0885) 2.3363 (0.0557) 3.2143 (0.0489) 3.4669 (0.0398)
15 0.4578 (0.0691) 0.4582 (0.0754) 0.4328 (0.0460) 0.6387 (0.2095) 6.8919 (0.1709) 4.8584 (0.0640) 5.1907 (0.0477)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
5 0.2378 (0.0594) 0.1514 (0.0477) 0.1641 (0.0589) 2.4720 (7.0730) 1.5732 (0.1017) 1.5974 (0.0809) 2.1323 (0.0963)
10 0.3301 (0.0583) 0.3049 (0.0635) 0.2744 (0.0505) 0.3829 (0.0852) 3.2420 (0.2083) 3.3202 (0.1462) 4.1527 (0.1602)
15 0.4683 (0.0766) 0.4834 (0.0953) 0.4183 (0.0486) 0.5459 (0.0998) 4.9017 (0.2202) 4.9803 (0.1710) 6.0796 (0.1809)
pp γ\gamma-LASSO MVE MCD Sest Mest MMest
QN(2.25c,13Ip)Q\sim\mathrm{N}\left(2.25c,\frac{1}{3}I_{p}\right)
5 2.1000 (0.0427) 0.4682 (0.0742) 0.4494 (0.0608) 0.5283 (0.0666) 0.3014 (0.0649) 0.5079 (0.0635)
10 4.4364 (0.0688) 0.5338 (0.0846) 0.5175 (0.0752) 0.7732 (0.0839) 0.4078 (0.0684) 0.7707 (0.0831)
15 6.9967 (0.0857) 0.6057 (0.0894) 0.5921 (0.0777) 3.7412 (3.1416) 0.5143 (0.0672) 3.7413 (3.1417)
QN(51p,5Ip)Q\sim\mathrm{N}(51_{p},5I_{p})
5 5.3996 (0.3406) 0.4685 (0.0754) 0.4496 (0.0617) 0.5284 (0.0669) 0.3014 (0.0649) 0.5450 (0.0701)
10 2.1356 (0.4419) 0.5419 (0.0812) 0.5175 (0.0755) 0.7732 (0.0839) 0.4075 (0.0689) 0.7708 (0.0831)
15 0.4703 (0.1192) 0.6017 (0.0789) 0.5938 (0.0791) 0.9585 (0.1027) 0.5142 (0.0678) 0.9585 (0.1027)

I.4 Illustration with the second contamination

Refer to caption
Figure S2: The 95%95\% Gaussian ellipsoids estimated for the first two coordinates and observed marginal histograms, from contaminated data based on the second contamination in Section 6.2 with ϵ=5%\epsilon=5\% (top) or 10%10\% (bottom).

For completeness, Figure S2 shows the 95% Gaussian ellipsoids estimated for the first two coordinates, similarly as in Figure 1 but with two samples of size 20002000 from a 1010-dimensional Huber’s contaminated Gaussian distributions based on the second contamination QQ in Section 6.2. Comparison of the methods studied is qualitatively similar to that found in Figure 1.

II Main proofs of results

II.1 Proof of Theorem 2

We state and prove the following result which implies Theorem 2. For b>0b>0, define two factors R2,b=sup|u|bdduf(eu)R_{2,b}=\sup_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}f^{\prime}(\mathrm{e}^{u}) and R3,b=R31,b+R32,bR_{3,b}=R_{31,b}+R_{32,b} with R31,b=sup|u|bd2du2{f(eu)}R_{31,b}=\sup_{|u|\leq b}\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}\{-f^{\prime}(\mathrm{e}^{u})\} and R32,b=sup|u|bd2du2f#(eu)R_{32,b}=\sup_{|u|\leq b}\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\#}(\mathrm{e}^{u}). For δ(0,1)\delta\in(0,1), define

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n,\displaystyle\lambda_{11}=\sqrt{\frac{2\log({5p})+\log(\delta^{-1})}{n}}+\frac{2\log({5p})+\log(\delta^{-1})}{n},
λ12=Crad44log(2p(p+1))n+2log(δ1)n,\displaystyle\lambda_{12}=C_{\mathrm{rad4}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}},

where Crad4=Csg6Crad3C_{\mathrm{rad4}}=C_{\mathrm{sg6}}C_{\mathrm{rad3}}, depending on universal constants Csg6C_{\mathrm{sg6}} and Crad3C_{\mathrm{rad3}} in Lemmas S26 and Corollary S2 in the Supplement. Denote

Errf1(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{f1}(n,p,\delta,\epsilon) =(f′′(1))1{f(3/5)(ϵ+1/(nδ))\displaystyle=(f^{{\prime\prime}}(1))^{-1}\Big{\{}-f^{\prime}(3/5)(\sqrt{\epsilon}+\sqrt{1/(n\delta)})
f(eb1)ϵ+12R3,b1(ϵ+1/n)+R2,b1λ12+λ1},\displaystyle\quad-f^{\prime}(\mathrm{e}^{-b_{1}})\sqrt{\epsilon}+\frac{1}{2}R_{3,b_{1}}(\sqrt{\epsilon}+\sqrt{1/n})+R_{2,b_{1}}\lambda_{12}+\lambda_{1}\Big{\}},

where b1=ϵ+1/nb_{1}=\sqrt{\epsilon}+\sqrt{1/n}. Note that R2,bR_{2,b}, R3,bR_{3,b} are bounded provided that bb is bounded, because ff is three-times continuously differentiable as required in Assumption 2.

Proposition S1.

Assume that ΣmaxM1\|\Sigma^{*}\|_{\max}\leq M_{1}, and ff satisfies Assumptions 12. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (25) with λ1Csp13R1M11λ11\lambda_{1}\geq C_{\mathrm{sp13}}R_{1}M_{11}\lambda_{11}, where M11=M11/2(M11/2+22π)M_{11}=M_{1}^{1/2}(M_{1}^{1/2}+{2}\sqrt{2\pi}) and Csp13=(5/3)(Csp11Csp12)C_{\mathrm{sp13}}=(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}}), depending on universal constants Csp11C_{\mathrm{sp11}} and Csp12C_{\mathrm{sp12}} in Lemma S3 in the Supplement. If ϵ1/5\epsilon\leq 1/5, ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5, and Errf1(n,p,δ,ϵ)a\mathrm{Err}_{f1}(n,p,\delta,\epsilon)\leq a for a constant a(0,1/2)a\in(0,1/2), then we have that with probability at least 17δ1-7\delta,

μ^μ\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty} S4,aErrf1(n,p,δ,ϵ),\displaystyle\leq S_{4,a}\mathrm{Err}_{f1}(n,p,\delta,\epsilon),
Σ^Σmax\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\max} S8,aErrf1(n,p,δ,ϵ),\displaystyle\leq S_{8,a}\mathrm{Err}_{f1}(n,p,\delta,\epsilon),

where S4,a=(1+2M1log212a)/aS_{4,a}=(1+\sqrt{2M_{1}\log\frac{2}{1-2a}})/a and S8,a=2M11/2S6,a+S7(1+S4,a+S6,a)S_{8,a}=2M_{1}^{1/2}S_{6,a}+S_{7}(1+S_{4,a}+S_{6,a}) with S6,a=S5(1+S4,a/2)S_{6,a}=S_{5}(1+S_{4,a}/2), S5=22π(1e2/M1)1S_{5}=2\sqrt{2\pi}(1-\mathrm{e}^{-2/M_{1}})^{-1}, and S7=4{(12πM1e1/(8M1))(12e1/(8M1))}2S_{7}=4\{(\frac{1}{\sqrt{2\pi M_{1}}}\mathrm{e}^{-1/(8M_{1})})\vee(1-2\mathrm{e}^{-1/(8M_{1})})\}^{-2}.

Proof of Proposition S1.

The main strategy of our proof is to show that the following inequalities hold with high probabilities,

d(θ^,θ)Δ12maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}Δ11,\displaystyle d(\hat{\theta},\theta^{*})-\Delta_{12}\leq\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}\leq\Delta_{11}, (S1)

where Δ11\Delta_{11} and Δ12\Delta_{12} are error terms, and d(θ,θ^)d(\theta^{*},\hat{\theta}) is a moment matching term, which under certain conditions delivers upper bounds, up to scaling constants, on the estimation errors to be controlled, μ^μ\|\hat{\mu}-\mu^{*}\|_{\infty} and Σ^Σmax\|\hat{\Sigma}-\Sigma^{*}\|_{\max}.

(Step 1) For the upper bound in (S1), we show that with probability at least 15δ1-5\delta,

maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}
maxγΓ{Kf(Pn,Pθ;hγ,μ)λ1pen1(γ)}\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\} (S2)
maxγΓ{Δ11+pen1(γ)Δ~11λ1pen1(γ)}.\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{\Delta_{11}+\mathrm{pen}_{1}(\gamma)\tilde{\Delta}_{11}-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}. (S3)

Inequality (S2) follows from the definition of θ^\hat{\theta}. Inequality (S3) follows from Proposition S5: it holds with probability at least 15δ1-{5}\delta that for any γΓ\gamma\in\Gamma,

Kf(Pn,Pθ;hγ,μ)Δ11+pen1(γ)Δ~11,\displaystyle K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})\leq\Delta_{11}+\mathrm{pen}_{1}(\gamma)\tilde{\Delta}_{11},

where Δ11=f(3/5)(ϵ+ϵ/(nδ))\Delta_{11}=-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)}), Δ~11=Csp13R1M11λ11\tilde{\Delta}_{11}=C_{\mathrm{sp13}}R_{1}M_{11}\lambda_{11}, and

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n.\lambda_{11}=\sqrt{\frac{2\log({5p})+\log(\delta^{-1})}{n}}+\frac{2\log({5p})+\log(\delta^{-1})}{n}.

Note that λ11\lambda_{11} is the same as in the proof of Theorem 4, and the above Δ~11\tilde{\Delta}_{11} differs from Δ~11\tilde{\Delta}_{11} in the proof of Theorem 4 only in the factor R1R_{1}. From (S2)–(S3), the upper bound in (S1) holds with probability at least 15δ1-5\delta, provided that the tuning parameter λ1\lambda_{1} is chosen such that λ1Δ~11\lambda_{1}\geq\tilde{\Delta}_{11}.

(Step 2) For the lower bound in (S1), we show that with probability at least 12δ1-2\delta,

maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\}
maxγΓ0{Kf(Pn,Pθ^;hγ,μ^)λ1pen1(γ)}\displaystyle\geq\max_{\gamma\in\Gamma_{0}}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{1}\;\mathrm{pen}_{1}(\gamma)\right\} (S4)
maxγΓ0f′′(1){EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~12λ1b1.\displaystyle\geq\max_{\gamma\in\Gamma_{0}}\;f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{12}-\lambda_{1}b_{1}. (S5)

Inequality (S4) holds provided that Γ0\Gamma_{0} is a subset of Γ\Gamma. As a subset of the pairwise spline class sp\mathcal{H}_{\mathrm{sp}}, define a class of pairwise ramp functions, rp\mathcal{H}_{\mathrm{rp}}, such that each function in rp\mathcal{H}_{\mathrm{rp}} can be expressed as, for x=(x1,,xp)Tpx=(x_{1},\ldots,x_{p})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{p},

hrp,β,c(x)=β0+j=1pβ1jramp(xjcj)+1ijpβ2,ijramp(xi)ramp(xj),\displaystyle h_{\mathrm{rp},\beta,c}(x)=\beta_{0}+\sum_{j=1}^{p}\beta_{1j}\,\mathrm{ramp}(x_{j}-c_{j})+\sum_{1\leq i\not=j\leq p}\beta_{2,ij}\,\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}),

where ramp(t)=12(t+1)+12(t1)+\mathrm{ramp}(t)=\frac{1}{2}(t+1)_{+}-\frac{1}{2}(t-1)_{+} for tt\in\mathbb{R}, c=(c1,,cp)Tc=(c_{1},\ldots,c_{p})^{\mathrm{\scriptscriptstyle T}} with cj{0,1}c_{j}\in\{0,1\}, and β=(β0,β1T,β2T)T\beta=(\beta_{0},\beta_{1}^{\mathrm{\scriptscriptstyle T}},\beta_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} with β1=(β1j:j=1,,p)T\beta_{1}=(\beta_{1j}:j=1,\ldots,p)^{\mathrm{\scriptscriptstyle T}} and β2=(β2,ij:1ijp)T\beta_{2}=(\beta_{2,ij}:1\leq i\not=j\leq p)^{\mathrm{\scriptscriptstyle T}}. For symmetry as in γ2\gamma_{2}, assume that the coefficients in β2\beta_{2} are symmetric, β2,ij=β2,ji\beta_{2,ij}=\beta_{2,ji} for any iji\not=j. By the definition of ramp()\mathrm{ramp}(\cdot), each function hrp,β,c(x)h_{\mathrm{rp},\beta,c}(x) can be represented as hγ(x)h_{\gamma}(x) in the spline class sp\mathcal{H}_{\mathrm{sp}}, where β\beta and γ\gamma satisfy β0=γ0\beta_{0}=\gamma_{0}, β11=γ11\|\beta_{1}\|_{1}=\|\gamma_{1}\|_{1}, and β21=γ21\|\beta_{2}\|_{1}=\|\gamma_{2}\|_{1}. Incidentally, this relationship also holds when symmetry is not imposed in the coefficients in γ2\gamma_{2} or in β2\beta_{2}. Denote as Γrp\Gamma_{\mathrm{rp}} the subset of Γ\Gamma such that rp={hγ(x):γΓrp}\mathcal{H}_{\mathrm{rp}}=\{h_{\gamma}(x):\gamma\in\Gamma_{\mathrm{rp}}\}.

Take Γ0={γΓrp:γ0=0,pen1(γ)=b1}\Gamma_{0}=\{\gamma\in\Gamma_{\mathrm{rp}}:\gamma_{0}=0,\mathrm{pen}_{1}(\gamma)=b_{1}\} for some fixed b1>0b_{1}>0. Inequality (S5) follows from Proposition S6: it holds with probability at least 12δ1-2\delta that for any γΓ0\gamma\in\Gamma_{0},

Kf(Pn,Pθ^;hγ,μ^)f′′(1){EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~12,\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{12},

where Δ~12=f(eb1)ϵ+12b12R3,b1+b1R2,b1λ12\tilde{\Delta}_{12}=-f^{\prime}(\mathrm{e}^{-b_{1}})\epsilon+\frac{1}{2}b_{1}^{2}R_{3,b_{1}}+b_{1}R_{2,b_{1}}\lambda_{12}, and

λ12=Crad44log(2p(p+1))n+2log(δ1)n.\lambda_{12}=C_{\mathrm{rad4}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}}.

Note that λ12\lambda_{12} is the same as in the proof of Theorem 4. From (S4)–(S5), the lower bound in (S1) holds with probability at least 12δ1-2\delta, where Δ12=Δ~12+λ1b1\Delta_{12}=\tilde{\Delta}_{12}+\lambda_{1}b_{1} and d(θ^,θ)=f′′(1)maxγΓ0{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}d(\hat{\theta},\theta^{*})=f^{{\prime\prime}}(1)\max_{\gamma\in\Gamma_{0}}\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\}.

(Step 3) We complete the proof by choosing appropriate b1b_{1} and relating the moment matching term d(θ^,θ)d(\hat{\theta},\theta^{*}) to the estimation error between θ^\hat{\theta} and θ\theta^{*}. First, due to the linearity of hγ,μ^h_{\gamma,\hat{\mu}} in γ\gamma, combining the lower and upper bounds in (S1) shows that with probability at least 17δ1-{7}\delta,

f′′(1)b1maxγΓrp,pen1(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle\quad f^{{\prime\prime}}(1)b_{1}\,\max_{\gamma\in\Gamma_{\mathrm{rp}},\mathrm{pen}_{1}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
f(3/5)(ϵ+ϵ/(nδ))f(eb1)ϵ+12b12R3,b1+b1R2,b1λ12+λ1b1.\displaystyle\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})-f^{\prime}(\mathrm{e}^{-b_{1}})\epsilon+\frac{1}{2}b_{1}^{2}R_{3,b_{1}}+b_{1}R_{2,b_{1}}\lambda_{12}+\lambda_{1}b_{1}.

Taking b1=ϵ+1/nb_{1}=\sqrt{\epsilon}+1/\sqrt{n} in the preceding display and rearranging yields

maxγΓrp,pen1(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Errf1(n,p,δ,ϵ),\displaystyle\quad\max_{\gamma\in\Gamma_{\mathrm{rp}},\mathrm{pen}_{1}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq\mathrm{Err}_{f1}(n,p,\delta,\epsilon), (S6)

where

Errf1(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{f1}(n,p,\delta,\epsilon) =(f′′(1))1{f(3/5)(ϵ+1/(nδ))\displaystyle=(f^{{\prime\prime}}(1))^{-1}\Big{\{}-f^{\prime}(3/5)(\sqrt{\epsilon}+\sqrt{1/(n\delta)})
f(eb1)ϵ+12R3,b1(ϵ+1/n)+R2,b1λ12+λ1}.\displaystyle\quad-f^{\prime}(\mathrm{e}^{-b_{1}})\sqrt{\epsilon}+\frac{1}{2}R_{3,b_{1}}(\sqrt{\epsilon}+1/\sqrt{n})+R_{2,b_{1}}\lambda_{12}+\lambda_{1}\Big{\}}.

The desired result then follows from Proposition S7: provided Errf1(n,p,δ,ϵ)a\mathrm{Err}_{f1}(n,p,\delta,\epsilon)\leq a, inequality (S6) implies that

μ^μS4,aErrf1(n,p,δ,ϵ),Σ^ΣmaxS8,aErrf1(n,p,δ,ϵ).\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty}\leq S_{4,a}\mathrm{Err}_{f1}(n,p,\delta,\epsilon),\quad\|\hat{\Sigma}-\Sigma^{*}\|_{\max}\leq S_{8,a}\mathrm{Err}_{f1}(n,p,\delta,\epsilon).

II.2 Proof of Theorem 3

We state and prove the following result which implies Theorem 3. For b>0b>0, define R4,b=inf|u|bdduf#(eu)R_{4,b}=\inf_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}f^{\#}(\mathrm{e}^{u}), in addition to R2,bR_{2,b} and R3,bR_{3,b} as in Proposition S1. For δ(0,1)\delta\in(0,1), define

λ21=5p+log(δ1)n,λ22=Crad516pn+2plog(δ1)n,\displaystyle\lambda_{21}=\sqrt{\frac{{5p}+\log(\delta^{-1})}{n}},\quad\lambda_{22}=C_{\mathrm{rad5}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}},
λ31=λ21+5p+log(δ1)n,λ32=Crad56(p1)n+(p1)log(δ1)n.\displaystyle\lambda_{31}=\lambda_{21}+\frac{{5p}+\log(\delta^{-1})}{n},\lambda_{32}=C_{\mathrm{rad5}}\sqrt{\frac{6(p-1)}{n}}+\sqrt{\frac{(p-1)\log(\delta^{-1})}{n}}.

where Crad5=Csg,12Crad3C_{\mathrm{rad5}}=C_{\mathrm{sg,12}}C_{\mathrm{rad3}}, depending on universal constants Csg,12C_{\mathrm{sg,12}} and Crad3C_{\mathrm{rad3}} in Lemmas S23 and Corollary S2 in the Supplement. Denote

Errf2(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{f2}(n,p,\delta,\epsilon) =(2R4,b2)1{f(3/5)(ϵ+1/(nδ))f(eb2)ϵ\displaystyle=({\sqrt{2}}R_{4,b_{2}^{\dagger}})^{-1}\Big{\{}-f^{\prime}(3/5)(\sqrt{\epsilon}+\sqrt{1/(n\delta)})-f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\sqrt{\epsilon}
+4Csg,122M2R3,b2(ϵ+1/(np))+R2,b2λ22+λ2},\displaystyle\quad+4C_{\mathrm{sg,12}}^{2}M_{2}R_{3,b_{2}^{\dagger}}(\sqrt{\epsilon}+\sqrt{1/(np)})+R_{2,b_{2}^{\dagger}}\lambda_{22}+\lambda_{2}\Big{\}},
Errf3(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{f3}(n,p,\delta,\epsilon) =(2R4,2b3)1{f(3/5)(ϵ+1/(nδ))f(e2b3)ϵ\displaystyle=({2}R_{4,2b_{3}^{\dagger}})^{-1}\Big{\{}-f^{\prime}(3/5)(\sqrt{\epsilon}+\sqrt{1/(n\delta)})-f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\sqrt{\epsilon}
+(80Csg,122M2)R3,2b3(ϵ+1/(np))+R2,b3λ32+λ3/p},\displaystyle\quad+(80C_{\mathrm{sg,12}}^{2}M_{2})R_{3,2b_{3}^{\dagger}}(\sqrt{\epsilon}+\sqrt{1/(np)})+R_{2,b_{3}^{\dagger}}\lambda_{32}+\lambda_{3}/\sqrt{p}\Big{\}},

where b2=ϵ+1/(np)b_{2}=\sqrt{\epsilon}+\sqrt{1/(np)}, b2=b22pb_{2}^{\dagger}=b_{2}\sqrt{2p}, b3=ϵ/p+1/(np2)b_{3}=\sqrt{\epsilon/p}+\sqrt{1/(np^{2})}, and b3=b2p(p1)b_{3}^{\dagger}=b_{2}\sqrt{p(p-1)}. Note that by the strict convexity and monotonicity of ff as required in Assumption 12, we have that

R4,b=inf|u|bdduf#(eu)=inf|u|bddu{f(f(eu))}R_{4,b}=\inf_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}f^{\#}(\mathrm{e}^{u})=\inf_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}\left\{-f^{*}(f^{\prime}(\mathrm{e}^{u}))\right\}

is bounded away from zero provided that bb is bounded.

Proposition S2.

Assume that ΣopM2\|\Sigma^{*}\|_{\mathrm{op}}\leq M_{2}, and ff satisfies Assumptions 12. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (27) with

λ2(5/3)Csp21M21/2R1λ21,λ3/p(255/3)Csp22M21R1λ31,\lambda_{2}\geq(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}R_{1}\lambda_{21},\quad\lambda_{3}/\sqrt{p}\geq{(25\sqrt{5}/3)}C_{\mathrm{sp22}}M_{21}R_{1}\lambda_{31},

where M21=M21/2(M21/2+22π)M_{21}=M_{2}^{1/2}(M_{2}^{1/2}+2\sqrt{2\pi}), Csp21=2Csg7Csg5C_{\mathrm{sp21}}=\sqrt{2}C_{\mathrm{sg7}}C_{\mathrm{sg5}}, and Csp22=2/πCsp21+Csg8C_{\mathrm{sp22}}=\sqrt{2/\pi}C_{\mathrm{sp21}}+C_{\mathrm{sg8}}, depending on universal constants Csg5C_{\mathrm{sg5}}, Csg7C_{\mathrm{sg7}}, and Csg8C_{\mathrm{sg8}} in Lemmas S25, S27, and S28 in the Supplement. If ϵ1/5\epsilon\leq 1/5, ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5, and Errf2(n,p,δ,ϵ)a\mathrm{Err}_{f2}(n,p,\delta,\epsilon)\leq a for a constant a(0,1/2)a\in(0,1/2), then we have that with probability at least 18δ1-{8}\delta,

μ^μ2\displaystyle\|\hat{\mu}-\mu^{*}\|_{2} S4,aErrf2(n,p,δ,ϵ),\displaystyle\leq S_{4,a}\mathrm{Err}_{f2}(n,p,\delta,\epsilon),
p1/2Σ^ΣF\displaystyle p^{-1/2}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}} S9,aErrf2(n,p,δ,ϵ)+S7Errf3(n,p,δ,ϵ),\displaystyle\leq S_{9,a}\mathrm{Err}_{f2}(n,p,\delta,\epsilon)+S_{7}\mathrm{Err}_{f3}(n,p,\delta,\epsilon),

where S9,a=2M21/2S6,a+2S7(S4,a+S6,a)S_{9,a}=2M_{2}^{1/2}S_{6,a}+\sqrt{2}S_{7}(S_{4,a}+S_{6,a}) and (S4,a,S6,a,S7)(S_{4,a},S_{6,a},S_{7}) are defined as in Proposition S1 except with M1M_{1} replaced by M2M_{2} throughout.

Proof of Proposition S2.

The main strategy of our proof is to show that the following inequalities hold with high probabilities,

d(θ^,θ)Δ22maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}Δ21,\displaystyle d(\hat{\theta},\theta^{*})-\Delta_{22}\leq\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}\leq\Delta_{21}, (S7)

where Δ21\Delta_{21} and Δ22\Delta_{22} are error terms, and d(θ^,θ)d(\hat{\theta},\theta^{*}) is a moment matching term, similarly as in the proof of Theorem 2. However, additional considerations are involved.

We split the proof into several steps. In Step 1, we derive the upper bound in (S7) by exploiting two tuning parameters λ2\lambda_{2} and λ3\lambda_{3} associated with γ1\gamma_{1} and γ2\gamma_{2} respectively. In Steps 2 and 3, we derive the first version of the lower bound in (S7) and then deduce upper bounds on μ^μ2\|\hat{\mu}-\mu^{*}\|_{2} and σ^σ2\|\hat{\sigma}-\sigma^{*}\|_{2}, where σ^\hat{\sigma} or σ\sigma^{*} is the vector of standard deviations from Σ^\hat{\Sigma} or Σ\Sigma^{*} respectively. In Steps 4 and 5, we derive the second version of the lower bound in (S7) and then deduce an upper bound on Σ^ΣF\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}.

(Step 1) For the upper bound in (S7), we show that with probability at least 14δ1-4\delta,

maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}
maxγΓ{Kf(Pn,Pθ;hγ,μ)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\} (S8)
maxγΓ{Δ21+pen2(γ1)Δ~21+pen2(γ2)Δ~31λ2pen2(γ1)λ3pen2(γ2)}.\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{\Delta_{21}+\mathrm{pen}_{2}(\gamma_{1})\tilde{\Delta}_{21}+\mathrm{pen}_{2}(\gamma_{2})\tilde{\Delta}_{31}-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}. (S9)

Inequality (S8) follows from the definition of θ^\hat{\theta}. Inequality (S9) follows from Proposition S8: it holds with probability at least 14δ1-4\delta that for any γΓ1\gamma\in\Gamma_{1},

Kf(Pn,Pθ;hγ,μ)Δ21+pen2(γ1)Δ~21+pen2(γ2)pΔ~31,\displaystyle K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})\leq\Delta_{21}+\mathrm{pen}_{2}(\gamma_{1})\tilde{\Delta}_{21}+\mathrm{pen}_{2}(\gamma_{2})\sqrt{p}\tilde{\Delta}_{31},

where

Δ21\displaystyle\Delta_{21} =f(3/5)(ϵ+ϵ/(nδ)),Δ~21=(5/3)Csp21M21/2R1λ21,\displaystyle=-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)}),\quad\tilde{\Delta}_{21}=(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}R_{1}\lambda_{21},
Δ~31\displaystyle\tilde{\Delta}_{31} =(255/3)Csp22M21R1λ31,\displaystyle={(25\sqrt{5}/3)}C_{\mathrm{sp22}}M_{21}R_{1}\lambda_{31},

and

λ21=5p+log(δ1)n,λ31=λ21+5p+log(δ1)n.\lambda_{21}=\sqrt{\frac{{5p}+\log(\delta^{-1})}{n}},\quad\lambda_{31}=\lambda_{21}+\frac{{5p}+\log(\delta^{-1})}{n}.

From (S8)–(S9), the upper bound in (S7) holds with probability at least 14δ1-4\delta, provided that the tuning parameters λ2\lambda_{2} and λ3\lambda_{3} are chosen such that λ2Δ~21\lambda_{2}\geq\tilde{\Delta}_{21} and λ3pΔ~31\lambda_{3}\geq\sqrt{p}\tilde{\Delta}_{31}.

(Step 2) For the first version of the lower bound in (S7), we show that with probability at least 12δ1-2\delta,

maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}
maxγΓ1{Kf(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)}\displaystyle\geq\max_{\gamma\in\Gamma_{1}}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})\right\} (S10)
maxγΓ10{Kf(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)}\displaystyle\geq\max_{\gamma\in\Gamma_{10}}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})\right\} (S11)
maxγΓ10R4,b2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~22λ2b2,\displaystyle\geq\max_{\gamma\in\Gamma_{10}}\;R_{4,b_{2}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{22}-\lambda_{2}b_{2}, (S12)

where Γ1={(γ0,γ1T,γ2T)T:γ2=0}\Gamma_{1}=\{(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}:\gamma_{2}=0\}. Inequality (S10) follows because Γ1\Gamma_{1} is a subset of Γ\Gamma such that γ2=0\gamma_{2}=0 and hence pen2(γ2)=0\mathrm{pen}_{2}(\gamma_{2})=0 for γΓ1\gamma\in\Gamma_{1}. Inequality (S11) holds provided that Γ10\Gamma_{10} is a subset of Γ1\Gamma_{1}. As a subset of the main-effect spline class sp1\mathcal{H}_{\mathrm{sp1}}, define a main-effect ramp class, rp1\mathcal{H}_{\mathrm{rp1}}, such that each function in rp1\mathcal{H}_{\mathrm{rp1}} can be expressed as, for x=(x1,,xp)Tpx=(x_{1},\ldots,x_{p})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{p},

hrp1,β,c(x)\displaystyle h_{\mathrm{rp1},\beta,c}(x) =β0+j=1pβ1jramp(xjcj),\displaystyle=\beta_{0}+\sum_{j=1}^{p}\beta_{1j}\mathrm{ramp}(x_{j}-c_{j}),

where ramp(t)=12(t+1)+12(t1)+\mathrm{ramp}(t)=\frac{1}{2}(t+1)_{+}-\frac{1}{2}(t-1)_{+} for tt\in\mathbb{R}, c=(c1,,cp)Tc=(c_{1},\ldots,c_{p})^{\mathrm{\scriptscriptstyle T}} with cj{0,1}c_{j}\in\{0,1\}, and β=(β0,β1T)T\beta=(\beta_{0},\beta_{1}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} with β1=(β11,,β1p)T\beta_{1}=(\beta_{11},\ldots,\beta_{1p})^{\mathrm{\scriptscriptstyle T}}. Only the main-effect ramp functions are included, while the interaction ramp functions are excluded, in hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x). By the definition of ramp()\mathrm{ramp}(\cdot), each function hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) can be represented as hγ(x)sp1h_{\gamma}(x)\in\mathcal{H}_{\mathrm{sp1}} with γ=(γ0,γ1T)TΓrp1\gamma=(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma_{\mathrm{rp1}}, such that β\beta and γ\gamma satisfy β0=γ0\beta_{0}=\gamma_{0} and β12=2γ12\|\beta_{1}\|_{2}=\sqrt{2}\|\gamma_{1}\|_{2}. For example, for ramp(x1)\mathrm{ramp}(x_{1}), the associated norms are β12=1\|\beta_{1}\|_{2}=1 and γ12=1/2\|\gamma_{1}\|_{2}=\sqrt{1/2}. Denote as Γrp1\Gamma_{\mathrm{rp1}} the subset of Γ1\Gamma_{1} such that rp1={hγ(x):γΓrp1}\mathcal{H}_{\mathrm{rp1}}=\{h_{\gamma}(x):\gamma\in\Gamma_{\mathrm{rp1}}\}.

Take Γ10={γΓrp1:pen2(γ)=b2,EPθhγ,μ^(x)=0,EPθ^hγ,μ^(x)0}\Gamma_{10}=\{\gamma\in\Gamma_{\mathrm{rp1}}:\mathrm{pen}_{2}(\gamma)=b_{2},\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0,\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\leq 0\} for some fixed b2>0b_{2}>0. Inequality (S12) follows from Proposition S9: it holds with probability at least 12δ1-2\delta that for any γΓ10\gamma\in\Gamma_{10},

Kf(Pn,Pθ^;hγ,μ^)R4,b2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~22,\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq R_{4,b_{2}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{22},

where b2=b22pb_{2}^{\dagger}=b_{2}\sqrt{2p}, Δ~22=f(eb2)ϵ+4Csg,122M2b22R3,b2+b2R2,b2λ22\tilde{\Delta}_{22}=-f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\epsilon+4C_{\mathrm{sg,12}}^{2}M_{2}b_{2}^{2}R_{3,b_{2}^{\dagger}}+b_{2}R_{2,b_{2}^{\dagger}}\lambda_{22}, and

λ22=Crad516pn+2plog(δ1)n.\lambda_{22}=C_{\mathrm{rad5}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}}.

From (S11)–(S12), the lower bound in (S7) holds with probability at least 12δ1-2\delta, where Δ22=Δ~22+λ2b2\Delta_{22}=\tilde{\Delta}_{22}+\lambda_{2}b_{2} and d(θ^,θ)=R4,b2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}d(\hat{\theta},\theta^{*})=R_{4,b_{2}^{\dagger}}\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\}.

(Step 3) We deduce upper bounds on μ^μ2\|\hat{\mu}-\mu^{*}\|_{2} and σ^σ2\|\hat{\sigma}-\sigma^{*}\|_{2}, by choosing appropriate b2b_{2} and relating the moment matching term d(θ^,θ)d(\hat{\theta},\theta^{*}) to the estimation errors. First, combining the upper bound in (S7) from Step 1 and the lower bound from Step 2 shows that with probability at least 16δ1-6\delta,

R4,b2b2maxγΓrp1,pen2(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle\quad R_{4,b_{2}^{\dagger}}b_{2}\,\max_{\gamma\in\Gamma_{\mathrm{rp1}},\mathrm{pen}_{2}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
f(3/5)(ϵ+ϵ/(nδ))f(eb2)ϵ+4Csg,122M2b22R3,b2+b2R2,b2λ22+λ2b2.\displaystyle\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})-f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\epsilon+4C_{\mathrm{sg,12}}^{2}M_{2}b_{2}^{2}R_{3,b_{2}^{\dagger}}+b_{2}R_{2,b_{2}^{\dagger}}\lambda_{22}+\lambda_{2}b_{2}.

Taking b2=ϵ+1/(np)b_{2}=\sqrt{\epsilon}+\sqrt{1/(np)} in the preceding display and rearranging yields

maxγΓrp1,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Errf2(n,p,δ,ϵ),\displaystyle\quad\max_{\gamma\in\Gamma_{\mathrm{rp1}},\mathrm{pen}_{2}(\gamma)={\sqrt{1/2}}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq\mathrm{Err}_{f2}(n,p,\delta,\epsilon), (S13)

where

Errf2(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{f2}(n,p,\delta,\epsilon) =(2R4,b2)1{f(3/5)(ϵ+1/(nδ))\displaystyle=({\sqrt{2}}R_{4,b_{2}^{\dagger}})^{-1}\Big{\{}-f^{\prime}(3/5)(\sqrt{\epsilon}+\sqrt{1/(n\delta)})
f(eb2)ϵ+4Csg,122M2R3,b2(ϵ+1/(np))+R2,b2λ22+λ2}.\displaystyle\quad-f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\sqrt{\epsilon}+4C_{\mathrm{sg,12}}^{2}M_{2}R_{3,b_{2}^{\dagger}}(\sqrt{\epsilon}+\sqrt{1/(np)})+R_{2,b_{2}^{\dagger}}\lambda_{22}+\lambda_{2}\Big{\}}.

The error bounds for (μ^,σ^)(\hat{\mu},\hat{\sigma}) then follows from Proposition S10: provided Errf2(n,p,δ,ϵ)a\mathrm{Err}_{f2}(n,p,\delta,\epsilon)\leq a, inequality (S13) implies that

μ^μ2S4,aErrf2(n,p,δ,ϵ),\displaystyle\|\hat{\mu}-\mu^{*}\|_{2}\leq S_{4,a}\mathrm{Err}_{f2}(n,p,\delta,\epsilon), (S14)
σ^σ2S6,aErrf2(n,p,δ,ϵ).\displaystyle\|\hat{\sigma}-\sigma^{*}\|_{2}\leq S_{6,a}\mathrm{Err}_{f2}(n,p,\delta,\epsilon). (S15)

(Step 4) For the second version of the lower bound in (S7), we show that with probability at least 12δ1-2\delta,

maxγΓ{Kf(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}
maxγΓ2{Kf(Pn,Pθ^;hγ,μ^)λ3pen2(γ2)}\displaystyle\geq\max_{\gamma\in\Gamma_{2}}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\} (S16)
maxγΓ20{Kf(Pn,Pθ^;hγ,μ^)λ3pen2(γ2)}\displaystyle\geq\max_{\gamma\in\Gamma_{20}}\;\left\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\} (S17)
maxγΓ20f′′(1){EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~32λ3b3,\displaystyle\geq\max_{\gamma\in\Gamma_{20}}\;f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{32}-\lambda_{3}b_{3}, (S18)

where Γ2={(γ0,γ1T,γ2T)T:γ1=0}\Gamma_{2}=\{(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}:\gamma_{1}=0\}. Inequality (S16) follows because Γ2\Gamma_{2} is a subset of Γ\Gamma such that γ1=0\gamma_{1}=0 and hence pen2(γ1)=0\mathrm{pen}_{2}(\gamma_{1})=0 for γΓ2\gamma\in\Gamma_{2}. Inequality (S17) holds provided that Γ20\Gamma_{20} is a subset of Γ2\Gamma_{2}. As a subset of the interaction spline class sp2\mathcal{H}_{\mathrm{sp2}}, define an interaction ramp class, rp2\mathcal{H}_{\mathrm{rp2}}, such that each function in rp2\mathcal{H}_{\mathrm{rp2}} can be expressed as, for x=(x1,,xp)Tpx=(x_{1},\ldots,x_{p})^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{p},

hrp2,β(x)\displaystyle h_{\mathrm{rp2},\beta}(x) =β0+1ijpβ2,ijramp(xi)ramp(xj),\displaystyle=\beta_{0}+\sum_{1\leq i\not=j\leq p}\beta_{2,ij}\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}),

where ramp(t)=12(t+1)+12(t1)+\mathrm{ramp}(t)=\frac{1}{2}(t+1)_{+}-\frac{1}{2}(t-1)_{+} for tt\in\mathbb{R}, and β=(β0,β2T)T\beta=(\beta_{0},\beta_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} with β2=(β2,ij:1ijp)T\beta_{2}=(\beta_{2,ij}:1\leq i\not=j\leq p)^{\mathrm{\scriptscriptstyle T}}. In contrast with the function hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) in sp1\mathcal{H}_{\mathrm{sp1}}, only the interaction ramp functions are included, while the main-effect ramp functions are excluded, in hrp2,β(x)h_{\mathrm{rp2},\beta}(x). For symmetry as in γ2\gamma_{2}, assume that the coefficients in β2\beta_{2} are symmetric, β2,ij=β2,ji\beta_{2,ij}=\beta_{2,ji} for any iji\not=j. By the definition of ramp()\mathrm{ramp}(\cdot), each function hrp2,β(x)h_{\mathrm{rp2},\beta}(x) can be represented as hγ(x)sp2h_{\gamma}(x)\in\mathcal{H}_{\mathrm{sp2}} with γ=(γ0,γ2T)TΓrp2\gamma=(\gamma_{0},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma_{\mathrm{rp2}}, such that β\beta and γ\gamma satisfy β0=γ0\beta_{0}=\gamma_{0} and β22=2γ22\|\beta_{2}\|_{2}=2\|\gamma_{2}\|_{2}. For example, for ramp(x1)ramp(x2)\mathrm{ramp}(x_{1})\mathrm{ramp}(x_{2}), the associated norms are β22=1\|\beta_{2}\|_{2}=1 and γ22=1/2\|\gamma_{2}\|_{2}=1/2. Denote as Γrp2\Gamma_{\mathrm{rp2}} the subset of Γ2\Gamma_{2} such that rp2={hγ(x):γΓrp2}\mathcal{H}_{\mathrm{rp2}}=\{h_{\gamma}(x):\gamma\in\Gamma_{\mathrm{rp2}}\}.

Take Γ20={γΓrp2:pen2(γ)=b3,EPθhγ,μ^(x)=0,EPθ^hγ,μ^(x)0}\Gamma_{20}=\{\gamma\in\Gamma_{\mathrm{rp2}}:\mathrm{pen}_{2}(\gamma)=b_{3},\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0,\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\leq 0\} for some fixed b3>0b_{3}>0. Inequality (S18) follows from Proposition S11: it holds with probability at least 12δ1-2\delta that for any γΓ20\gamma\in\Gamma_{20},

Kf(Pn,Pθ^;hγ,μ^)R4,2b3{EPθh(x)EPθ^h(x)}Δ~32,\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq R_{4,2b_{3}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-\tilde{\Delta}_{32},

where b3=b3p(p1)b_{3}^{\dagger}=b_{3}\sqrt{p(p-1)}, Δ~32=f(e2b3)ϵ+(80Csg,122M2)pb32R3,2b3+pb3R2,b3λ32\tilde{\Delta}_{32}=-f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\epsilon+(80C_{\mathrm{sg,12}}^{2}M_{2})pb_{3}^{2}R_{3,2b_{3}^{\dagger}}+\sqrt{p}b_{3}R_{2,b_{3}^{\dagger}}\lambda_{32}, and

λ32=Crad46(p1)n+(p1)log(δ1)n.\lambda_{32}=C_{\mathrm{rad4}}\sqrt{\frac{6(p-1)}{n}}+\sqrt{\frac{(p-1)\log(\delta^{-1})}{n}}.

From (S17)–(S18), the lower bound in (S7) holds with probability at least 12δ1-2\delta, where Δ22=Δ~32+λ3b3\Delta_{22}=\tilde{\Delta}_{32}+\lambda_{3}b_{3} and d(θ^,θ)=R4,2b3maxγΓ20{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}d(\hat{\theta},\theta^{*})=R_{4,2b_{3}^{\dagger}}\max_{\gamma\in\Gamma_{20}}\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\}.

(Step 5) We deduce an upper bound on Σ^ΣF\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}, by choosing appropriate b3b_{3} and relating the moment matching term d(θ^,θ)d(\hat{\theta},\theta^{*}) to the estimation error. First, combining the upper bound in (S7) from Step 1 and the lower bound from Step 4 shows that with probability 16δ1-6\delta,

R4,2b3b3maxγΓrp2,pen2(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle\quad R_{4,2b_{3}^{\dagger}}b_{3}\,\max_{\gamma\in\Gamma_{\mathrm{rp2}},\mathrm{pen}_{2}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
f(3/5)(ϵ+ϵ/(nδ))f(e2b3)ϵ+(80Csg,122M2)pb32R3,2b3+pb3R2,b3λ32+λ3b3.\displaystyle\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})-f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\epsilon+(80C_{\mathrm{sg,12}}^{2}M_{2})pb_{3}^{2}R_{3,2b_{3}^{\dagger}}+\sqrt{p}b_{3}R_{2,b_{3}^{\dagger}}\lambda_{32}+\lambda_{3}b_{3}.

Taking b3=ϵ/p+1/(np2)b_{3}=\sqrt{\epsilon/p}+\sqrt{1/(np^{2})} in the preceding display and rearranging yields

maxγΓrp2,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}pErrf3(n,p,δ,ϵ),\displaystyle\quad\max_{\gamma\in\Gamma_{\mathrm{rp2}},\mathrm{pen}_{2}(\gamma)={1/2}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq\sqrt{p}\;\mathrm{Err}_{f3}(n,p,\delta,\epsilon), (S19)

where

Errf3(n,p,δ,ϵ)=(2R4,2b3)1{f(3/5)(ϵ+1/(nδ))\displaystyle\mathrm{Err}_{f3}(n,p,\delta,\epsilon)=({2}R_{4,2b_{3}^{\dagger}})^{-1}\Big{\{}-f^{\prime}(3/5)(\sqrt{\epsilon}+\sqrt{1/(n\delta)})
f(e2b3)ϵ+(80Csg,122M2)R3,2b3(ϵ+1/(np))+R2,b3λ32+λ3/p}.\displaystyle\quad-f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\sqrt{\epsilon}+(80C_{\mathrm{sg,12}}^{2}M_{2})R_{3,2b_{3}^{\dagger}}(\sqrt{\epsilon}+\sqrt{1/(np)})+R_{2,b_{3}^{\dagger}}\lambda_{32}+\lambda_{3}/\sqrt{p}\Big{\}}.

The error bound for Σ^\hat{\Sigma} then follows from Proposition S12: inequality (S19) together with the error bounds (S14)–(S15) implies that

1pΣ^ΣF\displaystyle\frac{1}{\sqrt{p}}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}} 2M21/2σ^σ2+S7{2Δμ^,σ^+Errf3(n,p,δ,ϵ)}\displaystyle\leq 2M_{2}^{1/2}\|\hat{\sigma}-\sigma^{*}\|_{2}+S_{7}\left\{\sqrt{2}\Delta_{\hat{\mu},\hat{\sigma}}+\mathrm{Err}_{f3}(n,p,\delta,\epsilon)\right\}
S9,aErrf2(n,p,δ,ϵ)+S7Errf3(n,p,δ,ϵ),\displaystyle\leq S_{9,a}\mathrm{Err}_{f2}(n,p,\delta,\epsilon)+S_{7}\mathrm{Err}_{f3}(n,p,\delta,\epsilon),

where Δμ^,σ^=(μ^μ22+σ^σ22)1/2\Delta_{\hat{\mu},\hat{\sigma}}=(\|\hat{\mu}-\mu^{*}\|_{2}^{2}+\|\hat{\sigma}-\sigma^{*}\|_{2}^{2})^{1/2} and S9,a=2M21/2S6,a+2S7(S4,a+S6,a)S_{9,a}=2M_{2}^{1/2}S_{6,a}+\sqrt{2}S_{7}(S_{4,a}+S_{6,a}). ∎

II.3 Proof of Theorem 5

We state and prove the following result which implies Theorem 5. For δ(0,1)\delta\in(0,1), define (λ21,λ31,λ22,λ32)(\lambda_{21},\lambda_{31},\lambda_{22},\lambda_{32}) the same as in Sections II.1 and II.2. Denote

Errh2(n,p,δ,ϵ)=3ϵ(2p)1/2+22pϵ/(nδ)+λ2+λ22,\displaystyle\mathrm{Err}_{h2}(n,p,\delta,\epsilon)=3\epsilon(2p)^{1/2}+2\sqrt{2p\epsilon/(n\delta)}+\lambda_{2}+\lambda_{22},
Errh3(n,p,δ,ϵ)=3ϵp1+2ϵ(p1)/(nδ)+λ32/2+(255/6)Csp22M21λ31.\displaystyle\mathrm{Err}_{{h3}}(n,p,\delta,\epsilon)=3\epsilon\sqrt{p-1}+2\sqrt{\epsilon(p-1)/(n\delta)}+\lambda_{32}/2+(25\sqrt{5}/6)C_{\mathrm{sp22}}M_{21}\lambda_{31}.
Proposition S3.

Assume that ΣopM2\|\Sigma^{*}\|_{\mathrm{op}}\leq M_{2}. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (29) with λ3/p(255/3)Csp22M21λ31\lambda_{3}/\sqrt{p}\geq(25\sqrt{5}/3)C_{\mathrm{sp22}}M_{21}\lambda_{31} and λ2(5/3)Csp21M21/2λ21\lambda_{2}\geq(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}\lambda_{21}, where M21M_{21}, Csp21C_{\mathrm{sp21}}, and Csp22C_{\mathrm{sp22}} are defined as in Proposition S2. If ϵ1/5\epsilon\leq 1/5, ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and Errh2(n,p,δ,ϵ)a\mathrm{Err}_{h2}(n,p,\delta,\epsilon)\leq a for a constant a(0,1/2)a\in(0,1/2), then we have that with probability at least 18δ1-8\delta,

μ^μ2\displaystyle\|\hat{\mu}-\mu^{*}\|_{2} S4,aErrh2(n,p,δ,ϵ),\displaystyle\leq S_{4,a}\mathrm{Err}_{h2}(n,p,\delta,\epsilon),
p1/2Σ^ΣF\displaystyle p^{-1/2}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}} S9,aErrh2(n,p,δ,ϵ)+S7Errh3(n,p,δ,ϵ),\displaystyle\leq S_{9,a}\mathrm{Err}_{h2}(n,p,\delta,\epsilon)+S_{7}\mathrm{Err}_{h3}(n,p,\delta,\epsilon),

where (S4,a,S6,a,S7,S9,a)(S_{4,a},S_{6,a},S_{7},S_{9,a}) are defined as in Proposition S2.

Proof of Proposition S3.

The main strategy of our proof is to show that the following inequalities hold with high probabilities,

d(θ^,θ)Δ22maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}Δ21,\displaystyle d(\hat{\theta},\theta^{*})-\Delta_{22}\leq\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}\leq\Delta_{21}, (S20)

where Δ21\Delta_{21} and Δ22\Delta_{22} are error terms, and d(θ^,θ)d(\hat{\theta},\theta^{*}) is a moment matching term, similarly as in the proof of Theorem 4. However, additional considerations are involved.

(Step 1) For the upper bound in (S20), we show that with probability at least 14δ1-4\delta,

maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}
maxγΓ{KHG(Pn,Pθ;hγ,μ)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\} (S21)
maxγΓ{Δ21+pen2(γ1)Δ~21+pen2(γ2)Δ~31λ2pen2(γ1)λ3pen2(γ2)}.\displaystyle\leq\max_{\gamma\in\Gamma}\;\left\{\Delta_{21}+\mathrm{pen}_{2}(\gamma_{1})\tilde{\Delta}_{21}+\mathrm{pen}_{2}(\gamma_{2})\tilde{\Delta}_{31}-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}. (S22)

Inequality (S21) follows from the definition of θ^\hat{\theta}. Inequality (S22) follows from Proposition S15: it holds with probability at least 14δ1-4\delta that for any γΓ1\gamma\in\Gamma_{1},

KHG(Pn,Pθ;hγ,μ)Δ21+pen2(γ1)Δ~21+pen2(γ2)pΔ~31,\displaystyle K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})\leq\Delta_{21}+\mathrm{pen}_{2}(\gamma_{1})\tilde{\Delta}_{21}+\mathrm{pen}_{2}(\gamma_{2})\sqrt{p}\tilde{\Delta}_{31},

where

Δ21=2(ϵ+ϵ/(nδ)),Δ~21=(5/3)Csp21M21/2λ21,Δ~31=(255/3)Csp22M21λ31,\displaystyle\Delta_{21}=2(\epsilon+\sqrt{\epsilon/(n\delta)}),\quad\tilde{\Delta}_{21}=(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}\lambda_{21},\quad\tilde{\Delta}_{31}=(25\sqrt{5}/3)C_{\mathrm{sp22}}M_{21}\lambda_{31},

and λ21\lambda_{21} and λ31\lambda_{31} are the same as in the proof of Theorem 3. Note that Δ~21\tilde{\Delta}_{21} and Δ~31\tilde{\Delta}_{31} differ from those in the proof of Theorem 3 only in that R1R_{1} is removed. From (S21)–(S22), the upper bound in (S20) holds with probability at least 14δ1-4\delta, provided that the tuning parameters λ2\lambda_{2} and λ3\lambda_{3} are chosen such that λ2Δ~21\lambda_{2}\geq\tilde{\Delta}_{21} and λ3pΔ~31\lambda_{3}\geq\sqrt{p}\tilde{\Delta}_{31}.

(Step 2) For the first version of the lower bound in (S20), we show that with probability at least 12δ1-2\delta,

maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}
maxγΓ1{KHG(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)}\displaystyle\geq\max_{\gamma\in\Gamma_{1}}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})\right\} (S23)
maxγΓ10{KHG(Pn,Pθ^;hγ,μ^)}λ2(2p)1/2\displaystyle\geq\max_{\gamma\in\Gamma_{10}}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\right\}-\lambda_{2}\;(2p)^{-1/2} (S24)
maxγΓ10{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~22λ2(2p)1/2,\displaystyle\geq\max_{\gamma\in\Gamma_{10}}\;\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{22}-\lambda_{2}(2p)^{-1/2}, (S25)

where Γ1={(γ0,γ1T,γ2T)T:γ2=0}.\Gamma_{1}=\{(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}:\gamma_{2}=0\}. Inequality (S23) follows because Γ1\Gamma_{1} is defined as a subset of Γ\Gamma such that γ2=0\gamma_{2}=0 and hence pen2(γ2)=0\mathrm{pen}_{2}(\gamma_{2})=0 for γΓ1\gamma\in\Gamma_{1}.

Take Γ10={γΓrp1:γ0=0,pen2(γ)=(2p)1/2}\Gamma_{10}=\{\gamma\in\Gamma_{\mathrm{rp1}}:\gamma_{0}=0,\mathrm{pen}_{2}(\gamma)=(2p)^{-1/2}\}, where Γrp1\Gamma_{\mathrm{rp1}} is the subset of Γ1\Gamma_{1} associated with main-effect ramp functions as in the proof of Theorem 3. Inequality (S24) holds because Γ10Γ1\Gamma_{10}\subset\Gamma_{1} by definition.  Inequality (S25) follows from Proposition S16: it holds with probability at least 12δ1-2\delta that for any γΓ10\gamma\in\Gamma_{10},

KHG(Pn,Pθ^;hγ,μ^)EPθhγ,μ^(x)EPθ^hγ,μ^(x)Δ~22,\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)-\tilde{\Delta}_{22},

where Δ~22=ϵ+λ22(2p)1/2\tilde{\Delta}_{22}=\epsilon+\lambda_{22}(2p)^{-1/2}, and

λ22=Crad516pn+2plog(δ1)n.\lambda_{22}=C_{\mathrm{rad5}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}}.

Note that λ22\lambda_{22} is the same as in the proof of Theorem 3. From (S23)–(S25), the lower bound in (S20) holds with probability at least 12δ1-2\delta, where Δ22=Δ~22+λ2(2p)1/2\Delta_{22}=\tilde{\Delta}_{22}+\lambda_{2}(2p)^{-1/2} and d(θ^,θ)=EPθhγ,μ^(x)EPθ^hγ,μ^(x)d(\hat{\theta},\theta^{*})=\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x).

(Step 3) We deduce upper bounds on μ^μ2\|\hat{\mu}-\mu^{*}\|_{2} and σ^σ2\|\hat{\sigma}-\sigma^{*}\|_{2}, by choosing appropriate b2b_{2} and relating the moment matching term d(θ^,θ)d(\hat{\theta},\theta^{*}) to the estimation errors. First, combining the upper bound in (S20) from Step 1 and the lower bound from Step 2 shows that with probability at least 16δ1-{6}\delta,

(2p)1/2maxγΓrp1,pen2(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle\quad(2p)^{-1/2}\,\max_{\gamma\in\Gamma_{\mathrm{rp1}},\mathrm{pen}_{2}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
3ϵ+2ϵ/(nδ)+(λ2+λ22)(2p)1/2,\displaystyle\leq 3\epsilon+2\sqrt{\epsilon/(n\delta)}+(\lambda_{2}+\lambda_{22})(2p)^{-1/2},

which gives

maxγΓrp1,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Errh2(n,p,δ,ϵ),\displaystyle\quad\max_{\gamma\in\Gamma_{\mathrm{rp1}},\mathrm{pen}_{2}(\gamma)=\sqrt{1/2}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq\mathrm{Err}_{h2}(n,p,\delta,\epsilon), (S26)

where

Errh2(n,p,δ,ϵ)=3ϵp+2pϵ/(nδ)+(λ2+λ22)/2.\displaystyle\mathrm{Err}_{h2}(n,p,\delta,\epsilon)=3\epsilon\sqrt{p}+2\sqrt{p\epsilon/(n\delta)}+(\lambda_{2}+\lambda_{22})/\sqrt{2}.

The desired result then follows from Proposition S10: provided Errh2(n,p,δ,ϵ)a\mathrm{Err}_{h2}(n,p,\delta,\epsilon)\leq a, inequality (S26) implies that

μ^μ2S4,aErrh2(n,p,δ,ϵ),\displaystyle\|\hat{\mu}-\mu^{*}\|_{2}\leq S_{4,a}\mathrm{Err}_{h2}(n,p,\delta,\epsilon),
σ^σ2S6,aErrh2(n,p,δ,ϵ).\displaystyle\|\hat{\sigma}-\sigma^{*}\|_{2}\leq S_{6,a}\mathrm{Err}_{h2}(n,p,\delta,\epsilon).

(Step 4) For the second version of the lower bound in (S7), we show that with probability at least 12δ1-2\delta,

maxγΓ{KHG(Pn,Pθ^;hγ,μ^)λ2pen2(γ1)λ3pen2(γ2)}\displaystyle\quad\max_{\gamma\in\Gamma}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{2}\;\mathrm{pen}_{2}(\gamma_{1})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\}
maxγΓ2{KHG(Pn,Pθ^;hγ,μ^)λ3pen2(γ2)}\displaystyle\geq\max_{\gamma\in\Gamma_{2}}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\lambda_{3}\;\mathrm{pen}_{2}(\gamma_{2})\right\} (S27)
maxγΓ20{KHG(Pn,Pθ^;hγ,μ^)}λ3(4q)1/2\displaystyle\geq\max_{\gamma\in\Gamma_{20}}\;\left\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\right\}-\lambda_{3}\;(4q)^{-1/2} (S28)
maxγΓ20{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}Δ~32λ3(4q)1/2,\displaystyle\geq\max_{\gamma\in\Gamma_{20}}\;\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\tilde{\Delta}_{32}-\lambda_{3}(4q)^{-1/2}, (S29)

where Γ2={(γ0,γ1T,γ2T)T:γ1=0}\Gamma_{2}=\{(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}:\gamma_{1}=0\}. Inequality (S27) follows because Γ2\Gamma_{2} is a subset of Γ\Gamma such that γ1=0\gamma_{1}=0 and hence pen2(γ1)=0\mathrm{pen}_{2}(\gamma_{1})=0 for γΓ2\gamma\in\Gamma_{2}.

Take Γ20={γΓrp2:Γ0=0,pen2(γ)=(4q)1/2}\Gamma_{20}=\{\gamma\in\Gamma_{\mathrm{rp2}}:\Gamma_{0}=0,\mathrm{pen}_{2}(\gamma)=(4q)^{-1/2}\} for q=p(1p)q=p(1-p), where Γrp2\Gamma_{\mathrm{rp2}} is the subset of Γ2\Gamma_{2} associated with interaction ramp functions as in the proof of Theorem 3. Inequality (S28) holds because Γ20Γ2\Gamma_{20}\subset\Gamma_{2} by definition. Inequality (S29) follows from Proposition S17: it holds with probability at least 12δ1-2\delta that for any γΓ20\gamma\in\Gamma_{20},

KHG(Pn,Pθ^;hγ,μ^)EPθh(x)EPθ^h(x)Δ~32,\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)-\tilde{\Delta}_{32},

where Δ~32=ϵ+pλ32(4q)1/2\tilde{\Delta}_{32}=\epsilon+\sqrt{p}\lambda_{32}(4q)^{-1/2} and

λ32=Crad46(p1)n+(p1)log(δ1)n.\lambda_{32}=C_{\mathrm{rad4}}\sqrt{\frac{6(p-1)}{n}}+\sqrt{\frac{(p-1)\log(\delta^{-1})}{n}}.

Note that λ32\lambda_{32} is the same as in the proof of Theorem 3. From (S27)–(S29), the lower bound in (S20) holds with probability at least 12δ1-2\delta, where Δ22=Δ~32+λ3(4q)1/2\Delta_{22}=\tilde{\Delta}_{32}+\lambda_{3}(4q)^{-1/2} and d(θ^,θ)=maxγΓ20{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}d(\hat{\theta},\theta^{*})=\max_{\gamma\in\Gamma_{20}}\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\}.

(Step 5) We deduce an upper bound on Σ^ΣF\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}, by relating the moment matching term d(θ^,θ)d(\hat{\theta},\theta^{*}) to the estimation error. First, combining the upper bound in (S20) from Step 1 and the lower bound from Step 4 shows that with probability 16δ1-{6}\delta,

(4q)1/2maxγΓrp2,pen2(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle\quad(4q)^{-1/2}\max_{\gamma\in\Gamma_{\mathrm{rp2}},\mathrm{pen}_{2}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
3ϵ+2ϵ/(nδ)pλ32(4q)1/2(25/3)5pCsp22M21λ31(4q)1/2,\displaystyle\leq 3\epsilon+2\sqrt{\epsilon/(n\delta)}-\sqrt{p}\lambda_{32}(4q)^{-1/2}-(25/3)\sqrt{5p}C_{\mathrm{sp22}}M_{21}\lambda_{31}(4q)^{-1/2},

which gives

maxγΓrp2,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}pErrh3(n,p,δ,ϵ),\displaystyle\quad\max_{\gamma\in\Gamma_{\mathrm{rp2}},\mathrm{pen}_{2}(\gamma)=1/2}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq\sqrt{p}\;\mathrm{Err}_{h3}(n,p,\delta,\epsilon), (S30)

where

Errh3(n,p,δ,ϵ)=3ϵp1+2ϵ(p1)/(nδ)+λ32/2+(255/6)Csp22M21λ31\displaystyle\mathrm{Err}_{h3}(n,p,\delta,\epsilon)=3\epsilon\sqrt{p-1}+2\sqrt{\epsilon(p-1)/(n\delta)}+\lambda_{32}/2+(25\sqrt{5}/6)C_{\mathrm{sp22}}M_{21}\lambda_{31}

The desired result then follows from Proposition S12: inequality (S30) implies that

1pΣ^ΣF\displaystyle\frac{1}{\sqrt{p}}\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}} 2M21/2σ^σ2+S7(2Δμ^,σ^+Errh3(n,p,δ,ϵ))\displaystyle\leq 2M_{2}^{1/2}\|\hat{\sigma}-\sigma^{*}\|_{2}+S_{7}(\sqrt{2}\Delta_{\hat{\mu},\hat{\sigma}}+\mathrm{Err}_{h3}(n,p,\delta,\epsilon))
S9,aErrh2(n,p,δ,ϵ)+S7Errh3(n,p,δ,ϵ),\displaystyle\leq S_{9,a}\mathrm{Err}_{h2}(n,p,\delta,\epsilon)+S_{7}\mathrm{Err}_{h3}(n,p,\delta,\epsilon),

where Δμ^,σ^=(μ^μ22+σ^σ22)1/2\Delta_{\hat{\mu},\hat{\sigma}}=(\|\hat{\mu}-\mu^{*}\|_{2}^{2}+\|\hat{\sigma}-\sigma^{*}\|_{2}^{2})^{1/2} and S9,a=2M21/2S6,a+2S7(S4,a+S6,a)S_{9,a}=2M_{2}^{1/2}S_{6,a}+\sqrt{2}S_{7}(S_{4,a}+S_{6,a}).

II.4 Proof of Corollary 1

(i) In the proofs of Theorems 2 and 3, we used the main frame,

d(θ^,θ)Δ1maxγΓ{Kf(Pn,Pθ^;hγ,μ^)pen(γ;λ)}Δ2,\displaystyle d(\hat{\theta},\theta^{*})-\Delta_{1}\leq\max_{\gamma\in\Gamma}\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-{\mathrm{pen}(\gamma;\lambda)}\}\leq\Delta_{2}, (S31)

where pen(γ;λ)\mathrm{pen}(\gamma;\lambda) is λ1(γ11+γ21)\lambda_{1}(\|\gamma_{1}\|_{1}+\|\gamma_{2}\|_{1}) or λ2γ12+λ3γ22\lambda_{2}\|\gamma_{1}\|_{2}+\lambda_{3}\|\gamma_{2}\|_{2}. For Theorems 2 and 3, we showed the upper bound in (S31) using the fact that θ^\hat{\theta} is the minimizer of

maxγΓ{Kf(Pn,Pθ;hγ,μ)λpen(γ)},\max_{\gamma\in\Gamma}\{K_{f}(P_{n},P_{\theta};h_{\gamma,\mu})-\lambda\mathrm{pen}(\gamma)\},

which is a function of θ\theta by the definition of (25) and (27) as nested optimization (see Remark 1). Now θ^\hat{\theta} is not defined as a minimizer of the above function, but a solution to an alternating optimization problem (32) with two objectives. We need to develop new arguments. On the other hand, we showed the lower bound in (S31) for Theorems 2 and 3, through choosing different subsets of Γ\Gamma. The previous arguments are still applicable here.

(Step 1) For the upper bound in (S31), we show that the following holds with probability at least 1δ1-\delta,

maxγΓ{Kf(Pn,Pθ^;hγ,μ^)pen(γ;λ)}\displaystyle\quad\max_{\gamma\in\Gamma}\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-{\mathrm{pen}(\gamma;\lambda)}\}
maxγΓ{f(1ϵ^)ϵ^+R1|EPθ,nhγ,μ^(x)EPθhγ,μ^(x)|pen(γ;λ)}\displaystyle\leq\max_{\gamma\in\Gamma}\{-f^{\prime}(1-\hat{\epsilon})\hat{\epsilon}+R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)\right|-{\mathrm{pen}(\gamma;\lambda)}\} (S32)
Δ1+maxγΓ{R1|EPθ,nhγ,μ^(x)EPθhγ,μ^(x)|pen(γ;λ)},\displaystyle\leq\Delta_{1}+\max_{\gamma\in\Gamma}\{R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)\right|-{\mathrm{pen}(\gamma;\lambda)}\}, (S33)

where ϵ^\hat{\epsilon} is the (unobserved) fraction of contamination in (X1,,Xn)(X_{1},\ldots,X_{n}). Inequality (S32) follows from Supplement Lemma S13, and is the most important step for connecting two-objective GAN with logit ff-GAN. Inequality (S33) follows from an upper bound on ϵ^\hat{\epsilon} as proved in Supplement Proposition S5, where Δ1=f(3/5)(ϵ+ϵ/(nδ))\Delta_{1}=-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)}), the same as Δ11\Delta_{11} and Δ21\Delta_{21} in the proofs of Theorems 2 and 3.

Similarly as in Supplement Proposition S5 or S8, the term |EPθ,nhγ,μ^(x)EPθhγ,μ^(x)||\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)| can be controlled in terms of the L1L_{1} or L2L_{2} norms of (γ1,γ2)(\gamma_{1},\gamma_{2}), using Supplement Lemma S3 or S9. Then for pen(γ;λ)\mathrm{pen}(\gamma;\lambda) defined as an L1L_{1} or L2L_{2} penalty, it can be shown that the following holds with probability at least 14δ1-4\delta or 16δ1-6\delta,

R1|EPθ,nhγ,μ^(x)EPθhγ,μ^(x)|pen(γ;λ)0.\displaystyle R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)\right|-{\mathrm{pen}(\gamma;\lambda)}\leq 0. (S34)

provided that the tuning parameters λ1\lambda_{1} or (λ2,λ3)(\lambda_{2},\lambda_{3}) are chosen as in Theorem 2 or 3 respectively. From (S32)–(S34), the upper bound holds in (S31) with probability 15δ1-5\delta or 17δ1-7\delta.

(Steps 2,3) The lower bound step and the estimation error step for L1L_{1} or L2L_{2} penalized two-objective GAN are the same as in the proofs of Theorems 2 and 3 respectively.

(ii) For the two-objective hinge GAN hinge (35), the result follows similarly using Lemma S14 with Δ1=2(ϵ+ϵ/(nδ))\Delta_{1}=2(\epsilon+\sqrt{\epsilon/(n\delta)}) and R1=1R_{1}=1.

III Technical details

III.1 Details in main proof of Theorem 1

Lemma S1.

Suppose that f:(0,)f:(0,\infty)\to\mathbb{R} is convex with f(1)=0f(1)=0 and satisfies Assumption 1(i). Denote Cf=inft(0,1]f′′(t)C_{f}=\inf_{t\in(0,1]}{f^{{\prime\prime}}(t)}. Then

Df(P||Q)Cf2TV(P,Q)2.D_{f}(P||Q)\geq\frac{C_{f}}{2}\mathrm{TV}(P,Q)^{2}.

If further Assumption 1(iii) holds, then Df(P||Q)f′′(1)2TV(P,Q)2D_{f}(P||Q)\geq\frac{f^{\prime\prime}(1)}{2}\mathrm{TV}(P,Q)^{2}.

Proof.

Because x2x^{2} is convex in xx, by Jensen’s inequality, we have

TV(P,Q)2fTV2(p/q)dQ,\mathrm{TV}(P,Q)^{2}\leq\int f^{2}_{\mathrm{TV}}(p/q)\mathrm{d}Q,

where fTV(t)=(1t)+f_{\mathrm{TV}}(t)=(1-t)_{+} and p/qp/q is the density ratio dP/dQ\mathrm{d}P/\mathrm{d}Q. Note that Df(PQ)D_{f}(P\|Q) can be equivalently obtained as Df~(PQ)D_{\tilde{f}}(P\|Q), where f~(t)=f(t)f(1)(t1)\tilde{f}(t)=f(t)-f^{\prime}(1)(t-1). Therefore, it suffices to show that f~(t)Cf2fTV2(t)\tilde{f}(t)\geq\frac{C_{f}}{2}f^{2}_{TV}(t) for t(0,)t\in(0,\infty).

By a Taylor expansion of ff, we have

f(t)\displaystyle f(t) =f(1)+f(1)(t1)+f′′(t~)2(t1)2\displaystyle=f(1)+f^{\prime}(1)(t-1)+\frac{f^{\prime\prime}(\tilde{t})}{2}(t-1)^{2}
f(1)(t1)+Cf2(t1)2,\displaystyle\geq f^{\prime}(1)(t-1)+\frac{C_{f}}{2}(t-1)^{2}, (S35)

where t~\tilde{t} lies between tt and 1. If t(0,1]t\in(0,1], then (S35) gives

f~(t)Cf2(t1)2=Cf2(1t)+2=Cf2fTV2(t).\tilde{f}(t)\geq\frac{C_{f}}{2}(t-1)^{2}=\frac{C_{f}}{2}(1-t)_{+}^{2}=\frac{C_{f}}{2}f^{2}_{\mathrm{TV}}(t).

If t(1,)t\in(1,\infty), then because Cf0C_{f}\geq 0 by convexity of ff, (S35) gives

f~(t)Cf2(t1)20=Cf2(1t)+2=Cf2fTV2(t).\tilde{f}(t)\geq\frac{C_{f}}{2}(t-1)^{2}\geq 0=\frac{C_{f}}{2}(1-t)^{2}_{+}=\frac{C_{f}}{2}f^{2}_{TV}(t).

Combining the two cases completes the proof. ∎

Denote as Φ()\Phi(\cdot) the cumulative distribution function of N(0,1)\mathrm{N}(0,1), and erf(x)\mathrm{erf}(x) the probability of [2x,2x][-\sqrt{2}x,\sqrt{2}x] under N(0,1)\mathrm{N}(0,1) for x0x\geq 0.

Lemma S2.

Let a[0,1/2)a\in[0,1/2) be arbitrarily fixed.

(i) If Φ(x)1/2+a\Phi(x)\leq 1/2+a for x0x\geq 0 , then

xS1,a{Φ(x)1/2},x\leq S_{1,a}\left\{\Phi(x)-1/2\right\},

where S1,a={Φ(Φ1(1/2+a))}1S_{1,a}=\{\Phi^{\prime}(\Phi^{-1}(1/2+a))\}^{-1}.

(ii) If |erf(xz0/2)1/2|a|\mathrm{erf}(x\sqrt{z_{0}/2})-1/2|\leq a for x0x\geq 0, then

|x1|S2,a|erf(xz0/2)1/2|,|x-1|\leq S_{2,a}\left|\mathrm{erf}(x\sqrt{z_{0}/2})-1/2\right|,

where S2,a={z0/2erf(2/z0erf1(1/2+a))}1S_{2,a}=\{\sqrt{z_{0}/2}\mathrm{erf}^{\prime}(\sqrt{2/z_{0}}\mathrm{erf}^{-1}(1/2+a))\}^{-1} and z0z_{0} is an universal constant such that erf(z0/2)=1/2\mathrm{erf}(\sqrt{z_{0}/2})=1/2.

Proof.

(i) By the mean value theorem, we have Φ(x)12+S1,a1x\Phi(x)\geq\frac{1}{2}+S^{-1}_{1,a}x, because Φ()\Phi^{\prime}(\cdot) is decreasing on [0,+)[0,+\infty).

(ii) By the mean value theorem, we have |erf(xz0/2)1/2|S2,a1|x1||\text{erf}(x\sqrt{z_{0}/2})-1/2|\geq S^{-1}_{2,a}{|x-1|}, because erf()\mathrm{erf}^{\prime}(\cdot) is decreasing on [0,+)[0,+\infty). ∎

Proposition S4.

For two multivariate Gaussian distributions, Pθ¯P_{\bar{\theta}} and PθP_{\theta^{*}}, with θ¯=(μ¯,Σ¯)\bar{\theta}=(\bar{\mu},\bar{\Sigma}) and θ=(μ,Σ)\theta^{*}=(\mu^{*},\Sigma^{*}), denote d(θ¯,θ)=TV(Pθ¯,Pθ)d(\bar{\theta},\theta^{*})=\mathrm{TV}(P_{\bar{\theta}},P_{\theta^{*}}).

  • (i)

    If d(θ¯,θ)ad(\bar{\theta},\theta^{*})\leq a for a constant a[0,1/2)a\in[0,1/2), then

    μ¯μ2\displaystyle\|\bar{\mu}-\mu^{*}\|_{2} S1,aΣop1/2d(θ¯,θ),\displaystyle\leq S_{1,a}\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}d(\bar{\theta},\theta^{*}),
    μ¯μ\displaystyle\|\bar{\mu}-\mu^{*}\|_{\infty} S1,aΣmax1/2d(θ¯,θ),\displaystyle\leq S_{1,a}\|\Sigma^{*}\|_{\max}^{1/2}d(\bar{\theta},\theta^{*}),

    where S1,a={Φ(Φ1(1/2+a))}1S_{1,a}=\{\Phi^{\prime}(\Phi^{-1}(1/2+a))\}^{-1} as in Lemma S2.

  • (ii)

    If further d(θ¯,θ)a/(1+S1,a)d(\bar{\theta},\theta^{*})\leq a/(1+S_{1,a}), then

    Σ¯Σop\displaystyle\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}} 2S3,aΣopd(θ¯,θ)+S3,a2Σop(d(θ¯,θ))2,\displaystyle\leq 2S_{3,a}\|\Sigma^{*}\|_{\mathrm{op}}d(\bar{\theta},\theta^{*})+S^{2}_{3,a}\|\Sigma^{*}\|_{\mathrm{op}}(d(\bar{\theta},\theta^{*}))^{2},
    Σ¯Σmax\displaystyle\|\bar{\Sigma}-\Sigma^{*}\|_{\max} 4S3,aΣmaxd(θ¯,θ)+2S3,a2Σmax(d(θ¯,θ))2,\displaystyle\leq 4S_{3,a}\|\Sigma^{*}\|_{\max}d(\bar{\theta},\theta^{*})+2S^{2}_{3,a}\|\Sigma^{*}\|_{\max}(d(\bar{\theta},\theta^{*}))^{2},

where S3,a=S2,a(1+S1,a)S_{3,a}=S_{2,a}(1+S_{1,a}), S2,a={z0/2erf(2/z0erf1(1/2+a))}1S_{2,a}=\{\sqrt{z_{0}/2}\,\mathrm{erf}^{\prime}(\sqrt{2/z_{0}}\,\mathrm{erf}^{-1}(1/2+a))\}^{-1}, and the constant z0z_{0} is defined such that erf(z0/2)=1/2\mathrm{erf}(\sqrt{z_{0}/2})=1/2, as in Lemma S2.

Proof.

The TV distance, DTV(P1P2)D_{\mathrm{TV}}(P_{1}\|P_{2}), can be equivalently defined as

TV(P1,P2)=supA𝒜|P1(A)P2(A)|,\displaystyle\mathrm{TV}(P_{1},P_{2})=\sup_{A\in\mathcal{A}}|P_{1}(A)-P_{2}(A)|,

for P1P_{1} and P2P_{2} defined in a probability space (𝒳,𝒜)(\mathcal{X},\mathcal{A}). This definition is applicable to multivariate Gaussian distributions with singular variance matrices. To derive the desired results, we choose specific events AA and show that the differences in the means and variance matrices can be upper bounded by |Pθ¯(A)Pθ(A)||P_{\bar{\theta}}(A)-P_{\theta^{*}}(A)|.

We first show results (i) and (ii), when Σ\Sigma^{*} and Σ¯\bar{\Sigma} are nonsingular. Then we show that the results remain valid when Σ\Sigma^{*} or Σ¯\bar{\Sigma} is singular.

(i) Assume that both Σ\Sigma^{*} and Σ¯\bar{\Sigma} are nonsingular. For any upu\in\mathbb{R}^{p}, we have by the definition of TV,

Pμ,Σ(uTXuTμ¯)Pμ¯,Σ¯(uTXuTμ¯)d(θ¯,θ).P_{\mu^{*},\Sigma^{*}}(u^{\mathrm{\scriptscriptstyle T}}X\leq u^{\mathrm{\scriptscriptstyle T}}\bar{\mu})-P_{\bar{\mu},\bar{\Sigma}}(u^{\mathrm{\scriptscriptstyle T}}X\leq u^{\mathrm{\scriptscriptstyle T}}\bar{\mu})\leq d(\bar{\theta},\theta^{*}).

For nonzero upu\in\mathbb{R}^{p}, because uTΣ¯u0u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\not=0 and uTΣu0u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u\not=0, we have

Pμ¯,Σ¯(uTXuTμ¯)\displaystyle P_{\bar{\mu},\bar{\Sigma}}(u^{\mathrm{\scriptscriptstyle T}}X\leq u^{\mathrm{\scriptscriptstyle T}}\bar{\mu}) =12,\displaystyle=\frac{1}{2},
Pμ,Σ(uTXuTμ¯)\displaystyle P_{\mu^{*},\Sigma^{*}}(u^{\mathrm{\scriptscriptstyle T}}X\leq u^{\mathrm{\scriptscriptstyle T}}\bar{\mu}) =Φ(uT(μ¯μ)uTΣu).\displaystyle=\Phi\left(\frac{u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})}{\sqrt{u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}}\right).

Combining the preceding three displays shows that for nonzero upu\in\mathbb{R}^{p},

Φ(uT(μ¯μ)uTΣu)12+d(θ¯,θ).\displaystyle\Phi\left(\frac{u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})}{\sqrt{u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}}\right)\leq\frac{1}{2}+d(\bar{\theta},\theta^{*}).

By Lemma S2 (i), if d(θ¯,θ)ad(\bar{\theta},\theta^{*})\leq a for a constant a[0,1/2)a\in[0,1/2), then for any upu\in\mathbb{R}^{p} satisfying uT(μ¯μ)0u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})\geq 0,

0uT(μ¯μ)\displaystyle{0\leq}u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*}) uTΣuS1,ad(θ¯,θ).\displaystyle\leq\sqrt{u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}S_{1,a}d(\bar{\theta},\theta^{*}). (S36)

Let 𝒰2={up:u2=1}\mathcal{U}_{2}=\{u\in\mathbb{R}^{p}:\|u\|_{2}=1\}. By (S36) with uu restricted such that u𝒰2u\in\mathcal{U}_{2} and uT(μ¯μ)0u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})\geq 0, we have

μ¯μ2\displaystyle\|\bar{\mu}-\mu^{*}\|_{2} =supu𝒰2uT(μ¯μ)\displaystyle=\sup_{u\in\mathcal{U}_{2}}u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})
S1,aΣop1/2d(θ¯,θ).\displaystyle\leq S_{1,a}\|\Sigma^{*}\|_{\mathrm{op}}^{1/2}d(\bar{\theta},\theta^{*}).

Similarly, let 𝒰={±ej:j=1,,p}\mathcal{U}_{\infty}=\{\pm\mathrm{e}_{j}:j=1,\dots,p\}, where eje_{j} is a vector with jjth coordinate being one and others being zero. By (S36) with uu restricted such that u𝒰u\in\mathcal{U}_{\infty} and uT(μ¯μ)0u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})\geq 0, we have

μ¯μ\displaystyle\|\bar{\mu}-\mu^{*}\|_{\infty} =supu𝒰uT(μ¯μ)\displaystyle=\sup_{u\in\mathcal{U}_{\infty}}u^{\mathrm{\scriptscriptstyle T}}(\bar{\mu}-\mu^{*})
S1,aΣmax1/2d(θ¯,θ).\displaystyle\leq S_{1,a}\|\Sigma^{*}\|_{\max}^{1/2}d(\bar{\theta},\theta^{*}).

The last line uses the fact that supu𝒰uTΣu=diag(Σ)=Σmax\sup_{u\in\mathcal{U}_{\infty}}u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u=\|\mathrm{diag}(\Sigma^{*})\|_{\infty}=\|\Sigma^{*}\|_{\max} by the nature of variance matrices.

(ii) Assume that Σ\Sigma^{*} and Σ¯\bar{\Sigma} are nonsingular. We first separate the bias caused by the location difference between Pθ¯P_{\bar{\theta}} and PθP_{\theta^{*}}. By the triangle inequality, we have

TV(Pμ¯,Σ,Pμ¯,Σ¯)TV(Pμ¯,Σ,Pμ,Σ)+TV(Pμ,Σ,Pμ¯,Σ¯).\displaystyle\mathrm{TV}(P_{\bar{\mu},\Sigma^{*}},P_{\bar{\mu},\bar{\Sigma}})\leq\mathrm{TV}(P_{\bar{\mu},\Sigma^{*}},P_{\mu^{*},\Sigma^{*}})+\mathrm{TV}(P_{\mu^{*},\Sigma^{*}},P_{\bar{\mu},\bar{\Sigma}}). (S37)

By Lemma S1, we know that TV(P,Q)2DKL(P||Q)TV(P,Q)\leq\sqrt{2D_{\mathrm{KL}}(P||Q)}. Then we have

TV(Pμ¯,Σ,Pμ,Σ)\displaystyle\mathrm{TV}(P_{\bar{\mu},\Sigma^{*}},P_{\mu^{*},\Sigma^{*}}) 2DKL(Pμ¯,Σ||Pμ,Σ)\displaystyle\leq\sqrt{2D_{\mathrm{KL}}(P_{\bar{\mu},\Sigma^{*}}||P_{\mu^{*},\Sigma^{*}})}
S1,ad(θ¯,θ).\displaystyle\leq S_{1,a}d(\bar{\theta},\theta^{*}). (S38)

provided that d(θ¯,θ)ad(\bar{\theta},\theta^{*})\leq a. Inequality (S38) follows because by standard calculation

DKL(N(μ¯,Σ)||N(μ,Σ))\displaystyle D_{\text{KL}}(N(\bar{\mu},\Sigma^{*})||N(\mu^{*},\Sigma^{*})) =12(μ¯μ)TΣ1(μ¯μ),\displaystyle=\frac{1}{2}(\bar{\mu}-\mu^{*})^{\mathrm{\scriptscriptstyle T}}\Sigma^{*-1}(\bar{\mu}-\mu^{*}),

and taking u=Σ1(μ¯μ)u=\Sigma^{*-1}(\bar{\mu}-\mu^{*}) in (S36) gives

(μ¯μ)TΣ1(μ¯μ)\displaystyle\sqrt{(\bar{\mu}-\mu^{*})^{\mathrm{\scriptscriptstyle T}}\Sigma^{*-1}(\bar{\mu}-\mu^{*})} S1,ad(θ¯,θ).\displaystyle\leq S_{1,a}d(\bar{\theta},\theta^{*}).

Combining (S37) and (S38) yields

TV(Pμ¯,Σ,Pμ¯,Σ¯)\displaystyle\mathrm{TV}(P_{\bar{\mu},\Sigma^{*}},P_{\bar{\mu},\bar{\Sigma}}) d(θ¯,θ)+S1,ad(θ¯,θ).\displaystyle\leq d(\bar{\theta},\theta^{*})+{S_{1,a}}d(\bar{\theta},\theta^{*}). (S39)

For any upu\in\mathbb{R}^{p} such that uT(Σ¯Σ)u0u^{\mathrm{\scriptscriptstyle T}}(\bar{\Sigma}-\Sigma^{*})u\geq 0, (S39) implies

0\displaystyle 0 Pμ¯,Σ{(uTXuTμ¯)2z0uTΣ¯u}Pμ¯,Σ¯{(uTXuTμ¯)2z0uTΣ¯u}\displaystyle\leq P_{\bar{\mu},\Sigma^{*}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X-u^{\mathrm{\scriptscriptstyle T}}\bar{\mu})^{2}\leq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}-P_{\bar{\mu},\bar{\Sigma}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X-u^{\mathrm{\scriptscriptstyle T}}\bar{\mu})^{2}\leq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}
=P0,Σ{(uTX)2z0uTΣ¯u}P0,Σ¯{(uTX)2z0uTΣ¯u}\displaystyle=P_{0,\Sigma^{*}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\leq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}-P_{0,\bar{\Sigma}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\leq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}
d(θ¯,θ)+S1,ad(θ¯,θ),\displaystyle\leq d(\bar{\theta},\theta^{*})+S_{1,a}d(\bar{\theta},\theta^{*}),

where z0z_{0} is an universal constant such that erf(z0/2)=1/2\mathrm{erf}(\sqrt{z_{0}/2})=1/2. Similarly, for any upu\in\mathbb{R}^{p} such that uT(Σ¯Σ)u0u^{\mathrm{\scriptscriptstyle T}}(\bar{\Sigma}-\Sigma^{*})u\leq 0, (S39) implies

0\displaystyle 0 Pμ¯,Σ{(uTXuTμ¯)2z0uTΣ¯u}Pμ¯,Σ¯{(uTXuTμ¯)2z0uTΣ¯u}\displaystyle\leq P_{\bar{\mu},\Sigma^{*}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X-u^{\mathrm{\scriptscriptstyle T}}\bar{\mu})^{2}\geq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}-P_{\bar{\mu},\bar{\Sigma}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X-u^{\mathrm{\scriptscriptstyle T}}\bar{\mu})^{2}\geq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}
=P0,Σ{(uTX)2z0uTΣ¯u}P0,Σ¯{(uTX)2z0uTΣ¯u}\displaystyle=P_{0,\Sigma^{*}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\geq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}-P_{0,\bar{\Sigma}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\geq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}
d(θ¯,θ)+S1,ad(θ¯,θ).\displaystyle\leq d(\bar{\theta},\theta^{*})+S_{1,a}d(\bar{\theta},\theta^{*}).

Notice that the choice of z0z_{0} ensures that for ZN(0,1)Z\in\mathrm{N}(0,1),

P0,Σ¯{(uTX)2z0uTΣ¯u}\displaystyle P_{0,\bar{\Sigma}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\leq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\} =P(Z2z0)=12\displaystyle=\mathrm{P}(Z^{2}\leq z_{0})=\frac{1}{2}
=P(Z2z0)=P0,Σ¯{(uTX)2z0uTΣ¯u}.\displaystyle=\mathrm{P}(Z^{2}\geq z_{0})=P_{0,\bar{\Sigma}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\geq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}.

Moreover, for any nonzero upu\in\mathbb{R}^{p}, we have by the definition of erf\mathrm{erf},

P0,Σ{(uTX)2z0uTΣ¯u}=erf(z0uTΣ¯u2uTΣu).\displaystyle P_{0,\Sigma^{*}}\left\{(u^{\mathrm{\scriptscriptstyle T}}X)^{2}\leq z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u\right\}=\mathrm{erf}\left(\sqrt{\frac{z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u}{2u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}}\right).

Combining the preceding four displays shows that for any nonzero upu\in\mathbb{R}^{p},

|erf(z0uTΣ¯u2uTΣu)12|(1+S1,a)d(θ¯,θ).\displaystyle\left|\mathrm{erf}\left(\sqrt{\frac{z_{0}u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u}{2u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}}\right)-\frac{1}{2}\right|\leq(1+S_{1,a})d(\bar{\theta},\theta^{*}).

By Lemma S2 (ii), if d(θ¯,θ)min(a,a/(1+S1,a))=a/(1+S1,a)d(\bar{\theta},\theta^{*})\leq\min(a,a/(1+S_{1,a}))=a/(1+S_{1,a}), then for any nonzero upu\in\mathbb{R}^{p},

|uTΣ¯uuTΣu1|\displaystyle\left|\sqrt{\frac{u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u}{u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}}-1\right| S2,a(1+S1,a)d(θ¯,θ),\displaystyle\leq S_{2,a}(1+S_{1,a})d(\bar{\theta},\theta^{*}),

or equivalently for any upu\in\mathbb{R}^{p},

|uTΣ¯uuTΣu|\displaystyle\left|\sqrt{u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u}-\sqrt{u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}\right| S2,a(1+S1,a)uTΣud(θ¯,θ).\displaystyle\leq S_{2,a}(1+S_{1,a})\sqrt{u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u}d(\bar{\theta},\theta^{*}). (S40)

Notice that for any a,b,c0a,b,c{\geq}0, if |ab|c|\sqrt{a}-\sqrt{b}|\leq c then |ab|2bc+c2|a-b|\leq 2\sqrt{b}c+c^{2}. Thus, inequality (S40) implies

Σ¯Σop=supu𝒰2|uT(Σ¯Σ)u|\displaystyle\quad\|\bar{\Sigma}-\Sigma^{*}\|_{\mathrm{op}}=\sup_{u\in\mathcal{U}_{2}}\left|u^{\mathrm{\scriptscriptstyle T}}(\bar{\Sigma}-\Sigma^{*})u\right|
2S3,aΣopd(θ¯,θ)+S3,a2Σop(d(θ¯,θ))2,\displaystyle\leq 2S_{3,a}\|\Sigma^{*}\|_{\mathrm{op}}d(\bar{\theta},\theta^{*})+S^{2}_{3,a}\|\Sigma^{*}\|_{\mathrm{op}}(d(\bar{\theta},\theta^{*}))^{2}, (S41)

where S3,a=S2,a(1+S1,a)S_{3,a}=S_{2,a}(1+S_{1,a}).

To handle Σ¯Σmax\|\bar{\Sigma}-\Sigma^{*}\|_{\max}, let 𝒰2,={±eij:i,j=1,,p,ij}\mathcal{U}_{2,\infty}=\{\pm e_{ij}:i,j=1,\dots,p,i\neq j\}, where eije_{ij} is a vector in 𝒰2\mathcal{U}_{2} with only iith and jjth coordinates possibly being nonzero. For u𝒰2,u\in\mathcal{U}_{2,\infty}, we have uTΣu=uijTΣijuiju^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u=u_{ij}^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}_{ij}u_{ij} and uTΣ¯u=uijTΣ¯ijuiju^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u=u_{ij}^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}_{ij}u_{ij}, where uij2u_{ij}\in\mathbb{R}^{2} is formed by iith and jjth coordinates of uu, and Σij\Sigma^{*}_{ij} and Σ¯ij\bar{\Sigma}_{ij} are 2×22\times 2 matrices, formed by selecting iith and jjth rows and columns from Σ\Sigma^{*} and Σ¯\bar{\Sigma} respectively. Similarly as in the deviation of (S41), applying inequality (S40) with u𝒰2,u\in\mathcal{U}_{2,\infty}, we have

Σ¯ijΣijop\displaystyle\|\bar{\Sigma}_{ij}-\Sigma_{ij}^{*}\|_{\mathrm{op}} =supuU2,|uT(Σ¯Σ)u|\displaystyle=\sup_{u\in U_{2,\infty}}\left|u^{\mathrm{\scriptscriptstyle T}}(\bar{\Sigma}-\Sigma^{*})u\right|
2S3,aΣijopd(θ¯,θ)+S3,a2Σijop(d(θ¯,θ))2.\displaystyle\leq 2S_{3,a}\|\Sigma_{ij}^{*}\|_{\mathrm{op}}d(\bar{\theta},\theta^{*})+S^{2}_{3,a}\|\Sigma_{ij}^{*}\|_{\mathrm{op}}(d(\bar{\theta},\theta^{*}))^{2}.

Because for a matrix Am1×m2A\in\mathbb{R}^{m_{1}\times m_{2}}, AmaxAopm1m2Amax\|A\|_{\max}\leq\|A\|_{\mathrm{op}}\leq\sqrt{m_{1}m_{2}}\|A\|_{\max}, the above inequality implies that for any ij{1,,p}i\not=j\in\{1,\ldots,p\},

Σ¯ijΣijmax\displaystyle\quad\|\bar{\Sigma}_{ij}-\Sigma_{ij}^{*}\|_{\max}
4S3,aΣijmaxd(θ¯,θ)+2S3,a2Σijmax(d(θ¯,θ))2.\displaystyle\leq 4S_{3,a}\|\Sigma_{ij}^{*}\|_{\max}d(\bar{\theta},\theta^{*})+2S^{2}_{3,a}\|\Sigma_{ij}^{*}\|_{\max}(d(\bar{\theta},\theta^{*}))^{2}. (S42)

Taking the maximum on both sides of (S42) over iji\not=j gives the desired result:

Σ¯Σmax=maxij{1,,p}Σ¯ijΣijmax\displaystyle\quad\|\bar{\Sigma}-\Sigma^{*}\|_{\max}{=}\max_{i\neq j\in\{1,\dots,p\}}\|\bar{\Sigma}_{ij}-\Sigma_{ij}^{*}\|_{\max}
4S3,aΣmaxd(θ¯,θ)+2S3,a2Σmax(d(θ¯,θ))2.\displaystyle\leq 4S_{3,a}\|\Sigma^{*}\|_{\max}d(\bar{\theta},\theta^{*})+2S^{2}_{3,a}\|\Sigma^{*}\|_{\max}(d(\bar{\theta},\theta^{*}))^{2}.

(iii) Consider the case where Σ\Sigma^{*} or Σ¯\bar{\Sigma} is singular. As the following argument is symmetric in Σ\Sigma^{*} and Σ¯\bar{\Sigma}, we assume without loss of generality that Σ\Sigma^{*} is singular. Fix any nonzero uu such that uTΣu=0u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u=0.

First, we show that for θ¯=(μ¯,Σ¯)\bar{\theta}=(\bar{\mu},\bar{\Sigma}) such that TV(Pθ¯,Pθ)<1\mathrm{TV}(P_{\bar{\theta}},P_{\theta^{*}})<1, we also have uTΣ¯u=0u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u=0. In fact, TV(Pθ¯,Pθ)<1\mathrm{TV}(P_{\bar{\theta}},P_{\theta^{*}})<1 implies

|Pμ,Σ(uTX=uTμ)Pμ¯,Σ¯(uTX=uTμ)|d(θ¯,θ)<1.\displaystyle\left|P_{\mu^{*},\Sigma^{*}}(u^{\mathrm{\scriptscriptstyle T}}X=u^{\mathrm{\scriptscriptstyle T}}\mu^{*})-P_{\bar{\mu},\bar{\Sigma}}(u^{\mathrm{\scriptscriptstyle T}}X=u^{\mathrm{\scriptscriptstyle T}}\mu^{*})\right|\leq d(\bar{\theta},\theta^{*})<1. (S43)

Note that Pμ,Σ(uTX=uTμ)=1P_{\mu^{*},\Sigma^{*}}(u^{\mathrm{\scriptscriptstyle T}}X=u^{\mathrm{\scriptscriptstyle T}}\mu^{*})=1 because uTΣu=0u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u=0. If uTΣ¯u>0u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u>0, then Pμ¯,Σ¯(uTX=uTμ)=0P_{\bar{\mu},\bar{\Sigma}}(u^{\mathrm{\scriptscriptstyle T}}X=u^{\mathrm{\scriptscriptstyle T}}\mu^{*})=0, and hence (S43) gives

|10|d(θ¯,θ)<1,|1-0|\leq d(\bar{\theta},\theta^{*})<1,

which is a contradiction. Thus uTΣ¯u=0u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u=0.

Next we show that for θ¯=(μ¯,Σ¯)\bar{\theta}=(\bar{\mu},\bar{\Sigma}) such that TV(θ¯,θ)<1\mathrm{TV}(\bar{\theta},\theta^{*})<1, we also have uT(μμ¯)=0u^{\mathrm{\scriptscriptstyle T}}(\mu^{*}-\bar{\mu})=0. In fact, with uTΣ¯u=0u^{\mathrm{\scriptscriptstyle T}}\bar{\Sigma}u=0 as shown above, we have that Pμ¯,Σ¯(uTX=uTμ)=1P_{\bar{\mu},\bar{\Sigma}}(u^{\mathrm{\scriptscriptstyle T}}X=u^{\mathrm{\scriptscriptstyle T}}\mu^{*})=1 if uTμ=uTμ¯u^{\mathrm{\scriptscriptstyle T}}\mu^{*}=u^{\mathrm{\scriptscriptstyle T}}\bar{\mu} and Pμ¯,Σ¯(uTX=uTμ)=0P_{\bar{\mu},\bar{\Sigma}}(u^{\mathrm{\scriptscriptstyle T}}X=u^{\mathrm{\scriptscriptstyle T}}\mu^{*})=0 otherwise. If uTμuTμ¯u^{\mathrm{\scriptscriptstyle T}}\mu^{*}\neq u^{\mathrm{\scriptscriptstyle T}}\bar{\mu}, then inequality (S43) gives

|10|d(θ¯,θ)<1,|1-0|\leq d(\bar{\theta},\theta^{*})<1,

which is a contradiction. Thus uTμ=uTμ¯u^{\mathrm{\scriptscriptstyle T}}\mu^{*}=u^{\mathrm{\scriptscriptstyle T}}\bar{\mu}.

From the two preceding results, we see that the upper bounds (S36) and (S40) derived in (i) and (ii) remain valid for any upu\in\mathbb{R}^{p} satisfying uTΣu=0u^{\mathrm{\scriptscriptstyle T}}\Sigma^{*}u=0. Hence the desired results hold by the remaining proofs in (i) and (ii). ∎

III.2 Details in main proof of Theorem 2

Lemma S3.

Suppose that X1,,XnX_{1},\ldots,X_{n} are independent and identically distributed as XNp(0,Σ)X\sim N_{p}(0,\Sigma) with ΣmaxM1\|\Sigma\|_{\max}\leq M_{1}. For kk fixed knots ξ1,,ξk\xi_{1},\ldots,\xi_{k} in \mathbb{R}, denote φ(x)=(φ1T(x),,φkT(x))T\varphi(x)=(\varphi^{\mathrm{\scriptscriptstyle T}}_{1}(x),\ldots,\varphi^{\mathrm{\scriptscriptstyle T}}_{k}(x))^{\mathrm{\scriptscriptstyle T}}, where φl(x)p\varphi_{l}(x)\in\mathbb{R}^{p} is obtained by applying t(tξl)+t\mapsto(t-\xi_{l})_{+} componentwise to xpx\in\mathbb{R}^{p} for l=1,,kl=1,\ldots,k. Then the following results hold.

(i) Each component of the random vector φ(X)Eφ(X)\varphi(X)-\mathrm{E}\varphi(X) is a sub-gaussian random variable with tail parameter M11/2M_{1}^{1/2}.

(ii) For any δ>0\delta>0, we have that with probability at least 12δ1-2\delta,

supw1=1|wT{1ni=1nφ(Xi)Eφ(X)}|\displaystyle\quad\sup_{\|w\|_{1}=1}\left|w^{\mathrm{\scriptscriptstyle T}}\left\{\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\right\}\right|
Csp11M11/22log(kp)+log(δ1)n,\displaystyle\leq C_{\mathrm{sp11}}M_{1}^{1/2}\sqrt{\frac{2\log(kp)+\log(\delta^{-1})}{n}},

where Csp11=2Csg5C_{\mathrm{sp11}}=\sqrt{2}C_{\mathrm{sg5}}, depending on the universal constant Csg5C_{\mathrm{sg5}} in Lemma S25.

(iii) Let 𝒜1={Akp×kp:A1,1=1}\mathcal{A}_{1}=\{A\in\mathbb{R}^{kp\times kp}:\|A\|_{1,1}=1\}. For any δ>0\delta>0, we have that with probability at least 14δ1-4\delta, both the inequality in (ii) and

supA𝒜1|1ni=1nφT(Xi)Aφ(Xi)EφT(X)Aφ(X)|\displaystyle\quad\sup_{A\in\mathcal{A}_{1}}\left|\frac{1}{n}\sum_{i=1}^{n}\varphi^{\mathrm{\scriptscriptstyle T}}(X_{i})A\varphi(X_{i})-\mathrm{E}\varphi^{\mathrm{\scriptscriptstyle T}}(X)A\varphi(X)\right|
Csp12M11{2log(kp)+log(δ1)n+2log(kp)+log(δ1)n},\displaystyle\leq C_{\mathrm{sp12}}M_{11}\left\{\sqrt{\frac{2\log(kp)+\log(\delta^{-1})}{n}}+\frac{2\log(kp)+\log(\delta^{-1})}{n}\right\},

where M11=M11/2(M11/2+2πξ)M_{11}=M_{1}^{1/2}(M_{1}^{1/2}+\sqrt{2\pi}\|\xi\|_{\infty}), ξ=maxl=1,,k|ξl|\|\xi\|_{\infty}=\max_{{l=1,\ldots,k}}|\xi_{l}|, and Csp12=2/πCsp11+Csx7Csx6Csx5C_{\mathrm{sp12}}=\sqrt{2/\pi}C_{\mathrm{sp11}}+C_{\mathrm{sx7}}C_{\mathrm{sx6}}C_{\mathrm{sx5}}. Constants (Csx5,Csx6,Csx7)(C_{\mathrm{sx5}},C_{\mathrm{sx6}},C_{\mathrm{sx7}}) are the universal constants in Lemmas S30, S31, and S32.

Proof.

(i) This can be obtained as the univariate case of Lemma S9 (i). It is only required that the marginal variance of each component of XX is upper bounded by M1M_{1}.

(ii) Notice that

supw1=1|wT{1ni=1nφ(Xi)Eφ(X)}|=1ni=1nφ(Xi)Eφ(X).\displaystyle\sup_{\|w\|_{1}=1}\left|w^{\mathrm{\scriptscriptstyle T}}\left\{\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\right\}\right|=\left\|\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\right\|_{\infty}.

By (i) and sub-gaussian concentration (Lemma S25), each component of n1i=1nφ(Xi)Eφ(X)n^{-1}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X) is sub-gaussian with tail parameter Csg5(M1/n)1/2C_{\mathrm{sg5}}(M_{1}/n)^{1/2}. Then for any t>0t>0, by the union bound, we have that with probability at 12k2p2et1-2k^{2}p^{2}\mathrm{e}^{-t},

1ni=1nφ(Xi)Eφ(X)2Csg5(M1/n)1/2t1/2.\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\right\|_{\infty}\leq\sqrt{2}C_{\mathrm{sg5}}(M_{1}/n)^{1/2}t^{1/2}.

Taking t=2log(kp)+log(δ1)t=2\log(kp)+\log(\delta^{-1}) gives the desired result.

(iii) The difference of interest can be expressed in terms of the centered variables as

1ni=1nφiTAφiEφTAφ\displaystyle\quad\frac{1}{n}\sum_{i=1}^{n}\varphi^{\mathrm{\scriptscriptstyle T}}_{i}A\varphi_{i}-\mathrm{E}\varphi^{\mathrm{\scriptscriptstyle T}}A\varphi
=1ni=1n(φiEφ)TA(φiEφ)E{(φEφ)TA(φEφ)}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(\varphi_{i}-\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A(\varphi_{i}-\mathrm{E}\varphi)-\mathrm{E}\{(\varphi-\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A(\varphi-\mathrm{E}\varphi)\} (S44)
+1ni=1n2(Eφ)TA(φiEφ).\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}2(\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A(\varphi_{i}-\mathrm{E}\varphi). (S45)

We handle the concentration of the two terms separately. Denote φi=φ(Xi)\varphi_{i}=\varphi(X_{i}), φ=φ(X)\varphi=\varphi(X), φ~i=φiEφ\tilde{\varphi}_{i}=\varphi_{i}-\mathrm{E}\varphi, and φ~=φEφ\tilde{\varphi}=\varphi-\mathrm{E}\varphi.

First, for A𝒜1A\in\mathcal{A}_{1}, the term in (S45) can be bounded as follows:

|2(Eφ)TA1ni=1nφ~i|2EφA1,11ni=1nφ~i=2Eφ1ni=1nφ~i\displaystyle\quad\left|2(\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right|\leq 2\|\mathrm{E}\varphi\|_{\infty}\|A\|_{1,1}\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{\infty}=2\|\mathrm{E}\varphi\|_{\infty}\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{\infty}
2(M11/22π+ξ)1ni=1nφ~i,\displaystyle\leq 2\left(\frac{M_{1}^{1/2}}{\sqrt{2\pi}}+\|\xi\|_{\infty}\right)\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{\infty},

where ξ=maxl=1,,k|ξl|\|\xi\|_{\infty}=\max_{l=1,\ldots,k}|\xi_{l}|. The second step holds because A1,1=1\|A\|_{1,1}=1 for A𝒜1A\in\mathcal{A}_{1}. The third step holds because EφlM11/2/2π+|ξl|\|\mathrm{E}\varphi_{l}\|_{\infty}\leq M_{1}^{1/2}/\sqrt{2\pi}+|\xi_{l}| for l=1,,5l=1,\dots,5 by Lemma S15. By (ii), for any δ>0\delta>0, we have that with probability at least 12δ1-2\delta,

1ni=1nφ~iCsp11M11/22log(kp)+log(δ1)n.\displaystyle\quad\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{\infty}\leq C_{\mathrm{sp11}}M_{1}^{1/2}\sqrt{\frac{2\log(kp)+\log(\delta^{-1})}{n}}.

From the preceding two displays, we obtain that with probability at least 12δ1-2\delta,

supA𝒜1|2(Eφ)TA1ni=1nφ~i|\displaystyle\quad\sup_{A\in\mathcal{A}_{1}}\left|2(\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right|
2πCsp11M11/2(M11/2+2πξ)2log(kp)+log(δ1)n.\displaystyle\leq\sqrt{\frac{2}{\pi}}C_{\mathrm{sp11}}M_{1}^{1/2}\left(M_{1}^{1/2}+\sqrt{2\pi}\|\xi\|_{\infty}\right)\sqrt{\frac{2\log(kp)+\log(\delta^{-1})}{n}}. (S46)

Next, notice that

supA𝒜1|1ni=1nφ~TiAφ~iEφ~TAφ~|=1ni=1nφ~iφ~iEφ~φ~max.\displaystyle\quad\sup_{A\in\mathcal{A}_{1}}\left|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}_{i}A\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}A\tilde{\varphi}\right|=\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\otimes\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}\otimes\tilde{\varphi}\right\|_{\max}.

From (i), each component of φ~i\tilde{\varphi}_{i} is sub-gaussian with tail parameter M11/2M_{1}^{1/2}. By Lemma S30, each element of φ~iφ~i\tilde{\varphi}_{i}\otimes\tilde{\varphi}_{i} is sub-exponential with tail parameter Csx5M1C_{\mathrm{sx5}}M_{1}. By Lemma S31, each element of the centered version, φ~iφ~iEφ~φ~\tilde{\varphi}_{i}\otimes\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}\otimes\tilde{\varphi}, is sub-exponential with tail parameter Csx6Csx5M1C_{\mathrm{sx6}}C_{\mathrm{sx5}}M_{1}. Then for any t>0t>0, by Lemma S32 and the union bound, we have that with probability at least 12k2p2et1-2k^{2}p^{2}\mathrm{e}^{-t},

1ni=1nφ~iφ~iEφ~φ~maxCsx7Csx6Csx5M1(tntn).\displaystyle\quad\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\otimes\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}\otimes\tilde{\varphi}\right\|_{\max}\leq C_{\mathrm{sx7}}C_{\mathrm{sx6}}C_{\mathrm{sx5}}M_{1}\left(\sqrt{\frac{t}{n}}\vee\frac{t}{n}\right).

Taking t=2log(kp)+log(δ1)t=2\log(kp)+\log(\delta^{-1}), we obtain that with probability at least 12δ1-2\delta,

1ni=1nφ~iφ~iEφ~φ~max\displaystyle\quad\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\otimes\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}\otimes\tilde{\varphi}\right\|_{\max}
Csx7Csx6Csx5M1{2log(kp)+log(δ1)n2log(kp)+log(δ1)n}.\displaystyle\leq C_{\mathrm{sx7}}C_{\mathrm{sx6}}C_{\mathrm{sx5}}M_{1}\left\{\sqrt{\frac{2\log(kp)+\log(\delta^{-1})}{n}}\vee\frac{2\log(kp)+\log(\delta^{-1})}{n}\right\}. (S47)

Combining the two bounds (S46) and (S47) gives the desired result. ∎

Lemma S4.

Suppose that f:(0,)f:(0,\infty)\to\mathbb{R} is convex, non-increasing, and differentiable, and f(1)=0f(1)=0. Denote f#(t)=tf(t)f(t)f^{\#}(t)=tf^{\prime}(t)-f(t).

(i) For any t>0t>0 and ϵ[0,1)\epsilon\in[0,1), we have

(1ϵ)f(t)f#(t)f(1ϵ)ϵ.\displaystyle(1-\epsilon)f^{\prime}(t)-f^{\#}(t)\leq-f^{\prime}(1-\epsilon)\epsilon. (S48)

(ii) Let ϵ0(0,1)\epsilon_{0}\in(0,1) be fixed. For any ϵ[0,ϵ0]\epsilon\in[0,\epsilon_{0}] and any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

Kf(Pϵ,Pθ;h)f(1ϵ0)ϵ.\displaystyle K_{f}(P_{\epsilon},P_{\theta^{*}};h)\leq-f^{\prime}(1-\epsilon_{0})\epsilon.

(iii) Suppose, in addition, that f(eu)f^{\prime}(\mathrm{e}^{u}) is concave in uu. Let ϵ1(0,1)\epsilon_{1}\in(0,1) be fixed. If ϵ^=n1i=1nUi[0,ϵ1]\hat{\epsilon}=n^{-1}\sum_{i=1}^{n}U_{i}\in[0,\epsilon_{1}], then for any function h:ph:\mathbb{R}^{p}\to\mathbb{R},

Kf(Pn,Pθ;h)f(1ϵ1)ϵ^+R1|EPθ,nh(x)EPθh(x)|,\displaystyle K_{f}(P_{n},P_{\theta^{*}};h)\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon}+R_{1}|\mathrm{E}_{P_{\theta^{*},n}}h(x)-\mathrm{E}_{P_{\theta^{*}}}h(x)|, (S49)

where Pθ,nP_{\theta^{*},n} denotes the empirical distribution of {Xi:Ui=0,i=1,,n}\{X_{i}:U_{i}=0,i=1,\ldots,n\} in the latent representation of Huber’s contamination model.

Proof.

(i) Notice that by definition,

(1ϵ)f(t)f#(t)=(1ϵ)f(t)tf(t)+f(t)\displaystyle\quad(1-\epsilon)f^{\prime}(t)-f^{\#}(t)=(1-\epsilon)f^{\prime}(t)-tf^{\prime}(t)+f(t)
=f(t)+f(t)((1ϵ)t).\displaystyle=f(t)+f^{\prime}(t)((1-\epsilon)-t).

By the convexity of ff, we have that for any t>0t>0 and ϵ[0,1)\epsilon\in[0,1),

f(t)+f(t)((1ϵ)t)f(1ϵ).f(t)+f^{\prime}(t)((1-\epsilon)-t)\leq f(1-\epsilon).

Moreover, by the convexity of ff and f(1)=0f(1)=0, we have

f(1ϵ)f(1)+f(1ϵ)((1ϵ)1)=f(1ϵ)ϵ.f(1-\epsilon)\leq f(1)+f^{\prime}(1-\epsilon)((1-\epsilon)-1)=-f^{\prime}(1-\epsilon)\epsilon.

Combining the preceding three displays yields the desired result.

(ii) For any function hh, we have

Kf(Pϵ,Pθ;h)=ϵEQf(eh(x))+EPθ{(1ϵ)f(eh(x))f#(eh(x))}\displaystyle\quad K_{f}(P_{\epsilon},P_{\theta^{*}};h)=\epsilon\,\mathrm{E}_{Q}f^{\prime}(\mathrm{e}^{h(x)})+\mathrm{E}_{P_{\theta^{*}}}\left\{(1-\epsilon)f^{\prime}(\mathrm{e}^{h(x)})-f^{\#}(\mathrm{e}^{h(x)})\right\}
EPθ{(1ϵ)f(eh(x))f#(eh(x))},\displaystyle\leq\mathrm{E}_{P_{\theta^{*}}}\left\{(1-\epsilon)f^{\prime}(\mathrm{e}^{h(x)})-f^{\#}(\mathrm{e}^{h(x)})\right\}, (S50)

using the fact that ff is non-increasing and hence f(t)0f^{\prime}(t)\leq 0 for t>0t>0. Setting t=eh(x)t=\mathrm{e}^{h(x)} in (S48) shows that for ϵϵ0\epsilon\leq\epsilon_{0},

(1ϵ)f(eh(x))f#(eh(x))f(1ϵ)ϵf(1ϵ0)ϵ.\displaystyle(1-\epsilon)f^{\prime}(\mathrm{e}^{h(x)})-f^{\#}(\mathrm{e}^{h(x)})\leq-f^{\prime}(1-\epsilon)\epsilon\leq-f^{\prime}(1-\epsilon_{0})\epsilon. (S51)

where f(1ϵ)f(1ϵ0)f^{\prime}(1-\epsilon)\geq f^{\prime}(1-\epsilon_{0}) for ϵϵ0\epsilon\leq\epsilon_{0} by the convexity of ff. Combining (S50) and (S51) leads to the desired result.

(iii) For any function hh, Kf(Pn,Pθ,n;h)K_{f}(P_{n},P_{\theta^{*},n};h) can be bounded as follows:

Kf(Pn,Pθ;h)\displaystyle\quad K_{f}(P_{n},P_{\theta^{*}};h)
=1ni=1nUif(eh(Xi))+1ni=1n(1Ui)f(eh(Xi))EPθf#(eh(x))\displaystyle=\frac{1}{n}\sum_{i=1}^{n}U_{i}f^{\prime}(\mathrm{e}^{h(X_{i})})+\frac{1}{n}\sum_{i=1}^{n}(1-U_{i})f^{\prime}(\mathrm{e}^{h(X_{i})})-\mathrm{E}_{P_{\theta^{*}}}f^{\#}(\mathrm{e}^{h(x)})
(1ϵ^)EPθ,nf(eh(x))EPθf#(eh(x))\displaystyle\leq(1-\hat{\epsilon})\mathrm{E}_{P_{\theta^{*},n}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\theta^{*}}}f^{\#}(\mathrm{e}^{h(x)}) (S52)
(1ϵ^)f(eEPθ,nh(x))f#(eEPθh(x))\displaystyle\leq(1-\hat{\epsilon})f^{\prime}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*},n}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)}) (S53)
f(1ϵ1)ϵ^+|f#(eEPθ,nh(x))f#(eEPθh(x))|\displaystyle\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon}+|f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*},n}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})| (S54)
f(1ϵ1)ϵ^+R1|EPθ,nh(x)EPθh(x)|.\displaystyle\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon}+R_{1}|\mathrm{E}_{P_{\theta^{*},n}}h(x)-\mathrm{E}_{P_{\theta^{*}}}h(x)|. (S55)

Line (S52) follows because f(t)0f^{\prime}(t)\leq 0 for t>0t>0. Line (S53) follows from Jensen’s inequality by the concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)-f^{\#}(\mathrm{e}^{u}) in uu. Line (S54) follows because

(1ϵ^)f(eEPθ,nh(x))f#(eEPθ,nh(x))f(1ϵ1)ϵ^,\displaystyle(1-\hat{\epsilon})f^{\prime}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*},n}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*},n}}h(x)})\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon},

obtained by taking ϵ=ϵ^\epsilon=\hat{\epsilon} and t=eEPθ,nh(x)t=\mathrm{e}^{\mathrm{E}_{P_{\theta^{*},n}}h(x)} in (S48) and using f(1ϵ^)f(1ϵ1)f^{\prime}(1-\hat{\epsilon})\geq f^{\prime}(1-\epsilon_{1}) for ϵ^ϵ1\hat{\epsilon}\leq\epsilon_{1}. Finally, line (S55) follows because f#(eu)f^{\#}(\mathrm{e}^{u}) is R1R_{1}-Lipschitz in uu. ∎

Remark S5.

Compared with (S49), Kf(Pn,Pθ;h)K_{f}(P_{n},P_{\theta^{*}};h) can also be bounded as

Kf(Pn,Pθ;h)f(1ϵ1)ϵ^+|EPθ,nf#(eh(x))EPθf#(eh(x))|.\displaystyle K_{f}(P_{n},P_{\theta^{*}};h)\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon}+|\mathrm{E}_{P_{\theta^{*},n}}f^{\#}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\theta^{*}}}f^{\#}(\mathrm{e}^{h(x)})|. (S56)

In fact, this follows directly from (S52), because for ϵ^ϵ1\hat{\epsilon}\leq\epsilon_{1},

(1ϵ^)f(eh(x))f#(eh(x))f(1ϵ1)ϵ^,\displaystyle\quad(1-\hat{\epsilon})f^{\prime}(\mathrm{e}^{h(x)})-f^{\#}(\mathrm{e}^{h(x)})\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon},

which can be obtained by taking ϵ=ϵ^\epsilon=\hat{\epsilon} and t=eh(x)t=\mathrm{e}^{h(x)} in (S48), similarly as (S51). However, the bound (S56) involves the moment difference of f#(eh(x))f^{\#}(\mathrm{e}^{h(x)}) between PθP_{\theta^{*}} and Pθ,nP_{\theta^{*},n}, which is difficult to control for hh in our spline class, even with f#(eu)f^{\#}(\mathrm{e}^{u}) Lipschitz in uu. In contrast, by exploiting the concavity of f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)-f^{\#}(\mathrm{e}^{u}) in uu, the bound (S49) is derived such that it involves the moment difference of h(x)h(x), which can be controlled by Lemma S3 in Proposition S5 or by Lemma S9 in Proposition S8.

Proposition S5.

In the setting of Proposition S1, it holds with probability at least 15δ1-{5}\delta that for any γΓ\gamma\in\Gamma,

Kf(Pn,Pθ;hγ,μ)f(3/5)(ϵ+ϵ/(nδ))+pen1(γ)Csp13R1M11λ11,\displaystyle K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})+\mathrm{pen}_{1}(\gamma)C_{\mathrm{sp13}}R_{1}M_{11}\lambda_{11},

where Csp13=(5/3)(Csp11Csp12)C_{\mathrm{sp13}}=(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}}) with Csp11C_{\mathrm{sp11}} and Csp12C_{\mathrm{sp12}} as in Lemma S3, M11=M11/2(M11/2+22π)M_{11}=M_{1}^{1/2}(M_{1}^{1/2}+{2}\sqrt{2\pi}), and

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n.\lambda_{11}=\sqrt{\frac{2\log({5p})+\log(\delta^{-1})}{n}}+\frac{2\log({5p})+\log(\delta^{-1})}{n}.
Proof.

Consider the event Ω1={|ϵ^ϵ|ϵ(1ϵ)/(nδ)}\Omega_{1}=\{|\hat{\epsilon}-\epsilon|\leq\sqrt{\epsilon(1-\epsilon)/(n\delta)}\}. By Chebyshev’s inequality, we have P(Ω1)1δ\mathrm{P}(\Omega_{1})\geq 1-\delta. In the event Ω1\Omega_{1}, we have |ϵ^ϵ|1/5|\hat{\epsilon}-\epsilon|\leq 1/5 by the assumption ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and hence ϵ^2/5\hat{\epsilon}\leq 2/5 by the assumption ϵ1/5\epsilon\leq 1/5. By Lemma S4 with ϵ1=2/5\epsilon_{1}=2/5, it holds in the event Ω1\Omega_{1} that for any γΓ\gamma\in\Gamma,

Kf(Pn,Pθ;hγ,μ)\displaystyle\quad K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})
f(3/5)ϵ^+R1|EPθ,nhγ,μ(x)EPθhγ,μ(x)|\displaystyle\leq-f^{\prime}(3/5)\hat{\epsilon}+R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)\right|
f(3/5)(ϵ+ϵ/(nδ))+R1|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|.\displaystyle\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})+R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|. (S57)

The last step uses the fact that EPθhγ,μ(x)=EP(0,Σ)hγ(x)\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x) and EPθ,nhγ,μ(x)=EPθ,nhγ(xμ)\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*}), by the definition hγ,μ(x)=hγ(xμ)h_{\gamma,\mu^{*}}(x)=h_{\gamma}(x-\mu^{*}).

Next, conditionally on the contamination indicators (U1,,Un)(U_{1},\ldots,U_{n}) such that the event Ω1\Omega_{1} holds, we have that {Xi:Ui=1,i=1,,n}\{X_{i}:U_{i}=1,i=1,\ldots,n\} are n1n_{1} independent and identically distributed observations from PθP_{\theta^{*}}, where n1=i=1n(1Ui)=n(1ϵ^)(3/5)nn_{1}=\sum_{i=1}^{n}(1-U_{i})=n(1-\hat{\epsilon})\geq(3/5)n. Denote as Ω2\Omega_{2} the event that for any γ1\gamma_{1} and γ2\gamma_{2},

|EPθ,nγ1Tφ(xμ)EP(0,Σ)γ1Tφ(x)|γ11Csp11M11/22log(5p)+log(δ1)(3/5)n,\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)\right|\leq\|\gamma_{1}\|_{1}C_{\mathrm{sp11}}M_{1}^{1/2}\sqrt{\frac{2\log(5p)+\log(\delta^{-1})}{(3/5)n}},

and

|EPθ,nγ2T(φ(xμ)φ(xμ))EP(0,Σ)γ2T(φ(x)φ(x))|\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x-\mu^{*})\otimes\varphi(x-\mu^{*}))-\mathrm{E}_{P_{(0,\Sigma^{*})}}\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x)\otimes\varphi(x))\right|
γ21Csp12M11{2log(5p)+log(δ1)(3/5)n+2log(5p)+log(δ1)(3/5)n},\displaystyle\leq\|\gamma_{2}\|_{1}C_{\mathrm{sp12}}M_{11}\left\{\sqrt{\frac{2\log(5p)+\log(\delta^{-1})}{(3/5)n}}+\frac{2\log(5p)+\log(\delta^{-1})}{(3/5)n}\right\},

where Csp11C_{\mathrm{sp11}}, Csp12C_{\mathrm{sp12}}, and M11M_{11} are defined as in Lemma S3 with ξ=2\|\xi\|_{\infty}=2. In the event Ω2\Omega_{2}, the preceding inequalities imply that for any γ=(γ0,γ1T,γ2T)TΓ\gamma=(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma,

|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|
pen1(γ)(5/3)(Csp11Csp12)M11λ11,\displaystyle\leq\mathrm{pen}_{1}(\gamma)(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}})M_{11}\lambda_{11}, (S58)

where hγ(x)=γ0+γ1Tφ(x)+γ2T(φ(x)φ(x))h_{\gamma}(x)=\gamma_{0}+\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)+\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x)\otimes\varphi(x)) and pen1(γ)=γ11+γ21\mathrm{pen}_{1}(\gamma)=\|\gamma_{1}\|_{1}+\|\gamma_{2}\|_{1}. By applying Lemma S3 with k=5k=5 to {Xiμ:Ui=1,i=1,,n}\{X_{i}-\mu^{*}:U_{i}=1,i=1,\ldots,n\}, we have P(Ω2|U1,,Un)14δ\mathrm{P}(\Omega_{2}|U_{1},\ldots,U_{n})\geq 1-{4}\delta for any (U1,,Un)(U_{1},\ldots,U_{n}) such that Ω1\Omega_{1} holds. Taking the expectation over (U1,,Un)(U_{1},\ldots,U_{n}) given Ω1\Omega_{1} shows that P(Ω2|Ω1)14δ\mathrm{P}(\Omega_{2}|\Omega_{1})\geq 1-{4}\delta and hence P(Ω1Ω2)(1δ)(14δ)15δ\mathrm{P}(\Omega_{1}\cap\Omega_{2})\geq(1-\delta)(1-4\delta)\geq 1-{5}\delta.

Combining (S57) and (S58) in the event Ω1Ω2\Omega_{1}\cap\Omega_{2} indicates that, with probability at least 15δ1-5\delta, the desired inequality holds for any γΓ\gamma\in\Gamma. ∎

Lemma S5.

Suppose that X1,,XnX_{1},\ldots,X_{n} are independent and identically distributed as XPϵX\sim P_{\epsilon}. Let b>0b>0 be fixed and g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} be a vector of fixed functions. For a convex and twice differentiable function f:(0,)f:(0,\infty)\to\mathbb{R}, define

F(X1,,Xn)\displaystyle F(X_{1},\dots,X_{n}) =supw1=1,μp{Kf(Pn,Pθ^;bwTgμ)Kf(Pϵ,Pθ^;bwTgμ)}\displaystyle=\sup_{\|w\|_{1}=1,\mu\in\mathbb{R}^{p}}\left\{K_{f}(P_{n},P_{\hat{\theta}};bw^{\mathrm{\scriptscriptstyle T}}g_{\mu})-K_{f}(P_{\epsilon},P_{\hat{\theta}};bw^{\mathrm{\scriptscriptstyle T}}g_{\mu})\right\}
=supw1=1,μp{1ni=1nf(ebwTgμ(Xi))Ef(ebwTgμ(X))},\displaystyle=\sup_{\|w\|_{1}=1,\mu\in\mathbb{R}^{p}}\left\{\frac{1}{n}\sum_{i=1}^{n}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}(X_{i})})-\mathrm{E}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}(X)})\right\},

where gμ(x)=g(xμ)g_{\mu}(x)=g(x-\mu). Suppose that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp|n1i=1nϵigμ,j(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,j}(X_{i})| is sub-gaussian with tail parameter Vg/n\sqrt{V_{g}/n} for j=1,,qj=1,\ldots,q, where (ϵ1,,ϵn)(\epsilon_{1},\ldots,\epsilon_{n}) are Rademacher variables, independent of (X1,,Xn)(X_{1},\ldots,X_{n}), and gμ,j:p[0,1]g_{\mu,j}:\mathbb{R}^{p}\to[0,1] denotes the jjth component of gμg_{\mu}. Then for any δ>0\delta>0, we have that with probability at least 12δ1-2\delta,

F(X1,,Xn)bR2,b{Csg6Vglog(2q)n+2log(δ1)n},\displaystyle F(X_{1},\dots,X_{n})\leq bR_{2,b}\left\{C_{\mathrm{sg6}}\sqrt{\frac{V_{g}\log(2q)}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}}\right\},

where R2,b=sup|u|bdduf(eu)R_{2,b}=\sup_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}f^{\prime}(\mathrm{e}^{u}) and Csg6C_{\mathrm{sg6}} is the universal constant in Lemma S26.

Proof.

First, FF satisfies the bounded difference condition, because |bwTgμ|b|bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}|\leq b with w1=1\|w\|_{1}=1 and ff^{\prime} is non-decreasing by the convexity of ff:

supX1,,Xn,Xi|F(X1,,Xn)F(X1,,Xi,,Xn)|\displaystyle\quad\sup_{X_{1},\dots,X_{n},X_{i}^{\prime}}\left|F(X_{1},\dots,X_{n})-F(X_{1},\dots,X_{i}^{\prime},\dots,X_{n})\right|
f(eb)f(eb)n2bR2,bn,\displaystyle\leq\frac{f^{\prime}(\mathrm{e}^{b})-f^{\prime}(\mathrm{e}^{-b})}{n}\leq\frac{2bR_{2,b}}{n},

where R2,b=sup|u|bdduf(eu)R_{2,b}=\sup_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}f^{\prime}(\mathrm{e}^{u}). By McDiarmid’s inequality (\citeappendMC89), for any t>0t>0, we have that with probability at least 12e2nt21-2\mathrm{e}^{-2nt^{2}},

|F(X1,,Xn)EF(X1,,Xn)|2bR2,bt.\displaystyle|F(X_{1},\dots,X_{n})-\mathrm{E}F(X_{1},\dots,X_{n})|\leq 2bR_{2,b}t.

For any δ>0\delta>0, taking t=log(δ1)/(2n)t=\sqrt{\log(\delta^{-1})/(2n)} shows that with probability at least 12δ1-2\delta,

|F(X1,,Xn)EF(X1,,Xn)|bR2,b2log(δ1)n.\displaystyle|F(X_{1},\dots,X_{n})-\mathrm{E}F(X_{1},\dots,X_{n})|\leq bR_{2,b}\sqrt{\frac{2\log(\delta^{-1})}{n}}.

Next, the expectation of F(X1,,Xn)F(X_{1},\dots,X_{n}) can be bounded as follows:

Esupw1=1,μp{1ni=1nf(ebwTgμ(Xi))Ef(ebwTgμ(X))}\displaystyle\quad\mathrm{E}\sup_{\|w\|_{1}=1,\mu\in\mathbb{R}^{p}}\left\{\frac{1}{n}\sum_{i=1}^{n}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}(X_{i})})-\mathrm{E}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}(X)})\right\}
2Esupw1=1,μp{1ni=1nϵif(ebwTgμ(Xi))}\displaystyle\leq 2\mathrm{E}\sup_{\|w\|_{1}=1,\mu\in\mathbb{R}^{p}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}(X_{i})})\right\} (S59)
2R2,bEsupw1=1,μp{1ni=1nϵibwTgμ(Xi)}\displaystyle\leq 2R_{2,b}\,\mathrm{E}\sup_{\|w\|_{1}=1,\mu\in\mathbb{R}^{p}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}bw^{\mathrm{\scriptscriptstyle T}}g_{\mu}(X_{i})\right\} (S60)
2bR2,bEsupμp1ni=1nϵigμ(Xi)\displaystyle\leq 2bR_{2,b}\,\mathrm{E}\sup_{\mu\in\mathbb{R}^{p}}\left\|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu}(X_{i})\right\|_{\infty}
2bR2,bCsg6Vglog(2q)n.\displaystyle\leq 2bR_{2,b}C_{\mathrm{sg6}}\sqrt{\frac{V_{g}\log(2q)}{n}}. (S61)

Line (S59) follows from the symmetrization Lemma S33, where (ϵ1,,ϵn)(\epsilon_{1},\ldots,\epsilon_{n}) are Rademacher variables, independent of (X1,,Xn)(X_{1},\ldots,X_{n}). Line (S60) follows by Lemma S34, because f(eu)f^{\prime}(\mathrm{e}^{u}) is R2,bR_{2,b}-Lipschitz in u[b,b]u\in[-b,b]. Line (S61) follows because

Esupμp1ni=1nϵigμ(Xi)=Esupμpmaxj=1,,q|1ni=1nϵigμ,j(Xi)|\displaystyle\quad\mathrm{E}\sup_{\mu\in\mathbb{R}^{p}}\left\|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu}(X_{i})\right\|_{\infty}=\mathrm{E}\sup_{\mu\in\mathbb{R}^{p}}\max_{j=1,\dots,q}\left|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,j}(X_{i})\right|
=Emaxj=1,,qsupμp|1ni=1nϵigμ,j(Xi)|\displaystyle=\mathrm{E}\max_{j=1,\dots,q}\sup_{\mu\in\mathbb{R}^{p}}\left|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,j}(X_{i})\right|
Csg6Vglog(2q)n.\displaystyle\leq C_{\mathrm{sg6}}\sqrt{\frac{V_{g}\log(2q)}{n}}.

For the last step, we use the assumption that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp|n1i=1nϵigμ,j(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,j}(X_{i})| is sub-gaussian with tail parameter Vg/n\sqrt{V_{g}/n} for each j=1,,qj=1,\ldots,q, and apply Lemma S26 to obtain E(maxj=1,,qZn,j|X1,,Xn)Csg6Vglog(2q)/n\mathrm{E}(\max_{j=1,\ldots,q}Z_{n,j}|X_{1},\ldots,X_{n})\leq C_{\mathrm{sg6}}\sqrt{V_{g}\log(2q)/n}, and then E(maxj=1,,qZn,j)Csg6Vglog(2q)/n\mathrm{E}(\max_{j=1,\ldots,q}Z_{n,j})\leq C_{\mathrm{sg6}}\sqrt{V_{g}\log(2q)/n}.

Combining the tail probability and expectation bounds yields the desired result. ∎

Remark S2.

Results of Lemma S5 and Lemma S10 still hold with R2,bR_{2,b} and R2,bqR_{2,b\sqrt{q}} replaced by 11 if KfK_{f} is replaced by KHGK_{\mathrm{HG}}. This is true because f(eu)f^{\prime}(\mathrm{e}^{u}) will be replaced by identity function which is just 11-Lipschitz.

Lemma S6.

Suppose that f:(0,)f:(0,\infty)\to\mathbb{R} is convex and three-times differentiable. Let b>0b>0 be fixed. For any function h:p[b,b]h:\mathbb{R}^{p}\to[-b,b], we have

Kf(Pϵ,Pθ^;h)f(eb)ϵ+f(1){EPθh(x)EPθ^h(x)}12b2R3,b,\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h)\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-\frac{1}{2}b^{2}R_{3,b},

where R3,b=R31,b+R32,bR_{3,b}=R_{31,b}+R_{32,b}, R31,b=sup|u|bd2du2{f(eu)}R_{31,b}=\sup_{|u|\leq b}\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}\{-f^{\prime}(\mathrm{e}^{u})\}, and R32,b=sup|u|bd2du2f#(eu)R_{32,b}=\sup_{|u|\leq b}\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\#}(\mathrm{e}^{u}).

Proof.

First, Kf(Pϵ,Pθ^;h)K_{f}(P_{\epsilon},P_{\hat{\theta}};h) can be bounded as

Kf(Pϵ,Pθ^;h)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h)
=ϵEQf(eh(x))+(1ϵ)EPθf(eh(x))EPθ^f#(eh(x))\displaystyle=\epsilon\mathrm{E}_{Q}f^{\prime}(\mathrm{e}^{h(x)})+(1-\epsilon)\mathrm{E}_{P_{\theta^{*}}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\hat{\theta}}}f^{\#}(\mathrm{e}^{h(x)})
f(eb)ϵ+EPθf(eh(x))EPθ^f#(eh(x))\displaystyle\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+\mathrm{E}_{P_{\theta^{*}}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\hat{\theta}}}f^{\#}(\mathrm{e}^{h(x)})
=f(eb)ϵ+Kf(Pθ,Pθ^;h),\displaystyle=f^{\prime}(\mathrm{e}^{-b})\epsilon+K_{f}(P_{\theta^{*}},P_{\hat{\theta}};h),

where the inequality follows because f(eh(x))f(eb)f^{\prime}(\mathrm{e}^{h(x)})\geq f^{\prime}(\mathrm{e}^{-b}) for h(x)[b,b]h(x)\in[-b,b] by the convexity of ff. Next, consider the function κ(t)=Kf(Pθ,Pθ^;th)\kappa(t)=K_{f}(P_{\theta^{*}},P_{\hat{\theta}};th). A Taylor expansion of κ(1)=Kf(Pθ,Pθ^;h)\kappa(1)=K_{f}(P_{\theta^{*}},P_{\hat{\theta}};h) about t=0t=0 yields

Kf(Pθ,Pθ^;h)\displaystyle K_{f}(P_{\theta^{*}},P_{\hat{\theta}};h) =f(1){EPθh(x)EPθ^h(x)}12κ(t),\displaystyle=f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-\frac{1}{2}\kappa^{{\prime\prime}}(t),

where for some t[0,1]t\in[0,1],

κ(t)\displaystyle\kappa^{{\prime\prime}}(t) =EPθ{h2(x)d2du2f(eu)|u=th(x)}+EPθ^{h2(x)d2du2f#(eu)|u=th(x)}.\displaystyle=-\mathrm{E}_{P_{\theta^{*}}}\left\{h^{2}(x)\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\prime}(\mathrm{e}^{u})|_{u=th(x)}\right\}+\mathrm{E}_{P_{\hat{\theta}}}\left\{h^{2}(x)\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\#}(\mathrm{e}^{u})|_{u=th(x)}\right\}.

The desired result then follows because h(x)[b,b]h(x)\in[-b,b] and th(x)[b,b]th(x)\in[-b,b] for t[0,1]t\in[0,1], and hence κ(t)R3,b\kappa^{{\prime\prime}}(t)\leq R_{3,b} by the definition of R3,bR_{3,b}. ∎

Proposition S6.

Let b1>0b_{1}>0 be fixed. In the setting of Proposition S1, it holds with probability at least 12δ1-2\delta that for any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=b1\mathrm{pen}_{1}(\gamma)=b_{1},

Kf(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
f(1){EPθhγ,μ^(x)EPθ^hγ,μ^(x)}+f(eb1)ϵ12b12R3,b1b1R2,b1λ12\displaystyle\geq f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}+f^{\prime}(\mathrm{e}^{-b_{1}})\epsilon-\frac{1}{2}b_{1}^{2}R_{3,b_{1}}-b_{1}R_{2,b_{1}}\lambda_{12}

where, with Crad4=Csg6Crad3C_{\mathrm{rad4}}=C_{\mathrm{sg6}}C_{\mathrm{rad3}},

λ12=Crad44log(2p(p+1))n+2log(δ1)n,\lambda_{12}=C_{\mathrm{rad4}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}},

depending on the universal constants Csg6C_{\mathrm{sg6}} and Crad3C_{\mathrm{rad3}} in Lemma S26 and Corollary S2.

Proof.

By definition, for any γΓrp\gamma\in\Gamma_{\mathrm{rp}}, hγ(x)h_{\gamma}(x) can be represented as hrp,β,c(x)h_{\mathrm{rp},\beta,c}(x) such that β0=γ0\beta_{0}=\gamma_{0} and pen1(β)=pen1(γ)\mathrm{pen}_{1}(\beta)=\mathrm{pen}_{1}(\gamma):

hrp,β,c(x)=β0+j=1pβ1jramp(xjcj)+1ijpβ2,ijramp(xi)ramp(xj),\displaystyle h_{\mathrm{rp},\beta,c}(x)=\beta_{0}+\sum_{j=1}^{p}\beta_{1j}\,\mathrm{ramp}(x_{j}-c_{j})+\sum_{1\leq i\not=j\leq p}\beta_{2,ij}\,\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}),

where c=(c1,,cp)Tc=(c_{1},\ldots,c_{p})^{\mathrm{\scriptscriptstyle T}} with cj{0,1}c_{j}\in\{0,1\}, and β=(β0,β1T,β2T)T\beta=(\beta_{0},\beta_{1}^{\mathrm{\scriptscriptstyle T}},\beta_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} with β1=(β1j:j=1,,p)T\beta_{1}=(\beta_{1j}:j=1,\ldots,p)^{\mathrm{\scriptscriptstyle T}} and β2=(β2,ij:1ijp)\beta_{2}=(\beta_{2,ij}:1\leq i\not=j\leq p). Then for any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=b1\mathrm{pen}_{1}(\gamma)=b_{1}, we have β0=0\beta_{0}=0 and pen1(β)=b1\mathrm{pen}_{1}(\beta)=b_{1} correspondingly, and hence hγ(x)=hrp,β,c(x)[b1,b1]h_{\gamma}(x)=h_{\mathrm{rp},\beta,c}(x)\in[-b_{1},b_{1}] by the boundedness of the ramp function in [0,1][0,1]. Moreover, hrp,β,c(x)h_{\mathrm{rp},\beta,c}(x) with β0=0\beta_{0}=0 and pen1(β)=b1\mathrm{pen}_{1}(\beta)=b_{1} can be expressed in the form b1wTg(x)b_{1}w^{\mathrm{\scriptscriptstyle T}}g(x), where for q=2p+p(p1)q=2p+p(p-1), wqw\in\mathbb{R}^{q} is an L1L_{1} unit vector, and g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} is a vector of functions including ramp(xj)\mathrm{ramp}(x_{j}) and ramp(xj1)\mathrm{ramp}(x_{j}-1) for j=1,,pj=1,\ldots,p, and ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) for 1ijp1\leq i\not=j\leq p. For symmetry, ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) and ramp(xj)ramp(xi)\mathrm{ramp}(x_{j})\mathrm{ramp}(x_{i}) are included as two distinct components in gg, and the corresponding coefficients are identical to each other in ww. Parenthetically, at most one of the coefficients in ww associated with ramp(xj)\mathrm{ramp}(x_{j}) and ramp(xj1)\mathrm{ramp}(x_{j}-1) is nonzero for each jj, but this property is not used in the subsequent discussion.

Next, Kf(Pn,Pθ^;hγ,μ^)K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

Kf(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
Kf(Pϵ,Pθ^;hγ,μ^){Kf(Pn,Pθ^;hγ,μ^)Kf(Pϵ,Pθ^;hγ,μ^)}.\displaystyle\geq K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}.

For any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=b1\mathrm{pen}_{1}(\gamma)=b_{1}, applying Lemma S6 with h=hγ,μ^h=h_{\gamma,\hat{\mu}} and b=b1b=b_{1} yields

Kf(Pϵ,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
f(1){EPθhγ,μ^(x)EPθ^hγ,μ^(x)}+f(eb1)ϵ12b12R3,b1.\displaystyle\geq f^{{\prime\prime}}(1)\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}+f^{\prime}(\mathrm{e}^{-b_{1}})\epsilon-\frac{1}{2}b_{1}^{2}R_{3,b_{1}}.

By Lemma S5 with b=b1b=b_{1} and g(x)[0,1]qg(x)\in[0,1]^{q} defined above, it holds with probability at least 12δ1-2\delta that for any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=b1\mathrm{pen}_{1}(\gamma)=b_{1},

{Kf(Pn,Pθ^;hγ,μ^)Kf(Pϵ,Pθ^;hγ,μ^)}\displaystyle\quad\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}
b1R2,b1{Csg6Vglog(2q)n+2log(δ1)n}\displaystyle\leq b_{1}R_{2,b_{1}}\left\{C_{\mathrm{sg6}}\sqrt{\frac{V_{g}\log(2q)}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}}\right\}
=b1R2,b1{Csg6Crad34log(2p(p+1))n+2log(δ1)n},\displaystyle=b_{1}R_{2,b_{1}}\left\{C_{\mathrm{sg6}}C_{\mathrm{rad3}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}}\right\},

where Vg=4Crad32V_{g}=4C_{\mathrm{rad3}}^{2} is determined in Lemma S5 as follows. For j=1,,qj=1,\ldots,q, consider the function class 𝒢j={gμ,j:μp}\mathcal{G}_{j}=\{g_{\mu,j}:\mu\in\mathbb{R}^{p}\}, where μ=(μ1,,μp)T\mu=(\mu_{1},\ldots,\mu_{p})^{\mathrm{\scriptscriptstyle T}} and, as defined in Lemma S5, gμ,j(x)g_{\mu,j}(x) is either a moving-knot ramp function, ramp(xj1μj1)\mathrm{ramp}(x_{j_{1}}-\mu_{j_{1}}) or ramp(xj1μj11)\mathrm{ramp}(x_{j_{1}}-\mu_{j_{1}}-1) or a product of moving-knot ramp functions, ramp(xj1μj1)ramp(xj2μj2)\mathrm{ramp}(x_{j_{1}}-\mu_{j_{1}})\mathrm{ramp}(x_{j_{2}}-\mu_{j_{2}}) for 1j1j2p1\leq j_{1}\not=j_{2}\leq p. By Lemma S16, the VC index of moving-knot ramp functions is 2. By applying Corollary S2 (i) and (ii) with vanishing \mathcal{H}, we obtain that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp|n1i=1nϵigμ,j(Xi)|=supfj𝒢j|n1i=1nϵifj(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,j}(X_{i})|=\sup_{f_{j}\in\mathcal{G}_{j}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}f_{j}(X_{i})| is sub-gaussian with tail parameter Crad34/nC_{\mathrm{rad3}}\sqrt{4/n} for j=1,,qj=1,\ldots,q.

Combining the preceding three displays leads to the desired result. ∎

Lemma S7 (Local linearity 1).

For δ\delta\in\mathbb{R} and 0σ1,σ2M1/20\leq\sigma_{1},\sigma_{2}\leq M^{1/2}, denote Dh=Eh(σ1Z+δ)Eh(σ2Z)D_{h}=\mathrm{E}h(\sigma_{1}Z+\delta)-\mathrm{E}h(\sigma_{2}Z), where hh is a function on \mathbb{R} and ZZ is a standard Gaussian random variable. For h1(x)=±ramp(x)h_{1}(x)=\pm\mathrm{ramp}(x), if |Dh1|a|D_{h_{1}}|\leq a for a(0,1/2)a\in(0,1/2), then we have

|δ|S4,a|Dh1|,\displaystyle|\delta|\leq S_{4,a}|D_{h_{1}}|, (S62)

where S4,a=(1+2Mlog212a)/aS_{4,a}=(1+\sqrt{2M\log\frac{2}{1-2a}})/a. For h2(x)=±ramp(x1)h_{2}(x)=\pm\mathrm{ramp}(x-1), we have

|σ1σ2|S5(|Dh2|+|δ|/2),\displaystyle|\sigma_{1}-\sigma_{2}|\leq S_{5}(|D_{h_{2}}|+|\delta|/2), (S63)

where S5=22π(1e2/M)1S_{5}=2\sqrt{2\pi}(1-\mathrm{e}^{-2/M})^{-1}.

Remark S6.

Define a ramp function class

1={±ramp(xc),x:c=0,1}.\mathcal{R}_{1}=\left\{\pm\mathrm{ramp}(x-c),x\in\mathbb{R}:c=0,1\right\}.

In the setting of Lemma S7, suppose that for fixed a(0,1/2)a\in(0,1/2),

D=defsuph1{Eh(σ1Z+δ)Eh(σ2Z)}a.\displaystyle D\stackrel{{\scriptstyle\text{def}}}{{=}}\sup_{h\in\mathcal{R}_{1}}\left\{\mathrm{E}h(\sigma_{1}Z+\delta)-\mathrm{E}h(\sigma_{2}Z)\right\}\leq a.

Then we have

|δ|S4,aD,|σ1σ2|S6,aD,\displaystyle|\delta|\leq S_{4,a}D,\quad|\sigma_{1}-\sigma_{2}|\leq S_{6,a}D,

where S6,a=S5(1+S4,a/2)S_{6,a}=S_{5}(1+S_{4,a}/2). This shows that the moment matching discrepancy DD over 1\mathcal{R}_{1} delivers upper bounds, up to scaling constants, on the mean and standard deviation differences, provided that DD is sufficiently small, for example, D1/3D\leq 1/3.

Proof.

[Proof of (S62)] First, assume that δ\delta is nonnegative. The other direction will be discussed later. Take h(x)=ramp(x)h(x)=\mathrm{ramp}(x). Then h(x)+h(x)=1h(x)+h(-x)=1 for all xx\in\mathbb{R} and

Eh(σ2Z)=Eh(σ1Z)=12.\displaystyle\mathrm{E}h(\sigma_{2}Z)=\mathrm{E}h(\sigma_{1}Z)=\frac{1}{2}.

Define g(t)=Eh(σ1Z+t)Eh(σ1Z)=Eh(σ1Z+t)12g(t)=\mathrm{E}h(\sigma_{1}Z+t)-\mathrm{E}h(\sigma_{1}Z)=\mathrm{E}h(\sigma_{1}Z+t)-\frac{1}{2} for tt\in\mathbb{R}. Then g(0)=0g(0)=0 and g(δ)=Dhag(\delta)=D_{h}\leq a. We notice the following properties for the function gg.

  • (i)

    g(t)g(t) is non-decreasing and concave for t0t\geq 0.

  • (ii)

    g(t)/tg(t)/t is non-increasing for t>0t>0.

  • (iii)

    g(t)12exp{(t1)2/(2σ12)}g(t)\geq\frac{1}{2}-\exp\{-(t-1)^{2}/(2\sigma_{1}^{2})\} for t1t\geq 1.

For property (i), the derivative of g(t)g(t) is g(t)=12E{𝟙σ1Z+t[1,1]}0g^{\prime}(t)=\frac{1}{2}\mathrm{E}\{\mathbbm{1}_{\sigma_{1}Z+t\in[-1,1]}\}\geq 0. Moreover, g(t)g^{\prime}(t) is non-increasing: for 0t1<t20\leq t_{1}<t_{2},

g(t1)=12P(1t1σ1Z1t1)\displaystyle\quad g^{\prime}(t_{1})=\frac{1}{2}\mathrm{P}(-1-t_{1}\leq\sigma_{1}Z\leq 1-t_{1})
=12P(1t1σ1Z1t2)+12P(1t2σ1Z1t1),\displaystyle=\frac{1}{2}\mathrm{P}(-1-t_{1}\leq\sigma_{1}Z\leq 1-t_{2})+\frac{1}{2}\mathrm{P}(1-t_{2}\leq\sigma_{1}Z\leq 1-t_{1}),
g(t2)=12P(1t2σ1Z1t2)\displaystyle\quad g^{\prime}(t_{2})=\frac{1}{2}\mathrm{P}(-1-t_{2}\leq\sigma_{1}Z\leq 1-t_{2})
=12P(1t1σ1Z1t2)+12P(1t2σ1Z1t1),\displaystyle=\frac{1}{2}\mathrm{P}(-1-t_{1}\leq\sigma_{1}Z\leq 1-t_{2})+\frac{1}{2}\mathrm{P}(-1-t_{2}\leq\sigma_{1}Z\leq-1-t_{1}),

and

P(1t2σ1Z1t1)=P(t11σ1Zt21)\displaystyle\quad\mathrm{P}(1-t_{2}\leq\sigma_{1}Z\leq 1-t_{1})=\mathrm{P}(t_{1}-1\leq\sigma_{1}Z\leq t_{2}-1)
P(t1+1σ1Zt2+1)=P(1t2σ1Z1t1).\displaystyle\leq\mathrm{P}(t_{1}+1\leq\sigma_{1}Z\leq t_{2}+1)=\mathrm{P}(-1-t_{2}\leq\sigma_{1}Z\leq-1-t_{1}).

The last inequality holds because (t1)2(t+1)2(t-1)^{2}\leq(t+1)^{2} for any t0t\geq 0 and hence t1t2exp{(t1)2/(2σ12)}dtt1t2exp{(t+1)2/(2σ12)}dt\int_{t_{1}}^{t_{2}}\exp\{-(t-1)^{2}/(2\sigma_{1}^{2})\}\,\mathrm{d}t\geq\int_{t_{1}}^{t_{2}}\exp\{-(t+1)^{2}/(2\sigma_{1}^{2})\}\,\mathrm{d}t. To show (ii), we write g(t)=g(t)g(0)=t01g(tz)dzg(t)=g(t)-g(0)=t\int_{0}^{1}g^{\prime}(tz)\,\mathrm{d}z. Then g(t)/t=01g(tz)dzg(t)/t=\int_{0}^{1}g^{\prime}(tz)\,\mathrm{d}z is non-increasing in tt because gg^{\prime} is non-increasing. To show (iii), we notice that h(x)𝟙x>1h(x)\geq\mathbbm{1}_{x>1} and hence for t1t\geq 1,

g(t)+12=Eh(σ1Z+t)\displaystyle\quad g(t)+\frac{1}{2}=\mathrm{E}h(\sigma_{1}Z+t)
1P(σ1Z+t1)=1P(σ1Zt1)\displaystyle\geq 1-P(\sigma_{1}Z+t\leq 1)=1-\mathrm{P}(\sigma_{1}Z\geq t-1)
1exp{(t1)2/(2σ12)}.\displaystyle\geq 1-\exp\{-(t-1)^{2}/(2\sigma_{1}^{2})\}.

The last inequality follows by the Gaussian tail bound: P(Zz)ez2/2\mathrm{P}(Z\geq z)\leq\mathrm{e}^{-z^{2}/2} for z>0z>0.

By the preceding properties, we show that δS4,aDh\delta\leq S_{4,a}D_{h} and hence (S62) holds. Without loss of generality, assume that δ0\delta\not=0. For a(0,1/2)a\in(0,1/2), let ta>0t_{a}>0 be determined such that g(ta)=ag(t_{a})=a. Then δta\delta\leq t_{a} and, by property (ii), g(δ)/δa/tag(\delta)/\delta\geq a/t_{a}. If ta1t_{a}\geq 1, then, by property (iii), a=g(ta)12exp{(ta1)2/(2σ12)}a=g(t_{a})\geq\frac{1}{2}-\exp\{-(t_{a}-1)^{2}/(2\sigma_{1}^{2})\}, and hence

ta1+2σ1log212a.t_{a}\leq 1+\sqrt{2}\sigma_{1}\sqrt{\log\frac{2}{1-2a}}.

This inequality remains valid if ta<1t_{a}<1. Therefore, δg(δ)ta/ag(δ)S4,a=DhS4,a\delta\leq g(\delta)t_{a}/a\leq g(\delta)S_{4,a}=D_{h}S_{4,a}, by the assumption σ1M1/2\sigma_{1}\leq M^{1/2} and the definition of S4,aS_{4,a}.

When δ\delta is negative, a similar argument taking h(x)=ramp(x)h(x)=-\mathrm{ramp}(x), which is the same as ramp(x)1\mathrm{ramp}(-x)-1, shows that δS4,aDh-\delta\leq S_{4,a}D_{h} and hence (S62) holds.

[Proof of (S63)] First, assume that σ1σ2\sigma_{1}-\sigma_{2} is nonnegative. Take h(x)=ramp(x1)h(x)=\mathrm{ramp}(x-1). Notice that h(x)h(x) is (1/2)(1/2)-Lipschitz and hence |h(x+δ)h(x)||δ|/2|h(x+\delta)-h(x)|\leq|\delta|/2. Then by the triangle inequality, we have

Eh(σ1Z)Eh(σ2Z)Dh+|δ|/2.\displaystyle\mathrm{E}h(\sigma_{1}Z)-\mathrm{E}h(\sigma_{2}Z)\leq D_{h}+|\delta|/2.

Define g(t)=Eh(tZ)Eh(σ2Z)g(t)=\mathrm{E}h(tZ)-\mathrm{E}h(\sigma_{2}Z) for t0t\geq 0. Then g(σ2)=0g(\sigma_{2})=0 and g(σ1)Dh+|δ|/2g(\sigma_{1})\leq D_{h}+|\delta|/2. The derivative g(t)=12E{Z𝟙tZ[0,2]}g^{\prime}(t)=\frac{1}{2}\mathrm{E}\{Z\mathbbm{1}_{tZ\in[0,2]}\} can be calculated as

g(t)=1202/t12πzez2/2dz=122π{1e(2/t)22}.\displaystyle g^{\prime}(t)=\frac{1}{2}\int_{0}^{2/t}\frac{1}{\sqrt{2\pi}}z\mathrm{e}^{-z^{2}/2}\,\mathrm{d}z=\frac{1}{2\sqrt{2\pi}}\left\{1-\mathrm{e}^{-\frac{(2/t)^{2}}{2}}\right\}.

By the mean value theorem, g(σ1)=g(σ1)g(σ2)=g(t)(σ1σ2)g(M1/2)(σ1σ2)g(\sigma_{1})=g(\sigma_{1})-g(\sigma_{2})=g^{\prime}(t)(\sigma_{1}-\sigma_{2})\geq g^{\prime}(M^{1/2})(\sigma_{1}-\sigma_{2}), where t[σ2,σ1]t\in[\sigma_{2},\sigma_{1}] and hence tM1/2t\leq M^{1/2} because σ1M1/2\sigma_{1}\leq M^{1/2}. Therefore, we have

σ1σ2g(M1/2)1g(σ1)S5(Dh+|δ|/2),\displaystyle\sigma_{1}-\sigma_{2}\leq g^{\prime}(M^{1/2})^{-1}g(\sigma_{1})\leq S_{5}(D_{h}+|\delta|/2),

by the definition S5=g(M1/2)1=22π(1e2/M)1S_{5}=g^{\prime}(M^{1/2})^{-1}=2\sqrt{2\pi}(1-\mathrm{e}^{-2/M})^{-1}.

When σ1σ2\sigma_{1}-\sigma_{2} is negative, a similar argument taking h(x)=ramp(x1)h(x)=-\mathrm{ramp}(x-1) shows that σ2σ1S5(Dh+|δ|/2)\sigma_{2}-\sigma_{1}\leq S_{5}(D_{h}+|\delta|/2) and hence (S63) holds. ∎

Lemma S8 (Local linearity 2).

For δ1,δ2\delta_{1},\delta_{2}\in\mathbb{R}, 0σ1,σ2,σ~1,σ~2M1/20\leq\sigma_{1},\sigma_{2},\tilde{\sigma}_{1},\tilde{\sigma}_{2}\leq M^{1/2}, and ρ,ρ~[1,1]\rho,\tilde{\rho}\in[-1,1], denote Dh=Eh(X~)Eh(X)D_{h}=\mathrm{E}h(\tilde{X})-\mathrm{E}h(X), where hh is a function on 2\mathbb{R}^{2}, X=(X1,X2)TX=(X_{1},X_{2})^{\mathrm{\scriptscriptstyle T}} is a Gaussian random vector in 2\mathbb{R}^{2} with mean 0 and variance matrix (σ12σ1σ2ρσ1σ2ρσ22)\begin{pmatrix}\sigma_{1}^{2}&\sigma_{1}\sigma_{2}\rho\\ \sigma_{1}\sigma_{2}\rho&\sigma_{2}^{2}\end{pmatrix}, and X~=(X~1,X~2)T\tilde{X}=(\tilde{X}_{1},\tilde{X}_{2})^{\mathrm{\scriptscriptstyle T}} is a Gaussian random vector in 2\mathbb{R}^{2} with mean δ=(δ1,δ2)T\delta=(\delta_{1},\delta_{2})^{\mathrm{\scriptscriptstyle T}} and variance matrix (σ~12σ~1σ~2ρ~σ~1σ~2ρ~σ~22)\begin{pmatrix}\tilde{\sigma}_{1}^{2}&\tilde{\sigma}_{1}\tilde{\sigma}_{2}\tilde{\rho}\\ \tilde{\sigma}_{1}\tilde{\sigma}_{2}\tilde{\rho}&\tilde{\sigma}_{2}^{2}\end{pmatrix}. For h(x)=±ramp(x1)ramp(x2)h(x)=\pm\mathrm{ramp}(x_{1})\mathrm{ramp}(x_{2}), we have

|ρ~σ~1σ~2ρσ1σ2|M1/2σ~σ1+S7(|Dh|+Δ/2),\displaystyle|\tilde{\rho}\tilde{\sigma}_{1}\tilde{\sigma}_{2}-\rho\sigma_{1}\sigma_{2}|\leq M^{1/2}\|\tilde{\sigma}-\sigma\|_{1}+S_{7}(|D_{h}|+\Delta/2),

where S7=4{(12πMe1/(8M))(12e1/(8M))}2S_{7}=4\{(\frac{1}{\sqrt{2\pi M}}\mathrm{e}^{-1/(8M)})\vee(1-2\mathrm{e}^{-1/(8M)})\}^{-2}, which behaves like to 4(12e1/(8M))24(1-2\mathrm{e}^{-1/(8M)})^{-2} as M0M\to 0 or 8πMe1/(4M)8\pi M\mathrm{e}^{1/(4M)} as MM\to\infty, and Δ=δ1+σ~σ1\Delta=\|\delta\|_{1}+\|\tilde{\sigma}-\sigma\|_{1}, with σ=(σ1,σ2)T\sigma=(\sigma_{1},\sigma_{2})^{\mathrm{\scriptscriptstyle T}} and σ~=(σ~1,σ~2)T\tilde{\sigma}=(\tilde{\sigma}_{1},\tilde{\sigma}_{2})^{\mathrm{\scriptscriptstyle T}}.

Remark S4.

Define a ramp main-effect and interaction class

2\displaystyle\mathcal{R}_{2} ={±ramp(xjc),(x1,x2)2:j=1,2,c=0,1}\displaystyle=\left\{\pm\mathrm{ramp}(x_{j}-c),(x_{1},x_{2})\in\mathbb{R}^{2}:j=1,2,\,c=0,1\right\}
{±ramp(x1)ramp(x2),(x1,x2)2}.\displaystyle\quad\bigcup\left\{\pm\mathrm{ramp}(x_{1})\mathrm{ramp}(x_{2}),(x_{1},x_{2})\in\mathbb{R}^{2}\right\}.

In the setting of Lemma S8, suppose that for fixed a(0,1/2)a\in(0,1/2),

D=defsuph2{Eh(X~)Eh(X)}a.\displaystyle D\stackrel{{\scriptstyle\text{def}}}{{=}}\sup_{h\in\mathcal{R}_{2}}\left\{\mathrm{E}h(\tilde{X})-\mathrm{E}h(X)\right\}\leq a.

Then combining Lemma S7S8 yields

max(|δ1|,|δ2|)S4,aD,max(|σ~1σ1|,|σ~2σ2|)S6,aD,\displaystyle\max(|\delta_{1}|,|\delta_{2}|)\leq S_{4,a}D,\quad\max(|\tilde{\sigma}_{1}-\sigma_{1}|,|\tilde{\sigma}_{2}-\sigma_{2}|)\leq S_{6,a}D,
max(|σ~12σ12|,|σ~22σ22|,|ρ~σ~1σ~2ρσ1σ2|)S8,aD\displaystyle\max(|\tilde{\sigma}_{1}^{2}-\sigma_{1}^{2}|,|\tilde{\sigma}_{2}^{2}-\sigma_{2}^{2}|,|\tilde{\rho}\tilde{\sigma}_{1}\tilde{\sigma}_{2}-\rho\sigma_{1}\sigma_{2}|)\leq S_{8,a}D

where S8,a=S7(1+S4,a+S6,a)+2M1/2S6,aS_{8,a}=S_{7}(1+S_{4,a}+S_{6,a})+2M^{1/2}S_{6,a}. This shows that the moment matching discrepancy DD over 2\mathcal{R}_{2} delivers upper bounds, up to scaling constants, on the mean, variance, and covariance differences, provided that DD is sufficiently small.

Proof.

First, we handle the effect of different means and standard deviations between X~\tilde{X} and XX in DhD_{h}. Denote Dh=Eh(Y~)Eh(X)D^{\dagger}_{h}=\mathrm{E}h(\tilde{Y})-\mathrm{E}h(X), where Y~=(Y~1,Y~2)T\tilde{Y}=(\tilde{Y}_{1},\tilde{Y}_{2})^{\mathrm{\scriptscriptstyle T}} is a Gaussian random vector with mean 0 and variance matrix (σ12σ1σ2ρ~σ1σ2ρ~σ22)\begin{pmatrix}\sigma_{1}^{2}&\sigma_{1}\sigma_{2}\tilde{\rho}\\ \sigma_{1}\sigma_{2}\tilde{\rho}&\sigma_{2}^{2}\end{pmatrix}. Then we have

|Dh||Dh|+Δ/2.\displaystyle|D^{\dagger}_{h}|\leq|D_{h}|+\Delta/2. (S64)

In fact, assume that Y~=(σ1Z1,σ2Z2)T\tilde{Y}=(\sigma_{1}Z_{1},\sigma_{2}Z_{2})^{\mathrm{\scriptscriptstyle T}} and X~=δ+(σ~1Z1,σ~2Z2)T\tilde{X}=\delta+(\tilde{\sigma}_{1}Z_{1},\tilde{\sigma}_{2}Z_{2})^{\mathrm{\scriptscriptstyle T}}, where (Z1,Z2)T(Z_{1},Z_{2})^{\mathrm{\scriptscriptstyle T}} is a Gaussian random vector with mean 0 and variance matrix (1ρ~ρ~1)\begin{pmatrix}1&\tilde{\rho}\\ \tilde{\rho}&1\end{pmatrix}. For h(x)=±ramp(x1)ramp(x2)h(x)=\pm\mathrm{ramp}(x_{1})\mathrm{ramp}(x_{2}), we have

|Eh(Y~)Eh(X)|\displaystyle\quad|\mathrm{E}h(\tilde{Y})-\mathrm{E}h(X)|
|Eh(X~)Eh(X)|+j=1,2E|ramp(Y~j)ramp(X~j)|\displaystyle\leq|\mathrm{E}h(\tilde{X})-\mathrm{E}h(X)|+\sum_{j=1,2}\mathrm{E}\,|\mathrm{ramp}(\tilde{Y}_{j})-\mathrm{ramp}(\tilde{X}_{j})|
|Eh(X~)Eh(X)|+j=1,2{δj2+(σ~jσj)2}1/2/2\displaystyle\leq|\mathrm{E}h(\tilde{X})-\mathrm{E}h(X)|+\sum_{j=1,2}\{\delta_{j}^{2}+(\tilde{\sigma}_{j}-\sigma_{j})^{2}\}^{1/2}/2
|Eh(X~)Eh(X)|+Δ/2.\displaystyle\leq|\mathrm{E}h(\tilde{X})-\mathrm{E}h(X)|+\Delta/2.

The first inequality follows by the triangle inequality and the fact that ramp()\mathrm{ramp}(\cdot) is bounded in [0,1][0,1]. The second step uses E|ramp(Y~j)ramp(X~j)|(1/2)E|Y~jX~j|(1/2)E1/2[{δj+(σ~jσj)Zj}2]={δj2+(σ~jσj)2}1/2/2\mathrm{E}\,|\mathrm{ramp}(\tilde{Y}_{j})-\mathrm{ramp}(\tilde{X}_{j})|\leq(1/2)\mathrm{E}|\tilde{Y}_{j}-\tilde{X}_{j}|\leq(1/2)\mathrm{E}^{1/2}[\{\delta_{j}+(\tilde{\sigma}_{j}-\sigma_{j})Z_{j}\}^{2}]=\{\delta_{j}^{2}+(\tilde{\sigma}_{j}-\sigma_{j})^{2}\}^{1/2}/2, by the fact that ramp()\mathrm{ramp}(\cdot) is (1/2)(1/2)-Lipschitz, EZj=0\mathrm{E}Z_{j}=0, and E(Zj2)=1\mathrm{E}(Z_{j}^{2})=1. The third inequality follows because u1+u2u1+u2\sqrt{u_{1}+u_{2}}\leq\sqrt{u_{1}}+\sqrt{u_{2}}.

Next, we show that

|ρ~ρ|σ1σ28πMe1/(4M)|Dh|.\displaystyle|\tilde{\rho}-\rho|\sigma_{1}\sigma_{2}\leq 8\pi M\mathrm{e}^{1/(4M)}|D^{\dagger}_{h}|. (S65)

Assume that ρ~ρ\tilde{\rho}-\rho is nonnegative. The other direction will be discussed later. Now take h(x)=ramp(x1)ramp(x2)h(x)=\mathrm{ramp}(x_{1})\mathrm{ramp}(x_{2}), and define g(t)=Eh(Y)Eh(X)g(t)=\mathrm{E}h(Y)-\mathrm{E}h(X) for t[1,1]t\in[-1,1], where Y=(Y1,Y2)TY=(Y_{1},Y_{2})^{\mathrm{\scriptscriptstyle T}} is a Gaussian random vector with mean 0 and variance matrix (σ12σ1σ2tσ1σ2tσ22)\begin{pmatrix}\sigma_{1}^{2}&\sigma_{1}\sigma_{2}t\\ \sigma_{1}\sigma_{2}t&\sigma_{2}^{2}\end{pmatrix}. Then g(ρ)=0g(\rho)=0 and g(ρ~)=Dhg(\tilde{\rho})=D^{\dagger}_{h}. The derivative g(t)g^{\prime}(t) can be calculated as

g(t)=14σ1σ2E𝟙Y1[1,1]𝟙Y2[1,1].\displaystyle g^{\prime}(t)=\frac{1}{4}\sigma_{1}\sigma_{2}\mathrm{E}\mathbbm{1}_{Y_{1}\in[-1,1]}\mathbbm{1}_{Y_{2}\in[-1,1]}. (S66)

In fact, Y1Y_{1} can be represented as Y1=tσ1σ2Y2+1t2σ1Z1Y_{1}=t\frac{\sigma_{1}}{\sigma_{2}}Y_{2}+\sqrt{1-t^{2}}\sigma_{1}Z_{1}, where Z1Z_{1} is a standard Gaussian variable independent of Y2Y_{2}. Then direct calculation yields

g(t)\displaystyle g^{\prime}(t) =12E𝟙Y1[1,1](σ1σ2Y2t1t2σ1Z1)ramp(Y2)\displaystyle=\frac{1}{2}\mathrm{E}\mathbbm{1}_{Y_{1}\in[-1,1]}\left(\frac{\sigma_{1}}{\sigma_{2}}Y_{2}-\frac{t}{\sqrt{1-t^{2}}}\sigma_{1}Z_{1}\right)\mathrm{ramp}(Y_{2})
=12E𝟙Y1[1,1]{σ1σ2Y2t1t2(Y1tσ1σ2Y2)}ramp(Y2)\displaystyle=\frac{1}{2}\mathrm{E}\mathbbm{1}_{Y_{1}\in[-1,1]}\left\{\frac{\sigma_{1}}{\sigma_{2}}Y_{2}-\frac{t}{1-t^{2}}(Y_{1}-t\frac{\sigma_{1}}{\sigma_{2}}Y_{2})\right\}\mathrm{ramp}(Y_{2})
=12E𝟙Y1[1,1](11t2σ1σ2Y2t1t2Y1)ramp(Y2)\displaystyle=\frac{1}{2}\mathrm{E}\mathbbm{1}_{Y_{1}\in[-1,1]}\left(\frac{1}{1-t^{2}}\frac{\sigma_{1}}{\sigma_{2}}Y_{2}-\frac{t}{1-t^{2}}Y_{1}\right)\mathrm{ramp}(Y_{2})
=1211t2σ1σ2E𝟙Y1[1,1](Y2tσ2σ1Y1)ramp(Y2).\displaystyle=\frac{1}{2}\frac{1}{1-t^{2}}\frac{\sigma_{1}}{\sigma_{2}}\mathrm{E}\mathbbm{1}_{Y_{1}\in[-1,1]}\left(Y_{2}-t\frac{\sigma_{2}}{\sigma_{1}}Y_{1}\right)\mathrm{ramp}(Y_{2}).

By Stein’s lemma using the fact that Y2Y_{2} given Y1Y_{1} is Gaussian with mean tσ2σ1Y1t\frac{\sigma_{2}}{\sigma_{1}}Y_{1} and variance (1t2)σ22(1-t^{2})\sigma_{2}^{2}, we have

E{(Y2tσ2σ1Y1)ramp(Y2)|Y1}=E{(1t2)σ2212𝟙Y2[1,1]|Y1}.\displaystyle\mathrm{E}\left\{\left(Y_{2}-t\frac{\sigma_{2}}{\sigma_{1}}Y_{1}\right)\mathrm{ramp}(Y_{2})\Big{|}Y_{1}\right\}=\mathrm{E}\left\{(1-t^{2})\sigma_{2}^{2}\frac{1}{2}\mathbbm{1}_{Y_{2}\in[-1,1]}\Big{|}Y_{1}\right\}.

Substituting this into the expression for g(t)g^{\prime}(t) gives the formula (S66).

To show (S65), we derive a lower bound on g(t)g^{\prime}(t). Assume without loss of generality that σ1σ2\sigma_{1}\leq\sigma_{2}, because formula (S66) is symmetric in Y1Y_{1} and Y2Y_{2}. By the previous representation of Y1Y_{1}, we have |Y1|1t2σ1|Z1|+σ1σ2|Y2|σ1|Z1|+|Y2||Y_{1}|\leq\sqrt{1-t^{2}}\sigma_{1}|Z_{1}|+\frac{\sigma_{1}}{\sigma_{2}}|Y_{2}|\leq\sigma_{1}|Z_{1}|+|Y_{2}| and hence

g(t)\displaystyle g^{\prime}(t) 14σ1σ2E𝟙Y1[1,1]𝟙Y2[1/2,1/2]\displaystyle\geq\frac{1}{4}\sigma_{1}\sigma_{2}\mathrm{E}\mathbbm{1}_{Y_{1}\in[-1,1]}\mathbbm{1}_{Y_{2}\in[-1/2,1/2]}
14σ1σ2E𝟙σ1|Z1|[1/2,1/2]𝟙Y2[1/2,1/2]\displaystyle\geq\frac{1}{4}\sigma_{1}\sigma_{2}\mathrm{E}\mathbbm{1}_{\sigma_{1}|Z_{1}|\in[-1/2,1/2]}\mathbbm{1}_{Y_{2}\in[-1/2,1/2]}
S71σ1σ2.\displaystyle\geq S_{7}^{-1}\sigma_{1}\sigma_{2}.

where S7=4{(12πM1/2e1/(8M))(12e1/(8M))}2S_{7}=4\{(\frac{1}{\sqrt{2\pi}M^{1/2}}\mathrm{e}^{-1/(8M)})\vee(1-2\mathrm{e}^{-1/(8M)})\}^{-2}. The last inequality follows by the independence of Z1Z_{1} and Y2Y_{2}, σ1M1/2\sigma_{1}\leq M^{1/2}, σ2M1/2\sigma_{2}\leq M^{1/2}, and the two probability bounds: P(M1/2|Z1|1/2)12πMe(1/2)2/(2M)\mathrm{P}(M^{1/2}|Z_{1}|\leq 1/2)\geq\frac{1}{\sqrt{2\pi M}}\mathrm{e}^{-(1/2)^{2}/(2M)} and P(M1/2|Z1|1/2)12e(1/2)2/(2M)\mathrm{P}(M^{1/2}|Z_{1}|\leq 1/2)\geq 1-2\mathrm{e}^{-(1/2)^{2}/(2M)}. Hence by the mean value theorem, we have

g(ρ~)=g(ρ~)g(ρ)(ρ~ρ)σ1σ2S71,\displaystyle g(\tilde{\rho})=g(\tilde{\rho})-g(\rho)\geq(\tilde{\rho}-\rho)\sigma_{1}\sigma_{2}S_{7}^{-1},

which gives the desired bound (S65) in the case of ρ~ρ\tilde{\rho}\geq\rho:

(ρ~ρ)σ1σ2S7g(ρ~)=S7Dh.\displaystyle(\tilde{\rho}-\rho)\sigma_{1}\sigma_{2}\leq S_{7}g(\tilde{\rho})=S_{7}D^{\dagger}_{h}.

When ρ~ρ\tilde{\rho}-\rho is negative, a similar argument taking h(x)=ramp(x1)ramp(x2)h(x)=-\mathrm{ramp}(x_{1})\mathrm{ramp}(x_{2}) and interchanging the roles of Y~\tilde{Y} and XX leads to (S65).

Finally, by the triangle inequality, we find

|ρ~σ~1σ~2ρσ1σ2|\displaystyle|\tilde{\rho}\tilde{\sigma}_{1}\tilde{\sigma}_{2}-\rho\sigma_{1}\sigma_{2}| |σ~1σ~2σ1σ2|+|ρ~ρ|σ1σ2\displaystyle\leq|\tilde{\sigma}_{1}\tilde{\sigma}_{2}-\sigma_{1}\sigma_{2}|+|\tilde{\rho}-\rho|\sigma_{1}\sigma_{2}
M1/2(|σ~1σ1|+|σ~2σ2|)+|ρ~ρ|σ1σ2.\displaystyle\leq M^{1/2}(|\tilde{\sigma}_{1}-\sigma_{1}|+|\tilde{\sigma}_{2}-\sigma_{2}|)+|\tilde{\rho}-\rho|\sigma_{1}\sigma_{2}.

Combining this with (S64) and (S65) leads to the desired result. ∎

Proposition S7.

In the setting of Proposition S1 or Proposition 3, suppose that for a(0,1/2)a\in(0,1/2),

D=defsupγΓrp,pen1(γ)=1{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}a.\displaystyle D\stackrel{{\scriptstyle\text{def}}}{{=}}\sup_{\gamma\in\Gamma_{\mathrm{rp}},\mathrm{pen}_{1}(\gamma)=1}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq a. (S67)

Then we have

μ^μS4,aD,\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty}\leq S_{4,a}D,
Σ^ΣmaxS8,aD,\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\max}\leq S_{8,a}D,

where S4,aS_{4,a} and S8,aS_{8,a} are defined as in Lemma S7 and Remark S4 with M=M1M=M_{1}.

Proof.

For any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with pen1(γ)=1\mathrm{pen}_{1}(\gamma)=1, the function hγ,μ^(x)=hγ(xμ^)h_{\gamma,\hat{\mu}}(x)=h_{\gamma}(x-\hat{\mu}) can be expressed as hrp,β,c(xμ^)h_{\mathrm{rp},\beta,c}(x-\hat{\mu}) with pen1(β)=1\mathrm{pen}_{1}(\beta)=1. For j=1,,pj=1,\ldots,p, by restricting hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) in (S67) to those with hγ(x)h_{\gamma}(x) defined as ramp functions of xjx_{j} and using Remark S6, we obtain

|μ^jμj|S4,aD,|σ^jσj|S6,aD.\displaystyle|\hat{\mu}_{j}-\mu^{*}_{j}|\leq S_{4,a}D,\quad|\hat{\sigma}_{j}-\sigma^{*}_{j}|\leq S_{6,a}D.

For 1ijp1\leq i\not=j\leq p, by restricting hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) in (S67) to those with hγ(x)h_{\gamma}(x) defined as ramp interaction functions of (xi,xj)(x_{i},x_{j}) and using Remark S4, we obtain

max(|Σ^iiΣii|,|Σ^jjΣjj|,|Σ^ijΣij|)S8,aD,\displaystyle\max\left(|\hat{\Sigma}_{ii}-\Sigma_{ii}^{*}|,|\hat{\Sigma}_{jj}-\Sigma_{jj}^{*}|,|\hat{\Sigma}_{ij}-\Sigma_{ij}^{*}|\right)\leq S_{8,a}D,

where Σ^ij\hat{\Sigma}_{ij} and Σij\Sigma_{ij}^{*} are the (i,j)(i,j)th elements of Σ^\hat{\Sigma} and Σ\Sigma^{*} respectively. Combining the preceding two displays leads to the desired result. ∎

III.3 Details in main proof of Theorem 3

Lemma S9.

Suppose that X1,,XnX_{1},\ldots,X_{n} are independent and identically distributed as XNp(0,Σ)X\sim N_{p}(0,\Sigma) with ΣopM2\|\Sigma\|_{\mathrm{op}}\leq M_{2}. For kk fixed knots ξ1,,ξk\xi_{1},\ldots,\xi_{k} in \mathbb{R}, denote φ(x)=(φT1(x),,φTk(x))T\varphi(x)=(\varphi^{\mathrm{\scriptscriptstyle T}}_{1}(x),\ldots,\varphi^{\mathrm{\scriptscriptstyle T}}_{k}(x))^{\mathrm{\scriptscriptstyle T}}, where φl(x)p\varphi_{l}(x)\in\mathbb{R}^{p} is obtained by applying t(tξl)+t\mapsto(t-\xi_{l})_{+} componentwise to xpx\in\mathbb{R}^{p} for l=1,,kl=1,\ldots,k. Then the following results hold.

(i) φ(X)Eφ(X)\varphi(X)-\mathrm{E}\varphi(X) is a sub-gaussian random vector with tail parameter (kM2)1/2(kM_{2})^{1/2}.

(ii) For any δ>0\delta>0, we have that with probability at least 1δ1-\delta,

supw2=1|1ni=1nwTφ(Xi)EwTφ(X)|\displaystyle\quad\sup_{\|w\|_{2}=1}\left|\frac{1}{n}\sum_{i=1}^{n}w^{\mathrm{\scriptscriptstyle T}}\varphi(X_{i})-\mathrm{E}w^{\mathrm{\scriptscriptstyle T}}\varphi(X)\right|
Csp21(kM2)1/2kp+log(δ1)n,\displaystyle\leq C_{\mathrm{sp21}}(kM_{2})^{1/2}\sqrt{\frac{kp+\log(\delta^{-1})}{n}},

where Csp21=2Csg7Csg5C_{\mathrm{sp21}}=\sqrt{2}C_{\mathrm{sg7}}C_{\mathrm{sg5}} and (Csg5,Csg7)(C_{\mathrm{sg5}},C_{\mathrm{sg7}}) are the universal constants in Lemmas S25 and S27.

(iii) Let 𝒜2={Akp×kp:AF=1,AT=A}\mathcal{A}_{2}=\{A\in\mathbb{R}^{kp\times kp}:\|A\|_{\mathrm{F}}=1,A^{\mathrm{\scriptscriptstyle T}}=A\}. For any δ>0\delta>0, we have that with probability at least 13δ1-3\delta, both the inequality in (ii) and

1kpsupA𝒜2|1ni=1nφT(Xi)Aφ(Xi)EφT(X)Aφ(X)|\displaystyle\quad\frac{1}{\sqrt{kp}}\sup_{A\in\mathcal{A}_{2}}\left|\frac{1}{n}\sum_{i=1}^{n}\varphi^{\mathrm{\scriptscriptstyle T}}(X_{i})A\varphi(X_{i})-\mathrm{E}\varphi^{\mathrm{\scriptscriptstyle T}}(X)A\varphi(X)\right|
Csp22kM21{kp+log(δ1)n+kp+log(δ1)n},\displaystyle\leq C_{\mathrm{sp22}}kM_{21}\left\{\sqrt{\frac{kp+\log(\delta^{-1})}{n}}+\frac{kp+\log(\delta^{-1})}{n}\right\},

where M21=M21/2(M21/2+2πξ)M_{21}=M_{2}^{1/2}(M_{2}^{1/2}+\sqrt{2\pi}\|\xi\|_{\infty}), ξ=maxl=1,,k|ξl|\|\xi\|_{\infty}=\max_{{l=1,\ldots,k}}|\xi_{l}|, Csp22=2/πCsp21+Csg8C_{\mathrm{sp22}}=\sqrt{2/\pi}C_{\mathrm{sp21}}+C_{\mathrm{sg8}}, and Csg8C_{\mathrm{sg8}} is the universal constant in Lemma S28.

Proof.

(i) First, we show that wTφ(x)w^{\mathrm{\scriptscriptstyle T}}\varphi(x) is a k1/2k^{1/2}-Lipschitz function for any L2L_{2} unit vector wkpw\in\mathbb{R}^{kp}. For any x1,x2px_{1},x_{2}\in\mathbb{R}^{p}, we have |wT(φ(x1)φ(x2))|l=1kwl2φl(x1)φl(x2)2l=1kwl2x1x2k1/2x1x2|w^{\mathrm{\scriptscriptstyle T}}(\varphi(x_{1})-\varphi(x_{2}))|\leq\sum_{l=1}^{k}\|w_{l}\|_{2}\|\varphi_{l}(x_{1})-\varphi_{l}(x_{2})\|_{2}\leq\sum_{l=1}^{k}\|w_{l}\|_{2}\|x_{1}-x_{2}\|\leq k^{1/2}\|x_{1}-x_{2}\|, where ww is partitioned as w=(w1T,,wkT)Tw=(w_{1}^{\mathrm{\scriptscriptstyle T}},\ldots,w_{k}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}, and φl(x1)φl(x2)2x1x22\|\varphi_{l}(x_{1})-\varphi_{l}(x_{2})\|_{2}\leq\|x_{1}-x_{2}\|_{2} because each component of φl(x)\varphi_{l}(x) is 1-Lipschitz, as a function of only the corresponding component of xx. Next, XX can be represented as Σ1/2Z\Sigma^{1/2}Z, where ZZ is a standard Gaussian random vector. For any z1,z2pz_{1},z_{2}\in\mathbb{R}^{p} and L2L_{2} unit vector wkpw\in\mathbb{R}^{kp}, we have

|wTφ(Σ1/2z1)wTφ(Σ1/2z2)|\displaystyle\quad|w^{\mathrm{\scriptscriptstyle T}}\varphi(\Sigma^{1/2}z_{1})-w^{\mathrm{\scriptscriptstyle T}}\varphi(\Sigma^{1/2}z_{2})|
k1/2Σ1/2(z1z2)2k1/2Σ1/2opz1z22(kM2)1/2z1z22.\displaystyle\leq k^{1/2}\|\Sigma^{1/2}(z_{1}-z_{2})\|_{2}\leq k^{1/2}\|\Sigma^{1/2}\|_{\mathrm{op}}\|z_{1}-z_{2}\|_{2}\leq(kM_{2})^{1/2}\|z_{1}-z_{2}\|_{2}.

Hence wTφ(X)w^{\mathrm{\scriptscriptstyle T}}\varphi(X) is a (kM2)1/2(kM_{2})^{1/2}-Lipschitz function of the standard Gaussian vector ZZ. By Theorem 5.6 in \citeappendBLM13, the centered version satisfies that for any t>0t>0,

P(|wT(φ(X)Eφ(X))|>t)2et2/(2kM2).\displaystyle\mathrm{P}\left(|w^{\mathrm{\scriptscriptstyle T}}(\varphi(X)-\mathrm{E}\varphi(X))|>t\right)\leq 2\mathrm{e}^{-t^{2}/(2kM_{2})}.

That is, wT(φ(X)Eφ(X))w^{\mathrm{\scriptscriptstyle T}}(\varphi(X)-\mathrm{E}\varphi(X)) is sub-gaussian with tail parameter (kM2)1/2(kM_{2})^{1/2}. The desired result follows by the definition of sub-gaussian random vectors.

(ii) As shown above, wT(φ(X)Eφ(X))w^{\mathrm{\scriptscriptstyle T}}(\varphi(X)-\mathrm{E}\varphi(X)) is sub-gaussian with tail parameter (kM2)1/2(kM_{2})^{1/2} for any L2L_{2} unit vector ww. Then wT{n1i=1nφ(Xi)Eφ(X)}w^{\mathrm{\scriptscriptstyle T}}\{n^{-1}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\} is sub-gaussian with tail parameter Csg5(kM2/n)1/2C_{\mathrm{sg5}}(kM_{2}/n)^{1/2} by sub-gaussian concentration (Lemma S25). Hence by definition, we have that n1i=1nφ(Xi)Eφ(X)n^{-1}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X) is a sub-gaussian random vector with tail parameter Csg5(kM2/n)1/2C_{\mathrm{sg5}}(kM_{2}/n)^{1/2}. Notice that

supw2=1|1ni=1nwTφ(Xi)EwTφ(X)|=1ni=1nφ(Xi)Eφ(X)2.\displaystyle\sup_{\|w\|_{2}=1}\left|\frac{1}{n}\sum_{i=1}^{n}w^{\mathrm{\scriptscriptstyle T}}\varphi(X_{i})-\mathrm{E}w^{\mathrm{\scriptscriptstyle T}}\varphi(X)\right|=\left\|\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\right\|_{2}.

The desired result follows from Lemma S27: with probability at least 1δ1-\delta, we have

1ni=1nφ(Xi)Eφ(X)2\displaystyle\quad\left\|\frac{1}{n}\sum_{i=1}^{n}\varphi(X_{i})-\mathrm{E}\varphi(X)\right\|_{2} Csg7Csg5(kM2)1/2{kpn+log(δ1)n}\displaystyle\leq C_{\mathrm{sg7}}C_{\mathrm{sg5}}(kM_{2})^{1/2}\left\{\sqrt{\frac{kp}{n}}+\sqrt{\frac{\log(\delta^{-1})}{n}}\right\}
2Csg7Csg5(kM2)1/2kp+log(δ1)n.\displaystyle\leq\sqrt{2}C_{\mathrm{sg7}}C_{\mathrm{sg5}}(kM_{2})^{1/2}\sqrt{\frac{kp+\log(\delta^{-1})}{n}}.

(iii) The difference of interest can be expressed in terms of the centered variables as

1ni=1nφTiAφiEφTAφ\displaystyle\quad\frac{1}{n}\sum_{i=1}^{n}\varphi^{\mathrm{\scriptscriptstyle T}}_{i}A\varphi_{i}-\mathrm{E}\varphi^{\mathrm{\scriptscriptstyle T}}A\varphi
=1ni=1n(φiEφ)TA(φiEφ)E{(φEφ)TA(φEφ)}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(\varphi_{i}-\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A(\varphi_{i}-\mathrm{E}\varphi)-\mathrm{E}\{(\varphi-\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A(\varphi-\mathrm{E}\varphi)\} (S68)
+1ni=1n2(Eφ)TA(φiEφ).\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}2(\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A(\varphi_{i}-\mathrm{E}\varphi). (S69)

We handle the concentration of the two terms separately. Denote φi=φ(Xi)\varphi_{i}=\varphi(X_{i}), φ=φ(X)\varphi=\varphi(X), φ~i=φiEφ\tilde{\varphi}_{i}=\varphi_{i}-\mathrm{E}\varphi, and φ~=φEφ\tilde{\varphi}=\varphi-\mathrm{E}\varphi.

First, for A𝒜2A\in\mathcal{A}_{2} the term in (S69) can be bounded as follows:

|2(Eφ)TA1ni=1nφ~i|2Eφ2Aop1ni=1nφ~i22Eφ21ni=1nφ~i2\displaystyle\quad\left|2(\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right|\leq 2\|\mathrm{E}\varphi\|_{2}\|A\|_{\mathrm{op}}\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{2}\leq 2\|\mathrm{E}\varphi\|_{2}\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{2}
2kp(M21/22π+ξ)1ni=1nφ~i2,\displaystyle\leq 2\sqrt{kp}\left(\frac{M_{2}^{1/2}}{\sqrt{2\pi}}+\|\xi\|_{\infty}\right)\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{2},

where ξ=maxl=1,,k|ξl|\|\xi\|_{\infty}=\max_{l=1,\ldots,k}|\xi_{l}|. The second inequality holds because AopAF=1\|A\|_{\mathrm{op}}\leq\|A\|_{\mathrm{F}}=1. The third inequality holds because Eφl2p(M21/2/2π+|ξl|)\|\mathrm{E}\varphi_{l}\|_{2}\leq\sqrt{p}(M_{2}^{1/2}/\sqrt{2\pi}+|\xi_{l}|) by Lemma S15. By (ii), for any δ>0\delta>0, we have that with probability at least 1δ1-\delta,

1ni=1nφ~i2Csp21(kM2)1/2kp+log(δ1)n.\displaystyle\quad\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right\|_{2}\leq C_{\mathrm{sp21}}(kM_{2})^{1/2}\sqrt{\frac{kp+\log(\delta^{-1})}{n}}.

From the preceding two displays, we obtain that with probability at least 1δ1-\delta,

1kpsupA𝒜2|2(Eφ)TA1ni=1nφ~i|\displaystyle\quad\frac{1}{\sqrt{kp}}\sup_{A\in\mathcal{A}_{2}}\left|2(\mathrm{E}\varphi)^{\mathrm{\scriptscriptstyle T}}A\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\right|
2πCsp21(kM2)1/2(M21/2+2πξ)kp+log(δ1)n.\displaystyle\leq\sqrt{\frac{2}{\pi}}C_{\mathrm{sp21}}(kM_{2})^{1/2}\left(M_{2}^{1/2}+\sqrt{2\pi}\|\xi\|_{\infty}\right)\sqrt{\frac{kp+\log(\delta^{-1})}{n}}. (S70)

Next, consider an eigen-decomposition A=l=1kpλlwlwlTA=\sum_{l=1}^{kp}\lambda_{l}w_{l}w_{l}^{{\mathrm{\scriptscriptstyle T}}}, where λl\lambda_{l}’s are eigenvalues and wlw_{l}’s are the eigenvectors with wl2=1\|w_{l}\|_{2}=1. The concentration of the term in (S68) can be controlled as follows:

supA𝒜2|1ni=1nφ~iTAφ~iEφ~TAφ~|=supA𝒜2|l=1kpλlwlT(1ni=1nφ~iφ~iTEφ~φ~T)wl|\displaystyle\quad\sup_{A\in\mathcal{A}_{2}}\left|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}^{\mathrm{\scriptscriptstyle T}}A\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}A\tilde{\varphi}\right|=\sup_{A\in\mathcal{A}_{2}}\left|\sum_{l=1}^{kp}\lambda_{l}w_{l}^{\mathrm{\scriptscriptstyle T}}\left(\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\tilde{\varphi}_{i}^{\mathrm{\scriptscriptstyle T}}-\mathrm{E}\tilde{\varphi}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}\right)w_{l}\right|
supA𝒜2(l=1kp|λl|)1ni=1nφ~iφ~iTEφ~φ~Topkp1ni=1nφ~iφ~iTEφ~φ~Top.\displaystyle\leq\sup_{A\in\mathcal{A}_{2}}\left(\sum_{l=1}^{kp}|\lambda_{l}|\right)\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\tilde{\varphi}_{i}^{\mathrm{\scriptscriptstyle T}}-\mathrm{E}\tilde{\varphi}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}\right\|_{\mathrm{op}}\leq\sqrt{kp}\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\tilde{\varphi}_{i}^{\mathrm{\scriptscriptstyle T}}-\mathrm{E}\tilde{\varphi}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}\right\|_{\mathrm{op}}.

The last inequality uses the fact that AF=(l=1kpλl2)1/2=1\|A\|_{\mathrm{F}}=(\sum_{l=1}^{kp}\lambda_{l}^{2})^{1/2}=1 and hence l=1kp|λl|kp\sum_{l=1}^{kp}|\lambda_{l}|\leq\sqrt{kp} for A𝒜2A\in\mathcal{A}_{2}. From (i), φ~i\tilde{\varphi}_{i} is a sub-gaussian random vector with tail parameter (kM2)1/2(kM_{2})^{1/2}. By Lemma S28, for any δ>0\delta>0, we have that with probability at least 12δ1-2\delta,

1ni=1nφ~iφ~iTEφ~φ~TopCsg8kM2{kp+log(δ1)n+kp+log(δ1)n},\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}\tilde{\varphi}_{i}^{\mathrm{\scriptscriptstyle T}}-\mathrm{E}\tilde{\varphi}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}\right\|_{\mathrm{op}}\leq C_{\mathrm{sg8}}kM_{2}\left\{\sqrt{\frac{kp+\log(\delta^{-1})}{n}}+\frac{kp+\log(\delta^{-1})}{n}\right\},

From the preceding two displays, we obtain that with probability at least 12δ1-2\delta,

1kpsupA𝒜2|1ni=1nφ~iTAφ~iEφ~TAφ~|\displaystyle\quad\frac{1}{\sqrt{kp}}\sup_{A\in\mathcal{A}_{2}}\left|\frac{1}{n}\sum_{i=1}^{n}\tilde{\varphi}_{i}^{\mathrm{\scriptscriptstyle T}}A\tilde{\varphi}_{i}-\mathrm{E}\tilde{\varphi}^{\mathrm{\scriptscriptstyle T}}A\tilde{\varphi}\right|
Csg8kM2{kp+log(δ1)n+kp+log(δ1)n}.\displaystyle\leq C_{\mathrm{sg8}}kM_{2}\left\{\sqrt{\frac{kp+\log(\delta^{-1})}{n}}+\frac{kp+\log(\delta^{-1})}{n}\right\}. (S71)

Combining the two bounds (S70) and (S71) gives the desired result. ∎

Proposition S8.

In the setting of Proposition S2, it holds with probability at least 14δ1-{4}\delta that for any γ=(γ0,γ1,γ2)TΓ\gamma=(\gamma_{0},\gamma_{1},\gamma_{2})^{\mathrm{\scriptscriptstyle T}}\in{\Gamma},

Kf(Pn,Pθ;hγ,μ)\displaystyle K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}}) f(3/5)(ϵ+ϵ/(nδ))\displaystyle\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})
+pen2(γ1)(5/3)Csp21M21/2R1λ21\displaystyle\quad+\mathrm{pen}_{2}(\gamma_{1})(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}R_{1}\lambda_{21}
+pen2(γ2)(255/3)Csp22M21R1pλ31,\displaystyle\quad+\mathrm{pen}_{2}(\gamma_{2}){(25\sqrt{5}/3)}C_{\mathrm{sp22}}M_{21}R_{1}\sqrt{p}\lambda_{31},

where Csp21C_{\mathrm{sp21}} and Csp22C_{\mathrm{sp22}} are defined as in Lemma S9, M21=M21/2(M21/2+22π)M_{21}=M_{2}^{1/2}(M_{2}^{1/2}+2\sqrt{2\pi}), and

λ21=5p+log(δ1)n,λ31=λ21+5p+log(δ1)n.\displaystyle\lambda_{21}=\sqrt{\frac{{5p}+\log(\delta^{-1})}{n}},\quad\lambda_{31}=\lambda_{21}+\frac{{5p}+\log(\delta^{-1})}{n}.
Proof.

Consider the event Ω1={|ϵ^ϵ|ϵ(1ϵ)/(nδ)}\Omega_{1}=\{|\hat{\epsilon}-\epsilon|\leq\sqrt{\epsilon(1-\epsilon)/(n\delta)}\}. By Chebyshev’s inequality, we have P(Ω1)1δ\mathrm{P}(\Omega_{1})\geq 1-\delta. In the event Ω1\Omega_{1}, we have |ϵ^ϵ|1/5|\hat{\epsilon}-\epsilon|\leq 1/5 by the assumption ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and hence ϵ^2/5\hat{\epsilon}\leq 2/5 by the assumption ϵ1/5\epsilon\leq 1/5. By Lemma S4 with ϵ1=2/5\epsilon_{1}=2/5, it holds in the event Ω1\Omega_{1} that for any γΓ\gamma\in\Gamma,

Kf(Pn,Pθ;hγ,μ)\displaystyle\quad K_{f}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})
f(3/5)ϵ^+R1|EPθ,nhγ,μ(x)EPθhγ,μ(x)|\displaystyle\leq-f^{\prime}(3/5)\hat{\epsilon}+R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)\right|
f(3/5)(ϵ+ϵ/(nδ))+R1|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|.\displaystyle\leq-f^{\prime}(3/5)(\epsilon+\sqrt{\epsilon/(n\delta)})+R_{1}\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|. (S72)

The last step also uses the fact that EPθhγ,μ(x)=EP(0,Σ)hγ(x)\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x) and also EPθ,nhγ,μ(x)=EPθ,nhγ(xμ)\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*}), by the definition hγ,μ(x)=hγ(xμ)h_{\gamma,\mu^{*}}(x)=h_{\gamma}(x-\mu^{*}).

Next, conditionally on the contamination indicators (U1,,Un)(U_{1},\ldots,U_{n}) such that the event Ω1\Omega_{1} holds, we have that {Xi:Ui=1,i=1,,n}\{X_{i}:U_{i}=1,i=1,\ldots,n\} are n1n_{1} independent and identically distributed observations from PθP_{\theta^{*}}, where n1=i=1n(1Ui)=n(1ϵ^)(3/5)nn_{1}=\sum_{i=1}^{n}(1-U_{i})=n(1-\hat{\epsilon})\geq(3/5)n. Denote as Ω2\Omega_{2} the event that for any γ1\gamma_{1} and γ2\gamma_{2},

|EPθ,nγ1Tφ(xμ)EP(0,Σ)γ1Tφ(x)|γ12Csp21M21/25p+log(δ1)(3/5)n,\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)\right|\leq\|\gamma_{1}\|_{2}C_{\mathrm{sp21}}M_{2}^{1/2}\sqrt{\frac{{5p}+\log(\delta^{-1})}{(3/5)n}},

and

|EPθ,nγ2T(φ(xμ)φ(xμ))EP(0,Σ)γ2T(φ(x)φ(x))|\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x-\mu^{*})\otimes\varphi(x-\mu^{*}))-\mathrm{E}_{P_{(0,\Sigma^{*})}}\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x)\otimes\varphi(x))\right|
γ22Csp225M215p{5p+log(δ1)(3/5)n+5p+log(δ1)(3/5)n},\displaystyle\leq\|\gamma_{2}\|_{2}C_{\mathrm{sp22}}{5}M_{21}\sqrt{{5p}}\left\{\sqrt{\frac{{5p}+\log(\delta^{-1})}{(3/5)n}}+\frac{{5p}+\log(\delta^{-1})}{(3/5)n}\right\},

where Csp21C_{\mathrm{sp21}}, Csp22C_{\mathrm{sp22}}, and M21M_{21} are defined as in Lemma S9 with ξ=1\|\xi\|_{\infty}=1. In the event Ω2\Omega_{2}, the preceding inequalities imply that for any γ=(γ0,γ1T,γ2T)TΓ\gamma=(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma,

|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|
pen2(γ1)(5/3)Csp21M21/2λ21+pen2(γ2)(5/3)Csp225M215pλ31,\displaystyle\leq\mathrm{pen}_{2}(\gamma_{1})(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}\lambda_{21}+\mathrm{pen}_{2}(\gamma_{2})(5/3)C_{\mathrm{sp22}}{5}M_{21}\sqrt{{5p}}\lambda_{31}, (S73)

where hγ(x)=γ0+γ1Tφ(x)+γ2T(φ(x)φ(x))h_{\gamma}(x)=\gamma_{0}+\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)+\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x)\otimes\varphi(x)), pen2(γ1)=γ12\mathrm{pen}_{2}(\gamma_{1})=\|\gamma_{1}\|_{2}, and pen2(γ2)=γ22\mathrm{pen}_{2}(\gamma_{2})=\|\gamma_{2}\|_{2}. By applying Lemma S9 with k=5k=5 to {Xiμ:Ui=1,i=1,,n}\{X_{i}-\mu^{*}:U_{i}=1,i=1,\ldots,n\}, we have P(Ω2|U1,,Un)13δ\mathrm{P}(\Omega_{2}|U_{1},\ldots,U_{n})\geq 1-3\delta for any (U1,,Un)(U_{1},\ldots,U_{n}) such that Ω1\Omega_{1} holds. Taking the expectation over (U1,,Un)(U_{1},\ldots,U_{n}) given Ω1\Omega_{1} shows that P(Ω2|Ω1)13δ\mathrm{P}(\Omega_{2}|\Omega_{1})\geq 1-{3}\delta and hence P(Ω1Ω2)(1δ)(13δ)14δ\mathrm{P}(\Omega_{1}\cap\Omega_{2})\geq(1-\delta)(1-{3}\delta)\geq 1-{4}\delta.

Combining (S72) and (S73) in the event Ω1Ω2\Omega_{1}\cap\Omega_{2} indicates that, with probability at least 14δ1-4\delta, the desired inequality holds for any γΓ\gamma\in\Gamma. ∎

Lemma S10.

Suppose that X1,,XnX_{1},\ldots,X_{n} are independent and identically distributed as XPϵX\sim P_{\epsilon}. Let b>0b>0 be fixed and g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} be a vector of fixed functions. For a convex and twice differentiable function f:(0,)f:(0,\infty)\to\mathbb{R}, define

F(X1,,Xn)\displaystyle\quad F(X_{1},\dots,X_{n})
=supw2=1,μp,η0[0,1]q{Kf(Pn,Pθ^;bwTgμ,η0)Kf(Pϵ,Pθ^;bwTgμ,η0)}\displaystyle=\sup_{\|w\|_{2}=1,\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\{K_{f}(P_{n},P_{\hat{\theta}};bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}})\right\}
=supw2=1,μp,η0[0,1]q{1ni=1nf(ebwTgμ,η0(Xi))Ef(ebwTgμ,η0(X))},\displaystyle=\sup_{\|w\|_{2}=1,\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\{\frac{1}{n}\sum_{i=1}^{n}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}(X_{i})})-\mathrm{E}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}(X)})\right\},

where gμ,η0(x)=g(xμ)η0g_{\mu,\eta_{0}}(x)=g(x-\mu)-\eta_{0}. Suppose that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp,η0[0,1]q|n1i=1nϵigμ,η0,j(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0},j}(X_{i})| is sub-gaussian with tail parameter Vg/n\sqrt{V_{g}/n} for j=1,,qj=1,\ldots,q, where (ϵ1,,ϵn)(\epsilon_{1},\ldots,\epsilon_{n}) are Rademacher variables, independent of (X1,,Xn)(X_{1},\ldots,X_{n}), and gμ,η0,j:p[1,1]g_{\mu,\eta_{0},j}:\mathbb{R}^{p}\to[-1,1] denotes the jjth component of gμ,η0g_{\mu,\eta_{0}}. Then for any δ>0\delta>0, we have that with probability at least 12δ1-2\delta,

F(X1,,Xn)bR2,bq{Csg,122qVgn+2qlog(δ1)n},\displaystyle F(X_{1},\dots,X_{n})\leq bR_{2,b\sqrt{q}}\left\{C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}+\sqrt{\frac{2q\log(\delta^{-1})}{n}}\right\},

where R2,bq=sup|u|bqdduf(eu)R_{2,b\sqrt{q}}=\sup_{|u|\leq b\sqrt{q}}\frac{\mathrm{d}}{\mathrm{d}u}f^{\prime}(\mathrm{e}^{u}) and Csg,12C_{\mathrm{sg,12}} is the universal constant in Lemma S23.

Proof.

First, FF satisfies the bounded difference condition, because |bwTgμ,η0|bq|bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}|\leq b\sqrt{q} with w21\|w\|_{2}\leq 1 and ff^{\prime} is non-decreasing by the convexity of ff:

supX1,,Xn,Xi|F(X1,,Xn)F(X1,,Xi,,Xn)|\displaystyle\quad\sup_{X_{1},\dots,X_{n},X_{i}^{\prime}}\left|F(X_{1},\dots,X_{n})-F(X_{1},\dots,X_{i}^{\prime},\dots,X_{n})\right|
f(ebq)f(ebq)n2bqR2,bn,\displaystyle\leq\frac{f^{\prime}(\mathrm{e}^{b\sqrt{q}})-f^{\prime}(\mathrm{e}^{-b\sqrt{q}})}{n}\leq\frac{2b\sqrt{q}R_{2,b}}{n},

where R2,bq=sup|u|bqdduf(eu)R_{2,b\sqrt{q}}=\sup_{|u|\leq b\sqrt{q}}\frac{\mathrm{d}}{\mathrm{d}u}f^{\prime}(\mathrm{e}^{u}). By McDiarmid’s inequality (\citeappendMC89), for any t>0t>0, we have that with probability at least 12e2nt21-2\mathrm{e}^{-2nt^{2}},

|F(X1,,Xn)EF(X1,,Xn)|2bqR2,bt.\displaystyle|F(X_{1},\dots,X_{n})-\mathrm{E}F(X_{1},\dots,X_{n})|\leq 2b\sqrt{q}R_{2,b}t.

For any δ>0\delta>0, taking t=log(δ1)/(2n)t=\sqrt{\log(\delta^{-1})/(2n)} shows that with probability at least 12δ1-2\delta,

|F(X1,,Xn)EF(X1,,Xn)|bR2,b2qlog(δ1)n.\displaystyle|F(X_{1},\dots,X_{n})-\mathrm{E}F(X_{1},\dots,X_{n})|\leq bR_{2,b}\sqrt{\frac{2q\log(\delta^{-1})}{n}}.

Next, the expectation of F(X1,,Xn)F(X_{1},\dots,X_{n}) can be bounded as follows:

Esupw2=1,μp,η0[0,1]q{1ni=1nf(ebwTgμ,η0(Xi))Ef(ebwTgμ,η0(X))}\displaystyle\quad\mathrm{E}\sup_{\|w\|_{2}=1,\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\{\frac{1}{n}\sum_{i=1}^{n}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}(X_{i})})-\mathrm{E}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}(X)})\right\}
2Esupw2=1,μp,η0[0,1]q{1ni=1nϵif(ebwTgμ,η0(Xi))}\displaystyle\leq 2\mathrm{E}\sup_{\|w\|_{2}=1,\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f^{\prime}(\mathrm{e}^{bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}(X_{i})})\right\} (S74)
2R2,bqEsupw2=1,μp,η0[0,1]q{1ni=1nϵibwTgμ,η0(Xi)}\displaystyle\leq 2R_{2,b\sqrt{q}}\,\mathrm{E}\sup_{\|w\|_{2}=1,\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}bw^{\mathrm{\scriptscriptstyle T}}g_{\mu,\eta_{0}}(X_{i})\right\} (S75)
2bR2,bqEsupμp,η0[0,1]q1ni=1nϵigμ,η0(Xi)2\displaystyle\leq 2bR_{2,b\sqrt{q}}\,\mathrm{E}\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0}}(X_{i})\right\|_{2}
2bR2,bqCsg,122qVgn.\displaystyle\leq 2bR_{2,b\sqrt{q}}C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}. (S76)

Line (S74) follows from the symmetrization Lemma S33, where (ϵ1,,ϵn)(\epsilon_{1},\ldots,\epsilon_{n}) are Rademacher variables, independent of (X1,,Xn)(X_{1},\ldots,X_{n}). Line (S75) follows by Lemma S34, because f(et)f^{\prime}(\mathrm{e}^{t}) is R2,bqR_{2,b\sqrt{q}}-Lipschitz in u[bq,bq]u\in[-b\sqrt{q},b\sqrt{q}]. Line (S76) follows because

Esupμp,η0[0,1]q1ni=1nϵigμ,η0(Xi)2{Esupμp,η0[0,1]q1ni=1nϵigμ,η0(Xi)22}1/2\displaystyle\quad\mathrm{E}\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0}}(X_{i})\right\|_{2}\leq\left\{\mathrm{E}\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left\|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0}}(X_{i})\right\|_{2}^{2}\right\}^{1/2}
{j=1qEsupμp,η0[0,1]q|1ni=1nϵigμ,η0,j(Xi)|2}1/2\displaystyle\leq\left\{\sum_{j=1}^{q}\mathrm{E}\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}\left|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0},j}(X_{i})\right|^{2}\right\}^{1/2}
Csg,122qVgn.\displaystyle\leq C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}.

For the last step, we use the assumption that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp,η0[0,1]q|n1i=1nϵigμ,η0,j(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0},j}(X_{i})| is sub-gaussian with tail parameter Vg/n\sqrt{V_{g}/n} for each j=1,,qj=1,\ldots,q, and apply Lemma S23 to obtain E(Z2n,j|X1,,Xn)Csg,122(2Vg/n)\mathrm{E}(Z^{2}_{n,j}|X_{1},\ldots,X_{n})\leq C_{\mathrm{sg,12}}^{2}(2V_{g}/n), and then E(Z2n,j)Csg,122(2Vg/n)\mathrm{E}(Z^{2}_{n,j})\leq C_{\mathrm{sg,12}}^{2}(2V_{g}/n).

Combining the tail probability and expectation bounds yields the desired result. ∎

Lemma S11.

Suppose that f:(0,)f:(0,\infty)\to\mathbb{R} is convex and three-times differentiable. Let b>0b>0 be fixed.

(i) For any function h:p[b,b]h:\mathbb{R}^{p}\to[-b,b], we have

Kf(Pϵ,Pθ^;h)f(eb)ϵ+f(eEPθh(x))f#(eEPθ^h(x))12R33,b,\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h)\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+f^{\prime}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\hat{\theta}}}h(x)})-\frac{1}{2}R_{33,b},

where R33,b=R31,bVarPθh(x)+R32,bVarPθ^h(x)R_{33,b}=R_{31,b}\mathrm{Var}_{P_{\theta^{*}}}h(x)+R_{32,b}\mathrm{Var}_{P_{\hat{\theta}}}h(x), R31,b=sup|u|bd2du2{f(eu)}R_{31,b}=\sup_{|u|\leq b}\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}\{-f^{\prime}(\mathrm{e}^{u})\}, and R32,b=sup|u|bd2du2f#(eu)R_{32,b}=\sup_{|u|\leq b}\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\#}(\mathrm{e}^{u}).

(ii) If, in addition, EPθh(x)=0\mathrm{E}_{P_{\theta^{*}}}h(x)=0 and EPθ^h(x)0\mathrm{E}_{P_{\hat{\theta}}}h(x)\leq 0, then

Kf(Pϵ,Pθ^;h)f(eb)ϵ+R4,b{EPθh(x)EPθ^h(x)}12R33,b,\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h)\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+R_{4,b}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-\frac{1}{2}R_{33,b},

where R4,b=inf|u|bdduf#(eu)R_{4,b}=\inf_{|u|\leq b}\frac{\mathrm{d}}{\mathrm{d}u}f^{\#}(\mathrm{e}^{u}).

Proof.

(i) First, Kf(Pϵ,Pθ^;h)K_{f}(P_{\epsilon},P_{\hat{\theta}};h) can be bounded as follows:

Kf(Pϵ,Pθ^;h)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h)
=ϵEQf(eh(x))+(1ϵ)EPθf(eh(x))EPθ^f#(eh(x))\displaystyle=\epsilon\mathrm{E}_{Q}f^{\prime}(\mathrm{e}^{h(x)})+(1-\epsilon)\mathrm{E}_{P_{\theta^{*}}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\hat{\theta}}}f^{\#}(\mathrm{e}^{h(x)})
f(eb)ϵ+EPθf(eh(x))EPθ^f#(eh(x)),\displaystyle\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+\mathrm{E}_{P_{\theta^{*}}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\hat{\theta}}}f^{\#}(\mathrm{e}^{h(x)}),
=f(eb)ϵ+Kf(Pθ,Pθ^;h),\displaystyle=f^{\prime}(\mathrm{e}^{-b})\epsilon+K_{f}(P_{\theta^{*}},P_{\hat{\theta}};h),

where the inequality follows because f(eh(x))f(eb)f^{\prime}(\mathrm{e}^{h(x)})\geq f^{\prime}(\mathrm{e}^{-b}) for h(x)[b,b]h(x)\in[-b,b] by the convexity of ff. Next, consider the function

κ(t)=EPθf(eE1+th~1(x))EPθ^f#(eE2+th~2(x)),\displaystyle\kappa(t)=\mathrm{E}_{P_{\theta^{*}}}f^{\prime}(\mathrm{e}^{E_{1}+t\tilde{h}_{1}(x)})-\mathrm{E}_{P_{\hat{\theta}}}f^{\#}(\mathrm{e}^{E_{2}+t\tilde{h}_{2}(x)}),

where E1=EPθh(x)E_{1}=\mathrm{E}_{P_{\theta^{*}}}h(x), E2=EPθ^h(x)E_{2}=\mathrm{E}_{P_{\hat{\theta}}}h(x), h~1(x)=h(x)E1\tilde{h}_{1}(x)=h(x)-E_{1}, and h~2(x)=h(x)E2\tilde{h}_{2}(x)=h(x)-E_{2}. A Taylor expansion of κ(1)=Kf(Pθ,Pθ^;h)\kappa(1)=K_{f}(P_{\theta^{*}},P_{\hat{\theta}};h) about t=0t=0 yields

Kf(Pθ,Pθ^;h)\displaystyle K_{f}(P_{\theta^{*}},P_{\hat{\theta}};h) =f(eE1)f#(eE2)12κ(t),\displaystyle=f^{\prime}(\mathrm{e}^{E_{1}})-f^{\#}(\mathrm{e}^{E_{2}})-\frac{1}{2}\kappa^{{\prime\prime}}(t),

where for some t[0,1]t\in[0,1],

κ(t)\displaystyle\kappa^{{\prime\prime}}(t) =EPθ{h~12(x)d2du2f(eE1+u)|u=th~1(x)}+EPθ^{h~22(x)d2du2f#(eE2+u)|u=th~2(x)}.\displaystyle=-\mathrm{E}_{P_{\theta^{*}}}\left\{\tilde{h}_{1}^{2}(x)\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\prime}(\mathrm{e}^{E_{1}+u})|_{u=t\tilde{h}_{1}(x)}\right\}+\mathrm{E}_{P_{\hat{\theta}}}\left\{\tilde{h}_{2}^{2}(x)\frac{\mathrm{d}^{2}}{\mathrm{d}u^{2}}f^{\#}(\mathrm{e}^{E_{2}+u})|_{u=t\tilde{h}_{2}(x)}\right\}.

The desired result then follows because E1+th~1(x)[b,b]E_{1}+t\tilde{h}_{1}(x)\in[-b,b] and E2+th~2(x)[b,b]E_{2}+t\tilde{h}_{2}(x)\in[-b,b] for t[0,1]t\in[0,1] and hence κ(t)R33,b\kappa^{{\prime\prime}}(t)\leq R_{33,b} by the definition of R33,bR_{33,b}.

(ii) The inequality from (i) can be rewritten as

Kf(Pϵ,Pθ^;h)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h)
f(eb)ϵ+{f(eEPθh(x))f#(eEPθh(x))}+f#(eEPθh(x))f#(eEPθ^h(x))12R33,b.\displaystyle\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+\left\{f^{\prime}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})\right\}+f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\hat{\theta}}}h(x)})-\frac{1}{2}R_{33,b}.

If EPθh(x)=0\mathrm{E}_{P_{\theta^{*}}}h(x)=0, then

Kf(Pϵ,Pθ^;h)f(eb)ϵ+f#(eEPθh(x))f#(eEPθ^h(x))12R33,b.\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h)\geq f^{\prime}(\mathrm{e}^{-b})\epsilon+f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\hat{\theta}}}h(x)})-\frac{1}{2}R_{33,b}.

Moreover, if EPθh(x)EPθ^h(x)0\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\geq 0, then

f#(eEPθh(x))f#(eEPθ^h(x))R4,b{EPθh(x)EPθ^h(x)}.\displaystyle f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\theta^{*}}}h(x)})-f^{\#}(\mathrm{e}^{\mathrm{E}_{P_{\hat{\theta}}}h(x)})\geq R_{4,b}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}.

by the mean value theorem and the definition of R4,bR_{4,b}. Combining the preceding two displays gives the desired result. ∎

Proposition S9.

Let b2>0b_{2}>0 be fixed and b2=b22pb_{2}^{\dagger}=b_{2}\sqrt{2p}. In the setting of Proposition S2, it holds with probability at least 12δ1-2\delta that for any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}} with pen2(γ)=b2\mathrm{pen}_{2}(\gamma)=b_{2}, EPθhγ,μ^(x)=0\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, and EPθ^hγ,μ^(x)0\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\leq 0,

Kf(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
R4,b2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}+f(eb2)ϵ4Csg,122M2b22R3,b2b2R2,b2λ22\displaystyle\geq R_{4,b_{2}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}+f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\epsilon-4C_{\mathrm{sg,12}}^{2}M_{2}b_{2}^{2}R_{3,b_{2}^{\dagger}}-b_{2}R_{2,b_{2}^{\dagger}}\lambda_{22}

where R3,b=R31,b+R32,bR_{3,b}=R_{31,b}+R_{32,b} as in Lemma S6 and, with Crad5=Csg,12Crad3C_{\mathrm{rad5}}=C_{\mathrm{sg,12}}C_{\mathrm{rad3}},

λ22=Crad516pn+2plog(δ1)n,\lambda_{22}=C_{\mathrm{rad5}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}},

depending on the universal constants Csg,12C_{\mathrm{sg,12}} and Crad3C_{\mathrm{rad3}} in Lemma S23 and Corollary S2.

Proof.

By definition, for any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}}, hγ(x)h_{\gamma}(x) can be represented as hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) such that β0=γ0\beta_{0}=\gamma_{0} and pen2(β)=2pen2(γ)\mathrm{pen}_{2}(\beta)=\sqrt{2}\mathrm{pen}_{2}(\gamma):

hrp1,β,c(x)\displaystyle h_{\mathrm{rp1},\beta,c}(x) =β0+j=1pβ1jramp(xjcj)\displaystyle=\beta_{0}+\sum_{j=1}^{p}\beta_{1j}\mathrm{ramp}(x_{j}-c_{j})
=β0+β1Tφrp,c(x),\displaystyle=\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}\varphi_{\mathrm{rp},c}(x),

where c=(c1,,cp)Tc=(c_{1},\ldots,c_{p})^{\mathrm{\scriptscriptstyle T}} with cj{0,1}c_{j}\in\{0,1\}, β=(β0,β1T)T\beta=(\beta_{0},\beta_{1}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} with β1=(β11,,β1p)T\beta_{1}=(\beta_{11},\ldots,\beta_{1p})^{\mathrm{\scriptscriptstyle T}}, and φrp,c(x):p[0,1]p\varphi_{\mathrm{rp},c}(x):\mathbb{R}^{p}\to[0,1]^{p} denotes the vector of functions with the jjth component ramp(xjcj)\mathrm{ramp}(x_{j}-c_{j}) for j=1,,pj=1,\ldots,p. Then for any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}} with pen2(γ)=b2\mathrm{pen}_{2}(\gamma)=b_{2}, we have β0=γ0\beta_{0}=\gamma_{0} and pen2(β)=2b2\mathrm{pen}_{2}(\beta)=\sqrt{2}b_{2} correspondingly, and hence hγ(x)γ0=hrp1,β,c(x)β0[b22p,b22p]h_{\gamma}(x)-\gamma_{0}=h_{\mathrm{rp1},\beta,c}(x)-\beta_{0}\in[-b_{2}\sqrt{2p},b_{2}\sqrt{2p}] by the Cauchy–Schwartz inequality and the boundedness of the ramp function in [0,1][0,1]. Moreover, hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) can be expressed in the form β0+pen2(β)wTg(x)\beta_{0}+\mathrm{pen}_{2}(\beta)w^{\mathrm{\scriptscriptstyle T}}g(x), where for q=2pq=2p, wqw\in\mathbb{R}^{q} is an L2L_{2} unit vector, g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} is a vector of functions, including ramp(xj)\mathrm{ramp}(x_{j}) and ramp(xj1)\mathrm{ramp}(x_{j}-1) for j=1,,pj=1,\ldots,p. Parenthetically, at most one of the coefficients in ww associated with ramp(xj)\mathrm{ramp}(x_{j}) and ramp(xj1)\mathrm{ramp}(x_{j}-1) is nonzero for each jj, although this property is not used in the subsequent discussion.

For any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}} with pen2(γ)=b2\mathrm{pen}_{2}(\gamma)=b_{2} and EPθhγ,μ^(x)=0\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, the function hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) can be expressed as

hγ,μ^(x)=β1T{φrp,c(xμ^)β01},\displaystyle h_{\gamma,\hat{\mu}}(x)=\beta_{1}^{\mathrm{\scriptscriptstyle T}}\left\{\varphi_{\mathrm{rp},c}(x-\hat{\mu})-\beta_{01}\right\},

where β01=EPθφrp,c(xμ^)\beta_{01}=\mathrm{E}_{P_{\theta^{*}}}\varphi_{\mathrm{rp},c}(x-\hat{\mu}). The mean-centered ramp functions in φrp,c(xμ^)β01\varphi_{\mathrm{rp},c}(x-\hat{\mu})-\beta_{01} are bounded between [1,1][-1,1], and hence hγ,μ^(x)[b22p,b22p]h_{\gamma,\hat{\mu}}(x)\in[-b_{2}\sqrt{2p},b_{2}\sqrt{2p}] similarly as above. Moreover, such hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) can be expressed in the form pen2(β)wT{g(xμ^)η0}\mathrm{pen}_{2}(\beta)w^{\mathrm{\scriptscriptstyle T}}\{g(x-\hat{\mu})-\eta_{0}\}, where wqw\in\mathbb{R}^{q} is an L2L_{2} unit vector, g(x):p[0,1]qg(x):\mathbb{R}^{p}\to[0,1]^{q} is defined as above, and η0=EPθg(xμ^)[0,1]q\eta_{0}=\mathrm{E}_{P_{\theta^{*}}}g(x-\hat{\mu})\in[0,1]^{q} by the boundedness of the ramp function in [0,1][0,1].

Next, Kf(Pn,Pθ^;hγ,μ^)K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

Kf(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
Kf(Pϵ,Pθ^;hγ,μ^){Kf(Pn,Pθ^;hγ,μ^)Kf(Pϵ,Pθ^;hγ,μ^)}.\displaystyle\geq K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}. (S77)

For any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}} with pen2(γ)=b2\mathrm{pen}_{2}(\gamma)=b_{2}, EPθhγ,μ^(x)=0E_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, and EPθ^hγ,μ^(x)0E_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\leq 0, applying Lemma S11(ii) with h=hγ,μ^h=h_{\gamma,\hat{\mu}} and b=b2=b22pb=b_{2}^{\dagger}=b_{2}\sqrt{2p} yields

Kf(Pϵ,Pθ^;hγ,μ^)f(eb2)ϵ\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\epsilon
+R4,b2{EPθh(x)EPθ^h(x)}12{R31,b2VarPθhγ,μ^(x)+R32,b2VarPθ^hγ,μ^(x)}.\displaystyle\quad+R_{4,b_{2}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-\frac{1}{2}\left\{R_{31,b_{2}^{\dagger}}\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)+R_{32,b_{2}^{\dagger}}\mathrm{Var}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}.

By Lemma S18(i), VarPθhγ,μ^(x)\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x) can bounded as follows:

VarPθhγ,μ^(x)=VarPθβ1Tφrp,c(xμ^)\displaystyle\quad\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=\mathrm{Var}_{P_{\theta^{*}}}\beta_{1}^{\mathrm{\scriptscriptstyle T}}\varphi_{\mathrm{rp},c}(x-\hat{\mu})
β1222Csg,122(2)2Σop=4pen22(β)Csg,122M2.\displaystyle\leq\|\beta_{1}\|_{2}^{2}\cdot 2C_{\mathrm{sg,12}}^{2}(\sqrt{2})^{2}\|\Sigma^{*}\|_{\mathrm{op}}=4\mathrm{pen}_{2}^{2}(\beta)C_{\mathrm{sg,12}}^{2}M_{2}.

Similarly, VarPθhγ,μ^(x)\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x) can also be bounded by 4pen22(β)Csg,122M24\mathrm{pen}_{2}^{2}(\beta)C_{\mathrm{sg,12}}^{2}M_{2}, because Σ^opM2\|\hat{\Sigma}\|_{\mathrm{op}}\leq M_{2}. Hence Kf(Pϵ,Pθ^;hγ,μ^)K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

Kf(Pϵ,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
f(eb2)ϵ+R4,b2{EPθh(x)EPθ^h(x)}4Csg,122M2b22R3,b2,\displaystyle\geq f^{\prime}(\mathrm{e}^{-b_{2}^{\dagger}})\epsilon+R_{4,b_{2}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-4C_{\mathrm{sg,12}}^{2}M_{2}b_{2}^{2}R_{3,b_{2}^{\dagger}}, (S78)

where R3,b=R31,b+R32,bR_{3,b}=R_{31,b}+R_{32,b} as in Lemma S6. Moreover, by Lemma S5 with b=2b2b=\sqrt{2}b_{2} and g(x)[0,1]qg(x)\in[0,1]^{q} defined above, it holds with probability at least 12δ1-2\delta that for any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}} with γ0=0\gamma_{0}=0 and pen2(γ)=b2\mathrm{pen}_{2}(\gamma)=b_{2},

{Kf(Pn,Pθ^;hγ,μ^)Kf(Pϵ,Pθ^;hγ,μ^)}\displaystyle\quad\{K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}
b2R2,b2q{Csg,122qVgn+qlog(δ1)n}\displaystyle\leq b_{2}R_{2,b_{2}\sqrt{q}}\left\{C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}+\sqrt{\frac{q\log(\delta^{-1})}{n}}\right\}
b2R2,b22p{Csg,12Crad316pn+2plog(δ1)n},\displaystyle\leq b_{2}R_{2,b_{2}\sqrt{2p}}\left\{C_{\mathrm{sg,12}}C_{\mathrm{rad3}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}}\right\}, (S79)

where Vg=4Crad32V_{g}=4C_{\mathrm{rad3}}^{2} is determined in Lemma S5 as follows. For j=1,,qj=1,\ldots,q, consider the function class 𝒢j={gμ,η0,j:μp,η0[0,1]q}\mathcal{G}_{j}=\{g_{\mu,\eta_{0},j}:\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}\}, where μ=(μ1,,μp)T\mu=(\mu_{1},\ldots,\mu_{p})^{\mathrm{\scriptscriptstyle T}}, η0=(η01,,η0q)T\eta_{0}=(\eta_{01},\ldots,\eta_{0q})^{\mathrm{\scriptscriptstyle T}}, and, as defined in Lemma S5, gμ,η0,j(x)g_{\mu,\eta_{0},j}(x) is of the form ramp(xj1μj1)η0j\mathrm{ramp}(x_{j_{1}}-\mu_{j_{1}})-\eta_{0j} or ramp(xj1μj11)η0j\mathrm{ramp}(x_{j_{1}}-\mu_{j_{1}}-1)-\eta_{0j}. By Lemma S16, the VC index of moving-knot ramp functions is 2. By Lemma S17, the VC index of constant functions is also 2. By applying Corollary S2 (ii) with vanishing 𝒢\mathcal{G}, we obtain that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp,η0[0,1]q|n1i=1nϵigμ,η0,j(Xi)|=supfj𝒢j|n1i=1nϵifj(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0},j}(X_{i})|=\sup_{f_{j}\in\mathcal{G}_{j}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}f_{j}(X_{i})| is sub-gaussian with tail parameter Crad34/nC_{\mathrm{rad3}}\sqrt{4/n} for j=1,,qj=1,\ldots,q.

Combining the inequalities (S77)–(S79) leads to the desired result. ∎

Proposition S10.

In the setting of Proposition S2 or Proposition S3, suppose that for a(0,1/2)a\in(0,1/2),

D=defsupγΓrp1,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}a.\displaystyle D\stackrel{{\scriptstyle\text{def}}}{{=}}\sup_{\gamma\in\Gamma_{\mathrm{rp1}},\mathrm{pen}_{2}(\gamma)=\sqrt{1/2}}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\leq a. (S80)

Then we have

μ^μ2S4,aD,\displaystyle\|\hat{\mu}-\mu^{*}\|_{2}\leq S_{4,a}D, (S81)
σ^σ2S5(D+μ^μ2/2)S6,aD,\displaystyle\|\hat{\sigma}-\sigma^{*}\|_{2}\leq S_{5}\left(D+\|\hat{\mu}-\mu^{*}\|_{2}/2\right)\leq S_{6,a}D, (S82)

where S4,aS_{4,a}, S5S_{5}, and S6,aS_{6,a} are defined as in Lemma S7 and Remark S6 with M=M2M=M_{2}.

Proof.

For any γΓrp1\gamma\in\Gamma_{\mathrm{rp1}} with pen2(γ)=1/2\mathrm{pen}_{2}(\gamma)=\sqrt{1/2}, the function hγ(x)h_{\gamma}(x) can be obtained as hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) with pen2(β)=1\mathrm{pen}_{2}(\beta)=1. For j=1,,pj=1,\ldots,p, we restrict hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) in (S80) such that hγ(x)h_{\gamma}(x) is a ramp function of xjx_{j}, in the form ±ramp(xjc)\pm\mathrm{ramp}(x_{j}-c) for c{0,1}c\in\{0,1\}. Applying Lemma S7 shows that there exists hj(1)(xj)h_{j}^{(1)}(x_{j}) in the form ±ramp(xj)\pm\mathrm{ramp}(x_{j}) and hj(2)(xj)h_{j}^{(2)}(x_{j}) in the form ±ramp(xj1)\pm\mathrm{ramp}(x_{j}-1) such that

|μ^jμj|S4,a{EPθhj(1)(xjμ^j)EPθ^hj(1)(xjμ^j)},\displaystyle|\hat{\mu}_{j}-\mu^{*}_{j}|\leq S_{4,a}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{j}^{(1)}(x_{j}-\hat{\mu}_{j})-\mathrm{E}_{P_{\hat{\theta}}}h_{j}^{(1)}(x_{j}-\hat{\mu}_{j})\right\}, (S83)
|σ^jσj|S5{EPθhj(2)(xjμ^j)EPθ^hj(2)(xjμ^j)}+S5|μ^jμj|/2.\displaystyle|\hat{\sigma}_{j}-\sigma^{*}_{j}|\leq S_{5}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{j}^{(2)}(x_{j}-\hat{\mu}_{j})-\mathrm{E}_{P_{\hat{\theta}}}h_{j}^{(2)}(x_{j}-\hat{\mu}_{j})\right\}+S_{5}|\hat{\mu}_{j}-\mu^{*}_{j}|/2. (S84)

From (S83), we have that for any L2L_{2} unit vector w=(w1,,wp)Tw=(w_{1},\ldots,w_{p})^{\mathrm{\scriptscriptstyle T}},

j=1p|wj(μ^jμj)|\displaystyle\sum_{j=1}^{p}|w_{j}(\hat{\mu}_{j}-\mu_{j})| S4,aj=1p|wj|{EPθhj(1)(xjμ^j)EPθ^hj(1)(xjμ^j)}\displaystyle\leq S_{4,a}\sum_{j=1}^{p}|w_{j}|\left\{\mathrm{E}_{P_{\theta^{*}}}h_{j}^{(1)}(x_{j}-\hat{\mu}_{j})-\mathrm{E}_{P_{\hat{\theta}}}h_{j}^{(1)}(x_{j}-\hat{\mu}_{j})\right\}
=S4,a{EPθh(1)(xμ^)EPθ^h(1)(xμ^)},\displaystyle=S_{4,a}\left\{\mathrm{E}_{P_{\theta^{*}}}h^{(1)}(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h^{(1)}(x-\hat{\mu})\right\},

where h(1)(x)=j=1p|wj|hj(1)(xj)h^{(1)}(x)=\sum_{j=1}^{p}|w_{j}|h_{j}^{(1)}(x_{j}). In fact, h(1)(x)h^{(1)}(x) can be expressed as hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) such that c=(0,,0)Tc=(0,\ldots,0)^{\mathrm{\scriptscriptstyle T}} and each component in β1\beta_{1} is either |wj||w_{j}| or |wj|-|w_{j}| for j=1,,pj=1,\ldots,p, which implies that pen2(β)=w2=1\mathrm{pen}_{2}(\beta)=\|w\|_{2}=1. Hence by the definition of DD, we obtain (S81).

Similarly, from (S84), we have that for any L2L_{2} unit vector w=(w1,,wp)Tw=(w_{1},\ldots,w_{p})^{\mathrm{\scriptscriptstyle T}},

j=1p|wj(σ^jσj)|\displaystyle\quad\sum_{j=1}^{p}|w_{j}(\hat{\sigma}_{j}-\sigma_{j})|
S5j=1p|wj|{EPθhj(2)(xjμ^j)EPθ^hj(2)(xjμ^j)}+S5j=1p|wj(μ^jμj)|/2\displaystyle\leq S_{5}\sum_{j=1}^{p}|w_{j}|\left\{\mathrm{E}_{P_{\theta^{*}}}h_{j}^{(2)}(x_{j}-\hat{\mu}_{j})-\mathrm{E}_{P_{\hat{\theta}}}h_{j}^{(2)}(x_{j}-\hat{\mu}_{j})\right\}+S_{5}\sum_{j=1}^{p}|w_{j}(\hat{\mu}_{j}-\mu_{j}^{*})|/2
=S5{EPθh(2)(xμ^)EPθ^h(2)(xμ^)}+S5|wT(μ^μ)|/2,\displaystyle=S_{5}\left\{\mathrm{E}_{P_{\theta^{*}}}h^{(2)}(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h^{(2)}(x-\hat{\mu})\right\}+S_{5}|w^{\mathrm{\scriptscriptstyle T}}(\hat{\mu}-\mu^{*})|/2,

where h(2)(x)=j=1p|wj|hj(2)(xj)h^{(2)}(x)=\sum_{j=1}^{p}|w_{j}|h_{j}^{(2)}(x_{j}), which can be expressed in the form hrp1,β,c(x)h_{\mathrm{rp1},\beta,c}(x) with c=(1,,1)Tc=(1,\ldots,1)^{\mathrm{\scriptscriptstyle T}} and pen2(β)=w2=1\mathrm{pen}_{2}(\beta)=\|w\|_{2}=1. Hence by the definition of DD, we obtain (S82). ∎

Proposition S11.

Let b3>0b_{3}>0 be fixed and b3=b3p(p1)b_{3}^{\dagger}=b_{3}\sqrt{p(p-1)}. In the setting of Proposition S2, it holds with probability at least 12δ1-2\delta that for any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with pen2(γ)=b3\mathrm{pen}_{2}(\gamma)=b_{3}, EPθhγ,μ^(x)=0\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, and EPθ^hγ,μ^(x)0\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\leq 0,

Kf(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
R4,2b3{EPθh(x)EPθ^h(x)}+f(e2b3)ϵ(80Csg,122M2)pb32R3,2b3pb3R2,b3λ32,\displaystyle\geq R_{4,2b_{3}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}+f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\epsilon-(80C_{\mathrm{sg,12}}^{2}M_{2})pb_{3}^{2}R_{3,2b_{3}^{\dagger}}-\sqrt{p}b_{3}R_{2,b_{3}^{\dagger}}\lambda_{32},

where R3,b=R31,b+R32,bR_{3,b}=R_{31,b}+R_{32,b} as in Lemma S6 and, with Crad5=Csg,12Crad3C_{\mathrm{rad5}}=C_{\mathrm{sg,12}}C_{\mathrm{rad3}},

λ32=Crad512(p1)n+(p1)log(δ1)n,\lambda_{32}={C_{\mathrm{rad5}}}\sqrt{\frac{12(p-1)}{n}}+\sqrt{\frac{(p-1)\log(\delta^{-1})}{n}},

depending on the universal constants Csg,12C_{\mathrm{sg,12}} and Crad3C_{\mathrm{rad3}} in Lemma S23 and Corollary S2.

Proof.

By definition, for any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}}, hγ(x)h_{\gamma}(x) can be represented as hrp2,β(x)h_{\mathrm{rp2},\beta}(x) such that β0=γ0\beta_{0}=\gamma_{0} and pen2(β)=2pen2(γ)\mathrm{pen}_{2}(\beta)=2\mathrm{pen}_{2}(\gamma), where

hrp2,β(x)\displaystyle h_{\mathrm{rp2},\beta}(x) =β0+1ijpβ2,ijramp(xi)ramp(xj)\displaystyle=\beta_{0}+\sum_{1\leq i\not=j\leq p}\beta_{2,ij}\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j})
=β0+β2Tvec(φrp(x)φrp(x)),\displaystyle=\beta_{0}+\beta_{2}^{\mathrm{\scriptscriptstyle T}}\mathrm{vec}(\varphi_{\mathrm{rp}}(x)\otimes\varphi_{\mathrm{rp}}(x)),

where β=(β0,β2T)T\beta=(\beta_{0},\beta_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}} with β2=(β2,ij:1ijp)T\beta_{2}=(\beta_{2,ij}:1\leq i\not=j\leq p)^{\mathrm{\scriptscriptstyle T}}, and φrp(x):p[0,1]p\varphi_{\mathrm{rp}}(x):\mathbb{R}^{p}\to[0,1]^{p} denotes the vector of functions with the jjth component ramp(xj)\mathrm{ramp}(x_{j}) for j=1,,pj=1,\ldots,p. Then for any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with pen2(γ)=b3\mathrm{pen}_{2}(\gamma)=b_{3}, we have β0=γ0\beta_{0}=\gamma_{0} and pen2(β)=2b3\mathrm{pen}_{2}(\beta)=2b_{3} correspondingly, and hence hγ(x)γ0=hrp2,β(x)β0[2b3p(p1),2b3p(p1)]h_{\gamma}(x)-\gamma_{0}=h_{\mathrm{rp2},\beta}(x)-\beta_{0}\in[-2b_{3}\sqrt{p(p-1)},2b_{3}\sqrt{p(p-1)}], by the boundedness of the ramp function in [0,1][0,1] and the Cauchy–Schwartz inequality, β21p(p1)β22\|\beta_{2}\|_{1}\leq\sqrt{p(p-1)}\|\beta_{2}\|_{2}. Moreover, hrp2,β(x)h_{\mathrm{rp2},\beta}(x) can be expressed in the form β0+pen2(β)wTg(x)\beta_{0}+\mathrm{pen}_{2}(\beta)w^{\mathrm{\scriptscriptstyle T}}g(x), where for q=p(p1)q=p(p-1), wqw\in\mathbb{R}^{q} is an L2L_{2} unit vector, g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} is a vector of functions, including ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) for 1ijp1\leq i\not=j\leq p. For symmetry, ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) and ramp(xj)ramp(xi)\mathrm{ramp}(x_{j})\mathrm{ramp}(x_{i}) are included as two distinct components in gg, and the corresponding coefficients are assumed to be identical to each other in ww.

For any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with pen2(γ)=b3\mathrm{pen}_{2}(\gamma)=b_{3} and EPθhγ,μ^(x)=0\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, the function hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) can be expressed as

hγ,μ^(x)=β2T{φrp(xμ^)β0},\displaystyle h_{\gamma,\hat{\mu}}(x)=\beta_{2}^{\mathrm{\scriptscriptstyle T}}\left\{\varphi_{\mathrm{rp}}(x-\hat{\mu})-\beta_{0}\right\},

where β0=EPθφrp(xμ^)\beta_{0}=\mathrm{E}_{P_{\theta^{*}}}\varphi_{\mathrm{rp}}(x-\hat{\mu}). The mean-centered ramp functions in φrp(xμ^)β0\varphi_{\mathrm{rp}}(x-\hat{\mu})-\beta_{0} are bounded between [1,1][-1,1], and hence hγ,μ^(x)[b32p,b32p]h_{\gamma,\hat{\mu}}(x)\in[-b_{3}2p,b_{3}2p] similarly as above. Moreover, such hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) can be expressed in the form pen2(β)wT{g(xμ^)η0}\mathrm{pen}_{2}(\beta)w^{\mathrm{\scriptscriptstyle T}}\{g(x-\hat{\mu})-\eta_{0}\}, where wqw\in\mathbb{R}^{q} is an L2L_{2} unit vector, g(x):p[0,1]qg(x):\mathbb{R}^{p}\to[0,1]^{q} is defined as above, and η0=EPθg(xμ^)[0,1]q\eta_{0}=\mathrm{E}_{P_{\theta^{*}}}g(x-\hat{\mu})\in[0,1]^{q} by the boundedness of the ramp function in [0,1][0,1].

Next, Kf(Pn,Pθ^;hγ,μ^)K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

Kf(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
Kf(Pϵ,Pθ^;hγ,μ^)|Kf(Pn,Pθ^;hγ,μ^)Kf(Pϵ,Pθ^;hγ,μ^)|.\displaystyle\geq K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-|K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})|. (S85)

For any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with pen2(γ)=b3\mathrm{pen}_{2}(\gamma)=b_{3}, EPθhγ,μ^(x)=0E_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, and EPθ^hγ,μ^(x)0E_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\leq 0, applying Lemma S11(ii) with h=hγ,μ^h=h_{\gamma,\hat{\mu}} and b=2b3=2b3p(p1)b=2b_{3}^{\dagger}=2b_{3}\sqrt{p(p-1)} yields

Kf(Pϵ,Pθ^;hγ,μ^)f(e2b3)ϵ\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\geq f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\epsilon
+R4,2b3{EPθh(x)EPθ^h(x)}12{R31,2b3VarPθhγ,μ^(x)+R32,2b3VarPθ^hγ,μ^(x)}.\displaystyle\quad+R_{4,2b_{3}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-\frac{1}{2}\left\{R_{31,2b_{3}^{\dagger}}\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)+R_{32,2b_{3}^{\dagger}}\mathrm{Var}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}.

By Lemma S19(ii), VarPθhγ,μ^(x)\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x) can bounded as follows:

VarPθhγ,μ^(x)\displaystyle\quad\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)
=VarPθβ2Tvec(φrp(xμ^)φrp(xμ^))\displaystyle=\mathrm{Var}_{P_{\theta^{*}}}\beta_{2}^{\mathrm{\scriptscriptstyle T}}\mathrm{vec}(\varphi_{\mathrm{rp}}(x-\hat{\mu})\otimes\varphi_{\mathrm{rp}}(x-\hat{\mu}))
β22220Csg,122(2)2pΣop\displaystyle\leq\|\beta_{2}\|_{2}^{2}\cdot 20C_{\mathrm{sg,12}}^{2}(\sqrt{2})^{2}p\|\Sigma^{*}\|_{\mathrm{op}}
40pen22(β)Csg,122pM2.\displaystyle\leq 40\mathrm{pen}_{2}^{2}(\beta)C_{\mathrm{sg,12}}^{2}pM_{2}.

Similarly, VarPθhγ,μ^(x)\mathrm{Var}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x) can also be bounded by 40pen22(β)Csg,122pM240\mathrm{pen}_{2}^{2}(\beta)C_{\mathrm{sg,12}}^{2}pM_{2}, because Σ^opM2\|\hat{\Sigma}\|_{\mathrm{op}}\leq M_{2}. Hence, with pen2(β)=2b3\mathrm{pen}_{2}(\beta)=2b_{3}, Kf(Pϵ,Pθ^;hγ,μ^)K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

Kf(Pϵ,Pθ^;hγ,μ^)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
f(e2b3)ϵ+R4,2b3{EPθh(x)EPθ^h(x)}80Csg,122pM2b32R3,2b3,\displaystyle\geq f^{\prime}(\mathrm{e}^{-2b_{3}^{\dagger}})\epsilon+R_{4,2b_{3}^{\dagger}}\left\{\mathrm{E}_{P_{\theta^{*}}}h(x)-\mathrm{E}_{P_{\hat{\theta}}}h(x)\right\}-80C_{\mathrm{sg,12}}^{2}pM_{2}b_{3}^{2}R_{3,2b_{3}^{\dagger}}, (S86)

where R3,b=R31,b+R32,bR_{3,b}=R_{31,b}+R_{32,b} as in Lemma S6. Moreover, by Lemma S5 with b=2b3b=2b_{3} and g(x)[0,1]qg(x)\in[0,1]^{q} defined above, it holds with probability at least 12δ1-2\delta that for any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with γ0=0\gamma_{0}=0 and pen2(γ)=b3\mathrm{pen}_{2}(\gamma)=b_{3},

|Kf(Pn,Pθ^;hγ,μ^)Kf(Pϵ,Pθ^;hγ,μ^)|\displaystyle\quad|K_{f}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})|
b3R2,b3q{Csg,122qVgn+qlog(δ1)n}\displaystyle\leq b_{3}R_{2,b_{3}\sqrt{q}}\left\{C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}+\sqrt{\frac{q\log(\delta^{-1})}{n}}\right\}
=b3R2,b3{Csg,12Crad312p(p1)n+p(p1)log(δ1)n},\displaystyle=b_{3}R_{2,b_{3}^{\dagger}}\left\{C_{\mathrm{sg,12}}C_{\mathrm{rad3}}\sqrt{\frac{12p(p-1)}{n}}+\sqrt{\frac{p(p-1)\log(\delta^{-1})}{n}}\right\}, (S87)

where Vg=6Crad32V_{g}=6C_{\mathrm{rad3}}^{2} is determined in Lemma S5 as follows. For j=1,,qj=1,\ldots,q, consider the function class 𝒢j={gμ,η0,j:μp,η0[0,1]q}\mathcal{G}_{j}=\{g_{\mu,\eta_{0},j}:\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}\}, where μ=(μ1,,μp)T\mu=(\mu_{1},\ldots,\mu_{p})^{\mathrm{\scriptscriptstyle T}}, η0=(η01,,η0q)T\eta_{0}=(\eta_{01},\ldots,\eta_{0q})^{\mathrm{\scriptscriptstyle T}}, and, as defined in Lemma S5, gμ,η0,j(x)g_{\mu,\eta_{0},j}(x) is of the form ramp(xj1μj1)ramp(xj2μj2)η0j\mathrm{ramp}(x_{j_{1}}-\mu_{j_{1}})\mathrm{ramp}(x_{j_{2}}-\mu_{j_{2}})-\eta_{0j} for 1j1j2p1\leq j_{1}\not=j_{2}\leq p. By Lemma S16, the VC index of moving-knot ramp functions is 2. By Lemma S17, the VC index of constant functions is also 2. By applying Corollary S2 (ii), we obtain that conditionally on (X1,,Xn)(X_{1},\ldots,X_{n}), the random variable Zn,j=supμp,η0[0,1]q|n1i=1nϵigμ,η0,j(Xi)|=supfj𝒢j|n1i=1nϵifj(Xi)|Z_{n,j}=\sup_{\mu\in\mathbb{R}^{p},\eta_{0}\in[0,1]^{q}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}g_{\mu,\eta_{0},j}(X_{i})|=\sup_{f_{j}\in\mathcal{G}_{j}}|n^{-1}\sum_{i=1}^{n}\epsilon_{i}f_{j}(X_{i})| is sub-gaussian with tail parameter Crad36/nC_{\mathrm{rad3}}\sqrt{6/n} for j=1,,qj=1,\ldots,q.

Combining the inequalities (S85)–(S87) leads to the desired result. ∎

Proposition S12.

In the setting of Proposition S2 or Proposition S3, denote

D=supγΓrp2,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}.\displaystyle D=\sup_{\gamma\in\Gamma_{\mathrm{rp2}},\mathrm{pen}_{2}(\gamma)=1/2}\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}.

Then we have

Σ^ΣF2M21/2pσ^σ2+S7(2pΔμ^,σ^+D),\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\mathrm{F}}\leq 2M_{2}^{1/2}\sqrt{p}\|\hat{\sigma}-\sigma^{*}\|_{2}+S_{7}(\sqrt{2p}\Delta_{\hat{\mu},\hat{\sigma}}+D),

where Δμ^,σ^=(μ^μ22+σ^σ22)1/2\Delta_{\hat{\mu},\hat{\sigma}}=(\|\hat{\mu}-\mu^{*}\|_{2}^{2}+\|\hat{\sigma}-\sigma^{*}\|_{2}^{2})^{1/2} and S7S_{7} is defined as in Lemma S8 with M=M2M=M_{2}.

Proof.

For any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with pen2(γ)=1/2\mathrm{pen}_{2}(\gamma)=1/2, the function hγ(x)rp2h_{\gamma}(x)\in\mathcal{H}_{\mathrm{rp2}} can be obtained as hrp2,β(x)h_{\mathrm{rp2},\beta}(x) with pen2(β)=1\mathrm{pen}_{2}(\beta)=1. First, we handle the effect of different means and standard deviations between PθP_{\theta^{*}} and Pθ^P_{\hat{\theta}} in DD. Denote

D=supγΓrp2,pen2(γ)=1/2{EPθhγ,μ^(x)EPθ^hγ,μ^(x)},\displaystyle D^{\dagger}=\sup_{\gamma\in\Gamma_{\mathrm{rp2}},\mathrm{pen}_{2}(\gamma)=1/2}\left\{\mathrm{E}_{P_{\theta^{\dagger}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\},

where θ=(μ^,diag(σ^)Σ0diag(σ^))\theta^{\dagger}=(\hat{\mu},\mathrm{diag}(\hat{\sigma})\Sigma_{0}^{*}\mathrm{diag}(\hat{\sigma})) and Σ0\Sigma_{0}^{*} is defined as the correlation matrix such that Σ=diag(σ)Σ0diag(σ)\Sigma^{*}=\mathrm{diag}(\sigma^{*})\Sigma_{0}^{*}\mathrm{diag}(\sigma^{*}). Then DD^{\dagger} can be related to DD as follows:

DD+2pΔμ^,σ^,\displaystyle D^{\dagger}\leq D+\sqrt{2p}\Delta_{\hat{\mu},\hat{\sigma}}, (S88)

where Δμ^,σ^=(μ^μ22+σ^σ22)1/2\Delta_{\hat{\mu},\hat{\sigma}}=(\|\hat{\mu}-\mu^{*}\|_{2}^{2}+\|\hat{\sigma}-\sigma^{*}\|_{2}^{2})^{1/2}. In fact, by Lemma S21 with g(x)g(x) set to φrp(x)\varphi_{\mathrm{rp}}(x), which is (1/21/2)-Lipschitz and componentwise bounded in [0,1][0,1], we have that for any γΓrp2\gamma\in\Gamma_{\mathrm{rp2}} with pen2(γ)=1/2\mathrm{pen}_{2}(\gamma)=1/2,

|EPθhγ,μ^(x)EPθhγ,μ^(x)|2pΔμ^,σ^.\displaystyle\left|\mathrm{E}_{P_{\theta^{\dagger}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)\right|\leq\sqrt{2p}\Delta_{\hat{\mu},\hat{\sigma}}.

For each pair 1ijp1\leq i\not=j\leq p, we restrict hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) such that hγ(x)h_{\gamma}(x) is ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) or ramp(xi)ramp(xj)-\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}). Applying Lemma S8 shows that there exists rij{1,1}r_{ij}\in\{-1,1\} such that

|ρ^ijρij|σ^iσ^jS7rij{EPθhij(xμ^)EPθ^hij(xμ^)},\displaystyle|\hat{\rho}_{ij}-\rho^{*}_{ij}|\hat{\sigma}_{i}\hat{\sigma}_{j}\leq S_{7}r_{ij}\left\{\mathrm{E}_{P_{\theta^{\dagger}}}h_{ij}(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h_{ij}(x-\hat{\mu})\right\},

where hij(x)=ramp(xi)ramp(xj)h_{ij}(x)=\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}). By the triangle inequality, we have

|ρ^ijσ^iσ^jρijσiσj|\displaystyle\quad|\hat{\rho}_{ij}\hat{\sigma}_{i}\hat{\sigma}_{j}-\rho^{*}_{ij}\sigma_{i}^{*}\sigma_{j}^{*}|
σ^i|σ^jσj|+σj|σ^iσi|+|ρ^ijρij|σ^iσ^j\displaystyle\leq\hat{\sigma}_{i}|\hat{\sigma}_{j}-\sigma_{j}^{*}|+\sigma_{j}^{*}|\hat{\sigma}_{i}^{*}-\sigma_{i}^{*}|+|\hat{\rho}_{ij}-\rho^{*}_{ij}|\hat{\sigma}_{i}\hat{\sigma}_{j}
M21/2|σ^iσi|+M21/2|σ^jσj|+S7rij{EPθhij(xμ^)EPθ^hij(xμ^)}.\displaystyle\leq M_{2}^{1/2}|\hat{\sigma}_{i}-\sigma_{i}^{*}|+M_{2}^{1/2}|\hat{\sigma}_{j}-\sigma_{j}^{*}|+S_{7}r_{ij}\left\{\mathrm{E}_{P_{\theta^{\dagger}}}h_{ij}(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h_{ij}(x-\hat{\mu})\right\}.

In addition, we have |σ^i2σi2|=|(σ^i+σi)(σ^iσi)|2M21/2|σ^iσi||\hat{\sigma}_{i}^{2}-{\sigma_{i}^{*}}^{2}|=|(\hat{\sigma}_{i}+\sigma_{i}^{*})(\hat{\sigma}_{i}-\sigma_{i}^{*})|\leq 2M_{2}^{1/2}|\hat{\sigma}_{i}-\sigma_{i}^{*}|. Then for any L2L_{2} unit vector w=(wij:1i,jp)Tp×pw=(w_{ij}:1\leq i,j\leq p)^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{p\times p},

i=1p|wii(σ^i2σi2)|+1ijp|wij(ρ^ijσ^iσ^jρijσ1σ2)|\displaystyle\quad\sum_{i=1}^{p}\left|w_{ii}(\hat{\sigma}_{i}^{2}-{\sigma_{i}^{*}}^{2})\right|+\sum_{1\leq i\not=j\leq p}\left|w_{ij}(\hat{\rho}_{ij}\hat{\sigma}_{i}\hat{\sigma}_{j}-\rho^{*}_{ij}\sigma_{1}^{*}\sigma_{2}^{*})\right|
2M21/2i=1p|wii||σ^iσi|+M21/21ijp|wij|(|σ^iσi|+|σ^jσj|)\displaystyle\leq 2M_{2}^{1/2}\sum_{i=1}^{p}|w_{ii}||\hat{\sigma}_{i}-\sigma_{i}^{*}|+M_{2}^{1/2}\sum_{1\leq i\not=j\leq p}|w_{ij}|\left(|\hat{\sigma}_{i}-\sigma_{i}^{*}|+|\hat{\sigma}_{j}-\sigma_{j}^{*}|\right)
+S71ijp|wij|rij{EPθhij(xμ^)EPθ^hij(xμ^)}\displaystyle\quad+S_{7}\sum_{1\leq i\not=j\leq p}|w_{ij}|r_{ij}\left\{\mathrm{E}_{P_{\theta^{\dagger}}}h_{ij}(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h_{ij}(x-\hat{\mu})\right\}
=M21/21i,jp|wij|(|σ^iσi|+|σ^jσj|)+S7{EPθh(xμ^)EPθ^h(xμ^)},\displaystyle=M_{2}^{1/2}\sum_{1\leq i,j\leq p}|w_{ij}|\left(|\hat{\sigma}_{i}-\sigma_{i}^{*}|+|\hat{\sigma}_{j}-\sigma_{j}^{*}|\right)+S_{7}\left\{\mathrm{E}_{P_{\theta^{\dagger}}}h(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h(x-\hat{\mu})\right\},

where h(x)=1ijp|wij|rijhij(x)h(x)=\sum_{1\leq i\not=j\leq p}|w_{ij}|r_{ij}h_{ij}(x). The function h(x)h(x) can be expressed as hrp2,β(x)h_{\mathrm{rp2},\beta}(x) such that β2,ii=0\beta_{2,ii}=0 for i=1,,pi=1,\ldots,p and β2,ij=|wij|rij\beta_{2,ij}=|w_{ij}|r_{ij} for 1ijp1\leq i\not=j\leq p, and hence pen2(β)w2=1\mathrm{pen}_{2}(\beta)\leq\|w\|_{2}=1. By the definition of DD^{\dagger}, we have EPθh(xμ^)EPθ^h(xμ^)D\mathrm{E}_{P_{\theta^{\dagger}}}h(x-\hat{\mu})-\mathrm{E}_{P_{\hat{\theta}}}h(x-\hat{\mu})\leq D^{\dagger}. Moreover, by the Cauchy–Schwartz inequality, 1i,jp|wij||σ^iσi|pσ^σ2\sum_{1\leq i,j\leq p}|w_{ij}||\hat{\sigma}_{i}-\sigma_{i}^{*}|\leq\sqrt{p}\|\hat{\sigma}-\sigma^{*}\|_{2}. Substituting these inequalities into the preceding display shows that

i=1p|wii(σ^i2σi2)|+1ijp|wij(ρ^ijσ^iσ^jρijσ1σ2)|\displaystyle\quad\sum_{i=1}^{p}\left|w_{ii}(\hat{\sigma}_{i}^{2}-{\sigma_{i}^{*}}^{2})\right|+\sum_{1\leq i\not=j\leq p}\left|w_{ij}(\hat{\rho}_{ij}\hat{\sigma}_{i}\hat{\sigma}_{j}-\rho^{*}_{ij}\sigma_{1}^{*}\sigma_{2}^{*})\right|
2M21/2pσ^σ2+S7D.\displaystyle\leq 2M_{2}^{1/2}\sqrt{p}\|\hat{\sigma}-\sigma^{*}\|_{2}+S_{7}D^{\dagger}. (S89)

Combining (S88) and (S89) yields the desired result. ∎

III.4 Details in main proof of Theorem 4

For completeness, we restate Proposition 3 in the main proof of Theorem 4 below. For δ(0,1/7)\delta\in(0,1/7), define

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n,\displaystyle\lambda_{11}=\sqrt{\frac{2\log({5p})+\log(\delta^{-1})}{n}}+\frac{2\log({5p})+\log(\delta^{-1})}{n},
λ12=2Crad4log(2p(p+1))n+2log(δ1)n,\displaystyle\lambda_{12}=2C_{\mathrm{rad4}}\sqrt{\frac{\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}},

where Crad4=Csg6Crad3C_{\mathrm{rad4}}=C_{\mathrm{sg6}}C_{\mathrm{rad3}}, depending on universal constants Csg6C_{\mathrm{sg6}} and Crad3C_{\mathrm{rad3}} in Lemmas S26 and Corollary S2 in the Supplement. Denote

Errh1(n,p,δ,ϵ)=3ϵ+2ϵ/(nδ)+λ12+λ1.\displaystyle\mathrm{Err}_{h1}(n,p,\delta,\epsilon)=3\epsilon+2\sqrt{\epsilon/(n\delta)}+\lambda_{12}+\lambda_{1}.
Proposition 3.

Assume that ΣmaxM1\|\Sigma_{*}\|_{\max}\leq M_{1} and ϵ1/5\epsilon\leq 1/5. Let θ^=(μ^,Σ^)\hat{\theta}=(\hat{\mu},\hat{\Sigma}) be a solution to (28) with λ1Csp13M11λ11\lambda_{1}\geq C_{\mathrm{sp13}}M_{11}\lambda_{11} where M11=M11/2(M11/2+22π)M_{11}=M_{1}^{1/2}(M_{1}^{1/2}+{2}\sqrt{2\pi}) and Csp13=(5/3)(Csp11Csp12)C_{\mathrm{sp13}}=(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}}), depending on universal constants Csp11C_{\mathrm{sp11}} and Csp12C_{\mathrm{sp12}} in Lemma S3 in the Supplement. If ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and Errh1(n,p,δ,ϵ)a\mathrm{Err}_{h1}(n,p,\delta,\epsilon)\leq a for a constant a(0,1/2)a\in(0,1/2), then the following holds with probability at least 17δ1-7\delta uniformly over contamination distribution QQ,

μ^μ\displaystyle\|\hat{\mu}-\mu^{*}\|_{\infty} S4,aErrh1(n,p,δ,ϵ),\displaystyle{\leq}S_{4,a}\mathrm{Err}_{h1}(n,p,\delta,\epsilon),
Σ^Σmax\displaystyle\|\hat{\Sigma}-\Sigma^{*}\|_{\max} S8,aErrh1(n,p,δ,ϵ),\displaystyle{\leq}S_{8,a}\mathrm{Err}_{h1}(n,p,\delta,\epsilon),

where S4,a=(1+2M1log212a)/aS_{4,a}=(1+\sqrt{2M_{1}\log\frac{2}{1-2a}})/a and S8,a=2M11/2S6,a+S7(1+S4,a+S6,a)S_{8,a}=2M_{1}^{1/2}S_{6,a}+S_{7}(1+S_{4,a}+S_{6,a}) with S6,a=S5(1+S4,a/2)S_{6,a}=S_{5}(1+S_{4,a}/2), S5=22π(1e2/M1)1S_{5}=2\sqrt{2\pi}(1-\mathrm{e}^{-2/M_{1}})^{-1}, and S7=8πM1e1/(4M1)S_{7}=8\pi M_{1}\mathrm{e}^{1/(4M_{1})} S7=4{(12πM1e1/(8M1))(12e1/(8M1))}2S_{7}=4\{(\frac{1}{\sqrt{2\pi M_{1}}}\mathrm{e}^{-1/(8M_{1})})\vee(1-2\mathrm{e}^{-1/(8M_{1})})\}^{-2}.

To see why Proposition 3 leads to Theorem 4, we show that conditions in Proposition 3 are satisfied under the setting of Theorem 4. For a constant a(0,1/2)a\in(0,1/2), let

C1\displaystyle C_{1} =23Csp13M11,\displaystyle=2\sqrt{3}C_{\mathrm{sp13}}M_{11},
C2\displaystyle C_{2} =153C16(3(2C14Crad4+2+1)),\displaystyle=\frac{1}{5}\wedge\frac{\sqrt{3}C_{1}}{6}\wedge\left(3\vee\left(\frac{\sqrt{2}C_{1}}{4C_{\mathrm{rad4}}+2}+1\right)\right),
C\displaystyle C =S8,a(3(2C14Crad4+2+1)).\displaystyle=S_{8,a}\left(3\vee\left(\frac{\sqrt{2}C_{1}}{4C_{\mathrm{rad4}}+2}+1\right)\right).

Then the following conditions

  • (i)

    λ1C1(logp/n+log(1/δ)/n)\lambda_{1}\geq C_{1}\left(\sqrt{\log{p}/{n}}+\sqrt{\log{(1/\delta)}/{n}}\right),

  • (ii)

    ϵ+ϵ/(nδ)+λ1C2\epsilon+\sqrt{\epsilon/(n\delta)}+\lambda_{1}\leq C_{2},

imply the conditions

  • (iii)

    λ1Csp13M11λ11\lambda_{1}\geq C_{\mathrm{sp13}}M_{11}\lambda_{11},

  • (iv)

    Errh1(n,p,δ,ϵ)a\mathrm{Err}_{h1}(n,p,\delta,\epsilon)\leq a,

  • (v)

    ϵ1/5\epsilon\leq 1/5 and ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5.

In fact, condition (v) follows directly from condition (ii) because C21/5C_{2}\leq 1/5. For conditions (iii) and (iv), we first show that λ11\lambda_{11} and λ12\lambda_{12} can be upper bounded as follows:

λ11\displaystyle\lambda_{11} 2logp+3log(δ1)n+2logp+3log(δ1)n\displaystyle\leq\sqrt{\frac{2\log{p}+3\log(\delta^{-1})}{n}}+{\frac{2\log{p}+3\log(\delta^{-1})}{n}} (S90)
22logp+3log(δ1)n22logpn+23log(δ1)n,\displaystyle\leq 2\sqrt{\frac{2\log{p}+3\log(\delta^{-1})}{n}}\leq 2\sqrt{\frac{2\log{p}}{n}}+2\sqrt{\frac{3\log(\delta^{-1})}{n}}, (S91)

and

λ12\displaystyle\lambda_{12} 2Crad42log2+2logpn+2log(δ1)n\displaystyle\leq 2C_{\mathrm{rad4}}\sqrt{\frac{2\log{2}+2\log{p}}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}} (S92)
2Crad42logpn+(2Crad4+1)2log(δ1)n.\displaystyle\leq 2C_{\mathrm{rad4}}\sqrt{\frac{2\log{p}}{n}}+\left(2C_{\mathrm{rad4}}+1\right)\sqrt{\frac{2\log(\delta^{-1})}{n}}. (S93)

Lines (S90) and (S93) hold because log(1/δ)log(5)\log(1/\delta)\geq\log(5) for δ(0,1/7)\delta\in(0,1/7). Line (S91) holds because 2logp+3log(δ1)n1\sqrt{\frac{2\log{p}+3\log(\delta^{-1})}{n}}\leq 1 and hence the linear term in λ11\lambda_{11} is upper bounded by the square root term. To see this, by conditions (i) and (ii) we have

2logp+3log(δ1)n\displaystyle\sqrt{\frac{2\log{p}+3\log(\delta^{-1})}{n}} 3logpn+3log(δ1)n3λ1C13C2C11.\displaystyle\leq\sqrt{\frac{3\log{p}}{n}}+\sqrt{\frac{3\log(\delta^{-1})}{n}}\leq\frac{\sqrt{3}\lambda_{1}}{C_{1}}\leq\frac{\sqrt{3}C_{2}}{C_{1}}\leq 1.

Line (S92) holds because log(2p(p+1))2log2+2logp\log{(2p(p+1))}\leq 2\log{2}+2\log{p} for p1p\geq 1. With the above upper bounds for λ11\lambda_{11} and λ12\lambda_{12}, we show that condition (i) implies condition (iii) as follows:

λ1\displaystyle\lambda_{1} C1(logpn+log(1/δ)n)\displaystyle\geq C_{1}\left(\sqrt{\frac{\log{p}}{n}}+\sqrt{\frac{\log{(1/\delta)}}{n}}\right)
=23Csp13M11(logpn+log(1/δ)n)Csp13M11λ11,\displaystyle=2\sqrt{3}C_{\mathrm{sp13}}M_{11}\left(\sqrt{\frac{\log{p}}{n}}+\sqrt{\frac{\log{(1/\delta)}}{n}}\right)\geq C_{\mathrm{sp13}}M_{11}\lambda_{11},

and condition (ii) implies condition (iv) as follows:

Errh1(n,p,δ,ϵ)\displaystyle\mathrm{Err}_{h1}(n,p,\delta,\epsilon) =3ϵ+2ϵ/(nδ)+λ12+λ1\displaystyle=3\epsilon+2\sqrt{\epsilon/(n\delta)}+\lambda_{12}+\lambda_{1}
3(ϵ+ϵnδ)+(2C14Crad4+2+1)λ1\displaystyle\leq 3\left(\epsilon+\sqrt{\frac{\epsilon}{n\delta}}\right)+\left(\frac{\sqrt{2}C_{1}}{4C_{\mathrm{rad4}}+2}+1\right)\lambda_{1}
(3(2C14Crad4+2+1))(ϵ+ϵnδ+λ1)\displaystyle\leq\left(3\vee\left(\frac{\sqrt{2}C_{1}}{4C_{\mathrm{rad4}}+2}+1\right)\right)\left(\epsilon+\sqrt{\frac{\epsilon}{n\delta}}+\lambda_{1}\right)
(3(2C14Crad4+2+1))C2a.\displaystyle\leq\left(3\vee\left(\frac{\sqrt{2}C_{1}}{4C_{\mathrm{rad4}}+2}+1\right)\right)C_{2}\leq a.

Therefore, Proposition 3 implies Theorem 4 with constant C=S8,a(3(2C14Crad4+2+1))C=S_{8,a}\left(3\vee\left(\frac{\sqrt{2}C_{1}}{4C_{\mathrm{rad4}}+2}+1\right)\right).

Lemma S12.

Consider the hinge GAN (3).

(i) For any ϵ[0,1]\epsilon\in[0,1] and any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

KHG(Pϵ,Pθ;h)2ϵ.\displaystyle K_{\mathrm{HG}}(P_{\epsilon},P_{\theta^{*}};h)\leq 2\epsilon.

(ii) For any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

KHG(Pn,Pθ;h)2ϵ^+|EPθ,nh(x)EPθh(x)|,\displaystyle K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h)\leq 2\hat{\epsilon}+|\mathrm{E}_{P_{\theta^{*},n}}h(x)-\mathrm{E}_{P_{\theta^{*}}}h(x)|, (S94)

where ϵ^=n1i=1nUi\hat{\epsilon}=n^{-1}\sum_{i=1}^{n}{U_{i}} and Pθ,nP_{\theta^{*},n} denotes the empirical distribution of {Xi:Ui=0,i=1,,n}\{X_{i}:{U_{i}}=0,i=1,\ldots,n\} in the latent representation of Huber’s contamination model.

Proof.

(i) For any h:ph:\mathbb{R}^{p}\to\mathbb{R} we have

KHG(Pϵ,Pθ;h)\displaystyle\quad K_{\mathrm{HG}}(P_{\epsilon},P_{\theta^{*}};h)
=ϵEQmin(1,h(x))+(1ϵ)EPθmin(1,h(x))+EPθmin(1,h(x))\displaystyle=\epsilon\mathrm{E}_{Q}\min(1,h(x))+(1-\epsilon)\mathrm{E}_{P_{\theta^{*}}}\min(1,h(x))+\mathrm{E}_{P_{\theta^{*}}}\min(1,-h(x))
ϵ+(1ϵ)EPθmin(1,h(x))+EPθmin(1,h(x))\displaystyle\leq\epsilon+(1-\epsilon)\mathrm{E}_{P_{\theta^{*}}}\min(1,h(x))+\mathrm{E}_{P_{\theta^{*}}}\min(1,-h(x)) (S95)
2ϵ+(1ϵ){EPθmin(1,h(x))+EPθmin(1,h(x))}\displaystyle\leq 2\epsilon+(1-\epsilon)\left\{\mathrm{E}_{P_{\theta^{*}}}\min(1,h(x))+\mathrm{E}_{P_{\theta^{*}}}\min(1,-h(x))\right\} (S96)
2ϵ.\displaystyle\leq 2\epsilon. (S97)

Inequalities (S95) and (S96) hold because min(1,u)min(1,u)1\min(1,u)\vee\min(1,-u)\leq 1 for all uu\in\mathbb{R}. Inequality (S97) holds because min(1,u)+min(1,u)0\min(1,u)+\min(1,-u)\leq 0 for all uu\in\mathbb{R}.

(ii) Because both min(1,u)\min(1,u) and min(1,u)\min(1,-u) are concave in uu\in\mathbb{R} and upper bounded by 11, we have

KHG(Pn,Pθ;h)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h)
=1ni=1nRimin(1,h(Xi))+1ni=1n(1Ri)min(1,h(Xi))+EPθmin(1,h(Xi))\displaystyle=\frac{1}{n}\sum_{i=1}^{n}R_{i}\min(1,h(X_{i}))+\frac{1}{n}\sum_{i=1}^{n}(1-R_{i})\min(1,h(X_{i}))+\mathrm{E}_{P_{\theta^{*}}}\min(1,-h(X_{i}))
ϵ^+(1ϵ^)EPθ,nmin(1,h(Xi))+EPθmin(1,h(Xi))\displaystyle\leq\hat{\epsilon}+(1-\hat{\epsilon})\mathrm{E}_{P_{\theta^{*},n}}\min(1,h(X_{i}))+\mathrm{E}_{P_{\theta^{*}}}\min(1,-h(X_{i})) (S98)
ϵ^+(1ϵ^)min(1,EPθ,nh(Xi))+min(1,EPθh(Xi))\displaystyle\leq\hat{\epsilon}+(1-\hat{\epsilon})\min(1,\mathrm{E}_{P_{\theta^{*},n}}h(X_{i}))+\min(1,-\mathrm{E}_{P_{\theta^{*}}}h(X_{i})) (S99)
2ϵ^+(1ϵ^){min(1,EPθ,nh(Xi))+min(1,EPθh(Xi))}\displaystyle\leq 2\hat{\epsilon}+(1-\hat{\epsilon})\left\{\min(1,\mathrm{E}_{P_{\theta^{*},n}}h(X_{i}))+\min(1,-\mathrm{E}_{P_{\theta^{*}}}h(X_{i}))\right\} (S100)
2ϵ^+0+|min(1,EPθ,nh(x))min(1,EPθh(x))|\displaystyle\leq 2\hat{\epsilon}+0+|\min(1,-\mathrm{E}_{P_{\theta^{*},n}}h(x))-\min(1,-\mathrm{E}_{P_{\theta^{*}}}h(x))| (S101)
2ϵ^+0+|EPθ,nh(x)EPθh(x)|\displaystyle\leq 2\hat{\epsilon}+0+|\mathrm{E}_{P_{\theta^{*},n}}h(x)-\mathrm{E}_{P_{\theta^{*}}}h(x)| (S102)

Lines (S98) and (S100) hold because min(1,u)min(1,u)1\min(1,u)\vee\min(1,-u)\leq 1 for all uu\in\mathbb{R}. Line (S99) follows from Jensen’s inequality by the concavity of min(1,u)\min(1,u) and min(1,u)\min(1,-u). Line (S101) follows because min(1,u)+min(1,u)0\min(1,u)+\min(1,-u)\leq 0 for all uu\in\mathbb{R}, and the last line (S102) holds because min(1,u)\min(1,-u) is 11-Lipschitz in uu. ∎

Proposition S13.

In the setting of Proposition 3, it holds with probability at least 15δ1-{5}\delta that for any γΓ\gamma\in\Gamma,

KHG(Pn,Pθ;hγ,μ)2(ϵ+ϵ/(nδ))+pen1(γ)Csp13M11λ11,\displaystyle K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})\leq 2(\epsilon+\sqrt{\epsilon/(n\delta)})+\mathrm{pen}_{1}(\gamma)C_{\mathrm{sp13}}M_{11}\lambda_{11},

where Csp13=(5/3)(Csp11Csp12)C_{\mathrm{sp13}}=(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}}) with Csp11C_{\mathrm{sp11}} and Csp12C_{\mathrm{sp12}} as in Lemma S3, M11=M11/2(M11/2+22π)M_{11}=M_{1}^{1/2}(M_{1}^{1/2}+2\sqrt{2\pi}), and

λ11=2log(5p)+log(δ1)n+2log(5p)+log(δ1)n.\lambda_{11}=\sqrt{\frac{2\log(5p)+\log(\delta^{-1})}{n}}+\frac{2\log(5p)+\log(\delta^{-1})}{n}.
Proof.

The proof is similar to that of Proposition S5 and we use the same definition of Ω1\Omega_{1} and Ω2\Omega_{2}. In the event Ω1\Omega_{1} we have |ϵ^ϵ|1/5|\hat{\epsilon}-\epsilon|\leq 1/5 by the assumption ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and hence ϵ^2/5\hat{\epsilon}\leq 2/5 by the assumption ϵ1/5\epsilon\leq 1/5. Thus, by Lemma S12 with ϵ1=2/5\epsilon_{1}=2/5, it holds in the event Ω1\Omega_{1} that for any γΓ\gamma\in\Gamma,

KHG(Pn,Pθ;hγ,μ)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})
2ϵ^+|EPθ,nhγ,μ(x)EPθhγ,μ(x)|\displaystyle\leq 2\hat{\epsilon}+\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)\right|
2(ϵ+ϵ/(nδ))+|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|.\displaystyle\leq 2(\epsilon+\sqrt{\epsilon/(n\delta)})+\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|. (S103)

The last step uses the fact that EPθhγ,μ(x)=EP(0,Σ)hγ(x)\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x) and EPθ,nhγ,μ(x)=EPθ,nhγ(xμ)\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*}), by the definition hγ,μ(x)=hγ(xμ)h_{\gamma,\mu^{*}}(x)=h_{\gamma}(x-\mu^{*}).

Next, as shown in Proposition S5, it holds in the event Ω2\Omega_{2} while conditionally on Ω1\Omega_{1} that for any γ=(γ0,γ1T,γ2T)TΓ\gamma=(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma,

|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|
pen1(γ)(5/3)(Csp11Csp12)M11λ11,\displaystyle\leq\mathrm{pen}_{1}(\gamma)(5/3)(C_{\mathrm{sp11}}\vee C_{\mathrm{sp12}})M_{11}\lambda_{11}, (S104)

where hγ(x)=γ0+γ1Tφ(x)+γ2T(φ(x)φ(x))h_{\gamma}(x)=\gamma_{0}+\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)+\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x)\otimes\varphi(x)) and pen1(γ)=γ11+γ21\mathrm{pen}_{1}(\gamma)=\|\gamma_{1}\|_{1}+\|\gamma_{2}\|_{1}. Combining (S103) and (S104) indicates that in the event Ω1Ω2\Omega_{1}\cap\Omega_{2} with probability at least 15δ1-{5}\delta, the desired inequality holds for any γΓ\gamma\in\Gamma. ∎

Proposition S14.

In the setting of Proposition 3, it holds with probability at least 12δ1-2\delta that for any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=1\mathrm{pen}_{1}(\gamma)=1,

KHG(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}ϵλ12\displaystyle\geq\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\epsilon-\lambda_{12}

where, with Crad4=Csg6Crad3C_{\mathrm{rad4}}=C_{\mathrm{sg6}}C_{\mathrm{rad3}} is the same constant in Proposition S6,

λ12=Crad44log(2p(p+1))n+2log(δ1)n,\lambda_{12}=C_{\mathrm{rad4}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}},

depending on the universal constants Csg6C_{\mathrm{sg6}} and Crad3C_{\mathrm{rad3}} in Lemma S26 and Corollary S2.

Proof.

By definition, for any γΓrp\gamma\in\Gamma_{\mathrm{rp}}, hγ(x)h_{\gamma}(x) can be represented as hrp,β,c(x)h_{\mathrm{rp},\beta,c}(x) such that β0=γ0\beta_{0}=\gamma_{0} and pen1(β)=pen1(γ)\mathrm{pen}_{1}(\beta)=\mathrm{pen}_{1}(\gamma) in the same way as in Proposition S6. Then for any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=1\mathrm{pen}_{1}(\gamma)=1, we have β0=0\beta_{0}=0 and pen1(β)=1\mathrm{pen}_{1}(\beta)=1 correspondingly, and hence hγ(x)=hrp,β,c(x)[1,1]h_{\gamma}(x)=h_{\mathrm{rp},\beta,c}(x)\in[-1,1] by the boundedness of the ramp function in [0,1][0,1]. Moreover, hrp,β,c(x)h_{\mathrm{rp},\beta,c}(x) with β0=0\beta_{0}=0 and pen1(β)=1\mathrm{pen}_{1}(\beta)=1 can be expressed in the form wTg(x)w^{\mathrm{\scriptscriptstyle T}}g(x), where for q=2p+p(p1)q=2p+p(p-1), wqw\in\mathbb{R}^{q} is an L1L_{1} unit vector, and g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} is a vector of functions including ramp(xj)\mathrm{ramp}(x_{j}) and ramp(xj1)\mathrm{ramp}(x_{j}-1) for j=1,,pj=1,\ldots,p, and ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) for 1ijp1\leq i\not=j\leq p. For symmetry, ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) and ramp(xj)ramp(xi)\mathrm{ramp}(x_{j})\mathrm{ramp}(x_{i}) are included as two distinct components in gg, and the corresponding coefficients are identical to each other in ww.

Next, KHG(Pn,Pθ^;hγ,μ^)K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

KHG(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
KHG(Pϵ,Pθ^;hγ,μ^){KHG(Pn,Pθ^;hγ,μ^)KHG(Pϵ,Pθ^;hγ,μ^)}.\displaystyle\geq K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}. (S105)

For any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=1\mathrm{pen}_{1}(\gamma)=1, because hγ,μ^(x)[1,1]h_{\gamma,\hat{\mu}}(x)\in[-1,1], we have min(hγ,μ^(x),1)=hγ,μ^(x)\min(h_{\gamma,\hat{\mu}}(x),1)=h_{\gamma,\hat{\mu}}(x) and min(hγ,μ^(x),1)=hγ,μ^(x)\min(-h_{\gamma,\hat{\mu}}(x),1)=-h_{\gamma,\hat{\mu}}(x). Hence the hinge term KHG(Pϵ,Pθ^;hγ,μ^)K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) in (S105) reduces to a moment matching term and can be lower bounded as follows:

KHG(Pϵ,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
=EPϵmin(hγ,μ^(x),1)+EPθ^min(hγ,μ^(x),1)\displaystyle=\mathrm{E}_{P_{\epsilon}}\min(h_{\gamma,\hat{\mu}}(x),1)+\mathrm{E}_{P_{\hat{\theta}}}\min(-h_{\gamma,\hat{\mu}}(x),1)
=EPϵhγ,μ^(x)EPθ^hγ,μ^(x)\displaystyle=\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)
=ϵEQhγ,μ^(x)+(1ϵ)EPθhγ,μ^(x)EPθ^hγ,μ^(x)\displaystyle=\epsilon\mathrm{E}_{Q}h_{\gamma,\hat{\mu}}(x)+(1-\epsilon)\mathrm{E}_{P_{\theta}^{*}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)
ϵ+{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}.\displaystyle\geq-\epsilon+\left\{\mathrm{E}_{P_{\theta}^{*}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}.

Similarly, the two other hinge terms in (S105) also reduce to moment matching terms:

{KHG(Pn,Pθ^;hγ,μ^)KHG(Pϵ,Pθ^;hγ,μ^)}\displaystyle\quad\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}
={EPnhγ,μ^(x)EPθ^hγ,μ^(x)}{EPϵhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle=\left\{\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\left\{\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
=EPnhγ,μ^(x)EPϵhγ,μ^(x).\displaystyle=\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x).

We apply Lemma S5 with b=1b=1, g(x)[0,1]qg(x)\in[0,1]^{q} defined above, and f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)f^{\#}(\mathrm{e}^{u}) replaced by the identity function in uu. It holds with probability at least 12δ1-2\delta that for any γΓrp\gamma\in\Gamma_{\mathrm{rp}} with γ0=0\gamma_{0}=0 and pen1(γ)=b1\mathrm{pen}_{1}(\gamma)=b_{1},

EPnhγ,μ^(x)EPϵhγ,μ^(x)\displaystyle\quad\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)
Csg6Vglog(2q)n+2log(δ1)n\displaystyle\leq C_{\mathrm{sg6}}\sqrt{\frac{V_{g}\log(2q)}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}}
=Csg6Crad34log(2p(p+1))n+2log(δ1)n,\displaystyle=C_{\mathrm{sg6}}C_{\mathrm{rad3}}\sqrt{\frac{4\log(2p(p+1))}{n}}+\sqrt{\frac{2\log(\delta^{-1})}{n}},

as shown in Proposition S6. Combining the preceding three displays leads to the desired result. ∎

III.5 Details in main proof of Theorem 5

Proposition S15.

In the setting of Proposition S3, it holds with probability at least 14δ1-{4}\delta that for any γ=(γ0,γ1,γ2)T/Gamma\gamma=(\gamma_{0},\gamma_{1},\gamma_{2})^{\mathrm{\scriptscriptstyle T}}\in/Gamma,

KHG(Pn,Pθ;hγ,μ)\displaystyle K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}}) 2(ϵ+ϵ/(nδ))+pen2(γ1)(5/3)Csp21M21/2R1λ21\displaystyle\leq 2(\epsilon+\sqrt{\epsilon/(n\delta)})+\mathrm{pen}_{2}(\gamma_{1})(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}R_{1}\lambda_{21}
+pen2(γ2)(255/3)Csp22M21R1pλ31,\displaystyle\quad+\mathrm{pen}_{2}(\gamma_{2}){(25\sqrt{5}/3)}C_{\mathrm{sp22}}M_{21}R_{1}\sqrt{p}\lambda_{31},

where Csp21C_{\mathrm{sp21}} and Csp22C_{\mathrm{sp22}} are defined as in Lemma S9, M21=M21/2(M21/2+22π)M_{21}=M_{2}^{1/2}(M_{2}^{1/2}+2\sqrt{2\pi}), and

λ21=5p+log(δ1)n,λ31=λ21+5p+log(δ1)n.\displaystyle\lambda_{21}=\sqrt{\frac{5p+\log(\delta^{-1})}{n}},\quad\lambda_{31}=\lambda_{21}+\frac{5p+\log(\delta^{-1})}{n}.
Proof.

The proof is similar to that of Proposition S8 and we use the same definition of Ω1\Omega_{1} and Ω2\Omega_{2}. In the event Ω1\Omega_{1} we have |ϵ^ϵ|1/5|\hat{\epsilon}-\epsilon|\leq 1/5 by the assumption ϵ(1ϵ)/(nδ)1/5\sqrt{\epsilon(1-\epsilon)/(n\delta)}\leq 1/5 and hence ϵ^2/5\hat{\epsilon}\leq 2/5 by the assumption ϵ1/5\epsilon\leq 1/5. By Lemma S12 with ϵ1=2/5\epsilon_{1}=2/5, it holds that in the event Ω1\Omega_{1} for any γΓ\gamma\in\Gamma,

KHG(Pn,Pθ;hγ,μ)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\theta^{*}};h_{\gamma,\mu^{*}})
2ϵ^+|EPθ,nhγ,μ(x)EPθhγ,μ(x)|\displaystyle\leq 2\hat{\epsilon}+\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)-\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)\right|
2(ϵ+ϵ/(nδ))+|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|.\displaystyle\leq 2(\epsilon+\sqrt{\epsilon/(n\delta)})+\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|. (S106)

The last step uses the fact that EPθhγ,μ(x)=EP(0,Σ)hγ(x)\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x) and EPθ,nhγ,μ(x)=EPθ,nhγ(xμ)\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma,\mu^{*}}(x)=\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*}), by the definition hγ,μ(x)=hγ(xμ)h_{\gamma,\mu^{*}}(x)=h_{\gamma}(x-\mu^{*}).

Next, as shown in Proposition S8, it holds in event Ω2\Omega_{2} while conditionally on Ω1\Omega_{1} that for any γ=(γ0,γ1T,γ2T)TΓ\gamma=(\gamma_{0},\gamma_{1}^{\mathrm{\scriptscriptstyle T}},\gamma_{2}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}\in\Gamma,

|EPθ,nhγ(xμ)EP(0,Σ)hγ(x)|\displaystyle\quad\left|\mathrm{E}_{P_{\theta^{*},n}}h_{\gamma}(x-\mu^{*})-\mathrm{E}_{P_{(0,\Sigma^{*})}}h_{\gamma}(x)\right|
pen2(γ1)(5/3)Csp21M21/2λ21+pen2(γ2)(5/3)Csp223M213pλ31,\displaystyle\leq\mathrm{pen}_{2}(\gamma_{1})(5/3)C_{\mathrm{sp21}}M_{2}^{1/2}\lambda_{21}+\mathrm{pen}_{2}(\gamma_{2})(5/3)C_{\mathrm{sp22}}3M_{21}\sqrt{3p}\lambda_{31}, (S107)

where hγ(x)=γ0+γ1Tφ(x)+γ2T(φ(x)φ(x))h_{\gamma}(x)=\gamma_{0}+\gamma_{1}^{\mathrm{\scriptscriptstyle T}}\varphi(x)+\gamma_{2}^{\mathrm{\scriptscriptstyle T}}(\varphi(x)\otimes\varphi(x)), pen2(γ1)=γ12\mathrm{pen}_{2}(\gamma_{1})=\|\gamma_{1}\|_{2}, and pen2(γ2)=γ22\mathrm{pen}_{2}(\gamma_{2})=\|\gamma_{2}\|_{2}. Combining (S106) and (S107) indicates that, in the event Ω1Ω2\Omega_{1}\cap\Omega_{2} with probability at least 14δ1-4\delta, the desired inequality holds for any γΓ\gamma\in\Gamma. ∎

Proposition S16.

In the setting of Proposition S3, it holds with probability at least 12δ1-2\delta that for any γΓ10\gamma\in\Gamma_{10},

KHG(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}ϵλ22(2p)1/2\displaystyle\geq\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\epsilon-\lambda_{22}(2p)^{-1/2}

where, with Crad5=Csg,12Crad3C_{\mathrm{rad5}}=C_{\mathrm{sg,12}}C_{\mathrm{rad3}}, and

λ22=Crad516pn+2plog(δ1)n,\lambda_{22}=C_{\mathrm{rad5}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}},

depending on the universal constants Csg,12C_{\mathrm{sg,12}} and Crad3C_{\mathrm{rad3}} in Lemma S23 and Corollary S2.

Proof.

For any γΓ10Γrp1\gamma\in\Gamma_{10}\subset\Gamma_{\mathrm{rp1}}, because pen2(γ)=(2p)1/2\mathrm{pen}_{2}(\gamma)=(2p)^{-1/2} and Γ0=0\Gamma_{0}=0, we have β0=0\beta_{0}=0 and pen2(β)=p1/2\mathrm{pen}_{2}(\beta)=p^{-1/2}. Hence, hγ(x)=hrp1,β,c(x)[1,1]h_{\gamma}(x)=h_{\mathrm{rp1},\beta,c}(x)\in[-1,1] by the Cauchy–Schwartz inequality and the boundedness of the ramp function in [0,1][0,1]. Then the mean-centered version hγ,μ^(x)h_{\gamma,\hat{\mu}}(x), with EPθhγ,μ^(x)=0\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, is also bounded in [1,1][-1,1]. Moreover, such hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) can be expressed in the form pen2(β)wT{g(xμ^)η0}\mathrm{pen}_{2}(\beta)w^{\mathrm{\scriptscriptstyle T}}\{g(x-\hat{\mu})-\eta_{0}\}, where, for q=2pq=2p, wqw\in\mathbb{R}^{q} is an L2L_{2} unit vector, g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} is a vector of functions, including ramp(xj)\mathrm{ramp}(x_{j}) and ramp(xj1)\mathrm{ramp}(x_{j}-1) for j=1,,pj=1,\ldots,p, and η0=EPθg(xμ^)[0,1]q\eta_{0}=\mathrm{E}_{P_{\theta^{*}}}g(x-\hat{\mu})\in[0,1]^{q}.

Next, KHG(Pn,Pθ^;hγ,μ^)K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

KHG(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
KHG(Pϵ,Pθ^;hγ,μ^){KHG(Pn,Pθ^;hγ,μ^)KHG(Pϵ,Pθ^;hγ,μ^)}.\displaystyle\geq K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}. (S108)

For any γΓ10\gamma\in\Gamma_{10}, because hγ,μ^(x)[1,1]h_{\gamma,\hat{\mu}}(x)\in[-1,1], we have min(hγ,μ^(x),1)=hγ,μ^(x)\min(h_{\gamma,\hat{\mu}}(x),1)=h_{\gamma,\hat{\mu}}(x) and min(hγ,μ^(x),1)=hγ,μ^(x)\min(-h_{\gamma,\hat{\mu}}(x),1)=-h_{\gamma,\hat{\mu}}(x). Then the hinge term KHG(Pn,Pθ^;hγ,μ^)K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) reduces to a moment matching term and can be lower bounded as follows:

KHG(Pϵ,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
=EPϵmin(hγ,μ^(x),1)+EPθ^min(hγ,μ^(x),1)\displaystyle=\mathrm{E}_{P_{\epsilon}}\min(h_{\gamma,\hat{\mu}}(x),1)+\mathrm{E}_{P_{\hat{\theta}}}\min(-h_{\gamma,\hat{\mu}}(x),1)
=EPϵhγ,μ^(x)EPθ^hγ,μ^(x)\displaystyle=\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)
=ϵEQhγ,μ^(x)+(1ϵ)EPθhγ,μ^(x)EPθ^hγ,μ^(x)\displaystyle=\epsilon\mathrm{E}_{Q}h_{\gamma,\hat{\mu}}(x)+(1-\epsilon)\mathrm{E}_{P_{\theta}^{*}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)
ϵ+{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}.\displaystyle\geq-\epsilon+\left\{\mathrm{E}_{P_{\theta}^{*}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}. (S109)

Similarly, the absolute difference term in (S108) can be simplified as follows:

{KHG(Pn,Pθ^;hγ,μ^)KHG(Pϵ,Pθ^;hγ,μ^)}\displaystyle\quad\{K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})\}
={EPnhγ,μ^(x)EPθ^hγ,μ^(x)}{EPϵhγ,μ^(x)EPθ^hγ,μ^(x)}\displaystyle=\left\{\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\left\{\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}
=EPnhγ,μ^(x)EPϵhγ,μ^(x).\displaystyle=\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x).

We apply Lemma S10 with b=1b=1, g(x)[0,1]qg(x)\in[0,1]^{q} and η0\eta_{0} defined above, and f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)f^{\#}(\mathrm{e}^{u}) replaced by the identity function in uu. It holds with probability at least 12δ1-2\delta that for any γΓ01\gamma\in\Gamma_{01},

EPnhγ,μ^(x)EPϵhγ,μ^(x)\displaystyle\quad\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)
(2p)1/2{Csg,122qVgn+qlog(δ1)n}\displaystyle\leq(2p)^{-1/2}\left\{C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}+\sqrt{\frac{q\log(\delta^{-1})}{n}}\right\}
(2p)1/2{Csg,12Crad316pn+2plog(δ1)n},\displaystyle\leq(2p)^{-1/2}\left\{C_{\mathrm{sg,12}}C_{\mathrm{rad3}}\sqrt{\frac{16p}{n}}+\sqrt{\frac{2p\log(\delta^{-1})}{n}}\right\}, (S110)

as shown in Proposition S9. Combining the inequalities (S108)–(S110) leads to the desired result. ∎

Proposition S17.

In the setting of Proposition S3, it holds with probability at least 12δ1-2\delta that for any γΓ20\gamma\in\Gamma_{20},

KHG(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}ϵpλ32(4q)1/2\displaystyle\geq\left\{\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\epsilon-\sqrt{p}\lambda_{32}(4q)^{-1/2}

where q=p(1p)q=p(1-p), and, with Crad6=Csg,12Crad3C_{\mathrm{rad6}}=C_{\mathrm{sg,12}}C_{\mathrm{rad3}},

λ32=Crad612(p1)n+(p1)log(δ1)n.\lambda_{32}=C_{\mathrm{rad6}}\sqrt{\frac{12(p-1)}{n}}+\sqrt{\frac{(p-1)\log(\delta^{-1})}{n}}.
Proof.

For any γΓ20Γrp2\gamma\in\Gamma_{20}\subset\Gamma_{\mathrm{rp2}}, because pen2(γ)=(4q)1/2\mathrm{pen}_{2}(\gamma)=(4q)^{-1/2} and γ0=0\gamma_{0}=0, we have pen2(β)=2pen2(γ)=q1/2\mathrm{pen}_{2}(\beta)=2\mathrm{pen}_{2}(\gamma)=q^{-1/2}. Hence hγ(x)=hrp2,β(x)[1,1]h_{\gamma}(x)=h_{\mathrm{rp2},\beta}(x)\in[-1,1] by the boundedness of the ramp function in [0,1][0,1] and the Cauchy–Schwartz inequality, β21q1/2β22\|\beta_{2}\|_{1}\leq q^{1/2}\|\beta_{2}\|_{2}. Then the mean-centered version hγ,μ^(x)h_{\gamma,\hat{\mu}}(x), with EPθhγ,μ^(x)=0\mathrm{E}_{P_{\theta^{*}}}h_{\gamma,\hat{\mu}}(x)=0, is also bounded in [1,1][-1,1]. Moreover, such hγ,μ^(x)h_{\gamma,\hat{\mu}}(x) can be expressed in the form pen2(β)wT{g(xμ^)η0}\mathrm{pen}_{2}(\beta)w^{\mathrm{\scriptscriptstyle T}}\{g(x-\hat{\mu})-\eta_{0}\}, where, for q=p(p1)q=p(p-1), wqw\in\mathbb{R}^{q} is an L2L_{2} unit vector, g:p[0,1]qg:\mathbb{R}^{p}\to[0,1]^{q} is a vector of functions, including ramp(xi)ramp(xj)\mathrm{ramp}(x_{i})\mathrm{ramp}(x_{j}) for 1ijp1\leq i\not=j\leq p, and η0=EPθg(xμ^)[0,1]q\eta_{0}=\mathrm{E}_{P_{\theta^{*}}}g(x-\hat{\mu})\in[0,1]^{q}.

Next, KHG(Pn,Pθ^;hγ,μ^)K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) can be bounded as

KHG(Pn,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
KHG(Pϵ,Pθ^;hγ,μ^)|KHG(Pn,Pθ^;hγ,μ^)KHG(Pϵ,Pθ^;hγ,μ^)|.\displaystyle\geq K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-|K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})|. (S111)

For any γΓ20\gamma\in\Gamma_{20}, because hγ,μ^(x)[1,1]h_{\gamma,\hat{\mu}}(x)\in[-1,1], we have min(hγ,μ^(x),1)=hγ,μ^(x)\min(h_{\gamma,\hat{\mu}}(x),1)=h_{\gamma,\hat{\mu}}(x) and min(hγ,μ^(x),1)=hγ,μ^(x)\min(-h_{\gamma,\hat{\mu}}(x),1)=-h_{\gamma,\hat{\mu}}(x). Then the hinge term KHG(Pn,Pθ^;hγ,μ^)K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}}) reduces to a moment matching term and can be lower bounded as follows:

KHG(Pϵ,Pθ^;hγ,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})
=EPϵmin(hγ,μ^(x),1)+EPθ^min(hγ,μ^(x),1)\displaystyle=\mathrm{E}_{P_{\epsilon}}\min(h_{\gamma,\hat{\mu}}(x),1)+\mathrm{E}_{P_{\hat{\theta}}}\min(-h_{\gamma,\hat{\mu}}(x),1)
=EPϵhγ,μ^(x)EPθ^hγ,μ^(x)\displaystyle=\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)
=ϵEQhγ,μ^(x)+(1ϵ)EPθhγ,μ^(x)EPθ^hγ,μ^(x)\displaystyle=\epsilon\mathrm{E}_{Q}h_{\gamma,\hat{\mu}}(x)+(1-\epsilon)\mathrm{E}_{P_{\theta}^{*}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)
ϵ+{EPθhγ,μ^(x)EPθ^hγ,μ^(x)}.\displaystyle\geq-\epsilon+\left\{\mathrm{E}_{P_{\theta}^{*}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}. (S112)

Similarly, the absolute difference term in (S111) can be simplified as follows:

|KHG(Pn,Pθ^;hγ,μ^)KHG(Pϵ,Pθ^;hγ,μ^)|\displaystyle\quad|K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})-K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\gamma,\hat{\mu}})|
=|{EPnhγ,μ^(x)EPθ^hγ,μ^(x)}{EPϵhγ,μ^(x)EPθ^hγ,μ^(x)}|\displaystyle=\left|\left\{\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}-\left\{\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\hat{\theta}}}h_{\gamma,\hat{\mu}}(x)\right\}\right|
=|EPnhγ,μ^(x)EPϵhγ,μ^(x)|.\displaystyle=|\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)|.

We apply Lemma S10 with b=1b=1, g(x)[0,1]qg(x)\in[0,1]^{q} and η0\eta_{0} defined above, and f(eu)f^{\prime}(\mathrm{e}^{u}) and f#(eu)f^{\#}(\mathrm{e}^{u}) replaced by the identity function in uu. It holds with probability at least 12δ1-2\delta that for any γΓ20\gamma\in\Gamma_{20},

|EPnhγ,μ^(x)EPϵhγ,μ^(x)|\displaystyle\quad|\mathrm{E}_{P_{n}}h_{\gamma,\hat{\mu}}(x)-\mathrm{E}_{P_{\epsilon}}h_{\gamma,\hat{\mu}}(x)|
(4q)1/2{Csg,122qVgn+qlog(δ1)n}\displaystyle\leq(4q)^{-1/2}\left\{C_{\mathrm{sg,12}}\sqrt{\frac{2qV_{g}}{n}}+\sqrt{\frac{q\log(\delta^{-1})}{n}}\right\}
=p(4q)1/2{Csg,12Crad312(p1)n+(p1)log(δ1)n},\displaystyle=\sqrt{p}(4q)^{-1/2}\left\{C_{\mathrm{sg,12}}C_{\mathrm{rad3}}\sqrt{\frac{12(p-1)}{n}}+\sqrt{\frac{(p-1)\log(\delta^{-1})}{n}}\right\}, (S113)

as shown in Proposition S11. Combining the inequalities (S111)–(S113) leads to the desired result. ∎

III.6 Details in proof of Corollary 1

Lemma S13.

Assume that ff satisfies Assumptions 1 and 2, and GG satisfies Assumption 3. Let (γ^,θ^)(\hat{\gamma},\hat{\theta}) be a solution to the alternating optimization problem (32).

(i) Let ϵ0(0,1)\epsilon_{0}\in(0,1) be fixed. For any ϵ[0,ϵ0]\epsilon\in[0,\epsilon_{0}] and any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

Kf(Pϵ,Pθ^;h)f(1ϵ0)ϵ.\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h)\leq-f^{\prime}(1-\epsilon_{0})\epsilon.

(ii) Let ϵ1(0,1)\epsilon_{1}\in(0,1) be fixed. If ϵ^=n1i=1nUi[0,ϵ1]\hat{\epsilon}=n^{-1}\sum_{i=1}^{n}U_{i}\in[0,\epsilon_{1}], then for any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

Kf(Pn,Pθ^;h)f(1ϵ1)ϵ^+R1|EPθ,nh(x)EPθh(x)|,\displaystyle K_{f}(P_{n},P_{\hat{\theta}};h)\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon}+R_{1}|\mathrm{E}_{P_{\theta^{*},n}}h(x)-\mathrm{E}_{P_{\theta^{*}}}h(x)|, (S114)

where Pθ,nP_{\theta^{*},n} denotes the empirical distribution of {Xi:Ui=0,i=1,,n}\{X_{i}:U_{i}=0,i=1,\ldots,n\} in the latent representation of Huber’s contamination model.

Proof.

(i) As mentioned in Remark 2, the logit ff-GAN objective in (18) can be equivalently written as

Kf(Pϵ,Pθ^;h)\displaystyle K_{f}(P_{\epsilon},P_{\hat{\theta}};h) =EPϵf(eh(x))EPθ^f#(eh(x))\displaystyle=\mathrm{E}_{P_{\epsilon}}f^{\prime}(\mathrm{e}^{h(x)})-\mathrm{E}_{P_{\hat{\theta}}}f^{\#}(\mathrm{e}^{h(x)})
=EPϵT(h(x))EPθ^f{T(h(x))},\displaystyle=\mathrm{E}_{P_{\epsilon}}T(h(x))-\mathrm{E}_{P_{\hat{\theta}}}f^{*}\{T(h(x))\},

where T(u)=f(eu)T(u)=f^{\prime}(\mathrm{e}^{u}) and ff^{*} is the convex conjugate of ff. Because ff is convex and non-decreasing by Assumptions 1 and 2, we have that ff^{*} is convex and non-decreasing.

Denote as LG(θ,γ)L_{G}(\theta,\gamma) the generator objective function, EPϵf(ehγ,μ(x))EPθG(hγ,μ(x))\mathrm{E}_{P_{\epsilon}}f^{\prime}(\mathrm{e}^{h_{\gamma,\mu}(x)})-\mathrm{E}_{P_{\theta}}G(h_{\gamma,\mu}(x)). By the definition of a solution to alternating optimization (see Remark 1), we have

LG{(μ^,Σ^),γ^}LG{(μ,Σ^),γ^},\displaystyle L_{G}\{(\hat{\mu},\hat{\Sigma}),\hat{\gamma}\}\leq L_{G}\{(\mu^{*},\hat{\Sigma}),\hat{\gamma}\},
LG{(μ^,Σ^),γ^}LG{(μ^,Σ),γ^},\displaystyle L_{G}\{(\hat{\mu},\hat{\Sigma}),\hat{\gamma}\}\leq L_{G}\{(\hat{\mu},\Sigma^{*}),\hat{\gamma}\},

that is, the generator loss at θ^\hat{\theta} is less than that at any θ\theta, with the discriminator parameter fixed at γ^\hat{\gamma}. The preceding inequalities can be written out as

EPϵT(ehγ^,μ^(x))EPθ^G(hγ^,μ^(x))EPϵT(ehγ^,μ(x))EPμ,Σ^G(hγ^,μ(x)),\displaystyle\mathrm{E}_{P_{\epsilon}}T(\mathrm{e}^{h_{\hat{\gamma},\hat{\mu}}(x)})-\mathrm{E}_{P_{\hat{\theta}}}G(h_{\hat{\gamma},\hat{\mu}}(x))\leq\mathrm{E}_{P_{\epsilon}}T(\mathrm{e}^{h_{\hat{\gamma},\mu^{*}}(x)})-\mathrm{E}_{P_{\mu^{*},\hat{\Sigma}}}G(h_{\hat{\gamma},\mu^{*}}(x)),

and

EPϵT(ehγ^,μ^(x))EPθ^G(hγ^,μ^(x))EPϵT(ehγ^,μ^(x))EPμ^,ΣG(hγ^,μ^(x)).\displaystyle\mathrm{E}_{P_{\epsilon}}T(\mathrm{e}^{h_{\hat{\gamma},\hat{\mu}}(x)})-\mathrm{E}_{P_{\hat{\theta}}}G(h_{\hat{\gamma},\hat{\mu}}(x))\leq\mathrm{E}_{P_{\epsilon}}T(\mathrm{e}^{h_{\hat{\gamma},\hat{\mu}}(x)})-\mathrm{E}_{P_{\hat{\mu},\Sigma^{*}}}G(h_{\hat{\gamma},\hat{\mu}}(x)).

Note that either the location or the variance matrix, but not both, is changed on the two sides in each inequality. In the first inequality, we have EPθ^G(hγ^,μ^(x))=EPμ,Σ^G(hγ^,μ(x))\mathrm{E}_{P_{\hat{\theta}}}G(h_{\hat{\gamma},\hat{\mu}}(x))=\mathrm{E}_{P_{\mu^{*},\hat{\Sigma}}}G(h_{\hat{\gamma},\mu^{*}}(x)) because both are equal to EP0,Σ^G(hγ^,0(x))\mathrm{E}_{P_{0,\hat{\Sigma}}}G(h_{\hat{\gamma},0}(x)). In the second inequality, the term EPϵG(hγ^,μ^(x))\mathrm{E}_{P_{\epsilon}}G(h_{\hat{\gamma},\hat{\mu}}(x)) is on both sides. Then the two inequalities yield

EPϵT(ehγ^,μ^(x))\displaystyle\mathrm{E}_{P_{\epsilon}}T(\mathrm{e}^{h_{\hat{\gamma},\hat{\mu}}(x)}) EPϵT(ehγ^,μ(x)),\displaystyle\leq\mathrm{E}_{P_{\epsilon}}T(\mathrm{e}^{h_{\hat{\gamma},\mu^{*}}(x)}),
EPθ^G(hγ^,μ^(x))\displaystyle-\mathrm{E}_{P_{\hat{\theta}}}G(h_{\hat{\gamma},\hat{\mu}}(x)) EPμ^,ΣG(hγ^,μ^(x)).\displaystyle\leq-\mathrm{E}_{P_{\hat{\mu},\Sigma^{*}}}G(h_{\hat{\gamma},\hat{\mu}}(x)).

Now we are ready to derive an upper bound for Kf(Pϵ,Pθ^;hγ^,μ^)K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\hat{\gamma},\hat{\mu}}):

Kf(Pϵ,Pθ^;hγ^,μ^)\displaystyle\quad K_{f}(P_{\epsilon},P_{\hat{\theta}};h_{\hat{\gamma},\hat{\mu}})
EPϵT(hγ^,μ^)EPθ^f{T(G1(G(hγ^,μ^)))}\displaystyle\mathrm{E}_{P_{\epsilon}}T(h_{\hat{\gamma},\hat{\mu}})-\mathrm{E}_{P_{\hat{\theta}}}f^{*}\{T(G^{-1}(G(h_{\hat{\gamma},\hat{\mu}})))\}
EPϵT(hγ^,μ^)f{T(G1(EPθ^G(hγ^,μ^)))}\displaystyle\leq\mathrm{E}_{P_{\epsilon}}T(h_{\hat{\gamma},\hat{\mu}})-f^{*}\{T(G^{-1}(\mathrm{E}_{P_{\hat{\theta}}}G(h_{\hat{\gamma},\hat{\mu}})))\} (S115)
EPϵT(hγ^,μ)f{T(G1(EPμ^,ΣG(hγ^,μ^)))}\displaystyle\leq\mathrm{E}_{P_{\epsilon}}T(h_{\hat{\gamma},\mu^{*}})-f^{*}\{T(G^{-1}(\mathrm{E}_{P_{\hat{\mu},\Sigma^{*}}}G(h_{\hat{\gamma},\hat{\mu}})))\} (S116)
(1ϵ)EPθT(hγ^,μ)f{T(G1(EPμ,ΣG(hγ^,μ)))}\displaystyle\leq(1-\epsilon)\mathrm{E}_{P_{\theta^{*}}}T(h_{\hat{\gamma},\mu^{*}})-f^{*}\{T(G^{-1}(\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}G(h_{\hat{\gamma},\mu^{*}})))\} (S117)
(1ϵ)T(EPθhγ^,μ)f{T(EPμ,Σhγ^,μ)}\displaystyle\leq(1-\epsilon)T(\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}})-f^{*}\{T(\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}h_{\hat{\gamma},\mu^{*}})\} (S118)
f(1ϵ)f(1ϵ0)ϵ.\displaystyle\leq f(1-\epsilon)\leq-f^{\prime}(1-\epsilon_{0})\epsilon. (S119)

Line (S115) follows from Jensen’s inequality by the convexity of ff^{*}. Line (S116) follows from the two inequalities derived above, together with the fact that f(T(G1))-f^{*}(T(G^{-1})) is non-increasing, by non-decreasingness of ff^{*}, TT, and G1G^{-1}. In (S117) we use the fact that EPμ^,ΣG(hγ^,μ^)=EPμ,ΣG(hγ^,μ(x))\mathrm{E}_{P_{\hat{\mu},\Sigma^{*}}}G(h_{\hat{\gamma},\hat{\mu}})=\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}G(h_{\hat{\gamma},\mu^{*}}(x)) and drop the EQ\mathrm{E}_{Q} term because T0T\leq 0. Line (S118) follows from Jensen’s inequality by the convexity of GG and the concavity of TT, together with the fact that f(T(G1))-f^{*}(T(G^{-1})) is non-increasing. For the last line (S119), by the definition of Fenchel conjugate we have

(1ϵ)sf(s)f(1ϵ)f(1ϵ0)ϵ,\displaystyle(1-\epsilon)s-f^{*}(s)\leq f(1-\epsilon)\leq-f^{\prime}(1-\epsilon_{0})\epsilon,

with ss set to EPθT(hγ^,μ)\mathrm{E}_{P_{\theta^{*}}}T(h_{\hat{\gamma},\mu^{*}}).

(ii) To derive an upper bound for Kf(Pn,Pθ^;hγ^)K_{f}(P_{n},P_{\hat{\theta}};h_{\hat{\gamma}}), we first argue similarly as in part (i):

Kf(Pn,Pθ^;hγ^,μ^)\displaystyle\quad K_{f}(P_{n},P_{\hat{\theta}};h_{\hat{\gamma},\hat{\mu}})
=ϵ^EQT(hγ^,μ^)+(1ϵ^)EPθ,nT(hγ^,μ^)EPθ^f{T(hγ^,μ^)}\displaystyle=\hat{\epsilon}\,\mathrm{E}_{Q}T(h_{\hat{\gamma},\hat{\mu}})+(1-\hat{\epsilon})\mathrm{E}_{P_{\theta^{*},n}}T(h_{\hat{\gamma},\hat{\mu}})-\mathrm{E}_{P_{\hat{\theta}}}f^{*}\{T(h_{\hat{\gamma},\hat{\mu}})\}
(1ϵ^)T(EPθ,nhγ^,μ)f{T(EPμ,Σhγ^,μ)}.\displaystyle\leq(1-\hat{\epsilon})T(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}})-f^{*}\{T(\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}h_{\hat{\gamma},\mu^{*}})\}.

Then we use the R1R_{1}-Lipschitz property of f(T)f^{*}(T) and obtain

(1ϵ^)T(EPθ,nhγ^,μ)f{T(EPμ,Σhγ^,μ)}\displaystyle\quad(1-\hat{\epsilon})T(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}})-f^{*}\{T(\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}h_{\hat{\gamma},\mu^{*}})\}
(1ϵ^)T(EPθ,nhγ^,μ)f{T(EPθ,nhγ^,μ)}+R1|EPθ,nhγ^,μEPθhγ^,μ|\displaystyle\leq(1-\hat{\epsilon})T(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}})-f^{*}\{T(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}})\}+R_{1}|\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}}-\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}}|
f(1ϵ^)+R1|EPθ,nhγ^,μEPθhγ^,μ|\displaystyle\leq f(1-\hat{\epsilon})+R_{1}|\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}}-\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}}|
f(1ϵ1)ϵ^+R1|EPθ,nhγ^,μEPθhγ^,μ|.\displaystyle\leq-f^{\prime}(1-\epsilon_{1})\hat{\epsilon}+R_{1}|\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}}-\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}}|.

Combining the preceding displays completes the proof. ∎

Lemma S14.

Let (γ^,θ^)(\hat{\gamma},\hat{\theta}) be a solution to the alternating optimization problem (35).

(i) For any ϵ[0,1]\epsilon\in[0,1] and any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

KHG(Pϵ,Pθ^;h)2ϵ.\displaystyle K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h)\leq 2\epsilon.

(ii) If ϵ^=n1i=1nUi[0,1]\hat{\epsilon}=n^{-1}\sum_{i=1}^{n}U_{i}\in[0,1], then for any function h:ph:\mathbb{R}^{p}\to\mathbb{R}, we have

KHG(Pn,Pθ^;h)2ϵ^+|EPθ,nh(x)EPθh(x)|,\displaystyle K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h)\leq 2\hat{\epsilon}+|\mathrm{E}_{P_{\theta^{*},n}}h(x)-\mathrm{E}_{P_{\theta^{*}}}h(x)|, (S120)

where Pθ,nP_{\theta^{*},n} denotes the empirical distribution of {Xi:Ui=0,i=1,,n}\{X_{i}:U_{i}=0,i=1,\ldots,n\} in the latent representation of Huber’s contamination model.

Proof.

(i) By the same argument used in the proof of Lemma S13 with T(u)T(u) replaced by min(u,1)\min(u,1) and f(T(u))-f^{*}(T(u)) replaced by min(u,1)\min(-u,1) we have:

KHG(Pϵ,Pθ^;hγ^,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{\epsilon},P_{\hat{\theta}};h_{\hat{\gamma},\hat{\mu}})
ϵ+(1ϵ)min(EPθhγ^,μ,1)+min(EPθhγ^,μ,1)\displaystyle\leq\epsilon+(1-\epsilon)\min(\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}},1)+\min(-\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}},1) (S121)
2ϵ.\displaystyle\leq 2\epsilon. (S122)

Inequality (S121) is derived by the same argument that leads to (S115)-(S118) with the fact that min(u,1)\min(u,1) is concave and non-decreasing and that min(u,1)\min(-u,1) is concave and non-increasing just like TT and f(T)-f^{*}(T) in Lemma S13 respectively. The ϵ\epsilon term is a result of the fact that min(u,1)1\min(u,1)\leq 1. Inequality (S122) is by the same argument used in Lemma S12:

(1ϵ)min(u,1)+min(u,1)\displaystyle\quad(1-\epsilon)\min(u,1)+\min(-u,1)
ϵ+(1ϵ){min(u,1)+min(u,1)}\displaystyle\leq\epsilon+(1-\epsilon)\{\min(u,1)+\min(-u,1)\}
ϵ,\displaystyle\leq\epsilon,

with uu set to be hγ^,μh_{\hat{\gamma},\mu^{*}}.

(ii) To derive an upper bound for KHG(Pn,Pθ^;hγ^)K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\hat{\gamma}}), we first argue similarly as in part (i):

KHG(Pn,Pθ^;hγ^,μ^)\displaystyle\quad K_{\mathrm{HG}}(P_{n},P_{\hat{\theta}};h_{\hat{\gamma},\hat{\mu}})
=ϵ^EQmin(hγ^,μ^,1)+(1ϵ^)EPθ,nmin(hγ^,μ^,1)+EPθ^min(hγ^,μ^,1)\displaystyle=\hat{\epsilon}\,\mathrm{E}_{Q}\min(h_{\hat{\gamma},\hat{\mu}},1)+(1-\hat{\epsilon})\mathrm{E}_{P_{\theta^{*},n}}\min(h_{\hat{\gamma},\hat{\mu}},1)+\mathrm{E}_{P_{\hat{\theta}}}\min(-h_{\hat{\gamma},\hat{\mu}},1)
ϵ^+(1ϵ^)min(EPθ,nhγ^,μ,1)+min(EPμ,Σhγ^,μ,1).\displaystyle\leq\hat{\epsilon}+(1-\hat{\epsilon})\min(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}},1)+\min(-\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}h_{\hat{\gamma},\mu^{*}},1).

Then we use the 11-Lipschitz property of min(u,1)\min(-u,1) and obtain

(1ϵ^)min(EPθ,nhγ^,μ,1)+min(EPμ,Σhγ^,μ,1)\displaystyle\quad(1-\hat{\epsilon})\min(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}},1)+\min(-\mathrm{E}_{P_{\mu^{*},\Sigma^{*}}}h_{\hat{\gamma},\mu^{*}},1)
(1ϵ^)min(EPθ,nhγ^,μ,1)+min(EPθ,nhγ^,μ,1)+|EPθ,nhγ^,μEPθhγ^,μ|\displaystyle\leq(1-\hat{\epsilon})\min(\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}},1)+\min(-\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}},1)+|\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}}-\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}}|
ϵ^+|EPθ,nhγ^,μEPθhγ^,μ|.\displaystyle\leq\hat{\epsilon}+|\mathrm{E}_{P_{\theta^{*},n}}h_{\hat{\gamma},\mu^{*}}-\mathrm{E}_{P_{\theta^{*}}}h_{\hat{\gamma},\mu^{*}}|.

Combining the preceding displays completes the proof. ∎

III.7 Proofs in Section 5

Proof of Proposition 1.

We first verify that f(t)=1+t2g0(2t1+t)f(t)=\frac{1+t}{2}g_{0}(\frac{2t}{1+t}) is convex on [0,+)[0,+\infty) and f(1)=0f(1)=0 so that DfD_{f} is a valid ff-divergence. Because g0g_{0} is convex by the convexity of gg and f(t)=2(1+t)3g0(2t1+t)f^{{\prime\prime}}(t)=\frac{2}{(1+t)^{3}}g_{0}^{{\prime\prime}}(\frac{2t}{1+t}), it follows that ff is convex on [0,+)[0,+\infty). Direct calculation gives f(1)=g0(1)=0f(1)=g_{0}(1)=0. Thus, ff defines a valid ff-divergence.

Next we show that Lg(P,Pθ;qγ)=Kf(P,Pθ;hγ)L_{g}(P_{*},P_{\theta};q_{\gamma})=K_{f}(P_{*},P_{\theta};h_{\gamma}). Denote ehγ(x)\mathrm{e}^{h_{\gamma}(x)} by tt and qγ(x)q_{\gamma}(x) by qq. Then we have

Kf(P,Pθ;hγ)=EPf(t)EPθ{tf(t)f(t)}\displaystyle\quad K_{f}(P_{*},P_{\theta};h_{\gamma})=\mathrm{E}_{P_{*}}f^{\prime}(t)-\mathrm{E}_{P_{\theta}}\left\{tf^{\prime}(t)-f(t)\right\}
=EP{11+tg0(2t1+t)+12g0(2t1+t)}EPθ{t1+tg0(2t1+t)12g0(2t1+t)}\displaystyle=\mathrm{E}_{P_{*}}\left\{\frac{1}{1+t}g_{0}^{\prime}\left(\frac{2t}{1+t}\right)+\frac{1}{2}g_{0}\left(\frac{2t}{1+t}\right)\right\}-\mathrm{E}_{P_{\theta}}\left\{\frac{t}{1+t}g_{0}^{\prime}\left(\frac{2t}{1+t}\right)-\frac{1}{2}g_{0}\left(\frac{2t}{1+t}\right)\right\} (S123)
=EP{(1q)g0(2q)+12g0(2q)}EPθ{qg0(2q)12g0(2q)}\displaystyle=\mathrm{E}_{P_{*}}\left\{(1-q)g_{0}^{\prime}\left(2q\right)+\frac{1}{2}g_{0}\left(2q\right)\right\}-\mathrm{E}_{P_{\theta}}\left\{qg_{0}^{\prime}\left(2q\right)-\frac{1}{2}g_{0}\left(2q\right)\right\}
=EP{1q2g(q)+12g(q)12g(12)}EPθ{q2g(q)12g(q)+12g(12)}\displaystyle=\mathrm{E}_{P_{*}}\left\{\frac{1-q}{2}g^{\prime}\left(q\right)+\frac{1}{2}g\left(q\right)-\frac{1}{2}g\left(\frac{1}{2}\right)\right\}-\mathrm{E}_{P_{\theta}}\left\{\frac{q}{2}g^{\prime}\left(q\right)-\frac{1}{2}g\left(q\right)+\frac{1}{2}g\left(\frac{1}{2}\right)\right\} (S124)
=12{EPSg(q,1)EPθSg(q,0)}g(12).\displaystyle=\frac{1}{2}\left\{\mathrm{E}_{P_{*}}S_{g}(q,1)-\mathrm{E}_{P_{\theta}}S_{g}(q,0)\right\}-g\left(\frac{1}{2}\right). (S125)

Line (S123) is by direct calculation. Lines (S124)–(S125) are by the definition of gg and SgS_{g}.

Finally, by the definition of ff from g0g_{0}, direct calculation gives

qf(pq)=p+q2g0(2pp+q),\int qf\left(\frac{p}{q}\right)=\int\frac{p+q}{2}g_{0}\left(\frac{2p}{p+q}\right),

which implies that Dg0(P||(P+Pθ)/2)=Df(P||Pθ)D_{g_{0}}(P_{*}||(P_{*}+P_{\theta})/2)=D_{f}(P_{*}||P_{\theta}). ∎

IV Auxiliary lemmas

IV.1 Truncated linear basis

The following result gives upper bounds on the moments of the truncated linear basis, which are used in the proofs of Lemma S3 and S9.

Lemma S15.

For XN(0,σ2)X\sim\mathrm{N}(0,\sigma^{2}) and ξ\xi\in\mathbb{R}, we have

E(Xξ)+σ2π+|ξ|,\displaystyle\mathrm{E}(X-\xi)_{+}\leq\frac{\sigma}{\sqrt{2\pi}}+|\xi|,
E[{(Xξ)+}2]σ2+ξ2.\displaystyle\mathrm{E}[\{(X-\xi)_{+}\}^{2}]\leq\sigma^{2}+\xi^{2}.
Proof.

The second result is immediate: E[{(Xξ)+}2]E{(Xξ)2}=σ2+ξ2\mathrm{E}[\{(X-\xi)_{+}\}^{2}]\leq\mathrm{E}\{(X-\xi)^{2}\}=\sigma^{2}+\xi^{2}. The first result can be shown as follows:

E(Xξ)+=ξ(xξ)12πσex22σ2dx\displaystyle\quad\mathrm{E}(X-\xi)_{+}=\int_{\xi}^{\infty}(x-\xi)\frac{1}{\sqrt{2\pi}\sigma}\mathrm{e}^{-\frac{x^{2}}{2\sigma^{2}}}\,\mathrm{d}x
=σ2πeξ22σ2ξξ12πσex22σ2dx\displaystyle=\frac{\sigma}{\sqrt{2\pi}}\mathrm{e}^{-\frac{\xi^{2}}{2\sigma^{2}}}-\xi\int_{\xi}^{\infty}\frac{1}{\sqrt{2\pi}\sigma}\mathrm{e}^{-\frac{x^{2}}{2\sigma^{2}}}\,\mathrm{d}x
σ2π+|ξ|.\displaystyle\leq\frac{\sigma}{\sqrt{2\pi}}+|\xi|.

The last inequality holds because 0ξ12πσex22σ2dx10\leq\int_{\xi}^{\infty}\frac{1}{\sqrt{2\pi}\sigma}\mathrm{e}^{-\frac{x^{2}}{2\sigma^{2}}}\,\mathrm{d}x\leq 1. ∎

IV.2 VC index of ramp functions

For a collection 𝒞\mathcal{C} of subsets of 𝒳\mathcal{X}, and points x1,,xn𝒳x_{1},\dots,x_{n}\in\mathcal{X}, define

Δn𝒞(x1,,xn)=#{C{x1,,xn}:C𝒞},\Delta_{n}^{\mathcal{C}}(x_{1},\dots,x_{n})=\#\{C\cap\{x_{1},\dots,x_{n}\}:C\in\mathcal{C}\},

that is, Δn𝒞(x1,,xn)\Delta_{n}^{\mathcal{C}}(x_{1},\dots,x_{n}) is the number of subsets of {x1,,xn}\{x_{1},\dots,x_{n}\} picked out by the collection 𝒞\mathcal{C}. We say that a subset {xi,,xj}{x1,,xn}\{x_{i},\dots,x_{j}\}\subset\{x_{1},\dots,x_{n}\} is picked up by 𝒞\mathcal{C} if {xi,,xj}{C{x1,,xn}:C𝒞}\{x_{i},\dots,x_{j}\}\in\{C\cap\{x_{1},\dots,x_{n}\}:C\in\mathcal{C}\}. For convenience, we also say that {xi,,xj}\{x_{i},\dots,x_{j}\} is picked up by CC if {xi,,xj}=C{x1,,xn}\{x_{i},\dots,x_{j}\}=C\cap\{x_{1},\dots,x_{n}\}. Moreover, define

m𝒞(n)=maxx1,,xnΔn𝒞(x1,,xn),\displaystyle m^{\mathcal{C}}(n)=\max_{x_{1},\dots,x_{n}}\Delta_{n}^{\mathcal{C}}(x_{1},\dots,x_{n}),

and the Vapnik–Chervonenkis (VC) index of 𝒞\mathcal{C} as (\citeappendVW)

V(𝒞)=inf{n1:m𝒞(n)<2n}.\displaystyle V(\mathcal{C})=\inf\{n\geq 1:m^{\mathcal{C}}(n)<2^{n}\}.

where the infimum over the empty set is taken to be infinity.

The subgraph of a function f:𝒳f:\mathcal{X}\to\mathbb{R} is defined as Gf={(x,t)𝒳×:t<f(x)}G_{f}=\{(x,t)\in\mathcal{X}\times\mathbb{R}:t<f(x)\}. For a collection of functions \mathcal{F}, denote the collection of corresponding subgraphs as G={Gf:f}G_{\mathcal{F}}=\{G_{f}:f\in\mathcal{F}\}, and define the VC index of \mathcal{F} as V()=V(G)V(\mathcal{F})=V(G_{\mathcal{F}}).

Lemma S16.

For ={fb(x)=1(x+b)++(x+b1)+:b}\mathcal{F}=\{f_{b}(x)=1-(x+b)_{+}+(x+b-1)_{+}:b\in\mathbb{R}\}, we have that V()=2V(\mathcal{F})=2. Moreover, the VC index of {fb(x):b}={ramp(xb):b}\{f_{b}(-x):b\in\mathbb{R}\}=\{\mathrm{ramp}(x-b):b\in\mathbb{R}\} is 22. That is, the VC index of moving-knots ramp functions is 22.

Proof.

To show V()=2V(\mathcal{F})=2, we need to show mG(1)=21m^{G_{\mathcal{F}}}(1)=2^{1} and mG(2)<22m^{G_{\mathcal{F}}}(2)<2^{2}. The first property is trivially true. For the second property, it suffices to show that for any two distinct points {(x1,t1),(x2,t2)}\{(x_{1},t_{1}),(x_{2},t_{2})\}, there is at least a subset of {(x1,t1),(x2,t2)}\{(x_{1},t_{1}),(x_{2},t_{2})\} that cannot be picked up by GfbG_{f_{b}} for any fbf_{b}\in\mathcal{F}. Without loss of generality, assume that x1x2x_{1}\leq x_{2}. We arbitrarily fix fbf_{b}\in\mathcal{F} and discuss several cases depending on t2t1t_{2}-t_{1} and x2x1x_{2}-x_{1}.

If t2t1(x2x1)t_{2}-t_{1}\leq-(x_{2}-x_{1}), then if (x1,t1)Gfb(x_{1},t_{1})\in G_{f_{b}}, i.e., fb(x1)>t1f_{b}(x_{1})>t_{1}, we have

fb(x2)fb(x1)\displaystyle f_{b}(x_{2})-f_{b}(x_{1}) (1)(x2x1)\displaystyle\geq(-1)(x_{2}-x_{1})
t2t1.\displaystyle\geq t_{2}-t_{1}.

This implies that fb(x2)fb(x1)t1+t2>t2f_{b}(x_{2})\geq f_{b}(x_{1})-t_{1}+t_{2}>t_{2} and hence (x2,t2)Gfb(x_{2},t_{2})\in G_{f_{b}}. As a result, a subset containing just (x1,t1)(x_{1},t_{1}) cannot be picked up by GfbG_{f_{b}}.

If t2t10t_{2}-t_{1}\geq 0, then if (x2,t2)Gfb(x_{2},t_{2})\in G_{f_{b}}, i.e., fb(x2)>t2f_{b}(x_{2})>t_{2}, we have

fb(x1)fb(x2)>t2t1,\displaystyle f_{b}(x_{1})\geq f_{b}(x_{2})>t_{2}\geq t_{1},

and hence (x2,t2)Gfb(x_{2},t_{2})\in G_{f_{b}}. Thus, the subset {(x2,t2)}\{(x_{2},t_{2})\} can never be picked up by GfbG_{f_{b}}.

If t2t1<0t_{2}-t_{1}<0 and t2t1>(x2x1)t_{2}-t_{1}>-(x_{2}-x_{1}), then if fb(x2)>t2f_{b}(x_{2})>t_{2}, we have

fb(x1)fb(x2)\displaystyle f_{b}(x_{1})-f_{b}(x_{2}) (1)(x1x2)\displaystyle\geq(-1)(x_{1}-x_{2})
>(t2t1).\displaystyle>-(t_{2}-t_{1}).

This implies that fb(x1)>fb(x2)t2+t1>t1f_{b}(x_{1})>f_{b}(x_{2})-t_{2}+t_{1}>t_{1}. As a result, the subset {(x2,t2)}\{(x_{2},t_{2})\} can never be picked up by GfbG_{f_{b}}.

Combining the preceding cases shows that mG(2)<22m^{G_{\mathcal{F}}}(2)<2^{2} and V()=2V(\mathcal{F})=2. Moreover, the class of functions {fb(x):b}\{f_{b}(-x):b\in\mathbb{R}\}, denoted as ~\tilde{\mathcal{F}}, admits a one-to-one correspondence with \mathcal{F}. A subset of {(x1,t1),(x2,t2)}\{(x_{1},t_{1}),(x_{2},t_{2})\} is picked up by GG_{\mathcal{F}} if and only if the corresponding subset of {(x1,t1),(x2,t2)}\{(-x_{1},t_{1}),(-x_{2},t_{2})\} is picked up by G~G_{\tilde{\mathcal{F}}}. Hence V(~)=V()=2V(\tilde{\mathcal{F}})=V(\mathcal{F})=2. ∎

Lemma S17.

For ={f(x)b:b}\mathcal{F}=\{f(x)\equiv b:b\in\mathbb{R}\}, we have that V()=2V(\mathcal{F})=2. That is, the VC index of constant functions is 22.

Proof.

For any two distinct points (x1,t1)(x_{1},t_{1}) and (x2,t2)(x_{2},t_{2}), assume that with loss of generality t1t2t_{1}\leq t_{2}. Then the singleton {(x2,t2)}\{(x_{2},t_{2})\} can never be picked up by GfG_{f} for any ff\in\mathcal{F}, and hence mG(2)<22m^{G_{\mathcal{F}}}(2)<2^{2} and V()=2V(\mathcal{F})=2. In fact, if (x2,t2)(x_{2},t_{2}) is in the subgraph of f(x)bf(x)\equiv b, then t2<bt_{2}<b. As a result, t1t2<bt_{1}\leq t_{2}<b, indicating that (x1,t1)(x_{1},t_{1}) is also in the subgraph. ∎

IV.3 Lipschitz functions of Gaussian vectors

Say that a function g:pmg:\mathbb{R}^{p}\to\mathbb{R}^{m} is LL-Lipschitz if g(x1)g(x2)2Lx1x22\|g(x_{1})-g(x_{2})\|_{2}\leq L\|x_{1}-x_{2}\|_{2} for any x1,x2px_{1},x_{2}\in\mathbb{R}^{p}.

Lemma S18.

Let XNp(μ,Σ)X\sim\mathrm{N}_{p}(\mu,\Sigma), and g:pmg:\mathbb{R}^{p}\to\mathbb{R}^{m} be an LL-Lipschitz function.

(i) For any vector wmw\in\mathbb{R}^{m} with w2=1\|w\|_{2}=1, we have

E[{wT(g(X)Eg(X))}2]2Csg,122L2Σop,\displaystyle\mathrm{E}\left[\{w^{\mathrm{\scriptscriptstyle T}}(g(X)-\mathrm{E}g(X))\}^{2}\right]\leq 2C_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{\mathrm{op}},

where Csg,12C_{\mathrm{sg,12}} is the universal constant from Lemma S23. Hence we have

Varg(X)op2Csg,122L2Σop.\|\mathrm{Var}\,g(X)\|_{\mathrm{op}}\leq 2C_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{\mathrm{op}}.

(ii) For any symmetric matrix Am×mA\in\mathbb{R}^{m\times m} with AF=1\|A\|_{\mathrm{F}}=1, we have

E[{(g(X)Eg(X))TA(g(X)Eg(X))}2]4Csg,124mL4Σop2.\displaystyle\mathrm{E}\left[\{(g(X)-\mathrm{E}g(X))^{\mathrm{\scriptscriptstyle T}}A(g(X)-\mathrm{E}g(X))\}^{2}\right]\leq 4C_{\mathrm{sg,12}}^{4}mL^{4}\|\Sigma\|_{\mathrm{op}}^{2}.
Proof.

(i) By \citeappendBLM13, Theorem 5.6, it can be shown that for any L2L_{2} unit vector ww, wT(g(X)Eg(X))w^{\mathrm{\scriptscriptstyle T}}(g(X)-\mathrm{E}g(X)) is sub-gaussian with tail parameter LΣop1/2L\|\Sigma\|_{\mathrm{op}}^{1/2}. See the proof of Lemma S9(i) for a similar argument. Then by Lemma S23, E[{wT(g(X)Eg(X))}2]2Csg,122L2Σop\mathrm{E}[\{w^{\mathrm{\scriptscriptstyle T}}(g(X)-\mathrm{E}g(X))\}^{2}]\leq 2C_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{op}.

(ii) Consider an eigen-decomposition A=j=1mλjwjwjTA=\sum_{j=1}^{m}\lambda_{j}w_{j}w_{j}^{{\mathrm{\scriptscriptstyle T}}}, where λj\lambda_{j}’s are eigenvalues and wjw_{j}’s are the eigenvectors with wj2=1\|w_{j}\|_{2}=1. Denote g=g(X)g=g(X) and g~=gEg\tilde{g}=g-\mathrm{E}g. Then g~TAg~=j=1mλj(wjTg~)2\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g}=\sum_{j=1}^{m}\lambda_{j}(w_{j}^{\mathrm{\scriptscriptstyle T}}\tilde{g})^{2} and

(g~TAg~)2(j=1mλj2)(j=1m(wjTg~)4)4Csg,124mL4Σop2.\displaystyle\quad(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}\leq\left(\sum_{j=1}^{m}\lambda_{j}^{2}\right)\left(\sum_{j=1}^{m}(w_{j}^{\mathrm{\scriptscriptstyle T}}\tilde{g})^{4}\right)\leq 4C_{\mathrm{sg,12}}^{4}mL^{4}\|\Sigma\|_{\mathrm{op}}^{2}.

The first step uses the Cauchy–Schwartz inequality, and the second step uses the fact that j=1mλj2=AF2=1\sum_{j=1}^{m}\lambda_{j}^{2}=\|A\|_{\mathrm{F}}^{2}=1 and E(wjTg~)44Csg,124L4Σop2E(w_{j}^{\mathrm{\scriptscriptstyle T}}\tilde{g})^{4}\leq 4C_{\mathrm{sg,12}}^{4}L^{4}\|\Sigma\|_{\mathrm{op}}^{2} by Lemma S23 because wjTg~jw_{j}^{\mathrm{\scriptscriptstyle T}}\tilde{g}_{j} is sub-gaussian with tail parameter LΣop1/2L\|\Sigma\|_{\mathrm{op}}^{1/2} for each jj. ∎

The following result provides a 44th-moment bound which depends linearly on L2ΣopL^{2}\|\Sigma\|_{\mathrm{op}}, under a boundedness condition in addition to the Lipschitz condition.

Lemma S19.

Let XNp(μ,Σ)X\sim\mathrm{N}_{p}(\mu,\Sigma), and g:p[0,1]mg:\mathbb{R}^{p}\to[0,1]^{m} be an LL-Lipschitz function.

(i) For any matrix Am×mA\in\mathbb{R}^{m\times m} with AF=1\|A\|_{\mathrm{F}}=1, we have

E[{(g(X)Eg(X))TA(g(X)Eg(X))}2]2Csg,122mL2Σop.\displaystyle\mathrm{E}\left[\{(g(X)-\mathrm{E}g(X))^{\mathrm{\scriptscriptstyle T}}A(g(X)-\mathrm{E}g(X))\}^{2}\right]\leq 2C_{\mathrm{sg,12}}^{2}mL^{2}\|\Sigma\|_{\mathrm{op}}.

(ii) For any matrix Am×mA\in\mathbb{R}^{m\times m} with AF=1\|A\|_{\mathrm{F}}=1, we have

E[{gT(X)Ag(X)EgT(X)Ag(X)}2]20Csg,122mL2Σop.\displaystyle\mathrm{E}\left[\{g^{\mathrm{\scriptscriptstyle T}}(X)Ag(X)-\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}(X)Ag(X)\}^{2}\right]\leq 20C_{\mathrm{sg,12}}^{2}mL^{2}\|\Sigma\|_{\mathrm{op}}.
Proof.

(i) Denote g=g(X)g=g(X) and g~=gEg\tilde{g}=g-\mathrm{E}g. Then each component of g~\tilde{g} is contained in [1,1][-1,1] by the boundedness of gg. The variable (g~TAg~)2(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2} can be bounded as follows:

(g~TAg~)2=tr(g~TAg~g~TATg~)=tr(Ag~g~TATg~g~T)\displaystyle\quad(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}=\mathrm{tr}(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A^{\mathrm{\scriptscriptstyle T}}\tilde{g})=\mathrm{tr}(A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A^{\mathrm{\scriptscriptstyle T}}\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}})
tr(Ag~g~TAT)g~g~Top\displaystyle\leq\mathrm{tr}(A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A^{\mathrm{\scriptscriptstyle T}})\|\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}\|_{\mathrm{op}} (S126)
mtr(Ag~g~TAT)=mtr(ATAg~g~T).\displaystyle\leq m\,\mathrm{tr}(A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A^{\mathrm{\scriptscriptstyle T}})=m\,\mathrm{tr}(A^{\mathrm{\scriptscriptstyle T}}A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}). (S127)

Line (S126) follows from von Neumann’s trace equality. Line (S127) uses the fact that g~g~Topm\|\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}\|_{\mathrm{op}}\leq m, because wTg~g~Tu=(wTg~)2w22g~22mw22w^{\mathrm{\scriptscriptstyle T}}\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}}u=(w^{\mathrm{\scriptscriptstyle T}}\tilde{g})^{2}\leq\|w\|_{2}^{2}\|\tilde{g}\|_{2}^{2}\leq m\|w\|_{2}^{2} for any wmw\in\mathbb{R}^{m}, by the boundedness of g~\tilde{g}. Then the desired result follows because

Etr(ATAg~g~T)=tr(ATAVar(g))\displaystyle\quad\mathrm{E}\mathrm{tr}(A^{\mathrm{\scriptscriptstyle T}}A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}})=\mathrm{tr}(A^{\mathrm{\scriptscriptstyle T}}A\,\mathrm{Var}(g))
tr(ATA)Var(g)op\displaystyle\leq\mathrm{tr}(A^{\mathrm{\scriptscriptstyle T}}A)\|\mathrm{Var}(g)\|_{\mathrm{op}} (S128)
2Csg,122L2Σop.\displaystyle\leq 2C_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{\mathrm{op}}. (S129)

Line (S128) also follows from von Neumann’s trace equality. Line (S129) follows because tr(ATA)=AF2=1\mathrm{tr}(A^{\mathrm{\scriptscriptstyle T}}A)=\|A\|_{\mathrm{F}}^{2}=1 and Var(g)op2Csg,122L2Σop\|\mathrm{Var}(g)\|_{\mathrm{op}}\leq 2C_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{\mathrm{op}} by Lemma S18(i), with gg being an LL-Lipschitz function.

(ii) The difference gTAgEgTAgg^{\mathrm{\scriptscriptstyle T}}Ag-\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}Ag can be expressed in terms of the centered variables as gTAgEgTAg=(g~TAg~Eg~TAg~)+2(Eg)TAg~g^{\mathrm{\scriptscriptstyle T}}Ag-\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}Ag=(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g}-\mathrm{E}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})+2(\mathrm{E}g)^{\mathrm{\scriptscriptstyle T}}A\tilde{g}. Then

(gTAgEgTAg)22(g~TAg~Eg~TAg~)2+8{(Eg)TAg~}2\displaystyle\quad(g^{\mathrm{\scriptscriptstyle T}}Ag-\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}Ag)^{2}\leq 2(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g}-\mathrm{E}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}+8\{(\mathrm{E}g)^{\mathrm{\scriptscriptstyle T}}A\tilde{g}\}^{2}
2(g~TAg~Eg~TAg~)2+8Eg22Ag~22.\displaystyle\leq 2(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g}-\mathrm{E}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}+8\|\mathrm{E}g\|_{2}^{2}\|A\tilde{g}\|_{2}^{2}. (S130)

The expectation of the first term on (S130) can be bounded using (i) as

2E{(g~TAg~Eg~TAg~)2}=2E{(g~TAg~)2}2(Eg~TAg~)2\displaystyle\quad 2\mathrm{E}\{(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g}-\mathrm{E}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}\}=2\mathrm{E}\{(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}\}-2(\mathrm{E}\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}
2E{(g~TAg~)2}4mCsg,122L2Σop.\displaystyle\leq 2\mathrm{E}\{(\tilde{g}^{\mathrm{\scriptscriptstyle T}}A\tilde{g})^{2}\}\leq 4mC_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{\mathrm{op}}.

The expectation of the second term on (S130) can be bounded as

8Eg22EAg~22=8Eg22Etr(ATAg~g~T)\displaystyle\quad 8\|\mathrm{E}g\|_{2}^{2}\,\mathrm{E}\|A\tilde{g}\|_{2}^{2}=8\|\mathrm{E}g\|_{2}^{2}\;\mathrm{E}\mathrm{tr}(A^{\mathrm{\scriptscriptstyle T}}A\tilde{g}\tilde{g}^{\mathrm{\scriptscriptstyle T}})
16mCsg,122L2Σop.\displaystyle\leq 16mC_{\mathrm{sg,12}}^{2}L^{2}\|\Sigma\|_{\mathrm{op}}.

by inequality (S129) and the fact that Eg22m\|\mathrm{E}g\|_{2}^{2}\leq m. Combining the preceding two bounds yields the desired result. ∎

IV.4 Moment matching for Lipschitz functions

The following result gives an upper bound on moment matching of quadratic forms under a Lipschitz condition.

Lemma S20.

Let g:pmg:\mathbb{R}^{p}\to\mathbb{R}^{m} be an LL-Lipschitz function, and let X1=μ1+D1ZX_{1}=\mu_{1}+D_{1}Z and X2=μ2+D2ZX_{2}=\mu_{2}+D_{2}Z, where ZpZ\in\mathbb{R}^{p} is a random vector in which the second moments of all components are 1, μ1,μ2p\mu_{1},\mu_{2}\in\mathbb{R}^{p}, and D1=diag(d1)D_{1}=\mathrm{diag}(d_{1}) and D2=diag(d2)D_{2}=\mathrm{diag}(d_{2}) with d1,d2+pd_{1},d_{2}\in\mathbb{R}_{+}^{p}. Then for any matrix Am×mA\in\mathbb{R}^{m\times m} with AF=1\|A\|_{\mathrm{F}}=1,

|EgT(X1)Ag(X1)EgT(X2)Ag(X2)|\displaystyle\quad\left|\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}(X_{1})Ag(X_{1})-\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}(X_{2})Ag(X_{2})\right|
2mL2Δ2+22LΔ(Eg222)1/2,\displaystyle\leq 2\sqrt{m}L^{2}\Delta^{2}+2\sqrt{2}L\Delta(\mathrm{E}\|g_{2}\|_{2}^{2})^{1/2},

where Δ2=μ1μ222+d1d222\Delta^{2}=\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}.

Proof.

Denote g1=g(X1)g_{1}=g(X_{1}) and g2=g(X2)g_{2}=g(X_{2}). The difference g1TAg1g2TAg2g_{1}^{\mathrm{\scriptscriptstyle T}}Ag_{1}-g_{2}^{\mathrm{\scriptscriptstyle T}}Ag_{2} can be decomposed as

g1TAg1g2TAg2\displaystyle\quad g_{1}^{\mathrm{\scriptscriptstyle T}}Ag_{1}-g_{2}^{\mathrm{\scriptscriptstyle T}}Ag_{2}
=(g1g2)TA(g1g2)+2(g1g2)TAg2.\displaystyle=(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}A(g_{1}-g_{2})+2(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}Ag_{2}. (S131)

The expectation of the first term on (S131) can be bounded as

|E(g1g2)TA(g1g2)|=|tr(AV)|\displaystyle\quad\left|\mathrm{E}(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}A(g_{1}-g_{2})\right|=\left|\mathrm{tr}\left(AV\right)\right|
j=1msj(A)Vop\displaystyle\leq\sum_{j=1}^{m}s_{j}(A)\|V\|_{\mathrm{op}} (S132)
2mL2(μ1μ222+d1d222),\displaystyle\leq 2\sqrt{m}L^{2}\left(\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}\right), (S133)

where s1(A),,sm(A)s_{1}(A),\ldots,s_{m}(A) are the singular values of AA, and V=E{(g1g2)(g1g2)T}V=\mathrm{E}\{(g_{1}-g_{2})(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}\}. Line (S132) follows from von Neumann’s trace inequality. Line (S133) follows because j=1msj(A)m{j=1msj2(A)}1/2=mAF=m\sum_{j=1}^{m}s_{j}(A)\leq\sqrt{m}\{\sum_{j=1}^{m}s_{j}^{2}(A)\}^{1/2}=\sqrt{m}\|A\|_{\mathrm{F}}=\sqrt{m} with AF=1\|A\|_{\mathrm{F}}=1 and Vop2L2(μ1μ222+d1d222)\|V\|_{\mathrm{op}}\leq 2L^{2}(\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}), which can be shown as follows. For any L2L_{2} unit vector ww, we have

wTVw=E[{wT(g1g2)}2]Eg1g222\displaystyle\quad w^{\mathrm{\scriptscriptstyle T}}Vw=\mathrm{E}\left[\{w^{\mathrm{\scriptscriptstyle T}}(g_{1}-g_{2})\}^{2}\right]\leq\mathrm{E}\|g_{1}-g_{2}\|_{2}^{2}
L2Eμ1+D1Z(μ2+D2Z)22\displaystyle\leq L^{2}\mathrm{E}\|\mu_{1}+D_{1}Z-(\mu_{2}+D_{2}Z)\|_{2}^{2}
2L2{μ1μ222+E(D1D2)Z22}\displaystyle\leq 2L^{2}\left\{\|\mu_{1}-\mu_{2}\|_{2}^{2}+\mathrm{E}\|(D_{1}-D_{2})Z\|_{2}^{2}\right\}
2L2(μ1μ222+d1d222),\displaystyle\leq 2L^{2}\left(\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}\right), (S134)

using the fact that g()g(\cdot) is LL-Lipschitz and the marginal variances of ZZ are 1. The expectation of the second term on (S131) can be bounded as

|E(g1g2)TAg2|E|(g1g2)TAg2|\displaystyle\quad\left|\mathrm{E}(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}Ag_{2}\right|\leq\mathrm{E}\left|(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}Ag_{2}\right|
Eg1g22Ag22\displaystyle\leq\mathrm{E}\|g_{1}-g_{2}\|_{2}\|Ag_{2}\|_{2}
E1/2(g1g222)E1/2(Ag222)\displaystyle\leq\mathrm{E}^{1/2}(\|g_{1}-g_{2}\|_{2}^{2})\mathrm{E}^{1/2}(\|Ag_{2}\|_{2}^{2})
2L(μ1μ222+d1d222)1/2(Eg222)1/2.\displaystyle\leq\sqrt{2}L\left(\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}\right)^{1/2}(\mathrm{E}\|g_{2}\|_{2}^{2})^{1/2}. (S135)

Line (S135) uses the fact that E{g1g222}2L2(μ1μ222+d1d222)\mathrm{E}\{\|g_{1}-g_{2}\|_{2}^{2}\}\leq 2L^{2}(\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}) based on (S134) and the following argument:

EAg222E(Aop2g222)Eg222,\displaystyle\mathrm{E}\|Ag_{2}\|_{2}^{2}\leq\mathrm{E}(\|A\|_{\mathrm{op}}^{2}\|g_{2}\|_{2}^{2})\leq\mathrm{E}\|g_{2}\|_{2}^{2},

where the last step follows because AopAF=1\|A\|_{\mathrm{op}}\leq\|A\|_{\mathrm{F}}=1. Combining (S131), (S133), and (S135) yields the desired result. ∎

The following result gives a tighter bound than in Lemma S20 under a boundedness condition in addition to the Lipschitz condition.

Lemma S21.

In the setting of Lemma S20, suppose that each component of g1(x)g_{1}(x) and g2(x)g_{2}(x) is bounded in [1,1][-1,1]. Then for any matrix Am×mA\in\mathbb{R}^{m\times m} with AF=1\|A\|_{\mathrm{F}}=1,

|EgT(X1)Ag(X1)EgT(X2)Ag(X2)|22mLΔ,\displaystyle\left|\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}(X_{1})Ag(X_{1})-\mathrm{E}g^{\mathrm{\scriptscriptstyle T}}(X_{2})Ag(X_{2})\right|\leq 2\sqrt{2m}L\Delta,

where Δ2=μ1μ222+d1d222\Delta^{2}=\|\mu_{1}-\mu_{2}\|_{2}^{2}+\|d_{1}-d_{2}\|_{2}^{2}.

Proof.

The difference g1TAg1g2TAg2g_{1}^{\mathrm{\scriptscriptstyle T}}Ag_{1}-g_{2}^{\mathrm{\scriptscriptstyle T}}Ag_{2} can also be decomposed as

g1TAg1g2TAg2=(g1g2)TAg1+(g1g2)TAg2.\displaystyle g_{1}^{\mathrm{\scriptscriptstyle T}}Ag_{1}-g_{2}^{\mathrm{\scriptscriptstyle T}}Ag_{2}=(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}Ag_{1}+(g_{1}-g_{2})^{\mathrm{\scriptscriptstyle T}}Ag_{2}.

By (S135), both of the two terms on the right-hand side can be bounded in absolute values by 2LΔ(Eg222)1/2\sqrt{2}L\Delta(\mathrm{E}\|g_{2}\|_{2}^{2})^{1/2} and hence by 2mLΔ\sqrt{2m}L\Delta, because Eg222m\mathrm{E}\|g_{2}\|_{2}^{2}\leq m by the componentwise boundedness of g1g_{1} and g2g_{2}. ∎

V Technical tools

V.1 von Neumann’s trace inequality

Lemma S22.

(\citeappendVon62) For any m×mm\times m matrices AA and BB with singular values α1αm0\alpha_{1}\geq\cdots\geq\alpha_{m}\geq 0 and β1βm0\beta_{1}\geq\cdots\geq\beta_{m}\geq 0 respectively,

|tr(AB)|j=1mαjβj.\displaystyle|\mathrm{tr}(AB)|\leq\sum_{j=1}^{m}\alpha_{j}\beta_{j}.

As a direct consequence, if AA is symmetric and non-negative definite, then

|tr(AB)|tr(A)Bop\displaystyle|\mathrm{tr}(AB)|\leq\mathrm{tr}(A)\|B\|_{\mathrm{op}}

This follows because the singular values αi\alpha_{i}’s are also the eigenvalues of AA and hence tr(A)=j=1mαj\mathrm{tr}(A)=\sum_{j=1}^{m}\alpha_{j} for a symmetric and nonnegative definite matrix AA, and Bop=maxi=j,,mβj\|B\|_{\mathrm{op}}=\max_{i=j,\ldots,m}\beta_{j} by the definition of Bop\|B\|_{\mathrm{op}}.

V.2 Sub-gassisan and sub-exponential properties

The following results can be obtained from \citeappendVer18, Proposition 2.5.2.

Lemma S23.

For a random variable YY, the following properties are equivalent: there exist universal constants Csg,ij>0C_{\mathrm{sg},ij}>0 such that the Csg,ij1KjKiCsg,ijKjC_{\mathrm{sg},ij}^{-1}K_{j}\leq K_{i}\leq C_{\mathrm{sg},ij}K_{j} for all 1ij41\leq i\not=j\leq 4, where KiK_{i} is the parameter appearing in property (i).

  • (i)

    P(|Y|>t)2exp(t22K12)\mathrm{P}(|Y|>t)\leq 2\exp(-\frac{t^{2}}{2K_{1}^{2}}) for any t>0t>0.

  • (ii)

    E1/p(|Y|p)K2p\mathrm{E}^{1/p}(|Y|^{p})\leq K_{2}\sqrt{p} for any p1p\geq 1.

  • (iii)

    Eexp(Y2/K32)2\mathrm{E}\exp(Y^{2}/K_{3}^{2})\leq 2.

If EY=0\mathrm{E}Y=0, then properties (1)–(3) are also equivalent to the following one.

  • (iv)

    Eexp(sY)exp(K42s22)\mathrm{E}\exp(sY)\leq\exp(\frac{K_{4}^{2}s^{2}}{2}), for any ss\in\mathbb{R}.

Say that YY is a sub-gaussian random variable with tail parameter KK if property (1) holds in Lemma S23 with K1=KK_{1}=K. The following result shows that being sub-gaussian depends only on tail probabilities of a random variable.

Lemma S24.

Suppose that for some b,K>0b,K>0, a random variable YY satisfies that

P(|Y|>b+t)2et22K2for any t>0.\displaystyle\mathrm{P}(|Y|>b+t)\leq 2\mathrm{e}^{-\frac{t^{2}}{2K^{2}}}\quad\text{for any $t>0$}.

Then YY is sub-gaussian with tail parameter K+bK+b.

Proof.

We distinguish two cases of y>0y>0. First, if y>K+by>K+b, then (yb)/K>y/(K+b)>1(y-b)/K>y/(K+b)>1 and

P(|Y|>y)2exp{(yb)22K2}2exp{y22(K+b)2}.\mathrm{P}(|Y|>y)\leq 2\exp\left\{-\frac{(y-b)^{2}}{2K^{2}}\right\}\leq 2\exp\left\{-\frac{y^{2}}{2(K+b)^{2}}\right\}.

Second, if yK+by\leq K+b, then

P(|Y|>y)12e122exp{y22(K+b)2}.\mathrm{P}(|Y|>y)\leq 1\leq 2\mathrm{e}^{-\frac{1}{2}}\leq 2\exp\left\{-\frac{y^{2}}{2(K+b)^{2}}\right\}.

Hence the desired result holds. ∎

The following result follows directly from Chernoff’s inequality.

Lemma S25.

Suppose that (Y1,Yn)(Y_{1}\ldots,Y_{n}) are independent such that EYi=0EY_{i}=0 and YiY_{i} is sub-gaussian with tail parameter KK for i=1,,ni=1,\ldots,n. Then

P(|1ni=1nYi|>Csg5t)2exp(nt22K2)for any t>0.\displaystyle\mathrm{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}Y_{i}\right|>C_{\mathrm{sg}5}t\right)\leq 2\exp\left(-\frac{nt^{2}}{2K^{2}}\right)\quad\text{for any $t>0$}.

where Csg5=Csg,14C_{\mathrm{sg5}}=C_{\mathrm{sg,14}} as in Lemma S23.

The following result can be obtained from \citeappendVer18, Exercise 2.5.10.

Lemma S26.

Let (Y1,,Yn)(Y_{1},\ldots,Y_{n}) be random variables such that YiY_{i} is sub-gaussian with tail parameter KK for i=1,,ni=1,\ldots,n. Then

Emaxi=1,,n|Yi|Csg6Klog(2n),\displaystyle\mathrm{E}\max_{i=1,\ldots,n}|Y_{i}|\leq C_{\mathrm{sg6}}K\sqrt{\log(2n)},

where Csg6>0C_{\mathrm{sg6}}>0 is a universal constant.

Say that YdY\in\mathbb{R}^{d} is a sub-gaussian random vector with tail parameter KK if wTYw^{\mathrm{\scriptscriptstyle T}}Y is sub-gaussian with tail parameter KK for any wdw\in\mathbb{R}^{d} with w2=1\|w\|_{2}=1. The following result can be obtained from \citeappendHKZ12, Theorem 2.1.

Lemma S27.

Suppose that YdY\in\mathbb{R}^{d} with EY=0\mathrm{E}Y=0 is a sub-gaussian random vector with tail parameter KK. Then for any t>0t>0, we have that with probability at least 1et1-\mathrm{e}^{-t},

Y2Csg7K(d+t),\displaystyle\|Y\|_{2}\leq C_{\mathrm{sg7}}K(\sqrt{d}+\sqrt{t}),

where Csg7>0C_{\mathrm{sg7}}>0 is a universal constant.

The following result can be obtained from \citeappendVer10, Theorem 5.39 and Remark 5.40(1). Formally, this is different from \citeappendVer18, Theorem 4.7.1 and Exercise 4.7.3, due to assumption (4.24) used in the latter result.

Lemma S28.

Suppose that Y1,,YnY_{1},\ldots,Y_{n} are independent and identically distributed as YdY\in\mathbb{R}^{d}, where YY is a sub-gaussian random vector with tail parameter KK. Then for any t>0t>0, we have that with probability at least 12et1-2\mathrm{e}^{-t},

1ni=1nYiYiTΣopCsg8K2(d+tn+d+tn),\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}Y_{i}Y_{i}^{\mathrm{\scriptscriptstyle T}}-\Sigma\right\|_{\mathrm{op}}\leq C_{\mathrm{sg8}}K^{2}\left(\sqrt{\frac{d+t}{n}}+\frac{d+t}{n}\right),

where Σ=E(YYT)\Sigma=\mathrm{E}(YY^{\mathrm{\scriptscriptstyle T}}) and Csg8C_{\mathrm{sg8}} is a universal constant.

The following result can be obtained from \citeappendVer18, Proposition 2.5.2.

Lemma S29.

For a random variable YY, the following properties are equivalent: there exist universal constants Csx,ij>0C_{\mathrm{sx},ij}>0 such that the Csx,ij1KjKiCsx,ijKjC_{\mathrm{sx},ij}^{-1}K_{j}\leq K_{i}\leq C_{\mathrm{sx,ij}}K_{j} for all 1ij41\leq i\not=j\leq 4, where KiK_{i} is the parameter appearing in property (i).

  • (i)

    P(|Y|>t)2exp(tK1)\mathrm{P}(|Y|>t)\leq 2\exp(-\frac{t}{K_{1}}) for any t>0t>0.

  • (ii)

    E1/p(|Y|p)K2p\mathrm{E}^{1/p}(|Y|^{p})\leq K_{2}p for any p1p\geq 1.

  • (iii)

    Eexp(|Y|/K3)2\mathrm{E}\exp(|Y|/K_{3})\leq 2.

If EY=0\mathrm{E}Y=0, then properties (i)–(iii) are also equivalent to the following one.

  • (iv)

    Eexp(sY)exp(K42s22)\mathrm{E}\exp(sY)\leq\exp(\frac{K_{4}^{2}s^{2}}{2}), for any ss\in\mathbb{R} satisfying |s|K41|s|\leq K_{4}^{-1}.

Say that YY is a sub-exponential random variable with tail parameter KK if property (1) holds in Lemma S29 with K1=KK_{1}=K. The following result, from \citeappendVer18, Lemma 2.7.7, provides a link from sub-gaussian to sub-exponential random variables.

Lemma S30.

Suppose that Y1Y_{1} and Y2Y_{2} are sub-gaussian random variables with tail parameters K1K_{1} and K2K_{2} respectively. Then Y1Y2Y_{1}Y_{2} is sub-exponential with tail parameter Csx5K1K2C_{\mathrm{sx5}}K_{1}K_{2}, where Csx5>0C_{\mathrm{sx5}}>0 is a universal constant.

The following result about centering can be obtained from \citeappendVer18, Exercise 2.7.10.

Lemma S31.

Suppose that YY is sub-exponential random variable with tail parameter KK. Then YEYY-\mathrm{E}Y is sub-exponential random variable with tail parameter Csx6KC_{\mathrm{sx6}}K, where Csx6>0C_{\mathrm{sx6}}>0 is a universal constant.

The following result can be obtained from \citeappendVer18, Corollary 2.8.3.

Lemma S32.

Suppose that (Y1,Yn)(Y_{1}\ldots,Y_{n}) are independent such that EYi=0EY_{i}=0 and YiY_{i} is sub-exponential with tail parameter KK for i=1,,ni=1,\ldots,n. Then

P{|1ni=1nYi|>Csx7K(tntn)}2etfor any t>0.\displaystyle\mathrm{P}\left\{\left|\frac{1}{n}\sum_{i=1}^{n}Y_{i}\right|>C_{\mathrm{sx7}}K\left(\sqrt{\frac{t}{n}}\vee\frac{t}{n}\right)\right\}\leq 2\mathrm{e}^{-t}\quad\text{for any $t>0$}.

where Csx7>0C_{\mathrm{sx7}}>0 is a universal constant.

V.3 Symmetrization and contraction

The following result can be obtained from the symmetrization inequality (Section 2.3.1 in \citeappendVW) and Theorem 7 in \citeappendMZ03.

Lemma S33.

Let X1,,XnX_{1},\dots,X_{n} be i.i.d. random vectors and \mathcal{F} be a class of real-valued functions such that Ef(X1)<\mathrm{E}f(X_{1})<\infty for all ff\in\mathcal{F}. Then we have

Esupf{1ni=1nf(Xi)Ef(Xi)}2Esupf{1ni=1nϵif(Xi)},\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathrm{E}f(X_{i})\right\}\leq 2\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(X_{i})\right\},

where ϵ1,,ϵn\epsilon_{1},\dots,\epsilon_{n} are i.i.d. Rademacher random variables that are independent of X1,,XnX_{1},\dots,X_{n}. The above inequality also holds with the left-hand side replaced by

Esupf{Ef(Xi)1ni=1nf(Xi)}.\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\mathrm{E}f(X_{i})-\frac{1}{n}\sum_{i=1}^{n}f(X_{i})\right\}.
Proof.

For completeness, we give a direct proof. Let (X1,,Xn)(X_{1}^{\prime},\ldots,X_{n}^{\prime}) be i.i.d. copies of (X1,,Xn)(X_{1},\ldots,X_{n}). Then we have

Esupf{1ni=1nf(Xi)Ef(X)}\displaystyle\quad\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\mathrm{E}f(X)\right\}
=EXi[supfEXi{1ni=1nf(Xi)f(Xi)}]\displaystyle=\mathrm{E}_{X_{i}}\left[\sup_{f\in\mathcal{F}}\mathrm{E}_{X_{i}^{\prime}}\left\{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-f(X_{i}^{\prime})\right\}\right]
EXi,Xi[supf{1ni=1nf(Xi)1ni=1nf(Xi)}]\displaystyle\leq\mathrm{E}_{X_{i},X_{i}^{\prime}}\left[\sup_{f\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\frac{1}{n}\sum_{i=1}^{n}f(X_{i}^{\prime})\right\}\right] (S136)
=EXi,Xi,ϵi[supfϵi{1ni=1nf(Xi)1ni=1nf(Xi)}]\displaystyle=\mathrm{E}_{X_{i},X_{i}^{\prime},\epsilon_{i}}\left[\sup_{f\in\mathcal{F}}\epsilon_{i}\left\{\frac{1}{n}\sum_{i=1}^{n}f(X_{i})-\frac{1}{n}\sum_{i=1}^{n}f(X_{i}^{\prime})\right\}\right] (S137)
EXi,ϵi{supf1ni=1nϵif(Xi)}+EXi,ϵi{supf1ni=1nϵif(Xi)}\displaystyle\leq\mathrm{E}_{X_{i},\epsilon_{i}}\left\{\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(X_{i})\right\}+\mathrm{E}_{X_{i}^{\prime},\epsilon_{i}}\left\{\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}-\epsilon_{i}f(X_{i}^{\prime})\right\}
=2EXi,ϵi{supf1ni=1nϵif(Xi)}.\displaystyle=2\mathrm{E}_{X_{i},\epsilon_{i}}\left\{\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(X_{i})\right\}.

Line (S136) follows from Jensen’s inequality. Line (S137) follows because for a pair of i.i.d. random variables, their difference is a symmetric random variable about 0 and its distribution remains the same when multiplied by an independent Rademacher random variable. A similar argument is also applicable for upper bounding Esupf{Ef(Xi)1ni=1nf(Xi)}\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\mathrm{E}f(X_{i})-\frac{1}{n}\sum_{i=1}^{n}f(X_{i})\right\}. ∎

Lemma S34.

Let ϕ\phi be a function with a Lipschitz constant RR. Then in the setting of Lemma S33, we have

Esupf{1ni=1nϵiϕ(f(Xi))}Esupf{1ni=1nϵiRf(Xi)}.\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}\phi(f(X_{i}))\right\}\leq\mathrm{E}\sup_{f\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}Rf(X_{i})\right\}.

V.4 Entropy and maximal inequality

For a function class \mathcal{F} in a metric space endowed with norm \|\cdot\|, the covering number 𝒩(δ,,)\mathcal{N}(\delta,\mathcal{F},\|\cdot\|) is defined as the smallest number of balls of radius δ\delta in the \|\cdot\|-metric needed to cover \mathcal{F}. The entropy, H(δ,,)H(\delta,\mathcal{F},\|\cdot\|) is defined as log𝒩(δ,,)\log\mathcal{N}(\delta,\mathcal{F},\|\cdot\|). The following maximal inequality can be obtained from Dudley’s inequality for sub-gaussian variables (e.g., \citeappendVan2000, Corollary 8.3; \citeappendBLT18, Proposition 9.2) including Rademacher variables.

Lemma S35.

Let \mathcal{F} be a class of functions f:𝒳f:\mathcal{X}\to\mathbb{R}, and (ϵ1,,ϵn)(\epsilon_{1},\ldots,\epsilon_{n}) be independent Rademacher random variables. For a fixed set of points {xi𝒳:i=1,,n}\{x_{i}\in\mathcal{X}:i=1,\ldots,n\}, define the random variable

Zn()=supf|1ni=1nϵif(xi)|.\displaystyle Z_{n}(\mathcal{F})=\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(x_{i})\right|.

Suppose that supffn1\sup_{f\in\mathcal{F}}\|f\|_{n}\leq 1 and 01H1/2(u,,n)duΨn()\int_{0}^{1}H^{1/2}(u,\mathcal{F},\|\cdot\|_{n})\,\mathrm{d}u\leq\Psi_{n}(\mathcal{F}), where n\|\cdot\|_{n} is the empirical L2L_{2} norm, fn={n1i=1nf2(xi)}1/2\|f\|_{n}=\{n^{-1}\sum_{i=1}^{n}f^{2}(x_{i})\}^{1/2}. Then for any t>0t>0,

P{Zn()/Crad>n1/2(Ψn()+t)}2et22,\displaystyle\mathrm{P}\left\{Z_{n}(\mathcal{F})/C_{\mathrm{rad}}>n^{-1/2}(\Psi_{n}(\mathcal{F})+t)\right\}\leq 2\mathrm{e}^{-\frac{t^{2}}{2}}, (S138)

where Crad>0C_{\mathrm{rad}}>0 is a universal constant.

The following result, taken from \citeappendVW, Theorem 2.6.7, provides an upper bound on the entropy of a function class in terms of the VC index. For any r1r\geq 1 and probability measure QQ, the Lr(Q)L_{r}(Q) norm is defined as fr,Q=(|f|rdQ)1/r\|f\|_{r,Q}=(\int|f|^{r}\,\mathrm{d}Q)^{1/r}.

Lemma S36.

Let \mathcal{F} be a VC class of functions such that supf|f|1\sup_{f\in\mathcal{F}}|f|\leq 1. Then for any r1r\geq 1 and probability measure QQ, we have

𝒩(u,,r,Q)CvcV()(16e)V()ur(V()1)for any u(0,1),\displaystyle\mathcal{N}(u,\mathcal{F},\|\cdot\|_{r,Q})\leq C_{\mathrm{vc}}V(\mathcal{F})(16\mathrm{e})^{V(\mathcal{F})}u^{-r(V(\mathcal{F})-1)}\quad\text{for any $u\in(0,1)$},

where V()V(\mathcal{F}) denotes the VC index of \mathcal{F} and Cvc1C_{\mathrm{vc}}\geq 1 is a universal constant.

We deduce the following implications of the preceding results, which can be used in conjunction with Lemmas S16S17.

Corollary S1.

In the setting of Lemma S35, the random variable Zn()Z_{n}(\mathcal{F}) is sub-gaussian with tail parameter Cradn1/2(Ψn()+1)C_{\mathrm{rad}}n^{-1/2}(\Psi_{n}(\mathcal{F})+1).

Proof.

By (S138), the results follows from an application of Lemma S24. ∎

Corollary S2.

In the setting of Lemma S35, the following results hold.

(i) If supf|f|1\sup_{f\in\mathcal{F}}|f|\leq 1, then Zn()Z_{n}(\mathcal{F}) is sub-gaussian with tail parameter Crad2V()/nC_{\mathrm{rad2}}\sqrt{V(\mathcal{F})/n}, where Crad2=Crad{1+2+log(16Cvc)+012log(u1)du}C_{\mathrm{rad2}}=C_{\mathrm{rad}}\{1+\sqrt{2+\log(16C_{\mathrm{vc}})}+\int_{0}^{1}\sqrt{2\log(u^{-1})}\,\mathrm{d}u\}.

(ii) Consider another two classes 𝒢\mathcal{G} and \mathcal{H} of functions from 𝒳\mathcal{X} to \mathbb{R} in addition to \mathcal{F}, and let com={fg+h:f,g𝒢,h}\mathcal{F}_{\mathrm{com}}=\{fg+h:f\in\mathcal{F},g\in\mathcal{G},h\in\mathcal{H}\}. If supf𝒢|f|1\sup_{f\in\mathcal{F}\cup\mathcal{G}\cup\mathcal{H}}|f|\leq 1, then Zn(com)Z_{n}(\mathcal{F}_{\mathrm{com}}) is sub-gaussian with tail parameter Crad3{V()+V(𝒢)+V()}/nC_{\mathrm{rad3}}\sqrt{\{V(\mathcal{F})+V(\mathcal{G})+V(\mathcal{H})\}/n}, where Crad3=Crad{1+2+log(16Cvc)+012log(3u1)du}C_{\mathrm{rad3}}=C_{\mathrm{rad}}\{1+\sqrt{2+\log(16C_{\mathrm{vc}})}+\int_{0}^{1}\sqrt{2\log(3u^{-1})}\,\mathrm{d}u\}.

Proof.

(i) Take r=2r=2 and QQ to be the empirical distribution on {x1,,xn}\{x_{1},\ldots,x_{n}\}. By Lemma S36, the entropy integral 01H1/2(u,,n)du\int_{0}^{1}H^{1/2}(u,\mathcal{F},\|\cdot\|_{n})\,\mathrm{d}u can be upper bounded by

01log1/2{CvcV()(16e)V()u2(V()1)}du\displaystyle\quad\int_{0}^{1}\log^{1/2}\left\{C_{\mathrm{vc}}V(\mathcal{F})(16\mathrm{e})^{V(\mathcal{F})}u^{-2(V(\mathcal{F})-1)}\right\}\,\mathrm{d}u
=01{log(Cvc)+logV()+V()log(16e)+2(V()1)log(u1)}1/2du\displaystyle=\int_{0}^{1}\left\{\log(C_{\mathrm{vc}})+\log V(\mathcal{F})+V(\mathcal{F})\log(16\mathrm{e})+2(V(\mathcal{F})-1)\log(u^{-1})\right\}^{1/2}\,\mathrm{d}u
V()01{2+log(16Cvc)+2log(u1)}du,\displaystyle\leq\sqrt{V(\mathcal{F})}\int_{0}^{1}\left\{\sqrt{2+\log(16C_{\mathrm{vc}})}+\sqrt{2\log(u^{-1})}\right\}\,\mathrm{d}u,

using logV()V()\log V(\mathcal{F})\leq V(\mathcal{F}) for V()1V(\mathcal{F})\geq 1 and u1+u2u1+u2\sqrt{u_{1}+u_{2}}\leq\sqrt{u_{1}}+\sqrt{u_{2}}. Taking Ψn()\Psi_{n}(\mathcal{F}) in (i) to be the right-hand side of the preceding display yields the desired result by Corollary S1.

(ii) First, we show that the covering number 𝒩(u,com,n)\mathcal{N}(u,\mathcal{F}_{\mathrm{com}},\|\cdot\|_{n}) is upper bounded by the product 𝒩(u/3,,n)𝒩(u/3,𝒢,n)𝒩(u/3,,n)\mathcal{N}(u/3,\mathcal{F},\|\cdot\|_{n})\mathcal{N}(u/3,\mathcal{G},\|\cdot\|_{n})\mathcal{N}(u/3,\mathcal{H},\|\cdot\|_{n}). Denote as ^\hat{\mathcal{F}} a (u/3)(u/3)-net of \mathcal{F} with the cardinality 𝒩(u/3,,n)\mathcal{N}(u/3,\mathcal{F},\|\cdot\|_{n}). Similarly, denote as 𝒢^\hat{\mathcal{G}} and ^\hat{\mathcal{H}} those of 𝒢\mathcal{G}, \mathcal{H} with the cardinality 𝒩(u/3,𝒢,n)\mathcal{N}(u/3,\mathcal{G},\|\cdot\|_{n}) and 𝒩(u/3,,n)\mathcal{N}(u/3,\mathcal{H},\|\cdot\|_{n}) respectively. For any ff\in\mathcal{F}, g𝒢g\in\mathcal{G} and hh\in\mathcal{H}, there exist f^^\hat{f}\in\hat{\mathcal{F}}, g^𝒢^\hat{g}\in\hat{\mathcal{G}} and h^^\hat{h}\in\hat{\mathcal{H}} such that

f^fnu/3,g^gnu/3,h^hnu/3.\displaystyle\|\hat{f}-f\|_{n}\leq u/3,\quad\|\hat{g}-g\|_{n}\leq u/3,\quad\|\hat{h}-h\|_{n}\leq u/3.

By the triangle inequality and supf𝒢|f|1\sup_{f\in\mathcal{F}\cup\mathcal{G}}|f|\leq 1, we have

f^g^+h^fghn\displaystyle\quad\|\hat{f}\hat{g}+\hat{h}-fg-h\|_{n}
(f^f)g^n+f(g^g)n+h^hn\displaystyle\leq\|(\hat{f}-f)\hat{g}\|_{n}+\|f(\hat{g}-g)\|_{n}+\|\hat{h}-h\|_{n}
f^fn+g^gn+h^hnu.\displaystyle\leq\|\hat{f}-f\|_{n}+\|\hat{g}-g\|_{n}+\|\hat{h}-h\|_{n}\leq u.

This shows that ^com={f^g^+h^:f^^,g^𝒢^,h^^}\hat{\mathcal{F}}_{\mathrm{com}}=\{\hat{f}\hat{g}+\hat{h}:\hat{f}\in\hat{\mathcal{F}},\hat{g}\in\hat{\mathcal{G}},\hat{h}\in\hat{\mathcal{H}}\} is a uu-net of com\mathcal{F}_{\mathrm{com}} with respect to n\|\cdot\|_{n}. Hence the covering number 𝒩(u,com,n)\mathcal{N}(u,\mathcal{F}_{\mathrm{com}},\|\cdot\|_{n}) is upper bounded by the cardinality of ^com\hat{\mathcal{F}}_{\mathrm{com}}, that is, 𝒩(u/3,,n)𝒩(u/3,𝒢,n)𝒩(u/3,,n)\mathcal{N}(u/3,\mathcal{F},\|\cdot\|_{n})\mathcal{N}(u/3,\mathcal{G},\|\cdot\|_{n})\mathcal{N}(u/3,\mathcal{H},\|\cdot\|_{n}).

Next, by Lemma S36 applied to \mathcal{F}, 𝒢\mathcal{G}, and \mathcal{H} and similar calulation as in (i), the entropy integral 01H1/2(u,com,n)du\int_{0}^{1}H^{1/2}(u,\mathcal{F}_{\mathrm{com}},\|\cdot\|_{n})\,\mathrm{d}u can be upper bounded by

01{log𝒩(u/3,,n)+log𝒩(u/3,𝒢,n)+log𝒩(u/3,,n)}1/2du\displaystyle\quad\int_{0}^{1}\left\{\log\mathcal{N}(u/3,\mathcal{F},\|\cdot\|_{n})+\log\mathcal{N}(u/3,\mathcal{G},\|\cdot\|_{n})+\log\mathcal{N}(u/3,\mathcal{H},\|\cdot\|_{n})\right\}^{1/2}\,\mathrm{d}u
V()+V(𝒢)+V()01{2+log(16Cvc)+2log(3u1)}du.\displaystyle\leq\sqrt{V(\mathcal{F})+V(\mathcal{G})+V(\mathcal{H})}\int_{0}^{1}\left\{\sqrt{2+\log(16C_{\mathrm{vc}})}+\sqrt{2\log(3u^{-1})}\right\}\,\mathrm{d}u.

The desired result follows by applying Corollary S1 to the class com\mathcal{F}_{\mathrm{com}}, with Ψn(com)\Psi_{n}(\mathcal{F}_{\mathrm{com}}) taken to be the right-hand side of the preceding display. ∎

\bibliographystyleappend

imsart-nameyear \bibliographyappendfinal