This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Distributionally Robust Estimator that
Dominates the Empirical Average

Nikolas Koumpis   and   Dionysis Kalogerias
Yale University
Abstract
Contact: {nikolaos.koumpis,dionysis.kalogerias}@yale.edu

We leverage the duality between risk-averse and distributionally robust optimization (DRO) to devise a distributionally robust estimator that strictly outperforms the empirical average for all probability distributions with negative excess kurtosis. The aforesaid estimator solves the χ2\chi^{2}-robust mean squared error problem in closed form.

1 Introduction

The aim of this paper is to deploy the basic insights of DRO in order to devise an empirical estimator that outperforms the (commonly used) sample average w.r.t. the mean squared error; the latter being a standard figure of merit for measuring losses in statistics and machine learning (Girshick and Savage, 1951; James and Stein, 1961; Devroye et al., 2003; Wang and Bovik, 2009; Verhaegen and Verdult, 2007; Kuhn et al., 2019).

To set up the stage for our investigation, let B(d)\mathrm{B}(\mathbb{R}^{d}) be the set of Borel probability measures on d\mathbb{R}^{d} and D={X1,,Xn}D=\{X_{1},\dots,X_{n}\} a set of independent identically distributed (i.i.d.) random vectors realizing XμB(d)X\sim\mu\in\mathrm{B}(\mathbb{R}^{d}). The corresponding to DD empirical measure is μn=n1k=1nδXi\mu_{n}=n^{-1}\sum^{n}_{k=1}\delta_{X_{i}}, where δx\delta_{x} is the Delta measure of unit mass located at xdx\in\mathbb{R}^{d}.

By defining the real-valued map f(ξ,X^)𝔼ξX^X22,f(\xi,\widehat{\mathrm{X}})\equiv\mathbb{E}_{\xi}\|\widehat{\mathrm{X}}-X\|^{2}_{2}, ξB(d)\xi\in\mathrm{B}(\mathbb{R}^{d}), we obtain the following characterization of the empirical average

α(D;μn)=1ninXi=argminX^df(μn,X^),\displaystyle\alpha(D;\mu_{n})=\dfrac{1}{n}\sum^{n}_{i}X_{i}=\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\mathrm{argmin}}f(\mu_{n},\widehat{\mathrm{X}}), (1)

as the solution to the surrogate of the “true” problem

minX^df(μ,X^),\displaystyle\min_{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}f(\mu,\widehat{\mathrm{X}}), (2)

when μ\mu is unknown. Plausibly then, we are interested in the performance of (1) on μ\mu:

πα(μ,μn)f(μ,α(D;μn))f,\displaystyle\pi_{\alpha}(\mu,\mu_{n})\equiv f(\mu,\alpha(D;\mu_{n}))-f^{\star}, (3)

where111We assume that the first order moment is finite f=f(μ,𝔼X)f^{\star}=f(\mu,\mathbb{E}X) is the optimal value of (2). Although (1) dominates among unbiased estimators for normally distributed low dimensional data, in the case of skewed or heavy tailed measures, it exhibits a rather sub-optimal performance and therefore, it leaves room for improvement. Such an improvement could be potentially achieved by an appropriate re-weighting νnB(d)\nu_{n}\in\mathrm{B}(\mathbb{R}^{d}), νnμn\nu_{n}\ll\mu_{n} on DD, such that

πα(μ,νn)πα(μ,μn).\displaystyle\pi_{\alpha}(\mu,\nu_{n})\leq\pi_{\alpha}(\mu,\mu_{n}). (4)

The most prominent approach for doing so is via distributionally robust optimization. As a paradigm, DRO (Scarf et al., 1957) has gained interest due to its connections to regularization, generalization, robustness and statistics (Staib and Jegelka, 2019), (Blanchet et al., 2021), (Bertsimas and Sim, 2004), (Blanchet et al., 2021), (Blanchet and Shapiro, 2023), (Nguyen et al., 2023),(Blanchet et al., 2024).

Let F(β,ρ)\mathrm{F}(\beta,\rho) be the field of DRO problem instances P\mathrm{P} of the form

minX^d{supξμf(ξ,X^):ξβρ(μ)},\displaystyle\min_{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}\bigg{\{}\sup_{\xi\ll\mu}f(\xi,\widehat{\mathrm{X}})~{}:~{}{\xi\in\beta_{\rho}\left(\mu\right)}\bigg{\}}, (5)

where βρ(μ)B(d)\beta_{\rho}(\mu)\subseteq\mathrm{B}(\mathbb{R}^{d}) is the uncertainty set of radius ρ0\rho\geq 0 centered at μ\mu. The two main approaches for representing the uncertainty set, recently unified by (Blanchet et al., 2023), are the divergence approach, and the Wasserstein approach. In the former one, distributional shifts are measured in terms of likelihood ratios (Bertsimas and Sim, 2004; Bayraksan and Love, 2015; Namkoong and Duchi, 2016; Duchi and Namkoong, 2018; Van Parys et al., 2021), while in the latter, the uncertainty set is represented by a Wasserstein ball (Esfahani and Kuhn, 2015; Gao and Kleywegt, 2023; Lee and Raginsky, 2018; Kuhn et al., 2019).

Let X^\widehat{\mathrm{X}}^{\star} be a solution of PF(β,ρ)\mathrm{P}\in\mathrm{F}(\beta,\rho). Based on the discussion so far, we are interested in an ideal sub-field of instances F(β,ρ)F(β,ρ)\mathrm{F}^{\star}(\beta,\rho)\subset\mathrm{F}(\beta,\rho) formed by the following design specifications: First, if PF(β,ρ)\mathrm{P}\in\mathrm{F}^{\star}(\beta,\rho), then there exist a saddle point (X^(ξ),ξ(X^))(\widehat{\mathrm{X}}^{\star}(\xi^{\star}),\xi^{\star}(\widehat{\mathrm{X}}^{\star})). Then, the first design specification implies that necessarily

X^=X^(ξ)=𝔼ξ(X^)X.\displaystyle\widehat{\mathrm{X}}^{\star}=\widehat{\mathrm{X}}^{\star}(\xi^{\star})=\mathbb{E}_{\xi^{\star}(\widehat{\mathrm{X}}^{\star})}X. (6)

In some cases, robustness against distributional shifts is as safeguarding against risky events (naturally coordinated by the distribution tail). That is, solving a DRO problem is equivalent to solving a risk-averse problem. This brings us to the second specification which requires X^\widehat{\mathrm{X}}^{\star} to be the closed form solution of a risk-averse problem (over μ\mu). Combining the aforesaid specifications would allow us to implement the empirical counterpart α(D;ξρ,n)\alpha(D;\xi^{\star}_{\rho,n}) of X^\widehat{\mathrm{X}}^{\star} from data DD drawn from μ\mu. That being said, in this paper we study the following motivating question:

Is there an instance PF(β,ρ)\mathrm{P}\in\mathrm{F}^{\star}(\beta,\rho) such that πα(μ,ξρ,n(β))<πα(μ,μn)\pi_{\alpha}(\mu,\xi^{\star}_{\rho,n}(\beta))<\pi_{\alpha}(\mu,\mu_{n})?

The aforesaid dual relation between risk-averse and DRO suggests a place to start our design. In particular, our methodology is as follows: We are going to state a risk-averse problem (which in principle is easier to design), that is also solvable in closed form. Then, we are going to show that under certain conditions, its solution also solves a certain, precisely defined DRO problem. Then, we are going to verify our initial assertion by studying the performance of the corresponding empirical solution. We start by considering the following risk-constrained extension of (2):

minX^df(μ,X^) s.t. 𝔼μ(XX^22𝔼μXX^22)2ε.\begin{array}[]{cl}\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\operatorname{min}}&f(\mu,\widehat{\mathrm{X}})\\ \text{ s.t. }&\mathbb{E}_{\mu}\big{(}\|{X}-\widehat{\mathrm{X}}\|_{2}^{2}~{}-~{}\mathbb{E}_{\mu}\|{X}-\widehat{\mathrm{X}}\|_{2}^{2}\big{)}^{2}\leq\varepsilon.\end{array} (7)

The risk-averse problem (7) is a quadratically constrained quadratic program and admits the closed form solution (see Section 2)

X^λ=(I+4λΣ)1(𝔼X+2λ(𝔼[X22X]𝔼X22𝔼X)),\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda^{\star}}{=}\big{(}I+4\lambda^{\star}\Sigma\big{)}^{-1}\Big{(}\mathbb{E}X+2\lambda^{\star}\big{(}\mathbb{E}[\|X\|^{2}_{2}X]-\mathbb{E}\|X\|^{2}_{2}\mathbb{E}X\big{)}\Big{)}, (8)

where λ0\lambda^{\star}\geq{0}. Risk-averse optimization refers to the optimization of risk measures. Unlike expectations, these are (often) convex functionals on appropariate spaces of random variables and deliver optimizers that take into account the risky events of the underlying distribution. Unlike expectations, a convex risk measure can be represented through the Legendre transform as the supremum of its affine minorants. This representation essentially characterizes the risk measure as a supremum over a certain class of affine functionals, specifically over a subset of the dual space of signed measures (Shapiro et al., 2021). For the particular class of coherent risk measures, this set contains probability measures and thus, safeguarding for risky events is as being robust against distributional shifts. In other words, solving a risk-constrained problem on the population at hand is as solving a min-max problem where expectations are taken for an appropriate re-weighting of that population.

That said, the challenge is that minimization of a risk measure does not always correspond to a DRO formulation. As an example, problem (7) is closely related to the mean-variance risk-measure (Shapiro et al., 2021, p. 275), which does not lead to a DRO formulation in general. Mean-variance optimization was initially utilized by (Markowitz, 1952), in portfolio selection. Later on, (Abeille et al., 2016) deploys mean-variance for dynamic portfolio allocation problems, and (Kalogerias et al., 2020) stressed the importance of safeguarding against risky events in mean squared error Bayesian estimation. DRO with χ2\chi^{2}-divergence is almost equivalent as controlling by the variance (Gotoh et al., 2018; Lam, 2016; Duchi and Namkoong, 2019) were the worst case measure is computed exactly in O(nlog(n))O(n\log(n)) (Staib and Jegelka, 2019). Although minimization of the mean-variance risk-measure does not correspond to an element of F(β,ρ)\mathrm{F}(\beta,\rho), the mean deviation risk measure of order p=2p=2 does (see Theorem 1).

At this point, we state informally the first result of this paper (see Theorem 1), which establishes that the candidate problem (7) corresponds to an instance PF(β,ρ)\mathrm{P}\in\mathrm{F}^{\star}(\beta,\rho):

Informal Theorem 1.

There exists constant λ>0\lambda^{\star}>0 such that, for every λλ\lambda\leq\lambda^{\star}, ff has a saddle point and (8) solves the χ2\chi^{2}-DRO problem

minX^d{supνμ𝔼νX^X22:χ2(ν||μ)λ}.\displaystyle\underset{\widehat{{\mathrm{X}}}\in\mathbb{R}^{d}}{\mathrm{min}}~{}\bigg{\{}\sup_{\nu\ll\mu}~{}\mathbb{E}_{\nu}\|\widehat{\mathrm{X}}-X\|^{2}_{2}~{}:~{}\chi^{2}(\nu||\mu)\leq{\lambda}\bigg{\}}. (9)

As discussed later, Theorem 1 states that for the corresponding levels of risk ε\varepsilon of (7), the closed form solution of (7) solves all PF(χ2,λ)\mathrm{P}\in\mathrm{F}(\chi^{2},\lambda^{\star}). We answer the main question of the paper in the one dimensional case (see Theorem 2). We find that in the one dimensional case, the empirical counterpart of (8) dominates the empirical average for all probability distributions with negative excess kurtosis, κ\kappa.

Informal Theorem 2.

Let X^n,λ¯\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}} be the empirical counterpart of (8), and X¯n\bar{\mathrm{X}}_{n} be the empirical average. For all platykurtic probability distributions, there is λn=O(|κ|/n)\lambda_{n}=O(|\kappa|/n) such that, for every n3n\geq 3 and every λ¯<λn\bar{\lambda}<\lambda_{n},

𝔼|X^n,λ¯𝔼X|2<𝔼|X¯n𝔼X|2.\displaystyle\mathbb{E}|\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}-\mathbb{E}X|^{2}<\mathbb{E}|\bar{\mathrm{X}}_{n}-\mathbb{E}X|^{2}. (10)
Refer to caption
Figure 1: Gaussian mixture model of two mixands with parameters mean1=3,mean2=\mathrm{mean}_{1}=3,\mathrm{mean}_{2}= 0.1,σ1=0.5,σ2=1.70.1,\sigma_{1}=0.5,\sigma_{2}=1.7, and mixing proportions π1=0.3,π2=1π1\pi_{1}=0.3,\pi_{2}=1-\pi_{1}. The condition of Theorem 2 is satisfied with 3m22m4=11.46223m_{2}^{2}-m_{4}=11.4622. Specifically, the variance of the mixture is calculated to be m2=3.826m_{2}=3.826, and the fourth central moment is approximately m4=32.46m_{4}=32.46. We draw n=23n=23 i.i.d. samples (left) and we test X^23,λ¯=1.963λ¯3.61\widehat{\mathrm{X}}_{23,\bar{\lambda}}^{\star}=1.963-\bar{\lambda}3.61, against the nominal population simulated with n=10000n=10000 samples (middle) w.r.t. the mse (right) mse(λ¯)=𝔼μ(1.963λ¯3.61X)2\operatorname{mse}(\bar{\lambda})=\mathbb{E}_{\mu}(1.963-\bar{\lambda}3.61-X)^{2}.

All proofs of the results presented hereafter can be found in the supplementary material.

2 Risk-constrained minimum mean squared error estimators

We begin analyzing the risk-constrained mmse problem (7) by first referring the reader to (Kalogerias et al., 2019, Lemma 1), where the well-definiteness of the statement (7) is ensured under the following regularity condition

Assumption 1.

It is true that x23dμ(x)<+.\int\|{x}\|_{2}^{3}~{}\mathrm{d}\mu(x)<+\infty\text{.}

By noticing that the constraint in (7) can be written as a perfect square, we establish a quadratic reformulation that on the one hand, allows us to interpret the solution to (7) as a stereographic projection, and on the other, sets the stage for a derivative-free solution:

Lemma 1 (The risk-constrained estimator as stereographic projection).

Under Assumption 1, problem (7) is well-defined and equivalent to the convex quadratic program

minX^d𝔼μXX^22s.t.X^14Σ1ζ4Σ2ε¯\begin{array}[]{cl}\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\mathrm{min}}&\mathbb{E}_{\mu}\|{X}-\widehat{\mathrm{X}}\|^{2}_{2}\\ \mathrm{s.t.}&\|\widehat{\mathrm{X}}-\frac{1}{4}\Sigma^{-1}\zeta^{-}\|^{2}_{4\Sigma}\leq\bar{\varepsilon}\end{array} (11)

where ζ=2(𝔼[X22X]𝔼X22𝔼X)\zeta^{-}=2\big{(}\mathbb{E}[\|{X}\|^{2}_{2}{X}]-\mathbb{E}\|{X}\|^{2}_{2}\mathbb{E}X\big{)}, ε¯=ε+(ζ)Σ14ζα\bar{\varepsilon}=\varepsilon+(\zeta^{-})^{\top}\frac{\Sigma^{-1}}{4}\zeta^{-}-\alpha, α=𝔼(X2𝔼X2)2\alpha=\mathbb{E}(||X||^{2}-\mathbb{E}||X||^{2})^{2} and Σ0{\Sigma}\succcurlyeq{0} is the covariance matrix of XX 222By Q\|\cdot\|_{Q} we mean the quadratic form xQxx^{\top}Qx w.r.t. the positive semi-definite matrix Qn×nQ\in\mathbb{R}^{n\times n}..

Lemma 1 shows the equivalence of (7) to the convex quadratic program (11) allowing to interpret the former in geometric terms based on the standard inner product in d\mathbb{R}^{d}. Within this setting, the solution to (7) is a stereographic projection of the mmse\mathrm{mmse}-sphere onto d\mathbb{R}^{d}. Further, we can do slightly more by applying Pythagora’s theorem to the objective in (11) to obtain

minX^dX^𝔼X22s.t.X^14Σ1ζ4Σ2ε¯,,\begin{array}[]{cl}\underset{\widehat{\mathrm{X}}\in{\mathbb{R}^{d}}}{\mathrm{min}}&\|\widehat{\mathrm{X}}-\mathbb{E}X\|^{2}_{2}\\ \mathrm{s.t.}&\|\widehat{\mathrm{X}}-\frac{1}{4}{\Sigma}^{-1}\zeta^{-}\|^{2}_{4{\Sigma}}\leq{\bar{\varepsilon}},\end{array}, (12)

where by ζ\zeta^{-} we mean the opposite vector to ζ\zeta. According to Lemma 1, the solution of (12) (and therefore of (7)) is the point lying on the ellipse of radius ε¯\sqrt{\bar{\varepsilon}} and of center 14Σ1ζ\frac{1}{4}{\Sigma}^{-1}\zeta, with the least distance from the orthogonal projection 𝔼X\mathbb{E}X. Being guided by the geometric picture, the problem is feasible for any ε¯0\bar{\varepsilon}\geq{0}. In particular, for all ε¯ε¯c=𝔼X14Σ1ζ4Σ2\bar{\varepsilon}\geq\bar{\varepsilon}_{c}=\|{\mathbb{E}X-\frac{1}{4}{\Sigma}^{-1}\zeta^{-}\|^{2}_{4{\Sigma}}}, X^=𝔼X\widehat{\mathrm{X}}^{*}=\mathbb{E}X, while for ε¯=0\bar{\varepsilon}=0, a solution exists only when 𝔼X=14Σ1ζ\mathbb{E}X=\frac{1}{4}\Sigma^{-1}\zeta^{-}. Moving forward, for all ε¯(0,ε¯c)\bar{\varepsilon}\in(0,\bar{\varepsilon}_{c}) the solution is the unique touching point of the circle and the ellipse and it is obtained by equating the “equilibrium forces”. However, based on Lemma 1, we may proceed derivative-free (see Appendix A).

Proposition 1 (The risk-constrained mmse estimator).

Under the Assumption 1, the optimal solution to (7) is

X^λ=(I+4λΣ)1(𝔼X+2λ(𝔼[X22X]𝔼X22𝔼X)),\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda^{\star}}{=}\big{(}I+4\lambda^{*}\Sigma\big{)}^{-1}\Big{(}\mathbb{E}X+2\lambda^{*}\big{(}\mathbb{E}[\|X\|^{2}_{2}X]-\mathbb{E}\|X\|^{2}_{2}\mathbb{E}X\big{)}\Big{)}, (13)

where λ0\lambda^{*}\geq{0} is the solution to the corresponding dual problem.

Because by construction, (7) aims to reduce the estimation error variability, it is expected to do so by shifting the optimal estimates towards the areas suffering high loss, that is, towards the tail of the underlying distribution. The next result characterizes the bias of (13): The risk constrained estimator is a shifted version of the conditional expectation by a fraction of the Fisher’s moment coefficient of skewness plus the cross third-order statistics of XX.

Corollary 1 (Risk-constrained mmse estimator and Fisher’s skewness).

The estimator (13) operates according to

X^i,u,λ\displaystyle\widehat{\mathrm{X}}^{\star}_{i,u,\lambda} =𝔼Xi,u+λσiσi1+2λσi𝔼(Xi,u𝔼Xi,uσi)3+Ti,λ,\displaystyle=\mathbb{E}{X}_{i,u}{+}\frac{\lambda\sigma_{i}\sqrt{\sigma_{i}}}{1+2\lambda\sigma_{i}}\mathbb{E}\left(\frac{{X}_{i,u}-\mathbb{E}{X}_{i,u}}{\sqrt{\sigma_{i}}}\right)^{3}{+}\mathrm{T}_{i,\lambda}, (14)

where Xi,u{X}_{i,u}, and X^i,u,λ\widehat{\mathrm{X}}^{\star}_{i,u,\lambda} denotes the iith component of XuUX{X}_{u}\equiv U^{\top}{X}, and UX^λU^{\top}\widehat{{\mathrm{X}}}^{*}_{\lambda}, respectively, UU the orthonormal matrix with columns the eigen-vectors of ΣX{\Sigma}_{{X}}, and
Ti,λ=λ1+2λσiki𝔼Xk,u2𝔼Xi,u𝔼[Xk,u2Xi,u].\small\mathrm{T}_{i,\lambda}=\frac{\lambda}{1+2\lambda\sigma_{i}}\sum_{k\neq{i}}\mathbb{E}{X}^{2}_{k,u}\mathbb{E}{X}_{i,u}-\mathbb{E}[{X}^{2}_{k,u}{X}_{i,u}].

In an effort to minimize the squared error on average while maintaining its variance small, (13) attributes the estimation error made by the conditional expectation to the skewness of the conditional distribution along the eigen-directions of the conditional expected error. In particular, in the one-dimensional case, cross third-order statistics vanish and therefore,

X^λ=𝔼X+λσσ1+2λσ𝔼(X𝔼Xσ)3,λ[0,+).\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda}=\mathbb{E}X+\frac{\lambda\sigma\sqrt{\sigma}}{1+2\lambda\sigma}\mathbb{E}\Bigg{(}\frac{X-\mathbb{E}{X}}{\sqrt{\sigma}}\Bigg{)}^{3}~{},~{}\lambda\in[0,+\infty). (15)

Corollary 1 reveals the “internal mechanism” with which (13) biases towards the tail of the underline distribution. Per error eigen-direction, (13) compensates for the high losses relative to the expectation, by a fraction of the Fisher’s skewness coefficient and a term Ti\mathrm{T}_{i}, that captures the cross third-order statistics of the state components. Although (14) involves the cross third-order statistics of the projected state, these statistics cannot emerge artificially from a coordinate change such as UXU^{\top}{X}, but are intrinsic to the state itself. Besides, a meaningfull measure of risk should always be coordinate change invariant.

Motivated by the geometric picture, and the Lagrangean of (7), we note that there exists a unique random element such that (13) is viewed as its corresponding average (see Appendix A). Fortunately, such an element can be expressed as a Radon-Nikodym derivative w.r.t. μ\mu.

Lemma 2 (Risk constrained estimator as a weighted average).

The primal optimal solution of (7) can be written as

X^λ=𝔼ξ(X^λ,λ)X,\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda}=\mathbb{E}_{\xi(\widehat{\mathrm{X}}^{\star}_{\lambda},\lambda)}X, (16)

where

dξdμ(X)1+2λ(XX^λ22𝔼XX^λ22).\displaystyle\frac{\mathrm{d}{\xi}}{\mathrm{d}\mu}(X)\equiv 1+2\lambda\big{(}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}-\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}\big{)}. (17)

In particular, for λ1/γ\lambda\leq{1}/{\gamma^{*}}, γ=2(𝔼X^X^022+mmse(μ))\gamma^{*}=2\big{(}\mathbb{E}\|\widehat{\mathrm{X}}^{\star}_{\infty}-\widehat{\mathrm{X}}^{\star}_{0}\|^{2}_{2}+\mathrm{mmse}(\mu)\big{)}, it follows that dξdμ0\frac{\mathrm{d}{\xi}}{\mathrm{d}\mu}\geq{0}.

Combining Lemma 2 and (15), we obtain 333Where we explicitly refer to the measure w.r.t. which expectations are taken.

𝔼ξλX=𝔼μX+λσσ1+2λσ𝔼μ(X𝔼μXσ)3,λ1/γ\displaystyle\mathbb{E}_{\xi_{\lambda}}X=\mathbb{E}_{\mu}X+\frac{\lambda\sigma\sqrt{\sigma}}{1+2\lambda\sigma}\mathbb{E}_{\mu}\left(\frac{X-\mathbb{E}_{\mu}X}{\sqrt{\sigma}}\right)^{3},~{}\lambda\leq 1/\gamma^{*} (18)

In words, averaging on the re-distribution ξ\xi is the same as biasing the expectation w.r.t. the initial measure μ\mu towards the tail of μ\mu based on a fraction of the Fisher’s moment coefficient of skewness. This fact can be considered as a strong indication of another manifestation of duality between risk-averse and distributionally robust optimization. In addition, for the one dimensional case, assuming that μ\mu is positively skewed, we obtain the bound

𝔼ξ(γ)X𝔼μXσσ2(σ2𝔼|X𝔼Xσ|3+mmse(μ))+2σ𝔼(X𝔼Xσ)3.\displaystyle\mathbb{E}_{{\xi(\gamma^{*})}}X-\mathbb{E}_{\mu}X\leq\frac{\sigma\sqrt{\sigma}}{2\Big{(}\frac{\sigma}{2}{\mathbb{E}\Big{|}\frac{X-\mathbb{E}X}{\sqrt{\sigma}}\Big{|}^{3}+\mathrm{mmse}(\mu)}\Big{)}+2\sigma}\mathbb{E}\Bigg{(}\frac{X-\mathbb{E}X}{\sqrt{\sigma}}\Bigg{)}^{3}. (19)

That is, the expectations of neighboring populations that can be tracked with re-weighting are determined by a fraction of the skewness of the population at hand. In the case of empirical measures, (19) implies that lack of mass in the area of the tail is seen as the presence of risky events.

3 Distributionally robust mean square error estimation

In this section we show that (16) (and therefore the corresponding solution of (7)) is the solution to a χ2\chi^{2}- DRO problem for all χ2\chi^{2}-radii less than a computable constant. We start, by taking into account the equivalence between (7) and the mean-deviation risk measure of order p=2p=2 (see e.g. (Shapiro et al., 2021)). To not overload the notation, let us declare XX^22=ZX^\|{X}-\widehat{\mathrm{X}}\|^{2}_{2}=Z_{\widehat{\mathrm{X}}}. Then, (7) reads

minX^d𝔼μZX^ s.t. varμZX^ε,\begin{array}[]{cl}\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\operatorname{min}}&\mathbb{E}_{\mu}~{}Z_{\widehat{\mathrm{X}}}\\ \text{ s.t. }&\mathrm{var}_{\mu}Z_{\widehat{\mathrm{X}}}\leq\varepsilon,\end{array} (20)

or equivalently

minX^d𝔼μZX^ s.t. varμZX^ε.\begin{array}[]{cl}\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\operatorname{min}}&\mathbb{E}_{\mu}~{}Z_{\widehat{{X}}}\\ \text{ s.t. }&\sqrt{\rule{0.0pt}{8.61108pt}\mathrm{var}_{\mu}Z_{\widehat{\mathrm{X}}}}\leq\sqrt{\varepsilon}.\end{array} (21)

It is a standard exercise to verify that the constraint in (21) is convex w.r.t. X^\widehat{\mathrm{X}} which leads us to the following result.

Lemma 3 (Relation between the Lagrange multipliers of (20), and (21)).

Let λmv\lambda^{*}_{mv} and λmd\lambda^{*}_{md} be the Lagrange multipliers associated with the constraint of (20), and (21) respectively. Then, for all εεmin\varepsilon\geq\varepsilon_{\min} it holds

λmd(ε)=2λmv(ε)ε.\displaystyle{\lambda^{*}_{md}(\varepsilon)}=2\lambda^{*}_{mv}(\varepsilon){\sqrt{\varepsilon}}. (22)

Moving forward, strong Lagrangean duality yields

minX^d𝔼μZX^ s.t. varμZX^ε.\displaystyle\begin{array}[]{cl}\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\operatorname{min}}&\mathbb{E}_{\mu}~{}Z_{\widehat{{X}}}\\ \text{ s.t. }&\sqrt{\rule{0.0pt}{8.61108pt}\mathrm{var}_{\mu}Z_{\widehat{\mathrm{X}}}}\leq\sqrt{\varepsilon}.\end{array} =λmd(ε)ε+minX^dρ(ZX^),\displaystyle=-\lambda_{md}^{\star}({\varepsilon})\sqrt{\varepsilon}+\min_{\widehat{{X}}\in\mathbb{R}^{d}}\rho(Z_{\widehat{\mathrm{X}}}), (25)

where

ρ(ZX^)𝔼μZX^+λmd(ε)varμZX^,\displaystyle\rho(Z_{\widehat{\mathrm{X}}})\equiv\mathbb{E}_{\mu}Z_{\widehat{\mathrm{X}}}+\lambda_{md}^{\star}({\varepsilon})\sqrt{\mathrm{var}_{\mu}Z_{\widehat{{X}}}}, (26)

is the mean-deviation risk measure of order p=2p=2 w.r.t. μ\mu (see e.g. (Shapiro et al., 2021)), with sensitivity constant, the optimal Lagrange multiplier λmd(ε)\lambda^{\star}_{md}(\varepsilon).

3.1 Dual representation of the mean-deviation risk measure of order p=2p=2

Equation (26) expresses the mean-deviation risk measure in its primal form, where the sensitivity constant is identified with the Lagrange multiplier associated with the constraint in (7). Aiming to study robustness in distributional shifts, the place to start is with the dual representation of (26). Following (Shapiro et al., 2021), the variance in (26), can be expressed via the norm in 2\mathcal{L}_{2}

(varμZX^)1/2=(𝔼μ|ZX^𝔼μZX^|2)1/2=ZX^𝔼μZX^2,\displaystyle(\mathrm{var}_{\mu}Z_{\widehat{\mathrm{X}}})^{1/2}=(\mathbb{E}_{\mu}|Z_{\widehat{\mathrm{X}}}-\mathbb{E}_{\mu}Z_{\widehat{\mathrm{X}}}|^{2})^{1/2}=\|Z_{\widehat{\mathrm{X}}}-\mathbb{E}_{\mu}Z_{\widehat{\mathrm{X}}}\|_{{\mathcal{L}_{2}}},

the latter being identified with its dual norm Z2=supξ21ξ,Z\|Z\|_{\mathcal{L}^{*}_{2}}=\sup_{\|\xi\|_{{}_{\mathcal{L}_{2}}}\leq 1}\langle\xi,Z\rangle, where ξ2ξ2=𝔼μξ2||\xi||_{\mathcal{L}_{2}}\equiv||\xi||_{2}=\sqrt{\mathbb{E}_{\mu}\mathbb{\xi}^{2}}, and ξ,Z=𝔼μξZ\langle\xi,Z\rangle=\mathbb{E}_{\mu}\xi{Z}. Thus, we may write

(varμZX^)1/2=supξ21ξ𝔼μξ,ZX^,(\mathrm{var}_{\mu}Z_{\widehat{\mathrm{X}}})^{1/2}=\sup_{||\xi||_{2}\leq{1}}\langle\xi-\mathbb{E}_{\mu}\xi,Z_{\widehat{\mathrm{X}}}\rangle,

and subsequently (26) as:

ρ(ZX^)\displaystyle\rho(Z_{\widehat{\mathrm{X}}}) =1,ZX^+λmd(ε)supξ21ξ𝔼μξ,ZX^\displaystyle=\langle 1,Z_{\widehat{\mathrm{X}}}\rangle+\lambda_{md}^{*}({\varepsilon})\sup_{||\xi||_{2}\leq{1}}\langle\xi-\mathbb{E}_{\mu}\xi,Z_{\widehat{\mathrm{X}}}\rangle
=supξ21ZX^,1+λmd(ε)ξ𝔼μξ,ZX^\displaystyle=\sup_{||\xi||_{2}\leq{1}}\langle Z_{\widehat{\mathrm{X}}},1\rangle+\lambda_{md}^{*}({\varepsilon})\langle\xi-\mathbb{E}_{\mu}\xi,Z_{\widehat{\mathrm{X}}}\rangle
=supξ211+λmd(ε)(ξ𝔼μξ),ZX^\displaystyle=\sup_{||\xi||_{2}\leq{1}}\langle 1+\lambda_{md}^{*}({\varepsilon})(\xi-\mathbb{E}_{\mu}\xi),Z_{\widehat{\mathrm{X}}}\rangle
=supξΞξ,ZX^,Ξ{ξ2:ξ=1+λmd(ε)(ξ𝔼μξ),ξ21}.\displaystyle=\sup_{\xi^{\prime}\in{\Xi}}\langle\xi^{\prime},Z_{\widehat{\mathrm{X}}}\rangle,~{}\Xi\equiv\bigg{\{}\xi^{\prime}\in\mathcal{L}^{*}_{2}:\xi^{\prime}=1+\lambda_{md}^{*}({\varepsilon})(\xi-\mathbb{E}_{\mu}\xi),||\xi||_{2}\leq{1}\bigg{\}}. (27)

From (27), 𝔼μξ=1\mathbb{E}_{\mu}\xi^{\prime}=1, while 𝔼μξ21\mathbb{E}_{\mu}\xi^{2}\leq{1} implies 𝔼μ(ξ1)2λmd2(ε)\mathbb{E}_{\mu}(\xi^{\prime}-1)^{2}\leq\lambda^{2*}_{md}(\varepsilon). Therefore, ΞU(λmd)\Xi\subseteq\mathrm{U}(\lambda^{*}_{md}), where

U(λmd){ξ2:𝔼μξ=1,𝔼μ(ξ1)2λmd2(ε)}.\displaystyle\mathrm{U}(\lambda^{*}_{md})\equiv\big{\{}\xi\in\mathcal{L}^{*}_{2}~{}:~{}\mathbb{E}_{\mu}\xi=1,~{}\mathbb{E}_{\mu}(\xi-1)^{2}\leq\lambda_{md}^{2*}({\varepsilon})\big{\}}. (28)

In addition, let ξ=ξ1ξ=1+ξ+𝔼ξ\xi^{\prime}=\xi-1\Leftrightarrow\xi=1+\xi^{\prime}+\mathbb{E}\xi^{\prime}. Then λmd2(ε)𝔼(ξ1)2=𝔼ξ2\lambda^{2*}_{md}(\varepsilon)\geq\mathbb{E}(\xi-1)^{2}=\mathbb{E}\xi^{{}^{\prime}2}, which implies that Ξ=U(λmd)\Xi=\mathrm{U}(\lambda^{*}_{md}). Thus, we may write

ρ(ZX^)=supξU(λmd)ξ,ZX^.\displaystyle\rho(Z_{\widehat{\mathrm{X}}})=\sup_{\xi\in\mathrm{U}(\lambda^{*}_{md})}\langle\xi,Z_{\widehat{\mathrm{X}}}\rangle. (29)

Based on the previous discussion, (7) is equivalent to

minX^dsupξU(λmd)ξ,ZX^.\displaystyle\min_{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}\sup_{\xi\in\mathrm{U}(\lambda^{*}_{md})}\langle\xi,Z_{\widehat{\mathrm{X}}}\rangle. (30)

The equality constraint in (28) can be absorbed and the constraint set may be re-expressed as the chi-square ball of radius ρ=λmd2(ε)0\rho=\lambda^{2\star}_{md}(\varepsilon)\geq{0} centered at the probability measure μ\mu and formed by the χ2\chi^{2}-divergence from ν\nu to μ\mu:

χλmd2(ε)2(μ){ν2:𝔼μ(dν/dμ1)2λmd2(ε)}.\chi^{2}_{{\lambda^{2\star}_{md}(\varepsilon)}}(\mu)\equiv\big{\{}\nu\in\mathcal{L}^{\star}_{2}~{}:~{}\mathbb{E}_{\mu}(\mathrm{d}\nu/\mathrm{d}\mu-1)^{2}\leq\lambda_{md}^{2\star}({\varepsilon})\big{\}}.

Subsequently then, (30) reads

minX^d{supν2𝔼νZX^:νχλmd2(ε)2(μ)},\displaystyle\min_{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}\left\{\sup_{\nu\in\mathcal{L}^{\star}_{2}}\mathbb{E}_{\nu}Z_{\widehat{\mathrm{X}}}~{}:~{}\nu\in\chi^{2}_{{\lambda^{2\star}_{md}(\varepsilon)}}(\mu)\right\}, (31)

Thus the inner optimization problem in (31) is of a linear objective subject to a quadratic constraint over the infinite dimensional (dual) vector space 2\mathcal{L}^{\star}_{2}, and can be easily handled by variational calculus. In particular, by introducing an additional Lagrange multiplier —for dualizing the quadratic constraint—, it is easy to show that the supremum is attained at

(dνdμ)=λmd(ε)ZX^𝔼μZX^(varμZX^)1/2+1.\displaystyle\left(\frac{\mathrm{d}\nu}{\mathrm{d}\mathrm{\mu}}\right)^{\star}=\lambda^{\star}_{md}(\varepsilon)\frac{Z_{\widehat{\mathrm{X}}}-\mathbb{E}_{\mu}Z_{\widehat{\mathrm{X}}}}{({\mathrm{var}_{{\mu}}Z_{\widehat{\mathrm{X}}}})^{1/2}}+1. (32)

Note that the objective in (31) is not convex over X^\widehat{\mathrm{X}} since the expectation is taken w.r.t. signed measures. At this point, the reader might find it instructive comparing (32) with (17) through (22), and ask the question: Is there a subset of admissible Lagrange multipliers for which (31) (and therefore (7)) can be formulated as a distributionally (over probability measures) robust optimization problem? We answer this question with the following result.

Theorem 1 (Distributionally robust mean square error esimator).

Let Λ{λ0:λ1/γ},\Lambda\equiv\{\lambda\geq{0}~{}:~{}\lambda\leq 1/\gamma^{\star}\}, be a subset of admissible Lagrange multipliers associated with the constraint of (7), with γ\gamma^{\star} as in Lemma 2, and let ν\nu^{\star} being determined by μ\mu as in (32). Then, (ν(X^),X^(ν))(\nu^{\star}(\widehat{\mathrm{X}}^{\star}),\widehat{\mathrm{X}}^{\star}(\nu^{\star})) is a saddle point of f(ξ,X^)f(\xi,\widehat{\mathrm{X}}) over χγ2(μ)×d\chi^{2}_{{\gamma^{\star}}}(\mu)\times\mathbb{R}^{d}. In particular, the estimator X^λ\widehat{\mathrm{X}}^{\star}_{\lambda} (see Proposition 1) solves the χ2\chi^{2}-DRO problem

minX^d{sup0ν2𝔼νXX^22:χ2(ν||μ)λ},λΛ.\displaystyle\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\mathrm{min}}~{}\bigg{\{}\sup_{0\leq\nu\in\mathcal{L}^{\star}_{2}}~{}\mathbb{E}_{\sim\nu}\|{X}-\widehat{\mathrm{X}}\|^{2}_{2}~{}:~{}\chi^{2}(\nu||\mu)\leq{\lambda}\bigg{\}},~{}\lambda\in\Lambda. (33)

According to Theorem 1, when the risk in (7) is left relaxed more than ε(1/γ)\varepsilon(1/\gamma^{\star})444Here ε(1/γ)\varepsilon(1/\gamma^{\star}) denotes the corresponding to 1/γ1/\gamma^{\star} level of risk., (33) can be viewed as a zero-sum game between the statistician (S) who plays X^\widehat{\mathrm{X}}, and its fictitious opponent (A), who chooses νχγ2(μ)\nu\in\chi^{2}_{{\gamma^{\star}}}(\mu). Then, X^λ\widehat{\mathrm{X}}^{\star}_{\lambda} is interpreted as the best response of (S) to the best response ν\nu^{\star} of (A), with the objective ff resting on this equilibrium.

Then, according to Corollary 1, the optimal strategy X^\widehat{\mathrm{X}}^{\star} of (S) is to bias -up to a change of co-ordinates- towards the areas of high loss (of μ\mu), with a fraction of the Fisher’s moment coefficient of skewness plus some additional cross third-order statistics. The directionality in the play of (S) induces a directionality in the play of (A) as we formally discuss in the next section.

4 Skewness as the 22-Wasserstein gradient of the
χ2\chi^{2}-divergence

According to the previous discussion, we observe that the shift of the expectation in (14), and (18) attributes directionality to the mass re-distribution play of (A) in Theorem 1. In this subsection we make this observation more formal through the smooth structure of the 22-Wasserstein space P2,ac(d)\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}) of probability measures absolutely continuous w.r.t. Lebesgue measure with finite second moment, equipped with the 22-Wasserstein metric W2W_{2}. In particular, the dynamic formulation of the Wasserstein distance  (Benamou and Brenier, 2000) entails the definition of a (pseudo) Riemannian structure  (Chewi, 2023). At every point ρP2,ac(d)\rho\in\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}), the tangent space TρP2,ac(d):={ρ˙:d|ρ˙=0}T_{\rho}\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}):=\big{\{}\dot{\rho}:\mathbb{R}^{d}\rightarrow\mathbb{R}\big{|}\int\dot{\rho}=0\big{\}} contains the re-allocation directions and it is parameterized by functions ω(x)\omega(x) on d\mathbb{R}^{d}, through the elliptic operator ω(ρω)\omega\mapsto-\nabla\cdot(\rho\nabla\omega). The metric g:TP2,ac(d)×TP2,ac(d)C(P2,ac(d))g:\mathrm{T}\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})\times\mathrm{T}\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})\rightarrow C^{\infty}(\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})) 555 By C(P2,ac(d))C^{\infty}(\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})) we denote the class of smooth real-valued maps defined on P2,ac(d)\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}). has value at the point ρP2,ac(d)\rho\in\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})  (Otto, 2001):

g(ρ˙1,ρ˙2)(ρ):=dω1(x),ω2(x)ρ(x)𝑑x=𝔼ρω1(x),ω2(x).\displaystyle g(\dot{\rho}_{1},\dot{\rho}_{2})({\rho}):=\int_{\mathbb{R}^{d}}\langle\nabla\omega_{1}(x),\nabla\omega_{2}(x)\rangle\rho(x)dx=\mathbb{E}_{\rho}\langle\nabla\omega_{1}(x),\nabla\omega_{2}(x)\rangle.

Moving forward, let fC(P2,ac(d))f\in C^{\infty}(\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})). Then, the gradient w.r.t. the defined metric, gradf:P2,ac(d)TP2,ac(d)\mathrm{grad}f:\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})\rightarrow\mathrm{T}\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}) is the unique vector field satisfying

df(X)(ρ)=g(gradf,X)(ρ),\displaystyle\mathrm{d}f(X)({\rho})=g(\mathrm{grad}f,X)({\rho}),

where df:P2,ac(d)TP2,ac(d)\mathrm{d}f:\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d})\rightarrow\mathrm{T}^{*}\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}) is the differential of ff. To derive the gradient (gradf)(ρ)(\mathrm{grad}f)(\rho), consider the curves ρt:[0,1]P2,ac(d)\rho_{t}:[0,1]\rightarrow\mathrm{P}_{2,\text{ac}}(\mathbb{R}^{d}) progressing through ρ\rho and realize the tangent vector ρ˙t=(ρtω(x))\dot{\rho}_{t}=\nabla\cdot(\rho_{t}\nabla\omega(x)) for some ω(x)\omega(x). It is

df(ρ˙)(ρ)\displaystyle\mathrm{d}{f}(\dot{\rho})({\rho}) =limt0f(ρ+tρ˙t)f(ρ)t=δfδρ(x)ρ˙t𝑑x=δfδρ(x)(ρω(x))𝑑x\displaystyle=\lim_{t\rightarrow{0}}\frac{f(\rho+t\dot{\rho}_{t})-f(\rho)}{t}=\int\frac{\delta{f}}{\delta\rho}(x)\dot{\rho}_{t}dx=-\int\frac{\delta{f}}{\delta\rho}(x)\nabla\cdot(\rho\nabla\omega(x))dx
=δfδρ(x),ω(x)ρ𝑑x.\displaystyle=\int\langle\nabla\frac{\delta{f}}{\delta\rho}(x),\nabla\omega(x)\rangle\rho dx.

We identify (gradf)(ρ)δfδρ(x)(\mathrm{grad}f)(\rho)\equiv\nabla~{}\frac{\delta{f}}{\delta\rho}(x), or equivalently (gradf)(ρ)=(ρδfδρ(x))(\mathrm{grad}f)(\rho)=-\nabla\cdot(\rho\nabla~{}\frac{\delta{f}}{\delta\rho}(x)), where δfδρ\frac{\delta{f}}{\delta\rho} denotes the first variation of ff w.r.t. ρ\rho. For the particular case of the χ2\chi^{2}-divergence from ξ\xi to μ\mu, we obtain gradχ2(μ)(ξ)=2dξdμ,\mathrm{grad}\chi^{2}(\cdot\|\mu)(\xi)=2\nabla\frac{\mathrm{d}\xi}{\mathrm{d}\mu}, which when evaluated on (32) yields

gradχ2(μ)(νλmd)=2λmdX^X22(varμX^X22)1/2.\displaystyle\mathrm{grad}\chi^{2}(\cdot\|\mu)(\nu_{\lambda^{\star}_{md}})=\frac{2\lambda^{\star}_{md}\nabla\|\widehat{\mathrm{X}}-X\|^{2}_{2}}{(\mathrm{var}_{\mu}\|\widehat{\mathrm{X}}-X\|^{2}_{2})^{1/2}}. (34)

By plugging X^=X^λmv\widehat{\mathrm{X}}=\widehat{\mathrm{X}}^{\star}_{\lambda_{mv}} in (34) and subsequently taking expectations, we obtain (for simplicity in the one dimensional case)

𝔼μ[gradχ2(μ)(νλmd)]=λ2σσ1+λσ𝔼(X𝔼Xσ)3,λ1/2γ.\displaystyle\mathbb{E}_{\mu}[\mathrm{grad}\chi^{2}(\cdot\|\mu)(\nu_{\lambda^{\star}_{md}})]=\frac{\lambda^{2}\sigma\sqrt{\sigma}}{1+\lambda\sigma}\mathbb{E}\left(\frac{X-\mathbb{E}X}{\sqrt{\sigma}}\right)^{3},\lambda\leq 1/2\gamma^{*}. (35)

By combining (15) and (35) we may further write

𝔼νλmdX𝔼μX=𝔼μgradχ2(μ)(νλmd)λ,λ(0,1/2γ).\displaystyle\mathbb{E}_{\nu_{\lambda^{\star}_{md}}}X-\mathbb{E}_{\mu}X=\frac{\mathbb{E}_{\mu}\mathrm{grad}\chi^{2}(\cdot\|\mu)(\nu_{\lambda^{\star}_{md}})}{\lambda},~{}\lambda\in(0,1/2\gamma^{*}).

Equation (35) characterizes infinitesimally the optimal strategy of (A): Given the optimal play of (S), (A) changes the measure in (31) aiming to increase the χ2\chi^{2}-divergence from ν\nu to μ\mu. For a more comprehensive review in the theory of optimal transport and applications we refer the reader to (Ambrosio et al., 2005; Villani et al., 2009; Wibisono, 2018; Figalli and Glaudo, 2021; Chewi, 2023).

5 Mean squared error of the empirical distributionally robust mse estimator

Now that the risk-constrained estimator has all the desired features, we want to study its performance w.r.t. the mean squared error. We do so in two steps. First, we consider a simplified version of the empirical distributionally robust estimator and we show that it outperforms the empirical average over a rather large class of models. Second (see Appendix B), we argue that with high probability this estimator is the empirical version of the distributionally robust estimator (15). The next results refer to the one dimensional case (d=1d=1), with the extension to other dimensions left open for future exploration.

By considering (15) w.r.t. μn\mu_{\mathrm{n}}, we define the following algorithm

X^n,λ¯(D)1niXi+λ¯1ni(Xin1iXi)3,\displaystyle\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}(D)\equiv\frac{1}{n}\sum_{i}X_{i}+\bar{\lambda}\frac{1}{n}\sum_{i}\Big{(}{X_{i}-n^{-1}\sum_{i}X_{i}}\Big{)}^{3}, (36)

where λ¯\bar{\lambda} is a non-negative and deterministic parameter that replaces the first factor in the second term of (15). For ease of notation, let us first denote X¯n(D)=n1iXi\bar{\mathrm{X}}_{n}(D)=n^{-1}\sum_{i}X_{i}, Tn(D)=n1i(XiX¯n)3\mathrm{T}_{n}(D)=n^{-1}\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}. We call (36) the simplified empirical distributionally robust estimator, which after the above notation reads X^n,λ¯=X¯n+λ¯Tn\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}=\bar{\mathrm{X}}_{n}+\bar{\lambda}\mathrm{T}_{n}, and achieves mean squared error

𝔼(X^n,λ¯X)2\displaystyle\mathbb{E}(\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}-X)^{2} 𝔼(X¯nX)2+λ¯2(𝔼Tn2+ϵ)2λ¯𝔼[(XX¯n)Tn],\displaystyle\leq\mathbb{E}(\bar{\mathrm{X}}_{n}-X)^{2}+\bar{\lambda}^{2}(\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon)-2\bar{\lambda}\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\mathrm{T}_{n}\big{]}, (37)

for any ϵ>0\epsilon>0. Strict domination follows for all those positive λ¯\bar{\lambda} such that

λ¯<2𝔼[(XX¯n)Tn]/(𝔼Tn2+ϵ),\displaystyle\bar{\lambda}<{2\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\mathrm{T}_{n}\big{]}}/({\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}), (38)

only under reasonable conditions that render the right-hand side of (38) strictly positive. Let QQ be the set of all probability measures with finite second order moment and g:Qg:Q\rightarrow\mathbb{R}, the real-valued map with g(μ)3m22m4g(\mu)\equiv 3m^{2}_{2}-m_{4}, where m2m_{2} and m4m_{4} is the second and fourth central moment of μ\mu, respectively. Note that the Gaussian measures are all in g1(0)g^{-1}(0), and define the set Ng1((0,+))Q\mathrm{N}\equiv g^{-1}((0,+\infty))\subset Q. Strict domination of (36) is demonstrated by the following

Theorem 2 (Strict domination).

Let X^n,λ¯\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}} be as in (36), and X¯n\bar{\mathrm{X}}_{n} be the empirical average. Fix an ϵ>0\epsilon>0. Then, for probability measures μN\mu\in\mathrm{N}, and for any n3n\geq 3,

𝔼|X^n,λ¯𝔼X|2<𝔼|X¯n𝔼X|2,\displaystyle\mathbb{E}|\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}-\mathbb{E}X|^{2}<\mathbb{E}|\bar{\mathrm{X}}_{n}-\mathbb{E}X|^{2}, (39)

for all λ¯<1n(8g(μ)𝔼Tn2+ϵ)\bar{\lambda}<\frac{1}{n}\left(\frac{8g(\mu)}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}\right).

As mentioned earlier, Gaussian measures are among those with 3m22=m43m^{2}_{2}=m_{4}. In this case λ¯=0\bar{\lambda}=0, the condition (37) is satisfied with equality for any n1n\geq{1}, and the mse performance of (36) reduces to that of the empirical average. On the contrary, when the data are generated from μN\mu\in\mathrm{N}, (36) outperforms the average as soon as the parameter λ¯\bar{\lambda} is chosen inside the prescribed interval. Interestingly, domination remains feasible for arbitrarily large data sets as soon as λ¯\bar{\lambda} is sent to zero with a rate of at least O(1/n)O(1/n). The distributionally robust estimator achieves better mse performance compared to the sample mean by leveraging its bias term. Further, if λ¯φ(λ¯)𝔼|X^n,λ¯X|2\bar{\lambda}\mapsto\varphi(\bar{\lambda})\equiv\mathbb{E}|\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}-X|^{2}, then the slope of φ\varphi at zero

φ(0)=(3m22m4)n23n+2n3\displaystyle\varphi^{\prime}(0)=-(3m_{2}^{2}-m_{4})\frac{n^{2}-3n+2}{n^{3}} (40)

It is φ(0)0\varphi^{\prime}(0)\leq{0} for all μNg1(0)\mu\in\mathrm{N}\cup g^{-1}(0), and the maximum value is attained for μg1(0)\mu\in g^{-1}(0). In that case of course the best choice is λ¯=0\bar{\lambda}=0. On the contrary, for all those probability measures μN\mu\in\mathrm{N} strict domination follows for (see also Appendix B):

λ¯<2(3m22m4)𝔼Tn2+ϵn23n+2n3=φ(0)𝔼Tn2+ϵ.\displaystyle\bar{\lambda}<\frac{2(3m^{2}_{2}-m_{4})}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}\frac{n^{2}-3n+2}{n^{3}}=\frac{-\varphi^{\prime}(0)}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}. (41)

The more negative the slope at zero is, the wider the feasible set becomes. A very negative slope at zero permits large improvement in the mse performance even for very small values of λ¯\bar{\lambda} thus rendering, at the same time, the estimator as unbiased as possible. Since the slope at zero is determined by (3m22m4)-(3m^{2}_{2}-m_{4}), those models with large |g(μ)|=|3m22m4||g(\mu)|=|3m^{2}_{2}-m_{4}| trade more favorably mse performance for unbiasedness. That said, for models with large value of gg, values of λ¯\bar{\lambda} close to zero achieve a non-trivial performance difference even for large number of samples. However, on the basis of (40), large data sets tend to flatten out φ\varphi at zero, and to maintain performance λ¯\bar{\lambda} has to be increased to

λ¯n=argmin{φ(λ¯),λ¯φ(0)/(𝔼Tn2+ε)},\displaystyle\bar{\lambda}^{\star}_{n}=\mathrm{argmin}\{\varphi(\bar{\lambda}),~{}\bar{\lambda}\leq-\varphi^{\prime}(0)/({\mathbb{E}\mathrm{T}^{2}_{n}+\varepsilon})\}, (42)

thus rendering X^n,λ¯n\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}^{\star}_{n}} more biased. Lastly, λ¯n\bar{\lambda}^{\star}_{n} approaches zero as the feasible set shrinks with a rate of at least O(1/n)O(1/n) pushing X^n,λ¯n\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}^{\star}_{n}} to the empirical average.

6 Conclusion and future work

We deployed the main insights of DRO to come up with a distributionally robust estimator; its empirical version serving as a better estimator for the mean compared to the empirical average w.r.t. the mean squared error for all platykurtic probability distributions. The aforesaid estimator biases-up to a change of coordinates-towards the tail of the distribution with the Fisher’s moment coefficient of skewness plus additional cross third order statistics and on top of that, it can be written as an expectation w.r.t. an optimal re-weighting of the original measure. This optimal re-weighting along with the optimal estimator form a saddle point of the mse cost and the bias characterizes the directionality of the infinitesimal play through the Wasserstein geometry of the space of measures. Lastly, extending Theorem 2 in the multi-dimensional setting is left for future investigation.

References

  • Abeille et al. [2016] Marc Abeille, Alessandro Lazaric, Xavier Brokmann, et al. Lqg for portfolio optimization. arXiv preprint arXiv:1611.00997, 2016.
  • Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  • Bayraksan and Love [2015] Güzin Bayraksan and David K Love. Data-driven stochastic programming using phi-divergences. In The operations research revolution, pages 1–19. INFORMS, 2015.
  • Benamou and Brenier [2000] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
  • Bertsimas and Sim [2004] Dimitris Bertsimas and Melvyn Sim. The price of robustness. Operations research, 52(1):35–53, 2004.
  • Blanchet and Shapiro [2023] Jose Blanchet and Alexander Shapiro. Statistical limit theorems in distributionally robust optimization. In 2023 Winter Simulation Conference (WSC), pages 31–45. IEEE, 2023.
  • Blanchet et al. [2021] Jose Blanchet, Karthyek Murthy, and Viet Anh Nguyen. Statistical analysis of wasserstein distributionally robust estimators. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pages 227–254. INFORMS, 2021.
  • Blanchet et al. [2023] Jose Blanchet, Daniel Kuhn, Jiajin Li, and Bahar Taskesen. Unifying distributionally robust optimization via optimal transport theory. arXiv preprint arXiv:2308.05414, 2023.
  • Blanchet et al. [2024] Jose Blanchet, Jiajin Li, Sirui Lin, and Xuhui Zhang. Distributionally robust optimization and robust statistics. arXiv preprint arXiv:2401.14655, 2024.
  • Chewi [2023] Sinho Chewi. An optimization perspective on log-concave sampling and beyond. PhD thesis, Massachusetts Institute of Technology, 2023.
  • Devroye et al. [2003] Luc Devroye, Dominik Schäfer, László Györfi, and Harro Walk. The estimation problem of minimum mean squared error. Statistics & Decisions, 21(1):15–28, 2003.
  • Duchi and Namkoong [2018] John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
  • Duchi and Namkoong [2019] John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1–55, 2019.
  • Esfahani and Kuhn [2015] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. arXiv preprint arXiv:1505.05116, 2015.
  • Figalli and Glaudo [2021] Alessio Figalli and Federico Glaudo. An invitation to optimal transport, Wasserstein distances, and gradient flows. 2021.
  • Gao and Kleywegt [2023] Rui Gao and Anton Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603–655, 2023.
  • Girshick and Savage [1951] MA Girshick and LJ Savage. Bayes and minimax estimates for quadratic loss functions. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, volume 2, pages 53–74. University of California Press, 1951.
  • Gotoh et al. [2018] Jun-ya Gotoh, Michael Jong Kim, and Andrew EB Lim. Robust empirical optimization is almost the same as mean–variance optimization. Operations research letters, 46(4):448–452, 2018.
  • James and Stein [1961] William James and Charles Stein. Estimation with quadratic loss. In Breakthroughs in statistics: Foundations and basic theory, pages 443–460. Springer, 1961.
  • Kalogerias et al. [2019] Dionysios S Kalogerias, Luiz FO Chamon, George J Pappas, and Alejandro Ribeiro. Risk-aware mmse estimation. arXiv preprint arXiv:1912.02933, 2019.
  • Kalogerias et al. [2020] Dionysios S Kalogerias, Luiz FO Chamon, George J Pappas, and Alejandro Ribeiro. Better safe than sorry: Risk-aware nonlinear bayesian estimation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5480–5484. IEEE, 2020.
  • Kuhn et al. [2019] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pages 130–166. Informs, 2019.
  • Lam [2016] Henry Lam. Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research, 41(4):1248–1275, 2016.
  • Lee and Raginsky [2018] Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances. Advances in Neural Information Processing Systems, 31, 2018.
  • Markowitz [1952] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952. ISSN 00221082, 15406261. URL http://www.jstor.org/stable/2975974.
  • Namkoong and Duchi [2016] Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29, 2016.
  • Nguyen et al. [2023] Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Bridging bayesian and minimax mean square error estimation via wasserstein distributionally robust optimization. Mathematics of Operations Research, 48(1):1–37, 2023.
  • Otto [2001] Felix Otto. The geometry of dissipative evolution equations: The porous medium equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
  • Scarf et al. [1957] Herbert E Scarf, KJ Arrow, and S Karlin. A min-max solution of an inventory problem. Rand Corporation Santa Monica, 1957.
  • Shapiro et al. [2021] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic programming: modeling and theory. SIAM, 2021.
  • Staib and Jegelka [2019] Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. Advances in Neural Information Processing Systems, 32, 2019.
  • Van Parys et al. [2021] Bart PG Van Parys, Peyman Mohajerin Esfahani, and Daniel Kuhn. From data to decisions: Distributionally robust optimization is optimal. Management Science, 67(6):3387–3402, 2021.
  • Verhaegen and Verdult [2007] Michel Verhaegen and Vincent Verdult. Filtering and system identification: a least squares approach. Cambridge university press, 2007.
  • Villani et al. [2009] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
  • Wang and Bovik [2009] Zhou Wang and Alan C Bovik. Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine, 26(1):98–117, 2009.
  • Wibisono [2018] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.

Appendix A Appendix A (Proofs of Section 2)

Proof of Lemma 1.

The square of the constraint of (7) is written as666All expectations are taken w.r.t. μ\mu.:

(XX^22𝔼XX^22)2\displaystyle\Big{(}\|{X}-\widehat{\mathrm{X}}\|^{2}_{2}-\mathbb{E}||{X}-\widehat{\mathrm{X}}||^{2}_{2}\Big{)}^{2} =(X22+𝔼X22)2+4X^(X𝔼X)(X𝔼X)X^\displaystyle=(\|{X}\|^{2}_{2}+\mathbb{E}\|{X}\|^{2}_{2})^{2}+4\widehat{\mathrm{X}}^{\top}({X}-\mathbb{E}X)({X}-\mathbb{E}X)^{\top}\widehat{\mathrm{X}}
+4(X22𝔼{X22})(𝔼XX)X^\displaystyle\hskip 30.0pt+4(||X||^{2}_{2}-\mathbb{E}\{||X||^{2}_{2}\})(\mathbb{E}X-{X})^{\top}\widehat{\mathrm{X}}
=[1X^][α(X)ζ(X)ζ(X)γ(X)][1X^],\displaystyle=\left[\begin{array}[]{cc}1{}&~{}\widehat{\mathrm{X}}^{\top}\end{array}\right]\left[\begin{array}[]{ll}\alpha({X})&\zeta({X})^{\top}\\ \zeta({X})&\gamma({X})\end{array}\right]\left[~{}\begin{array}[]{c}1\\ \widehat{\mathrm{X}}{}\end{array}\right], (48)

where

α(X)\displaystyle\alpha({X}) =(X22𝔼X22)2\displaystyle=(||{X}||^{2}_{2}-\mathbb{E}\|X\|^{2}_{2})^{2}
ζ(X)\displaystyle\zeta({X}) =2(X22𝔼{X22})(𝔼XX)\displaystyle=2(\|{X}\|_{2}^{2}-\mathbb{E}\{\|{X}\|_{2}^{2}\})(\mathbb{E}{X}-{X})
γ(X)\displaystyle\gamma({X}) =4(X𝔼X)(X𝔼X).\displaystyle=4({X}-\mathbb{E}X)(X-\mathbb{E}X)^{\top}.

Thus, by taking expectations

𝔼(XX^22𝔼μXX^22)2=[1X^][αζζΓ][1X^],\displaystyle\mathbb{E}\Big{(}||{X}-\widehat{\mathrm{X}}||^{2}_{2}-\mathbb{E}_{\mu}||{X}-\widehat{\mathrm{X}}||^{2}_{2}\Big{)}^{2}=\left[\begin{array}[]{cc}1{}&~{}\widehat{\mathrm{X}}^{\top}\end{array}\right]\left[\begin{array}[]{ll}\alpha&\zeta^{\top}\\ \zeta&\Gamma\end{array}\right]\left[~{}\begin{array}[]{c}1\\ \widehat{\mathrm{X}}{}\end{array}\right], (54)

where α=𝔼α(X)\alpha=\mathbb{E}\alpha({X}), ζ=𝔼ζ(X)\zeta=\mathbb{E}\zeta({X}), and Γ=4Σ\Gamma=4\Sigma. Equation (54) rests on the permutation-invariance property of the trace\mathrm{trace}. Without loss of generality, let us assume that the covariance matrix Σ\Sigma is strictly positive, and consider its Schur complement w.r.t. Γ\mathrm{\Gamma} thus obtaining the following standard factorization (see e.g. [Verhaegen and Verdult, 2007, Lemma 2.3]):

𝔼(XX^22𝔼XX^22)2=[1Γ1ζX^][αζΓ1ζ00Γ][1Γ1ζX^].\displaystyle\mathbb{E}\big{(}\|{X}-\widehat{\mathrm{X}}\|^{2}_{2}-\mathbb{E}\|{X}-\widehat{\mathrm{X}}\|^{2}_{2}\big{)}^{2}=\left[\begin{array}[]{c}1\\ ~{}\Gamma^{-1}\zeta-\widehat{\mathrm{X}}\end{array}\right]^{\top}\left[\begin{array}[]{ll}\alpha-\zeta^{\top}\Gamma^{-1}\zeta&~{}{0}\\ 0&\Gamma\end{array}\right]\left[\begin{array}[]{c}1\\ ~{}\Gamma^{-1}\zeta-\widehat{\mathrm{X}}\end{array}\right]_{\boldsymbol{.}} (61)

Proof of Proposition 1.

From Lemma 1, problem (7) can equivalently be written as:

minX^dX^𝔼X22s.t.X^14Σ1ζ4Σ2ε¯,ε¯0=minX^dsupλ0L(X^,λ),\begin{array}[]{rl}\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\mathrm{min}}&\|\widehat{\mathrm{X}}-\mathbb{E}X\|^{2}_{2}\\ \mathrm{s.t.}&\|\widehat{\mathrm{X}}-\frac{1}{4}{\Sigma}^{-1}\zeta^{-}\|^{2}_{4{\Sigma}}\leq\bar{\varepsilon}\end{array},~{}\bar{\varepsilon}\geq 0~{}=~{}\underset{{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}}{\mathrm{min}}\sup_{\lambda\geq{0}}\mathrm{L}(\widehat{\mathrm{X}},\lambda),

where L:d×+\mathrm{L}:\mathbb{R}^{d}\times\mathbb{R}^{+}\rightarrow\mathbb{R} is the Lagrangean of (12) given by

L(X^,λ)=X^𝔼X22+λX^14Σ1ζ4Σ2λε¯,\displaystyle\mathrm{L}(\widehat{\mathrm{X}},\lambda)=\|\widehat{\mathrm{X}}-\mathbb{E}X\|^{2}_{2}+\lambda~{}\|\widehat{\mathrm{X}}-\frac{1}{4}{\Sigma}^{-1}\zeta^{-}\|^{2}_{4{\Sigma}}-\lambda\bar{\varepsilon}, (62)

and λ+\lambda\in\mathbb{R}^{+} is the multiplier associated with the constraint of (12). Slater’s condition is directly verified from (12) and therefore, due to convexity, the dual-optimal value is attained, and (12) has zero duality gap:

minX^dsupλ0L(X^,λ)=supλ0minX^dL(X^,λ).\displaystyle\underset{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}{\mathrm{min}}\sup_{\lambda\geq{0}}\mathrm{L}(\widehat{\mathrm{X}},\lambda)=\sup_{\lambda\geq{0}}\underset{{\widehat{\mathrm{X}}\in\mathbb{R}^{d}}}{\mathrm{min}}\mathrm{L}(\widehat{\mathrm{X}},\lambda).

The Lagrangian reads

L(X^,λ)=[X^1][I+4λΣ(𝔼X+λζ)(𝔼X+λζ)𝔼X22+λ14Σ1ζ22][X^1]λε¯,\displaystyle\mathrm{L}(\widehat{\mathrm{X}},\lambda)=\left[\begin{array}[]{cc}\widehat{\mathrm{X}}^{\top}{}&~{}1\end{array}\right]\left[\begin{array}[]{ll}{I}+4\lambda\Sigma{}{}&~{}~{}-\big{(}\mathbb{E}X+\lambda\zeta^{-}\big{)}\\ -\big{(}\mathbb{E}X+\lambda\zeta^{-}\big{)}^{\top}&||\mathbb{E}X||^{2}_{2}+\lambda||\frac{1}{4}{\Sigma}^{-1}\zeta^{-}||^{2}_{2}\end{array}\right]\left[~{}\begin{array}[]{c}\widehat{\mathrm{X}}\\ 1{}\end{array}\right]-\lambda\bar{\varepsilon}, (68)

and by performing a similar decomposition as in (61) we obtain

L(X^,λ)\displaystyle\mathrm{L}(\widehat{\mathrm{X}},\lambda) =X^(I+4λΣ)1(𝔼X+λζ)I+4λΣ2\displaystyle=\|\widehat{\mathrm{X}}-({I}+4\lambda{\Sigma})^{-1}(\mathbb{E}X+\lambda{\zeta}^{-})\|^{2}_{I+4\lambda{\Sigma}}
(𝔼X+λζ)(I+4λΣ)1(𝔼X+λζ)\displaystyle-\big{(}\mathbb{E}X+\lambda\zeta^{-}\big{)}^{\top}\big{(}{I}+4\lambda\Sigma\big{)}^{-1}\big{(}\mathbb{E}X+\lambda\zeta^{-}\big{)}
+𝔼X22+λ14Σ1ζΣ2λε¯.\displaystyle+\|\mathbb{E}X\|^{2}_{2}+\lambda\|\frac{1}{4}{\Sigma}^{-1}{\zeta}^{-}\|^{2}_{{\Sigma}}-\lambda\bar{\varepsilon}. (69)

The last two terms in (69) do not depend on X^\widehat{\mathrm{X}}, and thus

X^λ=(I+4λΣ)1(𝔼X+λζ),\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda}=({I}+4\lambda{\Sigma})^{-1}(\mathbb{E}X+\lambda{\zeta}^{-}), (70)

is primal-optimal. ∎

Proof of Corollary 1.

Let us denote Π(λ)=(I+4λΣ)1\Pi(\lambda)=({I}+4\lambda{\Sigma})^{-1}. Differentiation of (70) w.r.t. λ(0,+)\lambda\in(0,+\infty), gives

dλX^λ\displaystyle\mathrm{d}_{\lambda}\widehat{\mathrm{X}}^{\star}_{\lambda} =4Π(λ)ΣΠ(λ)(𝔼{X}+λζ)+Π(λ)ζ,\displaystyle=-4\Pi(\lambda)\Sigma\Pi(\lambda)(\mathbb{E}\{X\}+\lambda{\zeta}^{-})+\Pi(\lambda){\zeta}^{-}, (71)

where the commutator [Π,Σ]ΠΣΣXXΠ=0[\Pi,\Sigma]\equiv\Pi\Sigma-\Sigma_{X\mid{X}}\Pi=0, and therefore we may write

dλX^λ\displaystyle\mathrm{d}_{\lambda}\widehat{\mathrm{X}}^{\star}_{\lambda} =Π(λ)2(4Σ(𝔼X+λζ)+Π(λ)1ζ)\displaystyle=\Pi(\lambda)^{2}\Big{(}-4\Sigma(\mathbb{E}X+\lambda{\zeta}^{-})+\Pi(\lambda)^{-1}{\zeta}^{-}\Big{)}
=4Π(λ)2Σ(𝔼X(4Σ)1ζ).\displaystyle=-4\Pi(\lambda)^{2}\Sigma\Big{(}\mathbb{E}X-(4\Sigma)^{-1}\zeta^{-}\Big{)}. (72)

It is worth noticing that the dependency on λ\lambda has transferred in Π\Pi, and that 𝔼X=X^0\mathbb{E}X=\widehat{\mathrm{X}}^{\star}_{0}, and (4Σ)1ζ=X^(4\Sigma)^{-1}\zeta^{-}=\widehat{\mathrm{X}}^{\star}_{\infty}. Thus, we declare ΔX^0=(4Σ)1ζ𝔼X\widehat{\Delta X}^{\star}_{0\infty}=(4\Sigma)^{-1}\zeta^{-}-\mathbb{E}X. By integrating (72) w.r.t. λ\lambda we obtain

X^λ\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda} =X^0+UH(λ)UΔX^0.\displaystyle=\widehat{\mathrm{X}}^{\star}_{0}+U\mathrm{H}(\lambda)U^{\top}\widehat{\Delta X}^{\star}_{0\infty}. (73)

where H(λ)=diag(4λσ1+4λσ)\mathrm{H}(\lambda)=\mathrm{diag}(\frac{4\lambda\sigma}{1+4\lambda\sigma}), and Σ=UΛU\Sigma=U\Lambda U^{\top} refers to the spectral decomposition of Σ\Sigma. That is, the optimal estimates are shifted versions of the conditional expectation by a transformed version of the vector ΔX^0\widehat{\Delta X}^{\star}_{0\infty}. Recall that this vector encodes the difference between the center of the circle and the center of the ellipsoid. From (73) we obtain

UX^λ=UX^0+H(λ)(UX^0UX^)\displaystyle U^{\top}\widehat{\mathrm{X}}^{\star}_{\lambda}=U^{\top}\widehat{\mathrm{X}}^{\star}_{0}+\mathrm{H}(\lambda)\big{(}U^{\top}\widehat{\mathrm{X}}^{\star}_{0}-U^{\top}\widehat{\mathrm{X}}^{\star}_{\infty}\big{)} (74)

Further,

UX^0\displaystyle U^{\top}\widehat{\mathrm{X}}^{\star}_{0} =U𝔼X=𝔼Xu,\displaystyle=U^{\top}\mathbb{E}{X}=\mathbb{E}{X}_{u}, (75)

where XuUX{X}_{u}\equiv U^{\top}{X}. In addition, since

U𝔼[X22X]=𝔼[UUX22UX],U^{\top}\mathbb{E}[\|{X}\|^{2}_{2}{X}]=\mathbb{E}[\|UU^{\top}{X}\|^{2}_{2}U^{\top}{X}],

and UU^{\top} is unitary,

UX^\displaystyle U^{\top}\widehat{\mathrm{X}}^{\star}_{\infty} =12Λ1(𝔼[Xu22Xu]𝔼Xu22𝔼Xu)\displaystyle=\frac{1}{2}\Lambda^{-1}\bigg{(}\mathbb{E}[\|{X}_{u}\|^{2}_{2}{X}_{u}]-\mathbb{E}\|{X}_{u}\|^{2}_{2}\mathbb{E}{X}_{u}\bigg{)} (76)

By X^i,u,λ\widehat{\mathrm{X}}^{\star}_{i,u,\lambda}, and Xi,u{X}_{i,u} we mean the iith component of X^u,λUX^λ\widehat{\mathrm{X}}^{\star}_{u,\lambda}\equiv U^{\top}\widehat{\mathrm{X}}^{\star}_{\lambda}, and Xu{X}_{u}, respectively. Also, note that since the singular values of the covariance matrix are coordinate-change invariant, Λ=ΛXu\Lambda=\Lambda_{{X}_{u}}, or var[Xi]=var[Xi,u]σi\mathrm{var}[{X}_{i}]=\mathrm{var}[{X}_{i,u}]\equiv\sigma_{i}. Thus,

X^i,u,λ\displaystyle\widehat{\mathrm{X}}^{\star}_{i,u,\lambda} =𝔼Xi,u+2λσi1+2λσi(𝔼Xi,u12σi(k𝔼[Xk,u2Xi,u]k𝔼Xk,u2𝔼Xi,u))\displaystyle=\mathbb{E}{X}_{i,u}+\frac{2\lambda\sigma_{i}}{1+2\lambda\sigma_{i}}\Bigg{(}\mathbb{E}{X}_{i,u}-\frac{1}{2\sigma_{i}}\Big{(}\sum_{k}\mathbb{E}[{X}^{2}_{k,u}{X}_{i,u}]-\sum_{k}\mathbb{E}{X}^{2}_{k,u}\mathbb{E}{X}_{i,u}\Big{)}\Bigg{)}
=𝔼Xi,u\displaystyle=\mathbb{E}{X}_{i,u}
+2λσi1+2λσi(𝔼Xi,u12σi(𝔼Xi,u3𝔼Xi,u2𝔼Xi,u))\displaystyle+\frac{2\lambda\sigma_{i}}{1+2\lambda\sigma_{i}}\Bigg{(}\mathbb{E}{X}_{i,u}-\frac{1}{2\sigma_{i}}\Big{(}\mathbb{E}{X}^{3}_{i,u}-\mathbb{E}{X}^{2}_{i,u}\mathbb{E}{X}_{i,u}\Big{)}\Bigg{)}
+2λσi1+2λσi(12σi(ki𝔼[Xk,u2Xi,u]ki𝔼Xk,u2𝔼Xi,u))\displaystyle+\frac{2\lambda\sigma_{i}}{1+2\lambda\sigma_{i}}\Bigg{(}-\frac{1}{2\sigma_{i}}\Big{(}\sum_{k\neq{i}}\mathbb{E}[{X}^{2}_{k,u}{X}_{i,u}]-\sum_{k\neq{i}}\mathbb{E}{X}^{2}_{k,u}\mathbb{E}{X}_{i,u}\Big{)}\Bigg{)} (77)

Completing the third power in the second term of (77), and subsequently setting the last term equal to Ti,λ\mathrm{T}_{i,\lambda} concludes the proof. ∎

Proof of Lemma 2.

The optimallity condition for the corresponding to (7) Lagrangean yields:

dX^L(X^,λ)=0,\displaystyle\mathrm{d}_{\widehat{\mathrm{X}}}\mathrm{L}(\widehat{\mathrm{X}}^{\star},\lambda)=0,

or

2𝔼(X^λX)+λ𝔼{2(XX^λ22𝔼XX^λ22)(2(XX^λ)2𝔼(XX^λ))}=0,\displaystyle 2\mathbb{E}(\widehat{\mathrm{X}}^{\star}_{\lambda}-X)+\lambda\mathbb{E}\bigg{\{}2\Big{(}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}-\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}\Big{)}\big{(}2(X-\widehat{\mathrm{X}}^{\star}_{\lambda})-2\mathbb{E}(X-\widehat{\mathrm{X}}^{\star}_{\lambda})\big{)}\bigg{\}}=0,

or

2𝔼(X^λX)λ𝔼{2(XX^λ22𝔼XX^λ22)(2X2𝔼X)}=0,\displaystyle 2\mathbb{E}(\widehat{\mathrm{X}}^{\star}_{\lambda}-X)-\lambda\mathbb{E}\bigg{\{}2\Big{(}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}-\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}\Big{)}\big{(}2X-2\mathbb{E}X\big{)}\bigg{\}}=0,

or

𝔼(X^λX)2λ𝔼{(XX^λ22𝔼XX^λ22)X}=0,\displaystyle\mathbb{E}(\widehat{\mathrm{X}}^{\star}_{\lambda}-X)-2\lambda\mathbb{E}\bigg{\{}\Big{(}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}-\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}\Big{)}X\bigg{\}}=0,

or

X^λ=𝔼{[1+2λ(XX^λ22𝔼XX^λ22)]X},\displaystyle\widehat{\mathrm{X}}^{\star}_{\lambda}=\mathbb{E}\bigg{\{}\bigg{[}1+2\lambda\Big{(}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}-\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}\Big{)}\bigg{]}X\bigg{\}}, (78)

The Radon-Nikodym derivative in (78) takes positive as well as negative values. However, dξdμ12λ𝔼XX^λ22,\frac{\mathrm{d}{\xi}}{\mathrm{d}\mu}\geq 1-2\lambda\mathbb{E}||X-\widehat{\mathrm{X}}^{\star}_{\lambda}||^{2}_{2}, and therefore, λ1/2𝔼XX^λ22,λ[0,+)\lambda\leq{1}/{2\mathbb{E}||X-\widehat{\mathrm{X}}^{\star}_{\lambda}||^{2}_{2}},~{}\lambda\in[0,+\infty) renders the factor in (78) a positive density. This is guaranteed when

λ\displaystyle\lambda infλ[0,+)12𝔼XX^λ22=12𝔼XX^22=12(𝔼ΔX^022+mmse(μ))1/γ,\displaystyle{\leq}\inf_{\lambda\in[0,+\infty)}\frac{1}{2\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\lambda}\|^{2}_{2}}{=}\frac{1}{2\mathbb{E}\|X-\widehat{\mathrm{X}}^{\star}_{\infty}\|^{2}_{2}}{=}\frac{1}{2\big{(}\mathbb{E}\|\widehat{\Delta X}^{\star}_{0\infty}\|^{2}_{2}+\mathrm{mmse}(\mu)\big{)}}\equiv{1}/{\gamma^{\star}},

where the first equality follows because the objective increases with λ0\lambda\geq{0}. ∎

Appendix B Appendix B (Proofs of Sections 3, 5)

Proof of Lemma 3.

Convexity of (20), as well as of (21) and therefore Slater’s condition imply that the dual optimal is attained. The KKT conditions for (20), and (21) are dX^𝔼ZX^=λmv(ε)varZX^,\mathrm{d}_{\widehat{\mathrm{X}}}\mathbb{E}Z_{\widehat{\mathrm{X}}^{\star}}=-\lambda^{\star}_{mv}(\varepsilon)\nabla\mathrm{var}Z_{\widehat{\mathrm{X}}^{\star}}, and dX^𝔼ZX^=λmd(ε)varZX^(2varZX^)1,\mathrm{d}_{\widehat{\mathrm{X}}}\mathbb{E}Z_{\widehat{\mathrm{X}}^{\star}}=-\lambda^{\star}_{md}(\varepsilon)\nabla\mathrm{var}Z_{\widehat{\mathrm{X}}^{\star}}({2\sqrt{\rule{0.0pt}{5.16663pt}\mathrm{var}Z_{\widehat{\mathrm{X}}^{\star}}}})^{-1}, respectively. Complementary slackness condition for (21) yields λmd(ε)=2λmv(ε)ε.{\lambda^{\star}_{md}(\varepsilon)}=2\lambda^{\star}_{mv}(\varepsilon){\sqrt{\varepsilon}}.

Proof of Theorem 1.

To keep the notation simple, we use just λ\lambda for the optimal Lagrange multiplier λmd(ε)\lambda^{\star}_{md}(\varepsilon) of the mean-deviation problem, and σ\sigma for the optimal Lagrange multiplier λmv(ε)\lambda^{\star}_{mv}(\varepsilon) of the mean-variance problem. Then, on the basis of Lemma (3) λ=2σε\lambda=2\sigma\sqrt{\varepsilon} and

𝔼ν(λ)XX^σ22\displaystyle\mathbb{E}_{\nu^{\star}(\lambda)}\|X-\widehat{\mathrm{X}}^{\star}_{\sigma}\|^{2}_{2} =minX^supνU(λ)𝔼νXX^22\displaystyle=\min_{\widehat{\mathrm{X}}}\sup_{\nu\in\mathrm{U}(\lambda)}\mathbb{E}_{\nu}\|X-\widehat{\mathrm{X}}\|^{2}_{2} (79)
=supνU(λ)𝔼νXX^σ22\displaystyle=\sup_{\nu\in\mathrm{U}(\lambda)}\mathbb{E}_{\nu}\|X-\widehat{\mathrm{X}}^{\star}_{\sigma}\|^{2}_{2} (80)
𝔼νXX^σ22,νU(λ).\displaystyle\geq\mathbb{E}_{\nu}\|X-\widehat{\mathrm{X}}^{\star}_{\sigma}\|^{2}_{2},~{}\nu\in U(\lambda). (81)

Equation (80) follows from (79) because X^σ\widehat{\mathrm{X}}^{\star}_{\sigma} is the optimal solution of (29). On top of that,

minX^supνU(λ)𝔼XX^22\displaystyle\min_{\widehat{\mathrm{X}}}\sup_{\nu\in\mathrm{U}(\lambda)}\mathbb{E}\|X-\widehat{\mathrm{X}}\|^{2}_{2} =𝔼ν(λ)XX^σ22\displaystyle=\mathbb{E}_{\nu^{\star}(\lambda)}\|X-\widehat{\mathrm{X}}^{\star}_{\sigma}\|^{2}_{2} (82)
=𝔼ν(λ)X𝔼ν(λ)X22,\displaystyle=\mathbb{E}_{\nu^{\star}(\lambda)}\|X-\mathbb{E}_{\nu^{\star}(\lambda)}X\|^{2}_{2}, (83)

where (83) follows from (82) because of Lemma 2. With λ2ε/γ\lambda\leq 2\sqrt{\varepsilon}/\gamma^{\star}, again from Lemma 2, the measure is positive and thus,

𝔼ν(λ)X𝔼ν(λ)X22\displaystyle\mathbb{E}_{\nu^{\star}(\lambda)}\|X-\mathbb{E}_{\nu^{\star}(\lambda)}X\|^{2}_{2} =minX^𝔼ν(λ)XX^22\displaystyle=\min_{\widehat{\mathrm{X}}}\mathbb{E}_{\nu^{\star}(\lambda)}\|X-\widehat{\mathrm{X}}\|^{2}_{2}
𝔼ν(λ)XX^22.\displaystyle\leq\mathbb{E}_{\nu^{\star}(\lambda)}\|X-\widehat{\mathrm{X}}\|^{2}_{2}. (84)

Combining (81) and (84) for λ2ε/γ\lambda\leq 2\sqrt{\varepsilon}/\gamma^{\star} concludes the proof. ∎

Proof of Theorem 2.

It is X¯n=n1iXi\bar{\mathrm{X}}_{n}=n^{-1}\sum_{i}X_{i}, Tn=n1i(XiX¯n)3\mathrm{T}_{n}=n^{-1}\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3} and X^n,λ¯=X¯n+λ¯Tn\widehat{\mathrm{X}}^{\star}_{n,\bar{\lambda}}=\bar{\mathrm{X}}_{n}+\bar{\lambda}\mathrm{T}_{n}. The condition for strict domination reads

λ¯<2𝔼[(XX¯n)Tn]𝔼Tn2+ϵ=2n𝔼[(Xn1iXi)i(Xin1iXi)3]𝔼[(i(Xin1Xi)3)2]+ϵ.\displaystyle\bar{\lambda}<\frac{2\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\mathrm{T}_{n}\big{]}}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}=\frac{2n\mathbb{E}\big{[}(X-n^{-1}\sum_{i}X_{i})\sum_{i}(X_{i}-n^{-1}\sum_{i}X_{i})^{3}\big{]}}{\mathbb{E}\big{[}\big{(}\sum_{i}(X_{i}-n^{-1}\sum X_{i})^{3}\big{)}^{2}\big{]}+\epsilon}. (85)

For the numerator of (85) we have:

𝔼[(XX¯n)i(XiX¯n)3]=𝔼[Xi(XiX¯n)3]𝔼[X¯ni(XiX¯n)3]\displaystyle\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]}=\mathbb{E}\big{[}X\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]}-\mathbb{E}\big{[}\bar{\mathrm{X}}_{n}\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]} (86)

We compute each term of (86). For the first one we have

𝔼[Xi(XiX¯n)3]=𝔼[Xi(Xi33Xi2X¯n+3XiX¯n2X¯n3)]\displaystyle\mathbb{E}\big{[}X\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]}=\mathbb{E}\big{[}X\sum_{i}(X_{i}^{3}-3X_{i}^{2}\bar{\mathrm{X}}_{n}+3X_{i}\bar{X}^{2}_{n}-\bar{\mathrm{X}}^{3}_{n})\big{]}
=𝔼Xin(𝔼Xi33𝔼[Xi2X¯n]+3𝔼[XiX¯n2]𝔼X¯n3).\displaystyle=\mathbb{E}X\sum^{n}_{i}\big{(}\mathbb{E}X^{3}_{i}-3\mathbb{E}[X^{2}_{i}\bar{\mathrm{X}}_{n}]+3\mathbb{E}[X_{i}\bar{X}^{2}_{n}]-\mathbb{E}\bar{\mathrm{X}}^{3}_{n}\big{)}. (87)

We compute each term inside the parenthesis in (87): For the first we have 𝔼Xi3=𝔼X3\mathbb{E}X_{i}^{3}=\mathbb{E}X^{3}, while for the second:

3𝔼[Xi2X¯n]\displaystyle-3\mathbb{E}[X_{i}^{2}\bar{\mathrm{X}}_{n}] =3n1𝔼[Xi2(Xi+j=1,jinXj)]\displaystyle=-3n^{-1}\mathbb{E}[X^{2}_{i}(X_{i}+\sum^{n}_{j=1,j\neq{i}}X_{j})]
=3n1𝔼X33n1(n1)𝔼X2𝔼X.\displaystyle=-3n^{-1}\mathbb{E}X^{3}-3n^{-1}(n-1)\mathbb{E}X^{2}\mathbb{E}X. (88)

For the third term inside the parenthesis of (87) we use

𝔼(j=1,jinXj)2=(n1)𝔼X2+(𝔼X)2(n1)(n2)\displaystyle\mathbb{E}\bigg{(}\sum^{n}_{j=1,j\neq{i}}X_{j}\bigg{)}^{2}=(n-1)\mathbb{E}X^{2}+(\mathbb{E}X)^{2}(n-1)(n-2) (89)

to obtain

3𝔼[XiX¯n2]\displaystyle 3\mathbb{E}[X_{i}\bar{\mathrm{X}}^{2}_{n}] =3n2𝔼X3+9n2(n1)𝔼X𝔼X2+3n2(n1)(n2)(𝔼X)3.\displaystyle=3n^{-2}\mathbb{E}X^{3}+9n^{-2}(n-1)\mathbb{E}X\mathbb{E}X^{2}+3n^{-2}(n-1)(n-2)(\mathbb{E}X)^{3}. (90)

Lastly, the fourth term reads:

𝔼X¯n3=n2𝔼X3+3n2(n1)𝔼X2𝔼X+n2(𝔼X)3(n23n+2).\displaystyle\mathbb{E}\bar{\mathrm{X}}^{3}_{n}=n^{-2}\mathbb{E}X^{3}+3n^{-2}(n-1)\mathbb{E}X^{2}\mathbb{E}X+n^{-2}(\mathbb{E}X)^{3}\left(n^{2}-3n+2\right). (91)

Thus, by (88), (90), (91), the parenthesis in (87) reads:

(𝔼Xi33𝔼[Xi2X¯n]+3𝔼[XiX¯n2]𝔼X¯n3)\displaystyle\big{(}\mathbb{E}X^{3}_{i}-3\mathbb{E}[X^{2}_{i}\bar{\mathrm{X}}_{n}]+3\mathbb{E}[X_{i}\bar{\mathrm{X}}^{2}_{n}]-\mathbb{E}\bar{\mathrm{X}}^{3}_{n}\big{)} =𝔼X3[2n23n1+1]\displaystyle=\mathbb{E}X^{3}\left[2n^{-2}-3n^{-1}+1\right]
+𝔼X2𝔼X[3n2(n1)]\displaystyle\hskip 10.0pt+\mathbb{E}X^{2}\mathbb{E}X\left[3n^{-2}(n-1)\right]
+(𝔼X)32n2(n23n+2).\displaystyle\hskip 10.0pt+(\mathbb{E}X)^{3}2n^{-2}\left(n^{2}-3n+2\right). (92)

The summation in (87) increases order by nn:

i=1n(𝔼Xi33𝔼[Xi2X¯n]+3𝔼[XiX¯n2]𝔼X¯n3)\displaystyle\sum^{n}_{i=1}\big{(}\mathbb{E}X^{3}_{i}-3\mathbb{E}[X^{2}_{i}\bar{\mathrm{X}}_{n}]+3\mathbb{E}[X_{i}\bar{\mathrm{X}}^{2}_{n}]-\mathbb{E}\bar{\mathrm{X}}^{3}_{n}\big{)} =𝔼X3[2n1+n3]\displaystyle=\mathbb{E}X^{3}\left[2n^{-1}+n-3\right]
+𝔼X2𝔼X[3n1(n1)]\displaystyle+\mathbb{E}X^{2}\mathbb{E}X\left[3n^{-1}(n-1)\right]
+(𝔼X)32n1(n23n+2).\displaystyle+(\mathbb{E}X)^{3}2n^{-1}\left(n^{2}-3n+2\right). (93)

Therefore, for the first term in (86) we have:

𝔼[Xi(XiX¯n)3]\displaystyle\mathbb{E}\big{[}X\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]} =𝔼X𝔼X3(2n1+n3)\displaystyle=\mathbb{E}X\mathbb{E}X^{3}\left(2n^{-1}+n-3\right)
+𝔼X2(𝔼X)23n1(n1)+(𝔼X)42n1(n23n+2)\displaystyle\hskip 15.0pt+\mathbb{E}X^{2}(\mathbb{E}X)^{2}3n^{-1}(n-1)+(\mathbb{E}X)^{4}2n^{-1}\left(n^{2}-3n+2\right)
=𝔼X(𝔼X33𝔼X2𝔼X+2(𝔼X)3)n1(n1)(n2)\displaystyle=\mathbb{E}X\big{(}\mathbb{E}X^{3}-3\mathbb{E}X^{2}\mathbb{E}X+2(\mathbb{E}X)^{3}\big{)}n^{-1}(n-1)(n-2)
=𝔼X𝔼(X𝔼X)3n1(n1)(n2)\displaystyle=\mathbb{E}X\mathbb{E}(X-\mathbb{E}X)^{3}n^{-1}(n-1)(n-2) (94)

Moving forward, for the second term of (86) we have

𝔼[X¯ni(XiX¯n)3]=𝔼[X¯ni(Xi33Xi2X¯n+3XiX¯n2X¯n3)]\displaystyle\mathbb{E}\big{[}\bar{\mathrm{X}}_{n}\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]}=\mathbb{E}\big{[}\bar{\mathrm{X}}_{n}\sum_{i}\left(X^{3}_{i}-3X^{2}_{i}\bar{\mathrm{X}}_{n}+3X_{i}\bar{\mathrm{X}}^{2}_{n}-\bar{\mathrm{X}}^{3}_{n}\right)\big{]}
=i(𝔼[Xi3X¯n]3𝔼[Xi2X¯n2]+3𝔼[XiX¯n3]𝔼X¯n4)\displaystyle=\sum_{i}\left(\mathbb{E}[X^{3}_{i}\bar{\mathrm{X}}_{n}]-3\mathbb{E}[X^{2}_{i}\bar{\mathrm{X}}^{2}_{n}]+3\mathbb{E}[X_{i}\bar{\mathrm{X}}^{3}_{n}]-\mathbb{E}\bar{\mathrm{X}}^{4}_{n}\right) (95)

We compute the terms inside the parenthesis in (95). For the first one we have:

𝔼[Xi3X¯n]\displaystyle\mathbb{E}[X^{3}_{i}\bar{\mathrm{X}}_{n}] =n1𝔼[Xi3(Xi+j=1,jinXj)]\displaystyle=n^{-1}\mathbb{E}[X_{i}^{3}\left(X_{i}+\sum_{j=1,j\neq i}^{n}X_{j}\right)]
=n1𝔼X4+n1(n1)𝔼X3𝔼X.\displaystyle=n^{-1}\mathbb{E}X^{4}+n^{-1}(n-1)\mathbb{E}X^{3}\mathbb{E}X. (96)

For the second:

3𝔼[Xi2X¯n2]=3n2𝔼[Xi2(Xi+j=1,jiXj)2]\displaystyle-3\mathbb{E}[X^{2}_{i}\bar{\mathrm{X}}^{2}_{n}]=-3n^{-2}\mathbb{E}[X^{2}_{i}\left(X_{i}+\sum_{j=1,j\neq{i}}X_{j}\right)^{2}]
=(89)3n2𝔼X46n2(n1)𝔼X3𝔼X3n2(n1)(𝔼X2)23n2(n1)(n2)𝔼X2(𝔼X)2\displaystyle\stackrel{{\scriptstyle(\ref{quadraticterm})}}{{=}}-3n^{-2}\mathbb{E}X^{4}-6n^{-2}(n-1)\mathbb{E}X^{3}\mathbb{E}X-3n^{-2}(n-1)(\mathbb{E}X^{2})^{2}-3n^{-2}(n-1)(n-2)\mathbb{E}X^{2}(\mathbb{E}X)^{2} (97)

The third term in (95) reads:

3𝔼[XiX¯n3]=3n3𝔼[Xi(jXj)3]\displaystyle 3\mathbb{E}[X_{i}\bar{\mathrm{X}}^{3}_{n}]=3n^{-3}\mathbb{E}[X_{i}(\sum_{j}X_{j})^{3}]
=3n3𝔼[Xi(Xi+j=1,jiXj)3]\displaystyle=3n^{-3}\mathbb{E}[X_{i}(X_{i}+\sum_{j=1,j\neq{i}}X_{j})^{3}]
=3n3𝔼X4+12n3(n1)𝔼X3𝔼X+9n3(n1)(𝔼X2)2\displaystyle=3n^{-3}\mathbb{E}X^{4}+12n^{-3}(n-1)\mathbb{E}X^{3}\mathbb{E}X+9n^{-3}(n-1)(\mathbb{E}X^{2})^{2}
+18n3(n1)(n2)𝔼X2(𝔼X)2+3n3(𝔼X)4(n1)(n2)(n3),\displaystyle\hskip 15.0pt+18n^{-3}(n-1)(n-2)\mathbb{E}X^{2}(\mathbb{E}X)^{2}+3n^{-3}(\mathbb{E}X)^{4}(n-1)(n-2)(n-3), (98)

where we expanded the third power and subsequently used that

𝔼(j=1,jinXj)3=(n1)𝔼X3+3(n1)(n2)𝔼X2𝔼X+(𝔼X)3(n1)(n2)(n3).\displaystyle\mathbb{E}\left(\sum_{j=1,j\neq{i}}^{n}X_{j}\right)^{3}=(n-1)\mathbb{E}X^{3}+3(n-1)(n-2)\mathbb{E}X^{2}\mathbb{E}X+(\mathbb{E}X)^{3}(n-1)(n-2)(n-3). (99)

The last term in the parenthesis of (95) is

𝔼X¯n4\displaystyle-\mathbb{E}\bar{\mathrm{X}}^{4}_{n} =n3𝔼X44n3(n1)𝔼X3𝔼X3n3(n1)(𝔼X2)2\displaystyle=-n^{-3}\mathbb{E}X^{4}-4n^{-3}(n-1)\mathbb{E}X^{3}\mathbb{E}X-3n^{-3}(n-1)(\mathbb{E}X^{2})^{2}
6n3(n1)(n2)𝔼X2(𝔼X)2n3(n1)(n2)(n3)(𝔼X)4,\displaystyle-6n^{-3}(n-1)(n-2)\mathbb{E}X^{2}(\mathbb{E}X)^{2}{-}n^{-3}(n-1)(n-2)(n-3)(\mathbb{E}X)^{4}, (100)

where we used

𝔼X¯n4=n4k1,,kn04!k1!kn!𝔼[Xk1Xkn],\displaystyle\mathbb{E}\bar{\mathrm{X}}_{n}^{4}=n^{-4}\sum_{k_{1},\dots,k_{n}\geq{0}}\frac{4!}{k_{1}!\dots k_{n}!}\mathbb{E}[X^{k_{1}}\cdot\dots\cdot X^{k_{n}}], (101)

where k1,,kn0k_{1},\dots,k_{n}\geq{0} are all combinations that sum up to nn. By gathering all the terms we have that the parenthesis in (95) reads:

(𝔼[Xi3X¯n]3𝔼[Xi2X¯n2]+3𝔼[XiX¯n3]𝔼X¯n4)\displaystyle\left(\mathbb{E}[X^{3}_{i}\bar{\mathrm{X}}_{n}]-3\mathbb{E}[X^{2}_{i}\bar{\mathrm{X}}^{2}_{n}]+3\mathbb{E}[X_{i}\bar{\mathrm{X}}^{3}_{n}]-\mathbb{E}\bar{\mathrm{X}}^{4}_{n}\right) =n3𝔼X4(n1)(n2)\displaystyle=n^{-3}\mathbb{E}X^{4}(n-1)(n-2)
+n3𝔼X3𝔼X(n1)(n2)(n4)\displaystyle+n^{-3}\mathbb{E}X^{3}\mathbb{E}X(n-1)(n-2)(n-4)
3(𝔼X2)2n3(n1)(n2)\displaystyle-3(\mathbb{E}X^{2})^{2}n^{-3}(n-1)(n-2)
3𝔼X2(𝔼X)2n3(n1)(n2)(n4)\displaystyle-3\mathbb{E}X^{2}(\mathbb{E}X)^{2}n^{-3}(n-1)(n-2)(n-4)
+2(𝔼X)4n3(n1)(n2)(n3)\displaystyle+2(\mathbb{E}X)^{4}n^{-3}(n-1)(n-2)(n-3) (102)

Lastly, the outer summation in (95) increases the order by nn.

𝔼[X¯ni(XiX¯n)3]\displaystyle\mathbb{E}\left[\bar{\mathrm{X}}_{n}\sum_{i}\left(X_{i}-\bar{\mathrm{X}}_{n}\right)^{3}\right] =n2𝔼X4(n1)(n2)\displaystyle=n^{-2}\mathbb{E}X^{4}(n-1)(n-2)
+n2𝔼X3𝔼X(n1)(n2)(n4)\displaystyle+n^{-2}\mathbb{E}X^{3}\mathbb{E}X(n-1)(n-2)(n-4)
3(𝔼X2)2n2(n1)(n2)\displaystyle-3(\mathbb{E}X^{2})^{2}n^{-2}(n-1)(n-2)
3𝔼X2(𝔼X)2n2(n1)(n2)(n4)\displaystyle-3\mathbb{E}X^{2}(\mathbb{E}X)^{2}n^{-2}(n-1)(n-2)(n-4)
+2(𝔼X)4n2(n1)(n2)(n3)\displaystyle+2(\mathbb{E}X)^{4}n^{-2}(n-1)(n-2)(n-3)
=𝔼X𝔼(X𝔼X)3n1(n1)(n2)\displaystyle=\mathbb{E}X\mathbb{E}(X-\mathbb{E}X)^{3}n^{-1}(n-1)(n-2)
+n2(n1)(n2)(𝔼(X𝔼X)43[𝔼(X𝔼X)2]2)\displaystyle+n^{-2}(n-1)(n-2)\big{(}\mathbb{E}(X-\mathbb{E}X)^{4}-3[\mathbb{E}(X-\mathbb{E}X)^{2}]^{2}\big{)} (103)

By plugging (94), and (103) into (86) we obtain

𝔼[(XX¯n)i(XiX¯n)3]\displaystyle\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]} =𝔼X𝔼(X𝔼X)3n1(n1)(n2)\displaystyle=\mathbb{E}X\mathbb{E}(X-\mathbb{E}X)^{3}n^{-1}(n-1)(n-2)
𝔼X𝔼(X𝔼X)3n1(n1)(n2)\displaystyle-\mathbb{E}X\mathbb{E}(X-\mathbb{E}X)^{3}n^{-1}(n-1)(n-2)
n2(n1)(n2)(𝔼(X𝔼X)43[𝔼(X𝔼X)2]2)\displaystyle-n^{-2}(n-1)(n-2)\big{(}\mathbb{E}(X-\mathbb{E}X)^{4}-3[\mathbb{E}(X-\mathbb{E}X)^{2}]^{2}\big{)} (104)

or

𝔼[(XX¯n)i(XiX¯n)3]=n2(n1)(n2)(3m22m4),\displaystyle\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\sum_{i}(X_{i}-\bar{\mathrm{X}}_{n})^{3}\big{]}=n^{-2}(n-1)(n-2)\big{(}3m^{2}_{2}-m_{4}\big{)}, (105)

where m2m_{2}, and m4m_{4} denote the second and fourth central moments of the probability measure μ\mu, respectively. Strictly positive values for λ¯\bar{\lambda} are feasible for all those models with 3m22>m43m^{2}_{2}>m_{4}. Lastly, the condition (85) reads:

λ¯\displaystyle\bar{\lambda} 2𝔼[(XX¯n)Tn]𝔼Tn2+ϵ=2(3m22m4)𝔼Tn2+ϵn23n+2n3\displaystyle\leq\frac{2\mathbb{E}\big{[}(X-\bar{\mathrm{X}}_{n})\mathrm{T}_{n}\big{]}}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}=\frac{2(3m^{2}_{2}-m_{4})}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}\frac{n^{2}-3n+2}{n^{3}}
1n[8(3m22m4)𝔼Tn2+ϵ].\displaystyle\leq\frac{1}{n}\left[\frac{8(3m^{2}_{2}-m_{4})}{\mathbb{E}\mathrm{T}^{2}_{n}+\epsilon}\right]. (106)

For the case where the random variable is supported over a compact subset of d\mathbb{R}^{d}, we can bound the polynomial terms of γ\gamma^{\star} in (2), and also the variance in the denominator of the first factor of (15). This assumption is not absolutely necessary and can be avoided by showing that with high probability (36) is a DRO estimator. Since Theorem 2 allows n3n\geq 3, this can be done by controlling the empirical variance in the factor λ/(1+2λσ)\lambda/(1+2\lambda\sigma) of (15) as well as, the polynomial terms in the expectation (w.r.t. μn\mu_{n}) related to the threshold γ\gamma^{\star}. Then, for a fixed model, as λ0\lambda\rightarrow 0 one has to trade performance for a distributional robustness, the latter being favored by the logarithmic dependency in the number of samples. ∎