This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributionally Robust Instrumental Variables Estimation

Zhaonan Qu  Yongchan Kwon
Columbia University
Corresponding author. [email protected][email protected]
Data Science Institute
Columbia University
Department of Statistics
Abstract

Instrumental variables (IV) estimation is a fundamental method in econometrics and statistics for estimating causal effects in the presence of unobserved confounding. However, challenges such as untestable model assumptions and poor finite sample properties have undermined its reliability in practice. Viewing common issues in IV estimation as distributional uncertainties, we propose DRIVE, a distributionally robust IV estimation method. We show that DRIVE minimizes a square root variant of ridge regularized two stage least squares (TSLS) objective when the ambiguity set is based on a Wasserstein distance. In addition, we develop a novel asymptotic theory for this estimator, showing that it achieves consistency without requiring the regularization parameter to vanish. This novel property ensures that the estimator is robust to distributional uncertainties that persist in large samples. We further derive the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter based on theoretical results. Simulation studies demonstrate the superior finite sample performance of Wasserstein DRIVE in terms of estimation error and out-of-sample prediction. Due to its regularization and robustness properties, Wasserstein DRIVE presents an appealing option when the practitioner is uncertain about model assumptions or distributional shifts in data.

Keywords: Causal Inference; Distributionally Robust Optimization; Square Root Ridge; Invalid Instruments; Distribution Shift

1 Introduction

Instrumental variables (IV) estimation, also known as IV regression, is a fundamental method in econometrics and statistics to infer causal relationships in observational data with unobserved confounding. It leverages access to additional variables (instruments) that affect the outcome exogenously and exclusively through the endogenous regressor to yield consistent causal estimates, even when the standard ordinary least squares (OLS) estimator is biased by unobserved confounding (Imbens and Angrist,, 1994; Angrist et al.,, 1996; Imbens and Rubin,, 2015). Over the years, IV estimation has become an indispensable tool for causal inference in empirical works in economics (Card and Krueger,, 1994), as well as in the study of genetic and epidemiological data (Davey Smith and Ebrahim,, 2003).

Despite the widespread use of IV in empirical and applied works, it has important limitations and challenges, such as invalid instruments (Sargan,, 1958; Murray,, 2006), weak instruments (Staiger and Stock,, 1997), non-compliance (Imbens and Angrist,, 1994), and heteroskedasticity, especially in settings with weak instruments or highly leveraged datasets (Andrews et al.,, 2019; Young,, 2022). These issues could significantly impact the validity and quality of estimation and inference using instrumental variables (Jiang,, 2017). Many works have since been devoted to assessing and addressing these issues, such as statistical tests (Hansen,, 1982; Stock and Yogo,, 2002), sensitivity analysis (Rosenbaum and Rubin,, 1983; Bonhomme and Weidner,, 2022), and additional assumptions or structures on the data generating process (Kolesár et al.,, 2015; Kang et al.,, 2016; Guo et al., 2018b, ).

Recently, an emerging line of works have highlighted interesting connections between causality and the concepts of invariance and robustness (Peters et al.,, 2016; Meinshausen,, 2018; Rothenhäusler et al.,, 2021; Bühlmann,, 2020; Jakobsen and Peters,, 2022; Fan et al.,, 2024). Their guiding philosophy is that causal properties can be viewed as robustness against changes across heterogeneous environments, represented by a set 𝒫\mathcal{P} of data distributions. The robustness of an estimator against 𝒫\mathcal{P} is often represented in a distributionally robust optimization (DRO) framework via the min-max problem

minβsup𝒫𝔼[(W;β)],\displaystyle\min_{\beta}\sup_{\mathbb{P}\in\mathcal{P}}\mathbb{E}[\ell(W;\beta)], (1)

where (W;β)\ell(W;\beta) is a loss function of data WW and parameter β\beta of interest.

In many estimation and regression settings, one assumes that the true data distribution satisfies certain conditions, e.g., conditional independence or moment equations. Such conditions guarantee that standard statistical procedures based on the empirical distribution 0\mathbb{P}_{0} of data, such as M-estimation and generalized method of moments (GMM), enable valid estimation and inference. In practice, however, it is often reasonable to expect that the distribution 0\mathbb{P}_{0} of the observed data might deviate from that generated by the ideal model that satisfies such conditions, e.g., due to measurement errors or model mis-specifications. DRO addresses such distributional uncertainties by explicitly incorporating possible deviations into an ambiguity set 𝒫=𝒫(0,ρ)\mathcal{P}=\mathcal{P}(\mathbb{P}_{0},\rho) of distributions that are “close” to 0\mathbb{P}_{0}. The parameter ρ\rho quantifies the degree of uncertainty, e.g., as the radius of a ball centered around 0\mathbb{P}_{0} and defined by some divergence measure between probability distributions. By minimizing the worst-case loss over 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho) in the min-max optimization problem (1), the DRO approach achieves robustness against deviations captured by 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho).

DRO provides a useful perspective for understanding the robustness properties of statistical methods. For example, it is well-known that members of the family of kk-class estimators (Anderson and Rubin,, 1949; Nagar,, 1959; Theil,, 1961) are more robust than the standard IV estimator against weak instruments (Andrews,, 2007). Recent works by Rothenhäusler et al., (2021) and Jakobsen and Peters, (2022) show that kk-class estimators in fact have a DRO representation of the form (1), where \ell is the square loss, W=(X,Y)W=(X,Y), and X,YX,Y are endogenous and outcome variables generated from structural equation models parameterized by the natural parameter of kk-class estimators. See Appendix A.2 for details. The general robust optimization problem (1) can trace its roots in the classical robust statistics literature (Huber,, 1964; Huber and Ronchetti,, 2011) as well as classic works on robustness in economics (Hansen and Sargent,, 2008). Drawing inspirations from them, recent works in econometrics have also explored the use of robust optimization to account for (local) deviations from model assumptions (Kitamura et al.,, 2013; Armstrong and Kolesár,, 2021; Chen et al.,, 2021; Bonhomme and Weidner,, 2022; Adjaho and Christensen,, 2022; Fan et al.,, 2023). These works, together with works on invariance and robustness, highlight the emerging interactions between econometrics, statistics, and robust optimization.

Despite new developments connecting causality and robustness, many questions and opportunities remain. An important challenge in DRO is the choice of the ambiguity set 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho) to adequately capture distributional uncertainties. This choice is highly dependent on the structure of the particular problem of interest. While some existing DRO approaches use ambiguity sets 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho) based on marginal or joint distributions of data, such 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho) may not effectively capture the structure of IV estimation models. In addition, as the min-max problem (1) minimizes the loss function under the worst-case distribution in 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho), a common concern is that the resulting estimator is too conservative when 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho) is too large. In particular, although DRO estimators enjoy better empirical performance in finite samples, their asymptotic validity typically requires the ambiguity set to vanish to a singleton, i.e., ρ0\rho\rightarrow 0 (Blanchet et al.,, 2019, 2022). However, in the context of IV estimation, distributional uncertainties about untestable model assumptions could persist in large samples, necessitating the need for an ambiguity set that does not vanish to a singleton. It is therefore important to ask whether and how one can construct an estimator in the IV estimation setting that can sufficiently capture the distributional uncertainties about model assumptions, and at the same time remains asymptotically valid with a non-vanishing robustness parameter.

In this paper, we propose to view common challenges to IV estimation through the lens of DRO, whereby uncertainties about model assumptions, such as the exclusion restriction and homoskedasticity, are captured by a suitably chosen ambiguity set in (1). Based on this perspective, we propose DRIVE, a general DRO approach to IV estimation. Instead of constructing the ambiguity set based on marginal or joint distributions as in existing works, we construct 𝒫(0,ρ)\mathcal{P}(\mathbb{P}_{0},\rho) from distributions conditional on the instrumental variables. More precisely, we construct 0\mathbb{P}_{0} as the empirical distribution of outcome and endogenous variables Y,XY,X projected onto the space spanned by instrumental variables. When the ambiguity set of DRIVE is based on the 2-Wasserstein metric, we show that the resulting estimator minimizes a square root version of ridge regularized two stage least squares (TSLS) objective, where the radius ρ\rho of the ambiguity set becomes the regularization parameter. This regularized regression formulation relies on the general duality of Wasserstein DRO problems (Gao and Kleywegt,, 2023; Blanchet et al.,, 2019; Kuhn et al.,, 2019).

We next next reveal a surprising statistical property of the square root ridge by showing that Wasserstein DRIVE is consistent as long as the regularization parameter ρ\rho is bounded above by an estimable constant, which depends on the first stage coefficient of the IV model and can be interpreted as a measure of instrument quality. To our knowledge, this is the first consistency result for regularized regression estimators where the regularization parameter does not vanish as the sample size nn\rightarrow\infty. One implication of our results is that Wasserstein DRIVE, being a regularized regression estimator, enjoys better finite sample properties, but does not introduce bias asymptotically even for non-vanishing ρ\rho, unlike standard regularized regression estimators such as the ridge and LASSO.

We further characterize the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter. We demonstrate with numerical experiments that Wasserstein DRIVE improves over the finite sample performance of IV and kk-class estimators, thanks to its ridge type regularization, while at the same time retaining asymptotic validity whenever instruments are valid. In particular, Wasserstein DRIVE achieves significant improvements in mean squared errors (MSEs) over IV and OLS when instruments are moderately invalid. These findings suggest that Wasserstein DRIVE can be an attractive option in practice when we are concerned about model assumptions.

The rest of the paper is organized as follows. In Section 2, we discuss the standard IV estimation framework and common challenges. In Section 3, we propose the Wasserstein DRIVE framework and provide the duality theory. In Section 4, we develop asymptotic results for the Wasserstein DRIVE, including consistency under a non-vanishing robustness/regularization parameter. Section 5 conducts numerical studies that compare Wasserstein DRIVE with other estimators including IV, OLS, and kk-class estimators. Background materials, proofs, and additional results are included in the appendices in the supplementary material.

Notation. Throughout the paper, vp\|v\|_{p} denotes the pp-norm of a vector vv, while v:=v2\|v\|\mathrel{\mathop{\mathchar 58\relax}}=\|v\|_{2} denotes the Euclidean norm. Tr(M)\text{Tr}(M) denotes the trace of a matrix MM. λk(M)\lambda_{k}(M) represents the kk-th largest eigenvalue of a symmetric matrix MM. Boldfaced variables, such as 𝐗\mathbf{X}, represents a matrix whose ii-th row is the variable XiX_{i}.

2 Background and Challenges in IV Estimation

In this section, we first provide a brief review of the standard IV estimation framework. We then motivate the DRO approach to IV estimation by viewing common challenges from the perspective of distributional uncertainties. In Section 3, we propose the Wasserstein distributionally robust instrumental variables estimation (DRIVE) framework.

2.1 Instrumental Variables Estimation

Consider the following standard linear instrumental variables regression model with Xp,ZdX\in\mathbb{R}^{p},Z\in\mathbb{R}^{d} where dpd\geq p, and β0p,γd×p\beta_{0}\in\mathbb{R}^{p},\gamma\in\mathbb{R}^{d\times p}:

Y=β0TX+ϵ,X=γTZ+ξ.\displaystyle\begin{split}Y&=\beta^{T}_{0}X+\epsilon,\\ X&=\gamma^{T}Z+\xi.\end{split} (2)

In (2), XX are the endogenous variables, ZZ are the instrumental variables, and YY is the outcome variable. The error terms ϵ\epsilon and ξ\xi capture the unobserved (or residual) components of YY and XX, respectively. We are interested in estimating the causal effects β0\beta_{0} of the endogenous variables XX on the outcome variable YY given independent and identically distributed (i.i.d.) samples {Xi,Yi,Zi}i=1n\{X_{i},Y_{i},Z_{i}\}_{i=1}^{n}. However, XX and YY are confounded through some unobserved confounders UU that are correlated with both YY and XX, represented graphically in the directed acyclic graph (DAG) below:

XXZZYYUUγ\gammaβ0\beta_{0}

Mathematically, the unobserved confounding can be described by the moment condition

𝔼[Xϵ]𝟎.\mathbb{E}\left[X\epsilon\right]\neq\mathbf{0}.

As a result of the unobserved confounding, the standard ordinary least squares (OLS) regression estimator of β0\beta_{0} that regresses YY on XX is biased. To address this problem, the IV estimation approach leverages access to the instrumental variables ZZ, also often called instruments, which satisfy the moment conditions

rank(𝔼[ZXT])\displaystyle\text{rank}(\mathbb{E}\left[ZX^{T}\right]) =p,\displaystyle=p, (3)
𝔼[Zϵ]=𝟎,𝔼[ZξT]\displaystyle\mathbb{E}\left[Z\epsilon\right]=\mathbf{0},\mathbb{E}\left[Z\xi^{T}\right] =𝟎.\displaystyle=\mathbf{0}. (4)

Under these conditions, a popular IV estimator is the two stage least squares (TSLS, sometimes also stylized as 2SLS) estimator (Theil,, 1953). With ΠZ:=𝐙(𝐙T𝐙)1𝐙T\Pi_{Z}\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T} and 𝐗,𝐘,𝐙\mathbf{X,Y,Z} matrix representations of {Xi,Yi,Zi}i=1n\{X_{i},Y_{i},Z_{i}\}_{i=1}^{n} whose ii-th rows correspond to Xi,Yi,ZiX_{i},Y_{i},Z_{i}, respectively, the TSLS estimator β^IV:=(𝐗TΠZ𝐗)1𝐗TΠZ𝐘\hat{\beta}^{\text{IV}}\mathrel{\mathop{\mathchar 58\relax}}=(\mathbf{X}^{T}\Pi_{Z}\mathbf{X})^{-1}\mathbf{X}^{T}\Pi_{Z}\mathbf{Y} minimizes the objective

minβ1n𝐘ΠZ𝐗β2.\displaystyle\min_{\beta}\frac{1}{n}\|\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}. (5)

In contrast, the standard OLS estimator β^OLS\hat{\beta}^{\text{OLS}} solves the problem

minβ1n𝐘𝐗β2.\displaystyle\min_{\beta}\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}. (6)

When the moment conditions (3) and (4) hold, the TSLS estimator is a consistent estimator of the causal effect β0\beta_{0} under standard assumptions (Wooldridge,, 2020), and valid inference can be performed by constructing variance estimators based on the asymptotic distribution of β^IV\hat{\beta}^{\text{IV}} (Imbens and Rubin,, 2015). Although not the most common presentation of TSLS, the optimization formulation in (5) provides intuition on how IV estimation works: when the instruments ZZ are uncorrelated with the unobserved confounders UU affecting XX and YY, the projection operator ΠZ\Pi_{Z} applied to 𝐗\mathbf{X} “removes” the confounding from XX, so that ΠZ𝐗\Pi_{Z}\mathbf{X} becomes (asymptotically) uncorrelated with ϵ\epsilon. Regressing 𝐘\mathbf{Y} on ΠZ𝐗\Pi_{Z}\mathbf{X} then yields a consistent estimator of β0\beta_{0}.

The validity of estimation and inference based on β^IV\hat{\beta}^{\text{IV}} relies critically on the moment conditions (3) and (4). Condition (3) is often called the relevance condition or rank condition, and requires 𝔼[ZXT]\mathbb{E}\left[ZX^{T}\right] to have full rank (recall pdp\leq d). In the special case of one-dimensional instrumental and endogenous variables, i.e., d=p=1d=p=1, it simply reduces to 𝔼[ZX]0\mathbb{E}\left[ZX\right]\neq 0. Intuitively, the relevant condition ensures that the instruments ZZ can explain sufficient variations in the endogenous variables XX. In this case, the instruments are said to be relevant and strong. When 𝔼[ZXT]\mathbb{E}\left[ZX^{T}\right] is close to being rank deficient, i.e., the smallest eigenvalue λp(𝔼[ZXT])0\lambda_{p}(\mathbb{E}\left[ZX^{T}\right])\approx 0, IV estimation suffers from the so-called weak instrument problem, which results in many issues in estimation and inference, such as small sample bias and non-normal statistics (Stock et al.,, 2002). Some kk-class estimators, such as limited information maximum likelihood (LIML) (Anderson and Rubin,, 1949), are partially motivated to address these problems. Condition (4) is often referred to as the exclusion restriction or instrument exogeneity (Imbens and Rubin,, 2015), and instruments that satisfy this condition are called valid instruments. When an instrument ZZ is correlated with the unobserved confounder that confounds X,YX,Y, or when ZZ affects the outcome YY through an unobserved variable other than the endogenous variable XX, the instrument becomes invalid, resulting in biased estimation and invalid inference of β0\beta_{0} (Murray,, 2006). These issues can often be exacerbated when the instruments are weak, when there is heteroskedasticity (Andrews et al.,, 2019), or the data is highly leveraged (Young,, 2022).

Although many works have been devoted to addressing the problems of weak and invalid instruments, there are fundamental limits on the extent to which one can test for these issues. Given the popularity of IV estimation in practice, it is therefore desirable to have estimation and inference procedures that are robust to the presence of such issues. Our work is precisely motivated by these considerations. Compared to many existing robust approaches to IV estimation, we take a more agnostic approach via distributionally robust optimization. More precisely, we argue that many common challenges in IV estimation can be viewed as uncertainties about the data distribution, i.e., deviations from the ideal model that satisfies IV assumptions, which can be explicitly taken into account by choosing an appropriate ambiguity set in a DRO formulation of the standard IV estimation. To demonstrate this perspective more concretely, we now examine some common problems in IV estimation and show that they can be viewed as distributional shifts under a suitable metric, and therefore amenable to a DRO approach.

2.2 Challenges in IV Estimation as Distributional Uncertainties

Consider now the following one-dimensional version of the IV model in (2)

Y=Xβ0+ϵX=Zγ+ξ,\displaystyle\begin{split}Y&=X\beta_{0}+\epsilon\\ X&=Z\gamma+\xi,\end{split}

where we assume that X,YX,Y are confounded by an unobserved confounder UU through

ϵ=Zη+U,ξ=U.\epsilon=Z\eta+U,\quad\xi=U.

Note that in addition to UU, there is also potentially a direct effect η\eta from the instrument ZZ to the outcome variable YY. We focus on the resulting model for our subsequent discussions:

Y=Xβ0+Zη+UX=Zγ+U.\displaystyle\begin{split}Y&=X\beta_{0}+Z\eta+U\\ X&=Z\gamma+U.\end{split} (7)

The standard IV assumptions can be succinctly summarized for (7). The relevance condition (3) requires that γ0\gamma\neq 0, while the exclusion restriction (4) requires that ZZ is uncorrelated with UU and that in addition η=0\eta=0. Assume that U,ZU,Z are i.i.d. standard normal. X,YX,Y are then determined by (7)\eqref{eq:simple-IV}. We are interested in the shifts in data distribution, appropriately defined and measured, when the exogeneity and relevance conditions are violated.

Example 1 (Invalid Instruments).

As U,ZU,Z are independent, ZZ becomes invalid if and only if η0\eta\neq 0, and |η||\eta| quantifies the degree of instrument invalidity. Let η\mathbb{P}_{\eta} denote the joint distribution on (X,Y,Z)(X,Y,Z) in the model (7) indexed by η\eta\in\mathbb{R}. Let ~η,Z\tilde{\mathbb{P}}_{\eta,Z} be the resulting normal distribution on the conditional random variables (X~,Y~)=(XZ,YZ)(\tilde{X},\tilde{Y})=(X\mid Z,Y\mid Z), given ZZ. We are interested in the (expected) distributional shift between ~η,Z\tilde{\mathbb{P}}_{\eta,Z} and ~0,Z\tilde{\mathbb{P}}_{0,Z}. We choose the 2-Wasserstein distance W2(,)W_{2}(\cdot,\cdot) (Kantorovich,, 1942, 1960), also known as the Kantorovich metric, to measure this shift. Conveniently, the 2-Wasserstein distance between two normal distributions 1=𝒩(μ1,Σ1)\mathbb{Q}_{1}=\mathcal{N}(\mu_{1},\Sigma_{1}) and 2=𝒩(μ2,Σ2)\mathbb{Q}_{2}=\mathcal{N}(\mu_{2},\Sigma_{2}) has an explicit formula due to Olkin and Pukelsheim, (1982):

W2(1,2)2\displaystyle W_{2}(\mathbb{Q}_{1},\mathbb{Q}_{2})^{2} =μ1μ22+Tr(Σ1+Σ22(Σ21/2Σ1Σ21/2)1/2).\displaystyle=\|\mu_{1}-\mu_{2}\|^{2}+\text{Tr}(\Sigma_{1}+\Sigma_{2}-2(\Sigma_{2}^{1/2}\Sigma_{1}\Sigma_{2}^{1/2})^{1/2}). (8)

Applying (8) to the conditional distributions ~η,Z,~0,Z\tilde{\mathbb{P}}_{\eta,Z},\tilde{\mathbb{P}}_{0,Z}, and taking the expectation with respect to ZZ, we obtain the simple formula

𝔼W2(~η,Z,~0,Z)=2π|η|.\displaystyle\mathbb{E}W_{2}(\tilde{\mathbb{P}}_{\eta,Z},\tilde{\mathbb{P}}_{0,Z})=\sqrt{\frac{2}{\pi}}\cdot|\eta|. (9)

This calculation shows that the degree of instrument invalidity, as measured by the strength of direct effect of instrument on the outcome, is proportional to the expected distributional shift of the distribution on (X~,Y~)(\tilde{X},\tilde{Y}) from that under the valid IV assumption. Moreover, the simple form of the expected distributional shift relies on our choice of the Wasserstein distance to measure the distributional shift of the conditional random variables (X~,Y~)(\tilde{X},\tilde{Y}). If we instead measure shifts in the joint distribution η\mathbb{P}_{\eta} on (X,Y,Z)(X,Y,Z), the resulting distributional shift will depend on other model parameters in addition to η\eta. This example therefore suggests that the Wasserstein metric applied to the conditional distributional shift of (X~,Y~)(\tilde{X},\tilde{Y}) could be an appropriate measure of distributional uncertainty in IV regression models.

Example 2 (Weak Instruments).

Now consider another common problem with IV estimation, which happens when the first stage coefficient γ\gamma is close to 0. Let ~γ,Z\tilde{\mathbb{Q}}_{\gamma,Z} be the distribution on (X~,Y~)(\tilde{X},\tilde{Y}) indexed by γ\gamma\in\mathbb{R} and η=0\eta=0 in (7). In this case, we can verify that

𝔼W2(~γ1,Z,~γ2,Z)=2π1+β02|γ1γ2|.\displaystyle\mathbb{E}W_{2}(\tilde{\mathbb{Q}}_{\gamma_{1},Z},\tilde{\mathbb{Q}}_{\gamma_{2},Z})=\sqrt{\frac{2}{\pi}}\sqrt{1+\beta_{0}^{2}}|\gamma_{1}-\gamma_{2}|.

The expected distributional shift between the setting with a “strong” instrument with γ=γ0\gamma=\gamma_{0} and a “weak” instrument with γ=δγ0\gamma=\delta\cdot\gamma_{0} where δ0\delta\rightarrow 0 in the limit, measured by the 2-Wasserstein metric, is equal to

2π1+β02|γ0|.\displaystyle\sqrt{\frac{2}{\pi}}\sqrt{1+\beta_{0}^{2}}\cdot|\gamma_{0}|. (10)

Similar to the previous example, the degree of violation of the strong instrument assumption, as measured by the presumed instrument strength |γ0||\gamma_{0}|, is proportional to the expected distributional shift on (X~,Y~)(\tilde{X},\tilde{Y}). Note, however, that the distance is also proportional to the magnitude of the causal parameter β0\beta_{0}. This is reasonable because instrument strength is relative, and should be measured relative to the scale of the true causal parameter.

Next, we consider the distributional shift resulting from heteroskedastic errors, which are known to yield the TSLS estimator inefficient and the standard variance estimator invalid (Baum et al.,, 2003). Some kk-class estimators, such as the LIML and the Fuller estimators, also become inconsistent under heteroskedasticity (Hausman et al.,, 2012).

Example 3 (Heteroskedasticity).

In this example, we assume η=0\eta=0 in (7) and that the conditional distribution of UU given ZZ is centered normal with standard deviation α|Z|+1\alpha\cdot|Z|+1 where α0\alpha\geq 0. We are interested in the average distributional shift between the heteroskedastic setting (α>0\alpha>0) from the homoskedastic setting (α=0\alpha=0). We can verify that the expected 2-Wasserstein distance between the conditional distributions on (X~,Y~)(\tilde{X},\tilde{Y}) is

2π1+(β02+1)2α,\displaystyle\sqrt{\frac{2}{\pi}}\sqrt{1+(\beta_{0}^{2}+1)^{2}}\cdot\alpha, (11)

which is proportional to the degree of heteroskedasticity α\alpha.

The preceding discussions demonstrate that distributional uncertainties resulting from violations of common model assumptions in IV estimation are well captured by the 2-Wasserstein distance on the distributions of the conditional variables (X~,Y~)(\tilde{X},\tilde{Y}). We therefore propose to construct an ambiguity set in (1) using a Wasserstein ball around the empirical distribution on (X~,Y~)(\tilde{X},\tilde{Y}). We provide details of this framework in the next section.

3 Wasserstein Distributionally Robust IV Estimation

In this section, we propose a distributionally robust IV estimation framework. We propose to use Wasserstein ambiguity sets to account for distributional uncertainties in IV estimation. We develop the dual formulation of Wasserstein DRIVE as regularized regression, and discuss its connections and distinctions to other regularized regression estimators.

3.1 DRIVE

Motivated by the intuition that common challenges to IV estimation in practice, such as violations of model assumptions, can be viewed as distributional uncertainties on the conditional distributions of (X~,Y~)=(XZ,YZ)(\tilde{X},\tilde{Y})=(X\mid Z,Y\mid Z), we propose the Distributionally Robust IV Estimation (DRIVE) framework, which solves the following DRO problem given a dataset {(Xi,Yi,Zi)}i=1n\{(X_{i},Y_{i},Z_{i})\}_{i=1}^{n} and robustness parameter ρ\rho:

(DRIVE Objective)minβsup{:D(,~n)ρ}𝔼[(YXTβ)2],\displaystyle\textbf{(DRIVE Objective)}\quad\min_{\beta}\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\tilde{\mathbb{P}}_{n})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[({Y}-{X}^{T}\beta)^{2}\right], (12)

where ~n(𝒳×𝒴)\tilde{\mathbb{P}}_{n}(\mathcal{X}\times\mathcal{Y}) is the empirical distribution on (X,Y)(X,Y) induced by the projected samples

{X~i,Y~i}i=1n{(Π𝐙𝐗)i,(Π𝐙𝐘)i}i=1n.\{\tilde{X}_{i},\tilde{Y}_{i}\}_{i=1}^{n}\equiv\{(\Pi_{\mathbf{Z}}\mathbf{X})_{i},(\Pi_{\mathbf{Z}}\mathbf{Y})_{i}\}_{i=1}^{n}.

Here 𝐗n×p,𝐘n,𝐙n×d\mathbf{X}\in\mathbb{R}^{n\times p},\mathbf{Y}\in\mathbb{R}^{n},\mathbf{Z}\in\mathbb{R}^{n\times d} are the matrix representations of observations, and Π𝐙=𝐙(𝐙T𝐙)1𝐙T\Pi_{\mathbf{Z}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T} is the projection matrix onto the column space of 𝐙\mathbf{Z}. D(,)D(\cdot,\cdot) is a metric or divergence measure on the space of probability distributions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. Therefore, in our DRIVE framework, we first regress both the outcome YY and covariate XX on the instrument ZZ to form the nn predicted samples (Π𝐙𝐗,Π𝐙𝐘(\Pi_{\mathbf{Z}}\mathbf{X},\Pi_{\mathbf{Z}}\mathbf{Y}). Then an ambiguity set is constructed using DD around the empirical distribution ~n\tilde{\mathbb{P}}_{n}. This choice of the reference distribution 0\mathbb{P}_{0} is a key distinction of our work from previous works that leverage DRO in statistical models. In the standard regression/classification setting, the reference distribution is often chosen as the empirical distribution ^n\hat{\mathbb{P}}_{n} on {Xi,Yi}i=1n\{X_{i},Y_{i}\}_{i=1}^{n} (Blanchet et al.,, 2019). In the IV estimation setting where we have additional access to instruments ZZ, we have the choice of constructing ambiguity sets around the empirical distribution on the marginal quantities {(Xi,Yi,Zi)}i=1n\{(X_{i},Y_{i},Z_{i})\}_{i=1}^{n}, which is the approach taken in Bertsimas et al., (2022). In contrast, we choose to use the empirical distribution on the conditional quantities {(X~i,Y~i)}i=1n\{(\tilde{X}_{i},\tilde{Y}_{i})\}_{i=1}^{n}. This choice is motivated by the intuition that violations of IV assumptions can be captured by conditional distributional shifts, as illustrated by examples in the previous section.

The choice of the divergence measure D(,)D(\cdot,\cdot) is also important, as it characterizes the potential distributional uncertainties that DRIVE is robust to. In this paper, we propose to use the 2-Wasserstein distance W2(μ,ν)W_{2}(\mu,\nu) between two probability distributions μ,ν\mu,\nu (Mohajerin Esfahani and Kuhn,, 2018; Gao and Kleywegt,, 2023). One advantage of the Wasserstein distance is the tractability of its associated DRO problems (Blanchet et al.,, 2019), which can often be formulated as regularized regression problems with unique solutions. See also Appendix A.1. In Section 2.2, we provided several examples that demonstrate the 2-Wasserstein distance is able to capture common distributional uncertainties in the IV estimation setting. Alternative distance measures of probability distributions, such as the class of ϕ\phi-divergences (Ben-Tal et al.,, 2013), can also be used instead of the Wasserstein distance. For example, Kitamura et al., (2013) use the Hellinger distance to model local perturbations in robust estimation under moment restrictions, although not in the IV estimation setting. In this paper, we focus on the Wasserstein DRIVE framework based on D=W2D=W_{2}, and leave studies of DRIVE with other choices of DD to future works.

We next begin our formal study of Wasserstein DRIVE. In Section 3.2, we will show that the Wasserstein DRIVE objective is dual to a convex regularized regression problem. As a result, the solution to the optimization problem (12) is well-defined, and we denote this estimator by β^DRIVE\hat{\beta}_{\mathrm{DRIVE}}. In Section 4, we show β^DRIVE\hat{\beta}_{\mathrm{DRIVE}} is consistent with potentially non-vanishing choices of the robustness parameter and derive its asymptotic distribution.

3.2 Dual Representation of Wasserstein DRIVE

It is well-known in the optimization literature that min-max optimization problems such as (12) often have equivalent formulations as regularized regression problems. This correspondence between regularization and robustness already manifests itself in the ridge regression, which is equivalent to an 2\ell_{2}-robust OLS regression (Bertsimas and Copenhaver,, 2018). Importantly, the regularized regression formulations are often more tractable in terms of solving the resulting optimization problem, and also facilitate the study of the statistical properties of the estimators. We first show that the Wasserstein DRIVE objective can also be written as a regularized regression problem similar to, but distinct from, the standard TSLS objective with ridge regularization. Proofs can be found in Appendix F.

Theorem 3.1.

The optimization problem in (12) is equivalent to the following convex regularized regression problem:

minβ1nΠ𝐙𝐘Π𝐙𝐗β2+ρ(β2+1),\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}}+\sqrt{\rho(\|\beta\|^{2}+1)}, (13)

where Π𝐙=𝐙(𝐙T𝐙)1𝐙T\Pi_{\mathbf{Z}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T} is the finite sample projection operator, and Π𝐙𝐘\Pi_{\mathbf{Z}}\mathbf{Y} and Π𝐙𝐗\Pi_{\mathbf{Z}}\mathbf{X} are the OLS predictions of 𝐗,𝐘\mathbf{X,Y} using instruments 𝐙\mathbf{Z}.

Note that the robustness parameter ρ\rho of the DRO formulation (12) now has the dual interpretation as the regularization parameter in (13). This convex regularized regression formulation implies that the min-max problem (12) associated with Wassertein DRIVE has a unique solution, thanks to the strict convexity of the regularization term ρ(β2+1)\sqrt{\rho(\|\beta\|^{2}+1)}, and is easy to compute despite not having a closed form solution. In particular, (13) can be reformulated as a standard second order conic program (SOCP) (El Ghaoui and Lebret,, 1997), which can be solved efficiently with off-the-shelf convex optimization routines, such as CVX. More importantly, we leverage this formulation of Wasserstein DRIVE as a regularized regression problem to study its novel statistical properties in Section 4.

The equivalence between Wasserstein DRO problems and regularized regression problem is a familiar general result in recent works. For example, Blanchet et al., (2019) and Gao and Kleywegt, (2023) derive similar duality results for distributionally robust regression with qq-Wasserstein distances for q>1q>1. Compared to previous works, our work is distinct in the following aspects. First, we apply Wasserstein DRO to the IV estimation setting instead of standard regression settings, such as OLS or logistic regression. Although from an optimization point of view there is no substantial difference, the IV setting motivates a new asymptotic regime that uncovers interesting statistical properties of the resulting estimators. Second, the regularization term in (13) is distinct from those in previous works, which often use βp\|\beta\|_{p} with p1p\geq 1. This seemingly innocuous difference turns out to be crucial for our novel results on the Wasserstein DRIVE. Lastly, compared to the proof in Blanchet et al., (2019), our proof of Theorem 3.1 is based on a different argument using the Sherman-Morrison formula instead of Hölder’s inequality, which provides an independent proof of the important duality result for Wasserstein distributionally robust optimization.

3.3 Wasserstein DRIVE and Regularized Regression

The regularized regression formulation of the Wasserstein DRIVE problem in (13) resembles the standard ridge regularized (Hoerl and Kennard,, 1970) TSLS regression:

minβ1ni(YiX~iTβ)2+ρβ2minβ1n𝐘Π𝐙𝐗β2+ρβ2.\displaystyle\min_{\beta}\frac{1}{n}\sum_{i}({Y}_{i}-\tilde{X}_{i}^{T}\beta)^{2}+\rho\|\beta\|^{2}\Longleftrightarrow\min_{\beta}\frac{1}{n}\|\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}+\rho\|\beta\|^{2}. (14)

We therefore refer to (13) as the square root ridge regularized TSLS. However, there are three major distinctions between (13) and (14) that are essential in guaranteeing the statistical properties of Wasserstein DRIVE not enjoyed by the standard ridge regularized TSLS. First, the presence of square root operations on both the risk term and the penalty term; second, the presence of a constant in the regularization term; third, an additional projection on the outcomes in Π𝐙𝐘\Pi_{\mathbf{Z}}\mathbf{Y}. We further elaborate on these features in Section 4.

In the standard regression setting without instrumental variables, the square root ridge

minβ1n𝐘𝐗β2+ρ(1+β2)\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{\rho(1+\|\beta\|^{2})} (15)

also resembles the “square root LASSO” of Belloni et al., (2011):

minβ1n𝐘𝐗β2+λβ1.\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\lambda\|\beta\|_{1}. (16)

In particular, both can be written as dual problems of Wasserstein DRO problems (Blanchet et al.,, 2019). However, the square root LASSO is motivated by high-dimensional regression settings where the dimension of XX is potentially larger than the sample size nn, but β\beta is very sparse. In contrast, our study of the square root ridge is motivated by its robustness properties in the IV estimation setting, where the dimension of the endogenous variable is small (often one-dimensional). In other words, variable selection is not the main focus of this paper. A variant of the square root ridge estimator in (15) was also considered in the standard regression setting by Owen, (2007), who instead uses the penalty term β2\|\beta\|_{2}.

As is well-known in the regularized regression literature (Fu and Knight,, 2000), when the regularization parameter decays to 0 at a rate Op(1/n)O_{p}(1/\sqrt{n}), the ridge estimator is consistent. A similar result also holds for the square root ridge (15) in the standard regression setting as ρ0\rho\rightarrow 0. However, in the IV estimation setting, our distributional uncertainties about model assumptions, such as the validity of instruments, could persist even in large samples. Recall that ρ\rho is also the robustness parameter in the DRO formulation (12). Therefore, the usual requirement that ρ0\rho\rightarrow 0 as nn\rightarrow\infty cannot adequately capture distributional uncertainties in the IV estimation setting. In the next section, we study the asymptotic properties of Wasserstein DRIVE when ρ\rho does not necessarily vanish. In particular, we establish the consistency of Wasserstein DRIVE leveraging the three distinct features of (13) that are absent in the standard ridge regularized TSLS regression (14). This asymptotic result is in stark contrast to the conventional wisdom on regularized regression that regularized regression achieves lower variance at the cost of non-zero bias.

4 Asymptotic Theory of Wasserstein DRIVE

In this section, we leverage distinct geometric features of the square root ridge regression to study the asymptotic properties of the Wasserstein DRIVE. In Section 4.1, we show that the Wasserstein DRIVE estimator is consistent for any ρ[0,ρ¯]\rho\in[0,\overline{\rho}], where ρ¯\overline{\rho} depends on the first stage coefficient γ\gamma. This property is a consequence of the consistency of the square root ridge estimator in settings where the objective value at the true parameter vanishes, such as the GMM estimation setting. It ensures that Wasserstein DRIVE can achieve better finite sample performance thanks to its ridge type regularization, while at the same time retaining asymptotic validity when instruments are valid. In Section 4.2, we characterize the asymptotic distribution of Wasserstein DRIVE, and discuss several special settings particularly relevant in practice, such as the just-identified setting with one-dimensional instrumental and endogenous variables.

4.1 Consistency of Wasserstein DRIVE

Recall the linear IV regression model in (2)

Y=β0TX+ϵ,X=γTZ+ξ,\displaystyle\begin{split}Y&=\beta^{T}_{0}X+\epsilon,\\ X&=\gamma^{T}Z+\xi,\end{split}

where Xp,ZdX\in\mathbb{R}^{p},Z\in\mathbb{R}^{d}, and β0p,γd×p\beta_{0}\in\mathbb{R}^{p},\gamma\in\mathbb{R}^{d\times p} with dpd\geq p to ensure identification. In this section, we make the standard assumptions that the instruments satisfy the relevance and exogeneity conditions in (3) and (4), ϵ,ξ\epsilon,\xi are homoskedastic, the instruments ZZ are not perfectly collinear, and that 𝔼Z2k<,𝔼ξ2k<,𝔼|ϵ|2k<\mathbb{E}\|Z\|^{2k}<\infty,\mathbb{E}\|\xi\|^{2k}<\infty,\mathbb{E}|\epsilon|^{2k}<\infty for some k>2k>2. The results can be extended in a straightforward manner when we relax these assumptions, e.g., only requiring that exogeneity holds asymptotically. Given i.i.d. samples from the linear IV model, recall the regularized regression formulation of the Wasserstein DRIVE objective

minβ1ni(Π𝐙𝐘Π𝐙𝐗β)i2+ρn(β2+1),\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)_{i}^{2}}+\sqrt{\rho_{n}(\|\beta\|^{2}+1)}, (17)

where Π𝐙=𝐙(𝐙T𝐙)1𝐙Tn×n\Pi_{\mathbf{Z}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}\in\mathbb{R}^{n\times n}, and Π𝐙𝐘n\Pi_{\mathbf{Z}}\mathbf{Y}\in\mathbb{R}^{n} and Π𝐙𝐗n×p\Pi_{\mathbf{Z}}\mathbf{X}\in\mathbb{R}^{n\times p} are 𝐘n,𝐗n×p\mathbf{Y}\in\mathbb{R}^{n},\mathbf{X}\in\mathbb{R}^{n\times p} projected onto the instrument space spanned by 𝐙n×d\mathbf{Z}\in\mathbb{R}^{n\times d}.

Theorem 4.1 (Consistency of Wasserstein DRIVE).

Let β^nDRIVE\hat{\beta}_{n}^{\text{DRIVE}} be the unique minimizer of the objective in (17). Let ρnρ0\rho_{n}\rightarrow\rho\geq 0 and 1n𝐙T𝐙p=𝔼[ZZT]=ΣZ\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z}\rightarrow_{p}=\mathbb{E}[ZZ^{T}]=\Sigma_{Z}. Under the relevance and exogeneity conditions (3) and (4), the Wasserstein DRIVE estimator β^nDRIVE\hat{\beta}_{n}^{\text{DRIVE}} converges to βDRIVE\beta^{\text{DRIVE}} in probability as nn\rightarrow\infty, where βDRIVE\beta^{\text{DRIVE}} is the unique minimizer of

minβ(ββ0)TγTΣZγ(ββ0)+ρ(β2+1).\displaystyle\min_{\beta}\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)}. (18)

Moreover, whenever ρ[0,ρ¯]\rho\in[0,\overline{\rho}] where ρ¯=λp(γTΣZγ)\overline{\rho}=\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma) is the smallest eigenvalue of γTΣZγp×p\gamma^{T}\Sigma_{Z}\gamma\in\mathbb{R}^{p\times p}, the unique minimizer of the objective (18) is the true causal effect, i.e., βDRIVEβ0\beta^{\text{DRIVE}}\equiv\beta_{0}.

Therefore, Wasserstein DRIVE is consistent as long as the limit of the regularization parameter is bounded above by ρ¯=λp(γTΣZγ)\overline{\rho}=\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma). In the case when ΣZ=σZ2Id\Sigma_{Z}=\sigma^{2}_{Z}I_{d} for σZ2>0\sigma_{Z}^{2}>0, the upper bound ρ¯\overline{\rho} is proportional to the square of the smallest singular value of the first stage coefficient γ\gamma, which is positive under the relevance condition (3). Recall that ρn\rho_{n} is the radius of the Wasserstein ball in the min-max formulation of DRIVE in (12). Theorem 4.1 therefore guarantees that even when the robustness parameter ρnρ0\rho_{n}\equiv\rho\neq 0, which implies the solution to the min-max problem is different from the TSLS estimator (ρ=0\rho=0), the resulting estimator always has the same limit as long as ρλp(γTΣZγ)\rho\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma).

The significance of Theorem 4.1 is twofold. First, it provides a meaningful interpretation of the robustness parameter ρ\rho in Wasserstein DRIVE in terms of problem parameters, more precisely the variance covariance matrix ΣZ\Sigma_{Z} of ZZ and the first stage coefficient γ\gamma in the IV regression model. The maximum amount of robustness that can be afforded by Wasserstein DRIVE without sacrificing consistency is λp(γTΣZγ)\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma), which directly depends on the strength and variance of the instrument. This relation can be described more precisely when ΣZ=σZ2Id\Sigma_{Z}=\sigma^{2}_{Z}I_{d}, in which case the bound is proportional to σZ2\sigma^{2}_{Z} and λp(γTγ)\lambda_{p}(\gamma^{T}\gamma). Both quantities improve the quality of the instruments: σZ2\sigma^{2}_{Z} improves the proportion of variance of XX and YY explained by the instrument vs. noise, while a γ\gamma far from rank deficiency avoids the weak instrument problem. Therefore, the robustness of Wasserstein DRIVE depends intrinsically on the strength of the instruments. The quantity γTΣZγ\gamma^{T}\Sigma_{Z}\gamma is not unfamiliar in the IV setting, as it is proportional to the inverse of the asymptotic variance of the standard TSLS estimator when errors are homoskedastic. This observation suggests an intrinsic connection between robustness and efficiency in the IV setting. See the discussions after Theorem 4.2 for more on this point.

More importantly, Theorem 4.1 is the first consistency result for regularized regression estimators where the regularization parameter does not vanish with sample size. Although regularized regression such as ridge and LASSO is often associated with better finite sample performance at the cost of introducing some bias, our work demonstrates that, in the IV estimation setting, we can get the best of both worlds. On one hand, the ridge type regularization in Wasserstein DRIVE improves upon the finite sample properties of the standard IV estimators, which aligns with conventional wisdom on regularized regression. On the other hand, with a bounded level of the regularization parameter ρ\rho, Wasserstein DRIVE can still achieve consistency. This is in stark contrast to existing asymptotic results on regularized regression (Fu and Knight,, 2000). Therefore, in the context of IV estimation, with Wasserstein DRIVE we can achieve consistency and a certain amount of robustness at the same time, by leveraging additional information in the form of valid instruments. The maximum degree of robustness that can be achieved also has a natural interpretation in terms of the strength and variance of the instruments.

Theorem 4.1 also suggests the following procedure to construct a feasible and valid robustness/regularization parameter ρ^\hat{\rho} given data {Xi,Yi,Zi}i=1n\{X_{i},Y_{i},Z_{i}\}_{i=1}^{n}. Let γ^=(𝐙T𝐙)1𝐙T𝐗\hat{\gamma}=(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}\mathbf{X} be the OLS regression estimator of the first stage coefficient γ\gamma and Σ^Z\hat{\Sigma}_{Z} an estimator of ΣZ\Sigma_{Z}, such as the heteroskedasticity-robust estimator (White,, 1980). We can use ρ^λp(γ^TΣ^Zγ^)\hat{\rho}\leq\lambda_{p}(\hat{\gamma}^{T}\hat{\Sigma}_{Z}\hat{\gamma}) to construct the Wasserstein DRIVE objective, i.e., any value bounded above by the smallest eigenvalue of γ^TΣ^Zγ^\hat{\gamma}^{T}\hat{\Sigma}_{Z}\hat{\gamma}. Under the assumptions in Theorem 4.1, λp(γ^TΣ^Zγ^)λp(γTΣZγ)\lambda_{p}(\hat{\gamma}^{T}\hat{\Sigma}_{Z}\hat{\gamma})\rightarrow\lambda_{p}({\gamma}^{T}\Sigma_{Z}{\gamma}), which guarantees that the Wasserstein DRIVE estimator with parameter ρ^\hat{\rho} is consistent. In Appendix E, we discuss the construction of feasible regularization parameters in more detail. We demonstrate the validity and superior finite sample performance of DRIVE based on these proposals in simulation studies in Section 5.

One may wonder why Wasserstein DRIVE can achieve consistency with a non-zero regularization ρ\rho. Here we briefly discuss the phenomenon that the limiting objective (18)

minβ(ββ0)TγTΣZγ(ββ0)+ρ(β2+1)\displaystyle\min_{\beta}\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)}

has a unique minimizer at β0\beta_{0} for bounded ρ>0\rho>0. The first term (ββ0)TγTΣZγ(ββ0)\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})} achieves its minimum value of 0 at β=β0\beta=\beta_{0}. When ρ\rho is small, the effect of adding the regularization term ρ(β2+1)\sqrt{\rho(\|\beta\|^{2}+1)} does not overwhelm the first term, especially when its curvature at β0\beta_{0} is large. As a result, we may expect the minimizer to not deviate much from β0\beta_{0}. While this intuition is reasonable qualitatively, it does not fully explain the fact that the minimizer does not change for small ρ\rho. In the standard regression setting, the same intuition can be applied to the standard ridge regularization, but we know shrinkage occurs as soon as ρ>0\rho>0. The key distinction of (17) turns out to be the square root operations we apply to the loss and regularization terms, which endows the objective with a geometric interpretation, and ensures that the minimizer does not deviate from β0\beta_{0} unless ρ\rho is above some positive threshold. We call this phenomenon the “delayed shrinkage” of the square root ridge, as shrinkage does not happen until the regularization is large enough. We illustrate it with a simple example in Fig. 1, where the minimizer of the limiting square root objective does not change for a bounded range of ρ\rho.

Refer to caption
Figure 1: Plot of |β1|+ρ(β2+1)|\beta-1|+\sqrt{\rho}\sqrt{(\beta^{2}+1)}, which is the dual limit objective function in the one-dimensional case with σZ2=γ=β0=1\sigma^{2}_{Z}=\gamma=\beta_{0}=1. We also plot limit of standard ridge loss (β1)2+β2(\beta-1)^{2}+\beta^{2}. For ρ2\rho\leq 2, the minimum is achieved at β=1\beta=1, while for ρ=5\rho=5 and for the standard ridge, the minimum is achieved at β=0.5\beta=0.5.

Lastly, we comment on the importance of projection operations in Wasserstein DRIVE. A crucial feature of the Wasserstein DRIVE objective is that both the outcome and the covariates are regressed on the instrument to compute their predicted values. In other words, the objective (12) uses Π𝐙𝐘Π𝐙𝐗β\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta instead of 𝐘Π𝐙𝐗β\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta. For standard IV estimation (ρ=0)\rho=0), there is no substantial difference between the two objectives, since their minimizers are exactly the same, due to the idempotent property Π𝐙2=Π𝐙\Pi^{2}_{\mathbf{Z}}=\Pi_{\mathbf{Z}}. In fact, in applications of TSLS, the outcome variable is often not regressed on the instrument. However, Wasserstein DRIVE is consistent for positive ρ\rho only if the outcome YY is also projected onto the instrument space. In other words, the following problem does not yield a consistent estimator when ρ>0\rho>0:

minβ1n𝐘Π𝐙𝐗β2+ρ(β2+1).\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}}+\sqrt{\rho(\|\beta\|^{2}+1)}.

The reason behind this phenomenon is that 1nΠ𝐙𝐘Π𝐙𝐗β2{\frac{1}{n}\|\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}} is a GMM objective

(1niZi(YiβTXi))T(1n𝐙T𝐙)1(1niZi(YiβTXi)),\displaystyle(\frac{1}{n}\sum_{i}Z_{i}(Y_{i}-\beta^{T}X_{i}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}Z_{i}(Y_{i}-\beta^{T}X_{i})),

which when nn\rightarrow\infty achieves a minimal value of 0 at β0\beta_{0}, while the limit of 1n𝐘Π𝐙𝐗β2{\frac{1}{n}\|\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}} does not vanish even at β0\beta_{0}. In the former case, the geometric properties of the square root ridge ensure that the minimizer of the regularized objective is β0\beta_{0}.

4.2 Asymptotic Distribution of Wasserstein DRIVE

Having established the consistency of Wasserstein DRIVE with bounded ρ\rho, we now turn to the characterization of its asymptotic distribution. In general, the asymptotic distribution of Wasserstein DRIVE is different from that of the standard IV estimator. However, we will also examine several special cases relevant in practice where they coincide.

Theorem 4.2 (Asymptotic Distribution).

When limnρn=ρλp(γTΣZγ)\lim_{n\rightarrow\infty}\rho_{n}=\rho\leq{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)}, the Wasserstein DRIVE estimator β^nDRIVE\hat{\beta}_{n}^{\text{DRIVE}} has asymptotic distribution characterized by the following optimization problem:

n(β^DRIVEβ0)\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}-\beta_{0}) dargminδ(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)+ρβ0T(1+β02)δ,\displaystyle\rightarrow_{d}\arg\min_{\delta}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta, (19)

where 𝒵=𝒩(0,σ2ΣZ)\mathcal{Z}=\mathcal{N}(0,\sigma^{2}\Sigma_{Z}) and σ2=𝔼ϵ2\sigma^{2}=\mathbb{E}\epsilon^{2}.

In particular, when ρn0\rho_{n}\rightarrow 0 at any rate, we have

n(β^DRIVEβ0)\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}-\beta_{0}) d𝒩(0,σ2(γTΣZγ)1),\displaystyle\rightarrow_{d}\mathcal{N}(0,\sigma^{2}(\gamma^{T}\Sigma_{Z}\gamma)^{-1}),

which is the asymptotic distribution for TSLS estimators with homoskedastic errors ϵ\epsilon.

Recall that the maximal robustness parameter ρ\rho of Wasserstein DRIVE while still being consistent is equal to the smallest eigenvalue of γTΣZγ\gamma^{T}\Sigma_{Z}\gamma, which is proportional to the inverse of the asymptotic variance of the TSLS estimator. Therefore, as the efficiency of TSLS increases, so does the robustness of the associated Wasserstein DRIVE estimator. The “price” to pay for robustness when ρ>0\rho>0 is an interesting question. It is clear from Fig. 1 that the curvature of the population objective decreases as ρ\rho increases. Since the objective is not continuous at β0\beta_{0}, however, a generalized notion of curvature is needed to precisely characterize this behavior. Note that the asymptotic distribution of the TSLS estimator minimizes the objective

(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ),(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta),

Theorem 4.2 implies that in general the asymptotic distributions of Wasserstein DRIVE and TSLS are different when ρ>0\rho>0. However, there are still several cases relevant in practice where their asymptotic distributions do coincide, which we discuss next.

Corollary 4.3.

In the following cases, the asymptotic distribution of Wasserstein DRIVE is the same as that of the standard TSLS estimator:

  1. 1.

    When ρ=0\rho=0;

  2. 2.

    When ρλp(γTΣZγ)\rho\leq{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)} and β0\beta_{0} is identically 𝟎\mathbf{0};

  3. 3.

    When ρλp(γTΣZγ)\rho\leq{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)} and both β0\beta_{0} and γ\gamma are one-dimensional, i.e., d=p=1d=p=1.

In particular, the just-identified case with d=p=1d=p=1 covers many empirical applications of IV estimation, since in practice we are often interested in the causal effect of a single endogenous variable, for which we have a single instrument. The case when β0𝟎\beta_{0}\equiv\mathbf{0} is also very relevant, since an important question in practice is whether the causal effect of a variable is zero. Our theory suggests that the asymptotic distribution of Wasserstein DRIVE should be the same as that of the TSLS when the causal effect is zero and d>1d>1, even for ρ>0\rho>0. Based on this observation, intuitively, we should expect that Wasserstein DRIVE and TSLS estimators to be “close” to each other. If the estimators or their asymptotic variance estimators differ significantly, then β0\beta_{0} may not be identically 0. We can design statistical tests by leveraging this intuition. For example, we can construct test statistics using the TSLS estimator and the DRIVE estimator with ρ>0\rho>0, such as the difference β^DRIVEβ^TSLS\hat{\beta}^{\text{DRIVE}}-\hat{\beta}^{\text{TSLS}}. Then we can use bootstrap-based tests, such as a bootstrapped permutation test, to assess the null hypothesis that β^DRIVEβ^TSLS=0\hat{\beta}^{\text{DRIVE}}-\hat{\beta}^{\text{TSLS}}=0. If we fail to reject the null hypothesis, then there is evidence that the true causal effect β0=𝟎\beta_{0}=\mathbf{0}.

Corollary 4.3 can be seemingly pessimistic because it demonstrates that the asymptotic distribution of Wasserstein DRIVE could be the same as that of the TSLS in special cases. However, recall that Wasserstein DRIVE is formulated to minimize the worst-case risk over a set of distributions that are designed to capture deviations from model assumptions. Therefore, there is not actually any a priori reason that it should coincide with the TSLS when ρ>0\rho>0. In this sense, the fact that the Wasserstein DRIVE is consistent with ρ>0\rho>0 and may even coincide with TSLS is rather surprising. In the latter case, the worst-case distribution for Wasserstein DRIVE in the large sample limit must coincide with that of the standard population distribution, which may be worth further investigation.

The asymptotic results we develop in this section provide the basis on which one can perform estimation and inference with the Wasserstein DRIVE estimator. In the next section, we study the finite sample properties of DRIVE in simulation studies and demonstrate that it is superior in terms of estimation error and out of sample prediction compared to other popular estimators.

5 Numerical Studies

In this section, we study the empirical performance of Wasserstein DRIVE. Our results deliver three main messages. First, we demonstrate with simulations that Wasserstein DRIVE, with non-zero robustness parameter ρ\rho based on Theorem 4.1, has comparable performance as the standard IV estimator whenever instruments are valid. Second, when instruments become invalid, Wasserstein DRIVE outperforms other methods in terms of RMSE. Third, on the education dataset of Card, (1993), Wasserstein DRIVE also has superior performance at prediction for a heterogeneous target population.

5.1 MSE of Wasserstein DRIVE

We use the data generating process

Y=Xβ0+Zη+UX=γZ+UZ=UβUZ+ϵZ,\displaystyle\begin{split}Y&=X\beta_{0}+Z\eta+U\\ X&=\gamma Z+U\\ Z&=U\beta_{UZ}+\epsilon_{Z},\\ \end{split}

where U,ϵZ𝒩(0,σ2)U,\epsilon_{Z}\sim\mathcal{N}(0,\sigma^{2}) and we allow a direct effect η\eta from the instruments ZZ to the outcome YY. Moreover, the instruments ZZ can also be correlated with the unobserved confounder UU (βUZ0)\beta_{UZ}\neq 0). We fix the true parameters and generate independent datasets from the model, varying the degree of instrument invalidity. In Table 1, we report the MSE of estimators averaged over 500 repeated experiments. We control the degree of instrument invalidity by varying η\eta, the direct effect of instruments on the outcome, and βUZ\beta_{UZ}, the correlation between unobserved confounder and instruments. Results in Table 1 are based on data where γ0\|\gamma\|\gg 0 is large. We see that when instruments are strong, Wasserstein DRIVE performs as well as TSLS when instruments are valid, but performs significantly better than OLS, TSLS, anchor, and TSLS ridge when instruments become invalid. This suggests that DRIVE could be preferable in practice when we are concerned about instrument validity.

η\eta βUZ\beta_{UZ} OLS TSLS anchor TSLS ridge DRIVE
0 0 0.21 0.03 0.19 0.03 0.03
0.4 0 0.20 0.07 0.16 0.06 0.03
0.4 0.4 0.26 0.25 0.24 0.21 0.07
0.4 0.8 0.29 0.62 0.29 0.56 0.09
0.8 0 0.26 0.23 0.23 0.22 0.06
0.8 0.4 0.32 0.51 0.31 0.46 0.10
0.8 0.8 0.37 0.82 0.38 0.81 0.14
Table 1: MSE of estimators when instruments are potentially invalid. β0=1,n=2000\beta_{0}=1,n=2000, σ=0.5\sigma=0.5. For TSLS ridge the regularization parameter is selected using cross validation based on out-of-sample prediction errors. For anchor regression the regularization parameter is selected based on the proposal in Rothenhäusler et al., (2021). For DRIVE the regularization parameter is selected using nonparametric bootstrap of the score quantile (Appendix E).
Refer to caption
Refer to caption
Refer to caption
Figure 2: MSEs of estimators when instruments are potentially invalid. Instrument ZZ can have direct effects η\eta on the outcome YY, or be correlated with the unobserved confounder UU. Wasserstein DRIVE consistently outperforms the other estimators.

We further investigate the empirical performance of Wasserstein DRIVE when instruments are potentially invalid or weak. We present box plots (omitting outliers) of MSEs in Fig. 2. The Wasserstein DRIVE estimator with regularization parameter ρ\rho based on bootstrapped quantiles of the score function consistently outperforms OLS, TSLS, anchor (kk-class), and TSLS with ridge regularization. Moreover, the selected penalties increase as the direct effect of ZZ on YY or the correlation between the unobserved confounder UU and the instrument ZZ increases, i.e., as the model assumption of valid instruments becomes increasingly invalid. See Fig. 5 in Appendix Appendix E for more details. This property is highly desirable, because based on the DRO formulation of DRIVE, ρ\rho represents the amount of robustness against distributional shifts associated with the estimator, which should increase as the instruments become more invalid (larger distributional shift). Box plots of estimation errors in Fig. 2 also verify that even when instruments are valid, the finite sample performance of Wasserstein DRIVE is still better compared to the standard IV estimator, suggesting that there is no additional cost in terms of MSE when applying Wasserstein DRIVE, even when instruments are valid.

5.2 Prediction under Distributional Shifts on Education Data

We now turn our attention to a different task that has received more attention in recent years, especially in the context of policy learning and estimating causal effects across heterogeneous populations (Dehejia et al.,, 2021; Adjaho and Christensen,, 2022; Menzel,, 2023). We study the prediction performance of estimators when they are estimated on a dataset (training set) that has a potentially different distribution from the dataset for which they are used to make predictions (test set). We demonstrate that whenever the distributions between training and test datasets have significant differences, the prediction error of Wasserstein DRIVE is significantly smaller than that of OLS, IV, and anchor (kk-class) estimators.

We conduct our numerical study using the classic dataset on the return of education to wage compiled by David Card (Card,, 1993). Here, the causal inference problem is estimating the effect of additional school years on the increase in wage later in life. The dataset contains demographic information about interviewed subjects. Importantly, each sample comes from one of nine regions in the United States, which differ in the average number of years of schooling and other characteristics, i.e., there are covariate shifts in data collected from different regions. Our strategy is to divide the dataset into a training set and a test set based on the relative ranks of their average years of schools, which is the endogenous variable. We expect that if there are distributional shifts between different regions, then predicting wages using education and other information using conventional models trained on the training data may not yield a good performance on the test data.

training set (size) test set (size) OLS TSLS DRIVE anchor regression ridge TSLS ridge
bottom 3 educated states (1247) top 3 educated regions (841) 0.444 0.537 0.364 0.444 0.444 0.421
(0.009) (0.031) (0.002) (0.009) (0.009) (0.019)
top 6 educated regions (1763) 0.451 1.064 0.371 0.451 0.451 0.430
(0.011) (0.274) (0.003) (0.011) (0.011) (0.027)
top 6 educated regions (1763) bottom 3 educated regions (1247) 0.390 0.584 0.356 0.390 0.390 0.377
(0.007) (0.120) (0.002) (0.007) (0.007) (0.015)
top 3 educated regions (841) bottom 3 educated regions (1247) 0.389 1.99 0.355 0.388 0.359 0.344
(0.013) (0.775) (0.005) (0.013) (0.014) (0.004)
middle 3 educated regions (922) 0.328 3.18 0.364 0.328 0.326 0.361
(0.001) (1.213) (0.005) (0.001) (0.001) (0.005)
middle 3 educated regions (922) top 3 educated regions (841) 0.332 0.410 0.332 0.332 0.333 0.332
(0.001) (0.025) (0.001) (0.001) (0.001) (0.001)
bottom 3 educated regions (1247) 0.416 0.538 0.363 0.416 0.409 0.386
(0.014) (0.063) (0.005) (0.014) (0.016) (0.019)
most+least educated regions (374) 0.395 2.47 0.362 0.395 0.392 0.355
(0.001) (1.81) (0.004) (0.011) (0.012) (0.004)
top 3+bottom 3 educated regions (2088) 0.396 0.451 0.358 0.396 0.382 0.366
(0.005) (0.032) (0.003) (0.009) (0.006) (0.009)
Table 2: Comparison of estimation methods in terms of MSE on test data. Here the training and test datasets are split according to the 9 regions in the Card college proximity dataset based on their average education levels. In this specification, we did not include experience squared. Standard errors are obtained using 10 bootstrapped datasets.

Since each sample is labeled as working in 1976 in one of nine regions in the U.S., we split the samples based on these labels, using number of years of education as the splitting variable. For example, we can construct the training set by including samples from the top 6 regions with the highest average years of schooling, and the test set to consist of samples coming from the bottom 3 regions with the lowest average years of schooling. In this case, we would expect the training and test sets to have come from different distributions. Indeed, the average years of schooling differs by more than 1 year, and is statistically significant.

In splitting the samples based on the distribution of the endogenous variable, we are also motivated by the long-standing debates revolving around the use of instrumental variables in classic economic studies (Card and Krueger,, 1994). A leading concern is the validity of instruments. In the case of the study on educational returns, the validity of estimation and inference require that the instruments (proximity to college and quarter of birth) are not correlated with unobserved characteristics that may also affect their earnings. The following quote from Card, (1999) illustrates this concern:

“In the case of quasi or natural experiments, however, inferences are based on difference between groups of individuals who attended schools at different times, or in different locations, or had differences in other characteristics such as month of birth. The use of these differences to draw causal inferences about the effect of schooling requires careful consideration of the maintained assumption that the groups are otherwise identical.”

When this assumption is violated, the estimates based on a particular subpopulation becomes unreliable for the wider population, and we evaluate the performance based on how well they generalize to other groups of the population with potential distributional or covariate shifts. In Table 2, we compare the test set MSE of OLS, IV, Wasserstein DRIVE, anchor regression, ridge, and ridge regularized IV estimators. We see that Wasserstein DRIVE consistently outperforms other estimators commonly used in practice.

6 Concluding Remarks

In this paper, we propose a distributionally robust instrumental variables estimation framework. Our approach is motivated by two main considerations in practice. The first is the concern about model mis-specification in IV estimation, most notably the validity of instruments. Second, going beyond estimating the causal effect for the endogenous variable, practitioners may also be interested in making good predictions with the help of instruments when there is heterogeneity between training and test datasets, e.g., generalizing from findings using samples from a particular population/geographical group to other groups. We argue that both challenges can be naturally unified as problems of distributional shifts, and then addressed using frameworks from distributionally robust optimization.

We provide a dual representation of our Wasserstein DRIVE framework as a regularized TSLS problem, and reveal a distinct property of the resulting estimator: it is consistent with non-vanishing penalty parameter. We further characterize the asymptotic distribution of the Wasserstein DRIVE, and establish a few special cases when it coincides with that of the standard TSLS estimator. Numerical studies suggest that Wasserstein DRIVE has superior finite sample performance in two regards. First, it has lower estimation error when instruments are potentially invalid, but performs as well as the TSLS when instruments are valid. Second, it outperforms existing methods at the task of predicting outcomes under distributional shifts between training and test data. These findings provide support for the appeal of our DRO approach to IV estimation, and suggest that Wasserstein DRIVE could be preferable in practice to standard IV methods. Finally, there are many future research directions of interest, such as further results on inference and testing, as well as connections to sensitivity analysis. Extensions to nonlinear models would also be useful in practice.

Acknowledgements

We are indebted to Han Hong, Guido Imbens, and Yinyu Ye for invaluable advice and guidance throughout this project, and to Agostino Capponi, Timothy Cogley, Rajeev Dehejia, Yanqin Fan, Alfred Galichon, Rui Gao, Wenzhi Gao, Vishal Kamat, Samir Khan, Frederic Koehler, Michal Kolesár, Simon Sokbae Lee, Greg Lewis, Elena Manresa, Konrad Menzel, Axel Peytavin, Debraj Ray, Martin Rotemberg, Vasilis Syrgkanis, Johan Ugander, and Ruoxuan Xiong for helpful discussions and suggestions. This work was supported in part by a Stanford Interdisciplinary Graduate Fellowship (SIGF).

References

  • Adjaho and Christensen, (2022) Adjaho, C. and Christensen, T. (2022). Externally valid treatment choice. arXiv preprint arXiv:2205.05561, 1.
  • Anderson and Rubin, (1949) Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. The Annals of mathematical statistics, 20(1):46–63.
  • Andrews, (1994) Andrews, D. W. (1994). Empirical process methods in econometrics. Handbook of econometrics, 4:2247–2294.
  • Andrews, (1999) Andrews, D. W. (1999). Consistent moment selection procedures for generalized method of moments estimation. Econometrica, 67(3):543–563.
  • Andrews, (2007) Andrews, D. W. (2007). Inference with weak instruments. Advances in Economics and Econometrics, 3:122–173.
  • Andrews et al., (2019) Andrews, I., Stock, J. H., and Sun, L. (2019). Weak instruments in instrumental variables regression: Theory and practice. Annual Review of Economics, 11(1).
  • Angrist et al., (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455.
  • Angrist and Pischke, (2009) Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
  • Armstrong and Kolesár, (2021) Armstrong, T. B. and Kolesár, M. (2021). Sensitivity analysis using approximate moment condition models. Quantitative Economics, 12(1):77–108.
  • (10) Basmann, R. L. (1960a). On finite sample distributions of generalized classical linear identifiability test statistics. Journal of the American Statistical Association, 55(292):650–659.
  • (11) Basmann, R. L. (1960b). On the asymptotic distribution of generalized linear estimators. Econometrica, Journal of the Econometric Society, pages 97–107.
  • Baum et al., (2003) Baum, C. F., Schaffer, M. E., and Stillman, S. (2003). Instrumental variables and gmm: Estimation and testing. The Stata Journal, 3(1):1–31.
  • Bekker, (1994) Bekker, P. A. (1994). Alternative approximations to the distributions of instrumental variable estimators. Econometrica: Journal of the Econometric Society, pages 657–681.
  • Belloni et al., (2012) Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
  • Belloni et al., (2018) Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C., and Kato, K. (2018). High-dimensional econometrics and regularized gmm. arXiv preprint arXiv:1806.01888.
  • Belloni et al., (2011) Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806.
  • Ben-Tal et al., (2013) Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
  • Bennett and Kallus, (2023) Bennett, A. and Kallus, N. (2023). The variational method of moments. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):810–841.
  • Berkowitz et al., (2008) Berkowitz, D., Caner, M., and Fang, Y. (2008). Are “nearly exogenous instruments” reliable? Economics Letters, 101(1):20–23.
  • Bertsimas and Copenhaver, (2018) Bertsimas, D. and Copenhaver, M. S. (2018). Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 270(3):931–942.
  • Bertsimas et al., (2022) Bertsimas, D., Imai, K., and Li, M. L. (2022). Distributionally robust causal inference with observational data. arXiv preprint arXiv:2210.08326.
  • Bertsimas and Popescu, (2005) Bertsimas, D. and Popescu, I. (2005). Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal on Optimization, 15(3):780–804.
  • Bickel et al., (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 – 1732.
  • Blanchet et al., (2019) Blanchet, J., Kang, Y., and Murthy, K. (2019). Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
  • Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600.
  • Blanchet et al., (2022) Blanchet, J., Murthy, K., and Si, N. (2022). Confidence regions in wasserstein distributionally robust estimation. Biometrika, 109(2):295–315.
  • Bonhomme and Weidner, (2022) Bonhomme, S. and Weidner, M. (2022). Minimizing sensitivity to model misspecification. Quantitative Economics, 13(3):907–954.
  • Bound et al., (1995) Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American statistical association, 90(430):443–450.
  • Bowden et al., (2015) Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through egger regression. International journal of epidemiology, 44(2):512–525.
  • Bowden et al., (2016) Bowden, J., Davey Smith, G., Haycock, P. C., and Burgess, S. (2016). Consistent estimation in mendelian randomization with some invalid instruments using a weighted median estimator. Genetic epidemiology, 40(4):304–314.
  • Bühlmann, (2020) Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Science, 35(3):404–426.
  • Burgess et al., (2016) Burgess, S., Bowden, J., Dudbridge, F., and Thompson, S. G. (2016). Robust instrumental variable methods using multiple candidate instruments with application to mendelian randomization. arXiv preprint arXiv:1606.03729.
  • Burgess et al., (2020) Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. (2020). A robust and efficient method for mendelian randomization with hundreds of genetic variants. Nature communications, 11(1):376.
  • Burgess et al., (2011) Burgess, S., Thompson, S. G., and Collaboration, C. C. G. (2011). Avoiding bias from weak instruments in mendelian randomization studies. International journal of epidemiology, 40(3):755–764.
  • Calafiore and Ghaoui, (2006) Calafiore, G. C. and Ghaoui, L. E. (2006). On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications, 130:1–22.
  • Caner, (2009) Caner, M. (2009). Lasso-type gmm estimator. Econometric Theory, 25(1):270–290.
  • Caner and Kock, (2018) Caner, M. and Kock, A. B. (2018). High dimensional linear gmm. arXiv preprint arXiv:1811.08779.
  • Card, (1993) Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling. NBER Working Paper, (w4483).
  • Card, (1999) Card, D. (1999). The causal effect of education on earnings. Handbook of labor economics, 3:1801–1863.
  • Card and Krueger, (1994) Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in new jersey and pennsylvania. American Economic Review, 84(4).
  • Chamberlain and Imbens, (2004) Chamberlain, G. and Imbens, G. (2004). Random effects estimators with many instrumental variables. Econometrica, 72(1):295–306.
  • Chao and Swanson, (2005) Chao, J. C. and Swanson, N. R. (2005). Consistent estimation with a large number of weak instruments. Econometrica, 73(5):1673–1692.
  • Chen et al., (2021) Chen, X., Hansen, L. P., and Hansen, P. G. (2021). Robust inference for moment condition models without rational expectations. Journal of Econometrics, forthcoming.
  • Chernozhukov et al., (2015) Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Post-selection and post-regularization inference in linear models with many controls and instruments. American Economic Review, 105(5):486–490.
  • Cigliutti and Manresa, (2022) Cigliutti, I. and Manresa, E. (2022). Adversarial method of moments.
  • Conley et al., (2012) Conley, T. G., Hansen, C. B., and Rossi, P. E. (2012). Plausibly exogenous. Review of Economics and Statistics, 94(1):260–272.
  • Cragg and Donald, (1993) Cragg, J. G. and Donald, S. G. (1993). Testing identifiability and specification in instrumental variable models. Econometric Theory, 9(2):222–240.
  • Davey Smith and Ebrahim, (2003) Davey Smith, G. and Ebrahim, S. (2003). ‘mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International journal of epidemiology, 32(1):1–22.
  • Dehejia et al., (2021) Dehejia, R., Pop-Eleches, C., and Samii, C. (2021). From local to global: External validity in a fertility natural experiment. Journal of Business & Economic Statistics, 39(1):217–243.
  • Delage and Ye, (2010) Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612.
  • Duchi et al., (2021) Duchi, J. C., Glynn, P. W., and Namkoong, H. (2021). Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969.
  • Dupačová, (1987) Dupačová, J. (1987). The minimax approach to stochastic programming and an illustrative application. Stochastics: An International Journal of Probability and Stochastic Processes, 20(1):73–88.
  • El Ghaoui and Lebret, (1997) El Ghaoui, L. and Lebret, H. (1997). Robust solutions to least-squares problems with uncertain data. SIAM Journal on matrix analysis and applications, 18(4):1035–1064.
  • Emdin et al., (2017) Emdin, C. A., Khera, A. V., and Kathiresan, S. (2017). Mendelian randomization. Jama, 318(19):1925–1926.
  • Fan et al., (2024) Fan, J., Fang, C., Gu, Y., and Zhang, T. (2024). Environment invariant linear least squares. The Annals of Statistics, 52(5):2268–2292.
  • Fan et al., (2023) Fan, Y., Park, H., and Xu, G. (2023). Quantifying distributional model risk in marginal problems via optimal transport. arXiv preprint arXiv:2307.00779.
  • Fisher, (1961) Fisher, F. M. (1961). On the cost of approximate specification in simultaneous equation estimation. Econometrica: journal of the Econometric Society, pages 139–170.
  • Fu and Knight, (2000) Fu, W. and Knight, K. (2000). Asymptotics for lasso-type estimators. The Annals of statistics, 28(5):1356–1378.
  • Fuller, (1977) Fuller, W. A. (1977). Some properties of a modification of the limited information estimator. Econometrica: Journal of the Econometric Society, pages 939–953.
  • Galichon, (2018) Galichon, A. (2018). Optimal transport methods in economics. Princeton University Press.
  • Galichon, (2021) Galichon, A. (2021). The unreasonable effectiveness of optimal transport in economics. arXiv preprint arXiv:2107.04700.
  • Gao and Kleywegt, (2023) Gao, R. and Kleywegt, A. (2023). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603–655.
  • Goodfellow et al., (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
  • (64) Guo, Z., Kang, H., Cai, T. T., and Small, D. S. (2018a). Testing endogeneity with high dimensional covariates. Journal of Econometrics, 207(1):175–187.
  • (65) Guo, Z., Kang, H., Tony Cai, T., and Small, D. S. (2018b). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4):793–815.
  • Hahn and Hausman, (2005) Hahn, J. and Hausman, J. (2005). Estimation with valid and invalid instruments. Annales d’Economie et de Statistique, pages 25–57.
  • Hahn et al., (2004) Hahn, J., Hausman, J., and Kuersteiner, G. (2004). Estimation with weak instruments: Accuracy of higher-order bias and mse approximations. The Econometrics Journal, 7(1):272–306.
  • Hall, (2003) Hall, A. R. (2003). Generalized method of moments. A companion to theoretical econometrics, pages 230–255.
  • Hansen, (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society, pages 1029–1054.
  • Hansen and Sargent, (2008) Hansen, L. P. and Sargent, T. J. (2008). Robustness. Princeton university press.
  • Hansen and Sargent, (2010) Hansen, L. P. and Sargent, T. J. (2010). Wanting robustness in macroeconomics. In Handbook of monetary economics, volume 3, pages 1097–1157. Elsevier.
  • Hausman et al., (2012) Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., and Swanson, N. R. (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quantitative Economics, 3(2):211–255.
  • Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1):69–82.
  • Hu and Hong, (2013) Hu, Z. and Hong, L. J. (2013). Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, pages 1695–1724.
  • Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101.
  • Huber and Ronchetti, (2011) Huber, P. J. and Ronchetti, E. M. (2011). Robust statistics. John Wiley & Sons.
  • Imbens and Angrist, (1994) Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475.
  • Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
  • Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
  • Jakobsen and Peters, (2022) Jakobsen, M. E. and Peters, J. (2022). Distributional robustness of k-class estimators and the pulse. The Econometrics Journal, 25(2):404–432.
  • Jiang, (2017) Jiang, W. (2017). Have instrumental variables brought us closer to the truth. Review of Corporate Finance Studies, 6(2):127–140.
  • Kadane and Anderson, (1977) Kadane, J. B. and Anderson, T. (1977). A comment on the test of overidentifying restrictions. Econometrica: Journal of the Econometric Society, pages 1027–1031.
  • Kaji et al., (2020) Kaji, T., Manresa, E., and Pouliot, G. (2020). An adversarial approach to structural estimation. arXiv preprint arXiv:2007.06169.
  • Kallus and Zhou, (2021) Kallus, N. and Zhou, A. (2021). Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870–2890.
  • Kang et al., (2016) Kang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables estimation with some invalid instruments and its application to mendelian randomization. Journal of the American statistical Association, 111(513):132–144.
  • Kantorovich, (1942) Kantorovich, L. V. (1942). On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201.
  • Kantorovich, (1960) Kantorovich, L. V. (1960). Mathematical methods of organizing and planning production. Management science, 6(4):366–422.
  • Kitamura et al., (2013) Kitamura, Y., Otsu, T., and Evdokimov, K. (2013). Robustness, infinitesimal neighborhoods, and moment restrictions. Econometrica, 81(3):1185–1201.
  • Kolesár, (2018) Kolesár, M. (2018). Minimum distance approach to inference with many instruments. Journal of Econometrics, 204(1):86–100.
  • Kolesár et al., (2015) Kolesár, M., Chetty, R., Friedman, J., Glaeser, E., and Imbens, G. W. (2015). Identification and inference with many invalid instruments. Journal of Business & Economic Statistics, 33(4):474–484.
  • Koopmans, (1949) Koopmans, T. C. (1949). Optimum utilization of the transportation system. Econometrica: Journal of the Econometric Society, pages 136–146.
  • Kuhn et al., (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pages 130–166. Informs.
  • Kunitomo, (1980) Kunitomo, N. (1980). Asymptotic expansions of the distributions of estimators in a linear functional relationship and simultaneous equations. Journal of the American Statistical Association, 75(371):693–700.
  • Lei et al., (2023) Lei, L., Sahoo, R., and Wager, S. (2023). Policy learning under biased sample selection. arXiv preprint arXiv:2304.11735.
  • Lewis and Syrgkanis, (2018) Lewis, G. and Syrgkanis, V. (2018). Adversarial generalized method of moments. arXiv preprint arXiv:1803.07164.
  • McDonald, (1977) McDonald, J. B. (1977). The k-class estimators as least variance difference estimators. Econometrica: Journal of the Econometric Society, pages 759–763.
  • Meinshausen, (2018) Meinshausen, N. (2018). Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6–10. IEEE.
  • Menzel, (2023) Menzel, K. (2023). Transfer estimates for causal effects across heterogeneous sites. arXiv preprint arXiv:2305.01435.
  • Metzger, (2022) Metzger, J. (2022). Adversarial estimators. arXiv preprint arXiv:2204.10495.
  • Mohajerin Esfahani and Kuhn, (2018) Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115–166.
  • Murray, (2006) Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments. Journal of economic Perspectives, 20(4):111–132.
  • Nagar, (1959) Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica: Journal of the Econometric Society, pages 575–595.
  • (103) Nelson, C. R. and Startz, R. (1990a). The distribution of the instrumental variables estimator and its t-ratio when the instrument is a poor one. Journal of Business, pages S125–S140.
  • (104) Nelson, C. R. and Startz, R. (1990b). Some further results on the exact small sample properties of the instrumental variable estimator. Econometrica: Journal of the Econometric Society, pages 967–976.
  • Olkin and Pukelsheim, (1982) Olkin, I. and Pukelsheim, F. (1982). The distance between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48:257–263.
  • Owen, (2007) Owen, A. B. (2007). A robust hybrid of lasso and ridge regression. Contemporary Mathematics, 443(7):59–72.
  • Peters et al., (2016) Peters, J., Bühlmann, P., and Meinshausen, N. (2016). Causal Inference by using Invariant Prediction: Identification and Confidence Intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012.
  • Pollard, (1991) Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7(2):186–199.
  • Prékopa, (2013) Prékopa, A. (2013). Stochastic programming, volume 324. Springer Science & Business Media.
  • Richardson, (1968) Richardson, D. H. (1968). The exact distribution of a structural coefficient estimator. Journal of the American Statistical Association, 63(324):1214–1226.
  • Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological), 45(2):212–218.
  • Rothenhäusler et al., (2021) Rothenhäusler, D., Meinshausen, N., Bühlmann, P., Peters, J., et al. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2):215–246.
  • Ruszczyński, (2010) Ruszczyński, A. (2010). Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125:235–261.
  • Sahoo et al., (2022) Sahoo, R., Lei, L., and Wager, S. (2022). Learning from a biased sample. arXiv preprint arXiv:2209.01754.
  • Sanderson and Windmeijer, (2016) Sanderson, E. and Windmeijer, F. (2016). A weak instrument f-test in linear iv models with multiple endogenous variables. Journal of econometrics, 190(2):212–221.
  • Sargan, (1958) Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica: Journal of the econometric society, pages 393–415.
  • Scarf, (1958) Scarf, H. (1958). A min-max solution of an inventory problem. Studies in the mathematical theory of inventory and production.
  • Shapiro and Kleywegt, (2002) Shapiro, A. and Kleywegt, A. (2002). Minimax analysis of stochastic problems. Optimization Methods and Software, 17(3):523–542.
  • Sherman and Morrison, (1950) Sherman, J. and Morrison, W. J. (1950). Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127.
  • Sinha et al., (2017) Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2017). Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.
  • Small, (2007) Small, D. S. (2007). Sensitivity analysis for instrumental variables regression with overidentifying restrictions. Journal of the American Statistical Association, 102(479):1049–1058.
  • Staiger and Stock, (1997) Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica: Journal of the Econometric Society, pages 557–586.
  • Stock and Wright, (2000) Stock, J. H. and Wright, J. H. (2000). Gmm with weak identification. Econometrica, 68(5):1055–1096.
  • Stock et al., (2002) Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics, 20(4):518–529.
  • Stock and Yogo, (2002) Stock, J. H. and Yogo, M. (2002). Testing for weak instruments in linear iv regression.
  • Theil, (1953) Theil, H. (1953). Repeated least squares applied to complete equation systems. The Hague: central planning bureau.
  • Theil, (1961) Theil, H. (1961). Economic forecasts and policy.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
  • van der Vaart and Wellner, (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer.
  • VanderWeele et al., (2014) VanderWeele, T. J., Tchetgen, E. J. T., Cornelis, M., and Kraft, P. (2014). Methodological challenges in mendelian randomization. Epidemiology (Cambridge, Mass.), 25(3):427.
  • Vaserstein, (1969) Vaserstein, L. N. (1969). Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72.
  • Vershik, (2013) Vershik, A. M. (2013). Long history of the monge-kantorovich transportation problem: (marking the centennial of lv kantorovich’s birth!). The Mathematical Intelligencer, 35:1–9.
  • Villani, (2009) Villani, C. (2009). Optimal transport: old and new, volume 338. Springer.
  • Von Neumann and Morgenstern, (1947) Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior, 2nd rev.
  • Wang et al., (2016) Wang, Z., Glynn, P. W., and Ye, Y. (2016). Likelihood robust optimization for data-driven problems. Computational Management Science, 13:241–261.
  • White, (1980) White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, pages 817–838.
  • Windmeijer et al., (2018) Windmeijer, F., Farbmacher, H., Davies, N., and Smith, G. D. (2018). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association.
  • Windmeijer et al., (2021) Windmeijer, F., Liang, X., Hartwig, F. P., and Bowden, J. (2021). The confidence interval method for selecting valid instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(4):752–776.
  • Wooldridge, (2020) Wooldridge, J. M. (2020). Introductory econometrics: a modern approach.
  • Young, (2022) Young, A. (2022). Consistency without inference: Instrumental variables in practical application. European Economic Review, page 104112.

Appendix A Background and Preliminaries

A.1 Wasserstein Distributionally Robust Optimization

We first formally define the Wasserstein distance and discuss relevant results useful in this paper. The Wasserstein distance is a metric on the space of probability distributions defined based on the optimal transport problem. More specifically, given any Polish space 𝒳\mathcal{X} with metric dd, let 𝒫(𝒳)\mathcal{P}(\mathcal{X}) be the set of Borel probability measures on 𝒳\mathcal{X} and ,𝒫(𝒳)\mathbb{P},\mathbb{Q}\in\mathcal{P(\mathcal{X})}. For exposition, we assume they have densities f1f_{1} and f2f_{2}, respectively, although the Wasserstein distance is well-defined for more general probability measures using the concept of push-forwards (Villani,, 2009). The optimal transport problem, whose studied was pioneered by Kantorovich, (1942, 1960), aims to find the joint probability distribution between \mathbb{P} and \mathbb{Q} with the smallest cost, as specified by the metric dd:

minπ𝒫(𝒳×𝒳):x1π(x1,x2)𝑑x1=f2(x2);x2π(x1,x2)𝑑x2=f1(x1)𝒳×𝒳dp(x1,x2)π(x1,x2)𝑑x1𝑑x2,\displaystyle\min_{\pi\in\mathcal{P}(\mathcal{X}\times\mathcal{X})\mathrel{\mathop{\mathchar 58\relax}}\int_{x_{1}}\pi(x_{1},x_{2})dx_{1}=f_{2}(x_{2});\int_{x_{2}}\pi(x_{1},x_{2})dx_{2}=f_{1}(x_{1})}\int_{\mathcal{X}\times\mathcal{X}}d^{p}(x_{1},x_{2})\pi(x_{1},x_{2})dx_{1}dx_{2}, (20)

where p1p\geq 1. The pp-Wasserstein distance Wp(,)W_{p}(\mathbb{P},\mathbb{Q}) is defined to be the pp-th root of the optimal value of the optimal transport problem above. The Wasserstein distance is a metric on the space 𝒫(𝒳)\mathcal{P(\mathcal{X})} of probability measures, and the dual problem of (20) is derived the following important duality result due to Kantorovich (Villani,, 2009):

Wpp(,)=supuL1(),vL1():u(x1)+v(x2)dp(x1,x2){x1u(x1)f1(x1)𝑑x1+x2v(x2)f2(x2)𝑑x2}\displaystyle W^{p}_{p}(\mathbb{P},\mathbb{Q})=\sup_{u\in L^{1}(\mathbb{P}),v\in L^{1}(\mathbb{Q})\mathrel{\mathop{\mathchar 58\relax}}u(x_{1})+v(x_{2})\leq d^{p}(x_{1},x_{2})}\{\int_{x_{1}}u(x_{1})f_{1}(x_{1})dx_{1}+\int_{x_{2}}v(x_{2})f_{2}(x_{2})dx_{2}\}

It should be noted that the term “Wasserstein metric” for the optimal transport distance defined above is an unfortunate mistake, as Kantorovich, (1942) should be credited with pioneering the theory of optimal transport and proposing the metrics. However, due to a work of Wasserstein (Vaserstein,, 1969), which briefly discussed the optimal transport distance, being more well-known in the West initially, the terminology of Wasserstein metric persisted until today (Vershik,, 2013). The optimal transport problem has also been studied in the seminal work of Koopmans, (1949).

One of the appeals of the Wasserstein distance when formulating distributionally robust optimization problems lies in the tractability of the dual DRO problem. Specifically, in (12), the inner maximization problem requires solving an infinite-dimensional optimization problem for every β\beta, and is in generally not tractable. However, if DD is the Wasserstein distance, the inner problem has a tractable dual minimization problem, which when combined with the outer minimization problem over β\beta, yields a simple and tractable minimization problem. This will allow us to efficiently compute the WDRO estimator. Moreover, it establishes connections with the popular statistical approach of ridge regression.

Let cL1(𝒳)c\in L^{1}(\mathcal{X}) be a general loss function and 𝒫(𝒳)\mathbb{P}\in\mathcal{P}(\mathcal{X}) with density f1f_{1}. The following general duality result (Gao and Kleywegt,, 2023; Blanchet and Murthy,, 2019) provides a tractable reformulation of the Wasserstein DRIVE objective introduced in Section 2:

sup𝒫(𝒳):Wp(,)ρf2(x)c(x)𝑑x=infλ0{λθpinfx2𝒳[λdp(x1,x2)c(x2)]f1(x1)dx1},\displaystyle\sup_{\mathbb{Q}\in\mathcal{P}(\mathcal{X})\mathrel{\mathop{\mathchar 58\relax}}W_{p}(\mathbb{Q},\mathbb{P})\leq\rho}\int f_{2}(x)c(x)dx=\inf_{\lambda\geq 0}\{\lambda\theta^{p}-\int\inf_{x_{2}\in\mathcal{X}}[\lambda d^{p}(x_{1},x_{2})-c(x_{2})]f_{1}(x_{1})dx_{1}\}, (21)

where f2f_{2} is the density of \mathbb{Q}.

A.2 Anchor Regression of Rothenhäusler et al., (2021)

In the anchor regression framework of Rothenhäusler et al., (2021), the baseline distribution 0\mathbb{P}_{0} on (X,Y,U,A)(X,Y,U,A) is prescribed by the following linear structural equation model (SEM), given well-defined 𝐁,𝐌\mathbf{B},\mathbf{M} and distributions of A,ϵA,\epsilon:

(XYU)=𝐁(XYU)+𝐌A+ϵ(XYU)=(I𝐁)1(𝐌A+ϵ).\displaystyle\begin{pmatrix}X\\ Y\\ U\end{pmatrix}=\mathbf{B}\begin{pmatrix}X\\ Y\\ U\end{pmatrix}+\mathbf{M}A+\epsilon\Longleftrightarrow\begin{pmatrix}X\\ Y\\ U\end{pmatrix}=(I-\mathbf{B})^{-1}(\mathbf{M}A+\epsilon).

Here UU represents unobserved confounders, YY is the outcome variable, XX are observed regressors, and AA are “anchors” that can be understood as potentially invalid instrumental variables that may violate the exclusion restriction. Under this SEM, Rothenhäusler et al., (2021) posit that the potential deviations from the reference distribution 0\mathbb{P}_{0} are driven by bounded uncertainties in the anchors AA. Their main result provides a DRO interpretation of a modified population version of the IV regression that interpolates between the IV and OLS objectives for γ>1\gamma>1:

minβ𝔼[(YXTβ)2]+(γ1)𝔼[(PA(YXTβ))2]=minβsupvCγ𝔼v[(YXTβ)2].\displaystyle\min_{\beta}\mathbb{E}[(Y-X^{T}\beta)^{2}]+(\gamma-1)\mathbb{E}[(P_{A}(Y-X^{T}\beta))^{2}]=\min_{\beta}\sup_{v\in C^{\gamma}}\mathbb{E}_{v}[(Y-X^{T}\beta)^{2}]. (22)

The set of distributions 𝔼v\mathbb{E}_{v} induced by vCγv\in C^{\gamma} are defined via the following SEM with a bounded set CγC^{\gamma}:

(XYU)=(I𝐁)1v,Cγ:={v:vvTγ𝐌𝔼(AAT)𝐌T}.\displaystyle\begin{pmatrix}X\\ Y\\ U\end{pmatrix}=(I-\mathbf{B})^{-1}v,\quad C^{\gamma}\mathrel{\mathop{\mathchar 58\relax}}=\{v\mathrel{\mathop{\mathchar 58\relax}}vv^{T}\preceq\gamma\mathbf{M}\mathbb{E}(AA^{T})\mathbf{M}^{T}\}. (23)

In the interpolated objective in Eq. 22, PA()=𝔼(A)P_{A}(\cdot)=\mathbb{E}(\cdot\mid A) and 𝔼[(PA(YXTβ))2]\mathbb{E}[(P_{A}(Y-X^{T}\beta))^{2}] is the population version of the IV (TSLS) regression objective with AA as the instrument. Letting κ=11/γ\kappa=1-{1}/{\gamma}, we can rewrite the interpolated objective on the left hand side in (22) equivalently as

minβ𝔼[(PA(YXTβ))2]+1κκ𝔼[(YXTβ)2],\displaystyle\min_{\beta}\mathbb{E}[(P_{A}(Y-X^{T}\beta))^{2}]+\frac{1-\kappa}{\kappa}\mathbb{E}[(Y-X^{T}\beta)^{2}], (24)

which can be interpreted as “regularizing” the IV objective with the OLS objective, with penalty parameter (1κ)/κ(1-\kappa)/\kappa. Jakobsen and Peters, (2022) observe that the finite sample version of the objective in (24) is precisely that of a kk-class estimator with parameter κ\kappa (Theil,, 1961; Nagar,, 1959). This observation together with (22) therefore provides a DRO interpretation of kk-class estimators, which is also extended by Jakobsen and Peters, (2022) to more general instrumental variables estimation settings. Moreover, when κ=1\kappa=1, or equivalently γ\gamma\rightarrow\infty, we recover the standard IV objective in (24). Therefore, the IV estimator has a distributionally robust interpretation via (22) when distributional shifts vv are unbounded.

The DRO interpretation (22) of kk-class estimators sheds new light on some old wisdom on IV estimation. As has already been observed and studied by a large literature (Richardson,, 1968; Nelson and Startz, 1990a, ; Nelson and Startz, 1990b, ; Bound et al.,, 1995; Staiger and Stock,, 1997; Hahn et al.,, 2004; Burgess et al.,, 2011; Andrews et al.,, 2019; Young,, 2022), when instruments are weak, the usual normal approximation to the distribution of the IV estimator may be very poor, and the IV estimator is biased in small samples and in the weak instruments asymptotics. Moreover, a small violation of the exclusion restriction, i.e., direct effect of instruments on the outcome, can result in large bias when instruments are weak (Angrist and Pischke,, 2009). Consequently, IV may not perform as well as the OLS estimator in such a setting. Regularizing the IV objective by the OLS objective in (24) can therefore alleviate the weak instrument problem. This improvement has also been observed for kk-class estimators with particular choices of κ\kappa (Fuller,, 1977; Hahn et al.,, 2004). The DRO interpretation complements the intuition above based on regularizing the IV objective with the OLS objective. In so far as weak instruments can be understood as a form of distributional shift from standard modeling assumptions (strong first stage effects), a distributionally robust regression approach is a natural solution to address the challenge of weak instruments. In the case of the anchor regression, the distribution uncertainty set indexed by vCγv\in C^{\gamma} always contains distributions on (X,Y,U,A)(X,Y,U,A) where the association between AA and XX is weak, by selecting appropriate v0\|v\|\approx 0. Therefore, the DRO formulation (22) of kk-class estimators demonstrates that they are robust against the weak instrument problem by design. An additional insight of the DRO formulation is that kk-class estimators and anchor regression are also optimal in terms of predicting YY with XX when the distribution of (X,Y)(X,Y) could change between the training and test datasets in a bounded manner induced by the anchors AA.

On the other hand, the DRO interpretation of kk-class estimators also exposes its potential limitations. First of all, the ambiguity set in (23) does not in fact contain the reference distribution 0\mathbb{P}_{0} itself for any finite robustness parameter, which is unsatisfactory. Moreover, the SEM in (23) also implies that the instrument (anchors) AA cannot be influenced by the unobserved confounder UU, which is a major source of uncertainty regarding the validity of instruments in applications of IV estimation. In this sense, we may understand kk-class estimators as being robust against weak instruments (Young,, 2022), since they minimize an objective that interpolates between OLS and IV. On the other hand, the DRIVE approach we propose in this paper is by design robust against invalid instruments, as the ambiguity set captures distributional shifts arising from conditional correlations between the instrument and the outcome variable, conditional on the endogenous variable.

Appendix B Related Works

Our work is related to several literatures, including distributionally robust optimization, instrumental variables estimation, and regularized (penalized) regression. Although historically they developed largely independent of each other, recent works have started to explore their interesting connections, and our work can be viewed as an effort in this direction.

B.1 Distributionally Robust Optimization and Min-max Optimization

DRO has an important research area in operations research, and traces its origin to game theory (Von Neumann and Morgenstern,, 1947). Scarf, (1958) first studied DRO in the context of inventory control under uncertainties about future demand distributions. This work was followed by a line of research in min-max stochastic optimization models, notably the works of Shapiro and Kleywegt, (2002), Calafiore and Ghaoui, (2006), and Ruszczyński, (2010). Distributional uncertainty sets based on moment conditions are considered by Dupačová, (1987); Prékopa, (2013); Bertsimas and Popescu, (2005); Delage and Ye, (2010). Distributional uncertainty sets based on distance or divergence measures are considered by Iyengar, (2005); Wang et al., (2016). In recent years, distributional uncertainty sets based on the Wasserstein metric have gained traction, appearing in Mohajerin Esfahani and Kuhn, (2018); Blanchet et al., (2019); Blanchet and Murthy, (2019); Duchi et al., (2021), partly due to their close connections to regularized regression, such as the LASSO (Tibshirani,, 1996; Belloni et al.,, 2011) and regularized logistic regression. Other works employ alternative divergence measures, such as the KL divergence (Hu and Hong,, 2013) and more generally ϕ\phi-divergence (Ben-Tal et al.,, 2013). In this work, we focus on DRO based on the Wasserstein metric, originally proposed by Kantorovich, (1942) in the context of optimal transport, which has also become a popular tool in economics in recent years (Galichon,, 2018, 2021).

DRO has gained traction in causal inference problems in econometrics and statistics very recently. For example, Kallus and Zhou, (2021); Adjaho and Christensen, (2022); Lei et al., (2023) apply DRO in policy learning to handle distributional uncertainties. Chen et al., (2021) apply DRO to address the possibility of mis-specification of rational expectation in estimation of structural models. Sahoo et al., (2022) use distributional shifts to model sampling bias. Bertsimas et al., (2022) study DRO versions of classic causal inference frameworks. Fan et al., (2023) studies distributional model risk when data comes from multiple sources and only marginal reference measures are identified. DRO is also connected to the literature in macroeconomics on robust control (Hansen and Sargent,, 2010). A related recent line of works in econometrics also employs a min-max approach to estimation (Lewis and Syrgkanis,, 2018; Kaji et al.,, 2020; Metzger,, 2022; Cigliutti and Manresa,, 2022; Bennett and Kallus,, 2023), inspired by adversarial networks from machine learning (Goodfellow et al.,, 2014). These works leverage adversarial learning to enforce a large, possibly infinite, number of (conditional) moment constraints, in order to achieve efficiency gains. In contrast, the emphasize of the min-max formulation in our paper is to capture the potential violations of model assumptions using a distributional uncertainty set.

The DRO approach that we propose in this paper is motivated by a recent line of works that reveal interesting connections between causality and notions of invariance and distributional robustness (Peters et al.,, 2016; Meinshausen,, 2018; Rothenhäusler et al.,, 2021; Bühlmann,, 2020; Jakobsen and Peters,, 2022).

Another important literature has studied the connections between causal inference and concepts of invariance and robustness (Peters et al.,, 2016; Meinshausen,, 2018; Rothenhäusler et al.,, 2021; Bühlmann,, 2020; Jakobsen and Peters,, 2022). Our work is closely related to this line of works, whereby causality is interpreted as an invariance or robustness property under distributional shifts. In particular, Rothenhäusler et al., (2021); Jakobsen and Peters, (2022) provide a distributionally robust interpretation of the classic kk-class estimators. In our work, instead of constructing the distribution set based on marginal or joint distributions as is commonly done in previous works, we propose a Wasserstein DRO version of the IV estimation problem based on distributional shifts in conditional quantities, which is then reformulated as a ridge type regularized IV estimation problem. In this regard, our estimator is fundamentally different from the kk-class estimators, which minimize an IV regression objective regularized by an OLS objective.

B.2 Instrumental Variables Estimation

Our work is also closely related to the classic literatures in econometrics and statistics on instrumental variables estimation (regression), which is originally proposed and developed by Theil, (1953) and Nagar, (1959), and became widely used in applied fields in economics. Since then, many works have investigated potential challenges to instrumental variables estimation and their solutions, including invalid instruments (Fisher,, 1961; Hahn and Hausman,, 2005; Berkowitz et al.,, 2008; Kolesár et al.,, 2015) and weak instruments (Nelson and Startz, 1990a, ; Nelson and Startz, 1990b, ; Staiger and Stock,, 1997; Murray,, 2006; Andrews et al.,, 2019). Tests of weak instrument have been proposed by Stock and Yogo, (2002) and Sanderson and Windmeijer, (2016). Notably, the test of Stock and Yogo, (2002) for multi-dimensional instruments is based on the minimum eigenvalue rank test statistic of Cragg and Donald, (1993). In our Wasserstein DRIVE framework, the penalty/robustness parameter can also be selected using the minimum eigenvalue of the first stage coefficient. It remains to further study the connections between our work and the weak instrument literature in this regard. The related econometric literature on many (weak) instruments studies the regime where the number of instruments is allowed to diverge proportionally with the sample size (Kunitomo,, 1980; Bekker,, 1994; Chamberlain and Imbens,, 2004; Chao and Swanson,, 2005; Kolesár,, 2018). In this work, we will assume a fixed number of instruments to best illustrate the Wasserstein DRIVE approach. However, it would be interesting to extend the framework and analysis in the current work to the many instruments setting.

Testing for invalid instruments is possible in the over-identified regime, where there are more instruments than endogenous variables (Sargan,, 1958; Kadane and Anderson,, 1977; Hansen,, 1982; Andrews,, 1999). These tests have been used in combination with variable selection methods, such as LASSO and thresholding, to select valid instruments under certain assumptions (Kang et al.,, 2016; Windmeijer et al.,, 2018; Guo et al., 2018a, ; Windmeijer et al.,, 2021). In our paper, we propose a regularization selection procedure for Wasserstein DRIVE based on bootstrapped score quantile. In simulations, we find that the selected ρ\rho increases with the degree of instrument invalidity. It remains to further study the relation of this score quantile and test statistics for instrument invalidity in the over-identified setting. Lastly, our framework can be viewed as complementary to the post-hoc sensitivity analysis of invalid instruments (Angrist et al.,, 1996; Small,, 2007; Conley et al.,, 2012), where instead of bounding the potential bias of IV estimators arising from violations of model assumptions after estimation, we incorporate such potential deviations directly into the estimation procedure.

Instrumental variables estimation has also gained wide adoption in epidemiology and genetics, where it is known as Mendelian randomization (MR) (VanderWeele et al.,, 2014; Bowden et al.,, 2015; Sanderson and Windmeijer,, 2016; Emdin et al.,, 2017). An important consideration in MR is invalid instruments, because many genetic variants, which are candidate instruments in Mendelian randomization, could be correlated with the outcome variable through unknown mechanisms that are either direct effects (horizontal pleitropy) or correlations with unobserved confounders. Methods have been proposed to address these challenges, based on robust regression and regularization ideas (Bowden et al.,, 2015, 2016; Burgess et al.,, 2016, 2020). Our proposed DRIVE framework contributes to this area by providing a novel regularization method robust against potentially invalid instruments.

B.3 Regularized Regression

Our Wasserstein DRIVE framework can be viewed as an instance of data-driven regularized IV method. In this regard, it complements the classic kk-class estimators, which regularize the IV objective with OLS (Rothenhäusler et al.,, 2021). Data-driven kk-class estimators have been shown to enjoy better finite sample properties. These include the LIML (Anderson and Rubin,, 1949) and the Fuller estimator (Fuller,, 1977), which is a modification of LIML that works well when instrument are weak (Stock et al.,, 2002). More recently, Jakobsen and Peters, (2022) proposed another data-driven kk-class estimator called the PULSE, which minimizes the OLS objective but with a constraint set defined by statistical tests of independence between instrument and residuals. Kolesár et al., (2015) propose a modification of the kk-class estimator that is consistent with invalid instruments whose direct effects on the outcome are independent of the first stage effect on the endogenous regressor.

There is a rich literature that explores the interactions and connections between regularized regression and instrumental variables methods. One line of works seeks to improve the finite-sample performance and asymptotic properties of IV type estimators using methods from regularized regression. For example, Windmeijer et al., (2018) applies LASSO regression to the first stage, motivated by applications in genetics where one may have access to many weak or invalid instruments. Belloni et al., (2012); Chernozhukov et al., (2015) also apply LASSO, but the task is to select optimal instruments in the many instruments setting or when covariates are high-dimensional. (Caner,, 2009; Caner and Kock,, 2018; Belloni et al.,, 2018) apply LASSO to GMM estimators, generalizing regularized regression results from the M-estimation setting to the moment estimation setting, which also includes IV estimation.

Another line of works on regularized regressions, which is more closely related to our work, have investigated the connections and equivalences between regularized regression and causal effect estimators in econometrics based on instrumental variables. Basmann, 1960a ; Basmann, 1960b ; McDonald, (1977) are the first to connect kk-class estimators to regularized regressions. Rothenhäusler et al., (2021) and Jakobsen and Peters, (2022) further study the distributional robustness of kk-class estimators as minimizers of the TSLS objective regularized by the OLS objective. The Wasserstein DRIVE estimator that we propose in this work applies a different type of regularization, namely a square root ridge regularization, to the second stage coefficient in the TSLS objective. As such it has different behaviors compared to the anchor and kk-class estimators, which regularize using the OLS objective. It is also different from works that apply regularization to the first stage.

Appendix C Square Root Ridge Regression

In this section, we turn our attention to the square root ridge estimator in the standard regression setting. We first establish the n\sqrt{n}-consistency of the square root ridge when the regularization parameter vanishes at the appropriate rate. We then consider a novel regime with non-vanishing regularization parameter and vanishing noise, revealing properties that are strikingly different from the standard ridge regression. As we will see, these observations in the standard setting help motivate and provide the essential intuitions for our results in the IV estimation setting. In short, the interesting behaviors of the square root ridge arise from its unique geometry in the regime of vanishing noise, where i=1nϵi2=op(n)\sum_{i=1}^{n}\epsilon^{2}_{i}={o}_{p}(n) as nn\rightarrow\infty. This regime is rarely studied in conventional regression settings in statistics, but it precisely captures features of the instrumental variables estimation setting, where projected residuals (𝐘𝐗β0)T𝚷𝐙(𝐘𝐗β0)=op(n)(\mathbf{Y}-\mathbf{X}\beta_{0})^{T}\mathbf{\Pi}_{\mathbf{Z}}(\mathbf{Y}-\mathbf{X}\beta_{0})=o_{p}(n) when instruments are valid and β0\beta_{0} is the true effect coefficient. In addition to providing intuitions for the IV estimation setting, the regularization parameter selection procedure proposed for the square root LASSO in the standard regression setting by Belloni et al., (2011) also inspires us to propose a novel procedure for the IV setting in Appendix E, which is shown to perform well in simulations.

C.1 n\sqrt{n}-Consistency of the Square Root Ridge

We now consider the square root ridge estimator in the standard regression setting, and prove its n\sqrt{n}-consistency. We will build on the results of Belloni et al., (2011) on the non-asymptotic estimation error of the square root LASSO estimator. Conditional on a fixed design XipX_{i}\in\mathbb{R}^{p}, and with Φ\Phi the CDF of ϵi\epsilon_{i}, we consider the data generating process,

Yi\displaystyle Y_{i} =XiTβ0+σϵi.\displaystyle=X_{i}^{T}\beta_{0}+\sigma\epsilon_{i}.

In this section, we rewrite the objective of the square root ridge estimation (15) as

minβQ^(β)+λnβ2+1\displaystyle\min_{\beta}\sqrt{\hat{Q}(\beta)}+\frac{\lambda}{n}\sqrt{\|\beta\|^{2}+1} (25)
Q^(β)=1ni=1n(YiXiTβ)2,\displaystyle\hat{Q}(\beta)=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-X_{i}^{T}\beta)^{2}, (26)

and denote β^\hat{\beta} as the minimizer of the objective. Without loss of generality, we assume for all jj,

1ni=1nXij2\displaystyle\frac{1}{n}\sum_{i=1}^{n}X_{ij}^{2} =1\displaystyle=1

In other words, each covariate (feature) is normalized to have unit norm. Similar to the square root LASSO case, we will show that by selecting λ=𝒪(n)\lambda=\mathcal{O}(\sqrt{n}) properly, or equivalently ρ=𝒪(n1)\rho=\mathcal{O}(n^{-1}) in (15), we can achieve, with probability 1α1-\alpha, a n\sqrt{n}-consistency result:

β^β2\displaystyle\|\hat{\beta}-\beta\|_{2} σplog(2p/α)/n.\displaystyle\lesssim\sigma\sqrt{p\log(2p/\alpha)/n}.

Compare this with the bound of the square root LASSO, which is

β^β2\displaystyle\|\hat{\beta}-\beta\|_{2} σslog(2p/α)/n,\displaystyle\lesssim\sigma\sqrt{s\log(2p/\alpha)/n},

where ss is the number of non-zero entries of β0\beta_{0}, and is allowed to diverge as nn\rightarrow\infty. Since we do not impose assumptions on the size of the support of the pp-dimensional vector β0\beta_{0}, if s=ps=p is finite in the square root LASSO framework, we achieve the same bound on the estimation error. Our bound for the square root ridge is therefore sharp in this sense.

An important quantity in the analysis is the score S~\tilde{S}, which is the gradient of Q^(β)\sqrt{\hat{Q}(\beta)} evaluated at the true parameter value β=β0\beta=\beta_{0}:

S~\displaystyle\tilde{S} =Q^(β)(β0)=Q^(β0)2Q^(β0)=En(Xσϵ)En(σ2ϵ2)=En(Xϵ)En(ϵ2),\displaystyle=\nabla\sqrt{\hat{Q}(\beta)}(\beta_{0})=\frac{\nabla\hat{Q}(\beta_{0})}{2\sqrt{\hat{Q}(\beta_{0})}}=\frac{E_{n}(X\sigma\epsilon)}{\sqrt{E_{n}(\sigma^{2}\epsilon^{2})}}=\frac{E_{n}(X\epsilon)}{\sqrt{E_{n}(\epsilon^{2})}},

where EnE_{n} denotes the empirical average of the quantities. Similar to the lower bound on the regularization parameter in terms of the score function λcnS~\lambda\geq cn\|\tilde{S}\|_{\infty} in Belloni et al., (2011), we will aim to impose the condition that λcnS~2\lambda\geq cn\|\tilde{S}\|_{2} for some c>1c>1. Conveniently, this condition is already implied by λ=pλ\lambda=\sqrt{p}\lambda^{*}, where λ\lambda^{\ast} follows the selection procedures proposed in that paper. To see this point, note that S~2pS~\|\tilde{S}\|_{2}\leq\sqrt{p}\|\tilde{S}\|_{\infty}, so that with high probability, pλpcnS~cnS~2\sqrt{p}\lambda^{*}\geq\sqrt{p}cn\|\tilde{S}\|_{\infty}\geq cn\|\tilde{S}\|_{2}. Thus we may use the exact same selection procedure to achieve the desired bound, although there are other selection procedures for λ\lambda that would guarantee λcnS~2\lambda\geq cn\|\tilde{S}\|_{2} with high probability. For example, choose the (1α)(1-\alpha)-quantile of nS~2n\|\tilde{S}\|_{2} given XiX_{i}’s. We will for now adopt the selection procedure and the model assumptions in Belloni et al., (2011).

Assumption C.1.

We have log2(p/α)log(1/α)=o(n)\log^{2}(p/\alpha)\log(1/\alpha)=o(n) and p/αp/\alpha\rightarrow\infty as nn\rightarrow\infty.

Under this assumption, and assuming that ϵ\epsilon is normal, the selected regularization λ=pλ\lambda=\sqrt{p}\lambda^{*} satisfies

λ\displaystyle\lambda pnlog(2p/α)\displaystyle\lesssim\sqrt{pn\log(2p/\alpha)}

with probability 1α1-\alpha for all large nn, using the same argument as Lemma 1 of Belloni et al., (2011).

An important quantity in deriving the bound on the estimation error is the “prediction” norm

β^β02,n2\displaystyle\|\hat{\beta}-\beta_{0}\|_{2,n}^{2} :=1ni(XiT(β^β0))2\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\sum_{i}(X_{i}^{T}(\hat{\beta}-\beta_{0}))^{2}
=(β^β0)T1niXiXiT(β^β0),\displaystyle=(\hat{\beta}-\beta_{0})^{T}\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}(\hat{\beta}-\beta_{0}),

which is related to the Euclidean norm β^β02\|\hat{\beta}-\beta_{0}\|_{2} through the Gram matrix 1niXiXiT\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}. We need to make an assumption on the modulus of continuity.

Assumption C.2.

There exists a constant κ\kappa and n0n_{0} such that for all nn0n\geq n_{0}, κδ2δ2,n\kappa\|\delta\|_{2}\leq\|\delta\|_{2,n} for all δp\delta\in\mathbb{R}^{p}.

When pnp\leq n, the Gram matrix 1niXiXiT\frac{1}{n}\sum_{i}X_{i}X_{i}^{T} will be full rank (with high probability with random design), and concentrate around the population covariance matrix. This setting of pnp\leq n is different from the high-dimensional setting in the square root LASSO paper, as LASSO-type penalties are able to achieve selection consistency when p>np>n under sparsity, whereas ridge-type penalties generally cannot. Note also that when p>np>n, the restricted eigenvalues are necessary when defining κ\kappa, and it is necessary to prove that β^β0\hat{\beta}-\beta_{0} belongs to a restricted subset of p\mathbb{R}^{p} on which the bound with κ\kappa holds. When pnp\leq n, the restricted subset and eigenvalues are not necessary, and κ\kappa can be understood as the minimum eigenvalue of the Gram matrix, which would be bonded away from 0 (with high probability). The exact value of κ\kappa is a function of the data generating process. For example, if we assume covariates are generated independent of each other, then κ1\kappa\approx 1.

Theorem C.3.

Assume that pnp\leq n but pp is allowed to grow with nn. Let the regularization λ=pλ\lambda=\sqrt{p}\lambda^{*} where λ=cnΦ1(1α/2p)\lambda^{\ast}=c\sqrt{n}\Phi^{-1}(1-\alpha/2p), and under C.1 and C.2, the solution β^\hat{\beta} to the square root ridge problem

minβ1ni(YiXiTβ)2+λnβ2+1\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta)^{2}}+\frac{\lambda}{n}\sqrt{\|\beta\|^{2}+1}

satisfies

β^β02\displaystyle\|\hat{\beta}-\beta_{0}\|_{2} 2(1c+1)1(λn)2κ2λnσEn(ϵ2)σplog(2p/α)/n\displaystyle\leq\frac{2(\frac{1}{c}+1)}{1-(\frac{\lambda}{n})^{2}\kappa^{2}}\frac{\lambda}{n}\cdot\sigma\sqrt{E_{n}(\epsilon^{2})}\lesssim\sigma\sqrt{p\log(2p/\alpha)/n}

with probability at least 1α1-\alpha for all nn large enough.

We remark that the quantile of the score function En(Xϵ)En(ϵ2)\frac{E_{n}(X\epsilon)}{\sqrt{E_{n}(\epsilon^{2})}} is not only critical for establishing the n\sqrt{n}-consistency of the square root ridge. It is also important in practice as the basis for regularization parameter selection. In Appendix E, we propose a data-driven regularization selection procedure that uses nonparametric bootstrap to estimate the quantile of the score, and demonstrate in Section 5 that it has very good empirical performance. The nonparametric bootstrap procedure may be of independent as well. Before we discuss regularization parameter selection in detail, we first focus on the statistical properties of the square root ridge under the novel vanishing noise regime.

C.2 Delayed Shrinkage of Square Root Ridge

Conventional wisdom on regressions with ridge type penalties is that they induce shrinkage on parameter estimates, and this shrinkage happens for any non-zero regularization. Asymptotically, if the regularization parameter does not vanish as the sample size increases, the limit of the estimator, when it exists, is not equal to the true parameter. The same behavior may be expected of the square root ridge regression. Indeed, this is the case in the standard linear regression setting with constant variance, i.e., Var(ϵi)=σ2>0\text{Var}(\epsilon_{i})=\sigma^{2}>0 and

Yi=XiTβ0+ϵi.\displaystyle Y_{i}=X_{i}^{T}\beta_{0}+\epsilon_{i}.

However, as we will see, when Var(ϵi)\text{Var}(\epsilon_{i}) depends on the sample size, and vanishes as nn\rightarrow\infty, the square root ridge estimator can be consistent for non-vanishing penalties.

To best illustrate the intuition behind this property of the square root ridge, we start with the following simple example. Consider the data generating process written in matrix vector form:

𝐘=𝐗β0+ϵ,\displaystyle\mathbf{Y}=\mathbf{X}\beta_{0}+\mathbf{\epsilon}, (27)

where the rows of 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p} are i.i.d. 𝒩(0,Ip)\mathcal{N}(0,I_{p}) and independent of ϵ𝒩(0,σn2Ip)\mathbf{\epsilon}\sim\mathcal{N}(0,\sigma^{2}_{n}I_{p}). Suppose that the variance of the noises vanishes: σn20\sigma^{2}_{n}\rightarrow 0 as nn\rightarrow\infty. This is not a standard regression setup, but captures the essence of the IV estimation setting, as we show in Section 4.

Recall the square root ridge regression problem in (15), which is strictly convex:

minβ1n𝐘𝐗β2+ρ(1+β2).\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{\rho(1+\|\beta\|^{2})}.

Let β^sqrt(n)\hat{\beta}^{(n)}_{\mathrm{sqrt}} be its unique minimizer. As the sample size nn\rightarrow\infty, we will fix the regularization parameter ρ1\rho\equiv 1, instead of letting ρ0\rho\rightarrow 0. Standard asymptotic theory implies that β^sqrt(n)pβsqrt\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}{\beta}_{\mathrm{sqrt}}, where βsqrt{\beta}_{\mathrm{sqrt}} is the minimizer of the limit of the square root ridge objective. For the simple model (27), we can verify that

1n𝐘𝐗β2+(1+β2)pβ0β+(1+β2),\displaystyle\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{(1+\|\beta\|^{2})}\rightarrow_{p}\|\beta_{0}-\beta\|+\sqrt{(1+\|\beta\|^{2})},

where we have used the crucial property σn20\sigma_{n}^{2}\rightarrow 0. Therefore, under standard conditions, we have

β^sqrt(n)pβsqrt:=argminββ0β+(1+β2).\displaystyle\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}{\beta}_{\mathrm{sqrt}}\mathrel{\mathop{\mathchar 58\relax}}=\arg\min_{\beta}\|\beta_{0}-\beta\|+\sqrt{(1+\|\beta\|^{2})}. (28)

Note that the limiting objective above is strictly convex and hence has a unique minimizer. Moreover,

β0β+(1+β2)=(β0,1)(β,1)+(β,1)(β0,1),\displaystyle\|\beta_{0}-\beta\|+\sqrt{(1+\|\beta\|^{2})}=\|(\beta_{0},-1)-(\beta,-1)\|+\|(\beta,-1)\|\geq\|(\beta_{0},-1)\|,

using the triangle inequality. On the other hand, setting β=β0\beta=\beta_{0} in (28) achieves the lower bound (β0,1)\|(\beta_{0},-1)\|. Therefore, βsqrt=β0{\beta}_{\mathrm{sqrt}}=\beta_{0} is the unique minimizer of the limiting objective, and so

β^sqrt(n)pβ0\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}{\beta}_{0}

with ρ=1\rho=1 non-vanishing. We have therefore demonstrated that with a non-vanishing regularization parameter, the square root ridge regression can still produce a consistent estimator. This phenomenon holds more generally: the square root ridge estimator is consistent for any (limiting) regularization parameter ρ[0,1+1β02]\rho\in[0,1+\frac{1}{\|\beta_{0}\|^{2}}], as long as the noise vanishes, in the sense that i=1nϵi2=op(n)\sum_{i=1}^{n}\epsilon^{2}_{i}={o}_{p}(n). This condition is achieved for a wide variety of empirical risk minimization objectives, including the IV estimation objective.

Theorem C.4.

In the linear model (27) where the rows of 𝐗\mathbf{X} are distributed i.i.d. 𝒩(0,Ip)\mathcal{N}(0,I_{p}), if i=1nϵi2=op(n)\sum_{i=1}^{n}\epsilon^{2}_{i}={o}_{p}(n) as sample size nn\rightarrow\infty, then for any ρ[0,1+1β02]\rho\in[0,1+\frac{1}{\|\beta_{0}\|^{2}}], the unique solution β^sqrt\hat{\beta}_{\mathrm{sqrt}} to (15) is consistent:

β^sqrt(n)pβ0.\displaystyle\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}\beta_{0}.

In Fig. 3, we plot the solution of the limiting square root ridge objective in a one-dimensional example. As we can see, (asymptotic) shrinkage is delayed until regularization ρ\rho exceeds the limit 1+1β021+\frac{1}{\|\beta_{0}\|^{2}} in the vanishing noise regime. This behavior is in stark contrast with the regular ridge regression estimator, for which shrinkage starts from the origin, even in the vanishing noise setting.

Refer to caption
Figure 3: Limit of the square root ridge estimator in a one-dimensional example with vanishing noise, as a function of the regularization parameter ρ\rho. Optimal ρ=1+1β02\rho=1+\frac{1}{\|\beta_{0}\|^{2}} is the largest regularization level that guarantees consistency of square root ridge.
Remark C.5 (Necessary Requirements of Delayed Shrinkage).

Although the delayed shrinkage property of the square root ridge is essentially a simple consequence of the triangle inequality, it relies crucially on three features of the square root ridge estimation procedure.

First, even though the square root ridge shares similarities with the standard ridge regression

minβ1n𝐘𝐗β2+ρβ2,\displaystyle\min_{\beta}{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+{\rho\|\beta\|^{2}},

only the former has delayed shrinkage: the square root operations applied to the mean squared loss and the squared norm of the parameter above are essential. To see this, note that with vanishing noise, the limit of the standard ridge estimator for the model in (27) is the solution to the following problem:

minββ0β2+ρβ2,\displaystyle\min_{\beta}\|\beta_{0}-\beta\|^{2}+\rho\|\beta\|^{2},

which results in the optimal solution β=β0/(1+ρ)\beta=\beta_{0}/(1+\rho). Therefore, with any non-zero ρ\rho, the ridge estimator exhibits shrinkage, even with vanishing noise.

Second, the inclusion of the extra constant 1 in the regularization term (1+β2)\sqrt{(1+\|\beta\|^{2})} is crucial in guaranteeing that βsqrt=β0\beta_{\mathrm{sqrt}}=\beta_{0} is the unique limit of the square root ridge estimator. To see this, consider instead the following modified square root ridge problem, which appears in Owen, (2007); Blanchet et al., (2019):

minβ1n𝐘𝐗β2+ρβ2,\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{\rho}\|\beta\|_{2},

where the regularization term does not include an additive constant in the square root, so simplifies to β\|\beta\|. Under model (27) with vanishing noise and ρ=1\rho=1, this objective has limit β0βsqrt+βsqrt\|\beta_{0}-\beta_{\mathrm{sqrt}}\|+\|\beta_{\mathrm{sqrt}}\|. Without the “curvature” guaranteed by the additional constant in the regularization term, the limiting objective is no longer strictly convex, and there is actually an infinite number of solutions that achieves the lower bound in the triangle inequality

β0βsqrt+βsqrt\displaystyle\|\beta_{0}-\beta_{\mathrm{sqrt}}\|+\|\beta_{\mathrm{sqrt}}\| β0,\displaystyle\geq\|\beta_{0}\|,

including βsqrt=0\beta_{\mathrm{sqrt}}=0. This implies that the solution to the modified objective is no longer guaranteed to be a consistent estimator of β0\beta_{0}. Indeed, the inconsistency of this curvature-less version of the square root ridge estimator has also been corroborated by simulations.

Third, given that small penalties in the square root ridge objective could achieve regularization in finite samples without sacrificing consistency, one may wonder why it is not widely used. This is partly due to the standard ridge being easier to implement computationally, but the main reason is that the delayed shrinkage of the square root ridge estimator is only present in the vanishing noise regime. To see this, assume now ϵ𝒩(0,σ2I)\mathbf{\epsilon}\sim\mathcal{N}(0,\sigma^{2}I) with non-vanishing σ2>0\sigma^{2}>0 in (27). As sample size nn\rightarrow\infty,

1n𝐘𝐗β2+ρ(1+β2)\displaystyle\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}{\beta}\|^{2}}+\sqrt{\rho(1+\|{\beta}\|^{2})} =1n(𝐗β0+ϵ𝐗β)T(𝐗β0+ϵ𝐗β)+ρ(1+β2)\displaystyle=\sqrt{\frac{1}{n}(\mathbf{X}\beta_{0}+\mathbb{\mathbf{\epsilon}}-\mathbf{X}\beta)^{T}(\mathbf{X}\beta_{0}+\mathbb{\mathbf{\epsilon}}-\mathbf{X}\beta)}+\sqrt{\rho(1+\|\beta\|^{2})}
pβ0β2+σ2+ρ(1+β2),\displaystyle\rightarrow_{p}\sqrt{\|\beta_{0}-\beta\|^{2}+\sigma^{2}}+\sqrt{\rho(1+\|\beta\|^{2})},

and as before β^sqrt(n)pβsqrt\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}\beta_{\mathrm{sqrt}}, the unique minimizer of the limiting objective above. The optimal condition is given by

(ββ0)ββ02+σ2+ρβ(1+β2)\displaystyle\frac{(\beta-\beta_{0})}{\sqrt{\|\beta-\beta_{0}\|^{2}+\sigma^{2}}}+\sqrt{\rho}\frac{\beta}{\sqrt{(1+\|\beta\|^{2})}} =0,\displaystyle=0,

and now only when ρ0\rho\rightarrow 0 is β^sqrt(n)\hat{\beta}^{(n)}_{\mathrm{sqrt}} a consistent estimator of β0\beta_{0}, unless β00\beta_{0}\equiv 0. For this reason, the fact that square root ridge can be consistent with non-vanishing regularization may not be particularly useful in standard regression settings. In the presence of non-vanishing noise, shrinkage happens for any non-zero regularization, which has also been confirmed in simulations.

Although the consistency of the square root ridge estimator with non-vanishing regularization does not have immediate practical implications for conventional regression problems, it is actually very well suited for the instrumental variables estimation setting. The reason is that IV and TSLS regressions involve projecting the endogenous (and outcome) variables onto the space spanned by the instrumental variables in the first stage. When instruments are valid, this projection cancels out the noise terms asymptotically, resulting in 1n(𝐘𝐗β0)T𝚷𝐙(𝐘𝐗β0)p0\frac{1}{n}(\mathbf{Y}-\mathbf{X}\beta_{0})^{T}\mathbf{\Pi}_{\mathbf{Z}}(\mathbf{Y}-\mathbf{X}\beta_{0})\rightarrow_{p}0. The subsequent second-stage regression involving the projected variables therefore precisely corresponds to the vanishing noise regime, and we may expect a similar delayed shrinkage effect. This is indeed the case, and with non-zero regularization and (asymptotically) valid instruments, we show in Section 4 that the Wasserstein DRIVE estimator is consistent. This result suggests that we can introduce robustness and regularization to the standard IV estimation through the square root ridge objective without sacrificing asymptotic validity, and has important implications in practice.

C.3 Square Root Ridge vs. Ridge for GMM and M-Estimators

We also remark on the distinction between the square root ridge and the standard ridge in the case when ρn0\rho_{n}\rightarrow 0. From Fu and Knight, (2000), we know that if ρ\rho approaches 0 at a rate of or slower than O(1/n)O(1/\sqrt{n}), then the ridge estimator has asymptotic bias, i.e., it is not centered at β0\beta_{0}. However, for square root ridge (and DRIVE), as long as ρn0\rho_{n}\rightarrow 0 at any rate, the estimator will not have any bias. This feature is a result of the self-normalization property of the square root ridge. In (19), the second term results from

nρ(1+β0+δ/n2)nρ(1+β02)\displaystyle\sqrt{n\rho(1+\|\beta_{0}+\delta/\sqrt{n}\|^{2})}-\sqrt{n\rho(1+\|\beta_{0}\|^{2})} =nρβ0Tnρ(1+β02)δ/n+o(δ/n)\displaystyle=\frac{n\rho\beta_{0}^{T}}{\sqrt{n\rho(1+\|\beta_{0}\|^{2})}}\cdot\delta/\sqrt{n}+o(\delta/\sqrt{n})
ρβ0Tδ(1+β02),\displaystyle\rightarrow\frac{\sqrt{\rho}\beta_{0}^{T}\delta}{\sqrt{(1+\|\beta_{0}\|^{2})}},

which does not depend on nn. In this sense, the parameter ρ\rho in square root ridge is scale-free, unlike the regularization parameter in the standard ridge case, whose natural scale is O(1/n)O(1/\sqrt{n}). In the same spirit, when ρ\rho does not vanish, the resulting square root ridge estimator will have similar behaviors as that of a standard ridge estimator with a vanishing regularization parameter with rate Θ(1/n\Theta(1/\sqrt{n}). Moreover, the amount of shrinkage essentially does not depend on the magnitude of β0\beta_{0} due to the normalization of β0\beta_{0} by (1+β02)\sqrt{(1+\|\beta_{0}\|^{2})}, which is also different from the standard ridge setting.

Lastly, we discuss the distinction between our work and that of Blanchet et al., (2022), which analyzes the asymptotic properties of a general class of DRO estimators. In that work, the original estimators are based on minimizing a sample loss of the form

1ni=1n(Xi,Yi,β),\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell(X_{i},Y_{i},\beta),

which encompasses most M-estimators, including the maximum likelihood estimator, and they focus on the case when ρn0\rho_{n}\rightarrow 0. However, the IV (TSLS) estimator is different in that it is a moment-based estimator, more precisely a GMM estimator (Hansen,, 1982). The key distinction between these estimators is that the objective function of GMM estimators (and Z-estimators based on estimating equations) usually converges to a weighted distance function that evaluates to 0 at the true parameter β0\beta_{0}, whereas the objectives of M-estimators tend to converge to a limit that does not vanish even at the true parameter. To see this distinction more precisely, consider the limit of the OLS objective under the linear model Yi=XiTβ+ϵiY_{i}=X_{i}^{T}\beta+\epsilon_{i} with 𝔼(Xiϵi)=0\mathbb{E}(X_{i}\epsilon_{i})=0 and 1n𝐗T𝐗pIp\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\rightarrow_{p}I_{p}:

1n(𝐘𝐗β)T(𝐘𝐗β)\displaystyle\frac{1}{n}(\mathbf{Y}-\mathbf{X}\beta)^{T}(\mathbf{Y}-\mathbf{X}\beta) =1n(𝐗β0+ϵ𝐗β)T(𝐗β0+ϵ𝐗β)\displaystyle=\frac{1}{n}(\mathbf{X}\beta_{0}+\epsilon-\mathbf{X}\beta)^{T}(\mathbf{X}\beta_{0}+\epsilon-\mathbf{X}\beta)
(β0β)T(β0β)+σ2(ϵ),\displaystyle\rightarrow(\beta_{0}-\beta)^{T}(\beta_{0}-\beta)+\sigma^{2}(\epsilon),

which is minimized at β0\beta_{0}, achieving a minimum of σ2(ϵ)\sigma^{2}(\epsilon). On the other hand, consider the following GMM version of the OLS estimator, based on the moment condition that 𝔼(Xiϵi)=0\mathbb{E}(X_{i}\epsilon_{i})=0:

minβ1n{(𝐘𝐗β)T𝐗}W{𝐗T(𝐘𝐗β)},\displaystyle\min_{\beta}\frac{1}{n}\left\{(\mathbf{Y}-\mathbf{X}\beta)^{T}\mathbf{X}\right\}W\left\{\mathbf{X}^{T}(\mathbf{Y}-\mathbf{X}\beta)\right\},

where WW is a weighting matrix, with the optimal choice being (𝐗T𝐗)1(\mathbf{X}^{T}\mathbf{X})^{-1} in this setting. We have, assuming again 1n𝐗T𝐗pIp\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\rightarrow_{p}I_{p},

1n{(𝐘𝐗β)T𝐗}(𝐗T𝐗)1{𝐗T(𝐘𝐗β)}\displaystyle\frac{1}{n}\left\{(\mathbf{Y}-\mathbf{X}\beta)^{T}\mathbf{X}\right\}(\mathbf{X}^{T}\mathbf{X})^{-1}\left\{\mathbf{X}^{T}(\mathbf{Y}-\mathbf{X}\beta)\right\} ={1n(𝐘𝐗β)T𝐗}(1n𝐗T𝐗)1{1n𝐗T(𝐘𝐗β)}\displaystyle=\left\{\frac{1}{n}(\mathbf{Y}-\mathbf{X}\beta)^{T}\mathbf{X}\right\}(\frac{1}{n}\mathbf{X}^{T}\mathbf{X})^{-1}\left\{\frac{1}{n}\mathbf{X}^{T}(\mathbf{Y}-\mathbf{X}\beta)\right\}
(β0β)TI(β0β)=β0β2,\displaystyle\rightarrow(\beta_{0}-\beta)^{T}I(\beta_{0}-\beta)=\|\beta_{0}-\beta\|^{2},

which is also minimized at β0\beta_{0} but achieves a minimum value of 0. This distinction between M-estimators and Z-estimators (and GMM estimators) is negligible in the standard setting without the distributionally robust optimization component, and in fact the standard OLS estimator is preferable to the GMM version for being more stable (Hall,, 2003). However, when we apply square root ridge regularization to these estimators, they start behaving differently. Only regularized regression based on GMM and Z-estimators enjoys consistency with a non-zero ρ>0\rho>0. In Appendix D.1, we exploit this property to generalize our results and develop asymptotic results for a general class of GMM estimators.

Appendix D Extensions to GMM Estimation and qq-Wasserstein Distances

In this section, we consider generalizations of the framework and results in the main paper. We first formulate a Wasserstein Distributionally Robust GMM Estimation Framework, and generalize the asymptotic results on Wasserstein DRIVE in this setting. We then consider Wasserstein DRIVE with qq-Wasserstein distance where q2q\neq 2, and demonstrate that the resulting estimator enjoys a similar consistency property with non-vanishing robustness/regularization parameter.

D.1 Wasserstein Distributionally Robust GMM

In this section, we consider general GMM estimation and propose a distributionally robust GMM estimation framework. Let θ0p\theta_{0}\in\mathbb{R}^{p} be the true parameter vector in the interior of some compact space Θp\Theta\subseteq\mathbb{R}^{p}. Let ψ(W,θ)\psi(W,\theta) be a vector of moments that satisfy

𝔼[ψ(Wi,θ0)]\displaystyle\mathbb{E}[\psi(W_{i},\theta_{0})] =0,\displaystyle=0,

for all ii, where {W1,,Wn}\{W_{1},\dots,W_{n}\} are independent but not necessarily identically distributed variables. Let ψi(θ0)=ψ(Wi,θ)\psi_{i}(\theta_{0})=\psi(W_{i},\theta). We consider the GMM estimators that minimize the objective

minθ(1niψi(θ))TWn(θ)(1niψi(θ))\displaystyle\min_{\theta}\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)

where WnW_{n} is a positive definite weight matrix, e.g., the weight matrix corresponding to the two-step or continuous updating estimator, and 1niψi(θ)\frac{1}{n}\sum_{i}\psi_{i}(\theta) are the sample moments under the empirical distribution n\mathbb{P}_{n} on ψ(θ)\psi(\theta). Both the IV estimation and GMM formulation of OLS regression fall under this framework. When we are uncertain about the validity of the moment conditions, similarly to the Wasserstein DRIVE, we consider a regularized regression objective given by

minθ(1niψi(θ))TWn(θ)(1niψi(θ))+ρ(1+θ2).\displaystyle\min_{\theta}\sqrt{\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)}+\sqrt{\rho(1+\|\theta\|^{2})}. (29)

We will study the asymptotic properties of this regularized GMM objective. We make use of the following sufficient technical conditions in Caner, (2009) on GMM estimation to simplify the proof.

Assumption D.1.

The following conditions are satisfied:

  1. 1.

    For all ii and θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta, we have |ψi(θ1)ψi(θ2)|Bt|θ1,θ2||\psi_{i}(\theta_{1})-\psi_{i}(\theta_{2})|\leq B_{t}|\theta_{1},\theta_{2}|, with limn1ni=1n𝔼Btd<\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}B_{t}^{d}<\infty for some d>2d>2; supθΘ𝔼|ψi(θ)|d<\sup_{\theta\in\Theta}\mathbb{E}|\psi_{i}(\theta)|^{d}<\infty for some d>2d>2.

  2. 2.

    Let mn(θ):=1n𝔼iψ(θ)m_{n}(\theta)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\mathbb{E}\sum_{i}\psi(\theta) and assume that mn(θ)m(θ)m_{n}(\theta)\rightarrow m(\theta) uniformly over Θ\Theta, mn(θ)m_{n}(\theta) is continuously differentiable in θ\theta, m1(θ0)=0m_{1}(\theta_{0})=0 if and only if θ=θ0\theta=\theta_{0}, and m(θ)m(\theta) is continuous in θ\theta; The Jacobian matrix mn(θ)/θpJ(θ)\partial m_{n}(\theta)/\partial\theta\rightarrow_{p}J(\theta) in a neighborhood of θ\theta, and J(θ0)J(\theta_{0}) has full rank.

  3. 3.

    Wn(θ)W_{n}(\theta) is positive definite and continuous on Θ\Theta, and Wn(θ)pW(θ)W_{n}(\theta)\rightarrow_{p}W(\theta) uniformly in θ\theta. W(θ)W(\theta) is continuous in θ\theta and positive definite for all θΘ\theta\in\Theta.

  4. 4.

    The population objective m(θ)TW(θ)m(θ)m(\theta)^{T}W(\theta)m(\theta) is lower bounded by the squared distance θθ02\|\theta-\theta_{0}\|^{2}, i.e., m(θ)TW(θ)m(θ)ρ¯θθ02m(\theta)^{T}W(\theta)m(\theta)\geq\overline{\rho}\|\theta-\theta_{0}\|^{2} for all θΘ\theta\in\Theta and some ρ¯>0\overline{\rho}>0.

See also Andrews, (1994); Stock and Wright, (2000) which assume similar conditions as 1-3 on the GMM estimation setup. Condition 4 requires that the weighted moment is bounded below by a quadratic function near θ0\theta_{0}. Under these conditions, we have the following result.

Theorem D.2.

Under the assumptions in, the unique solution θ^GMM\hat{\theta}^{GMM} to

minθ(1niψi(θ))TWn(θ)(1niψi(θ))+ρn(1+θ2)\displaystyle\min_{\theta}\sqrt{\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)}+\sqrt{\rho_{n}(1+\|\theta\|^{2})}

converges to the solution θGMM\theta^{GMM} of the population objective

minθm(θ)TW(θ)m(θ)+ρ(1+θ2).\displaystyle\min_{\theta}\sqrt{m(\theta)^{T}W(\theta)m(\theta)}+\sqrt{\rho(1+\|\theta\|^{2})}.

Moreover, whenever ρρ¯\rho\leq\overline{\rho}, θGMM=θ0\theta^{GMM}=\theta_{0}, so that θ^GMMpθ0\hat{\theta}^{GMM}\rightarrow_{p}\theta_{0}.

Therefore, the square root ridge regularized GMM estimator also satisfies the consistency property with a non-zero regularization parameter ρ\rho. Next, we consider general qq-Wasserstein distance with q2q\neq 2.

D.2 Generalization to qq-Wasserstein DRIVE

The duality result in Theorem 3.1 can be generalized to qq-Wasserstein ambiguity sets. The resulting estimator can enjoy a similar consistency result as the square root Wasserstein DRIVE (q=2q=2), but only when q(1,2]q\in(1,2]. This is because the limiting objective can be written as (assuming ρ=1\rho=1 and λp(γTγ)=1\lambda_{p}(\gamma^{T}\gamma)=1)

(ββ0)TγTγ(ββ0)+(βp+1)p,\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\gamma(\beta-\beta_{0})}+\sqrt[{}^{p}]{(\|\beta\|^{p}+1)},

where 1/p+1/q=11/p+1/q=1. When q(1,2]q\in(1,2], p[2,)p\in[2,\infty), and so x2xp\|x\|_{2}\geq\|x\|_{p}. As a result, the limiting objective is bounded below by

ββ02+(βp+1)p\displaystyle\|\beta-\beta_{0}\|_{2}+\sqrt[{}^{p}]{(\|\beta\|^{p}+1)} (β,1)(β0,1)p+(βp+1)p\displaystyle\geq\|(\beta,-1)-(\beta_{0},-1)\|_{p}+\sqrt[{}^{p}]{(\|\beta\|^{p}+1)}
=(β,1)(β0,1)p+(β,1)p\displaystyle=\|(\beta,-1)-(\beta_{0},-1)\|_{p}+\|(\beta,-1)\|_{p}
(β0,1)p,\displaystyle\geq\|(\beta_{0},-1)\|_{p},

with equality holding in both inequalities if and only if β=β0\beta=\beta_{0}, i.e., β0\beta_{0} is again the unique minimizer of the limiting objective. We therefore have the following result.

Corollary D.3.

Under the same assumptions as Theorem 4.1, the following regularized regression problem

minβ1ni(Π𝐙𝐘Π𝐙𝐗β)i2+ρn(βp+1)p\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)_{i}^{2}}+\sqrt[{}^{p}]{\rho_{n}(\|\beta\|^{p}+1)} (30)

has a unique solution that converges in probability to β0\beta_{0} whenever q(1,2]q\in(1,2] and limnρnλp(γTΣZγ)\lim_{n\rightarrow\infty}\rho_{n}\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma).

Appendix E Regularization Parameter Selection for Wasserstein DRIVE

The selection of penalty/regularization parameters is an important consideration for all regularized regression problems. The most common approach is cross validation based on loss function minimization. However, for Wasserstein DRIVE, this standard cross validation procedure may not adequately address the challenges and goals of DRIVE. For example, from Theorem 4.1 we know that the Wasserstein DRIVE is only consistent when the penalty parameter is bounded above. We therefore need to take this result into account when selecting the penalty parameter. In this section, we discuss two selection procedures, one based on the first stage regression coefficient, and the other based on quantiles of the score estimated using a nonparametric bootstrap procedure, which is also of independent interest. We connect our procedures to existing works in the literature on weak and invalid IVs and investigate their empirical performance in Section 5.

E.1 Selecting ρ\rho Based on Estimate of First Stage Coefficient

Theorem 4.1 guarantees that as long as the regularization parameter converges to a value in the interval [0,σmin(γ)][0,\sigma_{\min}(\gamma)], Wasserstein DRIVE is consistent. A natural procedure to select ρ\rho is thus to compute the minimum singular value ρmax:=σmin(γ^){\rho}_{\max}\mathrel{\mathop{\mathchar 58\relax}}=\sigma_{\min}(\hat{\gamma}) of the first stage regression coefficient γ^\hat{\gamma} and then select a regularization parameter ρ=cρmax\rho=c\cdot{\rho}_{\max} for c[0,1]c\in[0,1]. In Section 5, we verify that this procedure produces consistent DRIVE estimators whenever instruments are valid. Moreover, when the instrument is invalid or weak, Wasserstein DRIVE enjoys superior finite sample properties, outperforming the standard IV, OLS, and related estimators at estimation accuracy and prediction accuracy under distributional shift. This approach is also related to the test of Cragg and Donald, (1993), which is originally used to test for under-identification, and later used by Stock and Yogo, (2002) to test for weak instruments. In the Cragg-Donald test, the minimum eigenvalue of the first stage rank matrix is used to construct the FF-statistic.

Although selecting ρ\rho based on the first stage coefficient gives rise to Wasserstein DRIVE estimators that perform well in practice, there is one important challenge that remains to be addressed. Recall that violations of the exclusion restriction can be viewed as a form of distributional shift. We therefore expect that as the degree of invalidity increases, the distributional shift becomes stronger. From the DRO formulation of DRIVE in Eq. 12, we know that the regularization parameter ρ\rho is also the radius of the Wasserstein distribution set. Therefore, ρ\rho should adaptively increase with increasingly invalid instruments. However, as the selection procedure proposed here only depends on the first stage estimate, it does not take this consideration into account. More importantly, when the instruments are weak, the smallest singular of the first stage coefficient is likely to be very close to zero, which results in a DRIVE estimate with a very small penalty parameter and may thus have similar problems as the standard IV. We next introduce another parameter selection procedure for ρ\rho based on Theorem C.3 that is able to better handle invalid and weak instruments.

E.2 Selecting ρ\rho Based on Nonparametric Bootstrap of Quantile of Score

Recall that the square root LASSO uses the following valid penalty:

λ\displaystyle\lambda^{\ast} =cnS~,\displaystyle=cn\|\tilde{S}\|_{\infty},

where the score function S~=Q^1/2(β0)=En(xϵ)En(ϵ2)\tilde{S}=\nabla\hat{Q}^{1/2}(\beta_{0})=\frac{E_{n}(x\epsilon)}{\sqrt{E_{n}(\epsilon^{2})}} with

Q^(β)\displaystyle\hat{Q}(\beta) =1ni(YiXiTβ)2,\displaystyle=\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta)^{2},

and c=1.1c=1.1 is a constant of Bickel et al., (2009). The intuition for this penalty level comes from the simplest case β00\beta_{0}\equiv 0, when the optimality condition requires λnS~\lambda\geq n\|\tilde{S}\|_{\infty}. To estimate S~\|\tilde{S}\|_{\infty}, Belloni et al., (2011) propose to estimate the empirical (1α)(1-\alpha)-quantile (conditional on XiX_{i}) of En(xϵ)En(ϵ2)\frac{\|E_{n}(x\epsilon)\|_{\infty}}{\sqrt{E_{n}(\epsilon^{2})}} by sampling i.i.d. errors ϵ\epsilon from the known error distribution Φ\Phi with zero mean and variance 1, resulting in

λ=cnΦ1(1α/2p),\displaystyle\lambda^{\ast}=c\sqrt{n}\Phi^{-1}(1-\alpha/2p), (31)

where the confidence level 1α1-\alpha is usually set to 0.95.

The consistency result in Theorem C.3 then suggests a natural choice of penalty parameter ρ\rho for the square root ridge, given by ρ=pnλ\sqrt{\rho}=\frac{\sqrt{p}}{n}\lambda^{\ast}, where λ\lambda^{\ast} is constructed from (31). However, there are two main challenges when applying this regularization parameter selection procedure to Wasserstein DRIVE in the instrumental variables estimation setting. First, it requires prior knowledge of the type of distribution Φ\Phi, e.g., Gaussian, of the errors ϵ\epsilon, even if we do not need its variance. Second, ρ=pnλ\sqrt{\rho}=\frac{\sqrt{p}}{n}\lambda^{\ast} is only valid for the square root ridge problem in the standard regression setting without instruments. When applied to the IV setting, the empirical risk is now

Q^(β)\displaystyle\hat{Q}(\beta) =1ni(Y~iX~iTβ)2,\displaystyle=\frac{1}{n}\sum_{i}(\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta)^{2},

where Y~i=(Π𝐙𝐘)i\tilde{Y}_{i}=(\Pi_{\mathbf{Z}}\mathbf{Y})_{i} and X~i=(Π𝐙𝐗)i\tilde{X}_{i}=(\Pi_{\mathbf{Z}}\mathbf{X})_{i} are variables projected to the instrument space. This means that “observations” (Y~i,X~i)(\tilde{Y}_{i},\tilde{X}_{i}) are no longer independent. Therefore, the i.i.d. assumption on the errors in the standard regression setting no longer holds.

We propose the following iterative procedure based on nonparametric bootstrap that simultaneous addresses the two challenges above. Given a starting estimate β(0)\beta^{(0)} of β0\beta_{0} (say the IV estimator), we compute the residuals ri(0)=Y~iX~iTβ(0)r_{i}^{(0)}=\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta^{(0)}. Then we bootstrap these residuals to compute the empirical quantile of

En(x~ϵ)En(ϵ2),\displaystyle\frac{\|E_{n}(\tilde{x}\epsilon)\|_{\infty}}{\sqrt{E_{n}(\epsilon^{2})}},

where ϵ\epsilon is drawn uniformly with replacement from the residuals rir_{i}. The quantile based on bootstrap then replaces Φ1(1α/2p)\Phi^{-1}(1-\alpha/2p) in (31) to give a penalty level ρ\rho, which we can use to solve the square root ridge problem to obtain a new estimate β(1)\beta^{(1)}. Then we use β(1)\beta^{(1)} to compute new residuals ri(1)=Y~iX~iTβ(1)r_{i}^{(1)}=\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta^{(1)}, and repeat the process. In practice, we can use the OLS or TSLS estimate as the starting point β(0)\beta^{(0)}. Fig. 4 shows that this procedure converges very quickly and does not depend on the initial β(0)\beta^{(0)}. Moreover, in Section 5 we demonstrate that the resulting Wasserstein DRIVE estimator has superior estimation performance in terms of 2\ell^{2} error, as well as prediction under distributional shift.

Refer to caption
Refer to caption
Figure 4: Left: Penalty parameter convergence as a function of iteration number, with β(0)\beta^{(0)} starting from OLS, TSLS, and TSLS ridge estimates. Right: Converged penalty for the standard linear regression model as a function of sample size.

E.3 Bootstrapped Score Quantile As Test Statistic for Invalid Instruments

When instruments are valid, one should expect the boostrapped quantiles will converge to 0. We next formalize this intuition in Proposition E.1 and also confirm it in numerical experiments.

Proposition E.1.

The bootstrapped quantiles converge to 0 when instruments are valid.

Refer to caption
Figure 5: Penalty strength selected based on nonparametric bootstrap of score quantile vs. correlation strength between invalid instrument and unobserved confounders.

More importantly, in practice we observe that the bootstrapped quantile increases with the degree of instrument invalidity. Fig. 5 illustrates this phenomenon with increasing correlation between the instruments and the unobserved confounder. The intuition behind this observation is that the quantile is essentially describing the orthogonality (moment) condition for valid IVs, and so should be close to zero with valid IV. A large value therefore indicates possible violation. Thus, the bootstrapped quantile could potentially be used as a test statistic for invalid instruments, using for example permutation tests. Equivalently, in a sensitivity analysis it could be used as a sensitivity parameter, based on which we can bound the worst bias of IV/OLS models.

We provide a more detailed discussion to further justify our proposal. In a linear regression setting, the quantity

i(xϵ)i(ϵ2)\displaystyle\frac{\|\sum_{i}(x\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}}

is a test statistic for the orthogonality condition 𝔼[X(YXβ)]=0\mathbb{E}[X(Y-X\beta)]=0 which holds asymptotically for β\beta equal to the OLS estimator. When i(xϵ)i(ϵ2)\frac{\|\sum_{i}(x\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}} is not zero, it indicates a violation of the orthogonality condition, which means a non-zero penalty could be beneficial. Similarly, in a TSLS model

i(x~ϵ)i(ϵ2)\displaystyle\frac{\|\sum_{i}(\tilde{x}\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}}

is a test statistic for the orthogonality condition 𝔼[X~(Y~X~β)]=0\mathbb{E}[\tilde{X}(\tilde{Y}-\tilde{X}\beta)]=0 which is asymptotically correct when β\beta is the TSLS estimator and ZZ is valid instrument. A large i(x~ϵ)i(ϵ2)\frac{\|\sum_{i}(\tilde{x}\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}} therefore indicates potential violations of the IV assumptions. We may also compare this quantity with the Sargan test statistic (Sargan,, 1958) for instrument invalidity in the over-identified setting and note similarities.

The penalty selection proposed in Belloni et al., (2011) can therefore be seen as a test statistic for the moment condition E(X(YXβ))=0E(X(Y-X\beta))=0 which should hold asymptotically for β\beta equal to the OLS estimator if the model assumption that X is independent of the error term is correct. So if the penalty is large, it is evidence for potential violation of XϵX\perp\epsilon. Similarly, in a TSLS model, the moment condition is E(X~(Y~X~β))=0E(\tilde{X}(\tilde{Y}-\tilde{X}\beta))=0 for beta equal to the TSLS estimator, so the penalty can be seen as assessing potential violation of IV assumptions.

Remark E.2.

Besides the data-driven procedures discussed above, we can also consider incorporating information provided by statistical tests for IV estimation. For example, in over-identified settings, the Sargan-Hasen test (Sargan,, 1958; Hansen,, 1982) can be used to test for the exclusion restriction. We can use this test to provide evidence on the validity of the instrument. For testing weak instruments, the popular test of Stock and Yogo, (2002) can be used. This proposal is also related to our observation that ρ\rho based on bootstrapped quantiles increase with the degree of invalidity, i.e., direct effect on the outcome or correlation with confounders, and can therefore potentially be used as a test statistic for the reliability of the IV estimator. We leave a detailed investigation of this proposal to future work.

Appendix F Proofs

F.1 Proof of Theorem 3.1

Proof.

The proof of Theorem 3.1 relies on a general duality result on Wasserstein DRO, with different variations derived in (Gao and Kleywegt,, 2023; Blanchet and Murthy,, 2019; Sinha et al.,, 2017). We start with the inner problem in the objective in (12):

sup{:D(,~n)ρ}𝔼[(Y~X~Tβ)2],\displaystyle\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\tilde{\mathbb{P}}_{n})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}-\tilde{X}^{T}\beta)^{2}\right],

where DD is the 2-Wasserstein distance and ~n\tilde{\mathbb{P}}_{n} is the empirical distribution on the projected data {Y~i,X~i}i=1n{(ΠZ𝐘)i,(ΠZ𝐗)i}i=1n\{\tilde{Y}_{i},\tilde{X}_{i}\}_{i=1}^{n}\equiv\{(\Pi_{Z}\mathbf{Y})_{i},(\Pi_{Z}\mathbf{X})_{i}\}_{i=1}^{n}. Proposition 1 of sinha2017certifying and Proposition 1 of blanchet2019robust both imply that

sup{:D(,)ρ}𝔼[(Y~X~Tβ)2]\displaystyle\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\mathbb{P})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}-\tilde{X}^{T}\beta)^{2}\right] =infγ0γρ+1ni=1nϕγ(β;(X~i,Y~i)),\displaystyle=\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i=1}^{n}\phi_{\gamma}(\beta;(\tilde{X}_{i},\tilde{Y}_{i})),

where the “robust” loss function is

ϕγ(β;(X~,Y~))\displaystyle\phi_{\gamma}(\beta;(\tilde{X},\tilde{Y})) =sup(X,Y)(YXTβ)2γXX~22γ(YY~)2\displaystyle=\sup_{(X,Y)}(Y-X^{T}\beta)^{2}-\gamma\|X-\tilde{X}\|_{2}^{2}-\gamma(Y-\tilde{Y})^{2}
=supWWTααTWγWW~22,\displaystyle=\sup_{W}W^{T}\alpha\alpha^{T}W-\gamma\|W-\tilde{W}\|_{2}^{2},

with W=(X,Y)W=(X,Y), W~=(X~,Y~)\tilde{W}=(\tilde{X},\tilde{Y}) and α=(β,1)\alpha=(-\beta,1). Note that γ\gamma is always chosen large enough, i.e., γIααT0\gamma I-\alpha\alpha^{T}\succeq 0, so that the objective WTααTWγWW~22W^{T}\alpha\alpha^{T}W-\gamma\|W-\tilde{W}\|_{2}^{2} is concave in WW. Otherwise, the supremum over WW in the inner problem is unbounded. Therefore, the first order condition is sufficient:

ααTWγ(WW~)\displaystyle\alpha\alpha^{T}W-\gamma(W-\tilde{W}) =0,\displaystyle=0,

so that

(ααTγI)W\displaystyle(\alpha\alpha^{T}-\gamma I)W =γW~,\displaystyle=-\gamma\tilde{W},

and

W\displaystyle W =γ(γIααT)1W~\displaystyle=\gamma(\gamma I-\alpha\alpha^{T})^{-1}\tilde{W}
=(IααT/γ)1W~,\displaystyle=(I-\alpha\alpha^{T}/\gamma)^{-1}\tilde{W},

where IααT/γI-\alpha\alpha^{T}/\gamma is invertible if γIααT\gamma I-\alpha\alpha^{T} is positive definite, which is required to make sure that the quadratic is concave in WW. The supremum is then given by

W~T(IααT/γ)1ααT(IααT/γ)1W~γ(W~T((IααT/γ)1I)2W~)\displaystyle\tilde{W}^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}\alpha\alpha^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}\tilde{W}-\gamma(\tilde{W}^{T}((I-\alpha\alpha^{T}/\gamma)^{-1}-I)^{2}\tilde{W})
=W~T((IααT/γ)1ααT(IααT/γ)1γ((IααT/γ)1I)2)W~W~A2,\displaystyle=\tilde{W}^{T}((I-\alpha\alpha^{T}/\gamma)^{-1}\alpha\alpha^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}-\gamma((I-\alpha\alpha^{T}/\gamma)^{-1}-I)^{2})\tilde{W}\equiv\|\tilde{W}\|_{A}^{2},

where

A\displaystyle A =((IααT/γ)1ααT(IααT/γ)1γ((IααT/γ)1I)2).\displaystyle=((I-\alpha\alpha^{T}/\gamma)^{-1}\alpha\alpha^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}-\gamma((I-\alpha\alpha^{T}/\gamma)^{-1}-I)^{2}).

Using the Sherman-Morrison Lemma (sherman1950adjustment), whose condition is satisfied if γIααT\gamma I-\alpha\alpha^{T} is positive definite,

(IααT/γ)1=I+1γαTαααT,\displaystyle(I-\alpha\alpha^{T}/\gamma)^{-1}=I+\frac{1}{\gamma-\alpha^{T}\alpha}\alpha\alpha^{T},

and AA can be simplified as

A=γγαTαααT.\displaystyle A=\frac{\gamma}{\gamma-\alpha^{T}\alpha}\alpha\alpha^{T}.

In summary, for each projected observation (for the IV estimate) W~i=(X~i,Y~i)\tilde{W}_{i}=(\tilde{X}_{i},\tilde{Y}_{i}), we can obtain a new “robustified” sample using the above operation, then minimize the following modified empirical risk constructed from the robustified samples:

minβsup{:D(,)ρ}𝔼[(Y~iX~iTβ)2]\displaystyle\min_{\beta}\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\mathbb{P})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta)^{2}\right] minβinfγ0γρ+1ni=1n(ϕγ(β;(X~i,Y~i))\displaystyle\Leftrightarrow\min_{\beta}\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i=1}^{n}(\phi_{\gamma}(\beta;(\tilde{X}_{i},\tilde{Y}_{i}))
minβinfγ0γρ+1ni(X~i,Y~i)A2,\displaystyle\Leftrightarrow\min_{\beta}\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i}\|(\tilde{X}_{i},\tilde{Y}_{i})\|_{A}^{2},

where for fixed β\beta, γ0\gamma\geq 0 is always chosen large enough so that ϕγ(β;X,Y)\phi_{\gamma}(\beta;X,Y) is finite.

Now, the inner minimization problem can be further solved explicitly. Recall that it is equal to

infγ0γρ+1niW~iT(γγαTαααT)W~i,\displaystyle\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i}\tilde{W}_{i}^{T}(\frac{\gamma}{\gamma-\alpha^{T}\alpha}\alpha\alpha^{T})\tilde{W}_{i},

which is convex in γ\gamma hence minimized at the first order condition:

ρ=1niW~iTααTW~iαTα(γαTα)2,\displaystyle\rho=\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}\alpha^{T}\alpha}{(\gamma-\alpha^{T}\alpha)^{2}},

or γ=1niW~iTααTW~iαTαρ+αTα\gamma=\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}\alpha^{T}\alpha}{\rho}}+\alpha^{T}\alpha, where we have chosen the larger root since only it is guaranteed to satisfy γIααT0\gamma I-\alpha\alpha^{T}\succeq 0 for any α=(β,1)\alpha=(-\beta,1).

Plugging this expression of γ\gamma into the objective, and using the notation IV:=1ni(Y~iβTX~i)2\ell_{IV}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\sum_{i}(\tilde{Y}_{i}-\beta^{T}\tilde{X}_{i})^{2},

γρ+1ni(X~i,Y~i)A2\displaystyle\gamma\rho+\frac{1}{n}\sum_{i}\|(\tilde{X}_{i},\tilde{Y}_{i})\|_{A}^{2} =ραTα1niW~iTααTW~i+ραTα\displaystyle=\sqrt{\rho\alpha^{T}\alpha\cdot\frac{1}{n}\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}+\rho\alpha^{T}\alpha
+1niW~iT(11αTα1niW~iTααTW~iαTαρ+αTαααT)W~i\displaystyle+\frac{1}{n}\sum_{i}\tilde{W}_{i}^{T}(\frac{1}{1-\frac{\alpha^{T}\alpha}{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}\alpha^{T}\alpha}{\rho}}+\alpha^{T}\alpha}}\alpha\alpha^{T})\tilde{W}_{i}
=ραTαIV+ραTα+IV1niW~iTααTW~iρ1niW~iTααTW~iρ+αTα\displaystyle=\sqrt{\rho\alpha^{T}\alpha\cdot\ell_{IV}}+\rho\alpha^{T}\alpha+\frac{\ell_{IV}}{\frac{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}{\rho}}}{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}{\rho}}+\sqrt{\alpha^{T}\alpha}}}
=ραTαIV+ραTα+IV+αTα1niW~iTααTW~iρIV\displaystyle=\sqrt{\rho\alpha^{T}\alpha\cdot\ell_{IV}}+\rho\alpha^{T}\alpha+\ell_{IV}+\frac{\sqrt{\alpha^{T}\alpha}}{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}{\rho}}}\ell_{IV}
=2ραTαIV+ραTα+IV\displaystyle=2\sqrt{\rho\alpha^{T}\alpha\cdot\ell_{IV}}+\rho\alpha^{T}\alpha+\ell_{IV}
=(IV+ραTα)2.\displaystyle=(\sqrt{\ell_{IV}}+\sqrt{\rho\alpha^{T}\alpha})^{2}.

Therefore, we have proved that the Wasserstein DRIVE objective

minβsup{:D(,~n)ρ}𝔼[(Y~X~Tβ)2]\displaystyle\min_{\beta}\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\tilde{\mathbb{P}}_{n})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}-\tilde{X}^{T}\beta)^{2}\right]

is equivalent to the following square root ridge regularized IV objective:

minβ1ni(Y~iβTX~i)2+ρ(β2+1).\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(\tilde{Y}_{i}-\beta^{T}\tilde{X}_{i})^{2}}+\sqrt{\rho(\|\beta\|^{2}+1)}.

F.2 Proof of Theorem 4.1

Proof.

We will show that as nn\rightarrow\infty, β^DRIVEβ0\hat{\beta}^{\text{DRIVE}}\rightarrow\beta_{0} as long as ρnρλp(γΣZγT)\rho_{n}\rightarrow\rho\leq\sqrt{\lambda_{p}(\gamma\Sigma_{Z}\gamma^{T})}. Recall the linear IV model (2)

Y\displaystyle Y =β0TX+ϵ,\displaystyle=\beta^{T}_{0}X+\epsilon,
X\displaystyle X =γTZ+ξ.\displaystyle=\gamma^{T}Z+\xi.

with instrument relevance and exogeneity conditions

rank(𝔼[ZXT])\displaystyle\text{rank}(\mathbb{E}\left[ZX^{T}\right]) =p,\displaystyle=p,
𝔼[Zϵ]=0,𝔼[ZξT]\displaystyle\mathbb{E}\left[Z\epsilon\right]=0,\mathbb{E}\left[Z\xi^{T}\right] =𝟎.\displaystyle=\mathbf{0}.

First, we compute the limit of the objective function (17), reproduced below

1nΠZ𝐘ΠZ𝐗β2+ρn(β2+1).\displaystyle\sqrt{\frac{1}{n}\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}}+\sqrt{\rho_{n}(\|\beta\|^{2}+1)}. (32)

For the loss term, we have

1ni(Π𝐙𝐘Π𝐙𝐗β)i2\displaystyle\sqrt{\frac{1}{n}\sum_{i}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)_{i}^{2}} =1n(Π𝐙𝐘Π𝐙𝐗β)T(Π𝐙𝐘Π𝐙𝐗β)\displaystyle=\sqrt{\frac{1}{n}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)^{T}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)}
=1n(Π𝐙(𝐗β0+ϵ)Π𝐙𝐗β)T(Π𝐙(𝐗β0+ϵ)Π𝐙𝐗β)\displaystyle=\sqrt{\frac{1}{n}(\Pi_{\mathbf{Z}}(\mathbf{X}\beta_{0}+\mathbf{\epsilon})-\Pi_{\mathbf{Z}}\mathbf{X}\beta)^{T}(\Pi_{\mathbf{Z}}(\mathbf{X}\beta_{0}+\mathbf{\epsilon})-\Pi_{\mathbf{Z}}\mathbf{X}\beta)}
=1n(Π𝐙𝐗(β0β)+ϵ)T(Π𝐙𝐗(β0β)+ϵ)\displaystyle=\sqrt{\frac{1}{n}(\Pi_{\mathbf{Z}}\mathbf{X}(\beta_{0}-\beta)+\mathbf{\epsilon})^{T}(\Pi_{\mathbf{Z}}\mathbf{X}(\beta_{0}-\beta)+\mathbf{\epsilon})}
=1n(ϵTΠ𝐙ϵ2ϵTΠ𝐙𝐗(ββ0)+(ββ0)T𝐗TΠ𝐙𝐗(ββ0)).\displaystyle=\sqrt{\frac{1}{n}(\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\epsilon-2\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0})+(\beta-\beta_{0})^{T}\mathbf{X}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0}))}.

Note first that 1nϵTΠ𝐙𝐗(ββ0)=op(1)\frac{1}{n}\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0})=o_{p}(1) whenever the instruments are valid, since

1nϵTΠ𝐙𝐗(ββ0)\displaystyle\frac{1}{n}\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0}) =1n(iϵiZi)T(𝐙T𝐙)1(iZiXiT(ββ0))\displaystyle=\frac{1}{n}(\sum_{i}\epsilon_{i}Z_{i})^{T}(\mathbf{Z}^{T}\mathbf{Z})^{-1}(\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))
=(1niϵiZi)T(1n𝐙T𝐙)1(1niZiXiT(ββ0))\displaystyle=(\frac{1}{n}\sum_{i}\epsilon_{i}Z_{i})^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))
p𝔼[Zϵ]ΣZ1𝔼[ZXT](ββ0)=0,\displaystyle\rightarrow_{p}\mathbb{E}[Z\epsilon]\cdot\Sigma^{-1}_{Z}\cdot\mathbb{E}[ZX^{T}]\cdot(\beta-\beta_{0})=0,

by the continuous mapping theorem. Similarly,

1n(ββ0)T𝐗TΠ𝐙𝐗(ββ0)\displaystyle\frac{1}{n}(\beta-\beta_{0})^{T}\mathbf{X}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0}) =1n(iZiXiT(ββ0))T(𝐙T𝐙)1(i(ZiXiT(ββ0))\displaystyle=\frac{1}{n}(\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))^{T}(\mathbf{Z}^{T}\mathbf{Z})^{-1}(\sum_{i}(Z_{i}X_{i}^{T}(\beta-\beta_{0}))
=(1niZiXiT(ββ0))T(1n𝐙T𝐙)1(1niZiXiT(ββ0))\displaystyle=(\frac{1}{n}\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))
p(ββ0)T𝔼(XiZiT)ΣZ1𝔼(ZiXiT)(ββ0)\displaystyle\rightarrow_{p}(\beta-\beta_{0})^{T}\mathbb{E}(X_{i}Z_{i}^{T})\Sigma^{-1}_{Z}\mathbb{E}(Z_{i}X_{i}^{T})(\beta-\beta_{0})
=(ββ0)TγTΣZΣZ1ΣZγ(ββ0)\displaystyle=(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\Sigma^{-1}_{Z}\Sigma_{Z}\gamma(\beta-\beta_{0})
=(ββ0)TγTΣZγ(ββ0).\displaystyle=(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0}).

The most important part is the “vanishing noise” behavior, i.e.,

1nϵTΠ𝐙ϵ\displaystyle\frac{1}{n}\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\epsilon =1n(iϵiZi)T(𝐙T𝐙)1(iϵiZi)\displaystyle=\frac{1}{n}(\sum_{i}\epsilon_{i}Z_{i})^{T}(\mathbf{Z}^{T}\mathbf{Z})^{-1}(\sum_{i}\epsilon_{i}Z_{i})
=1n(iϵiZi)T(1n𝐙T𝐙)1(1niϵiZi)\displaystyle=\frac{1}{n}(\sum_{i}\epsilon_{i}Z_{i})^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}\epsilon_{i}Z_{i})
p(𝔼(ϵiZi))TΣZ1(𝔼(ϵiZi))=0.\displaystyle\rightarrow_{p}(\mathbb{E}(\epsilon_{i}Z_{i}))^{T}\Sigma_{Z}^{-1}(\mathbb{E}(\epsilon_{i}Z_{i}))=0.

It then follows that the regularized regression objective (17) of the Wasserstein DRIVE estimator converges in probability to (18), reproduced below

(ββ0)TγTΣZγ(ββ0)+ρ(β2+1).\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)}. (33)

For ρ>0\rho>0, the population objective (33) is continuous and strictly convex in β\beta, and so has a unique minimizer βDRIVE\beta^{\text{DRIVE}}. Applying the convexity lemma of Pollard, (1991), since (32) is also strictly convex in β\beta, the convergence to (33) is uniform on compact sets BpB\subseteq\mathbb{R}^{p} that contain βDRIVE\beta^{\text{DRIVE}}. Applying Corollary 3.2.3 of van der Vaart and Wellner, (1996), we can therefore conclude that the minimizers of the empirical objectives converge in probability to the minimizer of the population objective, i.e.,

β^DRIVEpβDRIVE.\hat{\beta}^{\text{DRIVE}}\rightarrow_{p}\beta^{\text{DRIVE}}.

Next, we consider minimizing the population objective (33). If ρ\rho is bounded above by the smallest singular value of γTΣZγ\gamma^{T}\Sigma_{Z}\gamma, i.e.,

ρ\displaystyle\rho λp(γTΣZγT),\displaystyle\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma^{T}),

the population objective is lower bounded by

(ββ0)TγTΣZγ(ββ0)+ρ(β2+1)\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)} ρββ02+ρβ2+1\displaystyle\geq\sqrt{\rho}\|\beta-\beta_{0}\|_{2}+\sqrt{\rho}\sqrt{\|\beta\|^{2}+1}
=ρ(β,1)(β0,1)2+ρ(β,1)2\displaystyle=\sqrt{\rho}\|(\beta,1)-(\beta_{0},1)\|_{2}+\sqrt{\rho}\|(\beta,1)\|_{2}
ρ(β0,1)2,\displaystyle\geq\sqrt{\rho}\|(\beta_{0},1)\|_{2},

where in the second line we augment the vectors β,β0\beta,\beta_{0} with an extra coordinate equal to 1. The last line follows from the triangle inequality, with equality if and only if ββ0\beta\equiv\beta_{0}. We can verify that the lower bound ρ(β0,1)2\sqrt{\rho}\|(\beta_{0},1)\|_{2} of the population objective is therefore achieved uniquely at ββ0\beta\equiv\beta_{0} due to strict convexity. We have thus proved that when 0<ρλp(γTΣZγ)0<\rho\leq\sqrt{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)}, the population objective has a unique minimizer at β0\beta_{0}. When ρ=0\rho=0, the consistency of β^DRIVE\hat{\beta}^{\text{DRIVE}} can be similarly proved as long as λp(γTΣZγ)>0\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)>0, which guarantees that β0\beta_{0} is the unique minimizer of (33). Therefore, whenever ρλp(γTΣZγT)\rho\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma^{T}), we have β^DRIVEpβ0\hat{\beta}^{\text{DRIVE}}\rightarrow_{p}\beta_{0}. ∎

F.3 Proof of Theorem 4.2

Proof.

Define the objective function Hn(δ)H_{n}(\delta) of a local parameter δp\delta\in\mathbb{R}^{p} as follows:

ϕn(β)\displaystyle\phi_{n}(\beta) :=1nΠZ𝐘ΠZ𝐗β2+ρn(β2+1)\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\sqrt{\frac{1}{n}\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}}+\sqrt{\rho_{n}(\|\beta\|^{2}+1)}
Hn(δ)\displaystyle H_{n}(\delta) :=n[ϕn(β0+δ/n)ϕn(β0)].\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\sqrt{n}\left[\phi_{n}(\beta_{0}+\delta/\sqrt{n})-\phi_{n}(\beta_{0})\right].

Note that Hn(δ)H_{n}(\delta) is minimized at δ=n(β^nDRIVEβ0)\delta=\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0}). The key components of the proof are to compute the uniform limit H(δ)H(\delta) of Hn(δ)H_{n}(\delta) on compact sets in the weak topology, and to verify that their minimizers are uniformly tight, i.e., n(β^nDRIVEβ0)=Op(1)\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})=O_{p}(1). We can then apply Theorem 3.2.2 of van der Vaart and Wellner, (1996) to conclude that the sequence of minimizers n(β^nβ0)\sqrt{n}(\hat{\beta}_{n}-\beta_{0}) of Hn(δ)H_{n}(\delta) converges in distribution to the minimizer of the limit H(δ)H(\delta). We have

Hn(δ)=n(ϕn(β0+δ/n)ϕn(β0))\displaystyle H_{n}(\delta)=\sqrt{n}\cdot(\phi_{n}(\beta_{0}+\delta/\sqrt{n})-\phi_{n}(\beta_{0})) =ΠZ𝐘ΠZ𝐗(β0+δ/n)2ΠZ𝐘ΠZ𝐗β02I\displaystyle=\underset{\textbf{I}}{\underbrace{\sqrt{\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}(\beta_{0}+\delta/\sqrt{n})\|^{2}}-\sqrt{\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta_{0}\|^{2}}}}
+nρn(1+β0+δ/n2)nρn(1+β02)II.\displaystyle+\underset{\textbf{II}}{\underbrace{\sqrt{n\rho_{n}(1+\|\beta_{0}+\delta/\sqrt{n}\|^{2})}-\sqrt{n\rho_{n}(1+\|\beta_{0}\|^{2})}}}.

We first focus on I:

I =ΠZ𝐘ΠZ𝐗(β0+δ/n)2ΠZ𝐘ΠZ𝐗β02\displaystyle=\sqrt{\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}(\beta_{0}+\delta/\sqrt{n})\|^{2}}-\sqrt{\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta_{0}\|^{2}}
=Fn(β0+δ/n)Fn(β0),\displaystyle=\sqrt{F_{n}(\beta_{0}+\delta/\sqrt{n})}-\sqrt{F_{n}(\beta_{0})},

where

Fn(β)\displaystyle F_{n}(\beta) =ΠZ𝐘ΠZ𝐗β2.\displaystyle=\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}.

We have, with ψi(β)Zi(YiβTXi)\psi_{i}(\beta)\equiv Z_{i}(Y_{i}-\beta^{T}X_{i}),

Fn(β0+δ/n)\displaystyle F_{n}(\beta_{0}+\delta/\sqrt{n}) =ΠZ𝐘ΠZ𝐗(β0+δ/n)2\displaystyle=\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}(\beta_{0}+\delta/\sqrt{n})\|^{2}
=(1niψi(β0+δ/n))T(1n𝐙T𝐙)1(1niψi(β0+δ/n)),\displaystyle=(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n})),
Fn(β0)\displaystyle F_{n}(\beta_{0}) =ΠZ𝐘ΠZ𝐗β02\displaystyle=\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta_{0}\|^{2}
=(1niψi(β0))T(1n𝐙T𝐙)1(1niψi(β0)).\displaystyle=(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0})).

We compute the limits of Fn(β0+δ/n)F_{n}(\beta_{0}+\delta/\sqrt{n}) and Fn(β0)F_{n}(\beta_{0}). We have

ψi(β1)ψi(β2)\displaystyle\|\psi_{i}(\beta_{1})-\psi_{i}(\beta_{2})\| ZiXiT(β1β2)\displaystyle\leq\|Z_{i}X_{i}^{T}(\beta_{1}-\beta_{2})\|
=Z(ZTγ+ξT)(β1β2)\displaystyle=\|Z(Z^{T}\gamma+\xi^{T})(\beta_{1}-\beta_{2})\|
(ZZTγ+ZξT)(β1β2)\displaystyle\leq(\|ZZ^{T}\|\|\gamma\|+\|Z\xi^{T}\|)\cdot\|(\beta_{1}-\beta_{2})\|
(Z2γ+Zξ)(β1β2)\displaystyle\leq(\|Z\|^{2}\|\gamma\|+\|Z\|\|\xi\|)\cdot\|(\beta_{1}-\beta_{2})\|

where \|\cdot\| denotes operator norm for matrices and Euclidean norm for vectors. We have, for some constant cc that depends on kk,

𝔼(Z2γ+Zξ)k\displaystyle\mathbb{E}(\|Z\|^{2}\|\gamma\|+\|Z\|\|\xi\|)^{k} c(𝔼Z2kγk+𝔼Zkξk)\displaystyle\leq c(\mathbb{E}\|Z\|^{2k}\|\gamma\|^{k}+\mathbb{E}\|Z\|^{k}\|\xi\|^{k})
c(𝔼Z2kγk+𝔼Z2k𝔼ξ2k)\displaystyle\leq c(\mathbb{E}\|Z\|^{2k}\|\gamma\|^{k}+\sqrt{\mathbb{E}\|Z\|^{2k}}\cdot\sqrt{\mathbb{E}\|\xi\|^{2k}})
<\displaystyle<\infty

using the assumptions that 𝔼Z2k<\mathbb{E}\|Z\|^{2k}<\infty and 𝔼ξ2k<\mathbb{E}\|\xi\|^{2k}<\infty. Moreover, we have

𝔼ψ(β)k\displaystyle\mathbb{E}\|\psi(\beta)\|^{k} =𝔼Z(YβTX)k\displaystyle=\mathbb{E}\|Z(Y-\beta^{T}X)\|^{k}
=𝔼Z(XT(β0β)+ϵ)k\displaystyle=\mathbb{E}\|Z(X^{T}(\beta_{0}-\beta)+\epsilon)\|^{k}
=𝔼Z((ZTγ+ξT)(β0β)+ϵ)k\displaystyle=\mathbb{E}\|Z((Z^{T}\gamma+\xi^{T})(\beta_{0}-\beta)+\epsilon)\|^{k}
c(𝔼ZZTγ(β0β)k+𝔼ZξT(β0β)k+𝔼Zϵk)\displaystyle\leq c\left(\mathbb{E}\|ZZ^{T}\gamma(\beta_{0}-\beta)\|^{k}+\mathbb{E}\|Z\xi^{T}(\beta_{0}-\beta)\|^{k}+\mathbb{E}\|Z\epsilon\|^{k}\right)
c(𝔼Z2kγ(β0β)k+𝔼Z2k𝔼ξ2k(β0β)k+𝔼Z2k𝔼ϵ2k)\displaystyle\leq c\left(\mathbb{E}\|Z\|^{2k}\|\gamma(\beta_{0}-\beta)\|^{k}+\sqrt{\mathbb{E}\|Z\|^{2k}}\sqrt{\mathbb{E}\|\xi\|^{2k}}\|(\beta_{0}-\beta)\|^{k}+\sqrt{\mathbb{E}\|Z\|^{2k}}\sqrt{\mathbb{E}\|\epsilon\|^{2k}}\right)

which is uniformly bounded on compact subsets. The consistency result in Theorem 4.1 combined with the above bounds guarantee stochastic equicontinuity (Andrews,, 1994), so that as nn\rightarrow\infty, uniformly in δ\delta on compact sets that contain δ=n(β^nDRIVEβ0)\delta=\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0}),

1n(iψi(β0+δ/n)𝔼ψi(β0+δ/n))\displaystyle\frac{1}{\sqrt{n}}\left(\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n})-\mathbb{E}\psi_{i}(\beta_{0}+\delta/\sqrt{n})\right) d𝒩(0,Ω(β0))𝒵,\displaystyle\rightarrow_{d}\mathcal{N}(0,\Omega(\beta_{0}))\equiv\mathcal{Z},

where Ω(β)=1n𝔼i(ψi(β)ψiT(β))\Omega(\beta)=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}(\psi_{i}(\beta)\psi_{i}^{T}(\beta)), so that

Ω(β0)=1n𝔼i(ψi(β0)ψiT(β0))\displaystyle\Omega(\beta_{0})=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}(\psi_{i}(\beta_{0})\psi_{i}^{T}(\beta_{0})) =1n𝔼i(YiXiTβ)2ZiZiT\displaystyle=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}(Y_{i}-X_{i}^{T}\beta)^{2}Z_{i}Z_{i}^{T}
=1n𝔼iϵi2ZiZiT=σ2ΣZ,\displaystyle=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}\epsilon_{i}^{2}Z_{i}Z_{i}^{T}=\sigma^{2}\Sigma_{Z},

using independence and homoskedasticity. Moreover,

1ni𝔼ψi(β0+δ/n)\displaystyle\frac{1}{\sqrt{n}}\sum_{i}\mathbb{E}\psi_{i}(\beta_{0}+\delta/\sqrt{n}) =n𝔼[XT(β0+δ/n)Y]Z\displaystyle=\sqrt{n}\mathbb{E}\left[X^{T}(\beta_{0}+\delta/\sqrt{n})-Y\right]Z
=n𝔼[XT(β0+δ/n)(XTβ0+ϵ)]Z\displaystyle=\sqrt{n}\mathbb{E}\left[X^{T}(\beta_{0}+\delta/\sqrt{n})-(X^{T}\beta_{0}+\epsilon)\right]Z
=𝔼ZXTδ=𝔼Z(ZTγ+ξ)δ\displaystyle=\mathbb{E}ZX^{T}\delta=\mathbb{E}Z(Z^{T}\gamma+\xi)\delta
=ΣZγδ.\displaystyle=\Sigma_{Z}\gamma\delta.

Combining these, we have

1niψi(β0+δ/n)\displaystyle\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n}) d𝒵+ΣZγδ,\displaystyle\rightarrow_{d}\mathcal{Z}+\Sigma_{Z}\gamma\delta,

uniformly in δ\delta on compact sets, so that

Fn(β0+δ/n)\displaystyle F_{n}(\beta_{0}+\delta/\sqrt{n}) d(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)\displaystyle\rightarrow_{d}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)
Fn(β0)\displaystyle F_{n}(\beta_{0}) d𝒵TΣZ1𝒵,\displaystyle\rightarrow_{d}\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z},

and applying the continuous mapping theorem to the square root function,

I =Fn(β0+δ/n)Fn(β0)d(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)𝒵TΣZ1𝒵.\displaystyle=\sqrt{F_{n}(\beta_{0}+\delta/\sqrt{n})}-\sqrt{F_{n}(\beta_{0})}\rightarrow_{d}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}-\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}.

Next we have

𝐈𝐈\displaystyle\mathbf{II} =nρn(1+β0+δ/n2)nρn(1+β02)\displaystyle=\sqrt{n\rho_{n}(1+\|\beta_{0}+\delta/\sqrt{n}\|^{2})}-\sqrt{n\rho_{n}(1+\|\beta_{0}\|^{2})}
=nρnβ0Tnρn(1+β02)δ/n+o(δ/n)\displaystyle=\frac{n\rho_{n}\beta_{0}^{T}}{\sqrt{n\rho_{n}(1+\|\beta_{0}\|^{2})}}\cdot\delta/\sqrt{n}+o(\delta/\sqrt{n})
ρnβ0T(1+β02)δ.\displaystyle\rightarrow\frac{\sqrt{\rho_{n}}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta.

Combining the analyses of 𝐈\mathbf{I} and 𝐈𝐈\mathbf{II}, we have

Hn(δ)\displaystyle H_{n}(\delta) d(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)𝒵TΣZ1𝒵+ρβ0T(1+β02)δ\displaystyle\rightarrow_{d}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}-\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta

uniformly. Because Hn(δ)H_{n}(\delta) is convex and H(δ)H(\delta) has a unique minimum, argminδHn(δ)=n(β^nDRIVEβ0)=Op(1)\arg\min_{\delta}H_{n}(\delta)=\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})=O_{p}(1). Applying Theorem 3.2.2 of van der Vaart and Wellner, (1996) allows us to conclude that

n(β^nDRIVEβ0)\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0}) dargminδ(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)𝒵TΣZ1𝒵+ρβ0T(1+β02)δ.\displaystyle\rightarrow_{d}\arg\min_{\delta}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}-\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta.

In fact, we may drop the term 𝒵TΣZ1𝒵\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}} since it does not depend on δ\delta. Therefore,

n(β^nDRIVEβ0)=argminδHn(δ)\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})=\arg\min_{\delta}H_{n}(\delta) dargminδ(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)+ρβ0T(1+β02)δ.\displaystyle\rightarrow_{d}\arg\min_{\delta}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta.

Now when ρ=0\rho=0, the objective above reduces to

(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ),\displaystyle\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)},

which recovers the same minimizer as the TSLS objective

(𝒵+ΣZγδ)TΣZ1(𝒵+ΣZγδ)𝒵TΣZ1𝒵\displaystyle(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)-\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z} =2δTγT𝒵+δTγTΣZγδ,\displaystyle=2\delta^{T}\gamma^{T}\mathcal{Z}+\delta^{T}\gamma^{T}\Sigma_{Z}\gamma\delta,

since the first order condition of the former is

γT𝒵+γTΣZγδ(𝒵+ΣZγδ)T(𝒵+ΣZγδ)\displaystyle\frac{\gamma^{T}\mathcal{Z}+\gamma^{T}\Sigma_{Z}\gamma\delta}{\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}} =0,\displaystyle=0,

and of the latter is

γT𝒵+γTΣZγδ\displaystyle\gamma^{T}\mathcal{Z}+\gamma^{T}\Sigma_{Z}\gamma\delta =0.\displaystyle=0.

We can therefore conclude that with vanishing ρnρ=0\rho_{n}\rightarrow\rho=0, regardless of the rate, the asymptotic distribution of Wasserstein DRIVE coincides with that of the standard TSLS estimator. ∎

F.4 Proof of Corollary 4.3

Proof.

When ρn0\rho_{n}\rightarrow 0, the limiting objective is (𝒵+γδ)T(𝒵+γδ)\sqrt{(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)} which is minimized at the same δ\delta that minimizes the standard limit (𝒵+γδ)T(𝒵+γδ)(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta).

If 0<ρ|γ|0<\rho\leq|\gamma|, then FOC gives

γT𝒵+δTγTγ(𝒵+γδ)T(𝒵+γδ)+ρβ0(1+β02)\displaystyle\frac{\gamma^{T}\mathcal{Z}+\delta^{T}\gamma^{T}\gamma}{\sqrt{(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)}}+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}} =0.\displaystyle=0.

If β0\beta_{0} is one-dimensional (but γ\gamma can be a vector, i.e., multiple instruments), then FOC reduces to

γT𝒵+δTγTγ\displaystyle\gamma^{T}\mathcal{Z}+\delta^{T}\gamma^{T}\gamma =0,\displaystyle=0,

which is the same FOC as the standard IV limiting objective.

If both γ\gamma and β0\beta_{0} are one-dimensional, but β0\beta_{0} is not necessarily 0, we have that

(𝒵+γδ)T(𝒵+γδ)+ρβ0T(1+β02)δ\displaystyle\sqrt{(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta =|𝒵+γδ|+ρβ0(1+β02)δ\displaystyle=|\mathcal{Z}+\gamma\delta|+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta

The objective is 𝒵+γδ+ρβ0(1+β02)δ\mathcal{Z}+\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\geq 0 and 𝒵γδ+ρβ0(1+β02)δ-\mathcal{Z}-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\leq 0. Recall that by assumption ρ|γ|\sqrt{\rho}\leq|\gamma|.

If β0>0\beta_{0}>0 and γ>0\gamma>0, then γδ+ρβ0(1+β02)δ\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\geq 0 is minimized at δ=γ1𝒵\delta=-\gamma^{-1}\mathcal{Z}, and γδ+ρβ0(1+β02)δ-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\leq 0 is minimized at γ1𝒵-\gamma^{-1}\mathcal{Z}.

If β0>0\beta_{0}>0 and γ<0\gamma<0, then γδ+ρβ0(1+β02)δ\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\geq 0 is again minimized at δ=γ1𝒵\delta=-\gamma^{-1}\mathcal{Z} (since δγ1𝒵\delta\leq-\gamma^{-1}\mathcal{Z}), and γδ+ρβ0(1+β02)δ-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\leq 0 is again minimized at γ1𝒵-\gamma^{-1}\mathcal{Z}.

If β0<0\beta_{0}<0 and γ>0\gamma>0, then γδ+ρβ0(1+β02)δ\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\geq 0 is minimized at δ=γ1𝒵\delta=-\gamma^{-1}\mathcal{Z}, and γδ+ρβ0(1+β02)δ-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\leq 0 is minimized at γ1𝒵-\gamma^{-1}\mathcal{Z}.

If β0<0\beta_{0}<0 and γ>0\gamma>0, then γδ+ρβ0(1+β02)δ\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\geq 0 is minimized at δ=γ1𝒵\delta=-\gamma^{-1}\mathcal{Z}, and γδ+ρβ0(1+β02)δ-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta when γδ+𝒵0\gamma\delta+\mathcal{Z}\leq 0 is minimized at γ1𝒵-\gamma^{-1}\mathcal{Z}.

We can therefore conclude that the objective is always minimized at δ=γ1𝒵\delta=-\gamma^{-1}\mathcal{Z}, which is the limiting distribution of TSLS. ∎

F.5 Proof of Theorem D.2

Proof.

We can write

1niψi(θ)\displaystyle\frac{1}{n}\sum_{i}\psi_{i}(\theta) =1ni[ψi(θ)𝔼ψi(θ)]+1ni𝔼ψi(θ).\displaystyle=\frac{1}{n}\sum_{i}[\psi_{i}(\theta)-\mathbb{E}\psi_{i}(\theta)]+\frac{1}{n}\sum_{i}\mathbb{E}\psi_{i}(\theta).

D.1.1 guarantees that

1ni[ψi(θ)𝔼ψi(θ)]\displaystyle\frac{1}{n}\sum_{i}[\psi_{i}(\theta)-\mathbb{E}\psi_{i}(\theta)] =op(1),\displaystyle=o_{p}(1),

using for example Andrews.

Next, D.1.2 guarantees that 1ni𝔼ψi(θ)m(θ)\frac{1}{n}\sum_{i}\mathbb{E}\psi_{i}(\theta)\rightarrow m(\theta) uniformly in θ\theta, and D.1.3 further guarantees that

(1niψi(θ))TWn(θ)(1niψi(θ))+ρn(1+θ2)\displaystyle\sqrt{\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)}+\sqrt{\rho_{n}(1+\|\theta\|^{2})} p\displaystyle\rightarrow_{p}
m(θ)TW(θ)m(θ)+ρ(1+θ2)\displaystyle\sqrt{m(\theta)^{T}W(\theta)m(\theta)}+\sqrt{\rho(1+\|\theta\|^{2})}

uniformly in θ\theta. Applying Corollary 3.2.3 of van der Vaart and Wellner, (1996), we can conclude θ^GMMpθGMM\hat{\theta}^{GMM}\rightarrow_{p}\theta^{GMM}.

Next, we consider the minimizer of the population objective. Applying D.1.4, when ρρ¯\rho\leq\overline{\rho}, it is lower bounded by

m(θ)TW(θ)m(θ)+ρ(1+θ2)\displaystyle\sqrt{m(\theta)^{T}W(\theta)m(\theta)}+\sqrt{\rho(1+\|\theta\|^{2})} ρ¯θθ02+ρ(1+θ2)\displaystyle\geq\sqrt{\overline{\rho}\|\theta-\theta_{0}\|^{2}}+\sqrt{\rho(1+\|\theta\|^{2})}
ρ¯(θθ02+(1+θ2))\displaystyle\geq\sqrt{\overline{\rho}}\cdot(\sqrt{\|\theta-\theta_{0}\|^{2}}+\sqrt{(1+\|\theta\|^{2})})
=ρ¯((θ,1)(θ0,1)2+ρ(θ,1)2)\displaystyle=\sqrt{\overline{\rho}}\cdot(\|(\theta,1)-(\theta_{0},1)\|_{2}+\sqrt{\rho}\|(\theta,1)\|_{2})
ρ(θ0,1)2,\displaystyle\geq\sqrt{\rho}\|(\theta_{0},1)\|_{2},

where again the last inequality follows from the triangle inequality. We can verify that equalities are achieved if and only if θ=θ0\theta=\theta_{0}, which guarantees that θ^GMMpθ0\hat{\theta}^{GMM}\rightarrow_{p}\theta_{0}. The condition m(θ)TW(θ)m(θ)ρ¯θθ02m(\theta)^{T}W(\theta)m(\theta)\geq\overline{\rho}\|\theta-\theta_{0}\|^{2} is satisfies by many GMM estimators, including the TSLS, so this proof applies to Theorem 4.1 as well. ∎

F.6 Proof of Theorem C.3

Proof.

First, we use optimality condition of β^\hat{\beta} to bound

Q^(β^)Q^(β0)\displaystyle\sqrt{\hat{Q}(\hat{\beta})}-\sqrt{\hat{Q}(\beta_{0})} λnβ02+1λnβ^2+1\displaystyle\leq\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1}

On the other hand, by convexity of Q^(β)\sqrt{\hat{Q}(\beta)},

Q^(β^)Q^(β0)\displaystyle\sqrt{\hat{Q}(\hat{\beta})}-\sqrt{\hat{Q}(\beta_{0})} S~T(β^β0)S~2β^β02λcnβ^β02\displaystyle\geq\tilde{S}^{T}(\hat{\beta}-\beta_{0})\geq-\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}\geq-\frac{\lambda}{cn}\|\hat{\beta}-\beta_{0}\|_{2}

Now the estimation error in terms of the “prediction norm” (which is just the norm defined using the Gram matrix)

β^β02,n2\displaystyle\|\hat{\beta}-\beta_{0}\|_{2,n}^{2} :=1ni(XiT(β^β0))2\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\sum_{i}(X_{i}^{T}(\hat{\beta}-\beta_{0}))^{2}
=(β^β0)T1niXiXiT(β^β0)\displaystyle=(\hat{\beta}-\beta_{0})^{T}\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}(\hat{\beta}-\beta_{0})

is related to the difference Q^(β^)Q^(β0)\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0}) as follows:

Q^(β^)Q^(β0)\displaystyle\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0}) =1ni(YiXiTβ^)21ni(YiXiTβ0)2\displaystyle=\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\hat{\beta})^{2}-\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0})^{2}
=1ni(YiXiTβ0+XiTβ0XiTβ^)21ni(YiXiTβ0)2\displaystyle=\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0}+X_{i}^{T}\beta_{0}-X_{i}^{T}\hat{\beta})^{2}-\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0})^{2}
=β^β02,n2+21ni(YiXiTβ0)(XiTβ0XiTβ^)\displaystyle=\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}+2\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0})(X_{i}^{T}\beta_{0}-X_{i}^{T}\hat{\beta})
=β^β02,n2+21ni(σϵi)XiT(β0β^)\displaystyle=\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}+2\frac{1}{n}\sum_{i}(\sigma\epsilon_{i})X_{i}^{T}(\beta_{0}-\hat{\beta})
=β^β02,n22En(σϵXT(β^β0))\displaystyle=\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}-2E_{n}(\sigma\epsilon X^{T}(\hat{\beta}-\beta_{0}))

On the other hand,

Q^(β^)Q^(β0)\displaystyle\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0}) =[Q^(β^)+Q^(β0)][Q^(β^)Q^(β0)]\displaystyle=\left[\sqrt{\hat{Q}(\hat{\beta})}+\sqrt{\hat{Q}(\beta_{0})}\right]\cdot\left[\sqrt{\hat{Q}(\hat{\beta})}-\sqrt{\hat{Q}(\beta_{0})}\right]

and using Holder’s inequality,

2En(σϵXT(β^β0))\displaystyle 2E_{n}(\sigma\epsilon X^{T}(\hat{\beta}-\beta_{0})) =21ni(σϵi)XiT(β^β0)\displaystyle=2\frac{1}{n}\sum_{i}(\sigma\epsilon_{i})X_{i}^{T}(\hat{\beta}-\beta_{0})
=21ni(σϵi)21ni(σϵXiT)1ni(σ2ϵi2)(β^β0)\displaystyle=2\sqrt{\frac{1}{n}\sum_{i}(\sigma\epsilon_{i})^{2}}\frac{\frac{1}{n}\sum_{i}(\sigma\epsilon X_{i}^{T})}{\sqrt{\frac{1}{n}\sum_{i}(\sigma^{2}\epsilon_{i}^{2})}}(\hat{\beta}-\beta_{0})
=2Q^(β0)S~T(β^β0)\displaystyle=2\sqrt{\hat{Q}(\beta_{0})}\cdot\tilde{S}^{T}(\hat{\beta}-\beta_{0})
2Q^(β0)S~2β^β02\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}

Combining these, we can bound the estimation error β^β02,n2\|\hat{\beta}-\beta_{0}\|_{2,n}^{2} as

β^β02,n2\displaystyle\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}
=2En(σϵXT(β^β0))+Q^(β^)Q^(β0)\displaystyle=2E_{n}(\sigma\epsilon X^{T}(\hat{\beta}-\beta_{0}))+\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0})
2Q^(β0)S~2β^β02+(λnβ02+1λnβ^2+1)(Q^(β^)+Q^(β0))\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}+(\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1})\cdot(\sqrt{\hat{Q}(\hat{\beta})}+\sqrt{\hat{Q}(\beta_{0})})
2Q^(β0)S~2β^β02+(λnβ02+1λnβ^2+1)(2Q^(β0)+λnβ02+1λnβ^2+1)\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}+(\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1})\cdot(2\sqrt{\hat{Q}(\beta_{0})}+\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1})
=2Q^(β0)S~2β^β02+(λnβ02+1λnβ^2+1)2+2Q^(β0)(λnβ02+1λnβ^2+1)\displaystyle=2\sqrt{\hat{Q}(\beta_{0})}\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}+(\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1})^{2}+2\sqrt{\hat{Q}(\beta_{0})}(\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1})
2Q^(β0)S~2β^β02+(λn)2β^β022+2Q^(β0)λnβ^β02\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}+(\frac{\lambda}{n})^{2}\|\hat{\beta}-\beta_{0}\|_{2}^{2}+2\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}\|\hat{\beta}-\beta_{0}\|_{2}
2Q^(β0)λn(1c+1)β^β02+(λn)2β^β022\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)\|\hat{\beta}-\beta_{0}\|_{2}+(\frac{\lambda}{n})^{2}\|\hat{\beta}-\beta_{0}\|_{2}^{2}

Now the norms β^β02,n2\|\hat{\beta}-\beta_{0}\|_{2,n}^{2} and β^β02\|\hat{\beta}-\beta_{0}\|_{2} differ by the Gram matrix 1niXiXiT\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}, which by the assumption 1niXij2=1\frac{1}{n}\sum_{i}X_{ij}^{2}=1 has diagonal entries equal to 1. Recall that κ\kappa is the tight constant such that

κβ^β02\displaystyle\kappa\|\hat{\beta}-\beta_{0}\|_{2} β^β02,n\displaystyle\leq\|\hat{\beta}-\beta_{0}\|_{2,n}

for any β^β0\hat{\beta}-\beta_{0}, so we get

β^β0221κ2β^β02,n2\displaystyle\|\hat{\beta}-\beta_{0}\|_{2}^{2}\leq\frac{1}{\kappa^{2}}\|\hat{\beta}-\beta_{0}\|_{2,n}^{2} 21κ2Q^(β0)λn(1c+1)β^β02+1κ2(λn)2β^β022\displaystyle\leq 2\frac{1}{\kappa^{2}}\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)\|\hat{\beta}-\beta_{0}\|_{2}+\frac{1}{\kappa^{2}}(\frac{\lambda}{n})^{2}\|\hat{\beta}-\beta_{0}\|_{2}^{2}

which yields

β^β02\displaystyle\|\hat{\beta}-\beta_{0}\|_{2} 111κ2(λn)221κ2Q^(β0)λn(1c+1)\displaystyle\leq\frac{1}{1-\frac{1}{\kappa^{2}}(\frac{\lambda}{n})^{2}}2\frac{1}{\kappa^{2}}\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)
=2Q^(β0)λn(1c+1)κ2(λn)2\displaystyle=\frac{2\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)}{\kappa^{2}-(\frac{\lambda}{n})^{2}}

provided that

(λn)2\displaystyle(\frac{\lambda}{n})^{2} κ2.\displaystyle\leq\kappa^{2}.

As λ/n0\lambda/n\rightarrow 0 and κ\kappa is a universal constant linking the two norms, this condition will be satisfied for all nn large enough if Assumption 2 holds, so that the rate of convergence of β^β02,n0\|\hat{\beta}-\beta_{0}\|_{2,n}\rightarrow 0 is governed by that of λn0\frac{\lambda}{n}\rightarrow 0:

β^β022λn(1c+1)κ2(λn)2Q^(β0)σplog(2p/α)/n.\displaystyle\|\hat{\beta}-\beta_{0}\|_{2}\leq\frac{2\frac{\lambda}{n}(\frac{1}{c}+1)}{\kappa^{2}-(\frac{\lambda}{n})^{2}}\cdot\sqrt{\hat{Q}(\beta_{0})}\lesssim\sigma\sqrt{p\log(2p/\alpha)/n}.