Distributionally Robust Instrumental Variables Estimation

Zhaonan Qu Yongchan Kwon
^∗Columbia University Corresponding author. [email protected][email protected] Data Science Institute
^†Columbia University Department of Statistics

Abstract

Instrumental variables (IV) estimation is a fundamental method in econometrics and statistics for estimating causal effects in the presence of unobserved confounding. However, challenges such as untestable model assumptions and poor finite sample properties have undermined its reliability in practice. Viewing common issues in IV estimation as distributional uncertainties, we propose DRIVE, a distributionally robust IV estimation method. We show that DRIVE minimizes a square root variant of ridge regularized two stage least squares (TSLS) objective when the ambiguity set is based on a Wasserstein distance. In addition, we develop a novel asymptotic theory for this estimator, showing that it achieves consistency without requiring the regularization parameter to vanish. This novel property ensures that the estimator is robust to distributional uncertainties that persist in large samples. We further derive the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter based on theoretical results. Simulation studies demonstrate the superior finite sample performance of Wasserstein DRIVE in terms of estimation error and out-of-sample prediction. Due to its regularization and robustness properties, Wasserstein DRIVE presents an appealing option when the practitioner is uncertain about model assumptions or distributional shifts in data.

Keywords: Causal Inference; Distributionally Robust Optimization; Square Root Ridge; Invalid Instruments; Distribution Shift

1 Introduction

Instrumental variables (IV) estimation, also known as IV regression, is a fundamental method in econometrics and statistics to infer causal relationships in observational data with unobserved confounding. It leverages access to additional variables (instruments) that affect the outcome exogenously and exclusively through the endogenous regressor to yield consistent causal estimates, even when the standard ordinary least squares (OLS) estimator is biased by unobserved confounding (Imbens and Angrist,, 1994; Angrist et al.,, 1996; Imbens and Rubin,, 2015). Over the years, IV estimation has become an indispensable tool for causal inference in empirical works in economics (Card and Krueger,, 1994), as well as in the study of genetic and epidemiological data (Davey Smith and Ebrahim,, 2003).

Despite the widespread use of IV in empirical and applied works, it has important limitations and challenges, such as invalid instruments (Sargan,, 1958; Murray,, 2006), weak instruments (Staiger and Stock,, 1997), non-compliance (Imbens and Angrist,, 1994), and heteroskedasticity, especially in settings with weak instruments or highly leveraged datasets (Andrews et al.,, 2019; Young,, 2022). These issues could significantly impact the validity and quality of estimation and inference using instrumental variables (Jiang,, 2017). Many works have since been devoted to assessing and addressing these issues, such as statistical tests (Hansen,, 1982; Stock and Yogo,, 2002), sensitivity analysis (Rosenbaum and Rubin,, 1983; Bonhomme and Weidner,, 2022), and additional assumptions or structures on the data generating process (Kolesár et al.,, 2015; Kang et al.,, 2016; Guo et al., 2018b, ).

Recently, an emerging line of works have highlighted interesting connections between causality and the concepts of invariance and robustness (Peters et al.,, 2016; Meinshausen,, 2018; Rothenhäusler et al.,, 2021; Bühlmann,, 2020; Jakobsen and Peters,, 2022; Fan et al.,, 2024). Their guiding philosophy is that causal properties can be viewed as robustness against changes across heterogeneous environments, represented by a set $\mathcal{P}$ of data distributions. The robustness of an estimator against $\mathcal{P}$ is often represented in a distributionally robust optimization (DRO) framework via the min-max problem

\displaystyle\min_{\beta}\sup_{\mathbb{P}\in\mathcal{P}}\mathbb{E}[\ell(W;\beta)],

(1)

where $\ell(W;\beta)$ is a loss function of data $W$ and parameter $\beta$ of interest.

In many estimation and regression settings, one assumes that the true data distribution satisfies certain conditions, e.g., conditional independence or moment equations. Such conditions guarantee that standard statistical procedures based on the empirical distribution $\mathbb{P}_{0}$ of data, such as M-estimation and generalized method of moments (GMM), enable valid estimation and inference. In practice, however, it is often reasonable to expect that the distribution $\mathbb{P}_{0}$ of the observed data might deviate from that generated by the ideal model that satisfies such conditions, e.g., due to measurement errors or model mis-specifications. DRO addresses such distributional uncertainties by explicitly incorporating possible deviations into an ambiguity set $\mathcal{P}=\mathcal{P}(\mathbb{P}_{0},\rho)$ of distributions that are “close” to $\mathbb{P}_{0}$ . The parameter $\rho$ quantifies the degree of uncertainty, e.g., as the radius of a ball centered around $\mathbb{P}_{0}$ and defined by some divergence measure between probability distributions. By minimizing the worst-case loss over $\mathcal{P}(\mathbb{P}_{0},\rho)$ in the min-max optimization problem (1), the DRO approach achieves robustness against deviations captured by $\mathcal{P}(\mathbb{P}_{0},\rho)$ .

DRO provides a useful perspective for understanding the robustness properties of statistical methods. For example, it is well-known that members of the family of $k$ -class estimators (Anderson and Rubin,, 1949; Nagar,, 1959; Theil,, 1961) are more robust than the standard IV estimator against weak instruments (Andrews,, 2007). Recent works by Rothenhäusler et al., (2021) and Jakobsen and Peters, (2022) show that $k$ -class estimators in fact have a DRO representation of the form (1), where $\ell$ is the square loss, $W=(X,Y)$ , and $X,Y$ are endogenous and outcome variables generated from structural equation models parameterized by the natural parameter of $k$ -class estimators. See Appendix A.2 for details. The general robust optimization problem (1) can trace its roots in the classical robust statistics literature (Huber,, 1964; Huber and Ronchetti,, 2011) as well as classic works on robustness in economics (Hansen and Sargent,, 2008). Drawing inspirations from them, recent works in econometrics have also explored the use of robust optimization to account for (local) deviations from model assumptions (Kitamura et al.,, 2013; Armstrong and Kolesár,, 2021; Chen et al.,, 2021; Bonhomme and Weidner,, 2022; Adjaho and Christensen,, 2022; Fan et al.,, 2023). These works, together with works on invariance and robustness, highlight the emerging interactions between econometrics, statistics, and robust optimization.

Despite new developments connecting causality and robustness, many questions and opportunities remain. An important challenge in DRO is the choice of the ambiguity set $\mathcal{P}(\mathbb{P}_{0},\rho)$ to adequately capture distributional uncertainties. This choice is highly dependent on the structure of the particular problem of interest. While some existing DRO approaches use ambiguity sets $\mathcal{P}(\mathbb{P}_{0},\rho)$ based on marginal or joint distributions of data, such $\mathcal{P}(\mathbb{P}_{0},\rho)$ may not effectively capture the structure of IV estimation models. In addition, as the min-max problem (1) minimizes the loss function under the worst-case distribution in $\mathcal{P}(\mathbb{P}_{0},\rho)$ , a common concern is that the resulting estimator is too conservative when $\mathcal{P}(\mathbb{P}_{0},\rho)$ is too large. In particular, although DRO estimators enjoy better empirical performance in finite samples, their asymptotic validity typically requires the ambiguity set to vanish to a singleton, i.e., $\rho\rightarrow 0$ (Blanchet et al.,, 2019, 2022). However, in the context of IV estimation, distributional uncertainties about untestable model assumptions could persist in large samples, necessitating the need for an ambiguity set that does not vanish to a singleton. It is therefore important to ask whether and how one can construct an estimator in the IV estimation setting that can sufficiently capture the distributional uncertainties about model assumptions, and at the same time remains asymptotically valid with a non-vanishing robustness parameter.

In this paper, we propose to view common challenges to IV estimation through the lens of DRO, whereby uncertainties about model assumptions, such as the exclusion restriction and homoskedasticity, are captured by a suitably chosen ambiguity set in (1). Based on this perspective, we propose DRIVE, a general DRO approach to IV estimation. Instead of constructing the ambiguity set based on marginal or joint distributions as in existing works, we construct $\mathcal{P}(\mathbb{P}_{0},\rho)$ from distributions conditional on the instrumental variables. More precisely, we construct $\mathbb{P}_{0}$ as the empirical distribution of outcome and endogenous variables $Y,X$ projected onto the space spanned by instrumental variables. When the ambiguity set of DRIVE is based on the 2-Wasserstein metric, we show that the resulting estimator minimizes a square root version of ridge regularized two stage least squares (TSLS) objective, where the radius $\rho$ of the ambiguity set becomes the regularization parameter. This regularized regression formulation relies on the general duality of Wasserstein DRO problems (Gao and Kleywegt,, 2023; Blanchet et al.,, 2019; Kuhn et al.,, 2019).

We next next reveal a surprising statistical property of the square root ridge by showing that Wasserstein DRIVE is consistent as long as the regularization parameter $\rho$ is bounded above by an estimable constant, which depends on the first stage coefficient of the IV model and can be interpreted as a measure of instrument quality. To our knowledge, this is the first consistency result for regularized regression estimators where the regularization parameter does not vanish as the sample size $n\rightarrow\infty$ . One implication of our results is that Wasserstein DRIVE, being a regularized regression estimator, enjoys better finite sample properties, but does not introduce bias asymptotically even for non-vanishing $\rho$ , unlike standard regularized regression estimators such as the ridge and LASSO.

We further characterize the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter. We demonstrate with numerical experiments that Wasserstein DRIVE improves over the finite sample performance of IV and $k$ -class estimators, thanks to its ridge type regularization, while at the same time retaining asymptotic validity whenever instruments are valid. In particular, Wasserstein DRIVE achieves significant improvements in mean squared errors (MSEs) over IV and OLS when instruments are moderately invalid. These findings suggest that Wasserstein DRIVE can be an attractive option in practice when we are concerned about model assumptions.

The rest of the paper is organized as follows. In Section 2, we discuss the standard IV estimation framework and common challenges. In Section 3, we propose the Wasserstein DRIVE framework and provide the duality theory. In Section 4, we develop asymptotic results for the Wasserstein DRIVE, including consistency under a non-vanishing robustness/regularization parameter. Section 5 conducts numerical studies that compare Wasserstein DRIVE with other estimators including IV, OLS, and $k$ -class estimators. Background materials, proofs, and additional results are included in the appendices in the supplementary material.

Notation. Throughout the paper, $\|v\|_{p}$ denotes the $p$ -norm of a vector $v$ , while $\|v\|\mathrel{\mathop{\mathchar 58\relax}}=\|v\|_{2}$ denotes the Euclidean norm. $\text{Tr}(M)$ denotes the trace of a matrix $M$ . $\lambda_{k}(M)$ represents the $k$ -th largest eigenvalue of a symmetric matrix $M$ . Boldfaced variables, such as $\mathbf{X}$ , represents a matrix whose $i$ -th row is the variable $X_{i}$ .

2 Background and Challenges in IV Estimation

In this section, we first provide a brief review of the standard IV estimation framework. We then motivate the DRO approach to IV estimation by viewing common challenges from the perspective of distributional uncertainties. In Section 3, we propose the Wasserstein distributionally robust instrumental variables estimation (DRIVE) framework.

2.1 Instrumental Variables Estimation

Consider the following standard linear instrumental variables regression model with $X\in\mathbb{R}^{p},Z\in\mathbb{R}^{d}$ where $d\geq p$ , and $\beta_{0}\in\mathbb{R}^{p},\gamma\in\mathbb{R}^{d\times p}$ :

\displaystyle\begin{split}Y&=\beta^{T}_{0}X+\epsilon,\\ X&=\gamma^{T}Z+\xi.\end{split}

(2)

In (2), $X$ are the endogenous variables, $Z$ are the instrumental variables, and $Y$ is the outcome variable. The error terms $\epsilon$ and $\xi$ capture the unobserved (or residual) components of $Y$ and $X$ , respectively. We are interested in estimating the causal effects $\beta_{0}$ of the endogenous variables $X$ on the outcome variable $Y$ given independent and identically distributed (i.i.d.) samples $\{X_{i},Y_{i},Z_{i}\}_{i=1}^{n}$ . However, $X$ and $Y$ are confounded through some unobserved confounders $U$ that are correlated with both $Y$ and $X$ , represented graphically in the directed acyclic graph (DAG) below:

Mathematically, the unobserved confounding can be described by the moment condition

\mathbb{E}\left[X\epsilon\right]\neq\mathbf{0}.

As a result of the unobserved confounding, the standard ordinary least squares (OLS) regression estimator of $\beta_{0}$ that regresses $Y$ on $X$ is biased. To address this problem, the IV estimation approach leverages access to the instrumental variables $Z$ , also often called instruments, which satisfy the moment conditions

	$\displaystyle\text{rank}(\mathbb{E}\left[ZX^{T}\right])$	$\displaystyle=p,$		(3)
	$\displaystyle\mathbb{E}\left[Z\epsilon\right]=\mathbf{0},\mathbb{E}\left[Z\xi^{T}\right]$	$\displaystyle=\mathbf{0}.$		(4)

Under these conditions, a popular IV estimator is the two stage least squares (TSLS, sometimes also stylized as 2SLS) estimator (Theil,, 1953). With $\Pi_{Z}\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}$ and $\mathbf{X,Y,Z}$ matrix representations of $\{X_{i},Y_{i},Z_{i}\}_{i=1}^{n}$ whose $i$ -th rows correspond to $X_{i},Y_{i},Z_{i}$ , respectively, the TSLS estimator $\hat{\beta}^{\text{IV}}\mathrel{\mathop{\mathchar 58\relax}}=(\mathbf{X}^{T}\Pi_{Z}\mathbf{X})^{-1}\mathbf{X}^{T}\Pi_{Z}\mathbf{Y}$ minimizes the objective

\displaystyle\min_{\beta}\frac{1}{n}\|\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}.

(5)

In contrast, the standard OLS estimator $\hat{\beta}^{\text{OLS}}$ solves the problem

\displaystyle\min_{\beta}\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}.

(6)

When the moment conditions (3) and (4) hold, the TSLS estimator is a consistent estimator of the causal effect $\beta_{0}$ under standard assumptions (Wooldridge,, 2020), and valid inference can be performed by constructing variance estimators based on the asymptotic distribution of $\hat{\beta}^{\text{IV}}$ (Imbens and Rubin,, 2015). Although not the most common presentation of TSLS, the optimization formulation in (5) provides intuition on how IV estimation works: when the instruments $Z$ are uncorrelated with the unobserved confounders $U$ affecting $X$ and $Y$ , the projection operator $\Pi_{Z}$ applied to $\mathbf{X}$ “removes” the confounding from $X$ , so that $\Pi_{Z}\mathbf{X}$ becomes (asymptotically) uncorrelated with $\epsilon$ . Regressing $\mathbf{Y}$ on $\Pi_{Z}\mathbf{X}$ then yields a consistent estimator of $\beta_{0}$ .

The validity of estimation and inference based on $\hat{\beta}^{\text{IV}}$ relies critically on the moment conditions (3) and (4). Condition (3) is often called the relevance condition or rank condition, and requires $\mathbb{E}\left[ZX^{T}\right]$ to have full rank (recall $p\leq d$ ). In the special case of one-dimensional instrumental and endogenous variables, i.e., $d=p=1$ , it simply reduces to $\mathbb{E}\left[ZX\right]\neq 0$ . Intuitively, the relevant condition ensures that the instruments $Z$ can explain sufficient variations in the endogenous variables $X$ . In this case, the instruments are said to be relevant and strong. When $\mathbb{E}\left[ZX^{T}\right]$ is close to being rank deficient, i.e., the smallest eigenvalue $\lambda_{p}(\mathbb{E}\left[ZX^{T}\right])\approx 0$ , IV estimation suffers from the so-called weak instrument problem, which results in many issues in estimation and inference, such as small sample bias and non-normal statistics (Stock et al.,, 2002). Some $k$ -class estimators, such as limited information maximum likelihood (LIML) (Anderson and Rubin,, 1949), are partially motivated to address these problems. Condition (4) is often referred to as the exclusion restriction or instrument exogeneity (Imbens and Rubin,, 2015), and instruments that satisfy this condition are called valid instruments. When an instrument $Z$ is correlated with the unobserved confounder that confounds $X,Y$ , or when $Z$ affects the outcome $Y$ through an unobserved variable other than the endogenous variable $X$ , the instrument becomes invalid, resulting in biased estimation and invalid inference of $\beta_{0}$ (Murray,, 2006). These issues can often be exacerbated when the instruments are weak, when there is heteroskedasticity (Andrews et al.,, 2019), or the data is highly leveraged (Young,, 2022).

Although many works have been devoted to addressing the problems of weak and invalid instruments, there are fundamental limits on the extent to which one can test for these issues. Given the popularity of IV estimation in practice, it is therefore desirable to have estimation and inference procedures that are robust to the presence of such issues. Our work is precisely motivated by these considerations. Compared to many existing robust approaches to IV estimation, we take a more agnostic approach via distributionally robust optimization. More precisely, we argue that many common challenges in IV estimation can be viewed as uncertainties about the data distribution, i.e., deviations from the ideal model that satisfies IV assumptions, which can be explicitly taken into account by choosing an appropriate ambiguity set in a DRO formulation of the standard IV estimation. To demonstrate this perspective more concretely, we now examine some common problems in IV estimation and show that they can be viewed as distributional shifts under a suitable metric, and therefore amenable to a DRO approach.

2.2 Challenges in IV Estimation as Distributional Uncertainties

Consider now the following one-dimensional version of the IV model in (2)

\displaystyle\begin{split}Y&=X\beta_{0}+\epsilon\\ X&=Z\gamma+\xi,\end{split}

where we assume that $X,Y$ are confounded by an unobserved confounder $U$ through

\epsilon=Z\eta+U,\quad\xi=U.

Note that in addition to $U$ , there is also potentially a direct effect $\eta$ from the instrument $Z$ to the outcome variable $Y$ . We focus on the resulting model for our subsequent discussions:

\displaystyle\begin{split}Y&=X\beta_{0}+Z\eta+U\\ X&=Z\gamma+U.\end{split}

(7)

The standard IV assumptions can be succinctly summarized for (7). The relevance condition (3) requires that $\gamma\neq 0$ , while the exclusion restriction (4) requires that $Z$ is uncorrelated with $U$ and that in addition $\eta=0$ . Assume that $U,Z$ are i.i.d. standard normal. $X,Y$ are then determined by $\eqref{eq:simple-IV}$ . We are interested in the shifts in data distribution, appropriately defined and measured, when the exogeneity and relevance conditions are violated.

Example 1 (Invalid Instruments).

As $U,Z$ are independent, $Z$ becomes invalid if and only if $\eta\neq 0$ , and $|\eta|$ quantifies the degree of instrument invalidity. Let $\mathbb{P}_{\eta}$ denote the joint distribution on $(X,Y,Z)$ in the model (7) indexed by $\eta\in\mathbb{R}$ . Let $\tilde{\mathbb{P}}_{\eta,Z}$ be the resulting normal distribution on the conditional random variables $(\tilde{X},\tilde{Y})=(X\mid Z,Y\mid Z)$ , given $Z$ . We are interested in the (expected) distributional shift between $\tilde{\mathbb{P}}_{\eta,Z}$ and $\tilde{\mathbb{P}}_{0,Z}$ . We choose the 2-Wasserstein distance $W_{2}(\cdot,\cdot)$ (Kantorovich,, 1942, 1960), also known as the Kantorovich metric, to measure this shift. Conveniently, the 2-Wasserstein distance between two normal distributions $\mathbb{Q}_{1}=\mathcal{N}(\mu_{1},\Sigma_{1})$ and $\mathbb{Q}_{2}=\mathcal{N}(\mu_{2},\Sigma_{2})$ has an explicit formula due to Olkin and Pukelsheim, (1982):

\displaystyle W_{2}(\mathbb{Q}_{1},\mathbb{Q}_{2})^{2}

\displaystyle=\|\mu_{1}-\mu_{2}\|^{2}+\text{Tr}(\Sigma_{1}+\Sigma_{2}-2(\Sigma_{2}^{1/2}\Sigma_{1}\Sigma_{2}^{1/2})^{1/2}).

(8)

Applying (8) to the conditional distributions $\tilde{\mathbb{P}}_{\eta,Z},\tilde{\mathbb{P}}_{0,Z}$ , and taking the expectation with respect to $Z$ , we obtain the simple formula

\displaystyle\mathbb{E}W_{2}(\tilde{\mathbb{P}}_{\eta,Z},\tilde{\mathbb{P}}_{0,Z})=\sqrt{\frac{2}{\pi}}\cdot|\eta|.

(9)

This calculation shows that the degree of instrument invalidity, as measured by the strength of direct effect of instrument on the outcome, is proportional to the expected distributional shift of the distribution on $(\tilde{X},\tilde{Y})$ from that under the valid IV assumption. Moreover, the simple form of the expected distributional shift relies on our choice of the Wasserstein distance to measure the distributional shift of the conditional random variables $(\tilde{X},\tilde{Y})$ . If we instead measure shifts in the joint distribution $\mathbb{P}_{\eta}$ on $(X,Y,Z)$ , the resulting distributional shift will depend on other model parameters in addition to $\eta$ . This example therefore suggests that the Wasserstein metric applied to the conditional distributional shift of $(\tilde{X},\tilde{Y})$ could be an appropriate measure of distributional uncertainty in IV regression models.

Example 2 (Weak Instruments).

Now consider another common problem with IV estimation, which happens when the first stage coefficient $\gamma$ is close to 0. Let $\tilde{\mathbb{Q}}_{\gamma,Z}$ be the distribution on $(\tilde{X},\tilde{Y})$ indexed by $\gamma\in\mathbb{R}$ and $\eta=0$ in (7). In this case, we can verify that

\displaystyle\mathbb{E}W_{2}(\tilde{\mathbb{Q}}_{\gamma_{1},Z},\tilde{\mathbb{Q}}_{\gamma_{2},Z})=\sqrt{\frac{2}{\pi}}\sqrt{1+\beta_{0}^{2}}|\gamma_{1}-\gamma_{2}|.

The expected distributional shift between the setting with a “strong” instrument with $\gamma=\gamma_{0}$ and a “weak” instrument with $\gamma=\delta\cdot\gamma_{0}$ where $\delta\rightarrow 0$ in the limit, measured by the 2-Wasserstein metric, is equal to

\displaystyle\sqrt{\frac{2}{\pi}}\sqrt{1+\beta_{0}^{2}}\cdot|\gamma_{0}|.

(10)

Similar to the previous example, the degree of violation of the strong instrument assumption, as measured by the presumed instrument strength $|\gamma_{0}|$ , is proportional to the expected distributional shift on $(\tilde{X},\tilde{Y})$ . Note, however, that the distance is also proportional to the magnitude of the causal parameter $\beta_{0}$ . This is reasonable because instrument strength is relative, and should be measured relative to the scale of the true causal parameter.

Next, we consider the distributional shift resulting from heteroskedastic errors, which are known to yield the TSLS estimator inefficient and the standard variance estimator invalid (Baum et al.,, 2003). Some $k$ -class estimators, such as the LIML and the Fuller estimators, also become inconsistent under heteroskedasticity (Hausman et al.,, 2012).

Example 3 (Heteroskedasticity).

In this example, we assume $\eta=0$ in (7) and that the conditional distribution of $U$ given $Z$ is centered normal with standard deviation $\alpha\cdot|Z|+1$ where $\alpha\geq 0$ . We are interested in the average distributional shift between the heteroskedastic setting ( $\alpha>0$ ) from the homoskedastic setting ( $\alpha=0$ ). We can verify that the expected 2-Wasserstein distance between the conditional distributions on $(\tilde{X},\tilde{Y})$ is

\displaystyle\sqrt{\frac{2}{\pi}}\sqrt{1+(\beta_{0}^{2}+1)^{2}}\cdot\alpha,

(11)

which is proportional to the degree of heteroskedasticity $\alpha$ .

The preceding discussions demonstrate that distributional uncertainties resulting from violations of common model assumptions in IV estimation are well captured by the 2-Wasserstein distance on the distributions of the conditional variables $(\tilde{X},\tilde{Y})$ . We therefore propose to construct an ambiguity set in (1) using a Wasserstein ball around the empirical distribution on $(\tilde{X},\tilde{Y})$ . We provide details of this framework in the next section.

3 Wasserstein Distributionally Robust IV Estimation

In this section, we propose a distributionally robust IV estimation framework. We propose to use Wasserstein ambiguity sets to account for distributional uncertainties in IV estimation. We develop the dual formulation of Wasserstein DRIVE as regularized regression, and discuss its connections and distinctions to other regularized regression estimators.

3.1 DRIVE

Motivated by the intuition that common challenges to IV estimation in practice, such as violations of model assumptions, can be viewed as distributional uncertainties on the conditional distributions of $(\tilde{X},\tilde{Y})=(X\mid Z,Y\mid Z)$ , we propose the Distributionally Robust IV Estimation (DRIVE) framework, which solves the following DRO problem given a dataset $\{(X_{i},Y_{i},Z_{i})\}_{i=1}^{n}$ and robustness parameter $\rho$ :

\displaystyle\textbf{(DRIVE Objective)}\quad\min_{\beta}\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\tilde{\mathbb{P}}_{n})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[({Y}-{X}^{T}\beta)^{2}\right],

(12)

where $\tilde{\mathbb{P}}_{n}(\mathcal{X}\times\mathcal{Y})$ is the empirical distribution on $(X,Y)$ induced by the projected samples

\{\tilde{X}_{i},\tilde{Y}_{i}\}_{i=1}^{n}\equiv\{(\Pi_{\mathbf{Z}}\mathbf{X})_{i},(\Pi_{\mathbf{Z}}\mathbf{Y})_{i}\}_{i=1}^{n}.

Here $\mathbf{X}\in\mathbb{R}^{n\times p},\mathbf{Y}\in\mathbb{R}^{n},\mathbf{Z}\in\mathbb{R}^{n\times d}$ are the matrix representations of observations, and $\Pi_{\mathbf{Z}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}$ is the projection matrix onto the column space of $\mathbf{Z}$ . $D(\cdot,\cdot)$ is a metric or divergence measure on the space of probability distributions on $\mathcal{X}\times\mathcal{Y}$ . Therefore, in our DRIVE framework, we first regress both the outcome $Y$ and covariate $X$ on the instrument $Z$ to form the $n$ predicted samples $(\Pi_{\mathbf{Z}}\mathbf{X},\Pi_{\mathbf{Z}}\mathbf{Y}$ ). Then an ambiguity set is constructed using $D$ around the empirical distribution $\tilde{\mathbb{P}}_{n}$ . This choice of the reference distribution $\mathbb{P}_{0}$ is a key distinction of our work from previous works that leverage DRO in statistical models. In the standard regression/classification setting, the reference distribution is often chosen as the empirical distribution $\hat{\mathbb{P}}_{n}$ on $\{X_{i},Y_{i}\}_{i=1}^{n}$ (Blanchet et al.,, 2019). In the IV estimation setting where we have additional access to instruments $Z$ , we have the choice of constructing ambiguity sets around the empirical distribution on the marginal quantities $\{(X_{i},Y_{i},Z_{i})\}_{i=1}^{n}$ , which is the approach taken in Bertsimas et al., (2022). In contrast, we choose to use the empirical distribution on the conditional quantities $\{(\tilde{X}_{i},\tilde{Y}_{i})\}_{i=1}^{n}$ . This choice is motivated by the intuition that violations of IV assumptions can be captured by conditional distributional shifts, as illustrated by examples in the previous section.

The choice of the divergence measure $D(\cdot,\cdot)$ is also important, as it characterizes the potential distributional uncertainties that DRIVE is robust to. In this paper, we propose to use the 2-Wasserstein distance $W_{2}(\mu,\nu)$ between two probability distributions $\mu,\nu$ (Mohajerin Esfahani and Kuhn,, 2018; Gao and Kleywegt,, 2023). One advantage of the Wasserstein distance is the tractability of its associated DRO problems (Blanchet et al.,, 2019), which can often be formulated as regularized regression problems with unique solutions. See also Appendix A.1. In Section 2.2, we provided several examples that demonstrate the 2-Wasserstein distance is able to capture common distributional uncertainties in the IV estimation setting. Alternative distance measures of probability distributions, such as the class of $\phi$ -divergences (Ben-Tal et al.,, 2013), can also be used instead of the Wasserstein distance. For example, Kitamura et al., (2013) use the Hellinger distance to model local perturbations in robust estimation under moment restrictions, although not in the IV estimation setting. In this paper, we focus on the Wasserstein DRIVE framework based on $D=W_{2}$ , and leave studies of DRIVE with other choices of $D$ to future works.

We next begin our formal study of Wasserstein DRIVE. In Section 3.2, we will show that the Wasserstein DRIVE objective is dual to a convex regularized regression problem. As a result, the solution to the optimization problem (12) is well-defined, and we denote this estimator by $\hat{\beta}_{\mathrm{DRIVE}}$ . In Section 4, we show $\hat{\beta}_{\mathrm{DRIVE}}$ is consistent with potentially non-vanishing choices of the robustness parameter and derive its asymptotic distribution.

3.2 Dual Representation of Wasserstein DRIVE

It is well-known in the optimization literature that min-max optimization problems such as (12) often have equivalent formulations as regularized regression problems. This correspondence between regularization and robustness already manifests itself in the ridge regression, which is equivalent to an $\ell_{2}$ -robust OLS regression (Bertsimas and Copenhaver,, 2018). Importantly, the regularized regression formulations are often more tractable in terms of solving the resulting optimization problem, and also facilitate the study of the statistical properties of the estimators. We first show that the Wasserstein DRIVE objective can also be written as a regularized regression problem similar to, but distinct from, the standard TSLS objective with ridge regularization. Proofs can be found in Appendix F.

Theorem 3.1.

The optimization problem in (12) is equivalent to the following convex regularized regression problem:

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}}+\sqrt{\rho(\|\beta\|^{2}+1)},

(13)

where $\Pi_{\mathbf{Z}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}$ is the finite sample projection operator, and $\Pi_{\mathbf{Z}}\mathbf{Y}$ and $\Pi_{\mathbf{Z}}\mathbf{X}$ are the OLS predictions of $\mathbf{X,Y}$ using instruments $\mathbf{Z}$ .

Note that the robustness parameter $\rho$ of the DRO formulation (12) now has the dual interpretation as the regularization parameter in (13). This convex regularized regression formulation implies that the min-max problem (12) associated with Wassertein DRIVE has a unique solution, thanks to the strict convexity of the regularization term $\sqrt{\rho(\|\beta\|^{2}+1)}$ , and is easy to compute despite not having a closed form solution. In particular, (13) can be reformulated as a standard second order conic program (SOCP) (El Ghaoui and Lebret,, 1997), which can be solved efficiently with off-the-shelf convex optimization routines, such as CVX. More importantly, we leverage this formulation of Wasserstein DRIVE as a regularized regression problem to study its novel statistical properties in Section 4.

The equivalence between Wasserstein DRO problems and regularized regression problem is a familiar general result in recent works. For example, Blanchet et al., (2019) and Gao and Kleywegt, (2023) derive similar duality results for distributionally robust regression with $q$ -Wasserstein distances for $q>1$ . Compared to previous works, our work is distinct in the following aspects. First, we apply Wasserstein DRO to the IV estimation setting instead of standard regression settings, such as OLS or logistic regression. Although from an optimization point of view there is no substantial difference, the IV setting motivates a new asymptotic regime that uncovers interesting statistical properties of the resulting estimators. Second, the regularization term in (13) is distinct from those in previous works, which often use $\|\beta\|_{p}$ with $p\geq 1$ . This seemingly innocuous difference turns out to be crucial for our novel results on the Wasserstein DRIVE. Lastly, compared to the proof in Blanchet et al., (2019), our proof of Theorem 3.1 is based on a different argument using the Sherman-Morrison formula instead of Hölder’s inequality, which provides an independent proof of the important duality result for Wasserstein distributionally robust optimization.

3.3 Wasserstein DRIVE and Regularized Regression

The regularized regression formulation of the Wasserstein DRIVE problem in (13) resembles the standard ridge regularized (Hoerl and Kennard,, 1970) TSLS regression:

\displaystyle\min_{\beta}\frac{1}{n}\sum_{i}({Y}_{i}-\tilde{X}_{i}^{T}\beta)^{2}+\rho\|\beta\|^{2}\Longleftrightarrow\min_{\beta}\frac{1}{n}\|\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}+\rho\|\beta\|^{2}.

(14)

We therefore refer to (13) as the square root ridge regularized TSLS. However, there are three major distinctions between (13) and (14) that are essential in guaranteeing the statistical properties of Wasserstein DRIVE not enjoyed by the standard ridge regularized TSLS. First, the presence of square root operations on both the risk term and the penalty term; second, the presence of a constant in the regularization term; third, an additional projection on the outcomes in $\Pi_{\mathbf{Z}}\mathbf{Y}$ . We further elaborate on these features in Section 4.

In the standard regression setting without instrumental variables, the square root ridge

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{\rho(1+\|\beta\|^{2})}

(15)

also resembles the “square root LASSO” of Belloni et al., (2011):

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\lambda\|\beta\|_{1}.

(16)

In particular, both can be written as dual problems of Wasserstein DRO problems (Blanchet et al.,, 2019). However, the square root LASSO is motivated by high-dimensional regression settings where the dimension of $X$ is potentially larger than the sample size $n$ , but $\beta$ is very sparse. In contrast, our study of the square root ridge is motivated by its robustness properties in the IV estimation setting, where the dimension of the endogenous variable is small (often one-dimensional). In other words, variable selection is not the main focus of this paper. A variant of the square root ridge estimator in (15) was also considered in the standard regression setting by Owen, (2007), who instead uses the penalty term $\|\beta\|_{2}$ .

As is well-known in the regularized regression literature (Fu and Knight,, 2000), when the regularization parameter decays to 0 at a rate $O_{p}(1/\sqrt{n})$ , the ridge estimator is consistent. A similar result also holds for the square root ridge (15) in the standard regression setting as $\rho\rightarrow 0$ . However, in the IV estimation setting, our distributional uncertainties about model assumptions, such as the validity of instruments, could persist even in large samples. Recall that $\rho$ is also the robustness parameter in the DRO formulation (12). Therefore, the usual requirement that $\rho\rightarrow 0$ as $n\rightarrow\infty$ cannot adequately capture distributional uncertainties in the IV estimation setting. In the next section, we study the asymptotic properties of Wasserstein DRIVE when $\rho$ does not necessarily vanish. In particular, we establish the consistency of Wasserstein DRIVE leveraging the three distinct features of (13) that are absent in the standard ridge regularized TSLS regression (14). This asymptotic result is in stark contrast to the conventional wisdom on regularized regression that regularized regression achieves lower variance at the cost of non-zero bias.

4 Asymptotic Theory of Wasserstein DRIVE

In this section, we leverage distinct geometric features of the square root ridge regression to study the asymptotic properties of the Wasserstein DRIVE. In Section 4.1, we show that the Wasserstein DRIVE estimator is consistent for any $\rho\in[0,\overline{\rho}]$ , where $\overline{\rho}$ depends on the first stage coefficient $\gamma$ . This property is a consequence of the consistency of the square root ridge estimator in settings where the objective value at the true parameter vanishes, such as the GMM estimation setting. It ensures that Wasserstein DRIVE can achieve better finite sample performance thanks to its ridge type regularization, while at the same time retaining asymptotic validity when instruments are valid. In Section 4.2, we characterize the asymptotic distribution of Wasserstein DRIVE, and discuss several special settings particularly relevant in practice, such as the just-identified setting with one-dimensional instrumental and endogenous variables.

4.1 Consistency of Wasserstein DRIVE

Recall the linear IV regression model in (2)

\displaystyle\begin{split}Y&=\beta^{T}_{0}X+\epsilon,\\ X&=\gamma^{T}Z+\xi,\end{split}

where $X\in\mathbb{R}^{p},Z\in\mathbb{R}^{d}$ , and $\beta_{0}\in\mathbb{R}^{p},\gamma\in\mathbb{R}^{d\times p}$ with $d\geq p$ to ensure identification. In this section, we make the standard assumptions that the instruments satisfy the relevance and exogeneity conditions in (3) and (4), $\epsilon,\xi$ are homoskedastic, the instruments $Z$ are not perfectly collinear, and that $\mathbb{E}\|Z\|^{2k}<\infty,\mathbb{E}\|\xi\|^{2k}<\infty,\mathbb{E}|\epsilon|^{2k}<\infty$ for some $k>2$ . The results can be extended in a straightforward manner when we relax these assumptions, e.g., only requiring that exogeneity holds asymptotically. Given i.i.d. samples from the linear IV model, recall the regularized regression formulation of the Wasserstein DRIVE objective

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)_{i}^{2}}+\sqrt{\rho_{n}(\|\beta\|^{2}+1)},

(17)

where $\Pi_{\mathbf{Z}}=\mathbf{Z}(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}\in\mathbb{R}^{n\times n}$ , and $\Pi_{\mathbf{Z}}\mathbf{Y}\in\mathbb{R}^{n}$ and $\Pi_{\mathbf{Z}}\mathbf{X}\in\mathbb{R}^{n\times p}$ are $\mathbf{Y}\in\mathbb{R}^{n},\mathbf{X}\in\mathbb{R}^{n\times p}$ projected onto the instrument space spanned by $\mathbf{Z}\in\mathbb{R}^{n\times d}$ .

Theorem 4.1 (Consistency of Wasserstein DRIVE).

Let $\hat{\beta}_{n}^{\text{DRIVE}}$ be the unique minimizer of the objective in (17). Let $\rho_{n}\rightarrow\rho\geq 0$ and $\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z}\rightarrow_{p}=\mathbb{E}[ZZ^{T}]=\Sigma_{Z}$ . Under the relevance and exogeneity conditions (3) and (4), the Wasserstein DRIVE estimator $\hat{\beta}_{n}^{\text{DRIVE}}$ converges to $\beta^{\text{DRIVE}}$ in probability as $n\rightarrow\infty$ , where $\beta^{\text{DRIVE}}$ is the unique minimizer of

\displaystyle\min_{\beta}\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)}.

(18)

Moreover, whenever $\rho\in[0,\overline{\rho}]$ where $\overline{\rho}=\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)$ is the smallest eigenvalue of $\gamma^{T}\Sigma_{Z}\gamma\in\mathbb{R}^{p\times p}$ , the unique minimizer of the objective (18) is the true causal effect, i.e., $\beta^{\text{DRIVE}}\equiv\beta_{0}$ .

Therefore, Wasserstein DRIVE is consistent as long as the limit of the regularization parameter is bounded above by $\overline{\rho}=\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)$ . In the case when $\Sigma_{Z}=\sigma^{2}_{Z}I_{d}$ for $\sigma_{Z}^{2}>0$ , the upper bound $\overline{\rho}$ is proportional to the square of the smallest singular value of the first stage coefficient $\gamma$ , which is positive under the relevance condition (3). Recall that $\rho_{n}$ is the radius of the Wasserstein ball in the min-max formulation of DRIVE in (12). Theorem 4.1 therefore guarantees that even when the robustness parameter $\rho_{n}\equiv\rho\neq 0$ , which implies the solution to the min-max problem is different from the TSLS estimator ( $\rho=0$ ), the resulting estimator always has the same limit as long as $\rho\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)$ .

The significance of Theorem 4.1 is twofold. First, it provides a meaningful interpretation of the robustness parameter $\rho$ in Wasserstein DRIVE in terms of problem parameters, more precisely the variance covariance matrix $\Sigma_{Z}$ of $Z$ and the first stage coefficient $\gamma$ in the IV regression model. The maximum amount of robustness that can be afforded by Wasserstein DRIVE without sacrificing consistency is $\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)$ , which directly depends on the strength and variance of the instrument. This relation can be described more precisely when $\Sigma_{Z}=\sigma^{2}_{Z}I_{d}$ , in which case the bound is proportional to $\sigma^{2}_{Z}$ and $\lambda_{p}(\gamma^{T}\gamma)$ . Both quantities improve the quality of the instruments: $\sigma^{2}_{Z}$ improves the proportion of variance of $X$ and $Y$ explained by the instrument vs. noise, while a $\gamma$ far from rank deficiency avoids the weak instrument problem. Therefore, the robustness of Wasserstein DRIVE depends intrinsically on the strength of the instruments. The quantity $\gamma^{T}\Sigma_{Z}\gamma$ is not unfamiliar in the IV setting, as it is proportional to the inverse of the asymptotic variance of the standard TSLS estimator when errors are homoskedastic. This observation suggests an intrinsic connection between robustness and efficiency in the IV setting. See the discussions after Theorem 4.2 for more on this point.

More importantly, Theorem 4.1 is the first consistency result for regularized regression estimators where the regularization parameter does not vanish with sample size. Although regularized regression such as ridge and LASSO is often associated with better finite sample performance at the cost of introducing some bias, our work demonstrates that, in the IV estimation setting, we can get the best of both worlds. On one hand, the ridge type regularization in Wasserstein DRIVE improves upon the finite sample properties of the standard IV estimators, which aligns with conventional wisdom on regularized regression. On the other hand, with a bounded level of the regularization parameter $\rho$ , Wasserstein DRIVE can still achieve consistency. This is in stark contrast to existing asymptotic results on regularized regression (Fu and Knight,, 2000). Therefore, in the context of IV estimation, with Wasserstein DRIVE we can achieve consistency and a certain amount of robustness at the same time, by leveraging additional information in the form of valid instruments. The maximum degree of robustness that can be achieved also has a natural interpretation in terms of the strength and variance of the instruments.

Theorem 4.1 also suggests the following procedure to construct a feasible and valid robustness/regularization parameter $\hat{\rho}$ given data $\{X_{i},Y_{i},Z_{i}\}_{i=1}^{n}$ . Let $\hat{\gamma}=(\mathbf{Z}^{T}\mathbf{Z})^{-1}\mathbf{Z}^{T}\mathbf{X}$ be the OLS regression estimator of the first stage coefficient $\gamma$ and $\hat{\Sigma}_{Z}$ an estimator of $\Sigma_{Z}$ , such as the heteroskedasticity-robust estimator (White,, 1980). We can use $\hat{\rho}\leq\lambda_{p}(\hat{\gamma}^{T}\hat{\Sigma}_{Z}\hat{\gamma})$ to construct the Wasserstein DRIVE objective, i.e., any value bounded above by the smallest eigenvalue of $\hat{\gamma}^{T}\hat{\Sigma}_{Z}\hat{\gamma}$ . Under the assumptions in Theorem 4.1, $\lambda_{p}(\hat{\gamma}^{T}\hat{\Sigma}_{Z}\hat{\gamma})\rightarrow\lambda_{p}({\gamma}^{T}\Sigma_{Z}{\gamma})$ , which guarantees that the Wasserstein DRIVE estimator with parameter $\hat{\rho}$ is consistent. In Appendix E, we discuss the construction of feasible regularization parameters in more detail. We demonstrate the validity and superior finite sample performance of DRIVE based on these proposals in simulation studies in Section 5.

One may wonder why Wasserstein DRIVE can achieve consistency with a non-zero regularization $\rho$ . Here we briefly discuss the phenomenon that the limiting objective (18)

\displaystyle\min_{\beta}\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)}

has a unique minimizer at $\beta_{0}$ for bounded $\rho>0$ . The first term $\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}$ achieves its minimum value of 0 at $\beta=\beta_{0}$ . When $\rho$ is small, the effect of adding the regularization term $\sqrt{\rho(\|\beta\|^{2}+1)}$ does not overwhelm the first term, especially when its curvature at $\beta_{0}$ is large. As a result, we may expect the minimizer to not deviate much from $\beta_{0}$ . While this intuition is reasonable qualitatively, it does not fully explain the fact that the minimizer does not change for small $\rho$ . In the standard regression setting, the same intuition can be applied to the standard ridge regularization, but we know shrinkage occurs as soon as $\rho>0$ . The key distinction of (17) turns out to be the square root operations we apply to the loss and regularization terms, which endows the objective with a geometric interpretation, and ensures that the minimizer does not deviate from $\beta_{0}$ unless $\rho$ is above some positive threshold. We call this phenomenon the “delayed shrinkage” of the square root ridge, as shrinkage does not happen until the regularization is large enough. We illustrate it with a simple example in Fig. 1, where the minimizer of the limiting square root objective does not change for a bounded range of $\rho$ .

Refer to caption — Figure 1: Plot of $|\beta-1|+\sqrt{\rho}\sqrt{(\beta^{2}+1)}$ , which is the dual limit objective function in the one-dimensional case with $\sigma^{2}_{Z}=\gamma=\beta_{0}=1$ . We also plot limit of standard ridge loss $(\beta-1)^{2}+\beta^{2}$ . For $\rho\leq 2$ , the minimum is achieved at $\beta=1$ , while for $\rho=5$ and for the standard ridge, the minimum is achieved at $\beta=0.5$ .

Lastly, we comment on the importance of projection operations in Wasserstein DRIVE. A crucial feature of the Wasserstein DRIVE objective is that both the outcome and the covariates are regressed on the instrument to compute their predicted values. In other words, the objective (12) uses $\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta$ instead of $\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta$ . For standard IV estimation ( $\rho=0)$ , there is no substantial difference between the two objectives, since their minimizers are exactly the same, due to the idempotent property $\Pi^{2}_{\mathbf{Z}}=\Pi_{\mathbf{Z}}$ . In fact, in applications of TSLS, the outcome variable is often not regressed on the instrument. However, Wasserstein DRIVE is consistent for positive $\rho$ only if the outcome $Y$ is also projected onto the instrument space. In other words, the following problem does not yield a consistent estimator when $\rho>0$ :

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}}+\sqrt{\rho(\|\beta\|^{2}+1)}.

The reason behind this phenomenon is that ${\frac{1}{n}\|\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}}$ is a GMM objective

\displaystyle(\frac{1}{n}\sum_{i}Z_{i}(Y_{i}-\beta^{T}X_{i}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}Z_{i}(Y_{i}-\beta^{T}X_{i})),

which when $n\rightarrow\infty$ achieves a minimal value of 0 at $\beta_{0}$ , while the limit of ${\frac{1}{n}\|\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta\|^{2}}$ does not vanish even at $\beta_{0}$ . In the former case, the geometric properties of the square root ridge ensure that the minimizer of the regularized objective is $\beta_{0}$ .

4.2 Asymptotic Distribution of Wasserstein DRIVE

Having established the consistency of Wasserstein DRIVE with bounded $\rho$ , we now turn to the characterization of its asymptotic distribution. In general, the asymptotic distribution of Wasserstein DRIVE is different from that of the standard IV estimator. However, we will also examine several special cases relevant in practice where they coincide.

Theorem 4.2 (Asymptotic Distribution).

When $\lim_{n\rightarrow\infty}\rho_{n}=\rho\leq{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)}$ , the Wasserstein DRIVE estimator $\hat{\beta}_{n}^{\text{DRIVE}}$ has asymptotic distribution characterized by the following optimization problem:

\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}-\beta_{0})

\displaystyle\rightarrow_{d}\arg\min_{\delta}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta,

(19)

where $\mathcal{Z}=\mathcal{N}(0,\sigma^{2}\Sigma_{Z})$ and $\sigma^{2}=\mathbb{E}\epsilon^{2}$ .

In particular, when $\rho_{n}\rightarrow 0$ at any rate, we have

\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}-\beta_{0})

\displaystyle\rightarrow_{d}\mathcal{N}(0,\sigma^{2}(\gamma^{T}\Sigma_{Z}\gamma)^{-1}),

which is the asymptotic distribution for TSLS estimators with homoskedastic errors $\epsilon$ .

Recall that the maximal robustness parameter $\rho$ of Wasserstein DRIVE while still being consistent is equal to the smallest eigenvalue of $\gamma^{T}\Sigma_{Z}\gamma$ , which is proportional to the inverse of the asymptotic variance of the TSLS estimator. Therefore, as the efficiency of TSLS increases, so does the robustness of the associated Wasserstein DRIVE estimator. The “price” to pay for robustness when $\rho>0$ is an interesting question. It is clear from Fig. 1 that the curvature of the population objective decreases as $\rho$ increases. Since the objective is not continuous at $\beta_{0}$ , however, a generalized notion of curvature is needed to precisely characterize this behavior. Note that the asymptotic distribution of the TSLS estimator minimizes the objective

(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta),

Theorem 4.2 implies that in general the asymptotic distributions of Wasserstein DRIVE and TSLS are different when $\rho>0$ . However, there are still several cases relevant in practice where their asymptotic distributions do coincide, which we discuss next.

Corollary 4.3.

In the following cases, the asymptotic distribution of Wasserstein DRIVE is the same as that of the standard TSLS estimator:

1.

When $\rho=0$ ;
2.

When $\rho\leq{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)}$ and $\beta_{0}$ is identically $\mathbf{0}$ ;
3.

When $\rho\leq{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)}$ and both $\beta_{0}$ and $\gamma$ are one-dimensional, i.e., $d=p=1$ .

In particular, the just-identified case with $d=p=1$ covers many empirical applications of IV estimation, since in practice we are often interested in the causal effect of a single endogenous variable, for which we have a single instrument. The case when $\beta_{0}\equiv\mathbf{0}$ is also very relevant, since an important question in practice is whether the causal effect of a variable is zero. Our theory suggests that the asymptotic distribution of Wasserstein DRIVE should be the same as that of the TSLS when the causal effect is zero and $d>1$ , even for $\rho>0$ . Based on this observation, intuitively, we should expect that Wasserstein DRIVE and TSLS estimators to be “close” to each other. If the estimators or their asymptotic variance estimators differ significantly, then $\beta_{0}$ may not be identically 0. We can design statistical tests by leveraging this intuition. For example, we can construct test statistics using the TSLS estimator and the DRIVE estimator with $\rho>0$ , such as the difference $\hat{\beta}^{\text{DRIVE}}-\hat{\beta}^{\text{TSLS}}$ . Then we can use bootstrap-based tests, such as a bootstrapped permutation test, to assess the null hypothesis that $\hat{\beta}^{\text{DRIVE}}-\hat{\beta}^{\text{TSLS}}=0$ . If we fail to reject the null hypothesis, then there is evidence that the true causal effect $\beta_{0}=\mathbf{0}$ .

Corollary 4.3 can be seemingly pessimistic because it demonstrates that the asymptotic distribution of Wasserstein DRIVE could be the same as that of the TSLS in special cases. However, recall that Wasserstein DRIVE is formulated to minimize the worst-case risk over a set of distributions that are designed to capture deviations from model assumptions. Therefore, there is not actually any a priori reason that it should coincide with the TSLS when $\rho>0$ . In this sense, the fact that the Wasserstein DRIVE is consistent with $\rho>0$ and may even coincide with TSLS is rather surprising. In the latter case, the worst-case distribution for Wasserstein DRIVE in the large sample limit must coincide with that of the standard population distribution, which may be worth further investigation.

The asymptotic results we develop in this section provide the basis on which one can perform estimation and inference with the Wasserstein DRIVE estimator. In the next section, we study the finite sample properties of DRIVE in simulation studies and demonstrate that it is superior in terms of estimation error and out of sample prediction compared to other popular estimators.

5 Numerical Studies

In this section, we study the empirical performance of Wasserstein DRIVE. Our results deliver three main messages. First, we demonstrate with simulations that Wasserstein DRIVE, with non-zero robustness parameter $\rho$ based on Theorem 4.1, has comparable performance as the standard IV estimator whenever instruments are valid. Second, when instruments become invalid, Wasserstein DRIVE outperforms other methods in terms of RMSE. Third, on the education dataset of Card, (1993), Wasserstein DRIVE also has superior performance at prediction for a heterogeneous target population.

5.1 MSE of Wasserstein DRIVE

We use the data generating process

\displaystyle\begin{split}Y&=X\beta_{0}+Z\eta+U\\ X&=\gamma Z+U\\ Z&=U\beta_{UZ}+\epsilon_{Z},\\ \end{split}

where $U,\epsilon_{Z}\sim\mathcal{N}(0,\sigma^{2})$ and we allow a direct effect $\eta$ from the instruments $Z$ to the outcome $Y$ . Moreover, the instruments $Z$ can also be correlated with the unobserved confounder $U$ ( $\beta_{UZ}\neq 0)$ . We fix the true parameters and generate independent datasets from the model, varying the degree of instrument invalidity. In Table 1, we report the MSE of estimators averaged over 500 repeated experiments. We control the degree of instrument invalidity by varying $\eta$ , the direct effect of instruments on the outcome, and $\beta_{UZ}$ , the correlation between unobserved confounder and instruments. Results in Table 1 are based on data where $\|\gamma\|\gg 0$ is large. We see that when instruments are strong, Wasserstein DRIVE performs as well as TSLS when instruments are valid, but performs significantly better than OLS, TSLS, anchor, and TSLS ridge when instruments become invalid. This suggests that DRIVE could be preferable in practice when we are concerned about instrument validity.

$\eta$	$\beta_{UZ}$	OLS	TSLS	anchor	TSLS ridge	DRIVE
0	0	0.21	0.03	0.19	0.03	0.03
0.4	0	0.20	0.07	0.16	0.06	0.03
0.4	0.4	0.26	0.25	0.24	0.21	0.07
0.4	0.8	0.29	0.62	0.29	0.56	0.09
0.8	0	0.26	0.23	0.23	0.22	0.06
0.8	0.4	0.32	0.51	0.31	0.46	0.10
0.8	0.8	0.37	0.82	0.38	0.81	0.14

Table 1: MSE of estimators when instruments are potentially invalid.

\beta_{0}=1,n=2000

\sigma=0.5

. For TSLS ridge the regularization parameter is selected using cross validation based on out-of-sample prediction errors. For anchor regression the regularization parameter is selected based on the proposal in Rothenhäusler et al., (2021). For DRIVE the regularization parameter is selected using nonparametric bootstrap of the score quantile (Appendix E).

We further investigate the empirical performance of Wasserstein DRIVE when instruments are potentially invalid or weak. We present box plots (omitting outliers) of MSEs in Fig. 2. The Wasserstein DRIVE estimator with regularization parameter $\rho$ based on bootstrapped quantiles of the score function consistently outperforms OLS, TSLS, anchor ( $k$ -class), and TSLS with ridge regularization. Moreover, the selected penalties increase as the direct effect of $Z$ on $Y$ or the correlation between the unobserved confounder $U$ and the instrument $Z$ increases, i.e., as the model assumption of valid instruments becomes increasingly invalid. See Fig. 5 in Appendix Appendix E for more details. This property is highly desirable, because based on the DRO formulation of DRIVE, $\rho$ represents the amount of robustness against distributional shifts associated with the estimator, which should increase as the instruments become more invalid (larger distributional shift). Box plots of estimation errors in Fig. 2 also verify that even when instruments are valid, the finite sample performance of Wasserstein DRIVE is still better compared to the standard IV estimator, suggesting that there is no additional cost in terms of MSE when applying Wasserstein DRIVE, even when instruments are valid.

5.2 Prediction under Distributional Shifts on Education Data

We now turn our attention to a different task that has received more attention in recent years, especially in the context of policy learning and estimating causal effects across heterogeneous populations (Dehejia et al.,, 2021; Adjaho and Christensen,, 2022; Menzel,, 2023). We study the prediction performance of estimators when they are estimated on a dataset (training set) that has a potentially different distribution from the dataset for which they are used to make predictions (test set). We demonstrate that whenever the distributions between training and test datasets have significant differences, the prediction error of Wasserstein DRIVE is significantly smaller than that of OLS, IV, and anchor ( $k$ -class) estimators.

We conduct our numerical study using the classic dataset on the return of education to wage compiled by David Card (Card,, 1993). Here, the causal inference problem is estimating the effect of additional school years on the increase in wage later in life. The dataset contains demographic information about interviewed subjects. Importantly, each sample comes from one of nine regions in the United States, which differ in the average number of years of schooling and other characteristics, i.e., there are covariate shifts in data collected from different regions. Our strategy is to divide the dataset into a training set and a test set based on the relative ranks of their average years of schools, which is the endogenous variable. We expect that if there are distributional shifts between different regions, then predicting wages using education and other information using conventional models trained on the training data may not yield a good performance on the test data.

training set (size)	test set (size)	OLS	TSLS	DRIVE	anchor regression	ridge	TSLS ridge
bottom 3 educated states (1247)	top 3 educated regions (841)	0.444	0.537	0.364	0.444	0.444	0.421
	top 3 educated regions (841)	(0.009)	(0.031)	(0.002)	(0.009)	(0.009)	(0.019)
	top 6 educated regions (1763)	0.451	1.064	0.371	0.451	0.451	0.430
	top 6 educated regions (1763)	(0.011)	(0.274)	(0.003)	(0.011)	(0.011)	(0.027)
top 6 educated regions (1763)	bottom 3 educated regions (1247)	0.390	0.584	0.356	0.390	0.390	0.377
top 6 educated regions (1763)	bottom 3 educated regions (1247)	(0.007)	(0.120)	(0.002)	(0.007)	(0.007)	(0.015)
top 3 educated regions (841)	bottom 3 educated regions (1247)	0.389	1.99	0.355	0.388	0.359	0.344
	bottom 3 educated regions (1247)	(0.013)	(0.775)	(0.005)	(0.013)	(0.014)	(0.004)
	middle 3 educated regions (922)	0.328	3.18	0.364	0.328	0.326	0.361
	middle 3 educated regions (922)	(0.001)	(1.213)	(0.005)	(0.001)	(0.001)	(0.005)
middle 3 educated regions (922)	top 3 educated regions (841)	0.332	0.410	0.332	0.332	0.333	0.332
	top 3 educated regions (841)	(0.001)	(0.025)	(0.001)	(0.001)	(0.001)	(0.001)
	bottom 3 educated regions (1247)	0.416	0.538	0.363	0.416	0.409	0.386
	bottom 3 educated regions (1247)	(0.014)	(0.063)	(0.005)	(0.014)	(0.016)	(0.019)
	most+least educated regions (374)	0.395	2.47	0.362	0.395	0.392	0.355
	most+least educated regions (374)	(0.001)	(1.81)	(0.004)	(0.011)	(0.012)	(0.004)
	top 3+bottom 3 educated regions (2088)	0.396	0.451	0.358	0.396	0.382	0.366
	top 3+bottom 3 educated regions (2088)	(0.005)	(0.032)	(0.003)	(0.009)	(0.006)	(0.009)

Table 2: Comparison of estimation methods in terms of MSE on test data. Here the training and test datasets are split according to the 9 regions in the Card college proximity dataset based on their average education levels. In this specification, we did not include experience squared. Standard errors are obtained using 10 bootstrapped datasets.

Since each sample is labeled as working in 1976 in one of nine regions in the U.S., we split the samples based on these labels, using number of years of education as the splitting variable. For example, we can construct the training set by including samples from the top 6 regions with the highest average years of schooling, and the test set to consist of samples coming from the bottom 3 regions with the lowest average years of schooling. In this case, we would expect the training and test sets to have come from different distributions. Indeed, the average years of schooling differs by more than 1 year, and is statistically significant.

In splitting the samples based on the distribution of the endogenous variable, we are also motivated by the long-standing debates revolving around the use of instrumental variables in classic economic studies (Card and Krueger,, 1994). A leading concern is the validity of instruments. In the case of the study on educational returns, the validity of estimation and inference require that the instruments (proximity to college and quarter of birth) are not correlated with unobserved characteristics that may also affect their earnings. The following quote from Card, (1999) illustrates this concern:

“In the case of quasi or natural experiments, however, inferences are based on difference between groups of individuals who attended schools at different times, or in different locations, or had differences in other characteristics such as month of birth. The use of these differences to draw causal inferences about the effect of schooling requires careful consideration of the maintained assumption that the groups are otherwise identical.”

When this assumption is violated, the estimates based on a particular subpopulation becomes unreliable for the wider population, and we evaluate the performance based on how well they generalize to other groups of the population with potential distributional or covariate shifts. In Table 2, we compare the test set MSE of OLS, IV, Wasserstein DRIVE, anchor regression, ridge, and ridge regularized IV estimators. We see that Wasserstein DRIVE consistently outperforms other estimators commonly used in practice.

6 Concluding Remarks

In this paper, we propose a distributionally robust instrumental variables estimation framework. Our approach is motivated by two main considerations in practice. The first is the concern about model mis-specification in IV estimation, most notably the validity of instruments. Second, going beyond estimating the causal effect for the endogenous variable, practitioners may also be interested in making good predictions with the help of instruments when there is heterogeneity between training and test datasets, e.g., generalizing from findings using samples from a particular population/geographical group to other groups. We argue that both challenges can be naturally unified as problems of distributional shifts, and then addressed using frameworks from distributionally robust optimization.

We provide a dual representation of our Wasserstein DRIVE framework as a regularized TSLS problem, and reveal a distinct property of the resulting estimator: it is consistent with non-vanishing penalty parameter. We further characterize the asymptotic distribution of the Wasserstein DRIVE, and establish a few special cases when it coincides with that of the standard TSLS estimator. Numerical studies suggest that Wasserstein DRIVE has superior finite sample performance in two regards. First, it has lower estimation error when instruments are potentially invalid, but performs as well as the TSLS when instruments are valid. Second, it outperforms existing methods at the task of predicting outcomes under distributional shifts between training and test data. These findings provide support for the appeal of our DRO approach to IV estimation, and suggest that Wasserstein DRIVE could be preferable in practice to standard IV methods. Finally, there are many future research directions of interest, such as further results on inference and testing, as well as connections to sensitivity analysis. Extensions to nonlinear models would also be useful in practice.

Acknowledgements

We are indebted to Han Hong, Guido Imbens, and Yinyu Ye for invaluable advice and guidance throughout this project, and to Agostino Capponi, Timothy Cogley, Rajeev Dehejia, Yanqin Fan, Alfred Galichon, Rui Gao, Wenzhi Gao, Vishal Kamat, Samir Khan, Frederic Koehler, Michal Kolesár, Simon Sokbae Lee, Greg Lewis, Elena Manresa, Konrad Menzel, Axel Peytavin, Debraj Ray, Martin Rotemberg, Vasilis Syrgkanis, Johan Ugander, and Ruoxuan Xiong for helpful discussions and suggestions. This work was supported in part by a Stanford Interdisciplinary Graduate Fellowship (SIGF).

References

Adjaho and Christensen, (2022) Adjaho, C. and Christensen, T. (2022). Externally valid treatment choice. arXiv preprint arXiv:2205.05561, 1.
Anderson and Rubin, (1949) Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. The Annals of mathematical statistics, 20(1):46–63.
Andrews, (1994) Andrews, D. W. (1994). Empirical process methods in econometrics. Handbook of econometrics, 4:2247–2294.
Andrews, (1999) Andrews, D. W. (1999). Consistent moment selection procedures for generalized method of moments estimation. Econometrica, 67(3):543–563.
Andrews, (2007) Andrews, D. W. (2007). Inference with weak instruments. Advances in Economics and Econometrics, 3:122–173.
Andrews et al., (2019) Andrews, I., Stock, J. H., and Sun, L. (2019). Weak instruments in instrumental variables regression: Theory and practice. Annual Review of Economics, 11(1).
Angrist et al., (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455.
Angrist and Pischke, (2009) Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press.
Armstrong and Kolesár, (2021) Armstrong, T. B. and Kolesár, M. (2021). Sensitivity analysis using approximate moment condition models. Quantitative Economics, 12(1):77–108.
(10) Basmann, R. L. (1960a). On finite sample distributions of generalized classical linear identifiability test statistics. Journal of the American Statistical Association, 55(292):650–659.
(11) Basmann, R. L. (1960b). On the asymptotic distribution of generalized linear estimators. Econometrica, Journal of the Econometric Society, pages 97–107.
Baum et al., (2003) Baum, C. F., Schaffer, M. E., and Stillman, S. (2003). Instrumental variables and gmm: Estimation and testing. The Stata Journal, 3(1):1–31.
Bekker, (1994) Bekker, P. A. (1994). Alternative approximations to the distributions of instrumental variable estimators. Econometrica: Journal of the Econometric Society, pages 657–681.
Belloni et al., (2012) Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
Belloni et al., (2018) Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C., and Kato, K. (2018). High-dimensional econometrics and regularized gmm. arXiv preprint arXiv:1806.01888.
Belloni et al., (2011) Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806.
Ben-Tal et al., (2013) Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
Bennett and Kallus, (2023) Bennett, A. and Kallus, N. (2023). The variational method of moments. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):810–841.
Berkowitz et al., (2008) Berkowitz, D., Caner, M., and Fang, Y. (2008). Are “nearly exogenous instruments” reliable? Economics Letters, 101(1):20–23.
Bertsimas and Copenhaver, (2018) Bertsimas, D. and Copenhaver, M. S. (2018). Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 270(3):931–942.
Bertsimas et al., (2022) Bertsimas, D., Imai, K., and Li, M. L. (2022). Distributionally robust causal inference with observational data. arXiv preprint arXiv:2210.08326.
Bertsimas and Popescu, (2005) Bertsimas, D. and Popescu, I. (2005). Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal on Optimization, 15(3):780–804.
Bickel et al., (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 – 1732.
Blanchet et al., (2019) Blanchet, J., Kang, Y., and Murthy, K. (2019). Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
Blanchet and Murthy, (2019) Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600.
Blanchet et al., (2022) Blanchet, J., Murthy, K., and Si, N. (2022). Confidence regions in wasserstein distributionally robust estimation. Biometrika, 109(2):295–315.
Bonhomme and Weidner, (2022) Bonhomme, S. and Weidner, M. (2022). Minimizing sensitivity to model misspecification. Quantitative Economics, 13(3):907–954.
Bound et al., (1995) Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American statistical association, 90(430):443–450.
Bowden et al., (2015) Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through egger regression. International journal of epidemiology, 44(2):512–525.
Bowden et al., (2016) Bowden, J., Davey Smith, G., Haycock, P. C., and Burgess, S. (2016). Consistent estimation in mendelian randomization with some invalid instruments using a weighted median estimator. Genetic epidemiology, 40(4):304–314.
Bühlmann, (2020) Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Science, 35(3):404–426.
Burgess et al., (2016) Burgess, S., Bowden, J., Dudbridge, F., and Thompson, S. G. (2016). Robust instrumental variable methods using multiple candidate instruments with application to mendelian randomization. arXiv preprint arXiv:1606.03729.
Burgess et al., (2020) Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. (2020). A robust and efficient method for mendelian randomization with hundreds of genetic variants. Nature communications, 11(1):376.
Burgess et al., (2011) Burgess, S., Thompson, S. G., and Collaboration, C. C. G. (2011). Avoiding bias from weak instruments in mendelian randomization studies. International journal of epidemiology, 40(3):755–764.
Calafiore and Ghaoui, (2006) Calafiore, G. C. and Ghaoui, L. E. (2006). On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications, 130:1–22.
Caner, (2009) Caner, M. (2009). Lasso-type gmm estimator. Econometric Theory, 25(1):270–290.
Caner and Kock, (2018) Caner, M. and Kock, A. B. (2018). High dimensional linear gmm. arXiv preprint arXiv:1811.08779.
Card, (1993) Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling. NBER Working Paper, (w4483).
Card, (1999) Card, D. (1999). The causal effect of education on earnings. Handbook of labor economics, 3:1801–1863.
Card and Krueger, (1994) Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in new jersey and pennsylvania. American Economic Review, 84(4).
Chamberlain and Imbens, (2004) Chamberlain, G. and Imbens, G. (2004). Random effects estimators with many instrumental variables. Econometrica, 72(1):295–306.
Chao and Swanson, (2005) Chao, J. C. and Swanson, N. R. (2005). Consistent estimation with a large number of weak instruments. Econometrica, 73(5):1673–1692.
Chen et al., (2021) Chen, X., Hansen, L. P., and Hansen, P. G. (2021). Robust inference for moment condition models without rational expectations. Journal of Econometrics, forthcoming.
Chernozhukov et al., (2015) Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Post-selection and post-regularization inference in linear models with many controls and instruments. American Economic Review, 105(5):486–490.
Cigliutti and Manresa, (2022) Cigliutti, I. and Manresa, E. (2022). Adversarial method of moments.
Conley et al., (2012) Conley, T. G., Hansen, C. B., and Rossi, P. E. (2012). Plausibly exogenous. Review of Economics and Statistics, 94(1):260–272.
Cragg and Donald, (1993) Cragg, J. G. and Donald, S. G. (1993). Testing identifiability and specification in instrumental variable models. Econometric Theory, 9(2):222–240.
Davey Smith and Ebrahim, (2003) Davey Smith, G. and Ebrahim, S. (2003). ‘mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International journal of epidemiology, 32(1):1–22.
Dehejia et al., (2021) Dehejia, R., Pop-Eleches, C., and Samii, C. (2021). From local to global: External validity in a fertility natural experiment. Journal of Business & Economic Statistics, 39(1):217–243.
Delage and Ye, (2010) Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612.
Duchi et al., (2021) Duchi, J. C., Glynn, P. W., and Namkoong, H. (2021). Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–969.
Dupačová, (1987) Dupačová, J. (1987). The minimax approach to stochastic programming and an illustrative application. Stochastics: An International Journal of Probability and Stochastic Processes, 20(1):73–88.
El Ghaoui and Lebret, (1997) El Ghaoui, L. and Lebret, H. (1997). Robust solutions to least-squares problems with uncertain data. SIAM Journal on matrix analysis and applications, 18(4):1035–1064.
Emdin et al., (2017) Emdin, C. A., Khera, A. V., and Kathiresan, S. (2017). Mendelian randomization. Jama, 318(19):1925–1926.
Fan et al., (2024) Fan, J., Fang, C., Gu, Y., and Zhang, T. (2024). Environment invariant linear least squares. The Annals of Statistics, 52(5):2268–2292.
Fan et al., (2023) Fan, Y., Park, H., and Xu, G. (2023). Quantifying distributional model risk in marginal problems via optimal transport. arXiv preprint arXiv:2307.00779.
Fisher, (1961) Fisher, F. M. (1961). On the cost of approximate specification in simultaneous equation estimation. Econometrica: journal of the Econometric Society, pages 139–170.
Fu and Knight, (2000) Fu, W. and Knight, K. (2000). Asymptotics for lasso-type estimators. The Annals of statistics, 28(5):1356–1378.
Fuller, (1977) Fuller, W. A. (1977). Some properties of a modification of the limited information estimator. Econometrica: Journal of the Econometric Society, pages 939–953.
Galichon, (2018) Galichon, A. (2018). Optimal transport methods in economics. Princeton University Press.
Galichon, (2021) Galichon, A. (2021). The unreasonable effectiveness of optimal transport in economics. arXiv preprint arXiv:2107.04700.
Gao and Kleywegt, (2023) Gao, R. and Kleywegt, A. (2023). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2):603–655.
Goodfellow et al., (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
(64) Guo, Z., Kang, H., Cai, T. T., and Small, D. S. (2018a). Testing endogeneity with high dimensional covariates. Journal of Econometrics, 207(1):175–187.
(65) Guo, Z., Kang, H., Tony Cai, T., and Small, D. S. (2018b). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4):793–815.
Hahn and Hausman, (2005) Hahn, J. and Hausman, J. (2005). Estimation with valid and invalid instruments. Annales d’Economie et de Statistique, pages 25–57.
Hahn et al., (2004) Hahn, J., Hausman, J., and Kuersteiner, G. (2004). Estimation with weak instruments: Accuracy of higher-order bias and mse approximations. The Econometrics Journal, 7(1):272–306.
Hall, (2003) Hall, A. R. (2003). Generalized method of moments. A companion to theoretical econometrics, pages 230–255.
Hansen, (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society, pages 1029–1054.
Hansen and Sargent, (2008) Hansen, L. P. and Sargent, T. J. (2008). Robustness. Princeton university press.
Hansen and Sargent, (2010) Hansen, L. P. and Sargent, T. J. (2010). Wanting robustness in macroeconomics. In Handbook of monetary economics, volume 3, pages 1097–1157. Elsevier.
Hausman et al., (2012) Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., and Swanson, N. R. (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quantitative Economics, 3(2):211–255.
Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1):69–82.
Hu and Hong, (2013) Hu, Z. and Hong, L. J. (2013). Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, pages 1695–1724.
Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101.
Huber and Ronchetti, (2011) Huber, P. J. and Ronchetti, E. M. (2011). Robust statistics. John Wiley & Sons.
Imbens and Angrist, (1994) Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475.
Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
Iyengar, (2005) Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
Jakobsen and Peters, (2022) Jakobsen, M. E. and Peters, J. (2022). Distributional robustness of k-class estimators and the pulse. The Econometrics Journal, 25(2):404–432.
Jiang, (2017) Jiang, W. (2017). Have instrumental variables brought us closer to the truth. Review of Corporate Finance Studies, 6(2):127–140.
Kadane and Anderson, (1977) Kadane, J. B. and Anderson, T. (1977). A comment on the test of overidentifying restrictions. Econometrica: Journal of the Econometric Society, pages 1027–1031.
Kaji et al., (2020) Kaji, T., Manresa, E., and Pouliot, G. (2020). An adversarial approach to structural estimation. arXiv preprint arXiv:2007.06169.
Kallus and Zhou, (2021) Kallus, N. and Zhou, A. (2021). Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870–2890.
Kang et al., (2016) Kang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables estimation with some invalid instruments and its application to mendelian randomization. Journal of the American statistical Association, 111(513):132–144.
Kantorovich, (1942) Kantorovich, L. V. (1942). On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201.
Kantorovich, (1960) Kantorovich, L. V. (1960). Mathematical methods of organizing and planning production. Management science, 6(4):366–422.
Kitamura et al., (2013) Kitamura, Y., Otsu, T., and Evdokimov, K. (2013). Robustness, infinitesimal neighborhoods, and moment restrictions. Econometrica, 81(3):1185–1201.
Kolesár, (2018) Kolesár, M. (2018). Minimum distance approach to inference with many instruments. Journal of Econometrics, 204(1):86–100.
Kolesár et al., (2015) Kolesár, M., Chetty, R., Friedman, J., Glaeser, E., and Imbens, G. W. (2015). Identification and inference with many invalid instruments. Journal of Business & Economic Statistics, 33(4):474–484.
Koopmans, (1949) Koopmans, T. C. (1949). Optimum utilization of the transportation system. Econometrica: Journal of the Econometric Society, pages 136–146.
Kuhn et al., (2019) Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pages 130–166. Informs.
Kunitomo, (1980) Kunitomo, N. (1980). Asymptotic expansions of the distributions of estimators in a linear functional relationship and simultaneous equations. Journal of the American Statistical Association, 75(371):693–700.
Lei et al., (2023) Lei, L., Sahoo, R., and Wager, S. (2023). Policy learning under biased sample selection. arXiv preprint arXiv:2304.11735.
Lewis and Syrgkanis, (2018) Lewis, G. and Syrgkanis, V. (2018). Adversarial generalized method of moments. arXiv preprint arXiv:1803.07164.
McDonald, (1977) McDonald, J. B. (1977). The k-class estimators as least variance difference estimators. Econometrica: Journal of the Econometric Society, pages 759–763.
Meinshausen, (2018) Meinshausen, N. (2018). Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6–10. IEEE.
Menzel, (2023) Menzel, K. (2023). Transfer estimates for causal effects across heterogeneous sites. arXiv preprint arXiv:2305.01435.
Metzger, (2022) Metzger, J. (2022). Adversarial estimators. arXiv preprint arXiv:2204.10495.
Mohajerin Esfahani and Kuhn, (2018) Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115–166.
Murray, (2006) Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments. Journal of economic Perspectives, 20(4):111–132.
Nagar, (1959) Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica: Journal of the Econometric Society, pages 575–595.
(103) Nelson, C. R. and Startz, R. (1990a). The distribution of the instrumental variables estimator and its t-ratio when the instrument is a poor one. Journal of Business, pages S125–S140.
(104) Nelson, C. R. and Startz, R. (1990b). Some further results on the exact small sample properties of the instrumental variable estimator. Econometrica: Journal of the Econometric Society, pages 967–976.
Olkin and Pukelsheim, (1982) Olkin, I. and Pukelsheim, F. (1982). The distance between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48:257–263.
Owen, (2007) Owen, A. B. (2007). A robust hybrid of lasso and ridge regression. Contemporary Mathematics, 443(7):59–72.
Peters et al., (2016) Peters, J., Bühlmann, P., and Meinshausen, N. (2016). Causal Inference by using Invariant Prediction: Identification and Confidence Intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012.
Pollard, (1991) Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7(2):186–199.
Prékopa, (2013) Prékopa, A. (2013). Stochastic programming, volume 324. Springer Science & Business Media.
Richardson, (1968) Richardson, D. H. (1968). The exact distribution of a structural coefficient estimator. Journal of the American Statistical Association, 63(324):1214–1226.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological), 45(2):212–218.
Rothenhäusler et al., (2021) Rothenhäusler, D., Meinshausen, N., Bühlmann, P., Peters, J., et al. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2):215–246.
Ruszczyński, (2010) Ruszczyński, A. (2010). Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125:235–261.
Sahoo et al., (2022) Sahoo, R., Lei, L., and Wager, S. (2022). Learning from a biased sample. arXiv preprint arXiv:2209.01754.
Sanderson and Windmeijer, (2016) Sanderson, E. and Windmeijer, F. (2016). A weak instrument f-test in linear iv models with multiple endogenous variables. Journal of econometrics, 190(2):212–221.
Sargan, (1958) Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica: Journal of the econometric society, pages 393–415.
Scarf, (1958) Scarf, H. (1958). A min-max solution of an inventory problem. Studies in the mathematical theory of inventory and production.
Shapiro and Kleywegt, (2002) Shapiro, A. and Kleywegt, A. (2002). Minimax analysis of stochastic problems. Optimization Methods and Software, 17(3):523–542.
Sherman and Morrison, (1950) Sherman, J. and Morrison, W. J. (1950). Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127.
Sinha et al., (2017) Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2017). Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.
Small, (2007) Small, D. S. (2007). Sensitivity analysis for instrumental variables regression with overidentifying restrictions. Journal of the American Statistical Association, 102(479):1049–1058.
Staiger and Stock, (1997) Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica: Journal of the Econometric Society, pages 557–586.
Stock and Wright, (2000) Stock, J. H. and Wright, J. H. (2000). Gmm with weak identification. Econometrica, 68(5):1055–1096.
Stock et al., (2002) Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics, 20(4):518–529.
Stock and Yogo, (2002) Stock, J. H. and Yogo, M. (2002). Testing for weak instruments in linear iv regression.
Theil, (1953) Theil, H. (1953). Repeated least squares applied to complete equation systems. The Hague: central planning bureau.
Theil, (1961) Theil, H. (1961). Economic forecasts and policy.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
van der Vaart and Wellner, (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer.
VanderWeele et al., (2014) VanderWeele, T. J., Tchetgen, E. J. T., Cornelis, M., and Kraft, P. (2014). Methodological challenges in mendelian randomization. Epidemiology (Cambridge, Mass.), 25(3):427.
Vaserstein, (1969) Vaserstein, L. N. (1969). Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72.
Vershik, (2013) Vershik, A. M. (2013). Long history of the monge-kantorovich transportation problem: (marking the centennial of lv kantorovich’s birth!). The Mathematical Intelligencer, 35:1–9.
Villani, (2009) Villani, C. (2009). Optimal transport: old and new, volume 338. Springer.
Von Neumann and Morgenstern, (1947) Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior, 2nd rev.
Wang et al., (2016) Wang, Z., Glynn, P. W., and Ye, Y. (2016). Likelihood robust optimization for data-driven problems. Computational Management Science, 13:241–261.
White, (1980) White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, pages 817–838.
Windmeijer et al., (2018) Windmeijer, F., Farbmacher, H., Davies, N., and Smith, G. D. (2018). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association.
Windmeijer et al., (2021) Windmeijer, F., Liang, X., Hartwig, F. P., and Bowden, J. (2021). The confidence interval method for selecting valid instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(4):752–776.
Wooldridge, (2020) Wooldridge, J. M. (2020). Introductory econometrics: a modern approach.
Young, (2022) Young, A. (2022). Consistency without inference: Instrumental variables in practical application. European Economic Review, page 104112.

Appendix A Background and Preliminaries

A.1 Wasserstein Distributionally Robust Optimization

We first formally define the Wasserstein distance and discuss relevant results useful in this paper. The Wasserstein distance is a metric on the space of probability distributions defined based on the optimal transport problem. More specifically, given any Polish space $\mathcal{X}$ with metric $d$ , let $\mathcal{P}(\mathcal{X})$ be the set of Borel probability measures on $\mathcal{X}$ and $\mathbb{P},\mathbb{Q}\in\mathcal{P(\mathcal{X})}$ . For exposition, we assume they have densities $f_{1}$ and $f_{2}$ , respectively, although the Wasserstein distance is well-defined for more general probability measures using the concept of push-forwards (Villani,, 2009). The optimal transport problem, whose studied was pioneered by Kantorovich, (1942, 1960), aims to find the joint probability distribution between $\mathbb{P}$ and $\mathbb{Q}$ with the smallest cost, as specified by the metric $d$ :

\displaystyle\min_{\pi\in\mathcal{P}(\mathcal{X}\times\mathcal{X})\mathrel{\mathop{\mathchar 58\relax}}\int_{x_{1}}\pi(x_{1},x_{2})dx_{1}=f_{2}(x_{2});\int_{x_{2}}\pi(x_{1},x_{2})dx_{2}=f_{1}(x_{1})}\int_{\mathcal{X}\times\mathcal{X}}d^{p}(x_{1},x_{2})\pi(x_{1},x_{2})dx_{1}dx_{2},

(20)

where $p\geq 1$ . The $p$ -Wasserstein distance $W_{p}(\mathbb{P},\mathbb{Q})$ is defined to be the $p$ -th root of the optimal value of the optimal transport problem above. The Wasserstein distance is a metric on the space $\mathcal{P(\mathcal{X})}$ of probability measures, and the dual problem of (20) is derived the following important duality result due to Kantorovich (Villani,, 2009):

\displaystyle W^{p}_{p}(\mathbb{P},\mathbb{Q})=\sup_{u\in L^{1}(\mathbb{P}),v\in L^{1}(\mathbb{Q})\mathrel{\mathop{\mathchar 58\relax}}u(x_{1})+v(x_{2})\leq d^{p}(x_{1},x_{2})}\{\int_{x_{1}}u(x_{1})f_{1}(x_{1})dx_{1}+\int_{x_{2}}v(x_{2})f_{2}(x_{2})dx_{2}\}

It should be noted that the term “Wasserstein metric” for the optimal transport distance defined above is an unfortunate mistake, as Kantorovich, (1942) should be credited with pioneering the theory of optimal transport and proposing the metrics. However, due to a work of Wasserstein (Vaserstein,, 1969), which briefly discussed the optimal transport distance, being more well-known in the West initially, the terminology of Wasserstein metric persisted until today (Vershik,, 2013). The optimal transport problem has also been studied in the seminal work of Koopmans, (1949).

One of the appeals of the Wasserstein distance when formulating distributionally robust optimization problems lies in the tractability of the dual DRO problem. Specifically, in (12), the inner maximization problem requires solving an infinite-dimensional optimization problem for every $\beta$ , and is in generally not tractable. However, if $D$ is the Wasserstein distance, the inner problem has a tractable dual minimization problem, which when combined with the outer minimization problem over $\beta$ , yields a simple and tractable minimization problem. This will allow us to efficiently compute the WDRO estimator. Moreover, it establishes connections with the popular statistical approach of ridge regression.

Let $c\in L^{1}(\mathcal{X})$ be a general loss function and $\mathbb{P}\in\mathcal{P}(\mathcal{X})$ with density $f_{1}$ . The following general duality result (Gao and Kleywegt,, 2023; Blanchet and Murthy,, 2019) provides a tractable reformulation of the Wasserstein DRIVE objective introduced in Section 2:

\displaystyle\sup_{\mathbb{Q}\in\mathcal{P}(\mathcal{X})\mathrel{\mathop{\mathchar 58\relax}}W_{p}(\mathbb{Q},\mathbb{P})\leq\rho}\int f_{2}(x)c(x)dx=\inf_{\lambda\geq 0}\{\lambda\theta^{p}-\int\inf_{x_{2}\in\mathcal{X}}[\lambda d^{p}(x_{1},x_{2})-c(x_{2})]f_{1}(x_{1})dx_{1}\},

(21)

where $f_{2}$ is the density of $\mathbb{Q}$ .

A.2 Anchor Regression of Rothenhäusler et al., (2021)

In the anchor regression framework of Rothenhäusler et al., (2021), the baseline distribution $\mathbb{P}_{0}$ on $(X,Y,U,A)$ is prescribed by the following linear structural equation model (SEM), given well-defined $\mathbf{B},\mathbf{M}$ and distributions of $A,\epsilon$ :

\displaystyle\begin{pmatrix}X\\ Y\\ U\end{pmatrix}=\mathbf{B}\begin{pmatrix}X\\ Y\\ U\end{pmatrix}+\mathbf{M}A+\epsilon\Longleftrightarrow\begin{pmatrix}X\\ Y\\ U\end{pmatrix}=(I-\mathbf{B})^{-1}(\mathbf{M}A+\epsilon).

Here $U$ represents unobserved confounders, $Y$ is the outcome variable, $X$ are observed regressors, and $A$ are “anchors” that can be understood as potentially invalid instrumental variables that may violate the exclusion restriction. Under this SEM, Rothenhäusler et al., (2021) posit that the potential deviations from the reference distribution $\mathbb{P}_{0}$ are driven by bounded uncertainties in the anchors $A$ . Their main result provides a DRO interpretation of a modified population version of the IV regression that interpolates between the IV and OLS objectives for $\gamma>1$ :

\displaystyle\min_{\beta}\mathbb{E}[(Y-X^{T}\beta)^{2}]+(\gamma-1)\mathbb{E}[(P_{A}(Y-X^{T}\beta))^{2}]=\min_{\beta}\sup_{v\in C^{\gamma}}\mathbb{E}_{v}[(Y-X^{T}\beta)^{2}].

(22)

The set of distributions $\mathbb{E}_{v}$ induced by $v\in C^{\gamma}$ are defined via the following SEM with a bounded set $C^{\gamma}$ :

\displaystyle\begin{pmatrix}X\\ Y\\ U\end{pmatrix}=(I-\mathbf{B})^{-1}v,\quad C^{\gamma}\mathrel{\mathop{\mathchar 58\relax}}=\{v\mathrel{\mathop{\mathchar 58\relax}}vv^{T}\preceq\gamma\mathbf{M}\mathbb{E}(AA^{T})\mathbf{M}^{T}\}.

(23)

In the interpolated objective in Eq. 22, $P_{A}(\cdot)=\mathbb{E}(\cdot\mid A)$ and $\mathbb{E}[(P_{A}(Y-X^{T}\beta))^{2}]$ is the population version of the IV (TSLS) regression objective with $A$ as the instrument. Letting $\kappa=1-{1}/{\gamma}$ , we can rewrite the interpolated objective on the left hand side in (22) equivalently as

\displaystyle\min_{\beta}\mathbb{E}[(P_{A}(Y-X^{T}\beta))^{2}]+\frac{1-\kappa}{\kappa}\mathbb{E}[(Y-X^{T}\beta)^{2}],

(24)

which can be interpreted as “regularizing” the IV objective with the OLS objective, with penalty parameter $(1-\kappa)/\kappa$ . Jakobsen and Peters, (2022) observe that the finite sample version of the objective in (24) is precisely that of a $k$ -class estimator with parameter $\kappa$ (Theil,, 1961; Nagar,, 1959). This observation together with (22) therefore provides a DRO interpretation of $k$ -class estimators, which is also extended by Jakobsen and Peters, (2022) to more general instrumental variables estimation settings. Moreover, when $\kappa=1$ , or equivalently $\gamma\rightarrow\infty$ , we recover the standard IV objective in (24). Therefore, the IV estimator has a distributionally robust interpretation via (22) when distributional shifts $v$ are unbounded.

The DRO interpretation (22) of $k$ -class estimators sheds new light on some old wisdom on IV estimation. As has already been observed and studied by a large literature (Richardson,, 1968; Nelson and Startz, 1990a, ; Nelson and Startz, 1990b, ; Bound et al.,, 1995; Staiger and Stock,, 1997; Hahn et al.,, 2004; Burgess et al.,, 2011; Andrews et al.,, 2019; Young,, 2022), when instruments are weak, the usual normal approximation to the distribution of the IV estimator may be very poor, and the IV estimator is biased in small samples and in the weak instruments asymptotics. Moreover, a small violation of the exclusion restriction, i.e., direct effect of instruments on the outcome, can result in large bias when instruments are weak (Angrist and Pischke,, 2009). Consequently, IV may not perform as well as the OLS estimator in such a setting. Regularizing the IV objective by the OLS objective in (24) can therefore alleviate the weak instrument problem. This improvement has also been observed for $k$ -class estimators with particular choices of $\kappa$ (Fuller,, 1977; Hahn et al.,, 2004). The DRO interpretation complements the intuition above based on regularizing the IV objective with the OLS objective. In so far as weak instruments can be understood as a form of distributional shift from standard modeling assumptions (strong first stage effects), a distributionally robust regression approach is a natural solution to address the challenge of weak instruments. In the case of the anchor regression, the distribution uncertainty set indexed by $v\in C^{\gamma}$ always contains distributions on $(X,Y,U,A)$ where the association between $A$ and $X$ is weak, by selecting appropriate $\|v\|\approx 0$ . Therefore, the DRO formulation (22) of $k$ -class estimators demonstrates that they are robust against the weak instrument problem by design. An additional insight of the DRO formulation is that $k$ -class estimators and anchor regression are also optimal in terms of predicting $Y$ with $X$ when the distribution of $(X,Y)$ could change between the training and test datasets in a bounded manner induced by the anchors $A$ .

On the other hand, the DRO interpretation of $k$ -class estimators also exposes its potential limitations. First of all, the ambiguity set in (23) does not in fact contain the reference distribution $\mathbb{P}_{0}$ itself for any finite robustness parameter, which is unsatisfactory. Moreover, the SEM in (23) also implies that the instrument (anchors) $A$ cannot be influenced by the unobserved confounder $U$ , which is a major source of uncertainty regarding the validity of instruments in applications of IV estimation. In this sense, we may understand $k$ -class estimators as being robust against weak instruments (Young,, 2022), since they minimize an objective that interpolates between OLS and IV. On the other hand, the DRIVE approach we propose in this paper is by design robust against invalid instruments, as the ambiguity set captures distributional shifts arising from conditional correlations between the instrument and the outcome variable, conditional on the endogenous variable.

Appendix B Related Works

Our work is related to several literatures, including distributionally robust optimization, instrumental variables estimation, and regularized (penalized) regression. Although historically they developed largely independent of each other, recent works have started to explore their interesting connections, and our work can be viewed as an effort in this direction.

B.1 Distributionally Robust Optimization and Min-max Optimization

DRO has an important research area in operations research, and traces its origin to game theory (Von Neumann and Morgenstern,, 1947). Scarf, (1958) first studied DRO in the context of inventory control under uncertainties about future demand distributions. This work was followed by a line of research in min-max stochastic optimization models, notably the works of Shapiro and Kleywegt, (2002), Calafiore and Ghaoui, (2006), and Ruszczyński, (2010). Distributional uncertainty sets based on moment conditions are considered by Dupačová, (1987); Prékopa, (2013); Bertsimas and Popescu, (2005); Delage and Ye, (2010). Distributional uncertainty sets based on distance or divergence measures are considered by Iyengar, (2005); Wang et al., (2016). In recent years, distributional uncertainty sets based on the Wasserstein metric have gained traction, appearing in Mohajerin Esfahani and Kuhn, (2018); Blanchet et al., (2019); Blanchet and Murthy, (2019); Duchi et al., (2021), partly due to their close connections to regularized regression, such as the LASSO (Tibshirani,, 1996; Belloni et al.,, 2011) and regularized logistic regression. Other works employ alternative divergence measures, such as the KL divergence (Hu and Hong,, 2013) and more generally $\phi$ -divergence (Ben-Tal et al.,, 2013). In this work, we focus on DRO based on the Wasserstein metric, originally proposed by Kantorovich, (1942) in the context of optimal transport, which has also become a popular tool in economics in recent years (Galichon,, 2018, 2021).

DRO has gained traction in causal inference problems in econometrics and statistics very recently. For example, Kallus and Zhou, (2021); Adjaho and Christensen, (2022); Lei et al., (2023) apply DRO in policy learning to handle distributional uncertainties. Chen et al., (2021) apply DRO to address the possibility of mis-specification of rational expectation in estimation of structural models. Sahoo et al., (2022) use distributional shifts to model sampling bias. Bertsimas et al., (2022) study DRO versions of classic causal inference frameworks. Fan et al., (2023) studies distributional model risk when data comes from multiple sources and only marginal reference measures are identified. DRO is also connected to the literature in macroeconomics on robust control (Hansen and Sargent,, 2010). A related recent line of works in econometrics also employs a min-max approach to estimation (Lewis and Syrgkanis,, 2018; Kaji et al.,, 2020; Metzger,, 2022; Cigliutti and Manresa,, 2022; Bennett and Kallus,, 2023), inspired by adversarial networks from machine learning (Goodfellow et al.,, 2014). These works leverage adversarial learning to enforce a large, possibly infinite, number of (conditional) moment constraints, in order to achieve efficiency gains. In contrast, the emphasize of the min-max formulation in our paper is to capture the potential violations of model assumptions using a distributional uncertainty set.

The DRO approach that we propose in this paper is motivated by a recent line of works that reveal interesting connections between causality and notions of invariance and distributional robustness (Peters et al.,, 2016; Meinshausen,, 2018; Rothenhäusler et al.,, 2021; Bühlmann,, 2020; Jakobsen and Peters,, 2022).

Another important literature has studied the connections between causal inference and concepts of invariance and robustness (Peters et al.,, 2016; Meinshausen,, 2018; Rothenhäusler et al.,, 2021; Bühlmann,, 2020; Jakobsen and Peters,, 2022). Our work is closely related to this line of works, whereby causality is interpreted as an invariance or robustness property under distributional shifts. In particular, Rothenhäusler et al., (2021); Jakobsen and Peters, (2022) provide a distributionally robust interpretation of the classic $k$ -class estimators. In our work, instead of constructing the distribution set based on marginal or joint distributions as is commonly done in previous works, we propose a Wasserstein DRO version of the IV estimation problem based on distributional shifts in conditional quantities, which is then reformulated as a ridge type regularized IV estimation problem. In this regard, our estimator is fundamentally different from the $k$ -class estimators, which minimize an IV regression objective regularized by an OLS objective.

B.2 Instrumental Variables Estimation

Our work is also closely related to the classic literatures in econometrics and statistics on instrumental variables estimation (regression), which is originally proposed and developed by Theil, (1953) and Nagar, (1959), and became widely used in applied fields in economics. Since then, many works have investigated potential challenges to instrumental variables estimation and their solutions, including invalid instruments (Fisher,, 1961; Hahn and Hausman,, 2005; Berkowitz et al.,, 2008; Kolesár et al.,, 2015) and weak instruments (Nelson and Startz, 1990a, ; Nelson and Startz, 1990b, ; Staiger and Stock,, 1997; Murray,, 2006; Andrews et al.,, 2019). Tests of weak instrument have been proposed by Stock and Yogo, (2002) and Sanderson and Windmeijer, (2016). Notably, the test of Stock and Yogo, (2002) for multi-dimensional instruments is based on the minimum eigenvalue rank test statistic of Cragg and Donald, (1993). In our Wasserstein DRIVE framework, the penalty/robustness parameter can also be selected using the minimum eigenvalue of the first stage coefficient. It remains to further study the connections between our work and the weak instrument literature in this regard. The related econometric literature on many (weak) instruments studies the regime where the number of instruments is allowed to diverge proportionally with the sample size (Kunitomo,, 1980; Bekker,, 1994; Chamberlain and Imbens,, 2004; Chao and Swanson,, 2005; Kolesár,, 2018). In this work, we will assume a fixed number of instruments to best illustrate the Wasserstein DRIVE approach. However, it would be interesting to extend the framework and analysis in the current work to the many instruments setting.

Testing for invalid instruments is possible in the over-identified regime, where there are more instruments than endogenous variables (Sargan,, 1958; Kadane and Anderson,, 1977; Hansen,, 1982; Andrews,, 1999). These tests have been used in combination with variable selection methods, such as LASSO and thresholding, to select valid instruments under certain assumptions (Kang et al.,, 2016; Windmeijer et al.,, 2018; Guo et al., 2018a, ; Windmeijer et al.,, 2021). In our paper, we propose a regularization selection procedure for Wasserstein DRIVE based on bootstrapped score quantile. In simulations, we find that the selected $\rho$ increases with the degree of instrument invalidity. It remains to further study the relation of this score quantile and test statistics for instrument invalidity in the over-identified setting. Lastly, our framework can be viewed as complementary to the post-hoc sensitivity analysis of invalid instruments (Angrist et al.,, 1996; Small,, 2007; Conley et al.,, 2012), where instead of bounding the potential bias of IV estimators arising from violations of model assumptions after estimation, we incorporate such potential deviations directly into the estimation procedure.

Instrumental variables estimation has also gained wide adoption in epidemiology and genetics, where it is known as Mendelian randomization (MR) (VanderWeele et al.,, 2014; Bowden et al.,, 2015; Sanderson and Windmeijer,, 2016; Emdin et al.,, 2017). An important consideration in MR is invalid instruments, because many genetic variants, which are candidate instruments in Mendelian randomization, could be correlated with the outcome variable through unknown mechanisms that are either direct effects (horizontal pleitropy) or correlations with unobserved confounders. Methods have been proposed to address these challenges, based on robust regression and regularization ideas (Bowden et al.,, 2015, 2016; Burgess et al.,, 2016, 2020). Our proposed DRIVE framework contributes to this area by providing a novel regularization method robust against potentially invalid instruments.

B.3 Regularized Regression

Our Wasserstein DRIVE framework can be viewed as an instance of data-driven regularized IV method. In this regard, it complements the classic $k$ -class estimators, which regularize the IV objective with OLS (Rothenhäusler et al.,, 2021). Data-driven $k$ -class estimators have been shown to enjoy better finite sample properties. These include the LIML (Anderson and Rubin,, 1949) and the Fuller estimator (Fuller,, 1977), which is a modification of LIML that works well when instrument are weak (Stock et al.,, 2002). More recently, Jakobsen and Peters, (2022) proposed another data-driven $k$ -class estimator called the PULSE, which minimizes the OLS objective but with a constraint set defined by statistical tests of independence between instrument and residuals. Kolesár et al., (2015) propose a modification of the $k$ -class estimator that is consistent with invalid instruments whose direct effects on the outcome are independent of the first stage effect on the endogenous regressor.

There is a rich literature that explores the interactions and connections between regularized regression and instrumental variables methods. One line of works seeks to improve the finite-sample performance and asymptotic properties of IV type estimators using methods from regularized regression. For example, Windmeijer et al., (2018) applies LASSO regression to the first stage, motivated by applications in genetics where one may have access to many weak or invalid instruments. Belloni et al., (2012); Chernozhukov et al., (2015) also apply LASSO, but the task is to select optimal instruments in the many instruments setting or when covariates are high-dimensional. (Caner,, 2009; Caner and Kock,, 2018; Belloni et al.,, 2018) apply LASSO to GMM estimators, generalizing regularized regression results from the M-estimation setting to the moment estimation setting, which also includes IV estimation.

Another line of works on regularized regressions, which is more closely related to our work, have investigated the connections and equivalences between regularized regression and causal effect estimators in econometrics based on instrumental variables. Basmann, 1960a ; Basmann, 1960b ; McDonald, (1977) are the first to connect $k$ -class estimators to regularized regressions. Rothenhäusler et al., (2021) and Jakobsen and Peters, (2022) further study the distributional robustness of $k$ -class estimators as minimizers of the TSLS objective regularized by the OLS objective. The Wasserstein DRIVE estimator that we propose in this work applies a different type of regularization, namely a square root ridge regularization, to the second stage coefficient in the TSLS objective. As such it has different behaviors compared to the anchor and $k$ -class estimators, which regularize using the OLS objective. It is also different from works that apply regularization to the first stage.

Appendix C Square Root Ridge Regression

In this section, we turn our attention to the square root ridge estimator in the standard regression setting. We first establish the $\sqrt{n}$ -consistency of the square root ridge when the regularization parameter vanishes at the appropriate rate. We then consider a novel regime with non-vanishing regularization parameter and vanishing noise, revealing properties that are strikingly different from the standard ridge regression. As we will see, these observations in the standard setting help motivate and provide the essential intuitions for our results in the IV estimation setting. In short, the interesting behaviors of the square root ridge arise from its unique geometry in the regime of vanishing noise, where $\sum_{i=1}^{n}\epsilon^{2}_{i}={o}_{p}(n)$ as $n\rightarrow\infty$ . This regime is rarely studied in conventional regression settings in statistics, but it precisely captures features of the instrumental variables estimation setting, where projected residuals $(\mathbf{Y}-\mathbf{X}\beta_{0})^{T}\mathbf{\Pi}_{\mathbf{Z}}(\mathbf{Y}-\mathbf{X}\beta_{0})=o_{p}(n)$ when instruments are valid and $\beta_{0}$ is the true effect coefficient. In addition to providing intuitions for the IV estimation setting, the regularization parameter selection procedure proposed for the square root LASSO in the standard regression setting by Belloni et al., (2011) also inspires us to propose a novel procedure for the IV setting in Appendix E, which is shown to perform well in simulations.

C.1 $\sqrt{n}$ -Consistency of the Square Root Ridge

We now consider the square root ridge estimator in the standard regression setting, and prove its $\sqrt{n}$ -consistency. We will build on the results of Belloni et al., (2011) on the non-asymptotic estimation error of the square root LASSO estimator. Conditional on a fixed design $X_{i}\in\mathbb{R}^{p}$ , and with $\Phi$ the CDF of $\epsilon_{i}$ , we consider the data generating process,

\displaystyle Y_{i}

\displaystyle=X_{i}^{T}\beta_{0}+\sigma\epsilon_{i}.

In this section, we rewrite the objective of the square root ridge estimation (15) as

	$\displaystyle\min_{\beta}\sqrt{\hat{Q}(\beta)}+\frac{\lambda}{n}\sqrt{\\|\beta\\|^{2}+1}$		(25)
	$\displaystyle\hat{Q}(\beta)=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-X_{i}^{T}\beta)^{2},$		(26)

and denote $\hat{\beta}$ as the minimizer of the objective. Without loss of generality, we assume for all $j$ ,

\displaystyle\frac{1}{n}\sum_{i=1}^{n}X_{ij}^{2}

\displaystyle=1

In other words, each covariate (feature) is normalized to have unit norm. Similar to the square root LASSO case, we will show that by selecting $\lambda=\mathcal{O}(\sqrt{n})$ properly, or equivalently $\rho=\mathcal{O}(n^{-1})$ in (15), we can achieve, with probability $1-\alpha$ , a $\sqrt{n}$ -consistency result:

\displaystyle\|\hat{\beta}-\beta\|_{2}

\displaystyle\lesssim\sigma\sqrt{p\log(2p/\alpha)/n}.

Compare this with the bound of the square root LASSO, which is

\displaystyle\|\hat{\beta}-\beta\|_{2}

\displaystyle\lesssim\sigma\sqrt{s\log(2p/\alpha)/n},

where $s$ is the number of non-zero entries of $\beta_{0}$ , and is allowed to diverge as $n\rightarrow\infty$ . Since we do not impose assumptions on the size of the support of the $p$ -dimensional vector $\beta_{0}$ , if $s=p$ is finite in the square root LASSO framework, we achieve the same bound on the estimation error. Our bound for the square root ridge is therefore sharp in this sense.

An important quantity in the analysis is the score $\tilde{S}$ , which is the gradient of $\sqrt{\hat{Q}(\beta)}$ evaluated at the true parameter value $\beta=\beta_{0}$ :

\displaystyle\tilde{S}

\displaystyle=\nabla\sqrt{\hat{Q}(\beta)}(\beta_{0})=\frac{\nabla\hat{Q}(\beta_{0})}{2\sqrt{\hat{Q}(\beta_{0})}}=\frac{E_{n}(X\sigma\epsilon)}{\sqrt{E_{n}(\sigma^{2}\epsilon^{2})}}=\frac{E_{n}(X\epsilon)}{\sqrt{E_{n}(\epsilon^{2})}},

where $E_{n}$ denotes the empirical average of the quantities. Similar to the lower bound on the regularization parameter in terms of the score function $\lambda\geq cn\|\tilde{S}\|_{\infty}$ in Belloni et al., (2011), we will aim to impose the condition that $\lambda\geq cn\|\tilde{S}\|_{2}$ for some $c>1$ . Conveniently, this condition is already implied by $\lambda=\sqrt{p}\lambda^{*}$ , where $\lambda^{\ast}$ follows the selection procedures proposed in that paper. To see this point, note that $\|\tilde{S}\|_{2}\leq\sqrt{p}\|\tilde{S}\|_{\infty}$ , so that with high probability, $\sqrt{p}\lambda^{*}\geq\sqrt{p}cn\|\tilde{S}\|_{\infty}\geq cn\|\tilde{S}\|_{2}$ . Thus we may use the exact same selection procedure to achieve the desired bound, although there are other selection procedures for $\lambda$ that would guarantee $\lambda\geq cn\|\tilde{S}\|_{2}$ with high probability. For example, choose the $(1-\alpha)$ -quantile of $n\|\tilde{S}\|_{2}$ given $X_{i}$ ’s. We will for now adopt the selection procedure and the model assumptions in Belloni et al., (2011).

Assumption C.1.

We have $\log^{2}(p/\alpha)\log(1/\alpha)=o(n)$ and $p/\alpha\rightarrow\infty$ as $n\rightarrow\infty$ .

Under this assumption, and assuming that $\epsilon$ is normal, the selected regularization $\lambda=\sqrt{p}\lambda^{*}$ satisfies

\displaystyle\lambda

\displaystyle\lesssim\sqrt{pn\log(2p/\alpha)}

with probability $1-\alpha$ for all large $n$ , using the same argument as Lemma 1 of Belloni et al., (2011).

An important quantity in deriving the bound on the estimation error is the “prediction” norm

	$\displaystyle\\|\hat{\beta}-\beta_{0}\\|_{2,n}^{2}$	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\sum_{i}(X_{i}^{T}(\hat{\beta}-\beta_{0}))^{2}$
		$\displaystyle=(\hat{\beta}-\beta_{0})^{T}\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}(\hat{\beta}-\beta_{0}),$

which is related to the Euclidean norm $\|\hat{\beta}-\beta_{0}\|_{2}$ through the Gram matrix $\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}$ . We need to make an assumption on the modulus of continuity.

Assumption C.2.

There exists a constant $\kappa$ and $n_{0}$ such that for all $n\geq n_{0}$ , $\kappa\|\delta\|_{2}\leq\|\delta\|_{2,n}$ for all $\delta\in\mathbb{R}^{p}$ .

When $p\leq n$ , the Gram matrix $\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}$ will be full rank (with high probability with random design), and concentrate around the population covariance matrix. This setting of $p\leq n$ is different from the high-dimensional setting in the square root LASSO paper, as LASSO-type penalties are able to achieve selection consistency when $p>n$ under sparsity, whereas ridge-type penalties generally cannot. Note also that when $p>n$ , the restricted eigenvalues are necessary when defining $\kappa$ , and it is necessary to prove that $\hat{\beta}-\beta_{0}$ belongs to a restricted subset of $\mathbb{R}^{p}$ on which the bound with $\kappa$ holds. When $p\leq n$ , the restricted subset and eigenvalues are not necessary, and $\kappa$ can be understood as the minimum eigenvalue of the Gram matrix, which would be bonded away from $0$ (with high probability). The exact value of $\kappa$ is a function of the data generating process. For example, if we assume covariates are generated independent of each other, then $\kappa\approx 1$ .

Theorem C.3.

Assume that $p\leq n$ but $p$ is allowed to grow with $n$ . Let the regularization $\lambda=\sqrt{p}\lambda^{*}$ where $\lambda^{\ast}=c\sqrt{n}\Phi^{-1}(1-\alpha/2p)$ , and under C.1 and C.2, the solution $\hat{\beta}$ to the square root ridge problem

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta)^{2}}+\frac{\lambda}{n}\sqrt{\|\beta\|^{2}+1}

satisfies

\displaystyle\|\hat{\beta}-\beta_{0}\|_{2}

\displaystyle\leq\frac{2(\frac{1}{c}+1)}{1-(\frac{\lambda}{n})^{2}\kappa^{2}}\frac{\lambda}{n}\cdot\sigma\sqrt{E_{n}(\epsilon^{2})}\lesssim\sigma\sqrt{p\log(2p/\alpha)/n}

with probability at least $1-\alpha$ for all $n$ large enough.

We remark that the quantile of the score function $\frac{E_{n}(X\epsilon)}{\sqrt{E_{n}(\epsilon^{2})}}$ is not only critical for establishing the $\sqrt{n}$ -consistency of the square root ridge. It is also important in practice as the basis for regularization parameter selection. In Appendix E, we propose a data-driven regularization selection procedure that uses nonparametric bootstrap to estimate the quantile of the score, and demonstrate in Section 5 that it has very good empirical performance. The nonparametric bootstrap procedure may be of independent as well. Before we discuss regularization parameter selection in detail, we first focus on the statistical properties of the square root ridge under the novel vanishing noise regime.

C.2 Delayed Shrinkage of Square Root Ridge

Conventional wisdom on regressions with ridge type penalties is that they induce shrinkage on parameter estimates, and this shrinkage happens for any non-zero regularization. Asymptotically, if the regularization parameter does not vanish as the sample size increases, the limit of the estimator, when it exists, is not equal to the true parameter. The same behavior may be expected of the square root ridge regression. Indeed, this is the case in the standard linear regression setting with constant variance, i.e., $\text{Var}(\epsilon_{i})=\sigma^{2}>0$ and

\displaystyle Y_{i}=X_{i}^{T}\beta_{0}+\epsilon_{i}.

However, as we will see, when $\text{Var}(\epsilon_{i})$ depends on the sample size, and vanishes as $n\rightarrow\infty$ , the square root ridge estimator can be consistent for non-vanishing penalties.

To best illustrate the intuition behind this property of the square root ridge, we start with the following simple example. Consider the data generating process written in matrix vector form:

\displaystyle\mathbf{Y}=\mathbf{X}\beta_{0}+\mathbf{\epsilon},

(27)

where the rows of $\mathbf{X}\in\mathbb{R}^{n\times p}$ are i.i.d. $\mathcal{N}(0,I_{p})$ and independent of $\mathbf{\epsilon}\sim\mathcal{N}(0,\sigma^{2}_{n}I_{p})$ . Suppose that the variance of the noises vanishes: $\sigma^{2}_{n}\rightarrow 0$ as $n\rightarrow\infty$ . This is not a standard regression setup, but captures the essence of the IV estimation setting, as we show in Section 4.

Recall the square root ridge regression problem in (15), which is strictly convex:

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{\rho(1+\|\beta\|^{2})}.

Let $\hat{\beta}^{(n)}_{\mathrm{sqrt}}$ be its unique minimizer. As the sample size $n\rightarrow\infty$ , we will fix the regularization parameter $\rho\equiv 1$ , instead of letting $\rho\rightarrow 0$ . Standard asymptotic theory implies that $\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}{\beta}_{\mathrm{sqrt}}$ , where ${\beta}_{\mathrm{sqrt}}$ is the minimizer of the limit of the square root ridge objective. For the simple model (27), we can verify that

\displaystyle\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{(1+\|\beta\|^{2})}\rightarrow_{p}\|\beta_{0}-\beta\|+\sqrt{(1+\|\beta\|^{2})},

where we have used the crucial property $\sigma_{n}^{2}\rightarrow 0$ . Therefore, under standard conditions, we have

\displaystyle\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}{\beta}_{\mathrm{sqrt}}\mathrel{\mathop{\mathchar 58\relax}}=\arg\min_{\beta}\|\beta_{0}-\beta\|+\sqrt{(1+\|\beta\|^{2})}.

(28)

Note that the limiting objective above is strictly convex and hence has a unique minimizer. Moreover,

\displaystyle\|\beta_{0}-\beta\|+\sqrt{(1+\|\beta\|^{2})}=\|(\beta_{0},-1)-(\beta,-1)\|+\|(\beta,-1)\|\geq\|(\beta_{0},-1)\|,

using the triangle inequality. On the other hand, setting $\beta=\beta_{0}$ in (28) achieves the lower bound $\|(\beta_{0},-1)\|$ . Therefore, ${\beta}_{\mathrm{sqrt}}=\beta_{0}$ is the unique minimizer of the limiting objective, and so

\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}{\beta}_{0}

with $\rho=1$ non-vanishing. We have therefore demonstrated that with a non-vanishing regularization parameter, the square root ridge regression can still produce a consistent estimator. This phenomenon holds more generally: the square root ridge estimator is consistent for any (limiting) regularization parameter $\rho\in[0,1+\frac{1}{\|\beta_{0}\|^{2}}]$ , as long as the noise vanishes, in the sense that $\sum_{i=1}^{n}\epsilon^{2}_{i}={o}_{p}(n)$ . This condition is achieved for a wide variety of empirical risk minimization objectives, including the IV estimation objective.

Theorem C.4.

In the linear model (27) where the rows of $\mathbf{X}$ are distributed i.i.d. $\mathcal{N}(0,I_{p})$ , if $\sum_{i=1}^{n}\epsilon^{2}_{i}={o}_{p}(n)$ as sample size $n\rightarrow\infty$ , then for any $\rho\in[0,1+\frac{1}{\|\beta_{0}\|^{2}}]$ , the unique solution $\hat{\beta}_{\mathrm{sqrt}}$ to (15) is consistent:

\displaystyle\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}\beta_{0}.

In Fig. 3, we plot the solution of the limiting square root ridge objective in a one-dimensional example. As we can see, (asymptotic) shrinkage is delayed until regularization $\rho$ exceeds the limit $1+\frac{1}{\|\beta_{0}\|^{2}}$ in the vanishing noise regime. This behavior is in stark contrast with the regular ridge regression estimator, for which shrinkage starts from the origin, even in the vanishing noise setting.

Remark C.5 (Necessary Requirements of Delayed Shrinkage).

Although the delayed shrinkage property of the square root ridge is essentially a simple consequence of the triangle inequality, it relies crucially on three features of the square root ridge estimation procedure.

First, even though the square root ridge shares similarities with the standard ridge regression

\displaystyle\min_{\beta}{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+{\rho\|\beta\|^{2}},

only the former has delayed shrinkage: the square root operations applied to the mean squared loss and the squared norm of the parameter above are essential. To see this, note that with vanishing noise, the limit of the standard ridge estimator for the model in (27) is the solution to the following problem:

\displaystyle\min_{\beta}\|\beta_{0}-\beta\|^{2}+\rho\|\beta\|^{2},

which results in the optimal solution $\beta=\beta_{0}/(1+\rho)$ . Therefore, with any non-zero $\rho$ , the ridge estimator exhibits shrinkage, even with vanishing noise.

Second, the inclusion of the extra constant 1 in the regularization term $\sqrt{(1+\|\beta\|^{2})}$ is crucial in guaranteeing that $\beta_{\mathrm{sqrt}}=\beta_{0}$ is the unique limit of the square root ridge estimator. To see this, consider instead the following modified square root ridge problem, which appears in Owen, (2007); Blanchet et al., (2019):

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\|\mathbf{Y}-\mathbf{X}\beta\|^{2}}+\sqrt{\rho}\|\beta\|_{2},

where the regularization term does not include an additive constant in the square root, so simplifies to $\|\beta\|$ . Under model (27) with vanishing noise and $\rho=1$ , this objective has limit $\|\beta_{0}-\beta_{\mathrm{sqrt}}\|+\|\beta_{\mathrm{sqrt}}\|$ . Without the “curvature” guaranteed by the additional constant in the regularization term, the limiting objective is no longer strictly convex, and there is actually an infinite number of solutions that achieves the lower bound in the triangle inequality

\displaystyle\|\beta_{0}-\beta_{\mathrm{sqrt}}\|+\|\beta_{\mathrm{sqrt}}\|

\displaystyle\geq\|\beta_{0}\|,

including $\beta_{\mathrm{sqrt}}=0$ . This implies that the solution to the modified objective is no longer guaranteed to be a consistent estimator of $\beta_{0}$ . Indeed, the inconsistency of this curvature-less version of the square root ridge estimator has also been corroborated by simulations.

Third, given that small penalties in the square root ridge objective could achieve regularization in finite samples without sacrificing consistency, one may wonder why it is not widely used. This is partly due to the standard ridge being easier to implement computationally, but the main reason is that the delayed shrinkage of the square root ridge estimator is only present in the vanishing noise regime. To see this, assume now $\mathbf{\epsilon}\sim\mathcal{N}(0,\sigma^{2}I)$ with non-vanishing $\sigma^{2}>0$ in (27). As sample size $n\rightarrow\infty$ ,

	$\displaystyle\sqrt{\frac{1}{n}\\|\mathbf{Y}-\mathbf{X}{\beta}\\|^{2}}+\sqrt{\rho(1+\\|{\beta}\\|^{2})}$	$\displaystyle=\sqrt{\frac{1}{n}(\mathbf{X}\beta_{0}+\mathbb{\mathbf{\epsilon}}-\mathbf{X}\beta)^{T}(\mathbf{X}\beta_{0}+\mathbb{\mathbf{\epsilon}}-\mathbf{X}\beta)}+\sqrt{\rho(1+\\|\beta\\|^{2})}$
		$\displaystyle\rightarrow_{p}\sqrt{\\|\beta_{0}-\beta\\|^{2}+\sigma^{2}}+\sqrt{\rho(1+\\|\beta\\|^{2})},$

and as before $\hat{\beta}^{(n)}_{\mathrm{sqrt}}\rightarrow_{p}\beta_{\mathrm{sqrt}}$ , the unique minimizer of the limiting objective above. The optimal condition is given by

\displaystyle\frac{(\beta-\beta_{0})}{\sqrt{\|\beta-\beta_{0}\|^{2}+\sigma^{2}}}+\sqrt{\rho}\frac{\beta}{\sqrt{(1+\|\beta\|^{2})}}

\displaystyle=0,

and now only when $\rho\rightarrow 0$ is $\hat{\beta}^{(n)}_{\mathrm{sqrt}}$ a consistent estimator of $\beta_{0}$ , unless $\beta_{0}\equiv 0$ . For this reason, the fact that square root ridge can be consistent with non-vanishing regularization may not be particularly useful in standard regression settings. In the presence of non-vanishing noise, shrinkage happens for any non-zero regularization, which has also been confirmed in simulations.

Although the consistency of the square root ridge estimator with non-vanishing regularization does not have immediate practical implications for conventional regression problems, it is actually very well suited for the instrumental variables estimation setting. The reason is that IV and TSLS regressions involve projecting the endogenous (and outcome) variables onto the space spanned by the instrumental variables in the first stage. When instruments are valid, this projection cancels out the noise terms asymptotically, resulting in $\frac{1}{n}(\mathbf{Y}-\mathbf{X}\beta_{0})^{T}\mathbf{\Pi}_{\mathbf{Z}}(\mathbf{Y}-\mathbf{X}\beta_{0})\rightarrow_{p}0$ . The subsequent second-stage regression involving the projected variables therefore precisely corresponds to the vanishing noise regime, and we may expect a similar delayed shrinkage effect. This is indeed the case, and with non-zero regularization and (asymptotically) valid instruments, we show in Section 4 that the Wasserstein DRIVE estimator is consistent. This result suggests that we can introduce robustness and regularization to the standard IV estimation through the square root ridge objective without sacrificing asymptotic validity, and has important implications in practice.

C.3 Square Root Ridge vs. Ridge for GMM and M-Estimators

We also remark on the distinction between the square root ridge and the standard ridge in the case when $\rho_{n}\rightarrow 0$ . From Fu and Knight, (2000), we know that if $\rho$ approaches 0 at a rate of or slower than $O(1/\sqrt{n})$ , then the ridge estimator has asymptotic bias, i.e., it is not centered at $\beta_{0}$ . However, for square root ridge (and DRIVE), as long as $\rho_{n}\rightarrow 0$ at any rate, the estimator will not have any bias. This feature is a result of the self-normalization property of the square root ridge. In (19), the second term results from

	$\displaystyle\sqrt{n\rho(1+\\|\beta_{0}+\delta/\sqrt{n}\\|^{2})}-\sqrt{n\rho(1+\\|\beta_{0}\\|^{2})}$	$\displaystyle=\frac{n\rho\beta_{0}^{T}}{\sqrt{n\rho(1+\\|\beta_{0}\\|^{2})}}\cdot\delta/\sqrt{n}+o(\delta/\sqrt{n})$
		$\displaystyle\rightarrow\frac{\sqrt{\rho}\beta_{0}^{T}\delta}{\sqrt{(1+\\|\beta_{0}\\|^{2})}},$

which does not depend on $n$ . In this sense, the parameter $\rho$ in square root ridge is scale-free, unlike the regularization parameter in the standard ridge case, whose natural scale is $O(1/\sqrt{n})$ . In the same spirit, when $\rho$ does not vanish, the resulting square root ridge estimator will have similar behaviors as that of a standard ridge estimator with a vanishing regularization parameter with rate $\Theta(1/\sqrt{n}$ ). Moreover, the amount of shrinkage essentially does not depend on the magnitude of $\beta_{0}$ due to the normalization of $\beta_{0}$ by $\sqrt{(1+\|\beta_{0}\|^{2})}$ , which is also different from the standard ridge setting.

Lastly, we discuss the distinction between our work and that of Blanchet et al., (2022), which analyzes the asymptotic properties of a general class of DRO estimators. In that work, the original estimators are based on minimizing a sample loss of the form

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell(X_{i},Y_{i},\beta),

which encompasses most M-estimators, including the maximum likelihood estimator, and they focus on the case when $\rho_{n}\rightarrow 0$ . However, the IV (TSLS) estimator is different in that it is a moment-based estimator, more precisely a GMM estimator (Hansen,, 1982). The key distinction between these estimators is that the objective function of GMM estimators (and Z-estimators based on estimating equations) usually converges to a weighted distance function that evaluates to 0 at the true parameter $\beta_{0}$ , whereas the objectives of M-estimators tend to converge to a limit that does not vanish even at the true parameter. To see this distinction more precisely, consider the limit of the OLS objective under the linear model $Y_{i}=X_{i}^{T}\beta+\epsilon_{i}$ with $\mathbb{E}(X_{i}\epsilon_{i})=0$ and $\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\rightarrow_{p}I_{p}$ :

	$\displaystyle\frac{1}{n}(\mathbf{Y}-\mathbf{X}\beta)^{T}(\mathbf{Y}-\mathbf{X}\beta)$	$\displaystyle=\frac{1}{n}(\mathbf{X}\beta_{0}+\epsilon-\mathbf{X}\beta)^{T}(\mathbf{X}\beta_{0}+\epsilon-\mathbf{X}\beta)$
		$\displaystyle\rightarrow(\beta_{0}-\beta)^{T}(\beta_{0}-\beta)+\sigma^{2}(\epsilon),$

which is minimized at $\beta_{0}$ , achieving a minimum of $\sigma^{2}(\epsilon)$ . On the other hand, consider the following GMM version of the OLS estimator, based on the moment condition that $\mathbb{E}(X_{i}\epsilon_{i})=0$ :

\displaystyle\min_{\beta}\frac{1}{n}\left\{(\mathbf{Y}-\mathbf{X}\beta)^{T}\mathbf{X}\right\}W\left\{\mathbf{X}^{T}(\mathbf{Y}-\mathbf{X}\beta)\right\},

where $W$ is a weighting matrix, with the optimal choice being $(\mathbf{X}^{T}\mathbf{X})^{-1}$ in this setting. We have, assuming again $\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\rightarrow_{p}I_{p}$ ,

	$\displaystyle\frac{1}{n}\left\{(\mathbf{Y}-\mathbf{X}\beta)^{T}\mathbf{X}\right\}(\mathbf{X}^{T}\mathbf{X})^{-1}\left\{\mathbf{X}^{T}(\mathbf{Y}-\mathbf{X}\beta)\right\}$	$\displaystyle=\left\{\frac{1}{n}(\mathbf{Y}-\mathbf{X}\beta)^{T}\mathbf{X}\right\}(\frac{1}{n}\mathbf{X}^{T}\mathbf{X})^{-1}\left\{\frac{1}{n}\mathbf{X}^{T}(\mathbf{Y}-\mathbf{X}\beta)\right\}$
		$\displaystyle\rightarrow(\beta_{0}-\beta)^{T}I(\beta_{0}-\beta)=\\|\beta_{0}-\beta\\|^{2},$

which is also minimized at $\beta_{0}$ but achieves a minimum value of $0$ . This distinction between M-estimators and Z-estimators (and GMM estimators) is negligible in the standard setting without the distributionally robust optimization component, and in fact the standard OLS estimator is preferable to the GMM version for being more stable (Hall,, 2003). However, when we apply square root ridge regularization to these estimators, they start behaving differently. Only regularized regression based on GMM and Z-estimators enjoys consistency with a non-zero $\rho>0$ . In Appendix D.1, we exploit this property to generalize our results and develop asymptotic results for a general class of GMM estimators.

Appendix D Extensions to GMM Estimation and $q$ -Wasserstein Distances

In this section, we consider generalizations of the framework and results in the main paper. We first formulate a Wasserstein Distributionally Robust GMM Estimation Framework, and generalize the asymptotic results on Wasserstein DRIVE in this setting. We then consider Wasserstein DRIVE with $q$ -Wasserstein distance where $q\neq 2$ , and demonstrate that the resulting estimator enjoys a similar consistency property with non-vanishing robustness/regularization parameter.

D.1 Wasserstein Distributionally Robust GMM

In this section, we consider general GMM estimation and propose a distributionally robust GMM estimation framework. Let $\theta_{0}\in\mathbb{R}^{p}$ be the true parameter vector in the interior of some compact space $\Theta\subseteq\mathbb{R}^{p}$ . Let $\psi(W,\theta)$ be a vector of moments that satisfy

\displaystyle\mathbb{E}[\psi(W_{i},\theta_{0})]

\displaystyle=0,

for all $i$ , where $\{W_{1},\dots,W_{n}\}$ are independent but not necessarily identically distributed variables. Let $\psi_{i}(\theta_{0})=\psi(W_{i},\theta)$ . We consider the GMM estimators that minimize the objective

\displaystyle\min_{\theta}\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)

where $W_{n}$ is a positive definite weight matrix, e.g., the weight matrix corresponding to the two-step or continuous updating estimator, and $\frac{1}{n}\sum_{i}\psi_{i}(\theta)$ are the sample moments under the empirical distribution $\mathbb{P}_{n}$ on $\psi(\theta)$ . Both the IV estimation and GMM formulation of OLS regression fall under this framework. When we are uncertain about the validity of the moment conditions, similarly to the Wasserstein DRIVE, we consider a regularized regression objective given by

\displaystyle\min_{\theta}\sqrt{\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)}+\sqrt{\rho(1+\|\theta\|^{2})}.

(29)

We will study the asymptotic properties of this regularized GMM objective. We make use of the following sufficient technical conditions in Caner, (2009) on GMM estimation to simplify the proof.

Assumption D.1.

The following conditions are satisfied:

1.

For all $i$ and $\theta_{1},\theta_{2}\in\Theta$ , we have $|\psi_{i}(\theta_{1})-\psi_{i}(\theta_{2})|\leq B_{t}|\theta_{1},\theta_{2}|$ , with $\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}B_{t}^{d}<\infty$ for some $d>2$ ; $\sup_{\theta\in\Theta}\mathbb{E}|\psi_{i}(\theta)|^{d}<\infty$ for some $d>2$ .
2.

Let $m_{n}(\theta)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\mathbb{E}\sum_{i}\psi(\theta)$ and assume that $m_{n}(\theta)\rightarrow m(\theta)$ uniformly over $\Theta$ , $m_{n}(\theta)$ is continuously differentiable in $\theta$ , $m_{1}(\theta_{0})=0$ if and only if $\theta=\theta_{0}$ , and $m(\theta)$ is continuous in $\theta$ ; The Jacobian matrix $\partial m_{n}(\theta)/\partial\theta\rightarrow_{p}J(\theta)$ in a neighborhood of $\theta$ , and $J(\theta_{0})$ has full rank.
3.

$W_{n}(\theta)$ is positive definite and continuous on $\Theta$ , and $W_{n}(\theta)\rightarrow_{p}W(\theta)$ uniformly in $\theta$ . $W(\theta)$ is continuous in $\theta$ and positive definite for all $\theta\in\Theta$ .
4.

The population objective $m(\theta)^{T}W(\theta)m(\theta)$ is lower bounded by the squared distance $\|\theta-\theta_{0}\|^{2}$ , i.e., $m(\theta)^{T}W(\theta)m(\theta)\geq\overline{\rho}\|\theta-\theta_{0}\|^{2}$ for all $\theta\in\Theta$ and some $\overline{\rho}>0$ .

See also Andrews, (1994); Stock and Wright, (2000) which assume similar conditions as 1-3 on the GMM estimation setup. Condition 4 requires that the weighted moment is bounded below by a quadratic function near $\theta_{0}$ . Under these conditions, we have the following result.

Theorem D.2.

Under the assumptions in, the unique solution $\hat{\theta}^{GMM}$ to

\displaystyle\min_{\theta}\sqrt{\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)}+\sqrt{\rho_{n}(1+\|\theta\|^{2})}

converges to the solution $\theta^{GMM}$ of the population objective

\displaystyle\min_{\theta}\sqrt{m(\theta)^{T}W(\theta)m(\theta)}+\sqrt{\rho(1+\|\theta\|^{2})}.

Moreover, whenever $\rho\leq\overline{\rho}$ , $\theta^{GMM}=\theta_{0}$ , so that $\hat{\theta}^{GMM}\rightarrow_{p}\theta_{0}$ .

Therefore, the square root ridge regularized GMM estimator also satisfies the consistency property with a non-zero regularization parameter $\rho$ . Next, we consider general $q$ -Wasserstein distance with $q\neq 2$ .

D.2 Generalization to $q$ -Wasserstein DRIVE

The duality result in Theorem 3.1 can be generalized to $q$ -Wasserstein ambiguity sets. The resulting estimator can enjoy a similar consistency result as the square root Wasserstein DRIVE ( $q=2$ ), but only when $q\in(1,2]$ . This is because the limiting objective can be written as (assuming $\rho=1$ and $\lambda_{p}(\gamma^{T}\gamma)=1$ )

\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\gamma(\beta-\beta_{0})}+\sqrt[{}^{p}]{(\|\beta\|^{p}+1)},

where $1/p+1/q=1$ . When $q\in(1,2]$ , $p\in[2,\infty)$ , and so $\|x\|_{2}\geq\|x\|_{p}$ . As a result, the limiting objective is bounded below by

	$\displaystyle\\|\beta-\beta_{0}\\|_{2}+\sqrt[{}^{p}]{(\\|\beta\\|^{p}+1)}$	$\displaystyle\geq\\|(\beta,-1)-(\beta_{0},-1)\\|_{p}+\sqrt[{}^{p}]{(\\|\beta\\|^{p}+1)}$
		$\displaystyle=\\|(\beta,-1)-(\beta_{0},-1)\\|_{p}+\\|(\beta,-1)\\|_{p}$
		$\displaystyle\geq\\|(\beta_{0},-1)\\|_{p},$

with equality holding in both inequalities if and only if $\beta=\beta_{0}$ , i.e., $\beta_{0}$ is again the unique minimizer of the limiting objective. We therefore have the following result.

Corollary D.3.

Under the same assumptions as Theorem 4.1, the following regularized regression problem

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)_{i}^{2}}+\sqrt[{}^{p}]{\rho_{n}(\|\beta\|^{p}+1)}

(30)

has a unique solution that converges in probability to $\beta_{0}$ whenever $q\in(1,2]$ and $\lim_{n\rightarrow\infty}\rho_{n}\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)$ .

Appendix E Regularization Parameter Selection for Wasserstein DRIVE

The selection of penalty/regularization parameters is an important consideration for all regularized regression problems. The most common approach is cross validation based on loss function minimization. However, for Wasserstein DRIVE, this standard cross validation procedure may not adequately address the challenges and goals of DRIVE. For example, from Theorem 4.1 we know that the Wasserstein DRIVE is only consistent when the penalty parameter is bounded above. We therefore need to take this result into account when selecting the penalty parameter. In this section, we discuss two selection procedures, one based on the first stage regression coefficient, and the other based on quantiles of the score estimated using a nonparametric bootstrap procedure, which is also of independent interest. We connect our procedures to existing works in the literature on weak and invalid IVs and investigate their empirical performance in Section 5.

E.1 Selecting $\rho$ Based on Estimate of First Stage Coefficient

Theorem 4.1 guarantees that as long as the regularization parameter converges to a value in the interval $[0,\sigma_{\min}(\gamma)]$ , Wasserstein DRIVE is consistent. A natural procedure to select $\rho$ is thus to compute the minimum singular value ${\rho}_{\max}\mathrel{\mathop{\mathchar 58\relax}}=\sigma_{\min}(\hat{\gamma})$ of the first stage regression coefficient $\hat{\gamma}$ and then select a regularization parameter $\rho=c\cdot{\rho}_{\max}$ for $c\in[0,1]$ . In Section 5, we verify that this procedure produces consistent DRIVE estimators whenever instruments are valid. Moreover, when the instrument is invalid or weak, Wasserstein DRIVE enjoys superior finite sample properties, outperforming the standard IV, OLS, and related estimators at estimation accuracy and prediction accuracy under distributional shift. This approach is also related to the test of Cragg and Donald, (1993), which is originally used to test for under-identification, and later used by Stock and Yogo, (2002) to test for weak instruments. In the Cragg-Donald test, the minimum eigenvalue of the first stage rank matrix is used to construct the $F$ -statistic.

Although selecting $\rho$ based on the first stage coefficient gives rise to Wasserstein DRIVE estimators that perform well in practice, there is one important challenge that remains to be addressed. Recall that violations of the exclusion restriction can be viewed as a form of distributional shift. We therefore expect that as the degree of invalidity increases, the distributional shift becomes stronger. From the DRO formulation of DRIVE in Eq. 12, we know that the regularization parameter $\rho$ is also the radius of the Wasserstein distribution set. Therefore, $\rho$ should adaptively increase with increasingly invalid instruments. However, as the selection procedure proposed here only depends on the first stage estimate, it does not take this consideration into account. More importantly, when the instruments are weak, the smallest singular of the first stage coefficient is likely to be very close to zero, which results in a DRIVE estimate with a very small penalty parameter and may thus have similar problems as the standard IV. We next introduce another parameter selection procedure for $\rho$ based on Theorem C.3 that is able to better handle invalid and weak instruments.

E.2 Selecting $\rho$ Based on Nonparametric Bootstrap of Quantile of Score

Recall that the square root LASSO uses the following valid penalty:

\displaystyle\lambda^{\ast}

\displaystyle=cn\|\tilde{S}\|_{\infty},

where the score function $\tilde{S}=\nabla\hat{Q}^{1/2}(\beta_{0})=\frac{E_{n}(x\epsilon)}{\sqrt{E_{n}(\epsilon^{2})}}$ with

\displaystyle\hat{Q}(\beta)

\displaystyle=\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta)^{2},

and $c=1.1$ is a constant of Bickel et al., (2009). The intuition for this penalty level comes from the simplest case $\beta_{0}\equiv 0$ , when the optimality condition requires $\lambda\geq n\|\tilde{S}\|_{\infty}$ . To estimate $\|\tilde{S}\|_{\infty}$ , Belloni et al., (2011) propose to estimate the empirical $(1-\alpha)$ -quantile (conditional on $X_{i}$ ) of $\frac{\|E_{n}(x\epsilon)\|_{\infty}}{\sqrt{E_{n}(\epsilon^{2})}}$ by sampling i.i.d. errors $\epsilon$ from the known error distribution $\Phi$ with zero mean and variance 1, resulting in

\displaystyle\lambda^{\ast}=c\sqrt{n}\Phi^{-1}(1-\alpha/2p),

(31)

where the confidence level $1-\alpha$ is usually set to 0.95.

The consistency result in Theorem C.3 then suggests a natural choice of penalty parameter $\rho$ for the square root ridge, given by $\sqrt{\rho}=\frac{\sqrt{p}}{n}\lambda^{\ast}$ , where $\lambda^{\ast}$ is constructed from (31). However, there are two main challenges when applying this regularization parameter selection procedure to Wasserstein DRIVE in the instrumental variables estimation setting. First, it requires prior knowledge of the type of distribution $\Phi$ , e.g., Gaussian, of the errors $\epsilon$ , even if we do not need its variance. Second, $\sqrt{\rho}=\frac{\sqrt{p}}{n}\lambda^{\ast}$ is only valid for the square root ridge problem in the standard regression setting without instruments. When applied to the IV setting, the empirical risk is now

\displaystyle\hat{Q}(\beta)

\displaystyle=\frac{1}{n}\sum_{i}(\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta)^{2},

where $\tilde{Y}_{i}=(\Pi_{\mathbf{Z}}\mathbf{Y})_{i}$ and $\tilde{X}_{i}=(\Pi_{\mathbf{Z}}\mathbf{X})_{i}$ are variables projected to the instrument space. This means that “observations” $(\tilde{Y}_{i},\tilde{X}_{i})$ are no longer independent. Therefore, the i.i.d. assumption on the errors in the standard regression setting no longer holds.

We propose the following iterative procedure based on nonparametric bootstrap that simultaneous addresses the two challenges above. Given a starting estimate $\beta^{(0)}$ of $\beta_{0}$ (say the IV estimator), we compute the residuals $r_{i}^{(0)}=\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta^{(0)}$ . Then we bootstrap these residuals to compute the empirical quantile of

\displaystyle\frac{\|E_{n}(\tilde{x}\epsilon)\|_{\infty}}{\sqrt{E_{n}(\epsilon^{2})}},

where $\epsilon$ is drawn uniformly with replacement from the residuals $r_{i}$ . The quantile based on bootstrap then replaces $\Phi^{-1}(1-\alpha/2p)$ in (31) to give a penalty level $\rho$ , which we can use to solve the square root ridge problem to obtain a new estimate $\beta^{(1)}$ . Then we use $\beta^{(1)}$ to compute new residuals $r_{i}^{(1)}=\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta^{(1)}$ , and repeat the process. In practice, we can use the OLS or TSLS estimate as the starting point $\beta^{(0)}$ . Fig. 4 shows that this procedure converges very quickly and does not depend on the initial $\beta^{(0)}$ . Moreover, in Section 5 we demonstrate that the resulting Wasserstein DRIVE estimator has superior estimation performance in terms of $\ell^{2}$ error, as well as prediction under distributional shift.

E.3 Bootstrapped Score Quantile As Test Statistic for Invalid Instruments

When instruments are valid, one should expect the boostrapped quantiles will converge to 0. We next formalize this intuition in Proposition E.1 and also confirm it in numerical experiments.

Proposition E.1.

The bootstrapped quantiles converge to 0 when instruments are valid.

More importantly, in practice we observe that the bootstrapped quantile increases with the degree of instrument invalidity. Fig. 5 illustrates this phenomenon with increasing correlation between the instruments and the unobserved confounder. The intuition behind this observation is that the quantile is essentially describing the orthogonality (moment) condition for valid IVs, and so should be close to zero with valid IV. A large value therefore indicates possible violation. Thus, the bootstrapped quantile could potentially be used as a test statistic for invalid instruments, using for example permutation tests. Equivalently, in a sensitivity analysis it could be used as a sensitivity parameter, based on which we can bound the worst bias of IV/OLS models.

We provide a more detailed discussion to further justify our proposal. In a linear regression setting, the quantity

\displaystyle\frac{\|\sum_{i}(x\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}}

is a test statistic for the orthogonality condition $\mathbb{E}[X(Y-X\beta)]=0$ which holds asymptotically for $\beta$ equal to the OLS estimator. When $\frac{\|\sum_{i}(x\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}}$ is not zero, it indicates a violation of the orthogonality condition, which means a non-zero penalty could be beneficial. Similarly, in a TSLS model

\displaystyle\frac{\|\sum_{i}(\tilde{x}\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}}

is a test statistic for the orthogonality condition $\mathbb{E}[\tilde{X}(\tilde{Y}-\tilde{X}\beta)]=0$ which is asymptotically correct when $\beta$ is the TSLS estimator and $Z$ is valid instrument. A large $\frac{\|\sum_{i}(\tilde{x}\epsilon)\|_{\infty}}{\sqrt{\sum_{i}(\epsilon^{2})}}$ therefore indicates potential violations of the IV assumptions. We may also compare this quantity with the Sargan test statistic (Sargan,, 1958) for instrument invalidity in the over-identified setting and note similarities.

The penalty selection proposed in Belloni et al., (2011) can therefore be seen as a test statistic for the moment condition $E(X(Y-X\beta))=0$ which should hold asymptotically for $\beta$ equal to the OLS estimator if the model assumption that X is independent of the error term is correct. So if the penalty is large, it is evidence for potential violation of $X\perp\epsilon$ . Similarly, in a TSLS model, the moment condition is $E(\tilde{X}(\tilde{Y}-\tilde{X}\beta))=0$ for beta equal to the TSLS estimator, so the penalty can be seen as assessing potential violation of IV assumptions.

Remark E.2.

Besides the data-driven procedures discussed above, we can also consider incorporating information provided by statistical tests for IV estimation. For example, in over-identified settings, the Sargan-Hasen test (Sargan,, 1958; Hansen,, 1982) can be used to test for the exclusion restriction. We can use this test to provide evidence on the validity of the instrument. For testing weak instruments, the popular test of Stock and Yogo, (2002) can be used. This proposal is also related to our observation that $\rho$ based on bootstrapped quantiles increase with the degree of invalidity, i.e., direct effect on the outcome or correlation with confounders, and can therefore potentially be used as a test statistic for the reliability of the IV estimator. We leave a detailed investigation of this proposal to future work.

Appendix F Proofs

F.1 Proof of Theorem 3.1

Proof.

The proof of Theorem 3.1 relies on a general duality result on Wasserstein DRO, with different variations derived in (Gao and Kleywegt,, 2023; Blanchet and Murthy,, 2019; Sinha et al.,, 2017). We start with the inner problem in the objective in (12):

\displaystyle\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\tilde{\mathbb{P}}_{n})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}-\tilde{X}^{T}\beta)^{2}\right],

where $D$ is the 2-Wasserstein distance and $\tilde{\mathbb{P}}_{n}$ is the empirical distribution on the projected data $\{\tilde{Y}_{i},\tilde{X}_{i}\}_{i=1}^{n}\equiv\{(\Pi_{Z}\mathbf{Y})_{i},(\Pi_{Z}\mathbf{X})_{i}\}_{i=1}^{n}$ . Proposition 1 of sinha2017certifying and Proposition 1 of blanchet2019robust both imply that

\displaystyle\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\mathbb{P})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}-\tilde{X}^{T}\beta)^{2}\right]

\displaystyle=\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i=1}^{n}\phi_{\gamma}(\beta;(\tilde{X}_{i},\tilde{Y}_{i})),

where the “robust” loss function is

	$\displaystyle\phi_{\gamma}(\beta;(\tilde{X},\tilde{Y}))$	$\displaystyle=\sup_{(X,Y)}(Y-X^{T}\beta)^{2}-\gamma\\|X-\tilde{X}\\|_{2}^{2}-\gamma(Y-\tilde{Y})^{2}$
		$\displaystyle=\sup_{W}W^{T}\alpha\alpha^{T}W-\gamma\\|W-\tilde{W}\\|_{2}^{2},$

with $W=(X,Y)$ , $\tilde{W}=(\tilde{X},\tilde{Y})$ and $\alpha=(-\beta,1)$ . Note that $\gamma$ is always chosen large enough, i.e., $\gamma I-\alpha\alpha^{T}\succeq 0$ , so that the objective $W^{T}\alpha\alpha^{T}W-\gamma\|W-\tilde{W}\|_{2}^{2}$ is concave in $W$ . Otherwise, the supremum over $W$ in the inner problem is unbounded. Therefore, the first order condition is sufficient:

\displaystyle\alpha\alpha^{T}W-\gamma(W-\tilde{W})

\displaystyle=0,

so that

\displaystyle(\alpha\alpha^{T}-\gamma I)W

\displaystyle=-\gamma\tilde{W},

and

	$\displaystyle W$	$\displaystyle=\gamma(\gamma I-\alpha\alpha^{T})^{-1}\tilde{W}$
		$\displaystyle=(I-\alpha\alpha^{T}/\gamma)^{-1}\tilde{W},$

where $I-\alpha\alpha^{T}/\gamma$ is invertible if $\gamma I-\alpha\alpha^{T}$ is positive definite, which is required to make sure that the quadratic is concave in $W$ . The supremum is then given by

	$\displaystyle\tilde{W}^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}\alpha\alpha^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}\tilde{W}-\gamma(\tilde{W}^{T}((I-\alpha\alpha^{T}/\gamma)^{-1}-I)^{2}\tilde{W})$
	$\displaystyle=\tilde{W}^{T}((I-\alpha\alpha^{T}/\gamma)^{-1}\alpha\alpha^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}-\gamma((I-\alpha\alpha^{T}/\gamma)^{-1}-I)^{2})\tilde{W}\equiv\\|\tilde{W}\\|_{A}^{2},$

where

\displaystyle A

\displaystyle=((I-\alpha\alpha^{T}/\gamma)^{-1}\alpha\alpha^{T}(I-\alpha\alpha^{T}/\gamma)^{-1}-\gamma((I-\alpha\alpha^{T}/\gamma)^{-1}-I)^{2}).

Using the Sherman-Morrison Lemma (sherman1950adjustment), whose condition is satisfied if $\gamma I-\alpha\alpha^{T}$ is positive definite,

\displaystyle(I-\alpha\alpha^{T}/\gamma)^{-1}=I+\frac{1}{\gamma-\alpha^{T}\alpha}\alpha\alpha^{T},

and $A$ can be simplified as

\displaystyle A=\frac{\gamma}{\gamma-\alpha^{T}\alpha}\alpha\alpha^{T}.

In summary, for each projected observation (for the IV estimate) $\tilde{W}_{i}=(\tilde{X}_{i},\tilde{Y}_{i})$ , we can obtain a new “robustified” sample using the above operation, then minimize the following modified empirical risk constructed from the robustified samples:

	$\displaystyle\min_{\beta}\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\mathbb{P})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}_{i}-\tilde{X}_{i}^{T}\beta)^{2}\right]$	$\displaystyle\Leftrightarrow\min_{\beta}\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i=1}^{n}(\phi_{\gamma}(\beta;(\tilde{X}_{i},\tilde{Y}_{i}))$
		$\displaystyle\Leftrightarrow\min_{\beta}\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i}\\|(\tilde{X}_{i},\tilde{Y}_{i})\\|_{A}^{2},$

where for fixed $\beta$ , $\gamma\geq 0$ is always chosen large enough so that $\phi_{\gamma}(\beta;X,Y)$ is finite.

Now, the inner minimization problem can be further solved explicitly. Recall that it is equal to

\displaystyle\inf_{\gamma\geq 0}\gamma\rho+\frac{1}{n}\sum_{i}\tilde{W}_{i}^{T}(\frac{\gamma}{\gamma-\alpha^{T}\alpha}\alpha\alpha^{T})\tilde{W}_{i},

which is convex in $\gamma$ hence minimized at the first order condition:

\displaystyle\rho=\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}\alpha^{T}\alpha}{(\gamma-\alpha^{T}\alpha)^{2}},

or $\gamma=\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}\alpha^{T}\alpha}{\rho}}+\alpha^{T}\alpha$ , where we have chosen the larger root since only it is guaranteed to satisfy $\gamma I-\alpha\alpha^{T}\succeq 0$ for any $\alpha=(-\beta,1)$ .

Plugging this expression of $\gamma$ into the objective, and using the notation $\ell_{IV}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\sum_{i}(\tilde{Y}_{i}-\beta^{T}\tilde{X}_{i})^{2}$ ,

	$\displaystyle\gamma\rho+\frac{1}{n}\sum_{i}\\|(\tilde{X}_{i},\tilde{Y}_{i})\\|_{A}^{2}$	$\displaystyle=\sqrt{\rho\alpha^{T}\alpha\cdot\frac{1}{n}\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}+\rho\alpha^{T}\alpha$
		$\displaystyle+\frac{1}{n}\sum_{i}\tilde{W}_{i}^{T}(\frac{1}{1-\frac{\alpha^{T}\alpha}{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}\alpha^{T}\alpha}{\rho}}+\alpha^{T}\alpha}}\alpha\alpha^{T})\tilde{W}_{i}$
		$\displaystyle=\sqrt{\rho\alpha^{T}\alpha\cdot\ell_{IV}}+\rho\alpha^{T}\alpha+\frac{\ell_{IV}}{\frac{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}{\rho}}}{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}{\rho}}+\sqrt{\alpha^{T}\alpha}}}$
		$\displaystyle=\sqrt{\rho\alpha^{T}\alpha\cdot\ell_{IV}}+\rho\alpha^{T}\alpha+\ell_{IV}+\frac{\sqrt{\alpha^{T}\alpha}}{\sqrt{\frac{1}{n}\frac{\sum_{i}\tilde{W}_{i}^{T}\alpha\alpha^{T}\tilde{W}_{i}}{\rho}}}\ell_{IV}$
		$\displaystyle=2\sqrt{\rho\alpha^{T}\alpha\cdot\ell_{IV}}+\rho\alpha^{T}\alpha+\ell_{IV}$
		$\displaystyle=(\sqrt{\ell_{IV}}+\sqrt{\rho\alpha^{T}\alpha})^{2}.$

Therefore, we have proved that the Wasserstein DRIVE objective

\displaystyle\min_{\beta}\sup_{\{\mathbb{Q}\mathrel{\mathop{\mathchar 58\relax}}D(\mathbb{Q},\tilde{\mathbb{P}}_{n})\leq\rho\}}\mathbb{E}_{\mathbb{Q}}\left[(\tilde{Y}-\tilde{X}^{T}\beta)^{2}\right]

is equivalent to the following square root ridge regularized IV objective:

\displaystyle\min_{\beta}\sqrt{\frac{1}{n}\sum_{i}(\tilde{Y}_{i}-\beta^{T}\tilde{X}_{i})^{2}}+\sqrt{\rho(\|\beta\|^{2}+1)}.

∎

F.2 Proof of Theorem 4.1

Proof.

We will show that as $n\rightarrow\infty$ , $\hat{\beta}^{\text{DRIVE}}\rightarrow\beta_{0}$ as long as $\rho_{n}\rightarrow\rho\leq\sqrt{\lambda_{p}(\gamma\Sigma_{Z}\gamma^{T})}$ . Recall the linear IV model (2)

	$\displaystyle Y$	$\displaystyle=\beta^{T}_{0}X+\epsilon,$
	$\displaystyle X$	$\displaystyle=\gamma^{T}Z+\xi.$

with instrument relevance and exogeneity conditions

	$\displaystyle\text{rank}(\mathbb{E}\left[ZX^{T}\right])$	$\displaystyle=p,$
	$\displaystyle\mathbb{E}\left[Z\epsilon\right]=0,\mathbb{E}\left[Z\xi^{T}\right]$	$\displaystyle=\mathbf{0}.$

First, we compute the limit of the objective function (17), reproduced below

\displaystyle\sqrt{\frac{1}{n}\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}}+\sqrt{\rho_{n}(\|\beta\|^{2}+1)}.

(32)

For the loss term, we have

	$\displaystyle\sqrt{\frac{1}{n}\sum_{i}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)_{i}^{2}}$	$\displaystyle=\sqrt{\frac{1}{n}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)^{T}(\Pi_{\mathbf{Z}}\mathbf{Y}-\Pi_{\mathbf{Z}}\mathbf{X}\beta)}$
		$\displaystyle=\sqrt{\frac{1}{n}(\Pi_{\mathbf{Z}}(\mathbf{X}\beta_{0}+\mathbf{\epsilon})-\Pi_{\mathbf{Z}}\mathbf{X}\beta)^{T}(\Pi_{\mathbf{Z}}(\mathbf{X}\beta_{0}+\mathbf{\epsilon})-\Pi_{\mathbf{Z}}\mathbf{X}\beta)}$
		$\displaystyle=\sqrt{\frac{1}{n}(\Pi_{\mathbf{Z}}\mathbf{X}(\beta_{0}-\beta)+\mathbf{\epsilon})^{T}(\Pi_{\mathbf{Z}}\mathbf{X}(\beta_{0}-\beta)+\mathbf{\epsilon})}$
		$\displaystyle=\sqrt{\frac{1}{n}(\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\epsilon-2\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0})+(\beta-\beta_{0})^{T}\mathbf{X}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0}))}.$

Note first that $\frac{1}{n}\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0})=o_{p}(1)$ whenever the instruments are valid, since

	$\displaystyle\frac{1}{n}\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0})$	$\displaystyle=\frac{1}{n}(\sum_{i}\epsilon_{i}Z_{i})^{T}(\mathbf{Z}^{T}\mathbf{Z})^{-1}(\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))$
		$\displaystyle=(\frac{1}{n}\sum_{i}\epsilon_{i}Z_{i})^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))$
		$\displaystyle\rightarrow_{p}\mathbb{E}[Z\epsilon]\cdot\Sigma^{-1}_{Z}\cdot\mathbb{E}[ZX^{T}]\cdot(\beta-\beta_{0})=0,$

by the continuous mapping theorem. Similarly,

	$\displaystyle\frac{1}{n}(\beta-\beta_{0})^{T}\mathbf{X}^{T}\Pi_{\mathbf{Z}}\mathbf{X}(\beta-\beta_{0})$	$\displaystyle=\frac{1}{n}(\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))^{T}(\mathbf{Z}^{T}\mathbf{Z})^{-1}(\sum_{i}(Z_{i}X_{i}^{T}(\beta-\beta_{0}))$
		$\displaystyle=(\frac{1}{n}\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}Z_{i}X_{i}^{T}(\beta-\beta_{0}))$
		$\displaystyle\rightarrow_{p}(\beta-\beta_{0})^{T}\mathbb{E}(X_{i}Z_{i}^{T})\Sigma^{-1}_{Z}\mathbb{E}(Z_{i}X_{i}^{T})(\beta-\beta_{0})$
		$\displaystyle=(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\Sigma^{-1}_{Z}\Sigma_{Z}\gamma(\beta-\beta_{0})$
		$\displaystyle=(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0}).$

The most important part is the “vanishing noise” behavior, i.e.,

	$\displaystyle\frac{1}{n}\mathbf{\epsilon}^{T}\Pi_{\mathbf{Z}}\epsilon$	$\displaystyle=\frac{1}{n}(\sum_{i}\epsilon_{i}Z_{i})^{T}(\mathbf{Z}^{T}\mathbf{Z})^{-1}(\sum_{i}\epsilon_{i}Z_{i})$
		$\displaystyle=\frac{1}{n}(\sum_{i}\epsilon_{i}Z_{i})^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{n}\sum_{i}\epsilon_{i}Z_{i})$
		$\displaystyle\rightarrow_{p}(\mathbb{E}(\epsilon_{i}Z_{i}))^{T}\Sigma_{Z}^{-1}(\mathbb{E}(\epsilon_{i}Z_{i}))=0.$

It then follows that the regularized regression objective (17) of the Wasserstein DRIVE estimator converges in probability to (18), reproduced below

\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\|\beta\|^{2}+1)}.

(33)

For $\rho>0$ , the population objective (33) is continuous and strictly convex in $\beta$ , and so has a unique minimizer $\beta^{\text{DRIVE}}$ . Applying the convexity lemma of Pollard, (1991), since (32) is also strictly convex in $\beta$ , the convergence to (33) is uniform on compact sets $B\subseteq\mathbb{R}^{p}$ that contain $\beta^{\text{DRIVE}}$ . Applying Corollary 3.2.3 of van der Vaart and Wellner, (1996), we can therefore conclude that the minimizers of the empirical objectives converge in probability to the minimizer of the population objective, i.e.,

\hat{\beta}^{\text{DRIVE}}\rightarrow_{p}\beta^{\text{DRIVE}}.

Next, we consider minimizing the population objective (33). If $\rho$ is bounded above by the smallest singular value of $\gamma^{T}\Sigma_{Z}\gamma$ , i.e.,

\displaystyle\rho

\displaystyle\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma^{T}),

the population objective is lower bounded by

	$\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\\|\beta\\|^{2}+1)}$	$\displaystyle\geq\sqrt{\rho}\\|\beta-\beta_{0}\\|_{2}+\sqrt{\rho}\sqrt{\\|\beta\\|^{2}+1}$
		$\displaystyle=\sqrt{\rho}\\|(\beta,1)-(\beta_{0},1)\\|_{2}+\sqrt{\rho}\\|(\beta,1)\\|_{2}$
		$\displaystyle\geq\sqrt{\rho}\\|(\beta_{0},1)\\|_{2},$

where in the second line we augment the vectors $\beta,\beta_{0}$ with an extra coordinate equal to 1. The last line follows from the triangle inequality, with equality if and only if $\beta\equiv\beta_{0}$ . We can verify that the lower bound $\sqrt{\rho}\|(\beta_{0},1)\|_{2}$ of the population objective is therefore achieved uniquely at $\beta\equiv\beta_{0}$ due to strict convexity. We have thus proved that when $0<\rho\leq\sqrt{\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)}$ , the population objective has a unique minimizer at $\beta_{0}$ . When $\rho=0$ , the consistency of $\hat{\beta}^{\text{DRIVE}}$ can be similarly proved as long as $\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma)>0$ , which guarantees that $\beta_{0}$ is the unique minimizer of (33). Therefore, whenever $\rho\leq\lambda_{p}(\gamma^{T}\Sigma_{Z}\gamma^{T})$ , we have $\hat{\beta}^{\text{DRIVE}}\rightarrow_{p}\beta_{0}$ . ∎

F.3 Proof of Theorem 4.2

Proof.

Define the objective function $H_{n}(\delta)$ of a local parameter $\delta\in\mathbb{R}^{p}$ as follows:

	$\displaystyle\phi_{n}(\beta)$	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\sqrt{\frac{1}{n}\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\\|^{2}}+\sqrt{\rho_{n}(\\|\beta\\|^{2}+1)}$
	$\displaystyle H_{n}(\delta)$	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\sqrt{n}\left[\phi_{n}(\beta_{0}+\delta/\sqrt{n})-\phi_{n}(\beta_{0})\right].$

Note that $H_{n}(\delta)$ is minimized at $\delta=\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})$ . The key components of the proof are to compute the uniform limit $H(\delta)$ of $H_{n}(\delta)$ on compact sets in the weak topology, and to verify that their minimizers are uniformly tight, i.e., $\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})=O_{p}(1)$ . We can then apply Theorem 3.2.2 of van der Vaart and Wellner, (1996) to conclude that the sequence of minimizers $\sqrt{n}(\hat{\beta}_{n}-\beta_{0})$ of $H_{n}(\delta)$ converges in distribution to the minimizer of the limit $H(\delta)$ . We have

	$\displaystyle H_{n}(\delta)=\sqrt{n}\cdot(\phi_{n}(\beta_{0}+\delta/\sqrt{n})-\phi_{n}(\beta_{0}))$	$\displaystyle=\underset{\textbf{I}}{\underbrace{\sqrt{\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}(\beta_{0}+\delta/\sqrt{n})\\|^{2}}-\sqrt{\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta_{0}\\|^{2}}}}$
		$\displaystyle+\underset{\textbf{II}}{\underbrace{\sqrt{n\rho_{n}(1+\\|\beta_{0}+\delta/\sqrt{n}\\|^{2})}-\sqrt{n\rho_{n}(1+\\|\beta_{0}\\|^{2})}}}.$

We first focus on I:

	I	$\displaystyle=\sqrt{\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}(\beta_{0}+\delta/\sqrt{n})\\|^{2}}-\sqrt{\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta_{0}\\|^{2}}$
		$\displaystyle=\sqrt{F_{n}(\beta_{0}+\delta/\sqrt{n})}-\sqrt{F_{n}(\beta_{0})},$

where

\displaystyle F_{n}(\beta)

\displaystyle=\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta\|^{2}.

We have, with $\psi_{i}(\beta)\equiv Z_{i}(Y_{i}-\beta^{T}X_{i})$ ,

	$\displaystyle F_{n}(\beta_{0}+\delta/\sqrt{n})$	$\displaystyle=\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}(\beta_{0}+\delta/\sqrt{n})\\|^{2}$
		$\displaystyle=(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n})),$
	$\displaystyle F_{n}(\beta_{0})$	$\displaystyle=\\|\Pi_{Z}\mathbf{Y}-\Pi_{Z}\mathbf{X}\beta_{0}\\|^{2}$
		$\displaystyle=(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}))^{T}(\frac{1}{n}\mathbf{Z}^{T}\mathbf{Z})^{-1}(\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0})).$

We compute the limits of $F_{n}(\beta_{0}+\delta/\sqrt{n})$ and $F_{n}(\beta_{0})$ . We have

	$\displaystyle\\|\psi_{i}(\beta_{1})-\psi_{i}(\beta_{2})\\|$	$\displaystyle\leq\\|Z_{i}X_{i}^{T}(\beta_{1}-\beta_{2})\\|$
		$\displaystyle=\\|Z(Z^{T}\gamma+\xi^{T})(\beta_{1}-\beta_{2})\\|$
		$\displaystyle\leq(\\|ZZ^{T}\\|\\|\gamma\\|+\\|Z\xi^{T}\\|)\cdot\\|(\beta_{1}-\beta_{2})\\|$
		$\displaystyle\leq(\\|Z\\|^{2}\\|\gamma\\|+\\|Z\\|\\|\xi\\|)\cdot\\|(\beta_{1}-\beta_{2})\\|$

where $\|\cdot\|$ denotes operator norm for matrices and Euclidean norm for vectors. We have, for some constant $c$ that depends on $k$ ,

	$\displaystyle\mathbb{E}(\\|Z\\|^{2}\\|\gamma\\|+\\|Z\\|\\|\xi\\|)^{k}$	$\displaystyle\leq c(\mathbb{E}\\|Z\\|^{2k}\\|\gamma\\|^{k}+\mathbb{E}\\|Z\\|^{k}\\|\xi\\|^{k})$
		$\displaystyle\leq c(\mathbb{E}\\|Z\\|^{2k}\\|\gamma\\|^{k}+\sqrt{\mathbb{E}\\|Z\\|^{2k}}\cdot\sqrt{\mathbb{E}\\|\xi\\|^{2k}})$
		$\displaystyle<\infty$

using the assumptions that $\mathbb{E}\|Z\|^{2k}<\infty$ and $\mathbb{E}\|\xi\|^{2k}<\infty$ . Moreover, we have

	$\displaystyle\mathbb{E}\\|\psi(\beta)\\|^{k}$	$\displaystyle=\mathbb{E}\\|Z(Y-\beta^{T}X)\\|^{k}$
		$\displaystyle=\mathbb{E}\\|Z(X^{T}(\beta_{0}-\beta)+\epsilon)\\|^{k}$
		$\displaystyle=\mathbb{E}\\|Z((Z^{T}\gamma+\xi^{T})(\beta_{0}-\beta)+\epsilon)\\|^{k}$
		$\displaystyle\leq c\left(\mathbb{E}\\|ZZ^{T}\gamma(\beta_{0}-\beta)\\|^{k}+\mathbb{E}\\|Z\xi^{T}(\beta_{0}-\beta)\\|^{k}+\mathbb{E}\\|Z\epsilon\\|^{k}\right)$
		$\displaystyle\leq c\left(\mathbb{E}\\|Z\\|^{2k}\\|\gamma(\beta_{0}-\beta)\\|^{k}+\sqrt{\mathbb{E}\\|Z\\|^{2k}}\sqrt{\mathbb{E}\\|\xi\\|^{2k}}\\|(\beta_{0}-\beta)\\|^{k}+\sqrt{\mathbb{E}\\|Z\\|^{2k}}\sqrt{\mathbb{E}\\|\epsilon\\|^{2k}}\right)$

which is uniformly bounded on compact subsets. The consistency result in Theorem 4.1 combined with the above bounds guarantee stochastic equicontinuity (Andrews,, 1994), so that as $n\rightarrow\infty$ , uniformly in $\delta$ on compact sets that contain $\delta=\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})$ ,

\displaystyle\frac{1}{\sqrt{n}}\left(\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n})-\mathbb{E}\psi_{i}(\beta_{0}+\delta/\sqrt{n})\right)

\displaystyle\rightarrow_{d}\mathcal{N}(0,\Omega(\beta_{0}))\equiv\mathcal{Z},

where $\Omega(\beta)=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}(\psi_{i}(\beta)\psi_{i}^{T}(\beta))$ , so that

	$\displaystyle\Omega(\beta_{0})=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}(\psi_{i}(\beta_{0})\psi_{i}^{T}(\beta_{0}))$	$\displaystyle=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}(Y_{i}-X_{i}^{T}\beta)^{2}Z_{i}Z_{i}^{T}$
		$\displaystyle=\frac{1}{\sqrt{n}}\mathbb{E}\sum_{i}\epsilon_{i}^{2}Z_{i}Z_{i}^{T}=\sigma^{2}\Sigma_{Z},$

using independence and homoskedasticity. Moreover,

	$\displaystyle\frac{1}{\sqrt{n}}\sum_{i}\mathbb{E}\psi_{i}(\beta_{0}+\delta/\sqrt{n})$	$\displaystyle=\sqrt{n}\mathbb{E}\left[X^{T}(\beta_{0}+\delta/\sqrt{n})-Y\right]Z$
		$\displaystyle=\sqrt{n}\mathbb{E}\left[X^{T}(\beta_{0}+\delta/\sqrt{n})-(X^{T}\beta_{0}+\epsilon)\right]Z$
		$\displaystyle=\mathbb{E}ZX^{T}\delta=\mathbb{E}Z(Z^{T}\gamma+\xi)\delta$
		$\displaystyle=\Sigma_{Z}\gamma\delta.$

Combining these, we have

\displaystyle\frac{1}{\sqrt{n}}\sum_{i}\psi_{i}(\beta_{0}+\delta/\sqrt{n})

\displaystyle\rightarrow_{d}\mathcal{Z}+\Sigma_{Z}\gamma\delta,

uniformly in $\delta$ on compact sets, so that

	$\displaystyle F_{n}(\beta_{0}+\delta/\sqrt{n})$	$\displaystyle\rightarrow_{d}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)$
	$\displaystyle F_{n}(\beta_{0})$	$\displaystyle\rightarrow_{d}\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z},$

and applying the continuous mapping theorem to the square root function,

\displaystyle=\sqrt{F_{n}(\beta_{0}+\delta/\sqrt{n})}-\sqrt{F_{n}(\beta_{0})}\rightarrow_{d}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}-\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}.

Next we have

	$\displaystyle\mathbf{II}$	$\displaystyle=\sqrt{n\rho_{n}(1+\\|\beta_{0}+\delta/\sqrt{n}\\|^{2})}-\sqrt{n\rho_{n}(1+\\|\beta_{0}\\|^{2})}$
		$\displaystyle=\frac{n\rho_{n}\beta_{0}^{T}}{\sqrt{n\rho_{n}(1+\\|\beta_{0}\\|^{2})}}\cdot\delta/\sqrt{n}+o(\delta/\sqrt{n})$
		$\displaystyle\rightarrow\frac{\sqrt{\rho_{n}}\beta_{0}^{T}}{\sqrt{(1+\\|\beta_{0}\\|^{2})}}\cdot\delta.$

Combining the analyses of $\mathbf{I}$ and $\mathbf{II}$ , we have

\displaystyle H_{n}(\delta)

\displaystyle\rightarrow_{d}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}-\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta

uniformly. Because $H_{n}(\delta)$ is convex and $H(\delta)$ has a unique minimum, $\arg\min_{\delta}H_{n}(\delta)=\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})=O_{p}(1)$ . Applying Theorem 3.2.2 of van der Vaart and Wellner, (1996) allows us to conclude that

\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})

\displaystyle\rightarrow_{d}\arg\min_{\delta}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}-\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta.

In fact, we may drop the term $\sqrt{\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}}$ since it does not depend on $\delta$ . Therefore,

\displaystyle\sqrt{n}(\hat{\beta}^{\text{DRIVE}}_{n}-\beta_{0})=\arg\min_{\delta}H_{n}(\delta)

\displaystyle\rightarrow_{d}\arg\min_{\delta}\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta.

Now when $\rho=0$ , the objective above reduces to

\displaystyle\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)},

which recovers the same minimizer as the TSLS objective

\displaystyle(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}\Sigma_{Z}^{-1}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)-\mathcal{Z}^{T}\Sigma_{Z}^{-1}\mathcal{Z}

\displaystyle=2\delta^{T}\gamma^{T}\mathcal{Z}+\delta^{T}\gamma^{T}\Sigma_{Z}\gamma\delta,

since the first order condition of the former is

\displaystyle\frac{\gamma^{T}\mathcal{Z}+\gamma^{T}\Sigma_{Z}\gamma\delta}{\sqrt{(\mathcal{Z}+\Sigma_{Z}\gamma\delta)^{T}(\mathcal{Z}+\Sigma_{Z}\gamma\delta)}}

\displaystyle=0,

and of the latter is

\displaystyle\gamma^{T}\mathcal{Z}+\gamma^{T}\Sigma_{Z}\gamma\delta

\displaystyle=0.

We can therefore conclude that with vanishing $\rho_{n}\rightarrow\rho=0$ , regardless of the rate, the asymptotic distribution of Wasserstein DRIVE coincides with that of the standard TSLS estimator. ∎

F.4 Proof of Corollary 4.3

Proof.

When $\rho_{n}\rightarrow 0$ , the limiting objective is $\sqrt{(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)}$ which is minimized at the same $\delta$ that minimizes the standard limit $(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)$ .

If $0<\rho\leq|\gamma|$ , then FOC gives

\displaystyle\frac{\gamma^{T}\mathcal{Z}+\delta^{T}\gamma^{T}\gamma}{\sqrt{(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)}}+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}

\displaystyle=0.

If $\beta_{0}$ is one-dimensional (but $\gamma$ can be a vector, i.e., multiple instruments), then FOC reduces to

\displaystyle\gamma^{T}\mathcal{Z}+\delta^{T}\gamma^{T}\gamma

\displaystyle=0,

which is the same FOC as the standard IV limiting objective.

If both $\gamma$ and $\beta_{0}$ are one-dimensional, but $\beta_{0}$ is not necessarily 0, we have that

\displaystyle\sqrt{(\mathcal{Z}+\gamma\delta)^{T}(\mathcal{Z}+\gamma\delta)}+\frac{\sqrt{\rho}\beta_{0}^{T}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta

\displaystyle=|\mathcal{Z}+\gamma\delta|+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta

The objective is $\mathcal{Z}+\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\geq 0$ and $-\mathcal{Z}-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\leq 0$ . Recall that by assumption $\sqrt{\rho}\leq|\gamma|$ .

If $\beta_{0}>0$ and $\gamma>0$ , then $\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\geq 0$ is minimized at $\delta=-\gamma^{-1}\mathcal{Z}$ , and $-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\leq 0$ is minimized at $-\gamma^{-1}\mathcal{Z}$ .

If $\beta_{0}>0$ and $\gamma<0$ , then $\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\geq 0$ is again minimized at $\delta=-\gamma^{-1}\mathcal{Z}$ (since $\delta\leq-\gamma^{-1}\mathcal{Z}$ ), and $-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\leq 0$ is again minimized at $-\gamma^{-1}\mathcal{Z}$ .

If $\beta_{0}<0$ and $\gamma>0$ , then $\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\geq 0$ is minimized at $\delta=-\gamma^{-1}\mathcal{Z}$ , and $-\gamma\delta+\frac{\sqrt{\rho}\beta_{0}}{\sqrt{(1+\|\beta_{0}\|^{2})}}\cdot\delta$ when $\gamma\delta+\mathcal{Z}\leq 0$ is minimized at $-\gamma^{-1}\mathcal{Z}$ .

We can therefore conclude that the objective is always minimized at $\delta=-\gamma^{-1}\mathcal{Z}$ , which is the limiting distribution of TSLS. ∎

F.5 Proof of Theorem D.2

Proof.

We can write

\displaystyle\frac{1}{n}\sum_{i}\psi_{i}(\theta)

\displaystyle=\frac{1}{n}\sum_{i}[\psi_{i}(\theta)-\mathbb{E}\psi_{i}(\theta)]+\frac{1}{n}\sum_{i}\mathbb{E}\psi_{i}(\theta).

D.1.1 guarantees that

\displaystyle\frac{1}{n}\sum_{i}[\psi_{i}(\theta)-\mathbb{E}\psi_{i}(\theta)]

\displaystyle=o_{p}(1),

using for example Andrews.

Next, D.1.2 guarantees that $\frac{1}{n}\sum_{i}\mathbb{E}\psi_{i}(\theta)\rightarrow m(\theta)$ uniformly in $\theta$ , and D.1.3 further guarantees that

	$\displaystyle\sqrt{\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)^{T}W_{n}(\theta)\left(\frac{1}{n}\sum_{i}\psi_{i}(\theta)\right)}+\sqrt{\rho_{n}(1+\\|\theta\\|^{2})}$	$\displaystyle\rightarrow_{p}$
	$\displaystyle\sqrt{m(\theta)^{T}W(\theta)m(\theta)}+\sqrt{\rho(1+\\|\theta\\|^{2})}$

uniformly in $\theta$ . Applying Corollary 3.2.3 of van der Vaart and Wellner, (1996), we can conclude $\hat{\theta}^{GMM}\rightarrow_{p}\theta^{GMM}$ .

Next, we consider the minimizer of the population objective. Applying D.1.4, when $\rho\leq\overline{\rho}$ , it is lower bounded by

	$\displaystyle\sqrt{m(\theta)^{T}W(\theta)m(\theta)}+\sqrt{\rho(1+\\|\theta\\|^{2})}$	$\displaystyle\geq\sqrt{\overline{\rho}\\|\theta-\theta_{0}\\|^{2}}+\sqrt{\rho(1+\\|\theta\\|^{2})}$
		$\displaystyle\geq\sqrt{\overline{\rho}}\cdot(\sqrt{\\|\theta-\theta_{0}\\|^{2}}+\sqrt{(1+\\|\theta\\|^{2})})$
		$\displaystyle=\sqrt{\overline{\rho}}\cdot(\\|(\theta,1)-(\theta_{0},1)\\|_{2}+\sqrt{\rho}\\|(\theta,1)\\|_{2})$
		$\displaystyle\geq\sqrt{\rho}\\|(\theta_{0},1)\\|_{2},$

where again the last inequality follows from the triangle inequality. We can verify that equalities are achieved if and only if $\theta=\theta_{0}$ , which guarantees that $\hat{\theta}^{GMM}\rightarrow_{p}\theta_{0}$ . The condition $m(\theta)^{T}W(\theta)m(\theta)\geq\overline{\rho}\|\theta-\theta_{0}\|^{2}$ is satisfies by many GMM estimators, including the TSLS, so this proof applies to Theorem 4.1 as well. ∎

F.6 Proof of Theorem C.3

Proof.

First, we use optimality condition of $\hat{\beta}$ to bound

\displaystyle\sqrt{\hat{Q}(\hat{\beta})}-\sqrt{\hat{Q}(\beta_{0})}

\displaystyle\leq\frac{\lambda}{n}\sqrt{\|\beta_{0}\|^{2}+1}-\frac{\lambda}{n}\sqrt{\|\hat{\beta}\|^{2}+1}

On the other hand, by convexity of $\sqrt{\hat{Q}(\beta)}$ ,

\displaystyle\sqrt{\hat{Q}(\hat{\beta})}-\sqrt{\hat{Q}(\beta_{0})}

\displaystyle\geq\tilde{S}^{T}(\hat{\beta}-\beta_{0})\geq-\|\tilde{S}\|_{2}\|\hat{\beta}-\beta_{0}\|_{2}\geq-\frac{\lambda}{cn}\|\hat{\beta}-\beta_{0}\|_{2}

Now the estimation error in terms of the “prediction norm” (which is just the norm defined using the Gram matrix)

	$\displaystyle\\|\hat{\beta}-\beta_{0}\\|_{2,n}^{2}$	$\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n}\sum_{i}(X_{i}^{T}(\hat{\beta}-\beta_{0}))^{2}$
		$\displaystyle=(\hat{\beta}-\beta_{0})^{T}\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}(\hat{\beta}-\beta_{0})$

is related to the difference $\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0})$ as follows:

	$\displaystyle\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0})$	$\displaystyle=\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\hat{\beta})^{2}-\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0})^{2}$
		$\displaystyle=\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0}+X_{i}^{T}\beta_{0}-X_{i}^{T}\hat{\beta})^{2}-\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0})^{2}$
		$\displaystyle=\\|\hat{\beta}-\beta_{0}\\|_{2,n}^{2}+2\frac{1}{n}\sum_{i}(Y_{i}-X_{i}^{T}\beta_{0})(X_{i}^{T}\beta_{0}-X_{i}^{T}\hat{\beta})$
		$\displaystyle=\\|\hat{\beta}-\beta_{0}\\|_{2,n}^{2}+2\frac{1}{n}\sum_{i}(\sigma\epsilon_{i})X_{i}^{T}(\beta_{0}-\hat{\beta})$
		$\displaystyle=\\|\hat{\beta}-\beta_{0}\\|_{2,n}^{2}-2E_{n}(\sigma\epsilon X^{T}(\hat{\beta}-\beta_{0}))$

On the other hand,

\displaystyle\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0})

\displaystyle=\left[\sqrt{\hat{Q}(\hat{\beta})}+\sqrt{\hat{Q}(\beta_{0})}\right]\cdot\left[\sqrt{\hat{Q}(\hat{\beta})}-\sqrt{\hat{Q}(\beta_{0})}\right]

and using Holder’s inequality,

	$\displaystyle 2E_{n}(\sigma\epsilon X^{T}(\hat{\beta}-\beta_{0}))$	$\displaystyle=2\frac{1}{n}\sum_{i}(\sigma\epsilon_{i})X_{i}^{T}(\hat{\beta}-\beta_{0})$
		$\displaystyle=2\sqrt{\frac{1}{n}\sum_{i}(\sigma\epsilon_{i})^{2}}\frac{\frac{1}{n}\sum_{i}(\sigma\epsilon X_{i}^{T})}{\sqrt{\frac{1}{n}\sum_{i}(\sigma^{2}\epsilon_{i}^{2})}}(\hat{\beta}-\beta_{0})$
		$\displaystyle=2\sqrt{\hat{Q}(\beta_{0})}\cdot\tilde{S}^{T}(\hat{\beta}-\beta_{0})$
		$\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\\|\tilde{S}\\|_{2}\\|\hat{\beta}-\beta_{0}\\|_{2}$

Combining these, we can bound the estimation error $\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}$ as

	$\displaystyle\\|\hat{\beta}-\beta_{0}\\|_{2,n}^{2}$
	$\displaystyle=2E_{n}(\sigma\epsilon X^{T}(\hat{\beta}-\beta_{0}))+\hat{Q}(\hat{\beta})-\hat{Q}(\beta_{0})$
	$\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\\|\tilde{S}\\|_{2}\\|\hat{\beta}-\beta_{0}\\|_{2}+(\frac{\lambda}{n}\sqrt{\\|\beta_{0}\\|^{2}+1}-\frac{\lambda}{n}\sqrt{\\|\hat{\beta}\\|^{2}+1})\cdot(\sqrt{\hat{Q}(\hat{\beta})}+\sqrt{\hat{Q}(\beta_{0})})$
	$\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\\|\tilde{S}\\|_{2}\\|\hat{\beta}-\beta_{0}\\|_{2}+(\frac{\lambda}{n}\sqrt{\\|\beta_{0}\\|^{2}+1}-\frac{\lambda}{n}\sqrt{\\|\hat{\beta}\\|^{2}+1})\cdot(2\sqrt{\hat{Q}(\beta_{0})}+\frac{\lambda}{n}\sqrt{\\|\beta_{0}\\|^{2}+1}-\frac{\lambda}{n}\sqrt{\\|\hat{\beta}\\|^{2}+1})$
	$\displaystyle=2\sqrt{\hat{Q}(\beta_{0})}\\|\tilde{S}\\|_{2}\\|\hat{\beta}-\beta_{0}\\|_{2}+(\frac{\lambda}{n}\sqrt{\\|\beta_{0}\\|^{2}+1}-\frac{\lambda}{n}\sqrt{\\|\hat{\beta}\\|^{2}+1})^{2}+2\sqrt{\hat{Q}(\beta_{0})}(\frac{\lambda}{n}\sqrt{\\|\beta_{0}\\|^{2}+1}-\frac{\lambda}{n}\sqrt{\\|\hat{\beta}\\|^{2}+1})$
	$\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\\|\tilde{S}\\|_{2}\\|\hat{\beta}-\beta_{0}\\|_{2}+(\frac{\lambda}{n})^{2}\\|\hat{\beta}-\beta_{0}\\|_{2}^{2}+2\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}\\|\hat{\beta}-\beta_{0}\\|_{2}$
	$\displaystyle\leq 2\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)\\|\hat{\beta}-\beta_{0}\\|_{2}+(\frac{\lambda}{n})^{2}\\|\hat{\beta}-\beta_{0}\\|_{2}^{2}$

Now the norms $\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}$ and $\|\hat{\beta}-\beta_{0}\|_{2}$ differ by the Gram matrix $\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}$ , which by the assumption $\frac{1}{n}\sum_{i}X_{ij}^{2}=1$ has diagonal entries equal to 1. Recall that $\kappa$ is the tight constant such that

\displaystyle\kappa\|\hat{\beta}-\beta_{0}\|_{2}

\displaystyle\leq\|\hat{\beta}-\beta_{0}\|_{2,n}

for any $\hat{\beta}-\beta_{0}$ , so we get

\displaystyle\|\hat{\beta}-\beta_{0}\|_{2}^{2}\leq\frac{1}{\kappa^{2}}\|\hat{\beta}-\beta_{0}\|_{2,n}^{2}

\displaystyle\leq 2\frac{1}{\kappa^{2}}\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)\|\hat{\beta}-\beta_{0}\|_{2}+\frac{1}{\kappa^{2}}(\frac{\lambda}{n})^{2}\|\hat{\beta}-\beta_{0}\|_{2}^{2}

which yields

	$\displaystyle\\|\hat{\beta}-\beta_{0}\\|_{2}$	$\displaystyle\leq\frac{1}{1-\frac{1}{\kappa^{2}}(\frac{\lambda}{n})^{2}}2\frac{1}{\kappa^{2}}\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)$
		$\displaystyle=\frac{2\sqrt{\hat{Q}(\beta_{0})}\frac{\lambda}{n}(\frac{1}{c}+1)}{\kappa^{2}-(\frac{\lambda}{n})^{2}}$

provided that

\displaystyle(\frac{\lambda}{n})^{2}

\displaystyle\leq\kappa^{2}.

As $\lambda/n\rightarrow 0$ and $\kappa$ is a universal constant linking the two norms, this condition will be satisfied for all $n$ large enough if Assumption 2 holds, so that the rate of convergence of $\|\hat{\beta}-\beta_{0}\|_{2,n}\rightarrow 0$ is governed by that of $\frac{\lambda}{n}\rightarrow 0$ :

\displaystyle\|\hat{\beta}-\beta_{0}\|_{2}\leq\frac{2\frac{\lambda}{n}(\frac{1}{c}+1)}{\kappa^{2}-(\frac{\lambda}{n})^{2}}\cdot\sqrt{\hat{Q}(\beta_{0})}\lesssim\sigma\sqrt{p\log(2p/\alpha)/n}.

∎

	$\displaystyle\\|\beta-\beta_{0}\\|_{2}+\sqrt[{}^{p}]{(\\|\beta\\|^{p}+1)}$	$\displaystyle\geq\\|(\beta,-1)-(\beta_{0},-1)\\|_{p}+\sqrt[{}^{p}]{(\\|\beta\\|^{p}+1)}$
		$\displaystyle=\\|(\beta,-1)-(\beta_{0},-1)\\|_{p}+\\|(\beta,-1)\\|_{p}$
		$\displaystyle\geq\\|(\beta_{0},-1)\\|_{p},$

	$\displaystyle\sqrt{(\beta-\beta_{0})^{T}\gamma^{T}\Sigma_{Z}\gamma(\beta-\beta_{0})}+\sqrt{\rho(\\|\beta\\|^{2}+1)}$	$\displaystyle\geq\sqrt{\rho}\\|\beta-\beta_{0}\\|_{2}+\sqrt{\rho}\sqrt{\\|\beta\\|^{2}+1}$
		$\displaystyle=\sqrt{\rho}\\|(\beta,1)-(\beta_{0},1)\\|_{2}+\sqrt{\rho}\\|(\beta,1)\\|_{2}$
		$\displaystyle\geq\sqrt{\rho}\\|(\beta_{0},1)\\|_{2},$

	$\displaystyle\\|\psi_{i}(\beta_{1})-\psi_{i}(\beta_{2})\\|$	$\displaystyle\leq\\|Z_{i}X_{i}^{T}(\beta_{1}-\beta_{2})\\|$
		$\displaystyle=\\|Z(Z^{T}\gamma+\xi^{T})(\beta_{1}-\beta_{2})\\|$
		$\displaystyle\leq(\\|ZZ^{T}\\|\\|\gamma\\|+\\|Z\xi^{T}\\|)\cdot\\|(\beta_{1}-\beta_{2})\\|$
		$\displaystyle\leq(\\|Z\\|^{2}\\|\gamma\\|+\\|Z\\|\\|\xi\\|)\cdot\\|(\beta_{1}-\beta_{2})\\|$

	$\displaystyle\mathbb{E}(\\|Z\\|^{2}\\|\gamma\\|+\\|Z\\|\\|\xi\\|)^{k}$	$\displaystyle\leq c(\mathbb{E}\\|Z\\|^{2k}\\|\gamma\\|^{k}+\mathbb{E}\\|Z\\|^{k}\\|\xi\\|^{k})$
		$\displaystyle\leq c(\mathbb{E}\\|Z\\|^{2k}\\|\gamma\\|^{k}+\sqrt{\mathbb{E}\\|Z\\|^{2k}}\cdot\sqrt{\mathbb{E}\\|\xi\\|^{2k}})$
		$\displaystyle<\infty$

	$\displaystyle\mathbb{E}\\|\psi(\beta)\\|^{k}$	$\displaystyle=\mathbb{E}\\|Z(Y-\beta^{T}X)\\|^{k}$
		$\displaystyle=\mathbb{E}\\|Z(X^{T}(\beta_{0}-\beta)+\epsilon)\\|^{k}$
		$\displaystyle=\mathbb{E}\\|Z((Z^{T}\gamma+\xi^{T})(\beta_{0}-\beta)+\epsilon)\\|^{k}$
		$\displaystyle\leq c\left(\mathbb{E}\\|ZZ^{T}\gamma(\beta_{0}-\beta)\\|^{k}+\mathbb{E}\\|Z\xi^{T}(\beta_{0}-\beta)\\|^{k}+\mathbb{E}\\|Z\epsilon\\|^{k}\right)$
		$\displaystyle\leq c\left(\mathbb{E}\\|Z\\|^{2k}\\|\gamma(\beta_{0}-\beta)\\|^{k}+\sqrt{\mathbb{E}\\|Z\\|^{2k}}\sqrt{\mathbb{E}\\|\xi\\|^{2k}}\\|(\beta_{0}-\beta)\\|^{k}+\sqrt{\mathbb{E}\\|Z\\|^{2k}}\sqrt{\mathbb{E}\\|\epsilon\\|^{2k}}\right)$

Distributionally Robust Instrumental Variables Estimation

Abstract

1 Introduction

2 Background and Challenges in IV Estimation

2.1 Instrumental Variables Estimation

2.2 Challenges in IV Estimation as Distributional Uncertainties

Example 1 (Invalid Instruments).

Example 2 (Weak Instruments).

Example 3 (Heteroskedasticity).

3 Wasserstein Distributionally Robust IV Estimation

3.1 DRIVE

3.2 Dual Representation of Wasserstein DRIVE

Theorem 3.1.

3.3 Wasserstein DRIVE and Regularized Regression

4 Asymptotic Theory of Wasserstein DRIVE

4.1 Consistency of Wasserstein DRIVE

Theorem 4.1 (Consistency of Wasserstein DRIVE).

4.2 Asymptotic Distribution of Wasserstein DRIVE

Theorem 4.2 (Asymptotic Distribution).

Corollary 4.3.

5 Numerical Studies

5.1 MSE of Wasserstein DRIVE

5.2 Prediction under Distributional Shifts on Education Data

6 Concluding Remarks

Acknowledgements

References

Appendix A Background and Preliminaries

A.1 Wasserstein Distributionally Robust Optimization

A.2 Anchor Regression of Rothenhäusler et al., (2021)

Appendix B Related Works

B.1 Distributionally Robust Optimization and Min-max Optimization

B.2 Instrumental Variables Estimation

B.3 Regularized Regression

Appendix C Square Root Ridge Regression

C.1 n\sqrt{n}-Consistency of the Square Root Ridge

Assumption C.1.

Assumption C.2.

Theorem C.3.

C.2 Delayed Shrinkage of Square Root Ridge

Theorem C.4.

Remark C.5 (Necessary Requirements of Delayed Shrinkage).

C.3 Square Root Ridge vs. Ridge for GMM and M-Estimators

Appendix D Extensions to GMM Estimation and qq-Wasserstein Distances

D.1 Wasserstein Distributionally Robust GMM

Assumption D.1.

Theorem D.2.

D.2 Generalization to qq-Wasserstein DRIVE

Corollary D.3.

Appendix E Regularization Parameter Selection for Wasserstein DRIVE

E.1 Selecting ρ\rho Based on Estimate of First Stage Coefficient

E.2 Selecting ρ\rho Based on Nonparametric Bootstrap of Quantile of Score

E.3 Bootstrapped Score Quantile As Test Statistic for Invalid Instruments

Proposition E.1.

Remark E.2.

Appendix F Proofs

F.1 Proof of Theorem 3.1

Proof.

F.2 Proof of Theorem 4.1

Proof.

F.3 Proof of Theorem 4.2

Proof.

F.4 Proof of Corollary 4.3

Proof.

F.5 Proof of Theorem D.2

Proof.

F.6 Proof of Theorem C.3

Proof.

C.1 $\sqrt{n}$ -Consistency of the Square Root Ridge

Appendix D Extensions to GMM Estimation and $q$ -Wasserstein Distances

D.2 Generalization to $q$ -Wasserstein DRIVE

E.1 Selecting $\rho$ Based on Estimate of First Stage Coefficient

E.2 Selecting $\rho$ Based on Nonparametric Bootstrap of Quantile of Score