Instrument Space Selection for Kernel Maximum Moment Restriction

Rui Zhang^§ Krikamol Muandet^‡ Bernhard Schölkopf^‡ Masaaki Imaizumi^†
^§ Australian National University ^‡MPI for Intelligent Systems ^†University of Tokyo Tokyo Canberra, Australia Tübingen, Germany Tokyo, Japan

Abstract

Kernel maximum moment restriction (KMMR) recently emerges as a popular framework for instrumental variable (IV) based conditional moment restriction (CMR) models with important applications in conditional moment (CM) testing and parameter estimation for IV regression and proximal causal learning. The effectiveness of this framework, however, depends critically on the choice of a reproducing kernel Hilbert space (RKHS) chosen as a space of instruments. In this work, we presents a systematic way to select the instrument space for parameter estimation based on a principle of the least identifiable instrument space (LIIS) that identifies model parameters with the least space complexity. Our selection criterion combines two distinct objectives to determine such an optimal space: (i) a test criterion to check identifiability; (ii) an information criterion based on the effective dimension of RKHSs as a complexity measure. We analyze the consistency of our method in determining the LIIS, and demonstrate its effectiveness for parameter estimation via simulations.

1 Introduction

The instrumental variable (IV) based conditional moment restriction (CMR) models [61, 1, 24] have a wide range of applications in causal inference, economics, and finance modeling, where for correctly-specified models the conditional mean of certain functions of data equals zero almost surely. This kind of models also appear in Mendelian randomization, a technique in genetic epidemiology that uses genetic variation to improve causal inference of a modifiable exposure on disease [22, 13]. Rational expectation models [59], widely-used in macroeconomics, measures how available information is exploited to form future expectations by decision-makers as conditional moments [59]. Furthermore, CMRs have also gained popularity in the community of causal machine learning, leading to novel algorithms such as generalized random forests [6], double/debiased machine learning [21] and nonparametric IV regression [8, 58, 73]; see also related works therein, as well as in offline reinforcement learning [47].

Learning with CMRs is challenging because it implies an infinite number of unconditional moment restrictions (UMRs). Although the asymptotic efficiency of the instrumental variable (IV) estimator can in principle improve when we add more moment restrictions, it was observed that the excessive number of moments can be harmful in practice [3] because its finite-sample bias increases with the number of moment conditions [63]. Hence, traditional works in econometrics often select a finite number of UMRs for estimation based on the generalized method of moments (GMM) [36, 34]. Unfortunately, an adhoc choice of moments can potentially lead to a loss of efficiency or even a loss of identification [25]. For this reason, subsequent works advocate an incorporation of all moment restrictions simultaneously in different ways such as the method of sieves [23, 27] and a continuum of moment restrictions [17, 19, 16, 18], among others. Despite the progress, the question of moment selection in general remains open.

A recent interest to model the CMR using a reproducing kernel Hilbert space (RKHS) [57, 24] and deep neural networks [45, 8] opens up a new possibility to resolve the selection problem with modern tools in machine learning. In this work, we focus on the RKHS approach where the CMR is reformulated as a minimax optimization problem whose inner maximization is taken over functions in the RKHS. This framework is known as a kernel maximum moment restriction (KMMR). An advantage of the KMMR is that one can get a closed-form solution to the maximization problem, which is related to a continuum generalization of GMM [17, 16]. Furthermore, it has been shown that an RKHS with a specific type of kernels is sufficient to model the CMR; see Muandet et al. [57, Theorem 3.2] and Zhang et al. [73, Theorem 1]. Hence, in this context, the moment selection problem becomes the kernel selection problem, which is also a long-standing problem in machine learning. Besides, the KMMR can be viewed as an approximate dual of the known two-stage regression procedure [58, 46]. KMMRs have been applied successfully to IV regression [73], proximal causal learning [51] and condition moment testing [57]. Nevertheless, all of them employed a simple heuristic to select the kernel function, e.g., median heuristic, which limits the full potential of this framework.

Our contributions. In this paper, we aim to address the kernel selection problem for the KMMR framework. We focus on the IV estimator and term our problem the kernel or RKHS instrument space selection, because the RKHS functions as a space of instruments. We define an optimal instrument space, named least identifiable instrument space (LIIS), which has the identifiability for the true model parameters and the least complexity. To determine LIIS in practice, we propose an approach based on a combination of two criteria: (i) the identification test criterion (ITC) to test the identifiability of instrument spaces and (ii) the kernel effective information criterion (KEIC) to select the space based on its complexity. Our method has the following advantages: (a) compared with the higher-order asymptotics based methods [26, 28], our approach is easy to use and analyze, and can filter invalid instrument spaces; (b) our method is a combination of several information criteria, therefore compensating shortcomings of individual criteria. Moreover, we analyze the consistency of our method on selection of the LIIS and we show in the simulation experiments that our method effectively identifies the LIIS and improves the performance on parameter estimation for IV estimators. To the best of our knowledge, we do not find any other method achieve all of these on the kernel instrument space selection.

2 Preliminaries

2.1 Conditional moment restriction (CMR)

Let $(X,Z)$ be a random variable taking values in $\mathcal{X}\times\mathcal{Z}$ and $\Theta$ a parameter space. A conditional moment restriction (CMR) [61, 1] can then be expressed as

\text{CMR}(\theta_{0})=\mathbb{E}[\varphi_{\theta_{0}}(X)\,|\,Z]=\mathbf{0},\quad P_{Z}-\text{almost surely (a.s.)}

(1)

for the true parameter $\theta_{0}\in\Theta$ . The function $\varphi_{\theta}(X)$ is a problem-dependent generalized residual function in $\mathbb{R}^{q}$ parameterized by $\theta$ . Intuitively, the CMR asserts that, for correctly specified models, the conditional mean of the generalized residual function is almost surely equal to zero. Many statistical models can be written as (1) including nonparametric regression models where $X=(\tilde{X},Y),Z=\tilde{X}$ and $\varphi_{\theta}(X)=Y-f(\tilde{X};\theta)$ ; conditional quantile models where $X=(\tilde{X},Y),Z=\tilde{X}$ , and $\varphi_{\theta}(X)=\mathbbm{1}\{Y<f(\tilde{X};\theta)\}-\tau$ for the target quantile $\tau\in[0,1]$ ; and IV regression models where $X=(\tilde{X},Y)$ , $Z$ is an IV, and $\varphi_{\theta}(X)=Y-f(\tilde{X};\theta)$ .

Maximum moment restriction (MMR).

An important observation about the CMR (1) is that it implies a continuum of unconditional moment restriction (UMR) [17, 45, 8]: $\mathbb{E}[\varphi_{\theta_{0}}(X)^{\top}h(Z)]=0$ for all measurable functions $h\in\mathcal{H}$ where $\mathcal{H}$ is a space of measurable functions $h:\mathcal{Z}\to\mathbb{R}^{q}$ . We refer to $\mathcal{H}$ as an instrument space. Traditionally, inference and estimation of $\theta_{0}$ can be performed, for example, via a generalized method of moment (GMM) of UMR based on a specific subset of $\mathcal{H}$ [34]. Consequently, the choice of subset of $\mathcal{H}$ plays an important role in parameter estimation for the conditional moment models. In this paper, we discuss on the optimal instrument space $\mathcal{H}$ which is developed based on an equivalent definition of UMR, called the maximum moment restriction (MMR) [45, 57, 73], as follows:

R_{\mathcal{H}}(\theta_{0})\coloneqq\sup_{h\in\mathcal{H}}\mathbb{E}^{2}[\varphi_{\theta_{0}}(X)^{\top}h(Z)]=0.

(2)

Note that the MMR $R_{\mathcal{H}}$ depends critically on the choice of an instrument space $\mathcal{H}$ . In this paper, we focus exclusively on the IV regression models such that $\varphi_{\theta}(X)=Y-f(\tilde{X};\theta)\in\mathbb{R}$ and $\mathcal{H}$ is a real-valued function space. We defer applications of our method in other scenarios to future work.

2.2 Kernel maximum moment restriction (KMMR)

In this work, we focus on $R_{\mathcal{H}}(\theta)$ when an instrument space $\mathcal{H}$ is a reproducing kernel Hilbert space (RKHS) associated with the kernel $k:\mathcal{Z}\times\mathcal{Z}\to\mathbb{R}$ [5, 66, 9].

Reproducing kernel Hilbert space (RKHS).

Let $\mathcal{H}$ be a reproducing kernel Hilbert space (RKHS) of functions from $\mathcal{Z}$ to $\mathbb{R}$ with $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ and $\|\cdot\|_{\mathcal{H}}$ being its inner product and norm, respectively. Since for any $z\in\mathcal{Z}$ the linear functional $h\mapsto h(z)$ is continuous for $h\in\mathcal{H}$ , it follows from Riesz representation theorem [64] that there exists, for every $z\in\mathcal{Z}$ , a function $k(z,\cdot)\in\mathcal{H}$ such that $h(z)=\langle h,k(z,\cdot)\rangle_{\mathcal{H}}$ for all $h\in\mathcal{H}$ . This is generally known as a reproducing property of $\mathcal{H}$ [5, 66]. We call $k(z,z^{\prime}):=\langle k(z,\cdot),k(z^{\prime},\cdot)\rangle_{\mathcal{H}}$ a reproducing kernel of $\mathcal{H}$ . The reproducing kernel $k$ is unique (up to an isometry) and fully characterizes the RKHS $\mathcal{H}$ [5]. Examples of commonly used kernels on $\mathbb{R}^{d}$ include Gaussian RBF kernel $k(z,z^{\prime})=\exp(-\|z-z^{\prime}\|^{2}_{2}/2\sigma^{2})$ and Laplacian kernel $k(z,z^{\prime})=\exp(-\|z-z^{\prime}\|_{1}/\sigma)$ where $\sigma>0$ is a bandwidth parameter. For the detailed exposition on kernel methods and RKHS, see, e.g., [66, 9, 54, 56].

By representing $\mathcal{H}$ in (2) with the RKHS, Muandet et al. [57, Theorem 3.3] showed that $R_{\mathcal{H}}(\theta)$ has a closed-form expression after introducing an Ivanov regularization [40] $\|h\|=1$ to remove the scale effect of instruments:

\displaystyle R_{\mathcal{H}}(\theta):=\sup_{h\in\mathcal{H},\|h\|=1}\mathbb{E}^{2}[\varphi_{\theta}(X)h(Z)]=\mathbb{E}[\varphi_{\theta}(X)k(Z,Z^{\prime})\varphi_{\theta}(X^{\prime})],

(3)

where $(X^{\prime},Z^{\prime})$ is an independent copy of $(X,Z)$ . Given i.i.d data $\{(x_{i},z_{i})\}_{i=1}^{n}$ , we define its empirical analogue $\hat{R}_{\mathcal{H}}({\theta})\coloneqq\mathbb{E}_{n}[\varphi_{\theta}(X)K(Z,Z)\varphi_{\theta}(X)]$ and its minimizer $\hat{\theta}=\operatornamewithlimits{argmin}_{\theta}\hat{R}_{\mathcal{H}}({\theta})$ .

We focus on this expression in spite of a similar quadratic expression following from a Tikhonov regularization on $h$ [24, Eqn. (10)]. It is instructive to observe that the MMR (3) resembles the optimally-weighted GMM formulation of Carrasco and Florens [17], but without the re-weighting matrix; see, also, Carrasco et al. [19, Sec. 6] and Zhang et al. [73, Sec. 3]. While the optimally-weighted GMM (OWGMM) was originally motivated by the asymptotic efficiency theory in a parametric setting [18], the need to compute the inverse of the parameter-dependent covariance operator can lead to more cumbersome estimation [16] and poor finite-sample performance [10]. Hence, we will consider $R_{\mathcal{H}}(\theta)$ throughout for its simplicity.

The following result adapted from Zhang et al. [73, Theorem 1] and Muandet et al. [57, Theorem 3.2] guarantees that KMMR has the same roots as those of CMR (1).

Theorem 1 (Sufficiency of the instrument space).

Suppose that $k$ is continuous, bounded (i.e., $\sup_{z\in\mathcal{Z}}\sqrt{k(z,z)}<\infty$ ) and satisfies the condition of integrally strictly positive definite (ISPD) kernels, i.e., for any function $g$ that satisfies $0<\|g\|_{2}^{2}<\infty$ , we have $\iint_{\mathcal{Z}}g(z)k(z,z^{\prime})g(z^{\prime})\,\mathrm{d}z\,\mathrm{d}z^{\prime}>0$ . Then, $R_{\mathcal{H}}(\theta)=0$ if and only if $\text{CMR}(\theta)=\mathbb{E}[\varphi_{\theta}(X)\,|\,Z]=\mathbf{0}$ for $P_{Z}$ -almost all $z$ .

In other words, Theorem 1 implies that it is sufficient to restrict $\mathcal{H}$ in (2) to the RKHS associated with the ISPD kernel. However, it does not guarantee the optimality of $\mathcal{H}$ as an instrument space.

3 Least identifiable instrument space (LIIS)

The choice of an instrument space is critical for KMMR. If an instrument space is excessively small, KMMR loses an identification power, i.e., another parameter $\theta\neq\theta_{0}$ can satisfy the KMMR condition (3), hence it is impossible in principle to achieve a consistent estimator to $\theta_{0}$ . The scenario is often referred to as the under-identification problem [34, Chapter 2.1]. In contrast, an excessively large instrument space increases an error of estimators with finite samples. It is proved that a mean squared error (MSE) of an estimator with MMR is increased by the size of an instrument space [12, 33, 20, 24]. Specifically, for example, Theorem 1 in [24] showed that the MSE of their estimator has an upper bound which increases in a critical radius of an instrument space (defined by an upper bound of its Rademacher complexity). Therefore, it is important to avoid making instrument spaces excessively large in order to reduce errors.

Unfortunately, the instrument space selection problem cannot be solved by a straightforward cross-validation (CV) procedure, because the loss function (3) depends on an instrument space itself. For example, we can rewrite KMMR as $R_{\mathcal{H}}(\theta)=\sum_{i\geq 1}\lambda_{i}\mathbb{E}[\varphi_{\theta}(X)\phi_{i}(Z)]^{2}$ by the Mercer’s decomposition of a corresponding kernel $k(z,z^{\prime})=\sum_{i\geq 1}\lambda_{i}\phi_{i}(z)\phi_{i}(z^{\prime})$ with eigenfunctions and eigenvalues $\{\phi_{i}(\cdot),\lambda_{i}\}_{i\geq 1}$ . Due to the dependence, CV always select an excessively small instrument space which makes $R_{\mathcal{H}}$ always small.

Figure 1: Illustration of instrument spaces and identifiability. Instrument spaces are located on the top horizontal axis with higher complexity on the right.

H^{*}

denotes the LIIS. Spaces are categorized into three classes: under-, identification and over-. Each category has an example below. Under-identification has the MMR value

R=0

at multiple model parameters

\theta

’s; identification: unique; over-identification: none; weak identification: unique but

R\approx 0

at multiple

\theta

’s.

The proposed optimality. To resolve this issue, we may consider candidates of instrument spaces $\mathcal{H}_{1},...,\mathcal{H}_{M}$ . For example, we consider RKHSs by the Gaussian kernel with different lengthscale parameters. We introduce assumptions on the identification of $\theta_{0}$ and identifiability of $\mathcal{H}_{i}$ .

Assumption 1 (Global identification).

There exists an unique $\theta_{0}\in\Theta$ satisfying $\mathrm{CMR}(\theta_{0})=0$ .

Assumption 2 (Identifiability).

There exists non-empty index set $\mathcal{M}\subset\{1,2,...,M\}$ such that for any $m\in\mathcal{M}$ , there is $\theta_{0}$ uniquely satisfies $R_{\mathcal{H}_{m}}(\theta_{0})=0$ .

These assumptions guarantee that there is at least one instrument space from the candidates identifying the unique solution $\theta_{0}$ of the CMR problem (1). We, then, define a notion of optimality of instrument spaces. Let $\Omega(\mathcal{H})$ be a complexity measure of $\mathcal{H}$ , which is specified later.

Definition 1 (Least Identifiable Instrument Space (LIIS)).

A least identifiable instrument space $\mathcal{H}^{*}$ is an identifiable instrument space with the least complexity, i.e. $\mathcal{H}^{*}:=\operatornamewithlimits{argmin}_{\mathcal{H}_{j}:j\in\mathcal{M}}\Omega(\mathcal{H}_{j})$ .

The notion of LIIS is designed to satisfy the two requirements: the identifiability to $\theta_{0}$ , and the less complexity to reduce an estimation error. This optimality is different from the way with test errors, such as CV. We provide an illustration of the concepts in Figure 1.

4 Least identification selection criterion (LISC)

We propose a method to find LIIS from finite samples, named the least identification selection criterion (LISC), by developing several criteria necessary for LIIS and combining them. The developed criteria are as follow: an identification test criterion (ITC) and a kernel effective information criterion (KEIC). We first introduce the overall methodology and then explain these criteria.

Based on the ITC and KEIC, we propose a simple two-step procedure to select the optimal space. We first select a set of identifiable spaces via the ITC and then choose an identifiable space with the least KEIC. In case there is no identifiable instrument space determined when e.g. neural networks are employed, we minimize the ratio of the ITC and KEIC to select LIIS:

\displaystyle\textstyle\hat{\mathcal{H}}=\operatornamewithlimits{argmin}_{\mathcal{H}_{j}:j=1,...,M}\mathrm{KEIC}(\mathcal{H}_{j})/\mathrm{ITC}(\mathcal{H}_{j}).

Note that the ratio based method follows the spirit of LIIS and of our two-step procedure.

4.1 Criterion 1: identification test criterion (ITC)

We first develop a method to study identifiability of instrument spaces, by validating whether a minimizer $\theta$ of $R_{\mathcal{H}}(\theta)$ is unique. The uniqueness is verified by examining a rank of a Hessian-like matrix from $R_{\mathcal{H}}(\theta)$ , which is $F_{\mathcal{H}}(\theta_{*}):=\mathbb{E}[u_{\theta_{*}}(S)]$ where $u_{\theta}(S)\coloneqq\nabla_{\theta}\phi_{\theta^{*}}(X)k(Z,Z^{\prime})\nabla_{\theta}\phi_{\theta^{*}}(X^{\prime})$ with $S\coloneqq(X,Z,X^{\prime},Z^{\prime})$ and $\theta_{*}$ is a minimizer of $R_{\mathcal{H}}$ assumed to satisfy $R_{\mathcal{H}}(\theta_{*})=0$ . A full-rankness of the matrix $F_{\mathcal{H}}(\theta_{*})$ is a sufficient condition for global identification on linear models $\varphi_{\theta}$ , and local identification on nonlinear models [34, Assumptions 2.3, 3.6].

Test of full-rankness. We develop a statistical test for the full-rankness of $F_{\mathcal{H}}(\theta_{*})$ , based on the test of ranks [65]. With a $c$ -dimensional parameter $\theta$ and a $c\times c$ matrix $F_{\mathcal{H}}(\theta_{*})$ , we consider a null ( $H_{0}:\mathrm{rank}(F_{\mathcal{H}}(\theta))=c-1$ ) and alternative ( $H_{1}:\mathrm{rank}(F_{\mathcal{H}}(\theta))=c$ ) hypotheses. Let $\lambda_{c}$ be the smallest eigenvalue of $F_{\mathcal{H}}(\theta_{*})$ , which is non-negative owing to the quadratic form of $F_{\mathcal{H}}$ . For the test, we consider an empirical analogue of $F_{\mathcal{H}}(\theta_{*})$ as $\hat{F}_{\mathcal{H}}(\hat{\theta})=\mathbb{E}_{n}[u_{\hat{\theta}}(S)]\coloneqq n^{-2}\sum_{i,j=1}^{n}u_{\hat{\theta}}(s_{ij})$ with $s_{ij}:=(x_{i},z_{i},x_{j},z_{j})$ and $\hat{\theta}\coloneqq\operatornamewithlimits{argmin}_{\theta}\hat{F}_{\mathcal{H}}(\theta)$ . Then, we consider the smallest eigenvalue $\hat{\lambda}_{c}\geq 0$ of $\hat{F}_{\mathcal{H}}(\hat{\theta})$ and employ it as a test statistics $\hat{T}:=\hat{\lambda}_{c}^{2}$ . Let $N$ be a standard normal random variable and $\Lambda$ be a fixed scalar given in Appendix A. For a significance level $\alpha\in(0,1)$ , such as $\alpha=0.05$ , let $Q_{1-\alpha}$ be a $(1-\alpha)$ -quantile of $N^{2}$ . Our test has the limiting power of one:

Theorem 2.

Assume the conditions of Theorem 6. The test that rejects the null $\mathrm{rank}(F_{\mathcal{H}}(\theta_{*}))=c-1$ when $n\hat{T}>\Lambda Q_{1-\alpha}$ is consistent against any fixed alternative $\mathrm{rank}(F_{\mathcal{H}}(\theta_{*}))=c$ .

Test criterion. We propose a criterion based on the test statistic $\hat{T}$ . We randomly split the dataset into two parts and use one part to compute $\hat{T}$ and the other part for computing $\hat{\Lambda}=(\hat{C}\otimes\hat{C})^{\top}\hat{\Omega}(\hat{C}\otimes\hat{C})$ where $\hat{C}$ is an eigenvector of $\hat{F}_{\mathcal{H}}(\hat{\theta})$ corresponding to the smallest eigenvalue $\hat{\lambda}_{c}$ , and $\hat{\Omega}=\mathbb{E}_{n}[\mathrm{vec}(u_{\hat{\theta}}(S))\mathrm{vec}(u_{\hat{\theta}}(S))^{\top}]-\mathbb{E}_{n}[\mathrm{vec}(u_{\hat{\theta}}(S))]\mathbb{E}_{n}[\mathrm{vec}(u_{\hat{\theta}}(S))]^{\top}$ . Here, $\mathrm{vec}(\cdot)$ is the vectorization operator. Then, we define an identification test criterion (ITC) as

\displaystyle\mathrm{ITC}(\mathcal{H}):=n\hat{T}/\hat{\Lambda}.

We select an instrument space under which we can reject the null-hypothesis, namely, we select $\mathcal{H}$ when $\mathrm{ITC}(\mathcal{H})>Q_{1-\alpha}$ holds. The validity of the selection is shown in the following result:

Theorem 3 (Consistency of ITC).

Suppose that conditions of Theorem 2 and Assumption 1 hold, $\Theta$ is compact, $R_{\mathcal{H}}(\theta)$ is consistent to $\mathrm{CMR}(\theta)$ , $\hat{R}_{\mathcal{H}}(\theta)$ converges to $R_{\mathcal{H}}(\theta)$ uniformly in probability and $R_{\mathcal{H}}(\theta)$ is finite. Then the instrumental space selected by ITC is identifiable in probability approaching to $1$ as $n\to\infty$ .

4.2 Criterion 2: kernel effective information criterion (KEIC)

We develop another criterion for LIIS in the spirits of the information criterion like Akaike [2] and Andrews [4]. The strategy is to estimate both elements required for LIIS: to measure the complexity of $\mathcal{H}$ by the notion of effective dimension, and the identifiability by the empirical loss.

Effective dimension. The effective dimension has been a common measure of a complexity of RKHSs [74, 52, 15] and used to analyze the performances of kernel methods. To develop the notion, we consider the Mercer expansion [53] of $k(z,z^{\prime})=\sum_{i\geq 1}\lambda_{i}^{k}\phi_{i}(z)\phi_{i}(z^{\prime})$ for $q=1$ as provided in Section 3, with a superscript $k$ to differentiate eigenvalues of $F_{\mathcal{H}}$ and those of the kernel $k$ . Based on [15], a definition of the effective dimension is written as

\displaystyle\textstyle\mathrm{E}_{k}=\left(\sum_{i=1}^{\infty}\lambda_{i}^{k}\right)\left[\sum_{i=1}^{\infty}(\lambda_{i}^{k})^{2}\right]^{-\frac{1}{2}}.

(4)

We develop an empirical estimator for $\mathrm{E}_{k}$ as $\mathrm{Tr}(K_{\bm{z}}){\mathrm{Tr}(K_{\bm{z}}^{2})^{-\frac{1}{2}}}$ , where $[K_{\bm{z}}]_{ij}=k(z_{i},z_{j})$ is a kernel matrix and $\mathrm{Tr}(\cdot)$ is the trace operator. We show that its consistency as follows.

Theorem 4.

As $n\to\infty$ , $\mathrm{Tr}(K_{\bm{z}}){\mathrm{Tr}(K_{\bm{z}}^{2})^{-\frac{1}{2}}}\to\mathrm{E}_{k}$ holds.

The effective dimension measures the complexity of the instrument space $\mathcal{H}_{k}$ and meanwhile quantifies some capacity properties of the marginal measure $P_{Z}$ . An interpretation to our definition is that the numerator counts the effective UMRs which are assigned to relatively larger eigenvalues, and the denominator regularizes the count. Effective dimensions may differ in different tasks. For example, $\sum_{i}\lambda_{i}(\lambda_{i}+\alpha)^{-1}$ is considered in least square regression [74, 15], where $\alpha>0$ is a regularizer parameter, and Lopes et al. [50] interprets $(\sum_{i}\lambda_{i})^{2}/(\sum_{i=1}^{\infty}\lambda_{i}^{2})$ as the effective dimension of the covariance matrix.

Information criterion. We then develop an information criterion with the notion of the effective dimension. The key idea is to add the estimated effective dimension as a penalty term for an empirical KMMR risk. Note that the ITC is developed based on the existence assumption of parameters satisfying the KMMR, while the assumption may not hold always. Therefore, it is necessary to test the existence by the empirical risk. We propose a kernel effective information criterion (KEIC) on $\mathcal{H}$ as below, which is an analogy of a standard notion of information criteria such as BIC (Bayesian Information Criterion) [67].

\displaystyle\mathrm{KEIC}(\mathcal{H}):=n\hat{R}_{\mathcal{H}}(\hat{\theta})+\mathrm{Tr}(K_{\bm{z}}){\mathrm{Tr}(K_{\bm{z}}^{2})^{-1/2}}\log n.

Remark 1.

Given a set of valid and invalid instrument spaces, the KEIC filters invalid ones in probability approaching 1 as $n\to\infty$ , since they don’t have zero risks, i.e., $R_{\mathcal{H}}>0$ , and the first term in the KEIC increases faster than the second term.

Theorem 5 (Consistency of KEIC).

Suppose a function space $\mathcal{H}^{l}$ satisfying Assumption 2 uniquely exists and Assumption 1 holds. Given a set of identifiable spaces $\mathcal{H}_{i}$ , $i\in\mathcal{M}$ , then $\mathrm{P}(\tilde{\mathcal{H}}=\mathcal{H}^{l})\to 1$ as $n\to\infty$ where $\tilde{\mathcal{H}}=\operatornamewithlimits{argmin}_{\mathcal{H}_{j}:j\in\mathcal{M}}\mathrm{KEIC}(\mathcal{H}_{j})$ .

Remark 2 (Consistency of two-step procedure).

Suppose that conditions of Theorem 5 hold. Then consistency of the two-step selection procedure for the LIIS holds if the selected space by ITC is identifiable in probability approacing to 1 as $n\to\infty$ .

5 Related work

The problems of moment and instrument selection have a long history in econometrics [61, 34, 32]. While both problems often involve the GMM estimator, the latter focuses on the IVs and the corresponding estimator is referred to as the generalized instrumental variable (GIV) estimator [37]. In general, existing selection criteria can be summarized into three broad categories: (i) large sample or first-order asymptotic property based criteria, (ii) finite sample or higher-order asymptotic property based criteria, and (iii) information criteria.

The first category of selection methods, which was popular in the 1970s-1980s, treats an asymptotic efficiency of the resulting estimator as the most desirable criterion; see [61] for a review. However, this kind of criteria may not guarantee good finite sample properties as they incur large biases in practice [55, 26]. Thus, subsequent work gradually turned to the second category which aims to improve the finite sample precision of parameter estimation. Donald and Newey [26] proposed an instrument selection criterion based on a second-order asymptotic property, i.e., a Nagar type approximation [60] to the bias of linear causal models. Newey and Smith [63] explored higher-order asymptotic properties of GMM and generalized empirical likelihood estimation, which was later applied by Donald et al. [28] to developing the instrument selection for non-linear causal models. Interestingly shown by Newey and Smith [63], many bias-correction methods implicitly improves higher-order asymptotic efficiency. Moreover, the idea of improving finite sample biases is highly related to the cross validation (see e.g. Donald and Newey [26, pp. 1165] and Carrasco [16, section 4]), which has been used for different targets like selection of the weight matrix of GMM [16] and of regularization parameters in the GMM estimation [73]. Nonetheless, higher-order asymptotic properties often rely on complicated theoretical analyses and its practical performance is sensitive to the accuracy of the empirical approximation and noise in data [30]; the empirical approximation is also often computationally heavy (see, e.g., Donald et al. [28]). Additionally, this category of methods requires a strong assumption that all instrument candidates are valid [26, 28]. Therefore, it is desirable to seek a theoretically simpler selection method which remains robust and easy to use in practice. The last category, to which our method also belongs, relies on the information criterion. Andrews [4] previously proposed an orthogonality condition based criterion. Specifically, the method selects a maximal number of valid moments by minimizing the objective of the GMM estimation and maximizing the number of instruments simultaneously. Hall et al. [33] proposed an efficiency and non–redundancy based criterion to avoid inclusion of redundant moment conditions [12] which is a weakness of the orthogonality condition based criterion. For comprehensive reviews, we suggest readers refer to [34, Chapter 7] and [32].

Recently the CMR models have become increasingly popular in the machine learning community [57, 24, 8]. This opens up a new possibility to resolve the selection problem with modern tools in machine learning. Popular works includes DeepIV [38], KernelIV [69], DualIV [58], and adversarial structured equation models (SEM) [46] in the sub-area of causal machine learning; see, also, Liao et al. [47] and references therein for related works from the reinforcement learning (RL) perspective. In this line of work, no instruments are exploited and the focus is mainly on the estimation of the conditional density or solving the saddle-point reformulation of the CMR. On the other hand, the second line of work, which is more related to ours, includes adversarial GMM [45, 24, 73, 49, 29, 70, 43] with an adversarial instrument from a function space, and DeepGMM [8] with fixed instruments. All of these methods employ flexible instruments such as neural networks [45, 24] or RKHSs [24, 73, 49, 29, 70, 43], and the adversarial GMM selects appropriate instruments by maximizing the GMM objective function in order to obtain robust estimators. To the best of our knowledge, none of these works addressed the moment selection problem directly for the IV estimator.

Lastly, the KMMR objective is also related to the maximum mean discrepancy (MMD) [11] and the kernel Stein discrepancy (KSD) [48], as pointed out by Muandet et al. [57]. The kernel selection problem for MMD was previously studied on the two-sample testing problem [31]. The principle is to maximize the test power and later widely applied to kernel selection for different hypothesis testing, such as independence testing based on finite set independence criterion [41] and goodness-of-fit testing based on finite set KSD [42]. While this method can be applied to the KMMR based hypothesis testing [57], it is not applicable in our case as we focus on parameter estimation problem.

6 Experiments

We demonstrate the effectiveness of our selection criterion on the IV regression task with two baselines: (1) the median heuristic employed by Zhang et al. [73] and (2) Silverman’s rule-of-thumb [68]. These baselines are widely used in statistics and related fields [35].

6.1 Simple linear IV regression

We first demonstrate the ability of our method to select the LIIS among a set of candidate instrument spaces. We consider simple linear (in parameters) IV regression models, because it is easy to analyze the global identifiability of instrument spaces. Hall et al. [34, section 2.1] provides its details.

Data. We consider data generation models employed by Bennett et al. [8] and adapt them to our experiments: $Y=f^{*}(X)+e+\delta$ , $X=Z+e+\gamma$ , where $Z\in\mathrm{Uniform}[-3,3]$ , $e\sim\mathcal{N}(0,1)$ and $\delta,\gamma\sim\mathcal{N}(0,0.1^{2})$ . The variable $e$ is the confounding variable that creates the correlation between $X$ and the residual $(Y-f^{*}(X))$ . $Z$ is an instrumental variable correlated to $X$ and we employ two choices of function $f^{*}$ to enrich the test scenarios: (i) linear: $f^{*}(x)=x$ ; (ii) quadratic: $f^{*}(x)=x^{2}+x$ . Besides, we generate $n\in[100,500,1000]$ data for training, validation and test sets respectively. We give more experiment details in Appendix C.2.

Algorithms. We use a linear combination of polynomial basis functions to estimate the unknown true function $f^{*}(x)$ , namely, $f_{m}(x)\coloneqq\sum_{i=0}^{m}c_{i}x^{i}$ , where $c_{i}\in\mathbb{R}$ denote the unknown parameters and we consider two polynomial degrees $m\in[2,4]$ . For the KMMR framework, we employ a wide range of kernels to construct instrument spaces: (i) the linear kernel (denoted as L) $k_{L}(z,z^{\prime})=zz^{\prime}$ ; (ii) the polynomial kernel $k_{Pd}(z,z^{\prime})=(zz^{\prime}+p)^{d}$ , where $p\geq 0$ is a kernel parameter, $d\in\mathbb{N}$ controls the polynomial degree and we consider $d\in[2,4]$ (denoted as P2, P4) and $p\in[1,2]$ ; (iii) the Gaussian kernel (denoted as G) $k_{G}(z,z^{\prime})=\exp(-(z-z^{\prime})^{2}/(2p^{2}))$ where we consider the bandwidth $p\in[0.1,0.2,0.5,1,2]$ . LIIS is identified with this setting (see details in Appendix C.1): P2 with $p=1$ for $f_{2}$ and P4 with $p=1$ for $f_{4}$ .

Results. We present the results on the linear $f^{*}$ in Fig. 2 and the quadratic $f^{*}$ in Fig. 3 in Appendix. We can see that unidentifiable instrument spaces for $f_{2}$ : L, and for $f_{4}$ : L, P2-1, P2-2, are always tested to be unidentifiable, while other identifiable spaces have increasing significance of identifiability as data size. For $f_{2}$ , the LIIS P2-1 is chosen as the LIIS by our method on the large datasets while due to more parameters contained in $f_{4}$ , the LIIS P4-1 requires more data to be tested as an identifiable space significantly. This presents a difficulty of our method to identify the LIIS when the model has many parameters and the dataset is small.

Figure 2: ITC evaluated on the linear function

f^{*}

. The left and right plots employ

f_{2}

and

f_{4}

in the estimation respectively. We use [kernel]-[parameter] to denote different kernels and normalize all values in each plot to

[0,1]

. The symbols (*) on the right of nodes denote the selected LIISs. The red dash lines denote the quantile corresponding to the significance level

\alpha=0.05

Table 1: The mean square error (MSE)

\pm

one standard deviation (

n=500

f_{4}

Scenario	Algorithm	True Function $f^{*}$
Scenario	Algorithm	abs	linear	quad	sin
LS	Silverman	0.023 $\pm$ 0.006	0.006 $\pm$ 0.005	0.006 $\pm$ 0.007	0.032 $\pm$ 0.012
	Med-Heuristic	0.023 $\pm$ 0.007	0.006 $\pm$ 0.006	0.006 $\pm$ 0.008	0.032 $\pm$ 0.010
	Our Method	0.023 $\pm$ 0.007	0.006 $\pm$ 0.006	0.006 $\pm$ 0.008	0.031 $\pm$ 0.010
LW	Silverman	0.055 $\pm$ 0.059	0.032 $\pm$ 0.023	0.017 $\pm$ 0.011	0.058 $\pm$ 0.050
	Med-Heuristic	0.214 $\pm$ 0.019	0.066 $\pm$ 0.011	0.037 $\pm$ 0.018	0.101 $\pm$ 0.012
	Our Method	0.024 $\pm$ 0.010	0.015 $\pm$ 0.016	0.009 $\pm$ 0.010	0.019 $\pm$ 0.022
NS	Silverman	7.384 $\pm$ 18.271	0.137 $\pm$ 0.118	0.595 $\pm$ 0.733	0.539 $\pm$ 0.913
	Med-Heuristic	0.070 $\pm$ 0.044	0.021 $\pm$ 0.019	0.019 $\pm$ 0.011	0.074 $\pm$ 0.053
	Our Method	0.039 $\pm$ 0.019	0.006 $\pm$ 0.004	0.007 $\pm$ 0.003	0.028 $\pm$ 0.027

6.2 Robustness of parameter estimation with linear models

We compare our method and baselines on parameter estimation with the linear IV model.

Settings. We employ the afore-defined linear model $f_{4}$ for IV regression and measure the mean square error (MSE) between the optimized $f_{4}$ and the ground truth $f^{*}$ . We adapt the afore data generation models as below: $Y=f^{*}(X)+e+\delta$ , with $X=d^{-1}\sum_{i=1}^{d}g(Z_{i})+e+\gamma$ , where $Z:=(Z_{1},\cdots,Z_{d})\sim\mathrm{Uniform}([-3,3]^{d})$ , and other variables have the same definitions as the last subsection. We consider two more choices of $f^{*}$ : (iii) abs: $f^{*}(x)=|x|$ , and (iv) sin: $f^{*}(x)=\sin(x)$ , apart from (i) linear and (ii) quadratic $f^{*}$ in the last subsection. We sample $n\in[100,500]$ data for training, validation and test sets respectively. We design three evaluation scenarios: (a) $d=1$ , $g(Z)=Z$ , (b) $d=6$ , $g(Z)=Z$ , (c) $d=1$ , $g(Z)=\sin(Z)$ . In (a), $Z$ is a strong IV linearly correlated with $X$ , which is an ideal data generation process, and we refer to this scenario as the linearly strong (LS) IV scenario. The scenario (b) has $d=6$ weak IVs linearly correlated with $X$ (referred to as a linearly weak IV scenario, shortened as LW) and it is common in the real-world applications, such as the genetic variants employed in Mendelian randomization which are known as weak IVs [44, 39, 14]. The scenario (c) takes the nonlinearly strong correlation between $Z$ and $X$ in account (referred to as nonlinearly strong IV scenario, shortened as NS) and this consideration is also common, e.g., in the two-step least square methods, nonlinear models are often employed to fit the relation between $Z$ and $X$ [38, 69].

Results. Results are shown in Table 1 ( $n=500$ ) and 3 ( $n=100$ , in Appendix). Two baselines and our method have very close performance in the LS scenario. In contrast, weak IVs in the LW scenario have a significantly negative effect on the median-heuristic method, where our method performs stably and Silverman’s rule of thumb a bit less stably than our method. Moreover, the Silverman’s rule of thumb suffers from the nonlinear correlation between $Z$ and $X$ in the NS scenario and the median heuristic method performs better in this scenario than in the LW scenario. Our method still performs stably and well. The experiment results show robustness of our method on parameter estimation in the different scenarios and the superiority of our method which can select any kernel flexibly, whereas the baselines can only handle the Gaussian kernels in principle. Therefore, our method is preferable for the task of parameter estimation than the two baselines.

6.3 Robustness of parameter estimation with neural networks

We further assess the effectiveness of our method on parameter estimation with the neural network (NN), which is a commonly-used model in machine learning. Due to complicated structures of NNs, the problem is harder than the previous one.

Settings. We employ the weak IV scenario with $d=2$ from the last subsection and generates $n=500$ samples for training, validation and test datasets respectively. For a neat demonstration and to avoid e.g. training difficulties with deep NN models, we employ two relatively simple models: (i) one has just one hidden layer with 10 latent units, denoted as $N_{10}$ , and (ii) the other has two hidden layers with 5 latent units for each layer, denoted as $N_{55}$ . The sigmoid activation function is used for hidden layers. Both models are sufficient to fit the simple true causal functions $f^{*}(X)$ . We find that there are often no instrument spaces being identifiable significantly due to many parameters. Therefore, we minimize the ratio $\mathrm{KEIC}/\mathrm{ITC}$ defined in section 4 to select the LIIS.

An approximation to ITC. We approximate ITC to reduce computational burden due to many parameters in NNs. We consider a common form of NNs $f(x)=W_{0}\Phi(x)+b_{0}$ where $\Phi(x)=\sigma_{h}(b_{h}+W_{h}\sigma_{h-1}(\cdots\sigma_{1}(b_{1}+W_{1}x)))$ denotes a depth- $h$ structure with weights $\bm{W}$ , biases $\bm{b}$ and activation functions $\bm{\sigma}$ . Therefore, we view $\Phi(x)$ as basis functions, whose parameters are updated during the training process while fixed in computing the ITC. As a result, we only use the gradients on $W_{0}$ and $b_{0}$ to evaluate the ITC, which is an approximation to the ITC computed on gradients on all parameters. This can accelerate the computation of the ITC for a NN and we assess the performance of parameter estimation with this approximation.

Results. The MSEs between the optimized NN $\hat{f}_{\mathrm{NN}}$ and $f^{*}$ are reported in Table 2. First, we find that minimizing the ratio of $\mathrm{KEIC}/\mathrm{ITC}$ indeed helps to reduce the biases of parameter estimation with NNs. Compared with two baselines, our method performs stably on different datasets and on different NN structures. This provides further evidence that our method improve the practical performance of the KMMR. Second, our approximate ITC can work for the employed NNs.

Table 2: The mean square error (MSE)

\pm

one standard deviation (

n=500

, NN).

NN	Algorithm	True Function $f^{*}$
NN	Algorithm	abs	linear	quad	sin
$\mathrm{NN}_{10}$	Silverman	0.328 $\pm$ 0.152	0.166 $\pm$ 0.107	0.031 $\pm$ 0.020	0.327 $\pm$ 0.098
	Med-Heuristic	0.231 $\pm$ 0.030	0.045 $\pm$ 0.021	0.012 $\pm$ 0.004	0.179 $\pm$ 0.034
	Our Method	0.041 $\pm$ 0.027	0.027 $\pm$ 0.016	0.006 $\pm$ 0.003	0.058 $\pm$ 0.045
$\mathrm{NN}_{55}$	Silverman	0.444 $\pm$ 0.255	0.102 $\pm$ 0.074	0.037 $\pm$ 0.018	0.630 $\pm$ 0.398
	Med-Heuristic	0.187 $\pm$ 0.044	0.039 $\pm$ 0.018	0.013 $\pm$ 0.004	0.145 $\pm$ 0.039
	Our Method	0.036 $\pm$ 0.013	0.013 $\pm$ 0.003	0.007 $\pm$ 0.003	0.035 $\pm$ 0.024

7 Conclusions

The conditional moment restriction (CMR) is ubiquitous in many fields and the kernel maximum moment restriction (KMMR) is a promising framework to deal with the CMR in its easy-to-use-and-analyze form and practical outperformance. However, the optimal choice of the instrument space is a challenge and affects the effectiveness of the framework. The present work propose a systematic procedure to select the instrument space for instrumental variable (IV) estimators. We first define a selection principle of the least identifiable instrument space (LIIS) that identifies the estimated model parameters with the least space complexity. To determine the LIIS among a set of candidates, we propose the least identification selection criterion (LISC) as a combination of two criteria: (i) the identification test criterion (ITC) to check identifiability; (ii) the kernel effective information criterion (KEIC) consisting of the effective dimension and the risk function value to check the CMR condition and the effective dimension. The consistency of our method is explored and the experiments provide evidence for the effectiveness of our method on identifying the LIIS and on the parameter estimation for IV estimators.

References

Ai and Chen [2003] C. Ai and X. Chen. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003. ISSN 00129682, 14680262.
Akaike [1974] H. Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
Andersen and Sorensen [1996] T. G. Andersen and B. E. Sorensen. GMM estimation of a stochastic volatility model: A monte carlo study. Journal of Business & Economic Statistics, 14(3):328–352, 1996.
Andrews [1999] D. W. K. Andrews. Consistent Moment Selection Procedures for Generalized Method of Moments Estimation. Econometrica, 67(3):543–564, May 1999.
Aronszajn [1950] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
Athey et al. [2019] S. Athey, J. Tibshirani, S. Wager, et al. Generalized random forests. Annals of Statistics, 47(2):1148–1178, 2019.
Baker [1977] C. T. Baker. The numerical treatment of integral equations. 1977.
Bennett et al. [2019] A. Bennett, N. Kallus, and T. Schnabel. Deep generalized method of moments for instrumental variable analysis. In Advances in Neural Information Processing Systems 32, pages 3564–3574. Curran Associates, Inc., 2019.
Berlinet and Thomas-Agnan [2004] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
Bond and Windmeijer [2002] S. R. Bond and F. Windmeijer. Finite sample inference for gmm estimators in linear panel data models. 2002.
Borgwardt et al. [2006] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
Breusch et al. [1999] T. Breusch, H. Qian, P. Schmidt, and D. Wyhowski. Redundancy of moment conditions. Journal of Econometrics, 91(1):89–111, 1999. ISSN 0304-4076.
Burgess et al. [2017] S. Burgess, D. S. Small, and S. G. Thompson. A review of instrumental variable estimators for mendelian randomization. Statistical Methods in Medical Research, 26(5):2333–2355, 2017.
Burgess et al. [2020] S. Burgess, C. N. Foley, E. Allara, J. R. Staley, and J. M. M. Howson. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Communications, 11(1):376, 2020.
Caponnetto and De Vito [2007] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
Carrasco [2012] M. Carrasco. A regularization approach to the many instruments problem. Journal of Econometrics, 170(2):383–398, 2012.
Carrasco and Florens [2000] M. Carrasco and J.-P. Florens. Generalization of gmm to a continuum of moment conditions. Econometric Theory, 16(6):797–834, 2000.
Carrasco and Florens [2014] M. Carrasco and J.-P. Florens. On the asymptotic efficiency of GMM. Econometric Theory, 30(2):372–406, 2014. doi: 10.1017/S0266466613000340.
Carrasco et al. [2007] M. Carrasco, J.-P. Florens, and E. Renault. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. In J. Heckman and E. Leamer, editors, Handbook of Econometrics, volume 6B, chapter 77. Elsevier, 1 edition, 2007.
Cheng and Liao [2015] X. Cheng and Z. Liao. Select the valid and relevant moments: An information-based lasso for gmm with many moments. Journal of Econometrics, 186(2):443–464, 2015.
Chernozhukov et al. [2018] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
Davey Smith and Ebrahim [2003] G. Davey Smith and S. Ebrahim. ‘mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International journal of epidemiology, 32(1):1–22, 2003.
de Jong [1996] R. M. de Jong. The bierens test under data dependence. Journal of Econometrics, 72(1):1–32, 1996. ISSN 0304-4076.
Dikkala et al. [2020] N. Dikkala, G. Lewis, L. Mackey, and V. Syrgkanis. Minimax estimation of conditional moment models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12248–12262. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8fcd9e5482a62a5fa130468f4cf641ef-Paper.pdf.
Dominguez and Lobato [2004] M. Dominguez and I. Lobato. Consistent estimation of models defined by conditional moment restrictions. Econometrica, 72(5):1601–1615, 2004.
Donald and Newey [2001] S. G. Donald and W. K. Newey. Choosing the number of instruments. Econometrica, 69(5):1161–1191, 2001.
Donald et al. [2003] S. G. Donald, G. W. Imbens, and W. K. Newey. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics, 117(1):55–93, 2003. ISSN 0304-4076. doi: https://doi.org/10.1016/S0304-4076(03)00118-0. URL https://www.sciencedirect.com/science/article/pii/S0304407603001180.
Donald et al. [2009] S. G. Donald, G. W. Imbens, and W. K. Newey. Choosing instrumental variables in conditional moment restriction models. Journal of Econometrics, 152(1):28–36, 2009. ISSN 0304-4076. Recent Adavances in Nonparametric and Semiparametric Econometrics: A Volume Honouring Peter M. Robinson.
Feng et al. [2019] Y. Feng, L. Li, and Q. Liu. A kernel loss for solving the bellman equation. Advances in neural information processing systems, 32, 2019.
Ghosh [1994] J. K. Ghosh. Higher order asymptotics. NSF-CBMS Regional Conference Series in Probability and Statistics, 4:i–111, 1994. ISSN 19355920, 23290978. URL http://www.jstor.org/stable/4153181.
Gretton et al. [2012] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, and K. Fukumizu. Optimal kernel choice for large-scale two-sample tests. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1205–1213, Red Hook, NY, USA, 2012. Curran Associates Inc.
Hall [2015] A. R. Hall. Econometricians have their moments: Gmm at 32. Economic Record, 91(S1):1–24, 2015.
Hall et al. [2007] A. R. Hall, A. Inoue, K. Jana, and C. Shin. Information in generalized method of moments estimation and entropy-based moment selection. Journal of Econometrics, 138(2):488–512, 2007.
Hall et al. [2005] A. R. Hall et al. Generalized method of moments. Oxford university press, 2005.
Hall et al. [1991] P. Hall, S. J. Sheather, M. Jones, and J. S. Marron. On optimal data-based bandwidth selection in kernel density estimation. Biometrika, 78(2):263–269, 1991.
Hansen [1982] L. P. Hansen. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pages 1029–1054, 1982.
Hansen and Singleton [1982] L. P. Hansen and K. J. Singleton. Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica, 50(5):1269–1286, 1982. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1911873.
Hartford et al. [2017] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep IV: A flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1414–1423. PMLR, 2017.
Hartford et al. [2020] J. S. Hartford, V. Veitch, D. Sridhar, and K. Leyton-Brown. Valid causal inference with (some) invalid instruments. CoRR, abs/2006.11386, 2020.
Ivanov et al. [2002] V. K. Ivanov, V. V. Vasin, and V. Tanana. Theory of Linear Ill-posed Problems and Its Applications. Inverse and ill-posed problems series. VSP, 2002.
Jitkrittum et al. [2017a] W. Jitkrittum, Z. Szabó, and A. Gretton. An adaptive test of independence with analytic kernel embeddings. In International Conference on Machine Learning, pages 1742–1751. PMLR, 2017a.
Jitkrittum et al. [2017b] W. Jitkrittum, W. Xu, Z. Szabó, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-fit test. NIPS’17, page 261–270, Red Hook, NY, USA, 2017b. Curran Associates Inc. ISBN 9781510860964.
Kallus [2020] N. Kallus. Generalized optimal matching methods for causal inference. Journal of Machine Learning Research, 21(62):1–54, 2020. URL http://jmlr.org/papers/v21/19-120.html.
Kuang et al. [2020] Z. Kuang, F. Sala, N. Sohoni, S. Wu, A. Córdova-Palomera, J. Dunnmon, J. Priest, and C. Re. Ivy: Instrumental variable synthesis for causal inference. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 398–410. PMLR, 2020.
Lewis and Syrgkanis [2018] G. Lewis and V. Syrgkanis. Adversarial generalized method of moments. 03 2018.
Liao et al. [2020] L. Liao, Y. Chen, Z. Yang, B. Dai, Z. Wang, and M. Kolar. Provably efficient neural estimation of structural equation model: An adversarial approach. In Advances in Neural Information Processing Systems 33. 2020.
Liao et al. [2021] L. Liao, Z. Fu, Z. Yang, M. Kolar, and Z. Wang. Instrumental variable value iteration for causal offline reinforcement learning. arXiv preprint arXiv:2102.09907, 2021.
Liu et al. [2016] Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016.
Liu et al. [2018] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/dda04f9d634145a9c68d5dfe53b21272-Paper.pdf.
Lopes et al. [2011] M. E. Lopes, L. J. Jacob, and M. J. Wainwright. A more powerful two-sample test in high dimensions using random projection. arXiv preprint arXiv:1108.2401, 2011.
Mastouri et al. [2021] A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. In International conference on machine learning. PMLR, 2021.
Mendelson et al. [2003] S. Mendelson, T. Graepel, and R. Herbrich. On the performance of kernel classes. Journal of Machine Learning Research, 4:2003, 2003.
Mercer [1909] J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446, 1909. ISSN 02643952.
Micchelli and Pontil [2005] C. A. Micchelli and M. A. Pontil. On learning vector-valued functions. Neural Computation, 17(1):177–204, 2005.
Morimune [1983] K. Morimune. Approximate distributions of k-class estimators when the degree of overidentifiability is large compared with the sample size. Econometrica: Journal of the Econometric Society, pages 821–841, 1983.
Muandet et al. [2017] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–141, 2017.
Muandet et al. [2020a] K. Muandet, W. Jitkrittum, and J. Kübler. Kernel conditional moment test via maximum moment restriction. In J. Peters and D. Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 41–50. PMLR, 03–06 Aug 2020a.
Muandet et al. [2020b] K. Muandet, A. Mehrjou, S. K. Lee, and A. Raj. Dual instrumental variable regression. In Advances in Neural Information Processing Systems 33. Curran Associates, Inc., 2020b. Forthcoming.
Muth [1961] J. F. Muth. Rational expectations and the theory of price movements. Econometrica: Journal of the Econometric Society, pages 315–335, 1961.
Nagar [1959] A. L. Nagar. The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica: Journal of the Econometric Society, pages 575–595, 1959.
Newey [1993] W. Newey. Efficient estimation of models with conditional moment restrictions. In Handbook of Statistics, volume 11, chapter 16, pages 419–454. Elsevier, 1993.
Newey and McFadden [1994] W. K. Newey and D. McFadden. Chapter 36 large sample estimation and hypothesis testing. volume 4 of Handbook of Econometrics, pages 2111 – 2245. Elsevier, 1994.
Newey and Smith [2004] W. K. Newey and R. J. Smith. Higher order properties of gmm and generalized empirical likelihood estimators. Econometrica, 72(1):219–255, 2004. ISSN 00129682, 14680262.
Riesz [1909] F. Riesz. Sur les opérations fonctionnelles linéaires, 1909.
Robin and Smith [2000] J.-M. Robin and R. J. Smith. Tests of rank. Econometric Theory, pages 151–175, 2000.
Schölkopf and Smola [2002] B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002.
Schwarz et al. [1978] G. Schwarz et al. Estimating the dimension of a model. Annals of statistics, 6(2):461–464, 1978.
Silverman [1986] B. W. Silverman. Density estimation for statistics and data analysis. London: Chapman & Hall/CRC, 1986.
Singh et al. [2019] R. Singh, M. Sahani, and A. Gretton. Kernel instrumental variable regression. In Advances in Neural Information Processing Systems, pages 4595–4607, 2019.
Uehara et al. [2020] M. Uehara, J. Huang, and N. Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR, 2020.
Van der Vaart [2000] A. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
Vuong [1989] Q. H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2):307–333, 1989. ISSN 00129682, 14680262.
Zhang et al. [2020] R. Zhang, M. Imaizumi, B. Schölkopf, and K. Muandet. Maximum moment restriction for instrumental variable regression. arXiv preprint arXiv:2010.07684, 2020.
Zhang [2003] T. Zhang. Effective dimension and generalization of kernel learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003.

Checklist

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] As mentioned in the abstract and introduction, this paper is to address the kernel instrument space selection problem. The problem arises recently from the kernel maximum moment restriction framework for instrumental variable estimators and is different from traditional selection problem on instruments.
2. (b)
  
  Did you describe the limitations of your work? [Yes] In the experiment section 6.1, we present that our identification test criterion can’t identify the optimal instrument space for models with many parameters given small datasets.
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [N/A]
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [Yes] For our theoretical results, we define a series of assumptions such as 1 and 2, and include extra conditions in the statement of the theorems.
2. (b)
  
  Did you include complete proofs of all theoretical results? [Yes] We put proofs into the appendix.
3.
If you ran experiments…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We put codes for our experiments into the supplemental material.
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] All training details are included in the experiment section and additional sections in the appendix.
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We report error bars and provide the random seed and the number of repeated experiments.
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] Our experiment results don’t rely on the type of computational resources.
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [Yes]
2. (b)
  
  Did you mention the license of the assets? [N/A] We implement baselines and our method ourselves, and our data are simulated.
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [Yes] We put our code for experiments in the supplemental material.
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A] See the answer for 4(b).
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] Our data are simulated without any personally identifiable information or offensive content.
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

\doparttoc\faketableofcontents

Appendix
Instrument Space Selection for Kernel Maximum Moment Restriction

\parttoc

Appendix A Derivation Process of ITC

To derive an asymptotic distribution of $\hat{T}$ , we need an asymptotic distribution of $\hat{F}_{\mathcal{H}}(\hat{\theta})$ , shown as follows. As preparation, let $\mathrm{vec}:\mathbb{R}^{s\times t}\to\mathbb{R}^{st\times 1}$ be a vectorization operator by a rule of $[\mathrm{vec}(A)]_{(i-1)t+j,1}=A_{i,j}$ , $i=1,\cdots,s$ , $j=1,\cdots,t$ .

Lemma 1.

For $\hat{\theta}\rightarrow\theta$ almost surely and a non-zero matrix $\Omega=\mathbb{E}_{S}\left[\mathrm{vec}(u_{\theta}(S))\mathrm{vec}(u_{\theta}(S))^{\top}\right]-\mathbb{E}_{S}\left[\mathrm{vec}(u_{\theta}(S))\right]\mathbb{E}_{S}\left[\mathrm{vec}(u_{\theta}(S))\right]^{\top}\in\mathbb{R}^{c^{2}\times c^{2}}$ , we have $\sqrt{n}\mathrm{vec}(\hat{F}_{\mathcal{H}}(\hat{\theta})-{F}_{\mathcal{H}}({\theta}))\overset{\mathrm{d}}{\to}\mathcal{N}(0,\Omega)$ .

The asymptotic variance $\Omega$ is an essential term to determine an asymptotic distribution of $\hat{T}$ . We need an assumption on $\Omega$ under the non-full-rank hypothesis $H_{0}$ . Let $C$ be the eigenvector of $F_{\mathcal{H}}(\theta)$ corresponding to the eigenvalue $\lambda_{c}=0$ under $H_{0}$ . We make the following rank assumption.

Assumption 3 (rank condition).

Under the null hypothesis $H_{0}$ , the matrix $(C\otimes C)^{\top}\Omega(C\otimes C)$ is non-zero, where $\otimes$ is the Kronecker product.

An equivalent statement of Assumption 3 is that the $c^{2}\times 1$ vector $(C\otimes C)$ does not lie in the space spanned by the eigenvectors associated with the zero eigenvalue of $\Omega$ . Therefore, if $\Omega$ is positive definite, Assumption 3 is automatically satisfied. This assumption is required to obtain a non-degenerate asymptotic distribution of the test statistic. Let $\hat{C}$ be the eigenvalue of $\hat{F}_{\mathcal{H}}(\hat{\theta})$ corresponding to the eigenvalue $\hat{\lambda}_{c}\geq 0$ . Under the hypothesis $H_{0}$ and the rank assumption 3, we obtain the asymptotic distribution.

Theorem 6.

(asymptotic distribution of $\hat{T}$ ) Suppose conditions of Lemma 1, assumption 3 and the null-hypothesis $H_{0}$ hold. Then we obtain the asymptotic distribution: $n\hat{T}\overset{\mathrm{d}}{\to}\Lambda Z^{2}$ where $\Lambda\coloneqq(C\otimes C)^{\top}\Omega(C\otimes C)$ and $Z$ is a standard normal random variable.

By the asymptotic distribution, we can handle the type-I error. For a significance level $\alpha\in(0,1)$ , let $Q_{1-\alpha}$ be a $(1-\alpha)$ -quantile of $\Lambda Z^{2}$ . Then, the following result shows that, the type-I error is asymptotically equal to the arbitrary value $\alpha\in[0,1]$ .

Theorem 7.

Suppose Assumption 3 holds. Let $\mathbb{P}_{0}$ be a probability measure of data under the null-hypothesis $H_{0}$ . Then, for any $\alpha\in(0,1)$ , we obtain $\mathbb{P}_{0}(n\hat{T}\geq Q_{1-\alpha})\to\alpha$ as $n\to\infty$ .

Appendix B Proofs

B.1 Proof of Lemma 1

Proof.

We first decompose the objective into two terms (i) and (ii):

\displaystyle\sqrt{n}\mathrm{vec}(\hat{F}_{\mathcal{H}}(\hat{\theta})-{F}_{\mathcal{H}}({\theta}))=\underbrace{\sqrt{n}\mathrm{vec}(\hat{F}_{\mathcal{H}}(\hat{\theta})-F_{\mathcal{H}}(\hat{\theta}))}_{\text{(i)}}+\underbrace{\sqrt{n}\mathrm{vec}(F_{\mathcal{H}}(\hat{\theta})-{F}_{\mathcal{H}}({\theta}))}_{\text{(ii)}}

(5)

The term (ii) converges to zero as $n\to\infty$ given $\hat{\theta}\rightarrow\theta$ almost surely. By Slutsky’s theorem [71, Lemma 2.8], the asymptotic distribution depends on the term (i). Let $\mathbb{E}_{n}$ denote the empirical expectation and the asymptotic distribution of (i) is obtained by the central limit theorem

	$\displaystyle\sqrt{n}\mathrm{vec}(\hat{F}_{\mathcal{H}}(\hat{\theta})-F_{\mathcal{H}}(\hat{\theta}))=\sqrt{n}\left(\mathbb{E}_{n}\left[\mathrm{vec}(u_{\hat{\theta}}(S))\right]-\mathbb{E}_{S}\left[\mathrm{vec}(u_{\hat{\theta}}(S))\right]\right)\overset{\mathrm{d}}{\to}\mathcal{N}(\bm{0},\Omega),$		(6)
	$\displaystyle\Omega=\mathbb{E}_{S}\left[\mathrm{vec}(u_{\theta}(S))\mathrm{vec}(u_{\theta}(S))^{\top}\right]-\mathbb{E}_{S}\left[\mathrm{vec}(u_{\theta}(S))\right]\mathbb{E}_{S}\left[\mathrm{vec}(u_{\theta}(S))\right]^{\top}.$		(7)

∎

B.2 Proof of Theorem 6

Proof.

We rewrite $\hat{T}^{1/2}\coloneqq\hat{\lambda}=\hat{C}\hat{F}_{\mathcal{H}}(\hat{\theta})\hat{C}$ and $T^{1/2}\coloneqq\lambda=CF_{\mathcal{H}}(\theta)C$ . We further rewrite $\sqrt{n}(\hat{T}-T)$ as below

	$\displaystyle\sqrt{n}\left(\hat{T}^{1/2}-T^{1/2}\right)$	$\displaystyle=\sqrt{n}(\hat{C}\hat{F}_{\hat{\mathcal{H}}}(\theta)\hat{C}-CF_{\mathcal{H}}(\theta)C)$		(8)
		$\displaystyle=\underbrace{\sqrt{n}(\hat{C}\hat{F}_{\mathcal{H}}(\hat{\theta})\hat{C}-C\hat{F}_{\mathcal{H}}(\hat{\theta})C)}_{\text{(i)}}+\underbrace{\sqrt{n}(C\hat{F}_{\mathcal{H}}(\hat{\theta})C-CF_{\mathcal{H}}(\theta)C)}_{\text{(ii)}}.$		(9)

The term (i) converges to $0$ as $n\to\infty$ because $C\hat{F}_{\mathcal{H}}(\hat{\theta})C\to CF_{\mathcal{H}}(\theta)C=\lambda_{c}$ and $\hat{C}\hat{F}_{\mathcal{H}}(\hat{\theta})\hat{C}=\hat{\lambda}_{c}\to\lambda_{c}$ based on the consistency of $\hat{F}_{\mathcal{H}}(\hat{\theta})$ to $F_{\mathcal{H}}(\theta)$ given in Lemma 1. The term (ii) has an asymptotic distribution following by Lemma 1 and Assumption 3:

$\displaystyle\sqrt{n}\left(C\hat{F}_{\mathcal{H}}(\hat{\theta})C-CF_{\mathcal{H}}(\theta)C\right)$	$\displaystyle=\sqrt{n}(C\otimes C)^{\top}\mathrm{vec}(\hat{F}_{\mathcal{H}}(\hat{\theta})-F_{\mathcal{H}}(\theta))$	(10)
	$\displaystyle\overset{\mathrm{d}}{\to}\mathcal{N}(0,(C\otimes C)^{\top}\Omega(C\otimes C))$	(11)
	$\displaystyle\quad=\left[(C\otimes C)^{\top}\Omega(C\otimes C)\right]^{1/2}\mathcal{N}(0,1).$	(12)

Because the null-hypothesis holds, i.e., the matrix $F_{\mathcal{H}}(\theta)$ is non-full-rank, we obtain that $CF_{\mathcal{H}}(\theta)C=\lambda_{c}=0$ . Therefore, $\sqrt{n}\hat{T}\overset{\mathrm{d}}{\to}\Lambda^{1/2}Z$ by Slutsky’s theorem and $n\hat{T}\overset{\mathrm{d}}{\to}\Lambda Z^{2}$ by [72, Lemma 3.2], where $\Lambda=(C\otimes C)^{\top}\Omega(C\otimes C)$ and $Z$ is a standard normal random variable. ∎

B.3 Proof of Theorem 7

Proof.

The conclusion follows directly from the asymptotic distribution of $n\hat{T}$ . ∎

B.4 Proof of Theorem 3

Proof.

By Assumption 1 and consistency of $R_{\mathcal{H}}(\theta)$ to $\mathrm{CMR}(\theta)$ , we know that $R_{\mathcal{H}}(\theta)$ has a unique minimizer at $\theta_{0}$ which satisfies $\mathrm{CMR}(\theta_{0})=0$ . Further since $\Theta$ is compact, $R_{n,\mathcal{H}}(\theta)$ converges to $R_{\mathcal{H}}(\theta)$ uniformly in probability and $R_{\mathcal{H}}(\theta)$ is continuous (due to $R_{\mathcal{H}}(\theta)$ is finite and $\varphi_{\theta}$ is differentiable implicitly assumed), we obtain that the empirical $\hat{\theta}\overset{\mathrm{p}}{\to}\theta_{0}$ by Theorem 2.1 in Newey and McFadden [62]. Therefore, as $n\to\infty$ , $\mathrm{rank}(\hat{F}_{\mathcal{H}}(\hat{\theta}))\overset{\mathrm{p}}{\to}\mathrm{rank}(F_{\mathcal{H}}(\theta_{0})))$ . Therefore, the ITC consistently estimates the identifiability in probability. ∎

B.5 Proof of Theorem 4

Proof.

By the existing result [7, Theorem 3.4] that $\lambda_{i}^{\mathrm{mat}}/n\to\lambda_{i}^{K}$ as $n\to\infty$ where $\lambda_{i}^{\mathrm{mat}}$ is the $i$ -th ordered eigenvalue of $K_{\bm{z}}$ , we know that $\mathrm{Tr}(K_{\bm{z}})/n=\sum_{i=1}^{n}\lambda_{i}^{\mathrm{mat}}/n$ is a consistent estimator of $\sum_{i=1}^{\infty}\lambda_{i}^{K}$ . Similarly, $(\mathrm{Tr}(K_{\bm{z}}^{2})/n^{2})^{1/2}=(\sum_{i=1}^{n}(\lambda_{i}^{\mathrm{mat}})^{2}/n^{2})^{1/2}$ consistently estimates $(\sum_{i=1}^{\infty}(\lambda_{i}^{K})^{2})^{1/2}$ . Therefore, $\mathrm{Tr}(K_{\bm{z}}){\mathrm{Tr}(K_{\bm{z}}^{2})^{-1/2}}$ is a consistent estimator of $E_{k}$ . ∎

B.6 Proof of Theorem 5

Proof.

We define $j^{*}\in\mathcal{M}$ as an optimal index of the optimal RKHS instrument space, namely, $\mathcal{H}_{j^{*}}=\mathcal{H}^{l}$ . With the above settings, the statement of Theorem 5 is equivalent to show $\mathrm{P}(\mathrm{KEIC}(\mathcal{H}_{j})>\mathrm{KEIC}(\mathcal{H}_{j^{*}}))\to 1$ as $n\to\infty$ , for any $j\in\mathcal{M}$ such that $j\neq j^{*}$ . We define $E_{j}$ is an effective dimension of $\mathcal{H}_{j}$ and $\hat{E}_{j}:=\mathrm{Tr}(K_{\bm{z}}){\mathrm{Tr}(K_{\bm{z}}^{2})^{-1/2}}$ with a kernel $k$ corresponding to the RKHS $\mathcal{H}_{j}$ . For $j\neq j^{*}$ and $j^{*}$ , we study the following difference

\displaystyle\mathrm{KEIC}(\mathcal{H}_{j})-\mathrm{KEIC}(\mathcal{H}_{{j^{*}}})=n\underbrace{(\hat{R}_{\mathcal{H}_{j}}(\hat{\theta})-\hat{R}_{\mathcal{H}_{{j^{*}}}}(\hat{\theta}))}_{\coloneqq\Delta_{R}(j,j^{*})}+\underbrace{(\hat{E}_{j}-\hat{E}_{j^{*}})}_{\coloneqq\Delta_{E}(j,j^{*})}\log n.

Given the condition that an unique least identifiable instrument space exists and the consistency of $\hat{E}$ to $E$ shown in Theorem 4, we obtain that $\mathrm{P}(\hat{E}_{j}-\hat{E}_{j^{*}}>0)\to 1$ and $\Delta_{E}(j,j^{*})$ converges to some constant. Hence, it follows that $\Delta_{E}(j,j^{*})\log n:=\Omega(\log n)$ with $\Omega(\cdot)$ in complexity theory. By the assumption 1 that $\mathrm{CMR}(\theta)=0$ has an unique solution say $\theta_{0}$ and $H_{j}$ is identifiable, the minimizer $\hat{\theta}=\operatornamewithlimits{argmin}_{\theta}\hat{R}_{\mathcal{H}_{j}}(\theta)$ converges to $\theta_{0}$ . As a result, $n\Delta_{R}(j,j^{*})=O(1)$ by the asymptotic distribution in the conclusion (2) of Theorem 4.1 in Muandet et al. [57]. Hence the term $\Delta_{E}(j,j^{*})\log n$ has a larger order than $n\Delta_{R}(j,j^{*})$ and we obtain that $\mathrm{P}(\mathrm{KEIC}(\mathcal{H}_{j})>\mathrm{KEIC}(\mathcal{H}_{j^{*}}))\to 1$ as $n\to\infty$ for $j\neq j^{*}$ .

∎

Figure 3: ITC evaluated on the quadratic function

f^{*}

. The left and right plots employ

f_{2}

and

f_{4}

in the estimation respectively. We use [kernel]-[parameter] to denote different kernels and normalize all values in each plot to

[0,1]

. The symbols (*) on the right of nodes denote the selected LIISs. The red dash lines denote the quantile corresponding to the significance level

\alpha=0.05

Appendix C More Details on Experiments

C.1 LIIS for Section 6.1

Under the experiment settings of Section 6.1, an identifiable instrument space should satisfy (i) existence of column vectors of parameters $\bm{c}\coloneqq[c_{i}]_{i=1}^{m}$ which meet the CMR condition (1) and (ii) uniqueness of $\bm{c}$ , namely, $\mathrm{rank}\{\mathbb{E}[(\partial f_{m}(X)/\partial\bm{c})(\partial f_{m}(X)/\partial\bm{c}^{\top})k(Z,Z^{\prime})]\}=m+1$ [34, Assumption 2.3]. Since the same models used for estimation and data generation, (i) immediately holds, and (ii) also holds when a number of instruments is more than that of parameters. As the linear kernel has the feature map function $\phi_{\mathrm{L}}$ projecting $x$ to a one-dimensional space, using it for the instrument space is equivalent to employing $h(Z)=Z$ as the instrument. The condition (ii) fails due to the number of instruments is less than that of parameters of both $f_{2}$ and $f_{4}$ . Moreover, using the candidate parameters, P4 identifies the parameters $\bm{c}$ of $f_{2},\,f_{4}$ , while P2 fails to identify those of $f_{4}$ ; G can identify parameters of $f_{2},\,f_{4}$ using any candidate lengthscale. Considering the complexity measured by the effective dimension (4), we obtain the LIISs: P2 with $p=1$ for $f_{2}$ and P4 with $p=1$ for $f_{4}$ . We can approximate the effective dimension via a large sample.

Table 3: The mean square error (MSE)

\pm

one standard deviation (

n=100

f_{4}

Scenario	Algorithm	True Function $f^{*}$
Scenario	Algorithm	abs	linear	quad	sin
LS	Silverman	0.076 $\pm$ 0.032	0.050 $\pm$ 0.053	0.024 $\pm$ 0.032	0.124 $\pm$ 0.105
	Med-Heuristic	0.100 $\pm$ 0.081	0.058 $\pm$ 0.053	0.024 $\pm$ 0.032	0.146 $\pm$ 0.091
	Our Method	0.051 $\pm$ 0.029	0.048 $\pm$ 0.054	0.025 $\pm$ 0.032	0.100 $\pm$ 0.111
LW	Silverman	0.282 $\pm$ 0.067	0.096 $\pm$ 0.044	0.050 $\pm$ 0.022	0.169 $\pm$ 0.047
	Med-Heuristic	0.127 $\pm$ 0.075	0.050 $\pm$ 0.032	0.040 $\pm$ 0.026	0.101 $\pm$ 0.066
	Our Method	0.128 $\pm$ 0.077	0.036 $\pm$ 0.017	0.026 $\pm$ 0.012	0.041 $\pm$ 0.022
NS	Silverman	0.136 $\pm$ 0.080	0.153 $\pm$ 0.115	0.071 $\pm$ 0.042	0.198 $\pm$ 0.147
	Med-Heuristic	0.315 $\pm$ 0.260	0.571 $\pm$ 0.823	1.229 $\pm$ 2.753	0.560 $\pm$ 0.322
	Our Method	0.056 $\pm$ 0.040	0.029 $\pm$ 0.012	0.043 $\pm$ 0.042	0.052 $\pm$ 0.027

C.2 More Experiment Settings

we standardize the values of $Y$ to have zero mean and unit variance for numerical stability. Experiments on each setting are repeated 10 times. To avoid overfitting, we compute the empirical risk in the KEIC as a two-fold cross validation error. Random seed is set to 527. Codes for the experiments are provided in the supplementary material.

Instrument Space Selection for Kernel Maximum Moment Restriction

Abstract

1 Introduction

2 Preliminaries

2.1 Conditional moment restriction (CMR)

Maximum moment restriction (MMR).

2.2 Kernel maximum moment restriction (KMMR)

Reproducing kernel Hilbert space (RKHS).

Theorem 1 (Sufficiency of the instrument space).

3 Least identifiable instrument space (LIIS)

Assumption 1 (Global identification).

Assumption 2 (Identifiability).

Definition 1 (Least Identifiable Instrument Space (LIIS)).

4 Least identification selection criterion (LISC)

4.1 Criterion 1: identification test criterion (ITC)

Theorem 2.

Theorem 3 (Consistency of ITC).

4.2 Criterion 2: kernel effective information criterion (KEIC)

Theorem 4.

Remark 1.

Theorem 5 (Consistency of KEIC).

Remark 2 (Consistency of two-step procedure).

5 Related work

6 Experiments

6.1 Simple linear IV regression

6.2 Robustness of parameter estimation with linear models

6.3 Robustness of parameter estimation with neural networks

7 Conclusions

References

Checklist

Appendix Instrument Space Selection for Kernel Maximum Moment Restriction

Appendix A Derivation Process of ITC

Lemma 1.

Assumption 3 (rank condition).

Theorem 6.

Theorem 7.

Appendix B Proofs

B.1 Proof of Lemma 1

Proof.

B.2 Proof of Theorem 6

Proof.

B.3 Proof of Theorem 7

Proof.

B.4 Proof of Theorem 3

Proof.

B.5 Proof of Theorem 4

Proof.

B.6 Proof of Theorem 5

Proof.

Appendix C More Details on Experiments

C.1 LIIS for Section 6.1

C.2 More Experiment Settings

Appendix
Instrument Space Selection for Kernel Maximum Moment Restriction