Instrument Space Selection for Kernel Maximum Moment Restriction
Abstract
Kernel maximum moment restriction (KMMR) recently emerges as a popular framework for instrumental variable (IV) based conditional moment restriction (CMR) models with important applications in conditional moment (CM) testing and parameter estimation for IV regression and proximal causal learning. The effectiveness of this framework, however, depends critically on the choice of a reproducing kernel Hilbert space (RKHS) chosen as a space of instruments. In this work, we presents a systematic way to select the instrument space for parameter estimation based on a principle of the least identifiable instrument space (LIIS) that identifies model parameters with the least space complexity. Our selection criterion combines two distinct objectives to determine such an optimal space: (i) a test criterion to check identifiability; (ii) an information criterion based on the effective dimension of RKHSs as a complexity measure. We analyze the consistency of our method in determining the LIIS, and demonstrate its effectiveness for parameter estimation via simulations.
1 Introduction
The instrumental variable (IV) based conditional moment restriction (CMR) models [61, 1, 24] have a wide range of applications in causal inference, economics, and finance modeling, where for correctly-specified models the conditional mean of certain functions of data equals zero almost surely. This kind of models also appear in Mendelian randomization, a technique in genetic epidemiology that uses genetic variation to improve causal inference of a modifiable exposure on disease [22, 13]. Rational expectation models [59], widely-used in macroeconomics, measures how available information is exploited to form future expectations by decision-makers as conditional moments [59]. Furthermore, CMRs have also gained popularity in the community of causal machine learning, leading to novel algorithms such as generalized random forests [6], double/debiased machine learning [21] and nonparametric IV regression [8, 58, 73]; see also related works therein, as well as in offline reinforcement learning [47].
Learning with CMRs is challenging because it implies an infinite number of unconditional moment restrictions (UMRs). Although the asymptotic efficiency of the instrumental variable (IV) estimator can in principle improve when we add more moment restrictions, it was observed that the excessive number of moments can be harmful in practice [3] because its finite-sample bias increases with the number of moment conditions [63]. Hence, traditional works in econometrics often select a finite number of UMRs for estimation based on the generalized method of moments (GMM) [36, 34]. Unfortunately, an adhoc choice of moments can potentially lead to a loss of efficiency or even a loss of identification [25]. For this reason, subsequent works advocate an incorporation of all moment restrictions simultaneously in different ways such as the method of sieves [23, 27] and a continuum of moment restrictions [17, 19, 16, 18], among others. Despite the progress, the question of moment selection in general remains open.
A recent interest to model the CMR using a reproducing kernel Hilbert space (RKHS) [57, 24] and deep neural networks [45, 8] opens up a new possibility to resolve the selection problem with modern tools in machine learning. In this work, we focus on the RKHS approach where the CMR is reformulated as a minimax optimization problem whose inner maximization is taken over functions in the RKHS. This framework is known as a kernel maximum moment restriction (KMMR). An advantage of the KMMR is that one can get a closed-form solution to the maximization problem, which is related to a continuum generalization of GMM [17, 16]. Furthermore, it has been shown that an RKHS with a specific type of kernels is sufficient to model the CMR; see Muandet et al. [57, Theorem 3.2] and Zhang et al. [73, Theorem 1]. Hence, in this context, the moment selection problem becomes the kernel selection problem, which is also a long-standing problem in machine learning. Besides, the KMMR can be viewed as an approximate dual of the known two-stage regression procedure [58, 46]. KMMRs have been applied successfully to IV regression [73], proximal causal learning [51] and condition moment testing [57]. Nevertheless, all of them employed a simple heuristic to select the kernel function, e.g., median heuristic, which limits the full potential of this framework.
Our contributions. In this paper, we aim to address the kernel selection problem for the KMMR framework. We focus on the IV estimator and term our problem the kernel or RKHS instrument space selection, because the RKHS functions as a space of instruments. We define an optimal instrument space, named least identifiable instrument space (LIIS), which has the identifiability for the true model parameters and the least complexity. To determine LIIS in practice, we propose an approach based on a combination of two criteria: (i) the identification test criterion (ITC) to test the identifiability of instrument spaces and (ii) the kernel effective information criterion (KEIC) to select the space based on its complexity. Our method has the following advantages: (a) compared with the higher-order asymptotics based methods [26, 28], our approach is easy to use and analyze, and can filter invalid instrument spaces; (b) our method is a combination of several information criteria, therefore compensating shortcomings of individual criteria. Moreover, we analyze the consistency of our method on selection of the LIIS and we show in the simulation experiments that our method effectively identifies the LIIS and improves the performance on parameter estimation for IV estimators. To the best of our knowledge, we do not find any other method achieve all of these on the kernel instrument space selection.
2 Preliminaries
2.1 Conditional moment restriction (CMR)
Let be a random variable taking values in and a parameter space. A conditional moment restriction (CMR) [61, 1] can then be expressed as
(1) |
for the true parameter . The function is a problem-dependent generalized residual function in parameterized by . Intuitively, the CMR asserts that, for correctly specified models, the conditional mean of the generalized residual function is almost surely equal to zero. Many statistical models can be written as (1) including nonparametric regression models where and ; conditional quantile models where , and for the target quantile ; and IV regression models where , is an IV, and .
Maximum moment restriction (MMR).
An important observation about the CMR (1) is that it implies a continuum of unconditional moment restriction (UMR) [17, 45, 8]: for all measurable functions where is a space of measurable functions . We refer to as an instrument space. Traditionally, inference and estimation of can be performed, for example, via a generalized method of moment (GMM) of UMR based on a specific subset of [34]. Consequently, the choice of subset of plays an important role in parameter estimation for the conditional moment models. In this paper, we discuss on the optimal instrument space which is developed based on an equivalent definition of UMR, called the maximum moment restriction (MMR) [45, 57, 73], as follows:
(2) |
Note that the MMR depends critically on the choice of an instrument space . In this paper, we focus exclusively on the IV regression models such that and is a real-valued function space. We defer applications of our method in other scenarios to future work.
2.2 Kernel maximum moment restriction (KMMR)
In this work, we focus on when an instrument space is a reproducing kernel Hilbert space (RKHS) associated with the kernel [5, 66, 9].
Reproducing kernel Hilbert space (RKHS).
Let be a reproducing kernel Hilbert space (RKHS) of functions from to with and being its inner product and norm, respectively. Since for any the linear functional is continuous for , it follows from Riesz representation theorem [64] that there exists, for every , a function such that for all . This is generally known as a reproducing property of [5, 66]. We call a reproducing kernel of . The reproducing kernel is unique (up to an isometry) and fully characterizes the RKHS [5]. Examples of commonly used kernels on include Gaussian RBF kernel and Laplacian kernel where is a bandwidth parameter. For the detailed exposition on kernel methods and RKHS, see, e.g., [66, 9, 54, 56].
By representing in (2) with the RKHS, Muandet et al. [57, Theorem 3.3] showed that has a closed-form expression after introducing an Ivanov regularization [40] to remove the scale effect of instruments:
(3) |
where is an independent copy of . Given i.i.d data , we define its empirical analogue and its minimizer .
We focus on this expression in spite of a similar quadratic expression following from a Tikhonov regularization on [24, Eqn. (10)]. It is instructive to observe that the MMR (3) resembles the optimally-weighted GMM formulation of Carrasco and Florens [17], but without the re-weighting matrix; see, also, Carrasco et al. [19, Sec. 6] and Zhang et al. [73, Sec. 3]. While the optimally-weighted GMM (OWGMM) was originally motivated by the asymptotic efficiency theory in a parametric setting [18], the need to compute the inverse of the parameter-dependent covariance operator can lead to more cumbersome estimation [16] and poor finite-sample performance [10]. Hence, we will consider throughout for its simplicity.
The following result adapted from Zhang et al. [73, Theorem 1] and Muandet et al. [57, Theorem 3.2] guarantees that KMMR has the same roots as those of CMR (1).
Theorem 1 (Sufficiency of the instrument space).
Suppose that is continuous, bounded (i.e., ) and satisfies the condition of integrally strictly positive definite (ISPD) kernels, i.e., for any function that satisfies , we have . Then, if and only if for -almost all .
3 Least identifiable instrument space (LIIS)
The choice of an instrument space is critical for KMMR. If an instrument space is excessively small, KMMR loses an identification power, i.e., another parameter can satisfy the KMMR condition (3), hence it is impossible in principle to achieve a consistent estimator to . The scenario is often referred to as the under-identification problem [34, Chapter 2.1]. In contrast, an excessively large instrument space increases an error of estimators with finite samples. It is proved that a mean squared error (MSE) of an estimator with MMR is increased by the size of an instrument space [12, 33, 20, 24]. Specifically, for example, Theorem 1 in [24] showed that the MSE of their estimator has an upper bound which increases in a critical radius of an instrument space (defined by an upper bound of its Rademacher complexity). Therefore, it is important to avoid making instrument spaces excessively large in order to reduce errors.
Unfortunately, the instrument space selection problem cannot be solved by a straightforward cross-validation (CV) procedure, because the loss function (3) depends on an instrument space itself. For example, we can rewrite KMMR as by the Mercer’s decomposition of a corresponding kernel with eigenfunctions and eigenvalues . Due to the dependence, CV always select an excessively small instrument space which makes always small.
The proposed optimality. To resolve this issue, we may consider candidates of instrument spaces . For example, we consider RKHSs by the Gaussian kernel with different lengthscale parameters. We introduce assumptions on the identification of and identifiability of .
Assumption 1 (Global identification).
There exists an unique satisfying .
Assumption 2 (Identifiability).
There exists non-empty index set such that for any , there is uniquely satisfies .
These assumptions guarantee that there is at least one instrument space from the candidates identifying the unique solution of the CMR problem (1). We, then, define a notion of optimality of instrument spaces. Let be a complexity measure of , which is specified later.
Definition 1 (Least Identifiable Instrument Space (LIIS)).
A least identifiable instrument space is an identifiable instrument space with the least complexity, i.e. .
The notion of LIIS is designed to satisfy the two requirements: the identifiability to , and the less complexity to reduce an estimation error. This optimality is different from the way with test errors, such as CV. We provide an illustration of the concepts in Figure 1.
4 Least identification selection criterion (LISC)
We propose a method to find LIIS from finite samples, named the least identification selection criterion (LISC), by developing several criteria necessary for LIIS and combining them. The developed criteria are as follow: an identification test criterion (ITC) and a kernel effective information criterion (KEIC). We first introduce the overall methodology and then explain these criteria.
Based on the ITC and KEIC, we propose a simple two-step procedure to select the optimal space. We first select a set of identifiable spaces via the ITC and then choose an identifiable space with the least KEIC. In case there is no identifiable instrument space determined when e.g. neural networks are employed, we minimize the ratio of the ITC and KEIC to select LIIS:
Note that the ratio based method follows the spirit of LIIS and of our two-step procedure.
4.1 Criterion 1: identification test criterion (ITC)
We first develop a method to study identifiability of instrument spaces, by validating whether a minimizer of is unique. The uniqueness is verified by examining a rank of a Hessian-like matrix from , which is where with and is a minimizer of assumed to satisfy . A full-rankness of the matrix is a sufficient condition for global identification on linear models , and local identification on nonlinear models [34, Assumptions 2.3, 3.6].
Test of full-rankness. We develop a statistical test for the full-rankness of , based on the test of ranks [65]. With a -dimensional parameter and a matrix , we consider a null () and alternative () hypotheses. Let be the smallest eigenvalue of , which is non-negative owing to the quadratic form of . For the test, we consider an empirical analogue of as with and . Then, we consider the smallest eigenvalue of and employ it as a test statistics . Let be a standard normal random variable and be a fixed scalar given in Appendix A. For a significance level , such as , let be a -quantile of . Our test has the limiting power of one:
Theorem 2.
Assume the conditions of Theorem 6. The test that rejects the null when is consistent against any fixed alternative .
Test criterion. We propose a criterion based on the test statistic . We randomly split the dataset into two parts and use one part to compute and the other part for computing where is an eigenvector of corresponding to the smallest eigenvalue , and . Here, is the vectorization operator. Then, we define an identification test criterion (ITC) as
We select an instrument space under which we can reject the null-hypothesis, namely, we select when holds. The validity of the selection is shown in the following result:
4.2 Criterion 2: kernel effective information criterion (KEIC)
We develop another criterion for LIIS in the spirits of the information criterion like Akaike [2] and Andrews [4]. The strategy is to estimate both elements required for LIIS: to measure the complexity of by the notion of effective dimension, and the identifiability by the empirical loss.
Effective dimension. The effective dimension has been a common measure of a complexity of RKHSs [74, 52, 15] and used to analyze the performances of kernel methods. To develop the notion, we consider the Mercer expansion [53] of for as provided in Section 3, with a superscript to differentiate eigenvalues of and those of the kernel . Based on [15], a definition of the effective dimension is written as
(4) |
We develop an empirical estimator for as , where is a kernel matrix and is the trace operator. We show that its consistency as follows.
Theorem 4.
As , holds.
The effective dimension measures the complexity of the instrument space and meanwhile quantifies some capacity properties of the marginal measure . An interpretation to our definition is that the numerator counts the effective UMRs which are assigned to relatively larger eigenvalues, and the denominator regularizes the count. Effective dimensions may differ in different tasks. For example, is considered in least square regression [74, 15], where is a regularizer parameter, and Lopes et al. [50] interprets as the effective dimension of the covariance matrix.
Information criterion. We then develop an information criterion with the notion of the effective dimension. The key idea is to add the estimated effective dimension as a penalty term for an empirical KMMR risk. Note that the ITC is developed based on the existence assumption of parameters satisfying the KMMR, while the assumption may not hold always. Therefore, it is necessary to test the existence by the empirical risk. We propose a kernel effective information criterion (KEIC) on as below, which is an analogy of a standard notion of information criteria such as BIC (Bayesian Information Criterion) [67].
Remark 1.
Given a set of valid and invalid instrument spaces, the KEIC filters invalid ones in probability approaching 1 as , since they don’t have zero risks, i.e., , and the first term in the KEIC increases faster than the second term.
Theorem 5 (Consistency of KEIC).
Remark 2 (Consistency of two-step procedure).
Suppose that conditions of Theorem 5 hold. Then consistency of the two-step selection procedure for the LIIS holds if the selected space by ITC is identifiable in probability approacing to 1 as .
5 Related work
The problems of moment and instrument selection have a long history in econometrics [61, 34, 32]. While both problems often involve the GMM estimator, the latter focuses on the IVs and the corresponding estimator is referred to as the generalized instrumental variable (GIV) estimator [37]. In general, existing selection criteria can be summarized into three broad categories: (i) large sample or first-order asymptotic property based criteria, (ii) finite sample or higher-order asymptotic property based criteria, and (iii) information criteria.
The first category of selection methods, which was popular in the 1970s-1980s, treats an asymptotic efficiency of the resulting estimator as the most desirable criterion; see [61] for a review. However, this kind of criteria may not guarantee good finite sample properties as they incur large biases in practice [55, 26]. Thus, subsequent work gradually turned to the second category which aims to improve the finite sample precision of parameter estimation. Donald and Newey [26] proposed an instrument selection criterion based on a second-order asymptotic property, i.e., a Nagar type approximation [60] to the bias of linear causal models. Newey and Smith [63] explored higher-order asymptotic properties of GMM and generalized empirical likelihood estimation, which was later applied by Donald et al. [28] to developing the instrument selection for non-linear causal models. Interestingly shown by Newey and Smith [63], many bias-correction methods implicitly improves higher-order asymptotic efficiency. Moreover, the idea of improving finite sample biases is highly related to the cross validation (see e.g. Donald and Newey [26, pp. 1165] and Carrasco [16, section 4]), which has been used for different targets like selection of the weight matrix of GMM [16] and of regularization parameters in the GMM estimation [73]. Nonetheless, higher-order asymptotic properties often rely on complicated theoretical analyses and its practical performance is sensitive to the accuracy of the empirical approximation and noise in data [30]; the empirical approximation is also often computationally heavy (see, e.g., Donald et al. [28]). Additionally, this category of methods requires a strong assumption that all instrument candidates are valid [26, 28]. Therefore, it is desirable to seek a theoretically simpler selection method which remains robust and easy to use in practice. The last category, to which our method also belongs, relies on the information criterion. Andrews [4] previously proposed an orthogonality condition based criterion. Specifically, the method selects a maximal number of valid moments by minimizing the objective of the GMM estimation and maximizing the number of instruments simultaneously. Hall et al. [33] proposed an efficiency and non–redundancy based criterion to avoid inclusion of redundant moment conditions [12] which is a weakness of the orthogonality condition based criterion. For comprehensive reviews, we suggest readers refer to [34, Chapter 7] and [32].
Recently the CMR models have become increasingly popular in the machine learning community [57, 24, 8]. This opens up a new possibility to resolve the selection problem with modern tools in machine learning. Popular works includes DeepIV [38], KernelIV [69], DualIV [58], and adversarial structured equation models (SEM) [46] in the sub-area of causal machine learning; see, also, Liao et al. [47] and references therein for related works from the reinforcement learning (RL) perspective. In this line of work, no instruments are exploited and the focus is mainly on the estimation of the conditional density or solving the saddle-point reformulation of the CMR. On the other hand, the second line of work, which is more related to ours, includes adversarial GMM [45, 24, 73, 49, 29, 70, 43] with an adversarial instrument from a function space, and DeepGMM [8] with fixed instruments. All of these methods employ flexible instruments such as neural networks [45, 24] or RKHSs [24, 73, 49, 29, 70, 43], and the adversarial GMM selects appropriate instruments by maximizing the GMM objective function in order to obtain robust estimators. To the best of our knowledge, none of these works addressed the moment selection problem directly for the IV estimator.
Lastly, the KMMR objective is also related to the maximum mean discrepancy (MMD) [11] and the kernel Stein discrepancy (KSD) [48], as pointed out by Muandet et al. [57]. The kernel selection problem for MMD was previously studied on the two-sample testing problem [31]. The principle is to maximize the test power and later widely applied to kernel selection for different hypothesis testing, such as independence testing based on finite set independence criterion [41] and goodness-of-fit testing based on finite set KSD [42]. While this method can be applied to the KMMR based hypothesis testing [57], it is not applicable in our case as we focus on parameter estimation problem.
6 Experiments
We demonstrate the effectiveness of our selection criterion on the IV regression task with two baselines: (1) the median heuristic employed by Zhang et al. [73] and (2) Silverman’s rule-of-thumb [68]. These baselines are widely used in statistics and related fields [35].
6.1 Simple linear IV regression
We first demonstrate the ability of our method to select the LIIS among a set of candidate instrument spaces. We consider simple linear (in parameters) IV regression models, because it is easy to analyze the global identifiability of instrument spaces. Hall et al. [34, section 2.1] provides its details.
Data. We consider data generation models employed by Bennett et al. [8] and adapt them to our experiments: , , where , and . The variable is the confounding variable that creates the correlation between and the residual . is an instrumental variable correlated to and we employ two choices of function to enrich the test scenarios: (i) linear: ; (ii) quadratic: . Besides, we generate data for training, validation and test sets respectively. We give more experiment details in Appendix C.2.
Algorithms. We use a linear combination of polynomial basis functions to estimate the unknown true function , namely, , where denote the unknown parameters and we consider two polynomial degrees . For the KMMR framework, we employ a wide range of kernels to construct instrument spaces: (i) the linear kernel (denoted as L) ; (ii) the polynomial kernel , where is a kernel parameter, controls the polynomial degree and we consider (denoted as P2, P4) and ; (iii) the Gaussian kernel (denoted as G) where we consider the bandwidth . LIIS is identified with this setting (see details in Appendix C.1): P2 with for and P4 with for .
Results. We present the results on the linear in Fig. 2 and the quadratic in Fig. 3 in Appendix. We can see that unidentifiable instrument spaces for : L, and for : L, P2-1, P2-2, are always tested to be unidentifiable, while other identifiable spaces have increasing significance of identifiability as data size. For , the LIIS P2-1 is chosen as the LIIS by our method on the large datasets while due to more parameters contained in , the LIIS P4-1 requires more data to be tested as an identifiable space significantly. This presents a difficulty of our method to identify the LIIS when the model has many parameters and the dataset is small.
Scenario | Algorithm | True Function | |||
---|---|---|---|---|---|
abs | linear | quad | sin | ||
LS | Silverman | 0.023 0.006 | 0.006 0.005 | 0.006 0.007 | 0.032 0.012 |
Med-Heuristic | 0.023 0.007 | 0.006 0.006 | 0.006 0.008 | 0.032 0.010 | |
Our Method | 0.023 0.007 | 0.006 0.006 | 0.006 0.008 | 0.031 0.010 | |
LW | Silverman | 0.055 0.059 | 0.032 0.023 | 0.017 0.011 | 0.058 0.050 |
Med-Heuristic | 0.214 0.019 | 0.066 0.011 | 0.037 0.018 | 0.101 0.012 | |
Our Method | 0.024 0.010 | 0.015 0.016 | 0.009 0.010 | 0.019 0.022 | |
NS | Silverman | 7.384 18.271 | 0.137 0.118 | 0.595 0.733 | 0.539 0.913 |
Med-Heuristic | 0.070 0.044 | 0.021 0.019 | 0.019 0.011 | 0.074 0.053 | |
Our Method | 0.039 0.019 | 0.006 0.004 | 0.007 0.003 | 0.028 0.027 |
6.2 Robustness of parameter estimation with linear models
We compare our method and baselines on parameter estimation with the linear IV model.
Settings. We employ the afore-defined linear model for IV regression and measure the mean square error (MSE) between the optimized and the ground truth . We adapt the afore data generation models as below: , with , where , and other variables have the same definitions as the last subsection. We consider two more choices of : (iii) abs: , and (iv) sin: , apart from (i) linear and (ii) quadratic in the last subsection. We sample data for training, validation and test sets respectively. We design three evaluation scenarios: (a) , , (b) , , (c) , . In (a), is a strong IV linearly correlated with , which is an ideal data generation process, and we refer to this scenario as the linearly strong (LS) IV scenario. The scenario (b) has weak IVs linearly correlated with (referred to as a linearly weak IV scenario, shortened as LW) and it is common in the real-world applications, such as the genetic variants employed in Mendelian randomization which are known as weak IVs [44, 39, 14]. The scenario (c) takes the nonlinearly strong correlation between and in account (referred to as nonlinearly strong IV scenario, shortened as NS) and this consideration is also common, e.g., in the two-step least square methods, nonlinear models are often employed to fit the relation between and [38, 69].
Results. Results are shown in Table 1 () and 3 (, in Appendix). Two baselines and our method have very close performance in the LS scenario. In contrast, weak IVs in the LW scenario have a significantly negative effect on the median-heuristic method, where our method performs stably and Silverman’s rule of thumb a bit less stably than our method. Moreover, the Silverman’s rule of thumb suffers from the nonlinear correlation between and in the NS scenario and the median heuristic method performs better in this scenario than in the LW scenario. Our method still performs stably and well. The experiment results show robustness of our method on parameter estimation in the different scenarios and the superiority of our method which can select any kernel flexibly, whereas the baselines can only handle the Gaussian kernels in principle. Therefore, our method is preferable for the task of parameter estimation than the two baselines.
6.3 Robustness of parameter estimation with neural networks
We further assess the effectiveness of our method on parameter estimation with the neural network (NN), which is a commonly-used model in machine learning. Due to complicated structures of NNs, the problem is harder than the previous one.
Settings. We employ the weak IV scenario with from the last subsection and generates samples for training, validation and test datasets respectively. For a neat demonstration and to avoid e.g. training difficulties with deep NN models, we employ two relatively simple models: (i) one has just one hidden layer with 10 latent units, denoted as , and (ii) the other has two hidden layers with 5 latent units for each layer, denoted as . The sigmoid activation function is used for hidden layers. Both models are sufficient to fit the simple true causal functions . We find that there are often no instrument spaces being identifiable significantly due to many parameters. Therefore, we minimize the ratio defined in section 4 to select the LIIS.
An approximation to ITC. We approximate ITC to reduce computational burden due to many parameters in NNs. We consider a common form of NNs where denotes a depth- structure with weights , biases and activation functions . Therefore, we view as basis functions, whose parameters are updated during the training process while fixed in computing the ITC. As a result, we only use the gradients on and to evaluate the ITC, which is an approximation to the ITC computed on gradients on all parameters. This can accelerate the computation of the ITC for a NN and we assess the performance of parameter estimation with this approximation.
Results. The MSEs between the optimized NN and are reported in Table 2. First, we find that minimizing the ratio of indeed helps to reduce the biases of parameter estimation with NNs. Compared with two baselines, our method performs stably on different datasets and on different NN structures. This provides further evidence that our method improve the practical performance of the KMMR. Second, our approximate ITC can work for the employed NNs.
NN | Algorithm | True Function | |||
---|---|---|---|---|---|
abs | linear | quad | sin | ||
Silverman | 0.3280.152 | 0.1660.107 | 0.0310.020 | 0.3270.098 | |
Med-Heuristic | 0.2310.030 | 0.0450.021 | 0.0120.004 | 0.1790.034 | |
Our Method | 0.0410.027 | 0.0270.016 | 0.0060.003 | 0.0580.045 | |
Silverman | 0.4440.255 | 0.1020.074 | 0.0370.018 | 0.6300.398 | |
Med-Heuristic | 0.1870.044 | 0.0390.018 | 0.0130.004 | 0.1450.039 | |
Our Method | 0.0360.013 | 0.0130.003 | 0.0070.003 | 0.0350.024 |
7 Conclusions
The conditional moment restriction (CMR) is ubiquitous in many fields and the kernel maximum moment restriction (KMMR) is a promising framework to deal with the CMR in its easy-to-use-and-analyze form and practical outperformance. However, the optimal choice of the instrument space is a challenge and affects the effectiveness of the framework. The present work propose a systematic procedure to select the instrument space for instrumental variable (IV) estimators. We first define a selection principle of the least identifiable instrument space (LIIS) that identifies the estimated model parameters with the least space complexity. To determine the LIIS among a set of candidates, we propose the least identification selection criterion (LISC) as a combination of two criteria: (i) the identification test criterion (ITC) to check identifiability; (ii) the kernel effective information criterion (KEIC) consisting of the effective dimension and the risk function value to check the CMR condition and the effective dimension. The consistency of our method is explored and the experiments provide evidence for the effectiveness of our method on identifying the LIIS and on the parameter estimation for IV estimators.
References
- Ai and Chen [2003] C. Ai and X. Chen. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003. ISSN 00129682, 14680262.
- Akaike [1974] H. Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
- Andersen and Sorensen [1996] T. G. Andersen and B. E. Sorensen. GMM estimation of a stochastic volatility model: A monte carlo study. Journal of Business & Economic Statistics, 14(3):328–352, 1996.
- Andrews [1999] D. W. K. Andrews. Consistent Moment Selection Procedures for Generalized Method of Moments Estimation. Econometrica, 67(3):543–564, May 1999.
- Aronszajn [1950] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
- Athey et al. [2019] S. Athey, J. Tibshirani, S. Wager, et al. Generalized random forests. Annals of Statistics, 47(2):1148–1178, 2019.
- Baker [1977] C. T. Baker. The numerical treatment of integral equations. 1977.
- Bennett et al. [2019] A. Bennett, N. Kallus, and T. Schnabel. Deep generalized method of moments for instrumental variable analysis. In Advances in Neural Information Processing Systems 32, pages 3564–3574. Curran Associates, Inc., 2019.
- Berlinet and Thomas-Agnan [2004] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
- Bond and Windmeijer [2002] S. R. Bond and F. Windmeijer. Finite sample inference for gmm estimators in linear panel data models. 2002.
- Borgwardt et al. [2006] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
- Breusch et al. [1999] T. Breusch, H. Qian, P. Schmidt, and D. Wyhowski. Redundancy of moment conditions. Journal of Econometrics, 91(1):89–111, 1999. ISSN 0304-4076.
- Burgess et al. [2017] S. Burgess, D. S. Small, and S. G. Thompson. A review of instrumental variable estimators for mendelian randomization. Statistical Methods in Medical Research, 26(5):2333–2355, 2017.
- Burgess et al. [2020] S. Burgess, C. N. Foley, E. Allara, J. R. Staley, and J. M. M. Howson. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Communications, 11(1):376, 2020.
- Caponnetto and De Vito [2007] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
- Carrasco [2012] M. Carrasco. A regularization approach to the many instruments problem. Journal of Econometrics, 170(2):383–398, 2012.
- Carrasco and Florens [2000] M. Carrasco and J.-P. Florens. Generalization of gmm to a continuum of moment conditions. Econometric Theory, 16(6):797–834, 2000.
- Carrasco and Florens [2014] M. Carrasco and J.-P. Florens. On the asymptotic efficiency of GMM. Econometric Theory, 30(2):372–406, 2014. doi: 10.1017/S0266466613000340.
- Carrasco et al. [2007] M. Carrasco, J.-P. Florens, and E. Renault. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. In J. Heckman and E. Leamer, editors, Handbook of Econometrics, volume 6B, chapter 77. Elsevier, 1 edition, 2007.
- Cheng and Liao [2015] X. Cheng and Z. Liao. Select the valid and relevant moments: An information-based lasso for gmm with many moments. Journal of Econometrics, 186(2):443–464, 2015.
- Chernozhukov et al. [2018] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
- Davey Smith and Ebrahim [2003] G. Davey Smith and S. Ebrahim. ‘mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International journal of epidemiology, 32(1):1–22, 2003.
- de Jong [1996] R. M. de Jong. The bierens test under data dependence. Journal of Econometrics, 72(1):1–32, 1996. ISSN 0304-4076.
- Dikkala et al. [2020] N. Dikkala, G. Lewis, L. Mackey, and V. Syrgkanis. Minimax estimation of conditional moment models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12248–12262. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8fcd9e5482a62a5fa130468f4cf641ef-Paper.pdf.
- Dominguez and Lobato [2004] M. Dominguez and I. Lobato. Consistent estimation of models defined by conditional moment restrictions. Econometrica, 72(5):1601–1615, 2004.
- Donald and Newey [2001] S. G. Donald and W. K. Newey. Choosing the number of instruments. Econometrica, 69(5):1161–1191, 2001.
- Donald et al. [2003] S. G. Donald, G. W. Imbens, and W. K. Newey. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics, 117(1):55–93, 2003. ISSN 0304-4076. doi: https://doi.org/10.1016/S0304-4076(03)00118-0. URL https://www.sciencedirect.com/science/article/pii/S0304407603001180.
- Donald et al. [2009] S. G. Donald, G. W. Imbens, and W. K. Newey. Choosing instrumental variables in conditional moment restriction models. Journal of Econometrics, 152(1):28–36, 2009. ISSN 0304-4076. Recent Adavances in Nonparametric and Semiparametric Econometrics: A Volume Honouring Peter M. Robinson.
- Feng et al. [2019] Y. Feng, L. Li, and Q. Liu. A kernel loss for solving the bellman equation. Advances in neural information processing systems, 32, 2019.
- Ghosh [1994] J. K. Ghosh. Higher order asymptotics. NSF-CBMS Regional Conference Series in Probability and Statistics, 4:i–111, 1994. ISSN 19355920, 23290978. URL http://www.jstor.org/stable/4153181.
- Gretton et al. [2012] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, and K. Fukumizu. Optimal kernel choice for large-scale two-sample tests. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1205–1213, Red Hook, NY, USA, 2012. Curran Associates Inc.
- Hall [2015] A. R. Hall. Econometricians have their moments: Gmm at 32. Economic Record, 91(S1):1–24, 2015.
- Hall et al. [2007] A. R. Hall, A. Inoue, K. Jana, and C. Shin. Information in generalized method of moments estimation and entropy-based moment selection. Journal of Econometrics, 138(2):488–512, 2007.
- Hall et al. [2005] A. R. Hall et al. Generalized method of moments. Oxford university press, 2005.
- Hall et al. [1991] P. Hall, S. J. Sheather, M. Jones, and J. S. Marron. On optimal data-based bandwidth selection in kernel density estimation. Biometrika, 78(2):263–269, 1991.
- Hansen [1982] L. P. Hansen. Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pages 1029–1054, 1982.
- Hansen and Singleton [1982] L. P. Hansen and K. J. Singleton. Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica, 50(5):1269–1286, 1982. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1911873.
- Hartford et al. [2017] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep IV: A flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1414–1423. PMLR, 2017.
- Hartford et al. [2020] J. S. Hartford, V. Veitch, D. Sridhar, and K. Leyton-Brown. Valid causal inference with (some) invalid instruments. CoRR, abs/2006.11386, 2020.
- Ivanov et al. [2002] V. K. Ivanov, V. V. Vasin, and V. Tanana. Theory of Linear Ill-posed Problems and Its Applications. Inverse and ill-posed problems series. VSP, 2002.
- Jitkrittum et al. [2017a] W. Jitkrittum, Z. Szabó, and A. Gretton. An adaptive test of independence with analytic kernel embeddings. In International Conference on Machine Learning, pages 1742–1751. PMLR, 2017a.
- Jitkrittum et al. [2017b] W. Jitkrittum, W. Xu, Z. Szabó, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-fit test. NIPS’17, page 261–270, Red Hook, NY, USA, 2017b. Curran Associates Inc. ISBN 9781510860964.
- Kallus [2020] N. Kallus. Generalized optimal matching methods for causal inference. Journal of Machine Learning Research, 21(62):1–54, 2020. URL http://jmlr.org/papers/v21/19-120.html.
- Kuang et al. [2020] Z. Kuang, F. Sala, N. Sohoni, S. Wu, A. Córdova-Palomera, J. Dunnmon, J. Priest, and C. Re. Ivy: Instrumental variable synthesis for causal inference. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 398–410. PMLR, 2020.
- Lewis and Syrgkanis [2018] G. Lewis and V. Syrgkanis. Adversarial generalized method of moments. 03 2018.
- Liao et al. [2020] L. Liao, Y. Chen, Z. Yang, B. Dai, Z. Wang, and M. Kolar. Provably efficient neural estimation of structural equation model: An adversarial approach. In Advances in Neural Information Processing Systems 33. 2020.
- Liao et al. [2021] L. Liao, Z. Fu, Z. Yang, M. Kolar, and Z. Wang. Instrumental variable value iteration for causal offline reinforcement learning. arXiv preprint arXiv:2102.09907, 2021.
- Liu et al. [2016] Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016.
- Liu et al. [2018] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/dda04f9d634145a9c68d5dfe53b21272-Paper.pdf.
- Lopes et al. [2011] M. E. Lopes, L. J. Jacob, and M. J. Wainwright. A more powerful two-sample test in high dimensions using random projection. arXiv preprint arXiv:1108.2401, 2011.
- Mastouri et al. [2021] A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. In International conference on machine learning. PMLR, 2021.
- Mendelson et al. [2003] S. Mendelson, T. Graepel, and R. Herbrich. On the performance of kernel classes. Journal of Machine Learning Research, 4:2003, 2003.
- Mercer [1909] J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446, 1909. ISSN 02643952.
- Micchelli and Pontil [2005] C. A. Micchelli and M. A. Pontil. On learning vector-valued functions. Neural Computation, 17(1):177–204, 2005.
- Morimune [1983] K. Morimune. Approximate distributions of k-class estimators when the degree of overidentifiability is large compared with the sample size. Econometrica: Journal of the Econometric Society, pages 821–841, 1983.
- Muandet et al. [2017] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–141, 2017.
- Muandet et al. [2020a] K. Muandet, W. Jitkrittum, and J. Kübler. Kernel conditional moment test via maximum moment restriction. In J. Peters and D. Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 41–50. PMLR, 03–06 Aug 2020a.
- Muandet et al. [2020b] K. Muandet, A. Mehrjou, S. K. Lee, and A. Raj. Dual instrumental variable regression. In Advances in Neural Information Processing Systems 33. Curran Associates, Inc., 2020b. Forthcoming.
- Muth [1961] J. F. Muth. Rational expectations and the theory of price movements. Econometrica: Journal of the Econometric Society, pages 315–335, 1961.
- Nagar [1959] A. L. Nagar. The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica: Journal of the Econometric Society, pages 575–595, 1959.
- Newey [1993] W. Newey. Efficient estimation of models with conditional moment restrictions. In Handbook of Statistics, volume 11, chapter 16, pages 419–454. Elsevier, 1993.
- Newey and McFadden [1994] W. K. Newey and D. McFadden. Chapter 36 large sample estimation and hypothesis testing. volume 4 of Handbook of Econometrics, pages 2111 – 2245. Elsevier, 1994.
- Newey and Smith [2004] W. K. Newey and R. J. Smith. Higher order properties of gmm and generalized empirical likelihood estimators. Econometrica, 72(1):219–255, 2004. ISSN 00129682, 14680262.
- Riesz [1909] F. Riesz. Sur les opérations fonctionnelles linéaires, 1909.
- Robin and Smith [2000] J.-M. Robin and R. J. Smith. Tests of rank. Econometric Theory, pages 151–175, 2000.
- Schölkopf and Smola [2002] B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002.
- Schwarz et al. [1978] G. Schwarz et al. Estimating the dimension of a model. Annals of statistics, 6(2):461–464, 1978.
- Silverman [1986] B. W. Silverman. Density estimation for statistics and data analysis. London: Chapman & Hall/CRC, 1986.
- Singh et al. [2019] R. Singh, M. Sahani, and A. Gretton. Kernel instrumental variable regression. In Advances in Neural Information Processing Systems, pages 4595–4607, 2019.
- Uehara et al. [2020] M. Uehara, J. Huang, and N. Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR, 2020.
- Van der Vaart [2000] A. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
- Vuong [1989] Q. H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2):307–333, 1989. ISSN 00129682, 14680262.
- Zhang et al. [2020] R. Zhang, M. Imaizumi, B. Schölkopf, and K. Muandet. Maximum moment restriction for instrumental variable regression. arXiv preprint arXiv:2010.07684, 2020.
- Zhang [2003] T. Zhang. Effective dimension and generalization of kernel learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003.
Checklist
-
1.
For all authors…
-
(a)
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] As mentioned in the abstract and introduction, this paper is to address the kernel instrument space selection problem. The problem arises recently from the kernel maximum moment restriction framework for instrumental variable estimators and is different from traditional selection problem on instruments.
-
(b)
Did you describe the limitations of your work? [Yes] In the experiment section 6.1, we present that our identification test criterion can’t identify the optimal instrument space for models with many parameters given small datasets.
-
(c)
Did you discuss any potential negative societal impacts of your work? [N/A]
-
(d)
Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
-
(a)
-
2.
If you are including theoretical results…
- (a)
-
(b)
Did you include complete proofs of all theoretical results? [Yes] We put proofs into the appendix.
-
3.
If you ran experiments…
-
(a)
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We put codes for our experiments into the supplemental material.
-
(b)
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] All training details are included in the experiment section and additional sections in the appendix.
-
(c)
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We report error bars and provide the random seed and the number of repeated experiments.
-
(d)
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] Our experiment results don’t rely on the type of computational resources.
-
(a)
-
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
-
(a)
If your work uses existing assets, did you cite the creators? [Yes]
-
(b)
Did you mention the license of the assets? [N/A] We implement baselines and our method ourselves, and our data are simulated.
-
(c)
Did you include any new assets either in the supplemental material or as a URL? [Yes] We put our code for experiments in the supplemental material.
-
(d)
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A] See the answer for 4(b).
-
(e)
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] Our data are simulated without any personally identifiable information or offensive content.
-
(a)
-
5.
If you used crowdsourcing or conducted research with human subjects…
-
(a)
Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]
-
(b)
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
-
(c)
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]
-
(a)
Appendix
Instrument Space Selection for Kernel Maximum Moment Restriction
Appendix A Derivation Process of ITC
To derive an asymptotic distribution of , we need an asymptotic distribution of , shown as follows. As preparation, let be a vectorization operator by a rule of , , .
Lemma 1.
For almost surely and a non-zero matrix , we have .
The asymptotic variance is an essential term to determine an asymptotic distribution of . We need an assumption on under the non-full-rank hypothesis . Let be the eigenvector of corresponding to the eigenvalue under . We make the following rank assumption.
Assumption 3 (rank condition).
Under the null hypothesis , the matrix is non-zero, where is the Kronecker product.
An equivalent statement of Assumption 3 is that the vector does not lie in the space spanned by the eigenvectors associated with the zero eigenvalue of . Therefore, if is positive definite, Assumption 3 is automatically satisfied. This assumption is required to obtain a non-degenerate asymptotic distribution of the test statistic. Let be the eigenvalue of corresponding to the eigenvalue . Under the hypothesis and the rank assumption 3, we obtain the asymptotic distribution.
Theorem 6.
By the asymptotic distribution, we can handle the type-I error. For a significance level , let be a -quantile of . Then, the following result shows that, the type-I error is asymptotically equal to the arbitrary value .
Theorem 7.
Suppose Assumption 3 holds. Let be a probability measure of data under the null-hypothesis . Then, for any , we obtain as .
Appendix B Proofs
B.1 Proof of Lemma 1
Proof.
We first decompose the objective into two terms (i) and (ii):
(5) |
The term (ii) converges to zero as given almost surely. By Slutsky’s theorem [71, Lemma 2.8], the asymptotic distribution depends on the term (i). Let denote the empirical expectation and the asymptotic distribution of (i) is obtained by the central limit theorem
(6) | |||
(7) |
∎
B.2 Proof of Theorem 6
Proof.
We rewrite and . We further rewrite as below
(8) | ||||
(9) |
The term (i) converges to as because and based on the consistency of to given in Lemma 1. The term (ii) has an asymptotic distribution following by Lemma 1 and Assumption 3:
(10) | ||||
(11) | ||||
(12) |
Because the null-hypothesis holds, i.e., the matrix is non-full-rank, we obtain that . Therefore, by Slutsky’s theorem and by [72, Lemma 3.2], where and is a standard normal random variable. ∎
B.3 Proof of Theorem 7
Proof.
The conclusion follows directly from the asymptotic distribution of . ∎
B.4 Proof of Theorem 3
Proof.
By Assumption 1 and consistency of to , we know that has a unique minimizer at which satisfies . Further since is compact, converges to uniformly in probability and is continuous (due to is finite and is differentiable implicitly assumed), we obtain that the empirical by Theorem 2.1 in Newey and McFadden [62]. Therefore, as , . Therefore, the ITC consistently estimates the identifiability in probability. ∎
B.5 Proof of Theorem 4
Proof.
By the existing result [7, Theorem 3.4] that as where is the -th ordered eigenvalue of , we know that is a consistent estimator of . Similarly, consistently estimates . Therefore, is a consistent estimator of . ∎
B.6 Proof of Theorem 5
Proof.
We define as an optimal index of the optimal RKHS instrument space, namely, . With the above settings, the statement of Theorem 5 is equivalent to show as , for any such that . We define is an effective dimension of and with a kernel corresponding to the RKHS . For and , we study the following difference
Given the condition that an unique least identifiable instrument space exists and the consistency of to shown in Theorem 4, we obtain that and converges to some constant. Hence, it follows that with in complexity theory. By the assumption 1 that has an unique solution say and is identifiable, the minimizer converges to . As a result, by the asymptotic distribution in the conclusion (2) of Theorem 4.1 in Muandet et al. [57]. Hence the term has a larger order than and we obtain that as for .
∎
Appendix C More Details on Experiments
C.1 LIIS for Section 6.1
Under the experiment settings of Section 6.1, an identifiable instrument space should satisfy (i) existence of column vectors of parameters which meet the CMR condition (1) and (ii) uniqueness of , namely, [34, Assumption 2.3]. Since the same models used for estimation and data generation, (i) immediately holds, and (ii) also holds when a number of instruments is more than that of parameters. As the linear kernel has the feature map function projecting to a one-dimensional space, using it for the instrument space is equivalent to employing as the instrument. The condition (ii) fails due to the number of instruments is less than that of parameters of both and . Moreover, using the candidate parameters, P4 identifies the parameters of , while P2 fails to identify those of ; G can identify parameters of using any candidate lengthscale. Considering the complexity measured by the effective dimension (4), we obtain the LIISs: P2 with for and P4 with for . We can approximate the effective dimension via a large sample.
Scenario | Algorithm | True Function | |||
---|---|---|---|---|---|
abs | linear | quad | sin | ||
LS | Silverman | 0.076 0.032 | 0.050 0.053 | 0.024 0.032 | 0.124 0.105 |
Med-Heuristic | 0.100 0.081 | 0.058 0.053 | 0.024 0.032 | 0.146 0.091 | |
Our Method | 0.051 0.029 | 0.048 0.054 | 0.025 0.032 | 0.100 0.111 | |
LW | Silverman | 0.282 0.067 | 0.096 0.044 | 0.050 0.022 | 0.169 0.047 |
Med-Heuristic | 0.127 0.075 | 0.050 0.032 | 0.040 0.026 | 0.101 0.066 | |
Our Method | 0.128 0.077 | 0.036 0.017 | 0.026 0.012 | 0.041 0.022 | |
NS | Silverman | 0.136 0.080 | 0.153 0.115 | 0.071 0.042 | 0.198 0.147 |
Med-Heuristic | 0.315 0.260 | 0.571 0.823 | 1.229 2.753 | 0.560 0.322 | |
Our Method | 0.056 0.040 | 0.029 0.012 | 0.043 0.042 | 0.052 0.027 |
C.2 More Experiment Settings
we standardize the values of to have zero mean and unit variance for numerical stability. Experiments on each setting are repeated 10 times. To avoid overfitting, we compute the empirical risk in the KEIC as a two-fold cross validation error. Random seed is set to 527. Codes for the experiments are provided in the supplementary material.