equation*
Adversarial Estimators
Abstract
We develop an asymptotic theory of adversarial estimators (‘A-estimators’). They generalize maximum-likelihood-type estimators (‘M-estimators’) as their average objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.
1 Introduction
Although it is not always obvious, nearly all population parameters that are estimated in econometrics and machine learning can be written as the solution of so-called saddle-point or adversarial objectives of the form:
(1.1) |
where is a known loss function, is a random variable and are parameter spaces, containing the unknown parameter of interest and nuisance . We examine the natural estimator that approximately solves the empirical Nash condition:
(1.2) | ||||
(1.3) |
which replaces the expectation of the population objective 1.1 with the average of iid samples, . We search for the estimators over so-called sieve spaces (Grenander [1981]), which approximate the full parameter spaces and grow with the sample size . These could be neural networks for example, growing in depth and width. The sequences accommodate numerical procedures which only yield approximate Nash equilibria. This class of A-estimators (A for adversarial) strictly generalizes so-called M-estimators (M for maximum likelihood-type), which are obtained by fixing to be singleton.
A-Estimators have become a workhorse of econometrics and causal inference long before the advent of deep learning. Hansen et al. [1996]’s continuous-updating Generalized Methods of Moments (GMM), which looks for satisfying for some known function , can be written as:
and is therefore an A-estimator, but not an M-estimator. In statistics, an earlier example consists of the Empirical Likelihood (EL) approach pioneered in Cosslett [1981], Owen [1988, 1990], Qin and Lawless [1994]. Subsequently, EL was unified with GMM into the Generalized Empirical Likelihood (GEL) framework (Newey and Smith [2004]), also subsuming the exponential-tilting estimator (Imbens et al. [1998]), for example. All GEL estimators are A-estimators, but their adversarial formulation was rarely salient. However, some of their benefits may be owed directly to their adversarial objective: the adversary automatically detects which moment violations are most informative at a given parameter guess, adaptively guiding the estimation towards an efficient solution. This contrasts with earlier estimators which weighted the moments in a way that depended on choices of the researcher: the weights of Pearson’s Method-of-Moments were manually set by the researcher (implicitly), resulting in inefficient root- asymptotics. Two-step GMM (Hansen [1982]) required choosing a first-step estimator to compute the weights, yielding inefficient higher-order asymptotics (see Newey and Smith [2004]). Formally, the optimal weights are nuisance parameters, and as we will see in Section 3.2, estimating them via an adversary ensures that is robust to estimation errors in these nuisance parameters.
A key invention which put a spotlight on adversarial objectives in recent years were Generative Adversarial Networks, or GANs (Goodfellow et al. [2014]). They search for a generative model for which no adversary (called ‘critic’ or ‘discriminator’) could tell apart the generated data from real samples:
The objective contains the log-likelihood of a binary classifier discriminating between an equal number of real and generated samples. As we show in Section 2.1, this directly measures the Jensen-Shannon divergence between and . As of today,
versions of this objective are key to state-of-the-art image generation, see e.g. Jabbar et al. [2022] for a recent survey. An analogy to human-generated images makes this unsurprising: it is much easier to tell apart a photo from an image drawn by a human, than it is to draw a realistic image, or to define what makes a drawing realistic. This intuition motivates the objective: train the generator until its critic has nothing more to criticize. The ingenuity is that the researcher need not define a meaningful measure of ‘realism’ of a piece of data anymore. Instead, this measure is learned by the adversary. It is clear that the utility of this idea extends beyond image generation: in Imitation Learning, a sub-field of Robotics, it has been used to teach human behavior to artificial agents without requiring hand-crafted measures of ‘humanness’ (Ho and Ermon [2016]). In Econometrics, where new causal inference methods can usually only be benchmarked on simulated data sets, Athey et al. [2019] used the objective to limit the impact of researcher’s subjective choices by requiring simulations to be indistinguishable from real data. Kaji et al. [2020] proposed to use the objective to estimate structural economic models which produce realistic data beyond the set of features that would otherwise be manually specified by the researcher.
More generally, other adversarial objectives have proven useful beyond fitting models to data. In Reinforcement Learning, a sub-field of Robotics where agents independently discover strategies to reach predefined goals without copying prior examples, Dai et al. [2018] proposed an A-estimator in which the adversary detects and penalizes any systematic deviation from optimal behavior. Cotter et al. [2019] proposed an estimator which extends a standard ML objective by an adversary imposing fairness constraints across sub-populations. More recently, research in econometrics established A-estimators as a natural framework for integrating machine learning methods into causal inference, where quantities of interest are frequently identified by a continuum of restrictions. Chernozhukov et al. [2020] propose to estimate Riesz representers of causal parameters directly, via an adversary enforcing the restrictions identifying the Riesz representer. Estimating Riesz representers is key to obtaining well-behaved estimates of causal parameters in the presence of nuisance functions, and can also be useful for estimating asymptotic variances, e.g. Chen et al. [2019]. Another line of research develops novel adversarial objectives to estimate causal parameters from conditional moment restrictions, which naturally arise from causal assumptions (e.g. the instrumental variable setting), and are usually more informative than any finite set of unconditional moment restrictions. In this line of research, the adversary can be viewed as adaptively finding the unconditional moment restriction which is most violated at the current parameter guess, among infinitely many which are implied by the conditional moment restriction. The key works are Lewis and Syrgkanis [2018], Dikkala et al. [2020] and Bennett et al. [2019b], Bennett and Kallus [2020]. Metzger [2022] propose a semi-parametrically efficient generalization of GEL to the conditional case via adversarial networks, containing Bennett and Kallus [2020] as a special case.
In summary, a recurring theme of adversarial objectives is that instead of manually defining which specific features of the data are important for a model to capture, the researcher’s role is restricted to stating a general principle which should be satisfied by all features of the correct model, and the adversary adaptively focuses the estimation on the model’s features which violate this principle the most. Over the course of the paper, we will encounter further interesting connections between various A-estimators, such as their Neyman orthogonality, their information-theoretic foundation via f-Divergences, and their ties to Lagrangian Duality.
Despite their popularity, we are not aware of a unified statistical theory of A-estimators. For some individual estimators, consistency (Bennett et al. [2019b]) and convergence rate results (Dikkala et al. [2020], Singh et al. [2018], Liang [2021], Belomestny et al. [2021]) were obtained, but normality results are limited to parametric , either in Kernel settings (Bennett and Kallus [2020]) or leaving high-level assumptions about neural networks unverified (Kaji et al. [2020]). This can be attributed to two main obstacles: the theory of M-estimation does not apply to A-estimators, and the arguments from which the former is built up are insufficient to e.g. obtain the required uniform convergence of the adversary. The second issue is that adversarial objectives are most popular in the context of (deep) neural networks, whose statistical analysis (particularly their asymptotic normality) is complicated, e.g. due to their non-convex sieve space. Even in M-estimation settings, it was not clear whether known, high-level conditions for normality could be verified for neural networks (cf. the Conclusion of Shen et al. [2019]). We therefore make three separate contributions:
-
1.
We characterize the general class of A-estimators, and show that a wide range of estimators proposed in econometrics and machine learning fall into this class. We point out desirable characteristics shared between A-estimators, which help explain their recent success in practice.
-
2.
We develop a unified statistical theory of A-estimators, yielding their consistency, convergence rates (both under point- and partial identification), and asymptotic normality of functionals of their parameters. We provide high-level conditions for arbitrary sieves, as well as low-level conditions for semi-parametric settings with neural networks, to simplify verification in practice.
-
3.
We extend the theory of neural network M-estimators (as a special case). Our convergence rates hold uniformly over families of losses, allow more general losses than Farrell et al. [2018] and attain a reduced curse-of-dimensionality which Nakada and Imaizumi [2020], Bauer et al. [2019] observed in regression settings with lower-dimensional structures. To the best of our knowledge, we provide the first normality result for functionals of deep neural networks which does not rely on Neyman-orthogonality or unverified high-level assumptions.
The remainder of the paper is structured as follows. In Section 2, we review five different A-estimators proposed in the econometrics and machine learning literatures. We present our general statistical theory of A-estimators in Section 3 and apply it in Section 4 to derive novel results about the examples of Section 2. We conclude by recapping the similar role adversaries play across all examples, providing intuition which types of problems may generally benefit from adversarial formulations. Appendix C and Online Appendix D contain the proofs omitted in Sections 3 and 4, respectively.
2 Examples
2.1 Minimum -Divergence
A powerful class of estimation objectives asymptotically minimize an -divergence between the distribution of the data and the distribution of some model with support . This class, introduced by Nowozin et al. [2016], subsumes GANs (Goodfellow et al. [2014]), and many follow ups such as Mao et al. [2017], Tao et al. [2018]. For a continuous, proper convex function satisfying , the -divergence is defined as , where denotes the Radon-Nikodym derivative of with respect to (=likelihood ratio), which we assume exists for all . Notably, admits a useful dual representation:
(2.1) |
where denotes the convex conjugate of . The equality above follows from . Various choices111None of the objectives are unique: for any yields the same divergence, but changes the expressions. Note that we may also swap and , which yields valid objectives for the respective “reverse” -divergences. for are presented in Table 1. This duality is useful because the right-hand side suggests a finite-sample analog which does not depend on unknown quantities: we obtain an A-estimator for by letting
(2.2) |
and solving for satisfying the Nash condition 1.2,1.3 in . Normalizing without loss of generality222This implies , and , which merely re-scales the divergence 2.1 by a factor of , assuming the second derivative exists, the function attaining the supremum in 2.1 at some is . The adversary therefore estimates this transformed likelihood ratio at the current guess for , and the Nash-equilibrium corresponds to the case where it is approximately constant, i.e. the distribution is close to that of the data. Notably, can be evaluated using only samples from the two distributions333Note that we neither require explicit knowledge of nor infinitely many samples from at a given : it suffices to draw Monte Carlo samples from and solve for the corresponding finite sample saddle point. The resulting Monte Carlo approximation error for the expectation is then of order and can thus be accounted for by letting in equations 1.2,1.3, which has no impact on our asymptotic results.. This is crucial for GANs, where is only implicitly defined via a push-forward mapping parametrized by a neural net. As proposed by Kaji et al. [2020], this also makes it a drop-in alternative to the Simulated Method of Moments, which similarly estimates economic models from data they generate, but matches only a finite set of moments instead of the full distribution.
Name , domain Generative Adversarial Objective for Total Variation , for KL Divergence Reverse KL , for Divergence Squared Hellinger , for rescaled JS (GAN)
2.2 Generalized Empirical Likelihood
Our next example is a class of A-estimators that was proposed long before the recent success of adversarial objectives in deep learning. In econometrics, many important parameters are identified by a moment restriction of the form:
for some known, possibly vector-valued function . In the Introduction, we presented the continuous-updating GMM objective (Hansen et al. [1996]) for estimating , a workhorse for causal inference in econometrics. In this section, we review the more general class of Generalized Empirical Likelihood (GEL) estimators (Newey and Smith [2004]), which solve the constrained minimization problem:
That is, they seek for a parameter and a corresponding population distribution that is as close as possible to the sample , subject to satisfying the moment constraint . At this high level, it is worth noting that GEL optimizes the same target as the objective in Section 2.1, which imposes instead of a moment constraint. Glossing over some details, we can obtain a tractable estimator in this setting by concentrating out from the corresponding Lagrangian:
(2.3) |
Which again uses the convex conjugate of (see example 2.1). For a formal proof of this equivalence, see e.g. Imbens et al. [1998]. It is easy to see that GEL is an A-estimator with where and is the parameter space of the economic model. A particularly popular version of this objective corresponds to the case , where Table 1 tells us that and . In this case, we can analytically solve for the optimal adversary given . Substituting it in, we get the continuous-updating GMM objective presented in the introduction:
2.3 Off-Policy Reinforcement Learning
Next, we review the Smoothed Bellman Error Embedding (SBEED) algorithm introduced by Dai et al. [2018], a popular off-policy learning method in robotics. Off-policy learning aims to learn the optimal policy for an agent from data that was generated under an entirely different policy regime. This problem is not limited to robotics: since it was identified in the monetary policy context by Lucas [1976], it became a primary concern in econometrics and its recognition played a key role in the credibility revolution (Angrist and Pischke [2010]) of econometrics. While problem definitions otherwise differ between these literatures, off-policy learning methods have received recent interest in econometrics (Zhan et al. [2021], Athey and Wager [2021]).
For an agent receiving reward for taking action at state , forming an expectation over the future state , SBEED’s goal is to learn the value function and policy which satisfy the regularized Bellman equation:
where the entropy regularizes the optimal policy towards exploring all actions . Given the researcher’s choice of , the goal is to learn from finite samples . Importantly, the actions may be sampled from a suboptimal policy which does not equal . Starting from the first-order condition of the Bellman equation, Dai et al. [2018] develop an adversarial population objective, whose finite-sample analog is the A-estimator 1.2,1.3 with loss:
(2.4) |
where , and are implemented as neural networks in practice.
2.4 A-Estimators for Conditional Moment Restrictions
Another powerful application for A-estimators recently pursued by the econometric literature are conditional moment estimators. These methods estimate parameters which are identified by restrictions of the form:
(2.5) |
for some random variables and a known function . Conditions of this type occur e.g. when estimating some causal effect via instrumental variables, or as the first-order conditions of agents optimizing some expected utility given some information . As a result, nonparametric conditional moment estimators received considerable interest in econometrics, see e.g. Ai and Chen [2003, 2007], Chen and Qiu [2016]. These earlier estimators rely on first-step estimates of nuisance parameters capturing the conditional means and variances. Intuitively however, estimating the nuisance parameters via predictive objectives in a separate first step may dedicate scarce model capacity to capturing features which are not useful for the purpose of estimating downstream. This motivates recent work on adversarial objectives which unify the estimation into a single objective, more plausibly targeting the nuisance estimation towards the goal of identifying . Specifically, we will examine the estimator of Dikkala et al. [2020], with , yielding the finite sample objective
where is a class of neural networks. The methods proposed by Bennett et al. [2019a], Bennett and Kallus [2020] are closely related, but differ in the penalty they impose on . Dikkala et al. [2020] consider the case of instrumental variable regression, where and , but we will examine the general case. We note that Example 2.3 (SBEED, Dai et al. [2018]) can be viewed as a special case of re-scaled version of this objective, with and , although both literatures seem to be unaware of their connection. One can analytically solve for the optimal adversary to rewrite the population objective as:
which can be understood as a measure of distance between and , which clearly attains its minimum at , when . In Section 4.4, we will apply our theory to derive the asymptotic distribution of this estimator and show that is in fact inefficient. We further discuss how the adversarial formulation of GMM can directly inform a simple modification similar to Bennett and Kallus [2020] which yields an efficient A-estimator.
2.5 Estimating Riesz Representers
Chernozhukov et al. [2020] propose a distinct A-estimator to estimate Riesz representers for structural parameters which can be written as linear functionals . Here, is an unknown function for which an estimate is available from some first-stage regression of on , where . Quantities like are common in the average treatment effect or asset pricing literature, for example. Unfortunately, especially if is estimated via machine learning, the ‘naive’ estimator
is often not well behaved: may not converge in distribution to a Gaussian limit and thus one cannot provide confidence intervals around the estimate. Under the conditions of the Riesz representation theorem however, there may exist a function called the Riesz representer of the functional , which satisfies:
If a well-behaved estimate of is available, it can be combined with to define the so-called orthogonalized estimator:
which attains asymptotic normality under rather weak conditions on (see Lemma 17 of Chernozhukov et al. [2020]). Chernozhukov et al. [2020] propose a generalized procedure to estimate via an A-estimator, which we will simplify as follows:
where are neural networks. To clarify why this objective works, is it useful to analytically solve the adversarial component of the corresponding population objective:
As we will show in Section 4.5, our theory directly yields the convergence rates for that Chernozhukov et al. [2020]’s Lemma 17 requires for the asymptotic normality of . It does so at a reduced curse of dimensionality in for rather general function classes - i.e. under weaker conditions on smoothness and dimension of the data - complementing the original work.
3 General Theory
Roadmap.
This Section will present our general theory of A-estimators. Subsection 3.1 briefly discusses an alternative definition of A-estimators that may be more natural to some readers. In Subsection 3.2, we establish that A-estimators satisfy the desirable condition of Neyman-orthogonality with respect to the adversary and discuss its implications. Next, we characterize the convergence rates of A-estimators: Section 3.3 provides a high-level result for arbitrary sieves such as splines or wavelets, not just neural nets. Under more easily verifiable low-level conditions, Subsection 3.4 provides convergence rates for semiparametric settings involving neural networks, showing they exhibit a reduced curse-of-dimensionality. Finally, we characterize the asymptotic normality of smooth functionals of A-estimators. We again begin with a general, high-level result for arbitrary sieves, followed with the low-level conditions for the normality of neural networks. Notably, we show that a combination of undersmoothing and regularizing towards a convex target space suffices to overcome a key issue for normality proofs of neural networks: their non-convex sieve space.
Notation.
Throughout, we consider random variable with support , distribution and corresponding expectation operator . We also denote the variance operator by for any function . We denote the sample average, i.e. the expectation under the empirical distribution , by . Throughout, will treat estimated parameters as deterministic sequences indexed by , as is common in the literature. We also consider subvectors of , denoted by , with their respective supports being subspaces of . We require various norms: throughout, will denote the norm of a finite dimensional vector , with being the Euclidean norm. For a possibly vector-valued function , we denote its function norm over some subset by . We denote the supremum norm of a vector with components by . The supremum norm of over will be denoted by . For , we may omit the dependence on by writing . We will often write to denote , implying that a sufficiently large global constant exists such that , where does not depend on any varying aspects of the problem, such as any parameters, sample sizes, et cetera. We write if . We will also write and . Throughout, we will write and for short, where . We denote by a (not necessarily linear) projection onto the respective sieves, i.e. for any and for any .
3.1 Nash vs Minimax
We presented our preferred definition for A-estimators in the introduction, as satisfying a Nash condition of the empirical loss. All results of this paper will apply to this definition. However, the reader may have noticed that the “simultaneous” Nash condition of the estimator is symmetric in and , unlike the ‘sequential’ mini-max population objective, which nests a family of inner maximizations:
(3.1) |
where the loss and as a result the solutions are indexed by the parameter . The reader may therefore wonder if we could define an A-estimator for in a similar ‘sequential’ mini-max fashion. That is, we could consider a family of M-estimators approximately maximizing the empirical loss at any value of :
(3.2) |
And then look for satisfying:
(3.3) |
where again accommodates approximate minimization. Fortunately, it turns out that any satisfying the more compact Nash condition from the introduction always satisfies the mini-max condition presented above, as summarized by the following Lemma:
Lemma 3.1.
Proof.
Pick any satisfying 1.2 and 1.3 for some . Now pick some arbitrary family satisfying 3.2 for all , and define . Note that 1.3 directly implies that this also satisfies 3.2 at . It remains to show that the resulting and satisfy 3.3:
where the first inequality used and the Nash condition 1.2, and the second used the fact that was constructed to satisfy 3.2. ∎
This reassures us that it suffices to find one set of values which satisfy the Nash condition from the introduction, rather than a continuum of solutions indexed by . The final will satisfy the mini-max condition regardless, for some (unknown) . For our theory, it was crucial to derive the uniform convergence of , hence we will state the rate results for the more general mini-max definition. For the normality result, it was more convenient to work with the stronger Nash definition.
3.2 Adversaries are Neyman-Orthogonal
For many A-estimators, one could construct non-adversarial estimators which capture the same population objective. Whenever the adversarial nuisance parameter is a function, this usually requires a non-parametric first-step estimation of an alternative nuisance parameter. However, such an alternative estimator may not have a desirable property that is guaranteed for A-estimators: Neyman-orthogonality of with respect to the nuisance parameter.
This property has a long history in statistics, dating back at least to Neyman [1959]. It was popularized in econometrics by Chernozhukov et al. [2017] as a key setting in which standard machine learning methods can be applied without invalidating causal inference, which sparked follow-up work such as Chernozhukov et al. [2021] seeking to reformulate non-orthogonal problems as orthogonal ones. The notion applies to parameters which are identified by a moment restriction of the form:
(3.4) |
where is known and is an unknown nuisance parameter which has to be estimated in a first step. A popular estimator in this setting would be Hansen [1982]’s GMM, for example. The moment condition above is called (Neyman-)orthogonal whenever:
(3.5) |
Intuitively, this states that the condition identifying is “locally robust” against perturbations in . This guarantees that the uncertainty introduced by an appropriate first-step estimation of has no first-order effect on the GMM estimator . Specifically, the asymptotic distribution of is the same as in the case in which is known. In contrast, when moment restrictions do not satisfy this orthogonality condition, uncertainty about generally amplifies the asymptotic variance of , see e.g. Chen and Liao [2015], and normality may break down altogether.
Notably, if (and only if) is parametric, we can examine the first order condition for that is implied by the A-estimation objective 1.1 in this moment restriction framework444Note however that even when is parametric, we usually cannot estimate it via GMM as will not exist if is a typical sieve, such as a neural network. For the same reason, the theory developed in this paper must not rely on any finite-sample first order conditions. Instead, it will use only the approximate Nash condition 1.3, 1.2. : let , where denotes the functional evaluating to . Orthogonality then follows from the continuum of first-order conditions identifying :
since the derivative operators are exchangeable. This implies that as approaches , an A-estimator is robust to estimation errors in the adversary relative to , meaning they do not reduce the accuracy of , to a first-order.
Consider the example of Section 2.1, which estimates minimizing the f-Divergence between the model and the data . As a non-adversarial alternative, we could re-parametrize the problem and estimate via a first-step Kernel density estimator , and subsequently approximate the f-Divergence as the average over . However, the first-order condition for would not satisfy orthogonality, hence a GMM estimator based on this condition may not attain the variance of the analogous GMM estimator using instead. In contrast, the A-estimator of Section 2.1 does attain this variance - due to its orthogonal adversary - which we formally establish in Section 4.1. Moreover, this remains true when generalizing to a setting in which contains unknown functions, where no analogous GMM estimator exists that could capture the continuum of first-order conditions in .
3.3 Convergence Rate of A-Estimators
We begin with a general theorem characterizing the convergence rates of sieve A-estimators, for arbitrary loss functions and parameter spaces. It can be viewed as a generalization of Shen and Wong [1994]’s M-estimator result. Its proof is provided in Appendix C.1, with the main challenge being that Shen and Wong [1994]’s chaining arguments need to be carefully modified to hold uniformly over . Our theorem adopts a more compact formulation than Shen and Wong [1994] which does not require any norm over to state our assumptions, although convergence rates are obtained for any (pseudo-)norm which is dominated by the objective.
Theorem 3.1 (Convergence Rate of A-Estimators).
Assume that:
-
•
C1: The criterion variance is bounded by a power of its expectation:
(3.6) (3.7) for all for which the right hand sides are less than some constant.
- •
Then the following conclusions hold.
- i)
-
ii)
Hence, for any (pseudo-)norm under which compact and continuous. If also for , we get:
Remark 3.1 (Discussion of Assumptions).
The theorem extends Shen and Wong [1994]’s convergence rate result for sieve M-estimators to A-estimators. There is a direct mapping between our assumptions and theirs: our C1 combines their assumptions C1 and C2, and our C2 corresponds to their C3. Our proof in Appendix C.1 is structured in the same way as that of Shen and Wong [1994], although we need to modify their Lemmas to obtain the uniform convergence of the adversary in 3.10, which is crucial to the main result 3.9. The key modifications to our assumptions, which allow us to do so are: C1) that the constant factor implicit in the “” relation of 3.7 must not depend on , as implied by the definition of “” at the beginning of this section and C2) that the complexity of the joint sieve space satisfies the entropy bound. Otherwise, the assumptions are conceptually the same and we refer the reader to Shen and Wong [1994] for a more detailed discussion.
Remark 3.2.
Using similar arguments as ours, one may establish the uniform convergence of A-estimators over a third parameter space, generalizing the setting to arbitrary finite sequences of ’s and ’s over different parameter spaces: e.g. . This would yield convergence rates towards more general Stackelberg equilibria in so-called empirical games, for which we are currently only aware of a consistency result by Tuyls et al. [2018].
3.4 Semiparametric Rates with Neural Networks
Next, we will apply the general result of the previous section to derive the convergence rates for neural network A-estimators. For generality, we will consider the semiparametric setting in which may contain both Euclidean vectors and functions. These lower-level conditions are easy to verify in practice, but are general enough to apply to all estimators considered in Section 2. We will include the proof as it is short and an instructive application of Theorem 3.1. The theorem allows for two types of function classes, both of which can be viewed as generalizations of traditional Hölder functions with -dimensional domain, with their own notion of an intrinsic dimension , which may be smaller than that of the data . As we will review in Remark 3.5, we observe that neural networks achieve a reduced curse of dimensionality in these settings.
Theorem 3.2 (Semiparametric Rates with Neural Networks).
Consider the semiparametric setting in which and , where are subsets of some Euclidean spaces and are some function spaces. Let be compact under . For all , assume the following conditions hold:
- •
-
•
A1:
-
•
A2:
-
•
A3:
Pick any two values . Consider the A-estimator 3.2 with where and implement neural networks (cf. Definition 2) satisfying and for any large enough choice of . For A0a) choose and for A0b) choose . Then:
Hence, for any (pseudo-)norm under which is compact and continuous. Further, if for , we get:
Proof.
We will verify the conditions of Theorem 3.1. A2 and A3 imply C1 (3.6 and 3.7) with . Lipschitzness A1 together with Lemma B.1 imply C2 (B.3) with for any and . Therefore Theorem 3.1 applies with , which dominates and by assumption. We are therefore left with bounding and . By A3, we can bound for any . In the case of A0a), we set and use Lemma B.2 to obtain which yields . For A0b), Lemma B.3 yields the same bound as Lemma B.2, but only over a subset with for some arbitrarily large constant , which only affects the constant in the bound on . Hence we conclude that . Analogous arguments yield the same bound for . ∎
Remark 3.4 (Discussion of Assumptions).
A0 defines the function classes addressed by the Theorem. Both are generalizations of traditional Hölder classes which arise for , see Remark 3.5. Condition A1 requires the loss to be Lipschitz in both parameters, which simplifies (but is not necessary for) the verification of C2. Condition A2 (and analogously A3) consists of two parts. First, it states that for a given parameter, the variance of the criterion difference must be bounded by its expectation, a simplified version of Assumption C1 of Theorem 3.1 which happens to be satisfied in all of our examples, but versions of this Theorem with can be derived via the same steps as the proof above. The second part of the condition bounds the expected loss by a squared -norm over any subset of the function domain . For the case of A0a), it would have sufficed to state the condition with only, but for A0b) we require arbitrary subsets to apply the approximation result of Lemma B.3. A2 is implied, for example, by for some and Lipschitz map . The assumption is significantly weaker than Shen et al. [2019] or Farrell et al. [2018] who impose , which would not hold for Examples 2.2 or 2.4. It could be generalized further to allow for arbitrary powers of the -norms (and proved in the same way via Theorem 3.1), but the squares arise rather universally via Taylor expansions.
Remark 3.5.
Theorem 3.2 clarifies that neural networks do not necessarily exhibit the curse of dimensionality, as the lower bound on does not depend on the dimension of the data. Instead, what matters is the intrinsic dimension of the target function. In the setting A0a), introduced by Nakada and Imaizumi [2020], refers to the Minkowski dimension of the manifold which supports the data. It has been observed that for many high-dimensional types of data: intuitively, is low whenever there is strong statistical dependency between the individual dimensions of the data. Examples include the characteristics of physical products, images and natural language. In the setting A0b), introduced by Bauer et al. [2019], refers to the order of a generalized hierarchical interaction model. It is common for structural models in e.g. economics or optimal control to suggest that an unknown function is hierarchically composed of some finite number of individual functions which only depend on inputs at a time. The result underscores that neural networks can adaptively - that is, without the researcher modifying the estimation procedure - exploit structures in the target function which allow them to model the relationships more efficiently than what standard convergence results suggest.
3.5 Asymptotic Normality of A-Estimators
In applications, it we a often interested in estimating a quantity of the form , where is some known functional. To derive confidence intervals around the plug-in estimate , we need its asymptotic distribution. To this end, we present Theorem 3.3, which can roughly be viewed as a generalization of Shen [1997] to A-Estimators. For this section, we make use of the pathwise derivative presented in Definition 7. We require a particular inner product over the space :
As discussed in Definition 7, the notation implicitly assumes that the corresponding limit exists and is linear in . For short, we write , and .
Theorem 3.3 (General Normality of A-Estimators).
Consider the estimators satisfying the Nash conditions 1.2 and 1.3. Fix a sequence . Assume is smooth enough and converge fast enough such that a Riesz representer exists, satisfying:
(3.11) |
Where and are the shrinking neighborhoods defined in Lemma B.5. For , define the local perturbations and and assume:
-
CONDITION N1: Stochastic Equicontinuity
-
CONDITION N2: Population Criterion Difference
-
CONDITION N3: Approximation Error
Remark 3.6 (Discussion of Assumptions).
In contrast to our convergence rate result, our proof requires the A-estimator to satisfy the (stronger) Nash condition from the introduction. Our conditions N1-3 are analogues of Shen [1997]’s and play the same roles in our proof. N1 combines their assumptions A and D, N2 corresponds to their B, and N3 to their C. Shen [1997]’s high-level discussion of their assumptions therefore applies to ours as well, and we again refer the reader there for additional context. The main difference is that their conditions are formulated to control the remainder of a second order Taylor expansion, whereas we look at the convergence of the first derivative, which results in requirements for N1 and N2, rather than the found in Shen [1997]’s conditions A and B.
Remark 3.7.
Condition N3 is a version of a known condition on approximation error in M-estimation settings (see Condition C4 in Shen et al. [2019] and Condition C in Shen and Wong [1994]). Its verification usually exploits convexity of , such that . This holds for series or kernel based estimators, but not neural networks. Shen et al. [2019] therefore leave it as an explicit assumption, concluding that it is unclear how to verify it for neural networks. In Theorem 3.4, we resolve this issue, showing that N3 can be verified for non-convex sieves such as neural networks by adhering to two simple implementation choices: 1) undersmoothing, i.e. choosing a sieve which grows faster than rate-optimal, achieving an approximation error of and 2) regularizing the sieves towards the convex target classes containing .
3.6 Semiparametric Normality with Neural Networks
Next, we present Theorem 3.4, which strengthens the assumptions of our previous neural network convergence rate result (Theorem 3.2) in a way that allows us to derive the asymptotic normality of functionals via Theorem 3.3. A crucial innovation is that we are able to work around the non-convexity issues of deep neural networks discussed in Remark 3.7, to obtain a normality result from low-level conditions, which only consist of general properties that the loss function must satisfy (A4-A7), and certain implementation choices for the neural networks that must be followed. To the best of our knowledge, the theorem therefore also provides the first low-level conditions for the normality of smooth functionals of deep neural network M-estimators (as the special case where is singleton).
Theorem 3.4 (Semiparametric Normality with Neural Networks).
Let all assumptions of Theorem 3.2 be satisfied with , and choose . Let be convex and lie in their interior. Replace the neural network sieves with the following regularized versions:
for any which is small enough to guarantee that are nonempty. Further, for all , assume:
-
•
A4: Lipschitz Derivative:
-
•
A5: The perturbations are smooth: ,
-
•
A6: The Taylor remainders vanish with the loss:
-
i)
-
ii)
-
i)
-
•
A7: For non-Donsker classes, the variance of the derivatives is bounded by the loss:
-
i)
or is -Donsker
-
ii)
or is -Donsker
-
i)
If satisfy the Nash condition 1.2,1.3 with , then:
Remark 3.8 (Discussion of Assumptions).
The Theorem requires that the neural network sieves are implemented to undersmooth (i.e. grow faster than the rate-optimal sieve would) via the condition on , while being regularized towards the convex target spaces . Note that this does not affect the sieve’s approximation power towards these spaces, and there always exists an for which are non-empty due to their approximation rates. While in principle just an implementation choice, the current -norm regularization is arguably not practical and future work may be able to clarify whether e.g. an appropriate L2 penalty on the weights suffices. Conditions A4-A7 are general conditions on the loss function which can be satisfied in all our examples. A4 is a simple Lipschitz condition analogous to A1. The smoothness of the Riesz representer (A5) is most easily verified by computing and examining a given directly, although the Riesz representation theorem can provide general conditions under which lives in the same space as . A6 is a standard condition controlling the Taylor remainder. For a discussion, see e.g. Assumptions 4.5 in Ai and Chen [2003] and Ai and Chen [2007], or Assumption 3.5ii) in Chen and Pouzo [2015]. Whether it holds depends on how non-linear the objective is: e.g. for the quadratic objective of Dikkala et al. [2020], the left-hand side is zero. A7 serves to control the empirical process (N1). It can be easily satisfied either by bounding the variances of the derivatives, or by relying on the Donsker property of the target space (cf. Remark 3.9).
Remark 3.9.
Note that the Donsker property and thus A7 always holds if , where standard results using bracketing number bounds imply that the Hölder spaces satisfy the Donsker property. We conjecture that this analogously holds for our lower-dimensional classes A0a) and A0b) whenever , which would make the verification of A7 unnecessary in general, since we require . Verifying this conjecture is beyond the scope of this paper however, hence we provide A7 as an explicit assumption for maximum flexibility.
4 Application to Examples
4.1 Minimum -Divergence
Applying our general Theorem 3.2 to the estimator of Section 2.1 yields Proposition 4.1, which provides the convergence rate of semiparametric if and all unknown functions in are approximated by classes of neural networks.
Proposition 4.1.
Remark 4.1.
In general, the convergence rate of is faster the slower the growth rate of the neural network. However, the growth must be fast enough to control the approximation error of the sieves relative to the target function classes . This lower bound depends on the ratio of the smoothness of the target classes and and their intrinsic dimensions and , which may be smaller than that of the data , in which case -GANs attain a reduced curse-of-dimensionality relative to traditional nonparametric density estimators.
Remark 4.2.
This convergence rate result stands in contrast to Arora et al. [2017], who argued that Generative Adversarial Networks do not generalize with respect to the metric given by the population objective, only under a weaker “neural net distance” which they introduce. The convergence rate result above clarifies that the broad class of -GANs in fact does converge quickly under population divergence.
While a fast convergence rate of the model distribution is a key goal in semi- and nonparametric estimation, whenever some function of the estimate informs downstream decision-making, we are often interested in obtaining confidence intervals around . To this end, we derive the asymptotic normality of the adversarial -Divergence objective - an entirely novel result at this level of generality, to the best of our knowledge. First, we compute the inner product defined in Section 3.5, which can be expressed concisely:
Where is a pathwise derivative of the Radon-Nikodym derivative. Conditions under which the normality result of Section 3.5 applies are presented in Proposition 4.2.
Proposition 4.2.
Consider a functional for which a Riesz representer exists satisfying 3.11 with defined above. Let all assumptions of Theorem 4.1 be satisfied for and assume that is Donsker. Let be convex, let lie in their interior, and let them contain . Assume the Lipschitz condition and let be Lipschitz. Pick and regularize as in Theorem 3.4. Finally, for any on a path between and , assume that:
Then:
(4.1) |
Remark 4.3.
In applications, the key difficulty lies in verifying that the third derivative above is bounded by the loss. This condition serves to control the higher order term of the Taylor expansion. Such assumptions are common in the semiparametric literature, e.g. Ai and Chen [2003]’s Assumptions 4.5 and 4.6 play the same role. It is easiest to verify in the parametric setting, where
Note that and hence the asymptotics of are independent of , so the -divergences are asymptotically equivalent. An example for a smooth functional that is of particular interest in the semiparametric setting is , which “picks out” a linear combination of the parametric components. This allows us to derive the asymptotic normality of the vector in the following Corollary, which makes use of the orthogonal scores assumption that is standard in the semiparametric literature.
Corollary 4.2.1.
In addition to the assumptions of Proposition 4.2, assume the orthogonal scores condition holds:
Then the parametric component attains the Cramér-Rao bound:
Proof.
We simply choose , such that . Since , Proposition 4.2 yields . The result then follows via the Cramér-Wold device. ∎
The -GAN objective therefore attains the efficient asymptotics of maximum likelihood, but does not require explicit knowledge of the model density .
4.2 Generalized Empirical Likelihood
For the class of Generalized Empirical Likelihood estimators introduced in Section 2.2, the -normality and asymptotic efficiency of is long established in the parametric case (Imbens et al. [1998], Imbens [2002], Newey and Smith [2004]). However, our theoretical framework still allows us to extend the known results to the semiparametric case where may contain unknown functions, which are approximated by a class of neural networks which may grow with . In this case, we can characterize the convergence rate of to the identified set , which is unlikely to be singleton given that an infinite-dimensional parameter is hardly pinned down by a finite number of unconditional moment restrictions. We obtain the following result:
Proposition 4.3.
Let and consider the A-estimator satisfying 1.2,1.3 with . Let be as in Theorem 3.2 and , with . Assume that and . Then:
Proof.
We verify the conditions of Theorem 3.2. Assumption A0 holds by assumption, and A1 follows from the Lipschitzness of and that of . To verify Assumption 2, note that and boundedness of imply:
For the second part of condition A2, simply verify that , which follows by applying the Lipschitzness of in and the tower-property of to , akin to the proof of 4.1. Assumption A3 can be verified for the Euclidean via a Taylor expansion, yielding: . ∎
4.3 Off-Policy Reinforcement Learning
Next, we will use our theory to the extend the known results about SBEED, the off-policy RL algorithm of Dai et al. [2018] introduced in Section 2.3. Theorem 3.2 makes it easy to obtain the convergence rates of the corresponding A-estimator:
Proposition 4.4.
Consider the A-estimator satisfying 1.2,1.3 with as in 2.4. Assume the observations are iid for simplicity, and that and , where satisfy A0 in Theorem 3.2 with . Let , with compact under and path-connected. Let be continuous. Let the parametrizations satisfy the Lipschitz conditions and . Let the neural network classes be constructed as in Theorem 3.2, for any . Then for any :
Remark 4.4.
In contrast to the original work, our result also applies in the case where and are continuous, and we characterize the optimal rate of growth for the neural network function approximators, which optimally trade off bias and variance. While following almost trivially from the general Theorem 3.2, our result yields significantly faster convergence rates than the rates obtained by Dai et al. [2018], and our rates further exhibit the reduced curse of dimensionality of neural networks.
Remark 4.5.
We noticed that SBEED can be viewed as a special case of some of the econometric conditional moment estimators treated in Example 2.4, such as Dikkala et al. [2020]. We therefore refer the reader to Section 4.4 for an application of our asymptotic normality result. Interestingly, neither literature seems to be aware of this connection. Dai et al. [2018] cite convex conjugation and the interchangeability principle as the inspiration for their objective, whereas the adversarial conditional moment estimators in econometrics were inspired by Hansen [1982]’s Generalized Method of Moments.
4.4 A-Estimators for Conditional Moment Restrictions
We will now apply our theory to examine the asymptotic behavior of the conditional moment estimator of Dikkala et al. [2020], introduced in Section 2.4. We can apply Theorem 3.2 to obtain the rate at which converges:
Proposition 4.5.
Proof.
Condition A0 is satisfied by assumption, and A1 follows from Lipschitzness of and boundedness. Assumptions A2 and A3 can be verified by using boundedness to establish
∎
Remark 4.6.
Note that just like in the previous Example 2.2, this result does not require the parameter to be identified by the restriction 2.5. If that is the case however, the above rates can be translated into similar rates in any norm which is dominated by the objective, usually by construction. See Ai and Chen [2003] for an example of such a norm in the semi-parameteric setting.
Remark 4.7.
In contrast to Dikkala et al. [2020], our convergence rate result allows for general and possibly vector-valued, semiparametric in which unknown functions are approximated by neural networks. Our rates are also exhibit the reduced curse of dimensionality of neural networks.
Next, we use Theorem 3.4 to derive the asymptotic variance of the estimator, showing that the estimator is in inefficient in general. For this purpose, it suffices to only consider the simpler parametric setting.
Proposition 4.6.
Consider the parametric case where is Euclidean. In addition to the assumptions of Proposition 4.5, assume that the identification condition 2.5 holds. Let be bounded and satisfy the Lipschitz condition . Assume that is three times differentiable in . For all , let for a satisfying A0 with , let lie in the respective interiors of , and let for any . Let be regularized as in Theorem 3.4. Then:
where .
Chamberlain [1987] derived the efficiency bound for the parametric conditional moment setting, corresponding to the smallest (in a p.s.d. sense) -asymptotic variance for any unbiased estimator. It is given by the covariance matrix:
Note that in general, implying that is an inefficient estimator. By extension, this also applies to the Reinforcement Learning algorithm of Example 2.3. Comparing the GMM objective of Example 2.2 - which is known to be efficient in the unconditional moment setting - to the population objective of the present example, this may be unsurprising: in contrast to GMM, the population objective of Dikkala et al. [2020] corresponds to a regular norm, without the inverse covariance weighting which is crucial for asymptotic efficiency in the unconditional case. Generalizing GEL to the conditional moment setting by replacing the constant adversary with a neural network , Metzger [2022] therefore proposes the A-estimator given by:
which nests a simplified variant of Bennett and Kallus [2020] for , and for can be viewed as alternative to the Kernel approach of Kitamura et al. [2004]. Metzger [2022] provides a similar information theoretic foundation as the GEL estimator and - building on the theory developed in the present paper - derives the convergence rates and asymptotic efficiency of this estimator, where may contain unknown functions which are modeled as neural networks.
4.5 Estimating Riesz Representers
Finally, we show that Theorem 3.2 can be used to quickly derive the convergence rates of Chernozhukov et al. [2020]’s adversarial estimator for Riesz representers, which we introduced in Section 2.5.
Proposition 4.7.
Let be as in Theorem 3.2. Let be Lipschitz in . Let the support of be bounded. Then, for any :
This result clarifies that the Riesz representer of Chernozhukov et al. [2020] can similarly benefit from the adaptivity properties of neural networks, which yield faster rates for our target classes if . In combination with their Lemma 17, this implies that compared to other non-parametric sieves, neural networks guarantee the asymptotic normality of the orthogonalized estimator under weaker conditions on smoothness and . Since the normality of is of primary interest and already follows from Chernozhukov et al. [2020]’s Lemma 17 given our convergence rates, we refrain from deriving it for arbitrary functionals , although it would be possible to use Theorem 3.4 to derive for some for example.
5 Conclusion
We characterize the general class of adversarial estimators (‘A-estimators’), subsuming many estimators independently proposed in the fields of econometrics and machine learning. Our unified framework suggests interesting commonalities between A-estimators: their adversary is always Neyman-orthogonal with respect to the main model, guaranteeing that its estimation errors have no first-order asymptotic impact on the estimated model. Most objectives have versions which asymptotically minimize an -divergence criterion and are asymptotically efficient. Typically, A-estimators adaptively learn how to optimally emphasize the restrictions implied researcher’s estimation assumptions, performing particularly well when this set is large. This makes them a promising framework for incorporating machine learning methods into causal inference, where even simple target parameters often satisfy a continuum of restrictions. We characterize the convergence rates of A-estimators, as well as the asymptotic normality of smooth functionals of their parameters. We also provide low-level analogues of these results for semi-parametric models, in which unknown functions are approximated by deep neural networks. Our convergence and normality results also extend the theory of neural network M-estimators, as a special case: building on recent results in approximation theory, our neural network converge rates exhibit a reduced curse of dimensionality for more general losses than previously examined, which hold uniformly over a second parameter space. Our normality result overcomes a problem previously posed by the non-convexity of neural network sieves, showing that a particular regularization, combined with under-smoothing, can be used to satisfy a strong, high-level approximation error condition which the literature left hitherto unverified.
References
- Ai and Chen [2003] Chunrong Ai and Xiaohong Chen. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003. doi: https://doi.org/10.1111/1468-0262.00470. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00470.
- Ai and Chen [2007] Chunrong Ai and Xiaohong Chen. Estimation of possibly misspecified semiparametric conditional moment restriction models with different conditioning variables. Journal of Econometrics, 141(1):5–43, 2007. URL https://EconPapers.repec.org/RePEc:eee:econom:v:141:y:2007:i:1:p:5-43.
- Angrist and Pischke [2010] Joshua D. Angrist and Jörn-Steffen Pischke. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2):3–30, June 2010. doi: 10.1257/jep.24.2.3. URL https://www.aeaweb.org/articles?id=10.1257/jep.24.2.3.
- Arora et al. [2017] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224–232. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/arora17a.html.
- Athey and Wager [2021] Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1):133–161, 2021. doi: https://doi.org/10.3982/ECTA15732. URL https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA15732.
- Athey et al. [2019] Susan Athey, Guido W Imbens, Jonas Metzger, and Evan M Munro. Using wasserstein generative adversarial networks for the design of monte carlo simulations. Technical report, National Bureau of Economic Research, 2019.
- Bauer et al. [2019] Benedikt Bauer, Michael Kohler, et al. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Annals of Statistics, 47(4):2261–2285, 2019.
- Belomestny et al. [2021] Denis Belomestny, Eric Moulines, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Rates of convergence for density estimation with gans. arXiv preprint arXiv:2102.00199, 2021.
- Bennett and Kallus [2020] Andrew Bennett and Nathan Kallus. The variational method of moments. CoRR, abs/2012.09422, 2020. URL https://arxiv.org/abs/2012.09422.
- Bennett et al. [2019a] Andrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of moments for instrumental variable analysis. In NeurIPS, 2019a.
- Bennett et al. [2019b] Andrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of moments for instrumental variable analysis. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/15d185eaa7c954e77f5343d941e25fbd-Paper.pdf.
- Chamberlain [1987] Gary Chamberlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334, 1987.
- Chen et al. [2020] Minshuo Chen, Wenjing Liao, Hongyuan Zha, and Tuo Zhao. Statistical guarantees of generative adversarial networks for distribution estimation, 2020.
- Chen and Liao [2015] Xiaohong Chen and Zhipeng Liao. Sieve semiparametric two-step gmm under weak dependence. Journal of Econometrics, 189(1):163–186, 2015. ISSN 0304-4076. doi: https://doi.org/10.1016/j.jeconom.2015.07.001. URL https://www.sciencedirect.com/science/article/pii/S0304407615002031.
- Chen and Pouzo [2015] Xiaohong Chen and Demian Pouzo. Sieve wald and qlr inferences on semi/nonparametric conditional moment models. Econometrica, 2015.
- Chen and Qiu [2016] Xiaohong Chen and Yin Jia Qiu. Methods for nonparametric and semiparametric regressions with endogeneity: a gentle guide. Cowles Foundation Discussion Papers 2032, Cowles Foundation for Research in Economics, Yale University, 2016. URL https://EconPapers.repec.org/RePEc:cwl:cwldpp:2032.
- Chen et al. [2003] Xiaohong Chen, Oliver Linton, and Ingrid Van Keilegom. Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71(5):1591–1608, 2003.
- Chen et al. [2019] Xiaohong Chen, Demian Pouzo, and James L. Powell. Penalized sieve gel for weighted average derivatives of nonparametric quantile iv regressions. Journal of Econometrics, 2019.
- Chernozhukov et al. [2017] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. Double/debiased/neyman machine learning of treatment effects. The American Economic Review, 107:261–265, 2017.
- Chernozhukov et al. [2020] Victor Chernozhukov, Whitney Newey, Rahul Singh, and Vasilis Syrgkanis. Adversarial estimation of riesz representers, 2020.
- Chernozhukov et al. [2021] Victor Chernozhukov, Whitney Newey, Victor Quintas-Martinez, and Vasilis Syrgkanis. Automatic debiased machine learning via neural nets for generalized linear regression. 2021.
- Cosslett [1981] Stephen Cosslett. Maximum likelihood estimator for choice-based samples. Econometrica, 49:1289–1316, 02 1981. doi: 10.2307/1912755.
- Cotter et al. [2019] Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-player games for efficient non-convex constrained optimization. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 300–332. PMLR, 22–24 Mar 2019. URL https://proceedings.mlr.press/v98/cotter19a.html.
- Dai et al. [2018] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. CoRR, abs/1712.10285, 2018. URL http://arxiv.org/abs/1712.10285.
- Dikkala et al. [2020] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimation of conditional moment models. arXiv preprint arXiv:2006.07201, 2020.
- Farrell et al. [2018] Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. arXiv preprint arXiv:1809.09953, 2018.
- Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Grenander [1981] Ulf Grenander. Abstract inference. Technical report, 1981.
- Hansen [1982] Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029–1054, 1982. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1912775.
- Hansen et al. [1996] Lars Peter Hansen, John Heaton, and Amir Yaron. Finite-sample properties of some alternative gmm estimators. Journal of Business & Economic Statistics, 14(3):262–280, 1996. ISSN 07350015. URL http://www.jstor.org/stable/1392442.
- Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. NIPS, abs/1606.03476, 2016. URL http://arxiv.org/abs/1606.03476.
- Imbens [2002] Guido W. Imbens. Generalized method of moments and empirical likelihood. Journal of Business & Economic Statistics, 20(4):493–506, 2002. ISSN 07350015. URL http://www.jstor.org/stable/1392419.
- Imbens et al. [1998] Guido W. Imbens, Richard H. Spady, and Phillip Johnson. Information theoretic approaches to inference in moment condition models. Econometrica, 66(2):333–357, 1998. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/2998561.
- Jabbar et al. [2022] Abdul Jabbar, Xi Li, and Bourahla Omar. A survey on generative adversarial networks: Variants, applications, and training. ACM Computing Surveys (CSUR), 54:1 – 49, 2022.
- Kaji et al. [2020] Tetsuya Kaji, Elena Manresa, and Guillaume Pouliot. An adversarial approach to structural estimation, 2020.
- Kitamura et al. [2004] Yuichi Kitamura, Gautam Tripathi, and Hyungtaik Ahn. Empirical likelihood-based inference in conditional moment restriction models. Econometrica, 72(6):1667–1714, 2004.
- Lewis and Syrgkanis [2018] Greg Lewis and Vasilis Syrgkanis. Adversarial generalized method of moments, 2018.
- Liang [2021] Tengyuan Liang. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41, 2021.
- Lucas [1976] Robert E. Lucas. Econometric policy evaluation: A critique. Carnegie-Rochester Conference Series on Public Policy, 1:19–46, 1976. ISSN 0167-2231. doi: https://doi.org/10.1016/S0167-2231(76)80003-6. URL https://www.sciencedirect.com/science/article/pii/S0167223176800036.
- Mao et al. [2017] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks, 2017.
- Metzger [2022] Jonas Metzger. Adversarial conditional moment estimation, 2022.
- Nakada and Imaizumi [2020] Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38, 2020. URL http://jmlr.org/papers/v21/20-002.html.
- Newey and Smith [2004] Whitney K. Newey and Richard J. Smith. Higher order properties of gmm and generalized empirical likelihood estimators. Econometrica, 72(1):219–255, 2004. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/3598854.
- Neyman [1959] Jerzy Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability and Statistics: The Harald Cramer Volume, 1959.
- Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization, 2016.
- Owen [1990] Art Owen. Empirical Likelihood Ratio Confidence Regions. The Annals of Statistics, 18(1):90 – 120, 1990. doi: 10.1214/aos/1176347494. URL https://doi.org/10.1214/aos/1176347494.
- Owen [1988] Art B. Owen. Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2):237–249, 06 1988. ISSN 0006-3444. doi: 10.1093/biomet/75.2.237. URL https://doi.org/10.1093/biomet/75.2.237.
- Qin and Lawless [1994] Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. The Annals of Statistics, 22(1):300–325, 1994. ISSN 00905364. URL http://www.jstor.org/stable/2242455.
- Shen [1997] Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, 25(6):2555–2591, 1997. ISSN 00905364. URL http://www.jstor.org/stable/2959045.
- Shen and Wong [1994] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. Ann. Statist., 22(2):580–615, 06 1994. doi: 10.1214/aos/1176325486. URL https://doi.org/10.1214/aos/1176325486.
- Shen et al. [2019] Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
- Singh et al. [2018] Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabás Póczos. Nonparametric density estimation under adversarial losses. arXiv preprint arXiv:1805.08836, 2018.
- Tao et al. [2018] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chi-square generative adversarial network. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4887–4896. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/tao18b.html.
- Tuyls et al. [2018] K Tuyls, J Perolat, M Lanctot, JZ Leibo, and T Graepel. A generalised method for empirical game theoretic analysis. In AAMAS’18: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 77–85. ACM, 2018.
- Yarotsky [2017] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- Zhan et al. [2021] Ruohan Zhan, Vitor Hadad, David A. Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 2125–2135, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3467456. URL https://doi.org/10.1145/3447548.3467456.
Appendix A Definitions
Definition 1 (Covering Number).
For some norm over some metric space , the covering number is defined as the cardinality of the smallest set such that . The quantity is also called metric entropy.
Definition 2 (Deep Neural Networks).
We define the class of deep networks as parametrized functions of the form:
where the ’s are weight matrices and ’s are intercept vectors with real-valued elements, and is applied element-wise. For example, the choice (rectified linear unit) gives rise to the class of deep ReLU networks, and gives rise to the class of networks. We say the network is layers deep and call the upper bound its width. Further, we assume that
i.e. all elements in the ’s and ’s are bounded in absolute value by , and there are at most non-zero parameters in total. Finally, we assume for all . If the particular value is an arbitrary large enough constant, we may suppress the notation and write .
Definition 3 (Minkowski Dimension).
Definition 4 (Hölder Space).
For a function is a partial derivative with respect to a -th component, and using multi-index For denotes the largest integer that is less than . Let be a degree of smoothness. For the Höder norm is defined as
Then, the Hölder space on is defined as
Also, denotes the -radius closed ball in .
Definition 5 ((p, C)-smoothness).
Let for some and . A function is called smooth, if for every with the partial derivative exists and satisfies
for all .
Definition 6 (Generalized Hierarchical Interaction Models).
Let , , and for some and .
-
a)
We say that satisfies a generalized hierarchical interaction model of order and level with bound , if there exist and some such that
and where is Lipschitz continuous with constant and all of its partial derivatives of order less than or equal to are bounded in absolute value by by .
-
b)
We say that satisfies a generalized hierarchical interaction model of order and level with bound if there exist , and such that satisfy a generalized hierarchical interaction model of order and level and
where are Lipschitz continuous with constant and all of their partial derivatives of order less than or equal to are bounded by some constant .
-
c)
We say that the generalized hierarchical interaction model defined above is -smooth, if all functions occurring in its definition are -smooth, cf. Definition 5.
-
d)
We define as the class of all functions satisfying a -smooth generalized hierarchical interaction model of order and level with bound , where . Since the particular value of is not important as long as , we also write .
Definition 7 (Pathwise Derivatives).
For some , and some functional , we define the first pathwise derivative in the direction as
for some real number . Throughout this paper, the usage of implicitly assumes that the derivative and limit on the RHS exists and is linear in .
Appendix B Supporting Lemmas
Lemma B.1 (Covering Number of Neural Networks).
Consider the class of deep neural networks (Definition 2), with activation satisfying (e.g. ReLU, tanh) and consider the norm for some where . Its -covering number (Definition 1) can be bounded by:
Proof.
This is Lemma 7 in Chen et al. [2020]. While they only state the Lemma for the case of ReLU networks , their proof works for any activation satisfying and for all . We substituted the bound and renamed some variables. ∎
Lemma B.2 (Approximation by Deep ReLU Networks on Low Dimensional Data).
Proof.
The case is covered by Theorem 5 in Nakada and Imaizumi [2020]. While they do not state a bound on the width , it is easy to see that any network described by Definition 2 with at most non-zero parameters can be represented by a network with width bounded by . In the case of , the Lemma simply states the approximation error for conventional Hölder spaces as established in Yarotsky [2017]. ∎
Lemma B.3 (Approximation of Generalized Interaction Models by Deep ReLU Networks).
Proof.
This directly follows from Theorem 3 in Bauer et al. [2019], however our notation is greatly simplified by the fact that we are not interested in most of their constants, and that we offloaded most of the assumptions into Definition 6. What matters is that the network they construct has a depth that bounded by a constant (their equation (6)), and a number of non-zero parameters that is proportional to what they call in their Theorem 3 (by their equations (7) and (5) and the definition of in their Theorem 3). Since we assumed bounded support (leaving their as a constant), their bound yields an approximation error of for some , such that the number of non-zero parameters can be bounded as . They bound ( in their notation) in terms of and yielding for some large enough constant . Finally, their theorem holds only for activation functions which satisfy a property they call N-admissible. While this is technically not satisfied by , it is easy to verify that this property is satisfied by the activation function . Since for any there exists some such that , the same approximation bound holds with . ∎
Lemma B.4 (Empirical Process of Donsker Classes).
If is iid and is -Donsker for some satisfying
then
for any .
Proof.
This directly follows from Lemma 1 in Chen et al. [2003]. ∎
Lemma B.5 (Empirical Process Rates for A-Estimators).
Under the assumptions of Theorem 3.1, for any function satisfying the following conditions:
-
•
For any sequence and all :
(B.1) (B.2) at least if the right hand sides are smaller than some .
-
•
For all small , we have:
(B.3)
we obtain the following empirical processes bounds:
where and are shrinking neighborhoods around and containing and .
Appendix C Proofs
C.1 Theorem 3.1 and Lemma B.5
Theorem 3.1 and Lemma B.5 are simplified versions of the slightly more general Theorems C.1 and C.2, which modify Shen and Wong [1994]’s M-estimator convergence rate arguments to hold uniformly over another parameter space and accommodate estimators which are finite-sample optimal up to some stochastic remainder. Theorem C.1 is presented in C.1.1 and derives the uniform convergence rates for . Theorem C.2 is presented in C.1.2 and derives the rates for . In C.1.3, we then discuss how Theorem 3.1 and Lemma B.5 follow from these results.
C.1.1 Uniform convergence rate of
Theorem C.1 (Uniform Convergence Rates of Sieve M-Estimators).
Let be a pseudo-distance on , possibly indexed by . For the estimator of 3.2, assume:
-
CONDITION C1a. For some constants and and all small :
-
CONDITION C1b. For some constants and and all small :
-
CONDITION C2. Let . For some , and all small , its entropy (Def. 1) is bounded as:
where either or , which is understood to represent .
Let , then
where is given by:
And for any satisfying C1a and C2 when is replaced by , we can bound the empirical process as follows:
(C.1) |
where is defined as in Lemma B.5.
Proof.
The theorem generalizes Theorem 1 of Shen and Wong [1994] such that it holds uniformly over a family of losses indexed by the parameter , and to allow for the finite-sample optimum to hold approximately up to a possibly stochastic sequence . Fortunately, the proof can remain almost identical. Shen and Wong [1994] prove the Theorem by induction, through a chaining argument. They use their Lemma 2 to derive an initial, slow rate which corresponds to the induction start, yielding the assumptions of their Lemma 3 at step . Next, their Lemma 3 is repeatedly applied as the induction steps until the rates of Theorem C.1 are obtained. We do not reproduce these algebraic steps, as they are the same as in Shen and Wong [1994]. Like Shen and Wong [1994], we also do not provide the proof for the induction start as it is similar, but simpler than the proof of the induction step, which we present in Lemma C.1. ∎
Lemma C.1 (Induction Step for Theorem C.1).
Suppose Conditions C1a, C1b and C2 hold. If at Step we have a rate so that
where and Then at Step we can find an improved rate
where so that
And for any function satisfying C1b and C2 when is replaced by , and is defined as in Lemma B.5, the same bound applies to:
Proof.
We assume (wlog) and we only prove the case of Let for . Then
To prove the Lemma , we only need to tackle . By Condition C1a,
The last inequality requires . This is holds for (wlog), which follows from and . Thus
Let . By Condition C1b and we get . Since satisfies
we know that for some constants and . We can therefore apply Shen and Wong [1994]’s Lemma 1 and obtain:
The behavior of can be analyzed via Shen and Wong [1994]’s Remark 12.
(i) If then
for some constant . (ii) If , then
for some . Hence,
Take , and as defined in the Lemma, and we obtain . This yields the convergence rate of . The statement about arbitrary follows by applying analogous arguments (starting at the definition of ) to the expression instead of . ∎
C.1.2 Convergence of
Theorem C.2 (Convergence Rate of A-Estimators).
Consider the family of M-estimators defined in 3.2 and the A-estimator defined in 3.3. Assume that conditions C1b and C2 are satisfied with (hence C1a is automatically satisfied with ). Let be some pseudo-distance on . Assume that the following conditions are satisfied:
-
CONDITION C1a’ For some constants and , and all small :
-
CONDITION C1b’. For some constants and and all small :
-
CONDITION C2’. Let . For some , and all small , its entropy (Def. 1) is bounded as:
where either or , which represents .
Let be the approximation error of . Then:
Where and are defined as in Thm C.1 and . Also, for every satisfying C2 and C3 when is replaced by (recall ), we can bound the empirical process:
Proof.
The proof is similar to that of Theorem C.1. Again, we will only prove the induction step via Lemma D.1 in the Online Appendix, as the remaining arguments are analogous to the proof of the previous Theorem or that of Theorem 1 in Shen and Wong [1994]. The proof of Lemma D.1 largely mirrors that of Lemma C.1, but uses the results of Theorem C.1 to control the convergence of the adversary. The main additional complexity lies in properly switching back and forth between the sieve spaces and the target function spaces, when bounding the empirical process terms and variances respectively. ∎
C.1.3 Proofs of Theorem 3.1 and Lemma B.5
Proof.
To see that Theorem 3.1 and Lemma B.5 follow from the previous results, simply choose and , such that Conditions C1a and C1a’ are automatically satisfied with . Further, substitute such that Conditions C1b and C1b’ hold by assumptions 3.7 and 3.6 for all . Conditions C2 and C2’ directly follow from 3.8, substituting and fixing . This yields Theorem 3.1, and Lemma B.5 with .
For a proof of Lemma B.5 with , note that Condition C1b in Theorem C.1 is only needed to verify in the proof of Lemma C.1. Hence we can re-define such that the definition of automatically ensures . This change in constants does not affect the result. Analogous arguments can be applied to Lemma D.1 and thus Theorem C.2. ∎
C.2 Theorem 3.3
Proof.
The approximate Nash conditions 1.2 and 1.3 imply
(C.2) |
The second line uses Taylor’s theorem. The third line substitutes the definitions of and applies Condition N3. Since the signs of are arbitrary, we may replace the inequality with an equality, which yields . Adding and subtracting a few terms, we get:
(C.3) |
Where the last line uses Conditions N1 and N2. Substituting , we get:
with by the standard central limit theorem. ∎
C.3 Theorem 3.4
Proof of Theorem 3.4.
Note that the regularization does not interfere with Theorem 3.2: the approximation power relative to is not reduced as we remove only elements form the sieves that are far away from , and the regularization is sufficiently slow to guarantee that the sieves are nonempty for small enough . Hence Theorem 3.2 holds with , yielding rates . Also note that the assumption along with the lower bound ensures and . We first verify condition N1, decomposing it into two parts by adding and subtracting . First, we show
If A7ii) holds with , then this can be established with Lemma B.5 for , using the Lipschitz condition A4 and analogous arguments to those in the proof of Theorem 3.2. Lemma B.5 then yields the same rates as Theorem 3.2. If A7ii) instead asserts that is -Donsker, the same result follows from Lemma B.4, which can be applied because the Lipschitz continuity A4 implies the L2 continuity required by the Lemma. This implies condition N1, together with:
which can be established via analogous arguments and A7i). We proceed to verify condition N2. Using a similar decomposition, we note that
which follows from A6i) and the rates of Theorem 3.2. Similarly, A6ii) implies hence condition N2 holds. Finally, we verify condition N3. Define as the projection onto . Similarly, . Due to the reguarlization, we know . Therefore . By convexity of , we have for large enough. Given that due to and our choice of , we get . Taken together, these statements imply . Analogous arguments yield . Hence N3 holds. ∎
Appendix D Online Appendix
D.1 Proof of Theorem C.2
The induction step for the proof of Theorem C.2 is given by the following Lemma.
Lemma D.1 (Induction Step for Theorem C.2).
Suppose the Conditions of Theorem C.2 hold. If at Step we have a rate
so that
where and , then at Step , we can find an improved rate
where so that
Furthermore, for every satisfying Conditions C1b and C2 when is replaced by (recall ), the same bound applies to:
Proof.
As in the proof of Lemma C.1 we assume (wlog) and we only prove the case of . Let for . As before, we only need to bound . To this end, it will be useful to define
such that Theorem C.1 implies , which also implies, by definition of and : . Together with 3.3 this yields:
(D.1) |
By Condition C1a’, we can therefore bound:
(D.2) |
Where the last line used C1a’, the definition of the approximation errors, assumed large enough (wlog) and used various lower-bounds implied by the definition of . Together with D.1, this yields:
Let . To bound , we add and subtract terms and apply the Cauchy-Schwartz inequality:
By Conditions C1b and C1b’, and since , we obtain , assuming is large enough (wlog). The remaining arguments are unchanged from the proof of Lemma C.1, which eventually yields: . This completes the proof for the convergence rate of . To prove the statement about arbitrary satisfying C1b and C2, simply repeat the arguments of the previous proof (starting at the definition of ) with replaced by . ∎
D.2 Proof of Proposition 4.1
Proof.
The first order conditions for in 2.1 yield the optimal population adversary :
(D.3) |
We verify the conditions of Theorem 3.2. The Lipschitz condition A1 can be verified by writing and using the Lipschitzness of in and that of (which follows from boundedness of and differentiability of ). Towards A2, apply a 2nd order Taylor expansion (with mean value reminder) of at in direction of , which yields for some on a path from to :
where the last step follows from strict positivity and boundedness of wp1. Further note that
due to Lipschitzness of . Also note that Lipschitzness of in implies
Taken together, this implies A2. A3 can be verified analogously, starting with a Taylor expansion yielding for some on a path from to . ∎
D.3 Proof of Proposition 4.2
Proof.
First, we verify the conditions of Theorem 3.4. Note that
The Lipschitz condition A4 is therefore satisfied by the Lipschitzness of , and that of . Towards A6 i), we apply a Taylor expansion with 2nd order mean-value reminder around , yielding for some on a path between and :
where we used to get rid of the first-order term. The last line follows from the boundedness (in absolute value) of and on their compact support. Hence A6i) is satisfied. A7i) follows from:
where the last step used the Lipschitzness of and again the boundedness of . Assumption A7 ii) is satisfied since is Donsker by assumption. Finally, we verify A6 ii). Applying the mean value theorem twice, we get that for some on a path between and , which is dominated by via the last assumption stated in the proposition. Therefore A6ii) is satisfied, and Theorem 2.4 applies. ∎
D.4 Proof of Proposition 4.6
Proof.
Note that . We verify the conditions of Theorem 3.4, for , such that its conclusion becomes , which yields the Proposition via the Cramér-Wold device. Note that Hence assumption A4 follows from boundedness and Lipschitzness of . A5 holds by assumption. To verify A6i), notice that:
where the last equality used the fact that . Towards Assumption A6ii), note that we can apply a first-order Taylor expansion with mean-value reminder twice, which yields for some on a path between and :
Given the identification assumption 2.5, we can use a second-order Taylor expansion with mean-value reminder to show that , which then yields A6ii). Similarly, we can show A7ii) via a mean-value reminder:
We similarly can establish A7i) via
∎
D.5 Proof of Proposition 4.4
Proof.
A0 holds by assumption, and condition A1 is implied by Lipschitzness of in . Continuity of , compactness of and A0 further imply that there is some such that
Given that , condition A2 then follows from
as well as . A3 can be established analogously, hence the conclusions of Theorem 3.2 hold. ∎
D.6 Proof of Proposition 4.7
Proof.
Note that follows from the first order conditions of the adversary. Condition A0 is satisfied by assumption, and A1 follows from Lipschitzness of and boundedness. Lipschitzness of in and boundedness imply that is Lipschitz in . This implies
and together with , it yields:
Both bounds imply A3 and A2 respectively. The result follows because the loss is proportional to the squared L2 norm of . ∎