This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\RenewEnviron

equation*

\BODY\BODY

Adversarial Estimators

Jonas Metzger
Stanford University
(\AdvanceDate[-1])
Abstract

We develop an asymptotic theory of adversarial estimators (‘A-estimators’). They generalize maximum-likelihood-type estimators (‘M-estimators’) as their average objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.

1 Introduction

Although it is not always obvious, nearly all population parameters that are estimated in econometrics and machine learning can be written as the solution of so-called saddle-point or adversarial objectives of the form:

θ=argminθΘmaxλΛ𝔼l(θ,λ,Y)\theta_{*}=\underset{\theta\in\Theta}{\arg\min}\max_{\lambda\in\Lambda}\mathbb{E}l(\theta,\lambda,Y) (1.1)

where ll is a known loss function, YY is a random variable and Θ,Λ\Theta,\Lambda are parameter spaces, containing the unknown parameter of interest θ\theta_{*} and nuisance λ\lambda. We examine the natural estimator θ^n\widehat{\theta}_{n} that approximately solves the empirical Nash condition:

𝔼nl(θ^n,λ^n,Y)\displaystyle\mathbb{E}_{n}l(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y) infθΘn𝔼nl(θ,λ^n,Y)+η~n\displaystyle\leq\inf_{\theta\in\Theta_{n}}\mathbb{E}_{n}l(\theta,\widehat{\lambda}_{n},Y)+\widetilde{\eta}_{n} (1.2)
𝔼nl(θ^n,λ^n,Y)\displaystyle\mathbb{E}_{n}l(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y) supλΛn𝔼nl(θ^n,λ,Y)ηn\displaystyle\geq\sup_{\lambda\in\Lambda_{n}}\mathbb{E}_{n}l(\widehat{\theta}_{n},\lambda,Y)-\eta_{n} (1.3)

which replaces the expectation 𝔼\mathbb{E} of the population objective 1.1 with the average of nn iid samples, 𝔼n\mathbb{E}_{n}. We search for the estimators over so-called sieve spaces θ^n,λ^nΘn,Λn\widehat{\theta}_{n},\widehat{\lambda}_{n}\in\Theta_{n},\Lambda_{n} (Grenander [1981]), which approximate the full parameter spaces Θn,ΛnΘ,Λ\Theta_{n},\Lambda_{n}\subset\Theta,\Lambda and grow with the sample size nn. These could be neural networks for example, growing in depth and width. The sequences η~n,ηn=o(1)\widetilde{\eta}_{n},\ \eta_{n}=o_{\mathbb{P}}(1) accommodate numerical procedures which only yield approximate Nash equilibria. This class of A-estimators (A for adversarial) strictly generalizes so-called M-estimators (M for maximum likelihood-type), which are obtained by fixing Λ\Lambda to be singleton.

A-Estimators have become a workhorse of econometrics and causal inference long before the advent of deep learning. Hansen et al. [1996]’s continuous-updating Generalized Methods of Moments (GMM), which looks for θ\theta satisfying 𝔼[m(θ,Y)]=0\mathbb{E}[m(\theta,Y)]=0 for some known function m(θ,Y)m(\theta,Y), can be written as:

infθ𝔼n[m(θ,Y)]\displaystyle\inf_{\theta}\mathbb{E}_{n}\left[m(\theta,Y)\right] 𝔼n[m(θ,Y)m(θ,Y)]1𝔼n[m(θ,Y)]\displaystyle\mathbb{E}_{n}\left[m(\theta,Y)m(\theta,Y)^{\prime}\right]^{-1}\mathbb{E}_{n}\left[m(\theta,Y)\right]
=infθsupλ𝔼n[m(θ,Y)λ(m(θ,Y)λ)2/4]\displaystyle=\inf_{\theta}\sup_{\lambda}\mathbb{E}_{n}\left[m(\theta,Y)^{\prime}\lambda-(m(\theta,Y)^{\prime}\lambda)^{2}/4\right]

and is therefore an A-estimator, but not an M-estimator. In statistics, an earlier example consists of the Empirical Likelihood (EL) approach pioneered in Cosslett [1981], Owen [1988, 1990], Qin and Lawless [1994]. Subsequently, EL was unified with GMM into the Generalized Empirical Likelihood (GEL) framework (Newey and Smith [2004]), also subsuming the exponential-tilting estimator (Imbens et al. [1998]), for example. All GEL estimators are A-estimators, but their adversarial formulation was rarely salient. However, some of their benefits may be owed directly to their adversarial objective: the adversary λ\lambda automatically detects which moment violations are most informative at a given parameter guess, adaptively guiding the estimation towards an efficient solution. This contrasts with earlier estimators which weighted the moments in a way that depended on choices of the researcher: the weights of Pearson’s Method-of-Moments were manually set by the researcher (implicitly), resulting in inefficient root-nn asymptotics. Two-step GMM (Hansen [1982]) required choosing a first-step estimator to compute the weights, yielding inefficient higher-order asymptotics (see Newey and Smith [2004]). Formally, the optimal weights are nuisance parameters, and as we will see in Section 3.2, estimating them via an adversary ensures that θ^n\widehat{\theta}_{n} is robust to estimation errors in these nuisance parameters.

A key invention which put a spotlight on adversarial objectives in recent years were Generative Adversarial Networks, or GANs (Goodfellow et al. [2014]). They search for a generative model YθY\sim\mathbb{P}_{\theta} for which no adversary λ(Y)(0,1)\lambda(Y)\in(0,1) (called ‘critic’ or ‘discriminator’) could tell apart the generated data from nn real samples:

infθsupλ()(0,1)𝔼θlogλ(Y)+𝔼nlog(1λ(Y))\inf_{\mathbb{P}_{\theta}}\sup_{\lambda(\cdot)\in(0,1)}\mathbb{E}_{\mathbb{P}_{\theta}}\log\lambda(Y)+\mathbb{E}_{n}\log(1-\lambda(Y))

The objective contains the log-likelihood of a binary classifier λ(Y)\lambda(Y) discriminating between an equal number of real and generated samples. As we show in Section 2.1, this directly measures the Jensen-Shannon divergence between θ\mathbb{P}_{\theta} and n\mathbb{P}_{n}. As of today, versions of this objective are key to state-of-the-art image generation, see e.g. Jabbar et al. [2022] for a recent survey. An analogy to human-generated images makes this unsurprising: it is much easier to tell apart a photo from an image drawn by a human, than it is to draw a realistic image, or to define what makes a drawing realistic. This intuition motivates the objective: train the generator until its critic has nothing more to criticize. The ingenuity is that the researcher need not define a meaningful measure of ‘realism’ of a piece of data anymore. Instead, this measure is learned by the adversary. It is clear that the utility of this idea extends beyond image generation: in Imitation Learning, a sub-field of Robotics, it has been used to teach human behavior to artificial agents without requiring hand-crafted measures of ‘humanness’ (Ho and Ermon [2016]). In Econometrics, where new causal inference methods can usually only be benchmarked on simulated data sets, Athey et al. [2019] used the objective to limit the impact of researcher’s subjective choices by requiring simulations to be indistinguishable from real data. Kaji et al. [2020] proposed to use the objective to estimate structural economic models which produce realistic data beyond the set of features that would otherwise be manually specified by the researcher.

More generally, other adversarial objectives have proven useful beyond fitting models to data. In Reinforcement Learning, a sub-field of Robotics where agents independently discover strategies to reach predefined goals without copying prior examples, Dai et al. [2018] proposed an A-estimator in which the adversary detects and penalizes any systematic deviation from optimal behavior. Cotter et al. [2019] proposed an estimator which extends a standard ML objective by an adversary imposing fairness constraints across sub-populations. More recently, research in econometrics established A-estimators as a natural framework for integrating machine learning methods into causal inference, where quantities of interest are frequently identified by a continuum of restrictions. Chernozhukov et al. [2020] propose to estimate Riesz representers of causal parameters directly, via an adversary enforcing the restrictions identifying the Riesz representer. Estimating Riesz representers is key to obtaining well-behaved estimates of causal parameters in the presence of nuisance functions, and can also be useful for estimating asymptotic variances, e.g. Chen et al. [2019]. Another line of research develops novel adversarial objectives to estimate causal parameters from conditional moment restrictions, which naturally arise from causal assumptions (e.g. the instrumental variable setting), and are usually more informative than any finite set of unconditional moment restrictions. In this line of research, the adversary can be viewed as adaptively finding the unconditional moment restriction which is most violated at the current parameter guess, among infinitely many which are implied by the conditional moment restriction. The key works are Lewis and Syrgkanis [2018], Dikkala et al. [2020] and Bennett et al. [2019b], Bennett and Kallus [2020]. Metzger [2022] propose a semi-parametrically efficient generalization of GEL to the conditional case via adversarial networks, containing Bennett and Kallus [2020] as a special case.

In summary, a recurring theme of adversarial objectives is that instead of manually defining which specific features of the data are important for a model to capture, the researcher’s role is restricted to stating a general principle which should be satisfied by all features of the correct model, and the adversary adaptively focuses the estimation on the model’s features which violate this principle the most. Over the course of the paper, we will encounter further interesting connections between various A-estimators, such as their Neyman orthogonality, their information-theoretic foundation via f-Divergences, and their ties to Lagrangian Duality.

Despite their popularity, we are not aware of a unified statistical theory of A-estimators. For some individual estimators, consistency (Bennett et al. [2019b]) and convergence rate results (Dikkala et al. [2020], Singh et al. [2018], Liang [2021], Belomestny et al. [2021]) were obtained, but normality results are limited to parametric θ\theta, either in Kernel settings (Bennett and Kallus [2020]) or leaving high-level assumptions about neural networks unverified (Kaji et al. [2020]). This can be attributed to two main obstacles: the theory of M-estimation does not apply to A-estimators, and the arguments from which the former is built up are insufficient to e.g. obtain the required uniform convergence of the adversary. The second issue is that adversarial objectives are most popular in the context of (deep) neural networks, whose statistical analysis (particularly their asymptotic normality) is complicated, e.g. due to their non-convex sieve space. Even in M-estimation settings, it was not clear whether known, high-level conditions for normality could be verified for neural networks (cf. the Conclusion of Shen et al. [2019]). We therefore make three separate contributions:

  1. 1.

    We characterize the general class of A-estimators, and show that a wide range of estimators proposed in econometrics and machine learning fall into this class. We point out desirable characteristics shared between A-estimators, which help explain their recent success in practice.

  2. 2.

    We develop a unified statistical theory of A-estimators, yielding their consistency, convergence rates (both under point- and partial identification), and asymptotic normality of functionals of their parameters. We provide high-level conditions for arbitrary sieves, as well as low-level conditions for semi-parametric settings with neural networks, to simplify verification in practice.

  3. 3.

    We extend the theory of neural network M-estimators (as a special case). Our convergence rates hold uniformly over families of losses, allow more general losses than Farrell et al. [2018] and attain a reduced curse-of-dimensionality which Nakada and Imaizumi [2020], Bauer et al. [2019] observed in regression settings with lower-dimensional structures. To the best of our knowledge, we provide the first normality result for functionals of deep neural networks which does not rely on Neyman-orthogonality or unverified high-level assumptions.

The remainder of the paper is structured as follows. In Section 2, we review five different A-estimators proposed in the econometrics and machine learning literatures. We present our general statistical theory of A-estimators in Section 3 and apply it in Section 4 to derive novel results about the examples of Section 2. We conclude by recapping the similar role adversaries play across all examples, providing intuition which types of problems may generally benefit from adversarial formulations. Appendix C and Online Appendix D contain the proofs omitted in Sections 3 and 4, respectively.

2 Examples

2.1 Minimum ff-Divergence

A powerful class of estimation objectives asymptotically minimize an ff-divergence Df(θ)D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}) between the distribution of the data Y=θY\sim\mathbb{P}=\mathbb{P}_{\theta_{*}} and the distribution of some model θ,θΘn\mathbb{P}_{\theta},\theta\in\Theta_{n} with support 𝒴\mathcal{Y}. This class, introduced by Nowozin et al. [2016], subsumes GANs (Goodfellow et al. [2014]), and many follow ups such as Mao et al. [2017], Tao et al. [2018]. For a continuous, proper convex function f:f:\mathbb{R}\mapsto\mathbb{R} satisfying f(1)=0f(1)=0, the ff-divergence is defined as Df(θ)=𝔼[f(dθ(Y)d(Y))]D_{f}(\mathbb{P}_{\theta}\|\mathbb{P})=\mathbb{E}_{\mathbb{P}}[f(\frac{\mathrm{d}\mathbb{P}_{\theta}(Y)}{\mathrm{d}\mathbb{P}(Y)})], where dθ(Y)d(Y)\frac{\mathrm{d}\mathbb{P}_{\theta}(Y)}{\mathrm{d}\mathbb{P}(Y)} denotes the Radon-Nikodym derivative of θ\mathbb{P}_{\theta} with respect to \mathbb{P} (=likelihood ratio), which we assume exists for all θΘ\theta\in\Theta. Notably, Df(θ)D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}) admits a useful dual representation:

Df(θ):=𝔼f(dθ(Y)d(Y))=supλ:𝒴𝔼θ[λ(Y)]𝔼[f(λ(Y))]D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}):=\mathbb{E}_{\mathbb{P}}f\left(\frac{\mathrm{d}\mathbb{P}_{\theta}(Y)}{\mathrm{d}\mathbb{P}(Y)}\right)=\sup_{\lambda:\mathcal{Y}\to\mathbb{R}}\mathbb{E}_{\mathbb{P}_{\theta}}[\lambda(Y)]-\mathbb{E}_{\mathbb{P}}[f_{*}(\lambda(Y))] (2.1)

where f(t):=supλλtf(λ)f_{*}(t):=\sup_{\lambda\in\mathbb{R}}\lambda t-f(\lambda) denotes the convex conjugate of ff. The equality above follows from f=(f)f=(f_{*})_{*}. Various choices111None of the objectives are unique: f(t)f(t)+c(t1)f(t)\leftarrow f(t)+c(t-1) for any cc yields the same divergence, but changes the expressions. Note that we may also swap 𝔼θ\mathbb{E}_{\mathbb{P}_{\theta}} and 𝔼n\mathbb{E}_{n}, which yields valid objectives for the respective “reverse” ff-divergences. for ff are presented in Table 1. This duality is useful because the right-hand side suggests a finite-sample analog which does not depend on unknown quantities: we obtain an A-estimator for θ\theta_{*} by letting

l(θ,λ,Y)=𝔼θ[λ(Y)]f(λ(Y))l(\theta,\lambda,Y)=\mathbb{E}_{\mathbb{P}_{\theta}}[\lambda(Y)]-f_{*}(\lambda(Y)) (2.2)

and solving for θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} satisfying the Nash condition 1.2,1.3 in 𝔼n[l(θ,λ,Y)]\mathbb{E}_{n}[l(\theta,\lambda,Y)]. Normalizing f(t)f(t)f(1)(t1)f′′(1)f(t)\leftarrow\frac{f(t)-f^{\prime}(1)(t-1)}{f^{\prime\prime}(1)} without loss of generality222This implies f(1)=0f^{\prime}(1)=0, f′′(1)=1f^{\prime\prime}(1)=1 and f(0)=0,f(0)=f′′(0)=1f_{*}(0)=0,f_{*}^{\prime}(0)=f_{*}^{\prime\prime}(0)=1, which merely re-scales the divergence 2.1 by a factor of 1/f′′(1)1/f^{\prime\prime}(1), assuming the second derivative f′′f^{\prime\prime} exists, the function λ\lambda attaining the supremum in 2.1 at some θ\theta is λθ=f(dθd)\lambda_{*}^{\theta}=f^{\prime}(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}). The adversary λ^n\widehat{\lambda}_{n} therefore estimates this transformed likelihood ratio at the current guess for θ^n\widehat{\theta}_{n}, and the Nash-equilibrium corresponds to the case where it is approximately constant, i.e. the distribution θ^n\mathbb{P}_{\widehat{\theta}_{n}} is close to that of the data. Notably, 𝔼n[l(θ,λ,Y)]\mathbb{E}_{n}[l(\theta,\lambda,Y)] can be evaluated using only samples from the two distributions333Note that we neither require explicit knowledge of θ\mathbb{P}_{\theta} nor infinitely many samples from θ\mathbb{P}_{\theta} at a given nn: it suffices to draw mn2m\succ n^{2} Monte Carlo samples from θ\mathbb{P}_{\theta} and solve for the corresponding finite sample saddle point. The resulting Monte Carlo approximation error for the expectation 𝔼θ\mathbb{E}_{\mathbb{P}_{\theta}} is then of order m1=n1\sqrt{m}^{-1}=n^{-1} and can thus be accounted for by letting η~n,ηn=O(n1)\widetilde{\eta}_{n},\eta_{n}=O_{\mathbb{P}}(n^{-1}) in equations 1.2,1.3, which has no impact on our asymptotic results.. This is crucial for GANs, where θ\mathbb{P}_{\theta} is only implicitly defined via a push-forward mapping parametrized by a neural net. As proposed by Kaji et al. [2020], this also makes it a drop-in alternative to the Simulated Method of Moments, which similarly estimates economic models from data they generate, but matches only a finite set of moments instead of the full distribution.

Name f(t)f(t) f(t)f_{*}(t) , domain Generative Adversarial Objective for θ\theta Total Variation |t1|/2|t-1|/2 tt, for |t|12|t|\leq\frac{1}{2} sup|λ|12𝔼θλ(Y)𝔼nλ(Y)\sup_{|\lambda|\leq\frac{1}{2}}\mathbb{E}_{\mathbb{P}_{\theta}}\lambda(Y)-\mathbb{E}_{n}\lambda(Y) KL Divergence tlogtt\log t et1e^{t-1} supλ1+𝔼θλ(Y)𝔼neλ(Y)\sup_{\lambda\in\mathbb{R}}1+\mathbb{E}_{\mathbb{P}_{\theta}}\lambda(Y)-\mathbb{E}_{n}e^{\lambda(Y)} Reverse KL logt-\log t log(te)-\log(-te), for t0t\leq 0 supλ01+𝔼θλ(Y)+𝔼nlog(λ(Y))\sup_{\lambda\leq 0}1+\mathbb{E}_{\mathbb{P}_{\theta}}\lambda(Y)+\mathbb{E}_{n}\log(-\lambda(Y)) χ2\chi^{2} Divergence (t1)2(t-1)^{2} t+t2/4t+t^{2}/4 supλ𝔼θλ(Y)𝔼n[λ(Y)+λ(Y)2/4]\sup_{\lambda\in\mathbb{R}}\mathbb{E}_{\mathbb{P}_{\theta}}\lambda(Y)-\mathbb{E}_{n}\left[\lambda(Y)+\lambda(Y)^{2}/4\right] Squared Hellinger (t1)2(\sqrt{t}-1)^{2} t1t\frac{t}{1-t}, for t1t\leq 1 supλ1𝔼θλ(Y)𝔼n[λ(Y)1λ(Y)]\sup_{\lambda\leq 1}\mathbb{E}_{\mathbb{P}_{\theta}}\lambda(Y)-\mathbb{E}_{n}\left[\frac{\lambda(Y)}{1-\lambda(Y)}\right] rescaled JS (GAN) tlogt(1+t)log(1+t)t\log t-(1+t)\log(1+t) log(1et),for t<0\begin{matrix}\small{-\log(1-e^{t})},\text{for }t<0\end{matrix} suplogλ<0𝔼θlogλ(Y)+𝔼nlog(1λ(Y))\sup_{\log\lambda<0}\mathbb{E}_{\mathbb{P}_{\theta}}\log\lambda(Y)+\mathbb{E}_{n}\log(1-\lambda(Y))

Table 1: Various adversarial ff-divergence objectives. f(t)=f_{*}(t)=\infty outside the domain.

2.2 Generalized Empirical Likelihood

Our next example is a class of A-estimators that was proposed long before the recent success of adversarial objectives in deep learning. In econometrics, many important parameters θ\theta_{*} are identified by a moment restriction of the form:

𝔼[m(Y,θ)]=0θ=θ\mathbb{E}[m(Y,\theta)]=0\ \iff\ \theta=\theta_{*}

for some known, possibly vector-valued function m(Y,θ)m(Y,\theta). In the Introduction, we presented the continuous-updating GMM objective (Hansen et al. [1996]) for estimating θ\theta_{*}, a workhorse for causal inference in econometrics. In this section, we review the more general class of Generalized Empirical Likelihood (GEL) estimators (Newey and Smith [2004]), which solve the constrained minimization problem:

inf¯,θΘDf(¯n) s.t. 𝔼¯[m(Y,θ)]=0\inf_{\bar{\mathbb{P}},\theta\in\Theta}D_{f}(\bar{\mathbb{P}}\|\mathbb{P}_{n})\text{ s.t. }\mathbb{E}_{\bar{\mathbb{P}}}[m(Y,\theta)]=0

That is, they seek for a parameter θ\theta and a corresponding population distribution ¯\bar{\mathbb{P}} that is as close as possible to the sample n\mathbb{P}_{n}, subject to satisfying the moment constraint 𝔼¯[m(Y,θ)]=0\mathbb{E}_{\bar{\mathbb{P}}}[m(Y,\theta)]=0. At this high level, it is worth noting that GEL optimizes the same target as the objective in Section 2.1, which imposes ¯=θ\bar{\mathbb{P}}=\mathbb{P}_{\theta} instead of a moment constraint. Glossing over some details, we can obtain a tractable estimator in this setting by concentrating out ¯\bar{\mathbb{P}} from the corresponding Lagrangian:

inf¯,θΘsupλdim(m)Df(¯n)λ𝔼¯[m(Y,θ)]=infθΘsupλdim(m)𝔼n[f(λm(Y,θ))]\displaystyle\begin{split}\inf_{\bar{\mathbb{P}},\theta\in\Theta}\sup_{\lambda\in\mathbb{R}^{\operatorname{dim}(m)}}&D_{f}(\bar{\mathbb{P}}\|\mathbb{P}_{n})-\lambda^{\prime}\mathbb{E}_{\bar{\mathbb{P}}}[m(Y,\theta)]=\inf_{\theta\in\Theta}\sup_{\lambda\in\mathbb{R}^{\operatorname{dim}(m)}}\mathbb{E}_{n}\left[-f_{*}\left(\lambda^{\prime}m(Y,\theta)\right)\right]\end{split} (2.3)

Which again uses the convex conjugate ff_{*} of ff (see example 2.1). For a formal proof of this equivalence, see e.g. Imbens et al. [1998]. It is easy to see that GEL is an A-estimator with l(θ,λ,Y)=f(λm(Y,θ))l(\theta,\lambda,Y)=-f_{*}(\lambda^{\prime}m(Y,\theta)) where Λn=dim(m)\Lambda_{n}=\mathbb{R}^{\operatorname{dim}(m)} and Θn\Theta_{n} is the parameter space of the economic model. A particularly popular version of this objective corresponds to the case Df=χ2D_{f}=\chi^{2}, where Table 1 tells us that f(t)=(t1)2f(t)=(t-1)^{2} and f(t)=t+t2/4f_{*}(t)=t+t^{2}/4. In this case, we can analytically solve for the optimal adversary given θ\theta. Substituting it in, we get the continuous-updating GMM objective presented in the introduction:

supλdim(m)𝔼n[f(λm(Y,θ))]=𝔼n[m(Y,θ)]𝔼n[m(Y,θ)m(Y,θ)]1𝔼n[m(Y,θ)]\sup_{\lambda\in\mathbb{R}^{\operatorname{dim}(m)}}\mathbb{E}_{n}\left[-f_{*}\left(\lambda^{\prime}m(Y,\theta)\right)\right]=\mathbb{E}_{n}\left[m(Y,\theta)\right]^{\prime}\mathbb{E}_{n}\left[m(Y,\theta)m(Y,\theta)^{\prime}\right]^{-1}\mathbb{E}_{n}\left[m(Y,\theta)\right]

2.3 Off-Policy Reinforcement Learning

Next, we review the Smoothed Bellman Error Embedding (SBEED) algorithm introduced by Dai et al. [2018], a popular off-policy learning method in robotics. Off-policy learning aims to learn the optimal policy for an agent from data that was generated under an entirely different policy regime. This problem is not limited to robotics: since it was identified in the monetary policy context by Lucas [1976], it became a primary concern in econometrics and its recognition played a key role in the credibility revolution (Angrist and Pischke [2010]) of econometrics. While problem definitions otherwise differ between these literatures, off-policy learning methods have received recent interest in econometrics (Zhan et al. [2021], Athey and Wager [2021]).

For an agent receiving reward R(s,a)R(s,a) for taking action a𝒜a\in\mathcal{A} at state s𝒮s\in\mathcal{S}, forming an expectation over the future state s+𝒮s^{+}\in\mathcal{S}, SBEED’s goal is to learn the value function V(s)V_{*}(s) and policy aP(|s)a\sim P_{*}(\cdot|s) which satisfy the regularized Bellman equation:

V(s)=maxP(|s)𝔼aP(|s)[R(s,a)+β𝔼s+|s,a[V(s+)|s,a]]+H(P,s)V_{*}(s)=\max_{P(\cdot|s)}\mathbb{E}_{a\sim P(\cdot|s)}\Big{[}R(s,a)+\beta\mathbb{E}_{s^{+}|s,a}[V_{*}(s^{+})|s,a]\Big{]}+H(P,s)

where the entropy H(P,s)=𝔼aP(|s)[logP(a|s)]H(P,s)=-\mathbb{E}_{a\sim P(\cdot|s)}[\log P(a|s)] regularizes the optimal policy P(|s)P_{*}(\cdot|s) towards exploring all actions a𝒜a\in\mathcal{A}. Given the researcher’s choice of R,βR,\beta, the goal is to learn P,VP_{*},V_{*} from finite samples {(si,ai,si+)}i=1n\{(s_{i},a_{i},s^{+}_{i})\}_{i=1}^{n}. Importantly, the actions aia_{i} may be sampled from a suboptimal policy which does not equal PP_{*}. Starting from the first-order condition of the Bellman equation, Dai et al. [2018] develop an adversarial population objective, whose finite-sample analog is the A-estimator 1.2,1.3 with loss:

l(θ,λ,Y)=(R(s,a)+βVθ(s+)Vθ(s)logPθ(a|s))λ(s,a)12λ(s,a)2l(\theta,\lambda,Y)=\big{(}R(s,a)+\beta V_{\theta}(s^{+})-V_{\theta}(s)-\log P_{\theta}(a|s)\big{)}\lambda(s,a)-\frac{1}{2}\lambda(s,a)^{2} (2.4)

where λ(s,a)\lambda(s,a), logPθ(a|s)\log P_{\theta}(a|s) and Vθ(s)V_{\theta}(s) are implemented as neural networks in practice.

2.4 A-Estimators for Conditional Moment Restrictions

Another powerful application for A-estimators recently pursued by the econometric literature are conditional moment estimators. These methods estimate parameters θ\theta_{*} which are identified by restrictions of the form:

𝔼[m(X,θ)|Z]=0Zθ=θ\mathbb{E}[m(X,\theta)|Z]=0\ \forall Z\iff\ \theta=\theta_{*} (2.5)

for some random variables Y=(X,Z)Y=(X,Z) and a known function m(X,θ)m(X,\theta). Conditions of this type occur e.g. when estimating some causal effect θ\theta via instrumental variables, or as the first-order conditions of agents optimizing some expected utility given some information ZZ. As a result, nonparametric conditional moment estimators received considerable interest in econometrics, see e.g. Ai and Chen [2003, 2007], Chen and Qiu [2016]. These earlier estimators rely on first-step estimates of nuisance parameters capturing the conditional means and variances. Intuitively however, estimating the nuisance parameters via predictive objectives in a separate first step may dedicate scarce model capacity to capturing features which are not useful for the purpose of estimating θ\theta_{*} downstream. This motivates recent work on adversarial objectives which unify the estimation into a single objective, more plausibly targeting the nuisance estimation towards the goal of identifying θ\theta_{*}. Specifically, we will examine the estimator of Dikkala et al. [2020], with l(θ,λ,Y)=m(X,θ)λ(Z)14λ(Z)22l(\theta,\lambda,Y)=m(X,\theta)^{\prime}\lambda(Z)-\frac{1}{4}\|\lambda(Z)\|_{2}^{2}, yielding the finite sample objective

infθΘnsupλΛn𝔼n[m(X,θ)λ(Z)14λ(Z)λ(Z)],\inf_{\theta\in\Theta_{n}}\sup_{\lambda\in\Lambda_{n}}\mathbb{E}_{n}\left[m(X,\theta)^{\prime}\lambda(Z)-\frac{1}{4}\lambda(Z)^{\prime}\lambda(Z)\right],

where Λn\Lambda_{n} is a class of neural networks. The methods proposed by Bennett et al. [2019a], Bennett and Kallus [2020] are closely related, but differ in the penalty they impose on λ\lambda. Dikkala et al. [2020] consider the case of instrumental variable regression, where X=(y,x)X=(y,x) and m(X,θ)=yθ(x)m(X,\theta)=y-\theta(x), but we will examine the general case. We note that Example 2.3 (SBEED, Dai et al. [2018]) can be viewed as a special case of re-scaled version of this objective, with X=(s,a,s+)X=(s,a,s^{+}) and Z=(s,a)Z=(s,a), although both literatures seem to be unaware of their connection. One can analytically solve for the optimal adversary λθ(Z)=2𝔼[m(X,θ)|Z]\lambda_{*}^{\theta}(Z)=2\mathbb{E}[m(X,\theta)|Z] to rewrite the population objective as:

𝔼[l(θ,Y)]:=𝔼[l(θ,λθ,Y)]=𝔼[𝔼[m(X,θ)|Z]22]=𝔼[𝔼[m(X,θ)m(X,θ)|Z]22]\mathbb{E}[l(\theta,Y)]:=\mathbb{E}[l(\theta,\lambda_{*}^{\theta},Y)]=\mathbb{E}[\|\mathbb{E}[m(X,\theta)|Z]\|_{2}^{2}]=\mathbb{E}[\|\mathbb{E}[m(X,\theta)-m(X,\theta_{*})|Z]\|_{2}^{2}]

which can be understood as a measure of distance between θ\theta and θ\theta_{*}, which clearly attains its minimum at θ=θ\theta=\theta_{*}, when 𝔼[m(X,θ)|Z]0\mathbb{E}[m(X,\theta)|Z]\equiv 0. In Section 4.4, we will apply our theory to derive the asymptotic distribution of this estimator and show that is in fact inefficient. We further discuss how the adversarial formulation of GMM can directly inform a simple modification similar to Bennett and Kallus [2020] which yields an efficient A-estimator.

2.5 Estimating Riesz Representers

Chernozhukov et al. [2020] propose a distinct A-estimator to estimate Riesz representers for structural parameters ϕ\phi_{*} which can be written as linear functionals ϕ=ϕ(g)=𝔼[m(Y,g)]\phi_{*}=\phi(g_{*})=\mathbb{E}[m(Y,g_{*})]. Here, g=𝔼[y|x]g_{*}=\mathbb{E}[y|x] is an unknown function for which an estimate g^n\widehat{g}_{n} is available from some first-stage regression of yy on xx, where Y=(y,x,w)Y=(y,x,w). Quantities like ϕ\phi_{*} are common in the average treatment effect or asset pricing literature, for example. Unfortunately, especially if g^n\widehat{g}_{n} is estimated via machine learning, the ‘naive’ estimator

ϕ^n=𝔼n[m(Y,g^n)]\widehat{\phi}_{n}=\mathbb{E}_{n}[m(Y,\widehat{g}_{n})]

is often not well behaved: n(ϕ^nϕ)\sqrt{n}(\widehat{\phi}_{n}-\phi_{*}) may not converge in distribution to a Gaussian limit and thus one cannot provide confidence intervals around the estimate. Under the conditions of the Riesz representation theorem however, there may exist a function θΘ\theta_{*}\in\Theta called the Riesz representer of the functional ϕ(g)\phi(g), which satisfies:

ϕ(g)=𝔼[θ(x)g(x)]gΘ\phi(g)=\mathbb{E}[\theta_{*}(x)g(x)]\quad\forall g\in\Theta

If a well-behaved estimate θ^n\widehat{\theta}_{n} of θ\theta_{*} is available, it can be combined with g^n\widehat{g}_{n} to define the so-called orthogonalized estimator:

ϕ~n=𝔼n[m(Y,g^n)θ^n(x)(yg^n(x))]\tilde{\phi}_{n}=\mathbb{E}_{n}[m(Y,\widehat{g}_{n})-\widehat{\theta}_{n}(x)(y-\widehat{g}_{n}(x))]

which attains asymptotic normality under rather weak conditions on g^n\widehat{g}_{n} (see Lemma 17 of Chernozhukov et al. [2020]). Chernozhukov et al. [2020] propose a generalized procedure to estimate θ^n\widehat{\theta}_{n} via an A-estimator, which we will simplify as follows:

infθΘnsupλΛn𝔼n[m(Y,λ)θ(x)λ(x)λ(x)2/2]\inf_{\theta\in\Theta_{n}}\sup_{\lambda\in\Lambda_{n}}\mathbb{E}_{n}[m(Y,\lambda)-\theta(x)\lambda(x)-\lambda(x)^{2}/2]

where Θn,Λn\Theta_{n},\Lambda_{n} are neural networks. To clarify why this objective works, is it useful to analytically solve the adversarial component of the corresponding population objective:

supλ𝔼[m(Y,λ)θ(x)λ(x)λ(x)2/2]=12𝔼[(θ(x)θ(x))2]\sup_{\lambda}\mathbb{E}[m(Y,\lambda)-\theta(x)\lambda(x)-\lambda(x)^{2}/2]=\frac{1}{2}\mathbb{E}[(\theta_{*}(x)-\theta(x))^{2}]

As we will show in Section 4.5, our theory directly yields the convergence rates for θ^n\widehat{\theta}_{n} that Chernozhukov et al. [2020]’s Lemma 17 requires for the asymptotic normality of ϕ~n\tilde{\phi}_{n}. It does so at a reduced curse of dimensionality in xx for rather general function classes - i.e. under weaker conditions on smoothness and dimension of the data - complementing the original work.

3 General Theory

Roadmap.

This Section will present our general theory of A-estimators. Subsection 3.1 briefly discusses an alternative definition of A-estimators that may be more natural to some readers. In Subsection 3.2, we establish that A-estimators satisfy the desirable condition of Neyman-orthogonality with respect to the adversary and discuss its implications. Next, we characterize the convergence rates of A-estimators: Section 3.3 provides a high-level result for arbitrary sieves such as splines or wavelets, not just neural nets. Under more easily verifiable low-level conditions, Subsection 3.4 provides convergence rates for semiparametric settings involving neural networks, showing they exhibit a reduced curse-of-dimensionality. Finally, we characterize the asymptotic normality of smooth functionals of A-estimators. We again begin with a general, high-level result for arbitrary sieves, followed with the low-level conditions for the normality of neural networks. Notably, we show that a combination of undersmoothing and regularizing towards a convex target space suffices to overcome a key issue for normality proofs of neural networks: their non-convex sieve space.

Notation.

Throughout, we consider random variable YY with support 𝒴\mathcal{Y}, distribution \mathbb{P} and corresponding expectation operator 𝔼\mathbb{E}. We also denote the variance operator by 𝕍[f(Y)]=𝔼(f(Y)𝔼[f(Y)])2\mathbb{V}[f(Y)]=\mathbb{E}(f(Y)-\mathbb{E}[f(Y)])^{2} for any function f:𝒴f:\mathcal{Y}\mapsto\mathbb{R}. We denote the sample average, i.e. the expectation under the empirical distribution n\mathbb{P}_{n}, by 𝔼n\mathbb{E}_{n}. Throughout, 𝔼\mathbb{E} will treat estimated parameters as deterministic sequences indexed by nn, as is common in the literature. We also consider subvectors of YY, denoted by x𝒳,x¯𝒳¯x\in{\mathcal{X}},\bar{x}\in\bar{\mathcal{X}}, with their respective supports 𝒳,𝒳¯{\mathcal{X}},\bar{\mathcal{X}} being subspaces of 𝒴\mathcal{Y}. We require various norms: throughout, xq\|x\|_{q} will denote the q\ell^{q} norm of a finite dimensional vector xx, with x=x2\|x\|=\|x\|_{2} being the Euclidean norm. For a possibly vector-valued function f(x)f(x), we denote its LqL^{q} function norm over some subset 𝒳~𝒳\widetilde{\mathcal{X}}\subset\mathcal{X} by fLq(𝒳~)=𝔼[f(x)qq|x𝒳~]1/q\|f\|_{L^{q}(\widetilde{\mathcal{X}})}=\mathbb{E}[\|f(x)\|^{q}_{q}|x\in\widetilde{\mathcal{X}}]^{1/q}. We denote the supremum norm of a vector xx with components xix_{i} by x=maxi|xi|\|x\|_{\infty}=\max_{i}|x_{i}|. The supremum norm of ff over 𝒳~\widetilde{\mathcal{X}} will be denoted by f𝒳~=supx𝒳~f(x)\|f\|_{\widetilde{\mathcal{X}}}=\sup_{x\in{\widetilde{\mathcal{X}}}}\|f(x)\|_{\infty}. For 𝒳~=𝒳\widetilde{\mathcal{X}}=\mathcal{X}, we may omit the dependence on 𝒳{\mathcal{X}} by writing f:=f𝒳\|f\|_{\infty}:=\|f\|_{\mathcal{X}}. We will often write aba\prec b to denote a=O(b)a=O(b), implying that a sufficiently large global constant >C>0\infty>C>0 exists such that aCba\leq Cb, where CC does not depend on any varying aspects of the problem, such as any parameters, sample sizes, et cetera. We write aba\asymp b if abaa\prec b\prec a. We will also write ab=max(a,b)a\vee b=\max(a,b) and ab=min(a,b)a\wedge b=\min(a,b). Throughout, we will write lθ(λ,Y)=l(θ,λ,Y)l^{\theta}(\lambda,Y)=l(\theta,\lambda,Y) and l(θ,Y)=l(θ,λθ,Y)l(\theta,Y)=l(\theta,\lambda_{*}^{\theta},Y) for short, where λθ=argmaxλΛ𝔼l(θ,λ,Y)\lambda_{*}^{\theta}=\arg\max_{\lambda\in\Lambda}\mathbb{E}l(\theta,\lambda,Y). We denote by πn\pi_{n} a (not necessarily linear) projection onto the respective sieves, i.e. πnθarginfθΘnθθ\pi_{n}\theta\in\arg\inf_{\theta^{\prime}\in\Theta_{n}}\|\theta^{\prime}-\theta\|_{\infty} for any θΘ\theta\in\Theta and πnλarginfλΛnλλ\pi_{n}\lambda\in\arg\inf_{\lambda^{\prime}\in\Lambda_{n}}\|\lambda^{\prime}-\lambda\|_{\infty} for any λΛ\lambda\in\Lambda.

3.1 Nash vs Minimax

We presented our preferred definition for A-estimators in the introduction, as satisfying a Nash condition of the empirical loss. All results of this paper will apply to this definition. However, the reader may have noticed that the “simultaneous” Nash condition of the estimator is symmetric in θ^n\widehat{\theta}_{n} and λ^n\widehat{\lambda}_{n}, unlike the ‘sequential’ mini-max population objective, which nests a family of inner maximizations:

λθ=argmaxλΛ𝔼l(θ,λ,Y)\lambda_{*}^{\theta}=\arg\max_{\lambda\in\Lambda}\mathbb{E}l(\theta,\lambda,Y) (3.1)

where the loss ll and as a result the solutions λθ\lambda_{*}^{\theta} are indexed by the parameter θΘ\theta\in\Theta. The reader may therefore wonder if we could define an A-estimator for θ\theta_{*} in a similar ‘sequential’ mini-max fashion. That is, we could consider a family of M-estimators λ^nθ\widehat{\lambda}_{n}^{\theta} approximately maximizing the empirical loss at any value of θΘ\theta\in\Theta:

λ^nθΛn:𝔼nl(θ,λ^nθ,Y)supλΛn𝔼nl(θ,λ,Y)ηnθΘn\widehat{\lambda}_{n}^{\theta}\in\Lambda_{n}:\ \ \mathbb{E}_{n}l\left(\theta,\widehat{\lambda}_{n}^{\theta},Y\right)\geq\sup_{\lambda\in\Lambda_{n}}\mathbb{E}_{n}l(\theta,\lambda,Y)-\eta_{n}\ \ \forall\theta\in\Theta_{n} (3.2)

And then look for θ^nΘn\widehat{\theta}_{n}\in\Theta_{n} satisfying:

𝔼nl(θ^n,λ^nθ^n,Y)infθΘn𝔼nl(θ,λ^nθ,Y)+η¯n\mathbb{E}_{n}l(\widehat{\theta}_{n},\widehat{\lambda}_{n}^{\widehat{\theta}_{n}},Y)\leq\inf_{\theta\in\Theta_{n}}\mathbb{E}_{n}l(\theta,\widehat{\lambda}_{n}^{\theta},Y)+\bar{\eta}_{n} (3.3)

where η¯n=o(1)\bar{\eta}_{n}=o_{\mathbb{P}}(1) again accommodates approximate minimization. Fortunately, it turns out that any θ^n\widehat{\theta}_{n} satisfying the more compact Nash condition from the introduction always satisfies the mini-max condition presented above, as summarized by the following Lemma:

Lemma 3.1.

Any θ^n\widehat{\theta}_{n}, satisfying 1.2 and 1.3 for some λ^n\widehat{\lambda}_{n}, also satisfies 3.3 with some λ^nθ\widehat{\lambda}_{n}^{\theta} for which 3.2, λ^nθ^n=λ^n\widehat{\lambda}_{n}^{\widehat{\theta}_{n}}=\widehat{\lambda}_{n} and η¯n=η~n+ηn\bar{\eta}_{n}=\widetilde{\eta}_{n}+\eta_{n} holds.

Proof.

Pick any θ^n\widehat{\theta}_{n} satisfying 1.2 and 1.3 for some λ^n\widehat{\lambda}_{n}. Now pick some arbitrary family λ^nθ\widehat{\lambda}_{n}^{\theta} satisfying 3.2 for all θθ^n\theta\neq\widehat{\theta}_{n}, and define λ^nθ^n:=λ^n\widehat{\lambda}_{n}^{\widehat{\theta}_{n}}:=\widehat{\lambda}_{n}. Note that 1.3 directly implies that this λ^nθ\widehat{\lambda}_{n}^{\theta} also satisfies 3.2 at θ=θ^n\theta=\widehat{\theta}_{n}. It remains to show that the resulting θ^n\widehat{\theta}_{n} and λ^nθ\widehat{\lambda}_{n}^{\theta} satisfy 3.3:

𝔼nl(θ^n,λ^nθ^n,Y)infθΘn𝔼nl(θ,λ^n,Y)+η~ninfθΘn𝔼nl(θ,λ^nθ,Y)+η~n+ηn\displaystyle\mathbb{E}_{n}l(\widehat{\theta}_{n},\widehat{\lambda}_{n}^{\widehat{\theta}_{n}},Y)\leq\inf_{\theta\in\Theta_{n}}\mathbb{E}_{n}l(\theta,\widehat{\lambda}_{n},Y)+\widetilde{\eta}_{n}\leq\inf_{\theta\in\Theta_{n}}\mathbb{E}_{n}l(\theta,\widehat{\lambda}_{n}^{\theta},Y)+\widetilde{\eta}_{n}+\eta_{n}

where the first inequality used λ^nθ^n\widehat{\lambda}_{n}^{\widehat{\theta}_{n}} and the Nash condition 1.2, and the second used the fact that λ^nθ\widehat{\lambda}_{n}^{\theta} was constructed to satisfy 3.2. ∎

This reassures us that it suffices to find one set of values θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} which satisfy the Nash condition from the introduction, rather than a continuum of solutions λ^nθ\widehat{\lambda}_{n}^{\theta} indexed by θ\theta. The final θ^n\widehat{\theta}_{n} will satisfy the mini-max condition regardless, for some (unknown) λ^nθ\widehat{\lambda}_{n}^{\theta}. For our theory, it was crucial to derive the uniform convergence of λ^nθ\widehat{\lambda}_{n}^{\theta}, hence we will state the rate results for the more general mini-max definition. For the normality result, it was more convenient to work with the stronger Nash definition.

3.2 Adversaries are Neyman-Orthogonal

For many A-estimators, one could construct non-adversarial estimators which capture the same population objective. Whenever the adversarial nuisance parameter λ\lambda is a function, this usually requires a non-parametric first-step estimation of an alternative nuisance parameter. However, such an alternative estimator may not have a desirable property that is guaranteed for A-estimators: Neyman-orthogonality of θ\theta_{*} with respect to the nuisance parameter.

This property has a long history in statistics, dating back at least to Neyman [1959]. It was popularized in econometrics by Chernozhukov et al. [2017] as a key setting in which standard machine learning methods can be applied without invalidating causal inference, which sparked follow-up work such as Chernozhukov et al. [2021] seeking to reformulate non-orthogonal problems as orthogonal ones. The notion applies to parameters which are identified by a moment restriction of the form:

𝔼[φ(θ,ν,Y)]=0θ=θ\mathbb{E}[\varphi(\theta,\nu_{*},Y)]=0\iff\theta=\theta_{*} (3.4)

where φ\varphi is known and ν\nu_{*} is an unknown nuisance parameter which has to be estimated in a first step. A popular estimator θ^n\widehat{\theta}_{n} in this setting would be Hansen [1982]’s GMM, for example. The moment condition above is called (Neyman-)orthogonal whenever:

νν𝔼[φ(θ,ν,Y)]=0ν\nabla_{\nu_{*}\to\nu}\mathbb{E}[\varphi(\theta_{*},\nu_{*},Y)]=0\quad\forall\nu (3.5)

Intuitively, this states that the condition identifying θ\theta_{*} is “locally robust” against perturbations in ν\nu_{*}. This guarantees that the uncertainty introduced by an appropriate first-step estimation of ν\nu_{*} has no first-order effect on the GMM estimator θ^n\widehat{\theta}_{n}. Specifically, the asymptotic distribution of θ^n\widehat{\theta}_{n} is the same as in the case in which ν\nu_{*} is known. In contrast, when moment restrictions do not satisfy this orthogonality condition, uncertainty about ν\nu_{*} generally amplifies the asymptotic variance of θ^n\widehat{\theta}_{n}, see e.g. Chen and Liao [2015], and normality may break down altogether.

Notably, if (and only if) θ\theta_{*} is parametric, we can examine the first order condition for θ\theta_{*} that is implied by the A-estimation objective 1.1 in this moment restriction framework444Note however that even when θ\theta_{*} is parametric, we usually cannot estimate it via GMM as θl(θ,λ^nθ,Y)\nabla_{\theta}l(\theta,\widehat{\lambda}_{n}^{\theta},Y) will not exist if λ^nθ\widehat{\lambda}_{n}^{\theta} is a typical sieve, such as a neural network. For the same reason, the theory developed in this paper must not rely on any finite-sample first order conditions. Instead, it will use only the approximate Nash condition 1.3, 1.2. : let φ(θ,ν,Y)=θl(θ,ν(θ),Y)\varphi(\theta,\nu_{*},Y)=\nabla_{\theta}l(\theta,\nu_{*}(\theta),Y), where ν:ΘΛ\nu_{*}:\Theta\mapsto\Lambda denotes the functional evaluating to ν(θ)=λθ\nu_{*}(\theta)=\lambda_{*}^{\theta}. Orthogonality then follows from the continuum of first-order conditions identifying ν\nu_{*}:

νν𝔼[l(,ν(),Y)]𝟎νν𝔼[φ(θ,ν,Y)]=ννθ𝔼[l(θ,ν(θ),Y)]=θ𝟎\nabla_{\nu_{*}\to\nu}\mathbb{E}[l(\cdot,\nu_{*}(\cdot),Y)]\equiv\mathbf{0}\implies\nabla_{\nu_{*}\to\nu}\mathbb{E}[\varphi(\theta_{*},\nu_{*},Y)]=\nabla_{\nu_{*}\to\nu}\nabla_{\theta_{*}}\mathbb{E}[l(\theta_{*},\nu_{*}(\theta_{*}),Y)]=\nabla_{\theta_{*}}\mathbf{0}

since the derivative operators are exchangeable. This implies that as θ^n\widehat{\theta}_{n} approaches θ\theta_{*}, an A-estimator θ^n\widehat{\theta}_{n} is robust to estimation errors in the adversary λ^nθ\widehat{\lambda}_{n}^{\theta} relative to λθ\lambda_{*}^{\theta}, meaning they do not reduce the accuracy of θ^n\widehat{\theta}_{n}, to a first-order.

Consider the example of Section 2.1, which estimates θ\theta_{*} minimizing the f-Divergence between the model θ\mathbb{P}_{\theta} and the data \mathbb{P}. As a non-adversarial alternative, we could re-parametrize the problem and estimate ν:=d\nu_{*}:=d\mathbb{P} via a first-step Kernel density estimator ν^n(Y)=d^n(Y)\widehat{\nu}_{n}(Y)=\widehat{d\mathbb{P}}_{n}(Y), and subsequently approximate the f-Divergence as the average over f(dθ(Y)ν^n(Y))f\left(\frac{d\mathbb{P}_{\theta}(Y)}{\widehat{\nu}_{n}(Y)}\right). However, the first-order condition for θ\theta_{*} would not satisfy orthogonality, hence a GMM estimator based on this condition may not attain the variance of the analogous GMM estimator using ν\nu_{*} instead. In contrast, the A-estimator of Section 2.1 does attain this variance - due to its orthogonal adversary - which we formally establish in Section 4.1. Moreover, this remains true when generalizing to a setting in which θ\theta_{*} contains unknown functions, where no analogous GMM estimator exists that could capture the continuum of first-order conditions in θ\theta_{*}.

3.3 Convergence Rate of A-Estimators

We begin with a general theorem characterizing the convergence rates of sieve A-estimators, for arbitrary loss functions and parameter spaces. It can be viewed as a generalization of Shen and Wong [1994]’s M-estimator result. Its proof is provided in Appendix C.1, with the main challenge being that Shen and Wong [1994]’s chaining arguments need to be carefully modified to hold uniformly over Θ\Theta. Our theorem adopts a more compact formulation than Shen and Wong [1994] which does not require any norm over Θ,Λ\Theta,\Lambda to state our assumptions, although convergence rates are obtained for any (pseudo-)norm d(θ,θ)d(\theta,\theta_{*}) which is dominated by the objective.

Theorem 3.1 (Convergence Rate of A-Estimators).

Assume that:

  • C1: The criterion variance is bounded by a power γ>0\gamma>0 of its expectation:

    𝕍[l(θ,Y)l(θ,Y)]\displaystyle\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)] 𝔼[l(θ,Y)l(θ,Y)]γ\displaystyle\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]^{\gamma} (3.6)
    𝕍[lθ(λθ,Y)lθ(λ,Y)]\displaystyle\mathbb{V}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y)] 𝔼[lθ(λθ,Y)lθ(λ,Y)]γ\displaystyle\prec\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y)]^{\gamma} (3.7)

    for all θΘ,λΛ\theta\in\Theta,\lambda\in\Lambda for which the right hand sides are less than some constant.

  • C2: For all small ε>0\varepsilon>0, the covering number (Def. 1) is bounded via

    log𝒩(ε,{l(θ,λ,):θΘn,λΛn},)\displaystyle\log\mathcal{N}(\varepsilon,\{l(\theta,\lambda,\cdot):\theta\in\Theta_{n},\lambda\in\Lambda_{n}\},\|\cdot\|_{\infty}) ns(εr1)/r\displaystyle\prec n^{s}(\varepsilon^{-r}-1)/r (3.8)

    for 0s<10\leq s<1 and r0r\geq 0, where r=0r=0 represents limr0ns(εr1)/r=nslog(1/ε)\lim_{r\to 0}n^{s}(\varepsilon^{-r}-1)/r=n^{s}\log(1/\varepsilon).

Then the following conclusions hold.

  • i)

    The criterion converges at rate:

    𝔼[l(θ,Y)l(θ^n,Y)]\displaystyle\mathbb{E}[l(\theta_{*},Y)-l(\widehat{\theta}_{n},Y)] =O(nτ(γ,s,r,n)+ϵn+ηn+ϵ¯n+η¯n)\displaystyle=O_{\mathbb{P}}(n^{-\tau(\gamma,s,r,n)}+\epsilon_{n}+\eta_{n}+\bar{\epsilon}_{n}+\bar{\eta}_{n}) (3.9)
    supθΘn𝔼[lθ(λθ,Y)lθ(λ^nθ,Y)]\displaystyle\sup_{\theta\in\Theta_{n}}\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\widehat{\lambda}_{n}^{\theta},Y)] =O(nτ(γ,s,r,n)+ϵn+ηn)\displaystyle=O_{\mathbb{P}}(n^{-\tau(\gamma,s,r,n)}+\epsilon_{n}+\eta_{n}) (3.10)

    where ϵ¯n=𝔼[l(πnθ,Y)l(θ,Y)]\bar{\epsilon}_{n}=\mathbb{E}[l(\pi_{n}\theta_{*},Y)-l(\theta_{*},Y)] and ϵn=supθΘn𝔼[lθ(λθ,Y)lθ(πnλθ,Y)]\epsilon_{n}=\sup_{\theta\in\Theta_{n}}\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\pi_{n}\lambda_{*}^{\theta},Y)] are the sieve approximation errors. 3.10 also holds without 3.6. τ(γ,s,r,n)\tau(\gamma,s,r,n) represents:

    τ(γ,s,r,n)={1sloglognlogn, if r=0,γ11s2γ, if r=0,γ<11s2min(1,γ)(2r)/2, if 0<r<21s2loglognlogn, if r=21sr, if r>2\tau(\gamma,s,r,n)=\left\{\begin{array}[]{ll}{1-s}-\frac{\log\log n}{\log n},&\text{ if }r=0,\gamma\geq 1\\ \frac{1-s}{2-\gamma},&\text{ if }r=0,\gamma<1\\ \frac{1-s}{2-\min(1,\gamma)(2-r)/2},&\text{ if }0<r<2\\ \frac{1-s}{2}-\frac{\log\log n}{\log n},&\text{ if }r=2\\ \frac{1-s}{r},&\text{ if }r>2\end{array}\right.
  • ii)

    Hence, d(θ^n,θ)=o(1)d(\widehat{\theta}_{n},\theta_{*})=o_{\mathbb{P}}(1) for any (pseudo-)norm d(,)d(\cdot,\cdot) under which 𝔼[l(θ,Y)]\mathbb{E}[l(\theta,Y)] compact and continuous. If also d(θ,θ)1/q𝔼[l(θ,Y)l(θ,Y)]d(\theta,\theta_{*})^{1/q}\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)] for q>0q>0, we get:

    d(θ^n,θ)=O(nτ(γ,s,r,n)q+ϵnq+ηnq+ϵ¯nq+η¯nq)d(\widehat{\theta}_{n},\theta_{*})=O_{\mathbb{P}}(n^{-\tau(\gamma,s,r,n)q}+\epsilon_{n}^{q}+\eta_{n}^{q}+\bar{\epsilon}_{n}^{q}+\bar{\eta}_{n}^{q})
Remark 3.1 (Discussion of Assumptions).

The theorem extends Shen and Wong [1994]’s convergence rate result for sieve M-estimators to A-estimators. There is a direct mapping between our assumptions and theirs: our C1 combines their assumptions C1 and C2, and our C2 corresponds to their C3. Our proof in Appendix C.1 is structured in the same way as that of Shen and Wong [1994], although we need to modify their Lemmas to obtain the uniform convergence of the adversary in 3.10, which is crucial to the main result 3.9. The key modifications to our assumptions, which allow us to do so are: C1) that the constant factor implicit in the “\prec” relation of 3.7 must not depend on θ\theta, as implied by the definition of “\prec” at the beginning of this section and C2) that the complexity of the joint sieve space Θn×Λn\Theta_{n}\times\Lambda_{n} satisfies the entropy bound. Otherwise, the assumptions are conceptually the same and we refer the reader to Shen and Wong [1994] for a more detailed discussion.

Remark 3.2.

Using similar arguments as ours, one may establish the uniform convergence of A-estimators over a third parameter space, generalizing the setting to arbitrary finite sequences of min\min’s and max\max’s over different parameter spaces: e.g. minθmaxλminγ𝔼[l(θ,λ,γ,Y)]\min_{\theta}\max_{\lambda}\min_{\gamma}\mathbb{E}[l(\theta,\lambda,\gamma,Y)]. This would yield convergence rates towards more general Stackelberg equilibria in so-called empirical games, for which we are currently only aware of a consistency result by Tuyls et al. [2018].

Remark 3.3.

Beyond convergence rates for θ^n\widehat{\theta}_{n} and λ^nθ\widehat{\lambda}_{n}^{\theta}, it is often useful to control the empirical process of arbitrary functions f(θ,λ,Y)f(\theta,\lambda,Y) of the parameters, e.g. to establish conditions for asymptotic normality required by Theorem 3.3. For this purpose, we provide Lemma B.5 in Appendix B.

3.4 Semiparametric Rates with Neural Networks

Next, we will apply the general result of the previous section to derive the convergence rates for neural network A-estimators. For generality, we will consider the semiparametric setting in which θ,λ\theta,\lambda may contain both Euclidean vectors and functions. These lower-level conditions are easy to verify in practice, but are general enough to apply to all estimators considered in Section 2. We will include the proof as it is short and an instructive application of Theorem 3.1. The theorem allows for two types of function classes, both of which can be viewed as generalizations of traditional Hölder functions with DD-dimensional domain, with their own notion of an intrinsic dimension dDd^{*}\leq D, which may be smaller than that of the data DD. As we will review in Remark 3.5, we observe that neural networks achieve a reduced curse of dimensionality in these settings.

Theorem 3.2 (Semiparametric Rates with Neural Networks).

Consider the semiparametric setting in which Θ=¯×𝒜¯\Theta=\bar{\mathcal{B}}\times\bar{\mathcal{A}} and Λ=×𝒜\Lambda=\mathcal{B}\times\mathcal{A}, where ¯,\bar{\mathcal{B}},\mathcal{B} are subsets of some Euclidean spaces and 𝒜¯,𝒜\bar{\mathcal{A}},\mathcal{A} are some function spaces. Let Λ,Θ\Lambda,\Theta be compact under \|\cdot\|_{\infty}. For all λ,λΛ,θ,θΘ\lambda,\lambda^{\prime}\in\Lambda,\ \theta,\theta^{\prime}\in\Theta, assume the following conditions hold:

  • A0: Assume that θΘ\theta_{*}\in\Theta_{*} satisfies either

    • a)

      Θ¯×(p¯,𝒳¯)\Theta_{*}\subset\bar{\mathcal{B}}\times\mathcal{H}(\bar{p},\bar{\mathcal{X}}) on some 𝒳¯[0,1]D¯\bar{\mathcal{X}}\subset[0,1]^{\bar{D}} with dimM𝒳¯=d¯D¯\operatorname{dim}_{M}\bar{\mathcal{X}}=\bar{d}^{*}\leq\bar{D} (see Def. 3 and 4)

    • b)

      Θ¯×𝒢(p¯,d¯,[0,1]D¯)\Theta_{*}\subset\bar{\mathcal{B}}\times\mathcal{G}(\bar{p},\bar{d}^{*},[0,1]^{\bar{D}}) (see Def. 6)

    and that {λθ:θΘ}Λ\{\lambda_{*}^{\theta}:\theta\in\Theta\}\subset\Lambda_{*} satisfies either

    • a)

      Λ×(p,𝒳)\Lambda_{*}\subset\mathcal{B}\times\mathcal{H}(p,{\mathcal{X}}) on some 𝒳[0,1]D{\mathcal{X}}\subset[0,1]^{D} with dimM𝒳=dD\operatorname{dim}_{M}{\mathcal{X}}=d^{*}\leq D

    • b)

      Λ×𝒢(p,d,[0,1]D)\Lambda_{*}\subset\mathcal{B}\times\mathcal{G}(p,d^{*},[0,1]^{D})

  • A1: l(θ,λ,Y)l(θ,λ,Y)θθ𝒳¯+λλ𝒳l(\theta,\lambda,Y)-l(\theta^{\prime},\lambda^{\prime},Y)\prec\|\theta-\theta^{\prime}\|_{\bar{\mathcal{X}}}+\|\lambda-\lambda^{\prime}\|_{{\mathcal{X}}}

  • A2: 𝕍[l(θ,Y)l(θ,Y)]𝔼[l(θ,Y)l(θ,Y)]θθ𝒳~2+(x¯𝒳~)𝒳~𝒳¯\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)]\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]\prec\|\theta-\theta_{*}\|^{2}_{\widetilde{{\mathcal{X}}}}+\mathbb{P}(\bar{x}\not\in\widetilde{{\mathcal{X}}})\ \forall\widetilde{{\mathcal{X}}}\subset\bar{\mathcal{X}}

  • A3: 𝕍[lθ(λθ,Y)lθ(λ,Y)]𝔼[lθ(λθ,Y)lθ(λ,Y)]λλθ𝒳~2+(x𝒳~)𝒳~𝒳\mathbb{V}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y)]\prec\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y)]\prec\|\lambda-\lambda_{*}^{\theta}\|^{2}_{\widetilde{{\mathcal{X}}}}+\mathbb{P}(x\not\in\widetilde{\mathcal{X}})\ \forall\widetilde{{\mathcal{X}}}\subset{\mathcal{X}}

Pick any two values r¯>r¯(dpd¯p¯)\bar{r}>\underline{r}\geq\left(\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}\right). Consider the A-estimator 3.2 with ηn,η¯n=o(n2/(2+r¯))\eta_{n},\bar{\eta}_{n}=o_{\mathbb{P}}(n^{-2/(2+\bar{r})}) where Λn=×σ(L,Wn,wn,κn)\Lambda_{n}=\mathcal{B}\times\mathcal{F}_{\sigma}(L,W_{n},w_{n},\kappa_{n}) and Θn=¯×σ(L¯,W¯n,w¯n,κ¯n)\Theta_{n}=\bar{\mathcal{B}}\times\mathcal{F}_{\sigma}(\bar{L},\bar{W}_{n},\bar{w}_{n},\bar{\kappa}_{n}) implement neural networks (cf. Definition 2) satisfying Wn,W¯n,wn,w¯nnr¯/(r¯+2)W_{n},\bar{W}_{n},w_{n},\bar{w}_{n}\asymp n^{\underline{r}/(\underline{r}+2)} and κn,κ¯nnc\kappa_{n},\bar{\kappa}_{n}\asymp n^{c} for any large enough choice of L,L¯,c>0L,\bar{L},c>0. For A0a) choose σ(x)=ReLU(x)\sigma(x)=\mathrm{ReLU}(x) and for A0b) choose σ(x)=tanh(x)\sigma(x)=\tanh(x). Then:

𝔼[l(θ^n,Y)l(θ,Y)]=o(n2/(2+r¯))\mathbb{E}[l(\widehat{\theta}_{n},Y)-l(\theta_{*},Y)]=o_{\mathbb{P}}(n^{-2/(2+\bar{r})})
supθΘn𝔼[lθ(λθ,Y)lθ(λ^nθ,Y)]=o(n2/(2+r¯))\sup_{\theta\in\Theta_{n}}\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\widehat{\lambda}_{n}^{\theta},Y)]=o_{\mathbb{P}}(n^{-2/(2+\bar{r})})

Hence, d(θ^n,θ)=o(1)d(\widehat{\theta}_{n},\theta_{*})=o_{\mathbb{P}}(1) for any (pseudo-)norm d(,)d(\cdot,\cdot) under which 𝔼[l(θ,Y)]\mathbb{E}[l(\theta,Y)] is compact and continuous. Further, if d(θ,θ)1/q𝔼[l(θ,Y)l(θ,Y)]d(\theta,\theta_{*})^{1/q}\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)] for q>0q>0, we get:

d(θ^n,θ)=o(n2q/(2+r¯))d(\widehat{\theta}_{n},\theta_{*})=o_{\mathbb{P}}(n^{-2q/(2+\bar{r})})
Proof.

We will verify the conditions of Theorem 3.1. A2 and A3 imply C1 (3.6 and 3.7) with γ=1\gamma=1. Lipschitzness A1 together with Lemma B.1 imply C2 (B.3) with s=t/(t+2)s=t/(t+2) for any t:r¯>t>r¯t:\bar{r}>t>\underline{r} and r=0r=0. Therefore Theorem 3.1 applies with nτ(γ,s,r,n)=n2/(2+t)lognn2/(2+r¯)n^{-\tau(\gamma,s,r,n)}=n^{2/(2+t)}\log n\prec n^{2/(2+\bar{r})}, which dominates ηn\eta_{n} and η¯n\bar{\eta}_{n} by assumption. We are therefore left with bounding ϵn\epsilon_{n} and ϵ¯n\bar{\epsilon}_{n}. By A3, we can bound ϵnsupθΘnπnλθλθ𝒳~2+(x𝒳~)\epsilon_{n}\prec\sup_{\theta\in\Theta_{n}}\|\pi_{n}\lambda_{*}^{\theta}-\lambda_{*}^{\theta}\|_{\widetilde{{\mathcal{X}}}}^{2}+\mathbb{P}(x\not\in\widetilde{{\mathcal{X}}}) for any 𝒳~𝒳\widetilde{{\mathcal{X}}}\subset{\mathcal{X}}. In the case of A0a), we set 𝒳~=𝒳\widetilde{{\mathcal{X}}}={\mathcal{X}} and use Lemma B.2 to obtain supθΘnπnλθλθ𝒳2(Wnwn)2p/dn2pr¯/d/(2+r¯)n2/(2+r¯)\sup_{\theta\in\Theta_{n}}\|\pi_{n}\lambda_{*}^{\theta}-\lambda_{*}^{\theta}\|_{\mathcal{X}}^{2}\prec(W_{n}\wedge w_{n})^{-2p/d^{*}}\prec n^{-2p\underline{r}/d^{*}/(2+\underline{r})}\prec n^{2/(2+\underline{r})} which yields ϵn=o(n2/(2+r¯))\epsilon_{n}=o(n^{2/(2+\bar{r})}). For A0b), Lemma B.3 yields the same bound as Lemma B.2, but only over a subset 𝒳~𝒳\widetilde{{\mathcal{X}}}\subset{\mathcal{X}} with P(x𝒳~)nkP(x\not\in\widetilde{{\mathcal{X}}})\prec n^{-k} for some arbitrarily large constant k>0k>0, which only affects the constant cc in the bound on κn\kappa_{n}. Hence we conclude that ϵnn2/(2+r¯)+nkn2/(2+r¯)\epsilon_{n}\prec n^{2/(2+\underline{r})}+n^{-k}\prec n^{2/(2+\underline{r})}. Analogous arguments yield the same bound for ϵ¯n\bar{\epsilon}_{n}. ∎

Remark 3.4 (Discussion of Assumptions).

A0 defines the function classes addressed by the Theorem. Both are generalizations of traditional Hölder classes which arise for d=Dd^{*}=D, see Remark 3.5. Condition A1 requires the loss to be Lipschitz in both parameters, which simplifies (but is not necessary for) the verification of C2. Condition A2 (and analogously A3) consists of two parts. First, it states that for a given parameter, the variance of the criterion difference must be bounded by its expectation, a simplified version of Assumption C1 of Theorem 3.1 which happens to be satisfied in all of our examples, but versions of this Theorem with γ1\gamma\neq 1 can be derived via the same steps as the proof above. The second part of the condition bounds the expected loss by a squared sup\sup-norm over any subset 𝒳~\widetilde{\mathcal{X}} of the function domain 𝒳\mathcal{X}. For the case of A0a), it would have sufficed to state the condition with 𝒳~=𝒳\widetilde{\mathcal{X}}=\mathcal{X} only, but for A0b) we require arbitrary subsets 𝒳~\widetilde{\mathcal{X}} to apply the approximation result of Lemma B.3. A2 is implied, for example, by 𝔼[l(θ,Y)l(θ)]h(θ)h(θ)q(𝒳)2\mathbb{E}[l(\theta,Y)-l(\theta_{*})]\prec\|h(\theta)-h(\theta*)\|^{2}_{\mathcal{L}^{q}(\mathcal{X})} for some qq and Lipschitz map h:ΘΘh:\Theta\mapsto\Theta. The assumption is significantly weaker than Shen et al. [2019] or Farrell et al. [2018] who impose 𝔼[l(θ,Y)l(θ)]θθ2(𝒳)2\mathbb{E}[l(\theta,Y)-l(\theta_{*})]\asymp\|\theta-\theta*\|^{2}_{\mathcal{L}^{2}(\mathcal{X})}, which would not hold for Examples 2.2 or 2.4. It could be generalized further to allow for arbitrary powers of the sup\sup-norms (and proved in the same way via Theorem 3.1), but the squares arise rather universally via Taylor expansions.

Remark 3.5.

Theorem 3.2 clarifies that neural networks do not necessarily exhibit the curse of dimensionality, as the lower bound on r¯\bar{r} does not depend on the dimension DD of the data. Instead, what matters is the intrinsic dimension dd^{*} of the target function. In the setting A0a), introduced by Nakada and Imaizumi [2020], dd^{*} refers to the Minkowski dimension of the manifold 𝒳{\mathcal{X}} which supports the data. It has been observed that dDd^{*}\ll D for many high-dimensional types of data: intuitively, dd^{*} is low whenever there is strong statistical dependency between the individual dimensions of the data. Examples include the characteristics of physical products, images and natural language. In the setting A0b), introduced by Bauer et al. [2019], dd^{*} refers to the order of a generalized hierarchical interaction model. It is common for structural models in e.g. economics or optimal control to suggest that an unknown function is hierarchically composed of some finite number of individual functions which only depend on dDd^{*}\ll D inputs at a time. The result underscores that neural networks can adaptively - that is, without the researcher modifying the estimation procedure - exploit structures in the target function which allow them to model the relationships more efficiently than what standard convergence results suggest.

3.5 Asymptotic Normality of A-Estimators

In applications, it we a often interested in estimating a quantity of the form F(θ)F(\theta_{*}), where F:ΘF:\Theta\mapsto\mathbb{R} is some known functional. To derive confidence intervals around the plug-in estimate F(θ^n)F(\widehat{\theta}_{n}), we need its asymptotic distribution. To this end, we present Theorem 3.3, which can roughly be viewed as a generalization of Shen [1997] to A-Estimators. For this section, we make use of the pathwise derivative presented in Definition 7. We require a particular inner product over the space Θ\Theta:

θ,θ:=θθθθ𝔼[l(θ,Y)]\langle\theta,\theta^{\prime}\rangle:=\nabla_{\theta_{*}\to\theta}\nabla_{\theta_{*}\to\theta^{\prime}}\mathbb{E}[l(\theta_{*},Y)]

As discussed in Definition 7, the notation θθ\nabla_{\theta_{*}\to\theta} implicitly assumes that the corresponding limit exists and is linear in θ\theta. For short, we write λθ[v]:=θvλθ\lambda_{*}^{\prime\theta}[v]:=\nabla_{\theta\to v}\lambda_{*}^{\theta}, l(θ,Y)[v]:=θvl(θ,Y)l^{\prime}(\theta,Y)[v]:=\nabla_{\theta\to v}l(\theta,Y) and l(θ,λ,Y)[v,w]:=θvl(θ,λ,Y)+λwl(θ,λ,Y)l^{\prime}(\theta,\lambda,Y)[v,w]:=\nabla_{\theta\to v}l(\theta,\lambda,Y)+\nabla_{\lambda\to w}l(\theta,\lambda,Y).

Theorem 3.3 (General Normality of A-Estimators).

Consider the estimators θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} satisfying the Nash conditions 1.2 and 1.3. Fix a sequence en=o(n1/2)e_{n}=o(n^{-1/2}). Assume FF is smooth enough and θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} converge fast enough such that a Riesz representer vΘv_{*}\in\Theta_{*} exists, satisfying:

supθΘ^n|F(θ)F(θ)θθ,v|=O(en)\sup_{\theta\in\widehat{\Theta}_{n}}|F(\theta)-F(\theta_{*})-\langle\theta-\theta_{*},v_{*}\rangle|=O_{\mathbb{P}}(e_{n}) (3.11)

Where Θ^n\widehat{\Theta}_{n} and Λ^n(θ)\widehat{\Lambda}_{n}(\theta) are the shrinking neighborhoods defined in Lemma B.5. For v{v,v}v\in\{v_{*},-v_{*}\}, define the local perturbations θ¯n(θ)=θenv\bar{\theta}_{n}(\theta)=\theta-e_{n}v and λ¯nθ(λ)=λ+enλθ[v]\bar{\lambda}_{n}^{\theta}(\lambda)=\lambda+e_{n}\lambda^{\prime\theta}_{*}[v] and assume:

  • CONDITION N1: Stochastic Equicontinuity

    supθΘ^n,λΛ^n(θ)(𝔼n𝔼)l(θ,λ,Y)[v,λθ[v]]l(θ,Y)[v]=O(en)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}(\mathbb{E}_{n}-\mathbb{E})l^{\prime}(\theta,\lambda,Y)[v,\lambda_{*}^{\prime\theta}[v]]-l^{\prime}(\theta_{*},Y)[v]=O_{\mathbb{P}}(e_{n})
  • CONDITION N2: Population Criterion Difference

    supθΘ^n,λΛ^n(θ)𝔼l(θ,λ,Y)[v,λθ[v]]l(θ,Y)[v]θθ,v=O(en)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}\mathbb{E}l^{\prime}(\theta,\lambda,Y)[v,\lambda_{*}^{\prime\theta}[v]]-l^{\prime}(\theta_{*},Y)[v]-\langle\theta-\theta_{*},v\rangle=O_{\mathbb{P}}(e_{n})
  • CONDITION N3: Approximation Error

    supθΘ^n,λΛ^n(θ)𝔼nl(θ,λ,Y)[θ¯n(θ)πnθ¯n(θ),λ¯nθ(λ)πnλ¯nθ(λ)]=O(en2)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}\mathbb{E}_{n}l^{\prime}(\theta,\lambda,Y)[\bar{\theta}_{n}(\theta)-\pi_{n}\bar{\theta}_{n}(\theta),\bar{\lambda}_{n}^{\theta}(\lambda)-\pi_{n}\bar{\lambda}_{n}^{\theta}(\lambda)]=O_{\mathbb{P}}(e_{n}^{2})

If 1.2 and 1.3 are satisfied with η~n,ηn=O(en2)\widetilde{\eta}_{n},\eta_{n}=O_{\mathbb{P}}(e_{n}^{2}), then:

n(F(θ^n)F(θ))𝑑𝒩(0,V), where V=𝕍(l(θ,Y)[v])\sqrt{n}\left(F(\widehat{\theta}_{n})-F(\theta_{*})\right)\overset{d}{\longrightarrow}\mathcal{N}(0,V),\ \text{ where }V=\mathbb{V}\left(l^{\prime}(\theta_{*},Y)[v_{*}]\right)
Remark 3.6 (Discussion of Assumptions).

In contrast to our convergence rate result, our proof requires the A-estimator to satisfy the (stronger) Nash condition from the introduction. Our conditions N1-3 are analogues of Shen [1997]’s and play the same roles in our proof. N1 combines their assumptions A and D, N2 corresponds to their B, and N3 to their C. Shen [1997]’s high-level discussion of their assumptions therefore applies to ours as well, and we again refer the reader there for additional context. The main difference is that their conditions are formulated to control the remainder of a second order Taylor expansion, whereas we look at the convergence of the first derivative, which results in O(en)=o(n1/2)O_{\mathbb{P}}(e_{n})=o_{\mathbb{P}}(n^{-1/2}) requirements for N1 and N2, rather than the O(en2)=o(n1)O_{\mathbb{P}}(e_{n}^{2})=o_{\mathbb{P}}(n^{-1}) found in Shen [1997]’s conditions A and B.

Remark 3.7.

Condition N3 is a version of a known condition on approximation error in M-estimation settings (see Condition C4 in Shen et al. [2019] and Condition C in Shen and Wong [1994]). Its verification usually exploits convexity of Θn\Theta_{n}, such that πnθ¯n(θ)=θ+enπnv\pi_{n}\bar{\theta}_{n}(\theta)=\theta+e_{n}\pi_{n}v_{*}. This holds for series or kernel based estimators, but not neural networks. Shen et al. [2019] therefore leave it as an explicit assumption, concluding that it is unclear how to verify it for neural networks. In Theorem 3.4, we resolve this issue, showing that N3 can be verified for non-convex sieves such as neural networks by adhering to two simple implementation choices: 1) undersmoothing, i.e. choosing a sieve which grows faster than rate-optimal, achieving an approximation error of o(n1)o(n^{-1}) and 2) regularizing the sieves towards the convex target classes containing θ,λ\theta_{*},\lambda_{*}.

3.6 Semiparametric Normality with Neural Networks

Next, we present Theorem 3.4, which strengthens the assumptions of our previous neural network convergence rate result (Theorem 3.2) in a way that allows us to derive the asymptotic normality of functionals F(θ^n)F(\widehat{\theta}_{n}) via Theorem 3.3. A crucial innovation is that we are able to work around the non-convexity issues of deep neural networks discussed in Remark 3.7, to obtain a normality result from low-level conditions, which only consist of general properties that the loss function must satisfy (A4-A7), and certain implementation choices for the neural networks that must be followed. To the best of our knowledge, the theorem therefore also provides the first low-level conditions for the normality of smooth functionals of deep neural network M-estimators (as the special case where Λ\Lambda is singleton).

Theorem 3.4 (Semiparametric Normality with Neural Networks).

Let all assumptions of Theorem 3.2 be satisfied with dpd¯p¯<1/4\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}<1/4, and choose 2r¯>r¯>2/32\geq\bar{r}>\underline{r}>2/3. Let Θ,Λ\Theta_{*},\Lambda_{*} be convex and θ,λθ\theta_{*},\lambda_{*}^{\theta} lie in their interior. Replace the neural network sieves Θn,Λn\Theta_{n},\Lambda_{n} with the following regularized versions:

Θn{θΘn:infθΘθθ𝒳¯n1ϵ},Λn{λΛn:infλΛλλ𝒳n1ϵ}\Theta_{n}\leftarrow\{\theta\in\Theta_{n}:\inf_{\theta^{\prime}\in\Theta_{*}}\|\theta-\theta^{\prime}\|_{\bar{\mathcal{X}}}\prec n^{-1-\epsilon}\},\quad\Lambda_{n}\leftarrow\{\lambda\in\Lambda_{n}:\inf_{\lambda^{\prime}\in\Lambda_{*}}\|\lambda-\lambda^{\prime}\|_{{\mathcal{X}}}\prec n^{-1-\epsilon}\}

for any ϵ>0\epsilon>0 which is small enough to guarantee that Θn,,Λn\Theta_{n},,\Lambda_{n} are nonempty. Further, for all θ,vΘ,λ,wΛ\theta,v\in\Theta,\ \lambda,w\in\Lambda, assume:

  • A4: Lipschitz Derivative: l(θ,λ,Y)[v,w]l(θ,λ,Y)[v,w]θθ𝒳¯+λλ𝒳l^{\prime}(\theta,\lambda,Y)[v,w]-l^{\prime}(\theta^{\prime},\lambda^{\prime},Y)[v,w]\prec\|\theta-\theta^{\prime}\|_{\bar{\mathcal{X}}}+\|\lambda-\lambda^{\prime}\|_{\mathcal{X}}

  • A5: The perturbations are smooth: vΘv_{*}\in\Theta_{*}, λθ[v]Λ\lambda_{*}^{\prime\theta}[v_{*}]\in\Lambda_{*}

  • A6: The Taylor remainders vanish with the loss:

    • i)

      |𝔼l(θ,λ,Y)[v,λθ[v]]l(θ,Y)[v]|𝔼[l(θ,λθ,Y)l(θ,λ,Y)]|\mathbb{E}l^{\prime}(\theta,\lambda,Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]-l^{\prime}(\theta,Y)[v_{*}]|\prec\mathbb{E}[l(\theta,\lambda_{*}^{\theta},Y)-l(\theta,\lambda,Y)]

    • ii)

      |𝔼l(θ,Y)[v]l(θ,Y)[v]θθ,v|𝔼[l(θ,Y)l(θ,Y)]|\mathbb{E}l^{\prime}(\theta,Y)[v_{*}]-l^{\prime}(\theta_{*},Y)[v_{*}]-\langle\theta-\theta_{*},v_{*}\rangle|\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]

  • A7: For non-Donsker classes, the variance of the derivatives is bounded by the loss:

    • i)

      𝕍[l(θ,λ,Y)[v,λθ[v]]l(θ,λθ,Y)[v,λθ[v]]]𝔼[l(θ,λθ,Y)l(θ,λ,Y)]\mathbb{V}[l^{\prime}(\theta,\lambda,Y)[v,\lambda_{*}^{\prime\theta}[v]]-l^{\prime}(\theta,\lambda_{*}^{\theta},Y)[v,\lambda_{*}^{\prime\theta}[v]]]\prec\mathbb{E}[l(\theta,\lambda_{*}^{\theta},Y)-l(\theta,\lambda,Y)] or Λ\Lambda_{*} is \mathbb{P}-Donsker

    • ii)

      𝕍[l(θ,Y)[v]l(θ,Y)[v]]𝔼[l(θ,Y)l(θ,Y)]\mathbb{V}[l^{\prime}(\theta,Y)[v]-l^{\prime}(\theta_{*},Y)[v]]\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)] or Θ\Theta_{*} is \mathbb{P}-Donsker

If θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} satisfy the Nash condition 1.2,1.3 with ηn,η~n=o(n1)\eta_{n},\tilde{\eta}_{n}=o_{\mathbb{P}}(n^{-1}), then:

n(F(θ^n)F(θ))𝑑𝒩(0,V), where V=𝕍(l(θ,Y)[v])\sqrt{n}\left(F(\widehat{\theta}_{n})-F(\theta_{*})\right)\overset{d}{\longrightarrow}\mathcal{N}(0,V),\ \text{ where }V=\mathbb{V}\left(l^{\prime}(\theta_{*},Y)[v_{*}]\right)
Remark 3.8 (Discussion of Assumptions).

The Theorem requires that the neural network sieves Θn,Λn\Theta_{n},\Lambda_{n} are implemented to undersmooth (i.e. grow faster than the rate-optimal sieve would) via the condition on r¯\underline{r}, while being regularized towards the convex target spaces Θ,Λ\Theta_{*},\Lambda_{*}. Note that this does not affect the sieve’s approximation power towards these spaces, and there always exists an ϵ>0\epsilon>0 for which Θn,Λn\Theta_{n},\Lambda_{n} are non-empty due to their o(n1)o(n^{-1}) approximation rates. While in principle just an implementation choice, the current sup\sup-norm regularization is arguably not practical and future work may be able to clarify whether e.g. an appropriate L2 penalty on the weights suffices. Conditions A4-A7 are general conditions on the loss function which can be satisfied in all our examples. A4 is a simple Lipschitz condition analogous to A1. The smoothness of the Riesz representer (A5) is most easily verified by computing and examining a given v,λθ[v]v_{*},\lambda_{*}^{\prime\theta}[v_{*}] directly, although the Riesz representation theorem can provide general conditions under which vv_{*} lives in the same space as θ\theta_{*}. A6 is a standard condition controlling the Taylor remainder. For a discussion, see e.g. Assumptions 4.5 in Ai and Chen [2003] and Ai and Chen [2007], or Assumption 3.5ii) in Chen and Pouzo [2015]. Whether it holds depends on how non-linear the objective is: e.g. for the quadratic objective of Dikkala et al. [2020], the left-hand side is zero. A7 serves to control the empirical process (N1). It can be easily satisfied either by bounding the variances of the derivatives, or by relying on the Donsker property of the target space (cf. Remark 3.9).

Remark 3.9.

Note that the Donsker property and thus A7 always holds if p>D/2p>D/2, where standard results using bracketing number bounds imply that the Hölder spaces Θ,Λ\Theta_{*},\Lambda_{*} satisfy the Donsker property. We conjecture that this analogously holds for our lower-dimensional classes A0a) and A0b) whenever d/p<2d^{*}/p<2, which would make the verification of A7 unnecessary in general, since we require dpd¯p¯<1/4\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}<1/4. Verifying this conjecture is beyond the scope of this paper however, hence we provide A7 as an explicit assumption for maximum flexibility.

4 Application to Examples

4.1 Minimum ff-Divergence

Applying our general Theorem 3.2 to the estimator of Section 2.1 yields Proposition 4.1, which provides the convergence rate of semiparametric θ^n\widehat{\theta}_{n} if Λ\Lambda and all unknown functions in Θ\Theta are approximated by classes of neural networks.

Proposition 4.1.

Let θΘΘ,λθ=f(dθd)ΛΛ\theta_{*}\in\Theta_{*}\subset\Theta,\lambda_{*}^{\theta}=f^{\prime}(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}})\in\Lambda_{*}\subset\Lambda, where Θ,Λ\Theta,\Lambda are compact under \|\cdot\|_{\infty} and path-connected, and the target function classes Θ,Λ\Theta_{*},\Lambda_{*} satisfy A0 in Theorem 3.2. Fix some C<C<\infty. For any θΘ\theta\in\Theta, let 0<f′′(dθd(Y))<C0<f^{\prime\prime}\left(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}(Y)\right)<C wp1 and for any λΛ\lambda\in\Lambda, let 0<f′′(λ(Y))<C0<f^{\prime\prime}_{*}(\lambda(Y))<C wp1. Let dθddθdθθ\left\|\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\frac{\mathrm{d}\mathbb{P}_{\theta^{\prime}}}{\mathrm{d}\mathbb{P}}\right\|_{\infty}\prec\|\theta-\theta^{\prime}\|_{\infty}. Let Θn,Λn\Theta_{n},\Lambda_{n} be constructed as in Theorem 3.2, with all neural networks growing in width at some rate nr¯/(r¯+2)n^{\underline{r}/(\underline{r}+2)} satisfying r¯dpd¯p¯\underline{r}\geq\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}. Then for any r¯>r¯\bar{r}>\underline{r}:

Df(θ^n)=o(n2/(2+r¯))D_{f}(\mathbb{P}_{\widehat{\theta}_{n}}\|\mathbb{P})=o_{\mathbb{P}}(n^{-2/(2+\bar{r})})
Remark 4.1.

In general, the convergence rate of θ^n\widehat{\theta}_{n} is faster the slower the growth rate nr¯/(r¯+2)n^{\underline{r}/(\underline{r}+2)} of the neural network. However, the growth must be fast enough to control the approximation error of the sieves Θn,Λn\Theta_{n},\Lambda_{n} relative to the target function classes Θ,Λ\Theta_{*},\Lambda_{*}. This lower bound depends on the ratio of the smoothness of the target classes pp and p¯\bar{p} and their intrinsic dimensions dd^{*} and d¯\bar{d}^{*}, which may be smaller than that of the data YY, in which case ff-GANs attain a reduced curse-of-dimensionality relative to traditional nonparametric density estimators.

Remark 4.2.

This convergence rate result stands in contrast to Arora et al. [2017], who argued that Generative Adversarial Networks do not generalize with respect to the metric given by the population objective, only under a weaker “neural net distance” which they introduce. The convergence rate result above clarifies that the broad class of ff-GANs in fact does converge quickly under population divergence.

While a fast convergence rate of the model distribution θ^n\mathbb{P}_{\widehat{\theta}_{n}} is a key goal in semi- and nonparametric estimation, whenever some function F(θ^n)F(\widehat{\theta}_{n}) of the estimate informs downstream decision-making, we are often interested in obtaining confidence intervals around F(θ^n)F(\widehat{\theta}_{n}). To this end, we derive the asymptotic normality of the adversarial ff-Divergence objective - an entirely novel result at this level of generality, to the best of our knowledge. First, we compute the inner product defined in Section 3.5, which can be expressed concisely:

θ,θ=θθθθ𝔼[f(dθ(Y)d(Y))]=𝔼[θθlogdθ(Y)θθlogdθ(Y)]\langle\theta,\theta^{\prime}\rangle=\nabla_{\theta_{*}\to\theta}\nabla_{\theta_{*}\to\theta^{\prime}}\mathbb{E}\left[f\left(\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}(Y)}{\mathrm{d}\mathbb{P}(Y)}\right)\right]=\mathbb{E}\left[\nabla_{\theta_{*}\to\theta}\log\mathrm{d}\mathbb{P}_{\theta_{*}}(Y)\cdot\nabla_{\theta_{*}\to\theta^{\prime}}\log\mathrm{d}\mathbb{P}_{\theta_{*}}(Y)\right]

Where θθlogdθ(Y)=θθdθ(Y)d(Y)\nabla_{\theta_{*}\to\theta}\log\mathrm{d}\mathbb{P}_{\theta_{*}}(Y)=\nabla_{\theta_{*}\to\theta}\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}(Y)}{\mathrm{d}\mathbb{P}(Y)} is a pathwise derivative of the Radon-Nikodym derivative. Conditions under which the normality result of Section 3.5 applies are presented in Proposition 4.2.

Proposition 4.2.

Consider a functional F(θ)F(\theta) for which a Riesz representer vv_{*} exists satisfying 3.11 with ,\langle\cdot,\cdot\rangle defined above. Let all assumptions of Theorem 4.1 be satisfied for d/pd¯/p¯<1/4d^{*}/p\vee\bar{d}^{*}/\bar{p}<1/4 and assume that Θ\mathcal{\Theta}_{*} is Donsker. Let Θ,Λ\Theta_{*},\Lambda_{*} be convex, let θ,λθ\theta_{*},\lambda_{*}^{\theta} lie in their interior, and let them contain v,λ,θ[v]v_{*},\lambda^{\prime,\theta}_{*}[v_{*}]. Assume the Lipschitz condition θvdθdθvdθdθθ\|\nabla_{\theta\to v}\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\nabla_{\theta^{\prime}\to v}\frac{\mathrm{d}\mathbb{P}_{\theta^{\prime}}}{\mathrm{d}\mathbb{P}}\|_{\infty}\prec\|\theta-\theta^{\prime}\|_{\infty} and let f′′f^{\prime\prime} be Lipschitz. Pick 2r¯>r¯>2/32\geq\bar{r}>\underline{r}>2/3 and regularize Θn,Λn\Theta_{n},\Lambda_{n} as in Theorem 3.4. Finally, for any θ~,θ~\widetilde{\theta},\widetilde{\theta}^{\prime} on a path between θ\theta_{*} and θ\theta, assume that:

θ~θ~θθ~θθθ~vDf(θ~θ)Df(θθ)\nabla_{\widetilde{\theta}\to\widetilde{\theta}^{\prime}-\theta_{*}}\nabla_{\widetilde{\theta}\to\theta-\theta_{*}}\nabla_{\widetilde{\theta}\to v_{*}}D_{f}(\mathbb{P}_{\widetilde{\theta}}\|\mathbb{P}_{\theta_{*}})\prec D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}_{\theta_{*}})

Then:

n(F(θn)F(θ))𝑑𝒩(0,v,v)\sqrt{n}(F(\theta_{n})-F(\theta_{*}))\overset{d}{\rightarrow}\mathcal{N}(0,\langle v_{*},v_{*}\rangle) (4.1)
Remark 4.3.

In applications, the key difficulty lies in verifying that the third derivative above is bounded by the loss. This condition serves to control the higher order term of the Taylor expansion. Such assumptions are common in the semiparametric literature, e.g. Ai and Chen [2003]’s Assumptions 4.5 and 4.6 play the same role. It is easiest to verify in the parametric setting, where

θ~θ~θθ~θθθ~vDf(θθ)θθ22Df(θθ)\nabla_{\widetilde{\theta}\to\widetilde{\theta}^{\prime}-\theta_{*}}\nabla_{\widetilde{\theta}\to\theta-\theta_{*}}\nabla_{\widetilde{\theta}\to v_{*}}D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}_{\theta_{*}})\asymp\|\theta-\theta_{*}\|^{2}_{2}\asymp D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}_{\theta_{*}})

Note that ,\langle\cdot,\cdot\rangle and hence the asymptotics of θ^n\widehat{\theta}_{n} are independent of ff, so the ff-divergences are asymptotically equivalent. An example for a smooth functional F(θ)F(\theta) that is of particular interest in the semiparametric setting θ=(β,α)\theta=(\beta,\alpha) is F(θ)=βζF(\theta)=\beta^{\prime}\zeta, which “picks out” a linear combination of the parametric components. This allows us to derive the asymptotic normality of the vector n(β^β)\sqrt{n}(\widehat{\beta}-\beta_{*}) in the following Corollary, which makes use of the orthogonal scores assumption that is standard in the semiparametric literature.

Corollary 4.2.1.

In addition to the assumptions of Proposition 4.2, assume the orthogonal scores condition holds:

𝔼[ββlogdβ,α(Y)ααlogdβ,α(Y)]=0β,α\mathbb{E}\left[\nabla_{\beta_{*}\to\beta}\log\mathrm{d}\mathbb{P}_{\beta_{*},\alpha_{*}}(Y)\nabla_{\alpha_{*}\to\alpha}\log\mathrm{d}\mathbb{P}_{\beta_{*},\alpha_{*}}(Y)\right]=0\quad\forall\beta,\alpha

Then the parametric component β^n\widehat{\beta}_{n} attains the Cramér-Rao bound:

n(β^nβ)𝑑𝒩(0,I1), where I=𝔼[βlogdβ,α(Y)βlogdβ,α(Y)]\sqrt{n}(\widehat{\beta}_{n}-\beta_{*})\overset{d}{\rightarrow}\mathcal{N}\left(0,I^{-1}\right),\text{ where }I=\mathbb{E}\left[\nabla_{\beta_{*}}\log\mathrm{d}\mathbb{P}_{\beta_{*},\alpha_{*}}(Y)\cdot\nabla_{\beta_{*}^{\prime}}\log\mathrm{d}\mathbb{P}_{\beta_{*},\alpha_{*}}(Y)\right]
Proof.

We simply choose v=(I1ζ,𝟎)v_{*}=(I^{-1}\zeta,\mathbf{0}), such that θθ,v=(ββ)ζ=F(θ)F(θ)\langle\theta-\theta_{*},v_{*}\rangle=(\beta-\beta_{*})^{\prime}\zeta=F(\theta)-F(\theta_{*}). Since v,v=ζI1ζ\langle v_{*},v_{*}\rangle=\zeta^{\prime}I^{-1}\zeta, Proposition 4.2 yields n(β^nβ)ζ𝑑𝒩(0,ζI1ζ)\sqrt{n}(\widehat{\beta}_{n}-\beta_{*})^{\prime}\zeta\overset{d}{\to}\mathcal{N}(0,\zeta^{\prime}I^{-1}\zeta). The result then follows via the Cramér-Wold device. ∎

The ff-GAN objective therefore attains the efficient asymptotics of maximum likelihood, but does not require explicit knowledge of the model density θ\mathbb{P}_{\theta}.

4.2 Generalized Empirical Likelihood

For the class of Generalized Empirical Likelihood estimators introduced in Section 2.2, the n\sqrt{n}-normality and asymptotic efficiency of θ^n\widehat{\theta}_{n} is long established in the parametric case (Imbens et al. [1998], Imbens [2002], Newey and Smith [2004]). However, our theoretical framework still allows us to extend the known results to the semiparametric case where θ\theta may contain unknown functions, which are approximated by a class of neural networks Θn\Theta_{n} which may grow with nn. In this case, we can characterize the convergence rate of θ^n\widehat{\theta}_{n} to the identified set Θ={θΘ:𝔼[m(Y,θ)]=0}\Theta_{*}=\{\theta\in\Theta:\mathbb{E}[m(Y,\theta)]=0\}, which is unlikely to be singleton given that an infinite-dimensional parameter is hardly pinned down by a finite number of unconditional moment restrictions. We obtain the following result:

Proposition 4.3.

Let Df=χ2D_{f}=\chi^{2} and consider the A-estimator θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} satisfying 1.2,1.3 with l(θ,λ,Y)=f(λm(Y,θ))l(\theta,\lambda,Y)=-f_{*}(\lambda^{\prime}m(Y,\theta)). Let Θ,Θn\Theta_{*},\Theta_{n} be as in Theorem 3.2 and Λ=Λn=dim(m)\Lambda_{*}=\Lambda_{n}=\mathbb{R}^{\operatorname{dim}(m)}, with r¯>dp\bar{r}>\frac{d^{*}}{p}. Assume that m(Y,θ)m(Y,θ)θθm(Y,\theta)-m(Y,\theta^{\prime})\prec\|\theta-\theta^{\prime}\|_{\infty} and |m(Y,θ)|<|m(Y,\theta)|<\infty. Then:

𝔼[m(Y,θ^n)]=o(n1/(2+r¯))\mathbb{E}\left[m(Y,\widehat{\theta}_{n})\right]=o_{\mathbb{P}}(n^{-1/(2+\bar{r})})
Proof.

We verify the conditions of Theorem 3.2. Assumption A0 holds by assumption, and A1 follows from the Lipschitzness of m(Y,)m(Y,\cdot) and that of f(t)=t+t2/4f_{*}(t)=t+t^{2}/4. To verify Assumption 2, note that l(θ,Y)=0l(\theta_{*},Y)=0 and boundedness of m(Y,θ)m(Y,\theta) imply:

𝕍[l(θ,Y)l(θ,Y)]𝔼[(m(Y,θ)λθ)2]𝔼[l(θ,Y)l(θ,Y)]\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)]\prec\mathbb{E}[(m(Y,\theta)^{\prime}\lambda_{*}^{\theta})^{2}]\asymp\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]

For the second part of condition A2, simply verify that 𝔼[l(θ,Y)l(θ,Y)]λθλθ22θ,θ𝒳~2+(x¯𝒳~)\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]\prec\|\lambda_{*}^{\theta}-\lambda_{*}^{\theta_{*}}\|^{2}_{2}\prec\|\theta,\theta_{*}\|_{\widetilde{\mathcal{X}}}^{2}+\mathbb{P}(\bar{x}\not\in\widetilde{\mathcal{X}}), which follows by applying the Lipschitzness of mm in θ\theta and the tower-property of 𝔼\mathbb{E} to λθ=2𝔼[m(Y,θ)m(Y,θ)]1𝔼[m(Y,θ)]\lambda_{*}^{\theta}=-2\mathbb{E}[m(Y,\theta)m(Y,\theta)^{\prime}]^{-1}\mathbb{E}[m(Y,\theta)], akin to the proof of 4.1. Assumption A3 can be verified for the Euclidean λ\lambda via a Taylor expansion, yielding: 𝕍[l(θ,λ,Y)l(θ,λθ,Y)]λλθ22𝔼[l(θ,λ,Y)l(θ,λθ,Y)]\mathbb{V}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)]\asymp\|\lambda-\lambda_{*}^{\theta}\|_{2}^{2}\asymp\mathbb{E}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)]. ∎

4.3 Off-Policy Reinforcement Learning

Next, we will use our theory to the extend the known results about SBEED, the off-policy RL algorithm of Dai et al. [2018] introduced in Section 2.3. Theorem 3.2 makes it easy to obtain the convergence rates of the corresponding A-estimator:

Proposition 4.4.

Consider the A-estimator θ^n,λ^n\widehat{\theta}_{n},\widehat{\lambda}_{n} satisfying 1.2,1.3 with l(θ,λ,Y)l(\theta,\lambda,Y) as in 2.4. Assume the observations are iid for simplicity, and that P=PθP_{*}=P_{\theta_{*}} and V=VθV_{*}=V_{\theta_{*}}, where θΘ,λθΛ\theta_{*}\in\Theta_{*},\lambda_{*}^{\theta}\in\Lambda_{*} satisfy A0 in Theorem 3.2 with 𝒳=𝒳¯=𝒮×𝒜\mathcal{X}=\bar{\mathcal{X}}=\mathcal{S}\times\mathcal{A}. Let ΘΘ,ΛΛ\Theta_{*}\subset\Theta,\Lambda_{*}\subset\Lambda, with Θ,Λ\Theta,\Lambda compact under \|\cdot\|_{\infty} and path-connected. Let R(,),Vθ(),Pθ(|)R(\cdot,\cdot),V_{\theta}(\cdot),P_{\theta}(\cdot|\cdot) be continuous. Let the parametrizations Pθ,VθP_{\theta},V_{\theta} satisfy the Lipschitz conditions logPθlogPθθθ\left\|\log P_{\theta}-\log P_{\theta^{\prime}}\right\|_{\infty}\prec\|\theta-\theta^{\prime}\|_{\infty} and VθVθθθ\left\|V_{\theta}-V_{\theta^{\prime}}\right\|_{\infty}\prec\|\theta-\theta^{\prime}\|_{\infty}. Let the neural network classes Θn,Λn\Theta_{n},\Lambda_{n} be constructed as in Theorem 3.2, for any r¯dpd¯p¯\underline{r}\geq\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}. Then for any r¯>r¯\bar{r}>\underline{r}:

𝔼s,a[(R(s,a)+β𝔼[Vθ^n(s+)|s,a]Vθ^n(s)logPθ^n(a|s))2]=o(n2/(2+r¯))\mathbb{E}_{s,a}\left[\left(R(s,a)+\beta\mathbb{E}[V_{\widehat{\theta}_{n}}(s^{+})|s,a]-V_{\widehat{\theta}_{n}}(s)-\log P_{\widehat{\theta}_{n}}(a|s)\right)^{2}\right]=o_{\mathbb{P}}(n^{-2/(2+\bar{r})})
Remark 4.4.

In contrast to the original work, our result also applies in the case where 𝒜\mathcal{A} and 𝒮\mathcal{S} are continuous, and we characterize the optimal rate of growth for the neural network function approximators, which optimally trade off bias and variance. While following almost trivially from the general Theorem 3.2, our result yields significantly faster convergence rates than the o(n)o_{\mathbb{P}}(\sqrt{n}) rates obtained by Dai et al. [2018], and our rates further exhibit the reduced curse of dimensionality of neural networks.

Remark 4.5.

We noticed that SBEED can be viewed as a special case of some of the econometric conditional moment estimators treated in Example 2.4, such as Dikkala et al. [2020]. We therefore refer the reader to Section 4.4 for an application of our asymptotic normality result. Interestingly, neither literature seems to be aware of this connection. Dai et al. [2018] cite convex conjugation and the interchangeability principle as the inspiration for their objective, whereas the adversarial conditional moment estimators in econometrics were inspired by Hansen [1982]’s Generalized Method of Moments.

4.4 A-Estimators for Conditional Moment Restrictions

We will now apply our theory to examine the asymptotic behavior of the conditional moment estimator of Dikkala et al. [2020], introduced in Section 2.4. We can apply Theorem 3.2 to obtain the rate at which θ^n\widehat{\theta}_{n} converges:

Proposition 4.5.

Let Θn,Θ,Λn,Λ\Theta_{n},\Theta_{*},\Lambda_{n},\Lambda_{*} be as in Theorem 3.2. Let m(X,θ)m(X,\theta) be \|\cdot\|_{\infty}-Lipschitz in θ\theta. Let the support of YY be bounded. Then, for any r¯>dpd¯p¯\bar{r}>\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}, we get:

𝔼[𝔼[m(X,θ^n)m(X,θ)|Z]22]=o(n2/(2+r¯))\mathbb{E}\left[\left\|\mathbb{E}[m(X,\widehat{\theta}_{n})-m(X,\theta_{*})|Z]\right\|^{2}_{2}\right]=o_{\mathbb{P}}(n^{2/(2+\bar{r})})

For the instrumental variable regression setting studied by Dikkala et al. [2020], where m(X,θ)=yθ(x)m(X,\theta)=y-\theta(x), this implies:

𝔼[𝔼[θ^n(x)θ(x)|Z]22]=o(n2/(2+r¯))\mathbb{E}\left[\left\|\mathbb{E}\left[\widehat{\theta}_{n}(x)-\theta_{*}(x)\big{|}Z\right]\right\|^{2}_{2}\right]=o_{\mathbb{P}}(n^{2/(2+\bar{r})})
Proof.

Condition A0 is satisfied by assumption, and A1 follows from Lipschitzness of m(X,)m(X,\cdot) and boundedness. Assumptions A2 and A3 can be verified by using boundedness to establish

𝕍[l(θ,Y)l(θ,Y)]\displaystyle\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)] λθ(𝒵)22𝔼[l(θ,Y)l(θ,Y)]\displaystyle\prec\|\lambda_{*}^{\theta}\|_{\mathcal{L}(\mathcal{Z})^{2}}^{2}\asymp\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]
𝕍[l(θ,λ,Y)l(θ,λθ,Y)]\displaystyle\mathbb{V}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)] λλθ(𝒵)22𝔼[l(θ,λ,Y)l(θ,λθ,Y)]\displaystyle\prec\|\lambda-\lambda_{*}^{\theta}\|_{\mathcal{L}(\mathcal{Z})^{2}}^{2}\asymp\mathbb{E}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)]

Remark 4.6.

Note that just like in the previous Example 2.2, this result does not require the parameter θ\theta_{*} to be identified by the restriction 2.5. If that is the case however, the above rates can be translated into similar rates in any norm θ^nθ\|\widehat{\theta}_{n}-\theta_{*}\| which is dominated by the objective, usually by construction. See Ai and Chen [2003] for an example of such a norm in the semi-parameteric setting.

Remark 4.7.

In contrast to Dikkala et al. [2020], our convergence rate result allows for general mm and possibly vector-valued, semiparametric Θ\Theta in which unknown functions are approximated by neural networks. Our rates are also exhibit the reduced curse of dimensionality of neural networks.

Next, we use Theorem 3.4 to derive the asymptotic variance of the estimator, showing that the estimator is in inefficient in general. For this purpose, it suffices to only consider the simpler parametric setting.

Proposition 4.6.

Consider the parametric case where Θn=Θ=Θ\Theta_{n}=\Theta_{*}=\Theta is Euclidean. In addition to the assumptions of Proposition 4.5, assume that the identification condition 2.5 holds. Let d(X,θ):=θm(X,θ)d(X,\theta):=\nabla_{\theta}m(X,\theta) be bounded and satisfy the Lipschitz condition |d(X,θ)d(X,θ)|θθ|d(X,\theta)-d(X,\theta^{\prime})|\prec\|\theta-\theta^{\prime}\|_{\infty}. Assume that 𝔼[l(θ,Y)]\mathbb{E}[l(\theta,Y)] is three times differentiable in θ\theta. For all θΘ\theta\in\Theta, let λθ:=2𝔼[m(X,θ)|Z]Λ\lambda_{*}^{\theta}:=2\mathbb{E}[m(X,\theta)|Z]\in\Lambda_{*} for a Λ\Lambda_{*} satisfying A0 with dp<14\frac{d^{*}}{p}<\frac{1}{4}, let θ,λθ\theta_{*},\lambda_{*}^{\theta} lie in the respective interiors of Θ,Λ\Theta,\Lambda_{*}, and let λθ[v]():=2v𝔼[d(X,θ)|Z=]Λ\lambda_{*}^{\prime\theta}[v_{*}](\cdot):=2v_{*}^{\prime}\mathbb{E}[d(X,\theta)|Z=\cdot]\in\Lambda_{*} for any vΘv_{*}\in\Theta. Let Λn\Lambda_{n} be regularized as in Theorem 3.4. Then:

n(θ^nθ)𝑑𝒩(0,V)\sqrt{n}(\widehat{\theta}_{n}-\theta_{*})\overset{d}{\to}\mathcal{N}(0,V)

where V=𝔼[𝔼[θm(X,θ)|Z]𝔼[m(X,θ)m(X,θ)|Z]𝔼[θm(X,θ)|Z]]1V=\mathbb{E}\left[\mathbb{E}\left[\nabla_{\theta_{*}}m(X,\theta_{*})^{\prime}|Z\right]\mathbb{E}\left[m(X,\theta_{*})m(X,\theta_{*})^{\prime}|Z\right]\mathbb{E}\left[\nabla_{\theta_{*}}m(X,\theta_{*})|Z\right]\right]^{-1}.

Chamberlain [1987] derived the efficiency bound for the parametric conditional moment setting, corresponding to the smallest (in a p.s.d. sense) n\sqrt{n}-asymptotic variance for any unbiased estimator. It is given by the covariance matrix:

V=𝔼[𝔼[θm(X,θ)|Z]𝔼[m(X,θ)m(X,θ)|Z]1𝔼[θm(X,θ)|Z]]1V_{*}=\mathbb{E}\left[\mathbb{E}\left[\nabla_{\theta_{*}}m(X,\theta_{*})^{\prime}|Z\right]\mathbb{E}\left[m(X,\theta_{*})m(X,\theta_{*})^{\prime}|Z\right]^{-1}\mathbb{E}\left[\nabla_{\theta_{*}}m(X,\theta_{*})|Z\right]\right]^{-1}

Note that VVV\neq V_{*} in general, implying that θ^n\widehat{\theta}_{n} is an inefficient estimator. By extension, this also applies to the Reinforcement Learning algorithm of Example 2.3. Comparing the GMM objective of Example 2.2 - which is known to be efficient in the unconditional moment setting - to the population objective of the present example, this may be unsurprising: in contrast to GMM, the population objective of Dikkala et al. [2020] corresponds to a regular 2\ell^{2} norm, without the inverse covariance weighting which is crucial for asymptotic efficiency in the unconditional case. Generalizing GEL to the conditional moment setting by replacing the constant adversary with a neural network Λn\Lambda_{n}, Metzger [2022] therefore proposes the A-estimator given by:

infθΘnsupλΛn𝔼n[f(m(X,θ)λ(Z))]\inf_{\theta\in\Theta_{n}}\sup_{\lambda\in\Lambda_{n}}\mathbb{E}_{n}\left[-f_{*}\left(m(X,\theta)^{\prime}\lambda(Z)\right)\right]

which nests a simplified variant of Bennett and Kallus [2020] for Df=χ2D_{f}=\chi^{2}, and for Df=DKLD_{f}=D_{KL} can be viewed as alternative to the Kernel approach of Kitamura et al. [2004]. Metzger [2022] provides a similar information theoretic foundation as the GEL estimator and - building on the theory developed in the present paper - derives the convergence rates and asymptotic efficiency of this estimator, where Θn\Theta_{n} may contain unknown functions which are modeled as neural networks.

4.5 Estimating Riesz Representers

Finally, we show that Theorem 3.2 can be used to quickly derive the convergence rates of Chernozhukov et al. [2020]’s adversarial estimator for Riesz representers, which we introduced in Section 2.5.

Proposition 4.7.

Let Θn,Θ,Λn,Λ\Theta_{n},\Theta_{*},\Lambda_{n},\Lambda_{*} be as in Theorem 3.2. Let m(Y,λ)=m(Y,λ(x))m(Y,\lambda)=m(Y,\lambda(x)) be Lipschitz in λ(x)\lambda(x). Let the support of YY be bounded. Then, for any r¯>dpd¯p¯\bar{r}>\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}:

θ^nθ2(x)=o(n1/(2+r¯))\|\widehat{\theta}_{n}-\theta_{*}\|_{\mathcal{L}^{2}(x)}=o_{\mathbb{P}}(n^{1/(2+\bar{r})})

This result clarifies that the Riesz representer of Chernozhukov et al. [2020] can similarly benefit from the adaptivity properties of neural networks, which yield faster rates for our target classes if d<Dd_{*}<D. In combination with their Lemma 17, this implies that compared to other non-parametric sieves, neural networks guarantee the asymptotic normality of the orthogonalized estimator ϕ~n\tilde{\phi}_{n} under weaker conditions on smoothness and DD. Since the normality of ϕ~nϕ\tilde{\phi}_{n}-\phi_{*} is of primary interest and already follows from Chernozhukov et al. [2020]’s Lemma 17 given our convergence rates, we refrain from deriving it for arbitrary functionals ϕ^n(g)=𝔼[θ^n(x)g(x)]\widehat{\phi}_{n}(g)=\mathbb{E}[\widehat{\theta}_{n}(x)g(x)], although it would be possible to use Theorem 3.4 to derive n(ϕ^n(g)ϕ(g))𝑑𝒩(0,Vg)\sqrt{n}(\widehat{\phi}_{n}(g)-\phi(g))\overset{d}{\to}\mathcal{N}(0,V_{g}) for some VgV_{g} for example.

5 Conclusion

We characterize the general class of adversarial estimators (‘A-estimators’), subsuming many estimators independently proposed in the fields of econometrics and machine learning. Our unified framework suggests interesting commonalities between A-estimators: their adversary is always Neyman-orthogonal with respect to the main model, guaranteeing that its estimation errors have no first-order asymptotic impact on the estimated model. Most objectives have versions which asymptotically minimize an ff-divergence criterion and are asymptotically efficient. Typically, A-estimators adaptively learn how to optimally emphasize the restrictions implied researcher’s estimation assumptions, performing particularly well when this set is large. This makes them a promising framework for incorporating machine learning methods into causal inference, where even simple target parameters often satisfy a continuum of restrictions. We characterize the convergence rates of A-estimators, as well as the asymptotic normality of smooth functionals of their parameters. We also provide low-level analogues of these results for semi-parametric models, in which unknown functions are approximated by deep neural networks. Our convergence and normality results also extend the theory of neural network M-estimators, as a special case: building on recent results in approximation theory, our neural network converge rates exhibit a reduced curse of dimensionality for more general losses than previously examined, which hold uniformly over a second parameter space. Our normality result overcomes a problem previously posed by the non-convexity of neural network sieves, showing that a particular regularization, combined with under-smoothing, can be used to satisfy a strong, high-level approximation error condition which the literature left hitherto unverified.

References

  • Ai and Chen [2003] Chunrong Ai and Xiaohong Chen. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003. doi: https://doi.org/10.1111/1468-0262.00470. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00470.
  • Ai and Chen [2007] Chunrong Ai and Xiaohong Chen. Estimation of possibly misspecified semiparametric conditional moment restriction models with different conditioning variables. Journal of Econometrics, 141(1):5–43, 2007. URL https://EconPapers.repec.org/RePEc:eee:econom:v:141:y:2007:i:1:p:5-43.
  • Angrist and Pischke [2010] Joshua D. Angrist and Jörn-Steffen Pischke. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2):3–30, June 2010. doi: 10.1257/jep.24.2.3. URL https://www.aeaweb.org/articles?id=10.1257/jep.24.2.3.
  • Arora et al. [2017] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224–232. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/arora17a.html.
  • Athey and Wager [2021] Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1):133–161, 2021. doi: https://doi.org/10.3982/ECTA15732. URL https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA15732.
  • Athey et al. [2019] Susan Athey, Guido W Imbens, Jonas Metzger, and Evan M Munro. Using wasserstein generative adversarial networks for the design of monte carlo simulations. Technical report, National Bureau of Economic Research, 2019.
  • Bauer et al. [2019] Benedikt Bauer, Michael Kohler, et al. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Annals of Statistics, 47(4):2261–2285, 2019.
  • Belomestny et al. [2021] Denis Belomestny, Eric Moulines, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Rates of convergence for density estimation with gans. arXiv preprint arXiv:2102.00199, 2021.
  • Bennett and Kallus [2020] Andrew Bennett and Nathan Kallus. The variational method of moments. CoRR, abs/2012.09422, 2020. URL https://arxiv.org/abs/2012.09422.
  • Bennett et al. [2019a] Andrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of moments for instrumental variable analysis. In NeurIPS, 2019a.
  • Bennett et al. [2019b] Andrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of moments for instrumental variable analysis. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/15d185eaa7c954e77f5343d941e25fbd-Paper.pdf.
  • Chamberlain [1987] Gary Chamberlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334, 1987.
  • Chen et al. [2020] Minshuo Chen, Wenjing Liao, Hongyuan Zha, and Tuo Zhao. Statistical guarantees of generative adversarial networks for distribution estimation, 2020.
  • Chen and Liao [2015] Xiaohong Chen and Zhipeng Liao. Sieve semiparametric two-step gmm under weak dependence. Journal of Econometrics, 189(1):163–186, 2015. ISSN 0304-4076. doi: https://doi.org/10.1016/j.jeconom.2015.07.001. URL https://www.sciencedirect.com/science/article/pii/S0304407615002031.
  • Chen and Pouzo [2015] Xiaohong Chen and Demian Pouzo. Sieve wald and qlr inferences on semi/nonparametric conditional moment models. Econometrica, 2015.
  • Chen and Qiu [2016] Xiaohong Chen and Yin Jia Qiu. Methods for nonparametric and semiparametric regressions with endogeneity: a gentle guide. Cowles Foundation Discussion Papers 2032, Cowles Foundation for Research in Economics, Yale University, 2016. URL https://EconPapers.repec.org/RePEc:cwl:cwldpp:2032.
  • Chen et al. [2003] Xiaohong Chen, Oliver Linton, and Ingrid Van Keilegom. Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71(5):1591–1608, 2003.
  • Chen et al. [2019] Xiaohong Chen, Demian Pouzo, and James L. Powell. Penalized sieve gel for weighted average derivatives of nonparametric quantile iv regressions. Journal of Econometrics, 2019.
  • Chernozhukov et al. [2017] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. Double/debiased/neyman machine learning of treatment effects. The American Economic Review, 107:261–265, 2017.
  • Chernozhukov et al. [2020] Victor Chernozhukov, Whitney Newey, Rahul Singh, and Vasilis Syrgkanis. Adversarial estimation of riesz representers, 2020.
  • Chernozhukov et al. [2021] Victor Chernozhukov, Whitney Newey, Victor Quintas-Martinez, and Vasilis Syrgkanis. Automatic debiased machine learning via neural nets for generalized linear regression. 2021.
  • Cosslett [1981] Stephen Cosslett. Maximum likelihood estimator for choice-based samples. Econometrica, 49:1289–1316, 02 1981. doi: 10.2307/1912755.
  • Cotter et al. [2019] Andrew Cotter, Heinrich Jiang, and Karthik Sridharan. Two-player games for efficient non-convex constrained optimization. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 300–332. PMLR, 22–24 Mar 2019. URL https://proceedings.mlr.press/v98/cotter19a.html.
  • Dai et al. [2018] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. CoRR, abs/1712.10285, 2018. URL http://arxiv.org/abs/1712.10285.
  • Dikkala et al. [2020] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimation of conditional moment models. arXiv preprint arXiv:2006.07201, 2020.
  • Farrell et al. [2018] Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. arXiv preprint arXiv:1809.09953, 2018.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Grenander [1981] Ulf Grenander. Abstract inference. Technical report, 1981.
  • Hansen [1982] Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029–1054, 1982. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1912775.
  • Hansen et al. [1996] Lars Peter Hansen, John Heaton, and Amir Yaron. Finite-sample properties of some alternative gmm estimators. Journal of Business & Economic Statistics, 14(3):262–280, 1996. ISSN 07350015. URL http://www.jstor.org/stable/1392442.
  • Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. NIPS, abs/1606.03476, 2016. URL http://arxiv.org/abs/1606.03476.
  • Imbens [2002] Guido W. Imbens. Generalized method of moments and empirical likelihood. Journal of Business & Economic Statistics, 20(4):493–506, 2002. ISSN 07350015. URL http://www.jstor.org/stable/1392419.
  • Imbens et al. [1998] Guido W. Imbens, Richard H. Spady, and Phillip Johnson. Information theoretic approaches to inference in moment condition models. Econometrica, 66(2):333–357, 1998. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/2998561.
  • Jabbar et al. [2022] Abdul Jabbar, Xi Li, and Bourahla Omar. A survey on generative adversarial networks: Variants, applications, and training. ACM Computing Surveys (CSUR), 54:1 – 49, 2022.
  • Kaji et al. [2020] Tetsuya Kaji, Elena Manresa, and Guillaume Pouliot. An adversarial approach to structural estimation, 2020.
  • Kitamura et al. [2004] Yuichi Kitamura, Gautam Tripathi, and Hyungtaik Ahn. Empirical likelihood-based inference in conditional moment restriction models. Econometrica, 72(6):1667–1714, 2004.
  • Lewis and Syrgkanis [2018] Greg Lewis and Vasilis Syrgkanis. Adversarial generalized method of moments, 2018.
  • Liang [2021] Tengyuan Liang. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41, 2021.
  • Lucas [1976] Robert E. Lucas. Econometric policy evaluation: A critique. Carnegie-Rochester Conference Series on Public Policy, 1:19–46, 1976. ISSN 0167-2231. doi: https://doi.org/10.1016/S0167-2231(76)80003-6. URL https://www.sciencedirect.com/science/article/pii/S0167223176800036.
  • Mao et al. [2017] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks, 2017.
  • Metzger [2022] Jonas Metzger. Adversarial conditional moment estimation, 2022.
  • Nakada and Imaizumi [2020] Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38, 2020. URL http://jmlr.org/papers/v21/20-002.html.
  • Newey and Smith [2004] Whitney K. Newey and Richard J. Smith. Higher order properties of gmm and generalized empirical likelihood estimators. Econometrica, 72(1):219–255, 2004. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/3598854.
  • Neyman [1959] Jerzy Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability and Statistics: The Harald Cramer Volume, 1959.
  • Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization, 2016.
  • Owen [1990] Art Owen. Empirical Likelihood Ratio Confidence Regions. The Annals of Statistics, 18(1):90 – 120, 1990. doi: 10.1214/aos/1176347494. URL https://doi.org/10.1214/aos/1176347494.
  • Owen [1988] Art B. Owen. Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2):237–249, 06 1988. ISSN 0006-3444. doi: 10.1093/biomet/75.2.237. URL https://doi.org/10.1093/biomet/75.2.237.
  • Qin and Lawless [1994] Jin Qin and Jerry Lawless. Empirical likelihood and general estimating equations. The Annals of Statistics, 22(1):300–325, 1994. ISSN 00905364. URL http://www.jstor.org/stable/2242455.
  • Shen [1997] Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, 25(6):2555–2591, 1997. ISSN 00905364. URL http://www.jstor.org/stable/2959045.
  • Shen and Wong [1994] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. Ann. Statist., 22(2):580–615, 06 1994. doi: 10.1214/aos/1176325486. URL https://doi.org/10.1214/aos/1176325486.
  • Shen et al. [2019] Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
  • Singh et al. [2018] Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabás Póczos. Nonparametric density estimation under adversarial losses. arXiv preprint arXiv:1805.08836, 2018.
  • Tao et al. [2018] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chi-square generative adversarial network. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4887–4896. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/tao18b.html.
  • Tuyls et al. [2018] K Tuyls, J Perolat, M Lanctot, JZ Leibo, and T Graepel. A generalised method for empirical game theoretic analysis. In AAMAS’18: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 77–85. ACM, 2018.
  • Yarotsky [2017] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  • Zhan et al. [2021] Ruohan Zhan, Vitor Hadad, David A. Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 2125–2135, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3467456. URL https://doi.org/10.1145/3447548.3467456.

Appendix A Definitions

Definition 1 (Covering Number).

For some norm \|\cdot\| over some metric space Λ\Lambda, the covering number 𝒩(δ,Λ,)\mathcal{N}(\delta,\Lambda,\|\cdot\|) is defined as the cardinality of the smallest set CΛC\subset\Lambda such that supλΛinfcCλcδ\sup_{\lambda\in\Lambda}\inf_{c\in C}\|\lambda-c\|\leq\delta. The quantity log𝒩(δ,Λ,)\log\mathcal{N}(\delta,\Lambda,\|\cdot\|) is also called metric entropy.

Definition 2 (Deep Neural Networks).

We define the class of deep σ\sigma networks fσ(L,W,w,κ,B)f\in\mathcal{F}_{\sigma}(L,W,w,\kappa,B) as parametrized functions of the form:

f(x)=A(L)σ(A(L1)σ(A(1)x+b(1))+b(L1))+b(L)f(x)=A^{(L)}\cdot\sigma\left(A^{(L-1)}\cdots\sigma\left(A^{(1)}x+b^{(1)}\right)\cdots+b^{(L-1)}\right)+b^{(L)}

where the A(l)A^{(l)}’s are weight matrices and b(l)b^{(l)}’s are intercept vectors with real-valued elements, and σ:\sigma:\mathbb{R}\mapsto\mathbb{R} is applied element-wise. For example, the choice σ(x)=ReLU(x)=max{0,x}\sigma(x)=\operatorname{ReLU}(x)=\max\{0,x\} (rectified linear unit) gives rise to the class of deep ReLU networks, and σ(x)=tanh(x)\sigma(x)=\tanh{(x)} gives rise to the class of tanh\tanh networks. We say the network is LL layers deep and call the upper bound supldim(b(l))w\sup_{l}\operatorname{dim}(b^{(l)})\leq w its width. Further, we assume that

maxi,j,l|Aij(l)|κ,maxi,l|bi(l)|κ,l=1LA(l)0+b(l)0W, for i=1,,L\max_{i,j,l}\left|A^{(l)}_{ij}\right|\leq\kappa,\max_{i,l}|b^{(l)}_{i}|\leq\kappa,\sum_{l=1}^{L}\left\|A^{(l)}\right\|_{0}+\left\|b^{(l)}\right\|_{0}\leq W,\text{ for }i=1,\ldots,L

i.e. all elements in the A(l)A^{(l)}’s and b(l)b^{(l)}’s are bounded in absolute value by κ\kappa, and there are at most WW non-zero parameters in total. Finally, we assume fB<\|f\|_{\infty}\leq B<\infty for all ff. If the particular value BB is an arbitrary large enough constant, we may suppress the notation and write σ(L,W,w,κ,B)=σ(L,W,w,κ)\mathcal{F}_{\sigma}(L,W,w,\kappa,B)=\mathcal{F}_{\sigma}(L,W,w,\kappa).

Definition 3 (Minkowski Dimension).

The (upper) Minkowski dimension of a set 𝒳[0,1]D{\mathcal{X}}\subset[0,1]^{D} is defined as

dimM𝒳:=inf{d0lim supε0𝒩(ε,𝒳,)εd=0}\operatorname{dim}_{M}{\mathcal{X}}:=\inf\left\{d^{*}\geq 0\mid\limsup_{\varepsilon\downarrow 0}\mathcal{N}(\varepsilon,{\mathcal{X}},\|\cdot\|_{\infty})\varepsilon^{d^{*}}=0\right\}

where 𝒩(ε,𝒳,)\mathcal{N}(\varepsilon,{\mathcal{X}},\|\cdot\|_{\infty}) is given by Definition 1. As shown in Nakada and Imaizumi [2020], this definition generalizes many other notions of intrinsic dimension, such as the manifold dimension.

Definition 4 (Hölder Space).

For a function f:D,df(x)f:\mathbb{R}^{D}\rightarrow\mathbb{R},\partial_{d}f(x) is a partial derivative with respect to a dd-th component, and αf:=1α1DαDf\partial^{\alpha}f:=\partial_{1}^{\alpha_{1}}\cdots\partial_{D}^{\alpha_{D}}f using multi-index α=(α1,,αD).\alpha=\left(\alpha_{1},\ldots,\alpha_{D}\right). For zz\in\mathbb{R} z\lfloor z\rfloor denotes the largest integer that is less than zz. Let p>0p>0 be a degree of smoothness. For f:[0,1]D,f:[0,1]^{D}\rightarrow\mathbb{R}, the Höder norm is defined as

f(p,[0,1]D):=maxα:α1<psupx[0,1]D|αf(x)|+maxα:α1=px,x[0,1]D,xx|αf(x)αf(x)|xxpp\|f\|_{\mathcal{H}\left(p,[0,1]^{D}\right)}:=\max_{\alpha:\|\alpha\|_{1}<\lfloor p\rfloor}\sup_{x\in[0,1]^{D}}\left|\partial^{\alpha}f(x)\right|+\max_{\alpha:\|\alpha\|_{1}=\lfloor p\rfloor x,x^{\prime}\in[0,1]^{D},x\neq x^{\prime}}\frac{\left|\partial^{\alpha}f(x)-\partial^{\alpha}f\left(x^{\prime}\right)\right|}{\left\|x-x^{\prime}\right\|_{\infty}^{p-\lfloor p\rfloor}}

Then, the Hölder space on [0,1]D[0,1]^{D} is defined as

(p,[0,1]D)={fCp([0,1]D)f(p,[0,1]D)<}\mathcal{H}\left(p,[0,1]^{D}\right)=\left\{f\in C^{\lfloor p\rfloor}\left([0,1]^{D}\right)\mid\|f\|_{\mathcal{H}\left(p,[0,1]^{D}\right)}<\infty\right\}

Also, (p,[0,1]D,M)={f(p,[0,1]D)f(p,[0,1]D)M}\mathcal{H}\left(p,[0,1]^{D},M\right)=\left\{f\in\mathcal{H}\left(p,[0,1]^{D}\right)\mid\|f\|_{\mathcal{H}\left(p,[0,1]^{D}\right)}\leq M\right\} denotes the MM-radius closed ball in (p,[0,1]D)\mathcal{H}\left(p,[0,1]^{D}\right).

Definition 5 ((p, C)-smoothness).

Let p=q+sp=q+s for some q0q\in\mathbb{N}_{0} and 0<s10<s\leq 1. A function m:dm:\mathbb{R}^{d}\rightarrow\mathbb{R} is called (p,C)(p,C)-smooth, if for every α=(α1,,αd)0d\alpha=\left(\alpha_{1},\ldots,\alpha_{d}\right)\in\mathbb{N}_{0}^{d} with j=1dαj=q\sum_{j=1}^{d}\alpha_{j}=q the partial derivative qmx1αxdαd\frac{\partial^{q}m}{\partial x_{1}^{\alpha}\ldots\partial x_{d}^{\alpha_{d}}} exists and satisfies

|qmx1α1xdαd(x)qmx1α1xdαd(z)|Cxz2s\left|\frac{\partial^{q}m}{\partial x_{1}^{\alpha_{1}}\cdots\partial x_{d}^{\alpha_{d}}}(x)-\frac{\partial^{q}m}{\partial x_{1}^{\alpha_{1}}\cdots\partial x_{d}^{\alpha_{d}}}(z)\right|\leq C\cdot\|x-z\|_{2}^{s}

for all x,zdx,z\in\mathbb{R}^{d}.

Definition 6 (Generalized Hierarchical Interaction Models).

Let C0C\in\mathbb{R}_{\geq 0}, D,d{1,,D}D\in\mathbb{N},d^{*}\in\{1,\ldots,D\}, m:Dm:\mathbb{R}^{D}\rightarrow\mathbb{R} and p=q+sp=q+s for some q0q\in\mathbb{N}_{0} and 0<s10<s\leq 1.

  • a)

    We say that mm satisfies a generalized hierarchical interaction model of order dd^{*} and level 0 with bound CC, if there exist a1,,adDa_{1},\ldots,a_{d^{*}}\in\mathbb{R}^{D} and some f:df:\mathbb{R}^{d^{*}}\rightarrow\mathbb{R} such that

    m(x)=f(a1Tx,,adTx) for all xDm(x)=f\left(a_{1}^{T}x,\ldots,a_{d^{*}}^{T}x\right)\quad\text{ for all }x\in\mathbb{R}^{D}

    and where ff is Lipschitz continuous with constant CC and all of its partial derivatives of order less than or equal to qq are bounded in absolute value by by CC.

  • b)

    We say that mm satisfies a generalized hierarchical interaction model of order dd^{*} and level l+1l+1 with bound CC if there exist KK\in\mathbb{N}, gk:d(k=1,,K)g_{k}:\mathbb{R}^{d^{*}}\rightarrow\mathbb{R}(k=1,\ldots,K) and f1,k,,fd,k:D(k=1,,K)f_{1,k},\ldots,f_{d^{*},k}:\mathbb{R}^{D}\rightarrow\mathbb{R}(k=1,\ldots,K) such that f1,k,,fd,k(k=1,,K)f_{1,k},\ldots,f_{d^{*},k}(k=1,\ldots,K) satisfy a generalized hierarchical interaction model of order dd^{*} and level ll and

    m(x)=k=1Kgk(f1,k(x),,fd,k(x)) for all xDm(x)=\sum_{k=1}^{K}g_{k}\left(f_{1,k}(x),\ldots,f_{d^{*},k}(x)\right)\quad\text{ for all }x\in\mathbb{R}^{D}

    where gkg_{k} are Lipschitz continuous with constant CC and all of their partial derivatives of order less than or equal to qq are bounded by some constant CC.

  • c)

    We say that the generalized hierarchical interaction model defined above is (p,C)(p,C)-smooth, if all functions occurring in its definition are (p,C)(p,C)-smooth, cf. Definition 5.

  • d)

    We define 𝒢(p,d,C,[0,1]D)\mathcal{G}(p,d^{*},C,[0,1]^{D}) as the class of all functions m:[0,1]Dm:[0,1]^{D}\rightarrow\mathbb{R} satisfying a (p,C)(p,C)-smooth generalized hierarchical interaction model of order dd^{*} and level ll with bound CC, where lCl\leq C. Since the particular value of CC is not important as long as C<C<\infty, we also write 𝒢(p,d,[0,1]D)\mathcal{G}(p,d^{*},[0,1]^{D}).

Definition 7 (Pathwise Derivatives).

For some θΘ\theta\in\Theta, λΛ\lambda\in\Lambda and some functional l:Θ×Λdl:\Theta\times\Lambda\mapsto\mathbb{R}^{d}, we define the first pathwise derivative in the direction θΘ\theta^{\prime}\in\Theta as

θθl(θ,λ):=limτ0τl(θ+τθ,λ)\nabla_{\theta\to\theta^{\prime}}l(\theta,\lambda):=\lim_{\tau\to 0}\frac{\partial}{\partial\tau}l(\theta+\tau\theta^{\prime},\lambda)

for some real number τ\tau\in\mathbb{R}. Throughout this paper, the usage of θθ\nabla_{\theta\to\theta^{\prime}} implicitly assumes that the derivative and limit on the RHS exists and is linear in θ\theta^{\prime}.

Appendix B Supporting Lemmas

Lemma B.1 (Covering Number of Neural Networks).

Consider the class of deep neural networks fσ(L,W,w,κ)f\in\mathcal{F}_{\sigma}(L,W,w,\kappa) (Definition 2), with activation σ\sigma satisfying σ:|σ(x)|x,|σ(x)σ(x)||xx|x,x\sigma:|\sigma(x)|\leq x,|\sigma(x)-\sigma(x^{\prime})|\leq|x-x^{\prime}|\ \forall x,x^{\prime}\in\mathbb{R} (e.g. ReLU, tanh) and consider the norm f=supx𝒳|f(x)|\|f\|_{\infty}=\sup_{x\in{\mathcal{X}}}|f(x)| for some 𝒳[0,1]D{\mathcal{X}}\subset[0,1]^{D} where DwD\leq w. Its δ\delta-covering number (Definition 1) can be bounded by:

𝒩(δ,σ(L,W,w,κ),)(2L2(w+2)(κw)L+1δ)W\mathcal{N}\left(\delta,\mathcal{F}_{\sigma}(L,W,w,\kappa),\|\cdot\|_{\infty}\right)\leq\left(\frac{2L^{2}(w+2)(\kappa w)^{L+1}}{\delta}\right)^{W}
Proof.

This is Lemma 7 in Chen et al. [2020]. While they only state the Lemma for the case of ReLU networks σ(x)=max(0,x)\sigma(x)=\max(0,x), their proof works for any activation σ\sigma satisfying |σ(x)|x|\sigma(x)|\leq x and |σ(x)σ(x)||xx||\sigma(x)-\sigma(x^{\prime})|\leq|x-x^{\prime}| for all x,xx,x^{\prime}\in\mathbb{R}. We substituted the bound B=1B=1 and renamed some variables. ∎

Lemma B.2 (Approximation by Deep ReLU Networks on Low Dimensional Data).

Consider the Hölder space (p,[0,1]D)\mathcal{H}\equiv\mathcal{H}\left(p,[0,1]^{D}\right) (Definition 4) and some support 𝒳[0,1]D{\mathcal{X}}\subset[0,1]^{D} with Minkowski dimension (Definition 3) bounded by dimM𝒳dD\operatorname{dim}_{M}{\mathcal{X}}\leq d^{*}\leq D. For any small enough ϵ>0\epsilon>0, the class of deep ReLU networks ReLU(L,W(ϵ),w(ϵ),κ(ϵ))\mathcal{F}\equiv\mathcal{F}_{\mathrm{ReLU}}(L,W(\epsilon),w(\epsilon),\kappa(\epsilon)) (Definition 2) satisfies:

supfinffsupx𝒳|f(x)f(x)|<ϵ\sup_{f_{*}\in\mathcal{H}}\mathop{\mathrm{inf}\vphantom{\mathrm{sup}}}_{f\in\mathcal{F}}\sup_{x\in{\mathcal{X}}}|f(x)-f_{*}(x)|<\epsilon

as long as W(ϵ)c1ϵd/p,w(ϵ)c2ϵd/p,κ(ϵ)c3ϵc4W(\epsilon)\geq c_{1}\epsilon^{-d^{*}/p},\ w(\epsilon)\geq c_{2}\epsilon^{-d^{*}/p},\ \kappa(\epsilon)\geq c_{3}\epsilon^{-c_{4}} for any large enough choice of LL, c1c_{1}, c2c_{2}, c3c_{3}, c4>0c_{4}>0.

Proof.

The case d<Dd^{*}<D is covered by Theorem 5 in Nakada and Imaizumi [2020]. While they do not state a bound on the width w(ϵ)w(\epsilon), it is easy to see that any network described by Definition 2 with at most W(ϵ)W(\epsilon) non-zero parameters can be represented by a network with width bounded by w(ϵ)W(ϵ)w(\epsilon)\leq W(\epsilon). In the case of d=Dd^{*}=D, the Lemma simply states the approximation error for conventional Hölder spaces as established in Yarotsky [2017]. ∎

Lemma B.3 (Approximation of Generalized Interaction Models by Deep ReLU Networks).

Consider the function class 𝒢𝒢(p,d,[0,1]D)\mathcal{G}\equiv\mathcal{G}(p,d^{*},[0,1]^{D}) (Definition 6d) and consider some arbitrary random variable x[0,1]Dx\in[0,1]^{D} with probability measure x\mathbb{P}_{x}. For any small enough ϵ,η>0\epsilon,\eta>0, the class of deep tanh networks tanh(L,W(ϵ),w(ϵ),κ(ϵ))\mathcal{F}\equiv\mathcal{F}_{\tanh}(L,W(\epsilon),w(\epsilon),\kappa(\epsilon)) (Definition 2) satisfies:

supf𝒢inffsupx𝒳|f(x)f(x)|<ϵ\sup_{f_{*}\in\mathcal{G}}\mathop{\mathrm{inf}\vphantom{\mathrm{sup}}}_{f\in\mathcal{F}}\sup_{x\in{\mathcal{X}}}|f(x)-f_{*}(x)|<\epsilon

for some subset 𝒳[0,1]D{\mathcal{X}}\subset[0,1]^{D} with x(x𝒳)η\mathbb{P}_{x}(x\not\in{\mathcal{X}})\leq\eta as long as W(ϵ)=c1ϵd/p,w(ϵ)=c2ϵd/p,κ(ϵ)=c3ϵc4/ηW(\epsilon)=c_{1}\epsilon^{-d^{*}/p},\ w(\epsilon)=c_{2}\epsilon^{-d^{*}/p},\ \kappa(\epsilon)=c_{3}\epsilon^{-c_{4}}/\eta for any large enough choice of LL, c1c_{1}, c2c_{2}, c3c_{3}, c4>0c_{4}>0.

Proof.

This directly follows from Theorem 3 in Bauer et al. [2019], however our notation is greatly simplified by the fact that we are not interested in most of their constants, and that we offloaded most of the assumptions into Definition 6. What matters is that the network they construct has a depth that bounded by a constant (their equation (6)), and a number of non-zero parameters that is proportional to what they call (Mn+1)d(M_{n}+1)^{d^{*}} in their Theorem 3 (by their equations (7) and (5) and the definition of MM^{*} in their Theorem 3). Since we assumed bounded support (leaving their ana_{n} as a constant), their bound yields an approximation error of ϵ=:cMnp\epsilon=:cM_{n}^{-p} for some c>0c>0, such that the number of non-zero parameters can be bounded as W(ϵn)=O((Mn+1)d)=O(ϵd/p)W(\epsilon_{n})=O\left((M_{n}+1)^{d^{*}}\right)=O(\epsilon^{-d^{*}/p}). They bound κ(ϵ)\kappa(\epsilon) (α\alpha in their notation) in terms of MnM_{n} and η\eta yielding κ(ϵ)=O(ϵc4/η)\kappa(\epsilon)=O(\epsilon^{-c_{4}}/\eta) for some large enough constant c4>0c_{4}>0. Finally, their theorem holds only for activation functions σ\sigma which satisfy a property they call N-admissible. While this is technically not satisfied by σ(x)=tanh(x)\sigma(x)=\tanh{(x)}, it is easy to verify that this property is satisfied by the activation function σ~(x)=1/2+tanh(x)/2\widetilde{\sigma}(x)=1/2+\tanh{(x)}/2. Since for any f~σ~(L,W,w,κ)\tilde{f}\in\mathcal{F}_{\widetilde{\sigma}}(L,W,w,\kappa) there exists some ftanh(L,W,w,2κ+1/2)f\in\mathcal{F}_{\tanh}(L,W,w,2\kappa+1/2) such that f~=f\tilde{f}=f, the same approximation bound holds with σ(x)=tanh(x)\sigma(x)=\tanh{(x)}. ∎

Lemma B.4 (Empirical Process of Donsker Classes).

If Y𝒴Y\in\mathcal{Y} is iid and {l(θ,Y):θΘ}\{l(\theta,Y):\theta\in\Theta\} is \mathbb{P}-Donsker for some l:Θ×𝒴l:\Theta\times\mathcal{Y}\mapsto\mathbb{R} satisfying

limθΘ:θθ0𝔼[(l(θ,Y)l(θ,Y))2]=0,\lim_{\theta\in\Theta:\|\theta-\theta_{*}\|\to 0}\mathbb{E}\left[\left(l(\theta,Y)-l(\theta_{*},Y)\right)^{2}\right]=0,

then

supθΘ:θθδn(𝔼𝔼n)[l(θ,Y)l(θ,Y)]=o(n1/2)\sup_{\theta\in\Theta:\|\theta-\theta_{*}\|\leq\delta_{n}}(\mathbb{E}-\mathbb{E}_{n})[l(\theta,Y)-l(\theta_{*},Y)]=o_{\mathbb{P}}(n^{-1/2})

for any δn=o(1)\delta_{n}=o_{\mathbb{P}}(1).

Proof.

This directly follows from Lemma 1 in Chen et al. [2003]. ∎

Lemma B.5 (Empirical Process Rates for A-Estimators).

Under the assumptions of Theorem 3.1, for any function f(θ,λ,Y)f(\theta,\lambda,Y) satisfying the following conditions:

  • For any sequence en0e_{n}\geq 0 and all θΘ,λΛ\theta\in\Theta,\lambda\in\Lambda:

    𝕍[f(θ,λθ,Y)f(θ,λθ,Y)]\displaystyle\mathbb{V}[f(\theta,\lambda_{*}^{\theta},Y)-f(\theta_{*},\lambda_{*}^{\theta_{*}},Y)] 𝔼[l(θ,Y)l(θ,Y)+en]γ\displaystyle\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)+e_{n}]^{\gamma} (B.1)
    𝕍[f(θ,λ,Y)f(θ,λθ,Y)]\displaystyle\mathbb{V}[f(\theta,\lambda,Y)-f(\theta,\lambda_{*}^{\theta},Y)] 𝔼[lθ(λθ,Y)lθ(λ,Y)+en]γ\displaystyle\prec\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y)+e_{n}]^{\gamma} (B.2)

    at least if the right hand sides are smaller than some C>0C>0.

  • For all small ε>0\varepsilon>0, we have:

    log𝒩(ε,{f(θ,λ,):θΘn,λΛn},)ns(εr1)/r\log\mathcal{N}(\varepsilon,\{f(\theta,\lambda,\cdot):\theta\in\Theta_{n},\lambda\in\Lambda_{n}\},\|\cdot\|_{\infty})\prec n^{s}(\varepsilon^{-r}-1)/r (B.3)

we obtain the following empirical processes bounds:

supθΘ^n(𝔼𝔼n)[f(θ,πnλθ,Y)f(θ,λθ,Y)]\displaystyle\sup_{\begin{subarray}{c}\theta\in\widehat{\Theta}_{n}\end{subarray}}(\mathbb{E}-\mathbb{E}_{n})[f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)-f(\theta_{*},\lambda_{*}^{\theta_{*}},Y)] =O(nτ(γ,s,r,n)+ϵn+ηn+ϵ¯n+η¯n+en)\displaystyle=O_{\mathbb{P}}(n^{-\tau(\gamma,s,r,n)}+\epsilon_{n}+\eta_{n}+\bar{\epsilon}_{n}+\bar{\eta}_{n}+e_{n})
supθΘ^nλΛ^n(θ)(𝔼𝔼n)[f(θ,λ,Y)f(θ,πnλθ,Y)]\displaystyle\sup_{\begin{subarray}{c}\theta\in\widehat{\Theta}_{n}\\ \lambda\in\widehat{\Lambda}_{n}(\theta)\end{subarray}}(\mathbb{E}-\mathbb{E}_{n})[f(\theta,\lambda,Y)-f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)] =O(nτ(γ,s,r,n)+ϵn+ηn+en)\displaystyle=O_{\mathbb{P}}(n^{-\tau(\gamma,s,r,n)}+\epsilon_{n}+\eta_{n}+e_{n})

where Λ^n(θ):={λΛn:𝔼[lθ(λθ,Y)lθ(λ,Y)]𝔼[lθ(λθ,Y)lθ(λ^nθ,Y)]}\widehat{\Lambda}_{n}(\theta):=\{\lambda\in\Lambda_{n}:\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y)]\prec\mathbb{E}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\widehat{\lambda}_{n}^{\theta},Y)]\} and Θ^n:={θΘ:𝔼[l(θ,Y)l(θ,Y)]𝔼[l(θ^n,Y)l(θ,Y)]}\widehat{\Theta}_{n}:=\{\theta\in\Theta:\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]\prec\mathbb{E}[l(\widehat{\theta}_{n},Y)-l(\theta_{*},Y)]\} are shrinking neighborhoods around λθ\lambda_{*}^{\theta} and θ\theta_{*} containing λ^nθ\widehat{\lambda}_{n}^{\theta} and θ^n\widehat{\theta}_{n}.

Appendix C Proofs

C.1 Theorem 3.1 and Lemma B.5

Theorem 3.1 and Lemma B.5 are simplified versions of the slightly more general Theorems C.1 and C.2, which modify Shen and Wong [1994]’s M-estimator convergence rate arguments to hold uniformly over another parameter space and accommodate estimators which are finite-sample optimal up to some stochastic remainder. Theorem C.1 is presented in C.1.1 and derives the uniform convergence rates for λ^nθ\widehat{\lambda}_{n}^{\theta}. Theorem C.2 is presented in C.1.2 and derives the rates for θ^n\widehat{\theta}_{n}. In C.1.3, we then discuss how Theorem 3.1 and Lemma B.5 follow from these results.

C.1.1 Uniform convergence rate of λ^nθ\widehat{\lambda}_{n}^{\theta}

Theorem C.1 (Uniform Convergence Rates of Sieve M-Estimators).

Let ρθ(,)\rho^{\theta}(\cdot,\cdot) be a pseudo-distance on Λ\Lambda, possibly indexed by θΘ\theta\in\Theta. For the estimator λ^nθ\widehat{\lambda}_{n}^{\theta} of 3.2, assume:

  • CONDITION C1a. For some constants A1>0A_{1}>0 and α>0,\alpha>0, and all small ε>0\varepsilon>0:

    inf{ρθ(λ,λθ)ε,λΛ,θΘ}𝔼[lθ(λθ,Y)lθ(λ,Y)]2A1ε2α\inf_{\left\{\rho^{\theta}\left(\lambda,\lambda_{*}^{\theta}\right)\geq\varepsilon,\lambda\in\Lambda,\theta\in\Theta\right\}}\mathbb{E}\left[l^{\theta}\left(\lambda_{*}^{\theta},Y\right)-l^{\theta}(\lambda,Y)\right]\geq 2A_{1}\varepsilon^{2\alpha}
  • CONDITION C1b. For some constants A2>0A_{2}>0 and β>0,\beta>0, and all small ε>0\varepsilon>0:

    sup{ρθ(λ,λθ)ε,λΛ,θΘ}𝕍[lθ(λ,Y)lθ(λθ,Y)]A2ε2β\sup_{\left\{\rho^{\theta}\left(\lambda,\lambda_{*}^{\theta}\right)\leq\varepsilon,\lambda\in\Lambda,\theta\in\Theta\right\}}\mathbb{V}\left[l^{\theta}(\lambda,Y)-l^{\theta}\left(\lambda_{*}^{\theta},Y\right)\right]\leq A_{2}\varepsilon^{2\beta}
  • CONDITION C2. Let n={lθ(λ,)lθ(πnλθ,):λΛn,θΘn}\mathcal{F}_{n}=\left\{l^{\theta}(\lambda,\cdot)-l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},\cdot\right):\lambda\in\Lambda_{n},\theta\in\Theta_{n}\right\}. For some r0<12r_{0}<\frac{1}{2}, A3>0A_{3}>0 and all small ε>0\varepsilon>0, its entropy (Def. 1) is bounded as:

    log𝒩(ε,n,)A3n2r0εr\log\mathcal{N}\left(\varepsilon,\mathcal{F}_{n},\|\cdot\|_{\infty}\right)\leq A_{3}n^{2r_{0}}\varepsilon^{-r}

    where either r>0r>0 or r=0+r=0^{+}, which is understood to represent ε0+=log(1/ε)\varepsilon^{-0^{+}}=\log(1/\varepsilon).

Let ϵn:=supθΘnρθ(πnλθ,λθ)|𝔼[lθ(λθ,Y)l(πnλθ,Y)]|1/2α\epsilon_{n}:=\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},\lambda_{*}^{\theta}\right)\vee\left|\mathbb{E}\left[l^{\theta}(\lambda_{*}^{\theta},Y)-l(\pi_{n}\lambda_{*}^{\theta},Y)\right]\right|^{1/2\alpha}, then

supθΘnρθ(λ^nθ,λθ)=O(nτ+ϵn+ηn1/2α),\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\widehat{\lambda}_{n}^{\theta},\lambda_{*}^{\theta}\right)=O_{\mathbb{P}}\left(n^{-\tau}+\epsilon_{n}+\eta_{n}^{1/2\alpha}\right),

where τ=τ(α,β,r,r0,n)\tau=\tau(\alpha,\beta,r,r_{0},n) is given by:

τ={12r02αloglogn2αlogn, if r=0+,βα12r04α2β, if r=0+,β<α12r04αmin(α,β)(2r), if 0<r<212r04αloglogn2αlogn, if r=212r02αr, if r>2\tau=\left\{\begin{array}[]{ll}\frac{1-2r_{0}}{2\alpha}-\frac{\log\log n}{2\alpha\log n},&\text{ if }r=0^{+},\beta\geq\alpha\\ \frac{1-2r_{0}}{4\alpha-2\beta},&\text{ if }r=0^{+},\beta<\alpha\\ \frac{1-2r_{0}}{4\alpha-\min(\alpha,\beta)(2-r)},&\text{ if }0<r<2\\ \frac{1-2r_{0}}{4\alpha}-\frac{\log\log n}{2\alpha\log n},&\text{ if }r=2\\ \frac{1-2r_{0}}{2\alpha r},&\text{ if }r>2\end{array}\right.

And for any f(θ,λ,Y)f(\theta,\lambda,Y) satisfying C1a and C2 when lθ(λ,Y)l^{\theta}(\lambda,Y) is replaced by f(θ,λ,Y)f(\theta,\lambda,Y), we can bound the empirical process as follows:

supθΘn,λΛ^n(θ)(𝔼𝔼n)[f(θ,λ,Y)f(θ,πnλθ,Y)]=O(nτ+ϵn+ηn1/2α)\sup_{\begin{subarray}{c}\theta\in\Theta_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)\end{subarray}}(\mathbb{E}-\mathbb{E}_{n})[f(\theta,\lambda,Y)-f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]=O_{\mathbb{P}}(n^{-\tau}+\epsilon_{n}+\eta_{n}^{1/2\alpha}) (C.1)

where Λ^n(θ)\widehat{\Lambda}_{n}(\theta) is defined as in Lemma B.5.

Proof.

The theorem generalizes Theorem 1 of Shen and Wong [1994] such that it holds uniformly over a family of losses indexed by the parameter θΘn\theta\in\Theta_{n}, and to allow for the finite-sample optimum to hold approximately up to a possibly stochastic sequence ηn\eta_{n}. Fortunately, the proof can remain almost identical. Shen and Wong [1994] prove the Theorem by induction, through a chaining argument. They use their Lemma 2 to derive an initial, slow rate which corresponds to the induction start, yielding the assumptions of their Lemma 3 at step k=2k=2. Next, their Lemma 3 is repeatedly applied as the induction steps until the rates of Theorem C.1 are obtained. We do not reproduce these algebraic steps, as they are the same as in Shen and Wong [1994]. Like Shen and Wong [1994], we also do not provide the proof for the induction start as it is similar, but simpler than the proof of the induction step, which we present in Lemma C.1. ∎

Lemma C.1 (Induction Step for Theorem C.1).

Suppose Conditions C1a, C1b and C2 hold. If at Step k1k-1 we have a rate εn(k1)=nαk1>max(n(12r0)/[α(r+2)],ϵn)\varepsilon_{n}^{(k-1)}=n^{-\alpha_{k}-1}>\max\left(n^{-\left(1-2r_{0}\right)/[\alpha(r+2)]},\epsilon_{n}\right) so that

(supθΘnρθ(λ^nθ,λθ)Dεn(k1))5[exp((1ε)max(D4α,D2α)M1n2r0)+(k1)exp(Lnδ)]\mathbb{P}\left(\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\widehat{\lambda}_{n}^{\theta},\lambda_{*}^{\theta}\right)\geq D\varepsilon_{n}^{(k-1)}\right)\leq 5\left[\exp\left(-(1-\varepsilon)\max\left(D^{4\alpha},D^{2\alpha}\right)M_{1}n^{2r_{0}}\right)+(k-1)\exp\left(-Ln^{\delta_{*}}\right)\right]

where δ=min(r+4r0r+2,βr(12r0)4α+r0)\delta_{*}=\min\left(\frac{r+4r_{0}}{r+2},\frac{\beta r\left(1-2r_{0}\right)}{4\alpha}+r_{0}\right) and L=(1ε)min(M2D2α,M3D4αβ(2r)/2)L=(1-\varepsilon)\min\left(M_{2}D^{2\alpha},M_{3}D^{4\alpha-\beta(2-r)/2}\right) Then at Step k,k, we can find an improved rate

εn(k)=max(nαk,n(12r0)/[α(r+2)],ϵn,ηn1/2α)\varepsilon_{n}^{(k)}=\max\left(n^{-\alpha_{k}},n^{-\left(1-2r_{0}\right)/[\alpha(r+2)]},\epsilon_{n},\eta_{n}^{1/2\alpha}\right)

where αk=(12r0)/(4α)+αk1β(2r)/(4α),\alpha_{k}=\left(1-2r_{0}\right)/(4\alpha)+\alpha_{k-1}\beta(2-r)/(4\alpha), so that

(supθΘnρθ(λ^nθ,λθ)Dεn(k))5[exp((1ε)max(D4α,D2α)M1n2r0)+kexp(Lnδ)]\mathbb{P}\left(\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\widehat{\lambda}_{n}^{\theta},\lambda_{*}^{\theta}\right)\geq D\varepsilon_{n}^{(k)}\right)\leq 5\left[\exp\left(-(1-\varepsilon)\max\left(D^{4\alpha},D^{2\alpha}\right)M_{1}n^{2r_{0}}\right)+k\exp\left(-L{n}^{\delta_{*}}\right)\right]

And for any function f(θ,λ,Y)f(\theta,\lambda,Y) satisfying C1b and C2 when lθ(λ,Y)l^{\theta}(\lambda,Y) is replaced by f(θ,λ,Y)f(\theta,\lambda,Y), and Λ^n(θ)\widehat{\Lambda}_{n}(\theta) is defined as in Lemma B.5, the same bound applies to:

(supθΘn,λΛ^n(θ)(𝔼𝔼n)[f(θ,λ,Y)f(θ,πnλθ,Y)]A1(Dεn(k))2α)\mathbb{P}\left(\sup_{\begin{subarray}{c}\theta\in\Theta_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)\end{subarray}}(\mathbb{E}-\mathbb{E}_{n})[f(\theta,\lambda,Y)-f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]\geq A_{1}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}\right)
Proof.

We assume D>1D>1 (wlog) and we only prove the case of 4αβ(2r)/2.4\alpha\geq\beta(2-r)/2. Let Bn(i)={Dεn(i)supθΘnρθ(λ^nθ,λθ)<Dεn(i1)}B_{n}^{(i)}=\left\{D\varepsilon_{n}^{(i)}\leq\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\widehat{\lambda}_{n}^{\theta},\lambda_{*}^{\theta}\right)<D\varepsilon_{n}^{(i-1)}\right\} for i=2,,ki=2,\ldots,k. Then

(supθΘnρθ(λ^nθ,λθ)Dεn(k))(supθΘnρθ(λ^nθ,λθ)Dεn(k1))+(Bn(k))\mathbb{P}\left(\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\widehat{\lambda}_{n}^{\theta},\lambda_{*}^{\theta}\right)\geq D\varepsilon_{n}^{(k)}\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta_{n}}\rho^{\theta}\left(\widehat{\lambda}_{n}^{\theta},\lambda_{*}^{\theta}\right)\geq D\varepsilon_{n}^{(k-1)}\right)+\mathbb{P}\left(B_{n}^{(k)}\right)

To prove the Lemma , we only need to tackle (Bn(k))\mathbb{P}\left(B_{n}^{(k)}\right). By Condition C1a,

inf{ρθ(λ,λθ)Dεn(k),λΛn,θΘn}𝔼[lθ(πnλθ,Y)lθ(λ,Y)]ηn2A1(Dεn(k))2αsupθΘn𝔼[lθ(λθ,Y)lθ(πnλθ,Y)]ηnA1(Dεn(k))2α\begin{array}[]{l}\inf_{\left\{\rho^{\theta}\left(\lambda,\lambda_{*}^{\theta}\right)\geq D\varepsilon_{n}^{(k)},\lambda\in\Lambda_{n},\theta\in\Theta_{n}\right\}}\mathbb{E}\left[l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},Y\right)-l^{\theta}(\lambda,Y)\right]-\eta_{n}\\ \quad\geq 2A_{1}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}-\sup_{\theta\in\Theta_{n}}\mathbb{E}\left[l^{\theta}\left(\lambda_{*}^{\theta},Y\right)-l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},Y\right)\right]-\eta_{n}\geq A_{1}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}\end{array}

The last inequality requires A1(Dεn(k))2αsupθΘn𝔼[lθ(λθ,Y)lθ(πnλθ,Y)]ηn>0A_{1}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}-\sup_{\theta\in\Theta_{n}}\mathbb{E}\left[l^{\theta}\left(\lambda_{*}^{\theta},Y\right)-l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},Y\right)\right]-\eta_{n}>0. This is holds for A1>2A_{1}>2 (wlog), which follows from εn(k)ϵn\varepsilon_{n}^{(k)}\geq\epsilon_{n} and εn(k)ηn1/2α\varepsilon_{n}^{(k)}\geq\eta_{n}^{1/2\alpha}. Thus

(Bn(k))\displaystyle\mathbb{P}\left(B_{n}^{(k)}\right) (sup{Dεn(k)ρθ(λ,λθ)<Dεn(k1),λΛn,θΘn}𝔼n[lθ(λ,Y)lθ(πnλθ,Y)]ηn)\displaystyle\leq\mathbb{P}\left(\sup_{\left\{D\varepsilon_{n}^{(k)}\leq\rho^{\theta}\left(\lambda,\lambda_{*}^{\theta}\right)<D\varepsilon_{n}^{(k-1)},\lambda\in\Lambda_{n},\theta\in\Theta_{n}\right\}}\mathbb{E}_{n}\left[l^{\theta}(\lambda,Y)-l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},Y\right)\right]\geq-\eta_{n}\right)
(sup{Dεn(k)ρθ(λ,λθ)<Dεn(k1),λΛn,θΘn}n1/2(𝔼n𝔼)[lθ(λ,Y)lθ(πnλθ,Y)]A1n1/2(Dεn(k))2α)\displaystyle\leq\mathbb{P}\left(\sup_{\left\{D\varepsilon_{n}^{(k)}\leq\rho^{\theta}\left(\lambda,\lambda_{*}^{\theta}\right)<D\varepsilon_{n}^{(k-1)},\lambda\in\Lambda_{n},\theta\in\Theta_{n}\right\}}n^{1/2}(\mathbb{E}_{n}-\mathbb{E})\left[l^{\theta}(\lambda,Y)-l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},Y\right)\right]\geq A_{1}n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}\right)

Let vk=sup{Dεn(k)ρθ(λ,λθ)<Dεn(k1),λΛn,θΘn}Var(lθ(πnλθ,Y)lθ(λ,Y))v_{k}=\sup_{\left\{D\varepsilon_{n}^{(k)}\leq\rho^{\theta}\left(\lambda,\lambda_{*}^{\theta}\right)<D\varepsilon_{n}^{(k-1)},\lambda\in\Lambda_{n},\theta\in\Theta_{n}\right\}}\operatorname{Var}\left(l^{\theta}\left(\pi_{n}\lambda_{*}^{\theta},Y\right)-l^{\theta}(\lambda,Y)\right). By Condition C1b and εn(k1)ϵn\varepsilon_{n}^{(k-1)}\geq\epsilon_{n} we get vk4A2(Dεn(k1))2βv_{k}\leq 4A_{2}\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}. Since εn(k)\varepsilon_{n}^{(k)} satisfies

εn(k)nmin((12r0)/(4α)+αk1β(2r)/(4α),(12r0)/[α(r+2)])\varepsilon_{n}^{(k)}\geq n^{-\min\left(\left(1-2r_{0}\right)/(4\alpha)+\alpha_{k}-1\beta(2-r)/(4\alpha),\left(1-2r_{0}\right)/[\alpha(r+2)]\right)}

we know that n1/2(Dεn(k))2αmax(c1n(2r8r0)/[2(r+2)],c2(Dεnk1)2β(2r)/4nr0)n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}\geq\max\left(c_{1}n^{-\left(2-r-8r_{0}\right)/[2(r+2)]},c_{2}\left(D\varepsilon_{n}^{k-1}\right)^{2\beta(2-r)/4}n^{r_{0}}\right) for some constants c1>0c_{1}>0 and c2>0c_{2}>0. We can therefore apply Shen and Wong [1994]’s Lemma 1 and obtain:

(Bn(k))exp(ψ1(A1n1/2(Dεn(k))2α,vk,n))\mathbb{P}\left(B_{n}^{(k)}\right)\leq\exp\left(-\psi_{1}\left(A_{1}n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha},v_{k},\mathcal{F}_{n}\right)\right)

The behavior of ψ1()\psi_{1}(\cdot) can be analyzed via Shen and Wong [1994]’s Remark 12.
(i) If (Dεn(k))2αA1>12(Dεn(k1))2β,\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}A_{1}>12\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}, then

ψ1(A1n1/2(Dεn(k))2α,vk,n)3A14n(Dεn(k))2αM2D2αnn2(12r0)/(r+2)M2D2αn(r+4r0)/(r+2)\psi_{1}\left(A_{1}n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha},v_{k},\mathcal{F}_{n}\right)\geq\frac{3A_{1}}{4}n\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}\geq M_{2}D^{2\alpha}nn^{-2\left(1-2r_{0}\right)/(r+2)}\geq M_{2}D^{2\alpha}n^{\left(r+4r_{0}\right)/(r+2)}

for some constant M2>0M_{2}>0. (ii) If (Dεn(k))2αA112(Dεn(k1))2β\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}A_{1}\leq 12\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}, then

ψ1\displaystyle\psi_{1} (A1n1/2(Dεn(k))2α,vk,n)(A1n1/2(Dεn(k))2α)24(4A2)(Dεn(k1))2β\displaystyle\left(A_{1}n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha},v_{k},\mathcal{F}_{n}\right)\geq\frac{\left(A_{1}n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}\right)^{2}}{4\left(4A_{2}\right)\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}}
M3D4αβ(2r)/2(εn(k1))2β(2r)/2nr0(εn(k1))2β\displaystyle\geq M_{3}D^{4\alpha-\beta(2-r)/2}\left(\varepsilon_{n}^{(k-1)}\right)^{2\beta(2-r)/2}\frac{n^{r_{0}}}{\left(\varepsilon_{n}^{(k-1)}\right)^{2\beta}}
M3D4αβ(2r)/2(εn(1))βrnr0M3D4αβ(2r)/2nβr(12r0)/(4α)+r0\displaystyle\geq M_{3}D^{4\alpha-\beta(2-r)/2}\left(\varepsilon_{n}^{(1)}\right)^{-\beta r}n^{r_{0}}\geq M_{3}D^{4\alpha-\beta(2-r)/2}n^{\beta r\left(1-2r_{0}\right)/(4\alpha)+r_{0}}

for some M3>0M_{3}>0. Hence,

(Bn(k)){5exp((1ε)M2D2αn(r+4r0)/(r+2)) if (Dεn(k))2αA1>12(Dεn(k1))2β5exp((1ε)M3D4αβ(2r)/2nβr(12r0)/(4α)+r0) if (Dεn(k))2αA112(Dεn(k1))2β\mathbb{P}\left(B_{n}^{(k)}\right)\leq\left\{\begin{aligned} 5\exp\left(-(1-\varepsilon)M_{2}D^{2\alpha}n^{\left(r+4r_{0}\right)/(r+2)}\right)&\text{ if }\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}A_{1}>12\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}\\ 5\exp\left(-(1-\varepsilon)M_{3}D^{4\alpha-\beta(2-r)/2}n^{\beta r\left(1-2r_{0}\right)/(4\alpha)+r_{0}}\right)&\text{ if }\left(D\varepsilon_{n}^{(k)}\right)^{2\alpha}A_{1}\leq 12\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}\end{aligned}\right.

Take δ\delta_{*}, LL and εn(k)\varepsilon_{n}^{(k)} as defined in the Lemma, and we obtain (Bn(k))5exp(Lnδ)\mathbb{P}\left(B_{n}^{(k)}\right)\leq 5\exp\left(-L{n}^{\delta_{*}}\right). This yields the convergence rate of λ^nθ\widehat{\lambda}_{n}^{\theta}. The statement about arbitrary f(θ,λ,Y)f(\theta,\lambda,Y) follows by applying analogous arguments (starting at the definition of vkv_{k}) to the expression f(θ,πnλθ,Y)f(θ,λ,Y)f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)-f(\theta,\lambda,Y) instead of lθ(πnλθ,Y)lθ(λ,Y)l^{\theta}(\pi_{n}\lambda_{*}^{\theta},Y)-l^{\theta}(\lambda,Y). ∎

C.1.2 Convergence of θ^n\widehat{\theta}_{n}

Theorem C.2 (Convergence Rate of A-Estimators).

Consider the family of M-estimators λ^nθ\widehat{\lambda}_{n}^{\theta} defined in 3.2 and the A-estimator θ^n\widehat{\theta}_{n} defined in 3.3. Assume that conditions C1b and C2 are satisfied with ρθ(λ,λ):=|𝔼[lθ(λ,Y)lθ(λ,Y)]|\rho^{\theta}(\lambda,\lambda^{\prime}):=\left|\mathbb{E}[l^{\theta}(\lambda,Y)-l^{\theta}(\lambda^{\prime},Y)]\right| (hence C1a is automatically satisfied with α=1/2\alpha=1/2). Let ρ¯(,)\bar{\rho}(\cdot,\cdot) be some pseudo-distance on Θ\Theta. Assume that the following conditions are satisfied:

  • CONDITION C1a’ For some constants A¯1>0\bar{A}_{1}>0 and α¯>0\bar{\alpha}>0, and all small ε>0\varepsilon>0:

    inf{ρ¯(θ,θ)ε,θΘn}𝔼[l(θ,Y)l(θ,Y)]2A¯1ε2α¯\inf_{\{\bar{\rho}\left(\theta,\theta_{*}\right)\geq\varepsilon,\theta\in\Theta_{n}\}}\mathbb{E}\left[l\left(\theta,Y\right)-l(\theta_{*},Y)\right]\geq 2\bar{A}_{1}\varepsilon^{2\bar{\alpha}}
  • CONDITION C1b’. For some constants A¯2>0\bar{A}_{2}>0 and β¯>0,\bar{\beta}>0, and all small ε>0\varepsilon>0:

    sup{ρ¯(θ,θ)ε,θΘn}𝕍[l(θ,Y)l(θ,Y)]A¯2ε2β¯\sup_{\left\{\bar{\rho}\left(\theta,\theta_{*}\right)\leq\varepsilon,\theta\in\Theta_{n}\right\}}\mathbb{V}\left[l(\theta,Y)-l\left(\theta_{*},Y\right)\right]\leq\bar{A}_{2}\varepsilon^{2\bar{\beta}}
  • CONDITION C2’. Let ¯n={l(θ,πnλθ,)l(πnθ,πnλπnθ,):θΘn}\bar{\mathcal{F}}_{n}=\left\{l(\theta,\pi_{n}\lambda_{*}^{\theta},\cdot)-l\left(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},\cdot\right):\theta\in\Theta_{n}\right\}. For some r¯0<12\bar{r}_{0}<\frac{1}{2}, A¯3>0\bar{A}_{3}>0 and all small ε>0\varepsilon>0, its entropy (Def. 1) is bounded as:

    log𝒩(ε,¯n,)A¯3n2r¯0εr¯\log\mathcal{N}\left(\varepsilon,\bar{\mathcal{F}}_{n},\|\cdot\|_{\infty}\right)\leq\bar{A}_{3}n^{2\bar{r}_{0}}\varepsilon^{-\bar{r}}

    where either r¯>0\bar{r}>0 or r¯=0+\bar{r}=0^{+}, which represents ε0+=log(1/ε)\varepsilon^{-0^{+}}=\log(1/\varepsilon).

Let ϵ¯n:=ρ(πnθ,θ)|𝔼l(πnθ,Y)l(θ,Y)|1/2α¯\bar{\epsilon}_{n}:=\rho\left(\pi_{n}\theta_{*},\theta_{*}\right)\vee\left|\mathbb{E}l(\pi_{n}\theta_{*},Y)-l(\theta_{*},Y)\right|^{1/2\bar{\alpha}} be the approximation error of Θn\Theta_{n}. Then:

ρ¯(θ^n,θ)=O(nτ¯+ϵ¯n+(η¯n+nτ+ϵn+ηn)1/2α¯)\bar{\rho}\left(\widehat{\theta}_{n},\theta_{*}\right)=O_{\mathbb{P}}\left(n^{-\bar{\tau}}+\bar{\epsilon}_{n}+(\bar{\eta}_{n}+n^{-\tau}+\epsilon_{n}+\eta_{n})^{1/2\bar{\alpha}}\right)

Where τ=τ(1/2,β,r,r0,n)\tau=\tau(1/2,\beta,r,r_{0},n) and ϵn\epsilon_{n} are defined as in Thm C.1 and τ¯=τ(α¯,β¯,r¯,r¯0,n)\bar{\tau}=\tau(\bar{\alpha},\bar{\beta},\bar{r},\bar{r}_{0},n). Also, for every f(θ,λ,Y)f(\theta,\lambda,Y) satisfying C2 and C3 when l(θ,λ,Y)l(\theta,\lambda,Y) is replaced by f(θ,λ,Y)f(\theta,\lambda,Y) (recall l(θ,Y)=l(θ,λθ,Y)l(\theta,Y)=l(\theta,\lambda_{*}^{\theta},Y)), we can bound the empirical process:

supθΘ^n(𝔼𝔼n)[f(θ,πnλθ,Y)f(θ,λθ,Y)]=O(nτ¯+ϵ¯n+(η¯n+nτ+ϵn+ηn)1/2α¯)\sup_{\theta\in\widehat{\Theta}_{n}}(\mathbb{E}-\mathbb{E}_{n})[f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)-f(\theta_{*},\lambda_{*}^{\theta_{*}},Y)]=O_{\mathbb{P}}\left(n^{-\bar{\tau}}+\bar{\epsilon}_{n}+(\bar{\eta}_{n}+n^{-\tau}+\epsilon_{n}+\eta_{n})^{1/2\bar{\alpha}}\right)
Proof.

The proof is similar to that of Theorem C.1. Again, we will only prove the induction step via Lemma D.1 in the Online Appendix, as the remaining arguments are analogous to the proof of the previous Theorem or that of Theorem 1 in Shen and Wong [1994]. The proof of Lemma D.1 largely mirrors that of Lemma C.1, but uses the results of Theorem C.1 to control the convergence of the adversary. The main additional complexity lies in properly switching back and forth between the sieve spaces and the target function spaces, when bounding the empirical process terms and variances respectively. ∎

C.1.3 Proofs of Theorem 3.1 and Lemma B.5

Proof.

To see that Theorem 3.1 and Lemma B.5 follow from the previous results, simply choose ρθ(λ,λ)=|𝔼[lθ(λ,Y)lθ(λ,Y)]|\rho^{\theta}(\lambda,\lambda^{\prime})=|\mathbb{E}[l^{\theta}(\lambda,Y)-l^{\theta}(\lambda^{\prime},Y)]| and ρ¯(θ,θ)=|𝔼[l(θ,Y)l(θ,Y)]|\bar{\rho}(\theta,\theta^{\prime})=|\mathbb{E}[l(\theta,Y)-l(\theta^{\prime},Y)]|, such that Conditions C1a and C1a’ are automatically satisfied with α=1/2\alpha=1/2. Further, substitute γ=2β2β¯\gamma=2\beta\vee 2\bar{\beta} such that Conditions C1b and C1b’ hold by assumptions 3.7 and 3.6 for all ε<1C\varepsilon<1\wedge C. Conditions C2 and C2’ directly follow from 3.8, substituting s=2r0=2r¯0s=2r_{0}=2\bar{r}_{0} and fixing r=r¯r=\bar{r}. This yields Theorem 3.1, and Lemma B.5 with en=0e_{n}=0.

For a proof of Lemma B.5 with en0e_{n}\neq 0, note that Condition C1b in Theorem C.1 is only needed to verify vk4A2(Dεn(k1))2βv_{k}\leq 4A_{2}\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta} in the proof of Lemma C.1. Hence we can re-define ϵnϵn+en\epsilon_{n}\leftarrow\epsilon_{n}+e_{n} such that the definition of εn(k1)ϵn\varepsilon_{n}^{(k-1)}\geq\epsilon_{n} automatically ensures vk(Dεn(k1))2βv_{k}\prec\left(D\varepsilon_{n}^{(k-1)}\right)^{2\beta}. This change in constants does not affect the result. Analogous arguments can be applied to Lemma D.1 and thus Theorem C.2. ∎

C.2 Theorem 3.3

Proof.

The approximate Nash conditions 1.2 and 1.3 imply

O(en2)𝔼nl(θ^n,πnλ¯nθ^n(λ^n),Y)l(πnθ¯(θ^n),λ^n,Y)=𝔼nl(θ^n,λ^n,Y)[θ^nπnθ¯n(θ^n),πnλ¯nθ^n(λ^n)λ^n]+O(en2)=𝔼nl(θ^n,λ^n,Y)[env,enλθ^n[v]]+O(en2)\displaystyle\begin{split}O_{\mathbb{P}}(e_{n}^{2})&\geq\mathbb{E}_{n}l(\widehat{\theta}_{n},\pi_{n}\bar{\lambda}_{n}^{\widehat{\theta}_{n}}(\widehat{\lambda}_{n}),Y)-l(\pi_{n}\bar{\theta}(\widehat{\theta}_{n}),\widehat{\lambda}_{n},Y)\\ &=\mathbb{E}_{n}l^{\prime}(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y)[\widehat{\theta}_{n}-\pi_{n}\bar{\theta}_{n}(\widehat{\theta}_{n}),\pi_{n}\bar{\lambda}_{n}^{\widehat{\theta}_{n}}(\widehat{\lambda}_{n})-\widehat{\lambda}_{n}]+O_{\mathbb{P}}(e_{n}^{2})\\ &=\mathbb{E}_{n}l^{\prime}(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y)[e_{n}v,e_{n}\lambda_{*}^{\prime\widehat{\theta}_{n}}[v]]+O_{\mathbb{P}}(e_{n}^{2})\end{split} (C.2)

The second line uses Taylor’s theorem. The third line substitutes the definitions of θ¯n,λ¯nθ^n\bar{\theta}_{n},\bar{\lambda}_{n}^{\widehat{\theta}_{n}} and applies Condition N3. Since the signs of v,λθ^n[v]v,\lambda_{*}^{\prime\widehat{\theta}_{n}}[v] are arbitrary, we may replace the inequality with an equality, which yields O(en)=𝔼nl(θ^n,λ^n,Y)[v,λθ^n[v]]O_{\mathbb{P}}(e_{n})=\mathbb{E}_{n}l^{\prime}(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y)[v,\lambda_{*}^{\prime\widehat{\theta}_{n}}[v]]. Adding and subtracting a few terms, we get:

𝔼nl(θ,Y)[v]=(𝔼n𝔼)[l(θ^n,λ^n,Y)[v,λθ^n[v]]l(θ,Y)[v]]+𝔼[l(θ^n,λ^n,Y)[v,λθ^n[v]]l(θ,Y)[v]]+O(en)=θ^nθ,v+O(en)\displaystyle\begin{split}\mathbb{E}_{n}l^{\prime}(\theta_{*},Y)[v]=&(\mathbb{E}_{n}-\mathbb{E})[l^{\prime}(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y)[v,\lambda_{*}^{\prime\widehat{\theta}_{n}}[v]]-l^{\prime}(\theta_{*},Y)[v]]\\ &+\mathbb{E}[l^{\prime}(\widehat{\theta}_{n},\widehat{\lambda}_{n},Y)[v,\lambda_{*}^{\prime\widehat{\theta}_{n}}[v]]-l^{\prime}(\theta_{*},Y)[v]]+O_{\mathbb{P}}(e_{n})\\ =&\langle\widehat{\theta}_{n}-\theta_{*},v\rangle+O_{\mathbb{P}}(e_{n})\end{split} (C.3)

Where the last line uses Conditions N1 and N2. Substituting v=vv=v_{*}, we get:

n(F(θ^n)F(θ))=nθ^nθ,v+o(1)=n𝔼n[l(θ,Y)[v]]+o(1)𝑑𝒩(0,V)\sqrt{n}\left(F(\widehat{\theta}_{n})-F(\theta_{*})\right)=\sqrt{n}\langle\widehat{\theta}_{n}-\theta_{*},v_{*}\rangle+o_{\mathbb{P}}(1)=\sqrt{n}\mathbb{E}_{n}[l^{\prime}(\theta_{*},Y)[v]]+o_{\mathbb{P}}(1)\overset{d}{\longrightarrow}\mathcal{N}(0,V)

with V=𝕍(l(θ,Y)[v])V=\mathbb{V}\left(l^{\prime}(\theta_{*},Y)[v]\right) by the standard central limit theorem. ∎

C.3 Theorem 3.4

Proof of Theorem 3.4.

Note that the regularization does not interfere with Theorem 3.2: the approximation power relative to Θ,Λ\Theta_{*},\Lambda_{*} is not reduced as we remove only elements form the sieves Θn,Λn\Theta_{n},\Lambda_{n} that are far away from Θ,Λ\Theta_{*},\Lambda_{*}, and the regularization is sufficiently slow to guarantee that the sieves are nonempty for small enough ϵ>0\epsilon>0. Hence Theorem 3.2 holds with r¯=2\bar{r}=2, yielding rates o(n2/(2+r¯))=o(n1/2)o_{\mathbb{P}}(n^{-2/(2+\bar{r})})=o_{\mathbb{P}}(n^{-1/2}). Also note that the assumption dpd¯p¯<1/4\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}<1/4 along with the lower bound r¯>2/3\underline{r}>2/3 ensures supθΘθπnθ𝒳¯=o(n1)\sup_{\theta\in\Theta_{*}}\|\theta-\pi_{n}\theta\|_{\bar{{\mathcal{X}}}}=o(n^{-1}) and supλΛλπnλ𝒳=o(n1)\sup_{\lambda\in\Lambda_{*}}\|\lambda-\pi_{n}\lambda\|_{{\mathcal{X}}}=o(n^{-1}). We first verify condition N1, decomposing it into two parts by adding and subtracting l(θ,λθ,Y)[v,λθ[v]]=l(θ,Y)[v]l^{\prime}(\theta,\lambda_{*}^{\prime\theta},Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]=l^{\prime}(\theta,Y)[v_{*}]. First, we show

supθΘ^n,λΛ^n(θ)(𝔼n𝔼)l(θ,Y)[v]l(θ,Y)[v]=o(n1/2)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}(\mathbb{E}_{n}-\mathbb{E})l^{\prime}(\theta,Y)[v_{*}]-l^{\prime}(\theta_{*},Y)[v_{*}]=o_{\mathbb{P}}(n^{-1/2})

If A7ii) holds with 𝕍[l(θ,Y)[v]l(θ,Y)[v]]𝔼[l(θ,Y)l(θ,Y)]\mathbb{V}[l^{\prime}(\theta_{*},Y)[v]-l^{\prime}(\theta,Y)[v]]\prec\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)], then this can be established with Lemma B.5 for γ=1\gamma=1, using the Lipschitz condition A4 and analogous arguments to those in the proof of Theorem 3.2. Lemma B.5 then yields the same o(n1/2)o_{\mathbb{P}}(n^{-1/2}) rates as Theorem 3.2. If A7ii) instead asserts that Θ\Theta_{*} is \mathbb{P}-Donsker, the same result follows from Lemma B.4, which can be applied because the Lipschitz continuity A4 implies the L2 continuity required by the Lemma. This implies condition N1, together with:

supθΘ^n,λΛ^n(θ)(𝔼n𝔼)l(θ,λ,Y)[v,λθ[v]]l(θ,λθ,Y)[v,λθ[v]]=o(n1/2)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}(\mathbb{E}_{n}-\mathbb{E})l^{\prime}(\theta,\lambda,Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]-l^{\prime}(\theta,\lambda_{*}^{\prime\theta},Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]=o_{\mathbb{P}}(n^{-1/2})

which can be established via analogous arguments and A7i). We proceed to verify condition N2. Using a similar decomposition, we note that

supθΘ^n,λΛ^n(θ)𝔼l(θ,λ,Y)[v,λθ[v]]l(θ,λθ,Y)[v,λθ[v]]=O(en)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}\mathbb{E}l^{\prime}(\theta,\lambda,Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]-l^{\prime}(\theta,\lambda_{*}^{\prime\theta},Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]=O_{\mathbb{P}}(e_{n})

which follows from A6i) and the o(n1/2)o_{\mathbb{P}}(n^{-1/2}) rates of Theorem 3.2. Similarly, A6ii) implies supθΘ^n,λΛ^n(θ)𝔼l(θ,Y)[v]l(θ,Y)[v]θθ,v=O(en)\sup_{\theta\in\widehat{\Theta}_{n},\lambda\in\widehat{\Lambda}_{n}(\theta)}\mathbb{E}l^{\prime}(\theta,Y)[v_{*}]-l^{\prime}(\theta_{*},Y)[v_{*}]-\langle\theta-\theta_{*},v_{*}\rangle=O_{\mathbb{P}}(e_{n}) hence condition N2 holds. Finally, we verify condition N3. Define πθ:=arginfθΘθθ\pi_{*}\theta:=\arg\inf_{\theta^{\prime}\in\Theta_{*}}\|\theta^{\prime}-\theta\|_{\infty} as the projection onto Θ\Theta_{*}. Similarly, πλ:=arginfλΛλλ\pi_{*}\lambda:=\arg\inf_{\lambda^{\prime}\in\Lambda_{*}}\|\lambda^{\prime}-\lambda\|_{\infty}. Due to the reguarlization, we know θπθ=o(n1)\|\theta-\pi_{*}\theta\|_{\infty}=o(n^{-1}). Therefore θ¯n(θ)(πθenv)=o(n1)\|\bar{\theta}_{n}(\theta)-(\pi_{*}\theta-e_{n}v_{*})\|_{\infty}=o(n^{-1}). By convexity of Θ\Theta_{*}, we have (πθenv)Θ(\pi_{*}\theta-e_{n}v_{*})\in\Theta_{*} for nn large enough. Given that supθΘθπnθ=o(n1)\sup_{\theta^{\prime}\in\Theta_{*}}\|\theta^{\prime}-\pi_{n}\theta^{\prime}\|_{\infty}=o(n^{-1}) due to dpd¯p¯<1/4\frac{d^{*}}{p}\vee\frac{\bar{d}^{*}}{\bar{p}}<1/4 and our choice of r¯\underline{r}, we get (πθenv)θn(πθenv)=o(n1)\|(\pi_{*}\theta-e_{n}v_{*})-\theta_{n}(\pi_{*}\theta-e_{n}v_{*})\|_{\infty}=o(n^{-1}). Taken together, these statements imply θ¯n(θ)πnθ¯n(θ)=o(n1)\|\bar{\theta}_{n}(\theta)-\pi_{n}\bar{\theta}_{n}(\theta)\|_{\infty}=o(n^{-1}). Analogous arguments yield λ¯nθ(λ)πnλ¯nθ(λ)=o(n1)\|\bar{\lambda}_{n}^{\theta}(\lambda)-\pi_{n}\bar{\lambda}_{n}^{\theta}(\lambda)\|_{\infty}=o(n^{-1}). Hence N3 holds. ∎

Appendix D Online Appendix

D.1 Proof of Theorem C.2

The induction step for the proof of Theorem C.2 is given by the following Lemma.

Lemma D.1 (Induction Step for Theorem C.2).

Suppose the Conditions of Theorem C.2 hold. If at Step k1k-1 we have a rate

εn(k1)=nα¯k1>max(n(12r¯0)/[α¯(r¯+2)],ϵ¯n,ϵn)\varepsilon_{n}^{(k-1)}=n^{-{\bar{\alpha}}_{k}-1}>\max\left(n^{-\left(1-2{\bar{r}_{0}}\right)/[{\bar{\alpha}}(\bar{r}+2)]},{\bar{\epsilon}}_{n},\epsilon_{n}\right)

so that

(ρ¯(θ^n,θ)Dεn(k1))5[exp((1ε)max(D4α¯,D2α¯)M1n2r¯0)+(k1)exp(Lnδ)]\mathbb{P}\left(\bar{\rho}\left(\widehat{\theta}_{n},\theta_{*}\right)\geq D\varepsilon_{n}^{(k-1)}\right)\leq 5\left[\exp\left(-(1-\varepsilon)\max\left(D^{4{\bar{\alpha}}},D^{2{\bar{\alpha}}}\right)M_{1}n^{2{\bar{r}_{0}}}\right)+(k-1)\exp\left(-Ln^{\delta_{*}}\right)\right]

where δ=min(r+4r¯0r+2,β¯r¯(12r¯0)4α¯+r¯0)\delta_{*}=\min\left(\frac{r+4{\bar{r}_{0}}}{r+2},\frac{{\bar{\beta}}\bar{r}\left(1-2{\bar{r}_{0}}\right)}{4{\bar{\alpha}}}+{\bar{r}_{0}}\right) and L=(1ε)min(M2D2α¯,M3D4α¯β¯(2r¯)/2)L=(1-\varepsilon)\min\left(M_{2}D^{2{\bar{\alpha}}},M_{3}D^{4{\bar{\alpha}}-{\bar{\beta}}(2-\bar{r})/2}\right), then at Step kk, we can find an improved rate

εn(k)=max(nα¯k,n(12r¯0)/[α¯(r¯+2)],ϵ¯n,(η¯n+rn)1/2α¯)\varepsilon_{n}^{(k)}=\max\left(n^{-{\bar{\alpha}}_{k}},n^{-\left(1-2{\bar{r}_{0}}\right)/[{\bar{\alpha}}(\bar{r}+2)]},{\bar{\epsilon}}_{n},({\bar{\eta}}_{n}+r_{n})^{1/2{\bar{\alpha}}}\right)

where α¯k=(12r¯0)/(4α¯)+α¯k1β¯(2r¯)/(4α¯),{\bar{\alpha}}_{k}=\left(1-2{\bar{r}_{0}}\right)/(4{\bar{\alpha}})+{\bar{\alpha}}_{k-1}{\bar{\beta}}(2-\bar{r})/(4{\bar{\alpha}}), so that

(ρ¯(θ^n,θ)Dεn(k))5[exp((1ε)max(D4α¯,D2α¯)M1n2r¯0)+kexp(Lnδ)]\mathbb{P}\left(\bar{\rho}\left(\widehat{\theta}_{n},\theta_{*}\right)\geq D\varepsilon_{n}^{(k)}\right)\leq 5\left[\exp\left(-(1-\varepsilon)\max\left(D^{4{\bar{\alpha}}},D^{2{\bar{\alpha}}}\right)M_{1}n^{2{\bar{r}_{0}}}\right)+k\exp\left(-L{n}^{\delta_{*}}\right)\right]

Furthermore, for every f(θ,λ,Y)f(\theta,\lambda,Y) satisfying Conditions C1b and C2 when l(θ,λ,Y)l(\theta,\lambda,Y) is replaced by f(θ,λ,Y)f(\theta,\lambda,Y) (recall l(θ,Y)=l(θ,λθ,Y)l(\theta,Y)=l(\theta,\lambda_{*}^{\theta},Y)), the same bound applies to:

(supθΘ^n(𝔼𝔼n)[f(θ,πnλθ,Y)f(θ,λθ,Y)]Dεn(k))\mathbb{P}\left(\sup_{\theta\in\widehat{\Theta}_{n}}(\mathbb{E}-\mathbb{E}_{n})[f(\theta,\pi_{n}\lambda_{*}^{\theta},Y)-f(\theta_{*},\lambda_{*}^{\theta_{*}},Y)]\geq D\varepsilon_{n}^{(k)}\right)
Proof.

As in the proof of Lemma C.1 we assume D>1D>1 (wlog) and we only prove the case of 4α¯β¯(2r¯)/24{\bar{\alpha}}\geq{\bar{\beta}}(2-\bar{r})/2. Let Bn(i)={Dεn(i)ρ¯(θ^n,θ)<Dεn(i1)}B_{n}^{(i)}=\left\{D\varepsilon_{n}^{(i)}\leq\bar{\rho}\left(\widehat{\theta}_{n},\theta_{*}\right)<D\varepsilon_{n}^{(i-1)}\right\} for i=2,,ki=2,\ldots,k. As before, we only need to bound (Bn(k))\mathbb{P}\left(B_{n}^{(k)}\right). To this end, it will be useful to define

rn:=supθΘn𝔼[l(θ,λθ,Y)l(θ,λ^nθ,Y)]supθΘn(𝔼𝔼n)[l(θ,λ^nθ,Y)l(θ,πnλθ,Y)]r_{n}:=\sup_{\theta\in\Theta_{n}}\mathbb{E}[l(\theta,\lambda_{*}^{\theta},Y)-l(\theta,\widehat{\lambda}_{n}^{\theta},Y)]\vee\sup_{\theta\in\Theta_{n}}(\mathbb{E}-\mathbb{E}_{n})[l(\theta,\widehat{\lambda}_{n}^{\theta},Y)-l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]

such that Theorem C.1 implies rn=O(nτ+ϵn+ηn)r_{n}=O_{\mathbb{P}}(n^{\tau}+\epsilon_{n}+\eta_{n}), which also implies, by definition of rnr_{n} and ϵn\epsilon_{n}: supθΘn|𝔼n[l(θ,λ^nθ,Y)l(θ,πnλθ,Y)]|rn+ϵn2rn\sup_{\theta\in\Theta_{n}}\left|\mathbb{E}_{n}[l(\theta,\widehat{\lambda}_{n}^{\theta},Y)-l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]\right|\leq r_{n}+\epsilon_{n}\leq 2r_{n}. Together with 3.3 this yields:

𝔼n[l(θ^n,πnλθ^n,Y)l(θ,πnλθ,Y)]2rn+η¯nθΘn\mathbb{E}_{n}[l(\widehat{\theta}_{n},\pi_{n}\lambda_{*}^{\widehat{\theta}_{n}},Y)-l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]\leq 2r_{n}+\bar{\eta}_{n}\quad\forall\theta\in\Theta_{n} (D.1)

By Condition C1a’, we can therefore bound:

infθΘn:ρ¯(θ,θ)Dεn(k)𝔼[l(θ,πnλθ,Y)l(πnθ,πnλπnθ,Y)]η¯n2rn=infθΘn:ρ¯(θ,θ)Dεn(k)𝔼[l(θ,Y)l(θ,Y)]+𝔼[l(θ,Y)l(πnθ,Y)]η¯n2rn+𝔼[l(θ,πnλθ,Y)l(θ,Y)]+𝔼[l(πnθ,Y)l(πnθ,πnλπnθ,Y)]2A¯1(Dεn(k))2α¯ϵ¯n2α¯η¯nrnϵnϵnA¯1(Dεn(k))2α¯\displaystyle\begin{split}\inf_{{\theta\in\Theta_{n}:\bar{\rho}\left(\theta,\theta_{*}\right)\geq D\varepsilon_{n}^{(k)}}}&\mathbb{E}\left[l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)-l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)\right]-{\bar{\eta}}_{n}-2r_{n}\\ =\inf_{{\theta\in\Theta_{n}:\bar{\rho}\left(\theta,\theta_{*}\right)\geq D\varepsilon_{n}^{(k)}}}&\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]+\mathbb{E}[l(\theta_{*},Y)-l(\pi_{n}\theta_{*},Y)]-{\bar{\eta}}_{n}-2r_{n}\\ &+\mathbb{E}[l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)-l\left(\theta,Y\right)]+\mathbb{E}[l(\pi_{n}\theta_{*},Y)-l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)]\\ \geq&2{\bar{A}}_{1}\left(D\varepsilon_{n}^{(k)}\right)^{2{\bar{\alpha}}}-\bar{\epsilon}_{n}^{2\bar{\alpha}}-{\bar{\eta}}_{n}-r_{n}-\epsilon_{n}-\epsilon_{n}\geq\bar{A}_{1}\left(D\varepsilon_{n}^{(k)}\right)^{2{\bar{\alpha}}}\end{split} (D.2)

Where the last line used C1a’, the definition of the approximation errors, assumed large enough A¯1\bar{A}_{1} (wlog) and used various lower-bounds implied by the definition of εn(k)\varepsilon_{n}^{(k)}. Together with D.1, this yields:

(Bn(k))\displaystyle\mathbb{P}\left(B_{n}^{(k)}\right) (sup{Dεn(k)ρ¯(θ,θ)<Dεn(k1),θΘn}𝔼n[l(πnθ,πnλπnθ,Y)l(θ,πnλθ,Y)]η¯n2rn)\displaystyle\leq\mathbb{P}\left(\sup_{\left\{D\varepsilon_{n}^{(k)}\leq\bar{\rho}\left(\theta,\theta_{*}\right)<D\varepsilon_{n}^{(k-1)},\theta\in\Theta_{n}\right\}}\mathbb{E}_{n}\left[l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l\left(\theta,\pi_{n}\lambda_{*}^{\theta},Y\right)\right]\geq-{\bar{\eta}}_{n}-2r_{n}\right)
(supθΘnDεn(k)ρ¯(θ,θ)<Dεn(k1)n1/2(𝔼n𝔼)[l(πnθ,πnλπnθ,Y)l(θ,πnλθ,Y)]A1n1/2(Dεn(k))2α¯)\displaystyle\leq\mathbb{P}\left(\sup_{\begin{subarray}{c}\theta\in\Theta_{n}\\ D\varepsilon_{n}^{(k)}\leq\bar{\rho}\left(\theta,\theta_{*}\right)<D\varepsilon_{n}^{(k-1)}\end{subarray}}n^{1/2}(\mathbb{E}_{n}-\mathbb{E})\left[l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l\left(\theta,\pi_{n}\lambda_{*}^{\theta},Y\right)\right]\geq A_{1}n^{1/2}\left(D\varepsilon_{n}^{(k)}\right)^{2{\bar{\alpha}}}\right)

Let vk=sup{Dεn(k)ρ¯(θ,θ)<Dεn(k1),θΘn}𝕍[l(πnθ,πnλπnθ,Y)l(θ,πnλθ,Y)]v_{k}=\sup_{\left\{D\varepsilon_{n}^{(k)}\leq\bar{\rho}\left(\theta,\theta_{*}\right)<D\varepsilon_{n}^{(k-1)},\theta\in\Theta_{n}\right\}}\mathbb{V}\left[l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l\left(\theta,\pi_{n}\lambda_{*}^{\theta},Y\right)\right]. To bound vkv_{k}, we add and subtract terms and apply the Cauchy-Schwartz inequality:

𝕍\displaystyle\mathbb{V} [l(πnθ,πnλπnθ,Y)l(θ,πnλθ,Y)]\displaystyle[l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]
3𝕍[l(πnθ,λπnθ,Y)l(θ,λθ,Y)]+3𝕍[l(πnθ,πnλπnθ,Y)l(θ,πnλθ,Y)]\displaystyle\leq 3\mathbb{V}[l(\pi_{n}\theta_{*},\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l(\theta,\lambda_{*}^{\theta},Y)]+3\mathbb{V}[l(\pi_{n}\theta_{*},\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l(\theta,\pi_{n}\lambda_{*}^{\theta},Y)]
𝕍[l(θ,Y)l(θ,Y)]+𝕍[l(πnθ,Y)l(θ,Y)]+𝕍[lπnθ(πnλπnθ,Y)lπnθ(λπnθ,Y)]\displaystyle\prec\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)]+\mathbb{V}[l(\pi_{n}\theta_{*},Y)-l(\theta_{*},Y)]+\mathbb{V}[l^{\pi_{n}\theta_{*}}(\pi_{n}\lambda_{*}^{\pi_{n}\theta_{*}},Y)-l^{\pi_{n}\theta_{*}}(\lambda_{*}^{\pi_{n}\theta_{*}},Y)]
+𝕍[lθ(λθ,Y)lθ(πnλθ,Y)]\displaystyle\quad+\mathbb{V}[l^{\theta}(\lambda_{*}^{\theta},Y)-l^{\theta}(\pi_{n}\lambda_{*}^{\theta},Y)]

By Conditions C1b and C1b’, and since εn(k1)ϵ¯n\varepsilon_{n}^{(k-1)}\geq{\bar{\epsilon}}_{n}, we obtain vk4A¯2(Dεn(k1))2β¯v_{k}\leq 4\bar{A}_{2}\left(D\varepsilon_{n}^{(k-1)}\right)^{2{\bar{\beta}}}, assuming A¯2\bar{A}_{2} is large enough (wlog). The remaining arguments are unchanged from the proof of Lemma C.1, which eventually yields: (Bn(k))5exp(Lnδ)\mathbb{P}\left(B_{n}^{(k)}\right)\leq 5\exp\left(-L{n}^{\delta_{*}}\right). This completes the proof for the convergence rate of θ^n\widehat{\theta}_{n}. To prove the statement about arbitrary f(θ,λ,Y)f(\theta,\lambda,Y) satisfying C1b and C2, simply repeat the arguments of the previous proof (starting at the definition of vkv_{k}) with l(θ,λ,Y)l(\theta,\lambda,Y) replaced by f(θ,λ,Y)f(\theta,\lambda,Y). ∎

D.2 Proof of Proposition 4.1

Proof.

The first order conditions for λ\lambda in 2.1 yield the optimal population adversary λθ(y)\lambda_{*}^{\theta}(y):

λθ(y)dθ(y)f(λθ(y))d(y)=!0λθ(y)=f(dθ(y)d(y))\lambda_{*}^{\theta}(y)\mathrm{d}\mathbb{P}_{\theta}(y)-f_{*}^{\prime}(\lambda_{*}^{\theta}(y))\mathrm{d}\mathbb{P}(y)\overset{!}{=}0\iff\lambda_{*}^{\theta}(y)=f^{\prime}\left(\frac{\mathrm{d}\mathbb{P}_{\theta}(y)}{\mathrm{d}\mathbb{P}(y)}\right) (D.3)

We verify the conditions of Theorem 3.2. The Lipschitz condition A1 can be verified by writing l(θ,λ,Y)=λ(y)dθd(y)d(y)f(λ(Y))l(\theta,\lambda,Y)=\int\lambda(y)\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}(y)\mathrm{d}\mathbb{P}(y)-f_{*}(\lambda(Y)) and using the Lipschitzness of dθd\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}} in θ\theta and that of ff_{*} (which follows from boundedness of Λ\Lambda and differentiability of ff). Towards A2, apply a 2nd order Taylor expansion (with mean value reminder) of Df(θ)D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}) at dθd=dθd\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}=\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}} in direction of dθd\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}, which yields for some θ~Θ\widetilde{\theta}\in\Theta on a path from θ\theta_{*} to θ\theta:

Df(θ)=f(dθd)d=12(dθddθd)2f′′(dθ~d)ddθddθdL2(𝒴)2D_{f}(\mathbb{P}_{\theta}\|\mathbb{P})=\int f\left(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}\right)\mathrm{d}\mathbb{P}=\frac{1}{2}\int\left(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}}\right)^{2}f^{\prime\prime}\left(\frac{\mathrm{d}\mathbb{P}_{\widetilde{\theta}}}{\mathrm{d}\mathbb{P}}\right)\mathrm{d}\mathbb{P}\asymp\left\|\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}}\right\|_{L^{2}(\mathcal{Y})}^{2}

where the last step follows from strict positivity and boundedness of f′′(dθd)f^{\prime\prime}\left(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}\right) wp1. Further note that

𝕍[l(θ,Y)l(θ,Y)]=𝕍[f(dθd(Y))f(dθd(Y))]dθddθdL2(𝒴)2\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)]=\mathbb{V}\left[f_{*}\left(\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}(Y)\right)-f_{*}\left(\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}}(Y)\right)\right]\prec\left\|\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}}\right\|_{L^{2}(\mathcal{Y})}^{2}

due to Lipschitzness of ff_{*}. Also note that Lipschitzness of dθd\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}} in θ\theta implies

dθddθdL2(𝒴)2dθddθd𝒴~2+(y𝒴~)θθ2+(y𝒴~)\left\|\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}}\right\|_{L^{2}(\mathcal{Y})}^{2}\prec\left\|\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}-\frac{\mathrm{d}\mathbb{P}_{\theta_{*}}}{\mathrm{d}\mathbb{P}}\right\|_{\widetilde{\mathcal{Y}}}^{2}+\mathbb{P}(y\in\widetilde{\mathcal{Y}})\prec\|\theta-\theta_{*}\|_{\infty}^{2}+\mathbb{P}(y\in\widetilde{\mathcal{Y}})

Taken together, this implies A2. A3 can be verified analogously, starting with a Taylor expansion yielding 𝔼[l(θ,λ,Y)l(θ,λθ,Y)]=f′′(λ~)(λλθ)2dλλθL2(𝒴)2\mathbb{E}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)]=-\int f_{*}^{\prime\prime}(\widetilde{\lambda})(\lambda-\lambda_{*}^{\theta})^{2}\mathrm{d}\mathbb{P}\asymp-\|\lambda-\lambda_{*}^{\theta}\|^{2}_{L^{2}(\mathcal{Y})} for some λ~Λ\widetilde{\lambda}\in\Lambda on a path from λθ\lambda_{*}^{\theta} to λ\lambda. ∎

D.3 Proof of Proposition 4.2

Proof.

First, we verify the conditions of Theorem 3.4. Note that

l(θ,λ,Yi)[v,w]=λ(y)θvdθ(y)d(y)d(y)+w(y)dθ(y)d(y)d(y)f(λ(Yi))w(Yi)l^{\prime}(\theta,\lambda,Y_{i})[v,w]=\int\lambda(y)\nabla_{\theta\to v}\frac{\mathrm{d}\mathbb{P}_{\theta}(y)}{\mathrm{d}\mathbb{P}(y)}\mathrm{d}\mathbb{P}(y)+\int w(y)\frac{\mathrm{d}\mathbb{P}_{\theta}(y)}{\mathrm{d}\mathbb{P}(y)}\mathrm{d}\mathbb{P}(y)-f_{*}^{\prime}(\lambda(Y_{i}))w(Y_{i})

The Lipschitz condition A4 is therefore satisfied by the Lipschitzness of ff_{*}^{\prime}, dθd\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}} and that of θvdθd\nabla_{\theta\to v}\frac{\mathrm{d}\mathbb{P}_{\theta}}{\mathrm{d}\mathbb{P}}. Towards A6 i), we apply a Taylor expansion with 2nd order mean-value reminder around λ=λθ\lambda=\lambda_{*}^{\theta}, yielding for some λ~θ\widetilde{\lambda}^{\theta} on a path between λθ\lambda_{*}^{\theta} and λ\lambda:

𝔼[l(θ,λ,Yi)[v,λθ[v]]l(θ,λθ,Yi)[v,λθ[v]]]\displaystyle\mathbb{E}[l^{\prime}(\theta,\lambda,Y_{i})[v,\lambda_{*}^{\prime\theta}[v]]-l^{\prime}(\theta,\lambda_{*}^{\theta},Y_{i})[v,\lambda_{*}^{\prime\theta}[v]]]
=θvλθλλθ𝔼[l(θ,λθ,Y)]+𝔼[f′′(λ~θ(Y))(λλθ)2w(Y)]𝔼[(λλθ)2]\displaystyle=\nabla_{\theta\to v}\nabla_{\lambda_{*}^{\theta}\to\lambda-\lambda_{*}^{\theta}}\mathbb{E}[l(\theta,\lambda_{*}^{\theta},Y)]+\mathbb{E}[f_{*}^{\prime\prime}(\widetilde{\lambda}^{\theta}(Y))(\lambda-\lambda_{*}^{\theta})^{2}w(Y)]\prec\mathbb{E}[(\lambda-\lambda_{*}^{\theta})^{2}]

where we used λθλλθ𝔼[l(θ,λθ,Y)]0\nabla_{\lambda_{*}^{\theta}\to\lambda-\lambda_{*}^{\theta}}\mathbb{E}[l(\theta,\lambda_{*}^{\theta},Y)]\equiv 0 to get rid of the first-order term. The last line follows from the boundedness (in absolute value) of f′′()f_{*}^{\prime\prime}(\cdot) and w()w(\cdot) on their compact support. Hence A6i) is satisfied. A7i) follows from:

𝕍[l(θ,λ,Yi)[v,w]l(θ,λθ,Yi)[v,w]]=𝕍[(f(λ(Y))f(λθ(Y)))w(Y)]𝔼[(λλθ)2]\mathbb{V}[l^{\prime}(\theta,\lambda,Y_{i})[v,w]-l^{\prime}(\theta,\lambda_{*}^{\theta},Y_{i})[v,w]]=\mathbb{V}[(f_{*}^{\prime}(\lambda(Y))-f_{*}^{\prime}(\lambda_{*}^{\theta}(Y)))w(Y)]\prec\mathbb{E}[(\lambda-\lambda_{*}^{\theta})^{2}]

where the last step used the Lipschitzness of ff_{*}^{\prime} and again the boundedness of w()w(\cdot). Assumption A7 ii) is satisfied since Θ\mathcal{\Theta}_{*} is Donsker by assumption. Finally, we verify A6 ii). Applying the mean value theorem twice, we get that for some θ~,θ~\widetilde{\theta},\widetilde{\theta}^{\prime} on a path between θ\theta_{*} and θ\theta, 𝔼[l(θ,Y)[v]l(θ,Y)[v]θθ,v=θ~θ~θθ~θθ𝔼[l(θ~,Y)[v]]\mathbb{E}[l^{\prime}(\theta,Y)[v_{*}]-l^{\prime}(\theta_{*},Y)[v_{*}]-\langle\theta-\theta_{*},v_{*}\rangle=\nabla_{\widetilde{\theta}\to\widetilde{\theta}^{\prime}-\theta_{*}}\nabla_{\widetilde{\theta}\to\theta-\theta_{*}}\mathbb{E}[l^{\prime}(\widetilde{\theta},Y)[v_{*}]] which is dominated by 𝔼[l(θ,Y)]=Df(θθ)\mathbb{E}[l(\theta,Y)]=D_{f}(\mathbb{P}_{\theta}\|\mathbb{P}_{\theta_{*}}) via the last assumption stated in the proposition. Therefore A6ii) is satisfied, and Theorem 2.4 applies. ∎

D.4 Proof of Proposition 4.6

Proof.

Note that V=θθ𝔼[l(θ,Y)]=𝕍[θl(θ,Y)]V=\nabla_{\theta_{*}}\nabla_{\theta_{*}^{\prime}}\mathbb{E}[l(\theta_{*},Y)]=\mathbb{V}[\nabla_{\theta_{*}}l(\theta_{*},Y)]. We verify the conditions of Theorem 3.4, for v=V1ζv_{*}=V^{-1}\zeta, such that its conclusion becomes nθθ,v=n(θθ)ζ𝒩(0,ζV1ζ)\sqrt{n}\langle\theta-\theta_{*},v_{*}\rangle=\sqrt{n}(\theta-\theta_{*})^{\prime}\zeta\to\mathcal{N}(0,\zeta^{\prime}V^{-1}\zeta), which yields the Proposition via the Cramér-Wold device. Note that l(θ,λ,Y)[v,w]=vd(X,θ)λ(Z)+m(X,θ)w(Z)12λ(Z)w(Z)l^{\prime}(\theta,\lambda,Y)[v,w]=v^{\prime}d(X,\theta)^{\prime}\lambda(Z)+m(X,\theta)^{\prime}w(Z)-\frac{1}{2}\lambda(Z)^{\prime}w(Z) Hence assumption A4 follows from boundedness and Lipschitzness of d(X,),m(X,)d(X,\cdot),m(X,\cdot). A5 holds by assumption. To verify A6i), notice that:

𝔼[l(θ,λ,Y)[v,λθ[v]]l(θ,Y)]=𝔼[(vd(X,θ)12λθ[v](Z))(λλθ)(Z)]=0\mathbb{E}[l^{\prime}(\theta,\lambda,Y)[v_{*},\lambda_{*}^{\prime\theta}[v_{*}]]-l^{\prime}(\theta,Y)]=\mathbb{E}\left[\left(v_{*}^{\prime}d(X,\theta)-\frac{1}{2}\lambda_{*}^{\prime\theta}[v_{*}](Z)\right)\left(\lambda-\lambda_{*}^{\theta}\right)(Z)\right]=0

where the last equality used the fact that 𝔼[vd(X,θ)|Z]=12λθ[v](Z)\mathbb{E}[v_{*}^{\prime}d(X,\theta)|Z]=\frac{1}{2}\lambda_{*}^{\prime\theta}[v_{*}](Z). Towards Assumption A6ii), note that we can apply a first-order Taylor expansion with mean-value reminder twice, which yields for some θ¯,θ~\bar{\theta},\widetilde{\theta} on a path between θ\theta_{*} and θ\theta:

𝔼[l(θ,Y)[v]l(θ,Y)[v]θθ,v]=(θ¯θ)θ~θ~𝔼[l(θ~,Y)][v](θθ)θθ22\mathbb{E}[l^{\prime}(\theta,Y)[v_{*}]-l^{\prime}(\theta_{*},Y)[v_{*}]-\langle\theta-\theta_{*},v_{*}\rangle]=(\bar{\theta}-\theta_{*})^{\prime}\nabla_{\widetilde{\theta}}\nabla_{\widetilde{\theta^{\prime}}}\mathbb{E}[l^{\prime}(\widetilde{\theta},Y)][v_{*}](\theta-\theta_{*})\prec\|\theta-\theta_{*}\|_{2}^{2}

Given the identification assumption 2.5, we can use a second-order Taylor expansion with mean-value reminder to show that θθ22𝔼[l(θ,Y)l(θ,Y)]\|\theta-\theta_{*}\|_{2}^{2}\asymp\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)], which then yields A6ii). Similarly, we can show A7ii) via a mean-value reminder:

𝔼[(l(θ,Y)[v]l(θ,Y)[v])2]=𝔼[((θθ)θ~l(θ~,Y)[v])2]θθ22\mathbb{E}[(l^{\prime}(\theta,Y)[v]-l^{\prime}(\theta_{*},Y)[v])^{2}]=\mathbb{E}\left[\left((\theta-\theta_{*})^{\prime}\nabla_{\widetilde{\theta}}l^{\prime}(\widetilde{\theta},Y)[v]\right)^{2}\right]\prec\|\theta-\theta_{*}\|_{2}^{2}

We similarly can establish A7i) via

𝕍[l(θ,λ,Y)[v,λθ[v]]l(θ,λθ,Y)[v,λθ[v]]]𝔼[(λ(Z)λθ(Z))2]𝔼[l(θ,λ,Y)l(θ,λθ,Y)]\mathbb{V}[l^{\prime}(\theta,\lambda,Y)[v,\lambda_{*}^{\prime\theta}[v_{*}]]-l^{\prime}(\theta,\lambda_{*}^{\theta},Y)[v,\lambda_{*}^{\prime\theta}[v_{*}]]]\prec\mathbb{E}[(\lambda(Z)-\lambda_{*}^{\theta}(Z))^{2}]\asymp\mathbb{E}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)]

D.5 Proof of Proposition 4.4

Proof.

A0 holds by assumption, and condition A1 is implied by Lipschitzness of Vθ,PθV_{\theta},P_{\theta} in θ\theta. Continuity of Vθ,PθV_{\theta},P_{\theta}, compactness of Θ\Theta and A0 further imply that there is some 0M<0\leq M<\infty such that

sups,a,s+,θ|Rθ(s,a)+βV(s+)V(s)logPθ(a|s)|M\sup_{s,a,s_{+},\theta}|R_{\theta}(s,a)+\beta V(s^{+})-V(s)-\log P_{\theta}(a|s)|\leq M

Given that λθ0l(θ,Y)=0\lambda_{*}^{\theta_{*}}\equiv 0\implies l(\theta_{*},Y)=0, condition A2 then follows from

𝕍[l(θ,Y)l(θ,Y)]𝔼[l(θ,Y)2]2M2𝔼[λθ(s,a)2]𝔼[l(θ,Y)]=𝔼[l(θ,Y)l(θ,Y)]\mathbb{V}[l(\theta,Y)-l(\theta_{*},Y)]\prec\mathbb{E}[l(\theta,Y)^{2}]\leq 2M^{2}\mathbb{E}[\lambda_{*}^{\theta}(s,a)^{2}]\prec\mathbb{E}[l(\theta,Y)]=\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]

as well as 𝔼[l(θ,Y)l(θ,Y)]𝔼[(λθ(s,a)λθ(s,a))2|(s,a)𝒳~]+((s,a)𝒳~)θθ𝒳~2+((s,a)𝒳~),𝒳~𝒳¯\mathbb{E}[l(\theta,Y)-l(\theta_{*},Y)]\prec\mathbb{E}[(\lambda_{*}^{\theta}(s,a)-\lambda_{*}^{\theta_{*}}(s,a))^{2}|(s,a)\in\widetilde{\mathcal{X}}]+\mathbb{P}((s,a)\not\in\widetilde{\mathcal{X}})\prec\|\theta-\theta_{*}\|^{2}_{\widetilde{\mathcal{X}}}+\mathbb{P}((s,a)\not\in\widetilde{\mathcal{X}}),\ \forall\widetilde{\mathcal{X}}\subset\bar{\mathcal{X}}. A3 can be established analogously, hence the conclusions of Theorem 3.2 hold. ∎

D.6 Proof of Proposition 4.7

Proof.

Note that λθ(x)=θ(x)θ(x)\lambda_{*}^{\theta}(x)=\theta(x)-\theta_{*}(x) follows from the first order conditions of the adversary. Condition A0 is satisfied by assumption, and A1 follows from Lipschitzness of m(Y,)m(Y,\cdot) and boundedness. Lipschitzness of m(θ,λ(x))m(\theta,\lambda(x)) in λ(x)\lambda(x) and boundedness imply that l(θ,λ,Y)=m(Y,λ)θ(x)λ(x)λ(x)2/2l(\theta,\lambda,Y)=m(Y,\lambda)-\theta(x)\lambda(x)-\lambda(x)^{2}/2 is Lipschitz in λ(x)\lambda(x). This implies

𝕍[l(θ,λ,Y)l(θ,λθ,Y)]𝔼[(λ(x)λθ(x))2]\mathbb{V}[l(\theta,\lambda,Y)-l(\theta,\lambda_{*}^{\theta},Y)]\prec\mathbb{E}[(\lambda(x)-\lambda_{*}^{\theta}(x))^{2}]

and together with λθ(x)=θ(x)θ(x)\lambda_{*}^{\theta}(x)=\theta(x)-\theta_{*}(x), it yields:

𝕍[l(θ,λθ,Y)l(θ,λθ,Y)]𝔼[(θ(x)θ(x))2]\mathbb{V}[l(\theta_{*},\lambda_{*}^{\theta_{*}},Y)-l(\theta,\lambda_{*}^{\theta},Y)]\prec\mathbb{E}[(\theta(x)-\theta_{*}(x))^{2}]

Both bounds imply A3 and A2 respectively. The result follows because the loss 𝔼[l(θ,Y)]\mathbb{E}[l(\theta,Y)] is proportional to the squared L2 norm of θθ\theta-\theta_{*}. ∎