Dyadic Regression with Sample Selection

Kensuke Sakamoto University of Wisconsin-Madison [email protected]

(Date: 07/03/2024)

Abstract.

This paper addresses the sample selection problem in panel dyadic regression analysis. Dyadic data often include many zeros in the main outcomes due to the underlying network formation process. This not only contaminates popular estimators used in practice but also complicates the inference due to the dyadic dependence structure. We extend Kyriazidou (1997)’s approach to dyadic data and characterize the asymptotic distribution of our proposed estimator. The convergence rates are $\sqrt{n}$ or $\sqrt{n^{2}h_{n}}$ , depending on the degeneracy of the Hájek projection part of the estimator, where $n$ is the number of nodes and $h_{n}$ is a bandwidth. We propose a bias-corrected confidence interval and a variance estimator that adapts to the degeneracy. A Monte Carlo simulation shows the good finite sample performance of our estimator and highlights the importance of bias correction in both asymptotic regimes when the fraction of zeros in outcomes varies. We illustrate our procedure using data from Moretti and Wilson (2017)’s paper on migration.

The author thanks supports and comments from Jack Porter, Bruce Hansen, Xiaoxia Shi, Harold Chiang, Kohei Yata, and participants in New York Camp Econometrics XVII. This paper was supported by the Summer Fellowship from the Department of Economics at University of Wisconsin-Madison.

Keywords: Dyadic Data, Sample Selection, Fixed Effects, Network Formation, Bias Correction.

1. Introduction

Dyadic data describes pairwise outcomes, such as trade volume between countries. Numerous applications have analyzed such data using the regression model, referred to as dyadic regression. Examples include gravity equations in trade, migration, and urban economics (Helpman et al., 2008; Moretti and Wilson, 2017; Monte et al., 2018), and risk-sharing networks in development economics (Fafchamps and Gubert, 2007). One of the prominent features of dyadic data is the non-negligible number of zeros in the outcomes of interest, ¹¹1Helpman et al. (2008) documents that there was no trade among roughly 50% of country pairs from 1970 to 1997. In 2017, there was no migration among about 60% of country pairs (the author calculated using the data available from the World Bank (https://www.worldbank.org/en/topic/migrationremittancesdiasporaissues/brief/migration-remittances-data). possibly due to economic mechanisms such as prohibitive fixed costs. This paper deals with panel dyadic data, where zeros are prevalent both cross-sectionally and in the time series.

How should we treat zeros in dyadic regression? In applications, zeros are often discarded due to the log-linear specification (Moretti and Wilson, 2017). The Poisson pseudo-maximum-likelihood (PPML) estimator is also frequently used to avoid discarding zeros and address issues related to log-linearization (Silva and Tenreyro, 2006). These approaches implicitly assume that zeros occur exogenously. Since a zero in a pairwise outcome results from no link between two units, we can associate zeros with the underlying network formation mechanism that determines which pairs appear in a sample. If the network is formed endogenously as a result of an interaction between two agents, the empirical practices mentioned above can be subject to sample selection bias, as in Heckman (1979).

This paper has two primary objectives. First, we aim to jointly model network formation and the outcome generation on such networks. This joint modeling allows identification of the effects of changes in pair-level or individual-level characteristics, separating them from the effects caused by changes in networks. In contrast, the dyadic regression literature has primarily focused on regression with fixed or exogenous networks. Second, we develop a robust inference method that accounts for the dyadic dependence structure. Pairwise outcomes are likely to be dependent on each other through common shocks to individuals. This dyadic dependence can be especially important in the presence of zeros and the network formation because a few individuals can have significantly more links than others, ²²2For example, in Moretti and Wilson (2017)’s migration flow data, star scientists’ migration from or to California constituted approximately 14% of the links in the sample on average. This percentage is much higher than the expected 2% when considering all potential links in the sample. which strengthens the influence of shocks to those individuals on the dyadic dependence. At the same time, it is known that with dyadic data, we can have different asymptotic regimes depending on the nature of those individual-level shocks (Menzel, 2021). To be practitioner-friendly, our inference method needs to consider the dyadic dependence and ensure adaptivity to different resulting asymptotic regimes.

Our setup will be a linear panel dyadic regression model, featuring the network formation process as a sample selection mechanism that generates both zeros and unobservable outcomes. To capture the dyadic dependence structure, we incorporate two types of unobservable individual heterogeneity into the model: time-invariant fixed effects and time-varying random effects, which is a new modeling strategy in the literature. We extend Kyriazidou (1997)’s identification argument, originally designed for individualistic data, to dyadic data, and correspondingly propose a semiparametric, kernel-based estimator that assigns weights to pairs whose selection index remains stable over time. A significant challenge we face when analyzing our estimator is the need to address the dependence structure caused by node-level shocks, which is absent in individualistic data models analyzed in Kyriazidou (1997). To control for this type of dependence, we utilize the U-statistic-like structure of our estimator, which gives us a mutually uncorrelated decomposition into the node-level Hájek projection part and the dyad-level projection error part.

We show that our estimator is asymptotically normal with two different convergence rates depending on the nature of errors. If the Hájek projection is non-degenerate (i.e., each summand has positive variance), our estimator achieves $\sqrt{n}$ -asymptotic normality, where $n$ is the number of nodes. In this case, we not only have zero asymptotic bias but also share the same convergence rates as the usual fixed effect estimator and PPML estimator when its leading term is also non-degenerate. The latter point implies that there is no loss in effective sample sizes with our estimator for using a kernel-based local method compared with the usual non-weighted estimator. If the Hájek projection is degenerate, our estimator achieves $\sqrt{Nh_{n}}$ -asymptotic normality, where $N\sim n^{2}$ is the number of dyads and $h_{n}$ is a bandwidth. While the usual fixed effect estimator and the PPML estimator can be non-Gaussian in the limit (Menzel, 2021), our estimator is guaranteed to be asymptotically normal regardless of degeneracy. This result is analogous to Hall (1984)’s central limit theorem for degenerate U-statistics, allowing common statistics of interest, such as confidence intervals, to be constructed in a standard manner. In the degenerate case, our estimator exhibits asymptotic bias, which motivates us to introduce a bias correction.

We propose a variance estimator and bias-corrected confidence intervals that adapt to the degeneracy. Our variance estimator is similar to the one proposed by Graham et al. (2019) for nonparametric dyadic density estimation. We show that our estimator is consistent for the asymptotic variances in both non-degenerate and degenerate cases, after being rescaled by $\sqrt{n}$ or $\sqrt{Nh_{n}}$ , respectively. For the bias correction, we use a consistent estimator for the asymptotic bias in the degenerate case. We show that the correction term is negligible in the non-degenerate case after being rescaled by $\sqrt{n}$ . Combining both bias-corrected estimator and variance estimator, we can construct bias-corrected confidence intervals for our estimator. These intervals have asymptotically correct sizes regardless of the (non-)degeneracy of the leading term in our estimator.

We conduct a simple simulation exercise to demonstrate the performance of our estimators compared to the usual fixed effect estimator and PPML estimator, as we vary the fraction of selected dyads from 10% to 90%. Our proposed estimator exhibits better finite sample properties than the other two estimators. Our bias-corrected confidence intervals also outperform the alternatives in coverage probabilities, regardless of degeneracy. This result underscores the importance of bias correction in finite samples, even though the asymptotic bias is zero in the non-degenerate case, which is a new finding in the literature.

As an empirical application, we extend and apply our estimator to the regression specification proposed by Moretti and Wilson (2017), which estimates the effects of state tax differences on the internal migration flows in the U.S. Comparing our proposed estimator with Moretti and Wilson (2017)’s, we find that their conclusion, which suggests that state tax differences have a significant impact on internal migration, may not be robust in the presence of a dyadic dependence structure and sample selection biases.

This paper is closely related to the burgeoning literature on dyadic regression (Cameron and Miller, 2014; Tabord-Meehan, 2019; Bonhomme, 2020; Zeleneev, 2020; Graham, 2020; Graham et al., 2021). Except for Bonhomme (2020) and Zeleneev (2020), these papers do not touch on non-random sample selection but instead focus on the consequence of the dyadic dependence structure. Bonhomme (2020) mainly focuses on the case where the selection is conditionally random with random effects. He also discusses the case where the selection is conditionally non-random without a theoretical analysis. Zeleneev (2020) studies identification and estimation of a dyadic regression model with flexible fixed effects. While his focus is on identification and a rate of convergence analysis, our paper provides the full inference results.

This paper also contributes to the literature on econometric analysis of models with endogenous network formation. Examples include Johnsson and Moon (2021), Auerbach (2022), and Jochmans (2023). While these papers study social interaction/peer effects type models where outcomes of interest are individualistic, our paper studies the direct consequence of network formation on dyadic outcomes.

2. Model

There are $n$ nodes in the data (e.g., states, countries), indexed by $i=1,...,n$ . Let $\{(X_{it},Z_{it})_{t=1,...,T})\}_{i=1}^{n}$ be a node-level observation, where $X_{it}\in\mathbb{R}^{q_{x}}$ and $Z_{it}\in\mathbb{R}^{q_{z}}$ . For each dyad $ij$ and time $t$ , $Y_{ijt}\in\mathbb{R}$ is a main outcome, and we observe a binary variable $d_{ijt}\in\{0,1\}$ , which indicates that $Y_{ijt}$ is observable ³³3Since we focus on a linear model, we can interchange unobservability with zero. Alternatively, we can interpret $Y_{ijt}$ as the logarithm of $\tilde{Y}_{ijt}\geq 0$ . only if $d_{ijt}=1$ . We can interpret the adjacency matrix $D_{t}\equiv[d_{ijt}]_{i,j=1,...,n}$ as a network that summarizes the existence of interactions between nodes. In this paper, we restrict our attention to a model with $T=2$ and an undirected graph where $Y_{ijt}=Y_{jit}$ , $d_{ijt}=d_{jit}$ for all $i,j,t$ . We also rule out self-loops by convention: $Y_{iit}=d_{iit}=0$ for all $i,t$ . An extension to $T>2$ and a directed graph is discussed in Section 4.1.

The data is generated according to the following model:

$\displaystyle W_{ijt}$	$\displaystyle=w(X_{it},X_{jt}),R_{ijt}=r(Z_{it},Z_{jt}),$	(2.1)
$\displaystyle Y_{ijt}^{*}$	$\displaystyle=W_{ijt}^{\prime}\beta+A_{i}+A_{j}+\epsilon_{ijt},$	(2.2)
$\displaystyle d_{ijt}$	$\displaystyle=\boldsymbol{1}\{R_{ijt}^{\prime}\gamma+B_{i}+B_{j}-\eta_{ijt}\},$	(2.3)
$\displaystyle Y_{ijt}$	$\displaystyle=\begin{cases}Y_{ijt}^{*}\text{ if }d_{ijt}=1\\ \text{unobserved if }d_{ijt}=0\end{cases}.$	(2.4)

The regressors $W_{ijt}\in\mathbb{R}^{q_{w}}$ and $R_{ijt}\in\mathbb{R}^{q_{r}}$ are constructed from some user-specified symmetric functions $w:\mathbb{R}^{q_{x}}\times\mathbb{R}^{q_{x}}\to\mathbb{R}^{q_{w}}$ and $r:\mathbb{R}^{q_{z}}\times\mathbb{R}^{q_{z}}\to\mathbb{R}^{q_{r}}$ such that $w(x,y)=w(y,x)$ and $r(x^{\prime},y^{\prime})=r(y^{\prime},x^{\prime})$ for any $x,y\in\mathbb{R}^{q_{x}}$ and $x^{\prime},y^{\prime}\in\mathbb{R}^{q_{z}}$ . For example, we can specify $w$ to be a pairwise summation $w(x,y)=x+y$ . The symmetry in these functions is needed as our graphs are undirected; we can relax this requirement with directed graphs, as disccused in Section 4.1. The node-level fixed effects $A_{i},B_{i}\in\mathbb{R}$ are unobservable, and we allow them to correlate with the regressors, as in the usual fixed effect model.

We specify the structure of errors $\epsilon_{ijt},\eta_{ijt}$ as follows: For $1\leq i<j\leq n$ ,

\displaystyle(\epsilon_{ij1},\epsilon_{ij2},\eta_{ij1},\eta_{ij2})=\tau(U_{i1},U_{i2},U_{j1},U_{j2},U_{ij1},U_{ij2}),

(2.5)

where $U_{i}\equiv(U_{i1},U_{i2})$ and $U_{ij}\equiv(U_{ij1},U_{ij2})$ are node-level and dyad-level random vectors, respectively, and $\tau$ is an unknown multivariate function.⁴⁴4Here, we need not specify the dimensions of those vectors and the function since the following results do not depend on them as long as those dimensions are fixed.

Let $\xi_{i}\equiv(X_{i1},X_{i2},Z_{i1},Z_{i2},A_{i},B_{i})$ be a vector that contains observed and unobserved information in the two periods with respect to node $i$ . We impose the following distributional assumption:

Assumption 1.

(1)

$\xi_{i}$ , $i=1,...,n$ are independently and identically distributed.
(2)

$(\epsilon_{ijt},\eta_{ijt})_{t=1,2},1\leq i<j\leq n$ are generated according to (2.5).
(3)

Conditionally on $\{\xi_{i}\}_{i=1}^{n}$ , $U_{i}$ , $i=1,...,n$ are independent, $U_{ij}$ , $1\leq i<j\leq n$ are independent, and both of them are mutually independent.
(4)

For $i<j$ , $(U_{i},U_{j},U_{ij})$ conditional on $\{\xi_{i}\}_{i=1}^{n}$ has the same distribution as $(U_{i},U_{j},U_{ij})$ conditional on $\xi_{i},\xi_{j}$ .
(5)

For $i<j<k$ , if $\xi_{i}=\xi_{j}=\xi_{k}$ , $(U_{i},U_{j},U_{ij})$ and $(U_{i},U_{k},U_{ik})$ has the same distribution conditional conditionally on $\xi_{i},\xi_{j},\xi_{k}$ .

Part (1) imposes homogeneity on the node-level data-generating process. Parts (2) and (3) are new to the literature on dyadic regression with fixed effects. While the previous literature assumes conditional independence of dyadic-level errors (Graham, 2017; Zeleneev, 2020; Candelaria, 2020), our error structure (2.5) allows for the conditional dependence between errors with a common node (e.g., $\epsilon_{ij1}$ and $\epsilon_{ik1}$ ) through $U_{i}$ , but also includes conditional independence as a special case where node-level random vectors $U_{it},U_{jt}$ are degenerate given $\{\xi_{i}\}_{i=1}^{n}$ . Part (4) is the standard assumption in the literature and excludes "externalities," where dyad $ij$ can be affected by nodes other than $i$ or $j$ . Part (5) ensures the conditional exchangeability of $(\epsilon_{ijt},\eta_{ijt})_{t=1,2}$ across dyads.

2.1. Identification

The following two assumptions are crucial for the identification of $\beta$ :

Assumption 2.

$(\epsilon_{ij1},\epsilon_{ij2},\eta_{ij1},\eta_{ij2})$ and $(\epsilon_{ij2},\epsilon_{ij1},\eta_{ij2},\eta_{ij1})$ are identically distributed conditionally on $\xi_{i},\xi_{j}$ .

Assumption 3.

$E[d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}|\Delta R_{ij}^{\prime}\gamma=0]$ is non-singular where $\Delta W_{ij}=W_{ij1}-W_{ij2}$ and $\Delta R_{ij}=R_{ij1}-R_{ij2}$ .

Assumption 2 excludes cases where, for example, the conditional variance of $\epsilon_{ijt}$ depends only on period $t$ ’s information: $Var(\epsilon_{ijt}|\xi_{i},\xi_{j})=\sigma^{2}\times W_{ijt}^{\prime}\beta$ . However, it allows time invariant heteroskedasticity such as $Var(\epsilon_{ijt}|\xi_{i},\xi_{j})=\sigma^{2}(W_{ij1}+W_{ij2})^{\prime}\beta\times A_{i}\times A_{j}$ . From (2.5), this assumption is implied by the conditional exchangeability of $U_{it}$ and $U_{ijt}$ with respect to time. Assumption 3 excludes cases where $W_{ijt}$ is exactly the same as $R_{ijt}$ and implies that some variables in $R_{ijt}$ must be excluded from $W_{ijt}$ . Since

	$\displaystyle E[d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\|\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=Pr(d_{ij1}d_{ij2}=1\|\Delta R_{ij}^{\prime}\gamma=0)\times E[\Delta W_{ij}\Delta W_{ij}^{\prime}\|d_{ij1}d_{ij2}=1,\Delta R_{ij}^{\prime}\gamma=0],$

this assumption also implies that the networks $D_{1},D_{2}$ are locally dense across time in the sense that $Pr(d_{ij1}d_{ij2}=1|\Delta R_{ij}^{\prime}\gamma=0)>0$ .

Our identification argument is summarized in the following two steps, similarly to Kyriazidou (1997). First, take the time-difference on observed outcomes (dyads with $d_{ij1}=d_{ij2}=1$ ) to eliminate the fixed effects:

\displaystyle\Delta Y_{ij}=\Delta W_{ij}^{\prime}\beta+\epsilon_{ij1}-\epsilon_{ij2}.

If we take expectation of both sides conditionally on $d_{ij1}=d_{ij2}=1$ and $\xi_{i},\xi_{j}$ ,

\displaystyle E[\Delta Y_{ij}|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j}]=\Delta W_{ij}^{\prime}\beta+\underbrace{E[\epsilon_{ij1}-\epsilon_{ij2}|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j}]}_{\text{Sample selection effect}}.

Note that, in general, the sample selection effect is not $0$ .

Second, we seek to find conditions to eliminate the selection effect. Assumption 2 is equivalent to

\displaystyle F(\epsilon_{ij1},\epsilon_{ij2},\eta_{ij1},\eta_{ij2}|\xi_{i},\xi_{j})=F(\epsilon_{ij2},\epsilon_{ij1},\eta_{ij2},\eta_{ij1}|\xi_{i},\xi_{j}),

where $F$ is the conditional distribution of the errors given $\xi_{i},\xi_{j}$ . Then, for dyad $ij$ with $\Delta R_{ij}^{\prime}\gamma=R_{ij1}^{\prime}\gamma-R_{ij2}^{\prime}\gamma=0$ ,

	$\displaystyle E[\epsilon_{ij1}\|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=E[\epsilon_{ij1}\|R_{ij1}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij1},R_{ij2}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij2},\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=E[\epsilon_{ij2}\|R_{ij2}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij2},R_{ij1}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij1},\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=E[\epsilon_{ij2}\|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0].$

Hence, the conditional expectation of $\Delta Y_{ij}$ given $d_{ij1}d_{ij2}=1$ , $\xi_{i},\xi_{j}$ , and $\Delta R_{ij}^{\prime}\gamma=0$ is

\displaystyle E[\Delta Y_{ij}|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]=\Delta W_{ij}^{\prime}\beta.

Multiplying the both sides by $\Delta W_{ij}$ and aggregating $\xi_{i},\xi_{j}$ , we get

\displaystyle E[\Delta W_{ij}\Delta Y_{ij}|d_{ij1}d_{ij2}=1,\Delta R_{ij}^{\prime}\gamma=0]=E[\Delta W_{ij}\Delta W_{ij}^{\prime}|d_{ij1}d_{ij2}=1,\Delta R_{ij}^{\prime}\gamma=0]\beta.

Then, under Assumption 3, $\beta$ is uniquely written as

\displaystyle\beta=E[d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}|\Delta R_{ij}^{\prime}\gamma=0]^{-1}E[d_{ij1}d_{ij2}\Delta W_{ij}\Delta Y_{ij}|\Delta R_{ij}^{\prime}\gamma=0].

(2.6)

2.2. Estimation

Estimation is done in two steps; In the first step, we estimate $\gamma$ with a consistent estimator $\hat{\gamma}_{n}$ , and in the second step we estimate $\beta$ with $\hat{\beta}_{n}$ , a sample analogue of the identified $\beta$ with $\gamma$ replaced by $\hat{\gamma}_{n}$ .

In the following, we focus on the second step. The sample-analogue of (2.6) is given by

\displaystyle\hat{\beta}_{n}=\left[\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n})\right]^{-1}\left[\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta Y_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n})\right],

where $\sum_{i<j}=\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}$ , $K_{h_{n}}(v)=h_{n}^{-1}K(v/h_{n})$ is a kernel, and $h_{n}$ is a bandwidth. The weight function is used to smooth the condition $\Delta R_{ij}^{\prime}\gamma=0$ and puts larger weight on observations with small $\Delta R_{ij}^{\prime}\hat{\gamma}_{n}$ .

To evaluate $\hat{\beta}_{n}$ in terms of $\beta$ , rewrite the time-differenced model as

\displaystyle\Delta Y_{ij}=\Delta W_{ij}^{\prime}\beta+\lambda_{ij}+\nu_{ij},

where

	$\displaystyle\lambda_{ij}$	$\displaystyle\equiv E[\epsilon_{ij1}-\epsilon_{ij2}\|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j}]$
	$\displaystyle\nu_{ij}$	$\displaystyle\equiv\epsilon_{ij1}-\epsilon_{ij2}-\lambda_{ij}.$

Note that $E[\nu_{ij}|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j}]=0$ by construction. Define

	$\displaystyle\hat{S}_{WW}$	$\displaystyle\equiv\frac{1}{N}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}),$
	$\displaystyle\hat{S}_{W\lambda}$	$\displaystyle\equiv\frac{1}{N}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\lambda_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}),$
	$\displaystyle\hat{S}_{W\nu}$	$\displaystyle\equiv\frac{1}{N}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\nu_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}).$

Substituting $\Delta Y_{ij}$ into $\hat{\beta}_{n}$ yields

\displaystyle\hat{\beta}_{n}=\beta+\hat{S}_{WW}^{-1}\hat{S}_{W\lambda}+\hat{S}_{WW}^{-1}\hat{S}_{W\nu}.

The terms $\hat{S}_{WW}^{-1}\hat{S}_{W\lambda}$ and $\hat{S}_{WW}^{-1}\hat{S}_{W\nu}$ can be understood as the selection bias term and the stochastic error term in the estimator, respectively.

3. Asymptotic Analysis

3.1. Regularity Conditions

For ease of notation, we write the following conditions in terms of dyads $12$ and $13$ , which entails no loss of generality under the undirected graph and Assumption 1.

Let $f_{R\gamma,2}$ be the joint density of $\Delta R_{12}^{\prime}\gamma$ and $\Delta R_{13}^{\prime}\gamma$ when it exists. Let $f_{R\gamma}$ be the marginal density and $f_{R\gamma|\xi_{1},U_{1}}$ be the conditional density given $\xi_{1},U_{1}$ .

Assumption 4.

The joint distribution of $\Delta R_{12}^{\prime}\gamma$ and $\Delta R_{13}^{\prime}\gamma$ is absolutely continuous, and for some $\kappa_{0}>0$ , the following hold in the neighborhoods $(-\kappa_{0},\kappa_{0})^{2}$ or $(-\kappa_{0},\kappa_{0})$ around $(0,0)$ or $(0)$ , respectively:

(1)

The density $f_{R\gamma,2}(\cdot,\cdot)$ is $k\geq 2$ times continuously differentiable, and the derivatives $\frac{\partial^{2}}{\partial x^{p}\partial y^{q}}f_{R\gamma,2}(x,y)$ are uniformly bounded for $p+q\leq k,p,q\geq 0$ and bounded away from $0$ .
(2)

The marginal density $f_{R\gamma}(\cdot)$ is bounded away from $0$ .
(3)

The conditional density $f_{R\gamma|\xi_{1},U_{1}}(\cdot)$ given $\xi_{1},U_{1}$ is continuous and uniformly bounded almost surely.

Part (1) is a smoothness assumption on the density as in the nonparametric regression literature. Part (2) ensures that we observe $\Delta R_{12}^{\prime}\gamma$ around $0$ , which is crucial for identification. Part (3) essentially requires well-behaved $r(\cdot,\cdot)$ in (2.1).

Define $(w_{1},w_{2})\mapsto\Lambda(w_{1},w_{2},\xi_{1},\xi_{2})$ as

\displaystyle\Lambda(w_{1},w_{2},\xi_{1},\xi_{2})\equiv E[\epsilon_{12t}|\eta_{12t}\leq w_{1},\eta_{12s}\leq w_{2},\xi_{1},\xi_{2}]

with $t,s=1,2,t\neq s$ . This $\Lambda$ is the sample selection effect caused by the correlation between errors $\epsilon_{12t},\epsilon_{12s}$ and $\eta_{12t},\eta_{12s}$ . Note that the function $\Lambda$ does not depend on time $t$ or $s$ because of Assumption 2.

Assumption 5.

The function $(w_{1},w_{2})\mapsto\Lambda(w_{1},w_{2},\xi_{1},\xi_{2})$ is differentiable in the neighborhoods $(-\kappa_{0},\kappa_{0})^{2}$ around $(0,0)$ for some $\kappa_{0}>0$ .

This assumption is essential for controlling the sample selection effect and characterizing the asymptotic bias in some cases. An implication of this assumption is that for some $\Lambda_{12}\equiv\tilde{\Lambda}(w_{1},w_{2},\xi_{1},\xi_{2})$ ,

\displaystyle\Lambda(w_{1},w_{2},\xi_{1},\xi_{2})-\Lambda(w_{2},w_{1},\xi_{1},\xi_{2})=\Lambda_{12}\times(w_{1}-w_{2}),

by the multivariate mean-value theorem. Note that the function $\Lambda$ does not depend on time $t$ or $s$ because of Assumption 2. This assumption is strong because the difference in $\Lambda$ must be exactly linear in the first and second elements. If we focus on the degenerate case discussed below, since the asymptotic bias is $0$ in that case, we can relax the differentiability to Lipschitz-like continuity on $\Lambda$ : $|\Lambda(w_{1},w_{2},\xi_{1},\xi_{2})-\Lambda(w_{2},w_{1},\xi_{1},\xi_{2})|\leq|\Lambda_{12}|\times|w_{1}-w_{2}|$ .

Let $\|\cdot\|$ denote a Euclidian norm of vectors.

Assumption 6.

For some $\kappa_{0}>0$ , the following hold in the neighborhoods $(-\kappa_{0},\kappa_{0})^{2}$ or $(-\kappa_{0},\kappa_{0})$ around $(0,0)$ or $(0)$ , respectively.

(1)

The following moments are uniformly bounded almost surely:

	$\displaystyle E[\\|\Delta W_{12}\\|^{8}\|\Delta R_{12}^{\prime}\gamma=\cdot,\xi_{1},U_{1}],E[\\|\Delta R_{12}\\|^{6}\|\Delta R_{12}^{\prime}\gamma=\cdot,\xi_{1},U_{1}],$
	$\displaystyle E[\nu_{12}^{8}\|\Delta R_{12}^{\prime}\gamma=\cdot,\xi_{1},U_{1}],E[\Lambda_{12}^{6}\|\Delta R_{12}^{\prime}\gamma=\cdot,\xi_{1},U_{1}].$

(2)

The following moments are continuous and bounded, and the first two are positive definite:

	$\displaystyle E[d_{121}d_{122}\Delta W_{12}\Delta W_{12}^{\prime}\|\Delta R_{12}^{\prime}\gamma=\cdot],E[d_{121}d_{122}\Delta W_{12}\Delta W_{12}^{\prime}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=\cdot]$
	$\displaystyle E[d_{121}d_{122}d_{131}d_{132}\Delta W_{12}\Delta W_{13}^{\prime}\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=\cdot,\Delta R_{13}^{\prime}\gamma=\cdot].$

(3)

$g(\cdot)\equiv E[d_{121}d_{122}\Delta W_{12}\Lambda_{12}|\Delta R_{12}^{\prime}\gamma=\cdot]f_{R\gamma}(\cdot)$ is $k$ -times continuously differentiable with uniformly bounded derivatives.
(4)

$g_{\xi_{1},U_{1}}(\cdot)\equiv E[d_{121}d_{122}\Delta W_{12}\nu_{12}|\Delta R_{12}^{\prime}\gamma=\cdot,\xi_{1},U_{1}]f_{R\gamma|\xi_{1},U_{1}}(\cdot)$ is $k$ -times continuously differentiable with uniformly bounded derivatives almost surely.

Part (1) assumes the existence of conditional moments for the relevant variables. The conditioning on $\xi_{1}$ and $U_{1}$ is needed for controlling the dyadic dependence structure. Part (2) is crucial for obtaining the convergence results used below, and the positive definiteness is needed for ensuring the non-degeneracy of our estimator in the limit. Part (3) is used for characterizing the asymptotic bias provided below. Part (4) is essential for the negligibility of the approximation error of our variance estimator.

Assumption 7.

The following moments exist:

\displaystyle E[\|\Delta W_{12}\|^{8}],E[\|\Delta R_{12}\|^{8}],E[\Lambda_{12}^{6}],E[\nu_{12}^{6}]

Additionally to Assumption 6, which restricts the moments locally around $(0,0)$ or $(0)$ , we use the existence of these unconditional moments when bounding error terms coming from the usage of $\hat{\gamma}_{n}$ .

Assumption 8.

A kernel function $K(\cdot)$ satisfies the following:

(1)

For some $\kappa>0$ , $K$ is $0$ outside of $[-\kappa,\kappa]$ , bounded in $[-\kappa,\kappa]$ , and three times continuously differentiable with bounded derivatives in $(-\kappa,\kappa)$ .
(2)

$\int K(s)ds=1$ .
(3)

$\int s^{i}K(s)ds=0$ for $i=1,...,k$ .

For example, a biweight kernel $K(x)=15/16(1-x^{2})^{2}\boldsymbol{1}\{|x|\leq 1\}$ satisfies this assumption with $\kappa=1$ and $k=2$ .

Assumption 9.

The sequence of bandwidths $\{h_{n}\}$ satisfies $h_{n}\to 0$ and $nh_{n}\to\infty$ as $n\to\infty$ .

This assumption is standard in the nonparametric regression literature. We impose further conditions on $\{h_{n}\}$ in each statement below.

Assumption 10.

The first-step estimator $\hat{\gamma}_{n}$ satisfies $\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)=o_{p}(1)$ .

This assumption requires the first-step estimator to be consistent and converge faster than our estimator. For example, if $\eta_{ijt}\sim Logistic(0,1)$ independently across $ij$ and $t$ , we can show that Chamberlain (1980)’s conditional logit estimator satisfies $\hat{\gamma}_{n}-\gamma=O_{p}(1/\sqrt{N})$ so that $\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)=O_{p}(\sqrt{h_{n}})=o_{p}(1)$ . In Section 3.6, we discuss the availability of alternative estimators for $\gamma$ .

3.2. Asymptotic Normality

Define the following components that will appear in the asymptotic bias and variance expression:

	$\displaystyle\Sigma_{WW}$	$\displaystyle\equiv f_{R\gamma}(0)E[d_{121}d_{122}\Delta W_{12}\Delta W_{12}^{\prime}\|\Delta R_{12}^{\prime}\gamma=0]$
	$\displaystyle\Sigma_{W\lambda}$	$\displaystyle\equiv\frac{1}{k!}\frac{\partial^{k}g(0)}{\partial w^{k}}\int s^{k+1}K(s)ds,$
	$\displaystyle\Sigma_{W\nu,1}$	$\displaystyle\equiv 4f_{R\gamma,2}(0,0)E[d_{121}d_{122}d_{131}d_{132}\Delta W_{12}\Delta W_{13}^{\prime}\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0],$
	$\displaystyle\Sigma_{W\nu,2}$	$\displaystyle\equiv f_{R\gamma}(0)E[d_{121}d_{122}\Delta W_{12}\Delta W_{12}^{\prime}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=0]\int K^{2}(s)ds.$

We have the following result:

Theorem 1.

Suppose that Assumptions 1-10 hold. Fix an arbitrary non-zero vector $c\in\mathbb{R}^{q_{w}}$ and some constant $h\in(0,\infty)$ . Let $c_{W}=\Sigma_{WW}^{-1}c$ . Then, as $n\to\infty$ , we have the following three cases:

(1)

If $Nh_{n}^{2k+3}\to h$ and $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}>0$ :

$\displaystyle\sqrt{n}c^{\prime}(\hat{\beta}_{n}-\beta)\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}).$

(2)

If $Nh_{n}^{2k+3}\to h$ and $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ :

\displaystyle\sqrt{Nh_{n}}c^{\prime}(\hat{\beta}_{n}-\beta)\to_{d}\mathcal{N}(\sqrt{h}c_{W}^{\prime}\Sigma_{W\lambda},c_{W}^{\prime}\Sigma_{W\nu,2}c_{W}).

(3)

If $Nh_{n}^{2k+3}\to\infty$ and $nh^{2k+2}_{n}\to\infty$ :

$\displaystyle h_{n}^{-(k+1)}(\hat{\beta}_{n}-\beta)\to_{p}\Sigma_{WW}^{-1}\Sigma_{W\lambda}.$

Part (1) and (2) of Theorem 1 show that our estimator is asymptotically normal with different convergence rates depending on $\Sigma_{W\nu,1}$ . Part (1) departs from Kyriazidou (1997)’s result in that we have a parametric convergence rate based on the number of nodes $n$ , not the number of dyads (or units) $N$ . Under the condition that $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}>0$ in part (1), the covariance of two summands indexed with a common index (e.g., dyad $ij$ and $ik$ ) in the estimator does not vanish in the limit, which results in the reduction in the effective sample size to $n$ , the number of nodes. At the same time, the leading term is an average of conditional means of the summand given $\xi_{i}$ and $U_{i}$ , which averages out and drops $h_{n}$ in the convergence rate. This $\sqrt{n}$ -asymptotic normality is aligned with the dyadic non-parametric density estimation literature (Graham et al., 2019). Once we have $\Sigma_{W\nu,1}=0$ as in part (2), our result is aligned with Kyriazidou (1997) in that the convergence rates are non-parametric and based on the number of dyads (units). Part (3) of Theorem 1 shows that our estimator converges to the asymptotic bias part with suitable normalization, regardless of the degeneracy. We utilize this result to propose the bias-corrected estimator in the later section.

We can compare our estimator with the usual fixed effect estimator:

\displaystyle\hat{\beta}_{FE}=\left[\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\right]^{-1}\left[\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta Y_{ij}\right],

(3.1)

which is biased because of the selection effect $\lambda_{ij}$ . First, in the case of non-degeneracy, our estimator and the re-centered (infeasible) fixed effect estimator share the same convergence rates of $\sqrt{n}$ (Davezies et al., 2021). This implies that there is no reduction in the effective sample size for using our kernel-based local estimator, which amends the need for fairly large samples as discussed in Kyriazidou (1997). Second, in the case of degeneracy, the fixed effect estimator applied to our model can exhibit a non-Gaussian distribution in the limit (Menzel, 2021), while our estimator is asymptotically normal regardless of the degeneracy. This guaranteed asymptotic normality is analogous to Hall (1984)’s central limit theorem for degenerate U-statistics, and thus the common statsitics of interest, such as confidence intervals, can be constructed in a standard manner.

If we interpret the structural equation 2.2 as the log-linearized version of the canonical gravity model (Silva and Tenreyro, 2006; Head and Mayer, 2014),

\displaystyle\tilde{Y}_{ijt}=exp(W_{ijt}^{\prime}\beta+A_{i}+A_{j})\times\underbrace{\eta_{ijt}}_{=d_{ijt}exp(\epsilon_{ijt})},

the Poisson pseudo-maximum-likelihood estimator (PPML) for $\beta$ can be compared with our estimator. The PPML estimator is given by

\displaystyle\sum_{i<j}\sum_{t=1}^{2}\left[\tilde{Y}_{ijt}-exp(W_{ijt}^{\prime}\hat{\beta}_{PPML}+a_{i}\boldsymbol{1}\{i=1\}+a_{j}\boldsymbol{1}\{j=1\})\right]\begin{pmatrix}W_{ijt},\boldsymbol{1}\{i=1\},\boldsymbol{1}\{j=1\}\end{pmatrix}^{\prime}=0.

(3.2)

We can make a similar comparison as in the fixed effect estimator based on the results by Davezies et al. (2021) and Menzel (2021): $\hat{\beta}_{PPML}$ will be biased because of the misspecfied errors, and the re-centered $\hat{\beta}_{PPML}$ is asymptotically normal at the rate of $\sqrt{n}$ in the non-degenerate case and can be non-Gaussian in the degenerate case.

3.3. Variance Estimation

Since our estimator exhibits different asymptotic distributions depending on $\Sigma_{W\nu,1}$ , it is desirable to have a variance estimator that adapts to the degeneracy.

First, we estimate $\Sigma_{W\nu,1}$ . Define

\displaystyle\hat{S}_{ij}\equiv 2d_{ij1}d_{ij2}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n})\Delta W_{ij}\Delta\hat{\epsilon}_{ij},

where $\Delta\hat{\epsilon}_{ij}$ is a residual $\Delta Y_{ij}-\Delta W_{ij}^{\prime}\hat{\beta}_{n}$ . Then, we propose an estimator for $\Sigma_{W\nu,1}$ as

\displaystyle\hat{\Sigma}_{W\nu,1}={n\choose 3}^{-1}\Sigma_{i<j<k}\frac{1}{3}(\hat{S}_{ij}\hat{S}_{ik}^{\prime}+\hat{S}_{ij}\hat{S}_{jk}^{\prime}+\hat{S}_{ik}\hat{S}_{jk}^{\prime}).

Next, we estimate $\Sigma_{W\nu,2}$ by

\displaystyle\hat{\Sigma}_{W\nu,2}=\frac{h_{n}}{N}\sum_{i<j}d_{ij1}d_{ij2}K_{h_{n}}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n})^{2}\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta\hat{\epsilon}_{ij}.

The following result shows concistency of these estimators and their usefulness in adaptive variance estimation.

Proposition 1.

Suppose that Assumptions 1-10 hold. Set $h_{n}=hN^{-1/(2k+3)}$ for some $h\in(0,\infty)$ . We have

	$\displaystyle\hat{\Sigma}_{W\nu,1}$	$\displaystyle\to_{p}\Sigma_{W\nu,1},$
	$\displaystyle\hat{\Sigma}_{W\nu,2}$	$\displaystyle\to_{p}\Sigma_{W\nu,2},$

as $n\to\infty$ . If $c_{W}\Sigma_{W\nu,1}c_{W}=0$ with $c_{W}=\Sigma_{WW}^{-1}c$ for some $c\in\mathbb{R}^{q_{w}}$ , we have

\displaystyle nh_{n}c^{\prime}\hat{S}_{WW}^{-1}\hat{\Sigma}_{W\nu,1}\hat{S}_{WW}^{-1}c\to_{p}0,

as $n\to\infty$ .

We now propose our variance estimator as follows:

\displaystyle\hat{\Sigma}\equiv\hat{S}_{WW}^{-1}\left[\frac{n-2}{n(n-1)}\hat{\Sigma}_{W\nu,1}+\frac{1}{Nh_{n}}\hat{\Sigma}_{W\nu,2}\right]\hat{S}_{WW}^{-1}.

We can see that this estimator is adaptive to the degeneracy: When $\Sigma_{W\nu,1}$ is positive definite, since $n/(Nh_{n})=o(1)$ ,

\displaystyle nc^{\prime}\hat{\Sigma}c=c^{\prime}\hat{S}_{WW}^{-1}\left[\frac{n-2}{n-1}\hat{\Sigma}_{W\nu,1}+\frac{n}{Nh_{n}}\hat{\Sigma}_{W\nu,2}\right]\hat{S}_{WW}^{-1}c\to_{p}c_{W}^{\prime}\Sigma_{W\nu,1}c_{W},

as $n\to\infty$ by Proposition 1 and Lemma 1 in Appendix A. When $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ , since $nh_{n}c^{\prime}\hat{S}_{WW}^{-1}\hat{\Sigma}_{W\nu,1}\hat{S}_{WW}^{-1}c=o_{p}(1)$ by Proposition 1,

\displaystyle Nh_{n}c^{\prime}\hat{\Sigma}c=c^{\prime}\hat{S}_{WW}^{-1}\left[2(n-2)h_{n}\hat{\Sigma}_{W\nu,1}+\hat{\Sigma}_{W\nu,2}\right]\hat{S}_{WW}^{-1}c\to_{p}c_{W}^{\prime}\Sigma_{W\nu,2}c_{W},

as $n\to\infty$ .

Our variance estimator is adapted from the one provided in Graham et al. (2019) for a dyadic non-parametric density estimator. They show that this type of estimator can be adaptive to the "knife edge" case, where $nh_{n}$ is bounded from above and below asymptotically so that $Nh_{n}\sim n$ . Here, we additionally show that the estimator is adaptive to the degeneracy by showing that the term involving $\hat{\Sigma}_{W\nu,1}$ decays fast enough to be negligible when the convergence rate is $\sqrt{Nh_{n}}$ .

3.4. Bandwidth Selection

From the asymptotic distributional approximation result in Theorem 1, we can write down the mean squared error of our estimator (without negligible parts)

\displaystyle MSE(c^{\prime}\hat{\beta}_{n})=h_{n}^{2(k+1)}(c_{W}^{\prime}\Sigma_{W\lambda})^{2}+\frac{1}{n}c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}+\frac{1}{Nh_{n}}c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}.

The optimal solution for minimizing this mean squared error with respect to $h_{n}$ is given by

	$\displaystyle h_{n}^{*}$	$\displaystyle=\left(\frac{c_{W}^{\prime}\Sigma_{W\nu,2}c_{W}}{2(k+1)N(c_{W}^{\prime}\Sigma_{W\nu})^{2}}\right)^{\frac{1}{2k+3}}$
		$\displaystyle=h^{*}N^{-\frac{1}{2k+3}}.$

We can estimate $h^{*}$ by the plug-in method. By Proposition 1, we have a consistent estimator for the variance part. For the bias part, we use a pilot bandwidth given by

\displaystyle h_{n,\delta}=hN^{-\delta/(2k+3)},

for some $\delta\in(0,\frac{2k+3}{4k+4})$ and $h>0$ . Let $\hat{\beta}_{n,\delta}$ be our estimator calculated with $h_{n,\delta}$ . We can check that this bandwidth satisfies $Nh_{n,\delta}^{2k+3}\to\infty$ and $nh_{n,\delta}^{2k+2}\to\infty$ . Thus, by Theorem 1,

\displaystyle h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\beta)\to_{p}\Sigma_{WW}^{-1}\Sigma_{W\nu},

as $n\to\infty$ . By replacing $\beta$ by $\hat{\beta}_{n}$ , calculated with $h_{n}=hN^{-\frac{1}{2k+3}}$ , we have the following result:

Proposition 2.

Suppose that Assumptions 1-10 hold. Let $\hat{\beta}_{n}$ and $\hat{\beta}_{n,\delta}$ be the proposed estimators with bandwidths $h_{n}=hN^{-1/(2k+3)}$ and $h_{n,\delta}=hN^{-\delta/(2k+3)}$ , respectively, for some $h>0$ and $\delta\in(0,\frac{2k+3}{4k+4})$ . Then,

\displaystyle h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\hat{\beta}_{n})\to_{p}\Sigma_{WW}^{-1}\Sigma_{W\lambda},

as $n\to\infty$ .

Thus,

\displaystyle\hat{h}^{*}=\left(\frac{c^{\prime}\hat{S}_{WW}^{-1}\hat{\Sigma}_{W\nu,2}\hat{S}_{WW}^{-1}c}{2(k+1)\{h_{n,\delta}^{-(k+1)}c^{\prime}(\hat{\beta}_{n,\delta}-\hat{\beta}_{n})\}^{2}}\right)^{\frac{1}{2k+3}}

is a consistent estimator for $h^{*}$ by Propositions 1 and 2.

3.5. Bias Correction

Notice that our estimator has the asymptotic bias of $\sqrt{h}\Sigma_{WW}^{-1}\Sigma_{W\lambda}$ in the case of degeneracy, $\Sigma_{W\nu,1}=0$ from Theorem 1. If the bias is non-negligible, it distorts the coverage probability of the confidence interval. Correcting the bias part is desirable as it is generally unknown whether the degeneracy occurs. Fortunately, given the similar asymptotic distributional result as Kyriazidou (1997) in the degenerate case, we can use her bias correction strategy as follows.

Note that $h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\beta)$ directly estimates the asymptotic bias from Theorem 1. We can construct a bias-corrected estimator $\hat{\beta}_{n,bc}(\beta)$ by subtracting this bias estimator from the original estimator with suitable normalization: Let $r_{n,\delta}=N^{(1-\delta)/(2k+3)}$ . The bias-corrected estimator is given by

\displaystyle\hat{\beta}_{n,bc}(\beta)=\hat{\beta}_{n}-r_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\beta).

We can check that this estimator is asymptotically unbiased regardless of the degeneracy: When $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}>0$ ,

	$\displaystyle\sqrt{n}c^{\prime}(\hat{\beta}_{n,bc}(\beta)-\beta)$	$\displaystyle=\sqrt{n}c^{\prime}(\hat{\beta}_{n}-\beta)-\sqrt{n}r_{n,\delta}^{-(k+1)}c^{\prime}(\hat{\beta}_{n,\delta}-\beta)$
		$\displaystyle=\underbrace{\sqrt{n}(\hat{\beta}_{n}-\beta)}_{\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,1}c_{W})}-\underbrace{\sqrt{n}h^{k+1}N^{-(k+1)/(2k+3)}}_{\to 0}\underbrace{h_{n,\delta}^{-(k+1)}c^{\prime}(\hat{\beta}_{n,\delta}-\beta)}_{\to_{p}c_{W}^{\prime}\Sigma_{W\lambda}}$
		$\displaystyle\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}),$

as $n\to\infty$ . When $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ ,

	$\displaystyle\sqrt{Nh_{n}}c^{\prime}(\hat{\beta}_{n,bc}(\beta)-\beta)$	$\displaystyle=\sqrt{Nh_{n}}c^{\prime}(\hat{\beta}_{n}-\beta)-\sqrt{Nh_{n}}h_{n,1-\delta}^{-(k+1)}c^{\prime}(\hat{\beta}_{n,\delta}-\beta)$
		$\displaystyle=\underbrace{\sqrt{Nh_{n}}c^{\prime}(\hat{\beta}_{n}-\beta)}_{\to_{d}\mathcal{N}(\sqrt{h^{2k+3}}c_{W}^{\prime}\Sigma_{W\lambda},c_{W}^{\prime}\Sigma_{W\nu,2}c_{W})}-\underbrace{\sqrt{h}h_{n,\delta}^{-(k+1)}c^{\prime}(\hat{\beta}_{n,\delta}-\beta)}_{\to_{p}\sqrt{h^{2k+3}}c_{W}\Sigma_{W\lambda}}$
		$\displaystyle\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,2}c_{W}),$

as $n\to\infty$ . Thus, given the adaptivity of $\hat{\Sigma}$ to the degeneracy, we have

\displaystyle(c^{\prime}\hat{\Sigma}c)^{-1/2}c^{\prime}(\hat{\beta}_{n,bc}(\beta)-\beta)\to_{d}\mathcal{N}(0,1),

as $n\to\infty$ for an arbitrary non-zero vector $c\in\mathbb{R}^{q_{w}}$ .

Then, we can construct the bias-corrected confidence interval as follows: Letting $\Phi_{1-\alpha/2}^{-1}$ be $1-\alpha/2$ quantile of the standard normal distribution, we have

	$\displaystyle-\Phi_{1-\alpha/2}^{-1}\leq(c^{\prime}\hat{\Sigma}c)^{-1/2}c^{\prime}(\hat{\beta}_{n,bc}(\beta)-\beta)\leq\Phi_{1-\alpha/2}^{-1}$
	$\displaystyle\Longleftrightarrow-(c^{\prime}\hat{\Sigma}c)^{-1/2}\Phi_{1-\alpha/2}^{-1}\leq c^{\prime}\hat{\beta}_{n}-h_{n,1-\delta}^{-(k+1)}c^{\prime}\hat{\beta}_{n,\delta}-(1-h_{n,1-\delta}^{-(k+1)})c^{\prime}\beta\leq(c^{\prime}\hat{\Sigma}c)^{-1/2}\Phi_{1-\alpha/2}^{-1}$
	$\displaystyle\Longleftrightarrow CI_{L,\alpha,c}\leq c^{\prime}\beta\leq CI_{U,\alpha,c},$

where

	$\displaystyle CI_{L,\alpha,c}$	$\displaystyle\equiv(1-h_{n,1-\delta}^{-(k+1)})^{-1}\left[c^{\prime}\hat{\beta}_{n}-h_{n,1-\delta}^{(-k+1)}c^{\prime}\hat{\beta}_{n,\delta}-(c^{\prime}\hat{\Sigma}c)^{-1/2}\Phi_{1-\alpha/2}^{-1}\right],$
	$\displaystyle CI_{U,\alpha,c}$	$\displaystyle\equiv(1-h_{n,1-\delta}^{-(k+1)})^{-1}\left[c^{\prime}\hat{\beta}_{n}-h_{n,1-\delta}^{(-k+1)}c^{\prime}\hat{\beta}_{n,\delta}+(c^{\prime}\hat{\Sigma}c)^{-1/2}\Phi_{1-\alpha/2}^{-1}\right].$

The full inference procedure is summarized as follows:

(1)

Compute the first step estimator $\hat{\gamma}_{n}$ .
(2)

Choose $k\geq 2,\delta\in(0,\frac{2k+3}{4k+4})$ , and $h>0$ to compute $\hat{\beta}_{n}$ and $\hat{\beta}_{n,\delta}$ with bandwidths $h_{n}=hN^{-1/(2k+3)}$ and $h_{n,\delta}=hN^{-\delta/(2k+3)}$ , respectively.
(3)

Compute $\hat{\Sigma}$ and $h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\hat{\beta}_{n})$ to estimate the asymptotic variance and bias and obtain $\hat{h}^{*}$ .
(4)

Update $\hat{\beta}_{n}$ and $\hat{\beta}_{n,\delta}$ with bandwidths $h_{n}=\hat{h}^{*}N^{-1/(2k+3)}$ and $h_{n,\delta}=\hat{h}^{*}N^{-\delta/(2k+3)}$ , respectively.
(5)

Construct the confidence interval by computing $CI_{L,\alpha,c}$ and $CI_{U,\alpha,c}$ from $\hat{\beta}_{n},\hat{\beta}_{n,\delta}$ , and $c^{\prime}\hat{\Sigma}c$ .

3.6. First-step Estimator

Remember that we want to estimate $\gamma$ from the selection equation or network formation process (2.3):

\displaystyle d_{ijt}=\boldsymbol{1}\{R_{ijt}^{\prime}\gamma+B_{i}+B_{j}-\eta_{ijt}\geq 0\}.

This DGP can be interpreted as a panel discrete choice model as well as a network formation model. Estimators for discrete choice models such as Chamberlain (1980), Manski (1987), or Horowitz (1992) can be candidates for estimating $\gamma$ . Also, estimators for network formation models such as Graham (2017) or Candelaria (2020) can be applicable under additional conditions.

Whether those estimators can be used as our first-step estimator $\hat{\gamma}_{n}$ boils down to their convergence rates: Recall that Assumption 10 requires that $\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)=o_{p}(1)$ , which implies that the first-step estimator needs to converge faster than $\hat{\beta}_{n}$ . We can conjecture that, without additional conditions on $\eta_{ijt}$ , the convergence rates of those estimators are $\sqrt{n}$ in worst cases due to the conditional dependence across dyads. Obviously, $\sqrt{n}$ -rate is incompatible with Assumption 10. In the following, we discuss what kind of additional conditions are needed to ensure Assumption 10.

We may assume additive separability for $\eta_{ijt}$ : $\eta_{ijt}=V_{it}+V_{jt}+V_{ijt}$ , where conditionally on $\{\xi_{i}\}_{i=1}^{n}$ , $(V_{i1},V_{i2}),i=1,...,n$ is independent, $(V_{ij1},V_{ij2}),1\leq i<j\leq n$ is independent, and both are mutually independent. This assumption is weaker than assuming $(\eta_{ij1},\eta_{ij2}),1\leq i<j\leq n$ is conditionally independent given $\{\xi_{i}\}_{i=1}^{n}$ , where $V_{it}$ is treated as degenerate. With additional conditions, we can directly apply Graham (2017)’s joint maximum likelihood estimator or Candelaria (2020)’s semiparametric estimator, both of which leverage the cross-sectional variation in $d_{ijt}$ and $R_{ijt}$ . We can show that in our setting (especially Assumptions 1 and 6), the limiting networks are dense, which implies that both Graham (2017) and Candelaria (2020)’s estimators satisfy $\sqrt{N}(\hat{\gamma}_{n}-\gamma)=O_{p}(1)$ and Assumption 10.

Alternatively, we may assume that $(\eta_{ij1},\eta_{ij2}),1\leq i<j\leq n$ is conditionally independent given $\{\xi_{i}\}_{i=1}^{n}$ . Graham (2017) and Candelaria (2020)’s estimators still satisfy Assumption 10, but we can also show that Chamberlain (1980)’s conditional logit estimator and Horowitz (1992)’s smoothed maximum score estimator can satisfy Assumption 10. Under the conditional independence assumption, the latter two estimators can be written in an asymptotically locally linear form where the corresponding influence function is indexed by $ij$ with $0$ covariances. Thus, the convergence rates are based on $N$ and Assumption 10 can be satisfied depending on the tuning parameters.

4. Extension

4.1. Directed Graph with Multiple Periods

In the above analysis, we restricted our attention to an undirected graph; the variables are all symmetric with respect to nodes (e.g., $Y_{ijt}=Y_{jit}$ ). Also, there were only two time periods, $t=1,2$ . The extension to a directed graph case with $t=1,...,T\,(T\geq 2)$ is straightforward; Letting $\Delta_{st}A\equiv A_{s}-A_{t}$ denote the time difference between $s$ and $t$ , we propose the following estimator:

	$\displaystyle\hat{\beta}_{n}=$	$\displaystyle\left[\sum_{s<t}\sum_{i=1}^{n}\sum_{j\neq i}d_{ijs}d_{ijt}\Delta_{st}W_{ij}\Delta_{st}W_{ij}^{\prime}K_{h_{n}}(\Delta_{st}R_{ij}^{\prime}\hat{\gamma}_{n})\right]^{-1}$
		$\displaystyle\times\left[\sum_{s<t}\sum_{i=1}^{n}\sum_{j\neq i}d_{ijs}d_{ijt}\Delta_{st}W_{ij}\Delta_{st}Y_{ij}K_{h_{n}}(\Delta_{st}R_{ij}^{\prime}\hat{\gamma}_{n})\right].$

All the results and their proofs are valid with some modification because we can always rewrite the double sum $\sum_{i=1}^{n}\sum_{j\neq i}A_{ij}$ as $\sum_{i<j}(A_{ij}+A_{ji})$ for any variables $\{A_{ij}\}$ . We will use this version of the estimator in our empirical application.

4.2. Pairwise Fixed Effects

In the model (2.2) and (2.3), all the fixed effects are node-wise. Since we are interested in coefficients on time-varying dyadic variables, it is possible to include pairwise fixed effects $A_{ij}$ and $B_{ij}$ in each equation, additionally to $A_{i},A_{j}$ and $B_{i},B_{j}$ . Clearly, with pairwise fixed effects, the identification and estimator will be the same as with node-wise fixed effects since we are leveraging the time variation. Thus, a similar asymptotic analysis will also hold as long as $(A_{ij},B_{ij}),1\leq i<j\leq n$ are independently distributed conditionally on $\{\xi_{i}\}_{i=1}^{n}$ .

Alternatively, we can also do away with the additive separability by incorporating node-wise fixed effects into pairwise ones:

\displaystyle A_{ij}=\tilde{\tau}\left(\tilde{A}_{i},\tilde{A}_{j},\tilde{A}_{ij}\right),

where $\tilde{\tau}$ is some unknown function, $\tilde{A}_{i}$ is a node-wise fixed effect, and $\tilde{A}_{ij}$ is a pairwise fixed effect. We can impose a similar structure for $B_{ij}$ . Again, the asymptotic analysis will hold as long as $(\tilde{A}_{ij},\tilde{B}_{ij}),1\leq i<j\leq n$ are conditionally independent. With a more general dependence structure, we could show a similar asymptotic result using Kojevnikov et al. (2021)’s central limit theorem for $\psi$ -dependent data.

4.3. Sparsity

Above, we argue that our model and assumptions imply that the limiting networks $D_{1}$ and $D_{2}$ are locally dense around $\Delta R_{ij}^{\prime}\gamma\sim 0$ . Thus, we limit our attention to cases where the number of dyads in the sample must be proportional to $N$ . Our modeling is appropriate in some applications, such as trade or migration, where the number of dyads is rather dense. However, ours can be inappropriate for some applications where the networks are sparse such as employee-employer, bank-firm matched data (e.g., Abowd et al. (1999), Jiménez et al. (2014)).

We can accommodate sparse networks by the following modification; let us modify Assumption 1 so that $\xi_{i},i=1,...,n$ are drawn from some distribution that is allowed to depend on $n$ . For example, as argued in Graham (2017), we can consider a distribution where the fixed effects are such that $\liminf_{1\leq i\leq n}B_{i}=-\infty$ . Then, we can discuss identification and estimation with fixed $n$ , and the moments of interest are all dependent on $n$ . Especially, we can consider the sequence of networks such that $Pr(d_{121}d_{122}=1|\Delta R_{12}^{\prime}\gamma=0)\to 0$ and $r_{n}Pr(d_{121}d_{122}=1|\Delta R_{12}^{\prime}\gamma=0)=\Omega(1)$ for some $r_{n}\to\infty$ to incorporate sparsity. We do not pursue sparsity in this paper and leave it for future projects.

5. Simulation

To see the performance of the estimator, we conduct some simulation exercises. Consider the following data-generating process:

	$\displaystyle W_{ijt}=X_{it}+X_{jt},R_{it}=(W_{ijt},Z_{it}+Z_{jt})^{\prime},$
	$\displaystyle A_{i}=\frac{X_{i1}+X_{i2}}{2},B_{i}=\frac{Z_{i1}+Z_{i2}}{2},$
	$\displaystyle d_{ijt}=\boldsymbol{1}\{R_{ijt}^{\prime}(1,1)^{\prime}+\theta\times(B_{i}+B_{j})-\eta_{ijt}\geq 0\},$
	$\displaystyle Y_{ijt}=d_{ijt}(W_{ijt}+A_{i}+A_{j}+\epsilon_{ijt})$

where

	$\displaystyle X_{it},Z_{it}\sim\mathcal{N}(2,1),i.i.d.\text{ across }i,t,$
	$\displaystyle\eta_{ijt}\sim Logistic(0,1),i.i.d.\text{ across }ij,t,$
	$\displaystyle\epsilon_{ijt}=U_{it}+U_{jt}+\eta_{ijt},\text{ where }U_{it}\sim\mathcal{N}(0,\sigma),i.i.d.\text{ across}i,t.$

Note that $\beta=1$ and $\gamma=(1,1)^{\prime}$ . We have $\theta\in\{-0.3,-2.0,-3.0\}$ inside of $d_{ijt}$ to control for the fraction of zeros in the simulated data set:

\displaystyle Pr(d_{121}\times d_{122}=0)\sim\begin{cases}20\%\text{ if }\theta=-0.3\\ 75\%\text{ if }\theta=-2.0\\ 90\%\text{ if }\theta=-3.0\end{cases}.

We also change $\sigma\in\{0.0,1.0\}$ for $U_{it}$ so that $\sigma=0.0$ ( $\sigma=1.0$ ) corresponds to the degenerate (non-degenerate) case.

As described above, we can interpret this data-generating process as a log-linearized version of the canonical gravity model (Head and Mayer (2014)); by writing $\tilde{Y}_{ijt}$ as an observable outcome, we redefine the main equation as

\displaystyle\tilde{Y}_{ijt}=exp(W_{ijt}+A_{i}+A_{j})\times\underbrace{\eta_{ijt}}_{=d_{ijt}exp(\epsilon_{ijt})}.

We can take a log and recover the original model for a unit with $d_{ijt}=1$ . This modeling allows a mass at $\tilde{Y}_{ijt}=0$ , one important feature of dyadic data.

We conduct experiments for $n\in\{50,100,150,200\}$ , $\theta\in\{-0.3,-2.0,-3.0\}$ , and $\sigma\in\{0.0,1.0\}$ , and iterate $2000$ times for each one. We calculate $\hat{\gamma}_{n}$ by Chamberlain (1980)’s conditional logit estimator:

\displaystyle\hat{\gamma}_{n}=\underset{g\in\mathcal{G}}{argmax}\sum_{i<j:d_{ij1}+d_{ij2}=1}M_{ij}(g)

where $\mathcal{G}$ is a compact subset of $\mathbb{R}^{q_{r}}$ and

\displaystyle M_{ij}(g)=\boldsymbol{1}\{d_{ij1}=1\}\ln\left(\frac{exp(\Delta R_{ij}^{\prime}g)}{1+exp(\Delta R_{ij}^{\prime}g)}\right)+\boldsymbol{1}\{d_{ij2}=1\}\ln\left(\frac{1}{1+exp(\Delta R_{ij}^{\prime}g)}\right).

For $\hat{\beta}_{n}$ , we use a biweight kernel for $K(\cdot)$ , given by $K(x)=15/16(1-x^{2})^{2}\boldsymbol{1}\{|x|\leq 1\}$ . This choice implies that we assume that the smoothness of the model is given by $k=2$ . We set $\delta=0.4$ and $h=3.0$ and calculate each estimator and confidence interval according to the inference procedure discussed above.

For comparison, we calculate the fixed effect estimator $\hat{\beta}_{FE}$ given by (3.1). The standard error is calculated by $\hat{\Sigma}$ , with $K_{h_{n}}(\cdot)$ replaced by $1$ . We also calculate the Poisson pseudo-maximum-likelihood (PPML) esimator $\hat{\beta}_{PPML}$ given by (3.2). We compute $\hat{\beta}_{PPML}$ and its standard error by the penppml package in R (Ferreras Garrucho and Zylkin, 2023). The standard error is clustered at the node level, which is close to $\hat{\Sigma}_{WW}^{-2}\hat{\Sigma}_{W\nu,1}$ in our setting (Graham, 2020).

The result is summarized in the following TABLE 1 and 2. In TABLE 1, we evaluate the three estimators by mean and median biases (MeanBias), root mean square error (RMSE) for $\sigma=0,1$ . In TABLE 2, we compute 95% coverage probabilities (Coverage) of four different confidence intervals: $CI_{conv}$ (conventional CI from $\hat{\hat{\beta}}_{n}$ and $\hat{\Sigma}$ ), $CI_{bc}$ (bias-corrected CI given by $CI_{L,0.05}$ and $CI_{U,0.05}$ ), $CI_{FE}$ (conventional CI from $\hat{\beta}_{FE}$ and $\hat{\Sigma}$ with a flat kernel.), and $CI_{PPML}$ (conventional CI from $\hat{\beta}_{PPML}$ and its node-level clustered standard error).

From TABLE LABEL:table:1 and LABEL:table:2, we can see that our estimator performs better than the fixed effect estimator and the PPML estimator in terms of bias, which shows that the weights given by the first step estimator work well in eliminating the bias. Our estimator also outperforms the competitors regarding RMSE, which implies that the loss in precision is not severe. Our estimator also performs well even when there is a large fraction of zeros in $Y$ ( $Pr(D_{ij1}\times D_{ij2})\sim 90\%$ when $\theta=-3.0$ ). There is little difference between $\sigma=0$ and $\sigma=1$ other than added variances in the estimators.

From TABLE LABEL:table:3 and LABEL:table:4, we can see that $CI_{bc}$ is close to 95% regardless of the degeneracy ( $\Sigma_{W\nu,1}=0$ or $>0$ ) while the others are off from the targeted nominal coverage. This result confirms the effectiveness of the bias correction strategy as well as the adaptivity of our variance estimator, as claimed in Section 3.3. Also, it is notable to see that the bias correction is important for obtaining correct coverage probabilities even though the asymptotic bias is $0$ in the case of $\sigma=1.0$ so that $\Sigma_{W\nu,1}>0$ (Theorem 1) and $CI_{conv}$ would return an asymptotically correct coverage.

Table 1. Finite sample properties of

\hat{\beta}_{n}

\hat{\beta}_{FE}

, and

\hat{\beta}_{PPML}

(a)

\sigma=1.0

		MeanBias			RMSE
$\theta$	$n$	$\hat{\beta}_{n}$	$\hat{\beta}_{FE}$	$\hat{\beta}_{PPML}$	$\hat{\beta}_{n}$	$\hat{\beta}_{FE}$	$\hat{\beta}_{PPML}$
-0.3	50	0.045	0.133	0.185	0.122	0.160	0.430
-2.0	50	0.141	0.352	0.467	0.210	0.377	0.617
-3.0	50	0.162	0.369	0.582	0.273	0.415	0.752
-0.3	100	0.038	0.136	0.195	0.087	0.148	0.376
-2.0	100	0.099	0.349	0.438	0.142	0.359	0.536
-3.0	100	0.117	0.359	0.542	0.184	0.378	0.657
-0.3	150	0.028	0.135	0.193	0.070	0.143	0.327
-2.0	150	0.075	0.346	0.427	0.112	0.353	0.496
-3.0	150	0.095	0.356	0.527	0.145	0.367	0.607
-0.3	200	0.024	0.134	0.193	0.060	0.140	0.305
-2.0	200	0.061	0.344	0.417	0.091	0.348	0.471
-3.0	200	0.076	0.352	0.510	0.118	0.360	0.572

(b)

\sigma=0.0

		MeanBias			RMSE
$\theta$	$n$	$\hat{\beta}_{n}$	$\hat{\beta}_{FE}$	$\hat{\beta}_{PPML}$	$\hat{\beta}_{n}$	$\hat{\beta}_{FE}$	$\hat{\beta}_{PPML}$
-0.3	50	0.048	0.134	0.194	0.082	0.142	0.399
-2.0	50	0.140	0.352	0.468	0.176	0.365	0.586
-3.0	50	0.161	0.369	0.581	0.229	0.397	0.714
-0.3	100	0.037	0.135	0.193	0.053	0.138	0.332
-2.0	100	0.093	0.348	0.438	0.110	0.352	0.508
-3.0	100	0.113	0.359	0.546	0.145	0.368	0.630
-0.3	150	0.028	0.135	0.191	0.039	0.136	0.301
-2.0	150	0.071	0.345	0.427	0.082	0.348	0.477
-3.0	150	0.089	0.354	0.529	0.108	0.359	0.592
-0.3	200	0.024	0.135	0.191	0.031	0.136	0.278
-2.0	200	0.058	0.345	0.415	0.067	0.347	0.451
-3.0	200	0.074	0.355	0.508	0.089	0.358	0.553

Table 2.

95\%

coverage probabilities of

CI_{conv},CI_{bc}

CI_{FE}

, and

CI_{PPML}

(a)

\sigma=1.0

		Coverage
$\theta$	n	$CI_{conv}$	$CI_{bc}$	$CI_{FE}$	$CI_{PPML}$
-0.3	50	0.790	0.961	0.498	0.537
-2.0	50	0.646	0.963	0.150	0.236
-3.0	50	0.640	0.901	0.311	0.211
-0.3	100	0.785	0.978	0.233	0.498
-2.0	100	0.668	0.970	0.011	0.173
-3.0	100	0.674	0.953	0.072	0.143
-0.3	150	0.790	0.971	0.103	0.472
-2.0	150	0.689	0.949	0.001	0.117
-3.0	150	0.688	0.944	0.016	0.09
-0.3	200	0.817	0.964	0.040	0.426
-2.0	200	0.730	0.947	0.000	0.08
-3.0	200	0.720	0.946	0.004	0.08

(b)

\sigma=0.0

		Coverage
$\theta$	n	$CI_{conv}$	$CI_{bc}$	$CI_{FE}$	$CI_{PPML}$
-0.3	50	0.698	0.918	0.141	0.547
-2.0	50	0.535	0.935	0.026	0.204
-3.0	50	0.592	0.869	0.168	0.182
-0.3	100	0.655	0.960	0.001	0.515
-2.0	100	0.482	0.958	0.000	0.12
-3.0	100	0.571	0.944	0.004	0.106
-0.3	150	0.673	0.977	0.000	0.45
-2.0	150	0.471	0.945	0.000	0.073
-3.0	150	0.532	0.949	0.001	0.065
-0.3	200	0.660	0.970	0.000	0.407
-2.0	200	0.444	0.939	0.000	0.052
-3.0	200	0.520	0.933	0.000	0.047

6. Empirical example

6.1. Background

As a leading application of our model, consider Moretti and Wilson (2017). They study how state-level tax differences affect migration by top scientists in the U.S. Specifically, they estimate the following model implied by their economic theory:

	$\displaystyle log(P_{ijt}/P_{iit})$	$\displaystyle=\eta\left[log(1-\tau_{jt})-log(1-\tau_{it})\right]$
		$\displaystyle+\eta^{\prime}\left[log(1-\tau_{jt}^{\prime})-log(1-\tau_{it}^{\prime})\right]+\gamma_{j}+\gamma_{i}+u_{ijt},$

where $P_{ijt}$ is the number of scientists migrating to state $j$ from state $i$ at year $t$ , $\tau_{it}$ and $\tau_{it}^{\prime}$ are personal and corporate taxes imposed in state $i$ at year $t$ , $\gamma_{i}$ is a state fixed effect, and $u_{ijt}$ is an error term.

Note that if there is no migration from $i$ to $j$ at year $t$ , $P_{ijt}=0$ and $log(P_{ijt}/P_{iit})$ is undefined. In Moretti and Wilson (2017)’s dataset, more than 70% of state-pairs exhibit no migration flow:

Refer to caption — Figure 6.1. Fraction of positive migration flows in Moretti and Wilson (2017)’s dataset. The migration flow is positive in a given year if there is at least one scientist moving from state $i$ to $j$ (scientists are "star"; They are at or above 95% quantile in number of patents over the past ten years)

When running a regression, they are concerned with a potential sample selection bias stemming from these undefined outcomes. They argue that if the main regressors are not systemically associated with whether there is positive migration flow or not, the selection bias should be minimal. Running OLS on the linear probability model, they find little correlation between the main regressors and no flow. Recalculating their regression with our estimator helps check the validity of their argument and the appropriateness of using the linear probability model.

When applying our model to their context, we must consider what $R_{ijt}$ should be. Since Moretti and Wilson (2017)’s underlying theory is based on scientists’ and firms’ discrete choice, one consistent way to generate zero migration flows between some states is to consider endogenous choice sets as in Dubé et al. (2021). Formally, we can write the choice set of representative scientists in state $i$ as $C_{it}=\{j\in\{1,...,51\}:d_{ijt}=1\}$ . Here, $\{d_{ijt}\}$ represents the job-market network; if $d_{ijt}=1$ , it is possible to move from $i$ to $j$ , and vice versa. We can attribute the determinants of the network to the utilities and profits of scientists and firms, as well as the matching costs between the two parties. Such costs are not present in the structural equation if those costs are not compensated through wages; The structural equation consists of the determinants of log wage differences between two states. Thus in the selection equation (2.3), in addition to $W_{ijt}$ , we can include variables in $R_{ijt}$ that capture non-monetary matching costs between two states $i$ and $j$ , which does not violate Assumption 3 as $R_{ijt}$ satisfies the exclusion restriction.

6.2. Implementation

For $W_{ijt}$ , as in Moretti and Wilson (2017), we include the state-to-state differences in (i) an individual income average income tax rate (ATR) faced by a hypothetical taxpayer at 99% quantile of the national income distribution, (ii) the corporate tax rate (CIT), (iii) the investment tax credit (ITC), and (iv) the R&D tax credit (R&D credit). This is the same set of regressors as Moretti and Wilson (2017)’s baseline regression. For $R_{ijt}$ , we use $W_{ijt}$ plus state-to-state difference in the logarithm of population (POP) and a dummy variable that indicates whether $i$ and $j$ share their governors’ political parties (GOV). The additional variables in $R_{ijt}$ arguably measure non-monetary costs of connecting firms and workers in two states.

We implement the first step estimation as follows. We use the conditional logit estimator extended to a directed graph with multi-periods case, which is given by

\displaystyle\hat{\gamma}_{n}=\underset{g\in\mathcal{G}}{argmax}\sum_{s<t}\sum_{i,j}M_{ij,st}(g),

where $\mathcal{G}$ is a compact subset of $\mathbb{R}^{q_{r}}$ and

	$\displaystyle M_{ij,st}(g)$	$\displaystyle=\boldsymbol{1}\{d_{ijs}+d_{ijt}=1\}\times\left[\boldsymbol{1}\{d_{ijs}=1\}ln(e_{ij,st})+\boldsymbol{1}\{d_{ijt}=1\}ln(1-e_{ij,st})\right],$
	$\displaystyle e_{ij,st}$	$\displaystyle=\frac{exp(\Delta_{st}R_{ij}^{\prime}g)}{1+exp(\Delta_{st}R_{ij}^{\prime}g)}.$

TABLE 3 reports the first step estimation result. We can see that the coefficients on the newly added variables $GOV$ and $POP$ deviate from zero, which implies that a part of the identification assumptions (Assumption 3) is satisfied. Also, for these two variables, the estimated coefficients imply that the job market network exhibits homophily; similar states are more likely to be connected.

Table 3. First Step Estimation Result

Variable	$\hat{\gamma}_{n}$
GOV	0.162
POP	-13.144
ATR	8.561
CIT	5.850
ITC	5.350
R $\&$ D credit	-0.230

For the second step estimator, we use our $\hat{\beta}_{n}$ defined above and extend it to the directed graph with multiple period cases, as discussed in Section 4. We use a biweight kernel for $K(\cdot)$ (so $k=2$ ), choose $h=3.0$ as an initial constant for pilot bandwidths, and use $\delta=0.4$ for calculating $h_{n,\delta}$ . We extend and use $\hat{\Sigma}$ to calculate the standard error while taking into account the correlation across time (case 1). We also calculate the bias-corrected $95\%$ -confidence interval by computing $CI_{L,0.05}$ and $CI_{U,0.05}$ as defined above. Also, we list $\hat{\beta}_{MW}$ , an estimate from Moretti and Wilson (2017) (page 1883, TABLE 2A, specification (3)) and calculate the conventional $95\%$ -confidence interval based on their standard errors. Note that $\hat{\beta}_{MW}$ is the fixed estimator, where its standard error is calculated by clustering across time and the origin, destination, and origin-destination pairs.

We summarize the result in TABLE 4(b). We can see that $\hat{\beta}_{n}$ returns similar values as $\hat{\beta}_{WM}$ , which claims the robust positive effect of income and corporate-related tax differences on migration. Thus, Moretti and Wilson (2017)’s estimates are not likely to be qualitatively affected by the sample selection effects. However, while Moretti and Wilson (2017)’s estimates are statistically significant at $5\%$ level, our confidence intervals show that all of the estimates are no longer statistically significant at that level except for ITC. Our insignificance result is driven by both the increase in standard errors ⁵⁵5Our standard error is from $\hat{\Sigma}$ , which takes fully into account the dependence among pairs that share origin and destination, such as California $\to$ Wisconsin and New York $\to$ California. Moretti and Wilson (2017)’s standard error calculation ignores such dependence structure. and the asymptotic bias correction. Thus, our exercise shows that some of the results in Moretti and Wilson (2017) may not be robust to the presence of sample selection due to the endogeneity of the job market network.

Table 4. Comparison of our estimator and Moretti and Wilson (2017)

(a) This paper

Variable	Estimator (s.e)	CI
ATR	1.634	$[-2.656,5.872]$
	(1.886)
CIT	1.666	$[-2.948,6.276]$
	(2.040)
ITC	1.980	$[0.651,3.290]$
	(0.584)
R $\&$ D credit	0.429	$[-1.352,2.117]$
	(0.780)

(b) Moretti and Wilson (2017)

Estimator (s.e)	CI
1.926	$[0.918,2.933]$
(0.514)
1.840	$[0.687,2.992]$
(0.588)
1.793	$[0.987,2.598]$
(0.411)
0.368	$[0.011,0.724]$
(0.182)

7. Conclusion

This paper studies identification and inference of a panel dyadic data sample selection model. We show that Kyriazidou (1997)’s identification strategy can be extended to our dyadic data setting, and we prove asymptotic normality of the proposed estimator.

Our estimator has some appealing properties. The distributional result implies that our estimator has the same convergence rates as the usual estimators used in practice in the non-degenerate case, and there is no loss of effective sample size for using our nonparametric type estimator. Also, our estimator is guaranteed to be asymptotically normal, while others can be non-Gaussian in the limit.

We also provide consistent estimators for asymptotic bias and variance that adapts to the degeneracy. Specifically, the bias corrected confidence interval has an asymptotically correct size. Our simple simulation exercise confirms the validity of these estimators and highlights the importance of bias correction in both degenerate and non-degenerate cases.

References

Abowd et al. (1999) Abowd, J. M., F. Kramarz, D. N. Margolis, B. Y. J. M. Abowd, F. Kramarz, and D. N. Margolis (1999): “High Wage Workers and High Wage Firms,” Econometrica, 67, 251–333.
Ahn and Powell (1993) Ahn, H. and J. L. Powell (1993): “Semiparametric estimation of censored selection models with a nonparametric selection mechanism,” Journal of Econometrics, 58, 3–29.
Auerbach (2022) Auerbach, E. (2022): “Identification and Estimation of a Partially Linear Regression Model Using Network Data,” Econometrica, 90, 347–365.
Bonhomme (2020) Bonhomme, S. (2020): “Econometric analysis of bipartite networks,” The Econometric Analysis of Network Data.
Cameron and Miller (2014) Cameron, A. C. and D. Miller (2014): “Robust Inference for Dyadic Data,” Unpublished manuscript.
Candelaria (2020) Candelaria, L. E. (2020): “A Semiparametric Network Formation Model with Unobserved Linear Heterogeneity,” .
Chamberlain (1980) Chamberlain, G. (1980): “Analysis of Covariance with Qualitative Data,” The Review of Economic Studies, 47, 225–238.
Davezies et al. (2021) Davezies, L., X. D’Haultfœuille, and Y. Guyonvarch (2021): “Empirical process results for exchangeable arrays,” The Annals of Statistics, 49, 845 – 862.
Dubé et al. (2021) Dubé, J. P., A. Hortaçsu, and J. Joo (2021): “Random-coefficients logit demand estimation with zero-valued market shares,” Marketing Science, 40.
Fafchamps and Gubert (2007) Fafchamps, M. and F. Gubert (2007): “The formation of risk sharing networks,” Journal of Development Economics, 83.
Ferreras Garrucho and Zylkin (2023) Ferreras Garrucho, D. and T. Zylkin (2023): penppml: Penalized Poisson Pseudo Maximum Likelihood Regression, r package version 0.2.3.
Graham (2017) Graham, B. S. (2017): “An Econometric Model of Network Formation With Degree Heterogeneity,” Econometrica, 85, 1033–1063.
Graham (2020) ——— (2020): “Dyadic regression,” The Econometric Analysis of Network Data.
Graham et al. (2019) Graham, B. S., F. Niu, and J. L. Powell (2019): “Kernel Density Estimation for Undirected Dyadic Data,” .
Graham et al. (2021) ——— (2021): “Minimax Risk and Uniform Convergence Rates for Nonparametric Dyadic Regression,” .
Hall (1984) Hall, P. (1984): “Central limit theorem for integrated square error of multivariate nonparametric density estimators,” Journal of Multivariate Analysis, 14, 1–16.
Head and Mayer (2014) Head, K. and T. Mayer (2014): “Gravity Equations: Workhorse,Toolkit, and Cookbook,” Handbook of International Economics, 4, 131–195.
Heckman (1979) Heckman, J. (1979): “Sample Selection Bias as a Specification Error,” Econometrica, 47, 153–161.
Helpman et al. (2008) Helpman, E., M. Melitz, and Y. Rubinstein (2008): “Estimating trade flows: Trading partners and trading volumes,” Quarterly Journal of Economics, 123, 441–487.
Hoeffding (1961) Hoeffding, W. (1961): “The strong law of large numbers for U-statistics,” .
Horowitz (1992) Horowitz, J. L. (1992): “A Smoothed Maximum Score Estimator for the Binary Response Model,” Econometrica, 60.
Jiménez et al. (2014) Jiménez, G., S. Ongena, J.-L. Peydró, and J. Saurina (2014): “Hazardous Times for Monetary Policy: What Do Twenty-Three Million Bank Loans Say About the Effects of Monetary Policy on Credit Risk-Taking?” Econometrica, 82, 463–505.
Jochmans (2023) Jochmans, K. (2023): “Peer effects and endogenous social interactions,” Journal of Econometrics, 235, 1203–1214.
Johnsson and Moon (2021) Johnsson, I. and H. R. Moon (2021): “Estimation of Peer Effects in Endogenous Social Networks: Control Function Approach,” The Review of Economics and Statistics, 103, 328–345.
Kojevnikov et al. (2021) Kojevnikov, D., V. Marmer, and K. Song (2021): “Limit theorems for network dependent random variables,” Journal of Econometrics, 222, 882–908.
Kyriazidou (1997) Kyriazidou, E. (1997): “Estimation of a Panel Data Sample Selection Model,” Econometrica, 65.
Manski (1987) Manski, C. F. (1987): “Semiparametric Analysis of Random Effects Linear Models from Binary Panel Data,” Econometrica, 55, 357–362.
Menzel (2021) Menzel, K. (2021): “Bootstrap With Cluster-Dependence in Two or More Dimensions,” Econometrica, 89, 2143–2188.
Monte et al. (2018) Monte, F., S. J. Redding, and E. Rossi-Hansberg (2018): “Commuting, migration, and local employment elasticities,” American Economic Review, 108.
Moretti and Wilson (2017) Moretti, E. and D. J. Wilson (2017): “The Effect of State Taxes on the Geographical Location of Top Earners: Evidence from Star Scientists,” American Economic Review, 107, 1858–1903.
Silva and Tenreyro (2006) Silva, J. M. S. and S. Tenreyro (2006): “The log of gravity,” Review of Economics and Statistics, 88.
Tabord-Meehan (2019) Tabord-Meehan, M. (2019): “Inference With Dyadic Data: Asymptotic Behavior of the Dyadic-Robust t-Statistic,” Journal of Business and Economic Statistics, 37.
White (2001) White, H. (2001): Asymptotic Theory for Econometricians, Academic Press.
Zeleneev (2020) Zeleneev, A. (2020): “Identification and Estimation of Network Models with Nonparametric Unobserved Heterogeneity,” https://www.princeton.edu/~zeleneev/azeleneev_jmp.pdf.

Appendix A. Proofs

Proof of Theorem 1

Proof.

First, consider the infeasible version of $\beta_{n}$ , where $\hat{\gamma}_{n}$ is replaced by true $\gamma$ :

\displaystyle\tilde{\beta}_{n}=\beta+S_{WW}^{-1}S_{W\lambda}+S_{WW}^{-1}S_{W\nu},

where $S_{WW},S_{W\lambda},$ and $S_{W\nu}$ are the same as $\hat{S}_{WW},\hat{S}_{W\lambda},$ and $\hat{S}_{W\nu}$ except $\hat{\gamma}_{n}$ replaced by $\gamma$ . We use the following lemmas

Lemma 1.

Suppose Assumptions 1-9 hold. Then,

\displaystyle S_{WW}\to_{p}\Sigma_{WW},

as $n\to\infty$ .

Lemma 2.

Suppose Assumptions 1-9 hold. Fix some $h\in(0,\infty)$ . If $Nh_{n}^{2k+3}\to h$ , then

\displaystyle\sqrt{Nh_{n}}S_{W\lambda}\to_{p}\sqrt{h}\Sigma_{W\lambda},

as $n\to\infty$ . If $Nh_{n}^{2k+3}\to\infty$ and $nh_{n}^{2k+2}\to\infty$ , then

\displaystyle h_{n}^{-(k+1)}S_{W\lambda}\to_{p}\Sigma_{W\nu},

as $n\to\infty$ .

Lemma 3.

Suppose Assumptions 1-9 hold. Fix an arbitrary nonzero vector $c\in\mathbb{R}^{q_{w}}$ and some constant $h\in(0,\infty]$ . Let $c_{W}=\Sigma_{WW}^{-1}c$ . If $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}>0$ and $Nh_{n}^{2k+3}\to h$ , then

\displaystyle\sqrt{n}c^{\prime}S_{WW}^{-1}S_{W\nu}\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}),

as $n\to\infty$ . If $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ and $Nh_{n}^{2k+3}\to h$ , then

\displaystyle\sqrt{Nh_{n}}c^{\prime}S_{WW}^{-1}S_{W\nu}

\displaystyle\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,2}c_{W}),

as $n\to\infty$ .

By combining Lemmas 1-3, the statement of Theorem 1 follows for $\tilde{\beta}_{n}$ . The following lemmas are used to show the negligibility of $\hat{\beta}_{n}-\tilde{\beta}_{n}$ :

Lemma 4.

Suppose Assumptions 1-10 hold. Fix some constant $h\in(0,\infty]$ . If $Nh_{n}^{2k+3}\to h$ , then,

\displaystyle\hat{S}_{WW}=S_{WW}+o_{p}(1).

Lemma 5.

Suppose Assumptions 1-10 hold. Fix some constant $h\in(0,\infty]$ . If $Nh_{n}^{2k+3}\to h$ , then

\displaystyle\hat{S}_{W\lambda}=S_{W\lambda}+o_{p}\left(\frac{1}{\sqrt{Nh_{n}}}\right).

Lemma 6.

Suppose Assumptions 1-10 hold. Fix some constant $h\in(0,\infty]$ . If $Nh_{n}^{2k+3}\to h$ , then

\displaystyle\hat{S}_{W\nu}=S_{W\nu}+o_{p}\left(\frac{1}{\sqrt{Nh_{n}}}\right).

By combining Lemmas 4-6, we have

\displaystyle\hat{\beta}_{n}-\beta=\tilde{\beta}_{n}-\beta+o_{p}\left(\frac{1}{\sqrt{Nh_{n}}}\right).

Thus, the normalization $r_{n}\in\{\sqrt{n},\sqrt{Nh_{n}},h_{n}^{-(k+1)}\}$ corresponding to each case results in

\displaystyle r_{n}(\hat{\beta}_{n}-\beta)=r_{n}(\tilde{\beta}_{n}-\beta)+o_{p}(1).

Since $\tilde{\beta}_{n}$ satisfies the statement of Theorem 1, this completes the proof. ∎

Proof of Proposition 1

Proof.

We show the claim by the following steps.

Step 1: $\hat{\Sigma}_{W\nu,2}\to_{p}\Sigma_{W\nu,2}$

By expanding $K^{2}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}/h_{n})$ around $\Delta R_{ij}^{\prime}\gamma$ , we get

\displaystyle K^{2}(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}/h_{n})=K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})+2\Delta R_{ij}^{\prime}(\hat{\gamma}_{n}-\gamma)/h_{n}K^{\prime}(c_{ij,n}^{*}/h_{n})K(c_{12n}^{*}/h_{n}),

where $c_{ij,n}^{*}$ is between $\Delta R_{ij}^{\prime}\gamma$ and $\Delta R_{ij}^{\prime}\hat{\gamma}_{n}$ . Then,

	$\displaystyle\hat{\Sigma}_{W\nu}$	$\displaystyle=\underbrace{\frac{1}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{12}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta\hat{\varepsilon}^{2}_{ij}}_{D_{p1,1}}$
		$\displaystyle+\underbrace{\frac{2}{Nh_{n}^{2}}\sum_{i<j}\Delta R_{ij}^{\prime}(\hat{\gamma}_{n}-\gamma)K^{\prime}(c_{ijn}^{}/h_{n})K(c_{ijn}^{}/h_{n})\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta\hat{\varepsilon}^{2}_{ij}}_{D_{p1,2}}.$

Sub-Step 1: $D_{p1,1}\to_{p}\Sigma_{W\nu,2}$

Observe that

	$\displaystyle\Delta\hat{\varepsilon}_{ij}^{2}-\nu_{ij}^{2}$	$\displaystyle=\left(\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})\right)^{2}+\lambda_{ij}^{2}+2\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})\lambda_{ij}$
		$\displaystyle+2\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})\nu_{ij}+2\lambda_{ij}\nu_{ij}.$

Thus,

	$\displaystyle D_{p1,1}$	$\displaystyle=\frac{1}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\nu^{2}_{ij}$
		$\displaystyle+\frac{1}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\left(\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})\right)^{2}$
		$\displaystyle+\frac{1}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\lambda_{ij}^{2}$
		$\displaystyle+\frac{2}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})\lambda_{ij}$
		$\displaystyle+\frac{2}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})\nu_{ij}$
		$\displaystyle+\frac{2}{Nh_{n}}\sum_{i<j}K^{2}(\Delta R_{ij}^{\prime}\gamma/h_{n})d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\lambda_{ij}\nu_{ij}.$

We call each term by $D_{p1,1}^{i}$ for $i=1,...,6$ that is corresponding to each row.

The first term $D_{p1,1}^{1}$ converges to $\Sigma_{W\nu}$ . Its expectation coincides with $\Sigma_{W\nu}$ in the limit as

	$\displaystyle E[D_{p1,1}^{1}]$	$\displaystyle=\frac{1}{h_{n}}\int E[d_{121}d_{122}\Delta W_{12}\Delta W_{12}^{\prime}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=\int E[d_{121}d_{122}\Delta W_{12}\Delta W_{12}^{\prime}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=\Sigma_{W\nu,2}+o(1),$

where the last line holds by the dominated convergence theorem under Assumptions 4, 6, and 8. For the variance, denoting each summand by $D_{p1,1,ij}^{1}$ and for any vector $a$ with $\|a\|=1$ , we have

	$\displaystyle Var[\\|D_{p1,1}^{1}\\|]$	$\displaystyle\leq\frac{1}{Nh_{n}^{2}}E\left[\\|D_{p1,1,12}^{1}\\|^{2}\right]$
		$\displaystyle+\frac{2(n-2)}{Nh_{n}^{2}}E[\\|D_{p1,1,12}^{1}\\|\times\\|D_{p1,1,13}^{1}\\|].$

The first term in the right hand side is $O(1/(Nh_{n}))$ because,

	$\displaystyle E\left[\\|D_{p1,1,12}^{1}\\|^{2}\right]$	$\displaystyle\leq\int E[\\|\Delta W_{12}\\|^{4}\nu_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=r]K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}\int E[\\|\Delta W_{12}\\|^{4}\nu_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}),$

where the last line holds from Assumptions 4, 6, and 8. The second term on the right-hand side is $O(1/n$ ) because

	$\displaystyle E[\\|D_{p1,1,12}^{1}\\|\times\\|D_{p1,1,13}^{1}\\|]$	$\displaystyle\leq E\Big{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|^{2}\nu_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times K^{2}(r_{1}/h_{n})K^{2}(r_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=h_{n}^{2}E\Big{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|^{2}\nu_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times K^{2}(r_{1})K^{2}(r_{2})f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=O(h_{n}^{2}),$

where the first line follows from Assumptions 4, 6, and 8. Thus,

\displaystyle Var[\|D_{p1,1}^{1}\|]=O\left(\frac{1}{Nh_{n}}\right)+O\left(\frac{1}{n}\right)=o(1).

This implies that $D_{p1,1}^{1}\to_{p}\Sigma_{W\nu,2}$ as $n\to\infty$ .

The second term $D_{p1,1}^{2}$ converges to $0$ . Observe that, as $K$ is bounded by Assumption 8, for some absolute constant $C>0$ ,

\displaystyle\|D_{p1,1}^{2}\|\leq\frac{\|\beta-\beta_{n}\|^{2}}{h_{n}}\times\frac{C}{N}\sum_{i<j}\|\Delta W_{ij}\|^{4}.

Since $E[\|\Delta W_{ij}\|^{4}]<\infty$ by Assumption 7, we can apply the law of large numbers for U-statistics (Hoeffding (1961)) to $C/N\sum_{i<j}\|\Delta W_{ij}\|^{4}$ , which is $O_{p}(1)$ . Also, since $\|\beta-\beta_{n}\|=O_{p}(1/\sqrt{n})$ (which is the worst-case rate for the specified $h_{n}$ by Theorem 1), we have $\|\beta-\beta_{n}\|^{2}/h_{n}=O_{p}(1/(nh_{n}^{2}))=o_{p}(1)$ as $nh_{n}^{2}\sim n\times n^{-2/(2k+3)}=n^{(2k+1)/(2k+3)}$ diverges. Thus,

\displaystyle\|D_{p1,1}^{2}\|=o_{p}(1)\times O_{p}(1)=o_{p}(1),

and $D_{p1,1}^{2}\to_{p}0$ as $n\to\infty$ .

The third term $D_{p1,1}^{3}$ converges to $0$ . Observe that,

	$\displaystyle E[\\|D_{p1,1}^{3}\\|]$	$\displaystyle\leq\int E[\\|\Delta W_{12}^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]r^{2}K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}^{3}E[\\|\Delta W_{12}^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]r^{2}K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}^{3})$

where the last line follows from Assumptions 4, 6, and 8. Thus, $E[\|D_{p1,1}^{3}\|]=o(1)$ . Observe that, by writing each summand of $D_{p1,1}^{3}$ as $D_{p1,1,ij}^{3}$ ,

\displaystyle Var[\|D_{p1,1}^{3}\|]\leq\frac{1}{Nh_{n}^{2}}E[\|D_{p1,1,12}^{3}\|^{2}]+\frac{2(n-2)}{Nh_{n}^{2}}E[\|D_{p1,1,12}^{3}\|\times\|D_{p1,1,13}^{3}\|].

The first term on the right hand is $O(h_{n}^{3}/N)$ because

	$\displaystyle E[\\|D_{p,1,12}^{3}\\|^{2}]$	$\displaystyle\leq\int E[\\|\Delta W_{12}\\|^{4}\Lambda_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=r]r^{2}K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}^{5}E[\\|\Delta W_{12}\\|^{4}\Lambda_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]r^{4}K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}^{5}),$

where the last line holds from Assumptions 4, 6, and 8. The second term on the right hand side is $O(h_{n}^{4}/n)$ because

	$\displaystyle E[\\|D_{p1,1,12}^{3}\\|\times\\|D_{p1,1,13}^{3}\\|]$	$\displaystyle\leq E\Big{[}\int E[\\|\Delta W_{12}\\|^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|^{2}\Lambda_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times r_{1}^{2}r_{2}^{2}K^{2}(r_{1}/h_{n})K^{2}(r_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=h_{n}^{6}E\Big{[}\int E[\\|\Delta W_{12}\\|^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|^{2}\Lambda_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{2}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times r_{1}^{2}r_{2}^{2}K^{2}(r_{1})K^{2}(r_{2})f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{2}h_{n})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=O(h_{n}^{6}),$

where the last line follows from Assumptions 4, 6, and 8. Hence, we have

\displaystyle Var[\|D_{p1,1}^{3}\|]=O\left(\frac{h_{n}^{3}}{N}\right)+O\left(\frac{h_{n}^{4}}{n}\right)=o(1).

This implies that $D_{p1,1}^{3}\to_{p}0$ as $n\to\infty$ .

The fourth term $D_{p1,1}^{4}$ converges to $0$ . Observe that, since $K$ is bounded by Assumption 8 and $\|\gamma\|<\infty$ , for some constant $C>0$ ,

\displaystyle\|D_{p1,1}^{4}\|\leq\frac{C\|\beta-\hat{\beta}_{n}\|}{h_{n}}\times\frac{1}{N}\sum_{i<j}\|\Delta W_{ij}\|^{3}\|\Delta R_{ij}\|||\Lambda_{12}|

The sum part converges to the expectation of summand by the law of large numbers for U-statistics (Hoeffding (1961)) as $E[\|\Delta W_{12}\|^{3}\|\Delta R_{12}\||\Lambda_{12}|]<\infty$ is bounded by Cauchy-Schwartz and Assumption 7. Thus, this part is $O_{p}(1)$ . Also note that

\displaystyle\frac{\|\beta-\hat{\beta}_{n}\|}{h_{n}}=O_{p}\left(\frac{1}{nh_{n}}\right)=o_{p}(1),

by Assumption 9. Hence,,

\displaystyle\|D_{p1,1}^{4}\|=o_{p}(1).

This shows that $D_{p1,1}^{4}\to_{p}0$ as $n\to\infty$ .

The fifth term $D_{p1,1}^{5}$ converges to $0$ . Observe that, since $K$ is bounded by Assumption 8, for some constant $C>0$

\displaystyle\|D_{p1,1}^{5}\|\leq\frac{C}{N}\sum_{i<j}\|\Delta W_{ij}\|^{3}|\nu_{ij}|\times\frac{\|\beta-\hat{\beta}_{n}\|}{h_{n}}.

The sum part is $O_{p}(1)$ because

\displaystyle E[\|\Delta W_{12}\|^{3}|\nu_{12}|]<\infty,

by Assumption 7 and

\displaystyle Var\left[\frac{1}{N}\sum\|\Delta W_{ij}\||\nu_{ij}|\right]\leq\frac{E[\|\Delta W_{12}\|^{6}\nu_{12}^{2}]}{N}+\frac{2(n-2)}{N}E[\|\Delta W_{12}\|^{3}\Delta W_{13}\|^{3}|\nu_{12}||\nu_{13}|]=o(1),

as these two moments are bounded by Assumption 7. Thus,

\displaystyle\|D_{p1,1}^{5}\|=o_{p}(1),

by the previous calculation for the term involving $\hat{\beta}_{n}-\beta$ . This shows that $D_{p1,1}^{5}\to_{p}0$ as $n\to\infty$ .

The sixth term $D_{p1,1}^{6}$ converges to $0$ . Its expectation is exactly $0$ by the conditional mean independence of $\nu_{ij}$ . Also, by repeating the similar calculation as $Var[\|D_{p1,1}^{2}\|]$ (by replacing $\nu_{ij}^{2}$ by $\lambda_{ij}\nu_{ij}$ ), we have

\displaystyle Var[\|D_{p1,1}^{6}\|]=O\left(\frac{1}{N}\right)+O\left(\frac{h_{n}^{2}}{n}\right)=o(1).

This shows that $\tilde{D}_{1,6}\to_{p}0$ as $n\to\infty$ .

Sub-Step 2: $D_{p1,2}\to_{p}0$

As before, we can decompose $D_{p1,2}$ into $D_{p1,2}^{i}$ for $i=1,...,6$ . Unlike in $D_{p1,1}$ , we can no longer have the moments scaled by $h_{n}^{\alpha}$ because the middle values $c_{ij,n}^{*}$ are in the kernels. Thus, by the previous calculation for $D_{p1,1}$ , the $D_{p1,2}^{i}$ that involves $\nu_{ij}^{2}$ , $\lambda_{ij}^{2}$ , or $\lambda_{ij}\nu_{ij}$ will have the slowest convergence rate. So, it suffices to show that those terms converge to $0$ in probability.

Pick up such $D_{p1,2}^{i}$ with $\nu_{ij}^{2}$ , which is $D_{p1,2}^{1}$ and given by

\displaystyle D_{p1,2}^{1}=\frac{2}{Nh_{n}^{2}}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta R_{ij}^{\prime}\nu_{ij}^{2}K^{\prime}(c_{ij,n}^{*}/h_{n})K(c_{ij,n}^{*})(\hat{\gamma}_{n}-\gamma)

Observe that, for some constant $C>0$

\displaystyle\|D_{p1,2}^{1}\|\leq\frac{C}{N}\sum_{i<j}\|\Delta W_{ij}\|^{2}\|\Delta R_{ij}\|\nu_{ij}^{2}\times\frac{\|\hat{\gamma}_{n}-\gamma\|}{h_{n}^{2}}

The sum part is $O_{p}(1)$ because

\displaystyle E[\|\Delta W_{12}\|^{2}\|\Delta R_{12}\|\nu_{12}^{2}]<\infty,

by Assumption 7, and

	$\displaystyle Var\left[\frac{1}{N}\sum_{i<j}\\|\Delta W_{ij}\\|^{2}\\|\Delta R_{ij}\\|\nu_{ij}^{2}\right]$
	$\displaystyle\leq\frac{E[\\|\Delta W_{12}\\|^{4}\\|\Delta R_{12}\\|^{2}\nu_{12}^{4}]}{N}+\frac{2(n-2)}{N}E[\\|\Delta W_{12}\\|^{2}\\|\Delta W_{13}\\|^{2}\\|\Delta R_{12}\\|\\|\Delta R_{13}\\|\nu_{12}^{2}\nu_{13}^{2}]$
	$\displaystyle=o(1),$

as these moments are bounded by Assumption 7. The term involving $\hat{\gamma}_{n}$ is $o_{p}(1)$ because

\displaystyle\frac{\|\hat{\gamma}_{n}-\gamma\|}{h_{n}^{2}}

\displaystyle=\frac{\sqrt{Nh_{n}}\|\hat{\gamma}_{n}-\gamma\|}{\sqrt{Nh_{n}^{5}}}=o_{p}(1),

by Assumption 10 and $Nh_{n}^{5}=Nh_{n}^{2k+3}\times h_{n}^{-2k+2}$ diverges for $k\geq 2$ . Hence,

\displaystyle\|D_{p1,2}^{1}\|=O_{p}(1)\times o_{p}(1)=o_{p}(1).

This shows that $D_{p1,2}^{1}\to_{p}0$ as $n\to\infty$ . Thus, by the above argument, it follows that $D_{p1,2}\to_{p}0$ as $n\to\infty$ .

These two sub-steps conclude that

\displaystyle\hat{\Sigma}_{W\nu,2}\to_{p}\Sigma_{W\nu,2},

as $n\to\infty$ . This finishes Step 1.

Step 2: $\hat{\Sigma}_{W\nu,1}\to_{p}\Sigma_{W\nu,1}$

Define

\displaystyle S_{ij}\equiv 2d_{ij1}d_{ij2}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\Delta W_{ij}\Delta\hat{\epsilon}_{ij},

and let $\tilde{\Sigma}_{W\nu,1}$ be $\hat{\Sigma}_{W\nu,1}$ with $\hat{S}_{ij}$ replaced by $S_{ij}$ . First, we use the following result:

Lemma 7.

Suppose that Assumptions 1-10 hold. If $h_{n}=hN^{-1/(2k+3)}$ for some $h>0$ , we have

\displaystyle\tilde{\Sigma}_{W\nu,1}\to_{p}\Sigma_{W\nu,1},

as $n\to\infty$ .

Then, it is enough to show that $\hat{\Sigma}_{W\nu,1}$ is well approximated by $\tilde{\Sigma}_{W\nu,1}$ :

Lemma 8.

Suppose that Assumptions 1-10 hold. If $h_{n}=hN^{-1/(2k+3)}$ for some $h>0$ , we have

\displaystyle\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|=o_{p}(1).

Lemmas 7 and 8 imply that

\displaystyle\|\hat{\Sigma}_{W\nu,1}-\Sigma_{W\nu,1}\|\leq\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|+\|\tilde{\Sigma}_{W\nu,1}-\Sigma_{W\nu,1}\|=o_{p}(1),

which shows the consistency of $\hat{\Sigma}_{W\nu,1}$ for $\Sigma_{W\nu,1}$ . This finishes Step 2.

Step 3: $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ case

Observe that, by some algebra,

	$\displaystyle nh_{n}c^{\prime}\hat{S}_{WW}^{-1}\hat{\Sigma}_{W\nu,1}\hat{S}_{WW}^{-1}c$
	$\displaystyle=nh_{n}c^{\prime}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})\hat{\Sigma}_{W\nu,1}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})c$
	$\displaystyle+nh_{n}c^{\prime}\Sigma_{WW}^{-1}\hat{\Sigma}_{W\nu,1}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})c+nh_{n}c^{\prime}(\hat{S}_{WW}^{-1})\hat{\Sigma}_{W\nu,1}\Sigma_{WW}^{-1}c$
	$\displaystyle+3nh_{n}c^{\prime}\Sigma_{WW}^{-1}\hat{\Sigma}_{W\nu,1}\Sigma_{WW}^{-1}c.$

We show the negligibility of the first line in the right hand side of this decomposition. By the proof of Lemma 1, we have that

\displaystyle\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1}=o_{p}(n^{-\alpha/2})

for any $\alpha\in(0,1)$ . Thus,

\displaystyle nh_{n}c^{\prime}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})\hat{\Sigma}_{W\nu,1}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})=n^{1-\alpha}h_{n}o_{p}(1)\hat{\Sigma}_{W\nu,1}o_{p}(1)=o_{p}(1),

for $\alpha\in[(2k+1)/(2k+3),1)$ as $n^{1-\alpha}h_{n}=n^{(2k+1-\alpha(2k+3))/(2k+3)}=o(1)$ and $\hat{\Sigma}_{W\nu,1}=O_{p}(1)$ by the above Step 2.

The remaining terms are shown to be negligible by applying the following lemmas:

Lemma 9.

Suppose that Assumptions 1-10 hold. If $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ and $h_{n}=hN^{-1/(2k+3)}$ for some $h\in(0,\infty)$ , we have

	$\displaystyle n^{1-\alpha/2}h_{n}\hat{\Sigma}_{W\nu,1}c_{W}$	$\displaystyle\to_{p}0,$
	$\displaystyle n^{1-\alpha/2}h_{n}c_{W}^{\prime}\hat{\Sigma}_{W\nu,1}$	$\displaystyle\to_{p}0,$

as $n\to\infty$ for $\alpha\in[6/(2k+3),1)$ .

Lemma 10.

Suppose that Assumptions 1-10 hold. If $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ and $h_{n}=hN^{-1/(2k+3)}$ for some $h\in(0,\infty)$ , we have

\displaystyle nh_{n}c_{W}^{\prime}\hat{\Sigma}_{W\nu,1}c_{W}\to_{p}0,

as $n\to\infty$ .

Then, by Lemmas 9 and 10, the last two lines are shown to be

	$\displaystyle nh_{n}c^{\prime}\Sigma_{WW}^{-1}\hat{\Sigma}_{W\nu,1}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})c+nh_{n}c^{\prime}(\hat{S}_{WW}^{-1}-\Sigma_{WW}^{-1})\hat{\Sigma}_{W\nu,1}\Sigma_{WW}^{-1}c$
	$\displaystyle+3nh_{n}c^{\prime}\Sigma_{WW}^{-1}\hat{\Sigma}_{W\nu,1}\Sigma_{WW}^{-1}c$
	$\displaystyle=n^{1-\alpha/2}h_{n}c^{\prime}\Sigma_{WW}^{-1}\hat{\Sigma}_{W\nu,1}o_{p}(1)+c^{\prime}o_{p}(1)n^{1-\alpha/2}h_{n}\hat{\Sigma}_{W\nu,1}\Sigma_{WW}^{-1}c+o_{p}(1)$
	$\displaystyle=o_{p}(1).$

Hence,

\displaystyle nh_{n}c^{\prime}\hat{S}_{WW}^{-1}\hat{\Sigma}_{W\nu,1}\hat{S}_{WW}c=o_{p}(1).

Steps 1-3 finish the proof of Proposition 1. ∎

Proof of Proposition 2

Proof.

Since

\displaystyle h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\hat{\beta}_{n})=h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n,\delta}-\beta)-h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n}-\beta),

where the first term on the right hand side converges to $\Sigma_{WW}^{-1}\Sigma_{W\lambda}$ by Theorem 1 as $Nh_{n,\delta}^{2k+3}\to\infty$ , it suffices to show that

\displaystyle h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n}-\beta)=o_{p}(1).

Take an arbitrary non-zero vector $c\in\mathbb{R}^{q_{w}}$ . Since $\hat{\beta}_{n}$ is calculated based on $h_{n}=hN^{-1/(2k+3)}$ such that $Nh_{n}^{2k+3}\to h$ , by Theorem 1,

\displaystyle h_{n,\delta}^{-(k+1)}c^{\prime}(\hat{\beta}_{n}-\beta)

\displaystyle=\frac{1}{\sqrt{nh_{n,\delta}^{2(k+1)}}}\times\underbrace{\sqrt{n}c^{\prime}(\hat{\beta}_{n}-\beta)}_{=O_{p}(1)}=o_{p}(1)

since

\displaystyle nh_{n,\delta}^{2(k+1)}\sim n\times n^{-4\delta(k+1)/(2k+3)}=n^{\frac{2k+3-4\delta(k+1)}{2k+3}}

diverges for $\delta\in(0,\frac{2k+3}{4k+4})$ , which is assumed by the hypothesis. Since $c$ is arbitrary, $h_{n,\delta}^{-(k+1)}(\hat{\beta}_{n}-\beta)=o_{p}(1)$ , which completes the proof. ∎

Proofs of Lemmas

Proof of Lemma 1

Proof.

Write each summand of $S_{WW}$ as $S_{WW,ij}$ . Since it suffices to show the element-wise convergence of $S_{WW}$ to $\Sigma_{WW}$ , we use a unit vector $e\in\mathbb{R}^{q_{w}}$ with the arbitrary element being $1$ and $0$ elsewhere. Observe that

	$\displaystyle E[e^{\prime}S_{WW}e]=E[e^{\prime}S_{WW,ij}e]$	$\displaystyle=\frac{1}{h_{n}}\int E[d_{121}d_{122}e^{\prime}\Delta W_{12}\Delta W_{12}^{\prime}e\|\Delta R_{12}^{\prime}\gamma=r]K(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=\int E[d_{121}d_{122}e^{\prime}\Delta W_{12}\Delta W_{12}^{\prime}e\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=e^{\prime}\Sigma_{WW}e+o_{p}(1),$

where the last line holds from the dominated convergence theorem under Assumptions 4, 6 and 8. Since $S_{WW,ij}$ and $S_{WW,kl}$ are independent if $i\neq k,l$ and $j\neq k,l$ by Assumption 1, observe that

\displaystyle Var(e^{\prime}S_{WW}e)

\displaystyle=\frac{Var(S_{WW,12})}{N}+\frac{2(n-2)}{N}Cov(e^{\prime}S_{WW,12}e,e^{\prime}S_{WW,13}e).

For the variance, we have

	$\displaystyle Var(S_{WW,12})$	$\displaystyle\leq E[(e^{\prime}S_{WW,12}e)^{2}]$
		$\displaystyle\leq\frac{1}{h_{n}^{2}}\int E[\\|\Delta W_{12}\\|^{4}\|\Delta R_{12}^{\prime}\gamma=r]K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=\frac{1}{h_{n}}\int E[\\|\Delta W_{12}\\|^{4}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O\left(\frac{1}{h_{n}}\right),$

where the last line holds from Assumptions 4, 6 and 8. For the covariance, by the conditional independence of $\Delta W_{12}$ and $\Delta W_{13}$ , we have

	$\displaystyle Cov(e^{\prime}S_{WW,12}e,e^{\prime}S_{WW,13}e)$
	$\displaystyle\leq E[e^{\prime}S_{WW,12}e\times e^{\prime}S_{WW,13}e]$
	$\displaystyle\leq\frac{1}{h_{n}}\int E[\\|\Delta W_{12}\\|^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]\times E[\\|\Delta W_{13}\\|^{2}\|\Delta R_{13}^{\prime}\gamma=r_{2},\xi_{1},U_{1}]$
	$\displaystyle\times\|K(r_{1}/h_{n})\|\|K(r_{2}/h_{n})\|f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}$
	$\displaystyle=h_{n}\int E[\\|\Delta W_{12}\\|^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]\times E[\\|\Delta W_{13}\\|^{2}\|\Delta R_{13}^{\prime}\gamma=r_{2}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times\|K(r_{1})\|\|K(r_{2})\|f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{2}h_{n})dr_{1}dr_{2}$
	$\displaystyle=O(h_{n}),$

where the last line holds from Assumptions 4, 6 and 8. Thus,

\displaystyle Var(e^{\prime}S_{WW}e)=O\left(\frac{1}{Nh_{n}}\right)+O\left(\frac{h_{n}}{n}\right)=o(1).

By Chebychev’s inequality, $e^{\prime}S_{WW}e\to_{p}e^{\prime}\Sigma_{WW}e$ as $n\to\infty$ . Since $e$ is arbitrary, this completes the proof.

∎

Proof of Lemma 2

Proof.

Write each summand of $S_{W\lambda}$ as $S_{W\lambda,ij}$ . We use a unit vector $e\in\mathbb{R}^{q_{w}}$ with an arbitrary element being $1$ and $0$ elsewhere. Observe that, for large enough $n$ ,

	$\displaystyle E[e^{\prime}S_{W\lambda}]=E[e^{\prime}S_{W\lambda,ij}]$	$\displaystyle=\frac{1}{h_{n}}\int E[d_{121}d_{122}e^{\prime}\Delta W_{12}\lambda_{12}\|\Delta R_{12}^{\prime}\gamma=r]K(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}\int e^{\prime}g(rh_{n})rK(r)dr$
		$\displaystyle=\frac{h_{n}^{k+1}}{k!}\int\left(e^{\prime}\frac{\partial^{k}g(rh_{n})}{\partial r^{k}}+o(1)\right)r^{k+1}K(r)dr$
		$\displaystyle=h_{n}^{k+1}e^{\prime}\Sigma_{W\lambda}+o(h_{n}^{k+1}),$

where the second line holds from $\lambda_{12}=\Lambda_{12}\times\Delta R_{12}^{\prime}\gamma$ , the third line holds from Assumption 8 eliminating $\int s^{i}K(s)$ for $i=1,...,k$ , and the last line holds from the dominated convergence theorem under Assumptions 6 and 8. Observe that

\displaystyle Var[e^{\prime}S_{W\lambda}]=\frac{Var[e^{\prime}S_{W\lambda,12}]}{N}+\frac{2(n-2)}{N}Cov[e^{\prime}S_{W\lambda,12},e^{\prime}S_{W\lambda,13}].

For the variance, we have

	$\displaystyle Var[e^{\prime}S_{W\lambda,12}]$	$\displaystyle\leq E[(e^{\prime}S_{W\lambda,12})^{2}]$
		$\displaystyle\leq\frac{1}{h_{n}^{2}}\int E[\\|\Delta W_{12}\\|^{2}\lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}\int E[\\|\Delta W_{12}\\|^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]r^{2}K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}),$

where the last line holds from Cauchy-Schwartz and Assumptions 4, 6, and 8. For the covariance, we have

	$\displaystyle Cov[e^{\prime}S_{W\lambda,12},e^{\prime}S_{W\lambda,13}]$
	$\displaystyle\leq E[e^{\prime}S_{W\lambda,12}\times e^{\prime}S_{W\lambda,13}]$
	$\displaystyle=\frac{1}{h_{n}^{2}}E\Big{[}\int E[d_{121}d_{122}e^{\prime}\Delta W_{12}\Lambda_{12}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]E[d_{131}d_{132}e^{\prime}\Delta W_{13}\Lambda_{13}\|\Delta R_{13}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
	$\displaystyle\times r_{1}r_{2}K(r_{1}/h_{n})K(r_{2}/h_{n})f_{R\gamma,\xi_{1},U_{1}}(r_{1})f_{R\gamma,\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}\Big{]}$
	$\displaystyle\leq h_{n}^{2}E\Big{[}\int E[\\|\Delta W_{12}\\|\|\Lambda_{12}\|\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]E[\\|\Delta W_{13}\\|\|\Lambda_{13}\|\|\Delta R_{13}^{\prime}\gamma=r_{2}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times r_{1}r_{2}K(r_{1})K(r_{2})f_{R\gamma,\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma,\xi_{1},U_{1}}(r_{2}h_{n})dr_{1}dr_{2}\Big{]}$
	$\displaystyle=O(h_{n}^{2}),$

where the last line holds from Cauchy-Schwartz and Assumptions 4, 6, and 8. Thus,

\displaystyle Var[e^{\prime}S_{W\lambda}]=O\left(\frac{h_{n}}{N}\right)+O\left(\frac{h_{n}^{2}}{n}\right)=O\left(\frac{h_{n}^{2}}{n}\right),

since $O(h_{n}/N)=O(h_{n}^{2}/n)\times O(1/(nh_{n}))=o(h_{n}^{2}/n)$ under Assumption 9.

If $Nh_{n}^{2k+3}\to h$ for some $0<h<\infty$ , note that

\displaystyle\sqrt{Nh_{n}}E[e^{\prime}S_{W\lambda}]=\sqrt{Nh_{n}^{2k+3}}e^{\prime}\Sigma_{W\lambda}+o(\sqrt{Nh_{n}^{2k+3}})\to\sqrt{h}\Sigma_{W\lambda},

as $n\to\infty$ . Also,

\displaystyle Var[\sqrt{Nh_{n}}e^{\prime}S_{W\lambda}]=O\left(\frac{Nh_{n}^{3}}{n}\right)=O(nh_{n})\times O(h_{n}^{2})=o(1),

by Assumption 9. Thus, by Chebyshev’s inequality, we have

\displaystyle\sqrt{Nh_{n}}e^{\prime}S_{W\lambda}\to_{p}\sqrt{h}e^{\prime}\Sigma_{W\lambda},

as $n\to\infty$ .

If $Nh_{n}^{2k+3}\to\infty$ , note that

\displaystyle h_{n}^{-(k+1)}E[e^{\prime}S_{W\lambda}]=e^{\prime}\Sigma_{W\lambda}+o(1)\to e^{\prime}\Sigma_{W\lambda},

as $n\to\infty$ . Also,

\displaystyle Var[h_{n}^{-(k+1)}e^{\prime}S_{W\lambda}]=O\left(\frac{h_{n}^{2}}{nh_{n}^{2k+2}}\right)=o(1),

as $nh_{n}^{2k+2}\to\infty$ by the hypothesis. Thus, by Chebyshev’s inequality, we have

\displaystyle h_{n}^{-(k+1)}e^{\prime}S_{W\lambda}\to_{p}e^{\prime}\Sigma_{W\lambda},

as $n\to\infty$ . Since $e$ is arbitrary, this completes the proof. ∎

Proof of Lemma 3

Proof.

The proof is done in the following steps:

Step 0: Decomposition

Observe that

\displaystyle c^{\prime}S_{WW}^{-1}S_{W\nu}=c^{\prime}(S_{WW}^{-1}-\Sigma_{WW}^{-1})S_{W\nu}+c^{\prime}\Sigma_{WW}^{-1}S_{W\nu}

In Steps 1-2, we verify the asymptotic normality of $S_{W\nu}$ , with the worst-case convergence rate being $\sqrt{n}$ . Given that result, the first term on the right-hand side is shown to be negligible even when normalized by $\sqrt{Nh_{n}}$ :

	$\displaystyle\sqrt{Nh_{n}}c^{\prime}(S_{WW}^{-1}-\Sigma_{WW}^{-1})S_{W\nu}$	$\displaystyle=\sqrt{Nh_{n}}o_{p}(n^{-\alpha/2})O_{p}(1/\sqrt{n})$
		$\displaystyle=\sqrt{n^{1-\alpha}h_{n}}o_{p}(1)=o_{p}(1)$

because by Lemma 1, $S_{WW}^{-1}-\Sigma_{WW}^{-1}=o_{p}(n^{-\alpha/2})$ for any $\alpha\in(0,1)$ and $n^{1-\alpha}h_{n}=o(1)$ for sufficiently large $\alpha$ under the hypothesis. Thus,

\displaystyle c^{\prime}S_{WW}^{-1}S_{W\nu}=c^{\prime}\Sigma_{WW}^{-1}S_{W\nu}+o_{p}(1/\sqrt{Nh_{n}}),

and it suffices to establish the asymptotic normality of $S_{W\nu}$ . Write $c=c_{W}$ for short. Observe that, since $E[S_{W\nu}]=0$ by the definition of $\nu_{ij}$ , $c^{\prime}S_{W\nu}$ can be decomposed as

\displaystyle c^{\prime}S_{W\nu}=\underbrace{\frac{1}{n}\sum_{i=1}^{n}L_{i,W\nu}}_{L_{W\nu}}+\underbrace{\frac{1}{N}\sum_{i<j}Q_{ij,W\nu}}_{Q_{W\nu}},

where

	$\displaystyle L_{i,W\nu}$	$\displaystyle\equiv 2E[d_{ij1}d_{ij2}c^{\prime}\Delta W_{ij}\nu_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\|\xi_{i},U_{i}]$
	$\displaystyle Q_{ij,W\nu}$	$\displaystyle=d_{ij1}d_{ij2}c^{\prime}\Delta W_{ij}\nu_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)-E[d_{ij1}d_{ij2}c^{\prime}\Delta W_{ij}\nu_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\|\xi_{i},U_{i}]$
		$\displaystyle-E[d_{ij1}d_{ij2}\Delta W_{ij}\nu_{ij}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\|\xi_{j},U_{j}]$

By design, we have that $Cov[L_{i,W\nu},L_{j,W\nu}]=Cov[L_{i,W\nu},Q_{kl,W\nu}]=Cov[Q_{ij,W\nu},Q_{kl,W\nu}]=0$ for any $i\neq j$ , $k\neq l$ , and $ij\neq kl$ . We show the asymptotic normality of $c^{\prime}S_{W\nu}$ in the following.

Step 1: Asymptotic Normality of $L_{W\nu}$

Define $V_{L}$ by

\displaystyle V_{L}=\sqrt{n}L_{W\nu}=\sum_{i=1}\underbrace{\frac{L_{i,W\nu}}{\sqrt{n}}}_{V_{i,L}}

Note that $E[V_{i,W\nu}]=0$ by the mean independence of $\nu_{ij}$ . Observe that

	$\displaystyle Var[V_{L}]$	$\displaystyle=Var[L_{i,W\nu}]$
		$\displaystyle=4E\left[E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{1},U_{1}]^{2}\right]$
		$\displaystyle=4E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}c^{\prime}\Delta W_{13}\nu_{12}\nu_{13}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)K_{h_{n}}(\Delta R_{13}^{\prime}\gamma)]$
		$\displaystyle=\frac{4}{h_{n}^{2}}\int E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}c^{\prime}\Delta W_{13}\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=r_{1},\Delta R_{13}^{\prime}\gamma=r_{2}]$
		$\displaystyle\times K(r_{1}/h_{n})K(r_{2}/h_{n})f_{R\gamma,2}(r_{1},r_{2})dr_{1}dr_{2}$
		$\displaystyle=4\int E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}c^{\prime}\Delta W_{13}\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\Delta R_{13}^{\prime}\gamma=r_{2}h_{n}]$
		$\displaystyle\times K(r_{1})K(r_{2})f_{R\gamma,2}(r_{1}h_{n},r_{2}h_{n})dr_{1}dr_{2}$
		$\displaystyle=c^{\prime}\Sigma_{W\nu,1}c+o_{p}(1),$

where the last line holds from the dominated convergence theorem under Assumptions 4, 6, and 8. Furthermore, note that

	$\displaystyle E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{1},U_{1}]$
	$\displaystyle=\frac{1}{h_{n}}\int E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=r,\xi_{1},U_{1}]K(r/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r)dr$
	$\displaystyle=\int E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=r,\xi_{1},U_{1}]K(r/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r)dr$
	$\displaystyle=O(1),$

almost surely for sufficiently large $n$ by Assumptions 4, 6, and 8. Thus, we have

\displaystyle\sum_{i=1}^{n}E[|V_{i,L}|^{3}]=n\times O\left(\frac{1}{n\sqrt{n}}\right)=o(1).

If $\Sigma_{W\nu,1}$ is positive definite, we have $c^{\prime}\Sigma_{W\nu,1}c>0$ . Thus, by Lyapunov CLT, we have

\displaystyle V_{L}/\sqrt{Var[V_{L}]}\to_{d}\mathcal{N}(0,1),

as $n\to\infty$ . Thus, $V_{L}=\sqrt{n}L_{W\nu}\to_{d}\mathcal{N}(0,c^{\prime}\Sigma_{W\nu,1}c)$ as $n\to\infty$ .

If $c^{\prime}\Sigma_{W\nu,1}c=0$ , observe that, for some constant $C>0$

	$\displaystyle Var[L_{i,W\nu}]$
	$\displaystyle=4E\left[\left(\int h_{n}^{-1}E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=r,\xi_{1},U_{1}]K(r/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r)dr\right)^{2}\right]$
	$\displaystyle=4E\left[\left(\int g_{\xi_{1},U_{1}}(rh_{n})K(r)dr\right)^{2}\right]$
	$\displaystyle\leq 4E\left[\left(g_{\xi_{1},U_{1}}(0)+Ch_{n}^{k}\right)^{2}\right]$
	$\displaystyle=O(h_{n}^{2k}),$

where the first inequality holds from the Taylor expansion of $g_{\xi_{1},U_{1}}(rh_{n})$ and $K$ eliminating $\int r^{i}K(r)dr=0$ for $i=1,...,k$ , and the last line holds since

	$\displaystyle E\left[\left(g_{\xi_{1},U_{1}}(0)\right)^{2}\right]$
	$\displaystyle=E\left[\left(E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}f_{R\gamma\|\xi_{1},U_{1}}(0)\|\Delta R_{12}^{\prime}\gamma=0,\xi_{1},U_{1}]\right)^{2}\right]$
	$\displaystyle=E\left[E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}c^{\prime}\Delta W_{13}\nu_{12}\nu_{13}f_{R\gamma,2}(0,0)\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0,\xi_{1},U_{1}]\right]$
	$\displaystyle=f_{R\gamma,2}(0,0)E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}c^{\prime}\Delta W_{13}\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0]$
	$\displaystyle=c^{\prime}\Sigma_{W\nu,1}c/4=0,$

and $E[g_{\xi_{1},U_{1}}]=0$ by the conditional mean independence of $\nu_{12}$ . Thus,

\displaystyle Var[\sqrt{Nh_{n}}L_{W\nu}]=nh_{n}\times O(h_{n}^{2k})=h_{n}^{k-1/2}\times O(\sqrt{Nh_{n}^{2k+3}})=o(1),

since $\sqrt{Nh_{n}^{2k+3}}=O(1)$ by the hypothesis and $k\geq 2$ so that $h_{n}^{k-1/2}=o(1)$ . Hence, by Chebyshev’s inequality,

\displaystyle\sqrt{Nh_{n}}L_{W\nu}\to_{p}0,

as $n\to\infty$ when $\Sigma_{W\nu,1}=0$ .

Step 2: Asymptotic Normality of $Q_{W\nu}$

We use the CLT for martingale differences (Theorem 5.24 and Corollary 5.26 in White (2001)). Define $V_{n,t}(1\leq t\leq N)$ , a triangular array, as

	$\displaystyle V_{n,1}$	$\displaystyle=\frac{1}{N}Q_{12,W\nu},$
	$\displaystyle V_{n,2}$	$\displaystyle=\frac{1}{N}Q_{13,W\nu},$
		$\displaystyle\vdots$
	$\displaystyle V_{n,n+n-1}$	$\displaystyle=\frac{1}{N}Q_{1n,W\nu},$
		$\displaystyle\vdots$
	$\displaystyle V_{n,N}$	$\displaystyle=\frac{1}{N}Q_{n-1n,W\nu}.$

Notice that $Q_{ij,W\nu}$ is independent of $Q_{km,W\nu}$ if $i\neq k,m$ and $j\neq k,m$ . Also, $Q_{ij,W\nu}$ is conditionally independent of $Q_{km,W\nu}$ even if $i=k$ or $m$ , or $j=k$ or $m$ ; Note that $(\epsilon_{ijt},\eta_{ijt})_{t=1,2}$ and $(\epsilon_{imt},\eta_{imt})_{t=1,2}$ are conditionally independent given $\xi_{i},U_{i}$ by Assumption 1. Since $(\xi_{i},U_{i}),i=1,...n$ is i.i.d., this implies that, for $1<t\leq N$ ,

	$\displaystyle E[V_{n,t}\|\{V_{n,s};s<t\}]$	$\displaystyle=E\left[E[V_{n,t}\|\{V_{n,s};s<t\},\{\xi_{i},U_{i}\}_{j=1}^{n}\|]\{V_{n,s};s<t\}\right]$
		$\displaystyle=E\left[E[V_{n,t}\|\xi_{t},U_{t}]\|\{V_{n,s};s<t\}\right]$
		$\displaystyle=0,$

where

	$\displaystyle E[V_{n,t}\|\xi_{t},U_{t}]$
	$\displaystyle=\frac{1}{N}E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{1},U_{1}]-\frac{1}{N}E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{1},U_{1}]$
	$\displaystyle-\frac{1}{N}E\left[E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{2},U_{2}]\|\xi_{1},U_{1}\right]$
	$\displaystyle=-\frac{1}{N}E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)]$
	$\displaystyle=0$

because $\xi_{1},U_{1}$ is independent of $\xi_{2},U_{2}$ . Thus, letting $\mathcal{F}_{t}\equiv\sigma(V_{s}|1\leq s\leq t)$ be a sigma algebra generated by $V_{1},...,V_{t-1}$ ( $\mathcal{F}_{1}$ is set to be a trivial $\sigma$ -algebra) and $\mathbb{F}\equiv(\mathcal{F}_{t})_{1\leq t\leq N}$ be a filtration, we have

\displaystyle E[V_{n,t}|\mathcal{F}_{t-1}]=0

for $1\leq t\leq N$ . Also, for each $t$ , for some constant $C>0$ ,

\displaystyle E[|V_{n,t}|]\leq\frac{3}{N}E[|\Delta W_{12}\||\nu_{12}||K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)|]\leq\frac{C}{Nh_{n}}E[\|\Delta W_{12}\|^{2}]^{1/2}E[\nu_{12}^{2}]^{1/2}<\infty,

by Assumptions 7 and 8. This shows that $\{V_{n,t}\}$ is a martingale difference sequence.

Let $V_{n}=\sum_{t=1}^{N}V_{n,t}$ . Define the variance of this sequence by

\displaystyle v_{n}^{2}=Var\left[\sum^{N}_{t=1}V_{n,t}\right]=NVar[V_{n,1}]=\frac{1}{N}E[Q_{12,W\nu}^{2}].

We can calculate that

	$\displaystyle E[Q_{12,W\nu}^{2}]$	$\displaystyle=E\left[d_{121}d_{122}(c^{\prime}\Delta W_{12})^{2}\nu_{12}^{2}K^{2}(\Delta R_{12}^{\prime}\gamma)\right]$
		$\displaystyle-2E\left[E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{1},U_{1}]^{2}\right].$

Observe that

	$\displaystyle E\left[d_{121}d_{122}(c^{\prime}\Delta W_{12})^{2}\nu_{12}^{2}K^{2}(\Delta R_{12}^{\prime}\gamma)\right]$
	$\displaystyle=\frac{1}{h_{n}^{2}}\int E[d_{121}d_{122}(c^{\prime}\Delta W_{12})^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]K(r/h_{n})f_{R\gamma}(r)dr$
	$\displaystyle=\frac{1}{h_{n}}\int E[d_{121}d_{122}(c^{\prime}\Delta W_{12})^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K(r)f_{R\gamma}(rh_{n})dr$
	$\displaystyle=\frac{1}{h_{n}}c^{\prime}\Sigma_{W\nu,2}c+o\left(\frac{1}{h_{n}}\right),$

where the last line holds from the dominated convergence theorem under Assumptions 4, 6, and 8, and

	$\displaystyle E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}K_{h_{n}}(\Delta R_{12}^{\prime}\gamma)\|\xi_{1},U_{1}]$
	$\displaystyle=\frac{1}{h_{n}}\int E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=r,\xi_{1},U_{1}]K(r/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r)dr$
	$\displaystyle=\int E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=rh_{n},\xi_{1},U_{1}]K(r)f_{R\gamma\|\xi_{1},U_{1}}(rh_{n})dr$
	$\displaystyle=O(1),$

almost surely for sufficiently large $n$ by Assumptions 4, 6, and 8. Since $1/h_{n}$ diverges as $n\to\infty$ ,

\displaystyle E[Q_{12,W\nu}^{2}]=\frac{1}{h_{n}}c^{\prime}\Sigma_{W\nu,2}c+o\left(\frac{1}{h_{n}}\right).

Hence, we have

\displaystyle v_{n}^{2}=\frac{1}{Nh_{n}}c^{\prime}\Sigma_{W\nu,2}c+o\left(\frac{1}{Nh_{n}}\right).

The CLT for martingale differences holds if we can show the following two conditions:

\displaystyle\sum_{t=1}^{N}E\left[\left(\frac{V_{n,t}}{v_{n}}\right)^{2+\delta}\right]\to 0\text{ (Lyapunov)},

for some $\delta>0$ as $n\to\infty$ and

\displaystyle\sum_{t=1}^{N}\left(\frac{V_{n,t}}{v_{n}}\right)^{2}\to_{p}1\text{ (Stability)},

as $n\to\infty$ . If these conditions are met, we can apply Theorem 5.24 and Corollary 5.26 in White (2001) to show that

\displaystyle\frac{V_{n}}{v_{n}}\to_{d}\mathcal{N}(0,1),

as $n\to\infty$ . Since $\sqrt{Nh_{n}v_{n}}\to\sqrt{c^{\prime}\Sigma_{W\nu,2}c}$ , by Slutsky’s lemma,

\displaystyle\sqrt{Nh_{n}}V_{n}\to_{d}\mathcal{N}(0,c^{\prime}\Sigma_{W\nu,2}c),

which is equivalent to

\displaystyle\sqrt{Nh_{n}}Q_{W\nu}\to_{d}\mathcal{N}(0,c^{\prime}\Sigma_{W\nu,2}c),

as $n\to\infty$ .

For Lyapunov’s condition, observe that for some constant $C>0$ ,

	$\displaystyle E\left[\|V_{n,1}\|^{3}\right]$	$\displaystyle\leq\frac{C}{(Nh_{n})^{3}}\int E[\\|\Delta W_{12}\\|^{3}\|\nu_{12}\|^{3}\|\Delta R_{12}^{\prime}\gamma=r]\|K(r/h_{n})\|^{3}f_{R\gamma}(r)dr$
		$\displaystyle=\frac{Ch_{n}}{(Nh_{n})^{3}}\int E[\\|\Delta W_{12}\\|^{3}\|\nu_{12}\|^{3}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]\|K(r)\|^{3}f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O\left(\frac{1}{N^{3}h_{n}^{2}}\right),$

where the first inequality follows from Jensen’s inequality, the last line follows from Cauchy-Schwartz and Assumptions 4, 6, and 8. Since $v_{n}=O(1/\sqrt{Nh_{n}})$ , we have

\displaystyle\sum_{t=1}^{N}E\left[\left|\frac{V_{n,t}}{v_{n}}\right|^{3}\right]=NO\left(\frac{\sqrt{Nh_{n}}}{N^{3}h_{n}^{2}}\right)=O\left(\frac{1}{(Nh_{n})^{3/2}}\right)=o(1)

by Assumption 9. Thus Lyapunov’s condition holds.

For the stability condition, we can alternatively show that

\displaystyle\frac{1}{v_{n}^{2}}\sum_{t=1}^{N}\left(V_{n,t}^{2}-E[V_{n,t}^{2}]\right)\to_{p}0,

as $n\to\infty$ . Note that

\displaystyle\frac{1}{v_{n}^{2}}\sum_{t=1}^{N}\left(V_{n,t}^{2}-E[V_{n,t}^{2}]\right)=\frac{1}{Nv_{n}^{2}}\left(\frac{1}{N}\sum_{i<j}Q_{ij,W\nu}^{2}-E[Q_{12,W\nu}^{2}]\right).

Since $Nv_{n}^{2}=O(1/h_{n})$ , we need to show that the remaining term is $o_{p}(1/h_{n})$ . Since $Q_{ij,W\nu}$ is independent fron $Q_{km,W\nu}$ if there is no common node,

	$\displaystyle E\left[\left(\frac{1}{N}\sum_{i<j}Q_{ij,W\nu}^{2}-E[Q_{12,W\nu}^{2}]\right)^{2}\right]$
	$\displaystyle=\frac{Var[Q_{12,W\nu}^{2}]}{N}+\frac{2(n-2)}{N}Cov[Q_{12,W\nu}^{2},Q_{13,W\nu}^{2}]$
	$\displaystyle\leq\frac{E[Q_{12,W\nu}^{4}]}{N}+\frac{2(n-2)}{N}E[Q_{12,W\nu}^{2}\times Q_{13,W\nu}^{2}].$

The first term in the far right-hand side is bounded as follows: For some constant $C>0$ ,

	$\displaystyle\frac{E[Q_{12,W\nu}^{4}]}{N}$	$\displaystyle\leq\frac{C}{Nh_{n}^{4}}\int E[\\|\Delta W_{12}\\|^{4}\nu_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=r]K^{4}(r)f_{R\gamma}(r)dr$
		$\displaystyle=\frac{C}{Nh_{n}^{3}}\int E[\\|\Delta W_{12}\\|^{4}\nu_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K^{4}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O\left(\frac{1}{Nh_{n}^{3}}\right),$

where the first inequality follows from Jensen’s inequality, and the last line follows from Cauchy-Schwartz and Assumptions 4, 6, and 8. The second term on the far right-hand side is bounded as follows: For some constant $C>0$ ,

	$\displaystyle\frac{2(n-2)}{N}E[Q_{12,W\nu}^{2}\times Q_{13,W\nu}^{2}]$
	$\displaystyle\leq\frac{C(n-2)}{Nh_{n}^{4}}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]\times E[\\|\Delta W_{13}\\|^{2}\nu_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{2},\xi_{1},U_{1}]$
	$\displaystyle\times K^{2}(r_{1}/h_{n})K^{2}(r_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}$
	$\displaystyle=\frac{C(n-2)}{Nh_{n}^{2}}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]\times E[\\|\Delta W_{13}\\|^{2}\nu_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{2}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times K^{2}(r_{1})K^{2}(r_{2})f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{2}h_{n})dr_{1}dr_{2}$
	$\displaystyle=O\left(\frac{1}{nh_{n}^{2}}\right)$

Thus,

\displaystyle h_{n}E\left[\left(\frac{1}{N}\sum_{i<j}Q_{ij,W\nu}^{2}-E[Q_{12,W\nu}^{2}]\right)^{2}\right]=O\left(\frac{1}{Nh_{n}^{2}}\right)+O\left(\frac{1}{nh_{n}}\right)=o(1),

and by Markov’s inequality,

\displaystyle\sqrt{h_{n}}\left(\frac{1}{N}\sum_{i<j}Q_{ij,W\nu}^{2}-E[Q_{12,W\nu}^{2}]\right)=o_{p}(1).

Then,

\displaystyle\frac{1}{v_{n}^{2}}\sum_{t=1}^{N}\left(V_{n,t}^{2}-E[V_{n,t}^{2}]\right)=O(h_{n})\times o_{p}\left(\frac{1}{\sqrt{h_{n}}}\right)=o_{p}(\sqrt{h_{n}})=o_{p}(1),

which shows the stability condition.

Step 3: Conclusion

By Steps 0-2, we have established that if $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}>0$ ,

	$\displaystyle\sqrt{n}c^{\prime}S_{WW}^{-1}S_{W\nu}$
	$\displaystyle=\underbrace{\sqrt{n}L_{W\nu}}_{\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,1}c_{W})}+\underbrace{\frac{\sqrt{n}}{\sqrt{Nh_{n}}}}_{\to 0}\times\underbrace{\sqrt{Nh_{n}}Q_{W\nu}}_{\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,2}c_{W})}+o_{p}(\sqrt{n}/\sqrt{Nh_{n}})\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}),$

as $n\to\infty$ by Assumption 9, and if $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ ,

\displaystyle\sqrt{Nh_{n}}c^{\prime}S_{WW}^{-1}S_{W\nu}=\underbrace{\sqrt{Nh_{n}}L_{W\nu}}_{\to_{p}0}+\underbrace{\sqrt{Nh_{n}}Q_{W\nu}}_{\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,2}c_{W})}+o_{p}(1)\to_{d}\mathcal{N}(0,c_{W}^{\prime}\Sigma_{W\nu,2}c_{W}),

as $n\to\infty$ . This completes the proof. ∎

Proof of Lemma 4

Proof.

By expanding $K(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}/h_{n})$ around $\Delta R_{ij}^{\prime}\gamma$ , we have

\displaystyle\hat{S}_{WW}

\displaystyle=S_{WW}+\frac{1}{Nh_{n}^{2}}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta W_{ij}^{\prime}\Delta R_{ij}^{\prime}(\hat{\gamma}_{n}-\gamma)K^{\prime}(c_{ij,n}^{*})

where $c_{ij,n}^{*}$ is in between $\Delta R_{ij}^{\prime}\gamma$ and $\Delta R_{ij}^{\prime}\hat{\gamma}_{n}$ . Thus, for some constant $C>0$ ,

\displaystyle\|\hat{S}_{WW}-S_{WW}\|\leq

\displaystyle\underbrace{\frac{C}{N}\sum_{i<j}\|\Delta W_{ij}\|^{2}\|\Delta R_{ij}\|}_{D_{4,1}}\times h_{n}^{-2}\|\hat{\gamma}_{n}-\gamma\|

Notice that $\|\Delta W_{ij}\|=\|w(X_{i1},X_{j1})-w(X_{i2},X_{j2})\|,\|R_{ij}\|=\|r(Z_{i1},Z_{j1}-r(Z_{i2},Z_{j2})\|$ are symmetric in $i$ and $j$ by the symmetry of $w$ , $r$ , and $\|\cdot\|$ so that $D_{4,1}$ is a second-order U-statistics. Also,

\displaystyle E[\|\Delta W_{12}\|^{2}\|\Delta R_{ij}\|]<\infty.

by Cauchy-Schwartz with Assumption 7. Thus, we can apply the law of large numbers for U-statistics (Hoeffding (1961)):

\displaystyle D_{4,1}=O_{p}(1).

By the hypothesis and Assumption 10,

	$\displaystyle h_{n}^{-2}\\|\hat{\gamma}_{n}-\gamma\\|$	$\displaystyle=\frac{\\|\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)\\|}{\sqrt{Nh_{n}^{5}}}$
		$\displaystyle=\frac{\\|\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)\\|}{\sqrt{Nh_{n}^{2k+3}}}\times\sqrt{h_{n}^{2k-2}}$
		$\displaystyle=o_{p}(1),$

as $\sqrt{Nh_{n}^{2k+3}}$ is either diverging or $O(1)$ , $\sqrt{h^{2k-2}}=o(1)$ for $k\geq 2$ . Thus,

\displaystyle\|\hat{S}_{WW}-S_{WW}\|=O_{p}(1)\times o_{p}(1)=o_{p}(1).

This shows that $\hat{S}_{WW}=S_{WW}+o_{p}(1)$ . ∎

Proof of Lemma 5

Proof.

Expanding $K(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}/h_{n})$ around $\Delta R_{ij}^{\prime}\gamma$ , for some constant $C>0$ , we have

	$\displaystyle\sqrt{Nh_{n}}\\|\hat{S}_{W\lambda}-S_{W\lambda}\\|$
	$\displaystyle\leq\underbrace{\frac{1}{Nh_{n}^{2}}\sum_{i<j}\\|\Delta W_{ij}\\|\\|\Delta R_{ij}\\|\|\lambda_{ij}\|\|K^{\prime}(\Delta R_{ij}^{\prime}\gamma/h_{n})\|}_{D_{5,1}}\times\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|$
	$\displaystyle+\underbrace{\frac{1}{Nh_{n}^{2}}\sum_{i<j}\\|\Delta W_{ij}\\|\\|\Delta R_{ij}\\|^{2}\|\lambda_{ij}\|\|K^{\prime\prime}(\Delta R_{ij}^{\prime}\gamma/h_{n})\|}_{D_{5,2}}\times\frac{\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|}{2h_{n}}$
	$\displaystyle+C\underbrace{\frac{1}{N}\sum_{i<j}\\|\Delta W_{ij}\\|\\|\Delta R_{ij}\\|^{3}\|\lambda_{ij}\|}_{D_{5,3}}\times\frac{\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|^{3}}{6h_{n}^{4}}$

We follow the following steps to bound the right hand side.

Step 1 $D_{5,1}$ and $D_{5,2}$

Observe that

	$\displaystyle E[D_{5,1}]$	$\displaystyle=\frac{1}{h_{n}^{2}}\int E[\\|\Delta W_{12}\\|\\|\Delta R_{12}\\|\|\Lambda_{12}\|\|\Delta R_{12}^{\prime}\gamma=r]\|r\|\|K^{\prime}(r/h_{n})\|f_{R\gamma}(r)dr$
		$\displaystyle=\int E[\\|\Delta W_{12}\\|\\|\Delta R_{12}\\|\|\Lambda_{12}\|\|\Delta R_{12}^{\prime}\gamma=rh_{n}]\|r\|\|K^{\prime}(r)\|f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(1),$

where the last line holds from Assumptions 4, 6 and 8. Also, writing each summand of $D_{5,1}$ by $D_{5,1,ij}$ , we have

	$\displaystyle Var[D_{5,1}]$	$\displaystyle=\frac{1}{Nh_{n}^{4}}Var[D_{5,1,12}]+\frac{2(n-2)}{Nh_{n}^{4}}Cov[D_{5,1,12},D_{5,1,13}]$
		$\displaystyle\leq\frac{1}{Nh_{n}^{4}}E[D^{2}_{5,1,12}]+\frac{2(n-2)}{Nh_{n}^{4}}E[D_{5,1,12}\times D_{5,1,13}].$

The first term on the far right side is $O(1/(Nh_{n}^{2}))$ because

	$\displaystyle E[D_{5,1,12}^{2}]$	$\displaystyle=\int E[\\|\Delta W_{12}\\|^{2}\\|\Delta R_{12}\\|^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]r^{2}K^{\prime}(r/h_{n})^{2}f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}^{2}\int E[\\|\Delta W_{12}\\|^{2}\\|\Delta R_{12}\\|^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]r^{2}K^{\prime}(r)^{2}f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}^{2}),$

where the last line holds from Assumptions 4, 6, and 8. The second term on the far right side is $O(1/n)$ because

	$\displaystyle E[D_{5,1,12}\times D_{5,1,13}]$	$\displaystyle=E\Big{[}\int E[\\|\Delta W_{12}\\|\\|\Delta R_{12}\\|\|\Lambda_{12}\|\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|\\|\Delta R_{13}\\|\|\Lambda_{13}\|\|\Delta R_{13}^{\prime}\gamma=r_{2},\xi_{1},U_{1}]$
		$\displaystyle\times\|r_{1}\|\|r_{2}\|\|K^{\prime}(r_{1}/h_{n})\|\|K^{\prime}(r_{2}/h_{n})\|f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=h_{n}^{4}E\Big{[}\int E[\\|\Delta W_{12}\\|\\|\Delta R_{12}\\|\|\Lambda_{12}\|\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|\\|\Delta R_{13}\\|\|\Lambda_{13}\|\|\Delta R_{13}^{\prime}\gamma=r_{2}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times\|r_{1}\|\|r_{2}\|\|K^{\prime}(r_{1})\|\|K^{\prime}(r_{2})\|f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{2}h_{n})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=O(h_{n}^{4}),$

where the last line holds from Assumptions 4, 6, and 8. Thus,

\displaystyle Var[D_{5,1}]=O\left(\frac{1}{Nh_{n}^{2}}\right)+O\left(\frac{1}{n}\right)=o(1),

since $Nh_{n}^{2}=Nh^{2k+3}\times h_{n}^{1-2k}$ diverges by the hypothesis. Thus,

\displaystyle D_{5,1}=O_{p}(1).

By a similar calculation, we have

\displaystyle D_{5,2}=O_{p}(1).

Step 2: $D_{5,3}$

First, observe that

\displaystyle E[D_{5,3}]=E[\|\Delta W_{12}\|\|\Delta R_{12}\|^{4}|\Lambda_{12}|]<\infty

by Hölder’s inequality with Assumption 7. Note that by construction, $\Lambda_{ij}$ is written as a function of $\xi_{i}$ and $\xi_{j}$ with symmetry with respect to $i$ and $j$ , which implies that $D_{5,3}$ is a second-order U-statistics. Since each summand is non-negative and has a finite mean, we can apply the law of large numbers for U-statistics (Hoeffding (1961)) to show that

\displaystyle D_{5,3}=O_{p}(1).

Step 3: Conclusion

Finally, by Assumption 10 and the hypothesis,

	$\displaystyle\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|=o_{p}(1),$
	$\displaystyle\frac{\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|^{2}}{2h_{n}}=\frac{\\|\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)\\|^{2}}{2\sqrt{Nh_{n}^{3}}}=o_{p}(1),$
	$\displaystyle\frac{\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|^{3}}{6h_{n}^{4}}=\frac{\\|\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)\\|^{3}}{6Nh_{n}^{5}}=o_{p}(1).$

Thus,

\displaystyle\sqrt{Nh_{n}}\|\hat{S}_{W\lambda}-S_{W\lambda}\|=o_{p}(1).

This implies that

\displaystyle\hat{S}_{W\lambda}=S_{W\lambda}+o_{p}\left(\frac{1}{\sqrt{Nh_{n}}}\right).

This completes the proof. ∎

Proof of Lemma 6

Proof.

By expanding $K(\Delta R_{ij}^{\prime}\hat{\gamma}_{n}/h_{n})$ around $\Delta R_{ij}^{\prime}\gamma$ , we have

	$\displaystyle\sqrt{Nh_{n}}(\hat{S}_{W\nu}-S_{W\nu})$
	$\displaystyle=\underbrace{\frac{1}{Nh_{n}^{2}}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta R_{ij}^{\prime}\nu_{ij}K^{\prime}(\Delta R_{ij}^{\prime}\gamma/h_{n})}_{D_{6,1}}\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)$
	$\displaystyle+(\hat{\gamma}_{n}-\gamma)^{\prime}\underbrace{\frac{1}{Nh_{n}^{2}}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\Delta R_{ij}\Delta R_{ij}^{\prime}\nu_{ij}K^{\prime\prime}(\Delta R_{ij}^{\prime}\gamma/h_{n})}_{D_{6,2}}\sqrt{Nh_{n}}\frac{(\hat{\gamma}_{n}-\gamma)}{h_{n}}$
	$\displaystyle+\underbrace{\frac{\sqrt{Nh_{n}}}{Nh_{n}^{4}}\sum_{i<j}d_{ij1}d_{ij2}\Delta W_{ij}\nu_{ij}K^{\prime\prime\prime}(c_{ij,n}^{*}/h_{n})\left(\Delta R_{ij}^{\prime}(\hat{\gamma}_{n}-\gamma)\right)^{3}}_{D_{6,3}}$

We bound each component by the following steps.

Step 1: $D_{6,1}$ and $D_{6,2}$

Note that $E[D_{6,1}]=E[D_{6,2}]=0$ by the conditional mean independence of $\nu_{ij}$ . Write $D_{6,1,ij}$ as each summand of $D_{6,1}$ . Observe that, by the similar calculation as above,

\displaystyle Var[\|D_{6,1}\|]\leq\frac{1}{Nh_{n}^{4}}E[\|D_{6,1,12}\|^{2}]+\frac{2(n-2)}{Nh_{n}^{4}}E[\|D_{6,1,12}\|\times\|D_{6,1,13}\|].

The first term on the right hand side is $O(1/(Nh_{n}^{3}))$ since

	$\displaystyle E[\\|D_{6,1,12}\\|^{2}]$	$\displaystyle\leq\int E[\\|\Delta W_{12}\\|^{2}\\|\Delta R_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]K^{\prime}(r/h_{n})^{2}f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}\int E[\\|\Delta W_{12}\\|^{2}\\|\Delta R_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K^{\prime}(r)^{2}f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}),$

where the last line holds from Assumptions 4, 6, and 8. The second term on the right hand side is $O(1/(nh_{n}^{2}))$ since

	$\displaystyle E[\\|D_{6,1,12}\\|\times\\|D_{6,1,13}\\|]$	$\displaystyle\leq E\Big{[}\int E[\\|\Delta W_{12}\\|\\|\Delta R_{12}\\|\|\nu\|_{12}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|\\|\Delta R_{13}\\|\|\nu\|_{13}\|\Delta R_{13}^{\prime}\gamma=r_{2},\xi_{1},U_{1}]$
		$\displaystyle\times\|K^{\prime}(r_{1}/h_{n})\|\|K^{\prime}(r_{2}/h_{n})\|f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=h_{n}^{2}E\Big{[}\int E[\\|\Delta W_{12}\\|\\|\Delta R_{12}\\|\|\nu\|_{12}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|\\|\Delta R_{13}\\|\|\nu\|_{13}\|\Delta R_{13}^{\prime}\gamma=r_{2}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times\|K^{\prime}(r_{1})\|\|K^{\prime}(r_{2})\|f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{2}h_{n})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=O(h_{n}^{2}),$

where the last line holds from Assumptions 4, 6, and 8. Hence,

\displaystyle Var[\|D_{6,1}\|]=O\left(\frac{1}{Nh_{n}^{3}}\right)+O\left(\frac{1}{nh_{n}^{2}}\right)=o(1),

since both $Nh_{n}^{3}=Nh_{n}^{2k+3}\times h_{n}^{-2k}$ and $nh_{n}^{2}\sim\sqrt{Nh_{n}^{4}}=\sqrt{Nh_{n}^{2k+3}}\times\sqrt{h_{n}^{-2k+1}}$ diverge under the hypothesis. This shows that

\displaystyle D_{6,1}=o_{p}(1).

A similar calculation shows that

\displaystyle D_{6,2}=o_{p}(1),

as well.

Step 2: $D_{6,3}$

Observe that, for some constant $C>0$

\displaystyle\|D_{6,3}\|\leq C\underbrace{\frac{1}{N}\sum_{i<j}\|\Delta W_{ij}\|\|\Delta R_{ij}\|^{3}|\nu_{ij}|}_{D_{6,4}}\times\frac{\sqrt{Nh_{n}}\|\hat{\gamma}_{n}-\gamma\|^{3}}{Nh_{n}^{4}}.

Observe that

\displaystyle E[D_{6,4}]=E[\|\Delta W_{12}\|\|\Delta R_{12}\|^{3}|\nu_{12}|]<\infty,

by Cauchy-Schwartz with Assumption 6. Also, by writing each summand of $D_{6,4}$ as $D_{6,4,ij}$ , we have

\displaystyle Var[D_{6,4}]\leq\frac{E[D_{6,4,12}^{2}]}{N}+\frac{2(n-2)}{N}E[D_{6,4,12}\times D_{6,4,13}].

Since

	$\displaystyle E[D_{6,4,12}^{2}]$	$\displaystyle=E[\\|\Delta W_{12}\\|^{2}\\|\Delta R_{12}\\|^{6}\nu_{12}^{2}]<\infty$
	$\displaystyle E[D_{6,4,12}\times D_{6,4,13}]$	$\displaystyle=E[\\|\Delta W_{12}\\|\\|\Delta W_{13}\\|\\|\Delta R_{12}\\|^{3}\\|\Delta R_{13}\\|^{3}\|\nu_{12}\|\|\nu_{13}\|]<\infty$

by Hölder’s inequality with Assumption 7,

\displaystyle Var[D_{6,4}]=O\left(\frac{1}{N}\right)+O\left(\frac{1}{n}\right)=o(1).

This shows that

\displaystyle D_{6,4}=O_{p}(1).

Hence, by the previous calculation for the term involving $\hat{\gamma}_{n}-\gamma$ ,

\displaystyle\|D_{6,3}\|=O_{p}(1)\times o_{p}(1)=o_{p}(1).

Step 3: Conclusion

By the above steps and the hypothesis on $\hat{\gamma}_{n}-\gamma$ ,

\displaystyle\sqrt{Nh_{n}}\|\hat{S}_{W\nu}-S_{W\nu}\|=o_{p}(1).

This implies that

\displaystyle\hat{S}_{W\nu}=S_{W\nu}+o_{p}\left(\frac{1}{\sqrt{Nh_{n}}}\right).

This completes the proof. ∎

Proof of Lemma 7

Proof.

Define

	$\displaystyle S_{ij,1}$	$\displaystyle\equiv 2d_{ij1}d_{ij2}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\Delta W_{ij}\nu_{ij},$
	$\displaystyle S_{ij,2}$	$\displaystyle\equiv 2d_{ij1}d_{ij2}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\Delta W_{ij}\lambda_{ij},$
	$\displaystyle S_{ij,3}$	$\displaystyle\equiv 2d_{ij1}d_{ij2}K_{h_{n}}(\Delta R_{ij}^{\prime}\gamma)\Delta W_{ij}\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n}).$

Since

\displaystyle\Delta\hat{\epsilon}_{ij}=\Delta W_{ij}^{\prime}(\beta-\hat{\beta}_{n})+\lambda_{ij}+\nu_{ij},

we have

\displaystyle S_{ij}=S_{ij,1}+S_{ij,2}+S_{ij,3}.

Thus,

	$\displaystyle\tilde{\Sigma}_{W\nu,1}$
	$\displaystyle=\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\underbrace{\frac{1}{3}(S_{ij,1}S_{ik,1}^{\prime}+S_{ij,1}S_{jk,1}^{\prime}+S_{ik,1}S_{jk,1})}_{D_{7,ijk}}}_{D_{7}}+\mathcal{O}_{7},$

where $\mathcal{O}_{7}$ is the remainder term.

We first show that $c^{\prime}D_{7}c\to_{p}c^{\prime}\Sigma_{W\nu,1}c$ . Note that

	$\displaystyle E[c^{\prime}D_{7}c]$	$\displaystyle=E[c^{\prime}S_{12,1}S_{13,1}c]$
		$\displaystyle=\frac{4}{h_{n}^{2}}\int E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}\Delta W_{13}^{\prime}c\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=s_{1},\Delta R_{13}^{\prime}\gamma=s_{2}]$
		$\displaystyle\times K(s_{1}/h_{n})K(s_{2}/h_{n})f_{R\gamma,2}(s_{1},s_{2})ds_{1}ds_{2}$
		$\displaystyle=4\int E[d_{121}d_{122}d_{131}d_{132}c^{\prime}\Delta W_{12}\Delta W_{13}^{\prime}c\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=s_{1}h_{n},\Delta R_{13}^{\prime}\gamma=s_{2}h_{n}]$
		$\displaystyle\times K(s_{1})K(s_{2})f_{R\gamma,2}(s_{1}h_{n},s_{2}h_{n})ds_{1}ds_{2}$
		$\displaystyle\to c^{\prime}\Sigma_{W\nu,1}c,$

as $n\to\infty$ by the dominated convergence theorem under Assumptions 4, 6, and 8. Define the third order U-statistics

\displaystyle U_{n,1}={n\choose 3}^{-1}\sum_{i<j<k}p_{n}(\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k}),

where $\boldsymbol{\xi}_{i}=(\xi_{i},U_{i})$ and

\displaystyle p_{n}(\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k})=E[c^{\prime}D_{7,ijk}c|\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k}]

By the calculation of Graham et al. (2019) in Appendix B,

	$\displaystyle E[(c^{\prime}D_{7}c-U_{n,1})^{2}]$	$\displaystyle={n\choose 3}^{-1}E\left[(c^{\prime}D_{7,123}c-E[c^{\prime}D_{7,123}c\|\boldsymbol{\xi}_{1},\boldsymbol{\xi}_{2},\boldsymbol{\xi}_{3}])^{2}\right]$
		$\displaystyle+{n\choose 3}^{-2}\times 3{n\choose 2}{n-2\choose 2}\times E\left[c^{\prime}D_{7,123}c-E[c^{\prime}D_{7,123}c\|\boldsymbol{\xi}_{1},\boldsymbol{\xi}_{2},\boldsymbol{\xi}_{3}]\right]$
		$\displaystyle\times E\left[c^{\prime}D_{7,124}c-E[c^{\prime}D_{7,124}c\|\boldsymbol{\xi}_{1},\boldsymbol{\xi}_{2},\boldsymbol{\xi}_{4}]\right]$
		$\displaystyle=O\left(\frac{E[(c^{\prime}D_{7,123}c)^{2}]}{n^{3}}\right).$

Observe that

	$\displaystyle E[(c^{\prime}D_{7,123}c)^{2}]$	$\displaystyle=\frac{1}{9}\left(3E[(c^{\prime}S_{12,1}c\times c^{\prime}S_{13,1}c)^{2}]+6E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c\times c^{\prime}S_{23,1}c]\right)$
		$\displaystyle=O\left(\frac{1}{h_{n}^{2}}\right),$

since for some positive constant $C>0$ ,

	$\displaystyle E[(c^{\prime}S_{12,1}c\times c^{\prime}S_{13,1}c)^{2}]$	$\displaystyle\leq\frac{C}{h_{n}^{4}}\int E[\\|\Delta W_{12}\\|^{2}\\|\Delta W_{13}\\|^{2}\nu_{12}^{2}\nu_{13}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1},\Delta R_{13}^{\prime}\gamma=s_{2}]$
		$\displaystyle\times K^{2}(s_{1}/h_{n})K^{2}(s_{2}/h_{n})f_{R\gamma,2}(s_{1},s_{2})ds_{1}ds_{2}$
		$\displaystyle=\frac{C}{h_{n}}\int E[\\|\Delta W_{12}\\|^{2}\\|\Delta W_{13}\\|^{2}\nu_{12}^{2}\nu_{13}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1}h_{n},\Delta R_{13}^{\prime}\gamma=s_{2}h_{n}]$
		$\displaystyle\times K^{2}(s_{1})K^{2}(s_{2})f_{R\gamma,2}(s_{1}h_{n},s_{2}h_{n})ds_{1}ds_{2}$
		$\displaystyle=O\left(\frac{1}{h_{n}^{2}}\right),$

as $n\to\infty$ with the last line coming from Assumption 4, 6, and 8, and,

	$\displaystyle E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c\times c^{\prime}S_{23,1}c]$
	$\displaystyle=E[(c^{\prime}S_{12,1}c)^{2}\times E[c^{\prime}S_{13,1}c\|\xi_{1},U_{1}]\times E[c^{\prime}S_{23,1}c\|\xi_{2},U_{2}]]$
	$\displaystyle=E[E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c\|\xi_{1},U_{1}]\times E[c^{\prime}S_{23,1}\|\xi_{2},U_{2}]]$
	$\displaystyle=E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c]\times E[c^{\prime}S_{12,1}c]$
	$\displaystyle\leq O(1)\times\frac{C}{h_{n}^{3}}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1},\xi_{1},U_{1}]\times E[\\|\Delta W_{13}\\|\|\nu_{13}\|\|\Delta R_{13}^{\prime}\gamma=s_{2},\xi_{1},U_{1}]$
	$\displaystyle\times K^{2}(s_{1}/h_{n})K(s_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{2})ds_{1}ds_{2}ds_{3}$
	$\displaystyle=O(1)\times\frac{C}{h_{n}}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1}h_{n},\xi_{1},U_{1}]\times E[\\|\Delta W_{13}\\|\|\nu_{13}\|\|\Delta R_{13}^{\prime}\gamma=s_{2}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times K^{2}(s_{1})K(s_{2})f_{R\gamma\|\xi_{1},U_{1}}(s_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{2}h_{n})ds_{1}ds_{2}ds_{3}$
	$\displaystyle=O\left(\frac{1}{h_{n}}\right),$

where the first to third lines follow from the conditional independence of $S_{ij,1}$ , the random sampling of $\xi_{i}$ , and the conditional independence and exchangeability of $U_{i}$ under Assumption 1, and the last line follows from Assumptions 4, 6, and 8. Observe that, by conditional independence of $S_{ij,1}$ and $S_{ik,1}$ given $\xi_{i},U_{i}$ and $S_{ij,1}=S_{ji,1}$ , one can show that

	$\displaystyle E[c^{\prime}D_{7,123}c\times c^{\prime}D_{7,124}c]$	$\displaystyle=\frac{1}{9}\big{\{}2E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c\times c^{\prime}S_{14,1}c]$
		$\displaystyle+2E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c]\times E[c^{\prime}S_{13,1}c]+5E[c^{\prime}S_{12,1}c\times c^{\prime}S_{13,1}c]^{2}\big{\}}$
		$\displaystyle=\left(\frac{1}{h_{n}}\right),$

where the last line holds since

	$\displaystyle E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c\times c^{\prime}S_{14,1}c]$
	$\displaystyle\leq\frac{C}{h_{n}^{4}}E\bigg{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1},\xi_{1},U_{1}]\times E[\\|\Delta W_{12}\\|\nu_{12}\|\Delta R_{12}^{\prime}\gamma=s_{2},\xi_{1},U_{1}]$
	$\displaystyle\times E[\\|\Delta W_{14}\\|\nu_{14}\|\Delta R_{14}^{\prime}\gamma=s_{3},\xi_{1},U_{1}]\times K^{2}(s_{1}/h_{n})K(s_{2}/h_{n})K(s_{3}/h_{n})$
	$\displaystyle\times f_{R\gamma\|\xi_{1},U_{1}}(s_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{2})f_{R\gamma\|\xi_{1},U_{1}}(s_{3})ds_{1}ds_{2}ds_{3}\bigg{]}$
	$\displaystyle=\frac{C}{h_{n}}E\bigg{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1}h_{n},\xi_{1},U_{1}]\times E[\\|\Delta W_{12}\\|\nu_{12}\|\Delta R_{12}^{\prime}\gamma=s_{2}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times E[\\|\Delta W_{14}\\|\nu_{14}\|\Delta R_{14}^{\prime}\gamma=s_{3}h_{n},\xi_{1},U_{1}]\times K^{2}(s_{1})K(s_{2})K(s_{3})$
	$\displaystyle\times f_{R\gamma\|\xi_{1},U_{1}}(s_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{2}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{3}h_{n})ds_{1}ds_{2}ds_{3}\bigg{]}$
	$\displaystyle=O\left(\frac{1}{h_{n}}\right),$

where the last line holds by Assumptions 4, 6, 8,

	$\displaystyle E[(c^{\prime}S_{12,1}c)^{2}\times c^{\prime}S_{13,1}c]$
	$\displaystyle\leq\frac{C}{h_{n}^{3}}E\bigg{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1},\xi_{1},U_{1}]$
	$\displaystyle\times E[\\|\Delta W_{13}\\|\|\nu_{13}\|\|\Delta R_{13}^{\prime}\gamma=s_{2},\xi_{1},U_{1}]K^{2}(s_{1}/h_{n})K(s_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{2})ds_{1}ds_{2}\bigg{]}$
	$\displaystyle\leq\frac{C}{h_{n}}E\bigg{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=s_{1}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times E[\\|\Delta W_{13}\\|\|\nu_{13}\|\|\Delta R_{13}^{\prime}\gamma=s_{2}h_{n},\xi_{1},U_{1}]K^{2}(s_{1})K(s_{2})f_{R\gamma\|\xi_{1},U_{1}}(s_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{2}h_{n})ds_{1}ds_{2}\bigg{]}$
	$\displaystyle=O\left(\frac{1}{h_{n}}\right),$

where the last equality holds from Assumptions 4, 6, and 8, and

	$\displaystyle E[c^{\prime}S_{12}c\times c^{\prime}S_{13}c]$
	$\displaystyle\leq\frac{C}{h_{n}^{2}}E\bigg{[}\int E[\\|\Delta W_{12}\\|\|\nu_{12}\|\|\Delta R_{12}^{\prime}\gamma=s_{1},\xi_{1},U_{1}]$
	$\displaystyle\times E[\\|\Delta W_{13}\\|\|\nu_{13}\|\|\Delta R_{13}^{\prime}\gamma=s_{2},\xi_{1},U_{1}]K(s_{1}/h_{n})K(s_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{2})ds_{1}ds_{2}\bigg{]}$
	$\displaystyle\leq CE\bigg{[}\int E[\\|\Delta W_{12}\\|\|\nu_{12}\|\|\Delta R_{12}^{\prime}\gamma=s_{1}h_{n},\xi_{1},U_{1}]$
	$\displaystyle\times E[\\|\Delta W_{13}\\|\|\nu_{13}\|\|\Delta R_{13}^{\prime}\gamma=s_{2}h_{n},\xi_{1},U_{1}]K(s_{1})K(s_{2})f_{R\gamma\|\xi_{1},U_{1}}(s_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(s_{2}h_{n})ds_{1}ds_{2}\bigg{]}$
	$\displaystyle=O(1),$

where the last equality holds from Assumptions 4, 6, and 8. Thus,

\displaystyle E\left[(c^{\prime}D_{7}c-U_{n,1})^{2}\right]=O\left(\frac{1}{n^{3}h_{n}^{2}}\right)=o(1).

Thus, $c^{\prime}D_{1}$ is well approximated by $U_{n}$ . Also, since $nh_{n}^{2}\to\infty$ with the stated assumption on $h_{n}$ ,

\displaystyle E\left[\left(p_{n}(\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k})\right)^{2}\right]=O\left(E[(c^{\prime}D_{7,123}c)^{2}]\right)=O\left(\frac{n}{nh_{n}^{2}}\right)=o(1)\times O(n),

and by Lemma A.3 of Ahn and Powell (1993), we have

\displaystyle U_{n}=E[U_{n,1}]+o_{p}(1).

This shows that

	$\displaystyle c^{\prime}D_{7}c$	$\displaystyle=E[U_{n,1}]+\underbrace{c^{\prime}D_{7}c-U_{n,1}}_{=o_{p}(1)}+\underbrace{U_{n,1}-E[U_{n,1}]}_{=o_{p}(1)}$
		$\displaystyle=E[c^{\prime}D_{7}c]+o_{p}(1)$
		$\displaystyle=c^{\prime}\Sigma_{W\nu,1}c+o_{p}(1).$

This completes $c^{\prime}D_{7}c\to_{p}c^{\prime}\Sigma_{W\nu,1}c$ as $n\to\infty$ .

The remainder term $\mathcal{O}_{7}$ with each term involving either $S_{ij,2}$ or (and) $S_{ij,3}$ is of smaller order than $D_{7}$ since $S_{ij,2}$ and $S_{ij,3}$ involve $\|\hat{\beta}_{n}-\beta\|=O_{p}(1/\sqrt{n})$ and $\lambda_{ij}\sim h_{n}$ for large $n$ ; By computing in a similar way as before, we can establish that $E[|c^{\prime}\mathcal{O}_{7}c|]=o(1)$ and $Var[c^{\prime}\mathcal{O}_{7}c]=o(1)$ so that $|c^{\prime}\mathcal{O}_{7}c|\to_{p}0$ . Hence,

\displaystyle|c^{\prime}\tilde{\Sigma}_{\nu,1}c-c^{\prime}\Sigma_{W\nu,1}c|\leq|c^{\prime}D_{1}c-c^{\prime}\Sigma_{W\nu,1}c|+|c^{\prime}\mathcal{O}_{7}c|=o_{p}(1),

which completes the proof for Lemma 7. ∎

Proof of Lemma 8

Proof.

Define

\displaystyle\hat{S}_{ij,1}

\displaystyle=\frac{2}{h_{n}^{2}}d_{ij1}d_{ij2}K^{\prime}\left(\frac{c_{ij,n}^{*}}{h_{n}}\right)\Delta W_{ij}\Delta R_{ij}^{\prime}\Delta\hat{\epsilon}_{ij}

where $c_{ij,n}^{*}$ is in between $\Delta R_{ij}^{\prime}\gamma$ and $\Delta R_{ij}^{\prime}\hat{\gamma}_{n}$ . In the following argument, we treat $\Delta\hat{\epsilon}_{ij}$ in $\hat{S}_{ij,1}$ as $\nu_{ij}$ because only the existence of higher moments is important and bounding the terms involving $\nu_{ij}$ suffices. By the expression for $\Delta\hat{\epsilon}_{ij}$ , we have

\displaystyle\hat{S}_{ij}

\displaystyle=S_{ij,1}+S_{ij,2}+S_{ij,3}+\hat{S}_{ij,1}(\hat{\gamma}_{n}-\gamma).

By the proof of 7, we know that

\displaystyle{n\choose 3}^{-1}\sum_{i<j<k}\frac{1}{3}(S_{ij,p}S_{ik,p}^{\prime}+S_{ij,p}S_{jk,p}^{\prime}+S_{ik,p}S_{jk,p}^{\prime})

\displaystyle=o_{p}(1),

for $p=2,3$ . Then,

	$\displaystyle\\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\\|$	$\displaystyle\leq\sum_{p=1}^{3}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\underbrace{\frac{h_{n}^{2}}{3}(\\|S_{ij,p}\\|\\|\hat{S}_{ik,1}\\|+\\|S_{ij,p}\\|\\|\hat{S}_{jk,1}\\|+\\|S_{ik,p}\\|\\|\hat{S}_{jk,1}\\|)}_{D_{8,1,ijk}^{p}}}_{D_{8,1}^{p}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|}{h_{n}^{2}}$
		$\displaystyle+\underbrace{\sum_{p=1}^{3}{n\choose 3}^{-1}\sum_{i<j<k}\frac{h_{n}^{2}}{3}(\\|\hat{S}_{ij,1}\\|\\|S_{ik,p}\\|+\\|\hat{S}_{ij,1}\\|\\|S_{jk,p}\\|+\\|\hat{S}_{ik,1}\\|\\|S_{jk,p}\\|)}_{D_{8,2}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|}{h_{n}^{2}}$
		$\displaystyle+\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{h_{n}^{4}}{3}(\\|\hat{S}_{ij,1}\\|\\|\hat{S}_{ik,1}\\|+\\|\hat{S}_{ij,1}\\|\\|\hat{S}_{jk,1}\\|+\\|\hat{S}_{ik,1}\\|\\|\hat{S}_{jk,1}\\|)}_{D_{8,3}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|^{2}}{h_{n}^{4}}$

For $D_{8,1}^{p}$ , it suffices to bound $D_{8,1}^{1}$ as the similar calculation applies to the other terms. By Assumption 8, for some constant $C>0$ ,

	$\displaystyle\\|S_{ij,1}\\|$	$\displaystyle\leq\frac{2}{h_{n}}\underbrace{\\|\Delta W_{ij}\\|\|\nu_{ij}\|\|K(\Delta R_{ij}^{\prime}\gamma/h_{n})\|}_{g_{ij,1}},$
	$\displaystyle\\|\hat{S}_{ij,1}\\|$	$\displaystyle\leq\frac{C}{h_{n}^{2}}\underbrace{\\|\Delta W_{ij}\\|\\|\Delta R_{ij}\\|\|\nu_{ij}\|}_{g_{ij,2}}.$

Thus,

\displaystyle D_{8,1}^{1}\leq C{n\choose 3}^{-1}\sum_{i<j<k}\underbrace{(g_{ij,1}g_{ik,2}+g_{ij,1}g_{jk,2}+g_{ik,1}g_{jk,2})}_{g_{ijk,12}}.

Observe that

	$\displaystyle E[D_{8,1}^{1}]$	$\displaystyle\leq\frac{1}{h_{n}}E[g_{12,1}g_{13,2}]$
		$\displaystyle=\frac{1}{h_{n}}E\Big{[}\int E[\\|\Delta W_{12}\\|\|\nu_{12}\|\|\Delta R_{12}^{\prime}\gamma=r,\xi_{1},U_{1}]K(r/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r)dr$
		$\displaystyle\times E[\\|\Delta W_{13}\\|\\|\Delta R_{13}\\|\|\nu_{13}\|\|\xi_{1},U_{1}]\Big{]}$
		$\displaystyle=E\Big{[}\int E[\\|\Delta W_{12}\\|\|\nu_{12}\|\|\Delta R_{12}^{\prime}\gamma=rh_{n},\xi_{1},U_{1}]K(r)f_{R\gamma\|\xi_{1},U_{1}}(rh_{n})dr$
		$\displaystyle\times E[\\|\Delta W_{13}\\|\\|\Delta R_{13}\\|\|\nu_{13}\|\|\xi_{1},U_{1}]\Big{]}$
		$\displaystyle=O(1),$

where the last equality follows from Assumptions 4, 6, 7, and 8. For variance, the leading term involves covariances between variables with one common node, which has $n\times{n-1\choose 4}$ elements (up to some constant scale):

\displaystyle Var[D_{8,1}^{1}]=O\Bigg{(}\frac{1}{h_{n}^{2}}\times{n\choose 3}^{-2}\times n\times{n-1\choose 4}\times E[D_{8,1,123}^{1}\times D_{8,1,145}^{1}]\Bigg{)}=O_{p}\left(\frac{1}{nh_{n}^{2}}\right)=o(1),

since $nh_{n}^{2}\sim n^{(2k-1)/(2k+3)}$ diverges for $k\geq 2$ and

	$\displaystyle E[D_{8,1,123}^{1}\times D_{8,1,145}^{1}]$	$\displaystyle\leq E[g_{123,12}\times g_{145,12}]$
		$\displaystyle\leq E[g_{123,12}^{2}]$
		$\displaystyle\leq CE[\\|\Delta W_{12}\\|^{2}\\|\Delta W_{13}\\|^{2}\\|\Delta R_{13}\\|^{2}\nu_{12}^{2}\nu_{13}^{2}]$
		$\displaystyle=O(1),$

by Cauchy-Schwartz under Assumption 7. Thus, $D_{8,1}=O_{p}(1)$ . Similarly, $D_{8,2}=O_{p}(1)$ and $D_{8,3}=O_{p}(1)$ hold.

Notice that

\displaystyle\frac{\|\hat{\gamma}_{n}-\gamma\|}{h_{n}^{2}}=\frac{\sqrt{Nh_{n}}\|\hat{\gamma}_{n}-\gamma\|}{\sqrt{Nh_{n}^{5}}}=o_{p}(1),

since $1/\sqrt{Nh_{n}^{5}}$ diverges by the hypothesis, and similarly,

\displaystyle\frac{\|\hat{\gamma}_{n}-\gamma\|^{2}}{h_{n}^{4}}=\frac{(\sqrt{Nh_{n}}\|\hat{\gamma}_{n}-\gamma\|)^{2}}{Nh_{n}^{5}}=o_{p}(1).

Hence,

\displaystyle\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|\leq O_{p}(1)\times o_{p}(1)+o_{p}(1)=o_{p}(1),

which completes the proof for Lemma 8. ∎

Proof of Lemma 9

Proof.

We only show $nh_{n}c_{W}^{\prime}\hat{\Sigma}=o_{p}(1)$ as the other case follows by taking transpose. We also write $c_{W}$ as $c$ for short. The statement is proved by showing that Lemmas 7 and 8 hold even after re-scaled by $n^{1-\alpha/2}h_{n}$ . First, we show that $c^{\prime}\Sigma_{W\nu,1}c=0$ implies that $c^{\prime}\Sigma_{W\nu,1}=0$ .

Step 1: The implication of $c^{\prime}\Sigma_{W\nu,1}c=0$

Note that

	$\displaystyle\Sigma_{W\nu,1}$	$\displaystyle=f_{R\gamma,2}(0,0)Pr(d_{121}d_{122}d_{131}d_{132}=1\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0)$
		$\displaystyle\times E[\Delta W_{12}\Delta W_{13}^{\prime}\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0].$

By Assumption 4, $f_{R\gamma,2}(0,0)>0$ . Also, by the conditional independence of $d_{12t}$ and $d_{13s}$ under Assumption 1,

	$\displaystyle Pr(d_{121}d_{122}d_{131}d_{132}=1\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0)$
	$\displaystyle=E\left[\left(Pr(d_{122}d_{122}=1\|\Delta R_{12}^{\prime}\gamma=0,\xi_{1},U_{1})\right)^{2}\right\|\Delta R_{12}^{\prime}\gamma=0].$

Since $Pr(d_{121}d_{122}=1|\Delta R_{12}^{\prime}\gamma=0)>0$ is implied by Assumption 3, it must be that

\displaystyle Pr(d_{121}d_{122}d_{131}d_{132}=1|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0)>0,

as otherwise $Pr(d_{121}d_{122}=1|\Delta R_{12}^{\prime}\gamma=0,\xi_{1},U_{1})$ is constant at $0$ , which contradicts with the locally positive probability of $d_{121}d_{122}=1$ . Thus, $c^{\prime}\Sigma_{W\nu,1}c=0$ is equivalent to

	$\displaystyle E[\Delta c^{\prime}W_{12}\Delta W_{13}^{\prime}c\nu_{12}\nu_{13}\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0]$
	$\displaystyle=E\left[\left(E[c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=0,\xi_{1},U_{1}]\right)^{2}\|\Delta R_{12}^{\prime}\gamma=0\right]$
	$\displaystyle=0,$

which, in turn, is equivalent to (by the mean independence of $\nu_{12}$ ),

\displaystyle E[c^{\prime}\Delta W_{12}\nu_{12}|\Delta R_{12}^{\prime}\gamma=0,\xi_{1},U_{1}]=0,

almost surely. Thus, $c^{\prime}\Sigma_{W\nu,1}c=0$ implies that

	$\displaystyle c^{\prime}\Sigma_{W\nu,1}$	$\displaystyle=f_{R\gamma,2}(0,0)Pr(d_{121}d_{122}d_{131}d_{132}=1\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0)$
		$\displaystyle\times E\left[E[c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=0,\xi_{1},U_{1}]E[\Delta W_{13}^{\prime}\nu_{13}\|\Delta R_{13}^{\prime}\gamma,\xi_{1},U_{1}]\|\Delta R_{12}^{\prime}\gamma=\Delta R_{13}^{\prime}\gamma=0\right]$
		$\displaystyle=0.$

Step 2: $nh_{n}c^{\prime}\tilde{\Sigma}_{W\nu,1}=o_{p}(1)$

Remember that

\displaystyle\tilde{\Sigma}_{W\nu,1}=D_{7}+\mathcal{O}_{7}.

For $D_{7}$ , from the calculation in the proof of Lemma 3, we have that

\displaystyle E[c^{\prime}D_{7}]=E[c^{\prime}S_{12,1}S_{13,1}^{\prime}]=O(h_{n}^{k}).

, which shows $nh_{n}E[c^{\prime}D_{7}]=O(nh_{n}^{k+1})=o(1)$ under the hypothesis. For any non-zero vector $a\in\mathbb{R}^{q_{w}}$ , redefining $U_{n}$ and $U_{n,1}$ with the kernel $p_{n}(\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k})=E[c^{\prime}D_{7,ijk}a|\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k}]$ , we can repeat the calculation in the proof of Lemma 7 to get

\displaystyle E[(nh_{n}c^{\prime}D_{7}a-nh_{n}U_{n,1})^{2}]=n^{2}h_{n}^{2}\times O\left(\frac{1}{n^{3}h_{n}^{2}}\right)=o(1).

Also, $E[nh_{n}p_{n}(\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{k})]=E[nh_{n}c^{\prime}D_{7}]=o(1)$ and

\displaystyle E[(nh_{n}p_{n}(\boldsymbol{\xi}_{i},\boldsymbol{\xi}_{j},\boldsymbol{\xi}_{l}))^{2}]=O(n),

so that by Lemma A.3 of Ahn and Powell (1993),

\displaystyle nh_{n}U_{n}=nh_{n}E[U_{n,1}]+o_{p}(1)=o_{p}(1).

This shows that, since $a$ is arbitrary,

\displaystyle nh_{n}c^{\prime}D_{7}=o_{p}(1).

For the remainder term $\mathcal{O}_{7}$ , this should again be of smaller order than $nh_{n}c^{\prime}D_{7}$ since $\beta-\hat{\beta}_{n}=O_{p}(1/\sqrt{n})$ and $\lambda_{ij}=\Delta R_{ij}^{\prime}\gamma\Lambda_{ij}$ is locally $O(h_{n}^{k+1})$ under the smoothing kernel and smoothness conditions on the density. For example, one of the elements in $\mathcal{O}_{7}$ is given by

	$\displaystyle nh_{n}{n\choose 3}\sum_{i<j<k}\frac{1}{3}(S_{ij,2}S_{ik,2}^{\prime}+S_{ij,2}S_{jk,2}^{\prime}+S_{ik,2}S_{jk,2})$
	$\displaystyle\leq{n\choose 3}\sum_{i<j<k}\frac{4}{3h_{n}^{2}}\Bigg{(}\|\Delta W_{ij}\\|^{2}\\|\Delta W_{ik}\\|^{2}\|K(\Delta R_{ij}^{\prime}\gamma/h_{n})\|\|K(\Delta R_{ik}^{\prime}\gamma/h_{n})\|$
	$\displaystyle+\\|\Delta W_{ij}\\|^{2}\\|\Delta W_{jk}\\|^{2}\|K(\Delta R_{ij}^{\prime}\gamma/h_{n})\|\|K(\Delta R_{jk}/h_{n})\|$
	$\displaystyle+\\|\Delta W_{ik}^{\prime}\\|^{2}\\|\Delta W_{jk}\\|^{2}\|K(\Delta R_{ik}/h_{n})\|\|K(\Delta R_{jk}/h_{n})\|\Bigg{)}nh_{n}\\|\beta-\hat{\beta}_{n}\\|^{2}$
	$\displaystyle=O_{p}(h_{n})=o_{p}(1),$

where the last line can be shown by the same calculation as before to show $O_{p}(1)$ for the summation part and $\|\beta-\hat{\beta}_{n}\|^{2}=O_{p}(1/n)$ from Theorem 1. Similarly, we can show the negligibility of the elements in $\mathcal{O}_{7}$ . This finishes the step 2.

Step 3: $n^{1-\alpha/2}h_{n}\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|=o_{p}(1)$

By the proof of Lemma 8, we have

\displaystyle n^{1-\alpha/2}h_{n}\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|\leq O_{p}(1)\Bigg{(}\frac{n^{1-\alpha/2}\|\hat{\gamma}_{n}-\gamma\|}{h_{n}}+\frac{n^{1-\alpha/2}\|\hat{\gamma}_{n}-\gamma\|^{2}}{h_{n}^{3}}\Bigg{)}.

Observe that

	$\displaystyle\frac{n^{1-\alpha/2}\\|\hat{\gamma}_{n}-\gamma\\|}{h_{n}}$	$\displaystyle=O\Bigg{(}\frac{\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|}{\sqrt{n^{\alpha}h_{n}^{3}}}\Bigg{)}=o_{p}(1),$
	$\displaystyle\frac{n^{1-\alpha/2}\\|\hat{\gamma}_{n}-\gamma\\|^{2}}{h_{n}^{3}}$	$\displaystyle=O\Bigg{(}\frac{\\|\sqrt{Nh_{n}}(\hat{\gamma}_{n}-\gamma)\\|^{2}}{\sqrt{n^{2+\alpha}h_{n}^{7}}}\Bigg{)}=o_{p}(1),$

by Assumption 10 and

	$\displaystyle n^{\alpha}h_{n}^{3}$	$\displaystyle\sim n^{(\alpha(2k+3)-6)/(2k+3)}\to\infty,$
	$\displaystyle n^{2+\alpha}h_{n}^{7}$	$\displaystyle\sim n^{((2k+3)\alpha+4k-8)/(2k+3)}\to\infty,$

for $\alpha\in[6/(2k+3),1)$ . Hence,

\displaystyle n^{1-\alpha/2}h_{n}\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|=o_{p}(1).

Steps 1-3 complete the proof of Lemma 9. ∎

Proof of Lemma 10

Proof.

Write $c_{W}$ as $c$ for short. It suffices to show that $nh_{n}\|c^{\prime}\hat{\Sigma}_{W\nu,1}c-c^{\prime}\tilde{\Sigma}_{W\nu,1}c\|=o_{p}(1)$ as we already show, in the proof of Lemma 9, that $nh_{n}c^{\prime}\tilde{\Sigma}_{W\nu,1}=o_{p}(1)$ , which implies $nh_{n}c^{\prime}\tilde{\Sigma}_{W\nu,1}c=o_{p}(1)$ . To save the space, in the following argument, we treat $\Delta\hat{\epsilon}_{ij}$ as $\nu_{ij}$ ; The other terms are similarly bounded using the properties of $\beta-\hat{\beta}_{n}$ and $\lambda_{ij}$ .

Re-define

	$\displaystyle S_{ij}$	$\displaystyle=\frac{2}{h_{n}}d_{ij1}d_{ij2}K\left(\frac{\Delta R_{ij}^{\prime}\gamma}{h_{n}}\right)c^{\prime}\Delta W_{ij}\nu_{ij}$
	$\displaystyle\hat{S}_{ij}$	$\displaystyle=\frac{2}{h_{n}^{2}}d_{ij1}d_{ij2}K^{\prime}\left(\frac{\Delta R_{ij}^{\prime}\gamma}{h_{n}}\right)c^{\prime}\Delta W_{ij}\Delta R_{ij}^{\prime}\nu_{ij}$
	$\displaystyle\hat{S}_{ij,2}$	$\displaystyle=\frac{1}{h_{n}^{3}}d_{ij1}d_{ij2}K^{\prime\prime}\left(\frac{c_{ij,n}^{*}}{h_{n}}\right)\Delta R_{ij}c^{\prime}\Delta W_{ij}\Delta R_{ij}^{\prime}\nu_{ij},$

where $c_{ij,n}^{*}$ is in between $\Delta R_{ij}^{\prime}\gamma$ and $\Delta R_{ij}^{\prime}\hat{\gamma}_{n}$ . We have that

\displaystyle\hat{S}_{ij}=S_{ij}+\hat{S}_{ij,1}(\hat{\gamma}_{n}-\gamma)+(\hat{\gamma}_{n}-\gamma)^{\prime}\hat{S}_{ij,2}(\hat{\gamma}_{n}-\gamma).

Note that, for some constant $C>0$ , $\|\hat{S}_{ij,1}\|\leq Ch_{n}^{-2}g_{ij,2}$ as before and

\displaystyle\|\hat{S}_{ij,2}\|\leq\frac{C}{h_{n}^{3}}\underbrace{\|\Delta W_{ij}\|\|\Delta R_{ij}\|^{2}|\nu_{ij}|}_{g_{ij,3}}.

Observe that

	$\displaystyle c^{\prime}\hat{\Sigma}_{W\nu,1}c-c^{\prime}\tilde{\Sigma}_{W\nu,1}c$
	$\displaystyle\leq\frac{(\hat{\gamma}_{n}-\gamma)^{\prime}}{\sqrt{h_{n}}}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\underbrace{\frac{\sqrt{h_{n}}}{3}(S_{ij}\hat{S}_{ik,1}^{\prime}+S_{ij}\hat{S}_{jk,1}^{\prime}+S_{ik}\hat{S}_{jk,1}^{\prime})}_{D_{10,1,ijk}}}_{D_{9,1}}$
	$\displaystyle+\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{\sqrt{h_{n}}}{3}(\hat{S}_{ij,1}S_{ik}+\hat{S}_{ij,1}S_{jk}+\hat{S}_{ik,1}S_{jk})}_{D_{10,2}}\frac{(\hat{\gamma}_{n}-\gamma)}{\sqrt{h_{n}}}$
	$\displaystyle+\frac{(\hat{\gamma}_{n}-\gamma)^{\prime}}{h_{n}^{3/2}}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{h_{n}^{3}}{3}(S_{ij}\hat{S}_{ik,2}^{\prime}+S_{ij}\hat{S}_{jk,2}^{\prime}+S_{ik}\hat{S}_{jk,2}^{\prime})}_{D_{10,3}}\frac{(\hat{\gamma}_{n}-\gamma)}{h_{n}^{3/2}}$
	$\displaystyle+\frac{(\hat{\gamma}_{n}-\gamma)^{\prime}}{h_{n}^{3/2}}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{h_{n}^{3}}{3}(\hat{S}_{ij,2}S_{ik}+\hat{S}_{ij,2}S_{jk}+\hat{S}_{ik,2}S_{jk})}_{D_{9,4}}\frac{(\hat{\gamma}_{n}-\gamma)}{h_{n}^{3/2}}$
	$\displaystyle+C^{2}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{1}{3}(g_{ij,2}g_{ik,3}+g_{ij,2}g_{jk,3}+g_{ik,2}g_{jk,3})}_{D_{10,5}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|^{3}}{h_{n}^{5}}$
	$\displaystyle+C^{2}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{1}{3}(g_{ij,3}g_{ik,2}+g_{ij,3}g_{jk,2}+g_{jk,3}g_{ik,2})}_{D_{10,6}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|^{3}}{h_{n}^{5}}$
	$\displaystyle+\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{h_{n}^{2}}{3}(\\|\hat{S}_{ij,1}\\|\\|\hat{S}_{ik,1}\\|+\\|\hat{S}_{ij,1}\\|\\|\hat{S}_{jk,1}\\|+\\|\hat{S}_{jk,1}\\|\\|\hat{S}_{ik,1}\\|)}_{D_{10,7}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|^{2}}{h_{n}^{2}}$
	$\displaystyle+C^{2}\underbrace{{n\choose 3}^{-1}\sum_{i<j<k}\frac{1}{3}(g_{ij,3}g_{ik,3}+g_{ij,3}g_{jk,3}+g_{jk,3}g_{ik,3})}_{D_{10,8}}\frac{\\|\hat{\gamma}_{n}-\gamma\\|^{4}}{h_{n}^{6}}.$

First we stochastically bound $D_{10,1}$ and $D_{10,2}$ . For any vector $a\in\mathbb{R}^{q_{r}}$ and some constant $C>0$ ,

	$\displaystyle E[a^{\prime}D_{10,1}]$	$\displaystyle=\frac{\sqrt{h_{n}}}{h_{n}^{3}}E\left[E[c^{\prime}S_{12}\|\xi_{1},U_{1}]E[a^{\prime}\hat{S}_{13}c\|\xi_{1},U_{1}]\right]$
		$\displaystyle\leq\frac{1}{h_{n}^{3/2}}\left\{E\left[\int E[d_{121}d_{122}c^{\prime}\Delta W\nu_{12}\|\Delta R_{12}=s_{1}h_{n},\xi_{1},U_{1}]^{2}K(s_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{1}h_{n})\right]\right\}^{1/2}$
		$\displaystyle\times CE[\\|\Delta W_{12}\\|^{2}\\|\Delta R_{13}\\|^{2}\nu_{13}^{2}]^{1/2}$
		$\displaystyle=O(h_{n}^{(2k-1)/2})=o(1),$

where the first line hold from the conditional independence, the inequality follows from Cauchy-Schwartz and Assumption 8, and the final line holds from the implication from $c^{\prime}\Sigma_{W\nu,1}c=0$ and Assumptions 4, 6, 7, and 8. Repeating the calculation in the proof for Lemma 7 and adjusting for covariances with one index in common,

\displaystyle Var[a^{\prime}D_{10,1}]=O\left(\frac{1}{n^{2}h_{n}^{3}}\right)+O\left(\frac{h_{n}E[S_{12}a^{\prime}\hat{S}_{13,1}^{\prime}S_{14}a^{\prime}\hat{S}_{15,1}^{\prime}]}{n}\right)=o(1),

where the last equality holds from the stated assumption on $h_{n}$ for $k\geq 1$ and

	$\displaystyle E[S_{12}a^{\prime}\hat{S}_{13,1}^{\prime}S_{14}a^{\prime}\hat{S}_{15,1}^{\prime}]$
	$\displaystyle=\frac{16}{h_{n}^{2}}E\bigg{[}E[d_{121}d_{122}c^{\prime}\Delta W_{12}\nu_{12}\|\Delta R_{12}^{\prime}\gamma=h_{n}s_{1},\boldsymbol{\xi}_{1}]$
	$\displaystyle\times E[d_{131}d_{132}c^{\prime}\Delta W_{13}a^{\prime}\Delta R_{13}\nu_{13}\|\Delta R_{13}^{\prime}\gamma=h_{n}s_{2},\boldsymbol{\xi}_{1}]$
	$\displaystyle\times E[d_{141}d_{142}c^{\prime}\Delta W_{14}\nu_{14}\|\Delta R_{14}^{\prime}\gamma=h_{n}s_{3},\boldsymbol{\xi}_{1}]$
	$\displaystyle\times E[d_{151}d_{152}c^{\prime}\Delta W_{15}a^{\prime}\Delta R_{15}\nu_{15}\|\Delta R_{15}^{\prime}\gamma=h_{n}s_{4},\boldsymbol{\xi}_{1}]$
	$\displaystyle\times K(s_{1})K^{\prime}(s_{2})K(s_{3})K^{\prime}(s_{4})\Pi_{i=1}^{4}f_{R\gamma\|\xi_{1},U_{1}}(s_{i}h_{n})\bigg{]}$
	$\displaystyle=O\left(\frac{1}{h_{n}^{2}}\right)$

by Assumptions 4, 6, and 8. Thus, $D_{1,1}=o_{p}(1)$ . Similarly, we have $D_{1,2}=o_{p}(1)$ .

Next, we stochastically bound $D_{10,3}$ and $D_{9,4}$ . For any finite $a,b\in\mathbb{R}^{q_{r}}$ , for some $C>0$ ,

	$\displaystyle E[a^{\prime}D_{10,3}b]$
	$\displaystyle\leq\frac{C}{h_{n}}E[\|S_{12}\|g_{13,3}]$
	$\displaystyle=CE\bigg{[}\int E[d_{121}d_{122}\|c^{\prime}\Delta W_{12}\|\|\nu_{12}\|\|\Delta R_{12}^{\prime}\gamma=h_{n}s_{1},\boldsymbol{\xi}_{1}]$
	$\displaystyle\times E[g_{13}\|\boldsymbol{\xi}_{1}]$
	$\displaystyle\times K(s_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{1}h_{1})f_{R\gamma\|\xi_{1},U_{1}}(s_{2})\bigg{]}$
	$\displaystyle=O(1),$

where the last equality holds from Assumptions 4, 6, 7 and 8. The variance is calculated similarly as before:

\displaystyle Var[a^{\prime}D_{10,3}b]=O\left(\frac{1}{n^{2}h_{n}}\right)+O\left(\frac{h_{n}}{n}\right)=o(1).

Thus, $D_{10,3}=O_{p}(1)$ . Similarly, $D_{10,4}=O_{p}(1)$ .

$D_{10,5},D_{10,6}$ , and $D_{10,8}$ are all $O_{p}(1)$ by the similar computation as in Lemma 7.

$D_{9.7}$ is stochastically bounded as follows. Observe that

	$\displaystyle E[D_{10,7}]$	$\displaystyle=\frac{4}{h_{n}^{2}}E[\\|\hat{S}_{12,1}\\|\\|\hat{S}_{13,1}\\|]$
		$\displaystyle\leq 4\int E[\|c^{\prime}\Delta W_{12}\|c^{\prime}\Delta W_{12}\|\\|\Delta R_{12}\\|\Delta R_{13}\\|\|\nu_{12}\|\|\nu_{13}\|\|\Delta R_{12}^{\prime}\gamma=h_{n}s_{1},\Delta R_{13}^{\prime}\gamma=h_{n}s_{2}]$
		$\displaystyle\times K^{\prime}(s_{1})K^{\prime}(s_{2})f_{R\gamma,2}(h_{n}s_{1},h_{n}s_{2})$
		$\displaystyle=O(1),$

where the last line holds from Assumption 4, 6, and 8. The variance is calculated similarly as before:

\displaystyle Var[D_{10,7}]=O\left(\frac{1}{n^{2}h_{n}^{2}}\right)+O\left(\frac{1}{n}\right)=o(1).

Thus, $D_{10,7}=O_{p}(1)$ .

Finally, the above implies

	$\displaystyle nh_{n}\|c^{\prime}\hat{\Sigma}_{W\nu,1}c-c^{\prime}\tilde{\Sigma}_{W\nu,1}c\|$
	$\displaystyle\leq O_{p}(1)\times\left(\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|}{h_{n}}+\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|^{2}}{h_{n}^{3}}+\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|^{3}}{h_{n}^{5}}+\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|^{4}}{h_{n}^{6}}\right)$	$\displaystyle=o_{p}(1),$

because

	$\displaystyle\frac{nh_{n}\\|\hat{\gamma}-\gamma\\|}{h_{n}}$	$\displaystyle=O(\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|)=o_{p}(1),$
	$\displaystyle\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|^{2}}{h_{n}^{3}}$	$\displaystyle=O\left(\frac{(\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|)^{2}}{nh_{n}^{3}}\right)=o_{p}(1),$
	$\displaystyle\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|^{3}}{h_{n}^{5}}$	$\displaystyle=O\left(\frac{(\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|)^{3}}{n^{2}h_{n}^{11/2}}\right)=o_{p}(1),$
	$\displaystyle\frac{nh_{n}\\|\hat{\gamma}_{n}-\gamma\\|^{4}}{h_{n}^{6}}$	$\displaystyle=O\left(\frac{(\sqrt{Nh_{n}}\\|\hat{\gamma}_{n}-\gamma\\|)^{4}}{n^{3}h_{n}^{7}}\right)=o_{p}(1),$

by Assumption 10, $nh_{n}^{3}=O(n^{(2k-3)/(2k+3)})$ , $n^{2}h_{n}^{11/2}=O(n^{(4k-5)/(2k+3)})$ , and $n^{3}h_{n}^{7}=O(n^{(6k-5)})$ all diverging for $k\geq 2$ . This completes the proof for Lemma 10. ∎

	$\displaystyle E[\epsilon_{ij1}\|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=E[\epsilon_{ij1}\|R_{ij1}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij1},R_{ij2}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij2},\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=E[\epsilon_{ij2}\|R_{ij2}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij2},R_{ij1}^{\prime}\gamma+B_{i}+B_{j}\geq\eta_{ij1},\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0]$
	$\displaystyle=E[\epsilon_{ij2}\|d_{ij1}d_{ij2}=1,\xi_{i},\xi_{j},\Delta R_{ij}^{\prime}\gamma=0].$

	$\displaystyle E\left[\\|D_{p1,1,12}^{1}\\|^{2}\right]$	$\displaystyle\leq\int E[\\|\Delta W_{12}\\|^{4}\nu_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=r]K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}\int E[\\|\Delta W_{12}\\|^{4}\nu_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}),$

	$\displaystyle E[\\|D_{p1,1,12}^{1}\\|\times\\|D_{p1,1,13}^{1}\\|]$	$\displaystyle\leq E\Big{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|^{2}\nu_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{1},\xi_{1},U_{1}]$
		$\displaystyle\times K^{2}(r_{1}/h_{n})K^{2}(r_{2}/h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{1})f_{R\gamma\|\xi_{1},U_{1}}(r_{2})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=h_{n}^{2}E\Big{[}\int E[\\|\Delta W_{12}\\|^{2}\nu_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times E[\\|\Delta W_{13}\\|^{2}\nu_{13}^{2}\|\Delta R_{13}^{\prime}\gamma=r_{1}h_{n},\xi_{1},U_{1}]$
		$\displaystyle\times K^{2}(r_{1})K^{2}(r_{2})f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})f_{R\gamma\|\xi_{1},U_{1}}(r_{1}h_{n})dr_{1}dr_{2}\Big{]}$
		$\displaystyle=O(h_{n}^{2}),$

	$\displaystyle E[\\|D_{p1,1}^{3}\\|]$	$\displaystyle\leq\int E[\\|\Delta W_{12}^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=r]r^{2}K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}^{3}E[\\|\Delta W_{12}^{2}\Lambda_{12}^{2}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]r^{2}K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}^{3})$

	$\displaystyle E[\\|D_{p,1,12}^{3}\\|^{2}]$	$\displaystyle\leq\int E[\\|\Delta W_{12}\\|^{4}\Lambda_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=r]r^{2}K^{2}(r/h_{n})f_{R\gamma}(r)dr$
		$\displaystyle=h_{n}^{5}E[\\|\Delta W_{12}\\|^{4}\Lambda_{12}^{4}\|\Delta R_{12}^{\prime}\gamma=rh_{n}]r^{4}K^{2}(r)f_{R\gamma}(rh_{n})dr$
		$\displaystyle=O(h_{n}^{5}),$

Dyadic Regression with Sample Selection

Abstract.

1. Introduction

2. Model

Assumption 1.

2.1. Identification

Assumption 2.

Assumption 3.

2.2. Estimation

3. Asymptotic Analysis

3.1. Regularity Conditions

Assumption 4.

Assumption 5.

Assumption 6.

Assumption 7.

Assumption 8.

Assumption 9.

Assumption 10.

3.2. Asymptotic Normality

Theorem 1.

3.3. Variance Estimation

Proposition 1.

3.4. Bandwidth Selection

Proposition 2.

3.5. Bias Correction

3.6. First-step Estimator

4. Extension

4.1. Directed Graph with Multiple Periods

4.2. Pairwise Fixed Effects

4.3. Sparsity

5. Simulation

6. Empirical example

6.1. Background

6.2. Implementation

7. Conclusion

References

Appendix A. Proofs

Proof of Theorem 1

Proof.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Proof of Proposition 1

Proof.

Step 1: Σ^W​ν,2→pΣW​ν,2\hat{\Sigma}_{W\nu,2}\to_{p}\Sigma_{W\nu,2}

Sub-Step 1: Dp​1,1→pΣW​ν,2D_{p1,1}\to_{p}\Sigma_{W\nu,2}

Sub-Step 2: Dp​1,2→p0D_{p1,2}\to_{p}0

Step 2: Σ^W​ν,1→pΣW​ν,1\hat{\Sigma}_{W\nu,1}\to_{p}\Sigma_{W\nu,1}

Lemma 7.

Lemma 8.

Step 3: cW′​ΣW​ν,1​cW=0c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0 case

Lemma 9.

Lemma 10.

Proof of Proposition 2

Proof.

Proofs of Lemmas

Proof of Lemma 1

Proof.

Proof of Lemma 2

Proof.

Proof of Lemma 3

Proof.

Step 0: Decomposition

Step 1: Asymptotic Normality of LW​νL_{W\nu}

Step 2: Asymptotic Normality of QW​νQ_{W\nu}

Step 3: Conclusion

Proof of Lemma 4

Proof.

Proof of Lemma 5

Proof.

Step 1 D5,1D_{5,1} and D5,2D_{5,2}

Step 2: D5,3D_{5,3}

Step 3: Conclusion

Proof of Lemma 6

Proof.

Step 1: D6,1D_{6,1} and D6,2D_{6,2}

Step 2: D6,3D_{6,3}

Step 1: $\hat{\Sigma}_{W\nu,2}\to_{p}\Sigma_{W\nu,2}$

Sub-Step 1: $D_{p1,1}\to_{p}\Sigma_{W\nu,2}$

Sub-Step 2: $D_{p1,2}\to_{p}0$

Step 2: $\hat{\Sigma}_{W\nu,1}\to_{p}\Sigma_{W\nu,1}$

Step 3: $c_{W}^{\prime}\Sigma_{W\nu,1}c_{W}=0$ case

Step 1: Asymptotic Normality of $L_{W\nu}$

Step 2: Asymptotic Normality of $Q_{W\nu}$

Step 1 $D_{5,1}$ and $D_{5,2}$

Step 2: $D_{5,3}$

Step 1: $D_{6,1}$ and $D_{6,2}$

Step 2: $D_{6,3}$

Step 1: The implication of $c^{\prime}\Sigma_{W\nu,1}c=0$

Step 2: $nh_{n}c^{\prime}\tilde{\Sigma}_{W\nu,1}=o_{p}(1)$

Step 3: $n^{1-\alpha/2}h_{n}\|\hat{\Sigma}_{W\nu,1}-\tilde{\Sigma}_{W\nu,1}\|=o_{p}(1)$