This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Networks Perform Sufficient Dimension Reduction

Shuntuo Xu, Zhou Yu111Corresponding author.
Abstract

This paper investigates the connection between neural networks and sufficient dimension reduction (SDR), demonstrating that neural networks inherently perform SDR in regression tasks under appropriate rank regularizations. Specifically, the weights in the first layer span the central mean subspace. We establish the statistical consistency of the neural network-based estimator for the central mean subspace, underscoring the suitability of neural networks in addressing SDR-related challenges. Numerical experiments further validate our theoretical findings, and highlight the underlying capability of neural networks to facilitate SDR compared to the existing methods. Additionally, we discuss an extension to unravel the central subspace, broadening the scope of our investigation.

Introduction

Neural networks have achieved significant success in a tremendous variety of applications (Lee and Abu-El-Haija 2017; Silver et al. 2018; Jumper et al. 2021; Brandes et al. 2022; Thirunavukarasu et al. 2023). As the most essential foundation, a feedforward neural network is typically constructed by a series of linear transformations and nonlinear activations. To be specific, a function ff implemented by a feedforward neural network with LL layers can be represented as

f(x)=ϕLσL1ϕL1σ0ϕ0(x).f(x)=\phi_{L}\circ\sigma_{L-1}\circ\phi_{L-1}\circ\cdots\circ\sigma_{0}\circ\phi_{0}(x). (1)

Here, \circ is the functional composition operator, ϕi(z)=Wiz+bi\phi_{i}(z)=W_{i}^{\top}z+b_{i} denotes the linear transformation and σi\sigma_{i} represents the elementwise nonlinear activation function. Despite a clear architecture, the formula (1) provides limited insight into how the information within the input data is processed by the neural networks. To comprehensively understand how neural networks retrieve task-relevant information, there have been extensive efforts towards unraveling their interpretability (Doshi-Velez and Kim 2017; Guidotti et al. 2018; Zhang et al. 2021). For instance, post-hoc interpretability (Lipton 2018) focused on the predictions generated by neural networks, while disregarding the detailed mechanism and feature importance. Ghorbani, Abid, and Zou (2019) highlighted the fragility of this type of interpretation, as indistinguishable perturbations could result in completely different interpretations.

For the sake of striking a balance between the complexity of representation and the power of prediction, Tishby and Zaslavsky (2015) and Saxe et al. (2019) pioneered the interpretability of deep neural networks via the information bottleneck theory. Ghosh (2022) further linked the information bottleneck to sufficient dimension reduction (SDR) (Li 1991; Cook 1996), which is a rapidly developing research field in regression diagnostics, data visualization, pattern recognition and etc.

In this paper, we provide a theoretical understanding of neural networks for representation learning from the SDR perspective. Let xpx\in\mathbb{R}^{p} represent the covariates and yy\in\mathbb{R} represent the response. Consider the following regression model

y=f0(B0x)+ϵ,y=f_{0}(B_{0}^{\top}x)+\epsilon, (2)

where B0p×dB_{0}\in\mathbb{R}^{p\times d} is a nonrandom matrix with dpd\leq p, f0:df_{0}:\mathbb{R}^{d}\to\mathbb{R} is an unknown function, and ϵ\epsilon is the noise such that 𝔼(ϵ|x)=0\mathbb{E}(\epsilon|x)=0 and Var(ϵ|x)=ν2\mathrm{Var}(\epsilon|x)=\nu^{2} for some positive constant ν\nu. Intuitively, the semiparametric and potentially multi-index model (2) asserts that the core information for regression is encoded in the low-dimensional linear representation B0xB_{0}^{\top}x. In the literature of SDR, model (2) has been extensively studied. Based on model (2) , Cook and Li (2002) proposed the objective of sufficient mean dimension reduction as

y𝔼(y|x)|Bxy\perp\!\!\!\perp\mathbb{E}(y|x)|B^{\top}x (3)

for some matrix BB, where \perp\!\!\!\perp means statistical independence. Denote the column space spanned by BB as ΠB\Pi_{B}, which is commonly referred to as the mean dimension-reduction subspace. It is evident that the matrix BB satisfying (3) is far away from uniqueness. Hence, we focus on the intersection of all possible mean dimension-reduction subspace, which is itself a mean dimension-reduction subspace under mild conditions (e.g., when the domain of xx is open and convex; see Cook and Li (2002)), and term it the central mean subspace. Agreeing to condition (3), B0B_{0} defined in model (2) yields the central mean subspace, denoted as ΠB0\Pi_{B_{0}}, under certain assumption. Statistical estimation and inference about the central mean subspace is the primary goal of SDR. Popular statistical methods for recovering central mean subspace include ordinary least squares (Li and Duan 1989), principal Hessian directions (Li 1992), minimum average variance estimation (Xia et al. 2002), semiparametric approach (Ma and Zhu 2013), generalized kernel-based dimension reduction (Fukumizu and Leng 2014), and many others. Although numerous studies have demonstrated the ability of neural networks to approximate complex functions (Hornik, Stinchcombe, and White 1989; Barron 1993; Yarotsky 2017; Shen, Yang, and Zhang 2021) and to adapt low-dimensional structures (Bach 2017; Bauer and Kohler 2019; Schmidt-Hieber 2020; Abbe, Adsera, and Misiakiewicz 2022; Jiao et al. 2023; Troiani et al. 2024), it is of subsequent interest to investigate whether neural networks are capable of correctly identifying the intrinsic structure encapsulated in B0B_{0}, thereby deepening the interpretation of neural networks.

Our study was inspired by an observation that the weight matrix in the first layer, i.e., W1W_{1}, could accurately detect the presence of B0B_{0} in a toy data set, with the rank regularization. Specifically, consider the toy model y=(B0x)3+ϵy=(B_{0}^{\top}x)^{3}+\epsilon where xUniform([0,1]10)x\sim\mathrm{Uniform}([0,1]^{10}), B0=(1,2,0,,0)B_{0}=(1,-2,0,\ldots,0)^{\top} and ϵNormal(0,0.12)\epsilon\sim\mathrm{Normal}(0,0.1^{2}). We trained neural networks using the least squares loss with W1=W11W12W_{1}=W_{11}W_{12} where W1110×qW_{11}\in\mathbb{R}^{10\times q} and W12q×64W_{12}\in\mathbb{R}^{q\times 64} for q=1,,10q=1,\ldots,10. Evidently, the rank of W1W_{1} did not exceed qq. It was then observed that for each qq, (i) B0B_{0} was closely contained within the column space of W11W_{11}, as the absolute cosine similarity between B0B_{0} and its projection on ΠW11\Pi_{W_{11}} was close to 1, and (ii) the leading eigenvector of W1W1W_{1}W_{1}^{\top} closely aligned with B0B_{0} (see Figure 1). This observation indicates that the first layer of the neural network may potentially discover the underlying low-dimensional intrinsic structure.

Refer to caption
Figure 1: Absolute cosine similarity between (i) B0B_{0} and its projection on ΠW11\Pi_{W_{11}} (the dot line with triangle marks), (ii) B0B_{0} and the leading eigenvector of W1W1W_{1}W_{1}^{\top} (the solid line with square marks).

It is important to note that the application of neural networks for estimating the central mean subspace based on model (2) was previously explored by Kapla, Fertl, and Bura (2022) through a two-stage method. The first stage focused on obtaining a preliminary estimation of ΠB0\Pi_{B_{0}}, which subsequently served as an initial point for the joint estimation of ΠB0\Pi_{B_{0}} and f0f_{0} in the second stage. Given our toy example, however, it is prudent to critically evaluate the necessity of the first stage. Furthermore, their work lacks a comprehensive theoretical guarantee. Additionally, another related work conducted by Liang, Sun, and Liang (2022) concentrated on seeking nonlinear sufficient representations. In contrast, we focus more on revealing the fundamental nature of neural networks.

We in this paper show that, with suitable rank regularization, the first layer of a feedforward neural network conducts SDR in a regression task, wherein d(W11,B0)0d(W_{11},B_{0})\to 0 in probability for certain distance metric d(,)d(\cdot,\cdot). Furthermore, numerical experiments provide empirical evidences supporting this result, while demonstrating the efficiency of neural networks in addressing the issue of SDR.

Throughout this paper, we use v2\|v\|_{2} to represent the Euclidean norm of a vector vv. For a matrix AA, AF\|A\|_{F} is the Frobenius norm of AA, πA=A(AA)A\pi_{A}=A(A^{\top}A)^{-}A^{\top} denotes the projection matrix of AA where AA^{-} is the generalized inverse of AA, and ΠA\Pi_{A} stands for the linear space spanned by the columns of AA. For a measurable function f:𝒳f:\mathcal{X}\to\mathbb{R}, fL2(μ)\|f\|_{L^{2}(\mu)} represents the L2L^{2} norm of ff with respect of a given probability measure μ\mu, and fL(𝒳)=supx𝒳|f(x)|\|f\|_{L^{\infty}(\mathcal{X})}=\sup_{x\in\mathcal{X}}|f(x)| represents the supreme norm of ff over a set 𝒳\mathcal{X}. (𝒳)\mathcal{B}(\mathcal{X}) is the unit ball induced by 𝒳\mathcal{X}, such that (𝒳)𝒳\mathcal{B}(\mathcal{X})\subset\mathcal{X} and v21\|v\|_{2}\leq 1 for any v𝒳v\in\mathcal{X}.

Theoretical Justificatitons

Population-Level Unbiasedness

Suppose the true intrinsic dimension dd defined in model (2) is known and the covarites x([0,1]p)x\in\mathcal{B}([0,1]^{p}). As B0B_{0} is not identifiable in (2) and (3) , it is assumed without loss of generality that B0B0=IdB_{0}^{\top}B_{0}=I_{d} where IdI_{d} is the identity matrix with dd rows. By defining Ψd={Bp×d:BB=Id}\Psi_{d}=\{B\in\mathbb{R}^{p\times d}:B^{\top}B=I_{d}\}, we have B0ΨdB_{0}\in\Psi_{d}. In this paper, we consider the following neural network function class

,,𝒮,={\displaystyle\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}=\Big{\{}
f(x)=ϕLσL1ϕL1σ0ϕ0(Bx):\displaystyle f(x)=\phi_{L}\circ\sigma_{L-1}\circ\phi_{L-1}\circ\cdots\circ\sigma_{0}\circ\phi_{0}(B^{\top}x):
BΨd,ϕi(z)=Wiz+bi,Widi×di+1,i=0,,L,\displaystyle B\in\Psi_{d},\phi_{i}(z)=W_{i}^{\top}z+b_{i},W_{i}\in\mathbb{R}^{d_{i}\times d_{i+1}},i=0,\ldots,L,
with L=,maxi=1,,Ldi=,i=0L(di+1)di+1=𝒮,\displaystyle\text{with }L=\mathcal{L},\max_{i=1,\dots,L}d_{i}=\mathcal{M},\sum_{i=0}^{L}(d_{i}+1)d_{i+1}=\mathcal{S},
fL(([0,1]p))}.\displaystyle\|f\|_{L^{\infty}(\mathcal{B}([0,1]^{p}))}\leq\mathcal{R}\Big{\}}.

The activation functions σi(i=0,,L1)\sigma_{i}(i=0,\ldots,L-1) utilized are the rectified linear units, i.e., σi(x)=max(x,0)\sigma_{i}(x)=\max(x,0). We emphasize that ,,𝒮,\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}} incorporates rank regularization of dd in the first layer. For arbitrary f,,𝒮,f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}, we use 𝒯(f)\mathcal{T}(f) to represent the component BΨdB\in\Psi_{d} in the first layer of ff.

For a regression task, the theoretical study is highly related to the smoothness of underlying conditional mean function (Yang and Barron 1999). Here, we introduce the following assumptions of model (2).

Assumption 1 (Smoothness).

f0f_{0} is a Hölder continuous function of order α(0,1]\alpha\in(0,1] with constant λ\lambda, i.e., |f0(x)f0(z)|λxz2α|f_{0}(x)-f_{0}(z)|\leq\lambda\|x-z\|_{2}^{\alpha} for any x,z[0,1]dx,z\in[0,1]^{d}. Additionally, f0L([0,1]d)0\|f_{0}\|_{L^{\infty}([0,1]^{d})}\leq\mathcal{R}_{0} for some constant 01\mathcal{R}_{0}\geq 1.

Assumption 2 (Sharpness).

For any scalar δ>0\delta>0 and BΨdB\in\Psi_{d}, πBπB0F>δ\|\pi_{B}-\pi_{B_{0}}\|_{F}>\delta implies 𝔼{Var[f0(B0x)|Bx]}>M(δ)\mathbb{E}\{\mathrm{Var}[f_{0}(B_{0}^{\top}x)|B^{\top}x]\}>M(\delta) for some M(δ)>0M(\delta)>0.

Assumption 3.

yy is sub-exponentially distributed such that there exists τ>0\tau>0 satisfying 𝔼[exp(τ|y|)]<\mathbb{E}[\exp(\tau|y|)]<\infty.

Assumption 1 is a technical condition to study the approximation capability of neural networks (Shen, Yang, and Zhang 2020). Alternatively, other functional spaces, such as the Sobolev space, can also be employed for this purpose (Abdeljawad and Grohs 2022; Shen et al. 2022). Furthermore, Assumption 2 establishes a restriction on the sharpness of f0f_{0}. Consider the case that f0f_{0} is solely a constant function as zero. Then, a trivial neural network by setting all the parameters except BB to zero perfectly fits f0f_{0}, regardless of the value of BB. With Assumption 2, it becomes difficult to accurately capture the overall behavior of f0(B0x)f_{0}(B_{0}^{\top}x) using a biased BB. In other words, B0xB_{0}^{\top}x is sufficient and necessary for recovering 𝔼(y|x)\mathbb{E}(y|x), i.e., ΠB0\Pi_{B_{0}} is the central mean subspace. Similar condition was also adopted in Theorem 4.2 of Li and Dong (2009) to distinguish sufficient directions B0B_{0} from other BB. Assumption 3 is a commonly used condition for applying empirical process tools and concentration inequalities (Van der Vaart 2000; Zhu, Jiao, and Jordan 2022).

NN MAVE GKDR SIR SAVE PHD
(n,p)=(100,10)(n,p)=(100,10) mean 0.135 0.160 0.388 0.897 0.315 0.623
std 0.064 0.056 0.126 0.186 0.082 0.150
Setting 1 (n,p)=(200,30)(n,p)=(200,30) mean 0.276 0.214 1.012 0.946 0.298 0.807
std 0.274 0.051 0.299 0.113 0.049 0.107
(n,p)=(300,30)(n,p)=(300,30) mean 0.120 0.130 0.542 0.900 0.337 0.709
std 0.038 0.029 0.166 0.136 0.063 0.108
σ=0.1\sigma=0.1 mean 0.296 0.730 1.130 0.665 0.319 1.550
std 0.126 0.312 0.271 0.166 0.076 0.126
Setting 2 σ=0.2\sigma=0.2 mean 0.628 0.899 1.155 0.705 0.338 1.526
std 0.269 0.329 0.237 0.149 0.074 0.138
σ=0.5\sigma=0.5 mean 1.197 1.187 1.278 0.830 0.367 1.567
std 0.213 0.204 0.200 0.169 0.072 0.131
n=200n=200 mean 0.639 1.246 1.759 1.669 1.728 1.740
std 0.418 0.289 0.149 0.235 0.136 0.221
Setting 3 n=500n=500 mean 0.248 1.076 1.752 1.650 1.683 1.721
std 0.242 0.331 0.125 0.271 0.201 0.228
n=1000n=1000 mean 0.075 0.924 0.554 1.652 1.678 1.737
std 0.081 0.382 0.371 0.259 0.245 0.191
n=1000,d=4n=1000,d=4 mean 0.127 0.293 0.368 1.363 0.429 0.960
std 0.161 0.050 0.079 0.167 0.079 0.178
Setting 4 n=1500,d=6n=1500,d=6 mean 0.144 0.415 0.386 1.530 0.467 0.902
std 0.067 0.076 0.077 0.153 0.082 0.206
n=2000,d=8n=2000,d=8 mean 0.140 0.360 0.344 1.410 0.364 0.735
std 0.055 0.084 0.010 0.163 0.065 0.175
Table 1: The results of average and standard deviation of πB^πB0F\|\pi_{\hat{B}}-\pi_{B_{0}}\|_{F} on 100 replicates across different methods. NN represents the neural network-based method.
Theorem 1.

Suppose that Assumptions 1 and 2 hold. Let f=argminf,,𝒮,𝔼[yf(x)]2f^{*}=\mathrm{argmin}_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}\mathbb{E}[y-f(x)]^{2}, then

Π𝒯(f)=ΠB0,\Pi_{\mathcal{T}(f^{*})}=\Pi_{B_{0}},

provided that \mathcal{R} is sufficiently large, and \mathcal{L} and \mathcal{M} tend to infinity.

Theorem 1 builds a bridge connecting neural networks and SDR. It demonstrates that neural networks indeed achieve representation learning as the first layer of the optimal neural network perfectly reaches the target of SDR at population level. Theorem 1 also inspires us to perform SDR based on neural networks with a minor adjustment for the first layer. The detailed proof of Theorem 1 can be found in Section Proofs.

Sample Estimation Consistency

We now investigate the theoretical property of the neural network-based sample-level estimator for SDR. Given sample observations 𝒟n={(X1,Y1),,(Xn,Yn)}\mathcal{D}_{n}=\{(X_{1},Y_{1}),\ldots,(X_{n},Y_{n})\}, where (Xi,Yi)(X_{i},Y_{i}) is an independent copy of (x,y)(x,y) for i=1,,ni=1,\ldots,n, the commonly used least squares loss is adopted, i.e.,

Ln(f)=1ni=1n[Yif(Xi)]2.L_{n}(f)=\frac{1}{n}\sum_{i=1}^{n}[Y_{i}-f(X_{i})]^{2}.

Denote the optimal neural network estimator at the sample level as

f^n=argminf,,𝒮,Ln(f).\hat{f}_{n}=\mathop{\mathrm{argmin}}\limits_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}L_{n}(f).

𝒯(f^n)\mathcal{T}(\hat{f}_{n}) is then the sample estimator approximately spanning the central mean subspace.

To examine the closeness between Π𝒯(f^n)\Pi_{\mathcal{T}(\hat{f}_{n})} and ΠB0\Pi_{B_{0}}, we define the distance metric d(,)d(\cdot,\cdot) as

d(B,B0)=minQ𝒬B0BQ2,d(B,B_{0})=\min_{Q\in\mathcal{Q}}\|B_{0}-BQ\|_{2},

where BΨdB\in\Psi_{d} and 𝒬\mathcal{Q} is the collection of all orthogonal matrices in d×d\mathbb{R}^{d\times d}. We see that d(B,B0)=0d(B,B_{0})=0 if and only if ΠB=ΠB0\Pi_{B}=\Pi_{B_{0}}. And we make the following assumption essentially another view of Assumption 2.

Assumption 4.

For any positive scalar δ\delta, d(B,B0)>δd(B,B_{0})>\delta implies 𝔼{Var[f0(B0x)|Bx]}>M1(δ)\mathbb{E}\{\mathrm{Var}[f_{0}(B_{0}^{\top}x)|B^{\top}x]\}>M_{1}(\delta) for some M1(δ)>0M_{1}(\delta)>0.

Theorem 2.

Suppose the Assumptions 1, 3 and 4 hold. Then we have

d(𝒯(f^n),B0)P0,d(\mathcal{T}(\hat{f}_{n}),B_{0})\stackrel{{\scriptstyle P}}{{\longrightarrow}}0,

when the depth =O(nd/(2d+8α))\mathcal{L}=O(n^{d/(2d+8\alpha)}), the width =O(1)\mathcal{M}=O(1) and 0\mathcal{R}\geq\mathcal{R}_{0}.

Theorem 2 confirms that Π𝒯(f^n)\Pi_{\mathcal{T}(\hat{f}_{n})} converges to ΠB0\Pi_{B_{0}} in probability. As a result, under the least squares loss, the neural networks, with appropriate rank regularization, provide a consistent estimator of the central mean subspace. Therefore, it is promising to adopt the neural network function class ,,𝒮,\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}} to address sufficient mean dimension reduction problems. More importantly, this approach offers several advantages over existing methods. Unlike SIR (Li 1991), SAVE (Cook and Weisberg 1991) and PHD (Li 1992), our method does not impose stringent probabilistic distributional conditions on xx, such as linearity, constant variance and normality assumptions. Compared to MAVE (Xia et al. 2002), our method adopts the powerful neural networks to estimate the link function, thereby working similar or better than classical nonparametric tools based on first-order Taylor expansion. We present the proof of Theorem 2 in Section Proofs.

The results illustrated in Theorem 1 and 2 are contingent on the availability of true intrinsic dimension dd. In the case where dd is unknown, a natural modification for ,,𝒮,\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}} is to set Bp×pB\in\mathbb{R}^{p\times p} without rank regularization. Under Assumptions 1 and 2, in this scenario, the optimal ff^{*} at the population level still ensures that Π𝒯(f)\Pi_{\mathcal{T}(f^{*})} encompasses ΠB0\Pi_{B_{0}}, where the sharpness of f0f_{0} plays a crucial role; see the Supplementary Material for more details.

Simulation Study

From the perspective of numerical experiments, we utilized simulated data sets to demonstrate that (i) the column space of 𝒯(f^n)\mathcal{T}(\hat{f}_{n}) approached the central mean subspace ΠB0\Pi_{B_{0}} as sample size nn increased, (ii) the performance of neural network-based method in conducting SDR was comparable to classical methods. In some cases, the neural network-based method outperformed classical methods, particularly when the latent intrinsic dimension dd was large. Five commonly used methods, sliced inverse regression (SIR), sliced average variance estimation (SAVE), principal Hessian directions (PHD), minimum average variance estimation (MAVE) and generalized kernel dimension reduction (GKDR), were included for comparisons.

The neural network-based method was implemented in Python using PyTorch. Specifically, we utilized a linear layer without bias term to represent the matrix BB (recall that we suppose dd is known), which was further appended by a fully-connected feedforward neural network with number of neurons being hh/21h-h/2-1. Here, we set h=64h=64 when the sample size was less than 10001000, and set h=128h=128 when sample size was between 1000 and 2000. Overall, the neural network architecture was pdhh/21p-d-h-h/2-1. Code is available at https://github.com/oaksword/DRNN.

We considered the following four scenarios, with two of them attaining small d=1,2d=1,2 and the rest of d3d\geq 3.

Setting 1: y=x14+ϵy=x_{1}^{4}+\epsilon where xNormal(0,Ip)x\sim\mathrm{Normal}(0,I_{p}) and ϵt5\epsilon\sim\mathrm{t}_{5}.

Setting 2: y=log(x1+x1x2)+ϵy=\log(x_{1}+x_{1}x_{2})+\epsilon where xUniform([0,1]10)x\sim\mathrm{Uniform}([0,1]^{10}) and ϵNormal(0,σ2)\epsilon\sim\mathrm{Normal}(0,\sigma^{2}).

Setting 3: y=(1+β1x)2exp(β2x)+51(β3x>0)+ϵy=(1+\beta_{1}^{\top}x)^{2}\exp(\beta_{2}^{\top}x)+5\cdot 1(\beta_{3}^{\top}x>0)+\epsilon where xUniform([1,1]6)x\sim\mathrm{Uniform}([-1,1]^{6}), ϵχ222\epsilon\sim\chi^{2}_{2}-2, β1=(2,1,0,1,2,3)\beta_{1}=(-2,-1,0,1,2,3)^{\top}, β2=16\beta_{2}=1_{6} and β3=(1,1,1,1,1,1)\beta_{3}=(1,-1,1,-1,1,-1)^{\top}.

Setting 4:

y=k=1dexi2+sin(πxi)+ϵ,y=\sum_{k=1}^{d}\frac{e^{x_{i}}}{2+\sin(\pi x_{i})}+\epsilon,

where ϵNormal(0,0.12)\epsilon\sim\mathrm{Normal}(0,0.1^{2}),

xNormal(110,I10110110/20).x\sim\mathrm{Normal}\left(1_{10},I_{10}-1_{10}1_{10}^{\top}/20\right).

In setting 1, we tested the combinations of (n,p)=(100,10),(200,30)(n,p)=(100,10),(200,30) and (300,30)(300,30), where nn was the sample size. In setting 2, we fixed n=200,p=10n=200,p=10 and set σ=0.1,0.2,0.5\sigma=0.1,0.2,0.5. In setting 3, we fixed p=6p=6 and tested n=200,500,1000n=200,500,1000. In setting 4, we fixed p=10p=10 and tested (n,d)=(1000,4),(1500,6)(n,d)=(1000,4),(1500,6) and (2000,8)(2000,8). Settings 1, 2, 4 were equipped with continuous regression functions, and setting 3 involved discontinuity. We set the number of iterations of our method to 1000 for settings 1 and 2, and increased the iterations to 2000 and 4000 for settings 3 and 4, due to their high complexity.

To evaluate the performance of each method, we calculated the distance metric πB^πB0F\|\pi_{\hat{B}}-\pi_{B_{0}}\|_{F} where B^\hat{B} represented the estimate of B0B_{0}. Particularly, for our neural network-based method, B^=𝒯(f^n)\hat{B}=\mathcal{T}(\hat{f}_{n}). Smaller value of this metric indicated better performance. We ran 100 replicates for each setting. The results are displayed in Table 1 and Figure 2.

In simple cases (settings 1 and 2), the neural network-based method showed similar performance to classical methods, with sliced average variance estimation being the most effective method in setting 2. However, complex settings demonstrated the significant superiority of neural network-based method compared to other methods. Specifically, in setting 3, as the sample size nn increased, the metric πB^πB0F\|\pi_{\hat{B}}-\pi_{B_{0}}\|_{F} decreased rapidly, indicating the convergence of ΠB^\Pi_{\hat{B}} to ΠB0\Pi_{B_{0}}. According to the results in setting 4, the neural network-based method was capable of handling higher-dimensional scenarios. In summary, numerical studies advocated neural networks as a powerful tool for SDR.

Refer to caption
Figure 2: Boxplots of πB^πB0F\|\pi_{\hat{B}}-\pi_{B_{0}}\|_{F} on 100 replicates across different methods. In each panel, six methods for SDR are included, among which NN represents the neural network-based method.

Real Data Analysis

We further applied the neural network-based method involving rank regularization to a real regression task. In practice, the precise intrinsic dimension dd is unknown, and it is uncertain whether a low-dimensional structure exists. To address this issue, we used cross validation to determine an appropriate dd from the range of values {1,,p}\{1,\ldots,p\}. More specifically, the raw data set was divided into 80% training data and 20% testing data. The optimal value of dd was determined through cross validation on the training data. Subsequently, the final model was fit using the selected dd on the training data, and the mean squared error on the testing data was evaluated.

In order to reduce randomness, the aforementioned process was repeated 20 times and the resulting testing errors were recorded. Additionally, we conducted a comparative analysis between the neural network-based approach and alternative methods including vanilla neural network without rank regularization, SIR-based regression, SAVE-based regression, and MAVE-based regression. For the latter three techniques, we initially executed SDR to acquire the embedded data, followed by the utilization of a fully-connected neural network for predictive purpose. The optimal value for dd was also determined through cross validation.

We utilized a data set of weather records from Seoul, South Korea during the summer months from 2013 to 2017 (Cho et al. 2020), available at the UCI data set repository (bias correction of numerical prediction model temperature forecast). This data set contained 7750 observations with 23 predictors and 2 responses, specifically the next-day maximum air temperature and next-day minimum air temperature. After excluding the variables for station and date, the data set was reduced to 21 predictors, which were further standardized using StandardScaler in scikit-learn package.

NN-RR NN-VN MAVE SIR SAVE
mean 0.602 0.774 0.743 1.324 0.772
std 0.043 0.116 0.159 0.190 0.161
d¯\bar{d} 19.6 19.25 19.6 20.6
Table 2: The results of testing errors on 20 replicates across different methods. We report the average and standard deviation of testing errors, along with averaged optimal dd determined through cross validation. NN-RR, NN-VN, MAVE, SIR, SAVE represent the neural network-based method with rank regularization, vanilla neural network regression, MAVE-based regression, SIR-based regression, and SAVE-based regression, respectively.

It is evident from Table 2 that the neural network-based method with rank regularization outperformed other methods, demonstrating the effectiveness of the modification compared to the vanilla neural network, and the sound performance in reducing dimensions compared to other SDR methods. The presence of latent structures was partially supported by the averaged optimal dd. It was possible that 19 or 20 combinations of raw predictors might be sufficient, as opposed to the original 21 predictors.

Proofs

Proof of Theorem 1

Define the vanilla neural network function class without rank regularization as

,,𝒮,={\displaystyle\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}^{*}=\Big{\{}
f(x)=ϕLσL1ϕL1σ0ϕ0(x):\displaystyle f(x)=\phi_{L}\circ\sigma_{L-1}\circ\phi_{L-1}\circ\cdots\circ\sigma_{0}\circ\phi_{0}(x):
ϕi(z)=Wiz+bi,Widi×di+1,i=0,,L,\displaystyle\phi_{i}(z)=W_{i}^{\top}z+b_{i},W_{i}\in\mathbb{R}^{d_{i}\times d_{i+1}},i=0,\ldots,L,
with L=,maxi=1,,Ldi=,i=0L(di+1)di+1=𝒮,\displaystyle\text{with }L=\mathcal{L},\max_{i=1,\dots,L}d_{i}=\mathcal{M},\sum_{i=0}^{L}(d_{i}+1)d_{i+1}=\mathcal{S},
fL([0,1]d)}.\displaystyle\|f\|_{L^{\infty}([0,1]^{d})}\leq\mathcal{R}\Big{\}}.
Lemma 1.

Suppose that hh is Hölder continuous with α(0,1]\alpha\in(0,1] and λ>0\lambda>0. Then, for any ζ>0\zeta>0, there exists a function gg in neural network function class ,,𝒮,\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}^{*}, with rectified linear unit activation and ,,𝒮,\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R} large enough, such that

hgL([0,1]d)<ζ.\|h-g\|_{L^{\infty}([0,1]^{d})}<\zeta.

We note that Lemma 1 is a simplified version of Theorem 1.1 in (Shen, Yang, and Zhang 2020).

Proof of Theorem 1.

For f,,𝒮,f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}} such that π𝒯(f)πB0\pi_{\mathcal{T}(f)}\neq\pi_{B_{0}}, under Assumption 2, there exists a scalar t>0t>0 satisfying

𝔼{Var[f0(B0x)|𝒯(f)x]}>t.\mathbb{E}\left\{\mathrm{Var}\left[f_{0}(B_{0}^{\top}x)|\mathcal{T}(f)^{\top}x\right]\right\}>t.

Then, we have

𝔼[yf(x)]2=\displaystyle\mathbb{E}[y-f(x)]^{2}= 𝔼[ϵ+f0(B0x)f(x)]2\displaystyle\mathbb{E}[\epsilon+f_{0}(B_{0}^{\top}x)-f(x)]^{2}
=\displaystyle= 𝔼(ϵ2)+𝔼[f0(B0x)f(x)]2\displaystyle\mathbb{E}(\epsilon^{2})+\mathbb{E}[f_{0}(B_{0}^{\top}x)-f(x)]^{2}
=\displaystyle= ν2+Var[f0(B0x)f(x)]\displaystyle\nu^{2}+\mathrm{Var}[f_{0}(B_{0}^{\top}x)-f(x)]
+𝔼2[f0(B0x)f(x)]\displaystyle+\mathbb{E}^{2}[f_{0}(B_{0}^{\top}x)-f(x)]
\displaystyle\geq ν2+Var[f0(B0x)f(x)]\displaystyle\nu^{2}+\mathrm{Var}[f_{0}(B_{0}^{\top}x)-f(x)]
\displaystyle\geq ν2+𝔼{Var[f0(B0x)|𝒯(f)x]}\displaystyle\nu^{2}+\mathbb{E}\{\mathrm{Var}[f_{0}(B_{0}^{\top}x)|\mathcal{T}(f)^{\top}x]\}
>\displaystyle> ν2+t.\displaystyle\nu^{2}+t.

On the other hand, for f~,,𝒮,\tilde{f}\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}} such that π𝒯(f~)=πB0\pi_{\mathcal{T}(\tilde{f})}=\pi_{B_{0}}, there exists an orthogonal matrix Qd×dQ\in\mathbb{R}^{d\times d} such that B0=𝒯(f~)QB_{0}=\mathcal{T}(\tilde{f})Q. Hence, f~(x)=f~𝒯(f~)Q(B0x)\tilde{f}(x)=\tilde{f}\circ\mathcal{T}(\tilde{f})\circ Q(B_{0}^{\top}x) and f~𝒯(f~)Q\tilde{f}\circ\mathcal{T}(\tilde{f})\circ Q is still a neural network function. Assumption 1 and Lemma 1 imply that there is a neural network function gg satisfying

f0(B0x)g(B0x)L(([0,1]p))<t1/2/2,\|f_{0}(B_{0}^{\top}x)-g(B_{0}^{\top}x)\|_{L^{\infty}(\mathcal{B}([0,1]^{p}))}<t^{1/2}/2,

for sufficiently large ,,𝒮,\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}. As a result, select f~\tilde{f} such that f~𝒯(f~)Q=g\tilde{f}\circ\mathcal{T}(\tilde{f})\circ Q=g, and it follows that

𝔼[yf~(x)]2=ν2+𝔼[f0(B0x)g(B0x)]2<ν2+t/4.\mathbb{E}[y-\tilde{f}(x)]^{2}=\nu^{2}+\mathbb{E}[f_{0}(B_{0}^{\top}x)-g(B_{0}^{\top}x)]^{2}<\nu^{2}+t/4\ .

Therefore, for any f,,𝒮,f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}} such that π𝒯(f)πB0\pi_{\mathcal{T}(f)}\neq\pi_{B_{0}}, there exists a neural network function f~\tilde{f}, with the same rank regularization, satisfying

𝔼[yf~(x)]2<𝔼[yf(x)]2.\mathbb{E}[y-\tilde{f}(x)]^{2}<\mathbb{E}[y-f(x)]^{2}.

To conclude, π𝒯(f)=πB0\pi_{\mathcal{T}(f^{*})}=\pi_{B_{0}}, which entails that Π𝒯(f)=ΠB0\Pi_{\mathcal{T}(f^{*})}=\Pi_{B_{0}}. ∎

Proof of Theorem 2

Let L(f)=𝔼[yf(x)]2L(f)=\mathbb{E}[y-f(x)]^{2}. Without loss of generality, suppose =0\mathcal{R}=\mathcal{R}_{0}. We first present some useful lemmas.

Lemma 2.

For sufficiently large nn, it follows that

supf,,𝒮,|L(f)Ln(f)|=Op(𝒮log(𝒮)(logn)5n).\sup_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}|L(f)-L_{n}(f)|=O_{p}\left(\sqrt{\frac{\mathcal{S}\mathcal{L}\log(\mathcal{S})(\log n)^{5}}{n}}\right).

The proof of Lemma 2 is presented in the Supplementary Material. We note this convergence rate may not be optimal, but is sufficient for deducing consistency.

Lemma 3.

Under Assumptions 1 and 3, for sufficiently large nn, we have

𝔼[yf^n(x)]2ν2+Cn2α/(d+4α)(logn)3,\displaystyle\mathbb{E}[y-\hat{f}_{n}(x)]^{2}\leq\nu^{2}+Cn^{-2\alpha/(d+4\alpha)}(\log n)^{3},

where CC is a constant depending on 0\mathcal{R}_{0} and dd.

Proof of Lemma 3.

We begin with the decomposition that

L(f^n)\displaystyle L(\hat{f}_{n})\leq 𝔼[Ln(f^n)+supf,,𝒮,|L(f)Ln(f)|]\displaystyle\mathbb{E}\left[L_{n}(\hat{f}_{n})+\sup_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}|L(f)-L_{n}(f)|\right]
\displaystyle\leq 𝔼[inff,,𝒮,Ln(f𝒯(f)B0)\displaystyle\mathbb{E}\Bigg{[}\inf_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}L_{n}(f\circ\mathcal{T}(f)\circ B_{0}^{\top})
+supf,,𝒮,|L(f)Ln(f)|]\displaystyle+\sup_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}|L(f)-L_{n}(f)|\Bigg{]}
\displaystyle\leq 𝔼[inff,,𝒮,L(f𝒯(f)B0)\displaystyle\mathbb{E}\Bigg{[}\inf_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}L(f\circ\mathcal{T}(f)\circ B_{0}^{\top})
+2supf,,𝒮,|L(f)Ln(f)|]\displaystyle+2\sup_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}|L(f)-L_{n}(f)|\Bigg{]}
\displaystyle\leq inff,,𝒮,L(f𝒯(f)B0)\displaystyle\inf_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}L(f\circ\mathcal{T}(f)\circ B_{0}^{\top})
+C1𝒮log(𝒮)(logn)5n1,\displaystyle+C_{1}\sqrt{\mathcal{S}\mathcal{L}\log(\mathcal{S})(\log n)^{5}n^{-1}},

where C1C_{1} is a constant, for sufficiently large nn. Hence,

L(f^n)ν2\displaystyle L(\hat{f}_{n})-\nu^{2}
=\displaystyle= L(f^n)L(f0B0)\displaystyle L(\hat{f}_{n})-L(f_{0}\circ B_{0}^{\top})
=\displaystyle= L(f^n)inff,,𝒮,L(f𝒯(f)B0)\displaystyle L(\hat{f}_{n})-\inf_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}L(f\circ\mathcal{T}(f)\circ B_{0}^{\top})
+inff,,𝒮,L(f𝒯(f)B0)L(f0B0)\displaystyle+\inf_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}L(f\circ\mathcal{T}(f)\circ B_{0}^{\top})-L(f_{0}\circ B_{0}^{\top})
\displaystyle\leq C1𝒮log(𝒮)(logn)5n1\displaystyle C_{1}\sqrt{\mathcal{S}\mathcal{L}\log(\mathcal{S})(\log n)^{5}n^{-1}}
+inff,,𝒮,𝔼[f0(B0x)f𝒯(f)(B0x)]2\displaystyle+\inf_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}\mathbb{E}[f_{0}(B_{0}^{\top}x)-f\circ\mathcal{T}(f)(B_{0}^{\top}x)]^{2}
\displaystyle\leq C1𝒮log(𝒮)(logn)5n1\displaystyle C_{1}\sqrt{\mathcal{S}\mathcal{L}\log(\mathcal{S})(\log n)^{5}n^{-1}}
+infg,,𝒮,f0gL([0,1]d)2.\displaystyle+\inf_{g\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}^{*}}\|f_{0}-g\|_{L^{\infty}([0,1]^{d})}^{2}.

By Theorem 1.1 in Shen, Yang, and Zhang (2020), there exists gg^{*} with =3d+3max(dK1/d,K+1),=12D+14+2d\mathcal{M}=3^{d+3}\max(d\lfloor K^{1/d}\rfloor,K+1),\mathcal{L}=12D+14+2d for some constant K,D>0K,D>0 such that

f0gL([0,1]d)19dλ(KD)2α/d.\|f_{0}-g^{*}\|_{L^{\infty}([0,1]^{d})}\leq 19\sqrt{d}\lambda(KD)^{-2\alpha/d}\ .

Let

D=O(nd/(2d+8α)),K=O(1),\displaystyle D=O\left(n^{d/(2d+8\alpha)}\right),\quad K=O(1),

and observe that 𝒮=O(2)\mathcal{S}=O(\mathcal{M}^{2}\mathcal{L}). Then, it follows that

L(f^n)\displaystyle L(\hat{f}_{n}) =𝔼[yf^n(x)]2\displaystyle=\mathbb{E}[y-\hat{f}_{n}(x)]^{2}
ν2+Cn2α/(d+4α)(logn)3,\displaystyle\leq\nu^{2}+Cn^{-2\alpha/(d+4\alpha)}(\log n)^{3},

where CC is a constant depending on 0\mathcal{R}_{0} and dd. ∎

With Lemma 3, the proof of Theorem 2 is a direct application of Theorem 5.9 in Van der Vaart (2000).

Proof of Theorem 2.

Recall that m0=f0B0m_{0}=f_{0}\circ B_{0}^{\top}. Let R(f)=L(f)L(m0)=𝔼[f0(B0x)f(x)]20R(f)=L(f)-L(m_{0})=\mathbb{E}[f_{0}(B_{0}^{\top}x)-f(x)]^{2}\geq 0 and Rn(f)=Ln(f)Ln(m0)R_{n}(f)=L_{n}(f)-L_{n}(m_{0}). Then, Lemma 2 indicates that

supf,,𝒮,|Rn(f)R(f)|0,in probability,\sup_{f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}}|R_{n}(f)-R(f)|\to 0,\quad\text{in probability},

yielding Rn(f^n)=op(1)R_{n}(\hat{f}_{n})=o_{p}(1) by incorporating with Lemma 3.

We denote the metric

d1(f,m0)=minQ𝒬B0𝒯(f)Q2f0Qf𝒯(f)L2(μ).d_{1}(f,m_{0})=\min_{Q\in\mathcal{Q}}\|B_{0}-\mathcal{T}(f)Q\|_{2}\vee\|f_{0}\circ Q-f\circ\mathcal{T}(f)\|_{L^{2}(\mu)}.

Here, f,,𝒮,f\in\mathcal{F}_{\mathcal{L},\mathcal{M},\mathcal{S},\mathcal{R}}, m0=f0B0m_{0}=f_{0}\circ B_{0}^{\top}, aba\vee b means max(a,b)\max(a,b), and μ\mu is the probability distribution of 𝒯(f)x\mathcal{T}(f)^{\top}x. Recall that d(B,B0)=minQ𝒬B0BQ2d(B,B_{0})=\min_{Q\in\mathcal{Q}}\|B_{0}-BQ\|_{2}, where BΨdB\in\Psi_{d}.

For ff and δ>0\delta>0 such that d1(f,m0)δd_{1}(f,m_{0})\geq\delta, if

d(𝒯(f),B0)δ~=min{[δ2/(80λ)]1/α,δ/2},d(\mathcal{T}(f),B_{0})\leq\tilde{\delta}=\min\{[\delta^{2}/(8\mathcal{R}_{0}\lambda)]^{1/\alpha},\delta/2\},

we have

|f0(B0x)f0(Q𝒯(f)x)|\displaystyle|f_{0}(B_{0}^{\top}x)-f_{0}(Q^{\top}\mathcal{T}(f)^{\top}x)| λB0xQ𝒯(f)x2α\displaystyle\leq\lambda\|B_{0}^{\top}x-Q^{\top}\mathcal{T}(f)^{\top}x\|_{2}^{\alpha}
λB0Q𝒯(f)2α\displaystyle\leq\lambda\|B_{0}^{\top}-Q^{\top}\mathcal{T}(f)^{\top}\|_{2}^{\alpha}
δ280,\displaystyle\leq\frac{\delta^{2}}{8\mathcal{R}_{0}},

for some orthogonal matrix QQ. Hence,

R(f)\displaystyle R(f) =𝔼[f0(B0x)f(x)]2\displaystyle=\mathbb{E}[f_{0}(B_{0}^{\top}x)-f(x)]^{2}
𝔼[f0(Q𝒯(f)x)f(x)]2\displaystyle\geq\mathbb{E}[f_{0}(Q^{\top}\mathcal{T}(f)^{\top}x)-f(x)]^{2}
+2𝔼[f0(B0x)F0][F0f(x)]\displaystyle\hskip 20.00003pt+2\mathbb{E}[f_{0}(B_{0}^{\top}x)-F_{0}][F_{0}-f(x)]
δ22+d1(f,m0)2δ22>0,\displaystyle\geq-\frac{\delta^{2}}{2}+d_{1}(f,m_{0})^{2}\geq\frac{\delta^{2}}{2}>0,

where F0=f0(Q𝒯(f)x)F_{0}=f_{0}(Q^{\top}\mathcal{T}(f)^{\top}x).

For the case that d(𝒯(f),B0)>δ~d(\mathcal{T}(f),B_{0})>\tilde{\delta}, by Assumption 4, we have R(f)ξ>0R(f)\geq\xi>0 for some positive constant ξ\xi. To conclude, we obtain

inff:d1(f,m0)δR(f)>0,\inf_{f:d_{1}(f,m_{0})\geq\delta}R(f)>0,

for any δ>0\delta>0. Finally, applying Theorem 5.9 in Van der Vaart (2000), Rn(f^n)=op(1)R_{n}(\hat{f}_{n})=o_{p}(1) implies that d1(f^n,m0)0d_{1}(\hat{f}_{n},m_{0})\to 0 in probability. Hence, d(𝒯(f^n),B0)0d(\mathcal{T}(\hat{f}_{n}),B_{0})\to 0 in probability. ∎

Discussion

We have demonstrated that neural networks attain the capability of detecting the underlying low-dimensional structure that preserves all information in xx about the conditional mean function 𝔼(y|x)\mathbb{E}(y|x). As a result, neural networks are suitable to be utilized for the estimation of central mean subspace ΠB0\Pi_{B_{0}}. The theoretical investigations sharpen our understanding of neural networks, while broadening the scope of SDR as well.

In the context of SDR, a more general scenario than sufficient mean dimension reduction emerges when considering

yx|B0x.y\perp\!\!\!\perp x|B_{0}^{\top}x. (4)

And the column space spanned by B0B_{0} corresponds to the central subspace (Cook 1998a; Li 2018). It is clear that relation (4) is equivalent to that p(y|x)=p(y|B0x)p(y|x)=p(y|B_{0}^{\top}x), where p(|)p(\cdot|\cdot) represents the conditional probability density function. Following the work of Xia (2007), we can further adapt neural networks to estimate the central subspace by modifying the loss function.

Under mild conditions, Xia (2007) showed that

𝔼[Kh(yy0)|x=x0]p(y0|B0x0),as h0+.\mathbb{E}[K_{h}(y-y_{0})|x=x_{0}]\to p(y_{0}|B_{0}^{\top}x_{0}),\quad\text{as }h\to 0^{+}.

Here, (x0,y0)(x_{0},y_{0}) is a fixed point, and Kh()K_{h}(\cdot) is a suitable kernel function with a bandwidth hh. Such finding then implies that

Kh(yy0)\displaystyle K_{h}(y-y_{0}) =𝔼[Kh(yy0)|x]+remainder term\displaystyle=\mathbb{E}[K_{h}(y-y_{0})|x]+\text{remainder term}
p(y0|B0x)+remainder term.\displaystyle\approx p(y_{0}|B_{0}^{\top}x)+\text{remainder term}.

Based on this discovery, it is natural to employ a neural network function f(Bx,y0):d+1f(B^{\top}x,y_{0}):\mathbb{R}^{d+1}\to\mathbb{R} to approximate p(y0|B0x)p(y_{0}|B_{0}^{\top}x), and obtain the estimate of B0B_{0} at the population level by solving the following problem

(B,f)=argminB,f𝔼[Kh(yy~)f(Bx,y~)]2,(B^{*},f^{*})=\mathop{\mathrm{argmin}}\limits_{B,f}\mathbb{E}[K_{h}(y-\tilde{y})-f(B^{\top}x,\tilde{y})]^{2},

where y~\tilde{y} is an independent copy of yy. Empirically, we define the loss function as

L~n(B,f)=1n2i=1nj=1n[Kh(yjyi)f(Bxj,yi)]2.\tilde{L}_{n}(B,f)=\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}[K_{h}(y_{j}-y_{i})-f(B^{\top}x_{j},y_{i})]^{2}.

We provide some additional simulation results to verify the feasibility utilizing neural networks for the estimation of central subspace; see the Supplementary Material. Theoretical analysis of neural networks for the estimation of the central subspace, including unbiasedness and consistency, deserves further studies.

References

  • Abbe, Adsera, and Misiakiewicz (2022) Abbe, E.; Adsera, E. B.; and Misiakiewicz, T. 2022. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, 4782–4887. PMLR.
  • Abdeljawad and Grohs (2022) Abdeljawad, A.; and Grohs, P. 2022. Approximations with deep neural networks in Sobolev time-space. Analysis and Applications, 20(03): 499–541.
  • Bach (2017) Bach, F. 2017. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19): 1–53.
  • Barron (1993) Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3): 930–945.
  • Bauer and Kohler (2019) Bauer, B.; and Kohler, M. 2019. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4): 2261 – 2285.
  • Brandes et al. (2022) Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; and Linial, M. 2022. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics, 38(8): 2102–2110.
  • Cho et al. (2020) Cho, D.; Yoo, C.; Im, J.; and Cha, D.-H. 2020. Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth and Space Science, 7(4): e2019EA000740.
  • Cook (1996) Cook, R. D. 1996. Graphics for regressions with a binary response. Journal of the American Statistical Association, 91(435): 983–992.
  • Cook (1998a) Cook, R. D. 1998a. Regression graphics. New York: Wiley.
  • Cook and Li (2002) Cook, R. D.; and Li, B. 2002. Dimension reduction for conditional mean in regression. The Annals of Statistics, 30(2): 455–474.
  • Cook and Weisberg (1991) Cook, R. D.; and Weisberg, S. 1991. Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414): 328–332.
  • Doshi-Velez and Kim (2017) Doshi-Velez, F.; and Kim, B. 2017. Towards a rigorous science of interpretable machine learning. arXiv:1702.08608.
  • Fukumizu and Leng (2014) Fukumizu, K.; and Leng, C. 2014. Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109(505): 359–370.
  • Ghorbani, Abid, and Zou (2019) Ghorbani, A.; Abid, A.; and Zou, J. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 3681–3688.
  • Ghosh (2022) Ghosh, D. 2022. Sufficient dimension reduction: An information-theoretic viewpoint. Entropy, 24(2): 167.
  • Guidotti et al. (2018) Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; and Pedreschi, D. 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5): 1–42.
  • Hornik, Stinchcombe, and White (1989) Hornik, K.; Stinchcombe, M.; and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5): 359–366.
  • Jiao et al. (2023) Jiao, Y.; Shen, G.; Lin, Y.; and Huang, J. 2023. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2): 691–716.
  • Jumper et al. (2021) Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873): 583–589.
  • Kapla, Fertl, and Bura (2022) Kapla, D.; Fertl, L.; and Bura, E. 2022. Fusing sufficient dimension reduction with neural networks. Computational Statistics & Data Analysis, 168: 107390.
  • Lee and Abu-El-Haija (2017) Lee, J.; and Abu-El-Haija, S. 2017. Large-scale content-only video recommendation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 987–995.
  • Li (2018) Li, B. 2018. Sufficient dimension reduction: Methods and applications with R. CRC Press.
  • Li and Dong (2009) Li, B.; and Dong, Y. 2009. Dimension reduction for nonelliptically distributed predictors. The Annals of Statistics, 37(3): 1272 – 1298.
  • Li (1991) Li, K.-C. 1991. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414): 316–327.
  • Li (1992) Li, K.-C. 1992. On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. Journal of the American Statistical Association, 87(420): 1025–1039.
  • Li and Duan (1989) Li, K.-C.; and Duan, N. 1989. Regression analysis under link violation. The Annals of Statistics, 17(3): 1009–1052.
  • Liang, Sun, and Liang (2022) Liang, S.; Sun, Y.; and Liang, F. 2022. Nonlinear sufficient dimension reduction with a stochastic neural network. Advances in Neural Information Processing Systems, 35: 27360–27373.
  • Lipton (2018) Lipton, Z. C. 2018. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3): 31–57.
  • Ma and Zhu (2013) Ma, Y.; and Zhu, L. 2013. Efficient estimation in sufficient dimension reduction. Annals of statistics, 41(1): 250.
  • Saxe et al. (2019) Saxe, A. M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B. D.; and Cox, D. D. 2019. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124020.
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. 2020. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4): 1875–1897.
  • Shen et al. (2022) Shen, G.; Jiao, Y.; Lin, Y.; and Huang, J. 2022. Approximation with cnns in Sobolev space: With applications to classification. Advances in Neural Information Processing Systems, 35: 2876–2888.
  • Shen, Yang, and Zhang (2020) Shen, Z.; Yang, H.; and Zhang, S. 2020. Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5): 1768–1811.
  • Shen, Yang, and Zhang (2021) Shen, Z.; Yang, H.; and Zhang, S. 2021. Neural network approximation: Three hidden layers are enough. Neural Networks, 141: 160–173.
  • Silver et al. (2018) Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419): 1140–1144.
  • Thirunavukarasu et al. (2023) Thirunavukarasu, A. J.; Ting, D. S. J.; Elangovan, K.; Gutierrez, L.; Tan, T. F.; and Ting, D. S. W. 2023. Large language models in medicine. Nature medicine, 29(8): 1930–1940.
  • Tishby and Zaslavsky (2015) Tishby, N.; and Zaslavsky, N. 2015. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), 1–5. IEEE.
  • Troiani et al. (2024) Troiani, E.; Dandi, Y.; Defilippis, L.; Zdeborova, L.; Loureiro, B.; and Krzakala, F. 2024. Fundamental limits of weak learnability in high-dimensional multi-index models. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning.
  • Van der Vaart (2000) Van der Vaart, A. W. 2000. Asymptotic statistics, volume 3. Cambridge university press.
  • Xia (2007) Xia, Y. 2007. A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35(6): 2654 – 2690.
  • Xia et al. (2002) Xia, Y.; Tong, H.; Li, W. K.; and Zhu, L.-X. 2002. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(3): 363–410.
  • Yang and Barron (1999) Yang, Y.; and Barron, A. 1999. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 1564–1599.
  • Yarotsky (2017) Yarotsky, D. 2017. Error bounds for approximations with deep ReLU networks. Neural Networks, 94: 103–114.
  • Zhang et al. (2021) Zhang, Y.; Tiňo, P.; Leonardis, A.; and Tang, K. 2021. A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5): 726–742.
  • Zhu, Jiao, and Jordan (2022) Zhu, B.; Jiao, J.; and Jordan, M. I. 2022. Robust estimation for non-parametric families via generative adversarial networks. In 2022 IEEE International Symposium on Information Theory (ISIT), 1100–1105. IEEE.