Neural Networks Perform Sufficient Dimension Reduction
Abstract
This paper investigates the connection between neural networks and sufficient dimension reduction (SDR), demonstrating that neural networks inherently perform SDR in regression tasks under appropriate rank regularizations. Specifically, the weights in the first layer span the central mean subspace. We establish the statistical consistency of the neural network-based estimator for the central mean subspace, underscoring the suitability of neural networks in addressing SDR-related challenges. Numerical experiments further validate our theoretical findings, and highlight the underlying capability of neural networks to facilitate SDR compared to the existing methods. Additionally, we discuss an extension to unravel the central subspace, broadening the scope of our investigation.
Introduction
Neural networks have achieved significant success in a tremendous variety of applications (Lee and Abu-El-Haija 2017; Silver et al. 2018; Jumper et al. 2021; Brandes et al. 2022; Thirunavukarasu et al. 2023). As the most essential foundation, a feedforward neural network is typically constructed by a series of linear transformations and nonlinear activations. To be specific, a function implemented by a feedforward neural network with layers can be represented as
(1) |
Here, is the functional composition operator, denotes the linear transformation and represents the elementwise nonlinear activation function. Despite a clear architecture, the formula (1) provides limited insight into how the information within the input data is processed by the neural networks. To comprehensively understand how neural networks retrieve task-relevant information, there have been extensive efforts towards unraveling their interpretability (Doshi-Velez and Kim 2017; Guidotti et al. 2018; Zhang et al. 2021). For instance, post-hoc interpretability (Lipton 2018) focused on the predictions generated by neural networks, while disregarding the detailed mechanism and feature importance. Ghorbani, Abid, and Zou (2019) highlighted the fragility of this type of interpretation, as indistinguishable perturbations could result in completely different interpretations.
For the sake of striking a balance between the complexity of representation and the power of prediction, Tishby and Zaslavsky (2015) and Saxe et al. (2019) pioneered the interpretability of deep neural networks via the information bottleneck theory. Ghosh (2022) further linked the information bottleneck to sufficient dimension reduction (SDR) (Li 1991; Cook 1996), which is a rapidly developing research field in regression diagnostics, data visualization, pattern recognition and etc.
In this paper, we provide a theoretical understanding of neural networks for representation learning from the SDR perspective. Let represent the covariates and represent the response. Consider the following regression model
(2) |
where is a nonrandom matrix with , is an unknown function, and is the noise such that and for some positive constant . Intuitively, the semiparametric and potentially multi-index model (2) asserts that the core information for regression is encoded in the low-dimensional linear representation . In the literature of SDR, model (2) has been extensively studied. Based on model (2) , Cook and Li (2002) proposed the objective of sufficient mean dimension reduction as
(3) |
for some matrix , where means statistical independence. Denote the column space spanned by as , which is commonly referred to as the mean dimension-reduction subspace. It is evident that the matrix satisfying (3) is far away from uniqueness. Hence, we focus on the intersection of all possible mean dimension-reduction subspace, which is itself a mean dimension-reduction subspace under mild conditions (e.g., when the domain of is open and convex; see Cook and Li (2002)), and term it the central mean subspace. Agreeing to condition (3), defined in model (2) yields the central mean subspace, denoted as , under certain assumption. Statistical estimation and inference about the central mean subspace is the primary goal of SDR. Popular statistical methods for recovering central mean subspace include ordinary least squares (Li and Duan 1989), principal Hessian directions (Li 1992), minimum average variance estimation (Xia et al. 2002), semiparametric approach (Ma and Zhu 2013), generalized kernel-based dimension reduction (Fukumizu and Leng 2014), and many others. Although numerous studies have demonstrated the ability of neural networks to approximate complex functions (Hornik, Stinchcombe, and White 1989; Barron 1993; Yarotsky 2017; Shen, Yang, and Zhang 2021) and to adapt low-dimensional structures (Bach 2017; Bauer and Kohler 2019; Schmidt-Hieber 2020; Abbe, Adsera, and Misiakiewicz 2022; Jiao et al. 2023; Troiani et al. 2024), it is of subsequent interest to investigate whether neural networks are capable of correctly identifying the intrinsic structure encapsulated in , thereby deepening the interpretation of neural networks.
Our study was inspired by an observation that the weight matrix in the first layer, i.e., , could accurately detect the presence of in a toy data set, with the rank regularization. Specifically, consider the toy model where , and . We trained neural networks using the least squares loss with where and for . Evidently, the rank of did not exceed . It was then observed that for each , (i) was closely contained within the column space of , as the absolute cosine similarity between and its projection on was close to 1, and (ii) the leading eigenvector of closely aligned with (see Figure 1). This observation indicates that the first layer of the neural network may potentially discover the underlying low-dimensional intrinsic structure.

It is important to note that the application of neural networks for estimating the central mean subspace based on model (2) was previously explored by Kapla, Fertl, and Bura (2022) through a two-stage method. The first stage focused on obtaining a preliminary estimation of , which subsequently served as an initial point for the joint estimation of and in the second stage. Given our toy example, however, it is prudent to critically evaluate the necessity of the first stage. Furthermore, their work lacks a comprehensive theoretical guarantee. Additionally, another related work conducted by Liang, Sun, and Liang (2022) concentrated on seeking nonlinear sufficient representations. In contrast, we focus more on revealing the fundamental nature of neural networks.
We in this paper show that, with suitable rank regularization, the first layer of a feedforward neural network conducts SDR in a regression task, wherein in probability for certain distance metric . Furthermore, numerical experiments provide empirical evidences supporting this result, while demonstrating the efficiency of neural networks in addressing the issue of SDR.
Throughout this paper, we use to represent the Euclidean norm of a vector . For a matrix , is the Frobenius norm of , denotes the projection matrix of where is the generalized inverse of , and stands for the linear space spanned by the columns of . For a measurable function , represents the norm of with respect of a given probability measure , and represents the supreme norm of over a set . is the unit ball induced by , such that and for any .
Theoretical Justificatitons
Population-Level Unbiasedness
Suppose the true intrinsic dimension defined in model (2) is known and the covarites . As is not identifiable in (2) and (3) , it is assumed without loss of generality that where is the identity matrix with rows. By defining , we have . In this paper, we consider the following neural network function class
The activation functions utilized are the rectified linear units, i.e., . We emphasize that incorporates rank regularization of in the first layer. For arbitrary , we use to represent the component in the first layer of .
For a regression task, the theoretical study is highly related to the smoothness of underlying conditional mean function (Yang and Barron 1999). Here, we introduce the following assumptions of model (2).
Assumption 1 (Smoothness).
is a Hölder continuous function of order with constant , i.e., for any . Additionally, for some constant .
Assumption 2 (Sharpness).
For any scalar and , implies for some .
Assumption 3.
is sub-exponentially distributed such that there exists satisfying .
Assumption 1 is a technical condition to study the approximation capability of neural networks (Shen, Yang, and Zhang 2020). Alternatively, other functional spaces, such as the Sobolev space, can also be employed for this purpose (Abdeljawad and Grohs 2022; Shen et al. 2022). Furthermore, Assumption 2 establishes a restriction on the sharpness of . Consider the case that is solely a constant function as zero. Then, a trivial neural network by setting all the parameters except to zero perfectly fits , regardless of the value of . With Assumption 2, it becomes difficult to accurately capture the overall behavior of using a biased . In other words, is sufficient and necessary for recovering , i.e., is the central mean subspace. Similar condition was also adopted in Theorem 4.2 of Li and Dong (2009) to distinguish sufficient directions from other . Assumption 3 is a commonly used condition for applying empirical process tools and concentration inequalities (Van der Vaart 2000; Zhu, Jiao, and Jordan 2022).
NN | MAVE | GKDR | SIR | SAVE | PHD | |||
---|---|---|---|---|---|---|---|---|
mean | 0.135 | 0.160 | 0.388 | 0.897 | 0.315 | 0.623 | ||
std | 0.064 | 0.056 | 0.126 | 0.186 | 0.082 | 0.150 | ||
Setting 1 | mean | 0.276 | 0.214 | 1.012 | 0.946 | 0.298 | 0.807 | |
std | 0.274 | 0.051 | 0.299 | 0.113 | 0.049 | 0.107 | ||
mean | 0.120 | 0.130 | 0.542 | 0.900 | 0.337 | 0.709 | ||
std | 0.038 | 0.029 | 0.166 | 0.136 | 0.063 | 0.108 | ||
mean | 0.296 | 0.730 | 1.130 | 0.665 | 0.319 | 1.550 | ||
std | 0.126 | 0.312 | 0.271 | 0.166 | 0.076 | 0.126 | ||
Setting 2 | mean | 0.628 | 0.899 | 1.155 | 0.705 | 0.338 | 1.526 | |
std | 0.269 | 0.329 | 0.237 | 0.149 | 0.074 | 0.138 | ||
mean | 1.197 | 1.187 | 1.278 | 0.830 | 0.367 | 1.567 | ||
std | 0.213 | 0.204 | 0.200 | 0.169 | 0.072 | 0.131 | ||
mean | 0.639 | 1.246 | 1.759 | 1.669 | 1.728 | 1.740 | ||
std | 0.418 | 0.289 | 0.149 | 0.235 | 0.136 | 0.221 | ||
Setting 3 | mean | 0.248 | 1.076 | 1.752 | 1.650 | 1.683 | 1.721 | |
std | 0.242 | 0.331 | 0.125 | 0.271 | 0.201 | 0.228 | ||
mean | 0.075 | 0.924 | 0.554 | 1.652 | 1.678 | 1.737 | ||
std | 0.081 | 0.382 | 0.371 | 0.259 | 0.245 | 0.191 | ||
mean | 0.127 | 0.293 | 0.368 | 1.363 | 0.429 | 0.960 | ||
std | 0.161 | 0.050 | 0.079 | 0.167 | 0.079 | 0.178 | ||
Setting 4 | mean | 0.144 | 0.415 | 0.386 | 1.530 | 0.467 | 0.902 | |
std | 0.067 | 0.076 | 0.077 | 0.153 | 0.082 | 0.206 | ||
mean | 0.140 | 0.360 | 0.344 | 1.410 | 0.364 | 0.735 | ||
std | 0.055 | 0.084 | 0.010 | 0.163 | 0.065 | 0.175 |
Theorem 1.
Theorem 1 builds a bridge connecting neural networks and SDR. It demonstrates that neural networks indeed achieve representation learning as the first layer of the optimal neural network perfectly reaches the target of SDR at population level. Theorem 1 also inspires us to perform SDR based on neural networks with a minor adjustment for the first layer. The detailed proof of Theorem 1 can be found in Section Proofs.
Sample Estimation Consistency
We now investigate the theoretical property of the neural network-based sample-level estimator for SDR. Given sample observations , where is an independent copy of for , the commonly used least squares loss is adopted, i.e.,
Denote the optimal neural network estimator at the sample level as
is then the sample estimator approximately spanning the central mean subspace.
To examine the closeness between and , we define the distance metric as
where and is the collection of all orthogonal matrices in . We see that if and only if . And we make the following assumption essentially another view of Assumption 2.
Assumption 4.
For any positive scalar , implies for some .
Theorem 2 confirms that converges to in probability. As a result, under the least squares loss, the neural networks, with appropriate rank regularization, provide a consistent estimator of the central mean subspace. Therefore, it is promising to adopt the neural network function class to address sufficient mean dimension reduction problems. More importantly, this approach offers several advantages over existing methods. Unlike SIR (Li 1991), SAVE (Cook and Weisberg 1991) and PHD (Li 1992), our method does not impose stringent probabilistic distributional conditions on , such as linearity, constant variance and normality assumptions. Compared to MAVE (Xia et al. 2002), our method adopts the powerful neural networks to estimate the link function, thereby working similar or better than classical nonparametric tools based on first-order Taylor expansion. We present the proof of Theorem 2 in Section Proofs.
The results illustrated in Theorem 1 and 2 are contingent on the availability of true intrinsic dimension . In the case where is unknown, a natural modification for is to set without rank regularization. Under Assumptions 1 and 2, in this scenario, the optimal at the population level still ensures that encompasses , where the sharpness of plays a crucial role; see the Supplementary Material for more details.
Simulation Study
From the perspective of numerical experiments, we utilized simulated data sets to demonstrate that (i) the column space of approached the central mean subspace as sample size increased, (ii) the performance of neural network-based method in conducting SDR was comparable to classical methods. In some cases, the neural network-based method outperformed classical methods, particularly when the latent intrinsic dimension was large. Five commonly used methods, sliced inverse regression (SIR), sliced average variance estimation (SAVE), principal Hessian directions (PHD), minimum average variance estimation (MAVE) and generalized kernel dimension reduction (GKDR), were included for comparisons.
The neural network-based method was implemented in Python using PyTorch. Specifically, we utilized a linear layer without bias term to represent the matrix (recall that we suppose is known), which was further appended by a fully-connected feedforward neural network with number of neurons being . Here, we set when the sample size was less than , and set when sample size was between 1000 and 2000. Overall, the neural network architecture was . Code is available at https://github.com/oaksword/DRNN.
We considered the following four scenarios, with two of them attaining small and the rest of .
Setting 1: where and .
Setting 2: where and .
Setting 3: where , , , and .
Setting 4:
where ,
In setting 1, we tested the combinations of and , where was the sample size. In setting 2, we fixed and set . In setting 3, we fixed and tested . In setting 4, we fixed and tested and . Settings 1, 2, 4 were equipped with continuous regression functions, and setting 3 involved discontinuity. We set the number of iterations of our method to 1000 for settings 1 and 2, and increased the iterations to 2000 and 4000 for settings 3 and 4, due to their high complexity.
To evaluate the performance of each method, we calculated the distance metric where represented the estimate of . Particularly, for our neural network-based method, . Smaller value of this metric indicated better performance. We ran 100 replicates for each setting. The results are displayed in Table 1 and Figure 2.
In simple cases (settings 1 and 2), the neural network-based method showed similar performance to classical methods, with sliced average variance estimation being the most effective method in setting 2. However, complex settings demonstrated the significant superiority of neural network-based method compared to other methods. Specifically, in setting 3, as the sample size increased, the metric decreased rapidly, indicating the convergence of to . According to the results in setting 4, the neural network-based method was capable of handling higher-dimensional scenarios. In summary, numerical studies advocated neural networks as a powerful tool for SDR.

Real Data Analysis
We further applied the neural network-based method involving rank regularization to a real regression task. In practice, the precise intrinsic dimension is unknown, and it is uncertain whether a low-dimensional structure exists. To address this issue, we used cross validation to determine an appropriate from the range of values . More specifically, the raw data set was divided into 80% training data and 20% testing data. The optimal value of was determined through cross validation on the training data. Subsequently, the final model was fit using the selected on the training data, and the mean squared error on the testing data was evaluated.
In order to reduce randomness, the aforementioned process was repeated 20 times and the resulting testing errors were recorded. Additionally, we conducted a comparative analysis between the neural network-based approach and alternative methods including vanilla neural network without rank regularization, SIR-based regression, SAVE-based regression, and MAVE-based regression. For the latter three techniques, we initially executed SDR to acquire the embedded data, followed by the utilization of a fully-connected neural network for predictive purpose. The optimal value for was also determined through cross validation.
We utilized a data set of weather records from Seoul, South Korea during the summer months from 2013 to 2017 (Cho et al. 2020), available at the UCI data set repository (bias correction of numerical prediction model temperature forecast). This data set contained 7750 observations with 23 predictors and 2 responses, specifically the next-day maximum air temperature and next-day minimum air temperature. After excluding the variables for station and date, the data set was reduced to 21 predictors, which were further standardized using StandardScaler in scikit-learn package.
NN-RR | NN-VN | MAVE | SIR | SAVE | |
---|---|---|---|---|---|
mean | 0.602 | 0.774 | 0.743 | 1.324 | 0.772 |
std | 0.043 | 0.116 | 0.159 | 0.190 | 0.161 |
19.6 | — | 19.25 | 19.6 | 20.6 |
It is evident from Table 2 that the neural network-based method with rank regularization outperformed other methods, demonstrating the effectiveness of the modification compared to the vanilla neural network, and the sound performance in reducing dimensions compared to other SDR methods. The presence of latent structures was partially supported by the averaged optimal . It was possible that 19 or 20 combinations of raw predictors might be sufficient, as opposed to the original 21 predictors.
Proofs
Proof of Theorem 1
Define the vanilla neural network function class without rank regularization as
Lemma 1.
Suppose that is Hölder continuous with and . Then, for any , there exists a function in neural network function class , with rectified linear unit activation and large enough, such that
Proof of Theorem 1.
For such that , under Assumption 2, there exists a scalar satisfying
Then, we have
On the other hand, for such that , there exists an orthogonal matrix such that . Hence, and is still a neural network function. Assumption 1 and Lemma 1 imply that there is a neural network function satisfying
for sufficiently large . As a result, select such that , and it follows that
Therefore, for any such that , there exists a neural network function , with the same rank regularization, satisfying
To conclude, , which entails that . ∎
Proof of Theorem 2
Let . Without loss of generality, suppose . We first present some useful lemmas.
Lemma 2.
For sufficiently large , it follows that
The proof of Lemma 2 is presented in the Supplementary Material. We note this convergence rate may not be optimal, but is sufficient for deducing consistency.
Lemma 3.
Proof of Lemma 3.
We begin with the decomposition that
where is a constant, for sufficiently large . Hence,
By Theorem 1.1 in Shen, Yang, and Zhang (2020), there exists with for some constant such that
Let
and observe that . Then, it follows that
where is a constant depending on and . ∎
With Lemma 3, the proof of Theorem 2 is a direct application of Theorem 5.9 in Van der Vaart (2000).
Proof of Theorem 2.
We denote the metric
Here, , , means , and is the probability distribution of . Recall that , where .
For and such that , if
we have
for some orthogonal matrix . Hence,
where .
Discussion
We have demonstrated that neural networks attain the capability of detecting the underlying low-dimensional structure that preserves all information in about the conditional mean function . As a result, neural networks are suitable to be utilized for the estimation of central mean subspace . The theoretical investigations sharpen our understanding of neural networks, while broadening the scope of SDR as well.
In the context of SDR, a more general scenario than sufficient mean dimension reduction emerges when considering
(4) |
And the column space spanned by corresponds to the central subspace (Cook 1998a; Li 2018). It is clear that relation (4) is equivalent to that , where represents the conditional probability density function. Following the work of Xia (2007), we can further adapt neural networks to estimate the central subspace by modifying the loss function.
Under mild conditions, Xia (2007) showed that
Here, is a fixed point, and is a suitable kernel function with a bandwidth . Such finding then implies that
Based on this discovery, it is natural to employ a neural network function to approximate , and obtain the estimate of at the population level by solving the following problem
where is an independent copy of . Empirically, we define the loss function as
We provide some additional simulation results to verify the feasibility utilizing neural networks for the estimation of central subspace; see the Supplementary Material. Theoretical analysis of neural networks for the estimation of the central subspace, including unbiasedness and consistency, deserves further studies.
References
- Abbe, Adsera, and Misiakiewicz (2022) Abbe, E.; Adsera, E. B.; and Misiakiewicz, T. 2022. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, 4782–4887. PMLR.
- Abdeljawad and Grohs (2022) Abdeljawad, A.; and Grohs, P. 2022. Approximations with deep neural networks in Sobolev time-space. Analysis and Applications, 20(03): 499–541.
- Bach (2017) Bach, F. 2017. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19): 1–53.
- Barron (1993) Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3): 930–945.
- Bauer and Kohler (2019) Bauer, B.; and Kohler, M. 2019. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4): 2261 – 2285.
- Brandes et al. (2022) Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; and Linial, M. 2022. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics, 38(8): 2102–2110.
- Cho et al. (2020) Cho, D.; Yoo, C.; Im, J.; and Cha, D.-H. 2020. Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth and Space Science, 7(4): e2019EA000740.
- Cook (1996) Cook, R. D. 1996. Graphics for regressions with a binary response. Journal of the American Statistical Association, 91(435): 983–992.
- Cook (1998a) Cook, R. D. 1998a. Regression graphics. New York: Wiley.
- Cook and Li (2002) Cook, R. D.; and Li, B. 2002. Dimension reduction for conditional mean in regression. The Annals of Statistics, 30(2): 455–474.
- Cook and Weisberg (1991) Cook, R. D.; and Weisberg, S. 1991. Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414): 328–332.
- Doshi-Velez and Kim (2017) Doshi-Velez, F.; and Kim, B. 2017. Towards a rigorous science of interpretable machine learning. arXiv:1702.08608.
- Fukumizu and Leng (2014) Fukumizu, K.; and Leng, C. 2014. Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109(505): 359–370.
- Ghorbani, Abid, and Zou (2019) Ghorbani, A.; Abid, A.; and Zou, J. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 3681–3688.
- Ghosh (2022) Ghosh, D. 2022. Sufficient dimension reduction: An information-theoretic viewpoint. Entropy, 24(2): 167.
- Guidotti et al. (2018) Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; and Pedreschi, D. 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5): 1–42.
- Hornik, Stinchcombe, and White (1989) Hornik, K.; Stinchcombe, M.; and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5): 359–366.
- Jiao et al. (2023) Jiao, Y.; Shen, G.; Lin, Y.; and Huang, J. 2023. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2): 691–716.
- Jumper et al. (2021) Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873): 583–589.
- Kapla, Fertl, and Bura (2022) Kapla, D.; Fertl, L.; and Bura, E. 2022. Fusing sufficient dimension reduction with neural networks. Computational Statistics & Data Analysis, 168: 107390.
- Lee and Abu-El-Haija (2017) Lee, J.; and Abu-El-Haija, S. 2017. Large-scale content-only video recommendation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 987–995.
- Li (2018) Li, B. 2018. Sufficient dimension reduction: Methods and applications with R. CRC Press.
- Li and Dong (2009) Li, B.; and Dong, Y. 2009. Dimension reduction for nonelliptically distributed predictors. The Annals of Statistics, 37(3): 1272 – 1298.
- Li (1991) Li, K.-C. 1991. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414): 316–327.
- Li (1992) Li, K.-C. 1992. On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. Journal of the American Statistical Association, 87(420): 1025–1039.
- Li and Duan (1989) Li, K.-C.; and Duan, N. 1989. Regression analysis under link violation. The Annals of Statistics, 17(3): 1009–1052.
- Liang, Sun, and Liang (2022) Liang, S.; Sun, Y.; and Liang, F. 2022. Nonlinear sufficient dimension reduction with a stochastic neural network. Advances in Neural Information Processing Systems, 35: 27360–27373.
- Lipton (2018) Lipton, Z. C. 2018. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3): 31–57.
- Ma and Zhu (2013) Ma, Y.; and Zhu, L. 2013. Efficient estimation in sufficient dimension reduction. Annals of statistics, 41(1): 250.
- Saxe et al. (2019) Saxe, A. M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B. D.; and Cox, D. D. 2019. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124020.
- Schmidt-Hieber (2020) Schmidt-Hieber, J. 2020. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4): 1875–1897.
- Shen et al. (2022) Shen, G.; Jiao, Y.; Lin, Y.; and Huang, J. 2022. Approximation with cnns in Sobolev space: With applications to classification. Advances in Neural Information Processing Systems, 35: 2876–2888.
- Shen, Yang, and Zhang (2020) Shen, Z.; Yang, H.; and Zhang, S. 2020. Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5): 1768–1811.
- Shen, Yang, and Zhang (2021) Shen, Z.; Yang, H.; and Zhang, S. 2021. Neural network approximation: Three hidden layers are enough. Neural Networks, 141: 160–173.
- Silver et al. (2018) Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419): 1140–1144.
- Thirunavukarasu et al. (2023) Thirunavukarasu, A. J.; Ting, D. S. J.; Elangovan, K.; Gutierrez, L.; Tan, T. F.; and Ting, D. S. W. 2023. Large language models in medicine. Nature medicine, 29(8): 1930–1940.
- Tishby and Zaslavsky (2015) Tishby, N.; and Zaslavsky, N. 2015. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), 1–5. IEEE.
- Troiani et al. (2024) Troiani, E.; Dandi, Y.; Defilippis, L.; Zdeborova, L.; Loureiro, B.; and Krzakala, F. 2024. Fundamental limits of weak learnability in high-dimensional multi-index models. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning.
- Van der Vaart (2000) Van der Vaart, A. W. 2000. Asymptotic statistics, volume 3. Cambridge university press.
- Xia (2007) Xia, Y. 2007. A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35(6): 2654 – 2690.
- Xia et al. (2002) Xia, Y.; Tong, H.; Li, W. K.; and Zhu, L.-X. 2002. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(3): 363–410.
- Yang and Barron (1999) Yang, Y.; and Barron, A. 1999. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 1564–1599.
- Yarotsky (2017) Yarotsky, D. 2017. Error bounds for approximations with deep ReLU networks. Neural Networks, 94: 103–114.
- Zhang et al. (2021) Zhang, Y.; Tiňo, P.; Leonardis, A.; and Tang, K. 2021. A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5): 726–742.
- Zhu, Jiao, and Jordan (2022) Zhu, B.; Jiao, J.; and Jordan, M. I. 2022. Robust estimation for non-parametric families via generative adversarial networks. In 2022 IEEE International Symposium on Information Theory (ISIT), 1100–1105. IEEE.