Going Beyond Linear RL: Sample Efficient Neural Function Approximation
Abstract
Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with two-layer neural networks (considering both ReLU and polynomial activation functions). Our first result is a computationally and statistically efficient algorithm in the generative model setting under completeness for two-layer neural networks. Our second result considers this setting but under only realizability of the neural net function class. Here, assuming deterministic dynamics, the sample complexity scales linearly in the algebraic dimension. In all cases, our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
1 Introduction
In reinforcement learning (RL), an agent aims to learn the optimal decision-making rule by interacting with an unknown environment [SB18]. Deep Reinforcement Learning, empowered by deep neural networks [LBH15, GBC16], has achieved tremendous success in various real-world applications, such as Go [SHM+16], Atari [MKS+13a], Dota2 [BBC+19], Texas Holdém poker [MSB+17], and autonomous driving [SSSS16]. Those modern RL applications are characterized by large state-action spaces, and the empirical success of deep RL corroborates the observation that deep neural networks can extrapolate well across state-action spaces [HIB+18, MKS+13b, LHP+15].
Although in practice non-linear function approximation scheme is prevalent, theoretical understandings of the sample complexity of RL mainly focus on tabular or linear function approximation settings [SLW+06, JOA10, AOM17, JAZBJ18, Rus19, ZB19, AYLSW19, JYWJ20a, JYWJ20b, WWDK21]. These results rely on finite state space or exact linear approximations. Recently, sample efficient algorithms under non-linear function approximation settings are proposed [WVR17, DJK+18, DKJ+19, DPWZ20, LCYW19, WCYW20, DYM21, ZLG20, YJW+20]. Those algorithms are based on Bellman rank [JKA+17], eluder dimension [RVR13b], neural tangent kernel [JGH18, AZLS19, DLL+19, ZCZG18], or sequential Rademacher complexity [RST15a, RST15b]. However, the understanding of how deep RL learns and generalizes in large state spaces is far from complete. Whereas the aforementioned works study function approximation structures that possess the nice properties of linear models, such as low information gain and low eluder dimensions, the highly non-linear nature of neural networks renders challenges on the applicability of existent analysis to deep RL. For one thing, recent wisdoms in deep learning theory cast doubt on the ability of neural tangent kernel and random features to model the actual neural networks. Indeed, the neural tangent kernel approximately reduces neural networks to random feature models, but the RKHS norm of neural networks is exponential [YS19]. Moreover, it remains unclear what neural networks models have low eluder dimensions. For example, recent works [DYM21, LKFS21] show that two layer neural networks have exponential eluder dimension in the dimension of features. Furthermore, [MPSL21] demonstrates hard instances where learning RL is exponentially hard even in the case that the target value function can be approximated as a neural network and the optimal policy is softmax linear. Thus, the mismatch between the empirical success of deep RL and its theoretical understanding remains significant, which yields the following important question:
What are the structural properties that allow sample-efficient algorithms for RL with neural network function approximation?
In this paper, we advance the understanding of the above question by displaying several structural properties that allow efficient RL algorithms with neural function approximations. We consider several value function approximation models that possess high information gain and high eluder dimension. Specifically, we study two structures, namely two-layer neural networks and structured polynomials (i.e. two-layer neural networks with polynomial activation functions), under two RL settings, namely RL with simulator model and online RL. In the simulator (generative model) setting [Kak03, SWW+18], the agent can simulate the MDP at any state-action pair. In online RL, the agent can only start at an initial state and interact with the MDP step by step. The goal in both settings is to find a near-optimal policy while minimizing the number of samples used.
We obtain the following results. For the simulator setting, we propose sample-efficient algorithms for RL with two-layer neural network function approximation. Under either policy completeness, Bellman completeness, or gap conditions, our method provably learns near-optimal policy with polynomial sample complexities. For online RL, we provide sample-efficient algorithms for RL with structured polynomial function approximation. When the transition is deterministic, we also present sample-efficient algorithms under only the realizability assumption [DKWY20a, WWK21]. Our main techniques are based on neural network recovery [ZSJ+17, JSA15, GLM18a], and algebraic geometry [Mil17, Sha13, BCR13, WX19].
1.1 Summary of our results
Our main results in different settings are summarized in Table 1. We consider two-layer neural networks (where is ReLU activation) and rank polynomials (see Example 4.3).
rank polynomial | Neural Net of Width | ||||
---|---|---|---|---|---|
Sim. + Det. (R) | Onl. + Det. (R) | Sim. + Det. (R) | Sim. + Gap. (R) | Sim. + Stoch. (C) | |
Baseline | (*) | ||||
Our results |
We make the following elaborations on Table 1.
-
•
For simplicity, we display only the dependence on the feature dimension , network width or polynomial rank , precision , and degree (of polynomials).
-
•
In the table Sim. denotes simulator model, Onl. denotes online RL, Det. denotes deterministic transitions, Stoch. denotes stochastic transitions, Gap. denotes gap condition, (R) denotes realizability assumption only, and (C) denotes completeness assumption (either policy complete or Bellman complete) together with realizability assumption.
-
•
We apply [DLMW20] for the deterministic transition baseline, and apply [DKWY20b] for the stochastic transition baseline. We are unaware of any methods that can directly learn MDP with neural network value function approximation111Prior work on neural function approximation has focused on neural tangent kernels, which would require to approximate a two-layer network [GMMM21]. .
-
•
In polynomial case, the baseline first vectorizes the tensor into a vector in and then performs on this vector. In the neural network case, the baseline uses a polynomial of degree to approximate the neural network with precision and then vectorizes the polynomial into a vector in . The baseline method for realizable model (denoted by (*)) needs a further gap assumption of to avoid the approximation error from escalating [DLMW20]; note for small this condition never holds but we include it in the table for the sake of comparison.
-
•
In rank polynomial case, our result in simulator model can be found in Theorem 4.7 and our result in online RL model can be found in Theorem 4.8. These results only require a realizability assumption. Efficient explorations are guaranteed by algebraic-geometric arguments. In neural network model, our result in simulator model can be found in Theorem 3.4. This result also only relies on the realizability assumption. For stochastic transitions, our result works for either policy complete or Bellman complete settings, as in Theorem 3.5 and Theorem 3.6 respectively. The result for gap condition can be found in Theorem 3.8.
1.2 Related Work
Linear Function Approximation.
RL with linear function approximation has been widely studied under various settings, including linear MDP and linear mixture MDP [JYWJ20b, ZLKB20, YW20]. While these papers have proved efficient regret and sample complexity bounds, their analyses relied heavily on two techniques: they used the confidence ellipsoid to quantify the uncertainty, and they used the elliptical potential lemma to bound the total uncertainty [AYPS11]. These two techniques were integral to their analyses but are so restrictive that they generally do not extend to nonlinear cases.
Eluder Dimension.
[RVR13a, ORVR13] proposed eluder dimension, a complexity measure of the function space, and proved regret and sample complexity bounds that scaled with the eluder dimension, for bandits and reinforcement learning [WSY20a, JLM21]. They also showed that the eluder dimension is small in several settings, including generalized linear models and LQR. However, as shown in [DYM21], the eluder dimension could be exponentially large even with a single ReLU neuron, which suggested the eluder dimension would face difficulty in dealing with neural network cases. The eluder dimension is only known to give non-trivial bounds for linear function classes and monotone functions of linear function classes. For structured polynomial classes, the eluder dimension simply embeds into an ambient linear space of dimension , where is the dimension, and is the degree. This parallels the lower bounds in linearization / neural tangent kernel (NTK) works [WLLM19, GMMM19, AZL19], which show that linearization also incurs a similarly large penalty of sample complexity, and more advanced algorithm design is need to circumvent linearization[BL20, CBL+20, FLYZ20, WGL+19, GCL+19, NGL+19, GLM18b, MGW+20, HWLM20, WWL+20, DML21].
Bellman Rank and Completeness.
[JKA+17, SJK+19] studied RL with general function approximation. They used Bellman rank to measure the error of the function class under the Bellman operator and gave proved bounds in the term of it. Recently, [DKL+21a] propose bilinear rank and encompass more function approximation models. However, it is hard to bound either the Bellman rank or the bilinear rank for neural nets. Therefore, their results are not known to include the neural network approximation setting. Another line of work shows that exponential sample complexity is unavoidable even with good representations [DKWY20b, WAS20, WWK21], which implies the realizability assumption alone might be insufficient for function approximations.
Deterministic RL
Deterministic system is often the starting case in the study of sample-efficient algorithms, where the issue of exploration and exploitation trade-off is more clearly revealed since both the transition kernel and reward function are deterministic. The seminal work [WVR13] proposes a sample-efficient algorithm for Q-learning that works for a family of function classes. Recently, [DLMW20] studies the agnostic setting where the optimal Q-function can only be approximated by a function class with approximation error. The algorithm in [DLMW20] learns the optimal policy with the number of trajectories linear with the eluder dimension.
2 Preliminaries
An episodic Markov Decision Process (MDP) is defined by the tuple where is the state space, is the action set, is the number of time steps in each episode, is the transition kernel and is the reward function. In each episode the agent starts at a fixed initial state and at each time step it takes action , receives reward and transits to .
A deterministic policy is a length- sequence of functions . Given a policy , we define the value function as the expected sum of reward under policy starting from :
and we define the Q function as the the expected sum of reward taking action in state and then following :
The Bellman operator applied to Q-function is defined as follow
There exists an optimal policy that gives the optimal value function for all states, i.e. for all and . For notational simplicity we abbreviate as and correspondingly as . Therefore satisfies the following Bellman optimality equations for all , and :
The goal is to find a policy that is -optimal in the sense that , within a small number of samples. We consider two query models of interacting with MDP:
- •
-
•
In online RL, the agent can only start at the initial state and interact with the MDP by using a policy and observing the rewards and the next states. In each episode , the agent proposes a policy based on all history up to episode and executes to generate a single trajectory with and .
2.1 Function approximation
In reinforcement learning with value function approximation, the learner is given a function class , where is a set of candidate functions to approximate . The following assumption is commonly adopted in the literature [JYWJ20a, WSY20b, JLM21, DKL+21b].
Assumption 2.1 (Realizability).
for all .
The function approximation is equipped with feature mapping that is known to the agent. We focus the continuous action setting (e.g. in control and robotics problems) and make the following regularity assumption on the feature function .
Assumption 2.2 (Bounded features).
Assume .
Notation
For any vector , let and . Let denote the -th singular value, denotes the minimum eigenvalue and denotes the maximum eigenvalue. The conditional number is defined by . We use to denote Kronecker product and to denote Hadamard product. For a given integer , we use to denote the set . For a function , we use to denote the preimage of . We use the shorthand () to indicate ().
3 Neural Network Function Approximation
In this section we show sample-efficient algorithms with neural network function approximations. The function class of interest is given in the following definition. More general neural network class is discussed in Appendix A.5.
Definition 3.1 (Neural Network Function Class).
We use to denote the function class of where is a two-layer neural network where is ReLU, , , , and . Here is a known feature map whose image contains a ball with .222Here the is chosen only for simplicity. In general this assumption can be relaxed to that the image of contains an arbitrary dense ball near the origin, since one can always rescale the feature mapping in the neural function approximation.
We introduce the following completeness properties in the setting of value function approximations. Along with Assumption 2.1, they are commonly adopted in the literature .
Definition 3.2 (Policy complete).
Given MDP , function class is called policy complete if for all and , .
Definition 3.3 (Bellman complete).
Given MDP , function class is called Bellman complete if for all and , .
3.1 Warmup: Realizable with deterministic transition
We start by considering the case when the transition kernel is deterministic. In this case only Assumption 2.1 is required for the expressiveness of neural network function approximations. Algorithm 1 learns optimal policy from time step to . Suppose we have learned policies at level and they are exactly the optimal policies. We first explore features over a standard Gaussian distribution, and if then we simply skip this trial. Recall that , so with high probability (w.r.t ) almost all feature samples will be explored. We next construct an estimate of by collecting cumulative rewards using as the roll-out. Since the transition is deterministic, for all explored samples . Recall that is a two-layer neural network, we can now recover its parameters in Line 12 exactly by invoking techniques in neural network optimization (see, e.g. [JSA15, GLM17, ZSJ+17]). Details of this step can be found in Appendix A.5, where the method is mainly based on [ZSJ+17]. This means the reconstructed in Line 13 is precisely , and the algorithm can thus find optimal policy in the -th level.
Theorem 3.4.
(Informal) If , then with high probability Algorithm 1 learns the optimal policy.
The formal statement and complete proof are deferred to the Appendix A.1. The main idea of exact neural network recovery can be summarized in the following. We first use method of moments to find a ‘rough’ parameter recovery. If this ‘rough’ recovery is sufficiently close to the true parameter, the empirical loss function is locally strongly convex and there is unique global minimum. Then we can apply gradient descent to find this global minimum which is exactly the true parameter.
3.2 Policy complete neural function approximation
Now we consider general stochastic transitions. Difficulties arise in this scenario due to noises in the estimation of Q-functions. In the presence of model misspecification, these noises cause estimation errors to amplify through levels and require samples to be exponential in . In this section, we show that neural network function approximation is still learnable, assuming the function class is policy complete with regard to MDP . Thus for all , we can denote .
Algorithm 2 learns policy from level . In level , the algorithm has learned policy that is only sub-optimal by . Then it explores features from . The algorithm then queries and uses learned policy as roll out to collect an unbiased estimate of the Q-function . Since is a two-layer neural network, it can then be recovered from samples. Details of this step can be found in Appendix A.5, where the methods are mainly based on [ZSJ+17]. The algorithm then reconstructs this Q-function and finds its optimal policy .
Theorem 3.5.
(Informal) Fix , if , then with probability at least Algorithm 2 returns an -optimal policy .
The formal statement and complete proof are deferred to Appendix A.2. Notice that unlike the case of Theorem 3.4, the sample complexity does not depend on , thus avoiding the potential exponential dependence in .
The main idea of the proof is that at each time step a neural network surrogate of can be constructed by the policy already learned. Suppose we have learned in level , then from policy completeness belongs to and we can interact with the simulator to obtain its estimate . If is small, the optimistic planning based on is not far from the optimal policy of . Therefore the errors can be decoupled into the errors in recovering and the suboptimality of , which depends on level . This reasoning can then be recursively performed to level , and thus we can bound the suboptimality of .
3.3 Bellman complete neural function approximation
In addition to policy completeness, we show that neural network function approximation can also learn efficiently under the setting where the function class is Bellman complete with regard to MDP . Specifically, for , there are and such that .
Algorithm 3 is similar to the algorithm in previous section. Suppose in level , the algorithm has constructed the Q-function that is -close to the optimal . It then recovers weights from , using unbiased estimates . The Q-function reconstructed from weights is thus -close to the .
Theorem 3.6.
(Informal) Fix , if , then with probability at least Algorithm 3 returns an -optimal policy .
Due to Bellman completeness, the error of estimation can be controlled recursively. In fact, we can show is small by induction. The formal statement and detailed proof are deferred to Appendix A.3. Similar to Theorem 3.5, the sample complexity does not explicitly depend on , thus avoiding potentially exponential dependence in .
3.4 Realizable with optimality gap
In this section we consider MDPs where there is a non-zero gap between the optimal policy and any other ones. This concept, known as optimality gap, is widely used in reinforcement learning and bandit literature [DKJ+19, DKWY20b, DLMW20].
Definition 3.7.
The optimality gap is defined as
We show that in the presence positive optimality gap, there exists an algorithm that can learn the optimal policy with polynomial samples even without the completeness assumptions. Intuitively, this is because one only needs to recover the neural network up to precision in order to make sure the greedy policy is identical to the optimal one. The formal statement and proof are deferred to Appendix A.4.
Theorem 3.8.
(Informal) Fix , if , then with probability at least there exists an algorithm that returns the optimal policy .
Remark 3.9.
In all aforementioned methods, there are two key components that allow efficient learning. First, the exploration is conducted in a way that guarantees an recovery of candidate functions. By recovery we mean the algorithm recovers a candidate Q-function in this class deviating from the target function by at most uniformly for all state-action pairs in the domain of interest. This notion of learning guarantee has received study in active learning [Han14, KAH+17] and recently gain interest in contextual bandits [FAD+18]. Second, the agent constructs unbiased estimators of certain approximations to that lie in the neural function approximation class. This allows the recovery error to decouple linearly across time steps, which is made possible in several well-posed MDP instances, such as deterministic MDPs, MDPs with completeness assumptions, and MDPs with gap conditions. In principle, we note that provably efficient RL algorithms with general function approximation is possible as long as the above two components are present. We will see in the next section another example of learning RL with highly non-convex function approximations, where the function class of interest, admissible polynomial families, also allows for exploration schemes to achieve recovery.
4 Polynomial Realizability
In this section, we study the sample complexity to learn deterministic MDPs under polynomial realizability. We identify sufficient and necessary conditions for efficiently learning the MDPs for two different settings — the generative model setting and the online RL setting. Specifically, we show that if the image of the feature map satisfies some positive measure conditions, then by random exploring, we can identify the optimal policy with samples linear in the algebraic dimension of the underlying polynomial class. We also provide a lower bound example showing the separation between the two settings.
Next, we introduce the notion of Admissible Polynomial Families, which are the families of structured polynomials that enable efficient learning.
Definition 4.1 (Admissible Polynomial Families).
For , denote . Let . For any algebraic variety , we define as the polynomial family parameterized by . We say is admissible333Admissible means the dimension of decreases by one when there is an additional linear constraint w.r.t. , if for any , . We define the dimension of the family to be the dimension of .
The following theorem shows that to learn an admissible polynomial family, the sample complexity only scales with the algebraic dimension of the polynomial family.
Theorem 4.2 ([HHK+21]).
Consider the polynomial family of dimension . For , there exists a Lebesgue-measure zero set , such that if , for any , there is a unique (or no such ) to the system of equations for .
We give two important examples of admissible polynomial families with low dimension.
Example 4.3.
(Low-rank Polynomial of rank ) The function is a polynomial with terms, that is
where . The dimension of this family is upper bounded by . Neural network with monomial/polynomial activation functions are low-rank polynomials.
Example 4.4.
The function is of the form , where and is a degree polynomial. The polynomial and matrix are unknown. The dimension of this family is upper bounded by .
Next, we introduce the notion of positive measure.
Definition 4.5.
We say a measurable set is of positive measure if , where is the standard Lebesgue measure on .
If a measurable set satisfies , then there exists a procedure to draw samples from , such that for any of Lebesgue-measure zero, the probability that the sample falls in is zero. In fact, the sampling probability can be given by . The intuition behind its definition is that for all admissible polynomial families, the set of with "redundant information" about learning the parameter is of Lebesgue-measure zero. Therefore, a positive measure set allows you to query randomly and avoids getting coherent measurements.
Next two theorems identify the sufficent conditions for efficiently learning deterministic MDPs under polynomial realizability. Specifically, under online RL setting, we require the strong assumption that the set is of positive measure for all and all , while under generative model setting, we only require the union set to be of positive measure for all . The algorithms for solving the both cases are summarized in Algorithms 4 and 5.
Assumption 4.6 (Polynomial Realizability).
For all , , viewed as the function of , lies in some admissible polynomial family with dimension bounded by .
Theorem 4.7.
For the generative model setting, assume that the set is of positive measure at any level . Under the polynomial realizability, Algorithm 4 almost surely learns the optimal policy with at most samples.
Theorem 4.8.
For the online RL setting, assume that is of positive measure for every state at every level . Under polynomial realizability, within episodes, Algorithm 5 learns the optimal policy almost surely.
We remark that our Theorem 4.8 for learning MDPs under the online RL setting relies on a very strong assumption that allows the learner to explore randomly for any state. However, this assumption is necessary in some sense, as is suggested by our lower bound example in the next subsection.
4.1 Necessity of Generic Feature Maps in Online RL
In this section, we consider lower bounds for learning deterministic MDPs with polynomial realizable under online RL setting. Our goal is to show that in the online setting the generic assumption on the feature maps is necessary. On the contrary, under the generative model setting one can efficiently learn the MDPs without such a strong assumption, since at every level the we can set the state arbitrarily.
MDP construction
We briefly introduce the intuition of our construction. Consider a family of MDPs with only two states . we set the feature map such that, for the good state , it allows the learner to explore randomly, i.e., is of postive measure.
However, for the bad state , all actions are mapped to some restricted set, which forbids random exploration, i.e., is measure zero. This is illustrated in Figure 1.
Specifically, at least actions are needed to identify the groud-truth polynomial of for , while actions suffice for .
The transition is constructed as , which means it is impossible for the online scenarios to reach the good state for .

Theorem 4.9.
There exists a family of MDPs satisfying Assumption 4.6, such that the set is of positive measure at any level , but for all there is some such that is measure zero. Under the online RL setting, any algorithm needs to play at least episodes to identify the optimal policy. On the contrary, under the generative model setting, only samples are needed.
5 Conclusions
In this paper, we consider neural network and polynomial function approximations in the simulator and online settings. To our knowledge, this is the first paper that shows sample-efficient reinforcement learning is possible with neural net function approximation. Our results substantially improve upon what can be achieved with existing results that primarily rely on embedding neural networks into linear function classes. The analysis reveals that for function approximations that allows for efficient recovery, such as two layer neural networks and admissible polynomial families, reinforcement learning can be reduced to parameter recovery problems, as well-studied in theories for deep learning, phase retrieval, and etc. Our method can also be potentially extended to handle three-layer and deeper neural networks, with advanced tools in [FKR19, FVD19].
Our results for polynomial activation require deterministic transitions, since we cannot handle how noise propagates in solving polynomial equations. We leave to future work an in-depth study of the stability of roots of polynomial systems with noise, which is a fundamental mathematical problem and even unsolved for homogeneous polynomials. In particular, noisy tensor decomposition approaches combined with zeroth-order optimization may allow for stochastic transitions [HHK+21].
In the online RL setting, we can only show efficient learning under a very strong yet necessary assumption on the feature mapping. We leave to future work identifying more realistic and natural conditions which permit efficient learning in the online RL setting.
Acknowledgment
JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, NSF CCF 2002272, and an ONR Young Investigator Award. QL is supported by NSF 2030859 and the Computing Research Association for the CIFellows Project. SK acknowledges funding from the NSF Award CCF-1703574 and the ONR award N00014-18-1-2247. BH is supported by the Elite Undergraduate Training Program of School of Mathematical Sciences at Peking University.
References
- [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017.
- [AYLSW19] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479, 2019.
- [AYPS11] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- [AZL19] Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? arXiv preprint arXiv:1905.10337, 2019.
- [AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), pages 242–252, 2019.
- [BBC+19] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, and Chris Hesse. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- [BCR13] Jacek Bochnak, Michel Coste, and Marie-Françoise Roy. Real algebraic geometry, volume 36. Springer Science & Business Media, 2013.
- [BL20] Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020.
- [CBL+20] Minshuo Chen, Yu Bai, Jason D Lee, Tuo Zhao, Huan Wang, Caiming Xiong, and Richard Socher. Towards understanding hierarchical learning: Benefits of neural representations. Neural Information Processing Systems (NeurIPS), 2020.
- [DHK+20] Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- [DJK+18] Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. On oracle-efficient PAC-RL with rich observations. In Advances in Neural Information Processing Systems, 2018.
- [DKJ+19] Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674, 2019.
- [DKL+21a] Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. arXiv preprint arXiv:2103.10897, 2021.
- [DKL+21b] Simon S Du, Sham M Kakade, Jason D Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. arXiv preprint arXiv:2103.10897, 2021.
- [DKWY20a] Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
- [DKWY20b] Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
- [DLL+19] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), pages 1675–1685. PMLR, 2019.
- [DLMW20] Simon S Du, Jason D Lee, Gaurav Mahajan, and Ruosong Wang. Agnostic Q-learning with function approximation in deterministic systems: Tight bounds on approximation error and sample complexity. Advances in Neural Information Processing Systems, 2020.
- [DML21] Alex Damian, Tengyu Ma, and Jason Lee. Label noise sgd provably prefers flat global minimizers. arXiv preprint arXiv:2106.06530, 2021.
- [DPWZ20] Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. -regret for learning in Markov decision processes with function approximation and low Bellman rank. In Conference on Learning Theory, pages 1554–1557. PMLR, 2020.
- [DYM21] Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. arXiv preprint arXiv:2102.04168, 2021.
- [FAD+18] Dylan J. Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. Proceedings of the 35th International Conference on Machine Learning, 2018.
- [FKR19] Massimo Fornasier, Timo Klock, and Michael Rauchensteiner. Robust and resource efficient identification of two hidden layer neural networks. arXiv preprint arXiv:1907.00485, 2019.
- [FLYZ20] Cong Fang, Jason D Lee, Pengkun Yang, and Tong Zhang. Modeling from features: a mean-field framework for over-parameterized deep neural networks. arXiv preprint arXiv:2007.01452, 2020.
- [FVD19] Massimo Fornasier, Jan Vybíral, and Ingrid Daubechies. Robust and resource efficient identification of shallow neural networks by fewest samples. arXiv preprint arXiv:1804.01592, 2019.
- [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- [GCL+19] Ruiqi Gao, Tianle Cai, Haochuan Li, Liwei Wang, Cho-Jui Hsieh, and Jason D Lee. Convergence of adversarial training in overparametrized networks. Neural Information Processing Systems (NeurIPS), 2019.
- [GLM17] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
- [GLM18a] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In International Conference on Learning Representations (ICLR), 2018.
- [GLM18b] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. International Conference on Learning Representations (ICLR), 2018.
- [GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.
- [GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
- [Han14] Steve Hanneke. Active learning for cost-sensitive classification. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
- [HHK+21] Baihe Huang, Kaixuan Huang, Sham M Kakade, Jason D Lee, Qi Lei, Runzhe Wang, and Jiaqi Yang. Optimal gradient-based algorithms for non-concave bandit optimization. arXiv preprint arXiv:2107.04518, 2021.
- [HIB+18] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- [HWLM20] Jeff Z. HaoChen, Colin Wei, Jason D. Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. arXiv preprint arXiv:2006.08680, 2020.
- [JAZBJ18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NeurIPS), pages 8571–8580, 2018.
- [JKA+17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, pages 1704–1713, 2017.
- [JLM21] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. arXiv preprint arXiv:2102.00875, 2021.
- [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- [JSA15] Majid Janzamin, Hanie Sedghi, and Anankumar Anima. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- [JYWJ20a] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- [JYWJ20b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143, 2020.
- [KAH+17] Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daume III, and John Langford. Active learning for cost-sensitive classification. Proceedings of the 34th International Conference on Machine Learning, 2017.
- [Kak03] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of College London, 2003.
- [KCL15] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS), page 507–516, 2015.
- [LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- [LCYW19] Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
- [LHP+15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- [LKFS21] Gene Li, Pritish Kamath, Dylan J. Foster, and Nathan Srebro. Eluder dimension and generalized rank. arXiv preprint arXiv:2104.06970, 2021.
- [MGW+20] Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D Lee, Nathan Srebro, and Daniel Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Neural Information Processing Systems (NeurIPS), 2020.
- [Mil17] James S. Milne. Algebraic geometry (v6.02), 2017. Available at www.jmilne.org/math/.
- [MKS+13a] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learningep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- [MKS+13b] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- [MPSL21] Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, and Yuanzhi Li. Sample efficient reinforcement learning in continuous state spaces: A perspective beyond linearity. arXiv preprint arXiv:2106.07814, 2021.
- [MSB+17] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
- [NGL+19] Mor Shpigel Nacson, Suriya Gunasekar, Jason Lee, Nathan Srebro, and Daniel Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
- [ORVR13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
- [RST15a] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 16(6):155–186, 2015.
- [RST15b] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111–153, 2015.
- [Rus19] Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems, pages 14433–14443, 2019.
- [RVR13a] Dan Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264, 2013.
- [RVR13b] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, 2013.
- [SB18] Richard S Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [Sha13] Igor R Shafarevich. Basic Algebraic Geometry 1: Varieties in Projective Space. Springer Science & Business Media, 2013.
- [SHM+16] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
- [SJK+19] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933, 2019.
- [SLW+06] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
- [SSSS16] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- [SWW+18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model. arXiv preprint arXiv:1806.01492, 2018.
- [WAS20] Gellert Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. arXiv preprint arXiv:2010.01374, 2020.
- [WCYW20] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2020.
- [WGL+19] Blake Woodworth, Suriya Genesekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Kernel and deep regimes in overparametrized models. In Conference on Learning Theory (COLT), 2019.
- [WLLM19] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709–9721, 2019.
- [WSY20a] Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
- [WSY20b] Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. Advances in Neural Information Processing Systems, 2020.
- [WVR13] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
- [WVR17] Zheng Wen and Benjamin Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res., 42(3):762–782, 2017.
- [WWDK21] Yining Wang, Ruosong Wang, Simon S. Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. In International Conference on Learning Representations, 2021.
- [WWK21] Yuanhao Wang, Ruosong Wang, and Sham M. Kakade. An exponential lower bound for linearly-realizable mdps with constant suboptimality gap. arXiv preprint arXiv:2103.12690, 2021.
- [WWL+20] Xiang Wang, Chenwei Wu, Jason D Lee, Tengyu Ma, and Rong Ge. Beyond lazy training for over-parameterized tensor decomposition. Neural Information Processing Systems (NeurIPS), 2020.
- [WX19] Yang Wang and Zhiqiang Xu. Generalized phase retrieval: measurement number, matrix recovery and beyond. Applied and Computational Harmonic Analysis, 47(2):423–446, 2019.
- [YHLD20] Jiaqi Yang, Wei Hu, Jason D Lee, and Simon S Du. Provable benefits of representation learning in linear bandits. arXiv preprint arXiv:2010.06531, 2020.
- [YJW+20] Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael Jordan. Provably efficient reinforcement learning with kernel and neural function approximations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13903–13916. Curran Associates, Inc., 2020.
- [YS19] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. arXiv preprint arXiv:1904.00687, 2019.
- [YW20] Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. International Conference on Machine Learning, 2020.
- [ZB19] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
- [ZCZG18] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
- [ZLG20] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. Proceedings of the 37th International Conference on Machine Learning, 2020.
- [ZLKB20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, 2020.
- [ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. Proceedings of the Thirty-fourth International Conference on Machine Learning (ICML), 70, 2017.
Appendix A Omitted Proofs in Section 3
A.1 Proof of Section 3.1
Theorem A.1 (Formal statement of Theorem 3.4).
Proof.
Use to denote the global optimal policy. We prove that Algorithm 1 learns from to .
At level , the query obtains exact . Therefore by Theorem A.15, and thus the optimal planning finds . Suppose we have learned at level . Due to deterministic transition, the query obtains exact . Therefore by Theorem A.15, and thus the optimal planning finds . Recursively applying this process to , we complete the proof. ∎
A.2 Proof of Section 3.2
Theorem A.2 (Formal statement of Theorem 3.5).
Proof.
Use to denote the global optimal policy. We prove for all ,
At level , let , then . From Theorem A.13, we have satisfies for all . Therefore for all ,
where in the second step we used by optimality of and .
Suppose we have learned policies , we use to denote the optimal policy of . Let
then is zero mean sub-Gaussian (notice that is unbiased estimate of , and ). From Theorem A.13, we have satisfies for all . Therefore for all ,
where in the second step we used by optimality of and .
It thus follows that
where in the second step we use from optimality of . Repeating this argument to completes the proof ∎
A.3 Proof of Section 3.3
Theorem A.3 (Formal statement of Theorem 3.6).
Proof.
Use to denote the global optimal policy. We prove
(1) |
for all .
Suppose we have learned with . At level , let
then is zero mean sub-Gaussian (notice that is unbiased estimate of , and ). From Theorem A.13, we have satisfies for all . Therefore
holds for all .
A.4 Proof of Section 3.4
With gap condition, either Algorithm 2 or Algorithm 3 will work as long as we select . The following displays an adaption from Algorithm 2.
Theorem A.4 (Formal statement of Theorem 3.8).
Proof.
Use to denote the global optimal policy. Similar to Theorem A.1, we prove that Algorithm 6 learns from to .
At level , the algorithm uses trajectories to obtain such that by Theorem A.13. Therefore
where the third inequality uses the optimality of under . Thus Definition 3.7 gives . Suppose we have learned at level . We apply the same argument to derive . Recursively applying this process to , we complete the proof. ∎
A.5 Neural network recovery
This section considers recovering neural network from the following two models, where .
-
•
Noisy samples from
(2) where is sub-Gaussian noise.
-
•
Noiseless samples from
(3)
Recovering neural network has received comprehensive study in deep learning theory [JSA15, ZSJ+17, GLM17]. The analysis in this section is mainly based on the method of moments in [ZSJ+17]. However, notice that the above learning tasks are different from those considered in [ZSJ+17], due to the presence of noise and the truncated signals. Therefore, additional considerations must be made in the analysis.
We consider more general homogeneous activation functions, specified by the assumptions that follow. Since the activation function is homogeneous, we assume in the following without loss of generality.
Assumption A.5 (Property 3.1 of [ZSJ+17]).
Assume is nonnegative and homogeneously bounded, i.e. for some constants and .
Definition A.6 (Part of property 3.2 of [ZSJ+17]).
Define , where , and for .
Assumption A.7 (Part of property 3.2 of [ZSJ+17]).
The first derivative satisfies that, for all , we have .
Assumption A.8 (Property 3.3 of [ZSJ+17]).
The second derivative is either (a) globally bounded or (b) except for finite points.
Notice that ReLU, squared ReLU, leaky ReLU, and polynomial activation function functions all satisfies the above assumption. We make the following assumption on the dimension of feature vectors, which corresponds to how features can extract information about neural networks from noisy samples. The dimension only has to be greater than a logarithmic term in and the norm of parameters.
Assumption A.9 (Rich feature).
Assume .
First we introduce a notation from [ZSJ+17].
Definition A.10.
Define outer product as follows. For a vector and an identity matrix ,
For a symmetric rank- matrix and an identity matrix ,
where , , , , , .
Now we define some moments.
Definition A.11.
Define as follows:
The above expectations are all with respect to and .
Assumption A.12 (Assumption 5.3 of [ZSJ+17]).
Assume the activation function satisfies the followings:
-
•
If , then for all .
-
•
At least one of and is not zero.
-
•
If , then is an even function.
-
•
If , then is an odd function.
Now we state the theoretical result that recovers neural networks from noisy data.
Theorem A.13 (Neural network recovery from noisy data).
Let the activation function satisfies Assumption A.5 and Assumption A.12. Let be the condition number of . Given samples from Eq. (2). For any and such that Assumption A.9 holds, if
then there exists an algorithm that takes time and returns a matrix and a vector such that with probability at least ,
The algorithm and proof are shown in Appendix A.5.1. By Assumption A.5, the following corollary is therefore straightforward.
Corollary A.14.
In the same setting as Theorem A.13. For any and suppose and Assumption A.9 holds. Given samples from Eq. (2). If
then there exists an algorithm that takes time and outputs a matrix and a vector such that with probability at least , for all
In particular, when the following sample complexity suffices
Now we state the theoretical result that precisely recovers neural networks from noiseless data. The proof and method are shown in Appendix A.5.2.
Theorem A.15 (Exact neural network recovery from noiseless data).
A.5.1 Recover neural networks from noisy data
In this section we prove Theorem A.13. Denote where and .
Definition A.16.
Given a vector . Define where and where .
The method of moments is presented in Algorithm 7. Here we sketch its ideas, and refer readers to [ZSJ+17] for thorough explanations. There are three main steps. In the first step, it computes the span of the rows of . By power method, Line 7 finds the top- eigenvalues of and . It then picks the largest eigenvalues from and , by invoking TopK in Line 15. Finally it orthogonalizes the corresponding eigenvectors in Line 19 and finds an orthogonal matrix in the subspace spanned by .
In the second step, the algorithm forms third order tensor and use the robust tensor decomposition method in [KCL15] to find that approximates with unknown signs . In the third step, the algorithm determines , and . Since the activation function is homogeneous, we assume and for universal constants without loss of generality. For illustration, we define and as follows.
(4) | |||
(5) |
where such that and such that are specified later. Then the solutions of the following linear systems
(6) |
are the followings
When and do not have the same sign, we can recover and by , and recover by
In Algorithm 7, we use to approximate , and use moment estimators and to approximate and . Then the solutions to the optimization problems in Line 29 should approximate and , due to robustness for solving linear systems. As such, the outputs approximately recover the true model parameters.
Since Algorithm 7 carries out the same computation as [ZSJ+17], the computational complexity is the same. The difference of sample complexity comes from the noise in the model and the truncation of standard Gaussian. The proof entails bounding the error in estimating in Line 4, in Line 20 and in Line 28. In the following, unless further specified, the expectations are all with respect to and .
Lemma A.17.
Proof.
It suffices to bound , and . The main strategy is to bound all relevant moment terms and to invoke Claim D.6.
Specifically, we show that with probability at least ,
(7) |
(8) |
(9) |
Recall that for sample , where is independent of . Consider each component . Define as follows:
and . Then from Claim D.5 we have . We calculate
By Assumption A.5, using Claim D.1 and we have
Also, .
For any constant , we have with probability ,
where the first step comes from Assumption A.5 and the second step comes from Claim D.2 and Claim D.3.
Using Claim D.4, we have
Furthermore we have,
Then by Claim D.6, with probability at least ,
∎
Lemma A.18.
Proof.
From the definition of , it suffices to bound and . The proof is similar to the previous one.
Specifically, we show that with probability at least ,
(10) |
(11) |
Recall that for sample , where is independent of . Consider each component . Define :
where . Flatten along the first dimension to obtain , flatten along the first dimension to obtain .
From Claim D.7, . Therefore we have,
For any constant , we have with probability ,
where the first step comes from Assumption A.5 and the second step comes from Claim D.2 and Claim D.3.
Furthermore we have,
Then by Claim D.6, with probability at least ,
∎
Lemma A.19.
Proof.
Lemma A.20 (Adapted from Theorem 3 of [KCL15]).
Given a tensor . Assume incoherence . Let and . Then there exists an algorithm such that, with probability at least , for every , the algorithm returns a such that
where and is the inverse of the full-rank extension of with unit-norm columns.
Lemma A.21 (Adapted from Lemma E.6 of [ZSJ+17]).
Lemma A.22 (Adapted from Lemma E.13 in [ZSJ+17]).
Let be the orthogonal column span of . Let denote an orthogonal matrix satisfying that . For each , let denote the vector satisfying . Let be defined as in Eq. (4) and be the empirical version of such that .
Lemma A.23 (Adapted from Lemma E.14 in [ZSJ+17]).
Let be the orthogonal column span of and be an orthogonal matrix satisfying that . For each , let denote the vector satisfying .
Now we are in the position of proving Theorem A.13.
Proof.
It thus follows that
(14) |
where the first step applies triangle inequality and the last step uses Eq. (A.5.1) and Eq. (13).
We proceed to bound the error in and . We have,
(15) |
where the first step comes from Lemma A.23 and the second step comes from Lemma A.19 and Eq. (A.5.1). Furthermore,
(16) |
where the first step comes from Lemma A.22 and the second step comes from combining Lemma A.19, Lemma A.21, and Eq. (A.5.1). Finally, combining Eq. (A.5.1), Eq. (A.5.1) and Eq. (A.5.1), the output in Line 32 satisfies . Since are discrete values, they are exactly recovered.
∎
A.5.2 Exact recovery of neural networks from noiseless data
In this section we prove Theorem A.15. Similar to Appendix A.5.1, denote where and . We use to denote the distribution of and . We define the empirical loss for explored features and the population loss as follows,
(17) | ||||
(18) |
Definition A.24.
Let be the -th singular value of , . Let .
We use the follow results adapted from [ZSJ+17]. The only difference is that the rewards are potentially truncated if , and due to we can bound its difference between standard Gaussian in the same way as Appendix A.5.1.
Lemma A.25 (Concentration, adapted from Lemma D.11 in [ZSJ+17]).
Let samples size , then with probability at least ,
Lemma A.26 (Adapted from Lemma D.16 in [ZSJ+17]).
Now we prove Theorem A.15.
Proof.
The exact recovery consists of first finding (exact) and (approximate) close enough to by tensor method (Appendix A.5.1), and then minimizing the empirical loss . We will prove that is locally strongly convex, thus we find the precise .
From Lemma D.3 from [ZSJ+17] we know:
(19) |
Combining Lemma A.25, , and , we know must be positive definite.
Next we uniformly bound Lipschitzness of . From Lemma A.26 there exists a universal constant , such that for all that satisfies , holds uniformly. So there is a unique miminizer of in this region.
Notice , therefore we can find by directly minimizing the empirical loss as long as we find any in this region. This can be achieved by tensor method in Appendix A.5.1. We thus complete the proof. ∎
Appendix B Omitted Proofs in Section 4
Lemma B.1.
Consider the polynomial family of dimension . Assume that . For any that is of positive measure, by sampling samples i.i.d. from and observing the noiseless feedbacks , one can almost surely uniquely determine the by solving the system of equations , , for .
Proof.
By Theorem 4.2, there exists a set of Lebesgue measure zero, such that if , one can uniquely determine the by the observations on the samples. Therefore, we only need to show that with probability 1, the sampling procedure returns . This is because
∎
Appendix C Omitted Constructions and Proofs in Subsection 4.1
Construction of the Reward Functions
The following construction of the polynomial hard case is adopted from [HHK+21].
Let be the dimension of the feature space. Let denotes the -th standard orthonormal basis of , i.e., has only one at the -th entry and ’s for other entries. Let denote the highest order of the polynomial. We assume . We use to denote a subset of the -th multi-indices
For an , denote , .
The model space is a subset of rank-1 -th order tensors, which is defined as . We define two subsets of feature space and as . For , , define as . We assume that for each level , there is a , and the noiseless reward is .
We have the following properties.
Proposition C.1 ([HHK+21]).
For and , we have
Proposition C.2.
For , we have
proof of Proposition C.2.
For all , since , we can write
where and . Therefore,
Plug in the expression of , we have
Therefore,
Finally, since , we have . ∎
MDP constructions
Consider a family of MDPs with only two states . The action set is set to be . Let be a mapping from to such that is identity when restricted to . For all level , we define the feature map to be
Given an unknown sequence of indices , the reward function at level is . Specifically, we have
The transition is constructed as
This construction means it is impossible for the online scenarios to reach the good state for .
The next proposition shows that is polynomial realizable and falls into the case of Example 4.4.
Proposition C.3.
We have for all and , , and . Furthermore, , viewed as the function of , is a polynomial of the form for some degree- polynomial and .
proof of Proposition C.3.
First notice that by Proposition C.2, for all and , we have
Therefore, by induction, suppose we have proved for all , , then we have
Then we have .
Furthermore, we have
where and is a matrix with as the -th row. ∎
Theorem C.4.
Under the online RL setting, any algorithm needs to play at least episodes to identify and thus to identify the optimal policy.
proof of Theorem C.4.
Under the online RL setting, any algorithm enters and remains in for . When , no matter what the algorithm chooses, we have . Notice that for any and any , we have as Proposition C.1 suggests. Hence, we need to play times at level in the worst case to find out . The argument holds for all . ∎
Theorem C.5.
Under the generative model setting, by querying samples, we can almost surely identify , and thus identify the optimal policy.
proof of Theorem C.5.
By Proposition C.3, we know that , viewed as the function of , falls into the case of Example 4.4 with .
Next, notice that for all , . Although is not of positive measure, we can actually know the value of when is in since the reward is -homogenous. Specifically, for every feature of the form , where and , the reward is times the reward of . Therefore, to get the reward at , we only need to query the generative model at of level , and then multiply the reward by .
Notice that is of positive Lebesgue measure. By Theorem 4.7, we know that only samples are needed to determine the optimal policy almost surely.
∎
Appendix D Technical claims
Claim D.1.
Let denote -distribution with degree of freedom . For any we have,
We use the following facts from [ZSJ+17].
Claim D.2.
Given a fixed vector , for any and , we have
Claim D.3.
For any and , we have
Claim D.4.
Let be three constants, let be three vectors, we have
Claim D.5.
Let be defined in Definition A.11. For each , .
Claim D.6.
Let denote a distribution over . Let . Let be i.i.d. random matrices sampled from . Let and . For parameters , if the distribution satisfies the following four properties,
Then we have for any and , if
with probability at least ,